0% found this document useful (0 votes)
4 views

machine_learning

Uploaded by

VICTOR HUAMAN
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
4 views

machine_learning

Uploaded by

VICTOR HUAMAN
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 662

Introduction to Machine Learning

Laurent Younes

January 27, 2025


2
Contents

Preface 13

1 General Notation and Background Material 15


1.1 Linear algebra . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15
1.1.1 Sets and functions . . . . . . . . . . . . . . . . . . . . . . . . . . 15
1.1.2 Vectors . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15
1.1.3 Matrices . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 16
1.1.4 Multilinear maps . . . . . . . . . . . . . . . . . . . . . . . . . . 17
1.2 Topology . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 17
1.2.1 Open and closed sets in Rd . . . . . . . . . . . . . . . . . . . . . 17
1.2.2 Compact sets . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 18
1.2.3 Metric spaces . . . . . . . . . . . . . . . . . . . . . . . . . . . . 18
1.3 Calculus . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 18
1.3.1 Differentials . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 18
1.3.2 Important examples . . . . . . . . . . . . . . . . . . . . . . . . . 20
1.3.3 Higher order derivatives . . . . . . . . . . . . . . . . . . . . . . 21
1.3.4 Taylor’s theorem . . . . . . . . . . . . . . . . . . . . . . . . . . . 21
1.4 Probability theory . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 23
1.4.1 General assumptions and notation . . . . . . . . . . . . . . . . 23
1.4.2 Conditional probabilities and expectation . . . . . . . . . . . . 23
1.4.3 Measure theoretic probability . . . . . . . . . . . . . . . . . . . 25
1.4.4 Product of measures . . . . . . . . . . . . . . . . . . . . . . . . 26
1.4.5 Relative absolute continuity and densities . . . . . . . . . . . . 27
1.4.6 Measure-theoretic probability . . . . . . . . . . . . . . . . . . . 28
1.4.7 Conditional expectations (general case) . . . . . . . . . . . . . 28
1.4.8 Conditional probabilities (general case) . . . . . . . . . . . . . 29

2 A Few Results in Matrix Analysis 31


2.1 Notation and basic facts . . . . . . . . . . . . . . . . . . . . . . . . . . 31
2.2 The trace inequality . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 32
2.3 Applications . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 36
2.4 Some matrix norms . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 38

3
4 CONTENTS

3 Introduction to Optimization 43
3.1 Basic Terminology . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 43
3.2 Unconstrained Optimization Problems . . . . . . . . . . . . . . . . . . 45
3.2.1 Conditions for optimality (general case) . . . . . . . . . . . . . 45
3.2.2 Convex sets and functions . . . . . . . . . . . . . . . . . . . . . 46
3.2.3 Relative interior . . . . . . . . . . . . . . . . . . . . . . . . . . . 48
3.2.4 Derivatives of convex functions and optimality conditions . . . 51
3.2.5 Direction of descent and steepest descent . . . . . . . . . . . . 54
3.2.6 Convergence . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 55
3.2.7 Line search . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 58
3.3 Stochastic gradient descent . . . . . . . . . . . . . . . . . . . . . . . . . 62
3.3.1 Stochastic approximation methods . . . . . . . . . . . . . . . . 62
3.3.2 Deterministic approximation and convergence study . . . . . . 62
3.3.3 The ADAM algorithm . . . . . . . . . . . . . . . . . . . . . . . . 68
3.4 Constrained optimization problems . . . . . . . . . . . . . . . . . . . . 69
3.4.1 Lagrange multipliers . . . . . . . . . . . . . . . . . . . . . . . . 69
3.4.2 Convex constraints . . . . . . . . . . . . . . . . . . . . . . . . . 71
3.4.3 Applications . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 73
3.4.4 Projected gradient descent . . . . . . . . . . . . . . . . . . . . . 76
3.5 General convex problems . . . . . . . . . . . . . . . . . . . . . . . . . . 77
3.5.1 Epigraphs . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 77
3.5.2 Subgradients . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 78
3.5.3 Directional derivatives . . . . . . . . . . . . . . . . . . . . . . . 80
3.5.4 Subgradient descent . . . . . . . . . . . . . . . . . . . . . . . . 82
3.5.5 Proximal Methods . . . . . . . . . . . . . . . . . . . . . . . . . . 84
3.6 Duality . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 90
3.6.1 Generalized KKT conditions . . . . . . . . . . . . . . . . . . . . 90
3.6.2 Dual problem . . . . . . . . . . . . . . . . . . . . . . . . . . . . 92
3.6.3 Example: Quadratic programming . . . . . . . . . . . . . . . . 94
3.6.4 Proximal iterations and augmented Lagrangian . . . . . . . . . 95
3.6.5 Alternative direction method of multipliers . . . . . . . . . . . 97
3.7 Convex separation theorems and additional proofs . . . . . . . . . . . 98
3.7.1 Proof of proposition 3.44 . . . . . . . . . . . . . . . . . . . . . . 99
3.7.2 Proof of theorem 3.45 . . . . . . . . . . . . . . . . . . . . . . . . 100
3.7.3 Proof of theorem 3.46 . . . . . . . . . . . . . . . . . . . . . . . . 101

4 Introduction: Bias and Variance 103


4.1 Parameter estimation and sieves . . . . . . . . . . . . . . . . . . . . . . 103
4.2 Kernel density estimation . . . . . . . . . . . . . . . . . . . . . . . . . . 107
CONTENTS 5

5 Prediction: Basic Concepts 113


5.1 General Setting . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 113
5.2 Bayes predictor . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 114
5.3 Examples: model-based approach . . . . . . . . . . . . . . . . . . . . . 116
5.3.1 Gaussian models and naive Bayes . . . . . . . . . . . . . . . . . 116
5.3.2 Kernel regression . . . . . . . . . . . . . . . . . . . . . . . . . . 117
5.3.3 A classification example . . . . . . . . . . . . . . . . . . . . . . 118
5.4 Empirical risk minimization . . . . . . . . . . . . . . . . . . . . . . . . 118
5.4.1 General principles . . . . . . . . . . . . . . . . . . . . . . . . . 118
5.4.2 Bias and variance . . . . . . . . . . . . . . . . . . . . . . . . . . 119
5.5 Evaluating the error . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 120
5.5.1 Generalization error . . . . . . . . . . . . . . . . . . . . . . . . 120
5.5.2 Cross validation . . . . . . . . . . . . . . . . . . . . . . . . . . . 123

6 Inner Products and Reproducing Kernels 127


6.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 127
6.2 Basic Definitions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 127
6.2.1 Inner-product spaces . . . . . . . . . . . . . . . . . . . . . . . . 127
6.2.2 Feature spaces and kernels . . . . . . . . . . . . . . . . . . . . . 128
6.3 First examples . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 130
6.3.1 Inner product . . . . . . . . . . . . . . . . . . . . . . . . . . . . 130
6.3.2 Polynomial Kernels . . . . . . . . . . . . . . . . . . . . . . . . . 130
6.3.3 Functional Features . . . . . . . . . . . . . . . . . . . . . . . . . 131
6.3.4 General construction theorems . . . . . . . . . . . . . . . . . . 133
6.3.5 Operations on kernels . . . . . . . . . . . . . . . . . . . . . . . 134
6.3.6 Canonical Feature Spaces . . . . . . . . . . . . . . . . . . . . . 136
6.4 Projection on a finite-dimensional subspace . . . . . . . . . . . . . . . 138

7 Linear Regression 143


7.1 Least-Square Regression . . . . . . . . . . . . . . . . . . . . . . . . . . 143
7.1.1 Notation and Basic Estimator . . . . . . . . . . . . . . . . . . . 143
7.1.2 Limit behavior . . . . . . . . . . . . . . . . . . . . . . . . . . . . 146
7.1.3 Gauss-Markov theorem . . . . . . . . . . . . . . . . . . . . . . . 147
7.1.4 Kernel Version . . . . . . . . . . . . . . . . . . . . . . . . . . . . 148
7.2 Ridge regression and Lasso . . . . . . . . . . . . . . . . . . . . . . . . . 149
7.2.1 Ridge Regression . . . . . . . . . . . . . . . . . . . . . . . . . . 149
7.2.2 Equivalence of constrained and penalized formulations . . . . 153
7.2.3 Lasso regression . . . . . . . . . . . . . . . . . . . . . . . . . . . 156
7.3 Other Sparsity Estimators . . . . . . . . . . . . . . . . . . . . . . . . . 161
7.3.1 LARS estimator . . . . . . . . . . . . . . . . . . . . . . . . . . . 161
7.3.2 The Dantzig selector . . . . . . . . . . . . . . . . . . . . . . . . 163
7.4 Support Vector Machines for regression . . . . . . . . . . . . . . . . . 166
7.4.1 Linear SVM . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 166
6 CONTENTS

7.4.2 The kernel trick and SVMs . . . . . . . . . . . . . . . . . . . . . 171

8 Models for linear classification 173


8.1 Logistic regression . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 173
8.1.1 General Framework . . . . . . . . . . . . . . . . . . . . . . . . . 173
8.1.2 Conditional log-likelihood . . . . . . . . . . . . . . . . . . . . . 174
8.1.3 Training algorithm . . . . . . . . . . . . . . . . . . . . . . . . . 177
8.1.4 Penalized Logistic Regression . . . . . . . . . . . . . . . . . . . 179
8.1.5 Kernel logistic regression . . . . . . . . . . . . . . . . . . . . . . 182
8.2 Linear Discriminant analysis . . . . . . . . . . . . . . . . . . . . . . . . 183
8.2.1 Generative model in classification and LDA . . . . . . . . . . . 183
8.2.2 Dimension reduction . . . . . . . . . . . . . . . . . . . . . . . . 185
8.2.3 Fisher’s LDA . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 188
8.2.4 Kernel LDA . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 188
8.3 Optimal Scoring . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 193
8.3.1 Kernel optimal scoring . . . . . . . . . . . . . . . . . . . . . . . 198
8.4 Separating hyperplanes and SVMs . . . . . . . . . . . . . . . . . . . . 200
8.4.1 One-layer perceptron and margin . . . . . . . . . . . . . . . . . 200
8.4.2 Maximizing the margin . . . . . . . . . . . . . . . . . . . . . . . 201
8.4.3 KKT conditions and dual problem . . . . . . . . . . . . . . . . 203
8.4.4 Kernel version . . . . . . . . . . . . . . . . . . . . . . . . . . . . 204

9 Nearest-Neighbor Methods 207


9.1 Nearest neighbors for regression . . . . . . . . . . . . . . . . . . . . . . 207
9.1.1 Consistency . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 207
9.1.2 Optimality . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 213
9.2 p-NN classification . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 214
9.3 Designing the distance . . . . . . . . . . . . . . . . . . . . . . . . . . . 216

10 Tree-based algorithms 219


10.1 Recursive Partitioning . . . . . . . . . . . . . . . . . . . . . . . . . . . . 219
10.1.1 Binary prediction trees . . . . . . . . . . . . . . . . . . . . . . . 219
10.1.2 Training algorithm . . . . . . . . . . . . . . . . . . . . . . . . . 220
10.1.3 Resulting predictor . . . . . . . . . . . . . . . . . . . . . . . . . 221
10.1.4 Stopping rule . . . . . . . . . . . . . . . . . . . . . . . . . . . . 221
10.1.5 Leaf predictors . . . . . . . . . . . . . . . . . . . . . . . . . . . 221
10.1.6 Binary features . . . . . . . . . . . . . . . . . . . . . . . . . . . 221
10.1.7 Pruning . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 223
10.2 Random Forests . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 224
10.2.1 Bagging . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 224
10.2.2 Feature randomization . . . . . . . . . . . . . . . . . . . . . . . 225
10.3 Top-Scoring Pairs . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 226
10.4 Adaboost . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 228
CONTENTS 7

10.4.1 General set-up . . . . . . . . . . . . . . . . . . . . . . . . . . . . 228


10.4.2 The Adaboost algorithm . . . . . . . . . . . . . . . . . . . . . . 229
10.4.3 Adaboost and greedy gradient descent . . . . . . . . . . . . . . 233
10.5 Gradient boosting and regression . . . . . . . . . . . . . . . . . . . . . 234
10.5.1 Notation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 234
10.5.2 Translation-invariant loss . . . . . . . . . . . . . . . . . . . . . 235
10.5.3 General loss functions . . . . . . . . . . . . . . . . . . . . . . . 236
10.5.4 Return to classification . . . . . . . . . . . . . . . . . . . . . . . 238
10.5.5 Gradient tree boosting . . . . . . . . . . . . . . . . . . . . . . . 239

11 Neural Nets 243


11.1 First definitions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 243
11.2 Neural nets . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 244
11.2.1 Transitions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 244
11.2.2 Output . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 244
11.2.3 Image data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 245
11.3 Geometry . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 246
11.4 Objective function . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 247
11.4.1 Definitions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 247
11.4.2 Differential . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 248
11.4.3 Complementary computations . . . . . . . . . . . . . . . . . . 251
11.5 Stochastic Gradient Descent . . . . . . . . . . . . . . . . . . . . . . . . 251
11.5.1 Mini-batches . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 251
11.5.2 Dropout . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 252
11.6 Continuous time limit and dynamical systems . . . . . . . . . . . . . . 253
11.6.1 Neural ODEs . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 253
11.6.2 Adding a running cost . . . . . . . . . . . . . . . . . . . . . . . 255

12 Comparing probability distributions 261


12.1 Total variation distance . . . . . . . . . . . . . . . . . . . . . . . . . . . 261
12.2 Divergences . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 264
12.3 Monge-Kantorowich distance . . . . . . . . . . . . . . . . . . . . . . . 267

13 Monte-Carlo Sampling 269


13.1 General sampling procedures . . . . . . . . . . . . . . . . . . . . . . . 269
13.2 Rejection sampling . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 270
13.3 Markov chain sampling . . . . . . . . . . . . . . . . . . . . . . . . . . . 271
13.3.1 Definitions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 272
13.3.2 Convergence . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 273
13.3.3 Invariance and reversibility . . . . . . . . . . . . . . . . . . . . 274
13.3.4 Irreducibility and recurrence . . . . . . . . . . . . . . . . . . . 277
13.3.5 Speed of convergence . . . . . . . . . . . . . . . . . . . . . . . . 279
13.3.6 Models on finite state spaces . . . . . . . . . . . . . . . . . . . . 279
8 CONTENTS

13.3.7 Examples on Rd . . . . . . . . . . . . . . . . . . . . . . . . . . . 280


13.4 Gibbs sampling . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 284
13.4.1 Definition . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 284
13.4.2 Example: Ising model . . . . . . . . . . . . . . . . . . . . . . . . 287
13.5 Metropolis-Hastings . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 288
13.5.1 Definition . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 288
13.5.2 Sampling methods for continuous variables . . . . . . . . . . . 289
13.6 Perfect sampling methods . . . . . . . . . . . . . . . . . . . . . . . . . 294
13.7 Markovian Stochastic Approximation . . . . . . . . . . . . . . . . . . . 297

14 Markov Random Fields 305


14.1 Independence and conditional independence . . . . . . . . . . . . . . 305
14.1.1 Definitions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 305
14.1.2 Fundamental properties . . . . . . . . . . . . . . . . . . . . . . 307
14.1.3 Mutual independence . . . . . . . . . . . . . . . . . . . . . . . 309
14.1.4 Relation with Information Theory . . . . . . . . . . . . . . . . . 310
14.2 Models on undirected graphs . . . . . . . . . . . . . . . . . . . . . . . 313
14.2.1 Graphical representation of conditional independence . . . . . 313
14.2.2 Reduction of the Markov property . . . . . . . . . . . . . . . . 316
14.2.3 Restricted graph and partial evidence . . . . . . . . . . . . . . 319
14.2.4 Marginal distributions . . . . . . . . . . . . . . . . . . . . . . . 320
14.3 The Hammersley-Clifford theorem . . . . . . . . . . . . . . . . . . . . 321
14.3.1 Families of local interactions . . . . . . . . . . . . . . . . . . . . 321
14.3.2 Characterization of positive G-Markov processes . . . . . . . . 324
14.4 Models on acyclic graphs . . . . . . . . . . . . . . . . . . . . . . . . . . 329
14.4.1 Finite Markov chains . . . . . . . . . . . . . . . . . . . . . . . . 329
14.4.2 Undirected acyclic graph models and trees . . . . . . . . . . . 331
14.5 Examples of general “loopy” Markov random fields . . . . . . . . . . . 336
14.6 General state spaces . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 338

15 Probabilistic Inference for MRF 341


15.1 Monte Carlo sampling . . . . . . . . . . . . . . . . . . . . . . . . . . . 342
15.2 Inference with acyclic graphs . . . . . . . . . . . . . . . . . . . . . . . 347
15.3 Belief propagation and free energy approximation . . . . . . . . . . . 353
15.3.1 BP stationarity . . . . . . . . . . . . . . . . . . . . . . . . . . . . 353
15.3.2 Free-energy approximations . . . . . . . . . . . . . . . . . . . . 354
15.4 Computing the most likely configuration . . . . . . . . . . . . . . . . . 360
15.5 General sum-prod and max-prod algorithms . . . . . . . . . . . . . . . 363
15.5.1 Factor graphs . . . . . . . . . . . . . . . . . . . . . . . . . . . . 363
15.5.2 Junction trees . . . . . . . . . . . . . . . . . . . . . . . . . . . . 368
15.6 Building junction trees . . . . . . . . . . . . . . . . . . . . . . . . . . . 370
15.6.1 Triangulated graphs . . . . . . . . . . . . . . . . . . . . . . . . 371
15.6.2 Building triangulated graphs . . . . . . . . . . . . . . . . . . . 375
CONTENTS 9

15.6.3 Computing maximal cliques . . . . . . . . . . . . . . . . . . . . 378


15.6.4 Characterization of junction trees . . . . . . . . . . . . . . . . . 379

16 Bayesian Networks 383


16.1 Definitions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 383
16.2 Conditional independence graph . . . . . . . . . . . . . . . . . . . . . 384
16.2.1 Moral graph . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 384
16.2.2 Reduction to d-separation . . . . . . . . . . . . . . . . . . . . . 386
16.2.3 Chain-graph representation . . . . . . . . . . . . . . . . . . . . 388
16.2.4 Markov equivalence . . . . . . . . . . . . . . . . . . . . . . . . . 390
16.2.5 Probabilistic inference: Sum-prod algorithm . . . . . . . . . . 394
16.2.6 Conditional probabilities and interventions . . . . . . . . . . . 399
16.3 Structural equation models . . . . . . . . . . . . . . . . . . . . . . . . . 401

17 Latent Variables and Variational Methods 403


17.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 403
17.2 Variational principle . . . . . . . . . . . . . . . . . . . . . . . . . . . . 403
17.3 Examples . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 405
17.3.1 Mode approximation . . . . . . . . . . . . . . . . . . . . . . . . 405
17.3.2 Gaussian approximation . . . . . . . . . . . . . . . . . . . . . . 406
17.3.3 Mean-field approximation . . . . . . . . . . . . . . . . . . . . . 406
17.4 Maximum likelihood estimation . . . . . . . . . . . . . . . . . . . . . . 409
17.4.1 The EM algorithm . . . . . . . . . . . . . . . . . . . . . . . . . . 409
17.4.2 Application: Mixtures of Gaussian . . . . . . . . . . . . . . . . 411
17.4.3 Stochastic approximation EM . . . . . . . . . . . . . . . . . . . 413
17.4.4 Variational approximation . . . . . . . . . . . . . . . . . . . . . 415
17.5 Remarks . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 417
17.5.1 Variations on the EM . . . . . . . . . . . . . . . . . . . . . . . . 417
17.5.2 Direct minimization . . . . . . . . . . . . . . . . . . . . . . . . 418
17.5.3 Product measure assumption . . . . . . . . . . . . . . . . . . . 419

18 Learning Graphical Models 421


18.1 Learning Bayesian networks . . . . . . . . . . . . . . . . . . . . . . . . 421
18.1.1 Learning a Single Probability . . . . . . . . . . . . . . . . . . . 421
18.1.2 Learning a Finite Probability Distribution . . . . . . . . . . . . 423
18.1.3 Conjugate Prior for Bayesian Networks . . . . . . . . . . . . . . 424
18.1.4 Structure Scoring . . . . . . . . . . . . . . . . . . . . . . . . . . 425
18.1.5 Reducing the Parametric Dimension . . . . . . . . . . . . . . . 426
18.2 Learning Loopy Markov Random Fields . . . . . . . . . . . . . . . . . 427
18.2.1 Maximum Likelihood with Exponential Models . . . . . . . . . 427
18.2.2 Maximum likelihood with stochastic gradient ascent . . . . . . 429
18.2.3 Relation with Maximum Entropy . . . . . . . . . . . . . . . . . 430
18.2.4 Iterative Scaling . . . . . . . . . . . . . . . . . . . . . . . . . . . 432
10 CONTENTS

18.2.5 Pseudo likelihood . . . . . . . . . . . . . . . . . . . . . . . . . . 435


18.2.6 Continuous variables and score matching . . . . . . . . . . . . 436
18.3 Incomplete observations for graphical models . . . . . . . . . . . . . . 438
18.3.1 The EM Algorithm . . . . . . . . . . . . . . . . . . . . . . . . . 438
18.3.2 Stochastic gradient ascent . . . . . . . . . . . . . . . . . . . . . 439
18.3.3 Pseudo-EM Algorithm . . . . . . . . . . . . . . . . . . . . . . . 440
18.3.4 Partially-observed Bayesian networks on trees . . . . . . . . . . 441
18.3.5 General Bayesian networks . . . . . . . . . . . . . . . . . . . . . 443

19 Deep Generative Methods 445


19.1 Normalizing flows . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 445
19.1.1 General concepts . . . . . . . . . . . . . . . . . . . . . . . . . . 445
19.1.2 A greedy computation . . . . . . . . . . . . . . . . . . . . . . . 446
19.1.3 Neural implementation . . . . . . . . . . . . . . . . . . . . . . . 447
19.1.4 Time-continuous version . . . . . . . . . . . . . . . . . . . . . . 447
19.2 Non-diffeomorphic models and variational autoencoders . . . . . . . 449
19.2.1 General framework . . . . . . . . . . . . . . . . . . . . . . . . . 449
19.2.2 Generative model for VAEs . . . . . . . . . . . . . . . . . . . . 449
19.2.3 Discrete data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 451
19.3 Generative Adversarial Networks (GAN) . . . . . . . . . . . . . . . . . 452
19.3.1 Basic principles . . . . . . . . . . . . . . . . . . . . . . . . . . . 452
19.3.2 Objective function . . . . . . . . . . . . . . . . . . . . . . . . . 452
19.3.3 Algorithm . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 453
19.3.4 Associated probability metric and Wasserstein GANs . . . . . 454
19.4 Reversed Markov chain models . . . . . . . . . . . . . . . . . . . . . . 456
19.4.1 General principles . . . . . . . . . . . . . . . . . . . . . . . . . 456
19.4.2 Binary model . . . . . . . . . . . . . . . . . . . . . . . . . . . . 457
19.4.3 Model with continuous variables . . . . . . . . . . . . . . . . . 458
19.4.4 Continuous-time limit . . . . . . . . . . . . . . . . . . . . . . . 460
19.4.5 Differential of neural functions . . . . . . . . . . . . . . . . . . 461

20 Clustering 463
20.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 463
20.2 Hierarchical clustering and dendograms . . . . . . . . . . . . . . . . . 464
20.2.1 Partition trees . . . . . . . . . . . . . . . . . . . . . . . . . . . . 464
20.2.2 Bottom-up construction . . . . . . . . . . . . . . . . . . . . . . 465
20.2.3 Top-down construction . . . . . . . . . . . . . . . . . . . . . . . 468
20.2.4 Thresholding . . . . . . . . . . . . . . . . . . . . . . . . . . . . 469
20.3 K-medoids and K-mean . . . . . . . . . . . . . . . . . . . . . . . . . . . 469
20.3.1 K-medoids . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 469
20.3.2 Mixtures of Gaussian and deterministic annealing . . . . . . . 472
20.3.3 Kernel (soft) K-means . . . . . . . . . . . . . . . . . . . . . . . . 474
20.3.4 Convex relaxation . . . . . . . . . . . . . . . . . . . . . . . . . . 475
CONTENTS 11

20.4 Spectral clustering . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 480


20.4.1 Spectral approximation of minimum discrepancy . . . . . . . . 480
20.5 Graph partitioning . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 482
20.6 Deciding the number of clusters . . . . . . . . . . . . . . . . . . . . . . 484
20.6.1 Detecting elbows . . . . . . . . . . . . . . . . . . . . . . . . . . 484
20.6.2 The Caliński and Harabasz index . . . . . . . . . . . . . . . . . 485
20.6.3 The “silhouette” index . . . . . . . . . . . . . . . . . . . . . . . 487
20.6.4 Comparing to homogeneous data . . . . . . . . . . . . . . . . . 488
20.7 Bayesian Clustering . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 490
20.7.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 490
20.7.2 Model with a bounded number of clusters . . . . . . . . . . . . 494
20.7.3 Non-parametric priors . . . . . . . . . . . . . . . . . . . . . . . 500

21 Dimension Reduction and Factor Analysis 509


21.1 Principal component analysis . . . . . . . . . . . . . . . . . . . . . . . 509
21.1.1 General Framework . . . . . . . . . . . . . . . . . . . . . . . . . 509
21.1.2 Computation of the principal components . . . . . . . . . . . . 514
21.2 Kernel PCA . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 516
21.3 Statistical interpretation and probabilistic PCA . . . . . . . . . . . . . 518
21.4 Generalized PCA . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 521
21.5 Nuclear norm minimization and robust PCA . . . . . . . . . . . . . . . 523
21.5.1 Low-rank approximation . . . . . . . . . . . . . . . . . . . . . . 523
21.5.2 The nuclear norm . . . . . . . . . . . . . . . . . . . . . . . . . . 525
21.5.3 Robust PCA . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 526
21.6 Independent component analysis . . . . . . . . . . . . . . . . . . . . . 528
21.6.1 Identifiability . . . . . . . . . . . . . . . . . . . . . . . . . . . . 528
21.6.2 Measuring independence and non-Gaussianity . . . . . . . . . 529
21.6.3 Maximization over orthogonal matrices . . . . . . . . . . . . . 535
21.6.4 Parametric ICA . . . . . . . . . . . . . . . . . . . . . . . . . . . 536
21.6.5 Probabilistic ICA . . . . . . . . . . . . . . . . . . . . . . . . . . 538
21.7 Non-negative matrix factorization . . . . . . . . . . . . . . . . . . . . . 542
21.8 Variational Autoencoders . . . . . . . . . . . . . . . . . . . . . . . . . . 547
21.9 Bayesian factor analysis and Poisson point processes . . . . . . . . . . 547
21.9.1 A feature selection model . . . . . . . . . . . . . . . . . . . . . 547
21.9.2 Non-negative and count variables . . . . . . . . . . . . . . . . . 549
21.9.3 Feature assignment model . . . . . . . . . . . . . . . . . . . . . 551
21.10Point processes and random measures . . . . . . . . . . . . . . . . . . 554
21.10.1Poisson processes . . . . . . . . . . . . . . . . . . . . . . . . . . 554
21.10.2The gamma process . . . . . . . . . . . . . . . . . . . . . . . . . 557
21.10.3The beta process . . . . . . . . . . . . . . . . . . . . . . . . . . . 558
21.10.4Beta Process and feature selection . . . . . . . . . . . . . . . . . 559
12 CONTENTS

22 Data Visualization and Manifold Learning 561


22.1 Multidimensional scaling . . . . . . . . . . . . . . . . . . . . . . . . . . 561
22.1.1 Similarity matching (Euclidean case) . . . . . . . . . . . . . . . 561
22.1.2 Dissimilarity matching . . . . . . . . . . . . . . . . . . . . . . . 564
22.2 Manifold learning . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 568
22.2.1 Isomap . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 568
22.2.2 Local Linear Embedding . . . . . . . . . . . . . . . . . . . . . . 570
22.2.3 Graph Embedding . . . . . . . . . . . . . . . . . . . . . . . . . 573
22.2.4 Stochastic neighbor embedding . . . . . . . . . . . . . . . . . . 577
22.2.5 Uniform manifold approximation and projection (UMAP) . . . 580

23 Generalization Bounds 585


23.1 Notation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 585
23.2 Penalty-based Methods and Minimum Description Length . . . . . . . 586
23.2.1 Akaike’s information criterion . . . . . . . . . . . . . . . . . . . 586
23.2.2 Bayesian information criterion and minimum description length588
23.3 Concentration inequalities . . . . . . . . . . . . . . . . . . . . . . . . . 593
23.3.1 Cramér’s theorem . . . . . . . . . . . . . . . . . . . . . . . . . . 594
23.3.2 Sub-Gaussian variables . . . . . . . . . . . . . . . . . . . . . . . 596
23.3.3 Bennett’s inequality . . . . . . . . . . . . . . . . . . . . . . . . . 599
23.3.4 Hoeffding’s inequality . . . . . . . . . . . . . . . . . . . . . . . 602
23.3.5 McDiarmid’s inequality . . . . . . . . . . . . . . . . . . . . . . 604
23.3.6 Boucheron-Lugosi-Massart inequality . . . . . . . . . . . . . . 605
23.4 Bounding the empirical error with the VC-dimension . . . . . . . . . 606
23.4.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 606
23.4.2 Vapnik’s theorem . . . . . . . . . . . . . . . . . . . . . . . . . . 608
23.4.3 VC dimension . . . . . . . . . . . . . . . . . . . . . . . . . . . . 610
23.4.4 Examples . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 613
23.4.5 Data-based estimates . . . . . . . . . . . . . . . . . . . . . . . . 614
23.5 Covering numbers and chaining . . . . . . . . . . . . . . . . . . . . . . 616
23.5.1 Covering, packing and entropy numbers . . . . . . . . . . . . . 617
23.5.2 A first union bound . . . . . . . . . . . . . . . . . . . . . . . . . 617
23.5.3 Evaluating covering numbers . . . . . . . . . . . . . . . . . . . 620
23.5.4 Chaining . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 621
23.5.5 Metric entropy . . . . . . . . . . . . . . . . . . . . . . . . . . . . 624
23.5.6 Application . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 625
23.6 Other complexity measures . . . . . . . . . . . . . . . . . . . . . . . . 627
23.6.1 Fat-shattering and margins . . . . . . . . . . . . . . . . . . . . . 627
23.6.2 Maximum discrepancy . . . . . . . . . . . . . . . . . . . . . . . 634
23.6.3 Rademacher complexity . . . . . . . . . . . . . . . . . . . . . . 636
23.6.4 Algorithmic Stability . . . . . . . . . . . . . . . . . . . . . . . . 639
23.6.5 PAC-Bayesian bounds . . . . . . . . . . . . . . . . . . . . . . . . 641
23.7 Application to model selection . . . . . . . . . . . . . . . . . . . . . . . 644
Preface

Machine learning addresses the issue of analyzing, reproducing and predicting var-
ious mechanisms and processes observable through experiments and data acquisi-
tion. With the impetus of large technological companies in need of leveraging in-
formation included in the gigantic datasets that they produced or obtained through
user data, with the development of new data acquisition techniques in biology, physics
or astronomy, with the improvement of storage capacity and high-performance com-
puting, this field has experienced an explosive growth over the past decades, in
terms of scientific production and technological impact.

While it is being recognized in some places as a scientific discipline in itself, ma-


chine learning (which has received a few almost synonymic denominations across
time, including artificial intelligence, machine intelligence or statistical learning),
can also be seen as an interdisciplinary field interfacing techniques from traditional
domains such as computer science, applied mathematics, and statistics. From statis-
tics, and more specially nonparametric statistics, it borrows its main formalism,
asymptotic results and generalization bounds. It also builds on many classical meth-
ods that have been developed for estimation and prediction. From computer science,
it involves the construction and implementation of efficient algorithms, program-
ming design and architecture. Finally, machine learning leverages classical methods
from linear algebra and functional analysis, as well as from convex and nonlinear
optimization, fields within which it had also provided new problems and discover-
ies. It forms a significant part of the larger field commonly called “data science,”
which includes methods for storing, sharing and managing data, the development
powerful computer architectures for increasingly demanding algorithms, and, im-
portantly, the definition of ethical limits and processes through which data should
be used in the modern world.

This book, which originates from lecture notes of a series of graduate course
taught in the Department of Applied Mathematics and Statistics at Johns Hopkins
University, adopts a viewpoint (or bias) mainly focused on the mathematical and sta-
tistical aspects of the subject. Its goal is to introduce the mathematical foundations
and techniques that lead to the development and analysis of many of the algorithms
that are used today. It is written with the hope to provide the reader with a deeper

13
14 CONTENTS

understanding of the algorithms made available to her in multiple machine learn-


ing packages and software, and that she will be able to assess their prerequisites and
limitations, and to extend them and develop new algorithms. Note that, while adopt-
ing a presentation with a strong mathematical flavor, we will still make explicit the
details of many important machine learning algorithms.

Unsurprisingly, the book will be more accessible to a reader with some back-
ground in mathematics and statistics. It assumes familiarity with basic concepts in
linear algebra and matrix analysis, in multivariate calculus and in probability and
statistics. We tried to place a limit at the use of measure theoretic tools, that are
avoided up to a few exceptions, which are be localized and be accompanied with
alternative interpretations allowing for a reading at a more elementary level.

The book starts with an introductory chapter that describes notation used through-
out the book and serve at a reminder of basic concepts in calculus, linear algebra and
probability. It also introduces some measure theoretic terminology, and can be used
as a reading guide for the sections that use these tools. This chapter is followed by
two chapters offering background material on matrix analysis and optimization. The
latter chapter, which is relatively long, provides necessary references to many algo-
rithms that are used in the book, including stochastic gradient descent, proximal
methods, etc.

Chapter 4, which is also introductory, illustrates the bias-variance dilemma in


machine learning through the angle of density estimation and motivates chapter 5
in which basic concepts for statistical prediction are provided. Chapter 6 provides
an introduction to reproducing kernel theory and Hilbert space techniques that are
used in many places, before tackling, with chapters 7 to 11, the description of vari-
ous algorithms for supervised statistical learning, including linear methods, support
vector machines, decision trees, boosting, or neural networks.

Chapter 13, which presents sampling methods and an introduction to the theory
of Markov chains, starts a series of chapters on generative models, and associated
learning algorithms. Graphical models and described in chapters 14 to 16. Chap-
ter 17 introduces variational methods for models with latent variables, with applica-
tions to graphical models in chapter 18. Generative techniques using deep learning
are presented in chapter 19.

Chapters 20 to 22 focus on unsupervised learning methods, for clustering, factor


analysis and manifold learning. The final chapter of the book is theory-oriented and
discusses concentration inequalities and generalization bounds.
Chapter 1

General Notation and Background Material

1.1 Linear algebra

1.1.1 Sets and functions

If A is a set, the set of all subsets of A is denoted P (A). If A and B are two sets, the
notation BA refers to the set of all functions f : A → B. In particular, RA is the space
of real-valued functions, and forms a vector space. When A is finite, this space is
finite dimensional and can be identified with R|A| , where |A| denotes the cardinality
(number of elements) of A.

The indicator function of a subset C of A will be denoted 1C : A → {0, 1}, with


1C (x) = 1 if x ∈ C and 0 otherwise. We will sometimes write 1x∈C for 1C (x).

1.1.2 Vectors

Elements of the d-dimensional Euclidean space Rd will be denoted with letters such
as x, y, z, and their coordinates will be indexed as parenthesized exponents, so that
 (1) 
x 
x =  ... 
 
 
x(d)

(we will always identify element of Rd with column vectors). We will not distinguish
in the notation between “points” in Rd , seen as an affine space, and “vectors” in Rd ,
seen as a vector space. The vectors 0d and 1d will denote the d-dimensional vectors
with all coordinates equal to 0 and 1, respectively. The identity matrix in Rd will be
denoted IdRd . The canonical basis of Rd , provided by the columns of IdRd will be
denoted e1 , . . . , ed .

15
16 CHAPTER 1. GENERAL NOTATION AND BACKGROUND MATERIAL

The Euclidean norm of a vector x ∈ Rd is denoted |x| with

 1/2
|x| = (x(1) )2 + · · · + (x(d) )2 .

It will sometimes be denoted |x|2 , identifying it as a member of the family of ` p


norms
 1/p
|x|p = (x(1) )p + · · · + (x(d) )p (1.1)

for p ≥ 1. One can also define |x|p for 0 < p < 1, using (1.1), but in this case one
does not get a norm because the triangle inequality |x + y|p ≤ |x|p + |y|p is not true
in general. The family is interesting, however, because it approximates, in the limit
p → 0, the number of non-zero components of x, denoted |x|0 , which is a measure of
sparsity. Note that we also use the notation |A| to denote the cardinality (number of
elements) of a set A, hopefully without risk of confusion.

While we use single bars (|x|) to represent norms of finite-dimensional vectors,


we will use double bars (khk) for infinite-dimensional objects.

1.1.3 Matrices

The set of m × d real matrices with real entries is denoted Mm,d (R), or simply Mm,d
(Md,d will also be denoted Md ). The set of invertible d × d matrices will be denoted
GLd (R).

Given m column vectors x1 , . . . , xm ∈ Rd , the notation [x1 , . . . , xm ] refers to the d by


m matrix with j th column equal to xj , so that, for example, IdRd = [e1 , . . . , ed ].

(i)
Entry (i, j) in a matrix A ∈ Mm,d (R) will either be denoted A(i, j) or Aj . The rows
of A will be denoted A(1) , . . . , A(m) and the columns A1 , . . . , Am .

The operator norm of a matrix A ∈ Mm,d is defined by

|A|op = max{|Ax| : x ∈ Rd , |x| = 1}.

The space of d × d real symmetric matrices is denoted Sd , and its subsets con-
taining positive semi-definite (resp. positive definite) matrices is denoted Sd+ (resp.
Sd++ ). If m ≤ d, Om,d denotes the set of m × d matrices A such that AAT = IdRm , and
one writes Od for Od,d , the space of d-dimensional orthogonal matrices. Finally, SOd
is the subset Od containing orthogonal matrices with determinant 1, i.e., rotation
matrices.
1.2. TOPOLOGY 17

1.1.4 Multilinear maps

A k-linear map is a function a : (x1 , . . . , xk ) 7→ a(x1 , . . . , xk ) defined on (Rd )k with values


in Rq which is linear in each of its variables. The mapping is symmetric if its value
is unchanged after any permutation of the variables. If k = 2 and q = 1, one also says
that a is a bilinear form. The norm of a k-linear map is defined as
|a| = max{a(x1 , . . . , xk ) : |xj | ≤ 1, j = 1, . . . , k}
so that
k
Y
|a(x1 , . . . , xk )| ≤ |a| |xj |
j=1

for all x1 , . . . , xk ∈ Rd .

A symmetric bilinear (i.e., 2-linear) form a is called positive semidefinite if a(x, x) ≥


0 for all x ∈ Rd , and positive definite if it is positive semi-definite and a(x, x) = 0 if
and only if x = 0. Symmetric bilinear forms can always be expressed in the form
a(x, y) = xT Ay for some symmetric matrix A, and a is positive (semi-)definite if and
only A is also. Analogous statements hold for negative (semi-)definite forms and
matrices. We will use the notation A  0 (resp.  0) to indicate that A is positive
definite (resp. positive semidefinite). Note that, if a(x, y) = xT Ay for A ∈ Sd , then
|a| = |A|op .

1.2 Topology

1.2.1 Open and closed sets in Rd

The open balls in Rd will be denoted


B(x, r) = {y ∈ Rd : |y − x| < r},
with x ∈ Rd and r > 0. The closed balls are denoted B̄(x, r) and contain all y’s such
that |y − x| ≤ r. A set U ⊂ Rd is open if and only if for any x ∈ U , there exists r > 0
such that B(x, r) ⊂ U . A set Γ ⊂ Rd is closed if its complement, denoted
Γ c = {x ∈ Rd : x < Γ }
is open. The topological interior of a set A ⊂ Rd is the largest open set included in
A. It will be denoted either by Å or int(A). A point x belongs to Å if and only if
B(x, r) ⊂ A for some r > 0.

The closure of A is the smallest closed set that contains A and will be denoted
either Ā or cl(A). A point x belongs to Ā if and only if B(x, r) ∩ A , ∅ for all r > 0.
Alternatively, x belongs to Ā if and only if there exists a sequence (xk ) that converges
to x with xk ∈ A for all k.
18 CHAPTER 1. GENERAL NOTATION AND BACKGROUND MATERIAL

1.2.2 Compact sets

A compact set in Rd is a set Γ such that any sequence of points in Γ contains a subse-
quence that converges to some point in Γ . An alternate definition is that, whenever
Γ is covered by a collection of open sets, there exists a finite subcollection that still
covers Γ .

One can show that compact subsets of Rd are exactly its bounded and closed
subsets.

1.2.3 Metric spaces

A metric space is a space B equipped with a distance, i.e., a function ρ : B × B →


[0, +∞) that satisfies the following three properties.
∀x, y ∈ B : ρ(x, y) = 0 ⇔ x = y, (1.2a)
∀x, y ∈ B : ρ(x, y) = ρ(y, x), (1.2b)
∀x, y, z ∈ B : ρ(x, z) ≤ ρ(x, y) + ρ(y, z). (1.2c)
Equation (1.2c) is called the triangle inequality. The norm of the difference between
two points: ρ(x, y) = |x − y|, is a distance on Rd . The definition of open and closed
subsets in metric spaces is the same as above, with ρ(x, y) replacing |x − y|, and one
says that (xn ) converges to x if and only if ρ(xn , x) → 0.

Compact subsets are also defined in the same way, but are not necessarily char-
acterized as bounded and closed.

1.3 Calculus

1.3.1 Differentials

If x, y ∈ Rd , we will denote by [x, y] the closed segment delimited by x and y, i.e., the
set of all points (1 − t)x + ty for 0 ≤ t ≤ 1. One denotes by [x, y), (x, y] and (x, y) the
semi-open or open segments, with appropriate strict inequality for t. (Similarly to
the notation for open intervals, whether (x, y) denotes an open segment or a pair of
points will always be clear from the context.)

The derivative of a differentiable function f : t 7→ f (t) from an interval I ⊂ R to


R will be denoted by ∂f , or ∂t f if the variable t is well identified. Its value at t0 ∈ I
is denoted either as ∂f (t0 ) or ∂f |t=t0 . Higher derivatives are denoted as ∂k f , k ≥ 0,
with the usual convention ∂0 f = f . Note that notation such as f 0 , f 00 , f (3) will never
refer to derivatives.
1.3. CALCULUS 19

In the following, U is an open subset of Rd . If f is a function from U to Rm , we


let f (i) denote the i th component of f , so that
 (1) 
 f (x) 
f (x) =  ... 
 
 
f (m) (x)

for x ∈ U . If d = 1, and f is differentiable, the derivative of f at x is the column


vector of the derivatives of its components,
 (1) 
 ∂f (x) 
∂f (x) = 
 .. 
 . 

(m)
∂f (x)

For d ≥ 1 and j ∈ {1, . . . , d}, the j th partial derivative of f at x is

∂j f (x) = ∂(t 7→ f (x + tej ))|t=0 ∈ Rm ,

where e1 , . . . , ed form the canonical basis of Rd . If the notation for the variables on
which f depends is well understood from the context, we will alternatively use ∂xj f .
(For example, if f : (α, β) 7→ f (α, β), we will prefer ∂α f to ∂1 f .) The differential of f
at x is the linear mapping from Rd to Rm represented by the matrix

df (x) = [∂1 f (x), . . . , ∂d f (x)].

It is defined so that, for all h ∈ Rd

df (x)h = ∂(t 7→ f (x + th))|t=0

where the right-hand side is the directional derivative of f at x in the direction h.


Note that, if f : Rd → R (i.e., m = 1), df (x) is a row vector. If f is differentiable
on U and df (x) is continuous as a function of x, one says that f is continuously
differentiable, or C 1 .

Differentials obey the product rule and the chain rule. If f , g : U → R, then

d(f g)(x) = f (x)dg(x) + g(x)df (x).

If f : U → Rm , g : Ũ ⊂ Rk → U , then

d(f ◦ g)(x) = df (g(x))dg(x).

If d = m (so that df (x) is a square matrix), we let ∇ · f (x) = trace(df (x)), the
divergence of f .
20 CHAPTER 1. GENERAL NOTATION AND BACKGROUND MATERIAL

The terms “derivative” and “differential” are mostly interchangeable, although


one often uses “derivative” for differentiation with respect to a scalar variable.

The Euclidean gradient of a differentiable function f : U → R is ∇f (x) = df (x)T .


More generally, one defines the gradient of f with respect to a tensor field x 7→ A(x)
taking values in Sd++ , as the vector ∇A f (x) that satisfies the relation

df (x)h = ∇A f (x)T A(x)h


for all h ∈ Rd , so that
∇A f (x) = A(x)−1 df (x)T . (1.3)
In particular, the Euclidean gradient is associated with A(x) = IdRd for all x. With
some abuse of notation, we will denote ∇A f = A−1 ∇f when A is a fixed matrix,
therefore identified with the constant tensor field x 7→ A.

1.3.2 Important examples

We here compute, as an illustration and because they will be useful later, the differ-
ential of the determinant and the inversion in matrix spaces.

Recall that, if A = [a1 , . . . , ad ] ∈ Md is a d by d matrix,, with a1 , . . . , ad ∈ Rd , det(A)


is a d-linear form δ(a1 , . . . , ad ) which vanishes when two columns coincide and such
that δ(e1 , . . . , ed ) = 1. In particular δ changes signs when two of its columns are
inverted. It follows from this that

∂aij det(A) = δ(a1 , . . . , ai−1 , ej , aj+1 , . . . , ad )


= (−1)i−1 δ(ej , a1 , . . . , ai−1 , . . . , ad ) = (−1)i+j det A(ij) ,

where A(ij) is the matrix A with row i and column j removed. We therefore find that
the differential of A 7→ det(A) is the mapping
H 7→ trace(cof(A)T H) (1.4)
where cof(A) is the matrix composed of co-factors (−1)i+j det A(ij) . As a consequence,
if A is invertible, then the differential of log | det(A)| is the mapping
H 7→ trace(det(A)−1 cof(A)T H) = trace(A−1 H) (1.5)

Consider now the function I(A) = A 7→ A−1 defined on GLd (R), which is an open
subset of Md (R). Using AI(A) = IdRd and the product rule, we get
A(dI(A)H) + HI(A) = 0
or
dI(A)H = −A−1 HA−1 . (1.6)
1.3. CALCULUS 21

1.3.3 Higher order derivatives

Higher-order partial derivatives ∂ik · · · ∂i1 f : U → Rm are defined by iterating the


definition of first-order derivatives, namely
∂ik · · · ∂i1 f (x) = ∂ik (∂ik−1 · · · ∂i1 f )(x)
If all order k partial derivatives of f exist and are continuous, one says that f is
k-times continuously differentiable, or C k and, when true, the order in which the
derivatives are taken does not matter. In this case, one typically groups derivatives
with the same order using a power notation, writing, for example
∂1 ∂2 ∂1 f = ∂21 ∂2 f
for a C 3 function.

If f is C k , its k th differential at x is a symmetric k-multilinear map that can also


be iteratively defined by (for h1 , . . . , hk ∈ Rd )
d k f (x)(h1 , . . . , hk ) = d(d k−1 f (x)(h1 , . . . , hk−1 ))hk ∈ Rm .
It is related to partial derivatives through the relation:
d
X
k (i ) (i )
d f (x)(h1 , . . . , hk ) = h1 1 · · · hk k ∂ik · · · ∂i1 f (x).
i1 ,...,ik =1

When m = 1 and k = 2, one denotes by ∇2 f (x) = (∂i ∂j f (x), i, j = 1, . . . , n) the sym-


metric matrix formed by partial derivatives of order 2 of f at x. It is called the
Hessian of f at x and satisfies
hT1 ∇2 f (x)h2 = d 2 f (x)(h1 , h2 ).

The Laplacian of f is the trace of ∇2 f and denoted ∆f .

1.3.4 Taylor’s theorem

Taylor’s theorem, in its integral form, generalizes the fundamental theorem of cal-
culus to higher derivatives. It expresses the fact that, if f is C k on U and x, y ∈ U are
such that the closed segment [x, y] is included in U , then, letting h = y − x:
1 1
f (x + h) = f (x) + df (x)h + d 2 f (x)(h, h) + · · · + d k−1 f (x)(h, . . . , h)
2 (k − 1)!
Z1
1
+ (1 − t)k−1 d k f (x + th)(h, . . . , h) dt (1.7)
(k − 1)! 0
22 CHAPTER 1. GENERAL NOTATION AND BACKGROUND MATERIAL

The last term (remainder) can also be written as


R1
1 0
(1 − t)k−1 d k f (x + th)(h, . . . , h) dt
R1 .
k! (1 − t)k−1 dt
0

If f takes scalar values, then d k f (x + th)(h, . . . , h) is real and the intermediate value
theorem implies that there exists some z in [x, y] such that

1 1
f (x + h) = f (x) + df (x)h + d 2 f (x)(h, h) + · · · + d k−1 f (x)(h, . . . , h)
2 (k − 1)!
1
+ d k f (z)(h, . . . , h). (1.8)
k!

This is not true if f takes vector values. However, for any M such that |d k f (z)| ≤
M for z ∈ [x, y] (such M’s always exist because f is C k ), one has
Z 1
1 M k
(1 − t)k−1 d k f (x + th)(h, . . . , h) dt ≤ |h| .
(k − 1)! 0 k!

Equation (1.7) can be written as

1 1
f (x + h) = f (x) + df (x)h + d 2 f (x)(h, h) + · · · + d k f (x)(h, . . . , h)
2 k!
Z1
1
+ (1 − t)k−1 (d k f (x + th)(h, . . . , h) − d k f (x)(h, · · · , h)) dt . (1.9)
(k − 1)! 0

Let n o
x (r) = max |d k f (x + h) − d k f (x)| : |h| ≤ r .

Since d k f is continuous, x (r) tends to 0 when r → 0 and we have

1
|h|k
Z
(1 − t)k−1 |d k f (x + th)(h, . . . , h) − d k f (x)(h, · · · , h)| dt ≤  (|h|).
0 k x

This shows that (1.7) implies that

1 1 |h|k
f (x + h) = f (x) + df (x)h + d 2 f (x)(h, h) + · · · + d k f (x)(h, . . . , h) +  (|h|) (1.10)
2 k! k! x
1 1
= f (x) + df (x)h + d 2 f (x)(h, h) + · · · + d k f (x)(h, . . . , h) + o(|h|k ) (1.11)
2 k!
1.4. PROBABILITY THEORY 23

1.4 Probability theory

1.4.1 General assumptions and notation

When discussing probabilistic concepts, we will make the convenient assumption


that all random variables are defined on a fixed probability space (Ω, P). This means
that Ω is large enough to include enough randomness to generate all required vari-
ables (and implicitly enlarged when needed).

We assume that the reader is familiar with concepts related to discrete random
variables (which take values in a discrete or countable space) and their probability
mass function (p.m.f.) or continuous variables (with values in Rd for some d) and
their probability density functions (p.d.f.) when they exist. In particular, X : Ω →
Rd is a random variable with p.d.f. f if and only if the expectation of ϕ(X) is given
by
Z
E(ϕ(X)) = ϕ(x)f (x)dx
Rd

for all bounded and continuous functions ϕ : Rd → [0, +∞). Not all random vari-
ables of interest can be categorized as discrete or continuous with a p.d.f., however,
and the others are more conveniently handled using measure-theoretic notation as
introduced below.

With a few exceptions, we will use capital letters for random variables and small
letters for scalars and vectors that represent realizations of these variables. One of
these exceptions will be our notation for training data, defined as an independent
and identically distributed (i.i.d.) sample of a given random variable. A realization
of such a sample will always be denoted T = (x1 , . . . , xN ), which is therefore a series
of observations. We will use the notation T = (X1 , . . . , XN ) for the collection of i.i.d.
random variables that generate the training set, so that T = (X1 (ω), . . . , XN (ω)) = T (ω)
for some ω ∈ Ω. Another exception will apply to variables denoted using Greek
letters, for which we will use boldface fonts (such as α, β, . . .).

For a random variable X, the notation [X = x], or [X ∈ A] refers to subsets of Ω,


for example,
[X = x] = {ω ∈ Ω : X(ω) = x} .

1.4.2 Conditional probabilities and expectation

If X : Ω → RX and Y : Ω → RY are discrete random variables, then

P(Y = y | X = x) = P(Y = y, X = x)/ P(X = x)


24 CHAPTER 1. GENERAL NOTATION AND BACKGROUND MATERIAL

if P(X = x) > 0 and is undefined otherwise. Then, if Y is scalar- or vector-valued and


discrete, one defines the conditional expectation of Y given X, denoted E(Y | X), by
X
E(Y | X)(ω) = y P(Y = η | X = X(ω))
y∈RY

for all ω such that P(X = X(ω)) > 0. Note that E(Y | X) is a random variable, defined
over Ω. It however only depends on the values of X, in the sense that E(Y | X)(ω) =
E(Y | X)(ω0 ) if X(ω) = X(ω0 ). We will use the notation
X
E(Y | X = x) = y P(Y = y | X = x),
y∈RY

which is now a function defined on RX , satisfying E(Y | X)(ω) = E(Y | X = X(ω)).

If X and Y are scalar- or vector-valued and their joint distribution have a p.d.f.
ϕX,Y , one defines similarly the conditional p.d.f. of Y given X by
ϕX,Y (y, x
ϕY (y | X = x) = R
RY
ϕX,Y (y 0 , x)dy 0

provided that the denominator does not vanish. We will also use the notation ϕY (y |
X)(ω) = ϕY (y | X = X(ω)) for ω ∈ Ω. One then defines
Z
E(Y | X)(ω) = yϕY (y | X = X(ω))dy.
RY

In both cases considered above, it is easily checked that the conditional expecta-
tion satisfies the properties

(CE1) E(Y | X)(ω) only depends on X(ω)


(CE2) For all functions f : RX → [0, +∞) (continuous in the case of continuous ran-
dom variables), one has E(E(Y | X)f (X)) = E(Y f (X)).

The proof that our definition of E(Y | X) for discrete random variables is the only
one satisfying these properties is left to the reader. For continuous random variables,
assume that a function g : RX → RY satisfies

E(g(X)f (X)) = E(Y f (X))


R
for all continuous f ≥ 0. Then, letting ϕX (x) = R ϕX,Y (x, y)dy, which is the marginal
Y
p.d.f. of X,
Z Z Z Z !
g(x)f (x)ϕX (x)dx = yf (x)ϕX,Y (x, y)dxdy = yϕX,Y (x, y)dy dx.
RX RX ×RY RX RY
1.4. PROBABILITY THEORY 25

If we assume that ϕX,Y is continuous, then this identity being true for all f implies
that Z
g(x)ϕX (x) = yϕX,Y (x, y)dy
RY
so that g is the conditional expectation. If ϕX,Y is not continuous, then the identity
holds everywhere except on an exceptional “negligible” set (see the measure theo-
retic introduction below). Properties (CE1) and (CE2) provide the definition of the
conditional expectation for general random variables.

Taking g(x) = 1 for all x ∈ RX in (CE2) yields the well-known identity


E(E(Y | X)) = E(Y ).
Moreover, for any function g defined on RX we have
E(Y g(X) | X) = g(X)E(Y | X),
which can be checked by proving that the right-hand side satisfies conditions (i) and
(ii).

1.4.3 Measure theoretic probability

As much as possible—but not always—we will avoid relying on measure theory in


our discussions, at the expense of sometimes making not fully rigorous or incom-
plete statements (that readers familiar with this theory will easily complete). How-
ever, there will be situations in which the flexibility of the measure-theoretic formal-
ism is needed for the exposition. The following notions may help the reader navigate
through these situations (basic references in measure theory are Rudin [172], Dudley
[66], Billingsley [31]).

A measurable space is a pair (S, S ) where S is a set and S ⊂ P (S) contains S, is


stable by complementation (if A ∈ S , then Ac = S \ A ∈ S ), by countable unions and
intersections. Such an S is called a σ -algebra and elements of S form the measurable
subsets of S (relative to the σ -algebra).

A (positive) measure µ on (S, S ) in a mapping from S → [0, +∞) that associates


to A ∈ S its measure µ(A), such that the measure of a countable union of disjoint sets
is the countable sum of their measures. A function f : Ω → Rd is called measurable
if the inverse images by f of open subsets of Rd are measurable. More generally, if
(S, S ) and (S 0 , S 0 ) are measurable spaces, f : S → S 0 is measurable if f −1 (A0 ) ∈ S for
all A0 ∈ S 0 .
Remark 1.1 In these notes, measurable spaces S will always (and without mention)
be a complete metric (or “metrizable”) space with a dense countable subset (also
called a Polish space), and S the smallest σ -algebra containing all open subsets of S,
which is called the Borel σ -algebra. 
26 CHAPTER 1. GENERAL NOTATION AND BACKGROUND MATERIAL

If µ is a measure, a measurable set A is µ-negligible (or negligible for µ) if there


exists B ∈ S such that A ⊂ B and µ(B) = 0 and events are said to happen almost
everywhere if their complements are negligible. A countable union of negligible
sets is negligible, but this is not true for non countable unions, which may not even
be measurable. It is convenient, and always possible, to extend a σ -algebra so that it
contains all µ-negligible sets.

R The integral
R of a function f : S → Rd with respect to a measure µ is denoted
S
f dµ or S f (x)µ(dx). This integral is defined, using a limit argument, as a function
which is linear in f and such that, for all A ∈ S ,
Z Z
µ(dx) = 1A (x)µ(dx) = µ(A).
A S

More precisely, this uniquely defines the integral of linear combinations of indi-
cator Rfunctions with finite measures (called “simple functions”), and one then de-
fines S f dµ for f : S → [0, +∞) as the supremum of the integrals among all simple
functions that are no larger than f . After showing that the result is well defined and
linear in f , one defines the integral of f : S → R as the difference
R between those of
max(f , 0) and max(−f , 0), which is well define as soon as S |f | dµ < ∞, in which case
one says that f is µ-integrable.

The Lebesgue measure, Ld , on Rd provides an important example. For this mea-


sure, S is the σ -algebra generated
R by open subsets of Rd (the smallest one that con-
tains all open subsets), and Rd f (x)Ld (dx) extends the Riemann integral, justifying
R
the alternative notation Rd f (x)dx that we will preferably use. Another important
example, especially when S is finite or countable, is the counting measure, denoted
card, that returns the number of elements of a set, so that card(A) = |A|. If S is finite
or countable, one generally takes S = P (S) (every subset of S is measurable) and the
integral is simply the sum:
Z X
f (x) card(dx) = f (x).
S x∈F

1.4.4 Product of measures

Let µ1 is a measure on (S1 , S 1 ) and µ2 a measure on (S2 , S 2 ), in which we assume


the situation described in remark 1.1. The “tensor product” of µ1 and µ2 is denoted
µ1 ⊗ µ2 . It is a measure on S1 × S2 defined by µ1 ⊗ µ2 (A1 × A2 ) = µ1 (A1 )µ2 (A2 ) for
A1 ∈ S 1 and A2 ∈ S 2 (the σ -algebra on S1 × S2 is the smallest one that contains all
sets A1 × A2 , A1 ∈ S 1 , A2 ∈ S 2 ).
1.4. PROBABILITY THEORY 27

The integral, with respect to the product measure, of a function f : S1 × S2 → Rd


is denoted Z
f (x1 , x2 )µ1 (dx1 )µ2 (dx2 )
S1 ×S2

(rather than
Z
f (x1 , x2 )µ1 ⊗ µ2 (dx1 , dx2 )).
S1 ×S2

Fubini’s theorem justifies integration by parts: if f is integrable with respect to the


product measure, then (integrating first in the second variable)
Z
F(x1 ) = f (x1 , x2 )mu2 dx2
S1

is well defined and


Z Z
f (x1 , x2 )d(µ1 ⊗ µ2 ) = F(x1 )dµ1 .
S1 ×S2 S1

(And one has a symmetric statement by integrating first in the first variable.)

The tensor product between more that two measures is defined similarly, with
notation
n
O
µ1 ⊗ · · · ⊗ µn = µk .
k=1

1.4.5 Relative absolute continuity and densities

If µ and ν are measures on (S, S ), one says that ν is absolutely continuous with
respect to µ and write ν  µ if,

∀A ∈ S : µ(A) = 0 ⇒ ν(A) = 0. (1.12)

The Radon-Nikodym theorem states that ν  µ with ν(S) < ∞ if and only if ν has
a density with respect to µ, i.e., there exists a µ-integrable function ϕ : S → [0, +∞)
such that Z Z
f (x)ν(dx) = f (x)ϕ(x)µ(dx)
S S

for all measurable f : S → [0, +∞).


28 CHAPTER 1. GENERAL NOTATION AND BACKGROUND MATERIAL

1.4.6 Measure-theoretic probability

When using measure-theoretic probability, we will therefore assume that the pair
(Ω, P) is completed to a triple (Ω, A, P) where A is a σ -algebra and P a probability
measure, that is a (positive) measure on (Ω, A) such that P(Ω) = 1. This triple is
called a probability space. For probability spaces, measurable sets are also called
“events” and events that happen with probability one are said to happen “almost
surely.”

A random variable X must then also take values in a measurable space, say
(RX , S X ), and must be such that, for all C ∈ S , the set [X ∈ C] belongs to S X . This
justifies the computation of P(X ∈ C), which is also denoted PX (C).

A random variable X taking values in Rd has a p.d.f. if and only if PX  Ld and


the p.d.f. is the density provided by the Radon-Nikodym theorem. For a discrete
random variable (i.e., taking values in a finite or countable set), the p.m.f. of X
is also the density of PX with respect to the counting measure card (every discrete
measure is absolutely continuous with respect to card).

If X is a random variable with values in Rd , the integral of X with respect to P is


the expectation of X, denoted E(X). More generally, if (S, S , P ) is a probability space,
we will use the notation Z
EP (f ) = f (x)P (dx).
S
If P = PX for some random variable X : Ω → S, we will use EX rather than EPX .

1.4.7 Conditional expectations (general case)

We use (CE1) and (CE2) as a definition of conditional expectation in the general case.
We assume that (RX , S X ) and (RY , S Y ) are measurable spaces.
Definition 1.2 Assume that RY = Rd . Let X : Ω → RX and Y : Ω → RY be two random
variables with E(|Y |) < ∞. The conditional expectation of Y given X is a random variable
Z : Ω → RY such that:

(i) There exists a measurable function h : RX → Rd such that Z = h ◦ X almost surely.


(ii) For any measurable function g : RX → [0, +∞), one has
E(Y g ◦ X) = E(Zg ◦ X).

The variable Z is then denoted E(Y | X) and the function h in (i) is denoted E(Y | X = ·).

Importantly, random variables Z satisfying conditions (i) and (ii) always exist
and are almost surely unique, in the sense that if another function Z 0 satisfies these
1.4. PROBABILITY THEORY 29

conditions, then Z = Z 0 with probability one. One obtains an equivalent definition


if one restricts functions g in (ii) to indicators of measurable sets, yielding the con-
dition that, if A ⊂ RX is measurable,

E(Y 1X∈B ) = E(Z1X∈B ).

With this general definition, we still have

E(E(Y | X)) = E(Y )


and, for any function g defined on RX , E(Y g ◦ X | X) = (g ◦ X)E(Y | X).

Conditional expectations share many of the properties of simple expectations.


The are linear with respect to the Y variable. Moreover, if Y ≤ Y 0 , both taking scalar
values, then E(Y | X) ≤ E(Y 0 | X) almost surely. Jensen’s inequality also holds: if
γ : Rd → R is convex and γ ◦ Y is integrable, then

γ ◦ E(Y | X) ≤ E(γ ◦ Y | X).

We will discuss convex functions in chapter 3, but two important examples for this
section are γ(y) = |y| and γ(y) = |y|2 . The first one implies that |E(Y | X)| ≤ E(|Y | | X)
and, taking expectations on both sides: E(|E(Y | X|) ≤ E(|Y |), the upper bound being
finite by assumption. For the square norm, we find that, if Y is square integrable,
then so is E(Y | X) and
E(|E(Y | X)|2 ) ≤ E(|Y |2 ).

If Y is square integrable, then this inequality shows that E(Y | X) minimizes


E[|Y − Z|2 ] among all square integrable functions Z : Ω → RY that satisfy (i). In
other terms, the conditional expectation is the optimal least-square approximation
of Y by a function of X. To see this, just write

E(|Y − Z|2 | X) = E(|Y |2 | X) − 2E(Y T Z | X) + |Z|2


= E(|Y |2 | X) − 2E(Y | X)T Z + |Z|2
= E(|Y |2 | X] − |E(Y | X)|2 + |E(Y | X) − Z|2
= E(|Y − E(Y | X)|2 | X) + |E(Y | X) − Z|2
≥ E(|Y − E(Y | X)|2 | X)

and taking expectations on both sides yields the desired result.

1.4.8 Conditional probabilities (general case)

If A is a measurable subset of RY , then 1A is a random variable and its condi-


tional expectation E(1A | X) (resp. E(1A | X = x)) is denoted P(Y ∈ A | X) (resp.
30 CHAPTER 1. GENERAL NOTATION AND BACKGROUND MATERIAL

P(Y ∈ A | X = x)), or PY (A | X) (resp. PY (A | X = x)). Note that, for each A, these con-
ditional probabilities is defined up to modifications on sets of probability zero, and
it is not obvious that they can be defined for all A together (up to a modification on
a common set of probability zero), since there is generally a non-countable number
of sets A. This can be done, however, with some mild assumptions on the set RY and
its σ -algebra (always satisfied in our discussions, see remark remark 1.1), ensuring
that, for all ω ∈ Ω, A 7→ PY (A | X)(ω) is a probability distribution on RY such that,
for any measurable function h : RY → R such that h ◦ Y is integrable,
Z
E(h(Y ) | X) = h(y)PY (dy | X).
RY

Assume now that the the sets RX and RY are equipped with measures, say µX and
µY such that the joint distribution of (X, Y ) is absolutely continuous with respect to
µX ⊗ µY , so that there exists a function ϕ : RX × RY → R (the p.d.f. of (X, Y ) with
respect to µX ⊗ µY ) such that
Z
P(X ∈ A, Y ∈ B) = ϕ(x, y)µX (dx)µY (dy).
A×B

Then PY (· | X) is absolutely continuous with respect to µY , with density given by the


conditional p.d.f. of Y given X, namely,
ϕ(X(ω), y)
ϕ(y | X) : ω 7→ R = ϕ(y | X = X(ω)). (1.13)
RY
ϕ(X(ω), y 0 ) µY (dy 0 )

Note that ( Z )
0 0
P ω: ϕ(Z(ω), y ) µY (dy ) = 0 = 0
RY
so that the conditional density can be defined arbitrarily when the numerator van-
ishes1 .

We retrieve here as special cases the definition of conditional probabilities and


densities given in section 1.4.2. As an additional example, take RX = Rd , µX being
Lebesgue’s measure and assume that Y is discrete (so that µY = card), then

ϕ(X(ω), y)
ϕ(y | X = X(ω)) = P 0
.
y 0 ∈RY ϕ(X(ω), y )
R
1 Letting ϕX (x) = ϕ(x, y 0 ) µY (dy 0 ), which is the marginal p.d.f. of X with respect to µX , we have
RY
Z
P(ϕX (X) = 0) = 1ϕX (x)=0 ϕX (x)µX (dx) = 0.
RX
Chapter 2

A Few Results in Matrix Analysis

This chapter collects a few results in linear algebra that will be useful in the rest of
this book.

2.1 Notation and basic facts

We denote by Mn,d (R) the space of all n × d matrices with real coefficients1 . For a
matrix A ∈ Mn,d (R) and integer k ≤ n and l ≤ d, we let Adkle ∈ Mk,l (R) denote the
matrix A restricted to its first k rows and first l columns. The i, j entry of A will be
denoted A(i, j) or A(ij) .

We assume that the reader is familiar with elementary matrix analysis, including,
in particular the fact that symmetric matrices are diagonalizable in an orthonormal
basis, i.e., if A ∈ Md,d (R) is a symmetric matrix (whose space is denoted Sd ), there
exists an orthogonal matrix U ∈ Od (i.e., satisfying U T U = U U T = IdRd ) and a diag-
onal matrix D ∈ Md,d (R) such that
A = U DU T .
The identity AU = U D then implies that the columns of U form an orthonormal
basis of eigenvectors of A.

If A ∈ Sd+ is positive semi-definite (i.e., u T Au ≥ 0 for all u ∈ Rd ), the entries of


D in the decomposition A = U DU T are non-negative, and one can define the matrix
square root of A as S = U D 1/2 U T where D 1/2 is the diagonal matrix formed taking
the square roots of all coefficients of D. We will use the notation S = A1/2 . Note that
D 1/2 = D 1/2 if D is diagonal and positive semi-definite.

If A ∈ Sd++ is positive definite (i.e., A is positive semi-definite and u T Au = 0 im-


plies u = 0) and B is positive semi-definite, both being d ×d matrices, the generalized
1 Unless mentioned otherwise, all matrices are assumed to be real.

31
32 CHAPTER 2. A FEW RESULTS IN MATRIX ANALYSIS

eigenvalue problem associated with A and B consists in finding a diagonal matrix D


and a matrix U such that BU = AU D and U T AU = IdRd . Letting Ũ = A1/2 U , the
problem is equivalent to solving A−1/2 BA−1/2 Ũ = Ũ D with Ũ T Ũ = IdRd , i.e., finding
the eigenvalue decomposition of the symmetric positive-definite matrix A−1/2 BA−1/2 .

If A ∈ Mn,d (R), it can be decomposed as

A = U DV T

where U ∈ On (R) and V ∈ Od (R)) are orthogonal matrices and D ∈ Mn,d (R) is diago-
nal (i.e., such that D(i, j) = 0 whenever i , j) with non-negative diagonal coefficients.
These coefficients are called the singular values of A, and the procedure is called
a singular-value decomposition (SVD) of A. An equivalent formulation is that there
exist orthonormal bases u1 , . . . , un of Rn and v1 , . . . , vd of Rd (forming the columns of
U and V ) such that
Avi = λi ui
for i ≤ min(n, d), where λ1 , . . . , λmin(n,d) are the singular values. Of course, if A is
square and symmetric positive semi-definite, an eigenvalue decomposition of A is
also a singular value decomposition (and the singular values coincide with the eigen-
values). More generally, if A = U DV T , then AAT = U DD T U T and AT A = V D T DV T
are eigenvalue decompositions of AAT and AT A. Singular values are uniquely de-
fined, up to reordering. However, the matrices U and V are not unique up to column
reordering in general.

If m = min(n, d), then, forming the matrices Ũ = Udn,me (resp. Ṽ = Vdd,me ) by


removing from U (resp. V ) its last n − m (resp. d − m) columns , and D̃ = Ddm,me by
removing from D its n − m rows and d − m columns, one has

A = Ũ D̃ Ṽ T

with Ũ , D̃ and Ṽ having respectively size n×m, m×m and m×d, Ũ T Ũ = Ṽ T Ṽ = IdRm
and D̃ diagonal with non-negative coefficients. This representation provides a re-
duced SVD of A and one can create a full SVD from a reduced one by completing the
missing rows of Ũ and Ṽ to form orthogonal matrices, and by adding the required
number of zeros to D̃.

2.2 The trace inequality

We now descibe Von Neumann’s trace theorem. Its justification follows the proof
given in Mirsky [138].

Theorem 2.1 (Von Neumann) Let A, B ∈ Mn,d (R) have singular values (λ1 , . . . , λm ) and
(µ1 , . . . , µm ), respectively, where m = min(n, d). Assume that these eigenvalues are listed
2.2. THE TRACE INEQUALITY 33

in decreasing order so that λ1 ≥ · · · ≥ λm and µ1 ≥ · · · ≥ µm . Then,


m
X
trace(AT B) ≤ λi µi . (2.1)
i=1

Moreover, if trace(AT B) = m
P
i=1 λi µi , then there exist n × n and d × d orthogonal ma-
trices U and V such that U AV and U T BV are both diagonal, i.e., one can find SVDs of
T

A and B in the same bases of Rn and Rd .


Proof We can assume without loss of generality that d ≤ n because, if the result
holds for A and B, it also holds for AT and BT . Let A = U1 ΛV1T and B = U2 MV2T be
the singular values decompositions of A and B (both Λ and M are n × d matrices).
Then
trace(AT B) = trace(V1 ΛT U1T U2 MV2 ) = trace(ΛT U MV T )
with U = U1T U2 and V = V1T V2 . Let u(i, j), 1 ≤ i, j ≤ n and v(i, j), 1 ≤ i, j ≤ d be the
coefficients of the orthogonal matrices U and V . Then
d d d
X 1X 1X
trace(ΛT U MV T ) = u(i, j)v(i, j)λi µj ≤ λi µj u(i, j)2 + λi µj v(i, j)2 (2.2)
2 2
i,j=1 i,j=1 i,j=1

Let us consider the first sum in the upper-bound. Let ξd = λd (resp. ηd = µd ) and
ξi = λi − λi+1 (resp. ηi = µi − µi+1 ) for i = 1, . . . , d − 1. Since singular values are non-
increasing, we have ξi , ηi ≥ 0 and
d
X d
X
λi = ξj , µi = ηj
j=i j=i

for i = 1, . . . , d. We have
0
d
X d X
X d d
X d
X i0 X
X j
2 2
λi µj u(i, j) = ξi 0 ηj 0 u(i, j) = ξi 0 η j 0 u(i, j)2
i,j=1 i,j=1 i 0 =i j 0 =j i 0 ,j 0 =1 i=1 j=1
d
X
≤ ξi 0 ηj 0 min(i 0 , j 0 ) (2.3)
i 0 ,j 0 =1

Pj 0
where we used the fact that U is orthogonal, which implies that j=1 u(i, j)2 and
Pi 0 2
i=1 u(i, j) are both less than 1. Notice also that, when u(i, j) = δij (i.e., u(i, j) = 1 if
i = j and zero otherwise), then
0
i0 X
X j
u(i, j)2 = min(i 0 , j 0 ),
i=1 j=1
34 CHAPTER 2. A FEW RESULTS IN MATRIX ANALYSIS

so that the last inequality is an identity, and the chain of equalities leading to (2.3)
implies
X d d
X
0 0
ξi 0 ηj 0 min(i , j ) = λi µj .
i 0 ,j 0 =1 i=1

We therefore obtain (for any U ), the fact that


d
X d
X
2
λi µj u(i, j) ≤ λi µj .
i,j=1 i=1

The same identity obviously holds with v in place of u, and combining the two yields
(2.1).

We now consider conditions for equality. Clearly, if one can find SVD decompo-
sitions of A and B with U1 = U2 and V1 = V2 , then U = IdRn , V = IdRd and (2.1) is an
identity. We want to prove the converse statement.

For (2.1) to be an equality, we first need (2.2) to be an identity, which requires that
u(i, j) = v(i, j) as soon as λi µj > 0. We also need an equality in (2.3), which requires
0
i0 X
X j
u(i, j)2 = min(i 0 , j 0 )
i=1 j=1

as soon as λi 0 > λi 0 +1 and µj 0 > µj 0 +1 . The same identity must be true with v(i, j)
replacing u(i, j)

In view of this, denote by i1 < · · · < ip (resp. j1 < · · · < jq ) the indexes at which
the singular values of A (resp. B) differ form their successors, with the convention
λd+1 = µd+1 = 0. Let, for k = 1, . . . , p and l = 1, . . . , q
jl
ik X
X
C(k, l) = u(i, j)2 .
i=1 j=1

Then, we must have C(k, l) = min(ik , jl ) for all k, l and u(i, j) = v(i, j) for i = 1, . . . , ip
and j = 1, . . . , jq .

If, for all i, j ≤ d, we let Udije be the matrix formed by the first i rows and j
columns of U , the condition Ckl = min(ik , jl ) requires that Udik jl e UdiTk jl e = IdRik if ik ≤ jl
and UdiTk jl e Udik jl e = IdRjl if jl ≤ ik . This shows that, if ik ≤ jl , the rows of Udik jl e form an
orthonormal family, and necessarily, all elements u(i, j) for i ≤ ik and j > jl vanish.
The symmetric situation holds if jl ≤ ik .
2.2. THE TRACE INEQUALITY 35

Let rk = ik − ik−1 and sl = jl − jl−1 (with i0 = j0 = 0). We now consider possible


changes in the SVDs of A and B. With our notation, the matrix Λ takes the form

λi1 IdRr1 0 0 ··· 0 0 ...0


 
 0 λi2 IdRr2 0 ··· 0 0 ...0

 .. .. .. .. .. 

 . . . . . 
Λ =  0 0 . . . λip IdRrp 0 0 . . . 0
 


 0 ... 0 0 . . . 0

 .. .. .. .. 
. . . . 


 
0 ... 0 0 ... 0

Let W , W̃ be n × n and d × d orthogonal matrices taking the form


   
W1 0 0 ··· 0  W1 0 0 ··· 0 
 0 W2 0 · · · 0   0 W2 0 · · · 0 
   
 .
W =  .. ... ..  , W̃ =  ..
 
... .. 

. 
  . . 

 0 0 . . . W 0  0 0 . . . W 0
 p 

 
 p 

0 ... Wp+1 0 ... W̃p+1

where W1 , . . . , Wp are orthogonal with respective sizes r1 , . . . , rp , Wp+1 is orthogonal


with size n − ip and W̃p+1 is orthogonal with size d − ip . Then we have

W D W̃ = D

proving that U1 can be replaced by U1 W provided that V1 is replaced by V1 W̃ . Sim-


ilar transformations can be made on U1 and V2 , with U2 replaced by U2 Z and V2 by
V2 Z̃ with
   
Z1 0 0 · · · 0  Z1 0 0 · · · 0 
 0 Z2 0 · · · 0   0 Z2 0 · · · 0 
   
 . ..  ,  . .. 
Z =  .. .. Z̃ =  .. ..
. .  . . 
   
 0 0 . . . Zq 0   0 0 . . . Zq 0 
   
0 ... Zq+1 0 ... Z̃q+1

with a structure similar to W and W̃ , replacing r1 , . . . , rp by s1 , . . . , sq . As a conse-


quence, U = U1T U2 can be replaced by W T U Z and V by W̃ T V Z̃. To complete the
proof, we need to show that, when (2.1) is an equality, these matrices can be chosen
so that W T U Z = IdRn and W̃ T V Z̃ = IdRd .

Let us consider a first step in this direction, assuming that i1 ≤ j1 so that

U[i1 j1 ] UdiT1 j1 e = IdRi1 .


36 CHAPTER 2. A FEW RESULTS IN MATRIX ANALYSIS

Complete UdiT1 j1 e into a orthogonal matrix Z1 = [UdiT1 j1 e , Ũ ]. Build a matrix Z as above


by taking Z2 , . . . , Zq+1 equal to the identity. Then U Z has a first i1 × i1 block equal to
IdRi1 , which implies that all coefficients on the right and below this block are zeros.
If j1 ≤ i1 , a similar construction can be made on the other side, letting W1 = [Udi1 j1 e Ũ ]
with the first j1 × j1 block of the new matrix U equal to the identity. Note that, since
Vdip jq e = Udip jq e , the same result is obtained on V at the same time.

Pursuing this way (and skipping the formal induction argument, which is a bit
tedious), we can progressively introduce identity blocks into U and V and transform
them into new matrices (that we still denote by U and V ) taking the form (letting
k = min(ip , jq ))
! !
IdRk 0 IdRk 0
U= and V =
0 Ū 0 V̄

If k = ip (resp. k = jq ), the final reduction can be obtained by choosing Wp+1 = Ū


and W̃p+1 = V̄ (resp. Zp+1 = Ū T and Z̃p+1 = V̄ T ), leading to SVDs for A and B with
identical matrices U1 = U2 and V1 = V2 . 

Remark 2.2 Note that, since the singular values of −A and of A coincide, theorem 2.1
implies
m
X
T
trace(A B) ≤ λi µi . (2.4)
i=1

for all matrices A and B, with equality if either A and B or −A and B have an SVD
using the same bases. 

2.3 Applications

Let p and d be integers with p ≤ d. Let A ∈ Sd (R), B ∈ Sp (R) be symmetric ma-


trices. We consider the following optimization problem: maximize, over matrices
U ∈ Md,p (R) such that U T U = IdRp , the function

F(U ) = trace(U T AU B) = trace(AU BU T ) .

We first note that the singular values of U BU T , which is d × d, are the same as the
eigenvalues of B completed with zeros. Letting λ1 ≥ · · · ≥ λd be the eigenvalues of A
and µ1 ≥ · · · ≥ µp those of B, we therefore have, from theorem 2.1,

p
X
F(U ) ≤ λi µi .
i=1
2.3. APPLICATIONS 37

Introduce the eigenvalue decompositions of A and B in the form A = V ΛV T and


B = W MW T . For F(U ) to be equal to its upper-bound, we know that we must arrange
U BU T to take the form !
M 0
U BU T = V VT.
0 0
Use, as before, the notation Vddpe to denote the matrix formed with the p first columns
of V . Take U = Vddpe W T , which satisfies U T U = IdRp . We then have
!
T T T M 0 T
Vddpe W BW Vddpe = Vddpe MVddpe = V V ,
0 0.

which shows that U is optimal. We summarize this discussion in the next theorem.
Theorem 2.3 Let A ∈ Sd (R) and B ∈ Sp (R) be symmetric matrices, with p ≤ d. Let
eigenvalue decompositions of A and B be given by A = V ΛV T and B = W MW T , where
the diagonal elements of Λ (resp. M) are λ1 ≥ · · · ≥ λd (resp. µ1 ≥ · · · ≥ µp ).

Define F(U ) = trace(AU BU T ), for U ∈ Md,p (R). Then,


p
n o X
T
max F(U ) : U U = IdRp = λi µi .
i=1

This maximum is attained at


U = Vdd,pe W T .

The following corollary applies theorem 2.3 with B = diag(µ1 , . . . , µp ).


Corollary 2.4 Let A ∈ Sd (R) be a symmetric matrix with eigenvalues λ1 ≥ · · · ≥ λd . For
p ≤ d, let µ1 ≥ · · · ≥ µp > 0 and define
p
X
F(e1 , . . . , ep ) = µi eiT Aei .
i=1
Pp
Then, the maximum of F over all orthonormal families e1 , . . . , ep in Rd is i=1 λi µi and is
attained when e1 , . . . , ep are eigenvectors of A with eigenvalues λ1 , . . . , λp .
Pp
The minimum of F over all orthonormal families e1 , . . . , ep in Rd is i=1 λd−i+1 µi and
is attained when e1 , . . . , ep are eigenvectors of A with eigenvalues λd , . . . , λd−p+1 .
Proof The statement about the maximum is just a special case of theorem 2.3, with
B = diag(µ1 , . . . , µp ), noting that the ith diagonal element of U T AU is eiT Aei where ei
is the ith column of U .

The statement about the minimum is deduced by replacing A by −A. 


38 CHAPTER 2. A FEW RESULTS IN MATRIX ANALYSIS

Applying this corollary with p = 1, we retrieve the elementary result that λ1 =


max{u T Au : |u| = 1} and λd = min{u T Au : |u| = 1}.

To complete this chapter, we quickly state and prove Rayleigh’s theorem.

Theorem 2.5 Let A ∈ Md,d (R) be a symmetric matrix with eigenvalues λ1 ≥ · · · ≥ λd .


Then

λk = max min{u T Au, u ∈ V , |u| = 1} = min max{u T Au, u ∈ V , |u| = 1}


V :dim(V )=k V :dim(V )=d−k+1

where the min and max are taken over linear subspaces of Rd .

Proof Let e1 , . . . , ed be an orthonormal basis of eigenvectors of A associated with


λ1 , . . . , λd . Let, for k ≤ l, Wk,l = span(ek , . . . , el ). Let V be a subspace of dimension k.
Then V ∩ Wk,d , ∅ (because the sum of the dimensions of these two spaces is d + 1).
Taking u0 with norm 1 in this intersection, we have

min{u T Au, u ∈ V , |u| = 1} ≤ u0T Au0 ≤ max{u T Au, u ∈ Wk,d , |u| = 1} = λk ,

where the last identity follows by considering the eigenvalues of A restricted to Wk,d .
So, the maximum of the right-hand side is indeed less than λk , and it is attained for
V = W1,k . This proves the first identity, and the second one can be obtained by
applying the first one to −A. 

2.4 Some matrix norms

The operator norm of a matrix A ∈ Mn,d (R), is defined as

|A|op = max{|Ax| : x ∈ Rd , |x| = 1}.

It is equal to the square root of the largest eigenvalue of AT A, i.e., to the largest
singular value of A.

The Frobenius norm of A is


v
u
d
q u
tX
|A|F = trace(AT A) = A(i, j)2 ,
i,j=1

so that  m 1/2
X 
|A|F =  σk2 
k=1
2.4. SOME MATRIX NORMS 39

where σ1 , . . . , σm are the singular values of A (and m = min(n, d)).

The nuclear norm of A is defined by

d
X
|A|∗ = σk .
k=1

One can prove that this is a norm using an equivalent definition, provided by the
following proposition.

Proposition 2.6 Let A be an n by d matrix. Then


 
|A|∗ = max trace(U AV T ) : U ∈ Mn,n and U T U = Id, V ∈ Md,d and V T V = Id .

Proof The fact that trace(U AV T ) ≤ |A|∗ for any U and V is a consequence of the
trace inequality applied with B = [Id, 0] or its transpose depending on whether n ≤ d
or not. The upper-bound being attained when U and V are the matrices forming the
singular value decomposition of A, the proof is complete. 

The fact that |A|∗ is a norm, for which the only non-trivial fact was the triangular
inequality, now is an easy consequence of this proposition, because the maximum
of the sum of two functions is always less than the sum of their maximums. More
precisely, we have

|A + B|∗ = max{trace(U AV T ) + trace(U BV T ) :


U T U = Id, V T V = Id}
≤ max{trace(U AV T ) : U T U = Id, V T V = Id}
+ max{trace(U BV T ) : U T U = Id, V T V = Id}
= |A|∗ + |B|∗

The nuclear norms is also called the Ky Fan norm of order d. Ky Fan norms of
order k (for 1 ≤ k ≤ d) associate to a matrix A the quantity

|A|(k) = λ1 + · · · + λk ,

i.e., the sum of its k largest singular values. One has the following proposition.

Proposition 2.7 The Ky Fan norms satisfy the triangular inequality.

Proof We prove this following the argument suggested in Bhatia [28]. For A ∈ Md,d ,
and k = 1, . . . , d, let trace(k) (A) be the sum of the k largest diagonal elements of A. Let,
40 CHAPTER 2. A FEW RESULTS IN MATRIX ANALYSIS

for a symmetric matrix A, |A|0(k) denote the sum of the k largest eigenvalues of A (it
is equal to |A|(k) if A is positive definite, but can also include negative values).

Then, for any symmetric matrix A ∈ Sd ,


n o
|A|0(k) = max trace(k) (U AU T ) : U ∈ COd . (2.5)

To show this, assume that V in Od diagonalizes A, so that D = V AV T is a diagonal


matrix. Assume, without loss of generality, that the coefficients λj = D(j, j) are non-
increasing. Fix U ∈ Od , let B = U AU T and W = V U T so that D = W BW T , or B =
W T DW . Then, for any j ≤ d,

d
X
B(j, j) = W (i, j)2 D(i, i).
i=1

Then, for any 1 ≤ j1 < · · · < jk ≤ d

k
X d
X k
X
B(jl , jl ) = D(i, i) W (i, jl )2
l=1 i=1 l=1
Xk k
X k
X  Xd k
X
2
= D(i, i) + D(i, i) W (i, jl ) − 1 + D(i, i) W (i, jl )2
i=1 i=1 l=1 i=k+1 l=1
k
X k
X k
X 
= D(i, i) + (D(i, i) − D(k, k)) W (i, jl )2 − 1
i=1 i=1 l=1
 
X d k
X Xn X
k 
+ (D(i, i) − D(k, k)) W (i, jl )2 + D(k, k)  W (i, jl )2 − k  .
 
 
i=k+1 l=1 i=1 j=1

Pk 2
Because W is orthogonal, we have l=1 W (i, jl ) ≤ 1 and

n X
X k
W (i, jl )2 = k.
i=1 j=1

This shows that the terms after ki=1 D(i, i) in the upper bound are negative or zero,
P
so that
Xk k
X
B(jl , jl ) ≤ D(i, i).
l=1 i=1

The maximum of the left-hand side is trace(k) (B). Noting that we get an equality
when choosing U = V , the proof of (2.5) is complete.
2.4. SOME MATRIX NORMS 41

Using the same argument as that made above for the nuclear norm, one deduces
from this that
|A + B|0(k) ≤ |A|0(k) + |B|0(k)
for all A, B ∈ Sd and all k = 1, . . . , d.

Now, let A ∈ Mn,d and consider the symmetric matrix


!
0 AT
à = ∈ Sn+d .
A 0
!
u
Write a vector u ∈ Rn+d as u = 1 with u1 ∈ Rd and u2 ∈ Rn . Then u is an eigen-
u2
vector of à for an eigenvalue λ if and only if AT u2 = λu1 and Au1 = λu2 , which
implies that AT Au1 = λ2 u1 and λ2 is a singular value of A. Conversely, if µ is a
√ √
nonzero singular value of A, associated with eigenvector u1 , then 1/ µ and −1/ µ
!
u1
are eigenvalues of Ã, associated with eigenvectors √ . It follows from this
±Au1 / µ
that |A|(k) = |Ã|0(k) for k ≤ min(n, d) and therefore satisfies the triangle inequality. 

We refer to [28] for more examples of matrix norms, including, in particular those
provided by taking pth powers in Ky Fan’s norms, defining
p p
|A|(k,p) = (λ1 + · · · + λk )1/p .
42 CHAPTER 2. A FEW RESULTS IN MATRIX ANALYSIS
Chapter 3

Introduction to Optimization

This chapter summarizes some fundamental concepts in optimization that will be


used later in the book. The reader is referred to textbooks, such as Beck [22], Eiselt
et al. [68], Nocedal and Wright [147], Boyd et al. [39] and many others for proofs and
deeper results.

3.1 Basic Terminology

1. If I is a subset of R, a lower bound of I is an element u ∈ [−∞, +∞] such that u ≤ x


for all x ∈ I. Among these lower bounds, there exists a largest element, denoted
inf I ∈ [−∞, +∞], called the infimum of I (by convention, the infimum of an empty
set is +∞). Similarly, one defines the supremum of I, denoted sup I, as the smallest
upper bound of I (and the supremum of an empty set is −∞). Every set in R has an
infimum and a supremum, but these numbers do not necessarily belong to I. When
they do, they are respectively called minimal and maximal elements of I, and are
denoted min I and max I. So, the statement “u = min I” means u ∈ I and u ≤ v for all
v ∈ I.
2. If F : Ω → R is a real-valued function defined on a subset Ω ⊂ Rd , the infimum
of F over Ω is defined by
inf F = inf{F(x) : x ∈ Ω}

and its supremum is
sup F = sup{F(x) : x ∈ Ω}.

As seen above both numbers are well defined, and can take infinite values. One says
that x ∈ Ω is a (global) minimizer (resp. maximizer) of F if F(y) ≥ F(x) (resp. F(y) ≤
F(x)) for all y ∈ Ω. One also says that F reaches its minimum (resp. maximum), or is
minimized (resp. maximized) at x. Equivalently, x is a minimizer (resp. maximizer)
of F if and only if x ∈ Ω and
F(x) = min{F(y) : y ∈ Ω} (resp. max{F(y) : y ∈ Ω}).

43
44 CHAPTER 3. INTRODUCTION TO OPTIMIZATION

In such cases, one also writes F(x) = minΩ F or F(x) = maxΩ F. In particular, the
notation u = minΩ F indicates that u = infΩ F and that there exists an x in Ω such
that F(x) = u (i.e., that the infimum of F over Ω is realized at some x ∈ Ω). Note
that the infimum of a function always exists, but not necessarily its minimum. Also
note that minimizers, when they exist, are not necessarily unique. We will denote
by argminΩ F (resp. argmaxΩ F) the (possibly empty) set of minimizers (resp. maxi-
mizers) of F
3. One says that x is a local minimizer (resp. maximizer) of F on Ω if there exists an
open ball B ⊂ Rd such that x ∈ B and F(x) = minΩ∩B F (resp. F(x) = maxΩ∩B F).
4. An optimization problem consists in finding a minimizer or maximizer of an “ob-
jective function” F. Focusing from now on on minimization problems (statements
for maximization problems are symmetric), we will always implicitly assume that
a minimizer exists. The following provides some general assumptions on F and Ω
that ensure this fact.
The sublevel sets of F in Ω are denoted [F ≤ u]Ω (or simply [F ≤ u] when Ω = Rd )
for u ∈ [−∞, +∞] with
[F ≤ u]Ω = {x ∈ Ω : F(x) ≤ u} .
Note that \
argmin F = [F ≤ u]Ω .
Ω u>inf F

A typical requirement for F is that its sublevel sets are closed in Rd , which means
that, if a sequence (xn ) in Ω satisfies, for some u ∈ R, F(xn ) ≤ u for all n and converges
to a limit x, then x ∈ Ω and F(x) ≤ u. If this is true, one says that F is lower semi-
continuous, or l.s.c, on Ω. If, in addition to being closed, the sublevel sets of F are
bounded (at least for u small enough—larger than inf F), then argminΩ F is an inter-
section of nested compact sets, and is therefore not empty (so that the optimization
problem has at least one solution).
5. Different assumptions on F and Ω lead to different types of minimization prob-
lems, with specific underlying theory and algorithms.
1. If F is C 1 or smoother and Ω = Rd , one speaks of an unconstrained smooth
optimization problem.
2. For constrained problems, Ω is often specified by a finite number of inequali-
ties, i.e.,
Ω = {x ∈ Rd : γi (x) ≤ 0, i = 1, . . . , q}.
If F and all functions γ1 , . . . , γq are C 1 one speaks of smooth constrained problems.
3. If Ω is a convex set (i.e., x, y ∈ Ω ⇒ [x, y] ∈ Ω, where [x, y] is the closed line
segment connecting x and y) and F is a convex function (i.e., F((1 − t)x + ty) ≤ (1 −
t)F(x) + tF(y) for all x, y ∈ Ω), one speaks of a convex optimization problem.
3.2. UNCONSTRAINED OPTIMIZATION PROBLEMS 45

4. Non-smooth problems are often considered in data science, and lead to inter-
esting algorithms and solutions.
5. When both F and γ1 , . . . , γq are affine functions, one speaks of a linear program-
ming problem (or a linear program). (An affine function is a mapping x 7→ bT x + β,
b ∈ Rd , β ∈ R.)
If F is quadratic (F(x) = 21 xT Ax − bT x), and all γi ’s are affine, one speaks of a
quadratic programming problem.
6. Finally, some machine learning problems are specified over discrete or finite
sets Ω (for example Zd , or {0, 1}d ), leading to combinatorial optimization problems.

3.2 Unconstrained Optimization Problems

3.2.1 Conditions for optimality (general case)

Consider a function F : Ω → R where Ω is an open subset of Rd . We first discuss the


unconstrained optimization problem of finding

x∗ ∈ argmin F. (3.1)

The following result summarizes (non-identical) necessary and sufficient conditions


that are applicable to such a solution.

Theorem 3.1 Necessary conditions. Assume that F is differentiable over Ω, and that
x∗ is a local minimum of F. Then ∇F(x∗ ) = 0.
If F is C 2 , then, in addition, ∇2 F(x∗ ) must be positive semidefinite.
Sufficient conditions. Assume that F ∈ C 2 (Ω). If x∗ ∈ Ω is such that ∇F(x∗ ) = 0 and
∇2 F(x∗ ) is positive definite, then x∗ is a local minimum of F.

Proof Necessary conditions: Since Ω is open, it contains an open ball centered at


x∗ , with radius 0 and therefore all segments [x∗ , x∗ + h] for all  ∈ [0, 0 ] and all unit
norm vectors h. Since x∗ is a local minimum, we can choose 0 so that F(x∗ + h) ≥
F(x∗ ) for all h with |h| = 1.

Using Taylor formula, we get (for  ∈ [0, 0 ], |h| = 1)


Z 1
∗ ∗
0 ≤ F(x + h) − f (x ) =  dF(x∗ + th)hdt .
0

If dF(x∗ )h , 0 for some h, then, for small enough , dF(x∗ + th)h cannot change sign
R1
for t ∈ [0, 1] and therefore 0 dF(x∗ + th)hdt has the same sign as dF(x∗ )(h) which
must therefore be positive. But the same argument can be made with h replaced by
46 CHAPTER 3. INTRODUCTION TO OPTIMIZATION

−h, implying that dF(x∗ )(−h) = −dF(x∗ )h is also positive, and this gives a contradic-
tion. We therefore have dF(x∗ )(h) = 0 for all h, i.e., ∇F(x∗ ) = 0.

Assume that F is C 2 . Then, making a second-order Taylor expansion, one gets


Z 1
2

0 ≤ F(x + h) − F(x ) =  ∗
(1 − t)d 2 F(x∗ + th)(h, h)dt.
0

The same argument as above shows that, if d 2 F(x∗ )(h, h) , 0, then it must be posi-
tive. This shows that d 2 F(x∗ )(h, h) ≥ 0 for all h and d 2 F(x∗ ) (or its associated matrix
∇2 F(x∗ )) is positive semidefinite.

Now, assume that F is C 2 and ∇2 F(x∗ ) positive definite. One still has
Z 1
2
∗ ∗
F(x + h) − F(x ) =  (1 − t)d 2 F(x∗ + th)(h, h)dt
0

If ∇2 F(x∗ )  0, then ∇2 F(x∗ + th)  0 for small enough , showing the the r.h.s. of
the identity is positive for h , 0, and that F(x∗ + h) > F(x∗ ). 

Because maximizing F is the same as minimizing −F, necessary (resp. sufficient)


conditions for optimality in maximization problems are immediately deduced from
the above: it suffices to replace positive semidefinite (resp. positive definite) by
negative semidefinite (resp. negative definite).

3.2.2 Convex sets and functions

Definition 3.2 One says that a set Ω ⊂ Rd is convex if and only if, for all x, y ∈ Ω, the
closed segment [x, y] also belongs to Ω.

A function F : Rd → (−∞, +∞] is convex if, for all λ ∈ [0, 1] and all x, y ∈ Rd , one has

F((1 − λ)x + λy) ≤ (1 − λ)F(x) + λF(y). (3.2)

If, whenever the lower bound is not infinite, the inequality above is strict for λ ∈ (0, 1),
one says that F is strictly convex.

Note that, with our definition, convex functions can take the value +∞ but not
−∞. In order for the upper-bound to make sense when F takes infinite values, one
makes the following convention: a + (+∞) = +∞ for any a ∈ (−∞, +∞]; λ · (+∞) = +∞
for any λ > 0; 0 · (+∞) is not defined but 0 · (+∞) + (+∞) = +∞.
3.2. UNCONSTRAINED OPTIMIZATION PROBLEMS 47

Definition 3.3 The domain of F, denoted dom(F) is the set of x ∈ Rd such that F(x) < ∞.
One says that F is proper if dom(F) , ∅.

We will only consider proper convex functions in our discussions, which will simply
be referred to as convex functions for brevity.

Proposition 3.4 If F is a convex function, then dom(F) is a convex subset of Rd . Con-


versely, if Ω is a convex set and F satisfies (3.2) for all x, y ∈ Ω (i.e., F is convex on Ω),
then the extension F̂ defined by F̂(x) = F(x) if x ∈ Ω and F̂(x) = +∞ is a convex function
defined on Rd (such that dom(F̂) = Ω).

Proof The first statement is a direct consequence of (3.2), which implies that F is
finite on [x, y] as soon as it is finite at x and at y. For the second statement, (3.2) for
F̂ is true for x, y ∈ Ω, since it is true for F, and the uper-bound is +∞ otherwise. 

This proposition shows that there was no real loss of generality in requiring convex
functions to be defined on the full space Rd . Note also that the upper bound in (3.2)
is infinite unless both x and y belong to dom(F), so that the inequality only needs to
be checked in that case.

One says that a function F is concave if and only if −F is convex. All definitions
and properties made for convex functions then easily transcribe into similar state-
ments for concave functions. We say that a function f : I → (−∞, +∞] (where I is an
interval) is non-decreasing if, for all x, y ∈ I, x < y implies f (x) ≤ f (y). We say that f
is increasing if if, for all x, y ∈ I, x < y implies f (x) < f (y) if f (x) < ∞ and f (y) = ∞
otherwise.

Inequality (3.2) has important consequences on minimization problems. For ex-


ample, it implies the following proposition.
Proposition 3.5 Let F be a convex (resp. strictly convex) function on Rd . If x ∈ dom(F)
and y ∈ Rd , the function
1
λ ∈ (0, 1] 7→ (F((1 − λ)x + λy) − F(x)) (3.3)
λ
is non-decreasing (resp. increasing).

Conversely, let Ω ⊂ Rd be a convex set and F : Ω → (−∞, +∞) be a function such that
the expression in (3.3) is non-decreasing (resp. increasing) for all x ∈ dom(F) and y ∈ Rd .
Then, the extension F̂ of F defined in proposition 3.4 is convex (resp. strictly convex).
Proof Let f (λ) = (F((1 − λ)x + λy) − F(x))/λ. Let µ ≤ λ denote zλ = (1 − λ)x + λy,
zµ = (1 − µ)x + µy. One has zµ = (1 − ν)x + νzλ , with ν = µ/λ, so that

F(zµ ) ≤ (1 − µ/λ)F(x) + (µ/λ)F(zλ ) .


48 CHAPTER 3. INTRODUCTION TO OPTIMIZATION

Subtracting F(x) to both sides (which is allowed since F(x) < ∞) and dividing by µ
yields
f (µ) ≤ f (λ) .
If F is strictly convex, then, either F(zµ ) = ∞, in which case f (µ) = f (λ) = ∞, or

F(zµ ) < (1 − µ/λ)F(x) + (µ/λ)F(zλ ) .

as soon as 0 < µ < λ, yielding


f (µ) < f (λ) .

Now consider the converse statement. By comparing the expression in (3.3) to


that obtained with λ = 1, we find, for all x, y ∈ Ω
1
(F((1 − λ)x + λy) − F(x)) ≤ F(y) − f (x)
λ
which is (3.2). Since F̂ satisfies (3.2) in its domain, it is convex. If the function in
(3.3) is increasing, then the inequality is strict for 0 < λ < 1 as soon as the lower
bound is finite, and F is strictly convex. 

Corollary 3.6 If F is convex, any local minimum of F is a global minimum.

Proof If x is a local minimum of F, then, obviously, x ∈ dom(F), and for any y ∈ Rd


and small enough µ > 0, F(x) ≤ F((1 − µ)x + µy). Using the function in (3.3) for λ = µ
and for λ = 1, we get
1
0 ≤ (F((1 − µ)x + µy) − F(x)) ≤ F(y) − F(x)
µ
so that x is a global minimum. 

3.2.3 Relative interior

If Ω is convex, then Ω̊ and Ω̄ (its topological interior and closure) are convex too (the
easy proof is left to the reader). However, topological interiors of interesting convex
sets are often empty, and a more adapted notion of relative interior is preferable.

Define the affine hull of a set Ω, denoted aff(Ω), as the smallest affine subset of
Rd that contains Ω. The vector space parallel to aff(Ω) (generated by all differences
−−→
x − y, x, y ∈ Ω) will be denoted aff (Ω). Their dimension k, is the largest integer such
that there exist x0 , x1 , . . . , xk ∈ Ω such that x1 − x0 , . . . , xk − x0 are linearly indepen-
dent. Moreover, given these points, elements of the affine hull are defined through
barycentric coordinates, yielding

aff(Ω) = {x = λ(0) x0 + · · · + λ(k) xk :, λ(0) + · · · + λ(k) = 1} .


3.2. UNCONSTRAINED OPTIMIZATION PROBLEMS 49

The coordinates (λ(0) , . . . , λ(k) ) are uniquely associated to x ∈ aff(Ω) and depend con-
tinuously on x. They are indeed obtained by solving the linear system

x − x0 = λ(1) (x1 − x0 ) + · · · + λ(k) (xk − x0 )

which has a unique solution for x ∈ aff(Ω) by linear independence. To see continuity,
one can introduce the k × k matrix G with entries G(ij) given by the inner products
(xi − x0 )T (xj − x0 ) and the vector h(x) ∈ Rk with entries h(j) (x) = (x − x0 )T (xj − x0 ).
Continuity is then clear since λ = G−1 h(x).

Definition 3.7 If Ω is a convex set, then its relative interior, denoted relint(Ω), is the
set of all x ∈ Ω such that there exists  > 0 such that aff(Ω) ∩ B(x, ) ⊂ Ω.

We have the following important property.

Proposition 3.8 Let Ω be a nonempty convex set. If x ∈ relint(Ω) and y ∈ Ω, then


xλ = (1 − λ)x + λy ∈ relint(Ω) for all λ ∈ [0, 1).

Moreover relint(Ω) is a nonempty convex set.

Proof Take  such that B(x, ) ∩ aff(Ω) ⊂ Ω. Take any z ∈ B(xλ , (1 − λ)) ∩ aff(Ω).
Define z̃ such that z = (1 − λ)z̃ + λy, i.e.

z − λy
z̃ = .
1−λ
Then z̃ ∈ aff(Ω) and
|z − xλ |
|z̃ − x| =
<
1−λ
so that z̃, and therefore z belongs to Ω. This proves that B(xλ , (1 − λ)) ∩ aff(Ω) ⊂ Ω
so that xλ ∈ relint(Ω).

If both x and y belong to relint(Ω), then xλ ∈ relint(Ω) for λ ∈ [0, 1], showing that
this set is convex.

We now show that relint(Ω) , ∅. Let k be the dimension of aff(Ω), so that there
exist x0 , x1 , . . . , xk ∈ Ω such that x1 − x0 , . . . , xk − x0 are linearly independent. Consider
the “simplex”

S = {λ(0) x0 + · · · + λ(k) xk :, λ(0) + · · · + λ(k) = 1, λ(j) ≥ 0, j = 0, . . . , k},

which is included in Ω. Then the average x = (x0 + · · · + xk )/(k + 1) is such that


B(x, ) ∩ aff(Ω) ⊂ S for small enough . Otherwise, there would exist a sequence
y(n) = λ(0) (n)x0 + · · · + λ(k) (n)xk such that λ(0) (n) + · · · + λ(k) (n) = 1 and at least one
λ(j) (n) < 0 that converges to x. Let y j be the set of elements in this sequence such
50 CHAPTER 3. INTRODUCTION TO OPTIMIZATION

that λ(j) (n) < 0. This set is infinite for at least one j and provides a subsequence of
y that also converges to x. But this would imply that the j th barycentric coordinate,
which depends continuously on x, is non-positive, which is a contradiction.

We therefore have x ∈ relint(Ω), which completes the proof. 

The following proposition provides an equivalent definition of the relative inte-


rior.
Proposition 3.9 If Ω is a convex set, then
relint(Ω) = {x ∈ Ω : ∀y ∈ Ω, ∃ > 0 such that x − (y − x) ∈ Ω} . (3.4)

So x belongs in the relative interior of Ω if, for all y ∈ Ω, the segment [x, y] can be
extended on the x side and still remain included in Ω.
Proof Let A be the set in the r.h.s. of (3.4). The proof that relint(Ω) ⊂ A is straight-
forward and left to the reader. We consider the reverse inclusion.

Let x ∈ A, and let y ∈ relint(Ω), which is not empty. Then, for some  > 0, we have
z = x − (y − x) ∈ Ω.
Since
1
x= (y + z),
1+
proposition 3.8 implies that x ∈ relint(Ω). 

Convex functions have important regularity properties in the relative interior of


their domain, that we will denote ridom(F). Importantly:
ridom(F) = relint(dom(F)) , int(dom(F)).
A first such property is provided by the next proposition.
Proposition 3.10 Let F be a convex function. Then F is locally Lipschitz continuous on
ridom(F), i.e., for every compact subset C ⊂ ridom(F), there exists a constant L > 0 such
that |F(x) − F(y)| ≤ L|x − y| for all x, y ∈ C.

This implies, in particular, that F is continuous on ridom(F).


 −−→ 
Proof Take x ∈ ridom(F). Let K = h ∈ aff (dom(F)), |h| = 1 . Then, the segment
[x − ah, x + ah] is included in ridom(F) for small enough a and all h ∈ K. Since F is
convex, we have, for t ≤ a,
t
F(x + th) − F(x) ≤ (F(x + ah) − F(x))
a
3.2. UNCONSTRAINED OPTIMIZATION PROBLEMS 51

Writing x = λ(x − ah) + (1 − λ)(x + th) with λ = t/(t + a), we also have
t a
F(x) ≤ (F(x − ah) + F(x + th))
t+a t+a
which can be rewritten as
t
F(x) − F(x + th) ≤ (F(x − ah) − F(x)).
a
−−→
These two inequalities show that F is continuous at x along any direction in aff (dom(F)),
which implies that F is continuous at x. Given this, the differences F(x+ah)−F(x) are
bounded over the compact set C, by some constant M and, the previous inequalities
show that
M
|F(y) − F(x)| ≤ |x − y|
a
if y ∈ ridom(F), |y − x| ≤ a. 

3.2.4 Derivatives of convex functions and optimality conditions

The following theorem provides a stronger version of optimality conditions for the
minimization of differentiable convex functions. Note that we have only defined
differentiability of functions defined over open sets.
Theorem 3.11 Let F be a convex function, with int(dom(F)) , ∅. Assume that x ∈
int(dom(F)) and that F is differentiable at x. Then, for all y ∈ Rd :
∇F(x)T (y − x) ≤ F(y) − F(x) . (3.5)
If F is strictly convex, the inequality is strict for y , x. In particular, ∇F(x) = 0 implies
that x is a global minimizer of F. It is the unique minimizer if F is strictly convex.

Conversely, if F is C 1 on an open convex set Ω and satisfies (3.5) for all x, y ∈ Ω, then
F is convex.
Proof Equation (3.3) implies
1
(F((1 − λ)x + λy) − F(x)) ≤ F(y) − F(x), 0 < λ ≤ 1.
λ
Taking the limit of the lower bound for λ → 0, λ > 0 yields (3.5). If F is strictly
convex, the inequality is strict for λ < 1 and, since the l.h.s. is increasing in λ, it
remains strict when λ ↓ 0.

Conversely, assuming (3.5) for all x, y ∈ Ω, the derivative of λ 7→ λ1 (F((1 − λ)x +


λy) − F(x)) is
1
(λ∇F(x + λh)T h − F(x + λh) + F(x))
λ2
with h = y − x, which is non-negative by (3.5). This proves that F is convex. If (3.5)
holds with a strict inequality, then the derivative is positive and λ1 (F((1 − λ)x + λy) −
F(x)) is increasing. 
52 CHAPTER 3. INTRODUCTION TO OPTIMIZATION

The next proposition describes C 2 convex functions in terms of their second


derivatives.
Proposition 3.12 Let F be convex and twice differentiable at x ∈ int(dom(F)). Then
∇2 F(x) is positive semi-definite.

Conversely, assume that Ω = dom(F) is an open set and that F is C 2 on Ω with a


positive semi-definite second derivative. Then F (or, rather, its extension F̂) is convex. If
the second derivative is everywhere positive definite, then F is strictly convex.
Proof Using Taylor formula (1.10) at order 2, we get, for any h ∈ Rd with |h| = 1,
1 2 1 1
d F(x)(h, h) = 2 d 2 F(x)(th, th) = 2 (F(x + th) − F(x) − t∇F(x)T h) + (t) ≥ (t)
2 2t t
with (t) → 0 when t → 0, the last inequality deriving from (3.5). This shows that
d 2 F(x)(h, h) ≥ 0.

To prove the second statement, assume that F is C 2 and ∇2 F is positive semi-


definite everywhere. Then (1.8) implies
1
F(y) − F(x) − ∇F(x)T (y − x) = (y − x)T ∇2 F(z)(y − x)
2
for some z ∈ [x, y]. Since the r.h.s. is non-negative, (3.5) holds. If ∇2 F is positive
definite everywhere, then the r.h.s. is positive if y , x and (3.5) holds with a strict
inequality. 

If F is C 2 and ∇2 F is positive definite and strictly convex, then (1.8) implies that,
for some z ∈ [x, y],

1 T 2 ρmin (∇2 F(z))


T
F(y) − F(x) − ∇F(x) (y − x) = (y − x) ∇ F(z)(y − x) ≥ |y − x|2
2 2
where ρmin (A) denotes the smallest eigenvalue of A. If this smallest eigenvalue is
bounded from below away from zero, there exists a constant m > 0 such that
m
F(y) − F(x) − ∇F(x)T (y − x) − |y − x|2 ≥ 0. (3.6)
2
This property is captured by the following definition, which does not require F to be
C 2.
Definition 3.13 A C 1 function F is strongly convex if

1. int(dom(F)) , ∅
2. There exists m > 0 such that (3.6) holds for all x ∈ int(dom(F)) and y ∈ Rd .
3.2. UNCONSTRAINED OPTIMIZATION PROBLEMS 53

We have the following proposition.


Proposition 3.14 If F is strongly convex, then it is strictly convex, so that, in particular
argmin F has at most one element.

If dom(F) = Rd , then argmin F is not empty.


Proof The first part is a direct consequence of (3.6) and theorem 3.11.

For the second part, (3.6) implies that


m 2 m 2
 
T
F(x) − F(0) ≥ ∇F(0) x + |x| ≥ |x| |x| − |∇F(0)|
2 2
This shows that F(x) > F(0) if |x| > 2|∇F(0)|/m := r so that

argmin F = argmin F.
B̄(0,r)

The set in the r.h.s. involves the minimization of a continuous function on a compact
set, and is therefore not empty. 

We will use the following definition.


Definition 3.15 A function F : Ω → Rm is L-C k , L being a positive number, if it is C k
and
|d k F(x) − d k F(y)| ≤ L|x − y|.

If F is L-C k , then Taylor formula ((1.9)) implies

1 1 L|h|k+1
f (x + h) − f (x) − df (x)h − d 2 f (x)(h, h) − · · · − d k f (x)(h, . . . , h) ≤ (3.7)
2 k! (k + 1)!
for which we used the fact that
Z1 Z1 Z1
k−1 k−1 1 1 1
t(1 − t) dt = (1 − t) dt − (1 − t)k dt = − = .
0 0 0 k k + 1 k(k + 1)

If F is strongly convex and is, in addition, L-C 1 for some L, then using (3.7), one
gets the double inequality, for all x, y ∈ int(dom(F)):
m L
|y − x|2 ≤ F(y) − F(x) − ∇F(x)T (y − x) ≤ |y − x|2 . (3.8)
2 2

The following proposition will be used later.


54 CHAPTER 3. INTRODUCTION TO OPTIMIZATION

Proposition 3.16 Assume that F is strongly convex, satisfying (3.6), and that argmin F =
{x∗ } with x∗ ∈ int(dom(F)). Then, for all x ∈ int(dom(F)):

m 1
|x − x∗ |2 ≤ F(x) − F(x∗ ) ≤ |∇F(x)|2 (3.9)
2 2m
Proof Since ∇F(x∗ ) = 0, the first inequality is a consequence of (3.6) applied to x =
x∗ . Switching the role of x and x∗ , we have
m
F(x∗ ) − F(x) − ∇F(x)T (x∗ − x) ≥ |x − x∗ |2
2
so that
m m
0 ≤ F(x) − F(x∗ ) ≤ −∇F(x)T (x∗ − x) − |x − x∗ |2 ≤ |∇F(x)| |x − x∗ | − |x − x∗ |2 (3.10)
2 2
The maximum of the r.h.s. with respect to |x − x∗ | is attained at |∇F(x)|/m, showing
that
1
F(x) − F(x∗ ) ≤ |∇F(x)|2 ,
2m
which is the second inequality. 

3.2.5 Direction of descent and steepest descent

Gradient-based algorithms for optimization iteratively update the variable x, creat-


ing a sequence governed by an equation taking the form xt+1 = xt + αt ht with αt > 0
and ht ∈ Rd . To ensure that the objective function F decreases at each step, ht is cho-
sen to be a direction of descent for F at xt , a notion which, as seen below, is closely
connected with the direction of ∇F(xt ).

Definition 3.17 Let Ω be open in Rd and F : Ω → R be a C 1 function. A direction


of descent for F at x ∈ Ω is a vector h , 0 ∈ Rd such that there exists 0 > 0 such that
F(x + h) < F(x) for all  ∈ (0, 0 ].

Proposition 3.18 Assume that F : Ω → R is C 1 and take x ∈ Ω. Then any direction h


such that hT ∇F(x) < 0 is a direction of descent for F at x. Conversely, if h is a direction of
descent, then hT ∇F(x) ≤ 0.

Proof We have the first-order expansion F(x+h)−F(x) = hT ∇F(x)+o(). If hT ∇F(x) <
0, the r.h.s. is negative for small enough  and h is a direction of descent. Similarly,
if hT ∇F(x) > 0, the r.h.s. is positive for small enough  and h cannot be a direction of
descent. 
3.2. UNCONSTRAINED OPTIMIZATION PROBLEMS 55

In particular, h = −∇F(x) is always a direction of descent. It is called the steep-


est descent direction because it minimizes h 7→ ∂α F(x + αh)|α=0 over all h such that
|h|2 = 1. However, this designation has a character of optimality that may be mis-
leading, because using the Euclidean norm for the condition |h|2 = 1 is not neces-
sarily adapted to the optimization problem at hand. In the absence of additional
information on the problem, it does have a canonical nature, as it is (up to rescaling)
the only norm invariant to rotations (including permutations) of the coordinates.
Such invariance is not necessarily desirable when the variable x has a known struc-
ture (e.g., it is organized on a graph) which would be broken by permutation. Also,
steepest refers to a local “greedy” evaluation, but may not be optimal from a global
perspective. A simple example to illustrate this is the case of a quadratic function

1
F(x) = xT Ax − bT x
2
where A ∈ Sn++ is a positive definite symmetric matrix. Then ∇F(x) = Ax − b, but one
may argue that ∇A F(x) = A−1 ∇F(x) (defined in (1.3)) is a better choice, because it
allows the algorithm to reach the minimizer of F in one step, since x − ∇A F(x) = A−1 b
(this statement disregards the cost associated in solving the system Ax = b, which
can be an important factor in large dimension). Importantly, if F is any C 1 function,
and A ∈ Sn++ , the minimizer of h 7→ ∂α F(x + αh)|α=0 over all h such that hT Ah = 1 is
given by −∇A F(x), i.e., −∇A F(x) is the steepest descent for the norm associated with
A. This yields a general version of steepest descent methods, iterating

xt+1 = xt − αt ∇At F(xt )

with αt > 0 and At ∈ Sn++ .

One can also notice that ∇A F(x) is also a minimizer of

1
F(x) + ∇F(x)T h + hT Ah.
2

When ∇2 F(x) is positive definite, it is then natural to choose it as the matrix A, there-
fore taking h = −∇2 F(x)−1 ∇F(x). This provides Newton’s method for optimization.
However, Newton method requires computing second derivatives of F, which can be
computationally costly. It is, moreover, not a gradient-based method, which is the
focus of this discussion.

3.2.6 Convergence

We now consider a descent algorithm

xt+1 = xt + αt ht (3.11)
56 CHAPTER 3. INTRODUCTION TO OPTIMIZATION

where ht is a direction of descent at xt for the objective function F. To ensure con-


vergence, suitable choices for the direction of descent and the step must be made at
each iteration, and some assumptions on the objective function are needed.

Regarding the direction of descent, which must satisfy hTk ∇F(xk ) ≤ 0, we will as-
sume a uniform control away from orthogonality to the gradient, with the condition
−hTt ∇F(xt ) ≥ |ht | |∇F(xt )| (3.12a)
for some fixed  > 0. Without loss of generality (given that a multiplicative step αt
must also be chosen), we assume that ht is commensurable to the gradient, namely,
that
γ1 |∇F(xt )| ≤ |ht | ≤ γ2 |∇F(xt )| (3.12b)
for fixed 0 < γ1 ≤ γ2 . If ht = ∇At F, these assumptions are satisfied as soon as the
smallest and largest eigenvalues of At are controlled along the trajectory.

We have the following proposition.


Proposition 3.19 Assume that F is L-C 1 . Assume that xt satisfies (3.11) and that (3.12a)
and (3.12b) hold. Then, there exist constants ᾱ > 0 and C > 0 that depends on γ1 , γ2 and
, such that, for αt ≤ ᾱ, one has
F(xt+1 ) − F(xt ) ≤ −Cαt |∇F(xt )|2 . (3.13)
Proof Applying (3.7) to xt and xt+1 , we get
L
F(xt+1 ) − F(xt ) − αt ∇F(xt )T ht ≤ αt2 |ht |2
2
Using (3.11) and (3.12a), this gives
L
F(xt+1 ) − F(xt ) + αt γ1 |∇F(xt )|2 ≤ αt2 γ22 |∇F(xt )|2
2
so that  
F(xt+1 ) − F(xt ) ≤ −αt γ1 − αt γ22 L/2 |∇F(xt )|2 .
It suffices to take ᾱ = γ1 /Lγ22 and C = γ1 /2 to obtain (3.13). 

Iterating (3.13) for t = 1 to t = T − 1 yields


T
X 1
αt |∇F(xt )|2 ≤ (F(x1 ) − F(xT )).
C
t=1

If F is bounded from below, and one takes αt = ᾱ for all t, one deduces that
n o F(x1 ) − inf F
min |∇F(xt )|2 : t = 1, . . . , T ≤ .
CT ᾱ
3.2. UNCONSTRAINED OPTIMIZATION PROBLEMS 57

We can deduce from this, for example, that there exists a sequence t1 < · · · < tn < · · ·
such that ∇F(xtk ) → 0 when k → ∞. In particular, if one runs (3.11) until |∇F(xt )| is
smaller than a given tolerance level (which is standard), the procedure is guaranteed
to terminate in a finite number of steps.

Stronger results may be obtained under stronger assumptions on F and on the


algorithm. The first assumption is an inequality similar to (3.13) and requires that,
for some constant C > 0,

F(xt+1 ) − F(xt ) ≤ −C|∇F(xt )|2 . (3.14)

Such an inequality can be deduced from (3.13) under the additional assumption that
αt is bounded from below and we will discuss later line search strategies that ensure
its validity. The second assumption is that F is convex.

Theorem 3.20 Assume that F is convex and finite and that its sub-level set [F ≤ F(x0 )]
is bounded. Assume that argmin F is not empty and let x∗ be a minimizer of F. If (3.14)
is true, then
R2
F(xt ) − F(x∗ ) ≤
C(t + 1)
with R = max{|x − x∗ | : F(x) ≤ F(x0 )}.

Proof Note that the algorithm never leaves [F ≤ F(x0 )]. We have

F(xt+1 ) − F(x∗ ) ≤ F(xt ) − F(x∗ ) − C|∇F(xt )|2 .

Moreover, by convexity, F(x∗ ) − F(xt ) ≥ ∇F(xt )T (x∗ − xt ), so that

F(xt ) − F(x∗ ) ≤ ∇F(xt )T (xt − x∗ ) ≤ |∇F(xt )|R.

Combining these two inequalities, we get

C
F(xt+1 ) − F(x∗ ) ≤ F(xt ) − F(x∗ ) − (F(xt ) − F(x∗ ))2 .
R2

Introducing δt = (C/R2 )(F(xt ) − F(x∗ )), this inequality implies

δt+1 ≤ δt (1 − δt ) .

Taking inverses, we get


1 1 1 1
≥ + ≥ +1
δt+1 δt 1 − δt δt
1
which implies δt ≥ t + 1 or δt ≤ 1/(t + 1), which in turn implies the statement of the
theorem. 
58 CHAPTER 3. INTRODUCTION TO OPTIMIZATION

A faster convergence rate can be obtained if F is assumed to be strongly convex.


Indeed, if (3.6) and (3.14) are satisfied, then (using proposition 3.16),

F(xt+1 ) − F(x∗ ) ≤ F(xt ) − F(x∗ ) − C|∇F(xt )|2


≤ F(xt ) − F(x∗ ) − 2Cm(F(xt ) − F(x∗ ))
= (1 − 2Cm)(F(xt ) − F(x∗ )) .

We therefore get the proposition:

Proposition 3.21 If F is finite and satisfies (3.6), and if the descent algorithm satisfies
(3.14), then
F(xt ) − F(x∗ ) ≤ (1 − 2Cm)t (F(x0 ) − F(x∗ )).

3.2.7 Line search

Proposition 3.19 states that, to ensure that (3.14) holds, it suffices to take a small
enough step parameter α. However, the values of α that are acceptable depend on
properties of the objective function that are rarely known in practice. Moreover,
even if a valid choice is determined (this can sometimes be done in practice by trial
and error), setting a fixed value of α for the whole algorithm is often too conserva-
tive, as the best α when starting the algorithm may be different from the best one
close to convergence.

For this reason, most gradient descent procedures select a parameter αt at each
step using a line search. Given a current position and direction of descent h, a line
search explores the values of F(x + αh), α ∈ (0, αmax ] in order to discover some α ∗
that satisfies some desirable properties. We will assume in the following that x and
h satisfy (3.12a) and (3.12b) for fixed , γ1 , γ2 .

One possible strategy is to define α ∗ as a minimizer of the scalar function

fh (α) = F(x + αh)

over (0, αmax ] for a given upper-bound max . This can be implemented using, e.g.,
binary or ternary search algorithms, but such algorithms would typically require a
large number of number of evaluations of the function F, and would be too costly to
be run at each iteration of a gradient descent procedure.

Based on the previous convergence study, we should be happy with a line search
procedure that ensures that (3.14) is satisfied for some fixed value of the constant C.
One such condition is the so-called Armijo rule that requires (with a fixed, typically
small, value of c1 > 0):
fh (α) ≤ fh (0) + c1 αhT ∇f (x) . (3.15)
3.2. UNCONSTRAINED OPTIMIZATION PROBLEMS 59

We know that, under the assumptions of proposition 3.19, this condition can always
be satisfied with a small enough value of α. Such a value can be determined using a
“backtracking procedure,” which, given αmax and ρ ∈ (0, 1), takes α = ρk αmax where
k is the smallest integer such that (3.15) is satisfied. This value of k is then deter-
mined iteratively, trying αmax , ραmax , ρ2 αmax , . . . until (3.15) is true (this provides
the “backtracking method”).

A stronger requirement in the line search is to ensure that ∂fh (α) is not “too
negative” since one would otherwise be able to further reduce fh by taking a larger
value of α. This leads to the weak Wolfe conditions, which combine the Armijo’s
rule in (3.15) and
∂fh (α) = hT ∇F(x + αh) ≥ c2 hT ∇F(x) (3.16a)
for some constant c2 ∈ (c1 , 1). The strong Wolfe conditions require (3.15) and

|hT ∇F(x + αh)| ≤ c2 |hT ∇F(x)|. (3.16b)

(Since h is a direction of descent, (3.16b) requires (3.16a) and the fact that hT ∇F(x +
αh) does not take too large positive values.) If F is L-C 1 , these conditions, with
(3.12a) and (3.12b), imply (3.14). Indeed, (3.16a) and the L-C 1 condition imply

−(1 − c2 )hT ∇F(x) ≤ hT (∇F(x + αh) − ∇F(x)) ≤ Lα|h|2

and (3.12a) and (3.12b) give

(1 − c2 )|∇F(x)|2 ≤ αLγ22 |∇F(x)|2

showing that α ≥ (1 − c2 )/(Lγ22 ). Moreover

F(x + αh) ≤ F(x) + c1 αhT ∇f (x) ≤ F(x) − c1 α|∇F(x)|2

so that
c1 (1 − c2 )2
F(x + αh) ≤ F(x) − 2
|∇F(x)|2 .
Lγ2
We have just proved the following proposition.

Proposition 3.22 Assume that F is L-C 1 and that (3.12a), (3.12b), (3.15) and (3.16a)
are satisfied. Then there exists C > 0, depending only of L, , γ2 , c1 and c2 such that

F(x + αh) ≤ F(x) − C|∇F(x)|2 .

The Wolfe conditions can always be satisfied by some α as soon as F is C 1 and


bounded from below, and hT ∇F(x) < 0. The next proposition shows this result for
the weak condition, while providing an algorithm finding an α that satisfies it in a
finite number of steps.
60 CHAPTER 3. INTRODUCTION TO OPTIMIZATION

Proposition 3.23 Let f : α 7→ f (α) be a C 1 function defined on [0, +∞) such that f is
bounded from below and ∂α f (0) < 0. Let 0 < c1 < c2 < 1.

Let α0,0 = α0,1 = 0 and α0 > 0. Define recursively sequences αn,0 , αn,1 and αn as
follows.

(i) If f (αn ) ≤ f (0) + c1 αn ∂α f (0) and ∂f (αn ) ≥ c2 ∂α f (0) stop the construction.
(ii) If f (αn ) > f (0) + c1 αn ∂α f (0) let αn+1 = (αn + αn,0 )/2, αn+1,1 = αn and αn+1,0 = αn,0 .
(iii) If f (αn ) ≤ f (0) + c1 αn ∂α f (0) and ∂f (αn ) < c2 ∂α f (0):
(a) If αn,1 = 0, let αn+1 = 2αn , αn+1,0 = αn and αn+1,1 = αn,1 .
(b) If αn,1 > 0, let αn+1 = (αn + αn,1 )/2, αn+1,0 = αn and αn+1,1 = αn,1 .

Then the sequences are always finite, i.e., the algorithm terminates in a finite number of
steps.

Proof Assume, to get a contradiction, that the algorithm runs indefinitely, so that
case (i) never occurs. If case (ii) never occurs, then one runs step (iii-a) indefinitely,
so that αn → ∞ with f (αn ) ≤ f (0) + c1 αn ∂α f (0), and f cannot be bounded from
below, yielding a contradiction. As soon as case (ii) occurs, we have, at every step,
αn,0 ≥ αn−1,0 , αn,1 ≤ αn−1,1 , αn ∈ [αn,0 , αn,1 ], f (αn,1 ) > f (0) + c1 αn,1 ∂α f (0), f (αn,0 ) ≤
f (0) + c1 αn,0 ∂α f (0) and ∂f (αn,0 ) < c2 ∂α f (0). This implies that

f (αn,1 ) − f (αn,0 ) > c1 (αn,1 − αn,0 )∂α f (0).

Moreover, the updates imply that (αn+1,1 − αn+1,0 ) = (αn,1 − αn,0 )/2. This requires that
the three sequences αn , αn,0 and αn,1 converge to the same limit, α. We have

f (αn,1 ) − f (αn,0 )
∂α f (α) = lim ≥ c1 ∂α f (0)
n→∞ αn,1 − αn,0

and
∂α f (α) = lim ∂α f (αn,0 ) ≤ c2 ∂α f (0)
n→∞
yielding c1 ∂α f (0) ≤ c2 ∂α f (0) which is impossible since c2 > c1 and ∂α f (0) < 0. 

The existence of α satisfying the strong Wolfe condition is a consequence of the


following proposition, which also provides an algorithm.

Proposition 3.24 Let f : α 7→ f (α) be a C 1 function defined on [0, +∞) such that f is
bounded from below and ∂α f (0) < 0. Let 0 < c1 < c2 < 1.

Let α0,0 = α0,1 = 0 and α0 > 0. Define recursively sequences αn,0 , αn,1 and αn as
follows.
3.3. STOCHASTIC GRADIENT DESCENT 61

(i) If f (αn ) ≤ f (0) + c1 αn ∂α f (0) and |∂α f (αn )| ≤ c2 |∂α f (0)| stop the construction.
(ii) If f (αn ) > f (0) + c1 αn ∂α f (0) let αn+1 = (αn + αn,0 )/2, αn+1,1 = αn and αn+1,0 = αn,0 .
(iii) If f (αn ) ≤ f (0) + c1 αn ∂α f (0) and |∂α f (αn )| > c2 |∂α f (0)|:
(a) If αn,1 = 0 and ∂α f (αn ) > −c2 ∂α f (0), let αn+1 = 2αn , αn+1,0 = αn,0 and αn+1,1 =
αn,1 .
(b) If αn,1 = 0 and ∂α f (αn ) < c2 ∂α f (0), let αn+1 = 2αn , αn+1,0 = αn and αn+1,1 =
αn,1 .
(c) If αn,1 > 0 and ∂α f (αn ) > −c2 ∂α f (0), let αn+1 = (αn + αn,0 )/2, αn+1,1 = αn and
αn+1,0 = αn,0 .
(d) If αn,1 > 0 and ∂α f (αn ) < c2 ∂α f (0), let αn+1 = (αn + αn,1 )/2, αn+1,0 = αn and
αn+1,1 = αn,1 .

Then the sequences are always finite, i.e., the algorithm terminates in a finite number of
steps.
Proof Assume that the algorithm runs indefinitely in order to get a contradiction.
If the algorithm never enters case (ii), then αn,1 = 0 for all n, αn tends to infinity and
f (αn ) ≤ f (0) + c1 αn ∂α f (0), which contradicts the fact that f is bounded from below.

As soon as the algorithm enter (ii), we have, for all subsequent iterations: αn,0 ≤
αn ≤ αn,1 , αn+1,0 ≥ αn,0 , αn+1,1 ≤ αn,1 and αn+1,1 − αn+1,0 = (αn,1 − αn,0 )/2. This implies
that both αn,0 and αn,1 converge to the same limit α.

Moreover, we have, at each step:


f (αn,1 ) > f (0) + c1 αn,1 ∂α f (0) or ∂α f (αn,1 ) > −c2 ∂α f (0)
and
f (αn,0 ) ≤ f (0) + c1 αn,0 ∂α f (0) and ∂α f (αn,0 ) ≤ c2 ∂α f (0) .

This implies that, at each step:


f (αn,1 ) − f (αn,0 )
> c1 ∂α f (0) or ∂α f (αn,1 ) > −c2 ∂α f (0)
αn,1 − αn,0
and
∂α f (αn,0 ) ≤ c2 ∂α f (0) .
There inequalities remain satisfied at the limit, and we must have
∂α f (α) > c1 ∂α f (0) or ∂α f (α) > −c2 ∂α f (0)
and
∂α f (α) ≤ c2 ∂α f (0) ,
which is a contradiction since c2 > c1 and ∂α f (0) < 0. 
62 CHAPTER 3. INTRODUCTION TO OPTIMIZATION

3.3 Stochastic gradient descent

3.3.1 Stochastic approximation methods

In some situations, the computation of ∇F can be too costly, if not intractable, to run
gradient descent updates while a low-cost stochastic approximation is available. For
example, if F is an average of a sum of many terms, the approximation may simply
be based on averaging over a randomly selected subset of the terms. This leads
to a stochastic approximation algorithm [164, 114, 25, 67] called stochastic gradient
descent (SGD).

A general stochastic approximation algorithm of the Robbins-Monro type up-


dates a parameter, denoted x ∈ Rd , using stochastic rules. One associates to each x a
probability distribution (πx ) on some set S, and, for some function H : Rd × S → Rd ,
considers the sequence of random iterations:
(
ξ t+1 ∼ πXt
(3.17)
Xt+1 = Xt + αt+1 H(Xt , ξ t+1 )

where ξ t+1 is a random variable and the notation ξ t+1 ∼ πXt should be interpreted
as the more precise statement that the conditional distribution of ξ t+1 given all past
random variables Ut = (ξ 1 , X1 , . . . , ξ t , Xt ) only depends on Xt and is given by πXt .

It is sometimes assumed in the literature that πx does not depend on x. This is


no real loss of generality because under mild assumptions, a random variable ξ fol-
lowing πx can be generated as function U (x, ξ̃) where ξ̃ follows a fixed distribution
(such as that of a family of independent uniformly distributed variables) and one
can replace H(x, ξ) by H(x, U (x, ξ̃)). On the other hand, allowing π to depend on x
brings little additional complication in the notation, and corresponds to the natural
form of many applications.

More complex situations can also be considered, in which ξ t+1 is not condition-
ally independent of the past variables given Xt . For example, the conditional distri-
bution of ξ t+1 given the past may also depends on ξ t , which allows for the combina-
tion of stochastic gradient methods with Markov chain Monte-Carlo methods. This
situation is studied, for example, in Métivier and Priouret [139], Benveniste et al.
[25], and we will discuss an example in section 18.2.2.

3.3.2 Deterministic approximation and convergence study

Introduce the function


H̄(x) = Eπx (H(x, ·))
3.3. STOCHASTIC GRADIENT DESCENT 63

and write
Xt+1 = Xt + αt+1 H̄(Xt ) + αt+1 η t+1
with η t+1 = H(Xt , ξ t+1 ) − H̄(Xt ) in order to represent the evolution of Xt in (3.17) as
a perturbation of the deterministic algorithm

x̄t+1 = x̄t + αn+1 H̄(x̄t ) (3.18)

by the “noise term” αt+1 η t+1 . In many cases, the deterministic algorithm provides
the limit behavior of the stochastic sequence, and one should ensure that this limit
is as desired. By definition, the conditional expectation of η t+1 given Ut (the past) is
zero and one says that αt+1 η t+1 is a “martingale increment.” Then,
T
X
MT = αt+1 η t+1 (3.19)
t=0

is called a “martingale.” The theory of martingales offers numerous tools for con-
trolling the size of MT and is often a key element in proving the convergence of the
method.

Many convergence results have been provided in the literature and can be found
in textbooks or lecture notes such as Benaı̈m [23], Kushner and Yin [114], Benveniste
et al. [25]. These results rely on some smoothness and growth assumptions made on
the function H, and on the dynamics of the deterministic equation (3.18). Depend-
ing on these assumptions, proofs may become quite technical. We will here restrict
to a reasonably simple context and assume that

(H1) There exists a constant C such that, for all x ∈ Rd ,

Eπx (|H(x, ·)|2 ) ≤ C(1 + |x|2 ).

(H2) There exists x∗ ∈ Rd and µ > 0 such that, for all x ∈ Rd

(x − x∗ )T H̄(x) ≤ −µ|x − x∗ |2 .

Assuming this, let At = |Xt − x∗ |2 and at = E(At ). Then, using (3.17),


2
At+1 = At + 2αt+1 (Xt − x∗ )T H(Xt , ξ t+1 ) + αt+1 |H(Xt , ξ t+1 )|2 .

Taking the conditional expectation given past variables yields


2
E(At+1 | Ut ) = At + 2αt+1 (Xt − x∗ )T H̄(Xt ) + αt+1 Eπxt (|H(Xt , ·)|2 )
2
≤ At − 2αt+1 µAt + αt+1 C(1 + |Xt |2 )
2 2
≤ (1 − 2αt+1 µ + Cαt+1 )At + αt+1 C̃
64 CHAPTER 3. INTRODUCTION TO OPTIMIZATION

with C̃ = 1 + |x∗ |2 . Taking expectations on both sides yields


2 2
at+1 ≤ (1 − 2αt+1 µ + Cαt+1 )at + αt+1 C̃. (3.20)

We state the next step in the computation as a lemma.


Lemma 3.25 Assume that the sequence at satisfies the recursive inequality
at+1 ≤ (1 − δt )at + t (3.21)
Qt
with 0 ≤ δt ≤ 1. Let vk,t = j=k+1 (1 − δj ). Then
t
X
at ≤ a0 v0,t + k vk,t . (3.22)
k=1

Proof Letting bt = at /v0,t , we get


t+1
bt+1 ≤ bt +
v0,t+1
so that
t
X k
bt ≤ b0 + ,
v0,k
k=1
and
t
X
at ≤ a0 v0,t + k vk,t .
k=1 

Using (3.20), we can apply this lemma with t = C̃αt2 and δt = 2αt µ−Cαt2 , making the
1 2µ
additional assumption that, for all t, αt < min( 2µ , C ), which ensures that 0 < δt < 1.

Starting with a simple case, assume that the steps γt are constant, equal to some
value γ (yielding also constant δ and ). Then, (3.22) gives
t
t
X 
at ≤ a0 (1 − δ) +  (1 − δ)t−k−1 ≤ a0 (1 − δ)t + . (3.23)
δ
k=1

Returning to the expression of δ and  as functions of α, this gives


α C̃
at ≤ a0 (1 − 2αµ + α 2 C)t + .
2µ − αC
This shows that limsup at = O(α).

Return to the general case in which the steps depend on t, we will use the follow-
ing simple result, that we state as a lemma for future reference.
3.3. STOCHASTIC GRADIENT DESCENT 65

Lemma 3.26 Assume that the double indexed sequence wst , s ≤ t of non-negative num-
bers is bounded and such that, for all s, limt→∞ wst = 0. Let β1 , β2 , . . . be such that

X
|βt | < ∞.
t=1

Then
t
X
lim βs wst = 0.
t→∞
s=1

Proof For any t0 , we have


t
X t0
X X
βs wst ≤ max |βs | wst + max |wst | |βs |
s s,t
s=1 s=1 s=t0 +1

so that
t
X X
lim sup βs wst ≤ max |wst | |βs |
t→∞ s,t
s=1 s=t0 +1

and since this upper bound can be made arbitrarily small, the result follows. 

Lemma 3.25 implies that


t
X
2
at ≤ a0 v0,t + C̃ αs+1 vs,t .
s=1

Assume that
P∞ P∞ 2
(H3) k=1 αk = ∞ and k=1 αk < ∞,

Then limt→∞ vst = 0 for all s and lemma 3.26 implies that at tends to zero. So, we
have just proved that, if (H1), (H2) and (H3) are true, the sequence Xt converges in
the L2 sense to x∗ . Actually, under these conditions, one can show that Xt converge
to x∗ almost surely, and we refer to Benveniste et al. [25], Chapter 5, for a proof (the
argument above for an L2 convergence follows the one given in Nemirovski et al.
[146]).

Under (H3), one can say much more on the asymptotic behavior of the algorithm
by comparing it with an ordinary differential equation. The “ODE method,” intro-
duced in Ljung [121], is indeed a fundamental tool for the analysis of stochastic
approximation algorithms. The correspondence between discrete and continuous
66 CHAPTER 3. INTRODUCTION TO OPTIMIZATION

times is provided by the sequence αt . More precisely, let τ0 = 0 and τt = τt−1 + αt ,


t ≥ 1. From (H3), τt → ∞ when t → ∞. Define the piecewise linear interpolation
x` (ρ) of the sequence xt by
ρ − τt
X ` (ρ) = Xt + (X − Xt ), ρ ∈ [τt , τt+1 ).
αt+1 t+1
Switching to continuous time allows us to interpret the average iteration x̄t+1 = x̄t +
αt+1 H̄(x̄t ) as an Euler discretization scheme for the ordinary differential equation
(ODE)
∂ρ x̄ = H̄(x̄). (3.24)

Most of the insight on long-term behavior of stochastic approximations results


from the fact that the random process x behaves asymptotically like solutions of
this ODE. One has, for example, the following result, for which we introduce some
additional notation.

Assume that (3.24) has unique solutions for given initial conditions on any fi-
nite interval, and denote by ϕ(ρ, ω) its solution at time ρ initialized with x̄(0) = ω.
Let α c (ρ) and η c (ρ) be piecewise constant interpolations of (αt ) and (η t ) defined by
α c (ρ) = αt+1 and η c (ρ) = η t+1 on the interval [τt , τt+1 ). Finally, let
Z s
∆(ρ, T ) = max η c (u)du .
s∈[ρ,ρ+T ] ρ

The following proposition (see [23]) compares the tails of the process x` (i.e., the
functions x` (ρ + s), s ≥ 0) with the solutions of the ODE over finite intervals.
Proposition 3.27 (Benaim) Assume that H̄ is Lipschitz and bounded. Then, for some
constant C(T ) that only depends on T and H̄, one has, for all ρ ≥ 0
!
` ` c
sup |X (ρ + h) − ϕ(h, X (ρ))| ≤ C(T ) ∆(ρ − 1, T + 1) + max α (s) . (3.25)
h∈[0,T ] s∈[ρ,ρ+T ]

Recall that H̄ being Lipschitz means that there exists a constant C such that

|H̄(w) − H̄(w0 )| ≤ C|w − w0 |

for all w, w0 ∈ Rp .

In the upper-bound in (3.25), the term ∆(ρ − 1, T + 1) is a random variable. It can


be related to the variations

∆0 (t, N ) = max |Mt+k − Mt |,


k=0,...,N
3.3. STOCHASTIC GRADIENT DESCENT 67

where M is defined in (3.19), because, if m(ρ) is the largest integer t such that τt ≤ ρ,
then

∆0 (m(ρ) + 1, m(ρ + T ) − m(ρ)) ≤ ∆(ρ, T ) ≤ ∆0 (m(ρ), m(ρ + T ) − m(t) + 1).

In the case we are considering, one can use martingale inequalities (called Doob’s
inequalities) to control ∆0 . One has, for example,

E(|Mt+N − Mt |2 )
 
P max |Mt+k − Mt | > λ ≤ . (3.26)
0≤k≤N λ2

Furthermore, using the fact that E(ηk+1 ηl+1 ) = 0 if k , l, one has

t+N
X
2 2
E(|Mt+N − Mt | ) = αk+1 E(|η t+1 |2 ).
k=t

If we assume (to simplify) that H is bounded and ∞ 2


P
k=1 αk < ∞ then, for some con-
stant C, we have

X
2 2
E(|Mt+N − Mt | ) ≤ C αk+1 →0
k=t

and inequality (3.26) can then be used in (3.25) to control the probability of devia-
tion of the stochastic approximation from the solution of the ODE over finite inter-
vals (a little more work is required under weaker assumptions on H, such as (H1)).

Proposition 3.27 cannot be used with T = ∞ because the constant C(T ) typically
grows exponentially with T . In order to draw conclusions on the limit of the process
W , one needs additional assumptions on the stability of the ODE. We refer to [23]
for a collection of results on the relationship between invariant sets and attractors of
the ODE and limit trajectories of the stochastic approximation. We here quote one
of these results which is especially relevant for SGD.

Proposition 3.28 Assume that H̄ = −∇E is the gradient of a function E and that ∇E only
vanishes at a finite number of points. Assume also that Xt is bounded. Then Xt converges
to a point x∗ such that ∇E(x∗ ) = 0.

Some additional conditions on H̄ can ensure that stochastic approximation tra-


jectories remain bounded. The simplest one assumes the existence of a “Lyapunov
function” that controls the ODE at infinity. The following result is a simplified ver-
sion of Theorem 17 in Benveniste et al. [25].
68 CHAPTER 3. INTRODUCTION TO OPTIMIZATION

Theorem 3.29 In addition to the hypotheses previously made, assume that there exists a
C 2 function U with bounded second derivatives and K0 > 0 such that, for allx such that
|x| ≥ K0 ,

∇U (x)T H̄(x) ≤ 0,
U (x) ≥ γ|x|2 , γ > 0.

Then, the trajectories X ` (ρ) are almost surely bounded.

Note that hypothesis (H2) above implies the theorem’s assumptions.

3.3.3 The ADAM algorithm

ADAM (for adaptive moment estimation [103]) is a popular variant of stochastic


gradient descent. When dealing with high-dimensional vectors W , using a single
“gain” parameter (γn+1 in (11.4)) is a limiting assumption since all parameters do
not need to scale the same way. This can sometimes be handled by reweighting the
components of H, i.e., using iterations

Xt+1 = Xt + αt Dt H(Xt , ξ t+1 )

where Dt is a (typically diagonal) matrix. The previous theory can be applied to


situations in which D may be random, provided it converges almost surely to a fixed
matrix.

The ADAM algorithm provides such a construction (without the theoretical guar-
antees) in which Dt is computed using past iterations of the algorithm. It requires
several parameters, namely: α: the algorithm gain, taken as constant (e.g., α =
0.001); Two parameters β1 and β2 for moment estimates (e.g. β1 = 0.9 and β2 =
0.999); A small number  (e.g.,  = 10−8 ) to avoid divisions by 0. In addition, ADAM
defines two vectors: a mean m and a second moment v, respectively initialized at
0 and 1. The ADAM iterations are given below, in which g ⊗2 denotes the vector
obtained by squaring each coefficient of a vector g.

Algorithm 3.1 (ADAM)


1. Let Xt be the current state, mt and vt the current mean and variance.
2. Generate ξ t+1 and let gt+1 = H(Xt , ξ t+1 ).
3. Update mt+1 = β1 mt + (1 − β1 )gt+1 .
2
4. Update vt+1 = β2 vt + (1 − β2 )gt+1 .
5. Let m̂t+1 = mt+1 /(1 − β1t+1 ) and v̂t+1 = vt+1 /(1 − β2t+1 )
3.4. CONSTRAINED OPTIMIZATION PROBLEMS 69

6. Set
m̂t+1
Xt+1 = Xt − α √
v̂t+1 + 

Note that the iteration on mt and vt correspond to defining


t
β1 X
m̂t = (1 − β1 )t−k gk
1 − β1t
k=0

and
t
β2 X
v̂t = (1 − β2 )t−k gk 2 .
1 − β2t
k=0

3.4 Constrained optimization problems

3.4.1 Lagrange multipliers

A constrained optimization problem minimizes a function F over a closed subset


Ω of Rd , with Ω , Rd . This restriction invalidates, in a large part, the optimality
conditions discussed in section 3.2. These conditions indeed apply to minimizers
belonging to the interior of Ω, and therefore do not hold when they lie at its bound-
ary, which is a very common situation in practice (Ω often has an empty interior).

In this section, which follows the discussion given in Wright and Recht [207], we
review conditions for optimality for constrained minimization of smooth functions,
in two cases. The first one, discussed in this section, is when Ω is defined by a finite
number of smooth constraints, leading, under some assumptions, to the Karush-
Kuhn-Tucker (or KKT) conditions. The second one, in the next section, specializes
to closed convex Ω.

KKT conditions

We introduce some notation. Let γi , for i ∈ C, be C 1 functions γi : Rd → R, where C


is a finite set of indices. We assume that C is divided into two non-intersecting parts,
C = E ∪ I and consider minimization problems searching for

x∗ ∈ argmin F (3.27)

where
Ω = {x ∈ Rd : γi (x) = 0, i ∈ E and γi (x) ≤ 0, i ∈ I }. (3.28)
70 CHAPTER 3. INTRODUCTION TO OPTIMIZATION

The set Ω of all x that satisfy the constraints is called the feasible set for the consid-
ered problem. We will always assume that it is non-empty. If x ∈ Ω, one defines the
set A(x) of active constraints at x to be
A(x) = {i ∈ C : γi (x) = 0} .
One obviously has E ⊂ A(x) for x ∈ Ω.

To be valid, the KKT conditions require some additional assumptions on poten-


tial minimizers, called “constraint qualifications.” An instance of such assumptions
is provided by the next definition.
Definition 3.30 A point x ∈ Ω satisfies the Mangasarian-Fromovitz constraint qualifi-
cations (MF-CQ) if the following two conditions are satisfied.

(MF1) The vectors (∇γi (x), i ∈ E) are linearly independent.


(MF2) There exists a vector h ∈ Rd such that hT ∇γi (x) = 0 for all i ∈ E and hT ∇γi (x) < 0
for all i ∈ A(x) ∩ I .

A sufficient (and easier to check) condition for x to satisfy these constraints is when
the vectors (∇γi (x), i ∈ A(x)) are linearly independent [36]. Indeed, if the latter “LI-
CQ” condition is true, then any set of values can be assigned to hT ∇γi (x) with the
existence of a vector h that achieves them.

We introduce the Lagrangian


X
L(x, λ) = F(x) + λi γi (x) (3.29)
i∈C

where the real numbers λi , i ∈ C are called Lagrange multipliers. The following the-
orem (stated without proof, see, e.g., [147, 34]) provides necessary conditions satis-
fied by solutions of the constrained minimization problem that satisfy the constraint
qualifications.
Theorem 3.31 Assume x∗ ∈ Ω is a solution of (3.27), and that x∗ satisfies the MF-CQ
conditions. Then there exist Lagrange multipliers λi , i ∈ C, such that
∂x L(x∗ , λ) = 0
(
(3.30)
λi ≥ 0 if i ∈ I , with λi = 0 when i < A(x∗ )

Conditions (3.30) are the KKT conditions for the constrained optimization problem.
The second set of conditions is often called the complementary slackness conditions
and state that λi = 0 for an inequality constraint unless this constraint is satisfied
with an equality. The next section provides examples in which the MF-CQ condi-
tions are not satisfied and Theorem 3.31 does not hold. However, these conditions
are not needed in the special case when the constraints are affine.
3.4. CONSTRAINED OPTIMIZATION PROBLEMS 71

Theorem 3.32 Assume that for all i ∈ A(x∗ ), the functions γi are affine, i.e., γi (x) =
biT x + βi for some b ∈ Rd and β ∈ R. Then (3.30) holds at any solution of (3.27).

Remark 3.33 We have taken the convention to express the inequality constraints as
γi (x) ≤ 0, i ∈ I . With the reverse convention, i.e., γi (x) ≥ 0, i ∈ I , one generally
defines the Lagrangian as
X
L(x, λ) = F(x) − λi γi (x)
i∈C

and the KKT conditions remain unchanged. 

Examples. Constraint qualifications are important to ensure the validity of the the-
orem. Consider a problem with equality constraints only, and replace it by

x∗ ∈ argmin F

subject to γ̃i (x) = 0, i ∈ E, with γ̃i = γi2 . We clearly did not change the problem.
However, the previous theorem applied to the Lagrangian
X
L(x, λ) = F(x) + λi γ̃i (x)
i∈C

would require an optimal solution to satisfy ∇F(x) = 0, because ∇γ̃i (x) = 2γi (x)∇γi (x) =
0 for any feasible solution. Minimizers of constrained problems do not necessarily
satisfy ∇F(x) = 0, however. This is no contradiction with the theorem since ∇γ̃i (x) = 0
for all i shows that no feasible point satisfies the MF-CQ.

To take a more specific example, still with equality constraints, let d = 3, C = {1, 2}
with F(x, y, z) = x/2+y and γ1 (x, y, z) = x2 −y 2 , γ2 (x, y, z) = y −z2 . Note that γ1 = γ2 = 0
implies that y = |x|, so that, for a feasible point, F(x, y, z) = |x| + x/2 ≥ 0 and vanishes
only when x = y = 0, in which case z = 0 also. So (0, 0, 0) is a global minimizer.
We have dF(0) = (1/2, 1, 0), dγ1 (0) = (0, 0, 0) and dγ2 (0) = (0, 1, 0) so that 0 does not
satisfy the MF-CQ. The equation

dF(0) + λ1 dγ1 (0) + λ2 dγ2 (0) = 0

has no solution (λ1 , λ2 ), so that the conclusion of the theorem does not hold.

3.4.2 Convex constraints

We now consider the case in which Ω is a closed convex set. To specify the optimality
conditions in this case, we need the following definition.
72 CHAPTER 3. INTRODUCTION TO OPTIMIZATION

Definition 3.34 Let Ω ⊂ Rd be convex and let x ∈ Ω. The normal cone to Ω at x is the
set
NΩ (x) = {h ∈ Rd : hT (y − x) ≤ 0 for all y ∈ Ω} (3.31)

The normal cone is an example of convex cone. (A convex subset Γ of Rd is called


a convex cone, if it is such that λx ∈ Γ for all x ∈ Γ and λ ≥ 0, a property obviously
satisfied by NΩ (x).) It should also be clear from the definition that non-zero vectors
in NΩ (x) always point outside Ω, i.e., x + h < Ω if h ∈ NΩ (x), h , 0. Here are some
examples.

• If x is in the interior of Ω, then NΩ (x) = {0}.


• Assume that Ω is a half space, i.e., Ω = {x : bT x + β ≤ 0} with |b| = 1, and take
x ∈ ∂Ω, i.e., bT x + β = 0. Then

NΩ (x) = {h = µb : µ ≥ 0} .

Indeed, any element of Rd can be written as y = x +λb +q with qT b = 0, and y ∈ Ω


if and only if λ ≤ 0. Fix such a y and take h ∈ Rd , decomposed as h = µb + r, with
r T b = 0. We have hT (y − x) = λµ + r T q. Clearly, if µ < 0, or if r , 0, one can find λ ≤ 0
and q ⊥ b such that hT (y − x) > 0. One the other hand, if µ ≤ 0 and r = 0, we have
hT (y − x) ≤ 0 for all y ∈ Ω, which proves the above statement.
• With a similar argument, if Ω = {x : bT x + β = 0} is a hyperplane, one finds that

NΩ (x) = {h = λb : λ ∈ R} .

One can build normal cones to domains associated with multiple inequalities or
equalities based on the following theorem.
Theorem 3.35 Let Ω1 and Ω2 be two convex sets with relint(Ω1 )∩relint(Ω2 ) , ∅. Then,
if x ∈ Ω1 ∩ Ω2
NΩ1 ∩Ω2 (x) = NΩ1 (x) + NΩ2 (x)
Here, the addition is the standard sum between sets in a vector space:

A + B = {x + y : x ∈ A, y ∈ B}.

Finally, we note that, if x ∈ relint(Ω), then

NΩ (x) = {h ∈ Rd : hT (y − x) = 0, y ∈ Ω}. (3.32)

Indeed, if y ∈ Ω, then x + (y − x) ∈ Ω for small enough  (positive or negative). For


h ∈ NΩ (x), the condition hT (y − x) ≤ 0 for small enough  requires that hT (y − x) = 0.

With this definition in hand, we have the following theorem.


3.4. CONSTRAINED OPTIMIZATION PROBLEMS 73

Theorem 3.36 Let F be a C 1 function and Ω a closed convex set. If x∗ ∈ argminΩ F, then

−∇F(x∗ ) ∈ NΩ (x∗ ). (3.33)

If F is convex and (3.33) holds, then x∗ ∈ argminΩ F.


Proof Assume that x∗ ∈ argminΩ F. If y ∈ Ω, then x∗ + t(y − x∗ ) ∈ Ω for all t ∈ [0, 1]
and the function f (t) = F(x + t(y − x∗ )) is C 1 on [0, 1], with a minimum at t = 0. This
requires that ∂t f (0) = ∇F(x∗ )T (y − x∗ ) ≥ 0, because, if ∂t f (0) < 0, a Taylor expansion
would show that f (t) < f (0) for small enough t > 0.

If F is convex and (3.33) holds, we have F(y) ≥ F(x∗ )+∇F(x∗ )T (y −x∗ ) by convexity,
so that
F(x∗ ) ≤ F(y) + (−∇F(x∗ ))T (y − x∗ ) ≤ F(y). 

3.4.3 Applications

Lagrange multipliers revisited. Consider Ω defined by (3.28), with the additional


assumptions that γi (x) = biT x + βi for i ∈ E and γi is convex for i ∈ I , which ensure
that Ω is convex. Define
 

 X 

0
 
Nγ (x) = g = λ ∇γ (x) : λ ≥ 0, i ∈ A(x) ∩ I .
 
 i i i 

 
 i∈A(x) 

Then, the KKT conditions in (3.30) can be rewritten as

−∇F(x∗ ) ∈ Nγ0 (x∗ ).

Note that one always have Nγ0 (x) ⊂ NΩ (x) since, for g = 0
P
i∈A(x) λi ∇γi (x) ∈ Nγ (x), one
has, for y ∈ Ω,
X
T
g (y − x) = λi ∇γi (x)T (y − x)
i∈A(x)
X X
= λi (aTi y − aTi x) + λi (γi (x) + ∇γi (x)T (y − x))
i∈E i∈A(x)∩I
X
= λi (γi (x) + ∇γi (x)T (y − x))
i∈A(x)∩I
≤ λi γi (y) ≤ 0,

in which the have used the facts that aTi x = aTi y = −βi for x, y ∈ Ω, i ∈ E, γi (x) = 0 for
i ∈ A(x) and the convexity of γi . Constraint qualifications such as those considered
above are sufficient conditions that ensure the identity between the two sets.
74 CHAPTER 3. INTRODUCTION TO OPTIMIZATION

Consider now the situation of theorem 3.32, and assume that all constraints are
affine inequalities, γi (x) = biT x + β ≤ 0, i ∈ I . Then, the statement NΩ (x) ⊂ Nγ0 (x) can
be reexpressed as follows. All h ∈ Rd such that
hT (y − x) ≤ 0
as soon as biT (y − x) ≤ 0 for all i ∈ A(x) must take the form
X
h= λi bi
i∈A(x)

with λ(i) ≥ 0. This property is called Farkas’s lemma (see, e.g. [168]). Note that affine
equalities biT x + β = 0 can be included as two inequalities biT x + β ≤ 0, −biT x − β ≤ 0,
which removes the sign constraint on the corresponding λ(i) and therefore yields
theorem 3.32.

Positive semi-definite matrices. We now take an example in which theorem 3.32


does not apply directly. Let Ω = Sn+ be the space of positive semidefinite n × n ma-
trices, considered as a subset of the space Mn of n × n matrices, itself identified with
2
Rn . With this identification, the Euclidean inner product between two matrices can
be expressed as (A, B) 7→ trace(AT B).

We have A ∈ Sn+ if and only if, for all u ∈ Rd , u T Au ≥ 0, which provides an infinity
of linear inequality constraints on A. Elements of NS + (A) are matrices H ∈ Mn such
that
trace(H T (B − A)) ≤ 0
for all B ∈ Sn+ , and we want to make this normal cone explicit. We first note that,
every square matrix H can be decomposed as the sum of a symmetric matrix, Hs and
of a skew symmetric one, Ha (namely, Hs = (H + H T )/2 and Ha = (H − H T )/2). We
have moreover trace(HaT (B − A)) = 0, so the condition is only on the symmetric part
of H.

For any u ∈ Rd , one can take B = A+uu T , which belongs to Sn+ , with trace(HsT (B−
A)) = u T Hs u. This shows that, for H to belong to NSn+ (A), one needs Hs  0.

Now, take an eigenvector u of A with eigenvalue ρ > 0. Then B = A − αuu T is


also in Sn+ as soon as 0 ≤ α ≤ ρ, and trace(HsT (B − A)) = −αu T Hs u. So, if H ∈ NSn+ (A),
we have u T Hs u ≥ 0, and since Hs  0, this gives u T Hs u = 0. Still because Hs is
negative semi-definite, this implies Hs u = 0. (This can be shown, for example, using
Schwarz’s inequality which says that (u T Hs v)2 ≤ (u T Hs u)(v T Hs v) for all v ∈ Rd .)
Decomposing A with respect to its non-zero eigenvectors, i.e., writing
p
X
A= ρk uk ukT
k=1
3.4. CONSTRAINED OPTIMIZATION PROBLEMS 75

where p = rank(A), we get AHs = Hs A = 0. We therefore obtained the proposition


Proposition 3.37 Let A ∈ Sn+ . Then H ∈ Mn belongs to NSn+ (A) if and only if −Hs ∈ Sn+
and Hs A = 0, where Hs = (H + H T )/2.

Now, if one wants to minimize a function F over positive semidefinite matrices,


and A∗ is a minimizer, we get the necessary condition that A∗ (∇F(A∗ ))s = 0 with
(∇F(A∗ ))s positive semidefinite. These conditions are sufficient if F is convex.

For example, take


1
F(A) = trace(A2 ) − trace(BA) (3.34)
2
with B ∈ Sn . Then (∇F(A))s = A − B and the condition is A(A − B) = 0 with A  B. If
B is diagonalized in the form B = U T DU , with U orthogonal and D diagonal, then
the solution is A∗ = U T D + U where D + is deduced from U by replacing non-negative
entries by zeros.

Projection. Let Ω be closed convex, x0 ∈ Rd and F(x) = 12 |x − x0 |2 . We have

min F = min F
Ω Ω∩B̄(0,R)

for large enough R (e.g., larger than F(x) for any fixed point in Ω), and since the
latter minimization is over a compact set, argminΩ F is not empty. The function F
being strongly convex, its minimizer over Ω is unique and called the projection of
x0 on Ω, denoted projΩ (x0 ).

Since ∇F(x) = x − x0 , theorem 3.36 implies that projΩ (x0 ) is characterized by


projΩ (x0 ) ∈ Ω and
x0 − projΩ (x0 ) ∈ NΩ (projΩ (x0 )) (3.35)
or
(x0 − projΩ (x0 ))T (y − projΩ (x0 )) ≤ 0 for all y ∈ Ω. (3.36)
If x0 < Ω, then projΩ (x0 ) ∈ ∂Ω, since otherwise we would have NΩ (projΩ (x0 )) = {0}
and x0 = projΩ (x0 ), a contradiction. Of course, if x0 ∈ Ω, then projΩ (x0 ) = x0 .

Here are some important examples.

1. Let Ω = z0 + V , where z0 ∈ Rd and V is a linear space (i.e., Ω is an affine subset


of Rd ). Then NΩ (x) = z0 + V ⊥ = x + V ⊥ for all x ∈ Ω, where V ⊥ is the vector space of
vectors orthogonal to V , and projΩ (x0 ) is characterized by projΩ (x0 ) ∈ Ω and

(x0 − projΩ (x0 )) ∈ V ⊥

which is the usual characterization of the orthogonal projection on an affine space


(compare to section 6.4).
76 CHAPTER 3. INTRODUCTION TO OPTIMIZATION

2. If Ω = B̄(0, 1), the closed unit sphere, then NΩ (x) = R+ x for x ∈ ∂Ω (i.e., |x| = 1).
One can indeed note that, if h , 0 in normal to Ω at x, then h/|h| ∈ Ω so that
!
T h
h −x ≤ 0
|h|

which yields |h| ≤ hT x. The Cauchy-Schwartz inequality implying that hT x ≤ |h| |x| =
|h|, we must have equality, hT x = |h| |x|, which is only possible when x and h are
collinear.
Given x0 ∈ Rd with x0 ≥ 1, we see that projΩ (x0 ) must satisfy the conditions
|projΩ (x0 )| = 1 (to be in ∂Ω) and x0 − projΩ (x0 ) = λx0 for some λ ≥ 0, which gives
projΩ (x0 ) = x0 /|x0 |.
3. If Ω = Sn+ and B (taking the role of x0 ) is a symmetric matrix, then projΩ (B) was
found in the previous section, and is given by A = U T D + U where U T DU provides a
diagonalization of B.

The projection has the important property of being 1-Lipschitz.

Proposition 3.38 Let Ω be a closed convex subset of Rd . Then, for all x, y ∈ Rd

|projΩ (x) − projΩ (y)| ≤ |x − y|. (3.37)

Proof This proposition is a special case of proposition 3.55 below. 

3.4.4 Projected gradient descent

The projected gradient descent algorithm minimizes F over Ω by iterating

xt+1 = projΩ (xt − αt ∇F(xt )), (3.38)

which provides a feasible method when projΩ is easy to compute. An equivalent


formulation is
1
xt+1 = argmin F(xt ) + ∇F(xt )T (x − xt ) + |x − xt |2 . (3.39)
Ω 2αt

To justify this last statement it suffices to notice that the function in the r.h.s. can be
written as
1 α
|x − xt + αt ∇F(xt )|2 − t |∇F(xt )|2 + F(xt )
2αt 2
and apply the definition of the projection.

The convergence properties of this algorithm will be discussed in section 3.5.5,


in a more general context.
3.5. GENERAL CONVEX PROBLEMS 77

3.5 General convex problems

3.5.1 Epigraphs

Definition 3.39 Let F be a convex function. The epigraph of F is the set


n o
epi(F) = (x, a) ∈ Rd × R : F(x) ≤ a . (3.40)

One says that F is closed if epi(F) is a closed subset of Rd × R, that is: if x = limn xn and
a = limn an with F(xn ) ≤ an , then F(x) ≤ a.

Clearly, if (x, a) ∈ epi(F), then x ∈ dom(F). It should also be clear that epi(F) is always
convex when F is convex: If (x, a), (y, b) ∈ epi(F), then
F((1 − t)x + ty) ≤ (1 − t)F(x) + tF(y) ≤ (1 − t)a + tb
so that (1 − t)(x, a) + t(y, b) ∈ epi(F).

To illustrate the definition, consider a simple example. Let F be the function


defined on R by F(x) = |x| if |x| < 1 and F(x) = +∞ otherwise. It is convex, but not
closed, as can be seen by taking the sequence (1 − 1/n, 1) ∈ epi(F), with, at the limit,
F(1) = +∞ > 1. In contrast, the function defined by F̃(x) = |x| if |x| ≤ 1 and F̃(x) = +∞
otherwise is convex and closed.

We have the following proposition.


Proposition 3.40 A convex function F is closed if and only if all its sub-level sets
n o
Λa (F) = x ∈ Rd : F(x) ≤ a

are closed subsets of Rd .


Proof If F is closed, then Λa (F) is the intersection of the set {(x, a) : x ∈ Rd }, which is
obviously closed, and of epi(F). It is therefore a closed set.

Conversely, assume that all Λa (F) are closed and take a sequence (xn , an ) in epi(F)
that converges to (x, a). Then, fixing  > 0, xn ∈ Λa+ for large enough n, and since
this set is closed, F(x) ≤ a + . Since this is true for all  > 0, we have F(x) ≤ a and
(x, a) ∈ epi(F). 

Note that, if F is continuous, then it is closed, so that closedness generalizes con-


tinuity for convex functions, but it also applies to the non-smooth case.

If Ω is a convex subset of Rd , its indicator function σΩ (such that σΩ (x) = 0 for


x ∈ Ω and σΩ (x) = +∞ otherwise) is closed if and only if Ω is a closed subset of Rd .
This is obvious since Λa (σΩ ) = Ω if a ≥ 0 and ∅ otherwise.
78 CHAPTER 3. INTRODUCTION TO OPTIMIZATION

3.5.2 Subgradients

Several machine learning problems involve convex functions that are not C 1 , requir-
ing a generalization of the notion of derivative provided by the following definition.
Definition 3.41 If F is a convex function and x ∈ dom(F), a vector g ∈ Rd such that

F(x) + g T (y − x) ≤ F(y) (3.41)

for all y ∈ Rd is called a subgradient of F at x.

The set of subgradients of F at x is denoted ∂F(x) and called the subdifferential of F


at x.

If x ∈ int(dom(F)) and F is differentiable at x, (3.5) implies that ∇F(x) ∈ ∂F(x).


This is in this case the only element of ∂F(x).
Proposition 3.42 If F is differentiable at x ∈ int(dom(F)), then ∂F(x) = {∇F(x)}.

Proof We need to prove that there is no other subgradient. Assume that ∇F(x) exists
and take y = x + u in (3.41) (u ∈ Rd ). One gets, for g ∈ ∂F(x),

g T u ≤ F(x + u) − F(x) = ∇F(x)T u + o()

Dividing by || and letting  → 0 gives (depending on the sign of )

g T u ≤ ∇F(x)T u and − g T u ≤ −∇F(x)T u

This is only possible if g T u = ∇F(x)T u for all u ∈ Rd , which itself implies g = ∇F. 

The next theorem, which is an obvious consequence of definition 3.41, character-


izes minimizers of convex functions in the general case.
Theorem 3.43 Let F : Rd → R be convex. Then x is a (global) minimizer of F if and only
if 0 ∈ ∂F(x).

The following result shows that subgradients exist under generic conditions. We
note that g ∈ ∂F(x) if and only if proj −aff
−→
(dom(F))
(g) ∈ ∂F, because (3.41) is trivial if
F(y) = +∞. So ∂F cannot be bounded unless aff(dom(D)) = Rd . However, it is the
−−→
part of this set that is included in the aff (dom(F)) that is of interest.
Proposition 3.44 For all x ∈ Rd , ∂F(x) is a closed convex set (possibly empty, in par-
−−→
ticular for x < dom(F)). If x ∈ ridom(F), then ∂F(x) , ∅ and ∂F(x) ∩ aff (dom(F)) is
compact.
3.5. GENERAL CONVEX PROBLEMS 79

Proof The convexity and closedness of ∂F(x) is clear from the definition. If x ∈
−−→
ridom(F), there exists  > 0 such that x + h ∈ ridom(F) for all h ∈ aff (dom(F)) with
−−→
|h| = 1. For all g ∈ ∂F(x) ∩ aff (dom(F)), one has
−−→
|g| = max{g T h : h ∈ aff (dom(F)), |h| = 1}
−−→
≤ max((F(x + h) − F(x))/ : h ∈ aff (dom(F)), |h| = 1)
and the upper bound is finite because it is the maximum of a continuous function
over a bounded set. This shows that ∂F(x) is bounded. We defer the proof that
∂F(x) , ∅ to section 3.7. 

Subdifferentials are, under mild conditions, additive. More precisely, we have


the following proposition.
Theorem 3.45 Let F1 and F2 be convex functions such that
ridom(F1 ) ∩ ridom(F2 ) , ∅ .
Then, for all x ∈ Rd , ∂(F1 + F2 )(x) = ∂F1 (x) + ∂F2 (x).

Note that the inclusion


∂F1 (x) + ∂F2 (x) ⊂ ∂(F1 + F2 )(x)
as can be immediately checked by summing the inequalities satisfied by subgradi-
ents. The reverse inclusion requires the use of separation theorems for convex sets
(see section 3.7).

Another important point is how the chain rule works with compositions with
affine functions.
Theorem 3.46 Let F be a convex function on Rd , A a d × m matrix and b ∈ Rd . Let
G(x) = F(Ax + b), x ∈ Rm . Assume that there exists x0 ∈ Rm such that Ax0 ∈ ridom(F).
Then, for all x ∈ Rm ,
∂G(x) = AT ∂F(Ax + b).

One direction is straightforward and does not require the condition on ridom(F). If
g ∈ ∂F(Ax + b), then
F(z) − F(Ax + b) ≥ g T (z − Ax − b), z ∈ Rd
and applying this inequality to z = Ay + b for y ∈ Rm yields
G(y) − G(x) ≥ g T A(y − x)
so that AT g ∈ ∂G and AT ∂F ⊂ ∂G. The reverse inclusion is proved in section 3.7.

Subdifferentials can be seen as generalizations of normal cones.


80 CHAPTER 3. INTRODUCTION TO OPTIMIZATION

Proposition 3.47 Assume that Ω is a closed convex subset of Rd . Then σΩ (the indicator
function of Ω) has a subdifferential everywhere on Ω with

∂σΩ (x) = NΩ (x), x ∈ Ω

Proof For x ∈ Ω, (3.41) is


g T (y − x) ≤ σΩ (y)
for y ∈ Rd , but since σΩ (y) = +∞ outside of Ω, g ∈ ∂σΩ (x) is equivalent to

g T (y − x) ≤ 0

for y ∈ Ω, which is exactly the definition of the normal cone. 

Given this proposition, it is also clear (after noting that σΩ1 + σΩ2 = σΩ1 ∩Ω2 ) that
theorem 3.45 is a generalization of theorem 3.35.

3.5.3 Directional derivatives

From proposition 3.5, applied with y = x + h, we see that

1
t 7→ (F(x + th) − F(x))
t
is increasing as a function of t. This property allows us to define directional deriva-
tives of F at x.

Definition 3.48 Let F be convex and x ∈ dom(F). The directional derivative of F at x in


the direction h ∈ Rd is defined by

1
dF(x, h) = lim (F(x + th) − F(x)), (3.42)
t↓0 t

and belong to [−∞, +∞].

Note that, still from proposition 3.5, one has, for all x ∈ dom(F) and y ∈ Rd :

F(y) ≥ F(x) + dF(x, y − x) (3.43)

We have the proposition:

Proposition 3.49 If F is convex, then x∗ ∈ argmin(F) if and only if dF(x∗ , h) ≥ 0 for all
h ∈ Rd .
3.5. GENERAL CONVEX PROBLEMS 81

Proof If dF(x∗ , h) ≥ 0, then F(x∗ +th)−F(x∗ ) ≥ 0 for all t > 0 and this being true for all
h implies that x∗ is a minimizer. Conversely, if x∗ is a minimizer, dF(x∗ , h) is a limit
of non-negative numbers and is therefore non-negative. 

Proposition 3.50 If F is convex and x ∈ dom(F), then dF(x, h) is positively homogeneous


and subadditive (hence convex) as a function of h, namely

dF(x, λh) = λdF(x, h), λ > 0

and
dF(x, h1 + h2 ) ≤ dF(x, h1 ) + dF(x, h2 ).

Proof Positive homogeneity is straightforward and left to the reader. For the second
one, we can write
1
F(x + th1 + th2 ) ≤ (F(x + th1 /2) + F(x + th2 /2))
2
by convexity so that

1 1 1 1
 
(F(x + th1 + th2 ) − F(x)) ≤ (F(x + th1 /2) − F(x)) + (F(x + th2 /2) − F(x)) .
t 2 t t
Taking t ↓ 0,

1
dF(x; h1 + h2 ) ≤ (dF(x; h1 /2) + dF(x, h2 /2)) = dF(x, h1 ) + dF(x, h2 ).
2 

Proposition 3.51 If F is convex and x ∈ dom(F), then

dF(x, h) ≥ sup{g T h, g ∈ ∂F(x)}.

If x ∈ ridom(F), then
dF(x, h) = max{g T h, g ∈ ∂F(x)}.

Proof If g ∈ ∂F(x), then for all t > 0

F(x + th) − F(x) ≥ tg T h.

Dividing by t and passing to the limit yields dF(x, h) ≥ g T h.

We prove that the maximum is attained at some g ∈ ∂F(x) when x ∈ ridom(F).


In this case, the domain of the convex function G : h̃ 7→ dF(x, h̃) is the vector space
parallel to aff(dom(F)), namely

dom(G) = {h : x + h ∈ aff(dom(F))}.
82 CHAPTER 3. INTRODUCTION TO OPTIMIZATION

Indeed, for any h in this set, there exists  > 0 such that x + th ∈ dom(F) for 0 < t < 
and dF(x, h) ≤ (F(x + th) − F(x))/t < ∞. Conversely, if h ∈ dom(G), then F(x + th) − F(x)
must be finite for small enough t, so that x + th ∈ dom(F) and x + h ∈ aff(dom(F)).

As a consequence, for any h ∈ aff(dom(F)), there exists ĝ ∈ ∂G(h), which therefore


satisfies
dF(x, h̃) ≥ dF(x, h) + ĝ T (h̃ − h)
for any h̃ ∈ Rd (the upper bound is infinite if h̃ < dom(G)). Letting h̃ → 0, we get
dF(x, h) ≤ ĝ T h.

Also, by positive homogeneity, we have

tdF(x, h̃) ≥ dF(x, h) + ĝ T (t h̃ − h)

for all t > 0, which requires dF(x, h̃) ≥ ĝ T h̃ for all h̃, and in particular dF(x, h) = ĝ T h.

Since
F(x + h̃) − F(x) ≥ dF(x, h̃) ≥ ĝ T h̃
we see that ĝ ∈ ∂F(x), with ĝ T h = dF(x, h), which concludes the proof. 

The next proposition gives a criterion for a vector g to belong to ∂F(x) based on
directional derivatives.

Proposition 3.52 Assume that x ∈ dom(F) where F is convex. If g ∈ Rd is such that

dF(x, h) ≥ g T h

for all h ∈ Rd , then g ∈ ∂F(x).

Proof Just use the fact that dF(x, h) ≤ F(x + h) − F(x). 

3.5.4 Subgradient descent

When F is a non-differentiable a convex function, directions g such that −g ∈ ∂F(x)


do not always provide directions of descent. Indeed, g ∈ ∂F(x) implies

F(x − αg) ≥ F(x) − α|g|2

but the inequality goes in the “wrong direction.” However, we know that, for any
h ∈ Rd , there exists gh ∈ ∂F(x) such that

dF(x, −h) = −ghT h ≥ −g T h


3.5. GENERAL CONVEX PROBLEMS 83

for all g ∈ ∂F(x). As a consequence, any non-vanishing solution of the equation


h = gh will provide a direction of descent. This suggests looking for h ∈ ∂F(x) such
that h , 0 and |h|2 ≤ g T h for all g ∈ ∂F(x). Since g T h ≤ |g| |h|, this requires that |h| ≤ |g|
for all g ∈ ∂F(x), i.e.,
h = argmin(g 7→ |g|). (3.44)
∂F(x)

Conversely, if h is the minimal-norm element of ∂F(x) (which is necessarily unique


since the norm is strictly convex and ∂F(x) is convex and compact), then |h|2 ≤ |h +
t(g − h)|2 for all g ∈ ∂F(x) and t ∈ [0, 1], and taking the difference yields

2thT (g − h) + t 2 |g − h|2 ≥ 0.

The fact that this holds for all t ≥ 0 requires that hT (g − h) ≥ 0 as required. We have
therefore proved that h defined by (3.44) is a descent direction for F at x (it is actually
the steepest descent direction: see [207] for a proof), justifying the algorithm

xt+1 = xt − αt argmin(g 7→ |g|)


∂F(x)

as subgradient descent iterations.

Example. Consider the minimization of


n
X
F(x) = ψ(x) + λ |x(i) |
i=1

where ψ is a C 1 convex function on Rd . Let A(x) = {i : x(i) = 0}. Then


 

 X X 

(i)
 
∂F(x) =  ∇ψ(x) + λ sign(x ) + λ ρ e : |ρ | ≤ 1, i ∈ A(x)
 
 i i i 

 
 i<A(x) i∈A(x) 

where ei is the ith vector of the canonical basis of Rd .

For g = ∇ψ(x) + λ i<A(x) sign(x(i) ) + λ i∈A(x) ρi ei , we have


P P

X X
|g|2 = (∂i F(x) + λ sign(x(i) ))2 + (∂i ψ(x) − λρi )2 .
i<A(x) i∈A(x)

Define
s(t) = sign(t) min(|t|, 1).
Then h satisfying (3.44) is given by

∂ ψ(x) − λ sign(x(i) ) if i < A(x)


(
h = i
(i)
λ s(∂i ψ(x)/λ) if i ∈ A(x).
84 CHAPTER 3. INTRODUCTION TO OPTIMIZATION

In more complex situations, the extra minimization step at each iteration of the
algorithm can be challenging computationally. The following subgradient method
uses an averaging approach to minimize F without requiring finding subgradients
with minimal norms. It simply defines

xt+1 = xt − αt gt , gt ∈ ∂F(xt )

and computes Pt
j=1 αj xj
x̄t = Pt .
j=1 αj

We refer to [207] for a proof of convergence of this method.

3.5.5 Proximal Methods

Proximal operator. We start with a few simple facts. Let F be a closed convex
function and ψ be convex and differentiable, with dom(ψ) = Rd . Let G = F + ψ.
Then G is a closed convex function. Indeed, consider the sub-level set Λa (G) = {x :
G(x) ≤ a} and assume that xn → x with xn ∈ Λa (g). Then ψ(xn ) → ψ(x) by continuity,
and for all  > 0, we have, for large enough n, F(xn ) ≤ a − ψ(x) + . This inequality
remains true at the limit because F is closed, yielding G(x) ≤ a +  for all  > 0, so
that x ∈ Λa (G).

We have ridom(F) ∩ ridom(ψ) , ∅ so that (by theorem 3.45 and proposition 3.42)
∂G(x) = ∇ψ(x) + ∂F(x). In particular, x∗ is a minimizer of G if and only if −∇ψ(x∗ ) ∈
∂F(x∗ ).

It one assumes that ψ is strongly convex, so that there exists m and L such that
m L
|y − x|2 ≤ ψ(y) − ψ(x) − ∇ψ(x)T (y − x) ≤ |y − x|2
2 2
for all x, y ∈ Rd , then a minimizer of G exists and is unique. To see this, fix x0 ∈
ridom(F) and consider the closed convex set

Ω0 = ΛG(x0 ) (G) = {x : G(x) ≤ G(x0 )}.

Any minimizer of G must clearly belong to Ω0 . If x ∈ Ω0 , we have


m
F(x) + ψ(x0 ) + ∇ψ(x0 )T (x − x0 ) + |x − x0 |2 ≤ G(x) ≤ G(x0 ) .
2
Moreover, there exists (from proposition 3.44) an element g ∈ ∂F(x0 ) so that F(x) ≥
F(x0 ) + g T (x − x0 ) for all x ∈ Rd . We therefore get
m
F(x0 ) + ψ(x0 ) + (g + ∇ψ(x0 ))T (x − x0 ) + |x − x0 |2 ≤ G(x0 ) .
2
3.5. GENERAL CONVEX PROBLEMS 85

for all x ∈ Ω0 , which shows that Ω0 must be bounded and therefore compact. There
exists a minimizer x∗ of G on Ω0 , and therefore on all Rd . This minimizer is unique,
since the sum of a convex function and a strictly convex function is strictly convex.

In particular, for any closed convex F, we can apply the previous remarks to
1
G : v 7→ F(v) + |x − v|2
2
where x ∈ Rd is fixed. The function ψ : v 7→ |v −x|2 /2 is strongly convex (with L = m =
1) and G therefore has a unique minimizer v ∗ . This is summarized in the following
definition.
Definition 3.53 Let F be a closed convex function. The proximal operator associated to
F is the mapping proxF : Rd → dom(F) defined by
1
proxF (x) = argmin(v 7→ F(v) + |x − v|2 ). (3.45)
Rd 2

From the previous discussion, we also deduce


Proposition 3.54 Let F be a closed convex function and α > 0. We have x0 = proxαF (x)
if and only if x ∈ x0 + α∂F(x0 ). In particular, x∗ is a minimizer of F if and only if x∗ =
proxαF (x∗ )

Let us take a few examples.

• Let F(x) = λ|x|, x ∈ Rd , for some λ > 0. Then F is differentiable everywhere except
at x = 0 and dom(F) = Rd . We have ∂F(x) = λx/|x| for x , 0. A vector g belongs to
∂F(0) if and only if
g T x ≤ λ|x|
for all x ∈ Rd , which is equivalent to |g| ≤ λ so that ∂F(0) = B̄(0, λ).
We have x0 = proxF (x) if and only if x0 , 0 and x = x0 + λx0 /|x0 | or x0 = 0 and |x| ≤ λ.
For |x| > λ, the equation x = x0 + λx0 /|x0 | is solved by
|x| − λ
x0 = x
|x|
yielding
|x| − λ

 |x| x if |x| ≥ λ



proxF (x) = 
 (3.46)
0 otherwise

• Let Ω be a closed convex set. Then proxσΩ = projΩ , the projection operator on Ω,
as directly deduced from the definition.
86 CHAPTER 3. INTRODUCTION TO OPTIMIZATION

The following proposition can then be compared to proposition 3.38.

Proposition 3.55 Let F be a closed convex function. Then proxF is 1-Lipschitz: for all
x, y ∈ Rd ,
| proxF (x) − proxF (y)| ≤ |x − y|. (3.47)

Proof Let x0 = proxF (x) and y 0 = proxF (y). Then, there exists g ∈ ∂F(x0 ) and h ∈
∂F (y 0 ) such that x = x0 + g and y = y 0 + h. Moreover, we have

F(y 0 ) − F(x0 ) ≥ g T (y 0 − x0 )
F(x0 ) − F(y 0 ) ≥ hT (x0 − y 0 )

from which we deduce g T (y 0 − x0 ) ≤ hT (y 0 − x0 ) or (h − g)T (x0 − y 0 ) ≥ 0. Expressing g, h


in terms or x, x0 , y, y 0 , we get (y − x − y 0 + x0 )T (y 0 − x0 ) ≥ 0 or

|y 0 − x0 |2 ≤ (y − x)T (y 0 − x0 ) ≤ |y − x| |y 0 − x0 |

which is only possible if |y 0 − x0 | ≤ [y − x|. 

If F is differentiable, then x0 = proxαF (x) satisfies

x0 = x − α∇F(x0 )

so that x 7→ proxαF (x) can be interpreted as an implicit version of the standard gra-
dient step x 7→ x−α∇F(x). The iterations x(t+1) = proxαt F (x(t)) provide an algorithm
that converges to a minimizer of F (this will be justified below). This algorithm is
rarely practical, however, since the minimization required at each step is not nec-
essarily much easier to perform than minimizing F itself. The proximal operator,
however, is especially useful when combined with splitting methods.

Proximal gradient descent. Assume that the objective function F takes the form

F(x) = G(x) + H(x) (3.48)

where G is C 1 on Rd and H is a closed convex function. We first note that


F(x + th) − F(x)
dF(x, h) = lim
t↓0 t

is well defined (even if G is not convex, because it is smooth), with

dF(x, h) = ∇G(x)T h + dH(x, h)


3.5. GENERAL CONVEX PROBLEMS 87

In particular, if x∗ be a minimizer of F, then dF(x, h) ≥ 0 for all h so that dH(x, h) ≥


−∇G(x)T h for all h. Using proposition 3.52, this shows that −∇G(x) ∈ ∂H(x), which
is a necessary condition for optimality for F (which is sufficient if G is convex).

Proximal gradient descent implements the algorithm

xt+1 = proxαt H (xt − αt ∇G(xt )). (3.49)

We note that a stationary point of this algorithm, i.e. a point x such that x = proxαt H (x−
αt ∇G(x)) must be such that x − αt ∇G(x) ∈ x + αt ∂H(x), so that −∇G(x) ∈ ∂H(x). This
shows that the property of being stationary does not depend on αt > 0, and is equiv-
alent to the necessary optimality condition that was just discussed.

We first study this algorithm under the assumption that G is L-C 1 , which implies
that, for all x, y ∈ Rd .
L
G(y) ≤ G(x) + ∇G(x)T (y − x) + |x − y|2 .
2
At iteration t, we have

xt − αt ∇G(xt ) ∈ xt+1 + αt ∂H(xt+1 )

which implies, in particular

αt H(xt )−αt H(xt+1 ) ≥ (xt −xt+1 )T (xt −xt+1 −αt ∇G(xt )) = |xt −xt+1 |2 +αt ∇G(xt )T (xt+1 −xt )

Dividing by αt and adding G(xt ) − G(xt+1 ), we get


1
F(xt ) − F(xt+1 ) ≥ |xt − xt+1 |2 + G(xt ) + ∇G(xt )T (xt+1 − xt ) − G(xt+1 )
αt
!
1 L
≥ − |x − xt+1 |2 (3.50)
αt 2 t

so that proximal gradient descent iterations reduce the objective function as soon as
αt ≤ 2/L.

Assuming that αt < 2/L, (3.50) can be rewritten as


2
xt+1 − xt 2
≤ (F(xt ) − F(xt+1 ).
αt αt (2 − αt L)
This inequality should be compared to (3.11) in the unconstrained case. It yields, in
particular, the inequality
( )
xt+1 − xt F(x0 ) − min F
min :t≤T ≤ . (3.51)
αt 2T min{αt (2 − αt L), t ≤ T }
88 CHAPTER 3. INTRODUCTION TO OPTIMIZATION

As a consequence, if one runs proximal gradient descent until |xt+1 − xt |/αt is small
enough, the algorithm will terminate in finite time as soon as αt is bounded from
below (and, in particular, if αt is constant).

If we assume that G is convex, in addition to being L-C 1 , then we have a stronger


result. Let x∗ be a minimizer of F. Then, using again xt −αt ∇G(xt ) ∈ xt+1 +αt ∂H(xt+1 ),
we have
αt H(x∗ ) − αt H(xt+1 ) ≥ (x∗ − xt+1 )T (xt − xt+1 − αt ∇G(xt ))
and

αt F(x∗ ) − αt F(xt+1 ) ≥ (x∗ − xt+1 )T (xt − xt+1 ) − αt (x∗ − xt+1 )T ∇G(xt )) + αt G(x∗ ) − αt G(xt+1 )
≥ (x∗ − xt+1 )T (xt − xt+1 ) − αt (x∗ − xt )T ∇G(xt )) + αt G(x∗ )
+ αt (xt+1 − xt )T ∇G(xt ) − αt G(xt+1 )
αL
≥ (x∗ − xt+1 )T (xt − xt+1 ) − t |xt − xt+1 |2
2
Assuming that αt L ≤ 1, then
1 1
αt F(x∗ ) − αt F(xt+1 ) ≥ (x∗ − xt+1 )T (xt − xt+1 ) − |xt − xt+1 |2 = (|xt+1 − x∗ |2 − |xt − x∗ |2 ),
2 2
which we rewrite as
1
αt (F(xt+1 ) − F(x∗ )) ≤ (|xt − x∗ |2 − |xt+1 − x∗ |2 )
2
Note that, from (3.50), we also have
1
F(xt+1 ) ≤ F(xt ) − |x − xt+1 |2
2αt t
when αt L ≤ 1, which shows, in particular that F(xt ) is decreasing. Fixing a time T ,
we have, from these two observations
1
αt (F(xT ) − F(x∗ )) ≤ (|xt − x∗ |2 − |xt+1 − x∗ |2 )
2
for all t ≤ T − 1, and summing over T ,
T −1
X 1
(F(xT ) − F(x∗ )) αt ≤ (|x0 − x∗ |2 − |xT − x∗ |2 )
2
t=0

yielding
|x0 − x∗ |2
F(xT ) − F(x∗ ) ≤ P . (3.52)
2 Tt=0−1
αt
We summarize this in the following theorem, specializing to the case of constant
step αt .
3.6. DUALITY 89

Theorem 3.56 Let G be am L-C 1 function defined on Rd and H be closed convex. As-
sume that F = G+H has a minimizer x∗ . Then the algorithm (3.49) run with αt = α ≤ 1/L
for all t is such that, for all T > 0,
|x0 − x∗ |2
F(xT ) − F(x∗ ) ≤ . (3.53)
2αT

Also, when G = 0, F = H, we retrieve the proximal iterations algorithm


xt+1 = proxαF (xt ), (3.54)
and we have just proved that it converges for any α > 0 as soon as F is a closed convex
function.

One gets a stronger result under the assumption that G is C 2 , and is such that the
eigenvalues of ∇2 G(x) are included in a fixed interval [m, L] for all x ∈ Rd with m > 0.
Such a G is strongly convex, which implies that F has a unique minimizer. We have
|xt+1 − x∗ | = proxαt H (xt − αt ∇G(xt )) − proxαt H (x∗ − αt ∇G(x∗ )
≤ |xt − x∗ − αt (∇G(xt )) − ∇G(x∗ ))| .

Write
Z 1
∗ ∗
|xt − x − αt (∇G(xt )) − ∇G(x )| = (IdRn − αt ∇2 G(x∗ + t(xt − x∗ )))(xt − x∗ )dt
0
Z 1
≤ (IdRn − αt ∇2 G(x∗ + t(xt − x∗ )))(xt − x∗ ) dt
0
≤ max(|1 − αt m|, |1 − αt L|)|xt − x∗ |
where we have use the fact that the eigenvalues of IdRn − αt ∇2 G(x) are included in
[1−αt L, 1−αt m] for all x ∈ Rd . If one assumes that αt ≤ 1/L, so that max(|1−αt m|, |1−
αt L|) ≤ 1 − αt m, one gets
|xt+1 − x∗ | ≤ (1 − αt m)|xt − x∗ | .
Iterating this inequality, we get the theorem that we state for constant αt .
Theorem 3.57 Let F = G + H where G is a C 2 convex function and H is a closed convex
function. Assume that the eigenvalues of ∇2 G are uniformly included in [m, L] with m > 0.
Let x∗ argmin F.

Let (xt ) satisfy (3.49) with constant αt = α < 1/L. Then


|xt − x∗ | ≤ (1 − αm)t |x0 − x∗ |.

Note that these results also apply to projected gradient descent (section 3.4.4),
which is a special case (taking G = σΩ ).
90 CHAPTER 3. INTRODUCTION TO OPTIMIZATION

3.6 Duality

3.6.1 Generalized KKT conditions

A constrained convex minimization problem consists in the minimization of a closed


convex function F over a closed convex set Ω ⊂ ridom(F). We have seen in theo-
rem 3.36 that, for smooth F, any solution x∗ of this problem had to satisfy −∇F(x∗ ) ∈
NΩ (x) where
NΩ (x) = {h : hT (y − x) ≤ 0 for all y ∈ Ω} .

The next theorem generalizes this property to the non-smooth convex case, for
which the necessary optimality condition is also sufficient.
Theorem 3.58 Let F be a closed convex function, Ω ⊂ ridom(F) a nonempty closed con-
vex set. Then x∗ ∈ argminΩ F if and only if

0 ∈ ∂F(x∗ ) + NΩ (x∗ )

Proof Introduce the indicator function σΩ . Then minimizing F over Ω is the same
as minimizing G = F+σΩ over Rd . The assumptions imply that ridom(σΩ ) = relint(Ω) ⊂
ridom(F) and therefore
∂G(x) = ∂F(x) + ∂σΩ (x)
for all x ∈ Ω. Since
∂σΩ (x) = NΩ (x)
the result follows for the characterization of minimum of convex functions. 

In the following, we will restrict to the situation in which F is finite (i.e., dom(F) =
Rd ) and Ω is defined through a finite number of equalities and inequalities, taking
the form n o
Ω = x ∈ Rd : γi (x) = 0, i ∈ E and γi (x) ≤ 0, i ∈ I
for functions (γi , i ∈ C = E ∪I ) such that γi : x 7→ biT x+βt is affine for all i ∈ E and γi is
closed convex for all i ∈ I . This is similar to the situation considered in section 3.4.1,
with additional convexity assumptions, but without assuming smoothness. We re-
call the definition of active constraints from section 3.4.1, namely, for x ∈ Ω,

A(x) = {i ∈ C : γi (x) = 0}.

Following the discussion in the smooth case, define the set Nγ0 (x) ⊂ Rd by
 

 X 

0
 
Nγ (x) =  λ s : s ∈ ∂γ (x), i ∈ A(x), λ ≥ 0, i ∈ A(x) ∩ I .
 
 i i i i i 


i∈A(x) 

3.6. DUALITY 91

The property 0 ∈ ∂F(x∗ ) + Nγ0 (x∗ ) is the expression of the KKT conditions in the non-
smooth case. It holds for x∗ ∈ argminΩ F as soon as NΩ (x∗ ) = Nγ0 (x∗ ), which is true
under appropriate constraint qualifications. We here replace the MF-CQ in defini-
tion 3.30 by the following conditions that do not involve gradients.

Definition 3.59 Let (γi , i ∈ C = E ∪ I ) be a set of equality and inequality constraints,


with γi : x 7→ biT x + βi , i, ∈ E and γi closed convex, i ∈ I . One says that these constraints
satisfy the Slater constraint qualifications (Sl-CQ) if and only if:

(Sl 1) The vectors (bi , i ∈ E) are linearly independent.


(Sl 2) There exists x ∈ Rd such that γi (x) = 0 for i ∈ E and γi (x) < 0 for i ∈ I .

The first constraint is a very mild condition. When it is not satisfied, this means that
some bi ’s are linear combinations of others, and equality constraints for the latter
implies equality constraints for the former. These redundancies can therefore be
removed without changing the problem.

Note that (Sl2) can be replaced by the apparently weaker condition that, for all
i ∈ I , there exists xi ∈ Rd satisfying all the constraints and γi (xi ) < 0. Indeed, if
this is true, then the average, x̄, of (xi , i ∈ I ) also satisfies the equality constraints by
linearity, and if i ∈ I ,

1 X 1
γi (x̄) ≤ γi (x(j) ) ≤ γ (x(i) ) < 0.
|I | |I | i
j∈I

The following proposition makes a connection between the Slater conditions and
the MF-CQ in definition 3.30.

Proposition 3.60 Assume that γi , i ∈ I are convex C 1 functions. Then, if there exists a
feasible point x∗ that satisfies the MF-CQ, there exists another point x satisfying the Sl-
CQ. Conversely, if there exists x satisfying the Sl-CQ, then every feasible point x∗ satisfies
the MF-CQ.

Proof The linear independence conditions on equality constraints are the same in
MF-CQ and Sl-CQ, so that we only need to consider inequality constraints.

Let x∗ satisfy MF-CQ, and take h , 0 such that biT h = 0 for all i ∈ E, and ∇γi (x∗ )T h <
0, i ∈ A(x) ∩ I . Then x∗ + th satisfies the equality constraints for all t ∈ R. If i ∈ I is
not active, then γi (x∗ ) < 0 and this will remain true at x∗ + th for small t by continu-
ity. If i ∈ A(x) ∩ I , then a first order expansion gives γi (x∗ + th) = t∇γi (x∗ )T h + o(|h|),
which is also negative for small enough t > 0. So, x∗ + th satisfies the Sl-CQ for small
enough t > 0.
92 CHAPTER 3. INTRODUCTION TO OPTIMIZATION

Conversely, let x satisfy the Sl-CQ. Take a feasible point x∗ . If x∗ = x, then there
is no active inequality constraint and x∗ satisfies MF-CQ. Assume x∗ , x and let
h = x − x∗ . Then biT h = 0 for all i ∈ E, and if i ∈ I ∩ A(x∗ ),

0 > γi (x) = γi (x∗ + h) ≥ γi (x∗ ) + ∇γi (x∗ )T h = ∇γi (x∗ )T h

so that x∗ satisfies MF-CQ. 

The following theorem, that we give without proof, states that the Slater condi-
tions implies that the KKT conditions are satisfied for a minimizer.
Theorem 3.61 Assume that all the constraints are affine, or that they satisfy the Sl-CQ
in definition 3.59. Let x∗ ∈ argminΩ F. Then NΩ (x∗ ) = Nγ0 (x∗ ), so that there exist s0 ∈
∂F(x∗ ), si ∈ ∂γi (x∗ ), i ∈ A(x∗ ), (λi , i ∈ A(x∗ )) with λi ≥ 0 if i ∈ I ∩ A(x∗ ), such that
X
s0 + λi s i = 0 (3.55)
i∈A(x)

3.6.2 Dual problem

Consider the Lagrangian


X
L(x, λ) = F(x) + λi γi (x)
i∈C

defined in (3.29) and let D = {λ : λi ≥ 0, i ∈ I }. Because the functions γi are non-


positive on Ω, we have
L(x, λ) ≤ F(x)
for all x ∈ Ω and λ ∈ D, which implies that

L∗ (λ) = inf{L(x, λ) : x ∈ Rd }

is such that L∗ (λ) ≤ F(x) for all λ ∈ D and x ∈ Ω. Define

dˆ = sup{L∗ (λ) : λ ∈ D}

and
p̂ = inf{F(x) : x ∈ Ω},
whose computations respectively represent the dual and primal problems. Then, we
have dˆ ≤ p̂.

We did not need much of our assumptions (not even F to be convex) to reach
this conclusion. When the converse inequality is true (so that the duality gap p̂ − dˆ
vanishes), the dual problem provides important insights on the primal problem, as
well as alternative ways to solve it. This is true under the Slater conditions.
3.6. DUALITY 93

Theorem 3.62 The duality gap vanishes when the constraints are all affine, or when they
satisfy the Sl-CQ in definition 3.59. In this case, any solution λ∗ of the dual problem
provides Lagrange multipliers in theorem 3.61 and conversely.

We justify this statement, as a consequence of theorem 3.61 and the following


analysis. The Lagrangian L(x, λ) is linear in λ, and when λ ∈ D, is a convex func-
tion of x. Moreover, one can use subdifferential calculus (theorem 3.45) to con-
clude that, for any λ ∈ D, (3.55) expresses the fact that 0 ∈ ∂x L(x∗ , λ), i.e., that
x∗ ∈ argminRd L(·, λ).

Fixing x ∈ Rd , one can also consider the maximization of L in λ ∈ D. Clearly, if


x < Ω, so that γi (x) , 0 for some i ∈ E or γi (x) > 0 for some i ∈ I , then maxD L(x, λ) =
+∞. If x ∈ Ω, then the slackness conditions, which require λ(i) γi (x) = 0, i ∈ I , ensure
that λ ∈ argmaxD L(x, ·).

As a consequence, any pair x∗ ∈ Ω, λ∗ ∈ D satisfying the KKT conditions is such


that
L(x∗ , λ) ≤ L(x∗ , λ∗ ) ≤ L(x, λ∗ ) (3.56)
for all x ∈ Rd and λ ∈ D. Such a pair (x∗ , λ∗ ) is called a saddle point of the function L.
Conversely, any saddle point of L, i.e., any (x∗ , λ∗ ) ∈ Rd × D satisfying (3.56), must be
such that x∗ ∈ Ω (to ensure that L(x∗ , ·) is bounded), and satisfies the KKT conditions.

We therefore obtain the equivalence of the two properties, for (x∗ , λ∗ ) ∈ Rd × D:

(i) x∗ ∈ Ω and (x∗ , λ∗ ) satisfies the KKT conditions.


(ii) Equation (3.56) holds for all (x, λ) ∈ Rd × D.

Consider now the additional condition that

(iii) x∗ ∈ argminΩ F and λ∗ ∈ argmaxD L∗ .

We already know that, if (x∗ , λ∗ ) satisfy the KKT conditions, then x∗ ∈ argminΩ F
(because Nγ0 (x∗ ) ⊂ NΩ (x∗ )). Moreover, if (3.56) holds, then the inequality L(x∗ , λ) ≤
L(x∗ , λ∗ ) implies that L∗ (λ) ≤ L(x∗ , λ∗ ) for all λ ∈ D. The inequality L(x∗ , λ∗ ) ≤ L(x, λ∗ )
for all x implies that L(x∗ , λ∗ ) ≤ L∗ (λ∗ ). We therefore obtain the fact that λ∗ ∈ argmax L∗ (λ).
To summarize, we have
(i) ⇔ (ii) ⇒ (iii).

To obtain the final equivalence, we need to assume constraints qualifications,


such as Slater’s conditions, to ensure that Nγ0 (x∗ ) = NΩ (x∗ ). If this holds, then (iii)
implies (via theorem 3.61) that there exists λ̃ such that (i) and (ii) are satisfied for
94 CHAPTER 3. INTRODUCTION TO OPTIMIZATION

(x∗ , λ̃), with L(x∗ , λ̃) = L∗ (λ̃) and λ̃ ∈ argminD L∗ . This shows that L∗ (λ̃) = L∗ (λ∗ ). More-
over, from (3.56), we have

L(x∗ , λ∗ ) ≤ L(x∗ , λ̃) = L∗ (λ̃),

and, by definition of L∗ , L(x∗ , λ∗ ) ≥ L∗ (λ∗ ). This shows that L(x∗ , λ∗ ) = L(x∗ , λ̃). As a
consequence, for all (x, λ) ∈ Rd × D:

L(x∗ , λ) ≤ L(x∗ , λ̃) = L(x∗ , λ∗ ) = L∗ (λ∗ ) = inf L(·, λ∗ ) ≤ L(x, λ∗ )


Rd

so that (x∗ , λ∗ ) satisfies (ii).

3.6.3 Example: Quadratic programming

Quadratic programming problems minimize F(x) = 12 xT Ax − bT x, where A is a pos-


itive semidefinite matrix and b ∈ Rd , subject to affine constraints ciT x − di = 0, i ∈ E
and ciT x − di ≤ 0, i ∈ I .

We here consider the following objective function. Introduce variables x ∈ Rd ,


x0 ∈ R and ξ ∈ RN and minimize, for a fixed parameter γ,
N
1 X
F(x, x0 , ξ) = |x|2 + γ ξ (k)
2
k=1

subject to constraints, for k = 1, . . . , N ξ (k) ≥ 0 and

bk (x0 + xT ak ) + ξ (k) ≥ 1

where bk ∈ {−1, 1} and ak ∈ Rn respectively denote the kth output and input train-
ing sample. This algorithm minimizes a quadratic function of the input variables
(x, x0 , ξ) subject to linear constraints, and is an instance of a quadratic program-
ming problem (this is actually the support vector machine problem for classification,
which will be described in section 8.4.1).

Introduce Lagrange multipliers ηk for the constraint ξ (k) ≥ 0 and αk for bk (x0 +
xT ak ) + ξ (k) ≥ 1. The Lagrangian then takes the form
N N N
1 X X X
L(x, x0 , ξ, α, η) = |x|2 + γ ξ (k) − ηk ξ (k) − αk (bk (x0 + xT ak ) + ξ (k) − 1)
2
k=1 i=1 k=1
N N N N
1 X X X X
= |x|2 + (γ − ηk − αk )ξ (k)
− x0 αk bk − x T
αk bk ak + αk .
2
k=1 k=1 k=1 k=1
3.6. DUALITY 95

We compute the dual Lagrangian L∗ by P minimizing with respect to the primal vari-
ables. We note that L∗ (α, η) = −∞ when N k=1
alphak bk , 0, so that N
P
α
k=1 k kb = 0 is a constraint for the dual problem. The min-
(k)
imization in ξ also gives −∞ unless γ − ηk − αk = 0, which is therefore another
constraint. Finally, the optimal values of x is
N
X
x= αk bk ak
k=1

and we obtain the expression of the dual problem, which maximizes


N N
1X T
X
− αk αl bk bl ak al + αk
2
k,l=1 k=1

subject to ηk , αk ≥ 0, γ − ηk − αk = 0 and N
P
k=1 αk bk = 0. The conditions on ηk and αk
can be rewritten as 0 ≤ αk ≤ γ, ηk = γ − αk , and since the rest of the problem does
not depends on η, the dual problem can be reduced to maximizing
N N
∗ 1X T
X
L (α) = − αk αl ak al + αk
2
k,l=1 k=1
PN
subject to 0 ≤ αk ≤ γ and k=1 αk bk = 0.

3.6.4 Proximal iterations and augmented Lagrangian

The concave function L∗ can be maximized by minimizing −L∗ using proximal itera-
tions ((3.54)):
1
λ(t + 1) = prox−αt L∗ (λ(t)) = argmax(λ 7→ L∗ (λ) − |λ − λ(t)|2 ).
D 2αt

Introduce the function


X 1
ϕ(x, λ) = F(x) + λ(i) γi (x) − |λ − λ(t)|2
2αt
i∈C

so that
λ(t + 1) = argmax infn ϕ(x, µ).
µ∈D x∈R

The function ϕ is convex in x and strongly concave in µ. Results in “minimax


theory” [27] implies that one has the equality
max infn ϕ(x, µ) = inf sup ϕ(x, µ) (3.57)
µ∈D x∈R x∈Rd µ∈D
96 CHAPTER 3. INTRODUCTION TO OPTIMIZATION

(Note that the left-hand side of this equation is never larger than the right-hand
side, but their equality requires additional hypotheses—which are satisfied in our
context—in order to hold.)

Importantly, the maximization in µ in the right-hand side has a closed form so-
lution. It requires to maximize
X 1

µi γi (x) − (µ − λi (t))2
2αt i
i∈C

subject to µi ≥ 0 for i ∈ I , and each µi can be computed separately. For i ∈ E, there is


no constraint on µi , and one finds

µi = λi (t) + αt γi (x),

and

1 α 1 λ (t)2
µi γi (x) − (µi − λi (t))2 = λi (t)γi + t γi (x)2 = (λi (t) + αt γi (x))2 − i .
2αt 2 2αt 2αt

For i ∈ I , the solution is


µi = max(0, λi (t) + αt γi (x))
and one can check that, in this case:

1 1 λ (t)2
µi γi (x) − (µi − λi (t))2 = max(0, λi (t) + αt γi (x))2 − i
2αt 2αt 2αt

As a consequence, the right-hand side of (3.57) requires to minimize

1 X 1 X
G(x) = F(x) + (λi (t) + αt γi (x))2 + max(0, λi (t) + αt γi (x)))2
2αt 2αt
i∈E i∈I
1 X
− λi (t)2 .
2αt
i∈C

If we assume that the sub-level sets {x ∈ Ω : F(x) ≤ ρ} are bounded (or empty) for any
ρ ∈ R, then so are the sets {x ∈ Rn : G(x) ≤ ρ}, and this is a sufficient condition for the
existence of a saddle point for ϕ, which is a pair (x∗ , λ∗ ) such that, for all (x, λ) ∈ Rn ×D,

ϕ(x∗ , λ) ≤ ϕ(x∗ , λ∗ ) ≤ ϕ(x, λ∗ ).

One can then check that this implies that x∗ ∈ argminRn G while λ∗ = λ(t + 1), so that
3.6. DUALITY 97

the latter can be computed as follows:


 (
1 X
(λi (t) + αt γi (x))2

x(t) = argmin F(x) +



 x∈Rn 2αt
i∈E



 )
1

 X
 2
+ max(0, λi (t) + αt γi (x))) (3.58)

2αt





 i∈I




λi (t + 1) = λi (t) + αt γi (x(t)), i ∈ E

λi (t + 1) = max(0, λi (t) + αt γi (x(t))), i ∈ I

These iterations define the augmented Lagrangian algorithm. Starting this algorithm
with some λ(0) ∈ R|C| , and constant α, λ(t) will converge to a solution λ̂ of the dual
problem. The last two iterations stabilizing imply that γi (x(t)) converges to 0 for
i ∈ E, and also for i ∈ I such that λ̂i > 0, and that lim sup γi (x(t)) = 0 otherwise. This
shows that, if x(t) converges to a limit x̃, then G(x̃) = F(x̃). However, for any x ∈ Ω,
we have
G(x(t)) ≤ G(x) ≤ F(x)

(the proof being left to the reader), showing that x̃ ∈ argminΩ F.

Note that the augmented Lagrangian method can also be used in non-convex
optimization problems [147], requiring in that case that α is small enough.

3.6.5 Alternative direction method of multipliers

We return to a situation considered in section 3.5.5 where the function to minimize


takes the form F(x) = G(x) + H(x). Here, we do not assume that G or H is smooth,
but we will need their respective proximal operators to be easy to compute.

The problem can be reformulated as a minimization with equality constraints,


namely that of minimizing F̃(x, z) = G(x) + H(z) subject to x = z. We will actually
consider a more general situation, namely the problem minimizing a function F̃(x, z)
subject to constraints Ax + Bz = c where A and B are respectively d × n and d × m
matrices, x ∈ Rn , z ∈ Rm , c ∈ Rd . The augmented Lagrangian algorithm applied to
this problem leads to iterate (with only equality constraints)

 1
xt , zt = argmin

 {G(x) + H(z) + |λt + αt (Ax + Bz − c)|2 }


 n
x∈R ,z∈R m t


 λ = λ + α (Ax + Bz − c)
t+1 t t t t

with λt ∈ Rd .
98 CHAPTER 3. INTRODUCTION TO OPTIMIZATION

One can now consider splitting the first step in two and iterate:
 1
xt = argmin{G(x) + H(zt−1 ) + |λt + αt (Ax + Bzt−1 − c)|2 }







 x∈Rn t
1

(3.59)


 zt = argmin{G(xt ) + H(z) + |λt + αt (Axt + Bz − c)|2 }
2αt

z∈Rm





λ = λ + α (Ax + Bz − c)
t+1 t t t t

(Obviously, H(zt−1 ) and G(xt ) are constant in the first and second minimization prob-
lems and can be removed from the formulation.) These iterations constitute the
“alternative direction method of multipliers,” or ADMM (the method is also some-
times called Douglas-Rachford splitting). It is not equivalent to the augmented La-
grangian algorithm (one would need to iterate a large number of times over the first
two steps before applying the third one for this), but still satisfies good convergence
properties. The reader can refer to Boyd et al. [39] for a relatively elementary proof
that shows that this algorithm converges, with constant α, as soon as, in addition to
the hypotheses that were already made, the Lagrangian
L(x, z, λ) = G(x) + H(z) + λT (Ax + Bz − c)
has a saddle point: there exists x∗ , z∗ , y ∗ such that
max L(x∗ , z∗ , λ) = L(x∗ , z∗ , λ∗ ) = min L(x, z, λ∗ ).
y x,z

Remark 3.63 If αt = α does not depend on time, (3.59) can be slightly simplified by
letting ut = λt /α, with the iterations
α
xt = argmin{G(x) + |ut + Ax + Bzt−1 − c|2 }


2




 x∈Rn
 α (3.60)
zt = argmin{H(z) + |ut + Axt + Bz − c|2 }


2



 z∈R m


ut+1 = ut + Axt + Bzt − c,

in which we have removed the constant additive terms. 


3.7 Convex separation theorems and additional proofs

We conclude this chapter by completing some of the proofs left aside when discus-
sion convex functions. These proofs use convex separation theorems, stated below
(without proof).
Theorem 3.64 (c.f., Rockafellar [168]) Let Ω1 and Ω2 be two nonempty convex sets
with relint(Ω1 ) ∩ relint(Ω2 ) = ∅. Then there exists b ∈ Rd and β ∈ R such that b , 0,
bT x ≤ β for all x ∈ Ω1 and bT x ≥ β for all x ∈ Ω2 , with a strict inequality for at least one
x ∈ Ω1 ∪ Ω2 .
Theorem 3.65 Let Ω1 and Ω2 be two nonempty convex sets with Ω1 ∩ Ω2 = ∅ and Ω1
compact. Then there exists b ∈ Rn , β ∈ R and  < 0 such that bT x ≤ β −  for all x ∈ Ω1
and bT x ≥ β +  for all x ∈ Ω2 .
3.7. CONVEX SEPARATION THEOREMS AND ADDITIONAL PROOFS 99

3.7.1 Proof of proposition 3.44

We start with a few general remarks. If x ∈ Rd , the set {x} is convex and relint({x}) =
{x}. If Ω is any convex set such that x < relint(Ω), then theorem 3.64 implies that
there exist b ∈ Rd and β ∈ R such that bT y ≥ β ≥ bT x for all y ∈ Ω (with bT y > bT x for
at least one y). If x is in Ω \ (relint(Ω)) (so that x is a point on the relative boundary
of Ω), then, necessarily bT x = β and we can write

bT y ≥ bT x

for all y ∈ Ω with a strict inequality for some y ∈ Ω. One says that b and β provide a
supporting hyperplane for Ω at x.

Now, if F is a convex function, with

epi(F) = {(y, a) ∈ Rd × R : F(y) ≤ a}

then
relint(epi(F)) = {(y, a) ∈ ridom(F) × R : F(y) < a}
(this simple fact is proved in lemma 3.66 below). In particular, if x ∈ dom(F), then
(x, F(x)) must be in the relative boundary of epi(F). This implies that there exists
(b, b0 ) , (0, 0) ∈ Rd × R such that, for all (y, a) ∈ epi(F):

bT y + b0 a ≥ bT x + b0 F(x) .

If one assumes that x ∈ ridom(F), then, necessarily, b0 , 0. To show this, assume


otherwise, so that bT y ≥ bT x for all y ∈ dom(F), with b , 0. We get a contradiction
using the fact that, for some  > 0, [y, x−(y −x)] belongs to dom(Ω), because bT (y −x)
cannot have a constant sign on this segment.

So b0 , 0 and necessarily b0 > 0 to ensure that bT y + b0 a is bounded from below


for all a ≥ F(y). Without loss of generality, we can assume b0 = 1 and we get, for all
y ∈ dom(F)
F(y) + bT y ≥ F(x) + bT x
which shows that −b ∈ ∂F(x), justifying the fact that ∂F(x) , ∅ for x ∈ ridom(F).

We now state and prove the result announced above on the relative interior of
the epigraph of a convex function.
Lemma 3.66 Let F be a convex function with epigraph

epi(F) = {(y, a) : y ∈ dom(F), F(y) ≤ a}.

Then
relint(epi(G)) = {(y, a) : y ∈ ridom(F), F(y) < a}.
100 CHAPTER 3. INTRODUCTION TO OPTIMIZATION

Proof Let Γ = {(y, a) : y ∈ ridom(F), F(y) < a}. Assume that (y, a) ∈ relint(epi(F)).
Then (y, b) ∈ epi(F) for all b > a and there exists  > 0 such that (y, a) − ((y, b) −
(y, a)) ∈ epi(F) which requires that F(y) ≤ a − (b − 1) < a. Now, take x ∈ dom(F).
Then, (x, F(x)) ∈ epi(dom(F)) and (y, a) − ((x, F(x)) − (y, a)) ∈ epi(F) for small enough
, showing that F(y − (x − y)) ≤ (1 + )a − F(x) and y − (x − y) ∈ dom(F). This proves
that y ∈ ridom(F) and the fact that relint(epi(F)) ⊂ Γ .

Take (y, a) ∈ Γ , and (x, b) ∈ epi(F). We need to show that (y − (x − y), a − (b − a)) ∈
epi(F) for small enough , i.e., that

F(y − (x − y)) ≤ a − (b − a)

for small enough . But this is an immediate consequence of the facts that F is
continuous at y ∈ ridom(G) and F(y) < a. 

3.7.2 Proof of theorem 3.45

Assume that there exists x̄ ∈ ridom(F1 ) ∩ ridom(F2 ). Take x ∈ dom(F1 ) ∩ dom(F2 ) and
g ∈ ∂(F1 + F2 )(x). We want to show that g = g1 + g2 with g1 ∈ ∂F1 (x) and g2 ∈ ∂F2 (x).

By definition, we have

F1 (y) + F2 (y) ≥ F1 (x) + F2 (x) + g T (y − x)

for all y. We want to decompose g as g = g1 + g2 with g1 ∈ ∂F1 (x) and g2 ∈ ∂F2 (x).
Equivalently, we want to find g2 ∈ Rd such that, for all y ∈ Rd ,

F1 (y) ≥ F1 (x) + (g − g2 )T (y − x)
F2 (y) ≥ F2 (x) + g2T (y − x)

First note that we can replace F1 by y 7→ F1 (y) − F1 (x) − g T (y − x) and F2 by y 7→


F2 (y) − F2 (x) and assume with loss of generality that F1 (x) = F2 (x) = 0 and g = 0.
Making this assumption, we need to find g2 such that

F1 (y) ≥ −g2T (y − x)
F2 (y) ≥ g2T (y − x)

for all y ∈ Rd and some g2 ∈ Rd , under the assumption that F1 (y) + F2 (y) ≥ 0 for all
y. Introduce the two convex sets in Rd × R

Ω1 = epi(F1 ) = {(y, a) ∈ Rd × R : F1 (y) ≤ a}


Ω2 = {(y, a) ∈ Rd × R : F2 (y) ≤ −a} .
3.7. CONVEX SEPARATION THEOREMS AND ADDITIONAL PROOFS 101

The set Ω2 is the image of epi(F2 ) by the transformation (y, a) 7→ (y, −a). We have

relint(Ω1 ) = epi(F1 ) = {(y, a) ∈ ridom(F1 ) × R : F1 (y) < a}


relint(Ω2 ) = {(y, a) ∈ ridom(F2 ) × R : F2 (y) < −a} .

Since F1 + F2 ≥ 0, Ω1 and Ω2 have non-intersecting relative interiors. We can apply


the first separation theorem, providing b̄ = (b, b0 ) ∈ Rd × R and β ∈ R such that
b̄ , (0, 0), bT y + b0 a − β ≤ 0 for (y, a) ∈ Ω1 and bT y + b0 a − β ≥ 0 for (y, b) ∈ Ω2 , with a
strict inequality for at least one point in Ω1 ∪ Ω2 . We therefore obtain the fact that,
for all y and a,

F1 (y) ≤ a ⇒ bT y + b0 a − β ≤ 0
F2 (y) ≤ −a ⇒ bT y + b0 a − β ≥ 0.

We claim that b0 , 0. Indeed, if b0 = 0, the statement for F1 would imply that bT y −


β ≤ 0 for all y ∈ dom(F1 ) and the one on F2 that bT y − β ≥ 0 for y ∈ dom(F2 ). The
point x̄ ∈ relint(Ω1 ) ∩ relint(Ω2 ) should then satisfy bT x̄ − β = 0. We know that there
exists a point y ∈ Ω1 ∪ Ω2 such that bT y , β. Assume that y ∈ Ω1 , so that bT y − β < 0
and take  > 0 such that ỹ = x̄ − (y − x̄) ∈ Ω1 . Then

bT ỹ − β = −(bT y − β) < 0,

which is a contradiction. A similar contradiction is obtained when y belongs to Ω2 ,


yielding the fact that b0 cannot vanish.

Moreover, we clearly need b0 < 0 to ensure that bT y + b0 a − β ≤ 0 for all large


enough a if y ∈ dom(Ω1 ). There is then no loss of generality in assuming b0 = −1 and
we get

F1 (y) ≤ a ⇒ bT y − β ≤ a
F2 (y) ≤ −a ⇒ bT y − β ≥ a,

which is equivalent to
−F2 (y) ≤ bT y − β ≤ F1 (y)
Taking y = x gives β = bT x and we get the desired inequality with g2 = −b.

3.7.3 Proof of theorem 3.46

Let x̄ ∈ Rm such that Ax̄ ∈ ridom(F). We need to prove that ∂G(x) ⊂ AT ∂F(Ax + b)
when G(x) = F(Ax + b). We assume in the following that b = 0, since the theorem
with G(x) = F(x + b) is obvious. If g ∈ ∂G(x), we have

F(Ay) ≥ F(Ax) + g T (y − x)
102 CHAPTER 3. INTRODUCTION TO OPTIMIZATION

for all y ∈ Rm . We want to show that there exists h ∈ Rd such that g = AT h and, for
all z ∈ Rd ,
F(z) ≥ F(Ax) + hT (z − Ax) = F(Ax) + hT z − g T x.

Let Ω1 = epi(F) = {(z, a) :, z ∈ Rd , F(z) ≤ a} and

Ω2 = {(Ay, a) : y ∈ Rm , a = g T (y − x) + G(x)} ⊂ Rd × R.

Note that Ω2 is an affine space with relint(Ω2 ) = Ω2 . If (z, a) ∈ relint(Ω1 ) ∩ Ω2 , then


z = Ay for some y ∈ Rm and g T (y − x) + G(x) > F(z) = G(y). This contradicts the fact
that g ∈ ∂G(x) and shows that relint(Ω1 ) ∩ Ω2 = ∅. As a consequence, there exist
(b, b0 ) , (0, 0) and β such that

F(z) ≤ a ⇒ bT z + b0 a ≤ β
z = Ay, a = g T (y − x) + G(x) ⇒ bT z + b0 a ≥ β

Assume, to get a contradiction, that b0 = 0 (so that b , 0). Then bT Ay ≥ β for all y,
which is only possible if b is perpendicular to the range of A and β ≤ 0. On the other
hand, F(Ax̄) < ∞ implies that 0 = bT Ax̄ + b0 F(Ax̄) ≤ β, so that β = 0. Furthermore,
we know that one of the inequalities above has to be strict for at least one element
of Ω1 ∪ Ω2 , but this cannot be true on Ω2 , so there exists z ∈ dom(F) such that
bT z < 0. Since bT Ax̄ = 0 and Ax̄ ∈ ridom(F), we have Ax̄ − (z − Ax̄) ∈ dom(F), so that
bT (−z) ≤ 0, yielding a contradiction.

So, we need b0 , 0, and the first pair of inequalities clearly requires b0 < 0, so that
we can take b0 = −1. This shows that

bT z − β ≤ F(z)

for all z and


bT Ay − β ≥ g T (y − x) + F(Ax)
for all y. Taking y = x, z = Ax, we find that β = bT Ax − F(Ax) yielding

F(z) − F(Ax) ≥ bT (z − x)

for all z and bT A(y − x) ≥ g T (y − x) for all y. This last inequality implies that g = AT b
and the first one that b ∈ ∂F(Ax), therefore concluding the proof.
Chapter 4

Introduction: Bias, Variance and Density Esti-


mation

In this chapter, we illustrate the bias variance dilemma in the context of density es-
timation, in which problems are similar to those encountered in classical parametric
or non-parametric statistics [160, 60, 155].

For density estimation, one assumes that a random variable X is given with un-
known p.d.f. f and we want to build an estimator, i.e., a mapping (x, T ) 7→ fˆ(x; T )
that provides an estimation of f (x) based on a training set T = (x1 , . . . , xN ) containing
N i.i.d. realizations of X (i.e., T is a realization of T = (X1 , . . . , XN ), N independent
copies of X). Alternatively, we will say that the mapping T 7→ fˆ( · ; T ) is an estimator
of the full density f . Note that, to further illustrate our notation, fˆ(x; T ) is a number
while fˆ(x; T ) is a random variable.

4.1 Parameter estimation and sieves

Parameter estimation is the most common density estimation method, in which one
restrict fˆ to belong to a finite-dimensional parametric class, denoted (fθ , θ ∈ Θ), with
Θ ⊂ Rp . For example, fθ can be a family of Gaussian distributions on Rd . With our
notation, a parametric model provides estimators taking the form

fˆ(x; T ) = fθ̂(T ) (x)

and the problem becomes the estimation of the parameter θ̂.

There are several, well-known methods for parameter estimation, and, since this
is not the focus of the book, we only consider the most common one, maximum

103
104 CHAPTER 4. INTRODUCTION: BIAS AND VARIANCE

likelihood, which consists in computing θ̂ that maximizes the log-likelihood


N
1X
C(θ) = log fθ (xk ) . (4.1)
N
k=1

The resulting θ̂ (when it exists) is called the maximum likelihood estimator of θ, or


m.l.e.

If the true f belongs to the parametric class, so that f = fθ∗ for some θ∗ ∈ Θ, stan-
dard results in mathematical statistics [29, 119] provide sufficient conditions for θ̂
to converge to θ∗ when N tends to infinity. However, the fact that the true p.d.f. be-
longs to the finite dimensional class (fθ ) is an optimistic assumption that is generally
false. In this regard, the standard theorems in parametric statistics may be regarded
as analyzing a “best case scenario,” or as performing a “sanity check,” in which one
asks whether, in the ideal situation in which f actually belongs to the parametric
class, the designed estimator has a proper behavior. In non-parametric statistics, a
parametric model can still be a plausible approach in order to approximate the true
f , but the relevant question should then be whether fˆ provides (asymptotically), the
best approximation to f among all fθ , θ ∈ Θ. The maximum likelihood estimator can
be analyzed from this viewpoint, if one measures the difference between two density
functions by the Kullback-Leibler divergence (also called differential entropy):
Z
f (x)
KL(f kfθ ) = log f (x)dx (4.2)
Rd fθ (x)
which is positive unless f = fθ (and may be equal to +∞).

This expression of the divergence is a simplification of its general measure-theo-


retic definition, that we now provide for completeness—and future use. Let µ and
ν be two probability measures on a set Ω. e Recall from section 1.4.5 that one says
that µ is absolutely continuous with respect to ν, with notation µ  ν, if, for every
(measurable) subset A ⊂ Ω,e ν(A) = 0 implies µ(A) = 0. The Radon-Nikodym theorem
then states that µ  ν is and only if there exists a non-negative function g = dµ/dν
(the Radon-Nikodym derivative of µ with respect to ν) defined on Ω e such that
Z
µ(A) = g(x)dν(x).
A

In terms of random variables, this says that, if X : Ω → Ωe and Y : Ω → Ω


e are two
random variables with respective distributions µ and ν, and ϕ : Ωe → R is measur-
able, then E(ϕ(X)) = E(g(Y )ϕ(Y )). The general definition of the Kullback-Leibler
divergence between µ and ν is then:
Z !
 dµ dµ
log dν if µ  ν



KL(µkν) = 

 Ω̃ dν dν (4.3)

+∞

otherwise
4.1. PARAMETER ESTIMATION AND SIEVES 105

In the case when µ = f dx and ν = f˜ dx are both probability measures on Rd with


respective p.d.f.’s f and f˜, µ  ν means that f / f˜ is well defined everywhere except
on a set of ν-probability zero. It is then equal to dµ/dν. If µ  ν, we can therefore
write Z ! Z !
f (x) f (x) ˜ f (x)
KL(µkν) = log f (x)dx = log f (x)dx
˜
Rd f (x) f˜(x) Rd f˜(x)
and we will make the abuse of notation of writing KL(f kf˜) for KL(f dxkf˜ dx), which
gives the expression provided in (4.2).

The general definition also gives a simple expression when Ω


e is a finite set, with

X µ(x)
KL(µkν) = log µ(x),
ν(x)
x∈Ω
e

that we will use later in these notes (if there exists x such that µ(x) > 0 and ν(x) =
0, then KL(µkν) = ∞). The most important property for us is that the Kullback-
Leibler divergence can be used as a measure of discrepancy between two probability
distribution, based on the following proposition.

Proposition 4.1 Let µ and ν be two probability measures on Ω.


e Then KL(µkν) ≥ 0 and
vanishes if and only if µ = ν.

Proof Assume that


R µ  ν since the statement is obvious otherwise and let g =
dµ/dν. We have Ωe gdν = 1 (since, by definition, it is equal to µ(Ω)) so that
e
Z
KL(µkν) = (g log g + 1 − g)dν.

e

We have t log t + 1 − t ≥ 0 with equality if and only t = 1 (the proof being left to the
reader) so that KL(µkν) = 0 if and only if g = 1 with ν-probability one, i.e., if and
only if µ = ν. 

Minimizing KL(f kfθ ) with respect to θ is equivalent to maximizing


Z
Ef (log fθ ) = log fθ (x)f (x)dx ,
Rd

and an empirical evaluation of this expectation is N1 N


P
k=1 log fθ (xk ), which provides
the maximum likelihood method. Seen in this context, consistency of the maximum
likelihood estimator states that this estimator almost surely converges to a best ap-
proximator of the true f in the class (fθ , θ ∈ Θ). More precisely, if one assumes that
106 CHAPTER 4. INTRODUCTION: BIAS AND VARIANCE

the function θ 7→ log fθ (x) is continuous1 in θ for almost all x and that, for all θ ∈ Θ,
there exists a small enough δ > 0 such that
Z  
sup log fθ 0 (x) f (x) dx < ∞
Rd |θ 0 −θ|<δ

then, letting Θ∗ denote the set of maximizers of Ef (log fθ ), and assuming that it is
not empty, the maximum likelihood estimator θ̂N is such that, for all  > 0 and all
compact subsets K ⊂ Θ,
 
lim P d(θ̂N , Θ∗ ) >  and θ̂N ∈ K → 0
N →∞

where d(θ̂N , Θ∗ ) is the Euclidean distance between θ̂N and the set Θ∗ . The interested
reader can refer to Van der Vaart [196], Theorem 5.14, for a proof of this statement.
Note that this assertion does not exclude the situation in which θ̂N goes to infinity
(i.e., steps out of ever compact subset K in Θ), and the boundedness of the m.l.e. is
either asserted from additional properties of the likelihood, or by simply restricting
Θ to be a compact set.

If Θ∗ = {θ∗ } and the m.l.e. almost surely converges to θ∗ , the speed of conver-
gence can also be quantified by a central limit √theorem (see Van der Vaart [196],
Theorem 5.23) ensuring that, in standard cases N (θ̂N − θ∗ ) converges to a normal
distribution.

Even though these results relate our present subject to classical parametric statis-
tics, they are not sufficient for our purpose, because, when f , fθ∗ , the convergence
of the m.l.e. to the best approximator in Θ still leaves a gap in the estimation of f .
This gap is often called the bias of the class (fθ , θ ∈ Θ). One can reduce it by con-
sidering larger classes (e.g., with more dimensions), but the larger the class, the less
accurate the estimation of the best approximator becomes for a fixed sample size
(the estimator has a larger variance). This issue is known as the “bias vs. variance
dilemma,” and to address it, it is necessary to adjust the class Θ to the sample size
in order to optimally balance the two types of error (and all non-parametric estima-
tion methods have at least one mechanism that allows for this). When the “tuning
parameter” is the dimension of Θ, the overall approach is often referred to as the
method of sieves [83, 80], in which the dimension of Θ is increased as a function of N
in a suitable way.

Gaussian mixture models provide one of the most popular choices with the me-
thod of sieves. Modeling in this setting typically follows some variation of the fol-

1 Upper-semi continuous is sufficient.


4.2. KERNEL DENSITY ESTIMATION 107

lowing construction. Fix a sequence (mN , N ≥ 1) and let

mN 2 2
e−|x−µj | /2σ
 X
ΘN = f : f (x) = αj 2 )d/2
,
j=1
(2πσ

µ1 , . . . , µmN ∈ Rd , α1 + · · · + αmN = 1, α1 , . . . , αmN ∈ [0, +∞), σ > 0 . (4.4)

There are therefore (d + 1)mN free parameters in ΘN . The integer mN allows one
to adjust the dimension of ΘN and therefore controls the bias-variance trade-off. If
mN tends to infinity “slowly enough,” the m.l.e. will converges (almost surely) to
the true p.d.f. f [80]. However, determining optimal sequences N → mN remains a
challenging and largely unsolved problem.

In practice the computation of the m.l.e. in this context uses an algorithm called
EM, for expectation-maximization. This algorithm will be described later in chap-
ter 17.

4.2 Kernel density estimation

Kernel density estimators [151, 178, 179] provide alternatives to the method of
sieves. They also lend themselves to some analytical developments that provide
elementary illustrations of the bias-variance dilemma.

Define a kernel function as a function K : Rd → [0, +∞) such that


Z Z Z
K(x)dx = 1, |x|K(x) dx < ∞, xK(x) dx = 0. (4.5)
Rd Rd Rd

Note that the third equation is satisfied, in particular, when K is an even function,
i.e., K(−x) = K(x).

Given K and a scalar σ > 0, the rescaled kernel is defined by

1 x
 
Kσ (x) = d K .
σ σ

Using the change of variable y = x/σ (so that dy = dx/σ d ) one sees that Kσ satisfies
(4.5) as soon as K does.

Based on a training set T = (x1 , . . . , xN ), the kernel density estimator defines the
family of densities
N
ˆ 1X
fσ (x; T ) = Kσ (x − xk )
N
k=1
108 CHAPTER 4. INTRODUCTION: BIAS AND VARIANCE

One has Z
Kσ (x − xk ) dx = 1
Rd
so that it is clear that fˆσ is a p.d.f. In addition,
Z Z
xKσ (x − xk ) dx = (y + xk )Kσ (y) dy = xk
Rd Rd

so that Z
xfˆσ (x; T ) dx = x̄
Rd
where x̄ = (x1 + · · · + xN )/N .
2
A typical choice for K is a Gaussian kernel, K(y) = e−|y| /2 /(2π)d/2 . In this case, the
estimated density is a sum of bumps centered at the data points x1 , . . . , xN . The width
of the bumps is controlled by the parameter σ . A small σ implies less rigidity in the
model, which will therefore be more affected by changes in the data: the estimated
density will have a larger variance. The converse is true for large σ , at the cost of
being less able to adapt to variations in the true density: the model has a larger bias
(see fig. 4.1 and fig. 4.2).

As we now show, in order to get a consistent estimator, one needs to let σ = σN


depend on the size of the training set. We have, taking expectations with respect to
training data,
N
1 X  
E(fˆσ (x; T )) = E K((x − Xk )/σ )
Nσd
Z k=1
1
= K((x − y)/σ )f (y)dy
σd d
Z R
= K(z)f (x − σ z)dz
Rd

The bias of the estimator, i.e., the average difference between fˆσ (x; T ) and f (x) is
therefore given by
Z
ˆ
E(fσ (x; T )) − f (x) = K(z)(f (x − σ z) − f (x))dz.
Rd

Interestingly, this bias does not depend on N , but only on σ , and it is clear that,
under mild continuity assumptions on f , it will go to zero with σ .

The variance of fˆσ (x; T ) is given by


1
var(fˆσ (x; T )) = var(K((x − X)/σ ))
N σ 2d
4.2. KERNEL DENSITY ESTIMATION 109

σ = 0.1 σ = 0.25

σ = 0.5 σ = 1.0

Figure 4.1: Kernel density estimators using a Gaussian kernel and various values of σ when
the true distribution of the data is a standard Gaussian (Orange: true density; Blue: esti-
mated density, Red dots: training data).
110 CHAPTER 4. INTRODUCTION: BIAS AND VARIANCE

σ = 0.1 σ = 0.25

σ = 0.5 σ = 1.0

Figure 4.2: Kernel density estimators using a Gaussian kernel and various values of σ when
the true distribution of the data is a Gamma distribution with parameter 2 (Orange: true
density; Blue: estimated density, Red dots: training data).
4.2. KERNEL DENSITY ESTIMATION 111

with
Z d
1 1
var(K((x − X)/σ )) = K((x − y)/σ )2 f (y)dy
N σ 2d N σ 2d R
Z !2
1
− K((x − y)/σ )f (y)dy
N σ 2d Rd
Z Z !2
1 2 1
= K(z) f (x − σ z)dz − K(z)f (x − σ z)dz
N σ d Rd N Rd
The total mean-square error of the estimator is
E((fˆσ (x) − f (x))2 ) = var(fˆσ (x)) + (E(fˆσ (x)) − f (x))2 .
Clearly, this error cannot go to zero unless we allow σ = σN to depend on N . For the
bias term to go to zero, we know that we need σN → 0, in which case we can expect
the second term in the variance to decrease like 1/N , while, for the first term to go to
zero, we need N σNd to go to infinity. This illustrates the bias-variance dilemma: σN
must go to zero in order to cancel the bias, but not too fast in order to also cancel the
variance. There is, for each N , an optimal value of σ that minimizes the error, and
we now proceed to a more detailed analysis and make this statement a little more
precise.

Let us make a Taylor expansion of both


R bias and variance, assuming that f has at
least three bounded derivatives and that Rd |x|3 K(x) dx < ∞. We can write

σ2 T 2
f (x − σ z) = f (x) − σ zT ∇f (x) + z ∇ f (x)z + O(σ 3 |z|3 ),
2
R
where ∇2 f (x) denotes the matrix of second derivatives of f at x. Since zK(z)dz = 0,
this gives
σ2
E(fˆσ (x; T )) − f (x) = Mf (x) + o(σ 2 )
2
R R
with Mf = K(z) z ∇ f (x)z dz. Similarly, letting S = K 2 (z) dz,
T 2

1  
var(fˆσ (x)) = Sf (x) + o(σ d
+ σ 2
) .
Nσd
Assuming that f (x) > 0, we can obtain an asymptotically optimal value for σ by
minimizing the leading terms of the mean square error, namely
σ4 2 S
Mf + f (x)
4 Nσd
which yields σN = O(N −1/(d+4) ) and

E((fˆσN (x; T ) − f (x))2 ) = O(N −4/(d+4) ).


112 CHAPTER 4. INTRODUCTION: BIAS AND VARIANCE

If f has r + 1 derivatives, and K has r − 1 vanishing moments (this excludes the


2r
Gaussian kernel) one can reduce this error to N − 2r+d . These rates can be shown to
be “optimal,” in the “min-max” sense, which roughly expresses the fact that, for any
other estimator, there exists a function f for which the convergence speed is at least
as “bad” as the one obtained for kernel density estimation.

This result says that, in order to obtain a given accuracy  in the worst case sce-
nario, N should be chosen of order (1/)1+(d/2r) which grows exponentially fast with
the dimension. This is the curse of dimensionality which essentially states that the
issue of density estimation may be intractable in large dimensions. The same state-
ment is true also for most other types of machine learning problems. Since machine
learning essentially deals with high-dimensional data, this issue can be problematic.

Obviously, because the min-max theory is a worst-case analysis, not all situations
will be intractable for a given estimator, and some cases that are challenging for one
of them may be quite simple for others: even though all estimators are “cursed,” the
way each of them is cursed differs. Moreover, while many estimators are optimal
in the min-max sense, this theory does not give any information on “how often” an
estimator performs better than its worst case, or how it will perform on a given class
of problems. (For kernel density estimation, however, what we found was almost
universal with respect to the unknown density f , which indicates that this estimator
is not a good choice in large dimensions.)

Another important point with this curse of dimensionality is that data may very
often appear to be high dimensional while it has a simple, low-dimensional struc-
ture, maybe because many dimensions are irrelevant to the problem (they contain,
for example, just random noise), or because the data is supported by a non-linear
low-dimensional space, such as a curve or a surface. This information is, of course,
not available to the analysis, but can sometimes be inferred using some of the dimen-
sion reduction methods that will be discussed later in chapter 21. Sometimes, and
this is also important, information on the data structure can be provided by domain
knowledge, that is, by elements, provided by experts, that specify how the data has
been generated (such as underlying equations) and reasonable hypotheses that are
made in the field. This source of information should never be ignored in practice.
Chapter 5

Prediction: Basic Concepts

5.1 General Setting

The goal of prediction is to learn, based on training data, an input-output relation-


ship between two random variables X and Y , in the sense of finding, for a specified
criterion, the best function of the input X that predicts the output Y . (In statistics, Y
is often called the dependent variable, and X the independent variable.) We will, as
always, assume that all the variables mentioned in this chapter are defined on a fixed
probability space (Ω, P). We assume that X : Ω → RX , where RX is the input space,
and Y : Ω → RY , where RY is the output space. The input-output relationship is
therefore captured by an unknown function f : RX → RY , the predictor.

The following two subclasses of prediction problems are important enough to


have learned their own names and specific literature.

• Quantitative output: RY = Rq (often with q = 1). One then speaks of a regres-


sion problem.

• Categorical output: RY = {g1 , . . . , gq } is a finite set. One then speaks of a classi-


fication problem.

In most cases, the input space is Euclidean, i.e., RX = Rd . Note also that, in clas-
sification, instead of a function f : R → RY , one sometimes estimates a function
f : RX → Π(RY ), where Π(RY ) is the space of probability distributions on RY . We
will return to this in remark 5.3.

The quality of a prediction is assessed through the definition of a risk function.


Such a function, denoted r, is defined on RY ×RY , takes values in [0, +∞) and should
be understood as
r(True output, Predicted output), (5.1)

113
114 CHAPTER 5. PREDICTION: BASIC CONCEPTS

so that r(y, y 0 ) assigns a cost to the situation in which a true y is predicted by y 0 .


Note that this definition is asymmetric, and there is no requirement that r(y, y 0 ) =
r(y 0 , y). It is important to remember our convention that the first variable is the true
observation and the second one is a place-holder for a prediction. Risk functions
are also called loss functions, or simply cost functions and we will use these terms as
synonyms.

The goal in prediction is to minimize the expected risk, also called the generaliza-
tion error:
R(f ) = E(r(Y , f (X))).

We will prove that an optimal f can be easily described based on the joint dis-
tribution of X and Y (which is, unfortunately, never available). We will need for
this to use conditional expectations and conditional probabilities, as defined in sec-
tions 1.4.2 and 1.4.8.

5.2 Bayes predictor

Definition 5.1 A Bayes predictor is a measurable function f : RX → RY such that, for


all x ∈ RX (up to a PX negligible set),
  n   o
E r(Y , f (x)) | X = x = min E r(Y , y 0 ) | X = x : y 0 ∈ RY .

There can be multiple Bayes predictors if the minimum in the proposition is not
uniquely attained. Note that, if f ∗ is a Bayes predictor and fˆ any other predictor, we
have, by definition
   
E r(Y , f ∗ (X)) | X ≤ E r(Y , fˆ(X)) | X .

Passing to expectations, this implies R(f ∗ ) ≤ R(fˆ). We therefore have the following
result:

Theorem 5.2 Any Bayes predictor f ∗ is optimal, in the sense that it minimizes the gen-
eralization error R.

Example 1. Regression with mean-square error. When RX = Rd and RY = Rq , the most


common risk function is the squared norm of the difference, r(y, y 0 ) = |y − y 0 |2 . The
resulting generalization error is called the MSE (mean square error) and given by
R(f ) = E(|Y − f (X)|2 ). The Bayes predictor is such that f ∗ (x) minimizes

t 7→ E(|Y − t|2 | X = x).


5.2. BAYES PREDICTOR 115

Let f ∗ (x) = E(Y | X = x) and write


E(|Y − t|2 | X = x) =E(|Y − f ∗ (x)|2 | X = x) + 2E((Y − f ∗ (x))T (f ∗ (x) − t) | X = x)
+ |f ∗ (x) − t|2
=E(|Y − f ∗ (x)|2 | X = x) + 2E((Y − f ∗ (x))T | X = x)(f ∗ (x) − t)
+ |f ∗ (x) − t|2
=E(|Y − f ∗ (x)|2 | X = x) + |f ∗ (x) − t|2 .
This proves that E(Y | X = x) is the unique Bayes classifier (up to a modification on
a set of probability 0).
Example 2. Classification with zero-one loss. Let RX = Rd and RY be a finite set. The
zero-one loss function is defined by r(y, y 0 ) = 1 if y , y 0 and 0 otherwise. From
this, it results that the generalization error is the probability of misclassification
R(f ) = P (Y , f (X)) (also called the misclassification error).
The Bayes predictor is such that f ∗ (x) minimizes
g 7→ P(Y , g | X = x) = 1 − P(Y = g | X = x).
It is therefore given by the so-called posterior mode:
f ∗ (x) = argmaxg P(Y = g | X = x).
Remark 5.3 As mentioned at the beginning of the chapter, one sometimes replaces a
pointwise prediction of the output by a probabilistic one, so that f (x) is a probability
distribution on RY . If A is a (measurable) subset of RY , we will write f (x, A) rather
than f (x)(A).

In such a case, the loss function, r, is defined on RY × Π(RY ), and the expected
risk is still defined by E(r(Y , f (X))).

It is quite natural to require that π 7→ r(y, π) is minimized for π = δy . For classi-


fication problems, where RY is finite, one can choose
r(y, π) = − log π(y), (5.2)
which satisfies this property. The Bayes estimator is then a minimizer of π 7→ −E(log π(Y ) |
X = x). The solution is (unsurprisingly) f (x, y) = P(Y = y | X = x) since we always
have
X X
−E(log π(Y ) | X = x) = − log π(y)f (x, y) ≥ − log f (x, y)f (x, y).
y∈RY y∈RY

The difference between these terms is indeed


X f (x, y)
log f (x, y) = KL(f (x, ·)kπ) ≥ 0.
π(y)
y∈RY
116 CHAPTER 5. PREDICTION: BASIC CONCEPTS

For regression problems, with RY = Rq , one can choose


Z
r(y, π) = |z − y|2 π(dz)
Rq

which is indeed minimal when π is concentrated on y. Here, the Bayes estimator


minimizes (with respect to π)
Z Z Z Z !
2 2
|z − y| π(dz)PY (dy | X = x) = |z − y| PY (dy | X = x) π(dz)
Rq Rq Rq Rq

where PY (· | X = x) is the conditional distribution of Y given X = x. For any z, one


has Z Z
2
|y − z| PY (dy | X = x) ≥ |y − E(Y | X = x)|2 PY (dy | X = x),
Rq Rq
which shows that the Bayes estimator is, in this case, the Dirac measure concentrated
at E(Y | X = x). 

5.3 Examples: model-based approach

Bayes predictors are never available in practice, because the true distribution of
(X, Y ), or that of Y given X, are unknown. These distributions can only be inferred
from observations, i.e., from a training set: T = (x1 , y1 , . . . , xN , yN ).

This is the approach followed by model-based, or generative methods, namely us-


ing training data to approximate the joint distribution of X and Y before using the
Bayes estimator derived from this model for prediction. We now illustrate this ap-
proach with a few examples.

5.3.1 Gaussian models and naive Bayes

Consider a regression problem with RY = R, and model the joint distribution of


(X, Y ) as a (d + 1)-dimensional Gaussian distribution with mean
! µ and covariance
m
matrix Σ, which must be estimated from data. Write µ = , with µ0 ∈ R, m ∈ Rd
µ0
and Σ in the form, for some symmetric matrix S and d-dimensional vector u,
!
S u
Σ= T 2 .
u σ00
2
Then, letting ∆ = σ00 − u T S −1 u,
!
−1 1 ∆S −1 + S −1 uu T S −1 −S −1 u
Σ = .
∆ −u T S −1 1
5.3. EXAMPLES: MODEL-BASED APPROACH 117

This shows that the joint p.d.f. of (X, Y ) is proportional to


1 
 
2 T −1
exp − (y − µ0 ) − 2u S (x − m)(y − µ0 ) + (terms not depending on y) .
2∆
In particular
E(Y | X = x) = µ0 + u T S −1 (x − m),
which provides the least-square linear regression predictor. (In this expression, u is
the covariance between X and Y and S is the covariance matrix of X.)

If one restricts the model to having a diagonal covariance matrix S, then


d
X u (j)
E(Y | X = x) = µ0 + (x(j) − m(j) ).
sjj
j=1

This predictor is often called the naive Bayes predictor for regression.

5.3.2 Kernel regression

Let RX = Rd and RY = R. Let K1 : Rd → R and K2 : R → R be two kernels, therefore


satisfying
Z Z Z Z
K1 (x)dx = K2 (x)dx = 1; xK1 (x)dx = yK2 (y)dy = 0.
Rd R Rd R
Let K(x, y) = K1 (x)K2 (y) so that
Z Z Z
K(x, y)dydx = 1, yK(y, x)dydx = 0, xK(y, x)dydx = 0.
Rd+1 Rd+1 Rd+1

The kernel estimator of the joint p.d.f., ϕ, of (X, Y ) at scale σ is, in this case:
N
1X 1 x − xk y − yk
   
ϕ̂(x, y) = K1 K2 .
N σ d+1 σ σ
k=1
Based on ϕ̂, the conditional expectation of Y given X is
 x−x   y−y 
1 PN
R
1
N k=1 σ d+1 R
yK1 σ
k
K2 σ k dy
ˆ
f (x) =  x−x   y−y  .
1 PN
R
1 k k
N k=1 σ d+1 R 1K σ K2 σ dy

R  y−y 
Using the fact that σ −1 R
yK2 σ
k
dy = yk , we can simplify this expression to
obtain PN  x−x 
k
k=1 y k K1 σ
fˆ(x) = P  x−x  .
N k
k=1 K1 σ
This the kernel-density regression estimator [140, 205].
118 CHAPTER 5. PREDICTION: BASIC CONCEPTS

5.3.3 A classification example

Let RY = {0, 1} and assume RX = N = {0, 1, 2, . . .}. Let p = P(Y = 1) and assume that
conditionally to Y = g, X follows a Poisson distribution with mean µg . Assume that
µ0 < µ1 .

The posterior distribution of Y given X = x is 1

(1 − p)µx0 e−µ0 if g = 0
(
P(Y = g | X = x) ∝
pµx1 e−µ1 if g = 1

A Bayes classifier is then provided by taking f (x) = 1 if

log p + x log µ1 − µ1 ≥ log(1 − p) + x log µ0 − µ0 ,

that is:
µ1 1−p
x log ≥ log + µ1 − µ0 .
µ0 p
Since we are assuming that µ1 > µ0 , we find that f (x) = 1 if 2
& '
log((1 − p)/p) + µ1 − µ0
x≥
log(µ1 /µ0 )

and 0 otherwise.

5.4 Empirical risk minimization

5.4.1 General principles

Model-based approaches for prediction are based on the estimation of the joint dis-
tribution of the input and output variables, which is arguably a harder problem
than prediction [198]. Since the goal is to find f minimizing the expected risk
R(f ) = E(r(Y , f (X)), one may prefer a direct approach and consider the minimization
of an empirical estimate of this risk, based on training data T = (x1 , y1 , . . . , xN , yN ),
namely
N
1X
R̂(f ) = r(yk , f (xk )).
N
k=1

This strategy is called empirical risk minimization.


1∝ is the notation for “proportional to.”
2 dxe is the smallest integer larger than x (ceiling).
5.4. EMPIRICAL RISK MINIMIZATION 119

Importantly, R̂ must be minimized over a restricted class, F , of predictors to


avoid overfitting. For example, with RY = R and RX = Rd , one can take
 
 d
X 
(i) (i) (1) (d)
 
F = f : f (x) = β + b x : β , b . . . , b ∈ R .
 
 0 0 

 
i=1

Minimizing the empirical mean-square error


N
1X
R̂(f ) = (yk − f (xk ))2
N
k=1

over f ∈ F leads to the standard least-square regression estimator.

As another example, one can choose


 
p
  d
 
 X  X  
(i)
 
F = f : f (x) = w ψ β + β x , w , β R .
 

 
 j  
 j0 ji 

 j ji 

 
 j=1 i=1 

with a fixed function ψ. This corresponds to a two-layer perceptron model.

As a last example for now (we will see many others in the rest of this book),
taking d = 1, the set ( Z )
00 2
F = f : f (x) dx < µ
R
(with µ > 0) provides an infinite dimensional space of predictors, which leads to
spline regression.

5.4.2 Bias and variance

We give a further illustration of the bias-variance dilemma in the regression case,


using the mean-square error and taking q = 1 to simplify. Denote the Bayes predictor
by f ∗ (x) = E(Y | X = x).

Fix a function space F , and let fˆ∗ be the optimal predictor in F , in the sense that
it minimizes E(|Y − f (X)|2 ) over f ∈ F . Then, letting fˆN ∈ F denote an estimated
predictor,

R(fˆN ) = E(|Y − fˆN (X)|2 )


= E(|Y − fˆ∗ (X)|2 ) + E(|fˆN (X) − fˆ∗ (X)|2 ) + 2E((Y − fˆ∗ (X))(fˆ∗ (X) − fˆN (X)).

Let us make the assumption that there exists  > 0 such that fλ = fˆ∗ + λ(fˆN − fˆ∗ )
belongs to F for λ ∈ [−, ]. This happens when F is a linear space, or more generally
120 CHAPTER 5. PREDICTION: BASIC CONCEPTS

when F is convex and fˆ∗ is in its relative interior (see chapter 3). Let ψ : λ 7→
E(|Y − fλ (X)|2 ), which is minimal at λ = 0. We have

ψ(λ) =E(|Y − fˆ∗ (X) − λ(fˆN (X) − fˆ∗ (X))|2 )


=E(|Y − fˆ∗ (X)|2 ) − 2λE((Y − fˆ∗ (X))(fˆN (X) − fˆ∗ (X))) + λ2 E(|fˆN (X) − fˆ∗ (X))|2 )

and
0 = ψ 0 (0) = 2E((Y − fˆ∗ (X))(fˆ∗ (X) − fˆN (X))).
We therefore get the identity

R(fˆN ) = E(|Y − fˆ∗ (X)|2 ) + E(|fˆN (X) − fˆ∗ (X)|2 ) = “Bias” + “Variance”.

The bias can be further decomposed as

E(|Y − fˆ∗ (X)|2 ) = E(|Y − f ∗ (X)|2 ) + E(|f ∗ (X) − fˆ∗ (X)|2 )

because f ∗ is the conditional expectation. As a result, we obtain an expression the


generalization error with three contributions, namely,

R(fˆN ) ≤ E(|Y − f ∗ (X)|2 ) + E(|f ∗ − fˆ∗ (X)|2 ) + E(|fˆN (X) − fˆ∗ (X)|2 ).

The first term is the Bayes error. It is fixed by the joint distribution of X and Y
and measures how well Y can be approximated by a function of X. The second term
compares f ∗ to its best approximation in F , and is therefore reduced by taking larger
model spaces. The last term is the error caused by using the data to estimate fˆ∗ . It
increases with the size of F . This is illustrated in Figure 5.1.

Remark 5.4 If the assumption made on fˆ∗ is not valid, one can write
 
R(fˆN ) = E(|Y − fˆN (X)|2 ) ≤ 2 E(|Y − fˆ∗ (X)|2 ) + E(|fˆN (X) − fˆ∗ (X)|2 )

and still obtain a control (as an inequality) of the generalization error by a bias-plus-
variance sum. 

5.5 Evaluating the error

5.5.1 Generalization error

Given input and output variables X : Ω → RX and Y : Ω → RY and a risk function


r : RY × RY → [0. + ∞), we have defined the generalization (or prediction) error as

R(f ) = E(r(Y , f (X))) .


5.5. EVALUATING THE ERROR 121


Pˆ F

fˆ∗

f∗
P∗

Probability space Predictor space


Figure 5.1: Sources of errors in statistical Learning: When P ∗ is the distribution of the data,
the optimal predictor f ∗ minimizes the expected loss function. Based on data Z1 , . . . , ZN ,
the sample-based distribution is Pˆ = (δZ1 + · · · + δZN )/N and the empirical loss is minimized
over a subset S of the space of all possible estimators. The expected discrepancy between
the resulting estimator and the one minimizing the true expected loss on the subspace is the
“variance” of the method, and the expected discrepancy between this subspace-constrained
estimator and and the optimal one is the “bias.”
122 CHAPTER 5. PREDICTION: BASIC CONCEPTS

Recall that a training set T = ((x1 , y1 ), . . . , (xN , yN )) is a realization T = T (ω) of the


random variable T = ((X1 , Y1 ), . . . , (XN , YN )), an i.i.d. sample of the joint distribution
of (X, Y ). A learning
S∞ algorithm is a function T 7→ fˆT defined on the set of training
sets, namely, N =1 (RX × RY )N and taking values in F .

For a given T and a specific algorithm, one is primarily interested in evaluating


ˆ
R(fT ), the generalization error of the predictor estimated from observed data. To
emphasize the fact that the training set is fixed in this expression, one often writes:

R(fˆT ) = E(r(Y , fˆT (X)) | T = T )

If we also take the expectation with respect to T (for fixed N ), we obtain the
averaged generalization risk as

E(R(fˆT )) = E(r(Y , fˆT (X))),

which provides an evaluation of the average quality of the algorithm when evaluated
on random training sets of size N . If A : T 7→ fˆT denotes the learning algorithm, we
will denote RN (A) = E(R(fˆT )).

Since their computation requires the knowledge of the joint distribution of X and
Y , these errors are not available in practice. Given a training set T and a predictor
f , one can compute the empirical error

N
1X
R̂T (f ) = r(yk , f (xk )) .
N
k=1

Under the usual moment conditions, the law of large numbers implies that R̂ T (f ) →
R(f ) with probability one for any given predictor f . However, the law of large num-
bers cannot be applied to assess whether the in-sample error,

N
ˆ 1X
r(yk , fˆT (xk )),

ET = R̂T (fT ) =
N
k=1

is a good approximation of the generalization error R(fˆT ). This is because each term
in the sum depends on the full data set, so that ET is not a sum of independent terms.
The in-sample error typically under-estimates the generalization error, sometimes
with a large discrepancy.

When one has enough data, however, it is possible to set some of it aside to form
a test set. Formally, a test set is a collection T 0 = (x10 , y10 , . . . , xN 0 0
0 , yN 0 ) considered as a
realization of an i.i.d. sample of (X, Y ), T 0 = (X10 , Y10 , . . . , XN
0 0
0 , YN 0 ), independent of T .
5.5. EVALUATING THE ERROR 123

The test set error is then given by


N0
1 X 0 ˆ 0
ET ,T 0 = R̂T 0 (fˆT ) = 0 r(yk , fT (xk )).
N
k=1

The law of large numbers (applied conditionally to T = T ) implies that ET ,T 0 con-


verges to R(fˆT ) with probability one when N 0 → ∞.

However, in many applications, data acquisition is difficult or expensive (e.g., in


the medical field) and sparing a part of it in order to form a test set is not a reasonable
option. In such situations, cross-validation is generally a preferred alternative.

5.5.2 Cross validation

Cross-validation error

The n-fold cross-validation method (see, e.g., Stone [185]) separates the training set
into n non-overlapping sets of equal sizes, and estimates n predictors by leaving out
one of these subsets as a temporary test set. A generalization error is estimated from
each test set and averaged over the n results.

Let us formalize this computation after introducing some notation. We represent


training data in the form T = (z1 , . . . , zN ), a sample of a random variable Z. With
this notation, we can include supervised problems, such as prediction (taking Z =
(X, Y )) and unsupervised ones such as density estimation (taking Z = X). One tries
to estimate a function f within a given class (e.g., a predictor, or a density) and
one has a measure of “loss”, denoted `(f , z) ≥ 0 measuring how badly f performs
on the data z. For prediction, one takes `(f , z) = r(y, f (x)) with z = (x, y) and for
density estimation, e.g., `(f , z) = − log f (z), the negative log-likelihood. One then
lets R(f ) = E(`(f , Z)). For an algorithm A : T 7→ fˆT , the loss R̄(A) is the quantity of
interest.

Given another training set T 0 = (z10 , . . . , zN


0 0
0 ), the empirical loss on T is

N0
1 X
R̂T 0 (f ) = 0 `(f , zk0 )
N
k=1

and, using T as a training set and T 0 as a test set, we let, as above,


ET ,T 0 = R̂T 0 (fˆT ).

To define an n-fold cross-validation estimator of the error, one assumes that the
training set T is partitioned into n subsets of equal sizes (up to one element if N is
124 CHAPTER 5. PREDICTION: BASIC CONCEPTS

not a multiple of n), T1 , . . . , Tn , so that Ti and Tj are non-intersecting if i , j, and


T = ni=1 Ti . For each i, let T (i) = T \ Ti , which provides the training data with the
S
elements of Ti removed. Then, the n-fold cross-validation error is defined by
n
1X
ECV (T ) = ET (i) ,Ti .
n
i=1

Assuming, to simplify, that N is a multiple of n, the expectation of the cross-


validation error is E(R(fˆTN 0 )), where the average is made over training sets TN 0 of size
N 0 = N − N /n. Note that the cross-validation error is an estimate of the average error
of the algorithm over random training sets (of fixed size, N 0 ), not necessarily that of
the current estimator fˆT . It returns an evaluation of the algorithm A : T 7→ fˆT . When
needed, one can emphasize this and write R̄CV,T (A). Since N 0 ≤ N and accuracy
generally improves with the size of the training set, cross-validation typically over-
estimates (on average) the error for the number of available training samples.

The limit case when n = N is called leave-one-out (LOO) cross validation. In this
case ECV is an almost unbiased estimator of E(R(fˆT )), but, because it is an average of
functions of the training set that are quite similar (and that will therefore be posi-
tively correlated), its variance (as a function of T ) may be quite large. Conversely,
smaller values of n will have smaller variances, but larger biases. In practice, it is dif-
ficult to assess which choice of n is optimal, although 5- or 10-fold cross-validation
is quite popular. LOO cross-validation is also often used, especially when N is small.

Model selection using cross validation

Because it evaluates the quality of an algorithm, cross-validation is often used to per-


form model selection. Indeed, many learning algorithms depends on a parameter,
that we will denote λ. In kernel density estimation, for example, λ = σ is the ker-
nel width. For mixtures of Gaussian, λ = m is the number of Gaussian terms in the
mixtures. Formally, this means that one has, for every λ, an algorithm Aλ : T 7→ fˆT ,λ .

Fixing a training set T , one can compute, for each λ, the cross-validation error
eT (λ) = R̄CV,T (Aλ ). Model selection is then performed by finding

λ∗ (T ) = argmin eT (λ).
λ

Once this λ∗ is obtained, the final estimator is fˆT ,λ∗ (T ) , obtained by rerunning the
algorithm one more time on the full training set.

This defines a new training algorithm, A∗ : T 7→ fˆT ,λ∗ (T ) . It is a common mistake


to consider that the cross-validation error associated to this algorithm is still given
5.5. EVALUATING THE ERROR 125

by e(λ∗ (T )). This is false, because the computation of λ∗ uses the full training set.
To compute the cross-validation error of A∗ , one needs to encapsulate this model
selection procedure in an other cross-validation loop. So, one needs to compute,
using the previous notation,
n
1X

ECV (T ) = R̂Ti (fˆT (i) ,λ∗ (T (i) ) )
n
i=1

where each fˆT (i) ,λ∗ (T (i) ) is computed by running a cross-validated model selection pro-
cedure restricted to T (i) . This is often called a double-loop cross-validation proce-
dure (the number of folds in the inner and outer loops do not have to coincide). Note
that each λ∗ (T (i) ) that does not necessarily coincide with the optimal λ∗ (T ) obtained
with the full training set.
126 CHAPTER 5. PREDICTION: BASIC CONCEPTS
Chapter 6

Inner Products and Reproducing Kernels

6.1 Introduction

We will discuss later in this book various methods that specify the prediction is as a
linear function of the input. These methods are often applied after taking transfor-
mations of the original variables, in the form x 7→ h(x) (i.e., the prediction algorithm
is applied to h(x) instead of x). We will refer to h as a “feature function,” which typi-
cally maps the initial data x ∈ R to a vector space, sometimes of infinite dimensions,
that we will denote H (the “feature space”).

The present chapter provides a formal description of this framework, focusing,


in particular, on situations in which H has an inner product, as this inner product
is often instrumental in the design of linear methods on H. Many machine learning
methods can indeed be expressed either as functions of the coordinates of the input
data in some space, or as functions of the inner products between the input samples.
Such methods can bypass the difficulty of using high-dimensional features with the
help of the theory of “reproducing kernels,” [12, 204] which ensures that the inner
product between special classes of feature functions h(x) and h(x0 ) can be explicitly
computed as a function of x and x0 .

6.2 Basic Definitions

6.2.1 Inner-product spaces

We recall that a real vector space 1 is a set, H, on which an addition and a scalar
product are defined, namely (h, h0 ) ∈ H × H 7→ h + h0 ∈ H and (λ, h) ∈ R × H 7→ λh ∈
H, and we assume that the reader is familiar with the theory of finite-dimensional

1 All vector spaces in these notes will be real, and will therefore only be referred as vector spaces.

127
128 CHAPTER 6. INNER PRODUCTS AND REPRODUCING KERNELS

spaces.

An inner product on a vector space H is a bilinear function, typically denoted


(ξ, η) 7→ hξ , ηi such that hξ , ξi ≥ 0 with hξ , ξi = 0 if and only if ξ = 0. A vector
space equipped with an inner product is called an inner-product space. We will
often denote the inner product with a subscript referring to the space (e.g., h· , ·iH ).
Given such a product, the function
p
ξ 7→ kξkH = hξ , ξiH
is a norm, so that H is also a normed space (but not all normed spaces are inner-
product spaces) 2 .

When a normed space is complete with respect to the topology induced by its
norm, it is called a Banach space, or a Hilbert space when the norm is associated with
an inner product. Completeness means that Cauchy sequences in this space always
have a limit, i.e., if the sequence (ξn ) is such that, for any  > 0, there exists n0 > 0
such that kξn − ξm kH <  for all n, m ≥ n0 , then there exists ξ such that kξn − ξkH → 0.
Completeness is aR very natural property. It allows, for example, for the definition of
integrals such as h(t)dt as limits of Riemann sums for suitable functions h : R → H,
leading (with more general notions of integrals) to proper definitions of expectations
of H-valued random variables. Using a standard (abstract) construction, one can
prove that any normed space (resp. inner-product) can be extended to a Banach
(resp. Hilbert) space within which it is dense.

Note that finite-dimensional normed spaces are always complete.

6.2.2 Feature spaces and kernels

Now, consider an input set, say R, and a mapping h from R to H, where H is an inner
product space. For us, R is the set over which the original input data is observed,
typically Rd , and H is the feature space. One can define the function Kh : R × R → R
by
Kh (x, y) = hh(x) , h(y)iH .
The function Kh satisfies the following two properties.

[K1] Kh is symmetric, namely Kh (x, y) = Kh (y, x) for all x and y in R.


[K2] For any n > 0, for any choice of scalars λ1 , . . . , λn ∈ R and any x1 , . . . , xn ∈ R, one
has
Xn
λi λj Kh (xi , xj ) ≥ 0. (6.1)
i,j=1
2 Notethat we are using double bars for the norm in H, which, in most applications, is infinite
dimensional
6.2. BASIC DEFINITIONS 129

The first property is obvious, and the second one results from the fact that one can
write
n n n
X X X 2
λi λj Kh (xi , xj ) = λi λj hh(xi ) , h(xj )iH = λi h(xi ) ≥ 0. (6.2)
H
i,j=1 i,j=1 i=1

This leads us to the following definition.

Definition 6.1 A function K : R × R 7→ R satisfying properties [K1] and [K2] is called


a positive kernel.

One says that the kernel is positive definite if the sum in (6.1) cannot vanish unless (i)
λ1 = · · · = λn = 0 or (ii) xi = xj for some i , j.

An equivalent definition of positive kernels can be given using kernel matrices,


for which we introduce a notation.

Definition 6.2 If K : R × R 7→ R is given, we define, for every x1 , . . . , xn ∈ R, the kernel


matrix KK (x1 , . . . , xn ) with entries K(xi , xj ), for i, j = 1, . . . , n. (If K is understood from the
context, we will simply write K(x1 , . . . , xn ) instead of KK (x1 , . . . , xn ).)

Given this notation, it is clear that K is a positive kernel if and only if for all x1 , . . . , xn ∈
R, the matrix KK (x1 , . . . , xn ) is symmetric, positive semidefinite. It is a positive def-
inite kernel if KK (x1 , . . . , xn ) is positive definite as soon as all xj ’s are distinct. This
latter condition is obviously needed since, if xi = xj , the ith and jth columns of K
coincide and this matrix cannot be full-rank.

Remark 6.3 It is important to point out that K being a positive kernel does not require
that K(x, y) ≥ 0 for all x, y ∈ R (see examples in the next section). However, it does
imply that K(x, x) ≥ 0 for all x ∈ R, since diagonal elements of positive semi-definite
matrices are non-negative. 

The function Kh defined above is therefore always a positive kernel, but not al-
ways positive definite, as seen below. We will also see later that the converse state-
ment is true: any positive kernel K : R × R 7→ R can be expressed as Kh for some
feature function h between R and some feature space H.

Given a feature function h : R → H, we will denote by Vh = span(h(x), x ∈ R) the


vector space generated by the features, which, by definition, is the space of all linear
combinations
Xn
ξ= λi h(xi )
i=1
130 CHAPTER 6. INNER PRODUCTS AND REPRODUCING KERNELS

with λ1 , . . . , λm ∈ R, x1 , . . . , xn ∈ R and n ≥ 0 (by convention, ξ = 0 if n = 0). Then Kh is


positive definite if and only if any family (h(x1 ), . . . , h(xn )) with distinct xi ’s is linearly
independent. This is a direct consequence of (6.2).

n n 2
X X
λi λj Kh (xi , xj ) = λi h(xi ) .
i,j=1 i=1 H

This implies in particular that positive-definite kernels over infinite input spaces R
can only be associated to infinite-dimensional spaces H, since Vh ⊂ H.

6.3 First examples

6.3.1 Inner product

Clearly, if R is an inner product space, it has an associated reproducing kernel, de-


fined by
K(x, y) = hx , yiR .
This kernel is equal to Kh with H = R and h = id (the identity mapping). In particular
K(x, y) = xT y is a positive kernel if R = Rd . This kernel can obviously take positive
and negative values.

Notice that this kernel is not positive definite, because the rank of K(x1 , . . . , xn ) is
equal to the dimension of span(x1 , . . . , xn ), which can be less than n even when the
xi ’s are distinct.

6.3.2 Polynomial Kernels

Consider R = Rd and define

h(x) = (x(i1 ) . . . x(ik ) , 1 ≤ i1 , . . . , ik ≤ d),

which contains all products of degree k formed from variables x(1) , . . . , x(d) , i.e., all
monomials of degree k in x. This function takes its values in the space H = RNk ,
where Nk = d k . Using, in H, the inner product hξ , ηiH = ξ T η, we have
X
Kh (x, y) = (x(i1 ) y (i1 ) ) · · · (x(ik ) y (ik ) )
1≤i1 ,...,ik ≤d

= (xT y)k .

This provides the homogeneous polynomial kernel of order k.


6.3. FIRST EXAMPLES 131

If one now takes all monomials of order less than or equal to k, i.e.,

h(x) = (x(i1 ) . . . x(il ) , 1 ≤ i1 , . . . , il ≤ d, 0 ≤ l ≤ k),

which now takes values in a space of dimension 1 + d + · · · + d k , the corresponding


kernel is
(xT y)k+1 − 1
Kh (x, y) = 1 + (xT y) + · · · + (xT y)k = .
xT y − 1
This provides a polynomial kernel of order k. It is important to notice here that, even
though the dimension of the feature space increases exponentially in k, so that the
computation of the feature function rapidly becomes intractable, the computation
of the kernel itself remains a relatively mild operation.

One can make variations on this construction. For example, choosing any family
c0 , c1 , . . . , ck of positive numbers, one can take

h(x) = (cl x(i1 ) . . . x(il ) , 1 ≤ i1 , . . . , il ≤ d, 0 ≤ l ≤ k)

yielding
Kh (x, y) = c02 + c12 (xT y) + · · · + ck2 (xT y)k .
!1/2
k
Taking cl = α l for some α > 0, we get another form of polynomial kernel,
l
namely,
Kh (x, y) = (1 + α 2 xT y)k .

6.3.3 Functional Features

We now consider an example in which H is infinite dimensional. Let R = Rd . We as-


sume that a function s : Rd → R is chosen, such that s is both (absolutely) integrable
and square integrable. We also fix a scaling parameter ρ > 0. Associate to x ∈ Rd the
function
ξx : y 7→ s((y − x)/ρ),
which is also square integrable (as a function of y). We define the feature function
h : x 7→ ξx from Rd to H = L2 (Rd ), the space of square integrable functions on Rd
with inner product Z
hξ , ηiH = ξ(z)η(z)dz.
Rd

The resulting kernel is


Z Z
d
Kh (x, y) = s(z/ρ − x)s(z/ρ − y) dz = ρ s(z)s(z − (y − x)/ρ) dz.
Rd Rd
132 CHAPTER 6. INNER PRODUCTS AND REPRODUCING KERNELS

Note that Kh (x, y) is “translation-invariant,” which means that it only depends on


x − y. It takes the form Kh (x, y) = ρd Γ ((y − x)/ρ) where
Z
Γ (u) = s(z)s(z − u) dz.
Rd
is the convolution3 of s with s̃ : z 7→ s(−z).

Let σ be the Fourier transform of s, i.e.,


Z
T
σ (ω) = e−2iπω u s(u)du.
Rd
Because s is real-valued, we have σ (−ω) = σ̄ (ω), the complex conjugate of σ . More-
over, σ̄ is also the Fourier transform of s̃. Using the fact that the Fourier transform of
the convolution of two functions is the product of their Fourier transforms, we see
that the Fourier transform of Γ = s∗ s̃ is equal to |σ |2 . Applying the inverse transform,
we find Z Z
2iπωT u 2 T
Γ (u) = e |σ (ω)| dω = e−2iπω u |σ (−ω)|2 dω .
Rd Rd
This form is (almost) characteristic of translation-invariant kernels.

Let us consider a few examples of kernels that can be obtained in this way.

(1) Take d = 1 and let s be the indicator function of the interval [− 12 , 12 ]. Then, one
finds
Γ (t) = max(1 − |t|, 0) .
In this case, the space Vh is the space of all functions expressed as finite sums
n
X
z 7→ λj 1[xj −ρ/2,xj +ρ/2] (z) ,
j=1

and therefore is a space of compactly-supported piecewise constant functions. Such


a function computed with distinct xj ’s cannot vanish everywhere unless all λj ’s van-
ish, so that Kh is positive definite. Indeed, let
n
X
f (z) = λj 1[xj −ρ/2,xj +ρ/2] (z)
j=1

and assume without loss of generality that x1 < x2 < · · · < xn and let xn+1 = ∞. Let i0
be the smallest index j such that λj , 0, assuming that such an index exists. Then
f (z) = λi0 > 0 for all z ∈ [xi0 − ρ/2, xi0 +1 − ρ/2) which is a non-empty interval. So, if f
vanishes almost everywhere, we must have λj = 0 for all j = 1, . . . , n.
3 The convolution between two absolutely integrable functions f and g is defined by f ∗ g(u) =
R
Rd
f (z)g(u − z) dz
6.3. FIRST EXAMPLES 133

(2) Still with d = 1, let s(z) = e−|z| . Then, for t > 0,


Z∞
Γ (t) = e−|z| e−|z−t| dz
−∞
Z0 Zt Z ∞
z z−t −z z−t
= e e dz + e e dz + e−z e−z+t dz
−∞ 0 t
e−t e−t
= + te−t +
2 2
−t
= (1 + t)e
Using the fact that Γ (−t) = Γ (t) (make the change of variable z → −z in the integral),
we get
Γ (t) = (1 + |t|)e−|t| .
for all t. This shows that
K(x, y) = (1 + |x − y|)e−|x−y|
is a positive kernel on Rd .
2 /2
(3) Take s(z) = e−|z| , z ∈ Rd . Then
Z 2 2
Z
|u|2
− |z| +|u−z| − 2
Γ (u) = e 2 dz = e 4 e−|z−u/2| dz
Rd Rd
2
− |u|4
= (4π)d/2 e .

This provides a special case of Gaussian kernel.

6.3.4 General construction theorems

Translation invariance

As introduced above, a kernel K is translation invariant if it takes the form K(x, y) =


Γ (x −y) for some continuous function Γ defined on Rd . Bochner’s theorem [32] states
that such a K is a positive kernel if and only if Γ is the Fourier transform of a positive
measure, namely, Z
Γ (x) = e−2iπhx,ωi dµ(ω)
Rd
where µ is a positive and symmetric (invariant by sign change) measure on Rd . For
example one can take dµ(ω) = ν(ω)dω, where ν is a integrable, positive and even
function.

This theorem provides an at least numerical, and sometimes analytical, method


for constructing kernels. The previous section exhibited a special case of translation-
invariant kernel for which ν = |σ |2 .
134 CHAPTER 6. INNER PRODUCTS AND REPRODUCING KERNELS

Radial kernels

A radial kernel takes the form K(x, y) = γ(|x − y|2 ), for some continuous function
γ defined on [0, +∞). Shoenberg’s theorem [174] states that, if this function γ is
universally valid, i.e., K is a kernel for all dimensions d, then, it must take the form
Z∞
γ(t) = e−λt dµ(λ)
0

for some positive finite measure µ on [0, +∞).

For example, when µ is a Dirac measure, i.e., µ = δ(2a)−1 for some a > 0, then
K(x, y) = exp(−|x − y|2 /2a), which is the Gaussian kernel. Taking dµ = e−aλ dλ yields
γ(t) = 1/(t + a), and dµ = λe−aλ dλ yields γ(t) = 1/(a + t)2 .

There is also, in Schoenberg [174], a characterization of radial kernels for a fixed


dimension d. Such kernels must take the form
Z +∞
γ(t) = Ωd (tλ)dµ(λ)
0

with Ωd (t) = Γ (d/2)(2/t)(d−2)/2 J(d−2)/2 (t) where J(d−2)/2 is Bessel’s function of the first
kind.

6.3.5 Operations on kernels

Kernels can be combined in several ways as described in the next proposition.


Proposition 6.4 Let K1 : R × R → R and K2 : R × R → R be positive kernels. Then the
following assertions hold.

(i) If λ1 , λ2 > 0, λ1 K1 + λ2 K2 is a positive kernel. It is positive definite as soon as either


K1 or K2 is positive definite.

(ii) For any function f : R0 → R, K10 (x0 , y 0 ) = K1 (f (x0 ), f (y 0 )) is a positive kernel. It is
positive definite as soon as K1 is positive definite and f is one-to-one.
(iii) K(x, y) = K1 (x, y)K2 (x, y) is a positive kernel. It is positive definite as soon as K1 and
K2 are positive definite.
(iv) Let K1 and K2 be translation-invariant with R = Rd , taking the form Ki (x, y) =
Γi (x − y), where Γi is continuous ( i = 1, 2). Assume that one of the two functions
Γ1 , Γ2 is integrable on Rd . Then
Z
K(x, y) = K1 (x, z)K2 (z, y)dz
Rd
is also a positive kernel.
6.3. FIRST EXAMPLES 135

Proof Point (i) is obvious. Point (ii) is almost as simple, because, for any λ1 , . . . , λn ∈
R and x10 , . . . , xn0 ∈ R0 ,
n
X n
X
λi λj K10 (xi0 , xj0 ) = λi λj K1 (f (xi0 ), f (xj0 )) ≥ 0.
i,j=1 i,j=1

If K1 is positive definite, then the latter sum can only vanish if all λi are zero, or
some of the points in (f (x10 ), . . . , f (xn0 )) coincide. If, in addition, f is one-to-one, then
this is equivalent to all λi are zero, or some of the points in (x10 , . . . , xn0 ) coincide, so
that K10 is positive definite.

To prove point (iii), take x1 , . . . , xN ∈ Rd and form the matrices Ki = Ki (x1 , . . . , xN ),


i = 1, 2, which are, by assumption positive semi-definite. The matrix K = K(x1 , . . . , xN )
is the element-wise (or Hadamard) product of K1 and K2 , and the conclusion fol-
lows from the linear algebra result stating that the Hadamard product of two pos-
itive semi-definite (resp. positive definite) matrices A = (a(i, j), 1 ≤ i, j ≤ N ) and
B = (b(i, j), 1 ≤ i, j ≤ N ) is positive semi-definite (resp. positive definite). This is
proved by diagonalizing, say, A in an orthonormal basis u1 , . . . , uN , with eigenvalues
λ1 , . . . , λN and writing
N
X N
X
(i) (j) (k) (k)
α a(i, j)b(i, j)α = α (i) ui uj λk b(i, j)α (j)
i,j=1 i,j,k=1
N
X N
X (k) (k)
= λk (α (i) ui )(α (j) uj ))b(i, j) ≥ 0
k=1 i,j=1

If B is positive definite, then the sum above can be zero only if, for each k, either
(k)
λk = 0 or α (i) ui = 0 for all i. If A is also positive definite, then the only possibility
(k)
is α (i) ui = 0 for all i and k, which implies α (i) = 0 for all i since ui , 0.

To prove point (iv) 4 , we first note that a translation invariant kernel K 0 (x, y) =
Γ 0 (x − y) is always bounded. Indeed, the matrix K0 (x, 0) is positive semi-definite,
with determinant Γ 0 (0)2 − Γ 0 (x)2 > 0, showing that |Γ 0 (x)| < Γ 0 (0). This shows that
the integral defining K(x, y) converges as soon as one of the two functions Γ1 or Γ2 is
integrable. Moreover, we have K(x, y) = Γ (x − y) with
Z Z
Γ (x) = Γ1 (x − z)Γ2 (z) dz = Γ1 (x − u)Γ2 (u − y) du
Rd Rd

Using the fact that both Γ1 and Γ2 are even, and making the change of variable z 7→ −z,
one easily shows that Γ (x) = Γ (−x), which implies that K is symmetric.
4 This part of the proof uses some measure theory.
136 CHAPTER 6. INNER PRODUCTS AND REPRODUCING KERNELS

We proceed with the assumption that Γ2 is integrable and use Bochner’s theorem
to write Z
T
Γ1 (y) = e−iξ y dµ1 (ξ)
Rd
for some positive finite measure µ1 . Then
Z Z !
−2iπξ T (x−z)
Γ (x) = e dµ1 (ξ) Γ2 (z) dz
Rd Rd
Z Z !
−2iπξ T x 2iπξ T z
= e e Γ2 (z) dz dµ1 (ξ)
Rd Rd

The shift in the order of the variables ξ and z uses Fubini’s theorem. The function
Z
T
ψ(ξ) = e2iπξ z Γ2 (z) dz
Rd

is the inverse Fourier transform of Γ2 . Because Γ2 is bounded and integrable, it is also


square integrable, which implies that its inverse Fourier transform is also a square
integrable function. Since Bochner’s theorem implies that Γ2 is the Fourier transform
of a positive measure µ2 , we find, using the injectivity of the Fourier transform, that
ψ is non-negative. So Γ is the Fourier transform of the finite positive measure ψdµ1 ,
which implies that K is a positive kernel. 

Point (iv) can be related to the following discrete statement on symmetric matri-
ces: assume that A and B are positive semi-definite and that they commute, so that
AB = BA: then AB is positive semi-definite.R In the case of kernels, one may consider
the symmetric linear operators Ki : f 7→ Rd Ki (·, y)f (y)dy which maps the space of
square integrable functions into itself. Then K1 and K2 commute and K = K1 K2 .

6.3.6 Canonical Feature Spaces

Let K be a positive kernel on a set R. The following construction, which is fun-


damental, shows that K can always be associated with a feature function h taking
values in a suitably chosen inner-product space H.

Associate to each x ∈ R the function ξx : y 7→ K(y, x) (we will also write ξx =


K(·, x)), and let HK = span(ξx , x ∈ R), a subspace of the vector space of all functions
from R to R. Define the feature function h : x 7→ ξx from R to HK . There is a unique
inner product on HK such that K = Kh . Indeed, by definition, this requires

hK(·, x) , K(·, y)iHK = K(x, y) . (6.3)


6.3. FIRST EXAMPLES 137
Pn Pm
Moreover, by linearity, for any ξ = i=1 λi K(·, xi ) and η = i=1 µi K(·, yi ), one needs
n X
X m
hξ , ηiHK = λi µj K(xi , yj ) ,
i=1 j=1

so that the inner product is uniquely specified on HK . To make sure that this inner-
product is well defined, we must check that there is no ambiguity, in the sense that,
P 0
if ξ has an alternative decomposition ξ = ni=1 λ0i K(·, xi0 ), then, the value of hξ , ηiHK
remains unchanged. But this is clear, because one can also write
m
X
hξ , ηiHK = µj ξ(yj ) ,
j=1

which only depends on ξ and not on its decomposition. The linearity of the product
with respect to ξ is also clear from this expression, and the bilinearity by symmetry.

The Schwartz inequality implies that

|hξ , ηiHK | ≤ kξkHK kηkHK

From which we deduce that kξkHK = 0 implies that hξ , ηiHK = 0 for η ∈ HK . Since
hξ , K(·, y)iHK = ξ(y) for all y, this also implies that ξ = 0, completing the proof that
HK is an inner-product space.

Equation (6.3) is the “reproducing property” of the kernel for the inner-product
on HK . In functional analysis, the completion, ĤK , of HK for the topology associated
to its norm is then a Hilbert space, and is referred to as a “reproducing kernel Hilbert
space,” or RKHS.

More generally, an inner-product space H of functions h : R → R is a reproducing


kernel Hilbert space if H is a complete space (which makes it Hilbert) and there
exists a positive kernel K such that,

[RKHS1] For all x ∈ R, K(·, x) belongs to H,


[RKHS2] For all h ∈ H and x ∈ R,

hh , K(·, x)iH = h(x) .

Returning to the example of functional features in section 6.3.3, we have two dif-
ferent representations of the kernel in feature space, namely in H = L2 (Rd ), or in HK ,
with a different inner product. There is not a contradiction, and simply shows that
the representation of a positive kernel in terms of a feature function is not unique.
138 CHAPTER 6. INNER PRODUCTS AND REPRODUCING KERNELS

Remark 6.5 RKHS’s are defined as function spaces. While feature space representa-
tions, provided by functions h : R → H from R to a Hilbert space H are apparently
more general, a simple transformation allows for an identification of (a subspace of)
H with an RKHS. We will assume that the subset h(R) (containing all h(x), x ∈ R) is
a dense subset of H. If not, one can simply replace H by the closure of h(R) which is
a Hilbert subspace of H.

One can always interpret elements u ∈ H as functions by letting ϕu (x) = hu , h(x)iH .


The representation u 7→ ϕu is not one-to-one under our assumption that h(R) is a
dense subset of H: if ϕu = ϕv , then hu − v , h(x)iH = 0 for all x ∈ R, and since h(R)
is dense in H, this implies u = v. Letting H e = {ϕu , u ∈ H} and defining the inner
0
product hϕu , ϕu 0 iHe = hu , u iH , one obtains a Hilbert space H e isometric to H with a
new feature function h̃(x) = ϕh(x) . By definition, h̃(x)(y) = hh(x) , h(y)iH = Kh (x, y) so
that Kh (x, ·) belongs to H
e and hh̃(x) , h̃(y)i e = hh(x) , h(y)i = Kh (x, y). This shows that
H H
H is an RKHS with kernel Kh .
e 

6.4 Projection on a finite-dimensional subspace

If H is an inner-product space and V is a subspace of H, one defines the orthogonal


projection of an element ξ ∈ H on V as its closest point in V , that is, the element
η ∗ of V minimizing the function F : η 7→ kη − ξk2H over all η ∈ V . This closest point
does not always exist, but it does in the special case in which V is finite dimensional
(or, more generally, when V is a closed subspace of H; see Yosida [208]). We state,
without proof, some of the properties of this operation.

Assuming that V is closed, this minimizer is unique and will be denoted η ∗ =


πV (ξ). Moreover, πV is a linear transformation from H to V , and η ∗ is characterized
by the properties ( ∗
η ∈V
ξ − η∗ ⊥ V ,
the last condition meaning that hξ − η ∗ , ηiH = 0 for all η ∈ V .

Because kξk2H = kπV (ξ)k2H + kξ − πV (ξ)k2H , one always has kπV (ξ)kH ≤ kξkH , with
inequality if and only if πV (ξ) = ξ, i.e., if and only if ξ ∈ V .

If V is finite-dimensional and η1 , . . . , ηn is a basis of V , then πV (ξ) is given by


n
X
πV (ξ) = α (i) ηi
i=1

with α (considered as a column vector in Rn ) given by


α = Gram(η1 , . . . , ηn )−1 λ ,
6.4. PROJECTION ON A FINITE-DIMENSIONAL SUBSPACE 139

where λ ∈ Rn is the vector with coordinates λ(i) = hξ , ηi iH , i = 1, . . . , n. The Gram ma-


trix of η1 , . . . , ηn , denoted Gram(η1 , . . . , ηn ), is the n by n matrix with entries hηi , ηj iH
for i, j = 1, . . . , n.

If A is a subset of H, the set A⊥ consists of all vectors perpendicular to A, namely


n o
A⊥ = h ∈ H : hh , h̃iH = 0 for all h̃ ∈ A .

If V is a finite-dimensional (or, more generally, closed) subspace of H, then any point


in h is decomposed as h = πV (h) + h − πV (h) with h − πV (h) ∈ V ⊥ . This shows that πV ⊥
is well defined and equal to idH − πV .

Orthogonal projections can be applied to function interpolation in an RKHS. In-


deed, assuming that H is an RKHS, as described at the end of the previous sec-
tion, with a positive-definite kernel. Given distinct points x1 , . . . , xN ∈ R and values
α1 , . . . , αN ∈ R, the interpolation problem consists in finding h ∈ H with minimal
norm satisfying h(xk ) = αk , k = 1, . . . , N . Consider the finite dimensional space

V = span {K(·, xk ), k = 1, . . . N } .

Then there exists an element h0 ∈ V that satisfies the constraints. Indeed, looking
for h0 in the form
N
X
h0 (x) = K(x, xl )λl
l=1

one has
N
X
h0 (xk ) = K(xk , xl )λl
l=1

so that    
 λ1   α1 
 . 
 ..  = K(x1 , . . . , xN )−1  ... 
 
   
λN αN
   

Any other function h satisfying the constraints satisfies h(xk ) − h0 (xk ) = 0, which,
using RKHS2, is equivalent to hh − h0 , K(·, xk )iH = 0, i.e., to h − h0 ∈ V ⊥ . This shows
that h0 = πV (h), so that khkH ≥ kh0 kH and h0 provides the optimal interpolation. We
summarize this in the proposition:

Proposition 6.6 Let H is an RKHS with a positive-definite kernel. Let x1 , . . . , xN ∈ R be


distinct points and α1 , . . . , αN ∈ R. Then the function h ∈ H with minimal norm satisfying
h(xk ) = αk , k = 1, . . . , N takes the form
140 CHAPTER 6. INNER PRODUCTS AND REPRODUCING KERNELS

N
X
h(xk ) = K(xk , xl )λl (6.4a)
l=1

with    
 λ1   α1 
 . 
 ..  = K(x1 , . . . , xN )−1  ...  .
 
    (6.4b)
λN αN
   

A variation of this problem replaces the constraint by a penalty that complete


the minimization associated with the orthogonal projection, namely, minimizing (in
h ∈ H)
N
X
2 2
khkH + σ |h(xk ) − αk |2 .
k=1

Letting h0 = πV (h), so that h0 (xk ) = h(xk ) for all k, this expression can be rewritten as
N
X
kh0 k2H + kh − h0 k2H +σ 2
|h0 (xk ) − αk |2 .
k=1

This shows that the optimal h must coincide with its projection on V , and therefore
belong to that subspace. Looking for h in the form
N
X
h(·) = K(·, xl )λl ,
l=1

the objective function is rewritten as

N N
N X 2
X X
2
K(xk , xl )λk λl + σ K(xk , xl )λl − αk ,
k,l=1 k=1 l=1
   
 λ1   α1 
which, in vector notation gives, writing λ =  ..  and α =  ... ,
 .   
 
   
λN αN

λT K(x1 , . . . , xN )λ + σ 2 (K(x1 , . . . , xN )λ − α)T (K(x1 , . . . , xN )λ − α) .

The differential of this expression in λ is

K(x1 , . . . , xN )λ + 2σ 2 K(x1 , . . . , xN )(K(x1 , . . . , xN )λ − α).


6.4. PROJECTION ON A FINITE-DIMENSIONAL SUBSPACE 141

Assuming that x1 , . . . , xN are distinct, this vanishes if and only if

λ = (K(x1 , . . . , xN ) + (1/σ 2 )IdRN )−1 α.

We have just proved the proposition:

Proposition 6.7 Let H is an RKHS with a positive-definite kernel. Let x1 , . . . , xN ∈ R be


distinct points and α1 , . . . , αN ∈ R. Then the unique minimizer of
N
X
h 7→ khk2H +σ 2
|h(xk ) − αk |2
k=1

on H is given by

N
X
h(xk ) = K(xk , xl )λl (6.5a)
l=1
with    
 λ1   α1 
 . 
 ..  = (K(x1 , . . . , xN ) + (1/σ )IdRN )  ...  .
2 −1
 
    (6.5b)
λN αN
   
142 CHAPTER 6. INNER PRODUCTS AND REPRODUCING KERNELS
Chapter 7

Linear Models for Regression

In regression, linear models refer to situations in which one tries to predict the de-
pendent variable Y ∈ RY = Rq by a function fˆ(X) of the dependent variable X ∈ RX ,
where fˆ is optimized over a linear space F . The most common situation is the “stan-
dard linear model,” for which RX = Rd and

F = {f (x) = a0 + bT x : a0 ∈ Rq , b ∈ Md,q (R)}. (7.1)

More generally, with q = 1, given a mapping h : R → H, where H is an inner-


product space, one can take:

F = {f (x) = a0 + hb , h(x)iH : a0 ∈ R, b ∈ H}. (7.2)

Note that h can be nonlinear, and F can be infinite dimensional. Such sets corre-
sponds to linear models using feature functions, and will be addressed using kernel
methods in this chapter.

Note also that, even if the model is linear, the associated training algorithms
can be nonlinear, and we will review in fact several situations in which solving the
estimation problem requires nonlinear optimization methods.

7.1 Least-Square Regression

7.1.1 Notation and Basic Estimator

We denote by Y and X the dependent and independent variables of the regression


problem. We will assume that Y takes values in Rq and that X takes values in a set
RX , which will, by default, be equal to Rd , except when discussing kernel methods,
for which this set can be arbitrary (provided that there is a mapping h from RX to
an inner product space H with an easily computable kernel).

143
144 CHAPTER 7. LINEAR REGRESSION

Least-square regression uses the risk function r(y, y 0 ) = |y − y 0 |2 . The prediction


error is then R(f ) = E(|Y − f (X)|2 ) for any predictor f and the Bayes predictor is the
conditional expectation x 7→ E(Y | X = x) (see item Example 1. in section 5.2). We
also start with the standard setting where RX = Rd and F given by (7.1).

We will use the following


! notation, which sometimes simplifies the computation.
1
If x ∈ Rd , we let x̃ = , which belongs to Rd+1 . The linear predictor f (x) = a0 +
x
T
!
a
bT x with a0 ∈ Rq , b ∈ Md,q (R) can then be written as f (x) = β T x̃ with β = 0 ∈
b
Md+1,q (R).

In a model-based approach, the linear model is a Bayes predictor under the


generative assumption that Y = a0 + bT X +  where  is a residual noise satisfying
E( | X) = 0, which is true, for example, when  is centered and independent of
X. If one further specifies the model so that  is Gaussian, centered and indepen-
dent of X, and one assumes that the distribution of X does not depend on a0 and b,
then the maximum likelihood estimator of these parameters based on a training set
T = ((x1 , y1 ), . . . , (xN , yN )) must minimize the “residual sum of squares:”

N
X N
X
∆ ∆ 2
RSS(β) = N R̂(f ) = |yk − f (xk )| = |yk − β T x̃k |2 .
k=1 k=1

In other terms, the model-based approach is identical, under these (standard) as-
sumptions, to empirical risk minimization (section 5.4), on which we now focus.
(Recall that, even when using a model-based approach, one does not make assump-
tions on the true distribution of X and Y ; one rather treats the model as an approxi-
mation of these distributions, estimated by maximum likelihood, and uses the Bayes
predictor for the estimated model.)

The computation of the optimal regression parameters is made easier by the in-
troduction of the following matrices. Introduce the N × (d + 1) matrix X with rows
x̃1T , . . . , x̃N
T
and the N × q matrix Y with rows y1T , . . . , yN
T
, that is:

1 x1(1) · · · x1(d)  y1 · · · y1(q) 


   (1) 

X =  ... .. ..  , Y =  ... ..  .


   

 . .   . 
(1) (d)   (1) (q) 
1 xN · · · xN yN · · · yN
  

With this notation, we have


RSS(β) = |Y − X β|22 .
with |A|22 = trace(AT A) for a rectangular matrix A. The solution of the problem is
then provided by the following theorem.
7.1. LEAST-SQUARE REGRESSION 145

Theorem 7.1 Assume that the matrix X has rank d + 1. Then the RSS is minimized for
β̂ = (X T X )−1 X T Y
Proof We provide two possible proofs of this elementary problem. The first one is

an optimization argument noting that F(β) = RSS(β) is a convex function defined
on Md+1,q (R) and with values in R. Since F is quadratic, we have, for any matrix
h ∈ Md+1,q (R),

dF(β)h = ∂ F(β + h)|=0 = −2trace(hT X T (Y − X β))


and
dF(β) = 0 ⇔ X T (Y − X β) = 0 ⇔ β = β̂.

One can alternatively proceed with a direct computation. We have


RSS(β) = |Y |22 − 2trace(β T X T Y ) + trace(β T X T X β)
= |Y |22 − 2trace(β T X T X β̂) + trace(β T X T X β).

Replacing β by β̂ and simplifying yields


RSS(β̂) = |Y |22 − trace(β̂ T X T X β̂)
It follows that
RSS(β) =RSS(β̂) + trace(β̂ T X T X β̂) − 2trace(β T X T X β̂) + trace(β T X T X β)
=RSS(β̂) + |X (β̂ − β)|22

so that the left-hand side is minimized at β = β̂. 

Remark 7.2 If X does not have rank d + 1, then optimal solutions exist, but they
are not unique. By convexity, the solutions are exactly the vectors β at which the
gradient vanishes, i.e., those that satisfy X T X β = X T Y . The set of solutions can be
obtained by introducing the SVD of X in the form X = U DV T and letting γ = V T β
and Z = U T Y . Then
X T X β = X T Y ⇔ D T Dγ = D T Z.
Letting d (1) , . . . , d (m) denote the nonzero diagonal entries of D (so that m ≤ d + 1), we
find γ (i) = z(i) /d (i) for i ≤ m (the other equalities being 0 = 0). So, the d + 1 − m last
entries of γ can be chosen arbitrarily (and β = V γ). 

An alternate representation of the solution use a two-step computation that esti-


mates b first, then a0 . Indeed, for fixed b̂, the minimum of
N
X
|yk − a0 − xkT b̂|2
k=1
146 CHAPTER 7. LINEAR REGRESSION

is attained at aˆ0 = ȳ − x̄T b̂ with the usual definitions

N N
1X 1X
ȳ = yk and x̄ = xk .
N N
k=1 k=1

This shows that b̂ itself must be a minimizer of


N
X
|yk − ȳ − (xk − x̄)T b|2 .
k=1

Denote by Yc and Xc the matrices

y1 − y (1) · · · y1(q) − y (q) 


 (1)
x1 − x(1) · · · x1(d) − x(d) 
  (1) 
 .. .. .. ..
  
Xc =   , Yc =   .
  
 . .   . . 
 (1) (d) (1) (q)
xN − x(1) · · · xN − x(d) yN − y (1) · · · yN − y (q)
  

Then b̂ must minimize |Yc − Xc b|2 , yielding

b̂ = (XcT Xc )−1 XcT Yc , aˆ0 = ȳ − x̄T b̂.

The reader may want to double-check that this solution coincides with the one pro-
vided in theorem 7.1.

7.1.2 Limit behavior

The matrix
N
1 T 1X
Σ̂XX = Xc Xc = (xk − x)(xk − x)T
N N
k=1

is a sample estimate of the covariance matrix of X, that we will denote ΣXX . Simi-
larly, Σ̂XY = XcT Yc /N is a sample estimate of ΣXY , the covariance between X and Y .
With this notation, we have
b̂ = Σ̂−1
XX Σ̂XY ,

which, by the law of large numbers, converges to b∗ = Σ−1


XX ΣXY .

Let a∗0 = mY −mTX b∗ . Then f ∗ (x) = a∗0 +(b∗ )T x is the least-square optimal approxima-
tion of Y by a linear function of X, and the linear predictor fˆ(x) = aˆ0 + b̂T x converges
a.s. to f ∗ (x). Of course, f ∗ generally differs from f : x 7→ E(Y | X = x), which is the
least-square optimal approximation of Y by any (square-integrable) function of X,
so that the linear estimator will have a residual bias.
7.1. LEAST-SQUARE REGRESSION 147

7.1.3 Gauss-Markov theorem

If one makes the (unlikely) assumption that the linear model is exact, i.e., f (x) =
f ∗ (x), one has:

E(β̂) = E(E(β̂ | X )) = E((X T X )−1 X T E(Y | X )) = E((X T X )−1 X T X β) = β

and the estimator is “unbiased.” Under this parametric assumption, many other
properties of linear estimators can be proved, among which the well-known Gauss-
Markov theorem on the optimality of least-square estimation that we now state and
prove. For this theorem, for which we take (for simplicity) q = 1, we also assume
that var(Y | X = x), the variance of Y for its conditional distribution given X does
not depend on x, and denote it by σ 2 . This typically correspond to the standard
regression model in which one assumes that Y = f (X) +  where  is independent of
X with variance σ 2 .

Recall that a symmetric matrix A is said to be larger than or equal to another


symmetric matrix, B, writing A  B, if and only if A − B is positive semi-definite.
Theorem 7.3 (Gauss-Markov) Assume that an estimator β̃ takes the form β̃ = A(X )Y
(it is linear) and is unbiased conditionally to X : Eβ (β̃ | X ) = β (for all β). Then (under
the assumptions above) the covariance matrix of β̃ cannot be smaller than that of the least
square estimate, β̂.
Proof We write A = A(X ) for short. The condition that E(AY | X ) = β for all β yields
AX β = β for all β, or AX = IdRd+1 (A is a (d + 1) × N matrix). Since β̃ is unbiased, its
covariance matrix is
E(AY Y T AT ) − ββ T
and
E(AY Y T AT ) = E(E(AY Y T AT | X )) = σ 2 E(AAT ).
For β̃ = β̂, for which A = (X T X )−1 X T , we get E(AY Y T AT ) = σ 2 E((X T X )−1 ). We
therefore need to show that E(AAT )  E(X T X ), i.e., that for any u ∈ Rd+1 ,

u T E(AAT )u ≥ u T E((X T X )−1 )u

as soon as AX = IdRd+1 . We in fact have the stronger result (without expectations):

AX = IdRd+1 ⇒ AAT  (X T X )−1 .

To see this, fix u and consider the problem of minimizing Fu (A) = A 7→ u T AAT u
subject to the linear constraint AX = IdRd+1 . The Lagrange multipliers for this affine
constraint can be organized in a matrix C and the Lagrangian is

u T AAT u + trace(C T (AX − IdRd+1 )).


148 CHAPTER 7. LINEAR REGRESSION

Taking the derivative in A, we find that optimal solutions must satisfy

2u T AH T u + trace(C T HX ) = 0

for all H, which yields trace(H T (2uu T A + CX T )) = 0 for all H. This is only possible
when 2uu T A + CX T = 0, which in turn implies that 2uu T AX = −CX T X . Using the
constraint, we get
C = −2uu T (X T X )−1
so that uu T A = uu T (X T X )−1 X T . This implies that A = (X T X )−1 X T (the least-square
estimator) is a minimizer of Fu (A) for all u.

Any other solution that satisfies uu T A = uu T (X T X )−1 X T for all u. Taking u = ei


and summing over i (with d+1 T T −1 T
P
i=1 ei ei = IdRd+1 ) yields A = (X X ) X . 

7.1.4 Kernel Version

We now assume that X takes its values in an arbitrary set RX , with a representation
h : RX → H into an inner-product space. This representation does not need to be
explicit or computable, but the associated kernel K(x, y) = hh(x) , h(y)iH is assumed
to be known and easy to compute. (Recall that, from chapter 6, a positive kernel is
always associated with an inner-product space.) In particular, any algorithm in this
context should only rely on the kernel, and the function h only has a conceptual role.

Assume that q = 1 to lighten the notation, so that the dependent variable is scalar-
valued. We here let the space of predictors be

F = {f (x) = a0 + hb , h(x)iH : a0 ∈ R, b ∈ H}.

The residual sum of squares associated with this function space is


N
X
RSS(a0 , b) = (yk − a0 − hb , h(xk )i)2 .
k=1

The following result (or results similar to it) is a key step in almost all kernel
methods in machine learning.

Proposition 7.4 Let V = span(h(x1 ), . . . , h(xN )) be the finite-dimensional subspace of H


generated by the feature functions evaluated on training input data. Then

RSS(a0 , b) = RSS(a0 , πV (b)).

where πV is the orthogonal projection on V .


7.2. RIDGE REGRESSION AND LASSO 149

Proof The justification is immediate: since h(xk ) ∈ V , we have


hb , h(xk )iH = hπV (b) , h(xk )iH
for all b ∈ H. 

This shows that there is no loss of generality in restricting the minimization of


the residual sum of squares to b ∈ V . Such a b takes the form
N
X
b= αk h(xk ) (7.3)
k=1

and the regression problem can be reformulated as a function of the coefficients


α1 , . . . , αN ∈ R, with
N
X N
X
f (x) = a0 + αk hh(x) , h(xk )iH = a0 + αk K(x, xk ),
k=1 k=1

which only depends on the kernel. (This reduction is often referred to as the “kernel
trick.”)

However, the solution of the problem is, in this context, not very interesting.
Indeed, assume that K is positive definite and that all observations in the training set
are distinct. Then the matrix K(x1 , . . . , xN ) formed by the kernel evaluations K(xi , xj )
is invertible, and one can solve exactly the equations
N
X
yk = αj K(xk , xj ), k = 1, . . . , N
j=1

to get a zero RSS with a0 = 0. Unless there is no noise, such a solution will certainly
overfit the data. If K is not positive definite, and the dimension of V is less than
N (since this would place us in the previous situation otherwise), then it is more
efficient to work directly in a basis of V rather than using the over-parametrized ker-
nel representation. We will see however, starting with the next section, that kernel
methods become highly relevant as soon as the regression is estimated with some
control on the size of the regression coefficients, b.

7.2 Ridge regression and Lasso

7.2.1 Ridge Regression

Method. When the set F of possible predictors is too large, some additional com-
plexity control is needed to reduce the estimation variance. One simple approach
150 CHAPTER 7. LINEAR REGRESSION

is to limit the number of parameters to be estimated, which, for regression, corre-


sponds to limiting the number of possible predictors. This is related to the methods
of Sieves mentioned in section 4.1. In contrast, ridge regression and lasso control the
size of the parameters, as captured by their norm.

In both cases, one assigns a measure of complexity, denoted f 7→ γ(f ) ≥ 0, to each


element f ∈ F . Given γ, one can either optimize this predictor (using, for example,
the RSS) with the constraint that γ(f ) ≤ C for some constant C, or add a penalty
λγ(f ) to the objective function for some λ > 0. In general, the two approaches (con-
straint or penalty) are equivalent.

In linear spaces, complexity measures are often associated with a norm, and ridge
regression uses the sum of squares of coefficients of the prediction matrix b, mini-
mizing
XN
|yk − a0 − bT xk |2 + λ trace(bT b) , (7.4)
k=1

which can be written in vector form as

|Y − X β|22 + λ trace(β T ∆β),

where ∆ = diag(0, 1, . . . , 1). In the following, we will work with an unspecified (d +


1) × (d + 1) symmetric positive semi-definite matrix ∆. Various choices are indeed
possible, for example, ∆ = diag(0, σ̂ 2 (1), . . . , σ̂ 2 (d)), where σ̂ 2 (i) is the empirical vari-
ance of the ith coordinate of X in the training set. This last choice is quite natural,
because it ensures that, whenever one of the variable X (i) is rescaled by a factor c,
the corresponding optimal i th row of bT is rescaled by 1/c, leaving the predictor
unchanged.

Under this assumption, the optimal parameter is

β̂ λ = (X T X + λ∆)−1 X T Y ,

with a proof similar to that made for least-square regression. We obviously retrieve
the original formula for regression when λ = 0.
!
0 0
Alternatively, assuming that ∆ = , so that no penalty is imposed on the
0 ∆0
intercept, we have
b̂λ = (XcT Xc + λ∆0 )−1 XcT Yc (7.5)

and aˆ0 λ = ȳ − (b̂λ )T x̄. The proof of these statements is left to the reader.
7.2. RIDGE REGRESSION AND LASSO 151

Analysis in a special case To illustrate the impact of the penalty term on balancing
bias and variance, we now make a computation in the special case when Y = X̃β + ,
where var() = σ 2 and  is independent of X. In the following computation, we as-
sume that the training set is fixed (or rather, compute probabilities and expectations
conditionally to it). Also, to simplify notation, we denote
N
X
T
Sλ = X X + λ∆ = x̃kT x̃k + λ∆
k=1

and Σ = E(X̃ T X̃) for a single realization of X. Finally, we assume that q = 1, also to
simplify the discussion.

The mean-square prediction error is

R(λ) = E((Y − X̃ T β̂λ )2 )


= E((X̃ T (β − β̂λ ) + )2 )
= (β̂λ − β)T Σ(β̂λ − β) + σ 2 .

Denote by k the (true) residual k = yk − x̃kT β on training data and by  the vector
stacking these residuals. We have, writing S0 = Sλ − λ∆,

β̂λ = Sλ−1 X T Y
= Sλ−1 S0 β + Sλ−1 X T 
= β − λSλ−1 ∆β + Sλ−1 X T 

So we can rewrite

R(λ) = λ2 β T ∆Sλ−1 ΣSλ−1 ∆β − 2λ T X Sλ−1 ΣSλ−1 ∆β +  T X Sλ−1 ΣSλ−1 X T  + σ 2 .

Let us analyze the quantities that depend on the training set in this expression. The
first one is Sλ = S0 + λ∆. From the law of large numbers, S0 /N → Σ when N tends
to infinity, so that, assuming in addition that λ = λN = O(N ), we have Sλ−1 = O(1/N ).
The second one is
XN
T
 X= k x̃k
k=1
which, according to the central limit theorem, is such that

N −1/2  T X ∼ N (0, σ 2 Var(X̃))

when N → ∞. So, we can expect the coefficient of λ2 in R(λ) to have order N −2 , the
coefficient of λ to have order −3/2 and the constant coefficient of have order N −1 .
√ N
This suggests taking λ = µ N so that all coefficients have roughly the same order
when expanding in powers of µ.
152 CHAPTER 7. LINEAR REGRESSION

This gives Sλ = N (S0 /N + µ∆/ N ) ' N Σ and we make the approximation, letting
ξ = N −1/2 σ −1/2 X T and γ = Σ−1/2 ∆β, that

N (R(λ) − σ 2 ) ' µ2 |γ|2 − 2µξ T γ + ξ T ξ.

With this approximation, the optimal µ should be

ξT γ
µ= .
|γ|2

Of course, this µ cannot be computed from data, but we can see that, since ξ con-
verges to a centered Gaussian random variable, its value cannot be too large. It is
therefore natural to choose µ to be constant and use ridge regression in the form
N
X √
(yk − x̃kT β)2 + N µβ T ∆β.
k=1

In all cases, the mere fact that we find that the optimal µ is not 0 shows that, under
the simplifying (and optimistic) assumptions that we made for this computation,
allowing for a penalty term always reduces the prediction error. In other terms,
introducing some estimation bias in order to reduce the variance is beneficial.

Kernel Ridge Regression We now return to the feature-space situation and take h :
RX → H with associated kernel K. We still take q = 1 for simplicity. One formulates
the ridge regression problem in this context as the minimization of
N
X
(yl − a0 − hb , h(xl )iH )2 + λkbk2H
k=1

with respect to β = (a0 , b). Introducing the space V generated by the feature function
evaluated on the training set, we know from proposition 7.4 that replacing b by
πV (b) leaves the residual sum of squares invariant. Moreover, one has kπV (b)k2H ≤
kbk2H with equality if and only if b ∈ V . This shows that the solution b must belong
to V and therefore take the form (7.3).

Using this expression, one finds that the problem is reduced to finding the mini-
mum of  2
N 
X N
X  N
X
y − a −
 k 0 K(xl , xk )αl  + λ

 αk αl K(xk , xl )
k=1 l=1 k,l=1

with respect to a0 , α1 , . . . , αN . Recall that we have denoted by K = K(x1 , . . . , xN ) the


kernel matrix with entries K(xi , xj ), i, j = 1, . . . , N . We will assume in the following
that K is invertible.
7.2. RIDGE REGRESSION AND LASSO 153

Introduce the vector 1N ∈ RN with all coordinates equal to one. Let


!
 
0 0 0
K̃ = 1N K and K = .
0 K

!
a0
Let α ∈ RN be the vector with coefficients α1 , . . . , αN and α̃ = . With this
α
notation, the function to minimize is

F(α) = |Y − K̃α̃|2 + λα̃ T K0 α̃.

This takes the same form as standard ridge regression, replacing β by α̃, X by K̃ and
∆ by K0 . The solution therefore is

α̃ λ = (K̃T K̃ + λK0 )−1 K̃T Y .

Note that K being invertible implies that K̃T K̃ + λK0 is invertible. 1

To write the equivalent of (7.5), we need to use the equivalent of the matrix Xc ,
that is, the matrix K with the average of the jth column subtracted to each (i, j) entry,
given by:
1
Kc = K − 1N 1TN K.
N
Introduce the matrix P = Id − 1N 1TN /N . It is easily checked that P 2 = P (P is a pro-
jection matrix). Since Kc = P K, we have KTc Kc = KP K. One deduces from this the
expression of the optimal vector α λ , namely,

α λ = (KP K + λK)−1 KP Yc = (P K + λIdRN )−1 Yc

where we have, in addition, used the fact that P Yc = Yc . Finally, the intercept is
given by
1
a0 = y − (α λ )T K 1N .
N

7.2.2 Equivalence of constrained and penalized formulations

Case of ridge regression. Returning to the basic case (without feature space), we
now introduce an alternate formulation of ridge regression. Let ridge(λ) denote
the ridge regression problem that we have considered so far, for some parameter
!
1 Indeed, w0
let u = with w0 ∈ R and w ∈ RN be such that u T (K̃T K̃ + λK0 )u = 0. This requires
w
K̃u = 0 and u T K0 u = 0. The latter quantity is wT Kw, which shows that w = 0 since K has rank N .
Then K̃ = 1N w0 so that w0 = 0 also.
154 CHAPTER 7. LINEAR REGRESSION

PNConsiderT now
λ. the following problem, which will be called ridge0 (C): minimize
2 T
k=1 |yk − x̃k β| subject to the constraint β ∆β ≤ C. We claim that this problem is
equivalent to the ridge regression problem, in the following sense: for any C, there
exists a λ such that the solution of ridge0 (C) coincides with the solution of ridge(λ)
and vice-versa.

Indeed, fix a C > 0. Consider an optimal β for ridge0 (C). Assuming as above that ∆
is symmetric positive semi-definite, we let V be its null space and PV the orthogonal
projection on V . Write β = β1 + β2 with β1 = PV β. Let d1 and d2 be the respective
dimensions of V and V ⊥ so that d1 + d2 = d. Identifying Rd with the product space
V × V ⊥ (i.e., making a linear change of coordinates), the problem can be rewritten as
the minimization of
|Y − X1 β1 − X2 β2 |2
subject to β2T ∆β2 ≤ C, where X1 (resp. X2 ) is N × d1 (resp. N × d2 ).

The gradient of the constraint γ(β2 ) = β2T ∆β2 − C is ∇γ(β2 ) = 2∆β2 . Assume first
that ∆β2 , 0. Then the solution must satisfy the KKT conditions, which require that
there exists µ ≥ 0 such that β is a stationary point of the Lagrangian

|Y − X1 β1 − X2 β2 |2 + µβ2T ∆β2 ,

with µ > 0 only possible if β T ∆β = C. This requires that

X1T X1 β1 + X1T X2 β2 = X T Y ,
X2T X1 β1 + X2T X2 β2 + µ∆β2 = X T Y .

Since ∆β1 = 0, and using X = (X1 , X2 ), we have

β = (X T X + µ∆)−1 X T Y ,

which is the only solution of ridge(µ).

If ∆β2 = 0, then, necessarily, β2 = 0. Since C > 0, β must then be the solution of


the unconstrained problem, which is ridge(0).

Conversely, any solution β of ridge(λ) satisfies the first-order optimality condi-


tions for ridge0 (C) for C = β T ∆β (or any C ≥ β T ∆β if λ = 0). This shows the equiva-
lence of the two problems.

General case. We now consider this equivalence in a more general setting. Con-
sider a penalized optimization problem, denoted var(λ) which consists in minimiz-
ing in β some objective function of the form U (β) + λϕ(β), λ ≥ 0. Consider also
the family of problems var0 (C), with C > inf(ϕ), which minimize U (β) subject to
ϕ(β) ≤ C.
7.2. RIDGE REGRESSION AND LASSO 155

We make the following assumptions.

(i) U and ϕ are continuous functions from Rn to R.


(ii) ϕ(β) → ∞ when β → ∞.
(iii) For any λ ≥ 0, there is a unique solution of var(λ), denoted βλ .
(iv) For any C, there is a unique solution of var0 (C). denoted βC0 .

Assumptions (ii) and (iv) are true, in particular, when U is strictly convex, ϕ is
convex and U has compact level sets. We show that, with these assumptions, the
two families of problems are equivalent.

We first discuss the penalized problems and prove the following proposition,
which has its own interest.
Proposition 7.5 The function λ 7→ U (βλ ) is nondecreasing, and λ 7→ ϕ(βλ ) is non-
increasing, with
lim ϕ(βλ ) = inf(ϕ).
λ→∞
Moreover, βλ varies continuously as a function of λ.
Proof Consider two parameters λ and λ0 . We have

U (βλ ) + λϕ(βλ ) ≤ U (βλ0 ) + λϕ(βλ0 )


and U (βλ0 ) + λ0 ϕ(βλ0 ) ≤ U (βλ ) + λ0 ϕ(βλ )

since both left-hand sides are minimizers. This implies

λ(ϕ(βλ ) − ϕ(βλ0 )) ≤ U (βλ0 ) − U (βλ ) ≤ λ0 (ϕ(βλ ) − ϕ(βλ0 )). (7.6)

In particular: (λ0 − λ)(ϕ(βλ ) − ϕ(βλ0 )) ≥ 0. Assume that λ < λ0 . Then this last
inequality implies ϕ(βλ ) ≥ ϕ(βλ0 ) and (7.6) then implies that U (βλ ) ≤ U (βλ0 ), which
proves the first part of the proposition.

Now assume that there exists  > 0 such that ϕ(βλ ) > inf ϕ +  for all λ ≥ 0. Take
β̃ such that ϕ(β̃) ≤ inf ϕ + /2. For any λ > 0, we have

U (βλ ) + λϕ(βλ ) ≤ U (β̃) + λϕ(β̃)

so that U (βλ ) < U (β̃) − λ/2. Since U (βλ ) ≥ U (a0 ), we get U (a0 ) = −∞, which is a
contradiction. This shows that ϕ(βλ ) tends to inf(ϕ) when λ tends to infinity.

We now prove that λ 7→ βλ is continuous. Define G(β, λ) = U (β)+λϕ(β). Since we


assume that ϕ(β) → ∞ when β → ∞, and we have just proved that ϕ(βλ ) ≤ ϕ(a0 ) for
any λ, we obtain the fact that the set (|βλ |, λ ≥ 0) is bounded, say by a constant B ≥ 0.
156 CHAPTER 7. LINEAR REGRESSION

Consider a sequence λn that converges to λ. We want to prove that βλn → βλ , for


which (because βλ is bounded) it suffices to show that if any subsequence of (βλn )
converges to some β̃, then β̃ = βλ .

So, consider such a converging subsequence, that we will still denote by βλn for
convenience. Since G is continuous, one has G(βλn , λn ) → G(β̃, λ) when n tends to
infinity. Let us prove that G(βλ , λ) is continuous in λ. For any pair λ, λ0 and any β,
we have
G(βλ0 , λ0 ) ≤ G(βλ , λ0 ) = G(βλ , λ) + (λ0 − λ)ϕ(βλ ) ≤ G(βλ , λ) + |λ0 − λ|ϕ(a0 ) .
This yields, by symmetry, |G(βλ0 , λ0 ) − G(βλ , λ)| ≤ ϕ(a0 )|λ − λ0 |, proving the continuity
in λ.

So we must have G(β̃, λ) = G(βλ , λ). This implies that both β̃ and βλ are solutions
of var(λ), so that βλ = β̃ because we assume that the solution is unique. 

We now prove that the classes of problems var(λ) and var0 (C) are equivalent.
First, βλ is a minimizer of U (β) subject to the constraint ϕ(β) ≤ C, with C = ϕ(βλ ).
Indeed, if U (β) < U (βλ ) for some β with ϕ(β) ≤ ϕ(βλ ), then U (β) + λϕ(β) < U (βλ ) +
0
λϕ(βλ ) which is a contradiction. So βλ = βϕ(β . Using the continuity of βλ and ϕ,
λ)
this proves the equivalence of the problems when C is in the interval (a, ϕ(a0 )) where
a = limλ→∞ ϕ(βλ ) = inf(ϕ).

So, it remains to consider the case C > ϕ(a0 ). For such a C, the solution of var0 (C)
must be a0 since it is a solution of the unconstrained problem, and satisfies the con-
straint.

7.2.3 Lasso regression

Problem statement Assume that the output variable is scalar, i.e., q = 1. Let σ̂ 2 (i)
be the empirical variance of the ith variable X (i) . Then, the lasso estimator is defined
as a minimizer of N
P T 2 Pd (i)
k=1 (yk − x̃k β) subject to the constraint i=1 σ̂ (i)|β | ≤ C. Com-
pared to ridge regression, the sum of squares for β is simply replaced by a weighted
sum of absolute values, but we will see that this change may significantly affect the
nature of the solutions.

As we have just seen, the penalized formulation, minimizing


N
X d
X
(yk − x̃kT β)2 + λ σ̂ (i)|β (i) |
k=1 i=1

provides an equivalent family of problems, on which we will focus (because it is


easier to analyze). Since one uses a non-Euclidean norm in the penalty, there is no
7.2. RIDGE REGRESSION AND LASSO 157

kernel version of the lasso and we only discuss the method in the original input
space R = Rd .

For a vector a ∈ Rk , we let |a|1 = |a(1) | + · · · + |a(k) |, the ` 1 norm of a. Using the
previous notation for Y and X , the quantity to minimize can be rewritten as
|Y − X β|2 + λ|Dβ|1
where D is the d × (d + 1) matrix with d(i, i + 1) = σ̂ (i) for i = 1, . . . , d and all other
coefficients equal to 0. This is a convex optimization problem which, unlike ridge
regression, does not have a closed form solution.

ADMM. The alternating direction method of multipliers (ADMM) that was de-
scribed in section 3.6, (3.59) is one of the state-of-the-art algorithm to solve the lasso
problem, especially in large dimensions. Other iterative methods include subgradi-
ent descent (see the example in section 3.5.4) and proximal gradient descent. Since
x has a different meaning here, we change the notation in (3.59) by replacing x, z, u
by β, γ, τ, and rewrite the lasso problem as the minimization of
|Y − X β|2 + λ|γ|1
subject to Dβ − γ = 0. Applying (3.60) with A = D, B = −Id and c = 0, the ADMM
iterations are
2 α
  
2
β(n + 1) = argmin |Y − X β| + |Dβ − γ(n) + τ(n)|


β
2




α

  
(i) (i) (i) 2

γ (n + 1) = argmint λ|t| + (t − Dβ (n + 1) − τ (n)) , i = 1, . . . , d



 2
τ(n + 1) = τ(n) + Dβ(n + 1) − γ(n + 1)

The solutions of both minimization problems are explicit, yielding the following
algorithm.

Algorithm 7.1 (ADMM for lasso)


Let ρ > 0 be chosen. Starting with initial values β (0) , γ (0) and τ (0) , the ADMM algo-
rithm for lasso iterates:
α T −1 T α T
    
T
β(n + 1) = X X+ D D X Y + D (γ(n) − τ(n))






  2  2
(i) (i) (i)



 γ (n + 1) = Sλ/α Dβ (n + 1) + τ (n) , i = 1, . . . , d


τ(n + 1) = τ(n) + Dβ(n + 1) − γ(n + 1)
until the difference between the variables at steps n and n + 1 is below a small toler-
ance level. Here, Sµ is the so-called shrinkage operator



v − µ if v ≥ µ

Sµ (v) = sign(v) max(|v| − µ, 0) = 0 if |v| ≤ µ



v + µ if v ≤ −µ

158 CHAPTER 7. LINEAR REGRESSION

Note that the ADMM algorithm makes an iterative approximation of the constraints,
so that they are only satisfied at some precision level when the algorithm is stopped.

Exact computation. We now provide a more detailed characterization of the solu-


tion of the lasso problem and analyze, in particular, how this solution changes when
λ (or C) varies. To simplify the exposition, and without loss of generality, we will as-
sume that the variables have been normalized so that σ̂ (i) = 1 and the penalty simply
is the sum of absolute values. Let
N
X d
X
Gλ (β) = (yk − a0 − xkT b)2 + λ |b(i)|.
k=1 i=1

The following proposition, in which we let


N
1X
rb = (yk − a0 − xkT b)xk ,
N
k=1

characterizes the solution of the lasso.

Proposition 7.6 The pair (a0 , b) is the optimal solution of the lasso problem with param-
eter λ if and only if a0 = ȳ − x̄T b and, for all i = 1, . . . , d,

(i) λ
|rb | ≤ (7.7)
2N
with
(i) λ
rb = sign(b(i) ) if b(i) , 0. (7.8)
2N
(i)
In particular |rb | < λ/(2N ) implies b(i) = 0.

Proof Using the subdifferential calculus in theorem 3.45, one can compute the sub-
gradients of G by adding the subdifferentials of the terms that compose it. All these
terms are differentiable except |b(i) | when b(i) = 0, and the subdifferential of t 7→ |t| at
t = 0 is the interval [−1, 1].

This shows that g ∈ ∂Gλ (β) if and only if

g = −2N rb + λz

with z(i) = sign(b(i) ) if b(i) , 0 and |z(i) | ≤ 1 otherwise. Proposition 7.6 immediately
follows by taking g = 0. 
7.2. RIDGE REGRESSION AND LASSO 159

Let ζ = sign(b), the vector formed by the signs of the coordinates of b, with
sign(0) = 0. Then proposition 7.6 uniquely specifies a0 and b once λ and ζ are
known. Indeed, let J = Jζ denote the ordered subset of indices j ∈ {1, . . . , d} such
that ζ (j) , 0, and let b(J), xk (J), ζ(J), etc., denote the restrictions of vectors to these
indices. Equation (7.8) can be rewritten as (after replacing a0 by its optimal value)
λ
Xc (J)T Xc (J)b(J) = Xc (J)T Yc − ζ(J)
2
where
 (x1 (J) − x(J))T 
 

Xc (J) = 
 .. 
 .
 . 
(xN (J) − x(J)) T 

This yields
 λ 
b(J) = (Xc (J)T Xc (J))−1 Xc (J)T Yc − ζ(J) , (7.9)
2
(j)
which fully determine b since b = 0 if j < J, by definition.

For given λ, only one sign configuration ζ will provide the correct solution, with
correct signs for nonzero values of b above, and correct inequalities on rb . Call-
ing this configuration ζλ , one can note that if ζλ is known for a given value of λ,
it remains valid if we increase or decrease λ until one of the optimality conditions
changes, i.e., either one of the coordinates b(i) , i ∈ Jζλ , vanishes, or one of the inequal-
ities for i < Jζλ becomes an equality. Moreover, proposition 7.6 shows that between
these events both b and therefore rb depend linearly on λ, which makes easy the
task of determining maximal intervals around a given λ over which ζ remains un-
changed.

Note that solutions are known for λ = 0 (standard least squares) and for λ large
enough (for which b = 0). Indeed, for b = 0 to be a solution, it suffices that
N
X
∆ (i)
λ > λ0 = 2 max (yk − y)(xk − x(i) ) .
i
k=1

These remarks set the stage for an algorithm computing the optimal solution of
the lasso problem for all values of λ, starting either from λ = 0 or λ > λ0 . We will de-
scribe this algorithm starting for λ > λ0 , which has the merit to avoid complications
due to underconstrained least squares when d is large. For this purpose, we need a
little more notation. For a given ζ, let
bζ = (Xc (Jζ )T Xc (Jζ ))−1 Xc (Jζ )T Yc
and
1
uζ = (Xc (Jζ )T Xc (Jζ ))−1 ζ(Jζ ),
2
160 CHAPTER 7. LINEAR REGRESSION

so that b(Jζ ) = bζ − λuζ . The residuals then take the form


N N
(i) 1X (i) λX T (i)
rb = T
(yk − a0 − bζ xk )xk + (xk uζ )(xk − x(i) )
N N
k=1 k=1
(i) (i)
= ρζ + λdζ ,

where the last equation defines ρζ and dζ .

Assume that one wants to minimize Gλ∗ for some λ∗ > 0. We need to describe
the sequence of changes to the minimizers of Gλ when λ decreases from some value
larger than λ0 to the value λ∗ .

If λ∗ ≥ λ0 , then the optimal solution is b = 0, so we can assume that λ∗ < λ0 .


When λ is slightly smaller than λ0 , one needs to introduce some non-zero values in
ζ. Those values are at the indexes i such that
N
X (i)
λ0 = 2 (yk − y)(xk − x(i) )
k=1

(i)
The sign of ζ (i) is also determined since sign(b(i) ) = sign(rb ) when b(i) , 0.

The algorithm will then continue by progressively adding non-zero entries to ζ


when the covariance between some unused variables and the residual becomes too
large, or by removing non-zero values when the optimal b crosses a zero. We now
describe it in detail.

Algorithm 7.2 (Exact minimization for lasso)


1. Initialization: let λ(0) = 1 + λ0 , σ (0) = 0 and the corresponding values a0 (0) = y
and b(0) = 0.
2. Assume that the algorithm has reached step n with current variables λ(n), σ (n),
a0 (n) and b(n).
3. Determine the first λ0 < λ(n) for which either
(i) (i)
(i) For some i, ζ (i) (n) , 0 and bζ(n) − λ0 uζ(n) = 0.
(i)
(ii) For some i, ζ (i) (n) = 0 and (1 − 2N dζ(n) )λ0 − 2N ρζ(n) = 0.
(i)
(iii) For some i, ζ (i) (n) = 0 and (1 + 2N dζ(n) )λ0 + 2N ρζ(n) = 0.
4. Then, there are two cases:
(a) If λ0 ≥ λ∗ , set λ(n + 1) = λ0 . Let ζ (i) (n + 1) = ζ (i) (n) if i does not satisfy (i), (ii) or
(iii). If i is in case (i), set ζ (i) (n+1) = 0. For i in case (ii) (resp. (iii)), set ζ (i) (n+1) = 1
(resp. −1).
7.3. OTHER SPARSITY ESTIMATORS 161

(b) If λ0 < λ∗ , terminate the algorithm without updating ζ and set


(i) (i)
b(i) = bζ(n) − λ∗ uζ(n) , ζ (i) (n) , 0

and a0 = ȳ − bT x̄ to obtain the final solution.

7.3 Other Sparsity Estimators

7.3.1 LARS estimator

Algorithm. The LARS algorithm can be seen as a simplification of the previous


lasso algorithm in which one always adds active variables at each step. We assume
as above that input variables are normalized such that σ̂ (i) = 1.

Given a current set J of selected variables, the algorithm will decide either to stop
or to add a new variable to J according to a criterion that depends on a parameter
λ > 0. Let b(J) ∈ R|J| be the least-square estimator based on variables in J

b(J) = (Xc (J)T Xc (J))−1 Xc (J)T Yc .


(i) (i)
Let bJ ∈ Rd such that bJ = b(J) for i ∈ J and 0 otherwise. The covariances between
the remaining variables and the residuals are given by
N
(i) 1X (i)
rJ = (yk − y − (xk − x)T bJ )(xk − x(i) ), i < J.
N
k=1
(i) √
If, for all i ∈ J, |rJ | ≤ λ/N , the procedure is stopped. Otherwise, one adds to J the
(i)
variable i such that |rJ | is largest and continues.

Justification. Recall the notation |b|0 for the number of non-zero entries of b. Con-
sider the objective function
L(b) = |Yc − Xc b|2 + λ|b|0 .
Let J be the set currently selected by the algorithm, and bJ defined as above. We
consider the problem of adding one non-zero entry to b. Fix i < J, and let b̃ ∈ Rd
have all coordinates equalt to those of bJ for all except the ith one, which is therefore
allowed to be non-zero. Then
N 
X X (j) 2
(i)
L(b̃) = yk − y − (xk − x)b(j) − (xk − x)b̃(i) + λ|J| + λ,
k=1 j∈J
162 CHAPTER 7. LINEAR REGRESSION

so that (using σ̂ (i) = 1)


(i)
L(b̃) = L(bJ ) − 2N rJ b̃(i) + N (b̃(i) )2 + λ
Now, L(b̃) is an upper-bound for L(bJ∪{i} ), and so is its minimum with respect to b̃(i) .
This yields:
(i)
L(bJ∪{i} ) ≤ L(bJ ) − N (rJ )2 + λ
The LARS algorithm therefore finds the value of i that minimizes this upper-bound,
provided that the resulting minimum is less that L(bJ ).

Variant. The same argument can be made with |b|0 replaced by |b|1 and one gets
(i)
L(b̃) = L(bJ ) − 2N rJ b̃(i) + N (b̃(i) )2 + λ|b̃(i) |
Minimizing this expression with respect to b̃(i) yields the upper bound:
λ 2 λ
  (i) (i)
L(b ) − N |r | − if |rJ | ≥

J


 J 2N 2N
L(bJ∪{i} ) ≤ 

 (i) λ
L(bJ ) if |rJ | ≤


2N

This leads to the following alternate form of LARS. Given a current set J of se-
lected variables, compute
N
(i) 1X (i)
rJ = (yk − y − (xk − x)T bJ )(xk − x(i) ), i <J.
N
k=1
(i)
If, for all i < J, |rJ | ≤ λ/2N , stop the procedure. Otherwise, add to J the variable i
(i)
such that is largest and continue. This form tends to add more variables since
|rJ |

the stopping criterion decreases in 1/N instead of 1/ N .

Why “least angle”? Let µJ,k = yk −y −(xk −x)T bJ denote the residual after regression.
The empirical correlation between µ and x(i) is equal to the cosine of the angle, say
(i)
θJ between µJ ∈ RN and x(i) −x both considered as vectors in RN . This cosine is also
equal to
(i)
(i) µTJ (x(i) − x(i) ) √ rJ
cos θJ = = N
|x(i) − x(i) | |µJ | |µJ |

where we have used the fact that |x(i) − x(i) |/ N = σ̂ (i) = 1. Since |µJ | does not depend
(i)
on i, looking for the largest value of |rJ | is equivalent to looking for the smallest
(i)
value of |θJ |, so that we are looking for the unselected variable for which the angle
with the current residual is minimal.
7.3. OTHER SPARSITY ESTIMATORS 163

7.3.2 The Dantzig selector

Noise-free case. Assume that one wants to solve the equation X β = Y when the
dimension, N , of Y is small compared to number of columns, d, in X . Since the
system is under-determined, one needs additional constraints on β and a natural one
is to look for sparse solutions, i.e., find solutions with a maximum number of zero
coefficients. However, this is numerically challenging, and it is easier to minimize
the ` 1 norm of β instead (as seen when discussing the lasso, using this norm often
provides sparse solutions). In the following, we assume that the empirical variance
of each variable is normalized, so that, denoting X (i) the ith column of X , we have
|X (i)| = 1.

The Dantzig selector [46] minimizes


d
X
|β (i) |
i=1

subject to the constraint X β = Y . This results in a linear program (therefore easy to


implement). More precisely, introducing slack variables, it is indeed equivalent to
minimize
Xd Xd
ξ(i) + ξ ∗ (i)
i=1 i=1

subject to constraints ξ(i) ≥ β (i) , ξ(i)∗ ≥ −β (i) , ξ(i) ≥ 0, ξ ∗ (i) ≥ 0 and X β = Y .

Sparsity recovery Under some assumptions, this method does recover sparse solu-
tions when they exist. More precisely, let β̂ be the solution of the linear programming
problem above. Assume that there is a set J ∗ ⊂ {1, . . . , d} such that X β = Y for some
β ∈ Rd with β (i) = 0 if i < J ∗ . Conditions under which β̂ is equal to β are provided
in Candes and Tao [46] and involve the correlations between pairs of columns of X ,
and the size of J.

That the size of J ∗ must be a factor is clear, since, for the statement to make sense,
there cannot exist two β’s satisfying X β = Y and β (i) = 0 for i < J ∗ . Uniqueness is
obviously not true if |J| > N , because, even if one knew J, the condition would be
under-constrained for β. Since the set J ∗ is not known, and we also want to avoid
any other solution associated to a set of same size. So, there cannot exist β and β̃
respectively vanishing outside of J ∗ and J˜∗ , where J ∗ and J˜∗ have same cardinality,
such that X β = Y = X β̃. The equation X (β − β̃) = 0 would be under-constrained as
soon as the number of non-zero coefficients of β − β̃ is larger than N , and since this
number can be as large as |J ∗ | + |J˜∗ | = 2|J ∗ |, we see that one should impose at least
|J ∗ | ≤ N /2.
164 CHAPTER 7. LINEAR REGRESSION

Given this restriction, another obvious remark is that, if the set J on which β does
not vanish is known, with |J| small enough, then X β = Y is over-constrained and any
solution is (typically) unique. So the issue really is whether the set Jβ listing the
non-zero indexes of a solution β is equal to y J ∗ .

As often, precious insight on the solution of this minimization problem is ob-


tained by considering the dual problem. Introducing Lagrange multipliers λ(i) ≥
0, i = 1, . . . , d for the constraints ξ(i) − β (i) ≥ 0, λ∗ (i) ≥ 0, i = 1, . . . , d for ξ ∗ (i) + β (i) ≥ 0,
γ(i), γ ∗ (i) ≥ 0 for ξ(i) ≥ 0 and ξ ∗ (i) ≥ 0, and α ∈ RN for X β = Y , the Lagrangian is
L(β, ξ, λ, λ∗ , α) = (1d − λ − γ)T ξ + (1d − λ∗ − γ ∗ )T ξ ∗ + (λ − λ∗ + X T α)T β − α T Y .
The KKT conditions require γ = 1d − λ, γ ∗ = 1d − λ∗ , X α = λ∗ − λ and the comple-
mentary slackness conditions give (1 − λ(i))ξ(i) = (1 − λ∗i )ξ ∗ (i) = 0, λ(i)(β (i) − ξ(i)) =
λ∗ (i)(β (i) + ξ ∗ (i)) = 0.

The dual problem requires to minimize α T Y subject to the constraints X T α =


λ∗ − λ and 0 ≤ λ(i), λ∗ (i) ≤ 1. Assume that (α, λ, λ∗ ) is a solution of this dual problem.
One has the following cases.

(1) If λ(i) ∈ (0, 1), then ξ(i) = β (i) − ξ(i) = 0, which implies ξ(i) = β (i) = 0, and, as a
consequence (1 − λ∗ (i))ξ ∗ (i) = λ∗ (i)ξ ∗ (i) = 0, so that also ξ ∗ (i) = 0 .
(2) Similarly, λ∗ (i) ∈ (0, 1) implies ξ(i) = ξ ∗ (i) = β (i) = 0.
(3) If λ(i) = λ∗ (i) = 1, then β (i) − ξ(i) = β (i) + ξ(i) = 0 with ξ(i), ξ ∗ (i) ≥ 0, so that also
ξ(i) = ξ ∗ (i) = β (i) = 0.
(4) If λ(i) = λ∗ (i) = 0, then ξ(i) = ξ ∗ (i) = 0 and since β (i) ≤ ξ(i) and β (i) ≤ −ξ ∗ (i), we
get β (i) = 0.
(5) The only remaining situation, in which β (i) can be non-zero, is when λ(i) = 1 −
λ∗ (i) ∈ {0, 1}, or, equivalently, when |λ(i) − λ∗ (i)| = 1.

This discussion allows one to reconstruct the set Jβ associated with the primal prob-
lem given the solution of the dual problem. Note that |λ(i) − λ∗ (i)| = |α T X (i)|, so that
the set of indexes with |λ(i) − λ∗ (i)| = 1 is also

n o
Iα = i : |α T X (i)| = 1 .

One has
d
X X X
T
α Y = α Xβ =T
β (i) α T X (i) ≤ |β (i) | |α T X (i)| ≤ |β (i) |.
i=1 i∈Jβ i∈Jβ

The upper-bound is achieved when α T X (i) = sign(β (i) ) for i ∈ Jβ . So, if a vector α can
be found such that
7.3. OTHER SPARSITY ESTIMATORS 165

(i) α T X (i) = sign(β (i) ) for i ∈ J ∗ ,


(ii) |α T X (j)| < 1 for j < J ∗ ,

then it is a solution of the dual problem with Jα = J ∗ .

Let sJ = (s(j) , j ∈ J) be defined by s(j) = sign(β (j) ). One can always decompose
α ∈ RN in the form
α = XJ ∗ ρ + w

where ρ ∈ R|J | and w ∈ RN is perpendicular to the columns of XJ ∗ . From XJT∗ α = sJ ,
we get
ρ = (XJT∗ XJ ∗ )−1 sJ ∗ .
Letting αJ ∗ be the solution with w = 0, the question is therefore whether one can find
w such that ( T
w X (j) = 0, j ∈ J∗
|αJT X (k) + wT X (k)| < 1, k < J ∗

Denote for short ΣJJ 0 = XJT XJ 0 . One can show that such a solution exists when
the matrices ΣJJ are close to the identity as soon as |J| is small enough [46]. More
precisely, denote, for q ≤ d

δ(q) = max max(kΣJJ k, kΣ−1 −1


JJ k ) − 1,
|J|≤q

in which one uses the operator norm on matrices, and


n o
θ(q, q0 ) = max zT ΣT T 0 z0 : |J|, |J 0 | ≤ q.J ∩ J 0 = ∅, |z| = |z0 | = 1 .

Then, the following proposition is true.

Proposition 7.7 (Candes-Tao) Let q = |J ∗ | and s = (s(j) , j ∈ J ∗ ) ∈ Rq . Assume that


δ(2q) + θ(q, 2q) < 1. Then there exists α ∈ RN such that α T X (j) = s(j) for j ∈ J ∗ and

θ(q, q)
|α T X (j)| ≤ if j < J ∗ .
1 − δ(2q) − θ(q, 2q)

So α has the desired property as soon as δ(2q) + θ(q, q) + θ(q, 2q) ≤ 1. to control
subsets of variables of size less than 3q to obtain the conclusion, which is important,
of course, when q is small compared to d.
166 CHAPTER 7. LINEAR REGRESSION

Noisy case Consider now the noisy case. We here again introduce quantities that
were pivotal for the lasso and LARS estimators, namely, the covariances between the
variables and the residual error. So, we define, for a given β
(i)
rβ = X (i)T (Y − X β)

which depends linearly on β. Then, the Dantzig selector is defined by the linear
program: Minimize:
Xd
|β (j) |
j=1
subject to the constraint:
(j)
max |rβ | ≤ C.
j=1,...,d
The explicit expression of this problem as a linear program is obtained as before by
introducing slack variables ξ(j), ξ ∗ (j), j = 1, . . . , d and minimizing
d
X d
X
ξ(j) + ξ ∗ (j)
j=1 j=1

(j)
with constraints ξ(j), ξ ∗ (j) ≥ 0, ξ ≥ β, ξ ∗ ≥ −β, max |rβ | ≤ C.
j=1,...,d

Similar to the noise-free case, the Dantzig selector can identify sparse solutions
(up to a small error) if the columns of X are nearly orthogonal, with the same type of
conditions [47]. Interestingly enough, the accuracy of this algorithm can be proved
to be comparable to that of the lasso in the presence of a sparse solution [30].

7.4 Support Vector Machines for regression

7.4.1 Linear SVM

Problem formulation We start by discussing support vector machines (SVM) [198,


199] with RX = Rd equipped with the standard inner product (generally referred to
as linear SVM) and will extend the theory to kernel methods in the next section.
SVMs solve a linear regression problem, but replace the least-squares loss function
by (y, y 0 ) 7→ V (y − y 0 ) with
(
0 if |t| < 
V (t) = (7.10)
|t| −  if |t| ≥ 

A plot of the function V is provided in fig. 7.1. This function is an example


of what is often called a robust loss function. The quadratic error used in linear
7.4. SUPPORT VECTOR MACHINES FOR REGRESSION 167

Figure 7.1: The function V defining the SVM risk function.

regression had the advantage of providing closed form expressions for the solution,
but is quite sensitive to outliers. For robustness, it is preferable to use loss functions
that, like V , increase at most linearly at infinity. One sometimes choose them as
smooth convex functions, for example V (t) = (1 − cos γt)/(1 − cos γ) for |t| <  and
f (t) = |t| for t ≥ , where γ is chosen so that γ sin γ/(1 − cos γ) = 1. In such a case,
minimizing
N
X
F(β) = V (yk − a0 − xkT b)
k=1
can be done using gradient descent methods. Using V in (7.10) will require a little
more work, as we see now.

The SVM regression problem is generally formulated as the minimization of


N
X
V (yk − a0 − xkT b) + λ|b|2 ,
k=1

and we will study a slightly more general problem, minimizing


N
X
F(a0 , b) = V (yk − a0 − xkT b) + λbT ∆b ,
k=1

where ∆ is a symmetric positive-definite matrix. This objective function exhibits the


following features:

• A penalty on the coefficients of b, similar to ridge regression.


168 CHAPTER 7. LINEAR REGRESSION

• A linear penalty (instead of quadratic) for large errors in the prediction.

• An -tolerance for small errors, often referred to as the margin of the regression
SVM.

We now describe the various steps in the analysis and reduction of the problem.
They will lead to simple minimization algorithms, and possible extensions to non-
linear problems.

Reduction to a quadratic programming problem Introduce slack variables ξk , ξk∗ , k =


1, . . . , N . The original problem is equivalent to the minimization, with respect to
(a0 , b, ξ, ξ ∗ ), of
N
X
(ξk + ξk∗ ) + λbT ∆b
k=1

under the constraints:


ξk , ξk∗ ≥ 0




ξk − yk + a0 + xkT b +  ≥ 0

(7.11)




ξ ∗ + y − a − xT b +  ≥ 0

k k 0 k

The simple proof of this equivalence, which results in a quadratic programming


problem, is left to the reader. As often, one gains additional insight by studying the
dual problem.

Dual problem Introduce 4N non-negative Lagrange multipliers for the 4N con-


straints in the problem, namely, ηk , ηk∗ ≥ 0 for the positivity constraints, and αk , αk∗ ≥
0 for the last two in (7.11). The resulting Lagrangian is

N
X N
X
L(a0 , b, ξ, ξ ∗ , α, α ∗ , η, η ∗ ) = (ξk + ξk∗ ) + λbT ∆b − (ηk ξk + ηk∗ ξk∗ )
k=1 k=1
N
X N
X
− αk (ξk − yk + a0 + xkT b + ) − αk∗ (ξk∗ + yk − a0 − xkT b + ).
k=1 k=1

In this formulation, (a0 , b, ξ, ξ ∗ ) are the primal variables, and α, α ∗ , η, η ∗ the dual vari-
ables.
7.4. SUPPORT VECTOR MACHINES FOR REGRESSION 169

The KKT conditions are provided by the system:


 N
 X
(αk − αk∗ ) = 0








 k=1



 X N
(αk − αk∗ )xk = 0




2λ∆b −


 k=1
(7.12)




1 − ηk − αk = 0
1 − ηk∗ − αk∗ = 0






αk ( + ξk − yk + a0 + xkT b) = 0






αk∗ ( + ξk∗ + yk − a0 − xkT b) = 0




ηk ξk = ηk∗ ξk∗ = 0

The first four equations are the derivatives of the Lagrangian with respect to a0 , b, ξk , ξk∗
in this order and the last three are the complementary slackness conditions.

The dual problem maximizes the function

L∗ (α, α ∗ , η, η ∗ ) = inf ∗ L.
β,ξ,ξ

under the previous positivity constraints. Since the Lagrangian is linear in a0 , ξk


and ξk∗ , its minimum is −∞ unless the coefficients vanish. The linear terms must
therefore vanish for L∗ to be finite. With these conditions plus the fact that ∂b L = 0,
we retrieve the first four equations of system (7.12). Using ηk = 1 − αk , ηk∗ = 1 − αk∗
and
N
1 X
b= (αk − αk∗ )∆−1 xk (7.13)

k=1

one can express L∗ uniquely as a function of α, α ∗ , yielding

N N N
∗ ∗ 1 X ∗ ∗ T −1
X

X
L (α, α ) = − (αk − αk )(αl − αl )xk ∆ xl −  (αk + αk ) + (αk − αk∗ )yk .

k,l=1 k=1 k=1

This quantity must be maximized subject to the constraints 0 ≤ αk , αk∗ ≤ 1 and


PN ∗
k=1 (αk − αk ) = 0. This still is a quadratic programming problem, but it now has
nice additional features and interpretations.

Step 3: Analysis of the dual problem The dual problem only depends on the xk ’s
through the matrix with coefficients xkT ∆−1 xl , which is the Gram matrix of x1 , . . . , xN
for the inner product associated with ∆−1 . This property will lead to the the kernel
170 CHAPTER 7. LINEAR REGRESSION

version of SVMs discussed in the next section. The obtained predictor can also be
expressed as a function of these products, since
N
T 1 X
y = a0 + x b = a0 + (αk − αk∗ )(xkT ∆−1 x) .

k=1

Moreover, the dimension of the dual problem is 2N , which allows the method to be
used in large (possibly infinite) dimensions with a bounded cost.

We now analyze the solutions α, α ∗ of the dual problem. The complementary


slackness conditions reduce to:
αk ( + ξk − yk + a0 + xkT b) = 0





α ( + ξk∗ + yk − a0 − xkT b) = 0
 ∗
(7.14)
 k


(1 − α )ξ = (1 − α ∗ )ξ ∗ = 0

k k k k

These conditions have the following consequences, based on the prediction error
made for each training sample.

(i) First consider indexes k such that the error is strictly within the tolerance mar-
gin : |yk − a0 − xkT b| < . Then the terms between parentheses in first two equations
of (7.14) are strictly positive, which implies that αk = αk∗ = 0. The last two equations
in (7.14) then imply ξk = ξk∗ = 0.
(ii) Consider now the case when the prediction is strictly less accurate than the
tolerance margin. Assume that yk − a0 − xkT b > . The second and third equations in
(7.14) imply that αk∗ = ξk∗ = 0. The assumption also implies that

ξk = yk − a0 − xkT b −  > 0

and αk = 1. The case yk − a0 − xkT b < − is symmetric and provides αk = ξk = 0, ξk∗ > 0
and αk∗ = 1.
(iii) Finally, consider samples for which the prediction error is exactly at the toler-
ance margin. If yk − a0 − xkT b = , we have αk∗ = ξk = ξk∗ = 0. The fact that αk∗ = ξk∗ = 0 is
clear. To prove that ξk = 0, we note that would have otherwise ξk −yk +a0 +xkT b+ > 0,
which would imply that αk = 0 and we reach a contradiction with (1−αk )ξk = 0. Sim-
ilarly, yk − a0 − xkT b = − implies that αk = ξk = ξk∗ = 0.
The points for which |yk − a0 − xkT b| =  are called support vectors.

One important information deriving from this discussion is that the variables
(αk , αk∗ ) have prescribed values as long as the error yk − a0 − xkT b is not exactly  in
absolute value: (1, 0) if the error is larger than , (0, 0) if it is strictly between − and 
and (0, 1) if it is less than −. Also in all cases, at least one of αk and αk∗ must vanish.
7.4. SUPPORT VECTOR MACHINES FOR REGRESSION 171

Only in the case of support vectors does the previous discussion fail to provide a
value for one of these variables.

Now, we want to reverse the discussion and assume that the dual problem is
solved to see how the variables a0 and b of the primal problem can be retrieved. For
b, this is easy, thanks to (7.13). For a0 a direct computation can be made if a support
vector is identified, either because 0 < αk < 1, which implies that a0 = yk − xkT b − , or
because 0 < αk∗ < 1, which yields a0 = yk − xkT b + .

If no support vector can be identified, a0 is not uniquely determined (note that


the objective function is not strictly convex in a0 ). However, the coefficients αk , αk∗
provide some information on this intercept, in the form of inequalities. More pre-
cisely, let J + = {k : αk = 1}, J − = {k : αk∗ = 1} and J0 = {k : αk = αk∗ = 0}. Then k ∈ J +
implies that yk − a0 − bT xk ≥ , so that a0 ≤ yk − bT xk − . Similarly, k ∈ J − implies that
a0 ≥ yk − bT xk + . Finally, k ∈ J0 implies that a0 ≥ yk − bT xk −  and a0 ≤ yk − bT xk + .
As a consequence, one can take a0 to be any point in the interval [a0 − , a0 + ], where
!
− T T
a0 = max max (yk − xk b + ), max(yk − xk b − )
k∈J − k∈J0
!
+ T T
a0 = min min +
(yk − xk b − ), min(yk − xk b + ) .
k∈J k∈J0

7.4.2 The kernel trick and SVMs

Returning to our feature space notation, let X take values in RX and h : RX → H be


a feature function with values in an inner-product space H with associated kernel K.
SVMs in feature space must minimize, with a0 ∈ R and b ∈ H
N
X  
F(a0 , b) = V yk − a0 − hh(xk ) , biH + λkbk2H .
k=1

Letting as before V = span(h(x1 ), . . . , h(xN )), the same argument as that made for
ridge regression works, namely that the first term in F is unchanged if b is replaced
by πV (b) and the second one is strictly reduced unless b ∈ V , leading to a finite-
dimensional formulation in which
N
X
b= ck h(xk )
k=1

and one minimizes


N
X  N
X  N
X
F(a0 , c) = V yk − a0 − K(xk , xl )cl + λ K(xk , xl )ck cl .
k=1 l=1 k,l=1
172 CHAPTER 7. LINEAR REGRESSION

This function has the same form as the one studied in the linear case with b replaced
by c ∈ RN , xk replaced by the vector with coefficients K(xk , xl ), l = 1, . . . , N , that we
will denote K(k) and ∆ = K = K(x1 , . . . , xN ). Note that K(k) is the kth column of K, so
that  T
K(k) K−1 K(l) = K(xk , xl ).
Using this, we find that the dual problem requires to maximize
N N N
∗ ∗ 1 X ∗ ∗
X

X
L (α, α ) = − (αk − αk )(αl − αl )K(xk , xl ) −  (αk + αk ) + (αk − αk∗ )yk .

k,l=1 k=1 k=1

with
0 ≤ αk ≤ 1



0 ≤ αk∗ ≤ 1







 N
X
(αk − αk∗ ) = 0





k=1
The associated vector c satisfies
N
X
2λc = (αk − αk∗ )K−1 K(k) = α − α ∗ .
k=1

and the regression function is


N
1 X
f (x) = a0 + hb , h(x)iH = a0 + (αk − αk∗ )K(x, xk ) .

k=1

Finally, the discussions on the values of α, α ∗ and on the computation of a0 remain


unchanged.
Chapter 8

Models for linear classification

n o
In this chapter, Y is categorical and takes values in the finite set RY = g1 , . . . , gq .
The goal is to predict this class variable from the input X, taking values in a set RX .
Using the same progression as in the regression case, we will first discuss basic linear
methods, for which RX = Rd before extending them, whenever possible, to kernel
methods, for which RX can be arbitrary as soon as a feature space representation is
available.

Classifiers will be based on a training set T = ((x1 , y1 ), . . . , (xN , yN )) with xk ∈ RX


and yk ∈ RY for k = 1, . . . , N . For g ∈ RY , we will also let Ng denote the number of
samples in the training set such that yk = g, i.e.,

N
X
Ng = {k : yk = g} = 1yk =g .
k=1

8.1 Logistic regression

8.1.1 General Framework

Logistic regression uses the fact that, in order to apply Bayes’s rule, only the condi-
tional distribution of the class variables Y given X is needed, and trains a parametric
model of this distribution. More precisely, if one denotes by p(g|x) the probability
that Y = g conditional to X = x, logistic regression assumes that, for some parameters
(a0 (g), b(g), g ∈ RY ) with a0 (g) ∈ R and b(g) ∈ Rd , one has p = pa0 ,b with

log pa0 ,b (g | x) = a0 (g) + xT b(g) − log(C(a0 , b, x)),

exp(a0 (g) + xT b(g)).


P
where C(a0 , b, x) = g∈RY

173
174 CHAPTER 8. MODELS FOR LINEAR CLASSIFICATION

Introduce the functions, defined over mappings µ : RY → R (which can be iden-


tified with vectors in Rq )
X 0
Fg (µ) = µ(g) − log eµ(g ) . (8.1)
g 0 ∈RY
! !
a0 (g) 1
With this notation, letting β(g) = and x̃ = , one has log pβ (g|x) = Fg (β T x̃),
b(g) x
where β T x̃ is the function (g 0 7→ β(g 0 )T x̃).

For any constant function (g 7→ µ0 ∈ R) one has


X X
µ(g 0 )+µ0 0
Fg (µ + µ0 ) = µ(g) + µ0 − log e = µ(g) + µ0 − µ0 − log eµ(g ) = Fg (µ).
g 0 ∈RY g 0 ∈RY

As a consequence, if one replaces, for all g, β(g) by β̃(g) = β(g) + γ, with γ ∈ Rd+1 ,
then β̃ T x̃ = β T x̃ + γ T x̃ and

log pβ̃ (g | x) = log pβ (g | x).

This shows that the model is over-parametrized. One therefore needs a (d + 1)-
dimensional constraint to ensure uniqueness, and we will enforce a linear constraint
in the form X
ρg β(g) = c
g∈RY
P
with g ρg , 0.

8.1.2 Conditional log-likelihood

The conditional log-likelihood computed from the training set is:


N
X
`(β) = log pβ (yk | xk ) .
k=1

Logistic regression computes a maximizer β̂ of this log-likelihood. The classification


rule given a new input x then chooses the class g for which pβ̂ (g | x) is largest, or,
equivalently, the class g that maximizes x̃T β(g).
P
Proposition 8.1 Let m̃g = k:yg =k xk /Ng . The conditional log-likelihood ` is concave
with first derivative
N
X
T
∂β(g) ` = Ng m̃g − x̃kT pβ (g|xk ) (8.2)
k=1
8.1. LOGISTIC REGRESSION 175

and negative semi-definite second derivative


N
X N
X
∂β(g) ∂β(g 0 ) ` = −1[g=g 0 ] x̃k x̃kT pβ (g|xk ) + x̃k x̃kT pβ (g|xk )pβ (g 0 |xk ) . (8.3)
k=1 k=1

Remark 8.2 In this discussion, we consider ` as a function defined over collections


(β(g), g ∈ RY ), or, if one prefers, on the q(d + 1)-dimensional linear space, F , of
functions β : RY → Rq+1 . With this in mind, the differential d`(β) is a linear form
from F to R, therefore associating to any family u = (u(g), g ∈ RY ), the expression
X
d`(β) u = ∂β(g) ` u(g).
g∈RY

Similarly, the second derivative is the bilinear form


X
d 2 `(β)(u, u 0 ) = u(g)T ∂β(g) ∂β(g 0 ) ` u(g 0 ).
g,g 0 ∈RY

The last statement in the proposition expresses the fact that d 2 `(β)(u, u) ≤ 0 for all
u∈F. 

Proof First consider the function Fg in (8.1), so that

N
X
`(β) = Fyk (β T x̃k ).
k=1

We have, for ζ : RY → R,
0
eµ(g ) ζ(g 0 )
P
g 0 ∈RY
dFg (µ)ζ = ζ(g) − 0
eµ(g )
P
g 0 ∈RY

as can be easily computed by evaluating the derivative of F(µ + u) at  = 0. Intro-


ducing the notation
eµ(g)
qµ (g) = P µ(g)
g∈RY e

and X
hζiµ = ζ(g)qµ (g),
g∈RY

we have dFg (µ)ζ = ζ(g) − hζiµ . Evaluating the derivative of dFg (µ + u 0 )(ζ) at  = 0,
one gets (the computation being left to the reader)

d 2 Fg (µ)(ζ, ζ 0 ) = −hζζ 0 iµ + hζiµ hζ 0 iµ . (8.4)


176 CHAPTER 8. MODELS FOR LINEAR CLASSIFICATION

Note that −d 2 Fg (µ)(ζ, ζ) is the variance of ζ for the probability mass function qµ and
is therefore non-negative (so that Fµ is concave). This immediately shows that ` is
concave as a sum of concave functions.

Using the chain rule, we have, for u : RY → Rq ,


N
X N
X N
X
T
d`(β)u = dFyk (β x̃k )x̃kT u(·) = x̃kT u(yk ) − hx̃kT u(·)iβ T x̃k .
k=1 k=1 k=1

Reordering the first sum in the right-hand side according to the values of yk gives
N
X X
u(yk )T x̃k = Ng u(g)T m̃g .
k=1 g∈RY

Noting that qβ T x̃ = pβ (·|x), we find


X X
d`(β)(u) = Ng m̃Tg u(g) − x̃kT u(g)pβ (g|xk ),
g∈RY g∈RY

yielding (8.2). Applying the chain rule again, we have


N
X
2
d `(β)(u, u ) = 0
d 2 Fyk (β T x̃k )(x̃kT u(·), x̃kT u 0 (·)) (8.5)
k=1

with
d 2 Fyk (β T x̃k )(x̃kT u(·), x̃kT u 0 (·)) = − hu(·)T x̃k x̃kT u 0 (·)iβ T x̃k + hx̃kT u(·)iβ T x̃k hx̃kT u 0 (·)iβ T x̃k
X
=− u(g)T x̃k x̃kT u 0 (g)pβ (g|xk )
g∈RY
X
+ u(g)T x̃k x̃kT u 0 (g 0 )pβ (g|xk )pβ (g 0 |xk )
g,g 0 ∈RY

from which (8.3) follows. 

Remark 8.3 From Fg (µ + µ0 ) = Fg (µ) when µ0 is constant on RY , one deduces (taking


the derivative at µ0 = 0) that dFg (µ)1 = 0 for all µ, where 1 denotes the constant
function equal to 1 on RY . For h ∈ Rd+1 , let ch denote the constant function ch (g) = h,
g ∈ RY . We have
N
X N
X N
X
T
d`(β) ch = dFyk (β x̃k )x̃kT ch = dFyk (β T
x̃k )(x̃kT h1) = (x̃kT h)dFyk (β T x̃k )1 = 0.
k=1 k=1 k=1

Taking one extra derivative we see that


d`(β)(ch , u) = 0
for all functions u : RY → Rq . 
8.1. LOGISTIC REGRESSION 177

We now discuss whether there are other elements in the null space of the second
derivative of `. We will use notation introduced in the proof of proposition 8.1.
From (8.4), we have d 2 Fg (µ)(ζ, ζ) = 0 if and only if the variance of ζ for qµ vanishes,
which, since qµ > 0, is equivalent to ζ being constant. So, the null space of d 2 Fg (µ)
is one-dimensional, and composed of scalar multiples of 1. Using (8.5), we see that
d 2 `(u, u) = 0 if and only if , for all k = 1, . . . , N , (g 7→ x̃kT u(g)) is a constant function.

Assume that this is true. Then, letting ū = 1q


P
g∈RY u(g), one has, for all g ∈ RY
and k = 1, . . . , N ,
x̃kT u(g) = x̃kT ū
so that u(g) − ū is in the null space of the matrix X . This leads to the following
proposition.
Proposition 8.4 Assume that X has rank d + 1. Then the null space of d 2 `(β) is the set
of all vectors u = ch for h ∈ Rd+1 . In particular, for any c ∈ Rd+1 , the function ` restricted
to the space  

 X 

 
M = β : ρ β(g) = c
 
 g 


 g∈RY 

P
is strictly concave as soon as the scalar coefficients (ρg , g ∈ RY ) are such that g∈RY ρg , 0.
Proof From the discussion before the proposition, u ∈ Null(d 2 `) implies that X (u(g)−
ū) = 0 for all g, and since we assume that X has rand d + 1, this requires that u(g) = ū
for all g, i.e., u = cū . This proves the first point.

If one restricts ` to M, then we must restrict d 2 `(β) to those u’s such that g∈RY ρg u(g) =
P

0. But if d 2 `(β)(u, u) = 0 for such an u, then u = cū and


X X 
ρg u(g) = ρg ū.
g∈RY g∈RY
P
Since we assume that g∈RY ρg , 0, this requires ū = 0, and therefore u = 0.

This shows that the second derivative of the restriction of ` to M is negative


definite, so this restriction is strictly concave. 
8.1.3 Training algorithm

Given that we have expressed the first and second derivatives of ` in closed form1 ,
we can use Newton-Raphson gradient ascent to maximize ` over the affine space:
 

 X 

 
M = β : ρ β(g) = c
 
 g 


 g∈RY 

1 Their computation is feasible unless N is very large, and the matrix inversion in Newton’s itera-
tion also requires d to be not too large.
178 CHAPTER 8. MODELS FOR LINEAR CLASSIFICATION
P
with g∈RY ρg , 0. We assume in the following that the matrix X has rank d + 1 so
that proposition 8.4 applies. Since the constraint is affine, it is easy to express one of
the parameters β(g) as a function of the others and solve the strictly concave problem
as a function of the remaining variables. It is not much harder, and arguably more
elegant to solve the problem without breaking its symmetry with respect to the class
indexes, as described below.

Let  

 X 

 
M0 = β : ρ β(g) = 0 .
 
 g 


 g∈RY 

We still have the second order expansion

1
`(β + u) = `(β) + d`(β)u + d 2 `(β)(u, u) + o(|u|2 )
2
and we consider the maximization of the first three terms, simply restricting to vec-
tors u ∈ M0 . To allow for matrix computation, we use our ordering RY = (g1 , . . . , gq )
and identify a with the column vector
 
u(g1 )
 . 
 ..  ∈ Rq(d+1)
 
u(gq )
 

Similarly, we let
∂β(g1 ) ` 
 
 . 
∇`(β) =  .. 
 
∂β(gq ) `

and let ∇2 (`)(β) be the block matrix with i, j block given by ∂β(gi ) ∂β(gj ) `(β). We let ρ̂
be the (d + 1) × q(d + 1) row block matrix
 
ρ(g1 )IdRd+1 · · · ρ(gq )IdRd+1

so that u ∈ M0 is just ρ̂u = 0 in vector notation. Given this we have

1
`(β + u) = `(β) + ∇`(β)T u + u T ∇2 (`)(β)u + o(|u|2 ).
2

The maximum of `(β) + u T ∇`(β) + 12 u T ∇2 (`)(β)u subject to ρ̂u = 0 is a stationary


point of the Lagrangian

1
L = `(β) + u T ∇`(β) + u T ∇2 (`)(β)u + λT ρ̂u
2
8.1. LOGISTIC REGRESSION 179

for some λ ∈ Rd+1 and is characterized by


( 2
∇ (`)(β)u + ∇`(β) + ρ̂T λ = 0
ρ̂u = 0

This shows that the Newton-Raphson iterations can be implemented as

βn+1 = βn − n+1 un+1 (8.6)

with ! !−1 !
un+1 ∇2 (`)(βn ) ρ̂T ∇`(βn )
= . (8.7)
λ ρ̂ 0 0

We summarize this discussion in the following algorithm.

Algorithm 8.1 (Logistic regression with Newton’s gradient ascent)


(1) Input: (i) training data (x1 , y1 , . . . , xN , yN ) with xi ∈ Rd and yi ∈ RY ; (ii) coefficients
ρg , g ∈ RY with non-zero sum and target value c ∈ R; (iii) algorithm step  small
enough.
P
(2) Initialize the algorithm with β0 such that g ρg β0 (g) = c.
(3) At iteration n, compute ∇`(βn ) and ∇2 (`)(βn ) as provided by proposition 8.1.
(4) Update βn using (8.6) and (8.7), with n+1 = . Alternatively, optimize n+1 using
a line search.
(5) Stop the procedure if the change in the parameter is below a small tolerance
level. Otherwise, return to step 2.

8.1.4 Penalized Logistic Regression

Logistic regression can be combined with a penalty term, e.g., maximizing


d
X
`2 (β) = `(β) − λ |b(i) |2 (8.8)
i=1

or
d
X
`1 (β) = `(β) − λ |b(i) | (8.9)
i=1

where b(i) is the q-dimensional vector formed with the ith coefficients of b(g) for
g ∈ RY . Similarly to penalized regression, one generally normalizes the x variables
to have unit standard deviation before applying the method.
180 CHAPTER 8. MODELS FOR LINEAR CLASSIFICATION

Maximization with the `2 norm The problem in (8.8) relates to ridge regression
and can be solved using a Newton-Raphson method (Algorithm 8.1) with minor
changes. More precisely, letting
!
0 0
∆=
0 IdRd

we have, considering β as a d + 1 by q matrix,

`2 (β) = `(β) − λtrace(β T ∆β)

and
d`2 (β)u = d`(β)u − 2λtrace(β T ∆u),

d 2 `2 (β)(u, u 0 ) = d`(β)(u, u 0 ) − 2λtrace(u T ∆u 0 ).


In addition, when λ > 0, the problem is over-parametrized only up toPthe addition
of a constant to (g 7→ a0 (g)), so that one only needs a single constraint g ρg a0 (g) = c
and the Lagrange coefficient in (8.7) is one dimensional.

Maximization with the `1 norm The maximization in (8.9) can be run using prox-
imal gradient !ascent (section 3.5.5). Let C denote the affine subset of Rd+1 contain-
a
ing all β = 0 , such that g ρg a0 (g) = c and σC the convex indicator function with
P
b
σC (β) = 0 if β ∈ C and +∞ otherwise.

We want to maximize the objective function

`1 (β) = `(β) − λγ(β)

with
d sX
X
γ(a0 , b) = b(i) (g)2 σC (β).
i=1 g∈RY

Here, ` is concave and γ is convex and the proximal gradient iterations are

βn+1 = proxλγ (βn + ∇`(βn )). (8.10)

We now compute the proximal operator of γ, and, since γ is the sum of functions
depending on a0 , b(1) , . . . , b(d) , it suffices to compute separately the proximal operator
of each these functions.

Let u (0) , . . . , u (d) be functions from RY to R. Starting with a0 , we know that h(0) =
proxλσC (v (0) ) is the projection of u (0) on C, therefore characterized by h(0) ∈ C and
8.1. LOGISTIC REGRESSION 181

(u (0) − h(0) ) ⊥ C, the latter implying that h(0) = u (0) + tρ for some t ∈ R and the former
allowing one to identify t as t = (c − (u (0) )T ρ)/|ρ|2 ) so that

c − (v (0) )T ρ
proxλσC (u (0) ) = u (0) + ρ.
|ρ|2

Now consider i ∈ {1, . . . , d} and compute


sX
1 X (i)
argmin h(i) (g)2 + (u (g) − h(i) (g))2 .
2λ
g∈RY g∈RY

The function t 7→ t being differentiable everywhere except at 0, we first search for
a solution for which h(i) , 0 does not vanish. If such a solution exists, it must satisfy,
for all g ∈ RY
h(i) (g) 1 (i)
qP + (h (g) − u (i) (g)) = 0
(i) 0 2 λ
g 0 ∈RY h (g )
qP
(i)
Letting |h | = (i) 2
g∈RY h (g) we get

h(i) (·)(|h(i) (·)| + λ) = u (i) (·)|h(i) (·)|

Taking the norm on both sides and dividing by |h(i) (·)| (which is assumed not to
vanish) yields
|h(i) (·)| + λ = |u (i) (·)|,
which has a positive solution only if |u (i) (·)| > λ, and gives in that case

|u (i) (·)| − λ (i)


h(i) (·) = u (·)
|u (i) (·)|

If |u (i) (·)| ≤ λ, then we must take h(i) (·) = 0. We have therefore obtained:

proxλg (u) = h

with
c − (v (0) )T ρ
h(0) (·) = u (0) + ρ (8.11a)
|ρ|2
and
|u (i) (·)| − λ
 
(i)
h (·) = max , 0 u (i) (·) (8.11b)
|u (i) (·)|
for i ≥ 1. We summarize this discussion in the next algorithm, which should be run
with  > 0 small enough.
182 CHAPTER 8. MODELS FOR LINEAR CLASSIFICATION

Algorithm 8.2 (Logistic lasso)


(1) Input: (i) training data (x1 , y1 , . . . , xN , yN ) with xk ∈ Rd and yk ∈ RY ; (ii) coeffi-
cients ρg , g ∈ RY with non-zero sum and target value c ∈ R; (iii) algorithm step ; (iv)
penalty coefficient λ.
(2) Initialize the algorithm with β0 = (a0,0 , b0 ).
(3) At iteration n, compute u = βn + ∇`(βn ), with βn = (a0,n , bn ).
(i)
(4) Let an+1,0 (·) = h(0) (·) and for i ≥ 1, bn+1 (·) = h(i) (·) where h(0) , . . . , h(d) are given by
(8.11a) and (8.11b).
(5) Stop the procedure if the change in the parameter is below a small tolerance
level. Otherwise, return to step 2.

8.1.5 Kernel logistic regression

Let h : RX → H be a feature function with values in a Hilbert space H with K(x, x0 ) =


hh(x) , h(x0 )iH . The kernel version of logistic regression uses the model:
X
log pa0 ,b (g | x) = a0 (g) + hh(x) , b(g)iH − log exp(a0 (g̃) + hh(x) , b(g̃)iH )
g̃∈RY

with b(g) ∈ H for g ∈ RY .

Using the usual kernel argument, one sees that, when maximizing the log-likelihood,
there is no loss of generality is assuming that each b(g) belongs to V = span(h(x1 ), . . . , h(xN )).
Taking
N
X
b(g) = αk (g)h(xk ),
k=1
we have
 
N
X  X N
X 
log pα (g | x) = a0 (g) + αk (g)K(x, xk ) − log  exp(a0 (g̃) + αk (g̃)K(x, xk )) .
 
 
k=1 g̃∈RY k=1

To avoid overfitting, one must include a penalty term in the likelihood, and
P (in order
to take advantage of the kernel), one can take this term proportional to g kb(g)k2H .
The complete learning procedure then requires to maximize the concave penalized
likelihood
N
X N
X X
`(α) = log pα (yk | xk ) − λ αk (g)αl (g)K(xk , xl ).
k=1 g∈RY k,l=1

The computation of the first and second derivatives of this function is similar to that
for the original version, and we skip the details.
8.2. LINEAR DISCRIMINANT ANALYSIS 183

8.2 Linear Discriminant analysis

8.2.1 Generative model in classification and LDA

Generative model In classification, the class variable Y generally has a causal role
upon which the variable X is produced. Prediction can therefore be seen as an in-
verse problem where the cause is deduced from the result. In terms of generative
modeling, one should therefore model the distribution of Y , followed by the the
conditional distribution of X given Y .

Taking RX = Rd , denote by fg the conditional p.d.f. of X given Y = g and let πg =


P (Y = g). The Bayes estimator for the 0–1 loss maximizes the posterior probability
πg fg (x)
P(Y = g | X = x) = P .
g 0 ∈RY πg 0 fg 0 (x)

Since the denominator does not depend on g the Bayes estimator equivalently max-
imizes (taking logarithms)
log fg (x) + log πg .

One generally speaks of a linear classification method when the prediction is


based on the maximization in g of a function U (g, x) where U is affine in x. In this
sense, logistic regression is linear, and kernel logistic regression is linear in feature
space. For the generative approach, this occurs when one uses the following model,
which provides the generative form of linear discriminant analysis (LDA). Assume
that the distributions fg are all Gaussian with mean mg and common variance S, so
that
1 1 T −1
fg (x) = p e− 2 (x−mg ) S (x−mg ) . (8.12)
(2π)d det Σ
In this case, the optimal predictor must maximize (in g)
1
− (x − mg )T S −1 (x − mg ) + log πg .
2
P
Introduce m = E(X) = g∈RY πg mg . Then the optimal classifier must maximize

1 1
− (x − m)T S −1 (x − m) + (x − m)T S −1 (mg − m) − (mg − m)T S −1 (mg − m) + log πg .
2 2
Since the first term does not depend on g, it is equivalent to maximize
1
(x − m)T S −1 (mg − m) − (mg − m)T S −1 (mg − m) + log πg (8.13)
2
with respect to the class g, which provides an affine function of x.
184 CHAPTER 8. MODELS FOR LINEAR CLASSIFICATION

Training Training for LDA simply consists in estimating the class means and com-
mon variance in (8.12) from data. We introduce some notation for this purpose (this
notation will be reused through the rest of this chapter).

Recall that Ng , g ∈ RY denotes the number of samples with class g in the train-
ing set T = (x1 , y1 , . . . , xN , yN ). We let cg = Ng /N and C be the diagonal matrix with
diagonal coefficients cg1 , . . . , cgq . We also let ζ ∈ Rq denote the vector with the same
coordinates. For g ∈ RY , µg denotes the class average

n
1 X
µg = xk 1yk =g
Ng
k=1

and µ the global average


N
1X X
µ= xk = cg µg .
N
k=1 g∈RY

Let Σg denote the sample covariance matrix in class g, defined by

N
1 X
Σg = (xk − µg )(xk − µg )T 1yk =g ,
Ng
k=1

and Σw the pooled class covariance (also called within-class covariance) defined by

N
1X X
Σw = (xk − µyk )(xk − µyk )T = cg Σg .
N
k=1 g∈RY

Let, in addition, Σb denotes the “between-class” covariance matrix, given by


X
Σb = cg (µg − µ)(µg − µ)T
g∈RY

The global covariance matrix, given by,

N
1X
ΣXX = (xk − µ)(xk − µ)T
N
k=1

satisfies ΣXX = Σw + Σb . This identity is proved by noting that, for any g ∈ RY ,

N
1 X
(xk − µ)(xk − µ)T 1yk =g = Σg + (µg − µ)(µg − µ)T .
Ng
k=1
8.2. LINEAR DISCRIMINANT ANALYSIS 185

We will finally denote by M the matrix

(µg1 − µ)T 
 
 .. 
M =  .
 .

 T
(µgq − µ) 

Note that Σb = M T CM.

Given this notation, one can in particular take m̂g = µg , m = µ and Ŝ = Σw in


(8.13). The class probabilities, πg , can be deduced from the normalized frequencies
of y1 , . . . , yN . However, in many applications, one prefers to simply fix πg = 1/q, in
order to balance the importance of each class.
Remark 8.5 If one relaxes the assumption of common class variances, one needs to
use Σg in place of Σw for class g. The decision boundaries are not linear in this
case, but provided by quadratic equations (and the resulting method if often called
quadratic discriminant analysis, or QDA). QDA requires the estimation of qd(d +
3)/2 coefficients, which may be overly ambitious when the sample size is not large
compared to the dimension, in which case QDA is prone to overfitting. (Even LDA,
which involves qd +d(d +1)/2 parameters, may be unrealistic in some cases.) We also
note a variant of QDA that uses class covariance matrices given by

Σ̃g = αΣw + (1 − α)Σg .

8.2.2 Dimension reduction

One of the interests of LDA is that it can be combined with a rank reduction proce-
dure. LDA with q classes can always be seen as a (q − 1)-dimensional problem after
suitable projection on a data-dependent affine space. Recall that the classification
rule after training requires to maximize w.r.t. g ∈ RY the function
1
(x − µ)T Σ−1 T −1
w (µg − µ) − (µg − µ) Σw (µg − µ) + log πg .
2
Define the “spherized” data 2 by x̃k = Σ−1/2 1/2
w (xk − µ), where Σw is the positive sym-
metric square root of Σw . Also let µ̃g = Σ−1/2
w (µg − µ).

With this notation, the predictor chooses the class g that maximizes
1
x̃T µ̃g − |µ̃g |2 + log πg
2
with x̃ = Σ−1/2
w (x − µ̄).
2 In this section only, the notation x̃ does not refer to (1, xT )T .
186 CHAPTER 8. MODELS FOR LINEAR CLASSIFICATION
P
Now, let V = span{µ̃g , g ∈ RY }. Since g cg µ̃g = 0, this space is at most (q − 1)-
dimensional. Let PV denote the orthogonal projection on V . We have x̃T z = (PV x̃)T z
for any z ∈ V and x̃ ∈ Rd .

The classification rule can then be replaced by maximizing


1
(PV x̃)T µ̃g − |µ̃g |2 + log πg
2
with x̃ = Σ−1/2
w (x − µ̄).
 T
(µg1 − µ)T 
 
µ̃g1 
 ..   . 
 . . The dimension, denoted r, of V is
Recall that M =   and let M =

.  e  . 
(µgq − µ)T 
   T 
µ̃gq
equal to the rank of M. e Let (ẽ1 , . . . , ẽr ) be an orthonormal basis of V . One has
r
X
PV x̃ = (x̃T ẽj ) ẽj .
j=1

Given an input x, one must therefore compute the “scores” γj (x) = x̃T ẽj and maxi-
mize
r r
X 1X
γj (x)γj (µg ) − γj (µg )2 + log πg .
2
j=1 j=1

The following proposition is key to the practical implementation of LDA with


dimension reduction.
Proposition 8.6 An orthonormal basis of V = span(µ̃g , g ∈ CG) is provided by the the
e T CM
first r eigenvectors of M e associated with eigenvalues λ1 ≥ · · · ≥ λr > 0 (all other
eigenvalues being zero).
Proof Indeed, if x̃ is perpendicular to V , we have
X
M̃ T C M̃ x̃ = cg (µ̃Tg x̃)µ̃g = 0
g∈RY

so that V ⊥ ⊂ Null(M̃ T C M̃), and both spaces coincide because they have the same
dimension (d − r). This shows that V = Null(M e T C M)
e ⊥ = Range(M̃ T C M).
e Since
M T
e CM e is symmetric, Null(M T ⊥
e is generated by eigenvectors with non-zero
e C M)
eigenvalues. 

e = MΣ−1/2
Returning to the original variables, we have M w and M T CM = Σb , the
between class covariance matrix. This implies that M T
e CMe = Σ−1/2 −1/2 and each
w Σb Σw
8.2. LINEAR DISCRIMINANT ANALYSIS 187

Figure 8.1: Left: Original (training) data with three classes. Right: LDA scores, where the x
axis provides γ1 and the y axis γ2 .

eigenvector ẽj therefore satisfies


1/2
Σb Σ−1/2 −1/2
w ẽj = λj Σw ẽj = λj Σw (Σw ẽj ).

Therefore, letting ej = Σ−1/2


w ẽj , (e1 , . . . , er ) are the solutions of the generalized eigen-
value problem Σb e = λΣw e that are associated with non-zero eigenvalues (they are
however normalized so that ejT Σw ej = 1). Moreover, the scores are given by

γj (x) = x̃T ẽj = (x − µ)T Σ−1/2 T


w ẽj = (x − µ) ej
and can therefore be computed directly from the original data and the vectors e1 , . . . , er .
An example of training data and its representation in the LDA space (associated with
the scores) in provided in fig. 8.1.

We can now describe the LDA learning algorithm with dimension reduction.

Algorithm 8.3 (LDA with dimension reduction)


1. Compute µg , g ∈ RY , Σw and Σb from training data.
2. Estimate (if needed) πg , g ∈ RY
3. Solve the generalized eigenvalue problem Σb e = λΣw e. Let e1 , . . . , er be the
eigenvectors associated with non-zero eigenvalues, normalized so that ejT Σw ej = 1.
4. Choose a reduced dimension r0 ≤ r.
5. Precompute mean scores γj (µg ) = (µg − µ)T ej , g ∈ RY , j = 1, . . . , r0 .
6. To classify a new example x, compute γj (x) = (x − µ)T ej and choose the class
that maximizes
r0 r0
X 1X
γj (x)γj (µg ) − γj (µg )2 + log πg .
2
j=1 j=1
188 CHAPTER 8. MODELS FOR LINEAR CLASSIFICATION

8.2.3 Fisher’s LDA

This characterization leads to the discriminative interpretation of LDA, also called


Fisher’s LDA. Indeed, the generalized eigenvalue problem Σb e = λΣw e is directly
related to the maximization of the ratio eT Σb e subject to eT Σw e = 1, which provides
directions that have a large between-class variance for within class variance equal to
1. More precisely, e1 is the direction that achieves the maximum; e2 is the second best
direction, constrained to being perpendicular to e1 , and so on until er which is the
optimal constrained to be perpendicular to (e1 , . . . , er−1 ). We are therefore looking
for directions that have the largest ratio of between-class variance to within-class
variance.

8.2.4 Kernel LDA

Mean and covariance in feature space We assume the usual construction where h :
RX → H is a feature function, H a Hilbert space with kernel K(x, x0 ) = hh(x) , h(x0 )iH .
(The assumption that H is a complete space is here required for the expectations
below to be meaningful.)

We now discuss the kernel version of LDA by plugging the feature space repre-
sentation directly in the classification rule. So, consider h : R → H. Let X : Ω → R be
a random variable such that E(kh(X)k2H ) < ∞. Then, its mean feature m = E(h(X)) is
well defined as an element of H , and so are the class averages, mg = E(h(X) | Y = g).

In this possibly infinite-dimensional setting, the covariance “matrix” is defined


as a linear operator S : H → H such that, for all ξ, η ∈ H:
 
hξ , SηiH = E hh(X) − m , ξiH hh(X) − m , ηiH , (8.14)

which is equivalent to defining


Sη = E(hh(X) − m , ηiH (h(X) − m))

for η ∈ H. This definition generalizes the identity for a random variable U : Ω → Rd :


SU w = E((U − E(U ))(U − E(U ))T )w = E(((U − E(U ))T w) (U − E(U )))
One can similarly define the covariance matrix in class g, Sg , by conditioning the
right-hand side in (8.14) by Y = g and replacing m by mg .

LDA in feature space

Following the LDA model, we assume that the operators Sg are all equal to a fixed
operator, the within-class covariance operator denoted S.
8.2. LINEAR DISCRIMINANT ANALYSIS 189

Assuming that S is invertible, one can generalize the LDA classification rule to
data represented in feature space by classifying a new input x in class g when
1
hh(x) − m , S −1 (mg − m)iH − hmg − m , S −1 (mg − m)iH + log πg (8.15)
2
is maximal over all classes. Notice that this is a transcription of the finite-dimensional
Bayes rule, but cannot be derived from a generative model, because the assumption
that h(X) is Gaussian is not valid in general. (It would require that h takes values
in a d-dimensional linear space, which would eliminate all interesting kernel repre-
sentations.)

Let, as before, T = (x1 , y1 , . . . , xN , yN ) be the training set, Ng denote the number


of examples in class g and cg = Ng /N . When h is known (which, we recall, is not a
practical assumption, but we will fix this later), one can estimate the class averages
from training data by
N
1 X
µg = h(xk )1yk =g
Ng
k=1
and the within-class covariance operator by
N
1X
hξ , Σw ηiH = hh(xk ) − µyk , ξiH hh(xk ) − µyk , ηiH .
N
k=1

Unfortunately, the resulting variance estimator cannot be directly used in (8.15),


because it is not invertible if dim(H) > N . Indeed, one has Σw η = 0 as soon as η is

perpendicular to V = span(h(x1 ), . . . , h(xN )).

One way to address the degeneracy of the estimated covariance operator is to


add to Σw a small multiple of the identity, say ρIdH ,3 and let the classification rule
maximize in g:

1
hh(x) − µ , (Σw + ρIdH )−1 (µg − µ)iH − hµ − µ , (Σw + ρIdH )−1 (µg − µ)iH + log πg .
2 g
(8.16)
where µ is the average of h(x1 ), . . . , h(xN ). Taking this option, we still need to make
this expression computable and remove the dependency in the feature function h.

Reduction We have µg ∈ V for all g ∈ RY and, since


N
1X
Σw η = hh(xk ) − µyk , ηiH (h(xk ) − µyk ),
N
k=1
3 The operator A + ρIdH is invertible as soon as A is symmetric positive semi-definite.
190 CHAPTER 8. MODELS FOR LINEAR CLASSIFICATION

this operator maps H to V , which implies that Σw + ρIdH maps V into itself. More-
over, this mapping is onto: If v ∈ V and u = (Σw + ρIdH )−1 v, then, u ∈ V . Indeed,
for any z ⊥ V , we have hz , Σw u + ρuiH = hz , viH . We have hz , Σw uiH = 0 (because
Σw maps H to V ) and hz , viH = 0 (because v ∈ V ), so that we can conclude that
hz , uiH = 0. Since this is true for all z ⊥ V , this requires that u ∈ V .4

We now express the classification rule in (8.16) as a function of the kernel asso-
ciated with the feature-space representation. Denote, for any vector u ∈ RN ,
N
X
ξ(u) = u (k) h(xk ),
k=1

therefore defining a mapping ξ from RN onto V . Letting as usual K = K(x1 , . . . , xN )


be the matrix formed by pairwise evaluations of K on training inputs, we have the
identity
hξ(u) , ξ(u 0 )iH = u T Ku 0 .
for all u, u 0 ∈ RN . For simplicity, we will assume in the rest of the discussion that K
is invertible.

We have µg = ξ(1g /Ng ), where 1g ∈ RN is the vector with kth coordinate equal
to 1 if yk = g and 0 otherwise. Also µ = ξ(1/N ) (recall that 1 is the vector with all
coordinates equal to 1).

For u ∈ RN , we want to characterize v ∈ RN such that Σw ξ(u) = ξ(v). Let δk


denote the vector with 1 at the kth entry and 0 otherwise. We have
N
1X
Σw ξ(u) = hξ(u) , h(xk ) − µyk iH (h(xk ) − µyk )
N
k=1
N
1 X
= hξ(u) , ξ(δk − 1yk /Nyk )iH ξ(δk − 1yk /Nyk )
N
k=1
N
1X
= ((δk − 1yk /Nyk )T Ku) ξ(δk − 1yk /Nyk )
N
k=1
 N

 1 X 
= ξ  ((δk − 1yk /Nyk )T Ku) (δk − 1yk /Nyk )
N
k=1

so that Σw ξ(u) = ξ(P Ku) with


N
1X
P= (δk − 1yk /Nyk )(δk − 1yk /Nyk )T
N
k=1
4 One has (V ⊥ )⊥ = V for finite-dimensional—or more generally closed—subspaces of H
8.2. LINEAR DISCRIMINANT ANALYSIS 191

Note that one has


N
X
• δk δkT = IdRN ,
k=1
N X 1g 1Tg N !T
1yk X 1g X 1yk
X ! X
• δkT = δkT = = δk ,
Ny k Ng Ng Nyk
k=1 g∈RY k:yk =g g∈RY k=1
N
1yk 1yk T X 1g 1Tg
X ! !
• = .
Ny k Nyk Ng
k=1 g∈RY

This shows that P can be expressed as


 
1  X 
P = IdRN − 1g 1Tg /Ng  .
N 
g∈RY

We have therefore proved that


• (Σw + ρIdH )ξ(u) = ξ ((P K + ρIdRN )u)
 
• (Σw + ρIdH )−1 ξ(ũ) = ξ (P K + ρIdRN )−1 ũ .

Recall that the feature-space LDA classification rule maximizes


1
hh(x) − µ , (Σw + ρIdH )−1 (µg − µ)iH − hµg − µ , (Σw + ρIdH )−1 (µg − µ)iH + log πg .
2
All terms belong to V , except h(x), but this term can be replaced by its orthogonal
projection on V without changing the result. This projection can be made explicit
in terms of the representation ξ as follows. For x ∈ R, let ξ(ψ(x)) denote the orthog-
onal projection of h(x) on V (this defines the function ψ). If v(x) denotes the vector
with coordinates K(x, xk ), k = 1, . . . , N , then ψ(x) = K−1 v(x), as can be obtained by
identifying the inner products hh(x) , h(xk )iH and hξ(ψ(x)) , h(xk )iH .

We are now ready to rewrite the kernel LDA classification rule in terms of quan-
tities that only involve K. We have

hh(x) − µ , (Σw + ρIdH )−1 (µg − µ)iH


= hξ(ψ(x) − 1/N ) , ξ((P K + ρIdRN )−1 (1g /Ng − 1/N ))iH
= (ψ(x) − 1/N )T K(P K + ρIdRN )−1 (1g /Ng − 1/N )

Given this, the classification rule must maximize

(ψ(x) − 1/N )T K(P K + ρIdRN )−1 (1g /Ng − 1/N )


1
− (1g /Ng − 1/N )T K(P K + ρIdRN )−1 (1g /Ng − 1/N ) + log πg . (8.17)
2
192 CHAPTER 8. MODELS FOR LINEAR CLASSIFICATION

Dimension reduction Note that K(P K + ρIdRN )−1 = K(KP K + ρK)−1 K is a symmet-
ric matrix. So, the expression in (8.17) can be written as

1
(v(x) − η̄)T R−1 (ηg − η̄) − (ηg − η̄)T R−1 (ηg − η̄) + log πg .
2
with R = KP K + ρK, ηg = K1g /Ng and η̄ = K1/N . Clearly, if v1 , . . . , vN are the column
vectors K, we have
N N
1 X 1X
ηg = vk 1yk =g , η̄ = vk .
Ng N
k=1 k=1

We therefore retrieve an expression similar to finite-dimensional LDA, provided


that one replaces x by v(x), xk by vk and Σw by R. Letting

1 X
Q= Ng (ηg − η̄)(ηg − η̄)T
N
g∈RY

be the between-class covariance matrix, the discriminant directions are therefore


solutions of the generalized eigenvalue problem

Qfj = λj Rfj

with fjT Rfj = 1 with R = (KP K + ρK). Note that

N
1X
KP K = (vk − η̄yk )(vk − η̄yk )T
N
k=1

is the within-class covariance matrix for the training data (v1 , y1 , . . . , vN , yN ).

The following summarizes the kernel LDA classification algorithm.

Algorithm 8.4 (Kernel LDA)


(1) Select a positive kernel K and a coefficient ρ > 0.
(2) Given T = (x1 , y1 , . . . , xN , yN ) and a kernel K, compute the kernel matrix K =
K(x1 , . . . , xN ) and the matrix R = KP K + ρK. Let v1 , . . . , vN be the column vectors of K.
(3) Compute, for g ∈ RY ,
N N
1 X 1X
ηg = vk 1yk =g , η̄ = vk
Ng N
k=1 k=1

1
Ng (ηg − η̄)(ηg − η̄)T .
P
and let Q = N g∈RY
8.3. OPTIMAL SCORING 193

(4) Fix r0 ≤ q−1 and compute the eigenvectors f1 , . . . , fr0 associated with the r0 largest
eigenvalues for the generalized eigenvalue problem Qf = λRf , normalized such that
fjT Rfj = 1.
(5) Compute the scores γjg = (ηg − η̄)T fj .
(6) Given a new observation x, let v(x) be the vector with coordinates K(x, xk ), k =
1, . . . , N . Compute the scores γj (x) = (v(x) − η̄)T fj , j = 1, . . . , r0 . Classify x in the class
g maximizing
r r
X 1X 2
γi (x)γig − γig + log πg . (8.18)
2
i=1 i=1

8.3 Optimal Scoring

It is possible to apply linear regression (chapter 7) to solve a classification problem


by mapping the set RY to a collection of r-dimensional row vectors, or “scores.”
These scores (which have a different meaning from the LDA scores) will be repre-
sented by a function θ : RY 7→ Rr . As an example, one can take r = q and
     
1 0 0
0 1 0
     

θ(g1 ) = 0 , θ(g2 ) = 0 , . . . , θ(gq ) =  ...  .


     
   
 ..   ..   
 .   .  0
     
0 0 1
Given a training set T = (x1 , y1 , . . . , xN , yN ) and a score function θ, a linear model can
then be estimated from data by minimizing
N
X
|θyk − a0 − bT xk |2
k=1

where b is a d ×q matrix and a0 ∈ Rq . Letting as before β be the matrix with aT0 added
as first row to b and X the matrix with first row containing only ones and subsequent
rows given by x1T , . . . , xN
T
, one gets the least square estimator β̂ = (X T X )−1 X T Y , where
Y is the N × q matrix of stacked θyTk row vectors.

Given an input vector x, the row vector x̃T β will generally not coincide with
one of the score vectors. Assignment to a class can then be made by minimizing
|a0 + bT x − θg | over all g in RY .

Since the scores θ are free to choose, one may also try to optimize them, resulting
in the optimal scoring algorithm. To describe it, we will need the notation already
194 CHAPTER 8. MODELS FOR LINEAR CLASSIFICATION

introduced for LDA, plus the following. We will write, for short, θj = θ(gj ) and
 T
θ1 
introduce the q × r matrix Θ =  ... . We also denote by ρ1 , . . . , ρr the column vectors
 
 T 
θq
of Θ, so that Θ = [ρ1 , . . . , ρr ]. Let ugi , for i = 1, . . . , q, denote the q-dimensional vector
with ith coordinate equal to 1 and all others equal to 0. As before, Ng denote the class
cg1 
 
 . 
sizes, cg = Ng /N , C is the diagonal matrix with coefficients cg1 , . . . , cgq and ζ =  .. .
 
cgq

The goal of optimal scoring is to minimize, now with respect to θ, a0 and b, the
function
N
X
F(θ, a0 , b) = |θ(yk ) − a0 − bT xk |2 .
k=1
Some normalizing conditions are clearly needed, because this problem is under-
constrained. (In the form above, the optimal choice is to take all free parameters
equal to 0.) We now discuss the various indeterminacies and redundancies in the
model,

(a) If R is an r × r orthogonal matrix, then F(Rθ, Ra0 , bRT ) = F(θ, a0 , b), yielding an
infinity of possible equivalent solutions (that all lead to the same classification rule).
This implies that there is no loss of generality in assuming that Θ T CΘ is diagonal
(introducing C here will turn out to be convenient). Indeed, given any (θ, a0 , b), one
can just take R such that RΘ T CΘRT is diagonal and replace Θ by RΘ, a0 by Ra0 and
b by bRT to get an equivalent solution satisfying the constraint.
(b) Let D be an r by r diagonal matrix with positive entries. Replace θ, a0 and b
respectively by Dθ, Da0 and bD. The resulting objective function is
N
X
T
F(Dθ, Da0 , bD ) = |Dθ(yk ) − Da0 − DbT xk |2
k=1
r X
X N  d
X 2
2
= djj θ(yk , j) − a0 (j) − b(i, j)xk (i)
j=1 k=1 i=1

If the coefficient djj is free to chose, then the objective function can always be re-
duced by letting djj → 0, which removes one of the dimensions in θ. In order to
avoid this, one needs to fix the diagonal values of Θ T CΘ, and, by symmetry, it is
natural to require Θ T CΘ = IdRr .
(c) Given any δ ∈ Rr , one has F(θ, a0 , b) = F(θ − δ, a0 + δ, b), with identical classifica-
tion rule. One can therefore without loss of generality introduce r linear constraints,
8.3. OPTIMAL SCORING 195

and a convenient choice is X


ΘT ζ = cg θg = 0.
g∈RY

Given this reduction, we can now describe the optimal scoring problem as the
minimization of
XN
|θyk − a0 − bT xk |2
k=1
subject to Θ T CΘ = IdRr and T
Θ ζ = 0.

The optimal a0 is given by


N
1X
aˆ0 = θyk − bT µ = −bT µ,
N
k=1
so that the problem is reduced to minimizing
N
X
|θyk − bT (xk − µ)|2
k=1

subject to the same constraints. Using the facts that θyk = Θ T uyk , that
N
X X
uyk uyTk = Ng ug ugT = N C
k=1 g∈RY

and that
X N N
X X N
X
T T
uyk (xk − µ) = ug (xk − µ) = ug Ng (µg − µ)T = N CM,
k=1 g∈RY k:yk =g g∈RY

one can write


XN N
X
2
T
|θyk − b (xk − µ)| = |Θ T uyk − bT (xk − µ)|2
k=1 k=1
N
X N
X N
X
= uyTk ΘΘ T uyk − 2 (xk − µ)T bΘ T uyk + (xk − µ)T bbT (xk − µ)
k=1 k=1 k=1
XN N
X
= trace(Θ T uyk uyTk Θ) − 2 trace(Θ T uyk (xk − µ)T b)
k=1 k=1
N
X
+ trace(bT (xk − µ)(xk − µ)T b)
k=1
= N trace(Θ T CΘ) − 2N trace(Θ T CMb) + N trace(bT ΣXX b) .
196 CHAPTER 8. MODELS FOR LINEAR CLASSIFICATION

Note that, since Θ T CΘ = IdRr , then trace(Θ T CΘ) = r. We therefore obtain a


concise form of the optimal scoring problem: minimize

−2trace(Θ T CMb) + trace(bT ΣXX b).

subject to Θ T CΘ = IdRr and Θ T ζ = 0.

Given Θ, the optimal b is Σ−1 T


XX M CΘ, and replacing it in the objective function,
one finds that Θ must minimize

−2trace(Θ T CMΣ−1 T T −1 T
XX M CΘ) + trace(Θ CMΣXX M CΘ)

i.e., maximize
trace(Θ T CMΣ−1 T
XX M CΘ)
subject to Θ T CΘ = IdRr and Θ T ζ = 0. We now recall the following linear algebra
result (see chapter 2).
Proposition 8.7 Let A and B be respectively positive definite and non-negative semi-
definite symmetric q by q matrices. Then, the maximum, over all q by r matrices S such
that trace(S T AS) = IdRr , of trace(S T BS) is attained at S = [σ1 , . . . , σr ], where the columns
vectors σ1 , . . . , σr are the solutions of the generalized eigenvalue problem

Bσ = λAσ

associated with the largest eigenvalues, normalized so that σiT Aσi = 1 for i = 1, . . . , r..

Given this proposition, let ρ1 , . . . , ρr be the r first eigenvectors for the problem
T
CMΣ−1
XX M Cρ = λCρ. (8.19)

Assume that r is small enough so that the associated eigenvalues are not zero. Let
Θ = [ρ1 , . . . , ρr ]. We now prove that Θ is indeed a solution of the optimal scoring
problem, and the only point to show to complete the statement is that this Θ satisfies
the constraints Θ T ζ = 0. But we have
X
M T C 1q = cg (µg − µ̄) = 0,
g

which implies that 1q is a solution of the generalized eigenvalue problem associated


with λ = 0. This in turn implies that 1Tq Cρi = ζ T ρi = 0, which is exactly Θ T ζ = 0.

To summarize, we have found that the solution θ, b minimizing

−2trace(Θ T CMb) + trace(bT ΣXX b)

subject to Θ T CΘ = IdRr and Θ T ζ = 0 is given by


8.3. OPTIMAL SCORING 197

(i) Θ = [ρ1 , . . . , ρr ] where ρ1 , . . . , ρr are the eigenvectors for the problem


T
CMΣ−1
XX M Cρ = λCρ

associated with the r largest eigenvalues, normalized so that ρT Cρ = 1.


(ii) b = Σ−1 T
XX M CΘ.

The computation can, however, be further simplified. Let λ1 , . . . , λr be the eigen-


values associated with ρ1 , . . . , ρr . Letting D be the associated diagonal matrix, one
can write
T
CMΣ−1 XX M CΘ = CΘD.
This yields
T
Θ = MΣ−1
XX M CΘD
−1
= MbD −1 ,
from which we deduce that θg = Θ T ug = D −1 bT (µg − µ̄). So, given a new input vector
x, the decision rule is to assign it to the class g for which

|θg − bT (x − µ̄)|2 = |Θ T ug − bT (x − µ̄)|2 = |D −1 bT (µg − µ̄) − bT (x − µ̄)|2

is minimal. Letting b1 , . . . , br denote the r columns of b, this is equivalent to mini-


mizing, in g
Xr Xr
(bjT (µg − µ̄))2 /λ2j − 2 (bjT (x − µ̄))(bjT (µg − µ̄))/λj . (8.20)
j=1 j=1

From b = Σ−1 T
XX M CΘ and Θ = MbD
−1 we see that

T
bD = Σ−1
XX M CMb,

so that Σb b = ΣXX bD. This shows that the columns of b are solution of the eigenvalue
problem Σb u = λΣXX u. Moreover, from Θ T CΘ = IdRr , we get bT Σb b = D 2 . Since
bT Σb b = bT ΣXX bD, we get that b must be normalized to that bT ΣXX b = D.

This shows that the solution of the optimal scoring problem can be reformulated
uniquely in terms of b: if b1 , . . . , br are the r principal solutions of the eigenvalue
problem Σb u = λΣXX u, normalized so that u T ΣXX u = λ, a new input is classified
into the class g minimizing
r
X r
X
2
γj (µg ) /λ2j − 2 γj (x)γj (µg )/λj .
j=1 j=1

with γj (x) = bjT (x − µ̄).


198 CHAPTER 8. MODELS FOR LINEAR CLASSIFICATION

Remark 8.8 The following computation shows that optimal scoring is closely re-
lated to LDA. Recall the identity ΣXX = Σw + Σb . It implies that a solution of Σb u =
λΣXX u is also a solution of Σb u = λ̃Σw u with λ̃ = λ/(1 − λ). If u T ΣXX u = λ, then

λ̃
u T Σ w u = λ − u T Σ b u = λ − λ2 = ,
(1 + λ̃)2

which shows that


1 + λ̃
ũ = √ u
λ̃
satisfies ũ T Σw ũ = 1. So,
1 + λ̃j
ej = q bj
λ̃j
q
coincide with the LDA directions. We have, letting γ̃j (x) = ejT (x − µ̄) = λ̃j γj (x)/(1 +
λ̃j ):

r
X r
X r
X r
X
2
γj (µg ) /λ2j −2 γj (x)γj (µg )/λj = 2
γ̃j (µg ) / λ̃j − 2 γj (x)γj (µg )/(1 + λ̃j )
j=1 j=1 j=1 j=1

which relates the classification rules for the two methods. 

Remark 8.9 Optimal scoring can be modified by adding a penalty in the form
r
X
γ biT Ωbi = γtrace(bT Ωb) (8.21)
i=1

where Ω is a weight matrix. This only modifies the previous discussion by adding
γΩ/N to both ΣXX and Σw . 

8.3.1 Kernel optimal scoring

Let h : RX → H be the feature function and K the associated kernel, as usual. Opti-
mal scoring in feature space requires to minimize

N
X
|θyk − a0 − b(h(xk ))|2 + γkbk2H ,
k=1
8.3. OPTIMAL SCORING 199

where we have introduced a penalty on b. Here, b is a linear operator from H to Rr ,


therefore taking the form  
hb1 , hiH 
b(h) = 
 .. 
 . 

hbr , hiH

with b1 , . . . , br ∈ H, and we take


r
X
kbk2H = kbi k2H .
i=1

It is once again clear (and the argument is left to the reader) that the problem
can be reduced to the finite dimensional space V = span(h(x1 ), . . . , h(xN )), and that
the optimal b1 , . . . , br must take the form
N
X
bj = αli h(xl ) .
l=1

Introduce the kernel matrix K = K(x1 , . . . , xN ) with kth column denoted K(k) . Let α
be the N by r matrix with entries αkj , k = 1, . . . , N , j = 1, . . . , r. Then b(h(xk )), which is
the vector with coordinates
N
X
hbj , h(xk )i = αli K(xk , xl ),
l=1

is equal to α T K(k) . Moreover


r X
X N
kbk2H = αkj K(xk , xl )αlj = trace(α T Kα).
j=1 k,l=1

We therefore need to minimize


N
X
|θyk − a0 − α T K(k) |2 + γtrace(α T Kα),
k=1

so that the problem is reduced to penalized optimal scoring, with xk replaced by K(k) ,
b replaced by α and the matrix Ω in (8.21) replaced by K. Introducing the matrix P =
IdRN − 11T /N and Kc = P K, the covariance matrix ΣXX becomes KTc Kc /N = KP K/N .

The class averages µg are equal to K1(g)/Ng while µ = K1/N , so that the matrix
M is equal to
1(g1 )T /Ng1 − 1T /N 
 
 .. 
 K
.

 
1(gq )T /Ngq − 1T /N 
 
200 CHAPTER 8. MODELS FOR LINEAR CLASSIFICATION

which gives Σb = M T CM = KQK, where Q is

X Ng 1(g) 1 ! 1(g) 1 !T
Q = P CP = − −
N Ng N Ng N
g∈RY

So, the columns of α are the r principal eigenvectors ρ1 , . . . , ρr of the problem

1
KQKρ = (KP K + γK)ρ.
N

Given α, one then has, for any x ∈ Rd ,

N
X
hbi , h(x)iH = αki K(x, xk )
k=1

and
N
1 X
a0 (i) = αki K(xl , xk ).
N
k,l=1

8.4 Separating hyperplanes and SVMs

8.4.1 One-layer perceptron and margin

In this whole section, we restrict to two-class problems, and let RY = {−1, 1}. Given
a0 ∈ R and b , 0 ∈ Rd , the equation a0 + bT x = 0 defines a hyperplane in Rd . The
function f (x) = sign(a0 + xT b) defines a classifier that attributes a class ±1 to x ac-
cording to which side of the hyperplane it belongs (we ignore the ambiguity when
x is on the hyperplane). With this notation, a pair (x, y), where y is the true class, is
correctly classified if and only if y(a0 + xT b) > 0.

Let T = (x1 , y1 , . . . , xN , yN ) denote, as usual, the training data. A hyperplane, rep-


resented by the parameters (a0 , b) is separating for T if it correctly classifies all its
samples, i.e., if yk (a0 + xkT b) > 0 for k = 1, . . . , N . If such a hyperplane exists, one says
that T is linearly separable.

Let δ be a (small) positive number. The perceptron algorithm computes a0 and b


by minimizing
N
X
L(β) = [δ − yk (a0 + xkT b)]+ .
k=1
8.4. SEPARATING HYPERPLANES AND SVMS 201

The problem can be recast as a linear program, i.e., minimize


N
X
ξk
k=1

subject to ξk ≥ 0, ξk + yk (a0 + xkT b) − δ ≥ 0 for k = 1, . . . , N .

However, when T is linearly separable, separating hyperplanes are not uniquely


defined, and there is in general (depending on the choice made for δ) an infinity
of solutions to the perceptron problem. Intuitively, one should prefers a solution
that classifies the training data with some large margin, rather than one for which
training points may be very close to the separating boundary (see fig. 8.2).

Figure 8.2: The green line is preferable to the purple one in order to separate the data.

This leads to the maximum margin separating hyperplane classifier, also called
linear SVM, introduced by Vapnik and Chervonenkis [198, 199].

8.4.2 Maximizing the margin

We will use the following result.

Proposition 8.10 The distance of a point x ∈ Rd to the hyperplane M : a0 + bT x = 0 is


given by |a0 + xT b|/|b|.

Proof By definition, distance(x, M) = |x − πM (x)| where πM is the orthogonal projec-


tion on M. Since b is normal to M and letting h = πM (x), we have x = λb + h so that
|λb| = distance(x, M). Writing a0 + bT h = 0 in this equation implies a0 + bT x = λ|b|2 so
that |λ| |b| = |a0 + xT b|/|b|. 
202 CHAPTER 8. MODELS FOR LINEAR CLASSIFICATION

Assume that T is linearly separable and let M : a0 + bT x = 0 be a separating


hyperplane. The classification margin is defined as the minimal distance of the input
vectors x1 , . . . , xN to this hyperplane, i.e.,
m(a0 , b) = min{|a0 + xkT b|/|b| : k = 1, . . . , N }.
Because the hyperplane is separating, we have yk (a0 + xkT b) = |a0 + xkT b| for all k, so
that we also have
m(a0 , b) = min{yk (a0 + xkT b)/|b| : k = 1, . . . , N }.
We want to maximize this margin among all separating hyperplanes. This can be
expressed as maximizing, with respect to a0 , b, the quantity
min{yk (a0 + xkT b)/|b| : k = 1, . . . , N }
subject to the constraint that the hyperplane is separating, namely
yk (a0 + xkT b) ≥ 0, k = 1, . . . , N .
Introducing a new variable C representing the margin, the previous problem is
equivalent to maximizing C subject to
yk (a0 + xkT b) ≥ C|b|, k = 1, . . . , N .

The problem is now overparametrized, and there is no loss of generality in en-


forcing the additional constraint C|b| = 1. Noting that maximizing C is the same as
minimizing |b|2 , we can now reformulate the maximum margin hyperplane problem
as minimizing |b|2 /2 subject to
yk (a0 + xkT b) ≥ 1, k = 1, . . . , N ,
with C (the margin) given by C = 1/|b|. This results in a quadratic programming
problem.

If the data is not separable, there is no feasible point for this problem. To also
account for this situation (which is common), we can replace the constraint by a
penalty and minimize, with respect to a0 and b:
N
|b|2 X
+γ (1 − yk (a0 + xkT b))+
2
k=1
for some γ > 0. (Recall that x+
= max(x, 0).) This is equivalent to minimizing the
perceptron objective function, with δ = 1, and with an additional penalty term equal
to |b|2 /(2γ). This minimization problem is equivalent to a quadratic programming
problem obtained by introducing slack variables ξk , k = 1, . . . , N and minimizing
N
1 2 X
|b| + γ ξk ,
2
k=1

subject to the constraints ξk ≥ 0, yk (a0 + xkT b) + ξk ≥ 1, for k = 1, . . . , N .


8.4. SEPARATING HYPERPLANES AND SVMS 203

8.4.3 KKT conditions and dual problem

Introduce Lagrange multipliers ηk ≥ 0 for ξk ≥ 0 and αk ≥ 0 for yk (a0 + xkT b) + ξk ≥ 1.


The Lagrangian is then given by
N N N
1 X X X  
L = |b|2 + γ ξk − η k ξk − αk yk (a0 + xkT b) + ξk − 1 .
2
k=1 k=1 k=1

The KKT conditions are


N


 X



 b − αk yk xk = 0



 k=1



 XN


 αk yk = 0
(8.22)




 k=1




 γ − ηk − αk = 0, k = 1, . . . , N

ξk ηk = 0, k = 1, . . . , N




  
α y (a + xT b) + ξ − 1 = 0,

k = 1, . . . , N

k k 0 k k

Minimizing L with respect to a0 , b and ξ1 , . . . , ξN and ensuring that the minimum


is finite provides the first three KKT conditions. The resulting dual formulation
therefore requires to maximize
N N
X 1X
αk − αk αl yk yl xkT xl
2
k=1 k,l=1
PN
subject to the constraints 0 ≤ αk ≤ γ, k=1 αk yk = 0.

We now discuss the consequences of the complementary slackness conditions


based on the position of training sample relative to the separating hyperplane.

(i) First consider indices k such that (xk , yk ) is correctly classified beyond the margin,
i.e., yk (a0 + xkT b) > 1. The last KKT condition and the constraint ξk ≥ 0 require αk = 0,
and the third one then gives ξk = 0.
(ii) For samples that are misclassified or correctly classified below the margin 5 , i,e.,
yk (a0 + xkT b) < 1, the constraint yk (a0 + xkT b) + ξk ≥ 1 implies ξk > 0, so that αk = γ and
yk (a0 + xkT b) + ξk = 1.
5 Note that, even if the training data is linearly separable, there are generally samples that are on
the right side of the hyperplane, but at a distance to the hyperplane strictly lower that the “nominal
margin” C = 1/|b|. This is due to our relaxation of the original problem of finding a separating
hyperplane with maximal margin.
204 CHAPTER 8. MODELS FOR LINEAR CLASSIFICATION

(iii) If (xk , yk ) is correctly classified exactly at the margin, then ξk = 0 and there is
no constrain on αk beside belonging to [0, γ]. Training samples that lie exactly at the
margin are called support vectors.

Given a solution α1 , . . . , αN of the dual problem, one immediately recovers b via


the first equation in (8.22). For a0 , one must, similarly to the regression case, rely on
support vectors, which can be identified when 0 < αk < γ. In this case, one can take
a0 = yk − xkT b.

If no support vector is found, then a0 is not uniquely determined, and can be any
value such that yk (a0 + bT xk ) ≥ 1 if αk = 0 and yk (a0 + bT xk ) ≤ 1 if αk = γ. This shows
that a0 can be any point in the interval [β0− , β0+ ] with

a0 − = max{yk − xkT b : (yk = 1 and αk = 0) or (yk = −1 and αk = γ)}


a0 + = min{yk − xkT b : (yk = −1 and αk = 0) or (yk = 1 and αk = γ)}.

8.4.4 Kernel version

We make the usual assumptions: h : RX → H is a feature map with values in an


inner-product space with K(x, y) = hh(x) , h(y)iH . The predictors take the form f (x) =
sign(a0 + hb , h(x)iH ), a0 ∈ R and b ∈ H, and the goal is to minimize

N
1 2 X
kbkH + γ ξk ,
2
k=1

subject to ξk ≥ 0, yk (a0 + hh(xk ) , biH ) + ξk ≥ 1, k = 1, . . . , N .

Let V = span(h(x1 ), . . . , h(xN )). The usual projection argument implies that the
optimal b must belong to V and therefore take the form
N
X
b= uk h(xk ).
k=1

We therefore need to minimize


N N
1X X
uk ul K(xk , xl ) + γ ξk ,
2
k,l=1 k=1

subject to
 N
X 
yk a0 + K(xk , xl )al + ξk ≥ 1
l=1
8.4. SEPARATING HYPERPLANES AND SVMS 205

for k = 1, . . . , N . Introducing the same Lagrange multipliers as before, the Lagrangian


is

N N
1X X
L= uk ul K(xk , xl ) + γ ξk
2
k,l=1 k=1
N
X N
X   N
X  
− η k ξk − αk yk a0 + K(xk , xl )ul + ξk − 1 .
k=1 k=1 l=1

Using vector notation, we have

1
L = u T Ku + ξ T (γ 1 − η − α) − a0 α T y − (α y)T Ku + α T 1
2
where y α is the vector with coordinates yk αk . The infimum of L is −∞ unless
γ 1 − η − α = 0 and α T y = 0. If these identities are true, then the optimal u is u = α y
and the minimum of L is
1
− (α y)T K(α y) + α T 1
2

The dual problem therefore requires to minimize

1
(α y)T K(α y) − α T 1 = α T (K yy T )α − α T 1
2

subject to γ 1 − η − α = 0 and α T y = 0.

This is exactly the same problem as the one we obtained in the linear case, up
to the replacement of the Euclidean inner products xkT xl by the kernel evaluations
K(xk , xl ). Given the solution of the dual problem, the optimal b is

X N
X
b= uk h(xk ) = αk yk h(xk ).
k k=1

It is no computable, but the classification rule is explicit and given by


 N

 X 
f (x) = sign a0 + αk yk K(xk , x) .
k=1

Similarly to the linear case, the coefficient a0 can be identified using a support
vector, or is otherwise not uniquely determined. More precisely, if one of the αk ’s is
206 CHAPTER 8. MODELS FOR LINEAR CLASSIFICATION
P
strictly between 0 and γ, then a0 is given by a0 = yk − l αl yl K(xk , xl ). Otherwise, a0
is any number between
 
 X 

 
a0 = max y − α y K(x , x ) : (y = 1 and α = 0) or (y = −1 and α = γ)
 
 k l l k l k k k k 

 
l

and
 
 X 
a+0 = min 
 
y − α y K(x , x ) : (y = −1 and α = 0) or (y = 1 and α = γ) .
 
 k l l k l k k k k 

 
l
Chapter 9

Nearest-Neighbor Methods

Unlike linear models, nearest-neighbor methods are completely non-parametric and


assume no regularity on the decision rule or the regression function. In their sim-
plest version, they require no training and rely on the proximity of a new observa-
tion to those that belong to the training set. We will discuss in this chapter how
these methods are used for regression and classification, and study some of their
theoretical properties.

9.1 Nearest neighbors for regression

9.1.1 Consistency

We let RX denote the input space, and RY = Rq be the output space. We assume that
a distance, denoted dist is defined on RX . This means that dist : RX × RX → [0, +∞]
(we allow for infinite values) is a symmetric function such that dist(x, x0 ) = 0 if and
only if x = x0 and, for all x, x0 , x00 ∈ RX

dist(x, x0 ) ≤ dist(x, x00 ) + dist(x00 , x0 ),

which is the triangle inequality.

Let T = (x1 , y1 , . . . , xN , yN ) be the training set. For x ∈ RX , let

DT (x) = (dist(x, xk ), k = 1, . . . , N )

be the collection of all distances between x and the inputs in the training set. We
consider regression estimators taking the form
N
X
fˆ(x) = Wk (x)yk (9.1)
k=1

207
208 CHAPTER 9. NEAREST-NEIGHBOR METHODS

where W1 (x), . . . , WN (x) is a family of coefficients, or weights, that only depends on


DT (x).

We will, more precisely, use the following construction PN [184]. Assume that ad
family of numbers w1 ≥ w2 ≥ · · · ≥ wN ≥ 0 is chosen, with j=1 wj = 1. Given x ∈ R
and k ∈ {1, . . . , N }, we let rk+ (x) denote the number of indexes k 0 such that dist(x, xk 0 ) ≤
dist(x, xk ) and rk− (x) the number of such indexes such that d(x, xk 0 ) < d(x, xk ). The
coefficients defining fˆ in (9.1) are then chosen as:

Prk+ (x)
k 0 =rk− (x)+1 wk
0
Wk (x) = . (9.2)
rk+ (x) − rk− (x)

To emphasize the role of (w1 , . . . , wN ) is this definition, we will denote the resulting
estimation as fˆw . If there is no tie in the sequence of distances between x and ele-
ments of the training set, then rk+ (x) = rk− (x) + 1 is the rank of xk when training data is
ordered according to their proximity to x, and Wk (x) = wrk+ (x) . In this case, defining
l1 , . . . , lN such that d(x, xl1 ) < · · · < d(x, xlN ), we have

N
X
fˆw (x) = wj ylj .
j=1

In the general case, the weights wj associated with tied observations are averaged.

If p is an integer, the p-nearest neighbor (p-NN) estimator (that we will denote


ˆ
fp ) is associated to the weights wj = 1/p for j = 1, . . . , p and 0 otherwise. If there is no
tie for the definition of the pth nearest neighbor of x, Wk (x) = 1/p if k is among the
p nearest-neighbors and Wk (x) = 0 otherwise, so that fˆp is the average of the output
values over these p nearest neighbors. If the pth nearest neighbors are tied, their
output value is averaged before being used in the sum. For example, assume that
N = 5 and p = 2 and let the distances between x and xk for k = 1, . . . , 5 be respectively
9, 3, 2, 4, 6. Then fˆ2 (x) = (y2 + y3 )/2. If the distances were 9, 3, 2, 3, 6, then we would
have fˆ2 (x) = (y2 + y4 )/4 + y3 /2.

When RX = Rd and d(x, x0 ) = |x − x0 |, the following result is true.

Theorem 9.1 ([184]) Assume that E(Y 2 ) < ∞. Assume that, for each N , a sequence
(N ) (N ) (N )
w(N ) = w1 ≥ · · · ≥ wN ≥ 0 is chosen with N
P
j=1 wj = 1. Assume, in addition, that

(N )
(i) limN →∞ w1 = 0
P (N )
(ii) limN →∞ j≥αN wj → 0, for some α ∈ (0, 1).
9.1. NEAREST NEIGHBORS FOR REGRESSION 209

Then the corresponding classifier fˆw(N ) converges in the L2 norm to E(Y | X):
 
E |fˆw(N ) (X) − E(Y | X)|2 → 0.

For nearest-neighbor regression, (i) and (ii) mean that the number of nearest neighbors
pN must be chosen such that pN → ∞ and pN /N → 0.

Proof We give a proof under the assumption that f : x 7→ E(Y | X = x) is uniformly


continuous and bounded (one can, in fact, prove that it is always possible to reduce
to this case).

To lighten the notation, we will not make explicit the dependency on N in of


quantities such as W or w. One has

N
X N
X
fˆw (X) − E(Y | X) = Wk (X)(f (Xk ) − f (X)) + Wk (X)(Yk − f (Xk )) (9.3)
k=1 k=1

and the two sums can be addressed separately.

We start with the first sum and write, by Schwartz’s inequality:


 2
X  X

 W k (X)(f (X k ) − f (X))  ≤
 Wk (X)(f (Xk ) − f (X))2 .
k k

It therefore suffices to study the limit of E( k Wk (X)(f (Xk ) − f (X))2 . Fix  > 0. By
P
assumption, there exists M, a > 0 such that |f (x)| ≤ M for all x and |x − y| ≤ a ⇒
|f (x) − f (y)|2 ≤ . Then
   
X  X 
E  Wk (X)(f (Xk ) − f (X))2  =E  Wk (X)(f (Xk ) − f (X))2 1|Xk −X|≤a 
k
k 
X 
+ E  Wk (X)(f (Xk ) − f (X))2 1|Xk −X|>a 
k
 
X 
≤ 2 + 4M 2 E  Wk (X)1|Xk −X|>a  .
k

Since  can be made arbitrarily small, we need to show that, for any positive a, the
second term in the upper-bound tends to 0 when N → ∞. We will use the follow-
ing fact, which requires some minor measure theory argument to prove rigorously.
Define
S = {x : ∀δ > 0, P(|X − x| < δ) > 0} .
210 CHAPTER 9. NEAREST-NEIGHBOR METHODS

This set is called the support of X. Then, one can show that P(X ∈ S) = 1. This
means that, if X̃ is independent from X with the same distribution, then, for any
δ > 0, P(|X − X̃| < δ|X) > 0 with probability one. 1

Let Na (x) = | {k : |Xk − x| ≤ a} |. We have, for all x ∈ S and a > 0, and using the law
of large numbers,
N
Na (x) 1 X
= 1|Xk −x|≤a → P (|X − x| ≤ a) > 0.
N N
k=1

If |X − Xk | > a, then rk− (X) > Na (x) so that


X X
Wk (X)1|Xk −X|>a ≤ wj ,
k j≥Na (X)

and we have, taking 0 < α < P (|X − x| ≤ a),


 
 X  X
E  wj  ≤ wj + P(Na (X) < αN )
 
 
j≥Na (X) j≥αN

and both terms in the upper bound converge to 0. This shows that the first sum in
(9.3) tends to 0.

We now consider the second sum in (9.3). Let Zk = Yk − E(Y | Xk ). We have


E(Zk | Xk ) = 0 and E(Zk2 ) < ∞. We can write

N 2  
N 2 
 X    X 
E  Wk (X)Zk  = E E  Wk (X)Zk X, X1 , . . . , XN 
    
    
k=1 k=1
N 
X 
= E  Wk (X)2 E(Zk2 | Xk )
k=1
N
X
+ E(Wk (X)Wl (X)E(Zk Zl | Xi , Xj ))
k,l=1
1 This statement is proved as follows (with the assumption that X is Borel measurable). Let S c
denote the complement of S. Then S c is open. Indeed if x < S, there exists δx > 0 such that, letting
B(x, δx ) denote the open ball with radius δx , P(X ∈ B(x, δx )) = 0. Then P(X ∈ B(x0 , δx /3)) = 0 as soon as
|x − x0 | < δx /3, so that B(x, δx /3) ⊂ S c .S
If K c
S ⊂ S is compact, then K ⊂ x∈K B(x, δx ) and one can find a cfinite subset M ⊂ K such that
K ⊂ x∈M B(x, δx ), which proves that P(X ∈ K) = 0. Since P(X ∈ S ) = maxK P(X ∈ K) where the
maximum is over all compact subsets of S c , we find P(X ∈ S c ) = 0 as required.
9.1. NEAREST NEIGHBORS FOR REGRESSION 211

The cross products in the last term vanish because E(Zk | Xk ) = 0 and the samples
are independent. So it only remains to consider
N 
X 
E  Wk (X)2 E(Zk2 | Xk )
k=1

The random variable E(Zk | Xk ) = E(Yk2 | Xk ) − E(Yk | Xk )2 is a fixed non-negative


function of Xk , that we will denote h(Xk ). We have
N  N 
X  X 
E  Wk (X)2 h(Xk ) ≤ w1 E  Wk (X)h(Xi )
k=1 k=1
P 
N
with w1 → 0 and the proof is concluded by showing that E k=1 W k (X)h(Xk ) is
bounded.

Recall that the weights Wk are functions of X and of the whole training set,
and we will need to make this dependency explicit and write Wi (X, TX ) where TX =
(X1 , . . . , XN ). Similarly, the ranks in (9.2) will be written rj+ (X, TX ) and rj− (X, TX ).

Because X, X1 , . . . , XN are i.i.d., we can switch the role of X and Xk in the kth term
of the sum, yielding
N   N  
X  X (k)  
E  Wk (X, TX )h(Xk ) = E  Wk (Xk , TX ) h(X)
k=1 i=1

(k) PN (k)
with TX = (X1 , . . . , Xk−1 , X, Xk+1 , . . . , XN ). We now show that k=1 Wk (Xk , TX ) is bounded
independently of X, X1 , . . . , XN .

For this purpose, we group X1 , . . . , XN according to approximate alignment with


X. For u ∈ Rd with |u| = 1 and for δ ∈ (0, π/4), denote by Γ (u, δ) the cone formed by
all vectors v in Rd such that hv , ui > |v| cos δ (i.e., the angle between v and u is less
than δ). Notice that if v, v 0 ∈ Γ (u, δ), then hv , v 0 i ≥ cos(2δ)|v| |v 0 | and if |v 0 | ≤ |v|, then

|v|2 − |v − v 0 |2 = |v 0 |(2|v| cos(2δ) − |v 0 |) > 0 (9.4)

because cos(2δ) > 1/2.

Fixing δ, let Cd (δ) be the minimal number of such cones needed to cover Rd .
Choosing such a covering Γ (u1 , δ), . . . , Γ (uM , δ) where M = Cd (δ), we define the fol-
lowing subsets of {1, . . . , M}:

I0 = {k : Xk = X}
n o
Iq = k < I0 : Xk − X ∈ Γ (uq , δ) , q = 1, . . . , M
212 CHAPTER 9. NEAREST-NEIGHBOR METHODS

(these sets may overlap). We have


N
X M X
(k)
X (k)
Wk (Xk , TX ) ≤ Wk (Xk , TX )
k=1 q=0 k∈Iq

(k) (k)
If k ∈ I0 , then rk− (Xk , TX ) = 0 and rk+ (Xk , TX ) = c with c = |I0 |. This implies that, for
(k)
k ∈ I0 , we have Wk (Xk , TX ) = cj=1 wj /c and
P

X c
(k)
X
Wk (Xk , TX ) = wj .
k∈I0 j=1

We now consider Iq with q ≥ 1. Write Iq = {i1 , . . . , ir } ordered so that |Xij − X| is


non-decreasing. If j 0 < j, we have (using (9.4)) |Xij − Xij 0 | < |X − Xij |. This implies that
(ij ) (ij ) (ij )
ri−j (Xij , TX ) ≥ j − 1 and ri+j (Xij , TX ) − r − ij (Xij , TX ) ≥ c + 1. Therefore,

c+j
(ij ) 1 X
Wij (Xij , TX ) ≤ wj 0
c+1 0
j =j

and
 
c+j
N X c N
X (k) 1 X 1 

 X
0
X 
Wk (Xk , TX ) ≤ wj 0 = j wj 0 + (c + 1) wj 0  .

c+1 c+1 0

0
 0

k∈Iq j=1 j =j j =1 j =c+1

This yields
 
N
X c  1 X c N 
(k)
X X
Wk (Xk , TX ) ≤ wj + Cd (δ)  j 0 wj 0 + wj 0  ≤ Cd (δ) + 1.
 
c + 1 0 0

k=1 j=1 j =1 j =c+1

We therefore have
N 
X 
E  Wk (X)2 E(Zk2 | Xk ) ≤ w1 (Cd (δ) + 1)E(h(X)) → 0,
k=1

which concludes the proof. 

Theorem 9.1 is proved in Stone [184] with weaker hypotheses allowing for more
flexibility in the computation of distances, in which, for example, differences X − Xi
can be normalized by dividing them by a factor σi that may depend on the training
set. These relaxed assumptions slightly complicate the proof, and we refer the reader
to Stone [184] for a complete exposition.
9.1. NEAREST NEIGHBORS FOR REGRESSION 213

9.1.2 Optimality

The NN method can be shown to be optimal over some classes of functions. Opti-
mality is in the min-max sense, and works as follows. We assume that the regression
function f (x) = E(Y | X = x) belongs to some set F of real-valued functions on Rd .
Most of the time, the estimation methods must be adapted to a given choice of F ,
and various choices have arisen in the literature: classes of functions with r bounded
derivatives, Sobolev or related spaces, functions whose Fourier transforms has given
properties, etc.

Consider now an estimator of f , denoted fˆN , based on a training set of size N .


We can measure the error by, say:
Z !1/2
fˆN − f 2 = (fˆN (x) − f (x)) dx
2
Rd

Since fˆN is computed from a random sample, this error is a random variable. One
can study, when bN → 0, the probability
 
Pf kfˆN − f k22 ≥ cbN

for some constant c and, for example, for the model: Y = f (X) + noise. Here, the
notation Pf refers to the model assumption indicating the unobserved function f .

The min-max method considers the worst case and computes


 
MN (c) = sup Pf kfˆN − f k22 ≥ cbN .
f ∈F

This quantity now only depends on the estimation algorithm. One defines the no-
tion of “lower convergence rate” as a sequence bN such that, for any choice of the
estimation algorithm, MN (c) can be found arbitrarily close to 1 (i.e., kfˆN − f k22 ≥ cbN
with arbitrarily high probability for all f ∈ F ), for arbitrarily large N (and for some
choice of c). The mathematical statement is

∃c > 0 : lim inf MN (c) = 1.


N →∞

So, if bN is a lower convergence rate, then, for every estimator, there exists a constant
c such that the accuracy cbN cannot be achieved.

On the other hand, one says that bN is an achievable rate of convergence if there
exists an estimator such that, for some c0 ,

lim sup MN (c0 ) = 0.


N →∞
214 CHAPTER 9. NEAREST-NEIGHBOR METHODS

This says that for large N , and for some c0 , the accuracy is higher than c0 bN for the
given estimator. Notice the difference: a lower rate holds for all estimators, and an
achievable rate for at least one estimator.

The final definition of a min-max optimal rate is that it is both a lower rate and
an achievable rate (obviously for different constants c and c0 ). And an estimator is
optimal in the min-max sense if it achieves an optimal rate.

One can show that the p-NN estimator is optimal (under some assumptions on
the ratio pN /N ) when F is the class of Lipschitz functions on Rd , i.e., the class of
functions such that there exists a constant K with

|f (x) − f (y)| ≤ K|x − y|

for all x, y ∈ Rd . In this case, the optimal rate is bN = N −1/(2+d) (notice again the
“curse of dimensionality”: to achieve a given accuracy in the worst case, the number
of data points must grow exponentially with the dimension).

If the function class consists of smoother functions (for example, several deriva-
tives), the p-NN method is not optimal. This is because the local averaging method
is too crude when one knows already that the function is smooth. But it can be
modified (for example by fitting, using least squares, a polynomial of some degree
instead of computing an average) in order to obtain an optimal rate.

9.2 p-NN classification

Let (x1 , y1 , . . . , xN , yN ) be the training set, with xi ∈ Rd and yi ∈ RY where RY is a


finite set of classes. Using the same notation as in the previous section, define
N
X
bw (y|x) =
π Wk (x)1yk =y .
k=1

Let the corresponding classifier be

fˆw (x) = argmax π


bw (y|x).
y∈RY

Theorem 9.1 may be applied, for y ∈ RY , to the function fy (x) = π(y | x) = E(1Y =y |
X = x), which allows one to interpret the estimator π b(y | x) as a nearest-neighbor
predictor of the random variable 1Y =y as a function of X. We therefore obtain the
consistency of the estimated posteriors when N → ∞ under the same assumption as
those of theorem 9.1. This implies that, for large N , the classification will be close to
Bayes’s rule.
9.2. P -NN CLASSIFICATION 215

An asymptotic comparison with Bayes’s rule can already be made with p = 1. Let
ŷN (x) be the 1-NN estimator of Y given x and a training set of size N , and let ŷ(x) be
the Bayes estimator. We can compute the Bayes error by

P(ŷ(X) , Y ) = 1 − P(ŷ(X) = Y )
= 1 − E(P(ŷ(X) = Y |X))
= 1 − E(max π(y|X))
y∈RY

For the 1-NN rule, we have

P(ŷN (X) , Y ) = 1 − P(ŷN (X) = Y )


= 1 − E(P(ŷN (X) = Y |X))

Let us make the assumption that nearest neighbors are not tied (with probability
one). Let k ∗ (x, T ) denote the index of the nearest neighbor to x in the training set T .
We have

P(ŷN (X) = Y | X) = E(P(ŷN (X) = Y | X, T ))


N
X 
= E P(Y = Yk | X, T )1k ∗ (X,T )=k
k=1
N
X 
= E P(Y = Yk | X, Xk )1k ∗ (X,T )=k
k=1
XN X 
= E P(Y = g, Yk = g | X, Xk )1k ∗ (X,T )=k
k=1 g∈RY
N X
X 
= E π(g | X)π(g | Xk )χk ∗ (X,T )=k
k=1 g∈RY
X 
= E π(g | X)π(g | Xk ∗ (X,T ) )
g∈RY

Now, assume the continuity of x 7→ π(g | x) (although the result can be proved
without this simplifying assumption). We know that Xk ∗ (X,T ) → X when N → ∞ (see
the proof of theorem 9.1), which implies that π(g | Xk ∗ (X,T ) ) → π(x | X) and at the
limit X
P(ŷN (X) = Y | X) → π(g | X)2 .
g∈RY
216 CHAPTER 9. NEAREST-NEIGHBOR METHODS

This implies that the asymptotic 1-NN misclassification error is always smaller
than 2 times the Bayes error, that is
X 
2
1−E π(g | X) ≤ 2(1 − E(max π(g | X)))
g
g∈RY

Indeed, the left-hand term is smaller than 1 − E(maxg π(g|x)2 ) and the result comes
from the fact that for any t ∈ R. 1 − t 2 ≤ 2 − 2t.

Remark 9.2 Nearest neighbor methods may require large computation time, since,
for a given x, the number of comparisons which are needed is the size of the training
set. However, efficient (tree-based) search algorithms can be used in many cases to
reduce it to a logarithm in the size of the database, which is acceptable. A reduction
of the size of the training set by clustering also is a possibility for improving the
efficiency.

The computation time is also generally proportional to the dimension d of the


input x. When d is large, a reduction of dimension is often a good idea. Principal
components (see chapter 21), or LDA directions (see chapter 8) can be used for this
purpose. 

9.3 Designing the distance

LDA-based distance The most important factor in the design of a NN procedure


probably is the choice of the distance, something we have not discussed so far. In-
tuitively, the distance should increase fast in the directions “perpendicular” to the
regions of constancy of the class variables, and slowly (ideally not at all) within these
regions. The following construction uses discriminant analysis [87].
P
For g ∈ RY , let Σg be the covariance matrix in class g, and Σw = g∈RY πg Σg be
the within-class variance, where πg is the frequency of class g. Let Σb denote the
between-class covariance matrix (see section 8.2).

For x ∈ Rd , define the spherized vector x∗ = Σ−1/2 w x. The between-class variance


∗ −1/2 −1/2
computed for spherized data is Σb = Σw Σb Σw . A direction is discriminant if it is
close to the principal eigenvectors of Σ∗b . This suggests the introduction of the norm

|x|2∗ = (x∗ )T Σ∗b x∗ = xT Σ−1/2 −1/2 −1/2 −1/2 T −1 −1


w (Σw Σb Σw )Σw x = x Σw Σb Σw x.

This replaces the standard Euclidean norm (the method can be made more robust
by adding IdRd to Σ∗b .)
9.3. DESIGNING THE DISTANCE 217

Tangent distance Designing the distance, however, can sometimes be based on a


priori knowledge on some invariance properties associated with the classes. A suc-
cessful example comes from character recognition, where it is known that trans-
forming images by slightly rotating, scaling, or translating the character should not
change its class. This corresponds to the following general framework.

For each input x ∈ Rd , assume that one can make small transformations without
changing the class of x. We model these transformations as parametrized functions
x 7→ xθ = ϕ(x, θ) ∈ Rd , such that ϕ(x, 0) = x and ϕ is smooth in θ, which is a q-
dimensional parameter. The assumption is that ϕ(x, θ) and x should be from the
same class, at least for small θ. This will be used to improve on the Euclidean dis-
tance on Rd .

Take x, x0 ∈ Rd . Ideally, one would like to use the distance D(x, x0 ) = infθ,θ0 dist(xθ , xθ 0 )
where θ and θ 0 are restricted to a small neighborhood of 0. A more tractable expres-
sion can be based on first-order approximations
q
X
xθ ' x + ∇θ ϕ(x, 0)u = x + ui ∂θi ϕ(x, 0)
i=1
q
X
and xθ0 0 0 0
' x + ∇θ ϕ(x , 0)u = x + 0
ui0 ∂θi ϕ(x0 , 0)
i=1

yielding the approximation (also called the tangent distance)


2
D(x, x0 )2 ' inf x − x0 + ∇θ ϕ(x, 0)u − ∇θ ϕ(x0 , 0)u 0 .
0 u,u ∈Rq

The computation now is a simple least-squares problem, for which the solution is
given by the system
! ! !
∇θ ϕ(x, 0)T ∇θ ϕ(x, 0) −∇θ ϕ(x, 0)T ∇θ ϕ(x0 , 0) u ∇θ ϕ(x, 0)T (x0 − x)
= .
−∇θ ϕ(x0 , 0)T ∇θ ϕ(x, 0) ∇θ ϕ(x0 , 0)T ∇θ ϕ(x0 , 0) v ∇θ ϕ(x0 , 0)T (x − x0 )

A slight modification, to ensure that the norms of u and u 0 are not too large, is to
add a penalty λ(|u|2 + |u 0 |2 ), which results in adding λIdRq to the diagonal blocs of
the above matrix.
218 CHAPTER 9. NEAREST-NEIGHBOR METHODS
Chapter 10

Tree-based Algorithms, Randomization and Boost-


ing

10.1 Recursive Partitioning

Recursive partitioning methods implement a “divide and conquer” strategy to ad-


dress the prediction problem. They separate the input space RX into small regions
on which prediction is “easy,” i.e., such that the observed values of the output vari-
able are (almost) constant for input values in these regions. The regions are esti-
mated by recursive divisions until they become either too small or homogeneous.
These divisions are conveniently represented in the form of binary trees.

10.1.1 Binary prediction trees

Define a binary node to be a structure ν that contains the following information (note
that the definition is recursive):

• A label L(ν) that uniquely identifies the node.


• A set of children, C(ν), which is either empty or a pair of nodes (l(ν), r(ν)).
• A binary feature, i.e., a function γν : RX → {0, 1}, which is “None” (i.e., irrelevant)
if the node has no children.
• A predictor, fν : RX → RY , which is “None” if the node has children.

A node without children is called a terminal node, or a leaf.

A binary prediction tree T is a finite set of nodes, with the following properties:

(i) Only one node has no parent (the root, denoted ρ or ρT );

219
220 CHAPTER 10. TREE-BASED ALGORITHMS

(ii) Each other node has exactly one parent;


(iii) No node is a descendent of itself.

10.1.2 Training algorithm

Assume that a family Γ of binary features γ : RX → {0, 1} is chosen, together with a


family F of predictors f : RX → RY . Assume also the existence of two “algorithms”
as follows:

• Feature selection: Given the feature set Γ and a training set T , return an optimized
binary feature γ
bT ,Γ ∈ Γ .
• Predictor optimization: Given the predictor set F and a training set T , return an
optimized predictor fˆT ,F ∈ F .

Finally, assume that a stopping rule is defined, as a function of training sets σ : T 7→


σ (T ) ∈ {0, 1}, where 0 means “continue”, and 1 means “stop”.

Given a training set T0 , the algorithm builds a binary tree T using a recursive
construction. Each node ν ∈ T will be associated to a subset of T0 , denoted Tν . We
define below a recursive operation, denoted Node(T , j) that adds a node ν to a tree
T given a subset T of T0 and a label j. Starting with T = ∅, calling Node(T0 , 0) will
then create the desired tree.

Algorithm 10.1 (Node insertion: Node(T , j))


(a) Given T and j, let Tν = T and L(ν) = j.
(b) If σ (T ) = 1, let C(ν) = ∅, γν = “None” and fν = fˆT ,F .
(c) If σ (T ) = 0, let fν = “None”, γν = γ
bT ,Γ and C(ν) = (l(ν), r(ν)) with

l(ν) = Node(Tl , 2j + 1), r(ν) = Node(Tr , 2j + 2)

where
Tl = {(x, y) ∈ T : γν (x) = 0}, Tr = {(x, y) ∈ T : γν (x) = 1}

(d) Add ν to T and return.

Remark 10.1 Note that, even though the learning algorithm for prediction trees can
be very conveniently described in recursive form as above, efficient computer im-
plementations should avoid recursive calls, which may be inefficient and memory
demanding. Moreover, for large trees, it is likely that recursive implementations
will reach the maximal number of recursive calls imposed by compilers. 
10.1. RECURSIVE PARTITIONING 221

10.1.3 Resulting predictor

Once the tree is built, the predictor x 7→ fˆT (x) is recursively defined as follows.

(a) Initialize the computation with ν = ρ.


(b) At a given step of the algorithm, let ν be the current node.
• If ν has no children: then let fˆT (x) = fν (x).
• Otherwise: replace ν by l(ν) if γν (x) = 0 and by r(ν) if γν (x) = 1 and go back to
(b).

10.1.4 Stopping rule

The function σ , which decides whether a node is terminal or not is generally defined
based on very simple rules. Typically, σ (T ) = 1 when one the following conditions is
satisfied:

• The number of training examples in T is small (e.g., less than 5).


• The values yk in T have a small variance (regression) or are constant (classifica-
tion).

10.1.5 Leaf predictors

When one reaches a terminal node ν (so that σ (Tν ) = 1), a predictor fν must be
determined. This function can be optimized within any set F of predictors, using
any learning algorithm, but in practice, one usually makes this fairly simple and
defines F to be the family of constant functions taking values in RY . The function
fˆT ,F is then defined as:

• the average of the values of yk , for (xk , yk ) ∈ T (regression);


• the mode of the distribution of yk , for (xk , yk ) ∈ T (classification).

10.1.6 Binary features

The space Γ of possible binary features must be specified in order to partition non-
terminal nodes. A standard choice, used in the CART model [42] with RX = Rd ,
is n o
Γ = γ(x) = 1[x(i) ≥θ] , i = 1, . . . , d, θ ∈ R (10.1)

where x(i) is the ith coordinate of x. This corresponds to splitting the space using a
hyperplane parallel to one of the coordinate axes.
222 CHAPTER 10. TREE-BASED ALGORITHMS

The binary function γ bT ,Γ can be optimized over Γ using a greedy evaluation of


the risk, assuming that the prediction is based on the two nodes resulting from the
split. For γ ∈ Γ , f0 , f1 ∈ F , define
(
f0 (x) if γ(x) = 0
Fγ,f0 ,f1 (x) =
f1 (x) if γ(x) = 1

Given a risk function r, one then evaluates


X
ET (γ) = min r(y, Fγ,f0 ,f1 (x))
f0 ,f1 ∈F
(x,y)∈T

One then chooses γ


bT ,Γ = argminγ∈Γ (ET (Γ )).

Example 10.2 (Regression) Consider the regression case, taking squared differences
as risk and letting F contain only constant functions. Then
X  
ET (γ) = min (y − m0 )2 1γ(x)=0 + (y − m1 )2 1γ(x)=1 .
m0 ,m1
(x,y)∈T

Obviously, the optimal m0 and m1 are the averages of the output values, y, in each
of the subdomains defined by γ. For CART (see (10.1)), this cost must be minimized
over all choices (i, θ) with i = 1, . . . , d and θ ∈ R where γi,θ (x) = 1 if x(i) > θ and 0
otherwise. 

Example 10.3 (Classification.) For classification, one can apply the same method,
with the 0/1 loss, letting
X  
ET (γ) = min 1y,g0 1γ(x)=0 + 1y,g1 1γ(x)=1 .
g0 ,g1
(x,y)∈T

The optimal g0 and g1 are the majority classes in T ∩ {γ = 0} and T ∩ {γ = 1}. 

Example 10.4 (Entropy selection for classification) For classification trees, other
splitting criteria may be used based on the empirical probability pT on the set T ,
defined as
1
pT (A) = |{k : (xk , yk ) ∈ A}|
N
for A ⊂ RX × RY . The previous criterion, ET (γ), is proportional to

pT (γ = 0)(1 − max pT (g | γ = 0)) + pT (γ = 1)(1 − max pT (g | γ = 1)).


g g

One can define alternative objectives in the form

pT (γ = 0)H(pT (g | γ = 0)) + pT (γ = 1)H(pT (g | γ = 1))


10.1. RECURSIVE PARTITIONING 223

where π → H(π) associates to a probability distribution π a “complexity measure”


that is minimal when π is concentrated on a single class (which is the case for π 7→
1 − maxg π(g)).

Many such measures exists, and many of them are defined as various forms of
entropy designed in information theory. The most celebrated is Shannon’s entropy
[177], defined by X
H(p) = − p(g) log p(g) .
g∈RY

It is always positive, and minimal when the distribution is concentrated on a single


class. Other entropy measures include:

1 P q
• The Tsallis entropy: H(p) = 1−q g∈RY (p(g) −1), for q , 1. (Tsallis entropy for q = 2
is sometimes called the Gini impurity index.)
1
log g∈RY p(g)q , for q ≥ 0, q , 1.
P
• The Renyi entropy: H(p) = 1−q 

10.1.7 Pruning

Growing a decision tree to its maximal depth (given the amount of available data)
generally leads to predictors that overfit the data. The training algorithm is usually
followed by a pruning step that removes some some nodes based on a complexity
penalty.

Letting τ(T) denote the set of terminal nodes in the tree T and fˆT the associated
predictor, pruning is represented as an optimization problem, where one minimizes,
given the training set T ,

Uλ (T, T ) = R̂T (fˆT ) + λ|τ(T)|

where R̂T is as usual the in-sample error measured on the training set T .

To prune a tree, one selects one or more internal nodes and remove all their
descendants (so that these nodes become terminal). Associate to each node ν in T its
local in-sample error ETν equal to the error made by the optimal classifier estimated
from the training data associated with ν. Then,
X |T |
ν
Uλ (T, T ) = E + λ|τ(T)|
|T | Tν
ν∈τ(T)

If ν is a node in T (internal or terminal), let Tν be the subtree of T containing


ν as a root and all its descendants. Let T(ν) be the tree T will all descendants of ν
224 CHAPTER 10. TREE-BASED ALGORITHMS

removed (keeping ν). Then

|Tν |
Uλ (T, T ) = U0 (T(ν) , T ) − (E − U0 (Tν , Tν )) + λ(|τ(Tν )| − 1).
|T | Tν

Note also that, if ν is internal, and ν 0 , ν 00 are its children, then


|Tν 0 | |T 00 |
U0 (Tν , Tν ) = U0 (Tν 0 , Tν 0 ) + ν U0 (Tν 00 , Tν 00 )
|Tν | |Tν |
This formula can be used to compute U0 (Tν ) recursively for all nodes, starting with
leaves for which U0 (Tν ) = E(Tν ). (We also have |τ(Tν )| = |τ(Tν 0 )| + |τ(Tν 00 )|.) The
following algorithm converges to a global minimizer of Uλ .

Algorithm 10.2 (Pruning)


(1) Start with a complete tree T(0) built without penalty.
(2) Compute, for all nodes U0 (Tν ) and |τ(Tν )|. Let

|Tν |
ψν = (E − U0 (Tν )) − λ(|τ(Tν )| − 1).
|T | Tν

(3) Iterate the following steps.


• If ψν < 0 for all internal nodes ν, exit the program and return the current T(n).
• Otherwise choose an internal node ν such that ψν is largest.
• Let T(n + 1) = T(ν) (n). Subtract λ(|τ(Tν (n))| − 1) to ρν 0 for all ν 0 ancestor of ν.

10.2 Random Forests

10.2.1 Bagging

A random forest [7, 41] is a special case of composite predictors (we will see other
examples later in this chapter when describing boosting methods) that train mul-
tiple individual predictors under various conditions and combine them, through
averaging, or majority voting. With random forests, one generates individual trees
by randomizing the parameters of the learning process. One way to achieve this is
to randomly sample from the training set before running the training algorithm.

Letting as before T0 = (x1 , y1 , . . . , xN , yN ) denote the original set, with size N , one
can create “new” training data by sampling with replacement from T0 . More pre-
cisely, consider the family of independent random variables ξ = (ξ 1 , . . . , ξ N ), with
10.2. RANDOM FORESTS 225

each ξ j following a uniform distribution over {1, . . . , N }. One can then form the ran-
dom training set
T0 (ξ) = (xξ 1 , yξ 1 , . . . , xξ N , yξ N ).
Running the training algorithm using T0 (ξ) then provides a random tree, denoted
T(ξ). Now, by sampling K realizations of ξ, say ξ (1) , . . . , ξ (K) , one obtains a collection
of K random trees (a random forest) T∗ = (T1 , . . . , TK ), with Tj = T(ξ (j) ) that can
be combined to provide a final predictor. The simplest way to combine them is to
average the predictors returned by each tree (assuming, for classification, that this
predictor is a probability distribution on classes), so that
K
1X
fT∗ (x) = fTj (x). (10.2)
K
j=1

For classification, one can alternatively let each individual tree “vote” for their most
likely class.

Obviously, randomizing training data and averaging the predictors is a general


approach that can be applied to any prediction algorithm, not only to decision trees.
In the literature, the approach described above has been called bagging [40], which is
an acronym for “bootstrap aggregating” (bootstrap itself being a general resampling
method in statistics that samples training data with replacement to determine some
properties of estimators). Another way to randomize predictors (especially when
d, the input dimension is large), is to randomize input data by randomly removing
some of the coordinates, leading to a similar construction.

With decision trees one can in addition randomize the binary features use to
split nodes, as described next. While bagging may provide some enhancement to
predictors, feature randomization for decision trees often significantly improves the
performance, and is the typical randomization method used for random forests.

10.2.2 Feature randomization

When one decides to split a node during the construction of a prediction tree, one
can optimize the binary feature γ over a random subset of Γ rather than exploring
the whole set. For CART, for example, one can select a small number of dimensions
i1 , . . . , iq ∈ {1, . . . , d} with q  d, and optimize γ by thresholding one of the coordi-
nates x(ij ) for j ∈ {1, . . . , q}. This results in a randomized version of the node insertion
function.

Algorithm 10.3 (Randomized node insertion: RNode(T , j))


(a) Given T and j, let Tν = T and L(ν) = j.
(b) If σ (T ) = 1, let C(ν) = ∅, γν = “None” and fν = fˆT ,CF .
226 CHAPTER 10. TREE-BASED ALGORITHMS

(c) If σ (T 0 ) = 0, sample (e.g., uniformly without replacement) a subset Γν of Γ and let


fν = “None”, γν = γ̂T ,Γν and C(ν) = (l(ν), r(ν)) with

l(ν) = Node(Tl , 2j + 1)
r(ν) = Node(Tr , 2j + 2)

where
Tl = {(x, y) ∈ T : γν (x) = 0}
Tr = {(x, y) ∈ T : γν (x) = 1}

(d) Add ν to T and return.

Now, each time the function RNode(T0 , 0) is run, it returns a different, random,
tree. If it is called K times, this results in a random forest T∗ = (T1 , . . . TK ), with a
predictor FT∗ given by (10.2). Note that trees in random forests are generally not
pruned, since this operation has been observed to bring no improvement in the con-
text of randomized tress.

10.3 Top-Scoring Pairs

Top-Scoring Pair (TSP) classifiers were introduced in Geman et al. [78] and can be
seen as forests formed with depth-one classification trees in which splitting rules are
based on the comparison of pairs of variables. More precisely, define

γij (x) = 1x(i) >x(j) .

A decision tree based on these rules only relies on the order between the features,
and is therefore well adapted to situations in which the observations are subject to
increasing transformations, i.e., when the observed variable X is such that X (j) =
ϕ(Z (j) ), where ϕ : R → R is random and increasing and Z is a latent (unobserved)
variable. Obviously, in such a case, order-based splitting rules do not depend on ϕ.
Such an assumption is relevant, for example, when experimental conditions (such
as temperature) may affect the actual data collection, without changing their order,
which is the case when measuring high-throughput biological data, such as microar-
rays, for which the approach was introduced.

Assuming two classes, a depth-one tree in this context is simply the classifier
fij = γij . Given a training set, the associated empirical error is

N N
1X 1X
Eij = 1γij (xk ),yk = |yk − γij (xk )|
N N
k=1 k=1
10.4. ADABOOST 227

and the balanced error (better adapted to situations in which one class is observed
more often than the other) is

N
X
Eijb = wk |yk − γij (xk )|
k=1

with wk = 1/(2Nyk ), where N0 , N1 are the number of observations with yk = 0, yk = 1.


Pairs (i, j) with small errors are those for which the order between the features switch
with high probability when passing from class 0 to class 1.

In its simplest form, the TSP classifier defines the set

P = argmin Eijb
ij

of global minimizers of the empirical error (which may just be a singleton) and pre-
dicts the class based on a majority vote among the family of predictors (fij , (i, j) ∈ P ).
Equivalently, selected variables maximize the score ∆ij = 1 − Eijb , leading to the
method’s name.

Such classifiers, which are remarkably simple, have been found to be competitive
among a wide range of “advanced” classification algorithms for large-dimensional
problems in computational biology. The method has been refined in Tan et al. [191],
leading to the k-TSP classifier, which addresses the following remarks. First, when
j, j 0 are highly correlated, and (i, j) is a high-scoring pair, then (i, j 0 ) is likely to be
one too, and their associated decision rules will be redundant. Such cases should
preferably be pruned from the classification rules, especially if one wants to select
a small number of pairs. Second, among pairs of features that switch with the same
probability, it is natural to prefer those for which the magnitude of the switch is
largest, e.g., when the pair of variables switches from a regime in which one of them
is very low and the other very high to the opposite. In Tan et al. [191], a rank-based
tie-breaker is introduced, defined as

N
X
ρij = wk (Rk (i) − Rk (j))(2yk − 1),
k=1

(i) (1) (d)


where Rk (i) denotes the rank of xk in xk , . . . , xk . One can now order pairs (i, j)
and (i 0 , j 0 ) by stating that the former scores higher if (i) ∆ij > ∆i 0 j 0 , or (ii) ∆ij = ∆i 0 j 0
and ρij > ρi 0 j 0 . The k-TSP classifier is formed by selecting pairs, starting from the
highest scoring one, and use as lth pair (for l ≤ k) the highest scoring ones among all
those that do not overlap with the previously selected ones. In [191], the value of k
is optimized using cross-validation.
228 CHAPTER 10. TREE-BASED ALGORITHMS

10.4 Adaboost

Boosting methods refer to algorithms in which classifiers are enhanced by recur-


sively making them focus on harder data. We first address the issue of classification,
and describe one of the earliest algorithms (Adaboost). We will then interpret it
as a greedy gradient descent algorithm, as this interpretation will lead to further
extensions.

10.4.1 General set-up

We first consider binary classification problems, with RY = {−1, 1}. We want to de-
sign a function x 7→ F(x) ∈ {−1, 1} on the basis of a training set T = (x1 , y1 , . . . , xN , yN ).
With the 0-1 loss, minimizing the empirical error is equivalent to maximizing
N
1X
ET (F) = yk F(xk ).
N
k=1

Boosting algorithms build the function F as a linear combination of “base classi-


fiers,” f1 , . . . , fM , taking  
XM 
F(x) = sign  αj fj (x) .
 
 
j=1

We assume that each base classifier, fj , takes values in [−1, 1] (the interval).

The sequence of base classifiers is learned by progressively focusing on the hard-


est examples. We will therefore assume that the training algorithm for base clas-
sifiers takes as input the training set T as well a family of positive weights W =
(w1 , . . . , wN ). More precisely, letting
w
pW (k) = PN k ,
k=1 wk

the weighted algorithm should implement (explicitly or implicitly) the equivalent


of an unweighted algorithm on a simulated training set obtained by sampling with
replacement K  N elements of T according to pW (ideally letting K → ∞). Let us
take a few examples.

• Weighted LDA: one can use LDA as described in section 8.2 with
X 1 X X
cg = pW (k), µg = pW (k)xk , µ= cg µg
cg
k:yk =g k:yk =g g∈RY
10.4. ADABOOST 229

and the covariance matrices:


N
X X
Σw = pW (k)(xk − µyk )(xk − µyk )T , Σb = cg (µg − µ̄)(µg − µ̄)T .
k=1 g∈RY

• Weighted logistic regression: just maximize


N
X
pW (k) log πθ (yk |xk )
k=1

where πθ is given by the logistic model.


• Empirical risk minimization algorithms can be modified in order to minimize
N
X
R̂T ,W (f ) = wk r(yk , f (xk )).
k=1

• Of course, any algorithm can be run on a training set resampled using pW .

10.4.2 The Adaboost algorithm

Boosting algorithms keep track of a family of weights and modify it after the jth
classifier fj is computed, increasing the importance of misclassified examples, before
computing the next classifier. The following algorithm, called Adaboost [173, 73],
describes one such approach.

Algorithm 10.4 (Adaboost)


• Start with uniform weights, letting W (1) = (w1 (1), . . . , wN (1)) with wk (1) = 1/N ,
k = 1, . . . , N . Fix a number ρ ∈ (0, 1] and an integer M > 0.
• Iterate, for j = 1, . . . , M:
(1) Fit a base classifier fj using the weights W (j) = (w1 (j), . . . , wN (j)). Let
N
X
Sw+ (j) = wk (j)(2 − |yk − fj (xk )|) (10.3a)
k=1
N
X
Sw− (j) = wk (j)|yk − fj (xk )| (10.3b)
k=1
 
and define αj = ρ log Sw+ (j)/Sw− (j)
(2) Update the weights by
 
wk (j + 1) = wk (j) exp αj |yk − fj (xk )|/2 .
230 CHAPTER 10. TREE-BASED ALGORITHMS

• Return the classifier:  


XM 
F(x) = sign  αj fj (x) .
 
 
j=1

+
If fj is binary, i.e., fj (x) ∈ {−1, 1}, then |yk − fj (xk )| = 21yk ,fj (xk ) , so that SW /2 is the

weighted number of correct classifications and SW /2 is the weighted number of in-
correct ones.

For αj to be positive, the jth classifier must do better than pure chance on the
weighted training set. If not, taking αj ≤ 0 reflects the fact that, in that case, −fj has
better performance on training data.

Algorithms that do slightly better than chance with high probability are called
“weak learners” [173]. The following proposition [73] shows that, if the base clas-
sifiers reliably perform strictly better than chance (by a fixed, but not necessarily
large, margin), then the boosting algorithm can make the training-set error arbitrar-
ily close to 0.

Proposition 10.5 Let ET be the training set error of the estimator F returned by Algo-
rithm 10.4, i.e.,
N
1X
ET = 1yk ,F(xk ) .
N
k=1

Then
M  
ρ 1−ρ
Y
ET ≤ j (1 − j )1−ρ + j (1 − j )ρ
j=1

where

SW (j)
j = + − .
SW (j) + SW (j)

Proof We note that example k is misclassified by the final classifier if and only if

M
X
αj yk fj (xk ) ≤ 0
j=1

or
M
Y
e−αj yk fj (xk )/2 ≥ 1
j=1
10.4. ADABOOST 231

Noting that |yk − fj (xk )| = 1 − yk fj (xk ), we see that example k is misclassified when

M
Y M
Y
αj |yk −fj (xk )|/2
e ≥ eαj /2 .
j=1 j=1

This shows that

N
1X
ET = 1yk ,F(xk )
N
k=1
N
1 X
= 1QM αj |yk −fj (xk )|/2 QM αj /2
N j=1 e ≥ j=1 e
k=1
N YM M
1 X
αj |yk −fj (xk )|/2
Y
≤ e e−αj /2 .
N
k=1 j=1 j=1

Let, for q ≤ M,
N q
1 X Y αj |yk −fj (xk )|/2
Uq = e .
N
k=1 j=1

Since
q−1
1 Y αj |yk −fj (xk )|/2
wk (q) = e ,
N
j=1

PN + −
we also have Uq = k=1 wk (q + 1) = (SW (q + 1) + SW (q + 1))/2.

We will use the inequality 1

eαt ≤ 1 − (1 − eα )t,

1 This inequality is clear for α = 0. Assuming α , 0, the difference between the upper and lower
bound is

q(t) = 1 − eαt − (1 − eα )t.

The function q is concave (its second derivative is −α 2 eαt ) with q(0) = q(1) = 0 and therefore non-
negative over [0, 1].
232 CHAPTER 10. TREE-BASED ALGORITHMS

which is true for all α ∈ R and t ∈ [0, 1], to write

N q−1
1 X Y αj |yk −fj (xk )|/2
Uq ≤ e (1 − (1 − eαq )|yk − fq (xk )|/2)
N
k=1 j=1
N
X
= wk (q)(1 − (1 − eαq )|yk − fq (xk )|/2)
k=1
XN N
X
αq
= wk (q) − (1 − e ) wk (q)|yk − fq (xk )|/2
k=1 k=1
αq
= Uq−1 (1 − (1 − e )q )

This gives (using U0 = 1)


M 
Y 
αj
UM ≤ 1 − (1 − e j )
j=1

and
M 
Y 
ET ≤ 1 − (1 − e )j e−αj /2 .
αj

j=1
−ρ
It now suffices to replace eαj by (1 − j )ρ j and note that
 
−ρ ρ/2 ρ 1−ρ
1 − (1 − (1 − j )ρ j )j (1 − j )−ρ/2 j = j (1 − j )1−ρ + j (1 − j )ρ

to conclude the proof. 

For  ∈ [0, 1], one has

ρ (1 − )1−ρ + 1−ρ (1 − )ρ = 1 − (ρ − (1 − )ρ )(1−ρ − (1 − )−1−ρ ) ≤ 1

with equality if and only if  = 1/2, so that each term in the upper-bound reduces
the error unless the corresponding base classifier does not perform better than pure
chance. The parameter ρ determines the level at which one increases the importance
+ −
of misclassified examples for the next step. Let S̃W (j) and S̃W (j) denote the expres-
sions in (10.3a) and (10.3b) with wk (j) replaced by wk (j + 1). Then, in the case when
the base classifiers are binary, ensuring that |yk − fj (xk )|/2 = 1yk ,fj (xk ) , one can easily
+ − + −
check that S̃W (j)/ S̃W (j) = (SW (j)/SW (j))1−ρ . So, the ratio is (of course) unchanged if
ρ = 0, and pushed to a pure chance level if ρ = 1. We provide below an interpretation
of boosting as a greedy optimization procedure that will lead to the value ρ = 1/2.
10.4. ADABOOST 233

10.4.3 Adaboost and greedy gradient descent

We here restrict to the case of binary base classifiers and denote their linear combi-
nation by
M
X
h(x) = αj fj (x).
j=1
Whether an observation x is correctly classified in the true class y is associated to
the sign of the product yh(x), but the value of this product also has an important
interpretation, since, when it is positive, it can be thought of as a margin with which
x is correctly classified.

Assume that the function F is evaluated, not only on the basis of its classification
error, but also based on this margin, using a loss function of the kind
N
X
Ψ (h) = ψ(yk h(xk )) (10.4)
k=1
where ψ is decreasing. The boosting algorithm can then be interpreted as an classi-
fier which incrementally improves this objective function.

Let, for j < M,


j
X
(j)
h = αq fq .
q=1

The next combination h(j+1) is equal to h(j) + αj+1 fj+1 , and we now consider the prob-
lem of minimizing, with respect to fj+1 and αj+1 , the function Ψ (h(j+1) ), without
modifying the previous classifiers (i.e., performing a greedy optimization). So, we
want to minimize, with respect to the base classifier f˜ and to α ≥ 0, the function
N
X  
U (α, f˜) = ψ yk h(j) (xk ) + αyk f˜(xk )
k=1

Using the fact that f˜ is a binary classifier, this can be written


N
X N
X
U (α, f˜) = (j)
ψ(yk h (xk ) + α)1yk =f˜(xk ) + ψ(yk h(j) (xk ) − α)1yk ,f˜(xk ) (10.5)
k=1 k=1
N
X
= (ψ(yk h(j) (xk ) − α) − ψ(yk h(j) (xk ) + α))1yk ,f˜(xk )
k=1
XN
+ ψ(yk h(j) (xk ) + α).
k=1
234 CHAPTER 10. TREE-BASED ALGORITHMS

This shows that α and f˜ have inter-dependent optimality conditions. For a given
α, the best classifier f˜ must minimize a weighted empirical error with non-negative
weights (since ψ is decreasing)

wk = ψ(yk h(j) (xk ) − α) − ψ(yk h(j) (xk ) + α).

Given f˜, α must minimize the expression in (10.5). One can use an alternative min-
imization procedure to optimize both f˜ (as a weighted basic classifier) and α. How-
ever, for the special choice ψ(t) = e−t , this optimization turns out to only require one
step.

In this case, we have


N
X N
X
−α −yk h(j) (xk ) (j)
U (α, f˜) = α
(e − e )e 1yk ,f˜(xk ) + e −α
e−yk h (xk )
k=1 k=1
N
X N
X
−α (j) α −α −α (j) −α
=e (e − e ) wk (j)1yk ,f˜(xk ) + e e wk (j)
k=1 k=1

(j) −y h(j) (x )
with wk (j +1) = eα k k and α (j) = α1 +· · ·+αj . This shows that f˜ should minimize

N
X
wk (j + 1)1yk ,f˜(xk ) .
k=1

We note that
wk (j + 1) = wk (j)eαj (1−yk fj (xk )) = wk (j)eαj |yk −fk (xk )| ,
which is identical to the weight updates in algorithm Algorithm 10.4 (this is the
reason why the term α (j) was introduced in the computation). The new value of α
must minimize (using the notation of Algorithm 10.4)
+
e−α SW (j) + eα SW

(j),
1 + −
which yields α = 2 log SW (j)/SW (j). This is the value αj+1 in Algorithm 10.4 with
ρ = 1/2.

10.5 Gradient boosting and regression

10.5.1 Notation

The boosting idea, and in particular its interpretation as a greedy gradient proce-
dure, can be extended to non-linear regression problems [75]. Let us denote by F0
10.5. GRADIENT BOOSTING AND REGRESSION 235

the set of base predictors, therefore functions from RX = Rd to RY = Rq , since we


are considering regression problems. The final predictor is a linear combination
M
X
F(x) = αj fj (x)
j=1

with α1 , . . . , αM ∈ R and f1 , . . . , fM ∈ F0 . Note that the the coefficients αj are redundant


when the class F0 is invariant by multiplication by a scalar. Replacing if needed F0 by
{f = αg, α ∈ R, g ∈ F0 }, we will assume that this property holds and therefore remove
the coefficients αj from the problem.

In accordance with the principle of performing greedy searches, we let


j
X
(j)
F (x) = fq (x),
q=1

and consider the problem of minimizing over f ∈ F0 ,


N
X
U (f ) = r(yk , F (j) (xk ) + f (xk )),
k=1

where T = (x1 , y1 , . . . , xN , yN ) is the training data and r is the loss function.

10.5.2 Translation-invariant loss

In the case, which is frequent in regression, when r(y, y 0 ) only depends on y − y 0 , the
problem is equivalent to minimizing
N
X
U (f ) = r(yk − F (j) (xk ), f (xk )),
k=1

i.e., to let fj+1 be the optimal predictor (in F0 and for the loss r) of the residuals
(j)
yk = yk − F (j) (xk ). In this case, this provides a conceptually very simple algorithm.

Algorithm 10.5 (Gradient boosting for regression with translation-invariant loss)


• Let T = (x1 , y1 , . . . , xN , yN ) be a training set and r a loss function such that r(y, y 0 )
only depends on y − y 0 .
• Let F0 be a function class such that f ∈ F0 ⇒ αf ∈ F0 for all α ∈ R.
(0)
• Select an integer M > 0 and let F (0) = 0, yk = yk , k = 1, . . . , N .
• For j = 1, . . . , M:
236 CHAPTER 10. TREE-BASED ALGORITHMS

(j−1) (j−1)
(1) Find the optimal predictor fj ∈ F0 for the training set (x1 , y1 , . . . , xN , yN ).
(j) (j−1)
(2) Let yk = yk − fj (xk )

• Return F = M
P
k=1 fj .

Remark 10.6 Obviously, the class F0 should not be a linear class for the boosting
algorithm to have any effect. Indeed, if f , f 0 ∈ F0 implies f +f 0 ∈ F0 , no improvement
could be made to the predictor after the first step. 

A successful example of this algorithm uses regression trees as base predictors.


Recall that the functions output by such trees take the form
X
f (x) = wA 1x∈A
A∈C

where C is a finite partition of Rd . Each set in the partition is specified by the value
taken by a finite number of binary features (denoted by γ in our discussion of pre-
diction trees) and the maximal number of such features is the depth of the tree. We
assume that the set Γ of binary features is shared by all regression trees in F0 , and
that the depth of these trees is bounded by a fixed constant. These restrictions pre-
vent F0 from forming a linear class.2 Note that the maximal depth of tree learnable
from a finite training set is always bounded, since such trees cannot have more nodes
than the size of the training set (but one may want to restrict the maximal depth of
base predictors to be way less than N ).

10.5.3 General loss functions

We now consider situations in which the loss function is not necessarily a function
of the difference between true and predicted output. We are still interested in the
problem of minimizing U (f ), but we now approximate this problem using the first-
order expansion

N
X N
X
(j)
U (f ) = r(yk , F (xk )) + ∂2 r(yk , F (j) (xk ))T f (xk ) + o(f ),
k=1 k=1

where ∂2 r denotes the derivative of r with respect to its second variable. This sug-
gests (similarly to gradient descent) to choose f such that f (xk ) = −α∂2 r(yk , F (j) (xk ))
2 If
f and g are representable as trees, f + g can be represented as a tree whose depth is the sum as
those of the original trees, simply by inserting copies of g below each leaf of f .
10.5. GRADIENT BOOSTING AND REGRESSION 237

for some α > 0 and all k = 1, . . . , N . However, such an f may not exist in the class F0 ,
and the next best choice is to pick f = α f˜ with f˜ minimizing
N
X
|f˜(xk ) + ∂2 r(yk , F (j) (xk ))|2
k=1

over all f˜ ∈ F0 . This is similar to projected gradient descent in optimization, and α


such that f = α f˜ should minimize
N
X
r(yk , F (j) (xk ) + α f˜(xk )).
k=1
This provides a generic “gradient boosting” algorithm [75], summarized below.

Algorithm 10.6 (Gradient boosting)


• Let T = (x1 , y1 , . . . , xN , yN ) be a training set and r a differentiable loss function.
• Let F0 be a function class such that f ∈ F0 ⇒ αf ∈ F0 for all α ∈ R.
• Select an integer M > 0 and let F (0) = 0.
• For j = 1, . . . , M:
(1) Find f˜j ∈ F0 minimizing
N
X
|f˜(xk ) + ∂2 r(yk , F (j−1) (xk ))|2
k=1

over all f˜ ∈ F0 .
(2) Let fj = αj f˜j where αj minimizes
N
X
r(yk , F (j−1) (xk ) + α f˜j (xk )).
k=1

(3) Let F (j) = F (j−1) + fj .


• Return F = F (M) .

Remark 10.7 Importantly, the fact that F0 is stable by scalar multiplication implies
that the function f˜j satisfies
N
X
f˜(xk )T ∂2 r(yk , F (j−1) (xk )) ≤ 0,
k=1 
that is, excepted in the unlikely case in which the above sum is zero, it is a direction
of descent for the function U (because one could otherwise replace f˜j by −f˜j and
improve the approximation of the gradient).
238 CHAPTER 10. TREE-BASED ALGORITHMS

10.5.4 Return to classification

A slight modification of this algorithm may also be applied to classification, pro-


vided that the classifier f is obtained by learning the conditional distribution, de-
noted g 7→ p(g|x), of the output variable (assumed to take values in a finite set RY )
given the input (assumed to take values in RX = Rd ).

Our goal is to estimate an unknown target conditional distribution, µ, therefore


taking the form µ(g|x) for g ∈ RY and x ∈ Rd . We assume that a family µk , k =
1, . . . , N of distributions on the set RY is observed, where each µk is assumed to be
an approximation of the unknown µ(·|xk ) (typically, µk (g) = 1g=yk , i.e., µk = δyk ). The
risk function must take the form r(µ, µ0 ) where µ, µ0 ∈ S(RY ), the set of probability
distributions on RY . We will work with
X
r(µ, µ0 ) = − µ(g) log µ0 (g).
g∈RY

One can note that


r(µ, µ0 ) = KL(µkµ0 ) + r(µ, µ),
which is therefore minimal when µ0 = µ. Moreover, in the special case µk = δyk , the
empirical risk is
N
X N
X
R̂(p) = r(µk , p(·|xk )) = − log p(yk |xk ),
k=1 k=1
so that minimizing it is equivalent to maximizing the conditional likelihood that was
used for logistic regression.

Before applying the previous algorithm, one must address the issue that prob-
ability distributions do not form a vector space, and cannot be added to form new
probability distributions. In Friedman [75], Hastie et al. [87], it is suggested to use
the representation, which can be associated with any function F : (g, x) 7→ F(g|x) ∈ R,
eF(g|x)
pF (g|x) = P F(h|x)
.
h∈RY e

Because the representation if not unique (pF = pF 0 if F − F 0 only depends on x), we


will require in addition that X
F(h|x) = 0
h∈RY

for all x ∈ Rd . The space formed by such functions F is now linear, and we can
consider the empirical risk
 
N X
X N X
X N
X  X 
R̂(F) = − µk (g) log pF (g|xk ) = − µk (g)F(g|xk ) + log  eF(g|xk )  .
 
 
k=1 g∈RY k=1 g∈RY k=1 g∈RY
10.5. GRADIENT BOOSTING AND REGRESSION 239

One can evaluate the derivative of this risk with respect to a change on F(g|xk ),
and a short computation gives

N
∂R X
=− (µk (g) − pF (g|xk )).
∂F(g|xk )
k=1

Now assume that a basic space F0 of functions f : (g, x) 7→ f (g|x) is chosen, such
that all function in F0 satisfy X
f (g|x) = 0
g∈RY

for all x ∈ Rd . The gradient boosting algorithm then requires to minimize (in Step
(1)):
XN X
(µk (g) − pF (j−1) (g|xk ) − f˜(g|xk ))2
k=1 g∈RY

with respect to all functions f˜ ∈ F0 . Given the optimal f˜j , the next step requires to
minimize, with respect to α ∈ R:
 
N X N
˜
X X  X (j−1) 
−α µk (g)f˜j (g|xk ) + log  eF (g|xk )+α fj (g|xk )  .
 
 
k=1 g∈RY k=1 g∈RY

This is a scalar convex problem that can be solved, e.g., using gradient descent.

10.5.5 Gradient tree boosting

We now specialize to the situation in which the set F0 contains regression trees. In
this situation, the general algorithm can be improved by taking advantage of the fact
that the predictors returned by such trees are piecewise constant functions, where
the regions of constancy are associated with partitions C of Rd defined by the leaves
of the trees. In particular, f˜j (x) in Step (1) takes the form

J
X
f˜j (g|x) = f˜j,A (g)1x∈A .
A∈C

The final f at Step (2) should therefore take the form


X
α f˜j,A (g)1x∈A
A∈C
240 CHAPTER 10. TREE-BASED ALGORITHMS

but not much additional complexity is introduced by freely optimizing the values of
fj on A, that is, by looking at f in the form
X
fj,A (g)1x∈A
A∈C

where the values fj,A (g) optimize the empirical risk. This risk becomes
 
XN X X N X
X  X (j−1) 
− µk (g)fj,A (g)1xk ∈A + log  eF (g|xk )+fj,A (g)  1xk ∈A .
 
 
k=1 A∈C g∈RY k=1 A∈C g∈RY

The values fj,A (g), g ∈ RY can therefore be optimized separately, minimizing


 
X X X  X (j−1) 
− µk (g)fj,A (g) + log  eF (g|xk )+fj,A (g)  1xk ∈A .
 
 
k=1:xk ∈A g∈RY k:xk ∈A g∈RY

This is still a convex program, which has to be run at every leaf of the optimized
tree. If computing time is limited (or for large-scale problems), the determination of
fj,A (g) may be restricted to one step of gradient descent starting at fj,A = 0. A simple
computation indeed shows that the first derivative of the function above with respect
to fj,A (g) is X
aA (g) = − (µk (g) − pF (g|xk )).
k:xk ∈A

The derivative of this expression with respect to fj,A (g) (for the same g) is
X
bA (g) = pF (g|xk )(1 − pF (g|xk )).
k:xk ∈A

The off-diagonal terms in the second derivative are, for g , h,


X
− pF (g|xk )pF (h|xk ).
k:xk ∈A

In Friedman et al. [74], it is suggested to use an approximate Newton step, where


the off-diagonal terms in the second derivative are neglected. This corresponds to
minimizing
X 1 X
aA (g)fj,A (g) + bA (g)fj,A (g)2 .
2
g∈RY g∈RY
P
The solution is (introducing a Lagrange multiplier for the constraint g fj,A (g) = 0)

aA (g) − λ
fj,A (g) = −
bA (g)
10.5. GRADIENT BOOSTING AND REGRESSION 241

with P
g∈RY aA (g)/bA (g)
λ= P .
g∈RY 1/bA (g)
A small value  can be added to bA to avoid divisions by zero. We refer the reader
to Friedman et al. [74], Friedman [75], Hastie et al. [87] for several variations on this
basic idea. Note that an approximate but highly efficient implementation of boosted
trees, called XGBoost, has been developed in Chen and Guestrin [52].
242 CHAPTER 10. TREE-BASED ALGORITHMS
Chapter 11

Iterated Compositions of Functions and Neural


Nets

11.1 First definitions

We now discuss a class of methods in which the predictor f is built using iterated
compositions, with a main application to neural nets. We will structure these mod-
els using directed acyclic graphs (DAG). These graphs are composed with a set of
vertexes (or nodes) V = {0, . . . , m + 1} and a collection L of directed edges i → j be-
tween some vertexes. If an edge exists between i and j, one says that i is a parent of
j and j a child of i and we will use the notation pa(i) (resp. ch(i)) to denote the set
of parents (resp. children) of i. The graphs we consider must satisfy the following
conditions:

(i) No index is a descendant of itself, i.e., that the graph is acyclic.


(ii) The only index without parent is i = 0 and the only one without children in
i = m + 1.

To each node i in the graph, one associates a dimension di and a variable zi ∈ Rdi .
The root node variable, z0 = x, is the input and zm+1 is the output.
N One also associates
dj
to each node i , 0 a function ψi defined on the product space j∈pa(i)
R and taking
values in Rdi . The input-output relation is then defined by the family of equations:

zi = ψi (zpa(i) )

where zpa(i) = (zj , j ∈ pa(i)). Since there is only one root and one terminal node, these
iterations implement a relationship y = zm+1 = f (x), with z0 = x. We will refer to the
z1 , . . . , zm as the latent variables of the network.

Each function ψi is furthermore parametrized by an si -dimensional vector wi ∈

243
244 CHAPTER 11. NEURAL NETS

Rsi , so that we will write


zi = ψi (zpa(i) ; wi ).

We let W denote the vector containing all parameters w1 , . . . , wm+1 , which therefore
has dimension s = s1 + · · · + sm+1 . The network function f is then parametrized by W
and we will write y = f (x; W ).

11.2 Neural nets

11.2.1 Transitions

Most neural networks iterate functions taking the form

ψi (z; w) = ρ(bz + β0 ), z ∈ Rdj

where is a (
P di
P b d i × j∈pa(i) dj ) matrix and β0 ∈ R (so that w = (b, β0 ) is si = di (1 +
j∈pa(i) dj )-dimensional); ρ is defined on and takes values in R, and we make the
abuse of notation, for any d and u ∈ Rd

ρ(u (1) )


 

ρ(u) =  ...  .


 
 
ρ(u (d) )

The most popular choice for ρ is the positive part, or ReLU (for rectified linear
unit), given by ρ(t) = max(t, 0). Other common choices are ρ(t) = 1/(1 + e−t ) (sigmoid
function), or ρ(t) = tanh(z).

Residual neural networks (or ResNets [89]) are discussed in section 11.6. They
iterate transitions between inputs and outputs of same dimension, taking

zi+1 = zi + ψ(zi ; w). (11.1)

11.2.2 Output

The last node of the graph provides the prediction, y. Its expression depends on the
type of predictor that is learned

• For regression, y can be chosen as an affine function of is its parents: zm+1 =


bzpa(m+1) + a0 .
11.2. NEURAL NETS 245

• For classification, one can also use a linear model zm+1 = bzpa(m+1) + a0 where
(i)
zm+1 is q-dimensional and let the classification be argmax(zm+1 , i = 1, . . . , q). Alterna-
tively, one may use a “softmax” transformation, taking
(i)
(i) eζm+1
zm+1 =P (j)
q ζm+1
j=1 e

with ζm+1 = bzpa(m+1) + a0 .

11.2.3 Image data

Neural networks have achieved top performance when working with organized struc-
tures such as images. A typical problem in this setting is to categorize the content of
the image, i.e., return a categorical variable naming its principal element(s). Other
applications include facial recognition or identification. In this case, the transition
function can take advantage of the 2D structure, with some special terminology.

Instead of speaking of the total dimension, say, d, of the considered variables,


writing z = (z(1) , . . . , z(d) ), images are better represented with three indices z(u, v, λ)
where u = 1, . . . , U and U is the width of the image, v = 1, . . . , V and V is the height
of the image, λ = 1, . . . , Λ and Λ is the depth of the image. (With this notation
d = U V Λ.) Typical images have length and width of several hundred pixels, and
depth Λ = 3 for the three color channels. This three-dimensional structure is con-
served also for latent variables, with different dimensions. Deep neural networks
often combine compression in width and height with expansion in depth while tran-
sitioning from input to output.

The linear transformation b mapping one layer with dimensions Ui , Vi , Λi to an-


other with dimensions Ui+1 , Vi+1 , Λi+1 is then preferably seen as a collection of num-
bers: b(u 0 , v 0 , λ0 , u, v, λ) so that the transition from zi to zi+1 is given by
 
 Ui X
X Vi X
Λi 
zi+1 (u 0 , v 0 , λ0 ) = ρ β0 (u 0 , v 0 , λ0 ) + b(u 0 , v 0 , λ0 , u, v, λ)zi (u, v, λ) .
 
 
u=1 v=1 λ=1

For images, it is often preferable to use convolutional transitions, providing con-


volutional neural networks [117, 116], or CNNs. If Ui = Ui+1 and Vi = Vi+1 , such
a transition requires that b(u 0 , v 0 , λ0 , u, v, λ) only depends on λ, λ0 and on the differ-
ences u 0 −u and v −v 0 . In general, one also requires that b(u 0 , v 0 , λ0 , u, v, λ) is non-zero
only if |u 0 −u| and |v 0 −v| are both less than a constant, typically a small number. Also,
there is generally little computation across depths: each output at depth λ0 only uses
values from a single input depth. These restrictions obviously reduce dramatically
the number of free parameters involved in the transition.
246 CHAPTER 11. NEURAL NETS

x z1 ... zm y

Figure 11.1: Linear net with increasing layer depths and decreasing layer width.

x y

z1 z2m−1

zm

Figure 11.2: A sketch of the U-net architecture designed for image segmentation [169].

After one or a few convolutions, the dimension is often reduced by a “pooling”


operation, dividing the image into small non-overlapping windows and replacing
each such window by a single value, either the max (max-pooling) or the average.

11.3 Geometry

In addition to the transitions between latent variables and resulting changes of di-
mension, the structure of the DAG defining the network is an important element in
the design of a neural net. The simplest choice is a purely linear structure (as shown
in Figure 11.1), as was, for example, used for image categorization in [111].

More complex architectures have been introduced in recent years. Their design
is in a large part heuristic and based on an analysis of the kind of computation
that should be done in the network to perform a particular task. For example, an
architecture used for image segmentation in summarized in fig. 11.2.

Remark 11.1 An important feature of neural nets is their modularity, since “simple”
architectures can be combined (e.g., by placing the output of a network as input of
11.4. OBJECTIVE FUNCTION 247

another one) and form a more complex network that still follows the basic structure
defined above. One example of such a building block is the “attention module.”
(q)
Such a module takes as input three sequences of vectors of equal size, say z(q) = (zk ),
(c) (v)
z(c) = (zk ), z(v) = (zk ), for k = 1, . . . , n, which are are typically outputs of previous
modules. All three may be identical (self-attention modules), or distinct (encoder-
decoder modules in [200] have z(c) = z(v) , z(q) ). The input vectors are separately
(q) (c)
linearly transformed into “query,” “key,” and “value” vectors, qk = Wq zk , ck = Wc zk
(v)
and vk = Wv zk (where Wq , Wc , Wv are learned, and Wq and Wc have the same number
of rows, say, d) and the output of the module is also a sequence of n vectors given by
P τa(q ,c )
(o) e k lv
zk = Pl τa(q ,c ) l
le
k l

where a(q, c) measures√ the affinity between q and c (e.g, a(q, c) = qT c) and τ is a
fixed constant (e.g., 1/ d). These attention modules are fundamental components of
“transformer networks” [200], that are used, among other tasks, in natural language
processing and large language models. 

11.4 Objective function

11.4.1 Definitions

We now return to the general form of the problem, with variables z0 , . . . , zm+1 satis-
fying
zi = ψi (zpa(i) ; wi ).
Let T = (x1 , y1 , . . . , xN , yN ) denote the training data.

For regression problems, the objective function minimized by the algorithm is


typically the empirical risk, the simplest choice being the mean square error, which
gives
N
1X
F(W ) = |yk − zk,m+1 (W )|2 .
N
k=1
with zk;m+1 (W ) = f (xk ; W ).

For classification, with the dimension of the output variable equal to the number
of classes and the decision based on the largest coordinate, one can take (letting
zk,m+1 (i; W ) denote the ith coordinate of zk,m+1 (W )):
q
N
 
1 X 
X 
F(W ) = −zk,m+1 (yk ; W ) + log exp(zk,m+1 (i; W ))  .

N
k=1 i=1

This objective function is similar to that minimized in logistic regression.


248 CHAPTER 11. NEURAL NETS

11.4.2 Differential

General computation. The computation of the differential of F with respect to W


may look daunting, but it has actually a simple structure captured by the back-
propagation algorithm. Even if programming this algorithm can often be avoided
by using an automatic differentiation software, it is important to understand how it
works, and why the implementation of gradient-descent algorithms remains feasi-
ble.

Consider the general situation of minimizing a function G(W , z) over W ∈ Rs


and z ∈ Rr , subject to a constraint γ(W , z) = 0 where γ is defined on Rs × Rr and
takes values in Rr (here, it is important that the number of constraints is equal to
the dimension of z). We will denote below by ∂W and ∂z the derivatives of these
functions with respect to the multi-dimensional variables W and z. We make the
assumptions that ∂z γ, which is an r × r matrix, is invertible, and that the constraints
can be solved to express z as a function of W , that we will denote Z (W ).

This allows us to define the function F(W ) = G(W , Z (W )) and we want to com-
pute the gradient of F. (The function F in the previous section satisfies these as-
sumptions, with z = (zkj , k = 1, . . . , N , j = 1, . . . , m + 1) ). Taking h ∈ Rs , we have

dF(W )h = ∂W G(W , Z (W ))h + ∂z G(W , Z (W )) dZ (W )h.

Moreover, since γ(W , Z (W )) = 0 by definition of Z , we have

∂W γ(W , Z (W ))h + ∂z γ(W , Z (W )) dZ (W )h = 0,

so that

dF(W )h = ∂W G(W , Z (W ))h − ∂z G(W , Z (W )) ∂z γ(W , Z (W ))−1 ∂W γ(W , Z (W ))h.

Let p ∈ Rr be the solution of the linear system

∂z γ(W , Z (W ))T p = ∂z G(W , Z (W ))T . (11.2)

Then,
dF(W )h = (∂W G(W , Z (W )) − pT ∂W γ(W , Z (W )))h
or
∇F = ∂W G(W , Z (W ))T − ∂W γ(W , Z (W ))T p. (11.3)

Note that, introducing the “Hamiltonian”

H (p, z, W ) = pT γ(W , z) − G(W , z),


11.4. OBJECTIVE FUNCTION 249

one can summarize the previous computation with the system





∂p H = 0

∂ H =0

 z


∇F = −∂ H T .

W

Application: back-propagation. In our case, we are minimizing a function of the


form
N
1X
G(W , z1 , . . . , zN ) = r(yk , zk,m+1 )
N
k=1

subject to constraints zk,i+1 = ψi (zk,pa(i) ; wi ), i = 0, . . . , m, zk,0 = xk . We focus on one


of the terms in the sum, therefore fixing k, that we will temporarily drop from the
notation.

So, we evaluate the gradient of G(W , z) = r(y, zm+1 ) with zi+1 = ψi (zpa(i) ; wi ), i =
0, . . . , m, z0 = x. With the notation of the previous paragraph, we take γ = (γ1 , . . . , γm+1 )
with
γi (W , z) = ψi (zpa(i) ; wi ) − zi
These constraints uniquely define z as a function of W , which was one of our as-
sumptions. For the derivative, we have, for u = (u1 , . . . , um+1 ) ∈ Rr (with r = d1 + · · · +
dm+1 , ui ∈ Rdi ), and for i = 1, . . . , m + 1
X
∂z γi (W , z)u = ∂zj ψi (zpa(i) ; wi )uj − ui
j∈pa(i)

Taking p = (p1 , . . . , pm+1 ) ∈ Rr , we get

m+1
X X m+1
X
T
p ∂z γ(W , z)u = piT ∂zj ψi (zpa(i) ; wi )uj − piT ui
i=1 j∈pa(i) i=1
m+1
X X m+1
X
= piT ∂zj ψi (zpa(i) ; wi )uj − pjT uj .
j=1 i∈ch(j) j=1

This allows us to identify ∂z γ(W , z)T p as the vector g = (g1 , . . . , gm+1 ) with
X
gj = ∂zj ψi (zpa(i) ; wi )T pi − pj .
i∈ch(j)
250 CHAPTER 11. NEURAL NETS

For j = m + 1 (which has no children), we get gm+1 = −pm+1 , so that the equation
∂z γ T p = g can be solved recursively by taking pm+1 = −gm+1 and propagating back-
ward, with X
pj = −gj + ∂zj ψi (zpa(i) ; wi )T pi
i∈ch(j)
for j = m, . . . , 1.

From (11.2), we need to apply propagation with g = ∂z G. Since G only depends


on zm+1 , we have gm+1 = ∂zm+1 r(y, zm+1 ) and gj = 0 for j = 1, . . . , m. The final computa-
tion of the gradient is given by (11.3), in which ∂W G = 0 since G does not depend on
W . For the second term in the r.h.s. of (11.3), we have
∂W γi = ∂wi ψi (zpa(i) , wi ),
yielding ∂W γ T p = (ζ1 , . . . , ζm ) with
ζj = ∂wj ψj (zpa(j) , wj )T pj .

We can now formulate an algorithm that computes the gradient of F with respect
to W , reintroducing training data indexes in the notation.

Algorithm 11.1 (Back-propagation)


Let (x1 , y1 , . . . , xN , yN ) be the training set and Rk (z) = r(yk , z) so that
N
1X
F(W ) = Rk (zk,m+1 (W ))
N
k=1

with zk,m+1 (W ) = f (xk , W ). Let W be a family of weights. The following steps com-
pute ∇F(W ).

1. For all k = 1, . . . , N and all i = 1, . . . , m + 1, compute zk,i (W ) (forward computa-


tion through the network).
2. Initialize variables pk,m+1 = −∇Rk (zk,m+1 (W )), k = 1, . . . , N .
3. For all k = 1, . . . , N and all j = 1, . . . , m, compute pk,j using iterations
X
pk,j = ∂zj ψi (zk,pa(i) , wi )T pk,i .
i∈ch(j)

4. Let
N m+1
1 XX T
∇F(W ) = − Di ∂wi ψi (zk,pa(i) , wi )T pk,i ,
N
k=1 i=1
where Di is the si × s matrix such that Di h = hi .
11.5. STOCHASTIC GRADIENT DESCENT 251

11.4.3 Complementary computations

The back-propagation algorithm requires the computation of the gradient of the


costs Rk and of the derivatives of the functions ψi , and this can generally be done in
closed form, with relatively simple expressions.

• If Rk (z) = |yk −z|2 (which is the typical choice for regression models) then ∇Rk (z) =
2(z − yk ).
P 
q (i)
• In classification, with Rk (z) = −z(yk ) + log i=1 exp(z ) , one has

exp(z)
∇Rk (z) = −uyk + Pq
(i) )
i=1 exp(z

where uyk ∈ Rd is the vector with 1 at position yk and zero elsewhere, and exp(z) is
the vector with coordinates exp(z(i) ), i = 1, . . . , d.
• For dense transition functions in the form ψ(z; w) = ρ(bz + β0 ) with w = (β0 , b),
then ∂z ψ(z, w) = diag(ρ0 (β0 + bz))b so that

∂z ψ(z, w)T p = bT diag(ρ0 (β0 + bz))p

• Similarly
h i
∂w ψ(z, w)T p = diag(ρ0 (β0 + bz))p, diag(ρ0 (β0 + bz))pzT .

Note that neural network packages implement these functions (and more) automat-
ically.

11.5 Stochastic Gradient Descent

11.5.1 Mini-batches

Fix `  N . Consider the set of B` of binary sequences ξ = (ξ 1 , . . . , ξ N ) such that


ξ k ∈ {0, 1} and N k
P
k=1 ξ = `. Define
 N  N
 1 X  1 X
k
H(W , ξ) = ∇W 
 ξ r(yk , f (xk , W )) =

 ξ k ∇W r(yk , f (xk , W ))
` `
k=1 k=1

where ξ follows the uniform distribution on B` . Consider the stochastic approxima-


tion algorithm:
Wn+1 = Wn − γn+1 H(Wn , ξ n+1 ). (11.4)
252 CHAPTER 11. NEURAL NETS

Because E(ξ k ) = `/N , we have E(H(W , ξ)) = ∇W ET (f (·, W )) and (11.4) provides a
stochastic gradient descent algorithm to which the discussion in section 3.3 applies.
Such an approach is often referred to as “mini-batch” selection in the deep-learning
literature, since it correspond to sampling ` examples from the training set with-
out replacement and only computing the gradient of the empirical loss restricted to
these examples.

11.5.2 Dropout

Introduced for deep learning in Srivastava et al. [182], “dropout” is a learning para-
digm that brings additional robustness (and, maybe, reduces overfitting risks) to
massively parametrized predictors.

Assume that a random perturbation mechanism of the model parameters has


been designed. We will represent it using a random variable η (interpreted as noise)
and a transformation W 0 = ϕ(W , η) describing how η affects a given weight con-
figuration W to form a perturbed one W 0 . In order to shorten notation, we will
write ϕ(W , η) = η · W , borrowing the notation for a group action from group the-
ory. As a typical example, η can be chosen as a vector of Bernoulli random variables
(therefore taking values in {0,1}), with same dimension as W and one can simply let
η ·W = η W be the pointwise multiplication of the two vectors. This corresponds to
replacing some of the parameters by zero (“dropping them out”) while keeping the
others unchanged. One generally preserves the parameters of the final layer (gm ), so
that the corresponding η’s are equal to one, and let the other ones be independent,
with some probability p of being one, say, p = 1/2.

Returning to the general case, in which η is simply assumed to be a random


variable with known probability distribution, the dropout method replaces the ob-
jective function F(W ) = ET (f (·, W )) by its expectation over perturbed predictors
G(W ) = E(ET (f (·, η · W ))) where the expectation is taken with respect to the random
variable η. While this expectation cannot be computed explicitly, its minimization
can be performed using stochastic gradient descent, with
Wn+1 = Wn − γn+1 L(Wn , ηn+1 ),
where η1 , η2 , . . . is a sequence of independent realizations of η and
L(W , η) = ∇W (ET (f (·, η · W ))) .
Then, averaging in η
L̄(W ) = E(∇W F(η W )) = ∇F(W ).

In the special case where η · W is just pointwise multiplication, then


L(W , η) = η ∇F(η W ).
11.6. CONTINUOUS TIME LIMIT AND DYNAMICAL SYSTEMS 253

As a consequence, this quantity can be evaluated by using back-propagation to com-


pute ∇F(η ·W ) and multiplying the result by η pointwise. Obviously, random weight
perturbation can be combined with mini-batch selection in a hybrid stochastic gra-
dient descent algorithm, the specification of which being left to the reader. We also
note that stochastic gradient descent in neural networks is often implemented using
the ADAM algorithm (section 3.3.3).

11.6 Continuous time limit and dynamical systems

11.6.1 Neural ODEs

Equation (11.1) expresses the difference of the input and output of a neural transi-
tion as a non-linear function f (z; w) of the input. This strongly suggests passing to
continuous time and replacing the difference by a derivative, i.e., replacing the neu-
ral network by a high-dimensional parametrized dynamical system. The continuous
model then takes the form [51]

∂t z(t) = ψ(z(t); w(t)) (11.5)

where t varies in a a fixed interval, say, [0, T ]. The whole process is parametrized by
W = (w(t), t ∈ [0, T ]). We need to assume existence and uniqueness of solutions of
(11.5), which usually restricts the domain of admissibility of parameters W .

Typical neural transition functions are Lipschitz functions whose constant de-
pend on the weight magnitude, i.e., are such that

|ψ(z, w) − ψ(z0 , w)| ≤ C(w)|z − z0 | (11.6)

where C is a continuous function of W . For example, for ψ(z, w) = ρ(bz + β0 ), w =


(b, β0 ), one can take C(w) = Cρ |b|op . The Caratheodory theorem [17] implies that
solutions are well-defined as soon as
ZT
C(w(t))dt < ∞. (11.7)
0

This is a relatively mild requirement, on which we will return later. Assuming this,
we can consider z(T ) as a function of the initial value, z(0) = x and of the parameters,
writing z(T ) = f (x, W ).

Given a training set, we consider the problem of minimizing

N
1X
F(W ) = r(yk , f (xk , W )). (11.8)
N
k=1
254 CHAPTER 11. NEURAL NETS

The discussion in section section 11.4.2 applies—formally, at least—to this continu-


ous case, and we can consider the equivalent problem of minimizing
N
1X
G(W , z1 , . . . , zN ) = r(yk , zk (T ))
N
k=1
with ∂t zk (t) = ψ(zk (t); w(t)), zk (0) = xk . Once again, we consider each k separately,
which boils down to considering N = 1 and we drop the index k from the notation,
letting F(W ) = r(y, f (x, W )) G(W , z) = r(y, z(T )).

We define γ(W , z) to return the function


t 7→ γ(W , z)(t) = ψ(z(t); w(t)) − ∂t z(t).
Let p : [0, T ] → Rd . We want to determine the expression of u = ∂z γ T p, which
satisfies ZT ZT
T
u(t) δz(t)dt = p(t)T (∂z ψ(z(t), w(t))δz(t) − ∂t δz(t))dt
0 0
After an integration by parts, the r.h.s. becomes
ZT ZT
T T
−p(T ) δz(T ) + ∂t p(t) δz(t)dt + p(t)T ∂z ψ(z(t), w(t))δz(t))dt
0 0
which gives
u(t) = −p(T )δT + ∂t p(t) + ∂z ψ(z(t), w(t))T p(t).
The equation ∂z γ T p = ∂z G T therefore gives
−p(T )δT + ∂t p(t) + ∂z ψ(z(t), w(t))T p(t) = ∂2 r(y, z(T ))δT ,
so that p satisfies p(T ) = −∂2 r(y, z(T )) and
∂t p(t) = −∂z ψ(z(t), w(t))T p(t). (11.9)

We have ∂W G = 0 and v = ∂W γ T p satisfies


ZT ZT
T
v(t) δw(t)dt = p(t)T ∂w ψ(z(t), w(t))δw(t)dt
0 0
so that
∇F(W ) = (t 7→ −∂w ψ(z(t), w(t))T p(t)).

This informal derivation (more work is needed to justify the existence of various
differentials in appropriate function spaces) provides the continuous-time version
of the back-propagation algorithm, which is also known as the adjoint method in
the optimal control literature [91, 124]. In that context, z represents the state of
the control system, w is the control and p is called the costate, or covector. We
summarize the gradient computation algorithm, reintroducing N training samples.
11.6. CONTINUOUS TIME LIMIT AND DYNAMICAL SYSTEMS 255

Algorithm 11.2 (Adjoint method for neural ODE)


Let (x1 , y1 , . . . , xN , yN ) be the training set and Rk (z) = r(yk , z) so that

N
1X
F(W ) = Rk (zk (T , W ))
N
k=1

with ∂t zk = ψ(zk , W ), zk (0) = xk . Let W be a family of weights. The following steps


compute ∇F(W ).

1. For all k = 1, . . . , N and all t ∈ [0, T ], compute zk (t, W ) (forward computation


through the dynamical system).
2. Initialize variables pk (T ) = −∇Rk (zk (T , W ))/N , k = 1, . . . , N .
3. For all k = 1, . . . , N and all j = 1, . . . , m, compute pk (t) by solving (backwards in
time)
∂t pk (t) = −∂z ψ(zk (t), w(t))T pk (t).

4. Let, for t ∈ [0, T ],

N
X
∇F(W )(t) = − ∂w ψ(zk (t), w(t))T pk (t).
k=1

Of course, in numerical applications, the forward and backward dynamical sys-


tems need to be discretized, in time, resulting in a finite number of computation
steps. This can be done explicitly (for example using basic Euler schemes), or using
ODE solvers [51] available in every numerical software.

11.6.2 Adding a running cost

Optimal control problems are usually formulated with a “running cost” that penal-
izes the magnitude of the control, which in our case is provided by the function
W : t 7→ w(t). Penalties on network weights are rarely imposed with discrete neural
networks, but, as discussed above, in the continuous setting, some assumptions on
the function W , such as (11.7), are needed to ensure that the problem is well defined.

It is therefore natural to modify the objective function in (11.8) by adding a


penalty term ensuring the finiteness of the integral in (11.7), taking, for example,
for some λ > 0,
ZT N
X
2
F(W ) = λ C(w(t)) dt + r(yk , f (xk , W )). (11.10)
0 k=1
256 CHAPTER 11. NEURAL NETS

The finiteness of the integral of the squared C(w)2 implies, by Cauchy-Schwartz, the
integrability of C(w) itself, and usually leads to simpler computations.

If C(w) is known explicitly and is differentiable, the previous discussion and the
back-propagation algorithm can be adapted with minor modifications for the mini-
mization of (11.10). The only difference appears in Step 4 of Algorithm 11.2, with
N
1X
∇F(W )(t) = 2λ∇C(w(t)) − ∂w ψ(zk (t), w(t))T pk (t).
N
k=1

Computationally, one should still ensure that C and its gradient are not too costly to
compute. If ψ(z, w) = ρ(bz + β0 ), w = (b, β0 ), the choice C(w) = Cρ |b|op is valid, but not
computationally friendly. The simpler choice C(w) = Cρ |b|2 is also valid, but cruder
as an upper-bound of the Lipschitz constant. It leads however to straightforward
computations.

The addition of a running cost to the objective is important to ensure that any
potential solution of the problem leads to a solvable ODE. It does not guarantee that
an optimal solution exists, which is a trickier issue in the continuous setting than
in the discrete setting. This is an important theoretical issue, since it is needed, for
example, to ensure that various numerical discretization schemes lead to consistent
approximations of a limit continuous problem. The existence of minimizers is not
known in general for ODE networks. It does hold, however, in the following non-
parametric (i.e., weight-free) context that we now describe.

The function ψ in the r.h.s. of (11.5), is, for any fixed w, a function that maps
z ∈ Rd to a vector ψ(z, w) ∈ Rd . Such functions are called vector fields on Rd , and the
collection ψ(·, w), w ∈ Rs is a parametrized family of vector fields.

The non-parametric approach replaces this family of functions by a general vec-


tor field, v so that the time-indexed parametrized family of vector fields (t 7→ ψ(·, w(t)))
becomes an unconstrained family (t 7→ f (t, ·)). Following the general non-parametric
framework in statistics, one needs to define a suitable function space for the vector
fields, and use a penalty in the objective function.

We will assume that, at each time, f (t, ·) belongs to a reproducing kernel Hilbert
space (RKHS), as introduced in chapter 6. However, because we are considering a
space of vector fields rather than scalar-valued functions, we need work with matrix-
valued kernels [5], for which we give a definition that generalizes definition 6.1
(which corresponds to q = 1 below).
Definition 11.2 A function K : Rd × Rd 7→ Mq (R) satisfying

[K1-vec] K is symmetric, namely K(x, y) = K(y, x)T for all x and y in Rd .


11.6. CONTINUOUS TIME LIMIT AND DYNAMICAL SYSTEMS 257

[K2-vec] For any n > 0, for any choice of vectors λ1 , . . . , λn ∈ Rq and any x1 , . . . , xn ∈ Rd ,
one has
Xn
λTi K(xi , xj )λj ≥ 0. (11.11)
i,j=1

is called a positive (matrix-valued) kernel.

One says that the kernel is positive definite if the sum in (6.1) cannot vanish, unless
(i) λ1 = · · · = λn = 0 or (ii) xi = xj for some i , j.

If κ is a “scalar kernel” (satisfying definition 6.1), then K(x, y) = κ(x, y)IdRq is a


matrix-valued kernel.

A reproducing kernel Hilbert space of vector-valued functions is a Hilbert space


H of functions from Rd to Rq such that there exists a reproducing kernel K : Rd ×
Rd 7→ Mq (R) with the following properties

[RKHS1] For all x ∈ Rd and λ ∈ Rq , K(·, x)λ belongs to H,


[RKHS2] For all h ∈ H, x ∈ Rd and λ ∈ Rq ,

hh , K(·, x)λiH = λT h(x) .

Proposition 6.6 remains valid in the for vector-valued RKHS, with the following
modifications: λ1 , . . . , λN and α1 , . . . , αN are q-dimensional vectors and the matrix
K(x1 , . . . , xN ) is now an N q × N q block matrix, with q × q blocks given by K(xk , xl ),
k, l = 1, . . . , N .

Returning to the specification of the nonparametric control problem, we will as-


sume that a vector-valued RKHS, H, has been chosen, with q = d in definition 11.2.
We further assume that elements of H are Lipschitz continuous, with

|v(z) − v(z̃)| ≤ CkvkH |z − z̃| (11.12)

for some constant C and all v ∈ H. We note that, for every λ ∈ Rd ,

|λT (v(z) − v(z̃))|2 = |hv , K(·, z)λ − K(·, z̃)λiH |2


≤ kvk2H kK(·, z)λ − K(·, z̃)λk2H
= kvk2H (λT K(z, z)λ − 2λT K(z, z̃)λ + λT K(z̃, z̃))
≤ |λ|2 kvk2H |K(z, z) − 2K(z, z̃) + K(z̃, z̃)| .

This shows that (11.12) can be derived from regularity properties of the kernel,
namely, that
|K(z, z) − 2K(z, z̃) + K(z̃, z̃)| ≤ C|z − z̃|2
258 CHAPTER 11. NEURAL NETS

for some constant C and all z, z̃ ∈ Rd . This property is satisfied by most of the kernels
that are used in practice.

Let η : t 7→ η(t) be a function from [0, 1] to H. This means that, for each t, η(t) is
a vector field x 7→ η(t)(x) on Rd , and we will write indifferently η(t) and η(t, ·), with
a preference for η(t, x) rather than η(t)(x). We consider the objective function
Z T N
1X
F̄(f ) = λ kη(t)k2H dt + r(yk , zk (1)), (11.13)
0 N
k=1

with ∂t zk (t) = η(t, zk (t)), zk (0) = xk . To compare with (11.10), the finite-dimensional
w ∈ Rs is now replaced with an infinite-dimensional parameter, η, and the transition
ψ(z, w) becomes η(z).

Using the vector version of proposition 6.6 (or the kernel trick used several times
in chapters 7 and 8), one sees that there is no loss of generality in replacing η(t) by
its projection onto the vector space
 
XN 
d
 
V (t) =  K(·, z (t))w : w , . . . , w ∈ R .

 l l 1 N 

 
l=1

Noting that, if η(t) takes the form


N
X
η(t) = K(·, zl (t))wl (t),
l=1

then
N
X
kη(t)k2H = wk (t)T K(zk (t), zl (t))wl (t).
k,l=1
This allows us to replace the infinite-dimensional parameter η by a family W =
(w(t), t ∈ [0, T ] with w(t) = (wk (t), k = 1, . . . , N ). The minimization of F̄ in (11.13) can
be replaced by that of
Z T N N
X
T 1X
F(W ) = λ wk (t) K(zk (t), zl (t))wl (t)dt + r(yk , zk (1)), (11.14)
0 k,l=1 N
k=1

with
N
X
∂t zk (t) = K(zk (t), zl (t))wl (t).
l=1

This optimal control problem has a similar form to that considered in (11.10),
where the running cost C(w)2 is replaced by a cost that depends on the control (still
11.6. CONTINUOUS TIME LIMIT AND DYNAMICAL SYSTEMS 259

denoted w) and the state z. The discussion in section section 11.6.1 can be applied
with some modifications. Let K(z) be the dN × dN matrix formed with d × d blocks
K(zk (t), zl (t)) and w(t) the dN -dimensional vector formed by stacking w1 , . . . , wN . Let
Z T N
T 1X
G(W , z) = λ w(t) K(z(t))w(t)dt + r(yk , zk (1))
0 N
k=1

and
γ(W , z)(t) = K(z(t))w(t) − ∂t z(t) .
The backward ODE in step 3. of Algorithm 11.2 now becomes

∂t pk (t) = −∂zk (w(t)T K(z(t))p(t)) + λ∂zk (w(t)T K(z(t))w(t))

for k = 1, . . . , N . Step 4. becomes (for t ∈ [0, T ]),

∇F(W )(t) = K(z(t))(2λ − p(t)).

The resulting algorithm was introduced in [212]. It has the interesting prop-
erty (shared with neural ODE models with smooth controlled transitions) to de-
termine an implicit diffeomorphic transformation of the space, i.e., the function
x 7→ f (x; W , z) = z̃(T ) which returns the solution at time T of the ODE
N
X
∂t z̃(t) = K(z̃(t), zl (t))wl (t)
l=1

(or ∂z̃(t) = ψ(z̃(t); w(t)) for neural ODEs) is smooth, invertible, with a smooth inverse.
260 CHAPTER 11. NEURAL NETS
Chapter 12

Comparing probability distributions

When discussing, in the next chapters, generative machine learning methods to


learn probability distributions, we will need to evaluate the difference between two
such distributions, for example, to optimize an algorithm to return a learned dis-
tribution close to an observed one. We regroup in this chapter some approaches
that are used in machine learning for this purpose. In the following, R is taken, as
always, as a metric space equipped with its Borel σ -algebra.

12.1 Total variation distance

Definition 12.1 Let P and Q be two probability distributions on R. Their total variation
distance is defined by
Dvar (µ1 , µ2 ) = sup(µ1 (A) − µ2 (A)). (12.1)
A

where the supremum is taken over all measurable sets A.

We have the following lemma.

Lemma 12.2 There exists a measurable set A0 such that, for all B, P (B ∩ A0 ) ≥ Q(B ∩ A0 )
and P (B ∩ Ac0 ) ≤ Q(B ∩ Ac0 ).

Moreover The supremum in the r.h.s. of (12.1) is achieved at A = A0 .

Proof If R is a finite set, it suffices to let A0 = {x ∈ R : P (x) ≥ Q(x)}. If both P and


Q have p.d.f.’s ψ1 and ψ2 with respect to Lebesgue’s measure (with R = Rd ), then
one can take A0 = {x ∈ B : ψ1 (x) ≥ ψ2 (x)}. In the general case, take µ = P + Q so that
P , Q  µ, and let ψi = dµi /dµ and A0 = {x ∈ R : ψ1 (x) ≥ ψ2 (x)}. (This fact is also a
special case of the Hahn-Jordan decomposition of signed measures [66]).

261
262 CHAPTER 12. COMPARING PROBABILITY DISTRIBUTIONS

Now, it is clear that, for any A ∈ R,

P (A) − Q(A) = P (A ∩ A0 ) − Q(A ∩ A0 ) + P (A ∩ Ac0 ) − Q(A ∩ Ac0 )


≤ P (A ∩ A0 ) − Q(A ∩ A0 )
≤ P (A ∩ A0 ) − Q(A ∩ A0 ) + P (Ac ∩ A0 ) − Q(Ac ∩ A0 )
= P (A0 ) − Q(A0 )

showing that
Dvar (P , Q) = P (A0 ) − Q(A0 ). 

The following proposition lists additional properties.

Proposition 12.3 (i) If P , Q have a densities ψ1 , ψ2 with respect to some positive mea-
sure µ (such as P + Q), then
Z
1
Dvar (P , Q) = |ψ1 (x) − ψ2 (x)|µ(dx).
2 R

In particular, if B is finite

1X
Dvar (P , Q) = |P (x) − Q(x)|.
2
x∈B

(ii) For general R,


Z Z !
Dvar (P , Q) = sup f (x)P (dx) − f (x)Q(dx) . (12.2)
f R B

where the supremum is taken over all measurable functions f taking values in [0, 1].
(iii) If f : R → R is bounded, define the maximal oscillation of f by

osc(f ) = sup{f (x) − f (y) : x, y ∈ R}.

Then (Z Z )
Dvar (µ1 , µ2 ) = sup f (x)P (dx) − f (x)Q(dx) : osc(f ) ≤ 1
R R

(iv) Conversely, for any bounded measurable f : R → R,


(Z Z )
osc(f ) = sup f (x)µ1 (dx) − f (x)µ2 (dx) : Dvar (µ1 , µ2 ) ≤ 1
R R
12.1. TOTAL VARIATION DISTANCE 263

Proof If one takes A0 = {x ∈ B : ψ1 (x) ≥ ψ2 (x)}, then


Z Z
Dvar (P , Q) = (ψ1 (x) − ψ2 (x))µ(dx) = |ψ1 (x) − ψ2 (x)|µ(dx).
A0 A0

But, because both P and Q are probability measures,


Z
(ψ1 (x) − ψ2 (x))µ(dx) = 0
R
so that Z Z
(ψ1 (x) − ψ2 (x))µ(dx) = − (ψ1 (x) − ψ2 (x))µ(dx).
Ac0 A0
However, the l.h.s. is also equal to
Z
− |ψ1 (x) − ψ2 (x)|µ(dx)
Ac0

so that Z Z
|ψ1 (x) − ψ2 (x)|µ(dx) = 2 (ψ1 (x) − ψ2 (x)) = 2Dvar (P , Q),
R A0
which proves (i).

To prove (ii), first notice that, for all A,


Z Z
P (A) − Q(A) = f (x)P (dx) − f (x)Q(dx)
R R
for f = 1A , so that
Z Z !
Dvar (P , Q) ≤ sup f (x)P (dx) − f (x)Q(dx) .
f R R

Conversely, using A0 as above, and taking f with values in [0, 1]


Z Z Z Z
f (x)P (dx) − f (x)Q(dx) = f (x)(P − Q)(dx) + f (x)(P − Q)(dx)
R R A0 Ac0
Z
≤ f (x)(P − Q)(dx)
A0
Z
≤ (P − Q)(dx) = Dvar (P , Q)
A0

This shows (ii). For (iii), one can note that, if f takes values in [0, 1], then osc(f ) ≤
1 so that, using (ii),
(Z Z )
Dvar (P , Q) ≤ sup f (x)P (dx) − f (x)Q(dx) : osc(f ) ≤ 1
R R
264 CHAPTER 12. COMPARING PROBABILITY DISTRIBUTIONS

since the maximization on the r.h.s. is done on a larger set than in (ii).

Conversely, take f such that osc(f ) ≤ 1,  > 0 and y such that f (y) ≥ inf f + . Let
f (x) = (f (x) − f (y) + )/(1 + ), which takes values in [0, 1]. Then
Z Z
Dvar (P , Q) ≥ f (x)P (dx) − f (x)Q(dx)
R R
Z Z !
1
= f (x)P (dx) − f (x)Q(dx)
1+ R R

and since this is true for all  > 0, we get


Z Z
f (x)P (dx) − f (x)Q(dx) ≤ Dvar (P , Q)
R R

which completes the proof of (iii).

Using (iii), we find, for any P , Q and any bounded f


Z Z
f (x)P (dx) − f (x)Q(dx) ≤ Dvar (P , Q)osc(f )
R R

which shows that


(Z Z )
sup f (x)P (dx) − f (x)Q(dx) : Dvar (P , Q) ≤ 1 ≤ osc(f ).
R R

However, taking P = δx and Q = δy , so that Dvar (P , Q) = 0 if x = y and 1 otherwise,


we get
Z Z
f (x) − f (y) = f (x)P (dx) − f (x)Q(dx)
R R
(Z Z )
≤ sup f (x)P (dx) − f (x)Q(dx) : Dvar (P , Q) ≤ 1
R R

which yields (iv) after taking the supremum with respect to x and y. 

Remark 12.4 Statements (ii)–(iv) in proposition 12.3 still hold when the supremums
are made over continuous functions. 

12.2 Divergences

We encountered a first definition of divergence in chapter 4 with the Kullback-


Leibler (KL) divergence introduced in equation (4.3)). It was shown, in proposi-
tion 4.1, to provide a non-negative quantification of the discrepancy between two
probability measures P and P , vanishing only when P = Q. Equation (4.3) can be
extended to the notion of ϕ-divergence as follows.
12.2. DIVERGENCES 265

Definition 12.5 Let ϕ be a non-negative convex function on (0, +∞) such that ϕ(1) = 0
and ϕ(t) > 0 for t , 1. Let P and Q be two probability distributions on some space R, and
µ a measure on R such that P  µ and Q  µ. Letting f = dP /dµ and g = dQ/dµ, the
ϕ-divergence between µ and ν is defined by
Z !
f
Dϕ (P k Q) = gϕ dµ (12.3)
Ω̃ g
with the convention
ϕ(0) = lim ϕ(t) and 0ϕ(f /0) = f lim ϕ ∗ (t),
t→0 t→0

where
ϕ ∗ (t) = tϕ(1/t).
Note that the limits above may be infinite.

This definition does not depend on µ such that P , Q  µ. Indeed, if P , Q  µ, then


P , Q  P + Q  µ and letting f˜ = dP /d(P + Q), g̃ = dQ/d(P + Q) and h = f + g, one
checks that f = f˜h, g = g̃h and

Z ! Z !
f
g̃ϕ d(P + Q) = gϕ dµ
Ω̃ g̃ Ω̃ g
with the l.h.s. independent of µ. It is also clear that Dϕ (P k Q) ≥ 0 and vanishes if
and only if P = Q. Note that the KL divergence is Dϕ for ϕ(t) = t log t + 1 − t.

This divergence is, in general, not symmetric, nor does it satisfy the triangular
inequality. There are, however, sufficient conditions that ensure that these proper-
ties are true [56, 101, 194]. Symmetry is captured by the “conjugate” function ϕ ∗ in
definition 12.5. It is, like ϕ, non-negative and convex on (0, +∞) and vanishes only
at t = 1. The only part of this statement that is not obvious is that ϕ ∗ is convex, but
we have, for λ ∈ (0, 1), s, t > 0,
!
∗ 1
ϕ ((1 − λ)s + λt) = ((1 − λ)s + λt)ϕ
(1 − λ)s + λt
!
(1 − λ)s 1 λt 1
= ((1 − λ)s + λt)ϕ +
(1 − λ)s + λt s (1 − λ)s + λt t
 !
(1 − λ)s 1 λt 1
 
≤ ((1 − λ)s + λt) ϕ + ϕ
(1 − λ)s + λt s (1 − λ)s + λt t
= (1 − λ)ϕ ∗ (s) + λϕ ∗ (t).
Clearly Dϕ∗ (P k Q) = Dϕ (Q k P ) so that a simple sufficient condition for symmetry is
that ϕ ∗ = ϕ. This can always be ensured by replacing ϕ by ϕ̃ = (ϕ + ϕ ∗ )/2, yielding
1
Dϕ̃ (P k Q) = (Dϕ (P k Q) + Dϕ (Q k P )).
2
266 CHAPTER 12. COMPARING PROBABILITY DISTRIBUTIONS

(The symmetrized KL divergence is called the Jeffreys divergence.)

The validity of the triangle inequality is more challenging to ensure. In Kafka


et al. [101], it is proved that, if ϕ ∗ = ϕ and the function

|t α − 1|1/α
h(t) =
ϕ(t)

is well defined on (0, +∞), non-increasing on (0, 1) and continuous at t = 1, then


(P , Q) 7→ Dϕ (P , Q)α is symmetric and satisfies the triangle inequality.

One can, for example, take ϕ(t) = |t α − 1|1/α , which yields, with the notation of
definition 12.5, Z
Dϕ (P , Q) = |f α − g α |1/α dµ.
R

The case α = 1 gives two times the total variation distance. The case α = 1/2 provides
the Hellinger distance between P and Q
Z p !1/2
√ 2
DHellinger (P , Q) = | f − g| dµ .
R

In [194], it is proved that this condition is satisfied for the family

sign(β)  1/β 
ϕβ (t) = (t + 1)β − 2β−1 (t + 1) .
β −1

with α = 1/2 for β < 2 and α = 1/β for β ≥ 2. The limit cases

1
ϕ0 (t) = |t − 1|,
2
which provides the total variation distance and

t+1
 
ϕ1 (t) = t log t − (t + 1) log
2

are also included. Also, for β = 2, one retrieves the Hellinger distance.

One can also check that, for β = 1, one has

Dϕ1 (P , Q) = KL(P k (P + Q/2)) + KL(Q k (P + Q/2)).

The r.h.s. is called the Jensen-Shannon divergence between P and Q.


12.3. MONGE-KANTOROWICH DISTANCE 267

12.3 Monge-Kantorowich distance

The Monge-Kantorovich, also called Wasserstein, and sometimes also called “earth-
mover,” associates a transportation cost, say ρ(x, y), for moving a unit of mass from
x to y, and evaluates the minimum total cost needed to transform the distribution P
into Q. Its mathematical definition is
Z
DMK (P , Q) = inf ρ(x, y)π(dx, dy) (12.4)
π∈M(P ,Q) R×R

where M(P , Q) is the set of all joint distributions on R × R whose first marginal is P
and second marginal Q. For example,

DMK (δx1 , δx2 ) = ρ(x1 , x2 ).

The interpretation is that π(dx, dy) is the infinitesimal mass moved between the
infinitesimal neighborhoods x + dx and y + dy. The constraint π ∈ M(P , Q) indicates
that π displaces the mass distribution P to the mass distribution Q.
1/α
If we assume that, for some α ≥ 1, σ = ρ1/alpha is a distance on R, then DMK is
a distance on the space of probability measures on R (equipped with the Borel σ -
algebra specified by σ ). For this fact, and the results that follow, the reader can refer
to Villani et al. [203], Dudley [66].

When α = 1 (so that ρ = d is a distance), one has, furthermore, the following


theorem. Call a function f : R → R ρ-contractive if, for all x, y ∈ R, one has

|f (x) − f (y)| ≤ ρ(x, y).

Define Z Z !
Dρ∗ (P , Q) = max f dP − f dQ : f ρ-contractive

Then one has:

Theorem 12.6 (Kantorovich-Rubinstein) One has DMK (P , Q) = Dρ∗ (P , Q)

See the reference above for a proof. Further generalizations of this theorem, in par-
ticular for the case in which ρ is not equal to a distance, can be found in [203],
chapter 5.
268 CHAPTER 12. COMPARING PROBABILITY DISTRIBUTIONS
Chapter 13

Monte-Carlo Sampling

The goal of this section is to describe how, from a basic random number generator
that provides samples from a uniform distribution on [0, 1], one can generate sam-
ples that follow, or approximately follow, complex probability distributions on finite
or general spaces. This, combined with the law of large numbers, permits to approx-
imate probabilities or expectations by empirical averages over a large collection of
generated samples.

We assume that as many as needed independent samples of the uniform dis-


tribution are available, which is only an approximation of the truth. In practice,
computer programs are only able to generate pseudo-random numbers, which are
highly chaotic recursive sequences, but still deterministic. Also, these numbers are
generated as integers, which only provide, after normalization, a distribution on a
finite discretization of the unit interval. We will neglect these facts, however, and
work as if the output of the function random (or any similar name) in a computer
program is a true realization of the uniform distribution.

13.1 General sampling procedures

Real-valued variables. We will use the following notation for the left limit of a
function F at a given point z

F(z−) = lim F(y)


y→z,y<z

assuming, of course that this limit exists (which is always true, for example when F
is non-decreasing). Recall that F is left continuous if and only if F = F( ·−). Moreover,
it is easy to see that F( ·−) is left-continuous1 . Note also that, if F is non-decreasing,
1 For
every z and every  > 0, there exists z0 < z such that for all z00 ∈ [z0 , z), |F(z00 ) − F(z−)| < .
Moreover, taking any y ∈ (z0 , z), there exists y 0 < y such that for all y 00 ∈ [y 0 , y), |F(y 00 ) − F(y−)| < .

269
270 CHAPTER 13. MONTE-CARLO SAMPLING

one always has F(z) ≤ F(y−) whenever z < y. The following proposition provides a
basic mechanism for Monte-Carlo sampling.

Proposition 13.1 Let Z be a real-valued random variable with c.d.f. FZ . For u ∈ [0, 1],
define
FX− (u) = max{z : FZ (z−) ≤ u}.
Let U be uniformly distributed over [0, 1]. Then FZ− (U ) has the same distribution as Z.

Proof Let Az = {u ∈ [0, 1] : FZ− (u) ≤ z}. We show that

[0, FZ (z)) ⊂ Az ⊂ [0, FZ (z)]. (13.1)

Showing this will prove the lemma, since one has P(U < FZ (z)) = P(U ≤ FZ (z)) =
FZ (z), showing that
P(FZ− (U ) ≤ z) = P(U ∈ Az ) = FZ (z).

To prove (13.1), first assume that u < FZ (z). Take any z0 such that FZ (z0−) ≤ u.
Then, necessarily, z0 ≤ z, since z0 > z would imply that FZ (z) ≤ FZ (z0−). This shows
that max{z0 : FZ (z0−) ≤ u} ≤ z, i.e., u ∈ Az .

Now, take u > FZ (z). Because c.d.f.’s are right continuous, there exists y > z such
that u > FZ (y), which implies that FZ− (u) ≥ y and u < Az . 

This proposition shows that one can generate random samples of a real-valued
random variable Z as soon as one can compute FZ− and generate uniformly dis-
tributed variables. Note that, if FZ is strictly increasing, then FZ− = FZ−1 , the usual
function inverse.

The proposition also shows how to sample from random variables taking values
in finite sets. Indeed, if Z takes values in Ω e Z = {z1 , . . . , zn } with pi = P(Z = zi ), sam-
pling from Z is equivalent to sampling from the integer valued random variable Z e
with P(Z e = i) = pi . For this variable, F − (u) is the largest i such that p1 + · · · + pi−1 ≤ u

(this sum being zero if i = 1), which provides the standard sampling scheme for
discrete probability distributions.

13.2 Rejection sampling

While the previous approach can be generalized to multivariate distributions, it


quickly becomes unfeasible when the dimension gets large, excepting simple cases
in which the variables are independent, or known functions of independent variable
Without loss of generality, we can assume that y 0 ≥ z0 , yielding |F(z−) − F(y−)| ≤ 2, showing the left
continuity of F( · − 0) since this is true for all y ∈ (z0 , z).
13.3. MARKOV CHAIN SAMPLING 271

(which includes the Gaussian case). Rejection sampling is a simple algorithm that
allows, in some cases, for the generation of samples from a complicated distribution
based on repeated sampling of a simpler one.

Let us assume that we want to sample from a variable Z taking values in RZ ,


and that there exists a measure µ on RZ with respect to which the distribution of Z
is absolutely continuous, i.e., so that this distribution has a density fZ with respect
to µ. For example, RZ = Rd , and fZ is the p.d.f. of Z with respect to Lebesgue’s
measure. Assume that g is another density function (with respect to µ) from which
it is “easy” to sample. Consider the following algorithm, which includes a function
a : z 7→ a(z) ∈ [0, 1] that will be specified later.

Algorithm 13.1 (Rejection sampling with acceptance function a and base p.d.f. g)
(1) Sample a realization z of a random variable with p.d.f. g.
(2) Generate b ∈ {0, 1} with P(b = 1) = a(z).
(3) If b = 1, return Z = z and exit.
(4) Otherwise, return to step 1.

R
The probability of exiting at step 3 is ρ = Rd
g(z)a(z)µ(dz). So, the algorithm
simulates a random variable with p.d.f.
g(z)a(z)
f˜(z) = g(z)a(z)(1 + (1 − ρ) + (1 − ρ)2 + · · · ) = .
ρ
As a consequence, in order to simulate fZ , one must choose a so that fZ (z) is pro-
portional to g(z)a(z), which, (assuming that g(z) > 0 whenever fZ (z) > 0), requires
that a(z) is proportional to fZ (z)/g(z). Since a(z) must take values in [0, 1], but should
otherwise be chosen as large as possible to ensure that fewer iterations are needed,
one should take
f (z)
a(z) = Z
cg(z)
where c = max{fZ (z)/g(z) : z ∈ Rd }, which must therefore be finite. This fully specifies
a rejection sampling algorithm for fZ . Note that g is free to choose (with the restric-
tion that fZ (z)/g(z) must be bounded), and should be selected so that sampling from
it is easy, and the coefficient c above is not too large.

13.3 Markov chain sampling

When dealing with high-dimensional distributions, the constant c in the previous


procedure is typically extremely large, and the rejection-sampling algorithm be-
comes unfeasible, because it keeps rejecting samples for very long times. In such
272 CHAPTER 13. MONTE-CARLO SAMPLING

cases, one can use alternative simulation methods that iteratively updates the vari-
able Z by making small changes at each step, resulting in a procedure that asymp-
totically converges to a sample of the target distribution. Such sampling schemes are
usually described as Markov chains, leading to the name Markov-chain Monte Carlo
(or MCMC) sampling. We therefore start our discussion with some basic results on
the theory of Markov chains.

13.3.1 Definitions

We adopt a measure-theoretic formalism in this discussion and refer to section 1.4.6


for a short introduction to the terms used here. Assume that we want to sample from
a random variable that takes values in some (measurable) set B = RX .2

Markov chains generate sequences X0 , X1 , . . ., and can be seen as a stochastic


analogueof a recursive sequence Xn+1 = Φ(Xn ). Such sequences are fully defined
by the function Φ : B → B and the initial value X0 ∈ B. For Markov chains, we
will need the probability distribution of X0 , that we will generally denote P 0 with
P 0 (x) = P(X0 = x), and stochastic transition rules between Xn and Xn+1 . These tran-
sitions are given by the conditional probabilities

P n,n+1 (x, A) = P(Xn+1 ∈ A | Xn = x).

where A ⊂ B is measurable. The left-hand side of this equation, P n,n+1 is called a


transition probability, according to the following definition.
Definition 13.2 Let F1 and F2 be two sets equipped with σ -algebras A1 and A2 . A
transition probability from F1 to F2 is a function p : F1 × A2 → [0, 1] such that, for all
x ∈ F1 , the function A 7→ p(x, A) is a probability on F2 and for all A ∈ A2 , the function
x 7→ p(x, A), x ∈ F1 , is measurable.

When F2 is discrete, the probabilities are fully specified by their values on singleton
sets, and we will write p(x, y) for p(x, {y}).

When P n,n+1 (x, ·) does not depend on n, the Markov chain is said to be homoge-
neous. To simplify notation, we will restrict to homogeneous chains (and therefore
only write P (x, A)), although some of the chains used in MCMC sampling may be
inhomogeneous. This is not a very strong loss of generality, however, because in-
homogeneous Markov chains can be considered as homogeneous by extending the
space Ω on which they are defined to Ω × N, and defining the transition probability
 
p̃ (x, n), A × {r} = 1r=n+1 pn,n+1 (x, A).
2 We will assume in this chapter that B is a complete metric space with a dense countable subset,
with the associated Borel σ -algebra.
13.3. MARKOV CHAIN SAMPLING 273

An important special case is when B is countable, in which case one only needs
to specify transition probabilities for singletons A = {y}, and we will write

p(x, y) = P (x, {y}) = P(Xn+1 = y | Xn = x)

for the p.m.f. associated with P (x, ·).

Another simple situation is when B = Rd and each P (x, ·) has a p.d.f. that we will
also denote as p(x, ·). In this latter case, assuming that P 0 also has a p.d.f. that we
will denote by µ0 , the joint p.d.f. of (X0 , . . . , Xn ) on (Rd )n+1 is given by

f (x0 , x1 , . . . , xn ) = µ0 (x0 )p(x0 , x1 ) · · · p(xn−1 , xn ). (13.2)

The same expression holds for the joint p.m.f. in the discrete case.

In the general case (invoking measure theory), the joint distribution is also deter-
mined by the transition probabilities, and we leave the derivation of the expression
to the reader. An important point is that, in both special cases considered above,
and under some very mild assumptions in the general case, these transition proba-
bilities also uniquely define the joint distribution of the infinite process (X0 , X1 , . . .)
on B ∞ , which gives theoretical support to the consideration of asymptotic properties
of Markov chains.

In this discussion, we are interested in conditions ensuring that the chain asymp-
totically samples from a target probability distribution Q, i.e., that P(Xn ∈ A) con-
verges to Q(A) (one says that Xn converges in distribution to Q). In practice, Q is
given or modeled, and the goal is to determine the transition probabilities. Note
that the marginal distribution of Xn is computed by integrating (or summing) (13.2)
with respect to x0 , . . . , xn−1 . This is generally computationally challenging.

Given a transition probability P on B, we will use the notation, for a measurable


function f : B → R: Z
P f (x) = f (y)P (x, dy).
B
If Q is a probability distribution on B, it will also be convenient to write
Z
Qf = f (y)Q(dy).
B

13.3.2 Convergence

We will denote Px (·) the conditional distribution P(· | X0 = x) and P n (x, A) = Px (Xn ∈
A), which is a probability distribution on B. The goal of Markov Chain Monte Carlo
sampling is to design the transition probabilities such that Pxn (A) converges to Q(A)
274 CHAPTER 13. MONTE-CARLO SAMPLING

when n tends to infinity. One furthermore wants to complete this convergence with
a law of large numbers, ensuring that
n Z
1X
f (Xk ) → f (x)Q(dx)
n B
k=1

when n → ∞, where Xn is the generated Markov chain and f is Q-integrable.

Let Dvar be the total variation distance between probability measures introduced
in section 12.1. We will say that the Markov chain with transition P asymptotically
samples from Q if
lim Dvar (P n (x, ·), Q) = 0 (13.3)
n→∞
for Q-almost all x ∈ B. As we will see, he chain must satisfy specific conditions for
this to be guaranteed.

13.3.3 Invariance and reversibility

If a Markov chain converges to Q, then Q must be an “invariant distribution,” in the


sense that, if Xn ∼ Q for some n, then so does Xn+1 and, as a consequence, all Xm for
n ≥ m. This can be seen by writing
P n+1 (x, A) = Px (Xn+1 ∈ A) = Ex (P (Xn , A)) = EP n (x,·) (P (·, A))
If P n (x, ·) (and therefore also P n+1 (x, ·)) converges to Q, then passing to the limit
above yields
Q(A) = EQ (P (·, A))
and this states that, if Xn ∼ Q, then so does Xn+1 . If Q has a p.d.f. (resp. p.m.f.), say,
q, this gives
Z X
q(y) = p(x, y)q(x)dx, (resp. q(y) = p(x, y)q(x)).
Rd x∈B

So, if one designs a Markov chain with a target asymptotic distribution Q, the first
thing to ensure is that Q is invariant. However, while invariance leads to an integral
equation for q, a stronger condition, called reversibility is easier to assess.

Assume that Q is invariant by P . Make the assumption that P (x, ·) has a density
p∗ with respect to Q (this is, essentially, no loss of generality, see argument below),
so that Z
P (x, A) = p∗ (x, y)Q(dy).
A
Taking A = B above, we have
Z
p∗ (x, y)Q(dy) = P (x, B) = 1
B
13.3. MARKOV CHAIN SAMPLING 275

but we also have, because Q is invariant, that


Z
p∗ (x, y)Q(dx) = Q(B) = 1.
B
One says that the density is “doubly stochastic” with respect to Q.

Conversely, if a transition probability P has a doubly stochastic density p∗ with


respect to some probability Q on B, then Q is invariant by P , since
Z Z Z
P (x, A)Q(dx) = p∗ (x, y)Q(dy)Q(dx)
B B A Z Z Z
= p∗ (x, y)Q(dx)Q(dy) = Q(dy) = Q(A).
A B A

The property of being doubly stochastic can be reinterpreted in terms of time


reversal for Markov chains. Let Q0 be an initial distribution for a Markov chain
with transition P (not necessarily invariant) so that, for any n ≥ 0, the distribution
of Xn is Qn = Q0 P n . Fixing any m > 0, we are interested in the reversed process
X̃k = Xm−k . We first notice that the conditional distribution of Xn given its future
Xn+1 , . . . , Xm (with n < m) only depends on Xn+1 , so that the reversed process is also
Markov. Indeed, for any positive function f : B → R, g : B m−n → R, one has, using
the fundamental properties of conditional expectations and the fact that (Xn ) is a
Markov chain,
E(f (Xn )g(Xn+1 , . . . , Xm )) = E (E(f (Xn )g(Xn+1 , . . . , Xm ) | Xn , Xn+1 ))
= E (f (Xn )E(g(Xn+1 , . . . , Xm ) | Xn , Xn+1 ))
= E (f (Xn )E(g(Xn+1 , . . . , Xm ) | Xn+1 ))
= E (E(f (Xn ) | Xn+1 )E(g(Xn+1 , . . . , Xm ) | Xn+1 ))
= E (E(f (Xn ) | Xn+1 )g(Xn+1 , . . . , Xm )) .
This shows that
E(f (Xn ) | Xn+1 , . . . , Xm ) = E(f (Xn ) | Xn+1 ),
which is what we wanted. To identify the conditional distribution of Xn given Xn+1 ,
we note that for any x ∈ B, the transition probability P (x, ·) is absolutely continuous
with respect to Qn+1 , since
Z
Qn+1 (A) = P (x, A)Qn (dx)
B
and the r.h.s. is zero only if P (x, A) = 0 Qn -almost everywhere 3 . This shows that
there exists a function rn+1 : B × B → R such that, for all A
Z
P (x, A) = rn+1 (x, y)Qn+1 (dy).
A
3 The “almost everywhere” statement a priori depends on A, but can be made independent of it
under the mild assumption (that we will always make) that B has a countable basis of open sets.
276 CHAPTER 13. MONTE-CARLO SAMPLING

Given this point, one can write


Z
E(f (Xn )g(Xn+1 ) = f (xn )g(xn+1 )P (xn , dxn+1 )Qn (dxn )
B 2
Z
= f (xn )g(xn+1 )rn+1 (xn , xn+1 )Qn+1 (dxn+1 )Qn (dxn )
B2
Z Z !
= f (xn )rn+1 (xn , xn+1 )Qn (dxn ) g(xn+1 )Qn+1 (dxn+1 )
B B

which shows that the conditional distribution of Xn given Xn+1 = xn+1 has density
xn 7→ rn+1 (xn , xn+1 ) relatively to Qn .

Note that, for discrete probabilities, one has


P (x, y)
rn+1 (x, y) =
Qn+1 (y)
and
Qn (x)P (x, y)
P(Xn = x | Xn+1 = y) = . (13.4)
Qn+1 (y)
The formula is identical if both Q0 and P (x, ·) have p.d.f.’s with respect to a fixed
reference measure µ on B (for example, Lebesgue’s measure when B = Rd ), denoting
these p.d.f’s by q0 and p(x, ·). Then, the p.d.f. of the distribution of Xn given Xn+1 = y
is
q (x)p(x, y)
p̃n (y, x) = n (13.5)
qn+1 (y)
where qn is the p.d.f. of Qn . Note that the transition probabilities of the reversed
Markov chain depend on n, i.e., the reversed chain is non-homogeneous in general.

However, if one assumes that Q0 = Q is invariant by P , then Qn = Q for all n and


therefore rn (x, y) = p∗ (x, y), using the previous notation. In this case, the reversed
chain has transitions independent of n and its transition probability has density

p̃∗ (x, y) = p∗ (y, x)

with respect to Q. In the discrete case, letting p(x, y) = P (Xn+1 = y | Xn = x), we have
p∗ (x, y) = p(x, y)/Q(y), so that the reversed transition (call it p̃) is such that
p̃(x, y) p(y, x)
= ,
Q(y) Q(x)
i.e.,
Q(y)p(y, x) = Q(x)p̃(x, y). (13.6)
One retrieves easily the fact that if p is such that there exists Q and p̃ such that (13.6)
is satisfied, then (summing the equation over y) Q is an invariant probability for p.
13.3. MARKOV CHAIN SAMPLING 277

Let Q be a probability on B. One says that the Markov chain (or the transition
probability p) is Q-reversible if and only if p(x, ·) has a density p∗ (x, ·) with respect
to Q such that p∗ (x, y) = p∗ (y, x) for all x, y ∈ B. Since such a density is necessarily
doubly stochastic, Q is then invariant by p. Reversibility is equivalent to the prop-
erty that, whenever Xn ∼ Q, the joint distribution of (Xn , Xn+1 ) coincides with that of
(Xn+1 , Xn ). Alternatively, Q-reversibility requires that for all A, B ⊂ B,
Z Z
P (z, B)Q(dz) = P (z, A)Q(dz). (13.7)
A B

In the discrete case, (13.7) is equivalent to the “detailed balance” condition:

Q(y)p(y, x) = Q(x)p(x, y). (13.8)

While Q can be an invariant distribution for a Markov chain without that chain
being Q-reversible, the latter property is easier to ensure when designing transition
probabilities, and most sampling algorithms are indeed reversible with respect to
their target distribution.

Remark 13.3 A simple example of non-reversible Markov chain with invariant prob-
ability Q is often obtained in practice by alternating two or more Q-reversible transi-
tion probabilities. Assume, to simplify, that B is discrete and that p1 and p2 are tran-
sition probabilities that satisfy (13.8). Consider a composite Markov chain for which
the transition from Xn to Xn+1 consists in generating first Yn according to p1 (Xn , ·)
and then Xn+1 according to p2 (Yn , ·). The resulting composite transition probability
is X
p(x, y) = p1 (x, z)p2 (z, y).
z∈B
Trivially, Q is invariant by p, since it is invariant by p1 and p2 , but p is not Q-
reversible. Indeed, p satisfies (13.6) with
X
p̃(x, y) = p2 (x, z)p1 (z, y).
z∈B 

13.3.4 Irreducibility and recurrence

While necessary, invariance is not sufficient for a Markov chain to converge to Q


in distribution. However, it simplifies the general ergodicity conditions compared
to the general theory of Markov chains [148, 161], as summarized below, following
[193] (see also [13]). We therefore assume that the transition probability P is such
that Q is P -invariant.
278 CHAPTER 13. MONTE-CARLO SAMPLING

One says that the Markov chain is Q-irreducible (or, simply, irreducible in what
follows) if and only if, for all z ∈ B and all (measurable) B ⊂ B such that Q(B) > 0,
there exists n > 0 with Pz (Xn ∈ B) > 0. (Irreducibility implies that Q is the only
invariant probability of the Markov chain.)

A Markov chain is called periodic if there exists m > 1 such that B can be covered
by disjoint subsets B0 , . . . , Bm−1 that satisfy P (x, Bj ) = 1 for all x ∈ Bj−1 if j ≥ 1 and
all x ∈ Bm−1 if j = 0. In other terms, the chain loops between the sets B0 , . . . , Bm−1 . If
such a decomposition does not exists, the chain is called aperiodic.

A periodic chain cannot satisfy (13.3). Indeed, periodicity implies that Px (Xn ∈
Bi ) = 0 for all x ∈ Bi unless i = 0 (mod d). Since the sets Bi cover B, (13.3) is only pos-
sible with Q = 0. Irreducibility and aperiodicity are therefore necessary conditions
for ergodicity. Combined with the fact that Q is an invariant probability distribu-
tion, these conditions are also sufficient, in the sense that (13.3) is true for Q-almost
all x. (See [193] for a proof.)

Without the knowledge that the chain has an invariant probability, showing con-
vergence usually requires showing that the chain is recurrent, which means that, for
any set B such that Q(B) > 0, the probability that, starting from x, Xn ∈ B for an in-
finite number of n, written as Px (Xn ∈ B i.o.) (for infinitely often) is positive for all
x ∈ E and equal to 1 Q-almost surely. The fact that irreducibility and aperiodicity
combined with Q-invariance imply recurrence (or, more precisely, Q-positive recur-
rent [148]) is an important remark that significantly simplifies the theory for MCMC
simulation. Note that, by restricting B to a suitable set of Q-probability 1, one can
assume that Px (Xn ∈ B i.o.) = 1 for all x ∈ B, which is called Harris recurrence. It the
chain is Harris recurrent, then (13.3) holds with µ0 = δx for all x ∈ B. 4

One says that C ⊂ B is a “small” set if Q(C) > 0 and there exists a triple (m0 , , ν),
with  > 0 and ν a probability distribution on B, such that

P m0 (x, ·) ≥ ν(·)

for all x ∈ C. A slightly different result, proved in [13], replaces irreducibility by the
property that there exists a small set C ⊂ B such that

Px (∃n : Xn ∈ C) > 0
4 Harris recurrence is also associated with the uniqueness of right eigenvectors of P , that is func-
tions h : B → R such that Z

P h(x) = P (x, dy)h(y) = h(x).
B

Such functions are also called harmonic for P . Because P is a transition probability, constant functions
are always harmonic. Harris recurrence, in the current context, is equivalent to the fact that every
bounded harmonic function is constant.
13.3. MARKOV CHAIN SAMPLING 279

for Q-almost all x ∈ B. One then replaces aperiodicity by the similar condition that
the greatest common divisor of the set of integers m such that there exists m with
P m (x, ·) ≥ m ν(·) for all x ∈ C is equal to 1. These two conditions combined with
Q-invariance also imply that (13.3) holds for Q-almost all x ∈ B.

13.3.5 Speed of convergence

It is also important to quantify the speed of convergence in (13.3). Efficient algo-


rithms typically have a geometric convergence speed, namely

Dvar (Pxn , Q) ≤ M(x)r n (13.9)

for some 0 ≤ r < 1 and some function M(x), or uniformly geometric convergence
speed, for which the function M is bounded (or, equivalently, constant).

A sufficient condition for geometric ergodicity is provided in Nummelin [148,


Proposition 5.21]. Assume that the chain is Harris recurrent and that there exist
r > 1, a small set C and a “drift function” h with

sup(r E(h(Xn+1 ) | Xn = x) − h(x)) < 0 (13.10a)


x<C

and
sup E(h(Xn+1 )1Xn+1 <C | Xn = x) < ∞. (13.10b)
x∈C
Then the Markov chain is geometrically ergodic. Note that E(h(Xn+1 ) | Xn = x) =
P h(x). Equations (13.10a) and (13.10b) can be summarized in a single equation
[137], namely
P h(x) ≤ βh(x) + M1C (x) (13.11)
with β = 1/r < 1 and M ≥ 0.

13.3.6 Models on finite state spaces

Uniform geometric ergodicity is implied by the simple condition that the whole set
B is small, requiring in a uniform lower bound, for some probability distribution ν,

P m0 (x, ·) ≥ ν(·) (13.12)

for all x ∈ B. Such uniform conditions usually require strong restrictions on the
space B, such as compactness or finiteness.

To illustrate this consider the case in which the set B is finite. Assume, to sim-
plify, that Q(x) > 0 for all x ∈ B (one can restrict the Markov chain to such x’s other-
wise). Arbitrarily labeling elements of B as B = {x1 , . . . , xN }, we can consider p(x, y)
280 CHAPTER 13. MONTE-CARLO SAMPLING

as the coefficients of a matrix P = (p(xk , xl ), k, l = 1, . . . , N ). Such a matrix, which has


non-negative entries and row sums equal to 1, is called a stochastic matrix.

We will denote the nth power of P as P n = (p(n) (xk , xl ), k, l = 1, . . . , N ). One im-


mediately sees that irreducibility is equivalent to the fact that, for all x, y ∈ B, there
exists m (that may depend of x and y) such that p(m) (x, y) > 0. One can furthermore
show that the chain is irreducible and aperiodic if one can choose m independent of
x and y above, that is, if there exists m such that P m has positive coefficients. This
condition clearly implies uniformly geometric ergodicity, which is therefore valid
for all irreducible and aperiodic Markov chains on finite sets.

This result can also be deduced from properties of matrices with non-negative
or positive coefficients. The Perron-Frobenius theorem [93] states that the eigen-
value 1 (associated with the eigenvector 1) is the largest, in modulus, eigenvalue of
a stochastic matrix P˜ with positive entries, that it has multiplicity one and that all
other eigenvalues have a modulus strictly smaller that one. If P m has positive en-
tries, this implies that all the eigenvalues of (P m − 1Q) (where Q is considered as a
row vector) have modulus strictly less than one. This fact can then be used to prove
uniformly geometric ergodicity.

13.3.7 Examples on Rd

To take a geometrically ergodic example that is not uniform, consider the simple
random walk provided by the iterations

Xn+1 = ρXn + τ 2 n

where n ∼ N (0, IdRd ), τ 2 > 0 and 0 < ρ < 1. One shows easily by induction that
the conditional distribution of Xn given X0 = x is Gaussian with mean mn = ρn x and
covariance matrix σn2 IdRd with

1 − ρ2n 2
σn2 = τ .
1 − ρ2
2 Id ), with σ 2 = τ 2 /(1 − ρ2 ), is invariant.
In particular, the distribution Q = N (0, σ∞ Rd ∞
Estimates on the variational distances between Gaussian distributions, such as those
provided in Devroye et al. [61], can then be used to show that

Dvar (Pxn , Q) ≤ M(x)ρn

where M grows linearly in x but is not bounded.

Situations in which one can, as above, compute the probability distributions of


Xn are rare, however, and proving geometric convergence is significantly more diffi-
cult than for finite-state chains. For chains on Rd (or, more generally, locally compact
13.3. MARKOV CHAIN SAMPLING 281

metric spaces), the drift function criterion (13.11) can be used. Assume that P h(·),
given by Z
P h(x) = E(h(Xn+1 ) | Xn = x) = h(y)P (x, dy)
Rd
is continuous as soon as the function h : Rd → R is continuous (one says that the
chain is weak Feller). This true, for example, if P (x, ·) has a p.d.f. with respect
to Lebesgue’s measure which is continuous in x. In such a situation, one can see
that compact sets are small sets, and (13.11) can be restated as the existence if a
positive function h with compact sub-level sets and such that h(x) ≥ 1, of a compact
set C ⊂ Rd and of positive constants β < 1 and b such that, for all x ∈ Rd ,

P h(x) ≤ βh(x) + b1C (x). (13.13)

As an example, consider the Markov chain defined by

Xn+1 = Xn − δ∇H(Xn ) + τn+1

where 2 , 2 , . . . are i.i.d. standard d-dimensional Gaussian variables and H : Rd → R


is C 2 . This chain is clearly irreducible (with respect to Lebesgue’s measure). One has
Z Z
1 − 12 |y−x+δ∇H(x)|2 1 − |u|2
2
P h(x) = h(y)e 2τ dy = h(x−δ∇H(x)+τu)e dy.
(2πτ 2 )d/2 Rd (2π)d/2 Rd

Let us make the assumption that H is L-C 1 for some L > 0 (c.f. definition 3.15)
and furthermore assume that |∇H(x)| tends to infinity when x tends to infinity, en-
suring the fact that the sets {x : |∇H(x)| ≤ c} are compact for c > 0. We want to show
that, if δ is small enough, (13.13) holds for h(x) = exp(mH(x)) and m small enough.

We first compute an upper bound of

|u|2
g(x, u) = mH(x − δ∇H(x) + τu) − .
2
Using the L-C 1 property, we have

mL |u|2
g(x, u) ≤ mH(x) + m(−δ∇H(x) + τu)T ∇H(x) + |δ∇H(x) − τu|2 −
2 2
1 − mLτ 2 2
= mH(x) − mδ(1 − δL/2)|∇H(x)|2 + mτ(1 − τL)∇H(x)T u − |u|
2
2
1 − mLτ 2 mτ(1 − τL)
= mH(x) − u− ∇H(x)
2 1 − mLτ 2
mτ 2 (1 − τL)2
!
− m δ(1 − δL/2) − |∇H(x)|2
2(1 − mLτ 2 )
282 CHAPTER 13. MONTE-CARLO SAMPLING

Assume that mLτ 2 ≤ 1. It follows that


Z
1 h(x)
P h(x) = eg(x,u) du ≤
(2π)d/2 Rd (1 − mLτ 2 )d/2
mτ 2 (1 − τL)2
! !
2
exp −m δ(1 − δL/2) − |∇H(x)|
2(1 − mLτ 2 )

Using this upper bound, we see that (13.13) will hold if one first chooses δ such that
δL < 2, then m such that mLτ 2 < 1 and

mτ 2 (1 − τL)2
< δ(1 − δL/2)
2(1 − mLτ 2 )

and finally choose C = {x : |∇H(x)| ≤ c} where c is large enough so that

mτ 2 (1 − τL)2 2
! !
1
exp −m δ(1 − δL/2) − c < 1.
(1 − mLτ 2 )d/2 2(1 − mLτ 2 )

Note that this Markov chain is not in detailed balance. Since P (x, ·) has a p.d.f.,
being in detailed balance requires the ratio p(x, y)/p(y, x) to simplify as a ratio q(y)/q(x)
for some function q, which does not hold. However, we can identify the invariant √
distribution approximately with small δ and τ, that we will assume to satisfy τ = a δ
for a fixed a > 0, with δ a small number.

We can write
1 1
 
2
p(x, y) = exp − |y − x + δ∇H(x)|
(2πτ 2 )d/2 2τ 2
δ2
!
1 1 2 δ T 2
= exp − 2 |y − x| − 2 (y − x) ∇H(x) − 2 |∇H(x)| .
(2πτ 2 )d/2 2τ τ 2τ

If q is a density, we have
Z
qP (y) = q(x)p(x, y)dx
Rd

√ √ √
Z !
1 1 2 δ T δ 2
= q(y + a δu) exp − |u| + u ∇H(y + a δu) − 2 |∇H(y + a δu)| du
(2π)d/2 Rd 2 a 2a

Make the expansions:

√ √ a2 δ T 2
q(y + a δu) = q(y) + a δ∇q(y)T u + u ∇ q(y)u + o(δ|u|2 )
2
13.3. MARKOV CHAIN SAMPLING 283

and

√ √
!
δ T δ 2
exp u ∇H(y + a δu) − 2 |∇H(y + a δu)|
a 2a

δ T δ δ
= 1+ u ∇H(y) − 2 |∇H(y)|2 + δu T ∇2 H(u)u + 2 (u T ∇H(y))2 + o(δ|u|2 ).
a 2a 2a
R
Taking the product and using the fact that (2π)−d/2 Rd u exp(−|u|2 /2)du = 0 and
R
that (2π)−d/2 Rd u T Au exp(−|u|2 /2)du = trace(A) for any symmetric matrix A, we can
write, taking the product:

a2
!
T
qP (y) = q(y) + δ ∆q(y) + ∇H(y) ∇q(y) + q(y)∆H(y) + o(δ)
2

This indicates that, if q is invariant by P , it should satisfy

a2
∆q(y) + ∇H(y)T ∇q(y) + q(y)∆H(y) = o(1).
2
The partial differential equation

a2
∆q(y) + ∇H(y)T ∇q(y) + q(y)∆H(y) = 0 (13.14)
2
2H(y)

is satisfied by the function y 7→ e a2 . Assuming that this function is integrable, this
computation suggests that, for small δ, the Markov chain approximately samples
from the probability distribution

1 − 2H(x)
q0 = e a2 .
Z
This is further discussed in the next remark that involves stochastic differential
equations. We will also present a correction of this Markov chain that samples from
q0 for all δ in section 13.5.2.

Remark 13.4 (Langevin equation) This chain is indeed the Euler discretization [107]
of the stochastic differential equation,

dxt = −∇H(xt )dt + adwt (13.15)

where wt is a Brownian motion. Under general hypotheses, this stochastic diffusion


equation, called a Langevin equation, indeed converges in distribution to q0 (x). 5
5 Providing a rigorous account of the theory of stochastic differential equations is beyond our
scope, and we refer the reader to the many textbooks on the subject, such as McKean [132], Ikeda
and Watanabe [96], Ethier and Kurtz [69] (see also Berglund [26] for a short introduction).
284 CHAPTER 13. MONTE-CARLO SAMPLING

Such diffusions are continuous-time Markov processes (Xt , t ≥ 0), which means
that the probability distribution of Xt+s given all events before and including time s
only depends on Xs and is provided by a transition probability Pt , with

P(Xt+1 ∈ A | Xs = x) = Pt (x, A).


Similarly to deterministic ordinary differential equations, one shows that under suf-
ficient regularity conditions (e.g., ∇H is C 1 ), equations such as (13.15) have solutions
up to some positive (random) explosion time, and that this explosion time is finite
under additional conditions that ensure that |∇H(x)| does not grow too fast when x
tends to infinity.

If ϕ is a smooth enough function (say, C 2 , with compact support), the function


(t, x) 7→ Pt ϕ(x) satisfies the partial differential equation, called Kolmogorov’s back-
ward equation,
a2
∂t Pt ϕ(x) = −∇H(x)T ∇Pt ϕ(x) + ∆Pt ϕ(x)
2
with initial condition P0 ϕ(x) = ϕ(x). If Pt (x, ·) has at all times t a p.d.f. pt (x, ·), then
this p.d.f. must satisfy the forward Kolmogorov equation:

a2
∂t pt (x, y) = ∇2 · (∇H(y)pt (x, y)) + ∆ p (x, y)
2 2 t
where ∇2 and ∆2 indicate differentiation with respect to the second variable (y). (Re-
call that δf denotes the Laplacian of f .) Moreover, if Q is an invariant distribution
with p.d.f. q, it satisfies the equation

a2
∇ · (q∇H) + ∆q(y) = 0.
2
Noting that ∇ · (q∇H) = ∇qT ∇H + q∆H, we retrieve (13.14). Convergence proper-
ties (and, in particular, geometric convergence) of the Langevin equation to its limit
distribution are studied in Roberts and Tweedie [167], using methods introduced in
Meyn and Tweedie [135, 136, 137] 

13.4 Gibbs sampling

13.4.1 Definition

The Gibbs sampling algorithm [79] was introduced to sample from a distribution
on large sets for which direct sampling is intractable and rejection samping is inef-
ficient. It generates a Markov chain that converges (under some hypotheses) in dis-
tribution to this target probability. A general version of this algorithm is described
below.
13.4. GIBBS SAMPLING 285

Let Q be a probability distribution on B. Consider a finite family U1 , . . . , UK of


random variables defined on B with values in measurable spaces B10 , . . . , BK0 . Let
Qi = QU i denote the image of Q by U i , defined by Qi (Bi ) = Q(Ui ∈ Bi ) for Bi ⊂ Bi .
Also, assume that there exists, for all i, a regular family of conditional probabilities
for Q given Ui , defined as a collection of transition probabilities (ui , A) 7→ Qi (ui , A)
for ui ∈ Bi and A ⊂ B, that satisfy
Z Z
g(Ui (x))Q(dx) = Qi (ui , A)g(ui )Qi (dui )
A Bi

for all nonnegative measurable functions g : Bi → R. In simpler terms, Qi (ui , A)


determine a consistent set of conditional probabilities Q( · | Ui = ui ). For discrete
random variables (resp. variables with p.d.f.’s on Rd ), they are just elementary con-
ditional probabilities.

We then consider the following algorithm.

Algorithm 13.2 (Gibbs sampling)


Initialize the algorithm with some z(0) = z0 ∈ B and iterate the following two update
steps given a current z(n) ∈ B:

(1) Select j ∈ {1, . . . , K} according to some pre-defined scheme, i.e., at random ac-
cording to a probability distribution π(n) on the set {1, . . . , K}.
(2) Sample a new value z(n+1) according to the probability distribution Qj (Uj (z(n)), ·).

One typically chooses the probability distribution in step 1 equal to the uniform
distribution on {1, . . . , K} (in which case it is independent on n), or to π(n) = δjn where
jn = 1+(n (mod) K) (periodic scheme). Strictly speaking, Gibbs sampling is a Markov
chain if π(n) does not depend on n, and we will make this simplifying assumption in
the rest of our discussion (therefore replacing π(n) by π). One obvious requirement
for the feasibility of the method is that step 2 can be performed efficiently since it
must be repeated a very large number of times.

One can see that the Markov chain generated by this algorithm is Q-reversible.
Indeed, assume that Xn ∼ Q. For any (measurable) subsets A and B in B, one has,
using the definition of conditional expectations,

K
X  
P(Xn ∈ A, Xn+1 ∈ B) = E 1Z∈A Qi (Ui (Z), B) π(i). (13.16)
i=1
286 CHAPTER 13. MONTE-CARLO SAMPLING

Now, for any i,


  Z
E 1Z∈A Qi (Ui (Z), B) = Qi (Ui (z), B)Q(dz)
ZA
= Qi (ui , A)Qi (ui , B)Qi (dui )
Bi

which is symmetric in A and B.

Note that, in the discrete case


n
X Q(z̃)1U i (z̃)=U i (z)
P (z, z̃) = π(i) P 0
(13.17)
i=1 z0 :U i (z0 )=U i (z) Q(z )

and the relation Q(z)P (z, z̃) = Q(z̃)P (z̃, z) is obvious.

The conditioning variables U1 , . . . , UK should ensure, at least, that the associated


Markov chain is irreducible and aperiodic. For irreducibility, this requires that Z
can visits Q-almost all elements of B by a sequence of steps that leave one of the Ui ’s
invariant.

Remark 13.5 In the standard version of Gibbs sampling, B is a product space B1 ×


· · · × BK , and
Bj0 = B1 × · · · × Bj−1 × Bj+1 × · · · × BK .

One then takes Uj (z(1) , . . . , z(K) ) = (z(1) , . . . , z(i−1) , z(i+1) , . . . , z(K) ). In other terms, step 2
in the algorithm replaces the current value of z(j) (n) by a new one sampled from the
conditional distribution of Z (j) given the current values of z(i) (n), i , j. 

Remark 13.6 We have considered a fixed number of conditioning variables, U1 , . . . , UK ,


for simplicity, but the same analysis can be carried on if one replaces Uj by a func-
tion U : (x, θ) 7→ Uθ (x) defined on a product space B × Θ, taking values in some
space B̃, where Θ is a probability space equipped with a probability distribution
π andSKB is measurable. The previous discussion corresponds to Θ = {1, . . . , K} and
B = i=1 {i} × Bi (so that Ui (x) is replaced by (i, Ui (x))).

One may then define Qθ as the image of Q by Uθ and let Qθ (u, A) provide a
version of Q(A | Uθ = u). The only change in the previous discussion (besides using
θ in index) is that (13.16) becomes
Z  
P(Xn ∈ A, Xn+1 ∈ B) = E 1Z∈A Qθ (Uθ (Z), B) π(dθ).
Θ 
13.4. GIBBS SAMPLING 287

Remark 13.7 Using notation from the previous remark, and allowing π = π(n) to
depend on n, it is possible to allow π(n) to depend on the current state z(n) using the
following construction.

For every step n, assume that there exists a subset Θn of Θ such that π(n) (z, Θn ) = 1
and that, for all θ ∈ Θn , π(n) can be expressed in the form
(n)
π(n) (z, ·) = ψθ (Uθ (z), ·)
(n)
for some transition probability ψθ from Bθ to Θn . The resulting chain remains
Q-reversible, since
Z Z
P(Xn ∈ A, Xn+1 ∈ B) = 1z∈A Qθ (Uθ (z), B)π(n) (z, dθ)Q(dz)
ZB ZΘn
(n)
= 1z∈A Qθ (Uθ (z), B)ψθ (Uθ (z), dθ)Q(dz)
Z Θn Z B
(n)
= Qθ (u, A)Qθ (u, B)ψθ (u, dθ)Qθ (du). 
Θn B̃

13.4.2 Example: Ising model

We will see several examples of applications of Gibbs sampling in the next few chap-
ters. Here, we consider a special instance of Markov random field (see chapter 14)
called the Ising model. For this example, B = {0, 1}L , and
 
L L
1 X X 
q(z) = exp  αz(j) + βij z(i) z(j)  .
 
C  
j=1 i,j=1,i<j

Note that, although B is a finite set, its cardinality, 2L , is too large for the enumerative
procedure described in section 13.1 to be applicable as soon as L is, say, larger than
30. In practical applications of this model, L is orders of magnitude larger, typically
in the thousands or tens of thousands.

We here apply standard Gibbs sampling, as described in remark 13.5, defining


Bj = {0, 1} and
Ui (z(1) , . . . , z(L) ) = (z(1) , . . . , z(i−1) , z(i+1) , . . . , z(L) ).
The conditional distribution of Z (j) given Uj (z) is a Bernoulli distribution with pa-
rameter 0
exp(α + Lj0 =1,j 0 ,j βjj 0 z(j ) )
P
qZ (j) (1 | Uj (z)) =
1 + exp(α + Lj0 =1,j 0 ,j βij z(j) )
P
288 CHAPTER 13. MONTE-CARLO SAMPLING

(taking βjj 0 = βj 0 j for j > j 0 ). Gibbs sampling for this model will generate a sequence
of variables Z(0), Z(1), . . . by fixing Z(0) arbitrarily and, given Z(n) = z, applying the
two steps:

(1) Select j ∈ {1, . . . , L} at random according to a probability distribution π(n) on


the set {1, . . . , L}.
(2) Sample a new value ζ ∈ {0, 1} according to the Bernoulli distribution with
0 0
parameter qZ (j) (1 | Uj (z)), and set Z (j) (n + 1) = ζ and Z (j ) (n + 1) = Z (j ) (n) for j 0 , j.
n

Let us now consider the Ising model with fixed total activation, namely the pre-

vious distribution conditional to S(z) = z(1) + · · · + z(L) = h where 0 < h < L. The
distribution one wants to sample from now is
 
L L
1 X X 
qh (z) = exp  αz(j) + βij z(i) z(j)  1S(z)=h .
 
Ch  
j=1 i,j=1,i<j

In that case, the previous choice for the one-step transitions does not work, because
fixing all but one coordinate of z also fixes the last one (so that the chain would not
move from its initial value and would certainly not be irreducible). One can however
fix all but two coordinates, therefore defining
Uij (z(1) , . . . , z(L) ) = (z(1) , . . . , z(i−1) , z(i+1) , . . . , z(j−1) , z(j+1) , . . . , z(L) )
and Bij = {0, 1}2 . If Uij (z) is fixed, the only acceptable configurations are z itself and
the configuration z0 deduced from z by switching the value of z(i) and z(j) . Thus,
there is no possible change is z(i) = z(j) . If z(i) , z(j) , then the probability of flipping
the values of z(i) and z(j) is qh (z0 )/(qh (z) + qh (z0 )).

13.5 Metropolis-Hastings

13.5.1 Definition

Gibbs sampling is a special case of a generic MCMC algorithm called Metropolis-


Hastings that is defined as follows [134, 88]. Assume that the distribution Q has a
density q with respect to a measure µ on B. Specify a transition probability on B,
represented by a family of density functions with respect to µ, (g(z, ·), z ∈ B), and
a family of acceptance functions (z, z0 ) 7→ a(z, z0 ) ∈ [0, 1]. Two basic examples are
(i) when B is finite, µ is the counting measure, and q and g are probability mass
functions, and (ii) when B = Rd , µ is Lebesgue’s measure and q and g are probability
density functions.

The sampling algorithm is then defined as follows. It invokes a function a that


will be specified below.
13.5. METROPOLIS-HASTINGS 289

Algorithm 13.3 (Metropolis-Hastings)


Initialize the algorithm with Z(0) = z(0) ∈ B. At step n, the current value Z(n) = z is
then updated as follows.

• “Propose” a new configuration z0 drawn according to g(z, ·).


• “Accept” z0 (i.e., set Z(n + 1) = z0 ) with probability a(z, z0 ). If the new value is
rejected, keep the current one, i.e., let Z(n + 1) = z.

The transition
P probabilities for this process are p(x, y) = g(x, y)a(x, y) if x , y and
p(x, x) = 1 − y,x p(x, y). The chain is Q-reversible if the detailed balance equation

q(z)g(z, z0 )a(z, z0 ) = q(z0 )g(z0 , z)a(z0 , z) (13.18)

is satisfied. The functions g and a are part of the design of the algorithm, but (13.18)
suggest that g should satisfy the “weak symmetry” condition:

∀x, y ∈ Ω : g(x, y) = 0 ⇔ g(y, x) = 0. (13.19)

Note that this condition is necessary to ensure (13.18) if q(z) > 0 for all z. If q(z) > 0,
the fact that acceptance probabilities are less than 1 requires that

q(z0 )g(z0 , z)
!
0
a(z, z ) ≤ min 1, .
q(z)g(z, z0 )

If one takes a(z, z0 ) equal to the r.h.s. above, so that

q(z0 )g(z0 , z)
!
0
a(z, z ) = min 1, , (13.20)
q(z)g(z, z0 )

then (13.18) is satisfied as soon as q(z) > 0. If q(z) = 0, then this definition ensures
that a(z0 , z) = 0 and (13.18) is also satisfied. Note also that the case g(z, z0 ) = 0 is not
relevant, since z0 is not attainable from z in one step in this case. This shows that
(13.20) provides a Q-reversible chain. Obviously, if g already satisfies q(z)g(z, z0 ) =
q(z0 )g(z0 , z), which is the case for Gibbs sampling, then one should take a(z, z0 ) = 1 for
all z and z0 .

13.5.2 Sampling methods for continuous variables

Metropolis adjusted Langevin algorithm. While the Gibbs sampling and Metropolis-
Hastings methods were formulated for general variables and probability distribu-
tions, proving that the related chains are ergodic, and checking conditions for geo-
metric convergence speed is much harder when dealing with general state spaces
290 CHAPTER 13. MONTE-CARLO SAMPLING

than with finite or compact spaces (see, e.g., [165, 133, 6, 166]). On the other
hand, interesting choices of proposal transitions for Metropolis-Hastings are avail-
able when B = Rd and µ is Lebesgue’s measure, taking advantage, in particular, of
differential calculus. More precisely, assume that q takes the form
1
q(z) = exp(−H(z))
C
for some smooth function H (at least C 1 ), such that exp(−H) is integrable. We saw
in section 13.3.7 that, under suitable assumptions, the Markov chain
δ √
Xn+1 = Xn − ∇H(Xn ) + δn+1 (13.21)
2
with n+1 ∼ N (0, IdRd ) has q as invariant distribution in the limit δ → 0. Its transition
probability, such that g(z, · ) is the p.d.f. of N (z− 2δ ∇H(z), δIdRd ), is therefore a natural
choice for a proposal distribution in the Metropolis-Hastings algorithm. In addition
to converging from the exact target distribution, this “Metropolis adjusted Langevin
algorithm” (or MALA) can also be proved to satisfy geometric convergence under
less restrictive hypotheses than (13.21) [167].

Hamiltonian Monte-Carlo. Another approach, similar to MALA, is the Hamilto-


nian Monte-Carlo method (or hybrid Monte-Carlo) [65, 143]. Inspired by physics,
the method introduces a new variable, m ∈ Rd , called “momentum,” and defines the
“Hamiltonian:”
1
H(z, m) = H(z) + |m|2 .
2
Fix a time θ > 0. The proposal transition g(z, ·) is then defined as the value ζ(θ) that
is obtained by solving the Hamiltonian dynamical system
(
∂t ζ(t) = ∂m H(ζ(t), µ(t)) = µ(t)
(13.22)
∂t µ(t) = −∂z H(ζ(t), µ(t)) = −∇H(ζ(t))

with ζ(0) = z and µ(0) ∼ N (0, IdRd ). One can easily see that ∂t H(ζ(t), µ(t)) = 0, which
implies that
1 1
H(ζ(t)) + |µ(t)|2 = H(z) + |µ(0)|2
2 2
at all times t, or, denoting by ϕN the p.d.f. of the d-dimensional standard Gaussian,

q(ζ(t))ϕN (µ(t)) = q(ζ(0))ϕN (µ(0)).

Moreover, if one denotes by Φt (z, m) = (zt (z, m), mt (z, m)) the solution (ζ(t), µ(t)) of
the system started with ζ(0) = z and µ(0) = m, one can also see that det(dΦt (z, m)) = 1
at all times. Indeed, applying (1.5) and the chain rule, we have

∂t log det(dΦt (z, m)) = trace(dΦt (z, m)−1 ∂t dΦt (t, m)).
13.5. METROPOLIS-HASTINGS 291

From (
∂t zt (z, m) = mt (z, p)
∂t pt (z, m) = −∇H(zt (z, m))
we get
!
∂z mt (z, m) ∂m mt (z, m)
∂t dΦt (z, m) = 2 2
−∇ H(zt (z, m))∂z zt (z, m) −∇ H(zt (z, m))∂m zt (z, m)
!
0 IdRd
= dΦt (z, m).
−∇2 H(zt (z, m)) 0

We therefore get
!
0 IdRd
∂t log det(dΦt (z, m)) = trace =0
−∇2 H(zt (z, m)) 0

showing that the determinant is constant. Since Φ0 (z, m) = (z, m) by definition, we


get det(dΦt (z, m)) = 1 at all times.

Let q̄t denote the p.d.f. of Φt (z, m) and assume that q̄0 (z, m) = q(z)ϕN (m). We
have, using the change of variable formula

q̄t (Φt (z, m))| det dΦt (z, m)| = q(z)ϕN (m).

But the r.h.s. is, from the remarks above also equal to

q(zt (z, m))ϕN (mt (z, m))| det dΦt (z, m)|

yielding the identification


q̄t (z0 , m0 ) = q(z0 )ϕN (m0 )
This shows that Q (with p.d.f. q) is left invariant by this Markov chain.

Reversibility. One can actually show that chain is in detailed balance for the joint
density q̄(z, m) = q(z)ϕN (m). This is due to the fact that the system (13.22) is re-
versible, in the sense that

Φt (zt (z, m), −mt (z, m)) = (z, −m),

i.e., the system solved from its end point after changing the sign of the momentum
returns to its initial state after changing the sign of the momentum a second time.
In other terms, letting J(z, m) = (z, −m), we have Φt−1 = JΦt ◦J. So, consider a function
f : (Rd × Rd )2 → R. Denoting the Markov chain by (Zn , Mn ), we assume that the next
pair Zn+1 , Mn+1 is computed by (i) sampling Mn0 ∼ N (0, IdRd ); (ii) solving (13.22),
with initial conditions ζ(0) = Zn and µ(0) = Mn0 ; (iii) taking Zn+1 = ζ(θ) and sampling
Mn+1 ∼ N (0, IdRd ).
292 CHAPTER 13. MONTE-CARLO SAMPLING

We have
Z
E(f (Zn , Mn , Zn+1 , Mn+1 )) = f (z, m̃, z(z, m), m̄)ϕN (m)ϕN (m̄)ϕN (m̃)q(z)dmd m̄d m̃dz.

Make the change of variables z0 = z(z, m), m0 = m(z, m), which has Jacobian determi-
nant 1, and is such that z = z(z0 , −m0 ), m = −m(z0 , −m0 ). We get

E(f (Zn , Mn , Zn+1 , Mn+1 ))


Z
= f (z(z0 , −m0 ), m̃, z0 , m̄)ϕN (−m(z0 , −m0 ))ϕN (m̄)ϕN (m̃)q(z(z0 , −m0 ))dm0 d m̄d m̃dz0
Z
= f (z(z0 , −m0 ), m̃, z0 , m̄)ϕN (m(z0 , −m0 ))ϕN (m̄)ϕN (m̃)q(z(z0 , −m0 ))dm0 d m̄d m̃dz0
Z
= f (z(z0 , −m0 ), m̃, z0 , m̄)ϕN (−m0 )ϕN (m̄)ϕN (m̃)q(z0 )dm0 d m̄d m̃dz0 ,

using the conservation of H. Making the change of variables m0 → −m0 , we get

E(f (Zn , Mn , Zn+1 , Mn+1 )


Z
= f (z(z0 , m0 ), m̃, z0 , m̄)ϕN (m0 )ϕN (m̄)ϕN (m̃)q(z0 )dm0 d m̄d m̃dz0

which is equal to E(f (Zn+1 , Mn+1 , Zn , Mn )) showing the reversibility of the chain.

Time discretization. This simulation scheme can potentially make large moves in
the current configuration z while maintaining detailed balance (therefore not requir-
ing an accept/reject step). However, practical implementations require discretizing
(13.22), which breaks the conservation properties that were used in the argument
above, therefore requiring a Metropolis-Hastings correction. For example, a second-
order Runge Kutta (RK2) scheme with time step α gives

α2


 Z
 n+1
 = Z n + αM n − ∇H(Zn )
2


 α
Mn+1 = Mn − (∇H(Zn ) + ∇H(Zn + hMn ))


2
Only the update for Zn matters, however,
√ since Mn+1 is discarded and resampled at
each step. Importantly, if we let δ = α the first equation in the system becomes
δ
Zn+1 = Zn − ∇H(Zn ) + δMn
2
with Mn ∼ N (0, 1), which is exactly (13.21). Note that one can, in principle, solve
(13.22) for more that one discretization step (the continuous equation can be solved
for an arbitrary time), but one must then face the challenge of computing the Metropo-
lis correction since the Hamiltonian is not conserved at each step.
13.6. PERFECT SAMPLING METHODS 293

One can however use schemes that are more adapted to solving Hamiltonian
systems [120], such as the Störmer-Verlet scheme, which is
 α

Mn+1/2 = Mn − ∇H(Zn )



 2
Z = Z + αM


 n+1 n n+1/2
α


 Mn+1 = Mn+1/2 − ∇H(Zn+1 )


2

This scheme computes ψ1 ◦ ψ2 ◦ ψ1 (z, m) with ψ1 (z, m) = (z, m − (α/2)∇H(z)) and


ψ2 (z, m) = (z + αm, m). Because both ψ1 and ψ2 have a Jacobian determinant equal to
1, so does their composition. This scheme is also reversible, since we have
 α

−Mn+1/2 = −Mn+1 − ∇H(Zn+1 )



 2
Z = Z − αM


 n n+1 n+1/2
α


−Mn = −Mn+1/2 − ∇H(Zn )



2

These properties are conserved if one applies the Störmer-Verlet scheme more than
once at each iteration, that is, fixing some N > 0 and letting Φ(z, m) = (ψ1 ◦ψ2 ◦ψ1 )◦N ,
then Φ −1 = JΦ ◦ J, with J(z, m) = (z, −m) with det dΦ = 1. Considering again the aug-
mented chain which, starting from (Zn , Mn ), samples M̃ ∼ N (0, IdRd ), then computes
(Z 0 , M̃ 0 ) = Φ(Zn , M̃) and finally samples M 0 ∼ N (0, IdRd ) as a Metropolis-Hastings
proposal to sample from (z, m) 7→ q(z)ϕN (m), then, assuming that (Z, M) follows this
target distribution and letting (Z 0 , M 0 ) be the result of the proposal distribution, we
have, as computed above

E(f (Z, M, Z 0 , M 0 ))
Z
= f (z, m̃, z(z, m), m̄)ϕN (m)ϕN (m̄)ϕN (m̃)q(z)dmd m̄d m̃dz
Z
= f (z(z0 , m0 ), m̃, z0 , m̄)ϕN (m(z0 , m0 ))ϕN (m̄)ϕN (m̃)q(z(z0 , m0 ))dm0 d m̄d m̃dz0

This shows that the acceptance probability in the Metropolis step is

ϕ (m(z0 , m0 ))q(z(z0 , m0 ))
!
a(z, m, z , m ) = min 1, N
0 0
ϕN (m)q(z)
= exp (− max (H(z(z0 , m0 ), m(z0 , m0 )) − H(z, m)) , 0)

While the Hamiltonian is not kept invariant by the Störmer-Verlet scheme, so that an
accept-reject step is needed, it is usually quite stable over extended periods of time
so that the acceptance probability is generally close to one.
294 CHAPTER 13. MONTE-CARLO SAMPLING

13.6 Perfect sampling methods

We assume, in this section, that B is a finite set. The Markov chain simulation meth-
ods provided in the previous sections do not provide exact samples from the dis-
tribution q, but only increasingly accurate approximations. Perfect sampling algo-
rithms [157, 158, 71] use Markov chains “backwards” to generate exact samples.
To describe them, it is easier to describe a Markov chain as a stochastic recursive
equation of the form
Xn+1 = f (Xn , Un+1 ) (13.23)
where Un+1 is independent of Xn , Xn−1 , . . ., and the Uk ’s are identically distributed. In
the discrete case (assumed in this section), and given a stochastic matrix P , one can
take Un to be the uniformly distributed variable used to sample from (p(Xn , x), x ∈ B).
Conversely, the transition probability associated to (13.23) is p(x, y) = P (f (x, U ) = y).

It will be convenient to consider negative times also. For n > 0, recursively define
F−n (x, u−n+1 , . . . , u0 ) by

F−n−1 (x, u−n , . . . , u0 ) = F−n (f (x, u−n ), u−n+1 , . . . , u0 )


0 = (U , . . . , U ). The function F (x, u 0
and F−1 (x, u0 ) = f (x, u0 ). Denote, for short, U−n −n 0 −n −n+1 )
0 0
provides the value of X0 when X−n = x and U−n+1 = u−n+1 .
0 , let ν(u 0 ) denote the first integer n
For an infinite sequence in the past, u−∞ −∞
0
such that F−n (x, u−n+1 ) does not depend on x (the function “coalesces”). Then, the
following theorem is true:
Theorem 13.8 Assume that the chain defined by (13.23) is ergodic, with invariant dis-
0 ) is finite with probability 1, and
tribution Q. Then ν = ν(U−∞
0
X∗ := F−ν (x, U−ν+1 ) (13.24)

(which is independent of x) has distribution Q.


Proof Because the chain is ergodic, we know that there exists an integer N such
that one can pass from any state to any other with positive probability. So the chain
can, starting from anywhere, coalesce with positive probability in N steps; ν being
infinite would imply that this event never occurs in an infinite number of trials, and
this has probability 0.

For any k > 0 and any x ∈ B, we have


−ν 0 0
X∗ = F−ν (f−k (x, U−ν−k+1 ), U−ν+1 ) = F−ν−k (x, U−ν−k+1 ). (13.25)

But, because the chain is ergodic, we have, for any x ∈ B


0
lim P(F−k (x, U−k+1 ) = y) = Q(y).
k→∞
13.6. PERFECT SAMPLING METHODS 295

We can write
0 0 0
P(F−k (x, U−k+1 ) = y) = P(F−k (x, U−k+1 ) = y, ν ≤ k) + P(F−k (x, U−k+1 ) = y, ν > k)
0
= P(X∗ = y, ν ≤ k) + P(F−k (x, U−k+1 ) = y, ν > k)
The right-hand side tends to P(X∗ = y) when k tends to infinity (because P(ν > k)
tends to 0), and the left-hand side tends to Q(y), which gives the second part of the
theorem. 

From (13.25), which is the key step in proving that X ∗ follows the invariant distri-
bution, one can see why it is important to consider sampling that expands backward
in time rather than forward. More specifically, consider the coalescence time for the
forward chain, letting ν̃(u0∞ ) be the first index for which

X̃∗ := Fν̃ (x, u0ν̃ )


is independent from the starting point, x. For any k ≥ 0, one still has the fact that
Fν̃+k (x, u0ν̃+k ) does not depend on x, but its value depends on k and will not be equal
to X̃∗ anymore, which prevents the rest of the proof of theorem 13.8 to carry on.

An equivalent algorithm is described in the next proposition (the proof is easy


and left to the reader).
Proposition 13.9 Using the same notation as above, the following algorithm generates a
perfect sample, ξ∗ , of the invariant distribution of an ergodic Markov chain.
0 of U is available. Given this sequence, the algo-
Assume that an infinite sample u−∞
rithm, starting with t0 = 2, is:

x x x x
1. For all x ∈ B, define ξ−t , t = −t0 , . . . , 0 by ξ−t 0
= x and ξ−t+1 = f (ξ−t , u−t+1 ).
2. If ξ0x is constant (independent of x), let ξ∗ be equal to this constant value and stop.
Otherwise, return to step 1 replacing t0 with 2t0 .

In practice, the u−k ’s are only generated when they are needed. But it is important to
consider the sequence as fixed: once u−k is generated, it must be stored (or identically
regenerated, using the same seed) for further use. It is important to strengthen the
fact that this algorithm works backward in time, in the sense that the first states of
the sequence are not identical at each iteration, because they are generated using
random numbers with indexes further in the past.

Such an algorithm is not feasible when |B| is too large, since one would have to
consider an intractable number of simulated sequences (one for each x ∈ B). How-
ever there are cases in which the constancy of ξ0x over all B can be decided from its
constancy over a small subset of B.
296 CHAPTER 13. MONTE-CARLO SAMPLING

One situation in which this is true is when the Markov chain is monotone, ac-
cording to the following definition. Assume that B can be partially ordered, and
that f in (13.23) is increasing in x, i.e.,

x ≤ x0 ⇒ ∀u, f (x, u) ≤ f (x0 , u). (13.26)

Let Bmin and Bmax be the set of minimal and maximal elements in B. Then the
sequence coalesces for the algorithm above if and only if it coalesces over Bmin ∪
Bmax . Indeed, any x ∈ B is smaller than some maximal element, and larger than
some minimal element in B. By (13.26), these inequalities remain true at each step
of the sampling process, which implies that when chains initialized with extremal
elements coalesce, so do the other ones. Therefore, it suffices to run the algorithm
with extremal configurations only.

One can rewrite (13.26) in terms of transition probabilities p(x, y), assuming that
U follows a uniform distribution on [0, 1] and, for all x ∈ B, there exists a partition
(Ixy , y ∈ B) of B, such that
f (x, u) = y ⇔ u ∈ Ix,y
and Ixy is an interval with length pxy . Condition (13.26) is then equivalent to
[
x ≤ x0 ⇒ ∀y ∈ B, Ixy ⊂ Ix 0 y 0 .
y 0 ≥y

This requires in particular that y≥y0 p(x, y) ≤ y≥y0 p(x0 , y) whenever x ≤ x0 (one
P P
says that p(x, ·) is stochastically smaller than p(x0 , ·)).

One example in which this reduction works is with the ferromagnetic Ising model,
for which B = {−1, 1}L and

L
1  X 
q(x) = exp βst x(s) x(t)
C
s,t=1,s<t

with βst ≥ 0 for all {s, t}. Then, the Gibbs sampling algorithm iterates the follow-
ing steps: take a random s ∈ {1, . . . , L} and update x(s) according to the conditional
distribution
(s)
(s) (sc ) ey vs (x)
gs (y | x ) = −v (x) v (x)
e s +e s
with vs (x) = t,s βst x(t) . One can order B so that x ≤ x̃ if and only if x(s) ≤ x̃(s) for all
P
(s)
s = 1, . . . , L. The minimal and maximal elements are unique in this case, with xmin ≡
(s)
−1 and xmax ≡ 1. Moreover, because all βst are non-negative, vs is an increasing
function of x so that, if x ≤ x̃, then gs (1 | x(s) ) ≤ gs (1 | x̃(s) ).
13.7. MARKOVIAN STOCHASTIC APPROXIMATION 297

To define the stochastic iterations, first introduce


 c
1(s) ∧ x(s ) if u ≤ qs (1 | x(s) )


fs (x, u) =  c
(−1)(s) ∧ x(s ) if u > qs (1 | x(s) ),

which satisfies (13.26). The whole updating scheme can then be implemented with
the function
L
X
f (x, (u, ũ)) = δIs (ũ)fs (x, u)
s=1
where (Is , s ∈ V ) is any partition of [0, 1] in intervals of length 1/L. This is still mono-
tonic. The algorithm described in proposition 13.9 can therefore be applied to sam-
ple exactly, in finite time, from the ferromagnetic Ising model.

13.7 Application: Stochastic approximation with Markovian tran-


sitions

Using the material developed in this chapter, we now discuss the convergence of
stochastic approximation methods (such as stochastic gradient descent) when the
random variable in the update term follows Markovian transitions. In section 3.3,
we considered algorithms in the form
(
ξ t+1 ∼ πXt
Xt+1 = Xt + αt+1 H(Xt , ξ t+1 )
where ξ t : Ω → Rξ is a random variable. We now want to address situations in
which the random variable ξt+1 is obtained through a transition probability, there-
fore considering the algorithm
(
ξ t+1 ∼ PXt (ξt , · )
(13.27)
Xt+1 = Xt + αt+1 H(Xt , ξ t+1 )
Here Px is, for all x, a transition probability from Rξ to Rξ . We will assume that,
for all x ∈ Rd , the Markov chain with transition Px is geometrically ergodic, and we
denote by πx its invariant distribution. We let, as in section 3.3, H̄(x) = Eπx (H(x, ·)).
We will use the notation for a function f : Rd × Rξ → R
Z
0 d 0
Px f : (x , ξ) ∈ R × Rξ 7→ Px f (x , ξ) = f (x0 , ξ 0 )Px (ξ, dξ 0 )

and Z
0 d 0
πx f : x ∈ R 7→ πx f (x ) = f (x0 , ξ)πx (dξ).

In particular, H̄(x) = πx H(x). We also define h(x, ξ) = H(x, ξ) − H̄(x) and h̃(x, ξ) =
Px h(x, ξ). We make the following assumptions.
298 CHAPTER 13. MONTE-CARLO SAMPLING

(H1) There exists constants C0 , C1 , c2 such that, for all x, y ∈ Rd ,

sup |H(x, ξ)| ≤ C0 , (13.28a)


ξ∈Rξ

sup |h̃(x, ξ)| ≤ C1 , (13.28b)


ξ∈Rξ

sup |h̃(x, ξ) − h̃(y, ξ)| ≤ C1 |x − y|, (13.28c)


ξ∈Rξ

Dvar (πx , πy ) ≤ C2 |x − y| (13.28d)

(H2) There exists x∗ ∈ Rd and µ > 0 such that, for all x ∈ Rd

(x − x∗ )T H̄(x) ≤ −µ|x − x∗ |2 . (13.29)

(H3) We assume that there exists a constant M and a non-decreasing function ρ :


[0, +∞) → [0, 1) such that, for all probability distributions Q and Q0 on Rξ ,

Dvar (QPxn , Q0 Pxn ) ≤ Mρ(|x|)n Dvar (Q, Q0 ). (13.30)

(H4) The sequence α1 , α2 , . . . is non-increasing, with



X ∞
X
αt = +∞ and αt2 < +∞. (13.31a)
t=1 t=1
Pt
Let σt = s=1 αs . If C1 > 0, we also require that

lim αt σt (1 − ρ(σt ))−1 = 0 (13.31b)


t→∞

and
t
X
αs2 σs (1 − ρ(σs ))−2 < ∞. (13.31c)
s=2

Given this, the following theorem holds.

Theorem 13.10 Assuming (H1) to (H4), the sequence defined by (13.27) is such that

lim E(|Xt − x∗ |2 ) = 0
t→∞

Remark 13.11 Condition (H1) assumes that H is bounded and uniformly Lipschitz
in x, which is more restrictive than what was assumed in section 3.3.2, but applies,
for example, to situations considered in Younes [209] and later in this book in sec-
tion 18.2.2.
13.7. MARKOVIAN STOCHASTIC APPROXIMATION 299

Condition (H3) implies that the Markov chain with transition Px is uniformly geo-
metrically ergodic, but the ergodicity rate may depend on x and it may, in particular,
converge to 1 when x tends to ∞, which is the situation targeted in this theorem.

The reader may refer to [211] for a general discussion of this problem with re-
laxed hypotheses and almost sure convergence, at the expense of significantly longer
proofs. 

Proof We note that, from (13.28a), one has

|Xt − x∗ | ≤ C0 σt |X0 − x∗ |. (13.32)

Similarly to section 3.3.2, we let At = |Xt − x∗ |2 and at = E(At ). One can then write
2
At+1 = At +2αt+1 (Xt −x∗ )T H̄(Xt )+2αt+1 (Xt −x∗ )T (H(Xt , ξ t+1 )−H̄(Xt ))+αt+1 |H(Xt , ξt+1 )|2

but we do not have

E((Xt − x∗ )T (H(Xt , ξ t+1 ) − H̄(Xt )) | Ut ) = 0

anymore, where Ut is the σ -algebra of all past events up to time t (all events depend-
ing of Xs , ξ s , s ≤ t). Indeed the Markovian assumption implies that
Z 
E((Xt − x∗ )T (H(Xt , ξ t+1 ) − H̄(Xt )) | Ut ) = (Xt − x∗ )T 
 
H(Xt , ξ)PXt (ξ t , dξ) − H̄(Xt )

T
= (Xt − x∗ ) ((PXt H(Xt , ·))(ξt ) − H̄(Xt )),

which does not vanish in general. Following Benveniste et al. [25], this can be ad-
dressed by introducing the solution g(x, · ) of the “Poisson equation”

g(x, · ) − Px g(x, · ) = h(x, · ). (13.33)

(Recall that h(x, ξ) = H(x, ξ) − H̄(x).) One can then write

(Xt − x∗ )T h(Xt , ξ t+1 ) = (Xt − x∗ )T (g(Xt , ξ t+1 ) − PXt g(Xt , ξ t+1 )

and

At+1 ≤ (1 − 2αt+1 µ)At + 2αt+1 (Xt − x∗ )T (g(Xt , ξ t+1 ) − PXt g(Xt , ξ t )))
2
+ 2αt+1 (Xt − x∗ )T PXt g(Xt , ξ t ) − 2αt+1 (Xt − x∗ )T PXt g(Xt , ξ t+1 ) + αt+1 |H(Xt , ξt+1 )|2

Introduce the notation

ηs,t = E((Xs − x∗ )T PXs g(Xs , ξ t )).


300 CHAPTER 13. MONTE-CARLO SAMPLING

Using the fact that


 
E (Xt − x∗ )T (g(Xt , ξ t+1 ) − PXt g(Xt , ξ t ))) | Ut = 0

and and noting that |H(Xt , ξt+1 )|2 ≤ C02 , one finds, after taking expectations,
2
at+1 ≤ (1 − 2αt+1 µ)at + 2αt+1 ηt,t − 2αt+1 ηt,t+1 + αt+1 C02 .
Qt
Applying lemma 3.25, and letting vs,t = j=s+1 (1 − 2αj+1 µ), one gets

t
X t
X
at ≤ a0 v0,t + 2 vs,t αs+1 (ηs,s − ηs,s+1 ) + C02 2
vs,t αs+1 .
s=1 s=1

We now want to ensure that each term in the upper bound converges to 0. Simi-
larly to section 3.3.2, (13.31a) implies that this holds the first and last terms and we
therefore focus on the middle one, writing
t
X t
X
vs,t αs+1 (ηs,s − ηs,s+1 ) = v1,t α2 η1,1 − αt+1 ηt,t+1 + (vs,t αs+1 − vs−1,t αs )ηs,s (13.34)
s=1 s=2
t
X
+ vs−1,t αs (ηs,s − ηs−1,s )
s=2

We will need the following estimates on the function g in (13.33), which is de-
fined by

X ∞
X
n
g(x, ξ) = Px h(x, ξ) = h(x, ξ) + Pxn h̃(x, ξ).
n=0 n=0

Lemma 13.12 We have

|g(x, ·)| ≤ C0 + 2C1 M(1 − ρ(x))−1 , (13.35a)


|Px g(x, ·)| ≤ 2C1 M(1 − ρ(x))−1 . (13.35b)

and, for all x, y ∈ Rd and ξ ∈ Rξ

|Px g(x, ξ) − Py (g(y, ξ)| = M 2 C1 C2 (1 − ρ̄)−2 + MC1 (1 + C2 )(1 − ρ̄)−1 . (13.36)

with ρ̄ = max(ρ(|x|), ρ(|y|)).

Using lemma lemma 13.12 (which is proved at the end of the section), we can
control the terms intervening in (13.34). Note that the first term, v1,t α2 η1,1 , con-
verges to 0 since (13.31a) implies that v1,t converges to 0.
13.7. MARKOVIAN STOCHASTIC APPROXIMATION 301

We have,

αt+1 |E((Xt − x∗ )T PXt g(Xt , ξ t+1 ))| ≤ 2MC1 αt+1 σt (1 − ρ(σt ))−1 ,

so that (13.31b) implies that αt+1 ηt,t+1 → 0.

and since αs+1 ≤ αs , we have


t
X t
X
(vs,t αs+1 − vs−1,t αs )ηss ≤ |vst αs − vs,t αs+1 | |ηss |
s=2 s=2
t
X
≤ MC1 |vs−1,t αs − vs,t αs+1 | αs+1 σs (1 − ρ(σs ))−1
s=2
t
X
≤C |vs−1,t αs − vs,t αs+1 |
s=2

for some constant C, since αs+1 σs (1 − ρ(σs ))−1 is bounded. Writing

vs,t αs+1 − vs−1,t αs = vst (αs+1 − αs + 2µαs2 ),

we get (using αs+1 ≤ αs )


t
X t
X t
X
|vs−1,t αs − vs,t αs+1 | ≤ vst (αs − αs+1 ) + vst 2µαs2 .
s=2 s=2 s=2

Since both s (αs − αs+1 ) and ts=2 αs2 converge (the former is just α1 ), lemma 3.26
P P
implies that
Xt
(vs,t αs+1 − vs−1,t αs )ηss
s=2

tends to zero. The last term to consider is


t
X t
X
vs−1,t αs (ηss − ηs−1,s ) = vs−1,t αs E((Xs − Xs−1 )T PXs g(Xs , ξ s ))
s=2 s=2
t
X
+ vs−1,t αs E((Xs−1 − x∗ )T (PXs g(Xs , ξ s )) − PXs−1 g(Xs−1 , ξ s ))).
s=2

We have
t
X t
X
T
vs−1,t αs E((Xs − Xs−1 ) PXs g(Xs , ξ s )) ≤ 2C0 C1 M vs−1,t αs2 (1 − ρ(σs ))−1
s=2 s=2
302 CHAPTER 13. MONTE-CARLO SAMPLING

and
t
X
vs−1,t αs E((Xs−1 − x∗ )T (PXs g(Xs , ξ s )) − PXs−1 g(Xs−1 , ξ s )))
s=2
t
X
2
≤ 2M C0 C1 (1 + C2 )|X0 − x∗ | vs−1,t αs2 σs (1 − ρ(σs ))−2
s=2

and lemma 3.26 implies that both terms vanish at infinity. This concludes the proof
of theorem 13.10. 

Proof (Proof of lemma 13.12) Condition (H3) and proposition 12.3 and imply that
(since πx h̃ = 0)

|Pxn h̃(x, ξ)| ≤ Dvar (Pxn (ξ, ·), πx )osc(h̃(x, ·)) ≤ 2C1 Mρ(x)n

so that g is well defined with

|g(x, ·)| ≤ C0 + 2C1 M(1 − ρ(x))−1 ,


|Px g(x, ·)| ≤ 2C1 M(1 − ρ(x))−1 .

We will also need to control differences of the kind

Px g(x, ξ) − Py g(y, ξ).

We consider the nth term in the series, writing


n−1
X
Pxn h̃(x, ξ) − Pyn h̃(y, ξ) = (Pxn−k Pyk h̃(y, ξ) − (Pxn−k−1 Pyk+1 h̃(y, ξ))
k=0
+ Pxn h̃(x, ξ) − Pxn h̃(y, ξ).

This gives
n−1
X
Pxn h̃(x, ξ) − Pyn h̃(y, ξ) = Pxn−k−1 (Px Pyk h̃(y, ξ) − Pyk+1 h̃(y, ξ) − πx Pyk h̃(y) + πx Pyk+1 h̃(y))
k=0
n−1
X
+ (πx Pyk h̃(y) − πx Pyk+1 h̃(y)) + Pxn h̃(x, ξ) − Pxn h̃(y, ξ)
k=0
n−1
X
= Pxn−k−1 (Px Pyk h̃(y, ξ) − Pyk+1 h̃(y, ξ) − πx Pyk h̃(y) + πx Pyk+1 h̃(y))
k=0
+ πx h̃(y) − πx Pyn h̃(y) + Pxn h̃(x, ξ) − Pxn h̃(y, ξ)
13.7. MARKOVIAN STOCHASTIC APPROXIMATION 303

Finally
n−1
X
Pxn h(x, ξ) − Pyn h(y, ξ) = Pxn−k−1 (Px Pyk h̃(y, ξ) − Pyk+1 h̃(y, ξ) − πx Pyk h̃(y) + πx Pyk+1 h̃(y))
k=0
+ Pxn (h̃(x, ξ) − h̃(y, ξ) + πx h̃(y)) − (πx − πy )Pyn h̃(y)

Using proposition 12.3, we can write, letting ρ̄ = max(ρ(|x|), ρ(|y|)),

|Pxn−k−1 (Px Pyk h̃(y, ξ) − Pyk+1 h̃(y, ξ) − πx Pyk h̃(y, ξ) + πx Pyk+1 h̃(y, ξ))|
≤ M ρ̄n−k−1 osc(Px Pyk h̃(y, ξ) − Pyk+1 h̃(y, ξ))
≤ C2 M ρ̄n−k−1 |x − y|osc(Pyk h̃(y, ξ))
≤ C2 C1 M 2 ρ̄n−1 |x − y|

We also have
|Pxn (h̃(x, ξ) − h̃(y, ξ) + πx h̃(y, ξ))| ≤ MC1 ρ̄n |x − y|
and
|(πx − πy )Pyn h̃(y, ξ)| ≤ MC2 C1 ρ̄n |x − y|
so that
|Pxn h(x, ξ) − Pyn h(y, ξ)| ≤ MC1 ρ̄n−1 (nMC2 + (1 + C2 )ρ̄)|x − y|
From this, it follows that

X
|Px g(x, ξ) − Py (g(y, ξ)| ≤ MC1 ρ̄n−1 (nMC2 + (1 + C2 )|x − y|
n=1
2
= M C1 C2 (1 − ρ̄)−2 + MC1 (1 + C2 )(1 − ρ̄)−1 . 
304 CHAPTER 13. MONTE-CARLO SAMPLING
Chapter 14

Markov Random Fields

With this chapter, we start a discussion of large-scale statistical models in data sci-
ence, starting with graphical models (Markov random fields and Bayesian networks)
before discussing more recent approaches using, notably, deep learning. Impor-
tant textbook references for the present chapter include Pearl [152], Ancona et al.
[8], Winkler [206], Lauritzen [115], Cowell et al. [55], Koller and Friedman [109].

14.1 Independence and conditional independence

14.1.1 Definitions

We consider random variables X, Y , Z . . ., and denote by RX , RY , RZ . . . the sets in


which they take their values. We discuss in this section concepts of independence
and conditional independence between random variables. To simplify the exposi-
tion, we will work (unless mentioned otherwise) with discrete random variables (X
is discrete if RX is finite or countable)1 . We start with a basic definition.
Definition 14.1 Two discrete random variables X : Ω → RX and Y : Ω → RY are inde-
pendent if and only if

∀x ∈ RX , ∀y ∈ RY : P(X = x, Y = y) = P(X = x)P(Y = y).

The general definition for arbitrary r.v.’s is that

E(f (X)g(Y )) = E(f (X)) E(g(Y ))


for any pair of (measurable) non-negative functions f : RX → [0, +∞) and g : RY →
[0, +∞).
1 Inthe general case, RX , RY , . . . are metric spaces with a countable dense subset with σ -algebras
S X,S Y ,...

305
306 CHAPTER 14. MARKOV RANDOM FIELDS

One can easily check that X and Y are independent if and only if, for any non-
negative function g : RY → R, one has

E(g(Y ) | X) = E(g(Y )).

Notation 14.2 Independence is a property that involves two variables X and Y and
an underlying probability distribution P. Independence of X and Y relative to P will
be denoted (XyY )P . However we will only write XyY when there is no ambiguity
on P. 

More than independence, the concept of conditional independence will be fun-


damental in this chapter. It requires three variables, say X, Y , Z. Returning to the
discrete case, one says that X and Y are conditionally independent given Z is, for
any x ∈ RX , y ∈ RY and z ∈ RZ such that P(Z = z) > 0,

P(X = x, Y = y | Z = z) = P(X = x | Z = z) P(Y = y | Z = z). (14.1)

An equivalent statement is that, for any z such that P(Z = z) , 0, X and Y are inde-
pendent when P is replaced by the conditional distribution P(· | Z = z).

In the general case conditional independence means that, for any pair of non-
negative measurable functions f and g,

E(f (X)g(Y ) | Z) = E(f (X) | Z) E(g(Y ) | Z). (14.2)

From now, we restrict our discussion to discrete random variables.

Multiplying both terms in (14.1) by P(Z = z)2 , we get the equivalent statement:
X and Y are conditionally independent given Z if and only if,

∀x, y, z : P(X = x, Y = y, Z = z)P(Z = z) = P(X = x, Z = z) P(Y = y, Z = z). (14.3)

Note that the identity is meaningful, and always true, for P(Z = z) = 0, so that this
case does not need to be excluded anymore.

Conditional independence can be interpreted by the statement that X brings no


more information on Y than what is already provided by Z: one has
P(Y = y, X = x, Z = z) P(Y = y, Z = z)
P(Y = y | X = x, Z = z) = =
P(X = x, Z = z) P(Z = z)
as directly deduced from (14.3). (This computation being valid as soon as P(X =
x, Z = z) > 0.)
Notation 14.3 To indicate that X and Y are conditionally independent given Z for
the distribution P, we will write (XyY | Z)P or simply (XyY | Z). 
14.1. INDEPENDENCE AND CONDITIONAL INDEPENDENCE 307

So we have the equivalence:


 
(XyY | Z)P ⇔ ∀z : P(Z = z) > 0 ⇒ (XyY )P(|Z=z) .

Absolute independence is like “independence conditional to no variable”, and we


will use the notation ∅ for the “empty” random variable that contains no information
(for example, a set-valued random variable that always returns the empty set, or any
constant variable). So we have the tautology

XyY ⇔ (XyY | ∅).

Note that, dealing with discrete variables, all previous definitions automatically
extend to groups of variables: for example, if Z1 , Z2 are two discrete variables, so
is Z = (Z1 , Z2 ) and we immediately obtain a definition for the conditional indepen-
dence of X and Y given Z1 and Z2 , denoted (XyY | Z1 , Z2 ).

14.1.2 Fundamental properties

Proposition 14.5 below lists important properties of conditional independence that


will be used repeatedly in this chapter. Before stating this proposition, we need the
following definition.
Definition 14.4 One says that the joint distribution of the random variables (X1 , . . . , XN )
is positive if there exists subsets R̃k ⊂ RXk , k = 1, . . . , N such that P(Xk ∈ R̃k ) = 1 and:

P(X1 = x1 , . . . , XN = xN ) > 0
if xk ∈ R̃k , k = 1, . . . , N .

Note that the condition implies P(Xk = xk ) > 0 for all xk ∈ R̃k , so that R̃k = {xk ∈
RXk : P(Xk = xk ) > 0}, i.e., R̃k is the support of PXk . One can interpret the definition
as expressing the fact that any conjunction of events for different Xk ’s has positive
probability, as soon as each of them has positive probability (if all events may occur,
then they may occur together).

Note that the sets R̃k depend on X1 , . . . , XN . However, if this family of variables
is fixed, there is no loss in generality in restricting the space RXk to R̃k and there for
assume that P(X1 = x1 , . . . , XN = xN ) > 0 everywhere.
Proposition 14.5 Let X, Y , Z and W be random variables. The following properties are
true.

(CI1) Symmetry: (XyY | Z) ⇒ (Y yX | Z).


308 CHAPTER 14. MARKOV RANDOM FIELDS

(CI2) Decomposition: (Xy(Y , W ) | Z) ⇒ (XyY | Z).


(CI3) Weak union: (Xy(Y , W ) | Z) ⇒ (XyY | (Z, W )).
(CI4) Contraction: (XyY | Z) and (XyW | (Z, Y )) ⇒ (Xy(Y , W ) | Z).
(CI5) Intersection: assume that the joint distribution of W , Y and Z is positive. Then

(XyW | (Z, Y )) and (XyY | (Z, W )) ⇒ (Xy(Y , W ) | Z).

Proof Properties (CI1) and (CI2) are easily deduced from (14.3) and left to the
reader. To prove the last three, we will use the notation P (x), P (x, y) etc. instead
of P(X = x), P(X = x, Y = y), etc. to save space. Identities are assumed to hold for all
x, y, z, w unless stated otherwise.

For (CI3), we must prove, according to (14.3), that

P (x, y, z, w)P (z, w) = P (x, z, w)P (y, z, w) (14.4)

whenever P (x, y, z, w)P (z) = P (x, z)P (y, z, w). Summing this last equation over y (or
applying (CI2)) yields P (x, z, w)P (z) = P (x, z)P (z, w). We can note that all terms in
(14.4) vanish when P (z) = 0, so that the identity is true in this case. When P (z) , 0,
the right-hand side of (14.4) becomes

(P (x, z)P (z, w)/P (z))P (y, z, w) = (P (x, z)P (y, z, w)/P (z))P (z, w) = P (x, y, z, w)P (z, w),

using once again the hypothesis. This proves (CI3).

For (CI4), the hypotheses are



P (x, y, z)P (z) = P (x, z)P (y, z)



P (x, y, z, w)P (y, z) = P (x, y, z)P (y, z, w)

and the conclusion must be

P (x, y, z, w)P (z) = P (x, z)P (y, z, w). (14.5)

Since (14.5) is true when P (y, z) = 0, we assume that this probability does not vanish
and write

P (x, y, z, w)P (z) = P (x, y, z)P (z)P (y, z, w)/P (y, z)


= P (x, z)P (y, z)P (y, z, w)/P (y, z)
= P (x, z)P (y, z, w)

yielding (14.5).
14.1. INDEPENDENCE AND CONDITIONAL INDEPENDENCE 309

For (CI5), assuming



P (x, y, z, w)P (y, z) = P (x, y, z)P (y, z, w)


 (14.6)
P (x, y, z, w)P (z, w) = P (x, z, w)P (y, z, w),

we want to show that


P (x, y, z, w)P (z) = P (x, z)P (y, z, w).
Since this identity is true when any of the events W = w, Y = y or Z = z has zero
probability, we can assume that their probabilities are positive, which, by assump-
tion, also implies that all joint probabilities are positive. From the two identities, we
get
P (x, y, z, w)/P (y, z, w) = P (x, y, z)/P (y, z) = P (x, z, w)/P (z, w)
This implies
P (x, y, z) = P (y, z)P (x, z, w)/P (z, w)
that we can sum over y to obtain
P (x, z) = P (z)P (x, z, w)/P (z, w)
We therefore get
P (x, y, z, w)/P (y, z, w) = P (x, z, w)/P (z, w) = P (x, z)/P (z),
which is what we wanted. 

A counter-example of (CI5) when the positivity assumption is not satisfied can be


built as follows: let X be a Bernoulli random variable, and let Y = W = X. Let Z be
any Bernoulli random variable, independent from X. Given Z and W , X and Y are
constant and therefore independent. Similarly, given Z and Y , X and W are constant
and therefore independent. However, given Z, X and (Y , W ) are not independent
(they are equal and non constant).

14.1.3 Mutual independence

Another concept of interest is the mutual (conditional) independence of more than


two random variables. The random variables (X1 , . . . , Xn ) are mutually conditionally
independent given Z if and only if
E(f1 (X1 ) · · · fn (Xn ) | Z) = E(f1 (X1 ) | Z) · · · E(fn (Xn ) | Z)
for any non-negative measurable functions f1 , . . . , fn . In terms of discrete probabili-
ties, this can be written as

P (X1 = x1 , . . . , Xn = xn , Z = z)P (Z = z)n−1 =


P (X1 = x1 , Z = z) · · · P (Xn = xn , Z = z).
310 CHAPTER 14. MARKOV RANDOM FIELDS

This will be summarized with the notation

(X1 y · · · yXn | Z).

We have the proposition

Proposition 14.6 For variables X1 , . . . , Xn and Z, the following properties are equivalent.

(i) (X1 y · · · yXn | Z);


(ii) For all S, T ⊂ {1, . . . , n} with S ∩ T = ∅, we have: ((Xi , i ∈ S)y(Xj , j ∈ T ) | Z);
(iii) For all s ∈ {1, . . . , n}, we have: (Xs y(Xt , t , s) | Z);
(iv) For all s ∈ {2, . . . , n}, we have: (Xs y(X1 , . . . , Xs−1 ) | Z).

Proof It is clear that (i) ⇒ · · · ⇒ (iv) so it suffices to prove that (iv) ⇒ (i). For this,
simply write (applying (iv) repeatedly to s = n − 1, n − 2, . . .)

E(f1 (X1 ) · · · fn (Xn ) | Z) = E(f1 (X1 ) · · · fn−1 (Xn−1 ) | Z) E(fn (Xn ) | Z)


= E(f1 (X1 ) · · · fn−2 (Xn−2 ) | Z) E(fn−1 (Xn−1 ) | Z)
E(fn (Xn ) | Z)
..
.
= E(f1 (X1 ) | Z) · · · E(fn (Xn ) | Z). 

14.1.4 Relation with Information Theory

Several concepts in information theory are directly related to independence between


random variables. Recall that the (Shannon) entropy of a discrete probability distri-
bution over a finite set R is defined by
X
H(P ) = − log P (ω)P (ω). (14.7)
ω∈R

Similarly, the entropy of a random variable X : Ω → RX is defined by



X
H(X) = H(PX ) = − log P (X = x)P (X = x). (14.8)
x∈RX

The entropy is always non-negative, and provides a measure of the uncertainty as-
sociated to P . For a given finite set R, it is maximal when P is uniform over R, and
minimal (and vanishes) when P is supported by a single ω ∈ R (i.e. P (ω) = 1).
14.1. INDEPENDENCE AND CONDITIONAL INDEPENDENCE 311

One defines the entropy of two or more random variables as the entropy of their
joint distribution, so that, for example,
X
H(X, Y ) = − log P(X = x, Y = y)P(X = x, Y = y).
(x,y)∈RX ×RY

We have the proposition:


Proposition 14.7 For random variables X1 , . . . , Xn , one has
H(X1 , . . . , Xn ) ≤ H(X1 ) + · · · + H(Xn )
with equality if and only if (X1 , . . . , Xn ) are mutually independent.
Proof The proof of this proposition uses properties of the Kullback-Leibler diver-
gence (c.f. (4.3)), given by, for two probability distributions π and π0 on a finite set
B,
X π(ω)
KL(πkπ0 ) = π(ω) log 0 .
π (ω)
ω∈B
0
with the convention π log(π/π ) = 0 if π = 0 and = ∞ if π > 0 and π0 = 0. Returning to
proposition 14.7, a straightforward computation (which is left to the reader) shows
that
H(X1 ) + · · · + H(Xn ) − H(X1 , . . . , Xn ) = KL(πkπ0 )
with π(x1 , . . . , xn ) = P(X1 = x1 , . . . , Xn = xn ) and π0 (x1 , . . . , xn ) = nk=1 P(Xk = xk ). This
Q
makes proposition 14.7 a direct consequence of proposition 4.1. 

The mutual information between two random variables X and Y is defined by


I (X, Y ) = H(X) + H(Y ) − H(X, Y ). (14.9)
From proposition 14.7, I (X, Y ) is nonnegative and vanishes if and only if X and
Y are independent. Also from the proof of proposition 14.7, I (X, Y ) is equal to
KL(P(X,Y ) kPX ⊗ PY ) where the first probability is the joint distribution of X and Y and
the second one the product of the marginals of X and Y , which coincides with PX,Y
if and only if X and Y are independent.

If X and Y are two random variables, and y ∈ RY with P(Y = y) > 0, the entropy
of the conditional probability x 7→ P(X = x | Y = y) is denoted H(X | Y = y), and
is a function of y. The conditional entropy of X given Y , denoted H(X | Y ) is the
expectation of H(X | Y = y) for the distribution of Y , i.e.,
X
H(X | Y ) = H(X | Y = y)P(Y = y)
y∈RY
X X
=− log P(X = x | Y = y)P(X = x, Y = y).
x∈RX y∈RY

So, we have (with a straightforward proof)


312 CHAPTER 14. MARKOV RANDOM FIELDS

Proposition 14.8 Given two random variables X and Y , we have

H(X | Y ) = −EX,Y (log P(X = · | Y = ·)) (14.10)


= H(X, Y ) − H(Y )

This proposition also immediately yields:

I (X, Y ) = H(X) − H(X | Y ) = H(Y ) − H(Y | X). (14.11)

The identity H(X, Y ) = H(X | Y ) + H(Y ) that is deduced from proposition 14.8 can be
generalized to more than two random variables (the proof being left to the reader),
yielding, if X1 , . . . , Xn are random variables:
n
X
H(X1 , . . . , Xn ) = H(Xk | X1 , . . . , Xk−1 ). (14.12)
k=1

If Z is an additional random variable, the following identity is obtained by ap-


plying the previous one to conditional distributions given Z = z and taking averages
over z:
n
X
H(X1 , . . . , Xn | Z) = H(Xk | X1 , . . . , Xk−1 , Z). (14.13)
k=1

The following proposition characterizes conditional independence in terms of


entropy.

Proposition 14.9 Let X, Y and Z be three random variables. The following statements
are equivalent.

(i) X and Y are conditionally independent given Z.


(ii) H(X, Y | Z) = H(X | Z) + H(Y | Z)
(iii) H(X | Y , Z) = H(X | Y )

Moreover, when (i) to (iii) are satisfied, we have:

(iv) I (X, Y ) ≤ min(I (X, Z), I (Y , Z)).

Proof From proposition 14.7, we have, for any three random variables X, Y , Z, and
any z such that P (Z = z) > 0,

H(X, Y | Z = z) ≤ H(X | Z = z) + H(Y | Z = z).


14.2. MODELS ON UNDIRECTED GRAPHS 313

Taking expectations on both sides implies the important inequality

H(X, Y | Z) ≤ H(X | Z) + H(Y | Z) (14.14)

and equality occurs if and only if P(X = x, Y = y | Z = z) = P(X = x | Z = z)P(Y =


y | Z = z) whenever P(Z = z) > 0, that is, if and only if X and Y are conditionally
independent given Z. This proves that (i) and (ii) are equivalent. The fact that
(ii) and (iii) are equivalent comes from (14.13), which gives, for any three random
variables
H(X, Y | Z) = H(X | Y , Z) + H(Y | Z). (14.15)

To prove that (i)-(iii) implies (iv), we note that (14.14) and (14.15) imply that, for
any three random variables:

H(X | Y , Z) ≤ H(X | Y ).

If X and Y are conditionally independent given Z, then the right-hand side is equal
to H(X | Z) and this yields

I (X, Y ) = H(X) − H(X | Y ) ≤ H(X) − H(X | Z) = I (X, Z).

By symmetry, we must also have I (X, Y ) ≤ I (Y , Z) so that (iv) is true. 

Statement (iv) is often called the data-processing inequality, and has been used to
infer conditional independence within gene networks [126].

14.2 Models on undirected graphs

14.2.1 Graphical representation of conditional independence

An undirected graph is a collection of vertexes and edges, in which edges link pairs
of vertexes without order. Edges can therefore be identified to subsets of cardinality
two of the set of vertexes, V . This yields the definition:

Definition 14.10 An undirected graph G is a pair G = (V , E) where V is a finite set of


vertexes and elements e ∈ E are subsets e = {s, t} ⊂ V .

Note that edges in undirected graphs are defined as sets, i.e., unordered pairs, which
are delimited with braces in these notes. Later on, we will use parentheses to repre-
sent ordered pairs, (s, t) , (t, s). We will write s ∼G t, or simply s ∼ t to indicate that s
and t are connected by an edge in G (we also say that s and t are neighbors in G).
314 CHAPTER 14. MARKOV RANDOM FIELDS

Definition 14.11 A path in an undirected graph G = (V , E) is a finite sequence (s0 , . . . , sN )


of vertexes such that sk−1 ∼ sk ∈ E. (A sequence, (s0 ), of length 1 is also a path by exten-
sion.)

We say that s and t are connected by a path if either s = t or there exists a path
(s0 , . . . , sN ) such that s0 = s and sN = t.

A subset S ⊂ G is connected if any pair of elements in S can be connected by a path.

A subset T ⊂ G separates two other subsets S and S 0 if all paths between S and S 0
must pass in T . We will write (SyS 0 | T ) in such a case.

One of the goals of this chapter is to relate the notion of conditional indepen-
dence within a set of variables to separation in a suitably chosen undirected graph
with vertexes in one-to-one correspondence with the variables. This will also justi-
fies the similarity of notation used for separation and conditional independence.

We have the following simple fact:


Lemma 14.12 Let G = (V , E) be an undirected graph, and S, S 0 , T ⊂ V . Then

(SyS 0 | T ) ⇒ S ∩ S 0 ⊂ T .

Indeed, if (SyS 0 | T ) and s0 ∈ S ∩ S 0 , the path (s0 ) links S and S 0 and therefore must
pass in T .

Proposition 14.5 translates into similar properties for separation:


Proposition 14.13 Let (V , E) be an undirected graph and S, T , U , R be subsets of V . The
following properties hold

(i) (SyT |U ) ⇔ (T yS|U ).


(ii) (SyT ∪ R|U ) ⇒ (SyT |U ).
(iii) (SyT ∪ R|U ) ⇒ (SyT |U ∪ R).
(iv) (SyT | U ) and (SyR | U ∪ T ) ⇔ (SyT ∪ R | U ).
(v) U ∩ R = ∅, (SyR | U ∪ T ) and (SyU | T ∪ R) ⇒ (SyU ∪ R | T ).
Proof (i) is obvious, and for (ii) (and (iii)), if any path between S and T ∪ R must
pass by U , the same is obviously true for a path between S and T .

For the ⇒ part of (iv), if a path links S and T ∪ R, then it either links S and T
and must pass through U by the first assumption, or link S and R and therefore pass
through U or T by the second assumption. But if the path passes through T , it must
14.2. MODELS ON UNDIRECTED GRAPHS 315

also pass through U before by the first assumption. In all cases, the path passes
through U . The ⇐ part of (iv) is obvious.

Finally, consider (v) and take a path between two distinct elements in S and U ∪R.
Consider the first time the path hits U or R, and assumes that it hits U (the other
case being treated similarly by symmetry). Notice that the path cannot hit both U
and R at the same point since U ∩ R = ∅. From the assumptions, the path must hit
T ∪ R before passing by U , and the intersection cannot be in R, so it is in T , which is
the conclusion we wanted. 

To make a connection between separation in graphs and conditional indepen-


dence between random variables, we consider a graph G = (V , E) and a family of
random variables (X (s) , s ∈ V ) indexed by V . Each variable is assumed to take values
in a set Fs = RX (s) . The collection of values taken by the random variables will be
called configurations, and the sets Fs , s ∈ V are called the state spaces.

Letting F denote the collection (Fs , s ∈ V ), we will denote the set of such configu-
rations as F (V , F ). Then F is clear from the context, we will just write F (V ). If S ⊂ V
and x ∈ F (V , F ), the restriction of x to S is denoted x(S) = (x(s) , s ∈ S). The set formed
by those restrictions will be denoted F (S, F ) (or just F (S)).

Remark 14.14 Some care needs to be given to the definition of the space of con-
figurations, to avoid ambiguities when two sets Fs coincide. The configuration x =
(x(s) , s ∈ V ) should be understood, in an emphatic way, as the collection x̂ = ((s, x(s) ), s ∈
V ), which makes explicit the fact that x(s) is the value observed at vertex s. Similarly
the emphatic notation for x(S) ∈ F (V , F ) is x̂(S) = ((s, x(s) ), s ∈ S).

In the following, we will not use the emphatic notation to avoid overly heavy
expressions, but its relevance should be clear with the following simple example.
Take V = {1, 2, 3} and F1 = F2 = F3 = {0, 1}. Let x(1) = 0, x(2) = 0 and x(3) = 1. Then
the sub-configurations x({1,3}) and x({2,3}) both corresponds to values (0, 1), but we
consider them as distinct. In the same spirit, x(1) = x(2) , but x({1}) , x({2}) 

If S, T ⊂ V with S ∩ T = ∅, x(S) ∈ F (S, F ), y (T ) ∈ F (T , F ), we will denote their


concatenation by x(S) ∧ y (T ) , which is the configuration z = (zs , s ∈ S ∪ T ) ∈ F (T ∪ S, F )
such that z(s) = x(s) if s ∈ S and z(s) = y (s) if s ∈ T .

We define a random field over V as a random configuration X : Ω → F (V , F ), that


we will denote for short X = (X (s) , s ∈ V ). If S ⊂ V , the restriction X (S) will also be
denoted (X (s) , s ∈ S).

We can now write the definition:


316 CHAPTER 14. MARKOV RANDOM FIELDS

Definition 14.15 Let G = (V , E) be an undirected graph and X = (X (s) , s ∈ V ) a random


field over V . We say that X is Markov (or has the Markov property) relative to G (or is
G-Markov, or is a Markov random field on G) if and only if, for all S, T , U ⊂ V :

(SyT | U ) ⇒ (X (S) yX (T ) | X (U ) ). (14.16)

Letting the observation over an empty set S be empty, i.e., X∅ = ∅, this definition in-
cludes the statement that, if S and T are disconnected (i.e., there is no path between
them: they are separated by the empty set), then (X (S) yX (T ) | ∅): X (S) and X (T ) are
independent.

We will say that a probability distribution π on F (V ) is G-Markov if its associated


canonical random field X = (X (s) , s ∈ V ) defined on Ω̃ = F (V ) by X (s) (x) = x(s) is G-
Markov.

14.2.2 Reduction of the Markov property

We now proceed, in a series of steps, to a simplification of definition 14.15 in order


to obtain a minimal number of conditional independence statements. Note that, in
its current form, definition 14.15 requires to check (14.16) for any three subsets of
V , which provides a huge number of conditions. Fortunately, as we will see, these
conditions are not independent, and checking a much smaller number of them will
ensure that all of them are true.

The first step for our reduction is provided by the following lemma.
Lemma 14.16 Let G = (V , E) be an undirected graph and X = (Xs , s ∈ V ) a set of random
variables indexed by V . Then X is G-Markov if and only if, for S, T , U ⊂ V ,

S ∩ U = T ∩ U = ∅ and (SyT | U ) ⇒ (X (S) yX (T ) | X (U ) ). (14.17)


Proof Assume that (14.17) is true, and take any S, T , U with (SyT | U ). Let A =
S ∩ U , B = T ∩ U and C = A ∪ B. Partition S in S = S1 ∪ A, T in T1 ∪ B and U in
U1 ∪ C. From (SyT | U ), we get (S1 yT1 | U ). Since S1 ∩ U = T1 ∩ U = ∅, this implies
(X (S1 ) yX (T1 ) | X (U ) ). But this implies ((X (S1 ) , X (A) )y(X (T1 ) , X (B) ) | X (U ) ). Indeed, this
property requires

PX (x(S1 ) ∧ x(A) ∧ x(T1 ) ∧ x(B) ∧ x(U1 ) ∧ y (C) )P X (x(U1 ) ∧ y (C) )


= PX (x(S1 ) ∧ x(A) ∧ x(U1 ) ∧ y (C) )PX (x(T1 ) ∧ x(B) ∧ x(U1 ) ∧ y (C) )

If the configurations x(A) , x(B) , y (C) are not consistent (i.e., x(t) , y (t) for some t ∈ C),
then both sides vanish. So we can assume x(C) = y (C) and remove x(A) and x(B) from
the expression, since they are redundant. The resulting identity is true since it ex-
actly states that (X (S1 ) yX (T1 ) | X (U ) ). 
14.2. MODELS ON UNDIRECTED GRAPHS 317

Define the set of neighbors of s ∈ V (relative to the graph G) as the set of t , s


such that {s, t} ∈ E and denote this set by Vs . For S ⊂ V define also
[
VS = S c ∩ Vs
s∈S

which is the set of neighbors of all vertexes in S that do not belong to S. (Here S c
denotes the complementary set of S, S c = V \ S.) Finally, let WS denote the vertexes
that are “remote” from S, WS = (S ∪ VS )c .

We have the following important reduction of the condition in definition 14.15.


Proposition 14.17 X is Markov relative to G if and only if, for any S ⊂ V ,

(X (S) yX (WS ) | X (VS ) ). (14.18)

This says that


c c
P(X (S) = x(S) | X (S ) = x(S ) )
only depends (when defined) on variables x(t) for t ∈ S ∪ VS .
Proof First note that (SyWS | VS ) is always true, since any path reaching S from S c
must pass through VS . This immediately proves the “only if” part of the proposition.

Consider now the “if” part. Take S, T , U such that (SyT | U ). We want to prove
that (XS yXT | XU ). According to lemma 14.16, we can assume, without loss of gen-
erality, that S ∩ U = T ∩ U = ∅.

Define R as the set of vertexes v in V such that there exists a path between S and
v that does not pass in U . Then:

1. S ⊂ R: the path (s) for s ∈ S does not pass in U since S ∩ U = ∅.


2. U ∩ R = ∅ by definition.
3. VR ⊂ U : assume that there exists a point r in VR which is not in U . Then r has a
neighbor, say r 0 in R. By definition of R, there exists a path from S to r 0 that does not
hit U , and this path can obviously be extended by adding r at the end to obtain a
path that still does not hit U . But this implies that r ∈ R, which contradicts the fact
that VR ∩ R = ∅.
4. T ∩ (R ∪ VR ) = ∅: if t ∈ T , then t < R from (SyT | U ) and t < VR from T ∩ U = ∅.

We can then write (each decomposition being a partition, implicitly defining the
sets A, B and C, see Fig. 14.1) R = S ∪ A, U = VR ∪ C, (R ∪ VR )c = T ∪ C ∪ B, and from
(X (R) yX (WR ) | X (VR ) ), we get

((X (S) , X (A) )y(X (T ) , X (C) , X (B) ) | X (VR ) )


318 CHAPTER 14. MARKOV RANDOM FIELDS

Figure 14.1: See proof of proposition 14.17 for details

which implies
((X (S) , X (A) )y(X (T ) , X (B) ) | X (U ) )
by (CI3), which finally implies (X (S) yX (T ) | X (U ) ) by (CI2). 

For positive probabilities, it suffices to consider singletons in proposition 14.17.


Proposition 14.18 If the joint distribution of (X (s) , s ∈ V ) is positive and, for any s ∈ V ,

(X (s) yX (Ws ) | X (Vs ) ), (14.19)

then X is Markov relative to G. The converse statement is true without the positivity
assumption.
Proof It suffices to prove that, if (14.18) is true for S and T ⊂ V , with T ∩ S = ∅, it is
also true for S ∪ T . The result will then follow by induction.

So, let U = VS∪T and R = WS∪T = V \ (S ∪ T ∪ U ). Then, we have

(X (S) yX (WS ) | X (VS ) ) ⇒ (X (S) yX (R) | (X (U ) , X (T ) ))

because R ⊂ WS (if s ∈ VS , then it is either in U or in T and therefore cannot be in


R). Similarly, (X (T ) yX (R) | (X (U ) , X (S) )), and (CI5) (for which we need P positive) now
implies ((X (T ) , X (S) )yX (R) | X (U ) ). 

To see that the positivity assumption is needed, consider the following example with
six variables X (1) , . . . , X (6) , and a graph linking consecutive integers and closing with
14.2. MODELS ON UNDIRECTED GRAPHS 319

an edge between 1 and 6. Assume that X (1) = X (2) = X (4) = X (5) , and that X (1) , X (3)
and X (6) are independent. Then (14.19) is true, since, for k = 1, 2, 4, 5, X (k) is constant
given its neighbors, and X (3) (resp. X (6) ) is independent of the rest of the variables.
But (X (1) , X (2) ) is not independent of (X (4) , X (5) ) given the neighbors X (3) , X (6) .

Finally, another statement equivalent to proposition 14.18 is the following:


Proposition 14.19 If the joint distribution of (X (s) , s ∈ V ) is positive and, for any s, t ∈ V ,

s G t ⇒ (X (s) yX (t) | X (V \{s,t}) ),

then X is Markov relative to G. The converse statement is true without the positivity
assumption.
Proof Fix s ∈ V and assume that (X (s) yX (R) | X (V \R) ) for any R ⊂ Ws with cardinality
at most k (the statement is true for k = 1 by assumption). Consider a set R̃ ⊂ Ws of
cardinality k + 1, that we decompose into R ∪ {t} for some t ∈ R̃. We have (X (s) yX (t) |
X (V \R̃) , XR ) from the initial hypothesis and (X (s) yX (R) | X (V \R̃) , Xt ) from the induction
hypothesis. Using property (CI5), this yields (X (s) yX (R̃) | X (V \R̃) ). This proves the
proposition by induction. 

Remark 14.20 It is obvious from the definition of a G-Markov process that, if X is


Markov for a graph G = (V , E), it is automatically Markov for any richer graph, i.e.,
any graph G̃ = (V , Ẽ) with E ⊂ Ẽ. This is because separation in G̃ implies separation
in G. Moreover, any X is G-Markov for the complete graph on V , for which s ∼ t for
all s , t ∈ V . This is because no pair of sets can be separated in a complete graph.

Any graph with respect to which X is Markov must be richer than the graph
c
GX = (V , EX ) defined by s GX t if and only (X (s) yX (t) | X ({s,t} ) ). This is true because,
for any graph G for which X is Markov, we have
c
s G t ⇒ (X (s) yX (t) | X ({s,t} ) ) ⇒ s GX t.

Interestingly, proposition 14.19 states that X is GX -Markov as soon as its joint dis-
tribution is positive. This implies that GX is the minimal graph over which X is
Markov in this case. 

14.2.3 Restricted graph and partial evidence

Assume that some variables X (T ) = (X (t) , t ∈ T ) (with T ⊂ V ) have been observed,


with observed values x(T ) = (x(t) , t ∈ T ). One would like to use this partial evidence
to get additional information on the remaining variables, X (S) where S = V \T . From
the probabilistic point of view, this means computing the conditional distribution of
X (S) given X (T ) = x(T ) .
320 CHAPTER 14. MARKOV RANDOM FIELDS

One important property of G-Markov models is that the Markov property is es-
sentially conserved when passing to conditional distributions. We introduce for this
the following definitions.

Definition 14.21 If G = (V , E) is an undirected graph, a subgraph of G is a graph G0 =


(V 0 , E 0 ) with V 0 ⊂ V and E 0 ⊂ E.

If S ⊂ V , the restricted graph, GS , of G to S is defined by

GS = (S, ES ) with ES = {e = {s, t} : s, t ∈ S and e ∈ E} . (14.20)

We have the following proposition.

Proposition 14.22 Let G = (V , E) be an undirected graph and X be G-Markov. Let S ⊂ V


and T = S c . Given a partial evidence x(T ) such that P (X (T ) = x(T ) ) > 0, X (S) , conditionally
to X (T ) = x(T ) , is GS -Markov.

Proof The proof is straightforward once it is noticed that

(AyB | C)GS ⇒ (AyB | C ∪ T )G

so that

(AyB | C)GS ⇒ (X (A) yX (B) | X (C) , X (T ) )P


⇒ (X (A) yX (B) | X (C) )P (·|X (T ) =x(T ) ) 

14.2.4 Marginal distributions

The effect of taking marginal distributions for a G-Markov model is, unfortunately,
not as much a mild operation as computing conditional distributions, in the sense
that the conditional independence structure of the marginal distribution may be
much more complex than the original one.

Let G = (V , E) be an undirected graph, and let S be a subset of V . Define the


graph GS = (S, E S ) by {s, t} ∈ E S if and only if {s, t} ∈ E or there exist u, u 0 ∈ S c such
that {s, u} ∈ E, {t, u 0 } ∈ E and u and u 0 are connected by a path in S c . In other terms
E S links all s, t ∈ S that can be connected by a path, all but the extremities of which
are included in S c . With this notation, the following proposition holds.

Proposition 14.23 Let G = (V , E) be an undirected graph, and S ⊂ V . Assume that X =


(X (s) , s ∈ V ) is a family of random variables which is G-Markov. Then X (S) = (x(s) , s ∈ S)
is GS -Markov.
14.3. THE HAMMERSLEY-CLIFFORD THEOREM 321

Proof It suffices to prove that, for A, B, C ⊂ S,

(AyB | C)GS ⇒ (AyB | C)G .

So, assume that A and B are separated by C in GS . If a path connects A and B in G,


we can, by definition of E S , remove from this path any portion that passes in S c and
obtain a valid path in GS . By assumption, this path must pass in C, and therefore so
does the original path. 

The graph GS can be much more complex than the restricted graph GS intro-
duced in the previous section (note that, by definition, GS is richer than GS ). Take,
for example, the graph that corresponds to “hidden Markov models,” for which (cf.
fig. 14.2)
V = {1, . . . , N } × {0, 1}
and edges {s, t} ∈ E have either s = (k, 0) and t = (l, 0) with |k − l| = 1, or s = (k, 0) and
t = (k, 1). Let S = {1, . . . , N } × {1}. Then, GS is totally disconnected (ES = ∅), since no
edge in G links two elements of S. In contrast, any pair of elements in S is connected
by a path in S c , so that GS is a complete graph.

Figure 14.2: In this graph, variables in the lower row are conditionally independent given
the first row, while their marginal distribution requires a completely connected graph.

14.3 The Hammersley-Clifford theorem

The Hammersley-Clifford theorem, which will be proved in this section, gives a com-
plete description of positive Markov processes relative to a given graph, G. It states
that positive G-Markov models are associated to families of positive local interac-
tions indexed by cliques in the graph. We now introduce each of these concepts.

14.3.1 Families of local interactions

Definition 14.24 Let V be a set of vertexes and (Fs , s ∈ V ) a collection of state spaces.
A family of local interactions is a collection of non-negative functions Φ = (ϕC , C ∈ C)
indexed over some subset C of P (V ), such that each ϕC only depends on configurations
322 CHAPTER 14. MARKOV RANDOM FIELDS

restricted to C (i.e., it is defined on F (C)), with values in [0, +∞). (Recall that P (V ) is
the set of all subsets of V .)

Such a family has order p if no C ∈ C has cardinality larger than p. A family of local
interactions of order 2 is also called a family of pair interactions.

Such a family is said to be consistent, if there exists an x ∈ F (V ) such that


Y
ϕC (x(C) ) , 0.
C∈C

To a consistent family of local interactions, one associates the probability distribution πΦ


on F (V ) defined by
1 Y
Φ
π (x) = Φ ϕC (x(C) ) (14.21)
Z
C∈C

for all x ∈ F (V ), where Z Φ is a normalizing constant.

Given C ⊂ P (V ), define the graph GC = (V , EC ) by letting {s, t} ∈ EC if and only if


there exists C ∈ C such that {s, t} ∈ C. We then have the following proposition.
Proposition 14.25 Let Φ = (ϕC , C ⊂ C) be a consistent family of local interactions, asso-
ciated to some C ⊂ P (V ). Then the associated distribution πΦ is GC -Markov.
Proof Let X be a random field associated with π = πΦ . According to proposi-
tion 14.17, we must show that, for any S ⊂ V , one has

(X (S) yX (WS ) | X (VS ) )


where VS is the set of neighbors of S in GC and WS = V \ (VS ∪ S). Define the set US
by [
US = C
C∈C,S∩C,∅
so that VS = US \ S and WS = V \ US . To prove conditional independence, we need to
prove that, for any x ∈ F:

π(x)πVS (x(VS ) ) = πUS (x(US ) )πV \S (x(V \S) ) (14.22)


(where we denote πA the marginal distribution of π on F (A).)

From the definition of π, we have


1Y
π(x) = ϕC (x(C) )
Z
C∈C
1 Y Y
= ϕC (x(C) ) ϕC (x(C) ).
Z
C:C∩S,∅ C:C∩S=∅
14.3. THE HAMMERSLEY-CLIFFORD THEOREM 323

The first term in the last product only depends on x(US ) , and the second one only on
x(V \S) . Introduce the notation
 X Y
(VS )


 µ1 (x ) = ϕC (x(C) )


y (US ) :y (VS ) =x(VS ) C:C∩S,∅


 X Y
(VS )
ϕC (x(C) ).

µ (x ) =




 2
 (V \S) (V ) (V )
 y :y S =x S C:C∩S=∅

With this notation, we have:


 Y
(US ) (VS )



 π US (x ) = (µ2 (x )/Z) ϕC (x(C) )



 C:C∩S,∅
 Y
(V \S) (VS )
ϕC (x(C) )


π V \S (x ) = (µ1 (x )/Z)




 C:C∩S=∅
 πV (x(VS ) ) = µ1 (x(VS ) )µ2 (x(VS ) )/Z


S

from which (14.22) can be easily obtained. 

We now discuss conditional distributions and marginals for processes associated


with local interactions. If T ⊂ V , we let πT = πTΦ denote the marginal distribution of
π on T .

We start with a discussion of conditionals. Let π be associated with Φ, and let


S ⊂ V and T = V \S. Assume that a configuration y (T ) is given, such that πT (y (T ) ) > 0,
and consider the conditional distribution
πS|T (x(S) | y (T ) ) = π(x(S) ∧ y (T ) )/πT (y (T ) ). (14.23)
We have the following proposition.
Proposition 14.26 With the notation above, πS|T (· | y (T ) ) is associated to the family of
local interactions Φ|yT = (ϕC̃|y (T ) , C̃ ∈ CS ) with
n o
CS = C̃ : C̃ ⊂ S, ∃C ∈ C : C̃ = C ∩ S
and Y
ϕC̃|y (T ) (x(C̃) ) = ϕC (x(C̃) ∧ y (C∩T ) ).
C∈C:C∩S=C̃
Proof From (14.23) and the definition of π, it is easy to sees that
1 Y
πS|T (x(S) | y (T ) ) = (T )
ϕC (x(C∩S) ∧ y (C∩T ) ),
Z(y ) C:C∩S,∅

where Z(y (T ) ) is a constant that only depends on y (T ) . The fact that πS|T (· | y (T ) ) is
associated to Φ|y (T ) is then obtained by reorganizing the product over distinct S ∩
C’s. 
324 CHAPTER 14. MARKOV RANDOM FIELDS

This result, combined with proposition 14.25, is consistent with proposition 14.22,
in the sense that the restriction to GC to S coincides with the graph GCS . The easy
proof is left to the reader.

We now consider marginals, and more specifically marginals when only one node
is removed, which provides the basis for “node elimination.”
Proposition 14.27 Let π be associated to Φ = (ϕC , C ∈ C) as above. Let t ∈ V and
S = V \ {t}. Define Ct ∈ P (V ) as the set
Ct = {C ∈ C : t < C} ∪ {C̃t }
with [
C̃t = C \ {t}.
C∈C:t∈C

Define a family of local interactions Φt = (ϕ̃C̃ , C̃ ∈ Ct ) by ϕ̃C̃ = ϕC̃ if C̃ , C̃t and:

• If C̃t < C: X Y
ϕ̃C̃t (x(C̃t ) ) = ϕC (x(C̃t ) ∧ y (t) ).
y (t) ∈Ft C∈C,t∈C

• If C̃t ∈ C: X Y
ϕ̃C̃t (x(C̃t ) ) = ϕCt (x(C̃t ) ) ϕC (x(Ct ) ∧ y (t) )
y (t) ∈Ft C∈C,t∈C

Then the marginal, πS , of π over S is the distribution associated to Φt .

The proof is almost straightforward by summing over possible values of yt in the


expression of π and left to the reader.

14.3.2 Characterization of positive G-Markov processes

Using families of local interactions is a typical way to build graphical models in


applications. The previous section describes a graph with respect to which the ob-
tained process is Markov. Conversely, given a graph G, the Hammersley-Clifford
theorems states that families of local interactions over the cliques of G are the only
ways to build positive graphical models, which reinforces the importance of this
construction. We now pass to the statement and proof of this theorem, starting with
the following definition.
Definition 14.28 Let G = (V , E) be an undirected graph. A clique in G is a nonempty
subset C ⊂ V such that s ∼G t whenever s, t ∈ C, s , t. (In particular, subsets of cardinality
one are always cliques.) Cliques therefore form complete subgraphs of G.
14.3. THE HAMMERSLEY-CLIFFORD THEOREM 325

The set of cliques of a graph G will be denoted CG .

A clique that cannot be strictly included in any other clique is called a maximal clique,

and their set denoted CG .

(Note that some authors call cliques what we refer to as maximal cliques.)

Given G = (V , E), consider a random field X = (X (s) , s ∈ V ). We assume that X (s)


takes values in a finite set Fs with P(X (s) = a) > 0 for any a ∈ Fs (this is no loss of
generality since one can always restrict Fs to such a’s). If S ⊂ V , we denote as before
F (S) the set of restrictions of configurations to S. With this notation, X is positive,
according to definition 14.4, if and only if P (X = x) > 0 for all x ∈ F (V ). We will let
π = PX be the probability distribution of X, so that π(x) = P(X = x) and use as above
the notation: for S, T ⊂ V

πS (x(S) ) = P(X (S) = x(S) )


(14.24)
πS|T (x(S) | x(T ) ) = P(X (S) = x(s) | X (T ) = x(T ) ).

(For the first notation, we will simply write π if S = V .)

We will also need to fix a reference, or “zero,” configuration in F (V ) that we will


denote 0 = (0(s) , s ∈ V ), with 0(s) ∈ Fs for all s. We can choose it arbitrarily. Given this,
we have the theorem:

Theorem 14.29 (Hammersley-Clifford) With the previous notation, X is a positive G-


Markov process if and only if its distribution, π, is associated to a family of local interac-
tions Φ = (ϕC , C ⊂ CG ) such that ϕC (x(C) ) > 0 for all x(C) ∈ F (C).

Moreover, Φ is uniquely characterized by the additional constraint: ϕC (x(C) ) = 1 as


soon as there exists s ∈ C such that x(s) = 0(s) .

Letting λC = − log ϕC , we get an equivalent formulation of the theorem in terms


of potentials, where a potential is defined as a family of functions

Λ = (λC , C ∈ C)

indexed by a subset C of P (V ), such that λC only depends on x(C) . The distribution


associated to Λ is
1
 X 
π(x) = exp − λC (x(C) ) . (14.25)

C∈C

With this terminology, we trivially have an equivalent formulation:


326 CHAPTER 14. MARKOV RANDOM FIELDS

Theorem 14.30 X is a positive G-Markov process if and only if its distribution, π, is


associated to a potential Λ = (λC , C ⊂ CG ).

Moreover, Λ is uniquely characterized by the additional constraint: λC (x(C) ) = 0 as


soon as there exists s ∈ C such that x(s) = 0(s) .

We now prove this theorem.

Proof Let us start with the “if” part. If π is associated to a potential over CG , we
have already proved that π is GCG -Markov, so that it suffices to prove that GCG = G,
which is almost obvious: If s ∼G t, then {s, t} ∈ CG and s ∼GC t by definition of GCG .
G
Conversely, if s ∼GC t, there exists C ∈ CG such that {s, t} ⊂ C, which implies that
G
s ∼G t, by definition of a clique.

We now prove the “only if” part, which relies on a combinatorial lemma, which
is one of Möbius’s inversion formulas.

Lemma 14.31 Let A be a finite set and f : P (A) → R, B 7→ fB . Then, there is a unique
function λ : P (A) → R such that
X
∀B ⊂ A, fB = λC , (14.26)
C⊂B

and λ is given by X
λC = (−1)|C|−|B| fB . (14.27)
B⊂C

To prove the lemma, first notice that the space F of functions f : P (A) → R isP
a vector
space of dimension 2|A| and that the transformation ϕ : λ 7→ f with fB = C⊂B λC
is linear. It therefore suffices to prove that, given any f , the function λ given in
(14.27) satisfies ϕ(λ) = f , since this proves that ϕ is onto from F to F and therefore
necessarily one to one.

So consider f and λ given by (14.27). Then


 
X XX X  X 
ϕ(λ)(B) = λC = (−1)|C|−|B̃| fB̃ = (−1)|C|−|B̃|  fB̃ = fB
 

 
C⊂B C⊂B B̃⊂C B̃⊂B C⊃B̃,C⊂B

The last identity comes from the fact that, for any finite set B̃ ⊂ B, B̃ , B, we have
X
(−1)|C|−|B̃| = 0
C⊃B̃,C⊂B
14.3. THE HAMMERSLEY-CLIFFORD THEOREM 327

(for B̃ = B, the sum is obviously equal to 1). Indeed, if s ∈ B, s < B̃, we have
X X X
(−1)|C|−|B̃| = (−1)|C|−|B̃| + (−1)|C|−|B̃|
C⊃B̃,C⊂B C⊃B̃,C⊂B,s∈C C⊃B̃,C⊂B,s<C
X
|C∪{s}|−|B̃|
= ((−1) + (−1)|C|−|B̃| )
C⊃B̃,C⊂B,s<C
= 0.

So the lemma is proved. We now proceed to proving the existence and unique-
ness statements in theorem 14.30. Assume that X is G-Markov and positive. Fix
x ∈ F (V ) and consider the function, defined on P (V ) by
c
(B) π(x(B) ∧ 0(B ) )
fB (x ) = − log .
π(0)
Then, letting X
(C)
λC (x )= (−1)|C|−|B| fB (x(B) ),
B⊂C

we have fB (x(B) ) = (C) ).


P
C⊂B λC (x In particular, for B = V , this gives
1
 X 
(C)
π(x) = exp − λC (x )
Z
C⊂V

with Z = P (0). We now prove that λC (x(C) ) = 0 if x(s) = 0(s) for some s ∈ V or if C < CG .
This will prove (14.25) and the existence statement in theorem 14.30.

So, assume x(s) = 0(s) . Then, for any B such that s < B, we have fB (x(B) ) = f{s}∪B (x({s}∪B) ).
Now take C with s ∈ C. We have
X X
λC (x(C) ) = (−1)|C|−|B| fB (x(B) ) + (−1)|C|−|B| fB (x(B) )
B⊂C,s∈B B⊂C,s<B
X X
= (−1) |C|−|B∪{s}|
fB∪{s} (x(B∪{s}) ) + (−1)|C|−|B| fB (x(B) )
B⊂C,s<B B⊂C,s<B
X
= ((−1)|C|−|B∪{s}| + (−1)|C|−|B| )fB (x(B) )
B⊂C,s<B
= 0.

Now assume that C is not a clique, and let s , t ∈ C such that s  t. We can write,
using decompositions similar to the above,
X  
λC (x(C) ) = (−1)|C|−|B| fB∪{s,t} (x(B∪{s,t}) ) − fB∪{s} (x(B∪{s}) ) − fB∪{t} (x(B∪{t}) ) + fB (x(B) ) .
B⊂C\{s,t}
328 CHAPTER 14. MARKOV RANDOM FIELDS

But, for B ⊂ C \ {s, t}, we have


c
(B∪{s,t}) (B∪{s}) π(x(B∪{s,t}) ∧ 0(B \{s,t}) )
fB∪{s,t} (x ) − fB∪{s} (x ) = − log c
π(x(B∪{s}) ∧ 0(B \{s}) )
c
π (x(t) | x(B∪{s}) ∧ 0(B \{s,t}) )
= log t (t) (B∪{s}) c
πt (0 | x ∧ 0(B \{s,t}) )
and
c
(B∪{t}) (B) π(x(B∪{t}) ∧ 0(B \{t}) )
fB∪{t} (x ) − fB (x ) = − log c
π(x(B) ∧ 0(B ) )
c
π (x(t) | x(B) ∧ 0(B \{t}) )
= log t (t) (B) c .
πt (0 | x ∧ 0(B \{t}) )
So, we can write
c c
(C)
X
|C|−|B| πt (x(t) | x(B∪{s}) ∧ 0(B \{s,t}) )πt (0(t) | x(B) ∧ 0(B \{t}) )
λC (x ) = (−1) log c c
B⊂C\{s,t}
πt (0(t) | xB∪{s} ∧ 0(B \{s,t}) )πt (x(t) | x(B) ∧ 0(B \{t}) )

which vanishes, because


c c
πt (x(t) | x(B∪{s}) ∧ 0(B \{s,t}) ) = πt (x(t) | x(B) ∧ 0(B \{t}) )
when s  t.

To prove uniqueness, note that, for any zero-normalized Λ satisfying (14.25), we


must have π(0) = 1/Z and therefore, for any x,
c
π(x(B) ∧ 0(B ) ) X
− log = λC (x(C) )
π(0)
C⊂B

(extending Λ so that λC = 0 for C < CG ). But, from lemma 14.31, this uniquely
defines Λ. 

The exponential form of the distribution in the Hammersley-Clifford theorem is


related to what is called a Gibbs distribution in statistical mechanics. More precisely:
Definition 14.32 Let F be a finite set and W : F → R be a scalar function. The Gibbs
distribution with energy W at temperature T > 0 is defined by
1 − W (x)
π(x) = e T , x∈F
ZT
P
The normalizing constant ZT = y∈F exp(−W (y)/T ) is called the partition function.

If Λ = (λC , C ⊂ V ) is a potential then its associated energy is


X
W (x) = λC (x(C) ).
C⊂V
14.4. MODELS ON ACYCLIC GRAPHS 329

So the Hammersley-Clifford theorem implies that any positive G-Markov model is


associated to a unique zero-normalized potential defined over the cliques of G. This
representation can also be used to provide an alternate proof of proposition 14.19,
which is left to the reader. Finally, one can restate proposition 14.26 in terms of
potentials, yielding:

Proposition 14.33 Let P be a Gibbs distribution associated with a zero-normalized po-


tential λ = (λC , C ⊂ V ). Let S ⊂ V and T = S c . Then the conditional distribution of X (S)
given X (T ) = x(T ) is the Gibbs distribution associated with the zero-normalized potential
λ̃ = (λ̃C , C ⊂ S) where
X 0
(C)
λ̃C (y ) = λC 0 (y (C) ∧ x(T ∩C ) ).
C 0 ⊂V ,C 0 ∩S=C

14.4 Models on acyclic graphs

14.4.1 Finite Markov chains

We now review a few important examples of Markov processes X associated to spe-


cific graphs G = (V , E). We will always denote by Fs the space in which X (s) takes his
values, for s ∈ V .

The simplest example of G-Markov process (for any graph G) is the case when
X = (X (s) , s ∈ V ) is a collection of independent random variables. In this case, we can
take GX = (V , ∅), the totally disconnected graph on V . Another simple fact is that, as
already remarked, any X is Markov for the complete graph (V , P 2 (V )) where P 2 (V )
contains all subsets of V with cardinality 2.

Beyond these trivial (but nonetheless important) cases, the simplest graph-Markov
processes are those associated with linear graphs, providing finite Markov chains.
For this, we let V be a finite ordered set, say,

V = {0, . . . , N } .

We say that X is a finite Markov chain if, for any k = 1, . . . , N

(X (k) y(X (0) , . . . , X (k−2) ) | X (k−1) ).

So we have the identity

P(X (0) = x(0) , . . . , X (k) = x(k) )P (X (k−1) = x(k−1) )


= P(X (0) = x(0) , . . . , X (k−1) = x(k−1) )P(X (k−1) = x(k−1) , X (k) = x(k) ).
330 CHAPTER 14. MARKOV RANDOM FIELDS

The distribution of a Markov chain is therefore fully specified by P(X (0) = x(0) ), x0 ∈
F0 (the initial distribution) and the conditional probabilities
pk (x(k−1) , x(k) ) = P(X (k) = x(k) | X (k−1) = x(k−1) ) (14.28)
(with an arbitrary choice when P(X (k−1)
= x(k−1) ) = 0). Indeed, assume that P(X (0) =
x(0) , . . . , X (k−1) = x(k−1) ) is known (for all x(0) , . . . , x(k−1) ). Then, either:

(i) P(X (0) = x(0) , . . . , X (k−1) = x(k−1) ) = 0, in which case


P(X (0) = x(0) , . . . , X (k) = x(k) ) = 0
for any x(k) , or:
(ii) P(X (0) = x(0) , . . . , X (k−1) = x(k−1) ) > 0, in which case, necessarily, P(X (k−1) = x(k−1) ) >
0, and
P(X (0) = x(0) , . . . , X (k) = x(k) ) = pk (x(k−1) , x(k) )P(X (0) = x(0) , . . . , X (k−1) = x(k−1) ).
Note that pk in (14.28) is a transition probability (according to definition 13.2) be-
tween Fk−1 and Fk .

We have the following identification of a finite Markov chain with a graph-Markov


process:
Proposition 14.34 Let X = (X (0) , . . . , X (N ) ) be a finite Markov chain, such that X is posi-
tive. Then X is G-Markov for the linear graph G = (V , E) with
V = {1, . . . , N }
E = {{1, 2}, . . . , {N − 1, N }} .
The converse is true without the positivity assumption: a G-Markov process for the graph
above is always a finite Markov chain.
Proof We prove the direct statement (the converse one being obvious). Let s and t
be nonconsecutive distinct integers, with, say, s < t. From the Markov chain assump-
tion, we have
(X (t) y(X (s) , X ({1,t−2}\{s,}) ) | X (t−1) ),
which, using (CI3), yields (X (t) yX (s) | X ({1,...,t−1}\{s}) ). Define Y (u) = X ({1,...,u}\{s,t}) : what
we have proved is (X (t) yX (s) | Y (t) ).

We now proceed by induction and assume that (X (t) yX (s) | Y (u) ) for some u ≥ t.
Then, we have (X (u+1) y(X (s) , X (t) , Y (u−1) ) | X (u) ), which implies (from (CI3)) (X (u+1) yX (t) |
X (s) , Y (u) ). Applying (CI4) to (X (t) yX (s) | Y (u) ) and (X (t) yX (u+1) | X (s) , Y (u) ), we obtain
(X (t) y(X (s) , X (u+1) ) | Y (u) ) and finally, (X (t) yX (s) | Y (u+1) ). By induction, this gives
(X (t) yX (s) | Y (N ) ) and therefore proposition 14.19 now implies that X is G-Markov.

(The proposition can also be proved as a consequence of the decomposition


P(X (0) = x(0) , . . . , X (N ) = x(N ) ) = P(X (0) = x(0) )p1 (x(0) , x(1) ) . . . pN (x(N −1) , x(N ) ).) 
14.4. MODELS ON ACYCLIC GRAPHS 331

14.4.2 Undirected acyclic graph models and trees

The situation with acyclic graphs is only slightly more complex than with linear
graphs, but will require a few new definitions, including those of directed graphs
and trees.

The difference between directed and undirected graphs is that the edges of the
former are ordered pairs, namely:

Definition 14.35 A (finite) directed graph G is a pair G = (V , E) where V is a finite set


of vertexes and E is a subset of

V × V \ {(s, s), s ∈ V } ,

which satisfies, in addition,


(s, t) ∈ E ⇒ (t, s) < E.

So, for directed graphs, edges (s, t) and (t, s) have different meanings, and we allow
at most one of them in E. We say that the edge e = (s, t) stems from s and points to t.
The parents of a vertex s are the vertexes t such that (t, s) ∈ E, and its children are the
vertexes t such that (s, t) ∈ E. We will also use the notation s →G t to indicate that
(s, t) ∈ E (compare to s ∼G t for undirected graphs).

Definition 14.36 A path in a directed graph G = (V , E) is a sequence (s0 , . . . , sN ) such


that, for all k = 1, . . . , N , sk →G sk+1 (this includes the “trivial”, one-vertex, paths (s0 )).
(The definition was the same for undirected graph, replacing sk →G sk+1 by sk ∼G sk+1 .)
For both directed and undirected cases, one says that a path is closed if s0 = sN .

In an undirected graph, a path is folded if it can be written as (s0 , . . . , sN −1 , sN , sN −1 , . . . , s0 ).

If G = (V , E) is directed, one says that t ∈ V is a descendant of s ∈ V (or that s is an


ancestor of t) if there exists a path starting at s and ending at t. In particular, every vertex
is both a descendant and an ancestor of itself.

We finally define acyclic graphs.

Definition 14.37 A loop in a directed (resp. an undirected) graph G is a path (s0 , s1 , . . . , sN ),


with N ≥ 3, such that sN = s0 , which passes only once through s0 , . . . , sN −1 (no self-
intersection except at the end).

A (directed or undirected) graph G is acyclic if it contains no loop.

The following property will be useful.


332 CHAPTER 14. MARKOV RANDOM FIELDS

Proposition 14.38 In a directed graph, any non-trivial closed path contains a loop (i.e.,
one can delete vertexes from it to finally obtain a loop.)

In an undirected graph, any non-trivial closed path which is not a union of folded
paths contains a loop.
Proof Take γ = (s0 , s1 , . . . , sN ) with sN = s0 . The path being non-trivial means N > 1.

First take the case of a directed graph. Clearly, N ≥ 3 since a two-vertex path
cannot be closed in an directed graph. Consider the first occurrence of a repetition,
i.e., the first index for which
sj ∈ {s0 , . . . , sj−1 }.
Then there is a unique j 0 ∈ {0, . . . , j − 1} such that sj 0 = sj , and the path (sj 0 , . . . , sj−1 )
must be a loop (any repetition in the sequence would contradict the fact that j was
the first occurrence. This proves the result in the directed case.

Consider now an undirected graph. We can recursively remove all folded sub-
paths, by keeping everything but their initial point, since each such operation still
provide a path at the end. Assume that this is done, still denoting the remaining
path (s0 , s1 , . . . , sN ), which therefore has no folded subpath. We must have N ≥ 3
since N = 1 implies that the original path was a union of folded paths, and N = 2
provides a folded path. Let, 0 ≤ j 0 < j be as in the directed case. Note that one must
have j 0 < j − 2, since j 0 = j − 1 would imply an edge between j and itself and j 0 = j − 2
induces a folded subpath. But this implies that (sj 0 , . . . , sj−1 ) is a loop. 

Directed acyclic graphs (DAG) will be important for us, because they are associ-
ated with Bayesian networks that we will discuss later. For now, we are interested
with undirected acyclic graphs and their relation to trees, which form a subclass of
directed acyclic graphs, defined as follows.
Definition 14.39 A forest is a directed acyclic graph with the additional requirement
that each of its vertexes has at most one parent.

A root in a forest is a vertex that has no parent. A forest with a single root is called a
tree.

It is clear that a forest has at least one root, since one could otherwise describe
a nontrivial loop by starting from a any vertex and passing to its parent until the
sequence self-intersects (which must happen since V is finite). We will use the fol-
lowing definition.
Definition 14.40 If G = (V , E) is a directed graph, its flattened graph, denoted G[ =
(V , E [ ) is the undirected graph obtained by forgetting the edge ordering, namely
{s, t} ∈ E [ ⇔ (s, t) ∈ E or (t, s) ∈ E.
14.4. MODELS ON ACYCLIC GRAPHS 333

The following proposition relates forests and undirected acyclic graphs.


Proposition 14.41 If G is a forest, then G[ is an undirected acyclic graph.

Conversely, if G is an undirected acyclic graph, there exists a forest G̃ such that G̃[ = G.
Proof Let G = (V , E) be a forest and, in order to reach a contradiction, assume that
G[ has a loop, s0 , . . . , sN −1 , sN = s0 . Assume that (s0 , s1 ) ∈ E; then, also (s1 , s2 ) ∈ E
(otherwise s1 would have two parents), and this propagates to all (sk , sk+1 ) for k =
0, . . . , N − 1. But, since sN = s0 , this provides a loop in G which is not possible. This
proves thet G[ has no loop since the case (s1 , s0 ) ∈ E is treated similarly.

Now, let G be an undirected acyclic graph. Fix a vertex s ∈ V and consider the
following procedure, in which we recursively define sets Sk of processed vertexes,
and Ẽk of oriented edges, k ≥ 0, initialized with S0 = {s} and Ẽ0 = ∅.

– At step k of the procedure, assume that vertexes in Sk have been processed and
edges in Ẽk have been oriented so that (Sk , Ẽk ) is a forest, and that Ẽk[ is the set of edges
{s, t} ∈ E such that s, t ∈ Sk (so, oriented edges at step k can only involve processed
vertexes).
– If Sk = V : stop, the proposition is proved.
– Otherwise, apply the following construction. Let Fk be the set of edges in E that
contain exactly one element of Sk .
(1) If Fk = ∅, take any s ∈ V \ Sk as a new root and let Sk+1 = Sk ∪ {s}, Ẽk+1 = Ẽk .
(2) Otherwise, add to Ẽk the oriented edges (s, t) such that s ∈ Sk and {s, t} ∈ Fk ,
yielding Ẽk+1 , and add to Sk the corresponding children (t’s) yielding Sk+1 .

We need to justify the fact that G̃k+1 = (Sk+1 , Ẽk+1 ) above is still a forest. This is
obvious after Case (1), so consider Case (2). First G̃k+1 is acyclic, since any oriented
loop is a fortiori an unoriented loop and G is acyclic. So we need to prove that no
vertex in Sk+1 has two parents. Since we did not add any parent to the vertexes in Sk
and, by assumption, (Sk , Ẽk ) is a forest, the only possibility for a vertex to have two
parents in Sk+1 is the existence of t such that there exists s, s0 ∈ Sk with {s, t} and {s0 , t}
in E. But, since s and s0 have unaccounted edges containing them, they cannot have
been introduced in Sk before the previously introduced root has been added, so they
are both connected to this root: but the two connections to t would create a loop in
G which is impossible.

So the procedure carries on, and must end with Sk = V at some point since we
keep adding points to Sk at each step. 

Note that the previous proof shows there is more than one possible orientation
of a connected undirected tree into a tree is not unique, although uniquely specified
334 CHAPTER 14. MARKOV RANDOM FIELDS

once a root is chosen. The proof is constructive, and provides an algorithm building
a forest from an undirected acyclic graph.

We now define graphical models supported by trees, which constitute our first
Markov models associated with directed graphs. Define the depth of a vertex in a
tree G = (V , E) to be the number of edges in the unique path that links it to the
root. We will denote by Gd the set of vertexes in G that are at depth d, so that G0
contains only the root, G1 the children of the root and so on. Using this, we have the
definition:
Definition 14.42 Let G = (V , E) be a tree. A process X = (X (s) , s ∈ V ) is G-Markov if
and only, for each d ≥ 1, and for each s ∈ Gd , we have

(X (s) y(X (Gd \{s}) , X (Gq \{pa(s)}) , q < d) | X (pa(s)) ) (14.29)

where pa(s) is the parent of s.

So, conditional to its parent, X (s) is independent from all other variables at depth
smaller or equal to the depth of s.

Note that, from (CI3), definition 14.42 implies that, for all s ∈ Gd ,

(X (s) yX (Gd \{s}) | X (Gq ) , q < d),

which, using proposition 14.6, implies that the variables (X (s) , s ∈ Gd ) are mutually
independent given X (Gq ) , q < d. This implies that, for d = 1 (letting s0 denote the root
in G):
Y
P(X (G1 ) = x(G1 ) , X (s0 ) = x(s0 ) ) = P(X (s0 ) = x(s0 ) ) P(X (s) = x(s) | X (s0 ) = x(s0 ) ).
s∈G1

(If P(X (s0 ) = x(s0 ) ) = 0, the choice for the conditional probabilities can be made ar-
bitrarily without changing the left-hand side which vanishes.) More generally, we
have, letting G<d = G0 ∪ · · · ∪ Gd−1 ,
Y
P(X (G≤d ) = x(G≤d ) ) = P(X (s) = x(s) | X (pa(s)) = x(pa(s)) )P(X (G<d ) = x(G<d ) )
s∈Gd

(with again an arbitrary choice for conditional probabilities that are not defined) so
that, we obtain, by induction, for x ∈ F (V )
Y
P(X = x) = P(X (s0 ) = x(s0 ) ) ps (x(pa(s)) , x(s) ) (14.30)
s,s0


where ps (x(pa(s)) , x(s) ) = P(X (s) = x(s) | X (pa(s)) = x(pa(s)) ) are the tree transition probabil-
ities between a parent and a child. So we have the following proposition.
14.4. MODELS ON ACYCLIC GRAPHS 335

Proposition 14.43 A process X is Markov relative to a tree G = (V , E) if and only if there


exists a probability distribution p0 on Fs0 and a family (pst , (s, t) ∈ E) such that pst is a
transition probability from Fs to Ft and
Y
PX (x) = p0 (xs0 ) pst (x(s) , x(t) ), x ∈ F (V ). (14.31)
(s,t)∈E

We only have proved the “only if” part, but the “if” part is obvious from (14.31).
Another property that becomes obvious with this expression is the first part of the
following proposition.

Proposition 14.44 If a process X is Markov relative to a tree G = (V , E) then it is G[


Markov. Conversely, if G = (V , E) is an undirected acyclic graph and X is G-Markov, then
X is Markov relative to any tree G̃ such that G̃[ = G.

Proof To prove the converse part, assume that G = (V , E) is undirected acyclic and
that X is G-Markov. Take G̃ such that G̃[ = G. For s ∈ V and its parent pa(s) in G̃, the
sets {s} and G̃≤d \ {s, pa(s)} are separated by pa(s) in G. To see this, assume that there
exists a t ∈ G̃≤d \ {s, pa(s)} with a path from t to s that does not pass through pa(s).
Then we can complete this path with the path from t to the first common ancestor
(in G̃) of t and s and back to s to create a path from s to s that passes only once
through {pa(s), s} and therefore contains a loop by proposition 14.38.

The G-Markov property now implies

(X (s) y(X (G̃d \{s}) , X (G̃q \{pa(s)}) , q < d) | X (pa(s)) )

which proves that X is G̃-Markov. 

Remark 14.45 We see that there is no real gain in generality with passing from undi-
rected to directed graphs when working with trees. This is an important remark,
because directionality in graphs is often interpreted as causality. For example, there
is a natural causal order in the statements

(it rains) → (car windshields get wet) → (car wipers are on)

in the sense that each event can be seen as a logical precursor to the next one. How-
ever, because one can pass from this directed chain to an equivalent undirected chain
and then back to a equivalent directed tree by choosing any of the three variables as
roots, there is no way to infer, from the observation of the joint distribution of the
three events (it rains, car windshields get wet, wipers are on), any causal relationship
between them: the joint distribution cannot resolve whether wipers are on because
336 CHAPTER 14. MARKOV RANDOM FIELDS

it rains, or whether turning wipers on automatically wets windshields which in turn


triggers a shower !

To infer causal relationships, one needs a different kind of observation, that


would modify the distribution of the system. Such an operation (called an inter-
vention), can be done, for example, by preventing the windshields from being wet
(doing, for example, the observation in a parking garage), or forcing them to be wet
(using a hose). Then, one can compare observations made with these new condi-
tions, and those made with the original system, and check, for example, whether
they modified the probability that rain occurs outside. The answer (likely to be
negative !) would refute any causal relationship from “windshields are wet” to “it
rains.” On the other hand, the intervention might modify how wipers are used,
which would indicate a possible causal relationship from “windshields are wet” to
“wipers are on.” 

14.5 Examples of general “loopy” Markov random fields

We will see that acyclic models have very nice computational properties that make
them attractive in designing distributions. However, the absence of loops is a very
restrictive constraint, which is not realistic in many practical situations. Feedback
effects are often needed, for example. Most models in statistical physics are sup-
ported by a lattice, in which natural translation/rotation invariance relations forbid
using any non-trivial acyclic model. As an example, we now consider the 2D Ising
model on a finite grid, which is a model for (anti)-ferromagnetic interaction in a spin
system.

Let G = (V , E). A (positive) G-Markov model is said to have only pair interactions
if and only if can be written in the form
1
 X X 
(s) (s,t)
π(x) = exp − hs (x ) − h{s,t} (x ) .
Z
s∈G {s,t}∈E

Relating to theorem 14.30, this says that π is associated to a potential involving


cliques of order 2 at most (note that this does not mean that the cliques of the asso-
ciated graph have order 2 at most; there can be higher-order cliques, which would
then have a zero potential). The functions in the potential are indexed by sets, as
they should be from the general definition. However, models with pair interactions
are often written in the form
1
 X X 
(s) (s) (t)
π(x) = exp − hs (x ) − h̃st (x , x )
Z
s∈G {s,t}∈E

with h̃st (λ, µ) = h̃ts (µ, λ) (which is equivalent, taking h̃ = h/2).


14.5. EXAMPLES OF GENERAL “LOOPY” MARKOV RANDOM FIELDS 337

The Ising model is a special case of models with pair interactions, for which the
state space, Fs , is equal to {−1, 1} for all s and

hs (x(s) ) = αs x(s) , h{s,t} (x(s) , x(t) ) = βst x(s) x(t) .

In fact, for binary variables, this is the most general pair interaction model.

Figure 14.3: Graph forming a two-dimensional regular grid.

The Ising model is moreover usually defined on a regular lattice, which, in two
dimensions, implies that V is a finite rectangle in Z2 , for example V = {−N , . . . , N }2 .
The simplest choice of a translation- and 90-degree rotation-invariant graph is the
nearest-neighbor graph for which {(i, j), (i 0 , j 0 )} ∈ E if and only if |i − i 0 | + |j − j 0 | = 1
(see fig. 14.3). With this graph, one can furthermore simplify the model to obtain
the isotropic Ising model given by
1
 X X 
(s) (s) (t)
π(x) = exp − α x −β x x .
Z s∼t
s∈V

When β < 0, the model is ferromagnetic: each pair of neighbors with identical signs
brings a negative contribution to the energy, making the configuration more likely
(since lower energy implies higher probability).
338 CHAPTER 14. MARKOV RANDOM FIELDS

The Potts model generalizes the Ising model to finite, but non-necessarily binary,
state spaces, say, Fs = F = {1, . . . , n}. Define the function δ(λ, µ) = 1 if λ = µ and (−1)
otherwise. Then the Potts model is given by
1
 X X 
(s) (s) (t)
π(x) = exp − α h(x ) − β δ(x , x ) (14.32)
Z s∼t
s∈V

for some function h defined on F.

14.6 General state spaces

Our discussion of Markov random fields on graphs was done under the assumption
of finite state spaces, which notably simplifies many of the arguments and avoids
relying too much on measure theory. While this situation does cover a large range of
application, there are cases in which one wants to consider variables taking values
in continuous spaces, or in countable (infinite) spaces.

The results obtained for discrete variables can most of the time be extended to
variables whose distribution has a p.d.f. with respect to a product of measures on
the sets in which they take their values. For example, let X, Y , Z takes values in
RX , RY , RZ , equipped with σ -algebras S X , S Y , S Z and measures µX , µY , µZ . As-
sume that PX,Y ,Z is absolutely continuous with respect to µX ⊗ µY ⊗ µZ , with density
ϕXY Z . In such a situation, (14.3) remains valid, in that X is conditionally indepen-
dent of Y given Z if and only if

ϕXY Z (x, y, z)ϕZ (z) = ϕXZ (x, z)ϕY Z (y, z) (14.33)

almost everywhere (relative to µX ⊗ µY ⊗ µZ ). Here, ϕXZ , ϕY Z , ϕZ are marginal densi-


ties of the indexed random variables. The only difficulty in the argument, provided
below for the interested reader, is dealing properly with sets of measure zero.
Proof (Proof of (14.33)) Introduce the conditional densities
ϕXY Z (x, y, z)
ϕXY |Z (x, y | z) =
ϕZ (z)
and similarly ϕX|Z and ϕY |Z , which are defined when z < MZ = {z ∈ RZ : ϕZ (z) = 0}.
By definition of conditional independence, we have, for all A ∈ S X , B ∈ S X
Z Z
ϕXY |Z (x, y | z)µX (dx)µY (dy) = ϕX|Z (x | z)ϕY |Z (y | z)µX (dx)µY (dy)
A×B A×B

for all z < MZ , which implies that, for all z < MZ , there exists a set Nz ⊂ RX × RY such
that µX × µY (Nz ) = 0 and

ϕXY |Z (x, y | z) = ϕX|Z (x | z)ϕY |Z (y | z)


14.6. GENERAL STATE SPACES 339

for all z < MZ and (x, y) < Nz . This immediately implies (14.33) for those (x, y, z).

If z ∈ MZ , then
Z Z
0 = ϕZ (z) = ϕXZ (x, z)µX (dx) = ϕY Z (x, z)µY (dy)
RX RY

implying that ϕXZ (x, z) = ϕY Z (y, z) = 0 excepted on some set Nz such that µX ⊗
µY (Nz ) = 0, and (14.33) is therefore also true outside of this set. Now, letting N =
{(x, y, z) : (x, y) ∈ Nz }, we find that (14.33) is true for all (x, y, z) < N and
Z Z
µX ⊗µY ⊗µZ (N ) = 1(x,y)∈Nz µX (dx)µY (dy)µZ (dz) = µX ⊗µY (Nz )µZ (dz) = 0.
RX ×RY ×RZ RZ

(This argument involves Fubini’s theorem [172].) 

With this definition, the proof of proposition 14.5 can be caried on without change,
with the positivity condition expressing the fact that there exists R̃X ⊂ RX , R̃Y ⊂ RY
and R̃Z ⊂ RX such that ϕXY Z (x, y, z) > 0 for all x, y, z ∈ R̃X × R̃Y × R̃Z . (This proposition
is actually valid in full generality, with a proper definition of positivity.)

When considering random fields with general state spaces, we will restrict to
the similar situation in which each state space Fs is equipped with a σ -algebra S s
and a measure µs , and the joint distribution, PX of the random field X = (Xs , s ∈ V ) is
∆ N
absolutely continuous with respect to µ = µ , denoting by π the corresponding
s∈V s
p.d.f. We will says that π is positive if there exists F̃ = (F̃s , s ∈ V ) with measurable
F̃s ⊂ Fs such that π(x) > 0 for all x ∈ F (V , F̃). Without loss of generality unless one
considers multiple random fields with different supports, we will assume that F̃s = Fs
for all s.

The definition of consistent families of local interactions (definition 14.24) must


be modified by adding the condition that
Z Y
ϕC (x(C) )µ(dx) < ∞. (14.34)
F (V ) C∈C

This requirement is obviously needed to ensure that the normalizing constant in


(14.21) is finite. Proposition 14.25 is then true (with sums replaced by integrals in
the proof) and so are propositions 14.26 and 14.27. Finally, the Hammersley-Clifford
theorem (theorem 14.29) extends to this context.

Even though it is a natural requirement, condition (14.34) may be hard to assess


with general families of local interactions. In the case of Gaussian distributions,
however, one can provide relatively simple conditions. Assume that Fs = R for all
340 CHAPTER 14. MARKOV RANDOM FIELDS

s ∈ V , and condider a potential Λ = (λC , C ∈ C) with only univariate and bivariate


interactions, such that, for some vector a ∈ Rd (with d = |V |) and symmetric matrix
b ∈ Sd ,
1

 λ{s} (x({s}) ) = −a(s) x(s) + bss (x(s) )2



 2
λ (x({s,t}) ) = b x(s) x(t)


{s,t} st

Then, considering x ∈ F (V ) as a d-dimensional vector, we have

1 1 T
 
T
π(x) = exp a x − x bx ,
Z 2
with the integrability requirement that b  0 (positive definite). The random field
then follows a Gaussian distribution with mean m = b−1 a and covariance matrix
Σ = b−1 . The normalizing constant, Z, is given by
1 T
e− 2 a ba (2π)d/2
Z= √ .
det b

This Markov random field parametrization of Gaussian distributions emphasizes


the conditional structure of the variables rather than their covariances. It is useful
when the associated graph, represented by the matrix b is sparse. In particular,
the conditional distribution of X (s) given the other variables is Gaussian, with mean
(a(s) − t,s bst x(t) )/bss and variance 1/bss .
P
Chapter 15

Probabilistic Inference for Random Fields

Once the joint distribution of a family of variables has been modeled as a random
field, this model can be used to estimate the probabilities of specific events, or the
expectations of random variables of interest. For example, if the modeled variables
relate to a medical condition, in which variables such as diagnosis, age, gender, clin-
ical evidence can interact, one may want to compute, say, the probability of someone
having a disease given other observable factors. Note that, being able to compute
expectations of the modeled variables for G-Markov processes also ensures that one
can compute conditional expectations of some modeled variables given others, since,
by proposition 14.22, conditional G-Markov distributions are Markov over restricted
graphs.

We assume that X is G-Markov for a graph G = (V , E) and restrict (unless spec-


ified otherwise) to finite state spaces. We condider the basic problem to compute
P(X (S) = x(S) ) when S ⊂ V , starting with one-vertex marginals, P(X (s) = x(s) ).

The Hammersley-Clifford theorem provides a generic form for general positive


G-Markov processes, in the form
1
 X 
P(X = x) = π(x) = exp − hC (x(C) ) . (15.1)
Z
C∈CG

So, formally, marginal distributions are given by the ratio


 P 
P
exp − h (y (C) )
y∈F (V ),y (S) =x(S) C∈CG C
P(X (S) = x(S) ) =   .
(C)
P P
y∈F (V ) exp − C∈CG hC (y )

The problem is that the sums involved in this ratio involve a number of terms that
grows exponentially with the size of V . Unless V is very small, a direct computation
of these sums is intractable. An exception to this is the case of acyclic graphs, as

341
342 CHAPTER 15. PROBABILISTIC INFERENCE FOR MRF

we will see in section 15.2. But for general, loopy, graphs, the sums can only be
approximated, using, for example, Monte-Carlo sampling, as described in the next
section.

15.1 Monte Carlo sampling

Markov chain Monte Carlo methods are well adapted to sampling from Markov ran-
dom fields, because conditional distributions used in Gibbs sampling, or, more gen-
erally, ratios of probabilities used in the Metropolis-Hastings algorithm do not in
require the computation of the normalizing constant Z in (15.1). The simplest use
of Gibbs sampling generalizes the Ising model example of section 13.4.2. Using the
notation of Algorithm 13.2, one lets Bs0 = F (sc ) (with the notation sc = V \ {s}) and
c
Us (x) = x(s ) . The conditional distribution given Us is
c c
Qs (Us (x), y) = P(X (s) = y (s) | X (s ) = x(s ) )1y (sc ) =x(sc ) .

The conditional probability in the r.h.s. of this equation takes the form
 
c ∆ c c 1  X c

πs (y (s) | x(s ) ) = P(X (s) = y (s) | X (s ) = x(s ) ) = (s) (C∩s ) 

exp − h (y ∧ x )

C

c
Zs (x(s ) )

 
C∈C,s∈∈C

with  
X  X 
(sc ) (C∩sc )
Zs (x )= exp − hC (z(s) ∧ x ) .
 
 
z(s) ∈Fs C∈C,s∈∈C

The Gibbs sampling algorithm samples from Qs by visiting all s ∈ V infinitely of-
ten, as described in Algorithm 13.2. Metropolis-Hastings schemes are implemented
similarly, the most common choice using a local update scheme in Algorithm 13.3
such that g(x, ·) only changes one coordinate, chosen at random, so that

1 X
g(x, y) = 1y (sc ) =x(sc ) gs (y (s) )
|V |
s∈V

where gs is some probability distribution on Fs . The acceptance probability a(x, y) is


c c
equal to 1 when y = x. If y , x and g(x, y) > 0, there is a unique s for which y (s ) = x(s )
and !
π(y)g(y, x)
a(x, y) = min 1,
π(x)g(x, y)
with c
π(y)g(y, x) πs (y (s) | x(s ) )gs (x(s) )
= .
π(x)g(x, y) πs (x(s) | x(sc ) )gs (y (s) )
15.1. MONTE CARLO SAMPLING 343

Note that the latter equation avoids the computation of the local normalizing con-
c
stant Zs (x(s ) ), which simplifies in the ratio.

Both algorithms have a transition probability P that satisfies P m (x, y) > 0 for all
x, y ∈ F (V ), with m = |V | (for Metropolis-Hastings, one must assume that gs (y (s) ) > 0
for all y (s) ∈ Fs . This ensures that the chain is uniformly geometrically ergodic, i.e.,
(13.9) is satisfied with a constant M and some ρ < 1. However, in many practical
cases (especially for strongly structured distributions and large sets V ), the conver-
gence rate, ρ can be very close to 1, resulting in a slow convergence.

Acceleration strategies have been designed to address this issue, which is often
due to the existence of multiple configurations that are local modes of the probabil-
ity π. Such configurations are isolated from other high-probability configurations
because local updating schemes need to make multiple low-probability changes to
access them from the local mode. The following two approaches provide examples
designed to address this issue.

a. Cluster sampling. To facilitate escaping from such local modes, it is sometimes


possible to augment the state space by introducing a new configuration space, with
variable denoted ξ, and designing a joint distributions π̂(ξ, x) such that the marginal
distribution on F (V ) (summing over ξ) is the targeted π. The additional variable
can create high-probability bridges between local modes for π, and accelerate con-
vergence.
To take an example, assume that all sets Fs are identical (letting F = Fs , s ∈ V ) and
that the auxiliary variable ξ takes values in the set of functions from E to {0, 1}, that
we will denote B(E), i.e., that it takes the form (ξ (st) , {s, t} ∈ E), with ξ (st) ∈ {0, 1}. For
x ∈ F (V ), introduce the set Bx containing all ξ ∈ B(E), such that for all {s, t} ∈ E,

x(s) , x(t) ⇒ ξ (st) = 1.

Assume that the conditional distribution of ξ given x is supported by Bx , such that,


for ξ ∈ Bx  
1  X 
µst ξ (st)  .

P(ξ = ξ | X = x) = π̂(ξ | x) = exp −

ζ(x)  
{s,t}∈E

The coefficients µst are free to choose (and one possible choice is to take µst = 0 for
all {s, t} ∈ E). For this distribution, all ξ (st) are independent conditionally to X = x,
with ξ (st) = 1 with probability 1 if x(s) , x(t) , and

e−µst
P (ξ (st) = 1 | X = x) = (15.2)
1 + e−µst

if x(s) = x(t) . This conditional distribution is, as a consequence, very easy to sample
344 CHAPTER 15. PROBABILISTIC INFERENCE FOR MRF

from. Moreover, the normalizing constant ζ(x) has closed form and is given by
 
Y  X X 
ζ(x) = (1x(s) =x(t) + e−µst ) = exp  log (1 + e−µst ) + log(1 + eµst )1x(s) ,x(t)  .
 
 
{s,t}∈E {s,t}∈E {s,t}∈E

Now consider the conditional probability that X = x given ξ = ξ. For this distribu-
tion, one has, with probability 1, X (s) = X (t) when ξ (st) = 0. This implies that X is con-
stant on the connected components of the subgraph (V , Eξ ) of (V , E), where {s, t} ∈ Eξ
if and only if ξ (st) = 0. Let V1 , . . . , Vm denote these connected components (these com-
ponents and their number depend on ξ). The conditional distribution of X given ξ
is therefore supported by the configurations such that there exists c1 , . . . , cm ∈ F such
that x(s) = cj if and only if s ∈ Vj , that we will denote, with some abuse of notation:
(V1 ) (V )
c1 ∧ · · · ∧ cm m .
Given this remark, the conditional distribution of X given ξ = ξ is equivalent to a
distribution on F m , which may be feasible to sample from directly if |F| and m are not
too large. To sample from π, one now needs to alternate between sampling ξ given
X and the converse, yielding the following first version of cluster-based sampling.

Algorithm 15.1 (Cluster-based sampling: Version 1)


This algorithm samples from (15.1).
(1) Initialize the algorithm some configuration x ∈ F (V ).
(2) Loop over the following steps:
a. Generate a configuration ξ ∈ Bx such that ξ (st) = 1 with probability given by
(15.2) when x(s) = x(t) .
b. Determine the connected components, V1 , . . . , Vm , of the graph Gξ = (V , Eξ )
with edges given by pairs {s, t} such that ξ (st) = 1.
c. Sample values c1 , . . . , cm ∈ F according to the distribution
(V1 ) (V )
π(c1 ∧ · · · ∧ cm m )
q(c1 , . . . , cm ) ∝ (V ) (V )
.
ζ(c1 1 ∧ · · · ∧ cm m )
(V1 ) (V )
d. Set x = c1 ∧ · · · ∧ cm m .

Step (2.c) takes a simple form in the special case when π is a non-homogeneous Potts
model ((14.32)) with positive interactions, that we will write as
 
 X X 
π(x) = exp − αs x(s) − βst 1x(s) ,x(t) 
 
 
s∈V {s,t}∈E
15.1. MONTE CARLO SAMPLING 345

with βst ≥ 0. Then


 
π(x)  X X 
αs x(s) − 0

∝ exp − (βst − βst )1xs ,st 

ζ(x)  
s∈V {s,t}∈E

0 0
with βst = log(1 + eµst ). If one chooses µst such that βst = βst (which is possible since
βst ≥ 0), then the interaction term disappears and the probability q in (2.c) is propor-
tional to  
Ym  X 
exp − αs 
 
 
j=1 s∈Vj

so that c1 , . . . , cm can be generated independently. The resulting algorithm is the


Swendsen-Wang sampling algorithm for the Potts model [187]. The presentation
given here adapts the one introduced in Barbu and Zhu [16].
For more general models, step (2.c) can be computationally costly, especially if the
number of connected components is large. In this case, this step can be replaced by
a Gibbs sampling step for one of the cj0 s conditional to the others (and ξ) that we
summarize in the following variation of Algorithm 15.1.

Algorithm 15.2 (Cluster-based sampling: Version 2)


This algorithm samples from (15.1).
(1) Initialize the algorithm some configuration x ∈ F (V ).
(2) Loop over the following steps:
a. Generate a configuration ξ ∈ Bx such that ξ (st) = 1 with probability given by
(15.2) when x(s) = x(t) .
b. Determine the connected components, V1 , . . . , Vm , of the graph Gξ = (V , Eξ )
with edges given by pairs {s, t} such that ξ (st) = 1. Note that x is constant on
each of these connected components, i.e., there exists c1 , . . . , cm ∈ F such that
(V ) (V )
x = c1 1 ∧ · · · ∧ cm m .
c. Select at random one of the components, say, j0 ∈ {1, . . . , m}.
d. Sample the value c̃j0 ∈ F according to the distribution
(V1 ) (V )
π(c̃1 ∧ · · · ∧ c̃m m )
q(c̃j0 ) ∝ (V1 ) (V )
.
ζ(c̃1 ∧ · · · ∧ c̃m m )
with c̃j = cj if j , j0 .
e. Set x(s) = c̃j0 for s ∈ Vj0 .

Unlike single-variable updating schemes, these algorithms can update large chunks
of the configurations at each step, and may result in significantly faster convergence
346 CHAPTER 15. PROBABILISTIC INFERENCE FOR MRF

of the sampling procedure. Note that step (2.d) in Algorithm 15.2 can be replaced
by a Metropolis-Hastings update with a proper choice of proposal probability [16].
b. Parallel tempering. We now consider a different kind of extension in which we
allow π depends continuously on a parameter β > 0, writing πβ and, the goal is to
sample from π1 . For example, one can extend (15.1) by the family of probability
distributions  
1  X 
πβ (x) = exp −β hC (x(C) )
 
Zβ  
C∈CG

for β ≥ 0. For small β, πβ gets close to the uniform distribution on F (V ) (achieved for
β = 0), so that it becomes easier to move from local mode to local mode. This implies
that sampling with small β is more efficient and the associated Markov chain moves
more rapidly in the configuration space.
Assume given, for all β, two ergodic transition probabilities on F (V ), qβ and q̃β such
that (13.6) is satisfied with πβ as invariant probability, namely

πβ (y)qβ (y, x) = πβ (x)q̃β (x, y) (15.3)

for all x, y ∈ F (V ) (as seen in (13.6), q̃β is the transition probability for the reversed
chain). The basic idea is that qβ provides a Markov chain that converges rapidly
for small β and slowly when β is closer to 1. Parallel tempering (this algorithm
was introduced in Neal [141] based on ideas developed in Marinari and Parisi [127])
leverages this fact (and the continuity of πβ in β) to accelerate the simulation of π1
by introducing intermediate steps sampling at low β values.
The algorithm specifies a sequence of parameters 0 ≤ β1 ≤ · · · ≤ βm = 1. One simu-
lation steps goes down, then up this scale, as described in the following algorithm.

Algorithm 15.3 (Parallel Tempering)


Start with an initial configuration x0 ∈ F (V ). This configuration is then updated at
each step, using the following sequence of operations.
(1) For j = 1, . . . , m, generate a configuration xj according to q̃βj (xj−1 , ·).
(2) Generate a configuration zm−1 according to qβm (xm , ·).
(3) For j = m − 1, . . . , 1, generate a configuration zj−1 according to qβj (zj , ·).
(4) Set x0 = z0 with probability
    
 πβ (z0 ) m−1Y πβj (xj−1 )  πβ (xm−1 ) m−1Y πβj (zj ) 
min 1, 0
 m
 .
   
 πβ0 (x0 )  πβj (xj )  πβm (zm−1 )  πβj (zj−1 ) 
  
j=1 j=1

(Otherwise, keep x0 unchanged).


15.2. INFERENCE WITH ACYCLIC GRAPHS 347

Importantly, the acceptance probability at step (4) only involves ratios of πβ0 s and
therefore no normalizing constant. We now show that this algorithm is πβ0 -reversible.
Let p(·, ·) denote the transition probability of the chain. If z0 , x0 , p(x0 , z0 ) corre-
sponds to steps (1) to (3), with acceptance at step(4), and is therefore given by the
sum, over all x1 , . . . , xm and z1 , . . . , zm , of products

q̃β1 (x0 , x1 ) · · · q̃βm (xm−1 , xm )qβm (xm , zm−1 ) · · · qβ1 (z1 , z0 )


    
 πβ (z0 ) m−1 Y πβj (xj−1 )  πβ (xm−1 ) m−1Y πβj (zj ) 
min 1, 0
   m  
 πβ0 (x0 )  πβj (xj ) πβm (zm−1 ) πβj (zj−1 ) 
   
 
j=1 j=1

Applying (15.3), this is equal to



min q̃β1 (x0 , x1 ) · · · q̃βm (xm−1 , xm )qβm (xm , zm−1 ) · · · qβ1 (z1 , z0 ),
πβ0 (z0 ) 
qβ1 (x0 , x1 ) · · · qβm (xm−1 , xm )q̃βm (xm , zn−1 ) · · · q̃β1 (z1 , z0 )
πβ0 (x0 )

So,
X 
πβ0 (x0 )p(x0 , z0 ) = min πβ0 (x0 )q̃β1 (x0 , x1 ) · · · q̃βm (xm−1 , xm )qβm (xm , zm−1 ) · · · qβ1 (z1 , z0 ),

πβ0 (z0 )qβ1 (x1 , x0 ) · · · qβm (xm , xm−1 )q̃βm (zm−1 , xm ) · · · q̃β1 (z0 , z1 )

where the sum is over all x1 , . . . , xm , z1 , . . . , zm−1 ∈ F (V ). The sum is, of course, un-
changed if one renames x1 , . . . , xm , z1 , . . . , zm−1 to z1 , . . . , zm , x1 , . . . , xm−1 , but doing so
provides the expression of πβ0 (z0 )p(z0 , x0 ), proving the reversibility of the chain with
respect to πβ0 .

15.2 Inference with acyclic graphs

We now switch to deterministic methods to compute, or approximate, marginal


probabilities of Markov random fields. In this section, we consider a directed acyclic
graph G = (V , E). As we have seen, Markov processes for acyclic graphs are also
Markov for any tree structure associated with the graph. Introducing such a tree,
G̃ = (V , Ẽ) with G̃[ = G, we know that a Markov process on G can be written in the
form (letting s0 denote the root in G̃):
Y
π(x) = ps0 (x(s0 ) ) pst (x(s) , x(t) ) (15.4)
(s,t)∈Ẽ
348 CHAPTER 15. PROBABILISTIC INFERENCE FOR MRF

where ps0 is a probability and pst a transition probability.

We now show how to compute marginal probabilities of configurations x(S) , de-


noted πS (x(S) ), for a set S ⊂ V , starting with singletons S = {s}. The computation
can be done by propagating down the tree as follows. For s = s0 , the probability is
known, with πs0 = ps0 . Now take an arbitrary s , s0 and let pa(s) be its parent. Then
X
πs (x(s) ) = P(X (s) = x(s) ) = P (X (s) = x(s) | X (pa(s)) = y (pa(s)) )P (x(pa(s)) = y (pa(s)) )
y (pa(s)) ∈Fpa(s)
X
= πpa(s) (y (pa(s)) )ppa(s) (ypa(s) , x(s) )
y (pa(s)) ∈Fpa(s)

so that the marginal probability at any s , s0 can be computed given the marginal
probability of its parent. We can propagatePthe computation down the tree, with a
total cost for computing πs proportional to nk=1 |Ftk−1 | |Ftk | where t0 = s0 , t1 , . . . , tn = s
is the unique path between s0 and s. This is linear in the depth of the tree, and
quadratic (not exponential) in the sizes of
P the state spaces. The computation of all
singleton marginals requires an order of (s,t)∈E |Fs | |Ft | operations.

Now, assume that probabilities of singletons have been computed and consider
an arbitrary set S ⊂ V . Let s ∈ V be an ancestor of every vertex in S, maximal in the
sense that none of its children also satisfy this property. Consider the subtrees of
G̃ starting from each of the children of s, denoted G̃1 , . . . , G̃n with G̃k = (Vk , Ẽk ). Let
Sk = S ∩ Vk . From the conditional independence,
X
πS (x(S) ) = P (X (S\{s}) = x(S\{ s}) | X (s) = y (s) )πs (y (s) )
y (s) ∈Fs
X n
Y
= P (X (Sk ) = x(Sk ) | X (s) = y (s) )πs (ys )
y (s) ∈Fs k=1,Sk ,∅

Now, for all k = 1, . . . , n, we have |Sk | < |S|: this is obvious if S is not completely
included in one of the Vk ’s. But if S ⊂ Vk then the root, sk , of Vk is an ancestor of
all the elements in S and is a child of s, which contradicts the assumption that s is
maximal. So we have reduced the computation of πS (xS ) to the computations of n
probabilities of smaller sets, namely P (X (Sk ) = x(Sk ) | X (s) = y (s) ) for Sk , ∅. Because
the distribution of X (Vk ) conditioned at s is a G̃k -Markov model, we can reiterate
the procedure until only sets of cardinality one remain, for which we know how to
explicitly compute probabilities.

This provides a feasible algorithm to compute marginal probabilities with trees,


at least when its distribution is given in tree-form, like in (15.4). We now address
15.2. INFERENCE WITH ACYCLIC GRAPHS 349

the situation in which one starts with a probability distribution associated with pair
interactions (cf. definition 14.24) over the acyclic graph G

1Y Y
π(x) = ϕs (x(s) ) ϕst (x(s) , x(t) ). (15.5)
Z
s∈V {s,t}∈E

We assume these local interactions to be consistent, still allowing for some vanishing
ϕst (x(s) , x(t) ).

Putting π in the form (15.4) is equivalent to computing all joint probability dis-
tributions πst (x(s) , x(t) ) for {s, t} ∈ E, and we now describe this computation. Denote
Y Y
(s)
U (x) = ϕs (x ) ϕst (x(s) , x(t) )
s∈V {s,t}∈E

P
so that Z = y∈F (V ) U (y). For the tree G̃ = (V , Ẽ), and t ∈ V , we let G̃t = (Vt , Ẽt ) be the
subtree of G rooted at t (containing t and all its descendants). For S ⊂ V , define
Y Y 0
US (x(S) ) = ϕs (x(s) ) ϕss0 (x(s) , x(s ) )
s∈S {s,s0 }∈E,s,s0 ∈D

and

X
Zt (x(t) ) = UVt (x(t) ∧ y (Vt ) ).

y (Vt ) ∈F (Vt∗ )

with Vt∗ = Vt \ {t}.

Lemma 15.1 Let G = (V , E) be a directed acyclic graph and π = P X be the G-Markov


distribution given by (15.5). With the notation above, we have

(s0 )
Zs0 (x(s0 ) )
πs0 (x )= P (15.6)
y (s0 ) ∈Fs0 Zs0 (y (s0 ) )

and, for (s, t) ∈ Ẽ,

(s) (t) (t) (t) (s) (s) ϕst (x(s) , x(t) )Zt (x(t) )
pst (x , x ) = P (X =x |X =x )= P (s) (t) (t)
(15.7)
y (t) ∈Ft ϕst (x , y )Zt (y )

Proof Let Wt = V \Vt . Clearly, Z = x(0) ∈Fs Zs0 (x(0) ) and πs0 (x(0) ) = Zs0 (x(0) )/Z which
P
0
gives (15.6). Moreover, if s ∈ V , we have
P (Vs ) ∧ y (Ws ) )
(Vs∗ ) (Vs∗ ) (s) (s) y (Ws ) U (x
P(X =x |X =x )= P ∗ .

y (Vs ) ,y (Ws )U (x(s) ∧ y (Vs ) ∧ y (Ws ) )
350 CHAPTER 15. PROBABILISTIC INFERENCE FOR MRF

We can write
∗ ∗
U (x(s) ∧ y (Vs ) ∧ y (Ws ) ) = UVs (x(s) ∧ y (Vs ) )U{s}∪Ws (x(s) ∧ y (Ws ) )ϕs (x(s) )−1

yielding the simplified expression

UVs (x(Vs ) )ϕs (x(s) )−1 yW U{s}∪Ws (x(s) ∧ y (Ws ) )


P
(Vs∗ ) (Vs∗ ) (s) (s) s
P(X =x |X =x )= P ∗
 P 
ϕs (x(s) )−1 y (Vs∗ ) UVs (x(s) ∧ y (Vs ) ) (s)
y (Ws ) U{s}∪Ws (x ∧ y
(Ws ) )

UVs (x(Vs ) )
=
Zs (x(s) )
Now, if t1 , . . . , tn are the children of s, we have
n
Y n
Y
UVs (x (Vs )
) = ϕs (x ) (s) (s)
ϕstk (x , x (tk )
) UVt (x(Vtk ) ),
k
k=1 k=1

so that

P(X (tk ) = x(tk ) , k = 1, . . . , n | X (s) = x(s) )


n n
1 X
(s)
Y
(s) (tk )
Y
(tk ) (Vt∗ )
= ϕ s (x ) ϕst (x , x ) U V (x ∧ y k )
Zs (x(s) ) (V ∗ ) k=1
k
k=1
tk
tk
y ,k=1,...,n
Qn (s) , x(tk ) ) Qn Z (x(tk ) )
ϕs (x(s) ) k=1 ϕstk (x k=1 tk
= (s)
Zs (x )

This implies that the transition probability needed for the tree model, pst1 (x(s) , x(t1 ) ),
must be proportional to ϕst1 (x(s) , x(t1 ) )Zt1 (x(t1 ) ) which proves the lemma. 

This lemma reduces the computation of the transition probabilities to the com-
putation of Zs (x(s) ), for s ∈ V . This can be done efficiently, going upward in the
tree (from terminal vertexes to the root). Indeed, if s is terminal, then Vs = {s} and
Zs (x(s) ) = ϕs (x(s) ). Now, if s is non-terminal and t1 , . . . , tn are its children, then, it is
easy to see that
X n
Y
Zs (x(s) ) = ϕs (x(s) ) ϕstk (x(s) , x(tk ) )Ztk (x(tk ) )
x(t1 ) ∈Ft1 ,...,x(tn ) ∈Ftn k=1
n  X
Y 
(s) (s) (tk ) (tk )
= ϕs (x ) ϕstk (x , x )Ztk (x ) (15.8)
k=1 x (tk )
∈Ftk

So, Zs (x(s) ) can be easily computed once the Zt (x(t) )’s are known for the children of s.
15.2. INFERENCE WITH ACYCLIC GRAPHS 351

Equations (15.6) to (15.8) therefore provide the necessary relations in order to


compute the singleton and edge marginal probabilities on the tree. It is important
to note that these relations are valid for any tree structure consistent with the acyclic
graph we started with. We now rephrase them with notation that only depend on
this graph and not on the selected orientation.

Let {s, t} be an edge in E. Then s separates the graph G \ {s} into two components.
Let Vst be the component that contains t, and Vst∗ = Vst \ t. Define

X
Zst (xt ) = UVst (x(t) ∧ y (Vst ) ).

y (Vst ) ∈F (Vst∗ )

This Zst coincides with the previously introduced Zt , computed with any tree in
which the edge {s, t} is oriented from s to t. Equation (15.8) can be rewritten with
this new notation in the form:
Y  X 
(t) (t) (t) (t 0 ) (t 0 )
Zst (x ) = ϕt (x ) ϕtt0 (x , x )Ztt0 (x ) . (15.9)
t 0 ∈Vt \{s} x(t 0 ) ∈Ft 0

This equation is usually written in terms of “messages” defined by


X
mts (x(s) ) = ϕst (x(s) , x(t) )Zst (x(t) )
x(t) ∈Ft

which yields Y
Zst (x(t) ) = ϕt (x(t) ) mt0 t (x(t) )
t 0 ∈Vt \{s}

and the message consistency relation


X Y
mts (x(s) ) = ϕst (x(s) , x(t) )ϕt (x(t) ) mt0 t (x(t) ). (15.10)
x(t) ∈Ft t 0 ∈V t \{s}

Also, because one can start building a tree from G[ using any vertex as a root,
(15.6) is valid for any s ∈ V , in the form (applying (15.8) to the root)
1 Y
πs (x(s) ) = ϕs (x(s) ) mts (x(s) ) (15.11)
ζs
t∈Vs

where ζs is chosen to ensure that the sum of probabilities is 1. (In fact, looking at
lemma 15.1, we have Zs = Z, independent of s.)

Similarly, (15.7) can be written


Y
pst (x(s) , x(t) ) = mts (x(s) )−1 ϕst (x(s) , x(t) )ϕt (x(t) ) mt0 t (x(t) ) (15.12)
t 0 ∈V t \{s}
352 CHAPTER 15. PROBABILISTIC INFERENCE FOR MRF

which provides the edge transition probabilities. Combining this with (15.11), we
get the edge marginal probabilities:
1 Y Y
πst (x(s) , x(t) ) = ϕst (x(s) , x(t) )ϕs (x(s) )ϕt (x(t) ) mt0 t (x(t) ) ms0 s (x(s) ). (15.13)
ζ 0 0
t ∈Vt \{s} s ∈Vs \{t}

Remark 15.2 We can modify (15.10) by multiplying the right-hand side by an ar-
bitrary constant qts without changing the resulting estimation of probabilities: this
only multiplies the messages by a constant, which cancels after normalization. This
remark can be usefulP in particular to avoid numerical overflow; one can, for exam-
ple, define qts = 1/ xs ∈Fs mts (xs ) so that the messages always sum to 1. This is also
useful when applying belief propagation (see next section) to loopy networks, for
which (15.10) may diverge while the normalized version converges. 

The following summarizes this message passing algorithm.

Algorithm 15.4 (Belief propagation on acyclic graphs)


Given a family of interactions ϕs : Fs → [0, +∞), ϕst : Fs × Ft → [0, +∞),

(1) Initialize functions (messages) mts : Fs → R, e.g., taking mts (x(s) ) = 1/|Fs |.
(2) Compute unnormalized messages
X Y
m̃ts (·) = ϕst (·, x(t) )ϕt (x(t) ) mt0 t (x(t) )
x(t) ∈Ft t 0 ∈Vt \{s}

and let mts (·) = qts m̃ts (·), for some choice of constant qts , which must be a fixed func-
tion of m̃ts (·), such as
 −1
 X 
(s) 
qts =  m̃ts (x ) .


 (s) 
x ∈Fs

(3) Stop the algorithm when the messages stabilize (which happens after a finite
number of updates). Compute the edge marginal distributions using (15.13).

It should be clear, from the previous analysis that messages stabilize in finite time,
starting from the outskirts of the acyclic graph. Indeed, messages starting from a
terminal t (a vertex with only one neighbor) are automatically set to their correct
value in (15.10), X
mts (xs ) = ϕst (xs , xt )ϕt (xt ),
xt ∈Ft
at the first update. These values then propagate to provide messages that satisfy
(15.10) starting from the next-to-terminal vertexes (those that have only one neigh-
bor left when the terminals are removed) and so on.
15.3. BELIEF PROPAGATION AND FREE ENERGY APPROXIMATION 353

15.3 Belief propagation and free energy approximation

15.3.1 BP stationarity

It is possible to run Algorithm 15.4 on graphs that are not acyclic, since nothing in
its formulation requires this property. However, while the method stabilizes in finite
time for acyclic graphs, this property, or even the convergence of the messages is not
guaranteed for general, loopy, graphs. Convergence, however, has been observed in
a large number of applications, sometimes with very good approximations of the
true marginal distributions.

We will refer to stable solutions of Algorithm 15.4 as BP-stationary points, as


formally stated in the next definition, which allows for a possible normalization of
messages, which is particularly important with loopy networks.
Definition 15.3 Let G = (V , E) be an undirected graph and Φ = (ϕst , {s, t} ∈ E, ϕs , s ∈
V ) a consistent family of pair interactions. We say that a family of joint probability
0
distributions (πst , {s, t} ∈ E) is BP-stationary for (G, Φ) if there exists messages xt ∈ Ft 7→
mst (xt ), constants ζst for t ∼ s and αs for s ∈ V satisfying
α X Y
mts (x(s) ) = s ϕst (x(s) , x(t) )ϕt (x(t) ) mt0 t (x(t) ) (15.14)
ζts (t) 0
x ∈Ft t ∈Vt \{s}

such that
1 Y Y
0
πst (x(s) , x(t) ) = ϕst (x(s) , x(t) )ϕs (x(s) )ϕt (x(t) ) mt0 t (x(t) ) ms0 s (x(s) ). (15.15)
ζst 0 0
t ∈Vt \{s} s ∈Vs \{t}

There is no loss of generality in the specific form chosen for the normalizing con-
stants in (15.14) and (15.15), in the sense that, if the messages satisfy (15.15) and
X Y
mts (x(s) ) = qts ϕst (x(s) , x(t) )ϕt (x(t) ) mt0 t (x(t) )
x(t) ∈Ft t 0 ∈Vt \{s}

for some constants qts , then


X Y Y
ζst = ϕst (x(s) , x(t) )ϕs (x(s) )ϕt (x(t) ) mt0 t (x(t) ) ms0 s (x(s) )
x(s) ∈Fs ,x(t) ∈Ft t 0 ∈Vt \{s} s0 ∈Vs \{t}
1 X Y
= ϕs (x(s) ) ms0 s (x(s) )
qts (s) 0
x ∈Fs s ∈Vs

so that ζst qts (which has been denoted αs ) does not depend on t. Of course, the rele-
vant questions regarding BP-stationarity is whether the collection of pairwise prob-
0 0
ability πst exists, how to compute them, and whether πst (x(s) , x(t) ) provides a good
354 CHAPTER 15. PROBABILISTIC INFERENCE FOR MRF

approximation of the marginals of the probability distribution π that is associated


to Φ, namely
1Y Y
π(x) = ϕs (x(s) ) ϕst (x(s) , x(t) ).
Z
s∈V {s,t}∈E

A reassuring statement for BP-stationarity is that it is not affected when the func-
tions in Φ are multiplied by constants, which does not affect the underlying proba-
bility π. This is stated in the next proposition.
Proposition 15.4 Let Φ be as above a family of edge and vertex interactions. Let cst , {s, t} ∈
E, cs , s ∈ V be families of positive constants, and define Φ̃ = (ϕ̃st , ϕ̃s ) by ϕ̃st = cst ϕst and
ϕ̃s = cs ϕs . Then,

π0 is BP-stationary for (G, Φ) ⇔ π0 is BP-stationary for (G, Φ̃).

Proof Indeed, if (15.14) and (15.15) are true for (G, Φ), it suffices to replace αs by
αs cs and ζst by ζst cst ct to obtain (15.14) and (15.15) for (G, Φ̃). 

It is also important to notice that, if G is acyclic, definition 15.3 is no more general


than the message-passing rule we had considered earlier. More precisely, we have
(see remark 15.2),
Proposition 15.5 Let G = (V , E) be undirected acyclic and Φ = (ϕst , {s, t} ∈ E, ϕs , s ∈ V )
a consistent family of pair interactions. Then, the only BP-stationary distributions are the
marginals of the distribution π associated to Φ.

15.3.2 Free-energy approximations

A partial justification of the good behavior of BP with general graphs has been pro-
vided in terms of a quantity introduced in statistical mechanics, called the Bethe free
energy. We let G = (V , E) be an undirected graph and assume that a consistent family
of pair interactions is given (denoted Φ = (ϕs , s ∈ V , ϕst , {s, t} ∈ E)) and consider the
associated distribution, π, on F (V ), given by
1Y Y
π(x) = ϕs (x(s) ) ϕst (x(s) , x(t) ). (15.16)
Z
s∈V {s,t}∈E

It will also be convenient to use the function

ψst (x(s) , x(t) ) = ϕs (x(s) )ϕt (x(t) )ϕst (x(s) , x(t) )

such that
1Y Y
π(x) = ϕs (x(s) )1−|Vs | ψst (x(s) , x(t) ). (15.17)
Z
s∈V {s,t}∈E
15.3. BELIEF PROPAGATION AND FREE ENERGY APPROXIMATION 355

We will consider approximations π0 of π that minimize the Kullback-Leibler di-


vergence, KL(π0 kπ) (see (4.3)), subject to some constraints. We can write

KL(π0 kπ) = −Eπ0 (log π) − H(π0 )


X X
= − log Z − (1 − |Vs |)Eπ0 (log ϕs ) − Eπ0 (log ψst ) − H(π0 )
s∈V {s,t}∈E

(where H(π0 ) is the entropy of π0 ). Introduce the one- and two-dimensional marginals
0
of π0 , denoted πs0 ad πst . Then
X ϕ X ψ
KL(π0 kπ) = − log Z − (1 − |Vs |)Eπ0 (log s0 ) − Eπ0 (log st0 )
πs πst
s∈V {s,t}∈E
X X
+ (1 − |Vs |)H(πs0 ) + 0
H(πst ) − H(π0 ).
s∈V {s,t}∈E

The Bethe free energy is the function F β defined by


X ϕs X ψ
F β (π0 ) = − (1 − |Vs |)Eπ0 (log 0 ) − Eπ0 (log st
0 ); (15.18)
πs πst
s∈V {s,t}∈E

so that
KL(π0 kπ) = F β (π0 ) − log Z + ∆G (π0 )
with X X
∆G (π0 ) = (1 − |Vs |)H(πs0 ) + 0
H(πst ) − H(π0 ).
s∈V {s,t}∈E

Using this computation, one can consider the approximation problem: find π̂0
that minimizes KL(π0 kπ) over a class of distributions π0 for which the computation
of the first and second order marginals is easy. This problem has an explicit solution
when the distribution π0 is such that all variables are independent, leading to what
is called the mean-field approximation of π. Indeed, in this case, we have
X X X
∆G (π0 ) = (H(πs0 ) + H(πt0 )) + (1 − |Vs |)H(πs0 ) − H(πs0 ) = 0
{s,t}∈G s∈S s∈S

and X ϕs X ψ
F β (π0 ) = − (1 − |Vs |)Eπ0 (log 0 ) − Eπ0 (log 0 st 0 ) .
πs πs πt
s∈V {s,t}∈E

F β must be minimized with respect to the variables πs0 (x(s) ), s ∈ S, xs ∈ FS subject to


the constraints xs ∈Fs πs0 (x(s) ) = 1. The corresponding necessary optimality condi-
P
tions equations provide the mean-field consistency equations, described in the fol-
lowing definition.
356 CHAPTER 15. PROBABILISTIC INFERENCE FOR MRF

Proposition 15.6 A local minimum of F β (π0 ) over all probability distributions π0 of the
form Y
π0 (x) = πs0 (x(s) )
s∈V

must satisfy the mean field consistency equations:

1 Y  
πs (x(s) ) = ϕs (x(s) )1−|Vs | exp Eπt (log ψst (x(,) .)) . (15.19)
Zs t∼s

Proof Since all constraints are affine, we can use Lagrange multipliers, denoted
(λs , s ∈ S) for each of the constraints, to obtain necessary conditions for a minimizer,
yielding
∂F β
− λs = 0, s ∈ S, xs ∈ Fs .
∂πs (xs )
This gives:
! XX !
ϕs (xs ) ψst (xs , xt )
−(1 − |Vs |) log −1 − log − 1 πt (xt ) = λs .
πs (xs ) t∼s
πs (xs )π t (xt )
xt ∈Ft

Solving this with respect to πs (xs ) and regrouping all constant terms (independent
from xs ) in the normalizing constant Zs yields (15.19). 

The mean field consistency equations can be solved using a root-finding algo-
rithm or by directly solving the minimization problem. We will retrieve this method,
with more details, in our discussion of variational approximations in chapter 17.

In the particular case in which G is acyclic and the approximation is made by


G-Markov processes, the Kullback-Leibler distance is minimized with π0 = π (since
π belongs to the approximating class). A slightly non-trivial remark is that π is
optimal also for the minimization of the Bethe free energy Fβ , because this energy
coincides, up to the constant term log Z, with the Kullback-Leibler divergence, as
proved by the following proposition.

Proposition 15.7 If G is acyclic and π0 is G-Markov, then ∆G (π0 ) = 0.

This proposition is a consequence of the following lemma that has its own interest:

Lemma 15.8 If G is acyclic and π is a G-Markov distribution, then


Y Y
π(x) = πs (x(s) )1−|Vs | πst (x(s) , x(t) ). (15.20)
s∈V {s,t}∈E
15.3. BELIEF PROPAGATION AND FREE ENERGY APPROXIMATION 357

Proof (of lemma 15.8) We know that, if G̃ = (V , Ẽ) is a tree such that G̃[ = G, we
have, letting s0 be the root in G̃
Y
(s0 )
π(x) = πs0 (x ) pst (x(s) , x(t) )
(s,t)∈Ẽ
Y
(s0 )
= πs0 (x ) (πst (x(s) , x(t) )π(x(s) )−1 ).
(s,t)∈Ẽ

Each vertex s in V has |Vs | − 1 children in G̃, except s0 which has |Vs0 | children. Using
this, we get
Y Y
π(x) = πs0 (x(s0 ) )πs0 (x(s0 ) )−|Vs0 | πs (x(s) )1−|Vs | πst (x(s) , x(t) )
s∈V \{s0 } (s,t)∈Ẽ
Y Y
= πs (x(s) )1−|Vs | πst (x(s) , x(t) ). 
s∈V {s,t}∈E

Proof (of proposition 15.7) If π0 is given by (15.20), then

H(π0 ) = −Eπ0 log π0


X X
= − (1 − |Vs |)Eπ0 log πs0 − 0
Eπ0 log πst
s∈V {s,t}∈E
X X
= (1 − |Vs |)H(πs0 ) + 0
H(πst )
s∈V {s,t}∈E

which proves that ∆G (π0 ) = 0. 

In view of this, it is tempting to “generalize” the mean field optimization proce-


dure and minimize F β (π0 ) over all possible consistent singletons and pair marginals
0
(πs0 and πst ), then use the optimal ones as an approximation of πs and πst . What
we have just proved is that this procedure provides the exact expression of the
marginals when G is acyclic. For loopy graphs, however, it is not justified, and is
at best an approximation. A very interesting fact is that this procedure provides the
same consistency equations as belief propagation. To see this, we first start with the
characterization of minimizers of F β .

Proposition 15.9 Let G = (V , E) be an undirected graph and π be given by (15.16).


Consider the problem of minimizing the Bethe free energy F β in (15.18) with respect to all
0
possible choices of probability distributions (πst , {s, t} ∈ E), (πs0 , s ∈ V ) with the constraints
X
πs0 (x(s) ) = 0
πst (x(s) , x(t) ), ∀x(s) ∈ Fs and t ∼ s.
x(t) ∈Ft
358 CHAPTER 15. PROBABILISTIC INFERENCE FOR MRF

Then a local minimum of this problem must take the form

1
0
πst (x(s) , x(t) ) = ψ (x(s) , x(t) )µst (x(t) )µts (x(s) ) (15.21)
Zst st

where the functions µst : Ft → [0, +∞) are defined for all (s, t) such that {s, t} ∈ E and
satisfy the consistency conditions:
 |Vs |−1
 e
Y  X 
µts (x(s) )−(|Vs |−1) (s)
µs0 s (x ) =  (s) (t) (t) (t)
ψst (x , x )ϕt (x )µst (x ) . (15.22)

 Zst (t)
s0 ∼s

x ∈Ft

Proof We introduce Lagrange multipliers: λts (x(s) ) for the constraint


X
πs0 (x(s) ) = 0
πst (x(s) , x(t) )
x(t) ∈Ft

and γst for X


0
πst (x(s) , x(t) ) = 1,
x(s) ,x(t)

which covers all constraints associated to the minimization problem. The associated
Lagrangian is
 
X X X  X 
(s)  (s) (t) (s)
F β (π0 ) − λts (x )  0 0
πst (x , x ) − πs (x )
 
t∼s
 (t) 
s∈V x(s) ∈Fs x ∈Ft
 
X  X 
0 (s) (t)
γst  πst (x , x ) − 1 .
 

 (s) (t)

{s,t}∈E x ∈Fs ,x ∈Ft

0
The derivative with respect to πst (x(s) , x(t) ) yields the condition
0
log πst (x(s) , x(t) ) − log ψst (x(s) , x(t) ) + 1 − λts (x(s) ) − λst (x(t) ) − γst = 0.

which implies
0
πst (x(s) , x(t) ) = ϕst (x(s) , x(t) ) exp(γst − 1) exp(λts (x(s) ) + λst (x(t) )).
0
We let Zst = exp(1 − γst ), with γst chosen so that πst is a probability. The derivative
0 (s)
with respect to πs (x ) gives
X
(1 − |Vs |)(log πs0 (x(s) ) − log ϕs (x(s) ) + 1) + λts (x(s) ) = 0.
t∼s
15.3. BELIEF PROPAGATION AND FREE ENERGY APPROXIMATION 359

0
Combining this with the expression just obtained for πst , we get, for t ∼ s,
X (t)
(1 − |Vs |) log ψst (x(s) , x(t) )eλst (x ) + (1 − |Vs |)λts (x(s) )
x(t) ∈Ft
X
+ (1 − |Vs |)(1 − log Zst − log ϕs (x(s) )) + λs0 s (x(s) ) = 0,
s0 ∼s

which gives (15.22) with µst = exp(λst ). 

0
A family πst satisfying conditions (15.21) and (15.22) of proposition 15.9 will be
called Bethe-consistent. A very interesting remark states that Bethe-consistency is
equivalent to BP-stationarity, as stated below.
Proposition 15.10 Let G = (V , E) be an undirected graph and Φ = (ϕst , {s, t} ∈ E, ϕs , s ∈
V ) a consistent family of pair interactions. Then a family π0 of joint probability distribu-
tions is BP-stationary if and only if it is Bethe-consistent.
Proof First assume that π0 is BP-stationary with messages mst , so that (15.14) and (15.15)
are satisfied. Take Y
µst = at mt0 t (x(t) )
t 0 ∈Vt ,t 0 ,s

for some constant at that will be determined later. Then, the left-hand side of (15.22)
is
 −(|Vs |−1)
Y  Y  Y Y
(s) −(|Vs |−1) (s) (s) 
µts (x ) µs0 s (x ) = as  ms0 s (x ) ms00 s (x(s) )
 
0
 0 0
 0 00 00 0
s ∈Vs s ∈Vs ,s ,t s ∈Vs s ∈Vs ,s ,s

= as mts (x(s) )|Vs |−1 .


The right-hand side is equal to (using (15.14))
!|Vs |−1
eat ζst
m (x(s) ) ,
Zst αs ts
so that we need to have !|Vs |−1
eat ζst
as = .
Zst αs
We also need X
Zst = ψst (x(s) , x(t) )µst (x(t) )µts (x(s) ) = as at ζst .
x(s) ,x(t)
Solving these equations, we find that (15.21) and (15.22) are satisfied with

as = (e/αs )(|Vs |−1)/|Vs |



Zst = ζst as at

360 CHAPTER 15. PROBABILISTIC INFERENCE FOR MRF

which proves that π0 is Bethe-consistent.

Conversely, take a Bethe-consistent π0 , and µst , Zst satisfying (15.21) and (15.22).
For s such that |Vs | > 1, define, for t ∈ Vs ,
Y
(s) (s) −1
mts (x ) = µts (x ) µs0 s (x(s) )1/(|Vs |−1) . (15.23)
s0 ∼s

Define also, for |Vs | > 1, Y


ρts (x(s) ) = ms0 s (x(s) ).
s0 ∈Vs ,s0 ,t

(If |Vs | = 1, take ρts ≡ 1.) Using (15.23), we find ρts = µts when |Vs | > 1, and this
identity is still valid when |Vs | = 1, since in this case, (15.22) implies that µts (x(s) ) = 1.

We need to find constants αt and ζst such that (15.14) and (15.15) are satisfied.
But (15.15) implies X
ζts = ψst (x(s) , x(t) )ρst (x(t) )ρts (x(s) )
xt ,xs

and (15.21) implies ζts = Zts .

We now consider (15.14), which requires


α X
mts (x(s) ) = s ϕ (x(s) , x(t) )ϕt (x(t) )ρst (x(t) ).
ζst (t) st
x

It is now easy to see that this identity to the power |Vs | − 1 coincides with (15.22) as
soon as one takes αs = e. 

15.4 Computing the most likely configuration

We now address the problem of finding a configuration that maximizes π(x) (mode
determination). This problem turns out to be very similar to the computation of
marginals, that we have considered so far, and we will obtain similar algorithms.

Assume that G is undirected and acyclic and that π can be written as


1 Y Y
π(x) = ϕst (x(s) , x(t) ) ϕs (x(s) ).
Z
{s,t}∈E s∈V

Maximizing π(x) is equivalent to maximizing


Y Y
U (x) = ϕst (x(s) , x(t) ) ϕs (x(s) ). (15.24)
{s,t}∈E s∈V
15.4. COMPUTING THE MOST LIKELY CONFIGURATION 361

Assume that a root has been chosen in G, with the resulting edge orientation yielding
a tree G̃ = (V , Ẽ) such that G̃[ = G. We partially order the vertexes according to G̃,
writing s ≤ t if there exists a path from s to t in G̃ (s is an ancestor of t). Let Vs+
contain all t ∈ V with t ≥ s, and define
+
Y Y
Us (x(Vs ) ) = ϕtu (x(t) , x(u) ) ϕt (x(t) )
{t,u}∈EV + t>s
s

and n + o
Us∗ (x(s) ) = max Us (y (Vs ) ), y (s) = x(s) . (15.25)
Since we can write
+
+
Y
Us (x(Vs ) ) = ϕst (x(s) , x(t) )ϕt (x(t) )Ut (x(Vt ) ), (15.26)
t∈s+

we have
 
Y 
Us∗ (x(s) ) = max  ϕt (x(t) )ϕst (x(s) , x(t) )Ut∗ (x(t) )
x(t) ,t∈s+
t∈s+
Y
= max(ϕt (x(t) )ϕst (x(s) , x(t) )Ut∗ (x(t) )). (15.27)
xt ∈Ft
t∈s+

This provides a method to compute Us∗ (x(s) ) for all s, starting with the leaves and
progressively updating the parents. (When s is a leaf, Us∗ (x(s) ) = 1, by definition.)

Once all Us∗ (x(s) ) have been computed, it is possible to obtain a configuration x∗
(s)
that maximizes π. This is because an optimal configuration must satisfy Us∗ (x∗ ) =
+
(V ) +
(V \{s})
Us (x∗ s ) for all s ∈ V , i.e., x∗ s must solve the maximization problem in (15.25).
But because of (15.26), we can separate this problem over the children of s and obtain
the fact that, it t ∈ s+ ,
 
(t) (t) (s) (t) ∗ (t)
x∗ = argmax ϕt (x )ϕst (x∗ , x )Ut (x ) .
x(t)

This procedure can be rewritten in a slightly different form using messages sim-
ilar to the belief propagation algorithm. It s ∈ t + , define

µst (x(t) ) = max(ϕt (x(t) )ϕts (x(t) , x(s) )Us∗ (x(s) ))


xs ∈Fs

and
ξst (x(t) ) = argmax(ϕt (x(t) )ϕts (x(t) , x(s) )Us∗ (x(s) )).
x(s) ∈Fs
362 CHAPTER 15. PROBABILISTIC INFERENCE FOR MRF

Using section 15.4, we get


 
 Y 
µst (x(t) ) = max ϕts (x(t) , x(s) )ϕs (x(s) ) µus (x(s) ) ,
x(s) ∈Fs
u∈s+
 
 Y 
ξst (xt ) = argmax ϕts (x(t) , x(s) )ϕs (x(s) ) µus (x(s) ) .
(s)
x ∈Fs u∈s+

(t) (s)
An optimal configuration can now be computed using x∗ = ξts (x∗ ), with s ∈ pa(t).

This resulting algorithm therefore first operates upwards on the tree (from leaves
to root) to compute the µst ’s and ξst ’s, then downwards to compute x∗ . This is sum-
marized in the following algorithm.

Algorithm 15.5
A most likely configuration for

1 Y Y
π(x) = ϕst (xs , xt ) ϕs (xs ).
Z
{s,t}∈E s∈V

can be computed after iterating the following updates, based on any acyclic orienta-
tion of G:

(1) Compute, from leaves to root:


 
 Y 
µst (x(t) ) = max ϕts (x(t) , x(s) )ϕs (x(s) ) µus (x(s) )
x(s) ∈F s u∈s+
 
 Y 
and ξst (x(t) ) = argmax ϕts (x(t) , x(s) )ϕs (x(s) ) µus (x(s) ).
(s)
x ∈Fs u∈s+
(t) (s)
(2) Compute, from root to leaves: x∗ = ξts (x∗ ), with s = pa(t).

Similar to the computation of marginals, this algorithm can be rewritten in an


orientation-independent form. The main remark is that the value of µst (x(t) ) does
not depend on the tree orientation, as long as it is chosen such that s ∈ t + , i.e., the
edge {s, t} is oriented from t to s. This is because such a choice uniquely prescribes
the orientation of the edges of the descendants of s for any such tree, and µst only
depends on this structure. Since the same remark holds for ξst , this provides a def-
inition of these two quantities for any pair s, t such that {s, t} ∈ E. The updating rule
15.5. GENERAL SUM-PROD AND MAX-PROD ALGORITHMS 363

now becomes
 
 Y 
µst (x(t) ) = max ϕts (x(t) , x(s) )ϕs (x(s) ) µus (x(s) ) , (15.28)
 
(s)
x ∈Fs  
u∈Vs \{t}
 
 Y 
(t) (t) (s) (s) (s) 
ξst (x ) = argmax ϕts (x , x )ϕs (x ) µus (x ) (15.29)

(s)
x ∈Fs
 
u∈Vs \{t}

(t) (s)
with x∗ = ξts (x∗ ) for any pair s ∼ t. Like with the mts in the previous section,
looping over updating all µts in any order will finally stabilize to their correct values,
although, if an orientation is given, going from leaves to roots is obviously more
efficient.

The previous analysis is not valid for loopy graphs but section 15.4 and sec-
tion 15.4 provide well defined iterations when G is an arbitrary undirected graph,
and can therefore be used as such, without any guaranteed behavior.

15.5 General sum-prod and max-prod algorithms

15.5.1 Factor graphs

The expressions we obtained for message updating with belief propagation and with
mode determination respectively took the form
X Y
mts (x(s) ) ← ϕst (x(s) , x(t) )ϕt (x(t) ) mt0 t (x(t) )
x(t) ∈Ft t 0 ∈Vt \{s}

and  
 Y 
µts (x(s) ) ← max ϕst (x(s) , x(t) )ϕt (x(t) ) µt0 t (x(t) ) .
 
(t)
x ∈Ft  
0 t ∈Vt \{s}

They first one is often referred to as the “sum-prod” update rule, and the second
as the “max-prod”. In our construction, the sum-prod algorithm provided us with a
method computing
X
σs (x(s) ) = U (x(s) ∧ y (V \{s}) )
y (V \{s})

with Y Y
U (x) = ϕs (x(s) ) ϕst (x(s) , x(t) ).
s {s,t}∈E
364 CHAPTER 15. PROBABILISTIC INFERENCE FOR MRF

Indeed, we have, according to (15.11)


Y
(s) (s)
σs (x ) = ϕs (x ) mts (x(s) ).
t∈Vs

Similarly, the max-prod algorithm computes

ρs (x(s) ) = max U (x(s) ∧ y (V \{s}) )


yV \{s}

via the relation Y


ρs (x(s) ) = ϕs (x(s) ) µts (x(s) ).
t∈Vs

We now discuss generalizations of these algorithms to situations in which the


function U does not decompose as a product of bivariate functions. More precisely,
let S be a subset of P (V ), and assume the decomposition
Y
U (x) = ϕC (xC ).
C⊂S

The previous algorithms can be generalized using the concept of factor graphs asso-
ciated with the decomposition. The vertexes of this graph are either indexes s ∈ V or
sets C ∈ S, and the only edges link indexes and sets that contain them. The formal
definition is as follows.

Definition 15.11 Let V be a finite set of indexes and S a subset of P (V ). The factor
graph associated to V and S is the graph G = (V ∪ S, E), E being constituted of all pairs
{s, C} with C ∈ S and s ∈ C.

We assign the variable x(s) to a vertex s ∈ V of the factor graph, and the function ϕC
to C ∈ S. With this in mind, the sum-prod and max-prod algorithms are extended
to factor graphs as follows.

Definition 15.12 Let G = (V ∪S, E) be a factor graph, with associated functions ϕC (xC ).
The sum-prod algorithm on G updates messages msC (xs ) and mCs (xs ) according to the
rules Y
(s)
mC̃s (x(s) )



m sC (x ) ←

C̃,s∈C̃,C̃,C


(15.30)
 X Y
(s)
ϕC (y (C) ) mtC (y (t) )

mCs (x ) ←




(s) (s) t∈C\{s}

yC :y =x
15.5. GENERAL SUM-PROD AND MAX-PROD ALGORITHMS 365

Similarly, the max-prod algorithm iterates


Y
(s)
µC̃s (x(s) )



µ sC (x ) ←


 C̃,s∈C̃,C̃,C
(15.31)
 Y
 (s) (C)
µCs (x ) ← max ϕC (y ) µtC (y (t) )





 (C)
y :y =x(s) (s)
t∈C\{s}

These algorithms reduce to the original ones when only single vertex and pair in-
teractions exist. Let us check this with sum-prod. In this case, the set S contains
all singletons C = {s}, with associated function ϕs , and all edges {s, t} with associated
function ϕst . We have links between s and {s} and s and {s, t} ∈ E. For singletons, we
have Y
(s)
ms{s} (x ) ← ms{s,t} (x(s) ) and m{s}s (x(s) ) ← ϕs (x(s) ).
t∼s
For pairs, Y
ms{s,t} (x(s) ) ← ϕs (x(s) ) m{s,t̃}s (x(s) )
t̃∈Vs \{t}

and X
(s)
m{s,t}s (x ) ← ϕst (x(s) , y (t) )mt{s,t} (y (t) )
y (t)

and, combining the last two assignments, it becomes clear that we retrieve the initial
algorithm with m{s,t}s taking the role of what we previously denoted mts .

The important question, obviously, is whether the algorithms converge. The fol-
lowing result shows that this is true when the factor graph is acyclic.
Proposition 15.13 Let G = (V ∪ S, E) be a factor graph with associated functions ϕC .
Assume that G is acyclic. Then the sum-prod and max-prod algorithms converge in finite
time.

After convergence, we have σs (x(s) ) = C,s∈C mCs (x(s) ) and ρs (x(s) ) = C,s∈C µCs (x(s) ).
Q Q

Proof Let us assume that G is connected, which is without loss of generality, since
the following argument can be applied to each component of G separately. Since G
is acyclic, we can arbitrarily select one of its vertexes as a root to form a tree. This
being done, we can see that the messages going upward in the tree (from children
to parent) progressively stabilize, starting with leaves. Leaves in the factor graph
indeed are either singletons, C = {s}, or vertexes s ∈ V that belong to only one set
C ∈ S. In the first case, the algorithm imposes (taking, for example, the sum-prod
case) m{s}s (x(s) ) = ϕs (x(s) ), and in the second case msC (x(s) ) = 1. So the messages sent
upward by the leaves are set at the first step. Since the messages going from a child
to its parents only depend on the messages that it received from its other neighbors
366 CHAPTER 15. PROBABILISTIC INFERENCE FOR MRF

in the acyclic graph, which are its children in the tree, it is clear that all upward
messages progressively stabilize until the root is reached. Once this is done, mes-
sages propagate downward from each parent to its children. This stabilizes as soon
as all incoming messages to the parent are stabilized, since outgoing messages only
depend on those. At the end of the upward phase, this is true for the root, which
can then send its stable message to its children. These children now have all their
incoming messages and can now send their messages to their own children and so
on down to the leaves.

We now consider the second statement, proceeding by induction, assuming that


the result is true for any smaller graph than the one considered. Let s0 be the selected
root, and consider all vertexes s , s0 such that there exists Cs ∈ S such that s0 and s
both belong to Cs . Given s, there cannot be more than one such Cs since this would
create a loop in the graph. For each such s, consider the part Gs of G containing all
descendants of s. Let Vs be the set of vertexes among the descendants of s and Cs the
set of C’s below s. Define
Y
Us (x(Vs ) ) = ϕC (x(C) ).
C∈Cs

Since the upward phase of the algorithm does not depend on the ancestors of s,
the messages incoming to s for the sum-prod algorithm restricted to Gs are the same
as with the general algorithm, so that, using the induction hypothesis

X Y
Us (y (Vs ) ) = mCs (x(s) ) = msCs (x(s) ).
y (Vs ) ,y (s) =x(s) C∈Cs ,s∈C

Now let C1 , . . . , Cn list all the sets in C that contain s0 , which must be non-intersecting
(excepted at {s0 }), again not to create loops. Write

C1 ∪ · · · ∪ Cn = {s0 , s1 , . . . , sq }.

Then, we have

n
Y q
Y
U (x) = ϕCj (x (Cj )
) Usi (x(Vsi ) )
j=1 i=1
15.5. GENERAL SUM-PROD AND MAX-PROD ALGORITHMS 367
Sn
and letting S = j=1 Cj \ {s0 },

X n
Y q
Y
σs0 (x (s0 )
) = ϕCj (y (Cj )
) Usi (y (Vsi ) )
y (V ) :y (s0 ) =x(s0 ) j=1 i=1

X n
Y q
Y
(Cj )
= ϕCj (y ) msi Cs (y (si ) )
i
y ) S:y (s0 ) =x(s0 ) j=1 i=1
(

Y n X Y
= ϕCj (y (Cj ) ) msCs (y (s) )
j=1 y () C :y (s0 ) =x(s0 ) s∈Cj \{s0 }
j
n
Y
= mCj s0 (x(s0 ) )
j=1

which proves the required result (note that, when factorizing the sum, we have used
the fact that the sets Cj \ {s0 } are non intersecting). An almost identical argument
holds for the max-prod algorithm. 

Remark 15.14 Note that these algorithms are not always feasible. For example, it is
always possible to represent a function U on F (V ) with the trivial factor graph in
which S = {V } and E contains all {s, V }, s ∈ V (using ϕV = U ), but computing mV s
is identical to directly computing σs with a sum over all configurations on V \ {s}
which grows exponentially. In fact, the complexity of the sum-prod and max-prod
algorithms is exponential in the size of the largest C in S which should therefore
remain small. 

Remark 15.15 It is not always possible to decompose a function so that the resulting
factor graph is acyclic with small degree (maximum number of edges per vertex).
Sum-prod and max-prod can still be used with loopy networks, sometimes with
excellent results, but without theoretical support. 

Remark 15.16 One can sometimes transform a given factor graph into an acyclic
one by grouping vertexes. Assume that the set S ⊂ P (V ) is given. We will say that
a partition ∆ = (D1 , . . . Dk ) of V is S-admissible if, for any C ∈ S and any j ∈ {1, . . . , k},
one has either Dj ∩ C = ∅ or Dj ⊂ C.

If ∆ is S-admissible, one can define a new factor graph G̃ as follows. We first let
Ṽ = {1, . . . , k}. To define S̃ ⊂ P (Ṽ ) assign to each C ∈ S the set JC of indexes j such
that Dj ⊂ C. From the admissibility assumption,
[
C= Dj , (15.32)
j∈JC
368 CHAPTER 15. PROBABILISTIC INFERENCE FOR MRF

so that C 7→ JC is one-to-one. Let S̃ = {JC , C ∈ S}. Group variables using x̃(k) = x(Dk ) ,
so that F̃k = F (Dk ). Define Φ̃ = (ϕ̃C̃ , C̃ ∈ S̃) by ϕ̃C̃ = ϕC where C is given by (15.32).

In other terms, one groups variables (x(s) , s ∈ V ) into clusters, to create a simpler
factor graph, which may be acyclic even if the original one was not. For example, if
V = {a, b, c, d}, S = {A, B} with A = {a, b, c} and B = {b, c, d}, then (A, c, B, b) is a cycle in
the associated factor graph. If, however, one takes D1 = {a}, D2 = {b, c} and D3 = {d},
then (D1 , D2 , D3 ) is S-admissible and the associated factor graph is acyclic. In fact, in
such a case, the resulting factor graph, considered as a graph with vertexes given by
subsets of V , is a special case of a junction tree, which is defined in the next section.

15.5.2 Junction trees

Definition 15.17 Let V be a finite set. A junction tree on V is an undirected acyclic


graph G = (S, E) where S ⊂ P (V ) is a family of subsets of V that satisfy the following
property, called the running intersection constraint: if C, C 0 ∈ S and s ∈ C ∩ C 0 , then all
sets C 00 in the (unique) path connecting C and C 0 in G must also contain s.
Remark 15.18 Let us check that the clustered factor graph G̃ defined in remark 15.16
is equivalent to a junction tree when acyclic. Using the same notation, let Ŝ =
{D1 , . . . , Dk } ∪ S, removing if needed sets C ∈ S that coincide with one of the Dj ’s.
Place an edge between Dj and C if and only if Dj ⊂ C.

Let (C1 , Di1 , . . . , Din−1 , Cn ) be a path in that graph. Assume that s ∈ C1 ∩ C2 . Let Din
be the unique Dj that contains s. It is such that from the the admissibility assump-
tion, Din ⊂ C1 and Din ⊂ Cn , which implies that (C1 , Di1 , . . . , Cn , Din , C1 ) is a path in G̃.
Since G̃ is acyclic, this path must be a union of folded paths. But it is easy to see that
any folded path satisfies the running intersection constraint. (Note that there was
no loss of generality in assuming that the path started and ended with a “C”, since
any “D” must be contained in the C that follows or precedes it.) 

We now consider a probability distribution written in the form


1Y
π(x) = ϕC (x(C) )
Z
C∈S

and we make the assumption that S can be organized as a junction tree.

Belief propagation can be extended to junction trees. Fixing a root C0 ∈ S, we


first choose an orientation on G, which induces as usual a partial order on S. For
C ∈ S, define SC+ as the set of all B ∈ S such that B > C. Define also
[
VC+ = B.
B∈SC+
15.5. GENERAL SUM-PROD AND MAX-PROD ALGORITHMS 369

We want to compute sums


X
σC (x(C) ) = U (x(C) ∧ y (V \C) ),
y (V \C)

where U (x) =
Q (C) ).
C∈S ϕC (x We have
X Y
σC (x(C) ) = ϕC (x(C) ) ϕB (x(B∩C) ∧ y (B\C) ).
y (V \C) B∈S\{C}

Define X Y
σC+ (x(C) ) = ϕB (x(B∩C) ∧ y (B\C) ).
(V + \C) B>C
y C

Note that we have σC0 = ϕC0 σC+0 at the root. We have the recursion formula
 
X Y  Y 0 0 
σC+ (x(C) ) = ϕ (x(B∩C) ∧ y (B\C) )
 B ϕB0 (x(B ∩C) ∧ y (B \C) )
(V + \C) C→B B0 >B
y C
Y X Y 0 0
(B∩C) (B\C)
= ϕB (x ∧y ) ϕB0 (x(B ∩C) ∧ y (B \C) )
C→B (B∪VB+ \C) B0 >B
y
Y X
= ϕB (x(B∩C) ∧ y (B\C) )σB+ (x(B∩C) ∧ y (B\C) ).
C→B y (B\C)

The inversion between the sum and product in the second equation above was pos-
sible because the sets B ∪ VB+ \ C, C → B are disjoint. Indeed, if there existed B, B0
such that C → B and C → B0 , and descendants C 0 of B0 and C 00 of B00 with a non-
empty intersection, then this intersection would have to be included in every set in
the (non-oriented) path connecting C 0 and C 00 in G. Since this path contains C, the
intersection must also be included in C, so that the sets B ∪ VB+ \ C, with C → B are
disjoint.

Introduce messages
X
m+B (x(C) ) = ϕB (x(B∩C) ∧ y (B\C) )σB+ (x(B∩C) ∧ y (B\C) )
y (B\C)

where C is the parent of B. Then


X Y
m+B (x(C) ) = ϕB (x(B∩C) ∧ y (B\C) ) m+B0 (x(B∩C) ∧ y (B\C) )
y (B\C) B→B0

with Y
σC+ (x(C) ) = m+B (x(C) )
C→B
370 CHAPTER 15. PROBABILISTIC INFERENCE FOR MRF

which provides σC at the root. Reinterpreting this discussion in terms of the undi-
rected graph, we are led to introducing messages mBC (x(C) ) for B ∼ C in G, with the
message-passing rule
X Y
(C) (B∩C) (B\C)
mBC (x ) = ϕB (x ∧y ) mB0 B (x(B∩C) ∧ y (B\C) ). (15.33)
y (B\C) B0 ∼B,B0 ,C

Messages progressively stabilize when applied in G, and at convergence, we have


Y
σC (x(C) ) = ϕC (x(C) ) mBC (x(C) ). (15.34)
B∼C

Note that the complexity of the junction tree algorithm is exponential in the car-
dinality of the largest C ∈ S. This algorithm will therefore be unfeasible if S contains
sets that are too large.

15.6 Building junction trees

There is more than one family of set interactions with respect to which a given prob-
ability π can be decomposed (notice that, unlike in the Hammersley-Clifford The-
orem, we do not assume that the interactions are normalized), and not all of them
can be organized as a junction tree. One can however extend any given family into a
new one on which one can build a junction tree.
Definition 15.19 Let V be a set of vertexes, and S0 ⊂ P (V ). We say that a set S ⊂ P (V )
is an extension of S0 if, for any C0 ∈ S0 , there exists a C ∈ S such that C0 ⊂ C.

A tree G = (S, E) is a junction-tree extension of S0 if S is an extension of S0 and G is


a junction tree.

If Φ 0 = (ϕC0 , C ∈ S0 ) is a consistent family of set interactions, and S is an extension


of S0 , one can build a new family, Φ = (ϕC , C ∈ S), of set interactions which yields
the same probability distribution, i.e., such that, for all x ∈ F (V ),
Y Y
ϕC (x(C) ) ∝ ϕC0 0 (x(C0 ) ).
C∈S C0 ∈S0

For this, it suffices to build a mapping say T : S0 → S such that C0 ⊂ T (C0 ) for
all C0 ∈ S0 , which is always possible since S is an extension of S0 (for example,
arbitrarily order the elements of S and let T (S0 ) be the first element of S, according
to this order, that contains C0 ). One can then define
Y
ϕC (x(C) ) = ϕC0 0 (x(C0 ) ).
C0 :T (C0 )=C
15.6. BUILDING JUNCTION TREES 371

Given Φ 0 , our goal is to design a junction-tree extension which is as feasible


as possible. So, we are not interested by the trivial extension G = (V , ∅), since the
resulting junction-tree algorithm is unfeasible as soon as V is large. Theorem 15.24
in the next section will be the first step in the design of an algorithm that computes
junction trees on a given graph.

15.6.1 Triangulated graphs

Definition 15.20 Let G = (V , E) be an undirected graph. Let (s1 , s2 , . . . , sn ) be a path in


G. One says that this path has a chord at sj , with j ∈ {2, . . . , n} , if sj−1 ∼ sj+1 , and we will
refer to (sj−1 , sj , sj+1 ) as a chordal triangle. A path in G is achordal if it has no chord.

One says that G is triangulated (or chordal) if it has no achordal loop.

Definition 15.21 The graph G is decomposable if it satisfies the following recursive con-
dition: it is either complete, or there exists disjoint subsets (A, B, C) of V such that

• V = A ∪ B ∪ C,
• A and B are not empty,
• C is clique in G, C separates A and B,
• the restricted graphs, GA∪C and GB∪C are decomposable.

These definitions are in fact equivalent, as stated in the following proposition.


Proposition 15.22 An undirected graph is triangulated if and only if it is decomposable

Proof To prove the “if” part, we proceed by induction on n = |V |. Note that every
graph for n ≤ 3 is both decomposable and triangulated (we leave the verification
to the reader). Assume that the statement “decomposable ⇒ triangulated” holds
for graphs with less than n vertexes, and take G with n vertexes. Assume that G is
decomposable. If it is complete, it is obviously triangulated. Otherwise, there exists
A, B, C such that V = A ∪ B ∪ C, with A and B non-empty such that GA∪C and GB∪C
are decomposable, hence triangulated from the induction hypothesis, and such that
C is a clique which separates A and B. Assume that γ is an achordal loop in G. Since
it cannot be included in A ∪ C or B ∪ C, γ must go from A to B and back, which
implies that it passes at least twice in C. Since C is complete, the original loop can
be shortcut to form subloops in A ∪ C and B ∪ C. If one of (or both) these loops has
cardinality 3, this would provide γ with a chord, which contradicts the assumption.
Otherwise, the following lemma also provides a contradiction, since one of the two
chords that it implies must also be a chord in the original γ.
Lemma 15.23 Let (s1 , . . . , sn , sn+1 = s1 ) be a loop in a triangulated graph, with n ≥ 4.
Then the path has a chord at two non-contiguous vertexes at least.
372 CHAPTER 15. PROBABILISTIC INFERENCE FOR MRF

To prove the lemma, assume the contrary and let (s1 , . . . , sn , sn+1 = s1 ) be a loop that
does not satisfy the condition, with n as small as possible. If n > 4, the loop must
have a chord, say at sj , and one can remove sj from the loop to still obtain a smaller
loop that must satisfy the condition in the lemma, since n was as small as possible.
One of the two chords must be at a vertex other than the two neighbors of sj , and
thus provide a second chord in the original loop, which is a contradiction. Thus
n = 4, but G being triangulated implies that this 4-point loop has a diagonal, so that
the condition in the lemma also holds, which provides a contradiction.

For the “only if” part of proposition 15.22, assume that G is triangulated. We
prove that the graph is decomposable by induction on |G|. The induction will work if
we can show that, if G is triangulated, it is either complete or there exists a clique in
G such that V \ C is disconnected, i.e., there exist two elements a, b ∈ V \ C which are
related by no path in V \C. Indeed, we will then be able to decompose V = A∪B∪C,
where A and B are unions of (distinct) connected components of V \ C. Take, for
example, A to be the set of vertexes connected to A in G \ C, and B = V \ (A ∪ C),
which is not empty since it contains b. Note that restricted graphs from triangulated
graphs are triangulated too.

So, assume that G is triangulated, and not complete. Let C be a subset of V that
satisfies the property that V \ C is disconnected, and take C minimal, so that V \ C 0
is connected for any C 0 ⊂ C, C 0 , C. We want to show that C is a clique, so take s and
t in C and assume that they are not neighbors to reach a contradiction.

Let A and B be two connected components of V \ C. For any a ∈ A, b ∈ B, and


s, t ∈ C, we know that there exists a path between a and b in V \ C ∪ {s} and another
one in V \C ∪{t}, the first one passing by s (because it would otherwise connect a and
b in V \ C) and the second one passing by t. Any point before s (or t) in these paths
must belong to A, and any point after them must belong to B. Concatenating these
two paths, and removing multiple points if needed, we obtain a loop passing in A,
then by s, then in B, then by t. We can recursively remove all points at which these
paths have a chord. We can also notice that we cannot remove s nor t in this process,
since this would imply an edge between A and B, and that we must leave at least one
element in A and one in B because removing the last one would require s ∼ t. So, at
the end, we obtain an achordal loop with at least four points, which contradicts the
fact that G is triangulated. 

We can now characterize graphs that admit junction trees over the set of their
maximal cliques.

Theorem 15.24 Let G = (V , E) be an undirected graph, and CG be the set of all maximum
cliques in G. The following two properties are equivalent.


(i) There exists a junction tree over CG .
15.6. BUILDING JUNCTION TREES 373

(ii) G is triangulated/decomposable.

Proof The proof works by induction on the number of maximal cliques, |CG |. If G
has only one maximal clique, then G is complete, because any point not included
in this clique will have to be included in another maximal clique, which leads to a
contradiction. So G is decomposable, and, since any single node obviously provides
a junction tree, (i) is true also.

Now, fix G and assume that the theorem is true for any graph with fewer maximal

cliques. First assume that CG has a junction tree, T . Let C1 be a leaf in T , connected,

say, to C2 , and let T2 be T restricted to C2 = CG \{C1 }. Let V2 be the unions of maximal
cliques from nodes in T2 . A maximal clique C in GV2 is a clique in GV and therefore
included in some maximal clique C 0 ∈ CV . If C 0 ∈ C2 , then C 0 is also a clique in GV2 ,
and for C to be maximal, we need C = C 0 . If C 0 = C1 , we note that we must also have
[
C= C ∩ C̃
C̃∈C2

and whenever C ∩ C̃ is not empty, this set must be included in any node in the path in
T that links C̃ to C1 . Since this path contains C2 , we have C ∩ C̃ ⊂ C2 so that C ⊂ C2 ,
but, since C is maximal, this would imply that C = C2 = C1 which is impossible.

This shows that CG 2
= C2 . This also shows that T2 is a junction tree over C2 . So,
by the induction hypothesis, GV2 is decomposable. If s ∈ V2 ∩ C1 , then s also belongs
to some clique C 0 ∈ C2 , and therefore belongs to any clique in the path between
C 0 and C1 , which includes C2 . So s ∈ C1 ∩ C2 and C1 ∩ V2 = C1 ∩ C2 . So, letting
A = C1 \ (C1 ∩ C2 ), B = V1 \ (C1 ∩ C2 ), S = C1 ∩ C2 , we know that GA∪S and GB∪S are
decomposable (the first one being complete), and that S is a clique. To show that G
is decomposable, it remains to show that S separates A from B.

If a path connects A to B in G, it must contain an edge, say {s, t}, with s ∈ V \S and
t ∈ S; {s, t} must be included in a maximal clique in G. If this clique is C1 , we have
s ∈ C1 ∩ V2 = S. The same argument shows that this is the only possibility, because,
if {s, t} is included in some maximal clique in C2 , then we would find t ∈ C1 ∩ C2 . So
S separates A and B in G.

Let us now prove the converse statement, and assume that G is decomposable.
If G is complete, it has only one maximal clique and we are done. Otherwise, there
exists a partition V = A ∪ B ∪ S such that GA∪S and GB∪S are decomposable, A and B
separated by S which is complete. Let CA∗ be the maximal cliques in GA∪S and CB∗ the
maximal cliques in GB∪S . By hypothesis, there exist junction trees TA and TB over CA∗
and CB∗ .

Let C be a maximal clique in GA∪S . Assume that C intersect A; C can be extended


to a maximal clique, C 0 , in G, but C 0 cannot intersect B (since this would imply a
374 CHAPTER 15. PROBABILISTIC INFERENCE FOR MRF

direct edge between A and B) and is therefore included in A ∪ S, so that C = C 0 .


Similarly, all maximal cliques in GB∪S that intersect B also are maximal cliques in G.

The clique S is included in some maximal clique SA∗ ∈ CA∗ . From the previous
discussion, we have either SA∗ = S or SA∗ ∈ CG ∗
. Similarly, S can be extended to a
maximal clique SB∗ ∈ CB∗ , with SB∗ = S or SB∗ ∈ CG∗
. Notice also that at least one of SA∗
or SB∗ must be a maximal clique in G: indeed, assume that both sets are equal to S,
which, as a clique, can extended to a maximal clique S ∗ in G; S ∗ must be included
either in A ∪ S or in B ∪ S, and therefore be a maximal clique in the corresponding
graph which yields S ∗ = S. Reversing the notation if needed, we will assume that
SA∗ ∈ CG

.

All elements of CG must belong either to CA∗ or CB∗ since any maximal clique, say C,
in G must be included in either A ∪ S or B ∪ S, and therefore also provide a maximal
clique in the related graph. So the nodes in TA and TB enumerate all maximal cliques

in G, and we can build a tree T over CG by identifying SA∗ and SB∗ to S ∗ and merging
the two trees at this node. To conclude our proof, it only remains to show that the
running intersection property is satisfied. So consider two nodes C, C 0 in T and
take s ∈ C ∩ C 0 . If the path between these nodes remain in CA∗ , or in CB∗ , then s will
belong to any set along that path, since the running intersection is true on TA and
TB . Otherwise, we must have s ∈ S, and the path must contain S ∗ to switch trees,
and s must still belong to any clique in the path (applying the running intersection
property between the beginning of the path and S ∗ , and between S ∗ and the end of
the path). 

This theorem delineates a strategy in order to build a junction tree that is adapted
to a given family of local interactions Φ = (ϕC , C ∈ C). Letting G be the graph in-
duced by these interactions, i.e., s ∼G t if and only if there exists C ∈ C such that
{s, t} ⊂ C, the method proceeds as follows.

(JT1) Extend G by adding edges to obtain a triangulated graph G∗ .

(JT2) Compute the set C ∗ of maximal cliques in G∗ , which therefore extend C.

(JT3) Build a junction tree over C ∗ .

(JT4) Assign interaction ϕC to a clique C ∗ ∈ C ∗ such that C ⊂ C ∗ .

(JT5) Run the junction-tree belief propagation algorithm to compute the marginal
of π (associated to Φ) over each set C ∗ ∈ C ∗ .

Steps (JT4) and (JT5) have already been discussed, and we now explain how the first
three steps can be implemented.
15.6. BUILDING JUNCTION TREES 375

15.6.2 Building triangulated graphs

First consider step (JT1). To triangulate a graph G = (V , E), it suffices to order its
vertexes so that V = {s1 , . . . , sn }, and then run the following algorithm.

Algorithm 15.6 (Graph triangulation)


Initialize the algorithm with k = n and Ek = E. Given Ek , determine Ek−1 as follows:

• Add an edge to any pair of neighbors of sk (unless, of course, they are already
linked).
• Let Ek−1 be the new set of edges.

Then the graph G∗ = (V , E0 ) is triangulated. Indeed taking any achordal loop,


and selecting the vertex with highest index in the loop, say sk , brings a contradiction,
since the neighbors of sk have been linked when building Ek−1 .

However, the quality of the triangulation, which can be measured by the number
of added edges, or by the size of the maximal cliques, highly depends on the way ver-
texes have been numbered. Take the simple example of the linear graph with three
vertexes A ∼ B ∼ C. If the point of highest index is B, then the previous algorithm
will return the three-point loop A ∼ B ∼ C ∼ A. Any other ordering will leave the
linear graph, which is already triangulated, invariant.

So, one must be careful about the order with which nodes will be processed.
Finding an optimal ordering for a given global cost is an NP-complete problem.
However, a very simple modification of the previous algorithm, which starts with
sn having the minimal number of neighbors, and at each step defines sk to be the
one with fewest neighbors that haven’t been visited yet, provides an efficient way
for building triangulations. (It has the merit of leaving G invariant if it is a tree,
for example). Another criterion may be preferred to the number of neighbors (for
example, the number of new edges that would be needed if s is added).

If G is triangulated, there exists an ordering of V such that the algorithm above


leaves G invariant. We now proceed to a proof of this statement and also show that
such an ordering can be computed using an algorithm called maximum cardinality
search, which, in addition, allows one to decide whether a graph is triangulated. We
start with a definition that formalizes the sequence of operations in the triangulation
algorithm.
Definition 15.25 Let G = (V , E) be an undirected graph. A node elimination consists in
selecting a vertex s ∈ V and building the graph G(s) = (V (s) , E (s) ) with V (s) = V \ {s}, and
E (s) containing all pairs {t, t 0 } ⊂ V (s) such that either {t, t 0 } ∈ E or {t, t 0 } ⊂ Vs .
376 CHAPTER 15. PROBABILISTIC INFERENCE FOR MRF

G(s) is called the s-elimination graph of G. The set of added edges, namely E (s) \(E∩E (s) )
is called the deficiency set of s and denoted D(s) (or DG (s)).

So, the triangulation algorithm implements a sequence of node eliminations, suc-


cessively applied to sn , sn−1 , etc. One says that such an elimination process is perfect
if, for all k = 1, . . . , n, the deficiency set of sk in the graph obtained after elimination
of sn , . . . , sk+1 is empty (so that no edge is added during the process). We will also say
that (s1 , . . . , sn ) provides a perfect ordering for G.

Theorem 15.26 An undirected graph G = (V , E) admits a perfect ordering if and only if


it is triangulated.

Proof The “only if” part is obvious, since, the triangulation algorithm following
a perfect ordering does not add any edge to G, which must therefore have been
triangulated to start with.

We now proceed to the “if” part. For this it suffices to prove that for any trian-
gulated graph, there exists a vertex s such that DG (s) = ∅. One can then easily prove
the result by induction, since, after removing this s, the remaining graph G(s) is still
triangulated and would admit (by induction) a perfect ordering that completes this
first step.

To prove that such an s exists, we take a decomposition V = A ∪ S ∪ B, in which


S is complete and separates A and B, such that |A ∪ S| is minimal (or |B| maximal).
We claim that A ∪ S must be complete. Otherwise, since A ∪ S is still triangulated,
There exists a similar decomposition A ∪ S = A0 ∪ S 0 ∪ B0 . One cannot have S ∩ A0
and S ∩ B0 non empty simultaneously, since this would imply a direct edge from A0
to B0 (S is complete). Say that S ∩ A0 = ∅, so that A0 ⊂ A. Then the decomposition
V = A0 ∪ S 0 ∪ (B0 ∪ B) is such that S 0 separates A0 from B ∪ B0 . Indeed, a path from A0
to b ∈ B ∪ B0 must pass in S 0 if b ∈ B0 , and, if b ∈ B, it must pass in S (since it links
A and B). But S ⊂ S 0 ∪ B0 so that the path must intersect S 0 . We therefore obtain
a decomposition that enlarges B, which is a contradiction and shows that A ∪ S is
complete. Given this, any element s ∈ A can only have neighbors in A ∪ S and is
therefore such that DG (s) = ∅, which concludes the proof. 

If a graph is triangulated, there is more than one perfect ordering of its vertexes.
One of these orderings is provided the maximum cardinality search algorithm, which
also allows one to decide whether the graph is triangulated. We start with a defini-
tion/notation.

Definition 15.27 If G = (V , E) is an undirected graph, with |V | = n, any ordering V =


(s1 , . . . , sn ) can be identified with the bijection α : V → {1, . . . , n} defined by α(sk ) = k. In
other terms, α(s) is the rank of s in the ordering. We will refer to α as an ordering, too.
15.6. BUILDING JUNCTION TREES 377

Given an ordering α, we define incremental neighborhoods Vsα,k , for s ∈ V and k =


1, . . . , n to be the intersections of Vs with the sets α −1 ({1, . . . , k}), i.e.,
Vsα,k = {t ∈ V , t ∼ s, α(t) ≤ k} .

One says that α satisfies the maximum cardinality property if, for all k = {2, . . . , n}
|Vsα,k−1
k
| = max |Vsα,k−1 |. (15.35)
α(s)≥k

where sk = α −1 (k).

Given this, we have the proposition:


Proposition 15.28 If G = (V , E) is triangulated, then any ordering that satisfies the max-
imum cardinality property is perfect.

Equation (15.35) immediately provides an algorithm that constructs an order-


ing satisfying the maximum cardinality property given a graph G. From proposi-
tion 15.28, we see that, if for some k, the largest set Vsα,k−1
k
is not a clique, then G is
not triangulated. We now proceed to the proof of this proposition.
Proof Let G be triangulated, and assume that α is an ordering that satisfies (15.35).
Assume that α is not proper in order to reach a contradiction.

Let k be the first index for which Vsα,k−1


k
is not a clique, so that sk has two neigh-
bors, say t and u, such that α(t) < k, α(u) < k and t  u. Assume that α(t) > α(u).
Then t must have a neighbor that is not neighbor of s, say t 0 , such that α(t 0 ) < α(t)
(otherwise, s would have more neighbors than t at order less than α(t), which con-
tradicts the maximum cardinality property). The sequence t 0 , t, s, u forms a path that
is such that α increases from t 0 to s, then decreases from s to u, and contains no
chord. Moreover, t 0 and u cannot be neighbors, since this would yield an achordal
loop and a contradiction. The proof of proposition 15.28 consists in showing that
this construction can be iterated until a contradiction is reached.

More precisely, assume that an achordal path s1 , . . . , sk has been obtained, such
that α(s) is first increasing, then decreasing along the path, and such that, at extrem-
ities one either has α(s1 ) < α(sk ) < α(s2 ) or α(sk ) < α(s1 ) < α(sk−1 ). In fact, one can
switch between these last two cases by reordering the path backwards. Both paths
(u, s, t) and (u, s, t, t 0 ) in the discussion above satisfy this property.

• Assume, without loss of generality, that α(s1 ) < α(sk ) < α(s2 ) and note that, in the
considered path, s1 and sk cannot be neighbors (for, if j is the last index smaller than
k − 1 such that sj and sk are neighbors, then j must also be smaller than k − 2 and the
loop sj , . . . , sk−1 , sk would be achordal).
378 CHAPTER 15. PROBABILISTIC INFERENCE FOR MRF

• Since α(s2 ) > α(sk ), and s1 and s2 are neighbors, sk must have a neighbor, say sk0 ,
such that sk0 is not neighbor of s2 and α(sk0 ) < α(sk ).
• Select the first index j > 2 such that sj ∼ sk0 , and consider the path (s1 , . . . , sj , sk0 ).
This path is achordal, by construction, and one cannot have s1 ∼ sk0 since this would
create an achordal loop. Let us show that α first increases and then decreases along
this path. Since s2 is in the path, α must first increase, and it suffices to show that
α(sk0 ) < α(sj ). If α increases from s1 to sj , then α(sj ) > α(s2 ) > α(sk ) > α(sk0 ). If α
started decreasing at some point before sj , then α(sj ) > α(sk ) > α(sk0 ).
• Finally, we need to show that the α-value at one extremity is between the first two
α-values on the other end of the path. If α(sk0 ) < α(s1 ), and since we have just seen
that α(sj ) > α(sk ) > α(s1 ), we do get α(sk0 ) < α(s1 ) < α(sj ). If α(sk0 ) > α(s1 ), then, since
by construction α(s2 ) > α(sk ) > α(sk0 ), we have α(s2 ) > α(sk0 ) > α(s1 ).
• So, we have obtained a new path that satisfies the same property that the one we
started with, but with a maximum value at end points smaller than the initial one,
i.e.,
max(α(s1 ), α(sk0 )) < max(α(s1 ), α(sk )).
Since α takes a finite number of values, this process cannot be iterated indefinitely,
which yields our contradiction. 

15.6.3 Computing maximal cliques

At this point, we know that a graph must be triangulated for its maximal cliques
to admit junction trees, and we have an algorithm to decide whether a graph is
triangulated, and extend it into a triangulated one if needed. This provides the first
step, (JT1), of our description of the junction tree algorithm. The next step, (JT2),
requires computing a list of maximal cliques. Computing maximal cliques in general
graph is an NP complete problem, for which a large number of algorithms has been
developed (see, for example, [150] for a review). For graphs with a perfect ordering,
however, this problem can always be solved in a polynomial time.

Indeed, assume that a perfect ordering is given for G = (V , E), so that V = {s1 , . . . , sn }
is such that, for all k, Vs0k := Vsk ∩ {s1 , . . . , sk−1 } is a clique. Let Gk be G restricted to
{s1 , . . . , sk } and Ck∗ be the set of maximal cliques in Gk . Then the set Ck := {sk } ∪ Vs0k is
the only maximal clique in Gk that contains sk : it is a clique because the ordering is
perfect, and any clique that contains sk must be included in it (because its elements
are either sk or neighbors of sk ). It follows from this that the set Ck∗ can be deduced

from Ck−1 by

∗ ∗ 0 ∗
Ck = Ck−1 ∪ {Ck } if Vk < Ck−1


Ck∗ = (Ck−1

∪ {Ck }) \ {Vk0 } if Vk0 ∈ Ck−1




This allows one to enumerate all elements in CG = Cn∗ , starting with C1∗ = {{s1 }}.
15.6. BUILDING JUNCTION TREES 379

15.6.4 Characterization of junction trees

We now discuss the last remaining point, (JT3). For this, we need to form the clique

graph of G, which is the undirected graph G = (CG , E) defined by (C, C 0 ) ∈ E if and
only if C ∩ C 0 , ∅. We then have the following fact:

Proposition 15.29 The clique graph G of a connected triangulated undirected graph G


is connected.

Proof We proceed by induction, and assume that the result is true if |V | = n − 1


(the proposition obviously holds if |V | = 1). Assume that a perfect order on G has
been chosen, say V = {s1 , . . . , sn }. Let G0 be G restricted to {s1 , . . . , sn−1 }, and G0 the
associated clique graph. Because {sn } ∪ Vsn is a clique, any path in G provides a valid
path in G0 after removing all occurrences of sn (because any two neighbors of sn
are linked). The induction hypothesis also implies that G0 is connected. Since G is
connected, Vsn is not empty. Moreover, C := {sn } ∪ Vsn must be a maximal clique in
G (since we assume that the order is perfect) and it is the only maximal clique in G
that contains sn (all other maximal cliques in G therefore are maximal cliques in G0
also). To prove that G is connected, it suffices to prove that C is connected to any
other maximal clique, C 0 , in G by a path in G. If t ∈ C, t , sn , there exists a maximal
clique, say C 00 , in G0 that contains t, and, since G0 is connected, there exists a path
(C1 = C 0 , . . . , Cq = C 00 ) connecting C 0 to C 00 in G0 . Let j be the first integer such that
Cj = Vn (take j = q + 1 if this never happens). Then (C1 , . . . , Cj−1 , C) is a path linking
C 0 and C in G. 

We hereafter assume that G, and hence G, is connected. This is not real loss of
generality because connected components in undirected graphs yields independent
processes that can be handled separately. We assign weights to edges of the clique
graph of G by defining w(C, C 0 ) = |C ∩ C 0 |. A subgraph T̃ of any given graph G̃ is
called a spanning tree if T̃ is a tree with set of vertexes equal to the set of vertexes of

G̃. If T = (CG , E0 ) is a spaning tree of G, we define the total weight
X
w(T ) = w(C, C 0 ).
{C,C 0 }∈E0

We then have the proposition:

Proposition 15.30 [99] If G is a connected triangulated graph, the set of junction trees

over CG coincides with the set of maximizers of w(T ) over all spanning trees of G.

(Notice that G being connected implies that spanning trees over G exist.)
380 CHAPTER 15. PROBABILISTIC INFERENCE FOR MRF

Before proving this proposition, we discuss some properties related to maximal


(or maximum-weight) spanning trees over an undirected graph. For this discussion,
we let G = (V , E) be any undirected graph with weight (w(e), e ∈ E). We will then
apply these results to a clique graph when will switch back to the general notation
of this section. Maximal spanning trees can be computed using the so-called Prim’s
algorithm [98, 156, 63].

Algorithm 15.7 (Prim’s algorithm)


Initialize the algorithm with a single-node tree T1 = ({s1 }, ∅), for some arbitrary s1 ∈
V . Let Tk−1 = (Vk−1 , Ek−1 ) be the tree obtained at step k − 1 of the algorithm. If k ≤ n,
the next tree is built as follows.

(1) Let
Vk = {sk } ∪ Vk−1 (sk < Vk−1 .)

(2) Let Ek = {ek } ∪ Ek−1 , such that ek = {sk , s} for some s ∈ Vk−1 satisfying
 
0 0 0
w(ek ) = max w({t, t }), {t, t } ∈ E, t < Vk−1 , t ∈ Vk−1 . (15.36)

The ability of this algorithm to always build a maximal spanning tree is summa-
rized in the following proposition [81, 130].
Proposition 15.31 If G = (V , E, w) is a weighted, connected undirected graph, Prim’s
algorithm, as described above, provides a sequence Tk = (Vk , Ek ), for k = 1, . . . , n of subtrees
of G such that Vn = V and, for all k, Tk is a maximal spanning tree for the restriction GVk
of G to Vk .

Moreover, any maximal spanning tree of G, can be realized as Tn , where (T1 , . . . , Tn ) is


a sequence provided by Prim’s algorithm.
Proof We first prove that, for all k, Tk is a maximal spanning tree on the graph GVk .

We will prove a slightly stronger statement, namely, that, for all k, Tk can be
extended to form a maximal spanning tree of G. This is stronger, because, if Tk =
(Vk , Ek ) can be extended to a maximal spanning tree T = (V , E), and if Tk0 = (Vk , Ek0 ) is
a spanning tree for GVk such that w(Tk ) < w(Tk0 ), then the graph T 0 = (V , E 0 ) with

E 0 = (E \ Ek ) ∪ Ek0

would be a spanning tree for G with w(T ) < w(T 0 ), which is impossible. (To see that
T 0 is a tree, notice that paths in T 0 are in one-to-one correspondence with paths in T
by replacing any subpath within Tk0 by the unique subpath in Tk that has the same
extremities.)
15.6. BUILDING JUNCTION TREES 381

Clearly, T1 , which only has one vertex, can be extended to a maximal spanning
tree. Let k ≥ 1 be the last integer for which this property is true for all j = 1, . . . , k.
If k = n, we are done. Otherwise, take a maximum spanning tree, T , that extends
Tk . This tree cannot contain the new edge added when building Tk+1 , namely ek+1 =
{sk+1 , s} as defined in Prim’s algorithm, since it would otherwise also extend Tk+1 .
Consider the path γ in T that links s to sk . This path must have an edge e = {t, t 0 }
such that t ∈ Vk and t 0 < Vk , and by definition of ek+1 , we must have w(e) ≤ w(ek+1 ).
Notice that e is uniquely defined, because a path leaving Vk cannot return in this set,
since one would be otherwise able to close it into a loop by inserting the only path
in Tk that connects its extremities.

Replace e by ek+1 in T . The resulting graph, say T 0 , is still a spanning tree for
G. From any path in T , one can create a path in T 0 with the same extremities by
replacing any occurrence of the edge, e, by the concatenation of the unique path in
T going from t to s, followed by (s, sk+1 ), followed by the unique path in T going
from sk+1 to t 0 . This implies that T 0 is connected. It is also acyclic, since any loop
in T would have to contain ek+1 (since T is acyclic), but there is no other path than
(s, sk+1 ) in T 0 that links s and sk , because this path would have to be in T , and we
have removed the only possible one from T by deleting the edge e.

As a conclusion, T 0 is an extension of Tk+1 , and a spanning tree with total weight


larger or equal to the one of T , and must therefore be optimal, too. But this contra-
dicts the fact that Tk+1 cannot be extended to a maximal tree, so that k = n and the
sequence of trees provided by Prim’s algorithm is optimal.

To prove the second statement, let T be an optimal spanning tree. Let k be the
largest integer such that there exists a sequence (T1 , . . . , Tk ) generated by Prim’s algo-
rithm, such that, for all j = 1, . . . , k, Tj is a subtree of T . One necessarily has j ≥ 1,
since T extends any one-vertex tree. If k = n, we are done. Assuming otherwise,
let Tk = (Vk , Ek ) and make one more step of Prim’s algorithm, selecting an edge
ek+1 = (sk+1 , s) satisfying (15.36). By assumption, ek+1 is not in T . Take as before
the unique path linking s and sk+1 in T and let e be the unique edge at which this
path leaves Vk . Replacing e by ek+1 in T provides a new spanning tree, T 0 . One
must have w(e) ≥ w(ek+1 ) because T is optimal, and w(ek+1 ) ≥ w(e) by (15.36). So
w(e) = w(ek+1 ), and one can use e instead of ek+1 for the (k + 1)th step of Prim’s al-
gorithm. But this contradicts the fact that k was the largest integer in a sequence of
subtrees of T that is generated by Prim’s algorithm, and one therefore has k = n. 

The proof of proposition 15.30, that we provide now, uses very similar “edge-
switching” arguments.

Proof (Proof of proposition 15.30) Let us start with a maximum weight spanning
tree for G, say T , and show that it is a junction tree. Since T has maximum weight, we
382 CHAPTER 15. PROBABILISTIC INFERENCE FOR MRF

know that it can be obtained via Prim’s algorithm, and that there exists a sequence
T1 , . . . , Tn = T of trees constructed by this algorithm. Let Tk = (Ck , Ek ).

We proceed by contradiction. Let k be the largest index such that Tk can be ex-

tended to a junction tree for CG , and let T 0 be a junction tree extension of Tk . Assume
that k < n, and let ek+1 = (Ck+1 , C 0 ) be the edge that has been added when building
Tk+1 , with Ck+1 = {Ck+1 } ∪ Ck . This edge is not in T 0 , so that there exists a unique
edge e = (B, B0 ) in the path between Ck and C 0 in T 0 such that B ∈ Ck and B0 < Ck . We
must have w(e) = |B ∩ B0 | ≤ w(ek+1 ) = |Ck+1 ∩ C 0 |. But, since the running intersection
property is true for T 0 , both B and B0 must contain Ck+1 ∩C 0 so that B∩B0 = Ck+1 ∩C 0 .
This implies that, if one modifies T 0 by replacing edge e by edge ek+1 , yielding a new
spanning tree T 00 , the running intersection property is still satisfied in T 0 . Indeed if
a vertex s ∈ V belongs to both extremities of a path containing B and B0 in T 0 , then
it must belong to B ∩ B0 , and hence to Ck+1 ∩ C 0 , and therefore to any set in the path
in T 0 that linked Ck+1 and C 0 . So we found a junction tree extension of Tk+1 , which
contradicts our assumption that k was the largest. We must therefore have k = n and
T is a junction tree.

Let us now consider the converse statement and assume that T is a junction tree.
Let k be the largest integer such that there exists a sequence of subgraphs of T that
is provided by Prim’s algorithm. Denote such a sequence by (T1 , . . . , Tk ), with Tj =
(Cj , Ej ). Assume (to get a contradiction) that k < n, and consider a new step for
Prim’s algorithm, adding a new edge ek+1 = {Ck+1 , C 0 } to Tk . Take as before the path
in T linking C 0 to Ck+1 in T , and select the edge e at which this path leaves Ck . If e =
(B, B0 ), we must have w(e) = |B ∩ B0 | ≤ w(ek ) = |Ck+1 ∩ C 0 |, and the running intersection
property in T implies that Ck+1 ∩ C 0 ⊂ B ∩ B0 , which implies that w(e) = w(ek+1 ).
This implies that adding e instead of ek+1 at step k + 1 is a valid choice for Prim’s
algorithm, and contradicts the fact that k was the largest number of such steps that
could provide a subtree of T . So k = n and T is maximal. 
Chapter 16

Bayesian Networks

16.1 Definitions

Bayesian networks are graphical models supported by directed acyclic graphs (DAG),
which provide them with an ordered organization (directed graphs were introduced
in definition 14.35).

We first introduce some notation. Let G = (V , E) be a directed acyclic graph. The


parents of s ∈ V are vertexes t such that (t, s) ∈ E, and its children are t’s such that
(s, t) ∈ E. The set of parents of s is denoted pa(s), and the set of its children is ch(s),
with Vs = ch(s) ∪ pa(s).

Similarly to trees, the vertexes of G can be partially ordered by s ≤G t if and only


if there exists a path going from s to t. Unlike trees, however, there can be more
than one minimal element in V , and we still call roots vertexes that have no parent,
denoting
V0 = {s ∈ V : pa(s) = ∅} .
We also call leaves, or terminal nodes, vertexes that have no children. Unless other-
wise specified, we assume that all graphs are connected.

Bayesian networks over G are defined as follows. We use the same notation as
with Markov random fields to represent the set of configurations F (V ) that contains
collections x = (xs , s ∈ V ) with xs ∈ Fs .
Definition 16.1 A random variable X with values in F (V ) is a Bayesian network over a
DAG G = (V , E) if and only if its distribution can be written in the form
Y Y
PX (x) = ps (x(s) ) ps (x(pa(s)) , x(s) ) (16.1)
s∈V0 s∈V \V0

where ps is, for all s ∈ V , a probability distribution with respect to x(s) .

383
384 CHAPTER 16. BAYESIAN NETWORKS

Using the convention that conditional distributions given the empty set are just ab-
solute distributions, we can rewrite (16.1) as
Y
PX (x) = ps (x(pa(s)) , x(s) ). (16.2)
s∈V

One can verify that x∈Ω P X (x) = 1. Indeed, when summing over x, we can start
P
summing over all x(s) with ch(s) = ∅ (the leaves). Such x(s) ’s only appear in the corre-
sponding ps ’s, which disappear since they sum to 1. What remains is the sum of the
product over V minus the leaves, and the argument can be iterated until the remain-
ing sum is 1 (alternatively, work by induction on |V |). This fact is also a consequence
of proposition 16.5 below, applied with A = ∅.

16.2 Conditional independence graph

16.2.1 Moral graph

Bayesian networks have a conditional independence structure which is not exactly


given by G, but can be deduced from it. Indeed, fixing S ⊂ V , we can see, when
c c
computing the probability of X (S) = x(s) given X (S ) = x(S ) , which is

c c 1 Y
P(X (S) = x(S) | X (S ) = x(S ) ) = (S c) ps (x(pa(s)) , x(s) ),
Z(x ) s∈V

that the only variables x(t) , t < S that can be factorized in the normalizing constant
are those that are neither parent nor children of vertexes in S, and do not share a
child with a vertex in S (i.e., they intervene in no ps (x(pa(s)) , x(s) ) that involve elements
of S). This suggests the following definition.

Definition 16.2 Let G be a directed acyclic graph. We denote G] = (V , E ] ) the undirected


graph on V such that {s, t} ∈ E ] if one of the following conditions is satisfied

• Either (s, t) ∈ E or (t, s) ∈ E.


• There exists u ∈ V such that (s, u) ∈ E and (t, u) ∈ E.

G] is sometimes called the moral graph of G (because it forces parents to marry !). A
path in G] can be visualized as a path in G[ (the undirected graph associated with
G) which is allowed to jump between parents of the same vertex even if they were
not connected originally.

The previous discussion implies:


16.2. CONDITIONAL INDEPENDENCE GRAPH 385

Proposition 16.3 Let X be a Bayesian network on G. We have


(SyT | U )G] ⇒ (X (S) yX (T ) | X (U ) ),
i.e., X is G] -Markov.

This proposition can be refined by noticing that the joint distribution of X (S) ,
X (T )
and X (U ) can be deduced from a Bayesian network on a graph restricted to the
ancestors of S ∪T ∪U . Definition 14.21 for restricted graphs extends without change
to directed graphs, and we repeat it below for convenience.
Definition 16.4 Let G = (V , E) be a graph (directed or undirected), and A ⊂ V . The
restricted graph GA = (A, EA ) is such that the elements of EA are the edges (s, t) (or {s, t})
in E such that both s and t belong to A.

Moreover, for a directed acyclic graph G and s ∈ V , we define the set of ancestors of
s by
As = {t ∈ V , t ≤G s} (16.3)
for the partial order on V induced by G.
S
If S ⊂ V , we denote AS = s∈S As . Note that, by definition, S ⊂ AS . The following
proposition is true.
Proposition 16.5 Let X be a Bayesian network on G = (V , E) with distribution given by
(16.2). Let S ⊂ V and A = AS . Then the distribution of X (A) is a Bayesian network over
GA given by Y
P(X (A) = x(A) ) = ps (x(pa(s)) , x(s) ). (16.4)
s∈A

There is no ambiguity in the notation pa(s), since the parents of s ∈ A are the same in
GA as in G.
Proof One needs to show that
Y XY
ps (x(pa(s)) , x(s) ) = ps (x(pa(s)) , x(s) ).
s∈A xAc s∈V

This can be done by induction on the cardinality of V . Assume that the result is true
for graphs of size n, and let |V | = n + 1 (the result is obvious for graphs of size 1).

If A = V , there is nothing to prove, so assume that Ac is not empty. Then Ac must


contain a leaf in G, since otherwise, A would contain all leaves and their ancestors
which would imply that A = V .

If s ∈ Ac is a leaf in G, one can remove the variable x(s) from the sum, since it
only appear in ps and transition probabilities sum to one. But one can now apply
the induction assumption to the restriction of G to V \ {s}. 
386 CHAPTER 16. BAYESIAN NETWORKS

Given proposition 16.5, proposition 16.3 can therefore be refined as follows.

Proposition 16.6 Let X be a Bayesian network on G. We have

(SyT | U )(GA )] ⇒ (X (S) yX (T ) | X (U ) ).


S∪T ∪U

Proposition 16.5 is also used in the proof of the following proposition.

Proposition 16.7 Let G = (V , E) be a directed acyclic graph, and X be a Bayesian net-


work over G. Then, for all s ∈ S

P(X (s) = x(s) | X (As \{s}) = x(As \{s}) ) = P(X (s) = x(s) | X (pa(s)) = x(s ) ) = ps (x(pa(s)) , x(s) ).

Proof By proposition 16.5, we can without loss of generality assume that V = As .


Then

P(X (s) = x(s) | X (As \{s}) = x(As \{s}) ) ∝ P(X (As ) = x(As ) )
= ps (x(pa(s)) , x(s) )Z(x(As \{s}) )

where Y
(As \{s})
Z(x )= pt (x(pa(t)) , x(t) )
t∈As \{s}

disappears when the conditional probability is normalized. 

16.2.2 Reduction to d-separation

We now want to reformulate proposition 16.6 in terms of the unoriented graph G[


and specific features in G called v-junctions, that we now define.

Definition 16.8 Let G = (V , E) be a directed graph. A v-junction is a triple of distinct


vertexes, (s, t, u) ∈ V × V × V such that {s, u} ⊂ pa(t) (i.e., s and u are parents of t).

We will say that a path (s1 , . . . , sN ) in G[ passes at s = sk with a v-junction if (sk−1 , sk , sk+1 )
is a v-junction in G.

We have the lemma:

Lemma 16.9 Two vertexes s and t in G are separated by a set U in (GA{s,t}∪U )] if and only
if any path between s and t in G[ must either

(1) Pass at a vertex in U without a v-junction.


(2) Pass in V \ A{s,t}∪U at a v-junction.
16.2. CONDITIONAL INDEPENDENCE GRAPH 387

Proof

Step 1. We first note that the v-junction clause is redundant in (2). It can be removed
without affecting the condition. Indeed, if a path in G[ passes in V \ A{s,t}∪U one can
follow this path downward (i.e., following the orientation in G) until a v-junction is
met. This has to happen before reaching the extremities of the path, since u would
be an ancestor of s or t otherwise. We can therefore work with the weaker condition
(that we will denote (2)’) in the rest of proof.

Step 2. Assume that U separates s and t in (GA{s,t}∪U )] . Take a path γ between s and
t in G[ . We need to show that the path satisfies (1) or (2)’. So assume that (2)’ is
false (otherwise we are done) so that γ is included in A{s,t}∪U . We can modify γ by
removing all the central nodes in v-junctions and still keep a valid path in (GA{s,t}∪U )]
(since parents are connected in the moral graph). The remaining path must intersect
U by assumption, and this cannot be at a v-junction in γ since we have removed
them. So (1) is true.

Step 3. Conversely, assume that (1) or (2) is true for any path in G[ . Consider a path
γ in (GA{s,t}∪U )] between s and t. Any edge in γ that is not in G[ must involve parents
of a common child in A{s,t}∪U . Insert this child between the parents every time this
occurs, resulting in a v-junction added to γ. Since the added vertexes are still in
A{s,t}∪U , the new path still has no intersection with V \ A{s,t}∪U and must therefore
satisfy (1). So there must be an intersection with U without a v-junction, and since
the new additions are all at v-junctions, the intersection must have been originally
in γ, which therefore passes in U . This shows that U separates s and t in (GA{s,t}∪U )] .

Condition (2) can be further restricted to provide the notion of d-separation.


Definition 16.10 One says that two vertexes s and t in G are d-separated by a set U if
and only if any path between s and t in G[ must either

(D1) Pass at a vertex in U without a v-junction.


(D2) Pass in V \ AU with a v-junction.

Then we have:
Theorem 16.11 Two vertexes s and t in G are separated by a set U in (GA{s,t}∪U )] if and
only if they are d-separated by U .
Proof It suffices to show that if condition ((D1) or (D2)) holds for any path between
s and t in G[ , then so does ((1) or (2)). So take a path between s and t: if (D1) is true
388 CHAPTER 16. BAYESIAN NETWORKS

for this path, the conclusion is obvious, since (D1) and (1) are the same. So assume
that (D1) (and therefore (1)) is false and that (D2) is true. Let u be a vertex in V \ AU
at which γ passes with a v-junction.

Assume that (2) is false. Then u must be an ancestor of either s or t. Say it is an


ancestor of s: there is a path in G going from u to s without passing by U (otherwise
u would be an ancestor of U ); one can replace the portion of the old path between
s and u by this new one, which does not pass by u with a v-junction anymore. So
the new path still does not satisfy (D1) and must satisfy (D2). Keep on removing
all intersections with ancestors of s and t that have v-junctions to finally obtain a
path that satisfies neither (D1) or (D2) and a contradiction to the fact that s and t are
d-separated by U . 

16.2.3 Chain-graph representation

The d-separability property involves both unoriented and oriented edges. It is in


fact a property of the hybrid graph in which the orientation is removed from the
edges that are not involved in a v-junction, and retained otherwise. Such graphs are
particular instances of chain graphs.

Definition 16.12 A chain graph G = (V , E, Ẽ) is composed with a finite set V of vertexes,
a set E ⊂ P 2 (V ) of unoriented edges and a set Ẽ ⊂ E × E \ {(t, t), t ∈ E} of oriented edges
with the property that E ∩ Ẽ [ = ∅, i.e., two vertexes cannot be linked by both an oriented
and an unoriented edge.

A path in a chain graph is a a sequence of vertexes s0 , . . . , sN such that for all k ≥ 1,


sk−1 and sk form an edge, which means that either {sk−1 , sk } ∈ E or (sk−1 , sk ) ∈ Ẽ.

A chain graph is acyclic if it contains no loop. It is semi-acyclic if it contains no loop


containing oriented edges.

We start with the following equivalence relation within vertexes in a semi-acyclic


chain graph.

Proposition 16.13 Let G = (V , E, Ẽ) be a semi-acyclic chain graph. Define the relation
s R t if and only if there exists a path in the unoriented subgraph (V , E) that links s and t.
Then R is an equivalence relation.

The proposition is obvious. This relation partitions V in equivalence classes, the


set of which being denoted VR . If S ∈ VR , then any pair s, t in S is related by an
unoriented path, and if S , S 0 ∈ VR , no elements s ∈ S and t ∈ S 0 can be related by
such a path.
16.2. CONDITIONAL INDEPENDENCE GRAPH 389

Moreover, no path in G between two elements of S ∈ VR , can contain a directed


edge, since these elements must also be related by an undirected path, and this
would create a loop in G containing an undirected edge. So the restriction of G
to S is an undirected graph.

One can define a directed graph over equivalence classes as follows. Let GR =
(VR , ER ) be such that (S, S 0 ) ∈ ER if and only if there exists s ∈ S and t ∈ S 0 such
that (s, t) ∈ Ẽ. The graph GR is acyclic: any loop in GR would induce a loop in G
containing at least one oriented edge.

We now can formally define a probability distribution on a semi-acyclic chain


graph.

Definition 16.14 Let G = (V , E, Ẽ) be a semi-acyclic chain graph. One says that a ran-
dom variable X decomposes on G if and only if: (X (S) , S ∈ VR ) is a Bayesian network on
0
GR and the conditional distribution of X (S) given X (S ) , S 0 ∈ pa(S) is GS -Markov, such
that, for s ∈ S, P (X (s) = x(s) | X (t) , t ∈ S, XS 0 , S 0 ∈ pa(S)) only depends on x(t) with {s, t} ∈ E
or (t, s) ∈ Ẽ.

Returning to our discussion on Bayesian networks, we have the following. Asso-


ciate to a DAG G = (V , E) the chain graph G† = (V , E † , Ẽ † ) defined by: {s, t} ∈ E † if and
only if (s, t) or (t, s) ∈ E and is not involved in a v-junction, and (s, t) ∈ Ẽ † if (s, t) ∈ E
and is involved in a v-junction. This graph is acyclic; indeed, take any loop in G† :
when its edges are given their original orientations in E, the sequence cannot contain
a v-junction, since the orientation in v-junctions are kept in G† ; the path therefore
constitutes a loop in G which is a contradiction.

All, excepted at most one, vertexes in an equivalence class S ∈ GR have all their
parents in S. Indeed, assume that two vertexes, s and t, in S have parents outside of
S. There exists an unoriented path, s0 = s, s1 , . . . , sN = t, in G† connecting them, since
they belong to the same equivalence class. The edge at s must be oriented from s to
s1 in G, since otherwise s1 would be a second parent to s in G, creating a v-junction,
and the edge would have remained oriented in G† . Similarly, the last edge in the
path must be oriented from t to sN −1 in G. But this implies that there exists a v-
junction in the original orientation along the path, which cannot be constituted with
only unoriented edges in G† . So we get a contradiction.

Thus, random variables that decompose on G† are “Bayesian networks” of acyclic


graphs, or trees since we know these are equivalent. The root of each tree must have
multiple (vertex) parents in the parent tree in GR . The following theorem states that
all Bayesian networks are equivalent to such a process.

Theorem 16.15 Let G = (V , E) be a DAG. The random variable X is a Bayesian network


on G if and only if it decomposes over G† .
390 CHAPTER 16. BAYESIAN NETWORKS

Proof Assume that X is a Bayesian network on G. We can obviously rewrite the


probability distribution of X in the form
YY
π(x) = ps (x(pa(s)) , x(s) ).
† s∈S
S∈GR
S
Since every vertex in S has its parents in S or in T ∈pa(S) T , this a fortiori takes the
form Y
π(x) = pS ((x(T ) , T ∈ S − ), x(s) ).

S∈GR

So X (S) , S ∈ VR is a Bayesian network. Moreover,


Y
pS ((x(T ) , T ∈ S − ), x(s) ) = ps (x(pa(s)) , x(s) )
s∈S

is a tree distribution with the required form of the individual conditional distribu-
tions.

Now assume that X decomposes on G† . Then the conditional distribution of


X (S) given X (T ) , T ∈ pa(S) is Markov for the acyclic undirected graph GS , and can
therefore be expressed as a tree distribution consistent with the orientation of G. 

16.2.4 Markov equivalence

While the previous discussion provides a rather simple description of Bayesian net-
works in terms of chain graphs, it does not go all the way in reducing the number
of oriented edges in the definition of a Bayesian network. The issue is, in some way,
addressed by the notion of Markov equivalence, which is defined as follows.
Definition 16.16 Two directed acyclic graphs on the same set of vertexes G = (V , E) and
G̃ = (V , Ẽ) are Markov-equivalent if any family of random variables that decomposes as a
(positive) Bayesian network over one of them also decomposes as a Bayesian network over
the other.

The notion of Markov equivalence is exactly described by d-separation. This


is stated in the following theorem, due to Geiger and Pearl [77, 76], that we state
without proof.
Theorem 16.17 G and G̃ are Markov equivalent if and only if, whenever two vertexes
are d-separated by a set in one of them, the same separation is true with the other.

This property can be expressed in a strikingly simple condition. One says that a
v-junction (s, t, u) in a DAG is unlinked if s and u are not neighbors.
16.2. CONDITIONAL INDEPENDENCE GRAPH 391

Theorem 16.18 G and G̃ are Markov equivalent if and only if G[ = G̃[ and G and G̃ have
the same unlinked v-junctions.
Proof Step 1. We first show that a given pair of vertexes in a DAG is unlinked if
and only if it can be d-separated by some set in the graph. Clearly, if they are linked,
they cannot be d-separated (which is the “if” part), so what really needs to be proved
is that unlinked vertexes can be d-separated. Let s and t be these vertexes and let
U = A{s,t} \ {s, t}. Then U d-separates s and t since any path between s and t in
(GA{s,t}∪U )] = (GA{s,t} )] must obviously pass in U .
Step 2. We now prove the only-if part of theorem 16.18 and therefore assume that
G and G̃ are Markov equivalent, or, as stated in theorem 16.17, that d-separation
coincides in G and G̃. We want to prove that G[ = G̃[ and unlinked v-junctions are
the same.

Step 2.1. The first statement is obvious from Step 1: d-separation determines the
existence of a link, so if d-separation coincides in the two graphs, then the same
holds for links and G[ = G̃[ .
Step 2.2. So let us proceed to the second statement and let (s, t, u) be an unlinked v-
junction in G. We want to show that it is also a v-junction in G̃ (obviously unlinked
since links coincide).
We will denote by ÃS the ancestors of some set S ⊂ V in G̃ (while AS still denotes
its ancestors in G). Let U = A{s,u} \ {s, u}. Then, as we have shown in Step 1, U
d-separates s and u in G, so that, by assumption it also d-separates them in G̃.
We know that t < U , because it cannot be both a child and an ancestor of {s, u} in G
(this would induce a loop). The path (s, t, u) links s and u and does not pass in U ,
which is only possible (since U d-separates s and t in G̃) if it passes in V − ÃU at a
v-junction: so (s, t, u) is a v-junction in G̃, which is what we wanted to prove.

Step 3. We now consider the converse statement and assume that G[ = G̃[ and un-
linked v-junctions coincide. We want to show that d-separation is the same in G and
G̃. So, we assume that U d-separates s and t in G, and we want to show that the same
is true in G̃. Thus, what we need to prove is:
Claim 1. Consider a path γ between s and t in G̃[ = G[ . Then γ either (D1) passes in
U without a v-junction in G̃, or (D2) in V \ ÃU with a v-junction in G̃.
We will prove Claim 1 using a series of lemmas. We say that γ has a three-point loop
at u if (v, u, w) are three consecutive points in γ such that v and w are linked. So
(v, u, w, v) forms a loop in the undirected graph.
Lemma 16.19 If γ is a path between s and t that does not satisfy (D2) for G and passes
in U without three-point loops, then γ satisfies (D1) for G̃.

The proof is easy: since γ does not satisfy (D2) in G, it satisfies (D1) and passes in
U without a v-junction in G. But this intersection cannot be a v-junction in G̃ since
392 CHAPTER 16. BAYESIAN NETWORKS

it would otherwise have to be linked and constitute a three-point loop in γ, which


proves that (D1) is true for γ in G̃.
The next step is to remove the three-point loop condition in lemma 16.19. This will
be done using the next two results.
Lemma 16.20 Let γ be a path with a three-point loop at u ∈ U for G. Assume that γ \ u
(which is a valid path in G[ ) satisfies (D1) or (D2) in G̃. Then γ satisfies (D1) or (D2) in
G̃.

To prove the lemma, let v and w be the predecessor and successor of u in γ. First
assume that γ \ u satisfies (D1) in G̃. If this does not happen at v or at w, then this
will apply also to γ and we are done, so let us assume that v ∈ U and that (v 0 , v, w) is
not a v-junction in G̃, where v 0 is the predecessor of v. If (v 0 , v, u) is not a v-junction
in G̃, then (D1) is true for γ in G̃. If it is a v-junction, then (v, u, w) is not and (D1) is
true too.
Assume now that (D2) is true for γ \u in G̃. Again, there is no problem if (D2) occurs
for some point other than v or w, so let us consider the case for which it happens at
v. This means that v < ÃU and (v 0 , v, w) is a v-junction. But, since u ∈ U , the link
between u and v must be from u to v in G̃ so that there is no v-junction at u and (D1)
is true in G̃. This proves lemma 16.20.
Lemma 16.21 Let γ be a path with a three-point loop at u ∈ U for G. Assume that γ
does not satisfy (D2) in G. Then γ \ u does not satisfy this property either.

Let us assume that γ \ u satisfies (D2) and reach a contradiction. Letting (v, u, w) be
the three-point loop, (D2) can only happen in γ \ u at v or w, and let us assume that
this happens at v, so that, v 0 being the predecessor of v, (v 0 , v, w) is a v-junction in G
with v < AU . Since v < AU , the link between u and v in G must be from u to v, but
this implies that (v 0 , v, u) is a v-junction in G with v < AU which is a contradiction:
this proves lemma 16.21.
The previous three lemmas directly imply the next one.
Lemma 16.22 If γ is a path between s and t that does not satisfy (D2) for G, then γ
satisfies (D1) or (D2) for G̃.

Indeed, if we start with γ that does not satisfy (D2) for G, lemma 16.21 allows us
to progressively remove three-point loops from γ until none remains with a final
path that satisfies the assumptions of lemma 16.19 and therefore satisfies (D1) in G̃,
and lemma 16.20 allows us to add the points that we have removed in reverse order
while always satisfying (D1) or (D2) in G̃.
We now partially relax the hypothesis that (D2) is not satisfied with the next lemma.
Lemma 16.23 If γ is a path between s and t that does not pass in V \ AU at a linked
v-junction for G, then γ satisfies (D1) or (D2) for G̃.
16.2. CONDITIONAL INDEPENDENCE GRAPH 393

Assume that γ does not satisfy (D2) for G̃ (otherwise the result is proved). By
lemma 16.22, γ must satisfy (D2) for G. So, take an intersection of γ with V \AU that
occurs at a v-junction in G, that we will denote (v, u, w). This is still a v-junction in
G̃ since we assume it to be unlinked. Since (D2) is false in G̃, we must have u ∈ ÃU ,
and there is an oriented path, τ, from u to U in G̃.
We can assume that τ has no v-junction in G. If a v-junction exists in τ, then this
v-junction must be linked (otherwise this would also be a v-junction in G̃ and con-
tradict the fact that τ is consistently oriented in G̃), and this link must be oriented
from u to U in G̃ to avoid creating a loop in this graph. This implies that we can
bypass the v-junction while keeping a consistently oriented path in G̃, and iterate
this until τ has no v-junction in G. But this implies that τ is consistently oriented in
G, necessarily from U to u since u < AU .
Denote τ = (u0 = u, v1 , . . . , un ∈ U ). We now prove by induction that each (v, uk , w) is
an unlinked v-junction. This is true when k = 0, and let us assume that it is true for
k − 1. Then (uk , uk−1 , v) is a v-junction in G but not in G̃: so it must be linked and
there exists an edge between v and uk . In G̃, this edge must be oriented from v to uk ,
since (v, uk−1 , uk , v) would form a loop otherwise. For the same reason, there must be
an edge in G̃ from w to uk so that (v, uk , w) is an unlinked v-junction.
Since this is true for k = n, we can replace u by un in γ and still obtain a valid path.
This can be done for all intersections of γ with V \AU that occur at v-junctions. This
finally yields a path (denote it γ̄) which does not satisfy (D2) in G anymore, and
therefore satisfies (D1) or (D2) in G̃: so γ̄ must either pass in U without a v-junction
or in V \ ÃU at a v-junction. None of the nodes that were modified can satisfy any of
these conditions, since they were all in U with a v-junction, so that the result is true
for the original γ also. This proves lemma 16.23.
So the only unsolved case is when γ is allowed to pass in V \AU at linked v-junctions.
We define an algorithm that removes them as follows. Let γ0 = γ and let γk be the
path after step k of the algorithm. One passes from γk to γk+1 as follows.
• If γk has no linked v-junctions in V \ AU for G, stop.
• Otherwise, pick such a v-junction and let (v, u, w) be the three nodes involved in
it.
(i) If v ∈ U , v 0 < U and (v 0 , v, u) is a v-junction in G̃, remove v from γk to define
γk+1 .
(ii) Otherwise, if w ∈ U , w0 < U and (u, w, w0 ) is a v-junction in G̃, remove w from
γk to define γk+1 .
(iii) Otherwise, remove u from γk to define γk+1 .
None of the considered cases can disconnect the path. This is clear for case (iii) since
v and w are linked. For case (i), note that, in G, (v 0 , v, u) cannot be a v-junction since
(v, u, w) is one. This implies that the v-junction in G̃ must be linked and that v 0 and
u are connected.
394 CHAPTER 16. BAYESIAN NETWORKS

The algorithm will stop at some point with some γn that does not have any linked
v-junction in V \ AU anymore, which implies that (D1) or (D2) is true in G̃ for γn . To
prove that this statement holds for γ, it suffices to show that if (D1) or (D2) is true
in G̃ with γk+1 , it must have been true with γk at each step of the algorithm. So let’s
assume that γk+1 satisfies (D1) or (D2) in G̃.
First assume that we passed from γk to γk+1 via case (iii). Assume that (D2) is true
for γk+1 , with as usual the only interesting case being when this occurs at v or w.
Assume it occurs at v so that (v 0 , v, w) is a v-junction and v < ÃU . If (v 0 , v, u) is a
v-junction, then (D2) is true with γk . Otherwise, there is an edge from v to u in G̃
which also implies an edge from w to u since (v, u, w, v) would be a loop otherwise.
So (v, u, w) is a v-junction in G̃, and u cannot be in ÃU since its parent, v would be in
that set also. So (D2) is true in G̃. Now, assume that (D1) is true at v, so that (v 0 , v, w)
is not a v-junction and v ∈ U . If (v 0 , v, u) is not a v-junction either, we are done, so
assume the contrary. If v 0 ∈ U , then we cannot have a v-junction at v 0 and (D1) is
true. But v 0 < U is not possible since this leads to case (i).
Now assume that we passed from γk to γk+1 via case (i). Assume that (D1) is true for
γk : this cannot be at v 0 since v 0 < U , neither at u since u < AU , so it will also be true
for γk+1 . The same statement holds with (D2) since (v 0 , v, u) is a v-junction in G̃ with
v ∈ U which implies that both v 0 and u are in ÃU . Case (ii) is obviously addressed
similarly.

With this, the proof of theorem 16.18 is complete. 

16.2.5 Probabilistic inference: Sum-prod algorithm

We now discuss the issue of using the sum-prod algorithm to compute marginal
probabilities, P(X (s) = x(s) ) for s ∈ V when X is a Bayesian network on G = (V , E). By
definition, P(X = x) can be written in the form
Y
P(X = x) = ϕC (x(C) )
C∈C

where C contains all subsets Cs := {s} ∪ pa(s), s ∈ V . Marginal probabilities can there-
fore be computed easily when the factor graph associated to C is acyclic, according
to proposition 15.13. However, because of the specific form of the ϕC ’s (they are
conditional probabilities), the sum-prod algorithm can be analyzed in more detail,
and provide correct results even when the factor graph is not acyclic.

The general rules for the sum-prod algorithm are


Y
(s)
mC̃s (x(s) )



 m sC (x ) ←

C̃,s∈C̃,C̃,C


 X Y
(s)
ϕC (y (C) ) mtC (y (t) )

mCs (x ) ←




(C) (s) (s) t∈C\{s}

y :y =x
16.2. CONDITIONAL INDEPENDENCE GRAPH 395

They take a particular form for Bayesian networks, using the fact that a vertex s
belongs to Cs , and to all Ct for t ∈ ch(s).
Y
msCs (x(s) ) ← mCt s (x(s) ),
t∈ch(s)
Y
msCt (x(s) ) ← mCs s (x(s) ) mCu s (x(s) ), for t ∈ ch(s),
u∈ch(s),u,t
X Y
mCs s (x(s) ) ← ps (y (pa(s)) , x(s) ) mtCs (y (t) ),
y (Cs ) ,y (s) =x(s) t∈pa(s)
X Y
mCt s (x(s) ) ← pt (x(s) ∧ y (pa(s)\{t}) , y (t) )mtCt (y (t) ) muCt (y (u) ),
y (Ct ) ,y (s) =x(s) u∈pa(s),u,t
for t ∈ ch(s).

These relations imply that, if pa(s) = ∅ (s is a root), then mCs s = ps (x(s) ). Also, if
ch(s) = ∅ (s is a leaf) then msCs = 1. The following proposition shows that many of the
messages become constant over time.

Proposition 16.24 All upward messages, msCs and mCt s with t ∈ ch(s) become constant
(independent from x(s) ) in finite time.

Proof This can be shown recursively as follows. Assume that, for a given s, mtCt is
constant for all t ∈ ch(s) (this is true if s is a leaf). Then,
X Y
(s) (s) (pa(s)\{t}) (t) (t)
mCt s (x ) ← pt (x ∧y , y )mtCt (y ) muCt (y (u) ),
peyCt ,y (s) =x(s) u∈pa(s),u,t
X Y
= mtCt pt (x(s) ∧ y (pa(s)\{t}) , y (t) ) muCt (y (u) )
y (Ct ) ,y (s) =x(s) u∈pa(s),u,t
X Y
= mtCt muCt (y (u) )
y (Ct \{t}) ,y (s) =x(s) u∈pa(s),u,t
Y X
= mtCt muCt (y (u) )
u∈pa(s),u,t y (u)

which is constant. Now Y


msCs (x(s) ) ← mCt s (x(s) )
t∈ch(s)

is also constant. This proves that all msCs progressively become constant, and, as we
have just seen, this implies the same property for mCt s , t ∈ ch(s). 
396 CHAPTER 16. BAYESIAN NETWORKS

This proposition implies that, if initialized with constant messages (or after a
finite time), the sum-prod algorithm iterates
Y
msCs ← mCt s
t∈ch(s)
X Y
mCs s (x(s) ) ← ps (y () pa(s), x(s) ) mtCs (y (t) )
y (Cs ) ,y (s) =x(s) t∈pa(s)
Y
msCt (x(s) ) ← mCs s (x(s) ) mCu s , t ∈ ch(s)
u∈ch(s),u,t
Y X
mCt s ← msCs muCs (y (u) ), t ∈ ch(s).
u∈pa(t),u,s y (u)

From this expression, we can conclude

Proposition 16.25 If the previous algorithm is first initialized with upward messages,
msCs = mCt s all equal to 1, and if downward messages are computed top down from the
roots to the leaves, the obtained configuration of messages is invariant for the sum-prod
algorithm.

Proof If all upward messages are equal to 1, then clearly, the downward messages
sum to 1 once they are updated from roots to leaves, and this implies that the upward
messages will remain equal to 1 for the next round. The obtained configuration is
invariant since the downward messages are recursively uniquely defined by their
value at the roots. 

The downward messages, under the previous assumptions, satisfy msCt (x(s) ) = mCs s (x(s) )
for all t ∈ ch(s) and therefore
X Y
(s) (pa(s)) (s)
mCs s (x ) = π(y ,x ) mCt t (y (t) ). (16.5)
y (Cs ) ,y (s) =x(s) t∈pa(s)

Note that the associated “marginals” inferred by the sum-prod algorithm are
Y
σs (x(s) ) = mCs (x(s) ) = mCs s (x(s) )
C,s∈C

since mCt s (x(s) ) = 1 when t ∈ ch(s).

Although the sum-prod algorithm initialized with unit messages converges to a


stable configuration if run top-down, the obtained σs ’s do not necessarily provide
the correct single site marginals. There is a situation for which this is true, however,
which is when the initial directed graph is singly connected, as we will see below.
16.2. CONDITIONAL INDEPENDENCE GRAPH 397

Before this, let us analyze the complexity resulting from an iterative computation of
the marginal probabilities, similar to what we have done with trees.

We define the depth of a vertex in G as follows.


Definition 16.26 Let G = (V , E) be a DAG. The depth of a vertex s in V is defined recur-
sively by

- depth(s) = 0 if s has no parent.


 
- depth(s) = 1 + max depth(t), t ∈ pa(s) otherwise.

The recursive computation of marginal distributions is made possible (although


not always feasible) with the following remark.
Lemma 16.27 Let X be a Bayesian network on the DAG G = (V , E), and S ⊂ V , such
that all elements in S have the same depth. Let pa(S) be the set of parents of elements
in S, and T = depth− (S) the set of vertexes in V with depth strictly smaller than the
depth of S. Then (X (S) yX (T \pa(S)) | X (pa(S)) ) and the variables X (s) , s ∈ S are conditionally
independent given X (pa(S)) .
Proof It suffices to show that vertexes in S are separated from T \ pa(S) and from
other elements of S by pa(S) for the graph (GS∪T )] . Any path starting at s ∈ S must
either pass by a parent of s (which is what we want), or by one of its children, or
by another vertex that shares a child with s in GS∪T . But s cannot have any child
in GS∪T , since this child cannot have a smaller depth than s, and it cannot be in S
either since all elements in S have the same depth. 

This lemma allows us to work recursively as follows. Assume that we can compute
marginal distributions over sets S with maximal depth no larger than d. Take a set
S of maximal depth d + 1, and let S0 be the set of elements of depth d + 1 in S. Then,
letting T = depth− (S) = depth− (S0 ), and S1 = S \ S0 ,
X
P(X (S) = x(S) ) = P(X (S0 ) = x(S0 ) | X (T ) = y (T \S1 ) ∧ x(S1 ) )P (X (T ∪S1 ) = y (T \S1 ) ∧ x(S1 ) )
y (T \S1 )
X Y
= ps ((y ∧ x)(pa(s)) ∧ x(S1 ) , x(s) )P(X (pa(S0 )∪S1 ) = y (pa(S0 )\S1 ) ∧(16.6)
x(S1 ) )
y (pa(S)\S1 ) s∈S0

Since pa(S)∪S1 has maximal depth strictly smaller than the maximal depth of S, this
indeed provides a recursive formula for the computation of marginal over subsets
of V with increasing maximal depths. However, because one needs to add parents
to the considered set when reducing the depth, one may end up having to compute
marginals over very large sets, which becomes intractable without further assump-
tions.
398 CHAPTER 16. BAYESIAN NETWORKS

A way to reduce the complexity is to assume that the graph G is singly connected,
as defined below.

Definition 16.28 A DAG G is singly connected if there exists at most one path in G that
connects any two vertexes.

Such a property is true for a tree, but also holds for some networks with multiple
parents. We have the following nice property in this case.

Proposition 16.29 Let G be a singly connected DAG and X a Bayesian network on G. If


s is a vertex in G, the variables (X (t) , t ∈ pa(s)) are mutually independent.

Proof We have, using proposition 16.5,


X Y
P(X (pa(s)) = x(pa(s)) ) = pu (y (pa(u)) , y (u) ).
,y (pa(s)) =x(pa(s)) u∈Apa(s)
(Apa(s) )
y

Because the graph is singly connected, two parents of s cannot have a common an-
cestor (since there would then be two paths from this ancestor to S). So Apa(s) is the
disjoint union of the At ’s for t ∈ pa(s) and we can write
X Y Y
P(X (pa(s)) = x(pa(s)) ) = pu (y (pa(u)) , y (u) )
,y (pa(s)) =x(pa(s)) t∈pa(s) u∈At
(Apa(s) )
y
Y X Y
= pu (y (pa(u)) , y (u) )
t∈pa(s) y (At ) ,y (t) =x(t) u∈At
Y
= P(X (t) = x(t) )
t∈pa(s)

This proves the lemma. 

Section 16.2.5 can be simplified under the assumption of a singly connected


graph, at least for the computation of single vertex marginals; we have, if s ∈ V
and G is singly connected
X Y
P(X (s) = x(s) ) = ps (y (pa(s)) , x(s) ) P(X (t) = y (t) ). (16.7)
y (pa(s)) t∈pa(s)

This is now recursive in single vertex marginal probabilities. It moreover coincides


with the recursive equation that defines the messages mCs s in (16.5), which shows
that the sum-prod algorithm provides the correct answer in this case.
16.2. CONDITIONAL INDEPENDENCE GRAPH 399

16.2.6 Conditional probabilities and interventions

One of the main interests of graphical models is to provide an ability to infer the be-
havior of hidden variables of interest given other, observed, variables. When dealing
with oriented graphs the way this should be analyzed is, however, ambiguous.

Let’s consider an example, provided by the graph in fig. 16.1. The Bayesian net-

Bad weather Broken HVAC

No school

Figure 16.1: Example of causal graph.

work interpretation of this graph is that both events (which may be true or false)
“Bad weather” and “Broken HVAC” happen first, and that they are independent.
Then, given their observation, the “No school” event may occur, probably more
likely if the weather is bad or the HVAC is broken or snow, and even more likely
if both happened at the same time.

Now consider the following passive observation: you wake up, you haven’t checked
the weather yet or the news yet, and someone tells you that there is no school today.
Then you may infer that there is more chances than usual for bad weather or the
HVAC broken at school. Conditionally to this information, these two events become
correlated, even if they were initially independent. So, even if the “No school” event
is considered as a probabilistic consequence of its parents, observing it influences
our knowledge on them.

Now, here is an intervention, or manipulation: the school superintendent has


declared that he has given enough snow days for the year and declared that there
would be school today whatever happens. So you know that the “no-school” event
will not happen. Does it change the risk of bad weather of broken HVAC? Obviously
not: an intervention on a node does not affect the distribution of the parents.

Manipulation and passive observation are two very different ways of affecting
unobserved variables in Bayesian networks. Both of them may be relevant in appli-
cations. Of the two, the simplest to analyze is intervention, since it merely consists
in clamping one of the variables while letting the rest of the network dynamics un-
changed. This leads to the following formal definition of manipulation.

Definition 16.30 Let G = (V , E) be a directed acyclic graph and X a Bayesian network on


400 CHAPTER 16. BAYESIAN NETWORKS

G. Let S be a subset of G and x(S) ∈ FS a given configuration on S. Then the manipulated


distribution of X with fixed values x(S) on S is the Bayesian network on the restricted
graph GS , with the same conditional probabilities, using the value x(s) every time a vertex
s ∈ S is a parent of t ∈ V \ S in G.

So, if the distribution of X is given by (16.2), then its distribution after manipulation
on S is Y
π̃(y (V \S) ) = pt (y (pa(t)) , y (t) )
t∈V \S

where pa(t) is the set of parents of t in G, and y (s) = x(s) whenever s ∈ pa(t) ∩ S.

The distribution of a Bayesian network X after passive observation X (S) = x(S) is


not so easily described. It is obviously the conditional distribution P (X (V \S) = y (V \S) |
X (S) = x(S) ) and therefore requires using the conditional dependency structure, in-
volving the moral graph and/or d-separation.

Let us discuss this first in the simpler case of trees, for which the moral graph is
the undirected acyclic graph underlying the tree, and d-separation is simple separa-
tion on this acyclic graph. We can then use proposition 14.22 to understand the new
structure after conditioning: it is a GV[ \S -Markov random field, and, for t ∈ V \ S, the
conditional distribution of X (t) = y (t) given its neighbors is the same as before, using
the value x(s) when s ∈ S. But note that when doing this (passing to G[ ), we broke the
causality relation between the variables. We can however always go back to a tree (or
forest, since connectedness may have been broken) with the same edge orientation as
they initially were, but this requires reconstituting the edge joint probabilities from
the new acyclic graph, and therefore using (acyclic) belief propagation.

With general Bayesian networks, we know that the moral graph can be loopy and
therefore a source of difficulties. The following proposition states that the damage
is circumscribed to the ancestors of S.

Proposition 16.31 Let G = (V , E) be a directed acyclic graph, X a Bayesian network


c
on G, S ⊂ V and x(AS ) ∈ F (AS ). Then the conditional distribution of X (AS ) given by
X (AS ) = x(AS ) coincides with the manipulated distribution in definition 16.30.

Proof The conditional distribution is proportional to


Y
p(y (pa(s)) , y (s) )
s∈V

with y (t) = x(t) if t ∈ AS . Since s ∈ AS implies pa(s) ⊂ AS , all terms with s ∈ AS are
constant in the sum and can be factored out after normalization. So the conditional
16.3. STRUCTURAL EQUATION MODELS 401

distribution is proportional to
Y
p(y (pa(s)) , y (s) )
s∈AcS

with y (t) = x(t) if t ∈ AS . But we know that such products sum to 1, so that the
conditional distribution is equal to this expression and therefore provides a Bayesian
network on GAcS . 

16.3 Structural equation models

Structural equation models (SEM’s) provides an alternative (and essentially equiva-


lent) formulation of Bayesian networks, which may be more convenient to use, espe-
cially when dealing with variables taking values in general state spaces.

Let G = (V , E) be a directed acyclic graph. SEMs are associated to families of


functions Φs : F (pa(s)) × Bs → Fs and random variables ξ s : Ω → Bs (where Bs is
some measurable set), for s ∈ V . The random field X : Ω → F (V ) associated to the
SEM satisfies the equations
Xs = Φ (s) (X (s−) , ξ (s) ). (16.8)
Because of the DAG structure, these equations uniquely define X once ξ is specified.
As a consequence, there exists a function Ψ such that X = Ψ (ξ).

The model is therefore fully specified by the functions Φ (s) and the probability
distributions of the variables ξ (s) . We will assume that they have a density, denoted
g (s) , s ∈ V , with respect to some measure µs on Bs . They are typically chosen as
uniform distributions on Bs (continuous and compact, or discrete) or as standard
Gaussian when Bs = Rds for some ds . One also generally assumes that the variables
(ξ (s) , s ∈ V ) are jointly independent, and we make this assumption below.

Let Vk , k ≥ 0, be the set of vertexes in V with depth k (c.f. definition 16.26) and
V<k = V0 ∪ · · ·∪ Vk−1 . Then (using the independence of (ξ (s) , s ∈ V ), for s ∈ Vk , the con-
ditional distribution of X (s) given X (V<k ) = x(V<k ) is the distribution of Φ (s) (x(s−) , ξ (s) ).
Formally this is given by
Φ (s) (x(s−) , ·)] (g (s) µs ),
the pushforward of the distribution of ξ (s) by Φ (s) (x(s−) , ·).

More concretely, assume that ξ s follows a uniform distribution on Bs = [0, 1]h for
some h, and assume that Fs is finite for all s. Then,

P (X (s) = x(s) | X (V<k ) = x(V<k ) ) = Volume(Us (x(pa(s)) , x(s) )) = ps (x(pa(s)) , x(s) )
402 CHAPTER 16. BAYESIAN NETWORKS

where n o
Us (x(pa(s)) , x(s) ) = ξ ∈ [0, 1]h : Φ (s) (x(s−) , ξ) = x(s) .

Since variables X (s) , s ∈ Vk are conditionally independent given X (V <k) , we find that
X decomposes as a Bayesian network over G,
Y
P (X = x) = ps (x(pa(s)) , x(s) ).
s∈V

(s)
Similarly, if Fs = Bs = Rds , ξ (s) ∼ N (0, IdRds ), and ξ (s) 7→ Φθ (x(pa(s)) , ξ (s) ) is invertible,
(s)
with C 1 inverse x(s) 7→ Ψθ (x(pa(s)) , x(s) ), then X is a Bayesian network, with continu-
ous variables, and, using the change of variable formula, the conditional distribution
of X (s) given X (pa(s)) = x(s−) has p.d.f.

1 1
 
(pa(s)) (s) (s) (pa(s)) (s) 2 (s)
ps (x ,x ) = d /2
exp − |xs − Ψθ (x , x )| det(∂x(s) Ψθ (x(pa(s)) , x(s) )) .
(2π) s 2

A simple and commonly used special case for this example are linear SEMs, with

X (s) = as + bsT X (s) + σs ξ (s) .

In this case, the inverse mapping is immediate and the Jacobian determinant in the
d
change of variables is 1/σs s .
Chapter 17

Latent Variables and Variational Methods

17.1 Introduction

We will describe, in the next chapters, methods that fit a parametric model to the
observation while introducing unobserved, or “latent,” components in their models.
Such latent components typically attach interpretable information or structure to
the data. We have seen one such example in the form of the mixture of Gaussian
in chapter 4, that we will revisit in chapter 20. We now provide a presentation of
the variational Bayes paradigm that provides a general strategy to address latent
variable problems [144, 97, 14, 100].

The general framework is as follows. Variables in the model are divided in two
groups: the observable part, that we denote X, and the latent part, denoted Z. In
many models, Z represents some unobservable structure, such that X conditional to
Z has some relatively simple distribution (in a Bayesian estimation context, Z often
contains model parameters). The quantity of interest, however, is the conditional
distribution of Z given X (also called the “posterior distribution”), which allows one
to infer the latent structure from the observations, and will also have an important
role in maximum likelihood parametric estimation, as we will see below. This condi-
tional distribution is not always easy to compute or simulate, and variational Bayes
provides a framework under which it can be approximated.

17.2 Variational principle

We consider a pair of random variables X and Z, where X is considered as “ob-


served” and Z is hidden, or “latent”. We will use U = (X, Z) to denote the two
variables taken together. We denote as usual by PU the probability law of U , defined
on RU = RX × RZ by PU (A) = P(U ∈ A). We will also assume that there exists a
measure µ on RU that decomposes as a product measure µ = µX × µZ (where µX and

403
404 CHAPTER 17. LATENT VARIABLES AND VARIATIONAL METHODS

µZ are measures on RX and RZ ), such that PU  µ (PU is absolutely continuous with


respect to µ). This implies that PU has a density with respect to µ that we will denote
fU . If both RX and RZ are discrete, µ is typically the counting measure, and if they
are both Euclidean spaces, µ can be the Lebesgue measure on the product.1

The variables X and Z then have probability density functions with respect to µX
ad µZ , given by
Z Z
fX (x) = fU (x, z)µZ (dz) and fZ (z) = fU (x, z)µX (dx).
RZ RX

The conditional distribution of X given Z = z, denoted PX ( · | Z = z), has density


fX (x | z) = fU (x, z)/fZ (z) with respect to µX and that of Z given X = x, denoted PZ ( · |
X = x), has density fZ (z | x) = fU (x, z)/fX (x) with respect to µZ . We will be mainly
interested by approximations of PZ ( · | X = x), assuming that PZ and PX ( · | Z = z) (and
hence PU ) are easy to compute or simulate.

We will use the Kullback-Liebler divergence to quantify the accuracy of the ap-
proximation. As stated in proposition 4.1, we have

PZ ( · | X = x) = argmin KL(ν k PZ (· | X = x))


ν∈M1 (RZ )

where M1 (RZ ) denotes the set of all probability distributions on RZ . Note that all
distributions ν for which KL(ν k PZ (· | X = x)) is finite must be absolutely continuous
with respect to µZ and therefore take the form ν = gµZ . One has
Z
g(z)
KL(gµZ k PZ (·|X = x)) = log g(z)µZ (dz)
RZ fZ (z|x)
Z
g(z)
= log g(z)µZ (dz) + log fX (x). (17.1)
RZ fU (x, z)

We will denote by P (µZ ), or just P when there is no ambiguity, the set of all p.d.f.’s g
with
R respect to µZ , i.e., the set of all non-negative measurable functions on RZ with
R
g(z)µZ (dz) = 1.
Z

The basic principle of variational Bayes methods is to replace P by a subset P


b
and to define the approximation

PbZ ( · | X = x) = argmin KL(gµZ k PZ ( · |X = x)).


g∈P
b

1 The reader unfamiliar with measure theory may want to read this discussion by replacing dµX
by dx, dµZ by dz and dµU by dx dz, i.e., in the context of continuous probability distributions having
p.d.f.’s with respect to the Lebesgue’s measure.
17.3. EXAMPLES 405

For the approximation to be practical, the set P b must obviously be chosen so that
the computation of PbZ ( · | X = x) is computationally feasible. We now review a few
examples, before passing to the EM algorithm and its approximations.

17.3 Examples

17.3.1 Mode approximation

Assume that RZ is discrete and µZ is the counting measure so that


X g(z)
KL(gµZ k PZ ( · | X = x)) − log fX (x) = log g(z),
fU (x, z)
z∈RZ

the sum being infinite if there exists z such that ν(z) > 0 and fU (x, z) = 0. Take
b = {1z : z ∈ RZ } ,
P
the family of all Dirac functions on RZ . Then,
KL(1z k PZ ( · |X = x)) − log fX (x) = − log fU (x, z).
The variational approximation of PZ ( · | X = x) over P b therefore is the Dirac measure
at point(s) z ∈ RZ at which fU (x, z) is largest, i.e., the mode(s) of the posterior distri-
bution. This approximation is often called the MAP approximation (for maximum a
posteriori).

If RZ is, say, Rq and µZ = dz is Lebesgue’s measure, then the previous construc-


tion does not work because 1z is not a p.d.f. with respect to µZ . Instead of Dirac func-
tions, one can however use constant functions on small balls. Let B(z, ) denote the
open ball with radius , and let |B(z, )| denote its volume. Let uz, = 1B(z,) /|B(z, )|.
Fixing , we can consider the set
b = uz, : z ∈ Rq .
P
Now, one has (leaving the computation to the reader)
Z !
1
KL(uz, dz k PZ ( · | X = x)) − log fX (x) = − log fU (x, z0 )dz0 .
|B(z, ) B(z,)
The limit for small  (assuming that fU (x, ·) is continuous at z, or defining the limit
up to sets of measure zero) is − log fU (x, z), justifying again choosing the mode of the
posterior distribution of Z for the approximation.

The mode approximation has some limitations. First, it is in general a very crude
approximation of the posterior distribution. Second, even with the assumption that
fU has closed form, this p.d.f. is often difficult to maximize (for example when defin-
ing models over large discrete sets). In such cases, the mode approximation has
limited practical use.
406 CHAPTER 17. LATENT VARIABLES AND VARIATIONAL METHODS

17.3.2 Gaussian approximation

Let us still assume that RZ = Rq and that µZ = dz. Let Pb be the family of all Gaussian
q
distributions N (m, Σ) on R . Then, denoting by ϕ( · ; m, Σ) the density of N (m, Σ),

q q 1
KL(ϕ( · ; m, Σ) k PZ ( · | X = x)) − log fX (x) = − log 2π − − log det(Σ)
2 2Z 2
− log fU (x, z)ϕ(z; m, Σ)dz.
Rq

In order to provide the best approximation, m and Σ must therefore maximize


Z
1
log fU (x, z)ϕ(z; m, Σ)dz + log det(Σ). (17.2)
Rq 2
The resulting optimization problem does not have a closed form solution in general
(see section 19.2.2 for an example in which stochastic gradient methods are used to
solve this problem). Another approach that is commonly used in practice is to push
the approximation further by replacing log fU (x, z) by its second-order expansion
around its maximum as a function of z. Let m(x) be the posterior mode, i.e., the
value of z at which x 7→ log fU (x, z) is maximal, that we will assume to be unique.
Let H(x) denote the q × q Hessian matrix formed by the second partial derivatives
of − log fU (x, z) (with respect to z) at z = m(x). This matrix is positive semidefinite
according to the choice made for m(x), and we will assume that it is positive definite.
Since the first derivatives of log fU (x, z) at m(x) must vanish, we have the expansion:
1
log fU (x, z) = log fU (x, m(x)) − (z − m(x))T H(x)(z − m(x)) + · · ·
2
Plugging the expansion into the integral in (17.2) yields
1 1 1
− trace(H(x)Σ) − (m − m(x))T H(x)(m − m(x)) + log det Σ.
2 2 2
To maximize this expression, one must clearly take m = m(x). Moreover,
∂Σ (−trace(H(x)Σ) + log det Σ) = −H(x)T + (ΣT )−1 = −H(x) + Σ−1 ,
and we see that one must take Σ = H(x)−1 . This provides the Laplace approxima-
tion [62] of the posterior, N (m(x), H(x)−1 ), which is practical when the mode and
corresponding second derivatives are feasible to compute.

17.3.3 Mean-field approximation

This section generalizes the approach discussed in proposition 15.6 for Markov ran-
[1] [K]
dom fields. Assume that RZ can be decomposed into several components RZ , . . . , RZ ,
17.3. EXAMPLES 407

writing z = (z[1] , . . . , z[K] ) (for example, taking K = q and z[i] = z(i) , the ith coordinate
[1] [K]
of z if RZ = Rq ). Also assume that µZ splits into a product measure µZ ⊗ · · · ⊗ µZ .
Mean-field approximation consists in assuming that probabilities ν in P b split into
independent components, i.e., their densities g take the form:

g(z) = g [1] (z[1] ) · · · g [K] (z[K] ).


Then,
K Z
[j]
X
KL(ν k PZ ( · | X = x)) − log fX (x) = [j]
log g [j] (z[j] )g [j] (z[j] )µZ (dz[j] )
j=1 RZ
Z q
Y
− log fU (x, z) g [j] (z[j] )µZ (dz). (17.3)
RZ j=1

The mean-field approximation may be feasible when log fU (x, z) can be written as a
sum of products of functions of each z[j] . Indeed, assume that
K
XY
log fU (x, z) = ψα,j (z[j] , x) (17.4)
α∈A j=1

where A is a finite set. To shorten notation, let us denote by hψi the expectation of a
function ψ with respect to the product p.d.f. g. Then, (17.3) can be written as
K
X K
XY
KL(ν k PZ ( · | X = x)) − log fX (x) = hlog g (j) (z[j] )i − hψα,j (z[j] , x)i.
j=1 α∈A j=1

The following lemma will allow us to identify the form taken by the optimal
p.d.f. g [j] .
Lemma 17.1 Let Q be a set equipped with a positive measure µ. Let ψ : Q → R be a
measurable function such that
Z

Cψ = exp(ψ(q))µ(dq) < ∞.
Q

Let
1
gψ (q) = exp(ψ(q)).

Let g be any p.d.f. with respect to µ, and define
Z
F(g) = (log g(q) − ψ(q))g(q)µ(dq).
Q

Then F(gψ ) ≤ F(g).


408 CHAPTER 17. LATENT VARIABLES AND VARIATIONAL METHODS

Proof We note that gψ > 0, and that


KL(g k gψ ) = F(g) + log Cψ = F(g) − F(gψ ),
which proves the result, since KL divergences are always non-negative. 

Applying this lemma separately to each function g [j] implies that any optimal g
must be such that  
X 
g [j] (z[j] ) ∝ exp  Mα,j ψα,j (z[j] , x)
α∈A
with
K
Y 0
Mα,j = hψα,j 0 (z[j ] , x)i.
j 0 =1,j 0 ,j

We therefore have
R P  [j]
[j]
RZ
ψα,j (z[j] , x) exp α0 ∈A Mα 0 ,j ψα0 ,j (z[j] , x) µZ (dz[j] )
hψα,j (z[j] , x)i = R P  [j] (17.5)
[j] exp
[j] [j]
0 ∈A Mα 0 ,j ψα 0 ,j (z , x) µZ (dz )
R α
Z

This specifies a relationship expressing hψα,j (z[j] , x)i as a function of the other
0
expectations hψα 0 ,j 0 (z(j ) , x)i for j , j 0 . These equations put together are called the
mean-field consistency equations. When these equations can be written explicitly, i.e.,
when the integrals in (17.5) can be evaluated analytically (which is generally the case
when the p.d.f.’s g [j] can be associated with standard distributions), one obtains an
algorithm that iterates (17.5) over all α and j until stabilization (each step reducing
the objective function in (17.3)).

Let us retrieve the result obtained in proposition 15.6 using the current formal-
ism. Assume that RX finite and RZ = {0, 1}L , where L can be a large number, with
 
L L
1  X X 
αj (x)z(j) + βij (x)z(i) z(j)  .

fU (x, z) = exp 

C  
j=1 i,j=1,i<j

Take K = L, z[j] = z(j) . Applying the previous discussion, we see that g [j] must take
the form  
exp αj (x)z(j) + i,j βij (x)hz(i) iz(j)
P
g [j] (z(j) ) =  
1 + exp αj (x) + i,j βij (x)hz(i) i
P

In particular  
exp αj (x) + i,j βij (x)hz(i) i
P
hz(j) i =  
1 + exp αj (x) + i,j βij (x)hz(i) i
P
17.4. MAXIMUM LIKELIHOOD ESTIMATION 409

providing the mean-field consistency equations.

In this special case, it is also possible to express the objective function as a simple
function of the expectations hz(j) i’s. We indeed have, letting ρj = hz(j) i,

X L
Y L
X L
X
[j] (j)
log fU (x, z) g (z ) = − log C + αj (x)ρj + βij (x)ρi ρj .
z∈RZ j=1 j=1 i,j=1,i<j

The values of ρ1 , . . . , ρL are then obtained by maximizing


L
X L
X L 
X 
αj (x)ρj + βij (x)ρi ρj − ρj log ρj + (1 − ρj ) log(1 − ρj ) .
j=1 i,j=1,i<j j=1

The consistency equations express the fact that the derivatives of this expression
with respect to each ρj vanish.

17.4 Maximum likelihood estimation

17.4.1 The EM algorithm

We now consider maximum likelihood estimation with latent variables and use the
notation of section 17.2. The main tool is the following obvious consequence of
(17.1).

Proposition 17.2 One has


Z !
fU (x, z)
log fX (x) = max log g(z)dµZ (z)
g∈P (µZ ) RZ g(z)

and the maximum is achieved for g(z) = fZ (z | x), the conditional p.d.f. of Z given X = x.

Proof Equation (17.1) implies that


Z !
fU (x, z)
log g(z)dµZ (z) = log fX (x) − KL(g µZ k PZ ( · |X = x))
RZ g(z)

and the r.h.s. is indeed maximum when the Kullback-Liebler divergence vanishes,
that is, when g is the p.d.f. of PZ ( · | X = x). 

We will use this proposition for the derivation of the expectation-maximization


(or EM) algorithm for maximum likelihood with latent variables. We now assume
that PU , and therefore fU , is parametrized by θ ∈ Θ, and that a training set T =
410 CHAPTER 17. LATENT VARIABLES AND VARIATIONAL METHODS

(x1 , . . . , xN ) of realizations of X is observed. To indicate the dependence in θ, we will


write fU (x, z ; θ), or fZ (z | x ; θ). The maximum likelihood estimator (m.l.e.) then
maximizes X
`(θ) = log fX (x ; θ) .
x∈T

The EM algorithm is useful when the computation of the m.l.e. for complete obser-
vations, i.e., the maximization of

log fU (x, z ; θ)

when both x and z are given, is easy, whereas the same problem with the marginal
distribution is hard.

From proposition 17.2, we have:


Z !
X X fU (x, z ; θ)
log fX (x ; θ) = max log gx (z)µZ (dz)
gx ∈P (µZ ) RZ gx (z)
x∈T x∈T

Therefore the maximum likelihood requires to compute


XZ !
fU (x, z ; θ)
max log gx (z)µZ (dz). (17.6)
θ,gx ,x∈T RZ gx (z)
x∈T

The maximization can therefore be done by iterating the following two steps.

1. Given θn , compute
XZ !
fU (x, z ; θ)
argmax log gx (z)µZ (dz).
gx ,x∈T RZ gx (z)
x∈T

2. Given gx , x ∈ T , compute

XZ !
fU (x, z ; θ)
argmax log gx (z)µZ (dz)
θ RZ gx (z)
x∈T
XZ
= argmax log (fU (x, z ; θ)) gx (z)µZ (dz).
θ x∈T RZ

Step 1. is explicit and its solution is gx (z) = fZ (z | x ; θ). Using this, both steps can
be grouped together, yielding the EM algorithm.
17.4. MAXIMUM LIKELIHOOD ESTIMATION 411

Algorithm 17.1 (EM algorithm)


Let a statistical model with density fU (x, z ; θ) modeling an observable variable X and
a latent variable Z be given, and a training set T = (x1 , . . . , xN ) be observed. Starting
with an initial guess of the parameter, θ(0), the EM algorithm iterates the following
equation until numerical stabilization.
XZ
θn+1 = argmax log (fU (x, z ; θ 0 )) fZ (z | x ; θn )µZ (dz). (17.7)
θ0 x∈T RZ

Equation (17.7) maximizes (in θ 0 ) a function defined as an expectation (for θn ), jus-


tifying the name ”Expectation-Maximization.”

17.4.2 Application: Mixtures of Gaussian

A mixture of Gaussian (MoG) model was introduced in chapter 4 (equation (4.4)).


We now reinterpret it (in a slightly generalized version) as a model with partial
observations and show how the EM algorithm can be applied. Let ϕ(x ; m, Σ) denote
the p.d.f. of the d-dimensional multivariate Gaussian distribution with mean m and
covariance matrix Σ. We model fX (x ; θ) as
p
X
fX (x ; θ) = αj ϕ(x, ; cj , Σj ).
j=1

Here, θ contains all sequences α1 , . . . , αp (non-negative numbers that sum to one),


c1 , . . . , cp ∈ Rd and Σ1 , . . . , Σp (d × d positive definite matrices).

Using the previous notation, we therefore have RX = Rd , and µX the Lebesgue


measure on that space. The variable Z will take values in RZ = {1, . . . , p}, with µZ
being the counting measure. We model the joint density function for (X, Z) as
fU (x, z ; θ) = αz ϕ(x; cz , Σz ). (17.8)

Clearly fX is the marginal p.d.f. of fU . One can therefore consider Z as a latent


variable, and estimate θ using the EM algorithm.

We now make (17.7) explicit for mixtures of Gaussian. For given θ and θ 0 and
x ∈ R, let
Z
0 d
Ux (θ, θ ) = log 2π + log (fU (x, z ; θ 0 )) fZ (z | x ; θ)dµZ (z)
2 RZ
p
X 1 1

= log αz0 − log det Σ0z − (x − cz0 )T Σ0z −1 (x − cz0 ) fZ (z | x ; θ)
2 2
z=1
412 CHAPTER 17. LATENT VARIABLES AND VARIATIONAL METHODS

with 1 1 T Σ−1 (x−c )


(det Σz )− 2 αz e− 2 (x−cz ) z z
fZ (z | x ; θ) = P .
p − 21 − 12 (x−cj )T Σ−1
j (x−cj )
j=1 (det Σj ) αj e

PNIf θn is the current parameter in the EM, the next one, θn+1 must maximize
0 ). This can be solved in closed form. To compute α 0 , . . . , α 0 , one must
U (θ
x∈T x n , θ 1 p
maximize
XX p
(log αz0 )fZ (z | x ; θ)
x∈T z=1
subject to the constraint that
P 0
z αz = 1. This yields
X p X
.X
αz0 = fZ (z | x ; θ) fZ (j | x ; θ) = ζz / N
x∈T j=1 x∈T
P
with ζz = x∈T fZ (z | x ; θ).

The centers c10 , . . . , cp0 must minimize x∈T (x − cz0 )T Σ0z −1 (x − cz0 )fZ (z|x ; θ), which
P
yields
0 1X
cz = xfZ (z | x ; θ).
ζz
x∈T
Finally, Σ0z must minimize
ζz 1X
0
log det Σz + (x − cz0 )T Σ0z −1 (x − cz0 )fZ (z | x ; θ),
2 2
x∈T

which yields
1X
Σ0z = (x − cz0 )(x − cz0 )T fZ (z | x ; θ).
ζz
x∈T
We can now summarize the algorithm.

Algorithm 17.2 (EM for Mixture of Gaussian distributions)


1. Initialize the parameter θ(0) = (α(0), c(0), Σ(0)). Choose a small constant  and
a maximal number of iterations M.
2. At step n of the algorithm, let θ = θ(n) be the current parameter, writing for
short θ = (α, c, Σ).
3. Compute, for x ∈ T and i = 1, . . . , p
1 1 T Σ−1 (x−c )
(det Σi )− 2 αi e− 2 (x−ci ) i i
fZ (i | x ; θ) = P
p − 12 − 12 (x−cj )T Σ−1
j (x−cj )
j=1 (det Σj ) αj e
P
and let ζi = x∈T fZ (i | x ; θ), i = 1, . . . , p.
17.4. MAXIMUM LIKELIHOOD ESTIMATION 413

4. Let αi0 = ζi /N .
5. For i = 1, . . . , p, let
1X
ci0 = xfZ (i | x ; θ).
ζi
x∈T

6. For i = 1, . . . , p, let
1X
Σ0i = (x − ci0 )(x − ci0 )T fZ (i | x ; θ).
ζi
x∈T

7. Let θ 0 = (µ0 , c0 , Σ0 ). If |θ 0 − θ| <  or n + 1 = M: return θ 0 and exit the algorithm.


8. Otherwise, set θ(n + 1) = θ 0 and return to step 2.

Remark 17.3 Algorithm 17.2 can be simplified by making restrictions on the model.
Here are some examples.

(i) One may restrict to Σi = σi2 IdRd to reduce the number of free parameters. Then,
step 7 of the algorithm needs to be replaced by:
1 X
(σi0 )2 = |x − ci0 |2 fZ (i | x ; θ).
dζi
x∈T

(ii) Alternatively, the model may be simplified by assuming that all covariance ma-
trices coincide: Σi = Σ for i = 1, . . . , p. Then, step 7 becomes
p
1 XX
Σ0i = (x − ci0 )(x − ci0 )T fZ (i | x ; θ).
N 
i=1 x∈T

(iii) Finally, one may assume that Σ is known and fixed in the algorithm (usually in
the form Σ = σ 2 IdRd for some σ > 0) so that step 7 of the algorithm can be removed.
(iv) One may also assume also that the (prior) class probabilities are known, typi-
cally set to αi = 1/p for all i, so that step 4 can be skipped.

17.4.3 Stochastic approximation EM

The stochastic approximation EM (or SAEM) algorithm has been proposed by De-
lyon et al. [58] (see this reference for convergence results) to address the situation in
which the expectations for the posterior distribution cannot be computed in closed
form, but can be estimated using Monte-Carlo simulations. SAEM uses a special
414 CHAPTER 17. LATENT VARIABLES AND VARIATIONAL METHODS

form of stochastic approximation, different from the SGD algorithm described in


section 3.3. It updates, at each step n, an approximate objective function that we
will denote λn and a current parameter θ(n). It implements the following iterations:

 (x)


 ξn+1 ∼ PZ ( · | X = x ; θn ), x ∈ T

1 1 X
    
 (x)
λn+1 (θ 0 ) = 1 − λn (θ 0 ) + log fU (x, ξn+1 ; θ 0 ) − λn (θ 0 ) , θ 0 ∈ Θ (17.9)


 n+1 n+1
x∈T



0

θn+1 = argmax λn+1 (θ )




θ0

The second step means that

 
N  X n
X 1 (x)

λn (θ 0 ) = log fU (x, ξj ; θ 0 ) .

 

n 
x∈T j=1

(x)
Given that ξn+1 ∼ PZ ( · | X = x ; θn ), one expects this expression to approximate

XZ
log (fU (x, z ; θ 0 )) fZ (z | x ; θ)dµZ (z)
x∈T RZ

so that the third step of (17.9) can be seen as an approximation of (17.7). Suffi-
cient conditions under which this actually happens (and θ(n) converges to a local
maximizer of the likelihood) are provided in Delyon et al. [58] (see also Kuhn and
Lavielle [113] for a convergence result under more general hypotheses on how ξ is
simulated).

To be able to run this algorithm efficiently, one needs the simulation of the pos-
terior distribution to be feasible. Importantly, one also needs to be able to update
efficiently the function λn . This can be achieved when the considered model belongs
to an exponential family, which corresponds to assuming that the p.d.f. of U takes
the form
1  
fU (x, z ; θ) = exp ψ(θ)T H(x, z)
C(θ)

for some functions ψ and H. For example, the MoG model of equation (4.4) takes
17.4. MAXIMUM LIKELIHOOD ESTIMATION 415

this form, with

1 1 1 T −1 1

T
ψ(θ) = log α1 − mT1 Σ−1 1 m1 − log det Σ1 , . . . , log αp − mp Σp mp − log det Σp ,
2 2 2 2
−1 −1
Σ1 m 1 , . . . , Σp m p ,

−1 −1
Σ1 , . . . , Σp ,

H(x, z)T = 1z=1 , . . . , 1z=p ,
x1z=1 , . . . , x1z=p ,
1 T 1 T

− xx 1z=1 , . . . , − xx 1z=p
2 2

and C(θ) = (2π)pd/2 .

For such a model, we can replace the algorithm in (17.9) by the more manageable
one: 
(x)



 ξn+1 ∼ PZ ( · | X = x ; θn ), x ∈ T

1 1
  
 (x) (x) (x) (x)
η = 1 − ηn + (H(x, ξn+1 ) − ηn )




 n+1 n + 1 n+1

(17.10)
 X (x) 
0 0 T 0
λ (θ ) = ψ(θ ) η − log C(θ )




 n+1 n+1

 x∈T


θn+1 = argmax λn+1 (θ 0 )




θ0

We leave as an exercise the computation leading to the implementation of this algo-


rithm for mixtures of Gaussian.

17.4.4 Variational approximation

Returning to proposition 17.2 and equation (17.6), we see that one can make a vari-
ational approximation of the maximum likelihood by computing
XZ !
fU (x, z ; θ)
max log gx (z)µZ (dz), (17.11)
θ∈Θ,gx ∈P
b,x∈T RZ gx (z)
x∈T

where P b ⊂ P is a class of p.d.f. with respect to µZ . The resulting algorithm is then


implemented by iterating the computation of gx , x ∈ T , using approximations sim-
ilar to those provided in section 17.3, and maximization in θ for given gx , x ∈ T .
This variational approximation of the maximum likelihood estimator is therefore
provided by the following algorithm.
416 CHAPTER 17. LATENT VARIABLES AND VARIATIONAL METHODS

Algorithm 17.3 (Variational Bayes approximation of the m.l.e.)


Let a statistical model with density fU (x, z ; θ) modeling an observable variable X
and a latent variable Z be given, and a training set T = (x1 , . . . , xN ) be observed. Let
b be a set of p.d.f.’s on RZ and define
P
Z !
g(z)
g ( · ; x, θ) = argmin
b log g(z)µZ (dz)
g∈P
b RZ fU (x, z ; θ)

(assuming that this minimizer is uniquely defined).

Starting with an initial guess of the parameter, θ0 , iterate the following equation
until numerical stabilization:
XZ
θ(n + 1) = argmax log (fU (x, z ; θ 0 ))b
g (z ; x, θ(n))µZ (dz). (17.12)
θ0 x∈T RZ

Assume that the distributions in P


b are also parametrized, denoting their param-
eter by η, belonging to some Euclidean domain V . Let g( · ; η) denote the p.d.f. in
b with parameter η. Letting η = (ηx , x ∈ T ) denote an element of V T (parameters
P
in V indexed by elements of the training set), (17.11) can then be written as the
maximization of
XZ !
fU (x, z ; θ)
F(θ, η) = log g(z; ηx )µZ (dz). (17.13)
RZ g(z; ηx )
x∈T

This expression is amenable to a stochastic gradient ascent implementation. We


have
Z ! Z
fU (x, z ; θ)
∂θ log g(z; ηx )µZ (dz) = ∂θ log fU (x, z ; θ)g(z; ηx )µZ (dz)
RZ g(z; ηx ) RZ

and
Z !
fU (x, z ; θ)
∂ηx log g(z; ηx )µZ (dz)
RZ g(z; ηx )
Z ! !
fU (x, z ; θ)
= −∂η log g(z; ηx )g(z; ηx ) + log ∂η g(z; ηx ) µZ (dz)
RZ g(z; ηx )
Z !
fU (x, z ; θ)
= log ∂η log g(z; ηx )g(z; ηx )µZ (dz)
RZ g(z; ηx )
Here, we have used the fact that, for all η,
Z Z
∂η log g(z; η)g(z; η) µZ (dz) = ∂η g(z; η)µZ (dz) = 0
RZ RZ
17.5. REMARKS 417
R
since RZ
g(x, η)µZ (dz) = 1.

Denote by πη the probability distribution of the random variable Z taking val-


|T |
ues in RZ obtained by sampling Z = (Zx , x ∈ T ) such that the components Zx are
independent and with p.d.f. g( · ; ηx ) with respect to µZ . Define
X
Φ1 (θ, z) = ∂θ log fU (x, zx ; θ)
x∈T

and !
X fU (x, z ; θ)
Φ2 (θ, η, z) = log ∂η log g(zx ; ηx ).
g(z; ηx )
x∈T
Then, following section 3.3, one can maximize (17.13) using the algorithm
(
θn+1 = θn + γn+1 Φ1 (θn , Z n+1 )
(17.14)
η n+1 = η n + γn+1 Φ2 (θn , η n , Z n+1 )
where Z n+1 ∼ πη n .

Alternatively (for example when T is large), one can also sample from x ∈ T at
each update. This would require defining πη as the distribution on T ×RZ with p.d.f.
ϕη (x, z) = g(z; ηx )/N , where N = |T |. One can now use

Φ1 (θ, x, z) = ∂θ log fU (x, z ; θ)

and !
fU (x, z ; θ)
Φ2 (θ, η, z) = log ∂η log g(z; η),
g(z; η)
one can use



 θn+1 = θn + γn+1 ∂θ log fU (Xn+1 , Zn+1 ; θn )
 !
fU (Xn+1 , Zn+1 ; θn ) (17.15)

ηn+1,Xn+1 = ηn,Xn+1 + γn+1 log g(Zn+1 ; ηn,X ) ∂η log g(Zn+1 ; ηn,Xn+1 )




n+1

with (Xn+1 , Zn+1 ) ∼ πη n . Sampling from a single training sample at each step can be
replaced by sampling from a minibatch with obvious modifications.

17.5 Remarks

17.5.1 Variations on the EM

Based on the formulation of the EM as the solution of (17.6), it should be clear that
solving (17.7) at each step can be replaced by any update of the parameter that in-
creases (17.6). For example, (17.7) can be replaced by a partial run of a gradient
418 CHAPTER 17. LATENT VARIABLES AND VARIATIONAL METHODS

ascent algorithm, stopped before convergence. One can also use a coordinate as-
cent strategy. Assume that θ can be split into several components, say two, so that
θ = (θ (1) , θ (2) ). Then, (17.7) may then be split into
(1)
XZ 
(2)

θn+1 = argmax log fU (x, z ; θ (1) , θn ) fZ (z | x ; θ(n))µZ (dz)
θ (1) x∈T RZ

(2)
XZ 
(1)

(2)
θn+1 = argmax log fU (x, z ; θn+1 , θ ) fZ (z | x ; θ(n))µZ (dz).
θ (2) x∈T RZ

Doing so is, in particular, useful when both these steps are explicit, but not (17.7).

17.5.2 Direct minimization

While the EM algorithm is widely used in the context of partial observations, it is


also possible to make explicit the derivative of
Z
log fX (x ; θ) = log fU (x, z ; θ)µZ (dz)
RZ

with respect to the parameter θ. Indeed, differentiating the integral and writing
∂θ fU = fU ∂θ log fU , we have
Z
f (x, z ; θ)
∂θ log fX (x ; θ) = ∂θ log fU (x, z ; θ) U µ (dz)
RZ fX (x ; θ) Z
Z
= ∂θ log fU (x, z ; θ)fZ (z | x, θ)µZ (dz).
RZ

In other terms, the derivative of the log-likelihood of the observed data is the con-
ditional expectation of the derivative of the log-likelihood of the full data given
the observed data. When computable, this expression can be used with standard
gradient-based optimization methods, such as those described in chapter 3. This
expression is also amenable to a stochastic gradient ascent algorithm, namely
X
θn+1 = θn + γn+1 ∂θ fU (x, Zn+1,x , θn ) (17.16)
x∈T

where Zn+1,x follows the distribution with density fZ (· | x, θn ) with respect to µZ . An


alternative SGA implementation can use the discussion in section 17.4.4, with the
density gηx replaced by fZ ( · | x, ηx ), which leads to
 X


 θn+1 = θ n + γ n+1 ∂θ log fU (x, Zn+1,x , θn )


 x∈T

n+1,x = ηn,x − γn+1 ∂ηx log fZ (Zn+1,x | x, ηx ), x ∈ T
η

where Zn+1,x follows the distribution with density fZ (· | x, ηn,x ).


17.5. REMARKS 419

17.5.3 Product measure assumption

We have worked, in this chapter, under the assumption that PU was absolutely con-
tinuous with respect to a product measure µU = µX ⊗ µZ . This is not a mild as-
sumption, as it fails to include some important cases, for example when X and Z
have some deterministic relationship, the simplest instance being when X = F(Z)
for some function F. In many cases, however, one can make simple transformations
on the model that will make it satisfy this working assumption. For example, if
X = F(Z), one can generally split Z into Z = (Z (1) , Z (2) ) so that the equation X = F(Z)
is equivalent to Z (2) = G(X, Z (1) ) for some function G. One can then apply the dis-
cussion above to U = (X, Z (1) ) instead of U = (X, Z).

Using more advanced measure theory, however, one can see that this product
decomposition assumption was in fact unnecessary. Indeed, one can assume that
the measure µU can “disintegrate” in the following sense: there exists a measure µX
on RX and, for all x ∈ RX , a measure µZ ( · | x) on RZ such that, for all functions ψ
defined on RU ,
Z Z Z
ψ(x, z)µU (dx, dz) = ψ(x, z)µZ (dz | x)µX (dx).
RU RX RZ

This is now a mild assumption, which is true [33] as soon as one assumes that µU (R)
is finite (which is not a real loss of generality as one can reduce to this case by re-
placing if needed µU by an equivalent probability distribution).

With this assumption, the marginal distribution of X had a p.d.f. with respect to
µX given by Z
fX (x) = fU (x, z)µZ (dz | x)
RZ

and the conditional distributions PZ ( · | x) have a p.d.f. relative to µZ ( · | x) given by

fU (x, z)
fZ (z | x) = .
fX (x)

The computations and approximations made earlier in this chapter can then be ap-
plied with essentially no modification.
420 CHAPTER 17. LATENT VARIABLES AND VARIATIONAL METHODS
Chapter 18

Learning Graphical Models

We discuss, in this chapter, several methods designed to learn parameters of graph-


ical models, starting with the somewhat simpler case of Bayesian networks, than
passing to Markov random fields on loopy graphs.

18.1 Learning Bayesian networks

18.1.1 Learning a Single Probability

Since Bayesian networks are specified by probabilities and conditional probabilities


of configurations of variables, we start with a discussion of the basic problem of
estimating discrete probability distributions.

The obvious way to estimate the probability of an event A based on a series of N


independent experiments is by using relative frequencies

#{A occurs}
fA = .
N
This estimation is unbiased (E(fA ) = P(A)) and its variance is P(A)(1 − P(A))/N . This
implies that the relative error δA = fA / P(A) − 1 has zero mean and variance

1 − P(A)
σ2 = .
N P(A)

This number can clearly become very large when P(A) ' 0. In particular, when
P(A) is small compared to 1/N , the relative frequency will often be fA = 0, leading
to the false conclusion that A is not just rare, but impossible. If there are reasons to
expect beforehand that A is indeed possible, it is important to inject this prior belief
in the procedure, which suggest using Bayesian estimation methods.

421
422 CHAPTER 18. LEARNING GRAPHICAL MODELS

The main assumption for these methods is to consider the unknown probability,
p = P(A), as a random variable, yielding a generative process in which a random
probability is first obtained, and then N instances of A or not-A are generated using
this probability.

Assume that the “prior distribution” of p (which determines a prior belief) has
a p.d.f. q (with respect to Lebesgue’s measure) on the unit interval. Given on N in-
dependent observations of occurrences of A, each following a Bernoulli distribution
b(p), the joint likelihood of all involved variables is given by
!
N k
p (1 − p)N −k q(p),
k
where k is the number of times the event A has been observed.

The conditional density of p given the observation (k occurrences of A) is called


the posterior distribution. Here, it is given by
q(p) k
q(p | k) = p (1 − p)N −k
Ck
where Ck is a normalizing constant. If there was no specific prior knowledge on p
(so that q(p) = 1), the resulting distribution is a beta distribution with parameters
k + 1 and N − k + 1, the beta distribution being defined as follows.
Definition 18.1 The beta distribution with parameters a and b (abbreviated β(a, b)) has
density with respect to Lebesgue’s measure
Γ (a + b) a−1
ρ(t) = t (1 − t)b−1 if t ∈ [0, 1]
Γ (a)Γ (b)
and ρ(t) = 0 otherwise, with Z ∞
Γ (x) = t x−1 e−t dt.
0

From the definition of a beta distribution, it is clear also that, if we choose the
prior to be β(a + 1, ν − a + 1) then the posterior is β(k + a + 1, N + ν − (k + a) + 1).
The posterior therefore belongs to the same family of distributions as the prior, and
one says that the beta distribution is a conjugate prior for the binomial distribution.
The mode of the posterior distribution (which is the maximum a posteriori (MAP)
estimator) is given by
k+a
p̂ = .
N +ν

This estimator now provides a positive value even if k = 0. By selecting a and ν,


one therefore includes the prior belief that p is positive.
18.1. LEARNING BAYESIAN NETWORKS 423

18.1.2 Learning a Finite Probability Distribution

Now assume that F is a finite space and that we want to estimate a probability distri-
bution p = (p(x), x ∈ F) using a Bayesian approach as above. We cannot use the previ-
ous approach to estimate each p(x) separately, since these probabilities are linked by
the fact that they sum to 1. We can however come up with a good (conjugate) prior,
identified, as done above, by computing the posterior associated to a uniform prior
distribution.

Letting Nx be the number of times x ∈ F is observed among N independent sam-


ples of a random variable X with distribution PX (·) = p(·), the joint distribution of
(Nx , x ∈ F) is multinomial, given by

N! Y
P(Nx , x ∈ F | p(·)) = Q p(x)Nx .
N
x∈F x !
x∈F

The posteriorQdistribution of p(·) given the observations with a uniform prior is pro-
portional to x∈F p(x)Nx . It belongs to the family of Dirichlet distributions, described
in the following definition.

Definition 18.2 Let F be a finite set and SF be the simplex defined by


 

 X 

SF = (p(x), x ∈ F) : p(x) ≥ 0, x ∈ F and p(x) = 1 .
 
 

 
x∈F

The Dirichlet distribution with parameters a = (a(x), x ∈ F) (abbreviated Dir(a)) has den-
sity
Γ (ν) Y
ρ(p(·)) = Q p(x)a(x)−1 , if x ∈ SF
x∈F Γ (a(x))
x∈F
P
and 0 otherwise, with ν = x∈F a(x).

Note that, if F has cardinality 2, the Dirichlet distribution coincides with the beta
distribution. Similarly to the beta for the binomial, and almost by construction, the
Dirichlet distribution is a conjugate prior for the multinomial. More precisely, if the
prior distribution for p(·) is Dir(1+a(x), x ∈ F), then the posterior after N observations
of X is Dir(1 + Nx + a(x), x ∈ F), and the MAP estimator is given by

Nx + a(x)
p̂(x) =
N +ν
P
with ν = x∈F a(x).
424 CHAPTER 18. LEARNING GRAPHICAL MODELS

18.1.3 Conjugate Prior for Bayesian Networks

We now consider a Bayesian network on the set F (V ) containing configurations x =


(x(s) , s ∈ V ) with x(s) ∈ Fs . We want to estimate the conditional probabilities in the
representation Y
P(X = x) = ps (x(pa(s)) , x(s) ).
s∈V
Assume that N independent observations of X have been made. Define the counts
Ns (x(s) , x(pa(s)) ) to be the number of times the observation x({s}∪pa(s)) has been made.
Then, it is straightforward to see that, assuming a uniform prior for the ps , their
posterior distribution is proportional to
(s) (s− )
Y Y Y
ps (x(pa(s)) , x(s) )Ns (x ,x ) .
s∈V x(pa(s)) ∈Fpa(s) x(s) ∈Fs

This implies that, for the posterior distribution, the conditional probabilities
ps (x(pa(s)) , ·) are independent and follow a Dirichlet distribution with parameters
1 + Ns (x(s) , x(pa(s)) ), x(s) ∈ Fs .

So, independent Dirichlet distributions indexed by configurations of parents of


nodes provide a conjugate prior for the general Bayesian network model. This prior
is specified by a family of positive numbers
 
as (x(s) , x(pa(s)) ), s ∈ V , x(s) ∈ Fs , x(pa(s)) ∈ F (pa(s)) , (18.1)
yielding a prior probability proportional to
(s) (s− )
Y Y Y
ps (x(pa(s)) , x(s) )as (x ,x )−1 .
s∈V x(pa(s)) ∈Fpa(s) x(s) ∈Fs

and a MAP estimator



N (x(s) , x(pa(s)) ) + as (x(s) , x(s ) )
p̂s (x ,x ) = s
(pa(s)) (s)
− − (18.2)
Ns (x(s ) ) + νs (x(s ) )
where Ns (x(pa(s)) ) = x(s) ∈Fs Ns (x(s) , x(pa(s)) ) and νs (x(pa(s)) ) = x(s) ∈Fs as (x(s) , x(pa(s)) ).
P P

One can restrict the huge class of coefficients described by (18.1) to a smaller
class by imposing the following condition.
Definition 18.3 One says that the family of coefficients
a = (as (x(s) , x(pa(s)) ), s ∈ V , x(s) ∈ Fs , x(pa(s)) ∈ F (pa(s))),
is consistent if there exists a positive scalar ν and a probability distribution P 0 on F (V )
such that
as (x(s) , x(pa(s)) ) = νP{s}∪pa(s)
0
(x({s}∪pa(s)) ).
18.1. LEARNING BAYESIAN NETWORKS 425

The class of products of Dirichlet distributions with consistent families of coeffi-


cients still provides a conjugate prior for Bayesian networks (the proof being left to
the reader). Within this class, the simplest choice (and most natural in the absence
of additional information) is to assume that P 0 is uniform, so that
ν0
as (x(s) , x(pa(s)) ) = . (18.3)
|F ({s} ∪ pa(s))|
With this choice, ν 0 is the only parameter that needs to be specified. It is often called
the equivalent sample size for the prior distribution.

We can see from (18.2) that using a prior distribution is quite important for
Bayesian networks, since, when the number of parents increases, some configura-
tions on F (pa(s)) may not be observed, resulting in an undetermined value for the
ratio

Ns (x(s) , x(pa(s)) )/Ns (x(s ) ),
even though, for the estimated model, the probability of observing x(pa(s)) may not
be zero.

18.1.4 Structure Scoring

Given a prior defined as a family of Dirichlet distributions associated to a = (as (x(s) , x(pa(s)) )
for s ∈ V , x(s) ∈ Fs , x(pa(s)) ∈ F (pa(s)), the joint density of the observations and param-
eters is given by
Y Y (s) (pa(s)) )+a (x(s) ,x(pa(s)) )−1
P (x, θ) = D(as (·, x(pa(s)) )) p(x(pa(s)) , x(s) )Ns (x ,x s

s,x(pa(s)) s,x(s) ,x(pa(s))

with
Γ (ν)
D(a(λ), λ ∈ F) = Q
λ Γ (a(λ))
P
and ν = λ a(λ). Here, θ represents the parameters of the model, i.e., the conditional
distributions that specify the Bayesian network. Note that P (x, θ) is a density over
the product space F (V ) × Θ where Θ is the space of all these conditional distribu-
tions. The marginal of this likelihood over all possible parameters, i.e.,
Z
P (x) = P (x, θ)dθ

provides the expected likelihood of the sample relative to the distribution of the pa-
rameters, and only depends on the structure of the network. In our case, integrating
with respect to θ yields
X D(as (·, x(pa(s)) ))
log P (x) = log .
s,xpa(s) D(as (·, x(pa(s)) ) + Ns (·, x(pa(s)) ))
426 CHAPTER 18. LEARNING GRAPHICAL MODELS

Letting
X D(as (·, x(pa(s)) ))
γ(s, pa(s)) = log ,
(pa(s))
D(as (·, x(pa(s)) ) + Ns (·, x(pa(s)) ))
x
the decomposition X
log P (x) = γ(s, pa(s))
s∈V
expresses this likelihood as a sum of “scores” (associated to each node and its par-
ents), which depends on the observed sample. The scores that are computed above
are often called Bayesian scores because they derive from a Bayesian construction.
One can also consider simpler scores, such as penalized likelihood:
X
γ(s, pa(s)) = − Ĥ(X (s) | X (pa(s)) ) |F (pa(s))| − ρ|pa(s)|,
x(pa(s))

where Ĥ is the conditional entropy for the empirical distribution based on observed
samples. Structure learning algorithms [145, 109] are designed to optimize such
scores.

18.1.5 Reducing the Parametric Dimension

In the previous section, we estimated all conditional probabilities intervening in the


network. This is obviously a lot of parameters and, even with a regularizing prior,
the estimated values are likely to be be inaccurate for small sample sizes. It then
becomes desirable to simplify the parametric complexity of the model.

When the sets Fs are not too large, which is common in practice, the paramet-
ric explosion is due to the multiplicity of parents, since the number of conditional
probabilities ps (x(pa(s)) , ·) grows exponentially with |pa(s)|. One way to simplify this
is to assume that the conditional probability at s only depends on x(pa(s)) via some
“global-effect” statistic gs (x(pa(s)) ). The idea, of course, is that the number of values
taken by gs should remain small, even if the number of parents is large.

Examples of some functions gs can be max(x(t) , t ∈ pa(s)), or the min, or some


simple (quantized) function of the sum. With binary variables (Fs = {0, 1}), logical
operators are also available (“and”, “or”, “xor”), as well as combinations of them.
The choice made for the functions gs is part of building the model, and would rely on
the specific context and prior information on the process, which is always important
to account for, in any statistical problem.

Once the gs ’s are fixed, learning the network distribution, which is now given by
Y
π(x) = ps (gs (x(pa(s)) ), x(s) )
s∈V
18.2. LEARNING LOOPY MARKOV RANDOM FIELDS 427

can be done exactly as before, the parameters being all ps (w, λ), λ ∈ Fs , w ∈ Ws , where
Ws is the range of gs , and Dirichlet priors can be associated to each ps (w, ·) for s ∈ V
and w ∈ Ws . The counts provided in (18.3) now can be chosen as
ν0
as (xs , w) = . (18.4)
|F| |gs−1 (w)|

18.2 Learning Loopy Markov Random Fields

Like everything else, parameter estimation for loopy networks is much harder than
with trees or Bayesian networks. There is usually no closed form expression for
the estimators, and their computation relies on more or less tractable numerical
procedures.

18.2.1 Maximum Likelihood with Exponential Models

In this section, we consider a parametrized model for a Gibbs distribution


1
πθ (x) = exp(−θ T U (x)) (18.5)

where θ is a d-dimensional parameter and U is a function from F (V ) to Rd . For
example, if π is an Ising model with
1  X X 
π(x) = exp α x(s) + β x(s) x(t) ,
Z s∼t
s∈V

then θ = (α, β) and U (x) = −( s x(s) , s∼t x(s) x(t) ). Most of the Markov random fields
P P
models that are used in practice can be put in this form. The constant Zθ in (18.5) is
X
Zθ = exp(−θ T U (x))
x∈F (V )

and is usually not computable.

Now, assume that an N -sample, x1 , . . . , xN , is observed for this distribution. The


maximum likelihood estimator maximizes
N
1X
`(θ) = log πθ (xk ) = −θ T ŪN − log Zθ
N
k=1

with ŪN = (U (x1 ) + · · · + U (xN ))/N .

We have the following proposition, which is a well-known property of exponen-


tial families of probabilities.
428 CHAPTER 18. LEARNING GRAPHICAL MODELS

Proposition 18.4 The log-likelihood, `, is a concave function of θ, with

∇`(θ) = Eθ (U ) − ŪN (18.6)

and
∇2 `(θ) = −Varθ (U ) (18.7)

where Eθ denotes the expectation with respect to πθ and Varθ the covariance matrix under
the same distribution.

We skip the proof, which is just computation. This proposition implies that a
local maximum of θ 7→ `(θ) must also be global. Any such maximum must be a
solution of
Eθ (U ) = ŪN (x0 )

and conversely. There are some situations in which the maximum does not exist, or
is not unique. Let us first discuss the second case.

If several solutions exist, the log-likelihood cannot be strictly concave: there must
exist at least one θ for which Varθ (U ) is not definite. This implies that there exists a
nonzero vector u such that varθ (u T U ) = u T Varθ (U )u = 0. This is only possible when
u T U (x) = cst for all x ∈ FV . Conversely, if this is true, Varθ (U ) is degenerate for all θ.

So, the non-uniqueness of the solutions is only possible when a deterministic


affine relation exists between the components of U , i.e., when the model is over-
dimensioned. Such situations are usually easily dealt with by removing some pa-
rameters. In all other cases, there exists at most one maximum.

For a concave function like ` to have no maximum, there must exist what is called
a direction of recession [168], which is a direction α ∈ Rd such that, for all θ, the
function t 7→ `(θ + tα) is increasing. In this case the maximum is attained “at infin-
ity”. Denoting Uα (x) = α T U (x), the derivative in t of `(θ + tα) is

Eθ+tα (Uα ) − Ūα

where Ūα = α T ŪN . This derivative is positive for all t if and only if

Ūα = Uα∗ := min{Uα (x), x ∈ F (V )} (18.8)

and Uα is not constant. To prove this, assume that the derivative is positive. Then
Uα is not constant (otherwise, the derivative would be zero). Let Fα∗ ⊂ F (V ) be the
18.2. LEARNING LOOPY MARKOV RANDOM FIELDS 429

set of configurations x for which Uα (x) = Uα∗ . Then

Eθ+tα (Uα )
P T
x∈F (V ) Uα (x) exp(−θ U (x) − tUα (x))
= P T
x∈F (V ) exp(−θ U (x) − tUα (x))
P T ∗
x∈F (V ) Uα (x) exp(−θ U (x) − t(Uα (x) − Uα ))
= P T ∗
x∈F (V ) exp(−θ U (x) − t(Uα (x) − Uα ))
Uα∗
P T U (x)) + P T ∗
x∈Fα∗ exp(−θ x<Fα∗ Uα (x) exp(−θ U (x) − t(Uα (x) − Uα ))
= P T
P T ∗ .
x∈Fα∗ exp(−θ U (x)) + x<Fα∗ exp(−θ U (x) − t(Uα (x) − Uα ))

When t tends to +∞, the sums over x < Fα∗ tend to 0, which implies that Eθ+tα (Uα )
tends to Uα∗ . So, if Eθ+tα (Uα ) − Ūα > 0 for all t, then Ūα = Uα∗ and Uα is not constant.
The converse statement is obvious.

As a conclusion, the function ` has a finite maximum if and only if there is no


direction α ∈ Rd such that α T (U (x)− ŪN ) ≤ 0 for all x ∈ F (V ). Equivalently, ŪN must
belong to the interior of the convex hull of the finite set

{U (x), x ∈ F (V )} ⊂ Rd .

In such a case, that we hereafter assume, computing the maximum likelihood


estimator boils down to solving the equation

Eθ (U ) = ŪN .

Because the maximization problem is concave, we know that numerical algorithms


such as gradient ascent,

θ(t + 1) = θ(t) + (Eθ(t) (U ) − ŪN ), (18.9)

converge to the optimal parameter. Unfortunately, the computation of the expec-


tations and covariance matrices can only be made explicitly for acyclic models, for
which parameter estimation is not a problem anyway. For general loopy graphical
models, the expectation can be estimated iteratively using Monte-Carlo methods. It
turns out that this estimation can be synchronized with gradient descent to obtain a
consistent algorithm, which is described in the next section.

18.2.2 Maximum likelihood with stochastic gradient ascent

As remarked above, for fixed θ, we have designed, in chapter 15, Markov chain
Monte Carlo algorithms that asymptotically sample form πθ . Select one of these
algorithms, and let pθ be the corresponding transition probabilities for a given θ, so
430 CHAPTER 18. LEARNING GRAPHICAL MODELS

that pθ (x, y) = P(Xn+1 = y | Xn = x) for the sampling chain. Then, define the iterative
algorithm, initialized with arbitrary θ0 and x0 ∈ F (V ), that loops over the following
two steps.

(SG1) Sample from the distribution pθt (xt , ·) to obtain a new configuration xt+1 .
(SG2) Update the parameter using
θt+1 = θt + γt+1 (U (xt+1 ) − ŪN ). (18.10)

This algorithm differs from the situation considered in section 3.3 in that the
distribution of the sampled variable xt+1 depends on both the current parameter θt
and on the current variable xt . Convergence requires additional constraints on the
size of the gains γ(t) and we have the following theorem [209].
Theorem 18.5 If pθ corresponds to the Gibbs sampler or Metropolis algorithm, and γt+1 =
/(t +1) for small enough , the algorithm that iterates (SG1) and (SG2) converges almost
surely to the maximum likelihood estimator.

The speed of convergence of such algorithms depends both on the speed of con-
vergence of the Monte-Carlo sampling and of the original gradient ascent. The latter
can be improved somewhat with variants similar to those discussed in section 3.3,
for example by choosing data-adaptive gains as in the ADAM algorithm.

18.2.3 Relation with Maximum Entropy

The maximum likelihood estimator is closely related to what is called the maximum
entropy extension of a set of constraints. Let the function U from F (V ) to Rd be
given. An element u ∈ Rd is said to be a consistent assignment for U if there exists
a probability distribution π on F (V ) such that Eπ (U ) = u. An example of consistent
assignment is any empirical average Ū based on a sample (x(1) , . . . , x(N ) ), since Ū =
Eπ (U ) for
N
1X
π= δx(k) .
N
k=1

Given U and a consistent assignment, u, the associated maximum entropy exten-


sion is defined as a probability distribution π maximizing the entropy, H(π), subject
to the constraint Eπ (U ) = u. This is a convex optimization problem, with constraints
 X


 π(x) = 1

x∈F (V )




 X


 Uj (x)π(x) = uj , j = 1, . . . , d (18.11)




 x∈F (V )


π(x) ≥ 0, x ∈ F (V )
18.2. LEARNING LOOPY MARKOV RANDOM FIELDS 431

Because the entropy is strictly convex, there is a unique solution to this problem.
We first discuss non-positive solutions, i.e., solutions for which π(x) = 0 for some x.
An important fact is that, if, for a given x, there exists π1 such that Eπ1 (U ) = u and
π1 (x) > 0, then the optimal π must also satisfy π(x) > 0. This is because, if π(x) = 0,
then, letting π = (1 − )π + π1 , we have Eπ (U ) = u since this constraint is linear,
π (x) > 0 and
X
H(π ) − H(π) = − (π (y) log π (y) − π(y) log π(y))
y,π(y)>0
X
− π1 (y)(log() + log π1 (y))
y,π(y)=0
X
= − log  π1 (y) + O()
y,π(y)=0

which is positive for small enough , contradicting the fact that π is a maximizer.

Introduce the set Nu containing all configurations x ∈ F (V ) such that π(x) = 0


for all π such that Eπ (U ) = u. Then we know that the maximum entropy extension
satisfies π(x) > 0 if x < Nu . Introduce Lagrange multipliers θ0 , θ1 , . . . , θd for the d + 1
equality constraints in (18.11), and the Lagrangian
X
L = H(π) + (θ0 + θ T U (x))π(x)
x∈F (V )\Nu

in which we have set θ = (θ1 , . . . , θd ), we find that the optimal π must satisfy

log π(x) = −θ0 − 1 − θ T U (x)






 X

π(x) = 1



x




Eπ (U ) = ū

In other terms, the maximum entropy extension is characterized by


1
π(x) = exp(−θ T U (x))1Nuc (x)

and Eπ (U ) = u.

In particular, if Nu = ∅, then the maximum entropy extension is positive. If,


in addition, u = Ū for some observed sample, then it coincides with the maximum
likelihood estimator for (18.5). Notice that, in this case, the condition Nu , ∅ coin-
cide with the condition that there exists α such that α T U (x) ≥ α T u for all x, with
α T U (x) not constant. Indeed, assume that the latter condition is true. Then, if
Eπ (U ) = u, then Eπ (α T U ) = α T u, which is only possible if π(x) = 0 for all x such
432 CHAPTER 18. LEARNING GRAPHICAL MODELS

that α T U (x) < α T u. Such x’s exist by assumption, and therefore Nu , ∅. Conversely,
assume Nu , ∅. If condition (18.8) is not satisfied, then we have shown when dis-
cussing maximum likelihood that an optimal parameter for the exponential model
would exist, leading to a positive distribution for which Eπ (U ) = u, which is a con-
tradiction.

18.2.4 Iterative Scaling

Iterative scaling is a method that is well-adapted to learning distributions given by


(18.5), when U can be interpreted as a random histogram, or a collection of them.
More precisely, assume that for all x ∈ F (V ), one has

U (x) = (U1 (x), . . . , Uq (x))

with
q
X
Uj (x) = 1 and Uj (x) ≥ 0.
j=1

Let the parameter be given by θ = (θ1 , . . . , θq ). Assume that x1 , . . . , xN have been


observed, and let u ∈ Rd be a consistent assignment for U , with uj > 0 for j = 1, . . . , d
and such that Nu = ∅. Iterative scaling computes the maximum entropy extension of
Eπ (U ) = u, that we will denote π∗ . It is supported by the following lemma.
Lemma 18.6 Let π be a probability on F (V ) with π > 0 and define
d !Uj (x)
0 π(x) Y uj
π (x) =
ζ Eπ (Uj )
j=1

where ζ is chosen so that π0 is a probability. Then π0 > 0 and

KL(π∗ kπ0 ) − KL(π∗ kπ) ≤ −KL(ukEπ (U )) ≤ 0 (18.12)

Proof Note that, since π > 0, Eπ (Uj ) must also be positive for all j, since Eπ (Uj ) = 0
would otherwise imply Uj = 0 and uj = 0 for u to be consistent. So, π0 is well defined
and obviously positive.

We have
d
∗ 0 ∗
X

X uj
KL(π kπ ) − KL(π kπ) = log ζ − π (x) Uj (x) log
Eπ (Uj )
x∈F (V ) j=1
d
X uj
= log ζ − uj log
Eπ (Uj )
j=1
= log ζ − KL(ukEπ (U )).
18.2. LEARNING LOOPY MARKOV RANDOM FIELDS 433

(We have used the identity Eπ∗ (U ) = u.) So it suffices to prove that ζ ≤ 1. We have
d !Uj (x)
X Y uj
ζ = π(x)
Eπ (Uj )
x∈F (V ) j=1
d
X X uj
≤ π(x)Uj (x)
Eπ (Uj )
j=1 x∈F (V )
d
X uj
= Eπ (Uj ) = 1,
Eπ (Uj )
j=1

which proves the lemma


P Q wi(wePhave used the fact that, for xi , wi positive numbers with
i wi = 1, one has i xi ≤ i wi xi , which is a consequence of the concavity of the
logarithm). 

Consider the iterative algorithm


d !Uj (x)
πn (x) Y uj
πn+1 (x) =
ζn Eπn (Uj )
j=1

initialized with a uniform distribution. Equivalently, using the exponential formu-


lation, define, for j = 1, . . . , d,
Eθn (Uj )
θn+1,j = θn,j + log + KL(ukEθn (U )), (18.13)
uj
with πθ given by (18.5), initialized with θ0 = 0. Note that adding a term that is
independent of j to θ does not change the value of πθ , because the Uj ’s sum to 1.
The model is in fact overparametrized, and the addition of the KL divergence in
(18.13) ensures that di=1 ui θi = 0 at all steps.
P

This algorithm always reduces the Kullback-Leibler distance to the maximum en-
tropy extension. This distance being always positive, it therefore converges to a limit,
which, still according to lemma 18.6, is only possible if KL(ukEπn (U )) also tends to
0, that is Eπn (U ) → u. Since the space of probability distributions is compact, the
Heine-Borel theorem implies that the sequence πθn has at least one accumulation
point, that we now identify. If π is such a point, one must have Eπ (U ) = u. More-
over, we have π > 0, since otherwise KL(π∗ kπ) = +∞. To prove that π = π∗ (and
therefore the limit of the sequence), it remains to show that it can be put in the form
(18.5). For this, define the vector space V of functions v : F (V ) → R which can be
written in the form
Xg
v(x) = α0 + αj Uj (x).
j=1
434 CHAPTER 18. LEARNING GRAPHICAL MODELS

Since log πθn ∈ V for all n, so is its limit, and this proves that log π belongs to V . We
have obtained the following proposition.
Proposition 18.7 Assume that for all x ∈ F (V ), one has U (x) = (U1 (x), . . . , Ud (x)) with
d
X
Uj (x) = 1 and Uj (x) ≥ 0.
j=1

Let u be a consistent assignment for the expectation of U such thatNu = ∅. Then, the
algorithm described in (18.13) converges to the maximum entropy extension of u.

This is the iterative scaling algorithm. This method can be extended in a straight-
forward way to handle the maximum entropy extension for a family of functions
U (1) , . . . , U (K) , such that, for all x and for all k, U (k) (x) is a dk -dimensional vector such
that
dk
X (k)
Uj (x) = 1.
j=1

The maximum entropy extension takes the form


K
1
 X 
πθ (x) = exp − (θ (k) )T U (k) (x) ,

k=1

where θ (k) is dk -dimensional, and iterative scaling can then be implemented by up-
dating only one of these vectors at a time, using (18.13) with U = U (k) .

The restriction to U (x) providing a discrete probability distribution for all x is,
in fact, no loss of generality. This is because adding a constant to U does not change
the resulting exponential model in (18.5), and multiplying U by a constant can be
also compensated by dividing θ by the same constant in the same model. So, if u− is
a lower bound for minj,x Uj (x), one can replace
P U by (U − u− ), and therefore assume
that U ≥ 0, and if u+ isPan upper bound for j Uj (x), we can replace U by U /u+ and
therefore assume that j Uj (x) ≤ 1. Define

d
X
Ud+1 (x) = 1 − Uj (x) ≥ 0.
j=1

Then, the maximum entropy extension for (U1 , . . . , Ud ) with assignment (u1 , . . . , ud ) is
obviously also the extension for (U1 , . . . , Ud+1 ), with assignment (u1 , . . . , ud+1 ), where
d
X
ud+1 = 1 − uj ,
j=1
18.2. LEARNING LOOPY MARKOV RANDOM FIELDS 435

and the latter is in the form required in proposition 18.7. Note that iterative scaling
requires to compute the expectation of U1 , . . . , Ud before each update. These are not
necessarily available in closed form and may have to be estimated using Monte-Carlo
sampling.

18.2.5 Pseudo likelihood

Maximum likelihood estimation is a special case of minimal contrast estimators. These


estimators are based on the definition of a measure of dissimilarity, say C(πkπ̃), be-
tween two probability distributions π and π̃. The usual assumptions on C are that
C(πkπ̃) ≥ 0, with equality if and only if π = π̃, and that C is — at least — continuous
in π and π̃. Minimal contrast estimators approximate the problem of minimizing
θ 7→ C(πtrue kπθ ) over a parameter θ ∈ Θ, (which is not feasible, since πtrue , the true
distribution of the data, is unknown) by the minimization of θ 7→ C(π̂kπθ ) where
π̂ is the empirical distribution computed from observed data. Under mild condi-
tions on C, these estimators are generally consistent when N tends to infinity, which
means that the estimated parameter asymptotically (in the sample size N ) provides
the best (according to C) approximation of πtrue by the family πθ , θ ∈ Θ.

The contrast that is associated with maximum likelihood is the Kullback-Leibler


divergence. Indeed, given a sample x1 , . . . , xN , we have

KL(π̂kπθ ) = Eπ̂ log π̂ − Eπ̂ log πθ


XN
= Eπ̂ log π̂ − log πθ (xk ).
k=1

Since Eπ̂Plog π̂ does not depend on θ, minimizing KL(π̂kπθ ) is equivalent to maxi-


mizing N k=1 log πθ (xk ) which is the log-likelihood.

Maximum pseudo-likelihood estimators form another class of minimal contrast


estimators for graphical models. Given a distribution π on F (V ), define the local
specifications πs (x(s) | x(t) , t , s) to be conditional distributions at one vertex given
the others, and the contrast
X π
C(πkπ̃) = Eπ (log s ).
π̃s
s∈V

Because we can write, using standard properties of conditional expectations,


! X
X πs
C(πkπ̃) = Eπ Eπs (log ) = E(KL(πs (· | X (t) , t , s)kπ̃s (· | X (t) , t , s)),
π̃s
s∈V s∈V

we see that C(π, π̃) is always positive, and vanishes (under the assumption of positive
π) only if all the local specifications for π and π̃ coincide, and this can be shown
436 CHAPTER 18. LEARNING GRAPHICAL MODELS

to imply that π = π̃. Indeed, for any x, y ∈ F (V ), and choosing some order V =
{s1 , . . . , sn } on V , one can write
n
π(x) Y π(x(sk ) |x(s1 ) , . . . , x(sk−1 ) , y (sk+1 ) , . . . , y (sn ) )
=
π(y) π(x(sk ) |x(s1 ) , . . . , x(sk−1 ) , y (sk+1 ) , . . . , y (sn ) )
k=1
P
and the ratios π(x)/π(y), for x ∈ F (V ), combined with the constraint that x π(x) = 1
uniquely define π.

So C is a valid contrast and


X N
XX (s) (t)
C(π̂kπθ ) = Eπ̂ log π̂s − log πθ,s (xk |xk , t , s).
s∈V s∈V k=1

This yields the maximum pseudo-likelihood estimator (or pseudo maximum likeli-
hood) defined as a maximizer of the function (called log-pseudo-likelihood)
N
XX (s) (s)
θ 7→ log πθ,s (xk |xk , t , s).
s∈V k=1

Although maximum likelihood is known to provide the most accurate approxi-


mations in many cases, maximum of pseudo likelihood has the important advantage
to be, most of the time, computationally feasible. This is because, for a model like
(18.5), local specifications are given by
exp(−θ T U (x))
πθ,s (x(s) | x(t) , t , s) = P (s) (V \s) ))
.
T
y (s) ∈Fs exp(−θ U (y ∧ x

and therefore include no intractable normalizing constant. Maximum of pseudo-


likelihood estimators can be computed using standard maximization algorithms.
For exponential models such as (18.5), the log-pseudo-likelihood is, like the log-
likelihood, a concave function.

18.2.6 Continuous variables and score matching

The methods that were presented so far for discrete variables formally generalize to
more general state spaces, even though consistency or convergence issues in non-
compact cases can be significantly harder to address. Score matching is a parameter
estimation method that was introduced in [95] and was designed, in its original
version, to estimate parameters for statistical models taking the form
1
πθ (x) = exp (−F(x, θ))
C(θ)
18.2. LEARNING LOOPY MARKOV RANDOM FIELDS 437

with x ∈ Rd . We assume below suitable integrability and differentiability conditions,


in order to justify differentiation under integrals whenever they are needed. The
“score function” is defined as

s(x, θ) = −∇x log πθ (x) = ∇x F(x, θ)

where ∇x denotes the gradient with respect to the x variable. Letting πtrue denote
the p.d.f. of the true data distribution (not necessarily part of the statistical model),
score matching minimizes
Z
f (θ) = |s(x, θ) − strue (x)|2 πtrue (x)dx
Rd

where strue = −∇ log πtrue . This integral can be restricted to the support of πtrue , if
we don’t want to assume that πtrue is non-vanishing. Note, however that f (θ) = 0
implies that log πθ (·, θ) = log πtrue πtrue -almost everywhere, so that πθ (x) = cπtrue (x)
for some constant c and x in the support of πtrue . Only if πtrue (x) > 0 for all x ∈ Rd ,
can we conclude that this requires πθ = πtrue .

Expanding the squared norm and applying the divergence theorem yield
Z Z
2
f (θ) = |∇x log πθ (x)| πtrue (x)dx − 2 ∇x log πθ (x)T ∇πtrue (x)dx
d Rd
ZR
+ |strue (x)|2 πtrue (x)dx
d
ZR Z Z
2
= |∇x log πθ (x)| πtrue (x)dx + 2 T
∆ log πθ (x) πtrue (x)dx + |strue (x)|2 dx
Rd Rd Rd

To justify the use of the divergence theorem, one needs to assume two derivatives in
the log-likelihoods with sufficient decay at infinity (see Hyvärinen and Dayan [95]
for details). This shows that minimizing f is equivalent to minimizing
Z Z
2
g(θ) = |∇x log πθ (x)| πtrue (x)dx + 2 ∆ log πθ (x)T πtrue (x)dx
Rd Rd
2
= E(|∇x log πθ (X)| + 2∆ log πθ (X)).

In this form, the objective function can be approximated by a sample average, so


that, given observed data x1 , . . . , xN , one can define the score-matching estimator as
a minimizer of
XN  
|∇x log πθ (xk )|2 + 2∆ log πθ (xk ) . (18.14)
k=1

Remark 18.8 The method can be adapted to deal with discrete variables replacing
derivatives with differences. Let X take values in a finite set, RX , on which a graph
438 CHAPTER 18. LEARNING GRAPHICAL MODELS

structure can be defined, writing x ∼ y if x and y are connected by an edge. For


example, if X is itself a Markov random field on a graph G = (V , E), so that RX =
F (V ), one can define x ∼ y if and only if x(s) = y (s) for all but one s ∈ V . One can then
define the score function
π (y)
sθ (x, y) = 1 − θ
πθ (x)
defined over all x, y ∈ RX such that x ∼ y. Now the score matching functional is
 
X X 
f (θ) = |sθ (x, y) − s∗ (x, y)|2  π∗ (x),
 

 
x∈RX y∼x

whose minimization is, after reordering terms, equivalent to that of


XX πθ (y) 2 X X π (x) π (y) !
g(θ) = 1− π∗ (x) + 2 θ
− θ π∗ (x).
πθ (x) π θ (y) πθ (x)
x∈RX y∼x y∼x
x∈RX

Based on training data, a discrete score matching estimator is a minimizer of


N X N X
πθ (y) 2
!
X X πθ (xk ) πθ (y)
1− +2 − . (18.15)
πθ (xk ) πθ (y) πθ (xk )
k=1 y∼xk y∼x
k=1 k 

18.3 Incomplete observations for graphical models

18.3.1 The EM Algorithm

Missing variable sin the context of graphical models may correspond to real pro-
cesses that cannot be measured, which is common, for example, with biological data.
They may be more conceptual objects that are interpretable but are not parts of the
data acquisition process, like phonemes in speech recognition, or edges and labels
in image processing and object recognition. They may also be variables that have
been added to the model to increase its parametric dimension without increasing
the complexity of the graph. However, as we will see, dealing with incomplete or
imperfect observations brings the parameter estimation problem to a new level of
difficulty.

Since it is the most common approach to address incomplete or noisy observa-


tions, we start with a description of how the EM algorithm (Algorithm 17.1) applies
to graphical models, and of its limitations. We assume a graphical model on an
undirected graph G = (V , E), in which we assume that V is separated in two non-
intersecting subsets, V = S ∪ H. Letting X be a G-Markov random field, the part X (S)
is assumed to be observable, and X (H) is hidden.
18.3. INCOMPLETE OBSERVATIONS FOR GRAPHICAL MODELS 439

We assume that X takes values in F (V ), where we still denote by Fs the sets in


which Xs takes values for s ∈ V . We let the model distribution belong to an expo-
nential family, with
1  
πθ (x) = exp − θ T U (x) , x ∈ F (V ). (18.16)
Z(θ)
(S) (S)
Assume that an N -sample x1 , . . . , xN is observed over S. Since

log πθ (x) = − log Z(θ) − θ T U (x),

the transition from θn to θn+1 in Algorithm 17.1 is done by maximizing

− log Z(θ) − θ T Ūn (18.17)

where
N
1X (S)
Ūn = Eθn (U (X) | X (S) = xk ). (18.18)
N
k=1
So, the M-step of the EM, which maximizes (18.17), coincides with the complete-
data maximum-likelihood problem for which the empirical average of U is replaced
by the average of its conditional expectations given the observations, as given in
(18.18), which constitutes the E-step. As a consequence, a strict application of the
EM algorithm for graphical models is unfeasible, since each step requires running
an algorithm of similar complexity maximum likelihood for complete data, that we
already identified as a challenging, computationally costly problem. The same re-
mark holds for the SAEM algorithm of section 17.4.3, which also requires solving a
maximum likelihood problem at each iteration.

18.3.2 Stochastic gradient ascent

The stochastic gradient ascent described in section 18.2.2 can be extended to partial
observations [210], even though it loses the global convergence guarantee that re-
sulted from the concavity of the log-likelihood for complete observations. Indeed,
applying the computation of section 17.5.2, to a model given by (18.16), we get using
proposition 18.4,

∂θ log ψθ = Eθ (Eθ (U ) − U | X (S) = x(S) ) = Eθ (U ) − Eθ (U | X (S) = x(S) )

where we ψθ (x(S) ) denotes the marginal distribution of πθ on S.

Let πθ (x(H) | x(S) ) denotes the conditional probability P (X (H) = x(H) | X (S) = s(S) )
for the distribution πθ , therefore taking the form
1  
πθ (x(H) | x(S) ) = exp −θ T
U (x (S)
∧ x (H)
) .
Z̃(θ, x(S) )
440 CHAPTER 18. LEARNING GRAPHICAL MODELS

Assume given an ergodic transition probability pθ on F (V ), and a family of ergodic


(S)
transition probabilities pθx , x(S) ∈ F (S), such that the invariant distribution of pθ is
(S)
πθ , and the one of pθx is πθ (· | x(S) ). Then the following SGA algorithm can be used
to estimate θ

Algorithm 18.1
Start the algorithm with an initial parameter θ(0) and initial configurations x(0) and
(H)
xk (0), k = 1, . . . , N . Then, at step n,

(SGH1) Sample from the distribution pθ(n) (x(n), ·) to obtain new configurations x(n+
1) ∈ F (V ).
(S)
xk (H)
(SGH2) For k = 1, . . . , N , sample from the distribution pθ(n) (xk (n), ·) to obtain a new
()
configuration xk H(n + 1) over the hidden vertexes.
(SGH3) Update the parameter using
 N

 1 X (S) (H) 
θ(n + 1) = θ(n) + γ(n + 1) U (x(n + 1)) − U (xk ∧ xk (n + 1)) . (18.19)
N
k=1

18.3.3 Pseudo-EM Algorithm

The EM update
N
X  
(S)
θn+1 = argmax Eθn log πθ (X) | X (S) = xk .
θ k=1

being challenging for Markov random fields, it is tempting to replace the log-likelihood
in the expectation by an other contrast, such as the log-pseudo-likelihood. A simi-
lar approach to that described here was introduced in Chalmond [50], for situations
when the conditional distribution of X (S) given X (H) is “simple enough” (for exam-
ple, if the variables Xs , s ∈ S are conditionally independent given X (H) ) and when the
cardinality of the sets Fs , s ∈ H is small (binary, or ternary, variables).

The algorithm has the following variational interpretation. Fix x(S) ∈ F (S) and
s ∈ H. Also denote µs = 1/|F (H \ {s})|. If q is a transition probability from F (H \ {s})
to Fs , let

πθ,s (y (s) ∧ x(S) | y (H\{s}) )


! !
(s)
X
(S) (H\{s}) (s)
∆θ (q, x ) = log (H\{s}) , y (s) )µ
q(y , y )µs . (18.20)
y∈F (H)
q(y s
18.3. INCOMPLETE OBSERVATIONS FOR GRAPHICAL MODELS 441

This function is concave in q, since its first partial derivative with respect to q(y (H\{s}) , y (s) )
(for each y ∈ F (H)) is given by

µs log πθ,s (y (s) ∧ x(S) | y (H\{s}) )µs (y (H\{s}) ) − µs log(q(y (H\{s}) , y (s) )µs ) − µs

so that its Hessian is the diagonal matrix with negative (H\{s}) , y (s) ).
P entries −µ s /q(y
(H\{s})
Using Lagrange multipliers to express the constraints y (s) ∈Fs q(y , y (s) ) = 1 for
(s)
all y (H\{s}) , we find that ∆θ (q, x(S) ) is maximized when q(y (H\{s}) , y (s) ) is proportional
to πθ,s (y (s) ∧ x(S) | y (H\{s}) ), yielding

q(y (H\{s}) , y (s) ) = πθ,s (y (s) | x(S) ∧ y (H\{s}) ).

Now, consider the problem of maximizing

N X
X (n) (s) (s)
∆θ (qk , xk ) (18.21)
k=1 s∈H

(s)
with respect to θ and qk , k = 1, . . . , N , s ∈ H. Consider an iterative maximization
scheme in which, from a current parameter θn , one first, maximizes (18.21) with
(s)
respect to transition probabilities qk , then with respect to θ to obtain θn+1 . This
scheme provides the iteration

θn+1 =
N X X 
X 
(S) (S)
argmax log πθ,s (y (s) ∧ xk | y (H\{s}) ) πθn ,s (y (s) | xk ∧ y (H\{s}) )µs .
θ k=1 s∈H y∈F (H)

18.3.4 Partially-observed Bayesian networks on trees

We now consider the situation in which the joint distribution of X = X (S) ∧ X (H) is a
Bayesian network over a directed acyclic graph G = (V , E).
(S) (S)
Assume that x1 , . . . , xN are observed. The parameter θ is the collection of all
p(x(pa(s)) , x(s) ) for s ∈ V . Define the random variables Is,x (y) equal to one if y ({s}∪pa(s)) =
x({s}∪pa(s)) and zero otherwise. We can write
X X X
log π(y) = log ps (y (pa(s)) , y (s) ) = log ps (x(pa(s)) , x(s) )Is,x (y)
s∈S s∈S x({s}∪pa(s)) ∈F ({s}∪pa(s))
442 CHAPTER 18. LEARNING GRAPHICAL MODELS

This implies that


N
X  
(S) (S)
Eθn log π(xk , X (H) ) | X (S) = xk
k=1
X N
X
(pa(s)) (s) (S)
= log ps (x ,x ) Eθn (Is,x (X) | X (S) = xk )
x({s}∪pa(s)) ∈F ({s}∪pa(s)) k=1

X N
X (S)
= log ps (x(pa(s)) , x(s) ) πθn (x({s}∪pa(s)) | X (S) = xk ).
x({s}∪pa(s)) ∈F ({s}∪pa(s)) k=1

The EM iteration at step n then is


N
(n+1) 1 X (S)
ps (x(pa(s)) , x(s) ) = (s −) πθn (x({s}∪pa(s)) | X (S) = xk )
Zs (x ) k=1

with Y
πθn (x) = p(n) (x(pa(s)) , x(s) ),
s∈V
Zs being a normalization constant.

If the estimation is solved with a Dirichlet prior Dir(1+as (x(s) , x(pa(s)) )), the update
formula becomes
 N

(n+1) (pa(s)) (s) 1 
a (x(s) , x(pa(s)) ) +
X
({s}∪pa(s)) (S) (S) 
ps (x ,x ) = π (x X = x )  .
 
− s θ | k
Zs (x(s ) )  n
 
k=1
(18.22)

This algorithm is very simple when the conditional distributions πθn (x(s∪pa(s)) |
(S)
X (S) = xk ) can be easily computed, which is not always the case for a general
Bayesian network, since conditional distributions do not always have a structure of
Bayesian network. The computation is simple enough for trees, however, since con-
ditional tree distributions are still trees (or forests). More precisely, the conditional
distribution given the observed variables can be written in the form
1 Y Y
π(y (H) | x(S) ) = (S)
ϕ s,x (y (s)
) ϕst (y (s) , y (t) )
Z(x ) s∈H t∼s,{s,t}⊂H

with ϕs,pa(s) (y (s) , y (pa(s)) ) = ps (y (pa(s)) , y (s) ) and, letting ϕs (y (s) ) = ps (y (s) ) if pa(s) = ∅ and
1 otherwise, Y
ϕs,x (y (s) ) = ϕs (y (s) ) ϕst (y (s) , x(t) ).
t∼s,t∈S
18.3. INCOMPLETE OBSERVATIONS FOR GRAPHICAL MODELS 443

So, the marginal joint distribution of a vertex and its parents are directly given by
belief propagation, using the just defined interactions. This training algorithm is
summarized below.

Algorithm 18.2 (Learning tree distributions with hidden variables)


Start with some initial guess of the conditional probabilities (for example, those
given by the prior). The iterate the following two steps providing the transition
from θn to step θn+1 .

(1) For k = 1, . . . , N , use belief propagation (or sum-prod) to compute all πθn (x({s}∪pa(s)) |
(S)
X (S) = xk ). Note that these probabilities can be 0 or 1 when s ∈ S and/or pa(s) ⊂ S.
(2) Use (18.22) to compute the next set of parameters.

The tree case includes the important example of hidden Markov models, which
are defined as follows. S and H are ordered, with same cardinality, say S = {s1 , . . . , sq }
and H = {h1 , . . . , hq }. Edges are (h1 , h2 ), . . . , (hq−1 , hq ) and (h1 , s1 ), . . . , (hq , sq ). The in-
terpretation generally is that the hidden variables, hs , are the variables of interest,
and behave like a Markov chain, and that the observations, xs , are either noisy or
transformed versions of them. A major application is in speech recognition, where
the hs ’s are labels that represent specific phonemes (little pieces of spoken words)
and the xs ’s are measured signals. The transitions between hidden variables then
describe how phonemes are likely to appear in sequence for a given language, and
those between hidden and observed variables describe how each phoneme is likely
to be pronounced and heard.

18.3.5 General Bayesian networks

The algorithm in the general case can move from tractable to intractable depending
on the situation. This must generally be handled in a case by case basis, by analyzing
the conditional structure, for a given model, knowing the observations.

In practice, it is always possible to use loopy belief propagation to obtain some


approximation of the conditional probabilities, even if it is not sure that the algo-
rithm will converge to the correct marginals. When feasible, junction trees can be
used, too. Monte-Carlo sampling is also an option, although quite computational.
444 CHAPTER 18. LEARNING GRAPHICAL MODELS
Chapter 19

Deep Generative Methods

19.1 Normalizing flows

19.1.1 General concepts

We develop, in this chapter, methods that model stochastic processes using a feed-
forward approach that generates complex random variables using non-linear trans-
formations of simpler ones. Many of these methods can be seen as instances of struc-
tural equation models (SEMs), described in section 16.3, with, for deep-learning im-
plementations, high-dimensional parametrizations of (16.8).

With start with the formally simple case where the modeled variable takes values
in Rd and is modeled as
X = g(Z)
where Z also takes values in Rd , with a known distribution, and g is C 1 , invertible,
with a C 1 inverse on Rd , i.e., is a diffeomorphism of Rd . Let us denote by h the inverse
of g.

If Z has a p.d.f. fZ with respect to Lebesgue’s measure, then, using the change of
variable formula, the p.d.f. of X is
fX (x) = fZ (h(x)) | det ∂x h(x)|.

Now, given a training set T = (x1 , . . . , xN ), the log-likelihood, considered as a func-


tion of h, is given by
N
X N
X
`(h) = log fZ (h(xk )) + log | det ∂x h(xk )| . (19.1)
k=1 k=1
This expression should then be maximized with respect to h, subject to some restric-
tions or constraints to avoid overfitting.

445
446 CHAPTER 19. DEEP GENERATIVE METHODS

19.1.2 A greedy computation

One can define a rich class of diffeomorphisms through iterative compositions of


simple transformations. This framework was introduced in [188], where a greedy
approach was suggested to build such compositions. The method was termed “nor-
malizing flows,” since it create a discrete flow of diffeomorphisms that transform the
data into a sample of a normal distribution.

We quickly describe the basic principles of the algorithm. One starts with a
parametrized family, say (ψα , α ∈ A) of diffeomorphisms of R. Such families are
relatively easy to design, one example proposed in [188] being a smoothed version
of the piecewise linear function

u 7→ v0 + (1 − σ )u + γ|(1 − σ )u − u0 |

which is increasing as soon as 0 ≤ max(σ , γ) < 1. The smoothed version has an


additional parameter, , and takes the form
q
u 7→ v0 + (1 − σ )u + γ 2 + ((1 − σ )u − u0 )2 .

This transformation is parametrized by α = (v0 , σ , γ, u0 , ). Other families of parametrized


transformations can be designed.

A multivariate transformation ϕα,U : Rd → Rd can then be associated to families


α = (α1 , . . . , αd ) and orthogonal matrices U by taking

ψα1 (y (1) )


 

ϕα,U (x) = 


 .. 
 . 

ψαd (y (d) )

with y = U x.

The algorithm in [188] is initialized with h0 = idRd and updates the transforma-
tion at step n according to
hn = ϕαn ,Un ◦ hn−1 .

In this update, Un is generated as a random rotation matrix, and αn is determined


as a gradient ascent update (starting from α = 0) for the maximization of

α 7→ `(ϕα,Un ◦ hn−1 ).

(Here, the current value hn−1 is not revisited, therefore providing a “greedy” opti-
mization method.)
19.1. NORMALIZING FLOWS 447

Letting zn,k = hn (xk ), the chain rule implies that

N
X N
X
`(ϕα,Un ◦ hn−1 )) = log fZ (ϕα,Un (zn−1,k )) + log | det ϕα,Un (zn−1,k )|
k=1 k=1
N
X
+ log | det ∂x hn−1 (xk )| .
k=1

Since the last term does not depend on α, we see that it suffices to keep track of the
“particle” locations, zn−1,k to be able to compute αn . Note also that these locations
are easily updated with zn,k = ϕαn ,Un (zn−1,k ).

19.1.3 Neural implementation

This iterated composition of diffeomorphisms obviously provides a neural architec-


ture similar to those discussed in chapter 11. Fixing the number of iterations to be,
say, m, one can consider families of diffeomorphisms (ϕθ ) indexed by a parameter w
(we had w = (α, U ) in the previous discussion), and optimize (19.1) over all functions
h taking the form h = ϕwm ◦ · · · ◦ ϕw1 . Letting zj,k = ϕwj ◦ · · · ◦ ϕw1 (xk ) for j ≤ m (with
z0,k = xk ), we can write

N
X N X
X m
`(h) = log fZ (zm,k ) + log | det ∂x ϕwj (zj−1,k )|.
k=1 k=1 j=1

Normalizing flows in this form are described in [162, 108, 149]. The gradient of `
with respect to the parameters w1 , . . . , wm can be computed by backpropagation. We
note however that, unlike typical neural implementations, the parameters may come
with specific constraints, such as U ∈ Od (R) when w = (α, U ), so that the gradient
and associated displacement may have to be adapted compared to standard gradient
ascent implementations (see section 21.6.3 for a discussion of first-order implemen-
tations of gradient methods for functions of orthogonal matrices, and [1] for more
general methods on optimization over matrix groups).

19.1.4 Time-continuous version

In section 11.6, we described how diffeomorphisms could be generated as flows of


differential equations, and this remark can be used to provide a time-continuous
version of normalizing flows. Using (11.5), one generates trajectories z(·) by solving
over, say, [0, T ]
∂t z(t) = ψw(t) (z(t))
448 CHAPTER 19. DEEP GENERATIVE METHODS

with z(0) = x for some function w : t 7→ w(t). Letting z(t) = hw (t, x) (which defines
hw ), we know that, under suitable assumptions on ψ, the mapping x 7→ hw (t, x) is a
diffeomorphism of Rd . One can then maximize

N
X N
X
`(hw (T , ·)) = log fZ (hw (T , xk )) + log | det ∂x hw (T , xk )|
k=1 k=1

with respect to the function w. Let zk (t) = hw (t, xk ) and Jk (t) = log | det ∂x hw (t, xk )|. We
have, by definition
∂t zk (t) = ψw(t) (zk (t))
with zk (0) = xk . One can also show that

∂t Jk (t) = ∇ · ψw(t) (zk (t))

with Jk (0) = 0, where the r.h.s. is the divergence of ψw(t) evaluated at zk (t). We
provide a quick (and formal) justification of this fact. First note that differentiating
∂t hw (t, x) = ψw(t) (hw (t, x)) with respect to x yields

∂t ∂x hw (t, x) = ∂x ψw(t) (hw (t, x))∂x hw (t, x).

The mapping J : A 7→ log | det(A)| is differentiable on the set of invertible matrices


and is such that dJ (A)H = trace(A−1 H). Applying the chain rule, we find

∂t log | det ∂x hw (t, x)| = trace(∂x hw (t, x)−1 ∂x ψw(t) (hw (t, x))∂x hw (t, x))
= trace(∂x ψw(t) (hw (t, x))) = ∇ · ψw(t) (hw (t, x)).

From this, it follows that the time-continuous normalizing flow problem can be
reformulated as maximizing

N
X N
X
log fZ (zk (T )) + Jk (T )
k=1 k=1

subject to ∂t zk (t) = ψw(t) (zk (t)), ∂t Jk (t) = ∇ · ψw(t) (zk (t)), zk (0) = xk , Jk (0) = 0. This is
an optimal control problem, whose analysis can be done similarly to that made in
section 11.6.1, provided that ∇ · ψw(t) can be expressed in closed form.

Note that the inverse of hw (T , ·), which provides the generative model going from
Z to X can also be obtained as the solution of an ODE. Namely, if one solves the
differential equation
∂t x(t) = −ψw(T −t) (x(t))
with initial condition x(0) = z, then x(T ) solves the equation hw (T , ·) = z.
19.2. NON-DIFFEOMORPHIC MODELS AND VARIATIONAL AUTOENCODERS449

19.2 Non-diffeomorphic models and variational autoencoders

19.2.1 General framework

The previous discussion addressed the situation X = g(Z) when g is a diffeomor-


phism, which required, in particular, that X and Z are real vectors with identical di-
mensions. This may not always be desirable, as one may prefer a small-dimensional
variable Z (in the spirit of the factor analysis methods discussed in chapter 21), or
a high-dimensional Z to increase, for example the modeling power. In addition, the
observation variables may be discrete, which precludes the use of the change of vari-
ables formula. In such cases, Z has to be treated as a hidden variable using one of
the methods discussed in chapter 17.

It will convenient to model the generative process in the form of a conditional


distribution of X given Z rather than a deterministic function. We place ourselves
in the framework of chapter 17 (with slightly modified notation) and let RX and RZ
denote the measured spaces over where X and Z take their values, with measures
µX and µZ , and assume that the conditional distribution of X given Z = z has den-
sity fX (x | z, θ) with respect to µX , for some parameter θ. We also assume that Z
has a distribution with density fZ with respect to µZ , that we assume given and un-
parametrized. One can then directly apply the algorithms provided in chapter 17,
and in particular the variational methods described in section 17.4.4 with an appro-
priate definition of the approximation of the conditional density of Z given X. An
important example in this context is provided by variational autoencoders (VAEs)
that we now present.

19.2.2 Generative model for VAEs

VAEs [104, 105] model X ∈ Rd as X = g(Z, θ)+ where  is a centered Gaussian noise
with covariance matrix Q. The function g is typically nonlinear, and VAEs have been
introduced with this function modeled as a deep neural network (see chapter 11).
Letting ϕN ( · ; 0, Q) denote the p.d.f. of the Gaussian distribution N (0, Q), the con-
ditional distribution of X given Z = z has density

fX (x | z, θ) = ϕN (x − g(z, θ)) ; 0, Q)

with respect to Lebesgue’s measure on Rd .

Following the procedure in section 17.4.4, we define an approximation of the


conditional distribution of Z given X. Assuming that Z ∈ Rp , we let this distri-
bution be N (µ(x, w), Σ(x, w)) for some functions µ and Σ, w being a parameter. To
ensure that Σ  0, we will represent it in the form Σ(x, w) = S(x, w)2 where S is a sym-
metric matrix. In [104], both functions µ and S are represented as neural networks
450 CHAPTER 19. DEEP GENERATIVE METHODS

parametrized by w. The joint density of X and Z is such that


log fX,Z (x, z ; θ, Q) = log ϕN (x − g(z, θ)) ; 0, Q) + log fZ (z)
1 1 d
= − (x − g(z, θ))T Q−1 (x − g(z, θ)) − log det Q − log 2π + log fZ (z)
2 2 2
We also have
1 p
log ϕN (z ; µ(x, w), S(x, w)2 ) = − (z−µ(x, w))T S(x, w)−2 (z−µ(x, w))−log det S(x, w)− log 2π.
2 2

We can then rewrite the algorithm in (17.15) as




 θn+1 = θn + γn+1 ∂θ log fX,Z (Xn+1 , Zn+1 ; θn , Qn )





Qn+1 = Qn + γn+1 ∂Q log fX,Z (Xn+1 , Zn+1 ; θn , Qn )
 !
f (X , Z ; θ , Q ) (19.2)

X,Z n+1 n+1 n n
wn+1 = wn + γn+1 log


ϕN (Xn+1 ; µ(Xn+1 , wn ), S(Xn+1 , wn )2 )






× ∂w log ϕN (Xn+1 ; µ(Xn+1 , wn ), S(Xn+1 , wn )2 )

where Xn+1 is drawn uniformly from the training data and


Zn+1 ∼ N (µ(Xn+1 , wn ), S(Xn+1 , wn )2 ).
The derivatives in this system can be computed from those of g, µ and S (typically
involving back-propagation) and the expression of the derivatives of the determinant
and inverse of a matrix provided in (1.4) and (1.6).

The computations can be simplified if one assumes that fZ is the p.d.f. of a stan-
dard Gaussian, i.e., fZ = ϕN (·; 0, IdRp ). Indeed, in that case, the integral in (17.11),
which is, using the current notation,
Z
ϕ (x − g(z, θ) ; 0, Q)ϕN (z ; 0, IdRp )
log N 2)
ϕN (z ; µ(x, w), S(x, w)2 )dz, (19.3)
R p ϕ N (z ; µ(x, w), S(x, w)
can be partially computed. For any two p-dimensional Gaussian p.d.f.’s, one has
Z
1 1 T −1
log ϕN (z ; µ1 , Σ1 ) ϕN (z ; µ2 , Σ2 ) dz = − trace(Σ−1
1 Σ2 ) − (µ2 − µ1 ) Σ1 (µ2 − µ1 )
R p 2 2
1 p
− log det(Σ1 ) − log(2π). (19.4)
2 2
As a consequence, (19.3) becomes

1   1 d
− Ew (X − g(Z, θ))T Q−1 (X − g(Z, θ)) − log det Q − log 2π
2 2 2
1 1 p
 
2 2
− Ew trace(S(X, w) ) + |µ(X, w)| − log det(S(X, w)) + , (19.5)
2 2 2
19.2. NON-DIFFEOMORPHIC MODELS AND VARIATIONAL AUTOENCODERS451

where Ew denotes the expectation for the random variable (X, Z) where X follows a
uniform distribution over training data and the conditional distribution of Z given
X = x is N (µ(x, w) , S(x, w)2 ).

The algorithm proposed in Kingma and Welling [104] introduces a change of


variable Z = µ(X, w) + S(X, w)U where U ∼ N (0, IdRp ), rewriting (19.5) as

1  
− E (X − g(µ(X, w) + S(X, w)U , θ))T Q−1 (X − g(µ(X, w) + S(X, w)U , θ))
2
1 1
 
2 2
− Ew trace(S(X, w) ) + |µ(X, w)| − log det(S(X, w))
2 2
1 d p
− log det Q − log 2π + , (19.6)
2 2 2

with a modified version of (19.2). Letting

1
F(θ, Q, w, x, u) = − (x − g(µ(x, w) − S(x, w)U , θ))T Q−1 (x − g(µ(x, w) − S(x, w)U , θ))
2
1 1 1
− log det Q − trace(S(x, w)2 ) − |µ(x, w)|2 + log det(S(x, w))
2 2 2
the resulting algorithm is



 θn+1 = θn + γn+1 ∂θ F(θn , Qn , wn , Xn+1 , Un+1 )

Qn+1 = Qn + γn+1 ∂Q F(θn , Qn , wn , Xn+1 , Un+1 ) (19.7)




 wn+1 = wn − γn+1 ∂w F(θn , Qn , wn , Xn+1 , Un+1 )

where Xn+1 is drawn uniformly from the training data and Un+1 ∼ N (0, IdRp ).

19.2.3 Discrete data

This framework can be adapted to situations in which the observations are discrete.
Assume, as an example, that X takes values in {0, 1}V , where V is a set of vertexes,
i.e., X is a binary Markov random field on V . Assume, as a generative model, that
conditionally to the latent variable Z ∈ Rp , the variables X (s) , s ∈ V are independent
and X (s) follows a Bernoulli distribution with parameter g (s) (z, θ), where g : Rp →
[0, 1]V . Assume also that Z ∼ N (0, IdRp ), and define, as above, an approximation of
the conditional distribution of Z given X = x as a Gaussian with mean µ(x, w) and
covariance matrix S(x, w)2 . Then, the joint density of X and Z (with respect to the
product of the counting measure on {0, 1}V and Lebesgue’s measure on Rp ) is
X
log fX,Z (x, z ; θ) = (x(s) log g (s) (z, θ) + (1 − x(s) ) log(1 − g (s) (z, θ))) + log ϕN (z ; 0, IdRp )
s∈V
452 CHAPTER 19. DEEP GENERATIVE METHODS

and (19.2) becomes




 θn+1 = θn + γn+1 ∂θ log fX,Z (Xn+1 , Zn+1 ; θn , Qn )






 !
f (X , Z ; θ ) (19.8)

X,Z n+1 n+1 n
wn+1 = wn + γn+1 log


ϕN (Xn+1 ; µ(Xn+1 , wn ), S(Xn+1 , wn )2 )






× ∂w log ϕN (Xn+1 ; µ(Xn+1 , wn ), S(Xn+1 , wn )2 )

19.3 Generative Adversarial Networks (GAN)

19.3.1 Basic principles

Similarly to the methods discussed so far, GAN’s [82], use a one-step nonlinear gen-
erator X = g(Z, θ), with θ ∈ RK , to model observed data (we here switch back to a
deterministic relation), where Z has a known distribution, with p.d.f. fZ , for ex-
ample Z ∼ N (0, IdRp ). However, unlike the exact or approximate likelihood maxi-
mization that were discussed in sections 19.1 and 19.2, GANs us a different criterion
for estimating the parameter θ by minimizing metrics that can be approximated
by optimizing a classifier. The classifier is a function x 7→ f (x, w), parametrized by
w ∈ RM , whose goal is to separate simulated samples from real ones: it takes values
in [0, 1] and estimates the (posterior) probability that its input x is real. The adver-
sarial paradigm in GAN’s consists in estimating θ and w together so that generated
data, using θ, are indistinguishable from real ones using the optimal w. Their basic
structure is summarized in Figure 19.1.

Prediction

Data
Classifier Generator Noise
Simulation

W θ

Figure 19.1: Basic structure of GAN’s: W is optimized to improve the prediction problem:
“real data” vs. “simulation”. Given W , θ is optimized to worsen the prediction.

19.3.2 Objective function

Let Pθ denote the distribution of g(Z, θ), and Ptrue the target distribution of real data,
represented by the variable X. One can formalize the “real data” vs. “simulation”
19.3. GENERATIVE ADVERSARIAL NETWORKS (GAN) 453

problem with a pair of random variables (Xθ , Y ) where Y follows a Bernoulli distri-
bution with parameter 1/2, and P (Xθ ∈ A | Y = y) is Ptrue (A) when y = 1 and Pθ (A)
when y = 0. Given a loss function r : {0, 1} × [0, 1] → [0, +∞), one can define
U (θ, w) = E(r(Y , f (Xθ , w)))
and
U ∗ (θ) = min U (θ, w).
w∈RM
We want to maximize U∗ or, equivalently, solve the optimization problem
θ ∗ = argmax min U (θ, w).
θ∈RK w∈RM

Note that
2U (θ, w) = E(r(1, f (X, w))) + E(r(0, f (Xθ , w)))
so that choosing the cost requires to specify the two functions t 7→ r(1, t) and t 7→
r(0, t). In Goodfellow et al. [82], they are:
r(1, t) = − log t
(19.9)
r(0, t) = − log(1 − t).
19.3.3 Algorithm

Using costs in (19.9), one must compute


 
θ ∗ = argmin max E(log f (X, w)) + E(log(1 − f (Xθ , w)))
θ∈RK w∈RM
 
= argmin max E(log f (X, w)) + E(log(1 − f (g(Z, θ), w))) .
θ∈RK w∈RM

Such min-max, or saddle-point problem are numerically challenging. The fol-


lowing algorithm was proposed in Goodfellow et al. [82], and also includes a stochas-
tic approximation component. Indeed, in practice, Etrue is only known through the
observation of training data, say x1 , . . . , xN . Moreover, Eθ is only accessible through
Monte-Carlo simulation, so that both expectations can only be approximated through
finite-sample averaging.

Algorithm 19.1 (GAN training algorithm)


1. Extract a batch of m examples from training data, simulate m samples accord-
ing to Pθ and run a few (stochastic) gradient ascent steps with fixed θ to update w,
replacing expectations by averages.
2. Generate m new samples of Z and update θ with fixed w by iterating a few
steps of (stochastic) gradient descent.
454 CHAPTER 19. DEEP GENERATIVE METHODS

19.3.4 Associated probability metric and Wasserstein GANs

Let F be the family of all measurable functions: f : Rd → [0, 1]. Given two random
variables X1 , X2 : Ω → Rd , with respective distributions P1 , P2 (so that P(Xi ∈ A) =
Pi (A), consider the function
 
D(P1 , P2 ) = 2 log 2 + max E(log f (X1 )) + E(log(1 − f (X2 )))
f ∈F

Assume that X1 (resp. X2 ) has a p.d.f. g1 (resp. g2 ) with respect to some measure µ.
Then Z
E(log f (X1 )) + E(log(1 − f (X2 ))) = (g1 log f + g2 log(1 − f ))dµ
Rd

which is maximal at f = g1 /(g1 + g2 ). For this f ∗ ,
Z Z
∗ ∗ 2g1 2g2
2 log 2 + E(log f (X1 )) + E(log(1 − f (X2 ))) = g1 log dx + g2 log dµ
Rd g1 + g2 Rd g1 + g2
g1 + g2 g1 + g2
   
= KL g1 + KL g2 .
2 2

This last expression is the Jensen-Shannon divergence between g1 and g2 (cf. sec-
tion 12.2). One can then define
 
D(P1 , P2 ) = max E(log f (X1 , w)) + E(log(1 − f (X2 , w)))
b
w∈RM

as an approximation of D in which the set of all possible functions with values in


[0, 1] is replaced by those arising from the GAN classification network, parametrized
by w. This approximation is useful when g1 , g2 are only observable through random
sampling or simulation. With this interpretation, GANs minimize D(P b true , Pθ ).

This discussion suggests that new types of GAN’s may be designed using other
discrepancy functions between probability distributions, provided that they can be
expressed in terms of the maximization of some quantity over some space of func-
tions. Consider, for example, the norm in total variation, defined by (for discrete
distributions)
1X
Dvar (P1 , P2 ) = |P (x) − P2 (x)|.
2 x 1
or, in the general case Dvar (P1 , P2 ) = maxA (P1 (A) − P2 (A)).

If F is the space of continuous functions f : Rd → [0, 1], then we also have (cf.
proposition 12.3)
Dvar (P1 , P2 ) = max(E(f (X1 )) − E(f (X2 ))).
f ∈F
19.4. REVERSED MARKOV CHAIN MODELS 455

Since neural nets typically generate continuous functions with values in [0, 1], one
could train GANs by maximizing
 
Dvar (P1 , P2 ) = max E(f (X1 , w)) − E(f (X2 , w)) .
b
w∈RM
However, the total variation distance can be too crude as a way to compare probabil-
ity distributions, especially when these distributions have atoms (points x such that
Pi ({x}) > 0. For example, the total variation distance between two Dirac distributions
at, say, x1 and x2 in Rd is always 1, unless x1 = x2 . As a consequence, if xn converges
to x, with xn , x, then Dvar (δxn , δx ) 6→ 0.

The Monge-Kantorovich distance (introduced in section 12.3) is better adapted.


Using (12.4) with ρ(x, y) = |x − y| and theorem 12.6, we have
DMK (P1 , P2 ) = max(E(f (X1 )) − E(f (X2 ))) (19.10)
f ∈F

where F is now the space of contractive (or 1-Lipschitz) functions. Using the fact
that a neural network with all weights bounded by a constant K generates a function
whose Lipschitz constant is controlled solely by K, one can then approximate (up to
a multiplicative constant) the Wasserstein distance by
 
bMK (P1 , P2 ) = max E(f (X1 , w)) − E(f (X2 , w))
D
w∈W
where W is the set of all weights bounded by a fixed constant. Wasserstein GANs
(WGANs [11]) must then solve the saddle-point problem
 
U (θ, w) = max E(f (X, w)) − E(f (Xθ , w))
w∈W
and
U ∗ (θ) = min U (θ, w),
w∈RM
with an algorithm similar to that described earlier.
Remark 19.1 As a final reference, we note the “improved WGAN” algorithm intro-
duced in Gulrajani et al. [84] in which the boundedness constraint in the weights
is replaced by an explicit control of the derivative in x of the function f . Given in-
dependent X1 and X2 with respective distributions P1 , P2 , define a random variable
Z = (1 − U )X1 + U X2 where U is uniformly distributed over [0, 1] and (U , X1 , X2 )
are independent. Then, Gulrajani et al. [84] use the following approximation of the
Wasserstein distance between P1 and Pθ :
 
DbMK (Ptrue , Pθ ) = max Etrue (f (X, w)) − Eθ (f (X, w)) − Ẽθ ((|∂z f (Z, w)| − 1)2 ) .
w∈W
This approximation is justified by the fact that optimal solutions of (19.10) satisfy
(almost surely for the optimal coupling)
f (y) − f (x) = |x − y|
(see Villani et al. [203], theorem 5.10 and Gulrajani et al. [84]). 
456 CHAPTER 19. DEEP GENERATIVE METHODS

19.4 Reversed Markov chain models

19.4.1 General principles

The discussions in sections 19.2 and 19.3 can be applied to sequences of structural
equations (describing finite Markov chains) in the form



 Z0 = ξ 0

Zk+1 = g(Zk , ξ k ; θk ), k = 0, . . . , m − 1




 X = Zm

where ξ 0 , . . . , ξ m−1 are random variables with fixed distribution. Indeed, letting Z̃ =
(ξ 0 , . . . , ξ n−1 ) and θ̃ = (θ0 , . . . , θm−1 ) the whole system can be considered as a function
X = G(Z̃, θ̃) as considered in these sections. This representation, however, includes a
large number of hidden variables, and it is unclear whether much improvement can
be added to the case m = 1 to justify the additional computational load.

While direct Markov chain modeling may have a limited appeal, reversed Markov
chains use a different generative approach in that they first model a forward Markov
chain Zn , n ≥ 0 which is ergodic with known (and easy to sample from) limit distri-
bution Q∞ , and initial distribution Qtrue , the true distribution of the data. If one
fixes a large enough number of steps, say, τ, then it is reasonable to assume that
Zτ approximately follows the limit distribution, Q∞ . One can then (approximately)
sample from Qtrue by sampling Z̃0 according to Q∞ and then applying τ steps of the
time-reversed Markov chain.

Reversed chains were discussed in section 13.3.3. Assuming that Qtrue and P (z, · )
have a density with respect to a fixed measure µ on RZ , we found that Z̃k = Zτ−k is a
non-homogeneous Markov chain whose transition probability P˜k (x, A) = P (Z̃k+1 ∈ A |
Z̃k = x) has density
p(y, x)qτ−k−1 (y)
p̃k (x, y) =
qτ−k (x)
with respect to µ, where qn is the p.d.f. of Qn = Qtrue P n , the distribution of Zn .

The distributions Qn , n ≥ 0 are unknown, since they depend on the data dis-
tribution Ptrue , and the transition probabilities above must be estimated from data
to provide a sampling algorithm from the reversed Markov chain. While, at first
glance, this does not seem like a simplification of the problem, because one now has
to sample from a potentially large number (τ) of distributions instead of one, this
leads, with proper modeling and some intensive learning, to powerful generative
models.

To make this approach more efficient, the forward chain should be making small
19.4. REVERSED MARKOV CHAIN MODELS 457

changes to the current configuration at each step (e.g., adding a small amount of
noise). This ensures that the reversed transition probabilities p̃k (x, ·) are close to
Dirac distributions and are therefore likely to be well approximated by simple uni-
modal distributions such as Gaussians. Importantly, the estimation problem does
not have hidden data: given an observed sample, one can simulate τ steps of the
forward chain to obtain, after reversing the order, a full observation of the reversed
chain. Moreover, in some cases, analytical considerations can lead to partial compu-
tations that facilitate the modeling of the reversed transitions.

19.4.2 Binary model

We now take some examples, starting with a discrete case. Let Qtrue be the distribu-
tion of a binary random field with state space {0, 1} over a set of vertexes V , i.e., with
the notation of section 14.2, RX = F (V ) with F = {0, 1}. Fix a small  > 0 and define
the transition probability p(x, y) for x, y ∈ F (V ) by
Y 
p(x, y) = (1 − )1y (s) =x(s) + 1y (s) =1−x(s) .
s∈V

Since p(x, y) > 0 for all x and y, the chain converges (uniformly geometrically) to its
invariant probability Q∞ and one easily checks that this probability is such that all
variables are independent Bernoulli random variables with success probability 1/2.
Assuming that τ is large enough so that Qτ ' Q∞ , the sampling algorithm initializes
the reversed chain as independent Bernoulli(1/2) variables and runs τ steps using
the transitions p̃k , which must be learned from data.

For this model, we have


X
qk (x) = qk−1 (y)p(y, x).
y∈F (V )

For this transition, the probabililty of flipping two or more values of y is


N (N − 1) 2
1 − (1 − )N − N (1 − )N −1 =  + o(2 )
2
with N = |V |. We will write x ∼s y if y (s) = 1 − x(s) and y (t) = x(t) for s , t, and we will
write x ∼ y if x ∼s y for some s. With this notation, we have
X
qk (x) = (1 − N )qk−1 (x) +  qk−1 (y) + O(2 )
y:y∼x

Since it implies that qk (x) = qk−1 (x) + o(), this expression can be reversed as
X
qk−1 (y) = (1 + N )qk (y) −  qk (x) + O(2 )
x:x∼y
458 CHAPTER 19. DEEP GENERATIVE METHODS

Similarly, we have
p(y, x) = (1 − N )1x=y + 1x∼y + O(2 ).
This gives
X
p(y, x)qk−1 (y) = qk (x)1x=y − 1x=y qk (x0 ) + qk (y)1x∼y + O(2 ),
x0 :x0 ∼y

and we finally get


 
 X q (x0 )  qτ−k (y)
τ−k
p̃k (x, y) = 1 −   1x=y +  1x∼y + O(2 )
 
 0 0
q τ−k (x)  q τ−k (x)
x :x ∼y

(s) qτ−k (y)


If one lets σk (x) = qτ−k (x)
with y ∼s x, and defines
Y (s) (s)

p̂k (x, y) = (1 − σk (x))1y (s) =x(s) + σk (x)1y (s) =1−x(s) ,
s∈V

one checks easily that p̂k (x, y) = p̃k (x, y) + O(2 ). This suggests modeling the reversed
(s)
chain using transitions p̂k , for which the mapping x 7→ (σk (x), s ∈ V ) needs to be
learned from data (for example using a deep neural network). Note that 1 − σk (x) is
precisely the score function introduced for discrete distributions in remark 18.8.

19.4.3 Model with continuous variables

We now switch to an example with vector-valued variables, RX = Rd , and assume


that the forward Markov chain is such that, conditionally to Xn = x,

Xn+1 ∼ N (x + hf (x), hIdRd ),

where f is C 1 . We saw in section 13.3.7 that, when f = −∇H/2 for a C 2 function


H such that exp(−H) is integrable, this chain converges (approximately for small h)
to a limit distribution with p.d.f. (with respect to Lebesgue’s measure) proportional
to exp(−H). In the linear case, in which f (x) = −Ax/2 for some positive definite
symmetric matrix A, so that H(x) = 12 xT Ax, the limit distribution can be identified
exactly as N (0, Σh ) where Σh satisfies the equation

h
AΣh + Σh A − A2 − 2IdRd = 0
2
whose solution is Σh = (A − hA2 /4)−1 (details being left to the reader). This implies
that this limit distribution can be easily sampled from for any choice of A.
19.4. REVERSED MARKOV CHAIN MODELS 459

We now return to general f ’s and make, like in the discrete case, a first-order
identification of the reversed chain. We note that, for any smooth function γ,

E(γ(Xn+1 ) | Xn = x) = E(γ(x + hf (x) + hU ))
where U ∼ N (0, IdRd ). Making the second order expansion
√ √ h
γ(x + hf (x) + hU ) = γ(x) + h∇γ(x)T U + h∇γ(x)T f (x) + U T ∇2 γ(x)U + o(h)
2
and taking the expectation gives
h
E(γ(Xn+1 ) | Xn = x) = γ(x) + h∇γ(x)T f (x) + ∆γ(x) + o(h). (19.11)
2

Considering the reversed chain, and letting qk denote the p.d.f. of Xk for the
forward chain, we have
Z
E(γ(Xk−1 ) | Xk = x) = γ(y)p̃k (x, y)dy
d
ZR
q (y)
= γ(y)p(y, x) k−1 dy
Rd qk (x)
Z
1 q (y) 1 2
= d/2
γ(y) k−1 e− 2h |x−y−hf (y)| dy
(2πh) Rd qk (x)

1
Z √ qk−1 (x − hu) − 1 |u−√hf (x−√hu)|2
= γ(x − hu) e 2 dy,
(2π)d/2 Rd qk (x)

with the change of variable u = (x − y)/ h. We make a first-order expansion of the
terms in this integral, with
√ √ √ h
γ(x − hu)qk−1 (x − hu) = γ(x)qk−1 (x) − h∇(γqk−1 )(x)T u + u T ∇2 (γqk−1 )(x)u + o(h)
2
and
1
√ √ 1

hf (x− hu)|2 2 hu T f (x)−hu T df (x)u− 12 |f (x)|2 +o(h)
e− 2 |u− = e− 2 |u| e

!
− 12 |u|2 h h
=e 1 + hu T f (x) − hu T df (x)u − |f (x)|2 + |u T f (x)|2 + o(h) .
2 2
Taking products
√ √ 1
√ √
2
γ(x − hu)qk−1 (x − hu)e− 2 |u− hf (x− hu)|

!
− 12 |u|2 T T h 2 h T 2
=e γ(x)qk−1 (x) 1 + hu f (x) − hu df (x)u − |f (x)| + |u f (x)|
2 2

!
− 12 |u|2 T T T h T 2
+e − h∇(γqk−1 )(x) u − h(∇(γqk−1 )(x) u)(f (x) u) + u ∇ (γqk−1 )(x)u + o(h)
2
460 CHAPTER 19. DEEP GENERATIVE METHODS

We now take the integral with respect to u (recall that E(U T AU ) = trace(A) if A is
any square matrix and U is standard Gaussian), so that
1
Z √ √ √ √
− 12 |u− hf (x− hu)|2
γ(x − hu)q k−1 (x − hu)e du
(2π)d/2 Rd
1
 
= γ(x)qk−1 (x) + h −γ(x)qk−1 (x)∇ · f (x) − ∇(γqk−1 )(x)T f (x) + ∆(γqk−1 )(x) + o(h)
2
  !T 
  ∇(γq k−1 )(x) 1 ∆(γqk−1 )(x) 
= qk−1 (x) γ(x) + h −γ(x)∇ · f (x) − f (x) +  + o(h),
qk−1 (x) 2 qk−1 (x) 

where ∇ · f is the divergence of f .

To compute an expansion of qk (x), it suffices to take γ = 1 above, so that


  !T 
  ∇q k−1 (x) 1 ∆q k−1 (x) 
qk (x) = qk−1 (x) 1 + h −∇ · f (x) − f (x) +  + o(h).
qk−1 (x) 2 qk−1 (x) 

We now take the first-order expansion of the ratio, removing terms that cancel, and
get
!
T T ∇qk−1 (x) h
E(γ(Xk−1 ) | Xk = x) = γ(x) − h∇γ(x) f (x) + h∇γ(x) + ∆γ(x) + o(h).
qk−1 (x) 2

Comparing with (19.11), we find that X̃k = Xτ−k behaves, for small h, like the
non-homogeneous Markov chain such √ that the conditional distribution of X̃k+1 given
X̃k = x is N (x − hf (x) − hsτ−k−1 (x), hIdRd ), with sτ−k−1 (x) = −∇ log qτ−k−1 , the score
function introduced in section 18.2.6. One can therefore use score-matching meth-
ods from that section to estimate this distribution from observations of the forward
chain initialized with training data.

19.4.4 Continuous-time limit

The forward schemes described in the previous examples can be interpreted as con-
tinuous time processes over √ discrete or continuous variables. In the latter case, the
example Xk+1 ∼ N (x +hf (x), hIdRd ) conditionally to Xk = x is a discretization of the
stochastic differential equation

dxt = f (xt )dt + dwt

(see remark 13.4), where wt is a Brownian motion and the diffusion is initialized
with Qtrue . We found that going backward meant (at first order and conditionally to
Xk = x) √
Xk−1 ∼ N (x − hf (x) − hsk−1 (x), hId)
19.4. REVERSED MARKOV CHAIN MODELS 461

that we can rewrite as



xτ − Xk−1 ∼ N (xτ − x + hf (x) + hsk−1 (x), hId).

Following the definition in Anderson [9], this corresponds to a first-order discretiza-


tion of the reverse diffusion

dxt = (f (xt ) + st (xt ))dt + d w̃t , t ≤ τ

where w̃t is also a Brownian motion. This reverse diffusion with Xτ ∼ Q∞ will there-
fore approximately sample from Qtrue . (With this terminology, forward and reverse
diffusions have similar differential notation, but mean different things.) Note that,
in the continuous-time limit, the reverse Markov process follows the distribution of
the reversed diffusion exactly.

19.4.5 Differential of neural functions

As we have seen in the previous two examples, estimating the reversed Markov chain
requires computing the score functions of the forward probabilities. In the case
of continuous variables, this score function is typically parametrized as a neural
network, so that the function sk (x) = −∇ log qk (x) is computed as sk (x) = F(x; Wk ),
with the usual definition F(x, Wk ) = zm+1 with zj+1 = ϕj (zj , wjk ), z0 = x and Wk =
(w0k , . . . , wmk ).

Assume that a training set T is observed. Running the forward Markov chain
initialized with elements of T generates a new training set at each time step, that we
will denote Tk at step k. We have seen in section 18.2.6 that the score function sk
could be estimated by minimizing, with respect to W
X 
|F(x, W )|2 − 2∇ · F(x, W ) .
x∈Tk

This term involves the differential of F, which is defined recursively by (simply tak-
ing the derivative at each step)

dF(x, W ) = ζm+1 , ζj+1 = dϕj (zj , wj )ζj ,

with ζ0 = IdRd . From this recursive definition, back-propagation can be applied, in


principle, to compute the derivative of dF(x, W ) with respect to W . The feasibility of
this computation, however, is limited when d is large (d could be tens of thousands
if one models images), since computing the d × d matrix dF(x, W ) is then intractable.

One can note that, for any h ∈ Rd , the vector dF(x, W )h also satisfies the recursion

dF(x, W )h = ζm+1 h, ζj+1 h = dϕj (zj , wj )ζj h,


462 CHAPTER 19. DEEP GENERATIVE METHODS

with ζ0 h = h and
d
X
∇ · F(x, W ) = eTi dF(x, W )ei
i=1

where e1 , . . . , ed is the canonical basis of Rd . Putting the divergence of F in this


form does not reduce the computation cost (which is, roughly d 2 m, assuming that
all zj ’s have the same dimension), but expresses the divergence term in a form that
is amenable to stochastic gradient descent (which is typically already used to ap-
proximate the sum over x). Indeed, if ξ follows any distribution with zero mean and
covariance matrix equal to the identity (such as a standard Gaussian, or the uniform
distribution on the unit sphere), then

∇ · F(x, W ) = E(ξ T dF(x, W )ξ)

so that ξ can be generated in minibatches in SGD implementations (see [181], where


this approach is called “sliced score matching”).
Chapter 20

Clustering

20.1 Introduction

We now describe a collection of methods designed to divide a training set into ho-
mogeneous subsets, or clusters. This grouping operation is a key problem in many
applications for which it is important to categorize the data in order to obtain im-
proved understanding of the sampled phenomenon, and sometimes to be able to
apply a different approach to subsequent processing or analysis adapted to each
cluster.

We will assume that the variables of interest belong a set R = RX where R is


equipped with a discrepancy function α : R × R → [0, +∞). Often, α is derived from
a distance ρ on R, but this is not always the case. We will assume that the data results
from a training set T = (x1 , . . . , xN ). However, it may happen that only the discrep-
ancy matrix A = (α(x, y), x, y ∈ T ) is observed, while a coordinate representation of
the elements of T is not available.

Let us consider a few examples.

(i) The simplest case is when R = Rd with the standard Euclidean metric. Slightly
more generally, a metric may be defined by ρ2 (x, y) = kh(x) − h(y)k2H , where H is an
inner-product space and the feature function h : R 7→ H may be unknown, while its
associated “kernel”, K(x, y) = hh(x) , h(y)iH is known (this is a metric if h is one-to-
one). In this case
ρ2 (x, y) = K(x, x) − 2K(x, y) + K(y, y).

Typically, one then takes α = ρ or α = ρ2 .


(ii) Very often, however, the data is not Euclidean, and the distance does not cor-
respond to a feature space representation. This is the case, for example, for data be-
longing to “curved spaces” (manifolds), for which one may use the intrinsic distance

463
464 CHAPTER 20. CLUSTERING

provided by the length of shortest paths linking two points (assuming of course that
this notion can be given a rigorous meaning). The simplest example is data on the
unit sphere, where the distance ρ(x, y) between two points x and y is the length of
the shortest large circle that connects them, satisfying

|x − y|2 = 2 − 2 cos ρ(x, y).

Once again, α = ρ or ρ2 is a typical choice.


(iii) A more complex example is provided by R being the space of symmetric
positive-definite matrices on Rd , for which one defines the length of a differentiable
curve (S(t), t ∈ [a, b]) in this space by
Z bq
trace((S(t)−1 ∂t S)(S(t)−1 ∂t S)T )dt
a

and for which


d
X
2
ρ (S1 , S2 ) = (log λi )2
i=1

where λ1 , . . . , λd are the eigenvalues of S1−1/2 S2 S1−1/2 or, equivalently, solutions of the
generalized eigenvalue problem S2 u = λS1 u (see, for example, [72]).
(iv) Another common assumption is that the elements of R are vertices of a weighted
graph of which T is a subgraph; ρ may then be, e.g., the geodesic distance on the
graph.

20.2 Hierarchical clustering and dendograms

20.2.1 Partition trees

This method builds clusters by organizing them in a binary hierarchy in which the
data is divided into subsets, starting with the full training set, and iteratively split-
ting each subset into two parts until reaching singletons. This results in a binary
tree structure, called a dendogram, or partition tree, which is defined as follows.
Definition 20.1 A partition tree of a finite set A is a finite collection of nodes T with the
following properties.

(i) Each node has either zero or exactly two children. (We will use the notation v → v 0
to indicate that v 0 is a child of v.
(ii) All nodes but one have exactly one parent. The node without parent is the root of
the tree.
(iii) To each node v ∈ T is associated a subset Av ⊂ A.
20.2. HIERARCHICAL CLUSTERING AND DENDOGRAMS 465

1: {a, b, c, d, e, f }

2: {a, c, f } 3: {b, d, e}

4: {a, f } 5: {c} 6: {d} 7: {b, e}

8: {a} 9: {f } 10: {b} 11: {e}

Figure 20.1: A partition tree of the set {a, b, c, d, e, f }.

(iv) If v 0 and v 00 are the children of v, then (Av 0 , Av 00 ) forms a partition of Av .

Nodes without children are called leaves, or terminal nodes. We will say that the hierarchy
is complete if Av = A if v is the root, and |Av | = 1 for all terminal nodes.

An example of partition tree is provided in fig. 20.1.

The construction of the tree can follow two directions, the first one being bottom-
up, or agglomerative, in which the algorithm starts with the collection of all single-
tons and merges subsets one pair at a time until everything is merged into the full
dataset. The second approach is top-down, or divisive, and initializes the algorithm
with the full training set which is recursively split until singletons are reached. The
first approach, on which we now focus, is more common, and computationally sim-
pler.

We let T denote the training set and assume that a matrix of dissimilarities
(α(x, y), x, y ∈ T )
is given. We will make the abuse of notation of considering that T is a set even
though some of its elements may be repeated. This is no loss of generality, since
T = (x1 , . . . , xN ) can always be replaced by the subset {(k, xk ), k = 1, . . . , N } of N × R.

20.2.2 Bottom-up construction

We will extend α to a dissimilarity measure between subsets A, A0 ⊂ T that we will


denote (A, A0 ) 7→ ϕ(A, A0 ). Once ϕ is defined, agglomeration works along the follow-
ing algorithm.

Algorithm 20.1
1. Start with the collection T1 , . . . , TN of all single-node trees associated to each
element of T . Let n = 0 and m = N .
466 CHAPTER 20. CLUSTERING

2. Assume that, at step n of the algorithm, one has a collection of partition trees
T1 , . . . , Tm with root nodes r1 , . . . , rm associated with subsets Ar1 , . . . , Arm of T . Let the
total collection of nodes be indexed as Vn = {v1 , . . . , vN +n }, so that {r1 , . . . , rm } ⊂ Vn .
3. If m = 1, stop the algorithm.
4. Select indices i, j ∈ {1, . . . , m} such that ϕ(Ari , Arj ) is minimal, and merge the
corresponding trees by creating a new node vn+1+N with the root nodes of Ti and Tj
as children (so that vn+1+N is associated with Ari ∪ Arj ). Add vn+1+N to the collection
of root nodes, and remove ri and rj .
5. Set n → n + 1 and m → m − 1 and return to step 2.

Clearly, the specification of the extended dissimilarity measure (ϕ) is a key ele-
ment of the method. Some of most commonly used extensions are:

• Minimum gap: ϕmin (A, A0 ) = min(α(x, x0 ) : x ∈ A, x0 ∈ A0 ).


• Maximum dissimilarity: ϕmax (A, A0 ) = max(α(x, x0 ) : x ∈ A, x0 ∈ A0 ).
• Sum of dissimilarities:
XX
ϕsum (A, A0 ) = α(x, x0 )
x∈A x0 ∈A0

• Average dissimilarity:

1 XX
ϕavg (A, A0 ) = α(x, x0 ).
|A| |A0 | 0 0
x∈A x ∈A

As shown in the next two propositions, the maximum distance favors clusters
with small diameters, while using minimum gaps tends to favor connected clusters.

Proposition 20.2 Let diam(A) = max(α(x, y), x, y ∈ A). The agglomerative algorithm
using ϕmax is identical to that using ϕ(A, A0 ) = diam(A ∪ A0 ).

Proof Call Algorithm 1 the agglomerative algorithm using ϕmax , and Algorithm 2
the one using ϕ. At initialization, we have (because all sets are singletons),

ϕmax (Ak , Al ) = diam(Ak ∪ Al ) for all 1 ≤ k , l ≤ m. (20.1)

We show that this property remains true at all steps of the algorithms. Pro-
ceeding by induction, assume that, up to the step n, Algorithms 1 and 2 have been
identical and result in sets (A1 , . . . , Am ) satisfy (20.1). Then the next steps of the
two algorithms coincide and assume, without loss of generality, that this next step
20.2. HIERARCHICAL CLUSTERING AND DENDOGRAMS 467

merges Am−1 with Am . Let A0m−1 = Am−1 ∪ Am so that diam(A0m−1 ) ≤ diam(Ai ∪ Aj ) for
all 1 ≤ i , j ≤ m.

We need to show that the new partition satisfies (20.1), which requires that
ϕmax (A0m−1 , Ak ) = diam(A0m−1 ∪ Ak )
for k = 1, . . . , m − 2.

We have
diam(A0m−1 ∪ Ak ) = max(diam(A0m−1 ), diam(Ak ), ϕmax (A0m−1 , Ak )),
so that we must show that
max(diam(A0m−1 ), diam(Ak )) ≤ ϕmax (A0m−1 , Ak ).
Write
ϕmax (A0m−1 , Ak ) = max(ϕmax (Am , Ak ), ϕmax (Am−1 , Ak ))
= max(diam(Am ∪ Ak ), diam(Am−1 ∪ Ak ))
where the last identity results from the induction hypothesis.

The fact that


diam(Ak ) ≤ max(diam(Am ∪ Ak ), diam(Am−1 ∪ Ak ))
is obvious, and the inequality
diam(A0m−1 ) ≤ max(diam(Am ∪ Ak ), diam(Am−1 ∪ Ak ))
results from the fact that Am and Am−1 was an optimal pair. This shows that the
induction hypothesis remains true at the next step and concludes the proof of the
proposition. 

We now analyze ϕmin and, more specifically, the equivalence between the result-
ing algorithm and the one using the following measure of connectedness. For a given
set A and x, y ∈ A, let
n
α̃A (x, y) = inf  : ∃n > 0, ∃(x = x0 , x1 , . . . , xn−1 , xn = y) ∈ An+1 :
o
α(xi , xi−1 ) ≤  for 1 ≤ i ≤ n .
So α̃A is the smallest  such that there exists a sequence of steps of size less than  in
A going from x to y. The function
conn(A) = max{α̃A (x, y) : x, y ∈ A}
measures how well the set A is connected relative to the dissimilarity measure α.
and we have:
468 CHAPTER 20. CLUSTERING

Proposition 20.3 The agglomerative algorithm using ϕmin is identical to that using
ϕ(A, A0 ) = conn(A ∪ A0 ).
Proof The proof is similar to that of proposition 20.2. Indeed one can note that
conn(A ∪ A0 ) = max(conn(A), conn(A0 ), ϕmin (A, A0 )) .
Given this we can proceed by induction and prove that, if the current decomposi-
tion is A1 , . . . , Am such that ψ(Ak ∪ Al ) = ϕmin (Ak , Al ) for all 1 ≤ k , l ≤ m, then this
property is still true after merging using ϕmin and ϕ.

Assuming again that Am−1 and Am are merged, and letting A0m−1 = Am ∪ Am−1 , we
need to show that conn(Ak ∪ A0m−1 ) = ϕmin (Ak , A0m−1 ) for all k = 1, . . . , m − 2, which is
the same as showing that:
max(conn(Ak ), conn(A0m−1 )) ≤ ϕmin (Ak , A0m−1 ) = min(ϕmin (Ak , Am−1 ), ϕmin (Ak , Am )).
From the induction hypothesis, we have
min(ϕmin (Ak , Am−1 ), ϕmin (Ak , Am )) = min(conn(Ak ∪ Am−1 ), conn(Ak ∪ Am ))
and both terms in the right-hand side are larger than conn(Ak ) and also larger than
conn(A0m−1 ) which was a minimizer. 
20.2.3 Top-down construction

The agglomerative method is the most common way to build dendograms, mostly
because of the simplicity of the construction algorithm. The divisive approach is
more complex, because the division step, which requires, given a set A, to optimize
a splitting criterion over all two-partitions of A, may be significantly more expensive
than the merging steps in the agglomerative algorithm. The top-down construction
therefore requires the specification of a “splitting algorithm” σ : A 7→ (A0 , A00 ) such
that (A0 , A00 ) is a partition of A. We assume that, if |A| > 1, then the partition A, A00 is
not trivial, i.e., neither set is empty.

Given σ , the top-down construction is as follows.

Algorithm 20.2
1. Start with the one-node partition tree T0 = (T ).
2. Assume that at a given step of the algorithm, the current partition is T .
3. If T is complete, stop the algorithm.
4. For each terminal node v in T such that |Av | > 1, compute (A0v , A00v ) = σ (Av ) and
add two children v 0 and v 00 to v with Av 0 = A0v and Av 00 = A00v .
5. Return to step 2.

The division of a set into two parts is itself a clustering algorithm, and one may apply
any of those described in the rest of this chapter.
20.3. K-MEDOIDS AND K-MEAN 469

20.2.4 Thresholding

Once a complete hierarchy is built, it provides a complete binary partition tree T .


This tree provides in turn a collection of partitions of V , each of them obtained
through pruning. We now formalize this operation.

Let VT denote the set of terminal nodes in T and V0 = V \ VT contain the interior
nodes. Define a pruning set to be a subset D ⊂ V0 that contains no pair of nodes v, v 0
such that v 0 is a descendant of v. To any pruning set D, one can associate the pruned
subtree T (D) of T consisting of T from which all the vertices that are descendants
of elements of D are removed. From any such pruned subtree, one obtain a partition
S(D) of T formed by the collection of sets Av for v in the terminal nodes of T (D).
Between the extreme case S(v0 ) = {V } (where v0 is the root of T ) and S(∅) = ({x}, x ∈
VT ), there exists a huge number of possible partitions obtained in this way.

It is often convenient to organize these partitions according to the level sets of


a well-chosen score function v 7→ h(v) defined over V0 . For D ⊂ V , we denote by
max(D) the set of its deepest elements, i.e., the set formed by those v ∈ D that have
no descendant in D. Then, for any λ ∈ R, one can define Dλ+ = max {v : h(v) ≥ λ} (resp.
Dλ− = max {v : h(v) ≤ λ}) and the associated partition S(Dλ+ ) (resp. S(Dλ− )). The score
function h can be linked to the construction algorithm. For example, if one uses a
bottom-up construction using an extended dissimilarity ϕ, one can associate to each
node v with v ∈ V0 the value of ϕ(Av 0 , Av 00 ) where v 0 and v 00 are the children of v.

Another way to define such scores functions is by assigning weights to edges in


T . Indeed, given a collection w of positive numbers w(v, v 0 ) for v → v 0 in T , one can
define a score hw recursively by letting hw (v0 ) = 0 and hw (v 0 ) = hw (v) + w(v, v 0 ) if v 0 is
a child of v. The choice w(v, v 0 ) = 1 for all v, v 0 provide the usual notion of depth in
the tree.

Scores can also be built bottom-up, letting h(v) = 0 for terminal nodes and, for
v ∈ V0 ,
hw (v) = max(hw (v 0 ) + w(v, v 0 ), hw (v 00 ) + w(v, v 00 ))
where v 0 , v 00 are the children of v Here, taking w = 1 provides the height of each
node.

20.3 K-medoids and K-mean

20.3.1 K-medoids

One of the limitation of hierarchical clustering is that it is a greedy approach that


does not optimize a global quality measure associated to the partition. Such qual-
470 CHAPTER 20. CLUSTERING

ity measures can indeed be defined based on the heuristic that clusters should be
homogeneous (for some criterion) and far apart from each other.

In centroid-based methods, the homogeneity criterion is the minimum, over all


possible points in R, of the sum of dissimilarities between elements of the cluster
and that point. More precisely, for any A ⊂ R, and any dissimilarity measure α,
define the central dispersion index

 

X 

Vα (A) = inf  α(x, c) : c ∈ R . (20.2)
 
 

 
x∈A

If c achieves the minimum in the definition of Vα , it is called a centroid of A for the


dissimilarity α.

The most common choice is α = ρ2 , where ρ is a metric on R, and in this case,


we will just use V in place of Vρ2 . Note also that it is always possible to limit R to
the training set T , in which case the optimization in (20.2) is over a finite number of
centers. This makes centroid-based methods also applicable to the situation when
the matrix of dissimilarities is the only input provided to the algorithm, or when the
set R and the function α are too complex for the optimization in (20.2) to be feasible.

A centroid, c, in (20.2) may not always exists, and when it exists it may not always
be unique. For α = ρ2 , a point c such that

X
V (A) = ρ2 (x, c)
x∈A

is called a Fréchet mean of the set A. Returning to the examples provided in the
beginning of this chapter, two antipodal points on the sphere (whose distance is π)
have an infinity of Fréchet means (or midpoints in this case) provided by every point
in the equator between them. In contrast, the example provided with symmetric
matrices provides a so-called Hadamard space [43] and the Fréchet mean in that
case is unique. Of course, for Euclidean metrics, the Fréchet mean is just the usual
one.

Returning to our general discussion, the K-medoids method optimizes the sum
of central dispersions with a fixed number of clusters. Note that the letter K in K-
medoids originally refers to this number of clusters, but this notation conflicts with
other notation in this book (e.g., reproducing kernels) and we shall denote by p this
20.3. K-MEDOIDS AND K-MEAN 471

target number1 . So the K-medoids method minimizes


p
X
Wα (A1 , . . . , Ap ) = Vα (Ai )
i=1

over all partitions A1 , . . . , Ap of the training set T . Equivalently, it minimizes


p X
X
Wα (A1 , . . . , Ap , c1 , . . . , cp ) = α(x, ci ) (20.3)
i=1 x∈Ai

over all partitions of T and c1 , . . . , cp ∈ R. Finally, taking first the minimum with
respect to Ai , which corresponds to associating each x to the subset with closest
center, K-medoids, an equivalent formulation minimizes
X n o
W̃α (c1 , . . . , cp ) = min α(x, ci ), i = 1, . . . , p .
x∈T

The standard implementation of K-medoids solves this problem using an alter-


nate minimization, as defined in the following algorithm.

Algorithm 20.3 (K-medoids)


Let T ⊂ R be the training set. Start with an initial choice of c1 , . . . , cp ∈ R and iterate
over the following two steps until stabilization:

(1) For i = 1, . . . , p, let Ai contain points x ∈ T such that α(x, ci ) = min{α(x, cj ), j =


1, . . . , p}. In case of a tie in this minimum, assign x to only one of the tied sets
(e.g., at random) to ensure that A1 , . . . , Ap is a partition.
P
(2) For i = 1, . . . , p, let ci be a minimizer of x∈Ai α(x, ci ) if Ai is not empty, or ci be a
random point in T otherwise.

It should be clear that each step reduces the total cost Wα and that this cost
should stabilize at some point (which provides the stopping criterion) because there
is only a finite number of possible partitions of T . However, there can be many
possible limit points that are stable under the previous iterations, and some may
correspond to poor “local minima” of the objective function. Since the end-point of
the algorithm depends on the initialization, this step requires extra care. One may
design ad-hoc heuristics in order to start the algorithm with a good initial point that
is likely to provide a good solution at the end. These heuristics may depend on the
1 We still call the method K-medoids rather than p-medoids, to keep the name universally used in
the literature.
472 CHAPTER 20. CLUSTERING

problem at hand, or use a generic strategy. As a common example of the latter, one
may ensure that the initial centers are sufficiently far apart by picking c1 at random,
c2 as far as possible from c1 , c3 maximizing the sum of distances to c1 and c2 etc.
One also typically runs the algorithm several times with random initial conditions
and select the best solution over these multiple runs.

The second step of Algorithm 20.3 can be computationally challenging depend-


ing on the set R and the dissimilarity measure α. When R = Rd and α = ρ2 is the
square Euclidean distance, the solution is explicit and ci is simply the average of all
points in Ai . The resulting algorithm is the original incarnation of K-medoids, and
called K-means [183, 122, 125]. K-means is probably the most popular clustering
method and is often a step in more advanced approaches, as we will discuss later.
The two steps of Algorithm 20.3 are then simplified as follows.

Algorithm 20.4 (K-means)


Let T ⊂ Rd be the training set. Start with an initial choice of c1 , . . . , cp ∈ Rd and iterate
over the following two steps until stabilization:

(1) For i = 1, . . . , p, let Ai contain points x ∈ T such that |x − ci |2 = min{|x − cj |2 , j =


1, . . . , p}. In case of tie in this minimum, assign x to only one of the tied sets
(e.g., at random) to ensure that A1 , . . . , Ap is a partition.
(2) For i = 1, . . . , p, let
1 X
ci = x
|Ai |
x∈Ai
if Ai is not empty, or ci be a random point in T otherwise.

20.3.2 Mixtures of Gaussian and deterministic annealing

Mixtures of Gaussian (MoG) were discusssed in chapter 17 and in Algorithm 17.2.


Recall that they model the observed data X together with a latent class variable
Z ∈ {1, . . . , p} with joint distribution
d 1 1 T Σ−1 (x−c )
f (x, z; θ) = (2π)− 2 (det Σz )− 2 αz e− 2 (x−cz ) z z

where θ contains the weights, α1 , . . . , αp , the means, c1 , . . . , cp and the covariance ma-
trices Σ1 , . . . , Σp (we create, hopefully without risk of confusion, a short-lived conflict
of notation between the weights and the dissimilarity function). The posterior class
probabilities
1 1 T Σ−1 (x−c )
(det Σi )− 2 αi e− 2 (x−ci ) i i
fZ (i|x ; θ) = P , i = 1, . . . , p,
p −2 1 − 12 (x−cj )T Σ−1
j (x−cj )
j=1 (det Σj ) αj e
20.3. K-MEDOIDS AND K-MEAN 473

which are computed in step 3 of Algorithm 17.2 can be interpreted as a likelihood


that observation x belongs to group i. As a consequence, the mixture of Gaussian
algorithm can also be seen as a clustering method, in which one assigns each x ∈ T
to cluster i when i = argmax{fZ (j|x, θ) : j = 1, . . . , p}, making an arbitrary decision in
case of a tie.

In the special case in which all variances are fixed and equal to σ 2 IdRd , and all
prior class probabilities are equal to 1/p (see remark 17.3), the EM algorithm for mix-
tures of Gaussian is also called “soft K-means”, because it replaces the “hard” cluster
assignments in K-means by “soft” ones represented by the update of the posterior
distribution. We repeat its definition here for completeness (where θ = (c1 , . . . , cp )).

Algorithm 20.5 (Soft K-means)


1. Choose a number σ 2 > 0, a small constant  and a maximal number of itera-
tions M. Initialize the centers c = (c1 , . . . , cp ).
2. At step n of the algorithm, let c be the current centers.
3. Compute, for x ∈ T and i = 1, . . . , p
1
− |x−ci |2
e 2σ 2
fZ (i|x, θ) = P 1
p − |x−cj |2
2σ 2
j=1 e
PN
and let ζi = k=1 fZ (i|x, θ), i = 1, . . . , p.
4. For i = 1, . . . , p, let
1X
ci0 = xfZ (i|x, θ).
ζi
x∈T

5. If |c0 − c| <  or n = M: stop the algorithm.


6. Replace c by c0 and n by n + 1 and return to step 2.

When σ 2 → 0, fZ ( · |xk , θ) converges to the uniform probability on indexes j such


that cj is closest to xk , which is a Dirac measure unless there are ties. Class allo-
cation and center updating become then asymptotically identical to the K-means
algorithm. A variant of soft K-means, called deterministic annealing [170], applies
Algorithm 20.5 while letting σ slowly tend to 0. This new algorithm is experimen-
tally more robust than K-means, in that it is less likely to be trapped in bad local
minimums.
Remark 20.4 The soft K-means algorithm can also be defined directly as an alternate
minimization method for the objective function
p p
1 XX XX
F(c, fZ ) = fZ (j|x)|x − cj |2 + σ 2 fZ (j|x) log fZ (j|x),
2
x∈T j=1 x∈T j=1
474 CHAPTER 20. CLUSTERING
Pp
with the constraints fZ (j|x) ≥ 0 for all j and x and j=1 fZ (j|x) = 1. One can check
(we leave this as an exercise) that Step 3 in Algorithm 20.5 provides the optimal fZ
for F when c is fixed, and that Step 4 gives the optimal c when fZ is fixed (see ??). 

Remark 20.5 We note that, if a K-means, soft K-means or MoG algorithm has been
trained on a training set T , it is then easy to assign a new sample x̃ to one of the
clusters. Indeed, for K-means, it suffices to determine the center closest to x̃, and
for the other methods to maximize fZ (j|x̃, θ), which is computable given the model
parameters. In contrast, there was no direct way to do so using hierarchical cluster-
ing. 

20.3.3 Kernel (soft) K-means

We now consider the soft K-means algorithm in feature space, and introduce fea-
tures hk = h(xk ) in an inner product space H such that hhk , hl iH = K(xk , xl ) for some
positive definite kernel. As usual, the underlying assumption is that the computa-
tion of h(x) does not need to be feasible, while evaluations of K(x, y) are easy. Let us
consider the minimization of
p p
1 XX XX
fZ (j|x)kh(x) − cj k2H + σ 2 fZ (j|x) log fZ (j|x)
2
x∈T j=1 x∈T j=1

for some σ 2 > 0 (kernel K-means corresponds to taking the limit σ 2 → 0). Given fZ ,
the optimal centers are
1 X
cj = fZ (j|x)h(x)
ζj
x∈T
P
with ζ = x∈T fZ (j|x). They belong to the feature space, H, and are therefore not
computable in general. However, the distance between them and a point h(y) ∈ H is
explicit and given by

2X 1 X
kh(y) − cj k2H = K(y, y) − fZ (j|x)K(y, x) + 2 fZ (j|x)fZ (j|x0 )K(x, x0 ).
ζj ζj 0
x∈T x,x ∈T

The class probabilities at each iteration can therefore be updated using


2 2
e−kh(x)−cj kH / 2σ
fZ (j|x) = Pp 2 2
.
j 0 =1 e−kh(y)−cj 0 kH / 2σ

This yields the soft kernel K-means algorithm, that we repeat below.
20.3. K-MEDOIDS AND K-MEAN 475

Algorithm 20.6 (Kernel soft K-means)


Let T ⊂ Rd be the training set. Initialize the algorithm with some choice for fZ (j|x),
j = 1, . . . , p, x ∈ T (for example: fZ (j|x) = 1/p for all j and x).

(1) For j = 1, . . . , p and x ∈ T compute

2 X 1 X
kh(x) − cj k2H = K(x, x) − fZ (j|x0 )K(x, x0 ) + 2 fZ (j|x0 )fZ (j|x00 )K(x0 , x00 )
ζj 0 ζj 0 00
x ∈T x ,x ∈T

with ζj =
P 0 ).
x0 ∈T fZ (j|x

(2) Compute, for x ∈ T and j = 1, . . . , p,

2 2
e−kh(x)−cj kH /2σ
fZ (j|x) = Pp 2 2
.
j 0 =1 e−kh(y)−cj 0 kH /2σ

(3) If the variation of fZ compared to the previous iteration is small, or if a maximum


number of iterations has been reached, exit the algorithm.
(4) Return to step 1.

After convergence, the clusters are computed by assigning x to Ai when i = argmax{fZ (j|x) :
j = 1, . . . , p}, making an arbitrary decision in case of a tie.

For “hard” K-means (with σ 2 → 0), step 2 simply updates fZ (j|x) as the uniform
probability on the set of indexes j at which kh(x) − cj k2H is minimal.

20.3.4 Convex relaxation

We return to the initial formulation of K-means for Euclidean data, as a minimiza-


tion, over all partitions A = {A1 , . . . , AK } of {1, . . . , N } of

K X
X
W (A) = |xk − cj |2
j=1 k∈Aj

where cj is the average of the points xj such that j ∈ Aj . We start with a simple
transformation expressing this function in terms of the matrix Sα of square distances
476 CHAPTER 20. CLUSTERING

α(xk , xl ) = |xk − xl |2 . Indeed, we have

X X 1 X
|xk − cj |2 = |xk |2 − xk
|A|
k∈Aj k∈Aj k∈Aj
X 1 X T
= |xk |2 − xk xl
|A|
k∈Aj k,l∈Aj
1 X
= (|xk |2 + |xl |2 − 2xkT xl )
2|Aj |
k,l∈Aj
1 X
= |xk − xl |2
2|Aj |
k,l∈Aj
q
(k)
Introduce the vector uj ∈ RN with coordinates uj = 1/ |Aj | for k ∈ Aj and 0 other-
wise. Then
1 X 1 1
|xk − xl |2 = ujT Sα uJ = trace(Sα uj ujT ) . (20.4)
2|Aj | 2 2
k,l∈Aj

Let
p
X
Z(A) = uj ujT ,
j=1

so that Z(A) has entries Z (k,l) (A) = 1/|Aj | for k, l ∈ Aj , j = 1, . . . p and 0 for all other
k, l. Summing (20.4) over j, we get
1
W (A) = trace(Sα Z(A)).
2

The matrix Z(A) is symmetric, has non-negative entries. It moreover satisfies


Z(A)1N = 1N and Z(A)2 = Z(A). Interestingly, these properties characterize matri-
ces Z associated with partitions, as stated in the next proposition [154, 153].
Proposition 20.6 Let Z ∈ MN (R) be a symmetric matrix with non-negative entries sat-
isfying Z 1N = 1N and Z 2 = Z. The there exists a partition A of {1, . . . , N } such that
Z = Z(A).
Proof Note that Z being symmetric and satisfying Z 2 = Z imply that it is an orthog-
onal projection with eigenvalues 0 and 1. In particular Z is positive semidefinite.
This implies that, for all i, j ∈ {1, . . . , N }, one has

Z(i, j)2 ≤ Z(i, i), Z(j, j).

This inequality combined with N


P
j=1 Z(k, j) = 1 (expressing Z 1N = 1N ) shows that all
diagonal entries of Z are positive.
20.3. K-MEDOIDS AND K-MEAN 477

Define on {1, . . . , N } the relation k ∼ j if and only if Z(j, k) > 0. The relation is
symmetric and we just checked that k ∼ k for all k. It is also transitive, from the
relation (deriving from Z 2 = Z)
N
X
Z(k, j) = Z(k, i)Z(i, j)
i=1

which shows (since all terms in the sum are non-negative) that k ∼ i and j ∼ i imply
k ∼ j.

Let A = {A1 , . . . , Aq } be the partition of {1, . . . , N } formed by the equivalence classes


for this relation. We now show that Z = Z(A).

We have, for all k, j ∈ {1, . . . , N }


N
X N
X N
X
Z(k, i)(Z(k, j) − Z(i, j)) = Z(k, j) Z(k, i) − Z(k, i)Z(i, j)
i=1 i=1 i=1
XN
= Z(k, j) − Z(k, i)Z(i, j) = 0
i=1

Now, if k, j ∈ As for some s, the identity reduces to


X
Z(k, i)(Z(k, j) − Z(i, j)) = 0. (20.5)
i∈As

Choose
p k such that Z(k, k) = max{Z(i, i) : i ∈ As }. Then, for all i, j ∈ As , Z(i, j) ≤
Z(i, i)Z(j, j) ≤ Z(k, k) and (20.5) for j = k yields
X
Z(k, i)(Z(k, k) − Z(k, i)) = 0,
i∈As

which is only possible


p (since all Z(k, i) are positive) if Z(k, i) = Z(k, k) for all i ∈ As .
From Z(k, i) ≤ Z(i, i)Z(k, k), we get Z(i, i) = Z(k, k) for all i, and therefore (reapply-
ing what we just found to i insteand of k) Z(i, j) = Z(i, i) = Z(k, k) for all i, j ∈ As .
Finally, we have X
1= Z(k, i) = |As |Z(k, k)
i∈As
showing that Z(k, k) = 1/|As | and completing the proof that Z = Z(A). 

Note that the number of clusters, |A| is equal to the trace of Z(A). This shows that
minimizing W (A) over partitions with p clusters is equivalent to the constrained
optimization problem minimizing
G(Z) = trace(Sα Z) (20.6)
478 CHAPTER 20. CLUSTERING

over all matrices Z such that Z ≥ 0, Z T = Z, Z 1N = 1N , trace(Z) = p and Z 2 = Z.


This is still a difficult problem, since it is equivalent to K-means, which is NP hard.
Seeing the problem in this form, however, is more amenable to approximations and,
in particular, convex relaxations.

In [153], it is proposed to use a semidefinite program (SDP) as a relaxation. The


conditions Z = Z T and Z 2 = Z require that all eigenvalues of Z are either 0 or 1, and a
direct relaxation is to replace these constraints by Z T = Z and 0  Z  IdRN . The last
inequality is however redundant if we add the conditions 2 Z ≥ 0 and Z 1 = 1. This
is a consequence of the Perron-Frobenius theorem which states that a matrix Z̃ with
positive entries has a largest (in modulus) real eigenvalue, which has multiplicity
one and is associated with an eigenvector with positive coordinates, the latter eigen-
vector being (up to multiplication by a constant) the unique eigenvector of Z̃ with
positive coordinates. So, if a matrix Z̃ is symmetric, satisfies Z̃ > 0 and Z̃ 1N = 1N ,
then Z̃  IdRN . Applying this result to Z̃ = (1 − )Z + (/N )1N 1TN and letting  tend
to 0 shows that any matrix Z with non-negative entries satisfying Z 1N = 1N also
satisfies Z  IdRN .

This provides the following SDP relaxation of K-means [153]: minimize


G(Z) = trace(Sα Z) (20.7)
subject to Z T = Z, Z 1N = 1N , trace(Z) = p, Z ≥ 0, Z  0.

Clusters can be immediately inferred from the columns of the matrix Z(A), since
they are identical for two indices in the same cluster, and orthogonal to each other
for two indices in different clusters. Let z1 (A), . . . , zN (A) denote the columns of Z(A)
and z̄k (A) =√zk (A)/|zk (A)|. One has |z̄k (A) − z̄l (A)| = 0 if k and l belong to the same
cluster and 2 otherwise.

These properties will not necessarily be satisfied by a solution, say, Z ∗ , of the


SDP relaxation, but, assuming that the approximation is good enough, one may still
consider the normalized columns of Z ∗ and expect them to be similar for indices in
the same cluster, and away from each other otherwise. Denoting by z̄1∗ , . . . , z̄N

these
normalized columns, one can then run on them the standard K-means algorithm, or
a spectral clustering method such as those described in the next sections, to infer
clusters.
Remark 20.7 Clearly, one can use any symmetric matrix S in the definition of G in
(20.6) and (20.7). The method is equivalent to, or to a relaxation of, K-means only
when S is formed with squared norms in inner-product spaces, which does include
kernel K-means, for which
α(xk , xl ) = K(xk , xk ) − 2K(xk , xl ) + K(xl , xl ).
2 Recall
that Z  0 means that Z is positive definite, while Z ≥ 0 indicates that all its entries are
non-negative.
20.4. SPECTRAL CLUSTERING 479

If α is an arbitrary discrepancy measure, the minimization of G(Z) still makes sense,


since it is equivalent to minimizing
p
X
G(Z(A)) = Dα (Aj ).
j=1

where
1 X
Dα (A) = α(x, y) . (20.8)
|A|
x,y∈A
is a (normalized) measure of size, that we will call the α-dispersion of a finite set A.
Remark 20.8 Instead of using dissimilarities, some algorithms are more naturally
defined in terms of similarities. Given such a similarity measure, say, β, one must
maximize rather than minimize the index ∆β (which becomes, rather than a measure
of dispersion, a measure of concentration).

One passes from a dissimilarity α to a similarity β by applying a decreasing func-


tion to the former, a common choice being
β(x, x0 ) = exp(−α(x, x0 )/τ)
for some τ > 0.

Alternatively, one can fix an element x0 ∈ R and let


β(x, y) = α(x, x0 ) + α(y, x0 ) − α(x, y) − α(x0 , x0 ),
(note that the last term, α(x0 , x0 ) is generally equal to 0). For example, if α(x, y) =
|x − y|2 , then β(x, y) = 2(x − x0 )T (y − x0 ) (for which it is natural to take x0 = 0). If α is a
distance (not squared!), then β ≥ 0 by the triangular inequality. In this case, we have
n
X
∆β (A1 , . . . , Ap ) = Dβ (Ak )
k=1
p p
X 1 X X 1 X
= α(x, x0 ) + α(y, x0 )
|Ap | |Ap |
k=1 x,y∈Ak k=1 x,y∈Ak
p p
X 1 X X 1 X
− α(x0 , x0 ) − α(x, x0 )
|Ap | |Ap |
k=1 x,y∈Ak k=1 x,y∈Ak
Xp X p
X
=2 α(x, x0 ) − |Ak |α(x0 , x0 ) − ∆α (A1 , . . . , Ap )
k=1 x∈Ak k=1
X
=2 α(x, x0 ) − |T |α(x0 , x0 ) − ∆α (A1 , . . . , Ap ) 
x∈T
so that minimizing ∆α is equivalent to maximizing ∆β .
480 CHAPTER 20. CLUSTERING

20.4 Spectral clustering

20.4.1 Spectral approximation of minimum discrepancy

One refers to spectral methods algorithms that rely on computing eigenvectors and
eigenvalues (the spectrum) of data-dependent matrices. In the case of minimizing
discrepancies, they can be obtained by further simplifying (20.7), essentially by re-
moving constraints.

One indeed gets a simpler problem if the non-negativity constraint, Z ≥ 0, is


removed. Doing so, one cannot guarantee anymore that Z  IdRN , so we need to
reinstate this constraint. We will first make the further simplification to remove
the constraint Z 1N = 1N , the problem becoming minimizing trace(Sα Z) over all Z ∈
SN+ (R) such that 0  Z  IdRN and trace(Z) = p. Decomposing Z in an eigenbasis,
i.e., looking for it in the form

N
X
Z= ξj ej ejT ,
j=1

this is equivalent to minimizing

N
X
ξj ejT Sα ej (20.9)
j=1

subject to 0 ≤ ξj ≤ 1, N N
P
j=1 ξj = p and u1 , . . . , uN orthonormal basis of R . First con-
sider minimization with respect to the basis, fixing ξ. There is obviously no loss of
generality in requiring that ξ1 ≤ ξ2 ≤ · · · ≤ ξN , and using corollary 2.4 (adapted
to minimizing (20.9) rather than maximizing it) we know that an optimal basis
is given by the eigenvectors of Sα , ordered with non-decreasing eigenvalues. Let-
ting λ1 ≤ · · · ≤ λN denote these eigenvalues, we find that ξ1 , . . . , xN must be a non-
decreasing sequence minimizing

N
X
λj ξj
j=1

PN
subject to 0 ≤ ξk ≤ 1 and j=1 ξj = p. The optimal solution is obtained by taking
20.4. SPECTRAL CLUSTERING 481

ξ1 = · · · = ξp = 1, since, for any other solution

N
X p
X N
X p
X
λj ξj − λj ≥ λp+1 ξj + λj (ξj − 1)
j=1 j=1 j=p+1 j=1
Xp p
X
= λp+1 (1 − ξj ) + λj (ξk − 1)
j=1 j=1
p
X
= (λp+1 − λj )(1 − ξj )
j=1
≥ 0.

The following algorithm (similar to that discussed in [64]) summarizes this dis-
cussion.

Algorithm 20.7 (Spectral clustering: version 1)


Let Sα be an N × N discrepancy matrix. Let p denote the number of clusters.

(1) Compute the eigenvectors of Sα associated with the p smallest eigenvalues.


(j) (k)
(2) Denoting these eigenvectors by e1 , . . . , ep , define y1 , . . . , yN ∈ Rp by yk = ej .
(3) Run K-means on (y1 , . . . , yN ) to determine a partition.

This algorithm needs to be slightly modified if one also wants Z to satisfy Z 1 = 1.


In that case, 1 is one of the eigenvectors (with eigenvalue 1), and the others are
orthogonal to it. As a consequence, one now looks for Z in the form

N −1
X 1 T
Z= ξj ej ejT + 11
N
k=1

leading to the minimization of

N −1
X 1 T
ξj ejT Sα ej + 1 Sα 1
N
j=1

over all ξ1 , . . . , ξN −1 such that 0 ≤ ξj ≤ 1 and N


P
j=1 ξj = p − 1, and over all e1 , . . . , eN −1

such that e1 , . . . , eN −1 , 1/ N form an orthonormal basis. The main difference with the
previous problem is that we now need to ensure that all ej are perpendicular to 1.
482 CHAPTER 20. CLUSTERING

To achieve this, introduce the projection matrix P = IdRN − 11T /N and let S̃α =
P Sα P . Then, since u T 1 = 0 implies u T S̃α u = u T Sα u, it is equivalent to minimize
N
X −1
ξj ejT S̃α ej
j=1

over all ξ1 , . . . , ξN −1 such that 0 ≤ ξj ≤ 1 and N


P
j=1 ξj = p − 1, and over all e1 , . . . , eN −1

such that e1 , . . . , eN −1 , 1/ N form an orthonormal basis. Because S̃α√1 = 0, we know
that S̃α can be diagonalized in an orthonormal basis (e1 , . . . , eN −1 , 1/ N ), and we ob-
tain an optimal solution by selecting the p −1 vectors associated with smallest eigen-
values, with associated ξj = 1. We therefore get a modified version of the spectral
clustering algorithm.

Algorithm 20.8 (Spectral clustering: version 2)


Let Sα be an N × N discrepancy matrix. Let p denote the number of clusters. Let
P = IdRN − 1N 1TN /N .

(1) Compute S̃α = P Sα P


(2) Compute the eigenvectors of S̃α associated with the p − 1 smallest eigenvalues.
(j) (k)
(3) Denoting these eigenvectors by e1 , . . . , ep−1 , define y1 , . . . , yN ∈ Rp−1 by yk = ej .
(4) Run K-means on (y1 , . . . , yN ) to determine a partition.

20.5 Graph partitioning

Similarity measures are often associated with graph structures, with a goal of finding
a partition of their set of vertices. So, let T denote the set of these vertices and
assume that to all pairs x, y ∈ T , one attribute a weight given by β(x, y), where β is
assumed to be non-negative. We define β for all x, y ∈ T , but we interpret β(x, y) = 0
as marking the absence of an edge between x and y.

Let V denote the vector space of all functions f : T → R (we have dim(V ) = |T |).
This space can be equipped with the standard Euclidean norm, that we will call
in this section the L2 norm (by analogy with general spaces of square integrable
functions), letting, X
|f |22 = f (x)2 .
x∈T
One can also associate a measure of smoothness for a function f ∈ V by computing
the discrete “H 1 ” semi-norm,
X
|f |2H 1 = β(x, y)(f (x) − f (y))2 .
x,y∈T
20.5. GRAPH PARTITIONING 483

With this definition, “smooth functions” tend to have similar values at points x, y
in T such that β(x, y) is large while there is less constraint when β(x, y) is small. In
particular, |f |H 1 = 0 if and only if f is constant on connected components of the
graph.3

The notion of connected components, combined with thresholding, can be used


to build a hierarchical family of partitions of the graph. Define, for all t > 0, the
thresholded weights β (t) (x, y) = max(β(x, y) − t, 0). The set of connected components
associated with the pair (V , β (t) ) forms a partition, say, A(t) , of T . The resulting set
of partitions is nested in the sense that, if s < t, the sets forming the partition A(s) are
unions of sets forming A(t) . This thresholding procedure is not always satisfactory,
however, because there does not always exist a fixed value of t that produces a good
quality cluster decomposition.

If there exists p connected components, then the subspace of all functions f ∈ V


such that |f |H1 = 0 has dimension p. If C1 , . . . , Cp are the connected components,
this space is generated by the functions δCk , k = 1, . . . , p, with δCk (x) = 1 if x ∈ Ck
and 0 otherwise. These functions form, in addition, an orthogonal system for the
Euclidean inner product: hδCk , δCl i2 = 0 if k , l.

One can write 21 |f |2H 1 = f T Lf where L, called the Laplacian operator associated to
the considered graph, is defined by
X
Lf (x) = L(x, y)f (y)
y∈T

and  
X 
L(x, y) =  β(x, z) 1x=y − β(x, y). (20.10)
z∈T
The vectors δCk , k = 1, . . . , p are then an orthogonal basis of the null space of L. Con-
versely, let (e1 , . . . , ep ) be any basis of this null space. Then, there exists an invertible
matrix A = (aij , i, j = 1, . . . , p) such that
p
X
ei (x) = aij δCj (x).
j=1
 
e1 (x)
Associate to each x ∈ T the vector e(x) =  ...  ∈ Rp . Then, for any x, y ∈ T , we have
 
 
ep (x)
e(x) = e(y) if and only if δCj (x) = δCj (y) for all j = 1, . . . , p (because A is invertible),
3 Two nodes x and y are connected in the graph if there is a sequence z0 , . . . , zn in T such that z0 = x,
zn = y and β(zi , zi−1 ) > 0 for i = 1, . . . , n. This provides an equivalence relation and equivalent classes
are called connected components.
484 CHAPTER 20. CLUSTERING

that it, if and only if x and y belong to the same connected component. So, given
any basis of the null space of L, the function x 7→ e(x) determines these connected
components. So, a—not very efficient—way of determining the connected compo-
nents of the graph can be to diagonalize the operator L (written as an N by N matrix,
where N = |T |), extract the p eigenvectors e1 , . . . , ep associated with eigenvalue zero
and deduce from the function e(x) above the set of connected components.

Now, in practice, the graph associated to T and β will not separate nicely into
connected components in order to cluster the training set. Most of the time, because
of noise or some weak connections, there will be only one such component, or in any
case much less than what one would expect when clustering the data. The previous
discussion suggests, however, that in the presence of moderate noise in the con-
nection weights, one may expect that the eigenvectors associated to the p smallest
eigenvalues of L provide vectors e(x), x ∈ T such that e(x) and e(y) have similar values
if x and y belong to the same cluster (see 20.2). In such cases, these clusters should
be easy to determine using, say, K-means on the transformed dataset T̃ = (e(x), x ∈ T ).
This is summarized in the following algorithm.

Algorithm 20.9 (Spectral Graph Partitioning)


Let T ⊂ R be the training set and (x, y) 7→ β(x, y) a similarity measure defined on
T × T . Let p be the desired number of clusters.

(1) Form the Laplacian operator described in (20.10) and let e1 , . . . , ep be its eigen-
vectors associated to the p lowest eigenvalues. For x ∈ T , let e(x) ∈ Rp be given
by
e(x) = (e1 (x), . . . , ep (x))T ∈ Rp .

(2) Apply the K-means algorithm (or one of its variants) with p clusters to T̃ =
(e(x), x ∈ T ).

20.6 Deciding the number of clusters

20.6.1 Detecting elbows

The number, p, of subsets with respect to which the population should be parti-
tioned is rarely known a priori, and several methods have been introduced in the
literature in order to assess the ideal number of clusters. We now review some of
these methods, and denote, for this purpose, by L∗ (p) the minimized cost function
obtained with p clusters, e.g., using (20.3),

L∗ (p) = min{Wα (A1 , . . . , Ap , c1 , . . . , cp ) : A1 , . . . , Ap partition of T , c1 , . . . , cp ∈ R},


20.6. DECIDING THE NUMBER OF CLUSTERS 485

Figure 20.2: Example of data transformed using the eigenvectors of the graph Laplacian.
Left: Original data. Center: Result of a Kmeans algorithm with three clusters applied to the
transformed data (2D projection). Right: Visualization of the cluster labels on the original
data.

in the case of K-medoids (this definition is algorithm dependent). It is clear that L∗


is a decreasing function of p. It is also natural to expect that L∗ should decrease
significantly when p is smaller than the correct number of clusters, while the varia-
tion should be more marginal when p is overestimated, because the cost in putting
together two sets of points that are far apart (which happens when p is too small) is
typically larger than the gain in splitting a homogeneous region in two.

The simplest approach in this context is to visualize L∗ (p) as a function of p and


try to locate at which value the resulting curve makes an “elbow,” i.e., switches from
a sharply decreasing slope to a milder one. Figure 20.3 provides an illustration of
this visualization when the true number of clusters is three (the data in each cluster
following a normal distribution). When the clusters are well separated, an elbow
clearly appears on the graph of Γα∗ , but this situation is harder to observe when clus-
ters overlap with each other.

One can measure the “curvature” at the elbow using the distance between each
point in the graph of (p, Wα∗ (p)) and the line between its predecessor and successor.
The result gives the criterion
L∗ (p + 1) + L∗ (p − 1) − 2L∗ (p)
C(p) = p ,
(L∗ (p + 1) − L∗ (p − 1))2 + 4
specifying the elbow point as the value of p at which C attains its maximum. For
both examples in fig. 20.3, this method returns the correct number of clusters (3).

20.6.2 The Caliński and Harabasz index

Several other criteria have been introduced in the literature. Caliński and Harabasz
[45] propose to minimize the ratio of normalized between-group and within-groups
sums of squares associated with K-means. For a given p, let c1 , . . . , cp denote the
optimal centers, and A1 , . . . , Ap the optimal partition, with Nk = |Ak |. The normalized
486 CHAPTER 20. CLUSTERING

Figure 20.3: Elbow graphs for K-means clustering for two populations generated as mixtures
of Gaussian.

between-group sum of squares is


p
1 X
hα (p) = Nk |ck − x|2
p−1
k=1

and the normalized within-group sum of squares is


p
1 XX
wα (p) = |x − ck |2
N −p
k=1 x∈Ak

Caliński and Harabasz [45] suggest to maximize γCH (p) = hα (p)/wα (p).

This criterion can be extended to other types of cluster analysis. We have seen in
section 20.4 that, when α(x, y) = |x − y|2 ,
p p X
1X X X
α(x, y)/Nk = |x − ck |2 .
2
k=1 x,y∈Ak k=1 x∈Ak

We also have
X p X
X p
X
2 2
|x − x| = |x − ck | + Nk |ck − x|2
x∈T k=1 x∈Ak k=1
20.6. DECIDING THE NUMBER OF CLUSTERS 487

and the left-hand side is also equal to


1 X
α(x, y).
2N
x,y∈T

It follows that, when α(x, y) = |x − y|2 ,


 
p X
1  1

 X X 
hα (p) = α(x, y) − α(x, y)/Nk 

2(p − 1) N

 
x,y∈T k=1 x,y∈Ak

and
p X
1 X
wα (p) = α(x, y)/Nk .
2(N − p)
k=1 x,y∈Ak
These expressions can obviously be applied to any dissimilarity measure, extending
γCH to general clustering problems.

20.6.3 The “silhouette” index

For x ∈ T , let
1 X
dα (x, Ak ) = α(x, y).
Nk
y∈Ak
Let aα (x, p) = dα (x, A(x)) and b(x, p) = min{dα (x, Ak ) : Ak , A(x)}. Define the silhouette
index of x in the segmentation [171]by
bα (x, p) − aα (x, p)
sα (x, p) = ∈ [−1, 1].
max(bα (x, p), aα (x, p))
This index measures how well x is classified in the partitioning. It is large when
the mean distance between x and other objects in its class is small compared to the
minimum mean distance between x and any other class. In order to estimate the best
number of clusters with this criterion, one then can maximize the average index:
1X
γR (p) = sα (x, p).
N
x∈T
Remark 20.9 One can rewrite the Caliński and Harabasz index using the notation
introduced for the silhouette index. Indeed, let A(x) be the cluster Ak to which x
belongs. Then
p
1 XX N
k
hα (p) = (d (x, Ak ) − dα (x, A(x)))
2(p − 1) N α
x∈T k=1
and
p
1 XX
wα (p) = dα (x, Ak ).
2(N − p)
k=1 x∈Ak 
488 CHAPTER 20. CLUSTERING

Figure 20.4: Division of the unit square into clusters for uniformly distributed data.

20.6.4 Comparing to homogeneous data

Several selection methods choose p based on the comparison of the data to a “null
hypothesis” of no cluster. For example, assume that K-means is applied to a training
set T where samples are drawn uniformly according to the uniform distribution on
[0, 1]d . Given centers, c1 , . . . , cp , let Āk be the set of points in [0, 1]d that are closer
to ck than to any other point. Then the segmentation of T is formed by the sets
Ak = {x ∈ T : x ∈ Āk } and, for large enough N , we can approximate |Ak |/N (by the
Law of Large Numbers) by the volume of the set Āk , that we will denote by vol(Āk ).

Let us assume that c1 , . . . , cp are uniformly spaced, so that the sets Āk have similar
volumes (close to 1/p) and have roughly spherical shapes (see fig. 20.4). This implies
that
rp2 d
Z
2
|x − ck | dx ' vol(Ak )
Āk d +2
where rp is the radius of a sphere of volume 1/p, i.e., prpd ' d/Γd−1 where Γd−1 is the
surface area of the unit sphere in Rd . So, we should have, for some constant C that
only depends on d,
X Z
2
|x − ck | ' Nk |x − ck |2 dx ' C(d)(pN )p−2/d−1 = C(d)N p−2/d .
x∈Ak Āk

This suggests that, for fixed N and d, p2/d L∗ (p) should vary slowly when p overesti-
20.6. DECIDING THE NUMBER OF CLUSTERS 489

mate the number of clusters (assuming that this operation divides an homogeneous
cluster). Based on this analysis, Krzanowski and Lai [112] introduced the difference-
ratio criterion, namely,
2 2
(p − 1) d L∗ (p − 1) − p d L∗ (p)
γKL (p) = 2 2
,
p d L∗ (p) − (p + 1) d L∗ (p + 1)
and estimate the number of clusters by taking p maximizing γKL .

Another similar approach, introduced by Sugar and James [186], is based on an


analysis of mixtures of Gaussian, namely assuming an underlying model with p0
groups, where data in group k follow a Gaussian distribution N (µk , Id) (possibly
after standardizing the covariance matrix). In that work, the authors show that,
if d (the dimension) tends√to infinity, with the minimal distance between centers
growing proportionally to d, then L∗ (p)/d tends to infinity when p < p0 . They also
show that, with similar assumptions, L∗ (p)/d behaves like p−2/d for p ≥ p0 , still for
large dimensions. Based on this, they suggest using the criterion
!−ν !−ν
L∗ (p) L∗ (p − 1)
γSJ (p) = −
d d

(with the convention that L∗ (0) = 0) for some positive number ν and select the value
of p that maximizes γSJ . Indeed, in the case of Gaussian mixtures, the choice ν = d/2
ensures that, in large dimensions, γSJ (p) is small for p < p0 , that it is close to 1 for
p > p0 and close to p0 for p = p0 .

A more computational approach, based on Monte-Carlo simulations has been


introduced in Tibshirani et al. [192], defining the gap index

γT W H (p) = E(L∗ (p, T ] )) − L∗ (p, T )

where the L∗ (p, T ) denotes the optimal value of the optimized cost with p clusters
for a training set T . The notation T ] represent a random training set, with same
size and dimension as T , generated using an unclustered probability distribution
used as a reference. In Tibshirani et al. [192], this distribution is taken as uniform
(over the smallest hypercube containing the observed data), or uniform on the co-
efficients of a principal component decomposition of the data (see chapter 21). The
expectation E(L∗ (p, T ] )) is computed by Monte-Carlo simulation, by sampling many
realizations of the training set T , running the clustering algorithm for each of them
and averaging the optimal costs.

One can expect L∗ (p, T ) (for observed data) to decrease much faster (when adding
a cluster) than its expectation for homogeneous data when p < p0 , and the decrease of
490 CHAPTER 20. CLUSTERING

both terms to be comparable when p ≥ p0 . So the number of clusters can in principle


be estimated by detecting an elbow in the graph of γT W H (p) as a function of p. The
procedure suggested in Tibshirani et al. [192] in order to detect this elbow if to look
for the first index p such that

γT W H (p + 1) ≤ γT W H (p) + σ (p + 1)

where σ (p + 1) is the standard deviation of L∗ (p + 1, T ] ) for homogeneous data, also


estimated via Monte-Carlo simulation.

Figures figs. 20.5 to 20.7 provide a comparative illustration of some of these in-
dexes.

20.7 Bayesian Clustering

20.7.1 Introduction

We have seen an example of model-based clustering with mixtures of Gaussian dis-


tributions. The main parameters in this model were the number of classes, p, and
the probabilities αj associated to each cluster, and the parameter of the conditional
distribution (e.g., N (cj , σ 2 IdRd )) of X conditionally to being in the jth cluster. In the
approach we described, these parameters were estimated from data using maximum
likelihood (through the EM algorithm) and probabilities fZ (j|x) were then estimated
in order to compute the most likely clustering.We interpreted fZ (j|x) as the condi-
tional probability P (Z = z|X = x), where Z ∈ {1, . . . , p} represents the group variable.
The natural generative order is Z → X: first decide to which group the observation
belongs to, then sample the value of X conditional to this group. Clustering is in this
case reversing the order, i.e., computing the posterior distribution of Z given X.

In a Bayesian approach, the parameters p, α, c and σ 2 are also considered as ran-


dom variables, so that (letting θ denote the vector formed by these parameters), the
generative random sequence becomes θ → Z → X. Importantly, θ is assumed to
be generated once for all, even if several samples of X are observed, yielding the
generative sequence for an N -sample,

θ → (Z1 , . . . , ZN ) → (X1 , . . . , XN ).

We use below underlined letters to denote configurations of points, Z = (Z1 , . . . , ZN ),


X = (X1 , . . . , XN ), etc. We also use capital letters or boldface letters (for Greek sym-
bols) to differentiate random variable from realizations.

Clusters are still evaluated based on the conditional distribution of Z given X,


but this distribution must be evaluated by averaging the conditional distribution of
20.7. BAYESIAN CLUSTERING 491

Figure 20.5: Comparison of cluster indices for Gaussian clusters. First row: original data
and ground truth. Second panel: plots of four indices as functions of p (Elbow; Caliński and
Harabasz; silhouette; Sugar and James)
492 CHAPTER 20. CLUSTERING

Figure 20.6: Comparison of cluster indices for Gaussian clusters. First row: original data
and ground truth. Second panel: plots of four indices as functions of p (Elbow; Caliński and
Harabasz; silhouette; Sugar and James).
20.7. BAYESIAN CLUSTERING 493

Figure 20.7: Comparison of cluster indices for Gaussian clusters. First row: original data
and ground truth. Second panel: plots of four indices as functions of p (Elbow; Caliński and
Harabasz; silhouette; Sugar and James).
494 CHAPTER 20. CLUSTERING

Z and θ given X with respect to θ, formally4 ,


Z
P (z|x) = P (z, θ|x)P (θ)dθ
N
Z Y
∝ P (xk |zk , θ)P (zk |θ)P (θ)dθ.
k=1

In this expression, P (θ)dθ implies an integration with respect to the prior distribu-
tion of the parameters. This distribution is part of the design of the method, but one
usually chooses it so that it leads to simple computations, using so-called conjugate
priors, which are such that posterior distributions belong to the same parametric
family as the prior. For example, the conjugate prior for the mean of a Gaussian
distribution (such as ci in our model) is also a Gaussian distribution. The conjugate
prior for a scalar variance is the inverse gamma distribution, with density

v u −u−1
s exp(−v/s)
Γ (u)

for some parameters u, v. A conjugate prior for the class probabilities α = (α1 , . . . , αp )
is the Dirichlet distribution, with density

p
Γ (a1 + · · · + ap ) Y aj −1
D(α1 , . . . , αp ) = αj
Γ (a1 ) · · · Γ (ap )
j=1

on the simplex

Sp = {(α1 , . . . , αp ) ∈ Rp : αi ≥ 0, α1 + · · · + αp = 1}.

Note that these conjugate priors have the same form (up to normalization) as the
parametric model densities when considered as functions of the parameters.

20.7.2 Model with a bounded number of clusters

We first discuss the Bayesian approach assuming that the number of clusters is
smaller than a fixed number, p. In this example, we assume that c1 , . . . , cp are mod-
eled as independent Gaussian variables N (0, τ 2 IdRd ), σ 2 with an inverse gamma
distribution with parameters u and v and (α1 , . . . , αp ) using a Dirichlet distribution
with parameters (a, . . . , a).
4 The symbol ∝ means “equal up to a multiplicative constant”.
20.7. BAYESIAN CLUSTERING 495

Analytical example. The joint probability density of (X, Z) and θ is proportional


to
p N 2 2 N
2 −u−1 −v/σ 2 −
Pp 2
j=1 |cj | /2τ
2 Y Y e−|xk −czk | /2σ Y
(σ ) e e αja−1 αzk
j=1
(σ 2 )d/2
k=1 k=1
 N
 p
 1 X  Y a+Nj −1
= (σ 2 )−u−dN /2−1 exp −(v + |xk − czk |2 )/σ 2  αj .
2
k=1 j=1

One can explicitly integrate this last expression with respect to σ 2 and α, using
the expressions of the normalizing constants in the inverse gamma and Dirichlet
distributions, yielding (after integration and ignoring constant terms)
 
p
Γ (a + N1 ) · · · Γ (a + Np )  X
2

2
exp − |c | /2τ

j

(v + 12 N

2 )u+dN /2
P
|x − c |
 
k=1 k zk j=1
 
p
Γ (a + N1 ) · · · Γ (a + Np )  X
2 2

= exp − |cj | /2τ 

Pp
(v + 21 Sw + 21 j=1 Nj |cj − x̄j |2 )u+dN /2 
j=1

where Sw = N 2
P
k=1 |xk − x̄zk | is the within group sum of squares. Note that this sum of
squares depends on x and z, and that (N1 , . . . , Np ), the group sizes, depend on z.

Let us assume a “non-informative prior” on the centers, which corresponds to


letting τ tend to infinity and neglecting the last exponential. The remaining expres-
sion can now be integrated with respect to c1 , . . . , cp by making a change of variables
q
µj = Nj /(2v + Sk )(cj − xj ) and using the fact that

dc1 . . . dcp
Z
1 1 Pp
=
(v + 2 u+dN /2
(Rd )p 2 Sw + 2 j=1 Nj |cj − x̄j | )
p
dµ1 . . . dµp
Y Z
(2v + Sw ) (p−N )d/2−u)
Nj−d/2 p
( 12 + 21 j=1 |µj |2 )u+dN /2
P
j=1 (Rd )p

and the final integral does not depend on x or z. It follows from this that the condi-
tional distribution of Z given x takes the form
Qp
j=1 Γ (a + Nj )
P (z|x) = C(x) Qp
(2v + Sw )(N −p)d/2+u) j=1 Njd/2

where C(x) is a normalization constant ensuring that the right-hand side is a proba-
bility distribution over configurations z = (z1 , . . . , zN ) ∈ {1, . . . , p}N . In order to obtain
496 CHAPTER 20. CLUSTERING

the most likely configuration for this posterior distribution, one should therefore
minimize in z the function
p p
d dX X
((N − p) + u) log(2v + Sw ) + log Nj − log Γ (a + Nj ).
2 2
j=1 j=1

This final optimization problem cannot be solved in closed form, but this can be
performed numerically. One can simplify it a little by only keeping the main order
terms in the last two sums (using Stirling formula for the Gamma function) and
minimize
p
d X
((N − p) + u) log(2v + Sw ) − (a + Nj ) log(a + Nj ).
2
j=1

This expression has a nice interpretation, since the first term minimizes the within-
group sum of squares, the same objective function as in K-means, and the second
one is an entropy term that favors clusters with similar sizes.

Monte-Carlo simulation. An alternative to this analytical approach is to use Monte-


Carlo simulations to estimate some properties of the posterior distribution numeri-
cally. While they are often computationally demanding, Monte-Carlo methods are
more flexible and can be used in situations when analytic computations are intrac-
table. In order to sample from the distribution of Z given x, it is actually easier to
sample from the joint distribution of (Z, θ) given x, because this distribution has a
simpler form. Of course, if the pair (Z, θ) is sampled from the conditional distri-
bution given x, the first component, Z will follow the posterior distribution we are
interested in.

In the context of the discussed example, this reduces to sampling from a distri-
bution proportional to
p N 2 2 N
2 −u−1 −v/σ 2 −
Pp 2
j=1 |cj | /2τ
2 Y Y e−|xk −czk | /2σ Y
(σ ) e e αja−1 αzk . (20.11)
j=1
(σ 2 )d/2
k=1 k=1

Sampling from all these variables at once is not tractable, but it is easy to sample
from them in sub-groups, conditionally to the rest of the variables. We can, for
example, deduce from the expression above the following conditional distributions.

(i) Given (α,Pc, z), σ 2 follows an inverse gamma distribution with parameters u +
dN /2 and v + 21 N 2
k=1 |xk − czk | .

(ii) Given (z, z, σ 2 ), α follows a Dirichlet distribution with parameters a+N1 , . . . , a+


Np .
20.7. BAYESIAN CLUSTERING 497

(iii) Given (z, σ 2 , α), c1 , . . . , cp are independent and follow a Gaussian distribution,
respectively with mean (1 + σ 2 /(Nj τ 2 ))−1 x̄j and variance (Nj /σ 2 + 1/τ 2 )−1 .

(iv) Given (σ 2 , α, c), z1 , . . . , zN are independent and

2 /2σ 2
P (zk = j|σ 2 , α, c, x) ∝ αj e−|xk −cj | .

Algorithm 20.10 (Gibbs sampling for mixture of Gaussian (Bayesian case))


(1) Initialize with variables α, c, σ and z, for example generated according to the
prior distribution.
(2) Loop a large number of times over the following steps.

(i) Simulate a new value of σ 2 according to an inverse gamma distribution with


parameters u + dN /2 and v + 12 N 2
P
k=1 |xk − czk | .
(ii) Simulate new values for α1 , . . . , αp according to a Dirichlet distribution with
parameters a + N1 , . . . , a + Np .
(iii) Simulate new values for c1 , . . . , cp independently, sampling ci according to
a Gaussian distribution with mean (1 + σ 2 /(Nj τ 2 ))−1 x̄j and variance (Nj /σ 2 + 1/τ 2 )−1 .
(iv) Simulate new values of z1 , . . . , zN independently such that

2 /2σ 2
P (zk = j|σ 2 , α, c, x) ∝ αj e−|xk −cj | .

Note that this algorithm is only asymptotically providing a sample of the poste-
rior distribution (it has to be stopped at some point, of course). Note also that, at
each step, the labels z1 , . . . , zN provide a random partition of the set {1, . . . , N }, and
this partition changes at every step.

To estimate one single partition out of this simulation, several strategies are pos-
sible. Using the simulation, one can estimate the probability wkl that xk and xl be-
long to the same cluster. This can be dome by averaging the number of times that
zk = zl was observed along the Gibbs sampling iterations (from which one usually
excludes a few early “burn-in” iterations). These weights, wkl can then be used as
similarity measures in a clustering algorithm.

Alternatively, one can average for each k, the values of the class center czk associ-
ated to k, still along the Gibbs sampling iterations. These average values can then be
used as input of, say, a K-means algorithm to estimate final clusters.
498 CHAPTER 20. CLUSTERING

Mean-field approximation. We conclude this section with a variational Bayes ap-


proximation of the posterior distribution. We will make a mean-field approxima-
tion, in which all parameters and latent variables are independent, therefore ap-
proximating the distribution in (20.11) by a product distribution taking the form
p
Y N
(c)
Y (z)
2 (σ 2 ) 2 (α)
g(σ , α, c, z) = g (σ )g (α) gj (cj ) gk (zk ).
j=1 k=1

Here c = (c1 , . . . , cp ), z = (z1 , . . . , zN ) and α = (α1 , . . . , αp ). We have σ 2 ∈ (0, +∞), c ∈


(Rd )p , α ∈ S, the set of all non-negative α1 , . . . , αp that sum to one, and z ∈ {1, . . . , p}N
(x)
(so that gk is a p.m.f. on {1, . . . , p}. We will use the discussion in section 17.3.3 and
lemma 17.1, and use the notation introduced in that section to denote as ϕ the
expectation a variable ϕ of the variables above for the p.d.f. g.

The log-likelihood for a mixture of Gaussian takes the form (ignoring contant
terms)
p p
2 2 −2 1 X 2 X
`(σ , α, c, z) = − (u + 1) log σ − vσ − 2 |cj | + (a − 1) log αj

k=1 j=1
N N
Nd 1 X X
− log σ 2 − σ −2 |xk − czk |2 + log αzk
2 2
k=1 k=1
p p
2 −2 1 X 2 X
= − (u + 1) log σ − vσ − 2 |cj | + (a − 1) log αj

k=1 k=1
p
N X p
N X
Nd 1 X X
− log σ 2 − σ −2 2
|xk − cj | 1zk =j + log αj 1zj =k
2 2
k=1 j=1 k=1 j=1

and can therefore be decomposed as a sum of products of functions of single vari-


ables, as assumed in section 17.3.3. Using lemma 17.1, we can identify each of the
distributions composing g, namely:
2
• g (σ ) is the p.d.f. of an inverse gamma with parameters ũ = u + N d/2 and
N p
1 XXD E
ṽ = ν + |xk − Cj |2 Zk = j .
2
k=1 j=1

(c)
• gj is the p.d.f. of a Gaussian, with parameters N (m̃j , σ̃j2 IdRd ), with, letting
N
X N
X (z)
ζ̃(j) = Zk = j = gk (j),
k=1 k=1
20.7. BAYESIAN CLUSTERING 499
 D E −1 D E P
σ̃j2 = 1
τ2
+ σ −2 ζ̃(j) and m̃i = σ −2 σ̃j2 N
k=1 Zk = j xk .

• g (α) of a Dirichlet distribution, with parameters ã1 , . . . , ãk , with ãi = a + ζ̃(j).
(z)
• Finally gk is a p.m.f. on {1, . . . , p} with
1
 D E
(z)
ED E D
gk (j) ∝ exp − σ −2 |xk − Cj |2 + log αj .
2

To complete the consistency equations, it now suffices to evaluate the expecta-


tions in the formula above as functions of the other parameters. We leave to the
reader the verification of the following statements.
D E
• If σ 2 follows an inverse gamma distribution with parameters ũ and ṽ, then σ −2 =
ũ/ ṽ.
D E
• If Cj ∼ N (m̃j , σ̃j2 IdRd ), then |xk − Cj |2 = |xk − m̃j |2 + d σ̃j2 .
D E
• If α follows a Dirichlet distribution with parameters ã1 , . . . , ãp , then log αj =
ψ(ãj ) − ψ(ã1 + · · · + ãp ) where ψ is the digamma function (derivative of the logarithm
of the gamma function).

Combining these facts with the expression of the mean-field parameters, we can
now formulate a mean-field estimation algorithm for mixtures of Gaussian that iter-
atively applies the consistency equations.

Algorithm 20.11 (Mean-field algorithm for mixtures of Gaussian)


(1) : Input: training set (x1 , . . . , xN ), number of clusters p, prior parameters u, v, τ 2
and a .
(2) Initialize variables σ̃12 , . . . , σ̃p2 , m̃1 , . . . , m̃p , ã1 , . . . , ãp , g̃k (j), k = 1, . . . , N , j = 1, . . . , p.
(3) Let ζ̃(j) = N
P
k=1 g̃k (j), j = 1, . . . , p.
(4) Let
 
p
N X p
1  1 X d X 
ρ̃2 = g̃k (j)|xk − m̃j |2 + σ̃j2 ζ̃(j) .

v +
 
u + N d/2  2 2 
k=1 j=1 j=1

 −1 σ̃j2
1 ζ̃(j) PN
(5) For j = 1, . . . , p, let σ̃i2 = τ2
+ ρ̃2
and m̃i = ρ̃2 k=1 g̃k (j)xk .

(6) Let ãi = a + ζ̃(j), j = 1, . . . , p.


(7) For k = 1, . . . , N , j = 1, . . . , p, let
!
1  2 2

g̃k (j) ∝ exp − 2 |xk − m̃j | + d σ̃j + ψ(ãj ) .
2ρ̃
500 CHAPTER 20. CLUSTERING

(8) Compare the updated variables with their previous values and stop if the dif-
ference is below a tolerance level. Otherwise, return to (3).

(z)
After convergence gk provides the mean-field approximation of the posterior prob-
ability of classes for observation k and can be used to determine clusters.

20.7.3 Non-parametric priors

The Polya urn In the previous model with p clusters or less, the joint distribution
of Z1 , . . . , ZN is given by
p p
Γ (pa) Y Γ (a + Nj )
Z
Γ (pa) Y a+Nj −1
π(z1 , . . . , zN ) = αj dα = .
Γ (a)p Sp j=1 Γ (pa + N ) Γ (a)
j=1

Conditional to z1 , . . . , zN , the data model was completed by sampling p sets of pa-


rameters, say, θ1 , . . . , θp , each belonging to a parameter space Θ and following a prior
probability distribution with density, say, ψ and variables X1 , . . . , XN , where Xk ∈ R
was drawn according to a law dependent on its cluster, that we will denote ϕ( · | θzk ).
The complete likelihood of the data is now
p p N
Γ (pa) Y Γ (a + Nj ) Y Y
L(z, θ, x) = ψ(θj ) ϕ(xk |θzk ).
Γ (pa + N ) Γ (a)
j=1 j=1 k=1

Note that the right-hand side does not change if one relabels the values of z1 , . . . , zN ,
i.e., if one replaces each zk by s(zk ) where s is a permutation of {1, . . . , p}, creating a
new configuration denoted s · z. Let [z] denote the equivalence class of z, containing
all z0 = s · z, s ∈ SN : all the labelings in [z] provide the same partition of {1, . . . , N }
and can therefore be identified. One defines a probability distribution π̄ over these
equivalence classes by letting
p
Γ (pa) Y Γ (a + Nj )
π̄([z]) = |[z]| .
Γ (pa + N ) Γ (a)
j=1

The first term on the right-hand side is the number of elements in the equivalence
class of [z]. To compute it, let p0 = p0 (z) denote the number of different values
taken by z1 , . . . , zN , i.e., the “true” number of clusters (ignoring the empty ones),
which now is a function of z. Let A1 , . . . , Ap0 denote the partition associated with z.
New labelings equivalent to z can be obtained by assigning any index i1 ∈ {1, . . . , p}
20.7. BAYESIAN CLUSTERING 501

to elements of A1 , then any index i2 , i1 to elements of A2 , etc., so that there are


|[z]| = p!/(p − p0 )! choices. We therefore find:
p
p! Γ (pa) Y Γ (a + Nj )
π̄([z]) = .
(p − p0 )! Γ (pa + N ) Γ (a)
j=1

Letting λ = pa and using the formula Γ (x + 1) = xΓ (x), this can be rewritten as


p Nj −1
p(p − 1) · · · (p − p0 + 1) Y Y
π̄([z]) = (λ/p + i).
λ(λ + 1) . . . (λ + N − 1)
j=1 i=0

Now, the class [z] contains exactly one element ẑ with the following properties

• ẑ1 = 1,
• ẑk ≤ max(zj , j < k) + 1 for all k > 1.

This means that the kth label is either one of those already appearing in (ẑ1 , . . . , ẑk−1 )
or the next integer in the enumeration. We will call such a ẑ admissible. If we assume
that z is admissible in the expression of π̄, we can write
Q p0  QNj −1 
j=1 λ(1 − j/p) i=1 (λ/p + i)
π̄([z]) = .
λ(λ + 1) . . . (λ + N − 1)
If one takes the limit p → ∞ in this expression, one still gets a probability distribu-
tion on admissible labelings, namely
Q p0
λp0 j=1 (Nj − 1)!
π̄([z]) = . (20.12)
λ(λ + 1) . . . (λ + N − 1)
Recall that, in this equation, p0 is a function of z, equal, for admissible labelings, to
the largest j such that Nj > 0.

The probability π̄ is generated by the following sampling scheme, called the


Polya urn process simulating admissible labelings.

Algorithm 20.12 (Polya Urn)


1 Initialize k = 1, z1 = 1, j = 1. Let N1 = 1
2 At step k, assume that z1 , . . . , zk have been generated, with associated number of
clusters equal to j and N1 , . . . , Nj elements per cluster. Generate zk+1 such that
 Ni
i with probability , for i = 1, . . . , j


+


zk+1 = 
 λ k (20.13)
 λ
j + 1 with probability


λ+k
502 CHAPTER 20. CLUSTERING

3 If zk+1 = i ≤ j, then replace Ni by Ni + 1, k by k + 1.


4 If zk+1 = j + 1, let Nj+1 = 1, replace j by j + 1 and k by k + 1.
5 If k < N , return to step 2, otherwise, stop.

Using this prior, the complete model for the distribution of the observed data is
Q p0 p0
λp0 j=1 (Nj − 1)! Y N
Y
L(z, θ, x) = ψ(θj ) ϕ(xk |θzk )
λ(λ + 1) . . . (λ + N − 1)
j=1 k=1

Recall that, in this expression, z is restricted to the set of admissible labelings. We


also note that admissible labelings are in one-to-one correspondence with the par-
titions of {1, . . . , N }, so that the latent variable z in this expression can also be inter-
preted as representing a random partition of this set.

Dirichlet processes. As we will see later, the expression of the global likelihood
and the Polya urn model will suffice for us to develop non-parametric clustering
methods for a set of observations x1 , . . . , xN . However, this model is also associated
to an important class of random probability distributions (i.e., random variables
taking values in some set of probability distributions) called Dirichlet processes for
which we provide a brief description.

The distribution in (20.12) was obtained by passing to the limit from a model
that first generates p numbers α1 , . . . , αp , then generates the labels z1 , . . . , zN ∈ {1, . . . , p}
identified modulo relabeling. This distribution can also be defined P∞ directly, by first
defining an infinity of positive numbers (αj , j ≥ 1) such that i=1 αi = 1, followed by
the generation of random labels Z1 , . . . , ZN such that P (Zk = j) = αj , followed once
again with an identification up to relabeling.

The distribution of α that leads to the Polya urn is called the stick breaking process.
This process is such that
j−1
Y
αj = Uj (1 − Ui )
i=1
where U1 , U2 , . . . is a sequence of i.i.d. variables following a Beta(1, λ) distribution,
i.e., with p.d.f. λ(1 − u)λ−1 for u ∈ [0, 1]. The stick breaking interpretation comes
from the way α1 , α2 , . . . can be simulated: let α1 ∼ Beta(1, λ); given α1 , . . . , αj−1 , let
αj = (1 − α1 − · · · − αj−1 )Uj where Uj ∼ Beta(1, λ) and is independent from the past.
Each step can be thought of as breaking the remaining length, (1 − α1 − · · · − αj−1 ),
of an original stick of length 1 using a beta-distributed variable, Uj . This process
leads to the distribution (20.12) over admissible distributions, i.e., if α is generated
according to the stick breaking process, and Z1 , . . . , ZN are independent, each such
20.7. BAYESIAN CLUSTERING 503

that P (Zk = j) = αj , then the probability that (Z1 , . . . , ZN ) is identical, after relabeling,
to the admissible configuration z is given by (20.12). (We skip the proof of this result,
which is not straightforward.)

Now, take a realization α = (α1 , α2 , . . .) of the stick-breaking process, and inde-


pendent realizations η = (η1 , η1 , . . .) drawn according to the p.d.f. ψ. Define

X
ρ= αj δηj . (20.14)
j=1

For any realization of α and of η, ρ is a probability distribution on the parameter


space Θ (in which one chooses ηi with probability αi ). Since α and η are both random
variables, this defines a random variable ρ with values in the space of probability
measures on Θ.

This process has the following characteristic property. For any family V1 , . . . , Vk ⊂
Θ forming a partition of that set, the random variable (ρ(U1 ), . . . , ρ(Uk )) follows a
Dirichlet distribution with parameters
Z Z !
λ ψ dη, . . . , λ ψ dη .
U1 U1
This is the definition of a Dirichlet process with parameters (λ, ψ), or, simply, with
parameter λψ. Conversely, one can also show that any Dirichlet process can be de-
composed as in (20.14) where α is a stick-breaking process and η independent real-
izations of ψ.

Monte-Carlo simulation. The joint distribution of labels, parameters and observed


variables can also be deduced from (20.12), with a joint p.d.f. given by
Q p0 p0
λp0 −1 j=1 (Nj − 1)! Y N
Y
ψ(ηj ) ϕ(xk |ηzk ). (20.15)
(λ + 1) · · · (λ + N − 1)
j=1 k=1

The forward simulation of this distribution is a straightforward extension of Algo-


rithm 20.12, namely:

Algorithm 20.13
1 Initialize k = 1, z1 = 1, j = 1. Let N1 = 1.
2 Sample η1 ∼ ψ and x1 ∼ ϕ(·|η1 ).
3 At step k, assume that z1 , . . . , zk has been generated, with associated number of
clusters equal to j and N1 , . . . , Nj elements per cluster. Generate zk+1 such that
 Ni
i with probability , for i = 1, . . . , j


+


zk+1 = 
 λ k
 λ
j + 1 with probability


λ+k
504 CHAPTER 20. CLUSTERING

4 If zk+1 = i ≤ j, sample xk+1 ∼ ϕ( · |ηi ). Replace Ni by Ni + 1, k by k + 1.


5 If zk+1 = j + 1, let Nj+1 = 1, sample ηj+1 ∼ ψ and xk+1 ∼ ϕ( · |ηj+1 ). Replace j by
j + 1 and k by k + 1.
6 If k < N , return to step 2, otherwise, stop.

This algorithm cannot be used, of course, to sample from the conditional distri-
bution of Z and η given X = x, and Markov-chain Monte-Carlo must be used for this
purpose. In order to describe how Gibbs sampling may be applied to this problem,
we use the fact that, as previously remarked, using admissible labelings z is equiv-
alent to using partitions A = (A1 , . . . , Ap0 ) of {1, . . . , N }, and we will use the latter
formalism to describe the algorithm. We will also use the notation ηA to denote the
parameter associated to A ∈ A so our new notation for the variables is (A, η) where
A is a partition of {1, . . . , N } and η is a collection (ηA , A ∈ A) with ηA ∈ Θ. Given this,
we want to sample from a conditional p.d.f.

λ|A|−1 A∈A (|A| − 1)! Y


Q Y
Φ(A, η|x) ∝ ψ(ηA ) ϕ(xk |ηA ). (20.16)
(λ + 1) · · · (λ + N − 1)
A∈A k∈A

As an additional notation, given a partition A and an index k ∈ {1. . . . , N }, we let Ak


denote the set A in A that contains k.

The following points are relevant for the design of the sampling algorithm.

(1) The conditional distribution of η given A and the training data is proportional
to  
Y  Y 
ψ(η ) ϕ(x |η ) 
 A k A  
A∈A k∈A
This shows that the parameters ηA , A ∈ A are independent of each other, with ηA
following a distribution proportional to
Y
η 7→ ψ(η) ϕ(xk |η).
k∈Aj

Sampling from this distribution generally offers no special difficulty, especially if


the prior ψ is conjugate to ϕ. Importantly, one does not need to sample exactly from
ηA , and it is often more convenient to separate ηA into several components (such as
mean and variance for mixtures of Gaussian) and sample from them alternatively,
creating another level of Gibbs sampling.
(2) We now consider the issue of updating A. We will use for this purpose the
formalism of Algorithm 13.2. In particular, for each k ∈ {1, . . . , N }, we associate to
20.7. BAYESIAN CLUSTERING 505

the variable (A, η) the pair (A(k) , η (k) ), where A(k) is the partition of {1, . . . , N } \ {k}
(k)
formed by the sets A(k) = A \ {k} and ηA = ηA , unless A = {k}, in which case the set
and the corresponding ηA are dropped.
We can write Φ(A, η|x) in the form
(k)
λ|A |−1 B∈A(k) (|B| − 1)! Y
Q Y
Φ(A, η|x) ∝ q(Ak , ηAk )ϕ(xk |ηAk ) ψ(ηB ) ϕ(xl |ηB )
(λ + 1) · · · (λ + N − 1) (k) B∈A l∈B
(20.17)
with X
q(A, θ) = |B|1A=B∪{k} + λψ(θ)1A={k}
B∈A(k)

Partitions A0 that are consistent with A(k) allocate k to one of the clusters in A(k) or
create a new cluster with a new parameter ηk0 . If one replaces (A, η) by (A0 , η 0 ), only
the first two terms in (20.17) will be affected, so that the conditional probability of
A0 given A(k) is proportional to q(A0k , ηA0k )ϕ(xk |ηA0k ) and given by

|B|ϕ(xk |ηB )

if A0k = B ∪ {k}, ηB0 = ηB , B ∈ A(k)




 1
 C + λC 2
0 0
λϕ(x |η )ψ(η k)

k k

if A0k = {k},



C1 + λC2

where X Z
C1 = |B|ϕ(xk |ηB ) and C2 = ϕ(xk |θ)ψ(θ)dθ.
Θ
B∈Ak

Concretely, this means that one first decides to allocate k to a set B in A(k) with
probability |B|ϕ(xk |ηB )/(C1 +λC2 ) and to create a new set with probability λC2 /(C1 +
0
λC2 ). If a new set is created, then the associated parameter η{k} is sampled according
to the p.d.f. ϕ(xk |θ)ψ(θ/C2 .
(3) However, sampling using this conditional probability requires the computa-
tion of the integral C2 , which can represent a significant computational burden,
since this has to be done many times in a Gibbs sampling algorithm. A modification
of this algorithm, introduced in Neal [142], avoids this computation by adding new
auxiliary variables at each step of the computation. These variables are m parameters
η1∗ , . . . , ηm
∗ ∈ Θ where m is a fixed integer. To define the joint distribution of A, η, η ∗ ,

one lets the marginal distribution of (A, η) be given by (20.16) and conditionally to
A, η, let η ∗1 , . . . , η ∗m be:
(i) independent with density ψ if |Ak | > 1;
(ii) such that ηj∗ = ηAk and the other m − 1 starred parameters are independent
with distribution ψ, where j is randomly chosen in {1, . . . , m} if Ak = {k}.
506 CHAPTER 20. CLUSTERING

With this definition, the joint conditional distribution of (A, η, η ∗ ) takes the form

Φ(A,
b η, η ∗ |x) ∝ q̂(Ak , ηAk , η ∗ )ϕ(xk |ηAk )
(k)
λ|A |−1 B∈A(k) (|B| − 1)! Y
Q Y
ψ(ηB ) ϕ(xl |ηB ) (20.18)
(λ + 1) · · · (λ + N − 1) (k) B∈A l∈B

with
m m m
X Y λX Y
q̂(A, θ, η1∗ , . . . , ηm

)= |B|1θ=ηB ,A=B∪{k} ψ(ηj∗ ) + 1θ=ηj∗ ,A={k} ψ(θ) ψ(ηi∗ ).
m
B∈A(k) j=1 j=1 i=1,i,j

Note that Φ
b depends on k, so that the definition of the auxiliary variables will change
at each step of Gibbs sampling. The conditional distribution, for Φ, b of A0 , η 0 given
A(k) , η (k) , η ∗ is such that

• A0k = B ∪ {k} and ηA


0
0 = ηB with probability |B|ϕ(xk |ηB )/C, for B ∈ A
(k) .
k
• A0k = {k} and η A0k = ηj∗ with probability (λ/m)ϕ(xk |ηj∗ )/C, j = 1, . . . , m.

The constant C is given by


m
X λX
C= |B|ϕ(xk |ηB ) + ϕ(xk |ηj∗ )
m
B∈A k j=1

and is therefore easy to compute.

We can now summarize this discussion with Neal’s version of the Gibbs sampling
algorithm.

Algorithm 20.14 (Neal)


Initialize the algorithm with some arbitrary partition and parameters (A, η) (for ex-
ample, generated using the Dirichlet prior). Use the same notation to denote these
variables at the end of the previous iteration of the algorithm. The next iteration is
then run as follows.

(1) For k = 1, . . . , N , reallocate k to a cluster as follows.


(i) Form the new family of sets A(k) and labels η (k) by removing k from the parti-
tion A.
(ii) If |Ak | > 1, generate m variables η1∗ , . . . , ηm
∗ according to ψ. If A = {k}, generate
k
only m − 1 such variables and let the last one be equal to ηAk .
20.7. BAYESIAN CLUSTERING 507

(iii) Allocate k to a new cluster A0 with parameter ηA0 0 according to probabilities


proportional to
(k) (k)



 |B|ϕ(xk |ηB ) if A0 = B ∪ {k} and ηA0 0 = ηB

 λ
 ϕ(xk |ηj∗ ) if A = {k} and ηA0 0 = ηj∗ , j = 1, . . . , m


m
(2) For A ∈ A, update ηA , A ∈ A according to the distribution proportional to
Y
ψ(η) ϕ(xk |η)
k∈A

either directly, or via one step of Gibbs sampling visiting each of the variables that
constitute ηA .
(3) Loop a sufficient number of times over the previous two steps.

After running this algorithm, the set of clusters should be finalized by using
statistics computed along the simulation, as discussed after Algorithm 20.10.

Full example: Mixture of Gaussian. To conclude this section, we summarize the


Monte-Carlo sampling algorithm for mixtures of Gaussian using a non-parametric
Bayesian prior. Here, η ∈ Θ is the center c ∈ Rd , with prior distribution ψ = N (0, τ 2 IdRd ).
The previous algorithm must be modified because an additional parameter σ 2 is
shared by all classes, with prior given by an inverse gamma distribution with pa-
rameters u and v. The conditional distribution of the data is ϕ(x|c, σ ) ∼ N (c, σ 2 IdRd ).

Algorithm 20.15 (Gibbs sampling for non-parametric mixture of Gaussian)


(1) Initialize the algorithm with some arbitrary partition and parameters (A, η).
(2) For k = 1, . . . , N , reallocate k to a cluster as follows.
(i) Form the new family of sets A(k) and labels η (k) by removing k from the parti-
tion A.
(ii) If |Ak | > 1, generate m variables ci∗ , i = 1, . . . , m independently with ci∗ ∼ N (0, τ 2 IdRd ).
If Ak = {k}, generate only m − 1 such pairs of variables and let the last one be equal
to cAk .
0
(iii) Allocate k to a new cluster A0 with parameter cA 0 according to probabilities
proportional to

 |xk − c(k) | 

(k)

B
if A0 = B ∪ {k} and cA 0

|B| exp − 2σ 2 0 = cB



 |x − c∗ | 

λ

 exp − k B if A = {k} and cA

 0 ∗
0 = cj , j = 1, . . . , m.

m 2σ 2
508 CHAPTER 20. CLUSTERING

(3) Simulate a new value of σP2 according to an inverse gamma distribution with
1 N
parameters u + dN /2 and v + 2 k=1 |xk − cAk |2 .
(4) Simulate new values for cA , A ∈ A independently, sampling cA according to a
Gaussian distribution with mean (1 + σ 2 /(Nj τ 2 ))−1 x̄A and variance (|A|/σ 2 + 1/τ 2 )−1 ,
where
1 X
x̄A = xk .
|A|
k∈A
Chapter 21

Dimension Reduction and Factor Analysis

21.1 Principal component analysis

21.1.1 General Framework

Factor analysis aims at representing potentially high-dimensional data as functions


of a (generally) small number of “factors,” with a representation taking the general
form
X = Φ(Y , θ) + residual, (21.1)
where X is the observation, Y provide the factors and Φ is a function parametrized
by θ. A factor analysis model must therefore specify Φ (often, a linear function of Y ),
add hypotheses on Y (such as its dimension, or properties of its distribution) and on
the residuals. The transformation Φ is estimated from training data, but, ideally, the
method should also provide an algorithm that infers Y from a new observation of X.
Most of the time, Y is small dimensional so that the model also implies a reduction
of dimension.

We start our discussion with principal component analysis (or PCA). This meth-
ods can be characterized in multiple ways, and we introducing through the angle of
data approximation. In the following, the random variable X takes values in a finite-
or infinite-dimensional inner-product space H. We will denote, as usual, by h. , .iH
the product in this space.

Assume that N independent realization of X, denoted x1 , . . . , xN , are observed,


forming our training set T . Our goal is to obtain a small-dimensional representation
of these data, while loosing a minimal amount of relevant information. PCA, is the
simplest and most commonly used approach developed for this purpose.

If V is a finite-dimensional subspace of H, we denote by PV (y) the orthogonal


projection of y ∈ H on V , i.e., the element ξ ∈ V such that ky − ξk2H is minimal

509
510 CHAPTER 21. DIMENSION REDUCTION AND FACTOR ANALYSIS

(see section 6.4). Recall that this orthogonal projection if characterized by the two
properties: (i) PV (y) ∈ V and (ii) (y − PV (y)) ⊥ V .

Given a target dimension p, PCA determines a p-dimensional subspace of H, say,


V and a point c ∈ H, such that, letting

Rk = xk − c − PV (xk − c)

for k = 1, . . . , N , the residual sum of squares

N
X
S= kRk k2H (21.2)
k=1

is as small as possible.
PN
An optimal choice for c is c = x = k=1 xk /N . Indeed, using the linearity of the
orthogonal projection, we have

N
X
S= kxk − PV (xk ) − (c − PV (c))k2H
k=1
N
X
= kxk − PV (xk ) − (x − PV (x))k2H + N kx − PV (x) − (c − PV (c))k2H .
k=1

Given this, there would be no loss of generality in assuming that all xk ’s have been
replaced by xk − x and taking c = 0. While this is often done in the literature, there
are some advantages (especially when discussing kernel methods) in keeping the
average explicit in the notation, as we will continue to do.

Introducing an orthonormal basis (e1 , . . . , ep ) of V , one has

p
X
PV (xk − x) = ρk (i)ei
i=1

with ρki = hxk − x , ei iH . One can then reformulate the problem in terms of (e1 , . . . , ep ),
which must minimize
N
X p
X
S = kxk − x − hxk − x , ei iei k2H
k=1 i=1
N
X p X
X N
= kxk − xk2H − hxk − x , ei i2H .
k=1 i=1 k=1
21.1. PRINCIPAL COMPONENT ANALYSIS 511

For u, v ∈ H, define
N
1X
hu , viT = hxk − x , uiH hxk − x , viH
N
k=1

and kukT = hu , ui1/2


T (the index T refers to the fact that this norm is associated with
the training set). This provides a new quadratic form on H. The formula above
shows that minimizing S is equivalent to maximizing
p
X
kei k2T
i=1

subject to the constraint that (e1 , . . . , ep ) is orthonormal in H.

Let us consider a slightly more general problem. If H is a separable Hilbert


space1 and µ is a square-integrable probability measure on H, such that
Z
kxk2H dµ(x) < ∞,
H
R R
one can define m = H xdµ(x) and σµ2 = H kx − mk2H dµ. One can then define the
covariance bilinear form
Z
Γµ (u, v) = hu , x − miH hv , x − miH dµ(x),
H

which satisfies Γµ (u, v) ≤ σµ2 kukH kvkH .

With this notation, we have


hu , viT = Γµ̂T (u, v),

where µ̂T = (1/N ) N


P
k=1 δxk is the empirical measure (and in that case m = x̄). We can
therefore generalize the PCA problem by considering the maximization of
p
X
Γµ (ek , ek ) (21.3)
k=1

over all orthonormal families (e1 , . . . , ep ) in H.

When µ is square integrable, the associated operator, Aµ defined by


hu , Aµ viH = Γµ (u, v) (21.4)
1A Hilbert space is an inner-product space which is complete for its norm. A separable Hilbert
space must have a dense countable subset, which, in particular, implies that it has orthonormal bases.
512 CHAPTER 21. DIMENSION REDUCTION AND FACTOR ANALYSIS

for all u, v ∈ H, is a Hilbert-Schmidt operator [208]. Such an operator can, in partic-


ular, be diagonalized in an orthonormal basis of H, i.e., there exists an orthonormal
basis, (f1 , f2 , . . .) of H such that Aµ fi = λ2i fi for a non-increasing sequence of eigenval-
ues (with λ1 ≥ λ2 ≥ · · · ≥ 0) such that


X
σµ2 = λ2i .
k=1

The main statement of the following result is in finite dimensions, a simple ap-
plication of corollary 2.4. We here give a direct proof that also works in infinite
dimensions.

Theorem 21.1 Let (f1 , f2 , . . .) be an orthonormal basis of eigenvectors of Aµ with associ-


ated eigenvalues λ21 ≥ λ22 ≥ · · · ≥ 0. Then an orthonormal family (e1 , . . . , ep ) in H maxi-
mizes (21.3) if and only if,

span(fj : λ2j > λ2p ) ⊂ span(e1 , . . . , ep ) ⊂ span(fj : λ2j ≥ λ2p ). (21.5)

In particular f1 , . . . , fp always provide a solution and span(e1 , . . . , ep ) = span(f1 , . . . , fp ) for


any other solution as soon as λ2p > λ2p+1 .

Definition 21.2 When µ = µ̂T , the vectors (f1 , . . . , fp ) are called (with some abuse when
eigenvalues coincide) the first p principal components of the training set (x1 , . . . , xN ).

Proof If (e1 , . . . , ep ) is an orthonormal family in H, let

p
X
F(e1 , . . . , ep ) = Γµ (ek , ek ) .
k=1

(j) (j)
Note that F(f1 , . . . , fp ) = λ21 + · · · + λ2p . Write ek = ∞
P
j=1 αk fj (so that αk = hfj , ek iH ).
(j) (j)
These coefficients satisfy ∞
P
j=1 αk αl = 1 if k = l and 0 otherwise. Then


(j)
X
Γµ (ek , ek ) = λ2j (αk )2 .
j=1
21.1. PRINCIPAL COMPONENT ANALYSIS 513

We have
p X

(j)
X
F(e1 , . . . , ep ) = λ2j (αk )2
k=1 j=1
p X p p X

(j) (j)
X X
= λ2j (αk )2 + λ2j (αk )2
k=1 j=1 k=1 j=p+1
p X p p X ∞
(j) (j)
X X
≤ λ2j (αk )2 + λ2p+1 (αk )2
k=1 j=1 k=1 j=p+1
Xp p
X (j)
= (λ2j − λ2p+1 ) (αk )2 + pλ2p+1 .
j=1 k=1

Let P denote the orthogonal projection operator from H to span(e1 , . . . , ep ). We have,


for any h ∈ H, kP hk2H ≤ khk2H with equality if and only if h ∈ span(e1 , . . . , ep ). Applying
Pp (j) Pp (j)
this to h = fj , with P (fj ) = k=1 αk ek , we get k=1 (αk )2 ≤ 1 with equality if and only
if fj ∈ span(e1 , . . . , ep ).

As a consequence, the previous upper bound on F(e1 , . . . , ep ) implies


p
X
F(e1 , . . . , ep ) ≤ λ2j .
j=1

This upper bound is attained at (e1 , . . . , ep ) = (f1 , . . . , fp ), which is therefore a maxi-


mizer. Also, inspecting the argument above, we see that F(e1 , . . . , ep ) < λ21 + · · · + λ2p
unless
(j)
(a) for all k ≤ p and j ≥ p + 1: αk = 0 if λ2j > λ2p+1 , and
Pp (j)
(b) for all j ≤ p: k=1 (αk )2 = 1 unless λ2j = λ2p+1 .

Condition (a) implies that span(e1 , . . . , ep ) ⊂ span(fj : λ2j ≤ λ2p+1 ). If λ2p = λ2p+1 , the
inclusion span(e1 , . . . , ep ) ⊂ span(fj : λ2j ≤ λ2p ) therefore holds. If λ2p < λ2p+1 , condition
Pp (j)
(b) requires k=1 (αk )2 = 1 for all j ≤ p, which implies fj ∈ span(e1 , . . . , ep ) for j ≤ p,
so that span(e1 , . . . , ep ) = span(f1 , . . . , fp ) and the inclusion also hold.
Pp (j)
Condition (b) always requires k=1 (αk )2 = 1, hence fj ∈ span(f1 , . . . , fp ), when
λj < λp , showing that span(fj : λ2j < λ2p ) ⊂ span(e1 , . . . , ep ). Equation (21.5) therefore
always holds for (e1 , . . . , ep ) such that F(e1 , . . . , ep ) = λ21 + · · · + λ2p . Furthermore, condi-
tions (a) and (b) always hold for any orthonormal family that satisfy (21.5), showing
that any such solution is optimal. 
514 CHAPTER 21. DIMENSION REDUCTION AND FACTOR ANALYSIS

Notice that the optimal S in (21.2) is such that


X
S =N λ2i .
i>p

Remark 21.3 The interest of discussing PCA associated with a covariance operator
for a square integrable measure (in which case it is often called a Karhunen-Loeve
(KL) expansion) is that this setting is often important when discussing infinite-
dimensional random processes (such as Gaussian random fields). Moreover, these
operators quite naturally provide asymptotic versions of sample-based PCA. In-
teresting issues, that are part of functional data analysis [159], address the design
of proper estimation procedures to obtain converging estimators of KL expansions
based on finite samples for stochastic processes in infinite-dimensional spaces. 

21.1.2 Computation of the principal components

Small dimension. Assume that H has finite dimension, d, i.e., H = Rd , and repre-
sent x1 , . . . , xN ∈ Rd as column vectors. Let the inner product on H be associated to a
positive-definite symmetric matrix Q:

hu , viH = u T Qv.

Introduce the covariance matrix of the data


N
1X
ΣT = (xk − x)(xk − x)T ,
N
k=1

Write AT = Aµ̂T , for short, in (21.4). We have:

N
1X T
hu , AT viH = (u Q(xk − x))(v T Q(xk − x))
N
k=1
N
1 X
= u T Q(xk − x)(xk − x)T Qv
N
k=1
= hu , ΣT QviH ,

so that AT = ΣT Q.

The eigenvectors, f , of AT are such that Q1/2 f are eigenvectors of the symmetric
matrix Q1/2 ΣT Q1/2 , which shows that they form an orthogonal system in H, which
will be orthonormal if the eigenvectors are normalized so that f T Qf = 1. Equiva-
lently, they solve the generalized eigenvalue problem QΣT Qf = λ2 Qf , which may
be preferred numerically to diagonalizing the non-symmetric matrix ΣT Q.
21.1. PRINCIPAL COMPONENT ANALYSIS 515

Remark 21.4 Sometimes, the metric is specified by giving Q−1 instead of Q (or Q−1
is easy to compute). Then, one can directly solve the generalized eigenvalue problem
ΣT f˜ = λ2 Q−1 f˜ and set f = Q−1 f˜. The normalization f T Qf = 1 is then obtained by
normalizing f˜ so that f˜T Q−1 f˜ = 1. 

Remark 21.5 The “standard” version of PCA applies this computation using the Eu-
clidean inner product, with Q = IdRd , and the principal components are the eigen-
vectors of the covariance matrix of T associated with the largest eigenvalues. 

Large dimension. It often happens that the dimension of H is much larger than the
number of observations, N . In such a case, the previous approach is quite inefficient
(especially when the dimension of H is infinite!) and one should proceed as follows.

Returning to the original problem, one can remark that there is no loss of gener-
ality in assuming that V is a subspace of W := span{x1 − x, . . . , xN − x}. Indeed, letting
V 0 = PW (V ) (the projection of V on W ), we have, for ξ ∈ W ,

kξ − PV ξk2H = kξk2H − 2hξ , PV ξiH + kPV ξk2H


= kξk2H − 2hPW ξ , PV ξiH + kPV ξk2H
= kξk2H − 2hξ , PW PV ξiH + kPV ξk2H
≥ kξk2H − 2hξ , PW PV xiH + kPW PV ξk2H
= kξ − PW PV ξk2H
≥ kξ − PV 0 ξk2H .

In this computation, we have used the facts that PW ξ = ξ (since ξ ∈ W ), that kPW PV ξkH ≤
kPV ξkH , that PW PV ξ ∈ V 0 and that PV 0 (ξ) is the best approximation of ξ by an ele-
ment of V 0 . This shows that (since xk − x ∈ W for all k)
N
X N
X
kxk − x − PV (xk − x)k2H ≥ kxk − x − PV 0 (xk − x)k2H
k=1 k=1

with V 0 a subspace of W of dimension less than p, proving the result. This computa-
tion also shows that no improvement in PCA can be obtained by looking for spaces
of dimension p ≥ dim(W ) (with dim(W ) ≤ N − 1 because the data is centered).

It therefore suffices to look for f1 , . . . , fp in the form


N
X (i)
fi = αk (xk − x).
k=1

(i)
for some αk , 1 ≤ k ≤ N , 1 ≤ i ≤ p.
516 CHAPTER 21. DIMENSION REDUCTION AND FACTOR ANALYSIS

PN (i) (j)
With this notation, we have hfi , fj iH = k,l=1 αk αl hxk − x , xl − xiH and

N
1X
hfi , fj iT = hfi , xl − xiH hfj , xl − xiH
N
l=1
N N
1 X (i) (j) X
= αk αk 0 hxk − x , xl − xiH hxk 0 − x , xl − xiH .
N 0
k,k =1 l=1

Let S be the Gram matrix of the centered data, formed by the inner products hxk − x , xl − xiH ,
(i)
for k, l = 1, . . . , N . Let α (i) be the column vector with coordinates αk , k = 1, . . . , N . We
have hfi , fj iH = (α (i) )T Sα (j) and hfi , fj iT = (α (i) )T S 2 α (j) /N , which implies that, in this
representation, the operator AT is given by S/N . Thus, the previous simultaneous
orthogonalization problem can be solved in terms of the α’s by diagonalizing S and
taking the first eigenvectors, normalized so that (α (i) )T Sα (i) = 1. Let λ2j , j = 1, . . . , N
be the eigenvalues of S/N (of which only the first min(d, N − 1) may be non-zero).
In this representation, the decomposition of the projection of xk on the PCA basis is
given by
p
(j)
X
xk = βk fj
j=1

with
N
(j) (j) (j)
X
βk = hxk − x , fj iH = αl hxl − x , xk − xiH = N λ2j αk .
l=1

21.2 Kernel PCA

Since the previous computation only depended on the inner products hxk − x , xl − xiH ,
PCA can be performed in reproducing kernel Hilbert spaces, and the resulting method
is called kernel PCA. In this framework, X may take values in any set R with a rep-
resentation h : R → H. The associated kernel, K(x, x0 ) = hh(x) , h(x0 )iH , provides a
closed form expression of the inner products in terms of the original variables. The
feature function itself is most of the time unnecessary.

The kernel version of PCA consists in replacing xk − x with h(xk ) − h̄ where h̄ is


the average feature. This leads to defining a “centered kernel:”

Kc (x, x0 ) = hh(x) − h̄ , h(x0 ) − h̄iH


2
= hh(x) , h(x0 )iH − hh(x) + h(x0 ) , h̄i + h̄ H
N N
1X 1 X
= K(xk , xl ) − (K(x, xk ) + K(x0 , xk )) + 2 K(xk , xl ).
N N
k=1 k,l=1
21.2. KERNEL PCA 517

Then the Gram matrix in feature space is S with skl = Kc (xk , xl ) and the computation
described in the previous section can be applied. Note that, if one denotes, as usual
K = K(x1 , . . . , xN ) the matrix formed by kernel evaluations K(xk , xl ), and if one lets
P = IdRN − 1N 1N /N , then we have the simple matrix expression S = P KP .

Letting α (1) , . . . , α (p) ∈ RN be the first p eigenvectors of S, normalized so that


(α (i) )T Sα (i) = 1, the principal directions are vectors in feature space given by (us-
(i)
ing the notation in the previous section in which the kth coordinate of α (i) is αk )
N
X (i)
fi = αk (h(xk ) − h̄) ,
k=1

and they are not computable when the features not known explicitly. However, a
few geometric features associated with these directions can be characterized using
the kernel only.
n o
Consider the line in feature space Di = h̄ + λfi , λ ∈ R . Let Ωi denote the points
x ∈ R such that h(x) ∈ Di . Then x ∈ Ωi if and only if h(x) coincides with its orthogonal
projection on Di , which is equivalent to
2 2
hh(x) − h̄ , fi iH = h(x) − h̄ H
,

which can be expressed with the kernel as


N 2
X (i) 
Kc (x, x) −  αk Kc (x, xk ) = 0 . (21.6)
k=1

This provides a nonlinear equation in x. In particular, Ωi is generally nonlinear,


possibly with several connected components. Note that, by definition, the difference
in (21.6) is always non-negative, so that a way to visualize Ωi is to compute its sub-
level sets, i.e., the set of all x such that
N 2
X (i) 
Kc (x, x) −  αk Kc (x, xk ) ≤ 
k=1

for small .

Similarly, the feature vector h(x) − h̄ belongs to the space generated by the first p
components if and only if
p
2 2
X
hh(x) − h̄ , fi iH = h(x) − h̄ H
i=1
518 CHAPTER 21. DIMENSION REDUCTION AND FACTOR ANALYSIS

i.e.,
p X
N 2
X  (i) 

 αk Kc (x, xk ) = Kc (x, x).
i=1 k=1

One can also compute the finite-dimensional coordinates of h(x) in the PCA basis,
and this computation is easier. The representation is

x 7→ (u1 (x), . . . , up (x))

with
N
X (i)
ui = hh(x) − h̄ , fi iH = αk Kc (x, xk ) .
k=1
This provides an explicit nonlinear transformation that maps each data point x into
a p-dimensional point. This representation allows one to easily exploit the reduction
of dimension.

21.3 Statistical interpretation and probabilistic PCA

There is a simple probabilistic interpretation of linear PCA. Assume that H = Rd


with the standard inner product and that X is a centered random vector with covari-
ance matrix Σ. Consider the problem that consists in finding a factor decomposition
p
X
X= Y (i) ei + R
i=1

where Y = (Y (1) , . . . , Y (p) )T forms a p-dimensional centered vector, e1 , . . . , ep is an or-


thonormal system, and R is a random vector, independent of Y and as small as pos-
sible, in the sense that E(|R|2 ) is minimal.

One can see that, in an optimal decomposition, one needs RT ei = 0 for all i,
because one can always write
p
X p
X p
X
(i) (i) T
Y ei + R = (Y + R ei )ei + R − RT ei ei .
i=1 i=1 i=1
Pp
If R is centered, then so is R − i=1 RT ei ei and the latter provides a better solution
Pp
since |R − i=1 RT ei ei | ≤ |R|. Also, there is no loss of generality in requiring that
(Y (1) , . . . , Y (p) ) are uncorrelated, as this can always be obtained after a change of basis
in span(e1 , . . . , ep ).
21.3. STATISTICAL INTERPRETATION AND PROBABILISTIC PCA 519

Assuming this, we can write


p
X
2
E(|X| ) = E((Y (i) )2 ) + E(|R|2 )
i=1

with Y (i) = eiT X. So, to minimize E(|R|2 ), one needs to maximize


p
X
E((eiT X)2 )
i=1

which is equal to (letting Σ be the covariance matrix of X)


p
X
eiT Σei .
i=1

The solution of this problem is given by the first p eigenvectors of Σ. PCA (with a
Euclidean metric) exactly applies this procedure, with Σ replaced by the empirical
covariance.

“Probabilistic PCA” is based on a slightly different statistical model in which it is


assumed that X can be decomposed as
p
X
X= λi Y (i) ei + σ R,
i=1

where R is a d dimensional standard Gaussian vector and Y = (Y (1) , . . . , Y (p) )T a p-


dimensional standard Gaussian vector, independent of R. The main difference with
standard PCA is that the total variance of the residual, here dσ 2 , is a model param-
eter and not a quantity to minimize.

In addition to σ 2 , the model is parametrized by the coordinates of e1 , . . . , ep and


the values of λ1 , . . . , λp . Introduce the d × p matrix

W = [λ1 e1 , . . . , λp ep ].

We can rewrite this model in the form

X = W Y + σ 2R

where the parameters are W and σ 2 , with the constraint that W T W is a diagonal
matrix. As a linear combination of independent Gaussian random variables, X is
520 CHAPTER 21. DIMENSION REDUCTION AND FACTOR ANALYSIS

Gaussian with covariance matrix W W T + σ 2 Id. The log-likelihood of the observa-


tions x1 , . . . , xN therefore is
N
 
L(W , σ ) = − d log 2π + log det(W W T + σ 2 Id) + trace((W W T + σ 2 Id)−1 ΣT ) (21.7)
2
where ΣY is the empirical covariance matrix of x1 , . . . , xN . This function can be max-
imized explicitly in W and σ , as stated in the following proposition.

Proposition 21.6 Assume that the matrix ΣT is invertible. The log-likelihood in (21.7)
is maximized by taking

(i) W = [λ1 e1 , . . . , λp ep ] where e1 , . . . , ep are the eigenvectors of ΣT associated to the p


q
largest eigenvalues, and λi = δi2 − σ 2 , where δi2 is the eigenvalue of Σ associated to ei ;
(ii) and
d
2 1 X 2
σ = δi .
d −p
i=p+1

Proof We make the following change of variables: let ρ2 = 1/σ 2 and

1 1
µ2i = 2
− 2 .
σ λi + σ 2

Let Q = [µ1 e1 , . . . , µp ep ]. We have

(W W T + σ 2 Id)−1 = ρ2 Id − QQT .

To see this, complete (e1 , . . . , ep ) into an orthonormal basis of Rd , letting ep+1 , . . . , ed


denote the added vectors. Then
p
X d
X
2
2 2
T
W W + σ Id = T
(λi + σ )ei ei + σ 2 ei eiT
i=1 i=p+1

so that
p
X d
X
2
T
(W W + σ Id) −1
= (λ2i + σ 2 )−1 ei eiT + σ −2 ei eiT = ρ2 Id − QQT .
i=1 i=p+1

Using these variables, we can reformulate the problem as the minimization of


p
X p
X
2
− log(ρ − µ2i ) − (d 2 2
− p) log ρ + ρ trace(Σ) − µ2j ejT Σej .
i=1 j=1
21.4. GENERALIZED PCA 521

From theorem 2.3, we have


p
X p
X
µ2j ejT Σej ≤ µ2j δj2
j=1 j=1

and this upper bound is attained by letting e1 , . . . , ep be the first p eigenvectors of Σ.


Using this, we see that σ 2 , µ21 , . . . , µ2p must minimize

p
X d
X p
X
2
− log(ρ − µ2i ) − (d 2
− p) log ρ + ρ 2
δj2 − µ2j δj2 .
i=1 j=1 j=1

Computing the solution is elementary and left to the reader, and yields, when ex-
pressed as functions of σ 2 , λ21 , . . . , λ2p , the expressions given in the statement of the
theorem. 

21.4 Generalized PCA

We now discuss a dimension reduction method called generalized PCA (GPCA) [202]
that, instead of looking for the best linear approximation of the training set by one
specific subspace, provides an approximation by a finite union of such spaces.

As a motivation, consider the situation in fig. 21.1 in which part of the data
is aligned along one direction in space, and another part along another direction.
Then, the only information that PCA can retrieve (provided that the two directions
intersect) is the plane generated by the two directions, which will be captured by
the two principal components. PCA will not be able to determine the individual
directions. GPCA addresses this type of situation as follows.

Figure 21.1: PCA cannot distinguish between the situations depicted in the two datasets.
522 CHAPTER 21. DIMENSION REDUCTION AND FACTOR ANALYSIS

For simplicity, assume that we are trying to decompose the data along unions of
hyperplanes in Rd . Such hyperplanes have equations of the form u T x̃ = 0 where x̃ is
our notation for the vector (1, xT )T . If we have two hyperplanes, specified by u1 and
u2 and all the training samples approximately belong to one of them, then one has,
for all k = 1, . . . , N :
(u1T x̃k )(u2T x̃k ) = x̃kT u1 u2T x̃k ' 0.
Similarly, for n hyperplanes, the identity is, for k = 1, . . . , N :
n
Y
(ujT x̃k ) ' 0.
j=1

Write
n
Y X
(ujT x) = u1 (i1 ) · · · un (in )x(i1 ) · · · x(in )
j=1 1≤i1 ,...,in ≤d

in the form (by regrouping the terms associated with the same powers of x)
X
F(x) = qp1 ...pd (x(1) )p1 . . . (x(d) )pd . (21.8)
p1 +...+pd =n

The collection of n+d−1



n numbers Q = (qp1 ...pn , p1 + · · · + pd = n) takes a specific form
(that we will not need to make explicit) as a function of the unknown u1 , . . . , un , but
the first step of GPCA ignores this constraint and estimates Q minimizing
 2
N 
X X 
(1) (d)
qp1 ...pd (xk )p1 . . . (xk )pd 
 

 
k=1 p1 +...+pd =n

under the constraint qp21 ...pn = 1 (to avoid trivial solutions). Choosing an ordering
P
on the set of indices (p1 , . . . , pd ) such that p1 + · · · pd = n, one can stack the coefficients
(1) (d)
in Q and the monomials (xk )p1 . . . (xk )pd to form two vectors denoted Q (with some
abuse of notation) and V (xk ). One can then rewrite the problem of determining Q
as minimizing QT ΣQ subject to |Q|2 = 1, where

N
X
Σ= V (xk )V (xk )T .
k=1

The solution is given by the eigenvector associated with the smallest eigenvalue of Σ.
If the model is exact, this eigenvalue should be zero, and if only one decomposition
of the data in a set of distinct hyperplanes exists (i.e., if n is not chosen too large),
then Q is the unique solution up to a multiplicative constant.
21.5. NUCLEAR NORM MINIMIZATION AND ROBUST PCA 523

Once Q is found, it remains to identify the vectors u1 , . . . , un . This identification


can be obtained by inspecting the gradient of F on the union of hyperplanes. Indeed,
one has, for x ∈ Rd ,  
Xn Y 
T 

∇F(x) = 
 u x
j0 
 uj

0
j=1 j ,j

However, if x belong in one and only one of the hyperplanes, say xT uj = 0, then all
terms in the sum vanish but one and ∇F(x) is proportional to uj . So, if the model is
exact, one has, for each k = 1, . . . , N , either ∇F(xk ) = 0 (if xk belongs to the intersection
of two hyperplanes) or ∇F(xk )/|∇F(xk )| = ±uj for some j, and the sign ambiguity can
be removed by ensuring, for example, that the first non-vanishing coordinate of uj is
positive. (The gradient of F can be computed from Q using (21.8).) The computation
of ∇F on training data therefore allows for an exact computation of the hyperplanes.

In practice, when noise is present, one cannot expect this computation to be


exact. The vectors u1 , . . . , un can be estimated by clustering the collection of non-
vanishing gradients ∇F(xk ), k = 1, . . . , N . For example, one can compute a dissimi-
larity matrix such as dkl = 1 − cos2 (θkl ), where θkl is the angle between ∇F(xk ) and
∇F(xl ), and apply one of the methds discussed in section 20.4.1.

This analysis provides a decomposition of the training set into n (or fewer) hyper-
planes. The computation can then be recursively refined in order to obtain smaller
dimensional subspaces by applying the same method separately to each hyperplane.

21.5 Nuclear norm minimization and robust PCA

21.5.1 Low-rank approximation

One can also interpret PCA in terms of low-rank matrix approximations. Let Xc be
the N by d matrix (x1 − x, . . . , xN − x)T , which, in generic situations, has rank d − 1.
Then PCA with p components is equivalent to minimizing, over all N by d matrices
Z of rank p, the norm of the difference

|Xc − Z|2 = trace((Xc − Z)T (Xc − Z)) . (21.9)

The quantity |A|2 = trace(AT A) is the sum of square of the entries of A, which is often
referred to as the (squared) Frobenius norm. We have
d
X
2
|A| = σk2
k=1

where σ1 , . . . , σd are the singular values of A, i.e., the square roots of the eigenvalues
of AT A.
524 CHAPTER 21. DIMENSION REDUCTION AND FACTOR ANALYSIS

We first note the following characterization of rank-p matrices.

Proposition 21.7 A matrix Z has rank p if and only if it can be written in the form
Z = AW T where A is N ×p, and W is d ×p with W T W = IdRp , i.e., W = [e1 , . . . , ep ] where
the columns form an orthonormal family of Rd .

Proof The “if” part is obvious and we prove the “only if” part. Assume that Z has
rank p. Take W = [e1 , . . . , ep ], where (e1 , . . . , ep ) is an orthonormal family in Null(Z)⊥ .
Letting ep+1 , . . . , ed denote an orthonormal basis of Null(Z), we have di=1 ei eiT = IdRd
P
and
d
X X p
T
Z=Z ei ei = Z ei eiT = ZW W T
i=1 i=1
so that one can take A = ZW . 

Using this representation and letting zkT be the kth row vector of Z, we have

N N p
X X X (j) 2
2 2
|Xx − Z| = |xk − x − zk | = xk − x − ak ej .
k=1 k=1 j=1

(j)
With fixed e1 , . . . , ep , the optimal matrix A has coefficients ak = (xk − x)T ej . In matrix
form, this is:  
Xp 
Z = Xc  ej ejT  .
 
 
j=1

We therefore retrieve the PCA formulation that we gave in section 21.1, in the
special case of H = Rd with the standard Euclidean product. The lowest value
achieved by the PCA solution is
d
X
2
|Xc − Z| = N λ2k
k=p+1

where λ21 , . . . , λ2d are the eigenvalues of the covariance matrix computed from x1 , . . . , xN ,
who are also the squared singular values of the matrix Xc divided by N .

In this section, we will explore variations on PCA in which the minimization


of |Xc − Z|2 is completed with a penalty that depends on the singular values of the
matrix Z. As a first example, one can modify PCA by adding a penalty on the rank
(i.e., on the number of non-zero singular values), minimizing:

γ|Xc − Z|2 + rank(Z)


21.5. NUCLEAR NORM MINIMIZATION AND ROBUST PCA 525

for some parameter γ > 0. However, the solution to this problem is a small variation
of that of standard PCA. It is indeed given by standard PCA with p components
where p minimizes
d
X d
X
Nγ λ2k + p = N γ (λ2k − (N γ)−1 ) + d,
k=p+1 k=p+1

i.e., p is the index of the last eigenvalue that is larger than (N γ)−1 .

21.5.2 The nuclear norm

Based on the fact that rank(Z) is the number of non-zero singular values of Z, one
can use the same heuristic as in the development of the lasso, and replace counting
the non-zero values by the sum of the absolute values of the singular values, which
is just the sum of singular values since they are non-negative. This provides the
nuclear norm of A, defined in section 2.4 by
d
X
|A|∗ = σk
k=1

where σ1 , . . . , σd are the singular values of A. We will consider below the problem of
minimizing
γ|Xc − Z|2 + |Z|∗ (21.10)
and show that its solution is once again similar to PCA.

We recall the characterization of the nuclear norm proposition 2.6. If A is an N


by d matrix,
 
|A|∗ = max trace(U AV T ) : U is N × N and U T U = Id, V is d × d and V T V = Id .

In Cai et al. [44], the authors consider the minimization of (21.10) and prove
the following result. Recall that we have defined the shrinkage function Sτ : t 7→
sign(t) max(|t| − τ, 0) (with τ ≥ 0), using the same notation Sτ (X) when applying Sτ to
every entry of a vector or matrix X. Following Cai et al. [44], we define the singular
value thresholding operator A 7→ Sτ (A), where A is any rectangular matrix, by

Sτ (A) = U Sτ (∆)V T

when A = U ∆V T is a singular value decomposition of A.


Proposition 21.8 Let us assume without loss of generality that N ≥ d. The function
Z 7→ γ|Xc − Z|2 + |Z|∗ is minimized by Z = S1/2γ (X ).
526 CHAPTER 21. DIMENSION REDUCTION AND FACTOR ANALYSIS

Proof Representing Z by its singular value decomposition, we have the equivalent


formulation of minimizing

F(U , V , D) = γ|Xc − U DV T |2 + |D|∗


= γ|Xc |2 − 2γtrace(XcT U DV T ) + γ|D|2 + |D|∗

over all orthonormal matrices U and V and diagonal matrices with non-negative
coefficients D. From theorem 2.1, we know that trace(XcT U DV T ) is less than the
sum of the products of the non-increasingly ordered singular values of Xc and D
and this upper bound is attained by taking U = Ū and V = V̄ where Ū and V̄ are
the matrices providing the SVD of Xc , i.e., such that Xc = Ū ∆V̄ T where ∆ is diagonal
with non-decreasing coefficients along the diagonal. So, letting λ1 ≥ · · · ≥ λd ≥ 0 and
µ1 ≥ · · · ≥ µd ≥ 0 be the singular values of Xc and Z, we have just proved that, for any
D,
d
X d
X d
X
2
F(U , V , D) ≥ F(Ū , V̄ , D) = −2γ µi λi + γ µi + µi .
i=1 i=1 i=1
The lower bound is minimized when µi = max(λi − 1/2γ, 0). This proves the propo-
sition. 

21.5.3 Robust PCA

As a consequence, the nuclear norm penalty provides the same principal directions
(after replacing γ by 2γ) as the rank penalty, but applies a shrinking operation rather
than thresholding on the singular values. The difference is however more fundamen-
tal if, in addition to using the nuclear norm as a penalty, on replaces the squared
Frobenius norm on the approximation error by the ` 1 norm, where, for an n by m
matrix A with coefficients (a(i, j)),
X
|A|`1 = |a(i, j)| .
i,j

This is the formulation of robust PCA [48], which minimizes

γ|Xc − Z|`1 + |Z|∗ (21.11)

with respect to Z.

Robust PCA (which was initially named Principal Component Pursuit by the au-
thors in Candès et al. [48]) is designed for situations in which Xc can be decomposed
as the sum of a low-rank matrix Z and of a sparse residual S. Some theoretical justi-
fication was provided in the original paper, stating that if Xc = Z+S, with Z = U DV T
(its singular value decomposition) such that U and V are sufficiently “diffuse” and
21.6. INDEPENDENT COMPONENT ANALYSIS 527

rank(Z) is small enough, with the residual’s sparsity pattern taken uniformly at ran-
dom over the subsets of entries of S with a sufficiently small cardinality, then robust
PCA is able to reconstruct the decomposition exactly with high probability (relative
to the random selection of the sparsity pattern of S). We refer to Candès et al. [48]
for the long proof that justifies this statement.

Robust PCA can be solved using the ADMM algorithm (section 3.5.5) after refor-
mulating the problem as the minimization of

γ|R|`1 + |Z|∗

subject to R + Z = Xc . The algorithm therefore iterates over the following steps.


1
  
(k+1) (k) (k) 2

 Z = argmin |Z|∗ + |Z + R − Xx + U |





 Z
1 (k+1)
  
(21.12)

(k+1) (k) 2

 R = argmin γ|R| ` 1 + |Z + R − Xc + U |



 R 2α

U (k+1) = U (k) + Z (k+1) + R(k+1) − X

c

The first minimization is covered by proposition 2.6 and yields

Z (k+1) = Sα (Xc − R(k) − U (k) ).

The second minimization is solved by a standard shrinking operation, i.e.,

R(k+1) = Sγα (Xc − Z (k+1) − U (k) ).

Using this, we can rewrite the robust PCA algorithm as the sequence of fairly simple
iterations.

Algorithm 21.1
(1) Choose a small enough constant α and a very small tolerance level .
(2) Initialize the algorithm with N by d matrices R(0) and U (0) (e.g., equal to zero).
(3) At step n, apply the iteration:



 Z (k+1) = Sα (Xc − R(k) − U (k) )

 (k+1)
= Sγα (Xc − Z (k+1) − U (k) )


 R (21.13)

U (k+1) = U (k) + Z (k+1) + R(k+1) − Xc

(4) Stop the algorithm is the variation compared to variables at the previous step is
below the tolerance level. Otherwise, apply step n + 1.
528 CHAPTER 21. DIMENSION REDUCTION AND FACTOR ANALYSIS

21.6 Independent component analysis

Independent component analysis (ICA) is a factor analysis method that represents a


d-dimensional random variable X in the form X = AY where A is a fixed d ×d invert-
ible matrix and Y is a d-dimensional random vector with independent components.
There are two main approaches in this setting. The first one optimizes the matrix
W = A−1 so that the components of W X are “as independent as possible” according
to a suitable criterion. The second one is model-based, where a statistical model is
assumed for Y , and its parameters, together with the entries of the matrix A, are es-
timated via maximum likelihood. Before describing each of these methods, we first
discuss the extent to which the coefficients of A are identifiable.

21.6.1 Identifiability

A statistical model is identifiable if its parameters (which could be finite- of infinite-


dimensional) are uniquely defined by the distribution of the observable variables. In
the case of ICA, this question boils down to deciding whether AY ∼ A0 Y 0 (i.e., they
have the same probability distribution) implies that A = A0 (where Y and Y 0 are two
random vectors with independent components).

It should be clear that the answer to this question is negative, because there are
trivial transformations of the matrix A that do not break the ICA model. One can,
for example, take any invertible diagonal matrix, D, and let A0 = AD −1 and Y 0 = DY .
The same statement can be made if D is replaced by a permutation matrix, P , which
reorders the components of Y . So we know that AY ∼ A0 Y 0 is possible already when
A0 = ADP where D is diagonal and invertible and P is a permutation matrix. Note
that iterating such matrices (i.e., letting A0 = ADP D 0 P 0 ) does not extend the class
of transformations because one has DP = P P −1 DP and one can easily check that
P −1 DP is diagonal, so that one can rewrite any product of permutations and diagonal
matrices as a single diagonal matrix multiplied by a single permutation.

It is interesting, and fundamental for the well-posedness of ICA, that, under one
important additional assumption, the indeterminacy in the identification of A stops
at these transformations. The additional assumption is that at most one of the com-
ponents of Y follows a Gaussian distribution. That such a restriction is needed is
clear from the fact that one can transform any Gaussian vector Y with independent
components into another, BY , one as soon as BBT is diagonal. If two or more com-
ponents of Y are Gaussian, one can restrict these matrices B to only affect those
components. If only one of them is Gaussian, such an operation has no effect.

The following theorem is formally stated in Comon [53], and is a rephrasing


of the Darmois-Skitovitch theorem [57, 180]. The proof of this theorem relies on
complex analysis arguments on characteristic functions and is beyond the scope of
21.6. INDEPENDENT COMPONENT ANALYSIS 529

these notes (see Kagan et al. [102] for more details).

Theorem 21.9 Assume that Y is a random vector with independent components, such
that at most one of its components is Gaussian. Let A be an invertible linear transforma-
tion and Ỹ = CY . Then the following statements are independent.

(i) For all i , j, the components Ỹ (i) , Ỹ (j) are independent.


(ii) Ỹ (1) , . . . , Ỹ (d) are mutually independent.
(iii) C = DP is the product on a diagonal matrix and of a permutation.

The equivalence of (ii) and (iii) implies that the ICA model is identifiable up
to multiplication on the right by a permutation and a diagonal matrix. Indeed, if
X = AY = A0 Y 0 are two decompositions, then it suffices to apply the theorem to
C = (A0 )−1 A to conclude. The equivalence of (i) and (ii) is striking, and has the
important consequence that, if the data satisfies the ICA model, then, in order to
identify A (up to the listed indeterminacy), it suffices to look for Y = A−1 X with
pairwise independent components, which is a much lesser constraint than full mu-
tual independence.

As a final remark on the Gaussian indeterminacy, we point out that, if the mean
(m) and covariance matrix (Σ) of X are known (or estimated from data), the ICA
problem can be reduced to looking for orthogonal transformations A. Indeed, as-
suming X = AY and letting X̃ = Σ−1/2 (X − m) and Ỹ = D −1/2 (Y − A−1 m), where D is
the (diagonal) covariance matrix of Y , we have

X̃ = Σ−1/2 (AY − m) = Σ−1/2 AD 1/2 Ỹ .

Letting à = Σ−1/2 AD 1/2 , we have IdRd = E(X̃ X̃ T ) = ÃÃT so that à is orthogonal.


This shows that the ICA problem for X̃ in the form X̃ = ÃỸ with the restriction
that à is orthogonal has a solution, and also provides a solution of the original ICA
problem by letting A = Σ1/2 Ã and Y = Ỹ − Ã−1 Σ−1/2 m. Therefore, the indeterminacy
associated with Gaussian vectors is as general as possible up to a normalization of
first and second moments.

21.6.2 Measuring independence and non-Gaussianity

Independence between d variables is a very strong property and its complete char-
acterization is computationally challenging. The fact that the joint p.d.f.of the d
variables (we will restrict, to simplify our discussion, to variables that are absolutely
continuous) factorizes into the product of the marginal p.d.f.’s of each variable can
be measured by computing the mutual information between the variables, defined
530 CHAPTER 21. DIMENSION REDUCTION AND FACTOR ANALYSIS

by (letting ϕZ denote the p.d.f. of a variable Z)


Z
ϕ (y)
I(Y ) = Qd Y ϕY (y)dy.
ϕ (i) (y (i) )
i=1 Y

The mutual information is always non-negative and vanishes only if the components
of Y are mutually independent. Therefore, one can represent ICA as an optimization
problem minimizing I(W X) with respect to all invertible matrices W (so that W =
A−1 ). Letting Z
h(Y ) = − log ϕY (y) ϕY (y)dy

denote the “differential entropy” of Y , we can write


d
X
I(Y ) = h(Y (i) ) − h(Y ).
i=1

If Z = W X, then ϕZ (z) = ϕX (W −1 x)| det(W )|−1 . Using this expression in h(Z) and
making a change of variables yields h(W Z) = h(X) + log | det W | and
d
X
I(W X) = h(Z (i) ) − log | det(W )| − h(X).
i=1

This shows that the optimal W can be obtained by minimizing


d
X
F(W ) = h(W (i) X) − log | det(W )|
i=1

where W (i) is the ith row of W . This brings a notable simplification, since this ex-
pression only involves differential entropies of scalar variables, but still remains a
challenging problem.

In Comon [53], it is proposed to use cumulant expansions of the entropy around


that of a Gaussian with identical mean and variance to approximate the differential
entropy. If ξ ∼ N (m, σ 2 ) , then
1 1
h(ξ) = + log(2πσ 2 ).
2 2
Define, for a general random variable U with standard deviation σU , the non-Gaussian
entropy, or negentropy, defined by
1 1
ν(U ) = + log(2πσU2 ) − h(U ) .
2 2
21.6. INDEPENDENT COMPONENT ANALYSIS 531

One can shows that ν(U ) ≥ 0 and is equal to 0 if and only if U is Gaussian. One can
rewrite F(W ) as

d d
d d X
2
X
F(W ) = + log(2π) + log(σW (i) X ) − ν(W (i) X) − log | det(W )|
2 2
i=1 i=1

As we remarked earlier, if we replace X by Σ−1/2 (X − m) (after estimating the


covariance matrix of X), there is no loss of generality in requiring that W is an or-
2
thogonal matrix, in which case both σW (i) X and | det W | are equal to 1. Assuming
such a reduction is done, we see that the problem now requires to maximize

d
X
ν(W (i) X) (21.14)
i=1

among all orthogonal matrices W . Still in Comon [53], an approximation of the


negentropy ν(U ) is provided as a function of the third and fourth cumulants of the
distribution of U . These are given by

κ3 = E((U − E(U ))3 )

and
κ4 = E((U − E(U ))4 ) − 3σU4 .
In particular, when U is normalized, i.e., E(U ) = 0 and σU2 = 1, we have κ3 = E(U 3 )
and κ4 = E(U 4 ) − 3. Under the same assumption, it is proposed in Comon [53] to use
the approximation
κ32 κ42 7κ34 κ32 κ4
ν(U ) ∼ + + − .
12 48 48 8
This approximation was derived from an Edgeworth expansion of the p.d.f. of U ,
which can be seen as a Taylor expansion around a Gaussian distribution. Plugging
this expression into (21.14) provides an expression that can be maximized in W
where the cumulants are replaced by their sample estimates. However, the maxi-
mized function involves high-degree polynomials in the unknown coefficients of W ,
and this simplified problem still presents numerical challenges.

An alternative approximation of the negentropy has been proposed in Hyvärinen


[94] relying on the maximum entropy principle, described in the following theorem.
Associate to any random variable Y : G → R the differential entropy
Z
hµ (Y ) = − log ϕY (x)ϕY (x)dµ(x)
G
532 CHAPTER 21. DIMENSION REDUCTION AND FACTOR ANALYSIS

if the distribution of Y has a density, denoted ϕY , with respect to µ and hµ (Y ) = −∞


otherwise. Use also the same notation
Z
hµ (ϕ) = − log ϕ(x)ϕ(x)dµ(x)
G

for a p.d.f. ϕ with respect to µ (i.e., such that ϕ is non-negative and has integral 1).
Then, the following is true.
Theorem 21.10 Let g = (g (1) , . . . , g (p) )T be a function defined on a measurable space G,
taking values in Rp , and let µ be a measure on G. Let Γµ be the set of all λ = (λ(1) , . . . , λ(p) ) ∈
Rp such that Z  
exp λT g(y) dµ(y) < ∞. (21.15)
G
Then
( Z )
 
T T
hµ (Y ) ≤ inf −λ E(g(Y )) + log exp λ g(y) dµ(y) : λ ∈ Γµ . (21.16)
G

Define, for λ ∈ Γµ ,
 
exp λT g(x) dµ(x)
ψλ (x) = R   . (21.17)
T
G
exp λ g(y) dµ(y)

Assume that the infimum in (21.16) is attained at an interior point λ∗ of Γµ . Then

h(ϕλ∗ ) = max{h(Ỹ ) : EỸ (g) = EY (g), i = 1, . . . , p}. (21.18)

Proof Let Y be a random variable with p.d.f. ϕY with respect to µ (otherwise the
lower bound in (21.16) is −∞). Then
Z Z
ϕ (x)
hµ (Y ) + λE(g(Y )) − log exp (λg(y)) dµ(y) = − ϕY (x) log Y dµ(x) ≤ 0
G ψλ (x)
ϕ (x)
R
since G ϕY (x) log ψY (x) dµ(x) is a KL divergence and is always non-negative.
λ

Assume that λ is in Γ̊µ . Then, there exists  > 0 such that, for any u ∈ Rp , |u| = 1,
λ + u ∈ Γµ . Using the fact that eβ ≥ eα + (β − α)eα , we can write
Tg Tg Tg
uT geλ ≤ e(λ+u) − eλ
Tg Tg Tg
−uT geλ ≤ e(λ−u) − eλ

yielding
Tg T T Tg Tg Tg T
|uT g|eλ ≤ max(e(λ+u) g , e(λ−u) g ) − eλ ≤ e(λ+u) + e(λ−u) − eλ g .
21.6. INDEPENDENT COMPONENT ANALYSIS 533

Since the upper-bound is integrable with respect to µ, so is the lower bound, showing
that (taking u in the canonical basis of Rp )
Z
T
|g (i) (y)|eλ g(y) dµ(y) < ∞
G

for all i, or Z
T g(y)
|g(y)|eλ dµ(y) < ∞.
G

Let c = E(g(Y )) and define


Z
T
Ψc (λ) = −c λ + log exp(λT g(y))dy. (21.19)

Then R
g(x)T exp(λT g(x))dx
Z
T
∂λ Ψc = −c + R = −cT + g(x)T ψλ (x)dx .
exp(λT g(y))dy G

Since λ∗ is a minimizer, we find that, if Ỹ is a random variable with p.d.f. λ∗ , then


E(Ỹ ) = c = E(Y ). In that case, the upper-bound in (21.16) is hµ (Ỹ ), proving (21.18). 

Remark 21.11 The previous theorem is typically applied with µ equal to Lebesgue’s
measure on G = Rd or to a counting measure with G finite. To rewrite the statement
of theorem 21.10 in those cases, it suffices to replace dµ(x) by dx for the former, and
integrals by sums over G for the latter. In the rest of the discussion, we restrict to the
case when µ is Lebesgue’s measure, using h(Y ) instead of hµ (Y ). 

Remark 21.12 This principle justifies, in particular, that the negentropy is always
non-negative since it implies that a distribution that maximizes the entropy given
its first and second moments must be Gaussian. 

The right-hand side of (21.16) provides a variational approximation of the en-


tropy. If one uses this approximation when minimizing h(W (1) X) + · · · + h(W (d) X),
the resulting problem can be expressed as a minimization, with respect to W and
λ1 , . . . , λd ∈ Rp of

d
X d
X Z  
T (j)
− λ E(g(W X)) + log exp λT g(y) dy .
j=1 j=1

While it would be possible to solve this optimization problem directly, a further


approximation of the upper bound can be developed leading to a simpler procedure.
534 CHAPTER 21. DIMENSION REDUCTION AND FACTOR ANALYSIS

We have seen in the previous proof that, defining Ψc by (21.19) and denoting by Eλ
the expectation with respect to ϕλ , one has

∇Ψc (λ) = −c + Eλ (g) .

Taking the second derivative, one finds

∇2 Ψc (λ) = Eλ ((g − Eλ (g))(g − Eλ (g))T ).

Now choose c0 such that a maximizer of Ψc0 (λ), say, λc0 , is known. If c is close to c0 ,
a first order expansion indicates that, for λc maximizing Ψc , one should have

λc ' λc0 + ∇2 Ψc (λc0 )−1 (c − c0 )

with
Ψc (λc ) ' Ψc (λc0 ) − (c − c0 )T ∇2 Ψc (λc0 )−1 (c − c0 ).
One can then use the right-hand side as an approximation of the optimal entropy.

This leads to simple computations under the following √ assumptions. First, as-
(1) (2) 2
sume that the first two functions g and g are u and u / 3. Let ϕ0 be the p.d.f.
of a standard Gaussian. Assume that the functions g (j) are chosen so that
Z
g (i) (u)g (j) (u)ϕ0 (y)dy = δij
R
for i, j = 1, . . . , p and such that g (i) (u)ϕ0 (y)dy = 0 for i , 2. Take
Z
c0 = gϕ0 (u)du

(1) (2) √ (i)


so that c0 = 0, c0 = 1/ 3 and c0 = 0 for i ≥ 2.

Then λc0 provides, by construction, the distribution ϕ0 and for any c, ∇2 Ψc (λc0 ) =
IdRp . With these assumptions, the approximation is

Ψc (λ) = h(ϕ0 ) − |c − c0 |2
1 X
= (1 + log 2π) − (c(j) )2
2
j≥3

(assuming that the data is centered and normalized so that c(1) = 0 and c(2) = 1/ 3).
The ICA problem can then be solved by maximizing
p
d X
X
E(g (i) (W (j) X))2 (21.20)
j=1 i=1

over orthogonal matrices W .


21.6. INDEPENDENT COMPONENT ANALYSIS 535

Remark 21.13 Without the assumption made on the functions g (j) , one needs to
compute S = Cov(g(U ))−1 where U ∼ N (0, 1) and maximize
d
X
(E(g(W (j) X)) − E(g(U )))T S(E(g(W (j) X)) − E(g(U ))).
j=1

Clearly, this expression can be reduced to (21.20) by replacing g by S −1/2 (g−E(g(U ))).
Note also that we retrieve here a similar idea to the negentropy, maximizing a devi-
ation to a Gaussian. 

21.6.3 Maximization over orthogonal matrices

In the previous discussion, we reached a few times a formulation of ICA which re-
quired optimizing a function W 7→ F(W ) over all orthogonal matrices. We now dis-
cuss how such a problem may be implemented.

In all the examples that were considered, there would have been no loss of gen-
erality in requiring that W is a rotation, i.e., det(W ) = 1. This is because one can
change the sign of this determinant by simply changing the sign of one of the in-
dependent components, which is always possible. (In fact, the indeterminacy in W
is by right multiplication by the product of a permutation matrix and a diagonal
matrix with ±1 entries.)

Let us assume that F(W ) is actually defined and differentiable over all invertible
matrices, which form an open subset of the linear space Md (R) of d by d matrices.
Our optimization problem can therefore be considered as the minimization of F with
the constraint that W W T = IdRd .

Gradient descent derives from the analysis that a direction of descent should be
a matrix H such that F(W + H) < F(W ) for small enough  > 0 and on the remark
that H = −∇F(W ) provides such a direction. This analysis does not apply to the con-
strained optimization setting because, unless the constraints are linear, W + H will
generally stop to satisfy the constraint when  > 0, requiring the use of more complex
procedures. In our case, however, one can take advantage of the fact that orthogo-
nal matrices form a group to replace the perturbation W 7→ W + H by W 7→ W eH
(using the matrix exponential) where H is moreover required to be skew symmetric
(H + H T = 0), which guarantees that eH is an orthogonal matrix with determinant
1. Now, using the fact that eH = Id + H + o(), we can write
F(W eH ) = F(W ) + trace(∇F(W )T W H) + o() .
Let ∇s F(W ) be the skew symmetric part of W T ∇F(W ), i.e.,
1
∇s F(W ) = (W T ∇F(W ) − ∇F(W )T W ).
2
536 CHAPTER 21. DIMENSION REDUCTION AND FACTOR ANALYSIS

Then, if H is skew symmetric,


1 1
trace(∇s F(W )T H) = trace(∇F(W )T W H) − trace(W T ∇F(W )H)
2 2
1 1
= trace(∇F(W )T W H) + trace(W T ∇F(W )H T )
2 2
= trace(∇F(W )T W H)

so that
F(W eH ) = F(W ) + trace(∇s F(W )T H) + o() .
This show that H = −∇s F(W ) provides a direction of descent in the orthogonal group,
in the sense that, if ∇s F(W ) , 0,
s
F(W e−∇ F(W ) ) < F(W )

for small enough  > 0. As a consequence, the algorithm


s
Wn+1 = Wn e−n ∇ F(Wn )

combined with a line search for n implements gradient descent in the group of
orthogonal matrices, and therefore converges to a local minimizer of F.

If one linearizes the r.h.s. as a function of , one gets


s n
Wn e−n ∇ F(Wn ) = Wn + Wn ((Wn )T ∇F(Wn ) − ∇F(Wn )T Wn ) + o()
2
n
= Wn + (∇F(Wn ) − Wn ∇F(Wn )T Wn ) + o().
2
As already argued, this linearized version cannot be used when optimizing over the
orthogonal group. However, if one denotes by ω(A) the unitary part of the polar
decomposition of A, i.e., ω(A) = (AAT )−1/2 A, then the algorithm
n
 
T
Wn+1 = ω Wn + (∇F(Wn ) − Wn ∇F(Wn ) Wn )
2
also provides a valid gradient descent algorithm.

21.6.4 Parametric ICA

We now describe a parametric version of ICA in which a model is chosen for the in-
dependent components of Y . The simplest version of to assume that all Y (j) are i.i.d.
with some prescribed p.d.f., say, ψ. A typical example for ψ is a logistic distribution
with
2
ψ(t) = t −t 2 .
(e + e )
21.6. INDEPENDENT COMPONENT ANALYSIS 537

If y is a vector in Rd , we will use, as usual, the notation ψ(y) = (ψ(y (1) ), . . . , ψ(y (d) ))T
for ψ applied to each component of y.

The model parameter is then the matrix A, or preferably W = A−1 , and it may be
estimated using maximum likelihood. Indeed, the p.d.f. of X is
d
Y
fX (x) = | det W | ψ(W (j) x)
j=1

where W (j) is the jth row of W , so that W can be estimated by maximizing


N X
X d
`(W ) = N log |det(W )| + log ψ(W (j) xk ) .
k=1 j=1

If we denote by Γ (W ) the matrix with coefficients


N
X ψ 0 (W (j) xk )
γij (W ) = xk (i)
k=1
ψ(W (j) xk )

and use the fact that the gradient of W 7→ log | det W | is W −T (the inverse transpose
of W ), we can write
∇`(W ) = N W −T + Γ (W ).

We need however the maximization to operate on sets of invertible matrices, and


it is more natural to move in this set through multiplication than through addition,
because the product of two invertible matrices is always invertible, but not necessar-
ily their sum. So, similarly to the previous section, we will look for small variations
in the form W 7→ W eH , or simply, in this case, W 7→ W (IdRd + H). In both case, the
first order expansion of the log-likelihood gives

`(W ) + trace((N W −T + Γ (W ))T W H)

which suggests taking

H = W T (N W −T + Γ (W )) = N Id + W T Γ (W ).

Dividing H by N , we obtain the following variant of gradient ascent for maxi-


mum likelihood
Wn+1 = (1 + n )Wn + n Wn WnT Γ (Wn ) .
This algorithm numerically performs much better than standard gradient ascent. It
moreover presents the advantage of avoiding computing the inverse of W at each
step.
538 CHAPTER 21. DIMENSION REDUCTION AND FACTOR ANALYSIS

21.6.5 Probabilistic ICA

Note that the algorithms that we discussed concerning ICA were all formulated in
terms of the matrix W = A−1 , which “filters” the data into independent components.
As a result, ICA requires as many independent components as the dimension of X.
Moreover, because the components are typically normalized to have equal variance,
there is no obvious way to perform dimension reduction using this method. Indeed,
ICA is typically run after the data is preprocessed using PCA, this preprocessing
step providing the reduction of dimension.

It is however possible to define a model similar to probabilistic PCA, assuming a


limited number of components to which a Gaussian noise is added, in the form

p
X
X= aj Y (j) + σ R
j=1

with p < d, a1 , . . . , ap ∈ Rd , Y (1) , . . . , Y (p) independent variables as before, and R ∼


N (0, IdRd ). This model is identifiable (up to permutation and scalar multiplication
of the components) as soon as none of the variables Y (j) is Gaussian.

Let us assume a parametric setting similar to that of the previous section, so that
Y (1) , . . . , Y (p)
are explicitly modeled as independent variables with p.d.f. ψ. Introduce
the matrix A = [a1 , . . . , ap ], so that the model can also be written X = AY + σ R, where
A and σ 2 are unknown model parameters.

The p.d.f. of X is now given by


 p 
|x−Ay|2
Z
1 −
Y 
fX (x; A, σ 2 ) = e 2σ 2  ψ(y (i) ) dy (1) . . . dy (p) ,
(2πσ 2 )d/2

Rp i=1

which is definitely not a closed form. Since we are in a situation in which the pair of
random variables is imperfectly observed through X, using the EM algorithm (chap-
ter 17) is an option, but it may, as we shall see below, lead to heavy computation. The
basic step of the EM is, given current parameters A0 , σ0 , to maximize the conditional
expectation (knowing X, for the current parameters) of the joint log-likelihood of
(X, Y ) with respect to the new parameters. In this context, the joint distribution of
(X, Y ) has density

p
1 |x−Ay |2 Y
2 −
fX,Y (x, y; A, σ ) = 2 d/2
e 2σ 2
ψ(y (i) )
(2πσ ) i=1
21.6. INDEPENDENT COMPONENT ANALYSIS 539

so that, the conditional joint likelihood over the training set is

N N p
Nd 1 X XX
− log(2πσ 2 )− 2 EA0 ,σ0 (|xk − AY |2 |X = xk )− EA0 ,σ 2 (log ψ(Y (j) )|X = xk ).
2 2σ 0
k=1 k=1 j=1

Notice that the last term does not depend on A, σ 2 , and that, given A, the optimal
value of σ 2 is given by
N
1 X
σ2 = EA0 ,σ0 (|xk − AY |2 |X = xk )
Nd
k=1

The minimization of
N
X
EA0 ,σ0 (|xk − AY |2 |X = xk )
k=1
(j)
with respect to A is a least square problem. Let bk = EA0 ,σ0 (Y (j) |X = xk ) and sk (i, j) =
EA0 ,σ0 (Y (i) Y (j) |X = xk ): the gradient of the previous term is
N
X N
X
T
−2 EA0 ,σ 2 ((xk − AY )Y |X = xk ) = −2 (xk bkT − ASk ),
0
k=1 k=1

(j)
bk being the column vector with coefficients bk and Sk the matrix with coefficients
sk (i, j). The result therefore is
N  N −1
X  X 
A =  xk bkT   Sk  .
k=1 k=1

Unfortunately, the computation of the moments of the conditional distribution


of Y given xk (needed in bk and Sk ) is a difficult task. The conditional density of Y
given X = xk is
|A0 y−x|2

2σ02
g(y|xk ) = ψ(y)e /Z(A0 , σ0 )
from which moments cannot be computed analytically in general. Monte-Carlo sam-
pling algorithms can be used however to approximate these moments, but they are
computationally demanding. And they must be run at every step of the EM.

In place of the exact EM, one may use a mode approximation (section 17.3.1),
which replaces the conditional likelihood of Y given X = xk by a Dirac distribution
540 CHAPTER 21. DIMENSION REDUCTION AND FACTOR ANALYSIS

at the mode:
|A y−x |2
 
 − 0 2k 
ŷA0 ,σ0 (xk ) = argmaxy ψ(y)e 2σ0  .

The maximization step then reduces to maximizing in A, σ 2


N
Nd 1 X 2
− log(2πσ 2 ) − 2 xk − AŷA0 ,σ0 (xk ) . (21.21)
2 2σ
k=1

This therefore provides a two-step procedure.

Algorithm 21.2 (Probabilistic ICA: mode approximation)


(1) Initialize the algorithm with A0 , σ0 .
(2) At step n:
|An y−xk |2
Qp −
(i) For k = 1, . . . , N , maximize i=1 ψ(y (i) )e 2σn2
to obtain ŷAn ,σn (xk ). This
requires a numerical optimization procedure, such as gradient ascent. The problem
is concave when log ψ is concave.
(ii) Minimize (21.21) with respect to A, σ 2 , yielding
N  N −1
X  X 
An+1 =  xk bkT   Sk 
k=1 k=1

with bk = ŷAn ,σn (xk ), Sk = ŷAn ,σn (xk )ŷAn ,σn (xk )T , and

N
2 1 X 2
σn+1 = xk − AŷAn ,σn .
Nd
k=1

(3) Stop if the variation of the parameter is below a tolerance level. Otherwise,
iterate to the next step.

Once A and σ 2 have been estimated, the y components associated to a new obser-
vation x can be estimated by ŷA,σ (x), therefore minimizing
p
1 2
X
2
xk − Ay + log ψ(y (j) ),

j=1

yielding the map estimate, the same convex optimization problem as in step (1)
above. Now we can see how the method takes from both PCA and ICA: the columns
21.6. INDEPENDENT COMPONENT ANALYSIS 541

of A, a1 , . . . , ap can be considered as p principal directions, and are fixed after learn-


ing; they are not orthonormal, and do not satisfy the nesting properties of PCA (that
those p contain those for p − 1). The coordinates of x with respect to this basis is not
a projection, as would be provided by PCA, but the result of a penalized estimation
problem. The penalty associated to the logistic case is
(j) (j)
log ψ(y (j) ) = log 2 − 2 log(ey + e−y ).

This distribution with “exponential tails” has the interest of allowing large values of
y (j) , which generally entails sparse decompositions, in which y has a few large coeffi-
cients, and many zeros.

As an alternative to the mode approximation of the EM, which may lead to bi-
ased estimators, one may use the SAEM algorithm (section 17.4.3), as proposed in
Allassonniere and Younes [3]. Recall that the EM algorithm replaces the parameters
A0 , σ02 by minimizers of

N
Nd 1 X
log(σ 2 ) + 2 EA0 ,σ0 (|xk − AY |2 |X = xk )
2 2σ
k=1
N N N
Nd 1 X 2 1 X T 1 X
= log(σ 2 ) + 2 |xk | − 2 xk Abk + 2 trace(AT ASk ),
2 2σ σ 2σ
k=1 k=1 k=1

(j)
where the computation of bk = EA0 ,σ0 (Y (j) |X = xk ) and sk (i, j) = EA0 ,σ0 (Y (i) Y (j) |X = xk )
was the challenging issue. In the SAEM algorithm, the statistics bk and Sk are part of
a stochastic approximation scheme, and are estimated in parallel with EM updates
as follows.

Algorithm 21.3 (SAEM for probabilistic ICA)


Initialize the algorithm with parameters A, σ 2 . Define a sequence of decreasing
steps, γt .

Let, for k = 1, . . . , N , bk = 0 and Sk = Id. Iterate the following steps.

(1) For k = 1, . . . , N , sample yk according to the conditional distribution of Y given


X = xk , using the current parameters A and σ 2 .
(2) Update bk and Sk , letting (assuming step t of the algorithm)

bk → bk + γt (Yk − bk )
(

Sk → Sk + γt (Yk YkT − Sk )
542 CHAPTER 21. DIMENSION REDUCTION AND FACTOR ANALYSIS

(3) Replace A and σ 2 by


N  N −1
X  X 
A =  xk bkT   Sk 
k=1 k=1

and
N
2 1 X 2
σ = xk − AŷA0 ,σ0 .
Nd
k=1

P
P The2
parameter γt should be decreasing with t, typically so that t γt = +∞ and
t γt < ∞ (e.g., γt ∝ 1/t). One way to sample from Yk is to uses a rejection scheme,
iterating the procedure which samples y according to the prior and accepts the result
with probability M exp(−|xk − Ay|2 /2σ 2 ) until acceptance. Here M must be chosen so
that M maxy exp(−|xk − Ay|2 /2σ 2 ) ≤ 1 (e.g., M = 1).

This method will work for small p, but for large p, the probability of acceptance
may be very small. In such cases, Yk can be sampled changing one component at a
time using a Metropolis-Hastings scheme. If component j is updated, this scheme
samples a new value of y (call it y 0 ) by changing only y (j) according to the prior
distribution ψ and accept the change with probability

exp(−|xk − Ay 0 |2 /2σ 2 )
!
min 1, .
exp(−|xk − Ay|2 /2σ 2 )

21.7 Non-negative matrix factorization

In this section, we consider


Pfactor analysis methods that approximate a random vari-
p
able X in the form X = j=1 a(j) Y (j) with the constraint that the scalars a(1) , . . . ,
a(p) ∈ R and the vectors Y (1) , . . . , Y (p) ∈ Rd are respectively non-negative and with
non-negative entries. This model makes sense, for example, when X represents the
total multivariate production (e.g., in terms of number of molecules of various types)
resulting of several chemical reactions that operate together. Another application
is when X is a list of preference scores associated with a person for, say, books or
movies, and each person is modeled as a positive linear combination of p “typical
scorers,” represented by the vector Y (j) for j = 1, . . . , p.

When training data (x1 , . . . , xN ) is observed and stacked in an N by d matrix X ,


the decomposition can be summarized for all observations together in the matrix
form
X = AY T
21.7. NON-NEGATIVE MATRIX FACTORIZATION 543

(j)
where A is N by p and provides the coefficients ak associated with each observation
and Y = [y (1) , . . . , y (p) ] is d by p and provides the p typical profiles. The matrices A
and Y are unknown and their estimation subject to the constraint of having non-
negative components represent the non-negative matrix factorization (NMF) prob-
lem.

NMF is often implemented by solving the constrained optimization problem of


minimizing |X −AY T |2 subject to A and Y having non-negative entries. This problem
is non-convex in general but the sub-problems of optimizing either A or Y when the
other matrix is fixed are simple quadratic programs.

This suggests using an alternating minimization method, iterating steps in which


A is updated with Y fixed, followed by an update of Y with A fixed. However,
solving a full quadratic program at each step would be computationally prohibitive
with large datasets, and simpler update rules have been suggested, updating each
matrix in turn with a guarantee of reducing the objective function.

If Y is considered as fixed and A is the free variable, we have


|X − AY T |2 = |X |2 − 2trace(X T AY T ) + trace(AY T Y AT )
= trace(AT AY T Y ) − 2trace(AT (X Y )) + |X |2 .
The next lemma will provide update steps for A.
Lemma 21.14 Let M be an n by n symmetric matrix and b ∈ Rn , both assumed to have
non-negative entries. Let u ∈ Rn , also with non-negative coefficients, and let
 
 b (i) 
v (i) = u (i)  Pd  .
(j)

j=1 m(i, j)u

Then
v T Mv − 2bT v ≤ u T Mu − 2bT u .
Moreover, v = u if and only if u minimizes u T Mu − 2bT u subject to u (i) = 0, i = 1, . . . , n.
Proof Let F(u) = u T Mu − 2bT u. We look for v (i) = β (i) u (i) with β (i) ≥ 0 such that
F(v) ≤ F(u). We have
n
X n
X
F(v) = β (i) β (j) u (i) u (j) m(i, j) − 2 b(i) β (i)
i,j=1 i=1
n n
1 X
(i) 2 (j) 2 (i) (j)
X
≤ ((β ) + (β ) )u u m(i, j) − 2 b(i) β (i) u (i)
2
i,j=1 i=1
n
X n
X
= (β (i) )2 u (i) u (j) m(i, j) − 2 b(i) β (i) u (i)
i,j=1 i=1
544 CHAPTER 21. DIMENSION REDUCTION AND FACTOR ANALYSIS

When β = 1n , this upper-bound is equal to F(u). So, if we choose β minimizing the


upper-bound, we will indeed find v such that F(v) ≤ F(u). Rewriting the upper-
bound as    
Xn  Xn  
u (i) (β (i) )2  m(i, j)u (j)  − 2b(i) β (i) 
   
   
i=1 j=1
Pn
we see that β (i) = b(i) / j=1 m(i, j)u (j) provides such a minimizer, which proves the
first statement ofPthe lemma. For the second statement, we have v (i) = u (i) if and
only if u (i) = 0 or nj=1 m(i, j)u (j) = b(i) , and one directly checks that these are exactly
the KKT conditions for a minimizer of F over vectors with non-negative entries. 

To apply the lemma to the minimization in A, let M : A 7→ AY T Y and b = X Y


(we are working in the linear space of N by p matrices). Then the update

(i) (i) (X Y )(i, k)


ak 7→ ak
(AY T Y )(i.k)

decreases the objective function.

Similarly, applying the lemma with the operator Y 7→ Y AT A and b = X T A gives


the update for Y , namely
T
(i) (i) (X A)(i, j)
yj 7→ yj .
(Y AT A)(i, j)

We have therefore obtained the following algorithm.

Algorithm 21.4 (NMF, quadratic cost)


1. Fix p > 0 and let X be the N by d matrix containing the observed data. Initialize
the procedure with matrices A and Y , respectively of size N by p and d by p, with
positive coefficients.
2. At a given stage of the algorithm, let A and Y be the current matrices providing
an approximate decomposition of X .
3. For the next step, let à be the matrix with coefficients

(i) (i) (X Y )(i, k)


ãk = ak
(AY T Y )(i, k)

and Ỹ the matrix with coefficients

(i) (i) (X T Ã)(i, j)


ỹj = yj .
(Y ÃT Ã)(i, j)
21.7. NON-NEGATIVE MATRIX FACTORIZATION 545

4. Replace A by à and Y by Ỹ , iterating until numerical convergence.

An alternative version of the method has been proposed, where the objective
function is Φ(AY T ), where, for an N by d matrix Z = [z1 , . . . , zN ]T ,
N X
X d
(i) (i) (i)
Φ(Z) = (zk − xk log zk )
k=1 i=1

which is indeed minimal for Z = X . We state and prove a second lemma that will
allow us to address this problem.
Lemma 21.15 Let M be an n by q matrix and x ∈ Rn , b ∈ Rq , all assumed to have positive
entries. For u ∈ (0, +∞)q , define
q
X n
X q
X
(j) (j) (i)
F(u) = b u − x log m(i, j)u (j) .
j=1 i=1 j=1

Define v ∈ (0, +∞)q by


Pn (i) /α (i)
!
(j) (j) i=1 m(i, j)x
v =u
b(j)
Pq
with α (i) = k=1 m(i, k)u (k) . Then F(v) ≤ F(u). Moreover, v = u if and only if u minimizes
F subject to u (i) ≥ 0, i = 1, . . . , n.
Proof Introduce a variable β (j) > 0 for j = 1, . . . , q an let w(j) = u (j) β (j) . Then
q
X n
X q
X
(j) (j) (j) (i)
F(w) = b u β − x log m(i, j)u (j) β (j)
j=1 i=1 j=1
q n
Pq (j) (j) n q
j=1 m(i, j)u β
X X X X
(j) (j) (j) (i) (i)
= b u β − x log Pq − x log m(i, j)u (j)
(j)
j=1 i=1 j=1 m(i, j)u i=1 j=1

Let ρ(i, j) = m(i, j)u (j) /α (i) . Since the logarithm is concave, we have
q
X q
X
(j)
log ρ(i, j)β ≥ ρ(i, j) log β (j)
j=1 j=1

so that
q
X X q
n X n
X q
X
(j) (() (j) (i) (j) (i)
F(w) ≤ b u jβ − x ρ(i, j) log β − x log m(i, j)u (j) .
j=1 i=1 j=1 i=1 j=1
546 CHAPTER 21. DIMENSION REDUCTION AND FACTOR ANALYSIS

The upper bound with β (j) ≡ 1 gives F(u), so minimizing this expression in β will
give F(w) ≤ F(u). This minimization is straightforward and gives
Pn (i) Pn (i) /α (i)
(j) i=1 x ρ(i, j) i=1 m(i, j)x
β = =
b(j) u (j) b(j)
and the optimal w is the vector v provided in the lemma. Finally, one checks that
v = u if and only if u satisfies the KKT conditions for the considered problem. 

We can now apply this lemma to derive update rules for Y and A, where the
objective is
N X d X p N X
d p
(i) (j) (i) (j)
X X (i)
X
yj ak − xk log yj ak .
k=1 i=1 j=1 k=1 i=1 j=1

Starting with the minimization in A, we apply the lemma to each index k separately,
(i) (j)
taking n = d and q = p, with b(j) = di=1 yj and m(i, j) = yi . Then the update is
P

Pd (i) (i) (i)


i=1 xk yj /αk
ak (j) 7→ ak (j) Pd (i)
i=1 yj

(i) Pp (i) (j)


with αk = j=1 yj ak .

For Y , we can work with fixed i and apply the lemma with n = N , q = p, b(j) =
PN (j) (j)
k=1 ak and m(k, j) = ak . This gives the update:

PN (i) (j) (i)


(i) (i) k=1 xk ak /αk
yj 7→ yj PN (j) ,
k=1 ka

(i) Pp (i) (j)


still with αk = j=1 yj ak .

We summarize this in our second algorithm for NMF.

Algorithm 21.5 (NMF, logarithmic cost)


1. Fix p > 0 and let X be the N by d matrix containing the observed data.
2. Initialize the procedure with matrices Y and A, respectively of size N by p and
d by p, with positive coefficients.
3. At a given stage of the algorithm, let A and Y be the current matrices decom-
posing X .
21.8. VARIATIONAL AUTOENCODERS 547

4. Let à be the matrix with coefficients


Pd (i) (i) (i)
(j) (j) i=1 xk yj /αk
ãk = ak Pd (i)
i=1 yj

(i) Pp (i) (j)


with αk = j=1 yj ak .

5. Let Ỹ the matrix with coefficients


PN (i) (j) (i)
(i) (i) k=1 xk ãk / α̃k
ỹj = yj Pp (j)
j=1 ãk

(i) Pp (i) (j)


with α̃k = j=1 yj ãk .

6. Replace A by à and Y by Ỹ , iterating until numerical convergence.

21.8 Variational Autoencoders

Variational autoencoders, which were described in section 19.2.2, can be ineter-


preted as a non-linear factor model in which X = g(θ, Y ) +  where  is a centered
Gaussian noise with covariance matrix Q and Y ∈ Rp has a known probability dis-
tribution, such as Y ∼ N (0, IdRp ). In this framework, the conditional distribution of
Y given X = x was approximated as a Gaussian distribution with mean µ(x, w) and
covariance matrix S(x, w)2 . The implementation in Kingma and Welling [104, 105]
use neural networks for the three functions g, µ and S.

21.9 Bayesian factor analysis and Poisson point processes

21.9.1 A feature selection model

The expectation in many factor models is that individual observations are obtained
by mixing pure categories, or topics, and represented as a weighted sum or linear
combination of a small number of uncorrelated or independent variables. Denote p
the number of possible categories, which, in this section, can be assumed to be quite
large.

We will assume that each observation randomly selects a small number among
these categories before combining them. Let us consider (as an example) the follow-
ing model.
548 CHAPTER 21. DIMENSION REDUCTION AND FACTOR ANALYSIS

• The observations X1 , . . . , XN take the form of a probabilistic ICA model


p
X
Xk = ak (j)bk (j)Y (j) + σ Rk ,
j=1

where:
• Rk follows a standard Gaussian distribution,
• ak (1), . . . , ak (p) are independent with ak (j) ∼ N (mj , τj2 ),
• bk (1), . . . , bk (p) are independent and follow a Bernoulli distribution with param-
eter πj ,
• Y (1) , . . . , Y (p) are independent standard Gaussian random variables.
• σ 2 follows an inverse gamma distribution with parameters α0 , β0 .
• τ12 , . . . , τp2 follow independent inverse gamma distributions with parameters α1 , β1 .
• mj follow a Gaussian N (0, ρ2 ) and,
• πj follow a beta distribution with parameters (u, v).

The priors are, as usual, chosen so that the computation of posterior distributions
is easy, i.e., they are conjugate priors. The observed data is therefore obtained by
selecting components Yj with probability πj and weighted with a Gaussian random
coefficient, then added before introducing noise.

Let nj = N
P
k=1 bk (j). Ignoring constant factors, the joint likelihood of all variables
together is proportional to:
 
 1 X N X p 
L ∝σ −N d exp − 2 |Xk − ak (j)bk (j)Y (j) |2 
 
 2σ 
k=1 j=1
 
p p
  N

Y   1 X   1 X 
τ −N exp − 2 2
 j (a k (j) − m )
j    exp − m j

2τj2
   2ρ2 
j=1 k=1 j=1
p p
Y  nj Y  
πj (1 − πj )N −nj (τj2 )−α1 −1 exp(−β1 /τj2 )
j=1 j=1
p p
 
Y    1 X 
(σ 2 )α0 −1 exp(−β0 /σ 2 ) πju−1 (1 − πj )v−1 exp − |Y (i) |2 
2
j=1 i=1

In spite of the complexity of this expression, it is relatively straightforward (by


considering each variable in isolation) to see that
21.9. BAYESIAN FACTOR ANALYSIS AND POISSON POINT PROCESSES 549

• The conditional distribution of σ 2 , τ12 , . . . , τp2 given all other variables remains a
product of inverse gamma distributions.
• The conditional distribution of Y (1) , . . . , Y (p) given the other variables is Gaus-
sian.
• The conditional distribution of π1 , . . . , πp given the other variables is a product
of beta distributions.
• The conditional distribution of m1 , . . . , mp given the other variables remain in-
dependent Gaussian.
• The posterior distribution of a1 , . . . , aN (considered as p-dimensional vectors)
given the other variables is a product of independent Gaussian (but the components
ak (j), j = 1, . . . , p are correlated).
• For the posterior distribution given the other variables, b1 , . . . , bN (considered
as p-dimensional vectors) are independent. The components of each bk are not inde-
pendent but each bk (j) being a binary variable follows a Bernoulli distribution given
the other ones.

These remarks provide the basis of a Gibbs sampling algorithm for the simulation
of the posterior distribution of all unobserved variables (the computation of the pa-
rameters of each of the conditional distribution above requires some work, of course,
and these details are left to the reader). This simulation does not explicitly provide a
matrix factorization of the data (in the sense of a single matrix A such that X = AY ,
as considered in the previous section), but a probability distribution on such matri-
ces, expressed as A(k, j) = ak (j)bk (j). One can however use the average of the matri-
ces obtained through the simulation for this purpose. Additional information can
be obtained through this simulation. For example, the expectation of bk (j) provides
a measure of proximity of observation k to category j.

21.9.2 Non-negative and count variables

Poisson factor analysis. Many variations can be made on the previous construc-
tion. When the observations are non-negative, for example, an additive Gaussian
noise may not be well adapted. Alternative models should model the conditional
distribution of X given a, b and Y as a distribution over non-negative numbers with
mean (a b)T Y (for example a gamma distribution with appropriate parameters).
The posterior sampling generally is more challenging in this case because simple
conjugate priors are not always available.

An important special case is when X is a count variable taking values in the set of
non-negative integers. In this case (starting with a model without feature selection),
modeling X as a Poisson variable with mean a(1)Y (1) + · · · + a(p)Y (p) leads to tractable
computations, once it is noticed that X can be seen as a sum of random variables
550 CHAPTER 21. DIMENSION REDUCTION AND FACTOR ANALYSIS

Z [1] , . . . , Z [p] where Z [i] follows a Poisson distribution with parameter a(i)Y (i) . This
suggests introducing new latent variables (Z [1] , . . . , Z [p] ), which are not observed but
follow, conditionally to their sum, which is X and is observed, Pp a multinomial distri-
bution with parameters X, q1 , . . . , qp , with qi = a(i)Y /( j=1 a(j)Y (j) ).
(i)

This provides what is referred to as a Poisson factor analysis (PFA). As an exam-


ple, consider a Bayesian approach where, for the prior distribution, a(1), . . . , a(p) are
independent and follow as a gamma distribution with parameters α0 and β0 , and
Y (1) , . . . , Y (p) are independent, exponentially distributed with parameter 1. The joint
likelihood of all data then is (up to constant factors):
 N  N p [i]

 X  Y Y (ak (i)Y (i) )zk 
L ∝ exp − (ak (1)Y (1) + · · · + ak (p)Y (p) )  [i]

k=1

k=1 i=1 zk ! 
 N p   N X p
  p 
Y Y   X   X 
α−1  (i) 

 ak (i) 
 exp −β
 a (i)
k 
 exp −
  Y  .

k=1 i=1 k=1 i=1 i=1
This is the GaP (For Gamma-Poisson) model introduced in Canny [49]. The condi-
[i]
tional distribution of the variables (ak (i)) given (Zk ) and (Yi ) are independent and
gamma-distributed, and so are (Y (i) ) given the other variables. Finally, for each k, the
[1] [p]
family (Zk , . . . , Zk ) follows a multinomial distribution conditionally to their sum,
Xk , and the rest of the variables, and these variables are conditionally independent
across k.

GaP with feature selection One can include a feature selection step in this model
by introducing binary variables b(1), . . . , b(p), with selection probabilities π1 , . . . , πp ,
with a Beta(u, v) prior distribution on πi . Doing so, the likelihood of the extended
model is:
 N  N p [i]

 X  Y Y (ak (i)bk (i)Y (i) )zk 
L ∝ exp − (ak (1)bk (1)Y (1) + · · · + ak (p)bk (p)Y (p) )  [i]


k=1

k=1 i=1 zk !
 N p   N X p
  p 
Y Y   X   X 

 ak (i)α−1  exp −β ak (i) exp − Y (i) 
k=1 i=1 k=1 i=1 i=1
p
Y n p
Y 
πj (1 − πj )N −nj
j
πju−1 (1 − πj )v−1 .
j=1 j=1

where, as before, nj = N
P
k=1 bk (j). The conditional distribution of π1 , . . . , πp given
the other variables is therefore still that of a family of independent beta-distributed
variables. The binary variables bk (1), . . . , bk (p) are also conditionally independent
[i]
given the other variables, with bk (i) = 1 with probability one if zk > 0 and with
[j]
probability πj exp(−ak (j)Y (j) ) if zk = 0.
21.9. BAYESIAN FACTOR ANALYSIS AND POISSON POINT PROCESSES 551

21.9.3 Feature assignment model

The previous models assumed that p features were available, modeled as p random
variables with some prior distribution, and that each observation picks a subset of
them, drawing feature j with probability πj . We denoted by bk (j) the binary variable
indicating whether feature j was selected for observation k, and nj was the number
of times that feature was selected. Finally, we modeled πj as a beta variable with
parameters u and v.

One can compute, using this model, the probability distribution of of the feature
selection variables, b = (bk (j), j = 1, . . . , p, k = 1, . . . , N ). From the model definition, the
probability of observing such a configuration is given by
p
Γ (u + v)p
Z Y
nj +u−1
Q(b) = πj (1 − πj )N −nj +v−1 dπ1 . . . dπp
Γ (u)p Γ (v)p
j=1
p
Y Γ (u + v)Γ (u + nj )Γ (v + N − nj )
=
Γ (u)Γ (v)Γ (u + v + N )
j=1
p
Y u(u + 1) · · · (u + nj − 1)v(v + 1) · · · (v + N − nj − 1)
=
(u + v)(u + v + 1) · · · (u + v + N − 1)
j=1

Denote by njk = k−1


P
l=1 bl (j) the number of observations with index less than k that
pick feature j. Using this notation, we can write, using the fact that
N
Y
u(u + 1) · · · (u + nj − 1) = (u + njk )bk (j)
k=1

and a similar identity for v(v + 1) · · · (v + N − nj − 1),


p Y
N
Y (u + njk )bk (j) (v + k − 1 − njk )1−bk (j)
Q(b) =
u +v +k−1
j=1 k=1
N Y p  bk (j) v + k − 1 − n !1−bk (j)
Y u + njk jk
= .
u+v+k−1 u +v +k−1
k=1 j=1

Using this last equation, we can interpret the probability Q as resulting from a pro-
gressive feature assignment process. The first observation, k = 1, for which njk = 0
for all j, chooses each feature with probability u/(u + v). When reaching observation
k, feature j is chosen with probability (u + njk )/(u + v + k − 1). At all steps, features
are chosen independently from each other.
552 CHAPTER 21. DIMENSION REDUCTION AND FACTOR ANALYSIS

Let Fk be the set of features assigned to observation k, i.e., Fk = {j : bk (j) = 1} and


k−1
[
G k = Fk \ Fl
l=1

be the set of features used in observation k but in no previous observation. Let


Ck = Fk \ Gk and Uk = G1 ∩ · · · ∩ Gk−1 . Instead of considering configurations b =
(bk (j), i = 1, . . . , d, k = 1, . . . , N ) we may alternatively consider the family of sets S =
(Gk , Ck , 1 ≤ k ≤ N ). Such a family must satisfy the property that the sets Gk and
Ck are non-intersecting, Ck ⊂ Uk and Gl ∩ Gk = ∅ for l < k. It provide a unique
configuration b by letting bk (j) = 1 if and only if j ∈ Gk ∪ Ck . We will let, in the
following, qk = |Gk | and pk = |Uk |. The probability Q(b) can be re-expressed in terms
of S, letting (with some abuse of notation)

N  qk !p−pk+1
Y  u v + k − 1
Q(S) = 
 u +v +k−1 u +v +k−1
k=1
1j∈C !1j<C 
Y u + njk k v + k − 1 − njk k 
.

u+v+k−1 u+v+k−1
j∈Uk

Let S k = (Gl , Cl , l ≤ k). Then the expression of Q shows that, conditionally to S k−1 , Gk
and Ck are independent. Elements in Ck are chosen independently for each feature
j ∈ Uk with probability (u + njk )/(u + v + k − 1). Moreover, the conditional distribution
of qk given S k−1 is proportional to
qk !p−pk −qk
u v +k−1

u+v+k−1 u+v +k−1

i.e., it is a binomial distribution with parameters p − pk and u/(u + v + k − 1). Finally,


given sk−1 and qk , the distribution of Gk is uniform among all p−p

k subsets of
qk

{1, . . . , p} \ (G1 ∪ · · · ∪ Gk−1 )

with cardinality qk .

If there is no special meaning in the feature label, which is the case in our discus-
sion of prior models in which all features are sampled independently with the same
distribution, we may identify configurations that can be deduced from each other by
relabeling (note that relabeling features does not change the value of Q).

Call a configuration normal if Gk = {pk + 1, . . . , pk+1 }. Given S, it is always pos-


sible to relabel the features with a permutation σ so that, for each k, σ (Gk ) = {pk +
1, . . . , pk+1 }. There are, in fact, q1 ! . . . qN ! such permutations. We can complete the pro-
cess generating S by adding at the end a transformation into a normal configuration
21.9. BAYESIAN FACTOR ANALYSIS AND POISSON POINT PROCESSES 553

(picking uniformly at random one of the possible ones). The probability of a normal
configuration S obtained through this process is (using a simple counting argument)

N  ! qk !p−pk+1
Y  p − pk  u v + k − 1
Q(S) = 
 q
k u +v +k−1 u +v +k−1
k=1
1j∈C !1j<C 
Y u + njk k v + k − 1 − njk k 
,

u+v+k−1 u+v+k−1
j∈Uk

This provides a new incremental procedure that directly samples normalized as-
signments. First let q1 follow a binomial distribution bin(p, u/(u + v)) and assign the
first observation to features 1 to q1 . Assume that pk labels have been created before
step k. Then select for observation k some of the already labeled features, label j
being selected with probability (u + njk )/(u + v + k − 1) as above. Finally, add qk new
features where qk follows a binomial distribution bin(p − pk , u/(u + v + k − 1)).

This discussion is clearly reminiscent of the one that was made in section 20.7.3
leading to the Polya urn process, and we want here also to let p tend to infinity
(with fixed N ) with proper choices of u and v as functions of p in the expression
above. Choose two positive numbers c and γ and let u = cγ/p and v = c − u. Note
that, with the incremental simulation process that we just described, the conditional
expectation of the next number of labels, pk+1 given the current one, pk is
!
(p − pk )u cγ cγ cγ
E(pk+1 |pk ) = + pk = + 1− pk ≤ + pk
u+v+k−1 c+k−1 p(c + k − 1) c+k−1

Taking expectations on both sides, we get


k N
X cγ X cγ
E(pk+1 ) ≤ ≤
c+l −1 c+l −1
l=1 l=1

so that this expectation is bounded independently of k. This shows in particular


that pk /p tends to 0 in probability (just applying Markov’s inequality) and that the
binomial distribution bin(p − pk , u/(u + v + k − 1)) can be approximated by a Poisson
distribution with parameter cγ/(c + k − 1).

So, when p → ∞, we obtain the following incremental simulation process for the
feature labels, that we combine with the actual simulation of the features, assumed
to follow a prior distribution with p.d.f. ψ. This process is called the Indian buffet
process in the literature, the analogy being that a buffet offers an infinite variety of
dishes, and each observation is a customer who tastes a finite number of them.
554 CHAPTER 21. DIMENSION REDUCTION AND FACTOR ANALYSIS

Algorithm 21.6 (Indian buffet process)


1. Initialization:
(i) Sample an integer q1 according to a Poisson distribution with parameter γ.
(ii) Sample features y (1) , . . . , y (q1 ) according to ψ.
(iii) Assign these features to observation 1, and let n2,j = 1 for j = 1, . . . , q1 .

2. Assume that observations 1 to k−1 have been obtained, with pk features y (1) , . . . , y (pk )
such that the jth feature has been chosen nk,j times.
k,j n
(i) For j = 1, . . . , pk , assign feature j to sample k with probability c+k−1 . If j is
selected, let nk+1,j = nk,j + 1, otherwise let nk+1,j = nk,j .
(ii) Sample an integer qk according to a Poisson distribution with parameter

c+k−1 and let pk+1 = pk + qk .
(iii) Sample features y (pk +1) , . . . , y (pk+1 ) according to ψ.
(iv) Assign these features to observation k, and let nk+1,j = 1 for j = pk +1, . . . , pk .
3. If k = N , stop, otherwise replace k by k + 1 and return to Step 2.

21.10 Point processes and random measures

This section assumes that the reader is familiar with measure theory. It can however safely
be skipped as it is not reused in the rest of the book.

21.10.1 Poisson processes

If Z is a set, we will denote by Pc (Z) the set composed with all finite or countable
subsets of Z. A point process over Z is a random variable S : Ω → Pc (Z), i.e., a
variable that provides a countable random subset of Z. If B ⊂ Z one can then define
the counting function νS (B) = |S ∩ B| ∈ Z ∪ {+∞}.

A proper definition of such point processes requires some measure theory. Equip
Z with a σ -algebra A and consider the set N0 of integer-valued measures µ on (Z, A)
such that µ(Z) < ∞. Let N be the set formed with all countable sums of measures
in N0 . Then a general point process is a mapping ν : Ω → N such that for all
k ∈ N ∪ {+∞} and all B ∈ A, the event {ν(B) = k} is measurable. Recall that, for each
B ∈ A, ν(B) is itself a random variable, that we may denote ω 7→ νω (B). One then
define the intensity of the process as the the function µ : B 7→ E(ν(B)).

The following proposition provides an important identity satisfied by such mod-


els.
21.10. POINT PROCESSES AND RANDOM MEASURES 555

Theorem 21.16 (Campbell identity) Let ν be a point process with intensity µ. For
ω ∈ Ω, let Xω : Ω0 → Z be a random variable with distribution νω (defined, if needed, on
a different probability space (Ω0 , P 0 )). Then, for any µ-integrable function f :
Z
E(f (X)) = f (z)dµ(z). (21.22)
Z

Here, the expectation of f (X) is over both spaces Ω and Ω0 and corresponds to the
average of f . The identity is an immediate consequence of Fubini’s theorem.

We will be mainly interested in the family of Poisson point processes. These


processes are themselves parametrized by a measure, say µ, on Z such that µ is σ -
finite and µ(B) = 0 if B is a singleton. A Poisson process with intensity measure µ is
a point process ν such that:

(i) If B1 , . . . , Bn are non-intersecting pairwise, then ν(B1 ), . . . , ν(Bn ) are mutually in-
dependent.
(ii) for all B, ν(B) ∼ Poisson(µ(B)).

We take the convention that ν(B) = 0 (resp. = ∞) almost surely if µ(B) = 0 (resp.
= ∞). Note that property (i) also implies that if g1 , . . . , gn are measurable
R functions
from Z to (0, +∞) such that gi gj = 0 for i , j, then the variables ν(gi ) = Z gi (z)dν(z)
are independent.

If µ(Z) < ∞ (i.e., µ is finite), one can represent the distribution of a Poisson point
process as follows:
ν(Z)
X
ν= δXk
k=1
with ν(Z) ∼ Poisson(µ(Z)) and, conditional to ν(Z) = N , X1 , . . . , XN are i.i.d. and fol-
low the probability distribution µ̄ = µ/µ(Z). This measure can also be identified with
the random set S = {X1 , . . . , Xν(Z) }. The assumption that µ({z}) = 0 for any singleton
implies that ν({z}) = 0 almost surely. It also ensures that the points X1 , . . . , XN are
distinct with probability one.

If µ is σ -finite, then (by definition), it is a countable sum of finite measures


µ1 , µ2 , . . .. Then ν can be generated as the sum of independent ν1 , ν2 , . . ., where νi
is a Poisson process S with intensity µi . It can moreover be identified with the count-
able random set S = ∞ i=1 Si , where Si is the random set associated with νi . Note that,
in this construction, one can always assume that the measures µ1 , µ2 , . . . are mutually
singular (i.e., µi (B) > 0 for some i implies that µj (B) = 0 for j , i).
556 CHAPTER 21. DIMENSION REDUCTION AND FACTOR ANALYSIS

If we consider a Poisson process on (0, +∞) × Z, we can define weighted random


measures. Indeed, such a point process takes values in the collection of all sets of the
form {(wk , zk ), k ∈ I} where I is finite or countable. These subsets can be represented
as the sum of weighted Dirac masses,
X
ξ= wk δzk .
k∈I
To ensure that the points (zk , k ∈ I) generated by this process are all different, we need
to assume that the intensity µ of this random process is such that µ((0, +∞) × {z}) = 0
for all z ∈ Z. We will refer to ξ as a weighted Poisson process.

In the following, we will consider this class of random measures, with the small
addition of allowing for an extra term including a measure supported by a fixed set.
More precisely, given a (deterministic) countable subset I ⊂ Z, a family of indepen-
dent random variables (ρz , z ∈ I ) and a σ -finite measure µo such that µo ((0, +∞) ×
{z}) = 0 for all z ∈ Z, we can define the random measure
ξ = ξf + ξo
where ξo is a weighted Poisson process with intensity µo , assumed independent of
(ρz , z ∈ I ) and X
ξf = ρz δz .
z∈I
The subscripts o and f come from the terminology introduced in Kingman [106],
which studies “completely random measures,” which are a random measures that
satisfy point (i) in the definition of a Poisson process. Under mild assumptions, such
measures can be decomposed as a sum of a weighted Poisson process (here, ξo , the
ordinary part), of a process with fixed support, (here, ξf , the fixed part) and of a
deterministic measure (which is here taken to be 0).

Let us rapidly check that ξ satisfies property (i). Let B1 , . . . , Bn be non-overlapping


elements of A. Get gi (w, z) = w1Bi (z). Then
ξ(Bi ) = ξf (Bi ) + νo (gi )
where νo is a Poisson process with intensity µo . Since the sets do not overlap, the
variables (ξf (Bi ), i = 1, . . . , n) are independent, and so are (νo (gi ), i = 1, . . . , n) since
gi gj = 0 for i , j. Since ξf and νo are, in addition independent, we see that (ξ(Bi ), i =
1, . . . , n) are independent.

The intensity measure of such a process is still defined by


X Z
η(B) = E(ξ(B)) = P ((ρz , z) ∈ B) + wdµo (w, z)
z∈I (0,+∞)×B

where the last term is an application of Campbell’s inequality to the Poisson process
νo and the function g(w, x) = w1B (x).
21.10. POINT PROCESSES AND RANDOM MEASURES 557

21.10.2 The gamma process

The main example of such processes in factor analysis is the beta process that will be
discussed in the next section. We start, however, with a first example that is closely
related with the Dirichlet process, called the gamma process.

In this process, one fixes a finite measure π0 on Z and defines µ on (0, +∞) × Z by

µ(dw, dz) = cw−1 e−cw π0 (dz)dw.

Because µ is σ -finite but not finite (the integral over t diverges at t = 0), every real-
ization of ξ is an infinite sum
X∞
ξ= wk δzk .
k=1
The intensity measure of ξ is
Z +∞
η(B) = cπ0 (B) e−cw dw = π0 (B).
0

In particular,

X
wk = η(Z) = π0 (Z) < ∞.
k=1

For fixed B, the variable ξ(B) follows a Gamma distribution. This can be proved
by computing the Laplace transform of ξ, E(e−λξ(B) ), and identify it to that of a
Gamma. To make this computation, consider the point process νJ restricted to a
interval J ⊂R (0, +∞) with min(J) > 0, and ξJ the corresponding weighted process.
Let mJ (t) = J w−1 ce−(c+t)w dw. Then a realization of νJ can be obtained by first sam-
pling N from a Poisson distribution with parameter µ(J × Z) = mJ (0)π0 (Z) and then
sampling N points (wi , zi ) independently from the distribution µ/(mJ (0)π0 (Z)). This
implies that
−tw1B (z) w−1 ce−cw dwdπ (z) n

R ∞ 
(m (0)π (Z)) n e 0
X
J 0  0
E(e−tξJ (B) ) = e−mJ (0)π0 (Z)
 

n! mJ (0)π0 (Z)
 
n=0
∞ −mJ (0)π0 (Z) 
X e n
= π0 (B)mJ (t) + (π0 (Z) − π0 (B))mJ (0)
n!
n=0
π0 (B)(mJ (t)−mJ (0))
=e .

Now, Z −tw − 1
cw e
mJ (t) − mJ (0) = c e dw
J w
558 CHAPTER 21. DIMENSION REDUCTION AND FACTOR ANALYSIS

is finite even when J = (0, +∞). With a little more work justifying passing to the
limit, one finds that, for J = (0, +∞),
Z +∞ −tw − 1
!
−tξJ (B) −cw e
E(e ) = exp π0 (B) e dw .
0 w

Finally, write
Z +∞ −tw − 1 Z +∞ Z t
−cw e −cw
c e dw = −c e e−sw dsdw
0 w
Z0t Z +∞
0

= −c e−(s+c)w dwds
Z t0 0
t
=− c(s + c)−1 ds = −c log(1 + ).
0 c
This shows that
t −cπ0 (B)
 
E(e−tξJ (B) ) = 1 +
c
which is the Laplace transform of a Gamma distribution with parameters cπ0 (B) and
c, i.e., with density proportional to wcπ0 (B)−1 e−cw .

As a consequence, the normalized process δ = ξ/ξ(Z) is a Dirichlet process with


intensity cπ0 . Indeed, if B1 , . . . , Bn is a partition of Z the family (δ(B1 ), . . . , δ(Bn )) is
the ratio of n independent gamma variables to their sum, which provides a Dirichlet
distribution, and this property characterizes Dirichlet processes.

21.10.3 The beta process

The definition of the beta process parallels that of the gamma process, with weights
taking this time values in (0, 1). Fix again a finite measure π0 on Z and let µo on
(0, +∞) × Z be defined by

µo (dw, dz) = cw−1 (1 − w)c−1 π0 (dz)dw.

The associated weighted Poisson process can therefore be represented as a sum



X
ξo = wk δzk ,
k=1

and its intensity measure is


Z 1
ηo (B) = cπ0 (B) (1 − t)c−1 dw = π0 (B).
0
21.10. POINT PROCESSES AND RANDOM MEASURES 559

In particular, since π0 is finite, we have ∞


P
k=1 wk < ∞ almost surely. A beta process is
the sum of the process ξo and of a fixed set process
X
ξf = w z δz
z∈I

where I is a fixed finite set and (wz , z ∈ I ) are independent and follow a beta distri-
bution with parameters (a(z), b(z)).

If Z is a space of features, such a process provides a prior distribution on fea-


ture selections. It indeed provides, in addition to the deterministic set I , a random
countable set J ⊂ Z, with a set of random weights wz , z ∈ F := I ∪ J . Given this,
one defines the feature process as the selection of a P
subset A ⊂ F where each feature
z is selected with probability wz . Because E(|A|) = z∈F wz is finite, A is finite with
probability 1.

In the same way the Polya urn could be used to sample from a realization of a
Dirichlet process without actually sampling the whole process, there exists an algo-
rithm that samples a sequence of feature sets (A1 , . . . , An ) from this feature selection
process without needing the infinite collection of weights and features associated
with a beta process. We assume in the following that the prior process has an empty
fixed set. (Non-empty fixed sets will appear in the posterior.)

The first set of features, A1 , is obtained as follows according to a Poisson process


with intensity π0 : choose the number N of features in A1 according to a Poisson dis-
tribution with parameter π0 (Z). Then sample N features independently according
to the distribution π0 /π0 (Z).

Now assume that n − 1 sets of features A1 , . . . , An have been obtained and we want
to sample a new set An+1 conditionally to their observation. Let Jn be the union of
all random features obtained up to this point and n(z), for z ∈ Jn the number of times
this feature was observed in A1 , . . . , An . Then the conditional distribution of the beta
process ξ given this observation is still a beta process, with fixed set given by I = Jn ,
(a(z), b(z)) = (n(z), c + n − n(z)) for z ∈ Jn−1 and base measure πn = cπ0 /(c + n). This
implies that the next set An+1 can be obtained by sampling from the associated fea-
ture process. To do this, one first selects features z ∈ Jn with probability n(z)/(c + n),
then selects additional features z1 , . . . , zN independently with distribution π0 /π0 (Z)
where N follows a Poisson distribution with parameter cπ0 (Z)/(c + n). This is the
Indian buffet process, described in Algorithm 21.6 (taking π0 = γψ).

21.10.4 Beta Process and feature selection

The beta process can be used as a prior for feature selection within a factor analysis
model, as described in the previous paragraph. It is however easier to approximate
560 CHAPTER 21. DIMENSION REDUCTION AND FACTOR ANALYSIS

it with a model with almost surely finite support. Indeed, letting, for  > 0

Γ (c + 1)
µ(dw, dz) = w−1 (1 − w)c− π0 (dz)dw,
Γ ( + 1)Γ (c − )

one obtains a finite measure since


Z +∞ Z

µo (dw, dz) =
0 Z 

where γ = π0 (Z). Note that µ is normalized so that E(ξ(B)) = π0 (B) for B ⊂ Z.

In this case, the prior generates features by first sampling their number, p, ran-
domly according to a Poisson distribution with mean cγ/, then select p probabilities
w1 , . . . , wp independently using a beta distribution with parameters  and c − , and
finally attach to each i a feature zi with distribution π0 /γ. The features associated
with a given sample are then obtained by selecting each zi with probability wi .

We note also that the model described in section 21.9.3 provides an approxima-
tion of this prior using a finite number of features. With our notation here, this
corresponds to taking p  1 and  = cγ/p.
Chapter 22

Data Visualization and Manifold Learning

22.1 Multidimensional scaling

The methods described in this chapter aim at representing a dataset in low dimen-
sion, allowing for its visual exploration by summarizing its structure in a user-
accessible interface. Unlike factor analysis methods, they do not necessarily attempt
at providing a causal model expressing the data as a function of a small number of
sources, and generally do not provide a direct mechanism for adding new data to the
representation. In addition, all these methods take as input similarity dissimilarity
matrices between data points and do not require, say, Euclidean coordinates.

Assuming that a dissimilarity matrix D = (dkl , k, l = 1, . . . , N ) is given, the goals of


multidimensional scaling (or MDS) is to determine a small-dimensional Euclidean
2 2
representation, say y1 , . . . , yN ∈ Rp , such that yk − yl ' dkl . We review below two
versions of this algorithm, referred to as “similarity” and “dissimilarity” matching.

22.1.1 Similarity matching (Euclidean case)

We start with the standard hypotheses of MDS, assuming that the distances dkl de-
2
rive from a representation in feature space, so that dkl = khk − hl k2H for some inner-
product space H and (possibly unknown) features h1 , . . . , hN . Note that, since the
Euclidean distance is invariant by translation, there is no loss of generality in as-
suming h1 + · · · + hN = 0, which will be done in the following.

We look for a p-dimensional representation in the form yk = Φhk where Φ is a lin-


ear transformation (and we want yk to be computable directly from dissimilarities,
since we do not assume that hk is known). Since we are only interested in a trans-
formation of the h1 , . . . , hN , it suffices to compute Φ in the vector space generated by

561
562 CHAPTER 22. DATA VISUALIZATION AND MANIFOLD LEARNING

them, so that we let


Φ : span(h1 , . . . , hN ) → Rp ,
and we want Φ to (approximately) conserve the norm, i.e., be close to being an isom-
etry.

Because isometries are one-to-one and onto, the existence of an exact isometry

would require V = span(h1 , . . . , hN ) to be p-dimensional. The mapping Φ could then
be defined as Φ(h) = (hh , e1 iH , . . . , hh , ep iH ) where e1 , . . . , ep is any orthonormal basis
of V . In the general case, however, where V is not p-dimensional or less, one can
replace it by a best p-dimensional approximation of the training data, leading to a
problem similar to PCA in feature space.

Indeed, as we have seen in section 21.1.2, this best approximation can be ob-
tained by diagonalizing the Gram matrix S of h1 , . . . , hN , which is such that skl =
hhk , hl iH . (Recall that we assume that h̄ = 0, so we do not center the data here.) Us-
ing the notation in section 21.1.2, let α (1) , . . . , α (p) denote the eigenvectors associated
with the p largest eigenvalues, normalized so that (α (i) )T Sα (i) = 1 for i = 1, . . . , p. One
can then take
XN
(i)
ei = αl hl
l=1
and, for k = 1, . . . , N , j = 1, . . . , p:
(i) (i)
yk = λ2i αk (22.1)
where λ2i is the eigenvalue associated with α (i) .

This does not entirely address the original problem, since the inner products skl
are not given, but only the distances dkl , which satisfy
2
dkl = −2skl + skk + sll . (22.2)
This provides a linear system of equations in the unknown skl . This system is under-
determined, because D is invariant by any transformation hk 7→ hk + h0 (for a fixed
h0 ), and S is not. However, the assumption h1 + · · · + hN = 0 provides the additional
constraint needed to provide a unique solution.

Summing (22.2) over l, we then get


N
X N
X
2
dkl = N skk + sll . (22.3)
l=1 l=1

Summing this equation over k, we find


N
X N
X
2
dkl = 2N sll .
k,l=1 l=1
22.1. MULTIDIMENSIONAL SCALING 563

Using this in (22.3), we get


N N
1X 2 1 X 2
skk = dkl − dkl ,
N 2N 2
l=1 k,l=1

and, from (22.2)


 
N N N
1  2 1
 X 1 X 1 X 
skl = − dkl − dk20 l − 2
dkl 0 + d 2 
0
kl 0
.
2 N 0 N 0 N2 0 0 
k =1 l =1 k ,l =1
2
If we denote by D 2 the matrix formed with the squared distances dkl , this identity
can we rewritten in the simpler form
1 2
S =− PD P (22.4)
2
with P = IdRN − 1N 1TN /N .

We now show that this PCA approach to MDS is equivalent to the problem of
minimizing
N
X
F(y) = (ykT yl − skl )2 (22.5)
k,l=1
∈ Rp
over all y1 , . . . , yN such that y1 +· · ·+yN = 0, which can be interpreted as matching
“similarities” skl rather than distances. Indeed, letting Y denote the N by p matrix
with rows y1T , . . . , yN
T
, we have
F(y) = trace((Y Y T − S)2 ).
Finding Y is equivalent to finding a symmetric matrix M of rank p minimizing
trace((M − S)2 ). We have, using the trace inequality (theorem 2.1), and letting λ21 ≥
· · · ≥ λ2N (resp. µ21 ≥ · · · ≥ µ2p ) denote the eigenvalues of S (resp. M)

trace((M − S)2 ) = trace(M 2 ) − 2trace(MS) + trace(S 2 )


Xp XN
4
= µk − 2trace(MS) + λ4k
k=1 k=1
Xp p
X N
X
≥ µ4k − 2 λ2k µ2k + λ2k
k=1 k=1 k=1
Xp N
X
= (λ2k − µ2k )2 + λ4k
k=1 k=p+1
N
X
≥ λ4k
k=p+1
564 CHAPTER 22. DATA VISUALIZATION AND MANIFOLD LEARNING

This lower bound is attained when M and S can be diagonalized in the same or-
thonormal basis with λ2k = µ2k for k = 1, . . . , p. So, letting S = U DU T , where U is
orthogonal and D is diagonal with decreasing numbers on the diagonal, an optimal
M is given by M = Up Dp UpT , where Up is formed with the first p columns of A and Dp
is the first p × p block of D. This shows that the matrix Y = Up D 1/2 provides a min-
imizer of F. The matrix U = [u (1) , . . . , u (N ) ] differs from the matrix A = [α (1) , . . . , α (N ) ]
above through the normalization of its column vectors: we have Sα (i) = λ2i α (i) with
(α (i) )T Sα (i) = 1 while Su (i) = λ2i u (i) with (u (i) )T Su (i) = λ2i showing that α (i) = λ−1
i u .
(i)

This shows that Ap = Up Dp −1/2 so that Y can also be rewritten as Y = Ap Dp , i.e.,


(i) (i)
yk = λ2i αk , the same expression that was obtained before.

The minimization of F is called similarity matching. Clearly, this method can


be applied
P when one starts directly with a matrix of dissimilarities S, provided it
satisfies Nl=1 skl = 0 for all k. If this is not the case, then interpreting skl as an inner
T
product hk hl , it is natural to replace skl by what would give (hk − h̄)T (hl − h̄), namely,
by
N N N
0 1X 1 X 1 X
skl = skl − skl 0 − sk 0 l + 2 sk 0 l 0 .
N 0 N 0 N 0 0
l =1 k =1 k ,l =1
Interestingly, this discussion provides us with yet another interpretation of PCA.

22.1.2 Dissimilarity matching

While the minimization of (22.5) did not provide us with a new way of analyzing the
data (since it was equivalent to PCA), the direct comparison of dissimilarities, that
is, the minimization of
N
X
G(y) = (|yk − yl | − dkl )2
k,l=1
Rp ,
over y1 , . . . , yN ∈ provides a different approach. Since this may be useful in prac-
tice and does not bring in much additional difficulty, we will allow for the possibility
of weighting the differences in G and consider the minimization of
N
X
G(y) = wkl (|yk − yl | − dkl )2
k,l=1
where W = (wkl ) is a symmetric matrix of non-negative weights. The only additional
complexity resulting by adding weights is that the indeterminacy on y1 , . . . , yN is that
G(y) = G(y 0 ) as soon as y − y 0 is constant on every connected component of the graph
associated with the weight matrix W , so that the constraint on y should be replaced
by X
yk = 0
k∈Γ
22.1. MULTIDIMENSIONAL SCALING 565

for any connected component Γ of this graph. (If all weights are positive, then the
only non-empty
PN connected component is {1, . . . , N } and we retrieve our previous con-
straint k=1 yk = 0.)

Standard nonlinear optimization methods, such as projected gradient descent,


may be used to minimize G, but the preferred algorithm for MDS uses a stepwise
procedure resulting from the addition of an auxiliary variable. Rewrite
N
X N
X N
X
2 2
G(y) = wkl |yk − yl | − 2 wkl dkl |yk − yl | + dkl .
k,l=1 k,l=1 k,l=1

We have, for u ∈ Rp :
|u| = max{zT u : z ∈ Rp , |z| = 1u,0 } .
Using this identity, we can introduce auxiliary variables zkl , k, l = 1, . . . , N in Rp , with
|zkl | = 1 if yk , yl and define
N
X N
X N
X
Ĝ(y, z) = wkl |yk − yl |2 − 2 wkl dkl (yk − yl )T zkl + 2
dkl .
k,l=1 k,l=1 k,l=1

We then have
G(y) = min Ĝ(y, z).
z:|zkl |=1 if yk ,yk

As a consequence, minimizing G in y can be achieved by minimizing Ĝ in y and


z and discarding z when this is done. One can minimize Ĝ iteratively, alternating
minimization in y given z and in z given y, both steps being elementary. In order to
describe these steps, introduce some matrix notation.

Let L denote the Laplacian matrix of the weighted graph on {1, . . . ,P N } associated
with the weight matrix W , namely L = (`kl , k, l = 1, . . . , N ) with `kk = Nk=1 wkl − wkk
and `kl = −wkl when k , l. Then,
N
X
wkl |yk − yl |2 = 2trace(Y T LY ).
k,l=1

Defining uk ∈ Rp by
N
X
uk = wkl dkl (zkl − zlk ),
l=1
 T
 u1 
and U =  ... , we have
 
 
T
uN
N
X
wkl dkl (yk − yl )T zkl = trace(U T Y ).
k,l=1
566 CHAPTER 22. DATA VISUALIZATION AND MANIFOLD LEARNING

With this notation, the optimal matrix Y must minimize

2trace(Y T LY ) − 2trace(U T Y ).

Let m be the number of connected components of the weighted graph. Recall that
the matrix L is positive semi-definite and that an orthonormal basis of its null space
is provided by vectors, say e1 , . . . , em , that are constant on each of the m connected
components of the graph, so that the constraint on Y can be written as ejT Y = 0 for
j = 1, . . . , m. Introduce the matrix
m
X
L̂ = L + ek ekT
k=1

which is positive definite. Our minimization problem is then equivalent to minimiz-


ing
2trace(Y T L̂Y ) − 2trace(U T Y ),
subject to ejT Y = 0 for j = 1, . . . , m. The derivative of this function is

4L̂Y − 2U

so that an optimal Y must satisfy


m
X
4L̂Y − 2U + ej µTj = 0
j=1

for Lagrange multipliers µ1 , . . . , µm ∈ Rp . This shows that


m m
1 1X T 1 1X T
 
Y = L̂−1 U − ej µj = L̂−1 U − ej µj
2 2 2 4
j=1 j=1

where we have used the fact that L̂−1 ej = ej . We can now identify µj since

m
1 1X T 1 1
0 = ejT Y = ejT L̂−1 U − ej ej 0 µTj = ejT U − µTj
2 4 0 2 4
j =1

so that µTj = 2ejT U and the optimal Y is

m
1 1X T
Y = L̂−1 U − ej ej U .
2 2
j=1
22.2. MANIFOLD LEARNING 567

Note that this expression can be rewritten as


1
Y = PL L̂−1 U
2
where PL = IdRN − N T
P
k=1 ej ej is the projection onto the space perpendicular to the null
space of L (i.e., the range of L). In the case
√ where the graph has a single connected
component, one has m = 1 and e1 = 1N / N yielding
1
PL = IdRN − 1 1T .
N N N

The minimization in z given y is straightforward: if yk , yk , then zkl = (yk −


yl )/|yk − yl |. If yk = yl , then one can take any value for zkl and the simplest if of course
zkl = 0. Using the previous computation, we can summarize a training algorithm for
multi-dimensional scaling, called SMACOF for “Scaling by Maximizing a Convex
Function” (see, e.g., Borg and Groenen [35] for more details and references).

Algorithm 22.1 (SMACOF)


Assume that a symmetric matrix of dissimilarities (dkl , k, l = 1, . . . , N ) is given, to-
gether with a matrix of weights (wkl , k, l = 1, . . . , N ). Fix a target dimension, p. Fix a
tolerance constant .

1. Compute the Laplacian matrix L of the graph associated with the weights, the
projection matrix PL onto the range of L and the matrix M = (L + IdRN − PL )−1 .
 T
y1 
2. Initialize the algorithm with some family y1 , . . . , yN ∈ Rp and let Y =  ... .
 
 T 
yN
3. At a given step of the algorithm, let Y be the current solution and compute, for
k = 1, . . . , N :
N
X y − yl
uk = 2 wkl dkl k 1
|yk − yl | yk ,yl
l=1
 T
 u1 
to form the matrix U =  ... .
 
 
T
uN
4. Compute Y 0 = 12 PL MU .
5. If |Y − Y 0 | ≤ , exit and return Y 0 .
6. Return to step 3.
568 CHAPTER 22. DATA VISUALIZATION AND MANIFOLD LEARNING

Figure 22.1: Left: Multidimensional scaling applied to a 3D curve embedded in a 10-


dimensional space retrieves the Euclidean structure. Right: Isomap, in contrasts, identifies
the one-dimensional nature of the data.

22.2 Manifold learning

The goal of MDS is to map a full matrix of distances into a low-dimensional Eu-
clidean space. Such a representation, however, cannot address the possibility that
the data is supported by a low-dimensional, albeit nonlinear, space. For example,
people leaving on Earth live, for all purposes, on a two-dimensional structure (a
sphere), but any faithful Euclidean representation of the world population needs
to use the three spatial dimensions. One may also argue that the relevant distance
between points on Earth is not the Euclidean one either (because one would never
travel through Earth to go from one place to another), but the distance associated to
the shortest path on the sphere, which is measured along great circles.

To take another example, the left panel in fig. 22.1 provides the result of applying
MDS to a ten-dimensional dataset obtained by applying a random ten-dimensional
rotation to a curve supported by a three-dimensional torus. MDS indeed retrieves
the correct curve structure in space, which is three dimensional. However, for a
person “living” on the curve, the data is one-dimensional, a fact that is captured by
the Isomap method that we now describe.

22.2.1 Isomap

Let us return to the example of people living on the spherical Earth. One can de-
fine the distance between two points on Earth either as the shortest length a person
would have to travel (say, by plane) to go from one point to the other (that we can
call the intrinsic distance), or simply the chordal distance in 3D space between the two
points. The first one is obviously the most relevant to the spherical structure of the
Earth, but the second one is easier to compute given the locations of the points in
22.2. MANIFOLD LEARNING 569

space.

For typical datasets, the geometric structure of the data (e.g., that it is supported
by a sphere) is unknown, and the only information that is available is their chordal
distance in an ambient space (which can be very large). An important remark, how-
ever is that, when the points are close to each other, the two distances can be ex-
pected to be similar, if we assume that the geometry of the set supporting the data is
locally linear (e.g., that it is, like the sphere, a “submanifold” of the ambient space,
with small neighborhoods of any data point well approximated, at first order, by
points on a tangent space). Isomap uses this property, only trusting small distances
in the matrix D, and infers large distances by adding the costs resulting from travel-
ing from data points to nearby data points.

Fix an integer c. Given D, the c-nearest neighbor graph on V = {1, . . . , N } places an


edge between k and l if and only if dk,l is among the c smallest values in {dkl 0 , l 0 , k}
neighbors or xl among the c smallest values in {dk 0 l , k 0 , l}. We will write k ∼c l to
indicate that there exists an edge between k and l in this graph. One then defines
the geodesic distance on the graph as
 

 m 

(∗)  X 
dkl = min  d : k , . . . , k ∈ {1, . . . , N }, k = k ∼ k ∼ · · · ∼ k ∼ k = l, m ≥ 0 .
 
 kj−1 kj 0 m 0 c 1 c c m−1 c m 


 j=1 

This geodesic distance can be computed incrementally as follows. First define


(1) (1) (1)
dkl = |xk − xl |
if k ∼c l and dkl = +∞ otherwise (and also let dkk = 0). Then, given
d (n−1) , define  
(n) (n−1) (1)
dkl = min dkl 0 + dll 0 l 0 = 1, . . . , N

until the entries stabilize, i.e., d (n+1) = d (n) , in which case one has d (∗) = d (n) . The
validity of the statement can be easily proved by checking that
 

 n 

(n) X (1) 
dkl = min  d : k , . . . , k ∈ {1, . . . , N }, k = k, k = l ,
 
 k k
j−1 j
0 n 0 n 


 j=1 

which can be done by induction, the details being left to the reader. It should also
be clear that the procedure will stabilize after no more than N steps.

Once the distance is computed, Isomap then applies standard MDS, resulting
in a straightened representation of the data like in fig. 22.1. Another example is
provided in fig. 22.2, where, this time, the input curve is closed and cannot therefore
be represented as a one-dimensional structure. One can note, however, that, even in
this case, Isomap still provides some simplification of the initial shape of the data.
570 CHAPTER 22. DATA VISUALIZATION AND MANIFOLD LEARNING

Figure 22.2: Left: Multidimensional scaling applied to a 3D curve embedded in a 10-


dimensional space retrieves the Euclidean structure. Right: Isomap, in contrasts, identifies
the one-dimensional nature of the data.

22.2.2 Local Linear Embedding

Local linear embedding (LLE) exploits in a different way the fact that manifolds are
locally well approximated by linear spaces. Like Isomap, it starts also with build-
ing a c-nearest-neighbor graph on {1, . . . , k}. Assume, for the sake of the discussion,
that the distance matrix is computed for possibly unobserved data T = (x1 , . . . , xN ).
Letting Nk denote the indices of the nearest neighbors of k (excluding k itself), the
basic assumption is that xk should approximately lie in the affine space generated by
xl , l ∈ Nk . Expressed in barycentric coordinates, this space is defined by
 

 X X 

(l) Nk (l)
 
Tk =  ρ x : ρ ∈ R , ρ = 1 ,
 
 l 


l∈N 
l∈N

k k

and Tk can be interpreted as an approximation of the tangent space at xk to the data


manifold. Optimal coefficients (ρ() kl, k = 1, . . . , N , l ∈ Nk ) providing the representa-
tion of xk in that space can be estimated by minimizing, for all k
2
X (l)
xk − ρk xl
l∈Nk

P (l)
subject to l∈Nk ρk = 1. This is a simple least-square program. Let ck = |Nk | (ck = c
(l)
in the absence of ties). Order the elements of Nk to represent ρk , l ∈ Nk as a vector
denoted ρk ∈ Rck . Similarly, let Sk be the Gram matrix associated with xl , l ∈ Nk
formed with all inner products xlT0 xl , l, l 0 = 1, . . . , N and let r k be the vector composed
with products xkT xl , l ∈ Nk . Assume that Sk is invertible, which is generally true if
22.2. MANIFOLD LEARNING 571

c < d, unless the neighbors are exactly linearly aligned. Then, the optimal ρk and the
Lagrange multiplier λ for the constraint are given by
! !−1 !
ρk Sk 1ck rk
= T . (22.6)
λ 1 ck 0 1
If Sk is not invertible, the problem is under-constrained and one of its solutions can
be obtained by replacing the inverse above by a pseudo-inverse.

The low-dimensional representation of the data, still denoted (y1 , . . . , yN ) with


yk ∈ Rp is then estimated so that the relative position of yk to its neighbors is the
same as that of xk , i.e., so that X (l)
yk ' ρk yl .
l∈Nk
These vectors are estimated by minimizing
2
N
X X (l)
F(y) = yk − ρk yl .
k=1 l∈Nk

Obviously, some additional constraints are needed to avoid the trivial solution yk = 0
for all k. Also, replacing all yk ’s by yk0 = Ryk + b where R is an orthogonal transforma-
tion in Rp and b is a translation
P does not change the PN valueTof F, so there is no loss of
generality in assuming that N y
k=1 k = 0 and that k=1 yk yk = D0 , a diagonal matrix.
0
However, if one lets yk = Dyk where D is diagonal, then
 2
X p XN  
 (i)
X (l) (i) 
2

F(y) = Dii yk − ρk yl  .

 
i=1 k=1 l∈Nk

This shows that one should not allow the diagonal coefficients of D0 to be chosen
freely, since otherwise the optimal solution would require to take this coefficient to 0.
So D0 should be a fixed matrix, and by symmetry, it is natural to take D0 = IdRp . (Any
other solution—for a different D0 —can then be obtained by rescaling independently
the coordinates of y1 , . . . , yN .)
(l) (k) (l)
Extend ρk to an N -dimensional vector by taking ρk = −1 and ρk = 0 if l , k and
l < Nk . We can write
N X N 2
X (l)
F(y) = ρk yl .
k=1 l=1
Expanding the square, this is
N
X
F(y) = wll 0 ylT yl 0
l,l 0 =1
572 CHAPTER 22. DATA VISUALIZATION AND MANIFOLD LEARNING

PN (l) (l 0 )
with wll 0 = ρk ρk . Introducing the matrix W with entries wkl and the N × p
 k=1T
y1 
matrix Y =  ... , we have the simple expression
 
 
T
yN

F(y) = trace(Y T W Y ) .

Note that the constraints are Y T Y = IdRp and Y T 1N = 0. Without this last constraint,
we know that an optimal solution is provided by Y = [e1 , . . . , ep ] where e1 , . . . , ep pro-
vide an orthonormal family of eigenvectors associated to the p smallest eigenvalues
of W (this is a consequence of corollary 2.4). To handle the additional constraint, it
suffices to note that W 1N = 0, so that 1N is a zero eigenvector. Given this, it suffices
to compute p + 1 eigenvectors associated to smallest eigenvalues of W , e1 , . . . , ep+1 ,

with the condition that e1 = ±1N / N (which is automatically satisfied unless 0 is a
multiple eigenvalue of W ) and let

Y = [e2 , . . . , ep+1 ].

Note that e2 , . . . , ep+1 are also the p smallest eigenvectors of W + λ11T for any large
enough λ, e.g., λ > trace(W )/N .

LLE is summarized in the following algorithm.

Algorithm 22.2 (Local linear embedding)


The input of the algorithm is

(i) Either a training set T = (x1 , . . . , xN ), or its Gram matrix S containing all
inner products xkT xl (or more generally inner products in feature space), or a dissim-
ilarity matrix D = (dkl ).
(ii) An integer c for the graph construction.
(iii) An integer p for the target dimension.

(1) If not provided in input, compute the Gram matrix S and distance matrix D
(using (22.2) and (22.4)).
(2) Build the c-nearest-neighbor graph associated with the distances. Let Nk be the
set of neighbors of k, with ck = |Nk |.
(3) For k = 1, . . . , N , let Sk be the sub-matrix of S matrix associated with xl , l ∈ Nk
(l)
and compute coefficients ρk , l ∈ Nk stacked in a vector ρk ∈ Rck by solving (22.6).
(l) (l 0 ) (k)
(4) Form the matrix W with entries wll 0 = N
P
k=1 ρk ρk with ρ extended so that ρk =
(l)
−1 and ρk = 0 if l , k and l < Nk .
22.2. MANIFOLD LEARNING 573

Figure 22.3: Local linear embedding with target dimension 3 applied to the data in fig. 22.1
and fig. 22.2.

(5) Compute the first p + 1 eigenvectors, e1 , . . . , ep+1 , of W (associated with smallest


eigenvalues) arranging for e1 to be proportional to 1N .
(i) (k)
(6) Set yk = ei+1 for i = 1, . . . , p and k = 1, . . . , N .

The results of LLE applied to the datasets described in fig. 22.1 and fig. 22.2 are
provided in fig. 22.3.

Remark 22.1 We note that, for both Isomap and LLE, the c-nearest-neighbors graph
can be replaced by the graph formed with edges between all pairs of points that are
at distance less than  from each other, for a chosen  > 0, with no change in the
algorithms.

These parameters (c or ) must be chosen carefully and may have an important


impact on the output of the algorithm. Choosing them too small would not allow
for a correct estimation of distances in Isomap (with possibly some of them being
infinite if the graph has more than one connected component), or of the linear ap-
proximations in LLE. However, choosing them too large may break the basic hypoth-
esis that the data is locally Euclidean or linear that form the basic principles of these
algorithms. 

22.2.3 Graph Embedding

Both Isomap and LLE are based on the construction of a nearest-neighbor graph
based on dissimilarity data and the conservation of some of its geometric features
when deriving a small-dimensional representation. For LLE, a weight matrix W was
first estimated based on optimal linear approximations of xk by its neighbors, and
574 CHAPTER 22. DATA VISUALIZATION AND MANIFOLD LEARNING

the representation was computed by estimating the eigenvectors associated with the
smallest eigenvalues of W (excluding the eigenvector proportional to 1). However,
both methods were motivated by the intuition that the dataset was supported by
a continuous small-dimensional manifold. We now discuss methods that are solely
motivated by the discrete geometry of a graph, for which we use tools that are similar
to our discussion of graph clustering in section 20.5.

Adapting the notation in that section to the present one, we start with a graph
with N vertices and weights βkl between these vertices (such that βll = 0) and we
form the Laplacian operator defined by, for any vector u ∈ RN :
N
1 1 X 0
kukH 1 = βll 0 (u (l) − u (l ) )2 = u T Lu,
2 2 0
l,l =1

so 0
PNthat L is identified as the matrix with coefficients `ll 0 = −βll 0 for l , l and `ll =
l 0 =1 βll 0 . The matrix W that was obtained for LLEP coincides with this graph Lapla-
cian if one lets βll 0 = −wll 0 for l , l , since we have N
0
l 0 =1 wll 0 = 0. The usual require-
ment that weights are non-negative is no real loss of generality, because in LLE (and
in the Graph embedding method above), one is only interested in eigenvectors of W
(or L below) that are perpendicular to 1, and those remain the same if one replaces
W by
W − a1N 1TN + N aIdRN
which has negative off-diagonal coefficients w̃ll 0 = wll 0 − a for large enough a.

In graph (or Laplacian) embedding, the starting point is a weighted graph on


{1, . . . , N } with edge weights βll 0 interpreted as similarities between vertexes. These
weights may or may not be deduced from measures of dissimilarity (dll 0 , k, l = 1, . . . , N )
which themselves may or may not be computed as distances between training data
x1 , . . . , xN . If one starts with dissimilarities, it is typical to use simple transformations
to compute edge weights, and one the most commonly used is

βll 0 = exp(−dll2 0 /2τ 2 )

for some constant τ. These weights are usually truncated, replacing small values
by zeros (or the computation is restricted to nearest neighbors), to ensure that the
resulting graph is sparse, which speeds up the computation of eigenvectors for large
datasets.

Given a target dimension p, the graph is then represented as a collection of points


y1 , . . . , yN ∈ Rp , where yk is associated to vertex k. For this purpose, one needs to com-
pute the first p + 1 eigenvectors, e1 , . . . , ep+1 , of the graph Laplacian, with the require-

ment that e1 = ±1N / N . (This is always possible and can be achieved numerically
by computing eigenvectors of L+c11T for large enough c.) The graph representation
22.2. MANIFOLD LEARNING 575

(|) (k)
is then given by yk i = ei+1 for i = 1, . . . , p and k = 1, . . . , N . Note that these are exactly
the same operations as those described in steps 4 and 5 of the LLE algorithm.

One way to interpret this construction is that e2 , . . . , ep+1 (the coordinate functions
for the representation y1 , . . . , yN ) minimize
p
X
kei k2H1
j=1

subject to e2 , . . . , ep+1 being perpendicular to each other and perpendicular to the


constant functions (these constraints being justified for the same reasons as those
discussed for LLE). Small H 1 semi-norms being associated with smoothness on the
graph, we see that we are looking for the smoothest zero-mean representation of the
data.

Based on our discussion of LLE, we can make an alternative interpretation by


introducing a symmetric square root R of the Laplacian matrix L or any matrix such
that RRT = L. Writing R = [ρ1 , . . . , ρN ], one has
N
X
L= ρk ρTk
k=1
PN
and k=1 ρk = 0. With this notation, we can interpret Laplacian embedding as the
minimization of
N X N 2
X (l)
ρk yl
k=1 l=1
(subject to previous orthogonality constraints). In other terms, y1 , . . . , yN are deter-
mined so that the linear relationships
X (l)
ρkk yk = − ρk yl
l,k

are satisfied, which is similar to the LLE condition, without the requirement that
ρk (k) = 1.
(l) 2
An alternate requirement that could have been made for LLE is that N
P
l=1 (ρk ) =
1 for all k. Instead of having to solve a linear system in step 2 of Algorithm 22.2, one
would then compute an eigenvector with smallest eigenvalue of Sk . For graph em-
bedding, this constraint can be enforced by modifying the Laplacian matrix, since
PN (l) 2 T
l=1 (ρk ) is just the (k, k) coefficient of RR . Given this, let D be the diagonal matrix
formed by the diagonal elements of L, and define the so-called “symmetric Lapla-
cian” L̃ = D −1/2 LD −1/2 . One obtain an alternative, and popular, graph embedding
method by replacing e1 , . . . , ep+1 above by the first p eigenvectors of L̃.
576 CHAPTER 22. DATA VISUALIZATION AND MANIFOLD LEARNING

Another interpretation of this representation can be based on the random walk


associated with the graph structure. Consider the random process t 7→ q(t) defined
as follows. The initial position, q(0) is selected according to some arbitrary dis-
tribution, say π0 . Conditional to q(t) = k, the next position is determined by set-
ting random waiting times τkl , each distributed as an exponential distribution with
rate βkl (or expectation 1/βkl ), and the process moves to the position l for which
τkl is smallest after waiting for that time. Let P (t) be the matrix with coefficients
P (t, k, l) = P (q(t + s) = l | q(s) = k). Then, one has

P (t) = e−tL

where the right-hand side is the matrix exponential. If λ1 = 0 ≤ λ2 ≤ · · · ≤ λN are the


eigenvalues of L with corresponding eigenvectors e1 , . . . , eN , then

N
X
P (t) = e−tλi ei eiT
i=1

In particular, restricting the first eigenvectors of L provides an approximation of this


stochastic process, i.e.,
p
11T X −tλi+1
P (t) ' + e y(i)y(i)T .
N
i=1

We could also have considered the discrete-time version of the walk, for which,
considering integer times t ∈ N,

βkl




 PN
 if l , k
P (q(t + 1) = l | q(t) = k) =  l 0 =1,l 0 ,k βkl 0


0 if l = k

Introducing the matrix B of similarities PNβkl (with zero on the diagonal) and the diag-
onal matrix D with coefficients dkk = l=1,l,k βkl , the r.h.s. of the previous equation
is the k, l entry of the matrix P˜ = D −1 B. Then, for any integer s, P (q(t +s) = l | q(t) = k)
is the k, l entry if P˜ s = D −1/2 (D −1/2 BD −1/2 )s D 1/2 .

The Laplacian matrix L is given by L = D − B. The normalized Laplacian is

L̄ = D −1/2 LD −1/2 = IdRN − D −1/2 BD −1/2

so that
P˜ s = D −1/2 (IdRN − L̄)s D 1/2 .
22.2. MANIFOLD LEARNING 577

If one introduces the eigenvectors ē1 , . . . , ēN of the normalized Laplacian, still asso-
ciated with non-decreasing eigenvalues λ̄1 = 0, . . . , λ̄N , and arranges without loss of
generality that ē1 ∝ D 1/2 1N , then
N 
X 
P˜ s = D −1/2  (1 − λ̄i )s ēi ēiT  D 1/2 .
i=1
This shows that, for s large enough, the transitions of this Markov chain are well ap-
proximated by its first terms, suggesting using the alternative representation based
on the normalized Laplacian:
ȳk (i) = ēi+1 (k).
Both representations (using normalized or un-normalized Laplacians) are commonly
used in practice.

22.2.4 Stochastic neighbor embedding

General algorithm

Stochastic neighbor embedding (SNE, Hinton and Roweis [90]), and its variant (t-
SNE, Maaten and Hinton [123]) have become a popular tool for the visualization
of high-dimensional data based on dissimilarity matrices. One of the key contri-
butions of this algorithm is to introduce a local data rescaling step, that allows for
visualization of more homogeneous point clouds.

Assume that dissimilarities D = (dkl , k, l = 1, . . . , N ) are observed. The basic prin-


ciple in SNE is to deduce from the dissimilarities a family of N probability distribu-
tions on {1, . . . , N }, that we will denote πk , k = 1, . . . , N , with the property that πk (k) =
0. The computation of these probabilities include the local normalization step, and
we will return to this later. Given the πk ’s, one then estimate low-dimensional rep-
resentations y = (y1 , . . . , yN ) such that πk ' ψk where ψk is given by
  
exp − β |yk − yl |2
ψk (l; y) = P   1l,k .
N 2
l 0 =1,l 0 ,k exp − β |yk − y l 0 |
Here, β : [0, +∞) → [0, +∞) is an increasing differentiable function that tends to +∞
at infinity. The derivative is denoted ∂β. The original version of SNE [90] uses
β(t) = t and t-SNE [123] takes β(t) = log(1 + t).

The determination of the representation can then be performed by minimizing a


measure of discrepancy between the probabilities πk and ψk . In Hinton and Roweis
[90], it is suggested to minimize the sum of Kullback-Liebler divergences, namely
N
X
KL(πk kψk (·; y))
k=1
578 CHAPTER 22. DATA VISUALIZATION AND MANIFOLD LEARNING

or, equivalently, to maximize


N
X
F(y) = πk (l) log ψk (l; y)
k,l=1
 
N
X N
X N
 X 
=− β(|yk − yl |2 )πk (l) + log  exp(−β(|yk − yl |2 ))
 
 
k,l=1 k=1 l=1,l,k

The gradient of this function can be computed by evaluating the derivative at  = 0


of f :  7→ F(y + h). This computation gives
N
X
0
f (0) = − 2 ∂β(|yk − yl |2 )(yk − yl )T (hk − hl )πk (l)
k,l=1
N X
X N
+2 ∂β(|yk − yl |2 )(yk − yl )T (hk − hl )ψk (l; y)
k=1 l=1
XN X N
=−2 T
hk ∂β(|yk − yl |2 )(yk − yl )(πk (l) + πl (k) − ψk (l; y) − ψl (k; y))
k=1 l=1

This shows that


N
X
∂yk F(y) = −2 β(|yk − yl |2 )(yk − yl )(πk (l) + πl (k) − ψk (l; y) − ψl (k; y)).
l=1

This is a rather simple expression that can be used with any first-order opti-
mization algorithm to maximize F. The algorithm in Hinton and Roweis [90] uses
gradient ascent with momentum, namely iterating

y (n+1) = y (n) + γ∇F(y (n) ) + α (n) (y (n) − y (n−1) )

Choosing α (n) = 0 provides standard gradient ascent with fixed gain γ (of course,
other optimization methods may be used). The momentum can be interpreted, in a
loose sense, as a “friction term”.

A variant of the algorithm replaces the node-dependent probabilities πk by a sin-


gle, symmetric, joint distribution π̄ on {1, . . . , N }2 , (k, l) 7→ π̄(k, l), satisfying π̄(k, k) = 0
and π̄(k, l) = π̄(l, k). The target distribution ψ̄ then becomes

exp(−β(|yk − yl |2 ))
ψ̄(k, l; y) = PN .
k 0 ,l 0 =1 exp(−β(|yk 0 − yl 0 |2 ))
22.2. MANIFOLD LEARNING 579

With such a choice, the objective function has a simpler form, namely minimizing
KL(π̄kψ̄(·, y)) or maximizing the expected likelihood
N N
 N 
X X  X 
F̄(y) = π̄(k, l) log ψ̄(k, l; y) = − β(|yk − yl |2 )π̄(k, l) + log  exp(−β(|yk − yl |2 )) .
k,l=1 k,l=1 k,l=1

The gradient of this symmetric version of F can be computed similarly to the previ-
ous one and is given by
N
X
∂yk F̄(y) = −4 ∂β(|yk − yl |2 )(yk − yl )(π̄( k, l) − ψ̄(k, l; y)).
l=1

Setting initial probabilities

The probabilities πk (l) or π̄(k, l) are deduced from the dissimilarities as


2 2
e−dkl /2σk
πk (l) = P 2 2
N −dkl 0 /2σk
l 0 =1,l 0 ,k e

for l , k and
πk (l) + πl (k)
π̄(k, l) = .
2n

The coefficients σk2 , k = 1, . . . , N operate the local normalization, justifying, in


particular, the parameter-free expression chosen for ψ and ψ̄. These coefficients are
estimated so as to adjust the entropies of all πk to a fixed value,
PN which is a parameter
2
of the algorithm. Note that, letting t = 1/2σk and H(πk ) = − l=1 πk (l) log πk (l),

N
X N
X
∂t H(πk ) = − ∂t πk (l) log πk (l) − ∂t πk (l)
l=1 l=1
XN
=− ∂t πk (l) log πk (l)
l=1

Now
2
∂t log πk (l) = −dkl + d¯k2
with d¯k2 = N 2 0
P
l 0 =1 dkl 0 πk (l ). Writing ∂t πk (l) = πk (l)∂t log πk (l), we have

N
X N
X
∂t H(πk ) = (dkl log πk (l))πk (l) − d¯k πk (l) log πk (l).
l=1 l=1
580 CHAPTER 22. DATA VISUALIZATION AND MANIFOLD LEARNING

Using Schwartz inequality, we see that ∂t H(πk ) ≤ 0 so that H(πk ) is decreasing as


a function of t, i.e., increasing as a function of σk2 . When σk2 → 0, πk converges to
the uniform distribution on the set of nearest neighbors of k (the indexes l , k such
2
that dkl is minimal) and, letting νk denote their number, which is typically equal to
1, H(πk ) converges to log νk . When σk2 tends to infinity, πk converges to the uniform
distribution over indexes l , k, whose entropy is log(N − 1). This shows that eH(πk ) ,
which is called the perplexity of πk can take any value between νk and N − 1. The
common target value of the perplexity can therefore be taken anywhere between
maxk νk and N − 1. In Maaten and Hinton [123], it is recommended to choose a value
between 5 and 50.
Remark 22.2 The complexity of the computation of the gradient of the objective
function (either F or F̄ ) scales like the square of the size of the training set, which
may be prohibitive when N is large. In Van Der Maaten [195], an accelerated proce-
dure, that involves an approximation of the gradient is proposed. (This procedure is
however limited to representations in dimensions 2 or 3.) 

22.2.5 Uniform manifold approximation and projection (UMAP)

UMAP is similar in spirit to t-SNE, with a few important differences that result in
a simpler optimization problem and faster algorithms. Like Isomap, the approach
is based on matching distances between the high-dimensional data and the low-
dimensional representation. But while Isomap estimates a unique distance on the
whole training set (the geodesic distance on the nearest-neighbor graph), UMAP
estimates as many “local distances” as observations before “patching” them to form
the final representation.

The goal of transporting possibly non-homogeneous locally defined objects on


initial data to a homogeneous low-dimensional visualization is what makes UMAP
similar to t-SNE. The difference is that t-SNE transports local probability distri-
butions, while UMAP transports metric spaces. More precisely, given distances
(dkl , k, l = 1, . . . , N ) and an integer m provided as input, the algorithm builds, for
each k = 1, . . . , N a (pseudo-)metric δk on the associated data graph by letting
1
 
δ(k) (k, l) = δ(k) (l, k) = dkl − min dkl 0
σk l 0 ,k

if l is among the m nearest neighbors of k, where m is a parameter of the algorithm,


with all other distances being infinite. The normalization parameter σk has a role
similar to that of the same parameter in t-SNE in that it tends to make the represen-
tation homogeneous. Here, it is computed such that
X
exp(−δ(k) (l, l 0 )) = log2 m .
l
22.2. MANIFOLD LEARNING 581

Each such metric provides a weighted graph structure on {1, . . . , N } by defining


(k)
weights wll 0 = exp(−δ(k) (l, l 0 )). In UMAP, these weights are interpreted in the frame-
work of fuzzy sets, where a fuzzy set is defined by a pair (A, µ) where A is a set and
µ a function µ : A → [0, 1] [213]. The function µ is called the membership func-
tion and µ(x) for x ∈ A is the membership strength of x to A. Letting V = {1, . . . , N }
and E = V × V , one then interprets the weights as defining the membership strength
of edges to the graph, i.e., one defines the “fuzzy graph” G (k) = (V , E, µ(k) ) where
(k)
µ(k) (l, l 0 ) = wll 0 is the membership strength of edge (l, l 0 ) to G (k) .

This is, of course, just a reinterpretation of weighted graphs in terms of fuzzy


sets, but it allows one to combine the collection (G (k) , k = 1, . . . , N ) using simple fuzzy
sets operations, namely, defining the combined (fuzzy) graph G = (V , E, µ) with
N
[
(E, µ) = (E, µ(k) )
k=1
being the fuzzy union of the edge sets. There are, in fuzzy logic, multiple ways
to define set unions [85], and the one selected for UMAP define (A, µ) ∪ (A0 , µ0 ) =
(A∪A0 , ν) with ν(x) = µ(x)+µ0 (x)−µ(x)µ0 (x) (µ(x) and µ0 (x) being defined as 0 is x < A
or x < A0 respectively). In UMAP, each edge µ(k) (l, l 0 ) is non-zero only is k = l or l 0 so
that
(l) (l 0 ) (l) (l 0 )
µ(l, l 0 ) = wll 0 + wll 0 − wll 0 wll 0 .

This defines an input fuzzy graph structure on {1, . . . , N } that serves as target for
an optimized similar structured associated with the representation y = (y1 , . . . , yN ).
This representation, since it is designed as a homogeneous representation of the
data, provides a unique fuzzy graph H(y) = (V , E, ν(·; y)) and the edge membership
function is defined by ν(l, l 0 ; y) = ϕa,b (yl , yl 0 ) with
1
ϕa,b (y, y 0 ) = .
1 + a|y − y 0 |b
The parameters a and b are adjusted so that ϕa,b provides a differentiable approxi-
mation of the function
ψρ0 (y, y 0 ) = exp(− max(0, |y − y 0 | − ρ0 ))
where ρ0 is an input parameter of the algorithm. This function ψρ0 takes the same
form as the membership function defined for local graphs G (k) , and its replacement
by ϕa,b makes possible the use of gradient-based methods for the determination of
the optimal y (ψρ0 is not differentiable everywhere).

The representation y is optimized by minimizing the “fuzzy set cross-entropy”


!
X µ(k, l) 1 − µ(k, l)
C(µkν(·, y)) = µ(k, l) log + (1 − µ(k, l)) log
ν(k, l|y) 1 − ν(k, l|y)
(k,l)∈E
582 CHAPTER 22. DATA VISUALIZATION AND MANIFOLD LEARNING

or, equivalently, maximizing (using, for short, ϕ = ϕa,b )


X
F(y) = (µ(k, l) log ν(k, l|y) + (1 − µ(k, l)) log(1 − ν(k, l|y)))
(k,l)∈E
X
= (µ(k, l) log ϕ(yk , yl ) + (1 − µ(k, l)) log(1 − ϕ(yk , yl )))
(k,l)∈E

Note the important simplification compared to the similar function F is t-SNE,


in that the logarithm of a potentially large sum is avoided. We have
N
X N
X
∂yk F(y) =2 µ(k, l)∂yk log ϕ(yk , yl ) + 2 (1 − µ(k, l))∂yk log(1 − ϕ(yk , yl ))
l=1 l=1
N N
X ϕ(yk , yl ) X
=2 µ(k, l)∂yk log +2 ∂yk log(1 − ϕ(yk , yl )).
1 − ϕ(yk , yl )
l=1 l=1

The optimization can be implemented using stochastic gradient ascent. Introduce


0
random variables ξkl and ξkl both taking value in {0, 1}, all independent of each
0
other and such that P (ξkl = 1) = µkl and P (ξkl = 1) = . Define
N N N
X ϕ(yk , yl ) XX
Hk (y, ξ, ξ 0 ) = 2 ξkl ∂yk log + 2ck 0
ξkl ξkl 0 ∂yk log(1 − ϕ(yk , yl 0 )).
1 − ϕ(yk , yl ) 0
l=1 l=1 l =1
P
Then, if one takes ck = 1/( l µ(k, l)) one has
E(Hk (y, ξ, ξ 0 )) = ∂yk F(y).

This corresponds to SGA iterations in which:

(1) Each edge (k, l) is selected with probability µ(k, l) (which are zero for unless k
and l are neighbors);
(2) If (k, l) is selected, one selects an additional edges (k, l 0 ) each with probability .

Letting l1 , . . . , lm be the number of edges selected, yk is updated according to


 
m
 ϕ(yk , yl ) X 
yk ← yk + 2γ ∂yk log + ck ∂yk log(1 − ϕ(yk , yl 0 )) .
 
 1 − ϕ(yk , yl ) 
j=1

Remark 22.3 If one prefers using probability rather than fuzzy set theory, the graphs
G (k) may also be interpreted as random graphs in which edges are added indepen-
dently from each other and each edge (l, l 0 ) is drawn with probability µ(k) (l, l 0 ). The
22.2. MANIFOLD LEARNING 583

combined graph G is then the random graph in which (l, l 0 ) is present if and only
if it is in at least one of the G (k) and the objective function C coincides with the KL
divergence between this random graph and the random graph similarly defined for
y.

However, this fuzzy/random graph formulation of UMAP—which corresponds


to current practical implementations—is only a special case of the theoretical con-
struction made in McInnes et al. [131] which builds on the theory of (fuzzy) simpli-
cial sets and their representation of metric spaces. We refer the interested reader to
this reference, which requires a mathematical background beyond the scope of these
notes. 
584 CHAPTER 22. DATA VISUALIZATION AND MANIFOLD LEARNING
Chapter 23

Generalization Bounds

We provide, in this chapter, an introduction to some theoretical aspects of statistical


(or machine) learning, mostly focusing on the derivation of “generalization bounds”
that provide high-probability guarantees on the generalization error of predictors
using training data. While these bounds are not always of practical use, because
making them small in realistic situations would require an enormous amount of
training data, their derivations and the form they take for specific model classes
bring important insight on the structure of the learning problem, and help under-
stand why some methods may perform well while others do not.

23.1 Notation

We here recall some notation introduced in chapter 5. We consider a pair of random


variables (X, Y ), with X : Ω → RX and Y : Ω → G. Regression problems correspond
to RY = R (or Rq if multivariate) and classification to RY being a finite set. A pre-
dictor is a function f : RX → RY . The general prediction problem is to find such
a predictor within a class of functions, denoted F , minimizing the prediction (or
generalization error)

R(f ) = E(r(Y , f (X)))

where r : RY × RY → [0. + ∞) is a risk function.

A training set is a family T = ((x1 , y1 ), . . . , (xN , yN )) ∈ (RX × RY )N , the set T of all


possible training sets therefore being the set of all finite sequences in RX × RY . A
training algorithm can then be seen as a function A : T → F which associates to
each training set T a function A(T ) = fˆT .

585
586 CHAPTER 23. GENERALIZATION BOUNDS

Given T ∈ T , The training set error associated to a function f ∈ F is

1 X
R̂T (f ) = r(y, f (x)))
|T |
(x,y)∈T


and the in-sample error associated to a learning algorithm is the function T 7→ ET =
R̂T (fˆT ). Fixing the size (N ) of T , one also considers the random variable T with
values in T distributed as an N -sample of the distribution of (X, Y ).

A good learning algorithm should be such that the generalization error R(fˆT ) is
small, at least in average (i.e., E(R(fˆT )) is small). Our main goal in this chapter is to
describe generalization bounds trying to find upper-bounds for R(fˆT ) based on ET
and properties of the function class F . These bounds will reflect the bias-variance
trade-off, in that, even though large function classes provide smaller in-sample er-
rors, they will also induce a large additive term in the upper-bound, accounting for
the “variance” associated to the class.

Remark 23.1 Both variables X and Y are assumed to be random in the previous
setting, but there are often situations when one of them is “more random” than the
other. Randomness in Y is associated to measurement errors, or ambiguity in the
decision. Randomness in X more generally relates to the issue of sampling a dataset
in a large dimensional space. In some cases, Y is not random at all: for example,
in object recognition, the question of assigning categories for images such as those
depicted in fig. 23.1 has a quasi-deterministic answer. Sometimes, it is X who is
not random, for example when observing noisy signals where X is a deterministic
discretization of a time interval and Y is some function of X perturbed by noise. 

23.2 Penalty-based Methods and Minimum Description Length

23.2.1 Akaike’s information criterion

We make a computation under the following assumptions. We assume a regression


model Y = fθ (X) +  where  ∼ N (0, σ 2 ) and f is some function parametrized by
θ ∈ Rm . We also assume that the true distribution is actually covered by this model
and represented by a parameter θ0 . Let θ̂T denote the parameter estimated by least
squares using a training set T , and denote for short fˆT = fθ̂T .

The in-sample error is


N
1X
ET = (yk − fˆT (xk ))2 .
N
k=1
23.2. PENALTY-BASED METHODS AND MINIMUM DESCRIPTION LENGTH587

Figure 23.1: Images extracted from the PASCAL challenge 2007 dataset [70], in which cate-
gories must be associated with images. There is little ambiguity on correct answers based on
observing the image, i.e., little randomness in the variable Y .

We want to compare the training-set-averaged prediction error and the average in-
sample error, namely compute the error bias
∆N = E(R(fT )) − E(ET ) .
Write
∆N = E(R(fT )) − R(fθ0 ) + R(fθ0 ) − E(ET ).

We make a heuristic argument to evaluate ∆N . We can use the fact that θ̂T mini-
mizes the empirical error and write
N
1X
(Yk − fθ0 (Xk ))2 = ET + σ 2 (θ̂T − θ0 )T JT (θ̂T − θ0 ) + o(|θ̂T − θ0 |2 )
N
k=1

with
N
1 X 2
JT = ∂θ ((yk − fθ (xk ))2 )|θ−θ̂ ,
2σ 2 N T
k=1
which is an m by m symmetric matrix.

Now, using the fact that θ0 minimizes the mean square error (since fθ0 (x) =
E(Y |X = x)), we can write, for any T :
R(fT ) = R(fθ0 ) + σ 2 (θ̂T − θ0 )T I(θ̂T − θ0 ) + o(|θ̂T − θ0 |2 )
588 CHAPTER 23. GENERALIZATION BOUNDS

with
1
I= E(∂2θ (Y − fθ (X))2|θ−θ ).
2σ 2 0

As a consequence, we can write (taking expectations in both Taylor expansions)


   
∆N = σ 2 E (θ̂T − θ0 )T JT (θ̂T − θ0 ) + σ 2 E (θ̂T − θ0 )T I(θ̂T − θ0 ) + o(E(|θ̂T − θ0 |2 )).

(We skip hypotheses and justification for the analysis of the residual term.)

We now note that, because we are assuming a Gaussian noise, and that the true
data distribution belongs to the parametrized family, the least-square estimator is
also a maximum likelihood estimator. Indeed, the likelihood of the data is
 N
 N
1  1 X Y
2
exp − (Yk − f (X
θ k ))  ϕX (Xk )
(2πσ 2 )N /2 2σ 2
 
k=1 k=1

where ϕX is the p.d.f. of X and does not depend on the unknown parameter.

We can therefore apply classical results from mathematical statistics [196]. Un-
der some mild smoothness assumptions on the mapping θ 7→ fθ , θ̂T converges to
θ0 in probability when N tends to infinity, the √ matrix JT converges to I, which
is the model’s Fisher information matrix, and N (θ̂T − θ0 ) converges in distribu-
tion to a Gaussian N (0, I −1 ) . This implies that both N (θ̂T − θ0 )T JT (θ̂T − θ0 ) and
N (θ̂T − θ0 )T I(θ̂T − θ0 ) converge to a chi-square distribution with m degrees of free-
dom, whose expectation is m, which indicates that ∆N has order 2σ 2 m/N .

This analysis can be used to develop model selection rules, in which one chooses
between models of dimensions k1 < k2 < · · · < kq = m (e.g., by truncating the last
coordinates of X). The rule suggested by the previous computation is to select j
minimizing
(j) ˆ 2σ 2 kj
ET (fT ) + ,
N
where E (j) is the in-sample error computed using the kj -dimensional model. This
is an example of a penalty-based method, using the so-called Akaike’s information
criterion (AIC) [2].

23.2.2 Bayesian information criterion and minimum description length

Other penalty-based methods are more size-averse and replace the constant, 2, in
AIC by a function of N = |T |, for example log N . Such a change can be justified by a
Bayesian analysis, yielding the Bayesian information criterion (BIC) [175]. The ap-
proach in this case is not based on an evaluation of the error, but on an asymptotic
23.2. PENALTY-BASED METHODS AND MINIMUM DESCRIPTION LENGTH589

estimation of the posterior distribution resulting from a Bayesian model selection


principle. Like in the previous section, we content ourselves with a heuristic discus-
sion.

Let us consider a statistical model parametrized by θ ∈ Θ, where Θ is an open


convex subset of Rm with p.d.f. given by

f (z; θ) = exp(θ T U (z) − C(θ)) ,

with U : Rd → Rm and z = (x, y). We are given a family of sub-models represented


by M1 , . . . , Mq , where, for each j, Mj is the intersection of Θ with a kj -dimensional
affine subspace of Rm . We are also given a prior distribution for θ in which a sub-
model is first chosen, with probabilities α1 , . . . , αq , and given that, say, Mj is se-
lected, θ ∈ Mj is chosen with a probability distribution with density ϕj with respect
to Lebesgue’s measure on Mj (denoted dmj ). Given training data T = (z1 , . . . , zN ),
Bayesian model selection consists in choosing the model Mj where j maximizes the
posterior log-likelihood

Z
T Ū −C(θ))
µ(Mj |T ) = log αj eN (θ T ϕj dmj (θ)
Rm

where ŪT = (U (z1 ) + · · · + U (zN ))/N .

Consider the maximum likelihood estimator θ̂j within Mj , maximizing `(θ, ŪT ) =
θ T Ū
T − C(θ) over Mj . Then one has

1
`(θ, ŪT ) = `(θ̂j , ŪT ) + (θ − θ̂j )T ∂2θ `(θ̂j , ŪT )(θ − θ̂j ) + Rj (θ, θ̂j )|θ − θ̂j |3
2

Note that the first derivative of ` is ∂θ ` = Ū − Eθ (U ) where Eθ is the expectation for


f (·, θ). The second derivative is −varθ (U ) (showing that ` is concave) and the third
derivative involves third-order moments of U for Eθ and (like the second derivative)
does not depend on ŪT . In particular, we can assume that, for any M > 0, there exists
a constant CM such that whenever max(|θ|, |θ̂j |) ≤ M, we have Rj (θ, θ̂j ) ≤ CM .

The law of large numbers implies that ŪT converges to a limit when N tends to
infinity, and our assumptions imply that θ̂j converges to the parameter providing
the best approximation of the distribution of Z for the Kullback-Leibler divergence.
In particular, with probability 1, there exists an N such that θ̂j belongs to any large
enough, but fixed, compact set. Moreover, the second derivative `(θ̂j ) will also con-
verge to a limit, −Σj .
590 CHAPTER 23. GENERALIZATION BOUNDS

For any  > 0, write


Z
T Ū
αj eN (θ T −C(θ)) ϕj dmj (θ)
Rm Z Z
N (θ T ŪT −C(θ)) T Ū −C(θ))
= e ϕj dmj (θ) + eN (θ T ϕj dmj (θ) .
|θ−θ̂j |≤ |θ−θ̂j |≥

The second integral converges to 0 exponentially fast when N tends to ∞. The first
one behaves essentially like
Z
− 1 N (θ−θ̂j )T Σ−1
j (θ−θ̂j )+log ϕj (θ) dm (θ) .
e 2 j
Mj

Neglecting log ϕj (θ), this integral behaves like (2π det(Σj /N ))−1/2 , whose logarithm
is (−kj (log N )/2) plus constant terms. As a consequence, we find that

kj
µ(Mj | T ) = max `(θ) − log N + bounded terms.
θ∈Mj 2

Consider, as an example, linear regression with Y = β0 + bT x + σ 2 ν where ν is a


standard Gaussian random variable. Assume that the distribution of X is known, or,
preferably, make the previous discussion conditional to X1 , . . . , XN . Let sub-models
Mj correspond to the assumption that all but the first kj − 1 coefficients of b van-
ish. Then, up to bounded terms, the Bayesian estimator must minimize (over such
parameters b)
N
1 X kj
2
(yk − β0 − bT xk )2 + log N .
2σ 2
k=1
or
(j) kj σ 2
ET + log N .
N

We now turn to another interesting point of view, which provides the same penalty,
based on maximum description length principle (MDL; Rissanen [163]) measuring
the coding efficiency of a model.

Let us fix some notation. We assume that one has q competing models for pre-
dicting Y from X, for example, linear regression models based on different subsets
of the explanatory variables. Denote these models M1 , . . . , Mq . Each model will be
seen, not as an assumption on the true joint distribution of X and Y , but rather as
a tool to efficiently encode the training set ((x1 , y1 ), . . . , (xN , yN )). To describe MDL,
23.2. PENALTY-BASED METHODS AND MINIMUM DESCRIPTION LENGTH591

which selects the model that provides the most efficient code, we need to reintroduce
a few basic concepts of information theory.

The entropy of a discrete probability P over a set Ω is


X
H2 (P ) = − px log2 px .
x∈Ω

(The logarithm in base 2 is used because of the tradition of coding with bits in infor-
mation theory.)

For a discrete random variable X, the entropy H2 (X) is H2 (PX ) where PX is the
probability distribution of X. The relation between the entropy and coding theory
is as follows: a code is a function which associates to any element ω ∈ Ω a string of
bits c(ω). The associated code-length is denoted lc (ω), which is simply the number
of bits in c(ω). When P is a probability on Ω, the efficiency of a code is measured by
the average code-length: X
EP (lc ) = lc (ω)P (ω).
ω∈Ω

Shannon’s theorem [176, 54] states that, under some conditions on the code (en-
suring that any sequence of words can be recognized as soon as it is observed:
one says that it is instantaneously decodable) the average code length can never
be larger than the entropy of P . Moreover, it states that there exists codes that
achieve this lower bound with no more than one bit loss, such that for all ω, lc (ω) ≤
− log2 (P (ω))+1. These optimal codes, such as the Huffman code [54], can completely
be determined from the knowledge of P . This allows one to interpret a probability P
on Ω as a tool for designing codes with code-lengths essentially equal to (− log2 P ).

This statement can be generalized to continuous random variables (replacing the


discrete probability P by a probability density function, say ϕ) if one introduces a
coding precision level, denoted δ0 , meaning that the decoded values may differ by
no more than δ0 from the encoded ones. The result is that the optimal code-length
at precision δ0 can be estimated (up to one extra bit) by − log2 ϕ − log2 δ0 .

In our context, each model of the conditional distribution of Y given X, with


conditional density ϕ(y|x), provides a way to encode the training set with a total
code length, for (y1 , . . . , yN ), of
N
X
− log2 ϕ(yk | xk ) − N log2 δ0
k=1

(working, as before, conditionally to x1 , . . . , xN ). We assume that the precision at


which the data is encoded is fixed, which implies that the last term does not affect the
592 CHAPTER 23. GENERALIZATION BOUNDS

model choice. Now, assume a sequence of m parametrized model classes, M1 , . . . , Mm


and let ϕ(y | x, θ, Mj ) denote the conditional distribution with parameter θ in the
class Mj . Within model Mj , the optimal code length corresponds to the maximum
likelihood:
N
N 
X X 
− log2 ϕ(x, y | θ̂j , Mj ) = − max  log2 ϕ(x, y; θ, Mj ) .
θ
k=1 k=1

If the models are nested, which is often the case, the most efficient will always be
the largest model, since the maximization is on a larger set. However, the minimum
description length (MDL) principle uses the fact that, in order to decode the com-
pressed data, the model, including its optimal parameters, has to be known, so that
the complete code needs to include a model description. The decoding algorithm
will then be: decode the model, then use it to decode the data.

So assume that a model (one of the Mj ’s) has a kj -dimensional parameter θ. Also
assume that a probability distribution, π(θ | Mj ), is used to encode θ. Also choose a
precision level, δij , for each coordinate in θ, i = 1, . . . , kj . (Previously, we could con-
sider the precision of the yk , δ0 , as fixed, but now, the precision level for parameters
is a variable that will be optimized.) The total description length using this model
now becomes
N kj
X X
− log2 ϕ(yk | xk ; θ, Mj ) − log2 π(θ | Mj ) − log2 (δij ).
k=1 i=1

Let θ̂ (j) be the parameter that maximizes


N
X
L(θ | Mj ) = log2 ϕ(yk | xk ; θ, Mj ) + log2 π(θ | Mj )
k=1

If π is interpreted as a prior distribution of the parameters, θ̂ (j) is the maximum


a posteriori Bayes estimator. We now take the correction caused by (δij , i = 1, . . . , kj )
into account, by assuming that the ith coordinate in θ̂ (j) is truncated to − log2 δi bits.
(j)
Let θ denote this approximation. A second-order expansion of L(θ|Mj ) around
θ̂ (j) yields (assuming sufficient differentiability)
(j) 1 (j) (j) (j)
L(θ | Mj ) = L(θ̂ (j) | Mj ) + (θ − θ̂ (j) )T Sθ̂ (j) (θ − θ̂ (j) ) + o(|θ − θ̂ (j) |2 )
2
(j)
where Sθ is the matrix of second derivatives of L(· | Mj ) at θ. Approximating θ −θ̂ (j)
by δ(j) (the kj -dimensional vector with coordinates δij , i = 1, . . . , kj ), we see that the
23.3. CONCENTRATION INEQUALITIES 593

precision should maximize


kj
1 (j) T X
(δ ) Sθ̂ (j) δ(j) + log2 δij .
2
i=1

Note that Sθ̂(j) must be negative semi-definite, since θ̂ is a local maximum. Assuming
it is non-singular, the previous expression can be maximized and yields
1 1
Sθ̂(j) δ(j) = − (23.1)
log 2 δ(j)

where 1/δ(j) is the vector with coordinates (1/δij ).

Let us now make an asymptotic evaluation. Because L(θ | Mj ) includes a sum


over N independent terms, it is reasonable to assume that Sθ̂(j) has order N , and
more precisely, that Sθ̂(j) /N has a limit. Rewrite (23.1) as

Sθ̂(j) √ (j) 1 1
Nδ = − √ .
N log 2 N δ(j)

This implies that N δ(j) is the solution of an equation which stabilizes with N , and
it is therefore reasonable to assume that the optimal δij takes the form δij = ci (N |

Mj )/ N , with ci (N | Mj ) converging to some limit when N tends to infinity. The
total cost can therefore be estimated by
kj
(j)
kj kj X
−L(θ̂ | Mj ) + log2 N − − log2 ci (N | Mj )
2 2
i=1

The last two terms are O(1), and can be neglected, at least when N is large compared
to kj . The final criterion becomes the penalized likelihood

kj
ld (θ | Mj ) = L(θ|Mj ) − log2 N
2
in which we see that the dimension of the model appears with a factor log2 N as
announced (one needs to normalize both terms by N to compare with the previous
paragraph).

23.3 Concentration inequalities

The discussion of the AIC was a first attempt at evaluating a prediction error. It was
however done under very specific parametric assumptions, including the fact that
the true distribution of the data was within the considered model class. It was, in
594 CHAPTER 23. GENERALIZATION BOUNDS

addition, a bias evaluation, i.e., we estimated how much, in average, the in-sample
error was less than the generalization error. We would like to obtain upper bounds to
the generalization error that hold with high probability, and rely as little as possible
on assumptions on the true data distribution.

One of the main tools used in this context are concentration inequalities, which
provide upper bounds on the various probabilities of events involving a large num-
ber of random variables. The current section provides a review of some of these
inequalities.

23.3.1 Cramér’s theorem

If X1 , X2 , . . . are independent, integrable random variables with identical distribu-


tions (to that of a random variable X), the law of large numbers tells us that the
empirical mean X̄N = (X1 + · · · + XN )/N converges with probability one to m = E(X).
When the variables are square integrable, Chebychev’s inequality provides an easy
proof of the weak law of large numbers. Indeed,
  1   var(X̄N ) var(X)
P |X̄n − m| >  ≤ 2 E (X̄N − m)2 1|X̄n −m|> ≤ = .
 2 N 2

A stronger assumption on the moments of X yields a stronger inequality. One


says that X has exponential moments if there exists λ0 > 0 such that E(eλ0 |X| ) < ∞. In
this case, the cumulant-generating function, defined, for λ ∈ R, by
MX (λ) = log E(eλX ) ∈ [0, +∞], (23.2)
is finite for λ ∈ [−λ0 , λ0 ].

Here are a few straightforward properties of the cumulant-generating function.

(i) One has MX (0) = 0.


(ii) For any a ∈ R, one has MaX (λ) = MX (aλ).
(iii) If X1 and X2 are independent variables, one also has
MX1 +X2 (λ) = MX1 (λ) + MX2 (λ).
In particular, MX+a (λ) = MX (λ) + λa, so that MX−E(X) (λ) = MX (λ) − λE(X).
(iv) Finally, Markov’s inequality (which states that, for any non-negative variable Y ,
P (Y > t) ≤ E(Y )/t) applied to Y = eλX for λ > 0 yields

P(X > t) = P(eλX > eλt ) ≤ eMX (λ)−λt . (23.3)


(Note that this inequality is trivially true for λ = 0.)
23.3. CONCENTRATION INEQUALITIES 595

From these properties, one can easily derive a concentration inequality for the
mean of independent random variables. We have MX̄N (λ) = N MX (λ/N ) and apply-
ing (23.3) we get, for any λ ≥ 0 and t > 0
 λ(m+t) 
−λ(m+t)+MX̄N (λ) −N −MX ( Nλ )
P(X̄N − m > t) ≤ e =e N

where the right-hand side may be infinite. Because this inequality is true for any λ,
we have ∗
P(X̄N − m > t) ≤ e−N MX,+ (m+t)

where MX,+ (u) = supλ≥0 (λu − MX (λ)), which is non-negative since the maximized
quantity vanishes for λ = 0. A symmetric computation yields

P (X̄N − m < −t) ≤ e−N MX,− (m−t)

where MX,− (t) = supλ≤0 (λt − MX (λ)), which is also non-negative.

Let
MX∗ (t) = sup(λt − MX (λ)) ≥ 0 (23.4)
λ∈R
(this is the Fenchel-Legendre transform of the cumulant generating function, some-

times called the Cramér transform of X). One has MX∗ (m + t) = MX,+ (m + t) for t > 0.
λx
Indeed, because x 7→ e is convex, Jensen’s inequality implies that

E(eλX ) ≥ eλm

so that λ(m + t) − MX (λ) ≤ λt < 0 if λ < 0. Similarly, MX∗ (m − t) = MX,− (m − t) for t > 0.
We therefore have the following result.

Theorem 23.2 Let X1 , . . . , XN be independent and identically distributed random vari-


ables. Assume that these variables are integrable and let m = E(X1 ). Then, for all t > 0,

P(X̄N − m > t) ≤ e−N MX (m+t)

and
∗ ∗
P(|X̄N − m| > t) ≤ 2e−N min(MX (m+t),MX (m−t))

The last inequality derives from

P(|X̄N − m| > t) = P(X̄N − m > t) + uP (X̄N − m < −t) .

This is our first example of concentration inequality that shows that, when

min(MX∗ (m + t), MX∗ (m − t)) > 0,


596 CHAPTER 23. GENERALIZATION BOUNDS

the probability of a deviation by t at least of X̄n from its mean decays exponen-
tially fast. The derivation of the inequality above was quite easy: apply Markov’s
inequality in a parametrized form and optimize over the parameter. It is therefore
surprising that this inequality is sharp, in the sense that a similar lower bound also
holds. Even though we are not going to use it in the rest of this chapter, it is worth
sketching the argument leading to this lower bound, which involves an interesting
step making a change of measure.

Assume (without loss of generality) that m = 0 and consider P(X̄n > t). Assume,
to simplify the discussion, that the supremum of λ 7→ λ − MX (λ) is attained at some
λt . We have
E(XeλX )
∂λ MX (λ) = .
E(eλX )
eλx
Let qλ (x) = E(e λX ) and P λ (with expectation E λ ) the probability distribution on Ω

with density qλ (X) with respect to P, so that ∂λ MX (λ) = Eλ (X). We have, since λt is
a maximizer, Eλt (X) = t. Moreover, fixing δ > 0,
P(X̄N > t) = E(1X̄N >t )
≥ E(1|X̄n −t−δ|<δ )
 
≥ E 1|X̄n −t−δ|<δ eN λX̄N −N t−2N δ
= e−N (t+2δ) MX (λ)N Pλ (|X̄N − t − δ| < δ)
If one takes λ = λt+δ , this implies that

P(X̄N > t) ≥ e−N MX (t+δ) e−N δ Pλt+δ (|X̄N − t − δ| < δ) .
By the law of large numbers (applied to Pλt+δ ), Pλt+δ (|X̄N − t − δ| < δ) tends to 1 when
N tends to infinity. This implies that the logarithmic rate of convergence to 0 of
P(X̄N > t) is larger than N (MX∗ (t + δ) + δ), for any δ > 0, to be compared with the
rate N MX∗ (t) for the upper bound. In Large Deviation theory, the upper and lower
bounds are often simplified by considering the limit of log P(X̄N > t)/N , which, in
this case, is MX∗ (t) (and this result is called Cramér’s therorem).

While Cramér’s upper bound is sharp, its computation requires an exact knowl-
edge of the distribution of X, which is not a common situation. The following sec-
tions optimize the upper bound in situations where only partial information on the
variable is known, such as its moments or its range. As a first example, we consider
concentration of the mean for sub-Gaussian variables.

23.3.2 Sub-Gaussian variables

If X has exponential moments, then, (applying again Markov’s inequality)


P (|X| > x) ≤ Ce−λx
23.3. CONCENTRATION INEQUALITIES 597

for some positive constants C and λ. Reducing if needed the value of λ, one can
assume that C takes some predetermined (larger than 1) value, say, C = 2, the simple
argument being left to the reader. A random variable such that, for some λ > 0

P(|X| > x) ≤ 2e−λx

is called sub-exponential (and this property is equivalent to X having exponential


moments). Similarly, one says that X is sub-Gaussian if, some σ > 0,

x2

P(|X| > x) ≤ 2e 2σ 2 . (23.5)

Sub-Gaussian random variables are such that M(λ) < ∞ for all λ ∈ R. Indeed, for
λ>0
Z∞
λ|X|
E(e ) = P(eλ|X| > z)dz
0 Z

= 1+ P(|X| > λ−1 log z)dz
1
Z ∞ (log z)2

≤ 1+2 e 2σ 2 λ2 dz
Z1∞ 2
x− x2 2
≤ 1+2 e 2λ σ dx
1
√ λ2 σ 2
≤ 1 + 2 2πλσ e 2 .

Proposition 23.3 Assume that X is sub-Gaussian, so that (23.5) holds for some σ 2 > 0.
Then, for any t > 0, we have
!N
4t 2 − N t2
2
P(X̄n − E(X) > t) ≤ 1 + 2 e 2σ .
σ

Proof Let us assume, without loss of generality, that E(X) = 0. For λ > 0, we then
have
E(eλX ) = 1 + E(eλX − λX − 1) .
Let ϕ(t) = et − t − 1. We have ϕ(t) ≥ 0 for all t, ϕ(0) = 0 and, for z > 0, the equation
z = ϕ(t) has two solutions, one positive and one negative that we will denote g+ (z) >
0 > g− (z). We have
Z∞
E(ϕ(λX)) = P(ϕ(λX) > z)dz
0
Z∞ Z∞
= P(λX > g+ (z))dz + P(λX < g− (z))dz
0 0
598 CHAPTER 23. GENERALIZATION BOUNDS

The change of variable u = g+ (z) in the first integral is equivalent to u > 0, ϕ(u) = z
with dz = (eu −1)du. Similarly, u = −g− (z) in the second integral gives u > 0, ϕ(−u) = z
and dz = (1 − e−u )du so that
Z∞ Z∞
u
E(ϕ(λX)) = P(λX > u)(e − 1)du + P(λX < −u)(1 − e−u )du
Z0∞ 0

≤ P(λ|X| > u)(eu − e−u )du.


0

(Using the fact that max(P(λX > u), P(λX < −u)) ≤ P(λ|X| > u).) We have
Z∞ Z +∞ 2
u − u
−u
P(λ|X| > u)(e − e )du ≤ 2 (eu − e−u )e 2λ2 σ 2 du
0 0
Z +∞
v2
= 2λσ (eλσ v − e−λσ v )e− 2 dv
0
λ2 σ 2 √
= 2λσ e 2 2π(Φ(−σ λ) − Φ(σ λ))
λ2 σ 2
≤ 4λ2 σ 2 e 2

where Φ is the cumulative distribution


√ function of the standard Gaussian and we
have used Φ(−t) − Φ(t) ≤ 2t/ 2π. We therefore have

λ2 σ 2 λ2 σ 2
 
MX (λ) ≤ log 1 + 4λ2 σ 2 e 2 ≤ + log(1 + 4λ2 σ 2 ) .
2
This implies

t2 2 t2 4t 2
MX∗ (t) = sup(λt − MX (λ)) ≥ − MX (t/σ ) ≥ − log(1 + )
λ>0 σ2 2σ 2 σ2

so that !N
4t 2 − N t2
2
P(X̄n > t) ≤ 1 + 2 e 2σ .
σ 

The following result allows one to control the expectation of a non-negative sub-
Gaussian random variable.

Proposition 23.4 Let X be a non-negative random variable such that


2 /2σ 2
P(X > t) ≤ Ce−t

for some constants C and σ 2 . Then,


p
E(X) ≤ 3σ log C.
23.3. CONCENTRATION INEQUALITIES 599

Proof For any α ∈ (1, C], one has


t 2 log α
−t 2 /2σ 2 −
min(1, Ce ) ≤ αe 2σ 2 log C ,
which implies that
+∞
α √
Z p
E(X) = P(X > t)dt ≤ 2πσ log C
0 2 log α

Taking α = e gives
√ p p
E(X) ≤ πeσ log C ≤ 3σ log C. 

23.3.3 Bennett’s inequality

The following proposition (see [24]) provides an upper bound for MX (λ) as a func-
tion of E(X) and var(X) under the additional assumption that X is bounded from
above.
Proposition 23.5 Let m = E(X) and assume that for some constant b, one has X ≤ b with
probability one. Then, for any σ 2 > 0 such that var(X) ≤ σ 2 , one has
(b − m)2 σ2
!
λσ 2
λX λm − (b−m) λ(b−m)
E(e ) ≤ e e + e (23.6)
(b − m)2 + σ 2 (b − m)2 + σ 2
for any λ ≥ 0.
Proof There is no loss of generality in assuming that m = 0 and λ = 1, in which case
one must show that
b2 − σb
2 σ2
E(eX ) ≤ 2 e + eb (23.7)
b + σ2 b2 + σ 2
if X < b and E(X 2 ) ≤ σ 2 . Indeed, if this inequality is true for m = 0 and λ = 1, (23.6)
in the general case will result from letting X = Y /λ + m and applying the special case
to Y .

The right-hand side of (23.7) is exactly E(eX ) when X follows the discrete distri-
bution P0 supported by two points x0 and b, and such that E(X) = 0 and E(X 2 ) = σ 2 ,
which requires x0 = −σ 2 /b and P (X = x0 ) = b2 /(σ 2 + b2 ).

Now consider the quadratic function v(x) = αx2 + βx + γ which intersects x 7→ ex


at x = x0 and x = b, and is tangent to it at x = x0 , i.e., v(b) = eb and v(x0 ) = v 0 (x0 ) = ex0
(this uniquely defines v). Then ex ≤ v(x) for x < b, yielding
E(eX ) ≤ ασ 2 + γ.
However, since v(X) = eX almost surely when X ∼ P0 , this upper bound is attained
and equal to that provided in (23.7). 
600 CHAPTER 23. GENERALIZATION BOUNDS

If F(λ) denotes the right-hand side of (23.6), we have, for m ≤ u < b,

MX∗ (t) ≥ sup(λu − log F(λ))


λ≥0

and we now estimate this lower bound. Maximizing λy − log F(λ) is equivalent to
minimizing
λ(σ 2 +(u−m))
(b − m)2 e− b−m + σ 2 eλ(b−u)
λ 7→ .
(b − m)2 + σ 2
Introduce the notation ρ = σ 2 /(b − m)2 , µ = λ(b − m) and x = (u − m)/(b − m), so that
the function to minimize is
e−µ(ρ+x) + ρeµ(1−x)
µ 7→ .
1+ρ
Computing the derivative in µ and equating it to 0 gives
1 ρ+x
µ= log ,
1+ρ ρ(1 − x)
which is non-negative since ρ + x − ρ(1 − x) = (1 + ρ)x. For this value of µ, we have

e−µ(ρ+x) + ρeµ(1−x) 1 + ρeµ(1+ρ)


= e−µ(ρ+x)
ρ+1 ρ+1
ρ+x
1 + ρ ρ(1−x)
= e−µ(ρ+x)
ρ+1
e−µ(ρ+x)
=
1−x
and
e−µ(ρ+x) + ρeµ(1−x)
− log = µ(ρ + x) + log(1 − x)
ρ+1
ρ+x ρ+x
= log + log(1 − x)
1+ρ ρ(1 − x)
ρ+x ρ+x 1−x
= log + log(1 − x) .
1+ρ ρ 1+ρ
This provides a lower bound for MX∗ (m + (b − m)x), and yields the following corollary.
Corollary 23.6 Assume that X satisfy the conditions of proposition 23.5. Then
!!
ρ+x ρ+x 1−x
P(X̄N > m + t) ≤ exp −N log + log(1 − x) (23.8)
1+ρ ρ 1+ρ

with x = t/(b − m) and ρ = σ 2 /(b − m)2 .


23.3. CONCENTRATION INEQUALITIES 601

Bennett’s inequality is sometimes stated in a slightly weaker, but simpler form


[128]. Returning to the proof of proposition 23.5 and using the fact that log u ≤ u −1,
equation (23.7) implies

X b2 − σb
2 σ2
log E(e ) ≤ 2 2
e + 2 2
eb − 1
b +σ b +σ
b2 σ2 σ2 σ2
= 2 (e − b
+ − 1) + (eb − b − 1).
b + σ2 b b2 + σ 2
We will use the following lemma.

Lemma 23.7 The function ϕ : u 7→ (eu − u − 1)/u 2 is non-decreasing.

Proof We have ϕ 0 (u) = ψ(u)/u 3 where ψ(u) = ueu − 2eu + u + 2, yielding ψ 0 (u) =
ueu − eu + 1, ψ 00 (u) = ueu . Therefore, ψ 0 is has its minimum at u = 0 with ψ 0 (0) = 0 so
that ψ is increasing. Since ψ(0) = 0, we have ψ(u)/u 3 ≥ 0. 

We therefore have
b2 − σb
2 σ2 σ2
log E(eX ) ≤ (e + − 1) + (eb − b − 1)
b2 + σ 2 b b2 + σ 2
b2 σ 4 2 σ2
= 2 ϕ(−σ /b) + b2 ϕ(b)
b + σ 2 b2 b 2 + σ2

σ4 σ 2 b2
!
≤ 2 + ϕ(b)
b + σ 2 b2 + σ 2
σ2 b
= (e − b − 1)
b2
This shows that
σ 2 λb
log E(eλX ) ≤ (e − λb − 1)
b2
and
σ2 2 2 σ2
MX∗ (t) ≥ max (λb t/σ − e λb
+ λb + 1) = h(bt/σ 2 )
b2 λ b2
where h(u) = (1 + u) log(1 + u) − u.

We summarize this in the following corollary.

Corollary 23.8 Assume that X satisfy the conditions of proposition 23.5. Then, for t > 0,

Nσ2
!
(b − m)t

P(X̄N > m + t) ≤ exp − h (23.9)
(b − m)2 σ2

where h(u) = (1 + u) log(1 + u) − u.


602 CHAPTER 23. GENERALIZATION BOUNDS

This estimate can be further simplified as follows. Let g be such that g 00 (u) =
(1 + u/3)−3 and g(0) = g 0 (0) = 0, which gives g(u) = u 2 /(2 + 2u/3). Noting that h00 (u) =
(1 + u)−1 and that (1 + u)−1 ≥ (1 + u/3)−3 , for u ≥ 0 we find, integrating twice, that
h(u) ≥ g(u) for u ≥ 0. This shows that the following upper-bound is also true:

N t2
!
P(X̄N > m + t) ≤ exp − 2 . (23.10)
2σ + 2t(b − m)/3

This upper bound is known as Bernstein’s inequality.


Remark 23.9 It should be clear that, in the previous discussion, one may relax the
assumption that X1 , . . . , XN are identically distributed as long as there is a common
function M such that MXk (λ) ≤ mk + M(λ) for all k, with mk = E(Xk ). We have in this
case
P(X̄N > m̄N + t) ≤ exp(−N M ∗ (t))
with m̄N = (m1 + · · · + mN )/N and M ∗ (t) = supλ (λt − M(λ)). This remark can be,
in particular, applied to the situation in which X1 , . . . , XN satisfy the conditions of
proposition 23.5 with the same constants b and σ 2 , yielding the same upper bound
as in equation (23.8). 

23.3.4 Hoeffding’s inequality

We now consider the case in which the random variables X1 , . . . , XN are bounded
from above and from below, and start with the following consequence of proposi-
tion 23.5.
Proposition 23.10 Let X be a random variable taking values in the interval [a, b]. Let
m = E(X). Then
b − m λa m − a λb λ2 (b−a)2
E(eλX ) ≤ e + e ≤ eλm e 8 (23.11)
b−a b−a
for all λ ∈ R.
Proof We first note that, if X takes values in [a, b], then var(X) ≤ (b −m)(m−a) (using
σ 2 = (b − m)(m − a) in (23.6)). To prove the upper bound on the variance, introduce
the function g(x) = (x − a)(x − b) so that g(x) ≤ 0 on [a, b]. Noting that one can write
g(x) = (x − m)2 + (2m − a − b)(x − m) + (a − m)(b − m), we have

E(g(X)) = var(X) − (b − m)(m − a) ≤ 0,

which proves the inequality.

This shows that, if λ ≥ 0, we can apply proposition 23.5 with σ 2 = (b − m)(m − a),
which provides the first inequality in (23.11). To handle the case λ ≤ 0, it suffices to
apply this inequality with λ̃ = −λ, X̃ = −X, ã = −b, b̃ = −a and m̃ = −m.
23.3. CONCENTRATION INEQUALITIES 603

The second inequality, namely


!
b − m λa m − a λb λ2 (b−a)2
e + e ≤ eλm e 8
b−a b−a

requires a little additional work. Letting u = (m − a)/(b − a), α = λ(b − a) and taking
logarithms, we need to prove that

α2
log(1 − u + ueα ) − uα ≤
8
Let f (α) denote the difference between the right-hand side and left-hand side. Then
f (0) = 0,
α ueα
f 0 (α) = − + u,
4 1 − u + ueα
(so that f 0 (0) = 0) and
1 u(1 − u)eα
f 00 (α) = − .
4 (1 − u + ueα )2
For positive numbers x = 1 − u and y = ueα , one has (x + y)2 ≥ 4xy, which shows that
f 00 (α) ≥ 0. This proves that f 0 is non-decreasing with f 0 (0) = 0, proving that f is
minimized at α = 0, so that f (α) ≥ 0 as needed. 

We can then deduce the following theorem [92].

Corollary 23.11 (Hoeffding Inequality) If X1 , . . . , XN are independent, taking values,


respectively, in intervals of length, c1 , . . . , cN and Y = X1 + · · · + XN , then

2t 2
!
P(Y > E(Y ) + t) ≤ exp − 2 (23.12)
|c|

and
2t 2
!
P(Y < E(Y ) − t) ≤ exp − 2 (23.13)
|c|
PN 2
where |c|2 = k=1 ck .

Proof We have, by proposition 23.10, for any λ > 0


 
− λt− N λ2 2
P
k=1 MXk (λ) 8 |c| )
P(Y > E(Y ) + t) ≤ e ≤ e−(λt−

The upper bound is minimized for λ = 4t/|c|2 , yielding (23.12). Equation (23.13) is
obtained by applying (23.12) to −X. 
604 CHAPTER 23. GENERALIZATION BOUNDS

An important special case of this inequality is when X1 , . . . , XN are i.i.d. taking


values in an interval of length δ. Then

2N t 2
!
P(X̄N > E(X) + t) ≤ exp − 2 . (23.14)
δ

This inequality is obtained after applying Hoeffding’s inequality to X1 /N , . . . , XN /N ,


therefore taking c1 = · · · = cN = δ/N and |c|2 = δ2 /N .

23.3.5 McDiarmid’s inequality

One can relax the assumption that the random variables X1 , . . . , XN are independent
and only assume that these variables behave like “martingale increments,” as stated
in the following proposition [59].

Proposition 23.12 Let X1 , . . . , XN , Z1 , . . . , ZN be two sequences of N random variables


such that
E(Zk | X1 , Z1 , . . . , Xk−1 , Zk−1 ) = mk
is constant and |Zk − mk | ≤ ck for some constants c1 , . . . , cN . Then
2 /|c|2
P(Y > E(Y ) + t) ≤ e−2t
PN 2
with Y = Z1 + · · · + ZN and |c|2 = k=1 ck .

Proof Proposition 23.10 applied to the conditional distribution implies that, for
λ ≥ 0:

λ2 ck2
log E(eλ(Zk −mk ) | X1 , Z1 , . . . , Xk−1 , Zk−1 ) ≤ log E(eλ|Zk −mk | | X1 , Z1 . . . , Xk−1 , Zk−1 ) ≤ .
8
Pk
Let Sk = j=1 (Zj − mj ). Then

λ2 c2
λSk λSk−1 λ(Zk −mk ) k
E(e ) = E(e E(e | X1 , Z1 , . . . , Xk−1 , Zk−1 )) ≤ e 8 E(eλSk−1 )

so that
λ2 PN 2
E(eλSN ) ≤ e 8 k=1 ck

and the result follows from Markov’s inequality optimized over λ. 

We will use this proposition to prove the “bounded difference,” or McDiarmid’s


inequality.
23.3. CONCENTRATION INEQUALITIES 605

Theorem 23.13 (McDiarmid’s inequality) Let X1 , . . . , XN be independent random vari-


ables and g : RN → R a function such that there exists c1 , . . . , cN such that

|g(x1 , . . . , xk−1 , xk , xk+1 , . . . , xN ) − g(x1 , . . . , xk−1 , x̃k , xk+1 , . . . , xN )| ≤ ck (23.15)

for all k = 1, . . . , N and x1 , . . . , xk−1 , xk , x̃k , xk+1 , . . . , xN . Then


2 /|c|2
P (g(X1 , . . . , XN ) > E(g(X1 , . . . , XN )) + t) ≤ e−2t

with |c|2 = c12 + · · · + cN


2
.

Proof Let m = E(g(X1 , . . . , XN )). Let Z0 = 0,

Yk = E(g(X1 , . . . , XN ) | X1 , . . . , Xk ) − m

and Zk = Yk − Yk−1 . Note that Zk is a function of X1 , . . . , Xk and can therefore be


omitted from the conditional expectation given (X1 , Z1 , . . . , Xk−1 , Zk−1 ).

We have E(Yk ) = 0 and E(Yk | X1 , . . . , Xk−1 ) = Yk−1 so that E(Zk | X1 , . . . , Xk−1 ) = 0.


Because the variables are independent, we have, letting X̃1 , . . . , X̃N be independent
copies of X1 , . . . , XN ,

Zk = E(g(X1 , . . . , Xk−1 , Xk , X̃k+1 , . . . , X̃N ) | X1 , . . . , Xk )


− E(g(X1 , . . . , Xk−2 , X̃k−1 , X̃k , . . . , X̃N ) | X1 , . . . , Xk−1 ) .

For fixed X1 , . . . , Xk−1 , (23.15) implies that Zk varies in an interval of length ck at most
(whose bounds depend on X1 , . . . , Xk−1 ) so that |Zk − E(Zk )| ≤ ck . Proposition 23.12
implies that
2 2
P(Z1 + · · · + ZN ≥ t) ≤ e−2t /|c| ,
which concludes the proof since

Z1 + · · · + ZN = g(X1 , . . . , XN ) − E(g(X1 , . . . , XN )). 

23.3.6 Boucheron-Lugosi-Massart inequality

The following result [37], that we state without proof, extends on the same idea.

Theorem 23.14 Let X1 , . . . , XN be independent random variables. Let

Z = g(X1 , . . . , XN )

with g : RN → [0, +∞) and for k = 1, . . . , N ,

Zk = gk (X1 , . . . , Xk−1 , Xk+1 , . . . , XN )


606 CHAPTER 23. GENERALIZATION BOUNDS

with gk : RN −1 → R. Assume that, for all k = 1, . . . , N , one has 0 ≤ Z − Zk ≤ 1 and that


N
X
(Z − Zk ) ≤ Z.
k=1

Then
t2
!
P(Z − E(Z) > t) ≤ exp(−E(Z)h(t/ E(Z))) ≤ exp −
2E(Z) + 2t/3
where h(u) = (1 + u) log(1 + u) − u. Moreover, for t < E(Z),

P(Z − E(Z) < −t) ≤ exp(−E(Z)h(−t/ E(Z))) ≤ exp(−t 2 /2E(Z)) .

Finally, for all λ ∈ R

log E(eλ(Z−E(Z)) ) ≤ E(Z)(eλ − λ − 1). (23.16)

23.4 Bounding the empirical error with the VC-dimension

23.4.1 Introduction

Section 23.3 provides some of the most important inequalities used to evaluate the
deviation of various combinations of independent random variables (e.g., their em-
pirical mean) from their expectations (the reader may refer to Ledoux and Talagrand
[118], Devroye et al. [60], Talagrand [189], Dembo and Zeitouni [59], Vershynin [201]
and other textbooks on the subject for further developments).

We now return to the problem of estimating the generalization error based on


training data. For a given predictor f , concentration bounds allow us to control the
probability
P(R(f ) − R̂T (f ) > t)
where
R(f ) = E(r(Y , f (X))
and
N
1X
R̂T (f ) = r(yk , f (xk ))
N
k=1

for a training set T = (x1 , y1 , . . . , xN , yN ).

If this probability is small, then R(f ) ≤ R̂T (f ) + t with high probability, providing
a likely upper bound to the generalization error of f . For example, if r is the 0–1
23.4. BOUNDING THE EMPIRICAL ERROR WITH THE VC-DIMENSION 607

loss in a classification problem, Hoeffding’s inequality implies, for training sets of


size N ,
2
P(R(f ) − R̂T (f ) > t) ≤ e−2N t .

Now corollary 23.11 does not hold if we replace f by fˆT , i.e., if f is estimated
from the training set T , which is, unfortunately, the situation we are interested in.
Before addressing this problem, we point out that this inequality does apply to the
case in which f = fˆT0 where T0 is another training set, independent from T , so that
2
P(R(fˆT0 ) − R̂T (fˆT0 ) > t) ≤ e−2N t ,

which is proved by writing

P(R(fˆT0 ) − R̂T (fˆT0 ) > t) = E(P(R(fˆT ) − R̂T (fˆT ) |> t|T0 = T )) .

In this situation, the empirical risk is computed on a test or validation set (T ) inde-
pendent of the set used to estimate f (T0 ).

If one does not have a test set, and fˆT is optimized over a set F of possible pre-
dictors, one can rarely do much better than starting from a variation of the trivial
upper bound  
ˆ
P(R(fT ) − ET > t) ≤ P sup(R(f ) − ET (f )) > t
f ∈F

(with ET = R̂T (fˆT )) and the concentration inequalities discussed in section 23.3 need
to be extended to provide upper bounds to the right-hand side.

Remark 23.15 Computing supremums of functions over non countable sets may
bring some issues regarding measurability. To avoid complications, we will always
assume, when computing supremums over infinite sets, that such supremums can
be reduced to maximizations over finite sets, i.e., when considering supf ∈F Φ(f ) for
some function Φ, we will assume that there exists a nested sequence of finite subsets
Fn ⊂ F such that

sup{Φ(f ) : f ∈ F } = lim sup{Φ(f ) : f ∈ Fn } . (23.17)


n→∞

This is true, for example, when F has a topology that admits a countable dense
subset, with respect to which Φ is continuous. 

When F is a finite set, one can use a ”union bound” with


X
P(sup(R(f ) − ET (f )) > t) ≤ P(R(f ) − ET (f ) > t) ≤ |F | max P(R(f ) − ET (f ) > t).
f ∈F f ∈F
f ∈F
608 CHAPTER 23. GENERALIZATION BOUNDS

Such bounds cannot be applied to the typical case in which F is infinite, and is
likely to provide very poor estimates even when F is finite, but |F | is large. How-
ever, all proofs of concentration inequalities applied to such supremums require us-
ing a union bound at some point, often after considerable preparatory work. Union
bounds will in particular appear in conjunction with the Vapnik-Chervonenkis di-
mension that we now discuss.

23.4.2 Vapnik’s theorem

We consider a classification problem with two classes, 0 and 1, and therefore let F
be a set of binary functions, i.e., taking values in {0, 1}. We also assume that the risk
function r takes values in the interval [0, 1] (using, for example, the 0–1 loss). Let
 
U (t) = P sup(R(f ) − ET (f )) > t . (23.18)
f ∈F

A fundamental theorem of Vapnik provides an estimate of U (t) based on the


number of possible ways to split a training set of 2N points into two classes using
functions in F . The rest of this section is devoted to a discussion of this result and
related notions.

If A is a finite subset of R, we let F (A) denote the set {f|A : f ∈ F } of restrictions


of elements of F to the set A. As a convention, we let F (∅) = {f∅ }, containing the so-
called empty function. Since F only contains binary functions, we have |F (A)| ≤ 2|A| .
If x1 , . . . , xM ∈ R, we let, with a slight abuse of notation,

F (x1 , . . . , xM ) = F (A)

where A = {xi , i = 1, . . . , M}. This provides the number of possible splits of a training
set T = (x1 , . . . , xM ) using classifiers in F . Fixing in this section a random variable X,
we let
SF (M) = E(|F (X1 , . . . , XM )|)
where the expectation is taken over all M i.i.d. realizations from X. We also let

SF∗ (M) = max{|F (A)| : A ⊂ R, |A| ≤ M}.

The following theorem controls U in (23.18) in terms of SF .



Theorem 23.16 (Vapnik) With the notation above, one has, for t ≥ 2/N :
 
2
P sup(R(f ) − ET (f )) > t ≤ 2SF (2N )e−N t /8 , (23.19)
f ∈F
23.4. BOUNDING THE EMPIRICAL ERROR WITH THE VC-DIMENSION 609

which implies that, with probability at least 1 − δ, we have


r 
8 2

∀f ∈ F : R(f ) ≤ ET (f )) + log SF (N ) + log (23.20)
N δ

(The requirement that t ó 2/N does not really reduce the range of applicability
of (23.19), since, for t ≤ 2/N , the upper bound in that equation is typically much
larger than 1.)
Proof We first show that the problem can be symmetrized with the inequality, valid
if N t 2 ≥ 2,
t
   
P sup(R(f ) − ET (f )) ≥ t ≤ 2P sup(ET 0 (f ) − ET (f )) ≥ (23.21)
f ∈F f ∈F 2
in which T 0 is a second training set (independent of T ) with N samples also. In view
of assumption (23.17), there is no loss of generality in assuming that F is finite.
Associate to any training set T , a classifier fT ∈ F maximizing R(fT ) − E(fT ). One
then has
 
t t
   
P sup(ET 0 (f ) − ET (f )) ≥  ≥ P (ET 0 (fT ) − ET (fT )) ≥
f ∈F 2 2
t
 
≥ P (R(fT ) − ET 0 (fT ) ≤ and R(fT ) − ET (fT )) ≥ t
 2
t
 
= E 1R(fT )−ET (fT ))≥t P R(fT ) − ET 0 (fT ) ≤ T
2
Conditional to T , ET 0 (fT ) is the average of M i.i.d. Bernoulli random variables, with
variance bounded from above by 1/4 and
t 1/4 1
 
P R(fT ) − ET 0 (fT ) ≤ T ≥ 1− 2 ≥ .
2 N t /4 2
It follows that
t
 
P sup(ET 0 (f ) − ET (f )) >
f ∈F 2
1 1
   
≥ P R(fT ) − ET (fT )) ≥ t = P sup(R(f ) − ET (f )) ≥ t .
2 2 f ∈F
This justifies (23.21).

Now consider a family of independent Rademacher random variables ξ 1 , . . . , ξ N ,


also independent of T and T 0 , taking values −1 and +1 with equal probability. By
symmetry,
N
X
sup(ET 0 (f ) − ET (f )) = sup (r(Yk , f (Xk )) − r(Yk0 , f (Xk0 )))/N
f ∈F f ∈F k=1
610 CHAPTER 23. GENERALIZATION BOUNDS

has the same distribution as


N
X
sup ξ k (r(Yk , f (Xk )) − r(Yk0 , f (Xk0 )))/N .
f ∈F k=1

Now, there are at most |F (X1 , . . . , XN , X10 , . . . , XN


0
)| different sets of coefficients in front
of ξ1 , . . . , ξN in the above sum when f varies in F , so that, conditioning on T , T 0 and
taking a union bound , we have

 N
X 
P sup ξ k (r(Yk , f (Xk )) − r(Yk0 , f (Xk0 )))/N ≥ t/2 T , T 0
f ∈F k=1

≤ |F (X1 , . . . , XN , X10 , . . . , XN
0
)|
X N 
0 0 0
sup P ξ k (r(Yk , f (Xk )) − r(Yk , f (Xk )))/N ≥ t/2 T , T
f ∈F k=1

The variables ξ k (r(Yk , f (Xk )) − r(Yk0 , f (Xk0 )) are centered and belong to the interval
[−1, 1], which has length 2, so that Hoeffding’s inequality implies

N
X 
2 2
P ξ k (r(Yk , f (Xk )) − r(Yk0 , f (Xk0 )))/N ≥ t/2 T , T 0 ≤ e−2N (t/2) /4 = e−N t /8
k=1

and taking expectation over T and T 0 yields


 
2
P sup(R(f ) − ET (f )) ≥ t  = 2SF (2N )e−N t /8 .
 
f ∈F

−N t /8 so that t = 2
q Equation (23.20) is then obtained from letting δ = 2SF (2N )e
8 2SF (2N )
N log δ with R(f ) ≤ ET (f ) + t for all f with probability 1 − δ or more. 

23.4.3 VC dimension

To obtain a practical bound, the quantity SF (2N ), or its upper-bound SF∗ (2N ), needs
to be estimated. We prove below an important property of SF∗ , namely that, either
SF∗ (M) = 2M for all M, or there exists an M0 for which SF∗ (M0 ) < 2M0 , and taking
M0 to be the largest one for which an equality occurs, SF∗ (M) has order M M0 for all
M ≥ M0 . This motivates the following definition of the VC-dimension of the model
class.
23.4. BOUNDING THE EMPIRICAL ERROR WITH THE VC-DIMENSION 611

Definition 23.17 The Vapnik-Chervonenkis dimension (or VC dimension) of the model


class F is
VC-dim(F ) = max{M : SF∗ (M) = 2M } .
(where the infimum of an empty set is +∞).

Remark 23.18 If, for a finite set A ⊂ R, one has |F (A)| = 2|A| , one says that A is
shattered by F . So VC-dim(F ) is the largest integer M such that there exists a set of
cardinality M in R that is shattered by F . 

We now evaluate the growth of SF∗ (M) in terms of the VC-dimension, starting
with the following lemma, which states that, if A is a finite subset of R, there are at
least |F (A)| subsets of A that are shattered by F .

Lemma 23.19 (Pajor) Let A be a finite subset of R. Then

|F (A)| ≤ |{B ⊂ A : |F (B)| = 2B }| .

Proof The statement holds for A = ∅, for which |F∅ | = 1 = 20 . For |A| = 1, the upper-
bound is either 1 if |F (A)| = 1, or 2 if |F (A)| = 2, and the collection of sets B ⊂ A such
that |F (B)| = 2B is {∅} in the first case and {∅, A} in the second one. So, the statement
is true for |A| = 0 or 1.

Proceeding by induction, assume that the result is true if |A| ≤ N , and consider a
set A0 with |A0 | = N +1. Assume that |F (A0 )| ≥ 2 (otherwise there is nothing to prove),
which implies that there exists x ∈ A0 such that |F (x)| = 2. Take such an x and write
A0 = A ∪ {x} with x < A. Let

F0 = {f ∈ F : f (x) = 0} and F1 = {f ∈ F : f (x) = 1} .

Since F0 ∩ F1 = ∅, we have

|F (A0 )| = |F0 (A0 )| + |F1 (A0 )|.

Since f (x) is constant on F0 (resp. F1 ), we have |F0 (A0 )| = |F0 (A)| (resp. |F1 (A0 )| =
|F1 (A)|), and the induction hypothesis implies

|F (A0 )| ≤ |{B ⊂ A : |F0 (B)| = 2B }| + |{B ⊂ A : |F1 (B)| = 2B }|


= |{B ⊂ A : |F0 (B)| = 2B or |F0 (B)| = 2B }|
+ |{B ⊂ A : |F0 (B)| = |F1 (B)| = 2B }|.

If B ⊂ A is shattered by F0 or F1 , it is obviously shattered by F . Moreover, if B is


shattered by both, then B ∪ {x} is shattered by F . The upper bound in the equation
above is therefore less than the total number of sets shattered by F , which proves
the lemma. 
612 CHAPTER 23. GENERALIZATION BOUNDS

From this lemma, it results that if VC-dim(F ) = D < ∞, then SF∗ (M) is bounded
by the total number of subsets of cardinality D or less in a set of cardinality M. This
provides the following result, which implies that the term in front of the exponential
in (23.18) grows polynomially in N if F have finite VC-dimension.

Proposition 23.20 (Sauer-Shelah’s lemma) If D is the VC-dimension of F , then, for


N ≥ D,
eN D
 

SF (N ) ≤ .
D
Proof Pajor’s lemma implies that
D !
X N
SF∗ (N ) ≤
k
k=0

and the statement of the proposition derives from the standard upper bound
D
eN D
! 
N
X 

k D
k=0

that we now justify for completeness. We have

N k N D Dk
!
N N!
= ≤ ≤ D
k (N − k)!k! k! D k!

if k ≤ D ≤ N . This yields
D D
N D X D k N D eD
!
X N
≤ D ≤
k D k! DD
k=0 k=0

as required. 

We can therefore state a corollary to theorem 23.16 for model classes with finite
VC-dimension.

Corollary 23.21 Assume that VC-dim(F ) = D < ∞. Then, for t ≥ 2/N and N ≥ D,

2eN D −N t2 /8
   
P sup(R(f ) − ET (f )) > t ≤ 2 e . (23.22)
f ∈F D

and r r
8 eN 2
 
P sup(R(f ) − ET (f )) ≤ D log + log ≥ 1 − δ. (23.23)
f ∈F N D δ
23.4. BOUNDING THE EMPIRICAL ERROR WITH THE VC-DIMENSION 613

23.4.4 Examples

The following result provides the VC-dimension of the collection of linear classifiers.
n o
Proposition 23.22 Let R = Rd and F = x 7→ sign(a0 + bT x) : β0 ∈ R, b ∈ Rd . Then

VC-dim(F ) = d + 1.

Proof Let us show that no set of d +2 points can be shattered by F . Use the notation
x̃ = (1, xT )T and β = (a0 , bT )T , and consider d + 2 points x1 , . . . , xd+2 . Then x̃1 , . . . , x̃d+2
are linearly dependent and one of them, say, x̃d+2 can be expressed as a linear com-
bination of the others. Write
d+1
X
x̃d+2 = αk x̃k .
k=1

Then there is no function f ∈ F (taking the form x̃ 7→ sign(β x̃)) that maps (x1 , . . . , xd+2 )
to (sign(α1 ), . . . , sign(αd+1 ), −1) (where the definition of sign(0) = ±1 is indifferent),
since any such function satisfies

d+1
X
T
β x̃d+2 = αk β T x̃k > 0 .
k=1

This proves VC-dim(F ) < d + 2. To prove that VC-dim(F ) = d + 1, it suffices to exhibit


a set of d + 1 vectors in Rd that can be shattered by F . P Choose x1 , . . . , xd+1 such
that x̃1 , . . . , x̃d+1 are linearly independent (for example xi = i−1
k=1 ei , where (e1 , . . . , ed )
d
is the canonical basis of R ). This linear independence implies that, for any vector
α = (α1 , . . . , αd+1 )T ∈ Rd+1 , there exists a vector β ∈ Rd+1 such that x̃iT β = αi for all
i = 1, . . . , d + 1. This shows that any combination of signs for x̃iT β can be achieved, so
that (x1 , . . . , xd+1 ) is shattered. 

Upper-bounds on VC dimensions of more complex models have also been pro-


posed in the literature. As an example, the following theorem, that we provide with-
out proof, considers feed-forward neural networks with piecewise linear units (such
as ReLU, see chapter 11). This theorem is a special case of Theorem 7 in Bartlett et al.
[21], in which the more general case of networks with piecewise polynomial units is
provided. Given integers L, U1 , . . . , UL and W1 , . . . , WL , define the function class

F (L, (Ui ), (Wi ), p)

that consists of feed-forward neural networks with L layers, Ui piecewise linear com-
putational units with less than p pieces in the ith layer, and such that the total num-
ber of parameters involved in layers 1, 2, . . . , j is less than Wj .
614 CHAPTER 23. GENERALIZATION BOUNDS

Theorem 23.23
VC-dim(F (L, (Ui ), (Wi ), p)) = O(L̄WL log(pU )).
where U = U1 + · · · + UL and
L
1 X
L̄ = Wj .
WL
j=1

Note that p = 2 for ReLU networks. Theorem 7 in Bartlett et al. [21] also provides a
more explicit upper bound, namely
 L
 L 
 X X 
VC-dim(F (L, (Ui ), (W ), p)) ≤ L + L̄WL log2 4ep iUi log2  (2epiUi ) .
i=1 i=1

23.4.5 Data-based estimates

Approximations of the shattering numbers can be computed using training data.


One can, in particular, prove a concentration inequality [37] on log SF (X1 , . . . , XN ),
which may in turn be used to estimate log(SF (2N )). In the following, we let HVC (N , F )
denote the expectation of log SF (X1 , . . . , XN ). It is often referred to as the VC entropy
of F .
Theorem 23.24 One has, letting HVC = HVC (N , F ):
t2
!
P(log SF (X1 , . . . , XN ) ≥ HVC + t) ≤ exp −
2HVC + 2t/3
and
t2
!
P(log SF (X1 , . . . , XN ) ≤ HVC − t) ≤ exp −
2HVC
Proof We show that the random variable Z = log2 SF (X1 , . . . , XN ) satisfies the as-
sumptions of theorem 23.14, with
Zk = log2 SF (X1 , . . . , Xk−1 , Xk+1 , . . . , XN ).
Clearly, 0 ≤ Z, 0 ≤ Z − Zk ≤ 1, because one can do no more than double SF by adding
one point. We need to show that
N
X
(Z − Zk ) ≤ Z. (23.24)
k=1

Note that Z is the base-two entropy of the uniform distribution, π, on the set
F (X1 , . . . , XN ) ⊂ {−1, 1}N .

We will use the following lemma.


23.4. BOUNDING THE EMPIRICAL ERROR WITH THE VC-DIMENSION 615

Lemma 23.25 Let A be a finite set and ψ a probability distribution on AN . Let ψk be its
marginal when the kth variable is removed. Then:
N
X
H2 (ψk ) − (N − 1)H2 (ψ) ≥ 0. (23.25)
k=1

This lemma is a special case of a collection of results on non-negative entropy mea-


sures developed in Han [86], and we provide a direct proof below for completeness.

Given the lemma, let πk denote the marginal distribution of π when the kth
variable is removed, i.e.,

πk (1 , . . . k−1 , k+1 , . . . , N )


= π(1 , . . . k−1 , −1, k+1 , . . . , N ) + π(1 , . . . k−1 , 1, k+1 , . . . , N ).

We have:
N
X
(H2 (π) − H2 (πk )) ≤ H(π)
k=1
from which (23.24) derives since Z = H2 (π) and Zk ≥ H2 (πk ). The result then follows
from theorem 23.14.

We now prove lemma 23.25 by induction (this proof requires some basic notions
of information theory). For convenience, introduce random variables (ξ1 , . . . , ξN )
such that ξk ∈ A, with joint probability distribution given by ψ. Let Y = (ξ1 , . . . , ξN ),
Y (k) the (N −1)-tuple formed from Y by removing ξk , Y (k,l) the (N −2)-tuple obtained
by removing ξk and ξl , etc. Inequality (23.25) can then be rewritten
N
X
H2 (Y (k) ) − (N − 1)H2 (Y ) ≥ 0.
k=1

This inequality is obviously true for N = 1, and it is true also for N = 2 since it gives
in this case the well-known inequality H2 (Y1 , Y2 ) ≤ H2 (Y1 ) + H2 (Y2 ). Fix M > 2 and
assume that the lemma is true for any N < M. To prove the statement for N = M,
we will use the following inequality, which holds for any three random variables
U1 , U 2 , U 3 :
H2 (U1 , U3 ) + H2 (U2 , U3 ) ≥ H2 (U1 , U2 , U3 ) + H2 (U3 ) .
This inequality is equivalent to the statement on conditional entropies that H2 (U1 , U2 |
U3 ) ≤ H2 (U1 | U3 ) + H2 (U2 | U3 ). We apply it, for given k , l, to U1 = Yl , U2 = Yk ,
U3 = Y (k,l) , yielding

H2 (Y (k) ) + H2 (Y (l) ) ≥ H2 (Y ) + H2 (Y (k,l) ).


616 CHAPTER 23. GENERALIZATION BOUNDS

We now sum over all pairs k , l, yielding


N
X X
2(N − 1) H2 (Y (k) ) ≥ N (N − 1)H2 (Y ) + H2 (Y (k,l) ).
k=1 k,l

We finally use the induction hypothesis to write that, for all k


X
H2 (Y (k,l) ) ≥ (N − 2)H2 (Y (k) )
l,k

and obtain
N
X N
X
2(N − 1) H2 (Y (k) ) ≥ N (N − 1)H2 (Y ) + (N − 2) H2 (Y (k) ),
k=1 k=1

which provides the desired result after rearranging the terms. 

Note that theorem 23.16 involves SF (2N ), with:

log2 (SF (2N )) = log2 E(SF (X1 , . . . , X2N )) ≥ HVC (2N , F )

from Jensen’s inequality. This implies that the high-probability upper bound on
HVC (2N , F ) that results from the previous theorem is not necessarily an upper bound
on log(SF (2N )). It is however proved in Boucheron et al. [37] that
1
log2 E(SF (X1 , . . . , X2N )) ≤ H (2N , F )
log 2 VC
also holds (as a consequence of (23.16)). A little more work (see Boucheron et al. [37])
combining theorem 23.16 and theorem 23.24 implies the following bound, which
holds with probability 1 − δ at least:
r r
6 log SF (X1 , . . . , XN ) log(2/δ)
∀f ∈ F : R(f ) ≤ E(f ) + +4 .
N N

23.5 Covering numbers and chaining

The upper bounds using the VC dimension relied on the number of different values
taken by a set of functions when evaluated on a finite set, this number being used to
apply a union bound. A different point of view may be applied when one relies on
some notion of continuity of the family of functions on which a uniform concentra-
tion bound is needed, with respect to a given metric. This viewpoint is furthermore
applicable when the sets F (X1 , . . . , XN ) are infinite. To develop these tools, we will
need some new concepts measuring the size of sets in a metric space.
23.5. COVERING NUMBERS AND CHAINING 617

23.5.1 Covering, packing and entropy numbers

Definition 23.26 Let (G, ρ) be a metric space and let  > 0. The -covering number of
(G, ρ). denoted N (G, ρ, ), is the smallest integer n such that there exists a subset G ⊂ G
such that |G| = n and maxg∈G ρ(g, G) ≤ .

Let γ > 0. The γ-packing number M(G, ρ, γ), is the largest number n such that there
exists a subset A ⊂ G with cardinality n such that any two distinct elements of A are at
distance strictly larger than γ (such sets are called γ-nets).

When G and ρ are well understood from the context, we will write simply N () and
M(γ).
Proposition 23.27 One has, for any γ > 0:
M(G, ρ, 2γ) ≤ N (G, ρ, γ) ≤ M(G, ρ, γ).
Proof Let A be a maximal γ-net. Then, for all x ∈ G, there exists y ∈ A such that
ρ(x, y) ≤ γ: otherwise A ∪ {x} would also be a γ − net. This shows that max(ρ(x, A), x ∈
G) ≤ γ and N (G, ρ, γ) ≤ |A|.

Conversely, let A be a 2γ-net. Let G be an optimal γ-covering. Associate to each


y ∈ A a point x ∈ G at distance less than γ: at least one exists because G is a covering.
This defines a function f : A → G, which is necessarily one-to-one, because if two
points in A map to the same point in G, the distance between these two points would
be less than or equal to 2γ. This shows that M(G, ρ, 2γ) ≤ N (G, ρ, γ). 

The entropy numbers of (G, ρ), denoted, for an integer N , e(G, ρ, N ) (or just e(N ))
represent the best accuracy that can be achieved by subsets of G of size N , namely
e(G, ρ, N ) = min max{ρ(g, G) : g ∈ G}. (23.26)
G⊂G,|G|=N

We have:
e(G, ρ, N ) = inf{ : N (G, ρ, ) ≤ N } (23.27a)
and
N (G, ρ, ) = min{N : e(G, ρ, N ) ≤ }. (23.27b)

23.5.2 A first union bound

Let Z be a random variable Z : Ω → Z. We will consider a space G of functions


g : Z → R, such that (to simplify the discussion) E(g(Z)) = 0 for all g ∈ G. In this
section, we assume that functions in G are bounded and let
ρ∞ (g, g 0 ) = sup |g(z) − g 0 (z)| .
z∈Z
618 CHAPTER 23. GENERALIZATION BOUNDS

Assume that N (G, ρ∞ , ) < ∞, for all  > 0 (which requires the set G to be pre-
compact for the ρ∞ metric). Take t > 0, 0 <  < t and choose a set G ⊂ G such that
|G| = N (G, ρ∞ , ). Then, using a union bound,
P(sup g(Z) ≥ t) ≤ P(sup g(Z) ≥ t − ) (23.28)
g∈G g∈G
≤ N (G, ρ∞ , ) sup P(g(Z) ≥ t − ).
g∈G

Now, if each function in G satisfies a concentration inequality, say,


u 2
− 2µ(g)
P(g(Z) ≥ u) ≤ e

for some µ(g) > 0, then, assuming that µ(G) = maxg∈G µ(g) is finite, we find that, for
0 <  < t,
(t−)2
− 2µ(G)
P(sup g(Z) ≥ t) ≤ N (G, ρ∞ , ) e .
g∈G

We now apply this inequality to the case of binary classification, where a binary
variable Y is predicted by an input variable X, with a model class of classifiers F
and the 0–1 loss function. If A is a finite family of elements of R, we define, for
f ,f 0 ∈ F
1 X
ρA (f , f 0 ) = 1f (x),f 0 (x) .
|A|
x∈A
Let
N¯ (F , , N ) = E N (F , ρ{X1 ,...,XN } , )
 

where X1 , . . . , XN is an i.i.d. sample of X. We then have the following proposition.


Proposition 23.28 For all  > 0, one has
N (t/2−)2
 
P sup(R(f ) − ET (f )) ≥ t ≤ 2N¯ (F , /2, N )e− 4 . (23.29)
f ∈F

Proof A key step in the proof of theorem 23.16, was to show that
   N
X 
P sup(R(f )−ET (f )) ≥ t ≤ 2P sup ξ k (r(Yk0 , f (Xk0 ))−r(Yk , f (Xk ))) ≥ N t/2 . (23.30)
f ∈F f ∈F k=1

where ξ 1 , . . . , ξ N are Rademacher random variables and T , T 0 are two independent


training sets of size N . We start from this inequality and bound the conditional
expectation
 N
X 
P sup ξ k (r(Yk0 , f (Xk0 )) − r(Yk , f (Xk ))) ≥ N t/2 T , T 0
(23.31)
f ∈F k=1
23.5. COVERING NUMBERS AND CHAINING 619

and therefore consider r(Yk0 , f (Xk0 )) − r(Yk , f (Xk )) as constants that we will denote
ck (f ). Since we are using a 0–1 loss, we have ck (f ) ∈ {−1, 0, 1} and, for f , f 0 ∈ F ,

|ck (f ) − ck (f 0 )| ≤ 1f (Xk ),f 0 (Xk ) + 1f (Xk0 ),f 0 (Xk0 ) . (23.32)

Consider the random variable Z = (ξ 1 , . . . , ξ N ), and let


n o
G = gf , f ∈ F

with
N
1X
gf (ξ1 , . . . , ξN ) = ck (f )ξk .
N
k=1
We have
N
1X
ρ∞ (gf , gf 0 ) = |ck (f ) − ck (f 0 )| .
N
k=1
Applying Hoeffding’s inequality, we have, for u > 0 and using the fact that ck ∈ [−1, 1]
2N u 2 N u2
P(gf (Z) > u | T , T 0 ) ≤ e− 4 = e− 2

and the discussion preceding the theorem yields the fact that, for any  > 0:
N (t/2−)2
P(sup gf (Z) > t/2 | T , T 0 ) ≤ N (G, , ρ∞ )e− 2 . (23.33)
f ∈F

Let A = (X1 , . . . , XN , X10 , . . . , XN


0
) so that
N
0 1 X 
ρA (f , f ) = 1f (Xk ),f 0 (Xk ) + 1f (Xk0 ),f 0 (Xk0 ) .
2N
k=1

Using (23.32), we have ρ∞ (gf , gf 0 ) ≤ 2ρA (f , f 0 ), which implies

N (G, , ρ∞ ) ≤ N (F , /2, ρA ) .

Using this in (23.33) and taking the expectation in (23.31), we get


N (t/2−)2
 
P sup(R(f ) − ET (f )) ≥ t ≤ 2N¯ (F , /2, N )e− 2 (23.34)
f ∈F

which is valid for all  > 0. 

One can retrieve the bound obtained in theorem 23.16 using the obvious fact that

N (F , , ρA ) ≤ |F (A)|,
620 CHAPTER 23. GENERALIZATION BOUNDS

for any A ⊂ R, so that


N (t/2−)2
 
P sup(R(f ) − ET (f )) ≥ t ≤ 2S(F , 2N )e− 2
f ∈F

for any  > 0, and letting  go to zero,


N t2
 
P sup(R(f ) − ET (f )) ≥ t ≤ 2S(F , 2N )e− 8 .
f ∈F

So (23.29) provides a family of equations that depend on a parameter  which, in


the limit  → 0, includes theorem 23.16 as a particular case. For a given N , optimiz-
ing (23.29) over  may give a better upper bound, provided one has a good way to
estimate N¯ (F , /2, N ) (which is, of course, far from obvious).

23.5.3 Evaluating covering numbers

Covering numbers can be evaluated in some simple situations. The following propo-
sition provides an example in finite dimensions.

Proposition 23.29 Assume that G is a parametric family of functions, so that G = {gθ , θ ∈ Θ}


where Θ ⊂ Rm . Assume also that, for some constant C, ρ∞ (gθ , gθ0 ) ≤ C|θ − θ 0 | for all
θ, θ 0 ∈ Θ. Let G (M) = {gθ : θ ∈ Θ, |θ| ≤ M}. Then

2CM m
 
N (G, ρ∞ , ) ≤ 1 +

Proof Letting ρ denote the Euclidean distance in Rm , our hypotheses imply that
N (G (M) , ρ∞ , ) is bounded by N (BM , ρ, /C) where BM is the ball with radius M in
Rm . Now, if θ1 , . . . , θn is an α-covering of BM , then θ1 /M, . . . , θn /M is an (α/M)-
covering of B1 , which shows (together with a symmetric argument) that N (BM , ρ, α) =
N (B1 , ρ, α/M) and we get

N (G (M) , ρ∞ , ) ≤ N (B1 , ρ, /MC)

and we only need to evaluate N (B1 , ρ, α) for α > 0. Using proposition 23.27, one can
instead evaluate M(B1 , ρ, α). So let A be an α-net in B1 . Then
[
Bρ (x, α/2) ⊂ Bρ (0, 1 + α/2)
x∈A

and, since the sets in the union are disjoint,


X
volume(Bρ (x, α/2)) = |A|volume(Bρ (0, α/2)) ≤ volume(Bρ (0, 1 + α/2)) .
x∈A
23.5. COVERING NUMBERS AND CHAINING 621

Letting Cm denote the volume of the unit ball in Rm , this shows


 m
α α m
 
|A|Cm ≤ Cm 1 +
2 2
and
2 m
 
|A| ≤ 1 + ,
α
which concludes the proof. 

One can also obtain entropy number estimates in infinite dimensions. Here, we
quote a result applicable to spaces of smooth functions, referring to Van der Vaart
and Wellner [197] for a proof.

Theorem 23.30 Let Z be a bounded convex subset of Rd with non-empty interior. For
p ≥ 1 and f ∈ C p (Z), let
n o
kf kp,∞ = max |D k (f (x)| : k = 0, . . . , p, x ∈ Z .

Let G be the unit ball for this norm,


n o
G = f ∈ C p (Z) : kf kp,∞ ≤ 1 .

Let Z (1) be the set of all x ∈ Rd at distance less than 1 from R.

Then there exists a constant K depending only on p and d such that


 d/p
(1) 1
log N (, G, ρ∞ ) ≤ Kvolume(Z )


23.5.4 Chaining

The distance ρ∞ may not always be the best one to analyze the set of functions, G. For
example, if G is a class of functions with values in {−1, 1}, then ρ∞ (g, g 0 ) = 2 unless
g = g 0 . In such contexts, it is often preferable to use distances that compute average
discrepancies, such as
ρp (g, g 0 ) = E(|g(Z) − g 0 (Z)|p )1/p , (23.35)
for some random variable Z. Such distances, by definition, do not provide uniform
bounds on differences between functions (that we used to write (23.28)), but can
rather be used in upper-bounds on the probabilities of deviations from zero, which
have to be handled somewhat differently. We here summarize a general approach
called “chaining,” following for this purpose the presentation made in Talagrand
[190] (see also Audibert and Bousquet [15]). From now on, we assume that (G, ρ) is a
622 CHAPTER 23. GENERALIZATION BOUNDS

(pseudo-)metric space of functions g : Z → R and Z a random variable taking values


in Z. We will make the basic assumption that, for all g, g 0 ∈ G and t > 0,
t2
0 −
P(|g(Z) − g (Z)| > t) ≤ 2e 2ρ(g,g 0 )2 .

Note that this assumption includes cases in which


t 2
0 − 2ρ(g,g 0 )α
P(|g(Z) − g (Z)| > t) ≤ 2e .

for some α ∈ (0, 2], because, if ρ is a distance, then so is ρα/2 if α ≤ 2. We will also
assume that E(g(Z)) = 0 in order to avoid centering the variables at every step.

We are interested in upper bounds for P(supg∈G g(Z) > t). To build a chaining
argument, consider a family (G0 , G1 , . . .) of subsets of G. Assume that |Gk | ≤ Nk with
Nk chosen, for future simplicity, so that Nk−1 Nk ≤ Nk+1 . For g ∈ G, let πk (g) denote a
closest point to g in Gk . Also assume that G0 = {g0 } is a singleton, so that π0 (g) = g0
for all g ∈ G. (One can generally assume without harm that 0 ∈ G, in which case one
should choose g0 = 0 in the following discussion.) For g ∈ Gn , we therefore have
n
X
g − g0 = (πk (g) − πk−1 (g)) .
k=1

Let (t1 , t2 , . . .) be a sequence of numbers that will be determined later. Let


n
X
Sn = max tk ρ(πk (g), πk−1 (g)). (23.36)
g∈Gn
k=1

Then, for any t,

P(sup g(Z) − g0 (Z) > tSn )


g∈Gn
≤ P(∃g ∈ Gn , ∃k ≤ n : πk (g)(Z) − πk−1 (g)(Z) > ttk ρ(πk (g), πk−1 (g)))
≤ P(∃k ≤ n, ∃g ∈ Gk , g 0 ∈ Gk−1 : g(Z) − g 0 (Z) > ttk ρ(g, g 0 ))
Xn
≤ Nk Nk−1 sup P(g(Z) − g 0 (Z) > ttk ρ(g, g 0 ))
k=1 g∈Gk ,g 0 ∈Gk−1
X n t2 t2
− k
≤2 Nk+1 e 2

k=1
k k k−1
If one takes Nk = 22 , which satisfies Nk Nk−1 = 22 +2 ≤ Nk+1 , and tk = 2k/2 , one
finds that
n
X k+1 k−1 2
P(sup g(Z) − g0 (Z) > tSn ) ≤ 2 22 e−2 t .
g∈Gn k=1
23.5. COVERING NUMBERS AND CHAINING 623
p
The upper bound converges (as a function of n) as soon as t > 2 log 2. Moreover,
one has
n n
2 X

2 X
− t2 − t2
X
2k+1 −2k−1 t 2 −2k−2 (t 2 −8 log 2) k−2
2 2 e = 2e e ≤ 2e e−2
k=1 k=1 k=1
p
when t > 1 + 8 log 2. This provides a concentration bound for P(supg∈Gn g(Z) −
g0 (Z) > tSn ), that we may rewrite as

t2

P(sup g(Z) − g0 (Z) > t) ≤ Ce 2Sn2 (23.37)
g∈Gn

for t > 2Sn log 2, C = 2 ∞ −2k−2 and S given by (23.36), with t = 2k/2 . Moreover,
p P
k=1 e n k
we have
n
X
Sn = max 2k/2 ρ(πk (g), πk−1 (g))
g∈Gn
k=1
n
X
≤ max 2k/2 (ρ(g, Gk ) + ρ(g, Gk−1 ))
g∈Gn
k=1
X n
≤ 2 max 2k/2 ρ(g, Gk )
g∈Gn
k=0

and this simpler upper bound can be used in (23.37).

We haven’t made many assumptions so far on the sequence G0 , G1 , . . ., beyond


bounding their cardinality, but it is natural to require that they are built in order to
behave like a dense subset of G, so that

lim max ρ(x, Gn ) = 0. (23.38)


n→∞ g∈G

Note that this requires that the set G is precompact for the distance ρ. We will also
assume that
lim sup g(x) = sup g(x). (23.39)
n→∞ g∈G g∈G
n

Then, we have proved the following result [189].

Theorem 23.31 Let G0 , G1 , . . . be a family of subsets of G satisfying (23.38) and (23.39)


n
and such that G0 = {g0 } and |Gn | ≤ 22 for n ≥ 0. Let

X
S = 2 sup 2n/2 ρ(g, Gn ) (23.40)
g∈G n=0
624 CHAPTER 23. GENERALIZATION BOUNDS
p
Then, for t > S 1 + 8 log 2,
t2

P(sup g(Z) − g0 (Z) > t) ≤ Ce 2S 2 (23.41)
g∈G

with C = 2
P∞ −2k−2 .
k=1 e

The exponential rate of convergence in the right-hand side of (23.41) is the quan-
tity S, and the upper bound will be improved when building the sequence (G0 , G1 , . . .)
so that S is as small as possible. Such an optimization for a given family of functions
is however a formidable problem. It is however interesting to see (still following
[189]) that theorem 23.31 implies a classical inequality in terms of what is called the
metric entropy of the metric space (G, ρ).

23.5.5 Metric entropy

If S is given by (23.40), we have



X  ∞
X
n/2
S = 2 sup 2 ρ(g, Gn ) : g ∈ G ≤ 2 2n/2 sup{ρ(g, Gn ) : g ∈ G}
n=0 n=0
n
Take Gn achieving the minimum in the entropy number e(G, ρ, 22 ). Then, (23.41)
holds with S replaced by

X n
Ŝ = 2 2n/2 e(G, ρ, 22 ) .
n=0

Consider the function


Z ∞q
h(G, ρ) = log N (G, ρ, )d, (23.42)
0

which is known as Dudley’s metric entropy of the space (G, ρ). We have
n
Z e(2) q ∞ Z
X e(22 ) q
h(G, ρ) = log N ()d + log N ()d.
n−1
0 n=1 e(22 )

n−1 n n
If  ∈ [e(22 ), e(22 )), we have N () > 22 so that

X
p n n−1
h(G, ρ) ≥ e(2) log 3 + 2n/2 (e(22 ) − e(22 ))
n=1
√ ∞
 2X n
≥ 1− 2n/2 e(22 ).
2
n=1
23.5. COVERING NUMBERS AND CHAINING 625

Therefore,
4
√ h(G, ρ) ≤ 7h(G, ρ)
Ŝ ≤
2− 2
and this upper bound can also be used to obtain a simpler (but weaker) form of
theorem 23.31.
Remark 23.32 The covering numbers of a class G of binary functions g with values
in {−1, 1} can be controlled by the VC dimension of the class. Here, we consider
ρ(g, g 0 ) = P(g , g 0 ) = ρ1 (g, g 0 )/2. Then, the following theorem holds.
Theorem 23.33 Let G be a class of binary functions such that D = VC-dim(G) < ∞.
Then, there is a universal constant K such that, for any  ∈ (0, 1),
 D−1
1
N (G, ρ, ) ≤ KD(4e)D

with ρ(g, g 0 ) = P(g , g 0 ).

We refer to Van der Vaart and Wellner [197], Theorem 2.6.4 for a proof, which is
rather long and technical. 

23.5.6 Application

We quickly show how this discussion can be turned into results applicable to the
classification problem. If F is a function class of binary classifiers and r is the risk
function, one can consider the class

G = {(x, y) 7→ r(y, f (x)) : f ∈ F } .

If r is the 0–1 loss, we have VC-dim(G) ≤ VC-dim(F ). Indeed, if one considers N


points in R × {−1, 1}, say (x1 , y1 , . . . , xN , yN ), then

G(x1 , y1 , . . . , xN , yN )
= {r(1, f (xk )) : k = 1, . . . , N , yk = 1} ∪ {r(−1, f (xk )) : k = 1, . . . , N , yk = −1}.

If the two sets in the right-hand side are not empty, i.e., the numbers N(1) and N(−1)
of k’s such that yk = 1 or yk = −1 are not zero, then

|G(x1 , y1 , . . . , xN , yN )| ≤ 2N(1) + 2N(−1) ,

which is less that 2N as soon as N > 2. So, taking N > 2, for (x1 , y1 , . . . , xN , yN ) to be
shattered by G, we need N(1) = N or N(−1) = N and in this case, the inequality:

|G(x1 , y1 , . . . , xN , yN )| ≤ |F (x1 , . . . , xN )|
626 CHAPTER 23. GENERALIZATION BOUNDS

is obvious. The same inequality will be true for some x1 , . . . , xN with N = 2, except in
the uninteresting case where f (x) = 1 (or −1) for every x ∈ R.

A similar inequality holds for entropy numbers with the ρ1 distance (cf. (23.35))
because
E(|r(Y , f (X)) − r(Y , f 0 (X))|) ≤ P(f (X) , f 0 (X))
whenever r takes values in [0, 1], which implies that

N (G, ρ1 , ) ≤ N (F , ρ1 , )

for all  > 0. Note however that evaluating this upper bound may still be challenging
and would rely on strong assumptions on the distribution of X allowing to control
P(f (X) , f 0 (X)).

We now assume that functions in F define “posterior probabilities” on G. More


precisely, given λ ∈ R we can define the probability πλ on {−1, 1} by

eλy
πλ (y) = .
e−λ + eλ
Now, if F is a class of real-valued functions, we can define the risk function
1
r(y, f (x)) = log .
πf (x) (y)

Since |∂λ log πλ (y)| = |y − tanh λ| ≤ 2 for y ∈ {−1, 1}, we have

|r(y, f (x)) − r(y, f 0 (x))| ≤ 2|f (x) − f 0 (x)|

so that entropy numbers in G can be estimated from entropy numbers in F . As an


example, let F be a space of affine functions x 7→ a0 + bT x, x ∈ Rd . Assume that the
random variable X is bounded, so that one can take R to be an open ball centered at
0 with radius, say, U . For M > 0, let

FM = {f : x 7→ a0 + bT x : |b| ≤ M, |a0 | ≤ U M} .

The restriction |b| ≤ M is equivalent to using a penalty method, such as, for example,
ridge logistic regression. Moreover, if |b| ≤ M, it is natural to assume that |a0 | ≤ U M
because otherwise f would have a constant sign on R. In this case, we get

ρ∞ (r(y, f (x)), r(y, f 0 (x))) ≤ |a0 − a00 | + U |b − b0 |

and a small modification of the proof of proposition 23.29 shows that

4CU d+1
 
N (F , ρ∞ , ) ≤ 1 +

23.6. OTHER COMPLEXITY MEASURES 627

23.6 Other complexity measures

23.6.1 Fat-shattering and margins

VC-dimension and metric entropy are measures that control the complexity of a
model class, and can therefore be evaluated a priori without observing any data.
These bounds can be improved, in general, by using information derived from the
training set, and, particular the classification margin that has been obtained [18].

For this discussion, we need to return to the definition of covering numbers. If F


is a function class, ρ∞ the supremum metric on F ,  > 0 and N is an integer, we let
N (F , ρ∞ , , N ) = max{N (F (A), ρ∞ , ) : A ⊂ R, |A| = N }
that we will abbreviate in N∞ (, N ) when F is known from the context. We will
assume that functions in F take values values in [−1, 1], and we define for γ ≥ 0,
y ∈ {0, 1}, u ∈ R: 


0 if u < −γ and y = 0

rγ (y, u) = 0 if u > γ and y = 1



1 otherwise

So, rγ (y, f (x)) is equal to 0 if f (x) correctly predicts y with margin γ and to 1 other-
wise. We then define the classification error with margin γ as
Rγ (f ) = E(rγ (Y , f (X)))
and, given a training set T of size N
N
1X
Eγ,T = rγ (yk , f (xk )).
N
k=1

We then have the following theorem [10].



Theorem 23.34 If t ≥ 2/N
2 /8
P(sup(R0 (f ) − Eγ,T (f )) > t) ≤ 2N∞ (γ/2, 2N )e−N t , (23.43)
f ∈F

or, equivalently, with probability larger than 1 − δ, one has, for all f ∈ F ,
r 
8 2

R0 (f ) − Eγ,T (f )) ≤ log N∞ (γ/2, 2N ) + log . (23.44)
N δ
Proof We first note that, for N t 2 > 2,

   
P sup(R0 (f ) − Eγ,T (f )) > t ≤ 2P sup(ET 0 (f ) − Eγ,T (f )) > ,
f ∈F f ∈F 2
628 CHAPTER 23. GENERALIZATION BOUNDS

which is proved exactly the same way as (23.21) in theorem 23.16, and we skip the
argument.

We have
N
1X
ET 0 (f ) − Eγ,T (f ) = (r0 (Yk0 , f (Xk0 )) − rγ (Yk , f (Xk )))
N
k=1

and because (Xk , Yk ) and (Xk0 , Yk0 ) have the same distribution, supf ∈F (ET 0 (f ) − Eγ,T (f ))
has the same distribution as

∆T ,T 0 (ξ1 , . . . , ξN ) =
N
1 X 
sup (r0 (Yk0 , f (Xk0 )) − rγ (Yk , f (Xk )))ξ k + (r0 (Yk , f (Xk )) − rγ (Yk0 , f (Xk0 )))(1 − ξ k )
f ∈F N k=1

where ξ 1 , . . . , ξ N is a sequence of Bernoulli random variables with parameter 1/2.

We now estimate P(∆T ,T 0 (ξ 1 , . . . , ξ N ) > t/2 | T , T 0 ) and we therefore consider T and


T0as fixed. Let F be a subset of F , with cardinality N∞ (γ/2, 2N ), such that for all f ∈
F there exists an f 0 ∈ F such that |f (x)−f 0 (x)| ≤ γ/2 for all x ∈ {X1 , . . . , XN , X10 , . . . , XN
0
}.
Then we claim that
∆T ,T 0 (ξ1 , . . . , ξN ) ≤ ∆0T ,T 0 (ξ1 , . . . , ξN )
where
N
1 X  
∆0T ,T 0 (ξ1 , . . . , ξN ) = max (2ξk − 1) r γ (Yk0 , f (Xk0 )) − r γ (Yk , f (Xk )) .
f ∈F N 2 2
k=1

This is because, for any (x, y) ∈ R × {0, 1}, and f , f 0 such that |f (x) − f 0 (x)| < γ/2, we
have r0 (y, f (x)) ≤ rγ/2 (y, f 0 (x)) and rγ/2 (y, f 0 (x)) ≤ rγ (y, f (x)): if an example is misclas-
sified by f (resp. f 0 ) at a given margin, it must be misclassified by f 0 (resp. f ) at this
margin plus γ/2.

Now,
t
P(∆0T ,T 0 (ξ 1 , . . . , ξ N ) > )
2
N
1X t
 
≤ |F| max P (2ξ k − 1)(r γ (Yk0 , f (Xk0 )) − r γ (Yk , f (Xk ))) >
f ∈F N 2 2 2
k=1

to which we can apply Hoeffding’s inequality, yielding


t
 
2
0
P ∆T ,T 0 (ξ 1 , . . . , ξ N ) > ≤ |F|e−N t /8 ,
2 

which concludes the proof, since, by proposition 23.27, |F| ≤ N∞ (γ/2, 2N ).


23.6. OTHER COMPLEXITY MEASURES 629

In order to evaluate the covering numbers N∞ (, N ) using quantities similar


to VC-dimensions, a different type of set decomposition and shattering has been
proposed. Following Alon et al. [4], we introduce the following notions. Recall
that a family of functions F : R → {0, 1} shatters a finite set A ⊂ R if and only if
|F (A)| = 2|A| . The following definitions are adapted to functions taking values in a
continuous set.

Definition 23.35 Let F be a family of functions f : R → [−1, 1] and A a finite subset of


R.

(i) One says that F P -shatters A if there exists a function gA : R → R such that, for each
B ⊂ A, there exists a function f ∈ F such that f (x) ≥ gA (x) if x ∈ B and f (x) < gA (x) if
x ∈ A \ B.
(ii) Let γ be a positive number. One says that F Pγ -shatters A if there exists a function
gA : R → R such that, for each B ⊂ A, there exists a function f ∈ F such that f (x) ≥
gA (x) + γ if x ∈ B and f (x) ≤ gA (x) − γ if x ∈ A \ B.

Note that only the restriction of gA to A matters in this definition. This function
acts as a threshold for binary classification. More precisely, given a function g : A →
R, one can associate to every f ∈ F the binary function fg with fg (x) equal to 1 if
f (x) ≥ g(x) and to 0 otherwise. Letting Fg = {fg : f ∈ F } we see that F P-shatters A
if there exists a function gA such that FgA shatters A. The definition of Pγ -shattering
introduces a margin in the definition of fg (with fg (x) equal to 1 if f (x) ≥ g(x) + γ, to
0 if f (x) ≤ g(x) − γ and is ambiguous otherwise), and A is Pγ -shattered by F if, for
some gA , the corresponding FgA shatters A without ambiguities.

Definition 23.36 One then defines the P -dimension of F by

P-dim(F ) = max{|A| : A ⊂ R, F P -shatters A},

and similarly the Pγ -dimension of F is

Pγ -dim(F ) = max{|A| : A ⊂ R, F Pγ -shatters A}.

The Pγ -dimension of F will replace the VC-dimension in order to control the


covering numbers. More precisely, we have the following theorem [4].

Theorem 23.37 Let γ > 0 and assume that F has Pγ/4 -dimension D < ∞. Then,

!D log(4eN /(Dγ))
16N
N∞ (γ, N ) ≤ 2 .
γ2
630 CHAPTER 23. GENERALIZATION BOUNDS

Proof The proof is quite technical and relies on a combinatorial argument in which
F is first assumed to take integer values before addressing the continuous case.

Step 1. We first assume that functions in F take values in the finite set {1, . . . , r}
where r is an integer. For the time of this proof, we introduce yet another notion of
shattering called S-shattering (for strong shattering) which is essentially the same
as P1 -shattering, except that functions g are restricted to take values in {1, . . . , r}. Let
A be a finite subset of R. Given a function g : R → {1, . . . , r}, we say that (F , g) S-
shatters A if, for any B ⊂ A, there exist f ∈ F satisfying f (x) ≥ g(x) + 1 for x ∈ B and
f (x) ≤ g(x) − 1 if x ∈ A \ B. We say that F S-shatters A if (F , g) S-shatters A for some
g. The S-dimension of F is the cardinality of the largest subset of R that can be
S-shattered and will be denoted S-dim(F ). The first, and most difficult, part of the
proof is to show that, if S-dim(F ) = D, then
M(F (A), ρ∞ , 2) ≤ 2(|A|r 2 )dlog2 ye
with
D !
X |A| k
y= r
k
k=1
and die denotes the smallest integer larger than u ∈ R. Here, M is the packing
number defined in section 23.5.1.

To prove this, we can assume that r ≥ 3, since, for r ≤ 2, M(F (A), ρ∞ , 2) = 1 (the
diameter of F for the ρ∞ distance is 0 or 1). Let G(A) = {1, . . . , r}A be the set of all
functions f : A → {1, . . . , r} and let
UA = F ⊂ G(A) : ∀f , f 0 ∈ F, ∃x ∈ A with |f (x) − f 0 (x)| ≥ 2 .


For F ∈ UA , let
SA (F) = {(B, g) : B ⊂ A, B , ∅, g : B → {1, . . . , r}, (F, g) S-shatters B}.
Let tA (h) = min{|SA (F)| : F ∈ UA , |F| = h} (where the minimum of the empty set is +∞).
Since we are considering in UA all possible functions from A to {1, . . . , r}, it is clear
that tA (h) only depends on |A|, and we will also denote it by t(h, |A|).

Note that, by definition, if (B, g) ∈ SA (F), and F ⊂ F , then |B| ≤ D. So, the number
of elements in SA (F) for such an F is less or equal than the number of possible such
|A| k
pairs (B, g), which is strictly less than y = D
P
k=1 k r . So, if t(h, |A|) ≥ y, then there
cannot be any F ⊂ F in the set UA and M(F (A), ρ∞ , 2) < h. The rest of the proof
consists in showing that t(h, |A|) ≥ y.

For any n ≥ 1, we have t(2, n) = 1: fix x ∈ A, and F = {f1 , f2 } ∈ G such that f1 (x) = 1,
f2 (x) = 3 and f1 (y) = f2 (y) if y , x. Then only ({x}, g) is S-shattered by F, with g such
that g(x) = 2.
23.6. OTHER COMPLEXITY MEASURES 631

Now, assume that, for some integer m, t(2mnr 2 , n) < ∞, so that there exists F ∈ UA
such that |F| = 2mnr 2 . Arrange the elements of F into mnr 2 pairs {fi , fi0 }. For each
such pair, there exists xi ∈ A such that |fi (xi ) − fi0 (xi )| > 1. Since there are at most n
selected xi , one of them must be appearing at least mr 2 times. Call it x and keep (and
reindex) the corresponding mr 2 pairs, still denoted {fi , fi0 }. Now, there are at most
r(r − 1)/2 possible distinct values for the unordered pairs {fi (x), fi0 (x)}, so that one of
them must be appearing at least 2mr 2 /r(r − 1) > 2m times. Select these functions,
reindex them and exchange the role of fi and fi0 if needed to obtain 2m pairs {fi , fi0 }
such that fi (x) = k and fi0 (x) = l for all i and fixed k, l ∈ {1, . . . , r} such that k + 1 < l.
Let F1 = {f1 , . . . , f2m } and F10 = {f10 , . . . , f2m
0
}. Let A0 = A \ {x}. Then both F1 and F10
belong to UA0 , which implies that both SA0 (F1 ) and SA0 (F10 ) have cardinality at least
t(2m, n−1). Moreover, both sets are included in SA (F), and if (B, g) ∈ SA0 (F1 )∩SA0 (F10 ),
then (B ∪ {x}, g 0 ) ∈ SA (F), with g 0 (y) = g(y) for y ∈ B and g 0 (x) = k + 1. This provides
2t(2m, n−1) elements in SA (F) and shows the key inequality (which is obviously true
when the left-hand side is infinite)

t(2mnr 2 , n) ≥ 2t(2m, n − 1) .

This inequality can now be used to prove by induction that for all 0 ≤ k < n, one
has t(2(nr 2 )k , n) ≥ 2k , since

t(2((n + 1)r 2 )k+1 , n + 1) ≥ 2t(2((n + 1)r 2 )k , n) ≥ 2t(2(nr 2 )k , n).

For k ≥ n, one has 2(nr 2 )k > r n , where r n is the number of functions in G(A), so
that t(2(nr 2 )k , n) = +∞. So, t(2(nr 2 )k , n) ≥ 2k is valid for all k and it suffices to take
k = dlog2 ye to obtain the desired result.

Step 2. The next step uses a discretization scheme to extend the previous result to
functions with values in [−1, 1]. More precisely, given f : R → [0, 1], and η > 0, let

f η (x) = max{k ∈ N : 2kη − 1 ≤ f (x)}

which takes values in {0, . . . , r} for r = bη −1 c. If F is a class of functions with values


in [−1, 1], define F η = {f η : f ∈ F }. With this notation, the following holds.

(a) For all γ ≤ η: S-dim(F η ) ≤ Pγ -dim(F )


(b) For all  ≥ 4η and A ⊂ R: M(F (A), ρ∞ , ) ≤ M∞ (F η (A), ρ∞ , 2).

To prove (a), assume that F η S-shatters A, so that there exists g such that, for all
B ⊂ A, there exists f ∈ F such that f η (x) ≥ g(x) + 1 for x ∈ B and f η (x) ≤ g(x) − 1
for x ∈ A \ B. Using the fact that 2ηf η (x) − 1 ≤ f (x) < 2ηf η (x) + 2η − 1, we get f (x) ≥
2ηg(x)+2η−1 for x ∈ B and f (x) ≤ 2ηg(x)−1 for x ∈ A\B. So taking g̃(x) = 2ηg(x)+η−1
632 CHAPTER 23. GENERALIZATION BOUNDS

as threshold function (which does not depend on B), we see that F Pγ -shatters A if
γ ≤ η.

For (b), we deduce from the definition of f η that |f η (x) − f˜η (x)| > (2η)−1 |f (x) −
f˜(x)| − 1 so that, if  = 4η, |f (x) − f˜(x)| ≥  implies |f η (x) − f˜η (x)| > 1, or, equivalently
|f η (x) − f˜η (x)| ≥ 2.

Step 3. We can now conclude. Taking γ > 0, we have, if |A| = N


!dlog ye
γ/4 16N
N (F (A), ρ∞ , γ) ≤ M(F (A), ρ∞ , γ) ≤ M(F (A), ρ∞ , 2) ≤ 2
γ2
with
D ! D ! !D
X N −k −D
X N 4N e
y= (γ/4) ≤ (γ/4) ≤ .
k k Dγ
k=1 k=1
Since the maximum of N (F (A), ρ∞ , γ) over A with cardinality N is N∞ (γ, N ), the
proof is complete. 

One can use this result to evaluate margin bounds on linear classifiers with
bounded data. Let R be the ball with radius Λ in Rd and consider the model class
containing all functions f (x) = a0 + bT x with a0 ∈ [−Λ, Λ] and b ∈ Rd , |b| ≤ 1. Let
A = {x1 , . . . , xN } be a finite subset of R. Then, F Pγ -shatters A if and only if there
exists g1 , . . . , gN ∈ R such that, for any sequences ξ = (ξ1 , . . . , ξN ) ∈ {−1, 1}N , there
ξ ξ
exists a0 ∈ [−Λ, Λ] and bξ ∈ Rd , |bξ | ≤ 1 with ξk (a0 + (bξ )T xk − gk ) ≥ γ for k = 1, . . . , N .
Summing over N , we find that
N
X N
X N
X
ξ ξ T
Nγ + gk ξk ≤ a0 ξk + (b ) ξk xk .
k=1 k=1 k=1

This shows that, for any sequence ξ1 , . . . , ξN ,


N
X N
X N
X
Nγ + g k ξk ≤ Λ ξk + ξk xk
k=1 k=1 k=1

Applying the same inequality after changing the signs of ξ1 , . . . , ξN yields


N
X N
X N
X
Nγ ≤ Nγ + gk ξk ≤ Λ ξk + ξk xk .
k=1 k=1 k=1

This shows, in particular, that (letting ξ 1 , . . . , ξ N be independent Rademacher ran-


dom variables)  
N
 X XN 
P Λ ξk + ξ k xk ≥ N γ  = 1.
k=1 k=1
23.6. OTHER COMPLEXITY MEASURES 633

However, using the identity (A + B)2 ≤ 2A2 + 2B2 , we have


 N N
  N N

 X X   X 2 X 2 
2
P Λ
 ξk + ξ k xk ≥ N γ  ≤ P 2Λ
  ξk + 2 ξ k xk ≥ N 2 γ 2  .
k=1 k=1 k=1 k=1

Since  
N N n
 X 2 X 2 X
E 2Λ2 ξk + 2 ξ k xk  = 2N Λ2 + 2 |xk |2 ≤ 4N Λ2 ,


k=1 k=1 k=1

Markov’s inequality implies


 N N

 X 2 X 2  4Λ2
P 2Λ2 ξk + 2 ξ k xk ≥ N 2 γ 2  ≤ .
N γ2
k=1 k=1

We get a contradiction unless N ≤ 4Λ2 /γ 2 , which shows that Pγ -dim(F ) ≤ 4Λ2 /γ 2 .


Theorem 23.37 then implies that

! 63Λ2 2 log 16eN2 γ 


16N γ Λ
N∞ (γ, N ) ≤ 2
γ2

and this upper bound can then be plugged into equations (23.43) or (23.44) to esti-
mate the generalization error.

Beyond the explicit expression of the upper bound, the important point in the
previous argument is that the Pγ -dimension is bounded independently from the di-
mension d of X (and therefore also applies in the infinite-dimensional case). This
should be compared to what we found for the VC-dimension of separating hyper-
planes, which was d + 1 (cf. proposition 23.22).

Remark 23.38 Note that the upper-bound obtained in theorem 23.34 depends on a
parameter (γ) and the result is true for any choice of this parameter. It is tempting at
this point to optimize the bound with respect to γ, but this would be a mistake since
a family of events being likely does not imply that their intersection is likely too.
However, with a little work, one can ensure that an intersection of slightly weaker
inequalities holds. Indeed, assume that an estimate similar to (23.43) holds, in the
form
2
P(R0 (fˆT ) > UT (γ) + t) ≤ C(γ)e−mt /2
or, equivalently
 q 
2
P R0 (fˆT ) > UT (γ) + t 2 + 2 log C(γ) ≤ e−mt /2 ,
634 CHAPTER 23. GENERALIZATION BOUNDS

where UT (γ) depends on data and is increasing (as a function of γ), and C(γ) is a
decreasing function of γ. Consider a decreasing sequence (γk ) that converges to 0
(for example γk = L2−k ). Choose also an increasing function (γ). Then
 q 
P R0 (fˆT ) > min{UT (γ) + t 2 + 2 log C(γ) + 2 (γ) : 0 ≤ γ ≤ L}
 q 
ˆ 2 2
≤ P R0 (fT ) > min{UT (γk ) + t + 2 log C(γk−1 ) +  (γk ) : k ≥ 1} .

Moreover
 q 
ˆ 2 2
P R0 (fT ) > min{UT (γk ) + t + 2 log C(γk−1 ) +  (γk ) : k ≥ 1}

X  q 
≤ P R0 (fˆT ) > UT (γk ) + t 2 + 2 log C(γk−1 ) + (γk )
k=0

X C(γk ) −m2 (γk )/2−mt 2 /2
≤ e .
C(γk−1 )
k=0

So, it suffices to choose (γ) so that



X C(γk ) −m2 (γk )/2
C0 = e <∞
C(γk−1 )
k=1

to ensure that
 q 
2
P R0 (fˆT ) > min{UT (γ) + t 2 + 2 log C(γ) + 2 (γ) : γ0 ≤ γ ≤ L} ≤ C0 e−mt /2 .

For example, if γk = L2−k , one can take


s !
2 C(γ)
(γ) = log + log γ −1
m C(γ/2)

which yields C0 ≤ L. 

23.6.2 Maximum discrepancy

Let T be a training set and let T1 and T2 form a fixed partition of the training set
in two equal parts. Assume, for simplicity, that N is even and that the method
for selecting the two parts is deterministic, e.g., place the first half of T in T1 and
second one in T2 . Following Bartlett et al. [20], one can then define the maximum
discrepancy on T by
CT = sup(ET1 (f ) − ET2 (f ))
f ∈F
23.6. OTHER COMPLEXITY MEASURES 635

This discrepancy measures the extent to which estimators may differ when trained
on two independent half-sized training sets. For a binary classification problem, the
estimation of CT can be made with the same algorithm as the initial classifier, since
ET1 (f ) − ET2 (f ) is, up to a constant, exactly the classification error for the training set
in which the class labels are flipped for the data in T2 .

Following [20], we now discuss concentration bounds that rely on CT and start
with the following Lemma.
Lemma 23.39 Introduce the function

Φ(T ) = sup(R(f ) − ET (f )) − sup(ET1 (f ) − ET2 (f )).


f ∈F f ∈F

Then E(Φ(T )) ≤ 0.
Proof Note that, if T 0 is a training set, independent of T with identical distribution,
then, for any f0 ∈ F ,

R(f0 ) − ET (f0 ) = E(ET 0 (f0 ) − ET (f0 ) | T ) ≤ E(sup(ET 0 (f ) − ET (f )) | T )


f ∈F

so that
E(sup(R(f ) − ET (f ))) ≤ E(sup(ET 0 (f ) − ET (f ))).
f ∈F f ∈F

Now, for a given f , we have ET (f ) = 12 (ET1 (f ) + ET2 (f )) and splitting T 0 the same way,
we have ET 0 (f ) = 21 (ET10 (f ) + ET20 (f )).

We can therefore write


1
E(sup(R(f ) − ET (f ))) ≤ E(sup(ET10 (f ) − ET1 (f )) + (ET20 (f ) − ET2 (f )))
f ∈F 2 f ∈F
 
1  
≤ E(sup(ET10 (f ) − ET1 (f ))) + E(sup(ET20 (f ) − ET2 (f ))
2 f ∈F f ∈F
= E(sup(ET1 (f ) − ET2 (f )))
f ∈F

where we have used the fact that both (T10 , T1 ) and (T20 , T2 ) form random training sets
with identical distribution to (T1 , T2 ).

This proves that E(Φ(T )) ≤ 0. 

Using the lemma, one can write

P(sup(R(f ) − ET (f )) ≥ CT + ) = P(Φ(T ) ≥ ) ≤ P(Φ(T ) − E(Φ(T )) ≥ ).


f ∈F
636 CHAPTER 23. GENERALIZATION BOUNDS

One can then use McDiarmid’s inequality (theorem 23.13) after noticing that, letting
zk = (xk , yk ) for k = 1, . . . , N ,
3
max Φ(z1 , . . . , zN ) − Φ(z1 , . . . , zk−1 , zk0 , zk+1 , . . . , zN ) ≤
z1 ,...,zN ,zk0 N

yielding
2N 2
P(sup(R(f ) − ET (f )) ≥ CT + ) ≤ e− 9 .
f ∈F

23.6.3 Rademacher complexity

We now extend the previous definition by computing discrepancies over random


two-set partitions of the training set, which have equal size in average. This leads
to the empirical Rademacher complexity of the function class. Let ξ 1 , . . . , ξ N be a
sequence of Rademacher random variables (equal to -1 and +1 with equal probabil-
ity 1/2). Then, the (empirical) Rademacher complexity of the training set T for the
model class F is
N
1X
 
rad(T ) = E sup ξ k r(Yk , f (Xk )) T = T .
f ∈F N k=1

The mean Rademacher complexity is then the expectation of this quantity over
the training set distribution. The Rademacher complexity can be computed with
a—costly—Monte-Carlo simulation, in which the best estimator is computed with
randomly flipped labels corresponding to the values of k such that ξk = −1.

This measure of complexity was introduced to the machine learning framework


in Koltchinskii and Panchenko [110], Bartlett and Mendelson [19], and Rademacher
sums have been extensively studied in relation to empirical processes (cf. Ledoux
and Talagrand [118], chapter 4).

One can bound the Rademacher complexity in terms of VC dimension.

Proposition 23.40 Let F be a function class such that D = VC-dim(F ) < ∞. Then
3 p
rad(T ) ≤ √ 2D log(eN /D) .
N
Proof One has, using Hoeffding’s inequality
N N
1X 1
   X 
2
P sup ξ k r(yk , fk ) > t ≤ |F (T )| sup P ξ k r(yk , fk ) > t ≤ |F (T )|e−N t /2 .
f ∈F N k=1 f ∈F N
k=1
23.6. OTHER COMPLEXITY MEASURES 637

This implies that

N
1X
 
2
P sup ξ k r(yk , fk ) > t ≤ 2|F (T )|e−N t /2
f ∈F N k=1

and proposition 23.4 implies


p
3 2|F (T )|
rad(T ) ≤ √ .
N

Therefore if D = VC-dim(F ) < ∞, proposition 23.20 implies

3 p
rad(T ) ≤ √ 2D log(eN /D) .
N 

We now discuss generalization bounds using Rademacher’s complexity. While we


still consider binary classification problems (with RY = {−1, 1}), we will assume that
F contains functions that can take arbitrary scalar values, and the 0–1 loss function
becomes r(y, y 0 ) = 1yy 0 ≤0 with y ∈ {−1, 1} and y 0 ∈ R. We will also consider functions
that dominate this loss, i.e., functions ρ : RY × R → [0, 1] such that

r(y, y 0 ) ≤ ρ(y, y 0 )

for all y ∈ RY , y 0 ∈ R. Some examples are the margin loss ρh∗ (y, y 0 ) = 1yy 0 ≤h for h ≥ 0,
or the piecewise linear function

if yy 0 ≤ 0



 1
ρh (y, y 0 ) = 1 − yy 0 /h if 0 ≤ yy 0 ≤ h



if yy 0 ≥ h

 0

If G is a class of functions g : Z → R, we will denote


 N

1  X 
RadG (z1 , . . . , zN ) = E sup ξ k g(zk )
N g∈G k=1

and  
RadG (N ) = E RadG (Z1 , . . . , ZN ) .
Our previous notation can then be rewritten as rad(T ) = RadG (z1 , . . . , zn ) where zi =
(xi , yi ) and G is the space of functions: g : (x, y) 7→ r(y, f (x)) for f ∈ F . The following
theorem is proved in Koltchinskii and Panchenko [110], Bartlett and Mendelson [19].
638 CHAPTER 23. GENERALIZATION BOUNDS

Theorem 23.41 Let ρ be a function dominating the risk function r(y, y 0 ) = 1yy 0 ≤0 . Let

G ρ = {(x, y) 7→ ρ(y, f (x)) − 1 : f ∈ F }

and
N
ρ 1X
ET (f )= ρ(yk , f (xk )).
N
k=1

Then
ρ 2 /2
P(sup R(f ) ≥ ET (f ) + 2RadGρ (N ) + t) ≤ e−N t
f ∈F

Proof For f ∈ F , we have

R(f ) − E ρ (f ) ≤ E(ρ(Y , f (X))) − E ρ (f ) ≤ Φ(Z1 , . . . , ZN )

where
N
1X
 
Φ(Z1 , . . . , ZN ) = sup E(g(Z)) − g(Zk ) .
g∈G ρ N
k=1

Since changing one variable among Z1 , . . . , ZN changes Φ by at most 2/N , McDi-


armid’s inequality implies that
2 /2
P(Φ(Z1 , . . . , ZN ) − E(Φ(Z1 , . . . , ZN )) ≥ t) ≤ e−N t .

Now we have
N N
1X 1X
  
0
E(Φ(Z1 , . . . , ZN )) = E sup E g(Zk ) − g(Zk ) Z1 , . . . , ZN
g∈G ρ N N
k=1 k=1
N N
1 1
  X X 
≤ E sup g(Zk0 ) − g(Zk )
g∈G ρ N k=1
N
k=1
N
1X
    
≤ E E sup ξ k (g(Zk0 ) − g(Zk )) Z, Z 0
g∈G ρ N k=1
N
1
   X  
≤ 2E E sup ξk g(Zk ) Z
g∈G ρ N k=1
≤ 2RadGρ (N ) ,

of which the statement of the theorem is a direct consequence. 


23.6. OTHER COMPLEXITY MEASURES 639

23.6.4 Algorithmic Stability

Another result using McDiarmid’s inequality is proved in Bousquet and Elisseeff


[38], and is based on the stability of a classifier when one removes a single example
from the training set. As before, we consider training sets T of size N , where T is a
random variable.

For k ∈ {1, . . . , N }, and a training set T = (x1 , y1 , . . . , xN , yN ), we let T (k) be the train-
ing set with sample (xk , yk ) removed. One says that the predictor (T 7→ fˆT ) has uni-
form stability βN for the loss function r if, for all T of size N , all k ∈ {1, . . . , N }, and
all x, y:
|r(fˆT (x), y) − r(fˆ (k) (x), y)| ≤ βN .
T (23.45)

With this definition, the following theorem holds.

Theorem 23.42 (Bousquet and Elisseeff [38]) Assume that fˆT has uniform stability
βN for training sets of size N and that the loss function r(Y , f (X)) is almost surely
bounded by M > 0. Then, for all  > 2βN , one has

 −2βN
2
−2N
P(R(fˆT ) ≥ ET (fˆT ) + ) ≤ e 4N βN +M
.

Of course, this theorem is interesting only when βN is small as a function of N , i.e.,


when N βN is bounded.

Proof Let Zi = (Xi , Yi ) and F(Z1 , . . . , ZN ) = R(fˆT ) − ET (fˆT ). We want to apply McDi-
armid inequality (theorem 23.13) to F, and therefore estimate


δk (F) = max F(z1 , . . . , zN ) − F(z1 , . . . , zk−1 , zk0 , zk+1 , . . . , zN ) .
z1 ,...,zN ,zk0

Introduce a training set T̃k in which the variable zk is replaced by zk0 = (xk0 , yk0 ).
(k)
Because T̃k = T (k) , we have

|R(fˆT ) − R(fˆT̃k )| ≤ E(|r(Y , fˆT (X)) − r(Y , fˆT̃k (X)))|


≤ E(|r(Y , fˆT (X)) − r(Y , fˆ (k) (X))|) T
+E(|r(Y , fˆT̃k (X)) − E(r(Y , fˆT̃ (k) (X)))|)
k
≤ 2βN
640 CHAPTER 23. GENERALIZATION BOUNDS

Similarly, we have
1X
|ET (fˆT ) − ET̃k )(fˆT̃k )| ≤ |r(yl , fˆT (xl ), ) − r(yl , fˆT̃k (xl ))|
N
l,k
1
+ |r(yk , fˆT (xk )) − r(yk0 , fˆT̃k (xk0 ))|
N
1X
≤ |r(yl , fˆT (xl )) − r(yl , fˆT (k) (xl ))|
N
l,k
1X M
+ |r(yl , fˆT̃k (xl )) − r(yl , fˆT̃ (k) (xl ))| +
N k N
l,k
M
≤ 2βN +
N
Collecting these results, we find that δk (F) ≤ 4βN + M N , so that, by theorem 23.13,

2N 2
!
 
ˆ ˆ ˆ ˆ
P R(fT ) ≥ ET (fT ) + E(R(fT ) − ET (fT )) + ) ≤ exp − .
(4N βN + M)2

It remains to evaluate the expectation in this formula. Introducing as above vari-


ables Z10 , . . . , ZN
0
and using the same notation for T̃k , we have

E(R(fˆT )) = E(r(Yk0 , fT (Xk0 ))) = E(r(Yk , fT̃k (Xk ))) .


Using this, we have
N
1X
E(R(fˆT ) − ET (fˆT )) = E(r(Yk , fT̃k (Xk )) − r(Yk , fT (Xk )))
N
k=1
N
1 X
= E(r(Yk , fT̃k (Xk )) − r(Yk , fT̃ (k) (Xk )))
N k
k=1
N
1X
+ E(r(Yk , fT (k) (Xk )) − r(Yk , fT (Xk )))
N k
k=1

from which one deduces that

|E(R(fˆT ) − ET (fˆT ))| ≤ 2βN .

We therefore obtain
2N 2
!
 
P R(fˆT ) ≥ ET (fˆT ) +  + 2βN ≤ exp − .
(4N βN + M)2
as required. 
23.6. OTHER COMPLEXITY MEASURES 641

23.6.5 PAC-Bayesian bounds

Our final discussion of concentration bounds for the empirical error uses a slightly
different paradigm from that discussed so far. The main difference is that, instead of
computing one predictor fˆT from a training set T , it would return a random variable
with values in F , or, equivalently, a probability distribution on F (therefore assum-
ing that this space is measurable) that we will denote µ̂T . The training set error is
now defined by: Z
ĒT (µ) = ET (f )dµ(f ) ,

for any probability distribution µ on F , while the generalization error is:


Z
R̄(µ) = R(f )dµ(f ) .
F

Our goal is to obtain upper bounds on R̄(µT )− ĒT (µT ) that hold with high probability.
In this framework, we have the following result, in which Q denotes the space of
probability distributions on F .

Assume that the loss function r takes its values in [0, 1]. Recall that KL(µkπ) is
the Kullback-Leibler divergence from µ to π, defined by
Z
KL(µkπ) = log(ϕ(f ))ϕ(f )dπ(f )
F

if µ has a density ϕ with respect to π and +∞ otherwise. Then, the following theorem
holds.
Theorem 23.43 (McAllester [129]) With the notation above, for any fixed probability
distribution π ∈ Q,
r
KL(µkπ)
 
P sup(R̄(µ) − ĒT (µ)) > t + ≤ 2N e−N t . (23.46)
µ∈Q 2N

Taking t = log(2N /δ)/2N , the theorem is equivalent to the statement that, with prob-
ability 1 − δ, one has
r
log 2N /δ + KL(µkπ)
R̄(µ) − ĒT (µ) ≤ . (23.47)
2N
Proof We first show that, for any probability distributions π, µ on F , and any func-
tion H on F , Z Z
H(f )dµ − log eH(f ) dπ ≤ KL(µkπ) .
F F
642 CHAPTER 23. GENERALIZATION BOUNDS

Indeed, assume that µ has a density ϕ with respect to π (otherwise the upper bound
is infinite) and let
eH
ϕH = R .
e H(f ) dπ
F
Then
Z Z Z Z
H(f )
KL(µkπ) − H(f )dµ + log e dπ = ϕ log ϕdπ − ϕ log ϕH dπ
F F F F
Z !
ϕ ϕ
= log ϕ dπ
F ϕH ϕH H
= KL(µkϕH π) ≥ 0,
which proves the result (and also shows that one can only have equality when ϕ =
ϕH π-almost surely.)

Let χ(u) = max(u, 0)2 . We can use this inequality to show that, for any probability
Q ∈ Q and λ > 0,
Z Z
λχ(R̄(µ) − ĒT (µ)) ≤ λ χ(R(f ) − ET (f ))dµ(f ) ≤ KL(µkπ) + log eλχ(R(f )−ET (f )) dπ
F F

where we have applied Jensen’s inequality to the convex function χ. This yields
Z
λχ(R̄(µ)−ĒT (Q))
e ≤e KL(µkπ)
eλχ(R(f )−ET (f )) dπ.
F

Hoeffding’s inequality implies that, for all f ∈ F and t ≥ 0



P(χ(R(f ) − ET (f )) > t) = P(R(f ) − ET (f ) > t) ≤ e−2N t
so that
  Z ∞
λχ(R(f )−ET (f ))
E e = P(λχ(R(f ) − ET (f )) > log t)dt
0
Z eλ 2N log t
≤ 1+ e− λ dt
1

2N u
= 1+ e− λ +u du
0
eλ−2N − 1
= 1+λ .
λ − 2N

From this and Markov’s inequality, we get, for any λ > 0:


eλ−2N − 1
!
−λt
P(sup χ(R̄(µ) − ĒT (µ)) > t + KL(µkπ)/λ) ≤ e 1+λ .
µ∈Q λ − 2N
23.7. APPLICATION TO MODEL SELECTION 643

Taking λ = 2N yields

P(sup χ(R̄(µ) − ĒT (µ)) > t + KL(µkπ)/2N ) ≤ 2N e−2N t ,


µ∈Q

which implies
 q 
P sup R̄(µ) − ĒT (µ) > t + KL(µkπ)/2N ≤ 2N e−2N t ,
µ∈Q

concluding the proof. 

Remark 23.44 Note that the proof, which follows that given in Audibert and Bous-
quet [15], provides a family of inequalities obtained by taking λ = 2N /c in the final
step, with c > 1. In this case

eλ−2N − 1 λ c
1+λ ≤ 1+ =
λ − 2N 2N − λ c − 1
and one gets
c
 q 
P sup R̄(µ) − ĒT (µ) > t + cKL(µkπ)/2N ≤ 2N e−2N t .
µ∈Q c−1 

Remark 23.45 One special case of theorem 23.43 is when π is a discrete probability
measure supported by a subset F0 of F and µ corresponds to a deterministic pre-
dictor optimized over F0 , and is therefore a Dirac measure on some element f ∈ F0 .
Because δf has density ϕ(g) = 1/π(g) if g = f and 0 otherwise with respect to π, we
have KL(δf kπ) = − log π(f ) and theorem 23.43 implies that, with probability larger
than 1 − δ, r
log 2N /δ − log π(f )
R(f ) − ET (f ) ≤ .
2N

The term log 2N is however superfluous in this simple context, because one can
write, for any t > 0
r
log(π(f ))
  X log(π(f ))
P sup R(f ) − ET (f ) ≥ t − ≤ e−2N (t 2N ) = e−2N t
f ∈F0 2N
f ∈F0

so that, with probability 1 − δ (letting t = log(1/δ)/2N ), for all f ∈ F0 :


r
− log δ − log π(f )
R(f ) − ET (f ) ≤ .
2N 
644 CHAPTER 23. GENERALIZATION BOUNDS

23.7 Application to model selection

We now describe how the previous results can, in principle, be applied to model
selection [20]. We assume that we have a countable family of nested models classes
(F (j) , j ∈ J ). Denote, as usual, by ET (f ) the empirical prediction error in the training
(j)
set for a given function f . We will denote by fˆT a minimizer of the in-sample error
for F (j) , such that
(j)
ET (fˆ ) = min ET (f ).
T
f ∈F (j)

In the model selection problem, one would like to determine the best model class,
(j)
j = j(T ), such that the prediction error R(fˆT ) is minimal, or, more realistically, de-
(j )
termine j ∗ such that R(fˆ ∗ ) is not too far from the optimal one.
T

We will consider penalty-based methods in which one minimizes ẼT (f ) = ET (f ) +


CT (j) to determine j(T ). The penalty, CT , may also be data-dependent, and will
therefore be a random variable. The previous concentration inequalities provided
(j) (j)
highly probable upper-bounds for R(fˆT ), each exhibiting a random variable ΓT that
(j)
is larger than R(fˆ ) with probability close to one. More precisely, we obtained in-
T
equalities taking the form (when applied to F (j) )

(j) 2
P(RT (fˆ(j) ) ≥ ΓT + t) ≤ cj e−mt (23.48)

for some known constants cj and m. For example, the VC-dimension bounds have
(j) (j)
Γ = ET (fˆ ), cj = 2S (j) (2N ) and m = N /8.
T T F

Given such inequalities, one can develop a model selection strategy thatPrelies on
a priori weights, provided by a sequence πj of positive numbers such that j∈J πj =
1. Define
πj /cj
π̃j = P∞ ,
j 0 =1 πj 0 /cj 0

and let r
(j) (j) (j) log π̃j
CT = ΓT − ET (fˆT ) + −
m
yielding a penalty-based method that requires the minimization of
r
(j) (j) log π̃j
ẼT (f ) = (ET (f ) − ET (fˆT )) + ΓT + − .
m
q
(j ∗ ) (j) log π̃
The selected model class is then F where j∗ minimizes ΓT + − 2m j .
23.7. APPLICATION TO MODEL SELECTION 645

The same proof as that provided at the end of section 23.6.5 justifies this proce-
dure. Indeed, for t > 0,
!
(j) (j)
 
P R(fˆT ) − ẼT (fˆT ) ≥ t ≤ P max(R(fˆT ) − ẼT (fˆT )) ≥ t
j
 r 
(j) log π̃j
≤ P max(R(fˆT ) ≥ R∗j + t + −
 

m 

j
X 2
≤ c̃ πj e−mt
j
2
≤ c̃e−mt
P∞
with c̃ = j=1 πj /cj .
646 CHAPTER 23. GENERALIZATION BOUNDS
Bibliography

[1] Pierre-Antoine Absil, Robert Mahony, and Rodolphe Sepulchre. Optimization


algorithms on matrix manifolds. Princeton University Press, 2008.

[2] Hirotugu Akaike. Information theory and an extension of the maximum like-
lihood principle. In 2nd International Symposium on Information Theory, 1973.
Akademiai Kaido, 1973.

[3] Stéphanie Allassonniere and Laurent Younes. A stochastic algorithm for prob-
abilistic independent component analysis. The Annals of Applied Statistics, 6
(1):125–160, 2012.

[4] Noga Alon, Shai Ben-David, Nicolo Cesa-Bianchi, and David Haussler. Scale-
sensitive dimensions, uniform convergence, and learnability. Journal of the
ACM (JACM), 44(4):615–631, 1997.

[5] Mauricio A. Álvarez, Lorenzo Rosasco, and Neil D. Lawrence. Kernels for
vector-valued functions: A review. Foundations and Trends in Machine Learn-
ing, 4(3):195–266, 2012. ISSN 1935-8237. doi: 10.1561/2200000036.

[6] Yali Amit. Convergence properties of the gibbs sampler for perturbations of
gaussians. The Annals of Statistics, 24(1):122–140, 1996.

[7] Yali Amit and Donald Geman. Shape quantization and recognition with ran-
domized trees. Neural computation, 9(7):1545–1588, 1997.

[8] Alano Ancona, Donald Geman, Nobuyuki Ikeda, and D Geman. Random
fields and inverse problems in imaging. In Ecole d’ete de Probabilites de Saint-
Flour XVIII-1988, pages 115–193. Springer, 1990.

[9] Brian D.O. Anderson. Reverse-time diffusion equation models. Stochastic Pro-
cesses and their Applications, 12(3):313–326, 1982.

[10] Martin Anthony and Peter L. Bartlett. Neural network learning: Theoretical
foundations. cambridge university press, 2009.

647
648 BIBLIOGRAPHY

[11] Martin Arjovsky, Soumith Chintala, and Léon Bottou. Wasserstein Genera-
tive Adversarial Networks. In Proceedings of the 34th International Conference
on Machine Learning - Volume 70, ICML’17, pages 214–223. JMLR.org, 2017.
event-place: Sydney, NSW, Australia.
[12] Nachman Aronszajn. Theory of Reproducing Kernels. Trans. Am. Math. Soc.,
68:337–404, 1950.
[13] Krishna B. Athreya, Hani Doss, and Jayaram Sethuraman. On the convergence
of the markov chain simulation method. The Annals of Statistics, 24(1):69–100,
1996.
[14] Hagai Attias. A Variational Baysian Framework for Graphical Models. In
NIPS, volume 12. Citeseer, 1999.
[15] Jean-Yves Audibert and Olivier Bousquet. Combining pac-bayesian and
generic chaining bounds. Journal of Machine Learning Research, 8(Apr):863–
889, 2007.
[16] Adrian Barbu and Song-Chun Zhu. Generalizing swendsen-wang to sampling
arbitrary posterior probabilities. IEEE Transactions on Pattern Analysis and
Machine Intelligence, 27(8):1239–1253, 2005.
[17] Viorel Barbu. Differential equations. Springer, 2016.
[18] Peter Bartlett and John Shawe-Taylor. Generalization performance of support
vector machines and other pattern classifiers. Advances in Kernel methods—
support vector learning, pages 43–54, 1999.
[19] Peter L. Bartlett and Shahar Mendelson. Rademacher and gaussian complexi-
ties: Risk bounds and structural results. Journal of Machine Learning Research,
3(Nov):463–482, 2002.
[20] Peter L. Bartlett, Stéphane Boucheron, and Gábor Lugosi. Model selection and
error estimation. Machine Learning, 48:85–113, 2002.
[21] Peter L. Bartlett, Nick Harvey, Christopher Liaw, and Abbas Mehrabian.
Nearly-tight vc-dimension and pseudodimension bounds for piecewise linear
neural networks. Journal of Machine Learning Research, 20(63):1–17, 2019.
[22] Amir Beck. Introduction to nonlinear optimization: Theory, algorithms, and ap-
plications with MATLAB. SIAM, 2014.
[23] Michel Benaı̈m. Dynamics of stochastic approximation algorithms. In Semi-
naire de probabilites XXXIII, pages 1–68. Springer, 1999.
[24] George Bennett. Probability inequalities for the sum of independent random
variables. Journal of the American Statistical Association, 57(297):33–45, 1962.
BIBLIOGRAPHY 649

[25] Albert Benveniste, Michel Métivier, and Pierre Priouret. Adaptive algorithms
and stochastic approximations, volume 22. Springer Science & Business Media,
2012.

[26] Nils Berglund. Long-time dynamics of stochastic differential equations. arXiv


preprint arXiv:2106.12998, 2021.

[27] Dimitri Bertsekas. Convex optimization theory, volume 1. Athena Scientific,


2009.

[28] Rajendra Bhatia. Matrix analysis, volume 169. Springer Science & Business
Media, 2013.

[29] Peter J. Bickel and Kjell A. Doksum. Mathematical statistics: basic ideas and
selected topics, volume I, volume 117. CRC Press, 2015.

[30] Peter J. Bickel, Ya’acov Ritov, and Alexandre B. Tsybakov. Simultaneous anal-
ysis of lasso and dantzig selector. 2009.

[31] Patrick Billingsley. Convergence of probability measures. John Wiley & Sons,
2013.

[32] Salomon Bochner. Vorlesungen über fouriersche integrale. Bull Amer Math
Soc, 39:184, 1933.

[33] Vladimir I. Bogachev. Measure Theory. Springer, 2007.

[34] Joseph-Frédéric Bonnans, Jean Charles Gilbert, Claude Lemaréchal, and Clau-
dia A. Sagastizábal. Numerical optimization: theoretical and practical aspects.
Springer Science & Business Media, 2006.

[35] Ingwer Borg and Patrick J.F. Groenen. Modern multidimensional scaling: Theory
and applications. Springer Science & Business Media, 2005.

[36] Jonathan Borwein and Adrian S. Lewis. Convex analysis and nonlinear opti-
mization: theory and examples. Springer Science & Business Media, 2010.

[37] Stéphane Boucheron, Gábor Lugosi, and Pascal Massart. A sharp concentra-
tion inequality with applications. Random Structures & Algorithms, 16(3):277–
292, 2000.

[38] Olivier Bousquet and André Elisseeff. Stability and generalization. Journal of
machine learning research, 2(Mar):499–526, 2002.

[39] Stephen Boyd, Neal Parikh, Eric Chu, Borja Peleato, and Jonathan Eckstein.
Distributed optimization and statistical learning via the alternating direction
method of multipliers. Foundations and Trends® in Machine learning, 3(1):1–
122, 2011.
650 BIBLIOGRAPHY

[40] Leo Breiman. Bagging predictors. Machine learning, 24(2):123–140, 1996.

[41] Leo Breiman. Random forests. Machine learning, 45(1):5–32, 2001.

[42] Leo Breiman, Jerome Friedman, Charles J Stone, and Richard A. Olshen. Clas-
sification and regression trees. CRC press, 1984.

[43] Dmitri Burago, Iu D. Burago, Yuri Burago, Sergei A. Ivanov, and Sergei Ivanov.
A course in metric geometry, volume 33. American Mathematical Soc., 2001.

[44] Jian-Feng Cai, Emmanuel J. Candès, and Zuowei Shen. A singular value
thresholding algorithm for matrix completion. SIAM Journal on optimization,
20(4):1956–1982, 2010.

[45] Tadeusz Caliński and Jerzy Harabasz. A dendrite method for cluster analysis.
Communications in Statistics-theory and Methods, 3(1):1–27, 1974.

[46] Emmanuel J. Candes and Terence Tao. Decoding by linear programming. IEEE
Trans. information theory, 51(12):4203–4215, 2005.

[47] Emmanuel J. Candes and Terence Tao. The dantzig selector: statistical esti-
mation when p is much larget. Annals of statistics, 35, 2007.

[48] Emmanuel J. Candès, Xiaodong Li, Yi Ma, and John Wright. Robust principal
component analysis? Journal of the ACM (JACM), 58(3):11, 2011.

[49] John Canny. Gap: a factor model for discrete data. In Proceedings of the 27th
annual international ACM SIGIR conference on Research and development in in-
formation retrieval, pages 122–129, 2004.

[50] B. Chalmond. An iterative Gibbsian technique for reconstruction of m-ary


images. Pattern recognition, 22(6):747–761, 1989. ISSN 0031-3203.

[51] Ricky T. Q. Chen, Yulia Rubanova, Jesse Bettencourt, and David Duve-
naud. Neural ordinary differential equations. In S. Bengio, H. Wallach,
H. Larochelle, K. Grauman, N. Cesa-Bianchi, and R. Garnett, editors, Advances
in Neural Information Processing Systems 31, pages 6571–6583. Curran Asso-
ciates, Inc., 2018.

[52] Tianqi Chen and Carlos Guestrin. Xgboost: A scalable tree boosting system.
In Proceedings of the 22nd acm sigkdd international conference on knowledge dis-
covery and data mining, pages 785–794, 2016.

[53] Pierre Comon. Independent component analysis, a new concept? Signal pro-
cessing, 36(3):287–314, 1994.

[54] Thomas M. Cover and Joy A. Thomas. Elements of information theory. John
Wiley & Sons, 2012.
BIBLIOGRAPHY 651

[55] Robert G. Cowell, A. Philip Dawid, Steffen L. Lauritzen, and David J. Spiegel-
halter. Probabilistic networks and expert systems. Springer, 2007.

[56] Imre Csiszár. On topological properties of f-divergence. Studia Sci. Math.


Hungar., 2:330–339, 1967.

[57] George Darmois. Analyse générale des liaisons stochastiques: etude partic-
ulière de l’analyse factorielle linéaire. Revue de l’Institut international de statis-
tique, pages 2–8, 1953.

[58] Bernard Delyon, Marc Lavielle, and Eric Moulines. Convergence of a stochas-
tic approximation version of the em algorithm. Annals of statistics, pages 94–
128, 1999.

[59] Amir Dembo and Ofer Zeitouni. Large deviations techniques and applica-
tions. 1998. Applications of Mathematics, 38, 2011.

[60] Luc Devroye, Lázló Györfi, and Gábor Lugosi. A Probabilistic Theory of Pattern
Recognition. Springer, 1996.

[61] Luc Devroye, Abbas Mehrabian, and Tommy Reddad. The total variation dis-
tance between high-dimensional gaussians with the same mean. arXiv preprint
arXiv:1810.08693, 2018.

[62] Jean Dieudonné. Infinitesimal Calculus. Houghton Mifflin, 1971.

[63] Edsger W. Dijkstra. A note on two problems in connexion with graphs. Nu-
merische mathematik, 1(1):269–271, 1959. ISSN 0029-599X.

[64] Petros Drineas, Alan Frieze, Ravi Kannan, Santosh Vempala, and Vish-
wanathan Vinay. Clustering large graphs via the singular value decomposi-
tion. Machine learning, 56:9–33, 2004.

[65] Simon Duane, Anthony D. Kennedy, Brian J. Pendleton, and Duncan Roweth.
Hybrid monte carlo. Physics letters B, 195(2):216–222, 1987.

[66] Richard M. Dudley. Real analysis and probability. Chapman and Hall/CRC,
2018.

[67] Marie Duflo. Random iterative models, volume 34. Springer Science & Business
Media, 2013.

[68] HA Eiselt, Carl-Louis Sandblom, et al. Nonlinear optimization: Methods and


applications. Springer, 2019.

[69] Stewart N. Ethier and Thomas G. Kurtz. Markov processes: Characterization


and convergence. 1986.
652 BIBLIOGRAPHY

[70] Mark Everingham, Luc Van Gool, Christopher KI Williams, John Winn, and
Andrew Zisserman. The pascal visual object classes (voc) challenge. Interna-
tional journal of computer vision, 88(2):303–338, 2010.

[71] James A. Fill. An interruptible algorithm for perfect sampling via Markov
chains. The Annals of Applied Probability, 8(1):131–162, 1998.

[72] P. Thomas Fletcher and Sarang Joshi. Principal geodesic analysis on symmetric
spaces: Statistics of diffusion tensors. In Computer vision and mathematical
methods in medical and biomedical image analysis, pages 87–98. Springer, 2004.

[73] Yoav Freund and Robert E. Schapire. A decision-theoretic generalization of


on-line learning and an application to boosting. Journal of computer and system
sciences, 55(1):119–139, 1997. Publisher: Elsevier.

[74] Jerome Friedman, Trevor Hastie, and Robert Tibshirani. Additive logistic re-
gression: a statistical view of boosting (with discussion and a rejoinder by the
authors). The annals of statistics, 28(2):337–407, 2000.

[75] Jerome H. Friedman. Greedy function approximation: a gradient boosting


machine. Annals of statistics, pages 1189–1232, 2001. Publisher: JSTOR.

[76] Dan Geiger and Judea Pearl. On the logic of causal models. In Machine intel-
ligence and pattern recognition, volume 9, pages 3–14. Elsevier, 1990.

[77] Dan Geiger, Thomas Verma, and Judea Pearl. Identifying independence in
bayesian networks. Networks, 20(5):507–534, August 1990. ISSN 00283045.
doi: 10.1002/net.3230200504.

[78] Donald Geman, Christian d’Avignon, Daniel Q. Naiman, and Raimond L


Winslow. Classifying gene expression profiles from pairwise mrna compar-
isons. Statistical applications in genetics and molecular biology, 3(1):1–19, 2004.

[79] Stuart Geman and Donald Geman. Stochastic relaxation, gibbs distributions,
and the bayesian restoration of images. IEEE Transactions on pattern analysis
and machine intelligence, (6):721–741, 1984.

[80] Stuart Geman and Chii-Ruey Hwang. Nonparametric maximum likelihood


estimation by the method of sieves. The Annals of Statistics, pages 401–414,
1982.

[81] M. Gondran and M. Minoux. Graphs and algorithms. John Wiley & Sons, 1983.

[82] Ian Goodfellow, Jean Pouget-Abadie, Mehdi Mirza, Bing Xu, David Warde-
Farley, Sherjil Ozair, Aaron Courville, and Yoshua Bengio. Generative adver-
sarial nets. In Advances in neural information processing systems, pages 2672–
2680, 2014.
BIBLIOGRAPHY 653

[83] Ulf Grenander. Abstract Inference. Wiley, 1981.

[84] Ishaan Gulrajani, Faruk Ahmed, Martin Arjovsky, Vincent Dumoulin, and
Aaron C. Courville. Improved training of Wasserstein GANs. In Advances
in neural information processing systems, pages 5767–5777, 2017.

[85] Madan M. Gupta and J. Qi. Theory of T-norms and fuzzy inference methods.
Fuzzy Sets and Systems, 40(3):431–450, April 1991. ISSN 0165-0114.

[86] Te Sun Han. Nonnegative entropy measures of multivariate symmetric corre-


lations. Information and Control, 36:133–156, 1978.

[87] Trevor Hastie, Robert Tibshirani, and Jerome H. Friedman. The elements of
statistical learning. Springer, 2003.

[88] W. Keith Hastings. Monte carlo sampling methods using markov chains and
their applications. 1970.

[89] Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. Deep residual
learning for image recognition. In Proceedings of the IEEE Conference on Com-
puter Vision and Pattern Recognition, pages 770–778. IEEE, 2016.

[90] Geoffrey E. Hinton and Sam Roweis. Stochastic neighbor embedding. Ad-
vances in neural information processing systems, 15:857–864, 2002.

[91] Leslie M. Hocking. Optimal Control: An Introduction to the Theory with Appli-
cations. Oxford University Press, 1991.

[92] Wassily Hoeffding. Probability inequalities for sums of bounded random vari-
ables. In The Collected Works of Wassily Hoeffding, pages 409–426. Springer,
1994.

[93] Roger A. Horn and Charles R. Johnson. Matrix analysis. Cambridge university
press, 2012.

[94] Aapo Hyvärinen. New approximations of differential entropy for independent


component analysis and projection pursuit. In Advances in neural information
processing systems, pages 273–279, 1998.

[95] Aapo Hyvärinen and Peter Dayan. Estimation of non-normalized statistical


models by score matching. Journal of Machine Learning Research, 6(4), 2005.

[96] Nobuyuki Ikeda and Shinzo Watanabe. Stochastic differential equations and
diffusion processes. Elsevier, 1981.

[97] Tommi Sakari Jaakkola. Variational methods for inference and estimation in
graphical models. PhD Thesis, Massachusetts Institute of Technology, 1997.
654 BIBLIOGRAPHY

[98] Vojtech Jarnik. O jistem problemu minimalnim (about a certain minimal prob-
lem). Prace Moravske Prirodovedecke Spolecnosti, 6:57–63, 1930.

[99] Finn Jensen and Frank Jensen. Optimal junction trees. In Proceedings of the
Tenth Conference on Uncertainty in Artificial Intelligence, pages 360–366, 1994.

[100] Michael I. Jordan, Zoubin Ghahramani, Tommi S. Jaakkola, and Lawrence K.


Saul. An introduction to variational methods for graphical models. Machine
learning, 37(2):183–233, 1999.

[101] Peter Kafka, F Österreicher, and István Vincze. On powers of f-divergences


defining a distance. Studia Sci. Math. Hungar, 26(4):415–422, 1991.

[102] Abram M. Kagan, Calyampudi Radhakrishna Rao, and Yurij Vladimirovich


Linnik. Characterization problems in mathematical statistics. 1973.

[103] Diederik P. Kingma and Jimmy Ba. Adam: A method for stochastic optimiza-
tion. arXiv preprint arXiv:1412.6980, 2014.

[104] Diederik P. Kingma and Max Welling. Auto-Encoding Variational Bayes. 2014.

[105] Diederik P. Kingma and Max Welling. An Introduction to Variational Autoen-


coders. Foundations and Trends® in Machine Learning, 12(4):307–392, 2019.
Publisher: Now Publishers, Inc.

[106] John Kingman. Completely random measures. Pacific Journal of Mathematics,


21(1):59–78, 1967.

[107] Peter E. Kloeden and Eckhard Platen. Numerical solutions of stochastic differen-
tial equations. Springer, 1992.

[108] Ivan Kobyzev, Simon J.D. Prince, and Marcus A. Brubaker. Normalizing flows:
An introduction and review of current methods. IEEE transactions on pattern
analysis and machine intelligence, 43(11):3964–3979, 2020.

[109] Daphne Koller and Nir Friedman. Probabilistic graphical models: principles and
techniques. The MIT Press, 2009.

[110] Vladimir Koltchinskii and Dmitry Panchenko. Empirical margin distributions


and bounding the generalization error of combined classifiers. The Annals of
Statistics, 30(1):1–50, 2002.

[111] Alex Krizhevsky, Ilya Sutskever, and Geoffrey E. Hinton. Imagenet classifica-
tion with deep convolutional neural networks. Communications of the ACM,
60(6):84–90, 2017.
BIBLIOGRAPHY 655

[112] Wojtek J. Krzanowski and Y.T. Lai. A criterion for determining the number of
groups in a data set using sum-of-squares clustering. Biometrics, pages 23–34,
1988.
[113] Estelle Kuhn and Marc Lavielle. Coupling a stochastic approximation version
of EM with an MCMC procedure. ESAIM: Probability and Statistics, 8:115–131,
2004. Publisher: EDP Sciences.
[114] Harold Kushner and G. George Yin. Stochastic approximation and recursive
algorithms and applications, volume 35. Springer Science & Business Media,
2003.
[115] Steffen L Lauritzen. Graphical models, volume 17. Clarendon Press, 1996.
[116] Yann LeCun and Yoshua Bengio. Convolutional networks for images, speech,
and time series. The handbook of brain theory and neural networks, 3361(10):
1995, 1995.
[117] Yann LeCun, Bernhard Boser, John S. Denker, Donnie Henderson, Richard E.
Howard, Wayne Hubbard, and Lawrence D. Jackel. Backpropagation applied
to handwritten zip code recognition. Neural computation, 1(4):541–551, 1989.
[118] Michel Ledoux and Michel Talagrand. Probability in Banach Spaces: isoperime-
try and processes. Springer Science & Business Media, 1991.
[119] Erich L. Lehmann and George Casella. Theory of point estimation. Springer
Science & Business Media, 2006.
[120] Benedict Leimkuhler and Sebastian Reich. Simulating Hamiltonian Dynamics.
Cambridge Monographs on Applied and Computational Mathematics. Cam-
bridge University Press, 2005.
[121] Lennart Ljung. Analysis of recursive stochastic algorithms. IEEE Transactions
on Automatic Control, 22(4):551–575, August 1977. ISSN 1558-2523. Confer-
ence Name: IEEE Transactions on Automatic Control.
[122] Stuart Lloyd. Least squares quantization in pcm. IEEE transactions on infor-
mation theory, 28(2):129–137, 1982.
[123] Laurens van der Maaten and Geoffrey Hinton. Visualizing data using t-SNE.
Journal of machine learning research, 9(Nov):2579–2605, 2008.
[124] Jack Macki and Aaron Strauss. Introduction to Optimal Control Theory.
Springer Science & Business Media, 2012.
[125] James MacQueen. Some methods for classification and analysis of multivari-
ate observations. In Proceedings of the fifth Berkeley symposium on mathematical
statistics and probability, volume 1, pages 281–297. Oakland, CA, USA, 1967.
656 BIBLIOGRAPHY

[126] Adam a Margolin, Ilya Nemenman, Katia Basso, Chris Wiggins, Gustavo
Stolovitzky, Riccardo Dalla Favera, and Andrea Califano. ARACNE: an al-
gorithm for the reconstruction of gene regulatory networks in a mammalian
cellular context. BMC bioinformatics, 7 Suppl 1:S7, January 2006. ISSN 1471-
2105. doi: 10.1186/1471-2105-7-S1-S7.

[127] Enzo Marinari and Giorgio Parisi. Simulated tempering: a new monte carlo
scheme. Europhysics letters, 19(6):451, 1992.

[128] Pascal Massart. Concentration inequalities and model selection, volume 6.


Springer, 2007.

[129] David A. McAllester. Pac-bayesian model averaging. In COLT, volume 99,


pages 164–170. Citeseer, 1999.

[130] James A. McHugh. Algorithmic graph theory. New Jersey: Prentice-Hall Inc,
1990.

[131] Leland McInnes, John Healy, and James Melville. UMAP: Uniform Manifold
Approximation and Projection for Dimension Reduction. arXiv:1802.03426
[cs, stat], September 2020. arXiv: 1802.03426.

[132] Henry P. McKean. Stochastic integrals, volume 353. American Mathematical


Society, 1969.

[133] Kerrie L. Mengersen and Richard L. Tweedie. Rates of convergence of the


hastings and metropolis algorithms. The annals of Statistics, 24(1):101–121,
1996.

[134] Nicholas Metropolis, Arianna W. Rosenbluth, Marshall N. Rosenbluth, Au-


gusta H. Teller, and Edward Teller. Equation of state calculations by fast com-
puting machines. The journal of chemical physics, 21(6):1087–1092, 1953.

[135] Sean P. Meyn and Richard L. Tweedie. Stability of markovian processes ii:
Continuous-time processes and sampled chains. Advances in Applied Probabil-
ity, 25(3):487–517, 1993.

[136] Sean P. Meyn and Richard L. Tweedie. Stability of markovian processes iii:
Foster–lyapunov criteria for continuous-time processes. Advances in Applied
Probability, 25(3):518–548, 1993.

[137] Sean P. Meyn and Richard L. Tweedie. Markov chains and stochastic stability.
Springer Science & Business Media, 2012.

[138] Leon Mirsky. A trace inequality of john von neumann. Monatshefte für mathe-
matik, 79(4):303–306, 1975.
BIBLIOGRAPHY 657

[139] Michel Métivier and Pierre Priouret. Théorèmes de convergence presque sure
pour une classe d’algorithmes stochastiques à pas décroissant. Probability The-
ory and related fields, 74(3):403–428, 1987. Publisher: Springer.

[140] Elizbar A. Nadaraya. On estimating regression. Theory of Probability & Its


Applications, 9(1):141–142, 1964.

[141] Radford M. Neal. Sampling from multimodal distributions using tempered


transitions. Statistics and computing, 6:353–366, 1996.

[142] Radford M. Neal. Markov chain sampling methods for dirichlet process mix-
ture models. Journal of computational and graphical statistics, 9(2):249–265,
2000.

[143] Radford M. Neal. Mcmc using hamiltonian dynamics. arXiv preprint


arXiv:1206.1901, 2012.

[144] Radford M. Neal and Geoffrey E. Hinton. A view of the EM algorithm that jus-
tifies incremental, sparse, and other variants. In Learning in graphical models,
pages 355–368. Springer, 1998.

[145] R. E. Neapolitan. Learning Bayesian networks. Prentice Hall, 2004.

[146] Arkadi Nemirovski, Anatoli Juditsky, Guanghui Lan, and Alexander Shapiro.
Robust Stochastic Approximation Approach to Stochastic Programming.
SIAM Journal on Optimization, 19(4):1574–1609, January 2009. ISSN 1052-
6234. Publisher: Society for Industrial and Applied Mathematics.

[147] Jorge Nocedal and Stephen J. Wright. Nonlinear Equations. Springer, 2006.

[148] Esa Nummelin. General irreducible Markov chains and non-negative operators.
Number 83. Cambridge University Press, 2004.

[149] George Papamakarios, Eric Nalisnick, Danilo Jimenez Rezende, Shakir Mo-
hamed, and Balaji Lakshminarayanan. Normalizing flows for probabilistic
modeling and inference. Journal of Machine Learning Research, 22(57):1–64,
2021.

[150] Panos M. Pardalos and Jue Xue. The maximum clique problem. Journal of
Global Optimization, 4(3):301–328, 1994. ISSN 0925-5001.

[151] Emanuel Parzen. On estimation of a probability density function and mode.


The annals of mathematical statistics, 33(3):1065–1076, 1962.

[152] Judea Pearl. Probabilistic reasoning in intelligent systems. Morgan Kaufmann,


1988, 2012.
658 BIBLIOGRAPHY

[153] Jiming Peng and Yu Wei. Approximating k-means-type clustering via semidef-
inite programming. SIAM journal on optimization, 18(1):186–205, 2007.

[154] Jiming Peng and Yu Xia. A new theoretical framework for k-means-type clus-
tering. Foundations and advances in data mining, pages 79–96, 2005. Publisher:
Springer.

[155] Odile Pons. Functional estimation for density, regression models and processes.
World scientific, 2011.

[156] Robert C. Prim. Shortest connection networks and some generalizations. Bell
system technical journal, 36(6):1389–1401, 1957.

[157] James G. Propp and David B. Wilson. Exact sampling with coupled Markov
chains and applications to statistical mechanics. Random Structures and Algo-
rithms, 9(1&2):223–252, 1996.

[158] James G. Propp and David B. Wilson. How to get a perfectly random sam-
ple from a generic Markov chain and generate a random spanning tree of a
directed graph. Journal of Algorithms, 27:170–217, 1998.

[159] Jim O. Ramsay and Bernard W. Silverman. Functional Data Analysis. Springer-
Verlag, 1997.

[160] BLS Prakasa Rao. Nonparametric functional estimation. Academic press, 1983.

[161] Daniel Revuz. Markov chains. Elsevier, 2008.

[162] Danilo Rezende and Shakir Mohamed. Variational inference with normalizing
flows. In International conference on machine learning, pages 1530–1538. PMLR,
2015.

[163] Jorma Rissanen. Stochastic complexity in statistical inquiry. World Scientific,


1989.

[164] Herbert Robbins and Sutton Monro. A stochastic approximation method. In


Herbert Robbins Selected Papers, pages 102–109. Springer, 1985.

[165] Gareth O. Roberts and Nicholas G. Polson. On the geometric convergence of


the gibbs sampler. Journal of the Royal Statistical Society Series B: Statistical
Methodology, 56(2):377–384, 1994.

[166] Gareth O. Roberts and Jeffrey S. Rosenthal. General state space markov chains
and mcmc algorithms. Probability Surveys, 1:20–71, 2004.

[167] Gareth O. Roberts and Richard L. Tweedie. Exponential convergence of


langevin distributions and their discrete approximations. Bernoulli, 2(4):341–
363, 1996.
BIBLIOGRAPHY 659

[168] R. Tyrrell Rockafellar. Convex analysis, volume 18. Princeton university press,
1970.
[169] Olaf Ronneberger, Philipp Fischer, and Thomas Brox. U-Net: Convolutional
Networks for Biomedical Image Segmentation. In Nassir Navab, Joachim
Hornegger, William M. Wells, and Alejandro F. Frangi, editors, Medical Im-
age Computing and Computer-Assisted Intervention – MICCAI 2015, Lecture
Notes in Computer Science, pages 234–241, Cham, 2015. Springer Interna-
tional Publishing. ISBN 978-3-319-24574-4.
[170] Kenneth Rose, Eitan Gurewitz, and Geoffrey Fox. A deterministic annealing
approach to clustering. Pattern Recognition Letters, 11(9):589–594, 1990.
[171] Peter J. Rousseeuw. Silhouettes: a graphical aid to the interpretation and val-
idation of cluster analysis. Journal of computational and applied mathematics,
20:53–65, 1987.
[172] Walter Rudin. Real and Complex Analysis. Tata McGraw Hill, 1966.
[173] Robert E. Schapire. The strength of weak learnability. Machine learning, 5(2):
197–227, 1990.
[174] Isaac J. Schoenberg. Metric spaces and completely monotone functions. An-
nals of Mathematics, pages 811–841, 1938.
[175] Gideon Schwarz. Estimating the dimension of a model. The annals of statistics,
6(2):461–464, 1978.
[176] Claude E. Shannon. A mathematical theory of communication. The Bell system
technical journal, 27(3):379–423, 1948.
[177] Claude E. Shannon. Communication in the presence of noise. Proc. Institute
of Radio Engineers, 37(1):10–21, 1949.
[178] Simon J. Sheather and Michael C. Jones. A reliable data-based bandwidth
selection method for kernel density estimation. Journal of the Royal Statistical
Society: Series B (Methodological), 53(3):683–690, 1991.
[179] Bernard W. Silverman. Density estimation for statistics and data analysis. Chap-
man et Hall, 1998.
[180] Viktor Pavlovich Skitovich. Linear forms of independent random variables
and the normal distribution law. Izvestiya Rossiiskoi Akademii Nauk. Seriya
Matematicheskaya, 18(2):185–200, 1954.
[181] Yang Song, Sahaj Garg, Jiaxin Shi, and Stefano Ermon. Sliced score match-
ing: A scalable approach to density and score estimation. In Uncertainty in
Artificial Intelligence, pages 574–584. PMLR, 2020.
660 BIBLIOGRAPHY

[182] Nitish Srivastava, Geoffrey Hinton, Alex Krizhevsky, Ilya Sutskever, and Rus-
lan Salakhutdinov. Dropout: a simple way to prevent neural networks from
overfitting. The Journal of Machine Learning Research, 15(1):1929–1958, 2014.
[183] Hugo Steinhaus. Sur la division des corp materiels en parties. Bull. Acad.
Polon. Sci, 1(804):801, 1956.
[184] Charles J. Stone. Consistent nonparametric regression. The annals of statistics,
pages 595–620, 1977.
[185] Mervyn Stone. Cross-validatory choice and assessment of statistical predic-
tions. Journal of the royal statistical society: Series B (Methodological), 36(2):
111–133, 1974.
[186] Catherine A. Sugar and Gareth M. James. Finding the number of clusters in a
dataset: An information-theoretic approach. Journal of the American Statistical
Association, 98(463):750–763, 2003.
[187] Robert H. Swendsen and Jian-Sheng Wang. Nonuniversal critical dynamics in
monte carlo simulations. Physical review letters, 58(2):86, 1987.
[188] Esteban G. Tabak and Eric Vanden-Eijnden. Density estimation by dual ascent
of the log-likelihood. 2010.
[189] Michel Talagrand. The generic chaining: upper and lower bounds of stochastic
processes. Springer Science & Business Media, 2006.
[190] Michel Talagrand. Upper and lower bounds for stochastic processes: modern meth-
ods and classical problems, volume 60. Springer Science & Business Media,
2014.
[191] Aik Choon Tan, Daniel Q. Naiman, Lei Xu, Raimond L. Winslow, and Don-
ald Geman. Simple decision rules for classifying human cancers from gene
expression profiles. Bioinformatics, 21(20):3896–3904, 2005.
[192] Robert Tibshirani, Guenther Walther, and Trevor Hastie. Estimating the num-
ber of clusters in a data set via the gap statistic. Journal of the Royal Statistical
Society: Series B (Statistical Methodology), 63(2):411–423, 2001.
[193] Luke Tierney. Markov Chains for Exploring Posterior Distributions. Annals
of Statistics, 22(4):1701–1728, December 1994. ISSN 0090-5364, 2168-8966.
Publisher: Institute of Mathematical Statistics.
[194] Igor Vajda. On metric divergences of probability measures. Kybernetika, 45(6):
885–900, 2009.
[195] Laurens Van Der Maaten. Accelerating t-sne using tree-based algorithms. The
Journal of Machine Learning Research, 15(1):3221–3245, 2014.
BIBLIOGRAPHY 661

[196] Aad W. Van der Vaart. Asymptotic statistics, volume 3. Cambridge university
press, 2000.

[197] Aad W. Van der Vaart and John A. Wellner. Weak convergence and empirical
processes with applications to statistics. Springer, 1996.

[198] Vladimir Vapnik. Statistical learning theory. 1998. Wiley, New York, 1998.

[199] Vladimir Vapnik. The nature of statistical learning theory. Springer science &
business media, 2013.

[200] Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones,
Aidan N Gomez, Lukasz Kaiser, and Illia Polosukhin. Attention is all you
need. In I. Guyon, U. Von Luxburg, S. Bengio, H. Wallach, R. Fergus, S. Vish-
wanathan, and R. Garnett, editors, Advances in Neural Information Processing
Systems, volume 30. Curran Associates, Inc., 2017.

[201] Roman Vershynin. High-dimensional probability: An introduction with applica-


tions in data science, volume 47. Cambridge University Press, 2018.

[202] Rene Vidal, Yi Ma, and Shankar Sastry. Generalized principal component
analysis (gpca). IEEE transactions on pattern analysis and machine intelligence,
27(12):1945–1959, 2005.

[203] Cédric Villani et al. Optimal transport: old and new, volume 338. Springer,
2009.

[204] Grace Wahba. Spline Models for Observational Data. SIAM, 1990.

[205] Geoffrey S. Watson. Smooth regression analysis. Sankhyā: The Indian Journal
of Statistics, Series A, pages 359–372, 1964.

[206] Gerhard Winkler. Image analysis, random fields and Markov chain Monte Carlo
methods. Springer, 1995,2003.

[207] Stephen J. Wright and Benjamin Recht. Optimization for data analysis. Cam-
bridge University Press, 2022.

[208] Kôsaku Yosida. Functional Analysis. Springer, 1970.

[209] Laurent Younes. Estimation and annealing for gibbsian fields. Ann. de l’Inst.
Henri Poincaré, 2, 1988.

[210] Laurent Younes. Parametric inference for imperfectly observed gibbsian fields.
Prob. Thry. Rel. Fields, 82:625–645, 1989.
662 BIBLIOGRAPHY

[211] Laurent Younes. On the convergence of markovian stochastic algorithms with


rapidly decreasing ergodicity rates. Stochastics: An International Journal of
Probability and Stochastic Processes, 65(3-4):177–228, 1999.

[212] Laurent Younes. Diffeomorphic learning. Journal of Machine Learning Research,


21:1 – 28, 2020.

[213] Lotfi A. Zadeh. Fuzzy sets. In Fuzzy sets, fuzzy logic, and fuzzy systems: selected
papers by Lotfi A Zadeh, pages 394–432. World Scientific, 1996.

You might also like