machine_learning
machine_learning
Laurent Younes
Preface 13
3
4 CONTENTS
3 Introduction to Optimization 43
3.1 Basic Terminology . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 43
3.2 Unconstrained Optimization Problems . . . . . . . . . . . . . . . . . . 45
3.2.1 Conditions for optimality (general case) . . . . . . . . . . . . . 45
3.2.2 Convex sets and functions . . . . . . . . . . . . . . . . . . . . . 46
3.2.3 Relative interior . . . . . . . . . . . . . . . . . . . . . . . . . . . 48
3.2.4 Derivatives of convex functions and optimality conditions . . . 51
3.2.5 Direction of descent and steepest descent . . . . . . . . . . . . 54
3.2.6 Convergence . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 55
3.2.7 Line search . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 58
3.3 Stochastic gradient descent . . . . . . . . . . . . . . . . . . . . . . . . . 62
3.3.1 Stochastic approximation methods . . . . . . . . . . . . . . . . 62
3.3.2 Deterministic approximation and convergence study . . . . . . 62
3.3.3 The ADAM algorithm . . . . . . . . . . . . . . . . . . . . . . . . 68
3.4 Constrained optimization problems . . . . . . . . . . . . . . . . . . . . 69
3.4.1 Lagrange multipliers . . . . . . . . . . . . . . . . . . . . . . . . 69
3.4.2 Convex constraints . . . . . . . . . . . . . . . . . . . . . . . . . 71
3.4.3 Applications . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 73
3.4.4 Projected gradient descent . . . . . . . . . . . . . . . . . . . . . 76
3.5 General convex problems . . . . . . . . . . . . . . . . . . . . . . . . . . 77
3.5.1 Epigraphs . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 77
3.5.2 Subgradients . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 78
3.5.3 Directional derivatives . . . . . . . . . . . . . . . . . . . . . . . 80
3.5.4 Subgradient descent . . . . . . . . . . . . . . . . . . . . . . . . 82
3.5.5 Proximal Methods . . . . . . . . . . . . . . . . . . . . . . . . . . 84
3.6 Duality . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 90
3.6.1 Generalized KKT conditions . . . . . . . . . . . . . . . . . . . . 90
3.6.2 Dual problem . . . . . . . . . . . . . . . . . . . . . . . . . . . . 92
3.6.3 Example: Quadratic programming . . . . . . . . . . . . . . . . 94
3.6.4 Proximal iterations and augmented Lagrangian . . . . . . . . . 95
3.6.5 Alternative direction method of multipliers . . . . . . . . . . . 97
3.7 Convex separation theorems and additional proofs . . . . . . . . . . . 98
3.7.1 Proof of proposition 3.44 . . . . . . . . . . . . . . . . . . . . . . 99
3.7.2 Proof of theorem 3.45 . . . . . . . . . . . . . . . . . . . . . . . . 100
3.7.3 Proof of theorem 3.46 . . . . . . . . . . . . . . . . . . . . . . . . 101
20 Clustering 463
20.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 463
20.2 Hierarchical clustering and dendograms . . . . . . . . . . . . . . . . . 464
20.2.1 Partition trees . . . . . . . . . . . . . . . . . . . . . . . . . . . . 464
20.2.2 Bottom-up construction . . . . . . . . . . . . . . . . . . . . . . 465
20.2.3 Top-down construction . . . . . . . . . . . . . . . . . . . . . . . 468
20.2.4 Thresholding . . . . . . . . . . . . . . . . . . . . . . . . . . . . 469
20.3 K-medoids and K-mean . . . . . . . . . . . . . . . . . . . . . . . . . . . 469
20.3.1 K-medoids . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 469
20.3.2 Mixtures of Gaussian and deterministic annealing . . . . . . . 472
20.3.3 Kernel (soft) K-means . . . . . . . . . . . . . . . . . . . . . . . . 474
20.3.4 Convex relaxation . . . . . . . . . . . . . . . . . . . . . . . . . . 475
CONTENTS 11
Machine learning addresses the issue of analyzing, reproducing and predicting var-
ious mechanisms and processes observable through experiments and data acquisi-
tion. With the impetus of large technological companies in need of leveraging in-
formation included in the gigantic datasets that they produced or obtained through
user data, with the development of new data acquisition techniques in biology, physics
or astronomy, with the improvement of storage capacity and high-performance com-
puting, this field has experienced an explosive growth over the past decades, in
terms of scientific production and technological impact.
This book, which originates from lecture notes of a series of graduate course
taught in the Department of Applied Mathematics and Statistics at Johns Hopkins
University, adopts a viewpoint (or bias) mainly focused on the mathematical and sta-
tistical aspects of the subject. Its goal is to introduce the mathematical foundations
and techniques that lead to the development and analysis of many of the algorithms
that are used today. It is written with the hope to provide the reader with a deeper
13
14 CONTENTS
Unsurprisingly, the book will be more accessible to a reader with some back-
ground in mathematics and statistics. It assumes familiarity with basic concepts in
linear algebra and matrix analysis, in multivariate calculus and in probability and
statistics. We tried to place a limit at the use of measure theoretic tools, that are
avoided up to a few exceptions, which are be localized and be accompanied with
alternative interpretations allowing for a reading at a more elementary level.
The book starts with an introductory chapter that describes notation used through-
out the book and serve at a reminder of basic concepts in calculus, linear algebra and
probability. It also introduces some measure theoretic terminology, and can be used
as a reading guide for the sections that use these tools. This chapter is followed by
two chapters offering background material on matrix analysis and optimization. The
latter chapter, which is relatively long, provides necessary references to many algo-
rithms that are used in the book, including stochastic gradient descent, proximal
methods, etc.
Chapter 13, which presents sampling methods and an introduction to the theory
of Markov chains, starts a series of chapters on generative models, and associated
learning algorithms. Graphical models and described in chapters 14 to 16. Chap-
ter 17 introduces variational methods for models with latent variables, with applica-
tions to graphical models in chapter 18. Generative techniques using deep learning
are presented in chapter 19.
If A is a set, the set of all subsets of A is denoted P (A). If A and B are two sets, the
notation BA refers to the set of all functions f : A → B. In particular, RA is the space
of real-valued functions, and forms a vector space. When A is finite, this space is
finite dimensional and can be identified with R|A| , where |A| denotes the cardinality
(number of elements) of A.
1.1.2 Vectors
Elements of the d-dimensional Euclidean space Rd will be denoted with letters such
as x, y, z, and their coordinates will be indexed as parenthesized exponents, so that
(1)
x
x = ...
x(d)
(we will always identify element of Rd with column vectors). We will not distinguish
in the notation between “points” in Rd , seen as an affine space, and “vectors” in Rd ,
seen as a vector space. The vectors 0d and 1d will denote the d-dimensional vectors
with all coordinates equal to 0 and 1, respectively. The identity matrix in Rd will be
denoted IdRd . The canonical basis of Rd , provided by the columns of IdRd will be
denoted e1 , . . . , ed .
15
16 CHAPTER 1. GENERAL NOTATION AND BACKGROUND MATERIAL
1/2
|x| = (x(1) )2 + · · · + (x(d) )2 .
for p ≥ 1. One can also define |x|p for 0 < p < 1, using (1.1), but in this case one
does not get a norm because the triangle inequality |x + y|p ≤ |x|p + |y|p is not true
in general. The family is interesting, however, because it approximates, in the limit
p → 0, the number of non-zero components of x, denoted |x|0 , which is a measure of
sparsity. Note that we also use the notation |A| to denote the cardinality (number of
elements) of a set A, hopefully without risk of confusion.
1.1.3 Matrices
The set of m × d real matrices with real entries is denoted Mm,d (R), or simply Mm,d
(Md,d will also be denoted Md ). The set of invertible d × d matrices will be denoted
GLd (R).
(i)
Entry (i, j) in a matrix A ∈ Mm,d (R) will either be denoted A(i, j) or Aj . The rows
of A will be denoted A(1) , . . . , A(m) and the columns A1 , . . . , Am .
The space of d × d real symmetric matrices is denoted Sd , and its subsets con-
taining positive semi-definite (resp. positive definite) matrices is denoted Sd+ (resp.
Sd++ ). If m ≤ d, Om,d denotes the set of m × d matrices A such that AAT = IdRm , and
one writes Od for Od,d , the space of d-dimensional orthogonal matrices. Finally, SOd
is the subset Od containing orthogonal matrices with determinant 1, i.e., rotation
matrices.
1.2. TOPOLOGY 17
for all x1 , . . . , xk ∈ Rd .
1.2 Topology
The closure of A is the smallest closed set that contains A and will be denoted
either Ā or cl(A). A point x belongs to Ā if and only if B(x, r) ∩ A , ∅ for all r > 0.
Alternatively, x belongs to Ā if and only if there exists a sequence (xk ) that converges
to x with xk ∈ A for all k.
18 CHAPTER 1. GENERAL NOTATION AND BACKGROUND MATERIAL
A compact set in Rd is a set Γ such that any sequence of points in Γ contains a subse-
quence that converges to some point in Γ . An alternate definition is that, whenever
Γ is covered by a collection of open sets, there exists a finite subcollection that still
covers Γ .
One can show that compact subsets of Rd are exactly its bounded and closed
subsets.
Compact subsets are also defined in the same way, but are not necessarily char-
acterized as bounded and closed.
1.3 Calculus
1.3.1 Differentials
If x, y ∈ Rd , we will denote by [x, y] the closed segment delimited by x and y, i.e., the
set of all points (1 − t)x + ty for 0 ≤ t ≤ 1. One denotes by [x, y), (x, y] and (x, y) the
semi-open or open segments, with appropriate strict inequality for t. (Similarly to
the notation for open intervals, whether (x, y) denotes an open segment or a pair of
points will always be clear from the context.)
where e1 , . . . , ed form the canonical basis of Rd . If the notation for the variables on
which f depends is well understood from the context, we will alternatively use ∂xj f .
(For example, if f : (α, β) 7→ f (α, β), we will prefer ∂α f to ∂1 f .) The differential of f
at x is the linear mapping from Rd to Rm represented by the matrix
Differentials obey the product rule and the chain rule. If f , g : U → R, then
If f : U → Rm , g : Ũ ⊂ Rk → U , then
If d = m (so that df (x) is a square matrix), we let ∇ · f (x) = trace(df (x)), the
divergence of f .
20 CHAPTER 1. GENERAL NOTATION AND BACKGROUND MATERIAL
We here compute, as an illustration and because they will be useful later, the differ-
ential of the determinant and the inversion in matrix spaces.
where A(ij) is the matrix A with row i and column j removed. We therefore find that
the differential of A 7→ det(A) is the mapping
H 7→ trace(cof(A)T H) (1.4)
where cof(A) is the matrix composed of co-factors (−1)i+j det A(ij) . As a consequence,
if A is invertible, then the differential of log | det(A)| is the mapping
H 7→ trace(det(A)−1 cof(A)T H) = trace(A−1 H) (1.5)
Consider now the function I(A) = A 7→ A−1 defined on GLd (R), which is an open
subset of Md (R). Using AI(A) = IdRd and the product rule, we get
A(dI(A)H) + HI(A) = 0
or
dI(A)H = −A−1 HA−1 . (1.6)
1.3. CALCULUS 21
Taylor’s theorem, in its integral form, generalizes the fundamental theorem of cal-
culus to higher derivatives. It expresses the fact that, if f is C k on U and x, y ∈ U are
such that the closed segment [x, y] is included in U , then, letting h = y − x:
1 1
f (x + h) = f (x) + df (x)h + d 2 f (x)(h, h) + · · · + d k−1 f (x)(h, . . . , h)
2 (k − 1)!
Z1
1
+ (1 − t)k−1 d k f (x + th)(h, . . . , h) dt (1.7)
(k − 1)! 0
22 CHAPTER 1. GENERAL NOTATION AND BACKGROUND MATERIAL
If f takes scalar values, then d k f (x + th)(h, . . . , h) is real and the intermediate value
theorem implies that there exists some z in [x, y] such that
1 1
f (x + h) = f (x) + df (x)h + d 2 f (x)(h, h) + · · · + d k−1 f (x)(h, . . . , h)
2 (k − 1)!
1
+ d k f (z)(h, . . . , h). (1.8)
k!
This is not true if f takes vector values. However, for any M such that |d k f (z)| ≤
M for z ∈ [x, y] (such M’s always exist because f is C k ), one has
Z 1
1 M k
(1 − t)k−1 d k f (x + th)(h, . . . , h) dt ≤ |h| .
(k − 1)! 0 k!
1 1
f (x + h) = f (x) + df (x)h + d 2 f (x)(h, h) + · · · + d k f (x)(h, . . . , h)
2 k!
Z1
1
+ (1 − t)k−1 (d k f (x + th)(h, . . . , h) − d k f (x)(h, · · · , h)) dt . (1.9)
(k − 1)! 0
Let n o
x (r) = max |d k f (x + h) − d k f (x)| : |h| ≤ r .
1
|h|k
Z
(1 − t)k−1 |d k f (x + th)(h, . . . , h) − d k f (x)(h, · · · , h)| dt ≤ (|h|).
0 k x
1 1 |h|k
f (x + h) = f (x) + df (x)h + d 2 f (x)(h, h) + · · · + d k f (x)(h, . . . , h) + (|h|) (1.10)
2 k! k! x
1 1
= f (x) + df (x)h + d 2 f (x)(h, h) + · · · + d k f (x)(h, . . . , h) + o(|h|k ) (1.11)
2 k!
1.4. PROBABILITY THEORY 23
We assume that the reader is familiar with concepts related to discrete random
variables (which take values in a discrete or countable space) and their probability
mass function (p.m.f.) or continuous variables (with values in Rd for some d) and
their probability density functions (p.d.f.) when they exist. In particular, X : Ω →
Rd is a random variable with p.d.f. f if and only if the expectation of ϕ(X) is given
by
Z
E(ϕ(X)) = ϕ(x)f (x)dx
Rd
for all bounded and continuous functions ϕ : Rd → [0, +∞). Not all random vari-
ables of interest can be categorized as discrete or continuous with a p.d.f., however,
and the others are more conveniently handled using measure-theoretic notation as
introduced below.
With a few exceptions, we will use capital letters for random variables and small
letters for scalars and vectors that represent realizations of these variables. One of
these exceptions will be our notation for training data, defined as an independent
and identically distributed (i.i.d.) sample of a given random variable. A realization
of such a sample will always be denoted T = (x1 , . . . , xN ), which is therefore a series
of observations. We will use the notation T = (X1 , . . . , XN ) for the collection of i.i.d.
random variables that generate the training set, so that T = (X1 (ω), . . . , XN (ω)) = T (ω)
for some ω ∈ Ω. Another exception will apply to variables denoted using Greek
letters, for which we will use boldface fonts (such as α, β, . . .).
for all ω such that P(X = X(ω)) > 0. Note that E(Y | X) is a random variable, defined
over Ω. It however only depends on the values of X, in the sense that E(Y | X)(ω) =
E(Y | X)(ω0 ) if X(ω) = X(ω0 ). We will use the notation
X
E(Y | X = x) = y P(Y = y | X = x),
y∈RY
If X and Y are scalar- or vector-valued and their joint distribution have a p.d.f.
ϕX,Y , one defines similarly the conditional p.d.f. of Y given X by
ϕX,Y (y, x
ϕY (y | X = x) = R
RY
ϕX,Y (y 0 , x)dy 0
provided that the denominator does not vanish. We will also use the notation ϕY (y |
X)(ω) = ϕY (y | X = X(ω)) for ω ∈ Ω. One then defines
Z
E(Y | X)(ω) = yϕY (y | X = X(ω))dy.
RY
In both cases considered above, it is easily checked that the conditional expecta-
tion satisfies the properties
The proof that our definition of E(Y | X) for discrete random variables is the only
one satisfying these properties is left to the reader. For continuous random variables,
assume that a function g : RX → RY satisfies
If we assume that ϕX,Y is continuous, then this identity being true for all f implies
that Z
g(x)ϕX (x) = yϕX,Y (x, y)dy
RY
so that g is the conditional expectation. If ϕX,Y is not continuous, then the identity
holds everywhere except on an exceptional “negligible” set (see the measure theo-
retic introduction below). Properties (CE1) and (CE2) provide the definition of the
conditional expectation for general random variables.
R The integral
R of a function f : S → Rd with respect to a measure µ is denoted
S
f dµ or S f (x)µ(dx). This integral is defined, using a limit argument, as a function
which is linear in f and such that, for all A ∈ S ,
Z Z
µ(dx) = 1A (x)µ(dx) = µ(A).
A S
More precisely, this uniquely defines the integral of linear combinations of indi-
cator Rfunctions with finite measures (called “simple functions”), and one then de-
fines S f dµ for f : S → [0, +∞) as the supremum of the integrals among all simple
functions that are no larger than f . After showing that the result is well defined and
linear in f , one defines the integral of f : S → R as the difference
R between those of
max(f , 0) and max(−f , 0), which is well define as soon as S |f | dµ < ∞, in which case
one says that f is µ-integrable.
(rather than
Z
f (x1 , x2 )µ1 ⊗ µ2 (dx1 , dx2 )).
S1 ×S2
(And one has a symmetric statement by integrating first in the first variable.)
The tensor product between more that two measures is defined similarly, with
notation
n
O
µ1 ⊗ · · · ⊗ µn = µk .
k=1
If µ and ν are measures on (S, S ), one says that ν is absolutely continuous with
respect to µ and write ν µ if,
The Radon-Nikodym theorem states that ν µ with ν(S) < ∞ if and only if ν has
a density with respect to µ, i.e., there exists a µ-integrable function ϕ : S → [0, +∞)
such that Z Z
f (x)ν(dx) = f (x)ϕ(x)µ(dx)
S S
When using measure-theoretic probability, we will therefore assume that the pair
(Ω, P) is completed to a triple (Ω, A, P) where A is a σ -algebra and P a probability
measure, that is a (positive) measure on (Ω, A) such that P(Ω) = 1. This triple is
called a probability space. For probability spaces, measurable sets are also called
“events” and events that happen with probability one are said to happen “almost
surely.”
A random variable X must then also take values in a measurable space, say
(RX , S X ), and must be such that, for all C ∈ S , the set [X ∈ C] belongs to S X . This
justifies the computation of P(X ∈ C), which is also denoted PX (C).
We use (CE1) and (CE2) as a definition of conditional expectation in the general case.
We assume that (RX , S X ) and (RY , S Y ) are measurable spaces.
Definition 1.2 Assume that RY = Rd . Let X : Ω → RX and Y : Ω → RY be two random
variables with E(|Y |) < ∞. The conditional expectation of Y given X is a random variable
Z : Ω → RY such that:
The variable Z is then denoted E(Y | X) and the function h in (i) is denoted E(Y | X = ·).
Importantly, random variables Z satisfying conditions (i) and (ii) always exist
and are almost surely unique, in the sense that if another function Z 0 satisfies these
1.4. PROBABILITY THEORY 29
We will discuss convex functions in chapter 3, but two important examples for this
section are γ(y) = |y| and γ(y) = |y|2 . The first one implies that |E(Y | X)| ≤ E(|Y | | X)
and, taking expectations on both sides: E(|E(Y | X|) ≤ E(|Y |), the upper bound being
finite by assumption. For the square norm, we find that, if Y is square integrable,
then so is E(Y | X) and
E(|E(Y | X)|2 ) ≤ E(|Y |2 ).
P(Y ∈ A | X = x)), or PY (A | X) (resp. PY (A | X = x)). Note that, for each A, these con-
ditional probabilities is defined up to modifications on sets of probability zero, and
it is not obvious that they can be defined for all A together (up to a modification on
a common set of probability zero), since there is generally a non-countable number
of sets A. This can be done, however, with some mild assumptions on the set RY and
its σ -algebra (always satisfied in our discussions, see remark remark 1.1), ensuring
that, for all ω ∈ Ω, A 7→ PY (A | X)(ω) is a probability distribution on RY such that,
for any measurable function h : RY → R such that h ◦ Y is integrable,
Z
E(h(Y ) | X) = h(y)PY (dy | X).
RY
Assume now that the the sets RX and RY are equipped with measures, say µX and
µY such that the joint distribution of (X, Y ) is absolutely continuous with respect to
µX ⊗ µY , so that there exists a function ϕ : RX × RY → R (the p.d.f. of (X, Y ) with
respect to µX ⊗ µY ) such that
Z
P(X ∈ A, Y ∈ B) = ϕ(x, y)µX (dx)µY (dy).
A×B
Note that ( Z )
0 0
P ω: ϕ(Z(ω), y ) µY (dy ) = 0 = 0
RY
so that the conditional density can be defined arbitrarily when the numerator van-
ishes1 .
ϕ(X(ω), y)
ϕ(y | X = X(ω)) = P 0
.
y 0 ∈RY ϕ(X(ω), y )
R
1 Letting ϕX (x) = ϕ(x, y 0 ) µY (dy 0 ), which is the marginal p.d.f. of X with respect to µX , we have
RY
Z
P(ϕX (X) = 0) = 1ϕX (x)=0 ϕX (x)µX (dx) = 0.
RX
Chapter 2
This chapter collects a few results in linear algebra that will be useful in the rest of
this book.
We denote by Mn,d (R) the space of all n × d matrices with real coefficients1 . For a
matrix A ∈ Mn,d (R) and integer k ≤ n and l ≤ d, we let Adkle ∈ Mk,l (R) denote the
matrix A restricted to its first k rows and first l columns. The i, j entry of A will be
denoted A(i, j) or A(ij) .
We assume that the reader is familiar with elementary matrix analysis, including,
in particular the fact that symmetric matrices are diagonalizable in an orthonormal
basis, i.e., if A ∈ Md,d (R) is a symmetric matrix (whose space is denoted Sd ), there
exists an orthogonal matrix U ∈ Od (i.e., satisfying U T U = U U T = IdRd ) and a diag-
onal matrix D ∈ Md,d (R) such that
A = U DU T .
The identity AU = U D then implies that the columns of U form an orthonormal
basis of eigenvectors of A.
31
32 CHAPTER 2. A FEW RESULTS IN MATRIX ANALYSIS
A = U DV T
where U ∈ On (R) and V ∈ Od (R)) are orthogonal matrices and D ∈ Mn,d (R) is diago-
nal (i.e., such that D(i, j) = 0 whenever i , j) with non-negative diagonal coefficients.
These coefficients are called the singular values of A, and the procedure is called
a singular-value decomposition (SVD) of A. An equivalent formulation is that there
exist orthonormal bases u1 , . . . , un of Rn and v1 , . . . , vd of Rd (forming the columns of
U and V ) such that
Avi = λi ui
for i ≤ min(n, d), where λ1 , . . . , λmin(n,d) are the singular values. Of course, if A is
square and symmetric positive semi-definite, an eigenvalue decomposition of A is
also a singular value decomposition (and the singular values coincide with the eigen-
values). More generally, if A = U DV T , then AAT = U DD T U T and AT A = V D T DV T
are eigenvalue decompositions of AAT and AT A. Singular values are uniquely de-
fined, up to reordering. However, the matrices U and V are not unique up to column
reordering in general.
A = Ũ D̃ Ṽ T
with Ũ , D̃ and Ṽ having respectively size n×m, m×m and m×d, Ũ T Ũ = Ṽ T Ṽ = IdRm
and D̃ diagonal with non-negative coefficients. This representation provides a re-
duced SVD of A and one can create a full SVD from a reduced one by completing the
missing rows of Ũ and Ṽ to form orthogonal matrices, and by adding the required
number of zeros to D̃.
We now descibe Von Neumann’s trace theorem. Its justification follows the proof
given in Mirsky [138].
Theorem 2.1 (Von Neumann) Let A, B ∈ Mn,d (R) have singular values (λ1 , . . . , λm ) and
(µ1 , . . . , µm ), respectively, where m = min(n, d). Assume that these eigenvalues are listed
2.2. THE TRACE INEQUALITY 33
Moreover, if trace(AT B) = m
P
i=1 λi µi , then there exist n × n and d × d orthogonal ma-
trices U and V such that U AV and U T BV are both diagonal, i.e., one can find SVDs of
T
Let us consider the first sum in the upper-bound. Let ξd = λd (resp. ηd = µd ) and
ξi = λi − λi+1 (resp. ηi = µi − µi+1 ) for i = 1, . . . , d − 1. Since singular values are non-
increasing, we have ξi , ηi ≥ 0 and
d
X d
X
λi = ξj , µi = ηj
j=i j=i
for i = 1, . . . , d. We have
0
d
X d X
X d d
X d
X i0 X
X j
2 2
λi µj u(i, j) = ξi 0 ηj 0 u(i, j) = ξi 0 η j 0 u(i, j)2
i,j=1 i,j=1 i 0 =i j 0 =j i 0 ,j 0 =1 i=1 j=1
d
X
≤ ξi 0 ηj 0 min(i 0 , j 0 ) (2.3)
i 0 ,j 0 =1
Pj 0
where we used the fact that U is orthogonal, which implies that j=1 u(i, j)2 and
Pi 0 2
i=1 u(i, j) are both less than 1. Notice also that, when u(i, j) = δij (i.e., u(i, j) = 1 if
i = j and zero otherwise), then
0
i0 X
X j
u(i, j)2 = min(i 0 , j 0 ),
i=1 j=1
34 CHAPTER 2. A FEW RESULTS IN MATRIX ANALYSIS
so that the last inequality is an identity, and the chain of equalities leading to (2.3)
implies
X d d
X
0 0
ξi 0 ηj 0 min(i , j ) = λi µj .
i 0 ,j 0 =1 i=1
The same identity obviously holds with v in place of u, and combining the two yields
(2.1).
We now consider conditions for equality. Clearly, if one can find SVD decompo-
sitions of A and B with U1 = U2 and V1 = V2 , then U = IdRn , V = IdRd and (2.1) is an
identity. We want to prove the converse statement.
For (2.1) to be an equality, we first need (2.2) to be an identity, which requires that
u(i, j) = v(i, j) as soon as λi µj > 0. We also need an equality in (2.3), which requires
0
i0 X
X j
u(i, j)2 = min(i 0 , j 0 )
i=1 j=1
as soon as λi 0 > λi 0 +1 and µj 0 > µj 0 +1 . The same identity must be true with v(i, j)
replacing u(i, j)
In view of this, denote by i1 < · · · < ip (resp. j1 < · · · < jq ) the indexes at which
the singular values of A (resp. B) differ form their successors, with the convention
λd+1 = µd+1 = 0. Let, for k = 1, . . . , p and l = 1, . . . , q
jl
ik X
X
C(k, l) = u(i, j)2 .
i=1 j=1
Then, we must have C(k, l) = min(ik , jl ) for all k, l and u(i, j) = v(i, j) for i = 1, . . . , ip
and j = 1, . . . , jq .
If, for all i, j ≤ d, we let Udije be the matrix formed by the first i rows and j
columns of U , the condition Ckl = min(ik , jl ) requires that Udik jl e UdiTk jl e = IdRik if ik ≤ jl
and UdiTk jl e Udik jl e = IdRjl if jl ≤ ik . This shows that, if ik ≤ jl , the rows of Udik jl e form an
orthonormal family, and necessarily, all elements u(i, j) for i ≤ ik and j > jl vanish.
The symmetric situation holds if jl ≤ ik .
2.2. THE TRACE INEQUALITY 35
W D W̃ = D
Pursuing this way (and skipping the formal induction argument, which is a bit
tedious), we can progressively introduce identity blocks into U and V and transform
them into new matrices (that we still denote by U and V ) taking the form (letting
k = min(ip , jq ))
! !
IdRk 0 IdRk 0
U= and V =
0 Ū 0 V̄
Remark 2.2 Note that, since the singular values of −A and of A coincide, theorem 2.1
implies
m
X
T
trace(A B) ≤ λi µi . (2.4)
i=1
for all matrices A and B, with equality if either A and B or −A and B have an SVD
using the same bases.
2.3 Applications
We first note that the singular values of U BU T , which is d × d, are the same as the
eigenvalues of B completed with zeros. Letting λ1 ≥ · · · ≥ λd be the eigenvalues of A
and µ1 ≥ · · · ≥ µp those of B, we therefore have, from theorem 2.1,
p
X
F(U ) ≤ λi µi .
i=1
2.3. APPLICATIONS 37
which shows that U is optimal. We summarize this discussion in the next theorem.
Theorem 2.3 Let A ∈ Sd (R) and B ∈ Sp (R) be symmetric matrices, with p ≤ d. Let
eigenvalue decompositions of A and B be given by A = V ΛV T and B = W MW T , where
the diagonal elements of Λ (resp. M) are λ1 ≥ · · · ≥ λd (resp. µ1 ≥ · · · ≥ µp ).
where the min and max are taken over linear subspaces of Rd .
where the last identity follows by considering the eigenvalues of A restricted to Wk,d .
So, the maximum of the right-hand side is indeed less than λk , and it is attained for
V = W1,k . This proves the first identity, and the second one can be obtained by
applying the first one to −A.
It is equal to the square root of the largest eigenvalue of AT A, i.e., to the largest
singular value of A.
so that m 1/2
X
|A|F = σk2
k=1
2.4. SOME MATRIX NORMS 39
d
X
|A|∗ = σk .
k=1
One can prove that this is a norm using an equivalent definition, provided by the
following proposition.
Proof The fact that trace(U AV T ) ≤ |A|∗ for any U and V is a consequence of the
trace inequality applied with B = [Id, 0] or its transpose depending on whether n ≤ d
or not. The upper-bound being attained when U and V are the matrices forming the
singular value decomposition of A, the proof is complete.
The fact that |A|∗ is a norm, for which the only non-trivial fact was the triangular
inequality, now is an easy consequence of this proposition, because the maximum
of the sum of two functions is always less than the sum of their maximums. More
precisely, we have
The nuclear norms is also called the Ky Fan norm of order d. Ky Fan norms of
order k (for 1 ≤ k ≤ d) associate to a matrix A the quantity
|A|(k) = λ1 + · · · + λk ,
i.e., the sum of its k largest singular values. One has the following proposition.
Proof We prove this following the argument suggested in Bhatia [28]. For A ∈ Md,d ,
and k = 1, . . . , d, let trace(k) (A) be the sum of the k largest diagonal elements of A. Let,
40 CHAPTER 2. A FEW RESULTS IN MATRIX ANALYSIS
for a symmetric matrix A, |A|0(k) denote the sum of the k largest eigenvalues of A (it
is equal to |A|(k) if A is positive definite, but can also include negative values).
d
X
B(j, j) = W (i, j)2 D(i, i).
i=1
k
X d
X k
X
B(jl , jl ) = D(i, i) W (i, jl )2
l=1 i=1 l=1
Xk k
X k
X Xd k
X
2
= D(i, i) + D(i, i) W (i, jl ) − 1 + D(i, i) W (i, jl )2
i=1 i=1 l=1 i=k+1 l=1
k
X k
X k
X
= D(i, i) + (D(i, i) − D(k, k)) W (i, jl )2 − 1
i=1 i=1 l=1
X d k
X Xn X
k
+ (D(i, i) − D(k, k)) W (i, jl )2 + D(k, k) W (i, jl )2 − k .
i=k+1 l=1 i=1 j=1
Pk 2
Because W is orthogonal, we have l=1 W (i, jl ) ≤ 1 and
n X
X k
W (i, jl )2 = k.
i=1 j=1
This shows that the terms after ki=1 D(i, i) in the upper bound are negative or zero,
P
so that
Xk k
X
B(jl , jl ) ≤ D(i, i).
l=1 i=1
The maximum of the left-hand side is trace(k) (B). Noting that we get an equality
when choosing U = V , the proof of (2.5) is complete.
2.4. SOME MATRIX NORMS 41
Using the same argument as that made above for the nuclear norm, one deduces
from this that
|A + B|0(k) ≤ |A|0(k) + |B|0(k)
for all A, B ∈ Sd and all k = 1, . . . , d.
We refer to [28] for more examples of matrix norms, including, in particular those
provided by taking pth powers in Ky Fan’s norms, defining
p p
|A|(k,p) = (λ1 + · · · + λk )1/p .
42 CHAPTER 2. A FEW RESULTS IN MATRIX ANALYSIS
Chapter 3
Introduction to Optimization
43
44 CHAPTER 3. INTRODUCTION TO OPTIMIZATION
In such cases, one also writes F(x) = minΩ F or F(x) = maxΩ F. In particular, the
notation u = minΩ F indicates that u = infΩ F and that there exists an x in Ω such
that F(x) = u (i.e., that the infimum of F over Ω is realized at some x ∈ Ω). Note
that the infimum of a function always exists, but not necessarily its minimum. Also
note that minimizers, when they exist, are not necessarily unique. We will denote
by argminΩ F (resp. argmaxΩ F) the (possibly empty) set of minimizers (resp. maxi-
mizers) of F
3. One says that x is a local minimizer (resp. maximizer) of F on Ω if there exists an
open ball B ⊂ Rd such that x ∈ B and F(x) = minΩ∩B F (resp. F(x) = maxΩ∩B F).
4. An optimization problem consists in finding a minimizer or maximizer of an “ob-
jective function” F. Focusing from now on on minimization problems (statements
for maximization problems are symmetric), we will always implicitly assume that
a minimizer exists. The following provides some general assumptions on F and Ω
that ensure this fact.
The sublevel sets of F in Ω are denoted [F ≤ u]Ω (or simply [F ≤ u] when Ω = Rd )
for u ∈ [−∞, +∞] with
[F ≤ u]Ω = {x ∈ Ω : F(x) ≤ u} .
Note that \
argmin F = [F ≤ u]Ω .
Ω u>inf F
A typical requirement for F is that its sublevel sets are closed in Rd , which means
that, if a sequence (xn ) in Ω satisfies, for some u ∈ R, F(xn ) ≤ u for all n and converges
to a limit x, then x ∈ Ω and F(x) ≤ u. If this is true, one says that F is lower semi-
continuous, or l.s.c, on Ω. If, in addition to being closed, the sublevel sets of F are
bounded (at least for u small enough—larger than inf F), then argminΩ F is an inter-
section of nested compact sets, and is therefore not empty (so that the optimization
problem has at least one solution).
5. Different assumptions on F and Ω lead to different types of minimization prob-
lems, with specific underlying theory and algorithms.
1. If F is C 1 or smoother and Ω = Rd , one speaks of an unconstrained smooth
optimization problem.
2. For constrained problems, Ω is often specified by a finite number of inequali-
ties, i.e.,
Ω = {x ∈ Rd : γi (x) ≤ 0, i = 1, . . . , q}.
If F and all functions γ1 , . . . , γq are C 1 one speaks of smooth constrained problems.
3. If Ω is a convex set (i.e., x, y ∈ Ω ⇒ [x, y] ∈ Ω, where [x, y] is the closed line
segment connecting x and y) and F is a convex function (i.e., F((1 − t)x + ty) ≤ (1 −
t)F(x) + tF(y) for all x, y ∈ Ω), one speaks of a convex optimization problem.
3.2. UNCONSTRAINED OPTIMIZATION PROBLEMS 45
4. Non-smooth problems are often considered in data science, and lead to inter-
esting algorithms and solutions.
5. When both F and γ1 , . . . , γq are affine functions, one speaks of a linear program-
ming problem (or a linear program). (An affine function is a mapping x 7→ bT x + β,
b ∈ Rd , β ∈ R.)
If F is quadratic (F(x) = 21 xT Ax − bT x), and all γi ’s are affine, one speaks of a
quadratic programming problem.
6. Finally, some machine learning problems are specified over discrete or finite
sets Ω (for example Zd , or {0, 1}d ), leading to combinatorial optimization problems.
x∗ ∈ argmin F. (3.1)
Ω
Theorem 3.1 Necessary conditions. Assume that F is differentiable over Ω, and that
x∗ is a local minimum of F. Then ∇F(x∗ ) = 0.
If F is C 2 , then, in addition, ∇2 F(x∗ ) must be positive semidefinite.
Sufficient conditions. Assume that F ∈ C 2 (Ω). If x∗ ∈ Ω is such that ∇F(x∗ ) = 0 and
∇2 F(x∗ ) is positive definite, then x∗ is a local minimum of F.
If dF(x∗ )h , 0 for some h, then, for small enough , dF(x∗ + th)h cannot change sign
R1
for t ∈ [0, 1] and therefore 0 dF(x∗ + th)hdt has the same sign as dF(x∗ )(h) which
must therefore be positive. But the same argument can be made with h replaced by
46 CHAPTER 3. INTRODUCTION TO OPTIMIZATION
−h, implying that dF(x∗ )(−h) = −dF(x∗ )h is also positive, and this gives a contradic-
tion. We therefore have dF(x∗ )(h) = 0 for all h, i.e., ∇F(x∗ ) = 0.
The same argument as above shows that, if d 2 F(x∗ )(h, h) , 0, then it must be posi-
tive. This shows that d 2 F(x∗ )(h, h) ≥ 0 for all h and d 2 F(x∗ ) (or its associated matrix
∇2 F(x∗ )) is positive semidefinite.
Now, assume that F is C 2 and ∇2 F(x∗ ) positive definite. One still has
Z 1
2
∗ ∗
F(x + h) − F(x ) = (1 − t)d 2 F(x∗ + th)(h, h)dt
0
If ∇2 F(x∗ ) 0, then ∇2 F(x∗ + th) 0 for small enough , showing the the r.h.s. of
the identity is positive for h , 0, and that F(x∗ + h) > F(x∗ ).
Definition 3.2 One says that a set Ω ⊂ Rd is convex if and only if, for all x, y ∈ Ω, the
closed segment [x, y] also belongs to Ω.
A function F : Rd → (−∞, +∞] is convex if, for all λ ∈ [0, 1] and all x, y ∈ Rd , one has
If, whenever the lower bound is not infinite, the inequality above is strict for λ ∈ (0, 1),
one says that F is strictly convex.
Note that, with our definition, convex functions can take the value +∞ but not
−∞. In order for the upper-bound to make sense when F takes infinite values, one
makes the following convention: a + (+∞) = +∞ for any a ∈ (−∞, +∞]; λ · (+∞) = +∞
for any λ > 0; 0 · (+∞) is not defined but 0 · (+∞) + (+∞) = +∞.
3.2. UNCONSTRAINED OPTIMIZATION PROBLEMS 47
Definition 3.3 The domain of F, denoted dom(F) is the set of x ∈ Rd such that F(x) < ∞.
One says that F is proper if dom(F) , ∅.
We will only consider proper convex functions in our discussions, which will simply
be referred to as convex functions for brevity.
Proof The first statement is a direct consequence of (3.2), which implies that F is
finite on [x, y] as soon as it is finite at x and at y. For the second statement, (3.2) for
F̂ is true for x, y ∈ Ω, since it is true for F, and the uper-bound is +∞ otherwise.
This proposition shows that there was no real loss of generality in requiring convex
functions to be defined on the full space Rd . Note also that the upper bound in (3.2)
is infinite unless both x and y belong to dom(F), so that the inequality only needs to
be checked in that case.
One says that a function F is concave if and only if −F is convex. All definitions
and properties made for convex functions then easily transcribe into similar state-
ments for concave functions. We say that a function f : I → (−∞, +∞] (where I is an
interval) is non-decreasing if, for all x, y ∈ I, x < y implies f (x) ≤ f (y). We say that f
is increasing if if, for all x, y ∈ I, x < y implies f (x) < f (y) if f (x) < ∞ and f (y) = ∞
otherwise.
Conversely, let Ω ⊂ Rd be a convex set and F : Ω → (−∞, +∞) be a function such that
the expression in (3.3) is non-decreasing (resp. increasing) for all x ∈ dom(F) and y ∈ Rd .
Then, the extension F̂ of F defined in proposition 3.4 is convex (resp. strictly convex).
Proof Let f (λ) = (F((1 − λ)x + λy) − F(x))/λ. Let µ ≤ λ denote zλ = (1 − λ)x + λy,
zµ = (1 − µ)x + µy. One has zµ = (1 − ν)x + νzλ , with ν = µ/λ, so that
Subtracting F(x) to both sides (which is allowed since F(x) < ∞) and dividing by µ
yields
f (µ) ≤ f (λ) .
If F is strictly convex, then, either F(zµ ) = ∞, in which case f (µ) = f (λ) = ∞, or
If Ω is convex, then Ω̊ and Ω̄ (its topological interior and closure) are convex too (the
easy proof is left to the reader). However, topological interiors of interesting convex
sets are often empty, and a more adapted notion of relative interior is preferable.
Define the affine hull of a set Ω, denoted aff(Ω), as the smallest affine subset of
Rd that contains Ω. The vector space parallel to aff(Ω) (generated by all differences
−−→
x − y, x, y ∈ Ω) will be denoted aff (Ω). Their dimension k, is the largest integer such
that there exist x0 , x1 , . . . , xk ∈ Ω such that x1 − x0 , . . . , xk − x0 are linearly indepen-
dent. Moreover, given these points, elements of the affine hull are defined through
barycentric coordinates, yielding
The coordinates (λ(0) , . . . , λ(k) ) are uniquely associated to x ∈ aff(Ω) and depend con-
tinuously on x. They are indeed obtained by solving the linear system
which has a unique solution for x ∈ aff(Ω) by linear independence. To see continuity,
one can introduce the k × k matrix G with entries G(ij) given by the inner products
(xi − x0 )T (xj − x0 ) and the vector h(x) ∈ Rk with entries h(j) (x) = (x − x0 )T (xj − x0 ).
Continuity is then clear since λ = G−1 h(x).
Definition 3.7 If Ω is a convex set, then its relative interior, denoted relint(Ω), is the
set of all x ∈ Ω such that there exists > 0 such that aff(Ω) ∩ B(x, ) ⊂ Ω.
Proof Take such that B(x, ) ∩ aff(Ω) ⊂ Ω. Take any z ∈ B(xλ , (1 − λ)) ∩ aff(Ω).
Define z̃ such that z = (1 − λ)z̃ + λy, i.e.
z − λy
z̃ = .
1−λ
Then z̃ ∈ aff(Ω) and
|z − xλ |
|z̃ − x| =
<
1−λ
so that z̃, and therefore z belongs to Ω. This proves that B(xλ , (1 − λ)) ∩ aff(Ω) ⊂ Ω
so that xλ ∈ relint(Ω).
If both x and y belong to relint(Ω), then xλ ∈ relint(Ω) for λ ∈ [0, 1], showing that
this set is convex.
We now show that relint(Ω) , ∅. Let k be the dimension of aff(Ω), so that there
exist x0 , x1 , . . . , xk ∈ Ω such that x1 − x0 , . . . , xk − x0 are linearly independent. Consider
the “simplex”
that λ(j) (n) < 0. This set is infinite for at least one j and provides a subsequence of
y that also converges to x. But this would imply that the j th barycentric coordinate,
which depends continuously on x, is non-positive, which is a contradiction.
So x belongs in the relative interior of Ω if, for all y ∈ Ω, the segment [x, y] can be
extended on the x side and still remain included in Ω.
Proof Let A be the set in the r.h.s. of (3.4). The proof that relint(Ω) ⊂ A is straight-
forward and left to the reader. We consider the reverse inclusion.
Let x ∈ A, and let y ∈ relint(Ω), which is not empty. Then, for some > 0, we have
z = x − (y − x) ∈ Ω.
Since
1
x= (y + z),
1+
proposition 3.8 implies that x ∈ relint(Ω).
Writing x = λ(x − ah) + (1 − λ)(x + th) with λ = t/(t + a), we also have
t a
F(x) ≤ (F(x − ah) + F(x + th))
t+a t+a
which can be rewritten as
t
F(x) − F(x + th) ≤ (F(x − ah) − F(x)).
a
−−→
These two inequalities show that F is continuous at x along any direction in aff (dom(F)),
which implies that F is continuous at x. Given this, the differences F(x+ah)−F(x) are
bounded over the compact set C, by some constant M and, the previous inequalities
show that
M
|F(y) − F(x)| ≤ |x − y|
a
if y ∈ ridom(F), |y − x| ≤ a.
The following theorem provides a stronger version of optimality conditions for the
minimization of differentiable convex functions. Note that we have only defined
differentiability of functions defined over open sets.
Theorem 3.11 Let F be a convex function, with int(dom(F)) , ∅. Assume that x ∈
int(dom(F)) and that F is differentiable at x. Then, for all y ∈ Rd :
∇F(x)T (y − x) ≤ F(y) − F(x) . (3.5)
If F is strictly convex, the inequality is strict for y , x. In particular, ∇F(x) = 0 implies
that x is a global minimizer of F. It is the unique minimizer if F is strictly convex.
Conversely, if F is C 1 on an open convex set Ω and satisfies (3.5) for all x, y ∈ Ω, then
F is convex.
Proof Equation (3.3) implies
1
(F((1 − λ)x + λy) − F(x)) ≤ F(y) − F(x), 0 < λ ≤ 1.
λ
Taking the limit of the lower bound for λ → 0, λ > 0 yields (3.5). If F is strictly
convex, the inequality is strict for λ < 1 and, since the l.h.s. is increasing in λ, it
remains strict when λ ↓ 0.
If F is C 2 and ∇2 F is positive definite and strictly convex, then (1.8) implies that,
for some z ∈ [x, y],
1. int(dom(F)) , ∅
2. There exists m > 0 such that (3.6) holds for all x ∈ int(dom(F)) and y ∈ Rd .
3.2. UNCONSTRAINED OPTIMIZATION PROBLEMS 53
argmin F = argmin F.
B̄(0,r)
The set in the r.h.s. involves the minimization of a continuous function on a compact
set, and is therefore not empty.
1 1 L|h|k+1
f (x + h) − f (x) − df (x)h − d 2 f (x)(h, h) − · · · − d k f (x)(h, . . . , h) ≤ (3.7)
2 k! (k + 1)!
for which we used the fact that
Z1 Z1 Z1
k−1 k−1 1 1 1
t(1 − t) dt = (1 − t) dt − (1 − t)k dt = − = .
0 0 0 k k + 1 k(k + 1)
If F is strongly convex and is, in addition, L-C 1 for some L, then using (3.7), one
gets the double inequality, for all x, y ∈ int(dom(F)):
m L
|y − x|2 ≤ F(y) − F(x) − ∇F(x)T (y − x) ≤ |y − x|2 . (3.8)
2 2
Proposition 3.16 Assume that F is strongly convex, satisfying (3.6), and that argmin F =
{x∗ } with x∗ ∈ int(dom(F)). Then, for all x ∈ int(dom(F)):
m 1
|x − x∗ |2 ≤ F(x) − F(x∗ ) ≤ |∇F(x)|2 (3.9)
2 2m
Proof Since ∇F(x∗ ) = 0, the first inequality is a consequence of (3.6) applied to x =
x∗ . Switching the role of x and x∗ , we have
m
F(x∗ ) − F(x) − ∇F(x)T (x∗ − x) ≥ |x − x∗ |2
2
so that
m m
0 ≤ F(x) − F(x∗ ) ≤ −∇F(x)T (x∗ − x) − |x − x∗ |2 ≤ |∇F(x)| |x − x∗ | − |x − x∗ |2 (3.10)
2 2
The maximum of the r.h.s. with respect to |x − x∗ | is attained at |∇F(x)|/m, showing
that
1
F(x) − F(x∗ ) ≤ |∇F(x)|2 ,
2m
which is the second inequality.
Proof We have the first-order expansion F(x+h)−F(x) = hT ∇F(x)+o(). If hT ∇F(x) <
0, the r.h.s. is negative for small enough and h is a direction of descent. Similarly,
if hT ∇F(x) > 0, the r.h.s. is positive for small enough and h cannot be a direction of
descent.
3.2. UNCONSTRAINED OPTIMIZATION PROBLEMS 55
1
F(x) = xT Ax − bT x
2
where A ∈ Sn++ is a positive definite symmetric matrix. Then ∇F(x) = Ax − b, but one
may argue that ∇A F(x) = A−1 ∇F(x) (defined in (1.3)) is a better choice, because it
allows the algorithm to reach the minimizer of F in one step, since x − ∇A F(x) = A−1 b
(this statement disregards the cost associated in solving the system Ax = b, which
can be an important factor in large dimension). Importantly, if F is any C 1 function,
and A ∈ Sn++ , the minimizer of h 7→ ∂α F(x + αh)|α=0 over all h such that hT Ah = 1 is
given by −∇A F(x), i.e., −∇A F(x) is the steepest descent for the norm associated with
A. This yields a general version of steepest descent methods, iterating
1
F(x) + ∇F(x)T h + hT Ah.
2
When ∇2 F(x) is positive definite, it is then natural to choose it as the matrix A, there-
fore taking h = −∇2 F(x)−1 ∇F(x). This provides Newton’s method for optimization.
However, Newton method requires computing second derivatives of F, which can be
computationally costly. It is, moreover, not a gradient-based method, which is the
focus of this discussion.
3.2.6 Convergence
xt+1 = xt + αt ht (3.11)
56 CHAPTER 3. INTRODUCTION TO OPTIMIZATION
Regarding the direction of descent, which must satisfy hTk ∇F(xk ) ≤ 0, we will as-
sume a uniform control away from orthogonality to the gradient, with the condition
−hTt ∇F(xt ) ≥ |ht | |∇F(xt )| (3.12a)
for some fixed > 0. Without loss of generality (given that a multiplicative step αt
must also be chosen), we assume that ht is commensurable to the gradient, namely,
that
γ1 |∇F(xt )| ≤ |ht | ≤ γ2 |∇F(xt )| (3.12b)
for fixed 0 < γ1 ≤ γ2 . If ht = ∇At F, these assumptions are satisfied as soon as the
smallest and largest eigenvalues of At are controlled along the trajectory.
If F is bounded from below, and one takes αt = ᾱ for all t, one deduces that
n o F(x1 ) − inf F
min |∇F(xt )|2 : t = 1, . . . , T ≤ .
CT ᾱ
3.2. UNCONSTRAINED OPTIMIZATION PROBLEMS 57
We can deduce from this, for example, that there exists a sequence t1 < · · · < tn < · · ·
such that ∇F(xtk ) → 0 when k → ∞. In particular, if one runs (3.11) until |∇F(xt )| is
smaller than a given tolerance level (which is standard), the procedure is guaranteed
to terminate in a finite number of steps.
Such an inequality can be deduced from (3.13) under the additional assumption that
αt is bounded from below and we will discuss later line search strategies that ensure
its validity. The second assumption is that F is convex.
Theorem 3.20 Assume that F is convex and finite and that its sub-level set [F ≤ F(x0 )]
is bounded. Assume that argmin F is not empty and let x∗ be a minimizer of F. If (3.14)
is true, then
R2
F(xt ) − F(x∗ ) ≤
C(t + 1)
with R = max{|x − x∗ | : F(x) ≤ F(x0 )}.
Proof Note that the algorithm never leaves [F ≤ F(x0 )]. We have
C
F(xt+1 ) − F(x∗ ) ≤ F(xt ) − F(x∗ ) − (F(xt ) − F(x∗ ))2 .
R2
δt+1 ≤ δt (1 − δt ) .
Proposition 3.21 If F is finite and satisfies (3.6), and if the descent algorithm satisfies
(3.14), then
F(xt ) − F(x∗ ) ≤ (1 − 2Cm)t (F(x0 ) − F(x∗ )).
Proposition 3.19 states that, to ensure that (3.14) holds, it suffices to take a small
enough step parameter α. However, the values of α that are acceptable depend on
properties of the objective function that are rarely known in practice. Moreover,
even if a valid choice is determined (this can sometimes be done in practice by trial
and error), setting a fixed value of α for the whole algorithm is often too conserva-
tive, as the best α when starting the algorithm may be different from the best one
close to convergence.
For this reason, most gradient descent procedures select a parameter αt at each
step using a line search. Given a current position and direction of descent h, a line
search explores the values of F(x + αh), α ∈ (0, αmax ] in order to discover some α ∗
that satisfies some desirable properties. We will assume in the following that x and
h satisfy (3.12a) and (3.12b) for fixed , γ1 , γ2 .
over (0, αmax ] for a given upper-bound max . This can be implemented using, e.g.,
binary or ternary search algorithms, but such algorithms would typically require a
large number of number of evaluations of the function F, and would be too costly to
be run at each iteration of a gradient descent procedure.
Based on the previous convergence study, we should be happy with a line search
procedure that ensures that (3.14) is satisfied for some fixed value of the constant C.
One such condition is the so-called Armijo rule that requires (with a fixed, typically
small, value of c1 > 0):
fh (α) ≤ fh (0) + c1 αhT ∇f (x) . (3.15)
3.2. UNCONSTRAINED OPTIMIZATION PROBLEMS 59
We know that, under the assumptions of proposition 3.19, this condition can always
be satisfied with a small enough value of α. Such a value can be determined using a
“backtracking procedure,” which, given αmax and ρ ∈ (0, 1), takes α = ρk αmax where
k is the smallest integer such that (3.15) is satisfied. This value of k is then deter-
mined iteratively, trying αmax , ραmax , ρ2 αmax , . . . until (3.15) is true (this provides
the “backtracking method”).
A stronger requirement in the line search is to ensure that ∂fh (α) is not “too
negative” since one would otherwise be able to further reduce fh by taking a larger
value of α. This leads to the weak Wolfe conditions, which combine the Armijo’s
rule in (3.15) and
∂fh (α) = hT ∇F(x + αh) ≥ c2 hT ∇F(x) (3.16a)
for some constant c2 ∈ (c1 , 1). The strong Wolfe conditions require (3.15) and
(Since h is a direction of descent, (3.16b) requires (3.16a) and the fact that hT ∇F(x +
αh) does not take too large positive values.) If F is L-C 1 , these conditions, with
(3.12a) and (3.12b), imply (3.14). Indeed, (3.16a) and the L-C 1 condition imply
so that
c1 (1 − c2 )2
F(x + αh) ≤ F(x) − 2
|∇F(x)|2 .
Lγ2
We have just proved the following proposition.
Proposition 3.22 Assume that F is L-C 1 and that (3.12a), (3.12b), (3.15) and (3.16a)
are satisfied. Then there exists C > 0, depending only of L, , γ2 , c1 and c2 such that
Proposition 3.23 Let f : α 7→ f (α) be a C 1 function defined on [0, +∞) such that f is
bounded from below and ∂α f (0) < 0. Let 0 < c1 < c2 < 1.
Let α0,0 = α0,1 = 0 and α0 > 0. Define recursively sequences αn,0 , αn,1 and αn as
follows.
(i) If f (αn ) ≤ f (0) + c1 αn ∂α f (0) and ∂f (αn ) ≥ c2 ∂α f (0) stop the construction.
(ii) If f (αn ) > f (0) + c1 αn ∂α f (0) let αn+1 = (αn + αn,0 )/2, αn+1,1 = αn and αn+1,0 = αn,0 .
(iii) If f (αn ) ≤ f (0) + c1 αn ∂α f (0) and ∂f (αn ) < c2 ∂α f (0):
(a) If αn,1 = 0, let αn+1 = 2αn , αn+1,0 = αn and αn+1,1 = αn,1 .
(b) If αn,1 > 0, let αn+1 = (αn + αn,1 )/2, αn+1,0 = αn and αn+1,1 = αn,1 .
Then the sequences are always finite, i.e., the algorithm terminates in a finite number of
steps.
Proof Assume, to get a contradiction, that the algorithm runs indefinitely, so that
case (i) never occurs. If case (ii) never occurs, then one runs step (iii-a) indefinitely,
so that αn → ∞ with f (αn ) ≤ f (0) + c1 αn ∂α f (0), and f cannot be bounded from
below, yielding a contradiction. As soon as case (ii) occurs, we have, at every step,
αn,0 ≥ αn−1,0 , αn,1 ≤ αn−1,1 , αn ∈ [αn,0 , αn,1 ], f (αn,1 ) > f (0) + c1 αn,1 ∂α f (0), f (αn,0 ) ≤
f (0) + c1 αn,0 ∂α f (0) and ∂f (αn,0 ) < c2 ∂α f (0). This implies that
Moreover, the updates imply that (αn+1,1 − αn+1,0 ) = (αn,1 − αn,0 )/2. This requires that
the three sequences αn , αn,0 and αn,1 converge to the same limit, α. We have
f (αn,1 ) − f (αn,0 )
∂α f (α) = lim ≥ c1 ∂α f (0)
n→∞ αn,1 − αn,0
and
∂α f (α) = lim ∂α f (αn,0 ) ≤ c2 ∂α f (0)
n→∞
yielding c1 ∂α f (0) ≤ c2 ∂α f (0) which is impossible since c2 > c1 and ∂α f (0) < 0.
Proposition 3.24 Let f : α 7→ f (α) be a C 1 function defined on [0, +∞) such that f is
bounded from below and ∂α f (0) < 0. Let 0 < c1 < c2 < 1.
Let α0,0 = α0,1 = 0 and α0 > 0. Define recursively sequences αn,0 , αn,1 and αn as
follows.
3.3. STOCHASTIC GRADIENT DESCENT 61
(i) If f (αn ) ≤ f (0) + c1 αn ∂α f (0) and |∂α f (αn )| ≤ c2 |∂α f (0)| stop the construction.
(ii) If f (αn ) > f (0) + c1 αn ∂α f (0) let αn+1 = (αn + αn,0 )/2, αn+1,1 = αn and αn+1,0 = αn,0 .
(iii) If f (αn ) ≤ f (0) + c1 αn ∂α f (0) and |∂α f (αn )| > c2 |∂α f (0)|:
(a) If αn,1 = 0 and ∂α f (αn ) > −c2 ∂α f (0), let αn+1 = 2αn , αn+1,0 = αn,0 and αn+1,1 =
αn,1 .
(b) If αn,1 = 0 and ∂α f (αn ) < c2 ∂α f (0), let αn+1 = 2αn , αn+1,0 = αn and αn+1,1 =
αn,1 .
(c) If αn,1 > 0 and ∂α f (αn ) > −c2 ∂α f (0), let αn+1 = (αn + αn,0 )/2, αn+1,1 = αn and
αn+1,0 = αn,0 .
(d) If αn,1 > 0 and ∂α f (αn ) < c2 ∂α f (0), let αn+1 = (αn + αn,1 )/2, αn+1,0 = αn and
αn+1,1 = αn,1 .
Then the sequences are always finite, i.e., the algorithm terminates in a finite number of
steps.
Proof Assume that the algorithm runs indefinitely in order to get a contradiction.
If the algorithm never enters case (ii), then αn,1 = 0 for all n, αn tends to infinity and
f (αn ) ≤ f (0) + c1 αn ∂α f (0), which contradicts the fact that f is bounded from below.
As soon as the algorithm enter (ii), we have, for all subsequent iterations: αn,0 ≤
αn ≤ αn,1 , αn+1,0 ≥ αn,0 , αn+1,1 ≤ αn,1 and αn+1,1 − αn+1,0 = (αn,1 − αn,0 )/2. This implies
that both αn,0 and αn,1 converge to the same limit α.
In some situations, the computation of ∇F can be too costly, if not intractable, to run
gradient descent updates while a low-cost stochastic approximation is available. For
example, if F is an average of a sum of many terms, the approximation may simply
be based on averaging over a randomly selected subset of the terms. This leads
to a stochastic approximation algorithm [164, 114, 25, 67] called stochastic gradient
descent (SGD).
where ξ t+1 is a random variable and the notation ξ t+1 ∼ πXt should be interpreted
as the more precise statement that the conditional distribution of ξ t+1 given all past
random variables Ut = (ξ 1 , X1 , . . . , ξ t , Xt ) only depends on Xt and is given by πXt .
More complex situations can also be considered, in which ξ t+1 is not condition-
ally independent of the past variables given Xt . For example, the conditional distri-
bution of ξ t+1 given the past may also depends on ξ t , which allows for the combina-
tion of stochastic gradient methods with Markov chain Monte-Carlo methods. This
situation is studied, for example, in Métivier and Priouret [139], Benveniste et al.
[25], and we will discuss an example in section 18.2.2.
and write
Xt+1 = Xt + αt+1 H̄(Xt ) + αt+1 η t+1
with η t+1 = H(Xt , ξ t+1 ) − H̄(Xt ) in order to represent the evolution of Xt in (3.17) as
a perturbation of the deterministic algorithm
by the “noise term” αt+1 η t+1 . In many cases, the deterministic algorithm provides
the limit behavior of the stochastic sequence, and one should ensure that this limit
is as desired. By definition, the conditional expectation of η t+1 given Ut (the past) is
zero and one says that αt+1 η t+1 is a “martingale increment.” Then,
T
X
MT = αt+1 η t+1 (3.19)
t=0
is called a “martingale.” The theory of martingales offers numerous tools for con-
trolling the size of MT and is often a key element in proving the convergence of the
method.
Many convergence results have been provided in the literature and can be found
in textbooks or lecture notes such as Benaı̈m [23], Kushner and Yin [114], Benveniste
et al. [25]. These results rely on some smoothness and growth assumptions made on
the function H, and on the dynamics of the deterministic equation (3.18). Depend-
ing on these assumptions, proofs may become quite technical. We will here restrict
to a reasonably simple context and assume that
(x − x∗ )T H̄(x) ≤ −µ|x − x∗ |2 .
Using (3.20), we can apply this lemma with t = C̃αt2 and δt = 2αt µ−Cαt2 , making the
1 2µ
additional assumption that, for all t, αt < min( 2µ , C ), which ensures that 0 < δt < 1.
Starting with a simple case, assume that the steps γt are constant, equal to some
value γ (yielding also constant δ and ). Then, (3.22) gives
t
t
X
at ≤ a0 (1 − δ) + (1 − δ)t−k−1 ≤ a0 (1 − δ)t + . (3.23)
δ
k=1
Return to the general case in which the steps depend on t, we will use the follow-
ing simple result, that we state as a lemma for future reference.
3.3. STOCHASTIC GRADIENT DESCENT 65
Lemma 3.26 Assume that the double indexed sequence wst , s ≤ t of non-negative num-
bers is bounded and such that, for all s, limt→∞ wst = 0. Let β1 , β2 , . . . be such that
∞
X
|βt | < ∞.
t=1
Then
t
X
lim βs wst = 0.
t→∞
s=1
so that
t
X X
lim sup βs wst ≤ max |wst | |βs |
t→∞ s,t
s=1 s=t0 +1
and since this upper bound can be made arbitrarily small, the result follows.
Assume that
P∞ P∞ 2
(H3) k=1 αk = ∞ and k=1 αk < ∞,
Then limt→∞ vst = 0 for all s and lemma 3.26 implies that at tends to zero. So, we
have just proved that, if (H1), (H2) and (H3) are true, the sequence Xt converges in
the L2 sense to x∗ . Actually, under these conditions, one can show that Xt converge
to x∗ almost surely, and we refer to Benveniste et al. [25], Chapter 5, for a proof (the
argument above for an L2 convergence follows the one given in Nemirovski et al.
[146]).
Under (H3), one can say much more on the asymptotic behavior of the algorithm
by comparing it with an ordinary differential equation. The “ODE method,” intro-
duced in Ljung [121], is indeed a fundamental tool for the analysis of stochastic
approximation algorithms. The correspondence between discrete and continuous
66 CHAPTER 3. INTRODUCTION TO OPTIMIZATION
Assume that (3.24) has unique solutions for given initial conditions on any fi-
nite interval, and denote by ϕ(ρ, ω) its solution at time ρ initialized with x̄(0) = ω.
Let α c (ρ) and η c (ρ) be piecewise constant interpolations of (αt ) and (η t ) defined by
α c (ρ) = αt+1 and η c (ρ) = η t+1 on the interval [τt , τt+1 ). Finally, let
Z s
∆(ρ, T ) = max η c (u)du .
s∈[ρ,ρ+T ] ρ
The following proposition (see [23]) compares the tails of the process x` (i.e., the
functions x` (ρ + s), s ≥ 0) with the solutions of the ODE over finite intervals.
Proposition 3.27 (Benaim) Assume that H̄ is Lipschitz and bounded. Then, for some
constant C(T ) that only depends on T and H̄, one has, for all ρ ≥ 0
!
` ` c
sup |X (ρ + h) − ϕ(h, X (ρ))| ≤ C(T ) ∆(ρ − 1, T + 1) + max α (s) . (3.25)
h∈[0,T ] s∈[ρ,ρ+T ]
Recall that H̄ being Lipschitz means that there exists a constant C such that
for all w, w0 ∈ Rp .
where M is defined in (3.19), because, if m(ρ) is the largest integer t such that τt ≤ ρ,
then
In the case we are considering, one can use martingale inequalities (called Doob’s
inequalities) to control ∆0 . One has, for example,
E(|Mt+N − Mt |2 )
P max |Mt+k − Mt | > λ ≤ . (3.26)
0≤k≤N λ2
t+N
X
2 2
E(|Mt+N − Mt | ) = αk+1 E(|η t+1 |2 ).
k=t
and inequality (3.26) can then be used in (3.25) to control the probability of devia-
tion of the stochastic approximation from the solution of the ODE over finite inter-
vals (a little more work is required under weaker assumptions on H, such as (H1)).
Proposition 3.27 cannot be used with T = ∞ because the constant C(T ) typically
grows exponentially with T . In order to draw conclusions on the limit of the process
W , one needs additional assumptions on the stability of the ODE. We refer to [23]
for a collection of results on the relationship between invariant sets and attractors of
the ODE and limit trajectories of the stochastic approximation. We here quote one
of these results which is especially relevant for SGD.
Proposition 3.28 Assume that H̄ = −∇E is the gradient of a function E and that ∇E only
vanishes at a finite number of points. Assume also that Xt is bounded. Then Xt converges
to a point x∗ such that ∇E(x∗ ) = 0.
Theorem 3.29 In addition to the hypotheses previously made, assume that there exists a
C 2 function U with bounded second derivatives and K0 > 0 such that, for allx such that
|x| ≥ K0 ,
∇U (x)T H̄(x) ≤ 0,
U (x) ≥ γ|x|2 , γ > 0.
The ADAM algorithm provides such a construction (without the theoretical guar-
antees) in which Dt is computed using past iterations of the algorithm. It requires
several parameters, namely: α: the algorithm gain, taken as constant (e.g., α =
0.001); Two parameters β1 and β2 for moment estimates (e.g. β1 = 0.9 and β2 =
0.999); A small number (e.g., = 10−8 ) to avoid divisions by 0. In addition, ADAM
defines two vectors: a mean m and a second moment v, respectively initialized at
0 and 1. The ADAM iterations are given below, in which g ⊗2 denotes the vector
obtained by squaring each coefficient of a vector g.
6. Set
m̂t+1
Xt+1 = Xt − α √
v̂t+1 +
and
t
β2 X
v̂t = (1 − β2 )t−k gk 2 .
1 − β2t
k=0
In this section, which follows the discussion given in Wright and Recht [207], we
review conditions for optimality for constrained minimization of smooth functions,
in two cases. The first one, discussed in this section, is when Ω is defined by a finite
number of smooth constraints, leading, under some assumptions, to the Karush-
Kuhn-Tucker (or KKT) conditions. The second one, in the next section, specializes
to closed convex Ω.
KKT conditions
x∗ ∈ argmin F (3.27)
Ω
where
Ω = {x ∈ Rd : γi (x) = 0, i ∈ E and γi (x) ≤ 0, i ∈ I }. (3.28)
70 CHAPTER 3. INTRODUCTION TO OPTIMIZATION
The set Ω of all x that satisfy the constraints is called the feasible set for the consid-
ered problem. We will always assume that it is non-empty. If x ∈ Ω, one defines the
set A(x) of active constraints at x to be
A(x) = {i ∈ C : γi (x) = 0} .
One obviously has E ⊂ A(x) for x ∈ Ω.
A sufficient (and easier to check) condition for x to satisfy these constraints is when
the vectors (∇γi (x), i ∈ A(x)) are linearly independent [36]. Indeed, if the latter “LI-
CQ” condition is true, then any set of values can be assigned to hT ∇γi (x) with the
existence of a vector h that achieves them.
where the real numbers λi , i ∈ C are called Lagrange multipliers. The following the-
orem (stated without proof, see, e.g., [147, 34]) provides necessary conditions satis-
fied by solutions of the constrained minimization problem that satisfy the constraint
qualifications.
Theorem 3.31 Assume x∗ ∈ Ω is a solution of (3.27), and that x∗ satisfies the MF-CQ
conditions. Then there exist Lagrange multipliers λi , i ∈ C, such that
∂x L(x∗ , λ) = 0
(
(3.30)
λi ≥ 0 if i ∈ I , with λi = 0 when i < A(x∗ )
Conditions (3.30) are the KKT conditions for the constrained optimization problem.
The second set of conditions is often called the complementary slackness conditions
and state that λi = 0 for an inequality constraint unless this constraint is satisfied
with an equality. The next section provides examples in which the MF-CQ condi-
tions are not satisfied and Theorem 3.31 does not hold. However, these conditions
are not needed in the special case when the constraints are affine.
3.4. CONSTRAINED OPTIMIZATION PROBLEMS 71
Theorem 3.32 Assume that for all i ∈ A(x∗ ), the functions γi are affine, i.e., γi (x) =
biT x + βi for some b ∈ Rd and β ∈ R. Then (3.30) holds at any solution of (3.27).
Remark 3.33 We have taken the convention to express the inequality constraints as
γi (x) ≤ 0, i ∈ I . With the reverse convention, i.e., γi (x) ≥ 0, i ∈ I , one generally
defines the Lagrangian as
X
L(x, λ) = F(x) − λi γi (x)
i∈C
Examples. Constraint qualifications are important to ensure the validity of the the-
orem. Consider a problem with equality constraints only, and replace it by
x∗ ∈ argmin F
Ω
subject to γ̃i (x) = 0, i ∈ E, with γ̃i = γi2 . We clearly did not change the problem.
However, the previous theorem applied to the Lagrangian
X
L(x, λ) = F(x) + λi γ̃i (x)
i∈C
would require an optimal solution to satisfy ∇F(x) = 0, because ∇γ̃i (x) = 2γi (x)∇γi (x) =
0 for any feasible solution. Minimizers of constrained problems do not necessarily
satisfy ∇F(x) = 0, however. This is no contradiction with the theorem since ∇γ̃i (x) = 0
for all i shows that no feasible point satisfies the MF-CQ.
To take a more specific example, still with equality constraints, let d = 3, C = {1, 2}
with F(x, y, z) = x/2+y and γ1 (x, y, z) = x2 −y 2 , γ2 (x, y, z) = y −z2 . Note that γ1 = γ2 = 0
implies that y = |x|, so that, for a feasible point, F(x, y, z) = |x| + x/2 ≥ 0 and vanishes
only when x = y = 0, in which case z = 0 also. So (0, 0, 0) is a global minimizer.
We have dF(0) = (1/2, 1, 0), dγ1 (0) = (0, 0, 0) and dγ2 (0) = (0, 1, 0) so that 0 does not
satisfy the MF-CQ. The equation
has no solution (λ1 , λ2 ), so that the conclusion of the theorem does not hold.
We now consider the case in which Ω is a closed convex set. To specify the optimality
conditions in this case, we need the following definition.
72 CHAPTER 3. INTRODUCTION TO OPTIMIZATION
Definition 3.34 Let Ω ⊂ Rd be convex and let x ∈ Ω. The normal cone to Ω at x is the
set
NΩ (x) = {h ∈ Rd : hT (y − x) ≤ 0 for all y ∈ Ω} (3.31)
NΩ (x) = {h = µb : µ ≥ 0} .
NΩ (x) = {h = λb : λ ∈ R} .
One can build normal cones to domains associated with multiple inequalities or
equalities based on the following theorem.
Theorem 3.35 Let Ω1 and Ω2 be two convex sets with relint(Ω1 )∩relint(Ω2 ) , ∅. Then,
if x ∈ Ω1 ∩ Ω2
NΩ1 ∩Ω2 (x) = NΩ1 (x) + NΩ2 (x)
Here, the addition is the standard sum between sets in a vector space:
A + B = {x + y : x ∈ A, y ∈ B}.
Theorem 3.36 Let F be a C 1 function and Ω a closed convex set. If x∗ ∈ argminΩ F, then
If F is convex and (3.33) holds, we have F(y) ≥ F(x∗ )+∇F(x∗ )T (y −x∗ ) by convexity,
so that
F(x∗ ) ≤ F(y) + (−∇F(x∗ ))T (y − x∗ ) ≤ F(y).
3.4.3 Applications
Note that one always have Nγ0 (x) ⊂ NΩ (x) since, for g = 0
P
i∈A(x) λi ∇γi (x) ∈ Nγ (x), one
has, for y ∈ Ω,
X
T
g (y − x) = λi ∇γi (x)T (y − x)
i∈A(x)
X X
= λi (aTi y − aTi x) + λi (γi (x) + ∇γi (x)T (y − x))
i∈E i∈A(x)∩I
X
= λi (γi (x) + ∇γi (x)T (y − x))
i∈A(x)∩I
≤ λi γi (y) ≤ 0,
in which the have used the facts that aTi x = aTi y = −βi for x, y ∈ Ω, i ∈ E, γi (x) = 0 for
i ∈ A(x) and the convexity of γi . Constraint qualifications such as those considered
above are sufficient conditions that ensure the identity between the two sets.
74 CHAPTER 3. INTRODUCTION TO OPTIMIZATION
Consider now the situation of theorem 3.32, and assume that all constraints are
affine inequalities, γi (x) = biT x + β ≤ 0, i ∈ I . Then, the statement NΩ (x) ⊂ Nγ0 (x) can
be reexpressed as follows. All h ∈ Rd such that
hT (y − x) ≤ 0
as soon as biT (y − x) ≤ 0 for all i ∈ A(x) must take the form
X
h= λi bi
i∈A(x)
with λ(i) ≥ 0. This property is called Farkas’s lemma (see, e.g. [168]). Note that affine
equalities biT x + β = 0 can be included as two inequalities biT x + β ≤ 0, −biT x − β ≤ 0,
which removes the sign constraint on the corresponding λ(i) and therefore yields
theorem 3.32.
We have A ∈ Sn+ if and only if, for all u ∈ Rd , u T Au ≥ 0, which provides an infinity
of linear inequality constraints on A. Elements of NS + (A) are matrices H ∈ Mn such
that
trace(H T (B − A)) ≤ 0
for all B ∈ Sn+ , and we want to make this normal cone explicit. We first note that,
every square matrix H can be decomposed as the sum of a symmetric matrix, Hs and
of a skew symmetric one, Ha (namely, Hs = (H + H T )/2 and Ha = (H − H T )/2). We
have moreover trace(HaT (B − A)) = 0, so the condition is only on the symmetric part
of H.
For any u ∈ Rd , one can take B = A+uu T , which belongs to Sn+ , with trace(HsT (B−
A)) = u T Hs u. This shows that, for H to belong to NSn+ (A), one needs Hs 0.
min F = min F
Ω Ω∩B̄(0,R)
for large enough R (e.g., larger than F(x) for any fixed point in Ω), and since the
latter minimization is over a compact set, argminΩ F is not empty. The function F
being strongly convex, its minimizer over Ω is unique and called the projection of
x0 on Ω, denoted projΩ (x0 ).
2. If Ω = B̄(0, 1), the closed unit sphere, then NΩ (x) = R+ x for x ∈ ∂Ω (i.e., |x| = 1).
One can indeed note that, if h , 0 in normal to Ω at x, then h/|h| ∈ Ω so that
!
T h
h −x ≤ 0
|h|
which yields |h| ≤ hT x. The Cauchy-Schwartz inequality implying that hT x ≤ |h| |x| =
|h|, we must have equality, hT x = |h| |x|, which is only possible when x and h are
collinear.
Given x0 ∈ Rd with x0 ≥ 1, we see that projΩ (x0 ) must satisfy the conditions
|projΩ (x0 )| = 1 (to be in ∂Ω) and x0 − projΩ (x0 ) = λx0 for some λ ≥ 0, which gives
projΩ (x0 ) = x0 /|x0 |.
3. If Ω = Sn+ and B (taking the role of x0 ) is a symmetric matrix, then projΩ (B) was
found in the previous section, and is given by A = U T D + U where U T DU provides a
diagonalization of B.
To justify this last statement it suffices to notice that the function in the r.h.s. can be
written as
1 α
|x − xt + αt ∇F(xt )|2 − t |∇F(xt )|2 + F(xt )
2αt 2
and apply the definition of the projection.
3.5.1 Epigraphs
One says that F is closed if epi(F) is a closed subset of Rd × R, that is: if x = limn xn and
a = limn an with F(xn ) ≤ an , then F(x) ≤ a.
Clearly, if (x, a) ∈ epi(F), then x ∈ dom(F). It should also be clear that epi(F) is always
convex when F is convex: If (x, a), (y, b) ∈ epi(F), then
F((1 − t)x + ty) ≤ (1 − t)F(x) + tF(y) ≤ (1 − t)a + tb
so that (1 − t)(x, a) + t(y, b) ∈ epi(F).
Conversely, assume that all Λa (F) are closed and take a sequence (xn , an ) in epi(F)
that converges to (x, a). Then, fixing > 0, xn ∈ Λa+ for large enough n, and since
this set is closed, F(x) ≤ a + . Since this is true for all > 0, we have F(x) ≤ a and
(x, a) ∈ epi(F).
3.5.2 Subgradients
Several machine learning problems involve convex functions that are not C 1 , requir-
ing a generalization of the notion of derivative provided by the following definition.
Definition 3.41 If F is a convex function and x ∈ dom(F), a vector g ∈ Rd such that
Proof We need to prove that there is no other subgradient. Assume that ∇F(x) exists
and take y = x + u in (3.41) (u ∈ Rd ). One gets, for g ∈ ∂F(x),
This is only possible if g T u = ∇F(x)T u for all u ∈ Rd , which itself implies g = ∇F.
The following result shows that subgradients exist under generic conditions. We
note that g ∈ ∂F(x) if and only if proj −aff
−→
(dom(F))
(g) ∈ ∂F, because (3.41) is trivial if
F(y) = +∞. So ∂F cannot be bounded unless aff(dom(D)) = Rd . However, it is the
−−→
part of this set that is included in the aff (dom(F)) that is of interest.
Proposition 3.44 For all x ∈ Rd , ∂F(x) is a closed convex set (possibly empty, in par-
−−→
ticular for x < dom(F)). If x ∈ ridom(F), then ∂F(x) , ∅ and ∂F(x) ∩ aff (dom(F)) is
compact.
3.5. GENERAL CONVEX PROBLEMS 79
Proof The convexity and closedness of ∂F(x) is clear from the definition. If x ∈
−−→
ridom(F), there exists > 0 such that x + h ∈ ridom(F) for all h ∈ aff (dom(F)) with
−−→
|h| = 1. For all g ∈ ∂F(x) ∩ aff (dom(F)), one has
−−→
|g| = max{g T h : h ∈ aff (dom(F)), |h| = 1}
−−→
≤ max((F(x + h) − F(x))/ : h ∈ aff (dom(F)), |h| = 1)
and the upper bound is finite because it is the maximum of a continuous function
over a bounded set. This shows that ∂F(x) is bounded. We defer the proof that
∂F(x) , ∅ to section 3.7.
Another important point is how the chain rule works with compositions with
affine functions.
Theorem 3.46 Let F be a convex function on Rd , A a d × m matrix and b ∈ Rd . Let
G(x) = F(Ax + b), x ∈ Rm . Assume that there exists x0 ∈ Rm such that Ax0 ∈ ridom(F).
Then, for all x ∈ Rm ,
∂G(x) = AT ∂F(Ax + b).
One direction is straightforward and does not require the condition on ridom(F). If
g ∈ ∂F(Ax + b), then
F(z) − F(Ax + b) ≥ g T (z − Ax − b), z ∈ Rd
and applying this inequality to z = Ay + b for y ∈ Rm yields
G(y) − G(x) ≥ g T A(y − x)
so that AT g ∈ ∂G and AT ∂F ⊂ ∂G. The reverse inclusion is proved in section 3.7.
Proposition 3.47 Assume that Ω is a closed convex subset of Rd . Then σΩ (the indicator
function of Ω) has a subdifferential everywhere on Ω with
g T (y − x) ≤ 0
Given this proposition, it is also clear (after noting that σΩ1 + σΩ2 = σΩ1 ∩Ω2 ) that
theorem 3.45 is a generalization of theorem 3.35.
1
t 7→ (F(x + th) − F(x))
t
is increasing as a function of t. This property allows us to define directional deriva-
tives of F at x.
1
dF(x, h) = lim (F(x + th) − F(x)), (3.42)
t↓0 t
Note that, still from proposition 3.5, one has, for all x ∈ dom(F) and y ∈ Rd :
Proposition 3.49 If F is convex, then x∗ ∈ argmin(F) if and only if dF(x∗ , h) ≥ 0 for all
h ∈ Rd .
3.5. GENERAL CONVEX PROBLEMS 81
Proof If dF(x∗ , h) ≥ 0, then F(x∗ +th)−F(x∗ ) ≥ 0 for all t > 0 and this being true for all
h implies that x∗ is a minimizer. Conversely, if x∗ is a minimizer, dF(x∗ , h) is a limit
of non-negative numbers and is therefore non-negative.
and
dF(x, h1 + h2 ) ≤ dF(x, h1 ) + dF(x, h2 ).
Proof Positive homogeneity is straightforward and left to the reader. For the second
one, we can write
1
F(x + th1 + th2 ) ≤ (F(x + th1 /2) + F(x + th2 /2))
2
by convexity so that
1 1 1 1
(F(x + th1 + th2 ) − F(x)) ≤ (F(x + th1 /2) − F(x)) + (F(x + th2 /2) − F(x)) .
t 2 t t
Taking t ↓ 0,
1
dF(x; h1 + h2 ) ≤ (dF(x; h1 /2) + dF(x, h2 /2)) = dF(x, h1 ) + dF(x, h2 ).
2
If x ∈ ridom(F), then
dF(x, h) = max{g T h, g ∈ ∂F(x)}.
dom(G) = {h : x + h ∈ aff(dom(F))}.
82 CHAPTER 3. INTRODUCTION TO OPTIMIZATION
Indeed, for any h in this set, there exists > 0 such that x + th ∈ dom(F) for 0 < t <
and dF(x, h) ≤ (F(x + th) − F(x))/t < ∞. Conversely, if h ∈ dom(G), then F(x + th) − F(x)
must be finite for small enough t, so that x + th ∈ dom(F) and x + h ∈ aff(dom(F)).
for all t > 0, which requires dF(x, h̃) ≥ ĝ T h̃ for all h̃, and in particular dF(x, h) = ĝ T h.
Since
F(x + h̃) − F(x) ≥ dF(x, h̃) ≥ ĝ T h̃
we see that ĝ ∈ ∂F(x), with ĝ T h = dF(x, h), which concludes the proof.
The next proposition gives a criterion for a vector g to belong to ∂F(x) based on
directional derivatives.
dF(x, h) ≥ g T h
but the inequality goes in the “wrong direction.” However, we know that, for any
h ∈ Rd , there exists gh ∈ ∂F(x) such that
2thT (g − h) + t 2 |g − h|2 ≥ 0.
The fact that this holds for all t ≥ 0 requires that hT (g − h) ≥ 0 as required. We have
therefore proved that h defined by (3.44) is a descent direction for F at x (it is actually
the steepest descent direction: see [207] for a proof), justifying the algorithm
X X
|g|2 = (∂i F(x) + λ sign(x(i) ))2 + (∂i ψ(x) − λρi )2 .
i<A(x) i∈A(x)
Define
s(t) = sign(t) min(|t|, 1).
Then h satisfying (3.44) is given by
In more complex situations, the extra minimization step at each iteration of the
algorithm can be challenging computationally. The following subgradient method
uses an averaging approach to minimize F without requiring finding subgradients
with minimal norms. It simply defines
xt+1 = xt − αt gt , gt ∈ ∂F(xt )
and computes Pt
j=1 αj xj
x̄t = Pt .
j=1 αj
Proximal operator. We start with a few simple facts. Let F be a closed convex
function and ψ be convex and differentiable, with dom(ψ) = Rd . Let G = F + ψ.
Then G is a closed convex function. Indeed, consider the sub-level set Λa (G) = {x :
G(x) ≤ a} and assume that xn → x with xn ∈ Λa (g). Then ψ(xn ) → ψ(x) by continuity,
and for all > 0, we have, for large enough n, F(xn ) ≤ a − ψ(x) + . This inequality
remains true at the limit because F is closed, yielding G(x) ≤ a + for all > 0, so
that x ∈ Λa (G).
We have ridom(F) ∩ ridom(ψ) , ∅ so that (by theorem 3.45 and proposition 3.42)
∂G(x) = ∇ψ(x) + ∂F(x). In particular, x∗ is a minimizer of G if and only if −∇ψ(x∗ ) ∈
∂F(x∗ ).
It one assumes that ψ is strongly convex, so that there exists m and L such that
m L
|y − x|2 ≤ ψ(y) − ψ(x) − ∇ψ(x)T (y − x) ≤ |y − x|2
2 2
for all x, y ∈ Rd , then a minimizer of G exists and is unique. To see this, fix x0 ∈
ridom(F) and consider the closed convex set
for all x ∈ Ω0 , which shows that Ω0 must be bounded and therefore compact. There
exists a minimizer x∗ of G on Ω0 , and therefore on all Rd . This minimizer is unique,
since the sum of a convex function and a strictly convex function is strictly convex.
In particular, for any closed convex F, we can apply the previous remarks to
1
G : v 7→ F(v) + |x − v|2
2
where x ∈ Rd is fixed. The function ψ : v 7→ |v −x|2 /2 is strongly convex (with L = m =
1) and G therefore has a unique minimizer v ∗ . This is summarized in the following
definition.
Definition 3.53 Let F be a closed convex function. The proximal operator associated to
F is the mapping proxF : Rd → dom(F) defined by
1
proxF (x) = argmin(v 7→ F(v) + |x − v|2 ). (3.45)
Rd 2
• Let F(x) = λ|x|, x ∈ Rd , for some λ > 0. Then F is differentiable everywhere except
at x = 0 and dom(F) = Rd . We have ∂F(x) = λx/|x| for x , 0. A vector g belongs to
∂F(0) if and only if
g T x ≤ λ|x|
for all x ∈ Rd , which is equivalent to |g| ≤ λ so that ∂F(0) = B̄(0, λ).
We have x0 = proxF (x) if and only if x0 , 0 and x = x0 + λx0 /|x0 | or x0 = 0 and |x| ≤ λ.
For |x| > λ, the equation x = x0 + λx0 /|x0 | is solved by
|x| − λ
x0 = x
|x|
yielding
|x| − λ
|x| x if |x| ≥ λ
proxF (x) =
(3.46)
0 otherwise
• Let Ω be a closed convex set. Then proxσΩ = projΩ , the projection operator on Ω,
as directly deduced from the definition.
86 CHAPTER 3. INTRODUCTION TO OPTIMIZATION
Proposition 3.55 Let F be a closed convex function. Then proxF is 1-Lipschitz: for all
x, y ∈ Rd ,
| proxF (x) − proxF (y)| ≤ |x − y|. (3.47)
Proof Let x0 = proxF (x) and y 0 = proxF (y). Then, there exists g ∈ ∂F(x0 ) and h ∈
∂F (y 0 ) such that x = x0 + g and y = y 0 + h. Moreover, we have
F(y 0 ) − F(x0 ) ≥ g T (y 0 − x0 )
F(x0 ) − F(y 0 ) ≥ hT (x0 − y 0 )
|y 0 − x0 |2 ≤ (y − x)T (y 0 − x0 ) ≤ |y − x| |y 0 − x0 |
x0 = x − α∇F(x0 )
so that x 7→ proxαF (x) can be interpreted as an implicit version of the standard gra-
dient step x 7→ x−α∇F(x). The iterations x(t+1) = proxαt F (x(t)) provide an algorithm
that converges to a minimizer of F (this will be justified below). This algorithm is
rarely practical, however, since the minimization required at each step is not nec-
essarily much easier to perform than minimizing F itself. The proximal operator,
however, is especially useful when combined with splitting methods.
Proximal gradient descent. Assume that the objective function F takes the form
We note that a stationary point of this algorithm, i.e. a point x such that x = proxαt H (x−
αt ∇G(x)) must be such that x − αt ∇G(x) ∈ x + αt ∂H(x), so that −∇G(x) ∈ ∂H(x). This
shows that the property of being stationary does not depend on αt > 0, and is equiv-
alent to the necessary optimality condition that was just discussed.
We first study this algorithm under the assumption that G is L-C 1 , which implies
that, for all x, y ∈ Rd .
L
G(y) ≤ G(x) + ∇G(x)T (y − x) + |x − y|2 .
2
At iteration t, we have
αt H(xt )−αt H(xt+1 ) ≥ (xt −xt+1 )T (xt −xt+1 −αt ∇G(xt )) = |xt −xt+1 |2 +αt ∇G(xt )T (xt+1 −xt )
so that proximal gradient descent iterations reduce the objective function as soon as
αt ≤ 2/L.
As a consequence, if one runs proximal gradient descent until |xt+1 − xt |/αt is small
enough, the algorithm will terminate in finite time as soon as αt is bounded from
below (and, in particular, if αt is constant).
αt F(x∗ ) − αt F(xt+1 ) ≥ (x∗ − xt+1 )T (xt − xt+1 ) − αt (x∗ − xt+1 )T ∇G(xt )) + αt G(x∗ ) − αt G(xt+1 )
≥ (x∗ − xt+1 )T (xt − xt+1 ) − αt (x∗ − xt )T ∇G(xt )) + αt G(x∗ )
+ αt (xt+1 − xt )T ∇G(xt ) − αt G(xt+1 )
αL
≥ (x∗ − xt+1 )T (xt − xt+1 ) − t |xt − xt+1 |2
2
Assuming that αt L ≤ 1, then
1 1
αt F(x∗ ) − αt F(xt+1 ) ≥ (x∗ − xt+1 )T (xt − xt+1 ) − |xt − xt+1 |2 = (|xt+1 − x∗ |2 − |xt − x∗ |2 ),
2 2
which we rewrite as
1
αt (F(xt+1 ) − F(x∗ )) ≤ (|xt − x∗ |2 − |xt+1 − x∗ |2 )
2
Note that, from (3.50), we also have
1
F(xt+1 ) ≤ F(xt ) − |x − xt+1 |2
2αt t
when αt L ≤ 1, which shows, in particular that F(xt ) is decreasing. Fixing a time T ,
we have, from these two observations
1
αt (F(xT ) − F(x∗ )) ≤ (|xt − x∗ |2 − |xt+1 − x∗ |2 )
2
for all t ≤ T − 1, and summing over T ,
T −1
X 1
(F(xT ) − F(x∗ )) αt ≤ (|x0 − x∗ |2 − |xT − x∗ |2 )
2
t=0
yielding
|x0 − x∗ |2
F(xT ) − F(x∗ ) ≤ P . (3.52)
2 Tt=0−1
αt
We summarize this in the following theorem, specializing to the case of constant
step αt .
3.6. DUALITY 89
Theorem 3.56 Let G be am L-C 1 function defined on Rd and H be closed convex. As-
sume that F = G+H has a minimizer x∗ . Then the algorithm (3.49) run with αt = α ≤ 1/L
for all t is such that, for all T > 0,
|x0 − x∗ |2
F(xT ) − F(x∗ ) ≤ . (3.53)
2αT
One gets a stronger result under the assumption that G is C 2 , and is such that the
eigenvalues of ∇2 G(x) are included in a fixed interval [m, L] for all x ∈ Rd with m > 0.
Such a G is strongly convex, which implies that F has a unique minimizer. We have
|xt+1 − x∗ | = proxαt H (xt − αt ∇G(xt )) − proxαt H (x∗ − αt ∇G(x∗ )
≤ |xt − x∗ − αt (∇G(xt )) − ∇G(x∗ ))| .
Write
Z 1
∗ ∗
|xt − x − αt (∇G(xt )) − ∇G(x )| = (IdRn − αt ∇2 G(x∗ + t(xt − x∗ )))(xt − x∗ )dt
0
Z 1
≤ (IdRn − αt ∇2 G(x∗ + t(xt − x∗ )))(xt − x∗ ) dt
0
≤ max(|1 − αt m|, |1 − αt L|)|xt − x∗ |
where we have use the fact that the eigenvalues of IdRn − αt ∇2 G(x) are included in
[1−αt L, 1−αt m] for all x ∈ Rd . If one assumes that αt ≤ 1/L, so that max(|1−αt m|, |1−
αt L|) ≤ 1 − αt m, one gets
|xt+1 − x∗ | ≤ (1 − αt m)|xt − x∗ | .
Iterating this inequality, we get the theorem that we state for constant αt .
Theorem 3.57 Let F = G + H where G is a C 2 convex function and H is a closed convex
function. Assume that the eigenvalues of ∇2 G are uniformly included in [m, L] with m > 0.
Let x∗ argmin F.
Note that these results also apply to projected gradient descent (section 3.4.4),
which is a special case (taking G = σΩ ).
90 CHAPTER 3. INTRODUCTION TO OPTIMIZATION
3.6 Duality
The next theorem generalizes this property to the non-smooth convex case, for
which the necessary optimality condition is also sufficient.
Theorem 3.58 Let F be a closed convex function, Ω ⊂ ridom(F) a nonempty closed con-
vex set. Then x∗ ∈ argminΩ F if and only if
0 ∈ ∂F(x∗ ) + NΩ (x∗ )
Proof Introduce the indicator function σΩ . Then minimizing F over Ω is the same
as minimizing G = F+σΩ over Rd . The assumptions imply that ridom(σΩ ) = relint(Ω) ⊂
ridom(F) and therefore
∂G(x) = ∂F(x) + ∂σΩ (x)
for all x ∈ Ω. Since
∂σΩ (x) = NΩ (x)
the result follows for the characterization of minimum of convex functions.
In the following, we will restrict to the situation in which F is finite (i.e., dom(F) =
Rd ) and Ω is defined through a finite number of equalities and inequalities, taking
the form n o
Ω = x ∈ Rd : γi (x) = 0, i ∈ E and γi (x) ≤ 0, i ∈ I
for functions (γi , i ∈ C = E ∪I ) such that γi : x 7→ biT x+βt is affine for all i ∈ E and γi is
closed convex for all i ∈ I . This is similar to the situation considered in section 3.4.1,
with additional convexity assumptions, but without assuming smoothness. We re-
call the definition of active constraints from section 3.4.1, namely, for x ∈ Ω,
Following the discussion in the smooth case, define the set Nγ0 (x) ⊂ Rd by
X
0
Nγ (x) = λ s : s ∈ ∂γ (x), i ∈ A(x), λ ≥ 0, i ∈ A(x) ∩ I .
i i i i i
i∈A(x)
3.6. DUALITY 91
The property 0 ∈ ∂F(x∗ ) + Nγ0 (x∗ ) is the expression of the KKT conditions in the non-
smooth case. It holds for x∗ ∈ argminΩ F as soon as NΩ (x∗ ) = Nγ0 (x∗ ), which is true
under appropriate constraint qualifications. We here replace the MF-CQ in defini-
tion 3.30 by the following conditions that do not involve gradients.
The first constraint is a very mild condition. When it is not satisfied, this means that
some bi ’s are linear combinations of others, and equality constraints for the latter
implies equality constraints for the former. These redundancies can therefore be
removed without changing the problem.
Note that (Sl2) can be replaced by the apparently weaker condition that, for all
i ∈ I , there exists xi ∈ Rd satisfying all the constraints and γi (xi ) < 0. Indeed, if
this is true, then the average, x̄, of (xi , i ∈ I ) also satisfies the equality constraints by
linearity, and if i ∈ I ,
1 X 1
γi (x̄) ≤ γi (x(j) ) ≤ γ (x(i) ) < 0.
|I | |I | i
j∈I
The following proposition makes a connection between the Slater conditions and
the MF-CQ in definition 3.30.
Proposition 3.60 Assume that γi , i ∈ I are convex C 1 functions. Then, if there exists a
feasible point x∗ that satisfies the MF-CQ, there exists another point x satisfying the Sl-
CQ. Conversely, if there exists x satisfying the Sl-CQ, then every feasible point x∗ satisfies
the MF-CQ.
Proof The linear independence conditions on equality constraints are the same in
MF-CQ and Sl-CQ, so that we only need to consider inequality constraints.
Let x∗ satisfy MF-CQ, and take h , 0 such that biT h = 0 for all i ∈ E, and ∇γi (x∗ )T h <
0, i ∈ A(x) ∩ I . Then x∗ + th satisfies the equality constraints for all t ∈ R. If i ∈ I is
not active, then γi (x∗ ) < 0 and this will remain true at x∗ + th for small t by continu-
ity. If i ∈ A(x) ∩ I , then a first order expansion gives γi (x∗ + th) = t∇γi (x∗ )T h + o(|h|),
which is also negative for small enough t > 0. So, x∗ + th satisfies the Sl-CQ for small
enough t > 0.
92 CHAPTER 3. INTRODUCTION TO OPTIMIZATION
Conversely, let x satisfy the Sl-CQ. Take a feasible point x∗ . If x∗ = x, then there
is no active inequality constraint and x∗ satisfies MF-CQ. Assume x∗ , x and let
h = x − x∗ . Then biT h = 0 for all i ∈ E, and if i ∈ I ∩ A(x∗ ),
The following theorem, that we give without proof, states that the Slater condi-
tions implies that the KKT conditions are satisfied for a minimizer.
Theorem 3.61 Assume that all the constraints are affine, or that they satisfy the Sl-CQ
in definition 3.59. Let x∗ ∈ argminΩ F. Then NΩ (x∗ ) = Nγ0 (x∗ ), so that there exist s0 ∈
∂F(x∗ ), si ∈ ∂γi (x∗ ), i ∈ A(x∗ ), (λi , i ∈ A(x∗ )) with λi ≥ 0 if i ∈ I ∩ A(x∗ ), such that
X
s0 + λi s i = 0 (3.55)
i∈A(x)
L∗ (λ) = inf{L(x, λ) : x ∈ Rd }
dˆ = sup{L∗ (λ) : λ ∈ D}
and
p̂ = inf{F(x) : x ∈ Ω},
whose computations respectively represent the dual and primal problems. Then, we
have dˆ ≤ p̂.
We did not need much of our assumptions (not even F to be convex) to reach
this conclusion. When the converse inequality is true (so that the duality gap p̂ − dˆ
vanishes), the dual problem provides important insights on the primal problem, as
well as alternative ways to solve it. This is true under the Slater conditions.
3.6. DUALITY 93
Theorem 3.62 The duality gap vanishes when the constraints are all affine, or when they
satisfy the Sl-CQ in definition 3.59. In this case, any solution λ∗ of the dual problem
provides Lagrange multipliers in theorem 3.61 and conversely.
We already know that, if (x∗ , λ∗ ) satisfy the KKT conditions, then x∗ ∈ argminΩ F
(because Nγ0 (x∗ ) ⊂ NΩ (x∗ )). Moreover, if (3.56) holds, then the inequality L(x∗ , λ) ≤
L(x∗ , λ∗ ) implies that L∗ (λ) ≤ L(x∗ , λ∗ ) for all λ ∈ D. The inequality L(x∗ , λ∗ ) ≤ L(x, λ∗ )
for all x implies that L(x∗ , λ∗ ) ≤ L∗ (λ∗ ). We therefore obtain the fact that λ∗ ∈ argmax L∗ (λ).
To summarize, we have
(i) ⇔ (ii) ⇒ (iii).
(x∗ , λ̃), with L(x∗ , λ̃) = L∗ (λ̃) and λ̃ ∈ argminD L∗ . This shows that L∗ (λ̃) = L∗ (λ∗ ). More-
over, from (3.56), we have
and, by definition of L∗ , L(x∗ , λ∗ ) ≥ L∗ (λ∗ ). This shows that L(x∗ , λ∗ ) = L(x∗ , λ̃). As a
consequence, for all (x, λ) ∈ Rd × D:
bk (x0 + xT ak ) + ξ (k) ≥ 1
where bk ∈ {−1, 1} and ak ∈ Rn respectively denote the kth output and input train-
ing sample. This algorithm minimizes a quadratic function of the input variables
(x, x0 , ξ) subject to linear constraints, and is an instance of a quadratic program-
ming problem (this is actually the support vector machine problem for classification,
which will be described in section 8.4.1).
Introduce Lagrange multipliers ηk for the constraint ξ (k) ≥ 0 and αk for bk (x0 +
xT ak ) + ξ (k) ≥ 1. The Lagrangian then takes the form
N N N
1 X X X
L(x, x0 , ξ, α, η) = |x|2 + γ ξ (k) − ηk ξ (k) − αk (bk (x0 + xT ak ) + ξ (k) − 1)
2
k=1 i=1 k=1
N N N N
1 X X X X
= |x|2 + (γ − ηk − αk )ξ (k)
− x0 αk bk − x T
αk bk ak + αk .
2
k=1 k=1 k=1 k=1
3.6. DUALITY 95
We compute the dual Lagrangian L∗ by P minimizing with respect to the primal vari-
ables. We note that L∗ (α, η) = −∞ when N k=1
alphak bk , 0, so that N
P
α
k=1 k kb = 0 is a constraint for the dual problem. The min-
(k)
imization in ξ also gives −∞ unless γ − ηk − αk = 0, which is therefore another
constraint. Finally, the optimal values of x is
N
X
x= αk bk ak
k=1
subject to ηk , αk ≥ 0, γ − ηk − αk = 0 and N
P
k=1 αk bk = 0. The conditions on ηk and αk
can be rewritten as 0 ≤ αk ≤ γ, ηk = γ − αk , and since the rest of the problem does
not depends on η, the dual problem can be reduced to maximizing
N N
∗ 1X T
X
L (α) = − αk αl ak al + αk
2
k,l=1 k=1
PN
subject to 0 ≤ αk ≤ γ and k=1 αk bk = 0.
The concave function L∗ can be maximized by minimizing −L∗ using proximal itera-
tions ((3.54)):
1
λ(t + 1) = prox−αt L∗ (λ(t)) = argmax(λ 7→ L∗ (λ) − |λ − λ(t)|2 ).
D 2αt
so that
λ(t + 1) = argmax infn ϕ(x, µ).
µ∈D x∈R
(Note that the left-hand side of this equation is never larger than the right-hand
side, but their equality requires additional hypotheses—which are satisfied in our
context—in order to hold.)
Importantly, the maximization in µ in the right-hand side has a closed form so-
lution. It requires to maximize
X 1
µi γi (x) − (µ − λi (t))2
2αt i
i∈C
µi = λi (t) + αt γi (x),
and
1 α 1 λ (t)2
µi γi (x) − (µi − λi (t))2 = λi (t)γi + t γi (x)2 = (λi (t) + αt γi (x))2 − i .
2αt 2 2αt 2αt
1 1 λ (t)2
µi γi (x) − (µi − λi (t))2 = max(0, λi (t) + αt γi (x))2 − i
2αt 2αt 2αt
1 X 1 X
G(x) = F(x) + (λi (t) + αt γi (x))2 + max(0, λi (t) + αt γi (x)))2
2αt 2αt
i∈E i∈I
1 X
− λi (t)2 .
2αt
i∈C
If we assume that the sub-level sets {x ∈ Ω : F(x) ≤ ρ} are bounded (or empty) for any
ρ ∈ R, then so are the sets {x ∈ Rn : G(x) ≤ ρ}, and this is a sufficient condition for the
existence of a saddle point for ϕ, which is a pair (x∗ , λ∗ ) such that, for all (x, λ) ∈ Rn ×D,
One can then check that this implies that x∗ ∈ argminRn G while λ∗ = λ(t + 1), so that
3.6. DUALITY 97
These iterations define the augmented Lagrangian algorithm. Starting this algorithm
with some λ(0) ∈ R|C| , and constant α, λ(t) will converge to a solution λ̂ of the dual
problem. The last two iterations stabilizing imply that γi (x(t)) converges to 0 for
i ∈ E, and also for i ∈ I such that λ̂i > 0, and that lim sup γi (x(t)) = 0 otherwise. This
shows that, if x(t) converges to a limit x̃, then G(x̃) = F(x̃). However, for any x ∈ Ω,
we have
G(x(t)) ≤ G(x) ≤ F(x)
Note that the augmented Lagrangian method can also be used in non-convex
optimization problems [147], requiring in that case that α is small enough.
1
xt , zt = argmin
{G(x) + H(z) + |λt + αt (Ax + Bz − c)|2 }
2α
n
x∈R ,z∈R m t
λ = λ + α (Ax + Bz − c)
t+1 t t t t
with λt ∈ Rd .
98 CHAPTER 3. INTRODUCTION TO OPTIMIZATION
One can now consider splitting the first step in two and iterate:
1
xt = argmin{G(x) + H(zt−1 ) + |λt + αt (Ax + Bzt−1 − c)|2 }
2α
x∈Rn t
1
(3.59)
zt = argmin{G(xt ) + H(z) + |λt + αt (Axt + Bz − c)|2 }
2αt
z∈Rm
λ = λ + α (Ax + Bz − c)
t+1 t t t t
(Obviously, H(zt−1 ) and G(xt ) are constant in the first and second minimization prob-
lems and can be removed from the formulation.) These iterations constitute the
“alternative direction method of multipliers,” or ADMM (the method is also some-
times called Douglas-Rachford splitting). It is not equivalent to the augmented La-
grangian algorithm (one would need to iterate a large number of times over the first
two steps before applying the third one for this), but still satisfies good convergence
properties. The reader can refer to Boyd et al. [39] for a relatively elementary proof
that shows that this algorithm converges, with constant α, as soon as, in addition to
the hypotheses that were already made, the Lagrangian
L(x, z, λ) = G(x) + H(z) + λT (Ax + Bz − c)
has a saddle point: there exists x∗ , z∗ , y ∗ such that
max L(x∗ , z∗ , λ) = L(x∗ , z∗ , λ∗ ) = min L(x, z, λ∗ ).
y x,z
Remark 3.63 If αt = α does not depend on time, (3.59) can be slightly simplified by
letting ut = λt /α, with the iterations
α
xt = argmin{G(x) + |ut + Ax + Bzt−1 − c|2 }
2
x∈Rn
α (3.60)
zt = argmin{H(z) + |ut + Axt + Bz − c|2 }
2
z∈R m
ut+1 = ut + Axt + Bzt − c,
We conclude this chapter by completing some of the proofs left aside when discus-
sion convex functions. These proofs use convex separation theorems, stated below
(without proof).
Theorem 3.64 (c.f., Rockafellar [168]) Let Ω1 and Ω2 be two nonempty convex sets
with relint(Ω1 ) ∩ relint(Ω2 ) = ∅. Then there exists b ∈ Rd and β ∈ R such that b , 0,
bT x ≤ β for all x ∈ Ω1 and bT x ≥ β for all x ∈ Ω2 , with a strict inequality for at least one
x ∈ Ω1 ∪ Ω2 .
Theorem 3.65 Let Ω1 and Ω2 be two nonempty convex sets with Ω1 ∩ Ω2 = ∅ and Ω1
compact. Then there exists b ∈ Rn , β ∈ R and < 0 such that bT x ≤ β − for all x ∈ Ω1
and bT x ≥ β + for all x ∈ Ω2 .
3.7. CONVEX SEPARATION THEOREMS AND ADDITIONAL PROOFS 99
We start with a few general remarks. If x ∈ Rd , the set {x} is convex and relint({x}) =
{x}. If Ω is any convex set such that x < relint(Ω), then theorem 3.64 implies that
there exist b ∈ Rd and β ∈ R such that bT y ≥ β ≥ bT x for all y ∈ Ω (with bT y > bT x for
at least one y). If x is in Ω \ (relint(Ω)) (so that x is a point on the relative boundary
of Ω), then, necessarily bT x = β and we can write
bT y ≥ bT x
for all y ∈ Ω with a strict inequality for some y ∈ Ω. One says that b and β provide a
supporting hyperplane for Ω at x.
then
relint(epi(F)) = {(y, a) ∈ ridom(F) × R : F(y) < a}
(this simple fact is proved in lemma 3.66 below). In particular, if x ∈ dom(F), then
(x, F(x)) must be in the relative boundary of epi(F). This implies that there exists
(b, b0 ) , (0, 0) ∈ Rd × R such that, for all (y, a) ∈ epi(F):
bT y + b0 a ≥ bT x + b0 F(x) .
We now state and prove the result announced above on the relative interior of
the epigraph of a convex function.
Lemma 3.66 Let F be a convex function with epigraph
Then
relint(epi(G)) = {(y, a) : y ∈ ridom(F), F(y) < a}.
100 CHAPTER 3. INTRODUCTION TO OPTIMIZATION
Proof Let Γ = {(y, a) : y ∈ ridom(F), F(y) < a}. Assume that (y, a) ∈ relint(epi(F)).
Then (y, b) ∈ epi(F) for all b > a and there exists > 0 such that (y, a) − ((y, b) −
(y, a)) ∈ epi(F) which requires that F(y) ≤ a − (b − 1) < a. Now, take x ∈ dom(F).
Then, (x, F(x)) ∈ epi(dom(F)) and (y, a) − ((x, F(x)) − (y, a)) ∈ epi(F) for small enough
, showing that F(y − (x − y)) ≤ (1 + )a − F(x) and y − (x − y) ∈ dom(F). This proves
that y ∈ ridom(F) and the fact that relint(epi(F)) ⊂ Γ .
Take (y, a) ∈ Γ , and (x, b) ∈ epi(F). We need to show that (y − (x − y), a − (b − a)) ∈
epi(F) for small enough , i.e., that
for small enough . But this is an immediate consequence of the facts that F is
continuous at y ∈ ridom(G) and F(y) < a.
Assume that there exists x̄ ∈ ridom(F1 ) ∩ ridom(F2 ). Take x ∈ dom(F1 ) ∩ dom(F2 ) and
g ∈ ∂(F1 + F2 )(x). We want to show that g = g1 + g2 with g1 ∈ ∂F1 (x) and g2 ∈ ∂F2 (x).
By definition, we have
for all y. We want to decompose g as g = g1 + g2 with g1 ∈ ∂F1 (x) and g2 ∈ ∂F2 (x).
Equivalently, we want to find g2 ∈ Rd such that, for all y ∈ Rd ,
F1 (y) ≥ F1 (x) + (g − g2 )T (y − x)
F2 (y) ≥ F2 (x) + g2T (y − x)
F1 (y) ≥ −g2T (y − x)
F2 (y) ≥ g2T (y − x)
for all y ∈ Rd and some g2 ∈ Rd , under the assumption that F1 (y) + F2 (y) ≥ 0 for all
y. Introduce the two convex sets in Rd × R
The set Ω2 is the image of epi(F2 ) by the transformation (y, a) 7→ (y, −a). We have
F1 (y) ≤ a ⇒ bT y + b0 a − β ≤ 0
F2 (y) ≤ −a ⇒ bT y + b0 a − β ≥ 0.
bT ỹ − β = −(bT y − β) < 0,
F1 (y) ≤ a ⇒ bT y − β ≤ a
F2 (y) ≤ −a ⇒ bT y − β ≥ a,
which is equivalent to
−F2 (y) ≤ bT y − β ≤ F1 (y)
Taking y = x gives β = bT x and we get the desired inequality with g2 = −b.
Let x̄ ∈ Rm such that Ax̄ ∈ ridom(F). We need to prove that ∂G(x) ⊂ AT ∂F(Ax + b)
when G(x) = F(Ax + b). We assume in the following that b = 0, since the theorem
with G(x) = F(x + b) is obvious. If g ∈ ∂G(x), we have
F(Ay) ≥ F(Ax) + g T (y − x)
102 CHAPTER 3. INTRODUCTION TO OPTIMIZATION
for all y ∈ Rm . We want to show that there exists h ∈ Rd such that g = AT h and, for
all z ∈ Rd ,
F(z) ≥ F(Ax) + hT (z − Ax) = F(Ax) + hT z − g T x.
Ω2 = {(Ay, a) : y ∈ Rm , a = g T (y − x) + G(x)} ⊂ Rd × R.
F(z) ≤ a ⇒ bT z + b0 a ≤ β
z = Ay, a = g T (y − x) + G(x) ⇒ bT z + b0 a ≥ β
Assume, to get a contradiction, that b0 = 0 (so that b , 0). Then bT Ay ≥ β for all y,
which is only possible if b is perpendicular to the range of A and β ≤ 0. On the other
hand, F(Ax̄) < ∞ implies that 0 = bT Ax̄ + b0 F(Ax̄) ≤ β, so that β = 0. Furthermore,
we know that one of the inequalities above has to be strict for at least one element
of Ω1 ∪ Ω2 , but this cannot be true on Ω2 , so there exists z ∈ dom(F) such that
bT z < 0. Since bT Ax̄ = 0 and Ax̄ ∈ ridom(F), we have Ax̄ − (z − Ax̄) ∈ dom(F), so that
bT (−z) ≤ 0, yielding a contradiction.
So, we need b0 , 0, and the first pair of inequalities clearly requires b0 < 0, so that
we can take b0 = −1. This shows that
bT z − β ≤ F(z)
F(z) − F(Ax) ≥ bT (z − x)
for all z and bT A(y − x) ≥ g T (y − x) for all y. This last inequality implies that g = AT b
and the first one that b ∈ ∂F(Ax), therefore concluding the proof.
Chapter 4
In this chapter, we illustrate the bias variance dilemma in the context of density es-
timation, in which problems are similar to those encountered in classical parametric
or non-parametric statistics [160, 60, 155].
For density estimation, one assumes that a random variable X is given with un-
known p.d.f. f and we want to build an estimator, i.e., a mapping (x, T ) 7→ fˆ(x; T )
that provides an estimation of f (x) based on a training set T = (x1 , . . . , xN ) containing
N i.i.d. realizations of X (i.e., T is a realization of T = (X1 , . . . , XN ), N independent
copies of X). Alternatively, we will say that the mapping T 7→ fˆ( · ; T ) is an estimator
of the full density f . Note that, to further illustrate our notation, fˆ(x; T ) is a number
while fˆ(x; T ) is a random variable.
Parameter estimation is the most common density estimation method, in which one
restrict fˆ to belong to a finite-dimensional parametric class, denoted (fθ , θ ∈ Θ), with
Θ ⊂ Rp . For example, fθ can be a family of Gaussian distributions on Rd . With our
notation, a parametric model provides estimators taking the form
There are several, well-known methods for parameter estimation, and, since this
is not the focus of the book, we only consider the most common one, maximum
103
104 CHAPTER 4. INTRODUCTION: BIAS AND VARIANCE
If the true f belongs to the parametric class, so that f = fθ∗ for some θ∗ ∈ Θ, stan-
dard results in mathematical statistics [29, 119] provide sufficient conditions for θ̂
to converge to θ∗ when N tends to infinity. However, the fact that the true p.d.f. be-
longs to the finite dimensional class (fθ ) is an optimistic assumption that is generally
false. In this regard, the standard theorems in parametric statistics may be regarded
as analyzing a “best case scenario,” or as performing a “sanity check,” in which one
asks whether, in the ideal situation in which f actually belongs to the parametric
class, the designed estimator has a proper behavior. In non-parametric statistics, a
parametric model can still be a plausible approach in order to approximate the true
f , but the relevant question should then be whether fˆ provides (asymptotically), the
best approximation to f among all fθ , θ ∈ Θ. The maximum likelihood estimator can
be analyzed from this viewpoint, if one measures the difference between two density
functions by the Kullback-Leibler divergence (also called differential entropy):
Z
f (x)
KL(f kfθ ) = log f (x)dx (4.2)
Rd fθ (x)
which is positive unless f = fθ (and may be equal to +∞).
X µ(x)
KL(µkν) = log µ(x),
ν(x)
x∈Ω
e
that we will use later in these notes (if there exists x such that µ(x) > 0 and ν(x) =
0, then KL(µkν) = ∞). The most important property for us is that the Kullback-
Leibler divergence can be used as a measure of discrepancy between two probability
distribution, based on the following proposition.
We have t log t + 1 − t ≥ 0 with equality if and only t = 1 (the proof being left to the
reader) so that KL(µkν) = 0 if and only if g = 1 with ν-probability one, i.e., if and
only if µ = ν.
the function θ 7→ log fθ (x) is continuous1 in θ for almost all x and that, for all θ ∈ Θ,
there exists a small enough δ > 0 such that
Z
sup log fθ 0 (x) f (x) dx < ∞
Rd |θ 0 −θ|<δ
then, letting Θ∗ denote the set of maximizers of Ef (log fθ ), and assuming that it is
not empty, the maximum likelihood estimator θ̂N is such that, for all > 0 and all
compact subsets K ⊂ Θ,
lim P d(θ̂N , Θ∗ ) > and θ̂N ∈ K → 0
N →∞
where d(θ̂N , Θ∗ ) is the Euclidean distance between θ̂N and the set Θ∗ . The interested
reader can refer to Van der Vaart [196], Theorem 5.14, for a proof of this statement.
Note that this assertion does not exclude the situation in which θ̂N goes to infinity
(i.e., steps out of ever compact subset K in Θ), and the boundedness of the m.l.e. is
either asserted from additional properties of the likelihood, or by simply restricting
Θ to be a compact set.
If Θ∗ = {θ∗ } and the m.l.e. almost surely converges to θ∗ , the speed of conver-
gence can also be quantified by a central limit √theorem (see Van der Vaart [196],
Theorem 5.23) ensuring that, in standard cases N (θ̂N − θ∗ ) converges to a normal
distribution.
Even though these results relate our present subject to classical parametric statis-
tics, they are not sufficient for our purpose, because, when f , fθ∗ , the convergence
of the m.l.e. to the best approximator in Θ still leaves a gap in the estimation of f .
This gap is often called the bias of the class (fθ , θ ∈ Θ). One can reduce it by con-
sidering larger classes (e.g., with more dimensions), but the larger the class, the less
accurate the estimation of the best approximator becomes for a fixed sample size
(the estimator has a larger variance). This issue is known as the “bias vs. variance
dilemma,” and to address it, it is necessary to adjust the class Θ to the sample size
in order to optimally balance the two types of error (and all non-parametric estima-
tion methods have at least one mechanism that allows for this). When the “tuning
parameter” is the dimension of Θ, the overall approach is often referred to as the
method of sieves [83, 80], in which the dimension of Θ is increased as a function of N
in a suitable way.
Gaussian mixture models provide one of the most popular choices with the me-
thod of sieves. Modeling in this setting typically follows some variation of the fol-
mN 2 2
e−|x−µj | /2σ
X
ΘN = f : f (x) = αj 2 )d/2
,
j=1
(2πσ
µ1 , . . . , µmN ∈ Rd , α1 + · · · + αmN = 1, α1 , . . . , αmN ∈ [0, +∞), σ > 0 . (4.4)
There are therefore (d + 1)mN free parameters in ΘN . The integer mN allows one
to adjust the dimension of ΘN and therefore controls the bias-variance trade-off. If
mN tends to infinity “slowly enough,” the m.l.e. will converges (almost surely) to
the true p.d.f. f [80]. However, determining optimal sequences N → mN remains a
challenging and largely unsolved problem.
In practice the computation of the m.l.e. in this context uses an algorithm called
EM, for expectation-maximization. This algorithm will be described later in chap-
ter 17.
Kernel density estimators [151, 178, 179] provide alternatives to the method of
sieves. They also lend themselves to some analytical developments that provide
elementary illustrations of the bias-variance dilemma.
Note that the third equation is satisfied, in particular, when K is an even function,
i.e., K(−x) = K(x).
1 x
Kσ (x) = d K .
σ σ
Using the change of variable y = x/σ (so that dy = dx/σ d ) one sees that Kσ satisfies
(4.5) as soon as K does.
Based on a training set T = (x1 , . . . , xN ), the kernel density estimator defines the
family of densities
N
ˆ 1X
fσ (x; T ) = Kσ (x − xk )
N
k=1
108 CHAPTER 4. INTRODUCTION: BIAS AND VARIANCE
One has Z
Kσ (x − xk ) dx = 1
Rd
so that it is clear that fˆσ is a p.d.f. In addition,
Z Z
xKσ (x − xk ) dx = (y + xk )Kσ (y) dy = xk
Rd Rd
so that Z
xfˆσ (x; T ) dx = x̄
Rd
where x̄ = (x1 + · · · + xN )/N .
2
A typical choice for K is a Gaussian kernel, K(y) = e−|y| /2 /(2π)d/2 . In this case, the
estimated density is a sum of bumps centered at the data points x1 , . . . , xN . The width
of the bumps is controlled by the parameter σ . A small σ implies less rigidity in the
model, which will therefore be more affected by changes in the data: the estimated
density will have a larger variance. The converse is true for large σ , at the cost of
being less able to adapt to variations in the true density: the model has a larger bias
(see fig. 4.1 and fig. 4.2).
The bias of the estimator, i.e., the average difference between fˆσ (x; T ) and f (x) is
therefore given by
Z
ˆ
E(fσ (x; T )) − f (x) = K(z)(f (x − σ z) − f (x))dz.
Rd
Interestingly, this bias does not depend on N , but only on σ , and it is clear that,
under mild continuity assumptions on f , it will go to zero with σ .
σ = 0.1 σ = 0.25
σ = 0.5 σ = 1.0
Figure 4.1: Kernel density estimators using a Gaussian kernel and various values of σ when
the true distribution of the data is a standard Gaussian (Orange: true density; Blue: esti-
mated density, Red dots: training data).
110 CHAPTER 4. INTRODUCTION: BIAS AND VARIANCE
σ = 0.1 σ = 0.25
σ = 0.5 σ = 1.0
Figure 4.2: Kernel density estimators using a Gaussian kernel and various values of σ when
the true distribution of the data is a Gamma distribution with parameter 2 (Orange: true
density; Blue: estimated density, Red dots: training data).
4.2. KERNEL DENSITY ESTIMATION 111
with
Z d
1 1
var(K((x − X)/σ )) = K((x − y)/σ )2 f (y)dy
N σ 2d N σ 2d R
Z !2
1
− K((x − y)/σ )f (y)dy
N σ 2d Rd
Z Z !2
1 2 1
= K(z) f (x − σ z)dz − K(z)f (x − σ z)dz
N σ d Rd N Rd
The total mean-square error of the estimator is
E((fˆσ (x) − f (x))2 ) = var(fˆσ (x)) + (E(fˆσ (x)) − f (x))2 .
Clearly, this error cannot go to zero unless we allow σ = σN to depend on N . For the
bias term to go to zero, we know that we need σN → 0, in which case we can expect
the second term in the variance to decrease like 1/N , while, for the first term to go to
zero, we need N σNd to go to infinity. This illustrates the bias-variance dilemma: σN
must go to zero in order to cancel the bias, but not too fast in order to also cancel the
variance. There is, for each N , an optimal value of σ that minimizes the error, and
we now proceed to a more detailed analysis and make this statement a little more
precise.
σ2 T 2
f (x − σ z) = f (x) − σ zT ∇f (x) + z ∇ f (x)z + O(σ 3 |z|3 ),
2
R
where ∇2 f (x) denotes the matrix of second derivatives of f at x. Since zK(z)dz = 0,
this gives
σ2
E(fˆσ (x; T )) − f (x) = Mf (x) + o(σ 2 )
2
R R
with Mf = K(z) z ∇ f (x)z dz. Similarly, letting S = K 2 (z) dz,
T 2
1
var(fˆσ (x)) = Sf (x) + o(σ d
+ σ 2
) .
Nσd
Assuming that f (x) > 0, we can obtain an asymptotically optimal value for σ by
minimizing the leading terms of the mean square error, namely
σ4 2 S
Mf + f (x)
4 Nσd
which yields σN = O(N −1/(d+4) ) and
This result says that, in order to obtain a given accuracy in the worst case sce-
nario, N should be chosen of order (1/)1+(d/2r) which grows exponentially fast with
the dimension. This is the curse of dimensionality which essentially states that the
issue of density estimation may be intractable in large dimensions. The same state-
ment is true also for most other types of machine learning problems. Since machine
learning essentially deals with high-dimensional data, this issue can be problematic.
Obviously, because the min-max theory is a worst-case analysis, not all situations
will be intractable for a given estimator, and some cases that are challenging for one
of them may be quite simple for others: even though all estimators are “cursed,” the
way each of them is cursed differs. Moreover, while many estimators are optimal
in the min-max sense, this theory does not give any information on “how often” an
estimator performs better than its worst case, or how it will perform on a given class
of problems. (For kernel density estimation, however, what we found was almost
universal with respect to the unknown density f , which indicates that this estimator
is not a good choice in large dimensions.)
Another important point with this curse of dimensionality is that data may very
often appear to be high dimensional while it has a simple, low-dimensional struc-
ture, maybe because many dimensions are irrelevant to the problem (they contain,
for example, just random noise), or because the data is supported by a non-linear
low-dimensional space, such as a curve or a surface. This information is, of course,
not available to the analysis, but can sometimes be inferred using some of the dimen-
sion reduction methods that will be discussed later in chapter 21. Sometimes, and
this is also important, information on the data structure can be provided by domain
knowledge, that is, by elements, provided by experts, that specify how the data has
been generated (such as underlying equations) and reasonable hypotheses that are
made in the field. This source of information should never be ignored in practice.
Chapter 5
In most cases, the input space is Euclidean, i.e., RX = Rd . Note also that, in clas-
sification, instead of a function f : R → RY , one sometimes estimates a function
f : RX → Π(RY ), where Π(RY ) is the space of probability distributions on RY . We
will return to this in remark 5.3.
113
114 CHAPTER 5. PREDICTION: BASIC CONCEPTS
The goal in prediction is to minimize the expected risk, also called the generaliza-
tion error:
R(f ) = E(r(Y , f (X))).
We will prove that an optimal f can be easily described based on the joint dis-
tribution of X and Y (which is, unfortunately, never available). We will need for
this to use conditional expectations and conditional probabilities, as defined in sec-
tions 1.4.2 and 1.4.8.
There can be multiple Bayes predictors if the minimum in the proposition is not
uniquely attained. Note that, if f ∗ is a Bayes predictor and fˆ any other predictor, we
have, by definition
E r(Y , f ∗ (X)) | X ≤ E r(Y , fˆ(X)) | X .
Passing to expectations, this implies R(f ∗ ) ≤ R(fˆ). We therefore have the following
result:
Theorem 5.2 Any Bayes predictor f ∗ is optimal, in the sense that it minimizes the gen-
eralization error R.
In such a case, the loss function, r, is defined on RY × Π(RY ), and the expected
risk is still defined by E(r(Y , f (X))).
Bayes predictors are never available in practice, because the true distribution of
(X, Y ), or that of Y given X, are unknown. These distributions can only be inferred
from observations, i.e., from a training set: T = (x1 , y1 , . . . , xN , yN ).
This predictor is often called the naive Bayes predictor for regression.
The kernel estimator of the joint p.d.f., ϕ, of (X, Y ) at scale σ is, in this case:
N
1X 1 x − xk y − yk
ϕ̂(x, y) = K1 K2 .
N σ d+1 σ σ
k=1
Based on ϕ̂, the conditional expectation of Y given X is
x−x y−y
1 PN
R
1
N k=1 σ d+1 R
yK1 σ
k
K2 σ k dy
ˆ
f (x) = x−x y−y .
1 PN
R
1 k k
N k=1 σ d+1 R 1K σ K2 σ dy
R y−y
Using the fact that σ −1 R
yK2 σ
k
dy = yk , we can simplify this expression to
obtain PN x−x
k
k=1 y k K1 σ
fˆ(x) = P x−x .
N k
k=1 K1 σ
This the kernel-density regression estimator [140, 205].
118 CHAPTER 5. PREDICTION: BASIC CONCEPTS
Let RY = {0, 1} and assume RX = N = {0, 1, 2, . . .}. Let p = P(Y = 1) and assume that
conditionally to Y = g, X follows a Poisson distribution with mean µg . Assume that
µ0 < µ1 .
(1 − p)µx0 e−µ0 if g = 0
(
P(Y = g | X = x) ∝
pµx1 e−µ1 if g = 1
that is:
µ1 1−p
x log ≥ log + µ1 − µ0 .
µ0 p
Since we are assuming that µ1 > µ0 , we find that f (x) = 1 if 2
& '
log((1 − p)/p) + µ1 − µ0
x≥
log(µ1 /µ0 )
and 0 otherwise.
Model-based approaches for prediction are based on the estimation of the joint dis-
tribution of the input and output variables, which is arguably a harder problem
than prediction [198]. Since the goal is to find f minimizing the expected risk
R(f ) = E(r(Y , f (X)), one may prefer a direct approach and consider the minimization
of an empirical estimate of this risk, based on training data T = (x1 , y1 , . . . , xN , yN ),
namely
N
1X
R̂(f ) = r(yk , f (xk )).
N
k=1
As a last example for now (we will see many others in the rest of this book),
taking d = 1, the set ( Z )
00 2
F = f : f (x) dx < µ
R
(with µ > 0) provides an infinite dimensional space of predictors, which leads to
spline regression.
Fix a function space F , and let fˆ∗ be the optimal predictor in F , in the sense that
it minimizes E(|Y − f (X)|2 ) over f ∈ F . Then, letting fˆN ∈ F denote an estimated
predictor,
Let us make the assumption that there exists > 0 such that fλ = fˆ∗ + λ(fˆN − fˆ∗ )
belongs to F for λ ∈ [−, ]. This happens when F is a linear space, or more generally
120 CHAPTER 5. PREDICTION: BASIC CONCEPTS
when F is convex and fˆ∗ is in its relative interior (see chapter 3). Let ψ : λ 7→
E(|Y − fλ (X)|2 ), which is minimal at λ = 0. We have
and
0 = ψ 0 (0) = 2E((Y − fˆ∗ (X))(fˆ∗ (X) − fˆN (X))).
We therefore get the identity
R(fˆN ) = E(|Y − fˆ∗ (X)|2 ) + E(|fˆN (X) − fˆ∗ (X)|2 ) = “Bias” + “Variance”.
R(fˆN ) ≤ E(|Y − f ∗ (X)|2 ) + E(|f ∗ − fˆ∗ (X)|2 ) + E(|fˆN (X) − fˆ∗ (X)|2 ).
The first term is the Bayes error. It is fixed by the joint distribution of X and Y
and measures how well Y can be approximated by a function of X. The second term
compares f ∗ to its best approximation in F , and is therefore reduced by taking larger
model spaces. The last term is the error caused by using the data to estimate fˆ∗ . It
increases with the size of F . This is illustrated in Figure 5.1.
Remark 5.4 If the assumption made on fˆ∗ is not valid, one can write
R(fˆN ) = E(|Y − fˆN (X)|2 ) ≤ 2 E(|Y − fˆ∗ (X)|2 ) + E(|fˆN (X) − fˆ∗ (X)|2 )
and still obtain a control (as an inequality) of the generalization error by a bias-plus-
variance sum.
fˆ
Pˆ F
fˆ∗
f∗
P∗
If we also take the expectation with respect to T (for fixed N ), we obtain the
averaged generalization risk as
which provides an evaluation of the average quality of the algorithm when evaluated
on random training sets of size N . If A : T 7→ fˆT denotes the learning algorithm, we
will denote RN (A) = E(R(fˆT )).
Since their computation requires the knowledge of the joint distribution of X and
Y , these errors are not available in practice. Given a training set T and a predictor
f , one can compute the empirical error
N
1X
R̂T (f ) = r(yk , f (xk )) .
N
k=1
Under the usual moment conditions, the law of large numbers implies that R̂ T (f ) →
R(f ) with probability one for any given predictor f . However, the law of large num-
bers cannot be applied to assess whether the in-sample error,
N
ˆ 1X
r(yk , fˆT (xk )),
∆
ET = R̂T (fT ) =
N
k=1
is a good approximation of the generalization error R(fˆT ). This is because each term
in the sum depends on the full data set, so that ET is not a sum of independent terms.
The in-sample error typically under-estimates the generalization error, sometimes
with a large discrepancy.
When one has enough data, however, it is possible to set some of it aside to form
a test set. Formally, a test set is a collection T 0 = (x10 , y10 , . . . , xN 0 0
0 , yN 0 ) considered as a
realization of an i.i.d. sample of (X, Y ), T 0 = (X10 , Y10 , . . . , XN
0 0
0 , YN 0 ), independent of T .
5.5. EVALUATING THE ERROR 123
Cross-validation error
The n-fold cross-validation method (see, e.g., Stone [185]) separates the training set
into n non-overlapping sets of equal sizes, and estimates n predictors by leaving out
one of these subsets as a temporary test set. A generalization error is estimated from
each test set and averaged over the n results.
N0
1 X
R̂T 0 (f ) = 0 `(f , zk0 )
N
k=1
To define an n-fold cross-validation estimator of the error, one assumes that the
training set T is partitioned into n subsets of equal sizes (up to one element if N is
124 CHAPTER 5. PREDICTION: BASIC CONCEPTS
The limit case when n = N is called leave-one-out (LOO) cross validation. In this
case ECV is an almost unbiased estimator of E(R(fˆT )), but, because it is an average of
functions of the training set that are quite similar (and that will therefore be posi-
tively correlated), its variance (as a function of T ) may be quite large. Conversely,
smaller values of n will have smaller variances, but larger biases. In practice, it is dif-
ficult to assess which choice of n is optimal, although 5- or 10-fold cross-validation
is quite popular. LOO cross-validation is also often used, especially when N is small.
Fixing a training set T , one can compute, for each λ, the cross-validation error
eT (λ) = R̄CV,T (Aλ ). Model selection is then performed by finding
λ∗ (T ) = argmin eT (λ).
λ
Once this λ∗ is obtained, the final estimator is fˆT ,λ∗ (T ) , obtained by rerunning the
algorithm one more time on the full training set.
by e(λ∗ (T )). This is false, because the computation of λ∗ uses the full training set.
To compute the cross-validation error of A∗ , one needs to encapsulate this model
selection procedure in an other cross-validation loop. So, one needs to compute,
using the previous notation,
n
1X
∗
ECV (T ) = R̂Ti (fˆT (i) ,λ∗ (T (i) ) )
n
i=1
where each fˆT (i) ,λ∗ (T (i) ) is computed by running a cross-validated model selection pro-
cedure restricted to T (i) . This is often called a double-loop cross-validation proce-
dure (the number of folds in the inner and outer loops do not have to coincide). Note
that each λ∗ (T (i) ) that does not necessarily coincide with the optimal λ∗ (T ) obtained
with the full training set.
126 CHAPTER 5. PREDICTION: BASIC CONCEPTS
Chapter 6
6.1 Introduction
We will discuss later in this book various methods that specify the prediction is as a
linear function of the input. These methods are often applied after taking transfor-
mations of the original variables, in the form x 7→ h(x) (i.e., the prediction algorithm
is applied to h(x) instead of x). We will refer to h as a “feature function,” which typi-
cally maps the initial data x ∈ R to a vector space, sometimes of infinite dimensions,
that we will denote H (the “feature space”).
We recall that a real vector space 1 is a set, H, on which an addition and a scalar
product are defined, namely (h, h0 ) ∈ H × H 7→ h + h0 ∈ H and (λ, h) ∈ R × H 7→ λh ∈
H, and we assume that the reader is familiar with the theory of finite-dimensional
1 All vector spaces in these notes will be real, and will therefore only be referred as vector spaces.
127
128 CHAPTER 6. INNER PRODUCTS AND REPRODUCING KERNELS
spaces.
When a normed space is complete with respect to the topology induced by its
norm, it is called a Banach space, or a Hilbert space when the norm is associated with
an inner product. Completeness means that Cauchy sequences in this space always
have a limit, i.e., if the sequence (ξn ) is such that, for any > 0, there exists n0 > 0
such that kξn − ξm kH < for all n, m ≥ n0 , then there exists ξ such that kξn − ξkH → 0.
Completeness is aR very natural property. It allows, for example, for the definition of
integrals such as h(t)dt as limits of Riemann sums for suitable functions h : R → H,
leading (with more general notions of integrals) to proper definitions of expectations
of H-valued random variables. Using a standard (abstract) construction, one can
prove that any normed space (resp. inner-product) can be extended to a Banach
(resp. Hilbert) space within which it is dense.
Now, consider an input set, say R, and a mapping h from R to H, where H is an inner
product space. For us, R is the set over which the original input data is observed,
typically Rd , and H is the feature space. One can define the function Kh : R × R → R
by
Kh (x, y) = hh(x) , h(y)iH .
The function Kh satisfies the following two properties.
The first property is obvious, and the second one results from the fact that one can
write
n n n
X X X 2
λi λj Kh (xi , xj ) = λi λj hh(xi ) , h(xj )iH = λi h(xi ) ≥ 0. (6.2)
H
i,j=1 i,j=1 i=1
One says that the kernel is positive definite if the sum in (6.1) cannot vanish unless (i)
λ1 = · · · = λn = 0 or (ii) xi = xj for some i , j.
Given this notation, it is clear that K is a positive kernel if and only if for all x1 , . . . , xn ∈
R, the matrix KK (x1 , . . . , xn ) is symmetric, positive semidefinite. It is a positive def-
inite kernel if KK (x1 , . . . , xn ) is positive definite as soon as all xj ’s are distinct. This
latter condition is obviously needed since, if xi = xj , the ith and jth columns of K
coincide and this matrix cannot be full-rank.
Remark 6.3 It is important to point out that K being a positive kernel does not require
that K(x, y) ≥ 0 for all x, y ∈ R (see examples in the next section). However, it does
imply that K(x, x) ≥ 0 for all x ∈ R, since diagonal elements of positive semi-definite
matrices are non-negative.
The function Kh defined above is therefore always a positive kernel, but not al-
ways positive definite, as seen below. We will also see later that the converse state-
ment is true: any positive kernel K : R × R 7→ R can be expressed as Kh for some
feature function h between R and some feature space H.
n n 2
X X
λi λj Kh (xi , xj ) = λi h(xi ) .
i,j=1 i=1 H
This implies in particular that positive-definite kernels over infinite input spaces R
can only be associated to infinite-dimensional spaces H, since Vh ⊂ H.
Notice that this kernel is not positive definite, because the rank of K(x1 , . . . , xn ) is
equal to the dimension of span(x1 , . . . , xn ), which can be less than n even when the
xi ’s are distinct.
which contains all products of degree k formed from variables x(1) , . . . , x(d) , i.e., all
monomials of degree k in x. This function takes its values in the space H = RNk ,
where Nk = d k . Using, in H, the inner product hξ , ηiH = ξ T η, we have
X
Kh (x, y) = (x(i1 ) y (i1 ) ) · · · (x(ik ) y (ik ) )
1≤i1 ,...,ik ≤d
= (xT y)k .
If one now takes all monomials of order less than or equal to k, i.e.,
One can make variations on this construction. For example, choosing any family
c0 , c1 , . . . , ck of positive numbers, one can take
yielding
Kh (x, y) = c02 + c12 (xT y) + · · · + ck2 (xT y)k .
!1/2
k
Taking cl = α l for some α > 0, we get another form of polynomial kernel,
l
namely,
Kh (x, y) = (1 + α 2 xT y)k .
Let us consider a few examples of kernels that can be obtained in this way.
(1) Take d = 1 and let s be the indicator function of the interval [− 12 , 12 ]. Then, one
finds
Γ (t) = max(1 − |t|, 0) .
In this case, the space Vh is the space of all functions expressed as finite sums
n
X
z 7→ λj 1[xj −ρ/2,xj +ρ/2] (z) ,
j=1
and assume without loss of generality that x1 < x2 < · · · < xn and let xn+1 = ∞. Let i0
be the smallest index j such that λj , 0, assuming that such an index exists. Then
f (z) = λi0 > 0 for all z ∈ [xi0 − ρ/2, xi0 +1 − ρ/2) which is a non-empty interval. So, if f
vanishes almost everywhere, we must have λj = 0 for all j = 1, . . . , n.
3 The convolution between two absolutely integrable functions f and g is defined by f ∗ g(u) =
R
Rd
f (z)g(u − z) dz
6.3. FIRST EXAMPLES 133
Translation invariance
Radial kernels
A radial kernel takes the form K(x, y) = γ(|x − y|2 ), for some continuous function
γ defined on [0, +∞). Shoenberg’s theorem [174] states that, if this function γ is
universally valid, i.e., K is a kernel for all dimensions d, then, it must take the form
Z∞
γ(t) = e−λt dµ(λ)
0
For example, when µ is a Dirac measure, i.e., µ = δ(2a)−1 for some a > 0, then
K(x, y) = exp(−|x − y|2 /2a), which is the Gaussian kernel. Taking dµ = e−aλ dλ yields
γ(t) = 1/(t + a), and dµ = λe−aλ dλ yields γ(t) = 1/(a + t)2 .
with Ωd (t) = Γ (d/2)(2/t)(d−2)/2 J(d−2)/2 (t) where J(d−2)/2 is Bessel’s function of the first
kind.
Proof Point (i) is obvious. Point (ii) is almost as simple, because, for any λ1 , . . . , λn ∈
R and x10 , . . . , xn0 ∈ R0 ,
n
X n
X
λi λj K10 (xi0 , xj0 ) = λi λj K1 (f (xi0 ), f (xj0 )) ≥ 0.
i,j=1 i,j=1
If K1 is positive definite, then the latter sum can only vanish if all λi are zero, or
some of the points in (f (x10 ), . . . , f (xn0 )) coincide. If, in addition, f is one-to-one, then
this is equivalent to all λi are zero, or some of the points in (x10 , . . . , xn0 ) coincide, so
that K10 is positive definite.
If B is positive definite, then the sum above can be zero only if, for each k, either
(k)
λk = 0 or α (i) ui = 0 for all i. If A is also positive definite, then the only possibility
(k)
is α (i) ui = 0 for all i and k, which implies α (i) = 0 for all i since ui , 0.
To prove point (iv) 4 , we first note that a translation invariant kernel K 0 (x, y) =
Γ 0 (x − y) is always bounded. Indeed, the matrix K0 (x, 0) is positive semi-definite,
with determinant Γ 0 (0)2 − Γ 0 (x)2 > 0, showing that |Γ 0 (x)| < Γ 0 (0). This shows that
the integral defining K(x, y) converges as soon as one of the two functions Γ1 or Γ2 is
integrable. Moreover, we have K(x, y) = Γ (x − y) with
Z Z
Γ (x) = Γ1 (x − z)Γ2 (z) dz = Γ1 (x − u)Γ2 (u − y) du
Rd Rd
Using the fact that both Γ1 and Γ2 are even, and making the change of variable z 7→ −z,
one easily shows that Γ (x) = Γ (−x), which implies that K is symmetric.
4 This part of the proof uses some measure theory.
136 CHAPTER 6. INNER PRODUCTS AND REPRODUCING KERNELS
We proceed with the assumption that Γ2 is integrable and use Bochner’s theorem
to write Z
T
Γ1 (y) = e−iξ y dµ1 (ξ)
Rd
for some positive finite measure µ1 . Then
Z Z !
−2iπξ T (x−z)
Γ (x) = e dµ1 (ξ) Γ2 (z) dz
Rd Rd
Z Z !
−2iπξ T x 2iπξ T z
= e e Γ2 (z) dz dµ1 (ξ)
Rd Rd
The shift in the order of the variables ξ and z uses Fubini’s theorem. The function
Z
T
ψ(ξ) = e2iπξ z Γ2 (z) dz
Rd
Point (iv) can be related to the following discrete statement on symmetric matri-
ces: assume that A and B are positive semi-definite and that they commute, so that
AB = BA: then AB is positive semi-definite.R In the case of kernels, one may consider
the symmetric linear operators Ki : f 7→ Rd Ki (·, y)f (y)dy which maps the space of
square integrable functions into itself. Then K1 and K2 commute and K = K1 K2 .
so that the inner product is uniquely specified on HK . To make sure that this inner-
product is well defined, we must check that there is no ambiguity, in the sense that,
P 0
if ξ has an alternative decomposition ξ = ni=1 λ0i K(·, xi0 ), then, the value of hξ , ηiHK
remains unchanged. But this is clear, because one can also write
m
X
hξ , ηiHK = µj ξ(yj ) ,
j=1
which only depends on ξ and not on its decomposition. The linearity of the product
with respect to ξ is also clear from this expression, and the bilinearity by symmetry.
From which we deduce that kξkHK = 0 implies that hξ , ηiHK = 0 for η ∈ HK . Since
hξ , K(·, y)iHK = ξ(y) for all y, this also implies that ξ = 0, completing the proof that
HK is an inner-product space.
Equation (6.3) is the “reproducing property” of the kernel for the inner-product
on HK . In functional analysis, the completion, ĤK , of HK for the topology associated
to its norm is then a Hilbert space, and is referred to as a “reproducing kernel Hilbert
space,” or RKHS.
Returning to the example of functional features in section 6.3.3, we have two dif-
ferent representations of the kernel in feature space, namely in H = L2 (Rd ), or in HK ,
with a different inner product. There is not a contradiction, and simply shows that
the representation of a positive kernel in terms of a feature function is not unique.
138 CHAPTER 6. INNER PRODUCTS AND REPRODUCING KERNELS
Remark 6.5 RKHS’s are defined as function spaces. While feature space representa-
tions, provided by functions h : R → H from R to a Hilbert space H are apparently
more general, a simple transformation allows for an identification of (a subspace of)
H with an RKHS. We will assume that the subset h(R) (containing all h(x), x ∈ R) is
a dense subset of H. If not, one can simply replace H by the closure of h(R) which is
a Hilbert subspace of H.
Because kξk2H = kπV (ξ)k2H + kξ − πV (ξ)k2H , one always has kπV (ξ)kH ≤ kξkH , with
inequality if and only if πV (ξ) = ξ, i.e., if and only if ξ ∈ V .
V = span {K(·, xk ), k = 1, . . . N } .
Then there exists an element h0 ∈ V that satisfies the constraints. Indeed, looking
for h0 in the form
N
X
h0 (x) = K(x, xl )λl
l=1
one has
N
X
h0 (xk ) = K(xk , xl )λl
l=1
so that
λ1 α1
.
.. = K(x1 , . . . , xN )−1 ...
λN αN
Any other function h satisfying the constraints satisfies h(xk ) − h0 (xk ) = 0, which,
using RKHS2, is equivalent to hh − h0 , K(·, xk )iH = 0, i.e., to h − h0 ∈ V ⊥ . This shows
that h0 = πV (h), so that khkH ≥ kh0 kH and h0 provides the optimal interpolation. We
summarize this in the proposition:
N
X
h(xk ) = K(xk , xl )λl (6.4a)
l=1
with
λ1 α1
.
.. = K(x1 , . . . , xN )−1 ... .
(6.4b)
λN αN
Letting h0 = πV (h), so that h0 (xk ) = h(xk ) for all k, this expression can be rewritten as
N
X
kh0 k2H + kh − h0 k2H +σ 2
|h0 (xk ) − αk |2 .
k=1
This shows that the optimal h must coincide with its projection on V , and therefore
belong to that subspace. Looking for h in the form
N
X
h(·) = K(·, xl )λl ,
l=1
N N
N X 2
X X
2
K(xk , xl )λk λl + σ K(xk , xl )λl − αk ,
k,l=1 k=1 l=1
λ1 α1
which, in vector notation gives, writing λ = .. and α = ... ,
.
λN αN
on H is given by
N
X
h(xk ) = K(xk , xl )λl (6.5a)
l=1
with
λ1 α1
.
.. = (K(x1 , . . . , xN ) + (1/σ )IdRN ) ... .
2 −1
(6.5b)
λN αN
142 CHAPTER 6. INNER PRODUCTS AND REPRODUCING KERNELS
Chapter 7
In regression, linear models refer to situations in which one tries to predict the de-
pendent variable Y ∈ RY = Rq by a function fˆ(X) of the dependent variable X ∈ RX ,
where fˆ is optimized over a linear space F . The most common situation is the “stan-
dard linear model,” for which RX = Rd and
Note that h can be nonlinear, and F can be infinite dimensional. Such sets corre-
sponds to linear models using feature functions, and will be addressed using kernel
methods in this chapter.
Note also that, even if the model is linear, the associated training algorithms
can be nonlinear, and we will review in fact several situations in which solving the
estimation problem requires nonlinear optimization methods.
143
144 CHAPTER 7. LINEAR REGRESSION
N
X N
X
∆ ∆ 2
RSS(β) = N R̂(f ) = |yk − f (xk )| = |yk − β T x̃k |2 .
k=1 k=1
In other terms, the model-based approach is identical, under these (standard) as-
sumptions, to empirical risk minimization (section 5.4), on which we now focus.
(Recall that, even when using a model-based approach, one does not make assump-
tions on the true distribution of X and Y ; one rather treats the model as an approxi-
mation of these distributions, estimated by maximum likelihood, and uses the Bayes
predictor for the estimated model.)
The computation of the optimal regression parameters is made easier by the in-
troduction of the following matrices. Introduce the N × (d + 1) matrix X with rows
x̃1T , . . . , x̃N
T
and the N × q matrix Y with rows y1T , . . . , yN
T
, that is:
Theorem 7.1 Assume that the matrix X has rank d + 1. Then the RSS is minimized for
β̂ = (X T X )−1 X T Y
Proof We provide two possible proofs of this elementary problem. The first one is
∆
an optimization argument noting that F(β) = RSS(β) is a convex function defined
on Md+1,q (R) and with values in R. Since F is quadratic, we have, for any matrix
h ∈ Md+1,q (R),
Remark 7.2 If X does not have rank d + 1, then optimal solutions exist, but they
are not unique. By convexity, the solutions are exactly the vectors β at which the
gradient vanishes, i.e., those that satisfy X T X β = X T Y . The set of solutions can be
obtained by introducing the SVD of X in the form X = U DV T and letting γ = V T β
and Z = U T Y . Then
X T X β = X T Y ⇔ D T Dγ = D T Z.
Letting d (1) , . . . , d (m) denote the nonzero diagonal entries of D (so that m ≤ d + 1), we
find γ (i) = z(i) /d (i) for i ≤ m (the other equalities being 0 = 0). So, the d + 1 − m last
entries of γ can be chosen arbitrarily (and β = V γ).
N N
1X 1X
ȳ = yk and x̄ = xk .
N N
k=1 k=1
The reader may want to double-check that this solution coincides with the one pro-
vided in theorem 7.1.
The matrix
N
1 T 1X
Σ̂XX = Xc Xc = (xk − x)(xk − x)T
N N
k=1
is a sample estimate of the covariance matrix of X, that we will denote ΣXX . Simi-
larly, Σ̂XY = XcT Yc /N is a sample estimate of ΣXY , the covariance between X and Y .
With this notation, we have
b̂ = Σ̂−1
XX Σ̂XY ,
Let a∗0 = mY −mTX b∗ . Then f ∗ (x) = a∗0 +(b∗ )T x is the least-square optimal approxima-
tion of Y by a linear function of X, and the linear predictor fˆ(x) = aˆ0 + b̂T x converges
a.s. to f ∗ (x). Of course, f ∗ generally differs from f : x 7→ E(Y | X = x), which is the
least-square optimal approximation of Y by any (square-integrable) function of X,
so that the linear estimator will have a residual bias.
7.1. LEAST-SQUARE REGRESSION 147
If one makes the (unlikely) assumption that the linear model is exact, i.e., f (x) =
f ∗ (x), one has:
and the estimator is “unbiased.” Under this parametric assumption, many other
properties of linear estimators can be proved, among which the well-known Gauss-
Markov theorem on the optimality of least-square estimation that we now state and
prove. For this theorem, for which we take (for simplicity) q = 1, we also assume
that var(Y | X = x), the variance of Y for its conditional distribution given X does
not depend on x, and denote it by σ 2 . This typically correspond to the standard
regression model in which one assumes that Y = f (X) + where is independent of
X with variance σ 2 .
To see this, fix u and consider the problem of minimizing Fu (A) = A 7→ u T AAT u
subject to the linear constraint AX = IdRd+1 . The Lagrange multipliers for this affine
constraint can be organized in a matrix C and the Lagrangian is
2u T AH T u + trace(C T HX ) = 0
for all H, which yields trace(H T (2uu T A + CX T )) = 0 for all H. This is only possible
when 2uu T A + CX T = 0, which in turn implies that 2uu T AX = −CX T X . Using the
constraint, we get
C = −2uu T (X T X )−1
so that uu T A = uu T (X T X )−1 X T . This implies that A = (X T X )−1 X T (the least-square
estimator) is a minimizer of Fu (A) for all u.
We now assume that X takes its values in an arbitrary set RX , with a representation
h : RX → H into an inner-product space. This representation does not need to be
explicit or computable, but the associated kernel K(x, y) = hh(x) , h(y)iH is assumed
to be known and easy to compute. (Recall that, from chapter 6, a positive kernel is
always associated with an inner-product space.) In particular, any algorithm in this
context should only rely on the kernel, and the function h only has a conceptual role.
Assume that q = 1 to lighten the notation, so that the dependent variable is scalar-
valued. We here let the space of predictors be
The following result (or results similar to it) is a key step in almost all kernel
methods in machine learning.
which only depends on the kernel. (This reduction is often referred to as the “kernel
trick.”)
However, the solution of the problem is, in this context, not very interesting.
Indeed, assume that K is positive definite and that all observations in the training set
are distinct. Then the matrix K(x1 , . . . , xN ) formed by the kernel evaluations K(xi , xj )
is invertible, and one can solve exactly the equations
N
X
yk = αj K(xk , xj ), k = 1, . . . , N
j=1
to get a zero RSS with a0 = 0. Unless there is no noise, such a solution will certainly
overfit the data. If K is not positive definite, and the dimension of V is less than
N (since this would place us in the previous situation otherwise), then it is more
efficient to work directly in a basis of V rather than using the over-parametrized ker-
nel representation. We will see however, starting with the next section, that kernel
methods become highly relevant as soon as the regression is estimated with some
control on the size of the regression coefficients, b.
Method. When the set F of possible predictors is too large, some additional com-
plexity control is needed to reduce the estimation variance. One simple approach
150 CHAPTER 7. LINEAR REGRESSION
In linear spaces, complexity measures are often associated with a norm, and ridge
regression uses the sum of squares of coefficients of the prediction matrix b, mini-
mizing
XN
|yk − a0 − bT xk |2 + λ trace(bT b) , (7.4)
k=1
β̂ λ = (X T X + λ∆)−1 X T Y ,
with a proof similar to that made for least-square regression. We obviously retrieve
the original formula for regression when λ = 0.
!
0 0
Alternatively, assuming that ∆ = , so that no penalty is imposed on the
0 ∆0
intercept, we have
b̂λ = (XcT Xc + λ∆0 )−1 XcT Yc (7.5)
and aˆ0 λ = ȳ − (b̂λ )T x̄. The proof of these statements is left to the reader.
7.2. RIDGE REGRESSION AND LASSO 151
Analysis in a special case To illustrate the impact of the penalty term on balancing
bias and variance, we now make a computation in the special case when Y = X̃β + ,
where var() = σ 2 and is independent of X. In the following computation, we as-
sume that the training set is fixed (or rather, compute probabilities and expectations
conditionally to it). Also, to simplify notation, we denote
N
X
T
Sλ = X X + λ∆ = x̃kT x̃k + λ∆
k=1
and Σ = E(X̃ T X̃) for a single realization of X. Finally, we assume that q = 1, also to
simplify the discussion.
Denote by k the (true) residual k = yk − x̃kT β on training data and by the vector
stacking these residuals. We have, writing S0 = Sλ − λ∆,
β̂λ = Sλ−1 X T Y
= Sλ−1 S0 β + Sλ−1 X T
= β − λSλ−1 ∆β + Sλ−1 X T
So we can rewrite
Let us analyze the quantities that depend on the training set in this expression. The
first one is Sλ = S0 + λ∆. From the law of large numbers, S0 /N → Σ when N tends
to infinity, so that, assuming in addition that λ = λN = O(N ), we have Sλ−1 = O(1/N ).
The second one is
XN
T
X= k x̃k
k=1
which, according to the central limit theorem, is such that
when N → ∞. So, we can expect the coefficient of λ2 in R(λ) to have order N −2 , the
coefficient of λ to have order −3/2 and the constant coefficient of have order N −1 .
√ N
This suggests taking λ = µ N so that all coefficients have roughly the same order
when expanding in powers of µ.
152 CHAPTER 7. LINEAR REGRESSION
√
This gives Sλ = N (S0 /N + µ∆/ N ) ' N Σ and we make the approximation, letting
ξ = N −1/2 σ −1/2 X T and γ = Σ−1/2 ∆β, that
ξT γ
µ= .
|γ|2
Of course, this µ cannot be computed from data, but we can see that, since ξ con-
verges to a centered Gaussian random variable, its value cannot be too large. It is
therefore natural to choose µ to be constant and use ridge regression in the form
N
X √
(yk − x̃kT β)2 + N µβ T ∆β.
k=1
In all cases, the mere fact that we find that the optimal µ is not 0 shows that, under
the simplifying (and optimistic) assumptions that we made for this computation,
allowing for a penalty term always reduces the prediction error. In other terms,
introducing some estimation bias in order to reduce the variance is beneficial.
Kernel Ridge Regression We now return to the feature-space situation and take h :
RX → H with associated kernel K. We still take q = 1 for simplicity. One formulates
the ridge regression problem in this context as the minimization of
N
X
(yl − a0 − hb , h(xl )iH )2 + λkbk2H
k=1
with respect to β = (a0 , b). Introducing the space V generated by the feature function
evaluated on the training set, we know from proposition 7.4 that replacing b by
πV (b) leaves the residual sum of squares invariant. Moreover, one has kπV (b)k2H ≤
kbk2H with equality if and only if b ∈ V . This shows that the solution b must belong
to V and therefore take the form (7.3).
Using this expression, one finds that the problem is reduced to finding the mini-
mum of 2
N
X N
X N
X
y − a −
k 0 K(xl , xk )αl + λ
αk αl K(xk , xl )
k=1 l=1 k,l=1
!
a0
Let α ∈ RN be the vector with coefficients α1 , . . . , αN and α̃ = . With this
α
notation, the function to minimize is
This takes the same form as standard ridge regression, replacing β by α̃, X by K̃ and
∆ by K0 . The solution therefore is
To write the equivalent of (7.5), we need to use the equivalent of the matrix Xc ,
that is, the matrix K with the average of the jth column subtracted to each (i, j) entry,
given by:
1
Kc = K − 1N 1TN K.
N
Introduce the matrix P = Id − 1N 1TN /N . It is easily checked that P 2 = P (P is a pro-
jection matrix). Since Kc = P K, we have KTc Kc = KP K. One deduces from this the
expression of the optimal vector α λ , namely,
where we have, in addition, used the fact that P Yc = Yc . Finally, the intercept is
given by
1
a0 = y − (α λ )T K 1N .
N
Case of ridge regression. Returning to the basic case (without feature space), we
now introduce an alternate formulation of ridge regression. Let ridge(λ) denote
the ridge regression problem that we have considered so far, for some parameter
!
1 Indeed, w0
let u = with w0 ∈ R and w ∈ RN be such that u T (K̃T K̃ + λK0 )u = 0. This requires
w
K̃u = 0 and u T K0 u = 0. The latter quantity is wT Kw, which shows that w = 0 since K has rank N .
Then K̃ = 1N w0 so that w0 = 0 also.
154 CHAPTER 7. LINEAR REGRESSION
PNConsiderT now
λ. the following problem, which will be called ridge0 (C): minimize
2 T
k=1 |yk − x̃k β| subject to the constraint β ∆β ≤ C. We claim that this problem is
equivalent to the ridge regression problem, in the following sense: for any C, there
exists a λ such that the solution of ridge0 (C) coincides with the solution of ridge(λ)
and vice-versa.
Indeed, fix a C > 0. Consider an optimal β for ridge0 (C). Assuming as above that ∆
is symmetric positive semi-definite, we let V be its null space and PV the orthogonal
projection on V . Write β = β1 + β2 with β1 = PV β. Let d1 and d2 be the respective
dimensions of V and V ⊥ so that d1 + d2 = d. Identifying Rd with the product space
V × V ⊥ (i.e., making a linear change of coordinates), the problem can be rewritten as
the minimization of
|Y − X1 β1 − X2 β2 |2
subject to β2T ∆β2 ≤ C, where X1 (resp. X2 ) is N × d1 (resp. N × d2 ).
The gradient of the constraint γ(β2 ) = β2T ∆β2 − C is ∇γ(β2 ) = 2∆β2 . Assume first
that ∆β2 , 0. Then the solution must satisfy the KKT conditions, which require that
there exists µ ≥ 0 such that β is a stationary point of the Lagrangian
|Y − X1 β1 − X2 β2 |2 + µβ2T ∆β2 ,
X1T X1 β1 + X1T X2 β2 = X T Y ,
X2T X1 β1 + X2T X2 β2 + µ∆β2 = X T Y .
β = (X T X + µ∆)−1 X T Y ,
General case. We now consider this equivalence in a more general setting. Con-
sider a penalized optimization problem, denoted var(λ) which consists in minimiz-
ing in β some objective function of the form U (β) + λϕ(β), λ ≥ 0. Consider also
the family of problems var0 (C), with C > inf(ϕ), which minimize U (β) subject to
ϕ(β) ≤ C.
7.2. RIDGE REGRESSION AND LASSO 155
Assumptions (ii) and (iv) are true, in particular, when U is strictly convex, ϕ is
convex and U has compact level sets. We show that, with these assumptions, the
two families of problems are equivalent.
We first discuss the penalized problems and prove the following proposition,
which has its own interest.
Proposition 7.5 The function λ 7→ U (βλ ) is nondecreasing, and λ 7→ ϕ(βλ ) is non-
increasing, with
lim ϕ(βλ ) = inf(ϕ).
λ→∞
Moreover, βλ varies continuously as a function of λ.
Proof Consider two parameters λ and λ0 . We have
In particular: (λ0 − λ)(ϕ(βλ ) − ϕ(βλ0 )) ≥ 0. Assume that λ < λ0 . Then this last
inequality implies ϕ(βλ ) ≥ ϕ(βλ0 ) and (7.6) then implies that U (βλ ) ≤ U (βλ0 ), which
proves the first part of the proposition.
Now assume that there exists > 0 such that ϕ(βλ ) > inf ϕ + for all λ ≥ 0. Take
β̃ such that ϕ(β̃) ≤ inf ϕ + /2. For any λ > 0, we have
so that U (βλ ) < U (β̃) − λ/2. Since U (βλ ) ≥ U (a0 ), we get U (a0 ) = −∞, which is a
contradiction. This shows that ϕ(βλ ) tends to inf(ϕ) when λ tends to infinity.
So, consider such a converging subsequence, that we will still denote by βλn for
convenience. Since G is continuous, one has G(βλn , λn ) → G(β̃, λ) when n tends to
infinity. Let us prove that G(βλ , λ) is continuous in λ. For any pair λ, λ0 and any β,
we have
G(βλ0 , λ0 ) ≤ G(βλ , λ0 ) = G(βλ , λ) + (λ0 − λ)ϕ(βλ ) ≤ G(βλ , λ) + |λ0 − λ|ϕ(a0 ) .
This yields, by symmetry, |G(βλ0 , λ0 ) − G(βλ , λ)| ≤ ϕ(a0 )|λ − λ0 |, proving the continuity
in λ.
So we must have G(β̃, λ) = G(βλ , λ). This implies that both β̃ and βλ are solutions
of var(λ), so that βλ = β̃ because we assume that the solution is unique.
We now prove that the classes of problems var(λ) and var0 (C) are equivalent.
First, βλ is a minimizer of U (β) subject to the constraint ϕ(β) ≤ C, with C = ϕ(βλ ).
Indeed, if U (β) < U (βλ ) for some β with ϕ(β) ≤ ϕ(βλ ), then U (β) + λϕ(β) < U (βλ ) +
0
λϕ(βλ ) which is a contradiction. So βλ = βϕ(β . Using the continuity of βλ and ϕ,
λ)
this proves the equivalence of the problems when C is in the interval (a, ϕ(a0 )) where
a = limλ→∞ ϕ(βλ ) = inf(ϕ).
So, it remains to consider the case C > ϕ(a0 ). For such a C, the solution of var0 (C)
must be a0 since it is a solution of the unconstrained problem, and satisfies the con-
straint.
Problem statement Assume that the output variable is scalar, i.e., q = 1. Let σ̂ 2 (i)
be the empirical variance of the ith variable X (i) . Then, the lasso estimator is defined
as a minimizer of N
P T 2 Pd (i)
k=1 (yk − x̃k β) subject to the constraint i=1 σ̂ (i)|β | ≤ C. Com-
pared to ridge regression, the sum of squares for β is simply replaced by a weighted
sum of absolute values, but we will see that this change may significantly affect the
nature of the solutions.
kernel version of the lasso and we only discuss the method in the original input
space R = Rd .
For a vector a ∈ Rk , we let |a|1 = |a(1) | + · · · + |a(k) |, the ` 1 norm of a. Using the
previous notation for Y and X , the quantity to minimize can be rewritten as
|Y − X β|2 + λ|Dβ|1
where D is the d × (d + 1) matrix with d(i, i + 1) = σ̂ (i) for i = 1, . . . , d and all other
coefficients equal to 0. This is a convex optimization problem which, unlike ridge
regression, does not have a closed form solution.
ADMM. The alternating direction method of multipliers (ADMM) that was de-
scribed in section 3.6, (3.59) is one of the state-of-the-art algorithm to solve the lasso
problem, especially in large dimensions. Other iterative methods include subgradi-
ent descent (see the example in section 3.5.4) and proximal gradient descent. Since
x has a different meaning here, we change the notation in (3.59) by replacing x, z, u
by β, γ, τ, and rewrite the lasso problem as the minimization of
|Y − X β|2 + λ|γ|1
subject to Dβ − γ = 0. Applying (3.60) with A = D, B = −Id and c = 0, the ADMM
iterations are
2 α
2
β(n + 1) = argmin |Y − X β| + |Dβ − γ(n) + τ(n)|
β
2
α
(i) (i) (i) 2
γ (n + 1) = argmint λ|t| + (t − Dβ (n + 1) − τ (n)) , i = 1, . . . , d
2
τ(n + 1) = τ(n) + Dβ(n + 1) − γ(n + 1)
The solutions of both minimization problems are explicit, yielding the following
algorithm.
Note that the ADMM algorithm makes an iterative approximation of the constraints,
so that they are only satisfied at some precision level when the algorithm is stopped.
Proposition 7.6 The pair (a0 , b) is the optimal solution of the lasso problem with param-
eter λ if and only if a0 = ȳ − x̄T b and, for all i = 1, . . . , d,
(i) λ
|rb | ≤ (7.7)
2N
with
(i) λ
rb = sign(b(i) ) if b(i) , 0. (7.8)
2N
(i)
In particular |rb | < λ/(2N ) implies b(i) = 0.
Proof Using the subdifferential calculus in theorem 3.45, one can compute the sub-
gradients of G by adding the subdifferentials of the terms that compose it. All these
terms are differentiable except |b(i) | when b(i) = 0, and the subdifferential of t 7→ |t| at
t = 0 is the interval [−1, 1].
g = −2N rb + λz
with z(i) = sign(b(i) ) if b(i) , 0 and |z(i) | ≤ 1 otherwise. Proposition 7.6 immediately
follows by taking g = 0.
7.2. RIDGE REGRESSION AND LASSO 159
Let ζ = sign(b), the vector formed by the signs of the coordinates of b, with
sign(0) = 0. Then proposition 7.6 uniquely specifies a0 and b once λ and ζ are
known. Indeed, let J = Jζ denote the ordered subset of indices j ∈ {1, . . . , d} such
that ζ (j) , 0, and let b(J), xk (J), ζ(J), etc., denote the restrictions of vectors to these
indices. Equation (7.8) can be rewritten as (after replacing a0 by its optimal value)
λ
Xc (J)T Xc (J)b(J) = Xc (J)T Yc − ζ(J)
2
where
(x1 (J) − x(J))T
Xc (J) =
..
.
.
(xN (J) − x(J)) T
This yields
λ
b(J) = (Xc (J)T Xc (J))−1 Xc (J)T Yc − ζ(J) , (7.9)
2
(j)
which fully determine b since b = 0 if j < J, by definition.
For given λ, only one sign configuration ζ will provide the correct solution, with
correct signs for nonzero values of b above, and correct inequalities on rb . Call-
ing this configuration ζλ , one can note that if ζλ is known for a given value of λ,
it remains valid if we increase or decrease λ until one of the optimality conditions
changes, i.e., either one of the coordinates b(i) , i ∈ Jζλ , vanishes, or one of the inequal-
ities for i < Jζλ becomes an equality. Moreover, proposition 7.6 shows that between
these events both b and therefore rb depend linearly on λ, which makes easy the
task of determining maximal intervals around a given λ over which ζ remains un-
changed.
Note that solutions are known for λ = 0 (standard least squares) and for λ large
enough (for which b = 0). Indeed, for b = 0 to be a solution, it suffices that
N
X
∆ (i)
λ > λ0 = 2 max (yk − y)(xk − x(i) ) .
i
k=1
These remarks set the stage for an algorithm computing the optimal solution of
the lasso problem for all values of λ, starting either from λ = 0 or λ > λ0 . We will de-
scribe this algorithm starting for λ > λ0 , which has the merit to avoid complications
due to underconstrained least squares when d is large. For this purpose, we need a
little more notation. For a given ζ, let
bζ = (Xc (Jζ )T Xc (Jζ ))−1 Xc (Jζ )T Yc
and
1
uζ = (Xc (Jζ )T Xc (Jζ ))−1 ζ(Jζ ),
2
160 CHAPTER 7. LINEAR REGRESSION
Assume that one wants to minimize Gλ∗ for some λ∗ > 0. We need to describe
the sequence of changes to the minimizers of Gλ when λ decreases from some value
larger than λ0 to the value λ∗ .
(i)
The sign of ζ (i) is also determined since sign(b(i) ) = sign(rb ) when b(i) , 0.
Given a current set J of selected variables, the algorithm will decide either to stop
or to add a new variable to J according to a criterion that depends on a parameter
λ > 0. Let b(J) ∈ R|J| be the least-square estimator based on variables in J
Justification. Recall the notation |b|0 for the number of non-zero entries of b. Con-
sider the objective function
L(b) = |Yc − Xc b|2 + λ|b|0 .
Let J be the set currently selected by the algorithm, and bJ defined as above. We
consider the problem of adding one non-zero entry to b. Fix i < J, and let b̃ ∈ Rd
have all coordinates equalt to those of bJ for all except the ith one, which is therefore
allowed to be non-zero. Then
N
X X (j) 2
(i)
L(b̃) = yk − y − (xk − x)b(j) − (xk − x)b̃(i) + λ|J| + λ,
k=1 j∈J
162 CHAPTER 7. LINEAR REGRESSION
Variant. The same argument can be made with |b|0 replaced by |b|1 and one gets
(i)
L(b̃) = L(bJ ) − 2N rJ b̃(i) + N (b̃(i) )2 + λ|b̃(i) |
Minimizing this expression with respect to b̃(i) yields the upper bound:
λ 2 λ
(i) (i)
L(b ) − N |r | − if |rJ | ≥
J
J 2N 2N
L(bJ∪{i} ) ≤
(i) λ
L(bJ ) if |rJ | ≤
2N
This leads to the following alternate form of LARS. Given a current set J of se-
lected variables, compute
N
(i) 1X (i)
rJ = (yk − y − (xk − x)T bJ )(xk − x(i) ), i <J.
N
k=1
(i)
If, for all i < J, |rJ | ≤ λ/2N , stop the procedure. Otherwise, add to J the variable i
(i)
such that is largest and continue. This form tends to add more variables since
|rJ |
√
the stopping criterion decreases in 1/N instead of 1/ N .
Why “least angle”? Let µJ,k = yk −y −(xk −x)T bJ denote the residual after regression.
The empirical correlation between µ and x(i) is equal to the cosine of the angle, say
(i)
θJ between µJ ∈ RN and x(i) −x both considered as vectors in RN . This cosine is also
equal to
(i)
(i) µTJ (x(i) − x(i) ) √ rJ
cos θJ = = N
|x(i) − x(i) | |µJ | |µJ |
√
where we have used the fact that |x(i) − x(i) |/ N = σ̂ (i) = 1. Since |µJ | does not depend
(i)
on i, looking for the largest value of |rJ | is equivalent to looking for the smallest
(i)
value of |θJ |, so that we are looking for the unselected variable for which the angle
with the current residual is minimal.
7.3. OTHER SPARSITY ESTIMATORS 163
Noise-free case. Assume that one wants to solve the equation X β = Y when the
dimension, N , of Y is small compared to number of columns, d, in X . Since the
system is under-determined, one needs additional constraints on β and a natural one
is to look for sparse solutions, i.e., find solutions with a maximum number of zero
coefficients. However, this is numerically challenging, and it is easier to minimize
the ` 1 norm of β instead (as seen when discussing the lasso, using this norm often
provides sparse solutions). In the following, we assume that the empirical variance
of each variable is normalized, so that, denoting X (i) the ith column of X , we have
|X (i)| = 1.
Sparsity recovery Under some assumptions, this method does recover sparse solu-
tions when they exist. More precisely, let β̂ be the solution of the linear programming
problem above. Assume that there is a set J ∗ ⊂ {1, . . . , d} such that X β = Y for some
β ∈ Rd with β (i) = 0 if i < J ∗ . Conditions under which β̂ is equal to β are provided
in Candes and Tao [46] and involve the correlations between pairs of columns of X ,
and the size of J.
That the size of J ∗ must be a factor is clear, since, for the statement to make sense,
there cannot exist two β’s satisfying X β = Y and β (i) = 0 for i < J ∗ . Uniqueness is
obviously not true if |J| > N , because, even if one knew J, the condition would be
under-constrained for β. Since the set J ∗ is not known, and we also want to avoid
any other solution associated to a set of same size. So, there cannot exist β and β̃
respectively vanishing outside of J ∗ and J˜∗ , where J ∗ and J˜∗ have same cardinality,
such that X β = Y = X β̃. The equation X (β − β̃) = 0 would be under-constrained as
soon as the number of non-zero coefficients of β − β̃ is larger than N , and since this
number can be as large as |J ∗ | + |J˜∗ | = 2|J ∗ |, we see that one should impose at least
|J ∗ | ≤ N /2.
164 CHAPTER 7. LINEAR REGRESSION
Given this restriction, another obvious remark is that, if the set J on which β does
not vanish is known, with |J| small enough, then X β = Y is over-constrained and any
solution is (typically) unique. So the issue really is whether the set Jβ listing the
non-zero indexes of a solution β is equal to y J ∗ .
(1) If λ(i) ∈ (0, 1), then ξ(i) = β (i) − ξ(i) = 0, which implies ξ(i) = β (i) = 0, and, as a
consequence (1 − λ∗ (i))ξ ∗ (i) = λ∗ (i)ξ ∗ (i) = 0, so that also ξ ∗ (i) = 0 .
(2) Similarly, λ∗ (i) ∈ (0, 1) implies ξ(i) = ξ ∗ (i) = β (i) = 0.
(3) If λ(i) = λ∗ (i) = 1, then β (i) − ξ(i) = β (i) + ξ(i) = 0 with ξ(i), ξ ∗ (i) ≥ 0, so that also
ξ(i) = ξ ∗ (i) = β (i) = 0.
(4) If λ(i) = λ∗ (i) = 0, then ξ(i) = ξ ∗ (i) = 0 and since β (i) ≤ ξ(i) and β (i) ≤ −ξ ∗ (i), we
get β (i) = 0.
(5) The only remaining situation, in which β (i) can be non-zero, is when λ(i) = 1 −
λ∗ (i) ∈ {0, 1}, or, equivalently, when |λ(i) − λ∗ (i)| = 1.
This discussion allows one to reconstruct the set Jβ associated with the primal prob-
lem given the solution of the dual problem. Note that |λ(i) − λ∗ (i)| = |α T X (i)|, so that
the set of indexes with |λ(i) − λ∗ (i)| = 1 is also
∆
n o
Iα = i : |α T X (i)| = 1 .
One has
d
X X X
T
α Y = α Xβ =T
β (i) α T X (i) ≤ |β (i) | |α T X (i)| ≤ |β (i) |.
i=1 i∈Jβ i∈Jβ
The upper-bound is achieved when α T X (i) = sign(β (i) ) for i ∈ Jβ . So, if a vector α can
be found such that
7.3. OTHER SPARSITY ESTIMATORS 165
Let sJ = (s(j) , j ∈ J) be defined by s(j) = sign(β (j) ). One can always decompose
α ∈ RN in the form
α = XJ ∗ ρ + w
∗
where ρ ∈ R|J | and w ∈ RN is perpendicular to the columns of XJ ∗ . From XJT∗ α = sJ ,
we get
ρ = (XJT∗ XJ ∗ )−1 sJ ∗ .
Letting αJ ∗ be the solution with w = 0, the question is therefore whether one can find
w such that ( T
w X (j) = 0, j ∈ J∗
|αJT X (k) + wT X (k)| < 1, k < J ∗
Denote for short ΣJJ 0 = XJT XJ 0 . One can show that such a solution exists when
the matrices ΣJJ are close to the identity as soon as |J| is small enough [46]. More
precisely, denote, for q ≤ d
θ(q, q)
|α T X (j)| ≤ if j < J ∗ .
1 − δ(2q) − θ(q, 2q)
So α has the desired property as soon as δ(2q) + θ(q, q) + θ(q, 2q) ≤ 1. to control
subsets of variables of size less than 3q to obtain the conclusion, which is important,
of course, when q is small compared to d.
166 CHAPTER 7. LINEAR REGRESSION
Noisy case Consider now the noisy case. We here again introduce quantities that
were pivotal for the lasso and LARS estimators, namely, the covariances between the
variables and the residual error. So, we define, for a given β
(i)
rβ = X (i)T (Y − X β)
which depends linearly on β. Then, the Dantzig selector is defined by the linear
program: Minimize:
Xd
|β (j) |
j=1
subject to the constraint:
(j)
max |rβ | ≤ C.
j=1,...,d
The explicit expression of this problem as a linear program is obtained as before by
introducing slack variables ξ(j), ξ ∗ (j), j = 1, . . . , d and minimizing
d
X d
X
ξ(j) + ξ ∗ (j)
j=1 j=1
(j)
with constraints ξ(j), ξ ∗ (j) ≥ 0, ξ ≥ β, ξ ∗ ≥ −β, max |rβ | ≤ C.
j=1,...,d
Similar to the noise-free case, the Dantzig selector can identify sparse solutions
(up to a small error) if the columns of X are nearly orthogonal, with the same type of
conditions [47]. Interestingly enough, the accuracy of this algorithm can be proved
to be comparable to that of the lasso in the presence of a sparse solution [30].
regression had the advantage of providing closed form expressions for the solution,
but is quite sensitive to outliers. For robustness, it is preferable to use loss functions
that, like V , increase at most linearly at infinity. One sometimes choose them as
smooth convex functions, for example V (t) = (1 − cos γt)/(1 − cos γ) for |t| < and
f (t) = |t| for t ≥ , where γ is chosen so that γ sin γ/(1 − cos γ) = 1. In such a case,
minimizing
N
X
F(β) = V (yk − a0 − xkT b)
k=1
can be done using gradient descent methods. Using V in (7.10) will require a little
more work, as we see now.
• An -tolerance for small errors, often referred to as the margin of the regression
SVM.
We now describe the various steps in the analysis and reduction of the problem.
They will lead to simple minimization algorithms, and possible extensions to non-
linear problems.
N
X N
X
L(a0 , b, ξ, ξ ∗ , α, α ∗ , η, η ∗ ) = (ξk + ξk∗ ) + λbT ∆b − (ηk ξk + ηk∗ ξk∗ )
k=1 k=1
N
X N
X
− αk (ξk − yk + a0 + xkT b + ) − αk∗ (ξk∗ + yk − a0 − xkT b + ).
k=1 k=1
In this formulation, (a0 , b, ξ, ξ ∗ ) are the primal variables, and α, α ∗ , η, η ∗ the dual vari-
ables.
7.4. SUPPORT VECTOR MACHINES FOR REGRESSION 169
The first four equations are the derivatives of the Lagrangian with respect to a0 , b, ξk , ξk∗
in this order and the last three are the complementary slackness conditions.
L∗ (α, α ∗ , η, η ∗ ) = inf ∗ L.
β,ξ,ξ
N N N
∗ ∗ 1 X ∗ ∗ T −1
X
∗
X
L (α, α ) = − (αk − αk )(αl − αl )xk ∆ xl − (αk + αk ) + (αk − αk∗ )yk .
4λ
k,l=1 k=1 k=1
Step 3: Analysis of the dual problem The dual problem only depends on the xk ’s
through the matrix with coefficients xkT ∆−1 xl , which is the Gram matrix of x1 , . . . , xN
for the inner product associated with ∆−1 . This property will lead to the the kernel
170 CHAPTER 7. LINEAR REGRESSION
version of SVMs discussed in the next section. The obtained predictor can also be
expressed as a function of these products, since
N
T 1 X
y = a0 + x b = a0 + (αk − αk∗ )(xkT ∆−1 x) .
2λ
k=1
Moreover, the dimension of the dual problem is 2N , which allows the method to be
used in large (possibly infinite) dimensions with a bounded cost.
These conditions have the following consequences, based on the prediction error
made for each training sample.
(i) First consider indexes k such that the error is strictly within the tolerance mar-
gin : |yk − a0 − xkT b| < . Then the terms between parentheses in first two equations
of (7.14) are strictly positive, which implies that αk = αk∗ = 0. The last two equations
in (7.14) then imply ξk = ξk∗ = 0.
(ii) Consider now the case when the prediction is strictly less accurate than the
tolerance margin. Assume that yk − a0 − xkT b > . The second and third equations in
(7.14) imply that αk∗ = ξk∗ = 0. The assumption also implies that
ξk = yk − a0 − xkT b − > 0
and αk = 1. The case yk − a0 − xkT b < − is symmetric and provides αk = ξk = 0, ξk∗ > 0
and αk∗ = 1.
(iii) Finally, consider samples for which the prediction error is exactly at the toler-
ance margin. If yk − a0 − xkT b = , we have αk∗ = ξk = ξk∗ = 0. The fact that αk∗ = ξk∗ = 0 is
clear. To prove that ξk = 0, we note that would have otherwise ξk −yk +a0 +xkT b+ > 0,
which would imply that αk = 0 and we reach a contradiction with (1−αk )ξk = 0. Sim-
ilarly, yk − a0 − xkT b = − implies that αk = ξk = ξk∗ = 0.
The points for which |yk − a0 − xkT b| = are called support vectors.
One important information deriving from this discussion is that the variables
(αk , αk∗ ) have prescribed values as long as the error yk − a0 − xkT b is not exactly in
absolute value: (1, 0) if the error is larger than , (0, 0) if it is strictly between − and
and (0, 1) if it is less than −. Also in all cases, at least one of αk and αk∗ must vanish.
7.4. SUPPORT VECTOR MACHINES FOR REGRESSION 171
Only in the case of support vectors does the previous discussion fail to provide a
value for one of these variables.
Now, we want to reverse the discussion and assume that the dual problem is
solved to see how the variables a0 and b of the primal problem can be retrieved. For
b, this is easy, thanks to (7.13). For a0 a direct computation can be made if a support
vector is identified, either because 0 < αk < 1, which implies that a0 = yk − xkT b − , or
because 0 < αk∗ < 1, which yields a0 = yk − xkT b + .
Letting as before V = span(h(x1 ), . . . , h(xN )), the same argument as that made for
ridge regression works, namely that the first term in F is unchanged if b is replaced
by πV (b) and the second one is strictly reduced unless b ∈ V , leading to a finite-
dimensional formulation in which
N
X
b= ck h(xk )
k=1
This function has the same form as the one studied in the linear case with b replaced
by c ∈ RN , xk replaced by the vector with coefficients K(xk , xl ), l = 1, . . . , N , that we
will denote K(k) and ∆ = K = K(x1 , . . . , xN ). Note that K(k) is the kth column of K, so
that T
K(k) K−1 K(l) = K(xk , xl ).
Using this, we find that the dual problem requires to maximize
N N N
∗ ∗ 1 X ∗ ∗
X
∗
X
L (α, α ) = − (αk − αk )(αl − αl )K(xk , xl ) − (αk + αk ) + (αk − αk∗ )yk .
4λ
k,l=1 k=1 k=1
with
0 ≤ αk ≤ 1
0 ≤ αk∗ ≤ 1
N
X
(αk − αk∗ ) = 0
k=1
The associated vector c satisfies
N
X
2λc = (αk − αk∗ )K−1 K(k) = α − α ∗ .
k=1
n o
In this chapter, Y is categorical and takes values in the finite set RY = g1 , . . . , gq .
The goal is to predict this class variable from the input X, taking values in a set RX .
Using the same progression as in the regression case, we will first discuss basic linear
methods, for which RX = Rd before extending them, whenever possible, to kernel
methods, for which RX can be arbitrary as soon as a feature space representation is
available.
N
X
Ng = {k : yk = g} = 1yk =g .
k=1
Logistic regression uses the fact that, in order to apply Bayes’s rule, only the condi-
tional distribution of the class variables Y given X is needed, and trains a parametric
model of this distribution. More precisely, if one denotes by p(g|x) the probability
that Y = g conditional to X = x, logistic regression assumes that, for some parameters
(a0 (g), b(g), g ∈ RY ) with a0 (g) ∈ R and b(g) ∈ Rd , one has p = pa0 ,b with
173
174 CHAPTER 8. MODELS FOR LINEAR CLASSIFICATION
As a consequence, if one replaces, for all g, β(g) by β̃(g) = β(g) + γ, with γ ∈ Rd+1 ,
then β̃ T x̃ = β T x̃ + γ T x̃ and
This shows that the model is over-parametrized. One therefore needs a (d + 1)-
dimensional constraint to ensure uniqueness, and we will enforce a linear constraint
in the form X
ρg β(g) = c
g∈RY
P
with g ρg , 0.
The last statement in the proposition expresses the fact that d 2 `(β)(u, u) ≤ 0 for all
u∈F.
N
X
`(β) = Fyk (β T x̃k ).
k=1
We have, for ζ : RY → R,
0
eµ(g ) ζ(g 0 )
P
g 0 ∈RY
dFg (µ)ζ = ζ(g) − 0
eµ(g )
P
g 0 ∈RY
and X
hζiµ = ζ(g)qµ (g),
g∈RY
we have dFg (µ)ζ = ζ(g) − hζiµ . Evaluating the derivative of dFg (µ + u 0 )(ζ) at = 0,
one gets (the computation being left to the reader)
Note that −d 2 Fg (µ)(ζ, ζ) is the variance of ζ for the probability mass function qµ and
is therefore non-negative (so that Fµ is concave). This immediately shows that ` is
concave as a sum of concave functions.
Reordering the first sum in the right-hand side according to the values of yk gives
N
X X
u(yk )T x̃k = Ng u(g)T m̃g .
k=1 g∈RY
with
d 2 Fyk (β T x̃k )(x̃kT u(·), x̃kT u 0 (·)) = − hu(·)T x̃k x̃kT u 0 (·)iβ T x̃k + hx̃kT u(·)iβ T x̃k hx̃kT u 0 (·)iβ T x̃k
X
=− u(g)T x̃k x̃kT u 0 (g)pβ (g|xk )
g∈RY
X
+ u(g)T x̃k x̃kT u 0 (g 0 )pβ (g|xk )pβ (g 0 |xk )
g,g 0 ∈RY
We now discuss whether there are other elements in the null space of the second
derivative of `. We will use notation introduced in the proof of proposition 8.1.
From (8.4), we have d 2 Fg (µ)(ζ, ζ) = 0 if and only if the variance of ζ for qµ vanishes,
which, since qµ > 0, is equivalent to ζ being constant. So, the null space of d 2 Fg (µ)
is one-dimensional, and composed of scalar multiples of 1. Using (8.5), we see that
d 2 `(u, u) = 0 if and only if , for all k = 1, . . . , N , (g 7→ x̃kT u(g)) is a constant function.
If one restricts ` to M, then we must restrict d 2 `(β) to those u’s such that g∈RY ρg u(g) =
P
Given that we have expressed the first and second derivatives of ` in closed form1 ,
we can use Newton-Raphson gradient ascent to maximize ` over the affine space:
X
M = β : ρ β(g) = c
g
g∈RY
1 Their computation is feasible unless N is very large, and the matrix inversion in Newton’s itera-
tion also requires d to be not too large.
178 CHAPTER 8. MODELS FOR LINEAR CLASSIFICATION
P
with g∈RY ρg , 0. We assume in the following that the matrix X has rank d + 1 so
that proposition 8.4 applies. Since the constraint is affine, it is easy to express one of
the parameters β(g) as a function of the others and solve the strictly concave problem
as a function of the remaining variables. It is not much harder, and arguably more
elegant to solve the problem without breaking its symmetry with respect to the class
indexes, as described below.
Let
X
M0 = β : ρ β(g) = 0 .
g
g∈RY
1
`(β + u) = `(β) + d`(β)u + d 2 `(β)(u, u) + o(|u|2 )
2
and we consider the maximization of the first three terms, simply restricting to vec-
tors u ∈ M0 . To allow for matrix computation, we use our ordering RY = (g1 , . . . , gq )
and identify a with the column vector
u(g1 )
.
.. ∈ Rq(d+1)
u(gq )
Similarly, we let
∂β(g1 ) `
.
∇`(β) = ..
∂β(gq ) `
and let ∇2 (`)(β) be the block matrix with i, j block given by ∂β(gi ) ∂β(gj ) `(β). We let ρ̂
be the (d + 1) × q(d + 1) row block matrix
ρ(g1 )IdRd+1 · · · ρ(gq )IdRd+1
1
`(β + u) = `(β) + ∇`(β)T u + u T ∇2 (`)(β)u + o(|u|2 ).
2
1
L = `(β) + u T ∇`(β) + u T ∇2 (`)(β)u + λT ρ̂u
2
8.1. LOGISTIC REGRESSION 179
with ! !−1 !
un+1 ∇2 (`)(βn ) ρ̂T ∇`(βn )
= . (8.7)
λ ρ̂ 0 0
or
d
X
`1 (β) = `(β) − λ |b(i) | (8.9)
i=1
where b(i) is the q-dimensional vector formed with the ith coefficients of b(g) for
g ∈ RY . Similarly to penalized regression, one generally normalizes the x variables
to have unit standard deviation before applying the method.
180 CHAPTER 8. MODELS FOR LINEAR CLASSIFICATION
Maximization with the `2 norm The problem in (8.8) relates to ridge regression
and can be solved using a Newton-Raphson method (Algorithm 8.1) with minor
changes. More precisely, letting
!
0 0
∆=
0 IdRd
and
d`2 (β)u = d`(β)u − 2λtrace(β T ∆u),
Maximization with the `1 norm The maximization in (8.9) can be run using prox-
imal gradient !ascent (section 3.5.5). Let C denote the affine subset of Rd+1 contain-
a
ing all β = 0 , such that g ρg a0 (g) = c and σC the convex indicator function with
P
b
σC (β) = 0 if β ∈ C and +∞ otherwise.
with
d sX
X
γ(a0 , b) = b(i) (g)2 σC (β).
i=1 g∈RY
Here, ` is concave and γ is convex and the proximal gradient iterations are
We now compute the proximal operator of γ, and, since γ is the sum of functions
depending on a0 , b(1) , . . . , b(d) , it suffices to compute separately the proximal operator
of each these functions.
Let u (0) , . . . , u (d) be functions from RY to R. Starting with a0 , we know that h(0) =
proxλσC (v (0) ) is the projection of u (0) on C, therefore characterized by h(0) ∈ C and
8.1. LOGISTIC REGRESSION 181
(u (0) − h(0) ) ⊥ C, the latter implying that h(0) = u (0) + tρ for some t ∈ R and the former
allowing one to identify t as t = (c − (u (0) )T ρ)/|ρ|2 ) so that
c − (v (0) )T ρ
proxλσC (u (0) ) = u (0) + ρ.
|ρ|2
Taking the norm on both sides and dividing by |h(i) (·)| (which is assumed not to
vanish) yields
|h(i) (·)| + λ = |u (i) (·)|,
which has a positive solution only if |u (i) (·)| > λ, and gives in that case
If |u (i) (·)| ≤ λ, then we must take h(i) (·) = 0. We have therefore obtained:
proxλg (u) = h
with
c − (v (0) )T ρ
h(0) (·) = u (0) + ρ (8.11a)
|ρ|2
and
|u (i) (·)| − λ
(i)
h (·) = max , 0 u (i) (·) (8.11b)
|u (i) (·)|
for i ≥ 1. We summarize this discussion in the next algorithm, which should be run
with > 0 small enough.
182 CHAPTER 8. MODELS FOR LINEAR CLASSIFICATION
Using the usual kernel argument, one sees that, when maximizing the log-likelihood,
there is no loss of generality is assuming that each b(g) belongs to V = span(h(x1 ), . . . , h(xN )).
Taking
N
X
b(g) = αk (g)h(xk ),
k=1
we have
N
X X N
X
log pα (g | x) = a0 (g) + αk (g)K(x, xk ) − log exp(a0 (g̃) + αk (g̃)K(x, xk )) .
k=1 g̃∈RY k=1
To avoid overfitting, one must include a penalty term in the likelihood, and
P (in order
to take advantage of the kernel), one can take this term proportional to g kb(g)k2H .
The complete learning procedure then requires to maximize the concave penalized
likelihood
N
X N
X X
`(α) = log pα (yk | xk ) − λ αk (g)αl (g)K(xk , xl ).
k=1 g∈RY k,l=1
The computation of the first and second derivatives of this function is similar to that
for the original version, and we skip the details.
8.2. LINEAR DISCRIMINANT ANALYSIS 183
Generative model In classification, the class variable Y generally has a causal role
upon which the variable X is produced. Prediction can therefore be seen as an in-
verse problem where the cause is deduced from the result. In terms of generative
modeling, one should therefore model the distribution of Y , followed by the the
conditional distribution of X given Y .
Since the denominator does not depend on g the Bayes estimator equivalently max-
imizes (taking logarithms)
log fg (x) + log πg .
1 1
− (x − m)T S −1 (x − m) + (x − m)T S −1 (mg − m) − (mg − m)T S −1 (mg − m) + log πg .
2 2
Since the first term does not depend on g, it is equivalent to maximize
1
(x − m)T S −1 (mg − m) − (mg − m)T S −1 (mg − m) + log πg (8.13)
2
with respect to the class g, which provides an affine function of x.
184 CHAPTER 8. MODELS FOR LINEAR CLASSIFICATION
Training Training for LDA simply consists in estimating the class means and com-
mon variance in (8.12) from data. We introduce some notation for this purpose (this
notation will be reused through the rest of this chapter).
Recall that Ng , g ∈ RY denotes the number of samples with class g in the train-
ing set T = (x1 , y1 , . . . , xN , yN ). We let cg = Ng /N and C be the diagonal matrix with
diagonal coefficients cg1 , . . . , cgq . We also let ζ ∈ Rq denote the vector with the same
coordinates. For g ∈ RY , µg denotes the class average
n
1 X
µg = xk 1yk =g
Ng
k=1
N
1 X
Σg = (xk − µg )(xk − µg )T 1yk =g ,
Ng
k=1
and Σw the pooled class covariance (also called within-class covariance) defined by
N
1X X
Σw = (xk − µyk )(xk − µyk )T = cg Σg .
N
k=1 g∈RY
N
1X
ΣXX = (xk − µ)(xk − µ)T
N
k=1
N
1 X
(xk − µ)(xk − µ)T 1yk =g = Σg + (µg − µ)(µg − µ)T .
Ng
k=1
8.2. LINEAR DISCRIMINANT ANALYSIS 185
(µg1 − µ)T
..
M = .
.
T
(µgq − µ)
One of the interests of LDA is that it can be combined with a rank reduction proce-
dure. LDA with q classes can always be seen as a (q − 1)-dimensional problem after
suitable projection on a data-dependent affine space. Recall that the classification
rule after training requires to maximize w.r.t. g ∈ RY the function
1
(x − µ)T Σ−1 T −1
w (µg − µ) − (µg − µ) Σw (µg − µ) + log πg .
2
Define the “spherized” data 2 by x̃k = Σ−1/2 1/2
w (xk − µ), where Σw is the positive sym-
metric square root of Σw . Also let µ̃g = Σ−1/2
w (µg − µ).
With this notation, the predictor chooses the class g that maximizes
1
x̃T µ̃g − |µ̃g |2 + log πg
2
with x̃ = Σ−1/2
w (x − µ̄).
2 In this section only, the notation x̃ does not refer to (1, xT )T .
186 CHAPTER 8. MODELS FOR LINEAR CLASSIFICATION
P
Now, let V = span{µ̃g , g ∈ RY }. Since g cg µ̃g = 0, this space is at most (q − 1)-
dimensional. Let PV denote the orthogonal projection on V . We have x̃T z = (PV x̃)T z
for any z ∈ V and x̃ ∈ Rd .
Given an input x, one must therefore compute the “scores” γj (x) = x̃T ẽj and maxi-
mize
r r
X 1X
γj (x)γj (µg ) − γj (µg )2 + log πg .
2
j=1 j=1
so that V ⊥ ⊂ Null(M̃ T C M̃), and both spaces coincide because they have the same
dimension (d − r). This shows that V = Null(M e T C M)
e ⊥ = Range(M̃ T C M).
e Since
M T
e CM e is symmetric, Null(M T ⊥
e is generated by eigenvectors with non-zero
e C M)
eigenvalues.
e = MΣ−1/2
Returning to the original variables, we have M w and M T CM = Σb , the
between class covariance matrix. This implies that M T
e CMe = Σ−1/2 −1/2 and each
w Σb Σw
8.2. LINEAR DISCRIMINANT ANALYSIS 187
Figure 8.1: Left: Original (training) data with three classes. Right: LDA scores, where the x
axis provides γ1 and the y axis γ2 .
We can now describe the LDA learning algorithm with dimension reduction.
Mean and covariance in feature space We assume the usual construction where h :
RX → H is a feature function, H a Hilbert space with kernel K(x, x0 ) = hh(x) , h(x0 )iH .
(The assumption that H is a complete space is here required for the expectations
below to be meaningful.)
We now discuss the kernel version of LDA by plugging the feature space repre-
sentation directly in the classification rule. So, consider h : R → H. Let X : Ω → R be
a random variable such that E(kh(X)k2H ) < ∞. Then, its mean feature m = E(h(X)) is
well defined as an element of H , and so are the class averages, mg = E(h(X) | Y = g).
Following the LDA model, we assume that the operators Sg are all equal to a fixed
operator, the within-class covariance operator denoted S.
8.2. LINEAR DISCRIMINANT ANALYSIS 189
Assuming that S is invertible, one can generalize the LDA classification rule to
data represented in feature space by classifying a new input x in class g when
1
hh(x) − m , S −1 (mg − m)iH − hmg − m , S −1 (mg − m)iH + log πg (8.15)
2
is maximal over all classes. Notice that this is a transcription of the finite-dimensional
Bayes rule, but cannot be derived from a generative model, because the assumption
that h(X) is Gaussian is not valid in general. (It would require that h takes values
in a d-dimensional linear space, which would eliminate all interesting kernel repre-
sentations.)
1
hh(x) − µ , (Σw + ρIdH )−1 (µg − µ)iH − hµ − µ , (Σw + ρIdH )−1 (µg − µ)iH + log πg .
2 g
(8.16)
where µ is the average of h(x1 ), . . . , h(xN ). Taking this option, we still need to make
this expression computable and remove the dependency in the feature function h.
this operator maps H to V , which implies that Σw + ρIdH maps V into itself. More-
over, this mapping is onto: If v ∈ V and u = (Σw + ρIdH )−1 v, then, u ∈ V . Indeed,
for any z ⊥ V , we have hz , Σw u + ρuiH = hz , viH . We have hz , Σw uiH = 0 (because
Σw maps H to V ) and hz , viH = 0 (because v ∈ V ), so that we can conclude that
hz , uiH = 0. Since this is true for all z ⊥ V , this requires that u ∈ V .4
We now express the classification rule in (8.16) as a function of the kernel asso-
ciated with the feature-space representation. Denote, for any vector u ∈ RN ,
N
X
ξ(u) = u (k) h(xk ),
k=1
We have µg = ξ(1g /Ng ), where 1g ∈ RN is the vector with kth coordinate equal
to 1 if yk = g and 0 otherwise. Also µ = ξ(1/N ) (recall that 1 is the vector with all
coordinates equal to 1).
We are now ready to rewrite the kernel LDA classification rule in terms of quan-
tities that only involve K. We have
Dimension reduction Note that K(P K + ρIdRN )−1 = K(KP K + ρK)−1 K is a symmet-
ric matrix. So, the expression in (8.17) can be written as
1
(v(x) − η̄)T R−1 (ηg − η̄) − (ηg − η̄)T R−1 (ηg − η̄) + log πg .
2
with R = KP K + ρK, ηg = K1g /Ng and η̄ = K1/N . Clearly, if v1 , . . . , vN are the column
vectors K, we have
N N
1 X 1X
ηg = vk 1yk =g , η̄ = vk .
Ng N
k=1 k=1
1 X
Q= Ng (ηg − η̄)(ηg − η̄)T
N
g∈RY
Qfj = λj Rfj
N
1X
KP K = (vk − η̄yk )(vk − η̄yk )T
N
k=1
1
Ng (ηg − η̄)(ηg − η̄)T .
P
and let Q = N g∈RY
8.3. OPTIMAL SCORING 193
(4) Fix r0 ≤ q−1 and compute the eigenvectors f1 , . . . , fr0 associated with the r0 largest
eigenvalues for the generalized eigenvalue problem Qf = λRf , normalized such that
fjT Rfj = 1.
(5) Compute the scores γjg = (ηg − η̄)T fj .
(6) Given a new observation x, let v(x) be the vector with coordinates K(x, xk ), k =
1, . . . , N . Compute the scores γj (x) = (v(x) − η̄)T fj , j = 1, . . . , r0 . Classify x in the class
g maximizing
r r
X 1X 2
γi (x)γig − γig + log πg . (8.18)
2
i=1 i=1
where b is a d ×q matrix and a0 ∈ Rq . Letting as before β be the matrix with aT0 added
as first row to b and X the matrix with first row containing only ones and subsequent
rows given by x1T , . . . , xN
T
, one gets the least square estimator β̂ = (X T X )−1 X T Y , where
Y is the N × q matrix of stacked θyTk row vectors.
Given an input vector x, the row vector x̃T β will generally not coincide with
one of the score vectors. Assignment to a class can then be made by minimizing
|a0 + bT x − θg | over all g in RY .
Since the scores θ are free to choose, one may also try to optimize them, resulting
in the optimal scoring algorithm. To describe it, we will need the notation already
194 CHAPTER 8. MODELS FOR LINEAR CLASSIFICATION
introduced for LDA, plus the following. We will write, for short, θj = θ(gj ) and
T
θ1
introduce the q × r matrix Θ = ... . We also denote by ρ1 , . . . , ρr the column vectors
T
θq
of Θ, so that Θ = [ρ1 , . . . , ρr ]. Let ugi , for i = 1, . . . , q, denote the q-dimensional vector
with ith coordinate equal to 1 and all others equal to 0. As before, Ng denote the class
cg1
.
sizes, cg = Ng /N , C is the diagonal matrix with coefficients cg1 , . . . , cgq and ζ = .. .
cgq
The goal of optimal scoring is to minimize, now with respect to θ, a0 and b, the
function
N
X
F(θ, a0 , b) = |θ(yk ) − a0 − bT xk |2 .
k=1
Some normalizing conditions are clearly needed, because this problem is under-
constrained. (In the form above, the optimal choice is to take all free parameters
equal to 0.) We now discuss the various indeterminacies and redundancies in the
model,
(a) If R is an r × r orthogonal matrix, then F(Rθ, Ra0 , bRT ) = F(θ, a0 , b), yielding an
infinity of possible equivalent solutions (that all lead to the same classification rule).
This implies that there is no loss of generality in assuming that Θ T CΘ is diagonal
(introducing C here will turn out to be convenient). Indeed, given any (θ, a0 , b), one
can just take R such that RΘ T CΘRT is diagonal and replace Θ by RΘ, a0 by Ra0 and
b by bRT to get an equivalent solution satisfying the constraint.
(b) Let D be an r by r diagonal matrix with positive entries. Replace θ, a0 and b
respectively by Dθ, Da0 and bD. The resulting objective function is
N
X
T
F(Dθ, Da0 , bD ) = |Dθ(yk ) − Da0 − DbT xk |2
k=1
r X
X N d
X 2
2
= djj θ(yk , j) − a0 (j) − b(i, j)xk (i)
j=1 k=1 i=1
If the coefficient djj is free to chose, then the objective function can always be re-
duced by letting djj → 0, which removes one of the dimensions in θ. In order to
avoid this, one needs to fix the diagonal values of Θ T CΘ, and, by symmetry, it is
natural to require Θ T CΘ = IdRr .
(c) Given any δ ∈ Rr , one has F(θ, a0 , b) = F(θ − δ, a0 + δ, b), with identical classifica-
tion rule. One can therefore without loss of generality introduce r linear constraints,
8.3. OPTIMAL SCORING 195
Given this reduction, we can now describe the optimal scoring problem as the
minimization of
XN
|θyk − a0 − bT xk |2
k=1
subject to Θ T CΘ = IdRr and T
Θ ζ = 0.
subject to the same constraints. Using the facts that θyk = Θ T uyk , that
N
X X
uyk uyTk = Ng ug ugT = N C
k=1 g∈RY
and that
X N N
X X N
X
T T
uyk (xk − µ) = ug (xk − µ) = ug Ng (µg − µ)T = N CM,
k=1 g∈RY k:yk =g g∈RY
−2trace(Θ T CMΣ−1 T T −1 T
XX M CΘ) + trace(Θ CMΣXX M CΘ)
i.e., maximize
trace(Θ T CMΣ−1 T
XX M CΘ)
subject to Θ T CΘ = IdRr and Θ T ζ = 0. We now recall the following linear algebra
result (see chapter 2).
Proposition 8.7 Let A and B be respectively positive definite and non-negative semi-
definite symmetric q by q matrices. Then, the maximum, over all q by r matrices S such
that trace(S T AS) = IdRr , of trace(S T BS) is attained at S = [σ1 , . . . , σr ], where the columns
vectors σ1 , . . . , σr are the solutions of the generalized eigenvalue problem
Bσ = λAσ
associated with the largest eigenvalues, normalized so that σiT Aσi = 1 for i = 1, . . . , r..
Given this proposition, let ρ1 , . . . , ρr be the r first eigenvectors for the problem
T
CMΣ−1
XX M Cρ = λCρ. (8.19)
Assume that r is small enough so that the associated eigenvalues are not zero. Let
Θ = [ρ1 , . . . , ρr ]. We now prove that Θ is indeed a solution of the optimal scoring
problem, and the only point to show to complete the statement is that this Θ satisfies
the constraints Θ T ζ = 0. But we have
X
M T C 1q = cg (µg − µ̄) = 0,
g
From b = Σ−1 T
XX M CΘ and Θ = MbD
−1 we see that
T
bD = Σ−1
XX M CMb,
so that Σb b = ΣXX bD. This shows that the columns of b are solution of the eigenvalue
problem Σb u = λΣXX u. Moreover, from Θ T CΘ = IdRr , we get bT Σb b = D 2 . Since
bT Σb b = bT ΣXX bD, we get that b must be normalized to that bT ΣXX b = D.
This shows that the solution of the optimal scoring problem can be reformulated
uniquely in terms of b: if b1 , . . . , br are the r principal solutions of the eigenvalue
problem Σb u = λΣXX u, normalized so that u T ΣXX u = λ, a new input is classified
into the class g minimizing
r
X r
X
2
γj (µg ) /λ2j − 2 γj (x)γj (µg )/λj .
j=1 j=1
Remark 8.8 The following computation shows that optimal scoring is closely re-
lated to LDA. Recall the identity ΣXX = Σw + Σb . It implies that a solution of Σb u =
λΣXX u is also a solution of Σb u = λ̃Σw u with λ̃ = λ/(1 − λ). If u T ΣXX u = λ, then
λ̃
u T Σ w u = λ − u T Σ b u = λ − λ2 = ,
(1 + λ̃)2
r
X r
X r
X r
X
2
γj (µg ) /λ2j −2 γj (x)γj (µg )/λj = 2
γ̃j (µg ) / λ̃j − 2 γj (x)γj (µg )/(1 + λ̃j )
j=1 j=1 j=1 j=1
Remark 8.9 Optimal scoring can be modified by adding a penalty in the form
r
X
γ biT Ωbi = γtrace(bT Ωb) (8.21)
i=1
where Ω is a weight matrix. This only modifies the previous discussion by adding
γΩ/N to both ΣXX and Σw .
Let h : RX → H be the feature function and K the associated kernel, as usual. Opti-
mal scoring in feature space requires to minimize
N
X
|θyk − a0 − b(h(xk ))|2 + γkbk2H ,
k=1
8.3. OPTIMAL SCORING 199
It is once again clear (and the argument is left to the reader) that the problem
can be reduced to the finite dimensional space V = span(h(x1 ), . . . , h(xN )), and that
the optimal b1 , . . . , br must take the form
N
X
bj = αli h(xl ) .
l=1
Introduce the kernel matrix K = K(x1 , . . . , xN ) with kth column denoted K(k) . Let α
be the N by r matrix with entries αkj , k = 1, . . . , N , j = 1, . . . , r. Then b(h(xk )), which is
the vector with coordinates
N
X
hbj , h(xk )i = αli K(xk , xl ),
l=1
so that the problem is reduced to penalized optimal scoring, with xk replaced by K(k) ,
b replaced by α and the matrix Ω in (8.21) replaced by K. Introducing the matrix P =
IdRN − 11T /N and Kc = P K, the covariance matrix ΣXX becomes KTc Kc /N = KP K/N .
The class averages µg are equal to K1(g)/Ng while µ = K1/N , so that the matrix
M is equal to
1(g1 )T /Ng1 − 1T /N
..
K
.
1(gq )T /Ngq − 1T /N
200 CHAPTER 8. MODELS FOR LINEAR CLASSIFICATION
X Ng 1(g) 1 ! 1(g) 1 !T
Q = P CP = − −
N Ng N Ng N
g∈RY
1
KQKρ = (KP K + γK)ρ.
N
N
X
hbi , h(x)iH = αki K(x, xk )
k=1
and
N
1 X
a0 (i) = αki K(xl , xk ).
N
k,l=1
In this whole section, we restrict to two-class problems, and let RY = {−1, 1}. Given
a0 ∈ R and b , 0 ∈ Rd , the equation a0 + bT x = 0 defines a hyperplane in Rd . The
function f (x) = sign(a0 + xT b) defines a classifier that attributes a class ±1 to x ac-
cording to which side of the hyperplane it belongs (we ignore the ambiguity when
x is on the hyperplane). With this notation, a pair (x, y), where y is the true class, is
correctly classified if and only if y(a0 + xT b) > 0.
Figure 8.2: The green line is preferable to the purple one in order to separate the data.
This leads to the maximum margin separating hyperplane classifier, also called
linear SVM, introduced by Vapnik and Chervonenkis [198, 199].
If the data is not separable, there is no feasible point for this problem. To also
account for this situation (which is common), we can replace the constraint by a
penalty and minimize, with respect to a0 and b:
N
|b|2 X
+γ (1 − yk (a0 + xkT b))+
2
k=1
for some γ > 0. (Recall that x+
= max(x, 0).) This is equivalent to minimizing the
perceptron objective function, with δ = 1, and with an additional penalty term equal
to |b|2 /(2γ). This minimization problem is equivalent to a quadratic programming
problem obtained by introducing slack variables ξk , k = 1, . . . , N and minimizing
N
1 2 X
|b| + γ ξk ,
2
k=1
(i) First consider indices k such that (xk , yk ) is correctly classified beyond the margin,
i.e., yk (a0 + xkT b) > 1. The last KKT condition and the constraint ξk ≥ 0 require αk = 0,
and the third one then gives ξk = 0.
(ii) For samples that are misclassified or correctly classified below the margin 5 , i,e.,
yk (a0 + xkT b) < 1, the constraint yk (a0 + xkT b) + ξk ≥ 1 implies ξk > 0, so that αk = γ and
yk (a0 + xkT b) + ξk = 1.
5 Note that, even if the training data is linearly separable, there are generally samples that are on
the right side of the hyperplane, but at a distance to the hyperplane strictly lower that the “nominal
margin” C = 1/|b|. This is due to our relaxation of the original problem of finding a separating
hyperplane with maximal margin.
204 CHAPTER 8. MODELS FOR LINEAR CLASSIFICATION
(iii) If (xk , yk ) is correctly classified exactly at the margin, then ξk = 0 and there is
no constrain on αk beside belonging to [0, γ]. Training samples that lie exactly at the
margin are called support vectors.
If no support vector is found, then a0 is not uniquely determined, and can be any
value such that yk (a0 + bT xk ) ≥ 1 if αk = 0 and yk (a0 + bT xk ) ≤ 1 if αk = γ. This shows
that a0 can be any point in the interval [β0− , β0+ ] with
N
1 2 X
kbkH + γ ξk ,
2
k=1
Let V = span(h(x1 ), . . . , h(xN )). The usual projection argument implies that the
optimal b must belong to V and therefore take the form
N
X
b= uk h(xk ).
k=1
subject to
N
X
yk a0 + K(xk , xl )al + ξk ≥ 1
l=1
8.4. SEPARATING HYPERPLANES AND SVMS 205
N N
1X X
L= uk ul K(xk , xl ) + γ ξk
2
k,l=1 k=1
N
X N
X N
X
− η k ξk − αk yk a0 + K(xk , xl )ul + ξk − 1 .
k=1 k=1 l=1
1
L = u T Ku + ξ T (γ 1 − η − α) − a0 α T y − (α y)T Ku + α T 1
2
where y α is the vector with coordinates yk αk . The infimum of L is −∞ unless
γ 1 − η − α = 0 and α T y = 0. If these identities are true, then the optimal u is u = α y
and the minimum of L is
1
− (α y)T K(α y) + α T 1
2
1
(α y)T K(α y) − α T 1 = α T (K yy T )α − α T 1
2
subject to γ 1 − η − α = 0 and α T y = 0.
This is exactly the same problem as the one we obtained in the linear case, up
to the replacement of the Euclidean inner products xkT xl by the kernel evaluations
K(xk , xl ). Given the solution of the dual problem, the optimal b is
X N
X
b= uk h(xk ) = αk yk h(xk ).
k k=1
Similarly to the linear case, the coefficient a0 can be identified using a support
vector, or is otherwise not uniquely determined. More precisely, if one of the αk ’s is
206 CHAPTER 8. MODELS FOR LINEAR CLASSIFICATION
P
strictly between 0 and γ, then a0 is given by a0 = yk − l αl yl K(xk , xl ). Otherwise, a0
is any number between
X
−
a0 = max y − α y K(x , x ) : (y = 1 and α = 0) or (y = −1 and α = γ)
k l l k l k k k k
l
and
X
a+0 = min
y − α y K(x , x ) : (y = −1 and α = 0) or (y = 1 and α = γ) .
k l l k l k k k k
l
Chapter 9
Nearest-Neighbor Methods
9.1.1 Consistency
We let RX denote the input space, and RY = Rq be the output space. We assume that
a distance, denoted dist is defined on RX . This means that dist : RX × RX → [0, +∞]
(we allow for infinite values) is a symmetric function such that dist(x, x0 ) = 0 if and
only if x = x0 and, for all x, x0 , x00 ∈ RX
DT (x) = (dist(x, xk ), k = 1, . . . , N )
be the collection of all distances between x and the inputs in the training set. We
consider regression estimators taking the form
N
X
fˆ(x) = Wk (x)yk (9.1)
k=1
207
208 CHAPTER 9. NEAREST-NEIGHBOR METHODS
We will, more precisely, use the following construction PN [184]. Assume that ad
family of numbers w1 ≥ w2 ≥ · · · ≥ wN ≥ 0 is chosen, with j=1 wj = 1. Given x ∈ R
and k ∈ {1, . . . , N }, we let rk+ (x) denote the number of indexes k 0 such that dist(x, xk 0 ) ≤
dist(x, xk ) and rk− (x) the number of such indexes such that d(x, xk 0 ) < d(x, xk ). The
coefficients defining fˆ in (9.1) are then chosen as:
Prk+ (x)
k 0 =rk− (x)+1 wk
0
Wk (x) = . (9.2)
rk+ (x) − rk− (x)
To emphasize the role of (w1 , . . . , wN ) is this definition, we will denote the resulting
estimation as fˆw . If there is no tie in the sequence of distances between x and ele-
ments of the training set, then rk+ (x) = rk− (x) + 1 is the rank of xk when training data is
ordered according to their proximity to x, and Wk (x) = wrk+ (x) . In this case, defining
l1 , . . . , lN such that d(x, xl1 ) < · · · < d(x, xlN ), we have
N
X
fˆw (x) = wj ylj .
j=1
In the general case, the weights wj associated with tied observations are averaged.
Theorem 9.1 ([184]) Assume that E(Y 2 ) < ∞. Assume that, for each N , a sequence
(N ) (N ) (N )
w(N ) = w1 ≥ · · · ≥ wN ≥ 0 is chosen with N
P
j=1 wj = 1. Assume, in addition, that
(N )
(i) limN →∞ w1 = 0
P (N )
(ii) limN →∞ j≥αN wj → 0, for some α ∈ (0, 1).
9.1. NEAREST NEIGHBORS FOR REGRESSION 209
Then the corresponding classifier fˆw(N ) converges in the L2 norm to E(Y | X):
E |fˆw(N ) (X) − E(Y | X)|2 → 0.
For nearest-neighbor regression, (i) and (ii) mean that the number of nearest neighbors
pN must be chosen such that pN → ∞ and pN /N → 0.
N
X N
X
fˆw (X) − E(Y | X) = Wk (X)(f (Xk ) − f (X)) + Wk (X)(Yk − f (Xk )) (9.3)
k=1 k=1
It therefore suffices to study the limit of E( k Wk (X)(f (Xk ) − f (X))2 . Fix > 0. By
P
assumption, there exists M, a > 0 such that |f (x)| ≤ M for all x and |x − y| ≤ a ⇒
|f (x) − f (y)|2 ≤ . Then
X X
E Wk (X)(f (Xk ) − f (X))2 =E Wk (X)(f (Xk ) − f (X))2 1|Xk −X|≤a
k
k
X
+ E Wk (X)(f (Xk ) − f (X))2 1|Xk −X|>a
k
X
≤ 2 + 4M 2 E Wk (X)1|Xk −X|>a .
k
Since can be made arbitrarily small, we need to show that, for any positive a, the
second term in the upper-bound tends to 0 when N → ∞. We will use the follow-
ing fact, which requires some minor measure theory argument to prove rigorously.
Define
S = {x : ∀δ > 0, P(|X − x| < δ) > 0} .
210 CHAPTER 9. NEAREST-NEIGHBOR METHODS
This set is called the support of X. Then, one can show that P(X ∈ S) = 1. This
means that, if X̃ is independent from X with the same distribution, then, for any
δ > 0, P(|X − X̃| < δ|X) > 0 with probability one. 1
Let Na (x) = | {k : |Xk − x| ≤ a} |. We have, for all x ∈ S and a > 0, and using the law
of large numbers,
N
Na (x) 1 X
= 1|Xk −x|≤a → P (|X − x| ≤ a) > 0.
N N
k=1
and both terms in the upper bound converge to 0. This shows that the first sum in
(9.3) tends to 0.
The cross products in the last term vanish because E(Zk | Xk ) = 0 and the samples
are independent. So it only remains to consider
N
X
E Wk (X)2 E(Zk2 | Xk )
k=1
Recall that the weights Wk are functions of X and of the whole training set,
and we will need to make this dependency explicit and write Wi (X, TX ) where TX =
(X1 , . . . , XN ). Similarly, the ranks in (9.2) will be written rj+ (X, TX ) and rj− (X, TX ).
Because X, X1 , . . . , XN are i.i.d., we can switch the role of X and Xk in the kth term
of the sum, yielding
N N
X X (k)
E Wk (X, TX )h(Xk ) = E Wk (Xk , TX ) h(X)
k=1 i=1
(k) PN (k)
with TX = (X1 , . . . , Xk−1 , X, Xk+1 , . . . , XN ). We now show that k=1 Wk (Xk , TX ) is bounded
independently of X, X1 , . . . , XN .
Fixing δ, let Cd (δ) be the minimal number of such cones needed to cover Rd .
Choosing such a covering Γ (u1 , δ), . . . , Γ (uM , δ) where M = Cd (δ), we define the fol-
lowing subsets of {1, . . . , M}:
I0 = {k : Xk = X}
n o
Iq = k < I0 : Xk − X ∈ Γ (uq , δ) , q = 1, . . . , M
212 CHAPTER 9. NEAREST-NEIGHBOR METHODS
(k) (k)
If k ∈ I0 , then rk− (Xk , TX ) = 0 and rk+ (Xk , TX ) = c with c = |I0 |. This implies that, for
(k)
k ∈ I0 , we have Wk (Xk , TX ) = cj=1 wj /c and
P
X c
(k)
X
Wk (Xk , TX ) = wj .
k∈I0 j=1
c+j
(ij ) 1 X
Wij (Xij , TX ) ≤ wj 0
c+1 0
j =j
and
c+j
N X c N
X (k) 1 X 1
X
0
X
Wk (Xk , TX ) ≤ wj 0 = j wj 0 + (c + 1) wj 0 .
c+1 c+1 0
0
0
k∈Iq j=1 j =j j =1 j =c+1
This yields
N
X c 1 X c N
(k)
X X
Wk (Xk , TX ) ≤ wj + Cd (δ) j 0 wj 0 + wj 0 ≤ Cd (δ) + 1.
c + 1 0 0
k=1 j=1 j =1 j =c+1
We therefore have
N
X
E Wk (X)2 E(Zk2 | Xk ) ≤ w1 (Cd (δ) + 1)E(h(X)) → 0,
k=1
Theorem 9.1 is proved in Stone [184] with weaker hypotheses allowing for more
flexibility in the computation of distances, in which, for example, differences X − Xi
can be normalized by dividing them by a factor σi that may depend on the training
set. These relaxed assumptions slightly complicate the proof, and we refer the reader
to Stone [184] for a complete exposition.
9.1. NEAREST NEIGHBORS FOR REGRESSION 213
9.1.2 Optimality
The NN method can be shown to be optimal over some classes of functions. Opti-
mality is in the min-max sense, and works as follows. We assume that the regression
function f (x) = E(Y | X = x) belongs to some set F of real-valued functions on Rd .
Most of the time, the estimation methods must be adapted to a given choice of F ,
and various choices have arisen in the literature: classes of functions with r bounded
derivatives, Sobolev or related spaces, functions whose Fourier transforms has given
properties, etc.
Since fˆN is computed from a random sample, this error is a random variable. One
can study, when bN → 0, the probability
Pf kfˆN − f k22 ≥ cbN
for some constant c and, for example, for the model: Y = f (X) + noise. Here, the
notation Pf refers to the model assumption indicating the unobserved function f .
This quantity now only depends on the estimation algorithm. One defines the no-
tion of “lower convergence rate” as a sequence bN such that, for any choice of the
estimation algorithm, MN (c) can be found arbitrarily close to 1 (i.e., kfˆN − f k22 ≥ cbN
with arbitrarily high probability for all f ∈ F ), for arbitrarily large N (and for some
choice of c). The mathematical statement is
So, if bN is a lower convergence rate, then, for every estimator, there exists a constant
c such that the accuracy cbN cannot be achieved.
On the other hand, one says that bN is an achievable rate of convergence if there
exists an estimator such that, for some c0 ,
This says that for large N , and for some c0 , the accuracy is higher than c0 bN for the
given estimator. Notice the difference: a lower rate holds for all estimators, and an
achievable rate for at least one estimator.
The final definition of a min-max optimal rate is that it is both a lower rate and
an achievable rate (obviously for different constants c and c0 ). And an estimator is
optimal in the min-max sense if it achieves an optimal rate.
One can show that the p-NN estimator is optimal (under some assumptions on
the ratio pN /N ) when F is the class of Lipschitz functions on Rd , i.e., the class of
functions such that there exists a constant K with
for all x, y ∈ Rd . In this case, the optimal rate is bN = N −1/(2+d) (notice again the
“curse of dimensionality”: to achieve a given accuracy in the worst case, the number
of data points must grow exponentially with the dimension).
If the function class consists of smoother functions (for example, several deriva-
tives), the p-NN method is not optimal. This is because the local averaging method
is too crude when one knows already that the function is smooth. But it can be
modified (for example by fitting, using least squares, a polynomial of some degree
instead of computing an average) in order to obtain an optimal rate.
Theorem 9.1 may be applied, for y ∈ RY , to the function fy (x) = π(y | x) = E(1Y =y |
X = x), which allows one to interpret the estimator π b(y | x) as a nearest-neighbor
predictor of the random variable 1Y =y as a function of X. We therefore obtain the
consistency of the estimated posteriors when N → ∞ under the same assumption as
those of theorem 9.1. This implies that, for large N , the classification will be close to
Bayes’s rule.
9.2. P -NN CLASSIFICATION 215
An asymptotic comparison with Bayes’s rule can already be made with p = 1. Let
ŷN (x) be the 1-NN estimator of Y given x and a training set of size N , and let ŷ(x) be
the Bayes estimator. We can compute the Bayes error by
P(ŷ(X) , Y ) = 1 − P(ŷ(X) = Y )
= 1 − E(P(ŷ(X) = Y |X))
= 1 − E(max π(y|X))
y∈RY
Let us make the assumption that nearest neighbors are not tied (with probability
one). Let k ∗ (x, T ) denote the index of the nearest neighbor to x in the training set T .
We have
Now, assume the continuity of x 7→ π(g | x) (although the result can be proved
without this simplifying assumption). We know that Xk ∗ (X,T ) → X when N → ∞ (see
the proof of theorem 9.1), which implies that π(g | Xk ∗ (X,T ) ) → π(x | X) and at the
limit X
P(ŷN (X) = Y | X) → π(g | X)2 .
g∈RY
216 CHAPTER 9. NEAREST-NEIGHBOR METHODS
This implies that the asymptotic 1-NN misclassification error is always smaller
than 2 times the Bayes error, that is
X
2
1−E π(g | X) ≤ 2(1 − E(max π(g | X)))
g
g∈RY
Indeed, the left-hand term is smaller than 1 − E(maxg π(g|x)2 ) and the result comes
from the fact that for any t ∈ R. 1 − t 2 ≤ 2 − 2t.
Remark 9.2 Nearest neighbor methods may require large computation time, since,
for a given x, the number of comparisons which are needed is the size of the training
set. However, efficient (tree-based) search algorithms can be used in many cases to
reduce it to a logarithm in the size of the database, which is acceptable. A reduction
of the size of the training set by clustering also is a possibility for improving the
efficiency.
This replaces the standard Euclidean norm (the method can be made more robust
by adding IdRd to Σ∗b .)
9.3. DESIGNING THE DISTANCE 217
For each input x ∈ Rd , assume that one can make small transformations without
changing the class of x. We model these transformations as parametrized functions
x 7→ xθ = ϕ(x, θ) ∈ Rd , such that ϕ(x, 0) = x and ϕ is smooth in θ, which is a q-
dimensional parameter. The assumption is that ϕ(x, θ) and x should be from the
same class, at least for small θ. This will be used to improve on the Euclidean dis-
tance on Rd .
Take x, x0 ∈ Rd . Ideally, one would like to use the distance D(x, x0 ) = infθ,θ0 dist(xθ , xθ 0 )
where θ and θ 0 are restricted to a small neighborhood of 0. A more tractable expres-
sion can be based on first-order approximations
q
X
xθ ' x + ∇θ ϕ(x, 0)u = x + ui ∂θi ϕ(x, 0)
i=1
q
X
and xθ0 0 0 0
' x + ∇θ ϕ(x , 0)u = x + 0
ui0 ∂θi ϕ(x0 , 0)
i=1
The computation now is a simple least-squares problem, for which the solution is
given by the system
! ! !
∇θ ϕ(x, 0)T ∇θ ϕ(x, 0) −∇θ ϕ(x, 0)T ∇θ ϕ(x0 , 0) u ∇θ ϕ(x, 0)T (x0 − x)
= .
−∇θ ϕ(x0 , 0)T ∇θ ϕ(x, 0) ∇θ ϕ(x0 , 0)T ∇θ ϕ(x0 , 0) v ∇θ ϕ(x0 , 0)T (x − x0 )
A slight modification, to ensure that the norms of u and u 0 are not too large, is to
add a penalty λ(|u|2 + |u 0 |2 ), which results in adding λIdRq to the diagonal blocs of
the above matrix.
218 CHAPTER 9. NEAREST-NEIGHBOR METHODS
Chapter 10
Define a binary node to be a structure ν that contains the following information (note
that the definition is recursive):
A binary prediction tree T is a finite set of nodes, with the following properties:
219
220 CHAPTER 10. TREE-BASED ALGORITHMS
• Feature selection: Given the feature set Γ and a training set T , return an optimized
binary feature γ
bT ,Γ ∈ Γ .
• Predictor optimization: Given the predictor set F and a training set T , return an
optimized predictor fˆT ,F ∈ F .
Given a training set T0 , the algorithm builds a binary tree T using a recursive
construction. Each node ν ∈ T will be associated to a subset of T0 , denoted Tν . We
define below a recursive operation, denoted Node(T , j) that adds a node ν to a tree
T given a subset T of T0 and a label j. Starting with T = ∅, calling Node(T0 , 0) will
then create the desired tree.
where
Tl = {(x, y) ∈ T : γν (x) = 0}, Tr = {(x, y) ∈ T : γν (x) = 1}
Remark 10.1 Note that, even though the learning algorithm for prediction trees can
be very conveniently described in recursive form as above, efficient computer im-
plementations should avoid recursive calls, which may be inefficient and memory
demanding. Moreover, for large trees, it is likely that recursive implementations
will reach the maximal number of recursive calls imposed by compilers.
10.1. RECURSIVE PARTITIONING 221
Once the tree is built, the predictor x 7→ fˆT (x) is recursively defined as follows.
The function σ , which decides whether a node is terminal or not is generally defined
based on very simple rules. Typically, σ (T ) = 1 when one the following conditions is
satisfied:
When one reaches a terminal node ν (so that σ (Tν ) = 1), a predictor fν must be
determined. This function can be optimized within any set F of predictors, using
any learning algorithm, but in practice, one usually makes this fairly simple and
defines F to be the family of constant functions taking values in RY . The function
fˆT ,F is then defined as:
The space Γ of possible binary features must be specified in order to partition non-
terminal nodes. A standard choice, used in the CART model [42] with RX = Rd ,
is n o
Γ = γ(x) = 1[x(i) ≥θ] , i = 1, . . . , d, θ ∈ R (10.1)
where x(i) is the ith coordinate of x. This corresponds to splitting the space using a
hyperplane parallel to one of the coordinate axes.
222 CHAPTER 10. TREE-BASED ALGORITHMS
Example 10.2 (Regression) Consider the regression case, taking squared differences
as risk and letting F contain only constant functions. Then
X
ET (γ) = min (y − m0 )2 1γ(x)=0 + (y − m1 )2 1γ(x)=1 .
m0 ,m1
(x,y)∈T
Obviously, the optimal m0 and m1 are the averages of the output values, y, in each
of the subdomains defined by γ. For CART (see (10.1)), this cost must be minimized
over all choices (i, θ) with i = 1, . . . , d and θ ∈ R where γi,θ (x) = 1 if x(i) > θ and 0
otherwise.
Example 10.3 (Classification.) For classification, one can apply the same method,
with the 0/1 loss, letting
X
ET (γ) = min 1y,g0 1γ(x)=0 + 1y,g1 1γ(x)=1 .
g0 ,g1
(x,y)∈T
Example 10.4 (Entropy selection for classification) For classification trees, other
splitting criteria may be used based on the empirical probability pT on the set T ,
defined as
1
pT (A) = |{k : (xk , yk ) ∈ A}|
N
for A ⊂ RX × RY . The previous criterion, ET (γ), is proportional to
Many such measures exists, and many of them are defined as various forms of
entropy designed in information theory. The most celebrated is Shannon’s entropy
[177], defined by X
H(p) = − p(g) log p(g) .
g∈RY
1 P q
• The Tsallis entropy: H(p) = 1−q g∈RY (p(g) −1), for q , 1. (Tsallis entropy for q = 2
is sometimes called the Gini impurity index.)
1
log g∈RY p(g)q , for q ≥ 0, q , 1.
P
• The Renyi entropy: H(p) = 1−q
10.1.7 Pruning
Growing a decision tree to its maximal depth (given the amount of available data)
generally leads to predictors that overfit the data. The training algorithm is usually
followed by a pruning step that removes some some nodes based on a complexity
penalty.
Letting τ(T) denote the set of terminal nodes in the tree T and fˆT the associated
predictor, pruning is represented as an optimization problem, where one minimizes,
given the training set T ,
where R̂T is as usual the in-sample error measured on the training set T .
To prune a tree, one selects one or more internal nodes and remove all their
descendants (so that these nodes become terminal). Associate to each node ν in T its
local in-sample error ETν equal to the error made by the optimal classifier estimated
from the training data associated with ν. Then,
X |T |
ν
Uλ (T, T ) = E + λ|τ(T)|
|T | Tν
ν∈τ(T)
|Tν |
Uλ (T, T ) = U0 (T(ν) , T ) − (E − U0 (Tν , Tν )) + λ(|τ(Tν )| − 1).
|T | Tν
|Tν |
ψν = (E − U0 (Tν )) − λ(|τ(Tν )| − 1).
|T | Tν
10.2.1 Bagging
A random forest [7, 41] is a special case of composite predictors (we will see other
examples later in this chapter when describing boosting methods) that train mul-
tiple individual predictors under various conditions and combine them, through
averaging, or majority voting. With random forests, one generates individual trees
by randomizing the parameters of the learning process. One way to achieve this is
to randomly sample from the training set before running the training algorithm.
Letting as before T0 = (x1 , y1 , . . . , xN , yN ) denote the original set, with size N , one
can create “new” training data by sampling with replacement from T0 . More pre-
cisely, consider the family of independent random variables ξ = (ξ 1 , . . . , ξ N ), with
10.2. RANDOM FORESTS 225
each ξ j following a uniform distribution over {1, . . . , N }. One can then form the ran-
dom training set
T0 (ξ) = (xξ 1 , yξ 1 , . . . , xξ N , yξ N ).
Running the training algorithm using T0 (ξ) then provides a random tree, denoted
T(ξ). Now, by sampling K realizations of ξ, say ξ (1) , . . . , ξ (K) , one obtains a collection
of K random trees (a random forest) T∗ = (T1 , . . . , TK ), with Tj = T(ξ (j) ) that can
be combined to provide a final predictor. The simplest way to combine them is to
average the predictors returned by each tree (assuming, for classification, that this
predictor is a probability distribution on classes), so that
K
1X
fT∗ (x) = fTj (x). (10.2)
K
j=1
For classification, one can alternatively let each individual tree “vote” for their most
likely class.
With decision trees one can in addition randomize the binary features use to
split nodes, as described next. While bagging may provide some enhancement to
predictors, feature randomization for decision trees often significantly improves the
performance, and is the typical randomization method used for random forests.
When one decides to split a node during the construction of a prediction tree, one
can optimize the binary feature γ over a random subset of Γ rather than exploring
the whole set. For CART, for example, one can select a small number of dimensions
i1 , . . . , iq ∈ {1, . . . , d} with q d, and optimize γ by thresholding one of the coordi-
nates x(ij ) for j ∈ {1, . . . , q}. This results in a randomized version of the node insertion
function.
l(ν) = Node(Tl , 2j + 1)
r(ν) = Node(Tr , 2j + 2)
where
Tl = {(x, y) ∈ T : γν (x) = 0}
Tr = {(x, y) ∈ T : γν (x) = 1}
Now, each time the function RNode(T0 , 0) is run, it returns a different, random,
tree. If it is called K times, this results in a random forest T∗ = (T1 , . . . TK ), with a
predictor FT∗ given by (10.2). Note that trees in random forests are generally not
pruned, since this operation has been observed to bring no improvement in the con-
text of randomized tress.
Top-Scoring Pair (TSP) classifiers were introduced in Geman et al. [78] and can be
seen as forests formed with depth-one classification trees in which splitting rules are
based on the comparison of pairs of variables. More precisely, define
A decision tree based on these rules only relies on the order between the features,
and is therefore well adapted to situations in which the observations are subject to
increasing transformations, i.e., when the observed variable X is such that X (j) =
ϕ(Z (j) ), where ϕ : R → R is random and increasing and Z is a latent (unobserved)
variable. Obviously, in such a case, order-based splitting rules do not depend on ϕ.
Such an assumption is relevant, for example, when experimental conditions (such
as temperature) may affect the actual data collection, without changing their order,
which is the case when measuring high-throughput biological data, such as microar-
rays, for which the approach was introduced.
Assuming two classes, a depth-one tree in this context is simply the classifier
fij = γij . Given a training set, the associated empirical error is
N N
1X 1X
Eij = 1γij (xk ),yk = |yk − γij (xk )|
N N
k=1 k=1
10.4. ADABOOST 227
and the balanced error (better adapted to situations in which one class is observed
more often than the other) is
N
X
Eijb = wk |yk − γij (xk )|
k=1
P = argmin Eijb
ij
of global minimizers of the empirical error (which may just be a singleton) and pre-
dicts the class based on a majority vote among the family of predictors (fij , (i, j) ∈ P ).
Equivalently, selected variables maximize the score ∆ij = 1 − Eijb , leading to the
method’s name.
Such classifiers, which are remarkably simple, have been found to be competitive
among a wide range of “advanced” classification algorithms for large-dimensional
problems in computational biology. The method has been refined in Tan et al. [191],
leading to the k-TSP classifier, which addresses the following remarks. First, when
j, j 0 are highly correlated, and (i, j) is a high-scoring pair, then (i, j 0 ) is likely to be
one too, and their associated decision rules will be redundant. Such cases should
preferably be pruned from the classification rules, especially if one wants to select
a small number of pairs. Second, among pairs of features that switch with the same
probability, it is natural to prefer those for which the magnitude of the switch is
largest, e.g., when the pair of variables switches from a regime in which one of them
is very low and the other very high to the opposite. In Tan et al. [191], a rank-based
tie-breaker is introduced, defined as
N
X
ρij = wk (Rk (i) − Rk (j))(2yk − 1),
k=1
10.4 Adaboost
We first consider binary classification problems, with RY = {−1, 1}. We want to de-
sign a function x 7→ F(x) ∈ {−1, 1} on the basis of a training set T = (x1 , y1 , . . . , xN , yN ).
With the 0-1 loss, minimizing the empirical error is equivalent to maximizing
N
1X
ET (F) = yk F(xk ).
N
k=1
We assume that each base classifier, fj , takes values in [−1, 1] (the interval).
• Weighted LDA: one can use LDA as described in section 8.2 with
X 1 X X
cg = pW (k), µg = pW (k)xk , µ= cg µg
cg
k:yk =g k:yk =g g∈RY
10.4. ADABOOST 229
Boosting algorithms keep track of a family of weights and modify it after the jth
classifier fj is computed, increasing the importance of misclassified examples, before
computing the next classifier. The following algorithm, called Adaboost [173, 73],
describes one such approach.
+
If fj is binary, i.e., fj (x) ∈ {−1, 1}, then |yk − fj (xk )| = 21yk ,fj (xk ) , so that SW /2 is the
−
weighted number of correct classifications and SW /2 is the weighted number of in-
correct ones.
For αj to be positive, the jth classifier must do better than pure chance on the
weighted training set. If not, taking αj ≤ 0 reflects the fact that, in that case, −fj has
better performance on training data.
Algorithms that do slightly better than chance with high probability are called
“weak learners” [173]. The following proposition [73] shows that, if the base clas-
sifiers reliably perform strictly better than chance (by a fixed, but not necessarily
large, margin), then the boosting algorithm can make the training-set error arbitrar-
ily close to 0.
Proposition 10.5 Let ET be the training set error of the estimator F returned by Algo-
rithm 10.4, i.e.,
N
1X
ET = 1yk ,F(xk ) .
N
k=1
Then
M
ρ 1−ρ
Y
ET ≤ j (1 − j )1−ρ + j (1 − j )ρ
j=1
where
−
SW (j)
j = + − .
SW (j) + SW (j)
Proof We note that example k is misclassified by the final classifier if and only if
M
X
αj yk fj (xk ) ≤ 0
j=1
or
M
Y
e−αj yk fj (xk )/2 ≥ 1
j=1
10.4. ADABOOST 231
Noting that |yk − fj (xk )| = 1 − yk fj (xk ), we see that example k is misclassified when
M
Y M
Y
αj |yk −fj (xk )|/2
e ≥ eαj /2 .
j=1 j=1
N
1X
ET = 1yk ,F(xk )
N
k=1
N
1 X
= 1QM αj |yk −fj (xk )|/2 QM αj /2
N j=1 e ≥ j=1 e
k=1
N YM M
1 X
αj |yk −fj (xk )|/2
Y
≤ e e−αj /2 .
N
k=1 j=1 j=1
Let, for q ≤ M,
N q
1 X Y αj |yk −fj (xk )|/2
Uq = e .
N
k=1 j=1
Since
q−1
1 Y αj |yk −fj (xk )|/2
wk (q) = e ,
N
j=1
PN + −
we also have Uq = k=1 wk (q + 1) = (SW (q + 1) + SW (q + 1))/2.
eαt ≤ 1 − (1 − eα )t,
1 This inequality is clear for α = 0. Assuming α , 0, the difference between the upper and lower
bound is
The function q is concave (its second derivative is −α 2 eαt ) with q(0) = q(1) = 0 and therefore non-
negative over [0, 1].
232 CHAPTER 10. TREE-BASED ALGORITHMS
N q−1
1 X Y αj |yk −fj (xk )|/2
Uq ≤ e (1 − (1 − eαq )|yk − fq (xk )|/2)
N
k=1 j=1
N
X
= wk (q)(1 − (1 − eαq )|yk − fq (xk )|/2)
k=1
XN N
X
αq
= wk (q) − (1 − e ) wk (q)|yk − fq (xk )|/2
k=1 k=1
αq
= Uq−1 (1 − (1 − e )q )
and
M
Y
ET ≤ 1 − (1 − e )j e−αj /2 .
αj
j=1
−ρ
It now suffices to replace eαj by (1 − j )ρ j and note that
−ρ ρ/2 ρ 1−ρ
1 − (1 − (1 − j )ρ j )j (1 − j )−ρ/2 j = j (1 − j )1−ρ + j (1 − j )ρ
with equality if and only if = 1/2, so that each term in the upper-bound reduces
the error unless the corresponding base classifier does not perform better than pure
chance. The parameter ρ determines the level at which one increases the importance
+ −
of misclassified examples for the next step. Let S̃W (j) and S̃W (j) denote the expres-
sions in (10.3a) and (10.3b) with wk (j) replaced by wk (j + 1). Then, in the case when
the base classifiers are binary, ensuring that |yk − fj (xk )|/2 = 1yk ,fj (xk ) , one can easily
+ − + −
check that S̃W (j)/ S̃W (j) = (SW (j)/SW (j))1−ρ . So, the ratio is (of course) unchanged if
ρ = 0, and pushed to a pure chance level if ρ = 1. We provide below an interpretation
of boosting as a greedy optimization procedure that will lead to the value ρ = 1/2.
10.4. ADABOOST 233
We here restrict to the case of binary base classifiers and denote their linear combi-
nation by
M
X
h(x) = αj fj (x).
j=1
Whether an observation x is correctly classified in the true class y is associated to
the sign of the product yh(x), but the value of this product also has an important
interpretation, since, when it is positive, it can be thought of as a margin with which
x is correctly classified.
Assume that the function F is evaluated, not only on the basis of its classification
error, but also based on this margin, using a loss function of the kind
N
X
Ψ (h) = ψ(yk h(xk )) (10.4)
k=1
where ψ is decreasing. The boosting algorithm can then be interpreted as an classi-
fier which incrementally improves this objective function.
The next combination h(j+1) is equal to h(j) + αj+1 fj+1 , and we now consider the prob-
lem of minimizing, with respect to fj+1 and αj+1 , the function Ψ (h(j+1) ), without
modifying the previous classifiers (i.e., performing a greedy optimization). So, we
want to minimize, with respect to the base classifier f˜ and to α ≥ 0, the function
N
X
U (α, f˜) = ψ yk h(j) (xk ) + αyk f˜(xk )
k=1
This shows that α and f˜ have inter-dependent optimality conditions. For a given
α, the best classifier f˜ must minimize a weighted empirical error with non-negative
weights (since ψ is decreasing)
Given f˜, α must minimize the expression in (10.5). One can use an alternative min-
imization procedure to optimize both f˜ (as a weighted basic classifier) and α. How-
ever, for the special choice ψ(t) = e−t , this optimization turns out to only require one
step.
(j) −y h(j) (x )
with wk (j +1) = eα k k and α (j) = α1 +· · ·+αj . This shows that f˜ should minimize
N
X
wk (j + 1)1yk ,f˜(xk ) .
k=1
We note that
wk (j + 1) = wk (j)eαj (1−yk fj (xk )) = wk (j)eαj |yk −fk (xk )| ,
which is identical to the weight updates in algorithm Algorithm 10.4 (this is the
reason why the term α (j) was introduced in the computation). The new value of α
must minimize (using the notation of Algorithm 10.4)
+
e−α SW (j) + eα SW
−
(j),
1 + −
which yields α = 2 log SW (j)/SW (j). This is the value αj+1 in Algorithm 10.4 with
ρ = 1/2.
10.5.1 Notation
The boosting idea, and in particular its interpretation as a greedy gradient proce-
dure, can be extended to non-linear regression problems [75]. Let us denote by F0
10.5. GRADIENT BOOSTING AND REGRESSION 235
In the case, which is frequent in regression, when r(y, y 0 ) only depends on y − y 0 , the
problem is equivalent to minimizing
N
X
U (f ) = r(yk − F (j) (xk ), f (xk )),
k=1
i.e., to let fj+1 be the optimal predictor (in F0 and for the loss r) of the residuals
(j)
yk = yk − F (j) (xk ). In this case, this provides a conceptually very simple algorithm.
(j−1) (j−1)
(1) Find the optimal predictor fj ∈ F0 for the training set (x1 , y1 , . . . , xN , yN ).
(j) (j−1)
(2) Let yk = yk − fj (xk )
• Return F = M
P
k=1 fj .
Remark 10.6 Obviously, the class F0 should not be a linear class for the boosting
algorithm to have any effect. Indeed, if f , f 0 ∈ F0 implies f +f 0 ∈ F0 , no improvement
could be made to the predictor after the first step.
where C is a finite partition of Rd . Each set in the partition is specified by the value
taken by a finite number of binary features (denoted by γ in our discussion of pre-
diction trees) and the maximal number of such features is the depth of the tree. We
assume that the set Γ of binary features is shared by all regression trees in F0 , and
that the depth of these trees is bounded by a fixed constant. These restrictions pre-
vent F0 from forming a linear class.2 Note that the maximal depth of tree learnable
from a finite training set is always bounded, since such trees cannot have more nodes
than the size of the training set (but one may want to restrict the maximal depth of
base predictors to be way less than N ).
We now consider situations in which the loss function is not necessarily a function
of the difference between true and predicted output. We are still interested in the
problem of minimizing U (f ), but we now approximate this problem using the first-
order expansion
N
X N
X
(j)
U (f ) = r(yk , F (xk )) + ∂2 r(yk , F (j) (xk ))T f (xk ) + o(f ),
k=1 k=1
where ∂2 r denotes the derivative of r with respect to its second variable. This sug-
gests (similarly to gradient descent) to choose f such that f (xk ) = −α∂2 r(yk , F (j) (xk ))
2 If
f and g are representable as trees, f + g can be represented as a tree whose depth is the sum as
those of the original trees, simply by inserting copies of g below each leaf of f .
10.5. GRADIENT BOOSTING AND REGRESSION 237
for some α > 0 and all k = 1, . . . , N . However, such an f may not exist in the class F0 ,
and the next best choice is to pick f = α f˜ with f˜ minimizing
N
X
|f˜(xk ) + ∂2 r(yk , F (j) (xk ))|2
k=1
over all f˜ ∈ F0 .
(2) Let fj = αj f˜j where αj minimizes
N
X
r(yk , F (j−1) (xk ) + α f˜j (xk )).
k=1
Remark 10.7 Importantly, the fact that F0 is stable by scalar multiplication implies
that the function f˜j satisfies
N
X
f˜(xk )T ∂2 r(yk , F (j−1) (xk )) ≤ 0,
k=1
that is, excepted in the unlikely case in which the above sum is zero, it is a direction
of descent for the function U (because one could otherwise replace f˜j by −f˜j and
improve the approximation of the gradient).
238 CHAPTER 10. TREE-BASED ALGORITHMS
Before applying the previous algorithm, one must address the issue that prob-
ability distributions do not form a vector space, and cannot be added to form new
probability distributions. In Friedman [75], Hastie et al. [87], it is suggested to use
the representation, which can be associated with any function F : (g, x) 7→ F(g|x) ∈ R,
eF(g|x)
pF (g|x) = P F(h|x)
.
h∈RY e
for all x ∈ Rd . The space formed by such functions F is now linear, and we can
consider the empirical risk
N X
X N X
X N
X X
R̂(F) = − µk (g) log pF (g|xk ) = − µk (g)F(g|xk ) + log eF(g|xk ) .
k=1 g∈RY k=1 g∈RY k=1 g∈RY
10.5. GRADIENT BOOSTING AND REGRESSION 239
One can evaluate the derivative of this risk with respect to a change on F(g|xk ),
and a short computation gives
N
∂R X
=− (µk (g) − pF (g|xk )).
∂F(g|xk )
k=1
Now assume that a basic space F0 of functions f : (g, x) 7→ f (g|x) is chosen, such
that all function in F0 satisfy X
f (g|x) = 0
g∈RY
for all x ∈ Rd . The gradient boosting algorithm then requires to minimize (in Step
(1)):
XN X
(µk (g) − pF (j−1) (g|xk ) − f˜(g|xk ))2
k=1 g∈RY
with respect to all functions f˜ ∈ F0 . Given the optimal f˜j , the next step requires to
minimize, with respect to α ∈ R:
N X N
˜
X X X (j−1)
−α µk (g)f˜j (g|xk ) + log eF (g|xk )+α fj (g|xk ) .
k=1 g∈RY k=1 g∈RY
This is a scalar convex problem that can be solved, e.g., using gradient descent.
We now specialize to the situation in which the set F0 contains regression trees. In
this situation, the general algorithm can be improved by taking advantage of the fact
that the predictors returned by such trees are piecewise constant functions, where
the regions of constancy are associated with partitions C of Rd defined by the leaves
of the trees. In particular, f˜j (x) in Step (1) takes the form
J
X
f˜j (g|x) = f˜j,A (g)1x∈A .
A∈C
but not much additional complexity is introduced by freely optimizing the values of
fj on A, that is, by looking at f in the form
X
fj,A (g)1x∈A
A∈C
where the values fj,A (g) optimize the empirical risk. This risk becomes
XN X X N X
X X (j−1)
− µk (g)fj,A (g)1xk ∈A + log eF (g|xk )+fj,A (g) 1xk ∈A .
k=1 A∈C g∈RY k=1 A∈C g∈RY
This is still a convex program, which has to be run at every leaf of the optimized
tree. If computing time is limited (or for large-scale problems), the determination of
fj,A (g) may be restricted to one step of gradient descent starting at fj,A = 0. A simple
computation indeed shows that the first derivative of the function above with respect
to fj,A (g) is X
aA (g) = − (µk (g) − pF (g|xk )).
k:xk ∈A
The derivative of this expression with respect to fj,A (g) (for the same g) is
X
bA (g) = pF (g|xk )(1 − pF (g|xk )).
k:xk ∈A
aA (g) − λ
fj,A (g) = −
bA (g)
10.5. GRADIENT BOOSTING AND REGRESSION 241
with P
g∈RY aA (g)/bA (g)
λ= P .
g∈RY 1/bA (g)
A small value can be added to bA to avoid divisions by zero. We refer the reader
to Friedman et al. [74], Friedman [75], Hastie et al. [87] for several variations on this
basic idea. Note that an approximate but highly efficient implementation of boosted
trees, called XGBoost, has been developed in Chen and Guestrin [52].
242 CHAPTER 10. TREE-BASED ALGORITHMS
Chapter 11
We now discuss a class of methods in which the predictor f is built using iterated
compositions, with a main application to neural nets. We will structure these mod-
els using directed acyclic graphs (DAG). These graphs are composed with a set of
vertexes (or nodes) V = {0, . . . , m + 1} and a collection L of directed edges i → j be-
tween some vertexes. If an edge exists between i and j, one says that i is a parent of
j and j a child of i and we will use the notation pa(i) (resp. ch(i)) to denote the set
of parents (resp. children) of i. The graphs we consider must satisfy the following
conditions:
To each node i in the graph, one associates a dimension di and a variable zi ∈ Rdi .
The root node variable, z0 = x, is the input and zm+1 is the output.
N One also associates
dj
to each node i , 0 a function ψi defined on the product space j∈pa(i)
R and taking
values in Rdi . The input-output relation is then defined by the family of equations:
zi = ψi (zpa(i) )
where zpa(i) = (zj , j ∈ pa(i)). Since there is only one root and one terminal node, these
iterations implement a relationship y = zm+1 = f (x), with z0 = x. We will refer to the
z1 , . . . , zm as the latent variables of the network.
243
244 CHAPTER 11. NEURAL NETS
We let W denote the vector containing all parameters w1 , . . . , wm+1 , which therefore
has dimension s = s1 + · · · + sm+1 . The network function f is then parametrized by W
and we will write y = f (x; W ).
11.2.1 Transitions
where is a (
P di
P b d i × j∈pa(i) dj ) matrix and β0 ∈ R (so that w = (b, β0 ) is si = di (1 +
j∈pa(i) dj )-dimensional); ρ is defined on and takes values in R, and we make the
abuse of notation, for any d and u ∈ Rd
The most popular choice for ρ is the positive part, or ReLU (for rectified linear
unit), given by ρ(t) = max(t, 0). Other common choices are ρ(t) = 1/(1 + e−t ) (sigmoid
function), or ρ(t) = tanh(z).
Residual neural networks (or ResNets [89]) are discussed in section 11.6. They
iterate transitions between inputs and outputs of same dimension, taking
11.2.2 Output
The last node of the graph provides the prediction, y. Its expression depends on the
type of predictor that is learned
• For classification, one can also use a linear model zm+1 = bzpa(m+1) + a0 where
(i)
zm+1 is q-dimensional and let the classification be argmax(zm+1 , i = 1, . . . , q). Alterna-
tively, one may use a “softmax” transformation, taking
(i)
(i) eζm+1
zm+1 =P (j)
q ζm+1
j=1 e
Neural networks have achieved top performance when working with organized struc-
tures such as images. A typical problem in this setting is to categorize the content of
the image, i.e., return a categorical variable naming its principal element(s). Other
applications include facial recognition or identification. In this case, the transition
function can take advantage of the 2D structure, with some special terminology.
x z1 ... zm y
Figure 11.1: Linear net with increasing layer depths and decreasing layer width.
x y
z1 z2m−1
zm
Figure 11.2: A sketch of the U-net architecture designed for image segmentation [169].
11.3 Geometry
In addition to the transitions between latent variables and resulting changes of di-
mension, the structure of the DAG defining the network is an important element in
the design of a neural net. The simplest choice is a purely linear structure (as shown
in Figure 11.1), as was, for example, used for image categorization in [111].
More complex architectures have been introduced in recent years. Their design
is in a large part heuristic and based on an analysis of the kind of computation
that should be done in the network to perform a particular task. For example, an
architecture used for image segmentation in summarized in fig. 11.2.
Remark 11.1 An important feature of neural nets is their modularity, since “simple”
architectures can be combined (e.g., by placing the output of a network as input of
11.4. OBJECTIVE FUNCTION 247
another one) and form a more complex network that still follows the basic structure
defined above. One example of such a building block is the “attention module.”
(q)
Such a module takes as input three sequences of vectors of equal size, say z(q) = (zk ),
(c) (v)
z(c) = (zk ), z(v) = (zk ), for k = 1, . . . , n, which are are typically outputs of previous
modules. All three may be identical (self-attention modules), or distinct (encoder-
decoder modules in [200] have z(c) = z(v) , z(q) ). The input vectors are separately
(q) (c)
linearly transformed into “query,” “key,” and “value” vectors, qk = Wq zk , ck = Wc zk
(v)
and vk = Wv zk (where Wq , Wc , Wv are learned, and Wq and Wc have the same number
of rows, say, d) and the output of the module is also a sequence of n vectors given by
P τa(q ,c )
(o) e k lv
zk = Pl τa(q ,c ) l
le
k l
where a(q, c) measures√ the affinity between q and c (e.g, a(q, c) = qT c) and τ is a
fixed constant (e.g., 1/ d). These attention modules are fundamental components of
“transformer networks” [200], that are used, among other tasks, in natural language
processing and large language models.
11.4.1 Definitions
We now return to the general form of the problem, with variables z0 , . . . , zm+1 satis-
fying
zi = ψi (zpa(i) ; wi ).
Let T = (x1 , y1 , . . . , xN , yN ) denote the training data.
For classification, with the dimension of the output variable equal to the number
of classes and the decision based on the largest coordinate, one can take (letting
zk,m+1 (i; W ) denote the ith coordinate of zk,m+1 (W )):
q
N
1 X
X
F(W ) = −zk,m+1 (yk ; W ) + log exp(zk,m+1 (i; W )) .
N
k=1 i=1
11.4.2 Differential
This allows us to define the function F(W ) = G(W , Z (W )) and we want to com-
pute the gradient of F. (The function F in the previous section satisfies these as-
sumptions, with z = (zkj , k = 1, . . . , N , j = 1, . . . , m + 1) ). Taking h ∈ Rs , we have
so that
Then,
dF(W )h = (∂W G(W , Z (W )) − pT ∂W γ(W , Z (W )))h
or
∇F = ∂W G(W , Z (W ))T − ∂W γ(W , Z (W ))T p. (11.3)
So, we evaluate the gradient of G(W , z) = r(y, zm+1 ) with zi+1 = ψi (zpa(i) ; wi ), i =
0, . . . , m, z0 = x. With the notation of the previous paragraph, we take γ = (γ1 , . . . , γm+1 )
with
γi (W , z) = ψi (zpa(i) ; wi ) − zi
These constraints uniquely define z as a function of W , which was one of our as-
sumptions. For the derivative, we have, for u = (u1 , . . . , um+1 ) ∈ Rr (with r = d1 + · · · +
dm+1 , ui ∈ Rdi ), and for i = 1, . . . , m + 1
X
∂z γi (W , z)u = ∂zj ψi (zpa(i) ; wi )uj − ui
j∈pa(i)
m+1
X X m+1
X
T
p ∂z γ(W , z)u = piT ∂zj ψi (zpa(i) ; wi )uj − piT ui
i=1 j∈pa(i) i=1
m+1
X X m+1
X
= piT ∂zj ψi (zpa(i) ; wi )uj − pjT uj .
j=1 i∈ch(j) j=1
This allows us to identify ∂z γ(W , z)T p as the vector g = (g1 , . . . , gm+1 ) with
X
gj = ∂zj ψi (zpa(i) ; wi )T pi − pj .
i∈ch(j)
250 CHAPTER 11. NEURAL NETS
For j = m + 1 (which has no children), we get gm+1 = −pm+1 , so that the equation
∂z γ T p = g can be solved recursively by taking pm+1 = −gm+1 and propagating back-
ward, with X
pj = −gj + ∂zj ψi (zpa(i) ; wi )T pi
i∈ch(j)
for j = m, . . . , 1.
We can now formulate an algorithm that computes the gradient of F with respect
to W , reintroducing training data indexes in the notation.
with zk,m+1 (W ) = f (xk , W ). Let W be a family of weights. The following steps com-
pute ∇F(W ).
4. Let
N m+1
1 XX T
∇F(W ) = − Di ∂wi ψi (zk,pa(i) , wi )T pk,i ,
N
k=1 i=1
where Di is the si × s matrix such that Di h = hi .
11.5. STOCHASTIC GRADIENT DESCENT 251
• If Rk (z) = |yk −z|2 (which is the typical choice for regression models) then ∇Rk (z) =
2(z − yk ).
P
q (i)
• In classification, with Rk (z) = −z(yk ) + log i=1 exp(z ) , one has
exp(z)
∇Rk (z) = −uyk + Pq
(i) )
i=1 exp(z
where uyk ∈ Rd is the vector with 1 at position yk and zero elsewhere, and exp(z) is
the vector with coordinates exp(z(i) ), i = 1, . . . , d.
• For dense transition functions in the form ψ(z; w) = ρ(bz + β0 ) with w = (β0 , b),
then ∂z ψ(z, w) = diag(ρ0 (β0 + bz))b so that
• Similarly
h i
∂w ψ(z, w)T p = diag(ρ0 (β0 + bz))p, diag(ρ0 (β0 + bz))pzT .
Note that neural network packages implement these functions (and more) automat-
ically.
11.5.1 Mini-batches
Because E(ξ k ) = `/N , we have E(H(W , ξ)) = ∇W ET (f (·, W )) and (11.4) provides a
stochastic gradient descent algorithm to which the discussion in section 3.3 applies.
Such an approach is often referred to as “mini-batch” selection in the deep-learning
literature, since it correspond to sampling ` examples from the training set with-
out replacement and only computing the gradient of the empirical loss restricted to
these examples.
11.5.2 Dropout
Introduced for deep learning in Srivastava et al. [182], “dropout” is a learning para-
digm that brings additional robustness (and, maybe, reduces overfitting risks) to
massively parametrized predictors.
Equation (11.1) expresses the difference of the input and output of a neural transi-
tion as a non-linear function f (z; w) of the input. This strongly suggests passing to
continuous time and replacing the difference by a derivative, i.e., replacing the neu-
ral network by a high-dimensional parametrized dynamical system. The continuous
model then takes the form [51]
where t varies in a a fixed interval, say, [0, T ]. The whole process is parametrized by
W = (w(t), t ∈ [0, T ]). We need to assume existence and uniqueness of solutions of
(11.5), which usually restricts the domain of admissibility of parameters W .
Typical neural transition functions are Lipschitz functions whose constant de-
pend on the weight magnitude, i.e., are such that
This is a relatively mild requirement, on which we will return later. Assuming this,
we can consider z(T ) as a function of the initial value, z(0) = x and of the parameters,
writing z(T ) = f (x, W ).
N
1X
F(W ) = r(yk , f (xk , W )). (11.8)
N
k=1
254 CHAPTER 11. NEURAL NETS
This informal derivation (more work is needed to justify the existence of various
differentials in appropriate function spaces) provides the continuous-time version
of the back-propagation algorithm, which is also known as the adjoint method in
the optimal control literature [91, 124]. In that context, z represents the state of
the control system, w is the control and p is called the costate, or covector. We
summarize the gradient computation algorithm, reintroducing N training samples.
11.6. CONTINUOUS TIME LIMIT AND DYNAMICAL SYSTEMS 255
N
1X
F(W ) = Rk (zk (T , W ))
N
k=1
N
X
∇F(W )(t) = − ∂w ψ(zk (t), w(t))T pk (t).
k=1
Optimal control problems are usually formulated with a “running cost” that penal-
izes the magnitude of the control, which in our case is provided by the function
W : t 7→ w(t). Penalties on network weights are rarely imposed with discrete neural
networks, but, as discussed above, in the continuous setting, some assumptions on
the function W , such as (11.7), are needed to ensure that the problem is well defined.
The finiteness of the integral of the squared C(w)2 implies, by Cauchy-Schwartz, the
integrability of C(w) itself, and usually leads to simpler computations.
If C(w) is known explicitly and is differentiable, the previous discussion and the
back-propagation algorithm can be adapted with minor modifications for the mini-
mization of (11.10). The only difference appears in Step 4 of Algorithm 11.2, with
N
1X
∇F(W )(t) = 2λ∇C(w(t)) − ∂w ψ(zk (t), w(t))T pk (t).
N
k=1
Computationally, one should still ensure that C and its gradient are not too costly to
compute. If ψ(z, w) = ρ(bz + β0 ), w = (b, β0 ), the choice C(w) = Cρ |b|op is valid, but not
computationally friendly. The simpler choice C(w) = Cρ |b|2 is also valid, but cruder
as an upper-bound of the Lipschitz constant. It leads however to straightforward
computations.
The addition of a running cost to the objective is important to ensure that any
potential solution of the problem leads to a solvable ODE. It does not guarantee that
an optimal solution exists, which is a trickier issue in the continuous setting than
in the discrete setting. This is an important theoretical issue, since it is needed, for
example, to ensure that various numerical discretization schemes lead to consistent
approximations of a limit continuous problem. The existence of minimizers is not
known in general for ODE networks. It does hold, however, in the following non-
parametric (i.e., weight-free) context that we now describe.
The function ψ in the r.h.s. of (11.5), is, for any fixed w, a function that maps
z ∈ Rd to a vector ψ(z, w) ∈ Rd . Such functions are called vector fields on Rd , and the
collection ψ(·, w), w ∈ Rs is a parametrized family of vector fields.
We will assume that, at each time, f (t, ·) belongs to a reproducing kernel Hilbert
space (RKHS), as introduced in chapter 6. However, because we are considering a
space of vector fields rather than scalar-valued functions, we need work with matrix-
valued kernels [5], for which we give a definition that generalizes definition 6.1
(which corresponds to q = 1 below).
Definition 11.2 A function K : Rd × Rd 7→ Mq (R) satisfying
[K2-vec] For any n > 0, for any choice of vectors λ1 , . . . , λn ∈ Rq and any x1 , . . . , xn ∈ Rd ,
one has
Xn
λTi K(xi , xj )λj ≥ 0. (11.11)
i,j=1
One says that the kernel is positive definite if the sum in (6.1) cannot vanish, unless
(i) λ1 = · · · = λn = 0 or (ii) xi = xj for some i , j.
Proposition 6.6 remains valid in the for vector-valued RKHS, with the following
modifications: λ1 , . . . , λN and α1 , . . . , αN are q-dimensional vectors and the matrix
K(x1 , . . . , xN ) is now an N q × N q block matrix, with q × q blocks given by K(xk , xl ),
k, l = 1, . . . , N .
This shows that (11.12) can be derived from regularity properties of the kernel,
namely, that
|K(z, z) − 2K(z, z̃) + K(z̃, z̃)| ≤ C|z − z̃|2
258 CHAPTER 11. NEURAL NETS
for some constant C and all z, z̃ ∈ Rd . This property is satisfied by most of the kernels
that are used in practice.
Let η : t 7→ η(t) be a function from [0, 1] to H. This means that, for each t, η(t) is
a vector field x 7→ η(t)(x) on Rd , and we will write indifferently η(t) and η(t, ·), with
a preference for η(t, x) rather than η(t)(x). We consider the objective function
Z T N
1X
F̄(f ) = λ kη(t)k2H dt + r(yk , zk (1)), (11.13)
0 N
k=1
with ∂t zk (t) = η(t, zk (t)), zk (0) = xk . To compare with (11.10), the finite-dimensional
w ∈ Rs is now replaced with an infinite-dimensional parameter, η, and the transition
ψ(z, w) becomes η(z).
Using the vector version of proposition 6.6 (or the kernel trick used several times
in chapters 7 and 8), one sees that there is no loss of generality in replacing η(t) by
its projection onto the vector space
XN
d
V (t) = K(·, z (t))w : w , . . . , w ∈ R .
l l 1 N
l=1
then
N
X
kη(t)k2H = wk (t)T K(zk (t), zl (t))wl (t).
k,l=1
This allows us to replace the infinite-dimensional parameter η by a family W =
(w(t), t ∈ [0, T ] with w(t) = (wk (t), k = 1, . . . , N ). The minimization of F̄ in (11.13) can
be replaced by that of
Z T N N
X
T 1X
F(W ) = λ wk (t) K(zk (t), zl (t))wl (t)dt + r(yk , zk (1)), (11.14)
0 k,l=1 N
k=1
with
N
X
∂t zk (t) = K(zk (t), zl (t))wl (t).
l=1
This optimal control problem has a similar form to that considered in (11.10),
where the running cost C(w)2 is replaced by a cost that depends on the control (still
11.6. CONTINUOUS TIME LIMIT AND DYNAMICAL SYSTEMS 259
denoted w) and the state z. The discussion in section section 11.6.1 can be applied
with some modifications. Let K(z) be the dN × dN matrix formed with d × d blocks
K(zk (t), zl (t)) and w(t) the dN -dimensional vector formed by stacking w1 , . . . , wN . Let
Z T N
T 1X
G(W , z) = λ w(t) K(z(t))w(t)dt + r(yk , zk (1))
0 N
k=1
and
γ(W , z)(t) = K(z(t))w(t) − ∂t z(t) .
The backward ODE in step 3. of Algorithm 11.2 now becomes
The resulting algorithm was introduced in [212]. It has the interesting prop-
erty (shared with neural ODE models with smooth controlled transitions) to de-
termine an implicit diffeomorphic transformation of the space, i.e., the function
x 7→ f (x; W , z) = z̃(T ) which returns the solution at time T of the ODE
N
X
∂t z̃(t) = K(z̃(t), zl (t))wl (t)
l=1
(or ∂z̃(t) = ψ(z̃(t); w(t)) for neural ODEs) is smooth, invertible, with a smooth inverse.
260 CHAPTER 11. NEURAL NETS
Chapter 12
Definition 12.1 Let P and Q be two probability distributions on R. Their total variation
distance is defined by
Dvar (µ1 , µ2 ) = sup(µ1 (A) − µ2 (A)). (12.1)
A
Lemma 12.2 There exists a measurable set A0 such that, for all B, P (B ∩ A0 ) ≥ Q(B ∩ A0 )
and P (B ∩ Ac0 ) ≤ Q(B ∩ Ac0 ).
261
262 CHAPTER 12. COMPARING PROBABILITY DISTRIBUTIONS
showing that
Dvar (P , Q) = P (A0 ) − Q(A0 ).
Proposition 12.3 (i) If P , Q have a densities ψ1 , ψ2 with respect to some positive mea-
sure µ (such as P + Q), then
Z
1
Dvar (P , Q) = |ψ1 (x) − ψ2 (x)|µ(dx).
2 R
In particular, if B is finite
1X
Dvar (P , Q) = |P (x) − Q(x)|.
2
x∈B
where the supremum is taken over all measurable functions f taking values in [0, 1].
(iii) If f : R → R is bounded, define the maximal oscillation of f by
Then (Z Z )
Dvar (µ1 , µ2 ) = sup f (x)P (dx) − f (x)Q(dx) : osc(f ) ≤ 1
R R
so that Z Z
|ψ1 (x) − ψ2 (x)|µ(dx) = 2 (ψ1 (x) − ψ2 (x)) = 2Dvar (P , Q),
R A0
which proves (i).
This shows (ii). For (iii), one can note that, if f takes values in [0, 1], then osc(f ) ≤
1 so that, using (ii),
(Z Z )
Dvar (P , Q) ≤ sup f (x)P (dx) − f (x)Q(dx) : osc(f ) ≤ 1
R R
264 CHAPTER 12. COMPARING PROBABILITY DISTRIBUTIONS
since the maximization on the r.h.s. is done on a larger set than in (ii).
Conversely, take f such that osc(f ) ≤ 1, > 0 and y such that f (y) ≥ inf f + . Let
f (x) = (f (x) − f (y) + )/(1 + ), which takes values in [0, 1]. Then
Z Z
Dvar (P , Q) ≥ f (x)P (dx) − f (x)Q(dx)
R R
Z Z !
1
= f (x)P (dx) − f (x)Q(dx)
1+ R R
which yields (iv) after taking the supremum with respect to x and y.
Remark 12.4 Statements (ii)–(iv) in proposition 12.3 still hold when the supremums
are made over continuous functions.
12.2 Divergences
Definition 12.5 Let ϕ be a non-negative convex function on (0, +∞) such that ϕ(1) = 0
and ϕ(t) > 0 for t , 1. Let P and Q be two probability distributions on some space R, and
µ a measure on R such that P µ and Q µ. Letting f = dP /dµ and g = dQ/dµ, the
ϕ-divergence between µ and ν is defined by
Z !
f
Dϕ (P k Q) = gϕ dµ (12.3)
Ω̃ g
with the convention
ϕ(0) = lim ϕ(t) and 0ϕ(f /0) = f lim ϕ ∗ (t),
t→0 t→0
where
ϕ ∗ (t) = tϕ(1/t).
Note that the limits above may be infinite.
This divergence is, in general, not symmetric, nor does it satisfy the triangular
inequality. There are, however, sufficient conditions that ensure that these proper-
ties are true [56, 101, 194]. Symmetry is captured by the “conjugate” function ϕ ∗ in
definition 12.5. It is, like ϕ, non-negative and convex on (0, +∞) and vanishes only
at t = 1. The only part of this statement that is not obvious is that ϕ ∗ is convex, but
we have, for λ ∈ (0, 1), s, t > 0,
!
∗ 1
ϕ ((1 − λ)s + λt) = ((1 − λ)s + λt)ϕ
(1 − λ)s + λt
!
(1 − λ)s 1 λt 1
= ((1 − λ)s + λt)ϕ +
(1 − λ)s + λt s (1 − λ)s + λt t
!
(1 − λ)s 1 λt 1
≤ ((1 − λ)s + λt) ϕ + ϕ
(1 − λ)s + λt s (1 − λ)s + λt t
= (1 − λ)ϕ ∗ (s) + λϕ ∗ (t).
Clearly Dϕ∗ (P k Q) = Dϕ (Q k P ) so that a simple sufficient condition for symmetry is
that ϕ ∗ = ϕ. This can always be ensured by replacing ϕ by ϕ̃ = (ϕ + ϕ ∗ )/2, yielding
1
Dϕ̃ (P k Q) = (Dϕ (P k Q) + Dϕ (Q k P )).
2
266 CHAPTER 12. COMPARING PROBABILITY DISTRIBUTIONS
|t α − 1|1/α
h(t) =
ϕ(t)
One can, for example, take ϕ(t) = |t α − 1|1/α , which yields, with the notation of
definition 12.5, Z
Dϕ (P , Q) = |f α − g α |1/α dµ.
R
The case α = 1 gives two times the total variation distance. The case α = 1/2 provides
the Hellinger distance between P and Q
Z p !1/2
√ 2
DHellinger (P , Q) = | f − g| dµ .
R
sign(β) 1/β
ϕβ (t) = (t + 1)β − 2β−1 (t + 1) .
β −1
with α = 1/2 for β < 2 and α = 1/β for β ≥ 2. The limit cases
1
ϕ0 (t) = |t − 1|,
2
which provides the total variation distance and
t+1
ϕ1 (t) = t log t − (t + 1) log
2
are also included. Also, for β = 2, one retrieves the Hellinger distance.
The Monge-Kantorovich, also called Wasserstein, and sometimes also called “earth-
mover,” associates a transportation cost, say ρ(x, y), for moving a unit of mass from
x to y, and evaluates the minimum total cost needed to transform the distribution P
into Q. Its mathematical definition is
Z
DMK (P , Q) = inf ρ(x, y)π(dx, dy) (12.4)
π∈M(P ,Q) R×R
where M(P , Q) is the set of all joint distributions on R × R whose first marginal is P
and second marginal Q. For example,
The interpretation is that π(dx, dy) is the infinitesimal mass moved between the
infinitesimal neighborhoods x + dx and y + dy. The constraint π ∈ M(P , Q) indicates
that π displaces the mass distribution P to the mass distribution Q.
1/α
If we assume that, for some α ≥ 1, σ = ρ1/alpha is a distance on R, then DMK is
a distance on the space of probability measures on R (equipped with the Borel σ -
algebra specified by σ ). For this fact, and the results that follow, the reader can refer
to Villani et al. [203], Dudley [66].
Define Z Z !
Dρ∗ (P , Q) = max f dP − f dQ : f ρ-contractive
See the reference above for a proof. Further generalizations of this theorem, in par-
ticular for the case in which ρ is not equal to a distance, can be found in [203],
chapter 5.
268 CHAPTER 12. COMPARING PROBABILITY DISTRIBUTIONS
Chapter 13
Monte-Carlo Sampling
The goal of this section is to describe how, from a basic random number generator
that provides samples from a uniform distribution on [0, 1], one can generate sam-
ples that follow, or approximately follow, complex probability distributions on finite
or general spaces. This, combined with the law of large numbers, permits to approx-
imate probabilities or expectations by empirical averages over a large collection of
generated samples.
Real-valued variables. We will use the following notation for the left limit of a
function F at a given point z
assuming, of course that this limit exists (which is always true, for example when F
is non-decreasing). Recall that F is left continuous if and only if F = F( ·−). Moreover,
it is easy to see that F( ·−) is left-continuous1 . Note also that, if F is non-decreasing,
1 For
every z and every > 0, there exists z0 < z such that for all z00 ∈ [z0 , z), |F(z00 ) − F(z−)| < .
Moreover, taking any y ∈ (z0 , z), there exists y 0 < y such that for all y 00 ∈ [y 0 , y), |F(y 00 ) − F(y−)| < .
269
270 CHAPTER 13. MONTE-CARLO SAMPLING
one always has F(z) ≤ F(y−) whenever z < y. The following proposition provides a
basic mechanism for Monte-Carlo sampling.
Proposition 13.1 Let Z be a real-valued random variable with c.d.f. FZ . For u ∈ [0, 1],
define
FX− (u) = max{z : FZ (z−) ≤ u}.
Let U be uniformly distributed over [0, 1]. Then FZ− (U ) has the same distribution as Z.
Showing this will prove the lemma, since one has P(U < FZ (z)) = P(U ≤ FZ (z)) =
FZ (z), showing that
P(FZ− (U ) ≤ z) = P(U ∈ Az ) = FZ (z).
To prove (13.1), first assume that u < FZ (z). Take any z0 such that FZ (z0−) ≤ u.
Then, necessarily, z0 ≤ z, since z0 > z would imply that FZ (z) ≤ FZ (z0−). This shows
that max{z0 : FZ (z0−) ≤ u} ≤ z, i.e., u ∈ Az .
Now, take u > FZ (z). Because c.d.f.’s are right continuous, there exists y > z such
that u > FZ (y), which implies that FZ− (u) ≥ y and u < Az .
This proposition shows that one can generate random samples of a real-valued
random variable Z as soon as one can compute FZ− and generate uniformly dis-
tributed variables. Note that, if FZ is strictly increasing, then FZ− = FZ−1 , the usual
function inverse.
The proposition also shows how to sample from random variables taking values
in finite sets. Indeed, if Z takes values in Ω e Z = {z1 , . . . , zn } with pi = P(Z = zi ), sam-
pling from Z is equivalent to sampling from the integer valued random variable Z e
with P(Z e = i) = pi . For this variable, F − (u) is the largest i such that p1 + · · · + pi−1 ≤ u
Z̃
(this sum being zero if i = 1), which provides the standard sampling scheme for
discrete probability distributions.
(which includes the Gaussian case). Rejection sampling is a simple algorithm that
allows, in some cases, for the generation of samples from a complicated distribution
based on repeated sampling of a simpler one.
Algorithm 13.1 (Rejection sampling with acceptance function a and base p.d.f. g)
(1) Sample a realization z of a random variable with p.d.f. g.
(2) Generate b ∈ {0, 1} with P(b = 1) = a(z).
(3) If b = 1, return Z = z and exit.
(4) Otherwise, return to step 1.
R
The probability of exiting at step 3 is ρ = Rd
g(z)a(z)µ(dz). So, the algorithm
simulates a random variable with p.d.f.
g(z)a(z)
f˜(z) = g(z)a(z)(1 + (1 − ρ) + (1 − ρ)2 + · · · ) = .
ρ
As a consequence, in order to simulate fZ , one must choose a so that fZ (z) is pro-
portional to g(z)a(z), which, (assuming that g(z) > 0 whenever fZ (z) > 0), requires
that a(z) is proportional to fZ (z)/g(z). Since a(z) must take values in [0, 1], but should
otherwise be chosen as large as possible to ensure that fewer iterations are needed,
one should take
f (z)
a(z) = Z
cg(z)
where c = max{fZ (z)/g(z) : z ∈ Rd }, which must therefore be finite. This fully specifies
a rejection sampling algorithm for fZ . Note that g is free to choose (with the restric-
tion that fZ (z)/g(z) must be bounded), and should be selected so that sampling from
it is easy, and the coefficient c above is not too large.
cases, one can use alternative simulation methods that iteratively updates the vari-
able Z by making small changes at each step, resulting in a procedure that asymp-
totically converges to a sample of the target distribution. Such sampling schemes are
usually described as Markov chains, leading to the name Markov-chain Monte Carlo
(or MCMC) sampling. We therefore start our discussion with some basic results on
the theory of Markov chains.
13.3.1 Definitions
When F2 is discrete, the probabilities are fully specified by their values on singleton
sets, and we will write p(x, y) for p(x, {y}).
When P n,n+1 (x, ·) does not depend on n, the Markov chain is said to be homoge-
neous. To simplify notation, we will restrict to homogeneous chains (and therefore
only write P (x, A)), although some of the chains used in MCMC sampling may be
inhomogeneous. This is not a very strong loss of generality, however, because in-
homogeneous Markov chains can be considered as homogeneous by extending the
space Ω on which they are defined to Ω × N, and defining the transition probability
p̃ (x, n), A × {r} = 1r=n+1 pn,n+1 (x, A).
2 We will assume in this chapter that B is a complete metric space with a dense countable subset,
with the associated Borel σ -algebra.
13.3. MARKOV CHAIN SAMPLING 273
An important special case is when B is countable, in which case one only needs
to specify transition probabilities for singletons A = {y}, and we will write
Another simple situation is when B = Rd and each P (x, ·) has a p.d.f. that we will
also denote as p(x, ·). In this latter case, assuming that P 0 also has a p.d.f. that we
will denote by µ0 , the joint p.d.f. of (X0 , . . . , Xn ) on (Rd )n+1 is given by
The same expression holds for the joint p.m.f. in the discrete case.
In the general case (invoking measure theory), the joint distribution is also deter-
mined by the transition probabilities, and we leave the derivation of the expression
to the reader. An important point is that, in both special cases considered above,
and under some very mild assumptions in the general case, these transition proba-
bilities also uniquely define the joint distribution of the infinite process (X0 , X1 , . . .)
on B ∞ , which gives theoretical support to the consideration of asymptotic properties
of Markov chains.
In this discussion, we are interested in conditions ensuring that the chain asymp-
totically samples from a target probability distribution Q, i.e., that P(Xn ∈ A) con-
verges to Q(A) (one says that Xn converges in distribution to Q). In practice, Q is
given or modeled, and the goal is to determine the transition probabilities. Note
that the marginal distribution of Xn is computed by integrating (or summing) (13.2)
with respect to x0 , . . . , xn−1 . This is generally computationally challenging.
13.3.2 Convergence
We will denote Px (·) the conditional distribution P(· | X0 = x) and P n (x, A) = Px (Xn ∈
A), which is a probability distribution on B. The goal of Markov Chain Monte Carlo
sampling is to design the transition probabilities such that Pxn (A) converges to Q(A)
274 CHAPTER 13. MONTE-CARLO SAMPLING
when n tends to infinity. One furthermore wants to complete this convergence with
a law of large numbers, ensuring that
n Z
1X
f (Xk ) → f (x)Q(dx)
n B
k=1
Let Dvar be the total variation distance between probability measures introduced
in section 12.1. We will say that the Markov chain with transition P asymptotically
samples from Q if
lim Dvar (P n (x, ·), Q) = 0 (13.3)
n→∞
for Q-almost all x ∈ B. As we will see, he chain must satisfy specific conditions for
this to be guaranteed.
So, if one designs a Markov chain with a target asymptotic distribution Q, the first
thing to ensure is that Q is invariant. However, while invariance leads to an integral
equation for q, a stronger condition, called reversibility is easier to assess.
Assume that Q is invariant by P . Make the assumption that P (x, ·) has a density
p∗ with respect to Q (this is, essentially, no loss of generality, see argument below),
so that Z
P (x, A) = p∗ (x, y)Q(dy).
A
Taking A = B above, we have
Z
p∗ (x, y)Q(dy) = P (x, B) = 1
B
13.3. MARKOV CHAIN SAMPLING 275
which shows that the conditional distribution of Xn given Xn+1 = xn+1 has density
xn 7→ rn+1 (xn , xn+1 ) relatively to Qn .
with respect to Q. In the discrete case, letting p(x, y) = P (Xn+1 = y | Xn = x), we have
p∗ (x, y) = p(x, y)/Q(y), so that the reversed transition (call it p̃) is such that
p̃(x, y) p(y, x)
= ,
Q(y) Q(x)
i.e.,
Q(y)p(y, x) = Q(x)p̃(x, y). (13.6)
One retrieves easily the fact that if p is such that there exists Q and p̃ such that (13.6)
is satisfied, then (summing the equation over y) Q is an invariant probability for p.
13.3. MARKOV CHAIN SAMPLING 277
Let Q be a probability on B. One says that the Markov chain (or the transition
probability p) is Q-reversible if and only if p(x, ·) has a density p∗ (x, ·) with respect
to Q such that p∗ (x, y) = p∗ (y, x) for all x, y ∈ B. Since such a density is necessarily
doubly stochastic, Q is then invariant by p. Reversibility is equivalent to the prop-
erty that, whenever Xn ∼ Q, the joint distribution of (Xn , Xn+1 ) coincides with that of
(Xn+1 , Xn ). Alternatively, Q-reversibility requires that for all A, B ⊂ B,
Z Z
P (z, B)Q(dz) = P (z, A)Q(dz). (13.7)
A B
While Q can be an invariant distribution for a Markov chain without that chain
being Q-reversible, the latter property is easier to ensure when designing transition
probabilities, and most sampling algorithms are indeed reversible with respect to
their target distribution.
Remark 13.3 A simple example of non-reversible Markov chain with invariant prob-
ability Q is often obtained in practice by alternating two or more Q-reversible transi-
tion probabilities. Assume, to simplify, that B is discrete and that p1 and p2 are tran-
sition probabilities that satisfy (13.8). Consider a composite Markov chain for which
the transition from Xn to Xn+1 consists in generating first Yn according to p1 (Xn , ·)
and then Xn+1 according to p2 (Yn , ·). The resulting composite transition probability
is X
p(x, y) = p1 (x, z)p2 (z, y).
z∈B
Trivially, Q is invariant by p, since it is invariant by p1 and p2 , but p is not Q-
reversible. Indeed, p satisfies (13.6) with
X
p̃(x, y) = p2 (x, z)p1 (z, y).
z∈B
One says that the Markov chain is Q-irreducible (or, simply, irreducible in what
follows) if and only if, for all z ∈ B and all (measurable) B ⊂ B such that Q(B) > 0,
there exists n > 0 with Pz (Xn ∈ B) > 0. (Irreducibility implies that Q is the only
invariant probability of the Markov chain.)
A Markov chain is called periodic if there exists m > 1 such that B can be covered
by disjoint subsets B0 , . . . , Bm−1 that satisfy P (x, Bj ) = 1 for all x ∈ Bj−1 if j ≥ 1 and
all x ∈ Bm−1 if j = 0. In other terms, the chain loops between the sets B0 , . . . , Bm−1 . If
such a decomposition does not exists, the chain is called aperiodic.
A periodic chain cannot satisfy (13.3). Indeed, periodicity implies that Px (Xn ∈
Bi ) = 0 for all x ∈ Bi unless i = 0 (mod d). Since the sets Bi cover B, (13.3) is only pos-
sible with Q = 0. Irreducibility and aperiodicity are therefore necessary conditions
for ergodicity. Combined with the fact that Q is an invariant probability distribu-
tion, these conditions are also sufficient, in the sense that (13.3) is true for Q-almost
all x. (See [193] for a proof.)
Without the knowledge that the chain has an invariant probability, showing con-
vergence usually requires showing that the chain is recurrent, which means that, for
any set B such that Q(B) > 0, the probability that, starting from x, Xn ∈ B for an in-
finite number of n, written as Px (Xn ∈ B i.o.) (for infinitely often) is positive for all
x ∈ E and equal to 1 Q-almost surely. The fact that irreducibility and aperiodicity
combined with Q-invariance imply recurrence (or, more precisely, Q-positive recur-
rent [148]) is an important remark that significantly simplifies the theory for MCMC
simulation. Note that, by restricting B to a suitable set of Q-probability 1, one can
assume that Px (Xn ∈ B i.o.) = 1 for all x ∈ B, which is called Harris recurrence. It the
chain is Harris recurrent, then (13.3) holds with µ0 = δx for all x ∈ B. 4
One says that C ⊂ B is a “small” set if Q(C) > 0 and there exists a triple (m0 , , ν),
with > 0 and ν a probability distribution on B, such that
P m0 (x, ·) ≥ ν(·)
for all x ∈ C. A slightly different result, proved in [13], replaces irreducibility by the
property that there exists a small set C ⊂ B such that
Px (∃n : Xn ∈ C) > 0
4 Harris recurrence is also associated with the uniqueness of right eigenvectors of P , that is func-
tions h : B → R such that Z
∆
P h(x) = P (x, dy)h(y) = h(x).
B
Such functions are also called harmonic for P . Because P is a transition probability, constant functions
are always harmonic. Harris recurrence, in the current context, is equivalent to the fact that every
bounded harmonic function is constant.
13.3. MARKOV CHAIN SAMPLING 279
for Q-almost all x ∈ B. One then replaces aperiodicity by the similar condition that
the greatest common divisor of the set of integers m such that there exists m with
P m (x, ·) ≥ m ν(·) for all x ∈ C is equal to 1. These two conditions combined with
Q-invariance also imply that (13.3) holds for Q-almost all x ∈ B.
for some 0 ≤ r < 1 and some function M(x), or uniformly geometric convergence
speed, for which the function M is bounded (or, equivalently, constant).
and
sup E(h(Xn+1 )1Xn+1 <C | Xn = x) < ∞. (13.10b)
x∈C
Then the Markov chain is geometrically ergodic. Note that E(h(Xn+1 ) | Xn = x) =
P h(x). Equations (13.10a) and (13.10b) can be summarized in a single equation
[137], namely
P h(x) ≤ βh(x) + M1C (x) (13.11)
with β = 1/r < 1 and M ≥ 0.
Uniform geometric ergodicity is implied by the simple condition that the whole set
B is small, requiring in a uniform lower bound, for some probability distribution ν,
for all x ∈ B. Such uniform conditions usually require strong restrictions on the
space B, such as compactness or finiteness.
To illustrate this consider the case in which the set B is finite. Assume, to sim-
plify, that Q(x) > 0 for all x ∈ B (one can restrict the Markov chain to such x’s other-
wise). Arbitrarily labeling elements of B as B = {x1 , . . . , xN }, we can consider p(x, y)
280 CHAPTER 13. MONTE-CARLO SAMPLING
This result can also be deduced from properties of matrices with non-negative
or positive coefficients. The Perron-Frobenius theorem [93] states that the eigen-
value 1 (associated with the eigenvector 1) is the largest, in modulus, eigenvalue of
a stochastic matrix P˜ with positive entries, that it has multiplicity one and that all
other eigenvalues have a modulus strictly smaller that one. If P m has positive en-
tries, this implies that all the eigenvalues of (P m − 1Q) (where Q is considered as a
row vector) have modulus strictly less than one. This fact can then be used to prove
uniformly geometric ergodicity.
13.3.7 Examples on Rd
To take a geometrically ergodic example that is not uniform, consider the simple
random walk provided by the iterations
Xn+1 = ρXn + τ 2 n
where n ∼ N (0, IdRd ), τ 2 > 0 and 0 < ρ < 1. One shows easily by induction that
the conditional distribution of Xn given X0 = x is Gaussian with mean mn = ρn x and
covariance matrix σn2 IdRd with
1 − ρ2n 2
σn2 = τ .
1 − ρ2
2 Id ), with σ 2 = τ 2 /(1 − ρ2 ), is invariant.
In particular, the distribution Q = N (0, σ∞ Rd ∞
Estimates on the variational distances between Gaussian distributions, such as those
provided in Devroye et al. [61], can then be used to show that
metric spaces), the drift function criterion (13.11) can be used. Assume that P h(·),
given by Z
P h(x) = E(h(Xn+1 ) | Xn = x) = h(y)P (x, dy)
Rd
is continuous as soon as the function h : Rd → R is continuous (one says that the
chain is weak Feller). This true, for example, if P (x, ·) has a p.d.f. with respect
to Lebesgue’s measure which is continuous in x. In such a situation, one can see
that compact sets are small sets, and (13.11) can be restated as the existence if a
positive function h with compact sub-level sets and such that h(x) ≥ 1, of a compact
set C ⊂ Rd and of positive constants β < 1 and b such that, for all x ∈ Rd ,
Let us make the assumption that H is L-C 1 for some L > 0 (c.f. definition 3.15)
and furthermore assume that |∇H(x)| tends to infinity when x tends to infinity, en-
suring the fact that the sets {x : |∇H(x)| ≤ c} are compact for c > 0. We want to show
that, if δ is small enough, (13.13) holds for h(x) = exp(mH(x)) and m small enough.
|u|2
g(x, u) = mH(x − δ∇H(x) + τu) − .
2
Using the L-C 1 property, we have
mL |u|2
g(x, u) ≤ mH(x) + m(−δ∇H(x) + τu)T ∇H(x) + |δ∇H(x) − τu|2 −
2 2
1 − mLτ 2 2
= mH(x) − mδ(1 − δL/2)|∇H(x)|2 + mτ(1 − τL)∇H(x)T u − |u|
2
2
1 − mLτ 2 mτ(1 − τL)
= mH(x) − u− ∇H(x)
2 1 − mLτ 2
mτ 2 (1 − τL)2
!
− m δ(1 − δL/2) − |∇H(x)|2
2(1 − mLτ 2 )
282 CHAPTER 13. MONTE-CARLO SAMPLING
Using this upper bound, we see that (13.13) will hold if one first chooses δ such that
δL < 2, then m such that mLτ 2 < 1 and
mτ 2 (1 − τL)2
< δ(1 − δL/2)
2(1 − mLτ 2 )
mτ 2 (1 − τL)2 2
! !
1
exp −m δ(1 − δL/2) − c < 1.
(1 − mLτ 2 )d/2 2(1 − mLτ 2 )
Note that this Markov chain is not in detailed balance. Since P (x, ·) has a p.d.f.,
being in detailed balance requires the ratio p(x, y)/p(y, x) to simplify as a ratio q(y)/q(x)
for some function q, which does not hold. However, we can identify the invariant √
distribution approximately with small δ and τ, that we will assume to satisfy τ = a δ
for a fixed a > 0, with δ a small number.
We can write
1 1
2
p(x, y) = exp − |y − x + δ∇H(x)|
(2πτ 2 )d/2 2τ 2
δ2
!
1 1 2 δ T 2
= exp − 2 |y − x| − 2 (y − x) ∇H(x) − 2 |∇H(x)| .
(2πτ 2 )d/2 2τ τ 2τ
If q is a density, we have
Z
qP (y) = q(x)p(x, y)dx
Rd
√
√ √ √
Z !
1 1 2 δ T δ 2
= q(y + a δu) exp − |u| + u ∇H(y + a δu) − 2 |∇H(y + a δu)| du
(2π)d/2 Rd 2 a 2a
√ √ a2 δ T 2
q(y + a δu) = q(y) + a δ∇q(y)T u + u ∇ q(y)u + o(δ|u|2 )
2
13.3. MARKOV CHAIN SAMPLING 283
and
√
√ √
!
δ T δ 2
exp u ∇H(y + a δu) − 2 |∇H(y + a δu)|
a 2a
√
δ T δ δ
= 1+ u ∇H(y) − 2 |∇H(y)|2 + δu T ∇2 H(u)u + 2 (u T ∇H(y))2 + o(δ|u|2 ).
a 2a 2a
R
Taking the product and using the fact that (2π)−d/2 Rd u exp(−|u|2 /2)du = 0 and
R
that (2π)−d/2 Rd u T Au exp(−|u|2 /2)du = trace(A) for any symmetric matrix A, we can
write, taking the product:
a2
!
T
qP (y) = q(y) + δ ∆q(y) + ∇H(y) ∇q(y) + q(y)∆H(y) + o(δ)
2
a2
∆q(y) + ∇H(y)T ∇q(y) + q(y)∆H(y) = o(1).
2
The partial differential equation
a2
∆q(y) + ∇H(y)T ∇q(y) + q(y)∆H(y) = 0 (13.14)
2
2H(y)
−
is satisfied by the function y 7→ e a2 . Assuming that this function is integrable, this
computation suggests that, for small δ, the Markov chain approximately samples
from the probability distribution
1 − 2H(x)
q0 = e a2 .
Z
This is further discussed in the next remark that involves stochastic differential
equations. We will also present a correction of this Markov chain that samples from
q0 for all δ in section 13.5.2.
Remark 13.4 (Langevin equation) This chain is indeed the Euler discretization [107]
of the stochastic differential equation,
Such diffusions are continuous-time Markov processes (Xt , t ≥ 0), which means
that the probability distribution of Xt+s given all events before and including time s
only depends on Xs and is provided by a transition probability Pt , with
a2
∂t pt (x, y) = ∇2 · (∇H(y)pt (x, y)) + ∆ p (x, y)
2 2 t
where ∇2 and ∆2 indicate differentiation with respect to the second variable (y). (Re-
call that δf denotes the Laplacian of f .) Moreover, if Q is an invariant distribution
with p.d.f. q, it satisfies the equation
a2
∇ · (q∇H) + ∆q(y) = 0.
2
Noting that ∇ · (q∇H) = ∇qT ∇H + q∆H, we retrieve (13.14). Convergence proper-
ties (and, in particular, geometric convergence) of the Langevin equation to its limit
distribution are studied in Roberts and Tweedie [167], using methods introduced in
Meyn and Tweedie [135, 136, 137]
13.4.1 Definition
The Gibbs sampling algorithm [79] was introduced to sample from a distribution
on large sets for which direct sampling is intractable and rejection samping is inef-
ficient. It generates a Markov chain that converges (under some hypotheses) in dis-
tribution to this target probability. A general version of this algorithm is described
below.
13.4. GIBBS SAMPLING 285
(1) Select j ∈ {1, . . . , K} according to some pre-defined scheme, i.e., at random ac-
cording to a probability distribution π(n) on the set {1, . . . , K}.
(2) Sample a new value z(n+1) according to the probability distribution Qj (Uj (z(n)), ·).
One typically chooses the probability distribution in step 1 equal to the uniform
distribution on {1, . . . , K} (in which case it is independent on n), or to π(n) = δjn where
jn = 1+(n (mod) K) (periodic scheme). Strictly speaking, Gibbs sampling is a Markov
chain if π(n) does not depend on n, and we will make this simplifying assumption in
the rest of our discussion (therefore replacing π(n) by π). One obvious requirement
for the feasibility of the method is that step 2 can be performed efficiently since it
must be repeated a very large number of times.
One can see that the Markov chain generated by this algorithm is Q-reversible.
Indeed, assume that Xn ∼ Q. For any (measurable) subsets A and B in B, one has,
using the definition of conditional expectations,
K
X
P(Xn ∈ A, Xn+1 ∈ B) = E 1Z∈A Qi (Ui (Z), B) π(i). (13.16)
i=1
286 CHAPTER 13. MONTE-CARLO SAMPLING
One then takes Uj (z(1) , . . . , z(K) ) = (z(1) , . . . , z(i−1) , z(i+1) , . . . , z(K) ). In other terms, step 2
in the algorithm replaces the current value of z(j) (n) by a new one sampled from the
conditional distribution of Z (j) given the current values of z(i) (n), i , j.
One may then define Qθ as the image of Q by Uθ and let Qθ (u, A) provide a
version of Q(A | Uθ = u). The only change in the previous discussion (besides using
θ in index) is that (13.16) becomes
Z
P(Xn ∈ A, Xn+1 ∈ B) = E 1Z∈A Qθ (Uθ (Z), B) π(dθ).
Θ
13.4. GIBBS SAMPLING 287
Remark 13.7 Using notation from the previous remark, and allowing π = π(n) to
depend on n, it is possible to allow π(n) to depend on the current state z(n) using the
following construction.
For every step n, assume that there exists a subset Θn of Θ such that π(n) (z, Θn ) = 1
and that, for all θ ∈ Θn , π(n) can be expressed in the form
(n)
π(n) (z, ·) = ψθ (Uθ (z), ·)
(n)
for some transition probability ψθ from Bθ to Θn . The resulting chain remains
Q-reversible, since
Z Z
P(Xn ∈ A, Xn+1 ∈ B) = 1z∈A Qθ (Uθ (z), B)π(n) (z, dθ)Q(dz)
ZB ZΘn
(n)
= 1z∈A Qθ (Uθ (z), B)ψθ (Uθ (z), dθ)Q(dz)
Z Θn Z B
(n)
= Qθ (u, A)Qθ (u, B)ψθ (u, dθ)Qθ (du).
Θn B̃
We will see several examples of applications of Gibbs sampling in the next few chap-
ters. Here, we consider a special instance of Markov random field (see chapter 14)
called the Ising model. For this example, B = {0, 1}L , and
L L
1 X X
q(z) = exp αz(j) + βij z(i) z(j) .
C
j=1 i,j=1,i<j
Note that, although B is a finite set, its cardinality, 2L , is too large for the enumerative
procedure described in section 13.1 to be applicable as soon as L is, say, larger than
30. In practical applications of this model, L is orders of magnitude larger, typically
in the thousands or tens of thousands.
(taking βjj 0 = βj 0 j for j > j 0 ). Gibbs sampling for this model will generate a sequence
of variables Z(0), Z(1), . . . by fixing Z(0) arbitrarily and, given Z(n) = z, applying the
two steps:
Let us now consider the Ising model with fixed total activation, namely the pre-
∆
vious distribution conditional to S(z) = z(1) + · · · + z(L) = h where 0 < h < L. The
distribution one wants to sample from now is
L L
1 X X
qh (z) = exp αz(j) + βij z(i) z(j) 1S(z)=h .
Ch
j=1 i,j=1,i<j
In that case, the previous choice for the one-step transitions does not work, because
fixing all but one coordinate of z also fixes the last one (so that the chain would not
move from its initial value and would certainly not be irreducible). One can however
fix all but two coordinates, therefore defining
Uij (z(1) , . . . , z(L) ) = (z(1) , . . . , z(i−1) , z(i+1) , . . . , z(j−1) , z(j+1) , . . . , z(L) )
and Bij = {0, 1}2 . If Uij (z) is fixed, the only acceptable configurations are z itself and
the configuration z0 deduced from z by switching the value of z(i) and z(j) . Thus,
there is no possible change is z(i) = z(j) . If z(i) , z(j) , then the probability of flipping
the values of z(i) and z(j) is qh (z0 )/(qh (z) + qh (z0 )).
13.5 Metropolis-Hastings
13.5.1 Definition
The transition
P probabilities for this process are p(x, y) = g(x, y)a(x, y) if x , y and
p(x, x) = 1 − y,x p(x, y). The chain is Q-reversible if the detailed balance equation
is satisfied. The functions g and a are part of the design of the algorithm, but (13.18)
suggest that g should satisfy the “weak symmetry” condition:
Note that this condition is necessary to ensure (13.18) if q(z) > 0 for all z. If q(z) > 0,
the fact that acceptance probabilities are less than 1 requires that
q(z0 )g(z0 , z)
!
0
a(z, z ) ≤ min 1, .
q(z)g(z, z0 )
q(z0 )g(z0 , z)
!
0
a(z, z ) = min 1, , (13.20)
q(z)g(z, z0 )
then (13.18) is satisfied as soon as q(z) > 0. If q(z) = 0, then this definition ensures
that a(z0 , z) = 0 and (13.18) is also satisfied. Note also that the case g(z, z0 ) = 0 is not
relevant, since z0 is not attainable from z in one step in this case. This shows that
(13.20) provides a Q-reversible chain. Obviously, if g already satisfies q(z)g(z, z0 ) =
q(z0 )g(z0 , z), which is the case for Gibbs sampling, then one should take a(z, z0 ) = 1 for
all z and z0 .
Metropolis adjusted Langevin algorithm. While the Gibbs sampling and Metropolis-
Hastings methods were formulated for general variables and probability distribu-
tions, proving that the related chains are ergodic, and checking conditions for geo-
metric convergence speed is much harder when dealing with general state spaces
290 CHAPTER 13. MONTE-CARLO SAMPLING
than with finite or compact spaces (see, e.g., [165, 133, 6, 166]). On the other
hand, interesting choices of proposal transitions for Metropolis-Hastings are avail-
able when B = Rd and µ is Lebesgue’s measure, taking advantage, in particular, of
differential calculus. More precisely, assume that q takes the form
1
q(z) = exp(−H(z))
C
for some smooth function H (at least C 1 ), such that exp(−H) is integrable. We saw
in section 13.3.7 that, under suitable assumptions, the Markov chain
δ √
Xn+1 = Xn − ∇H(Xn ) + δn+1 (13.21)
2
with n+1 ∼ N (0, IdRd ) has q as invariant distribution in the limit δ → 0. Its transition
probability, such that g(z, · ) is the p.d.f. of N (z− 2δ ∇H(z), δIdRd ), is therefore a natural
choice for a proposal distribution in the Metropolis-Hastings algorithm. In addition
to converging from the exact target distribution, this “Metropolis adjusted Langevin
algorithm” (or MALA) can also be proved to satisfy geometric convergence under
less restrictive hypotheses than (13.21) [167].
with ζ(0) = z and µ(0) ∼ N (0, IdRd ). One can easily see that ∂t H(ζ(t), µ(t)) = 0, which
implies that
1 1
H(ζ(t)) + |µ(t)|2 = H(z) + |µ(0)|2
2 2
at all times t, or, denoting by ϕN the p.d.f. of the d-dimensional standard Gaussian,
Moreover, if one denotes by Φt (z, m) = (zt (z, m), mt (z, m)) the solution (ζ(t), µ(t)) of
the system started with ζ(0) = z and µ(0) = m, one can also see that det(dΦt (z, m)) = 1
at all times. Indeed, applying (1.5) and the chain rule, we have
∂t log det(dΦt (z, m)) = trace(dΦt (z, m)−1 ∂t dΦt (t, m)).
13.5. METROPOLIS-HASTINGS 291
From (
∂t zt (z, m) = mt (z, p)
∂t pt (z, m) = −∇H(zt (z, m))
we get
!
∂z mt (z, m) ∂m mt (z, m)
∂t dΦt (z, m) = 2 2
−∇ H(zt (z, m))∂z zt (z, m) −∇ H(zt (z, m))∂m zt (z, m)
!
0 IdRd
= dΦt (z, m).
−∇2 H(zt (z, m)) 0
We therefore get
!
0 IdRd
∂t log det(dΦt (z, m)) = trace =0
−∇2 H(zt (z, m)) 0
Let q̄t denote the p.d.f. of Φt (z, m) and assume that q̄0 (z, m) = q(z)ϕN (m). We
have, using the change of variable formula
q̄t (Φt (z, m))| det dΦt (z, m)| = q(z)ϕN (m).
But the r.h.s. is, from the remarks above also equal to
q(zt (z, m))ϕN (mt (z, m))| det dΦt (z, m)|
Reversibility. One can actually show that chain is in detailed balance for the joint
density q̄(z, m) = q(z)ϕN (m). This is due to the fact that the system (13.22) is re-
versible, in the sense that
i.e., the system solved from its end point after changing the sign of the momentum
returns to its initial state after changing the sign of the momentum a second time.
In other terms, letting J(z, m) = (z, −m), we have Φt−1 = JΦt ◦J. So, consider a function
f : (Rd × Rd )2 → R. Denoting the Markov chain by (Zn , Mn ), we assume that the next
pair Zn+1 , Mn+1 is computed by (i) sampling Mn0 ∼ N (0, IdRd ); (ii) solving (13.22),
with initial conditions ζ(0) = Zn and µ(0) = Mn0 ; (iii) taking Zn+1 = ζ(θ) and sampling
Mn+1 ∼ N (0, IdRd ).
292 CHAPTER 13. MONTE-CARLO SAMPLING
We have
Z
E(f (Zn , Mn , Zn+1 , Mn+1 )) = f (z, m̃, z(z, m), m̄)ϕN (m)ϕN (m̄)ϕN (m̃)q(z)dmd m̄d m̃dz.
Make the change of variables z0 = z(z, m), m0 = m(z, m), which has Jacobian determi-
nant 1, and is such that z = z(z0 , −m0 ), m = −m(z0 , −m0 ). We get
which is equal to E(f (Zn+1 , Mn+1 , Zn , Mn )) showing the reversibility of the chain.
Time discretization. This simulation scheme can potentially make large moves in
the current configuration z while maintaining detailed balance (therefore not requir-
ing an accept/reject step). However, practical implementations require discretizing
(13.22), which breaks the conservation properties that were used in the argument
above, therefore requiring a Metropolis-Hastings correction. For example, a second-
order Runge Kutta (RK2) scheme with time step α gives
α2
Z
n+1
= Z n + αM n − ∇H(Zn )
2
α
Mn+1 = Mn − (∇H(Zn ) + ∇H(Zn + hMn ))
2
Only the update for Zn matters, however,
√ since Mn+1 is discarded and resampled at
each step. Importantly, if we let δ = α the first equation in the system becomes
δ
Zn+1 = Zn − ∇H(Zn ) + δMn
2
with Mn ∼ N (0, 1), which is exactly (13.21). Note that one can, in principle, solve
(13.22) for more that one discretization step (the continuous equation can be solved
for an arbitrary time), but one must then face the challenge of computing the Metropo-
lis correction since the Hamiltonian is not conserved at each step.
13.6. PERFECT SAMPLING METHODS 293
One can however use schemes that are more adapted to solving Hamiltonian
systems [120], such as the Störmer-Verlet scheme, which is
α
Mn+1/2 = Mn − ∇H(Zn )
2
Z = Z + αM
n+1 n n+1/2
α
Mn+1 = Mn+1/2 − ∇H(Zn+1 )
2
These properties are conserved if one applies the Störmer-Verlet scheme more than
once at each iteration, that is, fixing some N > 0 and letting Φ(z, m) = (ψ1 ◦ψ2 ◦ψ1 )◦N ,
then Φ −1 = JΦ ◦ J, with J(z, m) = (z, −m) with det dΦ = 1. Considering again the aug-
mented chain which, starting from (Zn , Mn ), samples M̃ ∼ N (0, IdRd ), then computes
(Z 0 , M̃ 0 ) = Φ(Zn , M̃) and finally samples M 0 ∼ N (0, IdRd ) as a Metropolis-Hastings
proposal to sample from (z, m) 7→ q(z)ϕN (m), then, assuming that (Z, M) follows this
target distribution and letting (Z 0 , M 0 ) be the result of the proposal distribution, we
have, as computed above
E(f (Z, M, Z 0 , M 0 ))
Z
= f (z, m̃, z(z, m), m̄)ϕN (m)ϕN (m̄)ϕN (m̃)q(z)dmd m̄d m̃dz
Z
= f (z(z0 , m0 ), m̃, z0 , m̄)ϕN (m(z0 , m0 ))ϕN (m̄)ϕN (m̃)q(z(z0 , m0 ))dm0 d m̄d m̃dz0
ϕ (m(z0 , m0 ))q(z(z0 , m0 ))
!
a(z, m, z , m ) = min 1, N
0 0
ϕN (m)q(z)
= exp (− max (H(z(z0 , m0 ), m(z0 , m0 )) − H(z, m)) , 0)
While the Hamiltonian is not kept invariant by the Störmer-Verlet scheme, so that an
accept-reject step is needed, it is usually quite stable over extended periods of time
so that the acceptance probability is generally close to one.
294 CHAPTER 13. MONTE-CARLO SAMPLING
We assume, in this section, that B is a finite set. The Markov chain simulation meth-
ods provided in the previous sections do not provide exact samples from the dis-
tribution q, but only increasingly accurate approximations. Perfect sampling algo-
rithms [157, 158, 71] use Markov chains “backwards” to generate exact samples.
To describe them, it is easier to describe a Markov chain as a stochastic recursive
equation of the form
Xn+1 = f (Xn , Un+1 ) (13.23)
where Un+1 is independent of Xn , Xn−1 , . . ., and the Uk ’s are identically distributed. In
the discrete case (assumed in this section), and given a stochastic matrix P , one can
take Un to be the uniformly distributed variable used to sample from (p(Xn , x), x ∈ B).
Conversely, the transition probability associated to (13.23) is p(x, y) = P (f (x, U ) = y).
It will be convenient to consider negative times also. For n > 0, recursively define
F−n (x, u−n+1 , . . . , u0 ) by
We can write
0 0 0
P(F−k (x, U−k+1 ) = y) = P(F−k (x, U−k+1 ) = y, ν ≤ k) + P(F−k (x, U−k+1 ) = y, ν > k)
0
= P(X∗ = y, ν ≤ k) + P(F−k (x, U−k+1 ) = y, ν > k)
The right-hand side tends to P(X∗ = y) when k tends to infinity (because P(ν > k)
tends to 0), and the left-hand side tends to Q(y), which gives the second part of the
theorem.
From (13.25), which is the key step in proving that X ∗ follows the invariant distri-
bution, one can see why it is important to consider sampling that expands backward
in time rather than forward. More specifically, consider the coalescence time for the
forward chain, letting ν̃(u0∞ ) be the first index for which
x x x x
1. For all x ∈ B, define ξ−t , t = −t0 , . . . , 0 by ξ−t 0
= x and ξ−t+1 = f (ξ−t , u−t+1 ).
2. If ξ0x is constant (independent of x), let ξ∗ be equal to this constant value and stop.
Otherwise, return to step 1 replacing t0 with 2t0 .
In practice, the u−k ’s are only generated when they are needed. But it is important to
consider the sequence as fixed: once u−k is generated, it must be stored (or identically
regenerated, using the same seed) for further use. It is important to strengthen the
fact that this algorithm works backward in time, in the sense that the first states of
the sequence are not identical at each iteration, because they are generated using
random numbers with indexes further in the past.
Such an algorithm is not feasible when |B| is too large, since one would have to
consider an intractable number of simulated sequences (one for each x ∈ B). How-
ever there are cases in which the constancy of ξ0x over all B can be decided from its
constancy over a small subset of B.
296 CHAPTER 13. MONTE-CARLO SAMPLING
One situation in which this is true is when the Markov chain is monotone, ac-
cording to the following definition. Assume that B can be partially ordered, and
that f in (13.23) is increasing in x, i.e.,
Let Bmin and Bmax be the set of minimal and maximal elements in B. Then the
sequence coalesces for the algorithm above if and only if it coalesces over Bmin ∪
Bmax . Indeed, any x ∈ B is smaller than some maximal element, and larger than
some minimal element in B. By (13.26), these inequalities remain true at each step
of the sampling process, which implies that when chains initialized with extremal
elements coalesce, so do the other ones. Therefore, it suffices to run the algorithm
with extremal configurations only.
One can rewrite (13.26) in terms of transition probabilities p(x, y), assuming that
U follows a uniform distribution on [0, 1] and, for all x ∈ B, there exists a partition
(Ixy , y ∈ B) of B, such that
f (x, u) = y ⇔ u ∈ Ix,y
and Ixy is an interval with length pxy . Condition (13.26) is then equivalent to
[
x ≤ x0 ⇒ ∀y ∈ B, Ixy ⊂ Ix 0 y 0 .
y 0 ≥y
This requires in particular that y≥y0 p(x, y) ≤ y≥y0 p(x0 , y) whenever x ≤ x0 (one
P P
says that p(x, ·) is stochastically smaller than p(x0 , ·)).
One example in which this reduction works is with the ferromagnetic Ising model,
for which B = {−1, 1}L and
L
1 X
q(x) = exp βst x(s) x(t)
C
s,t=1,s<t
with βst ≥ 0 for all {s, t}. Then, the Gibbs sampling algorithm iterates the follow-
ing steps: take a random s ∈ {1, . . . , L} and update x(s) according to the conditional
distribution
(s)
(s) (sc ) ey vs (x)
gs (y | x ) = −v (x) v (x)
e s +e s
with vs (x) = t,s βst x(t) . One can order B so that x ≤ x̃ if and only if x(s) ≤ x̃(s) for all
P
(s)
s = 1, . . . , L. The minimal and maximal elements are unique in this case, with xmin ≡
(s)
−1 and xmax ≡ 1. Moreover, because all βst are non-negative, vs is an increasing
function of x so that, if x ≤ x̃, then gs (1 | x(s) ) ≤ gs (1 | x̃(s) ).
13.7. MARKOVIAN STOCHASTIC APPROXIMATION 297
which satisfies (13.26). The whole updating scheme can then be implemented with
the function
L
X
f (x, (u, ũ)) = δIs (ũ)fs (x, u)
s=1
where (Is , s ∈ V ) is any partition of [0, 1] in intervals of length 1/L. This is still mono-
tonic. The algorithm described in proposition 13.9 can therefore be applied to sam-
ple exactly, in finite time, from the ferromagnetic Ising model.
Using the material developed in this chapter, we now discuss the convergence of
stochastic approximation methods (such as stochastic gradient descent) when the
random variable in the update term follows Markovian transitions. In section 3.3,
we considered algorithms in the form
(
ξ t+1 ∼ πXt
Xt+1 = Xt + αt+1 H(Xt , ξ t+1 )
where ξ t : Ω → Rξ is a random variable. We now want to address situations in
which the random variable ξt+1 is obtained through a transition probability, there-
fore considering the algorithm
(
ξ t+1 ∼ PXt (ξt , · )
(13.27)
Xt+1 = Xt + αt+1 H(Xt , ξ t+1 )
Here Px is, for all x, a transition probability from Rξ to Rξ . We will assume that,
for all x ∈ Rd , the Markov chain with transition Px is geometrically ergodic, and we
denote by πx its invariant distribution. We let, as in section 3.3, H̄(x) = Eπx (H(x, ·)).
We will use the notation for a function f : Rd × Rξ → R
Z
0 d 0
Px f : (x , ξ) ∈ R × Rξ 7→ Px f (x , ξ) = f (x0 , ξ 0 )Px (ξ, dξ 0 )
Rξ
and Z
0 d 0
πx f : x ∈ R 7→ πx f (x ) = f (x0 , ξ)πx (dξ).
Rξ
In particular, H̄(x) = πx H(x). We also define h(x, ξ) = H(x, ξ) − H̄(x) and h̃(x, ξ) =
Px h(x, ξ). We make the following assumptions.
298 CHAPTER 13. MONTE-CARLO SAMPLING
and
t
X
αs2 σs (1 − ρ(σs ))−2 < ∞. (13.31c)
s=2
Theorem 13.10 Assuming (H1) to (H4), the sequence defined by (13.27) is such that
lim E(|Xt − x∗ |2 ) = 0
t→∞
Remark 13.11 Condition (H1) assumes that H is bounded and uniformly Lipschitz
in x, which is more restrictive than what was assumed in section 3.3.2, but applies,
for example, to situations considered in Younes [209] and later in this book in sec-
tion 18.2.2.
13.7. MARKOVIAN STOCHASTIC APPROXIMATION 299
Condition (H3) implies that the Markov chain with transition Px is uniformly geo-
metrically ergodic, but the ergodicity rate may depend on x and it may, in particular,
converge to 1 when x tends to ∞, which is the situation targeted in this theorem.
The reader may refer to [211] for a general discussion of this problem with re-
laxed hypotheses and almost sure convergence, at the expense of significantly longer
proofs.
Similarly to section 3.3.2, we let At = |Xt − x∗ |2 and at = E(At ). One can then write
2
At+1 = At +2αt+1 (Xt −x∗ )T H̄(Xt )+2αt+1 (Xt −x∗ )T (H(Xt , ξ t+1 )−H̄(Xt ))+αt+1 |H(Xt , ξt+1 )|2
anymore, where Ut is the σ -algebra of all past events up to time t (all events depend-
ing of Xs , ξ s , s ≤ t). Indeed the Markovian assumption implies that
Z
E((Xt − x∗ )T (H(Xt , ξ t+1 ) − H̄(Xt )) | Ut ) = (Xt − x∗ )T
H(Xt , ξ)PXt (ξ t , dξ) − H̄(Xt )
Rξ
T
= (Xt − x∗ ) ((PXt H(Xt , ·))(ξt ) − H̄(Xt )),
which does not vanish in general. Following Benveniste et al. [25], this can be ad-
dressed by introducing the solution g(x, · ) of the “Poisson equation”
and
At+1 ≤ (1 − 2αt+1 µ)At + 2αt+1 (Xt − x∗ )T (g(Xt , ξ t+1 ) − PXt g(Xt , ξ t )))
2
+ 2αt+1 (Xt − x∗ )T PXt g(Xt , ξ t ) − 2αt+1 (Xt − x∗ )T PXt g(Xt , ξ t+1 ) + αt+1 |H(Xt , ξt+1 )|2
and and noting that |H(Xt , ξt+1 )|2 ≤ C02 , one finds, after taking expectations,
2
at+1 ≤ (1 − 2αt+1 µ)at + 2αt+1 ηt,t − 2αt+1 ηt,t+1 + αt+1 C02 .
Qt
Applying lemma 3.25, and letting vs,t = j=s+1 (1 − 2αj+1 µ), one gets
t
X t
X
at ≤ a0 v0,t + 2 vs,t αs+1 (ηs,s − ηs,s+1 ) + C02 2
vs,t αs+1 .
s=1 s=1
We now want to ensure that each term in the upper bound converges to 0. Simi-
larly to section 3.3.2, (13.31a) implies that this holds the first and last terms and we
therefore focus on the middle one, writing
t
X t
X
vs,t αs+1 (ηs,s − ηs,s+1 ) = v1,t α2 η1,1 − αt+1 ηt,t+1 + (vs,t αs+1 − vs−1,t αs )ηs,s (13.34)
s=1 s=2
t
X
+ vs−1,t αs (ηs,s − ηs−1,s )
s=2
We will need the following estimates on the function g in (13.33), which is de-
fined by
∞
X ∞
X
n
g(x, ξ) = Px h(x, ξ) = h(x, ξ) + Pxn h̃(x, ξ).
n=0 n=0
Using lemma lemma 13.12 (which is proved at the end of the section), we can
control the terms intervening in (13.34). Note that the first term, v1,t α2 η1,1 , con-
verges to 0 since (13.31a) implies that v1,t converges to 0.
13.7. MARKOVIAN STOCHASTIC APPROXIMATION 301
We have,
αt+1 |E((Xt − x∗ )T PXt g(Xt , ξ t+1 ))| ≤ 2MC1 αt+1 σt (1 − ρ(σt ))−1 ,
Since both s (αs − αs+1 ) and ts=2 αs2 converge (the former is just α1 ), lemma 3.26
P P
implies that
Xt
(vs,t αs+1 − vs−1,t αs )ηss
s=2
We have
t
X t
X
T
vs−1,t αs E((Xs − Xs−1 ) PXs g(Xs , ξ s )) ≤ 2C0 C1 M vs−1,t αs2 (1 − ρ(σs ))−1
s=2 s=2
302 CHAPTER 13. MONTE-CARLO SAMPLING
and
t
X
vs−1,t αs E((Xs−1 − x∗ )T (PXs g(Xs , ξ s )) − PXs−1 g(Xs−1 , ξ s )))
s=2
t
X
2
≤ 2M C0 C1 (1 + C2 )|X0 − x∗ | vs−1,t αs2 σs (1 − ρ(σs ))−2
s=2
and lemma 3.26 implies that both terms vanish at infinity. This concludes the proof
of theorem 13.10.
Proof (Proof of lemma 13.12) Condition (H3) and proposition 12.3 and imply that
(since πx h̃ = 0)
|Pxn h̃(x, ξ)| ≤ Dvar (Pxn (ξ, ·), πx )osc(h̃(x, ·)) ≤ 2C1 Mρ(x)n
This gives
n−1
X
Pxn h̃(x, ξ) − Pyn h̃(y, ξ) = Pxn−k−1 (Px Pyk h̃(y, ξ) − Pyk+1 h̃(y, ξ) − πx Pyk h̃(y) + πx Pyk+1 h̃(y))
k=0
n−1
X
+ (πx Pyk h̃(y) − πx Pyk+1 h̃(y)) + Pxn h̃(x, ξ) − Pxn h̃(y, ξ)
k=0
n−1
X
= Pxn−k−1 (Px Pyk h̃(y, ξ) − Pyk+1 h̃(y, ξ) − πx Pyk h̃(y) + πx Pyk+1 h̃(y))
k=0
+ πx h̃(y) − πx Pyn h̃(y) + Pxn h̃(x, ξ) − Pxn h̃(y, ξ)
13.7. MARKOVIAN STOCHASTIC APPROXIMATION 303
Finally
n−1
X
Pxn h(x, ξ) − Pyn h(y, ξ) = Pxn−k−1 (Px Pyk h̃(y, ξ) − Pyk+1 h̃(y, ξ) − πx Pyk h̃(y) + πx Pyk+1 h̃(y))
k=0
+ Pxn (h̃(x, ξ) − h̃(y, ξ) + πx h̃(y)) − (πx − πy )Pyn h̃(y)
|Pxn−k−1 (Px Pyk h̃(y, ξ) − Pyk+1 h̃(y, ξ) − πx Pyk h̃(y, ξ) + πx Pyk+1 h̃(y, ξ))|
≤ M ρ̄n−k−1 osc(Px Pyk h̃(y, ξ) − Pyk+1 h̃(y, ξ))
≤ C2 M ρ̄n−k−1 |x − y|osc(Pyk h̃(y, ξ))
≤ C2 C1 M 2 ρ̄n−1 |x − y|
We also have
|Pxn (h̃(x, ξ) − h̃(y, ξ) + πx h̃(y, ξ))| ≤ MC1 ρ̄n |x − y|
and
|(πx − πy )Pyn h̃(y, ξ)| ≤ MC2 C1 ρ̄n |x − y|
so that
|Pxn h(x, ξ) − Pyn h(y, ξ)| ≤ MC1 ρ̄n−1 (nMC2 + (1 + C2 )ρ̄)|x − y|
From this, it follows that
∞
X
|Px g(x, ξ) − Py (g(y, ξ)| ≤ MC1 ρ̄n−1 (nMC2 + (1 + C2 )|x − y|
n=1
2
= M C1 C2 (1 − ρ̄)−2 + MC1 (1 + C2 )(1 − ρ̄)−1 .
304 CHAPTER 13. MONTE-CARLO SAMPLING
Chapter 14
With this chapter, we start a discussion of large-scale statistical models in data sci-
ence, starting with graphical models (Markov random fields and Bayesian networks)
before discussing more recent approaches using, notably, deep learning. Impor-
tant textbook references for the present chapter include Pearl [152], Ancona et al.
[8], Winkler [206], Lauritzen [115], Cowell et al. [55], Koller and Friedman [109].
14.1.1 Definitions
305
306 CHAPTER 14. MARKOV RANDOM FIELDS
One can easily check that X and Y are independent if and only if, for any non-
negative function g : RY → R, one has
Notation 14.2 Independence is a property that involves two variables X and Y and
an underlying probability distribution P. Independence of X and Y relative to P will
be denoted (XyY )P . However we will only write XyY when there is no ambiguity
on P.
An equivalent statement is that, for any z such that P(Z = z) , 0, X and Y are inde-
pendent when P is replaced by the conditional distribution P(· | Z = z).
In the general case conditional independence means that, for any pair of non-
negative measurable functions f and g,
Multiplying both terms in (14.1) by P(Z = z)2 , we get the equivalent statement:
X and Y are conditionally independent given Z if and only if,
Note that the identity is meaningful, and always true, for P(Z = z) = 0, so that this
case does not need to be excluded anymore.
Note that, dealing with discrete variables, all previous definitions automatically
extend to groups of variables: for example, if Z1 , Z2 are two discrete variables, so
is Z = (Z1 , Z2 ) and we immediately obtain a definition for the conditional indepen-
dence of X and Y given Z1 and Z2 , denoted (XyY | Z1 , Z2 ).
P(X1 = x1 , . . . , XN = xN ) > 0
if xk ∈ R̃k , k = 1, . . . , N .
Note that the condition implies P(Xk = xk ) > 0 for all xk ∈ R̃k , so that R̃k = {xk ∈
RXk : P(Xk = xk ) > 0}, i.e., R̃k is the support of PXk . One can interpret the definition
as expressing the fact that any conjunction of events for different Xk ’s has positive
probability, as soon as each of them has positive probability (if all events may occur,
then they may occur together).
Note that the sets R̃k depend on X1 , . . . , XN . However, if this family of variables
is fixed, there is no loss in generality in restricting the space RXk to R̃k and there for
assume that P(X1 = x1 , . . . , XN = xN ) > 0 everywhere.
Proposition 14.5 Let X, Y , Z and W be random variables. The following properties are
true.
Proof Properties (CI1) and (CI2) are easily deduced from (14.3) and left to the
reader. To prove the last three, we will use the notation P (x), P (x, y) etc. instead
of P(X = x), P(X = x, Y = y), etc. to save space. Identities are assumed to hold for all
x, y, z, w unless stated otherwise.
whenever P (x, y, z, w)P (z) = P (x, z)P (y, z, w). Summing this last equation over y (or
applying (CI2)) yields P (x, z, w)P (z) = P (x, z)P (z, w). We can note that all terms in
(14.4) vanish when P (z) = 0, so that the identity is true in this case. When P (z) , 0,
the right-hand side of (14.4) becomes
(P (x, z)P (z, w)/P (z))P (y, z, w) = (P (x, z)P (y, z, w)/P (z))P (z, w) = P (x, y, z, w)P (z, w),
Since (14.5) is true when P (y, z) = 0, we assume that this probability does not vanish
and write
yielding (14.5).
14.1. INDEPENDENCE AND CONDITIONAL INDEPENDENCE 309
Proposition 14.6 For variables X1 , . . . , Xn and Z, the following properties are equivalent.
Proof It is clear that (i) ⇒ · · · ⇒ (iv) so it suffices to prove that (iv) ⇒ (i). For this,
simply write (applying (iv) repeatedly to s = n − 1, n − 2, . . .)
The entropy is always non-negative, and provides a measure of the uncertainty as-
sociated to P . For a given finite set R, it is maximal when P is uniform over R, and
minimal (and vanishes) when P is supported by a single ω ∈ R (i.e. P (ω) = 1).
14.1. INDEPENDENCE AND CONDITIONAL INDEPENDENCE 311
One defines the entropy of two or more random variables as the entropy of their
joint distribution, so that, for example,
X
H(X, Y ) = − log P(X = x, Y = y)P(X = x, Y = y).
(x,y)∈RX ×RY
If X and Y are two random variables, and y ∈ RY with P(Y = y) > 0, the entropy
of the conditional probability x 7→ P(X = x | Y = y) is denoted H(X | Y = y), and
is a function of y. The conditional entropy of X given Y , denoted H(X | Y ) is the
expectation of H(X | Y = y) for the distribution of Y , i.e.,
X
H(X | Y ) = H(X | Y = y)P(Y = y)
y∈RY
X X
=− log P(X = x | Y = y)P(X = x, Y = y).
x∈RX y∈RY
The identity H(X, Y ) = H(X | Y ) + H(Y ) that is deduced from proposition 14.8 can be
generalized to more than two random variables (the proof being left to the reader),
yielding, if X1 , . . . , Xn are random variables:
n
X
H(X1 , . . . , Xn ) = H(Xk | X1 , . . . , Xk−1 ). (14.12)
k=1
Proposition 14.9 Let X, Y and Z be three random variables. The following statements
are equivalent.
Proof From proposition 14.7, we have, for any three random variables X, Y , Z, and
any z such that P (Z = z) > 0,
To prove that (i)-(iii) implies (iv), we note that (14.14) and (14.15) imply that, for
any three random variables:
H(X | Y , Z) ≤ H(X | Y ).
If X and Y are conditionally independent given Z, then the right-hand side is equal
to H(X | Z) and this yields
Statement (iv) is often called the data-processing inequality, and has been used to
infer conditional independence within gene networks [126].
An undirected graph is a collection of vertexes and edges, in which edges link pairs
of vertexes without order. Edges can therefore be identified to subsets of cardinality
two of the set of vertexes, V . This yields the definition:
Note that edges in undirected graphs are defined as sets, i.e., unordered pairs, which
are delimited with braces in these notes. Later on, we will use parentheses to repre-
sent ordered pairs, (s, t) , (t, s). We will write s ∼G t, or simply s ∼ t to indicate that s
and t are connected by an edge in G (we also say that s and t are neighbors in G).
314 CHAPTER 14. MARKOV RANDOM FIELDS
We say that s and t are connected by a path if either s = t or there exists a path
(s0 , . . . , sN ) such that s0 = s and sN = t.
A subset T ⊂ G separates two other subsets S and S 0 if all paths between S and S 0
must pass in T . We will write (SyS 0 | T ) in such a case.
One of the goals of this chapter is to relate the notion of conditional indepen-
dence within a set of variables to separation in a suitably chosen undirected graph
with vertexes in one-to-one correspondence with the variables. This will also justi-
fies the similarity of notation used for separation and conditional independence.
(SyS 0 | T ) ⇒ S ∩ S 0 ⊂ T .
Indeed, if (SyS 0 | T ) and s0 ∈ S ∩ S 0 , the path (s0 ) links S and S 0 and therefore must
pass in T .
For the ⇒ part of (iv), if a path links S and T ∪ R, then it either links S and T
and must pass through U by the first assumption, or link S and R and therefore pass
through U or T by the second assumption. But if the path passes through T , it must
14.2. MODELS ON UNDIRECTED GRAPHS 315
also pass through U before by the first assumption. In all cases, the path passes
through U . The ⇐ part of (iv) is obvious.
Finally, consider (v) and take a path between two distinct elements in S and U ∪R.
Consider the first time the path hits U or R, and assumes that it hits U (the other
case being treated similarly by symmetry). Notice that the path cannot hit both U
and R at the same point since U ∩ R = ∅. From the assumptions, the path must hit
T ∪ R before passing by U , and the intersection cannot be in R, so it is in T , which is
the conclusion we wanted.
Letting F denote the collection (Fs , s ∈ V ), we will denote the set of such configu-
rations as F (V , F ). Then F is clear from the context, we will just write F (V ). If S ⊂ V
and x ∈ F (V , F ), the restriction of x to S is denoted x(S) = (x(s) , s ∈ S). The set formed
by those restrictions will be denoted F (S, F ) (or just F (S)).
Remark 14.14 Some care needs to be given to the definition of the space of con-
figurations, to avoid ambiguities when two sets Fs coincide. The configuration x =
(x(s) , s ∈ V ) should be understood, in an emphatic way, as the collection x̂ = ((s, x(s) ), s ∈
V ), which makes explicit the fact that x(s) is the value observed at vertex s. Similarly
the emphatic notation for x(S) ∈ F (V , F ) is x̂(S) = ((s, x(s) ), s ∈ S).
In the following, we will not use the emphatic notation to avoid overly heavy
expressions, but its relevance should be clear with the following simple example.
Take V = {1, 2, 3} and F1 = F2 = F3 = {0, 1}. Let x(1) = 0, x(2) = 0 and x(3) = 1. Then
the sub-configurations x({1,3}) and x({2,3}) both corresponds to values (0, 1), but we
consider them as distinct. In the same spirit, x(1) = x(2) , but x({1}) , x({2})
Letting the observation over an empty set S be empty, i.e., X∅ = ∅, this definition in-
cludes the statement that, if S and T are disconnected (i.e., there is no path between
them: they are separated by the empty set), then (X (S) yX (T ) | ∅): X (S) and X (T ) are
independent.
The first step for our reduction is provided by the following lemma.
Lemma 14.16 Let G = (V , E) be an undirected graph and X = (Xs , s ∈ V ) a set of random
variables indexed by V . Then X is G-Markov if and only if, for S, T , U ⊂ V ,
If the configurations x(A) , x(B) , y (C) are not consistent (i.e., x(t) , y (t) for some t ∈ C),
then both sides vanish. So we can assume x(C) = y (C) and remove x(A) and x(B) from
the expression, since they are redundant. The resulting identity is true since it ex-
actly states that (X (S1 ) yX (T1 ) | X (U ) ).
14.2. MODELS ON UNDIRECTED GRAPHS 317
which is the set of neighbors of all vertexes in S that do not belong to S. (Here S c
denotes the complementary set of S, S c = V \ S.) Finally, let WS denote the vertexes
that are “remote” from S, WS = (S ∪ VS )c .
Consider now the “if” part. Take S, T , U such that (SyT | U ). We want to prove
that (XS yXT | XU ). According to lemma 14.16, we can assume, without loss of gen-
erality, that S ∩ U = T ∩ U = ∅.
Define R as the set of vertexes v in V such that there exists a path between S and
v that does not pass in U . Then:
We can then write (each decomposition being a partition, implicitly defining the
sets A, B and C, see Fig. 14.1) R = S ∪ A, U = VR ∪ C, (R ∪ VR )c = T ∪ C ∪ B, and from
(X (R) yX (WR ) | X (VR ) ), we get
which implies
((X (S) , X (A) )y(X (T ) , X (B) ) | X (U ) )
by (CI3), which finally implies (X (S) yX (T ) | X (U ) ) by (CI2).
then X is Markov relative to G. The converse statement is true without the positivity
assumption.
Proof It suffices to prove that, if (14.18) is true for S and T ⊂ V , with T ∩ S = ∅, it is
also true for S ∪ T . The result will then follow by induction.
To see that the positivity assumption is needed, consider the following example with
six variables X (1) , . . . , X (6) , and a graph linking consecutive integers and closing with
14.2. MODELS ON UNDIRECTED GRAPHS 319
an edge between 1 and 6. Assume that X (1) = X (2) = X (4) = X (5) , and that X (1) , X (3)
and X (6) are independent. Then (14.19) is true, since, for k = 1, 2, 4, 5, X (k) is constant
given its neighbors, and X (3) (resp. X (6) ) is independent of the rest of the variables.
But (X (1) , X (2) ) is not independent of (X (4) , X (5) ) given the neighbors X (3) , X (6) .
then X is Markov relative to G. The converse statement is true without the positivity
assumption.
Proof Fix s ∈ V and assume that (X (s) yX (R) | X (V \R) ) for any R ⊂ Ws with cardinality
at most k (the statement is true for k = 1 by assumption). Consider a set R̃ ⊂ Ws of
cardinality k + 1, that we decompose into R ∪ {t} for some t ∈ R̃. We have (X (s) yX (t) |
X (V \R̃) , XR ) from the initial hypothesis and (X (s) yX (R) | X (V \R̃) , Xt ) from the induction
hypothesis. Using property (CI5), this yields (X (s) yX (R̃) | X (V \R̃) ). This proves the
proposition by induction.
Any graph with respect to which X is Markov must be richer than the graph
c
GX = (V , EX ) defined by s GX t if and only (X (s) yX (t) | X ({s,t} ) ). This is true because,
for any graph G for which X is Markov, we have
c
s G t ⇒ (X (s) yX (t) | X ({s,t} ) ) ⇒ s GX t.
Interestingly, proposition 14.19 states that X is GX -Markov as soon as its joint dis-
tribution is positive. This implies that GX is the minimal graph over which X is
Markov in this case.
One important property of G-Markov models is that the Markov property is es-
sentially conserved when passing to conditional distributions. We introduce for this
the following definitions.
so that
The effect of taking marginal distributions for a G-Markov model is, unfortunately,
not as much a mild operation as computing conditional distributions, in the sense
that the conditional independence structure of the marginal distribution may be
much more complex than the original one.
The graph GS can be much more complex than the restricted graph GS intro-
duced in the previous section (note that, by definition, GS is richer than GS ). Take,
for example, the graph that corresponds to “hidden Markov models,” for which (cf.
fig. 14.2)
V = {1, . . . , N } × {0, 1}
and edges {s, t} ∈ E have either s = (k, 0) and t = (l, 0) with |k − l| = 1, or s = (k, 0) and
t = (k, 1). Let S = {1, . . . , N } × {1}. Then, GS is totally disconnected (ES = ∅), since no
edge in G links two elements of S. In contrast, any pair of elements in S is connected
by a path in S c , so that GS is a complete graph.
Figure 14.2: In this graph, variables in the lower row are conditionally independent given
the first row, while their marginal distribution requires a completely connected graph.
The Hammersley-Clifford theorem, which will be proved in this section, gives a com-
plete description of positive Markov processes relative to a given graph, G. It states
that positive G-Markov models are associated to families of positive local interac-
tions indexed by cliques in the graph. We now introduce each of these concepts.
Definition 14.24 Let V be a set of vertexes and (Fs , s ∈ V ) a collection of state spaces.
A family of local interactions is a collection of non-negative functions Φ = (ϕC , C ∈ C)
indexed over some subset C of P (V ), such that each ϕC only depends on configurations
322 CHAPTER 14. MARKOV RANDOM FIELDS
restricted to C (i.e., it is defined on F (C)), with values in [0, +∞). (Recall that P (V ) is
the set of all subsets of V .)
Such a family has order p if no C ∈ C has cardinality larger than p. A family of local
interactions of order 2 is also called a family of pair interactions.
The first term in the last product only depends on x(US ) , and the second one only on
x(V \S) . Introduce the notation
X Y
(VS )
µ1 (x ) = ϕC (x(C) )
y (US ) :y (VS ) =x(VS ) C:C∩S,∅
X Y
(VS )
ϕC (x(C) ).
µ (x ) =
2
(V \S) (V ) (V )
y :y S =x S C:C∩S=∅
where Z(y (T ) ) is a constant that only depends on y (T ) . The fact that πS|T (· | y (T ) ) is
associated to Φ|y (T ) is then obtained by reorganizing the product over distinct S ∩
C’s.
324 CHAPTER 14. MARKOV RANDOM FIELDS
This result, combined with proposition 14.25, is consistent with proposition 14.22,
in the sense that the restriction to GC to S coincides with the graph GCS . The easy
proof is left to the reader.
We now consider marginals, and more specifically marginals when only one node
is removed, which provides the basis for “node elimination.”
Proposition 14.27 Let π be associated to Φ = (ϕC , C ∈ C) as above. Let t ∈ V and
S = V \ {t}. Define Ct ∈ P (V ) as the set
Ct = {C ∈ C : t < C} ∪ {C̃t }
with [
C̃t = C \ {t}.
C∈C:t∈C
• If C̃t < C: X Y
ϕ̃C̃t (x(C̃t ) ) = ϕC (x(C̃t ) ∧ y (t) ).
y (t) ∈Ft C∈C,t∈C
• If C̃t ∈ C: X Y
ϕ̃C̃t (x(C̃t ) ) = ϕCt (x(C̃t ) ) ϕC (x(Ct ) ∧ y (t) )
y (t) ∈Ft C∈C,t∈C
A clique that cannot be strictly included in any other clique is called a maximal clique,
∗
and their set denoted CG .
(Note that some authors call cliques what we refer to as maximal cliques.)
Λ = (λC , C ∈ C)
Proof Let us start with the “if” part. If π is associated to a potential over CG , we
have already proved that π is GCG -Markov, so that it suffices to prove that GCG = G,
which is almost obvious: If s ∼G t, then {s, t} ∈ CG and s ∼GC t by definition of GCG .
G
Conversely, if s ∼GC t, there exists C ∈ CG such that {s, t} ⊂ C, which implies that
G
s ∼G t, by definition of a clique.
We now prove the “only if” part, which relies on a combinatorial lemma, which
is one of Möbius’s inversion formulas.
Lemma 14.31 Let A be a finite set and f : P (A) → R, B 7→ fB . Then, there is a unique
function λ : P (A) → R such that
X
∀B ⊂ A, fB = λC , (14.26)
C⊂B
and λ is given by X
λC = (−1)|C|−|B| fB . (14.27)
B⊂C
To prove the lemma, first notice that the space F of functions f : P (A) → R isP
a vector
space of dimension 2|A| and that the transformation ϕ : λ 7→ f with fB = C⊂B λC
is linear. It therefore suffices to prove that, given any f , the function λ given in
(14.27) satisfies ϕ(λ) = f , since this proves that ϕ is onto from F to F and therefore
necessarily one to one.
The last identity comes from the fact that, for any finite set B̃ ⊂ B, B̃ , B, we have
X
(−1)|C|−|B̃| = 0
C⊃B̃,C⊂B
14.3. THE HAMMERSLEY-CLIFFORD THEOREM 327
(for B̃ = B, the sum is obviously equal to 1). Indeed, if s ∈ B, s < B̃, we have
X X X
(−1)|C|−|B̃| = (−1)|C|−|B̃| + (−1)|C|−|B̃|
C⊃B̃,C⊂B C⊃B̃,C⊂B,s∈C C⊃B̃,C⊂B,s<C
X
|C∪{s}|−|B̃|
= ((−1) + (−1)|C|−|B̃| )
C⊃B̃,C⊂B,s<C
= 0.
So the lemma is proved. We now proceed to proving the existence and unique-
ness statements in theorem 14.30. Assume that X is G-Markov and positive. Fix
x ∈ F (V ) and consider the function, defined on P (V ) by
c
(B) π(x(B) ∧ 0(B ) )
fB (x ) = − log .
π(0)
Then, letting X
(C)
λC (x )= (−1)|C|−|B| fB (x(B) ),
B⊂C
with Z = P (0). We now prove that λC (x(C) ) = 0 if x(s) = 0(s) for some s ∈ V or if C < CG .
This will prove (14.25) and the existence statement in theorem 14.30.
So, assume x(s) = 0(s) . Then, for any B such that s < B, we have fB (x(B) ) = f{s}∪B (x({s}∪B) ).
Now take C with s ∈ C. We have
X X
λC (x(C) ) = (−1)|C|−|B| fB (x(B) ) + (−1)|C|−|B| fB (x(B) )
B⊂C,s∈B B⊂C,s<B
X X
= (−1) |C|−|B∪{s}|
fB∪{s} (x(B∪{s}) ) + (−1)|C|−|B| fB (x(B) )
B⊂C,s<B B⊂C,s<B
X
= ((−1)|C|−|B∪{s}| + (−1)|C|−|B| )fB (x(B) )
B⊂C,s<B
= 0.
Now assume that C is not a clique, and let s , t ∈ C such that s t. We can write,
using decompositions similar to the above,
X
λC (x(C) ) = (−1)|C|−|B| fB∪{s,t} (x(B∪{s,t}) ) − fB∪{s} (x(B∪{s}) ) − fB∪{t} (x(B∪{t}) ) + fB (x(B) ) .
B⊂C\{s,t}
328 CHAPTER 14. MARKOV RANDOM FIELDS
(extending Λ so that λC = 0 for C < CG ). But, from lemma 14.31, this uniquely
defines Λ.
The simplest example of G-Markov process (for any graph G) is the case when
X = (X (s) , s ∈ V ) is a collection of independent random variables. In this case, we can
take GX = (V , ∅), the totally disconnected graph on V . Another simple fact is that, as
already remarked, any X is Markov for the complete graph (V , P 2 (V )) where P 2 (V )
contains all subsets of V with cardinality 2.
Beyond these trivial (but nonetheless important) cases, the simplest graph-Markov
processes are those associated with linear graphs, providing finite Markov chains.
For this, we let V be a finite ordered set, say,
V = {0, . . . , N } .
The distribution of a Markov chain is therefore fully specified by P(X (0) = x(0) ), x0 ∈
F0 (the initial distribution) and the conditional probabilities
pk (x(k−1) , x(k) ) = P(X (k) = x(k) | X (k−1) = x(k−1) ) (14.28)
(with an arbitrary choice when P(X (k−1)
= x(k−1) ) = 0). Indeed, assume that P(X (0) =
x(0) , . . . , X (k−1) = x(k−1) ) is known (for all x(0) , . . . , x(k−1) ). Then, either:
We now proceed by induction and assume that (X (t) yX (s) | Y (u) ) for some u ≥ t.
Then, we have (X (u+1) y(X (s) , X (t) , Y (u−1) ) | X (u) ), which implies (from (CI3)) (X (u+1) yX (t) |
X (s) , Y (u) ). Applying (CI4) to (X (t) yX (s) | Y (u) ) and (X (t) yX (u+1) | X (s) , Y (u) ), we obtain
(X (t) y(X (s) , X (u+1) ) | Y (u) ) and finally, (X (t) yX (s) | Y (u+1) ). By induction, this gives
(X (t) yX (s) | Y (N ) ) and therefore proposition 14.19 now implies that X is G-Markov.
The situation with acyclic graphs is only slightly more complex than with linear
graphs, but will require a few new definitions, including those of directed graphs
and trees.
The difference between directed and undirected graphs is that the edges of the
former are ordered pairs, namely:
V × V \ {(s, s), s ∈ V } ,
So, for directed graphs, edges (s, t) and (t, s) have different meanings, and we allow
at most one of them in E. We say that the edge e = (s, t) stems from s and points to t.
The parents of a vertex s are the vertexes t such that (t, s) ∈ E, and its children are the
vertexes t such that (s, t) ∈ E. We will also use the notation s →G t to indicate that
(s, t) ∈ E (compare to s ∼G t for undirected graphs).
Proposition 14.38 In a directed graph, any non-trivial closed path contains a loop (i.e.,
one can delete vertexes from it to finally obtain a loop.)
In an undirected graph, any non-trivial closed path which is not a union of folded
paths contains a loop.
Proof Take γ = (s0 , s1 , . . . , sN ) with sN = s0 . The path being non-trivial means N > 1.
First take the case of a directed graph. Clearly, N ≥ 3 since a two-vertex path
cannot be closed in an directed graph. Consider the first occurrence of a repetition,
i.e., the first index for which
sj ∈ {s0 , . . . , sj−1 }.
Then there is a unique j 0 ∈ {0, . . . , j − 1} such that sj 0 = sj , and the path (sj 0 , . . . , sj−1 )
must be a loop (any repetition in the sequence would contradict the fact that j was
the first occurrence. This proves the result in the directed case.
Consider now an undirected graph. We can recursively remove all folded sub-
paths, by keeping everything but their initial point, since each such operation still
provide a path at the end. Assume that this is done, still denoting the remaining
path (s0 , s1 , . . . , sN ), which therefore has no folded subpath. We must have N ≥ 3
since N = 1 implies that the original path was a union of folded paths, and N = 2
provides a folded path. Let, 0 ≤ j 0 < j be as in the directed case. Note that one must
have j 0 < j − 2, since j 0 = j − 1 would imply an edge between j and itself and j 0 = j − 2
induces a folded subpath. But this implies that (sj 0 , . . . , sj−1 ) is a loop.
Directed acyclic graphs (DAG) will be important for us, because they are associ-
ated with Bayesian networks that we will discuss later. For now, we are interested
with undirected acyclic graphs and their relation to trees, which form a subclass of
directed acyclic graphs, defined as follows.
Definition 14.39 A forest is a directed acyclic graph with the additional requirement
that each of its vertexes has at most one parent.
A root in a forest is a vertex that has no parent. A forest with a single root is called a
tree.
It is clear that a forest has at least one root, since one could otherwise describe
a nontrivial loop by starting from a any vertex and passing to its parent until the
sequence self-intersects (which must happen since V is finite). We will use the fol-
lowing definition.
Definition 14.40 If G = (V , E) is a directed graph, its flattened graph, denoted G[ =
(V , E [ ) is the undirected graph obtained by forgetting the edge ordering, namely
{s, t} ∈ E [ ⇔ (s, t) ∈ E or (t, s) ∈ E.
14.4. MODELS ON ACYCLIC GRAPHS 333
Conversely, if G is an undirected acyclic graph, there exists a forest G̃ such that G̃[ = G.
Proof Let G = (V , E) be a forest and, in order to reach a contradiction, assume that
G[ has a loop, s0 , . . . , sN −1 , sN = s0 . Assume that (s0 , s1 ) ∈ E; then, also (s1 , s2 ) ∈ E
(otherwise s1 would have two parents), and this propagates to all (sk , sk+1 ) for k =
0, . . . , N − 1. But, since sN = s0 , this provides a loop in G which is not possible. This
proves thet G[ has no loop since the case (s1 , s0 ) ∈ E is treated similarly.
Now, let G be an undirected acyclic graph. Fix a vertex s ∈ V and consider the
following procedure, in which we recursively define sets Sk of processed vertexes,
and Ẽk of oriented edges, k ≥ 0, initialized with S0 = {s} and Ẽ0 = ∅.
– At step k of the procedure, assume that vertexes in Sk have been processed and
edges in Ẽk have been oriented so that (Sk , Ẽk ) is a forest, and that Ẽk[ is the set of edges
{s, t} ∈ E such that s, t ∈ Sk (so, oriented edges at step k can only involve processed
vertexes).
– If Sk = V : stop, the proposition is proved.
– Otherwise, apply the following construction. Let Fk be the set of edges in E that
contain exactly one element of Sk .
(1) If Fk = ∅, take any s ∈ V \ Sk as a new root and let Sk+1 = Sk ∪ {s}, Ẽk+1 = Ẽk .
(2) Otherwise, add to Ẽk the oriented edges (s, t) such that s ∈ Sk and {s, t} ∈ Fk ,
yielding Ẽk+1 , and add to Sk the corresponding children (t’s) yielding Sk+1 .
We need to justify the fact that G̃k+1 = (Sk+1 , Ẽk+1 ) above is still a forest. This is
obvious after Case (1), so consider Case (2). First G̃k+1 is acyclic, since any oriented
loop is a fortiori an unoriented loop and G is acyclic. So we need to prove that no
vertex in Sk+1 has two parents. Since we did not add any parent to the vertexes in Sk
and, by assumption, (Sk , Ẽk ) is a forest, the only possibility for a vertex to have two
parents in Sk+1 is the existence of t such that there exists s, s0 ∈ Sk with {s, t} and {s0 , t}
in E. But, since s and s0 have unaccounted edges containing them, they cannot have
been introduced in Sk before the previously introduced root has been added, so they
are both connected to this root: but the two connections to t would create a loop in
G which is impossible.
So the procedure carries on, and must end with Sk = V at some point since we
keep adding points to Sk at each step.
Note that the previous proof shows there is more than one possible orientation
of a connected undirected tree into a tree is not unique, although uniquely specified
334 CHAPTER 14. MARKOV RANDOM FIELDS
once a root is chosen. The proof is constructive, and provides an algorithm building
a forest from an undirected acyclic graph.
We now define graphical models supported by trees, which constitute our first
Markov models associated with directed graphs. Define the depth of a vertex in a
tree G = (V , E) to be the number of edges in the unique path that links it to the
root. We will denote by Gd the set of vertexes in G that are at depth d, so that G0
contains only the root, G1 the children of the root and so on. Using this, we have the
definition:
Definition 14.42 Let G = (V , E) be a tree. A process X = (X (s) , s ∈ V ) is G-Markov if
and only, for each d ≥ 1, and for each s ∈ Gd , we have
So, conditional to its parent, X (s) is independent from all other variables at depth
smaller or equal to the depth of s.
Note that, from (CI3), definition 14.42 implies that, for all s ∈ Gd ,
which, using proposition 14.6, implies that the variables (X (s) , s ∈ Gd ) are mutually
independent given X (Gq ) , q < d. This implies that, for d = 1 (letting s0 denote the root
in G):
Y
P(X (G1 ) = x(G1 ) , X (s0 ) = x(s0 ) ) = P(X (s0 ) = x(s0 ) ) P(X (s) = x(s) | X (s0 ) = x(s0 ) ).
s∈G1
(If P(X (s0 ) = x(s0 ) ) = 0, the choice for the conditional probabilities can be made ar-
bitrarily without changing the left-hand side which vanishes.) More generally, we
have, letting G<d = G0 ∪ · · · ∪ Gd−1 ,
Y
P(X (G≤d ) = x(G≤d ) ) = P(X (s) = x(s) | X (pa(s)) = x(pa(s)) )P(X (G<d ) = x(G<d ) )
s∈Gd
(with again an arbitrary choice for conditional probabilities that are not defined) so
that, we obtain, by induction, for x ∈ F (V )
Y
P(X = x) = P(X (s0 ) = x(s0 ) ) ps (x(pa(s)) , x(s) ) (14.30)
s,s0
∆
where ps (x(pa(s)) , x(s) ) = P(X (s) = x(s) | X (pa(s)) = x(pa(s)) ) are the tree transition probabil-
ities between a parent and a child. So we have the following proposition.
14.4. MODELS ON ACYCLIC GRAPHS 335
We only have proved the “only if” part, but the “if” part is obvious from (14.31).
Another property that becomes obvious with this expression is the first part of the
following proposition.
Proof To prove the converse part, assume that G = (V , E) is undirected acyclic and
that X is G-Markov. Take G̃ such that G̃[ = G. For s ∈ V and its parent pa(s) in G̃, the
sets {s} and G̃≤d \ {s, pa(s)} are separated by pa(s) in G. To see this, assume that there
exists a t ∈ G̃≤d \ {s, pa(s)} with a path from t to s that does not pass through pa(s).
Then we can complete this path with the path from t to the first common ancestor
(in G̃) of t and s and back to s to create a path from s to s that passes only once
through {pa(s), s} and therefore contains a loop by proposition 14.38.
Remark 14.45 We see that there is no real gain in generality with passing from undi-
rected to directed graphs when working with trees. This is an important remark,
because directionality in graphs is often interpreted as causality. For example, there
is a natural causal order in the statements
(it rains) → (car windshields get wet) → (car wipers are on)
in the sense that each event can be seen as a logical precursor to the next one. How-
ever, because one can pass from this directed chain to an equivalent undirected chain
and then back to a equivalent directed tree by choosing any of the three variables as
roots, there is no way to infer, from the observation of the joint distribution of the
three events (it rains, car windshields get wet, wipers are on), any causal relationship
between them: the joint distribution cannot resolve whether wipers are on because
336 CHAPTER 14. MARKOV RANDOM FIELDS
We will see that acyclic models have very nice computational properties that make
them attractive in designing distributions. However, the absence of loops is a very
restrictive constraint, which is not realistic in many practical situations. Feedback
effects are often needed, for example. Most models in statistical physics are sup-
ported by a lattice, in which natural translation/rotation invariance relations forbid
using any non-trivial acyclic model. As an example, we now consider the 2D Ising
model on a finite grid, which is a model for (anti)-ferromagnetic interaction in a spin
system.
Let G = (V , E). A (positive) G-Markov model is said to have only pair interactions
if and only if can be written in the form
1
X X
(s) (s,t)
π(x) = exp − hs (x ) − h{s,t} (x ) .
Z
s∈G {s,t}∈E
The Ising model is a special case of models with pair interactions, for which the
state space, Fs , is equal to {−1, 1} for all s and
In fact, for binary variables, this is the most general pair interaction model.
The Ising model is moreover usually defined on a regular lattice, which, in two
dimensions, implies that V is a finite rectangle in Z2 , for example V = {−N , . . . , N }2 .
The simplest choice of a translation- and 90-degree rotation-invariant graph is the
nearest-neighbor graph for which {(i, j), (i 0 , j 0 )} ∈ E if and only if |i − i 0 | + |j − j 0 | = 1
(see fig. 14.3). With this graph, one can furthermore simplify the model to obtain
the isotropic Ising model given by
1
X X
(s) (s) (t)
π(x) = exp − α x −β x x .
Z s∼t
s∈V
When β < 0, the model is ferromagnetic: each pair of neighbors with identical signs
brings a negative contribution to the energy, making the configuration more likely
(since lower energy implies higher probability).
338 CHAPTER 14. MARKOV RANDOM FIELDS
The Potts model generalizes the Ising model to finite, but non-necessarily binary,
state spaces, say, Fs = F = {1, . . . , n}. Define the function δ(λ, µ) = 1 if λ = µ and (−1)
otherwise. Then the Potts model is given by
1
X X
(s) (s) (t)
π(x) = exp − α h(x ) − β δ(x , x ) (14.32)
Z s∼t
s∈V
Our discussion of Markov random fields on graphs was done under the assumption
of finite state spaces, which notably simplifies many of the arguments and avoids
relying too much on measure theory. While this situation does cover a large range of
application, there are cases in which one wants to consider variables taking values
in continuous spaces, or in countable (infinite) spaces.
The results obtained for discrete variables can most of the time be extended to
variables whose distribution has a p.d.f. with respect to a product of measures on
the sets in which they take their values. For example, let X, Y , Z takes values in
RX , RY , RZ , equipped with σ -algebras S X , S Y , S Z and measures µX , µY , µZ . As-
sume that PX,Y ,Z is absolutely continuous with respect to µX ⊗ µY ⊗ µZ , with density
ϕXY Z . In such a situation, (14.3) remains valid, in that X is conditionally indepen-
dent of Y given Z if and only if
for all z < MZ , which implies that, for all z < MZ , there exists a set Nz ⊂ RX × RY such
that µX × µY (Nz ) = 0 and
for all z < MZ and (x, y) < Nz . This immediately implies (14.33) for those (x, y, z).
If z ∈ MZ , then
Z Z
0 = ϕZ (z) = ϕXZ (x, z)µX (dx) = ϕY Z (x, z)µY (dy)
RX RY
implying that ϕXZ (x, z) = ϕY Z (y, z) = 0 excepted on some set Nz such that µX ⊗
µY (Nz ) = 0, and (14.33) is therefore also true outside of this set. Now, letting N =
{(x, y, z) : (x, y) ∈ Nz }, we find that (14.33) is true for all (x, y, z) < N and
Z Z
µX ⊗µY ⊗µZ (N ) = 1(x,y)∈Nz µX (dx)µY (dy)µZ (dz) = µX ⊗µY (Nz )µZ (dz) = 0.
RX ×RY ×RZ RZ
With this definition, the proof of proposition 14.5 can be caried on without change,
with the positivity condition expressing the fact that there exists R̃X ⊂ RX , R̃Y ⊂ RY
and R̃Z ⊂ RX such that ϕXY Z (x, y, z) > 0 for all x, y, z ∈ R̃X × R̃Y × R̃Z . (This proposition
is actually valid in full generality, with a proper definition of positivity.)
When considering random fields with general state spaces, we will restrict to
the similar situation in which each state space Fs is equipped with a σ -algebra S s
and a measure µs , and the joint distribution, PX of the random field X = (Xs , s ∈ V ) is
∆ N
absolutely continuous with respect to µ = µ , denoting by π the corresponding
s∈V s
p.d.f. We will says that π is positive if there exists F̃ = (F̃s , s ∈ V ) with measurable
F̃s ⊂ Fs such that π(x) > 0 for all x ∈ F (V , F̃). Without loss of generality unless one
considers multiple random fields with different supports, we will assume that F̃s = Fs
for all s.
1 1 T
T
π(x) = exp a x − x bx ,
Z 2
with the integrability requirement that b 0 (positive definite). The random field
then follows a Gaussian distribution with mean m = b−1 a and covariance matrix
Σ = b−1 . The normalizing constant, Z, is given by
1 T
e− 2 a ba (2π)d/2
Z= √ .
det b
Once the joint distribution of a family of variables has been modeled as a random
field, this model can be used to estimate the probabilities of specific events, or the
expectations of random variables of interest. For example, if the modeled variables
relate to a medical condition, in which variables such as diagnosis, age, gender, clin-
ical evidence can interact, one may want to compute, say, the probability of someone
having a disease given other observable factors. Note that, being able to compute
expectations of the modeled variables for G-Markov processes also ensures that one
can compute conditional expectations of some modeled variables given others, since,
by proposition 14.22, conditional G-Markov distributions are Markov over restricted
graphs.
The problem is that the sums involved in this ratio involve a number of terms that
grows exponentially with the size of V . Unless V is very small, a direct computation
of these sums is intractable. An exception to this is the case of acyclic graphs, as
341
342 CHAPTER 15. PROBABILISTIC INFERENCE FOR MRF
we will see in section 15.2. But for general, loopy, graphs, the sums can only be
approximated, using, for example, Monte-Carlo sampling, as described in the next
section.
Markov chain Monte Carlo methods are well adapted to sampling from Markov ran-
dom fields, because conditional distributions used in Gibbs sampling, or, more gen-
erally, ratios of probabilities used in the Metropolis-Hastings algorithm do not in
require the computation of the normalizing constant Z in (15.1). The simplest use
of Gibbs sampling generalizes the Ising model example of section 13.4.2. Using the
notation of Algorithm 13.2, one lets Bs0 = F (sc ) (with the notation sc = V \ {s}) and
c
Us (x) = x(s ) . The conditional distribution given Us is
c c
Qs (Us (x), y) = P(X (s) = y (s) | X (s ) = x(s ) )1y (sc ) =x(sc ) .
The conditional probability in the r.h.s. of this equation takes the form
c ∆ c c 1 X c
πs (y (s) | x(s ) ) = P(X (s) = y (s) | X (s ) = x(s ) ) = (s) (C∩s )
exp − h (y ∧ x )
C
c
Zs (x(s ) )
C∈C,s∈∈C
with
X X
(sc ) (C∩sc )
Zs (x )= exp − hC (z(s) ∧ x ) .
z(s) ∈Fs C∈C,s∈∈C
The Gibbs sampling algorithm samples from Qs by visiting all s ∈ V infinitely of-
ten, as described in Algorithm 13.2. Metropolis-Hastings schemes are implemented
similarly, the most common choice using a local update scheme in Algorithm 13.3
such that g(x, ·) only changes one coordinate, chosen at random, so that
1 X
g(x, y) = 1y (sc ) =x(sc ) gs (y (s) )
|V |
s∈V
Note that the latter equation avoids the computation of the local normalizing con-
c
stant Zs (x(s ) ), which simplifies in the ratio.
Both algorithms have a transition probability P that satisfies P m (x, y) > 0 for all
x, y ∈ F (V ), with m = |V | (for Metropolis-Hastings, one must assume that gs (y (s) ) > 0
for all y (s) ∈ Fs . This ensures that the chain is uniformly geometrically ergodic, i.e.,
(13.9) is satisfied with a constant M and some ρ < 1. However, in many practical
cases (especially for strongly structured distributions and large sets V ), the conver-
gence rate, ρ can be very close to 1, resulting in a slow convergence.
Acceleration strategies have been designed to address this issue, which is often
due to the existence of multiple configurations that are local modes of the probabil-
ity π. Such configurations are isolated from other high-probability configurations
because local updating schemes need to make multiple low-probability changes to
access them from the local mode. The following two approaches provide examples
designed to address this issue.
The coefficients µst are free to choose (and one possible choice is to take µst = 0 for
all {s, t} ∈ E). For this distribution, all ξ (st) are independent conditionally to X = x,
with ξ (st) = 1 with probability 1 if x(s) , x(t) , and
e−µst
P (ξ (st) = 1 | X = x) = (15.2)
1 + e−µst
if x(s) = x(t) . This conditional distribution is, as a consequence, very easy to sample
344 CHAPTER 15. PROBABILISTIC INFERENCE FOR MRF
from. Moreover, the normalizing constant ζ(x) has closed form and is given by
Y X X
ζ(x) = (1x(s) =x(t) + e−µst ) = exp log (1 + e−µst ) + log(1 + eµst )1x(s) ,x(t) .
{s,t}∈E {s,t}∈E {s,t}∈E
Now consider the conditional probability that X = x given ξ = ξ. For this distribu-
tion, one has, with probability 1, X (s) = X (t) when ξ (st) = 0. This implies that X is con-
stant on the connected components of the subgraph (V , Eξ ) of (V , E), where {s, t} ∈ Eξ
if and only if ξ (st) = 0. Let V1 , . . . , Vm denote these connected components (these com-
ponents and their number depend on ξ). The conditional distribution of X given ξ
is therefore supported by the configurations such that there exists c1 , . . . , cm ∈ F such
that x(s) = cj if and only if s ∈ Vj , that we will denote, with some abuse of notation:
(V1 ) (V )
c1 ∧ · · · ∧ cm m .
Given this remark, the conditional distribution of X given ξ = ξ is equivalent to a
distribution on F m , which may be feasible to sample from directly if |F| and m are not
too large. To sample from π, one now needs to alternate between sampling ξ given
X and the converse, yielding the following first version of cluster-based sampling.
Step (2.c) takes a simple form in the special case when π is a non-homogeneous Potts
model ((14.32)) with positive interactions, that we will write as
X X
π(x) = exp − αs x(s) − βst 1x(s) ,x(t)
s∈V {s,t}∈E
15.1. MONTE CARLO SAMPLING 345
0 0
with βst = log(1 + eµst ). If one chooses µst such that βst = βst (which is possible since
βst ≥ 0), then the interaction term disappears and the probability q in (2.c) is propor-
tional to
Ym X
exp − αs
j=1 s∈Vj
Unlike single-variable updating schemes, these algorithms can update large chunks
of the configurations at each step, and may result in significantly faster convergence
346 CHAPTER 15. PROBABILISTIC INFERENCE FOR MRF
of the sampling procedure. Note that step (2.d) in Algorithm 15.2 can be replaced
by a Metropolis-Hastings update with a proper choice of proposal probability [16].
b. Parallel tempering. We now consider a different kind of extension in which we
allow π depends continuously on a parameter β > 0, writing πβ and, the goal is to
sample from π1 . For example, one can extend (15.1) by the family of probability
distributions
1 X
πβ (x) = exp −β hC (x(C) )
Zβ
C∈CG
for β ≥ 0. For small β, πβ gets close to the uniform distribution on F (V ) (achieved for
β = 0), so that it becomes easier to move from local mode to local mode. This implies
that sampling with small β is more efficient and the associated Markov chain moves
more rapidly in the configuration space.
Assume given, for all β, two ergodic transition probabilities on F (V ), qβ and q̃β such
that (13.6) is satisfied with πβ as invariant probability, namely
for all x, y ∈ F (V ) (as seen in (13.6), q̃β is the transition probability for the reversed
chain). The basic idea is that qβ provides a Markov chain that converges rapidly
for small β and slowly when β is closer to 1. Parallel tempering (this algorithm
was introduced in Neal [141] based on ideas developed in Marinari and Parisi [127])
leverages this fact (and the continuity of πβ in β) to accelerate the simulation of π1
by introducing intermediate steps sampling at low β values.
The algorithm specifies a sequence of parameters 0 ≤ β1 ≤ · · · ≤ βm = 1. One simu-
lation steps goes down, then up this scale, as described in the following algorithm.
Importantly, the acceptance probability at step (4) only involves ratios of πβ0 s and
therefore no normalizing constant. We now show that this algorithm is πβ0 -reversible.
Let p(·, ·) denote the transition probability of the chain. If z0 , x0 , p(x0 , z0 ) corre-
sponds to steps (1) to (3), with acceptance at step(4), and is therefore given by the
sum, over all x1 , . . . , xm and z1 , . . . , zm , of products
So,
X
πβ0 (x0 )p(x0 , z0 ) = min πβ0 (x0 )q̃β1 (x0 , x1 ) · · · q̃βm (xm−1 , xm )qβm (xm , zm−1 ) · · · qβ1 (z1 , z0 ),
πβ0 (z0 )qβ1 (x1 , x0 ) · · · qβm (xm , xm−1 )q̃βm (zm−1 , xm ) · · · q̃β1 (z0 , z1 )
where the sum is over all x1 , . . . , xm , z1 , . . . , zm−1 ∈ F (V ). The sum is, of course, un-
changed if one renames x1 , . . . , xm , z1 , . . . , zm−1 to z1 , . . . , zm , x1 , . . . , xm−1 , but doing so
provides the expression of πβ0 (z0 )p(z0 , x0 ), proving the reversibility of the chain with
respect to πβ0 .
so that the marginal probability at any s , s0 can be computed given the marginal
probability of its parent. We can propagatePthe computation down the tree, with a
total cost for computing πs proportional to nk=1 |Ftk−1 | |Ftk | where t0 = s0 , t1 , . . . , tn = s
is the unique path between s0 and s. This is linear in the depth of the tree, and
quadratic (not exponential) in the sizes of
P the state spaces. The computation of all
singleton marginals requires an order of (s,t)∈E |Fs | |Ft | operations.
Now, assume that probabilities of singletons have been computed and consider
an arbitrary set S ⊂ V . Let s ∈ V be an ancestor of every vertex in S, maximal in the
sense that none of its children also satisfy this property. Consider the subtrees of
G̃ starting from each of the children of s, denoted G̃1 , . . . , G̃n with G̃k = (Vk , Ẽk ). Let
Sk = S ∩ Vk . From the conditional independence,
X
πS (x(S) ) = P (X (S\{s}) = x(S\{ s}) | X (s) = y (s) )πs (y (s) )
y (s) ∈Fs
X n
Y
= P (X (Sk ) = x(Sk ) | X (s) = y (s) )πs (ys )
y (s) ∈Fs k=1,Sk ,∅
Now, for all k = 1, . . . , n, we have |Sk | < |S|: this is obvious if S is not completely
included in one of the Vk ’s. But if S ⊂ Vk then the root, sk , of Vk is an ancestor of
all the elements in S and is a child of s, which contradicts the assumption that s is
maximal. So we have reduced the computation of πS (xS ) to the computations of n
probabilities of smaller sets, namely P (X (Sk ) = x(Sk ) | X (s) = y (s) ) for Sk , ∅. Because
the distribution of X (Vk ) conditioned at s is a G̃k -Markov model, we can reiterate
the procedure until only sets of cardinality one remain, for which we know how to
explicitly compute probabilities.
the situation in which one starts with a probability distribution associated with pair
interactions (cf. definition 14.24) over the acyclic graph G
1Y Y
π(x) = ϕs (x(s) ) ϕst (x(s) , x(t) ). (15.5)
Z
s∈V {s,t}∈E
We assume these local interactions to be consistent, still allowing for some vanishing
ϕst (x(s) , x(t) ).
Putting π in the form (15.4) is equivalent to computing all joint probability dis-
tributions πst (x(s) , x(t) ) for {s, t} ∈ E, and we now describe this computation. Denote
Y Y
(s)
U (x) = ϕs (x ) ϕst (x(s) , x(t) )
s∈V {s,t}∈E
P
so that Z = y∈F (V ) U (y). For the tree G̃ = (V , Ẽ), and t ∈ V , we let G̃t = (Vt , Ẽt ) be the
subtree of G rooted at t (containing t and all its descendants). For S ⊂ V , define
Y Y 0
US (x(S) ) = ϕs (x(s) ) ϕss0 (x(s) , x(s ) )
s∈S {s,s0 }∈E,s,s0 ∈D
and
∗
X
Zt (x(t) ) = UVt (x(t) ∧ y (Vt ) ).
∗
y (Vt ) ∈F (Vt∗ )
(s0 )
Zs0 (x(s0 ) )
πs0 (x )= P (15.6)
y (s0 ) ∈Fs0 Zs0 (y (s0 ) )
(s) (t) (t) (t) (s) (s) ϕst (x(s) , x(t) )Zt (x(t) )
pst (x , x ) = P (X =x |X =x )= P (s) (t) (t)
(15.7)
y (t) ∈Ft ϕst (x , y )Zt (y )
Proof Let Wt = V \Vt . Clearly, Z = x(0) ∈Fs Zs0 (x(0) ) and πs0 (x(0) ) = Zs0 (x(0) )/Z which
P
0
gives (15.6). Moreover, if s ∈ V , we have
P (Vs ) ∧ y (Ws ) )
(Vs∗ ) (Vs∗ ) (s) (s) y (Ws ) U (x
P(X =x |X =x )= P ∗ .
∗
y (Vs ) ,y (Ws )U (x(s) ∧ y (Vs ) ∧ y (Ws ) )
350 CHAPTER 15. PROBABILISTIC INFERENCE FOR MRF
We can write
∗ ∗
U (x(s) ∧ y (Vs ) ∧ y (Ws ) ) = UVs (x(s) ∧ y (Vs ) )U{s}∪Ws (x(s) ∧ y (Ws ) )ϕs (x(s) )−1
UVs (x(Vs ) )
=
Zs (x(s) )
Now, if t1 , . . . , tn are the children of s, we have
n
Y n
Y
UVs (x (Vs )
) = ϕs (x ) (s) (s)
ϕstk (x , x (tk )
) UVt (x(Vtk ) ),
k
k=1 k=1
so that
This implies that the transition probability needed for the tree model, pst1 (x(s) , x(t1 ) ),
must be proportional to ϕst1 (x(s) , x(t1 ) )Zt1 (x(t1 ) ) which proves the lemma.
This lemma reduces the computation of the transition probabilities to the com-
putation of Zs (x(s) ), for s ∈ V . This can be done efficiently, going upward in the
tree (from terminal vertexes to the root). Indeed, if s is terminal, then Vs = {s} and
Zs (x(s) ) = ϕs (x(s) ). Now, if s is non-terminal and t1 , . . . , tn are its children, then, it is
easy to see that
X n
Y
Zs (x(s) ) = ϕs (x(s) ) ϕstk (x(s) , x(tk ) )Ztk (x(tk ) )
x(t1 ) ∈Ft1 ,...,x(tn ) ∈Ftn k=1
n X
Y
(s) (s) (tk ) (tk )
= ϕs (x ) ϕstk (x , x )Ztk (x ) (15.8)
k=1 x (tk )
∈Ftk
So, Zs (x(s) ) can be easily computed once the Zt (x(t) )’s are known for the children of s.
15.2. INFERENCE WITH ACYCLIC GRAPHS 351
Let {s, t} be an edge in E. Then s separates the graph G \ {s} into two components.
Let Vst be the component that contains t, and Vst∗ = Vst \ t. Define
∗
X
Zst (xt ) = UVst (x(t) ∧ y (Vst ) ).
∗
y (Vst ) ∈F (Vst∗ )
This Zst coincides with the previously introduced Zt , computed with any tree in
which the edge {s, t} is oriented from s to t. Equation (15.8) can be rewritten with
this new notation in the form:
Y X
(t) (t) (t) (t 0 ) (t 0 )
Zst (x ) = ϕt (x ) ϕtt0 (x , x )Ztt0 (x ) . (15.9)
t 0 ∈Vt \{s} x(t 0 ) ∈Ft 0
which yields Y
Zst (x(t) ) = ϕt (x(t) ) mt0 t (x(t) )
t 0 ∈Vt \{s}
Also, because one can start building a tree from G[ using any vertex as a root,
(15.6) is valid for any s ∈ V , in the form (applying (15.8) to the root)
1 Y
πs (x(s) ) = ϕs (x(s) ) mts (x(s) ) (15.11)
ζs
t∈Vs
where ζs is chosen to ensure that the sum of probabilities is 1. (In fact, looking at
lemma 15.1, we have Zs = Z, independent of s.)
which provides the edge transition probabilities. Combining this with (15.11), we
get the edge marginal probabilities:
1 Y Y
πst (x(s) , x(t) ) = ϕst (x(s) , x(t) )ϕs (x(s) )ϕt (x(t) ) mt0 t (x(t) ) ms0 s (x(s) ). (15.13)
ζ 0 0
t ∈Vt \{s} s ∈Vs \{t}
Remark 15.2 We can modify (15.10) by multiplying the right-hand side by an ar-
bitrary constant qts without changing the resulting estimation of probabilities: this
only multiplies the messages by a constant, which cancels after normalization. This
remark can be usefulP in particular to avoid numerical overflow; one can, for exam-
ple, define qts = 1/ xs ∈Fs mts (xs ) so that the messages always sum to 1. This is also
useful when applying belief propagation (see next section) to loopy networks, for
which (15.10) may diverge while the normalized version converges.
(1) Initialize functions (messages) mts : Fs → R, e.g., taking mts (x(s) ) = 1/|Fs |.
(2) Compute unnormalized messages
X Y
m̃ts (·) = ϕst (·, x(t) )ϕt (x(t) ) mt0 t (x(t) )
x(t) ∈Ft t 0 ∈Vt \{s}
and let mts (·) = qts m̃ts (·), for some choice of constant qts , which must be a fixed func-
tion of m̃ts (·), such as
−1
X
(s)
qts = m̃ts (x ) .
(s)
x ∈Fs
(3) Stop the algorithm when the messages stabilize (which happens after a finite
number of updates). Compute the edge marginal distributions using (15.13).
It should be clear, from the previous analysis that messages stabilize in finite time,
starting from the outskirts of the acyclic graph. Indeed, messages starting from a
terminal t (a vertex with only one neighbor) are automatically set to their correct
value in (15.10), X
mts (xs ) = ϕst (xs , xt )ϕt (xt ),
xt ∈Ft
at the first update. These values then propagate to provide messages that satisfy
(15.10) starting from the next-to-terminal vertexes (those that have only one neigh-
bor left when the terminals are removed) and so on.
15.3. BELIEF PROPAGATION AND FREE ENERGY APPROXIMATION 353
15.3.1 BP stationarity
It is possible to run Algorithm 15.4 on graphs that are not acyclic, since nothing in
its formulation requires this property. However, while the method stabilizes in finite
time for acyclic graphs, this property, or even the convergence of the messages is not
guaranteed for general, loopy, graphs. Convergence, however, has been observed in
a large number of applications, sometimes with very good approximations of the
true marginal distributions.
such that
1 Y Y
0
πst (x(s) , x(t) ) = ϕst (x(s) , x(t) )ϕs (x(s) )ϕt (x(t) ) mt0 t (x(t) ) ms0 s (x(s) ). (15.15)
ζst 0 0
t ∈Vt \{s} s ∈Vs \{t}
There is no loss of generality in the specific form chosen for the normalizing con-
stants in (15.14) and (15.15), in the sense that, if the messages satisfy (15.15) and
X Y
mts (x(s) ) = qts ϕst (x(s) , x(t) )ϕt (x(t) ) mt0 t (x(t) )
x(t) ∈Ft t 0 ∈Vt \{s}
so that ζst qts (which has been denoted αs ) does not depend on t. Of course, the rele-
vant questions regarding BP-stationarity is whether the collection of pairwise prob-
0 0
ability πst exists, how to compute them, and whether πst (x(s) , x(t) ) provides a good
354 CHAPTER 15. PROBABILISTIC INFERENCE FOR MRF
A reassuring statement for BP-stationarity is that it is not affected when the func-
tions in Φ are multiplied by constants, which does not affect the underlying proba-
bility π. This is stated in the next proposition.
Proposition 15.4 Let Φ be as above a family of edge and vertex interactions. Let cst , {s, t} ∈
E, cs , s ∈ V be families of positive constants, and define Φ̃ = (ϕ̃st , ϕ̃s ) by ϕ̃st = cst ϕst and
ϕ̃s = cs ϕs . Then,
Proof Indeed, if (15.14) and (15.15) are true for (G, Φ), it suffices to replace αs by
αs cs and ζst by ζst cst ct to obtain (15.14) and (15.15) for (G, Φ̃).
A partial justification of the good behavior of BP with general graphs has been pro-
vided in terms of a quantity introduced in statistical mechanics, called the Bethe free
energy. We let G = (V , E) be an undirected graph and assume that a consistent family
of pair interactions is given (denoted Φ = (ϕs , s ∈ V , ϕst , {s, t} ∈ E)) and consider the
associated distribution, π, on F (V ), given by
1Y Y
π(x) = ϕs (x(s) ) ϕst (x(s) , x(t) ). (15.16)
Z
s∈V {s,t}∈E
such that
1Y Y
π(x) = ϕs (x(s) )1−|Vs | ψst (x(s) , x(t) ). (15.17)
Z
s∈V {s,t}∈E
15.3. BELIEF PROPAGATION AND FREE ENERGY APPROXIMATION 355
(where H(π0 ) is the entropy of π0 ). Introduce the one- and two-dimensional marginals
0
of π0 , denoted πs0 ad πst . Then
X ϕ X ψ
KL(π0 kπ) = − log Z − (1 − |Vs |)Eπ0 (log s0 ) − Eπ0 (log st0 )
πs πst
s∈V {s,t}∈E
X X
+ (1 − |Vs |)H(πs0 ) + 0
H(πst ) − H(π0 ).
s∈V {s,t}∈E
so that
KL(π0 kπ) = F β (π0 ) − log Z + ∆G (π0 )
with X X
∆G (π0 ) = (1 − |Vs |)H(πs0 ) + 0
H(πst ) − H(π0 ).
s∈V {s,t}∈E
Using this computation, one can consider the approximation problem: find π̂0
that minimizes KL(π0 kπ) over a class of distributions π0 for which the computation
of the first and second order marginals is easy. This problem has an explicit solution
when the distribution π0 is such that all variables are independent, leading to what
is called the mean-field approximation of π. Indeed, in this case, we have
X X X
∆G (π0 ) = (H(πs0 ) + H(πt0 )) + (1 − |Vs |)H(πs0 ) − H(πs0 ) = 0
{s,t}∈G s∈S s∈S
and X ϕs X ψ
F β (π0 ) = − (1 − |Vs |)Eπ0 (log 0 ) − Eπ0 (log 0 st 0 ) .
πs πs πt
s∈V {s,t}∈E
Proposition 15.6 A local minimum of F β (π0 ) over all probability distributions π0 of the
form Y
π0 (x) = πs0 (x(s) )
s∈V
1 Y
πs (x(s) ) = ϕs (x(s) )1−|Vs | exp Eπt (log ψst (x(,) .)) . (15.19)
Zs t∼s
Proof Since all constraints are affine, we can use Lagrange multipliers, denoted
(λs , s ∈ S) for each of the constraints, to obtain necessary conditions for a minimizer,
yielding
∂F β
− λs = 0, s ∈ S, xs ∈ Fs .
∂πs (xs )
This gives:
! XX !
ϕs (xs ) ψst (xs , xt )
−(1 − |Vs |) log −1 − log − 1 πt (xt ) = λs .
πs (xs ) t∼s
πs (xs )π t (xt )
xt ∈Ft
Solving this with respect to πs (xs ) and regrouping all constant terms (independent
from xs ) in the normalizing constant Zs yields (15.19).
The mean field consistency equations can be solved using a root-finding algo-
rithm or by directly solving the minimization problem. We will retrieve this method,
with more details, in our discussion of variational approximations in chapter 17.
This proposition is a consequence of the following lemma that has its own interest:
Proof (of lemma 15.8) We know that, if G̃ = (V , Ẽ) is a tree such that G̃[ = G, we
have, letting s0 be the root in G̃
Y
(s0 )
π(x) = πs0 (x ) pst (x(s) , x(t) )
(s,t)∈Ẽ
Y
(s0 )
= πs0 (x ) (πst (x(s) , x(t) )π(x(s) )−1 ).
(s,t)∈Ẽ
Each vertex s in V has |Vs | − 1 children in G̃, except s0 which has |Vs0 | children. Using
this, we get
Y Y
π(x) = πs0 (x(s0 ) )πs0 (x(s0 ) )−|Vs0 | πs (x(s) )1−|Vs | πst (x(s) , x(t) )
s∈V \{s0 } (s,t)∈Ẽ
Y Y
= πs (x(s) )1−|Vs | πst (x(s) , x(t) ).
s∈V {s,t}∈E
1
0
πst (x(s) , x(t) ) = ψ (x(s) , x(t) )µst (x(t) )µts (x(s) ) (15.21)
Zst st
where the functions µst : Ft → [0, +∞) are defined for all (s, t) such that {s, t} ∈ E and
satisfy the consistency conditions:
|Vs |−1
e
Y X
µts (x(s) )−(|Vs |−1) (s)
µs0 s (x ) = (s) (t) (t) (t)
ψst (x , x )ϕt (x )µst (x ) . (15.22)
Zst (t)
s0 ∼s
x ∈Ft
which covers all constraints associated to the minimization problem. The associated
Lagrangian is
X X X X
(s) (s) (t) (s)
F β (π0 ) − λts (x ) 0 0
πst (x , x ) − πs (x )
t∼s
(t)
s∈V x(s) ∈Fs x ∈Ft
X X
0 (s) (t)
γst πst (x , x ) − 1 .
−
(s) (t)
{s,t}∈E x ∈Fs ,x ∈Ft
0
The derivative with respect to πst (x(s) , x(t) ) yields the condition
0
log πst (x(s) , x(t) ) − log ψst (x(s) , x(t) ) + 1 − λts (x(s) ) − λst (x(t) ) − γst = 0.
which implies
0
πst (x(s) , x(t) ) = ϕst (x(s) , x(t) ) exp(γst − 1) exp(λts (x(s) ) + λst (x(t) )).
0
We let Zst = exp(1 − γst ), with γst chosen so that πst is a probability. The derivative
0 (s)
with respect to πs (x ) gives
X
(1 − |Vs |)(log πs0 (x(s) ) − log ϕs (x(s) ) + 1) + λts (x(s) ) = 0.
t∼s
15.3. BELIEF PROPAGATION AND FREE ENERGY APPROXIMATION 359
0
Combining this with the expression just obtained for πst , we get, for t ∼ s,
X (t)
(1 − |Vs |) log ψst (x(s) , x(t) )eλst (x ) + (1 − |Vs |)λts (x(s) )
x(t) ∈Ft
X
+ (1 − |Vs |)(1 − log Zst − log ϕs (x(s) )) + λs0 s (x(s) ) = 0,
s0 ∼s
0
A family πst satisfying conditions (15.21) and (15.22) of proposition 15.9 will be
called Bethe-consistent. A very interesting remark states that Bethe-consistency is
equivalent to BP-stationarity, as stated below.
Proposition 15.10 Let G = (V , E) be an undirected graph and Φ = (ϕst , {s, t} ∈ E, ϕs , s ∈
V ) a consistent family of pair interactions. Then a family π0 of joint probability distribu-
tions is BP-stationary if and only if it is Bethe-consistent.
Proof First assume that π0 is BP-stationary with messages mst , so that (15.14) and (15.15)
are satisfied. Take Y
µst = at mt0 t (x(t) )
t 0 ∈Vt ,t 0 ,s
for some constant at that will be determined later. Then, the left-hand side of (15.22)
is
−(|Vs |−1)
Y Y Y Y
(s) −(|Vs |−1) (s) (s)
µts (x ) µs0 s (x ) = as ms0 s (x ) ms00 s (x(s) )
0
0 0
0 00 00 0
s ∈Vs s ∈Vs ,s ,t s ∈Vs s ∈Vs ,s ,s
Conversely, take a Bethe-consistent π0 , and µst , Zst satisfying (15.21) and (15.22).
For s such that |Vs | > 1, define, for t ∈ Vs ,
Y
(s) (s) −1
mts (x ) = µts (x ) µs0 s (x(s) )1/(|Vs |−1) . (15.23)
s0 ∼s
(If |Vs | = 1, take ρts ≡ 1.) Using (15.23), we find ρts = µts when |Vs | > 1, and this
identity is still valid when |Vs | = 1, since in this case, (15.22) implies that µts (x(s) ) = 1.
We need to find constants αt and ζst such that (15.14) and (15.15) are satisfied.
But (15.15) implies X
ζts = ψst (x(s) , x(t) )ρst (x(t) )ρts (x(s) )
xt ,xs
It is now easy to see that this identity to the power |Vs | − 1 coincides with (15.22) as
soon as one takes αs = e.
We now address the problem of finding a configuration that maximizes π(x) (mode
determination). This problem turns out to be very similar to the computation of
marginals, that we have considered so far, and we will obtain similar algorithms.
Assume that a root has been chosen in G, with the resulting edge orientation yielding
a tree G̃ = (V , Ẽ) such that G̃[ = G. We partially order the vertexes according to G̃,
writing s ≤ t if there exists a path from s to t in G̃ (s is an ancestor of t). Let Vs+
contain all t ∈ V with t ≥ s, and define
+
Y Y
Us (x(Vs ) ) = ϕtu (x(t) , x(u) ) ϕt (x(t) )
{t,u}∈EV + t>s
s
and n + o
Us∗ (x(s) ) = max Us (y (Vs ) ), y (s) = x(s) . (15.25)
Since we can write
+
+
Y
Us (x(Vs ) ) = ϕst (x(s) , x(t) )ϕt (x(t) )Ut (x(Vt ) ), (15.26)
t∈s+
we have
Y
Us∗ (x(s) ) = max ϕt (x(t) )ϕst (x(s) , x(t) )Ut∗ (x(t) )
x(t) ,t∈s+
t∈s+
Y
= max(ϕt (x(t) )ϕst (x(s) , x(t) )Ut∗ (x(t) )). (15.27)
xt ∈Ft
t∈s+
This provides a method to compute Us∗ (x(s) ) for all s, starting with the leaves and
progressively updating the parents. (When s is a leaf, Us∗ (x(s) ) = 1, by definition.)
Once all Us∗ (x(s) ) have been computed, it is possible to obtain a configuration x∗
(s)
that maximizes π. This is because an optimal configuration must satisfy Us∗ (x∗ ) =
+
(V ) +
(V \{s})
Us (x∗ s ) for all s ∈ V , i.e., x∗ s must solve the maximization problem in (15.25).
But because of (15.26), we can separate this problem over the children of s and obtain
the fact that, it t ∈ s+ ,
(t) (t) (s) (t) ∗ (t)
x∗ = argmax ϕt (x )ϕst (x∗ , x )Ut (x ) .
x(t)
This procedure can be rewritten in a slightly different form using messages sim-
ilar to the belief propagation algorithm. It s ∈ t + , define
and
ξst (x(t) ) = argmax(ϕt (x(t) )ϕts (x(t) , x(s) )Us∗ (x(s) )).
x(s) ∈Fs
362 CHAPTER 15. PROBABILISTIC INFERENCE FOR MRF
(t) (s)
An optimal configuration can now be computed using x∗ = ξts (x∗ ), with s ∈ pa(t).
This resulting algorithm therefore first operates upwards on the tree (from leaves
to root) to compute the µst ’s and ξst ’s, then downwards to compute x∗ . This is sum-
marized in the following algorithm.
Algorithm 15.5
A most likely configuration for
1 Y Y
π(x) = ϕst (xs , xt ) ϕs (xs ).
Z
{s,t}∈E s∈V
can be computed after iterating the following updates, based on any acyclic orienta-
tion of G:
now becomes
Y
µst (x(t) ) = max ϕts (x(t) , x(s) )ϕs (x(s) ) µus (x(s) ) , (15.28)
(s)
x ∈Fs
u∈Vs \{t}
Y
(t) (t) (s) (s) (s)
ξst (x ) = argmax ϕts (x , x )ϕs (x ) µus (x ) (15.29)
(s)
x ∈Fs
u∈Vs \{t}
(t) (s)
with x∗ = ξts (x∗ ) for any pair s ∼ t. Like with the mts in the previous section,
looping over updating all µts in any order will finally stabilize to their correct values,
although, if an orientation is given, going from leaves to roots is obviously more
efficient.
The previous analysis is not valid for loopy graphs but section 15.4 and sec-
tion 15.4 provide well defined iterations when G is an arbitrary undirected graph,
and can therefore be used as such, without any guaranteed behavior.
The expressions we obtained for message updating with belief propagation and with
mode determination respectively took the form
X Y
mts (x(s) ) ← ϕst (x(s) , x(t) )ϕt (x(t) ) mt0 t (x(t) )
x(t) ∈Ft t 0 ∈Vt \{s}
and
Y
µts (x(s) ) ← max ϕst (x(s) , x(t) )ϕt (x(t) ) µt0 t (x(t) ) .
(t)
x ∈Ft
0 t ∈Vt \{s}
They first one is often referred to as the “sum-prod” update rule, and the second
as the “max-prod”. In our construction, the sum-prod algorithm provided us with a
method computing
X
σs (x(s) ) = U (x(s) ∧ y (V \{s}) )
y (V \{s})
with Y Y
U (x) = ϕs (x(s) ) ϕst (x(s) , x(t) ).
s {s,t}∈E
364 CHAPTER 15. PROBABILISTIC INFERENCE FOR MRF
The previous algorithms can be generalized using the concept of factor graphs asso-
ciated with the decomposition. The vertexes of this graph are either indexes s ∈ V or
sets C ∈ S, and the only edges link indexes and sets that contain them. The formal
definition is as follows.
Definition 15.11 Let V be a finite set of indexes and S a subset of P (V ). The factor
graph associated to V and S is the graph G = (V ∪ S, E), E being constituted of all pairs
{s, C} with C ∈ S and s ∈ C.
We assign the variable x(s) to a vertex s ∈ V of the factor graph, and the function ϕC
to C ∈ S. With this in mind, the sum-prod and max-prod algorithms are extended
to factor graphs as follows.
Definition 15.12 Let G = (V ∪S, E) be a factor graph, with associated functions ϕC (xC ).
The sum-prod algorithm on G updates messages msC (xs ) and mCs (xs ) according to the
rules Y
(s)
mC̃s (x(s) )
m sC (x ) ←
C̃,s∈C̃,C̃,C
(15.30)
X Y
(s)
ϕC (y (C) ) mtC (y (t) )
mCs (x ) ←
(s) (s) t∈C\{s}
yC :y =x
15.5. GENERAL SUM-PROD AND MAX-PROD ALGORITHMS 365
These algorithms reduce to the original ones when only single vertex and pair in-
teractions exist. Let us check this with sum-prod. In this case, the set S contains
all singletons C = {s}, with associated function ϕs , and all edges {s, t} with associated
function ϕst . We have links between s and {s} and s and {s, t} ∈ E. For singletons, we
have Y
(s)
ms{s} (x ) ← ms{s,t} (x(s) ) and m{s}s (x(s) ) ← ϕs (x(s) ).
t∼s
For pairs, Y
ms{s,t} (x(s) ) ← ϕs (x(s) ) m{s,t̃}s (x(s) )
t̃∈Vs \{t}
and X
(s)
m{s,t}s (x ) ← ϕst (x(s) , y (t) )mt{s,t} (y (t) )
y (t)
and, combining the last two assignments, it becomes clear that we retrieve the initial
algorithm with m{s,t}s taking the role of what we previously denoted mts .
The important question, obviously, is whether the algorithms converge. The fol-
lowing result shows that this is true when the factor graph is acyclic.
Proposition 15.13 Let G = (V ∪ S, E) be a factor graph with associated functions ϕC .
Assume that G is acyclic. Then the sum-prod and max-prod algorithms converge in finite
time.
After convergence, we have σs (x(s) ) = C,s∈C mCs (x(s) ) and ρs (x(s) ) = C,s∈C µCs (x(s) ).
Q Q
Proof Let us assume that G is connected, which is without loss of generality, since
the following argument can be applied to each component of G separately. Since G
is acyclic, we can arbitrarily select one of its vertexes as a root to form a tree. This
being done, we can see that the messages going upward in the tree (from children
to parent) progressively stabilize, starting with leaves. Leaves in the factor graph
indeed are either singletons, C = {s}, or vertexes s ∈ V that belong to only one set
C ∈ S. In the first case, the algorithm imposes (taking, for example, the sum-prod
case) m{s}s (x(s) ) = ϕs (x(s) ), and in the second case msC (x(s) ) = 1. So the messages sent
upward by the leaves are set at the first step. Since the messages going from a child
to its parents only depend on the messages that it received from its other neighbors
366 CHAPTER 15. PROBABILISTIC INFERENCE FOR MRF
in the acyclic graph, which are its children in the tree, it is clear that all upward
messages progressively stabilize until the root is reached. Once this is done, mes-
sages propagate downward from each parent to its children. This stabilizes as soon
as all incoming messages to the parent are stabilized, since outgoing messages only
depend on those. At the end of the upward phase, this is true for the root, which
can then send its stable message to its children. These children now have all their
incoming messages and can now send their messages to their own children and so
on down to the leaves.
Since the upward phase of the algorithm does not depend on the ancestors of s,
the messages incoming to s for the sum-prod algorithm restricted to Gs are the same
as with the general algorithm, so that, using the induction hypothesis
X Y
Us (y (Vs ) ) = mCs (x(s) ) = msCs (x(s) ).
y (Vs ) ,y (s) =x(s) C∈Cs ,s∈C
Now let C1 , . . . , Cn list all the sets in C that contain s0 , which must be non-intersecting
(excepted at {s0 }), again not to create loops. Write
C1 ∪ · · · ∪ Cn = {s0 , s1 , . . . , sq }.
Then, we have
n
Y q
Y
U (x) = ϕCj (x (Cj )
) Usi (x(Vsi ) )
j=1 i=1
15.5. GENERAL SUM-PROD AND MAX-PROD ALGORITHMS 367
Sn
and letting S = j=1 Cj \ {s0 },
X n
Y q
Y
σs0 (x (s0 )
) = ϕCj (y (Cj )
) Usi (y (Vsi ) )
y (V ) :y (s0 ) =x(s0 ) j=1 i=1
X n
Y q
Y
(Cj )
= ϕCj (y ) msi Cs (y (si ) )
i
y ) S:y (s0 ) =x(s0 ) j=1 i=1
(
Y n X Y
= ϕCj (y (Cj ) ) msCs (y (s) )
j=1 y () C :y (s0 ) =x(s0 ) s∈Cj \{s0 }
j
n
Y
= mCj s0 (x(s0 ) )
j=1
which proves the required result (note that, when factorizing the sum, we have used
the fact that the sets Cj \ {s0 } are non intersecting). An almost identical argument
holds for the max-prod algorithm.
Remark 15.14 Note that these algorithms are not always feasible. For example, it is
always possible to represent a function U on F (V ) with the trivial factor graph in
which S = {V } and E contains all {s, V }, s ∈ V (using ϕV = U ), but computing mV s
is identical to directly computing σs with a sum over all configurations on V \ {s}
which grows exponentially. In fact, the complexity of the sum-prod and max-prod
algorithms is exponential in the size of the largest C in S which should therefore
remain small.
Remark 15.15 It is not always possible to decompose a function so that the resulting
factor graph is acyclic with small degree (maximum number of edges per vertex).
Sum-prod and max-prod can still be used with loopy networks, sometimes with
excellent results, but without theoretical support.
Remark 15.16 One can sometimes transform a given factor graph into an acyclic
one by grouping vertexes. Assume that the set S ⊂ P (V ) is given. We will say that
a partition ∆ = (D1 , . . . Dk ) of V is S-admissible if, for any C ∈ S and any j ∈ {1, . . . , k},
one has either Dj ∩ C = ∅ or Dj ⊂ C.
If ∆ is S-admissible, one can define a new factor graph G̃ as follows. We first let
Ṽ = {1, . . . , k}. To define S̃ ⊂ P (Ṽ ) assign to each C ∈ S the set JC of indexes j such
that Dj ⊂ C. From the admissibility assumption,
[
C= Dj , (15.32)
j∈JC
368 CHAPTER 15. PROBABILISTIC INFERENCE FOR MRF
so that C 7→ JC is one-to-one. Let S̃ = {JC , C ∈ S}. Group variables using x̃(k) = x(Dk ) ,
so that F̃k = F (Dk ). Define Φ̃ = (ϕ̃C̃ , C̃ ∈ S̃) by ϕ̃C̃ = ϕC where C is given by (15.32).
In other terms, one groups variables (x(s) , s ∈ V ) into clusters, to create a simpler
factor graph, which may be acyclic even if the original one was not. For example, if
V = {a, b, c, d}, S = {A, B} with A = {a, b, c} and B = {b, c, d}, then (A, c, B, b) is a cycle in
the associated factor graph. If, however, one takes D1 = {a}, D2 = {b, c} and D3 = {d},
then (D1 , D2 , D3 ) is S-admissible and the associated factor graph is acyclic. In fact, in
such a case, the resulting factor graph, considered as a graph with vertexes given by
subsets of V , is a special case of a junction tree, which is defined in the next section.
Let (C1 , Di1 , . . . , Din−1 , Cn ) be a path in that graph. Assume that s ∈ C1 ∩ C2 . Let Din
be the unique Dj that contains s. It is such that from the the admissibility assump-
tion, Din ⊂ C1 and Din ⊂ Cn , which implies that (C1 , Di1 , . . . , Cn , Din , C1 ) is a path in G̃.
Since G̃ is acyclic, this path must be a union of folded paths. But it is easy to see that
any folded path satisfies the running intersection constraint. (Note that there was
no loss of generality in assuming that the path started and ended with a “C”, since
any “D” must be contained in the C that follows or precedes it.)
where U (x) =
Q (C) ).
C∈S ϕC (x We have
X Y
σC (x(C) ) = ϕC (x(C) ) ϕB (x(B∩C) ∧ y (B\C) ).
y (V \C) B∈S\{C}
Define X Y
σC+ (x(C) ) = ϕB (x(B∩C) ∧ y (B\C) ).
(V + \C) B>C
y C
Note that we have σC0 = ϕC0 σC+0 at the root. We have the recursion formula
X Y Y 0 0
σC+ (x(C) ) = ϕ (x(B∩C) ∧ y (B\C) )
B ϕB0 (x(B ∩C) ∧ y (B \C) )
(V + \C) C→B B0 >B
y C
Y X Y 0 0
(B∩C) (B\C)
= ϕB (x ∧y ) ϕB0 (x(B ∩C) ∧ y (B \C) )
C→B (B∪VB+ \C) B0 >B
y
Y X
= ϕB (x(B∩C) ∧ y (B\C) )σB+ (x(B∩C) ∧ y (B\C) ).
C→B y (B\C)
The inversion between the sum and product in the second equation above was pos-
sible because the sets B ∪ VB+ \ C, C → B are disjoint. Indeed, if there existed B, B0
such that C → B and C → B0 , and descendants C 0 of B0 and C 00 of B00 with a non-
empty intersection, then this intersection would have to be included in every set in
the (non-oriented) path connecting C 0 and C 00 in G. Since this path contains C, the
intersection must also be included in C, so that the sets B ∪ VB+ \ C, with C → B are
disjoint.
Introduce messages
X
m+B (x(C) ) = ϕB (x(B∩C) ∧ y (B\C) )σB+ (x(B∩C) ∧ y (B\C) )
y (B\C)
with Y
σC+ (x(C) ) = m+B (x(C) )
C→B
370 CHAPTER 15. PROBABILISTIC INFERENCE FOR MRF
which provides σC at the root. Reinterpreting this discussion in terms of the undi-
rected graph, we are led to introducing messages mBC (x(C) ) for B ∼ C in G, with the
message-passing rule
X Y
(C) (B∩C) (B\C)
mBC (x ) = ϕB (x ∧y ) mB0 B (x(B∩C) ∧ y (B\C) ). (15.33)
y (B\C) B0 ∼B,B0 ,C
Note that the complexity of the junction tree algorithm is exponential in the car-
dinality of the largest C ∈ S. This algorithm will therefore be unfeasible if S contains
sets that are too large.
There is more than one family of set interactions with respect to which a given prob-
ability π can be decomposed (notice that, unlike in the Hammersley-Clifford The-
orem, we do not assume that the interactions are normalized), and not all of them
can be organized as a junction tree. One can however extend any given family into a
new one on which one can build a junction tree.
Definition 15.19 Let V be a set of vertexes, and S0 ⊂ P (V ). We say that a set S ⊂ P (V )
is an extension of S0 if, for any C0 ∈ S0 , there exists a C ∈ S such that C0 ⊂ C.
For this, it suffices to build a mapping say T : S0 → S such that C0 ⊂ T (C0 ) for
all C0 ∈ S0 , which is always possible since S is an extension of S0 (for example,
arbitrarily order the elements of S and let T (S0 ) be the first element of S, according
to this order, that contains C0 ). One can then define
Y
ϕC (x(C) ) = ϕC0 0 (x(C0 ) ).
C0 :T (C0 )=C
15.6. BUILDING JUNCTION TREES 371
Definition 15.21 The graph G is decomposable if it satisfies the following recursive con-
dition: it is either complete, or there exists disjoint subsets (A, B, C) of V such that
• V = A ∪ B ∪ C,
• A and B are not empty,
• C is clique in G, C separates A and B,
• the restricted graphs, GA∪C and GB∪C are decomposable.
Proof To prove the “if” part, we proceed by induction on n = |V |. Note that every
graph for n ≤ 3 is both decomposable and triangulated (we leave the verification
to the reader). Assume that the statement “decomposable ⇒ triangulated” holds
for graphs with less than n vertexes, and take G with n vertexes. Assume that G is
decomposable. If it is complete, it is obviously triangulated. Otherwise, there exists
A, B, C such that V = A ∪ B ∪ C, with A and B non-empty such that GA∪C and GB∪C
are decomposable, hence triangulated from the induction hypothesis, and such that
C is a clique which separates A and B. Assume that γ is an achordal loop in G. Since
it cannot be included in A ∪ C or B ∪ C, γ must go from A to B and back, which
implies that it passes at least twice in C. Since C is complete, the original loop can
be shortcut to form subloops in A ∪ C and B ∪ C. If one of (or both) these loops has
cardinality 3, this would provide γ with a chord, which contradicts the assumption.
Otherwise, the following lemma also provides a contradiction, since one of the two
chords that it implies must also be a chord in the original γ.
Lemma 15.23 Let (s1 , . . . , sn , sn+1 = s1 ) be a loop in a triangulated graph, with n ≥ 4.
Then the path has a chord at two non-contiguous vertexes at least.
372 CHAPTER 15. PROBABILISTIC INFERENCE FOR MRF
To prove the lemma, assume the contrary and let (s1 , . . . , sn , sn+1 = s1 ) be a loop that
does not satisfy the condition, with n as small as possible. If n > 4, the loop must
have a chord, say at sj , and one can remove sj from the loop to still obtain a smaller
loop that must satisfy the condition in the lemma, since n was as small as possible.
One of the two chords must be at a vertex other than the two neighbors of sj , and
thus provide a second chord in the original loop, which is a contradiction. Thus
n = 4, but G being triangulated implies that this 4-point loop has a diagonal, so that
the condition in the lemma also holds, which provides a contradiction.
For the “only if” part of proposition 15.22, assume that G is triangulated. We
prove that the graph is decomposable by induction on |G|. The induction will work if
we can show that, if G is triangulated, it is either complete or there exists a clique in
G such that V \ C is disconnected, i.e., there exist two elements a, b ∈ V \ C which are
related by no path in V \C. Indeed, we will then be able to decompose V = A∪B∪C,
where A and B are unions of (distinct) connected components of V \ C. Take, for
example, A to be the set of vertexes connected to A in G \ C, and B = V \ (A ∪ C),
which is not empty since it contains b. Note that restricted graphs from triangulated
graphs are triangulated too.
So, assume that G is triangulated, and not complete. Let C be a subset of V that
satisfies the property that V \ C is disconnected, and take C minimal, so that V \ C 0
is connected for any C 0 ⊂ C, C 0 , C. We want to show that C is a clique, so take s and
t in C and assume that they are not neighbors to reach a contradiction.
We can now characterize graphs that admit junction trees over the set of their
maximal cliques.
∗
Theorem 15.24 Let G = (V , E) be an undirected graph, and CG be the set of all maximum
cliques in G. The following two properties are equivalent.
∗
(i) There exists a junction tree over CG .
15.6. BUILDING JUNCTION TREES 373
(ii) G is triangulated/decomposable.
∗
Proof The proof works by induction on the number of maximal cliques, |CG |. If G
has only one maximal clique, then G is complete, because any point not included
in this clique will have to be included in another maximal clique, which leads to a
contradiction. So G is decomposable, and, since any single node obviously provides
a junction tree, (i) is true also.
Now, fix G and assume that the theorem is true for any graph with fewer maximal
∗
cliques. First assume that CG has a junction tree, T . Let C1 be a leaf in T , connected,
∗
say, to C2 , and let T2 be T restricted to C2 = CG \{C1 }. Let V2 be the unions of maximal
cliques from nodes in T2 . A maximal clique C in GV2 is a clique in GV and therefore
included in some maximal clique C 0 ∈ CV . If C 0 ∈ C2 , then C 0 is also a clique in GV2 ,
and for C to be maximal, we need C = C 0 . If C 0 = C1 , we note that we must also have
[
C= C ∩ C̃
C̃∈C2
and whenever C ∩ C̃ is not empty, this set must be included in any node in the path in
T that links C̃ to C1 . Since this path contains C2 , we have C ∩ C̃ ⊂ C2 so that C ⊂ C2 ,
but, since C is maximal, this would imply that C = C2 = C1 which is impossible.
∗
This shows that CG 2
= C2 . This also shows that T2 is a junction tree over C2 . So,
by the induction hypothesis, GV2 is decomposable. If s ∈ V2 ∩ C1 , then s also belongs
to some clique C 0 ∈ C2 , and therefore belongs to any clique in the path between
C 0 and C1 , which includes C2 . So s ∈ C1 ∩ C2 and C1 ∩ V2 = C1 ∩ C2 . So, letting
A = C1 \ (C1 ∩ C2 ), B = V1 \ (C1 ∩ C2 ), S = C1 ∩ C2 , we know that GA∪S and GB∪S are
decomposable (the first one being complete), and that S is a clique. To show that G
is decomposable, it remains to show that S separates A from B.
If a path connects A to B in G, it must contain an edge, say {s, t}, with s ∈ V \S and
t ∈ S; {s, t} must be included in a maximal clique in G. If this clique is C1 , we have
s ∈ C1 ∩ V2 = S. The same argument shows that this is the only possibility, because,
if {s, t} is included in some maximal clique in C2 , then we would find t ∈ C1 ∩ C2 . So
S separates A and B in G.
Let us now prove the converse statement, and assume that G is decomposable.
If G is complete, it has only one maximal clique and we are done. Otherwise, there
exists a partition V = A ∪ B ∪ S such that GA∪S and GB∪S are decomposable, A and B
separated by S which is complete. Let CA∗ be the maximal cliques in GA∪S and CB∗ the
maximal cliques in GB∪S . By hypothesis, there exist junction trees TA and TB over CA∗
and CB∗ .
The clique S is included in some maximal clique SA∗ ∈ CA∗ . From the previous
discussion, we have either SA∗ = S or SA∗ ∈ CG ∗
. Similarly, S can be extended to a
maximal clique SB∗ ∈ CB∗ , with SB∗ = S or SB∗ ∈ CG∗
. Notice also that at least one of SA∗
or SB∗ must be a maximal clique in G: indeed, assume that both sets are equal to S,
which, as a clique, can extended to a maximal clique S ∗ in G; S ∗ must be included
either in A ∪ S or in B ∪ S, and therefore be a maximal clique in the corresponding
graph which yields S ∗ = S. Reversing the notation if needed, we will assume that
SA∗ ∈ CG
∗
.
∗
All elements of CG must belong either to CA∗ or CB∗ since any maximal clique, say C,
in G must be included in either A ∪ S or B ∪ S, and therefore also provide a maximal
clique in the related graph. So the nodes in TA and TB enumerate all maximal cliques
∗
in G, and we can build a tree T over CG by identifying SA∗ and SB∗ to S ∗ and merging
the two trees at this node. To conclude our proof, it only remains to show that the
running intersection property is satisfied. So consider two nodes C, C 0 in T and
take s ∈ C ∩ C 0 . If the path between these nodes remain in CA∗ , or in CB∗ , then s will
belong to any set along that path, since the running intersection is true on TA and
TB . Otherwise, we must have s ∈ S, and the path must contain S ∗ to switch trees,
and s must still belong to any clique in the path (applying the running intersection
property between the beginning of the path and S ∗ , and between S ∗ and the end of
the path).
This theorem delineates a strategy in order to build a junction tree that is adapted
to a given family of local interactions Φ = (ϕC , C ∈ C). Letting G be the graph in-
duced by these interactions, i.e., s ∼G t if and only if there exists C ∈ C such that
{s, t} ⊂ C, the method proceeds as follows.
(JT5) Run the junction-tree belief propagation algorithm to compute the marginal
of π (associated to Φ) over each set C ∗ ∈ C ∗ .
Steps (JT4) and (JT5) have already been discussed, and we now explain how the first
three steps can be implemented.
15.6. BUILDING JUNCTION TREES 375
First consider step (JT1). To triangulate a graph G = (V , E), it suffices to order its
vertexes so that V = {s1 , . . . , sn }, and then run the following algorithm.
• Add an edge to any pair of neighbors of sk (unless, of course, they are already
linked).
• Let Ek−1 be the new set of edges.
However, the quality of the triangulation, which can be measured by the number
of added edges, or by the size of the maximal cliques, highly depends on the way ver-
texes have been numbered. Take the simple example of the linear graph with three
vertexes A ∼ B ∼ C. If the point of highest index is B, then the previous algorithm
will return the three-point loop A ∼ B ∼ C ∼ A. Any other ordering will leave the
linear graph, which is already triangulated, invariant.
So, one must be careful about the order with which nodes will be processed.
Finding an optimal ordering for a given global cost is an NP-complete problem.
However, a very simple modification of the previous algorithm, which starts with
sn having the minimal number of neighbors, and at each step defines sk to be the
one with fewest neighbors that haven’t been visited yet, provides an efficient way
for building triangulations. (It has the merit of leaving G invariant if it is a tree,
for example). Another criterion may be preferred to the number of neighbors (for
example, the number of new edges that would be needed if s is added).
G(s) is called the s-elimination graph of G. The set of added edges, namely E (s) \(E∩E (s) )
is called the deficiency set of s and denoted D(s) (or DG (s)).
Proof The “only if” part is obvious, since, the triangulation algorithm following
a perfect ordering does not add any edge to G, which must therefore have been
triangulated to start with.
We now proceed to the “if” part. For this it suffices to prove that for any trian-
gulated graph, there exists a vertex s such that DG (s) = ∅. One can then easily prove
the result by induction, since, after removing this s, the remaining graph G(s) is still
triangulated and would admit (by induction) a perfect ordering that completes this
first step.
If a graph is triangulated, there is more than one perfect ordering of its vertexes.
One of these orderings is provided the maximum cardinality search algorithm, which
also allows one to decide whether the graph is triangulated. We start with a defini-
tion/notation.
One says that α satisfies the maximum cardinality property if, for all k = {2, . . . , n}
|Vsα,k−1
k
| = max |Vsα,k−1 |. (15.35)
α(s)≥k
where sk = α −1 (k).
More precisely, assume that an achordal path s1 , . . . , sk has been obtained, such
that α(s) is first increasing, then decreasing along the path, and such that, at extrem-
ities one either has α(s1 ) < α(sk ) < α(s2 ) or α(sk ) < α(s1 ) < α(sk−1 ). In fact, one can
switch between these last two cases by reordering the path backwards. Both paths
(u, s, t) and (u, s, t, t 0 ) in the discussion above satisfy this property.
• Assume, without loss of generality, that α(s1 ) < α(sk ) < α(s2 ) and note that, in the
considered path, s1 and sk cannot be neighbors (for, if j is the last index smaller than
k − 1 such that sj and sk are neighbors, then j must also be smaller than k − 2 and the
loop sj , . . . , sk−1 , sk would be achordal).
378 CHAPTER 15. PROBABILISTIC INFERENCE FOR MRF
• Since α(s2 ) > α(sk ), and s1 and s2 are neighbors, sk must have a neighbor, say sk0 ,
such that sk0 is not neighbor of s2 and α(sk0 ) < α(sk ).
• Select the first index j > 2 such that sj ∼ sk0 , and consider the path (s1 , . . . , sj , sk0 ).
This path is achordal, by construction, and one cannot have s1 ∼ sk0 since this would
create an achordal loop. Let us show that α first increases and then decreases along
this path. Since s2 is in the path, α must first increase, and it suffices to show that
α(sk0 ) < α(sj ). If α increases from s1 to sj , then α(sj ) > α(s2 ) > α(sk ) > α(sk0 ). If α
started decreasing at some point before sj , then α(sj ) > α(sk ) > α(sk0 ).
• Finally, we need to show that the α-value at one extremity is between the first two
α-values on the other end of the path. If α(sk0 ) < α(s1 ), and since we have just seen
that α(sj ) > α(sk ) > α(s1 ), we do get α(sk0 ) < α(s1 ) < α(sj ). If α(sk0 ) > α(s1 ), then, since
by construction α(s2 ) > α(sk ) > α(sk0 ), we have α(s2 ) > α(sk0 ) > α(s1 ).
• So, we have obtained a new path that satisfies the same property that the one we
started with, but with a maximum value at end points smaller than the initial one,
i.e.,
max(α(s1 ), α(sk0 )) < max(α(s1 ), α(sk )).
Since α takes a finite number of values, this process cannot be iterated indefinitely,
which yields our contradiction.
At this point, we know that a graph must be triangulated for its maximal cliques
to admit junction trees, and we have an algorithm to decide whether a graph is
triangulated, and extend it into a triangulated one if needed. This provides the first
step, (JT1), of our description of the junction tree algorithm. The next step, (JT2),
requires computing a list of maximal cliques. Computing maximal cliques in general
graph is an NP complete problem, for which a large number of algorithms has been
developed (see, for example, [150] for a review). For graphs with a perfect ordering,
however, this problem can always be solved in a polynomial time.
Indeed, assume that a perfect ordering is given for G = (V , E), so that V = {s1 , . . . , sn }
is such that, for all k, Vs0k := Vsk ∩ {s1 , . . . , sk−1 } is a clique. Let Gk be G restricted to
{s1 , . . . , sk } and Ck∗ be the set of maximal cliques in Gk . Then the set Ck := {sk } ∪ Vs0k is
the only maximal clique in Gk that contains sk : it is a clique because the ordering is
perfect, and any clique that contains sk must be included in it (because its elements
are either sk or neighbors of sk ). It follows from this that the set Ck∗ can be deduced
∗
from Ck−1 by
∗ ∗ 0 ∗
Ck = Ck−1 ∪ {Ck } if Vk < Ck−1
Ck∗ = (Ck−1
∗
∪ {Ck }) \ {Vk0 } if Vk0 ∈ Ck−1
∗
∗
This allows one to enumerate all elements in CG = Cn∗ , starting with C1∗ = {{s1 }}.
15.6. BUILDING JUNCTION TREES 379
We now discuss the last remaining point, (JT3). For this, we need to form the clique
∗
graph of G, which is the undirected graph G = (CG , E) defined by (C, C 0 ) ∈ E if and
only if C ∩ C 0 , ∅. We then have the following fact:
We hereafter assume that G, and hence G, is connected. This is not real loss of
generality because connected components in undirected graphs yields independent
processes that can be handled separately. We assign weights to edges of the clique
graph of G by defining w(C, C 0 ) = |C ∩ C 0 |. A subgraph T̃ of any given graph G̃ is
called a spanning tree if T̃ is a tree with set of vertexes equal to the set of vertexes of
∗
G̃. If T = (CG , E0 ) is a spaning tree of G, we define the total weight
X
w(T ) = w(C, C 0 ).
{C,C 0 }∈E0
Proposition 15.30 [99] If G is a connected triangulated graph, the set of junction trees
∗
over CG coincides with the set of maximizers of w(T ) over all spanning trees of G.
(Notice that G being connected implies that spanning trees over G exist.)
380 CHAPTER 15. PROBABILISTIC INFERENCE FOR MRF
(1) Let
Vk = {sk } ∪ Vk−1 (sk < Vk−1 .)
(2) Let Ek = {ek } ∪ Ek−1 , such that ek = {sk , s} for some s ∈ Vk−1 satisfying
0 0 0
w(ek ) = max w({t, t }), {t, t } ∈ E, t < Vk−1 , t ∈ Vk−1 . (15.36)
The ability of this algorithm to always build a maximal spanning tree is summa-
rized in the following proposition [81, 130].
Proposition 15.31 If G = (V , E, w) is a weighted, connected undirected graph, Prim’s
algorithm, as described above, provides a sequence Tk = (Vk , Ek ), for k = 1, . . . , n of subtrees
of G such that Vn = V and, for all k, Tk is a maximal spanning tree for the restriction GVk
of G to Vk .
We will prove a slightly stronger statement, namely, that, for all k, Tk can be
extended to form a maximal spanning tree of G. This is stronger, because, if Tk =
(Vk , Ek ) can be extended to a maximal spanning tree T = (V , E), and if Tk0 = (Vk , Ek0 ) is
a spanning tree for GVk such that w(Tk ) < w(Tk0 ), then the graph T 0 = (V , E 0 ) with
E 0 = (E \ Ek ) ∪ Ek0
would be a spanning tree for G with w(T ) < w(T 0 ), which is impossible. (To see that
T 0 is a tree, notice that paths in T 0 are in one-to-one correspondence with paths in T
by replacing any subpath within Tk0 by the unique subpath in Tk that has the same
extremities.)
15.6. BUILDING JUNCTION TREES 381
Clearly, T1 , which only has one vertex, can be extended to a maximal spanning
tree. Let k ≥ 1 be the last integer for which this property is true for all j = 1, . . . , k.
If k = n, we are done. Otherwise, take a maximum spanning tree, T , that extends
Tk . This tree cannot contain the new edge added when building Tk+1 , namely ek+1 =
{sk+1 , s} as defined in Prim’s algorithm, since it would otherwise also extend Tk+1 .
Consider the path γ in T that links s to sk . This path must have an edge e = {t, t 0 }
such that t ∈ Vk and t 0 < Vk , and by definition of ek+1 , we must have w(e) ≤ w(ek+1 ).
Notice that e is uniquely defined, because a path leaving Vk cannot return in this set,
since one would be otherwise able to close it into a loop by inserting the only path
in Tk that connects its extremities.
Replace e by ek+1 in T . The resulting graph, say T 0 , is still a spanning tree for
G. From any path in T , one can create a path in T 0 with the same extremities by
replacing any occurrence of the edge, e, by the concatenation of the unique path in
T going from t to s, followed by (s, sk+1 ), followed by the unique path in T going
from sk+1 to t 0 . This implies that T 0 is connected. It is also acyclic, since any loop
in T would have to contain ek+1 (since T is acyclic), but there is no other path than
(s, sk+1 ) in T 0 that links s and sk , because this path would have to be in T , and we
have removed the only possible one from T by deleting the edge e.
To prove the second statement, let T be an optimal spanning tree. Let k be the
largest integer such that there exists a sequence (T1 , . . . , Tk ) generated by Prim’s algo-
rithm, such that, for all j = 1, . . . , k, Tj is a subtree of T . One necessarily has j ≥ 1,
since T extends any one-vertex tree. If k = n, we are done. Assuming otherwise,
let Tk = (Vk , Ek ) and make one more step of Prim’s algorithm, selecting an edge
ek+1 = (sk+1 , s) satisfying (15.36). By assumption, ek+1 is not in T . Take as before
the unique path linking s and sk+1 in T and let e be the unique edge at which this
path leaves Vk . Replacing e by ek+1 in T provides a new spanning tree, T 0 . One
must have w(e) ≥ w(ek+1 ) because T is optimal, and w(ek+1 ) ≥ w(e) by (15.36). So
w(e) = w(ek+1 ), and one can use e instead of ek+1 for the (k + 1)th step of Prim’s al-
gorithm. But this contradicts the fact that k was the largest integer in a sequence of
subtrees of T that is generated by Prim’s algorithm, and one therefore has k = n.
The proof of proposition 15.30, that we provide now, uses very similar “edge-
switching” arguments.
Proof (Proof of proposition 15.30) Let us start with a maximum weight spanning
tree for G, say T , and show that it is a junction tree. Since T has maximum weight, we
382 CHAPTER 15. PROBABILISTIC INFERENCE FOR MRF
know that it can be obtained via Prim’s algorithm, and that there exists a sequence
T1 , . . . , Tn = T of trees constructed by this algorithm. Let Tk = (Ck , Ek ).
We proceed by contradiction. Let k be the largest index such that Tk can be ex-
∗
tended to a junction tree for CG , and let T 0 be a junction tree extension of Tk . Assume
that k < n, and let ek+1 = (Ck+1 , C 0 ) be the edge that has been added when building
Tk+1 , with Ck+1 = {Ck+1 } ∪ Ck . This edge is not in T 0 , so that there exists a unique
edge e = (B, B0 ) in the path between Ck and C 0 in T 0 such that B ∈ Ck and B0 < Ck . We
must have w(e) = |B ∩ B0 | ≤ w(ek+1 ) = |Ck+1 ∩ C 0 |. But, since the running intersection
property is true for T 0 , both B and B0 must contain Ck+1 ∩C 0 so that B∩B0 = Ck+1 ∩C 0 .
This implies that, if one modifies T 0 by replacing edge e by edge ek+1 , yielding a new
spanning tree T 00 , the running intersection property is still satisfied in T 0 . Indeed if
a vertex s ∈ V belongs to both extremities of a path containing B and B0 in T 0 , then
it must belong to B ∩ B0 , and hence to Ck+1 ∩ C 0 , and therefore to any set in the path
in T 0 that linked Ck+1 and C 0 . So we found a junction tree extension of Tk+1 , which
contradicts our assumption that k was the largest. We must therefore have k = n and
T is a junction tree.
Let us now consider the converse statement and assume that T is a junction tree.
Let k be the largest integer such that there exists a sequence of subgraphs of T that
is provided by Prim’s algorithm. Denote such a sequence by (T1 , . . . , Tk ), with Tj =
(Cj , Ej ). Assume (to get a contradiction) that k < n, and consider a new step for
Prim’s algorithm, adding a new edge ek+1 = {Ck+1 , C 0 } to Tk . Take as before the path
in T linking C 0 to Ck+1 in T , and select the edge e at which this path leaves Ck . If e =
(B, B0 ), we must have w(e) = |B ∩ B0 | ≤ w(ek ) = |Ck+1 ∩ C 0 |, and the running intersection
property in T implies that Ck+1 ∩ C 0 ⊂ B ∩ B0 , which implies that w(e) = w(ek+1 ).
This implies that adding e instead of ek+1 at step k + 1 is a valid choice for Prim’s
algorithm, and contradicts the fact that k was the largest number of such steps that
could provide a subtree of T . So k = n and T is maximal.
Chapter 16
Bayesian Networks
16.1 Definitions
Bayesian networks are graphical models supported by directed acyclic graphs (DAG),
which provide them with an ordered organization (directed graphs were introduced
in definition 14.35).
Bayesian networks over G are defined as follows. We use the same notation as
with Markov random fields to represent the set of configurations F (V ) that contains
collections x = (xs , s ∈ V ) with xs ∈ Fs .
Definition 16.1 A random variable X with values in F (V ) is a Bayesian network over a
DAG G = (V , E) if and only if its distribution can be written in the form
Y Y
PX (x) = ps (x(s) ) ps (x(pa(s)) , x(s) ) (16.1)
s∈V0 s∈V \V0
383
384 CHAPTER 16. BAYESIAN NETWORKS
Using the convention that conditional distributions given the empty set are just ab-
solute distributions, we can rewrite (16.1) as
Y
PX (x) = ps (x(pa(s)) , x(s) ). (16.2)
s∈V
One can verify that x∈Ω P X (x) = 1. Indeed, when summing over x, we can start
P
summing over all x(s) with ch(s) = ∅ (the leaves). Such x(s) ’s only appear in the corre-
sponding ps ’s, which disappear since they sum to 1. What remains is the sum of the
product over V minus the leaves, and the argument can be iterated until the remain-
ing sum is 1 (alternatively, work by induction on |V |). This fact is also a consequence
of proposition 16.5 below, applied with A = ∅.
c c 1 Y
P(X (S) = x(S) | X (S ) = x(S ) ) = (S c) ps (x(pa(s)) , x(s) ),
Z(x ) s∈V
that the only variables x(t) , t < S that can be factorized in the normalizing constant
are those that are neither parent nor children of vertexes in S, and do not share a
child with a vertex in S (i.e., they intervene in no ps (x(pa(s)) , x(s) ) that involve elements
of S). This suggests the following definition.
G] is sometimes called the moral graph of G (because it forces parents to marry !). A
path in G] can be visualized as a path in G[ (the undirected graph associated with
G) which is allowed to jump between parents of the same vertex even if they were
not connected originally.
This proposition can be refined by noticing that the joint distribution of X (S) ,
X (T )
and X (U ) can be deduced from a Bayesian network on a graph restricted to the
ancestors of S ∪T ∪U . Definition 14.21 for restricted graphs extends without change
to directed graphs, and we repeat it below for convenience.
Definition 16.4 Let G = (V , E) be a graph (directed or undirected), and A ⊂ V . The
restricted graph GA = (A, EA ) is such that the elements of EA are the edges (s, t) (or {s, t})
in E such that both s and t belong to A.
Moreover, for a directed acyclic graph G and s ∈ V , we define the set of ancestors of
s by
As = {t ∈ V , t ≤G s} (16.3)
for the partial order on V induced by G.
S
If S ⊂ V , we denote AS = s∈S As . Note that, by definition, S ⊂ AS . The following
proposition is true.
Proposition 16.5 Let X be a Bayesian network on G = (V , E) with distribution given by
(16.2). Let S ⊂ V and A = AS . Then the distribution of X (A) is a Bayesian network over
GA given by Y
P(X (A) = x(A) ) = ps (x(pa(s)) , x(s) ). (16.4)
s∈A
There is no ambiguity in the notation pa(s), since the parents of s ∈ A are the same in
GA as in G.
Proof One needs to show that
Y XY
ps (x(pa(s)) , x(s) ) = ps (x(pa(s)) , x(s) ).
s∈A xAc s∈V
This can be done by induction on the cardinality of V . Assume that the result is true
for graphs of size n, and let |V | = n + 1 (the result is obvious for graphs of size 1).
If s ∈ Ac is a leaf in G, one can remove the variable x(s) from the sum, since it
only appear in ps and transition probabilities sum to one. But one can now apply
the induction assumption to the restriction of G to V \ {s}.
386 CHAPTER 16. BAYESIAN NETWORKS
P(X (s) = x(s) | X (As \{s}) = x(As \{s}) ) ∝ P(X (As ) = x(As ) )
= ps (x(pa(s)) , x(s) )Z(x(As \{s}) )
where Y
(As \{s})
Z(x )= pt (x(pa(t)) , x(t) )
t∈As \{s}
We will say that a path (s1 , . . . , sN ) in G[ passes at s = sk with a v-junction if (sk−1 , sk , sk+1 )
is a v-junction in G.
Lemma 16.9 Two vertexes s and t in G are separated by a set U in (GA{s,t}∪U )] if and only
if any path between s and t in G[ must either
Proof
Step 1. We first note that the v-junction clause is redundant in (2). It can be removed
without affecting the condition. Indeed, if a path in G[ passes in V \ A{s,t}∪U one can
follow this path downward (i.e., following the orientation in G) until a v-junction is
met. This has to happen before reaching the extremities of the path, since u would
be an ancestor of s or t otherwise. We can therefore work with the weaker condition
(that we will denote (2)’) in the rest of proof.
Step 2. Assume that U separates s and t in (GA{s,t}∪U )] . Take a path γ between s and
t in G[ . We need to show that the path satisfies (1) or (2)’. So assume that (2)’ is
false (otherwise we are done) so that γ is included in A{s,t}∪U . We can modify γ by
removing all the central nodes in v-junctions and still keep a valid path in (GA{s,t}∪U )]
(since parents are connected in the moral graph). The remaining path must intersect
U by assumption, and this cannot be at a v-junction in γ since we have removed
them. So (1) is true.
Step 3. Conversely, assume that (1) or (2) is true for any path in G[ . Consider a path
γ in (GA{s,t}∪U )] between s and t. Any edge in γ that is not in G[ must involve parents
of a common child in A{s,t}∪U . Insert this child between the parents every time this
occurs, resulting in a v-junction added to γ. Since the added vertexes are still in
A{s,t}∪U , the new path still has no intersection with V \ A{s,t}∪U and must therefore
satisfy (1). So there must be an intersection with U without a v-junction, and since
the new additions are all at v-junctions, the intersection must have been originally
in γ, which therefore passes in U . This shows that U separates s and t in (GA{s,t}∪U )] .
Then we have:
Theorem 16.11 Two vertexes s and t in G are separated by a set U in (GA{s,t}∪U )] if and
only if they are d-separated by U .
Proof It suffices to show that if condition ((D1) or (D2)) holds for any path between
s and t in G[ , then so does ((1) or (2)). So take a path between s and t: if (D1) is true
388 CHAPTER 16. BAYESIAN NETWORKS
for this path, the conclusion is obvious, since (D1) and (1) are the same. So assume
that (D1) (and therefore (1)) is false and that (D2) is true. Let u be a vertex in V \ AU
at which γ passes with a v-junction.
Definition 16.12 A chain graph G = (V , E, Ẽ) is composed with a finite set V of vertexes,
a set E ⊂ P 2 (V ) of unoriented edges and a set Ẽ ⊂ E × E \ {(t, t), t ∈ E} of oriented edges
with the property that E ∩ Ẽ [ = ∅, i.e., two vertexes cannot be linked by both an oriented
and an unoriented edge.
Proposition 16.13 Let G = (V , E, Ẽ) be a semi-acyclic chain graph. Define the relation
s R t if and only if there exists a path in the unoriented subgraph (V , E) that links s and t.
Then R is an equivalence relation.
One can define a directed graph over equivalence classes as follows. Let GR =
(VR , ER ) be such that (S, S 0 ) ∈ ER if and only if there exists s ∈ S and t ∈ S 0 such
that (s, t) ∈ Ẽ. The graph GR is acyclic: any loop in GR would induce a loop in G
containing at least one oriented edge.
Definition 16.14 Let G = (V , E, Ẽ) be a semi-acyclic chain graph. One says that a ran-
dom variable X decomposes on G if and only if: (X (S) , S ∈ VR ) is a Bayesian network on
0
GR and the conditional distribution of X (S) given X (S ) , S 0 ∈ pa(S) is GS -Markov, such
that, for s ∈ S, P (X (s) = x(s) | X (t) , t ∈ S, XS 0 , S 0 ∈ pa(S)) only depends on x(t) with {s, t} ∈ E
or (t, s) ∈ Ẽ.
is a tree distribution with the required form of the individual conditional distribu-
tions.
While the previous discussion provides a rather simple description of Bayesian net-
works in terms of chain graphs, it does not go all the way in reducing the number
of oriented edges in the definition of a Bayesian network. The issue is, in some way,
addressed by the notion of Markov equivalence, which is defined as follows.
Definition 16.16 Two directed acyclic graphs on the same set of vertexes G = (V , E) and
G̃ = (V , Ẽ) are Markov-equivalent if any family of random variables that decomposes as a
(positive) Bayesian network over one of them also decomposes as a Bayesian network over
the other.
This property can be expressed in a strikingly simple condition. One says that a
v-junction (s, t, u) in a DAG is unlinked if s and u are not neighbors.
16.2. CONDITIONAL INDEPENDENCE GRAPH 391
Theorem 16.18 G and G̃ are Markov equivalent if and only if G[ = G̃[ and G and G̃ have
the same unlinked v-junctions.
Proof Step 1. We first show that a given pair of vertexes in a DAG is unlinked if
and only if it can be d-separated by some set in the graph. Clearly, if they are linked,
they cannot be d-separated (which is the “if” part), so what really needs to be proved
is that unlinked vertexes can be d-separated. Let s and t be these vertexes and let
U = A{s,t} \ {s, t}. Then U d-separates s and t since any path between s and t in
(GA{s,t}∪U )] = (GA{s,t} )] must obviously pass in U .
Step 2. We now prove the only-if part of theorem 16.18 and therefore assume that
G and G̃ are Markov equivalent, or, as stated in theorem 16.17, that d-separation
coincides in G and G̃. We want to prove that G[ = G̃[ and unlinked v-junctions are
the same.
Step 2.1. The first statement is obvious from Step 1: d-separation determines the
existence of a link, so if d-separation coincides in the two graphs, then the same
holds for links and G[ = G̃[ .
Step 2.2. So let us proceed to the second statement and let (s, t, u) be an unlinked v-
junction in G. We want to show that it is also a v-junction in G̃ (obviously unlinked
since links coincide).
We will denote by ÃS the ancestors of some set S ⊂ V in G̃ (while AS still denotes
its ancestors in G). Let U = A{s,u} \ {s, u}. Then, as we have shown in Step 1, U
d-separates s and u in G, so that, by assumption it also d-separates them in G̃.
We know that t < U , because it cannot be both a child and an ancestor of {s, u} in G
(this would induce a loop). The path (s, t, u) links s and u and does not pass in U ,
which is only possible (since U d-separates s and t in G̃) if it passes in V − ÃU at a
v-junction: so (s, t, u) is a v-junction in G̃, which is what we wanted to prove.
Step 3. We now consider the converse statement and assume that G[ = G̃[ and un-
linked v-junctions coincide. We want to show that d-separation is the same in G and
G̃. So, we assume that U d-separates s and t in G, and we want to show that the same
is true in G̃. Thus, what we need to prove is:
Claim 1. Consider a path γ between s and t in G̃[ = G[ . Then γ either (D1) passes in
U without a v-junction in G̃, or (D2) in V \ ÃU with a v-junction in G̃.
We will prove Claim 1 using a series of lemmas. We say that γ has a three-point loop
at u if (v, u, w) are three consecutive points in γ such that v and w are linked. So
(v, u, w, v) forms a loop in the undirected graph.
Lemma 16.19 If γ is a path between s and t that does not satisfy (D2) for G and passes
in U without three-point loops, then γ satisfies (D1) for G̃.
The proof is easy: since γ does not satisfy (D2) in G, it satisfies (D1) and passes in
U without a v-junction in G. But this intersection cannot be a v-junction in G̃ since
392 CHAPTER 16. BAYESIAN NETWORKS
To prove the lemma, let v and w be the predecessor and successor of u in γ. First
assume that γ \ u satisfies (D1) in G̃. If this does not happen at v or at w, then this
will apply also to γ and we are done, so let us assume that v ∈ U and that (v 0 , v, w) is
not a v-junction in G̃, where v 0 is the predecessor of v. If (v 0 , v, u) is not a v-junction
in G̃, then (D1) is true for γ in G̃. If it is a v-junction, then (v, u, w) is not and (D1) is
true too.
Assume now that (D2) is true for γ \u in G̃. Again, there is no problem if (D2) occurs
for some point other than v or w, so let us consider the case for which it happens at
v. This means that v < ÃU and (v 0 , v, w) is a v-junction. But, since u ∈ U , the link
between u and v must be from u to v in G̃ so that there is no v-junction at u and (D1)
is true in G̃. This proves lemma 16.20.
Lemma 16.21 Let γ be a path with a three-point loop at u ∈ U for G. Assume that γ
does not satisfy (D2) in G. Then γ \ u does not satisfy this property either.
Let us assume that γ \ u satisfies (D2) and reach a contradiction. Letting (v, u, w) be
the three-point loop, (D2) can only happen in γ \ u at v or w, and let us assume that
this happens at v, so that, v 0 being the predecessor of v, (v 0 , v, w) is a v-junction in G
with v < AU . Since v < AU , the link between u and v in G must be from u to v, but
this implies that (v 0 , v, u) is a v-junction in G with v < AU which is a contradiction:
this proves lemma 16.21.
The previous three lemmas directly imply the next one.
Lemma 16.22 If γ is a path between s and t that does not satisfy (D2) for G, then γ
satisfies (D1) or (D2) for G̃.
Indeed, if we start with γ that does not satisfy (D2) for G, lemma 16.21 allows us
to progressively remove three-point loops from γ until none remains with a final
path that satisfies the assumptions of lemma 16.19 and therefore satisfies (D1) in G̃,
and lemma 16.20 allows us to add the points that we have removed in reverse order
while always satisfying (D1) or (D2) in G̃.
We now partially relax the hypothesis that (D2) is not satisfied with the next lemma.
Lemma 16.23 If γ is a path between s and t that does not pass in V \ AU at a linked
v-junction for G, then γ satisfies (D1) or (D2) for G̃.
16.2. CONDITIONAL INDEPENDENCE GRAPH 393
Assume that γ does not satisfy (D2) for G̃ (otherwise the result is proved). By
lemma 16.22, γ must satisfy (D2) for G. So, take an intersection of γ with V \AU that
occurs at a v-junction in G, that we will denote (v, u, w). This is still a v-junction in
G̃ since we assume it to be unlinked. Since (D2) is false in G̃, we must have u ∈ ÃU ,
and there is an oriented path, τ, from u to U in G̃.
We can assume that τ has no v-junction in G. If a v-junction exists in τ, then this
v-junction must be linked (otherwise this would also be a v-junction in G̃ and con-
tradict the fact that τ is consistently oriented in G̃), and this link must be oriented
from u to U in G̃ to avoid creating a loop in this graph. This implies that we can
bypass the v-junction while keeping a consistently oriented path in G̃, and iterate
this until τ has no v-junction in G. But this implies that τ is consistently oriented in
G, necessarily from U to u since u < AU .
Denote τ = (u0 = u, v1 , . . . , un ∈ U ). We now prove by induction that each (v, uk , w) is
an unlinked v-junction. This is true when k = 0, and let us assume that it is true for
k − 1. Then (uk , uk−1 , v) is a v-junction in G but not in G̃: so it must be linked and
there exists an edge between v and uk . In G̃, this edge must be oriented from v to uk ,
since (v, uk−1 , uk , v) would form a loop otherwise. For the same reason, there must be
an edge in G̃ from w to uk so that (v, uk , w) is an unlinked v-junction.
Since this is true for k = n, we can replace u by un in γ and still obtain a valid path.
This can be done for all intersections of γ with V \AU that occur at v-junctions. This
finally yields a path (denote it γ̄) which does not satisfy (D2) in G anymore, and
therefore satisfies (D1) or (D2) in G̃: so γ̄ must either pass in U without a v-junction
or in V \ ÃU at a v-junction. None of the nodes that were modified can satisfy any of
these conditions, since they were all in U with a v-junction, so that the result is true
for the original γ also. This proves lemma 16.23.
So the only unsolved case is when γ is allowed to pass in V \AU at linked v-junctions.
We define an algorithm that removes them as follows. Let γ0 = γ and let γk be the
path after step k of the algorithm. One passes from γk to γk+1 as follows.
• If γk has no linked v-junctions in V \ AU for G, stop.
• Otherwise, pick such a v-junction and let (v, u, w) be the three nodes involved in
it.
(i) If v ∈ U , v 0 < U and (v 0 , v, u) is a v-junction in G̃, remove v from γk to define
γk+1 .
(ii) Otherwise, if w ∈ U , w0 < U and (u, w, w0 ) is a v-junction in G̃, remove w from
γk to define γk+1 .
(iii) Otherwise, remove u from γk to define γk+1 .
None of the considered cases can disconnect the path. This is clear for case (iii) since
v and w are linked. For case (i), note that, in G, (v 0 , v, u) cannot be a v-junction since
(v, u, w) is one. This implies that the v-junction in G̃ must be linked and that v 0 and
u are connected.
394 CHAPTER 16. BAYESIAN NETWORKS
The algorithm will stop at some point with some γn that does not have any linked
v-junction in V \ AU anymore, which implies that (D1) or (D2) is true in G̃ for γn . To
prove that this statement holds for γ, it suffices to show that if (D1) or (D2) is true
in G̃ with γk+1 , it must have been true with γk at each step of the algorithm. So let’s
assume that γk+1 satisfies (D1) or (D2) in G̃.
First assume that we passed from γk to γk+1 via case (iii). Assume that (D2) is true
for γk+1 , with as usual the only interesting case being when this occurs at v or w.
Assume it occurs at v so that (v 0 , v, w) is a v-junction and v < ÃU . If (v 0 , v, u) is a
v-junction, then (D2) is true with γk . Otherwise, there is an edge from v to u in G̃
which also implies an edge from w to u since (v, u, w, v) would be a loop otherwise.
So (v, u, w) is a v-junction in G̃, and u cannot be in ÃU since its parent, v would be in
that set also. So (D2) is true in G̃. Now, assume that (D1) is true at v, so that (v 0 , v, w)
is not a v-junction and v ∈ U . If (v 0 , v, u) is not a v-junction either, we are done, so
assume the contrary. If v 0 ∈ U , then we cannot have a v-junction at v 0 and (D1) is
true. But v 0 < U is not possible since this leads to case (i).
Now assume that we passed from γk to γk+1 via case (i). Assume that (D1) is true for
γk : this cannot be at v 0 since v 0 < U , neither at u since u < AU , so it will also be true
for γk+1 . The same statement holds with (D2) since (v 0 , v, u) is a v-junction in G̃ with
v ∈ U which implies that both v 0 and u are in ÃU . Case (ii) is obviously addressed
similarly.
We now discuss the issue of using the sum-prod algorithm to compute marginal
probabilities, P(X (s) = x(s) ) for s ∈ V when X is a Bayesian network on G = (V , E). By
definition, P(X = x) can be written in the form
Y
P(X = x) = ϕC (x(C) )
C∈C
where C contains all subsets Cs := {s} ∪ pa(s), s ∈ V . Marginal probabilities can there-
fore be computed easily when the factor graph associated to C is acyclic, according
to proposition 15.13. However, because of the specific form of the ϕC ’s (they are
conditional probabilities), the sum-prod algorithm can be analyzed in more detail,
and provide correct results even when the factor graph is not acyclic.
They take a particular form for Bayesian networks, using the fact that a vertex s
belongs to Cs , and to all Ct for t ∈ ch(s).
Y
msCs (x(s) ) ← mCt s (x(s) ),
t∈ch(s)
Y
msCt (x(s) ) ← mCs s (x(s) ) mCu s (x(s) ), for t ∈ ch(s),
u∈ch(s),u,t
X Y
mCs s (x(s) ) ← ps (y (pa(s)) , x(s) ) mtCs (y (t) ),
y (Cs ) ,y (s) =x(s) t∈pa(s)
X Y
mCt s (x(s) ) ← pt (x(s) ∧ y (pa(s)\{t}) , y (t) )mtCt (y (t) ) muCt (y (u) ),
y (Ct ) ,y (s) =x(s) u∈pa(s),u,t
for t ∈ ch(s).
These relations imply that, if pa(s) = ∅ (s is a root), then mCs s = ps (x(s) ). Also, if
ch(s) = ∅ (s is a leaf) then msCs = 1. The following proposition shows that many of the
messages become constant over time.
Proposition 16.24 All upward messages, msCs and mCt s with t ∈ ch(s) become constant
(independent from x(s) ) in finite time.
Proof This can be shown recursively as follows. Assume that, for a given s, mtCt is
constant for all t ∈ ch(s) (this is true if s is a leaf). Then,
X Y
(s) (s) (pa(s)\{t}) (t) (t)
mCt s (x ) ← pt (x ∧y , y )mtCt (y ) muCt (y (u) ),
peyCt ,y (s) =x(s) u∈pa(s),u,t
X Y
= mtCt pt (x(s) ∧ y (pa(s)\{t}) , y (t) ) muCt (y (u) )
y (Ct ) ,y (s) =x(s) u∈pa(s),u,t
X Y
= mtCt muCt (y (u) )
y (Ct \{t}) ,y (s) =x(s) u∈pa(s),u,t
Y X
= mtCt muCt (y (u) )
u∈pa(s),u,t y (u)
is also constant. This proves that all msCs progressively become constant, and, as we
have just seen, this implies the same property for mCt s , t ∈ ch(s).
396 CHAPTER 16. BAYESIAN NETWORKS
This proposition implies that, if initialized with constant messages (or after a
finite time), the sum-prod algorithm iterates
Y
msCs ← mCt s
t∈ch(s)
X Y
mCs s (x(s) ) ← ps (y () pa(s), x(s) ) mtCs (y (t) )
y (Cs ) ,y (s) =x(s) t∈pa(s)
Y
msCt (x(s) ) ← mCs s (x(s) ) mCu s , t ∈ ch(s)
u∈ch(s),u,t
Y X
mCt s ← msCs muCs (y (u) ), t ∈ ch(s).
u∈pa(t),u,s y (u)
Proposition 16.25 If the previous algorithm is first initialized with upward messages,
msCs = mCt s all equal to 1, and if downward messages are computed top down from the
roots to the leaves, the obtained configuration of messages is invariant for the sum-prod
algorithm.
Proof If all upward messages are equal to 1, then clearly, the downward messages
sum to 1 once they are updated from roots to leaves, and this implies that the upward
messages will remain equal to 1 for the next round. The obtained configuration is
invariant since the downward messages are recursively uniquely defined by their
value at the roots.
The downward messages, under the previous assumptions, satisfy msCt (x(s) ) = mCs s (x(s) )
for all t ∈ ch(s) and therefore
X Y
(s) (pa(s)) (s)
mCs s (x ) = π(y ,x ) mCt t (y (t) ). (16.5)
y (Cs ) ,y (s) =x(s) t∈pa(s)
Note that the associated “marginals” inferred by the sum-prod algorithm are
Y
σs (x(s) ) = mCs (x(s) ) = mCs s (x(s) )
C,s∈C
Before this, let us analyze the complexity resulting from an iterative computation of
the marginal probabilities, similar to what we have done with trees.
This lemma allows us to work recursively as follows. Assume that we can compute
marginal distributions over sets S with maximal depth no larger than d. Take a set
S of maximal depth d + 1, and let S0 be the set of elements of depth d + 1 in S. Then,
letting T = depth− (S) = depth− (S0 ), and S1 = S \ S0 ,
X
P(X (S) = x(S) ) = P(X (S0 ) = x(S0 ) | X (T ) = y (T \S1 ) ∧ x(S1 ) )P (X (T ∪S1 ) = y (T \S1 ) ∧ x(S1 ) )
y (T \S1 )
X Y
= ps ((y ∧ x)(pa(s)) ∧ x(S1 ) , x(s) )P(X (pa(S0 )∪S1 ) = y (pa(S0 )\S1 ) ∧(16.6)
x(S1 ) )
y (pa(S)\S1 ) s∈S0
Since pa(S)∪S1 has maximal depth strictly smaller than the maximal depth of S, this
indeed provides a recursive formula for the computation of marginal over subsets
of V with increasing maximal depths. However, because one needs to add parents
to the considered set when reducing the depth, one may end up having to compute
marginals over very large sets, which becomes intractable without further assump-
tions.
398 CHAPTER 16. BAYESIAN NETWORKS
A way to reduce the complexity is to assume that the graph G is singly connected,
as defined below.
Definition 16.28 A DAG G is singly connected if there exists at most one path in G that
connects any two vertexes.
Such a property is true for a tree, but also holds for some networks with multiple
parents. We have the following nice property in this case.
Because the graph is singly connected, two parents of s cannot have a common an-
cestor (since there would then be two paths from this ancestor to S). So Apa(s) is the
disjoint union of the At ’s for t ∈ pa(s) and we can write
X Y Y
P(X (pa(s)) = x(pa(s)) ) = pu (y (pa(u)) , y (u) )
,y (pa(s)) =x(pa(s)) t∈pa(s) u∈At
(Apa(s) )
y
Y X Y
= pu (y (pa(u)) , y (u) )
t∈pa(s) y (At ) ,y (t) =x(t) u∈At
Y
= P(X (t) = x(t) )
t∈pa(s)
One of the main interests of graphical models is to provide an ability to infer the be-
havior of hidden variables of interest given other, observed, variables. When dealing
with oriented graphs the way this should be analyzed is, however, ambiguous.
Let’s consider an example, provided by the graph in fig. 16.1. The Bayesian net-
No school
work interpretation of this graph is that both events (which may be true or false)
“Bad weather” and “Broken HVAC” happen first, and that they are independent.
Then, given their observation, the “No school” event may occur, probably more
likely if the weather is bad or the HVAC is broken or snow, and even more likely
if both happened at the same time.
Now consider the following passive observation: you wake up, you haven’t checked
the weather yet or the news yet, and someone tells you that there is no school today.
Then you may infer that there is more chances than usual for bad weather or the
HVAC broken at school. Conditionally to this information, these two events become
correlated, even if they were initially independent. So, even if the “No school” event
is considered as a probabilistic consequence of its parents, observing it influences
our knowledge on them.
Manipulation and passive observation are two very different ways of affecting
unobserved variables in Bayesian networks. Both of them may be relevant in appli-
cations. Of the two, the simplest to analyze is intervention, since it merely consists
in clamping one of the variables while letting the rest of the network dynamics un-
changed. This leads to the following formal definition of manipulation.
So, if the distribution of X is given by (16.2), then its distribution after manipulation
on S is Y
π̃(y (V \S) ) = pt (y (pa(t)) , y (t) )
t∈V \S
where pa(t) is the set of parents of t in G, and y (s) = x(s) whenever s ∈ pa(t) ∩ S.
Let us discuss this first in the simpler case of trees, for which the moral graph is
the undirected acyclic graph underlying the tree, and d-separation is simple separa-
tion on this acyclic graph. We can then use proposition 14.22 to understand the new
structure after conditioning: it is a GV[ \S -Markov random field, and, for t ∈ V \ S, the
conditional distribution of X (t) = y (t) given its neighbors is the same as before, using
the value x(s) when s ∈ S. But note that when doing this (passing to G[ ), we broke the
causality relation between the variables. We can however always go back to a tree (or
forest, since connectedness may have been broken) with the same edge orientation as
they initially were, but this requires reconstituting the edge joint probabilities from
the new acyclic graph, and therefore using (acyclic) belief propagation.
With general Bayesian networks, we know that the moral graph can be loopy and
therefore a source of difficulties. The following proposition states that the damage
is circumscribed to the ancestors of S.
with y (t) = x(t) if t ∈ AS . Since s ∈ AS implies pa(s) ⊂ AS , all terms with s ∈ AS are
constant in the sum and can be factored out after normalization. So the conditional
16.3. STRUCTURAL EQUATION MODELS 401
distribution is proportional to
Y
p(y (pa(s)) , y (s) )
s∈AcS
with y (t) = x(t) if t ∈ AS . But we know that such products sum to 1, so that the
conditional distribution is equal to this expression and therefore provides a Bayesian
network on GAcS .
The model is therefore fully specified by the functions Φ (s) and the probability
distributions of the variables ξ (s) . We will assume that they have a density, denoted
g (s) , s ∈ V , with respect to some measure µs on Bs . They are typically chosen as
uniform distributions on Bs (continuous and compact, or discrete) or as standard
Gaussian when Bs = Rds for some ds . One also generally assumes that the variables
(ξ (s) , s ∈ V ) are jointly independent, and we make this assumption below.
Let Vk , k ≥ 0, be the set of vertexes in V with depth k (c.f. definition 16.26) and
V<k = V0 ∪ · · ·∪ Vk−1 . Then (using the independence of (ξ (s) , s ∈ V ), for s ∈ Vk , the con-
ditional distribution of X (s) given X (V<k ) = x(V<k ) is the distribution of Φ (s) (x(s−) , ξ (s) ).
Formally this is given by
Φ (s) (x(s−) , ·)] (g (s) µs ),
the pushforward of the distribution of ξ (s) by Φ (s) (x(s−) , ·).
More concretely, assume that ξ s follows a uniform distribution on Bs = [0, 1]h for
some h, and assume that Fs is finite for all s. Then,
∆
P (X (s) = x(s) | X (V<k ) = x(V<k ) ) = Volume(Us (x(pa(s)) , x(s) )) = ps (x(pa(s)) , x(s) )
402 CHAPTER 16. BAYESIAN NETWORKS
where n o
Us (x(pa(s)) , x(s) ) = ξ ∈ [0, 1]h : Φ (s) (x(s−) , ξ) = x(s) .
Since variables X (s) , s ∈ Vk are conditionally independent given X (V <k) , we find that
X decomposes as a Bayesian network over G,
Y
P (X = x) = ps (x(pa(s)) , x(s) ).
s∈V
(s)
Similarly, if Fs = Bs = Rds , ξ (s) ∼ N (0, IdRds ), and ξ (s) 7→ Φθ (x(pa(s)) , ξ (s) ) is invertible,
(s)
with C 1 inverse x(s) 7→ Ψθ (x(pa(s)) , x(s) ), then X is a Bayesian network, with continu-
ous variables, and, using the change of variable formula, the conditional distribution
of X (s) given X (pa(s)) = x(s−) has p.d.f.
1 1
(pa(s)) (s) (s) (pa(s)) (s) 2 (s)
ps (x ,x ) = d /2
exp − |xs − Ψθ (x , x )| det(∂x(s) Ψθ (x(pa(s)) , x(s) )) .
(2π) s 2
A simple and commonly used special case for this example are linear SEMs, with
In this case, the inverse mapping is immediate and the Jacobian determinant in the
d
change of variables is 1/σs s .
Chapter 17
17.1 Introduction
We will describe, in the next chapters, methods that fit a parametric model to the
observation while introducing unobserved, or “latent,” components in their models.
Such latent components typically attach interpretable information or structure to
the data. We have seen one such example in the form of the mixture of Gaussian
in chapter 4, that we will revisit in chapter 20. We now provide a presentation of
the variational Bayes paradigm that provides a general strategy to address latent
variable problems [144, 97, 14, 100].
The general framework is as follows. Variables in the model are divided in two
groups: the observable part, that we denote X, and the latent part, denoted Z. In
many models, Z represents some unobservable structure, such that X conditional to
Z has some relatively simple distribution (in a Bayesian estimation context, Z often
contains model parameters). The quantity of interest, however, is the conditional
distribution of Z given X (also called the “posterior distribution”), which allows one
to infer the latent structure from the observations, and will also have an important
role in maximum likelihood parametric estimation, as we will see below. This condi-
tional distribution is not always easy to compute or simulate, and variational Bayes
provides a framework under which it can be approximated.
403
404 CHAPTER 17. LATENT VARIABLES AND VARIATIONAL METHODS
The variables X and Z then have probability density functions with respect to µX
ad µZ , given by
Z Z
fX (x) = fU (x, z)µZ (dz) and fZ (z) = fU (x, z)µX (dx).
RZ RX
We will use the Kullback-Liebler divergence to quantify the accuracy of the ap-
proximation. As stated in proposition 4.1, we have
where M1 (RZ ) denotes the set of all probability distributions on RZ . Note that all
distributions ν for which KL(ν k PZ (· | X = x)) is finite must be absolutely continuous
with respect to µZ and therefore take the form ν = gµZ . One has
Z
g(z)
KL(gµZ k PZ (·|X = x)) = log g(z)µZ (dz)
RZ fZ (z|x)
Z
g(z)
= log g(z)µZ (dz) + log fX (x). (17.1)
RZ fU (x, z)
We will denote by P (µZ ), or just P when there is no ambiguity, the set of all p.d.f.’s g
with
R respect to µZ , i.e., the set of all non-negative measurable functions on RZ with
R
g(z)µZ (dz) = 1.
Z
1 The reader unfamiliar with measure theory may want to read this discussion by replacing dµX
by dx, dµZ by dz and dµU by dx dz, i.e., in the context of continuous probability distributions having
p.d.f.’s with respect to the Lebesgue’s measure.
17.3. EXAMPLES 405
For the approximation to be practical, the set P b must obviously be chosen so that
the computation of PbZ ( · | X = x) is computationally feasible. We now review a few
examples, before passing to the EM algorithm and its approximations.
17.3 Examples
the sum being infinite if there exists z such that ν(z) > 0 and fU (x, z) = 0. Take
b = {1z : z ∈ RZ } ,
P
the family of all Dirac functions on RZ . Then,
KL(1z k PZ ( · |X = x)) − log fX (x) = − log fU (x, z).
The variational approximation of PZ ( · | X = x) over P b therefore is the Dirac measure
at point(s) z ∈ RZ at which fU (x, z) is largest, i.e., the mode(s) of the posterior distri-
bution. This approximation is often called the MAP approximation (for maximum a
posteriori).
The mode approximation has some limitations. First, it is in general a very crude
approximation of the posterior distribution. Second, even with the assumption that
fU has closed form, this p.d.f. is often difficult to maximize (for example when defin-
ing models over large discrete sets). In such cases, the mode approximation has
limited practical use.
406 CHAPTER 17. LATENT VARIABLES AND VARIATIONAL METHODS
Let us still assume that RZ = Rq and that µZ = dz. Let Pb be the family of all Gaussian
q
distributions N (m, Σ) on R . Then, denoting by ϕ( · ; m, Σ) the density of N (m, Σ),
q q 1
KL(ϕ( · ; m, Σ) k PZ ( · | X = x)) − log fX (x) = − log 2π − − log det(Σ)
2 2Z 2
− log fU (x, z)ϕ(z; m, Σ)dz.
Rq
This section generalizes the approach discussed in proposition 15.6 for Markov ran-
[1] [K]
dom fields. Assume that RZ can be decomposed into several components RZ , . . . , RZ ,
17.3. EXAMPLES 407
writing z = (z[1] , . . . , z[K] ) (for example, taking K = q and z[i] = z(i) , the ith coordinate
[1] [K]
of z if RZ = Rq ). Also assume that µZ splits into a product measure µZ ⊗ · · · ⊗ µZ .
Mean-field approximation consists in assuming that probabilities ν in P b split into
independent components, i.e., their densities g take the form:
The mean-field approximation may be feasible when log fU (x, z) can be written as a
sum of products of functions of each z[j] . Indeed, assume that
K
XY
log fU (x, z) = ψα,j (z[j] , x) (17.4)
α∈A j=1
where A is a finite set. To shorten notation, let us denote by hψi the expectation of a
function ψ with respect to the product p.d.f. g. Then, (17.3) can be written as
K
X K
XY
KL(ν k PZ ( · | X = x)) − log fX (x) = hlog g (j) (z[j] )i − hψα,j (z[j] , x)i.
j=1 α∈A j=1
The following lemma will allow us to identify the form taken by the optimal
p.d.f. g [j] .
Lemma 17.1 Let Q be a set equipped with a positive measure µ. Let ψ : Q → R be a
measurable function such that
Z
∆
Cψ = exp(ψ(q))µ(dq) < ∞.
Q
Let
1
gψ (q) = exp(ψ(q)).
Cψ
Let g be any p.d.f. with respect to µ, and define
Z
F(g) = (log g(q) − ψ(q))g(q)µ(dq).
Q
Applying this lemma separately to each function g [j] implies that any optimal g
must be such that
X
g [j] (z[j] ) ∝ exp Mα,j ψα,j (z[j] , x)
α∈A
with
K
Y 0
Mα,j = hψα,j 0 (z[j ] , x)i.
j 0 =1,j 0 ,j
We therefore have
R P [j]
[j]
RZ
ψα,j (z[j] , x) exp α0 ∈A Mα 0 ,j ψα0 ,j (z[j] , x) µZ (dz[j] )
hψα,j (z[j] , x)i = R P [j] (17.5)
[j] exp
[j] [j]
0 ∈A Mα 0 ,j ψα 0 ,j (z , x) µZ (dz )
R α
Z
This specifies a relationship expressing hψα,j (z[j] , x)i as a function of the other
0
expectations hψα 0 ,j 0 (z(j ) , x)i for j , j 0 . These equations put together are called the
mean-field consistency equations. When these equations can be written explicitly, i.e.,
when the integrals in (17.5) can be evaluated analytically (which is generally the case
when the p.d.f.’s g [j] can be associated with standard distributions), one obtains an
algorithm that iterates (17.5) over all α and j until stabilization (each step reducing
the objective function in (17.3)).
Let us retrieve the result obtained in proposition 15.6 using the current formal-
ism. Assume that RX finite and RZ = {0, 1}L , where L can be a large number, with
L L
1 X X
αj (x)z(j) + βij (x)z(i) z(j) .
fU (x, z) = exp
C
j=1 i,j=1,i<j
Take K = L, z[j] = z(j) . Applying the previous discussion, we see that g [j] must take
the form
exp αj (x)z(j) + i,j βij (x)hz(i) iz(j)
P
g [j] (z(j) ) =
1 + exp αj (x) + i,j βij (x)hz(i) i
P
In particular
exp αj (x) + i,j βij (x)hz(i) i
P
hz(j) i =
1 + exp αj (x) + i,j βij (x)hz(i) i
P
17.4. MAXIMUM LIKELIHOOD ESTIMATION 409
In this special case, it is also possible to express the objective function as a simple
function of the expectations hz(j) i’s. We indeed have, letting ρj = hz(j) i,
X L
Y L
X L
X
[j] (j)
log fU (x, z) g (z ) = − log C + αj (x)ρj + βij (x)ρi ρj .
z∈RZ j=1 j=1 i,j=1,i<j
The consistency equations express the fact that the derivatives of this expression
with respect to each ρj vanish.
We now consider maximum likelihood estimation with latent variables and use the
notation of section 17.2. The main tool is the following obvious consequence of
(17.1).
and the maximum is achieved for g(z) = fZ (z | x), the conditional p.d.f. of Z given X = x.
and the r.h.s. is indeed maximum when the Kullback-Liebler divergence vanishes,
that is, when g is the p.d.f. of PZ ( · | X = x).
The EM algorithm is useful when the computation of the m.l.e. for complete obser-
vations, i.e., the maximization of
log fU (x, z ; θ)
when both x and z are given, is easy, whereas the same problem with the marginal
distribution is hard.
The maximization can therefore be done by iterating the following two steps.
1. Given θn , compute
XZ !
fU (x, z ; θ)
argmax log gx (z)µZ (dz).
gx ,x∈T RZ gx (z)
x∈T
2. Given gx , x ∈ T , compute
XZ !
fU (x, z ; θ)
argmax log gx (z)µZ (dz)
θ RZ gx (z)
x∈T
XZ
= argmax log (fU (x, z ; θ)) gx (z)µZ (dz).
θ x∈T RZ
Step 1. is explicit and its solution is gx (z) = fZ (z | x ; θ). Using this, both steps can
be grouped together, yielding the EM algorithm.
17.4. MAXIMUM LIKELIHOOD ESTIMATION 411
We now make (17.7) explicit for mixtures of Gaussian. For given θ and θ 0 and
x ∈ R, let
Z
0 d
Ux (θ, θ ) = log 2π + log (fU (x, z ; θ 0 )) fZ (z | x ; θ)dµZ (z)
2 RZ
p
X 1 1
= log αz0 − log det Σ0z − (x − cz0 )T Σ0z −1 (x − cz0 ) fZ (z | x ; θ)
2 2
z=1
412 CHAPTER 17. LATENT VARIABLES AND VARIATIONAL METHODS
PNIf θn is the current parameter in the EM, the next one, θn+1 must maximize
0 ). This can be solved in closed form. To compute α 0 , . . . , α 0 , one must
U (θ
x∈T x n , θ 1 p
maximize
XX p
(log αz0 )fZ (z | x ; θ)
x∈T z=1
subject to the constraint that
P 0
z αz = 1. This yields
X p X
.X
αz0 = fZ (z | x ; θ) fZ (j | x ; θ) = ζz / N
x∈T j=1 x∈T
P
with ζz = x∈T fZ (z | x ; θ).
The centers c10 , . . . , cp0 must minimize x∈T (x − cz0 )T Σ0z −1 (x − cz0 )fZ (z|x ; θ), which
P
yields
0 1X
cz = xfZ (z | x ; θ).
ζz
x∈T
Finally, Σ0z must minimize
ζz 1X
0
log det Σz + (x − cz0 )T Σ0z −1 (x − cz0 )fZ (z | x ; θ),
2 2
x∈T
which yields
1X
Σ0z = (x − cz0 )(x − cz0 )T fZ (z | x ; θ).
ζz
x∈T
We can now summarize the algorithm.
4. Let αi0 = ζi /N .
5. For i = 1, . . . , p, let
1X
ci0 = xfZ (i | x ; θ).
ζi
x∈T
6. For i = 1, . . . , p, let
1X
Σ0i = (x − ci0 )(x − ci0 )T fZ (i | x ; θ).
ζi
x∈T
Remark 17.3 Algorithm 17.2 can be simplified by making restrictions on the model.
Here are some examples.
(i) One may restrict to Σi = σi2 IdRd to reduce the number of free parameters. Then,
step 7 of the algorithm needs to be replaced by:
1 X
(σi0 )2 = |x − ci0 |2 fZ (i | x ; θ).
dζi
x∈T
(ii) Alternatively, the model may be simplified by assuming that all covariance ma-
trices coincide: Σi = Σ for i = 1, . . . , p. Then, step 7 becomes
p
1 XX
Σ0i = (x − ci0 )(x − ci0 )T fZ (i | x ; θ).
N
i=1 x∈T
(iii) Finally, one may assume that Σ is known and fixed in the algorithm (usually in
the form Σ = σ 2 IdRd for some σ > 0) so that step 7 of the algorithm can be removed.
(iv) One may also assume also that the (prior) class probabilities are known, typi-
cally set to αi = 1/p for all i, so that step 4 can be skipped.
The stochastic approximation EM (or SAEM) algorithm has been proposed by De-
lyon et al. [58] (see this reference for convergence results) to address the situation in
which the expectations for the posterior distribution cannot be computed in closed
form, but can be estimated using Monte-Carlo simulations. SAEM uses a special
414 CHAPTER 17. LATENT VARIABLES AND VARIATIONAL METHODS
(x)
ξn+1 ∼ PZ ( · | X = x ; θn ), x ∈ T
1 1 X
(x)
λn+1 (θ 0 ) = 1 − λn (θ 0 ) + log fU (x, ξn+1 ; θ 0 ) − λn (θ 0 ) , θ 0 ∈ Θ (17.9)
n+1 n+1
x∈T
0
θn+1 = argmax λn+1 (θ )
θ0
N X n
X 1 (x)
λn (θ 0 ) = log fU (x, ξj ; θ 0 ) .
n
x∈T j=1
(x)
Given that ξn+1 ∼ PZ ( · | X = x ; θn ), one expects this expression to approximate
XZ
log (fU (x, z ; θ 0 )) fZ (z | x ; θ)dµZ (z)
x∈T RZ
so that the third step of (17.9) can be seen as an approximation of (17.7). Suffi-
cient conditions under which this actually happens (and θ(n) converges to a local
maximizer of the likelihood) are provided in Delyon et al. [58] (see also Kuhn and
Lavielle [113] for a convergence result under more general hypotheses on how ξ is
simulated).
To be able to run this algorithm efficiently, one needs the simulation of the pos-
terior distribution to be feasible. Importantly, one also needs to be able to update
efficiently the function λn . This can be achieved when the considered model belongs
to an exponential family, which corresponds to assuming that the p.d.f. of U takes
the form
1
fU (x, z ; θ) = exp ψ(θ)T H(x, z)
C(θ)
for some functions ψ and H. For example, the MoG model of equation (4.4) takes
17.4. MAXIMUM LIKELIHOOD ESTIMATION 415
1 1 1 T −1 1
T
ψ(θ) = log α1 − mT1 Σ−1 1 m1 − log det Σ1 , . . . , log αp − mp Σp mp − log det Σp ,
2 2 2 2
−1 −1
Σ1 m 1 , . . . , Σp m p ,
−1 −1
Σ1 , . . . , Σp ,
H(x, z)T = 1z=1 , . . . , 1z=p ,
x1z=1 , . . . , x1z=p ,
1 T 1 T
− xx 1z=1 , . . . , − xx 1z=p
2 2
For such a model, we can replace the algorithm in (17.9) by the more manageable
one:
(x)
ξn+1 ∼ PZ ( · | X = x ; θn ), x ∈ T
1 1
(x) (x) (x) (x)
η = 1 − ηn + (H(x, ξn+1 ) − ηn )
n+1 n + 1 n+1
(17.10)
X (x)
0 0 T 0
λ (θ ) = ψ(θ ) η − log C(θ )
n+1 n+1
x∈T
θn+1 = argmax λn+1 (θ 0 )
θ0
Returning to proposition 17.2 and equation (17.6), we see that one can make a vari-
ational approximation of the maximum likelihood by computing
XZ !
fU (x, z ; θ)
max log gx (z)µZ (dz), (17.11)
θ∈Θ,gx ∈P
b,x∈T RZ gx (z)
x∈T
Starting with an initial guess of the parameter, θ0 , iterate the following equation
until numerical stabilization:
XZ
θ(n + 1) = argmax log (fU (x, z ; θ 0 ))b
g (z ; x, θ(n))µZ (dz). (17.12)
θ0 x∈T RZ
and
Z !
fU (x, z ; θ)
∂ηx log g(z; ηx )µZ (dz)
RZ g(z; ηx )
Z ! !
fU (x, z ; θ)
= −∂η log g(z; ηx )g(z; ηx ) + log ∂η g(z; ηx ) µZ (dz)
RZ g(z; ηx )
Z !
fU (x, z ; θ)
= log ∂η log g(z; ηx )g(z; ηx )µZ (dz)
RZ g(z; ηx )
Here, we have used the fact that, for all η,
Z Z
∂η log g(z; η)g(z; η) µZ (dz) = ∂η g(z; η)µZ (dz) = 0
RZ RZ
17.5. REMARKS 417
R
since RZ
g(x, η)µZ (dz) = 1.
and !
X fU (x, z ; θ)
Φ2 (θ, η, z) = log ∂η log g(zx ; ηx ).
g(z; ηx )
x∈T
Then, following section 3.3, one can maximize (17.13) using the algorithm
(
θn+1 = θn + γn+1 Φ1 (θn , Z n+1 )
(17.14)
η n+1 = η n + γn+1 Φ2 (θn , η n , Z n+1 )
where Z n+1 ∼ πη n .
Alternatively (for example when T is large), one can also sample from x ∈ T at
each update. This would require defining πη as the distribution on T ×RZ with p.d.f.
ϕη (x, z) = g(z; ηx )/N , where N = |T |. One can now use
and !
fU (x, z ; θ)
Φ2 (θ, η, z) = log ∂η log g(z; η),
g(z; η)
one can use
θn+1 = θn + γn+1 ∂θ log fU (Xn+1 , Zn+1 ; θn )
!
fU (Xn+1 , Zn+1 ; θn ) (17.15)
ηn+1,Xn+1 = ηn,Xn+1 + γn+1 log g(Zn+1 ; ηn,X ) ∂η log g(Zn+1 ; ηn,Xn+1 )
n+1
with (Xn+1 , Zn+1 ) ∼ πη n . Sampling from a single training sample at each step can be
replaced by sampling from a minibatch with obvious modifications.
17.5 Remarks
Based on the formulation of the EM as the solution of (17.6), it should be clear that
solving (17.7) at each step can be replaced by any update of the parameter that in-
creases (17.6). For example, (17.7) can be replaced by a partial run of a gradient
418 CHAPTER 17. LATENT VARIABLES AND VARIATIONAL METHODS
ascent algorithm, stopped before convergence. One can also use a coordinate as-
cent strategy. Assume that θ can be split into several components, say two, so that
θ = (θ (1) , θ (2) ). Then, (17.7) may then be split into
(1)
XZ
(2)
θn+1 = argmax log fU (x, z ; θ (1) , θn ) fZ (z | x ; θ(n))µZ (dz)
θ (1) x∈T RZ
(2)
XZ
(1)
(2)
θn+1 = argmax log fU (x, z ; θn+1 , θ ) fZ (z | x ; θ(n))µZ (dz).
θ (2) x∈T RZ
Doing so is, in particular, useful when both these steps are explicit, but not (17.7).
with respect to the parameter θ. Indeed, differentiating the integral and writing
∂θ fU = fU ∂θ log fU , we have
Z
f (x, z ; θ)
∂θ log fX (x ; θ) = ∂θ log fU (x, z ; θ) U µ (dz)
RZ fX (x ; θ) Z
Z
= ∂θ log fU (x, z ; θ)fZ (z | x, θ)µZ (dz).
RZ
In other terms, the derivative of the log-likelihood of the observed data is the con-
ditional expectation of the derivative of the log-likelihood of the full data given
the observed data. When computable, this expression can be used with standard
gradient-based optimization methods, such as those described in chapter 3. This
expression is also amenable to a stochastic gradient ascent algorithm, namely
X
θn+1 = θn + γn+1 ∂θ fU (x, Zn+1,x , θn ) (17.16)
x∈T
We have worked, in this chapter, under the assumption that PU was absolutely con-
tinuous with respect to a product measure µU = µX ⊗ µZ . This is not a mild as-
sumption, as it fails to include some important cases, for example when X and Z
have some deterministic relationship, the simplest instance being when X = F(Z)
for some function F. In many cases, however, one can make simple transformations
on the model that will make it satisfy this working assumption. For example, if
X = F(Z), one can generally split Z into Z = (Z (1) , Z (2) ) so that the equation X = F(Z)
is equivalent to Z (2) = G(X, Z (1) ) for some function G. One can then apply the dis-
cussion above to U = (X, Z (1) ) instead of U = (X, Z).
Using more advanced measure theory, however, one can see that this product
decomposition assumption was in fact unnecessary. Indeed, one can assume that
the measure µU can “disintegrate” in the following sense: there exists a measure µX
on RX and, for all x ∈ RX , a measure µZ ( · | x) on RZ such that, for all functions ψ
defined on RU ,
Z Z Z
ψ(x, z)µU (dx, dz) = ψ(x, z)µZ (dz | x)µX (dx).
RU RX RZ
This is now a mild assumption, which is true [33] as soon as one assumes that µU (R)
is finite (which is not a real loss of generality as one can reduce to this case by re-
placing if needed µU by an equivalent probability distribution).
With this assumption, the marginal distribution of X had a p.d.f. with respect to
µX given by Z
fX (x) = fU (x, z)µZ (dz | x)
RZ
fU (x, z)
fZ (z | x) = .
fX (x)
The computations and approximations made earlier in this chapter can then be ap-
plied with essentially no modification.
420 CHAPTER 17. LATENT VARIABLES AND VARIATIONAL METHODS
Chapter 18
#{A occurs}
fA = .
N
This estimation is unbiased (E(fA ) = P(A)) and its variance is P(A)(1 − P(A))/N . This
implies that the relative error δA = fA / P(A) − 1 has zero mean and variance
1 − P(A)
σ2 = .
N P(A)
This number can clearly become very large when P(A) ' 0. In particular, when
P(A) is small compared to 1/N , the relative frequency will often be fA = 0, leading
to the false conclusion that A is not just rare, but impossible. If there are reasons to
expect beforehand that A is indeed possible, it is important to inject this prior belief
in the procedure, which suggest using Bayesian estimation methods.
421
422 CHAPTER 18. LEARNING GRAPHICAL MODELS
The main assumption for these methods is to consider the unknown probability,
p = P(A), as a random variable, yielding a generative process in which a random
probability is first obtained, and then N instances of A or not-A are generated using
this probability.
Assume that the “prior distribution” of p (which determines a prior belief) has
a p.d.f. q (with respect to Lebesgue’s measure) on the unit interval. Given on N in-
dependent observations of occurrences of A, each following a Bernoulli distribution
b(p), the joint likelihood of all involved variables is given by
!
N k
p (1 − p)N −k q(p),
k
where k is the number of times the event A has been observed.
From the definition of a beta distribution, it is clear also that, if we choose the
prior to be β(a + 1, ν − a + 1) then the posterior is β(k + a + 1, N + ν − (k + a) + 1).
The posterior therefore belongs to the same family of distributions as the prior, and
one says that the beta distribution is a conjugate prior for the binomial distribution.
The mode of the posterior distribution (which is the maximum a posteriori (MAP)
estimator) is given by
k+a
p̂ = .
N +ν
Now assume that F is a finite space and that we want to estimate a probability distri-
bution p = (p(x), x ∈ F) using a Bayesian approach as above. We cannot use the previ-
ous approach to estimate each p(x) separately, since these probabilities are linked by
the fact that they sum to 1. We can however come up with a good (conjugate) prior,
identified, as done above, by computing the posterior associated to a uniform prior
distribution.
N! Y
P(Nx , x ∈ F | p(·)) = Q p(x)Nx .
N
x∈F x !
x∈F
The posteriorQdistribution of p(·) given the observations with a uniform prior is pro-
portional to x∈F p(x)Nx . It belongs to the family of Dirichlet distributions, described
in the following definition.
The Dirichlet distribution with parameters a = (a(x), x ∈ F) (abbreviated Dir(a)) has den-
sity
Γ (ν) Y
ρ(p(·)) = Q p(x)a(x)−1 , if x ∈ SF
x∈F Γ (a(x))
x∈F
P
and 0 otherwise, with ν = x∈F a(x).
Note that, if F has cardinality 2, the Dirichlet distribution coincides with the beta
distribution. Similarly to the beta for the binomial, and almost by construction, the
Dirichlet distribution is a conjugate prior for the multinomial. More precisely, if the
prior distribution for p(·) is Dir(1+a(x), x ∈ F), then the posterior after N observations
of X is Dir(1 + Nx + a(x), x ∈ F), and the MAP estimator is given by
Nx + a(x)
p̂(x) =
N +ν
P
with ν = x∈F a(x).
424 CHAPTER 18. LEARNING GRAPHICAL MODELS
This implies that, for the posterior distribution, the conditional probabilities
ps (x(pa(s)) , ·) are independent and follow a Dirichlet distribution with parameters
1 + Ns (x(s) , x(pa(s)) ), x(s) ∈ Fs .
One can restrict the huge class of coefficients described by (18.1) to a smaller
class by imposing the following condition.
Definition 18.3 One says that the family of coefficients
a = (as (x(s) , x(pa(s)) ), s ∈ V , x(s) ∈ Fs , x(pa(s)) ∈ F (pa(s))),
is consistent if there exists a positive scalar ν and a probability distribution P 0 on F (V )
such that
as (x(s) , x(pa(s)) ) = νP{s}∪pa(s)
0
(x({s}∪pa(s)) ).
18.1. LEARNING BAYESIAN NETWORKS 425
We can see from (18.2) that using a prior distribution is quite important for
Bayesian networks, since, when the number of parents increases, some configura-
tions on F (pa(s)) may not be observed, resulting in an undetermined value for the
ratio
−
Ns (x(s) , x(pa(s)) )/Ns (x(s ) ),
even though, for the estimated model, the probability of observing x(pa(s)) may not
be zero.
Given a prior defined as a family of Dirichlet distributions associated to a = (as (x(s) , x(pa(s)) )
for s ∈ V , x(s) ∈ Fs , x(pa(s)) ∈ F (pa(s)), the joint density of the observations and param-
eters is given by
Y Y (s) (pa(s)) )+a (x(s) ,x(pa(s)) )−1
P (x, θ) = D(as (·, x(pa(s)) )) p(x(pa(s)) , x(s) )Ns (x ,x s
with
Γ (ν)
D(a(λ), λ ∈ F) = Q
λ Γ (a(λ))
P
and ν = λ a(λ). Here, θ represents the parameters of the model, i.e., the conditional
distributions that specify the Bayesian network. Note that P (x, θ) is a density over
the product space F (V ) × Θ where Θ is the space of all these conditional distribu-
tions. The marginal of this likelihood over all possible parameters, i.e.,
Z
P (x) = P (x, θ)dθ
provides the expected likelihood of the sample relative to the distribution of the pa-
rameters, and only depends on the structure of the network. In our case, integrating
with respect to θ yields
X D(as (·, x(pa(s)) ))
log P (x) = log .
s,xpa(s) D(as (·, x(pa(s)) ) + Ns (·, x(pa(s)) ))
426 CHAPTER 18. LEARNING GRAPHICAL MODELS
Letting
X D(as (·, x(pa(s)) ))
γ(s, pa(s)) = log ,
(pa(s))
D(as (·, x(pa(s)) ) + Ns (·, x(pa(s)) ))
x
the decomposition X
log P (x) = γ(s, pa(s))
s∈V
expresses this likelihood as a sum of “scores” (associated to each node and its par-
ents), which depends on the observed sample. The scores that are computed above
are often called Bayesian scores because they derive from a Bayesian construction.
One can also consider simpler scores, such as penalized likelihood:
X
γ(s, pa(s)) = − Ĥ(X (s) | X (pa(s)) ) |F (pa(s))| − ρ|pa(s)|,
x(pa(s))
where Ĥ is the conditional entropy for the empirical distribution based on observed
samples. Structure learning algorithms [145, 109] are designed to optimize such
scores.
When the sets Fs are not too large, which is common in practice, the paramet-
ric explosion is due to the multiplicity of parents, since the number of conditional
probabilities ps (x(pa(s)) , ·) grows exponentially with |pa(s)|. One way to simplify this
is to assume that the conditional probability at s only depends on x(pa(s)) via some
“global-effect” statistic gs (x(pa(s)) ). The idea, of course, is that the number of values
taken by gs should remain small, even if the number of parents is large.
Once the gs ’s are fixed, learning the network distribution, which is now given by
Y
π(x) = ps (gs (x(pa(s)) ), x(s) )
s∈V
18.2. LEARNING LOOPY MARKOV RANDOM FIELDS 427
can be done exactly as before, the parameters being all ps (w, λ), λ ∈ Fs , w ∈ Ws , where
Ws is the range of gs , and Dirichlet priors can be associated to each ps (w, ·) for s ∈ V
and w ∈ Ws . The counts provided in (18.3) now can be chosen as
ν0
as (xs , w) = . (18.4)
|F| |gs−1 (w)|
Like everything else, parameter estimation for loopy networks is much harder than
with trees or Bayesian networks. There is usually no closed form expression for
the estimators, and their computation relies on more or less tractable numerical
procedures.
then θ = (α, β) and U (x) = −( s x(s) , s∼t x(s) x(t) ). Most of the Markov random fields
P P
models that are used in practice can be put in this form. The constant Zθ in (18.5) is
X
Zθ = exp(−θ T U (x))
x∈F (V )
and
∇2 `(θ) = −Varθ (U ) (18.7)
where Eθ denotes the expectation with respect to πθ and Varθ the covariance matrix under
the same distribution.
We skip the proof, which is just computation. This proposition implies that a
local maximum of θ 7→ `(θ) must also be global. Any such maximum must be a
solution of
Eθ (U ) = ŪN (x0 )
and conversely. There are some situations in which the maximum does not exist, or
is not unique. Let us first discuss the second case.
If several solutions exist, the log-likelihood cannot be strictly concave: there must
exist at least one θ for which Varθ (U ) is not definite. This implies that there exists a
nonzero vector u such that varθ (u T U ) = u T Varθ (U )u = 0. This is only possible when
u T U (x) = cst for all x ∈ FV . Conversely, if this is true, Varθ (U ) is degenerate for all θ.
For a concave function like ` to have no maximum, there must exist what is called
a direction of recession [168], which is a direction α ∈ Rd such that, for all θ, the
function t 7→ `(θ + tα) is increasing. In this case the maximum is attained “at infin-
ity”. Denoting Uα (x) = α T U (x), the derivative in t of `(θ + tα) is
where Ūα = α T ŪN . This derivative is positive for all t if and only if
and Uα is not constant. To prove this, assume that the derivative is positive. Then
Uα is not constant (otherwise, the derivative would be zero). Let Fα∗ ⊂ F (V ) be the
18.2. LEARNING LOOPY MARKOV RANDOM FIELDS 429
Eθ+tα (Uα )
P T
x∈F (V ) Uα (x) exp(−θ U (x) − tUα (x))
= P T
x∈F (V ) exp(−θ U (x) − tUα (x))
P T ∗
x∈F (V ) Uα (x) exp(−θ U (x) − t(Uα (x) − Uα ))
= P T ∗
x∈F (V ) exp(−θ U (x) − t(Uα (x) − Uα ))
Uα∗
P T U (x)) + P T ∗
x∈Fα∗ exp(−θ x<Fα∗ Uα (x) exp(−θ U (x) − t(Uα (x) − Uα ))
= P T
P T ∗ .
x∈Fα∗ exp(−θ U (x)) + x<Fα∗ exp(−θ U (x) − t(Uα (x) − Uα ))
When t tends to +∞, the sums over x < Fα∗ tend to 0, which implies that Eθ+tα (Uα )
tends to Uα∗ . So, if Eθ+tα (Uα ) − Ūα > 0 for all t, then Ūα = Uα∗ and Uα is not constant.
The converse statement is obvious.
{U (x), x ∈ F (V )} ⊂ Rd .
Eθ (U ) = ŪN .
As remarked above, for fixed θ, we have designed, in chapter 15, Markov chain
Monte Carlo algorithms that asymptotically sample form πθ . Select one of these
algorithms, and let pθ be the corresponding transition probabilities for a given θ, so
430 CHAPTER 18. LEARNING GRAPHICAL MODELS
that pθ (x, y) = P(Xn+1 = y | Xn = x) for the sampling chain. Then, define the iterative
algorithm, initialized with arbitrary θ0 and x0 ∈ F (V ), that loops over the following
two steps.
(SG1) Sample from the distribution pθt (xt , ·) to obtain a new configuration xt+1 .
(SG2) Update the parameter using
θt+1 = θt + γt+1 (U (xt+1 ) − ŪN ). (18.10)
This algorithm differs from the situation considered in section 3.3 in that the
distribution of the sampled variable xt+1 depends on both the current parameter θt
and on the current variable xt . Convergence requires additional constraints on the
size of the gains γ(t) and we have the following theorem [209].
Theorem 18.5 If pθ corresponds to the Gibbs sampler or Metropolis algorithm, and γt+1 =
/(t +1) for small enough , the algorithm that iterates (SG1) and (SG2) converges almost
surely to the maximum likelihood estimator.
The speed of convergence of such algorithms depends both on the speed of con-
vergence of the Monte-Carlo sampling and of the original gradient ascent. The latter
can be improved somewhat with variants similar to those discussed in section 3.3,
for example by choosing data-adaptive gains as in the ADAM algorithm.
The maximum likelihood estimator is closely related to what is called the maximum
entropy extension of a set of constraints. Let the function U from F (V ) to Rd be
given. An element u ∈ Rd is said to be a consistent assignment for U if there exists
a probability distribution π on F (V ) such that Eπ (U ) = u. An example of consistent
assignment is any empirical average Ū based on a sample (x(1) , . . . , x(N ) ), since Ū =
Eπ (U ) for
N
1X
π= δx(k) .
N
k=1
Because the entropy is strictly convex, there is a unique solution to this problem.
We first discuss non-positive solutions, i.e., solutions for which π(x) = 0 for some x.
An important fact is that, if, for a given x, there exists π1 such that Eπ1 (U ) = u and
π1 (x) > 0, then the optimal π must also satisfy π(x) > 0. This is because, if π(x) = 0,
then, letting π = (1 − )π + π1 , we have Eπ (U ) = u since this constraint is linear,
π (x) > 0 and
X
H(π ) − H(π) = − (π (y) log π (y) − π(y) log π(y))
y,π(y)>0
X
− π1 (y)(log() + log π1 (y))
y,π(y)=0
X
= − log π1 (y) + O()
y,π(y)=0
which is positive for small enough , contradicting the fact that π is a maximizer.
in which we have set θ = (θ1 , . . . , θd ), we find that the optimal π must satisfy
that α T U (x) < α T u. Such x’s exist by assumption, and therefore Nu , ∅. Conversely,
assume Nu , ∅. If condition (18.8) is not satisfied, then we have shown when dis-
cussing maximum likelihood that an optimal parameter for the exponential model
would exist, leading to a positive distribution for which Eπ (U ) = u, which is a con-
tradiction.
with
q
X
Uj (x) = 1 and Uj (x) ≥ 0.
j=1
Proof Note that, since π > 0, Eπ (Uj ) must also be positive for all j, since Eπ (Uj ) = 0
would otherwise imply Uj = 0 and uj = 0 for u to be consistent. So, π0 is well defined
and obviously positive.
We have
d
∗ 0 ∗
X
∗
X uj
KL(π kπ ) − KL(π kπ) = log ζ − π (x) Uj (x) log
Eπ (Uj )
x∈F (V ) j=1
d
X uj
= log ζ − uj log
Eπ (Uj )
j=1
= log ζ − KL(ukEπ (U )).
18.2. LEARNING LOOPY MARKOV RANDOM FIELDS 433
(We have used the identity Eπ∗ (U ) = u.) So it suffices to prove that ζ ≤ 1. We have
d !Uj (x)
X Y uj
ζ = π(x)
Eπ (Uj )
x∈F (V ) j=1
d
X X uj
≤ π(x)Uj (x)
Eπ (Uj )
j=1 x∈F (V )
d
X uj
= Eπ (Uj ) = 1,
Eπ (Uj )
j=1
This algorithm always reduces the Kullback-Leibler distance to the maximum en-
tropy extension. This distance being always positive, it therefore converges to a limit,
which, still according to lemma 18.6, is only possible if KL(ukEπn (U )) also tends to
0, that is Eπn (U ) → u. Since the space of probability distributions is compact, the
Heine-Borel theorem implies that the sequence πθn has at least one accumulation
point, that we now identify. If π is such a point, one must have Eπ (U ) = u. More-
over, we have π > 0, since otherwise KL(π∗ kπ) = +∞. To prove that π = π∗ (and
therefore the limit of the sequence), it remains to show that it can be put in the form
(18.5). For this, define the vector space V of functions v : F (V ) → R which can be
written in the form
Xg
v(x) = α0 + αj Uj (x).
j=1
434 CHAPTER 18. LEARNING GRAPHICAL MODELS
Since log πθn ∈ V for all n, so is its limit, and this proves that log π belongs to V . We
have obtained the following proposition.
Proposition 18.7 Assume that for all x ∈ F (V ), one has U (x) = (U1 (x), . . . , Ud (x)) with
d
X
Uj (x) = 1 and Uj (x) ≥ 0.
j=1
Let u be a consistent assignment for the expectation of U such thatNu = ∅. Then, the
algorithm described in (18.13) converges to the maximum entropy extension of u.
This is the iterative scaling algorithm. This method can be extended in a straight-
forward way to handle the maximum entropy extension for a family of functions
U (1) , . . . , U (K) , such that, for all x and for all k, U (k) (x) is a dk -dimensional vector such
that
dk
X (k)
Uj (x) = 1.
j=1
where θ (k) is dk -dimensional, and iterative scaling can then be implemented by up-
dating only one of these vectors at a time, using (18.13) with U = U (k) .
The restriction to U (x) providing a discrete probability distribution for all x is,
in fact, no loss of generality. This is because adding a constant to U does not change
the resulting exponential model in (18.5), and multiplying U by a constant can be
also compensated by dividing θ by the same constant in the same model. So, if u− is
a lower bound for minj,x Uj (x), one can replace
P U by (U − u− ), and therefore assume
that U ≥ 0, and if u+ isPan upper bound for j Uj (x), we can replace U by U /u+ and
therefore assume that j Uj (x) ≤ 1. Define
d
X
Ud+1 (x) = 1 − Uj (x) ≥ 0.
j=1
Then, the maximum entropy extension for (U1 , . . . , Ud ) with assignment (u1 , . . . , ud ) is
obviously also the extension for (U1 , . . . , Ud+1 ), with assignment (u1 , . . . , ud+1 ), where
d
X
ud+1 = 1 − uj ,
j=1
18.2. LEARNING LOOPY MARKOV RANDOM FIELDS 435
and the latter is in the form required in proposition 18.7. Note that iterative scaling
requires to compute the expectation of U1 , . . . , Ud before each update. These are not
necessarily available in closed form and may have to be estimated using Monte-Carlo
sampling.
we see that C(π, π̃) is always positive, and vanishes (under the assumption of positive
π) only if all the local specifications for π and π̃ coincide, and this can be shown
436 CHAPTER 18. LEARNING GRAPHICAL MODELS
to imply that π = π̃. Indeed, for any x, y ∈ F (V ), and choosing some order V =
{s1 , . . . , sn } on V , one can write
n
π(x) Y π(x(sk ) |x(s1 ) , . . . , x(sk−1 ) , y (sk+1 ) , . . . , y (sn ) )
=
π(y) π(x(sk ) |x(s1 ) , . . . , x(sk−1 ) , y (sk+1 ) , . . . , y (sn ) )
k=1
P
and the ratios π(x)/π(y), for x ∈ F (V ), combined with the constraint that x π(x) = 1
uniquely define π.
This yields the maximum pseudo-likelihood estimator (or pseudo maximum likeli-
hood) defined as a maximizer of the function (called log-pseudo-likelihood)
N
XX (s) (s)
θ 7→ log πθ,s (xk |xk , t , s).
s∈V k=1
The methods that were presented so far for discrete variables formally generalize to
more general state spaces, even though consistency or convergence issues in non-
compact cases can be significantly harder to address. Score matching is a parameter
estimation method that was introduced in [95] and was designed, in its original
version, to estimate parameters for statistical models taking the form
1
πθ (x) = exp (−F(x, θ))
C(θ)
18.2. LEARNING LOOPY MARKOV RANDOM FIELDS 437
where ∇x denotes the gradient with respect to the x variable. Letting πtrue denote
the p.d.f. of the true data distribution (not necessarily part of the statistical model),
score matching minimizes
Z
f (θ) = |s(x, θ) − strue (x)|2 πtrue (x)dx
Rd
where strue = −∇ log πtrue . This integral can be restricted to the support of πtrue , if
we don’t want to assume that πtrue is non-vanishing. Note, however that f (θ) = 0
implies that log πθ (·, θ) = log πtrue πtrue -almost everywhere, so that πθ (x) = cπtrue (x)
for some constant c and x in the support of πtrue . Only if πtrue (x) > 0 for all x ∈ Rd ,
can we conclude that this requires πθ = πtrue .
Expanding the squared norm and applying the divergence theorem yield
Z Z
2
f (θ) = |∇x log πθ (x)| πtrue (x)dx − 2 ∇x log πθ (x)T ∇πtrue (x)dx
d Rd
ZR
+ |strue (x)|2 πtrue (x)dx
d
ZR Z Z
2
= |∇x log πθ (x)| πtrue (x)dx + 2 T
∆ log πθ (x) πtrue (x)dx + |strue (x)|2 dx
Rd Rd Rd
To justify the use of the divergence theorem, one needs to assume two derivatives in
the log-likelihoods with sufficient decay at infinity (see Hyvärinen and Dayan [95]
for details). This shows that minimizing f is equivalent to minimizing
Z Z
2
g(θ) = |∇x log πθ (x)| πtrue (x)dx + 2 ∆ log πθ (x)T πtrue (x)dx
Rd Rd
2
= E(|∇x log πθ (X)| + 2∆ log πθ (X)).
Remark 18.8 The method can be adapted to deal with discrete variables replacing
derivatives with differences. Let X take values in a finite set, RX , on which a graph
438 CHAPTER 18. LEARNING GRAPHICAL MODELS
Missing variable sin the context of graphical models may correspond to real pro-
cesses that cannot be measured, which is common, for example, with biological data.
They may be more conceptual objects that are interpretable but are not parts of the
data acquisition process, like phonemes in speech recognition, or edges and labels
in image processing and object recognition. They may also be variables that have
been added to the model to increase its parametric dimension without increasing
the complexity of the graph. However, as we will see, dealing with incomplete or
imperfect observations brings the parameter estimation problem to a new level of
difficulty.
where
N
1X (S)
Ūn = Eθn (U (X) | X (S) = xk ). (18.18)
N
k=1
So, the M-step of the EM, which maximizes (18.17), coincides with the complete-
data maximum-likelihood problem for which the empirical average of U is replaced
by the average of its conditional expectations given the observations, as given in
(18.18), which constitutes the E-step. As a consequence, a strict application of the
EM algorithm for graphical models is unfeasible, since each step requires running
an algorithm of similar complexity maximum likelihood for complete data, that we
already identified as a challenging, computationally costly problem. The same re-
mark holds for the SAEM algorithm of section 17.4.3, which also requires solving a
maximum likelihood problem at each iteration.
The stochastic gradient ascent described in section 18.2.2 can be extended to partial
observations [210], even though it loses the global convergence guarantee that re-
sulted from the concavity of the log-likelihood for complete observations. Indeed,
applying the computation of section 17.5.2, to a model given by (18.16), we get using
proposition 18.4,
Let πθ (x(H) | x(S) ) denotes the conditional probability P (X (H) = x(H) | X (S) = s(S) )
for the distribution πθ , therefore taking the form
1
πθ (x(H) | x(S) ) = exp −θ T
U (x (S)
∧ x (H)
) .
Z̃(θ, x(S) )
440 CHAPTER 18. LEARNING GRAPHICAL MODELS
Algorithm 18.1
Start the algorithm with an initial parameter θ(0) and initial configurations x(0) and
(H)
xk (0), k = 1, . . . , N . Then, at step n,
(SGH1) Sample from the distribution pθ(n) (x(n), ·) to obtain new configurations x(n+
1) ∈ F (V ).
(S)
xk (H)
(SGH2) For k = 1, . . . , N , sample from the distribution pθ(n) (xk (n), ·) to obtain a new
()
configuration xk H(n + 1) over the hidden vertexes.
(SGH3) Update the parameter using
N
1 X (S) (H)
θ(n + 1) = θ(n) + γ(n + 1) U (x(n + 1)) − U (xk ∧ xk (n + 1)) . (18.19)
N
k=1
The EM update
N
X
(S)
θn+1 = argmax Eθn log πθ (X) | X (S) = xk .
θ k=1
being challenging for Markov random fields, it is tempting to replace the log-likelihood
in the expectation by an other contrast, such as the log-pseudo-likelihood. A simi-
lar approach to that described here was introduced in Chalmond [50], for situations
when the conditional distribution of X (S) given X (H) is “simple enough” (for exam-
ple, if the variables Xs , s ∈ S are conditionally independent given X (H) ) and when the
cardinality of the sets Fs , s ∈ H is small (binary, or ternary, variables).
The algorithm has the following variational interpretation. Fix x(S) ∈ F (S) and
s ∈ H. Also denote µs = 1/|F (H \ {s})|. If q is a transition probability from F (H \ {s})
to Fs , let
This function is concave in q, since its first partial derivative with respect to q(y (H\{s}) , y (s) )
(for each y ∈ F (H)) is given by
µs log πθ,s (y (s) ∧ x(S) | y (H\{s}) )µs (y (H\{s}) ) − µs log(q(y (H\{s}) , y (s) )µs ) − µs
so that its Hessian is the diagonal matrix with negative (H\{s}) , y (s) ).
P entries −µ s /q(y
(H\{s})
Using Lagrange multipliers to express the constraints y (s) ∈Fs q(y , y (s) ) = 1 for
(s)
all y (H\{s}) , we find that ∆θ (q, x(S) ) is maximized when q(y (H\{s}) , y (s) ) is proportional
to πθ,s (y (s) ∧ x(S) | y (H\{s}) ), yielding
N X
X (n) (s) (s)
∆θ (qk , xk ) (18.21)
k=1 s∈H
(s)
with respect to θ and qk , k = 1, . . . , N , s ∈ H. Consider an iterative maximization
scheme in which, from a current parameter θn , one first, maximizes (18.21) with
(s)
respect to transition probabilities qk , then with respect to θ to obtain θn+1 . This
scheme provides the iteration
θn+1 =
N X X
X
(S) (S)
argmax log πθ,s (y (s) ∧ xk | y (H\{s}) ) πθn ,s (y (s) | xk ∧ y (H\{s}) )µs .
θ k=1 s∈H y∈F (H)
We now consider the situation in which the joint distribution of X = X (S) ∧ X (H) is a
Bayesian network over a directed acyclic graph G = (V , E).
(S) (S)
Assume that x1 , . . . , xN are observed. The parameter θ is the collection of all
p(x(pa(s)) , x(s) ) for s ∈ V . Define the random variables Is,x (y) equal to one if y ({s}∪pa(s)) =
x({s}∪pa(s)) and zero otherwise. We can write
X X X
log π(y) = log ps (y (pa(s)) , y (s) ) = log ps (x(pa(s)) , x(s) )Is,x (y)
s∈S s∈S x({s}∪pa(s)) ∈F ({s}∪pa(s))
442 CHAPTER 18. LEARNING GRAPHICAL MODELS
X N
X (S)
= log ps (x(pa(s)) , x(s) ) πθn (x({s}∪pa(s)) | X (S) = xk ).
x({s}∪pa(s)) ∈F ({s}∪pa(s)) k=1
with Y
πθn (x) = p(n) (x(pa(s)) , x(s) ),
s∈V
Zs being a normalization constant.
If the estimation is solved with a Dirichlet prior Dir(1+as (x(s) , x(pa(s)) )), the update
formula becomes
N
(n+1) (pa(s)) (s) 1
a (x(s) , x(pa(s)) ) +
X
({s}∪pa(s)) (S) (S)
ps (x ,x ) = π (x X = x ) .
− s θ | k
Zs (x(s ) ) n
k=1
(18.22)
This algorithm is very simple when the conditional distributions πθn (x(s∪pa(s)) |
(S)
X (S) = xk ) can be easily computed, which is not always the case for a general
Bayesian network, since conditional distributions do not always have a structure of
Bayesian network. The computation is simple enough for trees, however, since con-
ditional tree distributions are still trees (or forests). More precisely, the conditional
distribution given the observed variables can be written in the form
1 Y Y
π(y (H) | x(S) ) = (S)
ϕ s,x (y (s)
) ϕst (y (s) , y (t) )
Z(x ) s∈H t∼s,{s,t}⊂H
with ϕs,pa(s) (y (s) , y (pa(s)) ) = ps (y (pa(s)) , y (s) ) and, letting ϕs (y (s) ) = ps (y (s) ) if pa(s) = ∅ and
1 otherwise, Y
ϕs,x (y (s) ) = ϕs (y (s) ) ϕst (y (s) , x(t) ).
t∼s,t∈S
18.3. INCOMPLETE OBSERVATIONS FOR GRAPHICAL MODELS 443
So, the marginal joint distribution of a vertex and its parents are directly given by
belief propagation, using the just defined interactions. This training algorithm is
summarized below.
(1) For k = 1, . . . , N , use belief propagation (or sum-prod) to compute all πθn (x({s}∪pa(s)) |
(S)
X (S) = xk ). Note that these probabilities can be 0 or 1 when s ∈ S and/or pa(s) ⊂ S.
(2) Use (18.22) to compute the next set of parameters.
The tree case includes the important example of hidden Markov models, which
are defined as follows. S and H are ordered, with same cardinality, say S = {s1 , . . . , sq }
and H = {h1 , . . . , hq }. Edges are (h1 , h2 ), . . . , (hq−1 , hq ) and (h1 , s1 ), . . . , (hq , sq ). The in-
terpretation generally is that the hidden variables, hs , are the variables of interest,
and behave like a Markov chain, and that the observations, xs , are either noisy or
transformed versions of them. A major application is in speech recognition, where
the hs ’s are labels that represent specific phonemes (little pieces of spoken words)
and the xs ’s are measured signals. The transitions between hidden variables then
describe how phonemes are likely to appear in sequence for a given language, and
those between hidden and observed variables describe how each phoneme is likely
to be pronounced and heard.
The algorithm in the general case can move from tractable to intractable depending
on the situation. This must generally be handled in a case by case basis, by analyzing
the conditional structure, for a given model, knowing the observations.
We develop, in this chapter, methods that model stochastic processes using a feed-
forward approach that generates complex random variables using non-linear trans-
formations of simpler ones. Many of these methods can be seen as instances of struc-
tural equation models (SEMs), described in section 16.3, with, for deep-learning im-
plementations, high-dimensional parametrizations of (16.8).
With start with the formally simple case where the modeled variable takes values
in Rd and is modeled as
X = g(Z)
where Z also takes values in Rd , with a known distribution, and g is C 1 , invertible,
with a C 1 inverse on Rd , i.e., is a diffeomorphism of Rd . Let us denote by h the inverse
of g.
If Z has a p.d.f. fZ with respect to Lebesgue’s measure, then, using the change of
variable formula, the p.d.f. of X is
fX (x) = fZ (h(x)) | det ∂x h(x)|.
445
446 CHAPTER 19. DEEP GENERATIVE METHODS
We quickly describe the basic principles of the algorithm. One starts with a
parametrized family, say (ψα , α ∈ A) of diffeomorphisms of R. Such families are
relatively easy to design, one example proposed in [188] being a smoothed version
of the piecewise linear function
u 7→ v0 + (1 − σ )u + γ|(1 − σ )u − u0 |
with y = U x.
The algorithm in [188] is initialized with h0 = idRd and updates the transforma-
tion at step n according to
hn = ϕαn ,Un ◦ hn−1 .
α 7→ `(ϕα,Un ◦ hn−1 ).
(Here, the current value hn−1 is not revisited, therefore providing a “greedy” opti-
mization method.)
19.1. NORMALIZING FLOWS 447
N
X N
X
`(ϕα,Un ◦ hn−1 )) = log fZ (ϕα,Un (zn−1,k )) + log | det ϕα,Un (zn−1,k )|
k=1 k=1
N
X
+ log | det ∂x hn−1 (xk )| .
k=1
Since the last term does not depend on α, we see that it suffices to keep track of the
“particle” locations, zn−1,k to be able to compute αn . Note also that these locations
are easily updated with zn,k = ϕαn ,Un (zn−1,k ).
N
X N X
X m
`(h) = log fZ (zm,k ) + log | det ∂x ϕwj (zj−1,k )|.
k=1 k=1 j=1
Normalizing flows in this form are described in [162, 108, 149]. The gradient of `
with respect to the parameters w1 , . . . , wm can be computed by backpropagation. We
note however that, unlike typical neural implementations, the parameters may come
with specific constraints, such as U ∈ Od (R) when w = (α, U ), so that the gradient
and associated displacement may have to be adapted compared to standard gradient
ascent implementations (see section 21.6.3 for a discussion of first-order implemen-
tations of gradient methods for functions of orthogonal matrices, and [1] for more
general methods on optimization over matrix groups).
with z(0) = x for some function w : t 7→ w(t). Letting z(t) = hw (t, x) (which defines
hw ), we know that, under suitable assumptions on ψ, the mapping x 7→ hw (t, x) is a
diffeomorphism of Rd . One can then maximize
N
X N
X
`(hw (T , ·)) = log fZ (hw (T , xk )) + log | det ∂x hw (T , xk )|
k=1 k=1
with respect to the function w. Let zk (t) = hw (t, xk ) and Jk (t) = log | det ∂x hw (t, xk )|. We
have, by definition
∂t zk (t) = ψw(t) (zk (t))
with zk (0) = xk . One can also show that
with Jk (0) = 0, where the r.h.s. is the divergence of ψw(t) evaluated at zk (t). We
provide a quick (and formal) justification of this fact. First note that differentiating
∂t hw (t, x) = ψw(t) (hw (t, x)) with respect to x yields
∂t log | det ∂x hw (t, x)| = trace(∂x hw (t, x)−1 ∂x ψw(t) (hw (t, x))∂x hw (t, x))
= trace(∂x ψw(t) (hw (t, x))) = ∇ · ψw(t) (hw (t, x)).
From this, it follows that the time-continuous normalizing flow problem can be
reformulated as maximizing
N
X N
X
log fZ (zk (T )) + Jk (T )
k=1 k=1
subject to ∂t zk (t) = ψw(t) (zk (t)), ∂t Jk (t) = ∇ · ψw(t) (zk (t)), zk (0) = xk , Jk (0) = 0. This is
an optimal control problem, whose analysis can be done similarly to that made in
section 11.6.1, provided that ∇ · ψw(t) can be expressed in closed form.
Note that the inverse of hw (T , ·), which provides the generative model going from
Z to X can also be obtained as the solution of an ODE. Namely, if one solves the
differential equation
∂t x(t) = −ψw(T −t) (x(t))
with initial condition x(0) = z, then x(T ) solves the equation hw (T , ·) = z.
19.2. NON-DIFFEOMORPHIC MODELS AND VARIATIONAL AUTOENCODERS449
VAEs [104, 105] model X ∈ Rd as X = g(Z, θ)+ where is a centered Gaussian noise
with covariance matrix Q. The function g is typically nonlinear, and VAEs have been
introduced with this function modeled as a deep neural network (see chapter 11).
Letting ϕN ( · ; 0, Q) denote the p.d.f. of the Gaussian distribution N (0, Q), the con-
ditional distribution of X given Z = z has density
fX (x | z, θ) = ϕN (x − g(z, θ)) ; 0, Q)
The computations can be simplified if one assumes that fZ is the p.d.f. of a stan-
dard Gaussian, i.e., fZ = ϕN (·; 0, IdRp ). Indeed, in that case, the integral in (17.11),
which is, using the current notation,
Z
ϕ (x − g(z, θ) ; 0, Q)ϕN (z ; 0, IdRp )
log N 2)
ϕN (z ; µ(x, w), S(x, w)2 )dz, (19.3)
R p ϕ N (z ; µ(x, w), S(x, w)
can be partially computed. For any two p-dimensional Gaussian p.d.f.’s, one has
Z
1 1 T −1
log ϕN (z ; µ1 , Σ1 ) ϕN (z ; µ2 , Σ2 ) dz = − trace(Σ−1
1 Σ2 ) − (µ2 − µ1 ) Σ1 (µ2 − µ1 )
R p 2 2
1 p
− log det(Σ1 ) − log(2π). (19.4)
2 2
As a consequence, (19.3) becomes
1 1 d
− Ew (X − g(Z, θ))T Q−1 (X − g(Z, θ)) − log det Q − log 2π
2 2 2
1 1 p
2 2
− Ew trace(S(X, w) ) + |µ(X, w)| − log det(S(X, w)) + , (19.5)
2 2 2
19.2. NON-DIFFEOMORPHIC MODELS AND VARIATIONAL AUTOENCODERS451
where Ew denotes the expectation for the random variable (X, Z) where X follows a
uniform distribution over training data and the conditional distribution of Z given
X = x is N (µ(x, w) , S(x, w)2 ).
1
− E (X − g(µ(X, w) + S(X, w)U , θ))T Q−1 (X − g(µ(X, w) + S(X, w)U , θ))
2
1 1
2 2
− Ew trace(S(X, w) ) + |µ(X, w)| − log det(S(X, w))
2 2
1 d p
− log det Q − log 2π + , (19.6)
2 2 2
1
F(θ, Q, w, x, u) = − (x − g(µ(x, w) − S(x, w)U , θ))T Q−1 (x − g(µ(x, w) − S(x, w)U , θ))
2
1 1 1
− log det Q − trace(S(x, w)2 ) − |µ(x, w)|2 + log det(S(x, w))
2 2 2
the resulting algorithm is
θn+1 = θn + γn+1 ∂θ F(θn , Qn , wn , Xn+1 , Un+1 )
Qn+1 = Qn + γn+1 ∂Q F(θn , Qn , wn , Xn+1 , Un+1 ) (19.7)
wn+1 = wn − γn+1 ∂w F(θn , Qn , wn , Xn+1 , Un+1 )
where Xn+1 is drawn uniformly from the training data and Un+1 ∼ N (0, IdRp ).
This framework can be adapted to situations in which the observations are discrete.
Assume, as an example, that X takes values in {0, 1}V , where V is a set of vertexes,
i.e., X is a binary Markov random field on V . Assume, as a generative model, that
conditionally to the latent variable Z ∈ Rp , the variables X (s) , s ∈ V are independent
and X (s) follows a Bernoulli distribution with parameter g (s) (z, θ), where g : Rp →
[0, 1]V . Assume also that Z ∼ N (0, IdRp ), and define, as above, an approximation of
the conditional distribution of Z given X = x as a Gaussian with mean µ(x, w) and
covariance matrix S(x, w)2 . Then, the joint density of X and Z (with respect to the
product of the counting measure on {0, 1}V and Lebesgue’s measure on Rp ) is
X
log fX,Z (x, z ; θ) = (x(s) log g (s) (z, θ) + (1 − x(s) ) log(1 − g (s) (z, θ))) + log ϕN (z ; 0, IdRp )
s∈V
452 CHAPTER 19. DEEP GENERATIVE METHODS
Similarly to the methods discussed so far, GAN’s [82], use a one-step nonlinear gen-
erator X = g(Z, θ), with θ ∈ RK , to model observed data (we here switch back to a
deterministic relation), where Z has a known distribution, with p.d.f. fZ , for ex-
ample Z ∼ N (0, IdRp ). However, unlike the exact or approximate likelihood maxi-
mization that were discussed in sections 19.1 and 19.2, GANs us a different criterion
for estimating the parameter θ by minimizing metrics that can be approximated
by optimizing a classifier. The classifier is a function x 7→ f (x, w), parametrized by
w ∈ RM , whose goal is to separate simulated samples from real ones: it takes values
in [0, 1] and estimates the (posterior) probability that its input x is real. The adver-
sarial paradigm in GAN’s consists in estimating θ and w together so that generated
data, using θ, are indistinguishable from real ones using the optimal w. Their basic
structure is summarized in Figure 19.1.
Prediction
Data
Classifier Generator Noise
Simulation
W θ
Figure 19.1: Basic structure of GAN’s: W is optimized to improve the prediction problem:
“real data” vs. “simulation”. Given W , θ is optimized to worsen the prediction.
Let Pθ denote the distribution of g(Z, θ), and Ptrue the target distribution of real data,
represented by the variable X. One can formalize the “real data” vs. “simulation”
19.3. GENERATIVE ADVERSARIAL NETWORKS (GAN) 453
problem with a pair of random variables (Xθ , Y ) where Y follows a Bernoulli distri-
bution with parameter 1/2, and P (Xθ ∈ A | Y = y) is Ptrue (A) when y = 1 and Pθ (A)
when y = 0. Given a loss function r : {0, 1} × [0, 1] → [0, +∞), one can define
U (θ, w) = E(r(Y , f (Xθ , w)))
and
U ∗ (θ) = min U (θ, w).
w∈RM
We want to maximize U∗ or, equivalently, solve the optimization problem
θ ∗ = argmax min U (θ, w).
θ∈RK w∈RM
Note that
2U (θ, w) = E(r(1, f (X, w))) + E(r(0, f (Xθ , w)))
so that choosing the cost requires to specify the two functions t 7→ r(1, t) and t 7→
r(0, t). In Goodfellow et al. [82], they are:
r(1, t) = − log t
(19.9)
r(0, t) = − log(1 − t).
19.3.3 Algorithm
Let F be the family of all measurable functions: f : Rd → [0, 1]. Given two random
variables X1 , X2 : Ω → Rd , with respective distributions P1 , P2 (so that P(Xi ∈ A) =
Pi (A), consider the function
D(P1 , P2 ) = 2 log 2 + max E(log f (X1 )) + E(log(1 − f (X2 )))
f ∈F
Assume that X1 (resp. X2 ) has a p.d.f. g1 (resp. g2 ) with respect to some measure µ.
Then Z
E(log f (X1 )) + E(log(1 − f (X2 ))) = (g1 log f + g2 log(1 − f ))dµ
Rd
∗
which is maximal at f = g1 /(g1 + g2 ). For this f ∗ ,
Z Z
∗ ∗ 2g1 2g2
2 log 2 + E(log f (X1 )) + E(log(1 − f (X2 ))) = g1 log dx + g2 log dµ
Rd g1 + g2 Rd g1 + g2
g1 + g2 g1 + g2
= KL g1 + KL g2 .
2 2
This last expression is the Jensen-Shannon divergence between g1 and g2 (cf. sec-
tion 12.2). One can then define
D(P1 , P2 ) = max E(log f (X1 , w)) + E(log(1 − f (X2 , w)))
b
w∈RM
This discussion suggests that new types of GAN’s may be designed using other
discrepancy functions between probability distributions, provided that they can be
expressed in terms of the maximization of some quantity over some space of func-
tions. Consider, for example, the norm in total variation, defined by (for discrete
distributions)
1X
Dvar (P1 , P2 ) = |P (x) − P2 (x)|.
2 x 1
or, in the general case Dvar (P1 , P2 ) = maxA (P1 (A) − P2 (A)).
If F is the space of continuous functions f : Rd → [0, 1], then we also have (cf.
proposition 12.3)
Dvar (P1 , P2 ) = max(E(f (X1 )) − E(f (X2 ))).
f ∈F
19.4. REVERSED MARKOV CHAIN MODELS 455
Since neural nets typically generate continuous functions with values in [0, 1], one
could train GANs by maximizing
Dvar (P1 , P2 ) = max E(f (X1 , w)) − E(f (X2 , w)) .
b
w∈RM
However, the total variation distance can be too crude as a way to compare probabil-
ity distributions, especially when these distributions have atoms (points x such that
Pi ({x}) > 0. For example, the total variation distance between two Dirac distributions
at, say, x1 and x2 in Rd is always 1, unless x1 = x2 . As a consequence, if xn converges
to x, with xn , x, then Dvar (δxn , δx ) 6→ 0.
where F is now the space of contractive (or 1-Lipschitz) functions. Using the fact
that a neural network with all weights bounded by a constant K generates a function
whose Lipschitz constant is controlled solely by K, one can then approximate (up to
a multiplicative constant) the Wasserstein distance by
bMK (P1 , P2 ) = max E(f (X1 , w)) − E(f (X2 , w))
D
w∈W
where W is the set of all weights bounded by a fixed constant. Wasserstein GANs
(WGANs [11]) must then solve the saddle-point problem
U (θ, w) = max E(f (X, w)) − E(f (Xθ , w))
w∈W
and
U ∗ (θ) = min U (θ, w),
w∈RM
with an algorithm similar to that described earlier.
Remark 19.1 As a final reference, we note the “improved WGAN” algorithm intro-
duced in Gulrajani et al. [84] in which the boundedness constraint in the weights
is replaced by an explicit control of the derivative in x of the function f . Given in-
dependent X1 and X2 with respective distributions P1 , P2 , define a random variable
Z = (1 − U )X1 + U X2 where U is uniformly distributed over [0, 1] and (U , X1 , X2 )
are independent. Then, Gulrajani et al. [84] use the following approximation of the
Wasserstein distance between P1 and Pθ :
DbMK (Ptrue , Pθ ) = max Etrue (f (X, w)) − Eθ (f (X, w)) − Ẽθ ((|∂z f (Z, w)| − 1)2 ) .
w∈W
This approximation is justified by the fact that optimal solutions of (19.10) satisfy
(almost surely for the optimal coupling)
f (y) − f (x) = |x − y|
(see Villani et al. [203], theorem 5.10 and Gulrajani et al. [84]).
456 CHAPTER 19. DEEP GENERATIVE METHODS
The discussions in sections 19.2 and 19.3 can be applied to sequences of structural
equations (describing finite Markov chains) in the form
Z0 = ξ 0
Zk+1 = g(Zk , ξ k ; θk ), k = 0, . . . , m − 1
X = Zm
where ξ 0 , . . . , ξ m−1 are random variables with fixed distribution. Indeed, letting Z̃ =
(ξ 0 , . . . , ξ n−1 ) and θ̃ = (θ0 , . . . , θm−1 ) the whole system can be considered as a function
X = G(Z̃, θ̃) as considered in these sections. This representation, however, includes a
large number of hidden variables, and it is unclear whether much improvement can
be added to the case m = 1 to justify the additional computational load.
While direct Markov chain modeling may have a limited appeal, reversed Markov
chains use a different generative approach in that they first model a forward Markov
chain Zn , n ≥ 0 which is ergodic with known (and easy to sample from) limit distri-
bution Q∞ , and initial distribution Qtrue , the true distribution of the data. If one
fixes a large enough number of steps, say, τ, then it is reasonable to assume that
Zτ approximately follows the limit distribution, Q∞ . One can then (approximately)
sample from Qtrue by sampling Z̃0 according to Q∞ and then applying τ steps of the
time-reversed Markov chain.
Reversed chains were discussed in section 13.3.3. Assuming that Qtrue and P (z, · )
have a density with respect to a fixed measure µ on RZ , we found that Z̃k = Zτ−k is a
non-homogeneous Markov chain whose transition probability P˜k (x, A) = P (Z̃k+1 ∈ A |
Z̃k = x) has density
p(y, x)qτ−k−1 (y)
p̃k (x, y) =
qτ−k (x)
with respect to µ, where qn is the p.d.f. of Qn = Qtrue P n , the distribution of Zn .
The distributions Qn , n ≥ 0 are unknown, since they depend on the data dis-
tribution Ptrue , and the transition probabilities above must be estimated from data
to provide a sampling algorithm from the reversed Markov chain. While, at first
glance, this does not seem like a simplification of the problem, because one now has
to sample from a potentially large number (τ) of distributions instead of one, this
leads, with proper modeling and some intensive learning, to powerful generative
models.
To make this approach more efficient, the forward chain should be making small
19.4. REVERSED MARKOV CHAIN MODELS 457
changes to the current configuration at each step (e.g., adding a small amount of
noise). This ensures that the reversed transition probabilities p̃k (x, ·) are close to
Dirac distributions and are therefore likely to be well approximated by simple uni-
modal distributions such as Gaussians. Importantly, the estimation problem does
not have hidden data: given an observed sample, one can simulate τ steps of the
forward chain to obtain, after reversing the order, a full observation of the reversed
chain. Moreover, in some cases, analytical considerations can lead to partial compu-
tations that facilitate the modeling of the reversed transitions.
We now take some examples, starting with a discrete case. Let Qtrue be the distribu-
tion of a binary random field with state space {0, 1} over a set of vertexes V , i.e., with
the notation of section 14.2, RX = F (V ) with F = {0, 1}. Fix a small > 0 and define
the transition probability p(x, y) for x, y ∈ F (V ) by
Y
p(x, y) = (1 − )1y (s) =x(s) + 1y (s) =1−x(s) .
s∈V
Since p(x, y) > 0 for all x and y, the chain converges (uniformly geometrically) to its
invariant probability Q∞ and one easily checks that this probability is such that all
variables are independent Bernoulli random variables with success probability 1/2.
Assuming that τ is large enough so that Qτ ' Q∞ , the sampling algorithm initializes
the reversed chain as independent Bernoulli(1/2) variables and runs τ steps using
the transitions p̃k , which must be learned from data.
Since it implies that qk (x) = qk−1 (x) + o(), this expression can be reversed as
X
qk−1 (y) = (1 + N )qk (y) − qk (x) + O(2 )
x:x∼y
458 CHAPTER 19. DEEP GENERATIVE METHODS
Similarly, we have
p(y, x) = (1 − N )1x=y + 1x∼y + O(2 ).
This gives
X
p(y, x)qk−1 (y) = qk (x)1x=y − 1x=y qk (x0 ) + qk (y)1x∼y + O(2 ),
x0 :x0 ∼y
one checks easily that p̂k (x, y) = p̃k (x, y) + O(2 ). This suggests modeling the reversed
(s)
chain using transitions p̂k , for which the mapping x 7→ (σk (x), s ∈ V ) needs to be
learned from data (for example using a deep neural network). Note that 1 − σk (x) is
precisely the score function introduced for discrete distributions in remark 18.8.
h
AΣh + Σh A − A2 − 2IdRd = 0
2
whose solution is Σh = (A − hA2 /4)−1 (details being left to the reader). This implies
that this limit distribution can be easily sampled from for any choice of A.
19.4. REVERSED MARKOV CHAIN MODELS 459
We now return to general f ’s and make, like in the discrete case, a first-order
identification of the reversed chain. We note that, for any smooth function γ,
√
E(γ(Xn+1 ) | Xn = x) = E(γ(x + hf (x) + hU ))
where U ∼ N (0, IdRd ). Making the second order expansion
√ √ h
γ(x + hf (x) + hU ) = γ(x) + h∇γ(x)T U + h∇γ(x)T f (x) + U T ∇2 γ(x)U + o(h)
2
and taking the expectation gives
h
E(γ(Xn+1 ) | Xn = x) = γ(x) + h∇γ(x)T f (x) + ∆γ(x) + o(h). (19.11)
2
Considering the reversed chain, and letting qk denote the p.d.f. of Xk for the
forward chain, we have
Z
E(γ(Xk−1 ) | Xk = x) = γ(y)p̃k (x, y)dy
d
ZR
q (y)
= γ(y)p(y, x) k−1 dy
Rd qk (x)
Z
1 q (y) 1 2
= d/2
γ(y) k−1 e− 2h |x−y−hf (y)| dy
(2πh) Rd qk (x)
√
1
Z √ qk−1 (x − hu) − 1 |u−√hf (x−√hu)|2
= γ(x − hu) e 2 dy,
(2π)d/2 Rd qk (x)
√
with the change of variable u = (x − y)/ h. We make a first-order expansion of the
terms in this integral, with
√ √ √ h
γ(x − hu)qk−1 (x − hu) = γ(x)qk−1 (x) − h∇(γqk−1 )(x)T u + u T ∇2 (γqk−1 )(x)u + o(h)
2
and
1
√ √ 1
√
hf (x− hu)|2 2 hu T f (x)−hu T df (x)u− 12 |f (x)|2 +o(h)
e− 2 |u− = e− 2 |u| e
√
!
− 12 |u|2 h h
=e 1 + hu T f (x) − hu T df (x)u − |f (x)|2 + |u T f (x)|2 + o(h) .
2 2
Taking products
√ √ 1
√ √
2
γ(x − hu)qk−1 (x − hu)e− 2 |u− hf (x− hu)|
√
!
− 12 |u|2 T T h 2 h T 2
=e γ(x)qk−1 (x) 1 + hu f (x) − hu df (x)u − |f (x)| + |u f (x)|
2 2
√
!
− 12 |u|2 T T T h T 2
+e − h∇(γqk−1 )(x) u − h(∇(γqk−1 )(x) u)(f (x) u) + u ∇ (γqk−1 )(x)u + o(h)
2
460 CHAPTER 19. DEEP GENERATIVE METHODS
We now take the integral with respect to u (recall that E(U T AU ) = trace(A) if A is
any square matrix and U is standard Gaussian), so that
1
Z √ √ √ √
− 12 |u− hf (x− hu)|2
γ(x − hu)q k−1 (x − hu)e du
(2π)d/2 Rd
1
= γ(x)qk−1 (x) + h −γ(x)qk−1 (x)∇ · f (x) − ∇(γqk−1 )(x)T f (x) + ∆(γqk−1 )(x) + o(h)
2
!T
∇(γq k−1 )(x) 1 ∆(γqk−1 )(x)
= qk−1 (x) γ(x) + h −γ(x)∇ · f (x) − f (x) + + o(h),
qk−1 (x) 2 qk−1 (x)
We now take the first-order expansion of the ratio, removing terms that cancel, and
get
!
T T ∇qk−1 (x) h
E(γ(Xk−1 ) | Xk = x) = γ(x) − h∇γ(x) f (x) + h∇γ(x) + ∆γ(x) + o(h).
qk−1 (x) 2
Comparing with (19.11), we find that X̃k = Xτ−k behaves, for small h, like the
non-homogeneous Markov chain such √ that the conditional distribution of X̃k+1 given
X̃k = x is N (x − hf (x) − hsτ−k−1 (x), hIdRd ), with sτ−k−1 (x) = −∇ log qτ−k−1 , the score
function introduced in section 18.2.6. One can therefore use score-matching meth-
ods from that section to estimate this distribution from observations of the forward
chain initialized with training data.
The forward schemes described in the previous examples can be interpreted as con-
tinuous time processes over √ discrete or continuous variables. In the latter case, the
example Xk+1 ∼ N (x +hf (x), hIdRd ) conditionally to Xk = x is a discretization of the
stochastic differential equation
(see remark 13.4), where wt is a Brownian motion and the diffusion is initialized
with Qtrue . We found that going backward meant (at first order and conditionally to
Xk = x) √
Xk−1 ∼ N (x − hf (x) − hsk−1 (x), hId)
19.4. REVERSED MARKOV CHAIN MODELS 461
where w̃t is also a Brownian motion. This reverse diffusion with Xτ ∼ Q∞ will there-
fore approximately sample from Qtrue . (With this terminology, forward and reverse
diffusions have similar differential notation, but mean different things.) Note that,
in the continuous-time limit, the reverse Markov process follows the distribution of
the reversed diffusion exactly.
As we have seen in the previous two examples, estimating the reversed Markov chain
requires computing the score functions of the forward probabilities. In the case
of continuous variables, this score function is typically parametrized as a neural
network, so that the function sk (x) = −∇ log qk (x) is computed as sk (x) = F(x; Wk ),
with the usual definition F(x, Wk ) = zm+1 with zj+1 = ϕj (zj , wjk ), z0 = x and Wk =
(w0k , . . . , wmk ).
Assume that a training set T is observed. Running the forward Markov chain
initialized with elements of T generates a new training set at each time step, that we
will denote Tk at step k. We have seen in section 18.2.6 that the score function sk
could be estimated by minimizing, with respect to W
X
|F(x, W )|2 − 2∇ · F(x, W ) .
x∈Tk
This term involves the differential of F, which is defined recursively by (simply tak-
ing the derivative at each step)
One can note that, for any h ∈ Rd , the vector dF(x, W )h also satisfies the recursion
with ζ0 h = h and
d
X
∇ · F(x, W ) = eTi dF(x, W )ei
i=1
Clustering
20.1 Introduction
We now describe a collection of methods designed to divide a training set into ho-
mogeneous subsets, or clusters. This grouping operation is a key problem in many
applications for which it is important to categorize the data in order to obtain im-
proved understanding of the sampled phenomenon, and sometimes to be able to
apply a different approach to subsequent processing or analysis adapted to each
cluster.
(i) The simplest case is when R = Rd with the standard Euclidean metric. Slightly
more generally, a metric may be defined by ρ2 (x, y) = kh(x) − h(y)k2H , where H is an
inner-product space and the feature function h : R 7→ H may be unknown, while its
associated “kernel”, K(x, y) = hh(x) , h(y)iH is known (this is a metric if h is one-to-
one). In this case
ρ2 (x, y) = K(x, x) − 2K(x, y) + K(y, y).
463
464 CHAPTER 20. CLUSTERING
provided by the length of shortest paths linking two points (assuming of course that
this notion can be given a rigorous meaning). The simplest example is data on the
unit sphere, where the distance ρ(x, y) between two points x and y is the length of
the shortest large circle that connects them, satisfying
where λ1 , . . . , λd are the eigenvalues of S1−1/2 S2 S1−1/2 or, equivalently, solutions of the
generalized eigenvalue problem S2 u = λS1 u (see, for example, [72]).
(iv) Another common assumption is that the elements of R are vertices of a weighted
graph of which T is a subgraph; ρ may then be, e.g., the geodesic distance on the
graph.
This method builds clusters by organizing them in a binary hierarchy in which the
data is divided into subsets, starting with the full training set, and iteratively split-
ting each subset into two parts until reaching singletons. This results in a binary
tree structure, called a dendogram, or partition tree, which is defined as follows.
Definition 20.1 A partition tree of a finite set A is a finite collection of nodes T with the
following properties.
(i) Each node has either zero or exactly two children. (We will use the notation v → v 0
to indicate that v 0 is a child of v.
(ii) All nodes but one have exactly one parent. The node without parent is the root of
the tree.
(iii) To each node v ∈ T is associated a subset Av ⊂ A.
20.2. HIERARCHICAL CLUSTERING AND DENDOGRAMS 465
1: {a, b, c, d, e, f }
2: {a, c, f } 3: {b, d, e}
Nodes without children are called leaves, or terminal nodes. We will say that the hierarchy
is complete if Av = A if v is the root, and |Av | = 1 for all terminal nodes.
The construction of the tree can follow two directions, the first one being bottom-
up, or agglomerative, in which the algorithm starts with the collection of all single-
tons and merges subsets one pair at a time until everything is merged into the full
dataset. The second approach is top-down, or divisive, and initializes the algorithm
with the full training set which is recursively split until singletons are reached. The
first approach, on which we now focus, is more common, and computationally sim-
pler.
We let T denote the training set and assume that a matrix of dissimilarities
(α(x, y), x, y ∈ T )
is given. We will make the abuse of notation of considering that T is a set even
though some of its elements may be repeated. This is no loss of generality, since
T = (x1 , . . . , xN ) can always be replaced by the subset {(k, xk ), k = 1, . . . , N } of N × R.
Algorithm 20.1
1. Start with the collection T1 , . . . , TN of all single-node trees associated to each
element of T . Let n = 0 and m = N .
466 CHAPTER 20. CLUSTERING
2. Assume that, at step n of the algorithm, one has a collection of partition trees
T1 , . . . , Tm with root nodes r1 , . . . , rm associated with subsets Ar1 , . . . , Arm of T . Let the
total collection of nodes be indexed as Vn = {v1 , . . . , vN +n }, so that {r1 , . . . , rm } ⊂ Vn .
3. If m = 1, stop the algorithm.
4. Select indices i, j ∈ {1, . . . , m} such that ϕ(Ari , Arj ) is minimal, and merge the
corresponding trees by creating a new node vn+1+N with the root nodes of Ti and Tj
as children (so that vn+1+N is associated with Ari ∪ Arj ). Add vn+1+N to the collection
of root nodes, and remove ri and rj .
5. Set n → n + 1 and m → m − 1 and return to step 2.
Clearly, the specification of the extended dissimilarity measure (ϕ) is a key ele-
ment of the method. Some of most commonly used extensions are:
• Average dissimilarity:
1 XX
ϕavg (A, A0 ) = α(x, x0 ).
|A| |A0 | 0 0
x∈A x ∈A
As shown in the next two propositions, the maximum distance favors clusters
with small diameters, while using minimum gaps tends to favor connected clusters.
Proposition 20.2 Let diam(A) = max(α(x, y), x, y ∈ A). The agglomerative algorithm
using ϕmax is identical to that using ϕ(A, A0 ) = diam(A ∪ A0 ).
Proof Call Algorithm 1 the agglomerative algorithm using ϕmax , and Algorithm 2
the one using ϕ. At initialization, we have (because all sets are singletons),
We show that this property remains true at all steps of the algorithms. Pro-
ceeding by induction, assume that, up to the step n, Algorithms 1 and 2 have been
identical and result in sets (A1 , . . . , Am ) satisfy (20.1). Then the next steps of the
two algorithms coincide and assume, without loss of generality, that this next step
20.2. HIERARCHICAL CLUSTERING AND DENDOGRAMS 467
merges Am−1 with Am . Let A0m−1 = Am−1 ∪ Am so that diam(A0m−1 ) ≤ diam(Ai ∪ Aj ) for
all 1 ≤ i , j ≤ m.
We need to show that the new partition satisfies (20.1), which requires that
ϕmax (A0m−1 , Ak ) = diam(A0m−1 ∪ Ak )
for k = 1, . . . , m − 2.
We have
diam(A0m−1 ∪ Ak ) = max(diam(A0m−1 ), diam(Ak ), ϕmax (A0m−1 , Ak )),
so that we must show that
max(diam(A0m−1 ), diam(Ak )) ≤ ϕmax (A0m−1 , Ak ).
Write
ϕmax (A0m−1 , Ak ) = max(ϕmax (Am , Ak ), ϕmax (Am−1 , Ak ))
= max(diam(Am ∪ Ak ), diam(Am−1 ∪ Ak ))
where the last identity results from the induction hypothesis.
We now analyze ϕmin and, more specifically, the equivalence between the result-
ing algorithm and the one using the following measure of connectedness. For a given
set A and x, y ∈ A, let
n
α̃A (x, y) = inf : ∃n > 0, ∃(x = x0 , x1 , . . . , xn−1 , xn = y) ∈ An+1 :
o
α(xi , xi−1 ) ≤ for 1 ≤ i ≤ n .
So α̃A is the smallest such that there exists a sequence of steps of size less than in
A going from x to y. The function
conn(A) = max{α̃A (x, y) : x, y ∈ A}
measures how well the set A is connected relative to the dissimilarity measure α.
and we have:
468 CHAPTER 20. CLUSTERING
Proposition 20.3 The agglomerative algorithm using ϕmin is identical to that using
ϕ(A, A0 ) = conn(A ∪ A0 ).
Proof The proof is similar to that of proposition 20.2. Indeed one can note that
conn(A ∪ A0 ) = max(conn(A), conn(A0 ), ϕmin (A, A0 )) .
Given this we can proceed by induction and prove that, if the current decomposi-
tion is A1 , . . . , Am such that ψ(Ak ∪ Al ) = ϕmin (Ak , Al ) for all 1 ≤ k , l ≤ m, then this
property is still true after merging using ϕmin and ϕ.
Assuming again that Am−1 and Am are merged, and letting A0m−1 = Am ∪ Am−1 , we
need to show that conn(Ak ∪ A0m−1 ) = ϕmin (Ak , A0m−1 ) for all k = 1, . . . , m − 2, which is
the same as showing that:
max(conn(Ak ), conn(A0m−1 )) ≤ ϕmin (Ak , A0m−1 ) = min(ϕmin (Ak , Am−1 ), ϕmin (Ak , Am )).
From the induction hypothesis, we have
min(ϕmin (Ak , Am−1 ), ϕmin (Ak , Am )) = min(conn(Ak ∪ Am−1 ), conn(Ak ∪ Am ))
and both terms in the right-hand side are larger than conn(Ak ) and also larger than
conn(A0m−1 ) which was a minimizer.
20.2.3 Top-down construction
The agglomerative method is the most common way to build dendograms, mostly
because of the simplicity of the construction algorithm. The divisive approach is
more complex, because the division step, which requires, given a set A, to optimize
a splitting criterion over all two-partitions of A, may be significantly more expensive
than the merging steps in the agglomerative algorithm. The top-down construction
therefore requires the specification of a “splitting algorithm” σ : A 7→ (A0 , A00 ) such
that (A0 , A00 ) is a partition of A. We assume that, if |A| > 1, then the partition A, A00 is
not trivial, i.e., neither set is empty.
Algorithm 20.2
1. Start with the one-node partition tree T0 = (T ).
2. Assume that at a given step of the algorithm, the current partition is T .
3. If T is complete, stop the algorithm.
4. For each terminal node v in T such that |Av | > 1, compute (A0v , A00v ) = σ (Av ) and
add two children v 0 and v 00 to v with Av 0 = A0v and Av 00 = A00v .
5. Return to step 2.
The division of a set into two parts is itself a clustering algorithm, and one may apply
any of those described in the rest of this chapter.
20.3. K-MEDOIDS AND K-MEAN 469
20.2.4 Thresholding
Let VT denote the set of terminal nodes in T and V0 = V \ VT contain the interior
nodes. Define a pruning set to be a subset D ⊂ V0 that contains no pair of nodes v, v 0
such that v 0 is a descendant of v. To any pruning set D, one can associate the pruned
subtree T (D) of T consisting of T from which all the vertices that are descendants
of elements of D are removed. From any such pruned subtree, one obtain a partition
S(D) of T formed by the collection of sets Av for v in the terminal nodes of T (D).
Between the extreme case S(v0 ) = {V } (where v0 is the root of T ) and S(∅) = ({x}, x ∈
VT ), there exists a huge number of possible partitions obtained in this way.
Scores can also be built bottom-up, letting h(v) = 0 for terminal nodes and, for
v ∈ V0 ,
hw (v) = max(hw (v 0 ) + w(v, v 0 ), hw (v 00 ) + w(v, v 00 ))
where v 0 , v 00 are the children of v Here, taking w = 1 provides the height of each
node.
20.3.1 K-medoids
ity measures can indeed be defined based on the heuristic that clusters should be
homogeneous (for some criterion) and far apart from each other.
X
Vα (A) = inf α(x, c) : c ∈ R . (20.2)
x∈A
A centroid, c, in (20.2) may not always exists, and when it exists it may not always
be unique. For α = ρ2 , a point c such that
X
V (A) = ρ2 (x, c)
x∈A
is called a Fréchet mean of the set A. Returning to the examples provided in the
beginning of this chapter, two antipodal points on the sphere (whose distance is π)
have an infinity of Fréchet means (or midpoints in this case) provided by every point
in the equator between them. In contrast, the example provided with symmetric
matrices provides a so-called Hadamard space [43] and the Fréchet mean in that
case is unique. Of course, for Euclidean metrics, the Fréchet mean is just the usual
one.
Returning to our general discussion, the K-medoids method optimizes the sum
of central dispersions with a fixed number of clusters. Note that the letter K in K-
medoids originally refers to this number of clusters, but this notation conflicts with
other notation in this book (e.g., reproducing kernels) and we shall denote by p this
20.3. K-MEDOIDS AND K-MEAN 471
over all partitions of T and c1 , . . . , cp ∈ R. Finally, taking first the minimum with
respect to Ai , which corresponds to associating each x to the subset with closest
center, K-medoids, an equivalent formulation minimizes
X n o
W̃α (c1 , . . . , cp ) = min α(x, ci ), i = 1, . . . , p .
x∈T
It should be clear that each step reduces the total cost Wα and that this cost
should stabilize at some point (which provides the stopping criterion) because there
is only a finite number of possible partitions of T . However, there can be many
possible limit points that are stable under the previous iterations, and some may
correspond to poor “local minima” of the objective function. Since the end-point of
the algorithm depends on the initialization, this step requires extra care. One may
design ad-hoc heuristics in order to start the algorithm with a good initial point that
is likely to provide a good solution at the end. These heuristics may depend on the
1 We still call the method K-medoids rather than p-medoids, to keep the name universally used in
the literature.
472 CHAPTER 20. CLUSTERING
problem at hand, or use a generic strategy. As a common example of the latter, one
may ensure that the initial centers are sufficiently far apart by picking c1 at random,
c2 as far as possible from c1 , c3 maximizing the sum of distances to c1 and c2 etc.
One also typically runs the algorithm several times with random initial conditions
and select the best solution over these multiple runs.
where θ contains the weights, α1 , . . . , αp , the means, c1 , . . . , cp and the covariance ma-
trices Σ1 , . . . , Σp (we create, hopefully without risk of confusion, a short-lived conflict
of notation between the weights and the dissimilarity function). The posterior class
probabilities
1 1 T Σ−1 (x−c )
(det Σi )− 2 αi e− 2 (x−ci ) i i
fZ (i|x ; θ) = P , i = 1, . . . , p,
p −2 1 − 12 (x−cj )T Σ−1
j (x−cj )
j=1 (det Σj ) αj e
20.3. K-MEDOIDS AND K-MEAN 473
In the special case in which all variances are fixed and equal to σ 2 IdRd , and all
prior class probabilities are equal to 1/p (see remark 17.3), the EM algorithm for mix-
tures of Gaussian is also called “soft K-means”, because it replaces the “hard” cluster
assignments in K-means by “soft” ones represented by the update of the posterior
distribution. We repeat its definition here for completeness (where θ = (c1 , . . . , cp )).
Remark 20.5 We note that, if a K-means, soft K-means or MoG algorithm has been
trained on a training set T , it is then easy to assign a new sample x̃ to one of the
clusters. Indeed, for K-means, it suffices to determine the center closest to x̃, and
for the other methods to maximize fZ (j|x̃, θ), which is computable given the model
parameters. In contrast, there was no direct way to do so using hierarchical cluster-
ing.
We now consider the soft K-means algorithm in feature space, and introduce fea-
tures hk = h(xk ) in an inner product space H such that hhk , hl iH = K(xk , xl ) for some
positive definite kernel. As usual, the underlying assumption is that the computa-
tion of h(x) does not need to be feasible, while evaluations of K(x, y) are easy. Let us
consider the minimization of
p p
1 XX XX
fZ (j|x)kh(x) − cj k2H + σ 2 fZ (j|x) log fZ (j|x)
2
x∈T j=1 x∈T j=1
for some σ 2 > 0 (kernel K-means corresponds to taking the limit σ 2 → 0). Given fZ ,
the optimal centers are
1 X
cj = fZ (j|x)h(x)
ζj
x∈T
P
with ζ = x∈T fZ (j|x). They belong to the feature space, H, and are therefore not
computable in general. However, the distance between them and a point h(y) ∈ H is
explicit and given by
2X 1 X
kh(y) − cj k2H = K(y, y) − fZ (j|x)K(y, x) + 2 fZ (j|x)fZ (j|x0 )K(x, x0 ).
ζj ζj 0
x∈T x,x ∈T
This yields the soft kernel K-means algorithm, that we repeat below.
20.3. K-MEDOIDS AND K-MEAN 475
2 X 1 X
kh(x) − cj k2H = K(x, x) − fZ (j|x0 )K(x, x0 ) + 2 fZ (j|x0 )fZ (j|x00 )K(x0 , x00 )
ζj 0 ζj 0 00
x ∈T x ,x ∈T
with ζj =
P 0 ).
x0 ∈T fZ (j|x
2 2
e−kh(x)−cj kH /2σ
fZ (j|x) = Pp 2 2
.
j 0 =1 e−kh(y)−cj 0 kH /2σ
After convergence, the clusters are computed by assigning x to Ai when i = argmax{fZ (j|x) :
j = 1, . . . , p}, making an arbitrary decision in case of a tie.
For “hard” K-means (with σ 2 → 0), step 2 simply updates fZ (j|x) as the uniform
probability on the set of indexes j at which kh(x) − cj k2H is minimal.
K X
X
W (A) = |xk − cj |2
j=1 k∈Aj
where cj is the average of the points xj such that j ∈ Aj . We start with a simple
transformation expressing this function in terms of the matrix Sα of square distances
476 CHAPTER 20. CLUSTERING
X X 1 X
|xk − cj |2 = |xk |2 − xk
|A|
k∈Aj k∈Aj k∈Aj
X 1 X T
= |xk |2 − xk xl
|A|
k∈Aj k,l∈Aj
1 X
= (|xk |2 + |xl |2 − 2xkT xl )
2|Aj |
k,l∈Aj
1 X
= |xk − xl |2
2|Aj |
k,l∈Aj
q
(k)
Introduce the vector uj ∈ RN with coordinates uj = 1/ |Aj | for k ∈ Aj and 0 other-
wise. Then
1 X 1 1
|xk − xl |2 = ujT Sα uJ = trace(Sα uj ujT ) . (20.4)
2|Aj | 2 2
k,l∈Aj
Let
p
X
Z(A) = uj ujT ,
j=1
so that Z(A) has entries Z (k,l) (A) = 1/|Aj | for k, l ∈ Aj , j = 1, . . . p and 0 for all other
k, l. Summing (20.4) over j, we get
1
W (A) = trace(Sα Z(A)).
2
Define on {1, . . . , N } the relation k ∼ j if and only if Z(j, k) > 0. The relation is
symmetric and we just checked that k ∼ k for all k. It is also transitive, from the
relation (deriving from Z 2 = Z)
N
X
Z(k, j) = Z(k, i)Z(i, j)
i=1
which shows (since all terms in the sum are non-negative) that k ∼ i and j ∼ i imply
k ∼ j.
Choose
p k such that Z(k, k) = max{Z(i, i) : i ∈ As }. Then, for all i, j ∈ As , Z(i, j) ≤
Z(i, i)Z(j, j) ≤ Z(k, k) and (20.5) for j = k yields
X
Z(k, i)(Z(k, k) − Z(k, i)) = 0,
i∈As
Note that the number of clusters, |A| is equal to the trace of Z(A). This shows that
minimizing W (A) over partitions with p clusters is equivalent to the constrained
optimization problem minimizing
G(Z) = trace(Sα Z) (20.6)
478 CHAPTER 20. CLUSTERING
Clusters can be immediately inferred from the columns of the matrix Z(A), since
they are identical for two indices in the same cluster, and orthogonal to each other
for two indices in different clusters. Let z1 (A), . . . , zN (A) denote the columns of Z(A)
and z̄k (A) =√zk (A)/|zk (A)|. One has |z̄k (A) − z̄l (A)| = 0 if k and l belong to the same
cluster and 2 otherwise.
where
1 X
Dα (A) = α(x, y) . (20.8)
|A|
x,y∈A
is a (normalized) measure of size, that we will call the α-dispersion of a finite set A.
Remark 20.8 Instead of using dissimilarities, some algorithms are more naturally
defined in terms of similarities. Given such a similarity measure, say, β, one must
maximize rather than minimize the index ∆β (which becomes, rather than a measure
of dispersion, a measure of concentration).
One refers to spectral methods algorithms that rely on computing eigenvectors and
eigenvalues (the spectrum) of data-dependent matrices. In the case of minimizing
discrepancies, they can be obtained by further simplifying (20.7), essentially by re-
moving constraints.
N
X
Z= ξj ej ejT ,
j=1
N
X
ξj ejT Sα ej (20.9)
j=1
subject to 0 ≤ ξj ≤ 1, N N
P
j=1 ξj = p and u1 , . . . , uN orthonormal basis of R . First con-
sider minimization with respect to the basis, fixing ξ. There is obviously no loss of
generality in requiring that ξ1 ≤ ξ2 ≤ · · · ≤ ξN , and using corollary 2.4 (adapted
to minimizing (20.9) rather than maximizing it) we know that an optimal basis
is given by the eigenvectors of Sα , ordered with non-decreasing eigenvalues. Let-
ting λ1 ≤ · · · ≤ λN denote these eigenvalues, we find that ξ1 , . . . , xN must be a non-
decreasing sequence minimizing
N
X
λj ξj
j=1
PN
subject to 0 ≤ ξk ≤ 1 and j=1 ξj = p. The optimal solution is obtained by taking
20.4. SPECTRAL CLUSTERING 481
N
X p
X N
X p
X
λj ξj − λj ≥ λp+1 ξj + λj (ξj − 1)
j=1 j=1 j=p+1 j=1
Xp p
X
= λp+1 (1 − ξj ) + λj (ξk − 1)
j=1 j=1
p
X
= (λp+1 − λj )(1 − ξj )
j=1
≥ 0.
The following algorithm (similar to that discussed in [64]) summarizes this dis-
cussion.
N −1
X 1 T
Z= ξj ej ejT + 11
N
k=1
N −1
X 1 T
ξj ejT Sα ej + 1 Sα 1
N
j=1
To achieve this, introduce the projection matrix P = IdRN − 11T /N and let S̃α =
P Sα P . Then, since u T 1 = 0 implies u T S̃α u = u T Sα u, it is equivalent to minimize
N
X −1
ξj ejT S̃α ej
j=1
Similarity measures are often associated with graph structures, with a goal of finding
a partition of their set of vertices. So, let T denote the set of these vertices and
assume that to all pairs x, y ∈ T , one attribute a weight given by β(x, y), where β is
assumed to be non-negative. We define β for all x, y ∈ T , but we interpret β(x, y) = 0
as marking the absence of an edge between x and y.
Let V denote the vector space of all functions f : T → R (we have dim(V ) = |T |).
This space can be equipped with the standard Euclidean norm, that we will call
in this section the L2 norm (by analogy with general spaces of square integrable
functions), letting, X
|f |22 = f (x)2 .
x∈T
One can also associate a measure of smoothness for a function f ∈ V by computing
the discrete “H 1 ” semi-norm,
X
|f |2H 1 = β(x, y)(f (x) − f (y))2 .
x,y∈T
20.5. GRAPH PARTITIONING 483
With this definition, “smooth functions” tend to have similar values at points x, y
in T such that β(x, y) is large while there is less constraint when β(x, y) is small. In
particular, |f |H 1 = 0 if and only if f is constant on connected components of the
graph.3
One can write 21 |f |2H 1 = f T Lf where L, called the Laplacian operator associated to
the considered graph, is defined by
X
Lf (x) = L(x, y)f (y)
y∈T
and
X
L(x, y) = β(x, z) 1x=y − β(x, y). (20.10)
z∈T
The vectors δCk , k = 1, . . . , p are then an orthogonal basis of the null space of L. Con-
versely, let (e1 , . . . , ep ) be any basis of this null space. Then, there exists an invertible
matrix A = (aij , i, j = 1, . . . , p) such that
p
X
ei (x) = aij δCj (x).
j=1
e1 (x)
Associate to each x ∈ T the vector e(x) = ... ∈ Rp . Then, for any x, y ∈ T , we have
ep (x)
e(x) = e(y) if and only if δCj (x) = δCj (y) for all j = 1, . . . , p (because A is invertible),
3 Two nodes x and y are connected in the graph if there is a sequence z0 , . . . , zn in T such that z0 = x,
zn = y and β(zi , zi−1 ) > 0 for i = 1, . . . , n. This provides an equivalence relation and equivalent classes
are called connected components.
484 CHAPTER 20. CLUSTERING
that it, if and only if x and y belong to the same connected component. So, given
any basis of the null space of L, the function x 7→ e(x) determines these connected
components. So, a—not very efficient—way of determining the connected compo-
nents of the graph can be to diagonalize the operator L (written as an N by N matrix,
where N = |T |), extract the p eigenvectors e1 , . . . , ep associated with eigenvalue zero
and deduce from the function e(x) above the set of connected components.
Now, in practice, the graph associated to T and β will not separate nicely into
connected components in order to cluster the training set. Most of the time, because
of noise or some weak connections, there will be only one such component, or in any
case much less than what one would expect when clustering the data. The previous
discussion suggests, however, that in the presence of moderate noise in the con-
nection weights, one may expect that the eigenvectors associated to the p smallest
eigenvalues of L provide vectors e(x), x ∈ T such that e(x) and e(y) have similar values
if x and y belong to the same cluster (see 20.2). In such cases, these clusters should
be easy to determine using, say, K-means on the transformed dataset T̃ = (e(x), x ∈ T ).
This is summarized in the following algorithm.
(1) Form the Laplacian operator described in (20.10) and let e1 , . . . , ep be its eigen-
vectors associated to the p lowest eigenvalues. For x ∈ T , let e(x) ∈ Rp be given
by
e(x) = (e1 (x), . . . , ep (x))T ∈ Rp .
(2) Apply the K-means algorithm (or one of its variants) with p clusters to T̃ =
(e(x), x ∈ T ).
The number, p, of subsets with respect to which the population should be parti-
tioned is rarely known a priori, and several methods have been introduced in the
literature in order to assess the ideal number of clusters. We now review some of
these methods, and denote, for this purpose, by L∗ (p) the minimized cost function
obtained with p clusters, e.g., using (20.3),
Figure 20.2: Example of data transformed using the eigenvectors of the graph Laplacian.
Left: Original data. Center: Result of a Kmeans algorithm with three clusters applied to the
transformed data (2D projection). Right: Visualization of the cluster labels on the original
data.
One can measure the “curvature” at the elbow using the distance between each
point in the graph of (p, Wα∗ (p)) and the line between its predecessor and successor.
The result gives the criterion
L∗ (p + 1) + L∗ (p − 1) − 2L∗ (p)
C(p) = p ,
(L∗ (p + 1) − L∗ (p − 1))2 + 4
specifying the elbow point as the value of p at which C attains its maximum. For
both examples in fig. 20.3, this method returns the correct number of clusters (3).
Several other criteria have been introduced in the literature. Caliński and Harabasz
[45] propose to minimize the ratio of normalized between-group and within-groups
sums of squares associated with K-means. For a given p, let c1 , . . . , cp denote the
optimal centers, and A1 , . . . , Ap the optimal partition, with Nk = |Ak |. The normalized
486 CHAPTER 20. CLUSTERING
Figure 20.3: Elbow graphs for K-means clustering for two populations generated as mixtures
of Gaussian.
Caliński and Harabasz [45] suggest to maximize γCH (p) = hα (p)/wα (p).
This criterion can be extended to other types of cluster analysis. We have seen in
section 20.4 that, when α(x, y) = |x − y|2 ,
p p X
1X X X
α(x, y)/Nk = |x − ck |2 .
2
k=1 x,y∈Ak k=1 x∈Ak
We also have
X p X
X p
X
2 2
|x − x| = |x − ck | + Nk |ck − x|2
x∈T k=1 x∈Ak k=1
20.6. DECIDING THE NUMBER OF CLUSTERS 487
and
p X
1 X
wα (p) = α(x, y)/Nk .
2(N − p)
k=1 x,y∈Ak
These expressions can obviously be applied to any dissimilarity measure, extending
γCH to general clustering problems.
For x ∈ T , let
1 X
dα (x, Ak ) = α(x, y).
Nk
y∈Ak
Let aα (x, p) = dα (x, A(x)) and b(x, p) = min{dα (x, Ak ) : Ak , A(x)}. Define the silhouette
index of x in the segmentation [171]by
bα (x, p) − aα (x, p)
sα (x, p) = ∈ [−1, 1].
max(bα (x, p), aα (x, p))
This index measures how well x is classified in the partitioning. It is large when
the mean distance between x and other objects in its class is small compared to the
minimum mean distance between x and any other class. In order to estimate the best
number of clusters with this criterion, one then can maximize the average index:
1X
γR (p) = sα (x, p).
N
x∈T
Remark 20.9 One can rewrite the Caliński and Harabasz index using the notation
introduced for the silhouette index. Indeed, let A(x) be the cluster Ak to which x
belongs. Then
p
1 XX N
k
hα (p) = (d (x, Ak ) − dα (x, A(x)))
2(p − 1) N α
x∈T k=1
and
p
1 XX
wα (p) = dα (x, Ak ).
2(N − p)
k=1 x∈Ak
488 CHAPTER 20. CLUSTERING
Figure 20.4: Division of the unit square into clusters for uniformly distributed data.
Several selection methods choose p based on the comparison of the data to a “null
hypothesis” of no cluster. For example, assume that K-means is applied to a training
set T where samples are drawn uniformly according to the uniform distribution on
[0, 1]d . Given centers, c1 , . . . , cp , let Āk be the set of points in [0, 1]d that are closer
to ck than to any other point. Then the segmentation of T is formed by the sets
Ak = {x ∈ T : x ∈ Āk } and, for large enough N , we can approximate |Ak |/N (by the
Law of Large Numbers) by the volume of the set Āk , that we will denote by vol(Āk ).
Let us assume that c1 , . . . , cp are uniformly spaced, so that the sets Āk have similar
volumes (close to 1/p) and have roughly spherical shapes (see fig. 20.4). This implies
that
rp2 d
Z
2
|x − ck | dx ' vol(Ak )
Āk d +2
where rp is the radius of a sphere of volume 1/p, i.e., prpd ' d/Γd−1 where Γd−1 is the
surface area of the unit sphere in Rd . So, we should have, for some constant C that
only depends on d,
X Z
2
|x − ck | ' Nk |x − ck |2 dx ' C(d)(pN )p−2/d−1 = C(d)N p−2/d .
x∈Ak Āk
This suggests that, for fixed N and d, p2/d L∗ (p) should vary slowly when p overesti-
20.6. DECIDING THE NUMBER OF CLUSTERS 489
mate the number of clusters (assuming that this operation divides an homogeneous
cluster). Based on this analysis, Krzanowski and Lai [112] introduced the difference-
ratio criterion, namely,
2 2
(p − 1) d L∗ (p − 1) − p d L∗ (p)
γKL (p) = 2 2
,
p d L∗ (p) − (p + 1) d L∗ (p + 1)
and estimate the number of clusters by taking p maximizing γKL .
(with the convention that L∗ (0) = 0) for some positive number ν and select the value
of p that maximizes γSJ . Indeed, in the case of Gaussian mixtures, the choice ν = d/2
ensures that, in large dimensions, γSJ (p) is small for p < p0 , that it is close to 1 for
p > p0 and close to p0 for p = p0 .
where the L∗ (p, T ) denotes the optimal value of the optimized cost with p clusters
for a training set T . The notation T ] represent a random training set, with same
size and dimension as T , generated using an unclustered probability distribution
used as a reference. In Tibshirani et al. [192], this distribution is taken as uniform
(over the smallest hypercube containing the observed data), or uniform on the co-
efficients of a principal component decomposition of the data (see chapter 21). The
expectation E(L∗ (p, T ] )) is computed by Monte-Carlo simulation, by sampling many
realizations of the training set T , running the clustering algorithm for each of them
and averaging the optimal costs.
One can expect L∗ (p, T ) (for observed data) to decrease much faster (when adding
a cluster) than its expectation for homogeneous data when p < p0 , and the decrease of
490 CHAPTER 20. CLUSTERING
γT W H (p + 1) ≤ γT W H (p) + σ (p + 1)
Figures figs. 20.5 to 20.7 provide a comparative illustration of some of these in-
dexes.
20.7.1 Introduction
θ → (Z1 , . . . , ZN ) → (X1 , . . . , XN ).
Figure 20.5: Comparison of cluster indices for Gaussian clusters. First row: original data
and ground truth. Second panel: plots of four indices as functions of p (Elbow; Caliński and
Harabasz; silhouette; Sugar and James)
492 CHAPTER 20. CLUSTERING
Figure 20.6: Comparison of cluster indices for Gaussian clusters. First row: original data
and ground truth. Second panel: plots of four indices as functions of p (Elbow; Caliński and
Harabasz; silhouette; Sugar and James).
20.7. BAYESIAN CLUSTERING 493
Figure 20.7: Comparison of cluster indices for Gaussian clusters. First row: original data
and ground truth. Second panel: plots of four indices as functions of p (Elbow; Caliński and
Harabasz; silhouette; Sugar and James).
494 CHAPTER 20. CLUSTERING
In this expression, P (θ)dθ implies an integration with respect to the prior distribu-
tion of the parameters. This distribution is part of the design of the method, but one
usually chooses it so that it leads to simple computations, using so-called conjugate
priors, which are such that posterior distributions belong to the same parametric
family as the prior. For example, the conjugate prior for the mean of a Gaussian
distribution (such as ci in our model) is also a Gaussian distribution. The conjugate
prior for a scalar variance is the inverse gamma distribution, with density
v u −u−1
s exp(−v/s)
Γ (u)
for some parameters u, v. A conjugate prior for the class probabilities α = (α1 , . . . , αp )
is the Dirichlet distribution, with density
p
Γ (a1 + · · · + ap ) Y aj −1
D(α1 , . . . , αp ) = αj
Γ (a1 ) · · · Γ (ap )
j=1
on the simplex
Sp = {(α1 , . . . , αp ) ∈ Rp : αi ≥ 0, α1 + · · · + αp = 1}.
Note that these conjugate priors have the same form (up to normalization) as the
parametric model densities when considered as functions of the parameters.
We first discuss the Bayesian approach assuming that the number of clusters is
smaller than a fixed number, p. In this example, we assume that c1 , . . . , cp are mod-
eled as independent Gaussian variables N (0, τ 2 IdRd ), σ 2 with an inverse gamma
distribution with parameters u and v and (α1 , . . . , αp ) using a Dirichlet distribution
with parameters (a, . . . , a).
4 The symbol ∝ means “equal up to a multiplicative constant”.
20.7. BAYESIAN CLUSTERING 495
One can explicitly integrate this last expression with respect to σ 2 and α, using
the expressions of the normalizing constants in the inverse gamma and Dirichlet
distributions, yielding (after integration and ignoring constant terms)
p
Γ (a + N1 ) · · · Γ (a + Np ) X
2
2
exp − |c | /2τ
j
(v + 12 N
2 )u+dN /2
P
|x − c |
k=1 k zk j=1
p
Γ (a + N1 ) · · · Γ (a + Np ) X
2 2
= exp − |cj | /2τ
Pp
(v + 21 Sw + 21 j=1 Nj |cj − x̄j |2 )u+dN /2
j=1
where Sw = N 2
P
k=1 |xk − x̄zk | is the within group sum of squares. Note that this sum of
squares depends on x and z, and that (N1 , . . . , Np ), the group sizes, depend on z.
dc1 . . . dcp
Z
1 1 Pp
=
(v + 2 u+dN /2
(Rd )p 2 Sw + 2 j=1 Nj |cj − x̄j | )
p
dµ1 . . . dµp
Y Z
(2v + Sw ) (p−N )d/2−u)
Nj−d/2 p
( 12 + 21 j=1 |µj |2 )u+dN /2
P
j=1 (Rd )p
and the final integral does not depend on x or z. It follows from this that the condi-
tional distribution of Z given x takes the form
Qp
j=1 Γ (a + Nj )
P (z|x) = C(x) Qp
(2v + Sw )(N −p)d/2+u) j=1 Njd/2
where C(x) is a normalization constant ensuring that the right-hand side is a proba-
bility distribution over configurations z = (z1 , . . . , zN ) ∈ {1, . . . , p}N . In order to obtain
496 CHAPTER 20. CLUSTERING
the most likely configuration for this posterior distribution, one should therefore
minimize in z the function
p p
d dX X
((N − p) + u) log(2v + Sw ) + log Nj − log Γ (a + Nj ).
2 2
j=1 j=1
This final optimization problem cannot be solved in closed form, but this can be
performed numerically. One can simplify it a little by only keeping the main order
terms in the last two sums (using Stirling formula for the Gamma function) and
minimize
p
d X
((N − p) + u) log(2v + Sw ) − (a + Nj ) log(a + Nj ).
2
j=1
This expression has a nice interpretation, since the first term minimizes the within-
group sum of squares, the same objective function as in K-means, and the second
one is an entropy term that favors clusters with similar sizes.
In the context of the discussed example, this reduces to sampling from a distri-
bution proportional to
p N 2 2 N
2 −u−1 −v/σ 2 −
Pp 2
j=1 |cj | /2τ
2 Y Y e−|xk −czk | /2σ Y
(σ ) e e αja−1 αzk . (20.11)
j=1
(σ 2 )d/2
k=1 k=1
Sampling from all these variables at once is not tractable, but it is easy to sample
from them in sub-groups, conditionally to the rest of the variables. We can, for
example, deduce from the expression above the following conditional distributions.
(i) Given (α,Pc, z), σ 2 follows an inverse gamma distribution with parameters u +
dN /2 and v + 21 N 2
k=1 |xk − czk | .
(iii) Given (z, σ 2 , α), c1 , . . . , cp are independent and follow a Gaussian distribution,
respectively with mean (1 + σ 2 /(Nj τ 2 ))−1 x̄j and variance (Nj /σ 2 + 1/τ 2 )−1 .
2 /2σ 2
P (zk = j|σ 2 , α, c, x) ∝ αj e−|xk −cj | .
2 /2σ 2
P (zk = j|σ 2 , α, c, x) ∝ αj e−|xk −cj | .
Note that this algorithm is only asymptotically providing a sample of the poste-
rior distribution (it has to be stopped at some point, of course). Note also that, at
each step, the labels z1 , . . . , zN provide a random partition of the set {1, . . . , N }, and
this partition changes at every step.
To estimate one single partition out of this simulation, several strategies are pos-
sible. Using the simulation, one can estimate the probability wkl that xk and xl be-
long to the same cluster. This can be dome by averaging the number of times that
zk = zl was observed along the Gibbs sampling iterations (from which one usually
excludes a few early “burn-in” iterations). These weights, wkl can then be used as
similarity measures in a clustering algorithm.
Alternatively, one can average for each k, the values of the class center czk associ-
ated to k, still along the Gibbs sampling iterations. These average values can then be
used as input of, say, a K-means algorithm to estimate final clusters.
498 CHAPTER 20. CLUSTERING
The log-likelihood for a mixture of Gaussian takes the form (ignoring contant
terms)
p p
2 2 −2 1 X 2 X
`(σ , α, c, z) = − (u + 1) log σ − vσ − 2 |cj | + (a − 1) log αj
2τ
k=1 j=1
N N
Nd 1 X X
− log σ 2 − σ −2 |xk − czk |2 + log αzk
2 2
k=1 k=1
p p
2 −2 1 X 2 X
= − (u + 1) log σ − vσ − 2 |cj | + (a − 1) log αj
2τ
k=1 k=1
p
N X p
N X
Nd 1 X X
− log σ 2 − σ −2 2
|xk − cj | 1zk =j + log αj 1zj =k
2 2
k=1 j=1 k=1 j=1
(c)
• gj is the p.d.f. of a Gaussian, with parameters N (m̃j , σ̃j2 IdRd ), with, letting
N
X N
X (z)
ζ̃(j) = Zk = j = gk (j),
k=1 k=1
20.7. BAYESIAN CLUSTERING 499
D E −1 D E P
σ̃j2 = 1
τ2
+ σ −2 ζ̃(j) and m̃i = σ −2 σ̃j2 N
k=1 Zk = j xk .
• g (α) of a Dirichlet distribution, with parameters ã1 , . . . , ãk , with ãi = a + ζ̃(j).
(z)
• Finally gk is a p.m.f. on {1, . . . , p} with
1
D E
(z)
ED E D
gk (j) ∝ exp − σ −2 |xk − Cj |2 + log αj .
2
Combining these facts with the expression of the mean-field parameters, we can
now formulate a mean-field estimation algorithm for mixtures of Gaussian that iter-
atively applies the consistency equations.
−1 σ̃j2
1 ζ̃(j) PN
(5) For j = 1, . . . , p, let σ̃i2 = τ2
+ ρ̃2
and m̃i = ρ̃2 k=1 g̃k (j)xk .
(8) Compare the updated variables with their previous values and stop if the dif-
ference is below a tolerance level. Otherwise, return to (3).
(z)
After convergence gk provides the mean-field approximation of the posterior prob-
ability of classes for observation k and can be used to determine clusters.
The Polya urn In the previous model with p clusters or less, the joint distribution
of Z1 , . . . , ZN is given by
p p
Γ (pa) Y Γ (a + Nj )
Z
Γ (pa) Y a+Nj −1
π(z1 , . . . , zN ) = αj dα = .
Γ (a)p Sp j=1 Γ (pa + N ) Γ (a)
j=1
Note that the right-hand side does not change if one relabels the values of z1 , . . . , zN ,
i.e., if one replaces each zk by s(zk ) where s is a permutation of {1, . . . , p}, creating a
new configuration denoted s · z. Let [z] denote the equivalence class of z, containing
all z0 = s · z, s ∈ SN : all the labelings in [z] provide the same partition of {1, . . . , N }
and can therefore be identified. One defines a probability distribution π̄ over these
equivalence classes by letting
p
Γ (pa) Y Γ (a + Nj )
π̄([z]) = |[z]| .
Γ (pa + N ) Γ (a)
j=1
The first term on the right-hand side is the number of elements in the equivalence
class of [z]. To compute it, let p0 = p0 (z) denote the number of different values
taken by z1 , . . . , zN , i.e., the “true” number of clusters (ignoring the empty ones),
which now is a function of z. Let A1 , . . . , Ap0 denote the partition associated with z.
New labelings equivalent to z can be obtained by assigning any index i1 ∈ {1, . . . , p}
20.7. BAYESIAN CLUSTERING 501
Now, the class [z] contains exactly one element ẑ with the following properties
• ẑ1 = 1,
• ẑk ≤ max(zj , j < k) + 1 for all k > 1.
This means that the kth label is either one of those already appearing in (ẑ1 , . . . , ẑk−1 )
or the next integer in the enumeration. We will call such a ẑ admissible. If we assume
that z is admissible in the expression of π̄, we can write
Q p0 QNj −1
j=1 λ(1 − j/p) i=1 (λ/p + i)
π̄([z]) = .
λ(λ + 1) . . . (λ + N − 1)
If one takes the limit p → ∞ in this expression, one still gets a probability distribu-
tion on admissible labelings, namely
Q p0
λp0 j=1 (Nj − 1)!
π̄([z]) = . (20.12)
λ(λ + 1) . . . (λ + N − 1)
Recall that, in this equation, p0 is a function of z, equal, for admissible labelings, to
the largest j such that Nj > 0.
Using this prior, the complete model for the distribution of the observed data is
Q p0 p0
λp0 j=1 (Nj − 1)! Y N
Y
L(z, θ, x) = ψ(θj ) ϕ(xk |θzk )
λ(λ + 1) . . . (λ + N − 1)
j=1 k=1
Dirichlet processes. As we will see later, the expression of the global likelihood
and the Polya urn model will suffice for us to develop non-parametric clustering
methods for a set of observations x1 , . . . , xN . However, this model is also associated
to an important class of random probability distributions (i.e., random variables
taking values in some set of probability distributions) called Dirichlet processes for
which we provide a brief description.
The distribution in (20.12) was obtained by passing to the limit from a model
that first generates p numbers α1 , . . . , αp , then generates the labels z1 , . . . , zN ∈ {1, . . . , p}
identified modulo relabeling. This distribution can also be defined P∞ directly, by first
defining an infinity of positive numbers (αj , j ≥ 1) such that i=1 αi = 1, followed by
the generation of random labels Z1 , . . . , ZN such that P (Zk = j) = αj , followed once
again with an identification up to relabeling.
The distribution of α that leads to the Polya urn is called the stick breaking process.
This process is such that
j−1
Y
αj = Uj (1 − Ui )
i=1
where U1 , U2 , . . . is a sequence of i.i.d. variables following a Beta(1, λ) distribution,
i.e., with p.d.f. λ(1 − u)λ−1 for u ∈ [0, 1]. The stick breaking interpretation comes
from the way α1 , α2 , . . . can be simulated: let α1 ∼ Beta(1, λ); given α1 , . . . , αj−1 , let
αj = (1 − α1 − · · · − αj−1 )Uj where Uj ∼ Beta(1, λ) and is independent from the past.
Each step can be thought of as breaking the remaining length, (1 − α1 − · · · − αj−1 ),
of an original stick of length 1 using a beta-distributed variable, Uj . This process
leads to the distribution (20.12) over admissible distributions, i.e., if α is generated
according to the stick breaking process, and Z1 , . . . , ZN are independent, each such
20.7. BAYESIAN CLUSTERING 503
that P (Zk = j) = αj , then the probability that (Z1 , . . . , ZN ) is identical, after relabeling,
to the admissible configuration z is given by (20.12). (We skip the proof of this result,
which is not straightforward.)
This process has the following characteristic property. For any family V1 , . . . , Vk ⊂
Θ forming a partition of that set, the random variable (ρ(U1 ), . . . , ρ(Uk )) follows a
Dirichlet distribution with parameters
Z Z !
λ ψ dη, . . . , λ ψ dη .
U1 U1
This is the definition of a Dirichlet process with parameters (λ, ψ), or, simply, with
parameter λψ. Conversely, one can also show that any Dirichlet process can be de-
composed as in (20.14) where α is a stick-breaking process and η independent real-
izations of ψ.
Algorithm 20.13
1 Initialize k = 1, z1 = 1, j = 1. Let N1 = 1.
2 Sample η1 ∼ ψ and x1 ∼ ϕ(·|η1 ).
3 At step k, assume that z1 , . . . , zk has been generated, with associated number of
clusters equal to j and N1 , . . . , Nj elements per cluster. Generate zk+1 such that
Ni
i with probability , for i = 1, . . . , j
+
zk+1 =
λ k
λ
j + 1 with probability
λ+k
504 CHAPTER 20. CLUSTERING
This algorithm cannot be used, of course, to sample from the conditional distri-
bution of Z and η given X = x, and Markov-chain Monte-Carlo must be used for this
purpose. In order to describe how Gibbs sampling may be applied to this problem,
we use the fact that, as previously remarked, using admissible labelings z is equiv-
alent to using partitions A = (A1 , . . . , Ap0 ) of {1, . . . , N }, and we will use the latter
formalism to describe the algorithm. We will also use the notation ηA to denote the
parameter associated to A ∈ A so our new notation for the variables is (A, η) where
A is a partition of {1, . . . , N } and η is a collection (ηA , A ∈ A) with ηA ∈ Θ. Given this,
we want to sample from a conditional p.d.f.
The following points are relevant for the design of the sampling algorithm.
(1) The conditional distribution of η given A and the training data is proportional
to
Y Y
ψ(η ) ϕ(x |η )
A k A
A∈A k∈A
This shows that the parameters ηA , A ∈ A are independent of each other, with ηA
following a distribution proportional to
Y
η 7→ ψ(η) ϕ(xk |η).
k∈Aj
the variable (A, η) the pair (A(k) , η (k) ), where A(k) is the partition of {1, . . . , N } \ {k}
(k)
formed by the sets A(k) = A \ {k} and ηA = ηA , unless A = {k}, in which case the set
and the corresponding ηA are dropped.
We can write Φ(A, η|x) in the form
(k)
λ|A |−1 B∈A(k) (|B| − 1)! Y
Q Y
Φ(A, η|x) ∝ q(Ak , ηAk )ϕ(xk |ηAk ) ψ(ηB ) ϕ(xl |ηB )
(λ + 1) · · · (λ + N − 1) (k) B∈A l∈B
(20.17)
with X
q(A, θ) = |B|1A=B∪{k} + λψ(θ)1A={k}
B∈A(k)
Partitions A0 that are consistent with A(k) allocate k to one of the clusters in A(k) or
create a new cluster with a new parameter ηk0 . If one replaces (A, η) by (A0 , η 0 ), only
the first two terms in (20.17) will be affected, so that the conditional probability of
A0 given A(k) is proportional to q(A0k , ηA0k )ϕ(xk |ηA0k ) and given by
|B|ϕ(xk |ηB )
if A0k = B ∪ {k}, ηB0 = ηB , B ∈ A(k)
1
C + λC 2
0 0
λϕ(x |η )ψ(η k)
k k
if A0k = {k},
C1 + λC2
where X Z
C1 = |B|ϕ(xk |ηB ) and C2 = ϕ(xk |θ)ψ(θ)dθ.
Θ
B∈Ak
Concretely, this means that one first decides to allocate k to a set B in A(k) with
probability |B|ϕ(xk |ηB )/(C1 +λC2 ) and to create a new set with probability λC2 /(C1 +
0
λC2 ). If a new set is created, then the associated parameter η{k} is sampled according
to the p.d.f. ϕ(xk |θ)ψ(θ/C2 .
(3) However, sampling using this conditional probability requires the computa-
tion of the integral C2 , which can represent a significant computational burden,
since this has to be done many times in a Gibbs sampling algorithm. A modification
of this algorithm, introduced in Neal [142], avoids this computation by adding new
auxiliary variables at each step of the computation. These variables are m parameters
η1∗ , . . . , ηm
∗ ∈ Θ where m is a fixed integer. To define the joint distribution of A, η, η ∗ ,
one lets the marginal distribution of (A, η) be given by (20.16) and conditionally to
A, η, let η ∗1 , . . . , η ∗m be:
(i) independent with density ψ if |Ak | > 1;
(ii) such that ηj∗ = ηAk and the other m − 1 starred parameters are independent
with distribution ψ, where j is randomly chosen in {1, . . . , m} if Ak = {k}.
506 CHAPTER 20. CLUSTERING
With this definition, the joint conditional distribution of (A, η, η ∗ ) takes the form
Φ(A,
b η, η ∗ |x) ∝ q̂(Ak , ηAk , η ∗ )ϕ(xk |ηAk )
(k)
λ|A |−1 B∈A(k) (|B| − 1)! Y
Q Y
ψ(ηB ) ϕ(xl |ηB ) (20.18)
(λ + 1) · · · (λ + N − 1) (k) B∈A l∈B
with
m m m
X Y λX Y
q̂(A, θ, η1∗ , . . . , ηm
∗
)= |B|1θ=ηB ,A=B∪{k} ψ(ηj∗ ) + 1θ=ηj∗ ,A={k} ψ(θ) ψ(ηi∗ ).
m
B∈A(k) j=1 j=1 i=1,i,j
Note that Φ
b depends on k, so that the definition of the auxiliary variables will change
at each step of Gibbs sampling. The conditional distribution, for Φ, b of A0 , η 0 given
A(k) , η (k) , η ∗ is such that
We can now summarize this discussion with Neal’s version of the Gibbs sampling
algorithm.
either directly, or via one step of Gibbs sampling visiting each of the variables that
constitute ηA .
(3) Loop a sufficient number of times over the previous two steps.
After running this algorithm, the set of clusters should be finalized by using
statistics computed along the simulation, as discussed after Algorithm 20.10.
|xk − c(k) |
(k)
B
if A0 = B ∪ {k} and cA 0
|B| exp − 2σ 2 0 = cB
|x − c∗ |
λ
exp − k B if A = {k} and cA
0 ∗
0 = cj , j = 1, . . . , m.
m 2σ 2
508 CHAPTER 20. CLUSTERING
(3) Simulate a new value of σP2 according to an inverse gamma distribution with
1 N
parameters u + dN /2 and v + 2 k=1 |xk − cAk |2 .
(4) Simulate new values for cA , A ∈ A independently, sampling cA according to a
Gaussian distribution with mean (1 + σ 2 /(Nj τ 2 ))−1 x̄A and variance (|A|/σ 2 + 1/τ 2 )−1 ,
where
1 X
x̄A = xk .
|A|
k∈A
Chapter 21
We start our discussion with principal component analysis (or PCA). This meth-
ods can be characterized in multiple ways, and we introducing through the angle of
data approximation. In the following, the random variable X takes values in a finite-
or infinite-dimensional inner-product space H. We will denote, as usual, by h. , .iH
the product in this space.
509
510 CHAPTER 21. DIMENSION REDUCTION AND FACTOR ANALYSIS
(see section 6.4). Recall that this orthogonal projection if characterized by the two
properties: (i) PV (y) ∈ V and (ii) (y − PV (y)) ⊥ V .
Rk = xk − c − PV (xk − c)
N
X
S= kRk k2H (21.2)
k=1
is as small as possible.
PN
An optimal choice for c is c = x = k=1 xk /N . Indeed, using the linearity of the
orthogonal projection, we have
N
X
S= kxk − PV (xk ) − (c − PV (c))k2H
k=1
N
X
= kxk − PV (xk ) − (x − PV (x))k2H + N kx − PV (x) − (c − PV (c))k2H .
k=1
Given this, there would be no loss of generality in assuming that all xk ’s have been
replaced by xk − x and taking c = 0. While this is often done in the literature, there
are some advantages (especially when discussing kernel methods) in keeping the
average explicit in the notation, as we will continue to do.
p
X
PV (xk − x) = ρk (i)ei
i=1
with ρki = hxk − x , ei iH . One can then reformulate the problem in terms of (e1 , . . . , ep ),
which must minimize
N
X p
X
S = kxk − x − hxk − x , ei iei k2H
k=1 i=1
N
X p X
X N
= kxk − xk2H − hxk − x , ei i2H .
k=1 i=1 k=1
21.1. PRINCIPAL COMPONENT ANALYSIS 511
For u, v ∈ H, define
N
1X
hu , viT = hxk − x , uiH hxk − x , viH
N
k=1
∞
X
σµ2 = λ2i .
k=1
The main statement of the following result is in finite dimensions, a simple ap-
plication of corollary 2.4. We here give a direct proof that also works in infinite
dimensions.
Definition 21.2 When µ = µ̂T , the vectors (f1 , . . . , fp ) are called (with some abuse when
eigenvalues coincide) the first p principal components of the training set (x1 , . . . , xN ).
p
X
F(e1 , . . . , ep ) = Γµ (ek , ek ) .
k=1
(j) (j)
Note that F(f1 , . . . , fp ) = λ21 + · · · + λ2p . Write ek = ∞
P
j=1 αk fj (so that αk = hfj , ek iH ).
(j) (j)
These coefficients satisfy ∞
P
j=1 αk αl = 1 if k = l and 0 otherwise. Then
∞
(j)
X
Γµ (ek , ek ) = λ2j (αk )2 .
j=1
21.1. PRINCIPAL COMPONENT ANALYSIS 513
We have
p X
∞
(j)
X
F(e1 , . . . , ep ) = λ2j (αk )2
k=1 j=1
p X p p X
∞
(j) (j)
X X
= λ2j (αk )2 + λ2j (αk )2
k=1 j=1 k=1 j=p+1
p X p p X ∞
(j) (j)
X X
≤ λ2j (αk )2 + λ2p+1 (αk )2
k=1 j=1 k=1 j=p+1
Xp p
X (j)
= (λ2j − λ2p+1 ) (αk )2 + pλ2p+1 .
j=1 k=1
Condition (a) implies that span(e1 , . . . , ep ) ⊂ span(fj : λ2j ≤ λ2p+1 ). If λ2p = λ2p+1 , the
inclusion span(e1 , . . . , ep ) ⊂ span(fj : λ2j ≤ λ2p ) therefore holds. If λ2p < λ2p+1 , condition
Pp (j)
(b) requires k=1 (αk )2 = 1 for all j ≤ p, which implies fj ∈ span(e1 , . . . , ep ) for j ≤ p,
so that span(e1 , . . . , ep ) = span(f1 , . . . , fp ) and the inclusion also hold.
Pp (j)
Condition (b) always requires k=1 (αk )2 = 1, hence fj ∈ span(f1 , . . . , fp ), when
λj < λp , showing that span(fj : λ2j < λ2p ) ⊂ span(e1 , . . . , ep ). Equation (21.5) therefore
always holds for (e1 , . . . , ep ) such that F(e1 , . . . , ep ) = λ21 + · · · + λ2p . Furthermore, condi-
tions (a) and (b) always hold for any orthonormal family that satisfy (21.5), showing
that any such solution is optimal.
514 CHAPTER 21. DIMENSION REDUCTION AND FACTOR ANALYSIS
Remark 21.3 The interest of discussing PCA associated with a covariance operator
for a square integrable measure (in which case it is often called a Karhunen-Loeve
(KL) expansion) is that this setting is often important when discussing infinite-
dimensional random processes (such as Gaussian random fields). Moreover, these
operators quite naturally provide asymptotic versions of sample-based PCA. In-
teresting issues, that are part of functional data analysis [159], address the design
of proper estimation procedures to obtain converging estimators of KL expansions
based on finite samples for stochastic processes in infinite-dimensional spaces.
Small dimension. Assume that H has finite dimension, d, i.e., H = Rd , and repre-
sent x1 , . . . , xN ∈ Rd as column vectors. Let the inner product on H be associated to a
positive-definite symmetric matrix Q:
hu , viH = u T Qv.
N
1X T
hu , AT viH = (u Q(xk − x))(v T Q(xk − x))
N
k=1
N
1 X
= u T Q(xk − x)(xk − x)T Qv
N
k=1
= hu , ΣT QviH ,
so that AT = ΣT Q.
The eigenvectors, f , of AT are such that Q1/2 f are eigenvectors of the symmetric
matrix Q1/2 ΣT Q1/2 , which shows that they form an orthogonal system in H, which
will be orthonormal if the eigenvectors are normalized so that f T Qf = 1. Equiva-
lently, they solve the generalized eigenvalue problem QΣT Qf = λ2 Qf , which may
be preferred numerically to diagonalizing the non-symmetric matrix ΣT Q.
21.1. PRINCIPAL COMPONENT ANALYSIS 515
Remark 21.4 Sometimes, the metric is specified by giving Q−1 instead of Q (or Q−1
is easy to compute). Then, one can directly solve the generalized eigenvalue problem
ΣT f˜ = λ2 Q−1 f˜ and set f = Q−1 f˜. The normalization f T Qf = 1 is then obtained by
normalizing f˜ so that f˜T Q−1 f˜ = 1.
Remark 21.5 The “standard” version of PCA applies this computation using the Eu-
clidean inner product, with Q = IdRd , and the principal components are the eigen-
vectors of the covariance matrix of T associated with the largest eigenvalues.
Large dimension. It often happens that the dimension of H is much larger than the
number of observations, N . In such a case, the previous approach is quite inefficient
(especially when the dimension of H is infinite!) and one should proceed as follows.
Returning to the original problem, one can remark that there is no loss of gener-
ality in assuming that V is a subspace of W := span{x1 − x, . . . , xN − x}. Indeed, letting
V 0 = PW (V ) (the projection of V on W ), we have, for ξ ∈ W ,
In this computation, we have used the facts that PW ξ = ξ (since ξ ∈ W ), that kPW PV ξkH ≤
kPV ξkH , that PW PV ξ ∈ V 0 and that PV 0 (ξ) is the best approximation of ξ by an ele-
ment of V 0 . This shows that (since xk − x ∈ W for all k)
N
X N
X
kxk − x − PV (xk − x)k2H ≥ kxk − x − PV 0 (xk − x)k2H
k=1 k=1
with V 0 a subspace of W of dimension less than p, proving the result. This computa-
tion also shows that no improvement in PCA can be obtained by looking for spaces
of dimension p ≥ dim(W ) (with dim(W ) ≤ N − 1 because the data is centered).
(i)
for some αk , 1 ≤ k ≤ N , 1 ≤ i ≤ p.
516 CHAPTER 21. DIMENSION REDUCTION AND FACTOR ANALYSIS
PN (i) (j)
With this notation, we have hfi , fj iH = k,l=1 αk αl hxk − x , xl − xiH and
N
1X
hfi , fj iT = hfi , xl − xiH hfj , xl − xiH
N
l=1
N N
1 X (i) (j) X
= αk αk 0 hxk − x , xl − xiH hxk 0 − x , xl − xiH .
N 0
k,k =1 l=1
Let S be the Gram matrix of the centered data, formed by the inner products hxk − x , xl − xiH ,
(i)
for k, l = 1, . . . , N . Let α (i) be the column vector with coordinates αk , k = 1, . . . , N . We
have hfi , fj iH = (α (i) )T Sα (j) and hfi , fj iT = (α (i) )T S 2 α (j) /N , which implies that, in this
representation, the operator AT is given by S/N . Thus, the previous simultaneous
orthogonalization problem can be solved in terms of the α’s by diagonalizing S and
taking the first eigenvectors, normalized so that (α (i) )T Sα (i) = 1. Let λ2j , j = 1, . . . , N
be the eigenvalues of S/N (of which only the first min(d, N − 1) may be non-zero).
In this representation, the decomposition of the projection of xk on the PCA basis is
given by
p
(j)
X
xk = βk fj
j=1
with
N
(j) (j) (j)
X
βk = hxk − x , fj iH = αl hxl − x , xk − xiH = N λ2j αk .
l=1
Since the previous computation only depended on the inner products hxk − x , xl − xiH ,
PCA can be performed in reproducing kernel Hilbert spaces, and the resulting method
is called kernel PCA. In this framework, X may take values in any set R with a rep-
resentation h : R → H. The associated kernel, K(x, x0 ) = hh(x) , h(x0 )iH , provides a
closed form expression of the inner products in terms of the original variables. The
feature function itself is most of the time unnecessary.
Then the Gram matrix in feature space is S with skl = Kc (xk , xl ) and the computation
described in the previous section can be applied. Note that, if one denotes, as usual
K = K(x1 , . . . , xN ) the matrix formed by kernel evaluations K(xk , xl ), and if one lets
P = IdRN − 1N 1N /N , then we have the simple matrix expression S = P KP .
and they are not computable when the features not known explicitly. However, a
few geometric features associated with these directions can be characterized using
the kernel only.
n o
Consider the line in feature space Di = h̄ + λfi , λ ∈ R . Let Ωi denote the points
x ∈ R such that h(x) ∈ Di . Then x ∈ Ωi if and only if h(x) coincides with its orthogonal
projection on Di , which is equivalent to
2 2
hh(x) − h̄ , fi iH = h(x) − h̄ H
,
for small .
Similarly, the feature vector h(x) − h̄ belongs to the space generated by the first p
components if and only if
p
2 2
X
hh(x) − h̄ , fi iH = h(x) − h̄ H
i=1
518 CHAPTER 21. DIMENSION REDUCTION AND FACTOR ANALYSIS
i.e.,
p X
N 2
X (i)
αk Kc (x, xk ) = Kc (x, x).
i=1 k=1
One can also compute the finite-dimensional coordinates of h(x) in the PCA basis,
and this computation is easier. The representation is
with
N
X (i)
ui = hh(x) − h̄ , fi iH = αk Kc (x, xk ) .
k=1
This provides an explicit nonlinear transformation that maps each data point x into
a p-dimensional point. This representation allows one to easily exploit the reduction
of dimension.
One can see that, in an optimal decomposition, one needs RT ei = 0 for all i,
because one can always write
p
X p
X p
X
(i) (i) T
Y ei + R = (Y + R ei )ei + R − RT ei ei .
i=1 i=1 i=1
Pp
If R is centered, then so is R − i=1 RT ei ei and the latter provides a better solution
Pp
since |R − i=1 RT ei ei | ≤ |R|. Also, there is no loss of generality in requiring that
(Y (1) , . . . , Y (p) ) are uncorrelated, as this can always be obtained after a change of basis
in span(e1 , . . . , ep ).
21.3. STATISTICAL INTERPRETATION AND PROBABILISTIC PCA 519
The solution of this problem is given by the first p eigenvectors of Σ. PCA (with a
Euclidean metric) exactly applies this procedure, with Σ replaced by the empirical
covariance.
W = [λ1 e1 , . . . , λp ep ].
X = W Y + σ 2R
where the parameters are W and σ 2 , with the constraint that W T W is a diagonal
matrix. As a linear combination of independent Gaussian random variables, X is
520 CHAPTER 21. DIMENSION REDUCTION AND FACTOR ANALYSIS
Proposition 21.6 Assume that the matrix ΣT is invertible. The log-likelihood in (21.7)
is maximized by taking
1 1
µ2i = 2
− 2 .
σ λi + σ 2
(W W T + σ 2 Id)−1 = ρ2 Id − QQT .
so that
p
X d
X
2
T
(W W + σ Id) −1
= (λ2i + σ 2 )−1 ei eiT + σ −2 ei eiT = ρ2 Id − QQT .
i=1 i=p+1
p
X d
X p
X
2
− log(ρ − µ2i ) − (d 2
− p) log ρ + ρ 2
δj2 − µ2j δj2 .
i=1 j=1 j=1
Computing the solution is elementary and left to the reader, and yields, when ex-
pressed as functions of σ 2 , λ21 , . . . , λ2p , the expressions given in the statement of the
theorem.
We now discuss a dimension reduction method called generalized PCA (GPCA) [202]
that, instead of looking for the best linear approximation of the training set by one
specific subspace, provides an approximation by a finite union of such spaces.
As a motivation, consider the situation in fig. 21.1 in which part of the data
is aligned along one direction in space, and another part along another direction.
Then, the only information that PCA can retrieve (provided that the two directions
intersect) is the plane generated by the two directions, which will be captured by
the two principal components. PCA will not be able to determine the individual
directions. GPCA addresses this type of situation as follows.
Figure 21.1: PCA cannot distinguish between the situations depicted in the two datasets.
522 CHAPTER 21. DIMENSION REDUCTION AND FACTOR ANALYSIS
For simplicity, assume that we are trying to decompose the data along unions of
hyperplanes in Rd . Such hyperplanes have equations of the form u T x̃ = 0 where x̃ is
our notation for the vector (1, xT )T . If we have two hyperplanes, specified by u1 and
u2 and all the training samples approximately belong to one of them, then one has,
for all k = 1, . . . , N :
(u1T x̃k )(u2T x̃k ) = x̃kT u1 u2T x̃k ' 0.
Similarly, for n hyperplanes, the identity is, for k = 1, . . . , N :
n
Y
(ujT x̃k ) ' 0.
j=1
Write
n
Y X
(ujT x) = u1 (i1 ) · · · un (in )x(i1 ) · · · x(in )
j=1 1≤i1 ,...,in ≤d
in the form (by regrouping the terms associated with the same powers of x)
X
F(x) = qp1 ...pd (x(1) )p1 . . . (x(d) )pd . (21.8)
p1 +...+pd =n
under the constraint qp21 ...pn = 1 (to avoid trivial solutions). Choosing an ordering
P
on the set of indices (p1 , . . . , pd ) such that p1 + · · · pd = n, one can stack the coefficients
(1) (d)
in Q and the monomials (xk )p1 . . . (xk )pd to form two vectors denoted Q (with some
abuse of notation) and V (xk ). One can then rewrite the problem of determining Q
as minimizing QT ΣQ subject to |Q|2 = 1, where
N
X
Σ= V (xk )V (xk )T .
k=1
The solution is given by the eigenvector associated with the smallest eigenvalue of Σ.
If the model is exact, this eigenvalue should be zero, and if only one decomposition
of the data in a set of distinct hyperplanes exists (i.e., if n is not chosen too large),
then Q is the unique solution up to a multiplicative constant.
21.5. NUCLEAR NORM MINIMIZATION AND ROBUST PCA 523
However, if x belong in one and only one of the hyperplanes, say xT uj = 0, then all
terms in the sum vanish but one and ∇F(x) is proportional to uj . So, if the model is
exact, one has, for each k = 1, . . . , N , either ∇F(xk ) = 0 (if xk belongs to the intersection
of two hyperplanes) or ∇F(xk )/|∇F(xk )| = ±uj for some j, and the sign ambiguity can
be removed by ensuring, for example, that the first non-vanishing coordinate of uj is
positive. (The gradient of F can be computed from Q using (21.8).) The computation
of ∇F on training data therefore allows for an exact computation of the hyperplanes.
This analysis provides a decomposition of the training set into n (or fewer) hyper-
planes. The computation can then be recursively refined in order to obtain smaller
dimensional subspaces by applying the same method separately to each hyperplane.
One can also interpret PCA in terms of low-rank matrix approximations. Let Xc be
the N by d matrix (x1 − x, . . . , xN − x)T , which, in generic situations, has rank d − 1.
Then PCA with p components is equivalent to minimizing, over all N by d matrices
Z of rank p, the norm of the difference
The quantity |A|2 = trace(AT A) is the sum of square of the entries of A, which is often
referred to as the (squared) Frobenius norm. We have
d
X
2
|A| = σk2
k=1
where σ1 , . . . , σd are the singular values of A, i.e., the square roots of the eigenvalues
of AT A.
524 CHAPTER 21. DIMENSION REDUCTION AND FACTOR ANALYSIS
Proposition 21.7 A matrix Z has rank p if and only if it can be written in the form
Z = AW T where A is N ×p, and W is d ×p with W T W = IdRp , i.e., W = [e1 , . . . , ep ] where
the columns form an orthonormal family of Rd .
Proof The “if” part is obvious and we prove the “only if” part. Assume that Z has
rank p. Take W = [e1 , . . . , ep ], where (e1 , . . . , ep ) is an orthonormal family in Null(Z)⊥ .
Letting ep+1 , . . . , ed denote an orthonormal basis of Null(Z), we have di=1 ei eiT = IdRd
P
and
d
X X p
T
Z=Z ei ei = Z ei eiT = ZW W T
i=1 i=1
so that one can take A = ZW .
Using this representation and letting zkT be the kth row vector of Z, we have
N N p
X X X (j) 2
2 2
|Xx − Z| = |xk − x − zk | = xk − x − ak ej .
k=1 k=1 j=1
(j)
With fixed e1 , . . . , ep , the optimal matrix A has coefficients ak = (xk − x)T ej . In matrix
form, this is:
Xp
Z = Xc ej ejT .
j=1
We therefore retrieve the PCA formulation that we gave in section 21.1, in the
special case of H = Rd with the standard Euclidean product. The lowest value
achieved by the PCA solution is
d
X
2
|Xc − Z| = N λ2k
k=p+1
where λ21 , . . . , λ2d are the eigenvalues of the covariance matrix computed from x1 , . . . , xN ,
who are also the squared singular values of the matrix Xc divided by N .
for some parameter γ > 0. However, the solution to this problem is a small variation
of that of standard PCA. It is indeed given by standard PCA with p components
where p minimizes
d
X d
X
Nγ λ2k + p = N γ (λ2k − (N γ)−1 ) + d,
k=p+1 k=p+1
i.e., p is the index of the last eigenvalue that is larger than (N γ)−1 .
Based on the fact that rank(Z) is the number of non-zero singular values of Z, one
can use the same heuristic as in the development of the lasso, and replace counting
the non-zero values by the sum of the absolute values of the singular values, which
is just the sum of singular values since they are non-negative. This provides the
nuclear norm of A, defined in section 2.4 by
d
X
|A|∗ = σk
k=1
where σ1 , . . . , σd are the singular values of A. We will consider below the problem of
minimizing
γ|Xc − Z|2 + |Z|∗ (21.10)
and show that its solution is once again similar to PCA.
In Cai et al. [44], the authors consider the minimization of (21.10) and prove
the following result. Recall that we have defined the shrinkage function Sτ : t 7→
sign(t) max(|t| − τ, 0) (with τ ≥ 0), using the same notation Sτ (X) when applying Sτ to
every entry of a vector or matrix X. Following Cai et al. [44], we define the singular
value thresholding operator A 7→ Sτ (A), where A is any rectangular matrix, by
Sτ (A) = U Sτ (∆)V T
over all orthonormal matrices U and V and diagonal matrices with non-negative
coefficients D. From theorem 2.1, we know that trace(XcT U DV T ) is less than the
sum of the products of the non-increasingly ordered singular values of Xc and D
and this upper bound is attained by taking U = Ū and V = V̄ where Ū and V̄ are
the matrices providing the SVD of Xc , i.e., such that Xc = Ū ∆V̄ T where ∆ is diagonal
with non-decreasing coefficients along the diagonal. So, letting λ1 ≥ · · · ≥ λd ≥ 0 and
µ1 ≥ · · · ≥ µd ≥ 0 be the singular values of Xc and Z, we have just proved that, for any
D,
d
X d
X d
X
2
F(U , V , D) ≥ F(Ū , V̄ , D) = −2γ µi λi + γ µi + µi .
i=1 i=1 i=1
The lower bound is minimized when µi = max(λi − 1/2γ, 0). This proves the propo-
sition.
As a consequence, the nuclear norm penalty provides the same principal directions
(after replacing γ by 2γ) as the rank penalty, but applies a shrinking operation rather
than thresholding on the singular values. The difference is however more fundamen-
tal if, in addition to using the nuclear norm as a penalty, on replaces the squared
Frobenius norm on the approximation error by the ` 1 norm, where, for an n by m
matrix A with coefficients (a(i, j)),
X
|A|`1 = |a(i, j)| .
i,j
with respect to Z.
Robust PCA (which was initially named Principal Component Pursuit by the au-
thors in Candès et al. [48]) is designed for situations in which Xc can be decomposed
as the sum of a low-rank matrix Z and of a sparse residual S. Some theoretical justi-
fication was provided in the original paper, stating that if Xc = Z+S, with Z = U DV T
(its singular value decomposition) such that U and V are sufficiently “diffuse” and
21.6. INDEPENDENT COMPONENT ANALYSIS 527
rank(Z) is small enough, with the residual’s sparsity pattern taken uniformly at ran-
dom over the subsets of entries of S with a sufficiently small cardinality, then robust
PCA is able to reconstruct the decomposition exactly with high probability (relative
to the random selection of the sparsity pattern of S). We refer to Candès et al. [48]
for the long proof that justifies this statement.
Robust PCA can be solved using the ADMM algorithm (section 3.5.5) after refor-
mulating the problem as the minimization of
γ|R|`1 + |Z|∗
Using this, we can rewrite the robust PCA algorithm as the sequence of fairly simple
iterations.
Algorithm 21.1
(1) Choose a small enough constant α and a very small tolerance level .
(2) Initialize the algorithm with N by d matrices R(0) and U (0) (e.g., equal to zero).
(3) At step n, apply the iteration:
Z (k+1) = Sα (Xc − R(k) − U (k) )
(k+1)
= Sγα (Xc − Z (k+1) − U (k) )
R (21.13)
U (k+1) = U (k) + Z (k+1) + R(k+1) − Xc
(4) Stop the algorithm is the variation compared to variables at the previous step is
below the tolerance level. Otherwise, apply step n + 1.
528 CHAPTER 21. DIMENSION REDUCTION AND FACTOR ANALYSIS
21.6.1 Identifiability
It should be clear that the answer to this question is negative, because there are
trivial transformations of the matrix A that do not break the ICA model. One can,
for example, take any invertible diagonal matrix, D, and let A0 = AD −1 and Y 0 = DY .
The same statement can be made if D is replaced by a permutation matrix, P , which
reorders the components of Y . So we know that AY ∼ A0 Y 0 is possible already when
A0 = ADP where D is diagonal and invertible and P is a permutation matrix. Note
that iterating such matrices (i.e., letting A0 = ADP D 0 P 0 ) does not extend the class
of transformations because one has DP = P P −1 DP and one can easily check that
P −1 DP is diagonal, so that one can rewrite any product of permutations and diagonal
matrices as a single diagonal matrix multiplied by a single permutation.
It is interesting, and fundamental for the well-posedness of ICA, that, under one
important additional assumption, the indeterminacy in the identification of A stops
at these transformations. The additional assumption is that at most one of the com-
ponents of Y follows a Gaussian distribution. That such a restriction is needed is
clear from the fact that one can transform any Gaussian vector Y with independent
components into another, BY , one as soon as BBT is diagonal. If two or more com-
ponents of Y are Gaussian, one can restrict these matrices B to only affect those
components. If only one of them is Gaussian, such an operation has no effect.
Theorem 21.9 Assume that Y is a random vector with independent components, such
that at most one of its components is Gaussian. Let A be an invertible linear transforma-
tion and Ỹ = CY . Then the following statements are independent.
The equivalence of (ii) and (iii) implies that the ICA model is identifiable up
to multiplication on the right by a permutation and a diagonal matrix. Indeed, if
X = AY = A0 Y 0 are two decompositions, then it suffices to apply the theorem to
C = (A0 )−1 A to conclude. The equivalence of (i) and (ii) is striking, and has the
important consequence that, if the data satisfies the ICA model, then, in order to
identify A (up to the listed indeterminacy), it suffices to look for Y = A−1 X with
pairwise independent components, which is a much lesser constraint than full mu-
tual independence.
As a final remark on the Gaussian indeterminacy, we point out that, if the mean
(m) and covariance matrix (Σ) of X are known (or estimated from data), the ICA
problem can be reduced to looking for orthogonal transformations A. Indeed, as-
suming X = AY and letting X̃ = Σ−1/2 (X − m) and Ỹ = D −1/2 (Y − A−1 m), where D is
the (diagonal) covariance matrix of Y , we have
Independence between d variables is a very strong property and its complete char-
acterization is computationally challenging. The fact that the joint p.d.f.of the d
variables (we will restrict, to simplify our discussion, to variables that are absolutely
continuous) factorizes into the product of the marginal p.d.f.’s of each variable can
be measured by computing the mutual information between the variables, defined
530 CHAPTER 21. DIMENSION REDUCTION AND FACTOR ANALYSIS
The mutual information is always non-negative and vanishes only if the components
of Y are mutually independent. Therefore, one can represent ICA as an optimization
problem minimizing I(W X) with respect to all invertible matrices W (so that W =
A−1 ). Letting Z
h(Y ) = − log ϕY (y) ϕY (y)dy
If Z = W X, then ϕZ (z) = ϕX (W −1 x)| det(W )|−1 . Using this expression in h(Z) and
making a change of variables yields h(W Z) = h(X) + log | det W | and
d
X
I(W X) = h(Z (i) ) − log | det(W )| − h(X).
i=1
where W (i) is the ith row of W . This brings a notable simplification, since this ex-
pression only involves differential entropies of scalar variables, but still remains a
challenging problem.
One can shows that ν(U ) ≥ 0 and is equal to 0 if and only if U is Gaussian. One can
rewrite F(W ) as
d d
d d X
2
X
F(W ) = + log(2π) + log(σW (i) X ) − ν(W (i) X) − log | det(W )|
2 2
i=1 i=1
d
X
ν(W (i) X) (21.14)
i=1
and
κ4 = E((U − E(U ))4 ) − 3σU4 .
In particular, when U is normalized, i.e., E(U ) = 0 and σU2 = 1, we have κ3 = E(U 3 )
and κ4 = E(U 4 ) − 3. Under the same assumption, it is proposed in Comon [53] to use
the approximation
κ32 κ42 7κ34 κ32 κ4
ν(U ) ∼ + + − .
12 48 48 8
This approximation was derived from an Edgeworth expansion of the p.d.f. of U ,
which can be seen as a Taylor expansion around a Gaussian distribution. Plugging
this expression into (21.14) provides an expression that can be maximized in W
where the cumulants are replaced by their sample estimates. However, the maxi-
mized function involves high-degree polynomials in the unknown coefficients of W ,
and this simplified problem still presents numerical challenges.
for a p.d.f. ϕ with respect to µ (i.e., such that ϕ is non-negative and has integral 1).
Then, the following is true.
Theorem 21.10 Let g = (g (1) , . . . , g (p) )T be a function defined on a measurable space G,
taking values in Rp , and let µ be a measure on G. Let Γµ be the set of all λ = (λ(1) , . . . , λ(p) ) ∈
Rp such that Z
exp λT g(y) dµ(y) < ∞. (21.15)
G
Then
( Z )
T T
hµ (Y ) ≤ inf −λ E(g(Y )) + log exp λ g(y) dµ(y) : λ ∈ Γµ . (21.16)
G
Define, for λ ∈ Γµ ,
exp λT g(x) dµ(x)
ψλ (x) = R . (21.17)
T
G
exp λ g(y) dµ(y)
Proof Let Y be a random variable with p.d.f. ϕY with respect to µ (otherwise the
lower bound in (21.16) is −∞). Then
Z Z
ϕ (x)
hµ (Y ) + λE(g(Y )) − log exp (λg(y)) dµ(y) = − ϕY (x) log Y dµ(x) ≤ 0
G ψλ (x)
ϕ (x)
R
since G ϕY (x) log ψY (x) dµ(x) is a KL divergence and is always non-negative.
λ
Assume that λ is in Γ̊µ . Then, there exists > 0 such that, for any u ∈ Rp , |u| = 1,
λ + u ∈ Γµ . Using the fact that eβ ≥ eα + (β − α)eα , we can write
Tg Tg Tg
uT geλ ≤ e(λ+u) − eλ
Tg Tg Tg
−uT geλ ≤ e(λ−u) − eλ
yielding
Tg T T Tg Tg Tg T
|uT g|eλ ≤ max(e(λ+u) g , e(λ−u) g ) − eλ ≤ e(λ+u) + e(λ−u) − eλ g .
21.6. INDEPENDENT COMPONENT ANALYSIS 533
Since the upper-bound is integrable with respect to µ, so is the lower bound, showing
that (taking u in the canonical basis of Rp )
Z
T
|g (i) (y)|eλ g(y) dµ(y) < ∞
G
for all i, or Z
T g(y)
|g(y)|eλ dµ(y) < ∞.
G
Then R
g(x)T exp(λT g(x))dx
Z
T
∂λ Ψc = −c + R = −cT + g(x)T ψλ (x)dx .
exp(λT g(y))dy G
Remark 21.11 The previous theorem is typically applied with µ equal to Lebesgue’s
measure on G = Rd or to a counting measure with G finite. To rewrite the statement
of theorem 21.10 in those cases, it suffices to replace dµ(x) by dx for the former, and
integrals by sums over G for the latter. In the rest of the discussion, we restrict to the
case when µ is Lebesgue’s measure, using h(Y ) instead of hµ (Y ).
Remark 21.12 This principle justifies, in particular, that the negentropy is always
non-negative since it implies that a distribution that maximizes the entropy given
its first and second moments must be Gaussian.
d
X d
X Z
T (j)
− λ E(g(W X)) + log exp λT g(y) dy .
j=1 j=1
We have seen in the previous proof that, defining Ψc by (21.19) and denoting by Eλ
the expectation with respect to ϕλ , one has
Now choose c0 such that a maximizer of Ψc0 (λ), say, λc0 , is known. If c is close to c0 ,
a first order expansion indicates that, for λc maximizing Ψc , one should have
with
Ψc (λc ) ' Ψc (λc0 ) − (c − c0 )T ∇2 Ψc (λc0 )−1 (c − c0 ).
One can then use the right-hand side as an approximation of the optimal entropy.
This leads to simple computations under the following √ assumptions. First, as-
(1) (2) 2
sume that the first two functions g and g are u and u / 3. Let ϕ0 be the p.d.f.
of a standard Gaussian. Assume that the functions g (j) are chosen so that
Z
g (i) (u)g (j) (u)ϕ0 (y)dy = δij
R
for i, j = 1, . . . , p and such that g (i) (u)ϕ0 (y)dy = 0 for i , 2. Take
Z
c0 = gϕ0 (u)du
Then λc0 provides, by construction, the distribution ϕ0 and for any c, ∇2 Ψc (λc0 ) =
IdRp . With these assumptions, the approximation is
Ψc (λ) = h(ϕ0 ) − |c − c0 |2
1 X
= (1 + log 2π) − (c(j) )2
2
j≥3
√
(assuming that the data is centered and normalized so that c(1) = 0 and c(2) = 1/ 3).
The ICA problem can then be solved by maximizing
p
d X
X
E(g (i) (W (j) X))2 (21.20)
j=1 i=1
Remark 21.13 Without the assumption made on the functions g (j) , one needs to
compute S = Cov(g(U ))−1 where U ∼ N (0, 1) and maximize
d
X
(E(g(W (j) X)) − E(g(U )))T S(E(g(W (j) X)) − E(g(U ))).
j=1
Clearly, this expression can be reduced to (21.20) by replacing g by S −1/2 (g−E(g(U ))).
Note also that we retrieve here a similar idea to the negentropy, maximizing a devi-
ation to a Gaussian.
In the previous discussion, we reached a few times a formulation of ICA which re-
quired optimizing a function W 7→ F(W ) over all orthogonal matrices. We now dis-
cuss how such a problem may be implemented.
In all the examples that were considered, there would have been no loss of gen-
erality in requiring that W is a rotation, i.e., det(W ) = 1. This is because one can
change the sign of this determinant by simply changing the sign of one of the in-
dependent components, which is always possible. (In fact, the indeterminacy in W
is by right multiplication by the product of a permutation matrix and a diagonal
matrix with ±1 entries.)
Let us assume that F(W ) is actually defined and differentiable over all invertible
matrices, which form an open subset of the linear space Md (R) of d by d matrices.
Our optimization problem can therefore be considered as the minimization of F with
the constraint that W W T = IdRd .
Gradient descent derives from the analysis that a direction of descent should be
a matrix H such that F(W + H) < F(W ) for small enough > 0 and on the remark
that H = −∇F(W ) provides such a direction. This analysis does not apply to the con-
strained optimization setting because, unless the constraints are linear, W + H will
generally stop to satisfy the constraint when > 0, requiring the use of more complex
procedures. In our case, however, one can take advantage of the fact that orthogo-
nal matrices form a group to replace the perturbation W 7→ W + H by W 7→ W eH
(using the matrix exponential) where H is moreover required to be skew symmetric
(H + H T = 0), which guarantees that eH is an orthogonal matrix with determinant
1. Now, using the fact that eH = Id + H + o(), we can write
F(W eH ) = F(W ) + trace(∇F(W )T W H) + o() .
Let ∇s F(W ) be the skew symmetric part of W T ∇F(W ), i.e.,
1
∇s F(W ) = (W T ∇F(W ) − ∇F(W )T W ).
2
536 CHAPTER 21. DIMENSION REDUCTION AND FACTOR ANALYSIS
so that
F(W eH ) = F(W ) + trace(∇s F(W )T H) + o() .
This show that H = −∇s F(W ) provides a direction of descent in the orthogonal group,
in the sense that, if ∇s F(W ) , 0,
s
F(W e−∇ F(W ) ) < F(W )
combined with a line search for n implements gradient descent in the group of
orthogonal matrices, and therefore converges to a local minimizer of F.
We now describe a parametric version of ICA in which a model is chosen for the in-
dependent components of Y . The simplest version of to assume that all Y (j) are i.i.d.
with some prescribed p.d.f., say, ψ. A typical example for ψ is a logistic distribution
with
2
ψ(t) = t −t 2 .
(e + e )
21.6. INDEPENDENT COMPONENT ANALYSIS 537
If y is a vector in Rd , we will use, as usual, the notation ψ(y) = (ψ(y (1) ), . . . , ψ(y (d) ))T
for ψ applied to each component of y.
The model parameter is then the matrix A, or preferably W = A−1 , and it may be
estimated using maximum likelihood. Indeed, the p.d.f. of X is
d
Y
fX (x) = | det W | ψ(W (j) x)
j=1
and use the fact that the gradient of W 7→ log | det W | is W −T (the inverse transpose
of W ), we can write
∇`(W ) = N W −T + Γ (W ).
H = W T (N W −T + Γ (W )) = N Id + W T Γ (W ).
Note that the algorithms that we discussed concerning ICA were all formulated in
terms of the matrix W = A−1 , which “filters” the data into independent components.
As a result, ICA requires as many independent components as the dimension of X.
Moreover, because the components are typically normalized to have equal variance,
there is no obvious way to perform dimension reduction using this method. Indeed,
ICA is typically run after the data is preprocessed using PCA, this preprocessing
step providing the reduction of dimension.
p
X
X= aj Y (j) + σ R
j=1
Let us assume a parametric setting similar to that of the previous section, so that
Y (1) , . . . , Y (p)
are explicitly modeled as independent variables with p.d.f. ψ. Introduce
the matrix A = [a1 , . . . , ap ], so that the model can also be written X = AY + σ R, where
A and σ 2 are unknown model parameters.
which is definitely not a closed form. Since we are in a situation in which the pair of
random variables is imperfectly observed through X, using the EM algorithm (chap-
ter 17) is an option, but it may, as we shall see below, lead to heavy computation. The
basic step of the EM is, given current parameters A0 , σ0 , to maximize the conditional
expectation (knowing X, for the current parameters) of the joint log-likelihood of
(X, Y ) with respect to the new parameters. In this context, the joint distribution of
(X, Y ) has density
p
1 |x−Ay |2 Y
2 −
fX,Y (x, y; A, σ ) = 2 d/2
e 2σ 2
ψ(y (i) )
(2πσ ) i=1
21.6. INDEPENDENT COMPONENT ANALYSIS 539
N N p
Nd 1 X XX
− log(2πσ 2 )− 2 EA0 ,σ0 (|xk − AY |2 |X = xk )− EA0 ,σ 2 (log ψ(Y (j) )|X = xk ).
2 2σ 0
k=1 k=1 j=1
Notice that the last term does not depend on A, σ 2 , and that, given A, the optimal
value of σ 2 is given by
N
1 X
σ2 = EA0 ,σ0 (|xk − AY |2 |X = xk )
Nd
k=1
The minimization of
N
X
EA0 ,σ0 (|xk − AY |2 |X = xk )
k=1
(j)
with respect to A is a least square problem. Let bk = EA0 ,σ0 (Y (j) |X = xk ) and sk (i, j) =
EA0 ,σ0 (Y (i) Y (j) |X = xk ): the gradient of the previous term is
N
X N
X
T
−2 EA0 ,σ 2 ((xk − AY )Y |X = xk ) = −2 (xk bkT − ASk ),
0
k=1 k=1
(j)
bk being the column vector with coefficients bk and Sk the matrix with coefficients
sk (i, j). The result therefore is
N N −1
X X
A = xk bkT Sk .
k=1 k=1
In place of the exact EM, one may use a mode approximation (section 17.3.1),
which replaces the conditional likelihood of Y given X = xk by a Dirac distribution
540 CHAPTER 21. DIMENSION REDUCTION AND FACTOR ANALYSIS
at the mode:
|A y−x |2
− 0 2k
ŷA0 ,σ0 (xk ) = argmaxy ψ(y)e 2σ0 .
with bk = ŷAn ,σn (xk ), Sk = ŷAn ,σn (xk )ŷAn ,σn (xk )T , and
N
2 1 X 2
σn+1 = xk − AŷAn ,σn .
Nd
k=1
(3) Stop if the variation of the parameter is below a tolerance level. Otherwise,
iterate to the next step.
Once A and σ 2 have been estimated, the y components associated to a new obser-
vation x can be estimated by ŷA,σ (x), therefore minimizing
p
1 2
X
2
xk − Ay + log ψ(y (j) ),
2σ
j=1
yielding the map estimate, the same convex optimization problem as in step (1)
above. Now we can see how the method takes from both PCA and ICA: the columns
21.6. INDEPENDENT COMPONENT ANALYSIS 541
This distribution with “exponential tails” has the interest of allowing large values of
y (j) , which generally entails sparse decompositions, in which y has a few large coeffi-
cients, and many zeros.
As an alternative to the mode approximation of the EM, which may lead to bi-
ased estimators, one may use the SAEM algorithm (section 17.4.3), as proposed in
Allassonniere and Younes [3]. Recall that the EM algorithm replaces the parameters
A0 , σ02 by minimizers of
N
Nd 1 X
log(σ 2 ) + 2 EA0 ,σ0 (|xk − AY |2 |X = xk )
2 2σ
k=1
N N N
Nd 1 X 2 1 X T 1 X
= log(σ 2 ) + 2 |xk | − 2 xk Abk + 2 trace(AT ASk ),
2 2σ σ 2σ
k=1 k=1 k=1
(j)
where the computation of bk = EA0 ,σ0 (Y (j) |X = xk ) and sk (i, j) = EA0 ,σ0 (Y (i) Y (j) |X = xk )
was the challenging issue. In the SAEM algorithm, the statistics bk and Sk are part of
a stochastic approximation scheme, and are estimated in parallel with EM updates
as follows.
bk → bk + γt (Yk − bk )
(
Sk → Sk + γt (Yk YkT − Sk )
542 CHAPTER 21. DIMENSION REDUCTION AND FACTOR ANALYSIS
and
N
2 1 X 2
σ = xk − AŷA0 ,σ0 .
Nd
k=1
P
P The2
parameter γt should be decreasing with t, typically so that t γt = +∞ and
t γt < ∞ (e.g., γt ∝ 1/t). One way to sample from Yk is to uses a rejection scheme,
iterating the procedure which samples y according to the prior and accepts the result
with probability M exp(−|xk − Ay|2 /2σ 2 ) until acceptance. Here M must be chosen so
that M maxy exp(−|xk − Ay|2 /2σ 2 ) ≤ 1 (e.g., M = 1).
This method will work for small p, but for large p, the probability of acceptance
may be very small. In such cases, Yk can be sampled changing one component at a
time using a Metropolis-Hastings scheme. If component j is updated, this scheme
samples a new value of y (call it y 0 ) by changing only y (j) according to the prior
distribution ψ and accept the change with probability
exp(−|xk − Ay 0 |2 /2σ 2 )
!
min 1, .
exp(−|xk − Ay|2 /2σ 2 )
(j)
where A is N by p and provides the coefficients ak associated with each observation
and Y = [y (1) , . . . , y (p) ] is d by p and provides the p typical profiles. The matrices A
and Y are unknown and their estimation subject to the constraint of having non-
negative components represent the non-negative matrix factorization (NMF) prob-
lem.
Then
v T Mv − 2bT v ≤ u T Mu − 2bT u .
Moreover, v = u if and only if u minimizes u T Mu − 2bT u subject to u (i) = 0, i = 1, . . . , n.
Proof Let F(u) = u T Mu − 2bT u. We look for v (i) = β (i) u (i) with β (i) ≥ 0 such that
F(v) ≤ F(u). We have
n
X n
X
F(v) = β (i) β (j) u (i) u (j) m(i, j) − 2 b(i) β (i)
i,j=1 i=1
n n
1 X
(i) 2 (j) 2 (i) (j)
X
≤ ((β ) + (β ) )u u m(i, j) − 2 b(i) β (i) u (i)
2
i,j=1 i=1
n
X n
X
= (β (i) )2 u (i) u (j) m(i, j) − 2 b(i) β (i) u (i)
i,j=1 i=1
544 CHAPTER 21. DIMENSION REDUCTION AND FACTOR ANALYSIS
An alternative version of the method has been proposed, where the objective
function is Φ(AY T ), where, for an N by d matrix Z = [z1 , . . . , zN ]T ,
N X
X d
(i) (i) (i)
Φ(Z) = (zk − xk log zk )
k=1 i=1
which is indeed minimal for Z = X . We state and prove a second lemma that will
allow us to address this problem.
Lemma 21.15 Let M be an n by q matrix and x ∈ Rn , b ∈ Rq , all assumed to have positive
entries. For u ∈ (0, +∞)q , define
q
X n
X q
X
(j) (j) (i)
F(u) = b u − x log m(i, j)u (j) .
j=1 i=1 j=1
Let ρ(i, j) = m(i, j)u (j) /α (i) . Since the logarithm is concave, we have
q
X q
X
(j)
log ρ(i, j)β ≥ ρ(i, j) log β (j)
j=1 j=1
so that
q
X X q
n X n
X q
X
(j) (() (j) (i) (j) (i)
F(w) ≤ b u jβ − x ρ(i, j) log β − x log m(i, j)u (j) .
j=1 i=1 j=1 i=1 j=1
546 CHAPTER 21. DIMENSION REDUCTION AND FACTOR ANALYSIS
The upper bound with β (j) ≡ 1 gives F(u), so minimizing this expression in β will
give F(w) ≤ F(u). This minimization is straightforward and gives
Pn (i) Pn (i) /α (i)
(j) i=1 x ρ(i, j) i=1 m(i, j)x
β = =
b(j) u (j) b(j)
and the optimal w is the vector v provided in the lemma. Finally, one checks that
v = u if and only if u satisfies the KKT conditions for the considered problem.
We can now apply this lemma to derive update rules for Y and A, where the
objective is
N X d X p N X
d p
(i) (j) (i) (j)
X X (i)
X
yj ak − xk log yj ak .
k=1 i=1 j=1 k=1 i=1 j=1
Starting with the minimization in A, we apply the lemma to each index k separately,
(i) (j)
taking n = d and q = p, with b(j) = di=1 yj and m(i, j) = yi . Then the update is
P
For Y , we can work with fixed i and apply the lemma with n = N , q = p, b(j) =
PN (j) (j)
k=1 ak and m(k, j) = ak . This gives the update:
The expectation in many factor models is that individual observations are obtained
by mixing pure categories, or topics, and represented as a weighted sum or linear
combination of a small number of uncorrelated or independent variables. Denote p
the number of possible categories, which, in this section, can be assumed to be quite
large.
We will assume that each observation randomly selects a small number among
these categories before combining them. Let us consider (as an example) the follow-
ing model.
548 CHAPTER 21. DIMENSION REDUCTION AND FACTOR ANALYSIS
where:
• Rk follows a standard Gaussian distribution,
• ak (1), . . . , ak (p) are independent with ak (j) ∼ N (mj , τj2 ),
• bk (1), . . . , bk (p) are independent and follow a Bernoulli distribution with param-
eter πj ,
• Y (1) , . . . , Y (p) are independent standard Gaussian random variables.
• σ 2 follows an inverse gamma distribution with parameters α0 , β0 .
• τ12 , . . . , τp2 follow independent inverse gamma distributions with parameters α1 , β1 .
• mj follow a Gaussian N (0, ρ2 ) and,
• πj follow a beta distribution with parameters (u, v).
The priors are, as usual, chosen so that the computation of posterior distributions
is easy, i.e., they are conjugate priors. The observed data is therefore obtained by
selecting components Yj with probability πj and weighted with a Gaussian random
coefficient, then added before introducing noise.
Let nj = N
P
k=1 bk (j). Ignoring constant factors, the joint likelihood of all variables
together is proportional to:
1 X N X p
L ∝σ −N d exp − 2 |Xk − ak (j)bk (j)Y (j) |2
2σ
k=1 j=1
p p
N
Y 1 X 1 X
τ −N exp − 2 2
j (a k (j) − m )
j exp − m j
2τj2
2ρ2
j=1 k=1 j=1
p p
Y nj Y
πj (1 − πj )N −nj (τj2 )−α1 −1 exp(−β1 /τj2 )
j=1 j=1
p p
Y 1 X
(σ 2 )α0 −1 exp(−β0 /σ 2 ) πju−1 (1 − πj )v−1 exp − |Y (i) |2
2
j=1 i=1
• The conditional distribution of σ 2 , τ12 , . . . , τp2 given all other variables remains a
product of inverse gamma distributions.
• The conditional distribution of Y (1) , . . . , Y (p) given the other variables is Gaus-
sian.
• The conditional distribution of π1 , . . . , πp given the other variables is a product
of beta distributions.
• The conditional distribution of m1 , . . . , mp given the other variables remain in-
dependent Gaussian.
• The posterior distribution of a1 , . . . , aN (considered as p-dimensional vectors)
given the other variables is a product of independent Gaussian (but the components
ak (j), j = 1, . . . , p are correlated).
• For the posterior distribution given the other variables, b1 , . . . , bN (considered
as p-dimensional vectors) are independent. The components of each bk are not inde-
pendent but each bk (j) being a binary variable follows a Bernoulli distribution given
the other ones.
These remarks provide the basis of a Gibbs sampling algorithm for the simulation
of the posterior distribution of all unobserved variables (the computation of the pa-
rameters of each of the conditional distribution above requires some work, of course,
and these details are left to the reader). This simulation does not explicitly provide a
matrix factorization of the data (in the sense of a single matrix A such that X = AY ,
as considered in the previous section), but a probability distribution on such matri-
ces, expressed as A(k, j) = ak (j)bk (j). One can however use the average of the matri-
ces obtained through the simulation for this purpose. Additional information can
be obtained through this simulation. For example, the expectation of bk (j) provides
a measure of proximity of observation k to category j.
Poisson factor analysis. Many variations can be made on the previous construc-
tion. When the observations are non-negative, for example, an additive Gaussian
noise may not be well adapted. Alternative models should model the conditional
distribution of X given a, b and Y as a distribution over non-negative numbers with
mean (a b)T Y (for example a gamma distribution with appropriate parameters).
The posterior sampling generally is more challenging in this case because simple
conjugate priors are not always available.
An important special case is when X is a count variable taking values in the set of
non-negative integers. In this case (starting with a model without feature selection),
modeling X as a Poisson variable with mean a(1)Y (1) + · · · + a(p)Y (p) leads to tractable
computations, once it is noticed that X can be seen as a sum of random variables
550 CHAPTER 21. DIMENSION REDUCTION AND FACTOR ANALYSIS
Z [1] , . . . , Z [p] where Z [i] follows a Poisson distribution with parameter a(i)Y (i) . This
suggests introducing new latent variables (Z [1] , . . . , Z [p] ), which are not observed but
follow, conditionally to their sum, which is X and is observed, Pp a multinomial distri-
bution with parameters X, q1 , . . . , qp , with qi = a(i)Y /( j=1 a(j)Y (j) ).
(i)
GaP with feature selection One can include a feature selection step in this model
by introducing binary variables b(1), . . . , b(p), with selection probabilities π1 , . . . , πp ,
with a Beta(u, v) prior distribution on πi . Doing so, the likelihood of the extended
model is:
N N p [i]
X Y Y (ak (i)bk (i)Y (i) )zk
L ∝ exp − (ak (1)bk (1)Y (1) + · · · + ak (p)bk (p)Y (p) ) [i]
k=1
k=1 i=1 zk !
N p N X p
p
Y Y X X
ak (i)α−1 exp −β ak (i) exp − Y (i)
k=1 i=1 k=1 i=1 i=1
p
Y n p
Y
πj (1 − πj )N −nj
j
πju−1 (1 − πj )v−1 .
j=1 j=1
where, as before, nj = N
P
k=1 bk (j). The conditional distribution of π1 , . . . , πp given
the other variables is therefore still that of a family of independent beta-distributed
variables. The binary variables bk (1), . . . , bk (p) are also conditionally independent
[i]
given the other variables, with bk (i) = 1 with probability one if zk > 0 and with
[j]
probability πj exp(−ak (j)Y (j) ) if zk = 0.
21.9. BAYESIAN FACTOR ANALYSIS AND POISSON POINT PROCESSES 551
The previous models assumed that p features were available, modeled as p random
variables with some prior distribution, and that each observation picks a subset of
them, drawing feature j with probability πj . We denoted by bk (j) the binary variable
indicating whether feature j was selected for observation k, and nj was the number
of times that feature was selected. Finally, we modeled πj as a beta variable with
parameters u and v.
One can compute, using this model, the probability distribution of of the feature
selection variables, b = (bk (j), j = 1, . . . , p, k = 1, . . . , N ). From the model definition, the
probability of observing such a configuration is given by
p
Γ (u + v)p
Z Y
nj +u−1
Q(b) = πj (1 − πj )N −nj +v−1 dπ1 . . . dπp
Γ (u)p Γ (v)p
j=1
p
Y Γ (u + v)Γ (u + nj )Γ (v + N − nj )
=
Γ (u)Γ (v)Γ (u + v + N )
j=1
p
Y u(u + 1) · · · (u + nj − 1)v(v + 1) · · · (v + N − nj − 1)
=
(u + v)(u + v + 1) · · · (u + v + N − 1)
j=1
Using this last equation, we can interpret the probability Q as resulting from a pro-
gressive feature assignment process. The first observation, k = 1, for which njk = 0
for all j, chooses each feature with probability u/(u + v). When reaching observation
k, feature j is chosen with probability (u + njk )/(u + v + k − 1). At all steps, features
are chosen independently from each other.
552 CHAPTER 21. DIMENSION REDUCTION AND FACTOR ANALYSIS
N qk !p−pk+1
Y u v + k − 1
Q(S) =
u +v +k−1 u +v +k−1
k=1
1j∈C !1j<C
Y u + njk k v + k − 1 − njk k
.
u+v+k−1 u+v+k−1
j∈Uk
Let S k = (Gl , Cl , l ≤ k). Then the expression of Q shows that, conditionally to S k−1 , Gk
and Ck are independent. Elements in Ck are chosen independently for each feature
j ∈ Uk with probability (u + njk )/(u + v + k − 1). Moreover, the conditional distribution
of qk given S k−1 is proportional to
qk !p−pk −qk
u v +k−1
u+v+k−1 u+v +k−1
with cardinality qk .
If there is no special meaning in the feature label, which is the case in our discus-
sion of prior models in which all features are sampled independently with the same
distribution, we may identify configurations that can be deduced from each other by
relabeling (note that relabeling features does not change the value of Q).
(picking uniformly at random one of the possible ones). The probability of a normal
configuration S obtained through this process is (using a simple counting argument)
N ! qk !p−pk+1
Y p − pk u v + k − 1
Q(S) =
q
k u +v +k−1 u +v +k−1
k=1
1j∈C !1j<C
Y u + njk k v + k − 1 − njk k
,
u+v+k−1 u+v+k−1
j∈Uk
This provides a new incremental procedure that directly samples normalized as-
signments. First let q1 follow a binomial distribution bin(p, u/(u + v)) and assign the
first observation to features 1 to q1 . Assume that pk labels have been created before
step k. Then select for observation k some of the already labeled features, label j
being selected with probability (u + njk )/(u + v + k − 1) as above. Finally, add qk new
features where qk follows a binomial distribution bin(p − pk , u/(u + v + k − 1)).
This discussion is clearly reminiscent of the one that was made in section 20.7.3
leading to the Polya urn process, and we want here also to let p tend to infinity
(with fixed N ) with proper choices of u and v as functions of p in the expression
above. Choose two positive numbers c and γ and let u = cγ/p and v = c − u. Note
that, with the incremental simulation process that we just described, the conditional
expectation of the next number of labels, pk+1 given the current one, pk is
!
(p − pk )u cγ cγ cγ
E(pk+1 |pk ) = + pk = + 1− pk ≤ + pk
u+v+k−1 c+k−1 p(c + k − 1) c+k−1
So, when p → ∞, we obtain the following incremental simulation process for the
feature labels, that we combine with the actual simulation of the features, assumed
to follow a prior distribution with p.d.f. ψ. This process is called the Indian buffet
process in the literature, the analogy being that a buffet offers an infinite variety of
dishes, and each observation is a customer who tastes a finite number of them.
554 CHAPTER 21. DIMENSION REDUCTION AND FACTOR ANALYSIS
2. Assume that observations 1 to k−1 have been obtained, with pk features y (1) , . . . , y (pk )
such that the jth feature has been chosen nk,j times.
k,j n
(i) For j = 1, . . . , pk , assign feature j to sample k with probability c+k−1 . If j is
selected, let nk+1,j = nk,j + 1, otherwise let nk+1,j = nk,j .
(ii) Sample an integer qk according to a Poisson distribution with parameter
cγ
c+k−1 and let pk+1 = pk + qk .
(iii) Sample features y (pk +1) , . . . , y (pk+1 ) according to ψ.
(iv) Assign these features to observation k, and let nk+1,j = 1 for j = pk +1, . . . , pk .
3. If k = N , stop, otherwise replace k by k + 1 and return to Step 2.
This section assumes that the reader is familiar with measure theory. It can however safely
be skipped as it is not reused in the rest of the book.
If Z is a set, we will denote by Pc (Z) the set composed with all finite or countable
subsets of Z. A point process over Z is a random variable S : Ω → Pc (Z), i.e., a
variable that provides a countable random subset of Z. If B ⊂ Z one can then define
the counting function νS (B) = |S ∩ B| ∈ Z ∪ {+∞}.
A proper definition of such point processes requires some measure theory. Equip
Z with a σ -algebra A and consider the set N0 of integer-valued measures µ on (Z, A)
such that µ(Z) < ∞. Let N be the set formed with all countable sums of measures
in N0 . Then a general point process is a mapping ν : Ω → N such that for all
k ∈ N ∪ {+∞} and all B ∈ A, the event {ν(B) = k} is measurable. Recall that, for each
B ∈ A, ν(B) is itself a random variable, that we may denote ω 7→ νω (B). One then
define the intensity of the process as the the function µ : B 7→ E(ν(B)).
Theorem 21.16 (Campbell identity) Let ν be a point process with intensity µ. For
ω ∈ Ω, let Xω : Ω0 → Z be a random variable with distribution νω (defined, if needed, on
a different probability space (Ω0 , P 0 )). Then, for any µ-integrable function f :
Z
E(f (X)) = f (z)dµ(z). (21.22)
Z
Here, the expectation of f (X) is over both spaces Ω and Ω0 and corresponds to the
average of f . The identity is an immediate consequence of Fubini’s theorem.
(i) If B1 , . . . , Bn are non-intersecting pairwise, then ν(B1 ), . . . , ν(Bn ) are mutually in-
dependent.
(ii) for all B, ν(B) ∼ Poisson(µ(B)).
We take the convention that ν(B) = 0 (resp. = ∞) almost surely if µ(B) = 0 (resp.
= ∞). Note that property (i) also implies that if g1 , . . . , gn are measurable
R functions
from Z to (0, +∞) such that gi gj = 0 for i , j, then the variables ν(gi ) = Z gi (z)dν(z)
are independent.
If µ(Z) < ∞ (i.e., µ is finite), one can represent the distribution of a Poisson point
process as follows:
ν(Z)
X
ν= δXk
k=1
with ν(Z) ∼ Poisson(µ(Z)) and, conditional to ν(Z) = N , X1 , . . . , XN are i.i.d. and fol-
low the probability distribution µ̄ = µ/µ(Z). This measure can also be identified with
the random set S = {X1 , . . . , Xν(Z) }. The assumption that µ({z}) = 0 for any singleton
implies that ν({z}) = 0 almost surely. It also ensures that the points X1 , . . . , XN are
distinct with probability one.
In the following, we will consider this class of random measures, with the small
addition of allowing for an extra term including a measure supported by a fixed set.
More precisely, given a (deterministic) countable subset I ⊂ Z, a family of indepen-
dent random variables (ρz , z ∈ I ) and a σ -finite measure µo such that µo ((0, +∞) ×
{z}) = 0 for all z ∈ Z, we can define the random measure
ξ = ξf + ξo
where ξo is a weighted Poisson process with intensity µo , assumed independent of
(ρz , z ∈ I ) and X
ξf = ρz δz .
z∈I
The subscripts o and f come from the terminology introduced in Kingman [106],
which studies “completely random measures,” which are a random measures that
satisfy point (i) in the definition of a Poisson process. Under mild assumptions, such
measures can be decomposed as a sum of a weighted Poisson process (here, ξo , the
ordinary part), of a process with fixed support, (here, ξf , the fixed part) and of a
deterministic measure (which is here taken to be 0).
where the last term is an application of Campbell’s inequality to the Poisson process
νo and the function g(w, x) = w1B (x).
21.10. POINT PROCESSES AND RANDOM MEASURES 557
The main example of such processes in factor analysis is the beta process that will be
discussed in the next section. We start, however, with a first example that is closely
related with the Dirichlet process, called the gamma process.
In this process, one fixes a finite measure π0 on Z and defines µ on (0, +∞) × Z by
Because µ is σ -finite but not finite (the integral over t diverges at t = 0), every real-
ization of ξ is an infinite sum
X∞
ξ= wk δzk .
k=1
The intensity measure of ξ is
Z +∞
η(B) = cπ0 (B) e−cw dw = π0 (B).
0
In particular,
∞
X
wk = η(Z) = π0 (Z) < ∞.
k=1
For fixed B, the variable ξ(B) follows a Gamma distribution. This can be proved
by computing the Laplace transform of ξ, E(e−λξ(B) ), and identify it to that of a
Gamma. To make this computation, consider the point process νJ restricted to a
interval J ⊂R (0, +∞) with min(J) > 0, and ξJ the corresponding weighted process.
Let mJ (t) = J w−1 ce−(c+t)w dw. Then a realization of νJ can be obtained by first sam-
pling N from a Poisson distribution with parameter µ(J × Z) = mJ (0)π0 (Z) and then
sampling N points (wi , zi ) independently from the distribution µ/(mJ (0)π0 (Z)). This
implies that
−tw1B (z) w−1 ce−cw dwdπ (z) n
∞
R ∞
(m (0)π (Z)) n e 0
X
J 0 0
E(e−tξJ (B) ) = e−mJ (0)π0 (Z)
n! mJ (0)π0 (Z)
n=0
∞ −mJ (0)π0 (Z)
X e n
= π0 (B)mJ (t) + (π0 (Z) − π0 (B))mJ (0)
n!
n=0
π0 (B)(mJ (t)−mJ (0))
=e .
Now, Z −tw − 1
cw e
mJ (t) − mJ (0) = c e dw
J w
558 CHAPTER 21. DIMENSION REDUCTION AND FACTOR ANALYSIS
is finite even when J = (0, +∞). With a little more work justifying passing to the
limit, one finds that, for J = (0, +∞),
Z +∞ −tw − 1
!
−tξJ (B) −cw e
E(e ) = exp π0 (B) e dw .
0 w
Finally, write
Z +∞ −tw − 1 Z +∞ Z t
−cw e −cw
c e dw = −c e e−sw dsdw
0 w
Z0t Z +∞
0
= −c e−(s+c)w dwds
Z t0 0
t
=− c(s + c)−1 ds = −c log(1 + ).
0 c
This shows that
t −cπ0 (B)
E(e−tξJ (B) ) = 1 +
c
which is the Laplace transform of a Gamma distribution with parameters cπ0 (B) and
c, i.e., with density proportional to wcπ0 (B)−1 e−cw .
The definition of the beta process parallels that of the gamma process, with weights
taking this time values in (0, 1). Fix again a finite measure π0 on Z and let µo on
(0, +∞) × Z be defined by
where I is a fixed finite set and (wz , z ∈ I ) are independent and follow a beta distri-
bution with parameters (a(z), b(z)).
In the same way the Polya urn could be used to sample from a realization of a
Dirichlet process without actually sampling the whole process, there exists an algo-
rithm that samples a sequence of feature sets (A1 , . . . , An ) from this feature selection
process without needing the infinite collection of weights and features associated
with a beta process. We assume in the following that the prior process has an empty
fixed set. (Non-empty fixed sets will appear in the posterior.)
Now assume that n − 1 sets of features A1 , . . . , An have been obtained and we want
to sample a new set An+1 conditionally to their observation. Let Jn be the union of
all random features obtained up to this point and n(z), for z ∈ Jn the number of times
this feature was observed in A1 , . . . , An . Then the conditional distribution of the beta
process ξ given this observation is still a beta process, with fixed set given by I = Jn ,
(a(z), b(z)) = (n(z), c + n − n(z)) for z ∈ Jn−1 and base measure πn = cπ0 /(c + n). This
implies that the next set An+1 can be obtained by sampling from the associated fea-
ture process. To do this, one first selects features z ∈ Jn with probability n(z)/(c + n),
then selects additional features z1 , . . . , zN independently with distribution π0 /π0 (Z)
where N follows a Poisson distribution with parameter cπ0 (Z)/(c + n). This is the
Indian buffet process, described in Algorithm 21.6 (taking π0 = γψ).
The beta process can be used as a prior for feature selection within a factor analysis
model, as described in the previous paragraph. It is however easier to approximate
560 CHAPTER 21. DIMENSION REDUCTION AND FACTOR ANALYSIS
it with a model with almost surely finite support. Indeed, letting, for > 0
Γ (c + 1)
µ(dw, dz) = w−1 (1 − w)c− π0 (dz)dw,
Γ ( + 1)Γ (c − )
In this case, the prior generates features by first sampling their number, p, ran-
domly according to a Poisson distribution with mean cγ/, then select p probabilities
w1 , . . . , wp independently using a beta distribution with parameters and c − , and
finally attach to each i a feature zi with distribution π0 /γ. The features associated
with a given sample are then obtained by selecting each zi with probability wi .
We note also that the model described in section 21.9.3 provides an approxima-
tion of this prior using a finite number of features. With our notation here, this
corresponds to taking p 1 and = cγ/p.
Chapter 22
The methods described in this chapter aim at representing a dataset in low dimen-
sion, allowing for its visual exploration by summarizing its structure in a user-
accessible interface. Unlike factor analysis methods, they do not necessarily attempt
at providing a causal model expressing the data as a function of a small number of
sources, and generally do not provide a direct mechanism for adding new data to the
representation. In addition, all these methods take as input similarity dissimilarity
matrices between data points and do not require, say, Euclidean coordinates.
We start with the standard hypotheses of MDS, assuming that the distances dkl de-
2
rive from a representation in feature space, so that dkl = khk − hl k2H for some inner-
product space H and (possibly unknown) features h1 , . . . , hN . Note that, since the
Euclidean distance is invariant by translation, there is no loss of generality in as-
suming h1 + · · · + hN = 0, which will be done in the following.
561
562 CHAPTER 22. DATA VISUALIZATION AND MANIFOLD LEARNING
Because isometries are one-to-one and onto, the existence of an exact isometry
∆
would require V = span(h1 , . . . , hN ) to be p-dimensional. The mapping Φ could then
be defined as Φ(h) = (hh , e1 iH , . . . , hh , ep iH ) where e1 , . . . , ep is any orthonormal basis
of V . In the general case, however, where V is not p-dimensional or less, one can
replace it by a best p-dimensional approximation of the training data, leading to a
problem similar to PCA in feature space.
Indeed, as we have seen in section 21.1.2, this best approximation can be ob-
tained by diagonalizing the Gram matrix S of h1 , . . . , hN , which is such that skl =
hhk , hl iH . (Recall that we assume that h̄ = 0, so we do not center the data here.) Us-
ing the notation in section 21.1.2, let α (1) , . . . , α (p) denote the eigenvectors associated
with the p largest eigenvalues, normalized so that (α (i) )T Sα (i) = 1 for i = 1, . . . , p. One
can then take
XN
(i)
ei = αl hl
l=1
and, for k = 1, . . . , N , j = 1, . . . , p:
(i) (i)
yk = λ2i αk (22.1)
where λ2i is the eigenvalue associated with α (i) .
This does not entirely address the original problem, since the inner products skl
are not given, but only the distances dkl , which satisfy
2
dkl = −2skl + skk + sll . (22.2)
This provides a linear system of equations in the unknown skl . This system is under-
determined, because D is invariant by any transformation hk 7→ hk + h0 (for a fixed
h0 ), and S is not. However, the assumption h1 + · · · + hN = 0 provides the additional
constraint needed to provide a unique solution.
We now show that this PCA approach to MDS is equivalent to the problem of
minimizing
N
X
F(y) = (ykT yl − skl )2 (22.5)
k,l=1
∈ Rp
over all y1 , . . . , yN such that y1 +· · ·+yN = 0, which can be interpreted as matching
“similarities” skl rather than distances. Indeed, letting Y denote the N by p matrix
with rows y1T , . . . , yN
T
, we have
F(y) = trace((Y Y T − S)2 ).
Finding Y is equivalent to finding a symmetric matrix M of rank p minimizing
trace((M − S)2 ). We have, using the trace inequality (theorem 2.1), and letting λ21 ≥
· · · ≥ λ2N (resp. µ21 ≥ · · · ≥ µ2p ) denote the eigenvalues of S (resp. M)
This lower bound is attained when M and S can be diagonalized in the same or-
thonormal basis with λ2k = µ2k for k = 1, . . . , p. So, letting S = U DU T , where U is
orthogonal and D is diagonal with decreasing numbers on the diagonal, an optimal
M is given by M = Up Dp UpT , where Up is formed with the first p columns of A and Dp
is the first p × p block of D. This shows that the matrix Y = Up D 1/2 provides a min-
imizer of F. The matrix U = [u (1) , . . . , u (N ) ] differs from the matrix A = [α (1) , . . . , α (N ) ]
above through the normalization of its column vectors: we have Sα (i) = λ2i α (i) with
(α (i) )T Sα (i) = 1 while Su (i) = λ2i u (i) with (u (i) )T Su (i) = λ2i showing that α (i) = λ−1
i u .
(i)
While the minimization of (22.5) did not provide us with a new way of analyzing the
data (since it was equivalent to PCA), the direct comparison of dissimilarities, that
is, the minimization of
N
X
G(y) = (|yk − yl | − dkl )2
k,l=1
Rp ,
over y1 , . . . , yN ∈ provides a different approach. Since this may be useful in prac-
tice and does not bring in much additional difficulty, we will allow for the possibility
of weighting the differences in G and consider the minimization of
N
X
G(y) = wkl (|yk − yl | − dkl )2
k,l=1
where W = (wkl ) is a symmetric matrix of non-negative weights. The only additional
complexity resulting by adding weights is that the indeterminacy on y1 , . . . , yN is that
G(y) = G(y 0 ) as soon as y − y 0 is constant on every connected component of the graph
associated with the weight matrix W , so that the constraint on y should be replaced
by X
yk = 0
k∈Γ
22.1. MULTIDIMENSIONAL SCALING 565
for any connected component Γ of this graph. (If all weights are positive, then the
only non-empty
PN connected component is {1, . . . , N } and we retrieve our previous con-
straint k=1 yk = 0.)
We have, for u ∈ Rp :
|u| = max{zT u : z ∈ Rp , |z| = 1u,0 } .
Using this identity, we can introduce auxiliary variables zkl , k, l = 1, . . . , N in Rp , with
|zkl | = 1 if yk , yl and define
N
X N
X N
X
Ĝ(y, z) = wkl |yk − yl |2 − 2 wkl dkl (yk − yl )T zkl + 2
dkl .
k,l=1 k,l=1 k,l=1
We then have
G(y) = min Ĝ(y, z).
z:|zkl |=1 if yk ,yk
Let L denote the Laplacian matrix of the weighted graph on {1, . . . ,P N } associated
with the weight matrix W , namely L = (`kl , k, l = 1, . . . , N ) with `kk = Nk=1 wkl − wkk
and `kl = −wkl when k , l. Then,
N
X
wkl |yk − yl |2 = 2trace(Y T LY ).
k,l=1
Defining uk ∈ Rp by
N
X
uk = wkl dkl (zkl − zlk ),
l=1
T
u1
and U = ... , we have
T
uN
N
X
wkl dkl (yk − yl )T zkl = trace(U T Y ).
k,l=1
566 CHAPTER 22. DATA VISUALIZATION AND MANIFOLD LEARNING
2trace(Y T LY ) − 2trace(U T Y ).
Let m be the number of connected components of the weighted graph. Recall that
the matrix L is positive semi-definite and that an orthonormal basis of its null space
is provided by vectors, say e1 , . . . , em , that are constant on each of the m connected
components of the graph, so that the constraint on Y can be written as ejT Y = 0 for
j = 1, . . . , m. Introduce the matrix
m
X
L̂ = L + ek ekT
k=1
4L̂Y − 2U
where we have used the fact that L̂−1 ej = ej . We can now identify µj since
m
1 1X T 1 1
0 = ejT Y = ejT L̂−1 U − ej ej 0 µTj = ejT U − µTj
2 4 0 2 4
j =1
m
1 1X T
Y = L̂−1 U − ej ej U .
2 2
j=1
22.2. MANIFOLD LEARNING 567
1. Compute the Laplacian matrix L of the graph associated with the weights, the
projection matrix PL onto the range of L and the matrix M = (L + IdRN − PL )−1 .
T
y1
2. Initialize the algorithm with some family y1 , . . . , yN ∈ Rp and let Y = ... .
T
yN
3. At a given step of the algorithm, let Y be the current solution and compute, for
k = 1, . . . , N :
N
X y − yl
uk = 2 wkl dkl k 1
|yk − yl | yk ,yl
l=1
T
u1
to form the matrix U = ... .
T
uN
4. Compute Y 0 = 12 PL MU .
5. If |Y − Y 0 | ≤ , exit and return Y 0 .
6. Return to step 3.
568 CHAPTER 22. DATA VISUALIZATION AND MANIFOLD LEARNING
The goal of MDS is to map a full matrix of distances into a low-dimensional Eu-
clidean space. Such a representation, however, cannot address the possibility that
the data is supported by a low-dimensional, albeit nonlinear, space. For example,
people leaving on Earth live, for all purposes, on a two-dimensional structure (a
sphere), but any faithful Euclidean representation of the world population needs
to use the three spatial dimensions. One may also argue that the relevant distance
between points on Earth is not the Euclidean one either (because one would never
travel through Earth to go from one place to another), but the distance associated to
the shortest path on the sphere, which is measured along great circles.
To take another example, the left panel in fig. 22.1 provides the result of applying
MDS to a ten-dimensional dataset obtained by applying a random ten-dimensional
rotation to a curve supported by a three-dimensional torus. MDS indeed retrieves
the correct curve structure in space, which is three dimensional. However, for a
person “living” on the curve, the data is one-dimensional, a fact that is captured by
the Isomap method that we now describe.
22.2.1 Isomap
Let us return to the example of people living on the spherical Earth. One can de-
fine the distance between two points on Earth either as the shortest length a person
would have to travel (say, by plane) to go from one point to the other (that we can
call the intrinsic distance), or simply the chordal distance in 3D space between the two
points. The first one is obviously the most relevant to the spherical structure of the
Earth, but the second one is easier to compute given the locations of the points in
22.2. MANIFOLD LEARNING 569
space.
For typical datasets, the geometric structure of the data (e.g., that it is supported
by a sphere) is unknown, and the only information that is available is their chordal
distance in an ambient space (which can be very large). An important remark, how-
ever is that, when the points are close to each other, the two distances can be ex-
pected to be similar, if we assume that the geometry of the set supporting the data is
locally linear (e.g., that it is, like the sphere, a “submanifold” of the ambient space,
with small neighborhoods of any data point well approximated, at first order, by
points on a tangent space). Isomap uses this property, only trusting small distances
in the matrix D, and infers large distances by adding the costs resulting from travel-
ing from data points to nearby data points.
until the entries stabilize, i.e., d (n+1) = d (n) , in which case one has d (∗) = d (n) . The
validity of the statement can be easily proved by checking that
n
(n) X (1)
dkl = min d : k , . . . , k ∈ {1, . . . , N }, k = k, k = l ,
k k
j−1 j
0 n 0 n
j=1
which can be done by induction, the details being left to the reader. It should also
be clear that the procedure will stabilize after no more than N steps.
Once the distance is computed, Isomap then applies standard MDS, resulting
in a straightened representation of the data like in fig. 22.1. Another example is
provided in fig. 22.2, where, this time, the input curve is closed and cannot therefore
be represented as a one-dimensional structure. One can note, however, that, even in
this case, Isomap still provides some simplification of the initial shape of the data.
570 CHAPTER 22. DATA VISUALIZATION AND MANIFOLD LEARNING
Local linear embedding (LLE) exploits in a different way the fact that manifolds are
locally well approximated by linear spaces. Like Isomap, it starts also with build-
ing a c-nearest-neighbor graph on {1, . . . , k}. Assume, for the sake of the discussion,
that the distance matrix is computed for possibly unobserved data T = (x1 , . . . , xN ).
Letting Nk denote the indices of the nearest neighbors of k (excluding k itself), the
basic assumption is that xk should approximately lie in the affine space generated by
xl , l ∈ Nk . Expressed in barycentric coordinates, this space is defined by
X X
(l) Nk (l)
Tk = ρ x : ρ ∈ R , ρ = 1 ,
l
l∈N
l∈N
k k
P (l)
subject to l∈Nk ρk = 1. This is a simple least-square program. Let ck = |Nk | (ck = c
(l)
in the absence of ties). Order the elements of Nk to represent ρk , l ∈ Nk as a vector
denoted ρk ∈ Rck . Similarly, let Sk be the Gram matrix associated with xl , l ∈ Nk
formed with all inner products xlT0 xl , l, l 0 = 1, . . . , N and let r k be the vector composed
with products xkT xl , l ∈ Nk . Assume that Sk is invertible, which is generally true if
22.2. MANIFOLD LEARNING 571
c < d, unless the neighbors are exactly linearly aligned. Then, the optimal ρk and the
Lagrange multiplier λ for the constraint are given by
! !−1 !
ρk Sk 1ck rk
= T . (22.6)
λ 1 ck 0 1
If Sk is not invertible, the problem is under-constrained and one of its solutions can
be obtained by replacing the inverse above by a pseudo-inverse.
Obviously, some additional constraints are needed to avoid the trivial solution yk = 0
for all k. Also, replacing all yk ’s by yk0 = Ryk + b where R is an orthogonal transforma-
tion in Rp and b is a translation
P does not change the PN valueTof F, so there is no loss of
generality in assuming that N y
k=1 k = 0 and that k=1 yk yk = D0 , a diagonal matrix.
0
However, if one lets yk = Dyk where D is diagonal, then
2
X p XN
(i)
X (l) (i)
2
F(y) = Dii yk − ρk yl .
i=1 k=1 l∈Nk
This shows that one should not allow the diagonal coefficients of D0 to be chosen
freely, since otherwise the optimal solution would require to take this coefficient to 0.
So D0 should be a fixed matrix, and by symmetry, it is natural to take D0 = IdRp . (Any
other solution—for a different D0 —can then be obtained by rescaling independently
the coordinates of y1 , . . . , yN .)
(l) (k) (l)
Extend ρk to an N -dimensional vector by taking ρk = −1 and ρk = 0 if l , k and
l < Nk . We can write
N X N 2
X (l)
F(y) = ρk yl .
k=1 l=1
Expanding the square, this is
N
X
F(y) = wll 0 ylT yl 0
l,l 0 =1
572 CHAPTER 22. DATA VISUALIZATION AND MANIFOLD LEARNING
PN (l) (l 0 )
with wll 0 = ρk ρk . Introducing the matrix W with entries wkl and the N × p
k=1T
y1
matrix Y = ... , we have the simple expression
T
yN
F(y) = trace(Y T W Y ) .
Note that the constraints are Y T Y = IdRp and Y T 1N = 0. Without this last constraint,
we know that an optimal solution is provided by Y = [e1 , . . . , ep ] where e1 , . . . , ep pro-
vide an orthonormal family of eigenvectors associated to the p smallest eigenvalues
of W (this is a consequence of corollary 2.4). To handle the additional constraint, it
suffices to note that W 1N = 0, so that 1N is a zero eigenvector. Given this, it suffices
to compute p + 1 eigenvectors associated to smallest eigenvalues of W , e1 , . . . , ep+1 ,
√
with the condition that e1 = ±1N / N (which is automatically satisfied unless 0 is a
multiple eigenvalue of W ) and let
Y = [e2 , . . . , ep+1 ].
Note that e2 , . . . , ep+1 are also the p smallest eigenvectors of W + λ11T for any large
enough λ, e.g., λ > trace(W )/N .
(i) Either a training set T = (x1 , . . . , xN ), or its Gram matrix S containing all
inner products xkT xl (or more generally inner products in feature space), or a dissim-
ilarity matrix D = (dkl ).
(ii) An integer c for the graph construction.
(iii) An integer p for the target dimension.
(1) If not provided in input, compute the Gram matrix S and distance matrix D
(using (22.2) and (22.4)).
(2) Build the c-nearest-neighbor graph associated with the distances. Let Nk be the
set of neighbors of k, with ck = |Nk |.
(3) For k = 1, . . . , N , let Sk be the sub-matrix of S matrix associated with xl , l ∈ Nk
(l)
and compute coefficients ρk , l ∈ Nk stacked in a vector ρk ∈ Rck by solving (22.6).
(l) (l 0 ) (k)
(4) Form the matrix W with entries wll 0 = N
P
k=1 ρk ρk with ρ extended so that ρk =
(l)
−1 and ρk = 0 if l , k and l < Nk .
22.2. MANIFOLD LEARNING 573
Figure 22.3: Local linear embedding with target dimension 3 applied to the data in fig. 22.1
and fig. 22.2.
The results of LLE applied to the datasets described in fig. 22.1 and fig. 22.2 are
provided in fig. 22.3.
Remark 22.1 We note that, for both Isomap and LLE, the c-nearest-neighbors graph
can be replaced by the graph formed with edges between all pairs of points that are
at distance less than from each other, for a chosen > 0, with no change in the
algorithms.
Both Isomap and LLE are based on the construction of a nearest-neighbor graph
based on dissimilarity data and the conservation of some of its geometric features
when deriving a small-dimensional representation. For LLE, a weight matrix W was
first estimated based on optimal linear approximations of xk by its neighbors, and
574 CHAPTER 22. DATA VISUALIZATION AND MANIFOLD LEARNING
the representation was computed by estimating the eigenvectors associated with the
smallest eigenvalues of W (excluding the eigenvector proportional to 1). However,
both methods were motivated by the intuition that the dataset was supported by
a continuous small-dimensional manifold. We now discuss methods that are solely
motivated by the discrete geometry of a graph, for which we use tools that are similar
to our discussion of graph clustering in section 20.5.
Adapting the notation in that section to the present one, we start with a graph
with N vertices and weights βkl between these vertices (such that βll = 0) and we
form the Laplacian operator defined by, for any vector u ∈ RN :
N
1 1 X 0
kukH 1 = βll 0 (u (l) − u (l ) )2 = u T Lu,
2 2 0
l,l =1
so 0
PNthat L is identified as the matrix with coefficients `ll 0 = −βll 0 for l , l and `ll =
l 0 =1 βll 0 . The matrix W that was obtained for LLEP coincides with this graph Lapla-
cian if one lets βll 0 = −wll 0 for l , l , since we have N
0
l 0 =1 wll 0 = 0. The usual require-
ment that weights are non-negative is no real loss of generality, because in LLE (and
in the Graph embedding method above), one is only interested in eigenvectors of W
(or L below) that are perpendicular to 1, and those remain the same if one replaces
W by
W − a1N 1TN + N aIdRN
which has negative off-diagonal coefficients w̃ll 0 = wll 0 − a for large enough a.
for some constant τ. These weights are usually truncated, replacing small values
by zeros (or the computation is restricted to nearest neighbors), to ensure that the
resulting graph is sparse, which speeds up the computation of eigenvectors for large
datasets.
(|) (k)
is then given by yk i = ei+1 for i = 1, . . . , p and k = 1, . . . , N . Note that these are exactly
the same operations as those described in steps 4 and 5 of the LLE algorithm.
One way to interpret this construction is that e2 , . . . , ep+1 (the coordinate functions
for the representation y1 , . . . , yN ) minimize
p
X
kei k2H1
j=1
are satisfied, which is similar to the LLE condition, without the requirement that
ρk (k) = 1.
(l) 2
An alternate requirement that could have been made for LLE is that N
P
l=1 (ρk ) =
1 for all k. Instead of having to solve a linear system in step 2 of Algorithm 22.2, one
would then compute an eigenvector with smallest eigenvalue of Sk . For graph em-
bedding, this constraint can be enforced by modifying the Laplacian matrix, since
PN (l) 2 T
l=1 (ρk ) is just the (k, k) coefficient of RR . Given this, let D be the diagonal matrix
formed by the diagonal elements of L, and define the so-called “symmetric Lapla-
cian” L̃ = D −1/2 LD −1/2 . One obtain an alternative, and popular, graph embedding
method by replacing e1 , . . . , ep+1 above by the first p eigenvectors of L̃.
576 CHAPTER 22. DATA VISUALIZATION AND MANIFOLD LEARNING
P (t) = e−tL
N
X
P (t) = e−tλi ei eiT
i=1
We could also have considered the discrete-time version of the walk, for which,
considering integer times t ∈ N,
βkl
PN
if l , k
P (q(t + 1) = l | q(t) = k) = l 0 =1,l 0 ,k βkl 0
0 if l = k
Introducing the matrix B of similarities PNβkl (with zero on the diagonal) and the diag-
onal matrix D with coefficients dkk = l=1,l,k βkl , the r.h.s. of the previous equation
is the k, l entry of the matrix P˜ = D −1 B. Then, for any integer s, P (q(t +s) = l | q(t) = k)
is the k, l entry if P˜ s = D −1/2 (D −1/2 BD −1/2 )s D 1/2 .
so that
P˜ s = D −1/2 (IdRN − L̄)s D 1/2 .
22.2. MANIFOLD LEARNING 577
If one introduces the eigenvectors ē1 , . . . , ēN of the normalized Laplacian, still asso-
ciated with non-decreasing eigenvalues λ̄1 = 0, . . . , λ̄N , and arranges without loss of
generality that ē1 ∝ D 1/2 1N , then
N
X
P˜ s = D −1/2 (1 − λ̄i )s ēi ēiT D 1/2 .
i=1
This shows that, for s large enough, the transitions of this Markov chain are well ap-
proximated by its first terms, suggesting using the alternative representation based
on the normalized Laplacian:
ȳk (i) = ēi+1 (k).
Both representations (using normalized or un-normalized Laplacians) are commonly
used in practice.
General algorithm
Stochastic neighbor embedding (SNE, Hinton and Roweis [90]), and its variant (t-
SNE, Maaten and Hinton [123]) have become a popular tool for the visualization
of high-dimensional data based on dissimilarity matrices. One of the key contri-
butions of this algorithm is to introduce a local data rescaling step, that allows for
visualization of more homogeneous point clouds.
This is a rather simple expression that can be used with any first-order opti-
mization algorithm to maximize F. The algorithm in Hinton and Roweis [90] uses
gradient ascent with momentum, namely iterating
Choosing α (n) = 0 provides standard gradient ascent with fixed gain γ (of course,
other optimization methods may be used). The momentum can be interpreted, in a
loose sense, as a “friction term”.
exp(−β(|yk − yl |2 ))
ψ̄(k, l; y) = PN .
k 0 ,l 0 =1 exp(−β(|yk 0 − yl 0 |2 ))
22.2. MANIFOLD LEARNING 579
With such a choice, the objective function has a simpler form, namely minimizing
KL(π̄kψ̄(·, y)) or maximizing the expected likelihood
N N
N
X X X
F̄(y) = π̄(k, l) log ψ̄(k, l; y) = − β(|yk − yl |2 )π̄(k, l) + log exp(−β(|yk − yl |2 )) .
k,l=1 k,l=1 k,l=1
The gradient of this symmetric version of F can be computed similarly to the previ-
ous one and is given by
N
X
∂yk F̄(y) = −4 ∂β(|yk − yl |2 )(yk − yl )(π̄( k, l) − ψ̄(k, l; y)).
l=1
for l , k and
πk (l) + πl (k)
π̄(k, l) = .
2n
N
X N
X
∂t H(πk ) = − ∂t πk (l) log πk (l) − ∂t πk (l)
l=1 l=1
XN
=− ∂t πk (l) log πk (l)
l=1
Now
2
∂t log πk (l) = −dkl + d¯k2
with d¯k2 = N 2 0
P
l 0 =1 dkl 0 πk (l ). Writing ∂t πk (l) = πk (l)∂t log πk (l), we have
N
X N
X
∂t H(πk ) = (dkl log πk (l))πk (l) − d¯k πk (l) log πk (l).
l=1 l=1
580 CHAPTER 22. DATA VISUALIZATION AND MANIFOLD LEARNING
UMAP is similar in spirit to t-SNE, with a few important differences that result in
a simpler optimization problem and faster algorithms. Like Isomap, the approach
is based on matching distances between the high-dimensional data and the low-
dimensional representation. But while Isomap estimates a unique distance on the
whole training set (the geodesic distance on the nearest-neighbor graph), UMAP
estimates as many “local distances” as observations before “patching” them to form
the final representation.
This defines an input fuzzy graph structure on {1, . . . , N } that serves as target for
an optimized similar structured associated with the representation y = (y1 , . . . , yN ).
This representation, since it is designed as a homogeneous representation of the
data, provides a unique fuzzy graph H(y) = (V , E, ν(·; y)) and the edge membership
function is defined by ν(l, l 0 ; y) = ϕa,b (yl , yl 0 ) with
1
ϕa,b (y, y 0 ) = .
1 + a|y − y 0 |b
The parameters a and b are adjusted so that ϕa,b provides a differentiable approxi-
mation of the function
ψρ0 (y, y 0 ) = exp(− max(0, |y − y 0 | − ρ0 ))
where ρ0 is an input parameter of the algorithm. This function ψρ0 takes the same
form as the membership function defined for local graphs G (k) , and its replacement
by ϕa,b makes possible the use of gradient-based methods for the determination of
the optimal y (ψρ0 is not differentiable everywhere).
(1) Each edge (k, l) is selected with probability µ(k, l) (which are zero for unless k
and l are neighbors);
(2) If (k, l) is selected, one selects an additional edges (k, l 0 ) each with probability .
Remark 22.3 If one prefers using probability rather than fuzzy set theory, the graphs
G (k) may also be interpreted as random graphs in which edges are added indepen-
dently from each other and each edge (l, l 0 ) is drawn with probability µ(k) (l, l 0 ). The
22.2. MANIFOLD LEARNING 583
combined graph G is then the random graph in which (l, l 0 ) is present if and only
if it is in at least one of the G (k) and the objective function C coincides with the KL
divergence between this random graph and the random graph similarly defined for
y.
Generalization Bounds
23.1 Notation
585
586 CHAPTER 23. GENERALIZATION BOUNDS
1 X
R̂T (f ) = r(y, f (x)))
|T |
(x,y)∈T
∆
and the in-sample error associated to a learning algorithm is the function T 7→ ET =
R̂T (fˆT ). Fixing the size (N ) of T , one also considers the random variable T with
values in T distributed as an N -sample of the distribution of (X, Y ).
A good learning algorithm should be such that the generalization error R(fˆT ) is
small, at least in average (i.e., E(R(fˆT )) is small). Our main goal in this chapter is to
describe generalization bounds trying to find upper-bounds for R(fˆT ) based on ET
and properties of the function class F . These bounds will reflect the bias-variance
trade-off, in that, even though large function classes provide smaller in-sample er-
rors, they will also induce a large additive term in the upper-bound, accounting for
the “variance” associated to the class.
Remark 23.1 Both variables X and Y are assumed to be random in the previous
setting, but there are often situations when one of them is “more random” than the
other. Randomness in Y is associated to measurement errors, or ambiguity in the
decision. Randomness in X more generally relates to the issue of sampling a dataset
in a large dimensional space. In some cases, Y is not random at all: for example,
in object recognition, the question of assigning categories for images such as those
depicted in fig. 23.1 has a quasi-deterministic answer. Sometimes, it is X who is
not random, for example when observing noisy signals where X is a deterministic
discretization of a time interval and Y is some function of X perturbed by noise.
Figure 23.1: Images extracted from the PASCAL challenge 2007 dataset [70], in which cate-
gories must be associated with images. There is little ambiguity on correct answers based on
observing the image, i.e., little randomness in the variable Y .
We want to compare the training-set-averaged prediction error and the average in-
sample error, namely compute the error bias
∆N = E(R(fT )) − E(ET ) .
Write
∆N = E(R(fT )) − R(fθ0 ) + R(fθ0 ) − E(ET ).
We make a heuristic argument to evaluate ∆N . We can use the fact that θ̂T mini-
mizes the empirical error and write
N
1X
(Yk − fθ0 (Xk ))2 = ET + σ 2 (θ̂T − θ0 )T JT (θ̂T − θ0 ) + o(|θ̂T − θ0 |2 )
N
k=1
with
N
1 X 2
JT = ∂θ ((yk − fθ (xk ))2 )|θ−θ̂ ,
2σ 2 N T
k=1
which is an m by m symmetric matrix.
Now, using the fact that θ0 minimizes the mean square error (since fθ0 (x) =
E(Y |X = x)), we can write, for any T :
R(fT ) = R(fθ0 ) + σ 2 (θ̂T − θ0 )T I(θ̂T − θ0 ) + o(|θ̂T − θ0 |2 )
588 CHAPTER 23. GENERALIZATION BOUNDS
with
1
I= E(∂2θ (Y − fθ (X))2|θ−θ ).
2σ 2 0
(We skip hypotheses and justification for the analysis of the residual term.)
We now note that, because we are assuming a Gaussian noise, and that the true
data distribution belongs to the parametrized family, the least-square estimator is
also a maximum likelihood estimator. Indeed, the likelihood of the data is
N
N
1 1 X Y
2
exp − (Yk − f (X
θ k )) ϕX (Xk )
(2πσ 2 )N /2 2σ 2
k=1 k=1
where ϕX is the p.d.f. of X and does not depend on the unknown parameter.
We can therefore apply classical results from mathematical statistics [196]. Un-
der some mild smoothness assumptions on the mapping θ 7→ fθ , θ̂T converges to
θ0 in probability when N tends to infinity, the √ matrix JT converges to I, which
is the model’s Fisher information matrix, and N (θ̂T − θ0 ) converges in distribu-
tion to a Gaussian N (0, I −1 ) . This implies that both N (θ̂T − θ0 )T JT (θ̂T − θ0 ) and
N (θ̂T − θ0 )T I(θ̂T − θ0 ) converge to a chi-square distribution with m degrees of free-
dom, whose expectation is m, which indicates that ∆N has order 2σ 2 m/N .
This analysis can be used to develop model selection rules, in which one chooses
between models of dimensions k1 < k2 < · · · < kq = m (e.g., by truncating the last
coordinates of X). The rule suggested by the previous computation is to select j
minimizing
(j) ˆ 2σ 2 kj
ET (fT ) + ,
N
where E (j) is the in-sample error computed using the kj -dimensional model. This
is an example of a penalty-based method, using the so-called Akaike’s information
criterion (AIC) [2].
Other penalty-based methods are more size-averse and replace the constant, 2, in
AIC by a function of N = |T |, for example log N . Such a change can be justified by a
Bayesian analysis, yielding the Bayesian information criterion (BIC) [175]. The ap-
proach in this case is not based on an evaluation of the error, but on an asymptotic
23.2. PENALTY-BASED METHODS AND MINIMUM DESCRIPTION LENGTH589
Z
T Ū −C(θ))
µ(Mj |T ) = log αj eN (θ T ϕj dmj (θ)
Rm
Consider the maximum likelihood estimator θ̂j within Mj , maximizing `(θ, ŪT ) =
θ T Ū
T − C(θ) over Mj . Then one has
1
`(θ, ŪT ) = `(θ̂j , ŪT ) + (θ − θ̂j )T ∂2θ `(θ̂j , ŪT )(θ − θ̂j ) + Rj (θ, θ̂j )|θ − θ̂j |3
2
The law of large numbers implies that ŪT converges to a limit when N tends to
infinity, and our assumptions imply that θ̂j converges to the parameter providing
the best approximation of the distribution of Z for the Kullback-Leibler divergence.
In particular, with probability 1, there exists an N such that θ̂j belongs to any large
enough, but fixed, compact set. Moreover, the second derivative `(θ̂j ) will also con-
verge to a limit, −Σj .
590 CHAPTER 23. GENERALIZATION BOUNDS
The second integral converges to 0 exponentially fast when N tends to ∞. The first
one behaves essentially like
Z
− 1 N (θ−θ̂j )T Σ−1
j (θ−θ̂j )+log ϕj (θ) dm (θ) .
e 2 j
Mj
Neglecting log ϕj (θ), this integral behaves like (2π det(Σj /N ))−1/2 , whose logarithm
is (−kj (log N )/2) plus constant terms. As a consequence, we find that
kj
µ(Mj | T ) = max `(θ) − log N + bounded terms.
θ∈Mj 2
We now turn to another interesting point of view, which provides the same penalty,
based on maximum description length principle (MDL; Rissanen [163]) measuring
the coding efficiency of a model.
Let us fix some notation. We assume that one has q competing models for pre-
dicting Y from X, for example, linear regression models based on different subsets
of the explanatory variables. Denote these models M1 , . . . , Mq . Each model will be
seen, not as an assumption on the true joint distribution of X and Y , but rather as
a tool to efficiently encode the training set ((x1 , y1 ), . . . , (xN , yN )). To describe MDL,
23.2. PENALTY-BASED METHODS AND MINIMUM DESCRIPTION LENGTH591
which selects the model that provides the most efficient code, we need to reintroduce
a few basic concepts of information theory.
(The logarithm in base 2 is used because of the tradition of coding with bits in infor-
mation theory.)
For a discrete random variable X, the entropy H2 (X) is H2 (PX ) where PX is the
probability distribution of X. The relation between the entropy and coding theory
is as follows: a code is a function which associates to any element ω ∈ Ω a string of
bits c(ω). The associated code-length is denoted lc (ω), which is simply the number
of bits in c(ω). When P is a probability on Ω, the efficiency of a code is measured by
the average code-length: X
EP (lc ) = lc (ω)P (ω).
ω∈Ω
Shannon’s theorem [176, 54] states that, under some conditions on the code (en-
suring that any sequence of words can be recognized as soon as it is observed:
one says that it is instantaneously decodable) the average code length can never
be larger than the entropy of P . Moreover, it states that there exists codes that
achieve this lower bound with no more than one bit loss, such that for all ω, lc (ω) ≤
− log2 (P (ω))+1. These optimal codes, such as the Huffman code [54], can completely
be determined from the knowledge of P . This allows one to interpret a probability P
on Ω as a tool for designing codes with code-lengths essentially equal to (− log2 P ).
If the models are nested, which is often the case, the most efficient will always be
the largest model, since the maximization is on a larger set. However, the minimum
description length (MDL) principle uses the fact that, in order to decode the com-
pressed data, the model, including its optimal parameters, has to be known, so that
the complete code needs to include a model description. The decoding algorithm
will then be: decode the model, then use it to decode the data.
So assume that a model (one of the Mj ’s) has a kj -dimensional parameter θ. Also
assume that a probability distribution, π(θ | Mj ), is used to encode θ. Also choose a
precision level, δij , for each coordinate in θ, i = 1, . . . , kj . (Previously, we could con-
sider the precision of the yk , δ0 , as fixed, but now, the precision level for parameters
is a variable that will be optimized.) The total description length using this model
now becomes
N kj
X X
− log2 ϕ(yk | xk ; θ, Mj ) − log2 π(θ | Mj ) − log2 (δij ).
k=1 i=1
Note that Sθ̂(j) must be negative semi-definite, since θ̂ is a local maximum. Assuming
it is non-singular, the previous expression can be maximized and yields
1 1
Sθ̂(j) δ(j) = − (23.1)
log 2 δ(j)
Sθ̂(j) √ (j) 1 1
Nδ = − √ .
N log 2 N δ(j)
√
This implies that N δ(j) is the solution of an equation which stabilizes with N , and
it is therefore reasonable to assume that the optimal δij takes the form δij = ci (N |
√
Mj )/ N , with ci (N | Mj ) converging to some limit when N tends to infinity. The
total cost can therefore be estimated by
kj
(j)
kj kj X
−L(θ̂ | Mj ) + log2 N − − log2 ci (N | Mj )
2 2
i=1
The last two terms are O(1), and can be neglected, at least when N is large compared
to kj . The final criterion becomes the penalized likelihood
kj
ld (θ | Mj ) = L(θ|Mj ) − log2 N
2
in which we see that the dimension of the model appears with a factor log2 N as
announced (one needs to normalize both terms by N to compare with the previous
paragraph).
The discussion of the AIC was a first attempt at evaluating a prediction error. It was
however done under very specific parametric assumptions, including the fact that
the true distribution of the data was within the considered model class. It was, in
594 CHAPTER 23. GENERALIZATION BOUNDS
addition, a bias evaluation, i.e., we estimated how much, in average, the in-sample
error was less than the generalization error. We would like to obtain upper bounds to
the generalization error that hold with high probability, and rely as little as possible
on assumptions on the true data distribution.
One of the main tools used in this context are concentration inequalities, which
provide upper bounds on the various probabilities of events involving a large num-
ber of random variables. The current section provides a review of some of these
inequalities.
From these properties, one can easily derive a concentration inequality for the
mean of independent random variables. We have MX̄N (λ) = N MX (λ/N ) and apply-
ing (23.3) we get, for any λ ≥ 0 and t > 0
λ(m+t)
−λ(m+t)+MX̄N (λ) −N −MX ( Nλ )
P(X̄N − m > t) ≤ e =e N
where the right-hand side may be infinite. Because this inequality is true for any λ,
we have ∗
P(X̄N − m > t) ≤ e−N MX,+ (m+t)
∗
where MX,+ (u) = supλ≥0 (λu − MX (λ)), which is non-negative since the maximized
quantity vanishes for λ = 0. A symmetric computation yields
∗
P (X̄N − m < −t) ≤ e−N MX,− (m−t)
∗
where MX,− (t) = supλ≤0 (λt − MX (λ)), which is also non-negative.
Let
MX∗ (t) = sup(λt − MX (λ)) ≥ 0 (23.4)
λ∈R
(this is the Fenchel-Legendre transform of the cumulant generating function, some-
∗
times called the Cramér transform of X). One has MX∗ (m + t) = MX,+ (m + t) for t > 0.
λx
Indeed, because x 7→ e is convex, Jensen’s inequality implies that
E(eλX ) ≥ eλm
∗
so that λ(m + t) − MX (λ) ≤ λt < 0 if λ < 0. Similarly, MX∗ (m − t) = MX,− (m − t) for t > 0.
We therefore have the following result.
and
∗ ∗
P(|X̄N − m| > t) ≤ 2e−N min(MX (m+t),MX (m−t))
This is our first example of concentration inequality that shows that, when
the probability of a deviation by t at least of X̄n from its mean decays exponen-
tially fast. The derivation of the inequality above was quite easy: apply Markov’s
inequality in a parametrized form and optimize over the parameter. It is therefore
surprising that this inequality is sharp, in the sense that a similar lower bound also
holds. Even though we are not going to use it in the rest of this chapter, it is worth
sketching the argument leading to this lower bound, which involves an interesting
step making a change of measure.
Assume (without loss of generality) that m = 0 and consider P(X̄n > t). Assume,
to simplify the discussion, that the supremum of λ 7→ λ − MX (λ) is attained at some
λt . We have
E(XeλX )
∂λ MX (λ) = .
E(eλX )
eλx
Let qλ (x) = E(e λX ) and P λ (with expectation E λ ) the probability distribution on Ω
with density qλ (X) with respect to P, so that ∂λ MX (λ) = Eλ (X). We have, since λt is
a maximizer, Eλt (X) = t. Moreover, fixing δ > 0,
P(X̄N > t) = E(1X̄N >t )
≥ E(1|X̄n −t−δ|<δ )
≥ E 1|X̄n −t−δ|<δ eN λX̄N −N t−2N δ
= e−N (t+2δ) MX (λ)N Pλ (|X̄N − t − δ| < δ)
If one takes λ = λt+δ , this implies that
∗
P(X̄N > t) ≥ e−N MX (t+δ) e−N δ Pλt+δ (|X̄N − t − δ| < δ) .
By the law of large numbers (applied to Pλt+δ ), Pλt+δ (|X̄N − t − δ| < δ) tends to 1 when
N tends to infinity. This implies that the logarithmic rate of convergence to 0 of
P(X̄N > t) is larger than N (MX∗ (t + δ) + δ), for any δ > 0, to be compared with the
rate N MX∗ (t) for the upper bound. In Large Deviation theory, the upper and lower
bounds are often simplified by considering the limit of log P(X̄N > t)/N , which, in
this case, is MX∗ (t) (and this result is called Cramér’s therorem).
While Cramér’s upper bound is sharp, its computation requires an exact knowl-
edge of the distribution of X, which is not a common situation. The following sec-
tions optimize the upper bound in situations where only partial information on the
variable is known, such as its moments or its range. As a first example, we consider
concentration of the mean for sub-Gaussian variables.
for some positive constants C and λ. Reducing if needed the value of λ, one can
assume that C takes some predetermined (larger than 1) value, say, C = 2, the simple
argument being left to the reader. A random variable such that, for some λ > 0
x2
−
P(|X| > x) ≤ 2e 2σ 2 . (23.5)
Sub-Gaussian random variables are such that M(λ) < ∞ for all λ ∈ R. Indeed, for
λ>0
Z∞
λ|X|
E(e ) = P(eλ|X| > z)dz
0 Z
∞
= 1+ P(|X| > λ−1 log z)dz
1
Z ∞ (log z)2
−
≤ 1+2 e 2σ 2 λ2 dz
Z1∞ 2
x− x2 2
≤ 1+2 e 2λ σ dx
1
√ λ2 σ 2
≤ 1 + 2 2πλσ e 2 .
Proposition 23.3 Assume that X is sub-Gaussian, so that (23.5) holds for some σ 2 > 0.
Then, for any t > 0, we have
!N
4t 2 − N t2
2
P(X̄n − E(X) > t) ≤ 1 + 2 e 2σ .
σ
Proof Let us assume, without loss of generality, that E(X) = 0. For λ > 0, we then
have
E(eλX ) = 1 + E(eλX − λX − 1) .
Let ϕ(t) = et − t − 1. We have ϕ(t) ≥ 0 for all t, ϕ(0) = 0 and, for z > 0, the equation
z = ϕ(t) has two solutions, one positive and one negative that we will denote g+ (z) >
0 > g− (z). We have
Z∞
E(ϕ(λX)) = P(ϕ(λX) > z)dz
0
Z∞ Z∞
= P(λX > g+ (z))dz + P(λX < g− (z))dz
0 0
598 CHAPTER 23. GENERALIZATION BOUNDS
The change of variable u = g+ (z) in the first integral is equivalent to u > 0, ϕ(u) = z
with dz = (eu −1)du. Similarly, u = −g− (z) in the second integral gives u > 0, ϕ(−u) = z
and dz = (1 − e−u )du so that
Z∞ Z∞
u
E(ϕ(λX)) = P(λX > u)(e − 1)du + P(λX < −u)(1 − e−u )du
Z0∞ 0
(Using the fact that max(P(λX > u), P(λX < −u)) ≤ P(λ|X| > u).) We have
Z∞ Z +∞ 2
u − u
−u
P(λ|X| > u)(e − e )du ≤ 2 (eu − e−u )e 2λ2 σ 2 du
0 0
Z +∞
v2
= 2λσ (eλσ v − e−λσ v )e− 2 dv
0
λ2 σ 2 √
= 2λσ e 2 2π(Φ(−σ λ) − Φ(σ λ))
λ2 σ 2
≤ 4λ2 σ 2 e 2
λ2 σ 2 λ2 σ 2
MX (λ) ≤ log 1 + 4λ2 σ 2 e 2 ≤ + log(1 + 4λ2 σ 2 ) .
2
This implies
t2 2 t2 4t 2
MX∗ (t) = sup(λt − MX (λ)) ≥ − MX (t/σ ) ≥ − log(1 + )
λ>0 σ2 2σ 2 σ2
so that !N
4t 2 − N t2
2
P(X̄n > t) ≤ 1 + 2 e 2σ .
σ
The following result allows one to control the expectation of a non-negative sub-
Gaussian random variable.
The following proposition (see [24]) provides an upper bound for MX (λ) as a func-
tion of E(X) and var(X) under the additional assumption that X is bounded from
above.
Proposition 23.5 Let m = E(X) and assume that for some constant b, one has X ≤ b with
probability one. Then, for any σ 2 > 0 such that var(X) ≤ σ 2 , one has
(b − m)2 σ2
!
λσ 2
λX λm − (b−m) λ(b−m)
E(e ) ≤ e e + e (23.6)
(b − m)2 + σ 2 (b − m)2 + σ 2
for any λ ≥ 0.
Proof There is no loss of generality in assuming that m = 0 and λ = 1, in which case
one must show that
b2 − σb
2 σ2
E(eX ) ≤ 2 e + eb (23.7)
b + σ2 b2 + σ 2
if X < b and E(X 2 ) ≤ σ 2 . Indeed, if this inequality is true for m = 0 and λ = 1, (23.6)
in the general case will result from letting X = Y /λ + m and applying the special case
to Y .
The right-hand side of (23.7) is exactly E(eX ) when X follows the discrete distri-
bution P0 supported by two points x0 and b, and such that E(X) = 0 and E(X 2 ) = σ 2 ,
which requires x0 = −σ 2 /b and P (X = x0 ) = b2 /(σ 2 + b2 ).
and we now estimate this lower bound. Maximizing λy − log F(λ) is equivalent to
minimizing
λ(σ 2 +(u−m))
(b − m)2 e− b−m + σ 2 eλ(b−u)
λ 7→ .
(b − m)2 + σ 2
Introduce the notation ρ = σ 2 /(b − m)2 , µ = λ(b − m) and x = (u − m)/(b − m), so that
the function to minimize is
e−µ(ρ+x) + ρeµ(1−x)
µ 7→ .
1+ρ
Computing the derivative in µ and equating it to 0 gives
1 ρ+x
µ= log ,
1+ρ ρ(1 − x)
which is non-negative since ρ + x − ρ(1 − x) = (1 + ρ)x. For this value of µ, we have
X b2 − σb
2 σ2
log E(e ) ≤ 2 2
e + 2 2
eb − 1
b +σ b +σ
b2 σ2 σ2 σ2
= 2 (e − b
+ − 1) + (eb − b − 1).
b + σ2 b b2 + σ 2
We will use the following lemma.
Proof We have ϕ 0 (u) = ψ(u)/u 3 where ψ(u) = ueu − 2eu + u + 2, yielding ψ 0 (u) =
ueu − eu + 1, ψ 00 (u) = ueu . Therefore, ψ 0 is has its minimum at u = 0 with ψ 0 (0) = 0 so
that ψ is increasing. Since ψ(0) = 0, we have ψ(u)/u 3 ≥ 0.
We therefore have
b2 − σb
2 σ2 σ2
log E(eX ) ≤ (e + − 1) + (eb − b − 1)
b2 + σ 2 b b2 + σ 2
b2 σ 4 2 σ2
= 2 ϕ(−σ /b) + b2 ϕ(b)
b + σ 2 b2 b 2 + σ2
σ4 σ 2 b2
!
≤ 2 + ϕ(b)
b + σ 2 b2 + σ 2
σ2 b
= (e − b − 1)
b2
This shows that
σ 2 λb
log E(eλX ) ≤ (e − λb − 1)
b2
and
σ2 2 2 σ2
MX∗ (t) ≥ max (λb t/σ − e λb
+ λb + 1) = h(bt/σ 2 )
b2 λ b2
where h(u) = (1 + u) log(1 + u) − u.
Corollary 23.8 Assume that X satisfy the conditions of proposition 23.5. Then, for t > 0,
Nσ2
!
(b − m)t
P(X̄N > m + t) ≤ exp − h (23.9)
(b − m)2 σ2
This estimate can be further simplified as follows. Let g be such that g 00 (u) =
(1 + u/3)−3 and g(0) = g 0 (0) = 0, which gives g(u) = u 2 /(2 + 2u/3). Noting that h00 (u) =
(1 + u)−1 and that (1 + u)−1 ≥ (1 + u/3)−3 , for u ≥ 0 we find, integrating twice, that
h(u) ≥ g(u) for u ≥ 0. This shows that the following upper-bound is also true:
N t2
!
P(X̄N > m + t) ≤ exp − 2 . (23.10)
2σ + 2t(b − m)/3
We now consider the case in which the random variables X1 , . . . , XN are bounded
from above and from below, and start with the following consequence of proposi-
tion 23.5.
Proposition 23.10 Let X be a random variable taking values in the interval [a, b]. Let
m = E(X). Then
b − m λa m − a λb λ2 (b−a)2
E(eλX ) ≤ e + e ≤ eλm e 8 (23.11)
b−a b−a
for all λ ∈ R.
Proof We first note that, if X takes values in [a, b], then var(X) ≤ (b −m)(m−a) (using
σ 2 = (b − m)(m − a) in (23.6)). To prove the upper bound on the variance, introduce
the function g(x) = (x − a)(x − b) so that g(x) ≤ 0 on [a, b]. Noting that one can write
g(x) = (x − m)2 + (2m − a − b)(x − m) + (a − m)(b − m), we have
This shows that, if λ ≥ 0, we can apply proposition 23.5 with σ 2 = (b − m)(m − a),
which provides the first inequality in (23.11). To handle the case λ ≤ 0, it suffices to
apply this inequality with λ̃ = −λ, X̃ = −X, ã = −b, b̃ = −a and m̃ = −m.
23.3. CONCENTRATION INEQUALITIES 603
requires a little additional work. Letting u = (m − a)/(b − a), α = λ(b − a) and taking
logarithms, we need to prove that
α2
log(1 − u + ueα ) − uα ≤
8
Let f (α) denote the difference between the right-hand side and left-hand side. Then
f (0) = 0,
α ueα
f 0 (α) = − + u,
4 1 − u + ueα
(so that f 0 (0) = 0) and
1 u(1 − u)eα
f 00 (α) = − .
4 (1 − u + ueα )2
For positive numbers x = 1 − u and y = ueα , one has (x + y)2 ≥ 4xy, which shows that
f 00 (α) ≥ 0. This proves that f 0 is non-decreasing with f 0 (0) = 0, proving that f is
minimized at α = 0, so that f (α) ≥ 0 as needed.
2t 2
!
P(Y > E(Y ) + t) ≤ exp − 2 (23.12)
|c|
and
2t 2
!
P(Y < E(Y ) − t) ≤ exp − 2 (23.13)
|c|
PN 2
where |c|2 = k=1 ck .
The upper bound is minimized for λ = 4t/|c|2 , yielding (23.12). Equation (23.13) is
obtained by applying (23.12) to −X.
604 CHAPTER 23. GENERALIZATION BOUNDS
2N t 2
!
P(X̄N > E(X) + t) ≤ exp − 2 . (23.14)
δ
One can relax the assumption that the random variables X1 , . . . , XN are independent
and only assume that these variables behave like “martingale increments,” as stated
in the following proposition [59].
Proof Proposition 23.10 applied to the conditional distribution implies that, for
λ ≥ 0:
λ2 ck2
log E(eλ(Zk −mk ) | X1 , Z1 , . . . , Xk−1 , Zk−1 ) ≤ log E(eλ|Zk −mk | | X1 , Z1 . . . , Xk−1 , Zk−1 ) ≤ .
8
Pk
Let Sk = j=1 (Zj − mj ). Then
λ2 c2
λSk λSk−1 λ(Zk −mk ) k
E(e ) = E(e E(e | X1 , Z1 , . . . , Xk−1 , Zk−1 )) ≤ e 8 E(eλSk−1 )
so that
λ2 PN 2
E(eλSN ) ≤ e 8 k=1 ck
Yk = E(g(X1 , . . . , XN ) | X1 , . . . , Xk ) − m
For fixed X1 , . . . , Xk−1 , (23.15) implies that Zk varies in an interval of length ck at most
(whose bounds depend on X1 , . . . , Xk−1 ) so that |Zk − E(Zk )| ≤ ck . Proposition 23.12
implies that
2 2
P(Z1 + · · · + ZN ≥ t) ≤ e−2t /|c| ,
which concludes the proof since
The following result [37], that we state without proof, extends on the same idea.
Z = g(X1 , . . . , XN )
Then
t2
!
P(Z − E(Z) > t) ≤ exp(−E(Z)h(t/ E(Z))) ≤ exp −
2E(Z) + 2t/3
where h(u) = (1 + u) log(1 + u) − u. Moreover, for t < E(Z),
23.4.1 Introduction
Section 23.3 provides some of the most important inequalities used to evaluate the
deviation of various combinations of independent random variables (e.g., their em-
pirical mean) from their expectations (the reader may refer to Ledoux and Talagrand
[118], Devroye et al. [60], Talagrand [189], Dembo and Zeitouni [59], Vershynin [201]
and other textbooks on the subject for further developments).
If this probability is small, then R(f ) ≤ R̂T (f ) + t with high probability, providing
a likely upper bound to the generalization error of f . For example, if r is the 0–1
23.4. BOUNDING THE EMPIRICAL ERROR WITH THE VC-DIMENSION 607
Now corollary 23.11 does not hold if we replace f by fˆT , i.e., if f is estimated
from the training set T , which is, unfortunately, the situation we are interested in.
Before addressing this problem, we point out that this inequality does apply to the
case in which f = fˆT0 where T0 is another training set, independent from T , so that
2
P(R(fˆT0 ) − R̂T (fˆT0 ) > t) ≤ e−2N t ,
In this situation, the empirical risk is computed on a test or validation set (T ) inde-
pendent of the set used to estimate f (T0 ).
If one does not have a test set, and fˆT is optimized over a set F of possible pre-
dictors, one can rarely do much better than starting from a variation of the trivial
upper bound
ˆ
P(R(fT ) − ET > t) ≤ P sup(R(f ) − ET (f )) > t
f ∈F
(with ET = R̂T (fˆT )) and the concentration inequalities discussed in section 23.3 need
to be extended to provide upper bounds to the right-hand side.
Remark 23.15 Computing supremums of functions over non countable sets may
bring some issues regarding measurability. To avoid complications, we will always
assume, when computing supremums over infinite sets, that such supremums can
be reduced to maximizations over finite sets, i.e., when considering supf ∈F Φ(f ) for
some function Φ, we will assume that there exists a nested sequence of finite subsets
Fn ⊂ F such that
This is true, for example, when F has a topology that admits a countable dense
subset, with respect to which Φ is continuous.
Such bounds cannot be applied to the typical case in which F is infinite, and is
likely to provide very poor estimates even when F is finite, but |F | is large. How-
ever, all proofs of concentration inequalities applied to such supremums require us-
ing a union bound at some point, often after considerable preparatory work. Union
bounds will in particular appear in conjunction with the Vapnik-Chervonenkis di-
mension that we now discuss.
We consider a classification problem with two classes, 0 and 1, and therefore let F
be a set of binary functions, i.e., taking values in {0, 1}. We also assume that the risk
function r takes values in the interval [0, 1] (using, for example, the 0–1 loss). Let
U (t) = P sup(R(f ) − ET (f )) > t . (23.18)
f ∈F
F (x1 , . . . , xM ) = F (A)
where A = {xi , i = 1, . . . , M}. This provides the number of possible splits of a training
set T = (x1 , . . . , xM ) using classifiers in F . Fixing in this section a random variable X,
we let
SF (M) = E(|F (X1 , . . . , XM )|)
where the expectation is taken over all M i.i.d. realizations from X. We also let
N
X
P sup ξ k (r(Yk , f (Xk )) − r(Yk0 , f (Xk0 )))/N ≥ t/2 T , T 0
f ∈F k=1
≤ |F (X1 , . . . , XN , X10 , . . . , XN
0
)|
X N
0 0 0
sup P ξ k (r(Yk , f (Xk )) − r(Yk , f (Xk )))/N ≥ t/2 T , T
f ∈F k=1
The variables ξ k (r(Yk , f (Xk )) − r(Yk0 , f (Xk0 )) are centered and belong to the interval
[−1, 1], which has length 2, so that Hoeffding’s inequality implies
N
X
2 2
P ξ k (r(Yk , f (Xk )) − r(Yk0 , f (Xk0 )))/N ≥ t/2 T , T 0 ≤ e−2N (t/2) /4 = e−N t /8
k=1
−N t /8 so that t = 2
q Equation (23.20) is then obtained from letting δ = 2SF (2N )e
8 2SF (2N )
N log δ with R(f ) ≤ ET (f ) + t for all f with probability 1 − δ or more.
23.4.3 VC dimension
To obtain a practical bound, the quantity SF (2N ), or its upper-bound SF∗ (2N ), needs
to be estimated. We prove below an important property of SF∗ , namely that, either
SF∗ (M) = 2M for all M, or there exists an M0 for which SF∗ (M0 ) < 2M0 , and taking
M0 to be the largest one for which an equality occurs, SF∗ (M) has order M M0 for all
M ≥ M0 . This motivates the following definition of the VC-dimension of the model
class.
23.4. BOUNDING THE EMPIRICAL ERROR WITH THE VC-DIMENSION 611
Remark 23.18 If, for a finite set A ⊂ R, one has |F (A)| = 2|A| , one says that A is
shattered by F . So VC-dim(F ) is the largest integer M such that there exists a set of
cardinality M in R that is shattered by F .
We now evaluate the growth of SF∗ (M) in terms of the VC-dimension, starting
with the following lemma, which states that, if A is a finite subset of R, there are at
least |F (A)| subsets of A that are shattered by F .
Proof The statement holds for A = ∅, for which |F∅ | = 1 = 20 . For |A| = 1, the upper-
bound is either 1 if |F (A)| = 1, or 2 if |F (A)| = 2, and the collection of sets B ⊂ A such
that |F (B)| = 2B is {∅} in the first case and {∅, A} in the second one. So, the statement
is true for |A| = 0 or 1.
Proceeding by induction, assume that the result is true if |A| ≤ N , and consider a
set A0 with |A0 | = N +1. Assume that |F (A0 )| ≥ 2 (otherwise there is nothing to prove),
which implies that there exists x ∈ A0 such that |F (x)| = 2. Take such an x and write
A0 = A ∪ {x} with x < A. Let
Since F0 ∩ F1 = ∅, we have
Since f (x) is constant on F0 (resp. F1 ), we have |F0 (A0 )| = |F0 (A)| (resp. |F1 (A0 )| =
|F1 (A)|), and the induction hypothesis implies
From this lemma, it results that if VC-dim(F ) = D < ∞, then SF∗ (M) is bounded
by the total number of subsets of cardinality D or less in a set of cardinality M. This
provides the following result, which implies that the term in front of the exponential
in (23.18) grows polynomially in N if F have finite VC-dimension.
and the statement of the proposition derives from the standard upper bound
D
eN D
!
N
X
≤
k D
k=0
N k N D Dk
!
N N!
= ≤ ≤ D
k (N − k)!k! k! D k!
if k ≤ D ≤ N . This yields
D D
N D X D k N D eD
!
X N
≤ D ≤
k D k! DD
k=0 k=0
as required.
We can therefore state a corollary to theorem 23.16 for model classes with finite
VC-dimension.
√
Corollary 23.21 Assume that VC-dim(F ) = D < ∞. Then, for t ≥ 2/N and N ≥ D,
2eN D −N t2 /8
P sup(R(f ) − ET (f )) > t ≤ 2 e . (23.22)
f ∈F D
and r r
8 eN 2
P sup(R(f ) − ET (f )) ≤ D log + log ≥ 1 − δ. (23.23)
f ∈F N D δ
23.4. BOUNDING THE EMPIRICAL ERROR WITH THE VC-DIMENSION 613
23.4.4 Examples
The following result provides the VC-dimension of the collection of linear classifiers.
n o
Proposition 23.22 Let R = Rd and F = x 7→ sign(a0 + bT x) : β0 ∈ R, b ∈ Rd . Then
VC-dim(F ) = d + 1.
Proof Let us show that no set of d +2 points can be shattered by F . Use the notation
x̃ = (1, xT )T and β = (a0 , bT )T , and consider d + 2 points x1 , . . . , xd+2 . Then x̃1 , . . . , x̃d+2
are linearly dependent and one of them, say, x̃d+2 can be expressed as a linear com-
bination of the others. Write
d+1
X
x̃d+2 = αk x̃k .
k=1
Then there is no function f ∈ F (taking the form x̃ 7→ sign(β x̃)) that maps (x1 , . . . , xd+2 )
to (sign(α1 ), . . . , sign(αd+1 ), −1) (where the definition of sign(0) = ±1 is indifferent),
since any such function satisfies
d+1
X
T
β x̃d+2 = αk β T x̃k > 0 .
k=1
that consists of feed-forward neural networks with L layers, Ui piecewise linear com-
putational units with less than p pieces in the ith layer, and such that the total num-
ber of parameters involved in layers 1, 2, . . . , j is less than Wj .
614 CHAPTER 23. GENERALIZATION BOUNDS
Theorem 23.23
VC-dim(F (L, (Ui ), (Wi ), p)) = O(L̄WL log(pU )).
where U = U1 + · · · + UL and
L
1 X
L̄ = Wj .
WL
j=1
Note that p = 2 for ReLU networks. Theorem 7 in Bartlett et al. [21] also provides a
more explicit upper bound, namely
L
L
X X
VC-dim(F (L, (Ui ), (W ), p)) ≤ L + L̄WL log2 4ep iUi log2 (2epiUi ) .
i=1 i=1
Note that Z is the base-two entropy of the uniform distribution, π, on the set
F (X1 , . . . , XN ) ⊂ {−1, 1}N .
Lemma 23.25 Let A be a finite set and ψ a probability distribution on AN . Let ψk be its
marginal when the kth variable is removed. Then:
N
X
H2 (ψk ) − (N − 1)H2 (ψ) ≥ 0. (23.25)
k=1
Given the lemma, let πk denote the marginal distribution of π when the kth
variable is removed, i.e.,
We have:
N
X
(H2 (π) − H2 (πk )) ≤ H(π)
k=1
from which (23.24) derives since Z = H2 (π) and Zk ≥ H2 (πk ). The result then follows
from theorem 23.14.
We now prove lemma 23.25 by induction (this proof requires some basic notions
of information theory). For convenience, introduce random variables (ξ1 , . . . , ξN )
such that ξk ∈ A, with joint probability distribution given by ψ. Let Y = (ξ1 , . . . , ξN ),
Y (k) the (N −1)-tuple formed from Y by removing ξk , Y (k,l) the (N −2)-tuple obtained
by removing ξk and ξl , etc. Inequality (23.25) can then be rewritten
N
X
H2 (Y (k) ) − (N − 1)H2 (Y ) ≥ 0.
k=1
This inequality is obviously true for N = 1, and it is true also for N = 2 since it gives
in this case the well-known inequality H2 (Y1 , Y2 ) ≤ H2 (Y1 ) + H2 (Y2 ). Fix M > 2 and
assume that the lemma is true for any N < M. To prove the statement for N = M,
we will use the following inequality, which holds for any three random variables
U1 , U 2 , U 3 :
H2 (U1 , U3 ) + H2 (U2 , U3 ) ≥ H2 (U1 , U2 , U3 ) + H2 (U3 ) .
This inequality is equivalent to the statement on conditional entropies that H2 (U1 , U2 |
U3 ) ≤ H2 (U1 | U3 ) + H2 (U2 | U3 ). We apply it, for given k , l, to U1 = Yl , U2 = Yk ,
U3 = Y (k,l) , yielding
and obtain
N
X N
X
2(N − 1) H2 (Y (k) ) ≥ N (N − 1)H2 (Y ) + (N − 2) H2 (Y (k) ),
k=1 k=1
from Jensen’s inequality. This implies that the high-probability upper bound on
HVC (2N , F ) that results from the previous theorem is not necessarily an upper bound
on log(SF (2N )). It is however proved in Boucheron et al. [37] that
1
log2 E(SF (X1 , . . . , X2N )) ≤ H (2N , F )
log 2 VC
also holds (as a consequence of (23.16)). A little more work (see Boucheron et al. [37])
combining theorem 23.16 and theorem 23.24 implies the following bound, which
holds with probability 1 − δ at least:
r r
6 log SF (X1 , . . . , XN ) log(2/δ)
∀f ∈ F : R(f ) ≤ E(f ) + +4 .
N N
The upper bounds using the VC dimension relied on the number of different values
taken by a set of functions when evaluated on a finite set, this number being used to
apply a union bound. A different point of view may be applied when one relies on
some notion of continuity of the family of functions on which a uniform concentra-
tion bound is needed, with respect to a given metric. This viewpoint is furthermore
applicable when the sets F (X1 , . . . , XN ) are infinite. To develop these tools, we will
need some new concepts measuring the size of sets in a metric space.
23.5. COVERING NUMBERS AND CHAINING 617
Definition 23.26 Let (G, ρ) be a metric space and let > 0. The -covering number of
(G, ρ). denoted N (G, ρ, ), is the smallest integer n such that there exists a subset G ⊂ G
such that |G| = n and maxg∈G ρ(g, G) ≤ .
Let γ > 0. The γ-packing number M(G, ρ, γ), is the largest number n such that there
exists a subset A ⊂ G with cardinality n such that any two distinct elements of A are at
distance strictly larger than γ (such sets are called γ-nets).
When G and ρ are well understood from the context, we will write simply N () and
M(γ).
Proposition 23.27 One has, for any γ > 0:
M(G, ρ, 2γ) ≤ N (G, ρ, γ) ≤ M(G, ρ, γ).
Proof Let A be a maximal γ-net. Then, for all x ∈ G, there exists y ∈ A such that
ρ(x, y) ≤ γ: otherwise A ∪ {x} would also be a γ − net. This shows that max(ρ(x, A), x ∈
G) ≤ γ and N (G, ρ, γ) ≤ |A|.
The entropy numbers of (G, ρ), denoted, for an integer N , e(G, ρ, N ) (or just e(N ))
represent the best accuracy that can be achieved by subsets of G of size N , namely
e(G, ρ, N ) = min max{ρ(g, G) : g ∈ G}. (23.26)
G⊂G,|G|=N
We have:
e(G, ρ, N ) = inf{ : N (G, ρ, ) ≤ N } (23.27a)
and
N (G, ρ, ) = min{N : e(G, ρ, N ) ≤ }. (23.27b)
Assume that N (G, ρ∞ , ) < ∞, for all > 0 (which requires the set G to be pre-
compact for the ρ∞ metric). Take t > 0, 0 < < t and choose a set G ⊂ G such that
|G| = N (G, ρ∞ , ). Then, using a union bound,
P(sup g(Z) ≥ t) ≤ P(sup g(Z) ≥ t − ) (23.28)
g∈G g∈G
≤ N (G, ρ∞ , ) sup P(g(Z) ≥ t − ).
g∈G
We now apply this inequality to the case of binary classification, where a binary
variable Y is predicted by an input variable X, with a model class of classifiers F
and the 0–1 loss function. If A is a finite family of elements of R, we define, for
f ,f 0 ∈ F
1 X
ρA (f , f 0 ) = 1f (x),f 0 (x) .
|A|
x∈A
Let
N¯ (F , , N ) = E N (F , ρ{X1 ,...,XN } , )
Proof A key step in the proof of theorem 23.16, was to show that
N
X
P sup(R(f )−ET (f )) ≥ t ≤ 2P sup ξ k (r(Yk0 , f (Xk0 ))−r(Yk , f (Xk ))) ≥ N t/2 . (23.30)
f ∈F f ∈F k=1
and therefore consider r(Yk0 , f (Xk0 )) − r(Yk , f (Xk )) as constants that we will denote
ck (f ). Since we are using a 0–1 loss, we have ck (f ) ∈ {−1, 0, 1} and, for f , f 0 ∈ F ,
with
N
1X
gf (ξ1 , . . . , ξN ) = ck (f )ξk .
N
k=1
We have
N
1X
ρ∞ (gf , gf 0 ) = |ck (f ) − ck (f 0 )| .
N
k=1
Applying Hoeffding’s inequality, we have, for u > 0 and using the fact that ck ∈ [−1, 1]
2N u 2 N u2
P(gf (Z) > u | T , T 0 ) ≤ e− 4 = e− 2
and the discussion preceding the theorem yields the fact that, for any > 0:
N (t/2−)2
P(sup gf (Z) > t/2 | T , T 0 ) ≤ N (G, , ρ∞ )e− 2 . (23.33)
f ∈F
N (G, , ρ∞ ) ≤ N (F , /2, ρA ) .
One can retrieve the bound obtained in theorem 23.16 using the obvious fact that
N (F , , ρA ) ≤ |F (A)|,
620 CHAPTER 23. GENERALIZATION BOUNDS
Covering numbers can be evaluated in some simple situations. The following propo-
sition provides an example in finite dimensions.
2CM m
N (G, ρ∞ , ) ≤ 1 +
Proof Letting ρ denote the Euclidean distance in Rm , our hypotheses imply that
N (G (M) , ρ∞ , ) is bounded by N (BM , ρ, /C) where BM is the ball with radius M in
Rm . Now, if θ1 , . . . , θn is an α-covering of BM , then θ1 /M, . . . , θn /M is an (α/M)-
covering of B1 , which shows (together with a symmetric argument) that N (BM , ρ, α) =
N (B1 , ρ, α/M) and we get
and we only need to evaluate N (B1 , ρ, α) for α > 0. Using proposition 23.27, one can
instead evaluate M(B1 , ρ, α). So let A be an α-net in B1 . Then
[
Bρ (x, α/2) ⊂ Bρ (0, 1 + α/2)
x∈A
One can also obtain entropy number estimates in infinite dimensions. Here, we
quote a result applicable to spaces of smooth functions, referring to Van der Vaart
and Wellner [197] for a proof.
Theorem 23.30 Let Z be a bounded convex subset of Rd with non-empty interior. For
p ≥ 1 and f ∈ C p (Z), let
n o
kf kp,∞ = max |D k (f (x)| : k = 0, . . . , p, x ∈ Z .
23.5.4 Chaining
The distance ρ∞ may not always be the best one to analyze the set of functions, G. For
example, if G is a class of functions with values in {−1, 1}, then ρ∞ (g, g 0 ) = 2 unless
g = g 0 . In such contexts, it is often preferable to use distances that compute average
discrepancies, such as
ρp (g, g 0 ) = E(|g(Z) − g 0 (Z)|p )1/p , (23.35)
for some random variable Z. Such distances, by definition, do not provide uniform
bounds on differences between functions (that we used to write (23.28)), but can
rather be used in upper-bounds on the probabilities of deviations from zero, which
have to be handled somewhat differently. We here summarize a general approach
called “chaining,” following for this purpose the presentation made in Talagrand
[190] (see also Audibert and Bousquet [15]). From now on, we assume that (G, ρ) is a
622 CHAPTER 23. GENERALIZATION BOUNDS
for some α ∈ (0, 2], because, if ρ is a distance, then so is ρα/2 if α ≤ 2. We will also
assume that E(g(Z)) = 0 in order to avoid centering the variables at every step.
We are interested in upper bounds for P(supg∈G g(Z) > t). To build a chaining
argument, consider a family (G0 , G1 , . . .) of subsets of G. Assume that |Gk | ≤ Nk with
Nk chosen, for future simplicity, so that Nk−1 Nk ≤ Nk+1 . For g ∈ G, let πk (g) denote a
closest point to g in Gk . Also assume that G0 = {g0 } is a singleton, so that π0 (g) = g0
for all g ∈ G. (One can generally assume without harm that 0 ∈ G, in which case one
should choose g0 = 0 in the following discussion.) For g ∈ Gn , we therefore have
n
X
g − g0 = (πk (g) − πk−1 (g)) .
k=1
k=1
k k k−1
If one takes Nk = 22 , which satisfies Nk Nk−1 = 22 +2 ≤ Nk+1 , and tk = 2k/2 , one
finds that
n
X k+1 k−1 2
P(sup g(Z) − g0 (Z) > tSn ) ≤ 2 22 e−2 t .
g∈Gn k=1
23.5. COVERING NUMBERS AND CHAINING 623
p
The upper bound converges (as a function of n) as soon as t > 2 log 2. Moreover,
one has
n n
2 X
∞
2 X
− t2 − t2
X
2k+1 −2k−1 t 2 −2k−2 (t 2 −8 log 2) k−2
2 2 e = 2e e ≤ 2e e−2
k=1 k=1 k=1
p
when t > 1 + 8 log 2. This provides a concentration bound for P(supg∈Gn g(Z) −
g0 (Z) > tSn ), that we may rewrite as
t2
−
P(sup g(Z) − g0 (Z) > t) ≤ Ce 2Sn2 (23.37)
g∈Gn
for t > 2Sn log 2, C = 2 ∞ −2k−2 and S given by (23.36), with t = 2k/2 . Moreover,
p P
k=1 e n k
we have
n
X
Sn = max 2k/2 ρ(πk (g), πk−1 (g))
g∈Gn
k=1
n
X
≤ max 2k/2 (ρ(g, Gk ) + ρ(g, Gk−1 ))
g∈Gn
k=1
X n
≤ 2 max 2k/2 ρ(g, Gk )
g∈Gn
k=0
Note that this requires that the set G is precompact for the distance ρ. We will also
assume that
lim sup g(x) = sup g(x). (23.39)
n→∞ g∈G g∈G
n
with C = 2
P∞ −2k−2 .
k=1 e
The exponential rate of convergence in the right-hand side of (23.41) is the quan-
tity S, and the upper bound will be improved when building the sequence (G0 , G1 , . . .)
so that S is as small as possible. Such an optimization for a given family of functions
is however a formidable problem. It is however interesting to see (still following
[189]) that theorem 23.31 implies a classical inequality in terms of what is called the
metric entropy of the metric space (G, ρ).
which is known as Dudley’s metric entropy of the space (G, ρ). We have
n
Z e(2) q ∞ Z
X e(22 ) q
h(G, ρ) = log N ()d + log N ()d.
n−1
0 n=1 e(22 )
n−1 n n
If ∈ [e(22 ), e(22 )), we have N () > 22 so that
∞
X
p n n−1
h(G, ρ) ≥ e(2) log 3 + 2n/2 (e(22 ) − e(22 ))
n=1
√ ∞
2X n
≥ 1− 2n/2 e(22 ).
2
n=1
23.5. COVERING NUMBERS AND CHAINING 625
Therefore,
4
√ h(G, ρ) ≤ 7h(G, ρ)
Ŝ ≤
2− 2
and this upper bound can also be used to obtain a simpler (but weaker) form of
theorem 23.31.
Remark 23.32 The covering numbers of a class G of binary functions g with values
in {−1, 1} can be controlled by the VC dimension of the class. Here, we consider
ρ(g, g 0 ) = P(g , g 0 ) = ρ1 (g, g 0 )/2. Then, the following theorem holds.
Theorem 23.33 Let G be a class of binary functions such that D = VC-dim(G) < ∞.
Then, there is a universal constant K such that, for any ∈ (0, 1),
D−1
1
N (G, ρ, ) ≤ KD(4e)D
with ρ(g, g 0 ) = P(g , g 0 ).
We refer to Van der Vaart and Wellner [197], Theorem 2.6.4 for a proof, which is
rather long and technical.
23.5.6 Application
We quickly show how this discussion can be turned into results applicable to the
classification problem. If F is a function class of binary classifiers and r is the risk
function, one can consider the class
G(x1 , y1 , . . . , xN , yN )
= {r(1, f (xk )) : k = 1, . . . , N , yk = 1} ∪ {r(−1, f (xk )) : k = 1, . . . , N , yk = −1}.
If the two sets in the right-hand side are not empty, i.e., the numbers N(1) and N(−1)
of k’s such that yk = 1 or yk = −1 are not zero, then
which is less that 2N as soon as N > 2. So, taking N > 2, for (x1 , y1 , . . . , xN , yN ) to be
shattered by G, we need N(1) = N or N(−1) = N and in this case, the inequality:
|G(x1 , y1 , . . . , xN , yN )| ≤ |F (x1 , . . . , xN )|
626 CHAPTER 23. GENERALIZATION BOUNDS
is obvious. The same inequality will be true for some x1 , . . . , xN with N = 2, except in
the uninteresting case where f (x) = 1 (or −1) for every x ∈ R.
A similar inequality holds for entropy numbers with the ρ1 distance (cf. (23.35))
because
E(|r(Y , f (X)) − r(Y , f 0 (X))|) ≤ P(f (X) , f 0 (X))
whenever r takes values in [0, 1], which implies that
N (G, ρ1 , ) ≤ N (F , ρ1 , )
for all > 0. Note however that evaluating this upper bound may still be challenging
and would rely on strong assumptions on the distribution of X allowing to control
P(f (X) , f 0 (X)).
eλy
πλ (y) = .
e−λ + eλ
Now, if F is a class of real-valued functions, we can define the risk function
1
r(y, f (x)) = log .
πf (x) (y)
FM = {f : x 7→ a0 + bT x : |b| ≤ M, |a0 | ≤ U M} .
The restriction |b| ≤ M is equivalent to using a penalty method, such as, for example,
ridge logistic regression. Moreover, if |b| ≤ M, it is natural to assume that |a0 | ≤ U M
because otherwise f would have a constant sign on R. In this case, we get
4CU d+1
N (F , ρ∞ , ) ≤ 1 +
23.6. OTHER COMPLEXITY MEASURES 627
VC-dimension and metric entropy are measures that control the complexity of a
model class, and can therefore be evaluated a priori without observing any data.
These bounds can be improved, in general, by using information derived from the
training set, and, particular the classification margin that has been obtained [18].
So, rγ (y, f (x)) is equal to 0 if f (x) correctly predicts y with margin γ and to 1 other-
wise. We then define the classification error with margin γ as
Rγ (f ) = E(rγ (Y , f (X)))
and, given a training set T of size N
N
1X
Eγ,T = rγ (yk , f (xk )).
N
k=1
or, equivalently, with probability larger than 1 − δ, one has, for all f ∈ F ,
r
8 2
R0 (f ) − Eγ,T (f )) ≤ log N∞ (γ/2, 2N ) + log . (23.44)
N δ
Proof We first note that, for N t 2 > 2,
P sup(R0 (f ) − Eγ,T (f )) > t ≤ 2P sup(ET 0 (f ) − Eγ,T (f )) > ,
f ∈F f ∈F 2
628 CHAPTER 23. GENERALIZATION BOUNDS
which is proved exactly the same way as (23.21) in theorem 23.16, and we skip the
argument.
We have
N
1X
ET 0 (f ) − Eγ,T (f ) = (r0 (Yk0 , f (Xk0 )) − rγ (Yk , f (Xk )))
N
k=1
and because (Xk , Yk ) and (Xk0 , Yk0 ) have the same distribution, supf ∈F (ET 0 (f ) − Eγ,T (f ))
has the same distribution as
∆T ,T 0 (ξ1 , . . . , ξN ) =
N
1 X
sup (r0 (Yk0 , f (Xk0 )) − rγ (Yk , f (Xk )))ξ k + (r0 (Yk , f (Xk )) − rγ (Yk0 , f (Xk0 )))(1 − ξ k )
f ∈F N k=1
This is because, for any (x, y) ∈ R × {0, 1}, and f , f 0 such that |f (x) − f 0 (x)| < γ/2, we
have r0 (y, f (x)) ≤ rγ/2 (y, f 0 (x)) and rγ/2 (y, f 0 (x)) ≤ rγ (y, f (x)): if an example is misclas-
sified by f (resp. f 0 ) at a given margin, it must be misclassified by f 0 (resp. f ) at this
margin plus γ/2.
Now,
t
P(∆0T ,T 0 (ξ 1 , . . . , ξ N ) > )
2
N
1X t
≤ |F| max P (2ξ k − 1)(r γ (Yk0 , f (Xk0 )) − r γ (Yk , f (Xk ))) >
f ∈F N 2 2 2
k=1
(i) One says that F P -shatters A if there exists a function gA : R → R such that, for each
B ⊂ A, there exists a function f ∈ F such that f (x) ≥ gA (x) if x ∈ B and f (x) < gA (x) if
x ∈ A \ B.
(ii) Let γ be a positive number. One says that F Pγ -shatters A if there exists a function
gA : R → R such that, for each B ⊂ A, there exists a function f ∈ F such that f (x) ≥
gA (x) + γ if x ∈ B and f (x) ≤ gA (x) − γ if x ∈ A \ B.
Note that only the restriction of gA to A matters in this definition. This function
acts as a threshold for binary classification. More precisely, given a function g : A →
R, one can associate to every f ∈ F the binary function fg with fg (x) equal to 1 if
f (x) ≥ g(x) and to 0 otherwise. Letting Fg = {fg : f ∈ F } we see that F P-shatters A
if there exists a function gA such that FgA shatters A. The definition of Pγ -shattering
introduces a margin in the definition of fg (with fg (x) equal to 1 if f (x) ≥ g(x) + γ, to
0 if f (x) ≤ g(x) − γ and is ambiguous otherwise), and A is Pγ -shattered by F if, for
some gA , the corresponding FgA shatters A without ambiguities.
Theorem 23.37 Let γ > 0 and assume that F has Pγ/4 -dimension D < ∞. Then,
!D log(4eN /(Dγ))
16N
N∞ (γ, N ) ≤ 2 .
γ2
630 CHAPTER 23. GENERALIZATION BOUNDS
Proof The proof is quite technical and relies on a combinatorial argument in which
F is first assumed to take integer values before addressing the continuous case.
Step 1. We first assume that functions in F take values in the finite set {1, . . . , r}
where r is an integer. For the time of this proof, we introduce yet another notion of
shattering called S-shattering (for strong shattering) which is essentially the same
as P1 -shattering, except that functions g are restricted to take values in {1, . . . , r}. Let
A be a finite subset of R. Given a function g : R → {1, . . . , r}, we say that (F , g) S-
shatters A if, for any B ⊂ A, there exist f ∈ F satisfying f (x) ≥ g(x) + 1 for x ∈ B and
f (x) ≤ g(x) − 1 if x ∈ A \ B. We say that F S-shatters A if (F , g) S-shatters A for some
g. The S-dimension of F is the cardinality of the largest subset of R that can be
S-shattered and will be denoted S-dim(F ). The first, and most difficult, part of the
proof is to show that, if S-dim(F ) = D, then
M(F (A), ρ∞ , 2) ≤ 2(|A|r 2 )dlog2 ye
with
D !
X |A| k
y= r
k
k=1
and die denotes the smallest integer larger than u ∈ R. Here, M is the packing
number defined in section 23.5.1.
To prove this, we can assume that r ≥ 3, since, for r ≤ 2, M(F (A), ρ∞ , 2) = 1 (the
diameter of F for the ρ∞ distance is 0 or 1). Let G(A) = {1, . . . , r}A be the set of all
functions f : A → {1, . . . , r} and let
UA = F ⊂ G(A) : ∀f , f 0 ∈ F, ∃x ∈ A with |f (x) − f 0 (x)| ≥ 2 .
For F ∈ UA , let
SA (F) = {(B, g) : B ⊂ A, B , ∅, g : B → {1, . . . , r}, (F, g) S-shatters B}.
Let tA (h) = min{|SA (F)| : F ∈ UA , |F| = h} (where the minimum of the empty set is +∞).
Since we are considering in UA all possible functions from A to {1, . . . , r}, it is clear
that tA (h) only depends on |A|, and we will also denote it by t(h, |A|).
Note that, by definition, if (B, g) ∈ SA (F), and F ⊂ F , then |B| ≤ D. So, the number
of elements in SA (F) for such an F is less or equal than the number of possible such
|A| k
pairs (B, g), which is strictly less than y = D
P
k=1 k r . So, if t(h, |A|) ≥ y, then there
cannot be any F ⊂ F in the set UA and M(F (A), ρ∞ , 2) < h. The rest of the proof
consists in showing that t(h, |A|) ≥ y.
For any n ≥ 1, we have t(2, n) = 1: fix x ∈ A, and F = {f1 , f2 } ∈ G such that f1 (x) = 1,
f2 (x) = 3 and f1 (y) = f2 (y) if y , x. Then only ({x}, g) is S-shattered by F, with g such
that g(x) = 2.
23.6. OTHER COMPLEXITY MEASURES 631
Now, assume that, for some integer m, t(2mnr 2 , n) < ∞, so that there exists F ∈ UA
such that |F| = 2mnr 2 . Arrange the elements of F into mnr 2 pairs {fi , fi0 }. For each
such pair, there exists xi ∈ A such that |fi (xi ) − fi0 (xi )| > 1. Since there are at most n
selected xi , one of them must be appearing at least mr 2 times. Call it x and keep (and
reindex) the corresponding mr 2 pairs, still denoted {fi , fi0 }. Now, there are at most
r(r − 1)/2 possible distinct values for the unordered pairs {fi (x), fi0 (x)}, so that one of
them must be appearing at least 2mr 2 /r(r − 1) > 2m times. Select these functions,
reindex them and exchange the role of fi and fi0 if needed to obtain 2m pairs {fi , fi0 }
such that fi (x) = k and fi0 (x) = l for all i and fixed k, l ∈ {1, . . . , r} such that k + 1 < l.
Let F1 = {f1 , . . . , f2m } and F10 = {f10 , . . . , f2m
0
}. Let A0 = A \ {x}. Then both F1 and F10
belong to UA0 , which implies that both SA0 (F1 ) and SA0 (F10 ) have cardinality at least
t(2m, n−1). Moreover, both sets are included in SA (F), and if (B, g) ∈ SA0 (F1 )∩SA0 (F10 ),
then (B ∪ {x}, g 0 ) ∈ SA (F), with g 0 (y) = g(y) for y ∈ B and g 0 (x) = k + 1. This provides
2t(2m, n−1) elements in SA (F) and shows the key inequality (which is obviously true
when the left-hand side is infinite)
t(2mnr 2 , n) ≥ 2t(2m, n − 1) .
This inequality can now be used to prove by induction that for all 0 ≤ k < n, one
has t(2(nr 2 )k , n) ≥ 2k , since
For k ≥ n, one has 2(nr 2 )k > r n , where r n is the number of functions in G(A), so
that t(2(nr 2 )k , n) = +∞. So, t(2(nr 2 )k , n) ≥ 2k is valid for all k and it suffices to take
k = dlog2 ye to obtain the desired result.
Step 2. The next step uses a discretization scheme to extend the previous result to
functions with values in [−1, 1]. More precisely, given f : R → [0, 1], and η > 0, let
To prove (a), assume that F η S-shatters A, so that there exists g such that, for all
B ⊂ A, there exists f ∈ F such that f η (x) ≥ g(x) + 1 for x ∈ B and f η (x) ≤ g(x) − 1
for x ∈ A \ B. Using the fact that 2ηf η (x) − 1 ≤ f (x) < 2ηf η (x) + 2η − 1, we get f (x) ≥
2ηg(x)+2η−1 for x ∈ B and f (x) ≤ 2ηg(x)−1 for x ∈ A\B. So taking g̃(x) = 2ηg(x)+η−1
632 CHAPTER 23. GENERALIZATION BOUNDS
as threshold function (which does not depend on B), we see that F Pγ -shatters A if
γ ≤ η.
For (b), we deduce from the definition of f η that |f η (x) − f˜η (x)| > (2η)−1 |f (x) −
f˜(x)| − 1 so that, if = 4η, |f (x) − f˜(x)| ≥ implies |f η (x) − f˜η (x)| > 1, or, equivalently
|f η (x) − f˜η (x)| ≥ 2.
One can use this result to evaluate margin bounds on linear classifiers with
bounded data. Let R be the ball with radius Λ in Rd and consider the model class
containing all functions f (x) = a0 + bT x with a0 ∈ [−Λ, Λ] and b ∈ Rd , |b| ≤ 1. Let
A = {x1 , . . . , xN } be a finite subset of R. Then, F Pγ -shatters A if and only if there
exists g1 , . . . , gN ∈ R such that, for any sequences ξ = (ξ1 , . . . , ξN ) ∈ {−1, 1}N , there
ξ ξ
exists a0 ∈ [−Λ, Λ] and bξ ∈ Rd , |bξ | ≤ 1 with ξk (a0 + (bξ )T xk − gk ) ≥ γ for k = 1, . . . , N .
Summing over N , we find that
N
X N
X N
X
ξ ξ T
Nγ + gk ξk ≤ a0 ξk + (b ) ξk xk .
k=1 k=1 k=1
Since
N N n
X 2 X 2 X
E 2Λ2 ξk + 2 ξ k xk = 2N Λ2 + 2 |xk |2 ≤ 4N Λ2 ,
k=1 k=1 k=1
and this upper bound can then be plugged into equations (23.43) or (23.44) to esti-
mate the generalization error.
Beyond the explicit expression of the upper bound, the important point in the
previous argument is that the Pγ -dimension is bounded independently from the di-
mension d of X (and therefore also applies in the infinite-dimensional case). This
should be compared to what we found for the VC-dimension of separating hyper-
planes, which was d + 1 (cf. proposition 23.22).
Remark 23.38 Note that the upper-bound obtained in theorem 23.34 depends on a
parameter (γ) and the result is true for any choice of this parameter. It is tempting at
this point to optimize the bound with respect to γ, but this would be a mistake since
a family of events being likely does not imply that their intersection is likely too.
However, with a little work, one can ensure that an intersection of slightly weaker
inequalities holds. Indeed, assume that an estimate similar to (23.43) holds, in the
form
2
P(R0 (fˆT ) > UT (γ) + t) ≤ C(γ)e−mt /2
or, equivalently
q
2
P R0 (fˆT ) > UT (γ) + t 2 + 2 log C(γ) ≤ e−mt /2 ,
634 CHAPTER 23. GENERALIZATION BOUNDS
where UT (γ) depends on data and is increasing (as a function of γ), and C(γ) is a
decreasing function of γ. Consider a decreasing sequence (γk ) that converges to 0
(for example γk = L2−k ). Choose also an increasing function (γ). Then
q
P R0 (fˆT ) > min{UT (γ) + t 2 + 2 log C(γ) + 2 (γ) : 0 ≤ γ ≤ L}
q
ˆ 2 2
≤ P R0 (fT ) > min{UT (γk ) + t + 2 log C(γk−1 ) + (γk ) : k ≥ 1} .
Moreover
q
ˆ 2 2
P R0 (fT ) > min{UT (γk ) + t + 2 log C(γk−1 ) + (γk ) : k ≥ 1}
∞
X q
≤ P R0 (fˆT ) > UT (γk ) + t 2 + 2 log C(γk−1 ) + (γk )
k=0
∞
X C(γk ) −m2 (γk )/2−mt 2 /2
≤ e .
C(γk−1 )
k=0
to ensure that
q
2
P R0 (fˆT ) > min{UT (γ) + t 2 + 2 log C(γ) + 2 (γ) : γ0 ≤ γ ≤ L} ≤ C0 e−mt /2 .
which yields C0 ≤ L.
Let T be a training set and let T1 and T2 form a fixed partition of the training set
in two equal parts. Assume, for simplicity, that N is even and that the method
for selecting the two parts is deterministic, e.g., place the first half of T in T1 and
second one in T2 . Following Bartlett et al. [20], one can then define the maximum
discrepancy on T by
CT = sup(ET1 (f ) − ET2 (f ))
f ∈F
23.6. OTHER COMPLEXITY MEASURES 635
This discrepancy measures the extent to which estimators may differ when trained
on two independent half-sized training sets. For a binary classification problem, the
estimation of CT can be made with the same algorithm as the initial classifier, since
ET1 (f ) − ET2 (f ) is, up to a constant, exactly the classification error for the training set
in which the class labels are flipped for the data in T2 .
Following [20], we now discuss concentration bounds that rely on CT and start
with the following Lemma.
Lemma 23.39 Introduce the function
Then E(Φ(T )) ≤ 0.
Proof Note that, if T 0 is a training set, independent of T with identical distribution,
then, for any f0 ∈ F ,
so that
E(sup(R(f ) − ET (f ))) ≤ E(sup(ET 0 (f ) − ET (f ))).
f ∈F f ∈F
Now, for a given f , we have ET (f ) = 12 (ET1 (f ) + ET2 (f )) and splitting T 0 the same way,
we have ET 0 (f ) = 21 (ET10 (f ) + ET20 (f )).
where we have used the fact that both (T10 , T1 ) and (T20 , T2 ) form random training sets
with identical distribution to (T1 , T2 ).
One can then use McDiarmid’s inequality (theorem 23.13) after noticing that, letting
zk = (xk , yk ) for k = 1, . . . , N ,
3
max Φ(z1 , . . . , zN ) − Φ(z1 , . . . , zk−1 , zk0 , zk+1 , . . . , zN ) ≤
z1 ,...,zN ,zk0 N
yielding
2N 2
P(sup(R(f ) − ET (f )) ≥ CT + ) ≤ e− 9 .
f ∈F
The mean Rademacher complexity is then the expectation of this quantity over
the training set distribution. The Rademacher complexity can be computed with
a—costly—Monte-Carlo simulation, in which the best estimator is computed with
randomly flipped labels corresponding to the values of k such that ξk = −1.
Proposition 23.40 Let F be a function class such that D = VC-dim(F ) < ∞. Then
3 p
rad(T ) ≤ √ 2D log(eN /D) .
N
Proof One has, using Hoeffding’s inequality
N N
1X 1
X
2
P sup ξ k r(yk , fk ) > t ≤ |F (T )| sup P ξ k r(yk , fk ) > t ≤ |F (T )|e−N t /2 .
f ∈F N k=1 f ∈F N
k=1
23.6. OTHER COMPLEXITY MEASURES 637
N
1X
2
P sup ξ k r(yk , fk ) > t ≤ 2|F (T )|e−N t /2
f ∈F N k=1
3 p
rad(T ) ≤ √ 2D log(eN /D) .
N
r(y, y 0 ) ≤ ρ(y, y 0 )
for all y ∈ RY , y 0 ∈ R. Some examples are the margin loss ρh∗ (y, y 0 ) = 1yy 0 ≤h for h ≥ 0,
or the piecewise linear function
if yy 0 ≤ 0
1
ρh (y, y 0 ) = 1 − yy 0 /h if 0 ≤ yy 0 ≤ h
if yy 0 ≥ h
0
and
RadG (N ) = E RadG (Z1 , . . . , ZN ) .
Our previous notation can then be rewritten as rad(T ) = RadG (z1 , . . . , zn ) where zi =
(xi , yi ) and G is the space of functions: g : (x, y) 7→ r(y, f (x)) for f ∈ F . The following
theorem is proved in Koltchinskii and Panchenko [110], Bartlett and Mendelson [19].
638 CHAPTER 23. GENERALIZATION BOUNDS
Theorem 23.41 Let ρ be a function dominating the risk function r(y, y 0 ) = 1yy 0 ≤0 . Let
and
N
ρ 1X
ET (f )= ρ(yk , f (xk )).
N
k=1
Then
ρ 2 /2
P(sup R(f ) ≥ ET (f ) + 2RadGρ (N ) + t) ≤ e−N t
f ∈F
where
N
1X
Φ(Z1 , . . . , ZN ) = sup E(g(Z)) − g(Zk ) .
g∈G ρ N
k=1
Now we have
N N
1X 1X
0
E(Φ(Z1 , . . . , ZN )) = E sup E g(Zk ) − g(Zk ) Z1 , . . . , ZN
g∈G ρ N N
k=1 k=1
N N
1 1
X X
≤ E sup g(Zk0 ) − g(Zk )
g∈G ρ N k=1
N
k=1
N
1X
≤ E E sup ξ k (g(Zk0 ) − g(Zk )) Z, Z 0
g∈G ρ N k=1
N
1
X
≤ 2E E sup ξk g(Zk ) Z
g∈G ρ N k=1
≤ 2RadGρ (N ) ,
For k ∈ {1, . . . , N }, and a training set T = (x1 , y1 , . . . , xN , yN ), we let T (k) be the train-
ing set with sample (xk , yk ) removed. One says that the predictor (T 7→ fˆT ) has uni-
form stability βN for the loss function r if, for all T of size N , all k ∈ {1, . . . , N }, and
all x, y:
|r(fˆT (x), y) − r(fˆ (k) (x), y)| ≤ βN .
T (23.45)
Theorem 23.42 (Bousquet and Elisseeff [38]) Assume that fˆT has uniform stability
βN for training sets of size N and that the loss function r(Y , f (X)) is almost surely
bounded by M > 0. Then, for all > 2βN , one has
−2βN
2
−2N
P(R(fˆT ) ≥ ET (fˆT ) + ) ≤ e 4N βN +M
.
Proof Let Zi = (Xi , Yi ) and F(Z1 , . . . , ZN ) = R(fˆT ) − ET (fˆT ). We want to apply McDi-
armid inequality (theorem 23.13) to F, and therefore estimate
∆
δk (F) = max F(z1 , . . . , zN ) − F(z1 , . . . , zk−1 , zk0 , zk+1 , . . . , zN ) .
z1 ,...,zN ,zk0
Introduce a training set T̃k in which the variable zk is replaced by zk0 = (xk0 , yk0 ).
(k)
Because T̃k = T (k) , we have
Similarly, we have
1X
|ET (fˆT ) − ET̃k )(fˆT̃k )| ≤ |r(yl , fˆT (xl ), ) − r(yl , fˆT̃k (xl ))|
N
l,k
1
+ |r(yk , fˆT (xk )) − r(yk0 , fˆT̃k (xk0 ))|
N
1X
≤ |r(yl , fˆT (xl )) − r(yl , fˆT (k) (xl ))|
N
l,k
1X M
+ |r(yl , fˆT̃k (xl )) − r(yl , fˆT̃ (k) (xl ))| +
N k N
l,k
M
≤ 2βN +
N
Collecting these results, we find that δk (F) ≤ 4βN + M N , so that, by theorem 23.13,
2N 2
!
ˆ ˆ ˆ ˆ
P R(fT ) ≥ ET (fT ) + E(R(fT ) − ET (fT )) + ) ≤ exp − .
(4N βN + M)2
We therefore obtain
2N 2
!
P R(fˆT ) ≥ ET (fˆT ) + + 2βN ≤ exp − .
(4N βN + M)2
as required.
23.6. OTHER COMPLEXITY MEASURES 641
Our final discussion of concentration bounds for the empirical error uses a slightly
different paradigm from that discussed so far. The main difference is that, instead of
computing one predictor fˆT from a training set T , it would return a random variable
with values in F , or, equivalently, a probability distribution on F (therefore assum-
ing that this space is measurable) that we will denote µ̂T . The training set error is
now defined by: Z
ĒT (µ) = ET (f )dµ(f ) ,
Our goal is to obtain upper bounds on R̄(µT )− ĒT (µT ) that hold with high probability.
In this framework, we have the following result, in which Q denotes the space of
probability distributions on F .
Assume that the loss function r takes its values in [0, 1]. Recall that KL(µkπ) is
the Kullback-Leibler divergence from µ to π, defined by
Z
KL(µkπ) = log(ϕ(f ))ϕ(f )dπ(f )
F
if µ has a density ϕ with respect to π and +∞ otherwise. Then, the following theorem
holds.
Theorem 23.43 (McAllester [129]) With the notation above, for any fixed probability
distribution π ∈ Q,
r
KL(µkπ)
P sup(R̄(µ) − ĒT (µ)) > t + ≤ 2N e−N t . (23.46)
µ∈Q 2N
Taking t = log(2N /δ)/2N , the theorem is equivalent to the statement that, with prob-
ability 1 − δ, one has
r
log 2N /δ + KL(µkπ)
R̄(µ) − ĒT (µ) ≤ . (23.47)
2N
Proof We first show that, for any probability distributions π, µ on F , and any func-
tion H on F , Z Z
H(f )dµ − log eH(f ) dπ ≤ KL(µkπ) .
F F
642 CHAPTER 23. GENERALIZATION BOUNDS
Indeed, assume that µ has a density ϕ with respect to π (otherwise the upper bound
is infinite) and let
eH
ϕH = R .
e H(f ) dπ
F
Then
Z Z Z Z
H(f )
KL(µkπ) − H(f )dµ + log e dπ = ϕ log ϕdπ − ϕ log ϕH dπ
F F F F
Z !
ϕ ϕ
= log ϕ dπ
F ϕH ϕH H
= KL(µkϕH π) ≥ 0,
which proves the result (and also shows that one can only have equality when ϕ =
ϕH π-almost surely.)
Let χ(u) = max(u, 0)2 . We can use this inequality to show that, for any probability
Q ∈ Q and λ > 0,
Z Z
λχ(R̄(µ) − ĒT (µ)) ≤ λ χ(R(f ) − ET (f ))dµ(f ) ≤ KL(µkπ) + log eλχ(R(f )−ET (f )) dπ
F F
where we have applied Jensen’s inequality to the convex function χ. This yields
Z
λχ(R̄(µ)−ĒT (Q))
e ≤e KL(µkπ)
eλχ(R(f )−ET (f )) dπ.
F
Taking λ = 2N yields
which implies
q
P sup R̄(µ) − ĒT (µ) > t + KL(µkπ)/2N ≤ 2N e−2N t ,
µ∈Q
Remark 23.44 Note that the proof, which follows that given in Audibert and Bous-
quet [15], provides a family of inequalities obtained by taking λ = 2N /c in the final
step, with c > 1. In this case
eλ−2N − 1 λ c
1+λ ≤ 1+ =
λ − 2N 2N − λ c − 1
and one gets
c
q
P sup R̄(µ) − ĒT (µ) > t + cKL(µkπ)/2N ≤ 2N e−2N t .
µ∈Q c−1
Remark 23.45 One special case of theorem 23.43 is when π is a discrete probability
measure supported by a subset F0 of F and µ corresponds to a deterministic pre-
dictor optimized over F0 , and is therefore a Dirac measure on some element f ∈ F0 .
Because δf has density ϕ(g) = 1/π(g) if g = f and 0 otherwise with respect to π, we
have KL(δf kπ) = − log π(f ) and theorem 23.43 implies that, with probability larger
than 1 − δ, r
log 2N /δ − log π(f )
R(f ) − ET (f ) ≤ .
2N
The term log 2N is however superfluous in this simple context, because one can
write, for any t > 0
r
log(π(f ))
X log(π(f ))
P sup R(f ) − ET (f ) ≥ t − ≤ e−2N (t 2N ) = e−2N t
f ∈F0 2N
f ∈F0
We now describe how the previous results can, in principle, be applied to model
selection [20]. We assume that we have a countable family of nested models classes
(F (j) , j ∈ J ). Denote, as usual, by ET (f ) the empirical prediction error in the training
(j)
set for a given function f . We will denote by fˆT a minimizer of the in-sample error
for F (j) , such that
(j)
ET (fˆ ) = min ET (f ).
T
f ∈F (j)
In the model selection problem, one would like to determine the best model class,
(j)
j = j(T ), such that the prediction error R(fˆT ) is minimal, or, more realistically, de-
(j )
termine j ∗ such that R(fˆ ∗ ) is not too far from the optimal one.
T
(j) 2
P(RT (fˆ(j) ) ≥ ΓT + t) ≤ cj e−mt (23.48)
for some known constants cj and m. For example, the VC-dimension bounds have
(j) (j)
Γ = ET (fˆ ), cj = 2S (j) (2N ) and m = N /8.
T T F
Given such inequalities, one can develop a model selection strategy thatPrelies on
a priori weights, provided by a sequence πj of positive numbers such that j∈J πj =
1. Define
πj /cj
π̃j = P∞ ,
j 0 =1 πj 0 /cj 0
and let r
(j) (j) (j) log π̃j
CT = ΓT − ET (fˆT ) + −
m
yielding a penalty-based method that requires the minimization of
r
(j) (j) log π̃j
ẼT (f ) = (ET (f ) − ET (fˆT )) + ΓT + − .
m
q
(j ∗ ) (j) log π̃
The selected model class is then F where j∗ minimizes ΓT + − 2m j .
23.7. APPLICATION TO MODEL SELECTION 645
The same proof as that provided at the end of section 23.6.5 justifies this proce-
dure. Indeed, for t > 0,
!
(j) (j)
P R(fˆT ) − ẼT (fˆT ) ≥ t ≤ P max(R(fˆT ) − ẼT (fˆT )) ≥ t
j
r
(j) log π̃j
≤ P max(R(fˆT ) ≥ R∗j + t + −
m
j
X 2
≤ c̃ πj e−mt
j
2
≤ c̃e−mt
P∞
with c̃ = j=1 πj /cj .
646 CHAPTER 23. GENERALIZATION BOUNDS
Bibliography
[2] Hirotugu Akaike. Information theory and an extension of the maximum like-
lihood principle. In 2nd International Symposium on Information Theory, 1973.
Akademiai Kaido, 1973.
[3] Stéphanie Allassonniere and Laurent Younes. A stochastic algorithm for prob-
abilistic independent component analysis. The Annals of Applied Statistics, 6
(1):125–160, 2012.
[4] Noga Alon, Shai Ben-David, Nicolo Cesa-Bianchi, and David Haussler. Scale-
sensitive dimensions, uniform convergence, and learnability. Journal of the
ACM (JACM), 44(4):615–631, 1997.
[5] Mauricio A. Álvarez, Lorenzo Rosasco, and Neil D. Lawrence. Kernels for
vector-valued functions: A review. Foundations and Trends in Machine Learn-
ing, 4(3):195–266, 2012. ISSN 1935-8237. doi: 10.1561/2200000036.
[6] Yali Amit. Convergence properties of the gibbs sampler for perturbations of
gaussians. The Annals of Statistics, 24(1):122–140, 1996.
[7] Yali Amit and Donald Geman. Shape quantization and recognition with ran-
domized trees. Neural computation, 9(7):1545–1588, 1997.
[8] Alano Ancona, Donald Geman, Nobuyuki Ikeda, and D Geman. Random
fields and inverse problems in imaging. In Ecole d’ete de Probabilites de Saint-
Flour XVIII-1988, pages 115–193. Springer, 1990.
[9] Brian D.O. Anderson. Reverse-time diffusion equation models. Stochastic Pro-
cesses and their Applications, 12(3):313–326, 1982.
[10] Martin Anthony and Peter L. Bartlett. Neural network learning: Theoretical
foundations. cambridge university press, 2009.
647
648 BIBLIOGRAPHY
[11] Martin Arjovsky, Soumith Chintala, and Léon Bottou. Wasserstein Genera-
tive Adversarial Networks. In Proceedings of the 34th International Conference
on Machine Learning - Volume 70, ICML’17, pages 214–223. JMLR.org, 2017.
event-place: Sydney, NSW, Australia.
[12] Nachman Aronszajn. Theory of Reproducing Kernels. Trans. Am. Math. Soc.,
68:337–404, 1950.
[13] Krishna B. Athreya, Hani Doss, and Jayaram Sethuraman. On the convergence
of the markov chain simulation method. The Annals of Statistics, 24(1):69–100,
1996.
[14] Hagai Attias. A Variational Baysian Framework for Graphical Models. In
NIPS, volume 12. Citeseer, 1999.
[15] Jean-Yves Audibert and Olivier Bousquet. Combining pac-bayesian and
generic chaining bounds. Journal of Machine Learning Research, 8(Apr):863–
889, 2007.
[16] Adrian Barbu and Song-Chun Zhu. Generalizing swendsen-wang to sampling
arbitrary posterior probabilities. IEEE Transactions on Pattern Analysis and
Machine Intelligence, 27(8):1239–1253, 2005.
[17] Viorel Barbu. Differential equations. Springer, 2016.
[18] Peter Bartlett and John Shawe-Taylor. Generalization performance of support
vector machines and other pattern classifiers. Advances in Kernel methods—
support vector learning, pages 43–54, 1999.
[19] Peter L. Bartlett and Shahar Mendelson. Rademacher and gaussian complexi-
ties: Risk bounds and structural results. Journal of Machine Learning Research,
3(Nov):463–482, 2002.
[20] Peter L. Bartlett, Stéphane Boucheron, and Gábor Lugosi. Model selection and
error estimation. Machine Learning, 48:85–113, 2002.
[21] Peter L. Bartlett, Nick Harvey, Christopher Liaw, and Abbas Mehrabian.
Nearly-tight vc-dimension and pseudodimension bounds for piecewise linear
neural networks. Journal of Machine Learning Research, 20(63):1–17, 2019.
[22] Amir Beck. Introduction to nonlinear optimization: Theory, algorithms, and ap-
plications with MATLAB. SIAM, 2014.
[23] Michel Benaı̈m. Dynamics of stochastic approximation algorithms. In Semi-
naire de probabilites XXXIII, pages 1–68. Springer, 1999.
[24] George Bennett. Probability inequalities for the sum of independent random
variables. Journal of the American Statistical Association, 57(297):33–45, 1962.
BIBLIOGRAPHY 649
[25] Albert Benveniste, Michel Métivier, and Pierre Priouret. Adaptive algorithms
and stochastic approximations, volume 22. Springer Science & Business Media,
2012.
[28] Rajendra Bhatia. Matrix analysis, volume 169. Springer Science & Business
Media, 2013.
[29] Peter J. Bickel and Kjell A. Doksum. Mathematical statistics: basic ideas and
selected topics, volume I, volume 117. CRC Press, 2015.
[30] Peter J. Bickel, Ya’acov Ritov, and Alexandre B. Tsybakov. Simultaneous anal-
ysis of lasso and dantzig selector. 2009.
[31] Patrick Billingsley. Convergence of probability measures. John Wiley & Sons,
2013.
[32] Salomon Bochner. Vorlesungen über fouriersche integrale. Bull Amer Math
Soc, 39:184, 1933.
[34] Joseph-Frédéric Bonnans, Jean Charles Gilbert, Claude Lemaréchal, and Clau-
dia A. Sagastizábal. Numerical optimization: theoretical and practical aspects.
Springer Science & Business Media, 2006.
[35] Ingwer Borg and Patrick J.F. Groenen. Modern multidimensional scaling: Theory
and applications. Springer Science & Business Media, 2005.
[36] Jonathan Borwein and Adrian S. Lewis. Convex analysis and nonlinear opti-
mization: theory and examples. Springer Science & Business Media, 2010.
[37] Stéphane Boucheron, Gábor Lugosi, and Pascal Massart. A sharp concentra-
tion inequality with applications. Random Structures & Algorithms, 16(3):277–
292, 2000.
[38] Olivier Bousquet and André Elisseeff. Stability and generalization. Journal of
machine learning research, 2(Mar):499–526, 2002.
[39] Stephen Boyd, Neal Parikh, Eric Chu, Borja Peleato, and Jonathan Eckstein.
Distributed optimization and statistical learning via the alternating direction
method of multipliers. Foundations and Trends® in Machine learning, 3(1):1–
122, 2011.
650 BIBLIOGRAPHY
[42] Leo Breiman, Jerome Friedman, Charles J Stone, and Richard A. Olshen. Clas-
sification and regression trees. CRC press, 1984.
[43] Dmitri Burago, Iu D. Burago, Yuri Burago, Sergei A. Ivanov, and Sergei Ivanov.
A course in metric geometry, volume 33. American Mathematical Soc., 2001.
[44] Jian-Feng Cai, Emmanuel J. Candès, and Zuowei Shen. A singular value
thresholding algorithm for matrix completion. SIAM Journal on optimization,
20(4):1956–1982, 2010.
[45] Tadeusz Caliński and Jerzy Harabasz. A dendrite method for cluster analysis.
Communications in Statistics-theory and Methods, 3(1):1–27, 1974.
[46] Emmanuel J. Candes and Terence Tao. Decoding by linear programming. IEEE
Trans. information theory, 51(12):4203–4215, 2005.
[47] Emmanuel J. Candes and Terence Tao. The dantzig selector: statistical esti-
mation when p is much larget. Annals of statistics, 35, 2007.
[48] Emmanuel J. Candès, Xiaodong Li, Yi Ma, and John Wright. Robust principal
component analysis? Journal of the ACM (JACM), 58(3):11, 2011.
[49] John Canny. Gap: a factor model for discrete data. In Proceedings of the 27th
annual international ACM SIGIR conference on Research and development in in-
formation retrieval, pages 122–129, 2004.
[51] Ricky T. Q. Chen, Yulia Rubanova, Jesse Bettencourt, and David Duve-
naud. Neural ordinary differential equations. In S. Bengio, H. Wallach,
H. Larochelle, K. Grauman, N. Cesa-Bianchi, and R. Garnett, editors, Advances
in Neural Information Processing Systems 31, pages 6571–6583. Curran Asso-
ciates, Inc., 2018.
[52] Tianqi Chen and Carlos Guestrin. Xgboost: A scalable tree boosting system.
In Proceedings of the 22nd acm sigkdd international conference on knowledge dis-
covery and data mining, pages 785–794, 2016.
[53] Pierre Comon. Independent component analysis, a new concept? Signal pro-
cessing, 36(3):287–314, 1994.
[54] Thomas M. Cover and Joy A. Thomas. Elements of information theory. John
Wiley & Sons, 2012.
BIBLIOGRAPHY 651
[55] Robert G. Cowell, A. Philip Dawid, Steffen L. Lauritzen, and David J. Spiegel-
halter. Probabilistic networks and expert systems. Springer, 2007.
[57] George Darmois. Analyse générale des liaisons stochastiques: etude partic-
ulière de l’analyse factorielle linéaire. Revue de l’Institut international de statis-
tique, pages 2–8, 1953.
[58] Bernard Delyon, Marc Lavielle, and Eric Moulines. Convergence of a stochas-
tic approximation version of the em algorithm. Annals of statistics, pages 94–
128, 1999.
[59] Amir Dembo and Ofer Zeitouni. Large deviations techniques and applica-
tions. 1998. Applications of Mathematics, 38, 2011.
[60] Luc Devroye, Lázló Györfi, and Gábor Lugosi. A Probabilistic Theory of Pattern
Recognition. Springer, 1996.
[61] Luc Devroye, Abbas Mehrabian, and Tommy Reddad. The total variation dis-
tance between high-dimensional gaussians with the same mean. arXiv preprint
arXiv:1810.08693, 2018.
[63] Edsger W. Dijkstra. A note on two problems in connexion with graphs. Nu-
merische mathematik, 1(1):269–271, 1959. ISSN 0029-599X.
[64] Petros Drineas, Alan Frieze, Ravi Kannan, Santosh Vempala, and Vish-
wanathan Vinay. Clustering large graphs via the singular value decomposi-
tion. Machine learning, 56:9–33, 2004.
[65] Simon Duane, Anthony D. Kennedy, Brian J. Pendleton, and Duncan Roweth.
Hybrid monte carlo. Physics letters B, 195(2):216–222, 1987.
[66] Richard M. Dudley. Real analysis and probability. Chapman and Hall/CRC,
2018.
[67] Marie Duflo. Random iterative models, volume 34. Springer Science & Business
Media, 2013.
[70] Mark Everingham, Luc Van Gool, Christopher KI Williams, John Winn, and
Andrew Zisserman. The pascal visual object classes (voc) challenge. Interna-
tional journal of computer vision, 88(2):303–338, 2010.
[71] James A. Fill. An interruptible algorithm for perfect sampling via Markov
chains. The Annals of Applied Probability, 8(1):131–162, 1998.
[72] P. Thomas Fletcher and Sarang Joshi. Principal geodesic analysis on symmetric
spaces: Statistics of diffusion tensors. In Computer vision and mathematical
methods in medical and biomedical image analysis, pages 87–98. Springer, 2004.
[74] Jerome Friedman, Trevor Hastie, and Robert Tibshirani. Additive logistic re-
gression: a statistical view of boosting (with discussion and a rejoinder by the
authors). The annals of statistics, 28(2):337–407, 2000.
[76] Dan Geiger and Judea Pearl. On the logic of causal models. In Machine intel-
ligence and pattern recognition, volume 9, pages 3–14. Elsevier, 1990.
[77] Dan Geiger, Thomas Verma, and Judea Pearl. Identifying independence in
bayesian networks. Networks, 20(5):507–534, August 1990. ISSN 00283045.
doi: 10.1002/net.3230200504.
[79] Stuart Geman and Donald Geman. Stochastic relaxation, gibbs distributions,
and the bayesian restoration of images. IEEE Transactions on pattern analysis
and machine intelligence, (6):721–741, 1984.
[81] M. Gondran and M. Minoux. Graphs and algorithms. John Wiley & Sons, 1983.
[82] Ian Goodfellow, Jean Pouget-Abadie, Mehdi Mirza, Bing Xu, David Warde-
Farley, Sherjil Ozair, Aaron Courville, and Yoshua Bengio. Generative adver-
sarial nets. In Advances in neural information processing systems, pages 2672–
2680, 2014.
BIBLIOGRAPHY 653
[84] Ishaan Gulrajani, Faruk Ahmed, Martin Arjovsky, Vincent Dumoulin, and
Aaron C. Courville. Improved training of Wasserstein GANs. In Advances
in neural information processing systems, pages 5767–5777, 2017.
[85] Madan M. Gupta and J. Qi. Theory of T-norms and fuzzy inference methods.
Fuzzy Sets and Systems, 40(3):431–450, April 1991. ISSN 0165-0114.
[87] Trevor Hastie, Robert Tibshirani, and Jerome H. Friedman. The elements of
statistical learning. Springer, 2003.
[88] W. Keith Hastings. Monte carlo sampling methods using markov chains and
their applications. 1970.
[89] Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. Deep residual
learning for image recognition. In Proceedings of the IEEE Conference on Com-
puter Vision and Pattern Recognition, pages 770–778. IEEE, 2016.
[90] Geoffrey E. Hinton and Sam Roweis. Stochastic neighbor embedding. Ad-
vances in neural information processing systems, 15:857–864, 2002.
[91] Leslie M. Hocking. Optimal Control: An Introduction to the Theory with Appli-
cations. Oxford University Press, 1991.
[92] Wassily Hoeffding. Probability inequalities for sums of bounded random vari-
ables. In The Collected Works of Wassily Hoeffding, pages 409–426. Springer,
1994.
[93] Roger A. Horn and Charles R. Johnson. Matrix analysis. Cambridge university
press, 2012.
[96] Nobuyuki Ikeda and Shinzo Watanabe. Stochastic differential equations and
diffusion processes. Elsevier, 1981.
[97] Tommi Sakari Jaakkola. Variational methods for inference and estimation in
graphical models. PhD Thesis, Massachusetts Institute of Technology, 1997.
654 BIBLIOGRAPHY
[98] Vojtech Jarnik. O jistem problemu minimalnim (about a certain minimal prob-
lem). Prace Moravske Prirodovedecke Spolecnosti, 6:57–63, 1930.
[99] Finn Jensen and Frank Jensen. Optimal junction trees. In Proceedings of the
Tenth Conference on Uncertainty in Artificial Intelligence, pages 360–366, 1994.
[103] Diederik P. Kingma and Jimmy Ba. Adam: A method for stochastic optimiza-
tion. arXiv preprint arXiv:1412.6980, 2014.
[104] Diederik P. Kingma and Max Welling. Auto-Encoding Variational Bayes. 2014.
[107] Peter E. Kloeden and Eckhard Platen. Numerical solutions of stochastic differen-
tial equations. Springer, 1992.
[108] Ivan Kobyzev, Simon J.D. Prince, and Marcus A. Brubaker. Normalizing flows:
An introduction and review of current methods. IEEE transactions on pattern
analysis and machine intelligence, 43(11):3964–3979, 2020.
[109] Daphne Koller and Nir Friedman. Probabilistic graphical models: principles and
techniques. The MIT Press, 2009.
[111] Alex Krizhevsky, Ilya Sutskever, and Geoffrey E. Hinton. Imagenet classifica-
tion with deep convolutional neural networks. Communications of the ACM,
60(6):84–90, 2017.
BIBLIOGRAPHY 655
[112] Wojtek J. Krzanowski and Y.T. Lai. A criterion for determining the number of
groups in a data set using sum-of-squares clustering. Biometrics, pages 23–34,
1988.
[113] Estelle Kuhn and Marc Lavielle. Coupling a stochastic approximation version
of EM with an MCMC procedure. ESAIM: Probability and Statistics, 8:115–131,
2004. Publisher: EDP Sciences.
[114] Harold Kushner and G. George Yin. Stochastic approximation and recursive
algorithms and applications, volume 35. Springer Science & Business Media,
2003.
[115] Steffen L Lauritzen. Graphical models, volume 17. Clarendon Press, 1996.
[116] Yann LeCun and Yoshua Bengio. Convolutional networks for images, speech,
and time series. The handbook of brain theory and neural networks, 3361(10):
1995, 1995.
[117] Yann LeCun, Bernhard Boser, John S. Denker, Donnie Henderson, Richard E.
Howard, Wayne Hubbard, and Lawrence D. Jackel. Backpropagation applied
to handwritten zip code recognition. Neural computation, 1(4):541–551, 1989.
[118] Michel Ledoux and Michel Talagrand. Probability in Banach Spaces: isoperime-
try and processes. Springer Science & Business Media, 1991.
[119] Erich L. Lehmann and George Casella. Theory of point estimation. Springer
Science & Business Media, 2006.
[120] Benedict Leimkuhler and Sebastian Reich. Simulating Hamiltonian Dynamics.
Cambridge Monographs on Applied and Computational Mathematics. Cam-
bridge University Press, 2005.
[121] Lennart Ljung. Analysis of recursive stochastic algorithms. IEEE Transactions
on Automatic Control, 22(4):551–575, August 1977. ISSN 1558-2523. Confer-
ence Name: IEEE Transactions on Automatic Control.
[122] Stuart Lloyd. Least squares quantization in pcm. IEEE transactions on infor-
mation theory, 28(2):129–137, 1982.
[123] Laurens van der Maaten and Geoffrey Hinton. Visualizing data using t-SNE.
Journal of machine learning research, 9(Nov):2579–2605, 2008.
[124] Jack Macki and Aaron Strauss. Introduction to Optimal Control Theory.
Springer Science & Business Media, 2012.
[125] James MacQueen. Some methods for classification and analysis of multivari-
ate observations. In Proceedings of the fifth Berkeley symposium on mathematical
statistics and probability, volume 1, pages 281–297. Oakland, CA, USA, 1967.
656 BIBLIOGRAPHY
[126] Adam a Margolin, Ilya Nemenman, Katia Basso, Chris Wiggins, Gustavo
Stolovitzky, Riccardo Dalla Favera, and Andrea Califano. ARACNE: an al-
gorithm for the reconstruction of gene regulatory networks in a mammalian
cellular context. BMC bioinformatics, 7 Suppl 1:S7, January 2006. ISSN 1471-
2105. doi: 10.1186/1471-2105-7-S1-S7.
[127] Enzo Marinari and Giorgio Parisi. Simulated tempering: a new monte carlo
scheme. Europhysics letters, 19(6):451, 1992.
[130] James A. McHugh. Algorithmic graph theory. New Jersey: Prentice-Hall Inc,
1990.
[131] Leland McInnes, John Healy, and James Melville. UMAP: Uniform Manifold
Approximation and Projection for Dimension Reduction. arXiv:1802.03426
[cs, stat], September 2020. arXiv: 1802.03426.
[135] Sean P. Meyn and Richard L. Tweedie. Stability of markovian processes ii:
Continuous-time processes and sampled chains. Advances in Applied Probabil-
ity, 25(3):487–517, 1993.
[136] Sean P. Meyn and Richard L. Tweedie. Stability of markovian processes iii:
Foster–lyapunov criteria for continuous-time processes. Advances in Applied
Probability, 25(3):518–548, 1993.
[137] Sean P. Meyn and Richard L. Tweedie. Markov chains and stochastic stability.
Springer Science & Business Media, 2012.
[138] Leon Mirsky. A trace inequality of john von neumann. Monatshefte für mathe-
matik, 79(4):303–306, 1975.
BIBLIOGRAPHY 657
[139] Michel Métivier and Pierre Priouret. Théorèmes de convergence presque sure
pour une classe d’algorithmes stochastiques à pas décroissant. Probability The-
ory and related fields, 74(3):403–428, 1987. Publisher: Springer.
[142] Radford M. Neal. Markov chain sampling methods for dirichlet process mix-
ture models. Journal of computational and graphical statistics, 9(2):249–265,
2000.
[144] Radford M. Neal and Geoffrey E. Hinton. A view of the EM algorithm that jus-
tifies incremental, sparse, and other variants. In Learning in graphical models,
pages 355–368. Springer, 1998.
[146] Arkadi Nemirovski, Anatoli Juditsky, Guanghui Lan, and Alexander Shapiro.
Robust Stochastic Approximation Approach to Stochastic Programming.
SIAM Journal on Optimization, 19(4):1574–1609, January 2009. ISSN 1052-
6234. Publisher: Society for Industrial and Applied Mathematics.
[147] Jorge Nocedal and Stephen J. Wright. Nonlinear Equations. Springer, 2006.
[148] Esa Nummelin. General irreducible Markov chains and non-negative operators.
Number 83. Cambridge University Press, 2004.
[149] George Papamakarios, Eric Nalisnick, Danilo Jimenez Rezende, Shakir Mo-
hamed, and Balaji Lakshminarayanan. Normalizing flows for probabilistic
modeling and inference. Journal of Machine Learning Research, 22(57):1–64,
2021.
[150] Panos M. Pardalos and Jue Xue. The maximum clique problem. Journal of
Global Optimization, 4(3):301–328, 1994. ISSN 0925-5001.
[153] Jiming Peng and Yu Wei. Approximating k-means-type clustering via semidef-
inite programming. SIAM journal on optimization, 18(1):186–205, 2007.
[154] Jiming Peng and Yu Xia. A new theoretical framework for k-means-type clus-
tering. Foundations and advances in data mining, pages 79–96, 2005. Publisher:
Springer.
[155] Odile Pons. Functional estimation for density, regression models and processes.
World scientific, 2011.
[156] Robert C. Prim. Shortest connection networks and some generalizations. Bell
system technical journal, 36(6):1389–1401, 1957.
[157] James G. Propp and David B. Wilson. Exact sampling with coupled Markov
chains and applications to statistical mechanics. Random Structures and Algo-
rithms, 9(1&2):223–252, 1996.
[158] James G. Propp and David B. Wilson. How to get a perfectly random sam-
ple from a generic Markov chain and generate a random spanning tree of a
directed graph. Journal of Algorithms, 27:170–217, 1998.
[159] Jim O. Ramsay and Bernard W. Silverman. Functional Data Analysis. Springer-
Verlag, 1997.
[160] BLS Prakasa Rao. Nonparametric functional estimation. Academic press, 1983.
[162] Danilo Rezende and Shakir Mohamed. Variational inference with normalizing
flows. In International conference on machine learning, pages 1530–1538. PMLR,
2015.
[166] Gareth O. Roberts and Jeffrey S. Rosenthal. General state space markov chains
and mcmc algorithms. Probability Surveys, 1:20–71, 2004.
[168] R. Tyrrell Rockafellar. Convex analysis, volume 18. Princeton university press,
1970.
[169] Olaf Ronneberger, Philipp Fischer, and Thomas Brox. U-Net: Convolutional
Networks for Biomedical Image Segmentation. In Nassir Navab, Joachim
Hornegger, William M. Wells, and Alejandro F. Frangi, editors, Medical Im-
age Computing and Computer-Assisted Intervention – MICCAI 2015, Lecture
Notes in Computer Science, pages 234–241, Cham, 2015. Springer Interna-
tional Publishing. ISBN 978-3-319-24574-4.
[170] Kenneth Rose, Eitan Gurewitz, and Geoffrey Fox. A deterministic annealing
approach to clustering. Pattern Recognition Letters, 11(9):589–594, 1990.
[171] Peter J. Rousseeuw. Silhouettes: a graphical aid to the interpretation and val-
idation of cluster analysis. Journal of computational and applied mathematics,
20:53–65, 1987.
[172] Walter Rudin. Real and Complex Analysis. Tata McGraw Hill, 1966.
[173] Robert E. Schapire. The strength of weak learnability. Machine learning, 5(2):
197–227, 1990.
[174] Isaac J. Schoenberg. Metric spaces and completely monotone functions. An-
nals of Mathematics, pages 811–841, 1938.
[175] Gideon Schwarz. Estimating the dimension of a model. The annals of statistics,
6(2):461–464, 1978.
[176] Claude E. Shannon. A mathematical theory of communication. The Bell system
technical journal, 27(3):379–423, 1948.
[177] Claude E. Shannon. Communication in the presence of noise. Proc. Institute
of Radio Engineers, 37(1):10–21, 1949.
[178] Simon J. Sheather and Michael C. Jones. A reliable data-based bandwidth
selection method for kernel density estimation. Journal of the Royal Statistical
Society: Series B (Methodological), 53(3):683–690, 1991.
[179] Bernard W. Silverman. Density estimation for statistics and data analysis. Chap-
man et Hall, 1998.
[180] Viktor Pavlovich Skitovich. Linear forms of independent random variables
and the normal distribution law. Izvestiya Rossiiskoi Akademii Nauk. Seriya
Matematicheskaya, 18(2):185–200, 1954.
[181] Yang Song, Sahaj Garg, Jiaxin Shi, and Stefano Ermon. Sliced score match-
ing: A scalable approach to density and score estimation. In Uncertainty in
Artificial Intelligence, pages 574–584. PMLR, 2020.
660 BIBLIOGRAPHY
[182] Nitish Srivastava, Geoffrey Hinton, Alex Krizhevsky, Ilya Sutskever, and Rus-
lan Salakhutdinov. Dropout: a simple way to prevent neural networks from
overfitting. The Journal of Machine Learning Research, 15(1):1929–1958, 2014.
[183] Hugo Steinhaus. Sur la division des corp materiels en parties. Bull. Acad.
Polon. Sci, 1(804):801, 1956.
[184] Charles J. Stone. Consistent nonparametric regression. The annals of statistics,
pages 595–620, 1977.
[185] Mervyn Stone. Cross-validatory choice and assessment of statistical predic-
tions. Journal of the royal statistical society: Series B (Methodological), 36(2):
111–133, 1974.
[186] Catherine A. Sugar and Gareth M. James. Finding the number of clusters in a
dataset: An information-theoretic approach. Journal of the American Statistical
Association, 98(463):750–763, 2003.
[187] Robert H. Swendsen and Jian-Sheng Wang. Nonuniversal critical dynamics in
monte carlo simulations. Physical review letters, 58(2):86, 1987.
[188] Esteban G. Tabak and Eric Vanden-Eijnden. Density estimation by dual ascent
of the log-likelihood. 2010.
[189] Michel Talagrand. The generic chaining: upper and lower bounds of stochastic
processes. Springer Science & Business Media, 2006.
[190] Michel Talagrand. Upper and lower bounds for stochastic processes: modern meth-
ods and classical problems, volume 60. Springer Science & Business Media,
2014.
[191] Aik Choon Tan, Daniel Q. Naiman, Lei Xu, Raimond L. Winslow, and Don-
ald Geman. Simple decision rules for classifying human cancers from gene
expression profiles. Bioinformatics, 21(20):3896–3904, 2005.
[192] Robert Tibshirani, Guenther Walther, and Trevor Hastie. Estimating the num-
ber of clusters in a data set via the gap statistic. Journal of the Royal Statistical
Society: Series B (Statistical Methodology), 63(2):411–423, 2001.
[193] Luke Tierney. Markov Chains for Exploring Posterior Distributions. Annals
of Statistics, 22(4):1701–1728, December 1994. ISSN 0090-5364, 2168-8966.
Publisher: Institute of Mathematical Statistics.
[194] Igor Vajda. On metric divergences of probability measures. Kybernetika, 45(6):
885–900, 2009.
[195] Laurens Van Der Maaten. Accelerating t-sne using tree-based algorithms. The
Journal of Machine Learning Research, 15(1):3221–3245, 2014.
BIBLIOGRAPHY 661
[196] Aad W. Van der Vaart. Asymptotic statistics, volume 3. Cambridge university
press, 2000.
[197] Aad W. Van der Vaart and John A. Wellner. Weak convergence and empirical
processes with applications to statistics. Springer, 1996.
[198] Vladimir Vapnik. Statistical learning theory. 1998. Wiley, New York, 1998.
[199] Vladimir Vapnik. The nature of statistical learning theory. Springer science &
business media, 2013.
[200] Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones,
Aidan N Gomez, Lukasz Kaiser, and Illia Polosukhin. Attention is all you
need. In I. Guyon, U. Von Luxburg, S. Bengio, H. Wallach, R. Fergus, S. Vish-
wanathan, and R. Garnett, editors, Advances in Neural Information Processing
Systems, volume 30. Curran Associates, Inc., 2017.
[202] Rene Vidal, Yi Ma, and Shankar Sastry. Generalized principal component
analysis (gpca). IEEE transactions on pattern analysis and machine intelligence,
27(12):1945–1959, 2005.
[203] Cédric Villani et al. Optimal transport: old and new, volume 338. Springer,
2009.
[204] Grace Wahba. Spline Models for Observational Data. SIAM, 1990.
[205] Geoffrey S. Watson. Smooth regression analysis. Sankhyā: The Indian Journal
of Statistics, Series A, pages 359–372, 1964.
[206] Gerhard Winkler. Image analysis, random fields and Markov chain Monte Carlo
methods. Springer, 1995,2003.
[207] Stephen J. Wright and Benjamin Recht. Optimization for data analysis. Cam-
bridge University Press, 2022.
[209] Laurent Younes. Estimation and annealing for gibbsian fields. Ann. de l’Inst.
Henri Poincaré, 2, 1988.
[210] Laurent Younes. Parametric inference for imperfectly observed gibbsian fields.
Prob. Thry. Rel. Fields, 82:625–645, 1989.
662 BIBLIOGRAPHY
[213] Lotfi A. Zadeh. Fuzzy sets. In Fuzzy sets, fuzzy logic, and fuzzy systems: selected
papers by Lotfi A Zadeh, pages 394–432. World Scientific, 1996.