0% found this document useful (0 votes)
2 views

Lecture_Notes_MAI

Uploaded by

17043420
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
2 views

Lecture_Notes_MAI

Uploaded by

17043420
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 114

Lecture notes: mathematics for artificial intelligence 1

G. Blanchard
September 16, 2024

Contents
0 Some reminders on probability theory 3
0.1 Elements of probability . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3
0.2 A few important properties . . . . . . . . . . . . . . . . . . . . . . . . . . 5
0.3 Conditioning . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7

1 Introduction to statistical learning theory (part 1):


Decision theory 11
1.1 Mathematical formalization . . . . . . . . . . . . . . . . . . . . . . . . . . 11
1.2 Optimal risk and prediction function. . . . . . . . . . . . . . . . . . . . . . 12
1.3 Learning from data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15
1.4 Consistency of estimators . . . . . . . . . . . . . . . . . . . . . . . . . . . 18
1.5 Plug-in classification . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 23
1.6 Negative results: the “no free lunch” theorem . . . . . . . . . . . . . . . . 25

2 Linear Discrimination:
A brief overview of classical methods 28
2.1 Linear discrimination functions . . . . . . . . . . . . . . . . . . . . . . . . 28
2.2 The naive Bayes classifier . . . . . . . . . . . . . . . . . . . . . . . . . . . 29
2.3 Gaussian generative distribution: LDA and QDA . . . . . . . . . . . . . . 31
2.4 Classification as regression . . . . . . . . . . . . . . . . . . . . . . . . . . . 33
2.5 Linear logistic regression . . . . . . . . . . . . . . . . . . . . . . . . . . . . 35
2.6 Hinge-loss based methods: Perceptron and Support Vector Machine . . . . 37
2.7 Regularization . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 39

3 Introduction to statistical learning theory (part 2): elementary bounds 41


3.1 Controlling the error of a single decision function and Hold-Out principle . 41
3.2 A sharp but inconvenient bound in the Bernoulli case: the Clopper-Pearson
bound . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 42
3.3 The Chernov method . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 43
3.4 Sub-Gaussian random variables and Hoeffding’s inequality . . . . . . . . . 46

1
3.5 Uniform bounds over a finite class of prediction functions . . . . . . . . . . 49
3.6 Uniform bounds over a countable class of prediction functions; regularlized
ERM . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 54

4 The Nearest Neighbors method 61


4.1 Basic notation and definitions . . . . . . . . . . . . . . . . . . . . . . . . . 61
4.2 Analysis for k fixed . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 62
4.3 Consistency of the k-nearest-neighbors method . . . . . . . . . . . . . . . . 66

5 Reproducing kernel methods 72


5.1 Motivation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 72
5.2 Reproducing kernel Hilbert spaces . . . . . . . . . . . . . . . . . . . . . . . 74
5.3 Construction of spsd kernels . . . . . . . . . . . . . . . . . . . . . . . . . . 79
5.4 Kernel-based methods . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 81
5.5 Regularity and approximation properties of functions in a RKHS . . . . . . 85
5.6 Translation invariant kernels and random Fourier features . . . . . . . . . . 89

6 Introduction to statistical learning theory (part 3): Rademacher com-


plexities and VC theory 92
6.1 Introduction, reminders . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 92
6.2 The Azuma-McDiarmid inequality . . . . . . . . . . . . . . . . . . . . . . . 93
6.3 Rademacher complexity . . . . . . . . . . . . . . . . . . . . . . . . . . . . 96
6.4 Properties of the Rademacher complexity . . . . . . . . . . . . . . . . . . . 99
6.5 Application to kernel methods . . . . . . . . . . . . . . . . . . . . . . . . . 102
6.6 Vapnik-Chervonenkis theory . . . . . . . . . . . . . . . . . . . . . . . . . . 106
6.7 Application to artificial neural networks . . . . . . . . . . . . . . . . . . . 111

Convention used for notation of importance of results (margin stars).


A triple margin star *** indicates a fundamental result, whose proof has to be known.
It can be asked to give a proof in the exam, without additional hints or reminders. The
proof itself is generally important, and it can also be expected that solutions of some
exercises involve variants of the ideas of the proof, so these have to be understood at a
deep level.
A double margin star ** indicates a fundamental definition or an important result that
has to be known and can be asked to be restated at the exam. If it is a result with a proof,
it is recommended to have an idea of how the proof works, but it won’t be asked to redo
the proof (at least not without reminders or other form of help.)
A single margin star * indicates an important result. Solutions of some exercises can
rely on using this result. The proof does not have to be known inside out.
A diamond ⋄ indicates a result that is given for illustration; it can be seen as an exercise
putting into light some interesting applications, argument techniques or results. It won’t
be required to state nor prove the result in the exam, but it is of interest to know how the
result works for training.

2
0 Some reminders on probability theory
0.1 Elements of probability
The theoretical approaches to artificial intelligence involve many different fields of mathe-
matics. A central tenet of most approaches to modern mathematical modeling of artificial
intelligence methods, which in one form or other receive data and “learn” from it, is that
said data should be modeled as random. Thus, while other distinct areas of mathematics
play an important role, that of probability theory should be considered central. In these
notes, we will chiefly concentrate on probabilistic and statistical aspects – what is usually
called “statistical learning theory”.
While we assume the reader to be familiar with mathematical elements of probability
theory, we start with recalling a few fundamentals.

Probability spaces. A probability space (Ω, A, P ) consists of a base space Ω, a σ-algebra


of subsets of Ω (called measurable subsets, or events), and a probability distribution P,
that is, a mapping A → [0, 1] satisfying the fundamental axioms of probability (P (Ω) = 1,
and σ-additivity over disjoint countable unions).
If A1 , A2 , . . . are events, not necessarily disjoint, then we always have as a consequence
of the fundamental axioms:

" #
[ X
** P Ai ≤ Ai , (0.1)
i≥1 i≥1

which we will call the union bound (also known as Boole’s inequality).
A probability distribution is a measure, and as such we can integrate real-valued mea-
surable functions f : Ω → R with respect to P .

Random variables. A random variable (r.v.) over (Ω, A, P ) with values in a measured
space (X , F) is a measurable map from the former to the latter space. It induces an
image (“push-forward”) probability measure PX on the image space (also sometimes noted
X#P ), defined via

∀F ∈ F : PX (F ) = P (X −1 (F )) = P (X ∈ F ),

called the distribution of X. It will be assumed that all considered random variables in
a given context are defined on the same underlying probability space (Ω, A, P ); the latter
is generally left unspecified, since we will generally only be interested in studying some
specific random variables.

3
If X is a random variable over (Ω, A, P ) and G is a further measurable map
(X , F) → (Z, G), then Z = G(X) is obviously also a random variable over (Ω, A, P )
(with values in Z). If Z = R and Z is integrable, we have the formula
Z Z
** zPZ (dz) = G(x)PX (dx) =: E[Z], (0.2)
R X

called expectation of Z.

The first equality (“change of variable formula”) can be useful since it can possibly
avoid to compute explicitly the distribution of Z to perform the integral. Expectation can
be defined for a vector-valued variable (Z = Rd ) in the obvious way (i.e. coordinate-wise).

Densities. If X is a r.v. taking values in X , µ is some reference measure on (X , F) and


it holds Z Z
∀F ∈ F : PX (F ) = f (x)µ(dx) = 1{x ∈ F }f (x)µ(dx)
F X
for some measurable function f : X → R+ , then we say that PX has density f with respect
to µ. It is sufficient to check the above equality for events F in a family of sets generating
F and stable by finite intersection (π-system), for it to hold for all F ∈ F.
For instance, on R with the standard Borel σ-algebra it is sufficient to check it on (open
or closed) intervals; on Rd , it suffices to consider parallelepipeds.

Marginals. Let (X , F) and (X ′ , F ′ ) denote two measurable spaces. The product X ×X ′


can be endowed by the product σ-algebra F ⊗ F ′ (generated by products of events in F
and F ′ ).
If P is a probability distribution on a product space (X × X ′ , F ⊗ F ′ ), the first (or
X −)marginal of P is the probability distribution of the random variable given by the
projection (x, x′ ) 7→ x, and simililarly for the second (or X ′ -)marginal.
If Z is a random variable over (Ω, A) with values in the product space (X × X ′ , F ⊗ F ′ ),
then Z(ω) = (X(ω), X ′ (ω)), with X random variable, and PX is called (first) marginal
distribution of Z (it is also the first marginal distribution of PZ in the above sense).

Independence. The random variables (X, X ′ ) on (X × X ′ , F ⊗ F ′ ) are independent


(also denoted X ⊥⊥ X ′ ) iff their joint distribution (as a couple) is the product distribution
their marginals, or equivalently, if

∀F ∈ F, F ′ ∈ F P(X,X ′ ) (F × F ′ ) = PX (F )PX ′ (F ′ ),

or equivalently

∀F ∈ F, F ′ ∈ F P (X ∈ F, X ′ ∈ F ′ ) = P (X ∈ F )P (X ′ ∈ F ′ ).

4
Yet equivalently, independence holds when the joint distribution is the product measure
of the marginals:
P(X,X ′ ) = PX ⊗ PX ′ .
As previously, it is sufficient to check the above equalities on a generating π-system to
establish independence.
By Fubini’s theorem, if X, X ′ are real-valued independent and integrable, then their
product XX ′ is integrable and E[XX ′ ] = E[X]E[X ′ ].
If X, X ′ are independent variables and F, G are measurable mappings into further
measured spaces, then F (X) and G(X ′ ) are independent.
This generalizes to finite families of r.v.’s and even to countable families, though we
won’t really need it, since we will not be interested in a.s. convergence most of the time in
the present notes. When X1 , X2 , . . . , Xn are independent and have the same marginal PX ,
we say they form an independent identically distributed (i.i.d.) family and denote their
joint distribution PX⊗n .

Exercises

Exercise 0.1. Justify the first equality in (0.2) if the space X is countable (using formulas
of discrete probability, i.e. integrals become sums).

Exercise R real-valued function on a measured space (X , F, µ) such


R 0.2. If f is a nonnegative,
that f dµ = 1, then P (F ) := F f dµ is a probability distribution on X (with density f ):
justify.

Exercise 0.3. If r.v. Z has density f (x, x′ ) with respect to a product measure µ ⊗ µ′ on
X ×X ′ , then the first marginal distribution PX of Z has a density with respect to µ: justify
why and specify it explicitly.

Exercise 0.4. If PX has density f wrt. µ and PX ′ has density f ′ wrt. µ′ , and X, X ′ are
independent, then PX,X ′ has density (x, x′ ) 7→ f (x)f ′ (x′ ) wrt. µ ⊗ µ′ . Conversely: if PX,X ′
has density f (x, x′ ) wrt. µ ⊗ µ′ and it holds that f (x, x′ ) = g(x)h(x′ ) for some functions
g, h (not necessarily densities), then X, X ′ are independent.

0.2 A few important properties


Support. Let (X , F) be a Borel space, which we recall is a topological space endowed by
the σ-algebra generated by its open sets (also called the Borel σ-algebra). If µ is a measure
on X , its support is defined as

* Supp(µ) := {x ∈ X : for any open set N, N ∋ x ⇒ µ(N ) > 0}.

The support of a measure is a closed set (exercise). Furthermore,

5
if X is a polish space (metrizable, complete, separable) then

* µ(Supp(µ)c ) = 0.

As a consequence, if we establish a certain property holds for all x ∈ Supp(µ), then it


holds for µ-almost all x ∈ X .

** Positivity of expectation. If X is a nonnegative real random variable, then


E[X] = 0 implies X = 0 a.s.
Markov’s inequality. If X is a nonnegative real random variable, then for any
t > 0:
E[X]
P[X ≥ t] ≤ .
t
Jensen’s inequality. If X is an integrable real random variable and ϕ is a convex
function R → R such that ϕ(X) is integrable, then

ϕ(E[X]) ≤ E[ϕ(X)].

As a consequence, we have in particular (for ϕ(x) = x2 ) that for a squared integrable


random variable:
Var[X] := E X 2 − E[X]2 ≥ 0.
 

However, the latter fact can be also established directly by the variance formula

Var[X] := E X 2 − E[X]2 = E (X − E[X])2 .


   

Bias-Variance formula. If X is a real-valued squared integrable random variable,


then for any c ∈ R, the random variable (X − c) is integrable and

E (X − c)2 = (E[X] − c)2 + E (X − E[X])2 = (E[X] − c)2 + Var[X].


   

Exercise 0.5. Assume probability P on a Borel space X has a continuous density f with
respect to a reference measure ν. Prove Supp(P) = {x : f (x) > 0}. Is this true if f is not
continuous?

Exercise 0.6. Prove the formula: Var[X] = 21 E[(X − X ′ )2 ], where X, X ′ are two indepen-
dent square integrable real variables having the same marginal distribution.

6
0.3 Conditioning
If A is an event with P (A) > 0, the standard definition for the conditional probability of
an event B conditioned to A is P (B|A) := P P(A∩B)(A)
. An important property is then that
P (•|A) := (B 7→ P (B|A)) is itself a probability distribution (it satisfies the axioms) called
conditional probability distribution P conditional to A. If X is a random variable, we can
for instance take A = {X ∈ F } provided P (X ∈ F ) > 0.
In what follows, we’ll very often consider random variables (X, Y ) with a joint distribu-
tion PXY and would like to consider the conditional distributions conditional to X = x or
Y = y, but these events unfortunately have null probability in general, so that the above
definition does not apply. We need something more general.

Definition 0.1. Let (X , F) and (Y, G) be two measurable spaces. We call regular transi-
tion probability or Markov kernel from X to Y a mapping κ : G × X → [0, 1] such that:

(i) For all x ∈ X , the mapping κ(•, x) : G 7→ κ(G, x) is a probability distribution on


(Y, G);

(ii) For all G ∈ G, the mapping κ(G, •) : x 7→ κ(G, x) is measurable.

Given a probability distribution P on X and a regular transition probability κ from X


to Y, we can define a joint probability on X × Y, as
Z Z
κ ◦ P (F × G) := κ(dy, x)P (dx),
F G

and more generally for an integrable real-valued function f on X × Y:


Z Z Z
f (x, y)(κ ◦ P )(dx, dy) := f (x, y)κ(dy, x)P (dx). (0.3)
X×Y X Y

A fundamental question is the converse, that is, given a joint probability PXY on X ⊗Y,
and PX the first marginal distribution, does it exist a transition probability κ from X to
Y such that PXY = κ ◦ P ? The following theorem is fundamental and guarantees the
existence of such an object in a sufficiently broad situation.

** Theorem 0.2 (Disintegration theorem). Assume (Y, G) is a “nice probability space”


(see below). Let P be a probability distribution on (X × Y, F ⊗ G), and PX its first
marginal distribution. Then there exists a transition kernel PY|X from X to Y, called
a regular conditional probability distribution (rcpd) of P such that P = PY|X ◦ PX .
Furthermore, this transition kernel is PX -a.s. unique, in the sense that if two transition
kernels κ, κ′ satisfy the above properties, then κ(•, x) = κ′ (•, x) for PX -almost every x.

7
In particular any Polish space, that is, a complete separable metrizable space, equipped
with its Borel σ-algebra is “nice”. The emphasis here is on separable, which somehow limits
the “size” of the output space of the kernel.
The disintegration theorem applies to product spaces without requiring random vari-
ables, but in probabilistic terms, it is more convenient to think of the joint distribution
PXY of the two random variables (X, Y ) : Ω → X × Y, and to call the transition kernel
regular conditional probability of Y given X = x, denoted PY |X (.|.). Thus to reiterate (0.3),
the property characterizing an rcpd is that it is a transition probability satisfying:

Z Z Z
** For any integrable f : f (x, y)PXY (dx, dy) = f (x, y)PY |X (dy|x)PX (dx),
X×Y X Y
(0.4)

and since it is sufficient to check it for indicators of a product of events, equivalently an


rcpd is characterized by

Z
For any events A on X and B on Y : PXY (A × B) = PY |X (B|x)PX (dx). (0.5)
A

Remark 0.3. In the definition above, it is important to notice that there is no unicity: in
fact, given any two regular transition probabilities κ, κ′ from X to Y, and PX a distribution
on X , we check from (0.3) that κ ◦ PX = κ′ ◦ PX as soon as κ(•, x) and κ′ (•, x) coincide
PX -a.s. Similarly, if PY|X is a rcpd of P , modifying x 7→ PY|X (•, x) on PX -null set gives
another rcpd of P . In this sense a regular conditional distribution is only defined a.s. with
respect to the marginal of the conditioning variable. This means that there is no good,
absolute definition of “the conditional distribution of Y given X = x0 ” (if P (X = x0 ) = 0)
but rather a family of such conditional distributions (indexed by x0 ).
But this family is unique up to a.s. equivalence; in particular, in the case where X, Y are
independent, we will always implicitly consider the “canonical” choice PY |X (·|x) = PY (·)
for all x.
If X is a random variable (Ω, A, P ) → (X , A), and (Ω, A) is a nice space, then we may
apply the previous theorem to the product space Ω × X equipped with the distribution
of the random variable Z(ω) = (ω, X(ω)) (this is the distribution δX ◦ P , where δX is the
transition kernel κ(., ω) = δX(ω) ). We thus obtain a regular conditional probability on Ω
given X. This can be used to give (PX -a.s.) a sense to the expression P (A|X = x), where
A is any event of Ω, which we will be using often; in particular the event A ⊂ Ω may be
defined depending on further random variables Y1 , Y2 , . . ., but we don’t have to explicitly

8
apply the disintegration theorem on a complicated product space given by the value space
of these variables, but may directly use the notation P (.|X = x0 ) without further ado.
For the remainder of these notes, we will assume without repeating that (Ω, A) is a
nice space, so that the previous argument applies and all regular conditional probabilities
exist.

** Definition 0.4 (Conditional expectation). Let h be a real-valued integrable function


on X × Y wrt. the (joint) probability distribution P . Assume that the conditions of
Theorem 0.2 are met, so that an rcpd PY |X exists. For any x0 ∈ X , the conditional
expectation of h given X = x0 is
Z
E[h(X, Y )|X = x0 ] := h(x0 , y)PY |X (dy|x0 ).

If Y is real-valued, denoting F (x) = E[Y |X = x], the random variable F (X) : ω 7→


F (X(ω)) is the conditional expectation of Y given X, denoted E[Y |X].

Remark 0.5. Since the rcpd PY |X (•, x) is only unique up to PX -a.s. equivalence, conditional
expectations are also only defined uniquely up to PX -a.s. equivalence over x0 above.
However, we will always implicitly assume that we have chosen a specific representative
rcpd PY |X (•, x) “once and for all” and define all conditional expectations as above with
respect to this particular choice. The reason why we spell this out is that we want all
conditional expectations for all integrable functions to be defined with using the same
common representative rcpd, which allows us to forget about the a.s. equivalence issue
and write properties such as the next proposition.

** Proposition 0.6 (Properties of conditional expectation). The considered variables are


real-valued and integrable as necessary for the statements below to make sense. A com-
mon representative rcpd PY |X is implicitly chosen to define all conditional expectations
below.

(i) E[Y ] = E[E[Y |X]]

(ii) E[h(X, Y )|X = x0 ] = E[h(x0 , Y )|X = x0 ], for PX -almost all x0 .

(iii) E[h(X)Y |X = x0 ] = h(x0 )E[Y |X = x0 ], for PX -almost all x0 .

(iv) E[h(X)Y |X] = h(X)E[Y |X], PX -a.s.

9
Remark 0.7. In classical probability courses, it is common to define the conditional ex-
pectation first, in a different and more general way (conditional expectation with respect
to a σ-algebra) that requires less formalism and is sometimes easier to handle. The ad-
vantage of considering rcpd’s PY |X is that they are probability distributions for each fixed
value of the conditioning variable x, and we can apply without restriction all theorems
of integration when integrating over the rcpd for any fixed x. Things get sometimes a
little bit more awkward when starting with conditional expectations. In particular the
“obvious” property (ii) above is not granted with the “usual” way of defining conditional
expectations – in fact in that “usual” framework it does not even formally make sense
since “usual” conditional expectations are only defined under a.s. equivalence, separately
for each function hx0 (.) := h(x0 , .); it does not make sense to say that two functions both
defined a.s. agree at a particular point. As we are using rcpds here, the a.s. equivalence is
“factored in” the choice of the common representative rcpd.

Proposition 0.8 (Conditioning and densities). Assume (X, Y ) is a couple of random


variables on X × Y with joint probability distribution PXY having density fXY (x, y) with
respect to a reference product measure µ ⊗ ν. Then the rcpd PY |X (.|x) coincides PX -a.s.
with the probability distribution on Y having the following density wrt. ν:

( fXY (x,y) R
R
fXY (x,y)ν(dy)
if Y
fXY (x, y)ν(dy) = fX (x) > 0;
* fY |X (y|x) := Y

f0 if fX (x) = 0,

where f0 is any fixed a priori density with respect to ν.

Observe that the second case deals with points outside of the support of PX , and almost
surely does not happen by definition; this is why the choice of f0 does not matter.

Exercises

Exercise 0.7. Check that P is the first marginal distribution of κ ◦ P , defined in (0.3).

Exercise 0.8. Prove the properties of Proposition 0.6 directly from the definition.

Exercise 0.9. Prove Proposition 0.8, forgetting about a.s. unicity. Just check that the
transition kernel Pe(, x) having the proposed density satisfies the fundamental properties
of a rcpd.

10
1 Introduction to statistical learning theory (part 1):
Decision theory
1.1 Mathematical formalization
In a nutshell, a learning task is formalized as a prediction of a certain target variable Y
from the observation of an object X. The variability in these quantities is modeled via a
joint probability distribution P of these objects. What needs to be formalized is:
• what is a prediction of Y from X?
• how is the goodness of a prediction assessed?
• what would be a theoretically optimal prediction?
• how is a prediction function “learnt” from data?
In what follows (X , X) is a measurable space (for instance Rd or a subset of Rd ) called
the input space and (Y, Y) another measurable space called label space, most often Y = R,
Y = [a, b], Y = {1, . . . , K}, or Y = Rk . When Y is a real interval the setting is typically
called that of regression, and if Y is a finite set, classification.
It is assumed that the variables (X, Y ) have a joint distribution P over the product
space, called generative distribution. In a prediction setting, we assume a random realiza-
tion (x, y) of P ; the value of y (the label) is unknown to us but we would like to predict
it as well as possible from the knowledge of x (the input, predictor or covariate). To allow
for bsome flexibility in the sequel the prediction can take values in a space (Y, e distinct
e Y)
from Y.

** Definition 1.1 (Prediction function). A prediction function is a measurable function


f : X 7→ Y.
e

Definition 1.2 (Loss function).


• A loss (or cost) function is a (measurable) function ℓ : Ye ×Y → R+ (loss functions
taking negative values are possible but we will mostly consider the nonnegative
loss case).
• The pointwise loss of a prediction function f on a realization (x, y) is ℓ(f (x), y).
• The risk or generalization error of f is the expected loss under a given generative
distribution P :
Eℓ (f, P ) = E(X,Y )∼P [ℓ(f (X), Y )].
This is abbreviated as E(f ) if both the loss function and the generative distribu-
tion are unambiguous. We allow possibly E(f ) = ∞.

11
** Standard examples.

Example 1.3 (Regression with least squares loss). Take Y = Ye = R, and ℓ(y ′ , y) =
(y − y ′ )2 ; so E(f ) = E[(f (X) − Y )2 ].

Example 1.4 (Classification). Take Y = Ye = {1, . . . , K} (K ≥ 2). Then the misclassi-


fication loss is ℓ(y ′ , y) = 1{y ̸= y ′ }, and E(f ) = P[f (X) ̸= Y ] is simply the probability of
an incorrect prediction of the class.
A generalized case is weighted classification, where ℓ(y, y ′ ) = Myy′ , M is the loss matrix
with diagonal 0. This represents situations where one type of error is considered more
costly than another.
A widely considered particular case is binary classification (K = 2; by contrast K > 2
is called multiclass classification). Depending on the context, it might be more convenient
to encode the two classes as Y = {0, 1} or Y = {−1, 1}. In the binary classification setting,
often the prediction function f takes real values (Ye = R). We can consider several losses
in this case, for now we mention

• the 0 − 1 loss or hard loss: Y = {−1, 1}, Ye = R and ℓ(y ′ , y) = 1{yy ′ ≤ 0}: this is
equivalent to using the misclassification loss and interpreting the class prediction as
sign(y ′ ) (y ′ = 0 is interpreted as a classification error no matter what).

• the quadratic loss for classification: Y = {0, 1}, Ye = R and ℓ(y ′ , y) = (y − y ′ )2 . If


the prediction y ′ is actually in {0, 1}, this is identical to the misclassification loss.

It is also possible to use a quadratic loss for multi-class classification, in this case we
take Ye = RK , and ℓ(y ′ , y) = ∥y ′ − ey ∥2 , where ei is the i-th canonical basis vector.

1.2 Optimal risk and prediction function.

** Definition 1.5. We assume a loss function ℓ on Ye × Y has been fixed.


The optimal error (also called Bayes error) for generating distribution P is

Eℓ∗ (P ) = inf Eℓ (f, P ),


f ∈F (X ,Y)
e

where F(X , Y)e denotes the set of (measurable) functions from X to Y. e This is often
simply abbreviated as E ∗ (P ) or simply E ∗ .
If this infimum is a minimum, an optimal prediction function fP∗ is one achieving
the minimum:
E(fP∗ ) = E ∗ (P ), or fP∗ ∈ Arg Min Eℓ (f, P ).
f ∈F (X ,Y)
e

12
It is possible to restrict the search for a prediction function to a subset (sometimes
called model or hypothesis class ) G ⊂ F(X , Y), e in which case one defines correspond-
∗ ∗
ingly EG and (if it exists) fG by restricting the inf or min to G:

EG,ℓ (P ) = EG∗ = inf Eℓ (f, P ).
f ∈G

The optimal prediction function f ∗ (or possibly fG∗ ), generally implicitly assumed to exist,
will be regarded as the target of the learning procedure.
Since the risk measures the goodness of prediction, it will be of interest to analyze how
far a given prediction function f is from the optimal: thus we will often study the excess
risk E(f ) − E ∗ (or possibly E(f ) − EG∗ , the excess risk with respect to prediction function
class G).
The following proposition is helpful to determine an (unrestricted) optimal prediction
function by reducing the problem to the case of a single probability distribution on Y and
a constant prediction.

* Proposition 1.6. Assume that we know a mapping F from the set of probability
distributions on Y to Ye such that for any probability distribution PY on Y:

F (PY ) ∈ Arg Min EY ∼PY [ℓ(c, y)].


c∈Y
e

Then for any joint generating probability PXY ,

fP∗XY (x) = F (PY |X (.|x)) ∈ Arg Min E[ℓ(c, y)|X = x].,


c∈Y
e

is an optimal prediction function (provided it is measurable).


If F (PY ) only exists for a restricted subset PY of probability distributions on Y, the
above remains true for any joint generating probability distribution PXY such that the
conditional distribution PY |X (.|x) belongs to PY , PX -a.s. .

Proof. By definition of F , for any c ∈ Y,


e and any distribution P on Y it holds

EY ∼P [ℓ(F (P ), Y )] ≤ EY ∼P [ℓ(c, Y )].

Define f ∗ as above. For any f ∈ F(X , Y), and any fixed x ∈ X , taking P ← PY |X (.|x)
and c ← f (x) in the above gives

13
EY ∼PY |X (.|x) [ℓ(f ∗ (x), Y )] = EY ∼PY |X (.|x) ℓ(F (PY |X (.|x)), Y ) ≤ EY ∼PY |X (.|x) [ℓ(f (x), Y )].
 

Now taking expectation over x, i.e. integrating with the distribution PX , we obtain
h i
EX∼PX EY ∼PY |X (.|X) [ℓ(f ∗ (X), Y )] = E[E[ℓ(f ∗ (X), Y )|X]]
= E[ℓ(f ∗ (X), Y )]
= E(f ∗ , PXY )
h i
≤ EX∼PX EY ∼PY |X (.|X) [ℓ(f (X), Y )] = E(f, PXY ).

(Important) examples.

*** Proposition 1.7. Under the setting of regression with quadratic loss, provided Y is
square integrable under PY , the optimal prediction function is

f ∗ (x) = E[Y |X = x],

and the corresponding risk

E ∗ = Var[Y |X] = E (Y − E[Y |X])2 ,


 

and for the excess risk of an arbitrary prediction function f it holds

E(f ) − E ∗ = E (f (X) − f ∗ (X))2 = ∥f − f ∗ ∥22,PX .


 

Proof. Using Proposition 1.6, we analyze the simple case of a given probability distribution
PY on R (for which Y is square integrable) and of a constant prediction c. We have
E (Y − c)2 = E (Y − E[Y ])2 + (E[Y ] − c)2 ,
   

so that F (PY ) = EY ∼PY [Y ] is an optimal constant prediction.


Now for a joint distribution PXY such that E[Y 2 ] < ∞, since E[Y 2 ] = E[E[Y 2 |X]], it
must be the case that E[Y 2 |X] is a.s. finite, in other words, PX -almost surely PY |X (.|x)
has a second moment. We can therefore apply the previous proposition and obtain the
claim. The formula for the risk follows. As for the excess risk, we have
E(f ) = E (f (X) − Y )2 = E (f (X) − E[Y |X] + E[Y |X] − Y )2
   

= E (f (X) − E[Y |X])2 + E (E[Y |X] − Y )2


   
(1.1)
∗ 2 ∗
 
= E (f (X) − f (X)) + E ,
rearranging gives the last claim of the proposition.

14
Exercise 1.1. Justify equality (1.1).

*** Proposition 1.8. Consider the classification setting with K classes and the misclas-
sification loss. Then a prediction function f ∗ with

f ∗ (x) ∈ Arg Max P (Y = y|X = x)


y∈{1,...,K}

is an optimal classification rule. This is known as a Bayes classifier and the resulting
E ∗ as the Bayes error rate.

Proof. Using Proposition 1.6, we analyze the simple case of a given probability distribution
PY on {1, . . . , K} and of a constant prediction c. Then

EY ∼PY [1{Y ̸= c}] = 1 − PY ({c}).

Obviously this is minimized for c ∈ Arg Maxy∈{1,...,K} PY ({y}). Using Proposition 1.6, this
gives the claim.

Exercise 1.2. What is the optimal prediction function for the setting of classification us-
ing real-valued prediction and the quadratic loss (in the binary and in the multiclass-
classification case)?

Exercise 1.3. In the classification setting, assume the class-conditional distributions


PX|Y [•|Y = i], i = 1, . . . , K, have respective densities fi with respect to a common measure
µ on X . Denote also πi := PY [Y = i]. Prove that a prediction function g ∗ such that

g ∗ (x) ∈ Arg Max(πi fi (x))


i∈{1,...,K}

is a Bayes classification function.

1.3 Learning from data


We now consider a method for learning a decision function from data; this is also called
“estimator” in statistical terms.
In what follows, available observed data will be referred to as a training sample Sn ,
which is a n-uple (x1 , y1 ), . . . , (xn , yn ).

15
** Definition 1.9. Given an integer n ∈ N>0 , an estimator (for the decision function)
acting on training samples of size n is a mapping

(X × Y)n → F(X , Y)
e
Sn 7→ fbSn

such that (Sn , x) 7→ fbSn (x) is (jointly) measurable.

Note: it is usual practice in statistics that quantities with a hat-notation are estimators,
i.e. are implicitly functions of the data (sample); the explicit dependence is often dropped
from the notation. h i
The generalization error or risk of an estimator is E(fbSn ) = EX,Y ℓ(fbSn (X), Y ) ; observe
that in this notation the sample Sn is considered as fixed and we take the expectation with
respect to a “new” point (X, Y ), hence the name “generalization error”. If the sample data
is modeled as random, this notation means that we are implicitly considering a conditional
expectation conditional to the sample, and that the new point (X, Y ) is independent of
the sample Sn . In the latterh case i(random sample), the expected generalization error/risk
of estimator f is ESn ∼P ⊗n E(fbSn ) , i.e. a double expectation of the loss over the training
b
sample and the new independent point (X, Y ). (We will always assume in these notes that
the training sample is i.i.d. with the marginal distribution P ).
By contrast, the training or empirical error of an estimator fb is

n
b fbSn , Sn ) := 1
X
** E(
b fb) = E( ℓ(fbSn (Xi ), Yi ) = E(fbSn , Pbn ),
n i=1

where Pbn := n1 ni=1 δXi ,Yi is the empirical distribution associated to the sample Sn .
P
Observe the crucial fact that the empirical error “doubly” depends on the sample Sn :
first because the sample determines the estimator fbSn , second because the error is evaluated
on the same sample.    
In particular, in general, ESn ∼P ⊗n E(b fbSn ) ̸= ES ∼P ⊗n E(fbSn ) . This is to be con-
n
trasted with
 the
 fact that, for a fixed (non data-dependent) prediction function f , we have
ESn ∼P ⊗n E(f ) = E(f ) by simple linearity of the expectation.
b

(Counter)Example: overfitting. Consider the classification setting and assume


X ∼ Unif[0, 1] and P (Y = 1) = P (Y = 0) = 12 , with Y independent of X. Then, it is
obvious that E(f ) = P[f (X) ̸= Y ] = 21 for any prediction function f . On the other hand,
if (Xi , Yi )i=1,...,n is i.i.d. from this generating distribution, almost surely the points Xi are
distinct, and we can a.s. pick a function fb such that f (Xi ) = Yi for all i. Then E( b fb) = 0,

16
yet E(fb) = 12 almost surely. This is an extreme case of what is known as the overfitting
phenomenon. In general, some overfitting will happen (i.e. E( b fb) will, generally, be in
expectation smaller than E[E(f )]; i.e. will have a negative bias, in statistical terms). One
of the goals of the course is to control the amount of overfitting from a theoretic point of
view. The above example suggests that we should limit ourselves in the possible choice of
prediction function by considering appropriate models G ⊊ F(X , Y). e

Empirical risk minimization (ERM). With the previous caveat in mind, a general
approach to construct an estimator fERM is to minimize the empirical risk given a given
b
model G ⊊ F(X , Y)
e (the model G is assumed to be given in the context and omitted from
the notation):

** fbERM ∈ Arg Min E(f


b ).
f ∈G

An example of such an estimator from classical statistics is the maximum likelihood


estimator. In this setting there is no actual “prediction”, but we can formally make it
enter the considered framework by taking Y = {0}, Ye = R, the possible “prediction”
functions are densities f : X → R+ with respect to some reference measure µ on X , and
models are of the form {fθ , θ ∈ Θ} for some index space Θ (often a subset of Rk : we then
speak of a parametric model). The loss function is ℓ(y ′ ) = − log(y ′ ) (allowed in this case
to take negative values), and we have
n
X
fbM L ∈ Arg Min − log fθ (xi ) .
θ∈Θ i=1

Another example from classical statistics is that of ordinary least squares linear regres-
sion (OLS): in the regression (X = Rd ) with the squared loss setting, consider the model
G = fβ (x) = ⟨β, x⟩|β ∈ Rd , the ERM over that model is fβbOLS with

n
X
* βbOLS ∈ Arg Min (yi − ⟨β, xi ⟩)2 . (1.2)
β∈Rd i=1

The “population” analogue (i.e. the optimal theoretical predictor over the same model
G) is

∈ Arg Min E (Y − ⟨β, X⟩)2 ,
 
βOLS
β∈Rd

in other words we have fG∗ = fβOLS


∗ (see Definition 1.5). Since

E (Y − ⟨β, X⟩)2 = E Y 2 + β t Σβ − 2β t γ,
   

17

with Σ := E[XX t ] (the second moment matrix) and γ := E[XY ], the solution is βOLS :=
−1
Σ γ by classical formulas for the minimum of a quadratic function (assuming Σ invertible).
To compute βbOLS we replace the generating distribution PXY by the empirical distribution
Pbn from the observed sample, therefore

n n
b := 1 1X
X
* b −1 γ
βbOLS = Σ b, where Σ xi xTi , and γ
b := xi y i ; (1.3)
n i=1 n i=1

(here we assume Σb invertible) a classical equivalent form is βbOLS = (X t X)−1 X t Y ,


where X is the (n, d) matrix whose rows are xt1 , . . . , xtn , and Y = (y1 , . . . , yn )t .

Exercise 1.4. Justify the validity of the last form for βbOLS .

1.4 Consistency of estimators


A primary goal of the analysis of statistical learning estimators is to understand their
behavior as the number n of data grows to infinity. A fundamental property is to ensure
asymptotical convergence of their risk to the optimal risk E ∗ (possibly the optimal risk
E ∗ over a model G). This is called consistency and is made more formal in the following
definition.

**
Definition 1.10 (Consistency). Let fb(n) , n ≥ 1 be a sequence of estimators (where
fb(n) learns from a sample Sn of size n) for a prediction problem with loss function ℓ.
Let P be a set of joint generating distributions on X × Y, and G ⊆ F(X , Y) e a
subset of prediction functions.
Then the sequence fb(n) , n ≥ 1 is consistent in expectation on the distribution model
P and the prediction model G, if
h i
(n)
∀P ∈ P : lim sup ESn ∼P ⊗n E(fbSn , P ) ≤ EG∗ (P ).
n→∞

It is consistent in probability if
h i
(n)
∀P ∈ P, ∀ε > 0 : PSn ∼P ⊗n E(fbSn , P ) ≥ EG∗ (P ) + ε → 0, as n → ∞.

Finally it is consistent almost surely if for any P ∈ P, an i.i.d. sequence (Xi , Yi )i≥1
from distribution P and Sn = ((X1 , Y1 ), . . . , (Xn , Yn )) it holds
(n)
lim sup E(fbSn , P ) ≤ EG∗ (P ), almost surely.
n→∞

The sequence of estimators is called universally consistent if the above holds for P =
all joint distributions on X × Y and G= all prediction functions.

18
In the case where we analyze consistency over a specific class G ⊊ F(X , Y)
e of prediction
functions, it is generally also assumed that the estimator fb outputs predictor functions
belonging to G, though it does not have to be the case.

Simple example. Consider the regression with quadratic loss setting, where X is
reduced to a singleton {0}, so that decision functions are given by a constant θ ∈ R, and
the optimal prediction θ∗ is just the (marginal) expectation E[Y ]. Then the ERM is the
empirical average θbn = n1 ni=1 yi , which is consistent by the law of large numbers.
P

Consistency of ERM over a finite class.

Proposition 1.11. Let G ⊊ F(X , Y) e be a finite set of prediction functions and fb(n) be
ERM
(n)
the ERM estimator on the set G using a sample of size n. Then (fbERM )n≥1 is almost surely
consistent over G and for all probability distributions on X × Y.

To establish this property, we first state the following elementary lemma:


(k)
Lemma 1.12. Let (Zn )k∈{1,...,K},n≥1 be a finite family of sequences of real random vari-
(k)
ables such that ∀k ∈ {1, . . . , K} : Zn → 0 almost surely as n → ∞. Then Un =
(k)
supk∈{1,...,K} Zn → 0 almost surely as n → ∞.
n o h i
(k) (c)
Proof. Let Ak denote the event limn→∞ Zn = 0 . By assumption, P[Ak ] = 1, so P Ak =
0, and therefore by the union bound
K
X
 K  K c
P[Ack ] = 1,

1 ≥ P ∩k=1 Ak = 1 − P ∪k=1 Ak ≥ 1 −
k=1
 
so P ∩K K
k=1 Ak = 1. Furthermore, for any ω in the event ∩k=1 Ak , by definition it holds
(k)
that ∀k ∈ {1, . . . , K} : Zn (ω) → 0 (in the usual deterministic sense of sequence con-
vergence). By standard arguments on (deterministic) sequence convergence, this implies
(k)
supk∈{1,...,K} Zn → 0 as n → ∞, which therefore also happens with probability 1.
Note: this lemma is not true as soon as one considers a countably infinite family of
sequences of random variables. Can you point where the argument in the proof fails – is
it for the “probabilistic” part or for the “deterministic” (sequence convergence) part?
Proof of Proposition 1.11. We write G = {f1 , . . . , fk }. We will only spell out the
proof in the case where E(fk ) < ∞ for all k ∈ {1, . . . , K}. In this situation, for any
k ∈ {1, . . . , K} we have E(f
b k , Sn ) → E(fk ) almost surely, by the law of large numbers,
1 n (k) (k)
= ℓ(f (k) (Xi ), Yi ), i ∈ N, which are i.i.d. random
P
since E(f
b k , Sn ) =
n i=1 Wi , with Wi
variables with expectation E(fk ). By Lemma 1.12, it follows that almost surely,

sup E(f
b k ) − E(fk ) → 0 as n → ∞.
k∈{1,...,K}

19
Now, remember that we cannot apply the law of large numbers to E( b fb(n) , Sn ) since it is
ERM
an average of variables which are not i.i.d. because on the “double” dependence on the
sample Sn .
(n) b fb(n) ) ≤ E(f
However, by definition of the ERM estimator: (a) fbERM ∈ G and (b) E( ERM
b k ),
for any k ∈ {1, . . . , K}. Furthermore, let us denote k an index such that E(fk∗ ) = EG∗ .

Using these properties, it holds


     
(n) (n) b fb(n) ) + E( b fb(n) ) − E(f
0 ≤ E(fbERM ) − EG∗ = E(fbERM ) − E( ERM ERM
b k∗ ) + E(fb k∗ ) − E(fk∗ )
| {z }
≤0

≤ 2 sup E(f
b ) − E(f ) , (1.4)
f ∈G

and we have seen that the latter quantity converges to 0 almost surely, hence the conclusion.
In the case where all prediction functions f ∈ G have infinite risk, EG∗ = ∞ and there
is nothing to prove. In the case where some prediction functions f ∈ G have infinite risk,
note that if E(f ) = ∞, for a fixed prediction function f , then the law of large numbers
implies limn→∞ E(f,
b Sn ) = ∞ almost surely. The above arguments still work in this case
with minor adaptations which are left to the reader.
Please observe that inequality (1.4) is fundamental and will be used again later for
more detailed analysis of the ERM.

⋄ More elaborate example of consistency analysis: regressogram. We will ana-


lyze consistency in a fully nonparametric regression model with the generating distribution
(X, Y ) such that

Y = f ∗ (X) + ε, E ε2 = σ 2 ,
 
E[ε] = 0, ε ⊥⊥ X, (1.5)

where will consider X = [0, 1]; note that the marginal distribution PX is left totally arbi-
trary by the above model. We will make the assumption that f ∗ is a L-Lipschitz function.
Recall that in the case of regression with quadratic loss, f ∗ (x) = E[Y |X = x] is the optimal
predictor.
The estimator fb we will consider based on the observed sample h Sn = (Xi , Yi )1≤i≤n will
be a piecewise constant function on the equal width intervals Ij := j−1 K K
, j , j = 1, . . . , K−
1, IK = K−1
 
K
, 1 ; where the choice of the number K of intervals will be determined later.
Putting Nj := {1 ≤ i ≤ n : Xi ∈ Ij }, we define the regressogram estimator with K bins as
K
X 1 X
fb(x) := 1{x ∈ Ij }b
aj ; aj := Yi , j = 1, . . . , K. (1.6)
|Nj | + 1 i∈N
b
j=1 j

We will prove the following property of the above estimator.


Proposition 1.13. Under the regression model (1.5) for X = [0, 1] with f ∗ Lipschitz, the
regressogram estimator fbn given by (1.6) with K(n) bins from a sample of size n:

20
h i
• is consistent in average risk, i.e. satisfies ESn E(fbn ) − E ∗ −→ 0, as n −→ 0,
provided K(n) −→ ∞ but K(n) = o(n) as n −→ ∞;
h i 1
• satisfies ESn E(fn ) − E = O(n−2/3 ) if K(n) ∼ n 3 .
b ∗

Proof. To analyze the averaged quadratic risk of fb we write


h i h i X K
ESn E(fb) − E ∗ = ESn EX (fb(X) − f ∗ (X))2 = aj − f ∗ (X))2 1{X ∈ Ij } .
 
ESn EX (b
j=1
| {z }
=:Aj

Note: we write ESn EX [. . .] to emphasize that the expectation is over a new independent
point X and the random sample Sn ∼ P ⊗n ; it would be more appropriate to write the
inner expectation as a conditional expectation conditional to Sn , but because X, Sn are
independent, we use the above notation that can be understood as an iterated integral.
Furthermore, according to the generative model we can decompose the expectation over
Sn as an expectation over (Xi )1≤i≤n and (εi )1≤i≤n ; finally since all these variables are
independent we can perform expectations in the order we want by Fubini’s theorem.
Concentrating on one term, we have

aj − f ∗ (X))2 1{X ∈ Ij }
Aj = (b
 2
 
X
1
(f ∗ (Xi ) + εi ) − f ∗ (X) + f ∗ (X)  1{X ∈ Ij }

=
|Nj | + 1 i∈N
j

1 X
∗ ∗ ∗
X 2
= (f (X i ) − f (X)) + f (X) + εi 1{X ∈ Ij }
(|Nj | + 1)2 i∈N i∈Nj
j
| {z } | {z }
=:Wj =:Zj

 
Taking expectations over (εi )1≤i≤n first, we notice E(εi ) [(Wj + Zj )2 ] = Wj2 + E Zj2 =
Wj2 +|Nj |σ 2 . For Xi , X ∈ Ij we have |f ∗ (Xj ) − f ∗ (X)| ≤ L/K by the Lipschitz assumption
on f ∗ , and we can also write |f ∗ (X)| ≤ M for some M , since a Lipschitz function must be
 2
|N |
bounded on a compact. All in all we have thus Wj2 1{X ∈ Ij } ≤ L Kj + M 1{X ∈ Ij }.
We thus finally get, putting pj := P[X ∈ Ij ], and using (a + b)2 ≤ 2(a2 + b2 ):
"  2 ! #
1 |Nj |
E[Aj ] ≤ E L + M + |Nj |σ 2 1{X ∈ Ij }
(|Nj | + 1)2 K
 2  
2L 2 2 1
≤ pj + (σ + 2M )E .
K2 |Nj | + 1

Since |Nj | = ni=1 1{Xi ∈ Ij }, it has a Binom(n, pj ) distribution and therefore satisfies
P

E[(|Nj | + 1)−1 ] ≤ (pj (n+1))−1 (left as an exercise). Summing over j and using K
P
j=1 pj = 1,

21
we finally get
h i L2 (σ 2 + 2M 2 )
ESn E(fb) − E ∗ ≤ 2 2 + K . (1.7)
K n+1
The conclusion of the proposition follow from the above inequality, by plugging in K =
K(n) according to the assumptions of the proposition, letting n go to infinity and some
standard estimates.
A few remarks are in order:

• The result of Proposition 1.13 holds whatever is the marginal distribution of X.


Results in statistical learning theory generally try to avoid strong assumptions on
the generative distribution; this is called a “distribution-free” approach.

• The strong assumption that was made is the Lipschitz character of the regression
function f ∗ .

• The estimator considered is very similar to (but not quite equal to) the ERM estima-
tor over the model GK of piecewise constant predictions on K equally sized intervals
(in fact, the conclusion of the proposition would be essentially the same for the ERM
estimator on GK ). Thus, this is an example where it is of advantage of selecting
a prediction model GK of limited “complexity” (even though we know the optimal
prediction does not lie in this model), and letting the complexity grow appropriately
with the sample size n.

• The two terms appearing in the bound (1.7) can be interpreted as an approximation
term (decreasing with K and independent of n; it comes from the fact that we can
better approximate a Lipschitz function by a piecewise constant function as K grows,
regardless of the sample) and an estimation term (growing with K but decreasing
with n; it comes from the fact that we can average out the noise variance σ 2 provided
we have enough points per interval, but the average number of points per interval
decreases with K). This “balancing” phenomenon is very common in statistical
learning theory.

Exercise 1.5. Prove the following implications between the different types of consistency:

a) Almost sure consistency implies consistency in probability.

b) If the estimators fb(n) take their values in the set of decision functions G, consistency
in expectation on G implies consistency in probability on G.

c) If the loss function is bounded, almost sure consistency implies consistency in expec-
tation.
(n)
Hint: consider the sequence of random variables Zn := E(fbSn ), and use fundamental
relations and implications of probabilistic convergence from a probability lecture. For (b),
establish and then use the fact that Zn − EG∗ ≥ 0. For (c), use Fatou’s lemma.

22
Exercise 1.6. How do the conclusions of Proposition 1.13 change if one assumes instead
that f ∗ is α-Hölder (α ∈ (0, 1])?

Exercise 1.7. Write explicitly an ERM estimator for the model GK defined above. How does
it differ from the estimator defined in (1.6)? (More technical:) how can Proposition 1.13
be adapted for the ERM estimator?

Exercise 1.8. Prove that fb := fβbOLS is almost surely consistent in the model G of linear

functions. Hint: recall the formulae (1.3) for βbOLS . Compare to the formula for βOLS
(assume Σ invertible). Use the LLN.

1.5 Plug-in classification


In this section we analyze a simple example of a “plug-in” rule for classification, which
“transfers” a decision problem to another one. Here, we transfer a regression estimate to
a classification estimate. Let us recall the classification setting:

** Consider the binary classification case, Y = {0, 1}. In this case, if we denote
η(x) := P[Y
 = 1|X1 = x], the∗ (or more precisely “a”) Bayes classifier takes the form

f (x) = 1 η(x) ≥ 2 , and E = E[min(η(X), 1 − η(X))].

For various reasons it can be more convenient to estimate the probability function η(x)
rather than the classifier f ∗ (x), for instance by using a regression method aiming at having
a small risk for the quadratic loss (see Exercise 1.2 below). Let ηb be such an estimate (we
assume is is given beforehand and do not specify how it was obtained). A natural way to
transform this estimate to an actual classifier is to “plug in” the estimate in place of η in
the formula for f ∗ , that is define
n 1o
f (x) = 1 ηb(x) ≥
b .
2
A natural question is whether fb is a good estimate of f ∗ whenever ηb is a good estimate of
η. Here, the “goodness” of an estimate will be measured via the excess risk to the optimal,
for the classification risk and the squared loss risk, respectively.

** Proposition 1.14 (Excess risk inequality for plug-in classification). Let ℓ denote the
misclassification loss and h the quadratic loss. For any function ηb estimating η, define
the plug-in classifier fb as above. Then
1 1
0 ≤ Eℓ (fb) − Eℓ∗ ≤ 2E[|b η ) − Eh∗ ) 2 .
η (X) − η(X))2 2 = 2(Eh (b

η (X) − η(X)|] ≤ 2E (b

23
Proof. Note that the first inequality of the claim is from definition of Eℓ∗ , and the third
is Jensen’s inequality. As in the proof of Proposition 1.6, it is possible to consider an
argument conditionally to X = x for any fixed x, only considering expectations with respect
to a (conditional) probability distribution over Y, then integrate the obtained pointwise
inequalities over X at the end. We therefore consider x as fixed, omit the dependence in
x of fb, f ∗ , ηb, η, and treat them as constants.
We first establish the (sometimes useful) equality that for any classifier f , one has
 
∗ ∗ 1
Eℓ (f ) − Eℓ = 2E 1{f (X) ̸= f (X)} η(X) − . (1.8)
2
Indeed, as suggested above, we condition with respect to X = x and consider x as fixed.
On the left-hand side, we have E[1{f ̸= Y } − 1{f ∗ ̸= Y }|X = x]. Assume η ≥ 21 (the
other case is of course similar); then f ∗ = 1. If f = f ∗ = 1, then obviously
1
E[1{f ̸= Y } − 1{f ∗ ̸= Y }] = 0 = 21{f ∗ ̸= f } η − .
2
If f = 0, then
1
E[1{f ̸= Y } − 1{f ∗ ̸= Y }] = η(1 − 0) + (1 − η)(0 − 1) = 2η − 1 = 21{f ∗ ̸= f } η − .
2
Thus (1.8) is established pointwise conditionally to X = x, thus also in expectation.
Now, consider the specific case fb = 1 ηb ≥ 21 . Still conditioning with respect to X = x,


we have pointwise
1
1{f ∗ ̸= f } η − ≤ |bη − η|. (1.9)
2
Indeed, consider again the cases f ∗ = f (trivial since the left-hand side is 0), and f ∗ ̸= f
(the inequality holds because f and f ∗ must be on opposite sides of 21 ).
Altogether (1.8) and (1.9) (integrated over X) give the second inequality of the claim.
Finally, for the last equality, notice again that, pointwise conditionally with respect to
X = x, we have since η(x) = E[Y |X = x]:

η − Y )2 − E (Y − E[Y ])2 = (b η − E[Y ])2 = (bη − η)2 ,


   
E (b

(where all expectations are to be understood w.r.t. PY |X (.|x)), giving the last equality
after integration w.r.t. X ∼ PX .
Proposition 1.14 shows in particular that if the squared loss excess risk of the estimate
ηb converges to zero, then so does the classification excess risk of the associated plug-in
classifier rule. So convergence of ηb to the Bayes optimal decision rule η for the quadratic risk
implies convergence of the plug-in estimate fb to the Bayes classifier, for the classification
risk.
Remark: This implies that universal consistency of ηb implies universal consistency of
the associated plug-in rule fb. However, for consistency over a specific model G, things are

24
more complicated. The result of Proposition 1.14 does not hold in general if we replace
the optimal Bayes risks Eh∗ , Eℓ∗ by the optimal risks EG,h
∗ ∗
, EG,ℓ
e over an arbitrary model G

of regression functions and their induced plug-in classification rules G,


e because the proof
heavily relied on the fact that we considered excess risk with respect to the (unrestricted)
optimal decision, for which we had an explicit formula. For a given model G, we can only
expect this “transfer of consistency to plug-in classifier” property if we limit consistency
to the distribution model PG containing only those distributions P such that the optimal
decision belongs to G. (Restricting one’s attention to a model and data distributions such
that the optimal decision function belongs to that model is sometimes called proper learning
scenario).

Exercise 1.9. Develop the reflectionsin the previous remark into a clean argument. Let G
be a subset of F(X , [0, 1]) and Ge = f : x 7→ 1 g(x) ≥ 12 , g ∈ G

⊂ F(X , {0, 1}) the set
of plug-in classification rules induced by G. Justify that, if a sequence ηb(n) of estimators
are consistent (either in expectation, probability or almost surely) for the model G and the
set of distributions
PG := {P ∈ P(X × {0, 1}) : ηP∗ ∈ G},
(where ηP∗ is the function x 7→ P (Y = 1|X = x)), then the associated sequence of plug-in
rules fb(n) is consistent for the model Ge and the same set of distributions. (Use directly the
result of Proposition 1.14.)
On the other hand, find a counter-example as simple as possible showing that the result
of Proposition 1.14 does not hold if we replace unrestricted optimal risks Eh∗ , Eℓ∗ by the
∗ ∗
optimal risks EG,h e over models G, G, even if we assume η
, EG,ℓ e b ∈ G. Hint: it is sufficient to

exhibit a class G and a distribution P for which the optimal decision ηP,G for the quadratic
risk is such that the associated plug-in rule is not the optimal classifier in G.
e For this it is
enough to consider a 2-point space X and a suitable class G with just two elements.

1.6 Negative results: the “no free lunch” theorem


In this section we show that we cannot hope to find a classification algorithm satisfying
a non-trivial “universal” bound on the classification risk when trained on a sample of size
n, for any n. Here “universal” means that the bound would hold for any data generating
distribution.

Theorem 1.15 (“No free lunch”). Let X be a set of infinite cardinality, Y = Ye =


{0, 1}, and the loss function is the 0-1 classification loss. Then for any n ∈ N>0 , and
any classification estimator fb acting on training samples of size n, it holds
 h i 1
sup ESn ∼P ⊗n E(fSn , P ) ≥ ,
b
P ∈PXY :E ∗ (P )=0 2

25
where PXY denotes the set of all probability distributions on X × Y.

Before proving the theorem, let us reflect about what it says.


• 1/2 is the risk of a classifier function that would guess the class completely at random
whatever the situation (such a “randomized” classifier would not strictly speaking
enter into our framework, which only allows for fixed decision functions, but it is easy
to see how to extend the setting to accommodate decision functions that include a
random component). Thus, whatever the learning algorithm and the sample size, we
can find a data distribution P such that our algorithm cannot be any better by any
margin ε > 0 than the “stupid” random guess rule that learns nothing (although the
Bayes optimal classifier for P has zero error).
• should it mean that trying to learn anything is always doomed to failure? No, it
just means that aiming at a “universal learning” having non-trivial risk for any data
generating distribution P and fixed sample size n is too ambitious. In order to get
non-trivial risk bounds for fixed n, it will be necessary to restrict the class of possible
data generating distributions (for example, by assuming that the optimal decision
function lies in a certain restricted class G).
• should it mean that “universal consistency” is impossible? It may seem surprising at
first, but it does not. Namely, in the above theorem n is fixed, so that if we require the
risk to be larger than 1/4 (say), the distribution Pn that will make learning algorithms
“fail” by having risk larger than 1/4 depends on n. It does not preclude that there
exists a learning algorithm such that for any fixed data generating distribution P ,
the risk sequence Rn (P ) = E(fb(n) , P ) converges to zero. In fact there exists such
algorithms (at least for X = [0, 1]d ). What the theorem says however, is that we
can find P such that the n-th term of the sequence Rn (P ) is arbitrarily close to 1/2
(which does not prevent the sequence to eventually converge to zero).
• in fact, this theorem seems quite intuitive and almost obvious. If we are allowed
any data generating distribution, we can imagine a distribution drawing uniformly
at random a point in a finite subset of X of size m ≫ n. Since the training sample is
of size n, we can only observe the true classification function on a set of probability at
most n/m ≪ 1, and on the complementary of that set the true decision function can
be arbitrary and we have no information about it, so how could we hope to be better
than random guessing on that part? And this is actually how the proof works, though
how to capture this mathematically is enlightening (and also provides insightful ideas
for proving less obvious and more interesting related results to establish lower bounds
on the risk in a “worst-case” sense).
Proof of Theorem 1.15. Assume that the sample size n, and the classification estimator
(learning algorithm) fb is fixed. Let m be some integer greater than n, whose value will be

26
discussed later. Since X is infinite, we can find a subset Xm = {t1 , . . . , tm } of size m in
X . For any r = (r1 , . . . , rm ) ∈ {0, 1}m , let Pr be the probability distribution on X × {0, 1}
such that:
• X is drawn uniformly at random in the set Xm ;
• Pr (Y = 1|X = ti ) = ri for any i = 1, . . . , m.
Note that E ∗ (Pr ) = 0 for any r.
The following step is a key idea for lower risk bounds. Rather than trying to find a
worst-case distribution for the estimator fb, we will consider its risk on average over the
family Pr when r is itself drawn at random. Clearly, the worst-case risk can only be higher
than this average:
 h i  h i
sup ESn ∼P ⊗n E(fbSn , P ) ≥ sup ESn ∼Pr⊗n E(fbSn , Pr )
P ∈PXY :E ∗ (P )=0 r∈{0,1}n
h i
≥ Er ESn ∼Pr⊗n E(fbSn , Pr ) .
where the outer expectation is over r drawn uniformly at random over {0, 1}n . We now
rewrite and bound the right-hand side:
h i h n oi h i
Er ESn ∼Pr⊗n E(fbSn , Pr ) = Er ESn ∼Pr⊗n E(X,Y )∼Pr 1 fbSn (X) ̸= Y = P fbSn (X) ̸= Y ,
where the probability P is over the draw of everything (r uniform in {0, 1}n , (Sn , (X, Y )) ∼
⊗(n+1)
Pr ). Next, by the law of total probability, we can write the latter probability as a
h i X h i
P fbSn (X) ̸= Y = P fbSn (X) ̸= Y X1 = x1 , . . . Xn = xn P[X1 = x1 , . . . , Xn = xn ].
n
(x1 ,...,xn )∈Xm

We now bound the conditional probability in each of the terms of the previous sum, to
simplify notation denote A := {Xi = xi , i = 1, . . . , n}. Note that conditional to A, con-
cerning the training sample Sn only the training labels are random and are given by the
draw of the elements of the vector r corresponding to the elements x1 , . . . , xn of Xm .
h i h i
P fSn (X) ̸= Y A ≥ P fSn (X) ̸= Y, X ̸∈ {x1 , . . . , xn } A
b b
h i h i
= P fbSn (X) ̸= Y X ̸∈ {x1 , . . . , xn }, A P X ̸∈ {x1 , . . . , xn } A .

It is intuitive and it can be checked (left as an exercise) that, conditional to the event
A and X ̸∈ {x1 , . . . , xn }, X is drawn uniformly at random in Xm \ {x1 , . . . , xn }, and its
corresponding true label Y is drawn at random with probability 1/2 and is independent
of Sn and X, and therefore independent of fbSn (X). Thus, the first conditional probabil-
ity i As for the second, it is larger than (1 − n/m). Backtracking, we get
h above is 1/2.
P fbSn (X) ̸= Y ≥ 12 (1 − m n
), so this is is also a lower bound on our initial supremum. Since
we can choose m arbitrary large, the result follows.

27
2 Linear Discrimination:
A brief overview of classical methods
In this chapter, we will exclusively focus on the classification problem (also called discrim-
ination, especially in older literature). The methods presented here are classical and not
particularly recent (the idea of Linear Discriminant dates back to R.A. Fisher in 1936,
Rosenblatt’s Perceptron is from 1956), but they still form the backbone of most machine
learning toolboxes nowadays and are to be known in order to understand more recent
developments. Furthermore, we will not study the question of convergence or statistical
consistency in this chapter, this will be relegated to later chapters.
We then consider Y = {0, 1, . . . , K − 1}, with K ≥ 2, each element of Y being called a
class. There are K classes, the case K = 2 is called binary classification.
We will also always assume that the input space is X ⊂ Rd .

2.1 Linear discrimination functions

** Definition 2.1. The family of affine score functions (sy (.))y∈Y based on vectors w =
(wy )y∈Y ∈ RdK and constants b = (by )y∈Y ∈ RK is given by

sy (x) = ⟨wy , x⟩ + by , y ∈ Y.

An associated classification function (linear discriminant) is given by

fw,b (x) = Arg Max sy (x). (2.1)


y∈Y

Note: Strictly speaking, in (2.1) we should break ties in some way in the case the Arg Max
contains more than one element, i.e. two or more scores are equal. In order to avoid
uninteresting complications, we will always assume that in such cases the ties are broken
in favor of the smallest class, i.e. the Arg Max is always replaced by its smallest element;
note that is always non-empty since the number of classes is finite.
If we denote xe := (x, 1) ∈ Rd+1 , it holds ⟨wy , x⟩ + by = ⟨wey , ye⟩ wherein w
ey := (wy , by ).
d+1
Thus, if we consider the “augmented” input space R and the score functions are linear
functions of xe. For this reason we will occasionally (depending on the context) drop the
constants b, and implicitly assume we have performed this augmentation operation. Also
for this reason, we will with some abuse of language talk about “linear” score functions
although they are strictly speaking affine.
In the binary classification case (K = 2), observe that we have
 
 
fw,b (x) = 1 ⟨w1 − w0 , x⟩ + (b1 − b0 ) ≥ 0 , (2.2)
 | {z } | {z } 
w b

28
(we assume ties are broken in favor of class 1 to simplify). Therefore, in the case of binary
classification, we will consider only one score s(x) = ⟨w, x⟩ + b to simplify. (A similar
reduction in parameters can be achieved for K > 1, by choosing the class 0 as reference
and defining modified scores s′y (x) = sy (x) − s0 (x), y ∈ Y. But this breaks the symmetry
somewhat, so that generally the full parametrization is used).
The goal of this chapter is to give an overview of classical methods to construct a linear
discrimination function from a training sample Sn = ((xi , yi ))1≤i≤n . The most natural
approach if the standard classification loss is the target is the associated ERM:
n  
1X
b b) = Arg Min E(fw,b ) = Arg Min
(w, b b 1 Arg Max(⟨wy , xi ⟩ + by ) = yi .
(w,b)∈RK(d+1) (w,b)∈RK(d+1) n i=1 y∈Y

Unfortunately (even in the binary classification case), the above minimization problem
is considered at best cumbersome and at worst quite intractable, in particular in high
dimension d , with large trainin data size n, etc. This is due essentially to the fact that
the empirical risk is a noncontinuous, piecewise constant function of the parameters, so
that usual numerical optimization approaches such as (stochastic) gradient descent are not
applicable.
It is fair to say that ERM in the above form for classification is almost never used in
practice. Instead, several alternative approaches are used, falling roughly speaking in two
categories:

1. A particular class of input space X and/or generating distribution PXY is assumed,


allowing to write the theoretically optimal linear classifier f ∗ = fw∗ ,b∗ in an explicit
way and using this knowledge to estimate the parameters directly from the data.
This is sometimes called a generative approach.

2. One considers directly the linear score functions (with output in Ye = RK+1 ), and
uses a different loss ℓ on Ye × Y which lends itself better to optimization (typically
because it is convex in the first variable). This approach is called the use of a “proxy
loss”.

** 2.2 The naive Bayes classifier


The “naive Bayes” method falls into the category of generative approaches. The setting
assumes that the input space is X = {0, 1}d , i.e. the coordinates (also called “features”)
of the predictor x are binary. Furthermore, the main assumption is the following:

Assumption (NB): Under the generative distribution PXY , the coordinates of X


are independent conditionally to Y .

We then have the following result:

29
Proposition 2.2. Naive Bayes, binary classification case Assume X = {0, 1}d , K = 2,
and assumption (NB) is satisfied. Then the optimal (Bayes) classifier function takes
the form ( d )
X
f ∗ (x) = 1 w(k) x(k) + b ≥ 0 ,
k=1
 
where, using the notation pk,j := P X (k) = 1|Y = j and πj := P[Y = j] (which are
assumed to belong to (0,1) ):
d
(k) pk,1 (1 − pk,0 ) X 1 − pk,1 π1
w := log ; b := log + log .
(1 − pk,1 )pk,0 k=1
1 − pk,0 π0

Proof. From the general formula of the Bayes classifier we have


 
∗ P[Y = 1|X = x]
f (x) = 1{P[Y = 1|X = x] ≥ P[Y = 0|X = x]} = 1 log ≥0 . (2.3)
P[Y = 0|X = x]
We then notice:
P[Y = i]
P[Y = i|X = x] = P[X = x|Y = i] , (2.4)
P[X = x]
so that
P[Y = 1|X = x] P[X = x|Y = 1] P[Y = 1]
log = log + .
P[Y = 0|X = x] P[X = x|Y = 0] P[Y = 0]
The second term above is log ππ01 . As to the first term, using assumption (NB), and the
fact that all coordinates belong to {0, 1}:
d
Y d
 (k) (k)
 X
x(k) log pk,1 + (1 − x(k) ) log(1 − pk,1 )

log P[X = x|Y = i] = log P X = x |Y = i =
k=1 k=1
d  
X
(k) pk1
= x log + log(1 − pk,1 ) .
k=1
1 − pk,1

Replacing into (2.4), (2.3) we obtain the announced result.

Estimation from data: in order to estimate a Naive Bayes classifier in practice, the
plug-in principle is used, that is, the theoretical parameters pk,i and πi are replaced in the
formula by their frequentist estimators from a sample Sn = ((xi , yi )1≤i≤n ) :
n o
(k)
|{j : yj = i}| j : (x j , yj ) = (1, i)
π
bi := ; pbk,i := ,
n |{j : yj = i}|

30
assuming the last denominator is nonzero (i.e. there exists at least one training example
in each class.)

Exercise 2.1. Generalize the above result of Proposition 2.2 to the case K > 2.

*** 2.3 Gaussian generative distribution: LDA and QDA


We consider in this section a generative approach where all the class-conditional distribu-
tions are Gaussian, i.e.

Assumption (GD):

PX|Y [•|Y = i] = N (mi , Σi ), i = 0, . . . , K − 1. (2.5)

We will also focus on the particular case of equal covariance matrices:

Assumption (GDEC): Like (GD), but where Σi = Σ for all i.

We will assume furthermore that the Gaussian distributions involved are non-degenerate,
i.e. the covariance matrices Σi have full rank. In this case the distributions have respective
densities
 
1 1 T −1
fi (x) = p exp − (x − mi ) Σi (x − mi ) , i = 0, . . . , K − 1.
(2π)d ||Σi | 2

with respect to the Lebesgue measure on Rd and from Exercise 1.3 we know that f ∗ (x) =
Arg Maxi (πi fi (x)) is a Bayes classifier, where πi := P[Y = i]. Since the arg max is un-
changed by monotone increasing transformation of the function to minimize, we can take
the logarithm and obtain

Under (GD), the Bayes classifier is given by


1 1

fQDA (x) = Arg Max si (x), where si (x) := − (x − mi )T Σ−1
i (x − mi ) − log|Σi | + log πi .
i 2 2
(2.6)

Since the above score functions are quadratic polynomials in x, this is called Quadrac-
tic Discriminant Analysis (QDA). The main quadratic term (x − mi )T Σ−1 i (x − mi ) is
called (squared) Mahalanobis distance of x to the center mi of the Gaussian distribution
N (mi , Σi ).

31
Under (GDEC), we observe that the quadratic part of the above scores is identical for
all scores. We can therefore subtract it from all scores without changing the arg max as
above, and it comes

Under (GDEC), the Bayes classifier is given by

∗ 1
fLDA (x) = Arg Max si (x), where si (x) := x, Σ−1 mi + log πi − mti Σ−1 mi , (2.7)
i | {z } 2{z
wi | }
bi

which is a linear discrimination function called Linear Discriminant Analysis (LDA).


In the two-class case, as argued before it is enough to consider the difference of scores

Under (GDEC), in the binary classification case, the Bayes classifier is given by
n o

fLDA = 1 x, Σ−1 (m1 − m0 ) + (b1 − b0 ) ≥ 0 ,
| {z }
wLDA

where b0 , b1 are as in (2.7).

Estimation from data: similarly to the previous section, one uses a plug-in approach
wherein the unknown population parameters Σi , mi , πi are estimated by their frequen-
tist estimators counterparts from a sample Sn and just “plugged into” the formula (2.6)
resp. (2.7) (denoting ni := |{j : yj = i}|). Thus, for (GD):

|{j : yj = i}|
π
bi := ;
n
1 X
m
b i := xj ;
ni j s.t.y =i
j

1 X
Σ
b i := b j )T (xi − m
(xi − m b j ). (2.8)
ni − 1 j s.t.y =i
j

Note that the (ni −1) in the denominator of Σ b is to make the estimator unbiased, assuming
ni ≥ 2; see a classical statistics course. If ni is used instead, it will not change much.
On the other hand, for (GDEC), one uses the so-called pooled estimator for the common
covariance matrix:
N
b := 1
X
Σ (xi − myi )T (xi − myi ). (2.9)
n − K i=

32
Exercise 2.2. Prove that QDA and LDA using the above estimators from data are covariant
by (bijective) linear data transformation. More precisely, let x ei := Axi , where A is an
d
invertible linear operator of R . Denote fb the classification function constructed by LDA
or QDA from Sn := ((xi , yi )1≤i≤n ), and fe the one constructed from Sen := ((exi , yi )1≤i≤n ).
d
Then f (x) = f (Ax) for all x ∈ R .
b e

Practical tricks.
In practice, there are often some modifications to the above canvas for LDA and QDA:

1. While the ERM estimator for the 0-1 loss is infeasible as argued in the beginning
of this chapter, in the binary classification case, it is fairly easy to minimize, for a
fixed w, the empirical classification error as a function of the constant b, b 7→ E(f b w,b ).
Namely, for fixed b the problem is reduced to a 1-dimensional one and one only has
to try for b the intermediate values of the reordered set of {⟨xj , w⟩, j = 1, . . . , n} and
select the one minimizing the empirical classification error. This is often what is
done in practice, so that the formula for the linear projection wLDA = Σ−1 (m1 − m0 )
is otfen considered as the most important of (binary) LDA, while the exact formula
for the constant b is unimportant, because it is often replaced by the above ERM
minimizer.

2. The most problematic part of LDA (and a fortiori QDA) in practice is the inversion
of the estimated covariance matrix. If some of its estimated eigenvalues are close
to 0, taking the inverse can be a highly unstable operation and lead to significant
estimation errors and erratic behavior, especially if the dimension d is large. For
this reason instead of Σb as defined in (2.8),(2.9), it is often suggested to used a
“regularized” version of it, such as
e := (1 − λ)Σ
Σ σ 2 Id ,
b + λb

or
e := (1 − λ)Σ
Σ b + λD,
b

where D is the diagonal matrix formed with the diagonal P entries σ bi2 := Σ
b ii of Σ
b
2 −1 d 2
(estimators of the variances of each coordinate), and σ
b =d bi . Here λ ∈ [0, 1]
i=1 σ
is a so-called “shrinking parameter” that has to be tuned, for instance by cross-
validation (see next chapter).

* 2.4 Classification as regression


In this approach we use a proxy loss function, namely the quadratic loss. More precisely, we
follow the principle explained in Example 1.4 (quadratic loss for multiclass classification).
A linear prediction function fw depending on parameters w = (w0 , . . . , wK−1 ) outputs the
scores in RK given by
fw (x) = (⟨x, w0 ⟩, . . . , ⟨x, wK−1 ⟩)

33
(as explained earlier in this chapter, we disregard the constant parameters by implic-
itly “augmenting” the data x to (x, 1)). Furthermore we consider the quadratic loss
ℓ(fw (x), y) = ∥fw (x) − ey ∥2 , where ek denotes the k-th canonical basis vector in RK .
The ERM is then given by
n
2
X
ŵ = Arg Min fw (xj ) − eyj
w∈RdK j=1
K−1
XX n
= Arg Min (⟨wk , xj ⟩ − 1{yj = k})2 .
w∈RdK k=0 j=1

Since the above function to minimize separates into K sums each involving only wk , each
sum can be minimized independently, given rise to
n
X
w
bk = Arg Min (⟨xj , w⟩ − 1{yj = k})2 = (X T X)−1 XY (k) ,
w∈Rd j=1

using the notation introduced below (1.3), and with Y (k) := (1{y1 = k}, . . . , 1{yn = k}) ∈
Rn .
Thus in this approach we transform the initial multiclass classification problem into
K distinct “one-versus-all” binary classification problems (classify class k against all the
others), each using a different linear predictor (and we recall that the final class prediction
is the the class attaining the maximum of these K linear scores).
Discussion: The unrestricted optimal prediction function for the loss function ℓ(y ′ , y) =
∥y ′ − ey ∥2 is given (again by separation into independent sums) by

f ∗ (x) = (P[Y = 0|X = x], . . . , P[Y = K − 1|X = x]) ∈ [0, 1]K ,

and it can seem a rather questionable idea to try to approximate a function from Rd to
[0, 1]K by a linear function, which is by definition unbounded. This is actually the main
reason why this quadratic loss approach is actually almost never used in practice with
linear prediction. Much more popular is the logistic regression.

Exercise 2.3. In this exercise we use explicitly the affine representation with parameters
(w, b) for affine scores. We focus on the binary classification case (K = 2).
Establish that the above approaches reduces to a single regression problem, since it
holds (w b0 , 1 − bb0 ).
b1 , bb1 ) = (−w
Prove that the direction of the estimated vector w b1 using the quadratic regression
approach coincides with the direction found by the (binary) LDA approach. (The constants
found by the two approaches b differ, and the above property does not hold any more for
K ≥ 3).

34
** 2.5 Linear logistic regression
As we have seen, usual quadratic regression models the class probability functions ηk (x) :=
P[Y = k|X = x] as linear functions, which is problematic. The idea of logistic regression
is to use a suitable transform, the “logit” transform, which is a log-ratio of probabilities.
Here we will use the class 0 as a reference and implicitly assume that ηk (x) ̸= 0 for all
k = 0, . . . , K − 1.

Define the ideal logistic score functions as

P[Y = k|X = x]
sk (x) := log , k = 0, . . . , K − 1. (2.10)
P[Y = 0|X = x]

(note that of course s0 (x) is identically 0),

and observe that f ∗ (x) := Arg Maxk=0,...,K−1 sk (x) is an optimal classifier. The score
functions sk can range anywhere in R.

The principle of logistic regression is to model these as linear functions of x:

sk (x) = ⟨wk , x⟩, k = 0, . . . , K − 1, (2.11)

with w0 := 0 and (wi )1≤i≤K−1 arbitrary.

Proposition 2.3. If the logistic scores defined in (2.10) satisfy (2.11), then it holds con-
versely
exp⟨wk , x⟩
ηk (x) = P[Y = k|X = x] = PK−1 . (2.12)
ℓ=0 exp⟨wℓ , x⟩

Proof: left to the reader.


Observe that (2.12) constitutes a partial statistical model: it specifies the form of the
conditional distribution Pw (Y |X) of Y given X, depending on a finite number of parameters
w = (w1 , . . . , wK−1 ) ∈ Rd(K−1) . However, the distribution of X itself is not modeled and
can be arbitrary. This can be seen as a generative approach for conditional probabilities
only. Following the principle of the maximum likelihood estimation in classical statistics,
in order to estimate the parameters from a sample Sn = ((xi , yi )1≤i≤n ), it is proposed to
use a maximum conditional likelihood principle, i.e. find

w
b = Arg Max L(w),
w

35
where L(·) is the log-conditional-likelihood based on the sample Sn :

L(w) := log P⊗n


w [Y1 = y1 , . . . , Yn = yn |X1 = x1 , . . . Xn = xn ]
X n
= log Pw [Yj = yj |Xj = xj ]
j=1
n K−1
!
X  X 
= wyj , xj − log 1 + exp⟨wℓ , xj ⟩ . (2.13)
j=1 ℓ=1

Unlike quadratic regression, the expression (2.13) does not separate into independent sums
for each parameter: the optimization problem has to be solved for all parameters (vectors
wk ) jointly.
In the particular case of binary classification, this reduces to
n
X 
L(w) = yj ⟨w, xj ⟩ − log(1 + exp⟨w, xj ⟩) . (2.14)
j=1

Under the above form (multi-class or binary class), one can notice that −L(w) can be
interpreted as an empirical risk with loss function (written in the binary case for simplicity)

** ℓlogit (f (x), y) := log(1 + exp f (x)) − yf (x), y ∈ {0, 1} (2.15)

which is then applied to linear prediction functions. In this sense, the logistic regression
can also be interpreted as a proxy loss method.
Practical implementation: The maximum of L(w), or equivalently the minimum of
−L(w), does not have a closed-form formula. Still, w 7→ −L(w) is a convex function, so
various methods of convex optimization can be used. If the dimension d is not too large,
a common approach is to use Newton-Raphson iterations
−1 
d2 L
 
dL
wk+1 = wk − t
.
|dwdw
{z } |wk |{z} dw |wk
Hessian Gradient

After some calculations, these iterations can be put under a form where they are interpreted
as a repeated “weighted least squares”, where the weights are updated along the iterations.

Exercise 2.4. For binary classification, it is common to encode the two classes as Y =
{−1, 1} which is more symmetrical. With this convention, verify that the logit loss (2.15)
can be rewritten as

ℓlogit (f (x), y ′ ) = log(1 + exp −(y ′ f (x))), y ′ ∈ {−1, 1}.

36
2.6 Hinge-loss based methods: Perceptron and Support Vector
Machine
We consider the binary classification setting with Y = {−1, 1}, Ye = R and the proxy loss
function called “hinge loss”
ℓε (y ′ , y) = (ε − yy ′ )+ ,
where (t)+ := max(0, t) is the positive part, and ε ≥ 0 a fixed parameter.

Interpretation as “large margin prediction”. In the case of linear binary classifi-


cation, a decision function fw,b of the form (2.2) separates the space into two half-spaces
separated by a (d − 1)-dimensional hyperplane with normal vector w and offset b from the
origin. If we assume ∥w∥ = 1, the corresponding score function s(x) = ⟨w, x⟩ + b can be
geometrically interpreted as the (signed) distance of point x to the separating hyperplane.
Because of this, the quantity s(x)y is often called the “margin” of prediction: a positive
margin indicates a correct classification and negative margin, a classification error; and
the absolute value of the margin indicates the distance to the separating hyperplane. As a
general rule, correct predictions that have margin too close to 0 are penalized, and incorrect
predictions are penalized all the more if the distance to the separating hyperplane is high.
Loss functions of the form ℓ(y, y ′ ) = L(yy ′ ) for various functions L : R → R+ , are often
called “margin-based loss functions”, and decision functions trained by minimizing the
associated risk “maximum margin classifiers”.

Perceptron. The perceptron algorithm (Rosenblatt, 1956) is based on a stochastic


gradient descent for the empirical risk using linear functions. More precisely, using the
above loss the empirical risk based on sample Sn can be written

b )= 1
X
E(f (ε − f (xj )yj ),
n j s.t.
f (xj )yj ≤ε

and for linear functions, the associated gradient (disregarding non-differentiability in 0 of


t → (t)+ ):
b w) = 1
X
∂w E(f (−yj xj ).
n j s.t.
⟨w,xj ⟩yj ≤ε

The perceptron algorithm is an early use of stochastic gradient descent: to avoid recom-
puting the above gradient at each optimization step, it is proposed to select randomly one
of the terms in the above sum and to make a step in that direction. This gives rise to the
following very simple procedure:

37
** Perceptron algorithm

1. Initialize w = 0.

2. Choose uniformly at random j ∈ {1, . . . , n}.

3. If ⟨w, xj ⟩yj > ε, return to step 2.

4. Update w ← w + yj xj .

5. Go to step 2.

Nowadays, the Perceptron algorithm is not so widely used, more modern approaches
such as the linear Support Vector Machine (see below) are used. One advantage remains
its simplicity.
An famous early result studying the convergence of the empirical risk of the above
algorithm in the simplest case ε = 0 and when the training data is linearly separable is the
following.

** Theorem 2.4 (Novikov, 1962). Let Sn = ((xi , yi )1≤i≤n ) be a fixed training sample for
binary classification. Assume that

• The training sample is linearly separable with margin γ > 0, meaning:

∃w∗ ∈ Rd , s. t. ∥w∗ ∥ = 1, and γ > 0 : ∀i ∈ {1, . . . , n} ⟨w∗ , xi ⟩yi ≥ γ.

• For all i = 1, . . . , n it holds ∥xi ∥ ≤ R.

Then: the Perceptron algorithm (run with ε = 0) finds a vector w separating perfectly
the data (i.e. such that ⟨w, xi ⟩yi > 0 for all i) after at most (R/γ)2 effective update
operations (i.e. passes through step 4).

* Support Vector Machine. The (linear) Support Vector Machine (Boser, Guyon, Vap-
nik 1992) uses the hinge loss with ε = 1. Several efficient algorithms have been developed
to optimize the resulting optimization problem, that are preferred to the perceptron, and
we won’t enter in detail here. It is nowadays a standard method of machine learning
toolboxes.

Exercise 2.5. Justify that in the binary classification case, the “classification as regression”
approach (Section 2.4) and the logistic regression approach (Section 2.5) both can be seen

38
as “large margin classification” methods, in the sense that they are based on minimizing
a certain margin-based loss (which can be written explicitly).

Exercise 2.6. Prove Thm. 2.4. Proceed as follows: let η > 0 be a fixed positive number,
and consider the quantity ∆2k := ∥wk − ηw∗ ∥2 , where wk denotes the vector w after the
kth effective update step. Compare ∆2k+1 to ∆2k , and choose an appropriate value of η
to establish that it must hold ∆2k ≤ ∆20 − kR2 , then conclude. Note: this proof is an
elementary example of the use of a well-chosen Lyapunov function decreasing along the
iterations (here ∥wk − ηw∗ ∥2 ). It is a powerful technique to study convergence of iterative
optimization schemes.

* 2.7 Regularization
In most linear methods, standard approaches tend to become unstable when the dimension
d is high. This has, roughly speaking, to do with the overfitting phenomenon: intuitively,
the number of free parameters is d and overfitting becomes more likely when there are
more parameters to fit relative to the amount of data available. A common approach to
“stabilize” ERM methods is to consider a regularized version
 
bλ ∈ Arg Min Eℓ (fw ) + λΩ(w) ,
w b
w∈Rd

where Ω : Rd → R is a regularization or penalization function which is generally assumed


to be convex, so that the above problem lends itself well to numerical optimization. A very
common choice is Ω(w) = ∥w∥2 , though many others can be found in the literature.
The methods considered in previous sections using ERM with the squared loss (Sec-
tion 2.4, the logit loss (Section 2.5), and the hinge loss (Section 2.6) all admit regularized
versions in this sense, which are generally the ones used in practice.
Let us study the particular case of linear regression in the binary classification case
(Y = {0, 1}), which reduces to usual linear regression (see (1.2)). The corresponding
regularized ERM is
n
!
X
w
bλ = Arg Min (yi − ⟨w, xi ⟩)2 + λ∥w∥2 ,
w∈Rd i=1

and it can be checked that it takes the explicit form (compare to (1.3)):

−1 n n
b := 1 1X
 X
bλ = Σ
w b + λId γ
b, where Σ xi xTi , and γ
b := xi yi ; (2.16)
n i=1 n i=1

a classical equivalent form is w bλ = (X t X + λId )−1 X t Y , where X is the (n, d) ma-


trix whose rows are xt1 , . . . , xtn , and Y = (y1 , . . . , yn )t . This is also often called ridge
regression.

39
The effect of regularization it therefore intuitively clear in the above formula: adding
λ > 0 on the diagonal of a matrix before inverting it should stabilize the inverse operation,
at the expense of adding some bias if λ is too large. A similar idea was used for regularized
LDA/QDA (Section 2.3).
In practice, the parameter λ > 0 has to be tuned in order to get a good compromize
between too much and too little regularization; we will see how in the coming chapter.

Exercise 2.7. Justify the formula (2.16) for regularized linear regression.

40
3 Introduction to statistical learning theory (part 2):
elementary bounds
3.1 Controlling the error of a single decision function and Hold-
Out principle
The main goal in this section is to get a theoretically justified control of the true risk
E(fb) of an estimator by using only observable quantities (i.e. that can be computed
from the available data). In statistical terms, this means we are looking for a confidence
(upper) bound on E(fb) (we are primarily interested in an upper bound, i.e. a guarantee
on the generalization error). A confidence bound is a quantity that can be computed from
the data, and is indeed larger that the (unknown) quantity of interest with a prescribed
probability (called coverage probability).
As we have discussed in Section 1.3, the empirical risk E(
b fb) is not (in general) a reliable
approximation of E(fb) because of the overfitting phenomenon: the same data is used to
learn fb and to evaluate the empirical risk, resulting in a bias (that can possibly be very
large). h i
On the other hand, we have argued that for a fixed decision function, it holds E E(fb ) =
E(f ), moreover we have by the law of large numbers E(f b ) = 1 Pn Zi −→ E(f ) in
n i=1
probability (or a.s.) as n −→ ∞, where Zi := ℓ(f (Xi ), Yi ) are i.i.d.
This idea underlies the co-called “Hold-out” principle: learn fb and evaluate the empir-
ical error on different samples. Thus, is Sn and Tm = (Xi′ , Yi′ )1≤i≤m are independent i.i.d.
samples, we consider the quantity
m
1 X b
bHO b
E (f ) = E(fSn , Tm ) =
b b ℓ(fSn (Xi′ ), Yi′ ).
m i=1
In practice, Sn and Tm can be obtained from an arbitrary separation in two parts of a single
sample. Sn is called learning sample and Tm validation sample (or “hold-out” sample, since
it has been held out from the learningh phase). Observe i that conditionally to Sn , we can
consider fbSn as a fixed function, thus E E( b fbSn , Tm )|Sn = E(fbSn ), furthermore we can apply
the law of large numbers (LLN) conditionally to Sn .
This principle justifies that it is already of interest to get a mathematical control of the
error of a single fixed decision function f . For this, we need more elaborate tools than the
LLN, which gives an limiting value but not quantification of the speed of convergence. A
traditional (asymptotic) tool for this is the Central Limit Theorem (CLT), which we recall
here: if W1 , . . . , Wn are i.i.d. real random variables such that σ 2 = Var[W1 ] exists, then it
holds
√ n1 ni=1 (Wi − E[W1 ])
P
n −→ N (0, 1), in distribution, as n −→ ∞. (3.1)
σ
This can be used for deriving asymptotic confidence intervals (in the statistical learning
context, remember the quantity of interest is typically E(f ))). An asymptotic confidence

41
interval or bound has the property that the coverage inequality only converges asymptoti-
cally to a prescribed value, say 1 − α.
However, learning theory generally focuses on obtaining nonasymptotic bounds, that is,
whose validity (coverage probability) is ensured for any n. There are several motivations
for this, but we will in particular stress that the CTL (3.1) is not valid in a uniform
sense in general. Consider a situation where Xi (p) are i.i.d. Bernoulli with parameter p
(for instance, Wi (p) = 1{f (Xi ) ̸= Yi } for a classification problem and a given classifier
function f with generalization error p). If the limit in (3.1) was uniform with respect to
the parameter p, we would choose an arbitrary sequence pn depending on n and still have
the convergence (3.1). Yet, it is known that for pn = c/n, we have the Poisson limit
n
X c
Wi −→ Poisson(c), in distribution, as n −→ ∞. (3.2)
i=1
n

Therefore, we deduce (by usual continuity arguments for the convergence in distribution,
and recalling the variance of Bernoulli variable with parameter p is p(1 − p)) that
√ n1 ni=1 (Wi nc − E W1 nc )
P   
Poisson(c) − c
n −→ √ , in distribution, as n −→ ∞,
σ c
which is obviously different from the standard CTL Gaussian limit. This situation could
in principle happen if we are given a sequence of classifiers fn whose classification error
converges to zero fast enough as n → ∞.

Exercise 3.1. Prove the convergence (3.2) by considering the Laplace transform FS (λ) :=
E[exp(λS)] on both sides. It is known that the pointwise convergence of the Laplace
transform implies the convergence in distribution.

3.2 A sharp but inconvenient bound in the Bernoulli case: the


Clopper-Pearson bound
We consider in this section an exact confidence bound for the parameter of a Bernoulli
variable, which we recall is relevant for the classification problem in order to get a bound
on the generalization error E(f ) = p ∈ [0, 1] of a fixed classifier f . In this situation,
Wi = 1{f (Xi ) ̸= Yi } is a Bernoulli variable with parameter p, and the number of errors on
the training sample is nE(fb ) = Pm Wi which has a Binom(n, p) distribution.
i=1

Proposition 3.1. Let F (p, n, k) = P[Bn,p ≤ k] = i≤k ni pi (n − p)i be the cumulative


P 
*
distribution function (cdf ) of a Binom(n, p) variable Bn,p . Let
B(u, n, α) = max{p ∈ [0, 1] : F (p, n, u) ≥ α}.
Then given Bn,p , the quantity B(Bn,p , n, α) is an upper confidence bound on p at coverage
level 1 − α, which is to say
∀p ∈ [0, 1] : P[p ≤ B(Bn,p , n, α)] ≥ 1 − α.

42
To prove this, we’ll use the following classical result:

** Lemma 3.2. Let X be a real-valued variable, and FX (t) := P[X ≤ t] its cdf. Then
∀α ∈ [0, 1] : P[FX (X) ≤ α] ≤ α,
with equality if FX is a continuous function. We say that FX (X) is stochastically lower
bounded by a Unif([0, 1]) variable (and is exactly distributed as uniform in [0, 1] if FX is
continuous).
(Proof of the Lemma). Let s := sup{x : FX (X) ≤ α}. We consider two cases:
(1) s is a maximum: then FX (s) = α (since FX is right-continuous), and FX (x) ≤ α ⇔
x ≤ s by monotonicity of FX . Then
P[FX (X) ≤ α] = P[X ≤ s] = FX (s) = α.
(2) s is not a maximum: then FX (x) ≤ α ⇔ x < s, and
P[FX (X) ≤ α] = P[X < s] = lim FX (x) ≤ α.
x↗s

Note that case (2) can only happen if FX is not (left)-continuous at point s, hence if FX
is continuous we are always in case (1).
(Proof of the proposition). Observe that p > B(u, n, α) ⇒ F (p, n, u) < α by definition of
B(u, n, α). Hence
P[p > B(Bn,p , n, α)] ≤ P[F (p, n, Bn,p ) < α] ≤ α,
where the last inequality is from the Lemma.
The above confidence bound, called Clopper-Pearson bound, is sharp because it is based
on the exact inversion of the cdf. However, it is not easy to qualitatively understand nor
to manipulate. We will now consider a method to construct more explicit bounds.

3.3 The Chernov method


The Chernov method, also called the Laplace transform method or exponential moment
method, is a generic device allowing to obtain non-asymptotic confidence bounds.

*** Proposition 3.3. Let X be a real-valued random variable, and define successively for
λ ∈ R:

FX (λ) := E[exp(λX)] ∈ (0, ∞] ; (Laplace transform)


ΨX (λ) := log FX (λ) ∈ (−∞, ∞] ; (Log-Laplace transform)

ΨX (t) := sup(λt − ΨX (λ)) ∈ [∞, ∞] ; (Legendre dual of ΨX , Cramér transform).
λ∈R

43
Then for any t ≥ E[X]:
P[X ≥ t] ≤ exp −Ψ∗X (t) . (3.3)
Furthermore,
Pn if X1 , . . . , Xn are i.i.d. with the same distribution as X and Sn :=
i=1 X i , then for any u ≥ E[X]:
 
1
P Sn ≥ u ≤ exp −nΨ∗X (u) . (3.4)
n

Note: the principle of the proof is so simple and useful that it is as important as
the theorem itself.

Proof. Let us begin with assuming that λ ≥ 0, so that x 7→ exp(λx) is a nondecreasing


function. We have for any t ∈ R:

P[X ≥ t] ≤ P[exp(λ(X − t)) ≥ 1] (inequality only to account for the case λ = 0)


≤ E[exp(λ(X − t))] by Markov’s inequality
= FX (λ) exp(−λt)
= exp(ΨX (λ) − λt).

Optimizing over λ ≥ 0, we thus get

P[X ≥ t] ≤ exp(− sup(λt − ΨX (t))).


λ≥0

Now we justify why we can replace the above supλ≥0 by a supremum over all λ ∈ R,
provided t ≥ E[X]. Observe that in this case, for λ < 0 and by Jensen’s inequality, we
have

ψX (λ) − λt = log E[exp(λX)] − λt ≥ λ(E[X] − t) ≥ 0.

Hence we have (since a probability is always less than 1!)

P[X ≥ t] ≤ min(1, exp(− sup(λt − ΨX (λ))))


λ≥0
  
≤ exp − max sup(λt − ΨX (λ)), sup(λt − ΨX (λ))
λ<0 λ≥0

≤ exp −Ψ∗X (t) .

In the case where we are interested in the deviations of n1 Sn = n1 ni=1 Xi with Xi , . . . , Xn


P
i.i.d., notice FSn (λ) = FX (λ)n by independence – a crucial property of the Laplace trans-
form (like the characteristic function), hence ΨSn (λ) = nΨX (λ) and Ψ∗Sn (t) = nΨ∗X nt .


Applying Chernov’s bound (3.3) to Sn , with t = nu yields (3.4).

44
1.25

1.00 3

0.75
D(t, p)

D(t, p)
2
D(t,p) D(t,p)

0.50 2(p − t) 2
2(p − t)2

1
0.25

0.00 0
0.00 0.25 0.50 0.75 1.00 0.00 0.25 0.50 0.75 1.00
t t

Figure 1: The function t 7→ D(t, p) and its lower bound t 7→ 2(p − t)2 (left for p = 0.3,
right for p = 0.03); see Exercise 3.2.

Chernov’s method for a Binomial random variable. An important application of


Chernov’s method is the following theorem bounding the deviations of a binomial random
variable from its mean.

(1−t)
* Proposition 3.4. Let D(t, p) := t log( pt ) + (1 − t) log (1−p) , defined for (t, p) ∈ [0, 1] ×
(0, 1) (with the convention 0 log 0 = 0). Let Bp denote a Binom(n, p) random variable.
Then
 
1
t ≥ p ⇒ P Bp ≥ t ≤ exp(−nD(t, p)); (3.5)
n
 
1
t ≤ p ⇒ P Bp ≤ t ≤ exp(−nD(t, p)). (3.6)
n

This implies, denoting pb = n1 Bp , for any α ∈ (0, 1]:


 
− log α
P pb ≥ p and D(b p, p) ≥ ≤ α; (3.7)
n
 
− log α
P pb ≤ p and D(b p, p) ≥ ≤ α; (3.8)
n
 
− log α
P D(bp, p) ≥ ≤ 2α. (3.9)
n

The function q 7→ D(q, p) is convex with a minimum of 0 in q = p (see Figure 1), hence
is decreasing on q ∈ [0, p] and increasing on q ∈ [p, 1].

Exercise 3.2. Prove the quadratic lower bound D(q, p) ≥ 2(p − q)2 (see Fig. 1).

45
Proof of (3.5)-(3.6). We apply (3.4) and thus only need to compute Ψ∗X (t) where X is a
Bernoulli variable of parameter p. In this case

FX (λ) = p exp(λ) + (1 − p) and ΨX (λ) = log((1 − p) + p exp(λ)).

We deduce that Ψ′X (λ) = p exp(λ)/((1 − p) + p exp(λ)), finding the solution of Ψ′X (λ) =
t ∈ (0, 1) is thus λ = log (1−p)t
p(1−t)
, so that Ψ∗X (t) = D(t, p) for t ∈ (0, 1) (and Ψ∗X (t) = ∞
otherwise), giving the conclusion.
Let us now prove the implication from (3.5) to (3.7). Recall that the function t 7→
D(t, p) is nonnegative and (strictly) increasing from [p, 1] onto [0, log(1/p)]. Hence we have
the following equality of events, for any t ≥ p:

{b
p ≥ t} = {D(b
p, p) ≥ D(t, p) and pb ≥ p}.

For any u ∈ [0, log(1/p)], if we choose t as the unique solution in the interval [p, 1] of
D(t, p) = u, rewriting (3.5) in the light of the above event equality yields

p ≥ p and D(b
P[b p, p) ≥ u] ≤ exp(−nu).

The latter inequality still holds trivially true if u > log(1/p), since in this case the
considered event has probability zero (since D(t, p) ≤ log(1/p) for all t ≥ p). Taking
u = − log(α)/n yields (3.7).
Similarly (3.6) implies (3.8), and (3.7)-(3.8) imply (3.9) by a union bound.

3.4 Sub-Gaussian random variables and Hoeffding’s inequality


For a binomial variable, the Chernov bound gives a more easy to manipulate non-asymptotic
bound than the Clopper-Pearson interval. However, the function D(q, p) is somewhat cum-
bersome to handle (even if can be lower bounded by a quadratic function). Furthermore,
we would like to have a deviation control applying to more general variables.
We will start with the following definition:

*** Definition 3.5 (Sub-Gaussian random variable). A sub-Gaussian real-valued random


variable X with parameter σ 2 is such that
 2 2
σ λ
∀λ ∈ R : F(X−E[X]) (λ) = E[exp λ(X − E[X])] ≤ exp .
2

Using Prop. 3.3, we obtain immediately the following deviation control:

46
Proposition 3.6. Let X be a sub-Gaussian variable with parameter σ. Then for any
α ∈ (0, 1]:
h p i
−1
P X ≥ E[X] + σ 2 log(α ) ≤ α;
h p i
P X ≤ E[X] − σ 2 log(α−1 ) ≤ α;
h p i
−1
P |X − E[X]| ≥ σ 2 log(α ) ≤ 2α.

Proof. Without loss of generality we may assume E[X] = 0. We apply Prop. 3.3. We
have by assumption FX (λ) ≤ exp(λ2 σ 2 /2), hence it holds Ψ∗X (t) ≥ t2 /(2σ 2 ). Solving
exp −Ψ∗ (t) = α for t, we obtain the first inequality. The second inequality is obtained by
applying the same argument to −X, and the final one is a union bound applied with the
events appearing in the first two inequalities.
It is straightforward to notice that sum of independent sub-Gaussian
Pvariables Xi with
2 2
parameters σi is itself sub-Gaussian with parameter σ such that σ = i σi . It follows:

** Corollary 3.7. If X1 , . . . , Xn are independent sub-Gaussian variables with respective


parameters σ12 , . . . , σn2 , then putting Sn := ni=1 Xi , for any α ∈ (0, 1):
P

" r # " r #
2 log α−1 2 log α−1
   
Sn Sn Sn Sn
P ≥E +σ ≤ α; P ≤E −σ ≤ α,
n n n n n n
Pn (3.10)
2 1 2
where σ := n i=1 σi .

The following proposition links boundedness to the sub-Gaussian property and is the
foundation of Hoeffding’s inequality coming next:

*** Proposition 3.8. A random variable with values in [a, b] is sub-Gaussian with pa-
rameter (b − a)2 /4. In particular, a random variable X such that |X| ≤ B a.s. is
sub-Gaussian with parameter σ 2 = B 2 .

From this it follows directly from Corollary 3.7:

47
Corollary 3.9 (Hoeffding’s inequality). Let X1 , . . . , Xn be independent random vari-
ables taking values in the interval [a, b] with |b − a| ≤ B. Then for any t ≥ 0:
" n #
2nt2
 
1X
P (Xi − E[Xi ]) ≥ t ≤ exp − 2 ;
n i=1 B

equivalently, for any α ∈ (0, 1]:


" n r #
1X − log α
P (Xi − E[Xi ]) ≥ B ≤ α.
n i=1 2n

Proof of Proposition 3.8. Put B := |b − a|. Without loss of generality (i.e. possibly re-
placing X by X ′ = X − a+b
2
), we can assume X is taking values in the interval [−B/2, B/2].
We start with upper bounding ΨX (λ). We notice that

E[X exp(λX)]
Ψ′X (λ) = ,
E[exp(λX)]
2
E[X 2 exp(λX)]

E[X exp(λX)]
Ψ′′X (λ) = − ,
E[exp(λX)] E[exp(λX)]

where the justification of the exchange of expectation (over X) and derivation (over λ) is
left as an exercise. We deduce that ΨX (0) = 0, Ψ′X (0) = E[X], Ψ′′X (λ) ≤ B 2 /4, and by
Taylor’s formula with exact rest:

λ2 ′′ λ2 B 2
ΨX (λ) = ΨX (0) + λΨ′ (0) + Ψ (c) ≤ λE[X] + .
2 8
2 2
Finally observe that Ψ(X−E[X]) (λ) = ΨX (λ) − λE[X] ≤ λ2 B2 .

Notation: To lighten notation in the sequel, we will use the following shortcut
notation: r
− log(δ)
ε(δ, n) := . (3.11)
2n

Hoeffding’s inequality allows us to bound with high probability the risk of a single
prediction function from its empirical risk, provided the loss function is bounded:

48
*** Corollary 3.10. Consider a prediction setting where the loss function ℓ : Ye × Y →
[0, B] is bounded by B > 0. Let f be a fixed prediction function, E(f ) its risk and E(f
b )
its empirical risk with respect to an i.i.d. sample Sn of size n. Then for any δ ∈ (0, 1),
it holds with probability at least 1 − δ with respect to the draw of the sample Sn :

E(f ) ≤ E(f
b ) + Bε(δ, n). (3.12)

Similarly, it holds with probability at least 1 − δ:

E(f
b ) ≤ E(f ) + Bε(δ, n). (3.13)

and it holds with probability at least 1 − δ:

E(f
b ) − E(f ) ≤ Bε(δ/2, n). (3.14)

Proof. The Pfirst inequality is an immediate consequence of Hoeffding’s inequality, since


1 n
E(f ) = n i=1 ℓ(f (Xi ), Yi ); put Wi := ℓ(f (Xi ), Yi ), then the Wi s are i.i.d., taking values
b
in [0, B] and E[Wi ] = E(f
b ). The second inequality is derived similarly, using −Wi instead
of Wi , and the third is obtained from the two first ones each applied with δ ′ = δ/2 and a
union bound argument.

3.5 Uniform bounds over a finite class of prediction functions


We now turn to the situation where we wish to get bounds on the risks of several prediction
functions using their empirical risks, uniformly over a class F. To be specific, from 3.10
we know that (when the loss function is bounded by B)
h i
∀f ∈ F : P E(f ) ≤ E(f b ) + Bε(δ, n) ≥ 1 − δ; (3.15)

however we wish to have a statement of the form


h i
P ∀f ∈ F : E(f ) ≤ E(f ) + B(f, δ) ≥ 1 − δ ,
b (3.16)

for some appropriate bound function B. Notice the difference between (3.15) and (3.16):
the second form will guarantees that all bounds, simultaneously, are valid with probability
at least 1 − δ, while there is no such uniformity in the first statement.
We will start with a finite class F.

49
*** Proposition 3.11. Assume a prediction problem with bounded loss function taking
values in [0, B]. Let F be a finite set of prediction functions of cardinality K. Assume
empirical risks are computed on an i.i.d. sample Sn of size n. Then it holds with
probability at least 1 − δ with respect to the draw of the sample Sn :
r
∀f ∈ F : E(f ) ≤ E(fb ) + B log K + Bε(δ, n). (3.17)
2n
Similarly, it holds with probability at least 1 − δ:
r
∀f ∈ F : E(f ) ≤ E(f b ) + B log K + Bε(δ, n). (3.18)
2n
and it holds with probability at least 1 − δ:
r
log 2K
∀f ∈ F : E(f ) − E(f
b ) ≤B + Bε(δ, n). (3.19)
2n

Proof. This is a direct consequence of the union bound. Denote A(f, δ) the event where
the bound E(f ) ≤ E(f
b ) + Bε(δ, n) is satisfied. From Corollary (3.10), and (3.15) it holds
c
that ∀f ∈ F P[A (f, δ)] ≤ δ for any δ ∈ (0, 1). Therefore
" # " #
\ [ X 
Acf,δ ≥ 1 − P Acf,δ ≥ 1 − Kδ.

P Af,δ = 1 − P
f ∈F f ∈F f ∈F

Replacing δ by δ/K in the above inequality, we therefore have


" #
\
P Af,δ/K ≥ 1 − δ.
f ∈F

This gives the first announced inequality, just noticing that


 r r
log K + log δ −1

δ log K
ε ,n = ≤ + ε(δ, n).
K 2n 2n

The second and third inequalities are obtained similarly from the corresponding single
function inequalities of Corollary (3.10) and a union bound.
A first consequence of Proposition (3.11) is that we can derive a bound on the risk of
an estimator fb taking values in a finite class F.

50
*** Corollary 3.12. Consider the same assumptions as in Proposition 3.11. Let fb be an
estimator taking its values in the finite class F of cardinality K. Then with probability
at least (1 − δ) of the draw of the sample Sn , it holds
r
b fbSn , Sn ) + B log K + Bε(δ, n).
E(fbSn ) ≤ E( (3.20)
2n

Proof. We consider event (3.17), whose probability is at least 1−δ. If this event is satisfied,
we can in particular specialize the inequality which holds for all f ∈ F for the particular
choice fb ∈ F (note that it does not matter that fb depends on the data now). This yields
the conclusion.
We now consider more specifically the ERM over the finite class F, and get the following
result as a further consequence of Proposition 3.11.

*** Proposition 3.13. Assume a prediction problem with bounded loss function taking
values in [0, B]. Let F be a finite set of prediction functions of cardinality K. Assume
empirical risks are computed on an i.i.d. sample Sn of size n. Consider the ERM
estimator fb = fbk , where
k ∈ Arg Min E(f
b b k ).
1≤k≤K

Then for any δ ∈ (0, 1) it holds with probability at least 1 − δ:


r
log K
E(fbk ) ≤ EF∗ + 2B + 2Bε(δ/2, n). (3.21)
2n

Observe that inequality (3.21) relates the risk of the ERM to the best risk in the class
EF∗ = min1≤k≤K E(fk ), hence it is an excess risk bound with respect to the class F. It is
trivial but useful to rewrite it in the following way to bound the excess risk with respect
to all possible decision functions:
r
log K
E(fbk ) − E ∗ ≤ (EF∗ − E ∗ ) + 2B + 2Bε(δ/2, n). (3.22)
2n
Observe in particular the role of n (the sample size) and of K (the size of class F). As
expected, larger “complexity” of the class (in this simple scenario the complexity is simply
the cardinality) results in a worse bound for the excess risk in (3.21), however we can
expect that a more complex (i.e. larger) class also has in advantage in that it will have a
smaller value for (EF∗ − E ∗ ), in the bound (3.22).

51
The good news is that the cardinality K of the class enter only logarithmically in the
bound. For instance taking K = np (for any fixed p p possibly much larger than 1) still
results in a bound on the excess risk behaving in O( log(n)/n).
Proof. We apply (the last inequality of) Proposition 3.11 so that the following event has
probability at least 1 − δ:
r
∀f ∈ F : b ) ≤ B log K + Bε(δ/2, n).
E(f ) − E(f (3.23)
2n
In the remainder of the proof, we assume that the latter event holds, and do not repeat
everytime “with probability at least 1 − δ”. We can in particular apply the bound (3.23)
to any specific f ∈ F, even data-dependent, as is fbk ; this is precisely the purpose of having
a uniform bound. Thus, we deduce that
r
b b ) + B log K + Bε(δ/2, , n).
E(fbk ) ≤ E(f (3.24)
k
2n
Let k ∗ ∈ Arg Min1≤k≤K E(fk ) be fixed, and apply also (3.23) to fk∗ :
r
−1
b k∗ ) ≤ E(fk∗ ) + B log(2δ ) .
E(f (3.25)
2n
Now using (3.24), (3.25), and the definition of fbk (minimizing the empirical risk) and fk∗
(minimizing the risk), respectively, we obtain
r
b b ) + B log K + Bε(δ/2, n)
E(fbk ) ≤ E(f k
r 2n
b k∗ ) + B log K + Bε(δ/2, n)
≤ E(f
r 2n
log K
≤ E(fk∗ ) + 2B + 2Bε(δ/2, n)
2n
r
log K
= min E(fk ) + 2B + 2Bε(δ/2, n).
1≤k≤K 2n

** Hold-out estimator selection. The situation where we consider a finite set of candi-
date prediction functions is found commonly when we have several estimators fb1 , . . . , fbK ,
amongst which we want to select one. We assume here that these estimators are “black
boxes”, i.e. they may be overfitting the data they are trained on. To select one of these
we again use the idea of Hold-Out presented at the beginning of the chapter:
1. Obtain two independent samples Sn , Tm of i.i.d. data. (Possibly split an existing
sample into two separate sub-samples to do so).

52
2. Train each of the estimators fb1 , . . . , fbK using the sample Sn , resulting in the decision
functions fb1,Sn , . . . , fbK,Sn .

3. Pick an estimator
k ∈ Arg Min E(
b b fbk,Sn , Tm ). (3.26)
1≤k≤K

Observe that, conditionally to Sn (i.e. considering the sample Sn “fixed”), the decision
functions fb1,Sn , . . . , fbK,Sn can be considered as fixed too (since they depend only on Sn ,
and not on Tn ). Therefore,
n (3.26) can be seen
o as an ERM method conditional to Sn , over
the class F(Sn ) := fk = fbk,Sn , k = 1, . . . , K .
This Hold-Out Selection method is used commonly in the case where we have an esti-
mator method fbλ depending on a “tuning parameter” λ (see for instance the regularization
methods introduced in Section 2.7) that we want to choose in order to minimize the risk.
In this case we can restrict the values of λ to discretized set {λ1 , . . . , λK } and we use the
previous strategy with fbk = fbλk .

* Cross-validation. In pratice, it is considered somewhat wasteful to split the data into


a training sample Sn and a “validation sample” Tm , as in the Hold-Out method. We
mention the very common cross-validation approach which can be seen as an elaboration
of the hold-out. We still assume to have a family of estimators fb1 , . . . , fbK .

• Fix an integer V ≥ 2.

• Given a sample Sn , of size n split it in (approximately) equal size, disjoint sub-samples


S (1) , . . . , S (V ) , each of size ≥ ⌊ Vn ⌋.

• Denote S (−j) the sample made of the reunion of S (i) i̸=j .




• Define
(−j)
fbk = fbk,S (−j)
the k-th estimator trained on all the samples except those of S (j) .

• Let
V
X
k ∈ Arg Min
b b fb(−j) , S (j) ).
E( (3.27)
k
1≤k≤K j=1

• as a final estimator, pick fb := fbbk,Sn , which corresponds to estimator b


k trained on the
entire sample Sn .

Note that (3.27) can be seen as a “multiple” hold-out where the disjoint samples S (j) , S (−j)
take the role of training and validation samples in the standard Hold-out, for all j =
1, . . . , V in turn.

53
Each term in the sum (3.27) has expectation E(fbk,n′ ), where fbk,n′ denotes estimator fbk
trained using a sample of size n (V V−1) , and it is possible to apply the previous arguments
based on Hoeffding’s inequality to bound it. However, the hope is that the “aggregation”
of hold-out errors in (3.27) results in a yet sharper estimate.
A precise theoretical analysis of cross-validation, and why it may outperform Hold-out
in practice, is a delicate subject that we won’t touch here. (Notice in particular that we
compare the empirical risks of estimators fbk trained on samples of size n(V − 1)/V , while
the final estimator is trained on the total sample of size n. Therefore, a minima one must
make some kind of assumptions on the fact that the estimators trained on the full sample
and on a sub-sample are “close” in some way).
Still, cross-validation is the “default” method in practice to pick the tuning parameters
of a learning method. The particular case V = n is called “leave-one-out” (one removes,
in turn, a single data point of the sample, trains each estimator on the remaining (n − 1)
data, and monitors its error on the point that has been left out).

3.6 Uniform bounds over a countable class of prediction func-


tions; regularlized ERM
In this section we present an extension of the arguments used previously, allowing to deal
with countably infinite classes.

** Proposition 3.14. Assume a prediction problem with bounded loss function taking
values in [0, B]. Let F be a finite or countably infinite set of prediction functions.
Assume that π is a set of real weights over F, such that π(f ) ∈ [0, 1] and:
X
π(f ) ≤ 1. (3.28)
f ∈F

Assume empirical risks are computed on an i.i.d. sample Sn of size n. Then for any
δ ∈ (0, 1], it holds with probability at least 1 − δ with respect to the draw of the sample
Sn : r
−1
∀f ∈ F : E(f ) ≤ E(f b ) + B log(π(f ) ) + Bε(δ, n). (3.29)
2n

Proof.
p The proof is a minor variation on the proof of Prop. 3.11. Denote t(δ, n) =
B log(δ −1 )/(2n), and A(f, δ) the event where the bound E(f ) ≤ E(fb ) + t(π(f )δ, n) is
satisfied. From Corollary (3.10), and (3.15) it holds that ∀f ∈ F P[Ac (f, δ)] ≤ π(f )δ for
any δ ∈ (0, 1). Therefore
" # " #
\ [ X  X
Acf,δ ≥ 1 − P Acf,δ ≥ 1 − δ

P Af,δ = 1 − P π(f ) ≥ 1 − δ.
f ∈F f ∈F f ∈F f ∈F

54
Elementary bounding for t(π(f )δ, n) then yields the claim.
Remarks.
• It is an extension of Proposition 3.11, since the latter can be obtained via the uniform
weight choice π(f ) = 1/K for all f ∈ F (finite class).

• It is always better to have the P constraint (3.28) satisfied with equality (if not, re-

place π(f ) by π (f ) = π(f )/( g∈F π(g)), which improves (3.17). With the equality
constraint, it is possible to interpret π as a discrete probability distribution over F.
However since there is no probabilistic argument used, we prefer the term “weights”.

• The choice of π is arbitrary, but must be fixed in advance (i.e. it cannot depend
on the data). One can interpret log(π(f )−1 ) as a “complexity” of function f but it
is not an intrinsic notion since we can decide freely π. Rather, when we pick π we
have to decide which functions are considered “less complex” (=are given a higher
weight), and we are limited by the sum 1 constraint: we can consider many functions
as “complex” but not too many as “simple”.

• The fact that we have a bound on an infinite set of functions is somewhat of an


illusion: it F is countably infinite, it means that only a finite number of them can
have weight larger than any given t. On the other hand we observe that (3.17) is
trivial for all functions f with π(f ) ≤ exp(−2n) (since the risk is always trivially
bounded by B). So for any given n, bound (3.17) gives non-trivial information only
for a finite subset of functions of F. On the other hand, the set of functions with a
non-trivial bound grows with n (and will contain any given function f with π(f ) > 0
for n big enough). In a way, this can be seen as a way to have a class Fn that depends
on n, but since it is implicit, it is more elegant.

Regularized ERM. Assume that we work under the same assumptions as Proposi-
tion 3.14. Bound (3.17) cannot be used to analyze the ERM on the (countably infinite)
class F in a meaningful way, because, if we try to follow the argument of the proof of
Prop. 3.13 (we recommend
q it as an exercise to see where it fails) we will have a remaining
term of the form − log π(fbERM )/2n in the bound; and we have no means to control it,
it can be arbitrary large.
Instead, we can use bound (3.17) to design an estimator which will consist in minimizing
that bound (for a fixed choice of π). Let us therefore define the regularized ERM (based
on the regularization induced by π):
r !
− log(π(f ))
fbπ ∈ Arg Min E(fb )+B . (3.30)
f ∈F 2n

Lemma 3.15. If the loss function ℓ(·, . . . ) takes its values in [0, B], the estimator fbπ
always exists, i.e. the Arg Min in (3.30) is never empty.

55
Proof. The property is obvious if F is finite; so here we assume that F is countably
infinite. Since the weights π(f ), f ∈ F are positive and summable, they can be ranked by
nonincreasing order, i.e. there exists an indexation F = {fi , i ∈ N} such that π(fi ) forms
a nonincreasing sequence. Since ℓ(·, ·) ∈ [0, B], it holds E(f
b ) ∈ [0, B] for all f ∈ F.
It follows that
r r !
− log(π(f 0 )) − log(π(f 0 ))
E(f
b 0) + B ≤B 1+ ,
2n 2n

while for all fi ∈ F: r


− log(π(fi ))
E(f
b i) ≥ B .
2n
n p √ p o
Let i∗ = min i ∈ N : − log(π(fi )) > 2n + − log(π(f0 )) , then for all i ≥ i∗ it holds
q q
− log(π(fi ))
E(fi ) ≥ B
b
2n
> E(f0 ) + B − log(π(f
b
2n
0 ))
. It follows that a minimum for the expres-
sion (3.30) exists, and must be attained for some fi with i < i∗ .

Proposition 3.16. Consider the same setting as in Proposition 3.14 and let fbπ be the
estimator defined by (3.30). Then for any δ ∈ (0, 1], with probability at least 1 − δ it holds
r !
− log(π(f ))
E(fbπ ) ≤ min E(f ) + 2B + 2Bε(δ/2, n). (3.31)
f ∈F 2n

Proof. The proof is very similar to that of Prop. 3.13. We apply Prop. 3.14 and obtain
that with probability at least 1 − δ/2, we have
r
−1
∀f ∈ F : E(f ) ≤ E(f b ) + B log(π(f ) ) + Bε(δ/2, n), (3.32)
2n
Let f∗π achieving the minimum on the right-hand side of (3.31). Note that the minimum
exists by using the same type of argument as in Lemma 3.15 (replacing the role of the
empirical risk by the true risk). The decision function f∗π is non-random and we can apply
the simple Hoeffding’s inequality to it, so that with probability at least 1 − δ/2 it holds

b ∗π ) ≤ E(f∗π ) + Bε(δ/2, n).


E(f (3.33)

Now using (3.32), (3.33) (holding simultaneously with probability at least 1 − δ), and the

56
definition of fbπ , fπ∗ , respectively, we obtain
s

b fbπ ) + B − log π(f ) + Bε(δ/2, n)
E(fbπ ) ≤ E(
2n
r
π
b π ) + B − log π(f∗ ) + Bε(δ/2, n)
≤ E(f ∗
r 2n
− log π(f∗π )
≤ E(f∗π ) + B + 2Bε(δ/2, n)
2n !
r
− log π(f )
= min E(f ) + B + 2Bε(δ/2, n).
f ∈F 2n

In particular, we obtain the following consistency property on a countable family of


decision functions.

Corollary 3.17. Consider the same settings as in Propositions 3.14 and 3.16, let π be set
of strictly positive real weights on the countable family F satisfying (3.28). Let fbnπ be the
estimator defined by (3.30) from an i.i.d. sample Sn of size n. Then the sequence fbnπ is
consistent in probability on F, that is, E(fbnπ ) converges to EF∗ in probability, as n → ∞.

Proof. Let t > 0 be fixed. Let ft∗ ∈ F be such that E(ft∗ ) ≤ inf f ∈F + 2t . Let us now fix any
δ ∈ (0, 1). Then event (3.31) (holding with probability at least 1 − δ) implies that
r !

∗ π ∗ − log(π(f t ))
EF ≤ E(fbn ) ≤ E(ft ) + 2B + 2Bε(δ/2, n)
2n
r
∗ t − log(π(ft∗ ))
≤ EF + + 2B + 2Bε(δ/2, n).
2 2n
The two last terms in the above bound converge to zero as n → ∞, so for any δ, for any
n large enough, it holds that
h i
P E(fbnπ ) − EF∗ > t ≤ δ,
h i
in other words P E(fbnπ ) − EF∗ > t → 0 as n → ∞; this is true for any t > 0, hence the
conclusion.

* An application to interval-based classification. Consider a binary classification


problem (Y = Ye = {0, 1}, 0-1 loss) for X = [0, 1] and denote FK the set of piecewise
constant classification functions, constant on each of the K intervals of the form j−1 j
 
K K
, ,

57
K
j ∈ JKK. Since classification functions can only take
S two values, we have |FK | = 2 .
Furthermore, we define a family of weights on F := K≥1 FK via

1 −K
π(f ) = 2 , if f ∈ FK ,
cK 2
for c = π 2 /6. (It is possible that the same decision function belongs to FK for several Ks,
in which case we just take the smallest K having this property). It is easy to check that
these weights satisfy (3.28). Applying Proposition 3.16, the regularized ERM estimator fbπ
using these weights satisfies with probability at least 1 − δ:
r r !
π K log 2 log c + 2 log K
E(fb ) ≤ min min E(f ) + 2B + 2B + 2Bε(δ/2, n).
K≥1 f ∈FK 2n 2n

Observe that if we choose a fixed K in advance, and consider the ERM on FK , applying
Prop. 3.13 yields a bound similar to the above (without the third term), albeit for this
fixed K only. As we have seen in Proposition 1.13 for piecewise-constant functions in a
slightly different context, choosing K of the right order is important in order to obtain fast
convergence rates. By contrast, above inequality for the regularized ERM tells us that we
get a bound as good as what we would obtain for the “best” choice K for the ERM on FK
– the price to pay for this adaptivity is the additional third term in the bound, which is
modest since it is negligible with respect to the second term for large K. The above type
of bound is sometimes called an oracle inequality: the risk of the regularized estimator is
(almost) as good as if an “oracle” had told us in advance which FK to choose to minimize
the corresponding ERM risk.

⋄ An application to decision trees. We will consider an application of the previous


principle to (a certain type of) decision trees. Decision trees are a certain class of decision
functions that are piecewise constant on the input space X and whose pieces are defined
by recursive partitioning of X . More formally, a decision tree decision function is given by:

• A complete binary tree structure T ; complete means that each node either has 2
daughter-nodes (it is then called an interior node) or has no descendents (it is then
called a leaf). Let T̊ denote the set of interior nodes, and ∂T the set of leaves.

• For each interior node s ∈ T̊ , a question qs which is a function X → {0, 1}. We will
assume that qs ∈ Q, where Q is a finite library of questions.

• For each leaf t ∈ ∂T , a prediction value yet ∈ Ye .

The set T (Q) of possible triplets (T, (qs )s∈T̊ , (e


yt∈∂T )) will be called the set of decision trees
(with questions belonging to Q).
Given the above structure and parameters T = (T, (qs )s∈T̊ , (e yt )t∈∂T )), the associated
decision function fT can be defined algorihmically as follows: for a given input point x:

58
1. Let s=the root node of the tree T .

2. If s is a leaf, return the value fT (x) = yes .

3. Otherwise, s is an interior node. Then compute qs (x) ∈ {0, 1}.

4. If qs (x) = 1, replace s by its right daughter-node, otherwise replace s by its right


daughter-node.

5. Return to step 2.

There are several classical methods available to build a decision tree function fb = fTb ,
corresponding to a certain triplet Tb ∈ T (Q), from a data sample Sn . We will not present
them in detail here. If we consider the output of such a method fbSn , we would like to be able
to have a confidence bound on its risk without knowing the internal details of the method.
Here we will not use a hold-out sample approach but rather use the same sample Sn , this
is why we resort to finding confidence bounds that are valid for all functions of T with
high probability (following the same approach as in Proposition 3.16 and Corollary 3.12
for finite classes).

Proposition 3.18. Assume a prediction setting with a loss function bounded by B. Let
Q be a finite set of questions, and assume Ye is a finite set. Let Sn be an i.i.d. training
sample of size n which will be used to compute all empirical risks. Then for any δ ∈ (0, 1),
with probability as least 1 − δ, it holds:
s r
C|T | log δ −1
∀T ∈ T : E(fT ) ≤ E(f
b )+B
T + B , (3.34)
2n 2n

where C := log|Q| + log Ye + log 4, and for T = (T, (qs )s∈T̊ , (e


yt∈∂T )) we denote T := |∂T |.
As a consequence, if fb is an estimator taking values in {fT , T ∈ T }, for any δ ∈ (0, 1),
with probability as least 1 − δ, it holds
s r
C|fbSn | log δ −1
E(fbSn ) ≤ E(
b fbSn , Sn ) + B +B ,
2n 2n

where we denoted |fT | = T , T ∈ T .

Proof. We want to apply Prop. (3.14), and for this need to choose a weight function π
over T (we consider
 directly a weight function over T , which induces of course a weight
function on fT , T ∈ T ). Although we insisted that there is no probabilistic argument
linked to π, it is easier to describe π as a probability distribution. We construct π as a
probability distribution in the following way:

59
• The marginal of the binary tree structure T is taken to be a Galton-Watson process
where each node had a probability ρ to have 0 descendents, and (1 − ρ) to have 2
daughters (with ρ ∈ (0, 12 ]). Thus

π(T ) := ρ|T̊ | (1 − ρ)|∂T | = (1 − ρ)−1 [ρ(1 − ρ)]|∂T | ,

since |∂T | = T̊ + 1 for a complete binary tree (check this property!)

• Conditional to T , we take the distribution on the questions in the interior nodes of T ,


(qs )s∈T̊ and of the predictions in the leaves of T , (e
yt )t∈∂T to be independent uniform
over Q and Y, e respectively.

We plug this choice (taking ρ = 21 ) into (3.29), and it holds

log(π(T )−1 ) = − log π(T ) − log π((qs )s∈T̊ , (e


yt )t∈∂T |T )
= − log(2) + |∂T | log 4 + (|∂T | − 1) log|Q| + |∂T | log Ye

≤ |∂T |(log(4) + log|Q| + log Ye )),

which leads to the conclusion.


Observe that with the above choice of weights, the size of the tree |∂T | plays a natural
role of “complexity” of the associated decision function. Moreover, in practice we may want
to pick a decision tree that minimizes (at least as much as possible, it exact mimization is
not exactly possible) the bound (3.34), which is regularized ERM (or approximate q ERM if
exact minimization is not possible) with the regularization penalty Ω(T ) = B C T /2n.

60
4 The Nearest Neighbors method
In this section we introduce and propose an elementary analysis of nearest-neighbors (NN)
methods, which are part of the classical toolbox for prediction methods. In a nutshell, the
output of the NN method from a training sample Sn is a decision function f such as the
prediction f (x) at a point x is obtained by looking up the neighbors of x in the training set,
and taking a decision based on a local average or majority vote among the labels associated
to these neighbors.

4.1 Basic notation and definitions


We assume that X is a Polish space (complete, separable metric space, made into a mea-
surable space by endowing it with its Borel σ-algebra), we denote d(·, ·) the metric on X .
We assume given a training sample Sn = (Xi , Yi )1≤i≤n of size n (as usual, assumed to be
drawn i.i.d. from a generating distribution PXY ).
Given a point x ∈ X , there exists a permutation σx (depending on Sn , therefore random)
such that:
d(x, Xσx (1) ) ≤ d(x, Xσx (2) ) ≤ . . . ≤ d(x, Xσx (n) ).
(We assume that in case of ties, the tied points are ordered by their indices in the training
set. It can be checked that it makes the permutation σx a measurable function of the
training set Sn ).
We introduce the notation
X (k) (x) := Xσx (k) , Y (k) := Yσx (k) (x), k = 1, . . . , n;
X (k) (x) is called the k-th nearest neighbor of x (in the sample Sn ), and Y (k) (x) is its label.
In the regression setting (Y = Ye = R), for a given integer k the k-nearest-neighbor
(k-NN) prediction function based on the sample Sn is defined as
k
1 X (i)
fbk−NN (x) := Y (x).
k i=1

In the classification setting (Y = Ye = {1, . . . , K}), a k-NN classifier function is defined as


k
!
X  (i)
fbk−NN (x) ∈ Arg Max 1 Y (x) = c ,
c=1,...,K
i=1

i.e. fbk−NN predicts the majority class among the neighbors of x in the sample (in case of
ties, once again one can decide to predict among the tied classes the one with the smallest
index).
The definition of the k-NN decision function is simple and natural, and there exists a
vast literature on the subject going back to the 1960s. In this chapter, we will concentrate
on the following questions concerning the asymptotic behavior of the k-NN prediction
(more precisely in the case of binary classification):

61
• what is the behavior of the risk E(fbk−NN ), as the sample size n grows but k is fixed?

• what is the behavior of the risk E(fbk−NN ), as the sample size n grows and k(n) is
allowed to grow with n?
Note that it is natural to let k(n) grow with n, since intuitively local averages are more
accurate if we use more data (but not too many since we want to remain in a neighborhood
of the prediction point).

4.2 Analysis for k fixed


As announced, from now on we will focus on binary classification Y = Ye = {0, 1} with the
0-1 classification loss. We denote, as in previous chapters, η(x) = P[Y = 1|X = x]. We
will make the following assumption:

η is a continuous function. (Cont)

We introduce some additional notation: for p ∈ [0, 1], denote


 
k
Qk (p) := P Bk,p < , where Bk,p ∼ Binom(k, p).
2

** Theorem 4.1. In the binary classification setting, assuming (Cont) holds, we have:
h i
(n)
1. ∀x ∈ Supp(PX ), ESn E ℓ(fbk−NN (x), Y )|X = x → αk (η(x)), as n → ∞.
h i
(n) ∗
2. ESn E(fbk−NN ) → Ek−NN := E[αk (η(X))], as n → ∞,

(n)
where the superscript (n) is a reminder for the sample size used to construct fbk−NN ,
and
αk (η) := ηQk (η) + (1 − η)(1 − Qk (η)). (4.1)

Before proving this theorem, we give a general intuition of why it holds (the proof will
consist in making this intuition mathematically rigorous):
• As n → ∞, the number of sample points in a given neighborhood of x will grow
to infinity (provided that x is in the support of PX ). Therefore, the distance of the
kth-NN of x to x should tend to zero.
• Conditional to the sample points (Xi )1≤i≤n , the labels Y (1) (x), . . . , Y (k) (x) of the k
neighbors of x will be distributed as Bernoulli variables with respective parameters
η(X (1) (x)), . . . , η(X (k) (x)). However, since the neighbors of x tend to x itself and η
is continuous, these labels will essentially behave as i.i.d. Ber(η(x)) variables.

62
• The decision function fbk−NN (x) will predict the majority class among the k neighbors.
By the previous point, we expect the number of neighbors of class 1 to behave as
Binom(k, η(x)) variable, so f will behave as a randomized decision predicting 0 with
probability Qk (x) and 1 with probability 1 − Qk (x), giving rise to the average error
αk (x) at point x.
We now proceed to the proof. The following lemmas will roughly speaking correspond to
the successive points in the above intuition.

** Lemma 4.2. Let X1 , . . . , Xn , . . . be i.i.d. points from the distribution PX on a Polish


(k)
space X . For a given x ∈ X we denote Xn (x) the k-th nearest neighbor of x among
the points X1 , . . . , Xn . Then it holds:
(k)
1. For any fixed x ∈ Supp(PX ), and any fixed integer k, it holds d(Xn , x) → 0 as
n → ∞, in probability and a.s.
(k)
2. Let X ∼ PX , independent of the (Xi )i≥1 . Then it holds d(Xn , X) → 0 as
n → ∞, in probability and a.s.

Proof. Assume x ∈ Supp(PX ), by definition this means that for any ε > 0, p(x, ε) :=
P[X ∈ B(x, ε)] > 0, where B(x, ε) is the open ball of center x and radius ε.
Let us fix ε > 0. Denote Nε,n (x) := #{i ∈ {1, . . . , n} : Xi ∈ B(x, ε)}; by the law of
large numbers, it holds
n
Nε,n (x) 1X
= 1{Xi ∈ B(x, ε)} → p(x, ε) > 0,
n n i=1

as n → ∞, in probability. Furthermore, note the following implications:

d(Xn(k) (x), x) ≥ ε ⇒ Xn(k) (x) ̸∈ B(x, ε) ⇒ Nε,n < k;

hence  
 (k)
 Nε,n k
P d(Xn (x), x) ≥ ε ≤ P < .
n n
Nε,n k k
Since n
→ p(x, ε) > 0 hin probability,
i h n → 0, in particular ifor n big enough n ≤
but
p(x, ε)/2 < p(x, ε), and P Nnε,n < nk ≤ P Nnε,n < p(x, ε) − p(x, ε)/2 → 0 by definition of
(k)
convergence in probability. We have proved that d(Xn (x), x) → 0 in probability.
However, this implies also convergence a.s. by a monotonicity argument: observe that
(k)
(for any fixed ω ∈ Ω determining the sequence (Xi (ω))i≥1 ), d(Xn (x), x)(ω) is a decreasing
sequence in n (adding more points to the set {X1 , . . . , Xn } can only make the k-th neighbor
of x possibly closer to x). It has therefore has a (nonnegative) limit L(ω) which can only be

63
n o
(k)
0 (a.s.); indeed for any t > 0, by right-continuity limn→∞ 1 d(Xn (x), x) ≥ t = 1{L ≥ t},
hence by dominated convergence

P[L ≥ t] = lim P d(Xn(k) (x), x) ≥ t = 0,


 
n→∞

so L = 0 a.s.
For the point 2, observe that for any ε > 0
h h ii
PX,(Xi )1≤i≤n d(Xn(k) (X), X) > ε = EX P d(Xn(k) (X), X) > ε X .
 

h h i i
(n) (k) (k)
In the point 1. we have proved that Fε (x)
:= P d(Xn (X), X)
> ε|X = x = P d(Xn (x), x) > ε
converges to 0 for PX -almost all x, since in a Polish space it holds P[X ∈ Supp(PX )] = 1.
Therefore, since it is bounded by 1, its integral (the above expression) converges to 0 by
(k)
dominated convergence. This means that d(Xn (X), X) → 0 as n → ∞, in probability,
and implies convergence a.s. by the same monotonicity argument as in point 1.
The following lemma based on a coupling argument will allow to formalize that the
labels of the close neighbors of a point x behave almost as if they were drawn with the
Bernoulli parameter η(x).

* Lemma 4.3. Let k be an integer and Ψ an integrable function {0, 1}k → [0, 1]. As-
sume Y1 , . . . , Yk are independent Bernoulli random variables with parameters η1 , . . . , ηk ;
and Y1′ , . . . , Yk′ are i.i.d. Bernoulli random variables with parameters η.
Then it holds
K
X
|E[Ψ(Y1 , . . . , Yk )] − E[Ψ(Y1′ , . . . , Yk′ )]| ≤ |ηi − η|.
i=1

(Note: equivalently the above is a bound on the total variation between the distribution of
(Yi )1≤i≤n and that of (Yi′ )1≤i≤n ).
Proof. Since the two expectations of Ψ(Y1 , . . . , Yk ) and Ψ(Y1′ , . . . , Yk′ ) only depend on the
distributions the one and the other k-uple of random variables, it is possible to construct
a coupling between these two k-uples. More precisely, we construct two k-uples of random
variables (Yei )1≤i≤k and (Yei )1≤i≤k so that (Yei )1≤i≤k has the same distribution as (Yi )1≤i≤k ,
Nk e′
i.e. i=1 Ber(ηi ), and similarly for (Yi )1≤i≤k , but so that Yi = Yi as “often” as possible.
e e
The construction is as follows: let U1 , . . . , Uk be i.i.d. Unif[0, 1], and define Yei =
1{Ui ≤ ηi }; Yei′ = 1{Ui ≤ η}. Then it can be checked easily that:

• (Yei )1≤i≤k ∼ ki=1 Ber(ηi );


N

• (Yei′ )1≤i≤k ∼ ki=1 Ber(η);


N
h i
• P Yei ̸= Yei′ = P Ui ∈ [min(η, ηi ), max(η, ηi )] = |ηi − η|.
 

64
Therefore, since ψ takes values in [0, 1]:
h i h i
|E[Ψ(Y1 , . . . , Yk )] − E[Ψ(Y1′ , . . . , Yk′ )]| = E Ψ(Ye1 , . . . , Yek ) − E Ψ(Ye1′ , . . . , Yek′ )
h i
′ ′
≤ E Ψ(Y1 , . . . , Yk ) − Ψ(Y1 , . . . , Yk )
e e e e
h i
≤ P ∃i ∈ {1, . . . , k} : Yei ̸= Yei′
k
X h i
≤ P Yei ̸= Yei′
i=1
Xk
≤ |η − ηi |.
i=1

We can now assemble the previous results to prove the theorem:


Proof of Theorem 4.1. Write
h i h h i i
(n) (n)
ESn E ℓ(fk−NN (x), Y ) X = x = EX1 ,...,Xn E ℓ(fk−NN (x), Y )|X1 , . . . , Xn ; X = x X = x ,
b b
(4.2)
(n)
where the internal conditional expectation is over (Y1 , . . . , Yn , Y ). Since fk−NN is con-
b
structed using only the values (y (1) (x), . . . , y (k) (x)) of the labels for the k nearest neighbors
of x, we can write
(n)
ℓ(fbk−NN (x), y) =: ψx (y (1) (x), . . . , y (k) (x), y).
Introduce the abbreviated notation X (n) = (X1 , . . . , Xn , X), x = (x1 , . . . , xn , x) and
(k) (k)
Yx := (Y (1) (x), . . . , Y (k) (x), Y ). Conditionally to X (n) = x, the uple Yx has the distri-
bution !
Ok
(i)
PY (k) |X (n) =x := Ber(η(x (x))) ⊗ Ber(η(x)).
i=1
On the other hand, it canNbe checked that αk (x) defined by (4.1) is the expectation of
ψx ((Y (k) )′ ), for (Y (k) )′ ∼ k+1
i=1 Ber(η(x)) (to see why, recall the third point in the general
intuition discussion following Theorem 4.1). Applying Lemma 4.3, we obtain for any x:
E ψx (Y (k) )|X (n) = x − αk (x) = EY (k) |X (n) =x ψx (Y (k) ) − E ψx ((Y (k) )′ )
     

k
X
≤ η(x(i) (x)) − η(x) .
i=1

Now, assuming x ∈ Supp(PX ), by Lemma 4.2, and continuity of η, since d(X (i) (x), x) → 0
in probability for i = 1, . . . , k as n → ∞, and since η is bounded by 1, the expectation
of the above right-hand side with respect to X1 , . . . , Xn converges to 0, which in view
of (4.2) establishes the first part of the theorem. The second part follows immediately by
integration over X ∼ PX and dominated convergence.

65
We now examine closer the function αk from (4.1) which determines the asymptotic
error, for k fixed, of the k-NN method. We look at the cases k = 1, 3, 5 (note that it is
reasonable to choose k odd to avoid ties). In each case, we examine the behavior of α(η)
as η is close to 0, and compare it to the “local” optimal risk, which is min(η, 1 − η) = η.
By symmetry (since αk (η) = αk (1 − η) by definition, for k odd), the same behavior holds
as a function of (1 − η) if η is close to 1.
This behavior is relevant in a situation where the classification problem is well-separable,
i.e. η(x) is always close to 0 or 1.

• k = 1: then α1 (η) = 2η(1 − η). Remember that the Bayes optimal error in classifi-
cation is given by E ∗ = E[min(η(X), 1 − η(X))]. If we denote a(x) := min(η(x), 1 −
η(x)), it therefore holds

E1−NN = E[2a(X)(1 − a(X))] ≤ 2E[a(X)]E[1 − a(X)] = 2E ∗ (1 − E ∗ ),

where the above inequality is Jensen’s, since x 7→ x(1 − x) is concave on [0, 1]. We
see that when E ∗ is close to 0, the risk of 1-NN is bounded by twice the Bayes risk.
This factor 2 is unavoidable, since in a well-separable situation, if η is close to 0 we
have
α1 (η) ∼ 2η,
which is twice the (local) optimal risk.

• k = 3: then Q3 (η) = (1−η)3 +3(1−η)2 η, and α3 (η) = (1−η)3 η+η 3 (1−η)+6(1−η)2 η 2 .


So
α3 (η) − η ∼ 3η 2 , as η → 0.

• k = 5: then with some additional tedious computations we get

α5 (η) − η ∼ 10η 3 , as η → 0.

We see that in well-separable situations, when for all x, η(x) ∈ [0, p] ∪ [1 − p, 1] with p

close to 0, we have E1−NN ≈ 2E ∗ while E3−NN

≈ E ∗ . We see clearly the advantage of taking
more neighbors in improving the asymptotic error. The 5-NN method asymptotically gives
an even better approximation of the optimal Bayes error, still the 3-NN method may be
sufficient in such a well-separable situation.

4.3 Consistency of the k-nearest-neighbors method


In this section we turn to analyzing the case where the number of neighbors k(n) can grow
with the sample size n. The main result is the following.

66
** Theorem 4.4. We consider a binary classification problem (with the usual 0-1 loss)
under either of the following assumptions:
(A) X is a Polish space, and the function η(x) = P[Y = 1|X = x] is continuous.
(B) X = Rd for some finite dimension d, endowed with the Euclidean distance, and
there is no assumption on η.
Then, if k(n) is such that k(n) → ∞ and k(n)/n → 0 and n → ∞, it holds either
under setting (A) or (B) that
(n)
E(fbk(n)−NN ) → E ∗ , in probability as n → ∞.

Notice in particular the remarkable fact that in setting (B) (Rd endowed with the Eu-
clidean distance), the k-NN method with suitably increasing k(n) is universally consis-
tent, i.e. its risk converges asymptotically towards the Bayes risk without any assumption
on the generating distribution PXY .
We start by revisiting Lemma 4.2.

** Lemma 4.5. Let X1 , . . . , Xn , . . . be i.i.d. points from the distribution PX on a Polish


(k)
space X . For a given x ∈ X we denote Xn (x) the k-th nearest neighbor of x among
the points X1 , . . . , Xn . Then if k(n)/n → 0, as n → ∞, it holds:
(k(n))
1. For any fixed x ∈ Supp(PX ), d(Xn , x) → 0 as n → ∞, in probability.
(k(n))
2. Let X ∼ PX , independent of the (Xi )i≥1 . Then it holds d(Xn , X) → 0 as
n → ∞, in probability.

Proof. The proof is actually the same as for Lemma 4.2. To wit, for any ε > 0 and fixed
k and n, we have seen that
 
 (k)
 Nε,n k
P d(Xn (x), x) ≥ ε ≤ P < ,
n n

and furthermore Nnε,n → p(x, ε) > 0 in probability as n → ∞. So if k(n)


n
→ 0 as n → ∞, for
k(n)
n large enough it holds n < p(x, ε)/2 implying that the right-hand side above converges
to zero. The statement of point 2 is obtained by integration over x since the statement of
point 1 is true for PX -almost all x, because P[X ∈ Supp(PX )] = 1 in a Polish space.
Proof of Theorem 4.4. . Denote
k(n)
1 X (i)
ηb(x) := Y (x).
k(n) i=1

67
Observe that fbk−NN can be seen as a plug-in classifier fbk−NN := 1 ηb(x) > 12 . From the


result on risk comparison for plug-in classifiers (Proposition 1.14), we know

Eℓ (fbk−NN ) − Eℓ∗ ≤ 2EX [|b


η (X) − η(X)|].

Now define
k(n)
1 X
ηe(x) := η(X (i) (x)),
k(n) i=1
and bound

η (X) − η(X)|] ≤ EX [|b


EX [|b η (X) − η(X)|] .
η (X) − ηe(X)|] + EX [|e
| {z } | {z }
(II) (I)

Estimating term (II). We have for any fixed x:


n
1 X (i)
ηb(x) − ηe(x) = (Y (x) − η(X (i) (x))).
k(n) i=1
 
Since E Y (i) (x) = η(X (i) (x)) and conditionally to X1 , . . . , Xn , the labels Y (1) (x), . . . , Y (k) (x)
are independent Bernoulli variables with respective parameters η(X (i) (x)), it holds
k(n)
1 X 1
η (x) − ηb(x))2 |X1 , . . . , Xn ; X = x = η(X (i) (x))(1 − η(X (i) (x))) ≤
 
EX,Sn (e 2
.
k(n) i=1 4k(n)

Furthermore, by integration
1 1 n→∞
η (x) − ηb(x))2 2 ≤ p

η (x) − ηb(x)|] ≤ EX,Sn (e
ESn [(II)] = EX,Sn [|e → 0.
2 k(n)

Hence term II tends to 0 in expectation over Sn and therefore also in probability, since it
is nonnegative.
Estimating Term (I), setting (A). We have for any x ∈ X :
k(n)
1 X
η(x) − ηe(x) = (η(x) − η(X (i) (x))).
k(n) i=1

Since we assumed under setting (A) that η is continuous, for any fixed ε > 0 there exists
δ > 0 such that for any x′ , d(x, x′ ) < δ implies |η(x) − η(x′ )| ≤ ε. Therefore, since η and ηe
are bounded by 1:
|η(x) − ηe(x)| ≤ ε + 1 d(X (k(n)) , x) > δ ;


hence
ESn [(I)] = ESn ,X [|η(X) − ηe(X)|] ≤ ε + P d(X (k(n)) , X) > δ ,
 

68
and from Lemma 4.5 we know that the second term tends to 0. Since this holds for any
ε > 0, the term I converges to 0 in expectation over Sn , and therefore also in probability,
since it is nonnegative. This concludes the proof for the setting (A).
Estimating Term (I), setting (B). In this setting X = Rd , but η is not necessarily
continuous. However, we know that the set C(Rd ) of continuous functions on Rd is dense in
L1 (Rd , PX ). Let ε > 0 be fixed, and pick then ηε continuous such that E[|η(X) − ηε (X)|] ≤
ε. We define
k(n)
1 X
ηeε (x) := ηε (X (i) (x)),
k(n) i=1
and write

(I) = EX [|η(X) − ηe(X)|] ≤ EX [|η(X) − ηε (X)|] + EX [|ηε (X) − ηeε (X)|] + EX [|e
ηε (X) − ηe(X)|] .
| {z } | {z } | {z }
≤ε (Ia) (Ib)

The term (Ia) can be shown to converge in probability to zero, with exactly the same
argument as used in the setting (A), since ηε is continuous.
Finally, in order to estimate term (Ib), we will use a clever geometrical lemma which
will stated precisely below, which will prove the following:

ηε (X) − ηe(X)|]
ESn [(Ib)] = ESn ,X [|e
 
k(n)
1 X
≤ ESn ,X  ηeε (X (i) (X)) − ηe(X (i) (X)) 
k(n) i=1

≤ γd E[|η(X) − ηε (X)|] (4.3)


≤ γd ε,

where γd is a factor that only depends on d. Overall, since (Ia) converges to 0 in expectation
over Sn , it also converges in probability, and the proof is done.
The following lemma is used to prove inequality (4.3):

* Lemma 4.6 (Stone’s lemma). Let X = Rd endowed with the Euclidean distance;
X, X1 , . . . , Xn be i.i.d. variables of distribution P, on X , and f ∈ L1+ (Rd , P) a nonneg-
ative integrable function. Then there exists a factor γd , only depending on d, such that
for any integer k > 0:
" k #
X
E f (X (i) (X)) ≤ kγd E[f (X)]. (4.4)
i=1

69
Proof. Assume k > 0 is a fixed integer. Let us denote X0 = X, and for i = 0, . . . , n,
introduce the notation

NNk (Xi ) := {j ̸= i : Xj is one of the k nearest neighbors of Xi among X0 , . . . , Xn }.

Then we have
" k
# " n
#
X X
E f (X (i) (X0 )) = E f (Xi )1{i ∈ NNk (X0 )}
i=1 i=1
n
X
= E[f (Xi )1{i ∈ NNk (X0 )}]
i=1
Xn
= E[f (X0 )1{0 ∈ NNk (Xi )}] (*)
i=1
" n
#
X
= E f (X0 ) 1{0 ∈ NNk (Xi )}
i=1
= E[f (X0 )#{i : 0 ∈ NNk (Xi )}]. (**)

Observe that the clever step (*) is obtained by symmetry: in each separate expectation we
can exchange the role of X0 and Xi , while the distribution of the (n+1)-tuple (X0 , . . . , Xn )
remains unchanged. Finally, the next lemma will establish that

#{i : 0 ∈ NNk (Xi )} ≤ kγd ,

for a factor γd only depending on d; this will conclude the proof, since we can plug this
upper bound into (**) (observe that it is at that point that we must make use of the fact
that f is nonnegative).
Lemma 4.7. Let (x0 , . . . , xn ) be (n + 1) points in Rd and

NNk (xi ) := {j ̸= i : xj is one of the k nearest neighbors of xi among x0 , . . . , xn },

where ties are broken by taking the smallest index. Then

#{i : 0 ∈ NNk (xi )} ≤ kγd .

with γd a factor only depending on d.


Proof. Without loss of generality we assume x0 = 0. Let us consider a fixed open cone C0
starting from the origin (x0 ) and of angle 2θ ≤ π/3. For any two points y and z in C0 ,
assume that ∥y∥ ≤ ∥z∥, then since the angle between y and z is strictly less than 2θ:

∥y∥2 ∥y∥
 
2 2 2 2
∥y − z∥ < ∥y∥ + ∥z∥ − 2∥y∥∥z∥ cos(2θ) ≤ ∥z∥ 1 + 2 − ≤ ∥z∥2 . (4.5)
| {z } ∥z∥ ∥z∥
≥1/2 | {z }
≤1

70
Let xi1 , . . . , xik be the elements of {x1 , . . . , xn } belonging to C0 and closest to the origin x0
(if there are only k ′ < k such elements, take only those; if there are more because of ties,
take the ones with the k smallest indices). Now, notice that for any other xj ∈ C0 with
j ̸∈ {i1 , . . . , ik }, we have ∥xj ∥ ≥ ∥xiℓ ∥ for all ℓ = 1, . . . , k, thus by (4.5) we have

∥xj − x0 ∥2 = ∥xj ∥2 > sup ∥xj − xiℓ ∥2 .


ℓ=1,...,k

Therefore, for any such xj , the point x0 is not among the k nearest neighbors of xj , and
we must have 0 ̸∈ N Nk (xj ).
To summarize: in any such open cone C0 , there are at most k indices iℓ such that
0 ∈ N Nk (xiℓ ). Now the space Rd can be covered by a finite number γd of such open cones
(note that it is enough to cover the unit ball by homogeneity, and then used compactness).
Overall there are at most kγd indices iℓ such that 0 ∈ N Nk (xiℓ ). This implies the conclusion.

 d
1
Exercise 4.1. Prove the bound γd ≤ 1 + sin(π/12) ≤ 5d . For this, assume that the unit
ball is covered by open cones of angle π/3, with principal axes given by the direction of
vectors x1 , . . . , xM , with ∥xi ∥ = 1. Furthermore, assume that this covering is of minimal
π
cardinality. Prove then that it must hold ∥xi − xj ∥ ≥ 2 sin 12 =: r for i ̸= j. Conclude by a
volume argument: the balls B(xi , 2 ) must be disjoint and are all contained in B(0, 1 + 2r ),
r

entailing that the sum of their volumes is less than the volume of this containing ball.
Conclude.

Exercise 4.2. It is possible to also prove a.s. convergence in Lemma 4.5 as in Lemma 4.2, but
(k(n))
the argument of Lemma 4.2 has to be modified since a.s. monotonicity of d(Xn (x), x)
does not necessary hold when k(n) depends on n.
Establish the a.s. convergence property in Lemma 4.5 by considering
(k(m))
Un := supm≥n d(Xm (x), x) and establishing that Un → 0 in probability. Then use the
monotonicity argument since Un is now a.s. decreasing in n.

71
5 Reproducing kernel methods
5.1 Motivation
Linear methods after a feature mapping. In the chapter on linear classification
methods, a central role was played by linear (or affine) score functions which were linear
forms fw (x) = ⟨x, w⟩; such linear forms are also the class of predictors considered for linear
regression. Now imagine we would like to consider as prediction (or score) functions the
class of functions of the form
( M
)
X
F := fα (x) := αi fi (x), α = (α1 , . . . , αM ) ∈ RM ,
i=1

where {f1 , . . . , fM } is a known, fixed finite set of real-valued functions X → R. For


instance, we could consider that f1 , . . . , fM are monomials in the coordinates of x of degree
up to m, so that F is the set of polynomials in x of degree up to m. Or the fi could
be trigonometric functions; or in fact any finite “library” of functions that we consider
relevant for the problem at hand. (Note in particular that X does not have to be a subset
of Rd ; here X could be something like a sequence of characters, a graph. . . .)
In this setting each function fi is called a fixed “feature”. While the above setting might
seem much more general than linear functions, we can subsume it into linear methods by
considering the “feature mapping”
Φ(x) : X → RM , x 7→ (f1 (x), . . . , fM (x)),
if we use the shorthand notation xe := ϕ(x), then we observe that functions in F which are
nonlinear in x are linear functions of x
e:
M
X
fα (x) := αi fi (x) = ⟨α, Φ(x)⟩ = ⟨α, x
e⟩. (5.1)
i=1

Therefore, we can in principle apply any linear learning method (regression, or one of the
linear classification methods seen in Section 2) to the modified input data xe in order to
learn a prediction function in the class F.
An important point to notice right away is that the feature mapping Φ can often be
high-dimensional (as exemplified by polynomial regression: if X = Rd , the vector space of
polynomial functions of degree up to m has dimension (m + 1)d ). In fact it can commonly
be the case that we would like to consider a “feature space” (the image space of Φ) of
dimensionality M larger than the data sample size n. This has two important consequences:
1. It is essential to consider regularized methods (see in particular Section 2.7), otherwise
we are certain to run into overfitting.
2. It can computationally inconvenient to store the data as its explicit feature mapping
(e
x1 , . . . , x
en ) = (ϕ(x1 ), . . . , ϕ(xn )), both in terms of computation time of this mapping,
and of memory size.

72
Scalar products are sufficient. Concerning the second point above, an important
point to remark is that all linear methods (possibly in regularized form) we have seen have
the following property:

1. The learnt linear function (or functions, in the case of multi-class classification)
fwb has a parameter which can be written (regardless of the dimension) as a linear
combination of the input data:
n
X
w
b= βi Xi ; (5.2)
i=1

2. In order to compute the coefficients βi above, it is sufficient to know the scalar


products ⟨Xi , Xj ⟩, i = 1, . . . , n, along with the labels Y1 , . . . , Yn .

3. In order to compute fwb (x) for a new test point x, given the coefficients (β1 , . . . , βn )
of the representation (5.2), it is sufficient to know the scalar products ⟨Xi , x⟩, i =
1, . . . , n.

The last point is obvious since given (5.2) we have


n
X
fwb (x) = ⟨w,
b x⟩ = βi ⟨Xi , x⟩. (5.3)
i=1

For the two first points, we take first the example of the perceptron. Remember that for
the perceptron algorithm (see Section 2.6), the main iteration is w bk = w bk−1 + Xik Yik . By
recursion, assume w bk−1 satisfies points 1-2 above, that is to say, is of the form (5.2), with
(k)
coefficients βi which can be computed only given the scalar products ⟨Xi , Xj ⟩. Observe
that the determination of the index ik depends on finding which training examples are
correctly classified or not by the classifier sign(fwbk−1 ), for which we only need to know the
scalar products ⟨Xi , Xj ⟩, by (5.3). As a consequence, w bk also satisfies points 1-2 above.
Let us take ridge regression as the next example. Remember from (2.16) and below
that this method outputs (for a fixed regularization parameter λ > 0) the linear predictor
with parameter vector
bλ = (X t X + λId )−1 X t Y ,
w (5.4)
where X is the (n, d) matrix whose rows are xt1 , . . . , xtn , and Y = (y1 , . . . , yn )t . We have
the following lemma:

* Lemma 5.1. It holds for λ > 0:

(X t X + λId )−1 X t = X t (XX t + λIn )−1 .

As a consequence, the formula (5.4) implies that w


bλ is of the form (5.2), with

(β1 , . . . , βn ) = (XX t + λIn )−1 Y ; (5.5)

73
observe that (XX t )ij = ⟨Xi , Xj ⟩, so finally points 1-2 hold in this setting too.
As apparent above, an object of central importance is XX t , the Gram matrix associated
with points (X1 , . . . , Xn ). Note that it is more economical to use the (n, n) Gram matrix
than the (d, d) matrix X t X if d < n.
The conclusion of these observations is that it is enough to know how to compute scalar
products ⟨x, x′ ⟩ in order to learn and apply prediction functions for classical linear method.
If we combine this observation with the idea of a feature mapping exposed earlier, we see
that we do not need to know the explicit feature mapping Φ: we only need to be able to
compute ⟨Φ(x), Φ(x′ )⟩ for arbitrary x, x′ ∈ X .
This is in essence the principles underlying the construction of kernel methods, to
summarize:
1. We can greatly extend the flexibility of linear methods by applying them after a
feature mapping Φ in a possibly high-dimensional Euclidean vector space.
2. For standard methods, we don’t need to know the feature mapping Φ explicitly, but
only need to be able to compute scalar products ⟨Φ(x), Φ(x′ )⟩ for any x, x′ ∈′ X .
3. It is important to always consider regularized versions of linear methods in this con-
text, since the output space of the feature mapping Φ is generally high-dimensional.

Exercise 5.1. Prove Lemma 5.1 and justify (5.5).

5.2 Reproducing kernel Hilbert spaces


The previous considerations motivate the interest of being able to compute easily
k(x, x′ ) := ⟨Φ(x), Φ(x′ )⟩ (5.6)
rather than explicitly computing Φ(x), Φ(x′ ). We will call such a function k a kernel. Since
we only need to know the function k in order to apply various algorithms on the feature
space, it is natural to ask the reciprocal question: under what conditions is a given function
k : X × X the kernel associated to a feature mapping; i.e. when can we guarantee that
there exists some feature space and a feature mapping Φ such that(5.6) holds?
Example. Let X = Rd and k(x, x′ ) = (⟨x, x′ ⟩ + c)2 , where c ≥ 0 is a constant. Is it a
kernel in the above sense? We have
d
!2 n
X X
′ ′
2
(⟨x, x ⟩ + c) = (xi xi ) + 2c xi x′i + c2
i=1 i=1
d n
X X √ √
= (xi xj )(x′i x′j ) + ( 2cxi )( 2cx′i ) + c2 ,
i,j=1 i=1

so (5.6) is satisfied with


h √ i
Φ(x) := (xi xj )1≤i,j≤d ; ( 2cx); c .

74
Observe that Φ maps to all monomials of degree up to 2, so the associated function space
defined via 5.1 is the vector space of polynomials of degree up to 2 in the coordinates of
x. For this reason k is called polynomial kernel of order 2.
The following fundamental theorem gives a set of necessary and sufficient conditions
on k in order for (5.6) to hold.

** Theorem 5.2 (Characterization theorem). Let X be a nonempty set, and k : X ×X →


R a function.
Then there exist a Hilbert space H◦ and a mapping Φ◦ : X → H◦ with

k(x, x′ ) = ⟨Φ◦ (x), Φ◦ (x′ )⟩H◦ (5.7)

if and only if the following conditions are satisfied:

1. k is symmetric: k(x, x′ ) = k(x′ , x) for all x, x′ ∈ X .

2. k has positive type, that is, for any integer n > 0, and any n-uples (x1 , . . . , xn ) ∈
X n , and (α1 , . . . , αn ) ∈ Rn , it holds
n
X
αi αj k(xi , xj ) ≥ 0.
i,j=1

Note: the above properties can be more compactly expressed equivalently as: for any
integer n > 0, and any n-uple (x1 , . . . , xn ) ∈ X n , the matrix K given by Kij = k(xi , xj )
is symmetric positive semi-definite. For this reason, we will call a kernel satisfying the
above conditions a symmetric positive semi-definite (spsd) kernel.

Proof. “Only if” direction: assume (5.7) holds. Then obviously k is symmetric, and
n n n 2
X X X
αi αj k(xi , xj ) = αi αj ⟨Φ(xi ), Φ(xj )⟩H = αi Φ(xi ) ≥ 0.
i,j=1 i,j=1 i=1

“If” direction: assume k is a spsd kernel. We need to construct H and Φ. For any
x ∈ X , denote the real-valued function kx : X → R, y 7→ k(x, y) (we will alternatively use
the notation kx := k(x, ·)). Now define

Hpre := Span{kx , x ∈ X }; (5.8)

we stress that the above set is made of finite linear combinations of functions of the
form k(xi , ·).

75
We define a bilinear form [·, ·] on Hpre :
X X X
for f = λi kxi ; g = µj kxj define [f, g] := λi µj k(xi , xj ). (5.9)
i∈I1 j∈I2 i∈I1
j∈I2

We need to stress that this is a well-formed definition: indeed, it may be possible that the
′ ′
P
same function has another representation as a linear expansion, say f = i∈I ′ λi kxi . But
P P 1
it holds that i∈I1 λi µj k(xi , xj ) = j∈I2 µj f (xj ) by definition, so the definition of [·, ·]
j∈I2
does not depend on the particular representation of f . The same argument applies to g.
Now, it is easy to check that the property 1. (k symmetric) implies that [·, ·] is sym-
metric, and that property 2. (k has positive P type) implies that for f ∈ Hpre having the
representation as in (5.9), it holds [f, f ] = i,j∈I1 λi λj k(xi , xj ) ≥ 0. Hence [·, ·] is a
symmetric positive semidefinite form on the vector space Hpre .
We finally check that it is definite. A symmetric positive semidefinite form satisfies the
Cauchy-Schwarz inequality, so it holds (assuming again f ∈ Hpre with representation as
in (5.9))
1 1
X
f (x) = λi k(xi , x) = [f, kx ] ≤ [f, f ] 2 [kx , kx ] 2 , (5.10)
i∈I1

hence [f, f ] = 0 implies that f (x) = 0 for all x, i.e. f = 0 as a function. Hence [·, ·] is a
symmetric definite positive bilinear form on Hpre .
Finally, define Φpre (x) : X → Hpre , x 7→ kx . Then it holds

[Φpre (x), Φpre (x′ )] = [kx , kx′ ] = k(x, x′ ). (5.11)

We have just constructed a pre-Hilbert space Hpre and a mapping Φpre such that (5.7)
holds.
What is missing for a proper Hilbert space is completeness. But this space can be
i
completed: there exists a complete Hilbert space H◦ , and an isometry i: (Hpre , [·, ·]) →
(H◦ , ⟨·, ·⟩H ) such that i(Hpre ) is dense in H◦ . (The completion operation is obtained by
considering equivalence classes of Cauchy sequences in Hpre ; this is a standard construction
that we don’t detail here.)
Correspondingly we can define Φ◦ (x) := i ◦ Φpre (x), which satisfies (5.7) because
of (5.11), since i is an isometry. This concludes the proof.
In the previous proof, Hpre was specifically constructed as a pre-Hilbert space of real-
valued functions on X with Φpre (x) = kx . It is an important point that this property in
fact carries over to its completion, and we highlight this in the next result.

** Theorem 5.3 (and definition). If k is a spsd kernel on the set X (as in Theorem 5.2),
the Hilbert space H and the mapping Φ satisfying (5.7) can be constructed so that:

76
1. H is a vector space of real-valued functions X ;

2. The mapping Φ : X → H is given by Φ : x 7→ kx ;

3. The following reproducing property is satisfied:

∀f ∈ H : f (x) = ⟨kx , f ⟩H . (5.12)

The space H satisfying the above properties is unique and is called reproducing kernel
Hilbert space on X with kernel k.
Furthermore, Hpre given by (5.8) is dense in H.

Proof. We have constructed in the proof of Theorem 5.2 a pre-Hilbert space Hpre satisfying
the announced properties (observe that the reproducing property (5.12) holds in Hpre by
construction/definition of the form [·, ·]). What about its completion H◦ ? We recall that
there exists an isometry i : Hpre → H◦ with i(Hpre ) dense in H◦ . We now construct the
following mapping

ξ : H◦ → F(X , R) : h 7→ ξ(h) := (x ∈ X 7→ ⟨i(kx ), h⟩H◦ ).

Let us prove that ξ is injective. Assume ξ(h) = 0, which is to say, for all x ∈ X it holds
⟨i(kx ), h⟩H◦ = 0. This implies by linearity that for any f ∈ Hpre , ⟨i(f ), h⟩H◦ = 0. But since
i(Hpre ) is dense in H◦ , it implies that for any h′ ∈ H◦ , ⟨h′ , h⟩H◦ = 0, hence h = 0.
Since ξ is linear, it defines a bijection between H◦ and ξ(H◦ ) ⊂ F(X , R). We can
therefore endow H = ξ(H◦ ) with the scalar product ⟨f, f ′ ⟩ := ⟨ξ −1 (f ), ξ −1 (f ′ )⟩, so that H
is a Hilbert space of functions X → R which is isometric to H◦ .
Additionally, we observe that Hpre ⊆ H since ξ ◦ i coincides with the identity: for any
x ∈ X , it holds

ξ(i(kx )) = (y 7→ ⟨i(ky ), i(kx )⟩H◦ = ⟨ky , kx ⟩Hpre = k(x, y)) = kx , (5.13)

ξ◦i
hence by linearity of ξ ◦ i, we have Hpre ,−→ H, which is an inclusion of Hilbert spaces since
ξ ◦ i is an isometry by composition of isometries.
We can therefore define the feature mapping Φ(x) = kx , which satisfies

⟨Φ(x), Φ(x′ )⟩H = ⟨kx , kx′ ⟩H = ⟨kx , kx′ ⟩Hpre = k(x, x′ ),

by the above isometric inclusion.


Finally, we check the reproducing property: for any f ∈ H = ξ(H◦ ) there exists h ∈ H◦
with f = ξ(h), hence f = (x 7→ ⟨h, kx ⟩H◦ ) and for any x ∈ X :

f (x) = ⟨h, i(kx )⟩H◦ while ⟨f, kx ⟩H = ⟨ξ(h), ξ(i(kx ))⟩H = ⟨h, i(kx )⟩H◦ ,

77
hence (5.12) is satisfied.
We turn to unicity. Let H′ be another Hilbert space of real functions on X satisfying
the announced properties. By property 2. it holds that kx ∈ H′ for all x ∈ X , and by
consequence Hpre ⊆ H′ . Furthermore property 3. implies that ⟨kx , kx′ ⟩H = k(x, x′ ) =
[kx , kx′ ], where [·, ·] is the bilinear form constructed on Hpre in the proof of Theorem 5.2.
By linearity ⟨·, ·⟩H coincides with [·, ·] on Hpre . Hence the identity mapping Hpre ,→ H′ is
an isometry.
On the other hand we have established that Hpre ⊆ H via the isometric inclusion
ξ ◦ i. Let Hpre be the closure of Hpre in H. It can be checked that Hpre = H, where H
was constructed above. Indeed, we know that i(Hpre ) is dense in H◦ , hence by isometry
ξ ◦ i(Hpre ) = Hpre is dense in ξ(H◦ ) = H.
Finally, observe that the closure of Hpre in H coincides with the closure of Hpre in H′ .
Indeed, any Cauchy sequence of functions (fn )n≥1 in Hpre converges both in H and H′
by completeness of both these spaces, and the limit point f is uniquely determined as a
function, since for any x ∈ X , f (x) = limn→∞ [kx , fn ], the right-hand side of the latter
equality being a real-valued Cauchy sequence (by continuity) hence having a unique limit.
So any such limit function f belongs to both H and H′ , and since H = Hpre , we have
H ⊆ H′ .

Since H is closed in H′ we can write H′ = H ⊕ H1 ; but for any f1 ∈ H1 we have since
H1 ⊥ H that for any h ∈ H, ⟨f1 , h⟩H′ = 0. In particular, for any x ∈ X , kx ∈ H and
⟨f1 , kx ⟩H′ = f1 (x) = 0 (by the assumed reproducing property on H′ ), so f1 = 0. Finally
H′ = H, proving unicity.
For a complete overview we also mention the following characterization of reproducing
Hilbert kernel spaces.

** Theorem 5.4. Let H be a Hilbert space of real-valued functions over a set X . Then
the following properties are equivalent:

1. For all x ∈ X , the evaluation function

δx : H → R; f 7→ f (x) (5.14)

is continuous.

2. There exists a (unique) function k : X × X → R such that:

(a) for all x ∈ X : k(x, ·) ∈ H.


(b) for all x ∈ X and f ∈ H: ⟨f, k(x, ·)⟩ = f (x).

Furthermore, the function k in the last point is a spsd kernel, so that H is the repro-
ducing kernel Hilbert space with kernel k.

78
Proof. (1) ⇒ (2): since δx is continuous, by Riesz’ theorem there exists a unique element
ζx ∈ H such that δx (f ) = ⟨ζx , f ⟩ for all f ∈ H. Since H is a space of real-valued functions,
we define k(x, y) = ζx (y) for all x, y. Then k(x, ·) = ζx , so that the announced properties
(a) and (b) are satisfied.
(2) ⇒ (1): we have for any f ∈ H and x ∈ X , that δx (f ) = f (x) = ⟨f, kx ⟩ which is
continuous by continuity of the scalar product (which can be seen as a consequence of the
Cauchy-Schwarz inequality).
The kernel k appearing in (2) is spsd: it holds k(x, y) = kx (y) = ⟨kx , ky ⟩ = ky (x) =
k(y, x), so k is symmetric. Furthermore
X X X 2
αi αj k(xi , xj ) = αi αj kxi , kxj = αi kxi ≥ 0.
i,j i,j i

5.3 Construction of spsd kernels


Theorem 5.2 gave us a characterization of kernels that are scalar products of mappings
of points of X through a feature mapping Φ to an underlying feature space H: these are
exactly the spsd kernels. But it is not obvious to check if a given function is spsd. Instead,
the following result will tell us how to construct many such kernels.

** Theorem 5.5. Let X be a nonempty set.

(i) If f is a real-valued function on X , then k(x, y) := f (x)f (y) is a spsd kernel on


X.

(ii) If X is a Euclidean or Hilbert space with inner product ⟨·, ·⟩, then k(x, y) = ⟨x, y⟩
is a spsd kernel. (“Linear kernel”)

(iii) If k1 is a spsd kernel on X and c ≥ 0 is a real, then k = ck1 is a spsd kernel.

(iv) If k1 , k2 are spsd kernels on X , then k = k1 + k2 is a spsd kernel.

(v) If k1 , k2 are spsd kernels on X , then k = k1 k2 is a spsd kernel.

(vi) If X ′ is another set, k3 a spsd kernel on X ′ and F : X → X ′ a mapping, then


k(x, y) := k3 (F (x), F (y)) is a spsd kernel on X .

(vii) If (ki )i≥1 is a sequence of kernels on X which converge pointwise (for any x, y ∈
X ) then the limiting function is a spsd kernel.

The proof for all points of the above theorem are left as an exercise, with the exception
of point (v), for which we provide the following lemma:

79
Lemma 5.6. Let M, N be two (n, n) spsd real matrices. Them the (n, n) matrix A
defined by Aij := Mij Nij is spsd.

Proof. Obviously A is symmetric. Let (ek )1≤k≤n be a diagonalizingP orthonormal basis of M


corresponding to nonnegative eigenvalues (λk )1≤k≤n . Thus M = nk=1 λk ek etk , and Mij =
Pn (i) (j)
k=1 λk ek ek . Similarly, let (fℓ )1≤ℓ≤n be a diagonalizing basis of N corresponding to
(i) (j)
nonnegative eigenvalues (µℓ )1≤ℓ≤n . Thus Nij = nℓ=1 µℓ fℓ fℓ . For any (α1 , . . . , αn ) ∈ Rn :
P

n n n
! n !
X X X (i) (j)
X (i) (j)
αi αj Aij = αi αj λk ek ek µℓ f ℓ f ℓ
i,j=1 i,j=1 k=1 ℓ=1
X n
(i) (j) (i) (j)
= αi αj λk µℓ ek ek fℓ fℓ
i,j,k,l=1
X n n
X (i) (i) (j) (j)
= λk µℓ (αi ek fℓ )(αj ek fℓ )
k,l=1 i,j=1
n n
!2
X X (i) (i)
= λk µℓ αi ek fℓ ≥ 0.
k,l=1 i=1

Therefore A is spsd.
We deduce the following corollaries from Theorem 5.5:

Corollary 5.7. Let X be a Euclidean or Hilbert space, and f be a real polynomial with
nonnegative coefficients. Then k(x, y) := f (⟨x, y⟩) is a spsd kernel on X .

Corollary 5.8. Let X be a Euclidean or Hilbert space, and F (t) = i≥0 ai ti be an ana-
P
***
lytical function with real, nonnegative coefficients ai ≥ 0 andnconvergence radius R > 0.
√ √ o
Then k(x, y) := F (⟨x, y⟩) is a spsd kernel on BX (0, R) = x ∈ X : ∥x∥ < R .

Proof. Corollary 5.7 is a direct consequences of points (ii)-(iii)-(iv)-(v) of Theorem 5.5.


Corollary 5.8 is a consequence √
of the previous corollary and of point (vii) of Theorem 5.5,
noticing that if x, y ∈ BX (0, R) then |⟨x, y⟩| ≤ ∥x∥∥y∥ < R, so ⟨x, y⟩ is within the
convergence radius of the power series defining F (t), and therefore F (⟨x, y⟩) is the limit of
the corresponding truncated series, while each such truncated series defines a spsd kernel
by Corollary 5.7.

* Examples of spsd kernels. In each of the following examples X is a subset of Rd .

80
• k(x, y) = (⟨x, y⟩ + c)m for m ∈ N∗ , c > 0: polynomial kernel of order m.

• k(x, y) = (1 − ⟨x, y⟩)−α for α > 0, on X = BRd (0, 1): negative binomial kernel.

• k(x, y) = exp(λ⟨x, y⟩), for λ > 0: exponential kernel.


 2

• k(x, y) = exp − ∥x−y∥
2σ 2 , for σ > 0: Gaussian kernel.

Exercise 5.2. Justify that the above kernels are spsd.

5.4 Kernel-based methods


We now turn back to our initial motivation: how to use a spsd kernel as a “proxy” for a
scalar product, and adapt linear methods to the kernel setting. Remember that:

• using a spsd kernel k instead of the regular scalar product can be seen as (implicitly)
mapping the x-data to a Hilbert space H via a feature mapping Φ, and applying the
linear method to the transformed data x e.

• linear functions of the transformed data x


e are in general non-linear functions of the
original data x.

• for each algorithm we want to find a suitable representation of the learnt function as
an expansion of the form (5.2) of the transformed data, i.e.:
n
X
w
b= βi Φ(Xi ), (5.15)
i=1

thus we only need to store the n-vectors of the coefficients (βi )1≤i≤n and to de-
termine how to compute them from the only information of the scalar products
(⟨Xi , Xj ⟩)1≤i,j≤n and the labels (Yi )1≤i≤n .

In this section, we will assume that k is a given spsd on X , H the associated RKHS, and
denote K the kernel Gram matrix of the x-data, i.e. Kij := ⟨Xi , Xj ⟩, 1 ≤ i, j ≤ n.
If we are using the RKHS H associated to k, due to the reproducing property, if w ∈ H
we have
fw (x) := ⟨w, Φ(x)⟩ = ⟨w, kx ⟩ = w(x); hence fw = w,
in other words the function fw associated to w is w itself, also the representation (5.15)
becomes (here denoting fb instead of w
b to emphasize that it is a function):
n
X
fb = βi kXi . (5.16)
i=1

This form is sometimes called “kernel expansion” or “dual representation”.

81
* Kernel perceptron. Recall again the standard perceptron iteration (Y = {−1, 1})):

w
b0 = 0; w bℓ + Xiℓ Yiℓ , where iℓ is any index s.t. Yiℓ ⟨w
bℓ+1 = w bℓ , Xiℓ ⟩ ≤ 0.

bℓ ∈ H by means of its
In the “kernelized perceptron”, we want to represent the vectors w
(ℓ) n
vector of coefficients β ∈ R in the representation (5.15). So the above becomes:

β (0) = 0; β (ℓ+1) = β (ℓ) + eiℓ ,

where (ei ) is the i-th canonical basis vector of Rn , and iℓ is any index such that
m
X m
X
(ℓ) (ℓ)
Yiℓ βi ⟨Φ(Xiℓ ), Φ(Xi )⟩ = Yiℓ βi k(Xiℓ , Xi ) = Yiℓ [Kβ (ℓ) ]iℓ ≤ 0.
i=1 i=1

After L iterations, the corresponding prediction function is


  n
 X  n
X 
(L) (L)
sign fbwbL (x) = sign βi ϕ(Xi ), x = sign βi k(Xi , x) .
i=1 i=1

In the case of the perceptron, regularization is obtained by early stopping, which is to say,
stopping at an iteration before all training points are all classified correctly. The stopping
iteration is typically determined from a predetermined set of values of K by hold-out or
cross-validation.

** Regularized kernel ERM. We assume here that the prediction space is Ye = R, and
ℓ : Ye × Y → R is a loss function. We consider ERM over prediction functions of the
form fw (x) := ⟨w, Φ(x)⟩ for w ∈ H; as we have seen it holds fw = w, hence our class of
prediction functions is the RKHS H itself. Furthermore, we consider regularization by the
squared RKHS norm. For a regularization parameter λ > 0, we therefore define
n
! n
!
X 2
X 2
fbλ ∈ Arg Min ℓ(Yi , ⟨w, Φ(Xi )⟩) + λ∥w∥ = Arg Min
H ℓ(Yi , f (Xi )) + λ∥f ∥ . H
w∈H
i=1 f ∈H i=1
(5.17)
We want to prove that, in general, fλ admits the representation (5.16) This will be estab-
b
lished as a consequence of the following result.

*** Theorem 5.9 (Representation theorem). Let H be a RKHS on X with kernel k. Let
n > 0 be an integer and Ψ : Rn × R+ → R be a mapping such that:

for any x ∈ Rn , the function t ∈ R+ 7→ Ψ(x, t) is nondecreasing.

82
For any x = (x1 , . . . , xn ) ∈ X n , denote
n
X 
n
Sx := Span{kx1 , i = 1, . . . , n} = βi kxi , (β1 , . . . , βn ) ∈ R ,
i=1

then it holds

inf Ψ(f (x1 ), . . . , f (xn ), ∥f ∥H ) = inf Ψ(f (x1 ), . . . , f (xn ), ∥f ∥H ).


f ∈H f ∈Sx

Furthermore, if the above infimum on the left-hand side is a minimum, then it is also
a minimum on the right-hand side; in other words, the mimimum over H is attained
at for an element of Sx .

Proof. Let x = (x1 , . . . , xn ) ∈ X n be fixed. Observe that Sx is a closed subset of H since


it is finite-dimensional. Hence there exists a well-defined orthogonal projector Π onto Sx .
For f ∈ H, denote for short Ψ(f ) := Ψ(f (x1 ), . . . , f (xn ), ∥f ∥H ). Let fε be such that
Ψ(fε ) ≤ inf f ∈H Ψ(f ) + ε for a fixed constant ε > 0. Consider the decomposition fε =
feε + fε⊥ , where feε := Πfε ; observe that fε⊥ ⊥ Sx , and therefore, by the reproducing
property,
∀i = 1, . . . , n : fε⊥ (xi ) = fε⊥ , kxi = 0,
so that fε (xi ) = feε (xi ), for i = 1, . . . , n.
On the other hand, since an orthogonal projector is a contraction, it holds feε H

∥fε ∥H . Therefore
Ψ(feε ) := Ψ(feε (x1 ), . . . , feε (xn ), feε H
) = Ψ(fε (x1 ), . . . , fε (xn ), feε H
)
≤ Ψ(fε (x1 ), . . . , fε (xn ), fε H
)
= Ψ(fε ),
where the inequality comes from the motonicity of Ψ in its last variable. This proves the
first claim of the theorem and, in case the infimum over f ∈ H is a minimum, the same
argument with ε = 0 holds to prove the second claim.

Examples of kernel ERM methods . We revisit here different (regularized) ERM


methods studied in Chapter 2 in “kernelized” form, using the representation (5.16) which
we know holds in general, due to Theorem 5.9 (we assume in each case that the minimum
is attained.)
Kernel Ridge Regression. Kernel ridge regression is regularized least-squares regression
using a RKHS as function space and the RKHS norm as regularization:
n
!
X
for λ > 0 : fbλ ∈ Arg Min (Yi − f (Xi ))2 + λ∥f ∥2 . (5.18) H
f ∈H i=1

83
From
Pn Theorem 5.9 we know that we can assume the representation (5.16), i.e. fbλ =
n
i=1 βλ,i k(Xi , ·). Denoting βλ := (βλ,1 , . . . , βλ,n ) ∈ R the coefficients of this expansion,
2
we observe that fbλ H = ni,j=1 βλ,i βλ,j k(Xi , Xj ) = βλT Kβλ ; also (fbλ (X1 ), . . . , fbλ (Xn )) =
P
Kβλ . Restricting the search for a minimum for functions of this form, (5.18) becomes:

βλ ∈ Arg Min ∥Y − Kβ∥2 + λβ t Kβ ,



for λ > 0 : (5.19)
β∈Rn

where we recall Y = (Y1 , . . . , Yn )t . By usual arguments (cancelling the first derivative wrt.
β of the above function to minimize), we obtain the necessary and sufficient condition

K(K + λIn )βλ = KY ;

hence a solution is
βλ = (K + λIn )−1 Y , (5.20)
observe that we have recovered exactly the formula (5.5) discussed in the beginning in the
chapter, but for the data mapped into the Hilbert space.
Kernel logistic regression. Recall that logistic regression (for a binary classification
problem with label space Y = {−1, 1}) can be seen as an ERM estimator with loss function

ℓlogit (f (x), y) = log(1 + exp −(yf (x))), y ∈ {−1, 1},

see Exercise 2.4. Again, for the “kernelized” (and regularized) version given by (5.17), and
the representation (5.16) with the fact that fbλ (Xi ) = [Kβλ ]i , the coefficient vector βλ ∈ Rn
is defined by
X n 
βλ ∈ Arg Min log(1 + exp(−Yi [Kβ]i ) + λβ t Kβ ,
β∈Rn i=1

which is a convex optimization problem in β and can be solved by standard methods such
as gradient descent, stochastic gradient descent, or Newton-Raphson iterations. The latter
requires inversion of a (n, n) Hessian matrix at each step, which can be prohibitive, so the
former methods might be preferred even if their convergence rate is not as fast.
In the multiclass case (Y = {0, . . . , K − 1}), a similar argument holds with an appropri-
ate loss function, and the fact that we are looking for (K−1) score functions fbλ,1 , . . . , fbλ,K−1 ,
it suffices to adapt (2.13) in the kernel setting.
Kernel Support vector machine. It is in all points similar to the previous argument
for logistic regression (still for binary classification with Y = {−1, 1}), but with the loss
function ℓHinge (f (x), y) := (1 − yf (x))+ . Again, the optimization problem for the kernel
expansion coefficients βλ ∈ Rn is convex. There exists a number of implementations using
further reformulations of the problem and using the particular form of the loss function
for efficient computation of an approximate minimum. It is one of the most standard
classification methods of machine learning toolboxes.

84
5.5 Regularity and approximation properties of functions in a
RKHS
It is of general interest to understand the properties of the functions belonging to a RKHS
with a given kernel, since the RKHS is the class of functions we use as predictors (or scores
in the case of classification) in different learning settings.
From a practical point of view, we can observe that due to the representation (5.16) as
a finite kernel expansion, the considered estimators belong (in general) to Hpre . Therefore,
whenever the kernel function k(·, ·) is measurable, resp. bounded, resp. continuous with
respect to either of its variables, so are the functions on Hpre , by finite linear combination.
Still, it is of mathematical interest (for further mathematical analysis, use of Hilbertian
analysis tools, etc.) to understand if this is also the case for the full RKHS obtained as
the completion of Hpre .

* Theorem 5.10 (Measurability). Let X be a measurable space and k a spsd kernel on


X with RKHS H. Then

(∀f ∈ H, f is measurable) ⇔ (∀x ∈ X : k(x, ·) is measurable).

Proof. (⇒) trivial since k(x, ·) = kx ∈ H for all x ∈ X .


(⇐) by linearity, any function Hpre is measurable. Now for f ∈ H, remember that
Hpre is dense in H (Theorem 5.3), so there exists a sequence (fn )n≥1 of elements of Hpre
such that ∥fn − f ∥H → 0 as n → ∞. This implies pointwise convergence since by the
reproducing property and Cauchy-Schwarz’s inequality, for any x ∈ X
p
|fn (x) − f (x)| = |⟨f − fn , kx ⟩| ≤ ∥f − fn ∥H ∥kx ∥H = ∥f − fn ∥H k(x, x).
Therefore f is measurable as a pointwise limit of measurable functions.

* Theorem 5.11 (Boundedness). Let be k a spsd kernel on a nonempty set X with


RKHS H. Then the following are equivalent:
(i) ∀f ∈ H : supx∈X |f (x)| < ∞.
(ii) supx∈X k(x, x) < ∞.
(iii) sup(x,y)∈X 2 |k(x, y)| < ∞.
In the case either of these properties are satisfied, the topology of the norm on H is
stronger than the supremum norm topology, more precisely
p
∀f ∈ H : ∥f ∥∞ ≤ ∥f ∥H sup k(x, x). (5.21)
x∈X

85
p
Proof. (ii) ⇒ (i) because |f (x)| = |⟨f, kx ⟩| ≤ ∥f ∥
p H k(x,p x); this also implies (5.21).
(ii) ⇒ (iii) because |k(x, y)| = |⟨kx , ky ⟩| ≤ k(x, x) k(y, y).
(iii) ⇒ (ii): trivial
(i) ⇒ (ii): can be seen as a consequence of the Banach-Steinhaus theorem. Namely,
consider the family of linear forms on H given by L := {δx , x ∈ H}, where δx is the
evaluation functional at point x given by (5.14); we have

∀f ∈ F : sup|L(f )| = sup|fx | < ∞


L∈L x∈X

by assumption. The Banach-Steinhaus theorem implies (since H is a Banach space)


that the pointwise bounded family of linear forms L is uniformly bounded, therefore
sup ∥δx ∥H∗ < ∞. But by the reproduction property it holds δx = kx∗ , therefore
p
∥δx ∥H∗ = ∥kx∗ ∥H∗ = sup ⟨kx , f ⟩ = k(x, x).
∥f ∥H =1

* Theorem 5.12 (Continuity). Let X be a topological space and k a spsd kernel on X


with RKHS H. Then

(k is continuous X × X → R) ⇒ (∀f ∈ H, f is continuous).

Proof. We have for any f ∈ H:


1
|f (x) − f (y)| = |⟨f, kx − ky ⟩| ≤ ∥f ∥H (k(x, x) + k(y, y) − 2k(x, y)) 2 ,

and the last factor converges to 0 as x → y, by bivariate continuity of k.

Universal kernels. We end this chapter by important results on the approximation


properties of RKHSs. Introduce the following definition:

** Definition 5.13.
Let X be a nonempty, compact topological space. Then a continuous spsd kernel k
on X × X is called universal then the corresponding RKHS H is dense (in the sense of
the supremum norm) in the space C(X) of continuous real-valued functions on X .
This definition extends to a non-compact topological space X : then k is said to be
universal if its restriction on any compact subset of X is universal.

We begin with the following observation:

86
Proposition 5.14. Let X be a nonempty, compact topological space. Then a continuous
spsd on X is universal iff Hpre is dense in C(X) for the supremum norm.

Proof. (⇐): since Hpre ⊆ H, Hpre dense in C(X) trivially implies that H is also dense in
C(X).
(⇒): we know that Hpre is dense in H in the sense of the H-norm. Since k is continuous
on the compact X × X , it is bounded, therefore (5.21) applies and Hpre is a fortiori dense
in H for the supremum norm, and therefore also dense in C(X ) for the supremum norm,
since H is.
The following final result of this section is very useful to establish universality of a
number of classical kernels on Rd .

i
P
** Theorem 5.15 (Universal Taylor kernels). Let F (t) = i≥0 ai t be a real-valued
analytical function with real, strictly positive coefficients ai > 0, and convergence radius
R > 0. √
d
Let X = R . Then k(x, y) := F (⟨x, y⟩) is a universal spsd kernel on BX (0, R) =
n √ o
x ∈ X : ∥x∥ < R .

This result implies in particular that the exponential kernel and the negative-binomial
kernel introduced at the end of Section 5.3 are universal on Rd and BRd (0, 1), respectively.

Lemma 5.16. Let k be a spsd kernel on a nonempty set X , and H◦ be a Hilbert space and
Φ◦ a mapping X → H◦ , so that it holds for all x, y in X : ⟨Φ◦ (x), Φ◦ (y)⟩H◦ = k(x, y).
Then, for any w ∈ H◦ , the function x 7→ ⟨w, Φ◦ (x)⟩H◦ belongs to the RKHS H associ-
ated to k.

Proof. Let us denote ξ : H◦ → F(X , R) the linear mapping given by

ξ(w) = (x 7→ ⟨w, Φ◦ (x)⟩H◦ .

If w = Φ◦ (x) for some x ∈ X , the function x′ 7→ ⟨w, Φ◦ (x′ )⟩H◦ = k(x, x′ ) coincides
with k(x, ·), i.e. ξ(Φ◦ (x)) = kx ∈ Hpre ⊆ H and furthermore ∥ξ(Φ◦ (x))∥H = k(x, x) =
∥Φ◦ (x)∥H◦ . By linearity, ξ is an isometry from H1 := Span{Φ◦ (x), x ∈ X } into H (in fact
into Hpre ), in particular ξ(H1 ) ⊆ H.
For any sequence (wn )n≥1 of elements in H1 converging to w∗ in H◦ , the sequence
(ξ(wn ))n≥1 is Cauchy in H (by isometry) and therefore converges to a limit f ∗ . But it
holds for any x ∈ X (using the definition of ξ, continuity of scalar products in H◦ and H,

87
the isometry property of ξ, and the reproducing property in H):

ξ(w∗ )(x) = lim wn , Φ◦ (x) H◦


= lim ⟨wn , Φ◦ (x)⟩H◦
n→∞ n→∞
= lim ⟨ξ(wn ), ξ(Φ◦ (x))⟩H
n→∞
= lim ξ(wn ), kx H
n→∞

= f (x),

hence ξ(w∗ ) = f ∗ . Therefore ξ(H1 ) ⊆ H.



Finally, since H1 is closed in H◦ , it holds H◦ = H1 ⊕ (H1 )⊥ . But for any w ∈ (H1 )⊥
and any x ∈ X , since w ⊥ Φ◦ (x) ∈ H1 it holds

ξ(w)(x) = ⟨w, Φ◦ (x)⟩H◦ = 0.

Finally we have established ξ(H◦ ) ⊆ H, which is the desired claim.


Proof of Theorem 5.15. In this proof we will construct a Hilbert space H◦ and Φ◦ a map-
ping X → H◦ , with the property ⟨Φ◦ (x), Φ◦ (y)⟩H◦ = k(x, y), which will be different from
the RKHS H associated to k, but we will map it into H using the previous lemma.
To simplify the next calculation we introduce  the following notation: for j = (j1 , . . . , jd ) ∈
Nd , put s(j) = j1 + . . . + jd ; c(j) := j1s(j)
,...,jd
the multinomial coefficent; and for x ∈ X ,
Qd
mj (x) := i=1 xji i the monomial P in thei coordinates of x associated to the multi-index j.
Observe that since F (t) = i≥0 ai t , we have (for max(∥x∥, ∥y∥) ≤ R ensuring conver-
gence of the series below), by the multinomial formula:

X X d
X ℓ X X d
Y

k(x, y) = aℓ ⟨x, y⟩ = aℓ xi y i = aℓ c(j1 , . . . , jd ) (xi yi )ji
ℓ≥0 ℓ≥0 j=1 ℓ≥0 j1 +...+jd =ℓ i=1
X
= as(j) c(j)mj (x)mj (y)
j∈Nd
X
= ϕj (x)ϕj (y),
j∈Nd
p
where ϕj (x) := as(j) c(j)mj (x).
We therefore consider H◦ := ℓ2 (Nd ), and Φ◦ (x) := (ϕj (x))j∈Nd . Note that absolute
convergence
P 2 d
√ Φ◦ (x) ∈ H◦ , i.e.
of the power series defining F ensures that
j∈Nd (ϕj (x)) < ∞ for any x ∈ R such that ∥x∥ < R.
We can apply the previous lemma and conclude that for any w ∈ H◦ , the function
x 7→ ⟨w, Φ◦ (x)⟩H◦ belongs to the RKHS H. Let us choose, for an arbitrary multi-index
p
j ∈ Nd , the vector w ∈ H◦ such that the j-coordinate of w is 1/ as(j) c(j) and the other
coordinates 0. Then for any x ∈ X :

⟨w, Φ◦ (x)⟩H◦ = mj (x).

88
We conclude that all monomial functions in the coordinates of x belong to H; then also
all polynomials by linearity, and the conclusion is a consequence of the Stone-Weierstraß
theorem.

⋄ 5.6 Translation invariant kernels and random Fourier features


In this section, we will address recent developments of the kernel methodology that are
relevant for practice. We have seen that the spsd kernel methodology allows to implicitly
represent feature mappings of the x-data into an infinite-dimensional Hilbert space H.
However, all kernel-based methods require to store and manipulate the kernel Gram matrix
K which is a (n, n) matrix. In typical modern applications the sample size n can be very
large (hundreds of millions) which can make the computation and storage of such a matrix,
wich O(n2 ) complexity, prohibitive; let alone manipulating it numerically.
For this reason, it has been proposed to construct explicit, approximate feature map-
pings Φe : X → Rp (with p ≪ n) such that ⟨Φ(x), Φ(y)⟩ ≈ k(x, y). While it seems like
we are now going backwards to the beginning of the chapter — and thus maybe could
do completely without kernels at all — having gained a mathematical understanding of
the properties of the mathematical object we are approximating (a RKHS) is very valu-
able, and we could not properly understand the methods considered below without having
introduced RKHSs.
We start with a few reminders on Fourier transform: for any f ∈ L1 (Rd , C), the d-
dimensional Fourier transform
Z
fb(ω) := exp(−i⟨x, ω⟩)f (x)dx ; ω ∈ Rd (5.22)
Rd

exists and satisfies:


• fb is continuous on Rd ;
• lim∥w∥→∞ fb(ω) = 0;

• fb ∞
≤ ∥f ∥L1 ;

• The inverse Fourier transform formula holds: if fb ∈ L1 (Rd , C) holds, then


Z
a.s. 1
f (x) = exp(i⟨x, ω⟩)fb(ω)dω ; x ∈ Rd . (5.23)
(2π)d Rd

Translation-invariant kernels. A fundamental relation between Fourier transform


and translation invariant kernels is given in the following theorem.

Theorem 5.17. Let φ : Rd → R, continuous and let k(x, y) = φ(x − y). Assume
φ = fb for some f ∈ L1 (Rd , R), with f (x) ≥ 0 a.s. Then k is a spsd kernel.

89
Note: this direction is actually the “easy” one. There exists a converse (Bochner’s
theorem) stating that if k is a spsd, translation-invariant kernel — i.e. it is of the form
k(x, y) = k(x − y, 0) = φ(x − y), where φ = k(0, ·) — then φ is the Fourier-Stieltjes
transform of a finite (nonnegative) measure on Rd .
Proof. It holds for any integer n > 0, (x1 , . . . , xn ) ∈ (Rd )n and (α1 , . . . , αn ) ∈ Rn :
n
X n
X
αi αj k(xi , xj ) = αi αj φ(xi − xj )
i,j=1 i,j=1
X n Z
= αi αj exp(−i⟨xi , ω⟩) exp(i⟨xj , ω⟩)f (ω)dω
i,j=1 Rd

Z n 2
X
= αi exp(−i⟨xi , ω⟩) f (ω)dω ≥ 0.
Rd i=1

Note: we have assumed here as in the rest of the chapter that φ and therefore k are real-
valued (which implies in particular that f must be symmetric around 0 in Theorem 5.17).
This can be generalized to a more general theory of complex-valued spsd (Hermitian)
kernels, mutatis mutandis.
Note: Because of the inverse Fourier formula, provided that φ is integrable we can
identify the function f as the inverse Fourier transform of φ, given by (5.23).
Examples.
• We find another proof that the Gaussian kernel k(x, y) := exp(−∥x − y∥2 /(2σ 2 )) is
spsd, with
σd
 2 2
σ t
f (t) = exp − .
(2π)d 2
• The Laplace kernel k(x, y) := 12 exp(−γ|x − y|) (with γ ≥ 0) is spsd, with
1 γ
f (t) = .
(2π) γ + t2
d 2

Random Fourier features. If k(x, y) = φ(x−y) is a spsd translation-invariant kernel


on Rd , such that φ = fb, then
Z
k(x, y) = φ(x − y) = exp(−i⟨x, ω⟩) exp(i⟨y, ω⟩)f (ω)dω,
Rd
R
where f ≥ 0 and f ∈ L1 (Rd ), with Rd f (ω)dω = φ(0). Up to rescaling of φ (and k), we
can assume φ(0) = 1 and thus can interpret f as a probability density on Rd . Let Pf denote
the associated probability distribution, then the above can be rewritten as
  
k(x, y) = Eω∼Pf exp − i⟨x, ω⟩ exp i⟨y, ω⟩ . (5.24)

90
The idea of random Fourier features is to approximate the above expectation by a finite
i.i.d.
average over p randomly drawn frequencies (ω1 , . . . , ωp ) ∼ Pf . More explicitly, given p
such random frequencies, define the mapping

e : Rd → R2p : 1 
Φ x 7→ √ cos(ω1 x), sin(ω1 x), . . . , cos(ωp x), sin(ωp x) . (5.25)
p

Then it holds
p
!
D E 1 X  
Φ(x),
e Φ(y)
e = Re exp − i⟨x, ωj ⟩ exp i⟨y, ωj ⟩ ,
p j=1

which converges to (5.24) in probability as p → ∞, by the law of large numbers (and the
fact that the kernel is real-valued). We can even quantify this convergence:

i.i.d.
Proposition 5.18. If k can be represented as (5.24) and we draw (ω1 , . . . , ωp ) ∼ Pf ,
e by (5.25), then for any x, y in Rd , δ ∈ [0, 1), with probability 1 − δ over
and define Φ
the draw of these frequencies, it holds
s
D E log(2δ −1 )
k(x, y) − Φ(x),
e Φ(y)
e ≤ .
2p

Proof. Direct consequence of Hoeffding’s


 inequality (Corollary 3.9), since
Re exp(−i⟨x, ωj ⟩) exp(i⟨y, ωj ⟩) ≤ 1.
Note. there are extensions of the above result in order to obtain such approximation
control which holds uniformly over (x, y) in a compact.

91
6 Introduction to statistical learning theory (part 3):
Rademacher complexities and VC theory
6.1 Introduction, reminders
Recall that in Section 3, we studied the behavior of statistical learning methods which
output a prediction function fb belonging to some class F which was assumed to be finite
or countable. The main mathematical tool was to control to obtain a uniform control of
the form
∀f ∈ F : E(f ) − E(f
b ) ≤ R(n, F, δ); (6.1)
holding with probability at least 1 − δ over the draw of a sample of size n. (Please note:
we only consider the case of a uniform bound R(n, F, δ) independent of the function f ; we
do not consider bounds depending on f as for instance (3.29) in this discussion.)
When F is finite, and the loss function is bounded, this was achieved as a consequence
of Hoeffding’s inequality (see Corollary 3.10), which gives control over a single function f ,
then a union bound (see Proposition 3.11).
Let us also recall briefly how a uniform bound (6.1) leads to a bound on the risk of an
ERM over class F:

*** Proposition 6.1. Let us assume a learning setting (consisting of an observation space
X , a label space Y, a prediction space Ye , and a loss function ℓ : Ye × Y → R+ ). Assume
that (6.1) holds.
Let η > 0 be fixed and fbη denote an η-approximate ERM over the class F that is,
fbη ∈ F and

Eb fbη ≤ inf E(f
b ) + η.
f ∈F

Then it holds (with the same probability with which (6.1) holds) that

E(fbη ) ≤ EF∗ + 2R(n, δ, F) + η.

Proof. This is a repetition (in a more formal setting) of the argument leading to Proposi-
tion 3.13. Let ε > 0 and let fε ∈ F be such that E(fε ) ≤ inf f ∈F E(f ) + ε. Since fbη ∈ F,
and putting R = R(n, F, δ) for short, using twice (6.1):
E(fbη ) ≤ E(
b fbη ) + R
≤ E(f
b ε) + R + η
≤ E(fε ) + 2R + η
≤ EF∗ + 2R + η + ε,
and this holds for any ε > 0, hence the conclusion.

92
Observe that the proposition above is a purely deterministic one once the event (6.1)
is satisfied: the only probabilistic point is to establish that this event has probability
large enough (while the remainder R(n, F, δ) remains hopefully “reasonable”, in particular
converging to 0 as the sample size n grows). Additionally, even if the learning algorithm
fb is not ERM, the bound (6.1) allows us to give a confidence bound on the (unknown)
risk of fb based only on its empirical risk, whatever the algorithm used, since the bound is
uniform.
In view of the above considerations, a goal of interest is to extend the uniform con-
trol (6.1) to more general classes, in particular (uncountably) infinite. Observe that (6.1)
is equivalent to a probabilistic upper bound of the random variable
|·|
ZF := sup E(f ) − E(f
b ), (6.2)
f ∈F

holding with probability 1 − δ. This is what we set to do in the next sections, and we will
|·|
achieve it in two steps: (a) bound the deviations of ZF from its expectation, with high
|·|
probability; (b) bound the expectation of ZF . We will also be interested in similar bounds
for the closely related variables
b ) − E(f ) , and Z − := sup E(f ) − E(f
ZF+ := sup E(f
 
b ) . (6.3)
F
f ∈F f ∈F

Note. In complete generality, it cannot be ensured that the variable defined in (6.2)
is measurable because a supremum of uncountably many measurable functions is not nec-
|·|
essarily measurable. We will ignore that point and assume implicitly that ZF (and other
suprema) are measurable throughout the chapter. It can for example be assumed (this is
often the case) that there exists a countable subset Fe ⊂ F such that the suprema over F
and Fe coincide a.s.; this is usually the case for more concrete prediction classes.

Exercise 6.1. In the case of a countably infinite set F, we had derived a bound of the
form (6.1) but with a bound R(n, f, F, δ) also depending on f due to the influence of the
“weight function”, see Proposition 3.14. Put R(f ) := R(n, f, F, δ) for short.
Let η > 0 be fixed and fbη denote an η-approximate regularized ERM with regularization
function R(f ) over class F that is, fbη ∈ F and
 
Eb fbη + R(fbη ) ≤ inf E(f
b ) + R(f ) + η.
f ∈F

Prove that if (6.1) holds (but with the function R(f ) depending on f ), then we have

E(fbη ) ≤ inf E(f ) + 2R(f ) + η.
f ∈F

6.2 The Azuma-McDiarmid inequality


For the first step of our programme, we will use the following concentration inequality,
which is extremely useful and versatile.

93
** Theorem 6.2 (Azuma-McDiarmid). Let X be a measurable space, and f : X n → R
a measurable function such that

∀i ∈ {1, . . . , n}, ∀(x1 , . . . , xn ) ∈ X n , ∀x′i ∈ X :


|f (x1 , . . . , xi , . . . , xn ) − f (x1 , . . . , x′i , . . . , xn )| ≤ 2ci , (Stab)

for some positive constants (c1 , . . . , cn ).


Let (X1 , . . . , Xn ) be a independent family of random variables taking values in X
(not necessarily P identically distributed), then f (X1 , . . . , Xn ) is a sub-Gaussian variable
with parameter ni=1 c2i , so that in particular

t2
 
P[f (X1 , . . . , Xn ) > E[f (X1 , . . . , Xn )] + t] ≤ exp − Pn 2 . (6.4)
2 i=1 ci

(In particular, if all constants ci are equal to c, the bound is exp(−t2 /(2nc2 )).)

To prove the above result we will first establish the following:


Theorem 6.3 (Bounded increment martingale inequality, Azuma). Let (Mk )k≥0 be a real-
valued martingale with respect to a filtration (Fk )k≥0 (with M0 = 0). Put ∆k := Mk −Mk−1 ,
k ≥ 1. Let n be a positive integer and assume ∆i ∈ [mi , Mi ] holds a.s., with
Pn |M2i −mi | ≤ 2ci ,
for i = 1, . . . , n. Then Mn is a sub-Gaussian variable with parameter i=1 ci .
Proof. Observe that E[∆n |Fn−1 ] = 0 since Mn is a martingale. Furthermore, since ∆n ∈
[mn , Mn ], we can apply Proposition 3.8 (Hoeffding’s inequality in exponential form, for 1
variable) in conditional expectation to conclude that for any λ ∈ R:
  λ2 c2n
E[exp(λ∆n )|Fn−1 ] = E exp λ(∆n − E[∆n |Fn−1 ])|Fn−1 ≤ exp .
2
Thus
n
" #
 X 
E[exp(λMn )] = E exp λ ∆i
i=1
"  n #
 X 
= E E exp λ ∆i Fn−1
i=1
" n−1
#
 X  h i
= E exp λ ∆i E exp(λ∆n ) Fn−1
i=1
 2 2 " n−1
#
λ cn  X 
≤ exp E exp λ ∆i ,
2 i=1

and we obtain the conclusion by straightforward recursion.

94
Proof of Theorem 6.2. Define the filtration Fi = S(X1 , . . . , Xi ), 0 ≤ i ≤ n, and define
for i = 1, . . . , n the martingale Mi := E[f (X1 , . . . , Xn )|Fi ] − E[f (X1 , . . . , Xn )], and its
increments ∆i := E[f (X1 , . . . , Xn )|Fi ] − E[f (X1 , . . . , Xn )|Fi−1 ]. Let us prove that ∆i
satisfies the boundedness assumption of Theorem 6.3. First, because (X1 , . . . , Xn ) are
independent, conditional expectation conditional to (X1 , . . . , Xi ) is the same as expectation
with respect to (Xi+1 , . . . , Xn ), thus
Z
E[f (X1 , . . . , Xn )|Fi ] = f (X1 , . . . , Xi , xi+1 , . . . , xn )P (dxi+1 , . . . , dxn ).

Therefore

|∆i | = |E[f (X1 , . . . , Xn )|Fi ] − E[f (X1 , . . . , Xn )|Fi−1 ]|


Z
≤ f (X1 , . . . , Xi−1 , Xi , xi+1 , . . . , xn )

− f (X1 , . . . , Xi−1 , xi , xi+1 , . . . , xn ) P (dxi , dxi+1 , . . . , dxn )


≤ 2ci ,

Pn M2n = f (X1 , . . . , Xn ) − E[f (X1 , . . . , Xn )] is a


by assumption (Stab). We condude that
sub-Gaussian variable with parameter i=1 ci , which is the desired conclusion.
Let us now apply this result to the variables introduced in (6.2),(6.3):

** Proposition 6.4. Consider a learning setting with ℓ a bounded loss function taking
values in [0, B], a class F of decision functions, and consider the random variables
|·|
ZF , ZF+ , ZF− defined by (6.2),(6.3). Denoting ZF• either of these variables, it holds that
ZF• is sub-Gaussian with parameter B 2 /(4n), and thus in particular

2nt2
 
 • •

P ZF ≥ E[ZF ] + t ≤ exp − 2 . (6.5)
B

Note. If F is a singleton {f }, then E ZF+ = E ZF− = 0, and we recover Hoeffding’s


   

inequality as a particular case.


|·|
Proof. We check that the variables ZF , ZF+ , ZF− satisfy the (Stab). Note that in general, if
(zi )i∈I and (zi′ )i∈I are families of real numbers in RI , for some index set I, it holds
   
sup(zi − zi ) ≥ sup zi + inf (−zj ) = sup zi − sup zj = sup zi − sup zi′ .
′ ′ ′
(6.6)
i∈I i∈I j∈I i∈I j∈I i∈I i∈I

Let us consider ZF+ (Sn ) as a function of the i.i.d. sample Sn = ((X1 , Y1 ), . . . , (Xn , Yn )).
(i)
Consider a sample Sen obtained with replacing (Xi , Yi ) by (Xi′ , Yi′ ) in Sn . Using (6.6), we

95
obtain
   
ZF+ (Sn ) − ZF+ (Sen(i) ) = sup E(f,
b Sn ) − E(f ) − sup E(f, b Se(i) ) − E(f )
n
f ∈F f ∈F
 
≤ sup E(f, b Sen(i) )
b Sn ) − E(f,
f ∈F
1
ℓ(f (Xi ), Yi ) − ℓ(f (Xi′ ), Yi′ )

= sup
f ∈F n
B
≤ ,
n
B
so that (Stab) is satisfied with ci = 2n . We conclude by applying Theorem (6.2). The case
|·| −
of the other variables ZF , ZF is similar.

6.3 Rademacher complexity


We go to to the second step of our program: bounding the expectation of the variables ZF• .

*** Theorem 6.5 (Symmetrization principle). We consider a standard learning setting,


|·|
a class F of decision functions, and consider the random variables ZF , ZF+ , ZF− defined
by (6.2),(6.3) based on an i.i.d. sample Sn of size n.
Let σ1 , . . . , σn be i.i.d. variables with values in {−1, 1}, independent from Sn , and
such that P[σi = 1] = P[σi = −1] = 1/2 (so-called “Rademacher variables”). Then it
holds that
  " n
#
 +  2 X
ESn ZF = ESn sup E(f ) − E(f, b Sn ) ≤ ESn ,(σ )
i 1≤i≤n
sup σi ℓ(f (Xi ), Yi ) .
f ∈F n f ∈F i=1

 + (6.7)
The same inequality as above holds for ESn ZF , while
  " n
#
h i
|·| 2 X
ESn ZF = ESn sup E(f ) − E(f, b Sn ) ≤ ESn ,(σ )
i 1≤i≤n
sup σi ℓ(f (Xi ), Yi ) .
f ∈F n f ∈F i=1
(6.8)

** Definition 6.6. Let W be a measurable space and G a set of measurable functions


W → R; Sn an i.i.d. sample of variables (Wi )1≤i≤n of distribution P , and σ :=
(σ1 , . . . , σn ) a family of i.i.d. Rademacher variables, independent of Sn . Then the

96
quantity " #
n
X
RP,n (G) := ESn ,σ sup σi g(Wi ) (6.9)
g∈G
i=1

is called Rademacher complexity of the class G.


We introduce similarly
" n
#
|·|
X
RP,n (G) := ESn ,σ sup σi g(Wi ) . (6.10)
g∈G
i=1

⋄ Note: geometrical interpretation. For a fixed sample W = P (W1 , . . . , Wn ) ∈


W , denote G(W ) = {(g(W1 ), . . . , g(Wn )), g ∈ G} ⊆ R . Then supg∈G ( ni=1 σi g(Wi )) =
n n

supu∈G(W ) ⟨σ, u⟩, and


" n
!# " #
2 X 2
√ Eσ sup σi g(Wi ) = √ Eσ sup ⟨σ, u⟩
n g∈G
i=1
n u∈G(W )
" # " #!
1
= √ Eσ sup ⟨σ, u⟩ + Eσ sup ⟨−σ, u⟩
n u∈G(W ) u∈G(W )
" #
⟨σ, u⟩ ⟨σ, u⟩
= Eσ sup − inf .
u∈G(W ) ∥σ∥ u∈G(W ) ∥σ∥

This can be interpreted as the averaged maximal “width” of the set G(W ) projected in
the direction of the random Rademacher vector σ. Hence the above quantities are also
known as Rademacher widths, which play an important role in high-dimensional geometry.
With this definition and notation, we can rewrite (6.7) and (6.8) as:

 
 ±  2
ESn ZF = ESn sup E(f ) − E(f, Sn ) ≤ RP,n (ℓ ◦ F),
b (6.11)
f ∈F n
 
b Sn ) ≤ 2 R|·| (ℓ ◦ F),
h i
|·|
ESn ZF = ESn sup E(f ) − E(f, (6.12)
f ∈F n P,n
where
ℓ ◦ F := {g : X × Y → R, (x, y) 7→ ℓ(f (x), y), f ∈ F}. (6.13)

Proof of Theorem 6.5. The first step of the symmetrization


h i principle is to use the fact
that for any fixed decision function f , it holds ESn E(f, Sn ) = E(f ). We now introduce
b
a (virtual) second sample Sn′ = ((Xi′ , Yi′ ))1≤i≤n independent of Sn and having the same

97
distribution as Sn h(sometimes
i called “ghost sample” or “independent copy of Sn ), and
replace E(f ) = ESn E(f, Sn ) :
b
     h i

ESn sup E(f,
b Sn ) − E(f ) = ESn sup E(f, b Sn ) − ES ′ E(f,
n
b S )
n
f ∈F f ∈F
 h i

= ESn sup ESn′ E(f, Sn ) − E(f, Sn )
b b
f ∈F
  

≤ ESn ,Sn′ sup E(f,b Sn ) − E(f,
b Sn )
f ∈F
" n
#
1 X ′ ′

= ESn ,Sn′ sup ℓ(f (Xi ), Yi ) − ℓ(f (Xi ), Yi )
f ∈F n i=1
" n 
#
1 X


= ESn ,Sn′ sup g(Wi ) − g(Wi ) ,
n g∈G
i=1

where for the inequality we have used supt∈T E[Ut ] ≤ E[supt∈T Ut ] for a family of real-
valued variables (Ut )t∈T ; and we have used the notation G := ℓ ◦ F as defined in (6.13) and
Wi := (Xi , Yi ), Wi′ = (Xi′ , Yi′ ) for shortening.
The second step is based on the observation that the distribution of Sn , Sn′ is unchanged
if we swap Wi and Wi′ between the two samples. Hence the above double expectation
remains unchanged by this operation, which flips the sign of the i-th term in the sum
inside the expectation. Now, given arbitrary fixed signs (σi )1≤i≤n = σ ∈ {−1, 1}n , consider
swapping Wi and Wi′ if σi = −1 and leave them alone if σi = 1, again the expectation is
unchanged while the sign of the i-th term in the sum inside of the expectation is multiplied
by σi . Thus
" n 
# " n
#
X  X  
n
∀σ ∈ {−1, 1} : ESn ,Sn′ sup g(Wi ) − g(Wi′ ) = ESn ,Sn′ sup σi g(Wi ) − g(Wi′ ) .
g∈G g∈G
i=1 i=1

Hence, the above quantity also remains the sum if we take an expectation over random
signs (σ1 , . . . , σn ) having any joint distribution; it turns out that it is most fruitful to take
i.i.d. Rademacher variables. Finally, we notice that
" n
# " n n
#
X   X X
ESn ,Sn′ ,σ sup σi g(Wi ) − g(Wi′ ) ≤ ESn ,Sn′ ,σ sup σi g(Wi ) + sup −σi g(Wi′ )
g∈G g∈G g∈G
i=1 i=1 i=1
" n
# " n
#
X X
= ESn ,σ sup σi g(Wi ) + ESn′ ,σ sup σi g(Wi′ )
g∈G g∈G
i=1 i=1
= 2RP,n (G),

collecting the above inequalities yields the conclusion (the argument when introducing
absolute values is entirely similar).

98
Taking stock, combining the results of Proposition 6.4 and (6.11)-(6.12) we obtain the
following corollary:

** Corollary 6.7. Consider a learning setting with ℓ a bounded loss function taking
values in [0, B], a class F of decision functions, and an i.i.d. sample Sn of size n.
For any fixed δ ∈ (0, 1), each of the following inequalities holds with probability at
least 1 − δ over the draw of Sn :
r
−1
b ) − E(f ) ≤ RP,n (ℓ ◦ F) + B log δ ;
 2
sup E(f (6.14)
f ∈F n 2n
r
−1
b ) ≤ RP,n (ℓ ◦ F) + B log δ ;
 2
sup E(f ) − E(f (6.15)
f ∈F n 2n
r
−1
b ) − E(f ) ≤ R (ℓ ◦ F) + B log δ .
sup E(f
2 |·|
(6.16)
P,n
f ∈F n 2n

6.4 Properties of the Rademacher complexity


The following proposition gathers some simple but fundamental properties of the Rademacher
complexity.

*** Proposition 6.8. Let W a measurable space and P a probability distribution on W.


In what follows, F and G are sets of measurable functions W → R.
|·|
(a) RP,n (F) = RP,n (F ∪ (−F)). In particular if F is symmetric around 0 then
|·|
RP,n (F) = RP,n (F).
|·|
If h is a fixed measurable function W → R, then RP,n ({h}) = 0 while RP,n ({h}) ≤
(b) √
n∥h∥L2 (P ) .

(c) Let a ∈ R be fixed. Then RP,n (aF) = |a|RP,n (F).

(d) RP,n (F + G) = RP,n (F) + RP,n (G).

(e) RP,n (Conv(F)) = RP,n (F), where Conv(F) is the set of finite convex combina-
tions of elements of F.

Proof. Point (a) is straightforward and left as an exercise, as is the first point of point (b).

99
For the second part of point (b), by Jensen’s inequality:
" n #
|·|
X
RP,n ({h}) = ESn ,σ σi h(Wi )
i=1

n
!2  21
X
≤ ESn ,σ  σi h(Wi ) 
i=1
" n
# 12
X
≤ ESn ,σ σi2 h(Wi )2
i=1

= n∥h∥L2 (P ) .

For point (c) we have by symmetry of the distribution of the vector or random signs σ:
" n
#
X
RP,n (aF) = ESn ,σ sup σi af (Wi )
f ∈F i=1
" n
#
X
= ESn ,σ sup|a| σi f (Wi )
f ∈F i=1
= |a|RP,n (aF).

For point (d):


" n
#
X
RP,n (F + G) = ESn ,σ sup σi (f (Wi ) + g(Wi ))
f ∈F ,g∈G i=1
" n n
#
X X
= ESn ,σ sup σi f (Wi ) + sup σi g(Wi ))
f ∈F i=1 g∈G
i=1
= RP,n (F) + RP,n (G).

For point (e): let us denote Conv2 (F) := {λf + (1 − λ)g; f, g ∈ F; λ ∈ [0, 1]} the set
of 2-points convex combinations of elements of F. It holds
" n
#
X
RP,n (Conv2 (F)) = ESn ,σ sup sup σi (λf (Wi ) + (1 − λ)g(Wi ))
λ∈[0,1] f,g∈F i=1
" n n
!#
X X
= ESn ,σ sup λ sup σi f (Wi ) + (1 − λ) sup σi g(Wi )
λ∈[0,1] f ∈F i=1 g∈F
i=1
" n
#
X
= ESn ,σ sup σi λf (Wi )
f ∈F i=1

= RP,n (F).

100
S holds RP,n (Conv2k (F)) = RP,n (F) for any integer k ≥ 0,
By straightforward recursion it
and finally since Conv(F) = k≥0 Conv2k (F) we obtain the result by monotone conver-
gence.
The following property is extremely useful in learning theory.

** Proposition 6.9 (Lipschitz comparison principle). Consider a standard learning set-


ting with Ye ⊆ R and a loss function ℓ : Ye × Y → R+ such that for any fixed y ∈ Y,
the function ℓ(·, y) : ye ∈ Y 7→ ℓ(e
y , y) is L-Lipschitz. Then it holds for any set F of
measurable functions from X to R:

RP,n (ℓ ◦ F) ≤ LRP,n (F). (6.17)

The proof hinges on the following lemma:

Lemma 6.10. Let (At )t∈I , (Bt )t∈I be families of real numbers indexed by a countable set
I, and γ : R → R a L-Lipschitz function. Let σ be a single Rademacher variable (random
sign). Then    
E sup(At + σγ(Bt )) ≤ E sup(At + LBt ) . (6.18)
t∈I t∈I

Proof. We have
 
1
E sup(At + σγ(Bt )) = sup(At + γ(Bt )) + sup(At − γ(Bt ))
t∈I 2 t∈I t∈I
1
= sup (At + At′ + γ(Bt ) − γ(Bt′ ))
2 t,t′ ∈I
1
≤ sup (At + At′ + L|Bt − Bt′ |)
2 t,t′ ∈I
1
= sup (At + At′ + L(Bt − Bt′ ))
2 t,t′ ∈I
1
= sup(At + LBt ) + sup(At − LBt )
2 t∈I t∈I
 
= E sup(At + LσBt ) .
t∈I

Note that the “magic” happens in the equality just after the Lipschitz inequality. By
symmetry between t, t′ we can remove the absolute value! It is worth pausing to think
about it.

101
Proof of Proposition 6.9. Let n be fixed: we will prove by recursion the following property
(for m ≤ n:)
" m n
!#
X X
H(m) : RP,n (ℓ ◦ F) ≤ E sup L σi f (Xi ) + σi ℓ(f (Xi ), Yi ) ,
f ∈F i=1 i=m+1

where the expectation is over Sn and σ. Observe that H(0) is obvious (it is the definition of
RP,n (ℓ ◦ F)) P
and H(n) is whatPwe want to prove. Assuming H(m − 1) holds for 1 ≤ m ≤ n,
put Af := L m−1 i=1 σi f (Xi )+
n
i=m+1 σi ℓ(f (Xi ), Yi ) and Bf := f (Xm ), then H(m−1) reads
 

RP,n (ℓ ◦ F) ≤ E sup Af + σm ℓ(Bf , Ym ) .
f ∈F

Performing the expectation over σm first, conditionally to the other random variables
(namely Sn and the signs (σi )i̸=m ), since Af and Bf do not depend on σm they can be
considered constants in this conditional expectation, and by independence from the rest
σm is still a random sign conditionally to the rest. We can therefore apply Lemma 6.10,
(with γ(·) = ℓ(·, Ym ) considered as fixed since we argue conditionally to Ym ), which yields
the conclusion.

6.5 Application to kernel methods


We start with the following important result for the Rademacher complexity of a ball in a
rkhs.

*** Proposition 6.11. Let k be a spsd kernel on a nonempty set X , H the associated
rkhs and for R ≥ 0,
BH (R) := {f ∈ H : ∥f ∥H ≤ R}
the closed ball of radius R in H centered at the origin. Then
|·| √ p
RP,n (BH (R)) ≤ nR EX∼P [k(X, X)]. (6.19)

In particular, if supx∈X k(x, x) ≤ M 2 < ∞, then for any distribution P :


|·| √
RP,n (BH (R)) ≤ nM R. (6.20)

102
Proof. It holds, using the Cauchy-Schwarz then Jensen’s inequality:
" n
#
|·|
X
RP,n (BH (R)) = E sup σi f (Xi )
f ∈BH R i=1
" n
#
X
=E sup σi ⟨f, kXi ⟩H
f ∈BH R i=1
"  X n  #
=E sup f, σi kXi
f ∈BH R i=1 H
" n
#
X
≤E sup ∥f ∥H σi kXi
f ∈BH R i=1 H
" n
# 21
X 2
≤ RE σi kXi
i=1 H
" n
# 12
X
= RESn Eσ σi σj k(Xi , Xj )
i,j=1
1
= R(nEX∼P [k(X, X)]) 2 .

Notice in particular the following interesting fact: the Rademacher complexity of a rkhs
ball depends on is radius, but not on the dimensionality (which might by infinite) provided
the kernel is bounded. This has a number of interesting consequences.

* Proposition 6.12. Let k be a spsd kernel on a nonempty set X , with supx∈X k(x, x) ≤ M 2 ;
let H the associated rkhs and R > 0 be fixed. Assume ℓ is a loss function (with prediction
space Ye = R) such that

(a) forall y ∈ Y and ye ∈ R with |e


y | ≤ M R it holds: 0 ≤ ℓ(e
y , y) ≤ B.

y , ye′ ) ∈ [−M R, M R]2 , it holds: |ℓ(e


(b) forall y ∈ Y and (e y ′ , y)| ≤ L|e
y , y) − ℓ(e y − ye′ |.

Let fbn be an estimator acting on a sample Sn of size n and such that fbn ∈ BH (R) a.s.
Then for any δ ∈ (0, 1), with probability larger than 1 − δ over the draw of the i.i.d. sample
Sn it holds:
r !
  1 log δ −1
E(fbn ) − E(
b fbn ) ≤ sup b ) ≤ √ 2LRM + B
E(f ) − E(f . (6.21)
f ∈BH (R) n 2

Proof. Start by noticing that by the usual argument based on the reproducing property,
any function f ∈ BH (R) satisfies |f (x)| = |⟨f, kx ⟩|H ≤ ∥f ∥H ∥kx ∥H ≤ RM . For this reason,

103
we may consider that the prediction space Ye is [−M R, M R]. We then have
 
E(fn ) − E(fn ) ≤
b b b sup E(f ) − E(f )
b
f ∈BH (R)
r
2 |·| log δ −1 with probability ≥ 1 − δ,
≤ RP,n (ℓ ◦ BH (R)) + B (Corollary 6.7), eq. (6.16)
n r 2n
2L log δ −1
≤ RP,n (BH (R)) + B (Proposition 6.9)
n r 2n
2LRM log δ −1
≤ √ +B (Proposition 6.11).
n 2n

* Corollary 6.13. Consider the same setting as in Proposition 6.12, with the squared loss
function ℓ(e
y , y) = (ey − y)2 , and Y = [−A, A] for some A > 0 (bounded regression: we
assume that the label is always bounded by A in absolute value). Then for any δ ∈ (0, 21 ],
with probability larger than 1 − δ it holds:
p
  C log δ −1
E(fbn ) − E(
b fbn ) ≤ sup b ) ≤
E(f ) − E(f √ , (6.22)
f ∈BH (R) n
where C := 6(A + RM )2 .
Proof. We check that the assumptions on the loss function of Proposition 6.12 are satisfied
with appropriate constants. For any ye, y with |ey | ≤ M R, |y| ≤ A it holds ℓ(e y , y) =
2 2
y − y) ≤ (A + M R) =: B (satisfying loss boundedness assumption (a)), and additionally
(e
for any ye′ with |e
y ′ | ≤ M R:
|ℓ(e y ′ , y)| = (e
y , y) − ℓ(e y ′ − y)2 = |e
y − y)2 − (e y + ye′ − 2y||e
y − ye′ | ≤ 2(A + RM )|e
y − ye′ |,

so Lipschitz assumption (b) is satisfied with L := 2 B. We therefore have the high
probability
p bound (6.21), we can further upper bound 2LRM by 4B, and finally use that
4 ≤ 8 (log δ −1 )/2, since δ ≤ 1/2 by assumption.
We now consider the analysis of kernel ridge regression (regularized least squares ERM).
* Proposition 6.14 (Oracle-type inequality for krr). We consider the same assumptions
as in Corollary 6.13: squared loss, bounded regression with labels bounded by A > 0 in
absolute value, kernel bounded by M 2 . For λ ∈ [0, M 2 ] define the kernel ridge regression
(krr) estimator, based on sample Sn of size n, as
 
b ) + λ∥f ∥2 .
fbλ ∈ Arg Min E(f (6.23)
H
f ∈H
1
Then for any δ ∈ (0, 2
with probability larger than 1 − δ it holds:
],

2
   A2 M 2 p
E(fbλ ) + λ fbλ H ≤ min E(f ) + λ∥f ∥2H + c √ log δ −1 , (6.24)
f ∈H λ n
where c is a numerical constant.

104
Proof. We start with noticing that the norm of fbλ must be bounded. Namely, by the
definition (6.23) of the estimator, it must correspond to a lower objective function than
the constant 0 function (denoted 0), thus
 
2 b ) + λ∥f ∥2
fbλ H ≤ λ−1 E(f H
 
b +λ 0 2
≤ λ−1 E(0) H
n
1 X
= (Yi − 0)2
λn i=1
A2
≤ .
λ

Therefore, fbλ ∈ BH (R) with R = √Aλ . Applying Corollary 6.13, we get that (6.22) is
satisfied with probability at least 1 − δ, in particular
p
C log δ −1
E(fbλ ) ≤ E(
b fbλ ) + √ , (6.25)
n
√ 2
with C = 6(A + RM )2 = 6A2 (1 2 2
 + M/ λ) ≤24A M /λ, since M/λ ≥ 1 by assumption.
Let now fλ∗ ∈ Arg Minf ∈H E(f ) + λ∥f ∥2H . By an argument similar to above, it must
hold fλ∗ ∈ BH (R), and, provided (6.22) is satisfied:
p
b λ∗ ) ≤ E(fλ∗ ) + C log δ −1
E(f √ . (6.26)
n

Now, using (6.25) and (6.26) as well as the definitions of fbλ , fλ∗ , we get (with probability
1 − δ of the event (6.22) being satisfied):
p
2 2 C log δ −1
E(fbλ ) + λ fbλ H ≤ E(
b fbλ ) + λ fbλ +
H

n
p
b λ∗ ) + λ fλ∗ 2 + C √ log δ −1
≤ E(f H n
p
2 C log δ −1
≤ E(fλ∗ ) + λ fλ∗ H + 2 √
n
2 A M2 p
2
≤ E(fbλ ) + λ fbλ H + 48 √ log δ −1 .
λ n

We have as a consequence the following universal consistency result:

105
** Corollary 6.15. Under the same assumptions as for Corollary 6.14, assume addi-
tionally that X is a compact topological space and that the kernel k is universal on
X .pLet (λn )n≥1 be a sequence of regularization parameters such that λn → 0 and
λn n/ log n → ∞, as n → ∞. Then for any distribution P of the data, if (Xi , Yi )i≥1
(n)
is an i.i.d. sequence from P , and fbλn denotes the krr estimator trained using sample
Sn = ((X1 , Y1 ), . . . , (Xn , Yn )), with regularization parameter λn , it holds that
(n)
E(fbλn ) → E ∗ ,

as n → ∞, a.s. and in probability.

(n)
Proof. P Let δn = n12 , and An denote the event (6.24) for the sample Sn and estimator fbλn .
Since ni=1 P[Acn ] ≤ n≥1 n−2 < ∞, by the Borel-Cantelli lemma P (∩k≥1 ∪n≥k Acn ) = 0,
P
i.e., for any ω ∈ Ω there exists a (random) integer n0 (ω) such that ω ∈ An for all n ≥ n0 (ω)
– in other words the events An are satisfied for all n ≥ n0 (ω).
Next, let ε > 0 be fixed; we establish that there exists fε ∈ H such that E(fε ) ≤ E ∗ + ε.
Namely, since Y = [−A, A], f ∗ (x) = E[Y |X = x] (for regression with quadratic loss)
also takes values in [−A, A] and thus belongs to L2 (X, P ). Furthermore, we now that
E(f ) − E ∗ = E[(f (X) − f ∗ (X))2 ] = ∥f − f ∗ ∥L2 (X ,P ) . Since k is universal, H is dense in
C(X), which itself is dense in L2 (X , P ). Hence we can find such an fε .
We now have for any ω ∈ Ω, n ≥ n0 (ω), for any ε > 0 using (6.24)

(n)

2
 A2 M 2 p
E(fλn ) ≤ min E(f ) + λn ∥f ∥H + c √
b log δn−1
f ∈H λn n
A2 M 2 p
≤ E(fε ) + λn ∥fε ∥2H + c √ log δn−1
λn n
r
∗ 2 A2 M 2 2 log n
≤ E + ε + λn ∥fε ∥H + c ,
λn n
(n)
by the assumptions on λn we deduce lim supn E(fbλn ) ≤ E ∗ + ε a.s. for any ε > 0, and get
the conclusion.

6.6 Vapnik-Chervonenkis theory


In this section we deal specifically with the classification setting Y = Ye = {−1, 1}, and the
0-1 loss ℓ(e y ̸= y}. If we use this loss function combined with a real-valued score
y , y) = 1{e
function f (x) and predict sign(f (x)), we see that the resulting loss ℓ(f, y) = 1{sign(f ) ̸= y}
is not Lipschitz in f , and therefore the Lipschitz comparison principle (Proposition 6.10)
does not apply in order to bound the Rademacher complexity. We need to find other means

106
to upper bound the complexity (observe that Corollary 6.7 still holds, as the loss function
is bounded by B = 1).

** Theorem 6.16. Consider a standard binary classification setting ( Y = Ye = {−1, 1})


with 0-1 loss. Let F be a set of measurable classification functions X → {−1, 1}. For
a fixed integer n > 0, introduce the n-evaluation functional

G : X n × F → {−1, 1}n ; ((x1 , . . . , xn ), f ) 7→ (f (x1 ), . . . , f (xn )). (6.27)

For fixed sxn = (x1 , . . . , xn ) ∈ X n , denote G(sxn , F) := {G(sxn , f ), f ∈ F}. Then it holds
for any distribution P on X , for Sn and i.i.d. sample from P :
√ hp i
Rp,n (ℓ ◦ F) ≤ 2nESnx ∼P ⊗n
x
log|G(Sn , F)| . (6.28)
X

We start with the following important lemma.

*** Lemma 6.17. Let ξ, . . . , ξK be K sub-Gaussian variables with common parameter σ 2 ,


centered (E[ξi ] = 0, i = 1, . . . , n), not necessarily independent. Then
 
p
E sup ξi ≤ σ 2 log K.
i=1,...,K

 
Proof. Put γ := E supi=1,...,K ξi . Then for any λ > 0:
 
exp(λγ) ≤ E exp(λ sup ξi ) (Jensen)
i=1,...,K
 
= E sup exp(λξi )
i=1,...,K
n
X
≤ E[exp(λξi )]
i=1
 2 2
σ λ
≤ K exp (Sub-Gaussianity).
2
log K √
We deduce γ ≤ λ
+ λσ 2 /2, which gives the claim when choosing λ = 2 log K/σ.
Proof of Theorem 6.16. For a sample Sn = ((Xi , Yi ))1≤i≤n denote
e n , f ) = (1{f (X1 ) ̸= Y1 }, . . . , 1{f (Xn ) ̸= Yn }) ∈ {0, 1}n ,
G(S

107
n o
e n , F) := G(S
and define G(S e n , f ), f ∈ F . Observe that G(S e n , F) = |G(Snx , F)|. With
this notation we have
" n
# " n
#
X X
Rp,n (ℓ ◦ F) = E sup σi 1{f (Xi ) ̸= Yi } = E sup σi u(i) .
f ∈F i=1 u∈G(S
e n ,F ) i=1

For a fixed u ∈ {0, 1}n let ξu (σ) :=


Pn (i)
i=1 σi u . By Hoeffding’s inequality (more pre-
cisely Proposition 3.8 and properties of sub-Gaussian variables), ξu (σ) is centered and
sub-Gaussian with parameter σ 2 = ∥u∥1 ≤ n. Hence by Lemma 6.17, taking expectation
with respect to σ first:
" # q 
Rp,n (ℓ ◦ F) = ESn ,σ sup ξu (σ) ≤ ESn e n , F)
2n log G(S
u∈G(S
e n ,F )
hp i
= ESn x
2n log|G(Sn , F)| .

While the bound (6.28) is nice, it turns out that in most interesting cased we can upper
bound |G(sxn , F)| uniformly for any sample sxn , as a function of F, and thus bound the
expectation in (6.28) independently of the distribution P !

** Definition 6.18 (Growth function). Let F be a set of functions from X to Y. Define


the growth function of F, using the notation for the evaluation functional introduced
in (6.27), as
Γ(F, n) = sup |G(Sn , F)|. (6.29)
Sn ∈X n

Observe that Γ(F, n) ≤ 2n always holds.

** Theorem 6.19 (Vapnik/Sauer). Let F be a set of functions from X to {0, 1}. Define

d = max{n > 0 : Γ(F, n) = 2n }. (6.30)

Then it holds for any n:


min(d,n)  
X n
Γ(F, n) ≤ ≤ (n + 1)d . (6.31)
i=0
i

The number d is called Vapnik-Chervonenkis (VC) dimension of the class of prediction


functions F. (Observe that since F is a set of indicator functions, equivalently one can
define the VC dimension of a family of subsets of X .)

108
Proof. For a subset A ⊆ {0, 1}n , and I ⊆ JnK := {1, . . . , n}, let (i1 , . . . , i|I| ) be the ordered
elements of I and denote AI ⊆ {0, 1}|I| the projection of A on the coordinates of indices
(i1 , . . . , i|I| ). For coherence, if I = ∅, we define A∅ = ∅. Let us call a subset of indices I
(possibly empty) shattered by A if AI = {0, 1}|I| (we define {0, 1}0 = ∅).
We will establish the following: for any A ⊆ {0, 1}n ,
n o
|I|
|A| ≤ I ⊆ JnK : AI = {0, 1} (6.32)

in words, the cardinality of A is upper bounded by the number of indices sets I shattered
by A.
We prove this by recursion on n. For n = 1 it is true since I = ∅ is always shattered,
and if A = {0, 1} then I = {1} is also shattered.
Assume the property is true for some n ≥ 1, and let A ⊆ {0, 1}n+1 . Let A
e ⊆ {0, 1}n the
a ∈ {0, 1}n such that both (e
sets of elements e a, 0) and (e
a, 1) belong to A. Let also denote

A := AJnK . Then it holds that
|A| = |A′ | + A
e. (6.33)
By recursion, both A′ and A
e satisfy (6.32). For I ⊆ JnK, it holds A′ = AI . Hence
I
n o
|A′ | ≤ I ⊆ JnK : AI = {0, 1}|I|
n o
|I|
= I ⊆ Jn + 1K s.t. (n + 1) ̸∈ I, AI = {0, 1} . (6.34)

On the other hand, if I ⊆ JnK is shattered by A,


e then by construction I ∪ {n + 1} is
shattered by A. Hence
n o
|I|+1
A ≤ I ⊆ JnK : AI∪{n+1} = {0, 1}
e
n o
= I ⊆ Jn + 1K s.t. (n + 1) ∈ I, AI = {0, 1}|I|+1 . (6.35)

Noticing that the set of indices concerned in (6.34) and (6.35) are disjoint and putting back
in (6.33), we obtain the property (6.32) for (n + 1).
We now apply property (6.32) for the set A = G(Sn , F) where Sn ∈ X n is arbitrary.
By assumption (6.30) the largest possible cardinality of a shattered index set I (which
determines a sub-sample of size |I|) is d. Hence
d  
X n
|G(Sn , F)| ≤ |{I ⊂ JnK : |I| ≤ d}| = .
i=0
i

Taking a supremum over all possible Sn ∈ X n yields the first inequality in (6.31).
n d
 i
Finally, note that i ≤ i n and use the binomial formula for the second inequality.

109
It is possible to get an exact or upper bound of the VC dimension (6.30), or possibly
directly of the growth function (6.29) in certain cases. A first fundamental fact is the
following bound on linear discrimination function classes.


** Proposition 6.20. Let X = Rd and F = x 7→ 1{⟨x, w⟩ > 0}, w ∈ Rd be the set of
linear classifiers without offset (i.e. indicators of half-spaces whose boundary contain
the origin). Then the VC dimension of F is equal to d.

Proof. First, we prove that the VC-dimension of the class F of half-spaces with boundary
going through the origin is at least d. For this, simply consider the d-uple Sd of points of
Rd formed by the canonical basis (xi = ei , i = 1, . . . , d). Let I ⊆ JnK be any index set,
(i) (i)
denote I c := {1, . . . , n} \ I. Define the vector wI as wI = 1 if i ∈ I, and wI = −1 if i ̸∈ I.
Then obviously, if fw (x) := 1{⟨x, w⟩ > 0}, we have fwI (Sd ) = (1{i ∈ I})1≤i≤d . Since this
works for any I, we have G(Sd , F) = {0, 1}d and |G(Sd , F)| = 2d .
Conversely, we prove that any family Sd+1 = (x1 , . . . , xd+1 ) of (d + 1) vectors in Rd
cannot be “shattered” by F (i.e. |G(Sd+1 , F)| < 2d+1 ). Since we are in dimension d, there is
a nontrivial linear combination
Pd+1 of these vectors that vanishes: ∃λ = (λ1 , . . . , λd+1 ) ∈ Rd+1
such that λ ̸= 0 and i=1 λi xi = 0. Let I := {i ∈ JnK : λi > 0}. Without loss of generality
we can assume I ̸= ∅ (otherwise replace λ by −λ). Let w be a vector such that fw (xi ) > 0
for all i ∈ I. Then
X D X E D X E X
0< λi fw (xi ) = w, λi xi = w, − λi xi = (−λi )fw (xi ).
i∈I i∈I i∈I c i∈I c

Therefore there is at least one i ∈ I c such that fw (xi ) > 0. It means that (1{i ∈ I})1≤i≤d ̸∈
G(Sd+1 , F), therefore |G(Sd+1 , F)| < 2d+1 .

** Corollary 6.21. Let G be a linear space of dimension d of real-valued functions on


an arbitrary set X . Then the VC-dimension of

F := {x 7→ 1{g(x) > 0}, g ∈ G}

is equal to d.

Proof. Let (g1 , . . . , gd ) be a basis of G. Define themapping A(x) := (g1 (x), . . . , gd (x)) ∈ Rd .
Since G = Span{g1 , . . . , gd }, it holds that F := x 7→ 1{⟨w, A(x)⟩ > 0}, w ∈ Rd . There-
fore, any family of points of Sk = (x1 , . . . , xk ) that is “shattered” by F (i.e. |G(Sk , F)| <
2k ) implies that the family (A(x1 ), . . . , A(xk )) of elements of Rd is shattered by linear
classifiers without offset, hence from Proposition 6.20 if must be the case that k ≤ d.

110
On the other hand, notice that Span(A(XP )) = Rd . If this was not the case, there would
exist u ∈ R , u ̸= 0, such that ⟨u, g(x)⟩ = ni=1 ui gi (x) = 0 for all x ∈ X , contradicting
d

that (g1 , . . . , gd ) are independent. We can therefore find Sd = (x1 , . . . , xd ) ∈ X d such that
(A(x1 ), . . . , A(xd )) are linearly independent vectors in Rd . A family of independent vectors
in Rd is shattered by linear classifiers P (repeat P
the argument of Proposition 6.20 after change
t −1
of basis, i.e. choose w eI = (M ) ( i∈I ei − j∈I c ej ) for any subset I ⊆ JnK, where M is
the matrix of (A(x1 ), . . . , A(xd ))). Hence Sd is shattered by F.
An example of application of the previous corollary is the set of binary classifiers defined
from polynomial score functions of degree up to k on an open set of Rd has VC-dimension
equal to (k + 1)d .

Exercise 6.2. Let X = Rd , f be a fixed real-valued function on X and


F = x 7→ 1{⟨x, w⟩ + f (x) > 0}, w ∈ Rd .


Prove that the VC dimension of F is equal to d. Hint: recycle the proof of Proposition 6.20
with appropriate changes. Deduce that if G is a linear space of dimension d of real-valued
functions on an arbitrary set X , and f a fixed real-valued function on X , then the VC-
dimension of
F := {x 7→ 1{g(x) + f (x) > 0}, g ∈ G}
is equal to d.

Exercise 6.3. Let X = Rd , f be a fixed real-valued function on X and


F = x 7→ 1{⟨x, w⟩ + a > 0}, w ∈ Rd , a ∈ R


the set of linear classifiers with offset (i.e. indicators of affine half-spaces). Prove that the
VC dimension of F is equal to d + 1.

6.7 Application to artificial neural networks


Artificial feedforward neural networks (ANNs) can be described in the following way. The
basic unit or “artificial neuron” consists of a function fw : Rd → R : x 7→ φ(⟨w, x⟩), for
some input dimension d, where w is called “activation weight vector” and φ is fixed a
priori and called an ”activation function”. The activation function is nonlinear and acts as
a thresholding function, classical choices involve φ(t) = tanh(t) (generally called sigmoid
function), φ(t) = (x)+ (Rectified Linear Unit or “ReLU” function), and φ(t) = sign(t) (in
this case the function fw is just a linear classifier).
A layer of the ANN consists in several parallel ANs with different parameters acting on
the same input space. The number of the ANs in the k-th layer is nk and the input of the
k-th layer is the ouput of the (k − 1)-th, so that the k-th layer is a (nonlinear) mapping
Rnk−1 → Rnk . The input of the first layer is the x-data belonging to Rd (n0 = d). The
output x(k) ]inRnk of the kth layer is thus computed by the formula
(k) (k)
xi = φ wi , x(k) ,

i = 1, . . . , nk .

111
The final layer (say K) consists of a single neuron (nK = 1) and outputs the prediction of
the network.
Thus, an ANN is parametrized by K
P
i=1 ni ni−1 real parameters corresponding to the
weight vectors of each AN. It is possible to reduce this dimensionality by specifying that
the activation weight vector of a given AN is restricted to a specific support of reduced
size k (i.e. the weights outside of the support are 0). It means that this AN can only use
the ouput of a specific (given) subset of size k of the ANs of the previous layer.
In this section, we will not explain how to construct ANNs from data, but give a rough
analysis of their statistical complexity using a simplified model. We will consider the sign
activation function, and thus assume that the k-th layer is in fact acting on {−1, 1}nk−1 , we
will also assume that the input space is binary, i.e. {−1, 1}d for simplicity. Finally we also
assume that a constant output neuron (equal to 1) is added in each layer (thus providing
the means to add an offset to the linear part of each AN), including on the input data.
Proposition 6.22. Given a input space {−1, 1}d , and using the sign activation function,
it is possible to find a weight vector implementing the “OR” and “AND” functions on a
specific subset I, that is, there exists (wI , a) and (wI′ , a′ ) in Rd+1 (the extra parameter is
because we add a constant coordinate to the input which we treat as an offset) such that
^ _
fwI (x) = xi ; fwI′ (x) = xi .
i∈I i∈I

Proof. We take wI = wI′ with i-th coordinate equal to 1 if i ∈ I and 0 else, and take
a = |I| − 1/2, a′ = 1/2.
Proposition 6.23. Any boolean function F : {−1, 1}d → {−1, 1} can be realized with a
2-layer ANN with sufficiently large first layer.
Proof. Let ξF := x ∈ {−1, 1}d : F (x) = 1 . For each x ∈ ξF we construct an AN in the


first layer with weight wx = x and offset a = d − 21 . It can be checked that fx,a (t) = 1 if
t = x, and is −1 otherwise. The neuron in the second layer is just an OR of all the neurons
of the first layer.
Proposition 6.23 states that a 2-layer ANN can already represent any boolean function,
however the first layer can be of size up to 2d to achieve this, which can be prohibitive. On
the other hand, using Proposition 6.22 we see that we can implement a “boolean circuit”
function (which comprises of applications of the elementary AND, OR and NOT gates
along an evaluation tree) with an ANN having as many (nonconstant) neurons as gates in
the boolean circuit, and with the number of layers equal to the depth of the evaluation
tree.
To sumarize, we get the following qualitative understanding:
• A “shallow” (with few layers) ANN with sign activation function can realize any
boolean function, but their complexity (size) must be very large to do so (there
are similar results for approximation of continuous functions using other activation
functions).

112
• A “deep” network (with many layers) can approximate more efficiently (i.e. using
less ANs) boolean functions that can be represented in short form as a composition
of elementary boolean gates.

To make the second point more quantitative, we will analyze the complexity (in the
sense of VC theory) of an ANN with architectural connection constraints, namely if each
AN is restricted to use only a (fixed) subset of the previous layer as an input (this corre-
sponds to its activation weight vector w being restricted to a fixed support of reduced size,
see beginning of the section). We can represent such connection constraints abstractly as
a graph G = (V, E): each vertex of V of the graph represents either a single ANs or an
individual coordinate of the input data in {−1, 1}d , and the (directed) edges E represent
the nonzero activation weights from one individual neuron of a given layer to the next.

Proposition 6.24. Let G = (V, E) be a graph representing the stucture and connection
constraints of an ANN, and FG be the set of boolean functions on {−1, 1}d that can be
represented by an ANN following these constraints.
Then it holds for n ≥ 1:
Γ(FG , n) ≤ (n + 1)|E| , (6.36)
and it follows that the VC dimension of FG is bounded by c|E| log |E|, for c a numerical
constant.

Proof. We will use the notion of growth function (6.29) when the output set of functions
can be larger than {0, 1} (the definition is unchanged). Let K be the number of layers
in the ANN. For k ≤ K, consider the set of functions FG,k ⊆ F({−1, 1}nk−1 , {−1, 1}nk )
obtained by only considering the k-th layer of the ANN under the structural constraints
given by G.
Note that the input resp. output space of FG,k is {−1, 1}nk−1 resp. {−1, 1}nk , given by
the values of the number nk−1 resp nk of ANs in the (k − 1)-th resp, k-th layer. For AN
number i of the k-th layer, 1 ≤ i ≤ nk , denote ℓk,i the number of incoming edges from the
previous layer. To this AN is associated a linear classifier of weight wk,i of dimension ℓk,i .
Therefore from Proposition 6.20 and Theorem 6.30, if Fk,i is the set of linear classifiers
that can be represented by this AN, we have

Γ(Fk,i , n) ≤ (n + 1)ℓk,i .

Since
Fk = {x ∈ {−1, 1}nk−1 7→ (fk,i (x))1≤i≤nk , fk,i ∈ Fk,i },
we deduce nk
Y Pnk
Γ(Fk , n) ≤ (n + 1)ℓk,i = (n + 1) i=1 ℓk,i .
i=1

Finally it is easy to check that we have in general the “composition rule” for F ◦ G =
f ◦ g, f ∈ F, g ∈ G:
Γ(F ◦ G, n) ≤ Γ(F, n) ◦ Γ(G, n);

113
namely, for any n-uple Sn in the input space of G, it holds

|G(Sn , F ◦ G)| = ∪Sn′ ∈G(Sn ,G) G(Sn′ , F)


≤ |G(Sn , G)| ′ max |G(Sn′ , F)|
Sn ∈G(Sn ,G)

≤ Γ(G, n)Γ(F, n).

Since we have FG = FK ◦ FK−1 ◦ . . . ◦ F1 , we get


K
Y PK P nk
Γ(FG , n) ≤ Γ(Fk , n) ≤ (n + 1) k=1 i=1 ℓk,i = (n + 1)|E| .
k=1

114

You might also like