Lecture_Notes_MAI
Lecture_Notes_MAI
G. Blanchard
September 16, 2024
Contents
0 Some reminders on probability theory 3
0.1 Elements of probability . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3
0.2 A few important properties . . . . . . . . . . . . . . . . . . . . . . . . . . 5
0.3 Conditioning . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7
2 Linear Discrimination:
A brief overview of classical methods 28
2.1 Linear discrimination functions . . . . . . . . . . . . . . . . . . . . . . . . 28
2.2 The naive Bayes classifier . . . . . . . . . . . . . . . . . . . . . . . . . . . 29
2.3 Gaussian generative distribution: LDA and QDA . . . . . . . . . . . . . . 31
2.4 Classification as regression . . . . . . . . . . . . . . . . . . . . . . . . . . . 33
2.5 Linear logistic regression . . . . . . . . . . . . . . . . . . . . . . . . . . . . 35
2.6 Hinge-loss based methods: Perceptron and Support Vector Machine . . . . 37
2.7 Regularization . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 39
1
3.5 Uniform bounds over a finite class of prediction functions . . . . . . . . . . 49
3.6 Uniform bounds over a countable class of prediction functions; regularlized
ERM . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 54
2
0 Some reminders on probability theory
0.1 Elements of probability
The theoretical approaches to artificial intelligence involve many different fields of mathe-
matics. A central tenet of most approaches to modern mathematical modeling of artificial
intelligence methods, which in one form or other receive data and “learn” from it, is that
said data should be modeled as random. Thus, while other distinct areas of mathematics
play an important role, that of probability theory should be considered central. In these
notes, we will chiefly concentrate on probabilistic and statistical aspects – what is usually
called “statistical learning theory”.
While we assume the reader to be familiar with mathematical elements of probability
theory, we start with recalling a few fundamentals.
" #
[ X
** P Ai ≤ Ai , (0.1)
i≥1 i≥1
which we will call the union bound (also known as Boole’s inequality).
A probability distribution is a measure, and as such we can integrate real-valued mea-
surable functions f : Ω → R with respect to P .
Random variables. A random variable (r.v.) over (Ω, A, P ) with values in a measured
space (X , F) is a measurable map from the former to the latter space. It induces an
image (“push-forward”) probability measure PX on the image space (also sometimes noted
X#P ), defined via
∀F ∈ F : PX (F ) = P (X −1 (F )) = P (X ∈ F ),
called the distribution of X. It will be assumed that all considered random variables in
a given context are defined on the same underlying probability space (Ω, A, P ); the latter
is generally left unspecified, since we will generally only be interested in studying some
specific random variables.
3
If X is a random variable over (Ω, A, P ) and G is a further measurable map
(X , F) → (Z, G), then Z = G(X) is obviously also a random variable over (Ω, A, P )
(with values in Z). If Z = R and Z is integrable, we have the formula
Z Z
** zPZ (dz) = G(x)PX (dx) =: E[Z], (0.2)
R X
called expectation of Z.
The first equality (“change of variable formula”) can be useful since it can possibly
avoid to compute explicitly the distribution of Z to perform the integral. Expectation can
be defined for a vector-valued variable (Z = Rd ) in the obvious way (i.e. coordinate-wise).
∀F ∈ F, F ′ ∈ F P(X,X ′ ) (F × F ′ ) = PX (F )PX ′ (F ′ ),
or equivalently
∀F ∈ F, F ′ ∈ F P (X ∈ F, X ′ ∈ F ′ ) = P (X ∈ F )P (X ′ ∈ F ′ ).
4
Yet equivalently, independence holds when the joint distribution is the product measure
of the marginals:
P(X,X ′ ) = PX ⊗ PX ′ .
As previously, it is sufficient to check the above equalities on a generating π-system to
establish independence.
By Fubini’s theorem, if X, X ′ are real-valued independent and integrable, then their
product XX ′ is integrable and E[XX ′ ] = E[X]E[X ′ ].
If X, X ′ are independent variables and F, G are measurable mappings into further
measured spaces, then F (X) and G(X ′ ) are independent.
This generalizes to finite families of r.v.’s and even to countable families, though we
won’t really need it, since we will not be interested in a.s. convergence most of the time in
the present notes. When X1 , X2 , . . . , Xn are independent and have the same marginal PX ,
we say they form an independent identically distributed (i.i.d.) family and denote their
joint distribution PX⊗n .
Exercises
Exercise 0.1. Justify the first equality in (0.2) if the space X is countable (using formulas
of discrete probability, i.e. integrals become sums).
Exercise 0.3. If r.v. Z has density f (x, x′ ) with respect to a product measure µ ⊗ µ′ on
X ×X ′ , then the first marginal distribution PX of Z has a density with respect to µ: justify
why and specify it explicitly.
Exercise 0.4. If PX has density f wrt. µ and PX ′ has density f ′ wrt. µ′ , and X, X ′ are
independent, then PX,X ′ has density (x, x′ ) 7→ f (x)f ′ (x′ ) wrt. µ ⊗ µ′ . Conversely: if PX,X ′
has density f (x, x′ ) wrt. µ ⊗ µ′ and it holds that f (x, x′ ) = g(x)h(x′ ) for some functions
g, h (not necessarily densities), then X, X ′ are independent.
5
if X is a polish space (metrizable, complete, separable) then
* µ(Supp(µ)c ) = 0.
ϕ(E[X]) ≤ E[ϕ(X)].
However, the latter fact can be also established directly by the variance formula
Exercise 0.5. Assume probability P on a Borel space X has a continuous density f with
respect to a reference measure ν. Prove Supp(P) = {x : f (x) > 0}. Is this true if f is not
continuous?
Exercise 0.6. Prove the formula: Var[X] = 21 E[(X − X ′ )2 ], where X, X ′ are two indepen-
dent square integrable real variables having the same marginal distribution.
6
0.3 Conditioning
If A is an event with P (A) > 0, the standard definition for the conditional probability of
an event B conditioned to A is P (B|A) := P P(A∩B)(A)
. An important property is then that
P (•|A) := (B 7→ P (B|A)) is itself a probability distribution (it satisfies the axioms) called
conditional probability distribution P conditional to A. If X is a random variable, we can
for instance take A = {X ∈ F } provided P (X ∈ F ) > 0.
In what follows, we’ll very often consider random variables (X, Y ) with a joint distribu-
tion PXY and would like to consider the conditional distributions conditional to X = x or
Y = y, but these events unfortunately have null probability in general, so that the above
definition does not apply. We need something more general.
Definition 0.1. Let (X , F) and (Y, G) be two measurable spaces. We call regular transi-
tion probability or Markov kernel from X to Y a mapping κ : G × X → [0, 1] such that:
A fundamental question is the converse, that is, given a joint probability PXY on X ⊗Y,
and PX the first marginal distribution, does it exist a transition probability κ from X to
Y such that PXY = κ ◦ P ? The following theorem is fundamental and guarantees the
existence of such an object in a sufficiently broad situation.
7
In particular any Polish space, that is, a complete separable metrizable space, equipped
with its Borel σ-algebra is “nice”. The emphasis here is on separable, which somehow limits
the “size” of the output space of the kernel.
The disintegration theorem applies to product spaces without requiring random vari-
ables, but in probabilistic terms, it is more convenient to think of the joint distribution
PXY of the two random variables (X, Y ) : Ω → X × Y, and to call the transition kernel
regular conditional probability of Y given X = x, denoted PY |X (.|.). Thus to reiterate (0.3),
the property characterizing an rcpd is that it is a transition probability satisfying:
Z Z Z
** For any integrable f : f (x, y)PXY (dx, dy) = f (x, y)PY |X (dy|x)PX (dx),
X×Y X Y
(0.4)
Z
For any events A on X and B on Y : PXY (A × B) = PY |X (B|x)PX (dx). (0.5)
A
Remark 0.3. In the definition above, it is important to notice that there is no unicity: in
fact, given any two regular transition probabilities κ, κ′ from X to Y, and PX a distribution
on X , we check from (0.3) that κ ◦ PX = κ′ ◦ PX as soon as κ(•, x) and κ′ (•, x) coincide
PX -a.s. Similarly, if PY|X is a rcpd of P , modifying x 7→ PY|X (•, x) on PX -null set gives
another rcpd of P . In this sense a regular conditional distribution is only defined a.s. with
respect to the marginal of the conditioning variable. This means that there is no good,
absolute definition of “the conditional distribution of Y given X = x0 ” (if P (X = x0 ) = 0)
but rather a family of such conditional distributions (indexed by x0 ).
But this family is unique up to a.s. equivalence; in particular, in the case where X, Y are
independent, we will always implicitly consider the “canonical” choice PY |X (·|x) = PY (·)
for all x.
If X is a random variable (Ω, A, P ) → (X , A), and (Ω, A) is a nice space, then we may
apply the previous theorem to the product space Ω × X equipped with the distribution
of the random variable Z(ω) = (ω, X(ω)) (this is the distribution δX ◦ P , where δX is the
transition kernel κ(., ω) = δX(ω) ). We thus obtain a regular conditional probability on Ω
given X. This can be used to give (PX -a.s.) a sense to the expression P (A|X = x), where
A is any event of Ω, which we will be using often; in particular the event A ⊂ Ω may be
defined depending on further random variables Y1 , Y2 , . . ., but we don’t have to explicitly
8
apply the disintegration theorem on a complicated product space given by the value space
of these variables, but may directly use the notation P (.|X = x0 ) without further ado.
For the remainder of these notes, we will assume without repeating that (Ω, A) is a
nice space, so that the previous argument applies and all regular conditional probabilities
exist.
Remark 0.5. Since the rcpd PY |X (•, x) is only unique up to PX -a.s. equivalence, conditional
expectations are also only defined uniquely up to PX -a.s. equivalence over x0 above.
However, we will always implicitly assume that we have chosen a specific representative
rcpd PY |X (•, x) “once and for all” and define all conditional expectations as above with
respect to this particular choice. The reason why we spell this out is that we want all
conditional expectations for all integrable functions to be defined with using the same
common representative rcpd, which allows us to forget about the a.s. equivalence issue
and write properties such as the next proposition.
9
Remark 0.7. In classical probability courses, it is common to define the conditional ex-
pectation first, in a different and more general way (conditional expectation with respect
to a σ-algebra) that requires less formalism and is sometimes easier to handle. The ad-
vantage of considering rcpd’s PY |X is that they are probability distributions for each fixed
value of the conditioning variable x, and we can apply without restriction all theorems
of integration when integrating over the rcpd for any fixed x. Things get sometimes a
little bit more awkward when starting with conditional expectations. In particular the
“obvious” property (ii) above is not granted with the “usual” way of defining conditional
expectations – in fact in that “usual” framework it does not even formally make sense
since “usual” conditional expectations are only defined under a.s. equivalence, separately
for each function hx0 (.) := h(x0 , .); it does not make sense to say that two functions both
defined a.s. agree at a particular point. As we are using rcpds here, the a.s. equivalence is
“factored in” the choice of the common representative rcpd.
( fXY (x,y) R
R
fXY (x,y)ν(dy)
if Y
fXY (x, y)ν(dy) = fX (x) > 0;
* fY |X (y|x) := Y
f0 if fX (x) = 0,
Observe that the second case deals with points outside of the support of PX , and almost
surely does not happen by definition; this is why the choice of f0 does not matter.
Exercises
Exercise 0.7. Check that P is the first marginal distribution of κ ◦ P , defined in (0.3).
Exercise 0.8. Prove the properties of Proposition 0.6 directly from the definition.
Exercise 0.9. Prove Proposition 0.8, forgetting about a.s. unicity. Just check that the
transition kernel Pe(, x) having the proposed density satisfies the fundamental properties
of a rcpd.
10
1 Introduction to statistical learning theory (part 1):
Decision theory
1.1 Mathematical formalization
In a nutshell, a learning task is formalized as a prediction of a certain target variable Y
from the observation of an object X. The variability in these quantities is modeled via a
joint probability distribution P of these objects. What needs to be formalized is:
• what is a prediction of Y from X?
• how is the goodness of a prediction assessed?
• what would be a theoretically optimal prediction?
• how is a prediction function “learnt” from data?
In what follows (X , X) is a measurable space (for instance Rd or a subset of Rd ) called
the input space and (Y, Y) another measurable space called label space, most often Y = R,
Y = [a, b], Y = {1, . . . , K}, or Y = Rk . When Y is a real interval the setting is typically
called that of regression, and if Y is a finite set, classification.
It is assumed that the variables (X, Y ) have a joint distribution P over the product
space, called generative distribution. In a prediction setting, we assume a random realiza-
tion (x, y) of P ; the value of y (the label) is unknown to us but we would like to predict
it as well as possible from the knowledge of x (the input, predictor or covariate). To allow
for bsome flexibility in the sequel the prediction can take values in a space (Y, e distinct
e Y)
from Y.
11
** Standard examples.
Example 1.3 (Regression with least squares loss). Take Y = Ye = R, and ℓ(y ′ , y) =
(y − y ′ )2 ; so E(f ) = E[(f (X) − Y )2 ].
• the 0 − 1 loss or hard loss: Y = {−1, 1}, Ye = R and ℓ(y ′ , y) = 1{yy ′ ≤ 0}: this is
equivalent to using the misclassification loss and interpreting the class prediction as
sign(y ′ ) (y ′ = 0 is interpreted as a classification error no matter what).
It is also possible to use a quadratic loss for multi-class classification, in this case we
take Ye = RK , and ℓ(y ′ , y) = ∥y ′ − ey ∥2 , where ei is the i-th canonical basis vector.
where F(X , Y)e denotes the set of (measurable) functions from X to Y. e This is often
simply abbreviated as E ∗ (P ) or simply E ∗ .
If this infimum is a minimum, an optimal prediction function fP∗ is one achieving
the minimum:
E(fP∗ ) = E ∗ (P ), or fP∗ ∈ Arg Min Eℓ (f, P ).
f ∈F (X ,Y)
e
12
It is possible to restrict the search for a prediction function to a subset (sometimes
called model or hypothesis class ) G ⊂ F(X , Y), e in which case one defines correspond-
∗ ∗
ingly EG and (if it exists) fG by restricting the inf or min to G:
∗
EG,ℓ (P ) = EG∗ = inf Eℓ (f, P ).
f ∈G
The optimal prediction function f ∗ (or possibly fG∗ ), generally implicitly assumed to exist,
will be regarded as the target of the learning procedure.
Since the risk measures the goodness of prediction, it will be of interest to analyze how
far a given prediction function f is from the optimal: thus we will often study the excess
risk E(f ) − E ∗ (or possibly E(f ) − EG∗ , the excess risk with respect to prediction function
class G).
The following proposition is helpful to determine an (unrestricted) optimal prediction
function by reducing the problem to the case of a single probability distribution on Y and
a constant prediction.
* Proposition 1.6. Assume that we know a mapping F from the set of probability
distributions on Y to Ye such that for any probability distribution PY on Y:
Define f ∗ as above. For any f ∈ F(X , Y), and any fixed x ∈ X , taking P ← PY |X (.|x)
and c ← f (x) in the above gives
13
EY ∼PY |X (.|x) [ℓ(f ∗ (x), Y )] = EY ∼PY |X (.|x) ℓ(F (PY |X (.|x)), Y ) ≤ EY ∼PY |X (.|x) [ℓ(f (x), Y )].
Now taking expectation over x, i.e. integrating with the distribution PX , we obtain
h i
EX∼PX EY ∼PY |X (.|X) [ℓ(f ∗ (X), Y )] = E[E[ℓ(f ∗ (X), Y )|X]]
= E[ℓ(f ∗ (X), Y )]
= E(f ∗ , PXY )
h i
≤ EX∼PX EY ∼PY |X (.|X) [ℓ(f (X), Y )] = E(f, PXY ).
(Important) examples.
*** Proposition 1.7. Under the setting of regression with quadratic loss, provided Y is
square integrable under PY , the optimal prediction function is
Proof. Using Proposition 1.6, we analyze the simple case of a given probability distribution
PY on R (for which Y is square integrable) and of a constant prediction c. We have
E (Y − c)2 = E (Y − E[Y ])2 + (E[Y ] − c)2 ,
14
Exercise 1.1. Justify equality (1.1).
*** Proposition 1.8. Consider the classification setting with K classes and the misclas-
sification loss. Then a prediction function f ∗ with
is an optimal classification rule. This is known as a Bayes classifier and the resulting
E ∗ as the Bayes error rate.
Proof. Using Proposition 1.6, we analyze the simple case of a given probability distribution
PY on {1, . . . , K} and of a constant prediction c. Then
Obviously this is minimized for c ∈ Arg Maxy∈{1,...,K} PY ({y}). Using Proposition 1.6, this
gives the claim.
Exercise 1.2. What is the optimal prediction function for the setting of classification us-
ing real-valued prediction and the quadratic loss (in the binary and in the multiclass-
classification case)?
15
** Definition 1.9. Given an integer n ∈ N>0 , an estimator (for the decision function)
acting on training samples of size n is a mapping
(X × Y)n → F(X , Y)
e
Sn 7→ fbSn
Note: it is usual practice in statistics that quantities with a hat-notation are estimators,
i.e. are implicitly functions of the data (sample); the explicit dependence is often dropped
from the notation. h i
The generalization error or risk of an estimator is E(fbSn ) = EX,Y ℓ(fbSn (X), Y ) ; observe
that in this notation the sample Sn is considered as fixed and we take the expectation with
respect to a “new” point (X, Y ), hence the name “generalization error”. If the sample data
is modeled as random, this notation means that we are implicitly considering a conditional
expectation conditional to the sample, and that the new point (X, Y ) is independent of
the sample Sn . In the latterh case i(random sample), the expected generalization error/risk
of estimator f is ESn ∼P ⊗n E(fbSn ) , i.e. a double expectation of the loss over the training
b
sample and the new independent point (X, Y ). (We will always assume in these notes that
the training sample is i.i.d. with the marginal distribution P ).
By contrast, the training or empirical error of an estimator fb is
n
b fbSn , Sn ) := 1
X
** E(
b fb) = E( ℓ(fbSn (Xi ), Yi ) = E(fbSn , Pbn ),
n i=1
where Pbn := n1 ni=1 δXi ,Yi is the empirical distribution associated to the sample Sn .
P
Observe the crucial fact that the empirical error “doubly” depends on the sample Sn :
first because the sample determines the estimator fbSn , second because the error is evaluated
on the same sample.
In particular, in general, ESn ∼P ⊗n E(b fbSn ) ̸= ES ∼P ⊗n E(fbSn ) . This is to be con-
n
trasted with
the
fact that, for a fixed (non data-dependent) prediction function f , we have
ESn ∼P ⊗n E(f ) = E(f ) by simple linearity of the expectation.
b
16
yet E(fb) = 12 almost surely. This is an extreme case of what is known as the overfitting
phenomenon. In general, some overfitting will happen (i.e. E( b fb) will, generally, be in
expectation smaller than E[E(f )]; i.e. will have a negative bias, in statistical terms). One
of the goals of the course is to control the amount of overfitting from a theoretic point of
view. The above example suggests that we should limit ourselves in the possible choice of
prediction function by considering appropriate models G ⊊ F(X , Y). e
Empirical risk minimization (ERM). With the previous caveat in mind, a general
approach to construct an estimator fERM is to minimize the empirical risk given a given
b
model G ⊊ F(X , Y)
e (the model G is assumed to be given in the context and omitted from
the notation):
Another example from classical statistics is that of ordinary least squares linear regres-
sion (OLS): in the regression (X = Rd ) with the squared loss setting, consider the model
G = fβ (x) = ⟨β, x⟩|β ∈ Rd , the ERM over that model is fβbOLS with
n
X
* βbOLS ∈ Arg Min (yi − ⟨β, xi ⟩)2 . (1.2)
β∈Rd i=1
The “population” analogue (i.e. the optimal theoretical predictor over the same model
G) is
∗
∈ Arg Min E (Y − ⟨β, X⟩)2 ,
βOLS
β∈Rd
E (Y − ⟨β, X⟩)2 = E Y 2 + β t Σβ − 2β t γ,
17
∗
with Σ := E[XX t ] (the second moment matrix) and γ := E[XY ], the solution is βOLS :=
−1
Σ γ by classical formulas for the minimum of a quadratic function (assuming Σ invertible).
To compute βbOLS we replace the generating distribution PXY by the empirical distribution
Pbn from the observed sample, therefore
n n
b := 1 1X
X
* b −1 γ
βbOLS = Σ b, where Σ xi xTi , and γ
b := xi y i ; (1.3)
n i=1 n i=1
Exercise 1.4. Justify the validity of the last form for βbOLS .
**
Definition 1.10 (Consistency). Let fb(n) , n ≥ 1 be a sequence of estimators (where
fb(n) learns from a sample Sn of size n) for a prediction problem with loss function ℓ.
Let P be a set of joint generating distributions on X × Y, and G ⊆ F(X , Y) e a
subset of prediction functions.
Then the sequence fb(n) , n ≥ 1 is consistent in expectation on the distribution model
P and the prediction model G, if
h i
(n)
∀P ∈ P : lim sup ESn ∼P ⊗n E(fbSn , P ) ≤ EG∗ (P ).
n→∞
It is consistent in probability if
h i
(n)
∀P ∈ P, ∀ε > 0 : PSn ∼P ⊗n E(fbSn , P ) ≥ EG∗ (P ) + ε → 0, as n → ∞.
Finally it is consistent almost surely if for any P ∈ P, an i.i.d. sequence (Xi , Yi )i≥1
from distribution P and Sn = ((X1 , Y1 ), . . . , (Xn , Yn )) it holds
(n)
lim sup E(fbSn , P ) ≤ EG∗ (P ), almost surely.
n→∞
The sequence of estimators is called universally consistent if the above holds for P =
all joint distributions on X × Y and G= all prediction functions.
18
In the case where we analyze consistency over a specific class G ⊊ F(X , Y)
e of prediction
functions, it is generally also assumed that the estimator fb outputs predictor functions
belonging to G, though it does not have to be the case.
Simple example. Consider the regression with quadratic loss setting, where X is
reduced to a singleton {0}, so that decision functions are given by a constant θ ∈ R, and
the optimal prediction θ∗ is just the (marginal) expectation E[Y ]. Then the ERM is the
empirical average θbn = n1 ni=1 yi , which is consistent by the law of large numbers.
P
Proposition 1.11. Let G ⊊ F(X , Y) e be a finite set of prediction functions and fb(n) be
ERM
(n)
the ERM estimator on the set G using a sample of size n. Then (fbERM )n≥1 is almost surely
consistent over G and for all probability distributions on X × Y.
sup E(f
b k ) − E(fk ) → 0 as n → ∞.
k∈{1,...,K}
19
Now, remember that we cannot apply the law of large numbers to E( b fb(n) , Sn ) since it is
ERM
an average of variables which are not i.i.d. because on the “double” dependence on the
sample Sn .
(n) b fb(n) ) ≤ E(f
However, by definition of the ERM estimator: (a) fbERM ∈ G and (b) E( ERM
b k ),
for any k ∈ {1, . . . , K}. Furthermore, let us denote k an index such that E(fk∗ ) = EG∗ .
∗
≤ 2 sup E(f
b ) − E(f ) , (1.4)
f ∈G
and we have seen that the latter quantity converges to 0 almost surely, hence the conclusion.
In the case where all prediction functions f ∈ G have infinite risk, EG∗ = ∞ and there
is nothing to prove. In the case where some prediction functions f ∈ G have infinite risk,
note that if E(f ) = ∞, for a fixed prediction function f , then the law of large numbers
implies limn→∞ E(f,
b Sn ) = ∞ almost surely. The above arguments still work in this case
with minor adaptations which are left to the reader.
Please observe that inequality (1.4) is fundamental and will be used again later for
more detailed analysis of the ERM.
Y = f ∗ (X) + ε, E ε2 = σ 2 ,
E[ε] = 0, ε ⊥⊥ X, (1.5)
where will consider X = [0, 1]; note that the marginal distribution PX is left totally arbi-
trary by the above model. We will make the assumption that f ∗ is a L-Lipschitz function.
Recall that in the case of regression with quadratic loss, f ∗ (x) = E[Y |X = x] is the optimal
predictor.
The estimator fb we will consider based on the observed sample h Sn = (Xi , Yi )1≤i≤n will
be a piecewise constant function on the equal width intervals Ij := j−1 K K
, j , j = 1, . . . , K−
1, IK = K−1
K
, 1 ; where the choice of the number K of intervals will be determined later.
Putting Nj := {1 ≤ i ≤ n : Xi ∈ Ij }, we define the regressogram estimator with K bins as
K
X 1 X
fb(x) := 1{x ∈ Ij }b
aj ; aj := Yi , j = 1, . . . , K. (1.6)
|Nj | + 1 i∈N
b
j=1 j
20
h i
• is consistent in average risk, i.e. satisfies ESn E(fbn ) − E ∗ −→ 0, as n −→ 0,
provided K(n) −→ ∞ but K(n) = o(n) as n −→ ∞;
h i 1
• satisfies ESn E(fn ) − E = O(n−2/3 ) if K(n) ∼ n 3 .
b ∗
Note: we write ESn EX [. . .] to emphasize that the expectation is over a new independent
point X and the random sample Sn ∼ P ⊗n ; it would be more appropriate to write the
inner expectation as a conditional expectation conditional to Sn , but because X, Sn are
independent, we use the above notation that can be understood as an iterated integral.
Furthermore, according to the generative model we can decompose the expectation over
Sn as an expectation over (Xi )1≤i≤n and (εi )1≤i≤n ; finally since all these variables are
independent we can perform expectations in the order we want by Fubini’s theorem.
Concentrating on one term, we have
aj − f ∗ (X))2 1{X ∈ Ij }
Aj = (b
2
X
1
(f ∗ (Xi ) + εi ) − f ∗ (X) + f ∗ (X) 1{X ∈ Ij }
=
|Nj | + 1 i∈N
j
1 X
∗ ∗ ∗
X 2
= (f (X i ) − f (X)) + f (X) + εi 1{X ∈ Ij }
(|Nj | + 1)2 i∈N i∈Nj
j
| {z } | {z }
=:Wj =:Zj
Taking expectations over (εi )1≤i≤n first, we notice E(εi ) [(Wj + Zj )2 ] = Wj2 + E Zj2 =
Wj2 +|Nj |σ 2 . For Xi , X ∈ Ij we have |f ∗ (Xj ) − f ∗ (X)| ≤ L/K by the Lipschitz assumption
on f ∗ , and we can also write |f ∗ (X)| ≤ M for some M , since a Lipschitz function must be
2
|N |
bounded on a compact. All in all we have thus Wj2 1{X ∈ Ij } ≤ L Kj + M 1{X ∈ Ij }.
We thus finally get, putting pj := P[X ∈ Ij ], and using (a + b)2 ≤ 2(a2 + b2 ):
" 2 ! #
1 |Nj |
E[Aj ] ≤ E L + M + |Nj |σ 2 1{X ∈ Ij }
(|Nj | + 1)2 K
2
2L 2 2 1
≤ pj + (σ + 2M )E .
K2 |Nj | + 1
Since |Nj | = ni=1 1{Xi ∈ Ij }, it has a Binom(n, pj ) distribution and therefore satisfies
P
E[(|Nj | + 1)−1 ] ≤ (pj (n+1))−1 (left as an exercise). Summing over j and using K
P
j=1 pj = 1,
21
we finally get
h i L2 (σ 2 + 2M 2 )
ESn E(fb) − E ∗ ≤ 2 2 + K . (1.7)
K n+1
The conclusion of the proposition follow from the above inequality, by plugging in K =
K(n) according to the assumptions of the proposition, letting n go to infinity and some
standard estimates.
A few remarks are in order:
• The strong assumption that was made is the Lipschitz character of the regression
function f ∗ .
• The estimator considered is very similar to (but not quite equal to) the ERM estima-
tor over the model GK of piecewise constant predictions on K equally sized intervals
(in fact, the conclusion of the proposition would be essentially the same for the ERM
estimator on GK ). Thus, this is an example where it is of advantage of selecting
a prediction model GK of limited “complexity” (even though we know the optimal
prediction does not lie in this model), and letting the complexity grow appropriately
with the sample size n.
• The two terms appearing in the bound (1.7) can be interpreted as an approximation
term (decreasing with K and independent of n; it comes from the fact that we can
better approximate a Lipschitz function by a piecewise constant function as K grows,
regardless of the sample) and an estimation term (growing with K but decreasing
with n; it comes from the fact that we can average out the noise variance σ 2 provided
we have enough points per interval, but the average number of points per interval
decreases with K). This “balancing” phenomenon is very common in statistical
learning theory.
Exercise 1.5. Prove the following implications between the different types of consistency:
b) If the estimators fb(n) take their values in the set of decision functions G, consistency
in expectation on G implies consistency in probability on G.
c) If the loss function is bounded, almost sure consistency implies consistency in expec-
tation.
(n)
Hint: consider the sequence of random variables Zn := E(fbSn ), and use fundamental
relations and implications of probabilistic convergence from a probability lecture. For (b),
establish and then use the fact that Zn − EG∗ ≥ 0. For (c), use Fatou’s lemma.
22
Exercise 1.6. How do the conclusions of Proposition 1.13 change if one assumes instead
that f ∗ is α-Hölder (α ∈ (0, 1])?
Exercise 1.7. Write explicitly an ERM estimator for the model GK defined above. How does
it differ from the estimator defined in (1.6)? (More technical:) how can Proposition 1.13
be adapted for the ERM estimator?
Exercise 1.8. Prove that fb := fβbOLS is almost surely consistent in the model G of linear
∗
functions. Hint: recall the formulae (1.3) for βbOLS . Compare to the formula for βOLS
(assume Σ invertible). Use the LLN.
** Consider the binary classification case, Y = {0, 1}. In this case, if we denote
η(x) := P[Y
= 1|X1 = x], the∗ (or more precisely “a”) Bayes classifier takes the form
∗
f (x) = 1 η(x) ≥ 2 , and E = E[min(η(X), 1 − η(X))].
For various reasons it can be more convenient to estimate the probability function η(x)
rather than the classifier f ∗ (x), for instance by using a regression method aiming at having
a small risk for the quadratic loss (see Exercise 1.2 below). Let ηb be such an estimate (we
assume is is given beforehand and do not specify how it was obtained). A natural way to
transform this estimate to an actual classifier is to “plug in” the estimate in place of η in
the formula for f ∗ , that is define
n 1o
f (x) = 1 ηb(x) ≥
b .
2
A natural question is whether fb is a good estimate of f ∗ whenever ηb is a good estimate of
η. Here, the “goodness” of an estimate will be measured via the excess risk to the optimal,
for the classification risk and the squared loss risk, respectively.
** Proposition 1.14 (Excess risk inequality for plug-in classification). Let ℓ denote the
misclassification loss and h the quadratic loss. For any function ηb estimating η, define
the plug-in classifier fb as above. Then
1 1
0 ≤ Eℓ (fb) − Eℓ∗ ≤ 2E[|b η ) − Eh∗ ) 2 .
η (X) − η(X))2 2 = 2(Eh (b
η (X) − η(X)|] ≤ 2E (b
23
Proof. Note that the first inequality of the claim is from definition of Eℓ∗ , and the third
is Jensen’s inequality. As in the proof of Proposition 1.6, it is possible to consider an
argument conditionally to X = x for any fixed x, only considering expectations with respect
to a (conditional) probability distribution over Y, then integrate the obtained pointwise
inequalities over X at the end. We therefore consider x as fixed, omit the dependence in
x of fb, f ∗ , ηb, η, and treat them as constants.
We first establish the (sometimes useful) equality that for any classifier f , one has
∗ ∗ 1
Eℓ (f ) − Eℓ = 2E 1{f (X) ̸= f (X)} η(X) − . (1.8)
2
Indeed, as suggested above, we condition with respect to X = x and consider x as fixed.
On the left-hand side, we have E[1{f ̸= Y } − 1{f ∗ ̸= Y }|X = x]. Assume η ≥ 21 (the
other case is of course similar); then f ∗ = 1. If f = f ∗ = 1, then obviously
1
E[1{f ̸= Y } − 1{f ∗ ̸= Y }] = 0 = 21{f ∗ ̸= f } η − .
2
If f = 0, then
1
E[1{f ̸= Y } − 1{f ∗ ̸= Y }] = η(1 − 0) + (1 − η)(0 − 1) = 2η − 1 = 21{f ∗ ̸= f } η − .
2
Thus (1.8) is established pointwise conditionally to X = x, thus also in expectation.
Now, consider the specific case fb = 1 ηb ≥ 21 . Still conditioning with respect to X = x,
we have pointwise
1
1{f ∗ ̸= f } η − ≤ |bη − η|. (1.9)
2
Indeed, consider again the cases f ∗ = f (trivial since the left-hand side is 0), and f ∗ ̸= f
(the inequality holds because f and f ∗ must be on opposite sides of 21 ).
Altogether (1.8) and (1.9) (integrated over X) give the second inequality of the claim.
Finally, for the last equality, notice again that, pointwise conditionally with respect to
X = x, we have since η(x) = E[Y |X = x]:
(where all expectations are to be understood w.r.t. PY |X (.|x)), giving the last equality
after integration w.r.t. X ∼ PX .
Proposition 1.14 shows in particular that if the squared loss excess risk of the estimate
ηb converges to zero, then so does the classification excess risk of the associated plug-in
classifier rule. So convergence of ηb to the Bayes optimal decision rule η for the quadratic risk
implies convergence of the plug-in estimate fb to the Bayes classifier, for the classification
risk.
Remark: This implies that universal consistency of ηb implies universal consistency of
the associated plug-in rule fb. However, for consistency over a specific model G, things are
24
more complicated. The result of Proposition 1.14 does not hold in general if we replace
the optimal Bayes risks Eh∗ , Eℓ∗ by the optimal risks EG,h
∗ ∗
, EG,ℓ
e over an arbitrary model G
Exercise 1.9. Develop the reflectionsin the previous remark into a clean argument. Let G
be a subset of F(X , [0, 1]) and Ge = f : x 7→ 1 g(x) ≥ 12 , g ∈ G
⊂ F(X , {0, 1}) the set
of plug-in classification rules induced by G. Justify that, if a sequence ηb(n) of estimators
are consistent (either in expectation, probability or almost surely) for the model G and the
set of distributions
PG := {P ∈ P(X × {0, 1}) : ηP∗ ∈ G},
(where ηP∗ is the function x 7→ P (Y = 1|X = x)), then the associated sequence of plug-in
rules fb(n) is consistent for the model Ge and the same set of distributions. (Use directly the
result of Proposition 1.14.)
On the other hand, find a counter-example as simple as possible showing that the result
of Proposition 1.14 does not hold if we replace unrestricted optimal risks Eh∗ , Eℓ∗ by the
∗ ∗
optimal risks EG,h e over models G, G, even if we assume η
, EG,ℓ e b ∈ G. Hint: it is sufficient to
∗
exhibit a class G and a distribution P for which the optimal decision ηP,G for the quadratic
risk is such that the associated plug-in rule is not the optimal classifier in G.
e For this it is
enough to consider a 2-point space X and a suitable class G with just two elements.
25
where PXY denotes the set of all probability distributions on X × Y.
26
discussed later. Since X is infinite, we can find a subset Xm = {t1 , . . . , tm } of size m in
X . For any r = (r1 , . . . , rm ) ∈ {0, 1}m , let Pr be the probability distribution on X × {0, 1}
such that:
• X is drawn uniformly at random in the set Xm ;
• Pr (Y = 1|X = ti ) = ri for any i = 1, . . . , m.
Note that E ∗ (Pr ) = 0 for any r.
The following step is a key idea for lower risk bounds. Rather than trying to find a
worst-case distribution for the estimator fb, we will consider its risk on average over the
family Pr when r is itself drawn at random. Clearly, the worst-case risk can only be higher
than this average:
h i h i
sup ESn ∼P ⊗n E(fbSn , P ) ≥ sup ESn ∼Pr⊗n E(fbSn , Pr )
P ∈PXY :E ∗ (P )=0 r∈{0,1}n
h i
≥ Er ESn ∼Pr⊗n E(fbSn , Pr ) .
where the outer expectation is over r drawn uniformly at random over {0, 1}n . We now
rewrite and bound the right-hand side:
h i h n oi h i
Er ESn ∼Pr⊗n E(fbSn , Pr ) = Er ESn ∼Pr⊗n E(X,Y )∼Pr 1 fbSn (X) ̸= Y = P fbSn (X) ̸= Y ,
where the probability P is over the draw of everything (r uniform in {0, 1}n , (Sn , (X, Y )) ∼
⊗(n+1)
Pr ). Next, by the law of total probability, we can write the latter probability as a
h i X h i
P fbSn (X) ̸= Y = P fbSn (X) ̸= Y X1 = x1 , . . . Xn = xn P[X1 = x1 , . . . , Xn = xn ].
n
(x1 ,...,xn )∈Xm
We now bound the conditional probability in each of the terms of the previous sum, to
simplify notation denote A := {Xi = xi , i = 1, . . . , n}. Note that conditional to A, con-
cerning the training sample Sn only the training labels are random and are given by the
draw of the elements of the vector r corresponding to the elements x1 , . . . , xn of Xm .
h i h i
P fSn (X) ̸= Y A ≥ P fSn (X) ̸= Y, X ̸∈ {x1 , . . . , xn } A
b b
h i h i
= P fbSn (X) ̸= Y X ̸∈ {x1 , . . . , xn }, A P X ̸∈ {x1 , . . . , xn } A .
It is intuitive and it can be checked (left as an exercise) that, conditional to the event
A and X ̸∈ {x1 , . . . , xn }, X is drawn uniformly at random in Xm \ {x1 , . . . , xn }, and its
corresponding true label Y is drawn at random with probability 1/2 and is independent
of Sn and X, and therefore independent of fbSn (X). Thus, the first conditional probabil-
ity i As for the second, it is larger than (1 − n/m). Backtracking, we get
h above is 1/2.
P fbSn (X) ̸= Y ≥ 12 (1 − m n
), so this is is also a lower bound on our initial supremum. Since
we can choose m arbitrary large, the result follows.
27
2 Linear Discrimination:
A brief overview of classical methods
In this chapter, we will exclusively focus on the classification problem (also called discrim-
ination, especially in older literature). The methods presented here are classical and not
particularly recent (the idea of Linear Discriminant dates back to R.A. Fisher in 1936,
Rosenblatt’s Perceptron is from 1956), but they still form the backbone of most machine
learning toolboxes nowadays and are to be known in order to understand more recent
developments. Furthermore, we will not study the question of convergence or statistical
consistency in this chapter, this will be relegated to later chapters.
We then consider Y = {0, 1, . . . , K − 1}, with K ≥ 2, each element of Y being called a
class. There are K classes, the case K = 2 is called binary classification.
We will also always assume that the input space is X ⊂ Rd .
** Definition 2.1. The family of affine score functions (sy (.))y∈Y based on vectors w =
(wy )y∈Y ∈ RdK and constants b = (by )y∈Y ∈ RK is given by
sy (x) = ⟨wy , x⟩ + by , y ∈ Y.
Note: Strictly speaking, in (2.1) we should break ties in some way in the case the Arg Max
contains more than one element, i.e. two or more scores are equal. In order to avoid
uninteresting complications, we will always assume that in such cases the ties are broken
in favor of the smallest class, i.e. the Arg Max is always replaced by its smallest element;
note that is always non-empty since the number of classes is finite.
If we denote xe := (x, 1) ∈ Rd+1 , it holds ⟨wy , x⟩ + by = ⟨wey , ye⟩ wherein w
ey := (wy , by ).
d+1
Thus, if we consider the “augmented” input space R and the score functions are linear
functions of xe. For this reason we will occasionally (depending on the context) drop the
constants b, and implicitly assume we have performed this augmentation operation. Also
for this reason, we will with some abuse of language talk about “linear” score functions
although they are strictly speaking affine.
In the binary classification case (K = 2), observe that we have
fw,b (x) = 1 ⟨w1 − w0 , x⟩ + (b1 − b0 ) ≥ 0 , (2.2)
| {z } | {z }
w b
28
(we assume ties are broken in favor of class 1 to simplify). Therefore, in the case of binary
classification, we will consider only one score s(x) = ⟨w, x⟩ + b to simplify. (A similar
reduction in parameters can be achieved for K > 1, by choosing the class 0 as reference
and defining modified scores s′y (x) = sy (x) − s0 (x), y ∈ Y. But this breaks the symmetry
somewhat, so that generally the full parametrization is used).
The goal of this chapter is to give an overview of classical methods to construct a linear
discrimination function from a training sample Sn = ((xi , yi ))1≤i≤n . The most natural
approach if the standard classification loss is the target is the associated ERM:
n
1X
b b) = Arg Min E(fw,b ) = Arg Min
(w, b b 1 Arg Max(⟨wy , xi ⟩ + by ) = yi .
(w,b)∈RK(d+1) (w,b)∈RK(d+1) n i=1 y∈Y
Unfortunately (even in the binary classification case), the above minimization problem
is considered at best cumbersome and at worst quite intractable, in particular in high
dimension d , with large trainin data size n, etc. This is due essentially to the fact that
the empirical risk is a noncontinuous, piecewise constant function of the parameters, so
that usual numerical optimization approaches such as (stochastic) gradient descent are not
applicable.
It is fair to say that ERM in the above form for classification is almost never used in
practice. Instead, several alternative approaches are used, falling roughly speaking in two
categories:
2. One considers directly the linear score functions (with output in Ye = RK+1 ), and
uses a different loss ℓ on Ye × Y which lends itself better to optimization (typically
because it is convex in the first variable). This approach is called the use of a “proxy
loss”.
29
Proposition 2.2. Naive Bayes, binary classification case Assume X = {0, 1}d , K = 2,
and assumption (NB) is satisfied. Then the optimal (Bayes) classifier function takes
the form ( d )
X
f ∗ (x) = 1 w(k) x(k) + b ≥ 0 ,
k=1
where, using the notation pk,j := P X (k) = 1|Y = j and πj := P[Y = j] (which are
assumed to belong to (0,1) ):
d
(k) pk,1 (1 − pk,0 ) X 1 − pk,1 π1
w := log ; b := log + log .
(1 − pk,1 )pk,0 k=1
1 − pk,0 π0
Estimation from data: in order to estimate a Naive Bayes classifier in practice, the
plug-in principle is used, that is, the theoretical parameters pk,i and πi are replaced in the
formula by their frequentist estimators from a sample Sn = ((xi , yi )1≤i≤n ) :
n o
(k)
|{j : yj = i}| j : (x j , yj ) = (1, i)
π
bi := ; pbk,i := ,
n |{j : yj = i}|
30
assuming the last denominator is nonzero (i.e. there exists at least one training example
in each class.)
Exercise 2.1. Generalize the above result of Proposition 2.2 to the case K > 2.
Assumption (GD):
We will assume furthermore that the Gaussian distributions involved are non-degenerate,
i.e. the covariance matrices Σi have full rank. In this case the distributions have respective
densities
1 1 T −1
fi (x) = p exp − (x − mi ) Σi (x − mi ) , i = 0, . . . , K − 1.
(2π)d ||Σi | 2
with respect to the Lebesgue measure on Rd and from Exercise 1.3 we know that f ∗ (x) =
Arg Maxi (πi fi (x)) is a Bayes classifier, where πi := P[Y = i]. Since the arg max is un-
changed by monotone increasing transformation of the function to minimize, we can take
the logarithm and obtain
Since the above score functions are quadratic polynomials in x, this is called Quadrac-
tic Discriminant Analysis (QDA). The main quadratic term (x − mi )T Σ−1 i (x − mi ) is
called (squared) Mahalanobis distance of x to the center mi of the Gaussian distribution
N (mi , Σi ).
31
Under (GDEC), we observe that the quadratic part of the above scores is identical for
all scores. We can therefore subtract it from all scores without changing the arg max as
above, and it comes
∗ 1
fLDA (x) = Arg Max si (x), where si (x) := x, Σ−1 mi + log πi − mti Σ−1 mi , (2.7)
i | {z } 2{z
wi | }
bi
Under (GDEC), in the binary classification case, the Bayes classifier is given by
n o
∗
fLDA = 1 x, Σ−1 (m1 − m0 ) + (b1 − b0 ) ≥ 0 ,
| {z }
wLDA
Estimation from data: similarly to the previous section, one uses a plug-in approach
wherein the unknown population parameters Σi , mi , πi are estimated by their frequen-
tist estimators counterparts from a sample Sn and just “plugged into” the formula (2.6)
resp. (2.7) (denoting ni := |{j : yj = i}|). Thus, for (GD):
|{j : yj = i}|
π
bi := ;
n
1 X
m
b i := xj ;
ni j s.t.y =i
j
1 X
Σ
b i := b j )T (xi − m
(xi − m b j ). (2.8)
ni − 1 j s.t.y =i
j
Note that the (ni −1) in the denominator of Σ b is to make the estimator unbiased, assuming
ni ≥ 2; see a classical statistics course. If ni is used instead, it will not change much.
On the other hand, for (GDEC), one uses the so-called pooled estimator for the common
covariance matrix:
N
b := 1
X
Σ (xi − myi )T (xi − myi ). (2.9)
n − K i=
32
Exercise 2.2. Prove that QDA and LDA using the above estimators from data are covariant
by (bijective) linear data transformation. More precisely, let x ei := Axi , where A is an
d
invertible linear operator of R . Denote fb the classification function constructed by LDA
or QDA from Sn := ((xi , yi )1≤i≤n ), and fe the one constructed from Sen := ((exi , yi )1≤i≤n ).
d
Then f (x) = f (Ax) for all x ∈ R .
b e
Practical tricks.
In practice, there are often some modifications to the above canvas for LDA and QDA:
1. While the ERM estimator for the 0-1 loss is infeasible as argued in the beginning
of this chapter, in the binary classification case, it is fairly easy to minimize, for a
fixed w, the empirical classification error as a function of the constant b, b 7→ E(f b w,b ).
Namely, for fixed b the problem is reduced to a 1-dimensional one and one only has
to try for b the intermediate values of the reordered set of {⟨xj , w⟩, j = 1, . . . , n} and
select the one minimizing the empirical classification error. This is often what is
done in practice, so that the formula for the linear projection wLDA = Σ−1 (m1 − m0 )
is otfen considered as the most important of (binary) LDA, while the exact formula
for the constant b is unimportant, because it is often replaced by the above ERM
minimizer.
2. The most problematic part of LDA (and a fortiori QDA) in practice is the inversion
of the estimated covariance matrix. If some of its estimated eigenvalues are close
to 0, taking the inverse can be a highly unstable operation and lead to significant
estimation errors and erratic behavior, especially if the dimension d is large. For
this reason instead of Σb as defined in (2.8),(2.9), it is often suggested to used a
“regularized” version of it, such as
e := (1 − λ)Σ
Σ σ 2 Id ,
b + λb
or
e := (1 − λ)Σ
Σ b + λD,
b
where D is the diagonal matrix formed with the diagonal P entries σ bi2 := Σ
b ii of Σ
b
2 −1 d 2
(estimators of the variances of each coordinate), and σ
b =d bi . Here λ ∈ [0, 1]
i=1 σ
is a so-called “shrinking parameter” that has to be tuned, for instance by cross-
validation (see next chapter).
33
(as explained earlier in this chapter, we disregard the constant parameters by implic-
itly “augmenting” the data x to (x, 1)). Furthermore we consider the quadratic loss
ℓ(fw (x), y) = ∥fw (x) − ey ∥2 , where ek denotes the k-th canonical basis vector in RK .
The ERM is then given by
n
2
X
ŵ = Arg Min fw (xj ) − eyj
w∈RdK j=1
K−1
XX n
= Arg Min (⟨wk , xj ⟩ − 1{yj = k})2 .
w∈RdK k=0 j=1
Since the above function to minimize separates into K sums each involving only wk , each
sum can be minimized independently, given rise to
n
X
w
bk = Arg Min (⟨xj , w⟩ − 1{yj = k})2 = (X T X)−1 XY (k) ,
w∈Rd j=1
using the notation introduced below (1.3), and with Y (k) := (1{y1 = k}, . . . , 1{yn = k}) ∈
Rn .
Thus in this approach we transform the initial multiclass classification problem into
K distinct “one-versus-all” binary classification problems (classify class k against all the
others), each using a different linear predictor (and we recall that the final class prediction
is the the class attaining the maximum of these K linear scores).
Discussion: The unrestricted optimal prediction function for the loss function ℓ(y ′ , y) =
∥y ′ − ey ∥2 is given (again by separation into independent sums) by
and it can seem a rather questionable idea to try to approximate a function from Rd to
[0, 1]K by a linear function, which is by definition unbounded. This is actually the main
reason why this quadratic loss approach is actually almost never used in practice with
linear prediction. Much more popular is the logistic regression.
Exercise 2.3. In this exercise we use explicitly the affine representation with parameters
(w, b) for affine scores. We focus on the binary classification case (K = 2).
Establish that the above approaches reduces to a single regression problem, since it
holds (w b0 , 1 − bb0 ).
b1 , bb1 ) = (−w
Prove that the direction of the estimated vector w b1 using the quadratic regression
approach coincides with the direction found by the (binary) LDA approach. (The constants
found by the two approaches b differ, and the above property does not hold any more for
K ≥ 3).
34
** 2.5 Linear logistic regression
As we have seen, usual quadratic regression models the class probability functions ηk (x) :=
P[Y = k|X = x] as linear functions, which is problematic. The idea of logistic regression
is to use a suitable transform, the “logit” transform, which is a log-ratio of probabilities.
Here we will use the class 0 as a reference and implicitly assume that ηk (x) ̸= 0 for all
k = 0, . . . , K − 1.
P[Y = k|X = x]
sk (x) := log , k = 0, . . . , K − 1. (2.10)
P[Y = 0|X = x]
and observe that f ∗ (x) := Arg Maxk=0,...,K−1 sk (x) is an optimal classifier. The score
functions sk can range anywhere in R.
Proposition 2.3. If the logistic scores defined in (2.10) satisfy (2.11), then it holds con-
versely
exp⟨wk , x⟩
ηk (x) = P[Y = k|X = x] = PK−1 . (2.12)
ℓ=0 exp⟨wℓ , x⟩
w
b = Arg Max L(w),
w
35
where L(·) is the log-conditional-likelihood based on the sample Sn :
Unlike quadratic regression, the expression (2.13) does not separate into independent sums
for each parameter: the optimization problem has to be solved for all parameters (vectors
wk ) jointly.
In the particular case of binary classification, this reduces to
n
X
L(w) = yj ⟨w, xj ⟩ − log(1 + exp⟨w, xj ⟩) . (2.14)
j=1
Under the above form (multi-class or binary class), one can notice that −L(w) can be
interpreted as an empirical risk with loss function (written in the binary case for simplicity)
which is then applied to linear prediction functions. In this sense, the logistic regression
can also be interpreted as a proxy loss method.
Practical implementation: The maximum of L(w), or equivalently the minimum of
−L(w), does not have a closed-form formula. Still, w 7→ −L(w) is a convex function, so
various methods of convex optimization can be used. If the dimension d is not too large,
a common approach is to use Newton-Raphson iterations
−1
d2 L
dL
wk+1 = wk − t
.
|dwdw
{z } |wk |{z} dw |wk
Hessian Gradient
After some calculations, these iterations can be put under a form where they are interpreted
as a repeated “weighted least squares”, where the weights are updated along the iterations.
Exercise 2.4. For binary classification, it is common to encode the two classes as Y =
{−1, 1} which is more symmetrical. With this convention, verify that the logit loss (2.15)
can be rewritten as
36
2.6 Hinge-loss based methods: Perceptron and Support Vector
Machine
We consider the binary classification setting with Y = {−1, 1}, Ye = R and the proxy loss
function called “hinge loss”
ℓε (y ′ , y) = (ε − yy ′ )+ ,
where (t)+ := max(0, t) is the positive part, and ε ≥ 0 a fixed parameter.
b )= 1
X
E(f (ε − f (xj )yj ),
n j s.t.
f (xj )yj ≤ε
The perceptron algorithm is an early use of stochastic gradient descent: to avoid recom-
puting the above gradient at each optimization step, it is proposed to select randomly one
of the terms in the above sum and to make a step in that direction. This gives rise to the
following very simple procedure:
37
** Perceptron algorithm
1. Initialize w = 0.
4. Update w ← w + yj xj .
5. Go to step 2.
Nowadays, the Perceptron algorithm is not so widely used, more modern approaches
such as the linear Support Vector Machine (see below) are used. One advantage remains
its simplicity.
An famous early result studying the convergence of the empirical risk of the above
algorithm in the simplest case ε = 0 and when the training data is linearly separable is the
following.
** Theorem 2.4 (Novikov, 1962). Let Sn = ((xi , yi )1≤i≤n ) be a fixed training sample for
binary classification. Assume that
Then: the Perceptron algorithm (run with ε = 0) finds a vector w separating perfectly
the data (i.e. such that ⟨w, xi ⟩yi > 0 for all i) after at most (R/γ)2 effective update
operations (i.e. passes through step 4).
* Support Vector Machine. The (linear) Support Vector Machine (Boser, Guyon, Vap-
nik 1992) uses the hinge loss with ε = 1. Several efficient algorithms have been developed
to optimize the resulting optimization problem, that are preferred to the perceptron, and
we won’t enter in detail here. It is nowadays a standard method of machine learning
toolboxes.
Exercise 2.5. Justify that in the binary classification case, the “classification as regression”
approach (Section 2.4) and the logistic regression approach (Section 2.5) both can be seen
38
as “large margin classification” methods, in the sense that they are based on minimizing
a certain margin-based loss (which can be written explicitly).
Exercise 2.6. Prove Thm. 2.4. Proceed as follows: let η > 0 be a fixed positive number,
and consider the quantity ∆2k := ∥wk − ηw∗ ∥2 , where wk denotes the vector w after the
kth effective update step. Compare ∆2k+1 to ∆2k , and choose an appropriate value of η
to establish that it must hold ∆2k ≤ ∆20 − kR2 , then conclude. Note: this proof is an
elementary example of the use of a well-chosen Lyapunov function decreasing along the
iterations (here ∥wk − ηw∗ ∥2 ). It is a powerful technique to study convergence of iterative
optimization schemes.
* 2.7 Regularization
In most linear methods, standard approaches tend to become unstable when the dimension
d is high. This has, roughly speaking, to do with the overfitting phenomenon: intuitively,
the number of free parameters is d and overfitting becomes more likely when there are
more parameters to fit relative to the amount of data available. A common approach to
“stabilize” ERM methods is to consider a regularized version
bλ ∈ Arg Min Eℓ (fw ) + λΩ(w) ,
w b
w∈Rd
and it can be checked that it takes the explicit form (compare to (1.3)):
−1 n n
b := 1 1X
X
bλ = Σ
w b + λId γ
b, where Σ xi xTi , and γ
b := xi yi ; (2.16)
n i=1 n i=1
39
The effect of regularization it therefore intuitively clear in the above formula: adding
λ > 0 on the diagonal of a matrix before inverting it should stabilize the inverse operation,
at the expense of adding some bias if λ is too large. A similar idea was used for regularized
LDA/QDA (Section 2.3).
In practice, the parameter λ > 0 has to be tuned in order to get a good compromize
between too much and too little regularization; we will see how in the coming chapter.
Exercise 2.7. Justify the formula (2.16) for regularized linear regression.
40
3 Introduction to statistical learning theory (part 2):
elementary bounds
3.1 Controlling the error of a single decision function and Hold-
Out principle
The main goal in this section is to get a theoretically justified control of the true risk
E(fb) of an estimator by using only observable quantities (i.e. that can be computed
from the available data). In statistical terms, this means we are looking for a confidence
(upper) bound on E(fb) (we are primarily interested in an upper bound, i.e. a guarantee
on the generalization error). A confidence bound is a quantity that can be computed from
the data, and is indeed larger that the (unknown) quantity of interest with a prescribed
probability (called coverage probability).
As we have discussed in Section 1.3, the empirical risk E(
b fb) is not (in general) a reliable
approximation of E(fb) because of the overfitting phenomenon: the same data is used to
learn fb and to evaluate the empirical risk, resulting in a bias (that can possibly be very
large). h i
On the other hand, we have argued that for a fixed decision function, it holds E E(fb ) =
E(f ), moreover we have by the law of large numbers E(f b ) = 1 Pn Zi −→ E(f ) in
n i=1
probability (or a.s.) as n −→ ∞, where Zi := ℓ(f (Xi ), Yi ) are i.i.d.
This idea underlies the co-called “Hold-out” principle: learn fb and evaluate the empir-
ical error on different samples. Thus, is Sn and Tm = (Xi′ , Yi′ )1≤i≤m are independent i.i.d.
samples, we consider the quantity
m
1 X b
bHO b
E (f ) = E(fSn , Tm ) =
b b ℓ(fSn (Xi′ ), Yi′ ).
m i=1
In practice, Sn and Tm can be obtained from an arbitrary separation in two parts of a single
sample. Sn is called learning sample and Tm validation sample (or “hold-out” sample, since
it has been held out from the learningh phase). Observe i that conditionally to Sn , we can
consider fbSn as a fixed function, thus E E( b fbSn , Tm )|Sn = E(fbSn ), furthermore we can apply
the law of large numbers (LLN) conditionally to Sn .
This principle justifies that it is already of interest to get a mathematical control of the
error of a single fixed decision function f . For this, we need more elaborate tools than the
LLN, which gives an limiting value but not quantification of the speed of convergence. A
traditional (asymptotic) tool for this is the Central Limit Theorem (CLT), which we recall
here: if W1 , . . . , Wn are i.i.d. real random variables such that σ 2 = Var[W1 ] exists, then it
holds
√ n1 ni=1 (Wi − E[W1 ])
P
n −→ N (0, 1), in distribution, as n −→ ∞. (3.1)
σ
This can be used for deriving asymptotic confidence intervals (in the statistical learning
context, remember the quantity of interest is typically E(f ))). An asymptotic confidence
41
interval or bound has the property that the coverage inequality only converges asymptoti-
cally to a prescribed value, say 1 − α.
However, learning theory generally focuses on obtaining nonasymptotic bounds, that is,
whose validity (coverage probability) is ensured for any n. There are several motivations
for this, but we will in particular stress that the CTL (3.1) is not valid in a uniform
sense in general. Consider a situation where Xi (p) are i.i.d. Bernoulli with parameter p
(for instance, Wi (p) = 1{f (Xi ) ̸= Yi } for a classification problem and a given classifier
function f with generalization error p). If the limit in (3.1) was uniform with respect to
the parameter p, we would choose an arbitrary sequence pn depending on n and still have
the convergence (3.1). Yet, it is known that for pn = c/n, we have the Poisson limit
n
X c
Wi −→ Poisson(c), in distribution, as n −→ ∞. (3.2)
i=1
n
Therefore, we deduce (by usual continuity arguments for the convergence in distribution,
and recalling the variance of Bernoulli variable with parameter p is p(1 − p)) that
√ n1 ni=1 (Wi nc − E W1 nc )
P
Poisson(c) − c
n −→ √ , in distribution, as n −→ ∞,
σ c
which is obviously different from the standard CTL Gaussian limit. This situation could
in principle happen if we are given a sequence of classifiers fn whose classification error
converges to zero fast enough as n → ∞.
Exercise 3.1. Prove the convergence (3.2) by considering the Laplace transform FS (λ) :=
E[exp(λS)] on both sides. It is known that the pointwise convergence of the Laplace
transform implies the convergence in distribution.
42
To prove this, we’ll use the following classical result:
** Lemma 3.2. Let X be a real-valued variable, and FX (t) := P[X ≤ t] its cdf. Then
∀α ∈ [0, 1] : P[FX (X) ≤ α] ≤ α,
with equality if FX is a continuous function. We say that FX (X) is stochastically lower
bounded by a Unif([0, 1]) variable (and is exactly distributed as uniform in [0, 1] if FX is
continuous).
(Proof of the Lemma). Let s := sup{x : FX (X) ≤ α}. We consider two cases:
(1) s is a maximum: then FX (s) = α (since FX is right-continuous), and FX (x) ≤ α ⇔
x ≤ s by monotonicity of FX . Then
P[FX (X) ≤ α] = P[X ≤ s] = FX (s) = α.
(2) s is not a maximum: then FX (x) ≤ α ⇔ x < s, and
P[FX (X) ≤ α] = P[X < s] = lim FX (x) ≤ α.
x↗s
Note that case (2) can only happen if FX is not (left)-continuous at point s, hence if FX
is continuous we are always in case (1).
(Proof of the proposition). Observe that p > B(u, n, α) ⇒ F (p, n, u) < α by definition of
B(u, n, α). Hence
P[p > B(Bn,p , n, α)] ≤ P[F (p, n, Bn,p ) < α] ≤ α,
where the last inequality is from the Lemma.
The above confidence bound, called Clopper-Pearson bound, is sharp because it is based
on the exact inversion of the cdf. However, it is not easy to qualitatively understand nor
to manipulate. We will now consider a method to construct more explicit bounds.
*** Proposition 3.3. Let X be a real-valued random variable, and define successively for
λ ∈ R:
43
Then for any t ≥ E[X]:
P[X ≥ t] ≤ exp −Ψ∗X (t) . (3.3)
Furthermore,
Pn if X1 , . . . , Xn are i.i.d. with the same distribution as X and Sn :=
i=1 X i , then for any u ≥ E[X]:
1
P Sn ≥ u ≤ exp −nΨ∗X (u) . (3.4)
n
Note: the principle of the proof is so simple and useful that it is as important as
the theorem itself.
Now we justify why we can replace the above supλ≥0 by a supremum over all λ ∈ R,
provided t ≥ E[X]. Observe that in this case, for λ < 0 and by Jensen’s inequality, we
have
44
1.25
1.00 3
0.75
D(t, p)
D(t, p)
2
D(t,p) D(t,p)
0.50 2(p − t) 2
2(p − t)2
1
0.25
0.00 0
0.00 0.25 0.50 0.75 1.00 0.00 0.25 0.50 0.75 1.00
t t
Figure 1: The function t 7→ D(t, p) and its lower bound t 7→ 2(p − t)2 (left for p = 0.3,
right for p = 0.03); see Exercise 3.2.
(1−t)
* Proposition 3.4. Let D(t, p) := t log( pt ) + (1 − t) log (1−p) , defined for (t, p) ∈ [0, 1] ×
(0, 1) (with the convention 0 log 0 = 0). Let Bp denote a Binom(n, p) random variable.
Then
1
t ≥ p ⇒ P Bp ≥ t ≤ exp(−nD(t, p)); (3.5)
n
1
t ≤ p ⇒ P Bp ≤ t ≤ exp(−nD(t, p)). (3.6)
n
The function q 7→ D(q, p) is convex with a minimum of 0 in q = p (see Figure 1), hence
is decreasing on q ∈ [0, p] and increasing on q ∈ [p, 1].
Exercise 3.2. Prove the quadratic lower bound D(q, p) ≥ 2(p − q)2 (see Fig. 1).
45
Proof of (3.5)-(3.6). We apply (3.4) and thus only need to compute Ψ∗X (t) where X is a
Bernoulli variable of parameter p. In this case
We deduce that Ψ′X (λ) = p exp(λ)/((1 − p) + p exp(λ)), finding the solution of Ψ′X (λ) =
t ∈ (0, 1) is thus λ = log (1−p)t
p(1−t)
, so that Ψ∗X (t) = D(t, p) for t ∈ (0, 1) (and Ψ∗X (t) = ∞
otherwise), giving the conclusion.
Let us now prove the implication from (3.5) to (3.7). Recall that the function t 7→
D(t, p) is nonnegative and (strictly) increasing from [p, 1] onto [0, log(1/p)]. Hence we have
the following equality of events, for any t ≥ p:
{b
p ≥ t} = {D(b
p, p) ≥ D(t, p) and pb ≥ p}.
For any u ∈ [0, log(1/p)], if we choose t as the unique solution in the interval [p, 1] of
D(t, p) = u, rewriting (3.5) in the light of the above event equality yields
p ≥ p and D(b
P[b p, p) ≥ u] ≤ exp(−nu).
The latter inequality still holds trivially true if u > log(1/p), since in this case the
considered event has probability zero (since D(t, p) ≤ log(1/p) for all t ≥ p). Taking
u = − log(α)/n yields (3.7).
Similarly (3.6) implies (3.8), and (3.7)-(3.8) imply (3.9) by a union bound.
46
Proposition 3.6. Let X be a sub-Gaussian variable with parameter σ. Then for any
α ∈ (0, 1]:
h p i
−1
P X ≥ E[X] + σ 2 log(α ) ≤ α;
h p i
P X ≤ E[X] − σ 2 log(α−1 ) ≤ α;
h p i
−1
P |X − E[X]| ≥ σ 2 log(α ) ≤ 2α.
Proof. Without loss of generality we may assume E[X] = 0. We apply Prop. 3.3. We
have by assumption FX (λ) ≤ exp(λ2 σ 2 /2), hence it holds Ψ∗X (t) ≥ t2 /(2σ 2 ). Solving
exp −Ψ∗ (t) = α for t, we obtain the first inequality. The second inequality is obtained by
applying the same argument to −X, and the final one is a union bound applied with the
events appearing in the first two inequalities.
It is straightforward to notice that sum of independent sub-Gaussian
Pvariables Xi with
2 2
parameters σi is itself sub-Gaussian with parameter σ such that σ = i σi . It follows:
" r # " r #
2 log α−1 2 log α−1
Sn Sn Sn Sn
P ≥E +σ ≤ α; P ≤E −σ ≤ α,
n n n n n n
Pn (3.10)
2 1 2
where σ := n i=1 σi .
The following proposition links boundedness to the sub-Gaussian property and is the
foundation of Hoeffding’s inequality coming next:
*** Proposition 3.8. A random variable with values in [a, b] is sub-Gaussian with pa-
rameter (b − a)2 /4. In particular, a random variable X such that |X| ≤ B a.s. is
sub-Gaussian with parameter σ 2 = B 2 .
47
Corollary 3.9 (Hoeffding’s inequality). Let X1 , . . . , Xn be independent random vari-
ables taking values in the interval [a, b] with |b − a| ≤ B. Then for any t ≥ 0:
" n #
2nt2
1X
P (Xi − E[Xi ]) ≥ t ≤ exp − 2 ;
n i=1 B
Proof of Proposition 3.8. Put B := |b − a|. Without loss of generality (i.e. possibly re-
placing X by X ′ = X − a+b
2
), we can assume X is taking values in the interval [−B/2, B/2].
We start with upper bounding ΨX (λ). We notice that
E[X exp(λX)]
Ψ′X (λ) = ,
E[exp(λX)]
2
E[X 2 exp(λX)]
E[X exp(λX)]
Ψ′′X (λ) = − ,
E[exp(λX)] E[exp(λX)]
where the justification of the exchange of expectation (over X) and derivation (over λ) is
left as an exercise. We deduce that ΨX (0) = 0, Ψ′X (0) = E[X], Ψ′′X (λ) ≤ B 2 /4, and by
Taylor’s formula with exact rest:
λ2 ′′ λ2 B 2
ΨX (λ) = ΨX (0) + λΨ′ (0) + Ψ (c) ≤ λE[X] + .
2 8
2 2
Finally observe that Ψ(X−E[X]) (λ) = ΨX (λ) − λE[X] ≤ λ2 B2 .
Notation: To lighten notation in the sequel, we will use the following shortcut
notation: r
− log(δ)
ε(δ, n) := . (3.11)
2n
Hoeffding’s inequality allows us to bound with high probability the risk of a single
prediction function from its empirical risk, provided the loss function is bounded:
48
*** Corollary 3.10. Consider a prediction setting where the loss function ℓ : Ye × Y →
[0, B] is bounded by B > 0. Let f be a fixed prediction function, E(f ) its risk and E(f
b )
its empirical risk with respect to an i.i.d. sample Sn of size n. Then for any δ ∈ (0, 1),
it holds with probability at least 1 − δ with respect to the draw of the sample Sn :
E(f ) ≤ E(f
b ) + Bε(δ, n). (3.12)
E(f
b ) ≤ E(f ) + Bε(δ, n). (3.13)
E(f
b ) − E(f ) ≤ Bε(δ/2, n). (3.14)
for some appropriate bound function B. Notice the difference between (3.15) and (3.16):
the second form will guarantees that all bounds, simultaneously, are valid with probability
at least 1 − δ, while there is no such uniformity in the first statement.
We will start with a finite class F.
49
*** Proposition 3.11. Assume a prediction problem with bounded loss function taking
values in [0, B]. Let F be a finite set of prediction functions of cardinality K. Assume
empirical risks are computed on an i.i.d. sample Sn of size n. Then it holds with
probability at least 1 − δ with respect to the draw of the sample Sn :
r
∀f ∈ F : E(f ) ≤ E(fb ) + B log K + Bε(δ, n). (3.17)
2n
Similarly, it holds with probability at least 1 − δ:
r
∀f ∈ F : E(f ) ≤ E(f b ) + B log K + Bε(δ, n). (3.18)
2n
and it holds with probability at least 1 − δ:
r
log 2K
∀f ∈ F : E(f ) − E(f
b ) ≤B + Bε(δ, n). (3.19)
2n
Proof. This is a direct consequence of the union bound. Denote A(f, δ) the event where
the bound E(f ) ≤ E(f
b ) + Bε(δ, n) is satisfied. From Corollary (3.10), and (3.15) it holds
c
that ∀f ∈ F P[A (f, δ)] ≤ δ for any δ ∈ (0, 1). Therefore
" # " #
\ [ X
Acf,δ ≥ 1 − P Acf,δ ≥ 1 − Kδ.
P Af,δ = 1 − P
f ∈F f ∈F f ∈F
The second and third inequalities are obtained similarly from the corresponding single
function inequalities of Corollary (3.10) and a union bound.
A first consequence of Proposition (3.11) is that we can derive a bound on the risk of
an estimator fb taking values in a finite class F.
50
*** Corollary 3.12. Consider the same assumptions as in Proposition 3.11. Let fb be an
estimator taking its values in the finite class F of cardinality K. Then with probability
at least (1 − δ) of the draw of the sample Sn , it holds
r
b fbSn , Sn ) + B log K + Bε(δ, n).
E(fbSn ) ≤ E( (3.20)
2n
Proof. We consider event (3.17), whose probability is at least 1−δ. If this event is satisfied,
we can in particular specialize the inequality which holds for all f ∈ F for the particular
choice fb ∈ F (note that it does not matter that fb depends on the data now). This yields
the conclusion.
We now consider more specifically the ERM over the finite class F, and get the following
result as a further consequence of Proposition 3.11.
*** Proposition 3.13. Assume a prediction problem with bounded loss function taking
values in [0, B]. Let F be a finite set of prediction functions of cardinality K. Assume
empirical risks are computed on an i.i.d. sample Sn of size n. Consider the ERM
estimator fb = fbk , where
k ∈ Arg Min E(f
b b k ).
1≤k≤K
Observe that inequality (3.21) relates the risk of the ERM to the best risk in the class
EF∗ = min1≤k≤K E(fk ), hence it is an excess risk bound with respect to the class F. It is
trivial but useful to rewrite it in the following way to bound the excess risk with respect
to all possible decision functions:
r
log K
E(fbk ) − E ∗ ≤ (EF∗ − E ∗ ) + 2B + 2Bε(δ/2, n). (3.22)
2n
Observe in particular the role of n (the sample size) and of K (the size of class F). As
expected, larger “complexity” of the class (in this simple scenario the complexity is simply
the cardinality) results in a worse bound for the excess risk in (3.21), however we can
expect that a more complex (i.e. larger) class also has in advantage in that it will have a
smaller value for (EF∗ − E ∗ ), in the bound (3.22).
51
The good news is that the cardinality K of the class enter only logarithmically in the
bound. For instance taking K = np (for any fixed p p possibly much larger than 1) still
results in a bound on the excess risk behaving in O( log(n)/n).
Proof. We apply (the last inequality of) Proposition 3.11 so that the following event has
probability at least 1 − δ:
r
∀f ∈ F : b ) ≤ B log K + Bε(δ/2, n).
E(f ) − E(f (3.23)
2n
In the remainder of the proof, we assume that the latter event holds, and do not repeat
everytime “with probability at least 1 − δ”. We can in particular apply the bound (3.23)
to any specific f ∈ F, even data-dependent, as is fbk ; this is precisely the purpose of having
a uniform bound. Thus, we deduce that
r
b b ) + B log K + Bε(δ/2, , n).
E(fbk ) ≤ E(f (3.24)
k
2n
Let k ∗ ∈ Arg Min1≤k≤K E(fk ) be fixed, and apply also (3.23) to fk∗ :
r
−1
b k∗ ) ≤ E(fk∗ ) + B log(2δ ) .
E(f (3.25)
2n
Now using (3.24), (3.25), and the definition of fbk (minimizing the empirical risk) and fk∗
(minimizing the risk), respectively, we obtain
r
b b ) + B log K + Bε(δ/2, n)
E(fbk ) ≤ E(f k
r 2n
b k∗ ) + B log K + Bε(δ/2, n)
≤ E(f
r 2n
log K
≤ E(fk∗ ) + 2B + 2Bε(δ/2, n)
2n
r
log K
= min E(fk ) + 2B + 2Bε(δ/2, n).
1≤k≤K 2n
** Hold-out estimator selection. The situation where we consider a finite set of candi-
date prediction functions is found commonly when we have several estimators fb1 , . . . , fbK ,
amongst which we want to select one. We assume here that these estimators are “black
boxes”, i.e. they may be overfitting the data they are trained on. To select one of these
we again use the idea of Hold-Out presented at the beginning of the chapter:
1. Obtain two independent samples Sn , Tm of i.i.d. data. (Possibly split an existing
sample into two separate sub-samples to do so).
52
2. Train each of the estimators fb1 , . . . , fbK using the sample Sn , resulting in the decision
functions fb1,Sn , . . . , fbK,Sn .
3. Pick an estimator
k ∈ Arg Min E(
b b fbk,Sn , Tm ). (3.26)
1≤k≤K
Observe that, conditionally to Sn (i.e. considering the sample Sn “fixed”), the decision
functions fb1,Sn , . . . , fbK,Sn can be considered as fixed too (since they depend only on Sn ,
and not on Tn ). Therefore,
n (3.26) can be seen
o as an ERM method conditional to Sn , over
the class F(Sn ) := fk = fbk,Sn , k = 1, . . . , K .
This Hold-Out Selection method is used commonly in the case where we have an esti-
mator method fbλ depending on a “tuning parameter” λ (see for instance the regularization
methods introduced in Section 2.7) that we want to choose in order to minimize the risk.
In this case we can restrict the values of λ to discretized set {λ1 , . . . , λK } and we use the
previous strategy with fbk = fbλk .
• Fix an integer V ≥ 2.
• Define
(−j)
fbk = fbk,S (−j)
the k-th estimator trained on all the samples except those of S (j) .
• Let
V
X
k ∈ Arg Min
b b fb(−j) , S (j) ).
E( (3.27)
k
1≤k≤K j=1
Note that (3.27) can be seen as a “multiple” hold-out where the disjoint samples S (j) , S (−j)
take the role of training and validation samples in the standard Hold-out, for all j =
1, . . . , V in turn.
53
Each term in the sum (3.27) has expectation E(fbk,n′ ), where fbk,n′ denotes estimator fbk
trained using a sample of size n (V V−1) , and it is possible to apply the previous arguments
based on Hoeffding’s inequality to bound it. However, the hope is that the “aggregation”
of hold-out errors in (3.27) results in a yet sharper estimate.
A precise theoretical analysis of cross-validation, and why it may outperform Hold-out
in practice, is a delicate subject that we won’t touch here. (Notice in particular that we
compare the empirical risks of estimators fbk trained on samples of size n(V − 1)/V , while
the final estimator is trained on the total sample of size n. Therefore, a minima one must
make some kind of assumptions on the fact that the estimators trained on the full sample
and on a sub-sample are “close” in some way).
Still, cross-validation is the “default” method in practice to pick the tuning parameters
of a learning method. The particular case V = n is called “leave-one-out” (one removes,
in turn, a single data point of the sample, trains each estimator on the remaining (n − 1)
data, and monitors its error on the point that has been left out).
** Proposition 3.14. Assume a prediction problem with bounded loss function taking
values in [0, B]. Let F be a finite or countably infinite set of prediction functions.
Assume that π is a set of real weights over F, such that π(f ) ∈ [0, 1] and:
X
π(f ) ≤ 1. (3.28)
f ∈F
Assume empirical risks are computed on an i.i.d. sample Sn of size n. Then for any
δ ∈ (0, 1], it holds with probability at least 1 − δ with respect to the draw of the sample
Sn : r
−1
∀f ∈ F : E(f ) ≤ E(f b ) + B log(π(f ) ) + Bε(δ, n). (3.29)
2n
Proof.
p The proof is a minor variation on the proof of Prop. 3.11. Denote t(δ, n) =
B log(δ −1 )/(2n), and A(f, δ) the event where the bound E(f ) ≤ E(fb ) + t(π(f )δ, n) is
satisfied. From Corollary (3.10), and (3.15) it holds that ∀f ∈ F P[Ac (f, δ)] ≤ π(f )δ for
any δ ∈ (0, 1). Therefore
" # " #
\ [ X X
Acf,δ ≥ 1 − P Acf,δ ≥ 1 − δ
P Af,δ = 1 − P π(f ) ≥ 1 − δ.
f ∈F f ∈F f ∈F f ∈F
54
Elementary bounding for t(π(f )δ, n) then yields the claim.
Remarks.
• It is an extension of Proposition 3.11, since the latter can be obtained via the uniform
weight choice π(f ) = 1/K for all f ∈ F (finite class).
• It is always better to have the P constraint (3.28) satisfied with equality (if not, re-
′
place π(f ) by π (f ) = π(f )/( g∈F π(g)), which improves (3.17). With the equality
constraint, it is possible to interpret π as a discrete probability distribution over F.
However since there is no probabilistic argument used, we prefer the term “weights”.
• The choice of π is arbitrary, but must be fixed in advance (i.e. it cannot depend
on the data). One can interpret log(π(f )−1 ) as a “complexity” of function f but it
is not an intrinsic notion since we can decide freely π. Rather, when we pick π we
have to decide which functions are considered “less complex” (=are given a higher
weight), and we are limited by the sum 1 constraint: we can consider many functions
as “complex” but not too many as “simple”.
Regularized ERM. Assume that we work under the same assumptions as Proposi-
tion 3.14. Bound (3.17) cannot be used to analyze the ERM on the (countably infinite)
class F in a meaningful way, because, if we try to follow the argument of the proof of
Prop. 3.13 (we recommend
q it as an exercise to see where it fails) we will have a remaining
term of the form − log π(fbERM )/2n in the bound; and we have no means to control it,
it can be arbitrary large.
Instead, we can use bound (3.17) to design an estimator which will consist in minimizing
that bound (for a fixed choice of π). Let us therefore define the regularized ERM (based
on the regularization induced by π):
r !
− log(π(f ))
fbπ ∈ Arg Min E(fb )+B . (3.30)
f ∈F 2n
Lemma 3.15. If the loss function ℓ(·, . . . ) takes its values in [0, B], the estimator fbπ
always exists, i.e. the Arg Min in (3.30) is never empty.
55
Proof. The property is obvious if F is finite; so here we assume that F is countably
infinite. Since the weights π(f ), f ∈ F are positive and summable, they can be ranked by
nonincreasing order, i.e. there exists an indexation F = {fi , i ∈ N} such that π(fi ) forms
a nonincreasing sequence. Since ℓ(·, ·) ∈ [0, B], it holds E(f
b ) ∈ [0, B] for all f ∈ F.
It follows that
r r !
− log(π(f 0 )) − log(π(f 0 ))
E(f
b 0) + B ≤B 1+ ,
2n 2n
Proposition 3.16. Consider the same setting as in Proposition 3.14 and let fbπ be the
estimator defined by (3.30). Then for any δ ∈ (0, 1], with probability at least 1 − δ it holds
r !
− log(π(f ))
E(fbπ ) ≤ min E(f ) + 2B + 2Bε(δ/2, n). (3.31)
f ∈F 2n
Proof. The proof is very similar to that of Prop. 3.13. We apply Prop. 3.14 and obtain
that with probability at least 1 − δ/2, we have
r
−1
∀f ∈ F : E(f ) ≤ E(f b ) + B log(π(f ) ) + Bε(δ/2, n), (3.32)
2n
Let f∗π achieving the minimum on the right-hand side of (3.31). Note that the minimum
exists by using the same type of argument as in Lemma 3.15 (replacing the role of the
empirical risk by the true risk). The decision function f∗π is non-random and we can apply
the simple Hoeffding’s inequality to it, so that with probability at least 1 − δ/2 it holds
Now using (3.32), (3.33) (holding simultaneously with probability at least 1 − δ), and the
56
definition of fbπ , fπ∗ , respectively, we obtain
s
bπ
b fbπ ) + B − log π(f ) + Bε(δ/2, n)
E(fbπ ) ≤ E(
2n
r
π
b π ) + B − log π(f∗ ) + Bε(δ/2, n)
≤ E(f ∗
r 2n
− log π(f∗π )
≤ E(f∗π ) + B + 2Bε(δ/2, n)
2n !
r
− log π(f )
= min E(f ) + B + 2Bε(δ/2, n).
f ∈F 2n
Corollary 3.17. Consider the same settings as in Propositions 3.14 and 3.16, let π be set
of strictly positive real weights on the countable family F satisfying (3.28). Let fbnπ be the
estimator defined by (3.30) from an i.i.d. sample Sn of size n. Then the sequence fbnπ is
consistent in probability on F, that is, E(fbnπ ) converges to EF∗ in probability, as n → ∞.
Proof. Let t > 0 be fixed. Let ft∗ ∈ F be such that E(ft∗ ) ≤ inf f ∈F + 2t . Let us now fix any
δ ∈ (0, 1). Then event (3.31) (holding with probability at least 1 − δ) implies that
r !
∗
∗ π ∗ − log(π(f t ))
EF ≤ E(fbn ) ≤ E(ft ) + 2B + 2Bε(δ/2, n)
2n
r
∗ t − log(π(ft∗ ))
≤ EF + + 2B + 2Bε(δ/2, n).
2 2n
The two last terms in the above bound converge to zero as n → ∞, so for any δ, for any
n large enough, it holds that
h i
P E(fbnπ ) − EF∗ > t ≤ δ,
h i
in other words P E(fbnπ ) − EF∗ > t → 0 as n → ∞; this is true for any t > 0, hence the
conclusion.
57
K
j ∈ JKK. Since classification functions can only take
S two values, we have |FK | = 2 .
Furthermore, we define a family of weights on F := K≥1 FK via
1 −K
π(f ) = 2 , if f ∈ FK ,
cK 2
for c = π 2 /6. (It is possible that the same decision function belongs to FK for several Ks,
in which case we just take the smallest K having this property). It is easy to check that
these weights satisfy (3.28). Applying Proposition 3.16, the regularized ERM estimator fbπ
using these weights satisfies with probability at least 1 − δ:
r r !
π K log 2 log c + 2 log K
E(fb ) ≤ min min E(f ) + 2B + 2B + 2Bε(δ/2, n).
K≥1 f ∈FK 2n 2n
Observe that if we choose a fixed K in advance, and consider the ERM on FK , applying
Prop. 3.13 yields a bound similar to the above (without the third term), albeit for this
fixed K only. As we have seen in Proposition 1.13 for piecewise-constant functions in a
slightly different context, choosing K of the right order is important in order to obtain fast
convergence rates. By contrast, above inequality for the regularized ERM tells us that we
get a bound as good as what we would obtain for the “best” choice K for the ERM on FK
– the price to pay for this adaptivity is the additional third term in the bound, which is
modest since it is negligible with respect to the second term for large K. The above type
of bound is sometimes called an oracle inequality: the risk of the regularized estimator is
(almost) as good as if an “oracle” had told us in advance which FK to choose to minimize
the corresponding ERM risk.
• A complete binary tree structure T ; complete means that each node either has 2
daughter-nodes (it is then called an interior node) or has no descendents (it is then
called a leaf). Let T̊ denote the set of interior nodes, and ∂T the set of leaves.
• For each interior node s ∈ T̊ , a question qs which is a function X → {0, 1}. We will
assume that qs ∈ Q, where Q is a finite library of questions.
58
1. Let s=the root node of the tree T .
5. Return to step 2.
There are several classical methods available to build a decision tree function fb = fTb ,
corresponding to a certain triplet Tb ∈ T (Q), from a data sample Sn . We will not present
them in detail here. If we consider the output of such a method fbSn , we would like to be able
to have a confidence bound on its risk without knowing the internal details of the method.
Here we will not use a hold-out sample approach but rather use the same sample Sn , this
is why we resort to finding confidence bounds that are valid for all functions of T with
high probability (following the same approach as in Proposition 3.16 and Corollary 3.12
for finite classes).
Proposition 3.18. Assume a prediction setting with a loss function bounded by B. Let
Q be a finite set of questions, and assume Ye is a finite set. Let Sn be an i.i.d. training
sample of size n which will be used to compute all empirical risks. Then for any δ ∈ (0, 1),
with probability as least 1 − δ, it holds:
s r
C|T | log δ −1
∀T ∈ T : E(fT ) ≤ E(f
b )+B
T + B , (3.34)
2n 2n
Proof. We want to apply Prop. (3.14), and for this need to choose a weight function π
over T (we consider
directly a weight function over T , which induces of course a weight
function on fT , T ∈ T ). Although we insisted that there is no probabilistic argument
linked to π, it is easier to describe π as a probability distribution. We construct π as a
probability distribution in the following way:
59
• The marginal of the binary tree structure T is taken to be a Galton-Watson process
where each node had a probability ρ to have 0 descendents, and (1 − ρ) to have 2
daughters (with ρ ∈ (0, 12 ]). Thus
60
4 The Nearest Neighbors method
In this section we introduce and propose an elementary analysis of nearest-neighbors (NN)
methods, which are part of the classical toolbox for prediction methods. In a nutshell, the
output of the NN method from a training sample Sn is a decision function f such as the
prediction f (x) at a point x is obtained by looking up the neighbors of x in the training set,
and taking a decision based on a local average or majority vote among the labels associated
to these neighbors.
i.e. fbk−NN predicts the majority class among the neighbors of x in the sample (in case of
ties, once again one can decide to predict among the tied classes the one with the smallest
index).
The definition of the k-NN decision function is simple and natural, and there exists a
vast literature on the subject going back to the 1960s. In this chapter, we will concentrate
on the following questions concerning the asymptotic behavior of the k-NN prediction
(more precisely in the case of binary classification):
61
• what is the behavior of the risk E(fbk−NN ), as the sample size n grows but k is fixed?
• what is the behavior of the risk E(fbk−NN ), as the sample size n grows and k(n) is
allowed to grow with n?
Note that it is natural to let k(n) grow with n, since intuitively local averages are more
accurate if we use more data (but not too many since we want to remain in a neighborhood
of the prediction point).
** Theorem 4.1. In the binary classification setting, assuming (Cont) holds, we have:
h i
(n)
1. ∀x ∈ Supp(PX ), ESn E ℓ(fbk−NN (x), Y )|X = x → αk (η(x)), as n → ∞.
h i
(n) ∗
2. ESn E(fbk−NN ) → Ek−NN := E[αk (η(X))], as n → ∞,
(n)
where the superscript (n) is a reminder for the sample size used to construct fbk−NN ,
and
αk (η) := ηQk (η) + (1 − η)(1 − Qk (η)). (4.1)
Before proving this theorem, we give a general intuition of why it holds (the proof will
consist in making this intuition mathematically rigorous):
• As n → ∞, the number of sample points in a given neighborhood of x will grow
to infinity (provided that x is in the support of PX ). Therefore, the distance of the
kth-NN of x to x should tend to zero.
• Conditional to the sample points (Xi )1≤i≤n , the labels Y (1) (x), . . . , Y (k) (x) of the k
neighbors of x will be distributed as Bernoulli variables with respective parameters
η(X (1) (x)), . . . , η(X (k) (x)). However, since the neighbors of x tend to x itself and η
is continuous, these labels will essentially behave as i.i.d. Ber(η(x)) variables.
62
• The decision function fbk−NN (x) will predict the majority class among the k neighbors.
By the previous point, we expect the number of neighbors of class 1 to behave as
Binom(k, η(x)) variable, so f will behave as a randomized decision predicting 0 with
probability Qk (x) and 1 with probability 1 − Qk (x), giving rise to the average error
αk (x) at point x.
We now proceed to the proof. The following lemmas will roughly speaking correspond to
the successive points in the above intuition.
Proof. Assume x ∈ Supp(PX ), by definition this means that for any ε > 0, p(x, ε) :=
P[X ∈ B(x, ε)] > 0, where B(x, ε) is the open ball of center x and radius ε.
Let us fix ε > 0. Denote Nε,n (x) := #{i ∈ {1, . . . , n} : Xi ∈ B(x, ε)}; by the law of
large numbers, it holds
n
Nε,n (x) 1X
= 1{Xi ∈ B(x, ε)} → p(x, ε) > 0,
n n i=1
hence
(k)
Nε,n k
P d(Xn (x), x) ≥ ε ≤ P < .
n n
Nε,n k k
Since n
→ p(x, ε) > 0 hin probability,
i h n → 0, in particular ifor n big enough n ≤
but
p(x, ε)/2 < p(x, ε), and P Nnε,n < nk ≤ P Nnε,n < p(x, ε) − p(x, ε)/2 → 0 by definition of
(k)
convergence in probability. We have proved that d(Xn (x), x) → 0 in probability.
However, this implies also convergence a.s. by a monotonicity argument: observe that
(k)
(for any fixed ω ∈ Ω determining the sequence (Xi (ω))i≥1 ), d(Xn (x), x)(ω) is a decreasing
sequence in n (adding more points to the set {X1 , . . . , Xn } can only make the k-th neighbor
of x possibly closer to x). It has therefore has a (nonnegative) limit L(ω) which can only be
63
n o
(k)
0 (a.s.); indeed for any t > 0, by right-continuity limn→∞ 1 d(Xn (x), x) ≥ t = 1{L ≥ t},
hence by dominated convergence
so L = 0 a.s.
For the point 2, observe that for any ε > 0
h h ii
PX,(Xi )1≤i≤n d(Xn(k) (X), X) > ε = EX P d(Xn(k) (X), X) > ε X .
h h i i
(n) (k) (k)
In the point 1. we have proved that Fε (x)
:= P d(Xn (X), X)
> ε|X = x = P d(Xn (x), x) > ε
converges to 0 for PX -almost all x, since in a Polish space it holds P[X ∈ Supp(PX )] = 1.
Therefore, since it is bounded by 1, its integral (the above expression) converges to 0 by
(k)
dominated convergence. This means that d(Xn (X), X) → 0 as n → ∞, in probability,
and implies convergence a.s. by the same monotonicity argument as in point 1.
The following lemma based on a coupling argument will allow to formalize that the
labels of the close neighbors of a point x behave almost as if they were drawn with the
Bernoulli parameter η(x).
* Lemma 4.3. Let k be an integer and Ψ an integrable function {0, 1}k → [0, 1]. As-
sume Y1 , . . . , Yk are independent Bernoulli random variables with parameters η1 , . . . , ηk ;
and Y1′ , . . . , Yk′ are i.i.d. Bernoulli random variables with parameters η.
Then it holds
K
X
|E[Ψ(Y1 , . . . , Yk )] − E[Ψ(Y1′ , . . . , Yk′ )]| ≤ |ηi − η|.
i=1
(Note: equivalently the above is a bound on the total variation between the distribution of
(Yi )1≤i≤n and that of (Yi′ )1≤i≤n ).
Proof. Since the two expectations of Ψ(Y1 , . . . , Yk ) and Ψ(Y1′ , . . . , Yk′ ) only depend on the
distributions the one and the other k-uple of random variables, it is possible to construct
a coupling between these two k-uples. More precisely, we construct two k-uples of random
variables (Yei )1≤i≤k and (Yei )1≤i≤k so that (Yei )1≤i≤k has the same distribution as (Yi )1≤i≤k ,
Nk e′
i.e. i=1 Ber(ηi ), and similarly for (Yi )1≤i≤k , but so that Yi = Yi as “often” as possible.
e e
The construction is as follows: let U1 , . . . , Uk be i.i.d. Unif[0, 1], and define Yei =
1{Ui ≤ ηi }; Yei′ = 1{Ui ≤ η}. Then it can be checked easily that:
64
Therefore, since ψ takes values in [0, 1]:
h i h i
|E[Ψ(Y1 , . . . , Yk )] − E[Ψ(Y1′ , . . . , Yk′ )]| = E Ψ(Ye1 , . . . , Yek ) − E Ψ(Ye1′ , . . . , Yek′ )
h i
′ ′
≤ E Ψ(Y1 , . . . , Yk ) − Ψ(Y1 , . . . , Yk )
e e e e
h i
≤ P ∃i ∈ {1, . . . , k} : Yei ̸= Yei′
k
X h i
≤ P Yei ̸= Yei′
i=1
Xk
≤ |η − ηi |.
i=1
k
X
≤ η(x(i) (x)) − η(x) .
i=1
Now, assuming x ∈ Supp(PX ), by Lemma 4.2, and continuity of η, since d(X (i) (x), x) → 0
in probability for i = 1, . . . , k as n → ∞, and since η is bounded by 1, the expectation
of the above right-hand side with respect to X1 , . . . , Xn converges to 0, which in view
of (4.2) establishes the first part of the theorem. The second part follows immediately by
integration over X ∼ PX and dominated convergence.
65
We now examine closer the function αk from (4.1) which determines the asymptotic
error, for k fixed, of the k-NN method. We look at the cases k = 1, 3, 5 (note that it is
reasonable to choose k odd to avoid ties). In each case, we examine the behavior of α(η)
as η is close to 0, and compare it to the “local” optimal risk, which is min(η, 1 − η) = η.
By symmetry (since αk (η) = αk (1 − η) by definition, for k odd), the same behavior holds
as a function of (1 − η) if η is close to 1.
This behavior is relevant in a situation where the classification problem is well-separable,
i.e. η(x) is always close to 0 or 1.
• k = 1: then α1 (η) = 2η(1 − η). Remember that the Bayes optimal error in classifi-
cation is given by E ∗ = E[min(η(X), 1 − η(X))]. If we denote a(x) := min(η(x), 1 −
η(x)), it therefore holds
∗
E1−NN = E[2a(X)(1 − a(X))] ≤ 2E[a(X)]E[1 − a(X)] = 2E ∗ (1 − E ∗ ),
where the above inequality is Jensen’s, since x 7→ x(1 − x) is concave on [0, 1]. We
see that when E ∗ is close to 0, the risk of 1-NN is bounded by twice the Bayes risk.
This factor 2 is unavoidable, since in a well-separable situation, if η is close to 0 we
have
α1 (η) ∼ 2η,
which is twice the (local) optimal risk.
α5 (η) − η ∼ 10η 3 , as η → 0.
We see that in well-separable situations, when for all x, η(x) ∈ [0, p] ∪ [1 − p, 1] with p
∗
close to 0, we have E1−NN ≈ 2E ∗ while E3−NN
∗
≈ E ∗ . We see clearly the advantage of taking
more neighbors in improving the asymptotic error. The 5-NN method asymptotically gives
an even better approximation of the optimal Bayes error, still the 3-NN method may be
sufficient in such a well-separable situation.
66
** Theorem 4.4. We consider a binary classification problem (with the usual 0-1 loss)
under either of the following assumptions:
(A) X is a Polish space, and the function η(x) = P[Y = 1|X = x] is continuous.
(B) X = Rd for some finite dimension d, endowed with the Euclidean distance, and
there is no assumption on η.
Then, if k(n) is such that k(n) → ∞ and k(n)/n → 0 and n → ∞, it holds either
under setting (A) or (B) that
(n)
E(fbk(n)−NN ) → E ∗ , in probability as n → ∞.
Notice in particular the remarkable fact that in setting (B) (Rd endowed with the Eu-
clidean distance), the k-NN method with suitably increasing k(n) is universally consis-
tent, i.e. its risk converges asymptotically towards the Bayes risk without any assumption
on the generating distribution PXY .
We start by revisiting Lemma 4.2.
Proof. The proof is actually the same as for Lemma 4.2. To wit, for any ε > 0 and fixed
k and n, we have seen that
(k)
Nε,n k
P d(Xn (x), x) ≥ ε ≤ P < ,
n n
67
Observe that fbk−NN can be seen as a plug-in classifier fbk−NN := 1 ηb(x) > 12 . From the
Now define
k(n)
1 X
ηe(x) := η(X (i) (x)),
k(n) i=1
and bound
Furthermore, by integration
1 1 n→∞
η (x) − ηb(x))2 2 ≤ p
η (x) − ηb(x)|] ≤ EX,Sn (e
ESn [(II)] = EX,Sn [|e → 0.
2 k(n)
Hence term II tends to 0 in expectation over Sn and therefore also in probability, since it
is nonnegative.
Estimating Term (I), setting (A). We have for any x ∈ X :
k(n)
1 X
η(x) − ηe(x) = (η(x) − η(X (i) (x))).
k(n) i=1
Since we assumed under setting (A) that η is continuous, for any fixed ε > 0 there exists
δ > 0 such that for any x′ , d(x, x′ ) < δ implies |η(x) − η(x′ )| ≤ ε. Therefore, since η and ηe
are bounded by 1:
|η(x) − ηe(x)| ≤ ε + 1 d(X (k(n)) , x) > δ ;
hence
ESn [(I)] = ESn ,X [|η(X) − ηe(X)|] ≤ ε + P d(X (k(n)) , X) > δ ,
68
and from Lemma 4.5 we know that the second term tends to 0. Since this holds for any
ε > 0, the term I converges to 0 in expectation over Sn , and therefore also in probability,
since it is nonnegative. This concludes the proof for the setting (A).
Estimating Term (I), setting (B). In this setting X = Rd , but η is not necessarily
continuous. However, we know that the set C(Rd ) of continuous functions on Rd is dense in
L1 (Rd , PX ). Let ε > 0 be fixed, and pick then ηε continuous such that E[|η(X) − ηε (X)|] ≤
ε. We define
k(n)
1 X
ηeε (x) := ηε (X (i) (x)),
k(n) i=1
and write
(I) = EX [|η(X) − ηe(X)|] ≤ EX [|η(X) − ηε (X)|] + EX [|ηε (X) − ηeε (X)|] + EX [|e
ηε (X) − ηe(X)|] .
| {z } | {z } | {z }
≤ε (Ia) (Ib)
The term (Ia) can be shown to converge in probability to zero, with exactly the same
argument as used in the setting (A), since ηε is continuous.
Finally, in order to estimate term (Ib), we will use a clever geometrical lemma which
will stated precisely below, which will prove the following:
ηε (X) − ηe(X)|]
ESn [(Ib)] = ESn ,X [|e
k(n)
1 X
≤ ESn ,X ηeε (X (i) (X)) − ηe(X (i) (X))
k(n) i=1
where γd is a factor that only depends on d. Overall, since (Ia) converges to 0 in expectation
over Sn , it also converges in probability, and the proof is done.
The following lemma is used to prove inequality (4.3):
* Lemma 4.6 (Stone’s lemma). Let X = Rd endowed with the Euclidean distance;
X, X1 , . . . , Xn be i.i.d. variables of distribution P, on X , and f ∈ L1+ (Rd , P) a nonneg-
ative integrable function. Then there exists a factor γd , only depending on d, such that
for any integer k > 0:
" k #
X
E f (X (i) (X)) ≤ kγd E[f (X)]. (4.4)
i=1
69
Proof. Assume k > 0 is a fixed integer. Let us denote X0 = X, and for i = 0, . . . , n,
introduce the notation
Then we have
" k
# " n
#
X X
E f (X (i) (X0 )) = E f (Xi )1{i ∈ NNk (X0 )}
i=1 i=1
n
X
= E[f (Xi )1{i ∈ NNk (X0 )}]
i=1
Xn
= E[f (X0 )1{0 ∈ NNk (Xi )}] (*)
i=1
" n
#
X
= E f (X0 ) 1{0 ∈ NNk (Xi )}
i=1
= E[f (X0 )#{i : 0 ∈ NNk (Xi )}]. (**)
Observe that the clever step (*) is obtained by symmetry: in each separate expectation we
can exchange the role of X0 and Xi , while the distribution of the (n+1)-tuple (X0 , . . . , Xn )
remains unchanged. Finally, the next lemma will establish that
for a factor γd only depending on d; this will conclude the proof, since we can plug this
upper bound into (**) (observe that it is at that point that we must make use of the fact
that f is nonnegative).
Lemma 4.7. Let (x0 , . . . , xn ) be (n + 1) points in Rd and
∥y∥2 ∥y∥
2 2 2 2
∥y − z∥ < ∥y∥ + ∥z∥ − 2∥y∥∥z∥ cos(2θ) ≤ ∥z∥ 1 + 2 − ≤ ∥z∥2 . (4.5)
| {z } ∥z∥ ∥z∥
≥1/2 | {z }
≤1
70
Let xi1 , . . . , xik be the elements of {x1 , . . . , xn } belonging to C0 and closest to the origin x0
(if there are only k ′ < k such elements, take only those; if there are more because of ties,
take the ones with the k smallest indices). Now, notice that for any other xj ∈ C0 with
j ̸∈ {i1 , . . . , ik }, we have ∥xj ∥ ≥ ∥xiℓ ∥ for all ℓ = 1, . . . , k, thus by (4.5) we have
Therefore, for any such xj , the point x0 is not among the k nearest neighbors of xj , and
we must have 0 ̸∈ N Nk (xj ).
To summarize: in any such open cone C0 , there are at most k indices iℓ such that
0 ∈ N Nk (xiℓ ). Now the space Rd can be covered by a finite number γd of such open cones
(note that it is enough to cover the unit ball by homogeneity, and then used compactness).
Overall there are at most kγd indices iℓ such that 0 ∈ N Nk (xiℓ ). This implies the conclusion.
d
1
Exercise 4.1. Prove the bound γd ≤ 1 + sin(π/12) ≤ 5d . For this, assume that the unit
ball is covered by open cones of angle π/3, with principal axes given by the direction of
vectors x1 , . . . , xM , with ∥xi ∥ = 1. Furthermore, assume that this covering is of minimal
π
cardinality. Prove then that it must hold ∥xi − xj ∥ ≥ 2 sin 12 =: r for i ̸= j. Conclude by a
volume argument: the balls B(xi , 2 ) must be disjoint and are all contained in B(0, 1 + 2r ),
r
entailing that the sum of their volumes is less than the volume of this containing ball.
Conclude.
Exercise 4.2. It is possible to also prove a.s. convergence in Lemma 4.5 as in Lemma 4.2, but
(k(n))
the argument of Lemma 4.2 has to be modified since a.s. monotonicity of d(Xn (x), x)
does not necessary hold when k(n) depends on n.
Establish the a.s. convergence property in Lemma 4.5 by considering
(k(m))
Un := supm≥n d(Xm (x), x) and establishing that Un → 0 in probability. Then use the
monotonicity argument since Un is now a.s. decreasing in n.
71
5 Reproducing kernel methods
5.1 Motivation
Linear methods after a feature mapping. In the chapter on linear classification
methods, a central role was played by linear (or affine) score functions which were linear
forms fw (x) = ⟨x, w⟩; such linear forms are also the class of predictors considered for linear
regression. Now imagine we would like to consider as prediction (or score) functions the
class of functions of the form
( M
)
X
F := fα (x) := αi fi (x), α = (α1 , . . . , αM ) ∈ RM ,
i=1
Therefore, we can in principle apply any linear learning method (regression, or one of the
linear classification methods seen in Section 2) to the modified input data xe in order to
learn a prediction function in the class F.
An important point to notice right away is that the feature mapping Φ can often be
high-dimensional (as exemplified by polynomial regression: if X = Rd , the vector space of
polynomial functions of degree up to m has dimension (m + 1)d ). In fact it can commonly
be the case that we would like to consider a “feature space” (the image space of Φ) of
dimensionality M larger than the data sample size n. This has two important consequences:
1. It is essential to consider regularized methods (see in particular Section 2.7), otherwise
we are certain to run into overfitting.
2. It can computationally inconvenient to store the data as its explicit feature mapping
(e
x1 , . . . , x
en ) = (ϕ(x1 ), . . . , ϕ(xn )), both in terms of computation time of this mapping,
and of memory size.
72
Scalar products are sufficient. Concerning the second point above, an important
point to remark is that all linear methods (possibly in regularized form) we have seen have
the following property:
1. The learnt linear function (or functions, in the case of multi-class classification)
fwb has a parameter which can be written (regardless of the dimension) as a linear
combination of the input data:
n
X
w
b= βi Xi ; (5.2)
i=1
3. In order to compute fwb (x) for a new test point x, given the coefficients (β1 , . . . , βn )
of the representation (5.2), it is sufficient to know the scalar products ⟨Xi , x⟩, i =
1, . . . , n.
For the two first points, we take first the example of the perceptron. Remember that for
the perceptron algorithm (see Section 2.6), the main iteration is w bk = w bk−1 + Xik Yik . By
recursion, assume w bk−1 satisfies points 1-2 above, that is to say, is of the form (5.2), with
(k)
coefficients βi which can be computed only given the scalar products ⟨Xi , Xj ⟩. Observe
that the determination of the index ik depends on finding which training examples are
correctly classified or not by the classifier sign(fwbk−1 ), for which we only need to know the
scalar products ⟨Xi , Xj ⟩, by (5.3). As a consequence, w bk also satisfies points 1-2 above.
Let us take ridge regression as the next example. Remember from (2.16) and below
that this method outputs (for a fixed regularization parameter λ > 0) the linear predictor
with parameter vector
bλ = (X t X + λId )−1 X t Y ,
w (5.4)
where X is the (n, d) matrix whose rows are xt1 , . . . , xtn , and Y = (y1 , . . . , yn )t . We have
the following lemma:
73
observe that (XX t )ij = ⟨Xi , Xj ⟩, so finally points 1-2 hold in this setting too.
As apparent above, an object of central importance is XX t , the Gram matrix associated
with points (X1 , . . . , Xn ). Note that it is more economical to use the (n, n) Gram matrix
than the (d, d) matrix X t X if d < n.
The conclusion of these observations is that it is enough to know how to compute scalar
products ⟨x, x′ ⟩ in order to learn and apply prediction functions for classical linear method.
If we combine this observation with the idea of a feature mapping exposed earlier, we see
that we do not need to know the explicit feature mapping Φ: we only need to be able to
compute ⟨Φ(x), Φ(x′ )⟩ for arbitrary x, x′ ∈ X .
This is in essence the principles underlying the construction of kernel methods, to
summarize:
1. We can greatly extend the flexibility of linear methods by applying them after a
feature mapping Φ in a possibly high-dimensional Euclidean vector space.
2. For standard methods, we don’t need to know the feature mapping Φ explicitly, but
only need to be able to compute scalar products ⟨Φ(x), Φ(x′ )⟩ for any x, x′ ∈′ X .
3. It is important to always consider regularized versions of linear methods in this con-
text, since the output space of the feature mapping Φ is generally high-dimensional.
74
Observe that Φ maps to all monomials of degree up to 2, so the associated function space
defined via 5.1 is the vector space of polynomials of degree up to 2 in the coordinates of
x. For this reason k is called polynomial kernel of order 2.
The following fundamental theorem gives a set of necessary and sufficient conditions
on k in order for (5.6) to hold.
2. k has positive type, that is, for any integer n > 0, and any n-uples (x1 , . . . , xn ) ∈
X n , and (α1 , . . . , αn ) ∈ Rn , it holds
n
X
αi αj k(xi , xj ) ≥ 0.
i,j=1
Note: the above properties can be more compactly expressed equivalently as: for any
integer n > 0, and any n-uple (x1 , . . . , xn ) ∈ X n , the matrix K given by Kij = k(xi , xj )
is symmetric positive semi-definite. For this reason, we will call a kernel satisfying the
above conditions a symmetric positive semi-definite (spsd) kernel.
Proof. “Only if” direction: assume (5.7) holds. Then obviously k is symmetric, and
n n n 2
X X X
αi αj k(xi , xj ) = αi αj ⟨Φ(xi ), Φ(xj )⟩H = αi Φ(xi ) ≥ 0.
i,j=1 i,j=1 i=1
“If” direction: assume k is a spsd kernel. We need to construct H and Φ. For any
x ∈ X , denote the real-valued function kx : X → R, y 7→ k(x, y) (we will alternatively use
the notation kx := k(x, ·)). Now define
we stress that the above set is made of finite linear combinations of functions of the
form k(xi , ·).
75
We define a bilinear form [·, ·] on Hpre :
X X X
for f = λi kxi ; g = µj kxj define [f, g] := λi µj k(xi , xj ). (5.9)
i∈I1 j∈I2 i∈I1
j∈I2
We need to stress that this is a well-formed definition: indeed, it may be possible that the
′ ′
P
same function has another representation as a linear expansion, say f = i∈I ′ λi kxi . But
P P 1
it holds that i∈I1 λi µj k(xi , xj ) = j∈I2 µj f (xj ) by definition, so the definition of [·, ·]
j∈I2
does not depend on the particular representation of f . The same argument applies to g.
Now, it is easy to check that the property 1. (k symmetric) implies that [·, ·] is sym-
metric, and that property 2. (k has positive P type) implies that for f ∈ Hpre having the
representation as in (5.9), it holds [f, f ] = i,j∈I1 λi λj k(xi , xj ) ≥ 0. Hence [·, ·] is a
symmetric positive semidefinite form on the vector space Hpre .
We finally check that it is definite. A symmetric positive semidefinite form satisfies the
Cauchy-Schwarz inequality, so it holds (assuming again f ∈ Hpre with representation as
in (5.9))
1 1
X
f (x) = λi k(xi , x) = [f, kx ] ≤ [f, f ] 2 [kx , kx ] 2 , (5.10)
i∈I1
hence [f, f ] = 0 implies that f (x) = 0 for all x, i.e. f = 0 as a function. Hence [·, ·] is a
symmetric definite positive bilinear form on Hpre .
Finally, define Φpre (x) : X → Hpre , x 7→ kx . Then it holds
We have just constructed a pre-Hilbert space Hpre and a mapping Φpre such that (5.7)
holds.
What is missing for a proper Hilbert space is completeness. But this space can be
i
completed: there exists a complete Hilbert space H◦ , and an isometry i: (Hpre , [·, ·]) →
(H◦ , ⟨·, ·⟩H ) such that i(Hpre ) is dense in H◦ . (The completion operation is obtained by
considering equivalence classes of Cauchy sequences in Hpre ; this is a standard construction
that we don’t detail here.)
Correspondingly we can define Φ◦ (x) := i ◦ Φpre (x), which satisfies (5.7) because
of (5.11), since i is an isometry. This concludes the proof.
In the previous proof, Hpre was specifically constructed as a pre-Hilbert space of real-
valued functions on X with Φpre (x) = kx . It is an important point that this property in
fact carries over to its completion, and we highlight this in the next result.
** Theorem 5.3 (and definition). If k is a spsd kernel on the set X (as in Theorem 5.2),
the Hilbert space H and the mapping Φ satisfying (5.7) can be constructed so that:
76
1. H is a vector space of real-valued functions X ;
The space H satisfying the above properties is unique and is called reproducing kernel
Hilbert space on X with kernel k.
Furthermore, Hpre given by (5.8) is dense in H.
Proof. We have constructed in the proof of Theorem 5.2 a pre-Hilbert space Hpre satisfying
the announced properties (observe that the reproducing property (5.12) holds in Hpre by
construction/definition of the form [·, ·]). What about its completion H◦ ? We recall that
there exists an isometry i : Hpre → H◦ with i(Hpre ) dense in H◦ . We now construct the
following mapping
Let us prove that ξ is injective. Assume ξ(h) = 0, which is to say, for all x ∈ X it holds
⟨i(kx ), h⟩H◦ = 0. This implies by linearity that for any f ∈ Hpre , ⟨i(f ), h⟩H◦ = 0. But since
i(Hpre ) is dense in H◦ , it implies that for any h′ ∈ H◦ , ⟨h′ , h⟩H◦ = 0, hence h = 0.
Since ξ is linear, it defines a bijection between H◦ and ξ(H◦ ) ⊂ F(X , R). We can
therefore endow H = ξ(H◦ ) with the scalar product ⟨f, f ′ ⟩ := ⟨ξ −1 (f ), ξ −1 (f ′ )⟩, so that H
is a Hilbert space of functions X → R which is isometric to H◦ .
Additionally, we observe that Hpre ⊆ H since ξ ◦ i coincides with the identity: for any
x ∈ X , it holds
ξ◦i
hence by linearity of ξ ◦ i, we have Hpre ,−→ H, which is an inclusion of Hilbert spaces since
ξ ◦ i is an isometry by composition of isometries.
We can therefore define the feature mapping Φ(x) = kx , which satisfies
f (x) = ⟨h, i(kx )⟩H◦ while ⟨f, kx ⟩H = ⟨ξ(h), ξ(i(kx ))⟩H = ⟨h, i(kx )⟩H◦ ,
77
hence (5.12) is satisfied.
We turn to unicity. Let H′ be another Hilbert space of real functions on X satisfying
the announced properties. By property 2. it holds that kx ∈ H′ for all x ∈ X , and by
consequence Hpre ⊆ H′ . Furthermore property 3. implies that ⟨kx , kx′ ⟩H = k(x, x′ ) =
[kx , kx′ ], where [·, ·] is the bilinear form constructed on Hpre in the proof of Theorem 5.2.
By linearity ⟨·, ·⟩H coincides with [·, ·] on Hpre . Hence the identity mapping Hpre ,→ H′ is
an isometry.
On the other hand we have established that Hpre ⊆ H via the isometric inclusion
ξ ◦ i. Let Hpre be the closure of Hpre in H. It can be checked that Hpre = H, where H
was constructed above. Indeed, we know that i(Hpre ) is dense in H◦ , hence by isometry
ξ ◦ i(Hpre ) = Hpre is dense in ξ(H◦ ) = H.
Finally, observe that the closure of Hpre in H coincides with the closure of Hpre in H′ .
Indeed, any Cauchy sequence of functions (fn )n≥1 in Hpre converges both in H and H′
by completeness of both these spaces, and the limit point f is uniquely determined as a
function, since for any x ∈ X , f (x) = limn→∞ [kx , fn ], the right-hand side of the latter
equality being a real-valued Cauchy sequence (by continuity) hence having a unique limit.
So any such limit function f belongs to both H and H′ , and since H = Hpre , we have
H ⊆ H′ .
⊥
Since H is closed in H′ we can write H′ = H ⊕ H1 ; but for any f1 ∈ H1 we have since
H1 ⊥ H that for any h ∈ H, ⟨f1 , h⟩H′ = 0. In particular, for any x ∈ X , kx ∈ H and
⟨f1 , kx ⟩H′ = f1 (x) = 0 (by the assumed reproducing property on H′ ), so f1 = 0. Finally
H′ = H, proving unicity.
For a complete overview we also mention the following characterization of reproducing
Hilbert kernel spaces.
** Theorem 5.4. Let H be a Hilbert space of real-valued functions over a set X . Then
the following properties are equivalent:
δx : H → R; f 7→ f (x) (5.14)
is continuous.
Furthermore, the function k in the last point is a spsd kernel, so that H is the repro-
ducing kernel Hilbert space with kernel k.
78
Proof. (1) ⇒ (2): since δx is continuous, by Riesz’ theorem there exists a unique element
ζx ∈ H such that δx (f ) = ⟨ζx , f ⟩ for all f ∈ H. Since H is a space of real-valued functions,
we define k(x, y) = ζx (y) for all x, y. Then k(x, ·) = ζx , so that the announced properties
(a) and (b) are satisfied.
(2) ⇒ (1): we have for any f ∈ H and x ∈ X , that δx (f ) = f (x) = ⟨f, kx ⟩ which is
continuous by continuity of the scalar product (which can be seen as a consequence of the
Cauchy-Schwarz inequality).
The kernel k appearing in (2) is spsd: it holds k(x, y) = kx (y) = ⟨kx , ky ⟩ = ky (x) =
k(y, x), so k is symmetric. Furthermore
X X X 2
αi αj k(xi , xj ) = αi αj kxi , kxj = αi kxi ≥ 0.
i,j i,j i
(ii) If X is a Euclidean or Hilbert space with inner product ⟨·, ·⟩, then k(x, y) = ⟨x, y⟩
is a spsd kernel. (“Linear kernel”)
(vii) If (ki )i≥1 is a sequence of kernels on X which converge pointwise (for any x, y ∈
X ) then the limiting function is a spsd kernel.
The proof for all points of the above theorem are left as an exercise, with the exception
of point (v), for which we provide the following lemma:
79
Lemma 5.6. Let M, N be two (n, n) spsd real matrices. Them the (n, n) matrix A
defined by Aij := Mij Nij is spsd.
n n n
! n !
X X X (i) (j)
X (i) (j)
αi αj Aij = αi αj λk ek ek µℓ f ℓ f ℓ
i,j=1 i,j=1 k=1 ℓ=1
X n
(i) (j) (i) (j)
= αi αj λk µℓ ek ek fℓ fℓ
i,j,k,l=1
X n n
X (i) (i) (j) (j)
= λk µℓ (αi ek fℓ )(αj ek fℓ )
k,l=1 i,j=1
n n
!2
X X (i) (i)
= λk µℓ αi ek fℓ ≥ 0.
k,l=1 i=1
Therefore A is spsd.
We deduce the following corollaries from Theorem 5.5:
Corollary 5.7. Let X be a Euclidean or Hilbert space, and f be a real polynomial with
nonnegative coefficients. Then k(x, y) := f (⟨x, y⟩) is a spsd kernel on X .
Corollary 5.8. Let X be a Euclidean or Hilbert space, and F (t) = i≥0 ai ti be an ana-
P
***
lytical function with real, nonnegative coefficients ai ≥ 0 andnconvergence radius R > 0.
√ √ o
Then k(x, y) := F (⟨x, y⟩) is a spsd kernel on BX (0, R) = x ∈ X : ∥x∥ < R .
80
• k(x, y) = (⟨x, y⟩ + c)m for m ∈ N∗ , c > 0: polynomial kernel of order m.
• k(x, y) = (1 − ⟨x, y⟩)−α for α > 0, on X = BRd (0, 1): negative binomial kernel.
• using a spsd kernel k instead of the regular scalar product can be seen as (implicitly)
mapping the x-data to a Hilbert space H via a feature mapping Φ, and applying the
linear method to the transformed data x e.
• for each algorithm we want to find a suitable representation of the learnt function as
an expansion of the form (5.2) of the transformed data, i.e.:
n
X
w
b= βi Φ(Xi ), (5.15)
i=1
thus we only need to store the n-vectors of the coefficients (βi )1≤i≤n and to de-
termine how to compute them from the only information of the scalar products
(⟨Xi , Xj ⟩)1≤i,j≤n and the labels (Yi )1≤i≤n .
In this section, we will assume that k is a given spsd on X , H the associated RKHS, and
denote K the kernel Gram matrix of the x-data, i.e. Kij := ⟨Xi , Xj ⟩, 1 ≤ i, j ≤ n.
If we are using the RKHS H associated to k, due to the reproducing property, if w ∈ H
we have
fw (x) := ⟨w, Φ(x)⟩ = ⟨w, kx ⟩ = w(x); hence fw = w,
in other words the function fw associated to w is w itself, also the representation (5.15)
becomes (here denoting fb instead of w
b to emphasize that it is a function):
n
X
fb = βi kXi . (5.16)
i=1
81
* Kernel perceptron. Recall again the standard perceptron iteration (Y = {−1, 1})):
w
b0 = 0; w bℓ + Xiℓ Yiℓ , where iℓ is any index s.t. Yiℓ ⟨w
bℓ+1 = w bℓ , Xiℓ ⟩ ≤ 0.
bℓ ∈ H by means of its
In the “kernelized perceptron”, we want to represent the vectors w
(ℓ) n
vector of coefficients β ∈ R in the representation (5.15). So the above becomes:
where (ei ) is the i-th canonical basis vector of Rn , and iℓ is any index such that
m
X m
X
(ℓ) (ℓ)
Yiℓ βi ⟨Φ(Xiℓ ), Φ(Xi )⟩ = Yiℓ βi k(Xiℓ , Xi ) = Yiℓ [Kβ (ℓ) ]iℓ ≤ 0.
i=1 i=1
In the case of the perceptron, regularization is obtained by early stopping, which is to say,
stopping at an iteration before all training points are all classified correctly. The stopping
iteration is typically determined from a predetermined set of values of K by hold-out or
cross-validation.
** Regularized kernel ERM. We assume here that the prediction space is Ye = R, and
ℓ : Ye × Y → R is a loss function. We consider ERM over prediction functions of the
form fw (x) := ⟨w, Φ(x)⟩ for w ∈ H; as we have seen it holds fw = w, hence our class of
prediction functions is the RKHS H itself. Furthermore, we consider regularization by the
squared RKHS norm. For a regularization parameter λ > 0, we therefore define
n
! n
!
X 2
X 2
fbλ ∈ Arg Min ℓ(Yi , ⟨w, Φ(Xi )⟩) + λ∥w∥ = Arg Min
H ℓ(Yi , f (Xi )) + λ∥f ∥ . H
w∈H
i=1 f ∈H i=1
(5.17)
We want to prove that, in general, fλ admits the representation (5.16) This will be estab-
b
lished as a consequence of the following result.
*** Theorem 5.9 (Representation theorem). Let H be a RKHS on X with kernel k. Let
n > 0 be an integer and Ψ : Rn × R+ → R be a mapping such that:
82
For any x = (x1 , . . . , xn ) ∈ X n , denote
n
X
n
Sx := Span{kx1 , i = 1, . . . , n} = βi kxi , (β1 , . . . , βn ) ∈ R ,
i=1
then it holds
Furthermore, if the above infimum on the left-hand side is a minimum, then it is also
a minimum on the right-hand side; in other words, the mimimum over H is attained
at for an element of Sx .
83
From
Pn Theorem 5.9 we know that we can assume the representation (5.16), i.e. fbλ =
n
i=1 βλ,i k(Xi , ·). Denoting βλ := (βλ,1 , . . . , βλ,n ) ∈ R the coefficients of this expansion,
2
we observe that fbλ H = ni,j=1 βλ,i βλ,j k(Xi , Xj ) = βλT Kβλ ; also (fbλ (X1 ), . . . , fbλ (Xn )) =
P
Kβλ . Restricting the search for a minimum for functions of this form, (5.18) becomes:
where we recall Y = (Y1 , . . . , Yn )t . By usual arguments (cancelling the first derivative wrt.
β of the above function to minimize), we obtain the necessary and sufficient condition
hence a solution is
βλ = (K + λIn )−1 Y , (5.20)
observe that we have recovered exactly the formula (5.5) discussed in the beginning in the
chapter, but for the data mapped into the Hilbert space.
Kernel logistic regression. Recall that logistic regression (for a binary classification
problem with label space Y = {−1, 1}) can be seen as an ERM estimator with loss function
see Exercise 2.4. Again, for the “kernelized” (and regularized) version given by (5.17), and
the representation (5.16) with the fact that fbλ (Xi ) = [Kβλ ]i , the coefficient vector βλ ∈ Rn
is defined by
X n
βλ ∈ Arg Min log(1 + exp(−Yi [Kβ]i ) + λβ t Kβ ,
β∈Rn i=1
which is a convex optimization problem in β and can be solved by standard methods such
as gradient descent, stochastic gradient descent, or Newton-Raphson iterations. The latter
requires inversion of a (n, n) Hessian matrix at each step, which can be prohibitive, so the
former methods might be preferred even if their convergence rate is not as fast.
In the multiclass case (Y = {0, . . . , K − 1}), a similar argument holds with an appropri-
ate loss function, and the fact that we are looking for (K−1) score functions fbλ,1 , . . . , fbλ,K−1 ,
it suffices to adapt (2.13) in the kernel setting.
Kernel Support vector machine. It is in all points similar to the previous argument
for logistic regression (still for binary classification with Y = {−1, 1}), but with the loss
function ℓHinge (f (x), y) := (1 − yf (x))+ . Again, the optimization problem for the kernel
expansion coefficients βλ ∈ Rn is convex. There exists a number of implementations using
further reformulations of the problem and using the particular form of the loss function
for efficient computation of an approximate minimum. It is one of the most standard
classification methods of machine learning toolboxes.
84
5.5 Regularity and approximation properties of functions in a
RKHS
It is of general interest to understand the properties of the functions belonging to a RKHS
with a given kernel, since the RKHS is the class of functions we use as predictors (or scores
in the case of classification) in different learning settings.
From a practical point of view, we can observe that due to the representation (5.16) as
a finite kernel expansion, the considered estimators belong (in general) to Hpre . Therefore,
whenever the kernel function k(·, ·) is measurable, resp. bounded, resp. continuous with
respect to either of its variables, so are the functions on Hpre , by finite linear combination.
Still, it is of mathematical interest (for further mathematical analysis, use of Hilbertian
analysis tools, etc.) to understand if this is also the case for the full RKHS obtained as
the completion of Hpre .
85
p
Proof. (ii) ⇒ (i) because |f (x)| = |⟨f, kx ⟩| ≤ ∥f ∥
p H k(x,p x); this also implies (5.21).
(ii) ⇒ (iii) because |k(x, y)| = |⟨kx , ky ⟩| ≤ k(x, x) k(y, y).
(iii) ⇒ (ii): trivial
(i) ⇒ (ii): can be seen as a consequence of the Banach-Steinhaus theorem. Namely,
consider the family of linear forms on H given by L := {δx , x ∈ H}, where δx is the
evaluation functional at point x given by (5.14); we have
** Definition 5.13.
Let X be a nonempty, compact topological space. Then a continuous spsd kernel k
on X × X is called universal then the corresponding RKHS H is dense (in the sense of
the supremum norm) in the space C(X) of continuous real-valued functions on X .
This definition extends to a non-compact topological space X : then k is said to be
universal if its restriction on any compact subset of X is universal.
86
Proposition 5.14. Let X be a nonempty, compact topological space. Then a continuous
spsd on X is universal iff Hpre is dense in C(X) for the supremum norm.
Proof. (⇐): since Hpre ⊆ H, Hpre dense in C(X) trivially implies that H is also dense in
C(X).
(⇒): we know that Hpre is dense in H in the sense of the H-norm. Since k is continuous
on the compact X × X , it is bounded, therefore (5.21) applies and Hpre is a fortiori dense
in H for the supremum norm, and therefore also dense in C(X ) for the supremum norm,
since H is.
The following final result of this section is very useful to establish universality of a
number of classical kernels on Rd .
i
P
** Theorem 5.15 (Universal Taylor kernels). Let F (t) = i≥0 ai t be a real-valued
analytical function with real, strictly positive coefficients ai > 0, and convergence radius
R > 0. √
d
Let X = R . Then k(x, y) := F (⟨x, y⟩) is a universal spsd kernel on BX (0, R) =
n √ o
x ∈ X : ∥x∥ < R .
This result implies in particular that the exponential kernel and the negative-binomial
kernel introduced at the end of Section 5.3 are universal on Rd and BRd (0, 1), respectively.
Lemma 5.16. Let k be a spsd kernel on a nonempty set X , and H◦ be a Hilbert space and
Φ◦ a mapping X → H◦ , so that it holds for all x, y in X : ⟨Φ◦ (x), Φ◦ (y)⟩H◦ = k(x, y).
Then, for any w ∈ H◦ , the function x 7→ ⟨w, Φ◦ (x)⟩H◦ belongs to the RKHS H associ-
ated to k.
If w = Φ◦ (x) for some x ∈ X , the function x′ 7→ ⟨w, Φ◦ (x′ )⟩H◦ = k(x, x′ ) coincides
with k(x, ·), i.e. ξ(Φ◦ (x)) = kx ∈ Hpre ⊆ H and furthermore ∥ξ(Φ◦ (x))∥H = k(x, x) =
∥Φ◦ (x)∥H◦ . By linearity, ξ is an isometry from H1 := Span{Φ◦ (x), x ∈ X } into H (in fact
into Hpre ), in particular ξ(H1 ) ⊆ H.
For any sequence (wn )n≥1 of elements in H1 converging to w∗ in H◦ , the sequence
(ξ(wn ))n≥1 is Cauchy in H (by isometry) and therefore converges to a limit f ∗ . But it
holds for any x ∈ X (using the definition of ξ, continuity of scalar products in H◦ and H,
87
the isometry property of ξ, and the reproducing property in H):
X X d
X ℓ X X d
Y
ℓ
k(x, y) = aℓ ⟨x, y⟩ = aℓ xi y i = aℓ c(j1 , . . . , jd ) (xi yi )ji
ℓ≥0 ℓ≥0 j=1 ℓ≥0 j1 +...+jd =ℓ i=1
X
= as(j) c(j)mj (x)mj (y)
j∈Nd
X
= ϕj (x)ϕj (y),
j∈Nd
p
where ϕj (x) := as(j) c(j)mj (x).
We therefore consider H◦ := ℓ2 (Nd ), and Φ◦ (x) := (ϕj (x))j∈Nd . Note that absolute
convergence
P 2 d
√ Φ◦ (x) ∈ H◦ , i.e.
of the power series defining F ensures that
j∈Nd (ϕj (x)) < ∞ for any x ∈ R such that ∥x∥ < R.
We can apply the previous lemma and conclude that for any w ∈ H◦ , the function
x 7→ ⟨w, Φ◦ (x)⟩H◦ belongs to the RKHS H. Let us choose, for an arbitrary multi-index
p
j ∈ Nd , the vector w ∈ H◦ such that the j-coordinate of w is 1/ as(j) c(j) and the other
coordinates 0. Then for any x ∈ X :
88
We conclude that all monomial functions in the coordinates of x belong to H; then also
all polynomials by linearity, and the conclusion is a consequence of the Stone-Weierstraß
theorem.
• fb ∞
≤ ∥f ∥L1 ;
Theorem 5.17. Let φ : Rd → R, continuous and let k(x, y) = φ(x − y). Assume
φ = fb for some f ∈ L1 (Rd , R), with f (x) ≥ 0 a.s. Then k is a spsd kernel.
89
Note: this direction is actually the “easy” one. There exists a converse (Bochner’s
theorem) stating that if k is a spsd, translation-invariant kernel — i.e. it is of the form
k(x, y) = k(x − y, 0) = φ(x − y), where φ = k(0, ·) — then φ is the Fourier-Stieltjes
transform of a finite (nonnegative) measure on Rd .
Proof. It holds for any integer n > 0, (x1 , . . . , xn ) ∈ (Rd )n and (α1 , . . . , αn ) ∈ Rn :
n
X n
X
αi αj k(xi , xj ) = αi αj φ(xi − xj )
i,j=1 i,j=1
X n Z
= αi αj exp(−i⟨xi , ω⟩) exp(i⟨xj , ω⟩)f (ω)dω
i,j=1 Rd
Z n 2
X
= αi exp(−i⟨xi , ω⟩) f (ω)dω ≥ 0.
Rd i=1
Note: we have assumed here as in the rest of the chapter that φ and therefore k are real-
valued (which implies in particular that f must be symmetric around 0 in Theorem 5.17).
This can be generalized to a more general theory of complex-valued spsd (Hermitian)
kernels, mutatis mutandis.
Note: Because of the inverse Fourier formula, provided that φ is integrable we can
identify the function f as the inverse Fourier transform of φ, given by (5.23).
Examples.
• We find another proof that the Gaussian kernel k(x, y) := exp(−∥x − y∥2 /(2σ 2 )) is
spsd, with
σd
2 2
σ t
f (t) = exp − .
(2π)d 2
• The Laplace kernel k(x, y) := 12 exp(−γ|x − y|) (with γ ≥ 0) is spsd, with
1 γ
f (t) = .
(2π) γ + t2
d 2
90
The idea of random Fourier features is to approximate the above expectation by a finite
i.i.d.
average over p randomly drawn frequencies (ω1 , . . . , ωp ) ∼ Pf . More explicitly, given p
such random frequencies, define the mapping
e : Rd → R2p : 1
Φ x 7→ √ cos(ω1 x), sin(ω1 x), . . . , cos(ωp x), sin(ωp x) . (5.25)
p
Then it holds
p
!
D E 1 X
Φ(x),
e Φ(y)
e = Re exp − i⟨x, ωj ⟩ exp i⟨y, ωj ⟩ ,
p j=1
which converges to (5.24) in probability as p → ∞, by the law of large numbers (and the
fact that the kernel is real-valued). We can even quantify this convergence:
i.i.d.
Proposition 5.18. If k can be represented as (5.24) and we draw (ω1 , . . . , ωp ) ∼ Pf ,
e by (5.25), then for any x, y in Rd , δ ∈ [0, 1), with probability 1 − δ over
and define Φ
the draw of these frequencies, it holds
s
D E log(2δ −1 )
k(x, y) − Φ(x),
e Φ(y)
e ≤ .
2p
91
6 Introduction to statistical learning theory (part 3):
Rademacher complexities and VC theory
6.1 Introduction, reminders
Recall that in Section 3, we studied the behavior of statistical learning methods which
output a prediction function fb belonging to some class F which was assumed to be finite
or countable. The main mathematical tool was to control to obtain a uniform control of
the form
∀f ∈ F : E(f ) − E(f
b ) ≤ R(n, F, δ); (6.1)
holding with probability at least 1 − δ over the draw of a sample of size n. (Please note:
we only consider the case of a uniform bound R(n, F, δ) independent of the function f ; we
do not consider bounds depending on f as for instance (3.29) in this discussion.)
When F is finite, and the loss function is bounded, this was achieved as a consequence
of Hoeffding’s inequality (see Corollary 3.10), which gives control over a single function f ,
then a union bound (see Proposition 3.11).
Let us also recall briefly how a uniform bound (6.1) leads to a bound on the risk of an
ERM over class F:
*** Proposition 6.1. Let us assume a learning setting (consisting of an observation space
X , a label space Y, a prediction space Ye , and a loss function ℓ : Ye × Y → R+ ). Assume
that (6.1) holds.
Let η > 0 be fixed and fbη denote an η-approximate ERM over the class F that is,
fbη ∈ F and
Eb fbη ≤ inf E(f
b ) + η.
f ∈F
Then it holds (with the same probability with which (6.1) holds) that
Proof. This is a repetition (in a more formal setting) of the argument leading to Proposi-
tion 3.13. Let ε > 0 and let fε ∈ F be such that E(fε ) ≤ inf f ∈F E(f ) + ε. Since fbη ∈ F,
and putting R = R(n, F, δ) for short, using twice (6.1):
E(fbη ) ≤ E(
b fbη ) + R
≤ E(f
b ε) + R + η
≤ E(fε ) + 2R + η
≤ EF∗ + 2R + η + ε,
and this holds for any ε > 0, hence the conclusion.
92
Observe that the proposition above is a purely deterministic one once the event (6.1)
is satisfied: the only probabilistic point is to establish that this event has probability
large enough (while the remainder R(n, F, δ) remains hopefully “reasonable”, in particular
converging to 0 as the sample size n grows). Additionally, even if the learning algorithm
fb is not ERM, the bound (6.1) allows us to give a confidence bound on the (unknown)
risk of fb based only on its empirical risk, whatever the algorithm used, since the bound is
uniform.
In view of the above considerations, a goal of interest is to extend the uniform con-
trol (6.1) to more general classes, in particular (uncountably) infinite. Observe that (6.1)
is equivalent to a probabilistic upper bound of the random variable
|·|
ZF := sup E(f ) − E(f
b ), (6.2)
f ∈F
holding with probability 1 − δ. This is what we set to do in the next sections, and we will
|·|
achieve it in two steps: (a) bound the deviations of ZF from its expectation, with high
|·|
probability; (b) bound the expectation of ZF . We will also be interested in similar bounds
for the closely related variables
b ) − E(f ) , and Z − := sup E(f ) − E(f
ZF+ := sup E(f
b ) . (6.3)
F
f ∈F f ∈F
Note. In complete generality, it cannot be ensured that the variable defined in (6.2)
is measurable because a supremum of uncountably many measurable functions is not nec-
|·|
essarily measurable. We will ignore that point and assume implicitly that ZF (and other
suprema) are measurable throughout the chapter. It can for example be assumed (this is
often the case) that there exists a countable subset Fe ⊂ F such that the suprema over F
and Fe coincide a.s.; this is usually the case for more concrete prediction classes.
Exercise 6.1. In the case of a countably infinite set F, we had derived a bound of the
form (6.1) but with a bound R(n, f, F, δ) also depending on f due to the influence of the
“weight function”, see Proposition 3.14. Put R(f ) := R(n, f, F, δ) for short.
Let η > 0 be fixed and fbη denote an η-approximate regularized ERM with regularization
function R(f ) over class F that is, fbη ∈ F and
Eb fbη + R(fbη ) ≤ inf E(f
b ) + R(f ) + η.
f ∈F
Prove that if (6.1) holds (but with the function R(f ) depending on f ), then we have
E(fbη ) ≤ inf E(f ) + 2R(f ) + η.
f ∈F
93
** Theorem 6.2 (Azuma-McDiarmid). Let X be a measurable space, and f : X n → R
a measurable function such that
t2
P[f (X1 , . . . , Xn ) > E[f (X1 , . . . , Xn )] + t] ≤ exp − Pn 2 . (6.4)
2 i=1 ci
(In particular, if all constants ci are equal to c, the bound is exp(−t2 /(2nc2 )).)
94
Proof of Theorem 6.2. Define the filtration Fi = S(X1 , . . . , Xi ), 0 ≤ i ≤ n, and define
for i = 1, . . . , n the martingale Mi := E[f (X1 , . . . , Xn )|Fi ] − E[f (X1 , . . . , Xn )], and its
increments ∆i := E[f (X1 , . . . , Xn )|Fi ] − E[f (X1 , . . . , Xn )|Fi−1 ]. Let us prove that ∆i
satisfies the boundedness assumption of Theorem 6.3. First, because (X1 , . . . , Xn ) are
independent, conditional expectation conditional to (X1 , . . . , Xi ) is the same as expectation
with respect to (Xi+1 , . . . , Xn ), thus
Z
E[f (X1 , . . . , Xn )|Fi ] = f (X1 , . . . , Xi , xi+1 , . . . , xn )P (dxi+1 , . . . , dxn ).
Therefore
** Proposition 6.4. Consider a learning setting with ℓ a bounded loss function taking
values in [0, B], a class F of decision functions, and consider the random variables
|·|
ZF , ZF+ , ZF− defined by (6.2),(6.3). Denoting ZF• either of these variables, it holds that
ZF• is sub-Gaussian with parameter B 2 /(4n), and thus in particular
2nt2
• •
P ZF ≥ E[ZF ] + t ≤ exp − 2 . (6.5)
B
Let us consider ZF+ (Sn ) as a function of the i.i.d. sample Sn = ((X1 , Y1 ), . . . , (Xn , Yn )).
(i)
Consider a sample Sen obtained with replacing (Xi , Yi ) by (Xi′ , Yi′ ) in Sn . Using (6.6), we
95
obtain
ZF+ (Sn ) − ZF+ (Sen(i) ) = sup E(f,
b Sn ) − E(f ) − sup E(f, b Se(i) ) − E(f )
n
f ∈F f ∈F
≤ sup E(f, b Sen(i) )
b Sn ) − E(f,
f ∈F
1
ℓ(f (Xi ), Yi ) − ℓ(f (Xi′ ), Yi′ )
= sup
f ∈F n
B
≤ ,
n
B
so that (Stab) is satisfied with ci = 2n . We conclude by applying Theorem (6.2). The case
|·| −
of the other variables ZF , ZF is similar.
+ (6.7)
The same inequality as above holds for ESn ZF , while
" n
#
h i
|·| 2 X
ESn ZF = ESn sup E(f ) − E(f, b Sn ) ≤ ESn ,(σ )
i 1≤i≤n
sup σi ℓ(f (Xi ), Yi ) .
f ∈F n f ∈F i=1
(6.8)
96
quantity " #
n
X
RP,n (G) := ESn ,σ sup σi g(Wi ) (6.9)
g∈G
i=1
This can be interpreted as the averaged maximal “width” of the set G(W ) projected in
the direction of the random Rademacher vector σ. Hence the above quantities are also
known as Rademacher widths, which play an important role in high-dimensional geometry.
With this definition and notation, we can rewrite (6.7) and (6.8) as:
± 2
ESn ZF = ESn sup E(f ) − E(f, Sn ) ≤ RP,n (ℓ ◦ F),
b (6.11)
f ∈F n
b Sn ) ≤ 2 R|·| (ℓ ◦ F),
h i
|·|
ESn ZF = ESn sup E(f ) − E(f, (6.12)
f ∈F n P,n
where
ℓ ◦ F := {g : X × Y → R, (x, y) 7→ ℓ(f (x), y), f ∈ F}. (6.13)
97
distribution as Sn h(sometimes
i called “ghost sample” or “independent copy of Sn ), and
replace E(f ) = ESn E(f, Sn ) :
b
h i
′
ESn sup E(f,
b Sn ) − E(f ) = ESn sup E(f, b Sn ) − ES ′ E(f,
n
b S )
n
f ∈F f ∈F
h i
′
= ESn sup ESn′ E(f, Sn ) − E(f, Sn )
b b
f ∈F
′
≤ ESn ,Sn′ sup E(f,b Sn ) − E(f,
b Sn )
f ∈F
" n
#
1 X ′ ′
= ESn ,Sn′ sup ℓ(f (Xi ), Yi ) − ℓ(f (Xi ), Yi )
f ∈F n i=1
" n
#
1 X
′
= ESn ,Sn′ sup g(Wi ) − g(Wi ) ,
n g∈G
i=1
where for the inequality we have used supt∈T E[Ut ] ≤ E[supt∈T Ut ] for a family of real-
valued variables (Ut )t∈T ; and we have used the notation G := ℓ ◦ F as defined in (6.13) and
Wi := (Xi , Yi ), Wi′ = (Xi′ , Yi′ ) for shortening.
The second step is based on the observation that the distribution of Sn , Sn′ is unchanged
if we swap Wi and Wi′ between the two samples. Hence the above double expectation
remains unchanged by this operation, which flips the sign of the i-th term in the sum
inside the expectation. Now, given arbitrary fixed signs (σi )1≤i≤n = σ ∈ {−1, 1}n , consider
swapping Wi and Wi′ if σi = −1 and leave them alone if σi = 1, again the expectation is
unchanged while the sign of the i-th term in the sum inside of the expectation is multiplied
by σi . Thus
" n
# " n
#
X X
n
∀σ ∈ {−1, 1} : ESn ,Sn′ sup g(Wi ) − g(Wi′ ) = ESn ,Sn′ sup σi g(Wi ) − g(Wi′ ) .
g∈G g∈G
i=1 i=1
Hence, the above quantity also remains the sum if we take an expectation over random
signs (σ1 , . . . , σn ) having any joint distribution; it turns out that it is most fruitful to take
i.i.d. Rademacher variables. Finally, we notice that
" n
# " n n
#
X X X
ESn ,Sn′ ,σ sup σi g(Wi ) − g(Wi′ ) ≤ ESn ,Sn′ ,σ sup σi g(Wi ) + sup −σi g(Wi′ )
g∈G g∈G g∈G
i=1 i=1 i=1
" n
# " n
#
X X
= ESn ,σ sup σi g(Wi ) + ESn′ ,σ sup σi g(Wi′ )
g∈G g∈G
i=1 i=1
= 2RP,n (G),
collecting the above inequalities yields the conclusion (the argument when introducing
absolute values is entirely similar).
98
Taking stock, combining the results of Proposition 6.4 and (6.11)-(6.12) we obtain the
following corollary:
** Corollary 6.7. Consider a learning setting with ℓ a bounded loss function taking
values in [0, B], a class F of decision functions, and an i.i.d. sample Sn of size n.
For any fixed δ ∈ (0, 1), each of the following inequalities holds with probability at
least 1 − δ over the draw of Sn :
r
−1
b ) − E(f ) ≤ RP,n (ℓ ◦ F) + B log δ ;
2
sup E(f (6.14)
f ∈F n 2n
r
−1
b ) ≤ RP,n (ℓ ◦ F) + B log δ ;
2
sup E(f ) − E(f (6.15)
f ∈F n 2n
r
−1
b ) − E(f ) ≤ R (ℓ ◦ F) + B log δ .
sup E(f
2 |·|
(6.16)
P,n
f ∈F n 2n
(e) RP,n (Conv(F)) = RP,n (F), where Conv(F) is the set of finite convex combina-
tions of elements of F.
Proof. Point (a) is straightforward and left as an exercise, as is the first point of point (b).
99
For the second part of point (b), by Jensen’s inequality:
" n #
|·|
X
RP,n ({h}) = ESn ,σ σi h(Wi )
i=1
n
!2 21
X
≤ ESn ,σ σi h(Wi )
i=1
" n
# 12
X
≤ ESn ,σ σi2 h(Wi )2
i=1
√
= n∥h∥L2 (P ) .
For point (c) we have by symmetry of the distribution of the vector or random signs σ:
" n
#
X
RP,n (aF) = ESn ,σ sup σi af (Wi )
f ∈F i=1
" n
#
X
= ESn ,σ sup|a| σi f (Wi )
f ∈F i=1
= |a|RP,n (aF).
For point (e): let us denote Conv2 (F) := {λf + (1 − λ)g; f, g ∈ F; λ ∈ [0, 1]} the set
of 2-points convex combinations of elements of F. It holds
" n
#
X
RP,n (Conv2 (F)) = ESn ,σ sup sup σi (λf (Wi ) + (1 − λ)g(Wi ))
λ∈[0,1] f,g∈F i=1
" n n
!#
X X
= ESn ,σ sup λ sup σi f (Wi ) + (1 − λ) sup σi g(Wi )
λ∈[0,1] f ∈F i=1 g∈F
i=1
" n
#
X
= ESn ,σ sup σi λf (Wi )
f ∈F i=1
= RP,n (F).
100
S holds RP,n (Conv2k (F)) = RP,n (F) for any integer k ≥ 0,
By straightforward recursion it
and finally since Conv(F) = k≥0 Conv2k (F) we obtain the result by monotone conver-
gence.
The following property is extremely useful in learning theory.
Lemma 6.10. Let (At )t∈I , (Bt )t∈I be families of real numbers indexed by a countable set
I, and γ : R → R a L-Lipschitz function. Let σ be a single Rademacher variable (random
sign). Then
E sup(At + σγ(Bt )) ≤ E sup(At + LBt ) . (6.18)
t∈I t∈I
Proof. We have
1
E sup(At + σγ(Bt )) = sup(At + γ(Bt )) + sup(At − γ(Bt ))
t∈I 2 t∈I t∈I
1
= sup (At + At′ + γ(Bt ) − γ(Bt′ ))
2 t,t′ ∈I
1
≤ sup (At + At′ + L|Bt − Bt′ |)
2 t,t′ ∈I
1
= sup (At + At′ + L(Bt − Bt′ ))
2 t,t′ ∈I
1
= sup(At + LBt ) + sup(At − LBt )
2 t∈I t∈I
= E sup(At + LσBt ) .
t∈I
Note that the “magic” happens in the equality just after the Lipschitz inequality. By
symmetry between t, t′ we can remove the absolute value! It is worth pausing to think
about it.
101
Proof of Proposition 6.9. Let n be fixed: we will prove by recursion the following property
(for m ≤ n:)
" m n
!#
X X
H(m) : RP,n (ℓ ◦ F) ≤ E sup L σi f (Xi ) + σi ℓ(f (Xi ), Yi ) ,
f ∈F i=1 i=m+1
where the expectation is over Sn and σ. Observe that H(0) is obvious (it is the definition of
RP,n (ℓ ◦ F)) P
and H(n) is whatPwe want to prove. Assuming H(m − 1) holds for 1 ≤ m ≤ n,
put Af := L m−1 i=1 σi f (Xi )+
n
i=m+1 σi ℓ(f (Xi ), Yi ) and Bf := f (Xm ), then H(m−1) reads
RP,n (ℓ ◦ F) ≤ E sup Af + σm ℓ(Bf , Ym ) .
f ∈F
Performing the expectation over σm first, conditionally to the other random variables
(namely Sn and the signs (σi )i̸=m ), since Af and Bf do not depend on σm they can be
considered constants in this conditional expectation, and by independence from the rest
σm is still a random sign conditionally to the rest. We can therefore apply Lemma 6.10,
(with γ(·) = ℓ(·, Ym ) considered as fixed since we argue conditionally to Ym ), which yields
the conclusion.
*** Proposition 6.11. Let k be a spsd kernel on a nonempty set X , H the associated
rkhs and for R ≥ 0,
BH (R) := {f ∈ H : ∥f ∥H ≤ R}
the closed ball of radius R in H centered at the origin. Then
|·| √ p
RP,n (BH (R)) ≤ nR EX∼P [k(X, X)]. (6.19)
102
Proof. It holds, using the Cauchy-Schwarz then Jensen’s inequality:
" n
#
|·|
X
RP,n (BH (R)) = E sup σi f (Xi )
f ∈BH R i=1
" n
#
X
=E sup σi ⟨f, kXi ⟩H
f ∈BH R i=1
" X n #
=E sup f, σi kXi
f ∈BH R i=1 H
" n
#
X
≤E sup ∥f ∥H σi kXi
f ∈BH R i=1 H
" n
# 21
X 2
≤ RE σi kXi
i=1 H
" n
# 12
X
= RESn Eσ σi σj k(Xi , Xj )
i,j=1
1
= R(nEX∼P [k(X, X)]) 2 .
Notice in particular the following interesting fact: the Rademacher complexity of a rkhs
ball depends on is radius, but not on the dimensionality (which might by infinite) provided
the kernel is bounded. This has a number of interesting consequences.
* Proposition 6.12. Let k be a spsd kernel on a nonempty set X , with supx∈X k(x, x) ≤ M 2 ;
let H the associated rkhs and R > 0 be fixed. Assume ℓ is a loss function (with prediction
space Ye = R) such that
Let fbn be an estimator acting on a sample Sn of size n and such that fbn ∈ BH (R) a.s.
Then for any δ ∈ (0, 1), with probability larger than 1 − δ over the draw of the i.i.d. sample
Sn it holds:
r !
1 log δ −1
E(fbn ) − E(
b fbn ) ≤ sup b ) ≤ √ 2LRM + B
E(f ) − E(f . (6.21)
f ∈BH (R) n 2
Proof. Start by noticing that by the usual argument based on the reproducing property,
any function f ∈ BH (R) satisfies |f (x)| = |⟨f, kx ⟩|H ≤ ∥f ∥H ∥kx ∥H ≤ RM . For this reason,
103
we may consider that the prediction space Ye is [−M R, M R]. We then have
E(fn ) − E(fn ) ≤
b b b sup E(f ) − E(f )
b
f ∈BH (R)
r
2 |·| log δ −1 with probability ≥ 1 − δ,
≤ RP,n (ℓ ◦ BH (R)) + B (Corollary 6.7), eq. (6.16)
n r 2n
2L log δ −1
≤ RP,n (BH (R)) + B (Proposition 6.9)
n r 2n
2LRM log δ −1
≤ √ +B (Proposition 6.11).
n 2n
* Corollary 6.13. Consider the same setting as in Proposition 6.12, with the squared loss
function ℓ(e
y , y) = (ey − y)2 , and Y = [−A, A] for some A > 0 (bounded regression: we
assume that the label is always bounded by A in absolute value). Then for any δ ∈ (0, 21 ],
with probability larger than 1 − δ it holds:
p
C log δ −1
E(fbn ) − E(
b fbn ) ≤ sup b ) ≤
E(f ) − E(f √ , (6.22)
f ∈BH (R) n
where C := 6(A + RM )2 .
Proof. We check that the assumptions on the loss function of Proposition 6.12 are satisfied
with appropriate constants. For any ye, y with |ey | ≤ M R, |y| ≤ A it holds ℓ(e y , y) =
2 2
y − y) ≤ (A + M R) =: B (satisfying loss boundedness assumption (a)), and additionally
(e
for any ye′ with |e
y ′ | ≤ M R:
|ℓ(e y ′ , y)| = (e
y , y) − ℓ(e y ′ − y)2 = |e
y − y)2 − (e y + ye′ − 2y||e
y − ye′ | ≤ 2(A + RM )|e
y − ye′ |,
√
so Lipschitz assumption (b) is satisfied with L := 2 B. We therefore have the high
probability
p bound (6.21), we can further upper bound 2LRM by 4B, and finally use that
4 ≤ 8 (log δ −1 )/2, since δ ≤ 1/2 by assumption.
We now consider the analysis of kernel ridge regression (regularized least squares ERM).
* Proposition 6.14 (Oracle-type inequality for krr). We consider the same assumptions
as in Corollary 6.13: squared loss, bounded regression with labels bounded by A > 0 in
absolute value, kernel bounded by M 2 . For λ ∈ [0, M 2 ] define the kernel ridge regression
(krr) estimator, based on sample Sn of size n, as
b ) + λ∥f ∥2 .
fbλ ∈ Arg Min E(f (6.23)
H
f ∈H
1
Then for any δ ∈ (0, 2
with probability larger than 1 − δ it holds:
],
2
A2 M 2 p
E(fbλ ) + λ fbλ H ≤ min E(f ) + λ∥f ∥2H + c √ log δ −1 , (6.24)
f ∈H λ n
where c is a numerical constant.
104
Proof. We start with noticing that the norm of fbλ must be bounded. Namely, by the
definition (6.23) of the estimator, it must correspond to a lower objective function than
the constant 0 function (denoted 0), thus
2 b ) + λ∥f ∥2
fbλ H ≤ λ−1 E(f H
b +λ 0 2
≤ λ−1 E(0) H
n
1 X
= (Yi − 0)2
λn i=1
A2
≤ .
λ
Therefore, fbλ ∈ BH (R) with R = √Aλ . Applying Corollary 6.13, we get that (6.22) is
satisfied with probability at least 1 − δ, in particular
p
C log δ −1
E(fbλ ) ≤ E(
b fbλ ) + √ , (6.25)
n
√ 2
with C = 6(A + RM )2 = 6A2 (1 2 2
+ M/ λ) ≤24A M /λ, since M/λ ≥ 1 by assumption.
Let now fλ∗ ∈ Arg Minf ∈H E(f ) + λ∥f ∥2H . By an argument similar to above, it must
hold fλ∗ ∈ BH (R), and, provided (6.22) is satisfied:
p
b λ∗ ) ≤ E(fλ∗ ) + C log δ −1
E(f √ . (6.26)
n
Now, using (6.25) and (6.26) as well as the definitions of fbλ , fλ∗ , we get (with probability
1 − δ of the event (6.22) being satisfied):
p
2 2 C log δ −1
E(fbλ ) + λ fbλ H ≤ E(
b fbλ ) + λ fbλ +
H
√
n
p
b λ∗ ) + λ fλ∗ 2 + C √ log δ −1
≤ E(f H n
p
2 C log δ −1
≤ E(fλ∗ ) + λ fλ∗ H + 2 √
n
2 A M2 p
2
≤ E(fbλ ) + λ fbλ H + 48 √ log δ −1 .
λ n
105
** Corollary 6.15. Under the same assumptions as for Corollary 6.14, assume addi-
tionally that X is a compact topological space and that the kernel k is universal on
X .pLet (λn )n≥1 be a sequence of regularization parameters such that λn → 0 and
λn n/ log n → ∞, as n → ∞. Then for any distribution P of the data, if (Xi , Yi )i≥1
(n)
is an i.i.d. sequence from P , and fbλn denotes the krr estimator trained using sample
Sn = ((X1 , Y1 ), . . . , (Xn , Yn )), with regularization parameter λn , it holds that
(n)
E(fbλn ) → E ∗ ,
(n)
Proof. P Let δn = n12 , and An denote the event (6.24) for the sample Sn and estimator fbλn .
Since ni=1 P[Acn ] ≤ n≥1 n−2 < ∞, by the Borel-Cantelli lemma P (∩k≥1 ∪n≥k Acn ) = 0,
P
i.e., for any ω ∈ Ω there exists a (random) integer n0 (ω) such that ω ∈ An for all n ≥ n0 (ω)
– in other words the events An are satisfied for all n ≥ n0 (ω).
Next, let ε > 0 be fixed; we establish that there exists fε ∈ H such that E(fε ) ≤ E ∗ + ε.
Namely, since Y = [−A, A], f ∗ (x) = E[Y |X = x] (for regression with quadratic loss)
also takes values in [−A, A] and thus belongs to L2 (X, P ). Furthermore, we now that
E(f ) − E ∗ = E[(f (X) − f ∗ (X))2 ] = ∥f − f ∗ ∥L2 (X ,P ) . Since k is universal, H is dense in
C(X), which itself is dense in L2 (X , P ). Hence we can find such an fε .
We now have for any ω ∈ Ω, n ≥ n0 (ω), for any ε > 0 using (6.24)
(n)
2
A2 M 2 p
E(fλn ) ≤ min E(f ) + λn ∥f ∥H + c √
b log δn−1
f ∈H λn n
A2 M 2 p
≤ E(fε ) + λn ∥fε ∥2H + c √ log δn−1
λn n
r
∗ 2 A2 M 2 2 log n
≤ E + ε + λn ∥fε ∥H + c ,
λn n
(n)
by the assumptions on λn we deduce lim supn E(fbλn ) ≤ E ∗ + ε a.s. for any ε > 0, and get
the conclusion.
106
to upper bound the complexity (observe that Corollary 6.7 still holds, as the loss function
is bounded by B = 1).
For fixed sxn = (x1 , . . . , xn ) ∈ X n , denote G(sxn , F) := {G(sxn , f ), f ∈ F}. Then it holds
for any distribution P on X , for Sn and i.i.d. sample from P :
√ hp i
Rp,n (ℓ ◦ F) ≤ 2nESnx ∼P ⊗n
x
log|G(Sn , F)| . (6.28)
X
Proof. Put γ := E supi=1,...,K ξi . Then for any λ > 0:
exp(λγ) ≤ E exp(λ sup ξi ) (Jensen)
i=1,...,K
= E sup exp(λξi )
i=1,...,K
n
X
≤ E[exp(λξi )]
i=1
2 2
σ λ
≤ K exp (Sub-Gaussianity).
2
log K √
We deduce γ ≤ λ
+ λσ 2 /2, which gives the claim when choosing λ = 2 log K/σ.
Proof of Theorem 6.16. For a sample Sn = ((Xi , Yi ))1≤i≤n denote
e n , f ) = (1{f (X1 ) ̸= Y1 }, . . . , 1{f (Xn ) ̸= Yn }) ∈ {0, 1}n ,
G(S
107
n o
e n , F) := G(S
and define G(S e n , f ), f ∈ F . Observe that G(S e n , F) = |G(Snx , F)|. With
this notation we have
" n
# " n
#
X X
Rp,n (ℓ ◦ F) = E sup σi 1{f (Xi ) ̸= Yi } = E sup σi u(i) .
f ∈F i=1 u∈G(S
e n ,F ) i=1
While the bound (6.28) is nice, it turns out that in most interesting cased we can upper
bound |G(sxn , F)| uniformly for any sample sxn , as a function of F, and thus bound the
expectation in (6.28) independently of the distribution P !
** Theorem 6.19 (Vapnik/Sauer). Let F be a set of functions from X to {0, 1}. Define
108
Proof. For a subset A ⊆ {0, 1}n , and I ⊆ JnK := {1, . . . , n}, let (i1 , . . . , i|I| ) be the ordered
elements of I and denote AI ⊆ {0, 1}|I| the projection of A on the coordinates of indices
(i1 , . . . , i|I| ). For coherence, if I = ∅, we define A∅ = ∅. Let us call a subset of indices I
(possibly empty) shattered by A if AI = {0, 1}|I| (we define {0, 1}0 = ∅).
We will establish the following: for any A ⊆ {0, 1}n ,
n o
|I|
|A| ≤ I ⊆ JnK : AI = {0, 1} (6.32)
in words, the cardinality of A is upper bounded by the number of indices sets I shattered
by A.
We prove this by recursion on n. For n = 1 it is true since I = ∅ is always shattered,
and if A = {0, 1} then I = {1} is also shattered.
Assume the property is true for some n ≥ 1, and let A ⊆ {0, 1}n+1 . Let A
e ⊆ {0, 1}n the
a ∈ {0, 1}n such that both (e
sets of elements e a, 0) and (e
a, 1) belong to A. Let also denote
′
A := AJnK . Then it holds that
|A| = |A′ | + A
e. (6.33)
By recursion, both A′ and A
e satisfy (6.32). For I ⊆ JnK, it holds A′ = AI . Hence
I
n o
|A′ | ≤ I ⊆ JnK : AI = {0, 1}|I|
n o
|I|
= I ⊆ Jn + 1K s.t. (n + 1) ̸∈ I, AI = {0, 1} . (6.34)
Noticing that the set of indices concerned in (6.34) and (6.35) are disjoint and putting back
in (6.33), we obtain the property (6.32) for (n + 1).
We now apply property (6.32) for the set A = G(Sn , F) where Sn ∈ X n is arbitrary.
By assumption (6.30) the largest possible cardinality of a shattered index set I (which
determines a sub-sample of size |I|) is d. Hence
d
X n
|G(Sn , F)| ≤ |{I ⊂ JnK : |I| ≤ d}| = .
i=0
i
Taking a supremum over all possible Sn ∈ X n yields the first inequality in (6.31).
n d
i
Finally, note that i ≤ i n and use the binomial formula for the second inequality.
109
It is possible to get an exact or upper bound of the VC dimension (6.30), or possibly
directly of the growth function (6.29) in certain cases. A first fundamental fact is the
following bound on linear discrimination function classes.
** Proposition 6.20. Let X = Rd and F = x 7→ 1{⟨x, w⟩ > 0}, w ∈ Rd be the set of
linear classifiers without offset (i.e. indicators of half-spaces whose boundary contain
the origin). Then the VC dimension of F is equal to d.
Proof. First, we prove that the VC-dimension of the class F of half-spaces with boundary
going through the origin is at least d. For this, simply consider the d-uple Sd of points of
Rd formed by the canonical basis (xi = ei , i = 1, . . . , d). Let I ⊆ JnK be any index set,
(i) (i)
denote I c := {1, . . . , n} \ I. Define the vector wI as wI = 1 if i ∈ I, and wI = −1 if i ̸∈ I.
Then obviously, if fw (x) := 1{⟨x, w⟩ > 0}, we have fwI (Sd ) = (1{i ∈ I})1≤i≤d . Since this
works for any I, we have G(Sd , F) = {0, 1}d and |G(Sd , F)| = 2d .
Conversely, we prove that any family Sd+1 = (x1 , . . . , xd+1 ) of (d + 1) vectors in Rd
cannot be “shattered” by F (i.e. |G(Sd+1 , F)| < 2d+1 ). Since we are in dimension d, there is
a nontrivial linear combination
Pd+1 of these vectors that vanishes: ∃λ = (λ1 , . . . , λd+1 ) ∈ Rd+1
such that λ ̸= 0 and i=1 λi xi = 0. Let I := {i ∈ JnK : λi > 0}. Without loss of generality
we can assume I ̸= ∅ (otherwise replace λ by −λ). Let w be a vector such that fw (xi ) > 0
for all i ∈ I. Then
X D X E D X E X
0< λi fw (xi ) = w, λi xi = w, − λi xi = (−λi )fw (xi ).
i∈I i∈I i∈I c i∈I c
Therefore there is at least one i ∈ I c such that fw (xi ) > 0. It means that (1{i ∈ I})1≤i≤d ̸∈
G(Sd+1 , F), therefore |G(Sd+1 , F)| < 2d+1 .
is equal to d.
Proof. Let (g1 , . . . , gd ) be a basis of G. Define themapping A(x) := (g1 (x), . . . , gd (x)) ∈ Rd .
Since G = Span{g1 , . . . , gd }, it holds that F := x 7→ 1{⟨w, A(x)⟩ > 0}, w ∈ Rd . There-
fore, any family of points of Sk = (x1 , . . . , xk ) that is “shattered” by F (i.e. |G(Sk , F)| <
2k ) implies that the family (A(x1 ), . . . , A(xk )) of elements of Rd is shattered by linear
classifiers without offset, hence from Proposition 6.20 if must be the case that k ≤ d.
110
On the other hand, notice that Span(A(XP )) = Rd . If this was not the case, there would
exist u ∈ R , u ̸= 0, such that ⟨u, g(x)⟩ = ni=1 ui gi (x) = 0 for all x ∈ X , contradicting
d
that (g1 , . . . , gd ) are independent. We can therefore find Sd = (x1 , . . . , xd ) ∈ X d such that
(A(x1 ), . . . , A(xd )) are linearly independent vectors in Rd . A family of independent vectors
in Rd is shattered by linear classifiers P (repeat P
the argument of Proposition 6.20 after change
t −1
of basis, i.e. choose w eI = (M ) ( i∈I ei − j∈I c ej ) for any subset I ⊆ JnK, where M is
the matrix of (A(x1 ), . . . , A(xd ))). Hence Sd is shattered by F.
An example of application of the previous corollary is the set of binary classifiers defined
from polynomial score functions of degree up to k on an open set of Rd has VC-dimension
equal to (k + 1)d .
Prove that the VC dimension of F is equal to d. Hint: recycle the proof of Proposition 6.20
with appropriate changes. Deduce that if G is a linear space of dimension d of real-valued
functions on an arbitrary set X , and f a fixed real-valued function on X , then the VC-
dimension of
F := {x 7→ 1{g(x) + f (x) > 0}, g ∈ G}
is equal to d.
the set of linear classifiers with offset (i.e. indicators of affine half-spaces). Prove that the
VC dimension of F is equal to d + 1.
111
The final layer (say K) consists of a single neuron (nK = 1) and outputs the prediction of
the network.
Thus, an ANN is parametrized by K
P
i=1 ni ni−1 real parameters corresponding to the
weight vectors of each AN. It is possible to reduce this dimensionality by specifying that
the activation weight vector of a given AN is restricted to a specific support of reduced
size k (i.e. the weights outside of the support are 0). It means that this AN can only use
the ouput of a specific (given) subset of size k of the ANs of the previous layer.
In this section, we will not explain how to construct ANNs from data, but give a rough
analysis of their statistical complexity using a simplified model. We will consider the sign
activation function, and thus assume that the k-th layer is in fact acting on {−1, 1}nk−1 , we
will also assume that the input space is binary, i.e. {−1, 1}d for simplicity. Finally we also
assume that a constant output neuron (equal to 1) is added in each layer (thus providing
the means to add an offset to the linear part of each AN), including on the input data.
Proposition 6.22. Given a input space {−1, 1}d , and using the sign activation function,
it is possible to find a weight vector implementing the “OR” and “AND” functions on a
specific subset I, that is, there exists (wI , a) and (wI′ , a′ ) in Rd+1 (the extra parameter is
because we add a constant coordinate to the input which we treat as an offset) such that
^ _
fwI (x) = xi ; fwI′ (x) = xi .
i∈I i∈I
Proof. We take wI = wI′ with i-th coordinate equal to 1 if i ∈ I and 0 else, and take
a = |I| − 1/2, a′ = 1/2.
Proposition 6.23. Any boolean function F : {−1, 1}d → {−1, 1} can be realized with a
2-layer ANN with sufficiently large first layer.
Proof. Let ξF := x ∈ {−1, 1}d : F (x) = 1 . For each x ∈ ξF we construct an AN in the
first layer with weight wx = x and offset a = d − 21 . It can be checked that fx,a (t) = 1 if
t = x, and is −1 otherwise. The neuron in the second layer is just an OR of all the neurons
of the first layer.
Proposition 6.23 states that a 2-layer ANN can already represent any boolean function,
however the first layer can be of size up to 2d to achieve this, which can be prohibitive. On
the other hand, using Proposition 6.22 we see that we can implement a “boolean circuit”
function (which comprises of applications of the elementary AND, OR and NOT gates
along an evaluation tree) with an ANN having as many (nonconstant) neurons as gates in
the boolean circuit, and with the number of layers equal to the depth of the evaluation
tree.
To sumarize, we get the following qualitative understanding:
• A “shallow” (with few layers) ANN with sign activation function can realize any
boolean function, but their complexity (size) must be very large to do so (there
are similar results for approximation of continuous functions using other activation
functions).
112
• A “deep” network (with many layers) can approximate more efficiently (i.e. using
less ANs) boolean functions that can be represented in short form as a composition
of elementary boolean gates.
To make the second point more quantitative, we will analyze the complexity (in the
sense of VC theory) of an ANN with architectural connection constraints, namely if each
AN is restricted to use only a (fixed) subset of the previous layer as an input (this corre-
sponds to its activation weight vector w being restricted to a fixed support of reduced size,
see beginning of the section). We can represent such connection constraints abstractly as
a graph G = (V, E): each vertex of V of the graph represents either a single ANs or an
individual coordinate of the input data in {−1, 1}d , and the (directed) edges E represent
the nonzero activation weights from one individual neuron of a given layer to the next.
Proposition 6.24. Let G = (V, E) be a graph representing the stucture and connection
constraints of an ANN, and FG be the set of boolean functions on {−1, 1}d that can be
represented by an ANN following these constraints.
Then it holds for n ≥ 1:
Γ(FG , n) ≤ (n + 1)|E| , (6.36)
and it follows that the VC dimension of FG is bounded by c|E| log |E|, for c a numerical
constant.
Proof. We will use the notion of growth function (6.29) when the output set of functions
can be larger than {0, 1} (the definition is unchanged). Let K be the number of layers
in the ANN. For k ≤ K, consider the set of functions FG,k ⊆ F({−1, 1}nk−1 , {−1, 1}nk )
obtained by only considering the k-th layer of the ANN under the structural constraints
given by G.
Note that the input resp. output space of FG,k is {−1, 1}nk−1 resp. {−1, 1}nk , given by
the values of the number nk−1 resp nk of ANs in the (k − 1)-th resp, k-th layer. For AN
number i of the k-th layer, 1 ≤ i ≤ nk , denote ℓk,i the number of incoming edges from the
previous layer. To this AN is associated a linear classifier of weight wk,i of dimension ℓk,i .
Therefore from Proposition 6.20 and Theorem 6.30, if Fk,i is the set of linear classifiers
that can be represented by this AN, we have
Γ(Fk,i , n) ≤ (n + 1)ℓk,i .
Since
Fk = {x ∈ {−1, 1}nk−1 7→ (fk,i (x))1≤i≤nk , fk,i ∈ Fk,i },
we deduce nk
Y Pnk
Γ(Fk , n) ≤ (n + 1)ℓk,i = (n + 1) i=1 ℓk,i .
i=1
Finally it is easy to check that we have in general the “composition rule” for F ◦ G =
f ◦ g, f ∈ F, g ∈ G:
Γ(F ◦ G, n) ≤ Γ(F, n) ◦ Γ(G, n);
113
namely, for any n-uple Sn in the input space of G, it holds
114