Selected Theoretical Aspects of ML and Deep Learning
Selected Theoretical Aspects of ML and Deep Learning
Contents
1 Generalities on regression, classification and neural networks 2
1.1 Regression . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2
1.2 Classification . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4
1.3 Neural networks . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5
1.4 Backpropagation for neural networks . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9
Acknowledgements
Sections 2 and 4 are taken from lecture notes from Sébastien Gerchinovitz that are available at https:
//www.math.univ-toulouse.fr/~fmalgouy/enseignement/mva.html. Parts of Section 3 benefited
from parts of the book [Gir14].
Introduction
With deep learning, we shall understand neural networks with many hidden layers. Deep learning
methods are currently very popular for some tasks, for instance the following ones.
• regression: predicting y ∈ R.
1
• For regression: any type of input x ∈ Rd and of corresponding output y ∈ R to predict. For
instance, y can be the delay of a flight and x can gather characteristics of this flight, such as the
day, position of the airport and duration.
• For classification: x ∈ Rd can be an image (vector of color levels for each pixels) and y can
give the type of image, for instance cat/dog, or value of a digit.
• For generative modeling: generating images (e.g. faces) or musical pieces.
Goals of the lecture notes. The goal is to study some theoretical aspects of deep learning, and
in some cases of machine learning more broadly. There are many recent contributions and only a few
of them will be covered.
From the previous proposition, f ? minimizes the mean square error among all possible functions,
and the closer a function f is to f ? , the more it leads to a small mean square error.
Proof of Proposition 1 Let us use the law of total expectation.
E (f (X) − Y )2 = E E (f (X) − Y )2 X .
Conditionally to X, we can use the equation
E (Z − a(X))2 |X = Var(Z|X) + (E(Z|X) − a(X))2
for a random variable Z and a function a(X) (bias-variance decomposition). This gives
E (f (X) − Y )2 =E (E(Y |X) − f (X))2 + Var(Y |X)
=E (f ? (X) − f (X))2 + E (Var(Y |X))
=E (f ? (X) − f (X))2 + E E (Y − E(Y |X))2 X
(law of total expectation :) =E (f ? (X) − f (X))2 + E (Y − E(Y |X))2
= E (f ? (X) − f (X))2 + E (Y − f ? (X))2 .
We now consider a data set of the form (X1 , Y1 ), . . . , (Xn , Yn ), independent and of law L. We
consider a function learned by empirical risk minimization. We let F be a set of functions from [0, 1]d
to R. We consider
n
1X
fˆn ∈ argmin (f (Xi ) − Yi )2 .
f ∈F n
i=1
The next proposition enables to bound the mean square error of fˆn .
2
Proposition 2 Let (X, Y ) ∼ L, independently from (X1 , Y1 ), . . . , (Xn , Yn ). Then we have
2
ˆ
E fn (X) − Y − E (f ? (X) − Y )2
n
!
1X
≤ 2E sup E (f (X) − Y ) − 2
(f (Xi ) − Yi ) 2
+ inf E (f (X) − f ? (X))2 .
f ∈F n f ∈F
i=1
Remarks
• In the term 2
E fˆn (X) − Y
the expectation is taken with respect to both (X1 , Y1 ), . . . , (Xn , Yn ) and (X, Y ).
• We bound 2
E fˆn (X) − Y − E (f ? (X) − Y )2
which is called the generalization error. The larger the set F is, the larger this error is, because
the supremum is taken over a larger set.
which is called the approximation error. The smaller F is the larger this error is, because the
infimum is taken over less functions.
• Hence, we see that F should be not too small and not too large, which can be interpreted as a
bias-variance trade off.
Proof of Proposition 2
We let, for f ∈ F,
R(f ) = E (Y − f (X))2
and
n
1X
Rn (f ) = (f (Xi ) − Yi )2 .
n
i=1
Then we have
2
ˆ
E fn (X) − Y
2
(law of total expectation:) =E E ˆ
fn (X) − Y X1 , Y1 , . . . , Xn , Yn
=E R(fˆn ) ,
3
since (X, Y ) is independent from X1 , Y1 , . . . , Xn , Yn and in R(fˆn ), the function fˆn is fixed, as the
expectation is taken only with respect to X and Y . Then we have
E R(fˆn ) − R(f ? ) =E R(fˆn ) − Rn (fˆn ) + E Rn (fˆn ) − Rn (f ) + E (Rn (f ) − R(f ))
?
+ R(f ) − inf R(f ) + inf R(f ) − R(f )
f ∈F f ∈F
! !
≤E sup |R(f ) − Rn (f )| +0+E sup |R(f ) − Rn (f )|
f ∈F f ∈F
Since this inequality holds for any > 0, we also obtain the inequality with = 0 which concludes the
proof.
1.2 Classification
The general principle is quite similar to regression. We consider a law L on [0, 1]d × {0, 1}. We are
looking for a function f : [0, 1]d → {0, 1} (a classifier) such that with (X, Y ) ∼ L,
P (f (X) 6= Y )
is small. The next proposition provides the optimal function f for this.
p? (x) = P ( Y = 1| X = x)
for x ∈ [0, 1]d . Then, for any f : [0, 1]d → {0, 1},
Hence, we see that a prediction error (that is, predicting f (X) with f (X) 6= T ? (X)) is more
harmful when
|1 − 2p? (X)|
is large. This is well interpreted, because when
|1 − 2p? (X)| = 0,
we have p? (X) = 1/2, thus P(Y = 1|X) = 1/2. In this case, P(f (X) 6= Y |X) = 1/2, regardless of the
value of f (X).
Proof of Proposition 3 Using the law of total expectation, we have
:= E (E ( e(X, Y )| X)) .
• If T ? (X) = 1, then
4
– if f (X) = 0, then
∗ e(X, Y ) = 1 with probability P(Y = 1|X) = p? (X),
∗ e(X, Y ) = −1 with probability P(Y = 0|X) = 1 − p? (X)
and thus
E ( e(X, Y )| X) = 1f (X)6=T ? (X) (p? (X) − (1 − p? (X))) = 1f (X)6=T ? (X) |1 − 2p? (X)| .
• If T ? (X) = 0, then
and thus
Hence, eventually
We now consider a data set of the form (X1 , Y1 ), . . . , (Xn , Yn ) independent and of law L. We
consider a function that is learned by empirical risk minimization. We consider a set F of functions
from [0, 1]d to {0, 1} and
n
ˆ 1X
fn ∈ argmin 1Yi 6=f (Xi ) .
f ∈F n i=1
The proof and the interpretation are the same as for regression.
5
Inputs Neurons of
the hidden Output
layer neuron
• A circle (a neuron) sums all the values that are pointed to it by the arrows.
6
with h·, ·i the standard inner product on Rd .
The neural network function is parametrized by
• b1 , . . . , bN ∈ R, the biases.
• linear σ(t) = t,
For instance, when d = 1, the network of Figure 2 encodes the absolute value function with σ the
ReLU function.
Feed-forward neural networks with several hidden layers. This is the same type of repre-
sentation but with several layers of activation functions. These networks are represented as in Figure
3.
7
Hidden layer Output
Inputs Hidden layer c
1 neuron
where
fv : RNc → R
Nc
X
u→ ui vi
i=1
8
• σ : R → R, the activation function,
Classes of functions
To come back to regression, the class of functions F corresponding to neural networks is given by
• σ, activation function,
These parameters are called architecture parameters. Then, for a given architecture, F is a parametric
set of functions
n
F = neural networks parametrized by
o
(c) (c) (c) (c) (1) (1) (1) (1)
v, b1 , . . . , bNc , w1 , . . . , wNc , . . . , b1 , . . . , bN1 , w1 , . . . , wN1 .
Motivation.
Here we consider the neuron network output with several hidden layers. We fix the architecture,
the activation function σ and the input x ∈ [0, 1]d . This output depends on the parameters θ gathering
all the weights and biases:
(c) (c) (c) (c) (1) (1) (1) (1)
θ = v, b1 , . . . , bNc , w1 , . . . , wNc , . . . , b1 , . . . , bN1 , w1 , . . . , wN1 .
9
Our aim is to compute the gradient of fθ (x) with respect to θ. This is very useful when the
parameters θ of a neural network are optimized, for instance with least squares in regression with
n
1X
min (fθ (Xi ) − Yi )2 .
θ n
i=1
To compute the gradient of this function with respect to θ, it is sufficient to compute the gradients of
fθ (Xi ) with respect to θ for i = 1, . . . , n.
Backpropagation algorithm.
At first sight, computing the gradient of fθ (x) with respect to θ is very challenging because of the
compositions of functions in (1). Even in the simpler case where σ is the identity, if we expand (1)
into a sum of products, the number of terms in the sum will be of order N1 × · · · × Nc which can be
of astronomical order for deep networks.
The backpropagation algorithm presented here solves this issue by enabling to compute the gradient
with storages and operations of complexity at most Nk Nk+1 for k ∈ {1, . . . , c − 1}.
(c)
We define the vector ηθ of dimension 1 × Nc equal to
(c) (c) (c)
ηθ = (σ 0 (gθ,1 (x))v1 , . . . , σ 0 (gθ,Nc (x))vNc )
(c)
gθ (x) is the vector of size Nc composed by the values at layer c just before the activations σ.
(c)
Lemma 5 ηθ is the (line) gradient of the network output with respect to the vector of values at layer
c just before the activations σ.
(c)
In Lemma 5, more precisely, the output is a function of gθ (x) and v, and we take the gradient
(c) (c)
with respect to gθ (x) at gθ (x) for v fixed.
(c)
Proof of Lemma 5 The output of the network is a function of gθ (x) and the final weights
v1 , . . . , vc as follows:
Nc
(c)
X
fθ (x) = vi σ(gθ,i (x)).
i=1
(c) (c)
We indeed see that the derivative with respect to gθ,i (x) is (σ 0 (gθ,i (x))vi , for i = 1, . . . , Nc .
(k)
Then, for k going from c − 1 to 1 we define (by induction) the vector ηθ of dimension 1 × Nk by
(k) (k+1) (k)
ηθ = ηθ W (k+1) Dσ0 (gθ (x))
| {z } | {z } | {z }
1×Nk+1 Nk+1 ×Nk Nk ×Nk
with
(k+1)
w1
.
W (k+1) := .
.
(k+1)
wNk+1
in dimension Nk+1 × Nk , where
(k)
gθ (x) is the vector of size Nk composed by the values at layer k just before the activations σ,
10
and where for a vector z = (z1 , . . . , zq ),
0
σ (z1 ) 0 0
Dσ0 (z) = 0
..
. 0
0 0 σ 0 (zq ).
in dimension q × q.
(k)
Lemma 6 For k = 1, . . . , c − 1, ηθ is the (line) gradient of the network output with respect to the
vector of values at layer k just before the activations σ.
For Lemma 6, we make the same comment as stated after Lemma 5.
Proof of Lemma 6
(k)
Assume that the values of the network at layer k just before σ are gθ (x) + z for a small z =
(k)
(z1 , . . . , zNk ). The corresponding output of the network, that we write fθ,z (x), is a function of gθ (x)+z
and the parameters of layers k + 1, . . . , c as follows:
n o
(k+1) (k)
fθ,z (x) = hθ W (k+1) σ(gθ (x) + z) + b(k+1) ,
(k+1) (k+1)
where b(k+1) = (b1 , . . . , bNk+1 )> , where for a vector t = (t1 , . . . , tq ) we let σ(t) = (σ(t1 ), . . . , σ(tq ))
(k+1)
and where hθ is the function providing the ouput of the network from the values at layer k + 1
just before the activations σ. We compute a Taylor expansion as z → 0:
n o
(k+1) (k) (k)
fθ,z (x) =hθ W (k+1) σ(gθ (x)) + b(k+1) + W (k+1) Dσ0 (gθ (x))z + o(kzk)
n o
(k+1) (k) (k+1) (k)
=hθ W (k+1) σ(gθ (x)) + b(k+1) + ηθ W (k+1) Dσ0 (gθ (x))z + o(kzk)
(k+1) (k)
=fθ (x) + ηθ W (k+1) Dσ0 (gθ (x))z + o(kzk),
(k+1) (k+1)
because by definition ηθ is the gradient vector of hθ (·) with respect to the input “·”.
Hence by definition of the gradient vector, the gradient of fθ (x) with respect to values of the
network at layer k just before σ is
(k+1) (k)
ηθ W (k+1) Dσ0 (gθ (x))
(k)
which is the definition of ηθ .
Hence backpropagation consists in computing (in this order):
(c) (1)
ηθ −→ · · · −→ ηθ .
Note that we can compute before, in a forward pass (coinciding with computing the output fθ (x))
(1) (c)
gθ (x) −→ · · · −→ gθ (x).
Proposition 7 For i = 1, . . . , Nc ,
∂fθ (x) (c)
= σ(gθ,i (x)). (2)
∂vi
For k = 1, . . . , c and i = 1, . . . , Nk ,
∂fθ (x) (k)
(k)
= ηθ,i . (3)
∂bi
For k = 2, . . . , c, i = 1, . . . , Nk and j = 1, . . . , Nk−1 ,
∂fθ (x) (k−1) (k)
(k)
= σ(gθ,j (x))ηθ,i . (4)
∂wi,j
For i = 1, . . . , N1 and j = 1, . . . , d,
∂fθ (x) (1)
(1)
= xj ηθ,i . (5)
∂wi,j
11
Proof of Proposition 7
Proof of (2)
We can write
Nc
(c)
X
fθ (x) = σ(gθ,i (x))vi
i=1
(c)
and σ(gθ,i (x))
does not depend on v1 , . . . , vNc . Hence (2) holds.
Proof of (3)
(k) (k)
We can write, if the scalar bi is replaced by bi + z for a small z, defining fθ,z (x) as the new
network output,
(k) (k) (N )
fθ,z (x) = hθ gθ (x) + zei k ,
(Nk ) (k)
where ei is the i-th base vector in RNk and where hθ is the function that maps
to
the output of the network.
(k) (k) (k)
By Lemma 5 or Lemma 6, the gradient of hθ (·) with respect to “·” at gθ (x) is ηθ . Hence, by a
Taylor expansion
(k) (k) (k) (k)
fθ,z (x) = hθ gθ (x) + ηθ,i z + o(kzk) = fθ (x) + ηθ,i z + o(kzk)
An example
We let σ be the indentity function for simplicity. We consider a neural network with c = 2,
N0 = d = 2, N1 = 3, N2 = 2 and with parameters as follows.
12
θ value
(1)
w1,1 1
(1)
w1,2 -1
(1)
w2,1 0
(1)
w2,2 1
(1)
w3,1 2
(1)
w3,2 -2
(1)
b1 0
(1)
b2 1
(1)
b3 -1
(2)
w1,1 1
(2)
w1,2 1
(2)
w1,3 -1
(2)
w2,1 2
(2)
w2,2 1
(2)
w2,3 1
(2)
b1 0
(2)
b2 1
v1 1
v2 1
The input is x = (1, 2). The execution of the forward pass (computing the network output) yields
−1
1 (1) (2) 5
x= , gθ (x) = 3 , gθ (x) =
, fθ (x) = 4.
2 −1
−3
and then
(1) (2) (1) 1 1 −1
ηθ W (2) Dσ0 (gθ (x))
ηθ = = 1 1 × = 3 2 0 .
2 1 1
Hence from Proposition 7 we have all the partial derivatives as follows.
13
θ corresponding partial derivative
(1)
w1,1 3
(1)
w1,2 6
(1)
w2,1 2
(1)
w2,2 4
(1)
w3,1 0
(1)
w3,2 0
(1)
b1 3
(1)
b2 2
(1)
b3 0
(2)
w1,1 -1
(2)
w1,2 3
(2)
w1,3 -3
(2)
w2,1 -1
(2)
w2,2 3
(2)
w2,3 -3
(2)
b1 1
(2)
b2 1
v1 5
v2 -1
We conclude by computing some derivatives “by hand” in order to confirm that backpropagation
provides the correct derivative values.
We have
n h i h i h i o
(2) (1) (1) (1) (2) (1) (1) (1) (2) (1) (1) (1) (2)
fθ (x) =v1 w1,1 w1,1 + 2w1,2 + b1 + w1,2 w2,1 + 2w2,2 + b2 + w1,3 w3,1 + 2w3,2 + b3 + b1
n h i h i h i o
(2) (1) (1) (1) (2) (1) (1) (1) (2) (1) (1) (1) (2)
+ v2 w2,1 w1,1 + 2w1,2 + b1 + w2,2 w2,1 + 2w2,2 + b2 + w2,3 w3,1 + 2w3,2 + b3 + b2 .
(1) (1)
Let us differentiate with respect to w32 . The terms that depend on w32 are
(1) (2) (1) (2)
2w32 w1,3 v1 + 2w32 w2,3 v2 .
Differentiating yields
(2) (2)
2w1,3 v1 + 2w2,3 v2 = 2.(−1).1 + 2.1.1 = 0,
confirming the result from backpropagation.
(1) (1)
Let us differentiate with respect to b1 . The terms that depend on b1 are
(2) (1) (2) (1)
w1,1 b1 v1 + w2,1 b1 v2 .
Differentiating yields
(2) (2)
w1,1 v1 + w2,1 v2 = 1.1 + 2.1 = 3,
confirming the result from backpropagation.
(2) (2)
As a last example, let us differentiate with respect to w23 . The terms that depend on w23 are
(2) (1) (1) (1)
w2,3 w3,1 + 2w3,2 + b3 v2 .
Differentiating yields
(1) (1) (1)
w3,1 + 2w3,2 + b3 v2 = (2 + 2.(−2) − 1).1 = −3,
14
2 Approximation with neural networks with one hidden layer
2.1 Statement of the theorem
Several theorems tackle the universality of feed-forward neural networks with one hidden layer of the
form
N
X
x ∈ [0, 1]d 7→ vi σ(hwi , xi + bi )
i=1
with v1 , . . . , vN ∈ R, b1 , . . . , bN ∈ R, w1 , . . . , wN ∈ Rd and σ : R → R.
We will study the first theorem of the literature, from [Cyb89].
• We have N1 ⊂ C([0, 1]d , R), which means that neural network functions are continuous.
• For all f ∈ C([0, 1]d , R), for all > 0, there exist N ∈ N, v1 , . . . , vN ∈ R, b1 , . . . , bN ∈ R and
w1 , . . . , wN ∈ Rd such that
N
X
sup f (x) − vi σ(hwi , xi + bi ) ≤ .
x∈[0,1]d i=1
• Equivalently, for all f ∈ C([0, 1]d , R), for all > 0, there exists g ∈ N1 such that ||f − g||∞ ≤ .
• Equivalently, for all f ∈ C([0, 1]d , R), there exists a sequence (gn )n∈N such that gn ∈ N1 for
n ∈ N and ||f − gn ||∞ → 0 as n → ∞.
(set of neural networks with N neurons), then we remark that N1,k ⊂ N1,k+1 for k ∈ N. The proof
of this inclusion is left as an exercize, one can for instance construct a neural network with k + 1
neurons and vk+1 = 0 to obtain the function of a neural network with k neurons. Hence, we have
15
that inf f ∈N1,N ||f −f ? ||∞ is decreasing with N . Hence, from Theorem 8, inf f ∈N1,N ||f −f ? ||∞ → 0
as N → ∞ (left as an exercize). Hence, since E(g(X)2 ) ≤ ||g||2∞ for g : [0, 1]d → R, we obtain
inf E (f (X) − f ? (X))2 → 0.
f ∈N1,N N →∞
Hence if we minimize the empirical risk with neural networks with N neurons (N large), the
approximation error will be small.
Step 2
We apply the Hahn-Banach theorem to construct a continuous linear map
L : C([0, 1]d , R) → C
such that L(f0 ) = 1 and L(f ) = 0 for all f ∈ N1 .
• Linear means that for g1 , g2 ∈ C([0, 1]d , R) and for α1 , α2 ∈ R, we have
L(α1 g1 + α2 g2 ) = α1 L(g1 ) + α2 L(g2 ).
• Continuous means that for g ∈ C([0, 1]d , R) and for a sequence (gn )n∈N with gn ∈ C([0, 1]d , R)
for n ∈ N and such that ||gn − g||∞ → 0 as n → ∞, we have
L(gn ) → L(g)
as n → ∞.
Step 3
We then use the Riesz representation theorem. There exists a complex-valued Borel measure µ on
[0, 1]d such that Z
L(f ) = f dµ
[0,1]d
for f ∈ C([0, 1]d , R), where the above integral is a Lebesgue integral. That µ is a complex-valued Borel
measure on [0, 1]d means that, with B the Borel sigma algebra (the measurable subsets of [0, 1]d ), we
have
µ : B → C.
Furthermore, for E ∈ B such that E = ∪∞
i=1 Ei with E1 , E2 , . . . ∈ B, with Ei ∩ Ej = ∅ for i 6= j, we
have
X∞
µ(E) = µ(Ei ),
i=1
P∞
where µ(Ei ) ∈ C and i=1 |µ(Ei )| < ∞.
Step 4 R
We show that [0,1]d f dµ = 0 for all f ∈ N1 implies that µ = 0, which is a contradiction to
R
L(f0 ) = [0,1]d f0 dµ = 1 and concludes the proof.
Remark 9 The steps 1, 2 and 3 could be carried out with N1 replaced by other function spaces F.
These steps are actually classical in approximation theory. The step 4 is on the contrary specific to
neural networks with one hidden layer.
16
2.3 Complete proof
Let f ∈ N1 . Then there exist N ∈ N, v1 , . . . , vN ∈ R, b1 , . . . , bN ∈ R, w1 , . . . , wN ∈ Rd such that
N
X
d
f : x ∈ [0, 1] 7→ vi σ(hwi , xi + bi ).
i=1
L : C([0, 1]d , R) → C
The above theorem holds because f0 6∈ N 1 , see for instance [Rud98][Chapters 3 and 6]. We then
apply a version of the Riesz representation theorem.
Theorem 11 There exists a complex-valued Borel measure µ on [0, 1]d such that
Z
L(f ) = f dµ
[0,1]d
for f ∈ C([0, 1]d , R). We have seen that µ : B → C where, for B ∈ B, we have B ⊂ [0, 1]d . Furthermore,
we can defined the total variation measure |µ| defined by
∞
X
|µ|(E) = sup |µ(Ei )|,
i=1
E ∈ B where the supremum is over the set of all the (Ei )i∈N , with Ei ∈ B for i ∈ N and Ei ∩ Ej = ∅
for i 6= j and E = ∪∞ d
i=1 Ei . Then |µ| : B → [0, ∞) and |µ| has finite mass, |µ|([0, 1] ) < ∞.
Finally, there exists h : [0, 1] → C, measurable, such that |h(x)| = 1 for x ∈ [0, 1]d and
d
dµ = hd|µ|
fi (x) = σ(hwi , xi + bi ),
17
Specifically, we can choose v1 = 1 and v2 = · · · = vN = 0 to obtain, for all w ∈ Rd , for all b ∈ R,
L(f ) = 0,
and thus Z
σ(hw, xi + b)h(x)d|µ|(x) = 0.
[0,1]d
σ (λ (hw, xi + b) + φ) → 1.
λ→+∞
σ (λ (hw, xi + b) + φ) → 0.
λ→+∞
Furthermore,
sup sup |σλ,φ (x)| ≤ sup σ(t) = ||σ||∞ < ∞,
λ>0 x∈[0,1]d t∈R
as σ is continuous and has finite limits at ±∞. We recall that |h(x)| = 1 for all x ∈ [0, 1]d and thus
Z Z
sup |σλ,φ (x)||h(x)|d|µ|(x) ≤ sup |σ(t)| d|µ|(x) = sup |σ(t)| |µ|([0, 1]d ) < ∞.
[0,1]d λ>0 t∈R [0,1]d t∈R
18
We let n o
Πw,b = x ∈ [0, 1]d : hw, xi + b = 0
and n o
Hw,b = x ∈ [0, 1]d : hw, xi + b > 0
and thus
µ(Hw,b ) + σ(φ)µ(Πw,b ) = 0.
Since σ is not constant, we can take φ1 , φ2 ∈ R with σ(φ1 ) 6= σ(φ2 ) and thus
1 σ(φ1 ) µ(Hw,b ) 0
=
1 σ(φ2 ) µ(Πw,b ) 0
and the determinant of the above matrix is σ(φ1 ) − σ(φ2 ) 6= 0. Hence, for all w ∈ Rd and b ∈ R,
µ( Πw,b ) = µ( Hw,b ) = 0.
|{z} | {z }
hyperplane half space
Pd
Let w ∈ Rd . We write ||w||1 = i=1 |wi |. For a bounded g : [−||w||1 , ||w||1 ] → C (not necessarily
continuous), we let Z
ψ(g) = g(hw, xi)dµ(x).
[0,1]d
We remark that
d
X d
X
|hw, xi| = wi x i ≤ |wi | = ||w||1 .
i=1 i=1
We observe that ψ is linear, for any bounded g1 , g2 : [−||w||1 , ||w||1 ] → C, for any α1 , α2 ∈ R, we have
Z
ψ(α1 g1 + α2 g2 ) = (α1 g1 (hw, xi) + α2 g2 (hw, xi)) dµ(x)
[0,1]d
Z Z
= α1 g1 (hw, xi)dµ(x) + α2 g2 (hw, xi)dµ(x)
[0,1]d [0,1]d
= α1 ψ(g1 ) + α2 ψ(g2 ).
with ||g1 − g2 ||∞ = supt∈[−||w||1 ,||w||1 ] |g1 (t) − g2 (t)|. Hence we have
Z
|ψ(g1 ) − ψ(g2 )| ≤ ||g1 − g2 ||∞ d|µ|(x) = ||g1 − g2 ||∞ |µ|([0, 1]d ),
[0,1]d | {z }
<∞
19
which is a Lipschitz property (stronger than continuity). Then, for θ ∈ R and g : [−||w||1 , ||w||1 ] → R
defined by
g(t) = 1t∈[θ,+∞)
for t ∈ [−||w||1 , ||w||1 ], we have
Z
ψ(g) = 1hw,xi∈[θ,+∞) dµ(x)
[0,1]d
Z
= 1hw,xi−θ≥0 dµ(x)
[0,1]d
Z Z
= 1hw,xi−θ>0 dµ(x) + 1hw,xi−θ=0 dµ(x)
[0,1]d [0,1]d
Z Z
= dµ(x) + dµ(x)
Hw,−θ Πw,−θ
= µ(Hw,−θ ) + µ(Πw,−θ )
= 0,
from what we have seen before. For g defined on [−||w||1 , ||w||1 ] valued in C, defined by
g(t) = 1t∈(θ,+∞)
Hence,
• with
1[θ1 ,θ2 ] : [−||w||1 , ||w||1 ] → R
defined by, for t ∈ [−||w||1 , ||w||1 ],
• with
1[θ1 ,θ2 ) : [−||w||1 , ||w||1 ] → R
defined by, for t ∈ [−||w||1 , ||w||1 ],
• with
1(θ1 ,θ2 ] : [−||w||1 , ||w||1 ] → R
defined by, for t ∈ [−||w||1 , ||w||1 ],
we have
ψ(1[θ1 ,θ2 ] ) = ψ(1[θ1 ,+∞) − 1(θ2 ,+∞) ),
with 1[θ1 ,+∞) (t) = 1t≥θ1 and 1(θ2 ,+∞) (t) = 1t>θ2 (for t ∈ [−||w||1 , ||w||1 ]). Hence
20
from what we have seen before. Also
ψ(1[θ1 ,θ2 ) ) = ψ(1[θ1 ,+∞) ) − ψ(1[θ2 ,+∞) ) = 0 − 0 = 0
and
ψ(1(θ1 ,θ2 ] ) = ψ(1(θ1 ,+∞) ) − ψ(1(θ2 ,+∞) ) = 0 − 0 = 0.
Now let us write r : [−||w||1 , ||w||1 ] → C defined by
r(t) = eit = cos(t) + i sin(t),
with i2 = −1 and for t ∈ [−||w||1 , ||w||1 ]. Let us also write, for k ∈ N and t ∈ [−||w||1 , ||w||1 ],
k−1
X j||w||1
rk (t) = 1t=−||w||1 r(−||w||1 ) + 1( j||w||1 , (j+1)||w||1 ] (t)r .
k k k
j=−k
Then
sup |rk (t) − r(t)| ≤ sup |r(x) − r(y)|
t∈[−||w||1 ,||w||1 ] x,y∈[−||w||1 ,||w||1 ]
||w||
|x−y|≤ k 1
→ 0,
k→∞
since r is uniformly continuous (or even Lipschitz) on [−||w||1 , ||w||1 ]. Hence, with the continuity
property that we have seen,
ψ(r) = lim ψ(rk ) = 0,
k→∞
since ψ(rk ) = 0 for k ∈ N from what we have seen before. Hence, we have shown that for any w ∈ Rd ,
Z
eihw,xi dµ(x) = 0.
[0,1]d
We see the Fourier transform of the measure µ. This implies that µ is the zero measure (which can
be shown by technical arguments which are not specific to neural networks). Hence
Z
L(f0 ) = f0 (x)dµ(x) = 0
[0,1]d
We have seen that, intuitively, the larger F is, the larger this generalization error is. A measure of
the “size” or “complexity” of F is given by the following definition.
21
Definition 12 We call shattering coefficient of F (at n for n ∈ N) the quantity
ΠF (n) = max card {(f (x1 ), . . . , f (xn )) ; f ∈ F} .
x1 ,...,xn ∈[0,1]d
Example
Let d = 1 and
F = {x ∈ [0, 1] 7→ 1x≥a ; a ∈ R} .
Then for any 0 ≤ x1 ≤ · · · ≤ xn ≤ 1 and for f ∈ F, we have
(f (x1 ), . . . , f (xn )) = (0, . . . , 0, 1, . . . , 1) ,
where
• if a > xn then there are only 0s,
• if a ≤ x1 then here are only 1s,
• if x1 < a ≤ xn then the first 1 is at position i ∈ {2, . . . , n} with xi−1 < a and xi ≥ a.
Hence the vectors that we can obtain are
(0, . . . , 0), (0, . . . , 1), (0, . . . , 1, 1), . . . , (1, . . . , 1).
Hence there are n + 1 possibilities. Hence
card {(f (x1 ), . . . , f (xn )) ; f ∈ F} ≤ n + 1.
are not necessarily ordered, there is a bijection between {(f (x1 ), . . . , f (xn )) ; f ∈ F}
If weconsider x1 , . . . , xn that
and f (x(1) ), . . . , f (x(n) ) ; f ∈ F where x(1) ≤ · · · ≤ x(n) are obtained by ordering x1 , . . . , xn . Thus
we still have
card {(f (x1 ), . . . , f (xn )) ; f ∈ F} ≤ n + 1.
Hence ΠF (n) ≤ n + 1. Furthermore, with x1 = 0, x2 = 1/n, . . . , xn = (n − 1)/n,
• with f given by x 7→ 1x≥2 we have (f (x1 ), . . . , f (xn )) = (0, . . . , 0),
• with f given by x 7→ 1x≥−1 we have (f (x1 ), . . . , f (xn )) = (1, . . . , 1),
• for i ∈ {1, . . . , n − 1}, with f given by x 7→ 1x≥(xi +xi+1 )/2 we have (f (x1 ), . . . , f (xn )) =
(0, . . . , 0, 1, . . . , 1) with i 0s and n − i 1s.
Hence with x1 = 0, x2 = 1/n, . . . , xn = (n − 1)/n we have
card {(f (x1 ), . . . , f (xn )) ; f ∈ F} ≥ n + 1.
Hence finally ΠF (n) ≥ n + 1 and thus
ΠF (n) = n + 1.
Example
Let d = 2 and
F = x ∈ [0, 1]2 7→ 1hw,xi≥a ; a ∈ R; w ∈ R2 .
22
Figure 4: An example of an affine classifier.
Then for n = 3 and for x1 , x2 , x3 ∈ [0, 1]2 that are not contained in a line, we can obtain the 8
possible classification vectors, as shown in Figure 5.
23
Figure 5: Obtaining the 8 possible classification vectors with 3 points and affine classifiers.
Hence
card {(f (x1 ), f (x2 ), f (x3 )) ; f ∈ F} ≥ 8.
Also
card {(f (x1 ), f (x2 ), f (x3 )) ; f ∈ F} ≤ card (i1 , i2 , i3 ) ∈ {0, 1}3 = 23 = 8.
Hence we have
ΠF (3) = 8.
Proposition 15 For any set F of functions from [0, 1]d to {0, 1}, we have
n
! r
1X 2 log(2ΠF (n) )
E sup P (f (X) 6= Y ) − 1f (Xi )6=Yi ≤ 2 .
f ∈F n n
i=1
Remarks
24
• The notation log stands for the Neper base e logarithm.
√
• We see a dependence in 1/ n, which is classical when empirical means are compared with
expectations.
In the second inequality above, we have used Jensen’s inequality which implies that E(|W |) ≤
p
Var(W ) for a random variable W . In the second equality above we have used that (X1 , Y1 ), . . . , (Xn , Yn )
are independent and distributed as (X, Y ). The first inequality above holds because
2
Var 1f (X)6=Y ≤ E 1f (X)6=Y ≤ E(1) = 1.
• In all cases, we have ΠF (n) ≤ 2n and thus the bound of Proposition 15 is smaller than
r r r s
2 log(2 × 2n ) 2 log(2n+1 )
2(n + 1) log(2) 1 p
2 =2 =2 = 2 2 log(2) 1 + → 2 2 log(2).
n n n n n→∞
This bound based on ΠF (n) ≤ 2n is not informative because we already know that
n
!
1 X
E sup |P (f (X) 6= Y ) − ≤ E sup 1 = 1.
1f (Xi )6=Yi
f ∈F {z } n f ∈F
i=1
∈[0,1] | {z }
∈[0,1]
25
To summarize
• The bound of Proposition 15 agrees in terms of order of magnitudes with the two extreme cases
card(F) = 1 (then ΠF (n) = 1) and ΠF (n) ≤ 2n .
• This bound will be particularly useful when ΠF (n) is in between these two cases.
Proof of Proposition 15
The proof is based on a classical argument that is called symmetrization. Without loss of generality,
we can assume that Y ∈ {−1, 1} and F is composed of functions from [0, 1]d to {−1, 1} (the choice of 0
and 1 to define the two classes is arbitrary in classification, here −1 and 1 will be more convenient). We
let (X̃1 , Ỹ1 ), . . . , (X̃n , Ỹn ) be pairs of random variables such that (X1 , Y1 ), . . . , (Xn , Yn ), (X̃1 , Ỹ1 ), . . . , (X̃n , Ỹn )
are independent and with the same distribution as (X, Y ).
Then
n
!
1X
P (f (X) 6= Y ) = Ẽ 1f (X̃i )6=Ỹi ,
n
i=1
writing Ẽ to indicate that the expectation is taken with respect to (X̃1 , Ỹ1 ), . . . , (X̃n , Ỹn ). We let
n
!
1X
∆n = E sup P (f (X) 6= Y ) − 1f (Xi )6=Yi .
f ∈F n
i=1
We have
n n
! !
1X 1X
∆n = E sup Ẽ −
1f (X̃i )6=Ỹi 1f (Xi )6=Yi
f ∈F n n
i=1 i=1
n n
!!
1X 1X
= E sup Ẽ 1f (X̃i )6=Ỹi − 1f (Xi )6=Yi
f ∈F n n
i=1 i=1
n n
!!
1 X 1 X
≤ E sup Ẽ 1f (X̃i )6=Ỹi − 1f (Xi )6=Yi
f ∈F n n
i=1 i=1
n n
!
1X 1X
≤ EẼ sup 1f (X̃i )6=Ỹi − 1f (Xi )6=Yi .
f ∈F n i=1
n
i=1
Let now σ1 , . . . , σn be independent random variables, independent from (Xi , Yi , X̃i , Ỹi )i=1,...,n and such
that
1
Pσ (σi = 1) = Pσ (σi = −1) =
2
for i = 1, . . . , n and by writing Eσ and Pσ the expectations and probabilities with respect to σ1 , . . . , σn .
We let for i = 1, . . . , n (
(Xi , Yi ) if σi = 1
(X̄i , Ȳi ) =
(X̃i , Ỹi ) if σi = −1
and (
¯ , Ȳ¯ ) = (X̃i , Ỹi )
(X̄
if σi = 1
.
i i
(Xi , Yi ) if σi = −1
¯ , Ȳ¯ )
Then (X̄i , Ȳi )i=1,...,n , (X̄i i i=1,...,n are independent and have the same distribution as (X, Y ). Let
us show this. For any bounded measurable functions g1 , . . . , g2n from [0, 1]d × {−1, 1} to R, we have,
using the law of total expectation,
n
! n !! n
! n ! !!
¯ , Ȳ¯ ) ¯ , Ȳ¯ ) σ , . . . , σ
Y Y Y Y
E g (X̄ , Ȳ )
i i i g (X̄
n+i i i =E E g (X̄ , Ȳ ) g (X̄
i i i n+i i i 1 n
i=1 i=1 i=1 i=1
n
Y n
Y n n
Y Y
= E gi (Xi , Yi ) gi (X̃i , Ỹi ) g (X̃ , Ỹ ) g (X , Y ) σ1 , . . . , σ n .
E n+j j j n+j j j
i=1 i=1 j=1 j=1
σi =1 σi =−1 σj =1 σj =−1
26
In the above conditional expectation, the 2n variables are independent since each of the (Xi , Yi )i=1,...,n , (Xi , Yi )i=1,...,
appears exactly once. Their common distribution is that of (X, Y ). Furthermore, the 2n functions
g1 , . . . , g2n appear once each. Hence we have
n
! n !! 2n
! 2n
¯ ¯
Y Y Y Y
E ig (X̄ , Ȳ )
i i g (X̄ , Ȳ )
n+i i =E
i E (g (X, Y )) σ , . . . , σ
σ i = E (g (X, Y )) .
1 n i
i=1 i=1 i=1 i=1
¯ , Ȳ¯ )
Hence, indeed, (X̄i , Ȳi )i=1,...,n , (X̄i i i=1,...,n are independent and have the same distribution as (X, Y ).
Hence, we have
n
!
1 X
∆n ≤ EẼEσ sup 1f (X̄i )6=Ȳi − 1f (X̄¯ i )6=Ȳ¯i ,
f ∈F n i=1
because
¯ , Ȳ¯ ) = (X , Y , X̃ , Ỹ ),
• if σi = 1, (X̄i , Ȳi , X̄i i i i i i
¯ , Ȳ¯ ) = (X̃ , Ỹ , X , Y ).
• if σi = −1, (X̄i , Ȳi , X̄i i i i i i
Hence
n
!
1X
∆n ≤ EẼEσ sup σi 1f (Xi )6=Yi − 1f (X̃i )6=Ỹi
f ∈F n i=1
n n
!
1 X 1X
≤ EẼEσ sup σi 1f (Xi )6=Yi + sup σi 1f (X̃i )6=Ỹi
f ∈F n i=1 f ∈F n i=1
n
!
1X
= 2EEσ sup σi 1f (Xi )6=Yi
f ∈F n i=1
n
!
1X
≤2 max max Eσ sup σi 1f (xi )6=yi .
y1 ,...,yn ∈{−1,1} x1 ,...,xn ∈[0,1]d f ∈F n i=1
Then we have !
2
∆n ≤ max max Eσ sup |hσ, vi| ,
n y1 ,...,yn ∈{−1,1} x1 ,...,xn ∈[0,1]d v∈VF (x,y)
with σ = (σ1 , . . . , σn ).
We observe that for all x1 , . . . , xn , y1 , . . . , yn , there is a bijection between VF (x, y) and {(f (x1 ), . . . , f (xn )); f ∈
F}. Hence
max max card(VF (x, y)) ≤ ΠF (n).
y1 ,...,yn ∈{−1,1} x1 ,...,xn ∈[0,1]d
27
which would conclude the proof.
Let us now show (6). Let us write −V = {−v; v ∈ V } and V # = V ∪ −V . We have, for any s > 0,
!
1
s supv∈V # hσ,vi
Eσ sup |hσ, vi| = Eσ sup hσ, vi = Eσ log e .
v∈V v∈V # s
We now apply Jensen inequality to the concave function (1/s) log. This gives
1
Eσ sup |hσ, vi| ≤ log Eσ es supv∈V # hσ,vi
v∈V s
!!
1
= log Eσ sup eshσ,vi
s v∈V #
1 X
≤ log Eσ eshσ,vi
s
v∈V #
1 X
= log Eσ eshσ,vi
s
v∈V #
n
1 X Y
= log Eσ (esσi vi )
s
v∈V # i=1
n
1 X Y 1
esvi + e−svi .
= log
s 2
#
v∈V i=1
2
We can show simply that for x ≥ 0, ex + e−x ≤ 2ex /2 . This gives, using also that vi2 ≤ 1 for
i = 1, . . . , n and v ∈ V ,
n 2 v2
1 X Y s i
Eσ sup |hσ, vi| ≤ log e 2
v∈V s
v∈V # i=1
1 X ns 2
≤ log e 2
s
v∈V #
1 # ns2
2
≤ log card(V )e
s
log(card(V # )) ns
= + .
s 2
We let r
2 log(card(V # ))
s=
n
which gives
r
1 1 n
q q
Eσ sup |hσ, vi| ≤ √ n log(card(V # )) + √ n log(card(V # )) = 2 log(card(V # ))
v∈V 2 2 2
q p
= 2n log(card(V # )) ≤ 2n log(2card(V )).
28
3.3 VC-dimension
From the previous proposition, the shattering coefficient ΠF (n) is important and we would like to
quantify its growth as n grows. A tool for this is the Vapnik-Cherbonenkis dimension, that we will
call VC-dimension.
Definition 16 For a set F of functions from [0, 1]d to {0, 1}, we write VCdim(F) and call VC-
dimension the quantity
VCdim(F) = sup {m ∈ N; ΠF (m) = 2m }
with the convention ΠF (0) = 1 so that VCdim(F) ≥ 0. It is possible that VCdim(F) = +∞.
Interpretation
The quantity VCdim(F) is the largest number of input points that can be “shattered”, meaning
that they can be classified in all possible ways by varying the classifier in F.
Examples
• When
F = {all the functions from [0, 1]d to {0, 1}}
then VCdim(F) = +∞. Indeed, for any n ∈ N, by considering x1 , . . . , xn two-by-two distinct,
we have ΠF (n) = 2n .
• When F is finite with card(F) ≤ 2m0 then VCdim(F) ≤ m0 . Indeed, for m > m0 , we have seen
that ΠF (m) ≤ card(F) ≤ 2m0 ≤ 2m . Hence
m 6∈ {m ∈ N; ΠF (m) = 2m }
and thus
sup {m ∈ N; ΠF (m) = 2m } ≤ m0 .
Proof of Remark 17
Since ΠF (V ) = 2V , there exist x1 , . . . , xV ∈ [0, 1]d such that
This means that we obtain all the possible vectors with components in {0, 1} and thus we obtain all
the possible subvectors for the i first coefficients for i = 1, . . . , V . Hence
and n o
Fd,a = x ∈ [0, 1]d 7→ 1hw,xi+a≥0 ; w ∈ Rd , a ∈ R .
Then
VCdim(Fd,l ) = d
and
VCdim(Fd,a ) = d + 1.
29
Remark 19 The VC-dimension coincides here with the number of free parameters and thus with the
usual notion of dimension.
Consider
x 7→ 1hx,Pd .
j=1 zj xj i≥0
Then for k = 1, . . . , d,
1hxk ,Pd = 1hxk ,zk xk i≥0 = 1zk ≥0 = yk .
j=1 zj xj i≥0
ΠFd,l (d) = 2d
and thus
VCdim(Fd,l ) ≥ d.
Assume that
VCdim(Fd,l ) ≥ d + 1.
Then, from Remark 17, ΠFd,l (d+1) = 2d+1 . Hence, there exists x1 , . . . , xd+1 ∈ [0, 1]d and w1 , . . . , w2d+1 ∈
Rd such that >
wi x 1
..
,
.
wi> xd+1
for i = 1, . . . , 2d+1 take all possible sign vectors (< 0 or ≥ 0). We write
>
x1
..
X= .
x>
d+1
of dimension (d + 1) × d and
W = w1 . . . w2d+1
of dimension d × 2d+1 . Then
x> x>
1 w1 ... 1 w2d+1
.. ..
XW = .
... .
x>
d+1 w 1 ... >
xd+1 w2d+1
is of dimension (d + 1) × 2d+1 . Let us show that the d + 1 rows of XW are linearly independent. Let
a of size (d + 1) × 1, non-zero such that
a> XW = 0,
30
where the above display is a linear combination of the rows of XW . Then, for k ∈ {1, . . . , 2d+1 },
>
x 1 wk
(a> XW )k = a> Xwk = a> ... .
x>
d+1 wk
Since a is non-zero we can assume that there is a j such that aj < 0 (up to replacing a by −a at the
beginning). Then
(a> XW )k ≥ |aj ||x>j wk | > 0,
since x>
j wk < 0 and aj < 0. This is a contradiction. Hence there does no exist a of size (d + 1) × 1,
non-zero such that a> XW = 0. Hence the d + 1 lines of XW are linearly independent. Hence the
rank of XW is equal to d + 1. But the rank of XW is smaller or equal to d because X is of dimension
(d + 1) × d. Hence we have reached a contradiction and thus
VCdim(Fd,l ) < d + 1.
Hence
VCdim(Fd,l ) = d.
Let us now consider Fd,a . Let
0
1 1 0
0 ..
x1 = . , x2 = 0 , . . . , xd = .
.. .. 0
.
0 1
0
and
0
..
xd+1 = . ,
0
in Rd . Then, for any y1 , . . . , yd+1 ∈ {0, 1}, write for i = 1, . . . , d + 1,
(
1 if yi = 1
zi = .
−1 if yi = 0.
Then for k = 1, . . . , d,
1hxk ,Pd = 1hxk ,(zk −zd+1 )xk i≥−zd+1 = 1zk −zd+1 ≥−zd+1 = 1zk ≥0 = yk .
j=1 (zj −zd+1 )xj i≥−zd+1
and
1hxd+1 ,Pd = 10≥−zd+1 = 1zd+1 ≥0 = yd+1 .
j=1 (zj −zd+1 )xj i≥−zd+1
VCdim(Fd,a ) ≥ d + 1.
31
Assume now that
VCdim(Fd,a ) ≥ d + 2.
Then, as seen previously,
ΠFd,a (d + 2) = 2d+2 .
Hence there exists x1 , . . . , xd+2 ∈ [0, 1]d such that for all y1 , . . . , yd+2 ∈ {0, 1}, there exists w ∈ Rd and
b ∈ R such that, for k = 1, . . . , d + 2,
1hw,xk i+b≥0 = yk .
We write
xi
x̄i =
1
of size (d + 1) × 1 for i = 1, . . . , d + 2 and
w
w̄ =
b
Hence in Rd+1 we have shattered d + 2 vectors x̄1 , . . . , x̄d+2 (we have obtained all the possible sign
vectors) with linear classifiers. This implies
VCdim(Fd+1,l ) ≥ d + 2
which is false since we have shown above that VCdim(Fd+1,l ) = d + 1. Hence we have
VCdim(Fd,a ) < d + 2.
Hence
VCdim(Fd,a ) = d + 1.
Lemma 20 (Sauer lemma) Let F be a non-empty set of functions from [0, 1]d to {0, 1}. Assume
that VCdim(F) < ∞. Then we have, for n ∈ N,
VCdim(F )
X n
ΠF (n) ≤ ≤ (n + 1)VCdim(F ) ,
i
i=0
with ( n!
n if i ∈ {0, . . . , n}
= i!(n−i)! .
i 0 if i > n
Proof of Lemma 20
For any set A, with H a non-empty set of functions from A to {0, 1}, we can define ΠH (n) and
VCdim(H) in the same way as when A = [0, 1]d . Let us show
32
We will show (7) by induction on k.
Let us show it for k = 1. If VH = 0 then ΠH (1) < 21 = 2. Hence
X0
1 k
ΠH (1) ≤ 1 = = .
0 i
i=0
Hence eventually (7) is true for k = 1. Assume now that (7) is true for any k from 1 to n − 1.
If VH = 0 then there does not exist any x ∈ A and h1 , h2 ∈ H such that h1 (x) = 0 and h2 (x) = 1
because for all x ∈ A, card{h(x); h ∈ H} < 21 . Hence for all x1 , . . . , xn ∈ A,
Hence
0
X n
ΠH (n) = 1 = .
i
i=0
card(H(x1 , . . . , xn )) = ΠH (n).
The set H(x1 , . . . , xn ) only depend on the values of the functions in H on {x1 , . . . , xn }. Hence,
replacing
• A by A = {x1 , . . . , xn },
• H by
H̃ = h0 : {x1 , . . . , xn } → {0, 1}; there exists h ∈ H such that h0 (xi ) = h(xi ) for i = 1, . . . , n ,
composed of the functions that are equal to 1 at xn and that stay in H if their value at xn is replaced
by 0. Notice that we have written 1{xn } : {x1 , . . . , xn } → {0, 1} defined by 1{xn } (xi ) = 1xn =xi for
i = 1, . . . , n.
We use the notation, for a set G of functions from {x1 , . . . , xn } to {0, 1}, and {xi1 , . . . , xiq } ⊂
{x1 , . . . , xn },
G(xi1 , . . . , xiq ) = {(g(xi1 ), . . . , g(xiq )); g ∈ G}.
We have
H(x1 , . . . , xn ) = H 0 (x1 , . . . , xn ) ∪ (H\H 0 )(x1 , . . . , xn )
and thus
cardH(x1 , . . . , xn ) ≤ cardH 0 (x1 , . . . , xn ) + card(H\H 0 )(x1 , . . . , xn ).
33
Step 1: bounding cardH 0 (x1 , . . . , xn )
We observe that
cardH 0 (x1 , . . . , xn ) = cardH 0 (x1 , . . . , xn−1 )
because h(xn ) = 1 for h ∈ H 0 .
If q ∈ N is such that there exists {xi1 , . . . , xiq } ⊂ {x1 , . . . , xn } with cardH 0 (xi1 , . . . , xiq ) = 2q then
xn 6∈ {xi1 , . . . , xiq } (because h(xn ) = 1 for h ∈ H 0 ). Also, we have cardH(xi1 , . . . , xiq , xn ) = 2q+1
because
2q+1 = card ({0, 1}q × {0, 1}) = card (h(xi1 ), . . . , h(xiq ), h(xn )); h ∈ H
VH\H 0
VH
X n − 1 X n−1
0
card(H\H )(x1 , . . . , xn−1 ) ≤ ΠH\H 0 (n − 1) ≤ ≤ .
i i
i=0 i=0
H −1
VX X VH
n−1 n−1
cardH(x1 , . . . , xn ) ≤ +
i i
i=0 i=0
VH XVH
X n−1 n−1
≤ +
i−1 i
i=1 i=0
VH
X n−1 n−1
=1 + +
i−1 i
i=1
VH
X n
=1 +
i
i=1
VH
X n
= .
i
i=0
We recall that cardH(x1 , . . . , xn ) = ΠH (n) and that we had started with any A ⊂ [0, 1]d and any
set of functions from A to {0, 1}. Hence (7) is shown by induction. Hence we have
VCdim(F )
X n
ΠF (n) ≤
i
i=0
34
which gives the first inequality of the lemma. For the second inequality, we have
VCdim(F ) min(VCdim(F ),n)
X n X n
=
i i
i=0 i=0
min(VCdim(F ),n)
X ni
≤
i!
i=0
min(VCdim(F ),n)
X
i VCdim(F)
≤ n
i
i=0
VCdim(F )
X
i VCdim(F)
≤ n
i
i=0
= (n + 1)VCdim(F ) ,
using the Newton binomial formula at the end, which shows the second inequality of the lemma.
Finally using Proposition 15 and Lemma 20, we obtain, for a set of functions F from [0, 1]d to
{0, 1},
n
! r
1X 2 log(2ΠF (n))
E sup P (f (X) 6= Y ) − 1f (Xi )6=Yi ≤ 2
f ∈F n n
i=1
r
2 log(2(n + 1)VCdim(F ) )
≤2
r n
2 log(2) + 2VCdim(F) log(n + 1)
=2
n
√
When VCdim(F) < ∞ the bound goes to zero at rate almost 1/ n. If we use a set of functions
Fn thatpdepends on n (more complex if there are more observations), then the rate of convergence is
√
almost VCdim(Fn )/ n.
35
Figure 6: A directed graph defined by V = {1, 2, 3, 4} and E = {(1, 2), (1, 3), (3, 4)} with 4 vertices
and 3 edges.
We say that the directed graph (V, E) is acyclic if there does not exist any n ∈ N and v1 , . . . , vn ∈ V
such that
• vn = v1 ,
• (v1 , v2 ), . . . , (vn−1 , vn ) ∈ E.
Figure 7: The graph on the left is acyclic and the graph on the right is cyclic (not acyclic).
A directed graph which is acyclic is called a DAG (directed acyclic graph). We call path a vector
(v1 , . . . , vn ) with v1 , . . . , vn ∈ V and (v1 , v2 ), . . . , (vn−1 , vn ) ∈ E. For a DAG (V, E) and v ∈ V we
call indegree of v the quantity card{(v 0 , v), v 0 ∈ V, (v 0 , v) ∈ E}. We call outdegree of v the quantity
card{(v, v 0 ), v 0 ∈ V, (v, v 0 ) ∈ E}. A simple example is given in Figure 8.
36
Figure 8: The vertex 1 has indegree 0 and outdegree 2. The vertex 4 has indegree 2 and outdegree 0.
• An activation function σ : R → R.
• A DAG G = (V, E) such that G has d ≥ 1 vertices with indegree 0 and 1 vertex of outdegree 0.
We write the d vertices with indegree 0 as
(0) (0)
s1 , . . . , s d .
• A vector of weights
wa ; a ∈ V 0 ∪ E
where V 0 is the set of vertices with non-zero indegrees (there is a weight per vertex [except the d
vertices with indegree 0] and a weight per edge).
We write L for the maximal length (number of edges) of a path of G. We have L ≤ card(V ) − 1. We
define the layers 0, 1, . . . , L by induction as follows.
• For ` = 1, . . . , L,
n
layer ` = vertices who have a predecessor in the layer ` − 1,
o
possibly other predecessors in the layers 0, 1, . . . , ` − 2, and no other predecessors .
• The edges are only of the form (v, v 0 ), with v ∈ layer i and v 0 ∈ layer j with i < j.
37
Proof of Proposition 22
We call the elements of the layer 0 the roots. For a vertex v, we call inverse path from v to the roots
a vector (v, v1 , . . . , vk ) with vk a root and (v1 , v), (v2 , v1 ), . . . , (vk , vk−1 ) ∈ E (hence (vk , vk−1 , . . . , v2 , v)
is a path). The length of such an inverse path is k (there are k edges in the path). By convention, if
v is a root, we say that v has an inverse path of length 0 to the roots.
Then let us show by induction that, for ` = 0, . . . , L,
layer ` = { vertices which longest inverse paths to the roots have length `} . (8)
38
Figure 9: An example of the DAG of a neural network. The layer 0 has 3 vertices, representing a
neural network classifier from [0, 1]3 to {0, 1}. These vertices have indegree 0. The layer 1 has 4
vertices (neurons). The layer 2 has 3 vertices (neurons). The layer L = 3 has one (final) vertex of
outdegree 0 (output of the neural network function). The layers 1 and 2 correspond to the hidden
layers.
wa ; a ∈ V 0 ∪ E
where V 0 contains the layers 1 to L. The input space is [0, 1]d . Consider an input x = (x1 , . . . , xd ) ∈
[0, 1]d . We define by induction on the layers 0 to L the outputs associated to each neurons of the layer
(0)
`. For the layer 0 the output of the vertex si is xi .
For the layer ` + 1, ` = 0, . . . , L − 2, the output of a vertex v is
m
!
X
σ wi S i + b
i=1
where
• m is the indegree of v,
• v10 , . . . , vm
0 are the predecessors of v: (v 0 , v), . . . , (v 0 , v) ∈ E,
1 m
39
• b = wv is the weight associated to the vertex v.
1Pm
i=1 wi Si +b≥0.
We assume that σ is piecewise polynomial: there exist I1 , . . . , Ip+1 pieces (p ≥ 1) where I1 , . . . , Ip+1
are intervals of R, that is of the form
(−∞, a), (−∞, a], (a, b), [a, b), (a, b], [a, b], (a, +∞), [a, +∞)
• Threshold function σ(x) = 1x≥0 with p = 1, I1 = (−∞, 0), I2 = [0, +∞) and D = 0. The
polynomials are x 7→ 0 on I1 and x 7→ 1 on I2 .
• ReLU function σ(x) = max(0, x) with p = 1, I1 = (−∞, 0), I2 = [0, +∞) and D = 1. The
polynomials are x 7→ 0 on I1 and x 7→ x on I2 .
• If D = 0, Wi is the number of parameters (weights and biases) useful to the computation of all
the neurons of the layer i. We have
• If D ≥ 1, Wi is the number of parameters (weights and biases) useful to the computation of all
the neurons of the layers 1 to i. We have
We write
L
1 X
L̄ = Wi ∈ [1, L],
W
i=1
• this is equal to 1 if D = 0,
• this can be close to L if D ≥ 1 and if the neurons are concentrated on the first layers.
40
We define, for i = 1, . . . , L, ki as the number of vertices of the layer i (kL = 1). We write
L
X
R= ki (1 + (i − 1)Di−1 ) if D ≥ 1
i=1
| {z }
≤U LDL−1
and
R=U if D = 0.
We define F as the set of all the feed-forward neural networks defined by G = (V, E), with one weight
per vertex of the layers 1 to L and one weight per edge (the structure of the network is fixed and the
weights are varying).
Then, for m ≥ W , with e = exp(1),
L Wi
2emki p(1 + (i − 1)Di−1 )
Y
ΠF (m) ≤ 2 (9)
Wi
i=1
L−1
PLi=1 Wi
≤ 4emp(1 + (L − 1)D ) . (10)
Furthermore
VCdim(F) ≤ L + L̄W log2 (4epR log2 (2epR)) . (11)
In particular we have the following.
• If D = 0, VCdim(F) ≤ L + W log2 (4epU log2 (2epU )) has W as a dominating term (neglecting
logarithms). This is the number of parameters of the neural network functions.
• If D ≥ 1, VCdim(F) has L̄W as a dominating term (neglecting logarithms). This is more than
the number of parameters of the neural network functions. We can interpret this by the fact
that depth can increase L̄ (recall that L̄ ∈ [1, L]) and thus make the family of neural network
functions more complex.
N
X
≤ card {(sign(f (x1 , a)), . . . , sign(f (xm , a))) ; a ∈ Pi } ,
i=1
where P1 , . . . , PN are a partition of RW which will be chosen such that the m functions a 7→ f (xj , a),
j = 1, . . . , m, are polynomial on each cell Pi . We can then apply Lemma 24.
The main difficulty is to construct a good partition. We will construct by induction partitions
C0 , . . . , CL−1 , where CL−1 will be the final partition P1 , . . . , PN .
The partitions C0 , . . . , CL−1 will be partitions of RW such that for i ∈ {0, . . . , L − 1}, Ci =
{A1 , . . . , Aq } with A1 ∪ · · · ∪ Aq = RW and Ar ∪ Ar0 = ∅ for r 6= r0 . We will have the following.
41
(a) The partitions are nested, any C ∈ Ci is a union of one or several C 0 ∈ Ci+1 (0 ≤ i ≤ L − 2).
(c) For i ∈ {0, . . . , L − 1}, for E ∈ Ci , for j ∈ {1, . . . , m}, the output of a neuron of the layer i (for
the input xj ) is a polynomial function of Wi variables of a ∈ E, with degree smaller or equal to
iDi .
Induction
When i = 0, we have C0 = {RW }. The output of a neuron of the layer 0 is constant with respect
to a ∈ RW and thus the property (c) holds.
Let 1 ≤ i ≤ L − 1. Assume that we have constructed nested partitions C0 , . . . , Ci−1 satisfying (b)
and(c). Let us construct Ci .
We write Ph,xj ,E (a) the input (just before σ) of the neuron h (h = 1, . . . , ki ) of the layer i, for the
input xj , as a function of a ∈ E with E ∈ Ci−1 . From the induction hypothesis (c), since Ph,xj ,E (a) is
of the form X
wk (output of neuron k) + b
k
and since the partitions are nested, we have that Ph,xj ,E (a) is polynomial on E of degree smaller of
equal to 1 + (i − 1)Di−1 and depends at most on Wi variables (we can check that this holds also when
D = 0).
Because of σ, the output of the neuron h is piecewise polynomial on E. We will divide E into
subcells such that the output is polynomial on each of the subcells, for any neurons h and any input
xj . Figure 10 illustrates the current state of the proof.
42
Figure 10: Illustration of the construction of the partitions.
We write t1 < t2 < · · · < tp the cuts of the pieces I1 , . . . , Ip+1 , as illustrated in Figure 11.
43
where in the above display there is a + if Ir+1 is closed at tr and a − if Ir+1 is open at tr . With this,
1±P
(a)−t ≥0
= sign ± P h,x j ,E (a) − tr
h,xj ,E r
distinct vectors of signs, sign (± (Ph,xk ,E (a) − tr ))h,j,r when a ∈ RW and thus when a ∈ E. Indeed,
We can thus partition E into less than Π subcells such that, on each of these subcells, the Ph,xj ,E (a)
stay in the same interval where σ is polynomial as a varies in the subcell. We remark that these Π
subcells of E are the same for all the neurons h and all the inputs xj (this is important for the sequel).
Hence we obtain a new partition Ci of cardinality less than Πcard(Ci−1 ). This enables to satisfy
the property (b).
Let us now address the property (c). For all E 0 ∈ Ci , the output of the neuron h ∈ {1, . . . , ki },
a ∈ E 0 7→ σ Ph,xj ,E (a)
where the factor D comes from the application of the polynomial corresponding to σ. Hence the
property (c) holds.
This completes the induction and we have the nested partitions C0 , . . . , CL−1 satisfying (b) and (c).
Use of the partition to conclude the proof
In particular, CL−1 is a partition of RW such that the output of each neuron of the layers 0, . . . , L−1
is polynomial of degree smaller or equal to (L − 1)DL−1 on each E ∈ CL−1 (since the partitions are
nested) and for all input x1 , . . . , xm .
Hence for each cell E ⊂ CL−1 and each input xj , the function
a ∈ E 7→ f (xj , a)
at the end of the network is polynomial with degree less of equal to 1 + (L − 1)DL−1 where the 1
comes from the final linear combination.
Hence, from Lemma 24,
WL
2em(1 + (L − 1)DL−1 )
card {(sign(f (x1 , a)), . . . , sign(f (xm , a))) ; a ∈ E} ≤ 2
WL
and thus
X
card {(sign(f (x1 , a)), . . . , sign(f (xm , a))) ; a ∈ Rw } ≤ card {(sign(f (x1 , a)), . . . , sign(f (xm , a))) ; a ∈ E}
E∈CL−1
WL
2em(1 + (L − 1)DL−1 )
≤ card(CL−1 )2 .
WL
(12)
44
Then, from the property (b),
L−1 Wi
2emki p(1 + (i − 1)Di−1 )
Y
card(CL−1 ) ≤ 2
Wi
i=1
k Pk !Pki=1 ai
Y a i yi
yiai ≤ Pi=1
k
.
i=1 i=1 ai
Then we have
PL !PLi=1 Wi
i−1 )
2emp i=1 ki (1 + (i − 1)D
ΠF (m) ≤ 2L PL
i=1 Wi
! Li=1 Wi
P
2empR
(by definition of R:) = 2L PL (13)
i=1 Wi
L PL !PLi=1 Wi
X 4emp(1 + (L − 1)DL−1 ) i=1 ki
(since L ≤ Wi :) ≤ PL
i=1 i=1 Wi
L L
X X P L W i
(since ki ≤ Wi :) ≤ 4emp(1 + (L − 1)DL−1 ) i=1 .
i=1 i=1
Lemma 25 Let r ≥ 16 and w ≥ t > 0. Then, for any m > t + w log2 (2r log2 (r)) := x0 , we have
mr w
2m > 2t .
w
r = 2epR ≥ 2eU ≥ 16
yields
L
!
X
VCdim(F) ≤ L + Wi log2 (4epR log2 (2epR))
i=1
References
[AB09] Martin Anthony and Peter L Bartlett. Neural network learning: Theoretical foundations.
cambridge university press, 2009.
45
[BHLM19] Peter L Bartlett, Nick Harvey, Christopher Liaw, and Abbas Mehrabian. Nearly-tight
VC-dimension and pseudodimension bounds for piecewise linear neural networks. Journal
of Machine Learning Research, 20(63):1–17, 2019.
[Gir14] Christophe Giraud. Introduction to high-dimensional statistics, volume 138. CRC Press,
2014.
[Rud98] Walter Rudin. Real and Complex Analysis. Mc Graw Hill, 1998.
46