0% found this document useful (0 votes)
14 views46 pages

Selected Theoretical Aspects of ML and Deep Learning

Uploaded by

ramki2006
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
14 views46 pages

Selected Theoretical Aspects of ML and Deep Learning

Uploaded by

ramki2006
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 46

Lecture notes

Selected theoretical aspects of machine learning and deep learning


François Bachoc
University Paul Sabatier

January 22, 2024

Contents
1 Generalities on regression, classification and neural networks 2
1.1 Regression . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2
1.2 Classification . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4
1.3 Neural networks . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5
1.4 Backpropagation for neural networks . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9

2 Approximation with neural networks with one hidden layer 15


2.1 Statement of the theorem . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15
2.2 Sketch of the proof . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 16
2.3 Complete proof . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 17

3 Complements on the generalization error in classification and VC-dimension 21


3.1 Shattering coefficients . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 21
3.2 Bounding the generalization error from the shattering coefficients . . . . . . . . . . . . 24
3.3 VC-dimension . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 29
3.4 Bounding the shattering coefficients from the VC-dimension . . . . . . . . . . . . . . . 32

4 VC-dimension of neural networks with several hidden layers 35


4.1 Neural networks as directed acyclic graphs . . . . . . . . . . . . . . . . . . . . . . . . . 35
4.2 Bounding the VC-dimension . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 40
4.3 Proof of the theorem . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 41

Acknowledgements
Sections 2 and 4 are taken from lecture notes from Sébastien Gerchinovitz that are available at https:
//www.math.univ-toulouse.fr/~fmalgouy/enseignement/mva.html. Parts of Section 3 benefited
from parts of the book [Gir14].

Introduction
With deep learning, we shall understand neural networks with many hidden layers. Deep learning
methods are currently very popular for some tasks, for instance the following ones.

• regression: predicting y ∈ R.

• classification: predicting y ∈ {0, 1} or y ∈ {0, . . . , K}.

• generative modeling: generating vectors x ∈ Rd following an unknown target distribution.

Typical applications are:

1
• For regression: any type of input x ∈ Rd and of corresponding output y ∈ R to predict. For
instance, y can be the delay of a flight and x can gather characteristics of this flight, such as the
day, position of the airport and duration.
• For classification: x ∈ Rd can be an image (vector of color levels for each pixels) and y can
give the type of image, for instance cat/dog, or value of a digit.
• For generative modeling: generating images (e.g. faces) or musical pieces.
Goals of the lecture notes. The goal is to study some theoretical aspects of deep learning, and
in some cases of machine learning more broadly. There are many recent contributions and only a few
of them will be covered.

1 Generalities on regression, classification and neural networks


1.1 Regression
We consider a law L on [0, 1]d × R. We aim at finding a function f : [0, 1]d → R such that, for
(X, Y ) ∼ L,  
E (f (X) − Y )2
is small.
The optimal function f is then the conditional expectation, as shown in the following proposition.
Proposition 1 Let f ? : [0, 1]d → R be defined by
f ? (x) = E ( Y | X = x) ,
for x ∈ [0, 1]d . Then, for any f : [0, 1]d → R,
     
E (f (X) − Y )2 = E (f ? (X) − Y )2 + E (f ? (X) − f (X))2 .

From the previous proposition, f ? minimizes the mean square error among all possible functions,
and the closer a function f is to f ? , the more it leads to a small mean square error.
Proof of Proposition 1 Let us use the law of total expectation.
    
E (f (X) − Y )2 = E E (f (X) − Y )2 X .
Conditionally to X, we can use the equation
 
E (Z − a(X))2 |X = Var(Z|X) + (E(Z|X) − a(X))2

for a random variable Z and a function a(X) (bias-variance decomposition). This gives
   
E (f (X) − Y )2 =E (E(Y |X) − f (X))2 + Var(Y |X)
 
=E (f ? (X) − f (X))2 + E (Var(Y |X))
    
=E (f ? (X) − f (X))2 + E E (Y − E(Y |X))2 X
   
(law of total expectation :) =E (f ? (X) − f (X))2 + E (Y − E(Y |X))2
   
= E (f ? (X) − f (X))2 + E (Y − f ? (X))2 .

We now consider a data set of the form (X1 , Y1 ), . . . , (Xn , Yn ), independent and of law L. We
consider a function learned by empirical risk minimization. We let F be a set of functions from [0, 1]d
to R. We consider
n
1X
fˆn ∈ argmin (f (Xi ) − Yi )2 .
f ∈F n
i=1

The next proposition enables to bound the mean square error of fˆn .

2
Proposition 2 Let (X, Y ) ∼ L, independently from (X1 , Y1 ), . . . , (Xn , Yn ). Then we have
 2   
ˆ
E fn (X) − Y − E (f ? (X) − Y )2
n
!
  1X  
≤ 2E sup E (f (X) − Y ) − 2
(f (Xi ) − Yi ) 2
+ inf E (f (X) − f ? (X))2 .
f ∈F n f ∈F
i=1

Remarks

• In the term  2 
E fˆn (X) − Y

the expectation is taken with respect to both (X1 , Y1 ), . . . , (Xn , Yn ) and (X, Y ).

• We bound  2   
E fˆn (X) − Y − E (f ? (X) − Y )2

which is always non-negative and is called the excess of risk.

• The first component of the bound is


n
!

2
 1X
2E sup E (f (X) − Y ) − (f (Xi ) − Yi )2
f ∈F n
i=1

which is called the generalization error. The larger the set F is, the larger this error is, because
the supremum is taken over a larger set.

• The second component of the bound is


 
inf E (f (X) − f ? (X))2
f ∈F

which is called the approximation error. The smaller F is the larger this error is, because the
infimum is taken over less functions.

• Hence, we see that F should be not too small and not too large, which can be interpreted as a
bias-variance trade off.

Proof of Proposition 2
We let, for f ∈ F,  
R(f ) = E (Y − f (X))2

and
n
1X
Rn (f ) = (f (Xi ) − Yi )2 .
n
i=1

We let, for  > 0, f be such that


R(f ) ≤ inf R(f ) + .
f ∈F

Then we have
 2 
ˆ
E fn (X) − Y
  2 
(law of total expectation:) =E E ˆ
fn (X) − Y X1 , Y1 , . . . , Xn , Yn
 
=E R(fˆn ) ,

3
since (X, Y ) is independent from X1 , Y1 , . . . , Xn , Yn and in R(fˆn ), the function fˆn is fixed, as the
expectation is taken only with respect to X and Y . Then we have
     
E R(fˆn ) − R(f ? ) =E R(fˆn ) − Rn (fˆn ) + E Rn (fˆn ) − Rn (f ) + E (Rn (f ) − R(f ))
   
?
+ R(f ) − inf R(f ) + inf R(f ) − R(f )
f ∈F f ∈F
! !
≤E sup |R(f ) − Rn (f )| +0+E sup |R(f ) − Rn (f )|
f ∈F f ∈F

+  + inf (R(f ) − R(f ? ))


f ∈F
!
 
(Proposition 1:) =2E sup |R(f ) − Rn (f )| +  + inf E (f (X) − f ? (X))2 .
f ∈F f ∈F

Since this inequality holds for any  > 0, we also obtain the inequality with  = 0 which concludes the
proof. 

1.2 Classification
The general principle is quite similar to regression. We consider a law L on [0, 1]d × {0, 1}. We are
looking for a function f : [0, 1]d → {0, 1} (a classifier) such that with (X, Y ) ∼ L,

P (f (X) 6= Y )

is small. The next proposition provides the optimal function f for this.

Proposition 3 Let p? : [0, 1]d → [0, 1] defined by

p? (x) = P ( Y = 1| X = x)

for x ∈ [0, 1]d . We let


T ? (x) = 1p? (x)≥ 1
2

for x ∈ [0, 1]d . Then, for any f : [0, 1]d → {0, 1},

P (f (X) 6= Y ) = P (T ? (X) 6= Y ) + E 1T ? (X)6=f (X) |1 − 2p? (X)| .




Hence, we see that a prediction error (that is, predicting f (X) with f (X) 6= T ? (X)) is more
harmful when
|1 − 2p? (X)|
is large. This is well interpreted, because when

|1 − 2p? (X)| = 0,

we have p? (X) = 1/2, thus P(Y = 1|X) = 1/2. In this case, P(f (X) 6= Y |X) = 1/2, regardless of the
value of f (X).
Proof of Proposition 3 Using the law of total expectation, we have

P (f (X) 6= Y ) − P (T ? (X) 6= Y ) = E E 1f (X)6=Y − 1T ? (X)6=Y X




:= E (E ( e(X, Y )| X)) .

Conditionally to X we have the following.

• If T ? (X) = 1, then

– if f (X) = 1, then e(X, Y ) = 0,

4
– if f (X) = 0, then
∗ e(X, Y ) = 1 with probability P(Y = 1|X) = p? (X),
∗ e(X, Y ) = −1 with probability P(Y = 0|X) = 1 − p? (X)

and thus

E ( e(X, Y )| X) = 1f (X)6=T ? (X) (p? (X) − (1 − p? (X))) = 1f (X)6=T ? (X) |1 − 2p? (X)| .

• If T ? (X) = 0, then

– if f (X) = 0, then e(X, Y ) = 0,


– if f (X) = 1, then
∗ e(X, Y ) = 1 with probability P(Y = 0|X) = 1 − p? (X),
∗ e(X, Y ) = −1 with probability P(Y = 1|X) = p? (X)

and thus

E ( e(X, Y )| X) = 1f (X)6=T ? (X) (1 − p? (X) − p? (X)) = 1f (X)6=T ? (X) |1 − 2p? (X)| .

Hence, eventually

P (f (X) 6= Y ) − P (T ? (X) 6= Y ) = E 1T ? (X)6=f (X) |1 − 2p? (X)| .





We now consider a data set of the form (X1 , Y1 ), . . . , (Xn , Yn ) independent and of law L. We
consider a function that is learned by empirical risk minimization. We consider a set F of functions
from [0, 1]d to {0, 1} and
n
ˆ 1X
fn ∈ argmin 1Yi 6=f (Xi ) .
f ∈F n i=1

The next proposition enables to bound the probability of error of fˆn .

Proposition 4 Let (X, Y ) ∼ L, independently from (X1 , Y1 ), . . . , (Xn , Yn ). Then we have


 
P fˆn (X) 6= Y − P (T ? (X) 6= Y )
n
!
1X
≤2E sup P (f (X) 6= Y ) − 1f (Xi )6=Yi
f ∈F n
i=1
+ inf E 1T ? (X)6=f (X) |1 − 2p? (X)| .

f ∈F

The proof and the interpretation are the same as for regression.

1.3 Neural networks


Neural networks define a set of functions from [0, 1]d to R.
Feed-forward neural networks with one hidden layer This is the simplest example. These
networks are represented as in Figure 1.

5
Inputs Neurons of
the hidden Output
layer neuron

Figure 1: Representation of a feed-forward neural network with one hidden layer.

In Figure 1, the interpretation is the following.


• The arrows mean

– that there is a multiplication by a scalar


– or that a function from R to R is applied and (possibly) a scalar is added.

• The function σ : R → R is called the activation function.

• A circle (a neuron) sums all the values that are pointed to it by the arrows.

• The column with w1 , . . . , wN is called the hidden layer.


The function corresponding to Figure 1 is
N
X
d
x ∈ [0, 1] 7→ vi σ (hwi , xi + bi ) ,
i=1

6
with h·, ·i the standard inner product on Rd .
The neural network function is parametrized by

• σ : R → R, the activation function,

• v1 , . . . , vN ∈ R, the output weights,

• w1 , . . . , wN ∈ Rd , the weights (of the neurons of the hidden layer),

• b1 , . . . , bN ∈ R, the biases.

Examples of activation functions are, for t ∈ R,

• linear σ(t) = t,

• threshold σ(t) = 1t≥0 ,

• sigmoid σ(t) = et /(1 + et ),

• ReLU σ(t) = max(0, t).

For instance, when d = 1, the network of Figure 2 encodes the absolute value function with σ the
ReLU function.

Figure 2: Representation of the absolute value function as a neural network.

Feed-forward neural networks with several hidden layers. This is the same type of repre-
sentation but with several layers of activation functions. These networks are represented as in Figure
3.

7
Hidden layer Output
Inputs Hidden layer c
1 neuron

Figure 3: Representation of a feed-forward neural network with several hidden layer.

The neural network function corresponding to Figure 3 is defined by

x ∈ [0, 1]d 7→ fv ◦ gc ◦ gc−1 ◦ · · · ◦ g1 (x), (1)

where

fv : RNc → R
Nc
X
u→ ui vi
i=1

and for i = 1, . . . , c, with N0 = d,


gi : RNi −1 → RNi
is defined by, for u ∈ RNi −1 and j = 1, . . . , Ni ,
 
(i) (i)
(gi (u))j = σ hwj , ui + bj .

The neural network function is parametrized by

8
• σ : R → R, the activation function,

• v ∈ RNc , the output weights,


(c) (c)
• b1 , . . . , bNc ∈ R, the biases of the hidden layer c,
(c) (c)
• w1 , . . . , wNc ∈ RNc −1 , the weights of the hidden layer c,
.
• ..
(2) (2)
• b1 , . . . , bN2 ∈ R, the biases of the hidden layer 2,
(2) (2)
• w1 , . . . , wN2 ∈ RN1 , the weights of the hidden layer 2,
(1) (1)
• b1 , . . . , bN1 ∈ R, the biases of the hidden layer 1,
(1) (1)
• w1 , . . . , wN1 ∈ Rd , the weights of the hidden layer 1.

Classes of functions
To come back to regression, the class of functions F corresponding to neural networks is given by

• c, number of hidden layers,

• σ, activation function,

• N1 , . . . , Nc , numbers of neurons in the hidden layers.

These parameters are called architecture parameters. Then, for a given architecture, F is a parametric
set of functions
n
F = neural networks parametrized by
o
(c) (c) (c) (c) (1) (1) (1) (1)
v, b1 , . . . , bNc , w1 , . . . , wNc , . . . , b1 , . . . , bN1 , w1 , . . . , wN1 .

For classification, for g ∈ F, we take


(
1 if g(x) ≥ 0
f (x) =
0 if g(x) < 0

to have a parametric set of classifiers.

1.4 Backpropagation for neural networks


In Section 1.4, we assume that σ : R → R is differentiable with derivative function σ 0 .

Motivation.

Here we consider the neuron network output with several hidden layers. We fix the architecture,
the activation function σ and the input x ∈ [0, 1]d . This output depends on the parameters θ gathering
all the weights and biases:
 
(c) (c) (c) (c) (1) (1) (1) (1)
θ = v, b1 , . . . , bNc , w1 , . . . , wNc , . . . , b1 , . . . , bN1 , w1 , . . . , wN1 .

We write fθ (x) for the neural network output.

9
Our aim is to compute the gradient of fθ (x) with respect to θ. This is very useful when the
parameters θ of a neural network are optimized, for instance with least squares in regression with
n
1X
min (fθ (Xi ) − Yi )2 .
θ n
i=1

To compute the gradient of this function with respect to θ, it is sufficient to compute the gradients of
fθ (Xi ) with respect to θ for i = 1, . . . , n.

Backpropagation algorithm.

At first sight, computing the gradient of fθ (x) with respect to θ is very challenging because of the
compositions of functions in (1). Even in the simpler case where σ is the identity, if we expand (1)
into a sum of products, the number of terms in the sum will be of order N1 × · · · × Nc which can be
of astronomical order for deep networks.
The backpropagation algorithm presented here solves this issue by enabling to compute the gradient
with storages and operations of complexity at most Nk Nk+1 for k ∈ {1, . . . , c − 1}.
(c)
We define the vector ηθ of dimension 1 × Nc equal to
(c) (c) (c)
ηθ = (σ 0 (gθ,1 (x))v1 , . . . , σ 0 (gθ,Nc (x))vNc )

(c) (c) (c)


where gθ (x) = (gθ,1 (x), . . . , gθ,Nc (x)) and

(c)
gθ (x) is the vector of size Nc composed by the values at layer c just before the activations σ.

(c)
Lemma 5 ηθ is the (line) gradient of the network output with respect to the vector of values at layer
c just before the activations σ.
(c)
In Lemma 5, more precisely, the output is a function of gθ (x) and v, and we take the gradient
(c) (c)
with respect to gθ (x) at gθ (x) for v fixed.
(c)
Proof of Lemma 5 The output of the network is a function of gθ (x) and the final weights
v1 , . . . , vc as follows:
Nc
(c)
X
fθ (x) = vi σ(gθ,i (x)).
i=1
(c) (c)
We indeed see that the derivative with respect to gθ,i (x) is (σ 0 (gθ,i (x))vi , for i = 1, . . . , Nc . 
(k)
Then, for k going from c − 1 to 1 we define (by induction) the vector ηθ of dimension 1 × Nk by
(k) (k+1) (k)
ηθ = ηθ W (k+1) Dσ0 (gθ (x))
| {z } | {z } | {z }
1×Nk+1 Nk+1 ×Nk Nk ×Nk

with  
(k+1)
w1
 . 
W (k+1) :=  . 
 . 
(k+1)
wNk+1
in dimension Nk+1 × Nk , where

(k)
gθ (x) is the vector of size Nk composed by the values at layer k just before the activations σ,

10
and where for a vector z = (z1 , . . . , zq ),
 0 
σ (z1 ) 0 0
Dσ0 (z) =  0
 .. 
. 0 
0 0 σ 0 (zq ).
in dimension q × q.
(k)
Lemma 6 For k = 1, . . . , c − 1, ηθ is the (line) gradient of the network output with respect to the
vector of values at layer k just before the activations σ.
For Lemma 6, we make the same comment as stated after Lemma 5.
Proof of Lemma 6
(k)
Assume that the values of the network at layer k just before σ are gθ (x) + z for a small z =
(k)
(z1 , . . . , zNk ). The corresponding output of the network, that we write fθ,z (x), is a function of gθ (x)+z
and the parameters of layers k + 1, . . . , c as follows:
n o
(k+1) (k)
fθ,z (x) = hθ W (k+1) σ(gθ (x) + z) + b(k+1) ,

(k+1) (k+1)
where b(k+1) = (b1 , . . . , bNk+1 )> , where for a vector t = (t1 , . . . , tq ) we let σ(t) = (σ(t1 ), . . . , σ(tq ))
(k+1)
and where hθ is the function providing the ouput of the network from the values at layer k + 1
just before the activations σ. We compute a Taylor expansion as z → 0:
n o
(k+1) (k) (k)
fθ,z (x) =hθ W (k+1) σ(gθ (x)) + b(k+1) + W (k+1) Dσ0 (gθ (x))z + o(kzk)
n o
(k+1) (k) (k+1) (k)
=hθ W (k+1) σ(gθ (x)) + b(k+1) + ηθ W (k+1) Dσ0 (gθ (x))z + o(kzk)
(k+1) (k)
=fθ (x) + ηθ W (k+1) Dσ0 (gθ (x))z + o(kzk),
(k+1) (k+1)
because by definition ηθ is the gradient vector of hθ (·) with respect to the input “·”.
Hence by definition of the gradient vector, the gradient of fθ (x) with respect to values of the
network at layer k just before σ is
(k+1) (k)
ηθ W (k+1) Dσ0 (gθ (x))
(k)
which is the definition of ηθ . 
Hence backpropagation consists in computing (in this order):
(c) (1)
ηθ −→ · · · −→ ηθ .
Note that we can compute before, in a forward pass (coinciding with computing the output fθ (x))
(1) (c)
gθ (x) −→ · · · −→ gθ (x).
Proposition 7 For i = 1, . . . , Nc ,
∂fθ (x) (c)
= σ(gθ,i (x)). (2)
∂vi
For k = 1, . . . , c and i = 1, . . . , Nk ,
∂fθ (x) (k)
(k)
= ηθ,i . (3)
∂bi
For k = 2, . . . , c, i = 1, . . . , Nk and j = 1, . . . , Nk−1 ,
∂fθ (x) (k−1) (k)
(k)
= σ(gθ,j (x))ηθ,i . (4)
∂wi,j
For i = 1, . . . , N1 and j = 1, . . . , d,
∂fθ (x) (1)
(1)
= xj ηθ,i . (5)
∂wi,j

11
Proof of Proposition 7
Proof of (2)
We can write
Nc
(c)
X
fθ (x) = σ(gθ,i (x))vi
i=1
(c)
and σ(gθ,i (x))
does not depend on v1 , . . . , vNc . Hence (2) holds.
Proof of (3)
(k) (k)
We can write, if the scalar bi is replaced by bi + z for a small z, defining fθ,z (x) as the new
network output,  
(k) (k) (N )
fθ,z (x) = hθ gθ (x) + zei k ,
(Nk ) (k)
where ei is the i-th base vector in RNk and where hθ is the function that maps

the values at layer k just before the activations σ

to
the output of the network.
(k) (k) (k)
By Lemma 5 or Lemma 6, the gradient of hθ (·) with respect to “·” at gθ (x) is ηθ . Hence, by a
Taylor expansion
 
(k) (k) (k) (k)
fθ,z (x) = hθ gθ (x) + ηθ,i z + o(kzk) = fθ (x) + ηθ,i z + o(kzk)

and thus (3) holds.


Proof of (4)
(k) (k)
We can write, if the scalar wi,j is replaced by wi,j + z for a small z, defining fθ,z (x) as the new
network output,  
(k) (k) (k−1) (N )
fθ,z (x) = hθ gθ (x) + zσ(gθ,j (x))ei k .

Hence, by a similar Taylor expansion as before


 
(k) (k) (k) (k−1) (k−1) (k)
fθ,z (x) = hθ gθ (x) + ηθ,i σ(gθ,j (x))z + o(kzk) = fθ (x) + σ(gθ,j (x))ηθ,i z + o(kzk)

and thus (4) holds.


Proof of (5)
(1) (1)
We can write, if the scalar wi,j is replaced by wi,j + z for a small z, defining fθ,z (x) as the new
network output,  
(1) (1) (N )
fθ,z (x) = hθ gθ (x) + zxj ei 1 .

Hence, by a similar Taylor expansion as before


 
(1) (1) (1) (1)
fθ,z (x) = hθ gθ (x) + ηθ,i xj z + o(kzk) = fθ (x) + xj ηθ,i z + o(kzk)

and thus (5) holds. 

An example

We let σ be the indentity function for simplicity. We consider a neural network with c = 2,
N0 = d = 2, N1 = 3, N2 = 2 and with parameters as follows.

12
θ value
(1)
w1,1 1
(1)
w1,2 -1
(1)
w2,1 0
(1)
w2,2 1
(1)
w3,1 2
(1)
w3,2 -2
(1)
b1 0
(1)
b2 1
(1)
b3 -1
(2)
w1,1 1
(2)
w1,2 1
(2)
w1,3 -1
(2)
w2,1 2
(2)
w2,2 1
(2)
w2,3 1
(2)
b1 0
(2)
b2 1
v1 1
v2 1
The input is x = (1, 2). The execution of the forward pass (computing the network output) yields
 
  −1  
1 (1) (2) 5
x= , gθ (x) =  3 , gθ (x) =
 , fθ (x) = 4.
2 −1
−3

The execution of the backward pass (retropropagation yields)


 
(2) (2) (2)
ηθ = σ 0 (gθ,1 (x))v1 · · · σ 0 (gθ,Nc (x))vNc = v1 v2 = 1 1
 

and then  
(1) (2) (1) 1 1 −1
ηθ W (2) Dσ0 (gθ (x))
 
ηθ = = 1 1 × = 3 2 0 .
2 1 1
Hence from Proposition 7 we have all the partial derivatives as follows.

13
θ corresponding partial derivative
(1)
w1,1 3
(1)
w1,2 6
(1)
w2,1 2
(1)
w2,2 4
(1)
w3,1 0
(1)
w3,2 0
(1)
b1 3
(1)
b2 2
(1)
b3 0
(2)
w1,1 -1
(2)
w1,2 3
(2)
w1,3 -3
(2)
w2,1 -1
(2)
w2,2 3
(2)
w2,3 -3
(2)
b1 1
(2)
b2 1
v1 5
v2 -1
We conclude by computing some derivatives “by hand” in order to confirm that backpropagation
provides the correct derivative values.
We have
n h i h i h i o
(2) (1) (1) (1) (2) (1) (1) (1) (2) (1) (1) (1) (2)
fθ (x) =v1 w1,1 w1,1 + 2w1,2 + b1 + w1,2 w2,1 + 2w2,2 + b2 + w1,3 w3,1 + 2w3,2 + b3 + b1
n h i h i h i o
(2) (1) (1) (1) (2) (1) (1) (1) (2) (1) (1) (1) (2)
+ v2 w2,1 w1,1 + 2w1,2 + b1 + w2,2 w2,1 + 2w2,2 + b2 + w2,3 w3,1 + 2w3,2 + b3 + b2 .

(1) (1)
Let us differentiate with respect to w32 . The terms that depend on w32 are
(1) (2) (1) (2)
2w32 w1,3 v1 + 2w32 w2,3 v2 .

Differentiating yields
(2) (2)
2w1,3 v1 + 2w2,3 v2 = 2.(−1).1 + 2.1.1 = 0,
confirming the result from backpropagation.
(1) (1)
Let us differentiate with respect to b1 . The terms that depend on b1 are
(2) (1) (2) (1)
w1,1 b1 v1 + w2,1 b1 v2 .

Differentiating yields
(2) (2)
w1,1 v1 + w2,1 v2 = 1.1 + 2.1 = 3,
confirming the result from backpropagation.
(2) (2)
As a last example, let us differentiate with respect to w23 . The terms that depend on w23 are
 
(2) (1) (1) (1)
w2,3 w3,1 + 2w3,2 + b3 v2 .

Differentiating yields
 
(1) (1) (1)
w3,1 + 2w3,2 + b3 v2 = (2 + 2.(−2) − 1).1 = −3,

confirming the result from backpropagation.

14
2 Approximation with neural networks with one hidden layer
2.1 Statement of the theorem
Several theorems tackle the universality of feed-forward neural networks with one hidden layer of the
form
N
X
x ∈ [0, 1]d 7→ vi σ(hwi , xi + bi )
i=1

with v1 , . . . , vN ∈ R, b1 , . . . , bN ∈ R, w1 , . . . , wN ∈ Rd and σ : R → R.
We will study the first theorem of the literature, from [Cyb89].

Theorem 8 ([Cyb89]) Let σ : R → R be a continuous function such that



σ(t) → 0
t→−∞
σ(t) → 1.
t→+∞

Then the set N1 of functions of the form


N
X
x ∈ [0, 1]d 7→ vi σ(hwi , xi + bi );
i=1

N ∈ N, v1 , . . . , vN ∈ R, b1 , . . . , bN ∈ R, w1 , . . . , wN ∈ Rd is dense in the set C([0, 1]d , R) of the real-


valued continuous functions on [0, 1]d , endowed with the supremum norm, ||f ||∞ = supx∈[0,1]d |f (x)|
for f ∈ C([0, 1]d , R).

This theorem means the following.

• We have N1 ⊂ C([0, 1]d , R), which means that neural network functions are continuous.

• For all f ∈ C([0, 1]d , R), for all  > 0, there exist N ∈ N, v1 , . . . , vN ∈ R, b1 , . . . , bN ∈ R and
w1 , . . . , wN ∈ Rd such that
N
X
sup f (x) − vi σ(hwi , xi + bi ) ≤ .
x∈[0,1]d i=1

• Equivalently, for all f ∈ C([0, 1]d , R), for all  > 0, there exists g ∈ N1 such that ||f − g||∞ ≤ .

• Equivalently, for all f ∈ C([0, 1]d , R), there exists a sequence (gn )n∈N such that gn ∈ N1 for
n ∈ N and ||f − gn ||∞ → 0 as n → ∞.

• This theorem is comforting for the approximation error


 
inf E (f (X) − f ? (X))2
f ∈F

in regression with F = N1 . Indeed, this term is equal to zero if x 7→ f ? (x) (conditional


expectation in regression) is a continuous function on [0, 1]d .

• Furthermore, if we let, for N ∈ N,


N
( )
X
N1,N = x ∈ [0, 1]d 7→ vi σ(hwi , xi + bi ); v1 , . . . , vN ∈ R, b1 , . . . , bN ∈ R, w1 , . . . , wN ∈ Rd
i=1

(set of neural networks with N neurons), then we remark that N1,k ⊂ N1,k+1 for k ∈ N. The proof
of this inclusion is left as an exercize, one can for instance construct a neural network with k + 1
neurons and vk+1 = 0 to obtain the function of a neural network with k neurons. Hence, we have

15
that inf f ∈N1,N ||f −f ? ||∞ is decreasing with N . Hence, from Theorem 8, inf f ∈N1,N ||f −f ? ||∞ → 0
as N → ∞ (left as an exercize). Hence, since E(g(X)2 ) ≤ ||g||2∞ for g : [0, 1]d → R, we obtain
 
inf E (f (X) − f ? (X))2 → 0.
f ∈N1,N N →∞

Hence if we minimize the empirical risk with neural networks with N neurons (N large), the
approximation error will be small.

2.2 Sketch of the proof


The proof is by contradiction (it is non constructive). For f ∈ C([0, 1]d , R) we will not exhibit a neural
network that is close to f .
Step 1
We assume that there exists f0 ∈ C([0, 1]d , R)\N 1 . Here we write N 1 for the closure of N1 , which
means that
f ∈ N 1 ⇐⇒ there exists (gN )N ∈N with gN ∈ N1 for N ∈ N such that ||gN − f0 ||∞ → 0.
N →∞

Step 2
We apply the Hahn-Banach theorem to construct a continuous linear map
L : C([0, 1]d , R) → C
such that L(f0 ) = 1 and L(f ) = 0 for all f ∈ N1 .
• Linear means that for g1 , g2 ∈ C([0, 1]d , R) and for α1 , α2 ∈ R, we have
L(α1 g1 + α2 g2 ) = α1 L(g1 ) + α2 L(g2 ).

• Continuous means that for g ∈ C([0, 1]d , R) and for a sequence (gn )n∈N with gn ∈ C([0, 1]d , R)
for n ∈ N and such that ||gn − g||∞ → 0 as n → ∞, we have
L(gn ) → L(g)
as n → ∞.
Step 3
We then use the Riesz representation theorem. There exists a complex-valued Borel measure µ on
[0, 1]d such that Z
L(f ) = f dµ
[0,1]d

for f ∈ C([0, 1]d , R), where the above integral is a Lebesgue integral. That µ is a complex-valued Borel
measure on [0, 1]d means that, with B the Borel sigma algebra (the measurable subsets of [0, 1]d ), we
have
µ : B → C.
Furthermore, for E ∈ B such that E = ∪∞
i=1 Ei with E1 , E2 , . . . ∈ B, with Ei ∩ Ej = ∅ for i 6= j, we
have
X∞
µ(E) = µ(Ei ),
i=1
P∞
where µ(Ei ) ∈ C and i=1 |µ(Ei )| < ∞.
Step 4 R
We show that [0,1]d f dµ = 0 for all f ∈ N1 implies that µ = 0, which is a contradiction to
R
L(f0 ) = [0,1]d f0 dµ = 1 and concludes the proof.

Remark 9 The steps 1, 2 and 3 could be carried out with N1 replaced by other function spaces F.
These steps are actually classical in approximation theory. The step 4 is on the contrary specific to
neural networks with one hidden layer.

16
2.3 Complete proof
Let f ∈ N1 . Then there exist N ∈ N, v1 , . . . , vN ∈ R, b1 , . . . , bN ∈ R, w1 , . . . , wN ∈ Rd such that
N
X
d
f : x ∈ [0, 1] 7→ vi σ(hwi , xi + bi ).
i=1

Since σ is continuous, f is continuous as a sum and composition of continuous functions. Hence


N1 ⊂ C([0, 1]d , R). Let us now assume that N 1 6= C([0, 1]d , R). Thus, let f0 ∈ C([0, 1]d , R)\N 1 . We
then apply a version of the Hahn-Banach theorem.

Theorem 10 There exists a continuous linear map

L : C([0, 1]d , R) → C

such that L(f0 ) = 1 and L(f ) = 0 for all f ∈ N1 .

The above theorem holds because f0 6∈ N 1 , see for instance [Rud98][Chapters 3 and 6]. We then
apply a version of the Riesz representation theorem.

Theorem 11 There exists a complex-valued Borel measure µ on [0, 1]d such that
Z
L(f ) = f dµ
[0,1]d

for f ∈ C([0, 1]d , R). We have seen that µ : B → C where, for B ∈ B, we have B ⊂ [0, 1]d . Furthermore,
we can defined the total variation measure |µ| defined by

X
|µ|(E) = sup |µ(Ei )|,
i=1

E ∈ B where the supremum is over the set of all the (Ei )i∈N , with Ei ∈ B for i ∈ N and Ei ∩ Ej = ∅
for i 6= j and E = ∪∞ d
i=1 Ei . Then |µ| : B → [0, ∞) and |µ| has finite mass, |µ|([0, 1] ) < ∞.
Finally, there exists h : [0, 1] → C, measurable, such that |h(x)| = 1 for x ∈ [0, 1]d and
d

dµ = hd|µ|

which means that for B ∈ B, Z Z


µ(B) = hd|µ| = h(x)d|µ|(x)
B B
and we have a more classical Lebesgue integral with a function h that corresponds to a density.
R
The above theorem is also given in [Rud98]. The theorem implies that L(f ) = [0,1]d f (x)h(x)d|µ|(x)
for f ∈ C([0, 1]d , R).
We now want to show that µ = 0, which means that µ(B) = 0 for B ∈ B. Since L(f ) = 0 for all
f ∈ C([0, 1]d , R), we have, for all N ∈ N, v1 , . . . , vN ∈ R, b1 , . . . , bN ∈ R, w1 , . . . , wN ∈ Rd , with
N
X
f= vi fi ,
i=1

where for i = 1, . . . , N , fi : [0, 1]d → R is defined by, for x ∈ [0, 1]d ,

fi (x) = σ(hwi , xi + bi ),

we have L(f ) = 0. Hence, since L is linear


N
X
vi L(fi ) = 0.
i=1

17
Specifically, we can choose v1 = 1 and v2 = · · · = vN = 0 to obtain, for all w ∈ Rd , for all b ∈ R,

L(f ) = 0,

with f : [0, 1]d → R defined by, for x ∈ [0, 1]d ,

f (x) = σ(hw, xi + b).

This gives, for all w ∈ Rd , for all b ∈ R,


Z
L(f ) = f (x)dµ(x) = 0
[0,1]d

and thus Z
σ(hw, xi + b)h(x)d|µ|(x) = 0.
[0,1]d

Let w ∈ Rd and b, φ ∈ R, λ > 0. Let x ∈ [0, 1]d . We let

σλ,φ (x) = σ (λ (hw, xi + b) + φ) .

Then if hw, xi + b > 0, since σ(t) → 1 as t → +∞, we have

σ (λ (hw, xi + b) + φ) → 1.
λ→+∞

If hw, xi + b < 0, since σ(t) → 0 as t → −∞, we have

σ (λ (hw, xi + b) + φ) → 0.
λ→+∞

If hw, xi + b = 0, then we have


σ (λ (hw, xi + b) + φ) = σ(φ).
Hence, we have shown that

1
 if hw, xi + b > 0
σλ,φ (x) → γ(x) := 0 if hw, xi + b < 0 .
λ→+∞ 
σ(φ) if hw, xi + b = 0

Furthermore, for x ∈ [0, 1]d ,

σλ,φ (x) = σ (λ (hw, xi + b) + φ) = σ (hλw, xi + λb + φ)

and thus σλ,φ ∈ N1 (it is a neural network function). Hence


Z
σλ,φ (x)h(x)d|µ|(x) = 0.
[0,1]d

Furthermore,
sup sup |σλ,φ (x)| ≤ sup σ(t) = ||σ||∞ < ∞,
λ>0 x∈[0,1]d t∈R

as σ is continuous and has finite limits at ±∞. We recall that |h(x)| = 1 for all x ∈ [0, 1]d and thus
Z  Z  
sup |σλ,φ (x)||h(x)|d|µ|(x) ≤ sup |σ(t)| d|µ|(x) = sup |σ(t)| |µ|([0, 1]d ) < ∞.
[0,1]d λ>0 t∈R [0,1]d t∈R

Hence we can apply the dominated convergence theorem,


Z Z
0= σλ,φ (x)h(x)d|µ|(x) → γ(x)h(x)d|µ|(x)
[0,1]d λ→+∞ [0,1]d
Z

= 1hw,xi+b>0 + σ(φ)1hw,xi+b=0 h(x)d|µ|(x).
[0,1]d

18
We let n o
Πw,b = x ∈ [0, 1]d : hw, xi + b = 0
and n o
Hw,b = x ∈ [0, 1]d : hw, xi + b > 0

for w ∈ Rd and b ∈ R. We then obtain


Z Z
h(x)d|µ|(x) + σ(φ) h(x)d|µ|(x) = 0
Hw,b Πw,b

and thus
µ(Hw,b ) + σ(φ)µ(Πw,b ) = 0.
Since σ is not constant, we can take φ1 , φ2 ∈ R with σ(φ1 ) 6= σ(φ2 ) and thus
    
1 σ(φ1 ) µ(Hw,b ) 0
=
1 σ(φ2 ) µ(Πw,b ) 0

and the determinant of the above matrix is σ(φ1 ) − σ(φ2 ) 6= 0. Hence, for all w ∈ Rd and b ∈ R,

µ( Πw,b ) = µ( Hw,b ) = 0.
|{z} | {z }
hyperplane half space

Pd
Let w ∈ Rd . We write ||w||1 = i=1 |wi |. For a bounded g : [−||w||1 , ||w||1 ] → C (not necessarily
continuous), we let Z
ψ(g) = g(hw, xi)dµ(x).
[0,1]d

We remark that
d
X d
X
|hw, xi| = wi x i ≤ |wi | = ||w||1 .
i=1 i=1

We observe that ψ is linear, for any bounded g1 , g2 : [−||w||1 , ||w||1 ] → C, for any α1 , α2 ∈ R, we have
Z
ψ(α1 g1 + α2 g2 ) = (α1 g1 (hw, xi) + α2 g2 (hw, xi)) dµ(x)
[0,1]d
Z Z
= α1 g1 (hw, xi)dµ(x) + α2 g2 (hw, xi)dµ(x)
[0,1]d [0,1]d
= α1 ψ(g1 ) + α2 ψ(g2 ).

Furthermore, we have a continuity property of ψ of the form:


Z
|ψ(g1 ) − ψ(g2 )| = (g1 − g2 )(hw, xi)dµ(x)
[0,1]d
Z
= (g1 − g2 )(hw, xi)h(x)d|µ|(x)
[0,1]d
Z
≤ |(g1 − g2 )(hw, xi)| |h(x)|d|µ|(x)
[0,1]d
Z
≤ ||g1 − g2 ||∞ |h(x)|d|µ|(x),
[0,1]d

with ||g1 − g2 ||∞ = supt∈[−||w||1 ,||w||1 ] |g1 (t) − g2 (t)|. Hence we have
Z
|ψ(g1 ) − ψ(g2 )| ≤ ||g1 − g2 ||∞ d|µ|(x) = ||g1 − g2 ||∞ |µ|([0, 1]d ),
[0,1]d | {z }
<∞

19
which is a Lipschitz property (stronger than continuity). Then, for θ ∈ R and g : [−||w||1 , ||w||1 ] → R
defined by
g(t) = 1t∈[θ,+∞)
for t ∈ [−||w||1 , ||w||1 ], we have
Z
ψ(g) = 1hw,xi∈[θ,+∞) dµ(x)
[0,1]d
Z
= 1hw,xi−θ≥0 dµ(x)
[0,1]d
Z Z
= 1hw,xi−θ>0 dµ(x) + 1hw,xi−θ=0 dµ(x)
[0,1]d [0,1]d
Z Z
= dµ(x) + dµ(x)
Hw,−θ Πw,−θ

= µ(Hw,−θ ) + µ(Πw,−θ )
= 0,

from what we have seen before. For g defined on [−||w||1 , ||w||1 ] valued in C, defined by

g(t) = 1t∈(θ,+∞)

for t ∈ [−||w||1 , ||w||1 ], we also have


Z
ψ(g) = 1hw,xi−θ>0 dµ(x)
[0,1]d
= µ(Hw,−θ )
= 0.

Hence,

• with
1[θ1 ,θ2 ] : [−||w||1 , ||w||1 ] → R
defined by, for t ∈ [−||w||1 , ||w||1 ],

1[θ1 ,θ2 ] (t) = 1t∈[θ1 ,θ2 ] = 1θ1 ≤t≤θ2 ,

• with
1[θ1 ,θ2 ) : [−||w||1 , ||w||1 ] → R
defined by, for t ∈ [−||w||1 , ||w||1 ],

1[θ1 ,θ2 ) (t) = 1t∈[θ1 ,θ2 ) = 1θ1 ≤t<θ2 ,

• with
1(θ1 ,θ2 ] : [−||w||1 , ||w||1 ] → R
defined by, for t ∈ [−||w||1 , ||w||1 ],

1(θ1 ,θ2 ] (t) = 1t∈(θ1 ,θ2 ] = 1θ1 <t≤θ2 ,

we have
ψ(1[θ1 ,θ2 ] ) = ψ(1[θ1 ,+∞) − 1(θ2 ,+∞) ),
with 1[θ1 ,+∞) (t) = 1t≥θ1 and 1(θ2 ,+∞) (t) = 1t>θ2 (for t ∈ [−||w||1 , ||w||1 ]). Hence

ψ(1[θ1 ,θ2 ] ) = ψ(1[θ1 ,+∞) ) − ψ(1(θ2 ,+∞) ) = 0 − 0 = 0,

20
from what we have seen before. Also
ψ(1[θ1 ,θ2 ) ) = ψ(1[θ1 ,+∞) ) − ψ(1[θ2 ,+∞) ) = 0 − 0 = 0
and
ψ(1(θ1 ,θ2 ] ) = ψ(1(θ1 ,+∞) ) − ψ(1(θ2 ,+∞) ) = 0 − 0 = 0.
Now let us write r : [−||w||1 , ||w||1 ] → C defined by
r(t) = eit = cos(t) + i sin(t),
with i2 = −1 and for t ∈ [−||w||1 , ||w||1 ]. Let us also write, for k ∈ N and t ∈ [−||w||1 , ||w||1 ],
k−1  
X j||w||1
rk (t) = 1t=−||w||1 r(−||w||1 ) + 1( j||w||1 , (j+1)||w||1 ] (t)r .
k k k
j=−k

Then
sup |rk (t) − r(t)| ≤ sup |r(x) − r(y)|
t∈[−||w||1 ,||w||1 ] x,y∈[−||w||1 ,||w||1 ]
||w||
|x−y|≤ k 1

→ 0,
k→∞

since r is uniformly continuous (or even Lipschitz) on [−||w||1 , ||w||1 ]. Hence, with the continuity
property that we have seen,
ψ(r) = lim ψ(rk ) = 0,
k→∞

since ψ(rk ) = 0 for k ∈ N from what we have seen before. Hence, we have shown that for any w ∈ Rd ,
Z
eihw,xi dµ(x) = 0.
[0,1]d

We see the Fourier transform of the measure µ. This implies that µ is the zero measure (which can
be shown by technical arguments which are not specific to neural networks). Hence
Z
L(f0 ) = f0 (x)dµ(x) = 0
[0,1]d

which is a contradiction with L(f0 ) = 1 and concludes the proof of Theorem 8.


There are two main take home messages.
• The density result N 1 = C([0, 1]d , R).
• The non-constructive proof technique, by contradiction. The use of the Hahn-Banach theorem
to prove a density result is standard.

3 Complements on the generalization error in classification and VC-


dimension
3.1 Shattering coefficients
We consider a pair of random variables (X, Y ) on [0, 1]d × {0, 1}. We consider (X1 , Y1 ), . . . , (Xn , Yn )
independent, with the same distribution as (X, Y ) and independent of (X, Y ). We consider a set F
of functions from [0, 1]d to {0, 1}. Then, we have seen in Section 1.2 that the generalization error is
n
!
1X
E sup P (f (X) 6= Y ) − 1f (Xi )6=Yi .
f ∈F n
i=1

We have seen that, intuitively, the larger F is, the larger this generalization error is. A measure of
the “size” or “complexity” of F is given by the following definition.

21
Definition 12 We call shattering coefficient of F (at n for n ∈ N) the quantity
ΠF (n) = max card {(f (x1 ), . . . , f (xn )) ; f ∈ F} .
x1 ,...,xn ∈[0,1]d

We observe that ΠF (n) is increasing with respect to F and n,


• if F1 ⊂ F2 , then ΠF1 (n) ≤ ΠF2 (n),
• ΠF (n) ≤ ΠF (n + 1).

Remark 13 We always have


ΠF (n) ≤ card {(i1 , . . . , in ) ∈ {0, 1}n } = 2n .

Remark 14 If F is a finite set,


ΠF (n) ≤ card(F).

Example
Let d = 1 and
F = {x ∈ [0, 1] 7→ 1x≥a ; a ∈ R} .
Then for any 0 ≤ x1 ≤ · · · ≤ xn ≤ 1 and for f ∈ F, we have
(f (x1 ), . . . , f (xn )) = (0, . . . , 0, 1, . . . , 1) ,
where
• if a > xn then there are only 0s,
• if a ≤ x1 then here are only 1s,
• if x1 < a ≤ xn then the first 1 is at position i ∈ {2, . . . , n} with xi−1 < a and xi ≥ a.
Hence the vectors that we can obtain are
(0, . . . , 0), (0, . . . , 1), (0, . . . , 1, 1), . . . , (1, . . . , 1).
Hence there are n + 1 possibilities. Hence
card {(f (x1 ), . . . , f (xn )) ; f ∈ F} ≤ n + 1.

 are not necessarily ordered, there is a bijection between {(f (x1 ), . . . , f (xn )) ; f ∈ F}
If weconsider x1 , . . . , xn that
and f (x(1) ), . . . , f (x(n) ) ; f ∈ F where x(1) ≤ · · · ≤ x(n) are obtained by ordering x1 , . . . , xn . Thus
we still have
card {(f (x1 ), . . . , f (xn )) ; f ∈ F} ≤ n + 1.
Hence ΠF (n) ≤ n + 1. Furthermore, with x1 = 0, x2 = 1/n, . . . , xn = (n − 1)/n,
• with f given by x 7→ 1x≥2 we have (f (x1 ), . . . , f (xn )) = (0, . . . , 0),
• with f given by x 7→ 1x≥−1 we have (f (x1 ), . . . , f (xn )) = (1, . . . , 1),
• for i ∈ {1, . . . , n − 1}, with f given by x 7→ 1x≥(xi +xi+1 )/2 we have (f (x1 ), . . . , f (xn )) =
(0, . . . , 0, 1, . . . , 1) with i 0s and n − i 1s.
Hence with x1 = 0, x2 = 1/n, . . . , xn = (n − 1)/n we have
card {(f (x1 ), . . . , f (xn )) ; f ∈ F} ≥ n + 1.
Hence finally ΠF (n) ≥ n + 1 and thus
ΠF (n) = n + 1.
Example
Let d = 2 and
F = x ∈ [0, 1]2 7→ 1hw,xi≥a ; a ∈ R; w ∈ R2 .


These are affine classifiers as in Figure 4.

22
Figure 4: An example of an affine classifier.

Then for n = 3 and for x1 , x2 , x3 ∈ [0, 1]2 that are not contained in a line, we can obtain the 8
possible classification vectors, as shown in Figure 5.

23
Figure 5: Obtaining the 8 possible classification vectors with 3 points and affine classifiers.

Hence
card {(f (x1 ), f (x2 ), f (x3 )) ; f ∈ F} ≥ 8.
Also
card {(f (x1 ), f (x2 ), f (x3 )) ; f ∈ F} ≤ card (i1 , i2 , i3 ) ∈ {0, 1}3 = 23 = 8.


Hence we have
ΠF (3) = 8.

3.2 Bounding the generalization error from the shattering coefficients


The next proposition enables to bound the generalization error from ΠF (n).

Proposition 15 For any set F of functions from [0, 1]d to {0, 1}, we have
n
! r
1X 2 log(2ΠF (n) )
E sup P (f (X) 6= Y ) − 1f (Xi )6=Yi ≤ 2 .
f ∈F n n
i=1

Remarks

24
• The notation log stands for the Neper base e logarithm.

• We see a dependence in 1/ n, which is classical when empirical means are compared with
expectations.

• If card(F) = 1 with F = {f } then


n
!
1X
E sup P (f (X) 6= Y ) − 1f (Xi )6=Yi
f ∈F n
i=1
n
!
 1X
= E E 1f (X)6=Y − 1f (Xi )6=Yi
n
i=1
v 
u
n
!2 
u  1X
≤ tE E 1f (X)6=Y − 1f (Xi )6=Yi 
u 
n
i=1
v
n
u !
u 1X
= tVar 1f (Xi )6=Yi
n
i=1
v
n
u !
u1 X
= t 2 Var 1f (Xi )6=Yi
n
i=1
r
1 
= 2
nVar 1f (X)6=Y
n
1
q 
=√ Var 1f (X)6=Y
n
1
≤ √ .
2 n

In the second inequality above, we have used Jensen’s inequality which implies that E(|W |) ≤
p
Var(W ) for a random variable W . In the second equality above we have used that (X1 , Y1 ), . . . , (Xn , Yn )
are independent and distributed as (X, Y ). The first inequality above holds because
  2 
Var 1f (X)6=Y ≤ E 1f (X)6=Y ≤ E(1) = 1.

On the other hand the upper bound of Proposition 15 is


r
2 log(2) p 1
2 = 2 2 log(2) √ .
n | {z } n
≈2.35

We obtain the same order of magnitude 1/ n.

• In all cases, we have ΠF (n) ≤ 2n and thus the bound of Proposition 15 is smaller than
r r r s
2 log(2 × 2n ) 2 log(2n+1 )
 
2(n + 1) log(2) 1 p
2 =2 =2 = 2 2 log(2) 1 + → 2 2 log(2).
n n n n n→∞

This bound based on ΠF (n) ≤ 2n is not informative because we already know that
 
n
  !
 1 X 
E sup |P (f (X) 6= Y ) −  ≤ E sup 1 = 1.
1f (Xi )6=Yi 
f ∈F {z } n f ∈F
 i=1 
∈[0,1] | {z }
∈[0,1]

25
To summarize
• The bound of Proposition 15 agrees in terms of order of magnitudes with the two extreme cases
card(F) = 1 (then ΠF (n) = 1) and ΠF (n) ≤ 2n .
• This bound will be particularly useful when ΠF (n) is in between these two cases.
Proof of Proposition 15
The proof is based on a classical argument that is called symmetrization. Without loss of generality,
we can assume that Y ∈ {−1, 1} and F is composed of functions from [0, 1]d to {−1, 1} (the choice of 0
and 1 to define the two classes is arbitrary in classification, here −1 and 1 will be more convenient). We
let (X̃1 , Ỹ1 ), . . . , (X̃n , Ỹn ) be pairs of random variables such that (X1 , Y1 ), . . . , (Xn , Yn ), (X̃1 , Ỹ1 ), . . . , (X̃n , Ỹn )
are independent and with the same distribution as (X, Y ).
Then
n
!
1X
P (f (X) 6= Y ) = Ẽ 1f (X̃i )6=Ỹi ,
n
i=1

writing Ẽ to indicate that the expectation is taken with respect to (X̃1 , Ỹ1 ), . . . , (X̃n , Ỹn ). We let
n
!
1X
∆n = E sup P (f (X) 6= Y ) − 1f (Xi )6=Yi .
f ∈F n
i=1
We have
n n
! !
1X 1X
∆n = E sup Ẽ −
1f (X̃i )6=Ỹi 1f (Xi )6=Yi
f ∈F n n
i=1 i=1
n n
!!
1X 1X
= E sup Ẽ 1f (X̃i )6=Ỹi − 1f (Xi )6=Yi
f ∈F n n
i=1 i=1
n n
!!
1 X 1 X
≤ E sup Ẽ 1f (X̃i )6=Ỹi − 1f (Xi )6=Yi
f ∈F n n
i=1 i=1
n n
!
1X 1X
≤ EẼ sup 1f (X̃i )6=Ỹi − 1f (Xi )6=Yi .
f ∈F n i=1
n
i=1

Let now σ1 , . . . , σn be independent random variables, independent from (Xi , Yi , X̃i , Ỹi )i=1,...,n and such
that
1
Pσ (σi = 1) = Pσ (σi = −1) =
2
for i = 1, . . . , n and by writing Eσ and Pσ the expectations and probabilities with respect to σ1 , . . . , σn .
We let for i = 1, . . . , n (
(Xi , Yi ) if σi = 1
(X̄i , Ȳi ) =
(X̃i , Ỹi ) if σi = −1
and (
¯ , Ȳ¯ ) = (X̃i , Ỹi )
(X̄
if σi = 1
.
i i
(Xi , Yi ) if σi = −1
¯ , Ȳ¯ )
Then (X̄i , Ȳi )i=1,...,n , (X̄i i i=1,...,n are independent and have the same distribution as (X, Y ). Let
us show this. For any bounded measurable functions g1 , . . . , g2n from [0, 1]d × {−1, 1} to R, we have,
using the law of total expectation,
n
! n !! n
! n ! !!
¯ , Ȳ¯ ) ¯ , Ȳ¯ ) σ , . . . , σ
Y Y Y Y
E g (X̄ , Ȳ )
i i i g (X̄
n+i i i =E E g (X̄ , Ȳ ) g (X̄
i i i n+i i i 1 n
i=1 i=1 i=1 i=1
      
n
Y n
Y n n
   Y  Y  
= E gi (Xi , Yi )  gi (X̃i , Ỹi )  g (X̃ , Ỹ ) g (X , Y ) σ1 , . . . , σ n  .

E   n+j j j n+j j j
 
   
i=1 i=1 j=1 j=1
σi =1 σi =−1 σj =1 σj =−1

26
In the above conditional expectation, the 2n variables are independent since each of the (Xi , Yi )i=1,...,n , (Xi , Yi )i=1,...,
appears exactly once. Their common distribution is that of (X, Y ). Furthermore, the 2n functions
g1 , . . . , g2n appear once each. Hence we have
n
! n !! 2n
! 2n
¯ ¯
Y Y Y Y
E ig (X̄ , Ȳ )
i i g (X̄ , Ȳ )
n+i i =E
i E (g (X, Y )) σ , . . . , σ
σ i = E (g (X, Y )) .
1 n i
i=1 i=1 i=1 i=1

¯ , Ȳ¯ )
Hence, indeed, (X̄i , Ȳi )i=1,...,n , (X̄i i i=1,...,n are independent and have the same distribution as (X, Y ).
Hence, we have
n
!
1 X 
∆n ≤ EẼEσ sup 1f (X̄i )6=Ȳi − 1f (X̄¯ i )6=Ȳ¯i ,
f ∈F n i=1

where Eσ means that only σ1 , . . . σn are random. We observe that


   
1f (X̄i )6=Ȳi − 1(X̄¯ i )6=Ȳ¯i = σi 1f (Xi )6=Yi − 1f (X̃i )6=Ỹi

because
¯ , Ȳ¯ ) = (X , Y , X̃ , Ỹ ),
• if σi = 1, (X̄i , Ȳi , X̄i i i i i i

¯ , Ȳ¯ ) = (X̃ , Ỹ , X , Y ).
• if σi = −1, (X̄i , Ȳi , X̄i i i i i i

Hence
n
!
1X  
∆n ≤ EẼEσ sup σi 1f (Xi )6=Yi − 1f (X̃i )6=Ỹi
f ∈F n i=1
n n
!
1 X 1X
≤ EẼEσ sup σi 1f (Xi )6=Yi + sup σi 1f (X̃i )6=Ỹi
f ∈F n i=1 f ∈F n i=1
n
!
1X
= 2EEσ sup σi 1f (Xi )6=Yi
f ∈F n i=1
n
!
1X
≤2 max max Eσ sup σi 1f (xi )6=yi .
y1 ,...,yn ∈{−1,1} x1 ,...,xn ∈[0,1]d f ∈F n i=1

For any y1 , . . . , yn ∈ {−1, 1} and x1 , . . . , xn ∈ [0, 1]d we define the set


 
VF (x, y) = 1y1 6=f (x1 ) , . . . , 1yn 6=f (xn ) ; f ∈ F .

Then we have !
2
∆n ≤ max max Eσ sup |hσ, vi| ,
n y1 ,...,yn ∈{−1,1} x1 ,...,xn ∈[0,1]d v∈VF (x,y)

with σ = (σ1 , . . . , σn ).
We observe that for all x1 , . . . , xn , y1 , . . . , yn , there is a bijection between VF (x, y) and {(f (x1 ), . . . , f (xn )); f ∈
F}. Hence
max max card(VF (x, y)) ≤ ΠF (n).
y1 ,...,yn ∈{−1,1} x1 ,...,xn ∈[0,1]d

Assume that we show


 
n
p
For any set V ⊂ {−1, 0, 1} , Eσ sup |hσ, vi| ≤ 2n log(2card(V )). (6)
v∈V

Then we would have


r
2p 2
∆n ≤ 2n log(2ΠF (n)) = 2 log(2ΠF (n))
n n

27
which would conclude the proof.
Let us now show (6). Let us write −V = {−v; v ∈ V } and V # = V ∪ −V . We have, for any s > 0,
  ! 
1  
s supv∈V # hσ,vi
Eσ sup |hσ, vi| = Eσ sup hσ, vi = Eσ log e .
v∈V v∈V # s

We now apply Jensen inequality to the concave function (1/s) log. This gives
 
1   
Eσ sup |hσ, vi| ≤ log Eσ es supv∈V # hσ,vi
v∈V s
!!
1
= log Eσ sup eshσ,vi
s v∈V #
  
1 X
≤ log Eσ  eshσ,vi 
s
v∈V #
 
1 X  
= log  Eσ eshσ,vi 
s
v∈V #
 
n
1 X Y
= log  Eσ (esσi vi )
s
v∈V # i=1
 
n
1 X Y 1
esvi + e−svi  .

= log 
s 2
#
v∈V i=1

2
We can show simply that for x ≥ 0, ex + e−x ≤ 2ex /2 . This gives, using also that vi2 ≤ 1 for
i = 1, . . . , n and v ∈ V ,
 
  n 2 v2
1 X Y s i
Eσ sup |hσ, vi| ≤ log  e 2 
v∈V s
v∈V # i=1
 
1 X ns 2
≤ log  e 2 
s
v∈V #
 
1 # ns2
2
≤ log card(V )e
s
log(card(V # )) ns
= + .
s 2
We let r
2 log(card(V # ))
s=
n
which gives

  r
1 1 n
q q
Eσ sup |hσ, vi| ≤ √ n log(card(V # )) + √ n log(card(V # )) = 2 log(card(V # ))
v∈V 2 2 2
q p
= 2n log(card(V # )) ≤ 2n log(2card(V )).

Hence (6) is proved and thus the proof of Proposition 15 is concluded. 

28
3.3 VC-dimension
From the previous proposition, the shattering coefficient ΠF (n) is important and we would like to
quantify its growth as n grows. A tool for this is the Vapnik-Cherbonenkis dimension, that we will
call VC-dimension.

Definition 16 For a set F of functions from [0, 1]d to {0, 1}, we write VCdim(F) and call VC-
dimension the quantity
VCdim(F) = sup {m ∈ N; ΠF (m) = 2m }
with the convention ΠF (0) = 1 so that VCdim(F) ≥ 0. It is possible that VCdim(F) = +∞.

Interpretation
The quantity VCdim(F) is the largest number of input points that can be “shattered”, meaning
that they can be classified in all possible ways by varying the classifier in F.
Examples

• When
F = {all the functions from [0, 1]d to {0, 1}}
then VCdim(F) = +∞. Indeed, for any n ∈ N, by considering x1 , . . . , xn two-by-two distinct,
we have ΠF (n) = 2n .

• When F is finite with card(F) ≤ 2m0 then VCdim(F) ≤ m0 . Indeed, for m > m0 , we have seen
that ΠF (m) ≤ card(F) ≤ 2m0 ≤ 2m . Hence

m 6∈ {m ∈ N; ΠF (m) = 2m }

and thus
sup {m ∈ N; ΠF (m) = 2m } ≤ m0 .

Remark 17 If VCdim(F) = V < ∞ then for i = 1, . . . , V , ΠF (i) = 2i .

Proof of Remark 17
Since ΠF (V ) = 2V , there exist x1 , . . . , xV ∈ [0, 1]d such that

card {(f (x1 ), . . . , f (xV )) ; f ∈ F} = 2V .

This means that we obtain all the possible vectors with components in {0, 1} and thus we obtain all
the possible subvectors for the i first coefficients for i = 1, . . . , V . Hence

card {(f (x1 ), . . . , f (xi )) ; f ∈ F} = 2i .

and thus ΠF (i) = 2i . 


Similarly if for i0 ∈ N, ΠF (i0 ) = 2i0 then for all i = 1, . . . , i0 , ΠF (i) = 2i .
We can compute the VC-dimension in the case of linear and affine classifiers.

Proposition 18 Let d ∈ N. Let


n o
Fd,l = x ∈ [0, 1]d 7→ 1hw,xi≥0 ; w ∈ Rd

and n o
Fd,a = x ∈ [0, 1]d 7→ 1hw,xi+a≥0 ; w ∈ Rd , a ∈ R .

Then
VCdim(Fd,l ) = d
and
VCdim(Fd,a ) = d + 1.

29
Remark 19 The VC-dimension coincides here with the number of free parameters and thus with the
usual notion of dimension.

Proof of Proposition 18 Write


 
  0  
1 1 0
0    .. 
x1 =  .  , x2 = 0 , . . . , xd =  . 
     
 ..   ..  0
.
0 1
0

in Rd . Then for any y1 , . . . , yd ∈ {0, 1} write


(
1 if yi = 1
zi = .
−1 if yi = 0

Consider
x 7→ 1hx,Pd .
j=1 zj xj i≥0

Then for k = 1, . . . , d,
1hxk ,Pd = 1hxk ,zk xk i≥0 = 1zk ≥0 = yk .
j=1 zj xj i≥0

Hence we reach all the elements of {0, 1}d . Hence

ΠFd,l (d) = 2d

and thus
VCdim(Fd,l ) ≥ d.
Assume that
VCdim(Fd,l ) ≥ d + 1.
Then, from Remark 17, ΠFd,l (d+1) = 2d+1 . Hence, there exists x1 , . . . , xd+1 ∈ [0, 1]d and w1 , . . . , w2d+1 ∈
Rd such that  > 
wi x 1
..
,
 
 .
wi> xd+1
for i = 1, . . . , 2d+1 take all possible sign vectors (< 0 or ≥ 0). We write
 > 
x1
 .. 
X= . 
x>
d+1

of dimension (d + 1) × d and 
W = w1 . . . w2d+1
of dimension d × 2d+1 . Then

x> x>
 
1 w1 ... 1 w2d+1
 .. ..
XW =  .

... . 
x>
d+1 w 1 ... >
xd+1 w2d+1

is of dimension (d + 1) × 2d+1 . Let us show that the d + 1 rows of XW are linearly independent. Let
a of size (d + 1) × 1, non-zero such that
a> XW = 0,

30
where the above display is a linear combination of the rows of XW . Then, for k ∈ {1, . . . , 2d+1 },
 > 
x 1 wk
(a> XW )k = a> Xwk = a>  ...  .
 

x>
d+1 wk

Let k such that for i = 1, . . . , d + 1, ai ≥ 0 if and only if x>


i wk ≥ 0 (k exists since we reach all the
possible sign vectors). Then
d+1
X d+1
X
(a> XW )k = ai (x>
i wk ) = |ai ||x>
i wk |.
| {z }
i=1 i=1
same signs

Since a is non-zero we can assume that there is a j such that aj < 0 (up to replacing a by −a at the
beginning). Then
(a> XW )k ≥ |aj ||x>j wk | > 0,

since x>
j wk < 0 and aj < 0. This is a contradiction. Hence there does no exist a of size (d + 1) × 1,
non-zero such that a> XW = 0. Hence the d + 1 lines of XW are linearly independent. Hence the
rank of XW is equal to d + 1. But the rank of XW is smaller or equal to d because X is of dimension
(d + 1) × d. Hence we have reached a contradiction and thus

VCdim(Fd,l ) < d + 1.

Hence
VCdim(Fd,l ) = d.
Let us now consider Fd,a . Let
 
  0  
1 1 0
0    .. 
x1 =  .  , x2 = 0 , . . . , xd =  . 
     
 ..   ..  0
.
0 1
0

and  
0
 .. 
xd+1 = . ,
0
in Rd . Then, for any y1 , . . . , yd+1 ∈ {0, 1}, write for i = 1, . . . , d + 1,
(
1 if yi = 1
zi = .
−1 if yi = 0.

Consider the function


x ∈ [0, 1]d 7→ 1hx,Pd .
j=1 (zj −zd+1 )xj i≥−zd+1

Then for k = 1, . . . , d,

1hxk ,Pd = 1hxk ,(zk −zd+1 )xk i≥−zd+1 = 1zk −zd+1 ≥−zd+1 = 1zk ≥0 = yk .
j=1 (zj −zd+1 )xj i≥−zd+1

and
1hxd+1 ,Pd = 10≥−zd+1 = 1zd+1 ≥0 = yd+1 .
j=1 (zj −zd+1 )xj i≥−zd+1

Hence we reach the 2d+1 possible vectors and thus

VCdim(Fd,a ) ≥ d + 1.

31
Assume now that
VCdim(Fd,a ) ≥ d + 2.
Then, as seen previously,
ΠFd,a (d + 2) = 2d+2 .
Hence there exists x1 , . . . , xd+2 ∈ [0, 1]d such that for all y1 , . . . , yd+2 ∈ {0, 1}, there exists w ∈ Rd and
b ∈ R such that, for k = 1, . . . , d + 2,
1hw,xk i+b≥0 = yk .
We write  
xi
x̄i =
1
of size (d + 1) × 1 for i = 1, . . . , d + 2 and
 
w
w̄ =
b

of size (d + 1) × 1. Then, for k = 1, . . . , d + 2,

1hw̄,x̄k i≥0 = 1hw,xk i+b≥0 = yk .

Hence in Rd+1 we have shattered d + 2 vectors x̄1 , . . . , x̄d+2 (we have obtained all the possible sign
vectors) with linear classifiers. This implies

VCdim(Fd+1,l ) ≥ d + 2

which is false since we have shown above that VCdim(Fd+1,l ) = d + 1. Hence we have

VCdim(Fd,a ) < d + 2.

Hence
VCdim(Fd,a ) = d + 1.


3.4 Bounding the shattering coefficients from the VC-dimension


From the next lemma, we can bound the shattering coefficients from bounds on the VC-dimension.

Lemma 20 (Sauer lemma) Let F be a non-empty set of functions from [0, 1]d to {0, 1}. Assume
that VCdim(F) < ∞. Then we have, for n ∈ N,
VCdim(F )  
X n
ΠF (n) ≤ ≤ (n + 1)VCdim(F ) ,
i
i=0

with   ( n!
n if i ∈ {0, . . . , n}
= i!(n−i)! .
i 0 if i > n

Proof of Lemma 20
For any set A, with H a non-empty set of functions from A to {0, 1}, we can define ΠH (n) and
VCdim(H) in the same way as when A = [0, 1]d . Let us show

For any set A, for any set H of functions from A to R: (7)


VH  
X k
ΠH (k) ≤ , for k = 1, . . . , n with VH = VCdim(H).
i
i=0

32
We will show (7) by induction on k.
Let us show it for k = 1. If VH = 0 then ΠH (1) < 21 = 2. Hence
  X0  
1 k
ΠH (1) ≤ 1 = = .
0 i
i=0

Hence (7) is proved for k = 1 and VH = 0.


If VH ≥ 1 we have
    XVH  
1 1 1 k
ΠH (1) = 2 = 2 = + ≤ .
0 1 i
i=0

Hence eventually (7) is true for k = 1. Assume now that (7) is true for any k from 1 to n − 1.
If VH = 0 then there does not exist any x ∈ A and h1 , h2 ∈ H such that h1 (x) = 0 and h2 (x) = 1
because for all x ∈ A, card{h(x); h ∈ H} < 21 . Hence for all x1 , . . . , xn ∈ A,

card {(h(x1 ), . . . , h(xn )); h ∈ H} = 1.

Hence
0  
X n
ΠH (n) = 1 = .
i
i=0

It thus remains to address the case VH ≥ 1.


For x1 , . . . , xn ∈ A, define

H(x1 , . . . , xn ) = {(h(x1 ), . . . , h(xn )); h ∈ H}.

There exist x1 , . . . , xn ∈ A such that

card(H(x1 , . . . , xn )) = ΠH (n).

The set H(x1 , . . . , xn ) only depend on the values of the functions in H on {x1 , . . . , xn }. Hence,
replacing

• A by A = {x1 , . . . , xn },

• H by

H̃ = h0 : {x1 , . . . , xn } → {0, 1}; there exists h ∈ H such that h0 (xi ) = h(xi ) for i = 1, . . . , n ,


we have ΠH (n) = ΠH̃ (n).


Hence in the sequel we assume that A = {x1 , . . . , xn } and H is a set of functions from {x1 , . . . , xn }
to {0, 1}, without loss of generality.
Let us consider the set

H 0 = h ∈ H; h(xn ) = 1 and h0 = h − 1{xn } ∈ H ,




composed of the functions that are equal to 1 at xn and that stay in H if their value at xn is replaced
by 0. Notice that we have written 1{xn } : {x1 , . . . , xn } → {0, 1} defined by 1{xn } (xi ) = 1xn =xi for
i = 1, . . . , n.
We use the notation, for a set G of functions from {x1 , . . . , xn } to {0, 1}, and {xi1 , . . . , xiq } ⊂
{x1 , . . . , xn },
G(xi1 , . . . , xiq ) = {(g(xi1 ), . . . , g(xiq )); g ∈ G}.
We have
H(x1 , . . . , xn ) = H 0 (x1 , . . . , xn ) ∪ (H\H 0 )(x1 , . . . , xn )
and thus
cardH(x1 , . . . , xn ) ≤ cardH 0 (x1 , . . . , xn ) + card(H\H 0 )(x1 , . . . , xn ).

33
Step 1: bounding cardH 0 (x1 , . . . , xn )
We observe that
cardH 0 (x1 , . . . , xn ) = cardH 0 (x1 , . . . , xn−1 )
because h(xn ) = 1 for h ∈ H 0 .
If q ∈ N is such that there exists {xi1 , . . . , xiq } ⊂ {x1 , . . . , xn } with cardH 0 (xi1 , . . . , xiq ) = 2q then
xn 6∈ {xi1 , . . . , xiq } (because h(xn ) = 1 for h ∈ H 0 ). Also, we have cardH(xi1 , . . . , xiq , xn ) = 2q+1
because
2q+1 = card ({0, 1}q × {0, 1}) = card (h(xi1 ), . . . , h(xiq ), h(xn )); h ∈ H


by definition of H 0 . Hence VH ≥ q + 1 and thus VH ≥ VH 0 + 1 (since q can be taken as VH 0 ). Hence


VH 0 ≤ VH − 1. Hence, we have, applying (7) with k = n − 1,
VH 0   H −1 
VX 
0 0
X n−1 n−1
cardH (x1 , . . . , xn ) = cardH (x1 , . . . , xn−1 ) ≤ ΠH 0 (n − 1) ≤ ≤ .
i i
i=0 i=0

Step 2: bounding card(H\H 0 )(x1 , . . . , xn )


If h, h0 ∈ H\H 0 satisfy h(xi ) = h0 (xi ) for i = 1, . . . , n − 1, then we can not have h(xn ) 6= h0 (xn )
(otherwise h or h0 takes value 1 at xn and thus belongs to H 0 ). Hence we have

card(H\H 0 )(x1 , . . . , xn ) = card(H\H 0 )(x1 , . . . , xn−1 ).

Also VH\H 0 ≤ VH because H\H 0 ⊂ H. Hence, using (7) with k = n − 1 we have

VH\H 0
VH 
X n − 1 X n−1

0
card(H\H )(x1 , . . . , xn−1 ) ≤ ΠH\H 0 (n − 1) ≤ ≤ .
i i
i=0 i=0

Combining the two steps, we obtain

H −1 
VX  X VH  
n−1 n−1
cardH(x1 , . . . , xn ) ≤ +
i i
i=0 i=0
VH   XVH  
X n−1 n−1
≤ +
i−1 i
i=1 i=0
VH    
X n−1 n−1
=1 + +
i−1 i
i=1
VH  
X n
=1 +
i
i=1
VH  
X n
= .
i
i=0

We recall that cardH(x1 , . . . , xn ) = ΠH (n) and that we had started with any A ⊂ [0, 1]d and any
set of functions from A to {0, 1}. Hence (7) is shown by induction. Hence we have
VCdim(F )  
X n
ΠF (n) ≤
i
i=0

34
which gives the first inequality of the lemma. For the second inequality, we have
VCdim(F )   min(VCdim(F ),n)  
X n X n
=
i i
i=0 i=0
min(VCdim(F ),n)
X ni

i!
i=0
min(VCdim(F ),n)  
X
i VCdim(F)
≤ n
i
i=0
VCdim(F )  
X
i VCdim(F)
≤ n
i
i=0
= (n + 1)VCdim(F ) ,

using the Newton binomial formula at the end, which shows the second inequality of the lemma. 
Finally using Proposition 15 and Lemma 20, we obtain, for a set of functions F from [0, 1]d to
{0, 1},
n
! r
1X 2 log(2ΠF (n))
E sup P (f (X) 6= Y ) − 1f (Xi )6=Yi ≤ 2
f ∈F n n
i=1
r
2 log(2(n + 1)VCdim(F ) )
≤2
r n
2 log(2) + 2VCdim(F) log(n + 1)
=2
n

When VCdim(F) < ∞ the bound goes to zero at rate almost 1/ n. If we use a set of functions
Fn thatpdepends on n (more complex if there are more observations), then the rate of convergence is

almost VCdim(Fn )/ n.

4 VC-dimension of neural networks with several hidden layers


The section is based on [BHLM19].

4.1 Neural networks as directed acyclic graphs


We will use graphs. A directed graph is of the form (V, E), where V stands for vertices and E stands
for edges. The set V is a finite set, for instance V = {v1 , . . . , vn } or V = {1, . . . , n}. The set E is a
subset of V × V that does not contain any element of the form (v, v), v ∈ V .
If (v1 , v2 ) ∈ E, we say that there is a path from v1 to v2 . We say that v1 is a predecessor of v2 and
that v2 is a successor of v1 . A simple example is given in Figure 6.

35
Figure 6: A directed graph defined by V = {1, 2, 3, 4} and E = {(1, 2), (1, 3), (3, 4)} with 4 vertices
and 3 edges.

We say that the directed graph (V, E) is acyclic if there does not exist any n ∈ N and v1 , . . . , vn ∈ V
such that

• vn = v1 ,

• (v1 , v2 ), . . . , (vn−1 , vn ) ∈ E.

A simple example is given in Figure 7.

Figure 7: The graph on the left is acyclic and the graph on the right is cyclic (not acyclic).

A directed graph which is acyclic is called a DAG (directed acyclic graph). We call path a vector
(v1 , . . . , vn ) with v1 , . . . , vn ∈ V and (v1 , v2 ), . . . , (vn−1 , vn ) ∈ E. For a DAG (V, E) and v ∈ V we
call indegree of v the quantity card{(v 0 , v), v 0 ∈ V, (v 0 , v) ∈ E}. We call outdegree of v the quantity
card{(v, v 0 ), v 0 ∈ V, (v, v 0 ) ∈ E}. A simple example is given in Figure 8.

36
Figure 8: The vertex 1 has indegree 0 and outdegree 2. The vertex 4 has indegree 2 and outdegree 0.

Definition 21 (general feed-forward neural network) A feed-forward neural network is defined


by the following.

• An activation function σ : R → R.

• A DAG G = (V, E) such that G has d ≥ 1 vertices with indegree 0 and 1 vertex of outdegree 0.
We write the d vertices with indegree 0 as
(0) (0)
s1 , . . . , s d .

• A vector of weights
wa ; a ∈ V 0 ∪ E


where V 0 is the set of vertices with non-zero indegrees (there is a weight per vertex [except the d
vertices with indegree 0] and a weight per edge).

We write L for the maximal length (number of edges) of a path of G. We have L ≤ card(V ) − 1. We
define the layers 0, 1, . . . , L by induction as follows.

• The layer 0 is the set {vertices with indegree 0}.

• For ` = 1, . . . , L,
n
layer ` = vertices who have a predecessor in the layer ` − 1,
o
possibly other predecessors in the layers 0, 1, . . . , ` − 2, and no other predecessors .

Proposition 22 In the context of Definition 21, we have the following.

• The layers 0, 1, . . . , L are non-empty.

• The layers 0, 1, . . . , L are disjoint sets.

• Any vertex belongs to a layer.

• The layer L is a singleton composed of the unique vertex of outdegree 0.

• The edges are only of the form (v, v 0 ), with v ∈ layer i and v 0 ∈ layer j with i < j.

37
Proof of Proposition 22
We call the elements of the layer 0 the roots. For a vertex v, we call inverse path from v to the roots
a vector (v, v1 , . . . , vk ) with vk a root and (v1 , v), (v2 , v1 ), . . . , (vk , vk−1 ) ∈ E (hence (vk , vk−1 , . . . , v2 , v)
is a path). The length of such an inverse path is k (there are k edges in the path). By convention, if
v is a root, we say that v has an inverse path of length 0 to the roots.
Then let us show by induction that, for ` = 0, . . . , L,

layer ` = { vertices which longest inverse paths to the roots have length `} . (8)

The property is true for the layer 0 (with our convention).


If the property is true for the layer `, then any vertex of the layer ` + 1 has an edge that comes
from the layer `, so it has an inverse path to the roots of length ` + 1. There are no longer inverse
path because the vertex has no predecessors outside of the layers 0, 1 . . . , `.
Consider a vertex v which longest inverse path to the roots has length ` + 1. The first predecessor
of v in this path belongs to the layer `, from (8) at step `. The only predecessors of v are in the layers
0 to ` because if there are other predecessors, the longest inverse path from v to the roots has length
W > ` (because this other predecessor would not belong to the layers 0 to ` and using (8)). Hence,
finally, v belongs to the layer ` + 1. Hence we have shown (8) by induction.
We remark that any vertex v has an inverse path to the roots. Indeed, we let V̄ be the subset of S
composed of the vertices which have an inverse path to the roots. Then, is V̄ 6= V , then V \V̄ provides
a DAG with one vertex of indegree 0 which is false (in a DAG, there exists a vertex of indegree zero,
otherwise we can construct an arbitrary long inverse path and thus a cycle).
Hence, from (8), an element of the layer L has outdegree 0 (otherwise there is a path of length
L + 1). Hence, the layer L is empty or is a singleton. By construction of the layers, the edges only go
from a layer i to a layer j with i < j. Hence, the only possible path of length L go through each of
the layers 0, 1, . . . , L. Since such a path exists, the layers 0, 1, . . . , L are non-empty.
Hence we have proved everything: the layers are non-empty, disjoint, any vertex belongs to one of
the layers and the edges go from a layer to a layer of strictly larger index. 
An example is given in Figure 9.

38
Figure 9: An example of the DAG of a neural network. The layer 0 has 3 vertices, representing a
neural network classifier from [0, 1]3 to {0, 1}. These vertices have indegree 0. The layer 1 has 4
vertices (neurons). The layer 2 has 3 vertices (neurons). The layer L = 3 has one (final) vertex of
outdegree 0 (output of the neural network function). The layers 1 and 2 correspond to the hidden
layers.

Remark Compared to Section 1.3,


• we do not have all the possible edges between the layers i and i + 1,

• we allow for edges between the layers i and i + k with k ≥ 2.


Formal definition of a general feed-forward neural network function based on a DAG
Following definition 21 and Proposition 22, it is a function characterized by

wa ; a ∈ V 0 ∪ E


where V 0 contains the layers 1 to L. The input space is [0, 1]d . Consider an input x = (x1 , . . . , xd ) ∈
[0, 1]d . We define by induction on the layers 0 to L the outputs associated to each neurons of the layer
(0)
`. For the layer 0 the output of the vertex si is xi .
For the layer ` + 1, ` = 0, . . . , L − 2, the output of a vertex v is
m
!
X
σ wi S i + b
i=1

where
• m is the indegree of v,

• v10 , . . . , vm
0 are the predecessors of v: (v 0 , v), . . . , (v 0 , v) ∈ E,
1 m

• (w1 , . . . , wm ) = (w(v10 ,v) , . . . , w(vm


0 ,v) ), the weights associated to the edges pointing to v,

• S1 , . . . , Sm are the outputs of v1 , . . . , vm


0 , which are vertices of the layers 0, . . . , ` so these outputs

are indeed already defined,

39
• b = wv is the weight associated to the vertex v.

For the layer L, with the same notations, the output is

1Pm
i=1 wi Si +b≥0.

4.2 Bounding the VC-dimension


We will bound the VC-dimension of these neural network functions based on the following quantities.

• L: number of layers minus 1 (longest path).

• U : number of neurons. U = card(V 0 ) where V 0 is the set of vertices of the layers 1 to L.

• W : number of weights. W = U + card(E).

We assume that σ is piecewise polynomial: there exist I1 , . . . , Ip+1 pieces (p ≥ 1) where I1 , . . . , Ip+1
are intervals of R, that is of the form

(−∞, a), (−∞, a], (a, b), [a, b), (a, b], [a, b], (a, +∞), [a, +∞)

such that Ii ∩ Ij = ∅ for i 6= j, with R = ∪p+1


i=1 Ii and such that σ is polynomial on Ii for i = 1, . . . , p + 1
with a polynomial function of degree smaller or equal to D ∈ N.
Examples

• Threshold function σ(x) = 1x≥0 with p = 1, I1 = (−∞, 0), I2 = [0, +∞) and D = 0. The
polynomials are x 7→ 0 on I1 and x 7→ 1 on I2 .

• ReLU function σ(x) = max(0, x) with p = 1, I1 = (−∞, 0), I2 = [0, +∞) and D = 1. The
polynomials are x 7→ 0 on I1 and x 7→ x on I2 .

Theorem 23 ([BHLM19]) Let L ≥ 1, U ≥ 3, d ≥ 1, p ≥ 1 and W ≥ U ≥ L. We consider a


DAG G = (V, E) which longest path has length L, with d vertices with indegree 0 and one vertex with
outdegree 0. We assume that card(V 0 ) = U where V 0 is the set of vertices in the layers 1 to L. We
assume that U + card(E) = W .
We consider a function σ : R → R, piecewise polynomial on p + 1 disjoint intervals, with degrees
smaller or equal to D.
We define the following, for i ∈ {1, . . . , L}.

• If D = 0, Wi is the number of parameters (weights and biases) useful to the computation of all
the neurons of the layer i. We have

Wi = number of edges pointing to the layer i + number of vertices in the layer i.

• If D ≥ 1, Wi is the number of parameters (weights and biases) useful to the computation of all
the neurons of the layers 1 to i. We have

Wi = number of edges pointing to a layer j, j ≤ i + number of vertices in the layers 1 to i .

We write
L
1 X
L̄ = Wi ∈ [1, L],
W
i=1

• this is equal to 1 if D = 0,

• this can be close to L if D ≥ 1 and if the neurons are concentrated on the first layers.

40
We define, for i = 1, . . . , L, ki as the number of vertices of the layer i (kL = 1). We write
L
X
R= ki (1 + (i − 1)Di−1 ) if D ≥ 1
i=1
| {z }
≤U LDL−1

and
R=U if D = 0.
We define F as the set of all the feed-forward neural networks defined by G = (V, E), with one weight
per vertex of the layers 1 to L and one weight per edge (the structure of the network is fixed and the
weights are varying).
Then, for m ≥ W , with e = exp(1),
L Wi
2emki p(1 + (i − 1)Di−1 )
Y 
ΠF (m) ≤ 2 (9)
Wi
i=1
L−1
PLi=1 Wi
≤ 4emp(1 + (L − 1)D ) . (10)
Furthermore
VCdim(F) ≤ L + L̄W log2 (4epR log2 (2epR)) . (11)
In particular we have the following.
• If D = 0, VCdim(F) ≤ L + W log2 (4epU log2 (2epU )) has W as a dominating term (neglecting
logarithms). This is the number of parameters of the neural network functions.
• If D ≥ 1, VCdim(F) has L̄W as a dominating term (neglecting logarithms). This is more than
the number of parameters of the neural network functions. We can interpret this by the fact
that depth can increase L̄ (recall that L̄ ∈ [1, L]) and thus make the family of neural network
functions more complex.

4.3 Proof of the theorem


Let us prove Theorem 23. The proof relies of the following result from algebraic geometry.
Lemma 24 Let P1 , . . . , Pm be polynomials functions of n ≤ m variables of degree smaller or equal to
D ≥ 1. We write
K = card {(sign(P1 (x)), . . . , sign(Pm (x))) ; x ∈ Rn } ,
with sign(t) = 1t≥0 . Note that K is the number of possible sign vectors. Then
2emD n
 
K≤ .
n
The proof of Lemma 24 can be found in [AB09].
Let us write f (x, a) for the output of the network (without the indicator function at the end) for
the input x ∈ [0, 1]d and the vector of parameters a ∈ RW . Let x1 , . . . , xm ∈ [0, 1]d . In order to bound
ΠF (m), let us bound
card (sign(f (x1 , a)), . . . , sign(f (xm , a))) ; a ∈ RW


N
X
≤ card {(sign(f (x1 , a)), . . . , sign(f (xm , a))) ; a ∈ Pi } ,
i=1

where P1 , . . . , PN are a partition of RW which will be chosen such that the m functions a 7→ f (xj , a),
j = 1, . . . , m, are polynomial on each cell Pi . We can then apply Lemma 24.
The main difficulty is to construct a good partition. We will construct by induction partitions
C0 , . . . , CL−1 , where CL−1 will be the final partition P1 , . . . , PN .
The partitions C0 , . . . , CL−1 will be partitions of RW such that for i ∈ {0, . . . , L − 1}, Ci =
{A1 , . . . , Aq } with A1 ∪ · · · ∪ Aq = RW and Ar ∪ Ar0 = ∅ for r 6= r0 . We will have the following.

41
(a) The partitions are nested, any C ∈ Ci is a union of one or several C 0 ∈ Ci+1 (0 ≤ i ≤ L − 2).

(b) We have card(C0 ) = 1 (C0 = {RW }) and for i ∈ {1, . . . , L − 1},


Wi
2emki p(1 + (i − 1)Di−1 )

card(Ci )
≤2 .
card(Ci−1 ) Wi

(c) For i ∈ {0, . . . , L − 1}, for E ∈ Ci , for j ∈ {1, . . . , m}, the output of a neuron of the layer i (for
the input xj ) is a polynomial function of Wi variables of a ∈ E, with degree smaller or equal to
iDi .

Induction
When i = 0, we have C0 = {RW }. The output of a neuron of the layer 0 is constant with respect
to a ∈ RW and thus the property (c) holds.
Let 1 ≤ i ≤ L − 1. Assume that we have constructed nested partitions C0 , . . . , Ci−1 satisfying (b)
and(c). Let us construct Ci .
We write Ph,xj ,E (a) the input (just before σ) of the neuron h (h = 1, . . . , ki ) of the layer i, for the
input xj , as a function of a ∈ E with E ∈ Ci−1 . From the induction hypothesis (c), since Ph,xj ,E (a) is
of the form X
wk (output of neuron k) + b
k

and since the partitions are nested, we have that Ph,xj ,E (a) is polynomial on E of degree smaller of
equal to 1 + (i − 1)Di−1 and depends at most on Wi variables (we can check that this holds also when
D = 0).
Because of σ, the output of the neuron h is piecewise polynomial on E. We will divide E into
subcells such that the output is polynomial on each of the subcells, for any neurons h and any input
xj . Figure 10 illustrates the current state of the proof.

42
Figure 10: Illustration of the construction of the partitions.

We write t1 < t2 < · · · < tp the cuts of the pieces I1 , . . . , Ip+1 , as illustrated in Figure 11.

Figure 11: Illustration of the cuts of the intervals for σ.

We consider the polynomials



± Ph,xj ,E (a) − tr h∈{1,...,ki } ,
j∈{1,...,m}
r∈{1,...,p}

43
where in the above display there is a + if Ir+1 is closed at tr and a − if Ir+1 is open at tr . With this,

1±P 
(a)−t ≥0
= sign ± P h,x j ,E (a) − tr
h,xj ,E r

is constant for Ph,xj ,E (a) ∈ Ir+1 .


From Lemma 24, this set of polynomials on RW reaches at most
Wi
2e(ki mp)(1 + (i − 1)Di−1 )

Π=2
Wi

distinct vectors of signs, sign (± (Ph,xk ,E (a) − tr ))h,j,r when a ∈ RW and thus when a ∈ E. Indeed,

• ki mp is the number of polynomials,

• 1 + (i − 1)Di−1 is the degree bound,

• Wi is the number of variables.

We can thus partition E into less than Π subcells such that, on each of these subcells, the Ph,xj ,E (a)
stay in the same interval where σ is polynomial as a varies in the subcell. We remark that these Π
subcells of E are the same for all the neurons h and all the inputs xj (this is important for the sequel).
Hence we obtain a new partition Ci of cardinality less than Πcard(Ci−1 ). This enables to satisfy
the property (b).
Let us now address the property (c). For all E 0 ∈ Ci , the output of the neuron h ∈ {1, . . . , ki },

a ∈ E 0 7→ σ Ph,xj ,E (a)


is a polynomial function of Wi variables with degree smaller or equal to

D(1 + (i − 1)Di−1 ) ≤ iDi ,

where the factor D comes from the application of the polynomial corresponding to σ. Hence the
property (c) holds.
This completes the induction and we have the nested partitions C0 , . . . , CL−1 satisfying (b) and (c).
Use of the partition to conclude the proof
In particular, CL−1 is a partition of RW such that the output of each neuron of the layers 0, . . . , L−1
is polynomial of degree smaller or equal to (L − 1)DL−1 on each E ∈ CL−1 (since the partitions are
nested) and for all input x1 , . . . , xm .
Hence for each cell E ⊂ CL−1 and each input xj , the function

a ∈ E 7→ f (xj , a)

at the end of the network is polynomial with degree less of equal to 1 + (L − 1)DL−1 where the 1
comes from the final linear combination.
Hence, from Lemma 24,
WL
2em(1 + (L − 1)DL−1 )

card {(sign(f (x1 , a)), . . . , sign(f (xm , a))) ; a ∈ E} ≤ 2
WL

and thus
X
card {(sign(f (x1 , a)), . . . , sign(f (xm , a))) ; a ∈ Rw } ≤ card {(sign(f (x1 , a)), . . . , sign(f (xm , a))) ; a ∈ E}
E∈CL−1
WL
2em(1 + (L − 1)DL−1 )

≤ card(CL−1 )2 .
WL
(12)

44
Then, from the property (b),
L−1 Wi
2emki p(1 + (i − 1)Di−1 )
Y 
card(CL−1 ) ≤ 2
Wi
i=1

and thus, since (12) holds for any x1 , . . . , xm ∈ [0, 1]d ,


L Wi
2emki p(1 + (i − 1)Di−1 )
Y 
ΠF (m) ≤ 2
Wi
i=1

and thus (9) is proved.


For the sequel, we use the Pinequality between arithmetic and geometric means: for y1 , . . . , yk > 0,
for a1 , . . . , ak ≥ 0 such that ki=1 ai > 0,

k Pk !Pki=1 ai
Y a i yi
yiai ≤ Pi=1
k
.
i=1 i=1 ai

Then we have
PL !PLi=1 Wi
i−1 )
2emp i=1 ki (1 + (i − 1)D
ΠF (m) ≤ 2L PL
i=1 Wi
! Li=1 Wi
P
2empR
(by definition of R:) = 2L PL (13)
i=1 Wi
L PL !PLi=1 Wi
X 4emp(1 + (L − 1)DL−1 ) i=1 ki
(since L ≤ Wi :) ≤ PL
i=1 i=1 Wi
L L
X X P L W i
(since ki ≤ Wi :) ≤ 4emp(1 + (L − 1)DL−1 ) i=1 .
i=1 i=1

Hence (10) is proved.


To prove the bound (11) on VCdim(F) we will combine (13) and the next lemma (that we do not
prove).

Lemma 25 Let r ≥ 16 and w ≥ t > 0. Then, for any m > t + w log2 (2r log2 (r)) := x0 , we have
 mr w
2m > 2t .
w

Hence from (13) and by definition of the VC-dimension, Lemma 25 with t = L, w = L


P
i=1 Wi and

r = 2epR ≥ 2eU ≥ 16

yields
L
!
X
VCdim(F) ≤ L + Wi log2 (4epR log2 (2epR))
i=1

which proves (11).

References
[AB09] Martin Anthony and Peter L Bartlett. Neural network learning: Theoretical foundations.
cambridge university press, 2009.

45
[BHLM19] Peter L Bartlett, Nick Harvey, Christopher Liaw, and Abbas Mehrabian. Nearly-tight
VC-dimension and pseudodimension bounds for piecewise linear neural networks. Journal
of Machine Learning Research, 20(63):1–17, 2019.

[Cyb89] George Cybenko. Approximation by superpositions of a sigmoidal function. Mathematics


of control, signals and systems, 2(4):303–314, 1989.

[Gir14] Christophe Giraud. Introduction to high-dimensional statistics, volume 138. CRC Press,
2014.

[Rud98] Walter Rudin. Real and Complex Analysis. Mc Graw Hill, 1998.

46

You might also like