Neural Network Theory22
Neural Network Theory22
University of Vienna
March 2, 2020
Contents
1 Introduction 2
3 ReLU networks 14
3.1 Linear finite elements and ReLU networks . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 16
3.2 Approximation of the square function . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 21
3.3 Approximation of smooth functions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 28
1
1 Introduction
In these notes, we study a mathematical structure called neural networks. These objects have recently received
much attention and have become a central concept in modern machine learning. Historically, however, they
were motivated by the functionality of the human brain. Indeed, the first neural network was devised by
McCulloch and Pitts [17] in an attempt to model a biological neuron.
A McCulloch and Pitts neuron is a function of the form
d
!
X
d
R 3 x 7→ 1R+ w i xi − θ ,
i=1
where d ∈ N, 1R+ : R → R, with 1R+ (x) = 0 for x < 0 and 1R+ (x) = 1 else, and wi , θ ∈ R for i = 1, . . . d. The
function 1R+ is a so-called activation function, θ is called a threshold, and wi are weights. The McCulloch and
Pitts neuron, receives d input signals. If their combined weighted strength exceeds θ, then the neuron fires,
i.e., returns 1. Otherwise the neuron remains inactive.
A network of neurons can be constructed by linking multiple neurons together in the sense that the output
of one neuron forms an input to another. A simple model for such a network is the multilayer perceptron∗ as
introduced by Rosenblatt [26].
Definition 1.1. Let d, L ∈ N, L ≥ 2 and % : R → R. Then a multilayer perceptron (MLP) with d-dimensional
input, L layers, and activation function % is a function F that can be written as
N0 = 8 N1 = 12 N2 = 12 N3 = 12 N4 = 8 N5 = 1
Figure 1.1: Illustration of a multi-layer perceptron with 5 layers. The red dots correspond to the neurons.
While the MLP or variations thereof, are probably the most widely used type of neural network in practice,
they are very different from their biological motivation. Connections only between layers and arbitrary
∗ We will later introduce a notion of neural networks, that differs slightly from that of a multilayer perceptron.
2
activation functions make for an efficient numerical scheme but are not a good representation of the biological
reality.
Nowadays, the field of neural network theory draws most of its motivation from the fact that deep neural
networks are applied in a technique called deep learning [11]. In deep learning, one is concerned with the
algorithmic identification of the most suitable deep neural network for a specific application. It is, therefore,
reasonable to search for purely mathematical arguments why and under which conditions a MLP is an
adequate architecture in practice instead of taking the motivation from the fact that biological neural networks
perform well.
In this note, will study deep neural networks with a very narrow focus. We will exclude all algorithmic
aspects of deep learning and concentrate fully on a functional analytical and well-founded framework.
One the one hand, following this focussed approach, it must be clear that we will not be able to provide a
comprehensive answer to why deep learning methods perform particularly. On the other hand, we will see
that this focus allows us to make rigorous statements which do provide explanations and intuition as to why
certain neural network architectures are preferable over others.
Concretely, we will identify many mathematical properties of sets of MLPs which explain, to some
extent, practically observed phenomena in machine learning. For example, we will see explanations of why
deep neural networks are, in some sense, superior to shallow neural networks or why the neural network
architecture can efficiently reproduce high dimensional functions when most classical approximation schemes
cannot.
2.1 Universality
One of the most famous results in neural network theory is that, under minor conditions on the activation
function, the set of networks is very expressive, meaning that every continuous function on a compact set can
be arbitrarily well approximated by a MLP. This theorem was first shown by Hornik [13] and Cybenko [7].
To talk about approximation, we first need to define a topology on a space of functions of interest. We
define, for K ⊂ Rd
C(K) := {f : K → R : f continuous}
and we equip C(K) with the uniform norm
kf k∞ := sup |f (x)|.
x∈K
If K is a compact space, then the representation theorem of Riesz [28, Theorem 6.19] tells us that the
topological dual space of C(K) is the space
3
Having fixed the topology on C(K), we can define the concept of universality next.
Definition 2.2. Let % : R → R be continuous, d, L ∈ N and K ⊂ Rd be compact. Denote by MLP(%, d, L) the set of
all MLPs with d-dimensional input, L layers, NL = 1, and activation function %.
We say that MLP(%, d, L) is universal, if MLP(%, d, L) is dense in C(K).
Example 2.1 demonstrates that MLP(%, d, L) is not universal for every activation function.
Definition 2.3. Let d ∈ N, K ⊂ Rd , compact. A continuous function f : R → R is called discriminatory if the only
measure µ ∈ M such that Z
f (ax − b)dµ(x) = 0, for all a ∈ Rd , b ∈ R
K
is µ = 0.
Proof. We start by observing that MLP(%, d, 2) is a linear subspace of C(K). Assume towards a contradiction,
that MLP(%, d, 2) is not dense in C(K). Then there exists h ∈ C(K) \ MLP(%, d, 2).
By the theorem of Hahn-Banach [28, Theorem 5.19] there exists a functional
0 6= H ∈ C(K)0
we have that H(%a,b ) = 0 for all a ∈ Rd , b ∈ R. Finally, by the identification C(K)0 = M there exists a
non-zero measure µ so that Z
%a,b dµ = 0, for all a ∈ Rd , b ∈ R.
K
As f is bounded and K compact, we conclude by the dominated convergence theorem that, for every
µ ∈ M, Z Z Z
f (λ(a · −b) + θ)dµ → 1dµ + f (θ)dµ,
K Ha,b,> Ha,b,=
where
Ha,b,> := {x ∈ K : ax − b > 0} and Ha,b,= := {x ∈ K : ax − b = 0}.
4
Figure 2.1: A sigmoidal function according to Definition 2.5.
for every step function g. By a density argument and the dominated convergence theorem, we have that (2.1)
holds for every bounded continuous function g. Thus (2.1) holds, in particular, for g = sin and g = cos. We
conclude that
Z Z
0= cos(ax) + i sin(ax)dµ(x) = eiax dµ(x).
K K
This implies that the Fourier transform of the measure µ vanishes. This can only happen if µ = 0, [27, p.
176].
Remark 2.7. Universality results can be achieved under significantly weaker assumptions than sigmoidality. For
example, in [15] it is shown that Example 2.1 already contains all continuous activation functions that do not generate
universal sets of MLPs.
5
Before we can even begin to analyse this question, we need to introduce a precise notion of the size of a
PL
MLP. One option could certainly be to count the number of neurons, i.e., `=1 N` in (1.1) of Definition 1.1.
However, since a MLP was defined as a function, it is by no means clear if there is a unique representation
with a unique number of neurons. Hence, the notion of ”number of neurons” of a MLP requires some
clarification.
Definition 2.8. Let d, L ∈ N. A neural network (NN) with input dimension d and L layers is a sequence of
matrix-vector tuples
Φ = (A1 , b1 ), (A2 , b2 ), . . . , (AL , bL ) ,
where N0 := d and N1 , . . . , NL ∈ N, and where A` ∈ RN` ×N`−1 and b` ∈ RN` for ` = 1, ..., L.
For a NN Φ and an activation function % : R → R, we define the associated realisation of the NN Φ as
x0 := x,
x` := % (A` x`−1 + b` ) for ` = 1, . . . , L − 1, (2.2)
xL := AL xL−1 + bL .
x 7→ f2 (f1 (x))
is the realisation of a NN and how many weights, neurons, and layers does this new function need to have?
0 0 00
Given two functions f1 : Rd → Rd and f2 : Rd → Rd , where d, d0 , d00 ∈ N, we denote by f1 ◦ f2 the
composition of these functions, i.e., f1 ◦ f2 (x) = f1 (f2 (x)) for x ∈ Rd . Indeed, a similar concept is possible
for NNs.
Definition 2.9. Let L1 , L2 ∈ N and let Φ1 = ((A11 , b11 ), . . . , (A1L1 , b1L1 )), Φ2 = ((A21 , b21 ), . . . , (A2L2 , b2L2 )) be two
NNs such that the input layer of Φ1 has the same dimension as the output layer of Φ2 . Then Φ1 Φ2 denotes the
following L1 + L2 − 1 layer network:
Φ1 Φ2 := A21 , b21 , . . . , A2L2 −1 , b2L2 −1 , A11 A2L2 , A11 b2L2 + b11 , A12 , b12 , . . . , A1L1 , b1L1 .
R Φ1 Φ2 = R Φ1 ◦ R Φ 2 .
6
Figure 2.2: Top: Two networks. Bottom: Concatenation of both networks according to Definition 2.9.
Definition 2.10. Let L, d1 , d2 ∈ N and let Φ1 = ((A11 , b11 ), . . . , (A1L , b1L )), Φ2 = ((A21 , b21 ), . . . , (A2L , b2L )) be two
NNs with L layers and with d1 -dimensional and d2 -dimensional input, respectively. We define
1. P Φ1 , Φ2 := A b1 , bb1 , Ã2 , b̃2 , . . . , ÃL , b̃L , if d1 = d2 ,
2. FP Φ1 , Φ2 := Ã1 , b̃1 , . . . , ÃL , b̃L , for arbitrary d1 , d2 ∈ N,
where
A11 b11 A1` b1`
0
A
b1 := , bb1 := , and Ã` := , b̃` := for 1 ≤ ` ≤ L.
A21 b21 0 A2` b2`
P(Φ1 , Φ2 ) is a NN with d-dimensional input and L layers, called the parallelisation with shared inputs of Φ1 and
Φ2 . FP(Φ1 , Φ2 ) is a NN with d1 + d2 -dimensional input and L layers, called the parallelisation without shared
inputs of Φ1 and Φ2 .
Figure 2.3: Top: Two networks. Bottom: Parallelisation with shared inputs of both networks according to
Definition 2.10.
R% (P(Φ1 , Φ2 ))(x) = (R% (Φ1 )(x), R% (Φ2 )(x)), for all x ∈ Rd . (2.3)
7
We depict the parallelisation of two networks in Figure 2.3. Using the concatenation, we can, for example,
increase the depth of networks without significantly changing their output if we can build a network that
realises the identity function. We demonstrate how to approximate the identity function below. This is our
first quantitative approximation result.
Proposition 2.11. Let d ∈ N, K ⊂ Rd compact, and % : R → R be differentiable and not constant on an open set.
Then, for every > 0, there exists a NN Φ = ((A1 , b1 ), (A2 , b2 )) such that A1 , A2 ∈ Rd×d , b1 , b2 ∈ Rd , M (Φ) ≤ 4d,
and
|R(Φ)(x) − x| < ,
for all x ∈ K.
Proof. Assume d = 1, the general case of d ∈ N then follows immediately by parallelisation without shared
inputs.
Let x∗ ∈ R be such that % is differentiable on a neighbourhood of x∗ and %0 (x∗ ) = θ 6= 0. Define, for λ > 0
By the definition of the derivative, we have that |R(Φ)(x) − x| → 0 for λ → ∞ and all x ∈ K.
Remark 2.12. It follows from Proposition 2.11 that under the assumptions of Theorem 2.4 and Proposition 2.11 we
have that MLP(%, d, L) is universal for every L ∈ N, L ≥ 2.
The operations above can be performed for quite general activation functions. If a special activation is
chosen, then different operations are possible. In Section 3, we will, for example, introduce an exact emulation
of the identity function by realisations of networks with the so-called ReLU activation function.
Here, σ(AN , C) denotes the worst-case error when approximating every element of C by the closest element
in AN . Quite often, it is not so simple to precisely compute σ(AN , C) but instead we can only establish an
asymptotic approximation rate. If h : N → R+ is such that
8
Definition 2.13. A typical example of nested spaces of which we want to understand the approximation capabilities are
spaces of sparse representations in a basis or more generally in a dictionary. Let D := (fi )∞ ‡
i=1 ⊂ H be a dictionary .
We define the spaces
(∞ )
X
AN := ci fi : kck0 ≤ N . (2.7)
i=1
We can introduce a simple procedure to lift approximation theoretical results for N -term approximation
to approximation theoretical results of NNs.
Proof. We aim to show that there exists C > 0 such that every element in AN can be approximated by a NN
with CN weights to arbitrary precision.
PN
Let a ∈ AN , then a = j=1 ci(j) fi(j) . Let > 0 then, by (2.8), we have that there exist NNs (Φj )N
j=1 such
that
L (Φj ) = L, M (Φj ) ≤ C,
R (Φj ) − fi(j)
H ≤ / (N kck∞ ) . (2.9)
We define, Φc := (([ci(1) , ci(2) , . . . , ci(N ) ], 0)) and Φa, := Φc P(Φ1 , Φ2 , · · · , ΦN ). Now it is clear, by the
triangle inequality, that
X N
X N
a,
kR (Φ ) − ak =
ci(j) fi(j) − R (Φj )
≤ |ci(j) |
fi(j) − R (Φj )
≤ .
j=1
j=1
Per Definition 2.9, L(Φc P(Φ1 , Φ2 , · · · , ΦN )) = L(P(Φ1 , Φ2 , · · · , ΦN )) = L and it is not hard to see that
Remark 2.15. In words, Theorem 2.14 states that we can transfer a classical N -term approximation result to approxi-
mation by realisations of NNs if we can approximate every element from the underlying dictionary arbitrarily well by
NNs. It turns out that, under the right assumptions on the activation function, Condition (2.8) is quite often satisfied.
We will see one instance of such a result in the following subsection and another one in Proposition 3.3 below.
‡ We assume here and in the sequel that a dictionary contains only countably many elements. This assumption is not necessary, but
9
2.5 Approximation of smooth functions
We shall proceed by demonstrating that (2.9) holds for the dictionary of multivariate B-splines. This idea,
was probably first applied by Mhaskar in [18].
Towards our first concrete approximation result, we therefore start by reviewing some approximation
properties of B-splines: The univariate cardinal B-spline on [0, k] of order k ∈ N is given by
k
1 X
` k
Nk (x) := (−1) (x − `)k−1
+ , for x ∈ R, (2.10)
(k − 1)! `
`=0
B k := N`,t : ` ∈ N, t` ∈ 2−` Zd .
d
` ,k
(2.11)
Best N -term approximation by multivariate B-splines is a well studied field. For example, we have the
following result by Oswald.
Theorem 2.16 ([21, Theorem 7]). Let d, k ∈ N, p ∈ (0, ∞], 0 < s ≤ k. Then there exists C > 0 such that, for every
f ∈ C s ([0, 1]d ), we have that, for every δ > 0, and every N ∈ N there exists ci ∈ R with |ci | ≤ Ckf k∞ and Bi ∈ B k
for i = 1, . . . , N such that
N
X δ−s
f − ci Bi
. N d kf kC s .
p
i=1 L
In particular, for C := {f ∈ C ([0, 1] ) : kf kC s ≤ 1}, we have that B k achieves a rate of best N -term approximation
s d
To obtain an approximation result by NN via Theorem 2.14, we now only need to check under which
conditions every element of the B-spline dictionary can be represented arbitrarily well by a NN. In this regard,
we first fix a class of activation functions.
%(x) %(x)
→ 0, for x → −∞, → 1, for x → ∞, and
xq xq
|%(x)| . (1 + |x|)q , for all x ∈ R.
Standard examples of sigmoidal functions of order k ∈ N are the functions x 7→ max{0, x}q . We have the
following proposition.
Proposition 2.18. Let k, d ∈ N, K > 0, and % : R → R be sigmoidal of order q ≥ 2. There exists a constant C > 0
such that for every f ∈ B k and every > 0 there is a NN Φ with dlog2 (d)e + dmax{logq (k − 1), 0}e + 1 layers and
C weights, such that
kf − R% (Φ )kL∞ ([−K,K]d ) ≤ .
10
d
Proof. We demonstrate how to approximate a cardinal B-spline of order k, i.e., N0,0,k , by a NN Φ with
d
activation function %. The general case, i.e., N`,t,k , follows by observing that shifting and rescaling of the
realisation of Φ can be done by manipulating the entries of A1 and b1 associated to the first layer of Φ. Towards
this goal, we first approximate a univariate B-spline. We observe with (2.10) that we first need to build
a network that approximates the function x 7→ (x)k−1 + . The rest follows by taking sums and shifting the
function.
It is not hard to see (but probably a good exercise to formally show) that, for every K 0 > 0,
qT
−qT
% ◦ % ◦ · · · ◦ %(ax) −x+ → 0 for a → ∞ uniformly for all x ∈ [−K 0 , K 0 ].
a
| {z }
T − times
Choosing T := dmax{logq (k − 1), 0}e we have that q T ≥ k − 1. We conclude that, for every K 0 > 0 and > 0
there exists a NN Φ∗ with dmax{logq (k − 1), 0}e + 1 layers such that
R (Φ∗ ) (x) − xp+ ≤ ,
(2.12)
2x1 x2 = (x1 + x2 )2 − x21 − x22 = (x1 + x2 )2+ + (−x1 − x2 )2+ − (x1 )2+ − (−x1 )2+ − (x2 )2+ − (−x2 )2+ . (2.14)
Hence, we can conclude that, for every K 0 > 0, we can find a fixed size NN Φmult with input dimension 2
which, for every > 0, approximates the map (x1 , x2 ) 7→ x1 x2 arbitrarily well for (x1 , x2 ) ∈ [−K 0 , K 0 ]2 .
We assume for simplicity, that log2 (d) ∈ N. Then we define
Φmult,d,d/2
:= FP(Φmult , . . . , Φmult ).
| {z }
d/2−times
Now, we set
Φmult,d,1
:= Φmult
Φmult,4,2 . . . Φmult,d,d/2 . (2.15)
We depict the hierarchical construction of (2.15) in Figure 2.4. Per construction, we have that Φmult,d,1
has
log2 (d) + 1 layers and, for every 0 > 0 and K 0 > 0, there exists > 0 such that
(x1 , . . . xd ) − x1 x2 · · · xd ≤ 0 .
mult,d,1
Φ
11
x1 x2 x3 x4 x5 x6 x7 x8
x1 x2 x3 x4 x5 x6 x7 x8
x1 x2 x3 x4 x5 x6 x7 x8
x1 x2 x3 x4 x5 x6 x5 x6
Figure 2.4: Setup of the multiplication network (2.15). Every red dot symbolises a multiplication network
Φmult
and not a regular neuron.
Finally, we set
Φ := Φmult,d,1
FP(Φ∨ , . . . , Φ∨ ).
| {z }
d−times
Per definition of , we have that Φ has dmax{logq (k − 1), 0}e + log2 (d) + 1 many layers. Moreover, the size of
all components of Φ was independent of . By choosing sufficiently small it is clear by construction that Φ
d
approximates N0,0,k arbitrarily well on [−K, K]d for sufficiently small .
As a simple consequence of Theorem 2.14 and Proposition 2.18 we obtain the following corollary.
Corollary 2.19. Let d ∈ N, s > δ > 0 and p ∈ (0, ∞]. Moreover let % : R → R be sigmoidal of order q ≥ 2. Then
there exists a constant C > 0 such that, for every f ∈ C s ([0, 1]d ) with kf kC s ≤ 1 and every 1/2 > > 0, there exists
a NN Φ such that
kf − R(Φ)kLp ≤
d
and M (Φ) ≤ C− s−δ and L(Φ) = dlog2 (d)e + dmax{logq (dse − 1), 0}e + 1.
Remark 2.20. Corollary 2.19 constitutes the first quantitative approximation result of these notes for a large class
of functions. There are a couple of particularly interesting features of this result. First of all, we observe that with
increasing smoothness of the functions, we need smaller networks to achieve a certain accuracy. On the other hand,
at least in the framework of this theorem, we require more layers if the smoothness s is much higher than the order of
sigmoidality of %.
Finally, the order of approximation deteriorates very quickly with increasing dimension d. Such a behaviour is often
called curse of dimension. We will later analyse to what extent NN approximation can overcome this curse.
12
Proposition 2.21. There exists a continuous, piecewise polynomial activation function % : R → R such that for every
function f ∈ C([0, 1]) and every > 0 there is a NN Φf, with M (Φ) ≤ 3, and L(Φ) = 2 such that
f − R Φf,
≤ .
∞
(2.16)
Proof. We denote by ΠQ , the set of univariate polynomials with rational coefficients. It is well-known that
this set is countable and dense in C(K) for every compact set K. Hence, we have that {π|[0,1] : π ∈ ΠQ } is a
countable set and dense in C([0, 1]). We set (πi )i∈Z := {π|[0,1] : π ∈ ΠQ } and define
πi (x − 2i), if x ∈ [2i, 2i + 1],
%(x) :=
πi (1)(2i + 2 − x) + πi+1 (0)(x − 2i − 1), if x ∈ (2i + 1, 2i + 2).
Theorem 2.23 ([14]). For every d ∈ N, there are 2d2 + d univariate, continuous, and increasing functions φp,q ,
p = 1, . . . , d, q = 1, . . . , 2d + 1 such that for every f ∈ C([0, 1]d ) we have that, for all x ∈ [0, 1]d ,
2d+1 d
!
X X
f (x) = gq φp,q (xp ) , (2.18)
q=1 p=1
We can combine Kolmogorov’s superposition theorem and Proposition 2.21 to obtain the following
approximation theorem for realisations of networks with the special activation function from Proposition
2.21.
Theorem 2.24. Let d ∈ N. Then there exists a constant C(d) > 0 and a continuous activation function %, such that
for every function f ∈ C([0, 1]d ) and every > 0 there is a NN Φf,,d with M (Φ) ≤ C(d), and L(Φ) = 3 such that
f − R Φf,,d
≤ .
∞
(2.19)
Proof. Let f ∈ C([0, 1]d ). Let 0 > 0 and let Φ e 1,d := (([1, . . . , 1], 0)) be a network with d dimensional input
1,2d+1 :=
and Φe (([1, . . . , 1], 0)) be a network with 2d + 1 dimensional input. Let gq , φp,q for p = 1, . . . , d,
q = 1, . . . , 2d + 1 be as in (2.18).
We have that there exists C ∈ R such that
Φq,0 := Φ
e 1,d FP Φφ1,q ,0 , Φφ2,q ,0 , . . . , Φφd,q ,0 .
13
It is clear that, for x = (x1 , . . . , xd ) ∈ [0, 1]d ,
d
X
R (Φq,0 ) (x) − φp,q (xp ) ≤ d0 (2.20)
p=1
We have by Proposition 2.21 that, for fixed 1 , the map R (Φgq ,1 ) is uniformly continuous on [−C − 1, C + 1]
for all q = 1, . . . , 2d + 1 and 0 ≤ 1.
Hence, we have that, for each ˜ > 0, there exists δ˜ > 0 such that
for all x, y ∈ [−C − 1, C + 1] so that |x − y| ≤ δ˜ in particular this statement holds for ˜ = 1 .
It follows from the triangle inequality, (2.20), and Proposition 2.21 that
2d+1
d
!
X
X
R Φf , − f
≤ gq ,1 q,0
R (Φ ) (R (Φ )) − g φ
0 1 ∞
q p,q
q=1 p=1 ∞
2d+1
d
!
X
X
gq ,1 q,0 gq ,1
≤
R (Φ ) (R (Φ )) − R (Φ ) φp,q
q=1 p=1 ∞
d
! d
!
X X
+
R (Φgq ,1 ) φp,q − gq φp,q
p=1 p=1 ∞
2d+1
X
=: I0 ,1 + II0 ,1 .
p=1
Choosing d0 < δ1 , we have that I0 ,1 ≤ 1 . Moreover, II ≤ 1
by construction
.
Hence, for every 1/2 > > 0, there exists 0 , 1 such that
R Φf0 − f
∞ ≤ (2d + 1)1 ≤ . We define
Φf,,d := Φf0 ,1 which concludes the proof.
Without knowing the details of the proof of Theorem 2.24 the statement that any function can be arbitrarily
well approximated by a fixed-size network is hardly believable. It seems as if the reason for this result to
hold is that we have put an immense amount of information into the activation function. At the very least,
we have now established that at least from a certain minimal size on, there is no aspect of the architecture of a
NN that fundamentally limits its approximation power. We will later develop fundamental lower bounds on
approximation capabilities. As a consequence of the theorem above, these lower bounds can only be given
for specific activation functions or under further restricting assumptions.
3 ReLU networks
We have already seen a variety of activation functions including sigmoidal and higher-order sigmoidal
functions. In practice, a much simpler function is usually used. This function is called rectified linear unit
14
(ReLU). It is defined by
x for x ≥ 0
x 7→ %R (x) := (x)+ = max{0, x} = (3.1)
0 else.
There are various reasons why this activation function is immensely popular. Most of these reasons are based
on its practicality in the algorithms used to train NNs which we do not want to analyse in this note. One
thing that we can observe, though, is that the evaluation of %R (x) can be done much more quickly than that
of virtually any non-constant function. Indeed, only a single decision has to be made, whereas, for other
activation functions such as, e.g., arctan, the evaluation requires many numerical operations. This function is
probably the simplest function that does not belong to the class described in Example 2.1.
One of the first questions that we can ask ourselves is whether the ReLU is discriminatory. We observe
the following. For a ∈ R, b1 < b2 and every x ∈ R, we have that
Ha (x) := %R (ax − ab1 + 1) − %R (ax − ab1 ) − %R (ax − ab2 ) + %R (ax − ab2 − 1) → 1[b1 ,b2 ] for a → ∞.
Indeed, for x < b1 − 1/a, we have that Ha (x) = 0. If b1 − 1/a < x < b1 , then Ha (x) = a(x − b1 + 1/a) ≤ 1.
If b1 < x < b2 , then Ha (x) = %R (ax − ab1 + 1) − %R (ax − ab1 ) = 1. If b2 ≤ x < b2 + 1/a, then Ha (x) =
1 − %R (ax − ab2 ) = 1 − ax − ab2 ≤ 1. Finally, if x ≥ b2 + 1/a then Ha (x) = 0. We depict Ha in Figure 3.1.
1 1
b1 − a b1 b2 b2 + a
Figure 3.1: Pointwise approximation of a univariate indicator function by sums of ReLU activation functions.
The argument above shows that sums of ReLUs can pointwise approximate arbitrary indicator function.
If we had that Z
%R (ax + b)dµ(x) = 0,
K
d
for all a ∈ R and b1 < b2 . At this point we have the same result as in (2.1). Following the rest of the proof of
Proposition 2.6 yields that %R is discriminatory.
We saw in Proposition 2.18 how higher-order sigmoidal functions can reapproximate B-splines of arbitrary
order. The idea there was that, essentially, through powers of xq+ , we can generate arbitrarily high degrees of
polynomials. This approach does not work anymore if q = 1. Moreover, the crucial multiplication operation
of Equation (2.14) cannot be performed so easily with realisations of networks with the ReLU activation
function.
If we want to use the local approximation by polynomials in a similar way as in Corollary 2.19, we have
two options: being content with approximation by piecewise linear functions, i.e., polynomials of degree one,
or trying to reproduce higher-order monomials by realisations of NNs with the ReLU activation function in a
different way than by simple composition.
Let us start with the first approach, which was established in [12].
15
3.1 Linear finite elements and ReLU networks
We start by recalling some basics on linear finite elements. Below, we will perform a lot of basic operations
on sets and therefore it is reasonable to recall and fix some set-theoretical notation first. For a subset A of a
topological space, we denote by co(A) the convex hull of A, i.e., the smallest convex set containing A. By A we
denote the closure of A, i.e., the smallest closed set containing A. Furthermore, int A denotes the interior of A,
which is the largest open subset of A. Finally, the boundary of A is denoted by ∂A and ∂A := A \ int A.
Let d ∈ N, Ω ⊂ Rd . A set T ⊂ P(Ω) so that
[
T = Ω,
T = (τi )M §
i=1 , where each τi is a d-simplex , and such that τi ∩ τj ⊂ ∂τi ∩ ∂τj is an n-simplex with n < d for
T
every i 6= j is called a simplicial mesh of Ω. We call the τi the cells of the mesh T and the extremal points of the
τi , i = 1 . . . , MT , the nodes of the mesh. We denote the set of nodes by (ηi )M
i=1 .
N
Figure 3.2: A two dimensional simplicial mesh of [0, 1]2 . The nodes are depicted by red x’s.
Since an affine linear function is uniquely defined through its values on d + 1 linearly independent points, it
is clear that every f ∈ VT is uniquely defined through the values (f (ηi ))M i=1 . By the same token, for every
N
MN
choice of (yi )i=1 , there exists a function f in VT such that f (ηi ) = yi for all i = 1, . . . , MN .
For i = 1, . . . , MN we define the Courant elements φi,T ∈ VT to be those functions that satisfy φi,T (ηj ) = δi,j .
See Figure 3.3 for an illustration.
Proposition 3.1. Let d ∈ N and T be a simplicial mesh of Ω ⊂ Rd , then we have that
MN
X
f= f (ηi )φi,T
i=1
16
Figure 3.3: Visualisation of a Courant element on a mesh.
As a consequence of Proposition 3.1, we have that we can build every function f ∈ VT as the realisation
of a NN with ReLU activation function if we can build φi,T for every i = 1, . . . , MN .
We start by making a couple of convenient definitions and then find an alternative representation of φi,T .
We define, for i, j = 1, . . . MN ,
[
F (i) := {j ∈ {1, . . . , MT } : ηi ∈ τj } , G(i) := τj , (3.2)
j∈F (i)
Here F (i) is the set of all indices of cells that contain ηi . Moreover, G(i) is the polyhedron created from
taking the union of all these cells.
Proposition 3.2. Let d ∈ N and T be a locally convex simplicial mesh of Ω ⊂ Rd . Then, for every i = 1, . . . , MN , we
have that
φi,T = max 0, min gj , (3.4)
j∈F (i)
where gj is the unique affine linear function such that gj (ηk ) = 0 for all ηk ∈ H(j, i) and gj (ηi ) = 1.
Proof. Let i ∈ {1, . . . , MN }. By the local convexity assumption we have that G(i) is convex. For simplicity, we
assume that ηi ∈ int G(i).¶
Step 1: We show that
[
∂G(i) = co(H(j, i)). (3.5)
j∈F (i)
The argument below is visualised in Figure 3.4. We have by convexity that G(i) = co(I(i)). Since ηi lies in
the interior of G(i) we have that there exists > 0 such that B (ηi ) ⊂ G(i). By convexity, we have that also the
open set co(int τk , B (ηi )) is a subset of G(i). It is not hard to see that τk \ co(H(k, i)) ⊂ co(int τk , B (ηi )) and
¶ The case ηi ∈ ∂G(i) needs to be treated slightly differently and is left as an excercise.
17
x
ηi
Figure 3.4: Visualisation of the argument in Step 1. The simplex τk is coloured green. The grey ball around
ηi is B (ηi ). The blue × represents x.
S
hence τk \ co(H(k, i)) lies in the interior of G(i). Since we also have that ∂G(i) ⊂ k∈F (i) ∂τk , we conclude
that [
∂G(i) ⊂ co(H(j, i)).
i∈F (i)
Now assume that there is j such that co(H(j, i)) 6⊂ ∂G(i). Since co(H(j, i)) ⊂ G(i) this would imply that
there exist x ∈ co(H(j, i)) such that x is in the interior of G(i). This implies that there exists 0 > 0 such that
B0 (x) ⊂ G(i). Hence, the line from ηi to x can be extended for a distance of 0 /2 to a point x∗ ∈ G(i) \ τj . As
x∗ must belong to a simplex τj ∗ that also contains ηi , we conclude that τj ∗ intersects the interior of τj which
cannot be by assumption on the mesh.
Step 2:
For each j, denote by H(j, i) the hyperplane through H(j, i). The hyperplane H(j, i) splits Rd into two
subsets, and we denote by H int (j, i) the set that contains ηi .
We claim that
\
G(i) = H int (j, i). (3.6)
j∈F (i)
for at least one j. The line between x000 and ηi intersects G(i) and, by Step 1, it intersects co(H(j, i)) for a
j ∈ F (i). It is also clear that x000 is not on the same side as ηi . Hence x000 6∈ H int (j, i).
Step 3: For each ηj ∈ I(i), we have that gk (ηj ) ≥ 0 for all k ∈ F (i).
18
ηi
G(i)
H(j1 , i) H(j1 , i)
Figure 3.5: The set G(i) and two hyperplanes H(j1 , i), H(j2 , i). Since G(i) is convex and H(j, i) extends its
boundary it is intuitively clear that G(i) is only on one side of H(j, i) and that (3.6) holds.
This is because, by (3.6) G(i) lies fully on one side of each hyperplane H(j, i), j ∈ F (i). Since gk vanishes
on H(k, i) and equals 1 on ηi we conclude that gk (ηj ) ≥ 0 for all k ∈ F (i)
Step 4: For every k ∈ F (i) we have that gk ≤ gj on τk for all j ∈ F (i)
If for j ∈ F (i), gj (η` ) ≥ gk (η` ) for all η` ∈ τk , then, since τk = co({η` : η` ∈ τk }), we conclude that gj ≥ gk .
Assume towards a contradiction that gj (η` ) < gk (η` ) for at least one η` ∈ I(i). Clearly this assumption cannot
hold for η` = ηi since there gj (ηi ) = 1 = gk (ηi ). If η` 6= ηi , then gk (η` ) = 0 implying gj (η` ) < 0. Together
with Step 3 this yields a contradiction.
Step 5: For each z 6∈ G(i), we have that there exists at least one k ∈ F (i) such that gk (z) ≤ 0.
This follows as in Step 3. Indeed, if z 6∈ G(i) then, by (3.6) we have that there is a hyperplane H(k, i) so
that z does not lie on the same side as ηi . Hence gk (z) ≤ 0.
Combining Steps 1-5 yields the claim (3.4).
Now that we have a formula for the functions φi,T , we proceed by building these functions as realisations
of NNs.
Proposition 3.3. Let d ∈ N and T be a locally convex simplicial mesh of Ω ⊂ Rd . Let kT denote the maximum
number of neighbouring cells of the mesh, i.e.,
kT := max # {j : ηi ∈ τj } . (3.7)
i=1,...,MN
19
Proof. We now construct the network the realisation of which equals (3.4). The claim (3.8) then follows with
Proposition 3.2.
We start by observing that, for a, b ∈ R,
a + b |a − b| 1
min{a, b} = − = (%R (a + b) − %R (−a − b) − %R (a − b) − %R (b − a)) .
2 2 2
Thus, defining Φmin,2 := ((A1 , 0), (A2 , 0)) with
1 1
−1 −1 1
A1 := 1 −1
, A2 := [1, −1, −1, −1],
2
−1 1
yields R(Φ)(a, b) = min{a, b}, L(Φ) = 2 and M (Φ) = 12. Following an idea that we saw earlier for the
construction of the multiplication network in (2.15), we construct for p ∈ N even, the networks
and for p = 2q
Φmin,p = Φ
e min,2 Φ e min,p .
e min,4 · · · Φ
It is clear that the realisation of Φmin,p is the minimum operator with p inputs. If p is not a power of two then
a small adaptation of the procedure above is necessary. We will omit this discussion here.
We see that L(Φmin,p ) = dlog2 (p)e + 1. To estimate the weights, we first observe that the number of
neurons in the first layer of Φ e min,p is bounded by 2p. It follows that each layer of Φmin,p has less than 2p
neurons. Since all affine maps in this construction are linear, we have that
Clearly, Φaff has one layer, d dimensional input, and #F (i) many output neurons.
We define, for p := #F (i),
Φi,T := ((1, 0), (1, 0)) Φmin,p Φaff .
Per construction and (3.4), we have that R(Φi,T ) = φi,T . Moreover, L(Φi,T )) = L(Φmin,p ) + 1 = dlog2 (p)e + 2.
Also, by construction, the number of neurons in each layer of Φi,T is bounded by 2p. Since, by (3.9), we have
that
Φi,T = ((A1 , b1 ), (A2 , 0), . . . , (AL , 0)),
with A` ∈ RN` ×N`−1 and b1 ∈ Rp , we conclude that
L
X L
X
M (Φi,T ) ≤ p + kA` k0 ≤ p + N`−1 N` ≤ p + 2dp + (2p)2 (dlog2 (p)e).
`=1 `=1
20
Theorem 3.4. Let T be a locally convex partition of Ω ⊂ Rd , d ∈ N. Let T have MN and let kT be defined as in (3.7).
Then, for every f ∈ VT , there exists a NN Φf such that
L Φf ≤ dlog2 (kT )e + 2,
Remark 3.5. One way to read Theorem 3.4 is the following: Whatever one can approximate by piecewise affine linear,
continuous functions with N degrees of freedom can be approximated to the same accuracy by realisations of NNs with
C · N degrees of freedom, for a constant C. If we consider approximation rates, then this implies that realisations of
NNs achieve the same approximation rates as linear finite element spaces.
For example, for Ω := [0, 1]d , one has that there exists a sequence of locally convex simplicial meshes (Tn )∞
n=1 with
MT (Tn ) . n such that
2
inf kf − gkL2 (Ω) . n− d kf kW 2,2d/(d+2) (Ω) ,
g∈VTn
2,2d/(d+2)
for all f ∈ W (Ω), see, e.g., [12].
where
2 0
A1 := 2 , b1 := −1 , A2 := [1, −2, 1].
2 −2
Then
R (Φ∧ ) (x) = %R (2x) − 2%R (2x − 1) + %R (2x − 2)
and L(Φ∧ ) = 2, M (Φ∧ ) = 8, N0 = 1, N1 = 3, N3 = 1 . It is clear that R(Φ∧ ) is a hat function. We depict it in
Figure 3.6.
A quite interesting thing happens if we compose R(Φ∧ ) with itself. We have that
∧
R(Φ
| · · Φ∧}) = R(Φ∧ ) ◦ · · · ◦ R(Φ∧ ))
·{z
| {z }
n−times n−times
21
is a saw-tooth function with 2n−1 hats of width 21−n each. This is depicted in Figure 3.6. Compositions are
∧
notoriously hard to picture, hence it is helpful to establish the precise form of R(Φ
| · · Φ∧}) formally. We
·{z
n−times
analyse this in the following proposition.
Proposition 3.6. For n ∈ N, we have that
∧
Fn = R(Φ
| · · Φ∧})
·{z
n−times
Proof. The proof follows by induction. We have that, for x ∈ [0, 1/2],
where Fn satisfies (3.10). Let x ∈ [0, 1/2] and x ∈ [i2−n−1 , (i + 1)2−n−1 ], i even. Then R(Φ∧ )(x) = 2x ∈
[i2−n , (i + 1)2−n ], i even. Hence, by (3.11), we have
If x ∈ [1/2, 1] and x ∈ [i2−n−1 , (i + 1)2−n−1 ], i even, then R(Φ∧ )(x) = 2 − 2x ∈ [2 − (i + 1)2−n , 2 − i2−n ] =
[(2n+1 − i − 1)2−n , (2n+1 − i)2−n ] = [j2−n , (j + 1)2−n ] for j := (2n+1 − i − 1) odd. We have, by (3.11),
The cases for i odd follow similarly. If x 6∈ (0, 1), then R(Φ∧ )(x) = 0 and per (3.11) we have that Fn+1 (x) = 0.
∧
It is clear by Definition 3.12 that L(Φ| · · Φ∧}) = n + 1. To show that M (Φ
·{z |
∧
· · Φ∧}) ≤ 12n − 2, we
·{z
n−times n−times
observe with
Φ∧ · · · Φ∧ =: ((A1 , b1 ), . . . , (AL , bL )))
Pn+1
that M (Φ∧ · · · Φ∧ ) ≤ `=1 N`−1 N` + N` ≤ (n − 1)(32 + 3) + N0 N1 + N1 + Nn Nn+1 + Nn+1 = 12(n −
1) + 3 + 3 + 3 + 1 = 12n − 2, where we use that N` = 3 for all 1 ≤ ` ≤ n and N0 = NL = 1.
22
1 1
F1(x1)
F1(x2)
F1(x3)
1
x1 x2 x3 1
x2
x3
1
1
Figure 3.6: Top Left: Visualisation of R(Φ∧ ) = F1 . Bottom Right: Visualisation of R(Φ∧ ) ◦ R(Φ∧ ) = F2 ,
Bottom Left: Fn for n = 4.
Remark 3.7. Proposition 3.6 already shows something remarkable. Consider a two layer network Φ with input
dimension 1 and N neurons. Then its realisation with ReLU activation function is given by
N
X
R(Φ) = cj %R (ai x + bj ) − d,
j=1
for cj , aj , bj , d ∈ R. It is clear that R(Φ) is piecewise affine linear with at most M (Φ) pieces. We see, that with this
construction, the resulting networks have not more than M (Φ) pieces. However, the function Fn from Proposition 3.6
M (Φ)+2
has at least 2 12 linear pieces.
The function Fn is therefore a function that can be very efficiently represented by deep networks, but not very
efficiently by shallow networks. This was first observed in [35].
The surprisingly high number of linear pieces of Fn is not the only remarkable thing about the construction
of Proposition 3.6. Yarotsky [38] made the following insightful observation:
Proposition 3.8 ([38]). For every x ∈ [0, 1] and N ∈ N, we have that
N
X Fn (x)
≤ 2−2N −2 .
2
x − x + (3.12)
22n
n=1
23
Proof. We claim that
N
X Fn
HN := x − (3.13)
n=1
22n
is a piecewise linear function with breakpoints k2−N where k = 0, . . . , 2N , and HN (k2−N ) = k 2 2−2N . We
prove this by induction. The result clearly holds for N = 0. Assume that the claim holds for N ∈ N. Then we
see that
FN +1
HN − HN +1 = 2N +2 .
2
Since, by Proposition 3.6, FN +1 is piecewise linear with breakpoints k2−N −1 where k = 0, . . . , 2N +1 and HN
is piecewise linear with breakpoints `2−N −1 where ` = 0, . . . , 2N +1 even, we conclude that HN +1 is piecewise
linear with breakpoints k2−N −1 where k = 0, . . . , 2N +1 . Moreover, by Proposition 3.6, FN +1 vanishes for all
k2−N −1 , where k is even. Hence, by the induction hypothesis HN +1 (k2−N −1 ) = (k2−N −1 )2 for all k even.
To complete the proof, we need to show that
FN +1
(k2−N −1 ) = HN (k2−N −1 ) − (k2−N −1 )2 ,
22N +2
for all k odd. Since HN is linear on [(k − 1)2−N −1 ), (k + 1)2−N −1 )], we have that
1
HN (k2−N −1 ) − (k2−N −1 )2 = ((k − 1)2−N −1 )2 + ((k + 1)2−N −1 )2 − (k2−N −1 )2
(3.14)
2
1
= 2−2N −2 ((k − 1))2 + (k + 1)2 − k 2
2
= 2−2(N +1) = 2−2(N +1) FN +1 (k2−N −1 ),
where the last step follows by Proposition 3.6. This shows that HN +1 (k2−N −1 ) = (k2−N −1 )2 for all k =
0, . . . , 2N +1 and completes the induction.
Finally, let x ∈ [k2−N , (k + 1)2−N ], k = 0, . . . , 2N − 1, then
(k + 1)2 − k 2 2−2N
−N 2
2 2
|HN (x) − x | = HN − x = (k2 ) + (x − k2−N ) − x2 , (3.15)
2−N
where the first step is because x 7→ x2 is convex and therefore its graph lies below that of the linear interpolant
and the second step follows by representing HN locally as the linear map that intersects x 7→ x2 at k2−N and
(k + 1)2−N .
Since (3.15) describes a continuous function on [k2−N , (k + 1)2−N ] vanishing at the boundary, it assumes
its maximum at the critical point
1 (k + 1)2 − k 2 2−2N
∗ 1
x := −N
= (2k + 1)2−N = (2k + 1)2−N −1 = `2−N −1 ,
2 2 2
for ` ∈ {1, . . . 2N +1 } odd. We have already computed in (3.14) that
24
1
3/4
1/2 x 7→ x2
H0
1/4 H1
H2
ReLU specific network operations We saw in Proposition 2.11 that we can approximate the identity func-
tion by realisations of NNs for many activation functions. For the ReLU, we can even go one step further and
rebuild the identity function exactly.
Lemma 3.9. Let d ∈ N, and define
ΦId := ((A1 , b1 ) , (A2 , b2 ))
with
IdRd
A1 := , b1 := 0, A2 := IdRd −IdRd , b2 := 0.
−IdRd
Then R(ΦId ) = IdRd .
Proof. Clearly, for x ∈ Rd , R(ΦId )(x) = %R (x) − %R (−x) = x.
Remark 3.10. Lemma 3.9 can be generalised to yield emulations of the identity function with arbitrary numbers of
layers. For each d ∈ N, and each L ∈ N≥2 , we define
Id d
ΦId
d,L :=
R
, 0 , (IdR2d , 0), . . . , (IdR2d , 0), ([IdRd | −IdRd ] , 0) .
−IdRd | {z }
L−2 times
For L = 1, one can achieve the same bounds, simply by setting ΦId
d,1 := ((IdRd , 0)).
Our first application of the NN of Lemma 3.9 is for a redefinition of the concatentation. Before that, we
first convince ourselves that the current notion of concatenation is not adequate if we want to control the
number of parameters of the concatenated NN.
Example 3.11. Let N ∈ N and Φ = ((A1 , 0), (A2 , 0)) with A1 = [1, . . . , 1]T ∈ RN ×1 , A2 = [1, . . . , 1] ∈ R1×N .
Per definition we have that M (Φ) = 2N .
Moreover, we have that
Φ Φ = ((A1 , 0), (A1 A2 , 0), (A2 , 0)).
It holds that A1 A2 ∈ RN ×N and every entry of A1 A2 equals 1. Hence M (Φ Φ) = N + N 2 + N .
25
Example shows that the number of weights of networks behaves quite undesirably under concatenation.
Indeed, we would expect that it should be possible to construct a concatenation of networks that imple-
ments the composition of the respective realisations and the number of parameters scales linearly instead of
quadratically in the number of parameters of the individual networks.
Fortunately, Lemma 3.9 enables precisely such a construction, see also Figure 3.8 for an illustration.
Definition 3.12. Let L1 , L2 ∈ N, and let Φ1 = ((A11 , b11 ), . . . , (A1L1 , b1L1 )) and Φ2 = ((A21 , b21 ), . . . , (A2L2 , b2L2 )) be
two NNs such that the input layer of Φ1 has the same dimension d as the output layer of Φ2 . Let ΦId be as in Lemma
3.9.
Then the sparse concatenation of Φ1 and Φ2 is defined as
Φ1 Φ2 := Φ1 ΦId Φ2 .
Remark 3.13. It is easy to see that
2 ! !
A2L2
bL2
Φ1 Φ2 = (A21 , b21 ), . . . , (A2L2 −1 , b2L2 −1 ), , A1 −A11 , b11 , (A12 , b12 ), . . . , (A1L1 , b1L1 )
1
,
−A2L2 −b2L2
has L1 + L2 layers and that R(Φ1 Φ2 ) = R(Φ1 ) ◦ R(Φ2 ) and M (Φ1 Φ2 ) ≤ 2M (Φ1 ) + 2M (Φ2 ).
Approximation of the square: We shall now build a NN that approximates the square function on [0, 1].
Of course this is based on the estimate (3.12).
Proposition 3.14 ([38, Proposition 2]). Let 1/2 > > 0. There exists a NN Φsq, such that, for → 0,
L(Φsq, ) = O(log2 (1/)) (3.16)
M (Φ ) = sq,
O(log22 (1/)) (3.17)
R(Φsq, )(x) − x2 ≤ ,
(3.18)
for all x ∈ [0, 1]. In addition, we have that R(Φsq, )(0) = 0.
Proof. By (3.12), we have that, for N := d− log2 ()/2e, it holds that, for all x ∈ [0, 1],
N
2 X Fn (x)
x − x + ≤ . (3.19)
22n n=1
We define, for n = 1, . . . , N ,
∧
Φn := ΦId
1,N −n (Φ
| · · Φ∧}).
·{z (3.20)
n−times
∧
Then we have that L(Φn ) = N − n + L(Φ
| · · Φ∧}) = N + 1 by Proposition 3.6. Moreover, by Remark 3.13,
·{z
n−times
∧
M (Φn ) ≤ 2M (ΦId
1,N −n ) + 2M (Φ
| · · Φ∧}) ≤ 4(N − n) + 2(12n − 2) ≤ 24N,
·{z (3.21)
n−times
where the penultimate inequality follows from Remark 3.10 and Proposition 3.6.
Next, we set
Φsq, := 1, −1/4, . . . , −2−2N , 0 P ΦId
d,N +1 , Φ1 , . . . , ΦN .
26
Figure 3.8: Top: Two neural networks, Middle: Sparse concatenation of the two networks as in Definition
3.12, Bottom: Regular concatenation as in Definition 2.9.
and, by (3.19), we conclude (3.18), for all x ∈ [0, 1], and that R(Φ)(0) = 0. Moreover, we have by Remark 3.13
that
L (Φsq, ) = L 1, −1/4, . . . , −2−2N , 0 + L P ΦId
d,N +1 , Φ1 , . . . , ΦN = N + 2 = d− log2 ()/2e + 2.
This shows (3.16). Finally, by Remark 3.13
M (Φsq, ) ≤ 2M 1, −1/4, . . . , −2−2N , 0 + 2M P ΦId
d,N +1 , Φ1 , . . . , ΦN
N
!
Id
X
= 2(N + 1) + 2 M Φd,N +1 + M (Φn )
n=1
N
X
= 2(N + 1) + 4(N + 1) + 2 M (Φn )
n=1
N
X
≤ 6(N + 1) + 2 24N = 6(N + 1) + 48N 2 ,
n=1
27
and hence
M (Φsq, ) = O log22 (1/) , for → 0,
• Extraction of binary representation is efficient: We have, by Proposition 3.6, that F` vanishes on all i2−` for
i = 0, . . . , 2` even and equals 1 on all i2−` for i = 0, . . . , 2` odd. Therefore
N
!
X
−`
FN 2 x` = x` .
`=1
By a short computation this yields that for all x ∈ [0, 1] that FN (x − 2−N −1 ) > 1/2, if xN = 1 and FN (x −
2−N −1 ) ≤ 1/2, if xN = 0. Hence, by building an approximate Heaviside function 1x≥0.5 with ReLU realisations
of networks, it is clear that one can approximate the map x 7→ x` .
Building N of the binary multiplications therefore requires N bit extractors and N multipliers by 0/1. Hence, this
requires of the order of N neurons, to achieve an error of 2−N .
Approximation of multiplication: Based on the idea, that we have already seen in the proof of Propo-
sition 2.18, in particular, Equation (2.14), we show how an approximation of a square function yields an
approximation of a multiplication operator.
Proposition 3.16. Let p ∈ N, K ∈ N, ∈ (0, 1/2). There exists a NN Φmult,p, such that for → 0
for all x = (x1 , x2 , . . . , xp ) ∈ [−K, K]p . Moreover, R(Φmult,p, )(x) = 0 if x` = 0 for at least one ` ≤ p. Here the
implicit constant depends on p only.
28
Proof. The crucial observation is that, by the parallelogram identity, we have that for x, y ∈ [−K, K]
2 2 !
K2
x+y x−y
x·y = · −
4 K K
2 2 !
K2
%R (x + y) %R (−x − y) %R (x − y) %R (−x + y)
= + − + .
4 K K K K
We set
1 1 2
K2
, 0 , 1 · 1 K
−1 −1 1 0 0
Φ1 := ,0 , and Φ 2 := , − ,0 .
1 −1 K 0 0 1 1 2 2
−1 1
Now we define 2 2
Φmult,2, := Φ2 FP Φsq,/K , Φsq,/K Φ1 .
2
Moreover, the size of Φmult,2, is up to a constant that of Φsq,/K . Thus (3.23)-(3.24) follow from Proposition
3.14. The construction for p > 2 follows by the now well-known stategy of building a binary tree of basic
multiplication networks as in Figure 2.4.
A direct corollary of Proposition 3.16 is the following Corollary that we state without proof.
Corollary 3.17. Let p ∈ N, K ∈ N, ∈ (0, 1/2). There exists a NN Φpow,p, such that, for → 0,
for all x ∈ [−K, K]. Moreover, R(Φpow,p, )(x) = 0 if x = 0. Here the implicit constant depends on p only.
Approximation of B-splines: Now that we can build a NN the realisation of which is a multiplication of
p ∈ N scalars, it is not hard to see with (2.10) that we can rebuild cardinal B-splines by ReLU networks.
Proposition 3.18. Let d, k, ` ∈ N, k ≥ 2, t ∈ Rd , 1/2 > > 0. There exists a NN Φd`,t,k such that for → 0
for all x ∈ Rd .
Proof. Clearly, it is sufficient to show the result for ` = 0 and t = 0. We have by (2.10) that
k
1 X k
Nk (x) = (−1)` (x − `)k−1
+ , for x ∈ R, (3.28)
(k − 1)! `
`=0
29
It is well known, see [31], that supp Nk = [0, k] and kNk k∞ ≤ 1. Let δ > 0, then we set
1 k k k
Φk,δ := ,− , . . . , (−1)k ,0 FP Φpow,k−1,δ , . . . , Φpow,k−1,δ
(k − 1)! 0 1 k | {z }
k+1−times
where
A1 = [1, 1, . . . , 1]T , b1 = −[0, 1, . . . , k]T ,
and IdRk+1 is the identity matrix on Rk+1 . Here K := k + 1 in the definition of Φpow,k−1,δ via Corollary 3.17.
It is now clear, that we can find δ > 0 so that
for x ∈ [−k − 1, k + 1]. With sufficient care, we see that, we can choose δ = Ω(), for → 0. Hence, we can
conclude from Definition 3.12 that Lδ := L(Φk,δ ) = O(L(Φmult,k+1,δ )) = O(log2 (1/)), and M (Φk,δ ) =
O(Φmult,k+1,δ ) ∈ O(log22 (1/)), for → 0 which yields (3.25) and (3.26). At this point, R(Φk,δ ) only accurately
approximates Nk on [−k − 1, k + 1]. To make this approximation global, we multiply R(Φk,δ ) with an
appropriate indicator function.
Let
T T
Φcut := [1, 1, 1, 1] , [1, 0, −k, −k − 1] , ([1, −1, −1, 1] , 0) .
Then R(Φcut ) is a piecewise linear spline with breakpoints −1, 0, k, k + 1. Moreover, R(Φcut ) is equal to 1 on
[0, k], vanishes on [−1, k + 1]c , and is non-negative and bounded by 1. We define
Φe k,δ := Φmult,2,/(4d2d−1 ) P Φk,δ , ΦId δ Φcut .
1,L −2
Since the realisation of the multiplication is 0 as soon as one of the inputs is zero by Proposition 3.16, we
conclude that
R Φ
e k,δ (x) − Nk (x) ≤ /(2d2d−1 ), (3.30)
Now we define
We have that
Yd Yd Yd
d
N0,0,k (x) − R Φd0,0,k, (x) ≤ d
Nk (x j ) − R Φ
e k,δ (x )
j
+
R Φ 0,0,k, (x) − R Φ
e k,δ (x j .
)
j=1 j=1 j=1
30
for all x ∈ Rd . It is clear, by repeated applications of the triangle inequality that for aj ∈ [0, 1], bj ∈ [−1, 1],
for j = 1, . . . , d,
d d
Y Y d−1
max |bj | ≤ d2d−1 max |bj |.
aj − (aj + b )
j
≤ d · 1 + max |b j |
j=1 j=1,...,d j=1,...,d j=1,...,d
j=1
Hence,
d d
Y Y
Nk (x j ) − R Φ
e k,δ (x j ≤ /2.
)
j=1 j=1
This yields (3.27). The statement on the size of Φd0,0,k, follows from Remark 3.13.
Approximation of smooth functions: Having established how to approximate arbitrary B-splines with
Proposition 3.18, we obtain that we can also approximate all functions that can be written as weighted sums of
B-splines with bounded coefficients. Indeed, we can conclude with Theorem 2.16 and with similar arguments
as in Theorem 2.14 the following result. Our overall argument to arrive here followed the strategy of [34].
Theorem 3.19. Let d ∈ N, s > δ > 0 and p ∈ (0, ∞]. Then there exists a constant C > 0 such that, for every
f ∈ C s ([0, 1]d ) with kf kC s ≤ 1 and every 1/2 > > 0, there exists a NN Φ such that
Proof. Let f ∈ C s ([0, 1]d ) with kf kC s ≤ 1 and let s > δ > 0. By Theorem 2.16 there exist a constant C > 0
and, for every N ∈ N, ci ∈ R with |ci | ≤ C and Bi ∈ B k for i = 1, . . . , N and k := dse, such that
N
X δ−s
f − ci Bi
≤ CN d .
i=1 p
δ−s
By Proposition 3.18, each of the Bi can be approximated up to an error of N d /(CN ) with a NN Φi of
δ−s δ−s
depth O(log2 (N d /(CN ))) = O(log2 (N )) and number of weights O(log22 (N d /(CN ))) = O(log22 (N )) for
N → ∞.
We define
ΦN
f := ([c1 , . . . , cN ], 0) P (Φ1 , . . . , ΦN ) .
31
for a constant C 0 > 0. It holds that log22 () = O(−σ ) for every σ > 0. Hence, for every δ 0 > δ with s > δ 0 , we
have 0
−d/(s−δ) log22 () = O(−d/(s−δ ) ), for → 0.
0
As a consequence we have that M (ΦN
f ) = O(
−d/(s−δ )
) for → 0. Since δ was arbitrary, this yields (3.32).
Remark 3.20. • It was shown in [38] that Theorem 3.19 holds with δ = 0 but with the bound M (Φ) ≤
C−d/s log2 (1/). Moreover, it holds for f ∈ C s ([−K, K]d ) for K > 0, but the constant C will then de-
pend on K.
• If L ≥ 3, then there exists a NN Φ with L layers, such that supp R(Φ) = Bk.k1 (0)k ,
• If L ≤ 2, then, for every NN Φ with L layers, such that supp R(Φ) is compact, we have that R(Φ) ≡ 0.
Proof. It is clear that, for every x ∈ Rd , we have that
d
X
(%R (x` ) + %R (−x` )) = kxk1 .
`=1
Moreover, the function %R (1 − kxk1 ) is clearly supported on Bk.k1 (0). Moreover, we have that %R (1 − kxk1 )
can be written as the realisation of a NN with at least 3 layers.
Next we address the second part of the theorem. If L = 1, then the set of realisations of NNs contains
only affine linear functions. It is clear that the only affine linear function that vanishes on a set of non-empty
interior is 0. For L = 2, all realisations of NNs have the form
N
X
x 7→ ci %R (hai , xi + bi ) + d, (4.1)
i=1
32
We next show that every function of the form (4.1) with compact support vanishes everywhere. For an
index i, we have that %R (hai , xi + bi ) is not continously differentiable at the hyperplane given by
bai
Si := − + z : z ⊥ ai .
kai k2
Let f be a function of the form (4.1). We define i ∼ j, if Si = Sj . Then we have that, for J ∈ {1, . . . , N }/ ∼
that a⊥ ⊥
i = aj for all i, j ∈ J. Hence, X
cj %R (haj , xi + bk ),
j∈J
is constant perpendicular to aj for every j ∈ J. And since the sum is piecewise affine linear, we have that it is
either affine linear or not continuously differentiable at every element of Sj . We can write
X X
f (x) = cj %R (haj , xi + bj ) + d.
J∈{1,...,N }/∼ j∈J
If i 6∼ j, then Si and Sj Pintersect in hyperplanes of dimension d − 2. Hence, it is clear that, if for at least
one J ∈ {1, . . . , N }/ ∼, j∈J cj %R (haj , xi + bj ) is not linear, then f is not continuously differentiable almost
everywhere in Sj for j ∈ J. Since Sj is unbounded, this contradicts P the compact support assumption on f .
On the other hand, if, for all J ∈ {1, . . . , N }/ ∼, we have that j∈J cj %R (haj , xi + bj ) is affine linear, then f
is affine linear. By previous observations we have that this necessitates f ≡ 0 to allow compact support of
f.
Remark 4.2. Proposition 4.1, deals with representability only. However, a similar result is true in the framework of
approximation theory. Concretely, two layer networks are inefficient at approximating certain compactly supported
functions, that three layer networks can approximate very well, see e.g. [9].
Theorem 4.3. Let L ∈ N. Let % be piecewise affine linear with p pieces. Then, for every NN Φ with d = 1, NL = 1
and N1 , . . . , NL−1 ≤ N , we have that R(Φ) has at most (pN )L−1 affine linear pieces.
Proof. The proof is given via induction over L. For L = 2, we have that
N1
X
R(Φ) = ck %(hak , xi + bi ) + d,
k=1
where ck , ak , bi , d ∈ R. It is not hard to see that if f1 , f2 are piecewise affine linear with n1 , n2 pieces each,
then f1 + f2 is piecewise affine linear with at most n1 + n2 pieces. Hence, R(Φ) has at most N p many affine
linear pieces.
Assume the statement to be proven for L ∈ N. Let ΦL+1 be a NN with L + 1 layers. We set
It is clear, that
R(ΦL+1 )(x) = AL+1 [%(h1 (x)), . . . , %(hNL (x))]T + bL+1 ,
where for ` = 1, . . . , NL each h` is the realisation of a NN with input and output dimension 1, L layers, and
less than N neurons in each layer.
33
For a piecewise affine linear function f with p̃ pieces, we have that % f has at most p · p̃ pieces. This is
because, for each of the p̃ affine linear pieces of f —let us call one of those pieces A ⊂ R—we have that f is
either constant or injective on A and hence % f has at most p linear pieces on A.
By this observation and the induction hypothesis, we conclude that % h1 has at most p(pN )L−1 affine
linear pieces. Hence,
NL
X
R(ΦL+1 )(x) = (AL+1 )k %(hk (x)) + bL+1
k=1
L−1 L
has at most N p(pN ) = (pN ) many affine linear pieces. This completes the proof.
For functions with input dimension more than 1 we have the following corollary.
Corollary 4.4. Let L, d ∈ N. Let % be piecewise affine linear with p pieces. Then, for every NN Φ with NL = 1 and
N1 , . . . , NL−1 ≤ N , we have that R(Φ) has at most (pN )L−1 affine linear pieces along every line.
Proof. Every line in Rd can be parametrized by R 3 t 7→ x0 + tv for x0 , v ∈ Rd . For Φ as in the statement of
corollary, we have that
R(Φ)(x0 + tv) = R(Φ Φ0 )(t),
where Φ0 = ((v, x0 )), which gives the result via Theorem 4.3.
Theorem 4.6. Let d, L, N ∈ N, and f ∈ C 2 ([0, 1]d ), where f is not affine linear. Let % : R → R be piecewise affine
linear with p pieces. Then for every NN with L layers and fewer than N neurons in each layer, we have that
Proof. Let f ∈ C 2 ([0, 1]d ) and non-linear. Then it is clear that there exists a point x0 and a vector v so that
t 7→ f (x0 + tv) is non-linear in t = 0.
We have that, for every NN Φ with d-dimensional input, one-dimensional output, L layers, and fewer
than N neurons in each layer that
34
5 High dimensional approximation
At this point we have seen two things on an abstract level. Deep NNs can approximate functions as well as
basically every classical approximation scheme. Shallow NNs do not perform as well as deep NNs in many
problems. From these observations we conclude that deep networks are preferable over shallow networks,
but we do not see why we should not use a classical tool, such as B-splines in applications instead. What is it
that makes deep NNs better than classical tools?
One of the advantages will become clear in this section. As it turns out, deep NNs are quite efficient in
approximating high dimensional functions.
If one defines e(n, d) as the smallest number such that there exists an algorithm reconstructing every f ∈ Fd
up to an error of e(n, d) from n point evaluations of f , then
e(n, d) = 1
for all n ≤ 2bd/2c − 1, see [20]. As a result, in any statement of the form
35
Definition 5.1. Let d, k, N ∈ N and let G(d, k, N ) be the set of directed acyclic graphs with N vertices, where the
indegree of every vertex is at most k and the outdegree of all but one vertex is at least 1 and the indegree of exactly d
vertices is 0.
For G ∈ G(d, k, N ), let (ηi )N
i=1 be a topological ordering of G. In other words, every edge ηi ηj in G satisfies i < j.
Moreover, for each i > d we denote
Ti := {j : ηj ηi is an edge of G},
and di = #Ti ≤ k.
With the necessary graph theoretical framework established, we can now define sets of hierarchical
functions.
Definition 5.2. Let d, k, N, s ∈ N. Let G ∈ G(d, k, N ) and let, for i = d + 1, . . . , N , fi ∈ C s (Rdi ) with
kfi kC s (Rdi ) ≤ 1∗∗ . For x ∈ Rd , we define for i = 1, . . . , d vi = xi and vi (x) = fi (vj1 (x), . . . , vjdi (x)), where
j1 , . . . , jdi ∈ Ti and j1 < j2 < · · · < jdi .
We call the function
f : [0, 1]d → R, x 7→ vN (x)
a compositional function associated to G with regularity s. We denote the set of compositional functions associated
to any graph in G(d, k, N ) with regularity s by CF(d, k, N ; s).
We present a visualisation of three types of graphs in Figure 5.1. While we have argued before that it is
reasonable to expect that NNs can efficiently approximate these types of functions, it is not entirely clear
why this is a relevant function class to study. In [19, 25], it is claimed that these functions are particularly
close to the functionality of the human visual cortex. In principle, the visual cortex works by first analysing
very localised features of a scene and then combining the resulting responses in more and more abstract
levels to yield more and more high-level descriptions of the scene.
If the inputs of a function correspond to spatial locations, e.g., come from several sensors, such as in
weather forecasting, then it might make sense to model this function as network of functions that first
aggregate information from spatially close inputs before sending the signal to a central processing unit.
Compositional functions can also be compared with Boolean circuits comprised of simple logic gates.
Let us now show how well functions from CF(d, k, N ; s) can be approximated by ReLU NNs. Here we
are looking for an approximation rate that increases with s and, hopefully, does not depend too badly on d.
Theorem 5.3. Let d, k, N, s ∈ N. Then there exists a constant C > 0 such that for every f ∈ CF(d, k, N ; s) and
every 1/2 > > 0 there exists a NN Φf with
Proof. Let f ∈ CF(d, k, N ; s) and let, for i = d + 1, . . . , N , fi ∈ C s (Rdi ) be according to Definition 5.2. By
Theorem 3.19 and Remark 3.20, we have that there exists a constant C > 0 and NNs Φi such that
|R(Φi )(x) − fi (x)| ≤ , (5.4)
(2k)N
for all x ∈ [−2, 2]di and L(Φi ) ≤ CN log2 (k/) and
di N kN
M (Φi ) ≤ C−di /s (2k) s N log2 (k/) ≤ C−k/s (2k) s N log2 (k/).
∗∗ The restriction kfi kC s (Rdi ) ≤ 1 could be replaced by kfi kC s (Rdi ) ≤ κ for a κ > 1, and Theorem 5.3 below would still hold up to
some additional constants depending on κ. This would, however, significantly increase the technicalities and obfuscate the main ideas
in Theorem 5.3.
36
Figure 5.1: Three types of graphs that could be the basis of compositional functions. The associated functions
are composed of two or three dimensional functions only.
For i = d + 1, . . . , N , let Pi be the orthogonal projection from Ri−1 to the components in Ti , i.e, for
di
Ti =: {j1 , . . . , jdi }, where j1 < · · · < jdi , we set Pi ((xk )i−1
k=1 ) = (xjk )k=1 .
Now we define for j = d + 1, . . . , N − 1,
e j := P ΦId
Φ j−1,L(Φj ) , Φj Pj ,
and
e N := ΦN PN .
Φ
Moreover,
Φf := Φ e N −1 . . . Φ
eN Φ e d+1 .
we have that
N
N
M (Φf ) . 2dlog2 (N )e N max M Φ
e j . N 2 max ej .
M Φ (5.5)
j=d+1 j=d+1
37
Furthermore,
N −1
N
N
e j ≤ max
max M Φ M ΦId
j−1,L(Φj ) + max M (Φj )
j=d+1 j=d+1 j=d+1
N
≤ 2N L (Φj ) + max M (Φj )
j=d+1
We prove (5.6) by induction. Since the realisation of ΦIdd,L(Φd+1 ) is the identity, we have, by construction that
e d+1 )(x))k = vk (x) for all k ≤ d. Moreover, by (5.4), we have that
(R(Φ
e d+1 (x)
d+1
N
R Φ − vd+1 (x) = R Φ
e (x) − fd+1 (x) ≤ / (2k) .
d+1 d+1
Assume, for the induction step, that (5.6) holds for N − 1 > j > d.
Again, since the identity is implemented exactly, we have by the induction hypothesis that, for all k ≤ j,
R Φ
e d+1 (x) − vk (x) ≤ / (2k)N −j .
e j+1 . . . Φ
k
Moreover, we have that vj+1 (x) = fj+1 (Pj+1 (v1 (x), . . . , vj (x)])). Hence,
e j+1 . . . Φ
e d+1 (x)
R Φ − v j+1 (x)
j+1
= R (Φj+1 ) ◦ Pj+1 ◦ R Φ
ej . . . Φ e d+1 (x) − vj+1 (x)
≤ R (Φj+1 ) ◦ Pj+1 ◦ R Φ
ej . . . Φ e d+1 (x) − fj+1 ◦ Pj+1 ◦ R Φ e d+1 (x)
ej . . . Φ
+ fj+1 ◦ Pj+1 ◦ R Φ
ej . . . Φe d+1 (x) − fj+1 ◦ Pj+1 ◦ [v1 (x), . . . , vj (x)] =: I + II.
Per (5.4), we have that I ≤ /(2k)N (Note that Pj+1 ◦ R Φ e d+1 (x) ⊂ [−2, 2]dj+1 by the induction
ej . . . Φ
hypothesis). Moreover, since every partial derivative of fj+1 is bounded in absolute value by 1 we have that
II ≤ dj+1 / (2k)N −j ≤ / 2(2k)N −j−1 by the induction assumption. Hence I + II ≤ /(2k)N −j−1
Finally, we compute
R Φ
eN . . . Φe d+1 (x) − vN (x)
= R (ΦN ) ◦ PN ◦ R Φ
e N −1 . . . Φe d+1 (x) − vN (x)
≤ R (ΦN ) ◦ PN ◦ R Φ
e N −1 . . . Φe d+1 (x) − fN ◦ PN ◦ R Φ e d+1 (x)
e N −1 . . . Φ
+ fN ◦ PN ◦ R Φ
e N −1 . . . Φe d+1 (x) − fN ◦ PN ◦ [v1 (x), . . . , vN −1 (x)] =: III + IV.
Using the exact same argument as for estimating I and II above, we conclude that
III + IV ≤ ,
which yields (5.3).
38
Remark 5.4. Theorem 5.3 shows what we had already conjectured earlier. The complexity of approximating a composi-
tional function depends asymptotically not on the input dimension d, but on the maximum indegree of the underlying
graph.
We also see that, while the convergence rate does not depend on d, the constants in (5.2) are potentially very large.
In particular, for fixed s the constants grow superexponentially with k.
If M is a d0 -dimensional manifold with d0 < d, and f ∈ C n (M), then we would expect to be able to obtain
an approximation rate by NNs, that does not depend on d but on d0 .
To obtain such a result, we first make a convenient definition of certain types of submanifolds of Rd .
Definition 5.5. Let M be a smooth d0 -dimensional submanifold of Rd . For N ∈ N, δ > 0, We say that M is
(N, δ)-covered, if there exist x1 . . . xN ⊂ M and such that
SN
• i=1 Bδ/2 (xi ) ⊃ M
• the projection
Pi : M ∩ Bδ (xi ) → Txi M
is injective and smooth and
Pi−1 : Pi (M ∩ Bδ (xi )) → M
is smooth.
0
Here Txi M is the tangent space of M at xi . See Figure 5.2 for a visualisation. We identify Txi M with Rd in the
sequel.
With this definition, we can have the following result which is similar to a number of results in the
literature, such as [32, 33, 5, 30].
39
M
Figure 5.2: One dimensional manifold embedded in 2D. For two points the tangent space is visualised in red.
The two circles describe areas where the projection onto the tangent space is invertible and smooth.
Theorem 5.6. Let d, k ∈ N, M ⊂ Rd be a (N, δ)-covered d0 -dimensional manifold for an N ∈ N and δ > 0. Then
there exists a constant c > 0, such that, for every > 0, and f ∈ C k (M, R) with kf kC k ,δ,N ≤ 1, there exists a NN Φ,
such that
kf − R(Φ)k∞ ≤ ,
d0
M (Φ) ≤ c · − k log2 (1/)
L(Φ) ≤ c · (log2 (1/)) .
Proof. The proof is structured in two parts. First we show a convenient alternative representation of f , then
we construct the associated NN.
Step 1: Since M is (N, δ)-covered, there exists B > 0 such that M ⊂ [−B, B]d .
Let T be a simplicial mesh on [−B, B]d so that for all nodes ηi ∈ T we have that
See (3.2) for the definition of G(i) and Figure 5.3 for a visualisation of T .
By Proposition 3.1, we have that
MN
X
1= φi,T .
i=1
We denote
IM := {i = 1, . . . , MN : dist(ηi , M) ≤ δ/8},
where dist(a, M) = miny∈M |a − y|. Per construction, we have that
X
1= φi,T (x), for all x ∈ M.
i∈IM
40
M
and Pi (M ∩ B3δ/4 (xi )), Pi (M ∩ B7δ/8 (xi )) are open. By a C ∞ version of the Urysohn Lemma, there exists a
smooth function σ : Rd → [0, 1] such that σ = 1 on Pi (M ∩ B3δ/4 (xi )) and σ = 0 on (Pi (M ∩ B7δ/8 (xi )))c .
We define
σfi for x ∈ Pi (M ∩ Bδ (xi ))
f˜i :=
0 else.
0
It is not hard to see that f˜i ∈ C k (Rd ) with kf kC k ≤ CM , where CM is a constant depending on M only and
f˜i = fi on Pi (M ∩ B3δ/4 (xi )). Hence, with (5.7), we have that
X
f (x) = φi,T (x) · f˜j(i) ◦ Pj(i) (x) . (5.8)
i∈IM
Step 2: The form of f given by (5.8) suggests a simple way to construct a ReLU approximation of f .
41
0
First of all, for every i ∈ IM , we have that Pj(i) is an affine linear map from [−B, B]d to Rd . We set
P :=
Φi ((Ai1 , bi1 )), where Ai1 , bi1 are such that Ai1 x + bi1 = Pj(i) (x) for all x ∈ Rd .
0
Let K > 0 be such that Pi (M) ⊂ [−K, K]d for all i ∈ IM . For every i ∈ IM , we have by Theorem 3.19
0
and Remark 3.20 that for every 1 > 0 there exists a NN Φfi such that, for all x ∈ [−K, K]d ,
˜ f
fj(i) (x) − R Φi (x) ≤ 1 ,
−d0 /k
M Φfi . 1 log2 (1/1 ), (5.9)
L Φfi . log2 (1/1 ). (5.10)
Per Proposition 3.3, there exists, for every i ∈ IM , a network Φφi with
R Φφi = φi,T , M Φφi , L Φφi . 1, (5.11)
where L∗ := L(Φfi ΦP φ ∗ f P φ
i ) − L(Φi ). At this point, we assume that L ≥ 0. If L(Φi Φi ) < L(Φi ), then one
could instead extend Φfi .
Finally, we define, for Q := |IM |,
φ(f P ) φ(f P ) φ(f P )
Φ1 ,2 := (([1, . . . , 1], 0)) P Φi1 , Φi 2 , . . . , ΦiQ .
42
for a constant cmult > 0 by Proposition 3.16 and Remark 3.10. By (5.9), we conclude that
0
M (Φ1 ,2 ) . −d /k log2 (1/),
where the implicit constant depends on M and d. As the last step, we compute the depth of Φ1 ,2 . We have
that
φ(f P )
L(Φ1 ,2 ) = max L Φi
i∈IM
= L Φmult,2,2 + L Φfi ΦP
i ,
. log2 (1/2 ) + log2 (1/1 ) . log2 (1/)
by (5.10).
Remark 5.7. Theorem 5.6 shows that the approximation rate when approximating C k regular functions defined on a
manifold does not depend badly on the ambient dimension. However, at least in our construction, the constants may
still depend on d and even grow rapidly with d. For example, in the estimate in (5.11) the implicit constant depends,
because of Proposition 3.3 on the maximal number of neighbouring cells of the underlying mesh. For a typical mesh on
a grid Zd of a d dimensional space, it is not hard to see that this number grows exponentially with the dimension d.
where fˆ denotes the Fourier transform of f . By the inverse Fourier transform theorem, the condition kfˆk1 < ∞
implies that every element of ΓC is continuous. We also denote the unit ball in Rd by B1d := {x ∈ Rd : |x| ≤ 1}.
We have the following result:
Theorem 5.8 (cf. [2, Theorem 1]). Let d ∈ N, f ∈ ΓC , % : R → R be sigmoidal and N ∈ N. Then, for every
c > 4C 2 , there exists a NN Φ with
L(Φ) = 2,
M (Φ) ≤ N · (d + 2) + 1,
Z
1 2 c
|f (x) − R(Φ)(x)| dx ≤ ,
|B1d | d
B1 N
43
Proof. Let f ∈ co(G). For every δ > 0, there exists f ∗ ∈ co(G) so that
kf − f ∗ k ≤ δ.
and therefore Xi = gi − f ∗ , for i = 1, . . . , m, are i.i.d random variables with E(Xi ) = 0. Since the Xi are
independent random variables, we have that
N
2 N N
1 X
1 X 1 X
E kXi k2 = 2 E kgi k2 − 2 hgi , f ∗ i + kf ∗ k2
Xi
= 2
E
N
N i=1 N i=1
i=1
N
1 X B2
E kgi k2 − kf ∗ k2 ≤
= 2 . (5.13)
N i=1 N
The argument above is, of course, commonly known as the weak law of large numbers. Because of (5.13)
there exists at least one event ω such that
N
2
1 X
B2
Xi (ω)
≤
N N
i=1
and hence
2
N
1 X
B2
gi (ω) − f ∗
≤ .
N N
i=1
Hilbert-space-valued.
44
with kci k1 = 1. We have now that if Λn corresponds the indices of the n largest of (|ci (f )|)∞
i=1 in (5.14), then
2
2
X
X
X
|cj (f )|2 .
f − cj (f )φj
=
cj (f )φj
(5.15)
=
j∈Λn
j6∈Λn
j6∈Λn
Pn+1
Since (n + 1)c̃n+1 ≤ j=1 c̃j ≤ 1, we have that c̃n+1 ≤ (n + 1)−1 and hence, the estimate
2
X
−1
f −
≤ (n + 1) ,
cj (f )φj
j∈Λn
follows. Therefore, in the case that G is an orthogonal basis, we can explicitly construct the gi and ci of
Lemma 5.9.
Remark 5.10. Lemma 5.9 allows a quite powerful procedure. Indeed, to achieve an approximation rate of 1/N for a
function f by superpositions of N elements of a set G, it suffices to show that any convex combination of elements of G
approximates f .
In the language of NNs, we could say that every function that can be represented by an arbitrary wide two-layer
NN with bounded activation function and where the weights in the last layer are positive and sum to one can also be
approximated with a network with only N neurons in the first layer and an error proportional to 1/N .
In view of Lemma 5.9, to show Theorem 5.8, we only need to demonstrate that each function in ΓC is in
the convex hull of functions representable by superpositions of sigmoidal NNs with few weights. Before we
prove this, we show that each function f ∈ ΓC is in the convex hull of functions of the set
Lemma 5.11. Let f ∈ ΓC . Then f|B1d −f (0) ∈ co(GC ). Here the closure is taken with respect to the norm k·kL2, (B1d ) ,
defined by
Z !1/2
1 2
kgkL2, (B1d ) := |g(x)| dx .
|B1d | B1d
Proof. Since f ∈ ΓC is continuous and fˆ ∈ L1 (Rd ), we have by the inverse Fourier transform that
Z
f (x) − f (0) = fˆ(ξ) e2πihx,ξi − 1 dξ
d
ZR
ˆ 2πi(hx,ξi+κ(ξ))
= f (ξ) e − e2πiκ(ξ) dξ
d
ZR
ˆ
= f (ξ) (cos(2πhx, ξi + κ(ξ)) − cos(κ(ξ))) dξ,
Rd
where κ(ξ) is the phase of fˆ(ξ) and the last inequality follows since f is real-valued. Moreover, we have that
45
|2πξ||fˆ(ξ)|dξ ≤ C, and thus Λ such that
R
Since f ∈ ΓC , we have Rd
1
dΛ(ξ) := |2πξ||fˆ(ξ)|dξ
C
is a finite measure with Λ(Rd ) =
R
dΛ(ξ) ≤ 1. In this notion, we have
Rd
Since (cos(2πhx, ξi + κ(ξ)) − cos(κ(ξ)))/|2πξ| is continuous and bounded by 1, and hence integrable with
respect to dΛ(ξ) we have by the dominated convergence theorem that, for n → ∞,
Z
(cos(2πhx, ξi + κ(ξ)) − cos(κ(ξ))) X (cos(2πhx, θi + κ(θ)) − cos(κ(θ)))
C dΛ(ξ) − C · Λ(Iθ → 0,
)
Rd |2πξ| 1 d
|2πθ|
θ∈ n Z
(5.17)
where Iθ := [0, 1/n]d + θ. Since f (x) − f (0) is continuous and thus bounded on B1d and
X
(cos(2πhx, θi + κ(θ)) − cos(κ(θ)))
C · Λ(Iθ ) ≤ C,
θ∈ 1 Zd |2πθ|
n
Since θ∈ 1 Zd Λ(Iθ ) = Λ(Rd ) ≤ 1, we conclude that f (x) − f (0) is in the L2, (B1d ) closure of convex combi-
P
n
nations of functions of the form
cos(2πhx, θi + κ(θ)) − cos(κ(θ))
x 7→ gθ (x) := αθ ,
|2πθ|
for θ ∈ Rd and 0 ≤ αθ ≤ C. The result follows, if we can show that each of the functions gθ is in co(GC ).
Setting z = hx, θ/|θ|i, it suffices to show that the map
cos(2π|θ|z + b) − cos(b)
[−1, 1] 3 z 7→ αθ =: g̃θ (z),
|2πθ|
where b ∈ [0, 2π] can be approximated arbitrarily well by convex combinations of functions of the form
46
Clearly, (gT,− ) + (gT,+ ) converges to g̃θ for T → ∞ and since
T T T
X |g̃θ (i/T ) − g̃θ ((i − 1)/T )| X |g̃θ (−i/T ) − g̃θ ((1 − i)/T )| X
+ ≤2 kg̃θ0 k∞ /(2CT ) ≤ 1
i=1
2C i=1
2C i=1
we have that g̃θ can be arbitrarily well approximated by convex combinations of the form (5.19). Therefore,
we have that gθ ∈ co(GC ) and by (5.18) this yields that f − f (0) ∈ co(GC ).
Proof of Theorem 5.8. Let f ∈ ΓC , then, by Lemma 5.11, we have that
Moreover, for every element g ∈ GC we have that kgkL2, (B1d ) ≤ 2C. Therefore, by Lemma 5.9, applied to the
Hilbert space L2, (B1d ), we get that for every N ∈ N, there exist |γi | ≤ 2C, ai ∈ Rd , bi ∈ R, for i = 1, . . . , N ,
so that 2
Z N
1 X 4C 2
f (x) − f (0) − γ 1 (ha , xi + b i dx ≤
) .
d i R + i
d B1
|B1 | B1d N
i=1
Since %(λx) → 1R+ (x) for λ → ∞ almost everywhere, it is clear that, for every δ > 0, there exist ãi , b̃i ,
i = 1, . . . , N , so that
2
N
4C 2
Z
1 X
fB1d (x) − f (0) − γi % hãi , xi + b̃i dx ≤ + δ.
|B1d | B1d i=1
N
is the realisation of a network Φ with L(Φ) = 2 and M (Φ) ≤ N · (d + 3). This is clear by setting
Φ := (([γ1 , . . . , γN ], f (0))) P ((ã1 , b̃1 )), . . . , ((ãN , b̃N )) .
Remark 5.12. The fact, that the approximation rate of Theorem 5.8 is independent from the dimension is quite surprising
at first. However, the following observation might render Theorem 5.8 more plausible. The assumption of having a
finite Fourier moment is comparable to a certain dimension dependent regularity assumption. In other words, the
condition of having a finite Fourier moment becomes more restrictive in higher dimensions, meaning that the complexity
of the function class does not, or only mildly grow with the dimension. Indeed, while this type of regularity is not
directly expressible in terms of classical orders of smoothness, Barron notes that a necessary condition for f ∈ ΓC , for
some C > 0, is that f has bounded first-order derivatives. A sufficient condition is that all derivatives of order up to
bd/2c + 2 are square-integrable, [2][Section II]. The sufficient condition amounts to f ∈ W bd/2c+2,2 (Rd ) which would
also imply an approximation rate of N −1 in the squared L2 norm by sums of at most N B-splines, see [21, 8].
Example 5.13. A natural question, especially in view of Remark 5.12, is which well known and relevant functions are
contained in ΓC . In [2, Section IX], a long list with properties of this set and elements thereof is presented. Among
others, we have that
1. If g ∈ ΓC , then
a−d g (a(· − b)) ∈ ΓC ,
for every a ∈ R+ , b ∈ Rd .
47
2. For gi ∈ ΓC , i = 1, . . . , m and c = (ci )m
i=1 it holds that
m
X
ci gi ∈ Γkck1 C .
i=1
2
3. The Gaussian function: x 7→ e−|x| /2
is in ΓC for C = O(d1/2 ).
4. Functions of high smoothness. If the first dd/2e + 2 derivatives of a function g are square integrable on Rd , then
g ∈ ΓC , where the constant C depends linearly on kgkW bd/2c+2,2 .
The last three examples show quite nicely how the assumption g ∈ ΓC includes an indirect dependence on the dimension.
HS := {h|S : h ∈ H},
The growth function counts the number of functions that result from restricting H to the best possible set S
with m elements. Intuitively, in the framework of binary classification, the growth function tells us in how
many ways we can classify the elements of the best possible sets S of any cardinality by functions in H.
It is clear that for every set S with |S| = m, we have that |HS | ≤ 2m and hence GH (m) ≤ 2m . We say that
a set S with |S| = m for which |HS | = 2m is shattered by H.
A second, more compressed notion of complexity in the context of binary classification is that of the
Vapnik–Chervonenkis dimension (VC Dimension), [36]. We define VCdim(H) to be the largest integer m such
that there exists S ⊂ X with |S| = m that is shattered by H. In other words,
48
1. Let H := {0, 1}, then GH (m) = 2 for all m ≥ 1. Hence, VCdim(H) = 1.
2. Let H := {0, χΩ , χΩc , 1} for some fixed non-empty set Ω ( R2 . Then, choosing S = (x1 , x2 ) with x1 ∈ Ω, x2 ∈
Ωc , we have GH (2) = 4 for all m ≥ 2. Hence, VCdim(H) = 2.
3. Let h := χR+ and ( ! )
T
cos θ 2
H := hθ,t := h · −t : θ ∈ [−π, π], t ∈ R .
sin θ
Then H is the set of all linear classifiers. It is not hard to see, that if S contains 3 points in general position, then
|H|S | = 8, see Figure 6.1. Hence, these sets S are shattered by H. We will later see that H does not shatter any
set of points with at least 4 elements. Hence VCdim(H) = 3. This is intuitively clear when considering Figure
6.2.
As a first step to familiarise ourselves with the new notions, we study the growth function and VC
dimension of realisations of NNs with one neuron and the Heaviside function as activation function. This
situation was discussed before in the third point of Example 6.1.
Figure 6.2: Four points which cannot be classified in every possible way by a single linear classifier. The
classification sketched above requires at least sums of two linear classifiers.
49
Theorem 6.2 ([1, Theorem 3.4]). Let d ∈ N and % = 1R+ . Let SN (d) be the set of realisations of neural networks
with two layers, d-dimensional input, one neuron in the first layer and one dimensional output and the weights in the
second layer satisfy (A2 , b2 ) = (1, 0). Then SN (d) shatters a set of points (xi )m d
i=1 ⊂ R if and only if
Proof. Assume first, that (xi )mi=1 is such that it is shattered by SN (d) and assume towards a contradiction
that (6.1) are not linearly independent.
Then we have that for every v ∈ {0, 1}m there exists a neural network Φv , such that, for all j ∈ {1, . . . , m},
R(Φv )(xj ) = vj .
Moreover, since (6.1) are not linearly independent there exist (αj )m
j=1 ⊂ R such that, without loss of generality,
m−1
X xj xm
αj = .
j=1
1 1
Let v ∈ {0, 1}m be such that, for j ∈ {1, . . . , m − 1}, vj = 1 − 1R+ (αj ) and vm = 1. Then,
where the last equality is because 1R+ (Av1 xj +bv1 ) = vj = 1−1R+ (αj ). This produces the desired contradiction.
If, on the other hand (6.1) are linearly independent, then the matrix
x1 1
x2 1
X= . ..
.. .
xm 1
has rank m. Hence, for every v ∈ {0, 1}m there exists a vector [ Av1 bv1 ] ∈ R1,d+1 such that X[ Av1 bv1 ]T =
v. Setting Φv := ((Av1 , bv1 ), (1, 0)) yields the claim.
In establishing bounds on the VC dimension of a set of neural networks, the activation function plays a
major role. For example, we have the following lemma.
Lemma 6.3 ([1, Lemma 7.2]). Let H := {x 7→ 1R+ (sin(ax)) : a ∈ R}. Then VCdim(H) = ∞.
Proof. Let xi := 2i−1 , for i ∈ N. Next, we will show that, for every k ∈ N, the set {x1 , . . . , xk } is shattered by
H. Pk
The argument is based on the following bit-extraction technique: Let b := j=1 bj 2−j + 2−k−1 . Setting
a := 2πb, we have that
Xk
1R+ (sin(axi )) = 1R+ sin 2π bj 2−j xi + 2π2−k−1 xi
j=1
i−1
X k
X
= 1R+ sin 2π bj 2−j xi + 2π bj 2−j xi + 2π2−k−1 xi =: I(xi ).
j=1 j=i
50
Pi−1 −j
Since j=1 bi 2 xi ∈ N, we have by the 2π periodicity of sin that
k
X
I(xi ) = 1R+ sin 2π bj 2−j xi + 2π2−k−1 xi
j=i
k
X
= 1R+ sin bi π + π · bj 2i+1−j + 2i−k .
j=i+1
P
k i+1−j
Since j=i+1 bj 2 + 2i−k ∈ (0, 1), we have that
0 if bi = 1,
I(xi ) =
1 else.
Since b was chosen arbitrary, this shows that VCdim(H) ≥ k for all k ∈ N.
In the previous two results (Theorem 6.2, Lemma 6.3), we observed that the VC dimension of sets of
realisations of NNs depends on their size and also on the associated activation function. We have the
following result, that we state without proof:
Denote, d, L, M ∈ N by N N d,L,M the set of neural networks with d dimensional input, L layers and at
most M weights.
Theorem 6.4 ([1, Theorem 8.8]). Let d, `, p ∈ N, and % : R → R be a piecewise polynomial with at most ` pieces of
degree at most p. Let, for L, M ∈ N,
Then
51
Proof. Choose for any b ∈ {0, 1}k an associated hb ∈ H according to (6.2) and gb according to (6.3).
Then |gb (xi ) − bi | < /2 and therefore gb (xi ) − /2 > 0 if bi = 1 and gb (xi ) − /2 < 0 otherwise. Hence
Remark 6.6. Lemma 6.5 and Theorem 6.4 allow an interesting observation about approximation by NNs. Indeed,
if a set of functions H is sufficiently large so that (6.2) holds, and NNs with M weights and L layers achieve an
approximation error less than > 0 for every function in H, then M L log2 (M ) + M L2 & k.
We would now like to establish a lower bound on the size of neural networks that approximate regular
functions well. Considering functions f ∈ C s ([0, 1]d ) with kf kC s ≤ 1, we we would, in view of Remark 6.6,
like to find out which value of k is achievable for any given .
We begin by constructing one bump function with a finite C n norm.
Lemma 6.7. For every n, d ∈ N, there exists a constant C > 0, such that, for every > 0, there exists a smooth
function f ∈ C n (Rd ) with
−1/n
f (x) := f˜ x .
1 + kf˜kC n
Then f (0) = , supp f ⊂ [−(1 + kf˜kC n )−1/n , (1 + kf˜kC n )−1/n ]d , and kf kC n ≤ 1 by the chain rule.
Adding up multiple, shifted versions of the function of Lemma 6.7 yields sets of functions that satisfy
(6.2). Concretely, we have the following lemma.
Lemma 6.8. Let n, d ∈ N. There exists C > 0 such that, for every > 0, there are {x1 , . . . , xk } with k ≥ C−d/n
such that, for every b ∈ {0, 1}k there is fb ∈ C n ([0, 1]d ) with kf kC n ≤ 1 and
fb (xi ) = bi .
Proof. Let, for C > 0 as in (6.5), {x1 , . . . , xk } := C1/n Zd ∩ [0, 1]d . Clearly, k ≥ C 0 −d/n for a constant C 0 > 0.
Let b ∈ {0, 1}k . Now set, for f as in Lemma 6.7,
k
X
fb := bi f (· − xi ).
i=1
52
Figure 6.3: Illustration of fb from Lemma 6.8 on [0, 1]2 .
Theorem 6.9. Let n, d ∈ N. Let % : R → R be piecewise polynomial. Assume that, for all > 0, there exist
M (), L() ∈ N such that
then
(M () + 1)L() log2 (M () + 1) + (M () + 1)L()2 & −d/n .
Moreover,
{g − /2 : g ∈ G} ⊆ {R(Φ) : Φ ∈ N N d,L(),M ()+1 }.
Hence
≥ C−d/n .
VCdim 1R+ ◦ R(Φ) : Φ ∈ N N d,L(),M ()+1
An application of Theorem 6.4 yields the result.
Remark 6.10. Theorem 6.9 shows that to achieve a uniform error of > 0 over sets of C n regular functions requires a
number of weights M and layers L such that
M L log2 (M ) + M L2 ≥ −d/n .
If we require L to only grow like log2 () then this demonstrates that the rate of Theorem 3.19/ Remark 3.20 is optimal.
For the case, that L is arbitrary, [1, Theorem 8.7] yields an upper bound on the VC dimension of
∞
( )
[
He := 1R+ ◦ R(Φ) : Φ ∈ N N d,`,M . (6.7)
`=1
of the form
e . M 2.
VCdim H (6.8)
53
Theorem 6.11. Let n, d ∈ N. Let % : R → R be piecewise polynomial. Assume that, for all > 0, there exist M () ∈ N
such that
sup S∞ inf kf − R(Φ)k∞ ≤ /2,
f ∈C n ([0,1]d ),kf kC n ≤1 Φ∈ `=1 N N d,`,M ()
then
M () & −d/(2n) .
Proof. The proof is the same as for Theorem 6.9, using (6.8) instead of Theorem 6.4.
Remark 6.12. Comparing Theorem 6.9 and Theorem 6.11, we see that approximation by NNs with arbitrarily many
layers can potentially achieve double the rate of that with restricted or only slowly growing number of layers.
Indeed, at least for the ReLU activation function, the lower bound of Theorem 6.11 is sharp. It could be shown in
[39], that ReLU realisations of NNs with unrestricted numbers of layers achieve approximation fidelity > 0 using
only O(−d/(2n) ) many weights, uniformly over the unit ball of C n ([0, 1]d ).
Now we have that, for a given architecture S = (d, N1 , . . . , NL ) ∈ NL+1 , a compact set K ⊂ Rd , and for a
continuous activation function % : R → R:
RN N % (S) ⊂ Lp (K),
for all p ∈ [1, ∞]. In this context, we can ask ourselves about the properties of RN N % (S) as a subset of the
normed linear spaces Lp (K).
The results below are based on the following observation about the realisation map:
Theorem 7.2 ([22, Proposition 4]). Let Ω ⊂ Rd be compact and let S = (d, N1 , . . . , NL ) ∈ NL+1 be a neural
network architecture. If the activation function % : R → R is continuous, then the map
R : N N (S) → L∞ (Ω)
Φ 7→ R(Φ)
54
7.1 Network spaces are not convex
We begin by analysing the simple question if, for a given architecture S, the set RN N % (S) is star-shaped.
We start by fixing the notion of a centre and of star-shapedness.
Definition 7.3. Let Z be a subset of a linear space. A point x ∈ Z is called a centre of Z if, for every y ∈ Z it holds
that
{tz + (1 − t)y : t ∈ [0, 1]} ⊂ Z.
A set is called star-shaped if it has at least one centre.
The following proposition follows directly from the definition of a neural network:
Proposition 7.4. Let S = (d, N1 , . . . , NL ) ∈ NL+1 . Then RN N % (S) is scaling invariant, i.e. for every λ ∈ R it
holds that λf ∈ RN N % (S) if f ∈ RN N % (S), and hence 0 ∈ RN N % (S) is a centre of RN N % (S).
Knowing that RN N % (S) is star-shaped with centre 0, we can also ask ourselves if RN N % (S) has more
than this one centre. It is not hard to see that also every constant function is a centre. The following theorem
yields an upper bound on the number of centres.
Theorem 7.5 ([22, Proposition C.4]). Let S = (N0 , N1 , . . . , NL ) be a neural network architecture, let Ω ⊂ RN0 , and
PL
let % : R → R be Lipschitz continuous. Then RN N % (S) contains at most `=1 (N`−1 + 1)N` linearly independent
centres, where N0 = d.
PL
Proof. Let M ∗ := `=1 (N`−1 + 1)N` . We first observe that M ∗ = dim(N N (S)).
∗
Assume towards a contradiction, that there are functions (gi )M i=1
+1
⊂ RN N % (S) ⊂ L2 (Ω) that are linearly
independent.
∗
By the Theorem of Hahn-Banach, there exist (gi0 )M i=1
+1
⊂ (L2 (Ω))0 such that gi0 (gj ) = δi,j , for all i, j ∈
{1, . . . , L + 1}. We define
g10 (g)
∗
g20 (g)
T : L2 (Ω) → RM +1 , g 7→ .. .
.
0
gM +1 (g)
Since T is continuous and linear, we have that T ◦ R is locally Lipschitz continuous by Theorem 7.2. Moreover,
∗
since the (gi )M
i=1
+1
are linearly independent, they span an M ∗ + 1 dimensional linear space V and T (V ) =
∗
M +1
R .
Next we would like to establish that RN N % (S) ⊃ V . Let g ∈ V then
∗
MX +1
g= a` g` ,
`=1
∗ Pm
for some (a` )M
`=1
+1
⊂ R. We show by induction that g̃ (m) := `=1 a` g` ∈ RN N % (S) for every m ≤ M ∗ + 1.
This is obviously true for m = 1. Moreover, we have that g̃ (m+1) = am+1 gm+1 + g̃ (m) . Hence the induction
step holds true if am+1 = 0. If am+1 6= 0, then we have that
(m+1) 1 1 (m)
g = 2am+1 gm+1 + ge , (7.1)
2 2am+1
By Proposition 7.4 ge(m) /(am+1 ) ∈ V . Additionally, gm+1 is a centre of RN N % (S). Therefore, we have
that 12 gm+1 + 2am+1
1
ge(m) ∈ RN N % (S). By Proposition 7.4, we conclude that ge(m+1) ∈ RN N % (S). Hence
∗
V ⊂ RN N % (S). Therefore, T ◦ R(N N (S)) ⊇ T (V ) = RM +1 .
It is a well known fact of basic analysis that there does not exist a surjective and locally Lipschitz continuous
map from Rn to Rn+1 for any n ∈ N. This yields the contradiction.
55
For a convex set X, the line between any two points of X is a subset of X. Hence, every point of a convex
set is a centre. This yields the following corollary.
Corollary 7.6. Let S = (N0 , N1 , . . . , NL ) be a neural network architecture, let Ω ⊂ RN0 , and let % : R → R be
PL
Lipschitz continuous. If RN N % (S) contains more than `=1 (N`−1 + 1)N` linearly independent functions, then
RN N % (S) is not convex.
Remark 7.7. It was shown in [22, Theorem 2.1] that the only Lipschitz continuous activation functions such that
PL
RN N % (S) contains not more than `=1 (N`−1 + 1)N` linearly independent functions are affine linear functions.
Additionally, it can be shown that Corollary 7.6 holds for locally Lipschitz functions as well. In this case, RN N % (S)
PL
necessarily contains more than `=1 (N`−1 + 1)N` linearly independent functions if the activation function is not a
polynomial.
Figure 7.1: Sketch of the set of realisations of neural networks with a fixed architecture. This set is star-shaped,
having 0 in the centre. It is not r-convex for any r and hence we see multiple holes between different rays. It
is not closed, which means that there are limit points outside of the set.
In addition to the non-convexity of RN N % (S), we will now show that, under mild assumptions on the
activation function, RN N % (S) is also very non-convex. Let us first make the notion of convexity quantitative.
S
Definition 7.8. A subset X of a metric space is called r-convex, if x∈X Br (x) is convex.
By Proposition 7.4, it is clear that RN N % (S) + Br (0) = r (RN N % (S) + B1 (0)). Hence,
for every r, r0 > 0. Therefore, RN N % (S) is r-convex for one r > 0 if and only if RN N % (S) is r-convex for
every r > 0.
With this observation we can now prove the following result.
56
Proposition 7.9 ([22, Theorem 2.2.]). Let S ∈ NL+1 , Ω ⊂ RN0 be compact, and % ∈ C 1 be discriminatory and such
that RN N % (S) is not dense in C(Ω). Then there does not exist an r > 0 such that RN N % (S) is r-convex.
Proof. By the discussion leading up to Proposition 7.9 we can assume, towards a contradiction that RN N % (S)
is r-convex for every r > 0.
We have that
\ \
co(RN N % (S)) ⊂ (RN N % (S) + Br (0)) ⊂ (RN N % (S) + Br (0)) ⊂ RN N % (S).
r>0 r>0
Therefore co(RN N % (S)) = co(RN N % (S)) ⊂ RN N % (S) and thus we conclude that RN N % (S) is convex.
We now aim at producing a contradiction by showing that RN N % (S) = C(Ω). We show this for L = 2,
and N2 = 1 only, the general case is demonstrated in [22, Theorem 2.2.] (there also the differentiability of % is
used).
Per assumption, for every a ∈ RN1 , t ∈ R,
x 7→ %(ax − t) ∈ RN N % (S).
By the same argument applied in the proof of Theorem 7.5 in (7.1), we have that for all sequences
(a` )∞
`=1 ⊂ R
N1
, (b` )∞ ∞
`=1 ⊂ R, and (t` )`=1 ⊂ R the function
m
X
g (m) (x) := b` %(a` x − t` )
`=1
Theorem 7.10. Let L ∈ N, S = (N0 , N1 , . . . , NL−1 , 1) ∈ NL+1 , where N1 ≥ 2, Ω ∈ Rd compact with nonempty
interior, and % ∈ C 2 \ C ∞ . Then RN N % (S) is not closed in L∞ .
Proof. Since % ∈ C 2 \ C ∞ , we have that there exists k ∈ N such that % ∈ C k and % 6∈ C k+1 . It is not hard to
see that therefore RN N % (S) ⊂ C k (Ω) and the map
F : Rd → R : x 7→ F (x) = %0 (x1 )
is not in C k (Rd ). Therefore, since Ω has non-empty interior, there exists t ∈ Rd so that F (· − t) 6∈ C k (Ω) and
thus F (· − t) 6∈ RN N % (S).
Assume for now that S = (N0 , 2, 1). The general statement follows by extending the networks below
to neural networks with architecture (N0 , 2, 1, . . . , 1, 1) by concatenating with the neural networks from
Proposition 2.11. To artificially increase the width of the networks and produce neural networks of architecture
S one can simply zero-pad the weight and shift matrices without altering the associated realisations.
57
We define the neural network
1 01×(N0 −1) 1/n
Φn := , , ([n, −n], 0) ,
1 0(N1 −1)×1 0
|R(Φn )(x) − %0 (x1 )| = |n(%(x1 + 1/n) − %(x1 )) − %0 (x1 )| ≤ sup |%00 (z)|/n,
z∈[−B,B]
by the mean value theorem, where B > 0 is such that Ω ⊂ [−B, B]d . Therefore, R(Φn ) → F in L∞ (Ω) and
hence RN N % (S) is not closed.
Remark 7.11. Theorem 7.10 holds in much more generality. In fact, a similar statement holds for various types of
activation functions, see [22, Theorem 3.3]. Surprisingly, the statement does not hold for the ReLU activation function,
[22, Theorem 3.8].
Theorem 7.10, should be contrasted to the following result that shows that subsets of the set of realisations
of neural networks with bounded weights are always closed.
Proposition 7.12. Let S ∈ NL+1 , Ω ⊂ RN0 be compact, and % be continuous. For C > 0, we denote by
the set of realisations of neural networks with weights bounded by C. Then RN N C is a closed subset of C(Ω).
Proof. By the Theorem of Heine-Borel, we have that
{Φ ∈ N N (S) : kΦktotal ≤ C}
R(Φn ) → g.
Then kΦn ktotal → ∞ since if kΦn ktotal would remain bounded, then g ∈ RN N % (S)C = RN N % (S)C ⊂
RN N % (S).
References
[1] M. Anthony and P. L. Bartlett. Neural network learning: theoretical foundations. Cambridge University
Press, Cambridge, 1999.
[2] A. Barron. Universal approximation bounds for superpositions of a sigmoidal function. IEEE Trans. Inf.
Theory, 39(3):930–945, 1993.
[3] R. Bellman. On the theory of dynamic programming. Proceedings of the National Academy of Sciences of
the United States of America, 38(8):716, 1952.
[4] E. K. Blum and L. K. Li. Approximation theory and feedforward networks. Neural networks, 4(4):511–515,
1991.
[5] C. K. Chui and H. N. Mhaskar. Deep nets for local manifold learning. Frontiers in Applied Mathematics
and Statistics, 4:12, 2018.
58
[6] A. Cohen and R. DeVore. Approximation of high-dimensional parametric pdes. Acta Numerica, 24:1–159,
2015.
[7] G. Cybenko. Approximation by superpositions of a sigmoidal function. Math. Control Signal Systems,
2(4):303–314, 1989.
[8] R. A. DeVore. Nonlinear approximation. Acta numerica, 7:51–150, 1998.
[9] R. Eldan and O. Shamir. The power of depth for feedforward neural networks. In Conference on learning
theory, pages 907–940, 2016.
[10] C. L. Frenzen, T. Sasao, and J. T. Butler. On the number of segments needed in a piecewise linear
approximation. Journal of Computational and Applied mathematics, 234(2):437–446, 2010.
[11] I. Goodfellow, Y. Bengio, and A. Courville. Deep Learning. MIT Press, 2016. http://www.
deeplearningbook.org.
[12] J. He, L. Li, J. Xu, and C. Zheng. ReLU deep neural networks and linear finite elements. arXiv preprint
arXiv:1807.03973, 2018.
[13] K. Hornik, M. Stinchcombe, and H. White. Multilayer feedforward networks are universal approximators.
Neural Netw., 2(5):359–366, 1989.
[14] A. N. Kolmogorov. The representation of continuous functions of several variables by superpositions of
continuous functions of a smaller number of variables. Doklady Akademii Nauk SSSR, 108(2):179–182,
1956.
[15] M. Leshno, V. Y. Lin, A. Pinkus, and S. Schocken. Multilayer feedforward networks with a nonpolynomial
activation function can approximate any function. Neural Netw., 6(6):861–867, 1993.
[16] V. Maiorov and A. Pinkus. Lower bounds for approximation by MLP neural networks. Neurocomputing,
25(1-3):81–91, 1999.
[17] W. McCulloch and W. Pitts. A logical calculus of ideas immanent in nervous activity. Bull. Math. Biophys.,
5:115–133, 1943.
[18] H. N. Mhaskar. Approximation properties of a multilayered feedforward artificial neural network. Adv.
Comput. Math., 1(1):61–80, 1993.
[19] H. N. Mhaskar and T. Poggio. Deep vs. shallow networks: An approximation theory perspective. Analysis
and Applications, 14(06):829–848, 2016.
[20] E. Novak and H. Woźniakowski. Approximation of infinitely differentiable multivariate functions is
intractable. Journal of Complexity, 25(4):398–404, 2009.
[21] P. Oswald. On the degree of nonlinear spline approximation in Besov-Sobolev spaces. J. Approx. Theory,
61(2):131–157, 1990.
[22] P. C. Petersen, M. Raslan, and F. Voigtlaender. Topological properties of the set of functions generated
by neural networks of fixed size. arXiv preprint arXiv:1806.08459, 2018.
[23] P. C. Petersen and F. Voigtlaender. Optimal approximation of piecewise smooth functions using deep
ReLU neural networks. Neural Netw., 180:296–330, 2018.
[24] G. Pisier. Remarques sur un résultat non publié de B. Maurey. Séminaire Analyse fonctionnelle (dit
”Maurey-Schwartz”), 1980-1981.
[25] T. Poggio, H. Mhaskar, L. Rosasco, B. Miranda, and Q. Liao. Why and when can deep-but not shallow-
networks avoid the curse of dimensionality: a review. Int. J. Autom. Comput., 14(5):503–519, 2017.
59
[26] F. Rosenblatt. The perceptron: a probabilistic model for information storage and organization in the
brain. Psychological review, 65(6):386, 1958.
[27] W. Rudin. Functional analysis. International Series in Pure and Applied Mathematics. McGraw-Hill, Inc.,
New York, second edition, 1991.
[28] W. Rudin. Real and complex analysis. Tata McGraw-Hill Education, 2006.
[29] I. Safran and O. Shamir. Depth-width tradeoffs in approximating natural functions with neural networks.
In Proceedings of the 34th International Conference on Machine Learning, volume 70 of Proceedings of Machine
Learning Research, pages 2979–2987, 2017.
[30] J. Schmidt-Hieber. Deep ReLU network approximation of functions on a manifold. arXiv preprint
arXiv:1908.00695, 2019.
[31] I. Schoenberg. Cardinal interpolation and spline functions. Journal of Approximation theory, 2(2):167–206,
1969.
[32] U. Shaham, A. Cloninger, and R. R. Coifman. Provable approximation properties for deep neural
networks. Appl. Comput. Harmon. Anal., 44(3):537–557, 2018.
[33] Z. Shen, H. Yang, and S. Zhang. Deep network approximation characterized by number of neurons.
arXiv preprint arXiv:1906.05497, 2019.
[34] T. Suzuki. Adaptivity of deep ReLU network for learning in Besov and mixed smooth Besov spaces:
optimal rate and curse of dimensionality. arXiv preprint arXiv:1810.08033, 2018.
[35] M. Telgarsky. Representation benefits of deep feedforward networks. arXiv preprint arXiv:1509.08101,
2015.
[36] V. N. Vapnik and A. Y. Chervonenkis. On the uniform convergence of relative frequencies of events to
their probabilities. In Measures of complexity, pages 11–30. Springer, 2015.
[37] R. Vershynin. High-dimensional probability: An introduction with applications in data science, volume 47.
Cambridge University Press, 2018.
[38] D. Yarotsky. Error bounds for approximations with deep ReLU networks. Neural Netw., 94:103–114,
2017.
[39] D. Yarotsky. Optimal approximation of continuous functions by very deep ReLU networks. In Conference
On Learning Theory, pages 639–649, 2018.
60