0% found this document useful (0 votes)
40 views60 pages

Neural Network Theory22

This document provides an introduction to neural network theory. It discusses how neural networks were originally inspired by biological neurons but are now primarily motivated by their success in deep learning applications. The document will study deep neural networks from a functional analysis perspective, focusing on rigorous mathematical properties rather than algorithmic aspects. This includes explaining phenomena like why deep neural networks can better approximate some functions than shallow networks and how they can handle high-dimensional problems more efficiently than other approximation methods.

Uploaded by

Qasm Mikaeli
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
40 views60 pages

Neural Network Theory22

This document provides an introduction to neural network theory. It discusses how neural networks were originally inspired by biological neurons but are now primarily motivated by their success in deep learning applications. The document will study deep neural networks from a functional analysis perspective, focusing on rigorous mathematical properties rather than algorithmic aspects. This includes explaining phenomena like why deep neural networks can better approximate some functions than shallow networks and how they can handle high-dimensional problems more efficiently than other approximation methods.

Uploaded by

Qasm Mikaeli
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 60

Neural Network Theory

Philipp Christian Petersen

University of Vienna

March 2, 2020

Contents
1 Introduction 2

2 Classical approximation results by neural networks 3


2.1 Universality . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3
2.2 Approximation rates . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5
2.3 Basic operations of networks . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6
2.4 Reapproximation of dictionaries . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8
2.5 Approximation of smooth functions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10
2.6 Fast approximations with Kolmogorov . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12

3 ReLU networks 14
3.1 Linear finite elements and ReLU networks . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 16
3.2 Approximation of the square function . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 21
3.3 Approximation of smooth functions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 28

4 The role of depth 32


4.1 Representation of compactly supported functions . . . . . . . . . . . . . . . . . . . . . . . . . 32
4.2 Number of pieces . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 33
4.3 Approximation of non-linear functions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 34

5 High dimensional approximation 35


5.1 Curse of dimensionality . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 35
5.2 Hierarchy assumptions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 35
5.3 Manifold assumptions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 39
5.4 Dimension dependent regularity assumption . . . . . . . . . . . . . . . . . . . . . . . . . . . . 43

6 Complexity of sets of networks 48


6.1 The growth function and the VC dimension . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 48
6.2 Lower bounds on approximation rates . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 51

7 Spaces of realisations of neural networks 54


7.1 Network spaces are not convex . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 55
7.2 Network spaces are not closed . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 57

1
1 Introduction
In these notes, we study a mathematical structure called neural networks. These objects have recently received
much attention and have become a central concept in modern machine learning. Historically, however, they
were motivated by the functionality of the human brain. Indeed, the first neural network was devised by
McCulloch and Pitts [17] in an attempt to model a biological neuron.
A McCulloch and Pitts neuron is a function of the form
d
!
X
d
R 3 x 7→ 1R+ w i xi − θ ,
i=1

where d ∈ N, 1R+ : R → R, with 1R+ (x) = 0 for x < 0 and 1R+ (x) = 1 else, and wi , θ ∈ R for i = 1, . . . d. The
function 1R+ is a so-called activation function, θ is called a threshold, and wi are weights. The McCulloch and
Pitts neuron, receives d input signals. If their combined weighted strength exceeds θ, then the neuron fires,
i.e., returns 1. Otherwise the neuron remains inactive.
A network of neurons can be constructed by linking multiple neurons together in the sense that the output
of one neuron forms an input to another. A simple model for such a network is the multilayer perceptron∗ as
introduced by Rosenblatt [26].
Definition 1.1. Let d, L ∈ N, L ≥ 2 and % : R → R. Then a multilayer perceptron (MLP) with d-dimensional
input, L layers, and activation function % is a function F that can be written as

x 7→ F (x) := TL (% (TL−1 (. . . % (T1 (x)) . . . ))) , (1.1)


N` ×N`−1
where T` (x) = A` x + b` , and (A` )L
`=1 ∈ R , b` ∈ RN` for N` ∈ N, N0 = d, and ` = 1, . . . , L. Here
% : R → R is applied coordinate-wise.
The neurons in the MLP correspond again, to the applications of % : R → R even though, in contrast to
the McCulloch and Pitts neuron, we now allow arbitrary %. In Figure 1.1, we visualise a MLP. We should
notice that the MLP does not allow arbitrary connections between neurons, but only between those, that are
in adjacent layers, and only from lower layers to higher layers.

N0 = 8 N1 = 12 N2 = 12 N3 = 12 N4 = 8 N5 = 1
Figure 1.1: Illustration of a multi-layer perceptron with 5 layers. The red dots correspond to the neurons.

While the MLP or variations thereof, are probably the most widely used type of neural network in practice,
they are very different from their biological motivation. Connections only between layers and arbitrary
∗ We will later introduce a notion of neural networks, that differs slightly from that of a multilayer perceptron.

2
activation functions make for an efficient numerical scheme but are not a good representation of the biological
reality.
Nowadays, the field of neural network theory draws most of its motivation from the fact that deep neural
networks are applied in a technique called deep learning [11]. In deep learning, one is concerned with the
algorithmic identification of the most suitable deep neural network for a specific application. It is, therefore,
reasonable to search for purely mathematical arguments why and under which conditions a MLP is an
adequate architecture in practice instead of taking the motivation from the fact that biological neural networks
perform well.
In this note, will study deep neural networks with a very narrow focus. We will exclude all algorithmic
aspects of deep learning and concentrate fully on a functional analytical and well-founded framework.
One the one hand, following this focussed approach, it must be clear that we will not be able to provide a
comprehensive answer to why deep learning methods perform particularly. On the other hand, we will see
that this focus allows us to make rigorous statements which do provide explanations and intuition as to why
certain neural network architectures are preferable over others.
Concretely, we will identify many mathematical properties of sets of MLPs which explain, to some
extent, practically observed phenomena in machine learning. For example, we will see explanations of why
deep neural networks are, in some sense, superior to shallow neural networks or why the neural network
architecture can efficiently reproduce high dimensional functions when most classical approximation schemes
cannot.

2 Classical approximation results by neural networks


The very first question that we would naturally ask ourselves is which functions we can express as a MLP.
Given that the activation function is fixed, it is conceivable that the set of functions that can be represented or
approximated could be quite small.
Example 2.1. • For linear activation functions %(x) = ax, a ∈ R it is clear that every MLP with this activation
function is an affine linear map.
• More generally, if % is a polynomial of degree k ∈ N, then every MLP with L layers is a polynomial of degree at
most k L−1 .†
Example 2.1 demonstrates that under some assumptions on the activation function not every function
can be represented and not even approximated by MLPs with fixed depth.

2.1 Universality
One of the most famous results in neural network theory is that, under minor conditions on the activation
function, the set of networks is very expressive, meaning that every continuous function on a compact set can
be arbitrarily well approximated by a MLP. This theorem was first shown by Hornik [13] and Cybenko [7].
To talk about approximation, we first need to define a topology on a space of functions of interest. We
define, for K ⊂ Rd
C(K) := {f : K → R : f continuous}
and we equip C(K) with the uniform norm

kf k∞ := sup |f (x)|.
x∈K

If K is a compact space, then the representation theorem of Riesz [28, Theorem 6.19] tells us that the
topological dual space of C(K) is the space

M := {µ : µ is a signed Borel measure on K}.


†A diligent student would probably want to verify this.

3
Having fixed the topology on C(K), we can define the concept of universality next.
Definition 2.2. Let % : R → R be continuous, d, L ∈ N and K ⊂ Rd be compact. Denote by MLP(%, d, L) the set of
all MLPs with d-dimensional input, L layers, NL = 1, and activation function %.
We say that MLP(%, d, L) is universal, if MLP(%, d, L) is dense in C(K).
Example 2.1 demonstrates that MLP(%, d, L) is not universal for every activation function.
Definition 2.3. Let d ∈ N, K ⊂ Rd , compact. A continuous function f : R → R is called discriminatory if the only
measure µ ∈ M such that Z
f (ax − b)dµ(x) = 0, for all a ∈ Rd , b ∈ R
K
is µ = 0.

Theorem 2.4 (Universal approximation theorem [7]). Let d ∈ N, K ⊂ Rd compact, and % : R → R be


discriminatory. Then MLP(%, d, 2) is universal.

Proof. We start by observing that MLP(%, d, 2) is a linear subspace of C(K). Assume towards a contradiction,
that MLP(%, d, 2) is not dense in C(K). Then there exists h ∈ C(K) \ MLP(%, d, 2).
By the theorem of Hahn-Banach [28, Theorem 5.19] there exists a functional

0 6= H ∈ C(K)0

so that H = 0 on MLP(%, d, 2). Since, for a ∈ Rd , b ∈ R,

x 7→ %(ax − b) =: %a,b ∈ MLP(%, d, 2),

we have that H(%a,b ) = 0 for all a ∈ Rd , b ∈ R. Finally, by the identification C(K)0 = M there exists a
non-zero measure µ so that Z
%a,b dµ = 0, for all a ∈ Rd , b ∈ R.
K

This is a contradiction to the assumption that % is discriminatory.


At this point, we know that all discriminatory activation functions lead to universal spaces of MLPs. Since
the property of being discriminatory seems hard to verify directly, we are now interested in identifying more
accessible sufficient conditions guaranteeing this property.
Definition 2.5. A continuous function f : R → R such that f (x) → 1 for x → ∞ and f (x) → 0 for x → −∞ is
called sigmoidal.

Proposition 2.6. Let d ∈ N, K ⊂ Rd be compact. Then every sigmoidal function f : R → R is discriminatory.


Proof. Let f be sigmoidal. Then it is clear from Definition 2.5 that, for λ → ∞,

 1 if ax − b > 0
f (λ(ax − b) + θ) → f (θ) if ax − b = 0
0 if ax − b < 0.

As f is bounded and K compact, we conclude by the dominated convergence theorem that, for every
µ ∈ M, Z Z Z
f (λ(a · −b) + θ)dµ → 1dµ + f (θ)dµ,
K Ha,b,> Ha,b,=

where
Ha,b,> := {x ∈ K : ax − b > 0} and Ha,b,= := {x ∈ K : ax − b = 0}.

4
Figure 2.1: A sigmoidal function according to Definition 2.5.

Now assume that Z


f (λ(a · −b) + θ)dµ = 0
K

for all a ∈ Rd , b ∈ R. Then Z Z


1dµ + f (θ)dµ = 0
Ha,b,> Ha,b,=

1dµ = 0 for all a ∈ Rd , b ∈ R.


R
and letting θ → −∞, we conclude that Ha,b,>
For fixed a ∈ Rd and b1 < b2 , we have that
Z Z Z
0= 1dµ − 1dµ = 1[b1 ,b2 ] (ax)dµ(x).
Ha,b1 ,+ Ha,b2 ,+ K

By linearity, we conclude that


Z
0= g(ax)dµ(x) (2.1)
K

for every step function g. By a density argument and the dominated convergence theorem, we have that (2.1)
holds for every bounded continuous function g. Thus (2.1) holds, in particular, for g = sin and g = cos. We
conclude that
Z Z
0= cos(ax) + i sin(ax)dµ(x) = eiax dµ(x).
K K

This implies that the Fourier transform of the measure µ vanishes. This can only happen if µ = 0, [27, p.
176].
Remark 2.7. Universality results can be achieved under significantly weaker assumptions than sigmoidality. For
example, in [15] it is shown that Example 2.1 already contains all continuous activation functions that do not generate
universal sets of MLPs.

2.2 Approximation rates


We saw in Theorem 2.4 that MLPs form universal approximators. However, neither the result nor the proof
of it give any indication of how ”large” MLPs need to be to achieve a certain approximation accuracy.

5
Before we can even begin to analyse this question, we need to introduce a precise notion of the size of a
PL
MLP. One option could certainly be to count the number of neurons, i.e., `=1 N` in (1.1) of Definition 1.1.
However, since a MLP was defined as a function, it is by no means clear if there is a unique representation
with a unique number of neurons. Hence, the notion of ”number of neurons” of a MLP requires some
clarification.
Definition 2.8. Let d, L ∈ N. A neural network (NN) with input dimension d and L layers is a sequence of
matrix-vector tuples 
Φ = (A1 , b1 ), (A2 , b2 ), . . . , (AL , bL ) ,
where N0 := d and N1 , . . . , NL ∈ N, and where A` ∈ RN` ×N`−1 and b` ∈ RN` for ` = 1, ..., L.
For a NN Φ and an activation function % : R → R, we define the associated realisation of the NN Φ as

R(Φ) : Rd → RNL : x 7→ xL := R(Φ)(x),

where the output xL ∈ RNL results from

x0 := x,
x` := % (A` x`−1 + b` ) for ` = 1, . . . , L − 1, (2.2)
xL := AL xL−1 + bL .

Here % is understood to act component-wise.


PL
We call N (Φ) := d + j=1 Nj the number of neurons of the NN Φ, L(Φ) := L the number of layers or
PL PL
depth, and M (Φ) := j=1 Mj (Φ) := j=1 kAj k0 + kbj k0 the number of weights of Φ. Here k.k0 denotes the
number of non-zero entries of a matrix or vector.
According to the notion of Definition 2.8, a MLP is the realisation of a NN.

2.3 Basic operations of networks


Before we analyse how many weights and neurons NNs need to possess so that their realisations approximate
certain functions well, we first establish a couple of elementary operations that one can perform with NNs.
This formalism was developed first in [23].
To understand the purpose of the following formalism, we start with the following question: Given two
realisations of NNs f1 : Rd → Rd and f2 : Rd → Rd , is it the case that the function

x 7→ f2 (f1 (x))

is the realisation of a NN and how many weights, neurons, and layers does this new function need to have?
0 0 00
Given two functions f1 : Rd → Rd and f2 : Rd → Rd , where d, d0 , d00 ∈ N, we denote by f1 ◦ f2 the
composition of these functions, i.e., f1 ◦ f2 (x) = f1 (f2 (x)) for x ∈ Rd . Indeed, a similar concept is possible
for NNs.
Definition 2.9. Let L1 , L2 ∈ N and let Φ1 = ((A11 , b11 ), . . . , (A1L1 , b1L1 )), Φ2 = ((A21 , b21 ), . . . , (A2L2 , b2L2 )) be two
NNs such that the input layer of Φ1 has the same dimension as the output layer of Φ2 . Then Φ1 Φ2 denotes the
following L1 + L2 − 1 layer network:

Φ1 Φ2 := A21 , b21 , . . . , A2L2 −1 , b2L2 −1 , A11 A2L2 , A11 b2L2 + b11 , A12 , b12 , . . . , A1L1 , b1L1 .
    

We call Φ1 Φ2 the concatenation of Φ1 and Φ2 .


It is left as an exercise to show that

R Φ1 Φ2 = R Φ1 ◦ R Φ 2 .
  

A second important operation is that of parallelisation.

6
Figure 2.2: Top: Two networks. Bottom: Concatenation of both networks according to Definition 2.9.

Definition 2.10. Let L, d1 , d2 ∈ N and let Φ1 = ((A11 , b11 ), . . . , (A1L , b1L )), Φ2 = ((A21 , b21 ), . . . , (A2L , b2L )) be two
NNs with L layers and with d1 -dimensional and d2 -dimensional input, respectively. We define
      
1. P Φ1 , Φ2 := A b1 , bb1 , Ã2 , b̃2 , . . . , ÃL , b̃L , if d1 = d2 ,

    
2. FP Φ1 , Φ2 := Ã1 , b̃1 , . . . , ÃL , b̃L , for arbitrary d1 , d2 ∈ N,

where
A11 b11 A1` b1`
       
0
A
b1 := , bb1 := , and Ã` := , b̃` := for 1 ≤ ` ≤ L.
A21 b21 0 A2` b2`

P(Φ1 , Φ2 ) is a NN with d-dimensional input and L layers, called the parallelisation with shared inputs of Φ1 and
Φ2 . FP(Φ1 , Φ2 ) is a NN with d1 + d2 -dimensional input and L layers, called the parallelisation without shared
inputs of Φ1 and Φ2 .

Figure 2.3: Top: Two networks. Bottom: Parallelisation with shared inputs of both networks according to
Definition 2.10.

One readily verifies that M (P(Φ1 , Φ2 )) = M (FP(Φ1 , Φ2 )) = M (Φ1 ) + M (Φ2 ), and

R% (P(Φ1 , Φ2 ))(x) = (R% (Φ1 )(x), R% (Φ2 )(x)), for all x ∈ Rd . (2.3)

7
We depict the parallelisation of two networks in Figure 2.3. Using the concatenation, we can, for example,
increase the depth of networks without significantly changing their output if we can build a network that
realises the identity function. We demonstrate how to approximate the identity function below. This is our
first quantitative approximation result.
Proposition 2.11. Let d ∈ N, K ⊂ Rd compact, and % : R → R be differentiable and not constant on an open set.
Then, for every  > 0, there exists a NN Φ = ((A1 , b1 ), (A2 , b2 )) such that A1 , A2 ∈ Rd×d , b1 , b2 ∈ Rd , M (Φ) ≤ 4d,
and
|R(Φ)(x) − x| < ,
for all x ∈ K.
Proof. Assume d = 1, the general case of d ∈ N then follows immediately by parallelisation without shared
inputs.
Let x∗ ∈ R be such that % is differentiable on a neighbourhood of x∗ and %0 (x∗ ) = θ 6= 0. Define, for λ > 0

b1 := x∗ , A1 := 1/λ, b2 := −λ%(x∗ )/θ, A2 := λ/θ.

Then we have, for all x ∈ K,


%(x/λ + x∗ ) − %(x∗ )


|R(Φ)(x) − x| = λ
− x . (2.4)
θ

If x = 0, then (2.4) shows that |R(Φ)(x) − x| = 0. Otherwise

|x| %(x/λ + x∗ ) − %(x∗ )




|R(Φ)(x) − x| = − θ .
|θ| x/λ

By the definition of the derivative, we have that |R(Φ)(x) − x| → 0 for λ → ∞ and all x ∈ K.
Remark 2.12. It follows from Proposition 2.11 that under the assumptions of Theorem 2.4 and Proposition 2.11 we
have that MLP(%, d, L) is universal for every L ∈ N, L ≥ 2.
The operations above can be performed for quite general activation functions. If a special activation is
chosen, then different operations are possible. In Section 3, we will, for example, introduce an exact emulation
of the identity function by realisations of networks with the so-called ReLU activation function.

2.4 Reapproximation of dictionaries


Approximation theory is a well-established field in applied mathematics. This field is concerned with
establishing the trade-off between the size of certain sets and their capability of approximately representing a
function. Concretely, let H be a normed space and (AN )N ∈N be a nested sequence (i.e. AN ⊂ AN +1 for every
N ∈ N) of subsets of H and let C ⊂ H.
For N ∈ N, we are interested in the following number

σ(AN , C) := sup inf kf − gkH . (2.5)


f ∈C g∈AN

Here, σ(AN , C) denotes the worst-case error when approximating every element of C by the closest element
in AN . Quite often, it is not so simple to precisely compute σ(AN , C) but instead we can only establish an
asymptotic approximation rate. If h : N → R+ is such that

σ(AN , C) = O(h(N )), for N → ∞, (2.6)

then we say that (AN )N ∈N achieves an approximation rate of h for C.

8
Definition 2.13. A typical example of nested spaces of which we want to understand the approximation capabilities are
spaces of sparse representations in a basis or more generally in a dictionary. Let D := (fi )∞ ‡
i=1 ⊂ H be a dictionary .
We define the spaces
(∞ )
X
AN := ci fi : kck0 ≤ N . (2.7)
i=1

Here kck0 = #{i ∈ N : ci 6= 0}.


With this notion of AN , we call σ(AN , C) the best N -term approximation error of C with respect to D. Moreover, if
h satisfies (2.6) then we say that D achieves a rate of best N -term approximation error of h for C.

We can introduce a simple procedure to lift approximation theoretical results for N -term approximation
to approximation theoretical results of NNs.

Theorem 2.14. Let d ∈ N, H ⊂ {f : Rd → R} be a normed space, % : R → R, and D := (fi )∞ i=1 ⊂ H be a dictionary.


Assume that there exist L, C ∈ N, such that, for every i ∈ N, and for every  > 0 there exists a NN Φi such that

L (Φi ) = L, M (Φi ) ≤ C, kR (Φi ) − fi kH ≤ . (2.8)

For every C ⊂ H, define AN as in (2.7) and

BN := {R(Φ) : Φ is a NN with d-dim input, L(Φ) = L, M (Φ) ≤ N } .

Then, for every C ⊂ H,


σ (BCN , C) ≤ σ (AN , C) .

Proof. We aim to show that there exists C > 0 such that every element in AN can be approximated by a NN
with CN weights to arbitrary precision.
PN
Let a ∈ AN , then a = j=1 ci(j) fi(j) . Let  > 0 then, by (2.8), we have that there exist NNs (Φj )N
j=1 such
that

L (Φj ) = L, M (Φj ) ≤ C, R (Φj ) − fi(j) H ≤ / (N kck∞ ) . (2.9)

We define, Φc := (([ci(1) , ci(2) , . . . , ci(N ) ], 0)) and Φa, := Φc P(Φ1 , Φ2 , · · · , ΦN ). Now it is clear, by the
triangle inequality, that

X N X N
a,
 
kR (Φ ) − ak = ci(j) fi(j) − R (Φj ) ≤ |ci(j) | fi(j) − R (Φj ) ≤ .
j=1 j=1

Per Definition 2.9, L(Φc P(Φ1 , Φ2 , · · · , ΦN )) = L(P(Φ1 , Φ2 , · · · , ΦN )) = L and it is not hard to see that

M (Φc P (Φ1 , Φ2 , · · · , ΦN )) ≤ M (P (Φ1 , Φ2 , · · · , ΦN )) ≤ N max M (Φj ) ≤ N C.


j=1,...,N

Remark 2.15. In words, Theorem 2.14 states that we can transfer a classical N -term approximation result to approxi-
mation by realisations of NNs if we can approximate every element from the underlying dictionary arbitrarily well by
NNs. It turns out that, under the right assumptions on the activation function, Condition (2.8) is quite often satisfied.
We will see one instance of such a result in the following subsection and another one in Proposition 3.3 below.
‡ We assume here and in the sequel that a dictionary contains only countably many elements. This assumption is not necessary, but

simplifies the notation a bit.

9
2.5 Approximation of smooth functions
We shall proceed by demonstrating that (2.9) holds for the dictionary of multivariate B-splines. This idea,
was probably first applied by Mhaskar in [18].
Towards our first concrete approximation result, we therefore start by reviewing some approximation
properties of B-splines: The univariate cardinal B-spline on [0, k] of order k ∈ N is given by
k  
1 X
` k
Nk (x) := (−1) (x − `)k−1
+ , for x ∈ R, (2.10)
(k − 1)! `
`=0

where we adopt the convention that 00 = 0.


For t ∈ R and ` ∈ N, we define N`,t,k := Nk (2` (· − t)). Additionally, we denote for d ∈ N, ` ∈ N, t ∈ Rd the
multivariate B-splines by
d
Y
d
N`,t,k (x) := N`,ti ,k (xi ), for x = (x1 , . . . xd ) ∈ Rd .
i=1

Finally, for d ∈ N, we define the dictionary of dyadic B-splines of order k by

B k := N`,t : ` ∈ N, t` ∈ 2−` Zd .
 d
` ,k
(2.11)

Best N -term approximation by multivariate B-splines is a well studied field. For example, we have the
following result by Oswald.

Theorem 2.16 ([21, Theorem 7]). Let d, k ∈ N, p ∈ (0, ∞], 0 < s ≤ k. Then there exists C > 0 such that, for every
f ∈ C s ([0, 1]d ), we have that, for every δ > 0, and every N ∈ N there exists ci ∈ R with |ci | ≤ Ckf k∞ and Bi ∈ B k
for i = 1, . . . , N such that
N
X δ−s
f − ci Bi . N d kf kC s .

p
i=1 L

In particular, for C := {f ∈ C ([0, 1] ) : kf kC s ≤ 1}, we have that B k achieves a rate of best N -term approximation
s d

error of order N δ−s for every δ > 0. a


a In [21, Theorem 7] this statement is formulated in much more generality. We cite here a simplified version so that we do not have

to introduce Besov spaces.

To obtain an approximation result by NN via Theorem 2.14, we now only need to check under which
conditions every element of the B-spline dictionary can be represented arbitrarily well by a NN. In this regard,
we first fix a class of activation functions.

Definition 2.17. A function % : R → R is called sigmoidal of order q ∈ N, if % ∈ C q−1 (R) and

%(x) %(x)
→ 0, for x → −∞, → 1, for x → ∞, and
xq xq
|%(x)| . (1 + |x|)q , for all x ∈ R.

Standard examples of sigmoidal functions of order k ∈ N are the functions x 7→ max{0, x}q . We have the
following proposition.

Proposition 2.18. Let k, d ∈ N, K > 0, and % : R → R be sigmoidal of order q ≥ 2. There exists a constant C > 0
such that for every f ∈ B k and every  > 0 there is a NN Φ with dlog2 (d)e + dmax{logq (k − 1), 0}e + 1 layers and
C weights, such that
kf − R% (Φ )kL∞ ([−K,K]d ) ≤ .

10
d
Proof. We demonstrate how to approximate a cardinal B-spline of order k, i.e., N0,0,k , by a NN Φ with
d
activation function %. The general case, i.e., N`,t,k , follows by observing that shifting and rescaling of the
realisation of Φ can be done by manipulating the entries of A1 and b1 associated to the first layer of Φ. Towards
this goal, we first approximate a univariate B-spline. We observe with (2.10) that we first need to build
a network that approximates the function x 7→ (x)k−1 + . The rest follows by taking sums and shifting the
function.
It is not hard to see (but probably a good exercise to formally show) that, for every K 0 > 0,


qT
−qT
% ◦ % ◦ · · · ◦ %(ax) −x+ → 0 for a → ∞ uniformly for all x ∈ [−K 0 , K 0 ].

a
| {z }

T − times

Choosing T := dmax{logq (k − 1), 0}e we have that q T ≥ k − 1. We conclude that, for every K 0 > 0 and  > 0
there exists a NN Φ∗ with dmax{logq (k − 1), 0}e + 1 layers such that
R (Φ∗ ) (x) − xp+ ≤ ,

(2.12)

for every x ∈ [−K 0 , K 0 ], where p ≥ k − 1. We observe that, for all x ∈ [−K 0 , K 0 ],

R Φ∗δ2 (x + δ) − R Φ∗δ2 (x)


 
p−1
→ px+ for δ → 0. (2.13)
δ
Repeating the ’derivative-trick’ of (2.13), we can find, for every K 0 > 0 and  > 0 a NN Φ† such that, for all
x ∈ [−K 0 , K 0 ],
R(Φ† )(x) − xk−1

+
≤ .
By (2.10), it is now clear that there exists a NN Φ∨  the size of which is independent of  which approximates
a univariate cardinal B-spline up to an error of .
As a second step, we would like to construct a network which multiplies all entries of the d-dimensional
output of the realisation of the NN FP(Φ∨ ∨
 , . . . , Φ ). Since % is a sigmoidal function of order larger than 2,
we observe by the ’derivative trick’ that led to (2.12) that we can also build a fixed size NN with two layers
which, for every K 0 > 0 and  > 0, approximates the map x 7→ x2+ arbitrarily well for x ∈ [−K 0 , K 0 ].
We have that for every x = (x1 , x2 ) ∈ R2

2x1 x2 = (x1 + x2 )2 − x21 − x22 = (x1 + x2 )2+ + (−x1 − x2 )2+ − (x1 )2+ − (−x1 )2+ − (x2 )2+ − (−x2 )2+ . (2.14)

Hence, we can conclude that, for every K 0 > 0, we can find a fixed size NN Φmult  with input dimension 2
which, for every  > 0, approximates the map (x1 , x2 ) 7→ x1 x2 arbitrarily well for (x1 , x2 ) ∈ [−K 0 , K 0 ]2 .
We assume for simplicity, that log2 (d) ∈ N. Then we define

Φmult,d,d/2
 := FP(Φmult , . . . , Φmult ).
| {z }
d/2−times

It is clear that, for all x ∈ [−K 0 , K 0 ]d ,


 
R Φmult,d,d/2 (x1 , . . . , xd ) − (x1 x2 , x3 x4 , . . . , xd−1 xd ) ≤ .



Now, we set

Φmult,d,1
 := Φmult
 Φmult,4,2 . . . Φmult,d,d/2 . (2.15)

We depict the hierarchical construction of (2.15) in Figure 2.4. Per construction, we have that Φmult,d,1
 has
log2 (d) + 1 layers and, for every 0 > 0 and K 0 > 0, there exists  > 0 such that

(x1 , . . . xd ) − x1 x2 · · · xd ≤ 0 .
mult,d,1
Φ

11
x1 x2 x3 x4 x5 x6 x7 x8

x1 x2 x3 x4 x5 x6 x7 x8

x1 x2 x3 x4 x5 x6 x7 x8

x1 x2 x3 x4 x5 x6 x5 x6
Figure 2.4: Setup of the multiplication network (2.15). Every red dot symbolises a multiplication network
Φmult
 and not a regular neuron.

Finally, we set
Φ := Φmult,d,1
 FP(Φ∨ , . . . , Φ∨ ).
|  {z }
d−times

Per definition of , we have that Φ has dmax{logq (k − 1), 0}e + log2 (d) + 1 many layers. Moreover, the size of
all components of Φ was independent of . By choosing  sufficiently small it is clear by construction that Φ
d
approximates N0,0,k arbitrarily well on [−K, K]d for sufficiently small .
As a simple consequence of Theorem 2.14 and Proposition 2.18 we obtain the following corollary.
Corollary 2.19. Let d ∈ N, s > δ > 0 and p ∈ (0, ∞]. Moreover let % : R → R be sigmoidal of order q ≥ 2. Then
there exists a constant C > 0 such that, for every f ∈ C s ([0, 1]d ) with kf kC s ≤ 1 and every 1/2 >  > 0, there exists
a NN Φ such that
kf − R(Φ)kLp ≤ 
d
and M (Φ) ≤ C− s−δ and L(Φ) = dlog2 (d)e + dmax{logq (dse − 1), 0}e + 1.
Remark 2.20. Corollary 2.19 constitutes the first quantitative approximation result of these notes for a large class
of functions. There are a couple of particularly interesting features of this result. First of all, we observe that with
increasing smoothness of the functions, we need smaller networks to achieve a certain accuracy. On the other hand,
at least in the framework of this theorem, we require more layers if the smoothness s is much higher than the order of
sigmoidality of %.
Finally, the order of approximation deteriorates very quickly with increasing dimension d. Such a behaviour is often
called curse of dimension. We will later analyse to what extent NN approximation can overcome this curse.

2.6 Fast approximations with Kolmogorov


One observation that we made in the previous subsection is that some activation functions yield better
approximation rates than others. In particular, in Theorem 2.19, we see that if the activation function % has a
low order of sigmoidality, then we need to use much deeper networks to obtain the same approximation
rates than with a sigmoidal function of high order.
Naturally, we can ask ourselves if, by a smart choice of activation function, we could even improve
Corollary 2.19 further. The following proposition shows how to achieve an incredible improvement if d = 1.
The idea for the following proposition and Theorem 2.24 below appeared in [16] first, but is presented in a
slightly simplified version here.

12
Proposition 2.21. There exists a continuous, piecewise polynomial activation function % : R → R such that for every
function f ∈ C([0, 1]) and every  > 0 there is a NN Φf, with M (Φ) ≤ 3, and L(Φ) = 2 such that
f − R Φf, ≤ .


(2.16)

Proof. We denote by ΠQ , the set of univariate polynomials with rational coefficients. It is well-known that
this set is countable and dense in C(K) for every compact set K. Hence, we have that {π|[0,1] : π ∈ ΠQ } is a
countable set and dense in C([0, 1]). We set (πi )i∈Z := {π|[0,1] : π ∈ ΠQ } and define

πi (x − 2i), if x ∈ [2i, 2i + 1],
%(x) :=
πi (1)(2i + 2 − x) + πi+1 (0)(x − 2i − 1), if x ∈ (2i + 1, 2i + 2).

It is clear that % is continuous and piecewise polynomial.


Finally, let us construct the network such that (2.19) holds. For f ∈ C([0, 1]) and  > 0 we have by density
of (πi )i∈Z that there exists i ∈ Z such that kf − πi k∞ ≤ . Hence,

|f (x) − %(x + 2i)| = |f (x) − πi (x)| ≤ . (2.17)

The claim follows by defining Φf, := ((1, 2i), (1, 0)).


Remark 2.22. It is clear that the restriction to functions defined on [0, 1] is arbitrary. For every function f ∈
C([−K, K]) for a constant K > 0, we have that f (2K(· − 1/2)) ∈ C([0, 1]). Therefore, the result of Proposition 2.21
holds by replacing C([0, 1]) by C([−K, K]).
We will discuss to what extent the activation function % of Proposition 2.21 is sensible a bit further
below. Before that, we would like to generalise this result to higher dimensions. This can be done by using
Kolmogorov’s superposition theorem.

Theorem 2.23 ([14]). For every d ∈ N, there are 2d2 + d univariate, continuous, and increasing functions φp,q ,
p = 1, . . . , d, q = 1, . . . , 2d + 1 such that for every f ∈ C([0, 1]d ) we have that, for all x ∈ [0, 1]d ,
2d+1 d
!
X X
f (x) = gq φp,q (xp ) , (2.18)
q=1 p=1

where gq , q = 1, . . . 2d + 1, are univariate continuous functions depending on f .

We can combine Kolmogorov’s superposition theorem and Proposition 2.21 to obtain the following
approximation theorem for realisations of networks with the special activation function from Proposition
2.21.
Theorem 2.24. Let d ∈ N. Then there exists a constant C(d) > 0 and a continuous activation function %, such that
for every function f ∈ C([0, 1]d ) and every  > 0 there is a NN Φf,,d with M (Φ) ≤ C(d), and L(Φ) = 3 such that
f − R Φf,,d ≤ .


(2.19)

Proof. Let f ∈ C([0, 1]d ). Let 0 > 0 and let Φ e 1,d := (([1, . . . , 1], 0)) be a network with d dimensional input
1,2d+1 :=
and Φe (([1, . . . , 1], 0)) be a network with 2d + 1 dimensional input. Let gq , φp,q for p = 1, . . . , d,
q = 1, . . . , 2d + 1 be as in (2.18).
We have that there exists C ∈ R such that

ran (φp,q ) ⊂ [−C, C], for all p = 1, . . . , d, q = 1, . . . , 2d + 1.

We define, with Proposition 2.21,

Φq,0 := Φ
e 1,d FP Φφ1,q ,0 , Φφ2,q ,0 , . . . , Φφd,q ,0 .


13
It is clear that, for x = (x1 , . . . , xd ) ∈ [0, 1]d ,
d

X
R (Φq,0 ) (x) − φp,q (xp ) ≤ d0 (2.20)


p=1

and, by construction, M (Φq ) ≤ 3d. Now define, for 1 > 0,


e 1,2d+1 FP (Φg1 ,1 , Φg2 ,1 , . . . Φg2d+1 ,1 ) P Φ1,0 , Φ2,0 , . . . , Φ2d+1,0 , 0 ,
Φf0 ,1 := Φ

(2.21)

where Φg1 ,1 is according to Remark 2.22 with K = C + 1.


Per definition of it follows that L(Φf0 ) ≤ 3 and the size of Φf0 is independent of 0 and 1 . We also have
that
 2d+1
X
R Φf0 ,1 = R (Φgq ,1 ) ◦ R (Φq,0 ) .
q=1

We have by Proposition 2.21 that, for fixed 1 , the map R (Φgq ,1 ) is uniformly continuous on [−C − 1, C + 1]
for all q = 1, . . . , 2d + 1 and 0 ≤ 1.
Hence, we have that, for each ˜ > 0, there exists δ˜ > 0 such that

|R (Φgq ,1 ) (x) − R (Φgq ,1 ) (y)| ≤ ˜,

for all x, y ∈ [−C − 1, C + 1] so that |x − y| ≤ δ˜ in particular this statement holds for ˜ = 1 .
It follows from the triangle inequality, (2.20), and Proposition 2.21 that
2d+1
d
!
X X
R Φf , − f ≤ gq ,1 q,0

R (Φ ) (R (Φ )) − g φ

0 1 ∞
q p,q

q=1 p=1 ∞
2d+1
d
!
X X
gq ,1 q,0 gq ,1
≤ R (Φ ) (R (Φ )) − R (Φ ) φp,q


q=1 p=1 ∞
d
! d
!
X X
+ R (Φgq ,1 ) φp,q − gq φp,q


p=1 p=1 ∞
2d+1
X
=: I0 ,1 + II0 ,1 .
p=1

Choosing d0 < δ1 , we have that I0 ,1 ≤ 1 . Moreover, II ≤ 1 by construction
 .
Hence, for every 1/2 >  > 0, there exists 0 , 1 such that R Φf0 − f ∞ ≤ (2d + 1)1 ≤ . We define
Φf,,d := Φf0 ,1 which concludes the proof.
Without knowing the details of the proof of Theorem 2.24 the statement that any function can be arbitrarily
well approximated by a fixed-size network is hardly believable. It seems as if the reason for this result to
hold is that we have put an immense amount of information into the activation function. At the very least,
we have now established that at least from a certain minimal size on, there is no aspect of the architecture of a
NN that fundamentally limits its approximation power. We will later develop fundamental lower bounds on
approximation capabilities. As a consequence of the theorem above, these lower bounds can only be given
for specific activation functions or under further restricting assumptions.

3 ReLU networks
We have already seen a variety of activation functions including sigmoidal and higher-order sigmoidal
functions. In practice, a much simpler function is usually used. This function is called rectified linear unit

14
(ReLU). It is defined by

x for x ≥ 0
x 7→ %R (x) := (x)+ = max{0, x} = (3.1)
0 else.

There are various reasons why this activation function is immensely popular. Most of these reasons are based
on its practicality in the algorithms used to train NNs which we do not want to analyse in this note. One
thing that we can observe, though, is that the evaluation of %R (x) can be done much more quickly than that
of virtually any non-constant function. Indeed, only a single decision has to be made, whereas, for other
activation functions such as, e.g., arctan, the evaluation requires many numerical operations. This function is
probably the simplest function that does not belong to the class described in Example 2.1.
One of the first questions that we can ask ourselves is whether the ReLU is discriminatory. We observe
the following. For a ∈ R, b1 < b2 and every x ∈ R, we have that

Ha (x) := %R (ax − ab1 + 1) − %R (ax − ab1 ) − %R (ax − ab2 ) + %R (ax − ab2 − 1) → 1[b1 ,b2 ] for a → ∞.

Indeed, for x < b1 − 1/a, we have that Ha (x) = 0. If b1 − 1/a < x < b1 , then Ha (x) = a(x − b1 + 1/a) ≤ 1.
If b1 < x < b2 , then Ha (x) = %R (ax − ab1 + 1) − %R (ax − ab1 ) = 1. If b2 ≤ x < b2 + 1/a, then Ha (x) =
1 − %R (ax − ab2 ) = 1 − ax − ab2 ≤ 1. Finally, if x ≥ b2 + 1/a then Ha (x) = 0. We depict Ha in Figure 3.1.

1 1
b1 − a b1 b2 b2 + a

Figure 3.1: Pointwise approximation of a univariate indicator function by sums of ReLU activation functions.

The argument above shows that sums of ReLUs can pointwise approximate arbitrary indicator function.
If we had that Z
%R (ax + b)dµ(x) = 0,
K

for a µ ∈ M and all a ∈ Rd and b ∈ R, then this would imply


Z
1[b1 ,b2 ] (ax)dµ(x) = 0
K

d
for all a ∈ R and b1 < b2 . At this point we have the same result as in (2.1). Following the rest of the proof of
Proposition 2.6 yields that %R is discriminatory.
We saw in Proposition 2.18 how higher-order sigmoidal functions can reapproximate B-splines of arbitrary
order. The idea there was that, essentially, through powers of xq+ , we can generate arbitrarily high degrees of
polynomials. This approach does not work anymore if q = 1. Moreover, the crucial multiplication operation
of Equation (2.14) cannot be performed so easily with realisations of networks with the ReLU activation
function.
If we want to use the local approximation by polynomials in a similar way as in Corollary 2.19, we have
two options: being content with approximation by piecewise linear functions, i.e., polynomials of degree one,
or trying to reproduce higher-order monomials by realisations of NNs with the ReLU activation function in a
different way than by simple composition.
Let us start with the first approach, which was established in [12].

15
3.1 Linear finite elements and ReLU networks
We start by recalling some basics on linear finite elements. Below, we will perform a lot of basic operations
on sets and therefore it is reasonable to recall and fix some set-theoretical notation first. For a subset A of a
topological space, we denote by co(A) the convex hull of A, i.e., the smallest convex set containing A. By A we
denote the closure of A, i.e., the smallest closed set containing A. Furthermore, int A denotes the interior of A,
which is the largest open subset of A. Finally, the boundary of A is denoted by ∂A and ∂A := A \ int A.
Let d ∈ N, Ω ⊂ Rd . A set T ⊂ P(Ω) so that
[
T = Ω,

T = (τi )M §
i=1 , where each τi is a d-simplex , and such that τi ∩ τj ⊂ ∂τi ∩ ∂τj is an n-simplex with n < d for
T

every i 6= j is called a simplicial mesh of Ω. We call the τi the cells of the mesh T and the extremal points of the
τi , i = 1 . . . , MT , the nodes of the mesh. We denote the set of nodes by (ηi )M
i=1 .
N

Figure 3.2: A two dimensional simplicial mesh of [0, 1]2 . The nodes are depicted by red x’s.

We say that a mesh T = (τi )M


S
i=1 is locally convex, if for every ηi it holds that
T
{τj : ηi ∈ τj } is convex.
For any mesh T one defines the linear finite element space

VT := f ∈ C(Ω) : f|τi affine linear for all i = 1, . . . , MT .

Since an affine linear function is uniquely defined through its values on d + 1 linearly independent points, it
is clear that every f ∈ VT is uniquely defined through the values (f (ηi ))M i=1 . By the same token, for every
N

MN
choice of (yi )i=1 , there exists a function f in VT such that f (ηi ) = yi for all i = 1, . . . , MN .
For i = 1, . . . , MN we define the Courant elements φi,T ∈ VT to be those functions that satisfy φi,T (ηj ) = δi,j .
See Figure 3.3 for an illustration.
Proposition 3.1. Let d ∈ N and T be a simplicial mesh of Ω ⊂ Rd , then we have that
MN
X
f= f (ηi )φi,T
i=1

holds for every f ∈ VT .


§A d-simplex is a convex hull of d + 1 points v0 , . . . , vd such that (v1 − v0 ), (v2 − v0 ), . . . , (vd − v0 ) are linearly independent.

16
Figure 3.3: Visualisation of a Courant element on a mesh.

As a consequence of Proposition 3.1, we have that we can build every function f ∈ VT as the realisation
of a NN with ReLU activation function if we can build φi,T for every i = 1, . . . , MN .
We start by making a couple of convenient definitions and then find an alternative representation of φi,T .
We define, for i, j = 1, . . . MN ,
[
F (i) := {j ∈ {1, . . . , MT } : ηi ∈ τj } , G(i) := τj , (3.2)
j∈F (i)

H(j, i) := {ηk ∈ τj , ηk 6= ηi } , I(i) := {ηk ∈ G(i)} . (3.3)

Here F (i) is the set of all indices of cells that contain ηi . Moreover, G(i) is the polyhedron created from
taking the union of all these cells.
Proposition 3.2. Let d ∈ N and T be a locally convex simplicial mesh of Ω ⊂ Rd . Then, for every i = 1, . . . , MN , we
have that
 
φi,T = max 0, min gj , (3.4)
j∈F (i)

where gj is the unique affine linear function such that gj (ηk ) = 0 for all ηk ∈ H(j, i) and gj (ηi ) = 1.
Proof. Let i ∈ {1, . . . , MN }. By the local convexity assumption we have that G(i) is convex. For simplicity, we
assume that ηi ∈ int G(i).¶
Step 1: We show that
[
∂G(i) = co(H(j, i)). (3.5)
j∈F (i)

The argument below is visualised in Figure 3.4. We have by convexity that G(i) = co(I(i)). Since ηi lies in
the interior of G(i) we have that there exists  > 0 such that B (ηi ) ⊂ G(i). By convexity, we have that also the
open set co(int τk , B (ηi )) is a subset of G(i). It is not hard to see that τk \ co(H(k, i)) ⊂ co(int τk , B (ηi )) and
¶ The case ηi ∈ ∂G(i) needs to be treated slightly differently and is left as an excercise.

17
x

ηi

Figure 3.4: Visualisation of the argument in Step 1. The simplex τk is coloured green. The grey ball around
ηi is B (ηi ). The blue × represents x.

S
hence τk \ co(H(k, i)) lies in the interior of G(i). Since we also have that ∂G(i) ⊂ k∈F (i) ∂τk , we conclude
that [
∂G(i) ⊂ co(H(j, i)).
i∈F (i)

Now assume that there is j such that co(H(j, i)) 6⊂ ∂G(i). Since co(H(j, i)) ⊂ G(i) this would imply that
there exist x ∈ co(H(j, i)) such that x is in the interior of G(i). This implies that there exists 0 > 0 such that
B0 (x) ⊂ G(i). Hence, the line from ηi to x can be extended for a distance of 0 /2 to a point x∗ ∈ G(i) \ τj . As
x∗ must belong to a simplex τj ∗ that also contains ηi , we conclude that τj ∗ intersects the interior of τj which
cannot be by assumption on the mesh.
Step 2:
For each j, denote by H(j, i) the hyperplane through H(j, i). The hyperplane H(j, i) splits Rd into two
subsets, and we denote by H int (j, i) the set that contains ηi .
We claim that
\
G(i) = H int (j, i). (3.6)
j∈F (i)

This is intuitively clear and sketched T in Figure 3.5.


We first prove the case G(i) ⊂ j∈F (i) H int (j, i). Assume towards a contradiction that x0 ∈ G(i) is a point
in Rd \ H int (j, i) for a j ∈ F (i)
Since ηi does not lie in the boundary of G(i) there exists  > 0 such that B (ηi ) ⊂ G(i) and therefore,
by convexity co(B (ηi ), x0 ) ⊂ G(i). Since ηi and x0 are on different sides of H(j, i), we have that there is a
point x00 ∈ H(j, i) and 0 > 0, such that B0 (x00 ) ⊂ G(i). Therefore, co(B0 (x00 ), int co(H(j, i))) ⊂ G(i) is open.
In particular, co(B0 (x00 ), int co(H(j, i))) ∩ ∂G(i) = ∅. We conclude that int co(H(j, i)) ∩ ∂G(i) = ∅. This
constitutes a contradiction to (3.5).
Next we prove that G(i) ⊃ j∈F (i) H int (j, i). Let x000 6∈ G(i). Next, we show that x000 lies in Rd \ H int (j, i)
T

for at least one j. The line between x000 and ηi intersects G(i) and, by Step 1, it intersects co(H(j, i)) for a
j ∈ F (i). It is also clear that x000 is not on the same side as ηi . Hence x000 6∈ H int (j, i).
Step 3: For each ηj ∈ I(i), we have that gk (ηj ) ≥ 0 for all k ∈ F (i).

18
ηi

G(i)

H(j1 , i) H(j1 , i)

Figure 3.5: The set G(i) and two hyperplanes H(j1 , i), H(j2 , i). Since G(i) is convex and H(j, i) extends its
boundary it is intuitively clear that G(i) is only on one side of H(j, i) and that (3.6) holds.

This is because, by (3.6) G(i) lies fully on one side of each hyperplane H(j, i), j ∈ F (i). Since gk vanishes
on H(k, i) and equals 1 on ηi we conclude that gk (ηj ) ≥ 0 for all k ∈ F (i)
Step 4: For every k ∈ F (i) we have that gk ≤ gj on τk for all j ∈ F (i)
If for j ∈ F (i), gj (η` ) ≥ gk (η` ) for all η` ∈ τk , then, since τk = co({η` : η` ∈ τk }), we conclude that gj ≥ gk .
Assume towards a contradiction that gj (η` ) < gk (η` ) for at least one η` ∈ I(i). Clearly this assumption cannot
hold for η` = ηi since there gj (ηi ) = 1 = gk (ηi ). If η` 6= ηi , then gk (η` ) = 0 implying gj (η` ) < 0. Together
with Step 3 this yields a contradiction.
Step 5: For each z 6∈ G(i), we have that there exists at least one k ∈ F (i) such that gk (z) ≤ 0.
This follows as in Step 3. Indeed, if z 6∈ G(i) then, by (3.6) we have that there is a hyperplane H(k, i) so
that z does not lie on the same side as ηi . Hence gk (z) ≤ 0.
Combining Steps 1-5 yields the claim (3.4).
Now that we have a formula for the functions φi,T , we proceed by building these functions as realisations
of NNs.
Proposition 3.3. Let d ∈ N and T be a locally convex simplicial mesh of Ω ⊂ Rd . Let kT denote the maximum
number of neighbouring cells of the mesh, i.e.,

kT := max # {j : ηi ∈ τj } . (3.7)
i=1,...,MN

Then, for every i = 1, . . . , MN , there exists a NN Φi with

L(Φi ) = dlog2 (kT )e + 2, and M (Φi ) ≤ C · (kT + d)kT (log2 (kT ))

for a universal constant C > 0, and

R(Φi ) = φi,T , (3.8)

where the activation function is the ReLU.

19
Proof. We now construct the network the realisation of which equals (3.4). The claim (3.8) then follows with
Proposition 3.2.
We start by observing that, for a, b ∈ R,

a + b |a − b| 1
min{a, b} = − = (%R (a + b) − %R (−a − b) − %R (a − b) − %R (b − a)) .
2 2 2
Thus, defining Φmin,2 := ((A1 , 0), (A2 , 0)) with
 
1 1
 −1 −1  1
A1 :=  1 −1
, A2 := [1, −1, −1, −1],
 2
−1 1

yields R(Φ)(a, b) = min{a, b}, L(Φ) = 2 and M (Φ) = 12. Following an idea that we saw earlier for the
construction of the multiplication network in (2.15), we construct for p ∈ N even, the networks

e min,p := FP(Φmin,2 , . . . , Φmin,2 )


Φ
| {z }
p/2− times

and for p = 2q
Φmin,p = Φ
e min,2 Φ e min,p .
e min,4 · · · Φ

It is clear that the realisation of Φmin,p is the minimum operator with p inputs. If p is not a power of two then
a small adaptation of the procedure above is necessary. We will omit this discussion here.
We see that L(Φmin,p ) = dlog2 (p)e + 1. To estimate the weights, we first observe that the number of
neurons in the first layer of Φ e min,p is bounded by 2p. It follows that each layer of Φmin,p has less than 2p
neurons. Since all affine maps in this construction are linear, we have that

Φmin,p = ((A1 , b1 ), . . . , (AL , bL )) = ((A1 , 0), . . . , (AL , 0)). (3.9)

We have that gk = Gk (·) + θk for θk ∈ R and Gk ∈ R1,d . Let

Φaff := P ((G1 , θ1 )) , ((G2 , θ2 )) , . . . ,



G#F (i) , θ#F (i) .

Clearly, Φaff has one layer, d dimensional input, and #F (i) many output neurons.
We define, for p := #F (i),
Φi,T := ((1, 0), (1, 0)) Φmin,p Φaff .
Per construction and (3.4), we have that R(Φi,T ) = φi,T . Moreover, L(Φi,T )) = L(Φmin,p ) + 1 = dlog2 (p)e + 2.
Also, by construction, the number of neurons in each layer of Φi,T is bounded by 2p. Since, by (3.9), we have
that
Φi,T = ((A1 , b1 ), (A2 , 0), . . . , (AL , 0)),
with A` ∈ RN` ×N`−1 and b1 ∈ Rp , we conclude that
L
X L
X
M (Φi,T ) ≤ p + kA` k0 ≤ p + N`−1 N` ≤ p + 2dp + (2p)2 (dlog2 (p)e).
`=1 `=1

Finally, per assumption p ≤ kT which yields the claim.


As a consequence of Propositions 3.3 and 3.1, we conclude that one can represent every continuous
piecewise linear function on a locally compact mesh with N nodes as the realisation of a NN with CN
weights where the constant depends on the maximum number of cells neighbouring a vertex kT and the
input dimension d.

20
Theorem 3.4. Let T be a locally convex partition of Ω ⊂ Rd , d ∈ N. Let T have MN and let kT be defined as in (3.7).
Then, for every f ∈ VT , there exists a NN Φf such that

L Φf ≤ dlog2 (kT )e + 2,


M (Φf ) ≤ CMN · (kT + d) kT log2 (kT ) ,


R Φf = f,


for a universal constant C > 0.

Remark 3.5. One way to read Theorem 3.4 is the following: Whatever one can approximate by piecewise affine linear,
continuous functions with N degrees of freedom can be approximated to the same accuracy by realisations of NNs with
C · N degrees of freedom, for a constant C. If we consider approximation rates, then this implies that realisations of
NNs achieve the same approximation rates as linear finite element spaces.
For example, for Ω := [0, 1]d , one has that there exists a sequence of locally convex simplicial meshes (Tn )∞
n=1 with
MT (Tn ) . n such that
2
inf kf − gkL2 (Ω) . n− d kf kW 2,2d/(d+2) (Ω) ,
g∈VTn

2,2d/(d+2)
for all f ∈ W (Ω), see, e.g., [12].

3.2 Approximation of the square function


With Theorem 3.4, we are able to reproduce approximation results of piecewise linear functions by realisations
of NNs. However, the approximation rates of piecewise affine linear functions when approximating C s
regular functions do not improve for increasing s as soon as s ≥ 1, see, e.g., Theorem 2.16. To really benefit
from higher-order smoothness, one requires piecewise polynomials of higher degree.
Therefore, if we want to approximate smooth functions in the spirit of Corollary 2.19, then we need to be
able to efficiently approximate continuous piecewise polynomials of degree higher than 1 by realisations of
NNs.
It is clear that this emulation of polynomials cannot be performed as in Corollary 2.19, since the ReLU is
piecewise linear. However, if we allow sufficiently deep networks there is, in fact, a surprisingly effective
possibility to approximate square functions and thereby polynomials by realisations of NNs with ReLU
activation functions.
To see this, we first consider the remarkable construction below.

Efficient construction of saw-tooth functions: Let

Φ∧ := ((A1 , b1 ), (A2 , 0)),

where    
2 0
A1 :=  2  , b1 :=  −1  , A2 := [1, −2, 1].
2 −2
Then
R (Φ∧ ) (x) = %R (2x) − 2%R (2x − 1) + %R (2x − 2)
and L(Φ∧ ) = 2, M (Φ∧ ) = 8, N0 = 1, N1 = 3, N3 = 1 . It is clear that R(Φ∧ ) is a hat function. We depict it in
Figure 3.6.
A quite interesting thing happens if we compose R(Φ∧ ) with itself. We have that

R(Φ
| · · Φ∧}) = R(Φ∧ ) ◦ · · · ◦ R(Φ∧ ))
·{z
| {z }
n−times n−times

21
is a saw-tooth function with 2n−1 hats of width 21−n each. This is depicted in Figure 3.6. Compositions are

notoriously hard to picture, hence it is helpful to establish the precise form of R(Φ
| · · Φ∧}) formally. We
·{z
n−times
analyse this in the following proposition.
Proposition 3.6. For n ∈ N, we have that

Fn = R(Φ
| · · Φ∧})
·{z
n−times

satisfies, for x ∈ (0, 1),

2n (x − i2−n ) for x ∈ [i2−n , (i + 1)2−n ], i even,



Fn (x) := (3.10)
2n ((i + 1)2−n − x)) for x ∈ [i2−n , (i + 1)2−n ], i odd

and Fn = 0 for x 6∈ (0, 1). Moreover, L(Φ
| · · Φ∧}) = n + 1 and M (Φ
·{z |

· · Φ∧}) ≤ 12n − 2.
·{z
n−times n−times

Proof. The proof follows by induction. We have that, for x ∈ [0, 1/2],

R(Φ∧ )(x) = %R (2x) = 2x.

Moreover, for x ∈ [1/2, 1] we conclude

R(Φ∧ )(x) = 2x − 2(2x − 1) = 2 − 2x.

Finally, if x 6∈ (0, 1), then


%R (2x) − 2%R (2x − 1) + %R (2x − 2) = 0.
This completes the case n = 1. We assume that we have shown (3.10) for n ∈ N. Hence, we have that

Fn+1 = Fn ◦ R(Φ∧ ), (3.11)

where Fn satisfies (3.10). Let x ∈ [0, 1/2] and x ∈ [i2−n−1 , (i + 1)2−n−1 ], i even. Then R(Φ∧ )(x) = 2x ∈
[i2−n , (i + 1)2−n ], i even. Hence, by (3.11), we have

Fn+1 (x) = 2n (2x − i2−n ) = 2n+1 (x − i2−n−1 ).

If x ∈ [1/2, 1] and x ∈ [i2−n−1 , (i + 1)2−n−1 ], i even, then R(Φ∧ )(x) = 2 − 2x ∈ [2 − (i + 1)2−n , 2 − i2−n ] =
[(2n+1 − i − 1)2−n , (2n+1 − i)2−n ] = [j2−n , (j + 1)2−n ] for j := (2n+1 − i − 1) odd. We have, by (3.11),

Fn+1 (x) = 2n (j2−n − (2 − 2x)) = 2n ((2 − 2−n (i + 1)) − (2 − 2x))


= 2n (2x − 2−n (i + 1)) = 2n+1 (x − 2−n−1 (i + 1)).

The cases for i odd follow similarly. If x 6∈ (0, 1), then R(Φ∧ )(x) = 0 and per (3.11) we have that Fn+1 (x) = 0.

It is clear by Definition 3.12 that L(Φ| · · Φ∧}) = n + 1. To show that M (Φ
·{z |

· · Φ∧}) ≤ 12n − 2, we
·{z
n−times n−times
observe with
Φ∧ · · · Φ∧ =: ((A1 , b1 ), . . . , (AL , bL )))
Pn+1
that M (Φ∧ · · · Φ∧ ) ≤ `=1 N`−1 N` + N` ≤ (n − 1)(32 + 3) + N0 N1 + N1 + Nn Nn+1 + Nn+1 = 12(n −
1) + 3 + 3 + 3 + 1 = 12n − 2, where we use that N` = 3 for all 1 ≤ ` ≤ n and N0 = NL = 1.

22
1 1

F1(x1)

F1(x2)

F1(x3)

1
x1 x2 x3 1

F2(x1) F2(x2) F2(x1) 1


1
x1

x2
x3

1
1

Figure 3.6: Top Left: Visualisation of R(Φ∧ ) = F1 . Bottom Right: Visualisation of R(Φ∧ ) ◦ R(Φ∧ ) = F2 ,
Bottom Left: Fn for n = 4.

Remark 3.7. Proposition 3.6 already shows something remarkable. Consider a two layer network Φ with input
dimension 1 and N neurons. Then its realisation with ReLU activation function is given by
N
X
R(Φ) = cj %R (ai x + bj ) − d,
j=1

for cj , aj , bj , d ∈ R. It is clear that R(Φ) is piecewise affine linear with at most M (Φ) pieces. We see, that with this
construction, the resulting networks have not more than M (Φ) pieces. However, the function Fn from Proposition 3.6
M (Φ)+2
has at least 2 12 linear pieces.
The function Fn is therefore a function that can be very efficiently represented by deep networks, but not very
efficiently by shallow networks. This was first observed in [35].
The surprisingly high number of linear pieces of Fn is not the only remarkable thing about the construction
of Proposition 3.6. Yarotsky [38] made the following insightful observation:
Proposition 3.8 ([38]). For every x ∈ [0, 1] and N ∈ N, we have that
N

X Fn (x)
≤ 2−2N −2 .
2
x − x + (3.12)
22n
n=1

23
Proof. We claim that
N
X Fn
HN := x − (3.13)
n=1
22n

is a piecewise linear function with breakpoints k2−N where k = 0, . . . , 2N , and HN (k2−N ) = k 2 2−2N . We
prove this by induction. The result clearly holds for N = 0. Assume that the claim holds for N ∈ N. Then we
see that
FN +1
HN − HN +1 = 2N +2 .
2
Since, by Proposition 3.6, FN +1 is piecewise linear with breakpoints k2−N −1 where k = 0, . . . , 2N +1 and HN
is piecewise linear with breakpoints `2−N −1 where ` = 0, . . . , 2N +1 even, we conclude that HN +1 is piecewise
linear with breakpoints k2−N −1 where k = 0, . . . , 2N +1 . Moreover, by Proposition 3.6, FN +1 vanishes for all
k2−N −1 , where k is even. Hence, by the induction hypothesis HN +1 (k2−N −1 ) = (k2−N −1 )2 for all k even.
To complete the proof, we need to show that

FN +1
(k2−N −1 ) = HN (k2−N −1 ) − (k2−N −1 )2 ,
22N +2
for all k odd. Since HN is linear on [(k − 1)2−N −1 ), (k + 1)2−N −1 )], we have that
1
HN (k2−N −1 ) − (k2−N −1 )2 = ((k − 1)2−N −1 )2 + ((k + 1)2−N −1 )2 − (k2−N −1 )2

(3.14)
2  
1
= 2−2N −2 ((k − 1))2 + (k + 1)2 − k 2

2
= 2−2(N +1) = 2−2(N +1) FN +1 (k2−N −1 ),

where the last step follows by Proposition 3.6. This shows that HN +1 (k2−N −1 ) = (k2−N −1 )2 for all k =
0, . . . , 2N +1 and completes the induction.
Finally, let x ∈ [k2−N , (k + 1)2−N ], k = 0, . . . , 2N − 1, then

(k + 1)2 − k 2 2−2N

−N 2
2 2
|HN (x) − x | = HN − x = (k2 ) + (x − k2−N ) − x2 , (3.15)
2−N
where the first step is because x 7→ x2 is convex and therefore its graph lies below that of the linear interpolant
and the second step follows by representing HN locally as the linear map that intersects x 7→ x2 at k2−N and
(k + 1)2−N .
Since (3.15) describes a continuous function on [k2−N , (k + 1)2−N ] vanishing at the boundary, it assumes
its maximum at the critical point

1 (k + 1)2 − k 2 2−2N

∗ 1
x := −N
= (2k + 1)2−N = (2k + 1)2−N −1 = `2−N −1 ,
2 2 2
for ` ∈ {1, . . . 2N +1 } odd. We have already computed in (3.14) that

|HN (x∗ ) − (x∗ )2 | ≤ 2−2(N +1) .

This yields the claim.


Equation 3.12 and Proposition 3.6 make us optimistic that, with sufficiently deep networks, we can
approximate the square function very efficiently. Before we can do this properly, we need to enlarge our
toolbox slightly and introduce a couple of additional operations on NNs.

24
1

3/4

1/2 x 7→ x2

H0

1/4 H1

H2

1/4 1/2 3/4 1

Figure 3.7: Visualisation of the construction of HN of (3.13).

ReLU specific network operations We saw in Proposition 2.11 that we can approximate the identity func-
tion by realisations of NNs for many activation functions. For the ReLU, we can even go one step further and
rebuild the identity function exactly.
Lemma 3.9. Let d ∈ N, and define
ΦId := ((A1 , b1 ) , (A2 , b2 ))
with
 
IdRd 
A1 := , b1 := 0, A2 := IdRd −IdRd , b2 := 0.
−IdRd
Then R(ΦId ) = IdRd .
Proof. Clearly, for x ∈ Rd , R(ΦId )(x) = %R (x) − %R (−x) = x.
Remark 3.10. Lemma 3.9 can be generalised to yield emulations of the identity function with arbitrary numbers of
layers. For each d ∈ N, and each L ∈ N≥2 , we define
 
  
Id d
ΦId
d,L :=
 R
, 0 , (IdR2d , 0), . . . , (IdR2d , 0), ([IdRd | −IdRd ] , 0) .
−IdRd | {z }
L−2 times

For L = 1, one can achieve the same bounds, simply by setting ΦId
d,1 := ((IdRd , 0)).

Our first application of the NN of Lemma 3.9 is for a redefinition of the concatentation. Before that, we
first convince ourselves that the current notion of concatenation is not adequate if we want to control the
number of parameters of the concatenated NN.
Example 3.11. Let N ∈ N and Φ = ((A1 , 0), (A2 , 0)) with A1 = [1, . . . , 1]T ∈ RN ×1 , A2 = [1, . . . , 1] ∈ R1×N .
Per definition we have that M (Φ) = 2N .
Moreover, we have that
Φ Φ = ((A1 , 0), (A1 A2 , 0), (A2 , 0)).
It holds that A1 A2 ∈ RN ×N and every entry of A1 A2 equals 1. Hence M (Φ Φ) = N + N 2 + N .

25
Example shows that the number of weights of networks behaves quite undesirably under concatenation.
Indeed, we would expect that it should be possible to construct a concatenation of networks that imple-
ments the composition of the respective realisations and the number of parameters scales linearly instead of
quadratically in the number of parameters of the individual networks.
Fortunately, Lemma 3.9 enables precisely such a construction, see also Figure 3.8 for an illustration.
Definition 3.12. Let L1 , L2 ∈ N, and let Φ1 = ((A11 , b11 ), . . . , (A1L1 , b1L1 )) and Φ2 = ((A21 , b21 ), . . . , (A2L2 , b2L2 )) be
two NNs such that the input layer of Φ1 has the same dimension d as the output layer of Φ2 . Let ΦId be as in Lemma
3.9.
Then the sparse concatenation of Φ1 and Φ2 is defined as

Φ1 Φ2 := Φ1 ΦId Φ2 .
Remark 3.13. It is easy to see that
  2 ! !
A2L2

bL2
Φ1 Φ2 = (A21 , b21 ), . . . , (A2L2 −1 , b2L2 −1 ), , A1 −A11 , b11 , (A12 , b12 ), . . . , (A1L1 , b1L1 )
 1  
,
−A2L2 −b2L2

has L1 + L2 layers and that R(Φ1 Φ2 ) = R(Φ1 ) ◦ R(Φ2 ) and M (Φ1 Φ2 ) ≤ 2M (Φ1 ) + 2M (Φ2 ).

Approximation of the square: We shall now build a NN that approximates the square function on [0, 1].
Of course this is based on the estimate (3.12).
Proposition 3.14 ([38, Proposition 2]). Let 1/2 >  > 0. There exists a NN Φsq, such that, for  → 0,
L(Φsq, ) = O(log2 (1/)) (3.16)
M (Φ ) = sq,
O(log22 (1/)) (3.17)
R(Φsq, )(x) − x2 ≤ ,

(3.18)
for all x ∈ [0, 1]. In addition, we have that R(Φsq, )(0) = 0.
Proof. By (3.12), we have that, for N := d− log2 ()/2e, it holds that, for all x ∈ [0, 1],
N


2 X Fn (x)
x − x + ≤ . (3.19)
22n n=1

We define, for n = 1, . . . , N ,

Φn := ΦId
1,N −n (Φ
| · · Φ∧}).
·{z (3.20)
n−times


Then we have that L(Φn ) = N − n + L(Φ
| · · Φ∧}) = N + 1 by Proposition 3.6. Moreover, by Remark 3.13,
·{z
n−times


M (Φn ) ≤ 2M (ΦId
1,N −n ) + 2M (Φ
| · · Φ∧}) ≤ 4(N − n) + 2(12n − 2) ≤ 24N,
·{z (3.21)
n−times

where the penultimate inequality follows from Remark 3.10 and Proposition 3.6.
Next, we set
Φsq, := 1, −1/4, . . . , −2−2N , 0 P ΦId
   
d,N +1 , Φ1 , . . . , ΦN .

Per construction, we have that


N N
X
−2n
X Fn (x)
R (Φsq, ) (x) = R ΦId

d,N +1 (x) − 2 R (Φj ) (x) = x − ,
n=1 n=1
22n

26
Figure 3.8: Top: Two neural networks, Middle: Sparse concatenation of the two networks as in Definition
3.12, Bottom: Regular concatenation as in Definition 2.9.

and, by (3.19), we conclude (3.18), for all x ∈ [0, 1], and that R(Φ)(0) = 0. Moreover, we have by Remark 3.13
that
L (Φsq, ) = L 1, −1/4, . . . , −2−2N , 0 + L P ΦId
   
d,N +1 , Φ1 , . . . , ΦN = N + 2 = d− log2 ()/2e + 2.
This shows (3.16). Finally, by Remark 3.13
M (Φsq, ) ≤ 2M 1, −1/4, . . . , −2−2N , 0 + 2M P ΦId
   
d,N +1 , Φ1 , . . . , ΦN
N
!
Id
 X
= 2(N + 1) + 2 M Φd,N +1 + M (Φn )
n=1
N
X
= 2(N + 1) + 4(N + 1) + 2 M (Φn )
n=1
N
X
≤ 6(N + 1) + 2 24N = 6(N + 1) + 48N 2 ,
n=1

where we applied (3.21) in the last estimate. Clearly,


6(N + 1) + 48N 2 = O N 2 , for N → ∞,


27
and hence
M (Φsq, ) = O log22 (1/) , for  → 0,


which yields (3.17).


Remark 3.15. In [29, Theorem 5], a proof of the result above is given that does not require Proposition 3.8, but instead
is based on three fascinating ideas:
• Multiplication
P∞ can be approximated by finitely many semi-binary multiplications: For x ∈ [0, 1], we
write x = `=1 x` 2` . Then

X N
X
x·y = 2−` x` y = 2−` x` y + O(2−N ), for N → ∞.
`=1 `=1

• Multiplication on [0, 1] by 0 or 1 can be build with a single ReLU: It holds that


 −`
2 y if x` = 1
%R (2−` y + x` − 1) =
0 else
= 2−` x` y.

• Extraction of binary representation is efficient: We have, by Proposition 3.6, that F` vanishes on all i2−` for
i = 0, . . . , 2` even and equals 1 on all i2−` for i = 0, . . . , 2` odd. Therefore
N
!
X
−`
FN 2 x` = x` .
`=1

By a short computation this yields that for all x ∈ [0, 1] that FN (x − 2−N −1 ) > 1/2, if xN = 1 and FN (x −
2−N −1 ) ≤ 1/2, if xN = 0. Hence, by building an approximate Heaviside function 1x≥0.5 with ReLU realisations
of networks, it is clear that one can approximate the map x 7→ x` .
Building N of the binary multiplications therefore requires N bit extractors and N multipliers by 0/1. Hence, this
requires of the order of N neurons, to achieve an error of 2−N .

3.3 Approximation of smooth functions


With the emulation of the square function on [0, 1] we have, in principle, a way of emulating the higher-order
sigmoidal function x2+ by ReLU networks. As we have seen in Section 2.5, sums and compositions of these
functions can be used to approximate smooth functions very efficiently.

Approximation of multiplication: Based on the idea, that we have already seen in the proof of Propo-
sition 2.18, in particular, Equation (2.14), we show how an approximation of a square function yields an
approximation of a multiplication operator.
Proposition 3.16. Let p ∈ N, K ∈ N,  ∈ (0, 1/2). There exists a NN Φmult,p, such that for  → 0

L(Φmult,p, ) = O(log2 (K) · log2 (1/)) (3.22)


M (Φ mult,p,
) = O(log2 (K) · log22 (1/)) (3.23)
p

Y
mult,p,
R(Φ )(x) − x` ≤ , (3.24)


`=1

for all x = (x1 , x2 , . . . , xp ) ∈ [−K, K]p . Moreover, R(Φmult,p, )(x) = 0 if x` = 0 for at least one ` ≤ p. Here the
implicit constant depends on p only.

28
Proof. The crucial observation is that, by the parallelogram identity, we have that for x, y ∈ [−K, K]
2  2 !
K2

x+y x−y
x·y = · −
4 K K
2  2 !
K2

%R (x + y) %R (−x − y) %R (x − y) %R (−x + y)
= + − + .
4 K K K K

We set
   
1 1  2
K2
     
 , 0 , 1 · 1 K
 −1 −1   1 0 0 
Φ1 :=  ,0  , and Φ 2 := , − ,0 .
 1 −1   K 0 0 1 1  2 2
−1 1

Now we define  2 2

Φmult,2, := Φ2 FP Φsq,/K , Φsq,/K Φ1 .

It is clear that, for all x, y ∈ [−K, K],


R Φmult,2, (x, y) − x · y ≤ .


2
Moreover, the size of Φmult,2, is up to a constant that of Φsq,/K . Thus (3.23)-(3.24) follow from Proposition
3.14. The construction for p > 2 follows by the now well-known stategy of building a binary tree of basic
multiplication networks as in Figure 2.4.
A direct corollary of Proposition 3.16 is the following Corollary that we state without proof.
Corollary 3.17. Let p ∈ N, K ∈ N,  ∈ (0, 1/2). There exists a NN Φpow,p, such that, for  → 0,

L(Φpow,p, ) = O(log2 (K) · log2 (1/))


M (Φpow,p, ) = O(log2 (K) · log22 (1/))
|R(Φpow,p, )(x) − xp | ≤ ,

for all x ∈ [−K, K]. Moreover, R(Φpow,p, )(x) = 0 if x = 0. Here the implicit constant depends on p only.

Approximation of B-splines: Now that we can build a NN the realisation of which is a multiplication of
p ∈ N scalars, it is not hard to see with (2.10) that we can rebuild cardinal B-splines by ReLU networks.
Proposition 3.18. Let d, k, ` ∈ N, k ≥ 2, t ∈ Rd , 1/2 >  > 0. There exists a NN Φd`,t,k such that for  → 0

L(d, k) := L(Φd`,t,k ) = O(log2 (1/)), (3.25)


M (d, k) := M (Φd`,t,k ) = O(log22 (1/)), (3.26)
R(Φd`,t,k )(x) − N`,t,k
d

(x) ≤ , (3.27)

for all x ∈ Rd .
Proof. Clearly, it is sufficient to show the result for ` = 0 and t = 0. We have by (2.10) that
k  
1 X k
Nk (x) = (−1)` (x − `)k−1
+ , for x ∈ R, (3.28)
(k − 1)! `
`=0

29
It is well known, see [31], that supp Nk = [0, k] and kNk k∞ ≤ 1. Let δ > 0, then we set
 
       
1 k k k
Φk,δ := ,− , . . . , (−1)k ,0 FP Φpow,k−1,δ , . . . , Φpow,k−1,δ 
(k − 1)! 0 1 k | {z }
k+1−times

((A1 , b1 ), (IdRk+1 , 0)),

where
A1 = [1, 1, . . . , 1]T , b1 = −[0, 1, . . . , k]T ,
and IdRk+1 is the identity matrix on Rk+1 . Here K := k + 1 in the definition of Φpow,k−1,δ via Corollary 3.17.
It is now clear, that we can find δ > 0 so that

|R(Φk,δ )(x) − Nk (x)| ≤ /(4d2d−1 ), (3.29)

for x ∈ [−k − 1, k + 1]. With sufficient care, we see that, we can choose δ = Ω(), for  → 0. Hence, we can
conclude from Definition 3.12 that Lδ := L(Φk,δ ) = O(L(Φmult,k+1,δ )) = O(log2 (1/)), and M (Φk,δ ) =
O(Φmult,k+1,δ ) ∈ O(log22 (1/)), for  → 0 which yields (3.25) and (3.26). At this point, R(Φk,δ ) only accurately
approximates Nk on [−k − 1, k + 1]. To make this approximation global, we multiply R(Φk,δ ) with an
appropriate indicator function.
Let   
T T
Φcut := [1, 1, 1, 1] , [1, 0, −k, −k − 1] , ([1, −1, −1, 1] , 0) .

Then R(Φcut ) is a piecewise linear spline with breakpoints −1, 0, k, k + 1. Moreover, R(Φcut ) is equal to 1 on
[0, k], vanishes on [−1, k + 1]c , and is non-negative and bounded by 1. We define
 
Φe k,δ := Φmult,2,/(4d2d−1 ) P Φk,δ , ΦId δ Φcut .
 1,L −2

Since the realisation of the multiplication is 0 as soon as one of the inputs is zero by Proposition 3.16, we
conclude that
 
R Φ
e k,δ (x) − Nk (x) ≤ /(2d2d−1 ), (3.30)


for all x ∈ R. Recall that


d
Y
d
N0,0,k (x) := Nk (xj ) , for x = (x1 , . . . , xd ) ∈ Rd .
j=1

Now we define

Φd0,0,k, := Φmult,d,/2 FP(Φ


e k,δ , . . . , Φ
e k,δ ).
|  {z }
d−times

We have that

Yd Yd   Yd  
d
N0,0,k (x) − R Φd0,0,k, (x) ≤ d
 
Nk (x j ) − R Φ
e k,δ (x )
j
+
R Φ 0,0,k, (x) − R Φ
e k,δ (x j .
)
j=1 j=1 j=1

Additionally, we have by (3.30) that




Yd    
e k,δ (xj ) − R Φ mult,d,/2
R Φ ◦ R(FP(Φ e k,δ ))(x) ≤ /2,
e k,δ , . . . , Φ
e
 
| {z }
j=1 d−times

30
for all x ∈ Rd . It is clear, by repeated applications of the triangle inequality that for aj ∈ [0, 1], bj ∈ [−1, 1],
for j = 1, . . . , d,

d d
Y Y  d−1
max |bj | ≤ d2d−1 max |bj |.

aj − (aj + b )
j
≤ d · 1 + max |b j |

j=1 j=1,...,d j=1,...,d j=1,...,d
j=1

Hence,
d d
Y Y  


Nk (x j ) − R Φ
e k,δ (x j ≤ /2.
)
j=1 j=1

This yields (3.27). The statement on the size of Φd0,0,k, follows from Remark 3.13.

Approximation of smooth functions: Having established how to approximate arbitrary B-splines with
Proposition 3.18, we obtain that we can also approximate all functions that can be written as weighted sums of
B-splines with bounded coefficients. Indeed, we can conclude with Theorem 2.16 and with similar arguments
as in Theorem 2.14 the following result. Our overall argument to arrive here followed the strategy of [34].

Theorem 3.19. Let d ∈ N, s > δ > 0 and p ∈ (0, ∞]. Then there exists a constant C > 0 such that, for every
f ∈ C s ([0, 1]d ) with kf kC s ≤ 1 and every 1/2 >  > 0, there exists a NN Φ such that

L(Φ) ≤ C log2 (1/), (3.31)


d
− s−δ
M (Φ) ≤ C , (3.32)
kf − R(Φ)kLp ≤ . (3.33)

Here the activation function is the ReLU.

Proof. Let f ∈ C s ([0, 1]d ) with kf kC s ≤ 1 and let s > δ > 0. By Theorem 2.16 there exist a constant C > 0
and, for every N ∈ N, ci ∈ R with |ci | ≤ C and Bi ∈ B k for i = 1, . . . , N and k := dse, such that

N
X δ−s
f − ci Bi ≤ CN d .


i=1 p

δ−s
By Proposition 3.18, each of the Bi can be approximated up to an error of N d /(CN ) with a NN Φi of
δ−s δ−s
depth O(log2 (N d /(CN ))) = O(log2 (N )) and number of weights O(log22 (N d /(CN ))) = O(log22 (N )) for
N → ∞.
We define
ΦN
f := ([c1 , . . . , cN ], 0) P (Φ1 , . . . , ΦN ) .

It is not hard to see that, for N → ∞,


2
M (ΦN
f ) = O(N log2 (N )) and L(ΦN
f ) = O(log2 (N )).

Additionally, by the triangle inequality


δ−s
f − R(ΦN

f ) p ≤ 2N
d .

To achieve (3.33), we, therefore, need to choose N = N := d(/2)d/(δ−s) e.


A simple estimate yields that L(ΦN f ) = O(log2 (1/)) for  → 0, i.e, (3.31). Moreover, we have that


N log22 (N ) ≤ 4d/(s − δ)(/2)d/(δ−s) log22 (/2) ≤ C 0 −d/(s−δ) log22 (),

31
for a constant C 0 > 0. It holds that log22 () = O(−σ ) for every σ > 0. Hence, for every δ 0 > δ with s > δ 0 , we
have 0
−d/(s−δ) log22 () = O(−d/(s−δ ) ), for  → 0.
0
As a consequence we have that M (ΦN
f ) = O(
 −d/(s−δ )
) for  → 0. Since δ was arbitrary, this yields (3.32).
Remark 3.20. • It was shown in [38] that Theorem 3.19 holds with δ = 0 but with the bound M (Φ) ≤
C−d/s log2 (1/). Moreover, it holds for f ∈ C s ([−K, K]d ) for K > 0, but the constant C will then de-
pend on K.

4 The role of depth


We have seen in the previous results that NNs can efficiently emulate the approximation of classical approxi-
mation tools, such as linear finite elements or B-splines. Already in Corollary 2.19, we have seen that deep
networks are sometimes more efficient at this task than shallow networks. In Remark 3.7, we found that
ReLU-realisations of deep NNs can represent certain saw-tooth functions with N linear pieces using only
O(log2 (N )) many weights, whereas shallow NNs require O(N ) many weights for N → ∞.
In this section, we investigate further examples of representation or approximation tasks that can be
performed easily with deep networks but cannot be achieved by small shallow networks or any shallow
networks.

4.1 Representation of compactly supported functions


Below we show that compactly supported functions cannot be represented by weighted sums of functions of
the form x 7→ %R (ha, xi), but they can be represented by 3-layer networks. This result is based on [4, Section
3].
Proposition 4.1. Let d ∈ N, d ≥ 2. The following two statements hold for the activation function %R :

• If L ≥ 3, then there exists a NN Φ with L layers, such that supp R(Φ) = Bk.k1 (0)k ,
• If L ≤ 2, then, for every NN Φ with L layers, such that supp R(Φ) is compact, we have that R(Φ) ≡ 0.
Proof. It is clear that, for every x ∈ Rd , we have that
d
X
(%R (x` ) + %R (−x` )) = kxk1 .
`=1

Moreover, the function %R (1 − kxk1 ) is clearly supported on Bk.k1 (0). Moreover, we have that %R (1 − kxk1 )
can be written as the realisation of a NN with at least 3 layers.
Next we address the second part of the theorem. If L = 1, then the set of realisations of NNs contains
only affine linear functions. It is clear that the only affine linear function that vanishes on a set of non-empty
interior is 0. For L = 2, all realisations of NNs have the form
N
X
x 7→ ci %R (hai , xi + bi ) + d, (4.1)
i=1

for N ∈ N, ci , bi , d ∈ R and ai ∈ Rd , for i = 1, . . . , N . We assume without loss of generality that all


ai 6= 0 otherwise %R (hai , xi + bi ) would be constant and one could remove the term from (4.1) by adapting d
accordingly.
Here kxkpp :=
k
Pd
k=1 |xk |p for p ∈ (0, ∞).

32
We next show that every function of the form (4.1) with compact support vanishes everywhere. For an
index i, we have that %R (hai , xi + bi ) is not continously differentiable at the hyperplane given by
 
bai
Si := − + z : z ⊥ ai .
kai k2

Let f be a function of the form (4.1). We define i ∼ j, if Si = Sj . Then we have that, for J ∈ {1, . . . , N }/ ∼
that a⊥ ⊥
i = aj for all i, j ∈ J. Hence, X
cj %R (haj , xi + bk ),
j∈J

is constant perpendicular to aj for every j ∈ J. And since the sum is piecewise affine linear, we have that it is
either affine linear or not continuously differentiable at every element of Sj . We can write
 
X X
f (x) =  cj %R (haj , xi + bj ) + d.
J∈{1,...,N }/∼ j∈J

If i 6∼ j, then Si and Sj Pintersect in hyperplanes of dimension d − 2. Hence, it is clear that, if for at least
one J ∈ {1, . . . , N }/ ∼, j∈J cj %R (haj , xi + bj ) is not linear, then f is not continuously differentiable almost
everywhere in Sj for j ∈ J. Since Sj is unbounded, this contradicts P the compact support assumption on f .
On the other hand, if, for all J ∈ {1, . . . , N }/ ∼, we have that j∈J cj %R (haj , xi + bj ) is affine linear, then f
is affine linear. By previous observations we have that this necessitates f ≡ 0 to allow compact support of
f.
Remark 4.2. Proposition 4.1, deals with representability only. However, a similar result is true in the framework of
approximation theory. Concretely, two layer networks are inefficient at approximating certain compactly supported
functions, that three layer networks can approximate very well, see e.g. [9].

4.2 Number of pieces


We start by estimating the number of piecewise linear pieces of the realisations of NNs with input and output
dimension 1 and L layers. This argument can be found in [35, Lemma 2.1].

Theorem 4.3. Let L ∈ N. Let % be piecewise affine linear with p pieces. Then, for every NN Φ with d = 1, NL = 1
and N1 , . . . , NL−1 ≤ N , we have that R(Φ) has at most (pN )L−1 affine linear pieces.

Proof. The proof is given via induction over L. For L = 2, we have that
N1
X
R(Φ) = ck %(hak , xi + bi ) + d,
k=1

where ck , ak , bi , d ∈ R. It is not hard to see that if f1 , f2 are piecewise affine linear with n1 , n2 pieces each,
then f1 + f2 is piecewise affine linear with at most n1 + n2 pieces. Hence, R(Φ) has at most N p many affine
linear pieces.
Assume the statement to be proven for L ∈ N. Let ΦL+1 be a NN with L + 1 layers. We set

ΦL+1 =: ((A1 , b1 ) , . . . , (AL+1 , bL+1 )) .

It is clear, that
R(ΦL+1 )(x) = AL+1 [%(h1 (x)), . . . , %(hNL (x))]T + bL+1 ,
where for ` = 1, . . . , NL each h` is the realisation of a NN with input and output dimension 1, L layers, and
less than N neurons in each layer.

33
For a piecewise affine linear function f with p̃ pieces, we have that % f has at most p · p̃ pieces. This is
because, for each of the p̃ affine linear pieces of f —let us call one of those pieces A ⊂ R—we have that f is
either constant or injective on A and hence % f has at most p linear pieces on A.
By this observation and the induction hypothesis, we conclude that % h1 has at most p(pN )L−1 affine
linear pieces. Hence,
NL
X
R(ΦL+1 )(x) = (AL+1 )k %(hk (x)) + bL+1
k=1
L−1 L
has at most N p(pN ) = (pN ) many affine linear pieces. This completes the proof.
For functions with input dimension more than 1 we have the following corollary.
Corollary 4.4. Let L, d ∈ N. Let % be piecewise affine linear with p pieces. Then, for every NN Φ with NL = 1 and
N1 , . . . , NL−1 ≤ N , we have that R(Φ) has at most (pN )L−1 affine linear pieces along every line.
Proof. Every line in Rd can be parametrized by R 3 t 7→ x0 + tv for x0 , v ∈ Rd . For Φ as in the statement of
corollary, we have that
R(Φ)(x0 + tv) = R(Φ Φ0 )(t),
where Φ0 = ((v, x0 )), which gives the result via Theorem 4.3.

4.3 Approximation of non-linear functions


Through the bounds on the number of pieces of a realisation of a NN with an piecewise affine linear activation
function, we can deduce a limit on approximability through NNs with bounds on the width and numbers of
layers for certain non-linear functions. This is based on the following observation, which can, e.g., be found
in [10].
Proposition 4.5. Let f ∈ C 2 ([a, b]), for a < b < ∞ so that f is not affine linear, then there exists a constant
c = c(f ) > 0 so that, for every p ∈ N,
kg − f k∞ > cp−2 ,
for all g which are piecewise affine linear with at most p pieces.
From this argument, we can now conclude the following lower bound to approximating functions which
are not affine linear by realisations of NNs with fixed numbers of layers.

Theorem 4.6. Let d, L, N ∈ N, and f ∈ C 2 ([0, 1]d ), where f is not affine linear. Let % : R → R be piecewise affine
linear with p pieces. Then for every NN with L layers and fewer than N neurons in each layer, we have that

kf − R(Φ)k∞ ≥ c(pN )−2(L−1) .

Proof. Let f ∈ C 2 ([0, 1]d ) and non-linear. Then it is clear that there exists a point x0 and a vector v so that
t 7→ f (x0 + tv) is non-linear in t = 0.
We have that, for every NN Φ with d-dimensional input, one-dimensional output, L layers, and fewer
than N neurons in each layer that

kf − R(Φ)k∞ ≥ kf (x0 + ·v) − R(Φ)(x0 + ·v)k∞ ≥ c · (pN )−2(L−1) ,


where the last estimate is by Corollary 4.4 and Proposition 4.5.
Remark 4.7. Theorem 4.6 shows that Theorem 3.19 would not hold with a fixed, bounded number of layers L as soon
as s sufficiently large. In other words, for very smooth functions, shallow networks yield suboptimal approximation
rates.
Moreover, no twice continuously differentiable and non-linear function can be approximated with an error that
decays with a super polynomial rate in the number of neurons by NNs with a fixed number of layers. In particular, the
approximation rate of Proposition 3.14 is not achievable by sequences of NNs of fixed finite depth.

34
5 High dimensional approximation
At this point we have seen two things on an abstract level. Deep NNs can approximate functions as well as
basically every classical approximation scheme. Shallow NNs do not perform as well as deep NNs in many
problems. From these observations we conclude that deep networks are preferable over shallow networks,
but we do not see why we should not use a classical tool, such as B-splines in applications instead. What is it
that makes deep NNs better than classical tools?
One of the advantages will become clear in this section. As it turns out, deep NNs are quite efficient in
approximating high dimensional functions.

5.1 Curse of dimensionality


The curse of dimensionality is a term introduced by Bellman [3] which is commonly used to describe an
exponentially increasing difficulty of problems with increasing dimension. A typical example is that of
function interpolation. We define the following function class, for d ∈ N,
( )
Fd := f ∈ C ∞ ([0, 1]d ) : sup kDα f k ≤ 1 .
|α|=1

If one defines e(n, d) as the smallest number such that there exists an algorithm reconstructing every f ∈ Fd
up to an error of e(n, d) from n point evaluations of f , then

e(n, d) = 1

for all n ≤ 2bd/2c − 1, see [20]. As a result, in any statement of the form

e(n, d) ≤ Cd,r n−r ,

we have that the constant Cd,r depends exponentially on d.


Another instance of this principle can be observed when approximating non-smooth functions. For
example, in Theorem 2.16, we saw that the approximation rate, when approximating functions f ∈ C s ([0, 1]d )
deteriorates exponentially with the dimension d. In fact, the approximation rates of Theorem 2.16 are, up to
the δ, optimal under some very reasonable assumptions on the approximation scheme, see [8] and discussions
later in the manuscript. Hence, there is a fundamental lower bound on approximation capabilities of any
approximation scheme that increases exponentially with the dimension.
Careful inspection of the arguments above show that these arguments also apply to approximation by
deep NNs. Hence, whenever we say below, that NNs overcome the curse of dimensionality then we mean that
under a certain additional assumption on the functions to approximate, we will not see a terrible dependence
of the approximation rate on the dimension.

5.2 Hierarchy assumptions


We have seen in Corollary 2.19 and Theorem 3.19 that, to approximate a C s regular function by a NN with a
higher-order sigmoidal function or a ReLU as activation function up to an accuracy  > 0, we need essentially
Pd
O(−d/s ) many weights. In contrast to that, a d-dimensional function f so that f (x) = i=1 gi (xi ), where
all the gi are one dimensional can be approximated using essentially dO(−1/s ) many weights, which is
asymptotically much less than O(−d/s ) for  → 0.
It is, therefore, reasonable to assume that high dimensional functions that are build from lower dimensional
functions in a way that can be emulated well with NNs, can be much more efficiently approximated than
high dimensional functions without this structure.
This observation was used in [25] to study approximation of so-called compositional functions. The
definition of these functions is based on special types of graphs.

35
Definition 5.1. Let d, k, N ∈ N and let G(d, k, N ) be the set of directed acyclic graphs with N vertices, where the
indegree of every vertex is at most k and the outdegree of all but one vertex is at least 1 and the indegree of exactly d
vertices is 0.
For G ∈ G(d, k, N ), let (ηi )N
i=1 be a topological ordering of G. In other words, every edge ηi ηj in G satisfies i < j.
Moreover, for each i > d we denote
Ti := {j : ηj ηi is an edge of G},
and di = #Ti ≤ k.
With the necessary graph theoretical framework established, we can now define sets of hierarchical
functions.
Definition 5.2. Let d, k, N, s ∈ N. Let G ∈ G(d, k, N ) and let, for i = d + 1, . . . , N , fi ∈ C s (Rdi ) with
kfi kC s (Rdi ) ≤ 1∗∗ . For x ∈ Rd , we define for i = 1, . . . , d vi = xi and vi (x) = fi (vj1 (x), . . . , vjdi (x)), where
j1 , . . . , jdi ∈ Ti and j1 < j2 < · · · < jdi .
We call the function
f : [0, 1]d → R, x 7→ vN (x)
a compositional function associated to G with regularity s. We denote the set of compositional functions associated
to any graph in G(d, k, N ) with regularity s by CF(d, k, N ; s).
We present a visualisation of three types of graphs in Figure 5.1. While we have argued before that it is
reasonable to expect that NNs can efficiently approximate these types of functions, it is not entirely clear
why this is a relevant function class to study. In [19, 25], it is claimed that these functions are particularly
close to the functionality of the human visual cortex. In principle, the visual cortex works by first analysing
very localised features of a scene and then combining the resulting responses in more and more abstract
levels to yield more and more high-level descriptions of the scene.
If the inputs of a function correspond to spatial locations, e.g., come from several sensors, such as in
weather forecasting, then it might make sense to model this function as network of functions that first
aggregate information from spatially close inputs before sending the signal to a central processing unit.
Compositional functions can also be compared with Boolean circuits comprised of simple logic gates.
Let us now show how well functions from CF(d, k, N ; s) can be approximated by ReLU NNs. Here we
are looking for an approximation rate that increases with s and, hopefully, does not depend too badly on d.
Theorem 5.3. Let d, k, N, s ∈ N. Then there exists a constant C > 0 such that for every f ∈ CF(d, k, N ; s) and
every 1/2 >  > 0 there exists a NN Φf with

L(Φf ) ≤ CN 2 log2 (k/) (5.1)


kN
−k
M (Φf ) ≤ CN 4 (2k) s  s log2 (k/) (5.2)
kf − R(Φf )k∞ ≤ , (5.3)

where the activation function is the ReLU.

Proof. Let f ∈ CF(d, k, N ; s) and let, for i = d + 1, . . . , N , fi ∈ C s (Rdi ) be according to Definition 5.2. By
Theorem 3.19 and Remark 3.20, we have that there exists a constant C > 0 and NNs Φi such that

|R(Φi )(x) − fi (x)| ≤ , (5.4)
(2k)N
for all x ∈ [−2, 2]di and L(Φi ) ≤ CN log2 (k/) and
di N kN
M (Φi ) ≤ C−di /s (2k) s N log2 (k/) ≤ C−k/s (2k) s N log2 (k/).
∗∗ The restriction kfi kC s (Rdi ) ≤ 1 could be replaced by kfi kC s (Rdi ) ≤ κ for a κ > 1, and Theorem 5.3 below would still hold up to
some additional constants depending on κ. This would, however, significantly increase the technicalities and obfuscate the main ideas
in Theorem 5.3.

36
Figure 5.1: Three types of graphs that could be the basis of compositional functions. The associated functions
are composed of two or three dimensional functions only.

For i = d + 1, . . . , N , let Pi be the orthogonal projection from Ri−1 to the components in Ti , i.e, for
di
Ti =: {j1 , . . . , jdi }, where j1 < · · · < jdi , we set Pi ((xk )i−1
k=1 ) = (xjk )k=1 .
Now we define for j = d + 1, . . . , N − 1,
 
e j := P ΦId
Φ j−1,L(Φj ) , Φj Pj ,

and
e N := ΦN PN .
Φ
Moreover,
Φf := Φ e N −1 . . . Φ
eN Φ e d+1 .

We first analyse the size of Φf . It is clear that


N
  N
L (Φf ) ≤ N max L Φ e j ≤ N max L (Φj ) ≤ CN 2 log2 (k/),
j=d+1 j=d+1

which yields (5.1). Additionally, since


   
M Φ e N −1 . . . Φ
eN Φ e d+1 ≤ 2M ΦeN Φe N −1 . . . Φ
e d(N +d+1)/2e
 
+ 2M Φ e d(N +d+1)/2e−1 . . . Φe N +d+1/2 ,

we have that
N
  N
 
M (Φf ) . 2dlog2 (N )e N max M Φ
e j . N 2 max ej .
M Φ (5.5)
j=d+1 j=d+1

37
Furthermore,
N −1
N
    N
e j ≤ max
max M Φ M ΦId
j−1,L(Φj ) + max M (Φj )
j=d+1 j=d+1 j=d+1
N
≤ 2N L (Φj ) + max M (Φj )
j=d+1

≤ 2CN log2 (k/) + C−k/s (2k)N k/s N log2 (k/),


2

where the penultimate estimate follows by Remark 3.10. Therefore, by (5.5),


M (Φf ) . −k/s (2k)N k/s N 4 log2 (k/),
which implies (5.2).
Finally, we prove (5.3). We claim that for N > j > d in the notation of Definition 5.2, for x ∈ [0, 1]d ,
 
R Φ
e d+1 (x) − [v1 (x), v2 (x), . . . , vj (x)] ≤ / (2k)N −j .
ej . . . Φ (5.6)

We prove (5.6) by induction. Since the realisation of ΦIdd,L(Φd+1 ) is the identity, we have, by construction that
e d+1 )(x))k = vk (x) for all k ≤ d. Moreover, by (5.4), we have that
(R(Φ
       

e d+1 (x)
d+1
N
R Φ − vd+1 (x) = R Φ
e (x) − fd+1 (x) ≤ / (2k) .
d+1 d+1

Assume, for the induction step, that (5.6) holds for N − 1 > j > d.
Again, since the identity is implemented exactly, we have by the induction hypothesis that, for all k ≤ j,
   
R Φ
e d+1 (x) − vk (x) ≤ / (2k)N −j .
e j+1 . . . Φ
k

Moreover, we have that vj+1 (x) = fj+1 (Pj+1 (v1 (x), . . . , vj (x)])). Hence,
   
e j+1 . . . Φ
e d+1 (x)

R Φ − v j+1 (x)
j+1

 
= R (Φj+1 ) ◦ Pj+1 ◦ R Φ
ej . . . Φ e d+1 (x) − vj+1 (x)
   
≤ R (Φj+1 ) ◦ Pj+1 ◦ R Φ
ej . . . Φ e d+1 (x) − fj+1 ◦ Pj+1 ◦ R Φ e d+1 (x)
ej . . . Φ
 
+ fj+1 ◦ Pj+1 ◦ R Φ
ej . . . Φe d+1 (x) − fj+1 ◦ Pj+1 ◦ [v1 (x), . . . , vj (x)] =: I + II.
 
Per (5.4), we have that I ≤ /(2k)N (Note that Pj+1 ◦ R Φ e d+1 (x) ⊂ [−2, 2]dj+1 by the induction
ej . . . Φ
hypothesis). Moreover,  since every partial  derivative of fj+1 is bounded in absolute value by 1 we have that
II ≤ dj+1 / (2k)N −j ≤ / 2(2k)N −j−1 by the induction assumption. Hence I + II ≤ /(2k)N −j−1
Finally, we compute
 
R Φ
eN . . . Φe d+1 (x) − vN (x)
 
= R (ΦN ) ◦ PN ◦ R Φ
e N −1 . . . Φe d+1 (x) − vN (x)
   
≤ R (ΦN ) ◦ PN ◦ R Φ
e N −1 . . . Φe d+1 (x) − fN ◦ PN ◦ R Φ e d+1 (x)
e N −1 . . . Φ
 
+ fN ◦ PN ◦ R Φ
e N −1 . . . Φe d+1 (x) − fN ◦ PN ◦ [v1 (x), . . . , vN −1 (x)] =: III + IV.

Using the exact same argument as for estimating I and II above, we conclude that
III + IV ≤ ,
which yields (5.3).

38
Remark 5.4. Theorem 5.3 shows what we had already conjectured earlier. The complexity of approximating a composi-
tional function depends asymptotically not on the input dimension d, but on the maximum indegree of the underlying
graph.
We also see that, while the convergence rate does not depend on d, the constants in (5.2) are potentially very large.
In particular, for fixed s the constants grow superexponentially with k.

5.3 Manifold assumptions


Realisations of deep NNs are, by definition, always functions on a d dimensional euclidean space. Of course,
we may only care about the values that this function takes on subsets of this space. For example, we may
only study approximation by NNs on compact subsets of Rd . In this manuscript, we have mostly studied this
setup for compact subsets of the form [A, B]d , where A < B.
Another approach could be, that we only care about the approximation of functions that live on low
dimensional submanifolds M ⊂ Rd . In applications, such as image classification, it is conceivable that the
input data, only come from the (potentially) low dimensional submanifold of natural images. In that context,
it is clear that the approximation properties of NNs are only interesting to us on that submanifold. In other
words, we would not care about the behaviour of a NN on inputs that are just unstructured combinations of
pixel values.
For a function f : M → Rn and  > 0, we now search for a NN Φ with input dimension d and output
dimension n, such that

|f (x) − R(Φ)(x)| ≤ , for all x ∈ M.

If M is a d0 -dimensional manifold with d0 < d, and f ∈ C n (M), then we would expect to be able to obtain
an approximation rate by NNs, that does not depend on d but on d0 .
To obtain such a result, we first make a convenient definition of certain types of submanifolds of Rd .
Definition 5.5. Let M be a smooth d0 -dimensional submanifold of Rd . For N ∈ N, δ > 0, We say that M is
(N, δ)-covered, if there exist x1 . . . xN ⊂ M and such that
SN
• i=1 Bδ/2 (xi ) ⊃ M
• the projection
Pi : M ∩ Bδ (xi ) → Txi M
is injective and smooth and
Pi−1 : Pi (M ∩ Bδ (xi )) → M
is smooth.
0
Here Txi M is the tangent space of M at xi . See Figure 5.2 for a visualisation. We identify Txi M with Rd in the
sequel.

Next, we need to define spaces of smooth functions on M. For k ∈ N, a function f on M is k-times


continuously differentiable if f ◦ ϕ−1 is k-times continuously differentiable for every coordinate chart ϕ. If
M is (N, δ) covered, then we can even introduce a convenient C k - norm on the space of k-times continuously
differentiable functions on M by

kf kC k ,δ,N := sup f ◦ Pi−1 C k (Pi (M∩B (xi ))) .



δ
i=1,...,N

With this definition, we can have the following result which is similar to a number of results in the
literature, such as [32, 33, 5, 30].

39
M

Figure 5.2: One dimensional manifold embedded in 2D. For two points the tangent space is visualised in red.
The two circles describe areas where the projection onto the tangent space is invertible and smooth.

Theorem 5.6. Let d, k ∈ N, M ⊂ Rd be a (N, δ)-covered d0 -dimensional manifold for an N ∈ N and δ > 0. Then
there exists a constant c > 0, such that, for every  > 0, and f ∈ C k (M, R) with kf kC k ,δ,N ≤ 1, there exists a NN Φ,
such that

kf − R(Φ)k∞ ≤ ,
 d0 
M (Φ) ≤ c · − k log2 (1/)
L(Φ) ≤ c · (log2 (1/)) .

Here the activation function is the ReLU.

Proof. The proof is structured in two parts. First we show a convenient alternative representation of f , then
we construct the associated NN.
Step 1: Since M is (N, δ)-covered, there exists B > 0 such that M ⊂ [−B, B]d .
Let T be a simplicial mesh on [−B, B]d so that for all nodes ηi ∈ T we have that

G(i) ⊂ Bδ/8 (ηi ).

See (3.2) for the definition of G(i) and Figure 5.3 for a visualisation of T .
By Proposition 3.1, we have that
MN
X
1= φi,T .
i=1

We denote
IM := {i = 1, . . . , MN : dist(ηi , M) ≤ δ/8},
where dist(a, M) = miny∈M |a − y|. Per construction, we have that
X
1= φi,T (x), for all x ∈ M.
i∈IM

40
M

Figure 5.3: Construction of mesh and choice of IM for a given manifold M

In Figure 5.3, we highlight the cells corresponding to IM .


SN
Moreover, by Definition 5.5, there exist x1 . . . xN ∈ M such that i=1 Bδ/2 (xi ) ⊃ M. Hence, ηi ∈
SN
i=1 B5δ/8 (xi ) for all i ∈ IM . Thus, for each ηi there exists j(i) ∈ {1, . . . , N } such that Bδ/8 (ηi ) ⊂ B3δ/4 (xj(i) ).
We rewrite f as follows: For x ∈ M, we have that
X
f (x) = φi,T (x) · f (x)
i∈IM
X  
−1
= φi,T (x) · f ◦ Pj(i) ◦ Pj(i) (x)
i∈IM
X 
=: φi,T (x) · fj(i) ◦ Pj(i) (x) , (5.7)
i∈IM

where fi : Pi (M ∩ Bδ (xi )) → R has C k norm bounded by 1. We have that

Pi (M ∩ B3δ/4 (xi )) ⊂ Pi (M ∩ B3δ/4 (xi )) ⊂ Pi (M ∩ B7δ/8 (xi ))

and Pi (M ∩ B3δ/4 (xi )), Pi (M ∩ B7δ/8 (xi )) are open. By a C ∞ version of the Urysohn Lemma, there exists a
smooth function σ : Rd → [0, 1] such that σ = 1 on Pi (M ∩ B3δ/4 (xi )) and σ = 0 on (Pi (M ∩ B7δ/8 (xi )))c .
We define 
σfi for x ∈ Pi (M ∩ Bδ (xi ))
f˜i :=
0 else.
0
It is not hard to see that f˜i ∈ C k (Rd ) with kf kC k ≤ CM , where CM is a constant depending on M only and
f˜i = fi on Pi (M ∩ B3δ/4 (xi )). Hence, with (5.7), we have that
X  
f (x) = φi,T (x) · f˜j(i) ◦ Pj(i) (x) . (5.8)
i∈IM

Step 2: The form of f given by (5.8) suggests a simple way to construct a ReLU approximation of f .

41
0
First of all, for every i ∈ IM , we have that Pj(i) is an affine linear map from [−B, B]d to Rd . We set
P :=
Φi ((Ai1 , bi1 )), where Ai1 , bi1 are such that Ai1 x + bi1 = Pj(i) (x) for all x ∈ Rd .
0
Let K > 0 be such that Pi (M) ⊂ [−K, K]d for all i ∈ IM . For every i ∈ IM , we have by Theorem 3.19
0
and Remark 3.20 that for every 1 > 0 there exists a NN Φfi such that, for all x ∈ [−K, K]d ,
 
˜ f
fj(i) (x) − R Φi (x) ≤ 1 ,

−d0 /k
 
M Φfi . 1 log2 (1/1 ), (5.9)
 
L Φfi . log2 (1/1 ). (5.10)

Per Proposition 3.3, there exists, for every i ∈ IM , a network Φφi with
     
R Φφi = φi,T , M Φφi , L Φφi . 1, (5.11)

with a constant depending on d.


Now we define, with Proposition 3.16, for 2 > 0,
 
φ(f P ) φ f
Φi := Φmult,2,2 P ΦId
1,L ∗ Φ i , Φ i Φ P
i ,

where L∗ := L(Φfi ΦP φ ∗ f P φ
i ) − L(Φi ). At this point, we assume that L ≥ 0. If L(Φi Φi ) < L(Φi ), then one
could instead extend Φfi .
Finally, we define, for Q := |IM |,
 
φ(f P ) φ(f P ) φ(f P )
Φ1 ,2 := (([1, . . . , 1], 0)) P Φi1 , Φi 2 , . . . , ΦiQ .

We have, by (5.8) that


   
φ(f P )
kf − R(Φ1 ,2 )k∞ ≤ Q max φi,T · f˜j(i) ◦ Pj(i) − R Φi


i∈IM ∞
     
≤ Q max φi,T · f˜j(i) ◦ Pj(i) − R Φi · R Φfi ΦP
φ

i
i∈IM ∞
     
φ f P φ(f P )
+ Q max R Φi · R Φi Φi − R Φi =: Q · (I + II).

i∈IM ∞
 
We proceed by estimating I. By (5.11) we have that φi,T = R Φφi and hence
   
I = max φi,T · f˜j(i) ◦ Pj(i) − φi,T · R Φfi ΦP

i
i∈IM ∞
   
˜ f P
≤ max fj(i) ◦ Pj(i) − R Φi Φi ≤ 1 .
i∈IM ∞

Moreover, II ≤ 2 by construction. We have, for  > 0 and 1 := 2 := /(2Q), that

kf − R (Φ1 ,2 )k∞ ≤ .

Finally, we estimate the size of Φ1 ,2 . We have that


 
φ(f P )
M (Φ1 ,2 ) ≤ Q max M Φi
i∈IM
    
φ f
≤ 2Q · M Φmult,2,2 + M ΦId P

1,L ∗ Φ i + M Φ i Φ i
     
∗ φ f
≤ 2Q · cmult log2 (1/) + 4L + 2M Φi + 2M Φi + 2M ΦP i ,

42
for a constant cmult > 0 by Proposition 3.16 and Remark 3.10. By (5.9), we conclude that
0
M (Φ1 ,2 ) . −d /k log2 (1/),
where the implicit constant depends on M and d. As the last step, we compute the depth of Φ1 ,2 . We have
that
 
φ(f P )
L(Φ1 ,2 ) = max L Φi
i∈IM
 
= L Φmult,2,2 + L Φfi ΦP

i ,
. log2 (1/2 ) + log2 (1/1 ) . log2 (1/)
by (5.10).
Remark 5.7. Theorem 5.6 shows that the approximation rate when approximating C k regular functions defined on a
manifold does not depend badly on the ambient dimension. However, at least in our construction, the constants may
still depend on d and even grow rapidly with d. For example, in the estimate in (5.11) the implicit constant depends,
because of Proposition 3.3 on the maximal number of neighbouring cells of the underlying mesh. For a typical mesh on
a grid Zd of a d dimensional space, it is not hard to see that this number grows exponentially with the dimension d.

5.4 Dimension dependent regularity assumption


The last instance of an approximation result without curse of dimension that we shall discuss in this section
is arguably the historically first result of this form. In [2], it was shown that, under suitable assumptions on
the integrability of the Fourier transform of a function, approximation rates that are (almost) independent of
the underlying dimensions are possible.
Here we demonstrate a slightly simplified result compared to that of [2]. Let, for C > 0,
 Z 
1 d ˆ ˆ

ΓC := f ∈ L R : kf k1 < ∞, |2πξ| f (ξ) dξ < C ,
Rd

where fˆ denotes the Fourier transform of f . By the inverse Fourier transform theorem, the condition kfˆk1 < ∞
implies that every element of ΓC is continuous. We also denote the unit ball in Rd by B1d := {x ∈ Rd : |x| ≤ 1}.
We have the following result:
Theorem 5.8 (cf. [2, Theorem 1]). Let d ∈ N, f ∈ ΓC , % : R → R be sigmoidal and N ∈ N. Then, for every
c > 4C 2 , there exists a NN Φ with

L(Φ) = 2,
M (Φ) ≤ N · (d + 2) + 1,
Z
1 2 c
|f (x) − R(Φ)(x)| dx ≤ ,
|B1d | d
B1 N

where |B1d | denotes the Lebesgue measure of B1d .


Before we present the proof of Theorem 5.8, we show the following auxilliary result, which is sometimes
called Approximate Caratheodory theorem, [37, Theorem 0.0.2].
Lemma 5.9 ([2, 24]). Let G be a subset of a Hilbert space and let G be such that the norm of each element of G is
bounded by B > 0. Let f ∈ co(G). Then, for every N ∈ N and c0 > B 2 there exist (gi )N N
i=1 ⊂ G and (ci )i=1 ⊂ [0, 1]
PN
with i=1 ci = 1 such that
N
2
X c0
f − ci gi ≤ . (5.12)


i=1
N

43
Proof. Let f ∈ co(G). For every δ > 0, there exists f ∗ ∈ co(G) so that

kf − f ∗ k ≤ δ.

Since f ∗ ∈ co(G), there exists m ∈ N so that


m
X
f∗ = c0i gi0
i=1
Pm
for some (gi0 )m 0 m
i=1 ⊂ G, (ci )i=1 ⊂ [0, 1], with
0
i=1 ci = 1.
At this point, there exists an at most m dimensional linear space Lm such that (gi0 )m i=1 ⊂ Lm which is
isometrically isomorphic to Rm . Hence, we can think of gi0 , and f ∗ to be elements of Rm in the sequel.††
Let σ be a probability distribution on {1, . . . , m} with Pσ (k) = c0k for k ∈ {1, . . . , m}. We have that
m
X
E(gi ) = E(g1 ) = c0i gi0 = f ∗ ,
i=1

and therefore Xi = gi − f ∗ , for i = 1, . . . , m, are i.i.d random variables with E(Xi ) = 0. Since the Xi are
independent random variables, we have that

N
2  N N
1 X 1 X 1 X
E kXi k2 = 2 E kgi k2 − 2 hgi , f ∗ i + kf ∗ k2
 
Xi  = 2

E 
N N i=1 N i=1
i=1
N
1 X B2
E kgi k2 − kf ∗ k2 ≤

= 2 . (5.13)
N i=1 N

The argument above is, of course, commonly known as the weak law of large numbers. Because of (5.13)
there exists at least one event ω such that
N
2
1 X B2
Xi (ω) ≤

N N


i=1

and hence 2
N
1 X B2
gi (ω) − f ∗ ≤ .

N N


i=1

By the triangle inequality, we have that


N
2
1 X B2
gi (ω) − f ≤ + δ.

N N


i=1

Since δ was arbitrary this yields the result.


We can have a more intuitive and elementary argument yielding Lemma 5.9 if G = (φi )∞ i=1 is an orthonor-
mal basis. This is based on an argument usually referred to as Stechkin’s estimate, see e.g. [6, Lemma 3.6].
Let f ∈ co(G), then

X ∞
X
f= hf, φi iφi =: ci (f )φi (5.14)
i=1 i=1
†† This simplification is not necessary at all, but some people might find it easier to think of real-valued random variables instead of

Hilbert-space-valued.

44
with kci k1 = 1. We have now that if Λn corresponds the indices of the n largest of (|ci (f )|)∞
i=1 in (5.14), then
2 2
X X X
|cj (f )|2 .

f − cj (f )φj = cj (f )φj (5.15)
=


j∈Λn j6∈Λn j6∈Λn

by Parseval’s identity. Let (c̃k (f ))∞ ∞


k=1 be a non-increasing rearrangement of (|cj (f )|)j=1 . We have that
X X X
|cj (f )|2 = c̃k (f )2 ≤ c̃n+1 (f ) c̃k (f ) ≤ c̃n+1 (f ), (5.16)
j6∈Λn k≥n+1 k≥n+1

Pn+1
Since (n + 1)c̃n+1 ≤ j=1 c̃j ≤ 1, we have that c̃n+1 ≤ (n + 1)−1 and hence, the estimate
2
X
−1

f −
≤ (n + 1) ,
cj (f )φj
j∈Λn

follows. Therefore, in the case that G is an orthogonal basis, we can explicitly construct the gi and ci of
Lemma 5.9.
Remark 5.10. Lemma 5.9 allows a quite powerful procedure. Indeed, to achieve an approximation rate of 1/N for a
function f by superpositions of N elements of a set G, it suffices to show that any convex combination of elements of G
approximates f .
In the language of NNs, we could say that every function that can be represented by an arbitrary wide two-layer
NN with bounded activation function and where the weights in the last layer are positive and sum to one can also be
approximated with a network with only N neurons in the first layer and an error proportional to 1/N .

In view of Lemma 5.9, to show Theorem 5.8, we only need to demonstrate that each function in ΓC is in
the convex hull of functions representable by superpositions of sigmoidal NNs with few weights. Before we
prove this, we show that each function f ∈ ΓC is in the convex hull of functions of the set

GC := B1d 3 x 7→ γ · 1R+ (ha, xi + b) : a ∈ Rd , b ∈ R, |γ| ≤ 2C .




Lemma 5.11. Let f ∈ ΓC . Then f|B1d −f (0) ∈ co(GC ). Here the closure is taken with respect to the norm k·kL2, (B1d ) ,
defined by
Z !1/2
1 2
kgkL2, (B1d ) := |g(x)| dx .
|B1d | B1d

Proof. Since f ∈ ΓC is continuous and fˆ ∈ L1 (Rd ), we have by the inverse Fourier transform that
Z  
f (x) − f (0) = fˆ(ξ) e2πihx,ξi − 1 dξ
d
ZR  
ˆ 2πi(hx,ξi+κ(ξ))
= f (ξ) e − e2πiκ(ξ) dξ
d
ZR
ˆ
= f (ξ) (cos(2πhx, ξi + κ(ξ)) − cos(κ(ξ))) dξ,
Rd

where κ(ξ) is the phase of fˆ(ξ) and the last inequality follows since f is real-valued. Moreover, we have that

(cos(2πhx, ξi + κ(ξ)) − cos(κ(ξ)))


Z Z
ˆ
f (ξ) (cos(2πhx, ξi + κ(ξ)) − cos(κ(ξ))) dξ = |2πξ| fˆ(ξ) dξ.

|2πξ|

R d Rd

45
|2πξ||fˆ(ξ)|dξ ≤ C, and thus Λ such that
R
Since f ∈ ΓC , we have Rd

1
dΛ(ξ) := |2πξ||fˆ(ξ)|dξ
C
is a finite measure with Λ(Rd ) =
R
dΛ(ξ) ≤ 1. In this notion, we have
Rd

(cos(2πhx, ξi + κ(ξ)) − cos(κ(ξ)))


Z
f (x) − f (0) = C dΛ(ξ).
Rd |2πξ|

Since (cos(2πhx, ξi + κ(ξ)) − cos(κ(ξ)))/|2πξ| is continuous and bounded by 1, and hence integrable with
respect to dΛ(ξ) we have by the dominated convergence theorem that, for n → ∞,

Z
(cos(2πhx, ξi + κ(ξ)) − cos(κ(ξ))) X (cos(2πhx, θi + κ(θ)) − cos(κ(θ)))
C dΛ(ξ) − C · Λ(Iθ → 0,
)

Rd |2πξ| 1 d
|2πθ|
θ∈ n Z

(5.17)

where Iθ := [0, 1/n]d + θ. Since f (x) − f (0) is continuous and thus bounded on B1d and

X
(cos(2πhx, θi + κ(θ)) − cos(κ(θ)))
C · Λ(Iθ ) ≤ C,
θ∈ 1 Zd |2πθ|
n

we have by the dominated convergence theorem that


2

(cos(2πhx, θi + κ(θ)) − cos(κ(θ)))
Z
1
f (x) − f (0) − C
X
d
· Λ(Iθ dx → 0.
) (5.18)
|B1 | B1d

1 d
|2πθ|
θ∈ Z

n

Since θ∈ 1 Zd Λ(Iθ ) = Λ(Rd ) ≤ 1, we conclude that f (x) − f (0) is in the L2, (B1d ) closure of convex combi-
P
n
nations of functions of the form
cos(2πhx, θi + κ(θ)) − cos(κ(θ))
x 7→ gθ (x) := αθ ,
|2πθ|

for θ ∈ Rd and 0 ≤ αθ ≤ C. The result follows, if we can show that each of the functions gθ is in co(GC ).
Setting z = hx, θ/|θ|i, it suffices to show that the map

cos(2π|θ|z + b) − cos(b)
[−1, 1] 3 z 7→ αθ =: g̃θ (z),
|2πθ|

where b ∈ [0, 2π] can be approximated arbitrarily well by convex combinations of functions of the form

[−1, 1] 3 z 7→ γ · 1R+ (a0 z + b0 ) , (5.19)

where a0 , b0 ∈ R and |γ| ≤ 2C.


Per definition, we have that kg̃θ0 k ≤ C. We define, for T ∈ N,
T
X |g̃θ (i/T ) − g̃θ ((i − 1)/T )|
gT,+ := · (2C · sign(g̃θ (i/T ) − g̃θ ((i − 1)/T )) · 1R+ (x − i/T )) ,
i=1
2C
T
X |g̃θ (−i/T ) − g̃θ ((1 − i)/T )|
gT,− := (2C · sign(g̃θ (−i/T ) − g̃θ ((1 − i)/T )) · 1R+ (−x + i/T )) .
i=1
2C

46
Clearly, (gT,− ) + (gT,+ ) converges to g̃θ for T → ∞ and since
T T T
X |g̃θ (i/T ) − g̃θ ((i − 1)/T )| X |g̃θ (−i/T ) − g̃θ ((1 − i)/T )| X
+ ≤2 kg̃θ0 k∞ /(2CT ) ≤ 1
i=1
2C i=1
2C i=1

we have that g̃θ can be arbitrarily well approximated by convex combinations of the form (5.19). Therefore,
we have that gθ ∈ co(GC ) and by (5.18) this yields that f − f (0) ∈ co(GC ).
Proof of Theorem 5.8. Let f ∈ ΓC , then, by Lemma 5.11, we have that

f|B1d − f (0) ∈ co(GC ).

Moreover, for every element g ∈ GC we have that kgkL2, (B1d ) ≤ 2C. Therefore, by Lemma 5.9, applied to the
Hilbert space L2, (B1d ), we get that for every N ∈ N, there exist |γi | ≤ 2C, ai ∈ Rd , bi ∈ R, for i = 1, . . . , N ,
so that 2
Z N
1 X 4C 2
f (x) − f (0) − γ 1 (ha , xi + b i dx ≤
) .

d i R + i
d B1
|B1 | B1d N


i=1

Since %(λx) → 1R+ (x) for λ → ∞ almost everywhere, it is clear that, for every δ > 0, there exist ãi , b̃i ,
i = 1, . . . , N , so that

 2

N
4C 2
Z
1 X 
fB1d (x) − f (0) − γi % hãi , xi + b̃i dx ≤ + δ.

|B1d | B1d i=1
N

The result follows by observing that


N
X  
γi % hãi , xi + b̃i + f (0)
i=1

is the realisation of a network Φ with L(Φ) = 2 and M (Φ) ≤ N · (d + 3). This is clear by setting
 
Φ := (([γ1 , . . . , γN ], f (0))) P ((ã1 , b̃1 )), . . . , ((ãN , b̃N )) .

Remark 5.12. The fact, that the approximation rate of Theorem 5.8 is independent from the dimension is quite surprising
at first. However, the following observation might render Theorem 5.8 more plausible. The assumption of having a
finite Fourier moment is comparable to a certain dimension dependent regularity assumption. In other words, the
condition of having a finite Fourier moment becomes more restrictive in higher dimensions, meaning that the complexity
of the function class does not, or only mildly grow with the dimension. Indeed, while this type of regularity is not
directly expressible in terms of classical orders of smoothness, Barron notes that a necessary condition for f ∈ ΓC , for
some C > 0, is that f has bounded first-order derivatives. A sufficient condition is that all derivatives of order up to
bd/2c + 2 are square-integrable, [2][Section II]. The sufficient condition amounts to f ∈ W bd/2c+2,2 (Rd ) which would
also imply an approximation rate of N −1 in the squared L2 norm by sums of at most N B-splines, see [21, 8].
Example 5.13. A natural question, especially in view of Remark 5.12, is which well known and relevant functions are
contained in ΓC . In [2, Section IX], a long list with properties of this set and elements thereof is presented. Among
others, we have that
1. If g ∈ ΓC , then
a−d g (a(· − b)) ∈ ΓC ,
for every a ∈ R+ , b ∈ Rd .

47
2. For gi ∈ ΓC , i = 1, . . . , m and c = (ci )m
i=1 it holds that

m
X
ci gi ∈ Γkck1 C .
i=1

2
3. The Gaussian function: x 7→ e−|x| /2
is in ΓC for C = O(d1/2 ).
4. Functions of high smoothness. If the first dd/2e + 2 derivatives of a function g are square integrable on Rd , then
g ∈ ΓC , where the constant C depends linearly on kgkW bd/2c+2,2 .

The last three examples show quite nicely how the assumption g ∈ ΓC includes an indirect dependence on the dimension.

6 Complexity of sets of networks


Until this point, we have mostly tried understanding the capabilities of NNs through the lens of approximation
theory. This analysis is based on two pillars: First, we are interested in asymptotic performance, i.e., we are
aiming to understand the behaviour of NNs for increasing sizes. Second, we measure our success over a
continuum by studying Lp norms for p ∈ [1, ∞].
This point of view is certainly not the only possible, and different applications require a different analysis
of the capabilities of NNs. Consider, for example, a binary classification task, i.e., a process, where values
(xi )N
i=1 should be classified as either 0 or 1. In this scenario, it is interesting to establish if for every possible
classification of the values (xi )N
i=1 as 0 or 1 there exists a NN the realisation of which is a function performing
this classification.
This question, in contrast to the point of view of approximation theory, is non-asymptotic and only studies
the success of networks on a finite, discrete set of samples. Nonetheless, we will later see, that the complexity
measures that we will introduce below are also closely related to some questions in approximation theory
and can be used to establish lower bounds on approximation rates.
The following sections are strongly inspired by [1, Sections 3-8].

6.1 The growth function and the VC dimension


We now introduce two notions of the capability of a class of functions to perform binary classification of
points: Let X be a space, H ⊂ {h : X → {0, 1}} and S ⊂ X be finite. We define by

HS := {h|S : h ∈ H},

the restriction of H to S. We define, the growth function of H by

GH (m) := max {|HS | : S ⊂ X, |S| = m} , for m ∈ N.

The growth function counts the number of functions that result from restricting H to the best possible set S
with m elements. Intuitively, in the framework of binary classification, the growth function tells us in how
many ways we can classify the elements of the best possible sets S of any cardinality by functions in H.
It is clear that for every set S with |S| = m, we have that |HS | ≤ 2m and hence GH (m) ≤ 2m . We say that
a set S with |S| = m for which |HS | = 2m is shattered by H.
A second, more compressed notion of complexity in the context of binary classification is that of the
Vapnik–Chervonenkis dimension (VC Dimension), [36]. We define VCdim(H) to be the largest integer m such
that there exists S ⊂ X with |S| = m that is shattered by H. In other words,

VCdim(H) := max {m ∈ N : GH (m) = 2m } .

Example 6.1. Let X = R2 .

48
1. Let H := {0, 1}, then GH (m) = 2 for all m ≥ 1. Hence, VCdim(H) = 1.
2. Let H := {0, χΩ , χΩc , 1} for some fixed non-empty set Ω ( R2 . Then, choosing S = (x1 , x2 ) with x1 ∈ Ω, x2 ∈
Ωc , we have GH (2) = 4 for all m ≥ 2. Hence, VCdim(H) = 2.
3. Let h := χR+ and ( ! )
 T
cos θ 2
H := hθ,t := h · −t : θ ∈ [−π, π], t ∈ R .
sin θ
Then H is the set of all linear classifiers. It is not hard to see, that if S contains 3 points in general position, then
|H|S | = 8, see Figure 6.1. Hence, these sets S are shattered by H. We will later see that H does not shatter any
set of points with at least 4 elements. Hence VCdim(H) = 3. This is intuitively clear when considering Figure
6.2.

Figure 6.1: Three points shattered by a set of linear classifiers.

As a first step to familiarise ourselves with the new notions, we study the growth function and VC
dimension of realisations of NNs with one neuron and the Heaviside function as activation function. This
situation was discussed before in the third point of Example 6.1.

Figure 6.2: Four points which cannot be classified in every possible way by a single linear classifier. The
classification sketched above requires at least sums of two linear classifiers.

We have the following theorem:

49
Theorem 6.2 ([1, Theorem 3.4]). Let d ∈ N and % = 1R+ . Let SN (d) be the set of realisations of neural networks
with two layers, d-dimensional input, one neuron in the first layer and one dimensional output and the weights in the
second layer satisfy (A2 , b2 ) = (1, 0). Then SN (d) shatters a set of points (xi )m d
i=1 ⊂ R if and only if

(x1 , 1), (x2 , 1), . . . , (xm , 1) (6.1)

are linearly independent centres. In particular, VCdim(SN (d)) = d + 1.

Proof. Assume first, that (xi )mi=1 is such that it is shattered by SN (d) and assume towards a contradiction
that (6.1) are not linearly independent.
Then we have that for every v ∈ {0, 1}m there exists a neural network Φv , such that, for all j ∈ {1, . . . , m},

R(Φv )(xj ) = vj .

Moreover, since (6.1) are not linearly independent there exist (αj )m
j=1 ⊂ R such that, without loss of generality,

m−1    
X xj xm
αj = .
j=1
1 1

Let v ∈ {0, 1}m be such that, for j ∈ {1, . . . , m − 1}, vj = 1 − 1R+ (αj ) and vm = 1. Then,

R (Φv ) (xm ) = % [ Av1 bv1 ][ xm 1 ]



 
m−1
X
= % αj · (Av1 xj + bv1 ) = 0,
j=1

where the last equality is because 1R+ (Av1 xj +bv1 ) = vj = 1−1R+ (αj ). This produces the desired contradiction.
If, on the other hand (6.1) are linearly independent, then the matrix
 
x1 1
 x2 1 
X= . .. 
 
 .. . 
xm 1

has rank m. Hence, for every v ∈ {0, 1}m there exists a vector [ Av1 bv1 ] ∈ R1,d+1 such that X[ Av1 bv1 ]T =
v. Setting Φv := ((Av1 , bv1 ), (1, 0)) yields the claim.
In establishing bounds on the VC dimension of a set of neural networks, the activation function plays a
major role. For example, we have the following lemma.
Lemma 6.3 ([1, Lemma 7.2]). Let H := {x 7→ 1R+ (sin(ax)) : a ∈ R}. Then VCdim(H) = ∞.
Proof. Let xi := 2i−1 , for i ∈ N. Next, we will show that, for every k ∈ N, the set {x1 , . . . , xk } is shattered by
H. Pk
The argument is based on the following bit-extraction technique: Let b := j=1 bj 2−j + 2−k−1 . Setting
a := 2πb, we have that
  
Xk
1R+ (sin(axi )) = 1R+ sin 2π bj 2−j xi + 2π2−k−1 xi 
j=1
  
i−1
X k
X
= 1R+ sin 2π bj 2−j xi + 2π bj 2−j xi + 2π2−k−1 xi  =: I(xi ).
j=1 j=i

50
Pi−1 −j
Since j=1 bi 2 xi ∈ N, we have by the 2π periodicity of sin that
  
k
X
I(xi ) = 1R+ sin 2π bj 2−j xi + 2π2−k−1 xi 
j=i
   
k
X
= 1R+ sin bi π + π ·  bj 2i+1−j + 2i−k  .
j=i+1

P 
k i+1−j
Since j=i+1 bj 2 + 2i−k ∈ (0, 1), we have that

0 if bi = 1,
I(xi ) =
1 else.

Since b was chosen arbitrary, this shows that VCdim(H) ≥ k for all k ∈ N.

In the previous two results (Theorem 6.2, Lemma 6.3), we observed that the VC dimension of sets of
realisations of NNs depends on their size and also on the associated activation function. We have the
following result, that we state without proof:
Denote, d, L, M ∈ N by N N d,L,M the set of neural networks with d dimensional input, L layers and at
most M weights.

Theorem 6.4 ([1, Theorem 8.8]). Let d, `, p ∈ N, and % : R → R be a piecewise polynomial with at most ` pieces of
degree at most p. Let, for L, M ∈ N,

H := {1R+ ◦ R(Φ) : Φ ∈ N N d,L,M } .a

Then, for all L, M ∈ N,


VCdim(H) . M L log2 (M ) + M L2 .
a We are a bit sloppy with the notation here. In [1, Theorem 8.8] the result only applies to sets of neural networks that all have the

same M indices of weights potentially non-zero.

6.2 Lower bounds on approximation rates


We will see next that the bound on the VC dimension of sets of neural networks of Theorem 6.4 implies a
lower bound on the approximation capabilities of neural networks. The argument below follows [39, Section
4.2].
We first show the following auxiliary result.
Lemma 6.5. Let d, k ∈ N, K ⊂ Rd , H ⊂ {h : K → R} be such that, for  > 0, {x1 , . . . , xk } ⊂ K and every
b ∈ {0, 1}k , there exists h ∈ H such that

h(xi ) = bi , for all i = 1, . . . , k. (6.2)

Let G ⊂ {g : K → R} be such that for every h ∈ H, there exists a g ∈ G satisfying

sup |g(x) − h(x)| < /2. (6.3)


x∈K

Then

VCdim({1R+ ◦ (g − /2) : g ∈ G}) ≥ k. (6.4)

51
Proof. Choose for any b ∈ {0, 1}k an associated hb ∈ H according to (6.2) and gb according to (6.3).
Then |gb (xi ) − bi | < /2 and therefore gb (xi ) − /2 > 0 if bi = 1 and gb (xi ) − /2 < 0 otherwise. Hence

1R+ (gb − /2)(xi ) = bi ,

which yields the claim.

Remark 6.6. Lemma 6.5 and Theorem 6.4 allow an interesting observation about approximation by NNs. Indeed,
if a set of functions H is sufficiently large so that (6.2) holds, and NNs with M weights and L layers achieve an
approximation error less than  > 0 for every function in H, then M L log2 (M ) + M L2 & k.
We would now like to establish a lower bound on the size of neural networks that approximate regular
functions well. Considering functions f ∈ C s ([0, 1]d ) with kf kC s ≤ 1, we we would, in view of Remark 6.6,
like to find out which value of k is achievable for any given .
We begin by constructing one bump function with a finite C n norm.
Lemma 6.7. For every n, d ∈ N, there exists a constant C > 0, such that, for every  > 0, there exists a smooth
function f ∈ C n (Rd ) with

supp f ⊂ [−C1/n , C1/n ]d , (6.5)

such that f (0) =  and kf kC n (Rd ) ≤ 1.


Proof. The function  2
e1−1/(1−|x| )
for |x| < 1,
f˜(x) :=
0 else.
is smooth and supported in [−1, 1]d and f˜(0) = 1. We set

−1/n
 
f (x) := f˜ x .
1 + kf˜kC n

Then f (0) = , supp f ⊂ [−(1 + kf˜kC n )−1/n , (1 + kf˜kC n )−1/n ]d , and kf kC n ≤ 1 by the chain rule.
Adding up multiple, shifted versions of the function of Lemma 6.7 yields sets of functions that satisfy
(6.2). Concretely, we have the following lemma.

Lemma 6.8. Let n, d ∈ N. There exists C > 0 such that, for every  > 0, there are {x1 , . . . , xk } with k ≥ C−d/n
such that, for every b ∈ {0, 1}k there is fb ∈ C n ([0, 1]d ) with kf kC n ≤ 1 and

fb (xi ) = bi .

Proof. Let, for C > 0 as in (6.5), {x1 , . . . , xk } := C1/n Zd ∩ [0, 1]d . Clearly, k ≥ C 0 −d/n for a constant C 0 > 0.
Let b ∈ {0, 1}k . Now set, for f as in Lemma 6.7,
k
X
fb := bi f (· − xi ).
i=1

By the properties of f , we have that f (· − xi ) vanishes on every xj for j 6= i and hence

fb (xi ) = bi , for all i = 1, . . . , k.

The construction of fb is depicted in Figure 6.3.


Combining all observations until here, yields the following result.

52
Figure 6.3: Illustration of fb from Lemma 6.8 on [0, 1]2 .

Theorem 6.9. Let n, d ∈ N. Let % : R → R be piecewise polynomial. Assume that, for all  > 0, there exist
M (), L() ∈ N such that

sup inf kf − R(Φ)k∞ ≤ /2,


f ∈C n ([0,1]d ),kf kC n ≤1 Φ∈N N d,L(),M ()

then
(M () + 1)L() log2 (M () + 1) + (M () + 1)L()2 & −d/n .

Proof. Let H := {h ∈ C n ([0, 1]d : khkC n ≤ 1) and G := {R(Φ) : Φ ∈ N N d,L(),M () }.


H satisfies (6.2) with k ≥ C−d/n due to Lemma 6.8 and G satisfies (6.3) by assumption. Hence,

VCdim({1R+ ◦ (g − /2) : g ∈ G}) ≥ k. (6.6)

Moreover,
{g − /2 : g ∈ G} ⊆ {R(Φ) : Φ ∈ N N d,L(),M ()+1 }.
Hence
≥ C−d/n .
 
VCdim 1R+ ◦ R(Φ) : Φ ∈ N N d,L(),M ()+1
An application of Theorem 6.4 yields the result.
Remark 6.10. Theorem 6.9 shows that to achieve a uniform error of  > 0 over sets of C n regular functions requires a
number of weights M and layers L such that

M L log2 (M ) + M L2 ≥ −d/n .

If we require L to only grow like log2 () then this demonstrates that the rate of Theorem 3.19/ Remark 3.20 is optimal.
For the case, that L is arbitrary, [1, Theorem 8.7] yields an upper bound on the VC dimension of

( )
[
He := 1R+ ◦ R(Φ) : Φ ∈ N N d,`,M . (6.7)
`=1

of the form
 
e . M 2.
VCdim H (6.8)

Using (6.8) yields the following result:

53
Theorem 6.11. Let n, d ∈ N. Let % : R → R be piecewise polynomial. Assume that, for all  > 0, there exist M () ∈ N
such that
sup S∞ inf kf − R(Φ)k∞ ≤ /2,
f ∈C n ([0,1]d ),kf kC n ≤1 Φ∈ `=1 N N d,`,M ()

then
M () & −d/(2n) .

Proof. The proof is the same as for Theorem 6.9, using (6.8) instead of Theorem 6.4.
Remark 6.12. Comparing Theorem 6.9 and Theorem 6.11, we see that approximation by NNs with arbitrarily many
layers can potentially achieve double the rate of that with restricted or only slowly growing number of layers.
Indeed, at least for the ReLU activation function, the lower bound of Theorem 6.11 is sharp. It could be shown in
[39], that ReLU realisations of NNs with unrestricted numbers of layers achieve approximation fidelity  > 0 using
only O(−d/(2n) ) many weights, uniformly over the unit ball of C n ([0, 1]d ).

7 Spaces of realisations of neural networks


As a final step of our analysis of deep neural networks from a functional analytical point of view, we would
like to understand set-topological aspects of sets of realisations of NNs. What we are analysing in this section
are sets of neural networks of a fixed architecture. We first define the notion of an architecture.
Definition 7.1. A vector S = (N0 , N1 , . . . , NL ) ∈ NL+1 is called architecture of a neural network Φ =
((A1 , b1 ), . . . , (AL , bL )) if A` ∈ RN` ×N` −1 for all ` = 1, . . . , L. We denote by N N (S) the set of neural networks
with architecture S and, for an activation function % : R → R, we denote by

RN N % (S) := {R(Φ) : Φ ∈ N N (S)}

the set of realisations of neural networks with architecture S.


For any architecture S, N N (S) is a finite dimensional vector space on which we use the norm
L L
kΦktotal := |Φ|scaling + |Φ|shift := max kA` k∞ + max kb` k∞ .
`=1 `=1

Now we have that, for a given architecture S = (d, N1 , . . . , NL ) ∈ NL+1 , a compact set K ⊂ Rd , and for a
continuous activation function % : R → R:

RN N % (S) ⊂ Lp (K),

for all p ∈ [1, ∞]. In this context, we can ask ourselves about the properties of RN N % (S) as a subset of the
normed linear spaces Lp (K).
The results below are based on the following observation about the realisation map:

Theorem 7.2 ([22, Proposition 4]). Let Ω ⊂ Rd be compact and let S = (d, N1 , . . . , NL ) ∈ NL+1 be a neural
network architecture. If the activation function % : R → R is continuous, then the map

R : N N (S) → L∞ (Ω)
Φ 7→ R(Φ)

is continuous. Moreover, if % is locally Lipschitz continuous, then R is locally Lipschitz continuous.

54
7.1 Network spaces are not convex
We begin by analysing the simple question if, for a given architecture S, the set RN N % (S) is star-shaped.
We start by fixing the notion of a centre and of star-shapedness.
Definition 7.3. Let Z be a subset of a linear space. A point x ∈ Z is called a centre of Z if, for every y ∈ Z it holds
that
{tz + (1 − t)y : t ∈ [0, 1]} ⊂ Z.
A set is called star-shaped if it has at least one centre.
The following proposition follows directly from the definition of a neural network:
Proposition 7.4. Let S = (d, N1 , . . . , NL ) ∈ NL+1 . Then RN N % (S) is scaling invariant, i.e. for every λ ∈ R it
holds that λf ∈ RN N % (S) if f ∈ RN N % (S), and hence 0 ∈ RN N % (S) is a centre of RN N % (S).
Knowing that RN N % (S) is star-shaped with centre 0, we can also ask ourselves if RN N % (S) has more
than this one centre. It is not hard to see that also every constant function is a centre. The following theorem
yields an upper bound on the number of centres.

Theorem 7.5 ([22, Proposition C.4]). Let S = (N0 , N1 , . . . , NL ) be a neural network architecture, let Ω ⊂ RN0 , and
PL
let % : R → R be Lipschitz continuous. Then RN N % (S) contains at most `=1 (N`−1 + 1)N` linearly independent
centres, where N0 = d.
PL
Proof. Let M ∗ := `=1 (N`−1 + 1)N` . We first observe that M ∗ = dim(N N (S)).

Assume towards a contradiction, that there are functions (gi )M i=1
+1
⊂ RN N % (S) ⊂ L2 (Ω) that are linearly
independent.

By the Theorem of Hahn-Banach, there exist (gi0 )M i=1
+1
⊂ (L2 (Ω))0 such that gi0 (gj ) = δi,j , for all i, j ∈
{1, . . . , L + 1}. We define
g10 (g)
 


 g20 (g) 
T : L2 (Ω) → RM +1 , g 7→  .. .
 
 . 
0
gM +1 (g)

Since T is continuous and linear, we have that T ◦ R is locally Lipschitz continuous by Theorem 7.2. Moreover,

since the (gi )M
i=1
+1
are linearly independent, they span an M ∗ + 1 dimensional linear space V and T (V ) =

M +1
R .
Next we would like to establish that RN N % (S) ⊃ V . Let g ∈ V then

MX +1
g= a` g` ,
`=1
∗ Pm
for some (a` )M
`=1
+1
⊂ R. We show by induction that g̃ (m) := `=1 a` g` ∈ RN N % (S) for every m ≤ M ∗ + 1.
This is obviously true for m = 1. Moreover, we have that g̃ (m+1) = am+1 gm+1 + g̃ (m) . Hence the induction
step holds true if am+1 = 0. If am+1 6= 0, then we have that
 
(m+1) 1 1 (m)
g = 2am+1 gm+1 + ge , (7.1)
2 2am+1

By Proposition 7.4 ge(m) /(am+1 ) ∈ V . Additionally, gm+1 is a centre of RN N % (S). Therefore, we have
that 12 gm+1 + 2am+1
1
ge(m) ∈ RN N % (S). By Proposition 7.4, we conclude that ge(m+1) ∈ RN N % (S). Hence

V ⊂ RN N % (S). Therefore, T ◦ R(N N (S)) ⊇ T (V ) = RM +1 .
It is a well known fact of basic analysis that there does not exist a surjective and locally Lipschitz continuous
map from Rn to Rn+1 for any n ∈ N. This yields the contradiction.

55
For a convex set X, the line between any two points of X is a subset of X. Hence, every point of a convex
set is a centre. This yields the following corollary.
Corollary 7.6. Let S = (N0 , N1 , . . . , NL ) be a neural network architecture, let Ω ⊂ RN0 , and let % : R → R be
PL
Lipschitz continuous. If RN N % (S) contains more than `=1 (N`−1 + 1)N` linearly independent functions, then
RN N % (S) is not convex.
Remark 7.7. It was shown in [22, Theorem 2.1] that the only Lipschitz continuous activation functions such that
PL
RN N % (S) contains not more than `=1 (N`−1 + 1)N` linearly independent functions are affine linear functions.
Additionally, it can be shown that Corollary 7.6 holds for locally Lipschitz functions as well. In this case, RN N % (S)
PL
necessarily contains more than `=1 (N`−1 + 1)N` linearly independent functions if the activation function is not a
polynomial.

Figure 7.1: Sketch of the set of realisations of neural networks with a fixed architecture. This set is star-shaped,
having 0 in the centre. It is not r-convex for any r and hence we see multiple holes between different rays. It
is not closed, which means that there are limit points outside of the set.

In addition to the non-convexity of RN N % (S), we will now show that, under mild assumptions on the
activation function, RN N % (S) is also very non-convex. Let us first make the notion of convexity quantitative.
S
Definition 7.8. A subset X of a metric space is called r-convex, if x∈X Br (x) is convex.
By Proposition 7.4, it is clear that RN N % (S) + Br (0) = r (RN N % (S) + B1 (0)). Hence,

RN N % (S) + Br (0) = r/r0 · (RN N % (S) + Br0 (0)) ,

for every r, r0 > 0. Therefore, RN N % (S) is r-convex for one r > 0 if and only if RN N % (S) is r-convex for
every r > 0.
With this observation we can now prove the following result.

56
Proposition 7.9 ([22, Theorem 2.2.]). Let S ∈ NL+1 , Ω ⊂ RN0 be compact, and % ∈ C 1 be discriminatory and such
that RN N % (S) is not dense in C(Ω). Then there does not exist an r > 0 such that RN N % (S) is r-convex.
Proof. By the discussion leading up to Proposition 7.9 we can assume, towards a contradiction that RN N % (S)
is r-convex for every r > 0.
We have that
\ \
co(RN N % (S)) ⊂ (RN N % (S) + Br (0)) ⊂ (RN N % (S) + Br (0)) ⊂ RN N % (S).
r>0 r>0

Therefore co(RN N % (S)) = co(RN N % (S)) ⊂ RN N % (S) and thus we conclude that RN N % (S) is convex.
We now aim at producing a contradiction by showing that RN N % (S) = C(Ω). We show this for L = 2,
and N2 = 1 only, the general case is demonstrated in [22, Theorem 2.2.] (there also the differentiability of % is
used).
Per assumption, for every a ∈ RN1 , t ∈ R,

x 7→ %(ax − t) ∈ RN N % (S).

By the same argument applied in the proof of Theorem 7.5 in (7.1), we have that for all sequences
(a` )∞
`=1 ⊂ R
N1
, (b` )∞ ∞
`=1 ⊂ R, and (t` )`=1 ⊂ R the function

m
X
g (m) (x) := b` %(a` x − t` )
`=1

satisfies g (m) ⊂ RN N % (S) for all m ∈ N.


By Theorem 2.4, we have that
( m
)
X
b` %(a` · −t` ) : (a` )∞
`=1 ⊂ R
N1 , (b )∞ , (t )∞ ⊂ R
` `=1 ` `=1 = C(Ω)
`=1

and hence C(Ω) ⊂ RN N % (S) which yields the desired contradiction.

7.2 Network spaces are not closed


The second property that we would like to understand is closedness. To make this more precise, we need to
decide on a norm first. We will now study closedness in the uniform norm.

Theorem 7.10. Let L ∈ N, S = (N0 , N1 , . . . , NL−1 , 1) ∈ NL+1 , where N1 ≥ 2, Ω ∈ Rd compact with nonempty
interior, and % ∈ C 2 \ C ∞ . Then RN N % (S) is not closed in L∞ .

Proof. Since % ∈ C 2 \ C ∞ , we have that there exists k ∈ N such that % ∈ C k and % 6∈ C k+1 . It is not hard to
see that therefore RN N % (S) ⊂ C k (Ω) and the map

F : Rd → R : x 7→ F (x) = %0 (x1 )

is not in C k (Rd ). Therefore, since Ω has non-empty interior, there exists t ∈ Rd so that F (· − t) 6∈ C k (Ω) and
thus F (· − t) 6∈ RN N % (S).
Assume for now that S = (N0 , 2, 1). The general statement follows by extending the networks below
to neural networks with architecture (N0 , 2, 1, . . . , 1, 1) by concatenating with the neural networks from
Proposition 2.11. To artificially increase the width of the networks and produce neural networks of architecture
S one can simply zero-pad the weight and shift matrices without altering the associated realisations.

57
We define the neural network
    
1 01×(N0 −1) 1/n
Φn := , , ([n, −n], 0) ,
1 0(N1 −1)×1 0

and observe that for every x ∈ Ω

|R(Φn )(x) − %0 (x1 )| = |n(%(x1 + 1/n) − %(x1 )) − %0 (x1 )| ≤ sup |%00 (z)|/n,
z∈[−B,B]

by the mean value theorem, where B > 0 is such that Ω ⊂ [−B, B]d . Therefore, R(Φn ) → F in L∞ (Ω) and
hence RN N % (S) is not closed.
Remark 7.11. Theorem 7.10 holds in much more generality. In fact, a similar statement holds for various types of
activation functions, see [22, Theorem 3.3]. Surprisingly, the statement does not hold for the ReLU activation function,
[22, Theorem 3.8].

Theorem 7.10, should be contrasted to the following result that shows that subsets of the set of realisations
of neural networks with bounded weights are always closed.
Proposition 7.12. Let S ∈ NL+1 , Ω ⊂ RN0 be compact, and % be continuous. For C > 0, we denote by

RN N C := {R(Φ) : Φ ∈ N N (S), kΦktotal ≤ C}

the set of realisations of neural networks with weights bounded by C. Then RN N C is a closed subset of C(Ω).
Proof. By the Theorem of Heine-Borel, we have that

{Φ ∈ N N (S) : kΦktotal ≤ C}

is compact. Hence the result follows by Theorem 7.2.


Combining Theorem 7.10 and Proposition 7.12 yields the following observation: Consider a function
g ∈ RN N % (S) \ RN N % (S) and a sequence Φn ∈ N N (S) so that

R(Φn ) → g.

Then kΦn ktotal → ∞ since if kΦn ktotal would remain bounded, then g ∈ RN N % (S)C = RN N % (S)C ⊂
RN N % (S).

References
[1] M. Anthony and P. L. Bartlett. Neural network learning: theoretical foundations. Cambridge University
Press, Cambridge, 1999.
[2] A. Barron. Universal approximation bounds for superpositions of a sigmoidal function. IEEE Trans. Inf.
Theory, 39(3):930–945, 1993.
[3] R. Bellman. On the theory of dynamic programming. Proceedings of the National Academy of Sciences of
the United States of America, 38(8):716, 1952.
[4] E. K. Blum and L. K. Li. Approximation theory and feedforward networks. Neural networks, 4(4):511–515,
1991.
[5] C. K. Chui and H. N. Mhaskar. Deep nets for local manifold learning. Frontiers in Applied Mathematics
and Statistics, 4:12, 2018.

58
[6] A. Cohen and R. DeVore. Approximation of high-dimensional parametric pdes. Acta Numerica, 24:1–159,
2015.
[7] G. Cybenko. Approximation by superpositions of a sigmoidal function. Math. Control Signal Systems,
2(4):303–314, 1989.
[8] R. A. DeVore. Nonlinear approximation. Acta numerica, 7:51–150, 1998.
[9] R. Eldan and O. Shamir. The power of depth for feedforward neural networks. In Conference on learning
theory, pages 907–940, 2016.
[10] C. L. Frenzen, T. Sasao, and J. T. Butler. On the number of segments needed in a piecewise linear
approximation. Journal of Computational and Applied mathematics, 234(2):437–446, 2010.
[11] I. Goodfellow, Y. Bengio, and A. Courville. Deep Learning. MIT Press, 2016. http://www.
deeplearningbook.org.
[12] J. He, L. Li, J. Xu, and C. Zheng. ReLU deep neural networks and linear finite elements. arXiv preprint
arXiv:1807.03973, 2018.
[13] K. Hornik, M. Stinchcombe, and H. White. Multilayer feedforward networks are universal approximators.
Neural Netw., 2(5):359–366, 1989.
[14] A. N. Kolmogorov. The representation of continuous functions of several variables by superpositions of
continuous functions of a smaller number of variables. Doklady Akademii Nauk SSSR, 108(2):179–182,
1956.
[15] M. Leshno, V. Y. Lin, A. Pinkus, and S. Schocken. Multilayer feedforward networks with a nonpolynomial
activation function can approximate any function. Neural Netw., 6(6):861–867, 1993.
[16] V. Maiorov and A. Pinkus. Lower bounds for approximation by MLP neural networks. Neurocomputing,
25(1-3):81–91, 1999.
[17] W. McCulloch and W. Pitts. A logical calculus of ideas immanent in nervous activity. Bull. Math. Biophys.,
5:115–133, 1943.
[18] H. N. Mhaskar. Approximation properties of a multilayered feedforward artificial neural network. Adv.
Comput. Math., 1(1):61–80, 1993.
[19] H. N. Mhaskar and T. Poggio. Deep vs. shallow networks: An approximation theory perspective. Analysis
and Applications, 14(06):829–848, 2016.
[20] E. Novak and H. Woźniakowski. Approximation of infinitely differentiable multivariate functions is
intractable. Journal of Complexity, 25(4):398–404, 2009.
[21] P. Oswald. On the degree of nonlinear spline approximation in Besov-Sobolev spaces. J. Approx. Theory,
61(2):131–157, 1990.
[22] P. C. Petersen, M. Raslan, and F. Voigtlaender. Topological properties of the set of functions generated
by neural networks of fixed size. arXiv preprint arXiv:1806.08459, 2018.
[23] P. C. Petersen and F. Voigtlaender. Optimal approximation of piecewise smooth functions using deep
ReLU neural networks. Neural Netw., 180:296–330, 2018.
[24] G. Pisier. Remarques sur un résultat non publié de B. Maurey. Séminaire Analyse fonctionnelle (dit
”Maurey-Schwartz”), 1980-1981.
[25] T. Poggio, H. Mhaskar, L. Rosasco, B. Miranda, and Q. Liao. Why and when can deep-but not shallow-
networks avoid the curse of dimensionality: a review. Int. J. Autom. Comput., 14(5):503–519, 2017.

59
[26] F. Rosenblatt. The perceptron: a probabilistic model for information storage and organization in the
brain. Psychological review, 65(6):386, 1958.
[27] W. Rudin. Functional analysis. International Series in Pure and Applied Mathematics. McGraw-Hill, Inc.,
New York, second edition, 1991.

[28] W. Rudin. Real and complex analysis. Tata McGraw-Hill Education, 2006.
[29] I. Safran and O. Shamir. Depth-width tradeoffs in approximating natural functions with neural networks.
In Proceedings of the 34th International Conference on Machine Learning, volume 70 of Proceedings of Machine
Learning Research, pages 2979–2987, 2017.
[30] J. Schmidt-Hieber. Deep ReLU network approximation of functions on a manifold. arXiv preprint
arXiv:1908.00695, 2019.
[31] I. Schoenberg. Cardinal interpolation and spline functions. Journal of Approximation theory, 2(2):167–206,
1969.
[32] U. Shaham, A. Cloninger, and R. R. Coifman. Provable approximation properties for deep neural
networks. Appl. Comput. Harmon. Anal., 44(3):537–557, 2018.
[33] Z. Shen, H. Yang, and S. Zhang. Deep network approximation characterized by number of neurons.
arXiv preprint arXiv:1906.05497, 2019.
[34] T. Suzuki. Adaptivity of deep ReLU network for learning in Besov and mixed smooth Besov spaces:
optimal rate and curse of dimensionality. arXiv preprint arXiv:1810.08033, 2018.

[35] M. Telgarsky. Representation benefits of deep feedforward networks. arXiv preprint arXiv:1509.08101,
2015.
[36] V. N. Vapnik and A. Y. Chervonenkis. On the uniform convergence of relative frequencies of events to
their probabilities. In Measures of complexity, pages 11–30. Springer, 2015.

[37] R. Vershynin. High-dimensional probability: An introduction with applications in data science, volume 47.
Cambridge University Press, 2018.
[38] D. Yarotsky. Error bounds for approximations with deep ReLU networks. Neural Netw., 94:103–114,
2017.
[39] D. Yarotsky. Optimal approximation of continuous functions by very deep ReLU networks. In Conference
On Learning Theory, pages 639–649, 2018.

60

You might also like