Mathematical Theory of Deep
Mathematical Theory of Deep
1 Introduction 9
1.1 Mathematics of deep learning . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9
1.2 High-level overview of deep learning . . . . . . . . . . . . . . . . . . . . . . . . . . . 9
1.3 Why does it work? . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13
1.4 Outline and philosophy . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14
1.5 Material not covered in this book . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 16
3 Universal approximation 25
3.1 A universal approximation theorem . . . . . . . . . . . . . . . . . . . . . . . . . . . . 25
3.2 Superexpressive activations and Kolmogorov’s superposition theorem . . . . . . . . 35
4 Splines 39
4.1 B-splines and smooth functions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 39
4.2 Reapproximation of B-splines with sigmoidal activations . . . . . . . . . . . . . . . . 40
1
8 High-dimensional approximation 92
8.1 The Barron class . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 92
8.2 Functions with compositionality structure . . . . . . . . . . . . . . . . . . . . . . . . 97
8.3 Functions on manifolds . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 100
9 Interpolation 106
9.1 Universal interpolation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 107
9.2 Optimal interpolation and reconstruction . . . . . . . . . . . . . . . . . . . . . . . . 108
2
15 Generalization in the overparameterized regime 213
15.1 The double descent phenomenon . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 214
15.2 Size of weights . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 217
15.3 Theoretical justification . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 219
15.4 Double descent for neural network learning . . . . . . . . . . . . . . . . . . . . . . . 221
3
Preface
This book serves as an introduction to the key ideas in the mathematical analysis of deep learning.
It is designed to help students and researchers to quickly familiarize themselves with the area and to
provide a foundation for the development of university courses on the mathematics of deep learning.
Our main goal in the composition of this book was to present various rigorous, but easy to grasp,
results that help to build an understanding of fundamental mathematical concepts in deep learning.
To achieve this, we prioritize simplicity over generality.
As a mathematical introduction to deep learning, this book does not aim to give an exhaustive
survey of the entire (and rapidly growing) field, and some important research directions are missing.
In particular, we have favored mathematical results over empirical research, even though an accurate
account of the theory of deep learning requires both.
The book is intended for students and researchers in mathematics and related areas. While we
believe that every diligent researcher or student will be able to work through this manuscript, it
should be emphasized that a familiarity with analysis, linear algebra, probability theory, and basic
functional analysis is recommended for an optimal reading experience. To assist readers, a review
of key concepts in probability theory and functional analysis is provided in the appendix.
The material is structured around the three main pillars of deep learning theory: Approximation
theory, Optimization theory, and Statistical Learning theory. Chapter 1 provides an overview and
outlines key questions for understand deep learning. Chapters 2 - 9 explore results in approximation
theory, Chapters 10 - 13 discuss optimization theory for deep learning, and the remaining Chapters
14 - 16 address the statistical aspects of deep learning.
This book is the result of a series of lectures given by the authors. Parts of the material were
presented by P.P. in a lecture titled “Neural Network Theory” at the University of Vienna, and by
J.Z. in a lecture titled “Theory of Deep Learning” at Heidelberg University. The lecture notes of
these courses formed the basis of this book. We are grateful to the many colleagues and students
who contributed to this book through insightful discussions and valuable suggestions. We would
like to offer special thanks to the following individuals:
Jonathan Garcia Rebellon, Jakob Lanser, Andrés Felipe Lerma Pineda, Martin Mauser, Davide
Modesto, Martina Neuman, Bruno Perreaux, Johannes Asmus Petersen, Milutin Popovic, Tuan
Quach, Lorenz Riess, Jakob Fabian Rohner, Jonas Schuhmann, Peter Školnı́k, Matej Vedak, Simon
Weissmann, Ashia Wilson.
4
Notation
In this section, we provide a summary of the notations used throughout the manuscript for the
reader’s convenience.
5
Symbol Description Reference
⊗ componentwise (Hadamard) product
hS empirical risk minimizer for a sample S Definition 14.5
Φid
L identity ReLU neural network Lemma 5.1
1S indicator function of the set S
⟨·, ·⟩ Euclidean inner product on Rd
⟨·, ·⟩H inner product on a vector space H Definition B.9
maximal number of elements shared by a single
kT (5.3.2)
node of a triangulation
K LC neural tangent kernel for the LeCun initialization Theorem 11.16
K̂n (x, x′ ) empirical tangent kernel (11.3.4)
K NTK neural tangent kernel for the NTK initialization Theorem 11.30
ΛA,σ,S,L loss landscape defining function Definition 12.2
Lip(f ) Lipschitz constant of a function f (9.2.1)
LipM (Ω) M -Lipschitz continuous functions on Ω (9.2.4)
L general loss function Section 14.1
L0−1 0-1 loss Section 14.1
Lce binary cross entropy loss Section 14.1
L2 square loss Section 14.1
Lp (Ω) Lebesgue space over Ω Section B.1.3
piecewise continuous and locally bounded func-
M Definition 3.1.1
tions
set of multilayer perceptrons with d-dim input, m-
Ndm (σ; L, n) dim output, activation function σ, depth L, and Definition 3.6
width n
Ndm (σ; L) union of Ndm (σ; L, n) for all n ∈ N Definition 3.6
set of neural networks with architecture A, ac-
N (σ; A, B) tivation function σ and all weights bounded in Definition 12.1
modulus by B
neural networks in N (σ; A, B) with range in
N ∗ (σ, A, B) (14.5.1)
[−1, 1]
N positive natural numbers
N0 natural numbers including 0
multivariate normal distribution with mean m ∈
N(m, C)
Rd and covariance C ∈ Rd×d
Continued on next page
6
Symbol Description Reference
number of parameters of a neural network with
nA Definition 12.1
layer widths described by A
Euclidean norm for vectors in Rd and spectral
∥·∥
norm for matrices in Rn×d
∥ · ∥F Frobenius norm for matrices
∥ · ∥∞ ∞-norm on Rd or supremum norm for functions
∥ · ∥p p-norm on Rd
∥ · ∥X norm on a vector space X
0 zero vector in Rd
O(·) Landau notation
ω(η) patch of the node η (5.3.5)
ΩΛ (c) sublevel set of loss landscape Definition 12.3
Pn short for Pn (Rd )
space of multivariate polynomials of degree n in
Pn (Rd ) Example 3.5
Rd
P short for P(Rd )
P[A] probability of event A Definition A.5
P[A|B] conditional probability of event A given B Definition A.3.2
PX distribution of random variable X Definition A.10
P(Rd ) space of multivariate polynomials in Rd Example 3.5
Φlin linearization of a model around initialization (11.3.1)
Φmin
n minimum neural network Lemma 5.11
Φ×
ε multiplication neural network Lemma 7.3
Φ×
n,ε multiplication of n numbers neural network Proposition 7.4
Φ2 ◦ Φ1 composition of neural networks Lemma 5.2
Φ2 • Φ1 sparse composition of neural networks Lemma 5.2
(Φ1 , . . . , Φm ) parallelization of neural networks (5.1.1)
Pieces(f, Ω) number of pieces of f on Ω Definition 6.1
parameter set of neural networks with architec-
PN (A, B) ture A and all weights bounded in modulus by Definition 12.1
B
Q rational numbers
R real numbers
Continued on next page
7
Symbol Description Reference
R− non-positive real numbers
R+ non-negative real numbers
Rσ Realization map Definition 12.1
R∗ Bayes risk (14.1.1)
R(h) risk of hypothesis h Definition 14.2
R
b S (h) empirical risk of h for sample S (1.2.3), Definition 14.4
Sn cardinal B-spline Definition 4.1
d
Sℓ,t,n multivariate cardinal B-spline Definition 4.2
cardinality of an arbitrary set S, or Lebesgue mea-
|S|
sure of S ⊆ Rd
S̊ interior of a set S
S closure of a set S
∂S boundary of a set S
Sc complement of a set S
σ general activation function
σa parametric ReLU activation function Section 2.3
σReLU ReLU activation function Section 2.3
sign sign function
size(Φ) number of free network parameters in Φ Definition 2.4
span(S) linear hull or span of S
T triangulation Definition 5.13
V[X] variance of random variable X Section A.2.2
VCdim(H) VC dimension of a set of functions H Definition 14.16
W distribution of weight intialization Section 11.5.1
W (ℓ) , b(ℓ) weights and biases in layer ℓ of a neural network Definition 2.1
width(Φ) width of Φ Definition 2.1
x(ℓ) output of ℓ-th layer of a neural network Definition 2.1
x̄(ℓ) preactivations (10.3.3)
X′ dual space to a normed space X Definition B.7
8
Chapter 1
Introduction
9
(
(
Figure 1.1: Illustration of a single neuron ν. The neuron receives six inputs (x1 , . . . , x6 ) = x,
computes their weighted sum 6j=1 xj wj , adds a bias b, and finally applies the activation function
P
σ to produce the output ν(x).
Deep Neural Networks Deep neural networks are formed by a combination of neurons. A
neuron is a function of the form
Rd ∋ x 7→ Φ(x) = T1 ◦ σ ◦ T0 (x)
10
L+1
where L ∈ N and (Tj )j=0 are affine transformations. The number of compositions L is referred to
as the number of layers of the deep neural network. Similar to a single neuron, (deep) neural
networks can be viewed as a parameterized function class, with the parameters being the entries
of the matrices and vectors determining the affine transformations (Tj )L+1
j=0 .
(
0 (
(
(
(
(
Figure 1.2: Illustration of a shallow neural network. The affine transformation T0 is of the form
(x1 , . . . , x6 ) = x 7→ W x + b, where the rows of W are the weight vectors w1 , w2 , w3 for each
respective neuron.
Gradient-based training After defining the structure or architecture of the neural network,
e.g., the activation function and the number of layers, the second step of deep learning consists of
determining optimal values for its parameters. This optimization is carried out by minimizing an
objective function. In supervised learning, which will be our focus, this objective depends on a
collection of input-output pairs known as a sample. Concretely, let S = (xi , y i )m
i=1 be a sample,
where xi ∈ Rd represents the inputs and y i ∈ Rk the corresponding outputs with d, k ∈ N. Our
goal is to find a deep neural network Φ such that
in a meaningful sense. For example, we could interpret “≈” to mean closeness with respect to
the Euclidean norm, or more generally, that L(Φ(xi ), y i ) is small for a function L measuring the
dissimilarity between its inputs. Such a function L is called a loss function. A standard way of
achieving (1.2.2) is by minimizing the so-called empirical risk of Φ with respect to the sample
S defined as
m
b S (Φ) = 1
X
R L(Φ(xi ), y i ). (1.2.3)
m
i=1
If L is differentiable, and for all xi the output Φ(xi ) depends differentiably on the parameters
of the neural network, then the gradient of the empirical risk Rb S (Φ) with respect to the parameters
is well-defined. This gradient can be efficiently computed using a technique called backpropa-
gation. This allows to minimize (1.2.3) by optimization algorithms such as (stochastic) gradient
11
descent. They produce a sequence of neural networks parameters, and corresponding neural net-
work functions Φ1 , Φ2 , . . . , for which the empirical risk is expected to decrease. Figure 1.3 illustrates
a possible behavior of this sequence.
Prediction The final part of deep learning concerns the question of whether we have actually
learned something by the procedure above. Suppose that our optimization routine has either con-
verged or has been terminated, yielding a neural network Φ∗ . While the optimization aimed to
minimize the empirical risk on the training sample S, our ultimate interest is not in how well Φ∗ per-
forms on S. Rather, we are interested in its performance on new, unseen data points (xnew , y new ).
To make meaningful statements about this performance, we need to assume a relationship between
the training sample S and other data points.
The standard approach is to assume existence of a data distribution D on the input-output
space—in our case, this is Rd × Rk —such that both the elements of S and all other considered data
points are drawn from this distribution. In other words, we treat S as an i.i.d. draw from D, and
(xnew , y new ) also sampled independently from D. If we want Φ∗ to perform well on average, then
this amounts to controlling the following expression
which is called the risk of Φ∗ . If the risk is not much larger than the empirical risk, then we say
that the neural network Φ∗ has a small generalization error. On the other hand, if the risk is
much larger than the empirical risk, then we say that Φ∗ overfits the training data, meaning that
Φ∗ has memorized the training samples, but does not generalize well to new data.
Figure 1.3: A sequence of one dimensional neural networks Φ1 , . . . , Φ4 that successively minimizes
the empirical risk for the sample S = (xi , yi )6i=1 .
12
1.3 Why does it work?
It is natural to wonder why the deep learning pipeline, as outlined in the previous subsection,
ultimately succeeds in learning, i.e., achieving a small risk. Is it true that for a given sample
(xi , y i )m
i=1 there exist a neural network such that Φ(xi ) ≈ y i for all i = 1, . . . m? Does the
optimization routine produce a meaningful result? Can we control the risk, knowing only that the
empirical risk is small?
While most of these questions can be answered affirmatively under certain assumptions, these
assumptions often do not apply to deep learning in practice. We next explore some potential
explanations and show that they lead to even more questions.
Approximation A fundamental result in the study of neural networks is the so-called universal
approximation theorem, which will be discussed in Chapter 3. This result states that every con-
tinuous function on a compact domain can be approximated arbitrarily well (in a uniform sense)
by a shallow neural network.
This result, however, does not answer questions that are more specific to deep learning, such
as the question of efficiency. For example, if we aim for computational efficiency, then we might
be interested in the smallest neural network that fits the data. This raises the question: What is
the role of the architecture for the expressive capabilities of neural networks? Furthermore, if we
consider reducing the empirical risk an approximation problem, we are confronted with one of the
main issues of approximation theory, which is the curse of dimensionality. Function approximation
in high dimensions is notoriously difficult and gets exponentially harder with increasing dimension.
In practice, many successful deep learning architectures operate in this high-dimensional regime.
Why do these neural networks not seem to suffer from the curse of dimensionality?
Optimization While gradient descent can sometimes be proven to converge to a global minimum
as we will discuss in Chapter 10, this typically requires the objective function to be at least convex.
However, there is no reason to believe that for example the empirical risk is a convex function of
the network parameters. In fact, due to the repeatedly occurring compositions with the nonlinear
activation function in the network, the empirical risk is typically highly non-linear and not convex.
Therefore, there is generally no guarantee that the optimization routine will converge to a global
minimum, and it may get stuck in a local (and non-global) minimum or a saddle point. Why is the
output of the optimization nonetheless often meaningful in practice?
Generalization In traditional statistical learning theory, which we will review in Chapter 14,
the extent to which the risk exceeds the empirical risk, can be bounded a priori; such bounds are
often expressed in terms of a notion of complexity of the set of admissible functions (the class of
neural networks) divided by the number of training samples. For the class of neural networks of a
fixed architecture, the complexity roughly amounts to the number of neural network parameters.
In practice, typically neural networks with more parameters than training samples are used. This
is dubbed the overparameterized regime. In this regime, the classical estimates described above are
void.
Why is it that, nonetheless, deep overparameterized architectures are capable of making accu-
rate predictions on unseen data? Furthermore, while deep architectures often generalize well, they
sometimes fail spectacularly on specific, carefully crafted examples. In image classification tasks,
13
these examples may differ only slightly from correctly classified images in a way that is not per-
ceptible to the human eye. Such examples are known as adversarial examples, and their existence
poses a great challenge for applications of deep learning.
14
we prove substantially better approximation rates than we obtained for shallow neural networks.
This adds again to our understanding of depth and its connections to expressive power of neural
network architectures.
Chapter 8: High-dimensional approximation. The convergence rates established in the
previous chapters deteriorate significantly in high-dimensional settings. This chapter examines
three scenarios under which neural networks can provably overcome the curse of dimensionality.
Chapter 9: Interpolation. In this chapter we shift our perspective from approximation to
exact interpolation of the training data. We analyze conditions under which exact interpolation is
possible, and discuss the implications for empirical risk minimization. Furthermore, we present a
constructive proof showing that ReLU networks can express an optimal interpolant of the data (in
a specific sense).
Chapter 10: Training of neural networks. We start to examine the training process of deep
learning. First, we study the fundamentals of (stochastic) gradient descent and convex optimization.
Then, we discuss how the backpropagation algorithm can be used to implement these optimization
algorithms for training neural networks. Finally, we examine accelerated methods and highlight
the key principles behind popular and more advanced training algorithms such as Adam.
Chapter 11: Wide neural networks and the neural tangent kernel. This chapter
introduces the neural tangent kernel as a tool for analyzing the training behavior of neural networks.
We begin by revisiting linear and kernel regression for the approximation of functions based on data.
Afterwards, we demonstrate in an abstract setting that under certain assumptions, the training
dynamics of gradient descent for neural networks resemble those of kernel regression, converging
to a global minimum. Using standard intialization schemes, we then show that the assumptions
for such a statement to hold are satisfied with high probability, if the network is sufficiently wide
(overparameterized). This analysis provides insights into why, under certain conditions, we can
train neural networks without getting stuck in (bad) local minima, despite the non-convexity of
the objective function. Additionally, we discuss a well-known link between neural networks and
Gaussian processes, giving some indication why overparameterized networks do not necessarily
overfit in practice.
Chapter 12: Loss landscape analysis. In this chapter, we present an alternative view on the
optimization problem, by analyzing the loss landscape—the empirical risk as a function of the neural
network parameters. We give theoretical arguments showing that increasing overparameterization
leads to greater connectivity between the valleys and basins of the loss landscape. Consequently,
overparameterized architectures make it easier to reach a region where all minima are global minima.
Additionally, we observe that most stationary points associated with non-global minima are saddle
points. This sheds further light on the empirically observed fact that deep architectures can often
be optimized without getting stuck in non-global minima.
Chapter 13: Shape of neural network spaces. While Chapters 11 and 12 highlight
potential reasons for the success of neural network training, in this chapter, we show that the set
of neural networks of a fixed architecture has some undesirable properties from an optimization
perspective. Specifically, we show that this set is typically non-convex. Moreover, in general it does
not possess the best-approximation property, meaning that there might not exist a neural network
within the set yielding the best approximation for a given function.
Chapter 14 : Generalization properties of deep neural networks. To understand
why deep neural networks successfully generalize to unseen data points (outside of the training
set), we study classical statistical learning theory, with a focus on neural network functions as the
15
hypothesis class. We then show how to establish generalization bounds for deep learning, providing
theoretical insights into the performance on unseen data.
Chapter 15: Generalization in the overparameterized regime. The generalization
bounds of the previous chapter are not meaningful when the number of parameters of a neural net-
work surpasses the number of training samples. However, this overparameterized regime is where
many successful network architectures operate. To gain a deeper understanding of generalization
in this regime, we describe the phenomenon of double descent and present a potential explana-
tion. This addresses the question of why deep neural networks perform well despite being highly
overparameterized.
Chapter 16: Robustness and adversarial examples. In the final chapter, we explore
the existence of adversarial examples—inputs designed to deceive neural networks. We provide
some theoretical explanations of why adversarial examples arise, and discuss potential strategies to
prevent them.
16
be found in [149]. Regarding the topic of fairness, we refer for instance to [55, 8].
Unsupervised and Reinforcement Learning: While this book focuses on supervised learn-
ing, where each data point xi has a label yi , there is a vast field of machine learning called unsuper-
vised learning, where labels are absent. Classical unsupervised learning problems include clustering
and dimensionality reduction [212, Chapters 22/23].
A popular area in deep learning, where no labels are used, is physics-informed neural networks
[187]. Here, a neural network is trained to satisfy a partial differential equation (PDE), with the
loss function quantifying the deviation from this PDE.
Finally, reinforcement learning is a technique where an agent can interact with an environment
and receives feedback based on its actions. The actions are guided by a so-called policy, which is
to be learned, [148, Chapter 17]. In deep reinforcement learning, this policy is modeled by a deep
neural network. Reinforcement learning is the basis of the aforementioned AlphaGo.
Implementation: While this book focuses on provable theoretical results, the field of deep
learning is strongly driven by applications, and a thorough understanding of deep learning cannot
be achieved without practical experience. For this, there exist numerous resources with excellent
explanations. We recommend [67, 38, 182] as well as the countless online tutorials that are just a
Google (or alternative) search away.
Many more: The field is evolving rapidly, and new ideas are constantly being generated
and tested. This book cannot give a complete overview. However, we hope that it provides the
reader with a solid foundation in the fundamental knowledge and principles to quickly grasp and
understand new developments in the field.
17
Chapter 2
Feedforward neural networks, henceforth simply referred to as neural networks (NNs), constitute
the central object of study of this book. In this chapter, we provide a formal definition of neural
networks, discuss the size of a neural network, and give a brief overview of common activation
functions.
x(0) := x (2.1.1a)
(ℓ) (ℓ−1) (ℓ−1) (ℓ−1)
x := σ(W x +b ) for ℓ ∈ {1, . . . , L} (2.1.1b)
x(L+1) := W (L) x(L) + b(L) (2.1.1c)
holds
We call L the depth, dmax = maxℓ=1,...,L dℓ the width, σ the activation function, and
(σ; d0 , . . . , dL+1 ) the architecture of the neural network Φ. Moreover, W (ℓ) ∈ Rdℓ+1 ×dℓ are the
weight matrices and b(ℓ) ∈ Rdℓ+1 the bias vectors of Φ for ℓ = 0, . . . L.
Remark 2.2. Typically, there exist different choices of architectures, weights, and biases yielding
the same function Φ : Rd0 → RdL+1 . For this reason we cannot associate a unique meaning to these
notions solely based on the function realized by Φ. In the following, when we refer to the properties
18
of a neural network Φ, it is always understood to mean that there exists at least one construction
as in Definition 2.1, which realizes the function Φ and uses parameters that satisfy those properties.
The architecture of a neural network is often depicted as a connected graph, as illustrated in
Figure 2.1. The nodes in such graphs represent (the output of) the neurons. They are arranged in
layers, with x(ℓ) in Definition 2.1 corresponding to the neurons in layer ℓ. We also refer to x(0) in
(2.1.1a) as the input layer and to x(L+1) in (2.1.1c) as the output layer. All layers in between
are referred to as the hidden layers and their output is given by (2.1.1b). The number of hidden
layers corresponds to the depth. For the correct interpretation of such graphs, we note that by our
conventions in Definition 2.1, the activation function is applied after each affine transformation,
except in the final layer.
Neural networks of depth one are called shallow, if the depth is larger than one they are called
deep. The notion of deep neural networks is not used entirely consistently in the literature, and
some authors use the word deep only in case the depth is much larger than one, where the precise
meaning of “much larger” depends on the application.
Throughout, we only consider neural networks in the sense of Definition 2.1. We emphasize
however, that this is just one (simple but very common) type of neural network. Many adjustments
to this construction are possible and also widely used. For example:
• We may use different activation functions σℓ in each layer ℓ or we may even use a different
activation function for each node.
• Residual neural networks allow “skip connections”. This means that information is allowed
to skip layers in the sense that the nodes in layer ℓ may have x(0) , . . . , x(ℓ−1) as their input
(and not just x(ℓ−1) ), cf. (2.1.1).
Let us clarify some further common terminology used in the context of neural networks:
• parameters: The parameters of a neural network refer to the set of all entries of the weight
matrices and bias vectors. These are often collected in a single vector
These parameters are adjustable and are learned during the training process, determining the
specific function realized by the network.
• hyperparameters: Hyperparameters are settings that define the network’s architecture (and
training process), but are not directly learned during training. Examples include the depth,
the number of neurons in each layer, and the choice of activation function. They are typically
set before training begins.
• weights: The term “weights” is often used broadly to refer to all parameters of a neural
network, including both the weight matrices and bias vectors.
19
input hidden layers output
(1) (3)
(0) x1 x1
x1
(2)
x1 (4)
x1
(1) (3)
(0) x2 x2
x2
(2)
x2 (4)
x2
(1) (3)
(0) x3 x3
x3
(2)
x3
(1) (3)
x4 x4
Figure 2.1: Sketch of a neural network with three hidden layers, and d0 = 3, d1 = 4, d2 = 3, d3 = 4,
d4 = 2. The neural network has depth three and width four.
• model: For a fixed architecture, every choice of network parameters w as in (2.1.2) defines
a specific function x 7→ Φw (x). In deep learning this function is often referred to as a model.
More generally, “model” can be used to describe any function parameterization by a set of
parameters w ∈ Rn , n ∈ N.
(ii) if d10 = d20 =: d0 and L1 = L2 =: L, then there exists a neural network Φparallel with architecture
(σ; d0 , d11 + d21 , . . . , d1L+1 + d2L+1 ) such that
(iii) if d10 = d20 =: d0 , L1 = L2 =: L, and d1L+1 = d2L+1 =: dL+1 , then there exists a neural network
Φsum with architecture (σ; d0 , d11 + d21 , . . . , d1L + d2L , dL+1 ) such that
20
(iv) if d1L1 +1 = d20 , then there exists a neural network Φcomp with architecture
(σ; d10 , d11 , . . . , d1L1 , d21 , . . . , d2L2 +1 ) such that
1
Φcomp (x) = Φ2 ◦ Φ1 (x) for all x ∈ Rd0 .
• weight sharing: This is a technique where specific entries of the weight matrices (or bias
vectors) are constrained to be equal. Formally, this means imposing conditions of the form
(i) (j)
Wk,l = Ws,t , i.e. the entry (k, l) of the ith weight matrix is equal to the entry at position
(s, t) of weight matrix j. We denote this assumption by (i, k, l) ∼ (j, s, t), paying tribute
to the trivial fact that “∼” is an equivalence relation. During training, shared weights are
updated jointly, meaning that any change to one weight is simultaneously applied to all other
weights of this class. Weight sharing can also be applied to the entries of bias vectors.
• sparsity: This refers to imposing a sparsity structure on the weight matrices (or bias vectors).
(i)
Specifically, we apriorily set Wk,l = 0 for certain (k, l, i), i.e. we impose entry (k, l) of the ith
weight matrix to be 0. These zero-valued entries are considered fixed, and are not adjusted
(i)
during training. The condition Wk,l = 0 corresponds to node l of layer i − 1 not serving as an
input to node k in layer i. If we represent the neural network as a graph, this is indicated by
not connecting the corresponding nodes. Sparsity can also be imposed on the bias vectors.
Both of these restrictions decrease the number of learnable parameters in the neural network. The
number of parameters can be seen as a measure of the complexity of the represented function class.
For this reason, we introduce size(Φ) as a notion for the number of learnable parameters. Formally
(with |S| denoting the cardinality of a set S):
21
2.3 Activation functions
Activation functions are a crucial part of neural networks, as they introduce nonlinearity into the
model. If an affine activation function were used, the resulting neural network function would also
be affine and hence very restricted in what it can represent.
The choice of activation function can have a significant impact on the performance, but there
does not seem to be a universally optimal one. We next discuss a few important activation functions
and highlight some common issues associated with them.
1.0 8 8
ReLU a=0.05
SiLU a=0.1
0.8 6 6
a=0.2
0.6 4
4
0.4 2
2
0.2 0
0
0.0
2
5 0 5 5 0 5 5 0 5
22
and depicted in Figure 2.2 (b). It is piecewise linear, and due to its simplicity its evaluation is
computationally very efficient. It is one of the most popular activation functions in practice. Since
its derivative is always zero or one, it does not suffer from the vanishing gradient problem to the
same extent as the sigmoid function. However, ReLU can suffer from the so-called dead neurons
problem. Consider the neural network
depending on the bias b ∈ R. If b < 0, then Φ(x) = 0 for all x ∈ R. The neuron corresponding to
d
the second application of σReLU thus produces a constant signal. Moreover, if b < 0, db Φ(x) = 0
for all x ∈ R. As a result, every negative value of b yields a stationary point of the empirical risk.
A gradient-based method will not be able to further train the parameter b. We thus refer to this
neuron as a dead neuron.
SiLU (Sigmoid Linear Unit): An important difference between the ReLU and the Sigmoid is
that the ReLU is not differentiable at 0. The SiLU activation function (also referred to as “swish”)
can be interpreted as a smooth approximation to the ReLU. It is defined as
x
σSiLU (x) := xσsig (x) = for x ∈ R,
1 + e−x
and is depicted in Figure 2.2 (b). There exist various other smooth activation functions that
mimic the ReLU, including the Softplus x 7→ log(1 + exp(x)), the GELU (Gaussian Error Linear
Unit) x 7→ xF (x) where F (x) denotes the cumulative distribution function of the standard normal
distribution, and the Mish x 7→ x tanh(log(1 + exp(x))).
Parametric ReLU or Leaky ReLU: This variant of the ReLU addresses the dead neuron
problem. For some a ∈ (0, 1), the parametric ReLU is defined as
and is depicted in Figure 2.2 (c) for three different values of a. Since the output of σ does not
have flat regions like the ReLU, the dying ReLU problem is mitigated. If a is not chosen too small,
then there is less of a vanishing gradient problem than for the Sigmoid. In practice, the additional
parameter a has to be fine-tuned depending on the application. Like the ReLU, the parametric
ReLU is not differentiable at 0.
23
Exercises
Exercise 2.5. Prove Proposition 2.3.
Exercise 2.6. In this exercise, we show that ReLU and parametric ReLU create similar sets of
neural network functions. Fix a > 0.
(i) Find a set of weight matrices and biases vectors, such that the associated neural network Φ1 ,
with the ReLU activation function σReLU satisfies Φ1 (x) = σa (x) for all x ∈ R.
(ii) Find a set of weight matrices and biases vectors, such that the associated neural network Φ2 ,
with the parametric ReLU activation function σa satisfies Φ2 (x) = σReLU (x) for all x ∈ R.
(iii) Conclude that every ReLU neural network can be expressed as a leaky ReLU neural network
and vice versa.
Exercise 2.7. Let d ∈ N, and let Φ1 be a neural network with the ReLU as activation function,
input dimension d, and output dimension 1. Moreover, let Φ2 be a neural network with the sigmoid
activation function, input dimension d, and output dimension 1. Show that, if Φ1 = Φ2 , then Φ1 is
a constant function.
Exercise 2.8. In this exercise, we show that for the sigmoid activation functions, dead-neuron-like
behavior is very rare. Let Φ be a neural network with the sigmoid activation function. Assume
that Φ is a constant function. Show that for every ε > 0 there is a non-constant neural network Φ e
with the same architecture as Φ such that for all ℓ = 0, . . . L,
(ℓ) (ℓ)
∥W (ℓ) − W
f ∥ ≤ ε and ∥b(ℓ) − e
b ∥≤ε
f (ℓ) , e
where W (ℓ) , b(ℓ) are the weights and biases of Φ and W
(ℓ)
b are the biases of Φ.
e
Show that such a statement does not hold for ReLU neural networks. What about leaky ReLU?
24
Chapter 3
Universal approximation
After introducing neural networks in Chapter 2, it is natural to inquire about their capabilities.
Specifically, we might wonder if there exist inherent limitations to the type of functions a neural
network can represent. Could there be a class of functions that neural networks cannot approx-
imate? If so, it would suggest that neural networks are specialized tools, similar to how linear
regression is suited for linear relationships, but not for data with nonlinear relationships.
In this chapter, we will show that this is not the case, and neural networks are indeed a universal
tool. More precisely, given sufficiently large and complex architectures, they can approximate
almost every sensible input-output relationship. We will formalize and prove this claim in the
subsequent sections.
Throughout what follows, we always consider C 0 (Rd ) equipped with the topology of Defini-
tion 3.1 (also see Exercise 3.22), and every subset such as C 0 (D) with the subspace topology:
for example, if D ⊆ Rd is bounded, then convergence in C 0 (D) refers to uniform convergence
limn→∞ supx∈D |fn (x) − f (x)| = 0.
25
Definition 3.2. Let d ∈ N. A set of functions H from Rd to R is a universal approximator (of
C 0 (Rd )), if for every ε > 0, every compact K ⊆ Rd , and every f ∈ C 0 (Rd ), there exists g ∈ H such
that supx∈K |f (x) − g(x)| < ε.
For a set of (not necessarily continuous) functions H mapping between Rd and R, we denote by
cc
H its closure with respect to compact convergence.
The relationship between a universal approximator and the closure with respect to compact
convergence is established in the proposition below.
Proof. Suppose that H is a universal approximator and fix f ∈ C 0 (Rd ). For n ∈ N, define Kn :=
[−n, n]d ⊆ Rd . Then for every n ∈ N there exists fn ∈ H such that supx∈Kn |fn (x) − f (x)| < 1/n.
cc
Since for every compact K ⊆ Rd there exists n0 such that K ⊆ Kn for all n ≥ n0 , it holds fn −→ f .
The “only if” part of the assertion is trivial.
A key tool to show that a set is a universal approximator is the Stone-Weierstrass theorem, see
for instance [196, Sec. 5.7].
(c) H is an algebra of functions, i.e., H is closed under addition, multiplication and scalar mul-
tiplication.
write
Pn := span{xα | α ∈ Nd0 , |α| ≤ n}
i.e., Pn is the
S space of polynomials of degree at most n (with real coefficients). It is easy to check
that P := n∈N Pn (Rd ) satisfies the assumptions of Theorem 3.4 on every compact set K ⊆ Rd .
Thus the space of polynomials P is a universal approximator of C 0 (Rd ), and by Proposition 3.3,
P is dense in C 0 (Rd ). In case we wish to emphasize the dimension of the underlying space, in the
following we will also write Pn (Rd ) or P(Rd ) to denote Pn , P respectively.
26
3.1.2 Shallow neural networks
With the necessary formalism established, we can now show that shallow neural networks of ar-
bitrary width form a universal approximator under certain (mild) conditions on the activation
function. The results in this section are based on [132], and for the proofs we follow the arguments
in that paper.
We first introduce notation for the set of all functions realized by certain architectures.
Definition 3.6. Let d, m, L, n ∈ N and σ : R → R. The set of all functions realized by neural
networks with d-dimensional input, m-dimensional output, depth at most L, width at most n, and
activation function σ is denoted by
Furthermore,
[
Ndm (σ; L) := Ndm (σ; L, n).
n∈N
In the sequel, we require the activation function σ to belong to the set of piecewise continuous
and locally bounded functions
M := σ ∈ L∞
loc (R) there exist intervals I1 , . . . , IM partitioning R,
(3.1.1)
s.t. σ ∈ C 0 (Ij ) for all j = 1, . . . , M .
Here, M ∈ N is finite, and the intervals Ij are understood to have positive (possibly infinite)
Lebesgue measure, i.e. Ij is e.g. not allowed to be empty or a single point. Hence, σ is a piecewise
continuous function, and it has discontinuities at most finitely many points.
Example 3.7. Activation functions belonging to M include, in particular, all continuous non-
polynomial functions, which in turn includes all practically relevant activation functions such as
the ReLU, the SiLU, and the Sigmoid discussed in Section 2.3. In these cases, we can choose M = 1
and I1 = R. Discontinuous functions include for example the Heaviside function x 7→ 1x>0 (also
called a “perceptron” in this context) but also x 7→ 1x>0 sin(1/x): Both belong to M with M = 2,
I1 = (−∞, 0] and I2 = (0, ∞). We exclude for example the function x 7→ 1/x, which is not locally
bounded.
The rest of this subsection is dedicated to proving the following theorem that has now already
been anounced repeatedly.
Theorem 3.8. Let d ∈ N and σ ∈ M. Then Nd1 (σ; 1) is a universal approximator of C 0 (Rd ) if
and only if σ is not a polynomial.
27
Remark 3.9. We will see in Exercise 3.26 and Corollary 3.18 that neural networks can also arbitrarily
well approximate non-continuous functions with respect to suitable norms.
The universal approximation theorem by Leshno, Lin, Pinkus and Schocken [132]—of which
Theorem 3.8 is a special case—is even formulated for a much larger set M, which allows for
activation functions that have discontinuities at a (possibly non-finite) set of Lebesgue measure
zero. Instead of proving the theorem in this generality, we resort to the simpler case stated above.
This allows to avoid some technicalities, but the main ideas remain the same. The proof strategy
is to verify the following three claims:
cc cc
(i) if C 0 (R1 ) ⊆ N11 (σ; 1) then C 0 (Rd ) ⊆ Nd1 (σ; 1) ,
cc
(ii) if σ ∈ C ∞ (R) is not a polynomial then C 0 (R1 ) ⊆ N11 (σ; 1) ,
cc
(iii) if σ ∈ M is not a polynomial then there exists σ̃ ∈ C ∞ (R) ∩ N11 (σ; 1) which is not a
polynomial.
cc cc cc
Upon observing that σ̃ ∈ N11 (σ; 1) implies N11 (σ̃, 1) ⊆ N11 (σ; 1) , it is easy to see that these
statements together with Proposition 3.3 establish the implication “⇐” asserted in Theorem 3.8.
The reverse direction is straightforward to check and will be the content of Exercise 3.23.
We start with a more general version of (i) and reduce the problem to the one dimensional case.
Lemma 3.10. Assume that H is a universal approximator of C 0 (R). Then for every d ∈ N
span{x 7→ g(w · x) | w ∈ Rd , g ∈ H}
We claim that
cc
Hk ⊆ span{Rd ∋ x 7→ g(w · x) | w ∈ Rd , g ∈ H} =: X (3.1.2)
for all k ∈ N0 . This implies that all multivariate polynomials belong to X. An application of the
Stone-Weierstrass theorem (cp. Example 3.5) and Proposition 3.3 then conclude the Q proof.
For every α, β ∈ Nd0 with |α| = |β| = k, it holds Dβ xα = δβ,α α!, where α! := dj=1 αj ! and
δβ,α = 1 if β = α and δβ,α = 0 otherwise. Hence, since {x 7→ xα | |α| = k} is a basis of Hk , the
set {Dα | |α| = k} is a basis of its topological dual H′k . Thus each linear functional l ∈ H′k allows
the representation l = p(D) for some p ∈ Hk (here D stands for the differential).
By the multinomial formula
k
d
X X k! α α
(w · x)k = wj x j = w x .
α!
j=1 d
{α∈N0 | |α|=k}
28
Therefore, we have that (x 7→ (w · x)k ) ∈ Hk . Moreover, for every l = p(D) ∈ H′k and all w ∈ Rd
we have that
Hence, if l(x 7→ (w · x)k ) = p(D)(x 7→ (w · x)k ) = 0 for all w ∈ Rd , then p ≡ 0 and thus l ≡ 0.
This implies span{x 7→ (w · x)k | w ∈ Rd } = Hk . Indeed, if there exists h ∈ Hk which is not
in span{x 7→ (w · x)k | w ∈ Rd }, then by the theorem of Hahn-Banach (see Theorem B.8), there
exists a non-zero functional in H′k vanishing on span{x 7→ (w · x)k | w ∈ Rd }. This contradicts the
previous observation.
By the universality of H it is not hard to see that x 7→ (w · x)k ∈ X for all w ∈ Rd . Therefore,
we have Hk ⊆ X for all k ∈ N0 .
By the above lemma, in order to verify that Nd1 (σ; 1) is a universal approximator, it suffices to
show that N11 (σ; 1) is a universal approximator. We first show that this is the case for sigmoidal
activations.
For sigmoidal activation functions we can now conclude the universality in the univariate case.
cc
Lemma 3.12. Let σ : R → R be monotonically increasing and sigmoidal. Then C 0 (R) ⊆ N11 (σ; 1) .
We prove Lemma 3.12 in Exercise 3.24. Lemma 3.10 and Lemma 3.12 show Theorem 3.8 in
the special case where σ is monotonically increasing and sigmoidal. For the general case, let us
continue with (ii) and consider C ∞ activations.
Lemma 3.13. If σ ∈ C ∞ (R) and σ is not a polynomial, then N11 (σ; 1) is dense in C 0 (R).
cc
Proof. Denote X := N11 (σ; 1) . We show again that all polynomials belong to X. An application
of the Stone-Weierstrass theorem then gives the statement.
Fix b ∈ R and denote fx (w) := σ(wx + b) for all x, w ∈ R. By Taylor’s theorem, for h ̸= 0
29
for some ξ = ξ(h) between w and w + h. Note that the left-hand side belongs to N11 (σ; 1) as a
function of x. Since σ ′′ ∈ C 0 (R), for every compact set K ⊆ R
sup sup |x2 σ ′′ (ξ(h)x + b)| ≤ sup sup |x2 σ ′′ (ηx + b)| < ∞.
x∈K |h|≤1 x∈K η∈[w−1,w+1]
Finally, we come to the proof of (iii)—the claim that there exists at least one non-polynomial
C ∞ (R) function in the closure of N11 (σ; 1). The argument is split into two lemmata. Denote in the
following by Cc∞ (R) the set of compactly supported C ∞ (R) functions.
cc
Lemma 3.14. Let σ ∈ M. Then for each φ ∈ Cc∞ (R) it holds σ ∗ φ ∈ N11 (σ; 1) .
Proof. Fix φ ∈ Cc∞ (R) and let a > 0 such that supp φ ⊆ [−a, a]. We have
Z
σ ∗ φ(x) = σ(x − y)φ(y) dy.
R
cc
Clearly, fn ∈ N11 (σ; 1). We will show fn −→ σ ∗φ as n → ∞. To do so we verify uniform convergence
of fn towards σ ∗ φ on the interval [−b, b] with b > 0 arbitrary but fixed.
For x ∈ [−b, b]
n−1
X Z yj+1
|σ ∗ φ(x) − fn (x)| ≤ σ(x − y)φ(y) − σ(x − yj )φ(yj ) dy . (3.1.4)
j=0 yj
Fix ε ∈ (0, 1). Since σ ∈ M, S there exist z1 , . . . , zM ∈ R such that σ is continuous on R\{z1 , . . . , zM }
(cp. (3.1.1)). With Dε := M j=1 (zj −ε, zj +ε), observe that σ is uniformly continuous on the compact
set Kε := [−a − b, a + b] ∩ Dεc . Now let Jc ∪ Jd = {0, . . . , n − 1} be a partition (depending on x),
such that j ∈ Jc if and only if [x − yj+1 , x − yj ] ⊆ Kε . Hence, j ∈ Jd implies the existence of
i ∈ {1, . . . , M } such that the distance of zi to [x − yj+1 , x − yj ] is at most ε. Due to the interval
30
[x − yj+1 , x − yj ] having length 2a/n, we can bound
X [
yj+1 − yj = [x − yj+1 , x − yj ]
j∈Jd j∈Jd
M h
[ 2a 2a i
≤ zi − ε −
, zi + ε +
n n
i=1
4a
≤ M · 2ε + .
n
Next, because of the local boundedness of σ and the fact that φ ∈ Cc∞ , it holds sup|y|≤a+b |σ(y)| +
sup|y|≤a |φ(y)| =: γ < ∞. Hence
|σ ∗ φ(x) − fn (x)|
X Z yj+1
≤ σ(x − y)φ(y) − σ(x − yj )φ(yj ) dy
j∈Jc ∪Jd yj
2 4a
≤ 2γ M · 2ε +
n
+ 2a sup max |σ(x − y)φ(y) − σ(x − yj )φ(yj )|. (3.1.5)
j∈Jc y∈[yj ,yj+1 ]
Finally, uniform continuity of σ on Kε and φ on [−a, a] imply that the last term tends to 0 as
n → ∞ uniformly for all x ∈ [−b, b]. This shows that there exist C < ∞ (independent of ε and x)
and nε ∈ N (independent of x) such that the term in (3.1.5) is bounded by Cε for all n ≥ nε . Since
ε was arbitrary, this yields the claim.
Lemma 3.15. If σ ∈ M and σ ∗ φ is a polynomial for all φ ∈ Cc∞ (R), then σ is a polynomial.
Proof. Fix −∞ < a < b < ∞ and consider Cc∞ (a, b) := {φ ∈ C ∞ (R) | supp φ ⊆ [a, b]}. Define a
metric ρ on Cc∞ (a, b) via
X |φ − ψ|C j (a,b)
ρ(φ, ψ) := 2−j ,
1 + |φ − ψ|C j (a,b)
j∈N0
31
where
Since
Pj the space of j times differentiable functions on [a, b] is complete with respect to the norm
| · | , see for instance [89, Satz 104.3], the space C ∞ (a, b) is complete with the metric ρ.
i
C (a,b) c
i=0
For k ∈ N set
Vk := {φ ∈ Cc∞ (a, b) | σ ∗ φ ∈ Pk },
Baire’s category theorem implies the existence of k0 ∈ N (depending on a, b) such that Vk0 contains
an open subset of Cc∞ (a, b). Since Vk0 is a vector space, it must hold Vk0 = Cc∞ (a, b).
We now show that φ ∗ σ ∈ Pk0 for every φ ∈ Cc∞ (R); in other words, k0 = k0 (a, b) can be chosen
independent of a and b. First consider a shift s ∈ R and let ã := a + s and b̃ := b + s. Then with
S(x) := x + s, for any φ ∈ Cc∞ (ã, b̃) holds φ ◦ S ∈ Cc∞ (a, b), and thus (φ ◦ S) ∗ σ ∈ Pk0 . Since
(φ ◦ S) ∗ σ(x) = φ ∗ σ(x + s), we conclude that φ ∗ σ ∈ Pk0 . Next let −∞ < ã < b̃ < ∞ be arbitrary.
Then, for an integer n > (b̃ − ã)(b − a) we can cover (ã, b̃) with n ∈ N overlappingP open intervals
∞ (ã, b̃) can be written as φ = n
(a1 , b1 ), . . . , (an , bn ), each of length b − a. Any φ ∈ C c j=1 φj where
n
φj ∈ Cc (aj , bj ). Then φ ∗ σ = j=1 φj ∗ σ ∈ Pk0 , and thus φ ∗ σ ∈ Pk0 for every φ ∈ Cc∞ (R).
∞
P
Finally, Exercise 3.25 implies σ ∈ Pk0 .
32
3.1.3 Deep neural networks
Theorem 3.8 shows the universal approximation capability of single-hidden-layer neural networks
with activation functions σ ∈ M\P: they can approximate every continuous function on every
compact set to arbitrary precision, given sufficient width. This result directly extends to neural
networks of any fixed depth L ≥ 1. The idea is to use the fact that the identity function can be
approximated with a shallow neural network. Composing a shallow neural network approximation of
the target function f with (multiple) shallow neural networks approximating the identity function,
gives a deep neural network approximation of f .
Instead of directly applying Theorem 3.8, we first establish the following proposition regarding
the approximation of the identity function. Rather than σ ∈ M\P, it requires a different (mild)
assumption on the activation function. This allows for a constructive proof, yielding explicit bounds
on the neural network size, which will prove useful later in the book.
Proposition 3.16. Let d, L ∈ N, let K ⊆ Rd be compact, and let σ : R → R be such that there
exists an open set on which σ is differentiable and not constant. Then, for every ε > 0, there exists
a neural network Φ ∈ Ndd (σ; L, d) such that
Proof. The proof uses the same idea as in Lemma 3.13, where we approximate the derivative of the
activation function by a simple neural network. Let us first assume d ∈ N and L = 1.
Let x∗ ∈ R be such that σ is differentiable on a neighborhood of x∗ and σ ′ (x∗ ) = θ ̸= 0.
Moreover, let x∗ = (x∗ , . . . , x∗ ) ∈ Rd . Then, for λ > 0 we define
λ x λ
Φλ (x) := σ + x∗ − σ(x∗ ),
θ λ θ
Then, we have, for all x ∈ K,
σ(x/λ + x∗ ) − σ(x∗ )
Φλ (x) − x = λ − x. (3.1.6)
θ
If xi = 0 for i ∈ {1, . . . , d}, then (3.1.6) shows that (Φλ (x) − x)i = 0. Otherwise
By the definition of the derivative, we have that |(Φλ (x) − x)i | → 0 for λ → ∞ uniformly for all
x ∈ K and i ∈ {1, . . . , d}. Therefore, |Φλ (x) − x| → 0 for λ → ∞ uniformly for all x ∈ K.
The extension to L > 1 is straight forward and is the content of Exercise 3.27.
33
Corollary 3.17. Let d ∈ N, L ∈ N and σ ∈ M. Then Nd1 (σ; L) is a universal approximator of
C 0 (Rd ) if and only if σ is not a polynomial.
Proof. We only show the implication “⇐”. The other direction is again left as an exercise, see
Exercise 3.23.
Assume σ ∈ M is not a polynomial, let K ⊆ Rd be compact, and let f ∈ C 0 (Rd ). Fix ε ∈ (0, 1).
We need to show that there exists a neural network Φ ∈ Nd1 (σ; L) such that supx∈K |f (x)−Φ(x)| <
ε. The case L = 1 holds by Theorem 3.8, so let L > 1.
By Theorem 3.8, there exist Φshallow ∈ Nd1 (σ; 1) such that
ε
sup |f (x) − Φshallow (x)| < . (3.1.7)
x∈K 2
Compactness of {f (x) | x ∈ K} implies that we can find n > 0 such that
{Φshallow (x) | x ∈ K} ⊆ [−n, n]. (3.1.8)
Let Φid ∈ N11 (σ; L − 1) be an approximation to the identity such that
ε
sup |x − Φid (x)| < , (3.1.9)
x∈[−n,n] 2
which is possibly by the extension of Proposition 3.16 to general non-polynomial activation functions
σ ∈ M.
Denote Φ := Φid ◦ Φshallow . According to Proposition 2.3 (iv) holds Φ ∈ Nd1 (σ; L) as desired.
Moreover (3.1.7), (3.1.8), (3.1.9) imply
sup |f (x) − Φ(x)| = sup |f (x) − Φid (Φshallow (x))|
x∈K x∈K
≤ sup |f (x) − Φshallow (x)| + |Φshallow (x) − Φid (Φshallow (x))|
x∈K
ε ε
≤ + = ε.
2 2
This concludes the proof.
Corollary 3.18. Let d ∈ N, L ∈ N, p ∈ [1, ∞), and let σ ∈ M not be a polynomial. Then for
every ε > 0, every compact K ⊆ Rd , and every f ∈ Lp (K) there exists Φf,ε ∈ Nd1 (σ; L) such that
Z 1/p
p
|f (x) − Φ(x)| dx ≤ ε.
K
34
3.2 Superexpressive activations and Kolmogorov’s superposition
theorem
In the previous section, we saw that a large class of activation functions allow for universal approx-
imation. However, these results did not provide any insights into the necessary neural network size
for achieving a specific accuracy.
Before exploring this topic further in the following chapters, we next present a remarkable result
that shows how the required neural network size is significantly influenced by the choice of activation
function. The result asserts that, with the appropriate activation function, every f ∈ C 0 (K) on a
compact set K ⊆ Rd can be approximated to every desired accuracy ε > 0 using a neural network
of size O(d2 ); in particular the neural network size is independent of ε > 0, K, and f . We will first
discuss the one-dimensional case.
Proposition 3.19. There exists a continuous activation function σ : R → R such that for every
compact K ⊆ R, every ε > 0 and every f ∈ C 0 (K) there exists Φ(x) = σ(wx + b) ∈ N11 (σ; 1, 1)
such that
Proof. Denote by P̃n all polynomials p(x) = nj=0 qj xj with rational coefficients, i.e. such that
P
that supx∈[0,1] |p(x) − f (x)| < ε/2. Now choose qj ∈ Q so close to rj such that p̃(x) := nj=1 qj xj
P
satisfies supx∈[0,1] |p̃(x) − p(x)| < ε/2. Let i ∈ Z such that p̃(x) = pi (x), i.e., pi (x) = σ(2i + x) for
all x ∈ [0, 1]. Then supx∈[0,1] |f (x) − σ(x + 2i)| < ε.
For general compact K assume that K ⊆ [a, b]. By Tietze’s extension theorem, f allows a
continuous extension to [a, b], so without loss of generality K = [a, b]. By the first case we can find
i ∈ Z such that with y = (x − a)/(b − a) (i.e. y ∈ [0, 1] if x ∈ [a, b])
x−a
sup f (x) − σ + 2i = sup |f (y · (b − a) + a) − σ(y + 2i)| < ε,
x∈[a,b] b−a y∈[0,1]
To extend this result to arbitrary dimension, we will use Kolmogorov’s superposition theorem.
It states that every continuous function of d variables can be expressed as a composition of functions
that each depend only on one variable. We omit the technical proof, which can be found in [120].
35
Theorem 3.20 (Kolmogorov). For every d ∈ N there exist 2d2 + d monotonically increasing
functions φi,j ∈ C 0 (R), i = 1, . . . , d, j = 1, . . . , 2d + 1, such that for every f ∈ C 0 ([0, 1]d ) there
exist functions fj ∈ C 0 (R), j = 1, . . . , 2d + 1 satisfying
2d+1 d
!
X X
f (x) = fj φi,j (xi ) for all x ∈ [0, 1]d .
j=1 i=1
Corollary 3.21. Let d ∈ N. With the activation function σ : R → R from Proposition 3.19, for
every compact K ⊆ Rd , every ε > 0 and every f ∈ C 0 (K) there exists Φ ∈ Nd1 (σ; 2, 2d2 + d) (i.e.
width(Φ) = 2d2 + d and depth(Φ) = 2) such that
Proof. Without loss of generality we can assume K = [0, 1]d : the extension to the general case then
follows by Tietze’s extension theorem and a scaling argument as in the proof of Proposition 3.19.
Let fj , φi,j , i = 1, . . . , d, j = 1, . . . , 2d + 1 be as in Theorem 3.20. Fix ε > 0. Let a > 0 be so
large that
Since each fj is uniformly continuous on the compact set [−da, da], we can find δ > 0 such that
ε
sup sup |fj (y) − fj (ỹ)| < . (3.2.1)
j |y−ỹ|<δ 2(2d + 1)
|y|,|ỹ|≤da
36
Thus with
d
X d
X
yj := φi,j (xi ), ỹj := φ̃i,j (xi )
j=1 j=1
2d+1 d 2d+1
! !
X X X
f (x) − σ wj · σ(wi,j xi + bi,j ) + bj = (fj (yj ) − f˜j (ỹj ))
j=1 i=1 j=1
2d+1
X
≤ |fj (yj ) − fj (ỹj )| + |fj (ỹj ) − f˜j (ỹj )|
j=1
2d+1
X
ε ε
≤ + ≤ ε.
2(2d + 1) 2(2d + 1)
j=1
37
Exercises
Exercise 3.22. Write down a generator of a (minimal) topology on C 0 (Rd ) such that fn → f ∈
cc
C 0 (Rd ) if and only if fn −→ f , and show this equivalence. This topology is referred to as the
topology of compact convergence.
Exercise 3.23. Show the implication “⇒” of Theorem 3.8 and Corollary 3.17.
Exercise 3.24. Prove Lemma 3.12. Hint: Consider σ(nx) for large n ∈ N.
Exercise 3.25. Let k ∈ N, σ ∈ M and assume that σ ∗ φ ∈ Pk for all φ ∈ Cc∞ (R). Show that
σ ∈ Pk .
Hint: Consider ψ ∈ Cc∞ (R) such that ψ ≥ 0 and R ψ(x) dx = 1 and set ψε (x) := ψ(x/ε)/ε.
R
Use that away from the discontinuities of σ it holds ψε ∗ σ(x) → σ(x) as ε → 0. Conclude that σ
is piecewise in Pk , and finally show that σ ∈ C k (R).
Exercise 3.26. Prove Corollary 3.18 with the use of Corollary 3.17.
38
Chapter 4
Splines
In Chapter 3, we saw that sufficiently large neural networks can approximate every continuous
function to arbitrary accuracy. However, these results did not further specify the meaning of
“sufficiently large” or what constitutes a suitable architecture. Ideally, given a function f , and a
desired accuracy ε > 0, we would like to have a (possibly sharp) bound on the required size, depth,
and width guaranteeing the existence of a neural network approximating f up to error ε.
The field of approximation theory establishes such trade-offs between properties of the function f
(e.g., its smoothness), the approximation accuracy, and the number of parameters needed to achieve
this accuracy. For example, given k, d ∈ N, how many parameters are required to approximate a
function f : [0, 1]d → R with ∥f ∥C k ([0,1]d ) ≤ 1 up to uniform error ε? Splines are known to achieve
this approximation accuracy with a superposition of O(ε−d/k ) simple (piecewise polynomial) basis
functions. In this chapter, following [146], we show that certain sigmoidal neural networks can
match this performance in terms of the neural network size. In fact, from an approximation
theoretical viewpoint we show that the considered neural networks are at least as expressive as
superpositions of splines.
By shifting and dilating the cardinal B-spline, we obtain a system of univariate splines. Taking
tensor products of these univariate splines yields a set of higher-dimensional functions known as
the multivariate B-splines.
39
Definition 4.2. For t ∈ R and n, ℓ ∈ N we define Sℓ,t,n := Sn (2ℓ (· − t)). Additionally, for d ∈ N,
t ∈ Rd , and n, ℓ ∈ N, we define the the multivariate B-spline Sℓ,t,n
d as
d
Y
d
Sℓ,t,n (x) := Sℓ,ti ,n (xi ) for x = (x1 , . . . xd ) ∈ Rd ,
i=1
and n o
B n := Sℓ,t,n
d
ℓ ∈ N, t ∈ Rd
Having introduced the system B n , we would like to understand how well we can represent each
smooth function by superpositions of elements of B n . The following theorem is adapted from the
more general result [168, Theorem 7]; also see [141, Theorem D.3] for a presentation closer to the
present formulation.
Theorem 4.3. Let d, n, k ∈ N such that 0 < k ≤ n. Then there exists C such that for every
f ∈ C k ([0, 1]d ) and every N ∈ N, there exist ci ∈ R with |ci | ≤ C∥f ∥L∞ ([0,1]d ) and Bi ∈ B n for
i = 1, . . . , N , such that
N
X k
f− ci Bi ≤ CN − d ∥f ∥C k [0,1]d .
i=1 L∞ ([0,1]d )
Remark 4.4. There are a couple of critical concepts in Theorem 4.3 that will reappear throughout
this book. The number of parameters N determines the approximation accuracy N −k/d . This im-
plies that achieving accuracy ε > 0 requires O(ε−d/k ) parameters (according to this upper bound),
which grows exponentially in d. This exponential dependence on d is referred to as the “curse of
dimension” and will be discussed again in the subsequent chapters. The smoothness parameter
k has the opposite effect of d, and improves the convergence rate. Thus, smoother functions can
be approximated with fewer B-splines than rougher functions. This more efficient approximation
requires the use of B-splines of order n with n ≥ k. We will see in the following, that the order of
the B-spline is closely linked to the concept of depth in neural networks.
40
Definition 4.5. A function σ : R → R is called sigmoidal of order q ∈ N, if σ ∈ C q−1 (R) and
there exists C > 0 such that
σ(x)
→0 as x → −∞,
xq
σ(x)
→1 as x → ∞,
xq
|σ(x)| ≤ C · (1 + |x|)q for all x ∈ R.
Example 4.6. The rectified power unit x 7→ σReLU (x)q is sigmoidal of order q.
Our goal in the following is to show that neural networks can approximate a linear combination
of N B-splines with a number of parameters that is proportional to N . As an immediate conse-
quence of Theorem 4.3, we then obtain a convergence rate for neural networks. Let us start by
approximating a single univariate B-spline with a neural network of fixed size.
Sn − ΦSn L∞ ([−K,K])
≤ ε.
n−1
Proof. By definition (4.1.1), Sn is a linear combination of n + 1 shifts of σReLU . We start by
n−1
approximating σReLU . It is not hard to see (Exercise 4.10) that, for every K ′ > 0 and every t ∈ N
t t
a−q σ ◦ σ ◦ · · · ◦ σ(ax) −σReLU (x)q → 0 as a → ∞ (4.2.1)
| {z }
t− times
This shows that we can approximate the ReLU to the power of q t ≥ n − 1. However, our goal is to
obtain an approximation of the ReLU raised to the power n − 1, which could be smaller than q t .
t
To reduce the order, we emulate approximate derivatives of Φqε . Concretely, we show the following
claim: For all 1 ≤ p ≤ q t for every K ′ > 0 and ε > 0 there exists a neural network Φpε having
⌈logq (n − 1)⌉ layers and satisfying
41
The claim holds for p = q t . We now proceed by induction over p = q t , q t − 1, . . . Assume (4.2.3)
holds for some p ∈ {2, . . . , q t }. Fix δ ≥ 0. Then
Hence, by the binomial theorem it follows that there exists δ∗ > 0 such that
for all x ∈ [−K ′ , K ′ ]. By Proposition 2.3, (Φpδ2 (x + δ∗ ) − Φpδ2 )/(pδ∗ ) is a neural network with
∗ ∗
⌈logq (n − 1)⌉ layers and size independent from ε. Calling this neural network Φp−1ε shows that
(4.2.3) holds for p − 1, which concludes the induction argument and proves the claim.
For every neural network Φ, every spatial translation Φ(· − t) is a neural network of the same
architecture. Hence, every term in the sum (4.1.1) can be approximated to arbitrary accuracy by
a neural network of a fixed size. Since by Proposition 2.3, sums of neural networks of the same
depth are again neural networks of the same depth, the result follows.
d
Next, we extend Proposition 4.7 to the multivariate splines Sℓ,t,n for arbitrary ℓ, d ∈ N, t ∈ Rd .
d
Qd
Proof. By definition Sℓ,t,n (x) = i=1 Sℓ,ti ,n (xi ) where
By Proposition 4.7 there exist a constant C ′ > 0 such that for each i = 1, . . . , d and all ε > 0, there
is a neural network ΦSℓ,ti ,n with size C ′ and ⌈logq (n − 1)⌉ layers such that
If d = 1, this shows the statement. For general d, it remains to show that the product of the ΦSℓ,ti ,n
for i = 1, . . . , d can be approximated.
We first prove the following claim by induction: For every d ∈ N, d ≥ 2, there exists a constant
C ′′ > 0, such that for all K ′ ≥ 1 and all ε > 0 there exists a neural network Φmult,ε,d with size
42
C ′′ , ⌈log2 (d)⌉ layers, and activation function σ such that for all x1 , . . . , xd with |xi | ≤ K ′ for all
i = 1, . . . , d,
d
Y
Φmult,ε,d (x1 , . . . , xd ) − xi < ε. (4.2.4)
i=1
For the base case, let d = 2. Similar to the proof of Proposition 4.7, one can show that there exists
C ′′′ > 0 such that for every ε > 0 and K ′ > 0 there exists a neural network Φsquare,ε with one
hidden layer and size C ′′′ such that
Each term on the right-hand side can be approximated up to uniform error ε/6 with a network of
size C ′′′ and one hidden layer. By Proposition 2.3, we conclude that there exists a neural network
Φmult,ε,2 satisfying (4.2.4) for d = 2.
Assume the induction hypothesis (4.2.4) holds for d − 1 ≥ 1, and let ε > 0 and K ′ ≥ 1. We
have
d ⌊d/2⌋ d
Y Y Y
xi = xi · xi . (4.2.6)
i=1 i=1 i=⌊d/2⌋+1
We will now approximate each of the terms in the product on the right-hand side of (4.2.6) by a
neural network using the induction assumption.
For simplicity assume in the following that ⌈log2 (⌊d/2⌋)⌉ = ⌈log2 (d − ⌊d/2⌋)⌉. The general
case can be addressed via Proposition 3.16. By the induction assumption there then exist neural
networks Φmult,1 and Φmult,2 both with ⌈log2 (⌊d/2⌋)⌉ layers, such that for all xi with |xi | ≤ K ′ for
i = 1, . . . , d
⌊d/2⌋
Y ε
Φmult,1 (x1 , . . . , x⌊d/2⌋ ) − xi < ,
i=1
4((K ′ )⌊d/2⌋ + ε)
d
Y ε
Φmult,2 (x⌊d/2⌋+1 , . . . , xd ) − xi < .
4((K ′ )⌊d/2⌋ + ε)
i=⌊d/2⌋+1
By Proposition 2.3, Φmult,ε,d := Φmult,ε/2,2 ◦(Φmult,1 , Φmult,2 ) is a neural network with 1+⌈log2 (⌊d/2⌋)⌉ =
⌈log2 (d)⌉ layers. By construction, the size of Φmult,ε,d does not depend on K ′ or ε. Thus, to complete
the induction, it only remains to show (4.2.4).
For all a, b, c, d ∈ R holds
43
Hence, for x1 , . . . , xd with |xi | ≤ K ′ for all i = 1, . . . , d, we have that
d
Y
xi − Φmult,ε,d (x1 , . . . , xd )
i=1
⌊d/2⌋ d
ε Y Y
≤ + xi · xi − Φmult,1 (x1 , . . . , x⌊d/2⌋ )Φmult,2 (x⌊d/2⌋+1 , . . . , xd )
2
i=1 i=⌊d/2⌋+1
ε ε ε
≤ + |K ′ |⌊d/2⌋ ′ ⌊d/2⌋
+ (|K ′ |⌈d/2⌉ + ε) ′ ⌊d/2⌋
< ε.
2 4((K ) + ε) 4((K ) + ε)
This completes the proof of (4.2.4).
The overall result follows by using Proposition 2.3 to show that the multiplication network can
be composed with a neural network comprised of the ΦSℓ,ti ,n for i = 1, . . . , d. Since in no step above
the size of the individual networks was dependent on the approximation accuracy, this is also true
for the final network.
Proposition 4.8 shows that we can approximate a single multivariate B-spline with a neural
network with a size that is independent of the accuracy. Combining this observation with Theorem
4.3 leads to the following result.
Theorem 4.9. Let d, n, k ∈ N such that 0 < k ≤ n and n ≥ 2. Let q ≥ 2, and let σ be sigmoidal
of order q.
Then there exists C such that for every f ∈ C k ([0, 1]d ) and every N ∈ N there exists a neural
network ΦN with activation function σ, ⌈log2 (d)⌉ + ⌈logq (k − 1)⌉ layers, and size bounded by CN ,
such that
k
f − ΦN L∞ ([0,1]d ) ≤ CN − d ∥f ∥C k ([0,1]d ) .
Proof. Fix N ∈ N. By Theorem 4.3, there exist coefficients |ci | ≤ C∥f ∥L∞ ([0,1]d ) and Bi ∈ B n for
i = 1, . . . , N , such that
N
X k
f− ci Bi ≤ CN − d ∥f ∥C k ([0,1]d ) .
i=1 L∞ ([0,1]d )
Moreover, by Proposition 4.8, for each i = 1, . . . , N exists a neural network ΦBi with ⌈log2 (d)⌉ +
⌈logq (k − 1)⌉ layers, and a fixed size, which approximates Bi on [−1, 1]d ⊇ [0, 1]d up to error of
ε := N −k/d /N . The size of ΦBi is independent of i and N .
By Proposition 2.3, there exists a neural network ΦN that uniformly approximates N
P
i=1 ci Bi
d
up to error ε on [0, 1] , and has ⌈log2 (d)⌉ + ⌈logq (k − 1)⌉ layers. The size of this network is linear
in N (see Exercise 4.11). This concludes the proof.
Theorem 4.9 shows that neural networks with higher-order sigmoidal functions can approximate
smooth functions with the same accuracy as spline approximations while having a comparable
number of parameters. The network depth is required to behave like O(log(k)) in terms of the
smoothness parameter k, cp. Remark 4.4.
44
Bibliography and further reading
The argument of linking sigmoidal activation functions with spline based approximation was first
introduced in [146, 144]. For further details on spline approximation, see [168] or the book [207].
The general strategy of approximating basis functions by neural networks, and then lifting ap-
proximation results for those bases has been employed widely in the literature, and will also reappear
again in this book. While the following chapters primarily focus on ReLU activation, we highlight
a few notable approaches with non-ReLU activations based on the outlined strategy: To approx-
imate analytic functions, [145] emulates a monomial basis. To approximate periodic functions, a
basis of trigonometric polynomials is recreated in [147]. Wavelet bases have been emulated in [171].
Moreover, neural networks have been studied through the representation system of ridgelets [30]
and ridge functions [103]. A general framework describing the emulation of representation systems
to transfer approximation results was presented in [21].
45
Exercises
Exercise 4.10. Show that (4.2.1) holds.
Exercise 4.11. Let L ∈ N, σ : R → R, and let Φ1 , Φ2 be two neural networks with architecture
(1) (1) (2) (2)
(σ; d0 , d1 , . . . , dL , dL+1 ) and (σ; d0 , d1 , . . . , dL , dL+1 ). Show that Φ1 + Φ2 is a neural network
with size(Φ1 + Φ2 ) ≤ size(Φ1 ) + size(Φ2 ).
2
Exercise 4.12. Show that, for σ = σReLU and k ≤ 2, for all f ∈ C k ([0, 1]d ) all weights of the approx-
imating neural network of Theorem 4.9 can be bounded in absolute value by O(max{2, ∥f ∥C k ([0,1]d ) }).
46
Chapter 5
In this chapter, we discuss feedforward neural networks using the ReLU activation function σReLU
introduced in Section 2.3. We refer to these functions as ReLU neural networks. Due to its simplicity
and the fact that it reduces the vanishing and exploding gradients phenomena, the ReLU is one of
the most widely used activation functions in practice.
A key component of the proofs in the previous chapters was the approximation of derivatives of
the activation function to emulate polynomials. Since the ReLU is piecewise linear, this trick is not
applicable. This makes the analysis fundamentally different from the case of smoother activation
functions. Nonetheless, we will see that even this extremely simple activation function yields a very
rich class of functions possessing remarkable approximation capabilities.
To formalize these results, we begin this chapter by adopting a framework from [174]. This
framework enables the tracking of the number of network parameters for basic manipulations such
as adding up or composing two neural networks. This will allow to bound the network complexity,
when constructing more elaborate networks from simpler ones. With these preliminaries at hand,
the rest of the chapter is dedicated to the exploration of links between ReLU neural networks and
the class of “continuous piecewise linear functions.” In Section 5.2, we will see that every such
function can be exactly represented by a ReLU neural network. Afterwards, in Section 5.3 we will
give a more detailed analysis of the required network complexity. Finally, we will use these results
to prove a first approximation theorem for ReLU neural networks in Section 5.4. The argument is
similar in spirit to Chapter 4, in that we transfer established approximation theory for piecewise
linear functions to the class of ReLU neural networks of a certain architecture.
• Reproducing an identity: We have seen in Proposition 3.16 that for most activation functions,
an approximation to the identity can be built by neural networks. For ReLUs, we can have
an even stronger result and reproduce the identity exactly. This identity will play a crucial
47
role in order to extend certain neural networks to deeper neural networks, and to facilitate
an efficient composition operation.
• Composition: We saw in Proposition 2.3 that we can produce a composition of two neural
networks and the resulting function is a neural network as well. There we did not study the
size of the resulting neural networks. For ReLU activation functions, this composition can be
done in a very efficient way leading to a neural network that has up to a constant not more
than the number of weights of the two initial neural networks.
• Parallelization: Also the parallelization of two neural networks was discussed in Proposition
2.3. We will refine this notion and make precise the size of the resulting neural networks.
• Linear combinations: Similarly, for the sum of two neural networks, we will give precise
bounds on the size of the resulting neural network.
5.1.1 Identity
We start with expressing the identity on Rd as a neural network of depth L ∈ N.
Lemma 5.1 (Identity). Let L ∈ N. Then, there exists a ReLU neural network Φid L such that
Φid
L (x) = x for all x ∈ Rd . Moreover, depth(Φid ) = L, width(Φid ) = 2d, and size(Φid ) = 2d·(L+1).
L L L
Proof. Writing I d ∈ Rd×d for the identity matrix, we choose the weights
(W (0) , b(0) ), . . . , (W (L) , b(L) )
Id
:= , 0 , (I 2d , 0), . . . , (I 2d , 0), ((I d , −I d ), 0).
−I d | {z }
L−1 times
Using that x = σReLU (x) − σReLU (−x) for all x ∈ R and σReLU (x) = x for all x ≥ 0 it is obvious
that the neural network Φid
L associated to the weights above satisfies the assertion of the lemma.
We will see in Exercise 5.23 that the property to exactly represent the identity is not shared by
sigmoidal activation functions. It does hold for polynomial activation functions, though.
5.1.2 Composition
Assume we have two neural networks Φ1 , Φ2 with architectures (σReLU ; d10 , . . . , d1L1 +1 ) and (σReLU ; d20 , . . . , d2L1 +1 )
respectively. Moreover, we assume that they have weights and biases given by
(0) (0) (L1 ) (L1 ) (0) (0) (L2 ) (L2 )
(W 1 , b1 ), . . . , (W 1 , b1 ), and (W 2 , b2 ), . . . , (W 2 , b2 ),
respectively. If the output dimension d1L1 +1 of Φ1 equals the input dimension d20 of Φ2 , we can
define two types of concatenations: First Φ2 ◦ Φ1 is the neural network with weights and biases
given by
(0) (0) (L −1) (L −1) (0) (L ) (0) (L ) (0)
W 1 , b1 , . . . , W 1 1 , b1 1 , W 2 W 1 1 , W 2 b1 1 + b2 ,
(1) (1) (L ) (L )
W 2 , b2 , . . . , W 2 2 , b2 2 .
48
Second, Φ2 • Φ1 is the neural network defined as Φ2 ◦ Φid 1 ◦ Φ1 . In terms of weighs and biases,
Φ2 • Φ1 is given as
! !!
(L ) (L )
(0) (0)
(L1 −1) (L1 −1)
W1 1 b1 1
W 1 , b1 , . . . , W 1 , b1 , (L ) , (L ) ,
−W 1 1 −b1 1
(0) (0) (0) (1) (1) (L ) (L )
W 2 , −W 2 , b2 , W 2 , b2 , . . . , W 2 2 , b2 2 .
Lemma 5.2 (Composition). Let Φ1 , Φ2 be neural networks with architectures (σReLU ; d10 , . . . , d1L1 +1 )
and (σReLU ; d20 , . . . , d2L2 +1 ). Assume d1L1 +1 = d20 . Then Φ2 ◦ Φ1 (x) = Φ2 • Φ1 (x) = Φ2 (Φ1 (x)) for
0
all x ∈ Rd1 . Moreover,
and
1
Proof. The fact that Φ2 ◦ Φ1 (x) = Φ2 • Φ1 (x) = Φ2 (Φ1 (x)) for all x ∈ Rd0 follows immediately
from the construction. The same can be said for the width and depth bounds. To confirm the size
(0) (L ) d2 ×d1 (0) (L )
bound, we note that W 2 W 1 1 ∈ R 1 L1 and hence W 2 W 1 1 has not more than d21 × d1L1
(0) (L ) (0) 2
(nonzero) entries. Moreover, W 2 b1 1 + b2 ∈ Rd1 . Thus, the L1 -th layer of Φ2 ◦ Φ1 (x) has at
most d21 × (1 + d1L1 ) entries. The rest is obvious from the construction.
Interpreting linear transformations as neural networks of depth 0, the previous lemma is also
valid in case Φ1 or Φ2 is a linear mapping.
5.1.3 Parallelization
Let (Φi )m i i
i=1 be neural networks with architectures (σReLU ; d0 , . . . , dLi +1 ), respectively. We proceed
to build a neural network (Φ1 , . . . , Φm ) realizing the function
djL
Pm
dj0
Pm
j=1 j +1
(Φ1 , . . . , Φm ) : R j=1 →R (5.1.1)
(x1 , . . . , xm ) 7→ (Φ1 (x1 ), . . . , Φm (xm )).
49
To do so we first assume L1 = · · · = Lm = L, and define (Φ1 , . . . , Φm ) via the following sequence
of weight-bias tuples:
(0) (0) (L) (L)
W1 b1 W1 b1
. . .. . . ..
, . , . . . , , . (5.1.2)
. .
(0) (0) (L) (L)
Wm bm Wm bm
where these matrices are understood as block-diagonal filled up with zeros. For the general case
where the Φj might have different depths, let Lmax := max1≤i≤m Li and I := {1 ≤ i ≤ m | Li <
Lmax }. For j ∈ I c set Φ
e j := Φj , and for each j ∈ I
e j := Φid
Φ Lmax −Lj ◦ Φj . (5.1.3)
Finally,
(Φ1 , . . . , Φm ) := (Φ
e 1, . . . , Φ
e m ). (5.1.4)
Lemma 5.3 (Parallelization). Let m ∈ N and (Φi )m i=1 be neural networks with architectures
(σReLU ; di0 , . . . , diLi +1 ), respectively. Then the neural network (Φ1 , . . . , Φm ) satisfies
dj0
Pm
(Φ1 , . . . , Φm )(x) = (Φ1 (x1 ), . . . , Φm (xm )) for all x ∈ R j=1 .
Proof. All statements except for the bound on the size follow immediately from the construction.
e i )m in (5.1.3)
To obtain the bound on the size, we note that by construction the sizes of the (Φ i=1
will simply be added. The size of each Φ
e i can be bounded with Lemma 5.2.
In terms of the construction (5.1.2), the only required change is that the block-diagonal matrix
(0) (0)
Pm j 1 (0) (0)
diag(W 1 , . . . , W m ) becomes the matrix in R j=1 d1 ×d0 which stacks W 1 , . . . , W m on top of
each other. Similarly, we will allow Φj to only take some of the entries of x as input. For par-
allelization with shared inputs we will use the same notation (Φj )m
j=1 as before, where the precise
meaning will always be clear from context. Note that Lemma 5.3 remains valid in this case.
50
5.1.4 Linear combinations
Let m ∈ N and let (Φi )m i
i=1 be ReLU neural networks that have architectures (σReLU ; d0 , . . . , dLi +1 ),
i
This corresponds Pm to the parallelization (Φ1 , . . . , Φm ) composed with the linear transformation
(z 1 , . . . , z m ) 7→ j=1 αj z j . The following result holds.
Lemma 5.4 (Linear combinations). Let m ∈ N and (Φi )m i=1 be neural networks with architec-
tures (σReLU ; di0 , . . . , diLi +1 ), respectively. Assume that d1L1P m
+1 = · · · = dLm +1 , letPα ∈ Rm and set
Lmax := maxj≤m Lj . Then, there exists a neural network m m
j=1 αj Φj such that ( j=1 αj Φj )(x) =
Pm Pm j
m j=1 d0 . Moreover,
j=1 αj Φj (xj ) for all x = (xj )j=1 ∈ R
m
X m
X
width αj Φj ≤ 2 width(Φj ), (5.1.6a)
j=1 j=1
m
X
depth αj Φj = max depth(Φj ), (5.1.6b)
j≤m
j=1
m m m
(Lmax − Lj )djLj +1 .
X X X
size αj Φj ≤ 2 size(Φj ) + 2 (5.1.6c)
j=1 j=1 j=1
For general depths, we define the sum of the neural networks to be the sum of the extended
neural networks Φe i as of (5.1.3). All statements of the lemma follow immediately from this con-
struction.
In case d10 = · · · = dm
0 =: d0 (all neural networks have the same input dimension), we will also
consider linear combinations with shared inputs, i.e., a neural network realizing
m
X
x 7→ αj Φj (x) for x ∈ Rd0 .
j=1
51
This requires the same minor adjustment as discussed at the end of Section 5.1.3. Lemma 5.4
remains valid in this case and again we do not distinguish in notation for linear combinations with
or without shared inputs.
Remark 5.6. A “continuous piecewise linear function” as in Definition 5.5 is actually piecewise
affine. To maintain consistency with the literature, we use the terminology cpwl.
In the following, we will refer to the connected domains on which f is equal to one of the
functions gj , also as regions or pieces. If f is cpwl with q ∈ N regions, then with n ∈ N denoting
the number of affine functions it holds n ≤ q.
Note that, the mapping x 7→ σReLU (w⊤ x + b), which is a ReLU neural network with a single
neuron, is cpwl (with two regions). Consequently, every ReLU neural network is a repeated compo-
sition of linear combinations of cpwl functions. It is not hard to see that the set of cpwl functions
is closed under compositions and linear combinations. Hence, every ReLU neural network is a cpwl
function. Interestingly, the reverse direction of this statement is also true, meaning that every cpwl
function can be represented by a ReLU neural network as we shall demonstrate below. Therefore,
we can identify the class of functions realized by arbitrary ReLU neural networks as the class of
cpwl functions.
A statement similar to Theorem 5.7 can be found in [4, 85]. There, the authors give a con-
struction with a depth that behaves logarithmic in d and is independent of n, but with significantly
larger bounds on the size. As we shall see, the proof of Theorem 5.7 is a simple consequence of the
following well-known result from [225]; also see [169, 237]. It states that every cpwl function can
be expressed as a finite maximum of a finite minimum of certain affine functions.
52
Proposition 5.8. Let d ∈ N, Ω ⊆ Rd be convex, and let f : Ω → R be cpwl with n ∈ N affine
functions as in Definition 5.5. Then there exists m ∈ N and sets sj ⊆ {1, . . . , n} for j ∈ {1, . . . , m},
such that
Proof. Step 1. We start with d = 1, i.e., Ω ⊆ R is a (possibly unbounded) interval and for each
x ∈ Ω there exists j ∈ {1, . . . , n} such that with gj (x) := wj x + bj it holds that f (x) = gj (x).
Without loss of generality, we can assume that gi ̸= gj for all i ̸= j. Since the graphs of the gj are
lines, they intersect at (at most) finitely many points in Ω.
Since f is continuous, we conclude that there exist finitely many intervals covering Ω, such that
f coincides with one of the gj on each interval. For each x ∈ Ω let
sx := {1 ≤ j ≤ n | gj (x) ≥ f (x)}
and
Since there exist only finitely many possibilities to choose a subset of {1, . . . , n}, we conclude that
(5.2.1) holds for d = 1.
It remains to verify the claim (5.2.2). Fix y ̸= x ∈ Ω. Without loss of generality, let x < y
and let x = x0 < · · · < xk = y be such that f |[xi−1 ,xi ] equals some gj for each i ∈ {1, . . . , k}. In
order to show (5.2.2), it suffices to prove that there exists at least one j such that gj (x0 ) ≥ f (x0 )
and gj (xk ) ≤ f (xk ). The claim is trivial for k = 1. We proceed by induction. Suppose the
claim holds for k − 1, and consider the partition x0 < · · · < xk . Let r ∈ {1, . . . , n} be such
that f |[x0 ,x1 ] = gr |[x0 ,x1 ] . Applying the induction hypothesis to the interval [x1 , xk ], we can find
j ∈ {1, . . . , n} such that gj (x1 ) ≥ f (x1 ) and gj (xk ) ≤ f (xk ). If gj (x0 ) ≥ f (x0 ), then gj is the desired
function. Otherwise, gj (x0 ) < f (x0 ). Then gr (x0 ) = f (x0 ) > gj (x0 ) and gr (x1 ) = f (x1 ) ≤ gj (x1 ).
Therefore gr (x) ≤ gj (x) for all x ≥ x1 , and in particular gr (xk ) ≤ gj (xk ). Thus gr is the desired
function.
Step 2. For general d ∈ N, let gj (x) := w⊤ j x + bj for j = 1, . . . , n. For each x ∈ Ω, let
sx := {1 ≤ j ≤ n | gj (x) ≥ f (x)}
53
and for all y ∈ Ω, let
For an arbitrary 1-dimensional affine subspace S ⊆ Rd passing through x consider the line
(segment) I := S ∩ Ω, which is connected since Ω is convex. By Step 1, it holds
on all of I. Since I was arbitrary the formula is valid for all y ∈ Ω. This again implies (5.2.1) as
in Step 1.
Remark 5.9. Using min(a, b) = − max(−a, −b), there exists m̃ ∈ N and sets s̃j ⊆ {1, . . . , n} for
j = 1, . . . , m̃, such that for all x ∈ R
= max (− max(−w⊤
i x − bi ))
1≤j≤m̃ i∈s̃j
= max (min(w⊤
i x + bi )).
1≤j≤m̃ i∈s̃j
To prove Theorem 5.7, it therefore suffices to show that the minimum and the maximum are
expressible by ReLU neural networks.
and
Proof. We have
(
0 if y > x
max{x, y} = y +
x−y if x ≥ y
= y + σReLU (x − y).
Using y = σReLU (y) − σReLU (−y), the claim for the maximum follows. For the minimum observe
that min{x, y} = − max{−x, −y}.
54
x
min{x, y}
y
Figure 5.1: Sketch of the neural network in Lemma 5.10. Only edges with non-zero weights are
drawn.
size(Φmin
n ) ≤ 16n, width(Φmin
n ) ≤ 3n, depth(Φmin
n ) ≤ ⌈log2 (n)⌉
Proof. Throughout denote by Φmin 2 : R2 → R the neural network from Lemma 5.10. It is of depth
1 and size 7 (since all biases are zero, it suffices to count the number of connections in Figure 5.1).
Step 1. Consider first the case where n = 2k for some k ∈ N. We proceed by induction of k.
For k = 1 the claim is proven. For k ≥ 2 set
Φmin
2k
:= Φmin
2 ◦ (Φmin min
2k−1 , Φ2k−1 ). (5.2.3)
Next, we bound the size of the neural network. Note that all biases in this neural network are set to
0, since the Φmin
2 neural network in Lemma 5.10 has no biases. Thus, the size of the neural network
min
Φ2k corresponds to the number of connections in the graph (the number of nonzero weights).
Careful inspection of the neural network architecture, see Figure 5.2, reveals that
k−2
X
size(Φmin
2k ) = 4 · 2
k−1
+ 12 · 2j + 3
j=0
= 2n + 12 · (2k−1 − 1) + 3 = 2n + 6n − 9 ≤ 8n,
Φmin
1 (x) := x for all x ∈ R
be the identity on R, i.e. a linear transformation and thus formally a depth 0 neural network. Then,
for all n ≥ 2
(Φid ◦ Φmin min ) if n ∈ {2k + 1 | k ∈ N}
min min 1 ⌊n ⌋ , Φ⌈ n ⌉
Φn := Φ2 ◦ min
2
min
2 (5.2.4)
(Φ⌊ n ⌋ , Φ⌈ n ⌉ ) otherwise.
2 2
55
This definition extends (5.2.3) to arbitrary n ≥ 2, since the first case in (5.2.4) never occurs if n ≥ 2
is a power of two.
To analyze (5.2.4), we start with the depth and claim that
depth(Φmin
n )=k for all 2k−1 < n ≤ 2k
and all k ∈ N. We proceed by induction over k. The case k = 1 is clear. For the induction step,
assume the statement holds for some fixed k ∈ N and fix an integer n with 2k < n ≤ 2k+1 . Then
lnm
∈ (2k−1 , 2k ] ∩ N
2
and (
jnk {2k−1 } if n = 2k + 1
∈
2 (2k−1 , 2k ] ∩ N otherwise.
Using the induction assumption, (5.2.4) and Lemmas 5.1 and 5.2, this shows
depth(Φmin min
n ) = depth(Φ2 ) + k = 1 + k,
Φmax min
n (x1 , . . . , xn ) := −Φn (−x1 , . . . , −xn ).
and
n m
X m
X o
⊤
width(Φ) ≤ 2 max width(Φmax
m ), width(Φmin
|sj | ), width((w i x + b )
i i∈sj ))
j=1 j=1
n
≤ 2 max{3m, 3mn, mdn} = O(dn2 )
56
x1
x2
x3
x4
min{x1 , . . . , x8 }
x5
x6
x7
x8
nr of connections
between layers: 2k−1 · 4 2k−2 · 12 2k−3 · 12 3
x1 x2 x3 x4 x5 x1 x2 x3 x4 x5 x6 x1 x2 x3 x4 x5 x6 x7 x8
Φmin
2 Φid min
1 Φ2 Φid min
1 Φ2 Φid min
1 Φ2 Φmin
2 Φmin
2 Φmin
2 Φmin
2
Φid
1 Φmin
2 Φmin
2 Φmin
2 Φmin
2 Φmin
2
Φmin
2 Φmin
2 Φmin
2
57
and
⊤
size(Φ) ≤ 4 size(Φmax
m ) + size((Φ min m
)
|sj | j=1 ) + size((w i x + b ) )m
i i∈sj j=1 )
X m
≤ 4 16m + 2 (16|sj | + 2⌈log2 (n)⌉) + nm(d + 1) = O(dn2n ).
j=1
5.3.1 Triangulations of Ω
For the ensuing discussion, we will consider Ω ⊆ Rd to be partitioned into simplices. This parti-
tioning will be termed a triangulation of Ω. Other notions prevalent in the literature include a
tessellation of Ω, or a simplicial mesh on Ω. To give a precise definition, let us first recall some
terminology. For a set S ⊆ Rd we denote the convex hull of S by
X n Xn
co(S) := αj xj n ∈ N, xj ∈ S, αj ≥ 0, αj = 1 . (5.3.1)
j=1 j=1
An n-simplex is the convex hull of n ∈ N points that are independent in a specific sense. This
is made precise in the following definition.
Definition 5.13. Let d ∈ N, and Ω ⊆ Rd be compact. Let T be a finite set of d-simplices, and
for each τ ∈ T let V (τ ) ⊆ Ω have cardinality d + 1 such that τ = co(V (τ )). We call T a regular
triangulation of Ω, if and only if
S
(i) τ ∈T τ = Ω,
(ii) for all τ , τ ′ ∈ T it holds that τ ∩ τ ′ = co(V (τ ) ∩ V (τ ′ )).
S
We call η ∈ V := τ ∈T V (τ ) a node (or vertex) and τ ∈ T an element of the triangulation.
58
η2 η2 η2
η3 η1 η3 η1 η3 η1
η5 η5
η4 η4 η4
Figure 5.4: The first is a regular triangulation, while the second and the third are not.
We will split the proof into several lemmata. The strategy is to introduce a basis of the space
of cpwl functions on T the elements of which vanish on the boundary of Ω. We will then show
that there exist O(|T |) basis functions, each of which can be represented with a neural network the
size of which depends only on kT and d. To construct this basis, we first point out that an affine
function on a simplex is uniquely defined by its values at the nodes.
59
η6 η1
ω(η) co(V (τ1 )\{η}) = co({η 1 , η 2 })
τ6
τ5 τ1
η5 η2
τ4 η τ2
τ3
η4 η3
Figure 5.5: Visualization of Lemma 5.16 in two dimensions. The patch ω(η) consists of the union
of all 2-simplices τi containing η. Its boundary consists of the union of all 1-simplices made up by
the nodes of each τi without the center node, i.e., the convex hulls of V (τi )\{η}.
coefficients do not sum to 1). Hence, g is uniquely determined by its values at the nodes.
Since Ω is the union of the simplices τ ∈ T , every cpwl function with respect to T is thus
uniquely defined through its values at the nodes. Hence, the desired basis consists of cpwl functions
φη : Ω → R with respect to T such that
where δηµ denotes the Kronecker delta. Assuming φη to be well-defined for the moment, we can
then represent every cpwl function f : Ω → R that vanishes on the boundary ∂Ω as
X
f (x) = f (η)φη (x) for all x ∈ Ω.
η∈V∩Ω̊
Note that it suffices to sum over the set of interior nodes V ∩ Ω̊, since f (η) = 0 whenever η ∈
∂Ω. To formally verify existence and well-definedness of φη , we first need a lemma characterizing
the boundary of so-called “patches” of the triangulation: For each η ∈ V, we introduce the patch
ω(η) of the node η as the union of all elements containing η, i.e.,
[
ω(η) := τ. (5.3.5)
{τ ∈T | η∈τ }
60
We refer to Figure 5.5 for a visualization of Lemma 5.16. The proof of Lemma 5.16 is quite
technical but nonetheless elementary. We therefore only outline the general argument but leave
the details to the reader in Excercise 5.27: The boundary of ω(η) must be contained in the union
of the boundaries of all τ in the patch ω(η). Since η is an interior point of Ω, it must also be
an interior point of ω(η). This can be used to show that for every S := {η i0 , . . . , η ik } ⊆ V (τ ) of
cardinality k + 1 ≤ d, the interior of (the k-dimensional manifold) co(S) belongs to the interior
of ω(η) whenever η ∈ S. Using Exercise 5.27, it then only remains to check that co(S) ⊆ ∂ω(η)
whenever η ∈ / S, which yields the claimed formula. We are now in position to show well-definedness
of the basis functions in (5.3.4).
Lemma 5.17. For each interior node η ∈ V ∩ Ω̊ there exists a unique cpwl function φη : Ω → R
satisfying (5.3.4). Moreover, φη can be expressed by a ReLU neural network with size, width, and
depth bounds that only depend on d and kT .
Proof. By Lemma 5.15, on each τ ∈ T , the affine function φη |τ is uniquely defined through the
values at the nodes of τ . This defines a continuous function φη : Ω → R. Indeed, whenever
τ ∩ τ ′ ̸= ∅, then τ ∩ τ ′ is a subsimplex of both τ and τ ′ in the sense of Definition 5.13 (ii). Thus,
applying Lemma 5.15 again, the affine functions on τ and τ ′ coincide on τ ∩ τ ′ .
Using Lemma 5.15, Lemma 5.16 and the fact that φη (µ) = 0 whenever µ ̸= η, we find that
φη vanishes on the boundary of the patch ω(η) ⊆ Ω. Thus, φη vanishes on the boundary of Ω.
Extending by zero, it becomes a cpwl function φη : Rd → R. This function is nonzero only on
elements τ for which η ∈ τ . Hence, it is a cpwl function with at most n := kT + 1 affine functions.
By Theorem 5.7, φη can be expressed as a ReLU neural network with the claimed size, width and
depth bounds.
it holds that Φ : Ω → R satisfies Φ(η) = f (η) for all η ∈ V. By Lemma 5.15 this implies that
f equals Φ on each τ , and thus f equals Φ on all of Ω. Since each element τ is the convex hull
of d + 1 nodes η ∈ V, the cardinality of V is bounded by the cardinality of T times d + 1. Thus,
the summation in (5.3.6) is over O(|T |) terms. Using Lemma 5.4 and Lemma 5.17 we obtain the
claimed bounds on size, width, and depth of the neural network.
61
Definition 5.18. A regular triangulation T is called locally convex if and only if ω(η) is convex
for all interior nodes η ∈ V ∩ Ω̊.
Theorem 5.19. Let d ∈ N, and let Ω ⊆ Rd be a bounded domain. Let T be a locally convex regular
triangulation of Ω. Let f : Ω → R be cpwl with respect to T and f |∂Ω = 0. Then, there exists a
constant C > 0 (independent of d, f and T ) and there exists a neural network Φf : Ω → R such
that Φf = f ,
size(Φf ) ≤ C · (1 + d2 kT |T |),
width(Φf ) ≤ C · (1 + d log(kT )|T |),
depth(Φf ) ≤ C · (1 + log2 (kT )).
Assume in the following that T is a locally convex triangulation. We will split the proof of the
theorem again into a few lemmata. First, we will show that a convex patch can be written as an
intersection of finitely many half-spaces. Specifically, with the affine hull of a set S defined as
Xn Xn
aff(S) := αj xj n ∈ N, xj ∈ S, αj ∈ R, αj = 1 (5.3.7)
j=1 j=1
be the affine hyperplane passing through all nodes in V (τ )\{η}, and let further
Lemma 5.20. Let η be an interior node. Then a patch ω(η) is convex if and only if
\
ω(η) = H+ (τ, η). (5.3.8)
{τ ∈T | η∈T }
Proof. The right-hand side is a finite intersection of (convex) half-spaces, and thus itself convex. It
remains to show that if ω(η) is convex, then (5.3.8) holds. We start with “⊃”. Suppose x ∈ / ω(η).
Then the straight line co({x, η}) must pass through ∂ω(η), and by Lemma 5.16 this implies that
there exists τ ∈ T with η ∈ τ such that co({x, η}) passes through aff(V (τ )\{η}) = H0 (τ, η).
62
Hence η and x lie on different sides of this affine hyperplane, which shows “⊇”. Now we show “⊆”.
Let τ ∈ T be such that η ∈ τ and fix x in the complement of H+ (τ, η). Suppose that x ∈ ω(η). By
convexity, we then have co({x} ∪ τ ) ⊆ ω(η). This implies that there exists a point in co(V (τ )\{η})
belonging to the interior of ω(η). This contradicts Lemma 5.16. Thus, x ∈ / ω(η).
The above lemma allows us to explicitly construct the basis functions φη in (5.3.4). To see this,
denote in the following for τ ∈ T and η ∈ V (τ ) by gτ,η ∈ P1 (Rd ) the affine function such that
(
1 if η = µ
gτ,η (µ) = for all µ ∈ V (τ ).
0 if η ̸= µ
This function exists and is unique by Lemma 5.15. Observe that φη (x) = gτ,η (x) for all x ∈ τ .
Lemma 5.21. Let η ∈ V ∩ Ω̊ be an interior node and let ω(η) be a convex patch. Then
φη (x) = max 0, min gτ,η (x) for all x ∈ Rd . (5.3.9)
{τ ∈T | η∈τ }
Thus
i.e., (5.3.9) holds for all x ∈ R\ω(η). Next, let τ , τ ′ ∈ T such that η ∈ τ and η ∈ τ ′ . We wish to
show that gτ,η (x) ≤ gτ ′ ,η (x) for all x ∈ τ . Since gτ,η (x) = φη (x) for all x ∈ τ , this then concludes
the proof of (5.3.9). By Lemma 5.20 it holds
µ ∈ H+ (τ ′ , η) for all µ ∈ V (τ ).
Hence, by (5.3.10)
Moreover, gτ,η (η) = gτ ′ ,η (η) = 1. Thus, gτ,η (µ) ≥ gτ ′ ,η (µ) for all µ ∈ V (τ ′ ) and therefore
of Theorem 5.19. For every interior node η ∈ V ∩ Ω̊, the cpwl basis function φη in (5.3.4) can be
expressed as in (5.3.9), i.e.,
φη (x) = σ • Φmin
|{τ ∈T | η∈τ }| • (gτ,η (x)){τ ∈T | η∈τ } ,
63
where (gτ,η (x)){τ ∈T | η∈τ } denotes the parallelization with shared inputs of the functions gτ,η (x) for
all τ ∈ T such that η ∈ τ .
For this neural network, with |{τ ∈ T | η ∈ τ }| ≤ kT , we have by Lemma 5.2
and similarly
Since for every interior node, the number of simplices touching the node must be larger or equal
to d, we can assume max{kT , d} = kT in the following (otherwise there exist no interior nodes, and
the function f is constant 0). As in the proof of Theorem 5.14, the neural network
X
Φ(x) := f (η)φη (x)
η∈V∩Ω̊
realizes the function f on all of Ω. Since the number of nodes |V| is bounded by (d + 1)|T |, an
application of Lemma 5.4 yields the desired bounds.
Then, C 0,s (Ω) is the set of functions f ∈ C 0 (Ω) for which ∥f ∥C 0,s (Ω) < ∞.
Hölder continuous functions can be approximated well by certain cpwl functions. Therefore, we
obtain the following result.
Theorem 5.22. Let d ∈ N. There exists a constant C = C(d) such that for every f ∈ C 0,s ([0, 1]d )
and every N there exists a ReLU neural network ΦfN with
and
s
sup f (x) − ΦfN (x) ≤ C∥f ∥C 0,s ([0,1]d ) N − d .
x∈[0,1]d
64
Proof. For M ≥ 2, consider the set of nodes {ν/M | ν ∈ {−1, . . . , M + 1}d } where ν/M =
(ν1 /M, . . . , νd /M ). These nodes suggest a partition of [−1/M, 1 + 1/M ]d into (2 + M )d sub-
hypercubes. Each such sub-hypercube can be partitioned into d! simplices, such that we obtain a
regular triangulation T with d!(2+M )d elements on [0, 1]d . According to Theorem 5.14 there exists a
neural network Φ that is cpwl with respect to T and Φ(ν/M ) = f (ν/M ) whenever ν ∈ {0, . . . , M }d
and Φ(ν/M ) = 0 for all other (boundary) nodes. It holds
Since Φ|τ is the linear interpolant of f at the nodes V (τ ) of the simplex τ , Φ(x) is a convex
combination of the (f (η))η∈V (τ ) . Fix an arbitrary node η 0 ∈ V (τ ). Then ∥x − η 0 ∥2 ≤ ε and
≤ ∥f ∥C 0,s ([0,1]d ) εs .
Hence, using f (η 0 ) = Φ(η 0 ),
The principle behind Theorem 5.22 can be applied in even more generality. Since we can
represent every cpwl function on a regular triangulation with a neural network of size O(N ), where
N denotes the number of elements, all of classical (e.g. finite element) approximation theory for
cpwl functions can be lifted to generate statements about ReLU approximation. For instance, it is
well-known, that functions in the Sobolev space H 2 ([0, 1]d ) can be approximated by cpwl functions
on a regular triangulation in terms of L2 ([0, 1]d ) with the rate 2/d. Similar as in the proof of
Theorem 5.22, for every f ∈ H 2 ([0, 1]d ) and every N ∈ N there then exists a ReLU neural network
ΦN such that size(ΦN ) = O(N ) and
2
∥f − ΦN ∥L2 ([0,1]d ) ≤ C∥f ∥H 2 ([0,1]d ) N − d .
65
Finally, we can wonder how to approximate even smoother functions, i.e., those that have many
continuous derivatives. Since more smoothness is a restrictive assumption on the set of functions
to approximate, we would hope that this will allow us to have smaller neural networks. Essentially,
we desire a result similar to Theorem 4.9, but with the ReLU activation function.
However, we will see in the following chapter, that the emulation of piecewise affine functions
on regular triangulations cannot yield the approximation rates of Theorem 4.9. To harness the
smoothness, it will be necessary to build ReLU neural networks that emulate polynomials. Sur-
prisingly, we will see in Chapter 7 that polynomials can be very efficiently approximated by deep
ReLU neural networks.
66
Exercises
Exercise 5.23. Let p : R → R be a polynomial of degree n ≥ 1 (with leading coefficient nonzero)
and let s : R → R be a continuous sigmoidal activation function. Show that the identity map
x 7→ x : R → R belongs to N11 (p; 1, n + 1) but not to N11 (s; L) for any L ∈ N.
Exercise 5.24. Consider cpwl functions f : R → R with n ∈ N0 breakpoints (points where the
function is not C 1 ). Determine the minimal size required to exactly express every such f with a
depth-1 ReLU neural network.
Exercise 5.25. Show that, the notion of affine independence is invariant under permutations of
the points.
Exercise
Sd 5.27. Let τ = co(η 0 , . . . , η d ) be a d-simplex. Show that the boundary of τ is given by
i=0 co({η 0 , . . . , η d }\{η i }).
67
Chapter 6
In the previous chapters, we observed some remarkable approximation results of shallow ReLU
neural networks. In practice, however, deeper architectures are more common. To understand why,
we in this chapter we discuss some potential shortcomings of shallow ReLU networks compared to
deep ReLU networks.
Traditionally, an insightful approach to study limitations of ReLU neural networks has been to
analyze the number of linear regions these functions can generate.
Definition 6.1. Let d ∈ N, Ω ⊆ Rd , and let f : Ω → R be cpwl (see Definition 5.5). We say
that f has p ∈ N pieces S (or linear regions), if p is the smallest number of connected open
sets (Ωi )pi=1 such that pi=1 Ωi = Ω, and f |Ωi is an affine function for all i = 1, . . . , p. We denote
Pieces(f, Ω) := p.
For d = 1 we call every point where f is not differentiable a break point of f .
To get an accurate cpwl approximation of a function, the approximating function needs to have
many pieces. The next theorem, corresponding to [62, Theorem 2], quantifies this statement.
Theorem 6.2. Let −∞ < a < b < ∞ and f ∈ C 3 ([a, b]) so that f is not affine. Then there exists
Rbp
a constant c > 0 depending only on a |f ′′ (x)| dx so that
The proof of the theorem is left to the reader, see Exercise 6.12.
Theorem 6.2 implies that for ReLU neural networks we need architectures allowing for many
pieces, if we want to approximate non-linear functions to high accuracy. But how many pieces can
68
we create for a fixed depth and width? We will establish a simple theoretical upper bound in Section
6.1. Subsequently, we will investigate under which conditions these upper bounds are attainable
in Section 6.2. This will reveal that certain functions necessitate very large shallow networks for
approximation, whereas relatively small deep networks can also approximate them. These findings
are presented in Section 6.3.
Finally, we will question the practical relevance of this analysis by examining how many pieces
typical neural networks possess. Surprisingly, in Section 6.4 we will find that randomly initialized
deep neural networks on average do not have a number of pieces that is anywhere close to the
theoretical upper bound.
This holds because the sum is affine in every point where both f1 and f2 are affine. Therefore,
the sum has at most as many break points as f1 and f2 combined. Moreover, the number of
pieces of a univariate function equals the number of its break points plus one.
This is because for each of the affine pieces of f2 —let us call one of those pieces A ⊆ R—we
have that f2 is either constant or injective on A. If it is constant, then f1 ◦ f2 is constant. If
it is injective, then Pieces(f1 ◦ f2 , A) = Pieces(f1 , f2 (A)) ≤ Pieces(f1 , Rd ). Since this holds
for all pieces of f2 we get (6.1.2).
These considerations give the following result, which follows the argument of [227, Lemma 2.1].
We state it for general cpwl activation functions. The ReLU activation function corresponds to
p = 2.
Theorem 6.3. Let L ∈ N. Let σ be cpwl with p pieces. Then, every neural network with architecture
(σ; 1, d1 , . . . , dL , 1) has at most (p · width(Φ))L pieces.
Proof. The proof is via induction over the depth L. Let L = 1, and let Φ : R → R be a neural
network of architecture (σ; 1, d1 , 1). Then
d1
(1) (0) (0)
X
Φ(x) = wk σ(wk x + bk ) + b(1) for x ∈ R,
k=1
69
Figure 6.1: Top: Composition of two cpwl functions f1 ◦ f2 can create a piece whenever the value
of f2 crosses a level that is associated to a break point of f1 . Bottom: Addition of two cpwl
functions f1 + f2 produces a cpwl function that can have break points at positions where either f1
or f2 has a break point.
for certain w(0) , w(1) , b(0) ∈ Rd1 and b(1) ∈ R. By (6.1.1), Pieces(Φ) ≤ p · width(Φ).
For the induction step, assume the statement holds for L ∈ N, and let Φ : R → R be a neural
network of architecture (σ; 1, d1 , . . . , dL+1 , 1). Then, we can write
dL+1
X
Φ(x) = wj σ(hj (x)) + b for x ∈ R,
j=1
for some w ∈ RdL+1 , b ∈ R, and where each hj is a neural network of architecture (σ; 1, d1 , . . . , dL , 1).
Using the induction hypothesis, each σ ◦ hℓ has at most p · (p · width(Φ))L affine pieces. Hence
Φ has at most width(Φ) · p · (p · width(Φ))L = (p · width(Φ))L+1 affine pieces. This completes the
proof.
Theorem 6.3 shows that there are limits to how many pieces can be created with a certain
architecture. It is noteworthy that the effects of the depth and the width of a neural network
are vastly different. While increasing the width can polynomially increase the number of pieces,
increasing the depth can result in exponential increase. This is a first indication of the prowess of
depth of neural networks.
To understand the effect of this on the approximation problem, we apply the bound of Theorem
6.3 to Theorem 6.2.
Theorem 6.4 gives a lower bound on achievable approximation rates in dependence of the depth
L. As target functions become smoother, we expect that we can achieve faster convergence rates
70
(cp. Chapter 4). However, without increasing the depth, it seems to be impossible to leverage such
additional smoothness.
This observation strongly indicates that deeper architectures can be superior. Before we can
make such statements, we first explore whether the upper bounds of Theorem 6.3 are even achiev-
able.
This function can be expressed by a ReLU neural network of depth one and with two nodes
h1 (x) = σReLU (2x) − σReLU (4x − 2) for all x ∈ [0, 1]. (6.2.1)
It turns out that this function has a rather interesting behavior. It is a “sawtooth” function with
2n−1 spikes, see Figure 6.2.
Proof. The case n = 1 holds by definition. We proceed by induction, and assume the statement
holds for n. Let x ∈ [0, 1/2] and i ≥ 0 even such that x ∈ [i2−(n+1) , (i + 1)2−(n+1) ]. Then
2x ∈ [i2−n , (i + 1)2−n ]. Thus
Similarly, if x ∈ [0, 1/2] and i ≥ 1 odd such that x ∈ [i2−(n+1) , (i + 1)2−(n+1) ], then h1 (x) = 2x ∈
[i2−n , (i + 1)2−n ] and
The case x ∈ [1/2, 1] follows by observing that hn+1 is symmetric around 1/2.
The neural network hn has size O(n) and is piecewise linear on at least 2n pieces. This shows
that the number of pieces can indeed increase exponentially in the neural network size, also see the
upper bound in Theorem 6.3.
71
h1 h2 h3
1 1 1
0 1 0 1 0 1
Theorem 6.6. For every n ∈ N there exists a neural network f ∈ N11 (σReLU ; n2 + 3, 2) such that
for any g ∈ N11 (σReLU ; n, 2n−1 ) holds
Z 1
1
|f (x) − g(x)| dx ≥ .
0 32
The neural network f may have quadratically more layers than g, but width(g) = 2n−1 and
width(f ) = 2. Hence the size of g may be exponentially larger than the size of f , but nonetheless no
such g can approximate f . Thus even exponential increase in width cannot necessarily compensate
for increase in depth. The proof is based on the following observations stated in [228]:
(i) Functions with few oscillations poorly approximate functions with many oscillations,
(iii) neural networks with many layers can have many oscillations.
Proof of Theorem 6.6. Fix n ∈ N. Let f := hn2 +3 ∈ N11 (σReLU ; n2 + 3, 2). For arbitrary g ∈
2
N11 (σReLU ; n, 2n−1 ), by Theorem 6.3, g is piecewise linear with at most (2 · 2n−1 )n = 2n break
2
points. The function f is the sawtooth function with 2n +2 spikes. The number of triangles formed
2 2
by the graph of f and the constant line at 1/2 equals 2n +3 − 1, each with area 2−(n +5) , see
Figure 6.3. For the m triangles in between two break points of g, the graph of g does not cross at
72
1
1
2
0 1 0 1
Figure 6.3: Left: The functions hn form 2n − 1 triangles with the line at 1/2, each with area
2−(n+2) . Right: For an affine function with m (in this sketch m = 5) triangles in between two
break points, the function can cross at most ⌈m/2⌉ + 1 ≤ m/2 + 2 of them. Figure adapted from
[229, Section 5].
73
Figure 6.4: Two randomly initialized neural networks Φ1 and Φ2 with architectures
(σReLU ; 1, 10, 10, 1) and (σReLU ; 1, 5, 5, 5, 5, 5, 1). The initialization scheme was He initialization
[86]. The number of linear regions equals 114 and 110, respectively.
where h is a cpwl function, a is a vector, and b is a scalar, many pieces can be generated if ⟨a, h(x)⟩
crosses the −b level often.
If a, b are random variables, and we know that h does not oscillate too much, then we can
quantify the probability of ⟨a, h(x)⟩ crossing the −b level often. The following lemma from [115,
Lemma 3.1] provides the details.
Lemma 6.7. Let c > 0 and let h : [0, c] → R be a cpwl function on [0, c]. Let t ∈ N, let A ⊆ R be
a Lebesgue measurable set, and assume that for every y ∈ A it holds that
Then, c∥h′ ∥L∞ ≥ ∥h′ ∥L1 ≥ |A| · t, where |A| is the Lebesgue measure of A.
In particular, if h has at most P ∈ N pieces and ∥h′ ∥L1 is finite, then it holds for all δ > 0 that
for all t ≤ P
∥h′ ∥L1
P [|{x ∈ [0, c] | h(x) = U }| ≥ t] ≤ ,
δt
P [|{x ∈ [0, c] | h(x) = U }| > P ] = 0,
Proof. We will assume c = 1. The general case then follows by considering h̃(x) = h(x/c).
Let for (ci )Pi=1
+1
⊆ [0, 1] with c1 = 0, cP +1 = 1 and ci ≤ ci+1 for all i = 1, . . . , P + 1 the pieces of
h be given by ((ci , ci+1 ))Pi=1 . We denote
74
and for j = i, . . . , P
i−1
[
Vei := Vj .
j=1
In words, Ti,n contains the values of A that are hit on Vi for the nth time. Since h is cpwl, we
observe that for all i = 1, . . . , P
(i) Ti,n1 ∩ Ti,n2 = ∅ for all n1 , n2 ∈ N ∪ {∞}, n1 ̸= n2 ,
(ii) Ti,∞ ∪ ∞
S
n=1 Ti,n = h(Vi ) ∩ A,
(iv) |Ti,∞ | = 0.
Note that, since h is affine on Vi it holds that h′ = |h(Vi )|/|Vi | on Vi . Hence, for t ≤ P
P
X P
X
′
∥h ∥L1 ≥ |h(Vi )| ≥ |h(Vi ) ∩ A|
i=1 i=1
P ∞
!
X X
= |Ti,n | + |Ti,∞ |
i=1 n=1
∞
P X
X
= |Ti,n |
i=1 n=1
Xt X P
≥ |Ti,n |,
n=1 i=1
where the first equality follows by (i), (ii), the second by (iv), and the last inequality by (iii).
Note that, by assumption for all n ≤ t every y ∈ A is an element of Ti,n or Ti,∞ for some i ≤ P .
Therefore, by (iv)
X P
|Ti,n | ≥ |A|,
i=1
which completes the proof.
Lemma 6.7 applied to neural networks essentially states that, in a single neuron, if the bias
term is chosen uniformly randomly on an interval of length δ, then the probability of generating at
least t pieces by composition scales reciprocal to t.
Next, we will analyze how Lemma 6.7 implies an upper bound on the number of pieces generated
in a randomly initialized neural network. For simplicity, we only consider random biases in the
following, but mention that similar results hold if both the biases and weights are random variables
[82].
75
Definition 6.8. Let L ∈ N, (d0 , d1 , . . . , dL , 1) ∈ NL+2 and W (ℓ) ∈ Rdℓ+1 ×dℓ for ℓ = 0, . . . , L. Fur-
thermore, let δ > 0 and let the bias vectors b(ℓ) ∈ Rdℓ+1 , for ℓ = 0, . . . , L, be random variables such
that each entry of each b(ℓ) is independently and uniformly distributed on the interval [−δ/2, δ/2].
We call the associated ReLU neural network a random-bias neural network.
To apply Lemma 6.7 to a single neuron with random biases, we also need some bound on the
derivative of the input to the neuron.
Definition 6.9. Let L ∈ N, (d0 , d1 , . . . , dL , 1) ∈ NL+2 , and W (ℓ) ∈ Rdℓ+1 ×dℓ and b(ℓ) ∈ Rdℓ+1 for
ℓ = 0, . . . , L. Moreover let δ > 0.
For ℓ = 1, . . . , L + 1, i = 1, . . . , dℓ introduce the functions
L
)
Y
(b(j) )L
j=0 ∈ dj+1
[−δ/2, δ/2] , ℓ = 1, . . . , L, i = 1, . . . , dℓ
j=0
Theorem 6.10. Let L ∈ N and let (d0 , d1 , . .. , dL , 1) ∈ NL+2 . Let δ ∈ (0, 1]. Let W (ℓ) ∈ Rdℓ+1 ×dℓ ,
for ℓ = 0, . . . , L, be such that ν (W (ℓ) )L
ℓ=0 , δ ≤ Cν for a Cν > 0.
For an associated random-bias neural network Φ, we have that for a line segment s ⊆ Rd0 of
length 1
L
Cν X
E[Pieces(Φ, s)] ≤ 1 + d1 + (1 + (L − 1) ln(2width(Φ))) dj . (6.4.1)
δ
j=2
Proof. Let W (ℓ) ∈ Rdℓ+1 ×dℓ for ℓ = 0, . . . , L. Moreover, let b(ℓ) ∈ [−δ/2, δ/2]dℓ+1 for ℓ = 0, . . . , L
be uniformly distributed random variables. We denote
θℓ : s → Rdℓ
dℓ
x 7→ (ηℓ,i (x; (W (j) , b(j) )ℓ−1
j=0 ))i=1 .
76
Let κ : s → [0, 1] be an isomorphism. Since each coordinate of θℓ is cpwl, there are points
x0 , x1 , . . . , xqℓ ∈ s with κ(xj ) < κ(xj+1 ) for j = 0, . . . , qℓ − 1, such that θℓ is affine (as a function
into Rdℓ ) on [κ(xj ), κ(xj+1 )] for all j = 0, . . . , qℓ − 1 as well as on [0, κ(x0 )] and [κ(xqℓ ), 1].
We will now inductively find an upper bound on the qℓ .
Let ℓ = 2, then
θ2 (x) = W (1) σReLU (W (0) x + b(0) ).
Since W (1) · +b(1) is an affine function, it follows that θ2 can only be non-affine in points where
σReLU (W (0) · +b(0) ) is not affine. Therefore, θ2 is only non-affine if one coordinate of W (0) · +b(0)
intersects 0 nontrivially. This can happen at most d1 times. We conclude that we can choose
q2 = d 1 .
Next, let us find an upper bound on qℓ+1 from qℓ . Note that
Now θℓ+1 is affine in every point x ∈ s where θℓ is affine and (θℓ (x) + b(ℓ−1) )i ̸= 0 for all coordinates
i = 1, . . . , dℓ . As a result, we have that we can choose qℓ+1 such that
Therefore, for ℓ ≥ 2
ℓ
X
qℓ+1 ≤ d1 + {x ∈ s | (θj (x) + b(j) )i = 0 for at least one i = 1, . . . , dj }
j=3
dj
ℓ X
(j)
X
≤ d1 + {x ∈ s | ηj,i (x) = −bi } .
j=2 i=1
pk,ℓ,i = 0.
77
It holds
dj n
L X o
(j)
X
E x ∈ s ηj,i (x) = −bi
j=2 i=1
dj ∞
L X h n o i
(j)
X X
≤ k·P x ∈ s ηj,i (x) = −bi =k
j=2 i=1 k=1
dj ∞
L X
X X
≤ k · (pk,j,i − pk+1,j,i ).
j=2 i=1 k=1
Pieces(Φ, s) ≤ qL+1 + 1
78
• Maximal internal derivative: Theorem 6.10 requires the weights to be chosen such that the
maximal internal derivative is bounded by a certain number. However, if they are randomly
initialized in such a way that with high probability the maximal internal derivative is bounded
by a small number, then similar results can be shown. In practice, weights in the ℓth layer
p are
often initialized according to a centered normal distribution with standard deviation 2/dℓ ,
[86]. Due to the anti-proportionality of the variance to the width of the layers it is achieved
that the internal derivatives remain bounded with high probability, independent of the width
of the neural networks. This explaines the observation from Figure 6.4.
79
Exercises
Exercise 6.12. Let −∞ < a < b < ∞ and let f ∈ C 3 ([a, b])\P1 . Denote by p(ε) ∈ N the minimal
number of intervals partitioning [a, b], such that a (not necessarily continuous) piecewise linear
function on p(ε) intervals can approximate f on [a, b] uniformly up to error ε > 0. In this exercise,
we wish to show
√
lim inf p(ε) ε > 0. (6.4.2)
ε↘0
Therefore, we can find a constant C > 0 such that ε ≥ Cp(ε)−2 for all ε > 0. This shows a variant
of Theorem 6.2. Proceed as follows to prove (6.4.2):
(i) Fix ε > 0 and let a = x0 < x1 · · · < xp(ε) = b be a partitioning into p(ε) pieces. For
i = 0, . . . , p(ε) − 1 and x ∈ [xi , xi+1 ] let
f (xi+1 ) − f (xi )
ei (x) := f (x) − f (xi ) + (x − xi ) .
xi+1 − xi
h2i ′′
max |ei (x)| = |f (mi )| + O(h3i ).
x∈[xi ,xi+1 ] 8
(iv) Conclude that (6.4.2) holds for general non-linear f ∈ C 3 ([a, b]).
Exercise 6.13. Show that, for L = 1, Theorem 6.3 holds for piecewise smooth functions, when
replacing the number of affine pieces by the number of smooth pieces. These are defined by replacing
“affine” by “smooth” (meaning C ∞ ) in Definition 6.1.
Exercise 6.14. Show that, for L > 1, Theorem 6.3 does notx hold for piecewise smooth functions,
when replacing the number of affine pieces by the number of smooth pieces.
(p)
Exercise 6.15. For p ∈ N, p > 2 and n ∈ N, construct a function hn similar to hn of (6.5), such
(p) (p)
that hn ∈ N11 (σReLU ; n, p) and such that hn has pn pieces and size O(p2 n).
80
Chapter 7
In the previous chapter, we observed that many layers are a necessary prerequisite for ReLU neural
networks to approximate smooth functions with high rates. We now analyze which depth is sufficient
to achieve good approximation rates for smooth functions.
To approximate smooth functions efficiently, one of the main tools in Chapter 4 was to rebuild
polynomial-based functions, such as higher-order B-splines. For smooth activation functions, we
were able to reproduce polynomials by using the nonlinearity of the activation functions. This
argument certainly cannot be repeated for the piecewise linear ReLU. On the other hand, up until
now, we have seen that deep ReLU neural networks are extremely efficient at producing the strongly
oscillating sawtooth functions discussed in Lemma 6.5.
The main observation this chapter is that the efficient representation of sawtooth functions
is intimately linked to the approximation of the square function and hence allows very efficient
approximations of polynomial functions. This observation was first made by Dmitry Yarotsky [245]
in 2016, and the present chapter is primarily based on this paper.
First, in Section 7.1, we will give an efficient neural network approximation of the squaring func-
tion. Second, in Section 7.2, we will demonstrate how the squaring neural network can be modified
to yield a neural network that approximates the function that multiplies its inputs. Using these
two tools, we conclude in Section 7.3 that deep ReLU neural networks can efficiently approximate
k-times continuously differentiable functions with Hölder continuous derivatives.
is a piecewise linear function on [0, 1] with break points xn,j = j2−n , j = 0, . . . , 2n . Moreover,
sn (xn,k ) = x2n,k for all k = 0, . . . , 2n , i.e. sn is the piecewise linear interpolant of x2 on [0, 1].
81
Proof. The statement holds for n = 1. We proceed by induction. Assume the statement holds for
sn and let k ∈ {0, . . . , 2n+1 }. By Lemma 6.5, hn+1 (xn+1,k ) = 0 whenever k is even. Hence for even
k ∈ {0, . . . , 2n+1 }
n+1
X hj (xn+1,k )
sn+1 (xn+1,k ) = xn+1,k −
22j
j=1
hn+1 (xn+1,k )
= sn (xn+1,k ) − 2(n+1)
= sn (xn+1,k ) = x2n+1,k ,
2
where we used the induction assumption sn (xn+1,k ) = x2n+1,k for xn+1,k = k2−(n+1) = k2 2−n =
xn,k/2 .
Now let k ∈ {1, . . . , 2n+1 − 1} be odd. Then by Lemma 6.5, hn+1 (xn+1,k ) = 1. Moreover,
since sn is linear on [xn,(k−1)/2 , xn,(k+1)/2 ] = [xn+1,k−1 , xn+1,k+1 ] and xn+1,k is the midpoint of this
interval,
hn+1 (xn+1,k )
sn+1 (xn+1,k ) = sn (xn+1,k ) −
22(n+1)
1 1
= (x2n+1,k−1 + x2n+1,k+1 ) − 2(n+1)
2 2
(k − 1)2 (k + 1)2 2
= 2(n+1)+1 + 2(n+1)+1 − 2(n+1)+1
2 2 2
1 2k 2 k2
= = 2(n+1) = x2n+1,k .
2 22(n+1) 2
This completes the proof.
Proof. Set en (x) := x2 − sn (x). Let x be in the interval [xn,k , xn,k+1 ] = [k2−n , (k + 1)2−n ] of length
2−n . Since sn is the linear interpolant of x2 on this interval, we have
x2n,k+1 − x2n,k 2k + 1 1
|e′n (x)| = 2x − = 2x − ≤ n.
2−n 2n 2
Thus en : [0, 1] → R has Lipschitz constant 2−n . Since en (xn,k ) = 0 for all k = 0, . . . , 2n , and the
length of the interval [xn,k , xn,k+1 ] equals 2−n we get
1
sup |en (x)| ≤ 2−n 2−n = 2−2n−1 .
x∈[0,1] 2
82
x s1 (x) s2 (x) sn−1 (x)
Figure 7.1: The neural networks h1 (x) = σReLU (2x) − σReLU (4x − 2) and sn (x) = σReLU (sn−1 (x)) −
hn (x)/22n where hn = h1 ◦ hn−1 .
Finally, to see that sn can be represented by a neural network of the claimed architecture, note
that for n ≥ 2
n
X hj (x) hn (x) h1 ◦ hn−1 (x)
sn (x) = x − = sn−1 (x) − = σReLU ◦ sn−1 (x) − .
22j 2 2n 22n
j=1
Here we used that sn−1 is the piecewise linear interpolant of x2 , so that sn−1 (x) ≥ 0 and thus
sn−1 (x) = σReLU (sn−1 (x)) for all x ∈ [0, 1]. Hence sn is of depth n and width 3, see Figure 7.1.
In conclusion, we have shown that sn : [0, 1] → [0, 1] approximates the square function uniformly
on [0, 1] with exponentially decreasing error in the neural network size. Note that due to Theorem
6.4, this would not be possible with a shallow neural network, which can at best interpolate x2 on
a partition of [0, 1] with polynomially many (w.r.t. the neural network size) pieces.
7.2 Multiplication
According to Lemma 7.2, depth can help in the approximation of x 7→ x2 , which, on first sight,
seems like a rather specific example. However, as we shall discuss in the following, this opens
up a path towards fast approximation of functions with high regularity, e.g., C k ([0, 1]d ) for some
k > 1. The crucial observation is that, via the polarization identity we can write the product of
two numbers as a sum of squares
(x + y)2 − (x − y)2
x·y = (7.2.1)
4
for all x, y ∈ R. Efficient approximation of the operation of multiplication allows efficient ap-
proximation of polynomials. Those in turn are well-known to be good approximators for functions
exhibiting k ∈ N derivatives. Before exploring this idea further in the next section, we first make
precise the observation that neural networks can efficiently approximate the multiplication of real
numbers.
We start with the multiplication of two numbers, in which case neural networks of logarithmic
size in the desired accuracy are sufficient.
83
Lemma 7.3. For every ε > 0 there exists a ReLU neural network Φ× 2
ε : [−1, 1] → [−1, 1] such that
sup |x · y − Φ×
ε (x, y)| ≤ ε,
x,y∈[−1,1]
Since |a| = σReLU (a) + σReLU (−a), by (7.2.1) we have for all x, y ∈ [−1, 1]
(x + y)2 − (x − y)2
× |x + y| |x − y|
x · y − Φε (x, y) = − sn − sn
4 2 2
4( x+y 2 x−y 2
2 ) − 4( 2 ) 4sn ( |x+y| |x−y|
2 ) − 4sn ( 2 )
= −
4 4
4(2−2n−1 + 2−2n−1 )
≤ = 4−n ≤ ε,
4
where we used |x+y|/2, |x−y|/2 ∈ [0, 1]. We have depth(Φ× ε ) = 1+depth(sn ) = 1+n ≤ 1+⌈log4 (ε)⌉
and size(Φ× ε ) ≤ C + 2size(s n ) ≤ Cn ≤ C · (1 − log(ε)) for some constant C > 0.
The fact that Φ× ε maps from [−1, 1]2 → [−1, 1] follows by (7.2.2) and because s : [0, 1] → [0, 1].
n
Finally, if x = 0, then Φ× ε (x, y) = sn (|x + y|) − sn (|x − y|) = sn (|y|) − sn (|y|) = 0. If y = 0 the same
argument can be made.
In a similar way as in Proposition 4.8 and Lemma 5.11, we can apply operations with two inputs
in the form of a binary tree to extend them to an operation on arbitrary many inputs.
Proposition 7.4. For every n ≥ 2 and ε > 0 there exists a ReLU neural network Φ× n
n,ε : [−1, 1] →
[−1, 1] such that
n
Y
sup xj − Φ×
n,ε (x1 , . . . , xn ) ≤ ε,
xj ∈[−1,1] j=1
84
Proof. We begin with the case n = 2k . For k = 1 let Φ̃× ×
2,δ := Φδ . If k ≥ 2 let
Φ̃×k
2 ,δ
:= Φ×
δ ◦ Φ̃ ×
2k−1 ,δ
, Φ̃ ×
2k−1 ,δ
.
Using Lemma 7.3, we find that this neural network has depth bounded by
depth Φ̃× k
2 ,δ
≤ kdepth(Φ×δ ) ≤ Ck · (1 + | log(δ)|) ≤ C log(n)(1 + | log(δ)|).
Y
ek := sup xj − Φ̃×
2k ,δ
(x) .
xj ∈[−1,1]
j≤2k
85
and we denote by C k,s (Ω) the set of functions f ∈ C k (Ω) for which ∥f ∥C k,s (Ω) < ∞.
Lemma 7.6. Let d ∈ N, k ∈ N, s ∈ [0, 1], Ω = [0, 1]d and f ∈ C k,s (Ω). Then for all a, x ∈ Ω
X Dα f (a)
f (x) = (x − a)α + Rk (x) (7.3.2)
α!
{α∈Nd0 | 0≤|α|≤k}
k+1/2
where with h := maxi≤d |ai − xi | we have |Rk (x)| ≤ hk+s d k! ∥f ∥C k,s (Ω) .
for some ξ between a and t. Now let f ∈ C k,s (Rd ) and a, x ∈ Rd . Thus with g(t) := f (a+t·(x−a))
holds for f (x) = g(1)
k−1 (j)
X g (0) g (k) (ξ)
f (x) = + .
j! k!
j=0
j j! j! Qd
and (x − a)α = − aj )αj .
where we use the multivariate notations α = α! = Qd j=1 (xj
j=1 αj !
86
Hence
X Dα f (a)
f (x) = (x − a)α
α!
{α∈Nd0 | 0≤|α|≤k}
| {z }
∈Pk
X Dα f (a + ξ · (x − a)) − Dα f (a)
+ (x − a)α ,
α!
|α|=k
| {z }
=:Rk
for some ξ ∈ [0, 1]. Using the definition of h, the remainder term can be bounded by
k α α 1 X k
|Rk | ≤ h max sup |D f (a + t · (x − a)) − D f (a)|
|α|=k x∈Ω k! d
α
t∈[0,1] {α∈N0 | |α|=k}
k+ 12
d
≤ hk+s ∥f ∥C k,s (Ω) ,
k!
√ k
= (1 + · · · + 1)k = dk by the
P
where we used (7.3.1), ∥x − a∥2 ≤ dh and {α∈Nd0 | |α|=k} α
multinomial formula.
We now come to the main statement of this section. Up to logarithmic terms, it shows the
convergence rate (k + s)/d for approximating functions in C k,s ([0, 1]d ).
Theorem 7.7. Let d ∈ N, k ∈ N0 , s ∈ [0, 1], and Ω = [0, 1]d . Then, there exists a constant C > 0
such that for every f ∈ C k,s (Ω) and every N ≥ 2 there exists a ReLU neural network ΦfN such that
k+s
sup |f (x) − ΦfN (x)| ≤ CN − d ∥f ∥C k,s (Ω) , (7.3.3)
x∈Ω
Proof. The idea of the proof is to use the so-called “partition of unity method”: First we will
construct a partition of unity (φν )ν , such that for an appropriately chosen M ∈ N each φν has
support on a O(1/M ) neighborhood of a point η ∈ Ω. On each of these neighborhoods P we will use
the local Taylor polynomial pν of f around η to approximate the function. Then ν φν pν gives
an
P approximation to f on Ω. This approximation can be emulated by a neural network of the type
Φ × (φ , p̂ ), where p̂ is an neural network approximation to the polynomial p .
ν ε ν ν ν ν
It suffices to show the theorem in the case where
( )
dk+1/2
max , exp(d) ∥f ∥C k,s (Ω) ≤ 1.
k!
87
Step 1. We construct the neural network. Define
k+s
M := ⌈N 1/d ⌉ and ε := N − d . (7.3.4)
Consider a uniform simplicial mesh with nodes {ν/M | ν ≤ M } where ν/M := (ν1 /M, . . . , νd /M ),
and where “ν ≤ M ” is short for {ν ∈ Nd0 | νi ≤ M for all i ≤ d}. We denote by φν the cpwl basis
function on this mesh such that φν (ν/M ) = 1 and φν (µ/M ) = 0 whenever µ ̸= ν. As shown in
Chapter 5, φν is a neural network of size O(1). Then
X
φν ≡ 1 on Ω, (7.3.5)
ν≤M
where (iα,1 , . . . , iα,k ) ∈ {0, . . . , d}k is arbitrary but fixed such that |{j | iα,j = r}| = αr for all
r = 1, . . . , d. Finally, define
ΦfN :=
X
Φ×
ε (φν , p̂ν ). (7.3.7)
ν≤M
Step 2. We bound the approximation error. First, for each x ∈ Ω, using (7.3.5) and (7.3.6)
X X
f (x) − φν (x)pν (x) ≤ |φν (x)||pν (x) − f (x)|
ν≤M ν≤M
k+ 21
−(k+s) d
X
sup f (x) − φν (x)pν (x) ≤ M ∥f ∥C k,s (Ω) ≤ M −(k+s) . (7.3.8)
x∈Ω k!
ν≤M
88
Next, fix ν ≤ M and y ∈ Ω such that ∥ν/M − y∥∞ ≤ 1/M ≤ 1. Then by Proposition 7.4
k
X Dα f ν Y νiα,j
M
|pν (y) − p̂ν (y)| ≤ yiα,j −
α! M
|α|≤k j=1
× νiα,1 iα,k
− Φ|α|,ε yiα,1 − , . . . , yiα,k −
M M
X Dα f ( ν )
M
≤ε ≤ ε exp(d)∥f ∥C k,s (Ω) ≤ ε, (7.3.9)
α!
|α|≤k
k k
X dj X dj ∞
X 1 X1 X j!
= = ≤ = exp(d).
α! j! α! j! j!
{α∈Nd0 | |α|≤k} j=0 {α∈Nd0 | |α|=j} j=0 j=0
Fix x ∈ Ω. Then x belongs to a simplex of the mesh, and thus x can be in the support of at
most d + 1 (the number of nodes of a simplex) functions φν . Moreover, Lemma 7.3 implies that
supp Φ×
ε (φν (·), p̂ν (·)) ⊆ supp φν . Hence, using Lemma 7.3 and (7.3.9)
X X
φν (x)pν (x) − Φ×
ε (φν (x), p̂ν (x))
ν≤M ν≤M
X
≤ (|φν (x)pν (x) − φν (x)p̂ν (x)|
{ν≤M | x∈supp φν }
With our choices in (7.3.4) this yields the error bound (7.3.3).
Step 3. It remains to bound the size and depth of the neural network in (7.3.7).
By Lemma 5.17, for each 0 ≤ ν ≤ M we have
where kT is the maximal number of simplices attached to a node in the mesh. Note that kT is
independent of M , so that the size and depth of φν are bounded by a constant Cφ independent of
M.
89
Lemma 7.3 and Proposition 7.4 thus imply with our choice of ε = N −(k+s)/d
depth(ΦfN ) = depth(Φ×
ε ) + max depth(φη ) + max depth(p̂ν )
ν≤M ν≤M
≤ C · (1 + | log(ε)| + Cφ ) + depth(Φ×
k,ε )
≤ C · (1 + | log(ε)| + Cφ )
≤ C · (1 + log(N ))
for some constant C > 0 depending on k and d (we use “C” to denote a generic constant that can
change its value in each line).
To bound the size, we first observe with Lemma 5.4 that
X
size(p̂ν ) ≤ C · 1 + size Φ×
|α|,ε
≤ C · (1 + | log(ε)|)
|α|≤k
for some C depending on k. Thus, for the size of ΦfN we obtain with M = ⌈N 1/d ⌉
size(ΦfN ) ≤ C · 1 +
X
size(Φ×
ε ) + size(φν ) + size(p̂ν )
ν≤M
≤ C · (1 + M )d (1 + | log(ε)| + Cφ )
≤ C · (1 + N 1/d )d (1 + Cφ + log(N ))
≤ CN log(N ),
which concludes the proof.
Theorem 7.7 shows the convergence rate (k+s)/d for approximating a C k,s -function f : [0, 1]d →
R. As long as k is large, in principle we can achieve arbitrarily large (and d-independent if k ≥ d)
k+s
convergence rates. Crucially, and in contrast to Theorem 5.22, achieving error N − d requires the
neural networks to be of size O(N log(N )) and depth O(log(N )), i.e. to get more and more accurate
approximations, the neural network depth is required to increase.
Remark 7.8. Under the stronger assumption that f is an analytic function (in particular such an f is
in C ∞ ), one can show exponential convergence rates for ReLU networks of the type exp(−βN 1/(d+1) )
for some fixed β > 0 and where N corresponds again to the neural network size (up to logarithmic
terms), see [58, 166].
Remark 7.9. Let L : x 7→ Ax + b : Rd → Rd be a bijective affine transformation and set Ω :=
L([0, 1]d ) ⊆ Rd . Then for a function f ∈ C k,s (Ω), by Theorem 7.7 there exists a neural network
ΦfN such that
sup |f (x) − ΦfN (L−1 (x))| = sup |f (L(x)) − ΦfN (x)|
x∈Ω x∈[0,1]d
k+s
≤ C∥f ◦ L∥C k,s ([0,1]d ) N − d .
Since for x ∈ [0, 1]d holds |f (L(x))| ≤ supy∈Ω |f (y)| and if 0 ̸= α ∈ Nd0 is a multiindex |Dα (f (L(x))| ≤
|α|
∥A∥2 supy∈Ω |Dα f (y)|, we have ∥f ◦ L∥C k,s ([0,1]d ) ≤ (1 + ∥A∥k+s
2 )∥f ∥C k,s (Ω) . Thus the convergence
k+s
rate N − d is achieved on every set of the type L([0, 1]d ) for an affine map L, and in particular on
every hypercube ×dj=1 [aj , bj ].
90
Bibliography and further reading
This chapter is based on the seminal 2017 paper by Yarotsky [245], where the construction of
approximating the square function, the multiplication, and polynomials (discussed in Sections 7.1
and 7.2) was first introduced and analyzed. The construction relies on the sawtooth function
discussed in Section 6.2 and originally introduced by Telgarsky in [227]. Yarotsky’s work has
since sparked a large body of research, as it allows to lift polynomial approximation theory to
neural network classes. Convergence results based on this type of argument include for example
[174, 59, 150, 58, 166].
The approximation result derived in Section 7.3 for Hölder continuous functions follows by
standard approximation theory for piecewise polynomial functions. We point out that similar
results for the approximation of functions in C k or functions that are analytic can also be shown for
other activation function than ReLU; see in particular the works of Mhaskar [144, 145] and Section
6 in Pinkus’ Acta Numerica article [176] for sigmoidal and smooth activations. Additionally, the
more recent paper [48] specifically addresses the hyperbolic tangent activation. Finally, [81] studies
general activation functions that allow for the construction of approximate partitions of unity.
91
Chapter 8
High-dimensional approximation
In the previous chapters we established convergence rates for the approximation of a function f :
[0, 1]d → R by a neural network. For example, Theorem 7.7 provides the error bound O(N −(k+s)/d )
in terms of the network size N (up to logarithmic terms), where k and s describe the smoothness
of f . Achieving an accuracy of ε > 0, therefore, necessitates a network size N = O(ε−d/(k+s) )
(according to this bound). Hence, the size of the network needs to increase exponentially in d.
This exponential dependence on the dimension d is referred to as the curse of dimensionality
[16]. For classical smoothness spaces, such exponential d dependence cannot be avoided [16, 52, 164].
However, functions f that are of interest in practice may have additional properties, which allow
for better convergence rates.
In this chapter, we discuss three scenarios under which the curse of dimensionality can be
mitigated. First, we examine an assumption limiting the behavior of functions in their Fourier
domain. This assumption allows for slow but dimension independent approximation rates. Second,
we consider functions with a specific compositional structure. Concretely, these functions are
constructed by compositions and linear combinations of simple low-dimensional subfunctions. In
this case, the curse of dimension is present but only through the input dimension of the subfunctions.
Finally, we study the situation, where we still approximate high-dimensional functions, but only care
about the approximation accuracy on a lower dimensional submanifold. Here, the approximation
rate is goverened by the smoothness and the dimension of the manifold.
its Fourier transform. Then, for C > 0 the Barron class is defined as
Z
1 d ˆ
ΓC := f ∈ L (R ) ∥f ∥L1 (Rd ) < ∞, ˆ
|2πξ||f (ξ)| dξ < C .
Rd
92
We point out that the definition of ΓC in [10] is more general, but our assumption will simplify
some of the arguments. Nonetheless, the following proof is very close to the original result, and the
presentation is similar to [175, Section 5]. Theorem 1 in [10] reads as follows.
Theorem 8.1. Let σ : R → R be sigmoidal (see Definition 3.11) and let f ∈ ΓC for some C > 0.
Denote by B1d := {x ∈ Rd | ∥x∥ ≤ 1} the unit ball. Then, for every c > 4C 2 and every N ∈ N there
exists a neural network Φf with architecture (σ; d, N, 1) such that
Z
1 2 c
d
f (x) − Φf (x) dx ≤ , (8.1.1)
|B1 | B1d N
Remark 8.2. The approximation rate on (8.1.1) can be slightly improved under some assumptions
on the activation function such as powers of the ReLU, [213].
Importantly, the dimension d does not enter on the right-hand side of (8.1.1), in particular the
convergence rate is not directly affected by the dimension, which is in stark contrast to the results
of the previous chapters. However, it should be noted, that the constant Cf may still have some
inherent d-dependence, see Exercise 8.10.
The proof of Theorem 8.1 is based on a peculiar property of high-dimensional convex sets, which
is described by the (approximate) Caratheodory theorem, the original version of which was given
in [31]. The more general version stated in the following lemma follows [236, Theorem 0.0.2] and
[10, 177]. For its statement recall that co(G) denotes the the closure of the convex hull of G.
Lemma 8.3. Let H be a Hilbert space, and let G ⊆ H be such that for some B > 0 it holds that
∥g∥H ≤ B for all g ∈ G. Let f ∈ co(G). Then, for every N ∈ N and every c > B 2 there exist
(gi )N
i=1 ⊆ G such that
N 2
1 X c
f− gi ≤ . (8.1.2)
N N
i=1 H
Proof. Fix ε > 0 and N ∈ N. Since f ∈ co(G), there exist coefficients α1 , . . . , αm ∈ [0, 1] summing
to 1, and linearly independent elements h1 , . . . , hm ∈ G such that
m
X
f ∗ := αj hj
j=1
satisfies ∥f − f ∗ ∥H < ε. We claim that there exists g1 , . . . , gN , each in {h1 , . . . , hm }, such that
2
N
∗ 1 X B2
f − gj ≤ . (8.1.3)
N N
j=1
H
93
Since ε > 0 was arbitrary, this then concludes the proof. Since there exists an isometric isomorphism
from span{h1 , . . . , hm } to Rm , there is no loss of generality in assuming H = Rm in the following.
Let Xi , i = 1, . . . , N , be i.i.d. Rm -valued random variables with
Lemma 8.3 provides a powerful tool: If we want to approximate a function f with a superposition
of N elements in a set G, then it is sufficient to show that f can be represented as an arbitrary
(infinite) convex combination of elements of G.
Lemma 8.3 suggests that we can prove Theorem 8.1 by showing that each function in ΓC belongs
to the convex hull of neural networks with just a single neuron. We make a small detour before
proving this result. We first show that each function f ∈ ΓC is in the convex hull of affine transforms
of Heaviside functions. We define the set of affine transforms of Heaviside functions GC as
n o
GC := B1d ∋ x 7→ γ · 1R+ (⟨a, x⟩ + b) a ∈ Rd , b ∈ R, |γ| ≤ 2C .
The following lemma, corresponding to [175, Lemma 5.12], provides a link between ΓC and GC .
Lemma 8.4. Let d ∈ N, C > 0 and f ∈ ΓC . Then f |B d − f (0) ∈ co(GC ), where the closure is
1
taken with respect to the norm
Z !1/2
1 2
∥g∥L2,⋄ (B d ) := |g(x)| dx .
1 |B1d | B1d
94
Proof. Since f ∈ ΓC , we have that f , fˆ ∈ L1 (Rd ). Hence, we can apply the inverse Fourier
transform and get the following computation:
Z
f (x) − f (0) = fˆ(ξ) e2πi⟨x,ξ⟩ − 1 dξ
d
ZR
= fˆ(ξ) e2πi⟨x,ξ⟩+iκ(ξ) − eiκ(ξ) dξ
d
ZR
= fˆ(ξ) (cos(2π⟨x, ξ⟩ + κ(ξ)) − cos(κ(ξ))) dξ,
Rd
where κ(ξ) is the phase of fˆ(ξ) and the last inequality follows since f is real-valued.
To use the fact that f has a bounded Fourier moment, we reformulate the integral as
Z
fˆ(ξ) (cos(2π⟨x, ξ⟩ + κ(ξ)) − cos(κ(ξ))) dξ
Rd
(cos(2π⟨x, ξ⟩ + κ(ξ)) − cos(κ(ξ)))
Z
= |2πξ| fˆ(ξ) dξ.
Rd |2πξ|
We define a new measure Λ with density
1
dΛ(ξ) := |2πξ||fˆ(ξ)| dξ.
C
Since f ∈ ΓC , it follows that Λ is a probability measure on Rd . Now we have that
(cos(2π⟨x, ξ⟩ + κ(ξ)) − cos(κ(ξ)))
Z
f (x) − f (0) = C dΛ(ξ). (8.1.5)
Rd |2πξ|
Next, we would like to replace the integral of (8.1.5) by an appropriate finite sum.
The cosine function is 1-Lipschitz. Hence, we note that ξ 7→ qx (ξ) := (cos(2π⟨x, ξ⟩ + κ(ξ)) −
cos(κ(ξ)))/|2πξ| is bounded by 1. In addition, it is easy to see that qx is well-defined and continuous
even in the origin.
Therefore, the integral (8.1.5) can be approximated by a Riemann sum, i.e.,
Z X
C qx (ξ) dΛ(ξ) − C qx (θ) · Λ(Iθ ) → 0, (8.1.6)
Rd 1 d
θ∈ n Z
Since θ∈ 1 Zd Λ(Iθ ) = Λ(Rd ) = 1, we conclude that f (x) − f (0) is in the L2,⋄ (B1d ) closure of
P
n
convex combinations of functions of the form
x 7→ gθ (x) := αθ qx (θ),
95
for θ ∈ Rd and 0 ≤ αθ ≤ C.
Now we only need to prove that each gθ is in co(GC ). By setting z = ⟨x, θ/|θ|⟩, we observe that
the result follows if the map
[−1, 1] ∋ z 7→ γ1R+ a′ z + b′ ,
(8.1.8)
Per construction, gT,− + gT,+ converges to g̃θ for T → ∞. Moreover, ∥g̃θ′ ∥L∞ (R) ≤ C and hence
T T
X |g̃θ (i/T ) − g̃θ ((i − 1)/T )| X |g̃θ (−i/T ) − g̃θ ((1 − i)/T )|
+
2C 2C
i=1 i=1
T
2 X ′
≤ ∥g̃θ ∥L∞ (R) ≤ 1.
2CT
i=1
We conclude that gT,− + gT,+ is a convex combination of functions of the form (8.1.8). Hence,
g̃θ can be arbitrarily well approximated by convex combinations of the form (8.1.8). Therefore
gθ ∈ co(GC ). Finally, (8.1.7) yields that f − f (0) ∈ co(GC ).
f |B d − f (0) ∈ co(GC ).
1
It is not hard to see that for every g ∈ GC holds ∥g∥L2,⋄ (B d ) ≤ 2C. Applying Lemma 8.3 with the
1
Hilbert space L2,⋄ (B1d ), we get that for every N ∈ N there exist |γi | ≤ 2C, ai ∈ Rd , bi ∈ R, for
i = 1, . . . , N , so that
N 2
4C 2
Z
1 X
f (x) − f (0) − γi 1R+ (⟨ai , x⟩ + bi ) dx ≤ .
|B1d | B1d i=1
N
By Exercise 3.24, it holds that σ(λ·) → 1R+ for λ → ∞ almost everywhere. Thus, for every δ > 0
there exist ãi , b̃i , i = 1, . . . , N , so that
96
N 2
4C 2
Z
1 X
f (x) − f (0) − γi σ ⟨ãi , x⟩ + b̃i dx ≤ + δ.
|B1d | B1d i=1
N
The result follows by observing that
N
X
γi σ ⟨ãi , x⟩ + b̃i + f (0)
i=1
The dimension-independent approximation rate of Theorem 8.1 may seem surprising, especially
in comparison to the results in Chapters 4 and 5. However, this can be explained by recognizing
that the assumption of a finite Fourier moment is effectively a dimension-dependent regularity
assumption. Indeed, the condition becomes more restrictive in higher dimensions and hence the
complexity of ΓC does not grow with the dimension.
To further explain this, let us relate the Barron class to classical function spaces. In [10, Section
II] it was observed that a sufficient condition is that all derivatives of order up to ⌊d/2⌋ + 2 are
square-integrable. In other words, if f belongs to the Sobolev space H ⌊d/2⌋+2 (Rd ), then f is a
Barron function. Importantly, the functions must become smoother, as the dimension increases.
This assumption would also imply an approximation rate of N −1/2 in the L2 norm by sums of
at most N B-splines, see [168, 52]. However, in such estimates some constants may still depend
exponentially on d, whereas all constants in Theorem 8.1 are controlled independently of d.
Another notable aspect of the approximation of Barron functions is that the absolut values of
the weights other than the output weights are not bounded by a constant. To see this, we refer
to (8.1.6), where arbitrarily large θ need to be used. While ΓC is a compact set, the set of neural
networks of the specified architecture for a fixed N ∈ N is not parameterized with a compact
parameter set. In a certain sense, this is reminiscent of Proposition 3.19 and Theorem 3.20, where
arbitrarily strong approximation rates where achieved by using a very complex activation function
and a non-compact parameter space.
97
With each vertex ηj for j > d we associate a function fj : Rdj → R. Here dj denotes the
cardinality of the set Sj , which is defined as the set of indices i corresponding to vertices ηi for
which we have an edge from ηi to ηj . Without loss of generality, we assume that m ≥ dj = |Sj | ≥ 1
for all j > d. Finally, we let
and1
we denote the set of all functions of the type FM by F k,s (m, d, M ). Figure 8.1 shows possible
graphs of such functions.
Clearly, for s = 0, F k,0 (m, d, M ) ⊆ C k (Rd ) since the composition of functions in C k belongs
again to C k . A direct application of Theorem 7.7 allows to approximate FM ∈ F k (m, d, M ) with a
k
neural network of size O(N log(N )) and error O(N − d ). Since each fj depends only on m variables,
k
intuitively we expect an error convergence of type O(N − m ) with the constant somehow depending
on the number M of vertices. To show that this is actually possible, in the following we associate
with each node ηj a depth lj ≥ 0, such that lj is the maximum number of edges connecting ηj to
one of the nodes {η1 , . . . , ηd }.
Figure 8.1: Three types of graphs that could be the basis of compositional functions. The associated
functions are composed of two or three-dimensional functions only.
Proposition 8.5. Let k, m, d, M ∈ N and s > 0. Let FM ∈ F k,s (m, d, M ). Then there exists a
constant C = C(m, k + s, M ) such that for every N ∈ N there exists a ReLU neural network ΦFM
1
The ordering of the inputs (Fi )i∈Sj in (8.2.1b) is arbitrary but considered fixed throughout.
98
such that
and
k+s
sup |FM (x) − ΦFM (x)| ≤ N − m .
x∈[0,1]d
Proof. Throughout this proof we assume without loss of generality that the indices follow a topo-
logical ordering, i.e., they are ordered such that Sj ⊆ {1, . . . , j − 1} for all j (i.e. the inputs of
vertex ηj can only be vertices ηi with i < j).
Step 1. First assume that fˆj are functions such that
|fj (x) − fˆj (x)| ≤ δj := ε · (2m)−(M +1−j) for all x ∈ [−2, 2]dj . (8.2.3)
Let F̂j be defined as in (8.2.1), but with all fj in (8.2.1b) replaced by fˆj . We now check the error
of the approximation F̂M to FM . To do so we proceed by induction over j and show that for all
x ∈ [−1, 1]d
Note that due to ∥fj ∥C k ≤ 1 we have |Fj (x)| ≤ 1 and thus (8.2.4) implies in particular that
F̂j (x) ∈ [−2, 2].
For j = 1 it holds F1 (x1 ) = F̂1 (x1 ) = x1 , and thus (8.2.4) is valid for all x1 ∈ [−1, 1]. For the
induction step, for all x ∈ [−1, 1]d by (8.2.3) and the induction hypothesis
|Fj (x) − F̂j (x)| = |fj ((Fi )i∈Sj ) − fˆj ((F̂i )i∈Sj )|
= |fj ((Fi )i∈Sj ) − fj ((F̂i )i∈Sj )| + |fj ((F̂i )i∈Sj ) − fˆj ((F̂i )i∈Sj )|
X
≤ |Fi − F̂i | + δj
i∈Sj
This shows that (8.2.4) holds, and thus for all x ∈ [−1, 1]d
99
Step 2. We sketch a construction, of how to write F̂M from Step 1 as a neural network ΦFM
of the claimed size and depth bounds. Fix N ∈ N and let
m
Nj := ⌈N (2m) k+s (M +1−j) ⌉.
− k+s k+s
sup |fj (x) − Φfj (x)| ≤ Nj m
≤ N− m (2m)−(M +1−j) (8.2.5)
x∈[−2,2]dj
and
fj
m(M +1−j) m(M + 1 − j)
size(Φ ) ≤ CNj log(Nj ) ≤ CN (2m) k+s log(N ) + log(2m)
k+s
as well as
fj m(M + 1 − j)
depth(Φ ) ≤ C · log(N ) + log(2m) .
k+s
Then
n M M j
X X m(M +1−j) X m
size(Φfj ) ≤ 2CN log(N ) (2m) k+s ≤ 2CN log(N ) (2m) k+s
j=1 j=1 j=1
m(M +1)
≤ 2CN log(N )(2m) k+s .
PM j
R M +1 1 M +1 .
Here we used j=1 a ≤ 1 exp(log(a)x) dx ≤ log(a) a
− k+s
The function F̂M from Step 1 then will yield error N by (8.2.3) and (8.2.5). We observe that
m
F̂M can be constructed inductively as a neural network ΦFM by propagating all values ΦF1 , . . . , Φ̂Fj
to all consecutive layers using identity neural networks and then using the outputs of (ΦFi )i∈Sj+1
as input to Φfj+1 . The depth of this neural network is bounded by
M
X
depth(Φfj ) = O(M log(N )).
j=1
We have at most M
P
j=1 |Sj | ≤ mM values which need to be propagated through these O(M log(N ))
layers, amounting to an overhead O(mM 2 log(N )) = O(log(N )) for the identity neural networks.
In all the neural network size is thus O(N log(N )).
Remark 8.6. From the proof we observe that the constant C in Proposition 8.5 behaves like
m(M +1)
O((2m) k+s ).
100
M
approximation error on M, then we can again show that it is m rather than d that determines the
rate of convergence.
To explain the idea, we assume in the following that M is a smooth, compact m-dimensional
manifold in Rd . Moreover, we suppose that there exists δ > 0 and finitely many points x1 , . . . , xM ∈
M such that the δ-balls Bδ/2 (xi ) := {y ∈ Rd | ∥y − x∥2 < δ/2} for j = 1, . . . , M cover M (for
every δ > 0 such xi exist since M is compact). Moreover, denoting by Tx M ≃ Rm the tangential
space of M at x, we assume δ > 0 to be so small that the orthogonal projection
is injective, the set πj (Bδ (xj ) ∩ M) ⊆ Txj M has C ∞ boundary, and the inverse projection
where
101
In the following, for f : M → R, k ∈ N0 , and s ∈ [0, 1) we let
Proof. Since M is compact there exists A > 0 such that M ⊆ [−A, A]d . Similar as in the proof of
Theorem 7.7, we consider a uniform mesh with nodes {−A + 2A νnP| ν ≤ n}, and the corresponding
d
piecewise linear basis functions forming the partition of unity ν≤ φν ≡ 1 on [−A, A] where
supp φν ≤ {y ∈ Rd | ∥ νn − y∥∞ ≤ A n }. Let δ > 0 be such as in the beginning of this section.
Since M is covered by the balls (Bδ/2 (xj ))M
j=1 , fixing n ∈ N large enough, for each ν such that
supp φν ∩ M = ̸ ∅ there exists j(ν) ∈ {1, . . . , M } such that supp φν ⊆ Bδ (xj(ν) ) and we set
Ij := {ν ≤ M | j = j(ν)}. Then we have for all x ∈ M
X M X
X
f (x) = φν (x)fj (πj (x)) = φν (x)fj (πj (x)). (8.3.3)
ν≤n j=1 ν∈Ij
Next, we approximate the functions fj . Let Cj be the smallest (m-dimensional) cube in Txj M ≃
Rm such that πj (Bδ (xj ) ∩ M) ⊆ Cj . The function fˆj can be extended to a function on Cj (we will
use the same notation for this extension) such that
for some constant depending on πj (Bδ (xj ) ∩ M) but independent of f . Such an extension result
can, for example, be found in [216, Chapter VI]. By Theorem 7.7 (also see Remark 7.9), there exists
a neural network fˆj : Cj → R such that
k+s
sup |fj (x) − fˆj (x)| ≤ CN − m (8.3.4)
x∈Cj
and
M X
X
ΦN := Φ× ˆ
ε (φν , fi ◦ πj ),
j=1 ν∈Ij
102
where we note that πj is linear and thus fˆj ◦ πj can be expressed by a neural network. First let us
estimate the error of this approximation. For x ∈ M
M X
X
|f (x) − ΦN (x)| ≤ |φν (x)fj (πj (x)) − Φ× ˆ
ε (φν (x), fj (πj (x)))|
j=1 ν∈Ij
M X
X
≤ (|φν (x)fj (πj (x)) − φν (x)fj (πj (x))|
j=1 ν∈Ij
+|φν (x)fj (πj (x)) − Φ× ˆ
ε (φν (x), fj (πj (x)))|
M X
X M
X X
≤ sup ∥fi − fˆi ∥L∞ (Ci ) |φν (x)| + ε
i≤M j=1 ν∈Ij j=1 {ν∈Ij | x∈supp φν }
k+s k+s
≤ CN − m + dε ≤ CN − m ,
where we used that x can be in the support of at most d of the φν , and where C is a constant
depending on d and M.
Finally, let us bound the size and depth of this approximation. Using size(φν ) ≤ C, depth(φν ) ≤
C (see (5.3.12)) and size(Φ× ×
ε ) ≤ C log(ε) ≤ C log(N ) and depth(Φε ) ≤ Cdepth(ε) ≤ C log(N ) (see
Lemma 7.3) we find
M X
X XM X
size(Φ×
ε ) + size(φ ν ) + size(fˆi ◦ π j ) ≤ C log(N ) + C + CN log(N )
j=1 ν∈Ij j=1 ν∈Ij
103
Another prominent direction, omitted in this chapter, pertains to scientific machine learn-
ing. High-dimensional functions often arise from (parametric) PDEs, which have a rich literature
describing their properties and structure. Various results have shown that neural networks can
leverage the inherent low-dimensionality known to exist in such problems. Efficient approximation
of certain classes of high-dimensional (or even infinite-dimensional) analytic functions, ubiquitous
in parametric PDEs, has been verified in [208, 209]. Further general analyses for high-dimensional
parametric problems can be found in [167, 122], and results exploiting specific structural conditions
of the underlying PDEs, e.g., in [125, 198]. Additionally, [58, 150, 166] provide results regarding
fast convergence for certain smooth functions in potentially high but finite dimensions.
For high-dimensional PDEs, elliptic problems have been addressed in [78], linear and semilin-
ear parabolic evolution equations have been explored in [79, 71, 100], and stochastic differential
equations in [109, 80].
104
Exercises
Exercise 8.8. Let C > 0 and d ∈ N. Show that, if g ∈ ΓC , then
for every a ∈ R+ , b ∈ Rd .
Exercise 8.9. Let C > 0 and d ∈ N. Show that, for gi ∈ ΓC , i = 1, . . . , m and c = (ci )m
i=1 it holds
that
Xm
ci gi ∈ Γ∥c∥1 C .
i=1
2 d
Exercise 8.10. √ For every d ∈ N the function f (x) := exp(−∥x∥2 /2), x ∈ R , belongs to Γd . It
holds Cf = O( d), for d → ∞.
105
Chapter 9
Interpolation
The learning problem associated to minimizing the empirical risk of (1.2.3) is based on minimizing
an error that results from evaluating a neural network on a finite set of (training) points. In
contrast, all previous approximation results focused on achieving uniformly small errors across the
entire domain. Finding neural networks that achieve a small training error appears to be much
simpler, since, instead of ∥f − Φn ∥∞ → 0 for a sequence of neural networks Φn , it suffices to have
Φn (xi ) → f (xi ) for all xi in the training set.
In this chapter, we study the extreme case of the aforementioned approximation problem. We
analyze under which conditions it is possible to find a neural network that coincides with the target
function f at all training points. This is referred to as interpolation. To make this notion more
precise, we state the following definition.
Definition 9.1 (Interpolation). Let d, m ∈ N, and let Ω ⊆ Rd . We say that a set of functions
H ⊆ {h : Ω → R} interpolates m points in Ω, if for every S = (xi , yi )m i=1 ⊆ Ω × R, such that
xi ̸= xj for i ̸= j, there exists a function h ∈ H such that h(xi ) = yi for all i = 1, . . . , m.
106
We start our analysis of the interpolation properties of neural networks by presenting a result
similar to the universal approximation theorem but for interpolation in the following section. In
the subsequent section, we then look at interpolation with desirable properties.
Theorem 9.3 (Universal Interpolation Theorem). Let d, n ∈ N and let σ ∈ M not be a polynomial.
Then Nd1 (σ, 1, n) interpolates n + 1 points in Rd .
107
To do so, we proceed by induction over k = 0, . . . , n, to show that there exist (wj )kj=1 and
(bj )kj=1 such that the first k + 1 columns of A are linearly independent. The case k = 0 is trivial.
Next let 0 < k < n and assume that the first k columns of A are linearly independent. We wish to
find wk , bk such that the first k + 1 columns are linearly independent. Suppose such wk , bk do not
exist and denote by Yk ⊆ Rn+1 the space spanned by the first k columns of A. Then for all w ∈ Rn ,
b ∈ R the vector (σ(w⊤ xi + b))n+1 i=1 ∈ R
n+1 must belong to Y . Fix y = (y )n+1 ∈ Rn+1 \Y . Then
k i i=1 k
n+1
XX N 2
inf ∥(Φ̃(xi ))n+1 2
i=1 − y∥2 = inf vj σ(w⊤
j xi + bj ) + c − yi
Φ̃∈Nd1 (σ,1) N,wj ,bj ,vj ,c
i=1 j=1
Since we can find a continuous function f : Rd → R such that f (xi ) = yi for all i = 1, . . . , n + 1,
this contradicts Theorem 3.8.
9.2.1 Motivation
In the previous section, we observed that neural networks with m − 1 ∈ N hidden neurons can
interpolate m points for every reasonable activation function. However, not all interpolants are
equally suitable for a given application. For instance, consider Figure 9.1 for a comparison between
polynomial and piecewise affine interpolation on the unit interval.
The two interpolants exhibit rather different behaviors. In general, there is no way of deter-
mining which constitutes a better approximation to f . In particular, given our limited information
about f , we cannot accurately reconstruct any additional features that may exist between inter-
polation points x1 , . . . , xm . In accordance with Occam’s razor, it thus seems reasonable to assume
that f does not exhibit extreme oscillations or behave erratically between interpolation points.
As such, the piecewise interpolant appears preferable in this scenario. One way to formalize the
assumption that f does not “exhibit extreme oscillations” is to assume that the Lipschitz constant
|f (x) − f (y)|
Lip(f ) := sup (9.2.1)
x̸=y ∥x − y∥
of f is bounded by a fixed value M ∈ R. Here ∥ · ∥ denotes an arbitrary fixed norm on Rd .
How should we choose M ? For every function f : Ω → R satisfying
f (xi ) = yi for all i = 1, . . . , m, (9.2.2)
we have
|f (x) − f (y)| |yi − yj |
Lip(f ) = sup ≥ sup =: M̃ . (9.2.3)
x̸=y∈Ω ∥x − y∥ i̸=j ∥xi − xj ∥
108
Figure 9.1: Interpolation of eight points by a polynomial of degree seven and by a piecewise affine
spline. The polynomial interpolation has a significantly larger derivative or Lipschitz constant than
the piecewise affine interpolator.
Because of this, we fix M as a real number greater than or equal to M̃ for the remainder of our
analysis.
denoting the set of all functions with Lipschitz constant at most M , we want to solve the following
problem:
The next theorem shows that a function Φ as in (9.2.5) indeed exists. This Φ not only allows
for an explicit formula, it also belongs to LipM (Ω) and additionally interpolates the data. Hence,
it is not just an optimal reconstruction, it is also an optimal interpolant. This theorem goes back
to [13], which, in turn, is based on [219].
109
Theorem 9.5. Let m, d ∈ N, Ω ⊆ Rd , f : Ω → R, and let x1 , . . . , xm ∈ Ω, y1 , . . . , ym ∈ R satisfy
(9.2.2) and (9.2.3) with M̃ > 0. Further, let M ≥ M̃ .
Then, Problem 9.4 has at least one solution given by
1
Φ(x) := (fupper (x) + flower (x)) for x ∈ Ω, (9.2.6)
2
where
Moreover, Φ ∈ LipM (Ω) and Φ interpolates the data (i.e. satisfies (9.2.2)).
Proof. First we claim that for all h1 , h2 ∈ LipM (Ω) holds max{h1 , h2 } ∈ LipM (Ω) as well as
min{h1 , h2 } ∈ LipM (Ω). Since min{h1 , h2 } = − max{−h1 , −h2 }, it suffices to show the claim for
the maximum. We need to check that
| max{h1 (x), h2 (x)} − max{h1 (y), h2 (y)}|
≤M (9.2.7)
∥x − y∥
for all x ̸= y ∈ Ω. Fix x ̸= y. Without loss of generality we assume that
max{h1 (x), h2 (x)} ≥ max{h1 (y), h2 (y)} and max{h1 (x), h2 (x)} = h1 (x).
If max{h1 (y), h2 (y)} = h1 (y) then the numerator in (9.2.7) equals h1 (x) − h1 (y) which is bounded
by M ∥x − y∥. If max{h1 (y), h2 (y)} = h2 (y), then the numerator equals h1 (x) − h2 (y) which is
bounded by h1 (x) − h1 (y) ≤ M ∥x − y∥. In either case (9.2.7) holds.
Clearly, x 7→ yk −M ∥x−xk ∥ ∈ LipM (Ω) for each k = 1, . . . , m and thus fupper , flower ∈ LipM (Ω)
as well as Φ ∈ LipM (Ω).
Next we claim that for all f ∈ LipM (Ω) satisfying (9.2.2) holds
flower (x) ≤ f (x) ≤ fupper (x) for all x ∈ Ω. (9.2.8)
This is true since for every k ∈ {1, . . . , m} and x ∈ Ω
|yk − f (x)| = |f (xk ) − f (x)| ≤ M ∥x − xk ∥
so that for all x ∈ Ω
f (x) ≤ min (yk + M ∥x − xk ∥), f (x) ≥ max (yk − M ∥x − xk ∥).
k=1,...,m k=1,...,m
Since fupper , flower ∈ LipM (Ω) satisfy (9.2.2), we conclude that for every h : Ω → R holds
sup sup |f (x) − h(x)| ≥ sup max{|flower (x) − h(x)|, |fupper (x) − h(x)|}
f ∈LipM (Ω) x∈Ω x∈Ω
f satisfies (9.2.2)
|flower (x) − fupper (x)|
≥ sup . (9.2.9)
x∈Ω 2
110
Moreover, using (9.2.8),
sup sup |f (x) − Φ(x)| ≤ sup max{|flower (x) − Φ(x)|, |fupper (x) − Φ(x)|}
f ∈LipM (Ω) x∈Ω x∈Ω
f satisfies (9.2.2)
|flower (x) − fupper (x)|
= sup . (9.2.10)
x∈Ω 2
Figure 9.2 depicts fupper , flower , and Φ for the interpolation problem shown in Figure 9.1, while
Figure 9.3 provides a two-dimensional example.
Figure 9.2: Interpolation of the points from Figure 9.1 with the optimal Lipschitz interpolant.
111
Then, there exists a ReLU neural network Φ ∈ LipM (Ω) that interpolates the data (i.e. satisfies
(9.2.2)) and satisfies
Moreover, depth(Φ) = O(log(m)), width(Φ) = O(dm) and all weights of Φ are bounded in absolute
value by max{M, ∥y∥∞ }.
Proof. To prove the result, we simply need to show that the function in (9.2.6) can be expressed as
a ReLU neural network with the size bounds described in the theorem. First we notice, that there
is a simple ReLU neural network that implements the 1-norm. It holds for all x ∈ Rd that
d
X
∥x∥1 = (σ(xi ) + σ(−xi )) .
i=1
Thus, there exists a ReLU neural network Φ∥·∥1 such that for all x ∈ Rd
for all x ∈ Rd . Using the parallelization of neural networks introduced in Section 5.1.3, there exists
a ReLU neural network Φall := (Φ1 , . . . , Φm ) : Rd → Rm such that
and
Using Lemma 5.11, we can now find a ReLU neural network Φupper such that Φupper = fupper (x)
for all x ∈ Ω, width(Φupper ) ≤ max{16m, 4md}, and depth(Φupper ) ≤ 1 + log(m).
Essentially the same construction yields a ReLU neural network Φlower with the respective
properties. Lemma 5.4 then completes the proof.
112
limx→−∞ σ(x) = 0, and limx→∞ σ(x) = 1. This result was improved in [97], which dropped the
nondecreasing assumption on σ.
The main idea of the optimal Lipschitz interpolation theorem in Section 9.2 is due to [13]. A
neural network construction of Lipschitz interpolants, which however is not the optimal interpolant
in the sense of Problem 9.4, is given in [108, Theorem 2.27].
113
Exercises
Exercise 9.7. Under the assumptions of Theorem 9.5, we define for x ∈ Ω the set of nearest
neighbors by
Ix := argmini=1,...,m ∥xi − x∥.
The one-nearest-neighbor classifier f1NN is defined by
1
f1NN (x) = (min yi + max yi ).
2 i∈Ix i∈Ix
Exercise 9.8. Extend Theorem 9.6 to the ∥ · ∥∞ -norm. Hint: The resulting neural network will
need to be deeper than the one of Theorem 9.6.
114
Figure 9.3: Two-dimensional example of the interpolation method of (9.2.6). From top left to
bottom we see fupper , flower , and Φ. The interpolation points (xi , yi )6i=1 are marked with red
crosses.
115
Chapter 10
Up to this point, we have discussed the representation and approximation of certain function classes
using neural networks. The second pillar of deep learning concerns the question of how to fit a
neural network to given data, i.e., having fixed an architecture, how to find suitable weights and
biases. This task amounts to minimizing a so-called objective function such as the empirical risk
R̂S in (1.2.3). Throughout this chapter we denote the objective function by
f : Rn → R,
and interpret it as a function of all neural network weights and biases collected in a vector in Rn .
The goal is to (approximately) determine a minimizer, i.e., some w∗ ∈ Rn satisfying
f (w∗ ) ≤ f (w) for all w ∈ Rn .
Standard approaches include, in particular, variants of (stochastic) gradient descent. These are
the topic of this chapter, in which we present basic ideas and results in convex optimization using
gradient-based methods.
116
Figure 10.1: Two examples of gradient descent as defined in (10.1.2). The red points represent the
wk .
(i) hk needs to be sufficiently small so that with v = −hk ∇f (wk ), the second-order term in
(10.1.1) is not dominating. This ensures that the update (10.1.2) decreases the objective
function.
(ii) hk should be large enough to ensure significant decrease of the objective function, which
facilitates faster convergence of the algorithm.
A learning rate that is too high might overshoot the minimum, while a rate that is too low results
in slow convergence. Common strategies include, in particular, constant learning rates (hk = h
for all k ∈ N0 ), learning rate schedules such as decaying learning rates (hk ↘ 0 as k → ∞), and
adaptive methods. For adaptive methods the algorithm dynamically adjust hk based on the values
of f (wj ) or ∇f (wj ) for j ≤ k.
Remark 10.1. It is instructive to interpret (10.1.2) as an Euler discretization of the “gradient flow”
This ODE describes the movement of a particle w(t), whose velocity at time t ≥ 0 equals −∇f (w(t))—
the vector of steepest descent. Note that
df (w(t))
= ∇f (w(t)), w′ (t) = −∥∇f (w(t))∥2 ,
dt
and thus the dynamics (10.1.3) necessarily decreases the value of the objective function along its
path as long as ∇f (w(t)) ̸= 0.
Throughout the rest of Section 10.1 we assume that w0 ∈ Rn is arbitrary, and the sequence
(wk )k∈N0 is generated by (10.1.2). We will analyze the convergence of this algorithm under suitable
assumptions on f and the hk . The proofs primarily follow the arguments in [159, Chapter 2]. We
also refer to that book for a much more detailed discussion of gradient descent, and further reading
on convex optimization.
117
10.1.1 L-smoothness
A key assumption to analyze convergence of (10.1.2) is Lipschitz continuity of ∇f .
for ∇f . Integrating the gradient along lines in Rn then shows that f is bounded from above by a
quadratic function touching the graph of f at w, as stated in the next lemma; also see Figure 10.2.
Thus
Z 1
L
f (v) − f (w) − ⟨∇f (w), v − w⟩ ≤ L∥t(v − w)∥∥v − w∥ dt = ∥v − w∥2 ,
0 2
Remark 10.4. The argument in the proof of Lemma 10.3 also gives the lower bound
L
f (v) ≥ f (w) + ⟨∇f (w), v − w⟩ − ∥w − v∥2 for all w, v ∈ Rn . (10.1.5)
2
The previous lemma allows us to show a decay property for the gradient descent iterates.
Specifically, the values of f necessarily decrease in each iteration as long as the step size hk is small
enough, and ∇f (wk ) ̸= 0.
118
Lemma 10.5. Let n ∈ N and L > 0. Let f : Rn → R be L-smooth. Further, let (hk )∞
k=1 be positive
∞ n
numbers and let (wk )k=0 ⊆ R be defined by (10.1.2).
Then, for all k ∈ N
Lh2k
f (wk+1 ) ≤ f (wk ) − hk − ∥∇f (wk )∥2 . (10.1.6)
2
Proof. Set c := h − (Lh2 )/2 = (2h − Lh2 )/2 > 0. By (10.1.6) for j ≥ 0
f (wj ) − f (wj+1 ) ≥ c∥∇f (wj )∥2 .
Hence
k k
X
2 1X 1
∥∇f (wj )∥ ≤ f (wj ) − f (wj+1 ) = (f (w0 ) − f (wk+1 )) .
c c
j=0 j=0
Thus, lower boundedness of the objective function together with L-smoothness already suffice to
obtain some form of convergence of the gradients to 0. We emphasize that this does not imply
convergence of wk towards some w∗ with ∇f (w∗ ) = 0 as the example f (w) = arctan(w), w ∈ R,
shows.
119
10.1.2 Convexity
While L-smoothness entails some interesting properties of gradient descent, it does not have any
direct implications on the existence or uniqueness of minimizers. To show convergence of f (wk )
towards minw f (w) for k → ∞ (assuming this minimum exists), we will assume that f is a convex
function.
as shown in Exercise 10.27. Thus, f ∈ C 1 (Rn ) is convex if and only if the graph of f lies above
each of its tangents, see Figure 10.2.
For convex f , a minimizer neither needs to exist (e.g., f (w) = w for w ∈ R) nor be unique
(e.g., f (w) = 0 for w ∈ Rn ). However, if w∗ and v ∗ are two minimizers, then every convex
combination λw∗ + (1 − λ)v ∗ , λ ∈ [0, 1], is also a minimizer due to (10.1.8). Thus, the set of all
minimizers is convex. In particular, a convex objective function has either zero, one, or infinitely
many minimizers. Moreover, if f ∈ C 1 (Rn ) then ∇f (w) = 0 implies
Lemma 10.9. Let n ∈ N and L > 0. Let f : Rn → R be L-smooth and convex. Further, let
hk ∈ (0, 2/L) for all k ∈ N0 , and (wk )∞ n
k=0 ⊆ R be defined by (10.1.2). Suppose that w ∗ is a
minimizer of f .
Then, for all k ∈ N0
2
∥wk+1 − w∗ ∥2 ≤ ∥wk − w∗ ∥2 − hk · − hk ∥∇f (wk )∥2 .
L
To prove the lemma, we will require the following inequality [159, Theorem 2.1.5].
120
Lemma 10.10. Let n ∈ N and L > 0. Let f : Rn → R be L-smooth and convex.
Then,
1
∥∇f (w) − ∇f (v)∥2 ≤ ⟨∇f (w) − ∇f (v), w − v⟩ for all w, v ∈ Rn .
L
Proof. Fix w ∈ Rn and set Ψ(u) := f (u) − ⟨∇f (w), u⟩ for all u ∈ Rn . Then ∇Ψ(u) = ∇f (u) −
∇f (w) and thus Ψ is L-smooth. Moreover, convexity of f , specifically (10.1.9), yields Ψ(u) ≥
f (w) − ⟨∇f (w), w⟩ = Ψ(w) for all u ∈ Rn , and thus w is a minimizer of Ψ. Using (10.1.4) on Ψ
we get for every v ∈ Rn
L
Ψ(w) = minn Ψ(u) ≤ minn Ψ(v) + ⟨∇Ψ(v), u − v⟩ + ∥u − v∥2
u∈R u∈R 2
L
= min Ψ(v) − t∥∇Ψ(v)∥2 + t2 ∥∇Ψ(v)∥2
t≥0 2
1
= Ψ(v) − ∥∇Ψ(v)∥2
2L
since the minimum of t 7→ t2 L/2 − t is attained at t = L−1 . This implies
1
f (w) − f (v) + ∥∇f (w) − ∇f (v)∥2 ≤ ⟨∇f (w), w − v⟩ .
2L
Adding the same inequality with the roles of w and v switched gives the result.
These preparations allow us to show that for constant step size h < 2/L, we obtain convergence
of f (wk ) towards f (w∗ ) with rate O(k −1 ), as stated in the next theorem which corresponds to
[159, Theorem 2.1.14].
Theorem 10.11. Let n ∈ N and L > 0. Let f : Rn → R be L-smooth and convex. Further, let
hk = h ∈ (0, 2/L) for all k ∈ N0 , and let (wk )∞ n
k=0 ⊆ R be defined by (10.1.2). Suppose that w ∗ is
a minimizer of f .
Then, f (wk ) − f (w∗ ) = O(k −1 ) for k → ∞, and for the specific choice h = 1/L
2L
f (wk ) − f (w∗ ) ≤ ∥w0 − w∗ ∥2 for all k ∈ N0 . (10.1.10)
4+k
121
Proof. The case w0 = w∗ is trivial and throughout we assume w0 ̸= w∗ .
Step 1. Let j ∈ N0 . Using convexity (10.1.9)
and
1 1
f (wk ) − f (w∗ ) = ek ≤ 1 = 2
(h−Lh /2)
.
+ kω 1
e0 f (w0 )−f (w∗ ) + k ∥w 0 −w ∗ ∥
2
122
L-smooth convex µ-strongly convex
Figure 10.2: The graph of L-smooth functions lies between two quadratic functions at each point,
see (10.1.4) and (10.1.5), the graph of convex functions lies above the tangent at each point, see
(10.1.9), and the graph of µ-strongly convex functions lies above a quadratic function at each point,
see (10.1.15).
Remark 10.12. The step size h = 1/L is again such that the upper bound in (10.1.14) is minimized.
We emphasize, that while under the assumptions of Theorem 10.11 it holds f (wk ) → f (w∗ ),
in general it is not true that wk → w∗ as k → ∞. To show the convergence of the wk , we need to
introduce stronger assumptions that guarantee the existence of a unique minimizer.
Definition 10.13. Let n ∈ N and µ > 0. A function f ∈ C 1 (Rn ) is called µ-strongly convex if
µ
f (v) ≥ f (w) + ⟨∇f (w), v − w⟩ + ∥v − w∥2 for all w, v ∈ Rn . (10.1.15)
2
Note that (10.1.15) is the opposite of the bound (10.1.4) implied by L-smoothness. We depict
the three notions of L-smoothness, convexity, and µ-strong convexity in Figure 10.2.
Every µ-strongly convex function has a unique minimizer. To see this note first that (10.1.15)
implies f to be lower bounded by a convex quadratic function, so that there exists at least one
minimizer w∗ , and ∇f (w∗ ) = 0. By (10.1.15) we then have f (v) > f (w∗ ) for every v ̸= w∗ .
The next theorem shows that the gradient descent iterates converge linearly towards the unique
minimizer for L-smooth and µ-strongly convex functions. Recall that a sequence ek is said to
converge linearly to 0, if and only if there exist constants C > 0 and c ∈ [0, 1) such that
The constant c is also referred to as the rate of convergence. Before giving the statement, we first
note that comparing (10.1.4) and (10.1.15) it necessarily holds L ≥ µ and therefore κ := L/µ ≥ 1.
This term is known as the condition number of f . It crucially influences the rate of convergence.
123
Theorem 10.14. Let n ∈ N and L ≥ µ > 0. Let f : Rn → R be L-smooth and µ-strongly convex.
Further, let hk = h ∈ (0, 1/L] for all k ∈ N0 , let (wk )∞ n
k=0 ⊆ R be defined by (10.1.2), and let w ∗
be the unique minimizer of f .
Then, f (wk ) → f (w∗ ) and wk → w∗ converge linearly for k → ∞. For the specific choice
h = 1/L
µ k
∥wk − w∗ ∥2 ≤ 1 − ∥w0 − w∗ ∥2 (10.1.16a)
L
L µ k
f (wk ) − f (w∗ ) ≤ 1− ∥w0 − w∗ ∥2 . (10.1.16b)
2 L
Proof. It suffices to show (10.1.16a) since (10.1.16b) follows directly by Lemma 10.3 and because
∇f (w∗ ) = 0. The case k = 0 is trivial, so let k ∈ N.
Expanding wk = wk−1 − h∇f (wk−1 ) and using µ-strong convexity (10.1.15)
The descent property also implies f (wk−1 ) − f (w∗ ) ≥ f (wk−1 ) − f (wk ). Thus the right-hand side
of (10.1.17) is less or equal to zero as long as 2h ≥ h/(1 − Lh/2), which is equivalent to h ≤ 1/L.
Hence
Remark 10.15. With a more refined argument, see [159, Theorem 2.1.15], the constraint on the
step size can be relaxed to h ≤ 2/(µ + L). For h = 2/(µ + L) one then obtains (10.1.16) with
1 − µ/L = 1 − κ−1 replaced by
L/µ − 1 2 κ − 1 2
= ∈ [0, 1). (10.1.18)
L/µ + 1 κ+1
We have
κ − 1 2
= 1 − 4κ−1 + O(κ−2 )
κ+1
as κ → ∞. Thus, (10.1.18) gives a slightly better, but conceptually similar, rate of convergence
than 1 − κ−1 shown in Theorem 10.14.
124
10.1.4 PL-inequality
Linear convergence for gradient descent can also be shown under a weaker assumption known as
the Polyak-Lojasiewicz-inequality, or PL-inequality for short.
Lemma 10.16. Let n ∈ N and µ > 0. Let f : Rn → R be µ-strongly convex and denote its unique
minimizer by w∗ . Then f satisfies the PL-inequality
1
µ · (f (w) − f (w∗ )) ≤ ∥∇f (w)∥2 for all w ∈ Rn . (10.1.19)
2
As the lemma states, the PL-inequality is implied by strong convexity. Moreover, it is indeed
weaker than strong convexity, and does not even imply convexity, see Exercise 10.28. The next
theorem, which corresponds to [220, Theorem 1], gives a convergence result for L-smooth functions
satisfying the PL-inequality. It therefore does not require convexity. The proof is left as an exercise.
We only note that the PL-inequality bounds the distance to the minimal value of the objective
function by the squared norm of the gradient. It is thus precisely the type of bound required to
show convergence of gradient descent.
Theorem 10.17. Let n ∈ N and L > 0. Let f : Rn → R be L-smooth. Further, let hk = 1/L for
all k ∈ N0 , and let (wk )∞ n
k=0 ⊆ R be defined by (10.1.2), and let w ∗ be a (not necessarily unique)
minimizer of f , so that the PL-inequality (10.1.19) holds.
Then, it holds for all k ∈ N0 that
µ k
f (wk ) − f (w∗ ) ≤ 1 − (f (w0 ) − f (w∗ )).
L
125
we denote by Gk . We interpret Gk as an approximation to ∇f (wk ); specifically, throughout we
will assume that (given wk ) Gk is an unbiased estimator, i.e.
wk+1 := wk − hk Gk , (10.2.2)
where hk > 0 denotes again the step size, and unlike in Section 10.1, we focus here on the case of hk
depending on k. The iteration (10.2.2) creates a Markov chain (w0 , w1 , . . . ), meaning that wk is
a random variable, and its state only depends1 on wk−1 . The main reason for replacing the actual
gradient by an estimator, is not to improve the accuracy or convergence rate, but rather to decrease
the computational cost and storage requirements of the algorithm. The underlying assumption is
that Gk−1 can be computed at a fraction of the cost required for the computation of ∇f (wk−1 ).
The next example illustrates this in the standard setting.
Example 10.18 (Empirical risk minimization). Suppose we have some data S := (xj , yj )m j=1 ,
d
where yj ∈ R is the label corresponding to the data point xj ∈ R . Using the square loss, we wish
to fit a neural network Φ(·, w) : Rd → R depending on parameters (i.e. weights and biases) w ∈ Rn ,
such that the empirical risk
m
1 X
f (w) := R̂S (w) = (Φ(xj , w) − yj )2 ,
2m
j=1
and thus the computation of m gradients of the neural network Φ. For large m (in practice m can
be in the millions or even larger), this computation might be infeasible. To decrease computational
complexity, we replace the full gradient (10.2.3) by
where j ∼ uniform(1, . . . , m) is a random variable with uniform distribution on the discrete set
{1, . . . , m}. Then
m
1 X
E[G] = (Φ(xj , w) − yj )∇w Φ(xj , w) = ∇f (w),
m
j=1
but an evaluation of G merely requires the computation of a single gradient of the neural net-
work. More general, one can choose a mini-batch size mb (where mb ≪ m) and let G =
1 P
mb j∈J (Φ(xj , w) − yj )∇w Φ(xj , w), where J is a random subset of {1, . . . , m} of cardinality mb .
1
More precisely, given wk−1 , the state of wk is conditionally independent of w1 , . . . , wk−2 . See Appendix A.3.3.
126
Remark 10.19. In practice, the following variant is also common: Let mb k = m for mb , k, m ∈ N,
i.e. the number of data points m is a k-fold multiple of the mini-batch size mb . In each epoch, first
Sk
a random partition ˙ i=1 Ji = {1, . . . , m} is determined. Then for each i = 1, . . . , k, the weights are
updated with the gradient estimate
1 X
Φ(xj − yj , w)∇w Φ(xj , w).
mb
j∈Ji
Hence, in one epoch (which corresponds to k updates of the neural network weights), the algorithm
sweeps through the whole dataset.
SGD can be analyzed in various settings. To give the general idea, we concentrate on the case
of L-smooth and µ-strongly convex objective functions. Let us start by looking at a property akin
to the (descent) Lemma 10.5. Using Lemma 10.3
L
f (wk+1 ) ≤ f (wk ) − hk ⟨∇f (wk ), Gk ⟩ + h2k ∥Gk ∥2 .
2
In contrast to gradient descent, we cannot say anything about the sign of the term in the middle of
the right-hand side. Thus, (10.2.2) need not necessarily decrease the value of the objective function
in every step. The key insight is that in expectation the value is still decreased under certain
assumptions, namely
L
E[f (wk+1 )|wk ] ≤ f (wk ) − hk E[⟨∇f (wk ), Gk ⟩ |wk ] + h2k E ∥Gk ∥2 wk
2
L
= f (wk ) − hk ∥∇f (wk )∥2 + h2k E ∥Gk ∥2 wk
2
2 L 2
= f (wk ) − hk ∥∇f (wk )∥ − hk E[∥Gk ∥ |wk ]
2
where we used (10.2.1).
Assuming, for some fixed γ > 0, the uniform bound
E[∥Gk ∥2 |wk ] ≤ γ
and that ∥∇f (wk )∥ > 0 (which is true unless wk is the minimizer), upon choosing
2∥∇f (wk )∥2
0 < hk < ,
Lγ
the expectation of the objective function decreases. Since ∇f (wk ) tends to 0 as we approach the
minimum, this also indicates that we should choose step sizes hk that tend to 0 for k → ∞. For
our analysis we will work with the specific choice
1 (k + 1)2 − k 2
hk := for all k ∈ N0 , (10.2.4)
µ (k + 1)2
as, e.g., in [76]. Note that
2k + 1 2
hk = 2
= + O(k −2 ) = O(k −1 ).
µ(k + 1) µ(k + 1)
127
Since wk is a random variable by construction, a convergence statement can only be stochastic,
e.g., in expectation or with high probability. We concentrate here on the former, but emphasize
that also the latter can be shown.
Theorem 10.20. Let n ∈ N and L, µ, γ > 0. Let f : Rn → R be L-smooth and µ-strongly convex.
Let (hk )∞ ∞ ∞
k=0 satisfy (10.2.4) and let (Gk )k=0 , (w k )k=0 be sequences of random variables satisfying
2
(10.2.1) and (10.2.2). Assume that E[∥Gk ∥ |wk ] ≤ γ for all k ∈ N0 .
Then
4γ
E[∥wk − w∗ ∥2 ] ≤ = O(k −1 ),
µ2 k
4Lγ
E[f (wk )] − f (w∗ ) ≤ 2 = O(k −1 )
2µ k
for k → ∞.
E[∥wk − w∗ ∥2 |wk−1 ]
= ∥wk−1 − w∗ ∥2 − 2hk−1 E[⟨Gk−1 , wk−1 − w∗ ⟩ |wk−1 ] + h2k−1 E[∥Gk−1 ∥2 |wk−1 ]
= ∥wk−1 − w∗ ∥2 − 2hk−1 ⟨∇f (wk−1 ), wk−1 − w∗ ⟩ + h2k−1 E[∥Gk−1 ∥2 |wk−1 ].
Thus
so that
128
By choice of hi
k−1 k−1
Y Y i2 j2
(1 − µhi ) = =
(i + 1)2 k2
i=j i=j
and thus
k−1 2
γ X (j + 1)2 − j 2 (j + 1)2
ek ≤ 2
µ (j + 1)2 k2
j=0
k−1
γ 1 X (2j + 1)2
≤ 2 2
µ k (j + 1)2
j=0 | {z }
≤4
γ 4k
≤ 2 2
µ k
4γ
≤ 2 .
µ k
Since E[∥wk − w∗ ∥2 ] is the expectation of E[∥wk − w∗ ∥2 |wk−1 ] with respect to the random
variable wk−1 , and e0 /k 2 + 4γ/(µ2 k) is a constant independent of wk−1 , we obtain
4γ
E[∥wk − w∗ ∥2 ] ≤ .
µ2 k
Finally, using L-smoothness
L L
f (wk ) − f (w∗ ) ≤ ⟨∇f (w∗ ), wk − w∗ ⟩ + ∥wk − w∗ ∥2 = ∥wk − w∗ ∥2 ,
2 2
and taking the expectation concludes the proof.
The specific choice of hk in (10.2.4) simplifies the calculations in the proof, but it is not necessary
in order for the asymptotic convergence to hold. One can show similar convergence results with
hk = c1 /(c2 + k) under certain assumptions on c1 , c2 , e.g. [23, Theorem 4.7].
10.3 Backpropagation
We now explain how to apply gradient-based methods to the training of neural networks. Let
d
Φ ∈ Nd0L+1 (σ; L, n) (see Definition 3.6) and assume that the activation function satisfies σ ∈ C 1 (R).
As earlier, we denote the neural network parameters by
with weight matrices W (ℓ) ∈ Rdℓ+1 ×dℓ and bias vectors b(ℓ) ∈ Rdℓ+1 . Additionally, we fix a differ-
entiable loss function L : RdL+1 × RdL+1 → R, e.g., L(w, w̃) = ∥w − w̃∥2 /2, and assume given data
(xj , y j )m d0
j=1 ⊆ R × R
dL+1 . The goal is to minimize an empirical risk of the form
m
1 X
f (w) := L(Φ(xj , w), y j )
m
j=1
129
as a function of the neural network parameters w. An application of the gradient step (10.1.2) to
update the parameters requires the computation of
m
1 X
∇f (w) = ∇w L(Φ(xj , w), y j ).
m
j=1
For stochastic methods, as explained in Example 10.18, we only compute the average over a (ran-
dom) subbatch of the dataset. In either case, we need an algorithm to determine ∇w L(Φ(x, w), y),
i.e. the gradients
∇b(ℓ) L(Φ(x, w), y) ∈ Rdℓ+1 , ∇W (ℓ) L(Φ(x, w), y) ∈ Rdℓ+1 ×dℓ (10.3.2)
for all ℓ = 0, . . . , L.
The backpropagation algorithm [197] provides an efficient way to do so. To explain it, for fixed
input x ∈ Rd0 introduce the notation
where the application of σ : R → R to a vector is, as always, understood componentwise. With the
notation of Definition 2.1, x(ℓ) = σ(x̄(ℓ) ) ∈ Rdℓ for ℓ = 1, . . . , L and x̄(L+1) = x(L+1) = Φ(x, w) ∈
RdL+1 is the output of the neural network. Therefore, the x̄(ℓ) for ℓ = 1, . . . , L are sometimes also
referred to as the preactivations.
In the following, we additionally fix y ∈ RdL+1 and write
Note that x̄(k) depends on (W (ℓ) , b(ℓ) ) only if k > ℓ. Since x̄(ℓ+1) is a function of x̄(ℓ) for each ℓ,
by repeated application of the chain rule
∂L ∂L ∂ x̄(L+1) ∂ x̄(ℓ+2) ∂ x̄(ℓ+1)
(ℓ)
= (L+1) (L)
· · · (ℓ+1) (ℓ)
. (10.3.4)
∂Wij |∂ x̄{z } | ∂ x̄{z } |∂ x̄{z } ∂Wij
∈R1×dL+1 ∈RdL+1 ×dL ∈Rdℓ+2 ×dℓ+1
| {z }
∈Rdℓ+1 ×1
(ℓ)
An analogous calculation holds for ∂L/∂bj . Since all terms in (10.3.4) are easy to compute (see
(10.3.3)), in principle we could use this formula to determine the gradients in (10.3.2). To avoid
unnecessary computations, the main idea of backpropagation is to introduce
As the following lemma shows, the α(ℓ) can be computed recursively for ℓ = L + 1, . . . , 1. This
explains the name “backpropagation”. In the following, ⊙ denotes the componentwise (Hadamard)
product, i.e. a ⊙ b = (ai bi )di=1 for every a, b ∈ Rd .
130
Lemma 10.21. It holds
and
Proof. Equation (10.3.5) holds by definition. For ℓ ∈ {1, . . . , L} by the chain rule
∂L ∂ x̄(ℓ+1) ⊤ ∂L ∂ x̄(ℓ+1) ⊤
α(ℓ) = = = α(ℓ+1) .
∂ x̄(ℓ) | ∂ x̄{z (ℓ) ∂ x̄ (ℓ+1)
} | {z } ∂ x̄ (ℓ)
and
and
131
ℓ+1 d
Thus, with ei = (δki )k=1
∂L ∂ x̄(ℓ+1) ⊤ ∂L (ℓ+1)
(ℓ)
= (ℓ)
= e⊤
i α
(ℓ+1)
= αi for ℓ ∈ {0, . . . , L}
∂bi ∂bi ∂ x̄(ℓ+1)
and similarly
∂L ∂ x̄(1) ⊤
(0) (0) (1)
(0)
= (0)
α(1) = x̄j e⊤
i α
(1)
= x̄j αi
∂Wij ∂Wij
and
∂L (ℓ) (ℓ+1)
(ℓ)
= σ(x̄j )αi for ℓ ∈ {1, . . . , L}.
∂Wij
Lemma 10.21 and Proposition 10.22 motivate Algorithm 1, in which a forward pass computing
x̄(ℓ) , ℓ = 1, . . . , L + 1, is followed by a backward pass to determine the α(ℓ) , ℓ = L + 1, . . . , 1,
and the gradients of L with respect to the neural network parameters. This shows how to use
gradient-based optimizers from the previous sections for the training of neural networks.
Two important remarks are in order. First, the objective function associated to neural networks
is typically not convex as a function of the neural network weights and biases. Thus, the analysis
of the previous sections will in general not be directly applicable. It may still give some insight
about the convergence behavior locally around the minimizer however. Second, to derive the back-
propagation algorithm we assumed the activation function to be continuously differentiable, which
does not hold for ReLU. Using the concept of subgradients, gradient-based algorithms and their
analysis may be generalized to some extent to also accommodate non-differentiable loss functions,
see Exercises 10.31–10.33.
10.4 Acceleration
Acceleration is an important tool for the training of neural networks [221]. The idea was first
introduced by Polyak in 1964 under the name “heavy ball method” [180]. It is inspired by the
dynamics of a heavy ball rolling down the valley of the loss landscape. Since then other types of
acceleration have been proposed and analyzed, with Nesterov acceleration being the most prominent
example [160]. In this section, we first give some intuition by discussing the heavy ball method for
a simple quadratic loss. Afterwards we turn to Nesterov acceleration and give a convergence proof
for L-smooth and µ-strongly convex objective functions that improves upon the bounds obtained
for gradient descent.
132
Algorithm 1 Backpropagation
Input: Network input x, target output y, neural network parameters
(0) (0) (L) (L)
((W , b ), . . . , (W , b ))
Output: Gradients of the loss function L with respect to neural network parameters
Forward pass
x̄(1) ← W (0) x + b(0)
for ℓ = 1, . . . , L do
x̄(ℓ+1) ← W (ℓ) σ(x̄(ℓ) ) + b(ℓ)
end for
Backward pass
α(L+1) ← ∇x̄(L+1) L(x̄(L+1) , y)
for ℓ = L, . . . , 1 do
∇b(ℓ) L ← α(ℓ+1)
∇W (ℓ) L ← α(ℓ+1) σ(x̄(ℓ) )⊤
α(ℓ) ← σ ′ (x̄(ℓ) ) ⊙ (W (ℓ) )⊤ α(ℓ+1)
end for
∇b(0) L ← α(1)
∇W (0) L ← α(1) x⊤
133
1.0
Gradient Descent
Heavy Ball
0.8
0.6
0.4
0.2
0.0
1.00 0.75 0.50 0.25 0.00 0.25 0.50 0.75 1.00
Figure 10.3: 20 steps of gradient descent and the heavy ball method on the objective function
(10.4.1) with λ1 = 12 ≫ 1 = λ2 , step size h = α = h∗ as in (10.4.2), and β = 1/3.
Remark 10.23. Interpreting (10.4.4) as a second-order Taylor expansion of some objective function
f˜ around its minimizer w∗ , we note that the described effects also occur for general objective
functions with ill-conditioned Hessians at the minimizer.
Figure 10.3 gives further insight into the poor performance of gradient descent for (10.4.1) with
λ1 ≫ λ2 . The loss-landscape looks like a ravine (the derivative is much larger in one direction than
the other), and away from the floor, ∇f mainly points to the opposite side. Therefore the iterates
oscillate back and forth in the first coordinate, and make little progress in the direction of the valley
along the second coordinate axis. To address this problem, the heavy ball method introduces a
“momentum” term which can mitigate this effect to some extent. The idea is, to choose the update
not just according to the gradient at the current location, but to add information from the previous
steps. After initializing w0 and, e.g., w1 = w0 − α∇f (w0 ), let for k ∈ N
This is known as Polyak’s heavy ball method [180]. Here α > 0 and β ∈ (0, 1) are hyperparameters
(that could also depend on k) and in practice need to be carefully tuned to balance the strength of
the gradient and the momentum term. Iteratively expanding (10.4.5) with the given initialization,
observe that for k ≥ 0
k
!
X
wk+1 = wk − α β j ∇f (wk−j ) . (10.4.6)
j=0
Thus, wk is updated using an exponentially weighted average of all past gradients. Choosing the
momentum parameter β in the interval (0, 1) ensures that the influence of previous gradients on the
update decays exponentially. The concrete value of β determines the balance between the impact
of recent and past gradients.
Intuitively, this (exponentially weighted) linear combination of the past gradients averages out
some of the oscillation observed for gradient descent in Figure 10.3 in the first coordinate, and thus
“smoothes” the path. The partial derivative in the second coordinate, along which the objective
134
function is very flat, does not change much from one iterate to the next. Thus, its proportion in
the update is strengthened through the addition of momentum. This is observed in Figure 10.3.
As mentioned earlier, the heavy ball method can be interpreted as a discretization of the dy-
namics of a ball rolling down the valley of the loss landscape. If the ball has positive mass, i.e. is
“heavy”, its momentum prevents the ball from bouncing back and forth too strongly. The following
remark further elucidates this connection.
Remark 10.24. As pointed out, e.g., in [181, 183], for suitable choices of α and β, (10.4.5) can be
interpreted as a discretization of the second-order ODE
This equation describes the movement of a point mass m under influence of the force field −∇f (w(t));
the term −w′ (t), which points in the negative direction of the current velocity, corresponds to fric-
tion, and r > 0 is the friction coefficient. The discretization
wk+1 − 2wk + wk−1 wk+1 − wk
m = −∇f (wk ) −
h2 h
then leads to
h2 m
wk+1 = wk − ∇f (wk ) + (wk − wk−1 ), (10.4.8)
m − rh m − rh
| {z } | {z }
=α =β
for j ∈ {1, 2} and k ≥ 1. The smaller the modulus of the eigenvalues of the matrix in (10.4.9),
the faster the convergence towards the minimizer w∗,j = 0 ∈ R for arbitrary initialization. Hence,
the goal is to choose α > 0 and β ∈ (0, 1) such that the maximal modulus of the eigenvalues
of the matrix for j ∈ {1, 2} is possibly small. We omit the details of this calculation (also see
[181, 165, 70]), but mention that this is obtained for
2 2 √ λ − √ λ 2
1
α= √ √ and β= √ √ 2 .
λ1 + λ2 λ1 + λ2
With these choices, the modulus of the maximal eigenvalue is bounded by
√
p κ−1
β=√ ∈ [0, 1),
κ+1
135
where again κ = λ1 /λ2 . Due to (10.4.9), this expression gives a rate of convergence for (10.4.5).
Contrary to gradient descent, see (10.4.3), for this problem the heavy ball method achieves a
convergence rate that only depends on the square root of the condition number κ. This explains
the improved performance observed in Figure 10.3.
Theorem 10.25. Let n ∈ N and L,pµ > 0. Let f : Rn → R be L-smooth and µ-strongly convex.
Further, let v 0 , w0 ∈ Rn and let τ = µ/L. Let (wk , v k+1 , uk+1 )∞ n
k=0 ⊆ R be defined by (10.4.11a),
and let w∗ be the unique minimizer of f .
Then, for all k ∈ N0 , it holds that
r
2 2 µ k µ
∥uk − w∗ ∥ ≤ 1− f (v 0 ) − f (w∗ ) + ∥u0 − w∗ ∥2 , (10.4.12a)
µ L 2
r
µ k µ
f (v k ) − f (w∗ ) ≤ 1 − f (v 0 ) − f (w∗ ) + ∥u0 − w∗ ∥2 . (10.4.12b)
L 2
136
Proof. Define
µ
ek := f (v k ) − f (w∗ ) + ∥uk − w∗ ∥2 . (10.4.13)
2
To show (10.4.12), it suffices to prove with c = 1 − τ that ek+1 ≤ cek for all k ∈ N0 .
We start with the last term in (10.4.13). By (10.4.11c)
µ µ
∥uk+1 − w∗ ∥2 − ∥uk − w∗ ∥2
2 2
µ µ
= ∥uk+1 − uk + uk − w∗ ∥2 − ∥uk − w∗ ∥2
2 2
!
µ µ τ
= ∥uk+1 − uk ∥2 + · 2 τ · (wk − uk ) − ∇f (wk ), uk − w∗
2 2 µ
µ
= ∥uk+1 − uk ∥2 + τ ⟨∇f (wk ), w∗ − uk ⟩ − τ µ ⟨wk − uk , w∗ − uk ⟩ . (10.4.14)
2
From (10.4.11a) we have τ uk = (1 + τ )wk − v k so that
137
so that
1
f (v k+1 ) − f (w∗ ) − τ · (f (wk ) − f (w∗ )) ≤ (1 − τ )(f (wk ) − f (w∗ )) − ∥∇f (wk )∥2
2L
1
= c · (f (v k ) − f (w∗ )) + c · (f (wk ) − f (v k )) − ∥∇f (wk )∥2 . (10.4.17)
2L
Now, (10.4.16) and (10.4.17) imply
1 µ
ek+1 ≤ cek + c · (f (wk ) − f (v k )) − ∥∇f (wk )∥2 + ∥uk+1 − uk ∥2
2L 2
τµ 2
+ ⟨∇f (wk ), v k − wk ⟩ − ∥wk − uk ∥ .
2
Since we wish to bound ek+1 by cek , we now show that all terms except cek on the right-hand side
of the inequality above sum up to a non-positive value. By (10.4.11c) and (10.4.15)
µ µ τ2
∥uk+1 − uk ∥2 = ∥v k − wk ∥2 − τ ⟨∇f (wk ), v k − wk ⟩ + ∥∇f (wk )∥2 .
2 2 2µ
Moreover, using µ-strong convexity
⟨∇f (wk ), v k − wk ⟩
µ
≤ τ ⟨∇f (wk ), v k − wk ⟩ + (1 − τ ) f (v k ) − f (wk ) − ∥v k − wk ∥2 .
2
Thus, we arrive at
1 µ
ek+1 ≤ cek + c · (f (wk ) − f (v k )) − ∥∇f (wk )∥2 + ∥v k − wk ∥2
2L 2
τ2
− τ ⟨∇f (wk ), v k − wk ⟩ + ∥∇f (wk )∥2 + τ ⟨∇f (wk ), v k − wk ⟩
2µ
µ τµ
+ c · (f (v k ) − f (wk )) − c ∥v k − wk ∥2 − ∥wk − uk ∥2
2 2
τ2 1 µ 1
= cek + − ∥∇f (wk )∥2 + τ− ∥wk − v k ∥2
2µ 2L 2 τ
≤ cek
2
pwe used once more (10.4.15), and the fact that τ /(2µ) − 1/(2L) = 0 and τ − 1/τ ≤ 0 since
where
τ = µ/L ∈ (0, 1].
Comparing the result for gradient descent (10.1.16) with NAG (10.4.12), the improvement lies
in the convergence rate, which is 1−κ−1 for gradient descent (also see Remark 10.15), and 1−κ−1/2
for NAG, where κ = L/µ. In contrast to gradient descent, for NAG the convergence depends only
on the square root of the condition number κ. For ill-conditioned problems where κ is large, we
therefore expect much better performance for accelerated methods.
Finally, we mention that NAG also achieves faster convergence in the case of L-smooth and
convex objective functions. While the error decays like O(k −1 ) for gradient descent, see Theorem
10.11, for NAG one obtains convergence O(k −2 ), see [160, 158, 241].
138
10.5 Other methods
In recent years, a multitude of first order (gradient descent) methods has been proposed and studied
for the training of neural networks. They typically employ (a subset) of the three critical strategies:
mini-batches, acceleration, and adaptive step sizes. The concept of mini-batches and acceleration
have been covered in the previous sections, and we will touch upon adaptive learning rates in the
present one. Specifically, we present three algorithms—AdaGrad, RMSProp, and Adam—which
have been among the most influential in the field, and serve to explore the main ideas. An intuitive
overview of first order methods can also be found in [194], which discusses additional variants that
are omitted here. Moreover, in practice, various other techniques and heuristics such as batch
normalization, gradient clipping, data augmentation, regularization and dropout, early stopping,
specific weight initializations etc. are used. We do not discuss them here, and refer to [22] or to
[67, Chapter 11] for a practitioners guide.
After initializing m0 = 0 ∈ Rn , v 0 = 0 ∈ Rn , and w0 ∈ Rn , all methods discussed below are
special cases of the update
10.5.1 AdaGrad
In Section 10.2 we argued, that for stochastic methods the learning rate should decrease in order
to get convergence. The choice of how to decrease the learning rate can have significant impact
in practice. AdaGrad [57], which stands for adaptive gradient algorithm, provides a method to
dynamically adjust learning rates during optimization. Moreover, it does so by using individual
learning rates for each component.
AdaGrad correspond to (10.5.1) with
β1 = 0, γ1 = β2 = γ2 = 1, αk = α for all k ∈ N0 .
This leaves the hyperparameters ε > 0 and α > 0. The constant ε > 0 is chosen small but positive
to avoid division by zero in (10.5.1c). Possible default values are α = 0.01 and ε = 10−8 . The
AdaGrad update then reads
139
Due to
k
X
v k+1 = ∇f (wj ) ⊙ ∇f (wj ), (10.5.2)
j=0
the algorithm scales the gradient ∇f (wk ) in the update component-wise by the inverse square root
of the sum over all past squared gradients plus ε. Note that the scaling factor (vk+1,i + ε)−1/2 for
component i will be large, if the previous gradients for that component were small, and vice versa.
In the words of the authors of [57]: “our procedures give frequently occurring features very low
learning rates and infrequent features high learning rates.”
Remark 10.26. A benefit of the componentwise scaling can be observed for the ill-conditioned
objective function in (10.4.1). Since in this case ∇f (wj ) = (λ1 wj,1 , λ2 wj,2 )⊤ for each j = 1, . . . , k,
setting ε = 0 AdaGrad performs the update
!
wk,1 ( kj=0 wj,1
2 )−1/2
P
wk+1 = wk − α 2 )−1/2 .
wk,2 ( kj=0 wj,1
P
√
Note how the λ1 and λ2 factors in the update have vanished due to the division by v k+1 . This
makes the method invariant to a componentwise rescaling of the gradient, and results in a more
direct path towards the minimizer.
10.5.2 RMSProp
The sum of past squared gradients can increase rapidly, leading to a significant reduction in learning
rates when training neural networks with AdaGrad. This often results in slow convergence, see
for example [242]. RMSProp [90] seeks to rectify this by adjusting the learning rates using an
exponentially weighted average of past gradients.
RMSProp corresponds to (10.5.1) with
effectively leaving the hyperparameters ε > 0, γ1 ∈ (0, 1) and α > 0. Typically, recommended
default values are ε = 10−8 , α = 0.01 and γ1 = 0.9. The algorithm is given through
Note that
k
γ1j ∇f (wk−j ) ⊙ ∇f (wk−j ),
X
v k+1 = (1 − γ1 )
j=0
so that, contrary to AdaGrad (10.5.2), the influence of gradient ∇f (wk−j ) on the weight v k+1
decays exponentially in j.
140
10.5.3 Adam
Adam [116], short for adaptive moment estimation, combines adaptive learning rates based on
exponentially weighted averages as in RMSProp, with heavy ball momentum. Contrary to AdaGrad
an RMSProp it thus uses a value β1 > 0.
More precisely, Adam corresponds to (10.5.1) with
q
1 − γ1k+1
β2 = 1 − β1 ∈ (0, 1), γ2 = 1 − γ1 ∈ (0, 1), αk = α
1 − β1k+1
for all k ∈ N0 , for some α > 0. The default values for the remaining parameters recommended in
[116] are ε = 10−8 , α = 0.001, β1 = 0.9 and γ1 = 0.999. The update can be formulated as
mk+1
mk+1 = β1 mk + (1 − β1 )∇f (wk ), m̂k+1 = (10.5.4a)
1 − β1k+1
v k+1
v k+1 = γ1 v k + (1 − γ1 )∇f (wk ) ⊙ ∇f (wk ), v̂ k+1 = (10.5.4b)
1 − γ1k+1
p
wk+1 = wk − αm̂k+1 ⊘ v̂ k+1 + ε. (10.5.4c)
and thus correspond to heavy ball style momentum with momentum parameter β = β1 , see (10.4.6).
The normalized version m̂k+1 is introduced to account for the bias towards 0, stemming from the
initialization m0 = 0. The weight-vector v k+1 in (10.5.4b) is analogous to the exponentially
weighted average of RMSProp in (10.5.3a), and the normalization again serves to counter the bias
from v 0 = 0.
It should be noted that there exist examples of convex functions for which Adam does not
converge to a minimizer, see [190]. The authors of [190] propose a modification termed AMSGrad,
which avoids this issue and their analysis also applies to RMSProp. Nonetheless, Adam remains a
highly popular and successful algorithm for the training of neural networks. We also mention that
the proof of convergence in the stochastic setting requires k-dependent decreasing learning rates
such as α = O(k −1/2 ) in (10.5.3b) and (10.5.4c).
141
and for a textbook specifically on stochastic optimization also see [126]. The backpropagation al-
gorithm discussed in Section 10.3 was popularized by Rumelhart, Hinton and Williams [197]; for
further details on the historical developement we refer to [202, Section 5.5], and for a more in-depth
discussion of the algorithm, see for instance [84]. The heavy ball method in Section 10.4 goes back
to Polyak [180]. To motivate the algorithm we proceed similar as in [70, 181, 183]. For the analysis
of Nesterov acceleration [160], we follow the Lyapunov type proofs given in [231, 241]. Finally,
for Section 10.5 on other algorithms, we refer to the original works that introduced AdaGrad [57],
RMSProp [90] and Adam [116]. A good overview of gradient descent methods popular for deep
learning can be found in [194]. Regarding the analysis of RMSProp and Adam, we refer to [190]
which gave an example of a convex function for which Adam does not converge, and provide a prov-
ably convergent modification of the algorithm. Convergence proofs (for variations of) AdaGrad and
Adam can also be found in [49].
For a general discussion and analysis of optimization algorithms in machine learning see [23].
Details on implementations in Python can for example be found in [67], and for recommendations
and tricks regarding the implementation we also refer to [22, 129].
142
Exercises
Exercise 10.27. Let f ∈ C 1 (Rn ). Show that f is convex in the sense of Definition 10.8 if and only
if
Exercise 10.28. Find a function f : R → R that is L-smooth, satisfies the PL-inequality (10.1.19)
for some µ > 0, has a unique minimizer w∗ ∈ R, but is not convex and thus also not strongly
convex.
Exercise 10.29. Prove Theorem 10.17, i.e. show that L-smoothness and the PL-inequality (10.1.19)
yield linear convergence of f (wk ) → f (w∗ ) as k → ∞.
A subgradient always exists, i.e. ∂f (v) is necessarily nonempty. This statement is also known
under the name “Hyperplane separation theorem”. Subgradients generalize the notion of gradients
for convex functions, since for any convex continuously differentiable f , (10.5.5) is satisfied with
g = ∇f (v).
Exercise 10.31. Let f : Rn → R be convex and Lip(f ) ≤ L. Show that for any g ∈ ∂f (v) holds
∥g∥ ≤ L.
wk+1 := wk − hk g k ,
Hint: Start by recursively expanding ∥wk − w∗ ∥2 = · · · , and then apply the property of the
subgradient.
Exercise 10.33. Consider the setting of Exercise 10.32. Determine step sizes h1 , . . . , hk (which
may depend on k, i.e. hk,1 , . . . , hk,k ) such that for any arbitrarily small δ > 0
143
Exercise 10.34. Let A ∈ Rn×n be symmetric positive semidefinite, b ∈ Rn and c ∈ R. Denote
the eigenvalues of A by λ1 ≥ · · · ≥ λn ≥ 0. Show that the objective function
1
f (w) := w⊤ Aw + b⊤ w + c (10.5.6)
2
is convex and λ1 -smooth. Moreover, if λn > 0, then f is λn -strongly convex. Show that these
values are optimal in the sense that f is neither L-smooth nor µ-strongly convex if L < λ1 and
µ > λn .
Hint: Note that L-smoothness and µ-strong convexity are invariant under shifts and the addition
of constants. That is, for every α ∈ R and β ∈ Rn , f˜(w) := α + f (w + β) is L-smooth or µ-strongly
convex if and only if f is. It thus suffices to consider w⊤ Aw/2.
Exercise 10.35. Let f be as in Exercise 10.34. Show that gradient descent converges for arbitrary
initialization w0 ∈ Rn , if and only if
Show that argminh>0 maxj=1,...,n |1 − hλj | = 2/(λ1 + λn ) and conclude that the convergence will be
slow if f is ill-conditioned, i.e. if λ1 /λn ≫ 1.
Hint: Assume first that b = 0 ∈ Rn and c = 0 ∈ R in (10.5.6), and use the singular value
decomposition A = U ⊤ diag(λ1 , . . . , λn )U .
p
Exercise 10.36. Show that (10.4.10) can equivalently be written as (10.4.11) with τ = µ/L,
α = 1/L, β = (1 − τ )/(1 + τ ) and the initialization u0 = ((1 + τ )w0 − v 0 )/τ .
144
Chapter 11
In this chapter we explore the dynamics of training neural networks of large width. Throughout
we focus on the situation where we have data pairs
and wish to train a neural network Φ(x, w) depending on the input x ∈ Rd and the parameters
w ∈ Rn , by minimizing the square loss objective defined as
m
X
f (w) := (Φ(xi , w) − yi )2 , (11.0.1b)
i=1
which is a multiple of the empirical risk Rb S (Φ) in (1.2.3) for the sample S = (xi , yi )m and the
i=1
square-loss. We exclusively focus on gradient descent with a constant step size h, which yields a
sequence of parameters (wk )k∈N . We aim to understand the evolution of Φ(x, wk ) as k progresses.
For linear mappings w 7→ Φ(x, w), the objective function (11.0.1b) is convex. As established in
the previous chapter, gradient descent then finds a global minimizer. For typical neural network
architectures, w 7→ Φ(x, w) is not linear, and such a statement is in general not true.
Recent research has shown that neural network behavior tends to linearize in the parameters
as network width increases [106]. This allows to transfer some of the results and techniques from
the linear case to the training of neural networks. We start this chapter in Sections 11.1 and 11.2
by recalling (kernel) least-squares methods, which describe linear (in w) models. Following [131],
the subsequent sections explore why in the infinite width limit neural networks exhibit linear-like
behavior. In Section 11.5.2 we formally introduce the linearization of w 7→ Φ(x, w). Section 11.4
presents an abstract result showing convergence of gradient descent, under the condition that Φ
does not deviate too much from its linearization. In Sections 11.5 and 11.6, we then detail the
implications for wide neural networks for two (slightly) different architectures. In particular, we
will prove that gradient descent can find global minimizers when applied to (11.0.1b) for networks
of very large width. We emphasize that this analysis treats the case of strong overparametrization,
specifically the case of increasing the network width while keeping the number of data points m
fixed.
145
11.1 Linear least-squares
Arguably one of the simplest machine learning algorithms is linear least-squares regression. Given
data (11.0.1a), linear regression tries to fit a linear function Φ(x, w) := x⊤ w in terms of w by
minimizing f (w) in (11.0.1b). With
⊤
x1 y1
.. m×d ..
A= . ∈R and y = . ∈ Rm (11.1.1)
x⊤
m ym
it holds
f (w) = ∥Aw − y∥2 . (11.1.2)
Remark 11.1. More generally, the ansatz Φ(x, (w, b)) := w⊤ x + b corresponds to
b
Φ(x, (w, b)) = (1, x⊤ ) .
w
Therefore, additionally allowing for a bias can be treated analogously.
The model Φ(x, w) = x⊤ w is linear in both x and w. In particular, w 7→ f (w) is a convex
function by Exercise 10.34, and we may apply the convergence results of Chapter 10 when using
gradient based algorithms. If A is invertible, then f has a unique minimizer given by w∗ = A−1 y. If
rank(A) = d, then f is strongly convex by Exercise 10.34, and there still exists a unique minimizer.
If however rank(A) < d, then ker(A) ̸= {0} and there exist infinitely many minimizers of f . To
ensure uniqueness, we look for the minimum norm solution (or minimum 2-norm solution)
w∗ := argmin{w∈Rd | f (w)≤f (v) ∀v∈Rd } ∥w∥. (11.1.3)
The following proposition establishes the uniqueness of w∗ and demonstrates that it can be repre-
sented as a superposition of the (xi )m
i=1 .
Proposition 11.2. Let A ∈ Rm×d and y ∈ Rm be as in (11.1.1). There exists a unique minimum
2-norm solution of (11.1.2). Denoting H̃ := span{x1 , . . . , xm } ⊆ Rd , it is the unique element
Proof. We start with existence and uniqueness. Let C ⊆ Rm be the space spanned by the columns
of A. Then C is closed and convex, and therefore y ∗ = argminỹ∈C ∥y − ỹ∥ exists and is unique
(this is a fundamental property of Hilbert spaces, see Theorem B.14). In particular, the set M =
{w ∈ Rd | Aw = y ∗ } ⊆ Rd of minimizers of f is not empty. Clearly M is also closed and convex.
By the same argument as before, w∗ = argminw∗ ∈M ∥w∗ ∥ exists and is unique.
It remains to show (11.1.4). Denote by w∗ the minimum norm solution and decompose w∗ =
w̃ + ŵ with w̃ ∈ H̃ and ŵ ∈ H̃ ⊥ . We have Aw∗ = Aw̃ and ∥w∗ ∥2 = ∥w̃∥2 + ∥ŵ∥2 . Since w∗
is the minimal norm solution it must hold ŵ = 0. Thus w∗ ∈ H̃. Finally assume there exists a
minimizer v of f in H̃ different from w∗ . Then 0 ̸= w∗ − v ∈ H̃, and since H̃ is spanned by the
rows of A we have A(w∗ − v) ̸= 0. Thus y ∗ = Aw∗ ̸= Av, which contradicts that v minimizes
f.
146
The condition of minimizing the 2-norm is a form of regularization. Interestingly, gradient
descent converges to the minimum norm solution for the quadratic objective (11.1.2), as long as w0
is initialized within H̃ = span{x1 , . . . , xm } (e.g. w0 = 0). Therefore, it does not find an “arbitrary”
minimizer but implicitly regularizes the problem in this sense. In the following smax (A) denotes
the maximal singular value of A.
Theorem 11.3. Let A ∈ Rm×d be as in (11.1.1), let w0 = w̃0 + ŵ0 where w̃0 ∈ H̃ and ŵ0 ∈ H̃ ⊥ .
Fix h ∈ (0, 1/(2smax (A)2 )) and set
We sketch the argument in case w0 ∈ H̃, and leave the full proof to the reader, see Exercise
11.32. Note that H̃ is the space spanned by the rows of A (or the columns of A⊤ ). The gradient
of the objective function equals
∇f (w) = 2A⊤ (Aw − y).
Therefore, if w0 ∈ H̃, then the iterates of gradient descent never leave the subspace H̃. By Exercise
10.34 and Theorem 10.11, for small enough step size, it holds f (wk ) → 0. By Proposition 11.2
there only exists one minimizer in H̃, corresponding to the minimum norm solution. Thus wk
converges to the minimal norm solution.
As we will see below, w∗ is well-defined. We will call Φ(x, w∗ ) = ⟨ϕ(x), w∗ ⟩H the kernel least
squares estimator. The nonlinearity of the feature map allows for more expressive models x 7→
Φ(x, w) capable of capturing more complicated structures beyond linearity in the data.
147
Remark 11.4 (Gradient descent). Let H = Rn be equipped with the Euclidean inner product. Con-
sider the sequence (wk )k∈N0 ⊆ Rn generated by gradient descent to minimize (11.2.2). Assuming
sufficiently small step size, by Theorem 11.3 for x ∈ Rd
Here, ŵ0 ∈ Rn denotes the orthogonal projection of w0 ∈ Rn onto H̃ ⊥ where H̃ := span{ϕ(x1 ), . . . , ϕ(xm )}.
Gradient descent thus yields the kernel least squares estimator plus ⟨ϕ(x), ŵ0 ⟩. Notably, on the
set
{x ∈ Rd | ϕ(x) ∈ span{ϕ(x1 ), . . . , ϕ(xm )}}, (11.2.4)
(11.2.3) thus coincides with the kernel least squares estimator independent of the initialization w0 .
11.2.1 Examples
To motivate the concept of feature maps consider the following example from [155].
Example 11.5. Let xi ∈ R2 with associated labels yi ∈ {−1, 1} for i = 1, . . . , m. The goal is to
find some model Φ(·, w) : R2 → R, for which
x1 x1
dataset 1 dataset 2
The first dataset is separable by an affine hyperplane as depicted by the dashed line. Thus a linear
model is capable of correctly classifying all datapoints. For the second dataset this is not possible.
To enhance model expressivity, introduce a feature map ϕ : R2 → R6 via
For w ∈ R6 , this allows Φ(x) = w⊤ ϕ(x) to represent arbitrary polynomials of degree 2. With this
kernel approach, the decision boundary of (11.2.5) becomes the set of all hyperplanes in the feature
space passing through 0 ∈ R6 . Visualizing the last two features of the second dataset, we obtain
148
x22
x21
Note how in the feature space R6 , the datapoints are again separated by such a hyperplane. Thus,
with the feature map in (11.2.6), the predictor (11.2.5) can perfectly classify all points also for the
second dataset.
In the above example we chose the feature space H = R6 . It is also possible to work with
infinite dimensional feature spaces as the next example demonstrates.
Example 11.6. Let H = ℓ2 (N) be the space of square summable sequences and ϕ : Rd → ℓ2 (N)
some map. Fitting the corresponding model
X
Φ(x, w) = ⟨ϕ(x), w⟩ℓ2 = ϕi (x)wi
i∈N
to data (xi , yi )m
i=1 requires to minimize
m
!2
X X
f (w) = ϕi (xj )wi − yj w ∈ ℓ2 (N).
j=1 i∈N
Proof. Let w̃1 , . . . , w̃n be a basis of H̃. If H̃ = {0} the statement is trivial, so we assume
P 1 ≤ n ≤ m.
Let A = (⟨ϕ(xi ), w̃j ⟩)ij ∈ Rm×n . Every w̃ ∈ H̃ has a unique representation w̃ = nj=1 αj w̃j for
some α ∈ Rn . With this ansatz
m
X m X
X n 2
f (w̃) = (⟨ϕ(xi ), w̃⟩ − yi )2 = ⟨ϕ(xi ), w̃j ⟩ αj − yi = ∥Aα − y∥2 . (11.2.8)
i=1 i=1 j=1
149
Note that A : Rn → Rm is injective since for every α ∈ Rn \{0} holds nj=1 αj w̃j ∈ H̃ \ {0} and
P
D E
hence Aα = ( ϕ(xi ), nj=1 αj w̃j )m n
P
i=1 ̸= 0. Therefore, there exists a unique minimizer α ∈ R of
the right-hand side of (11.2.8), and thus there exists a unique minimizer w∗ ∈ H̃ in (11.2.7).
For arbitrary w ∈ H we wish to show f (w) ≥ f (w∗ ), so that w∗ minimizes f in H. Decompose
w = w̃ + ŵ with w̃ ∈ H̃ and ŵ ∈ H̃ ⊥ , i.e. ⟨ϕ(xj ), ŵ⟩H = 0 for all j = 1, . . . , m. Then, using that
w∗ minimizes f in H̃,
m
X m
X
f (w) = (⟨ϕ(xj ), w⟩H − yj )2 = (⟨ϕ(xj ), w̃⟩H − yj )2 = f (w̃) ≥ f (w∗ ).
j=1 j=1
Finally, let w ∈ H be any minimizer of f in H different from w∗ . It remains to show ∥w∥H >
∥w∗ ∥H . Decompose again w = w̃ + ŵ with w̃ ∈ H̃ and ŵ ∈ H̃ ⊥ . As above f (w) = f (w̃) and
thus w̃ is a minimizer of f . Uniqueness of w∗ in (11.2.7) implies w̃ = w∗ . Therefore ŵ ̸= 0 and
∥w∗ ∥2H < ∥w̃∥2H + ∥ŵ∥2H = ∥w∥2H .
Instead of looking for the minimum norm minimizer w∗ in the Hilbert space H, by Proposition
11.2 it suffices to determine the unique minimizer in the at most m-dimensional subspace H̃ spanned
by ϕ(x1 ), . . . , ϕ(xm ). This significantly simplifies the problem. To do so we first introduce the
notion of kernels.
Pm
Proposition 11.9. Let α ∈ Rm be any minimizer of (11.2.9). Then w∗ = j=1 αj ϕ(xj ) is the
unique minimum H-norm solution of (11.2.2).
Proposition 11.9, the proof of which is left as an exercise, suggests the following algorithm to
compute the kernel least squares estimator:
150
(i) compute the kernel matrix G = (K(xi , xj ))m
i,j=1 ,
Thus, minimizing (11.2.2) and expressing the kernel least squares estimator does neither require
explicit knowledge of the feature map ϕ nor of the minimum norm solution w∗ ∈ H. It is sufficient
to choose a kernel map K : Rd × Rd → R; this is known as the kernel trick. Given a kernel K, we
will therefore also refer to (11.2.10) as the kernel least squares estimator without specifying H or
ϕ.
i.e. K is the corresponding kernel. See for instance [217, Thm. 4.49].
which is the first order Taylor approximation of Φ around the initial parameter w0 . Introduce the
notation
δi := Φ(xi , w0 ) − ∇w Φ(xi , w0 )⊤ w0 − yi for all i = 1, . . . , m. (11.3.2)
151
The square loss for the linearized model then reads
m
X
f lin (w) := (Φlin (xi , w) − yi )2
j=1
m
X
= (⟨∇w Φ(xi , w0 ), w⟩ + δi )2 , (11.3.3)
j=1
where ⟨·, ·⟩ stands for the Euclidean inner product in Rn . Comparing with (11.2.2), minimizing f lin
corresponds to a kernel least squares regression with feature map
ϕ(x) = ∇w Φ(x, w0 ) ∈ Rn .
We refer to K̂n as the empirical tangent kernel, as it arises from the first order Taylor approxima-
tion (the tangent) of the original model Φ around initialization w0 . Note that the kernel depends
on the choice of w0 . As explained in Remark 11.4, training Φlin with gradient descent yields the
kernel least-squares estimator with kernel K̂n plus an additional term depending on w0 .
Of course the linearized model Φlin only captures the behaviour of Φ for parameters w that are
close to w0 . If we assume for the moment that during training of Φ, the parameters remain close to
initialization, then we can expect similar behaviour and performance of Φ and Φlin . Under certain
assumptions, we will see in the next sections that this is precisely what happens, when the width
of a neural network increases. Before we make this precise, in Section 11.4 we investigate whether
gradient descent applied to f (w) will find a global minimizer, under the assumption that Φlin is a
good approximation of Φ.
152
(Φ(x1 , w) − y1 )2
(Φlin (x1 , w) − y1 )2
y1
w0 Φlin (x1 , w) w0
Φ(x1 , w)
Figure 11.1: Graph of w 7→ Φ(x1 , w) and the linearization w 7→ Φlin (x1 , w) at the initial parameter
d
w0 , s.t. dw Φ(x1 , w0 ) ̸= 0. If Φ and Φlin are close, then there exists w s.t. Φ(x1 , w) = y1 (left). If
the derivatives are also close, the loss (Φ(x1 , w) − y1 )2 is nearly convex in w, and gradient descent
finds a global minimizer (right).
Φ(x1 , w) (Φ(x1 , w) − y1 )2
(Φlin (x1 , w) − y1 )2
y1
w0 Φlin (x1 , w) w0
Figure 11.2: Same as Figure 11.1. If Φ and Φlin are not close, there need not exist w such that
Φ(x1 , w) = y1 , and gradient descent need not converge to a global minimizer.
(c) and √
λ2min 2 mU p
L≤ p and r= f (w0 ). (11.4.3)
12m3/2 U 2 f (w0 ) λmin
The regularity of the kernel matrix in Assumption 11.12 (a) is equivalent to (∇w Φ(xi , w0 )⊤ )m i=1 ∈
R m×n having full rank m ≤ n (in particular we have at least as many parameters n as training
d
data m). In the context of Figure 11.1, this means that dw Φ(x1 , w0 ) ̸= 0 and thus Φlin is a not a
constant function. This condition guarantees that there exists w such that Φlin (xi , w) = yi for all
i = 1, . . . , m. In other words, already the linearized model Φlin is sufficiently expressive to interpo-
late the data. Assumption 11.12 (b) formalizes the closeness condition of Φ and Φlin . Apart from
giving an upper bound on ∇w Φ(xi , w), it assumes w 7→ Φ(xi , w) to be L-smooth in a ball of radius
r > 0 around w0 , for all i = 1, . . . , m. This allows to control how far Φ(xi , w) and Φlin (xi , w) and
their derivatives may deviate from each other for w in this ball. Finally Assumption 11.12 (c) ties
together all constants, ensuring the full model to be sufficiently close to its linearization in a large
enough neighbourhood of w0 .
We are now ready to state the following theorem, which is a variant of [131, Thm. G.1]. In
Section 11.5 we will see that its main requirement—Assumption 11.12—is satisfied with high prob-
ability for certain (wide) neural networks.
153
Theorem 11.13. Let Assumption 11.12 be satisfied and fix a positive learning rate
1
h≤ . (11.4.4)
λmin + λmax
Set for all k ∈ N
wk+1 = wk − h∇f (wk ). (11.4.5)
It then holds for all k ∈ N
√
2 mU p
∥wk − w0 ∥ ≤ f (w0 ) (11.4.6a)
λmin
f (wk ) ≤ (1 − hλmin )2k f (w0 ). (11.4.6b)
E(w) := (Φ(xi , w) − yi )m
i=1 ∈ R
m
such that
∇E(w) = (∇w Φ(xi , w))m
i=1 ∈ R
m×n
and similarly
m
X
∥∇E(w) − ∇E(v)∥2 ≤ ∥∇w Φ(xi , w) − ∇w Φ(xi , v)∥2
i=1
≤ mL2 ∥w − v∥2 for all w, v ∈ Br (w0 ). (11.4.8b)
for all k ∈ N0 and where an empty sum is understood as zero. Since ∞ j −1 = (hλ −1
P
j=0 c = (1−c) min )
2
and f (wk ) = ∥E(wk )∥ , these inequalities directly imply (11.4.6).
The case k = 0 is trivial. For the induction step, assume (11.4.9) holds for some k ∈ N0 .
154
Step 1. We show (11.4.9a) for k + 1. The induction assumption and (11.4.3) give
∞ √
√ X
j 2 mU p
∥wk − w0 ∥ ≤ 2h mU ∥E(w0 )∥ c = f (w0 ) = r, (11.4.10)
λmin
j=0
Since the eigenvalues of ∇E(w0 )∇E(w0 )⊤ belong to [λmin , λmax ] by (11.4.7) and Assumption 11.12
(a), as long as h ≤ (λmin + λmax )−1 we have
∥I m − 2h∇E(w̃k )∇E(wk )⊤ ∥ ≤ ∥I m − 2h∇E(w0 )∇E(w0 )⊤ ∥ + 6hmU Lr
≤ 1 − 2hλmin + 6hmU Lr
≤ 1 − 2h(λmin − 3mU Lr)
≤ 1 − hλmin = c,
where we have used the equality for r and the upper bound for L in (11.4.3).
155
Let us emphasize the main statement of Theorem 11.13. By (11.4.6b), full batch gradient
descent (11.4.5) achieves zero loss in the limit, i.e. the data is interpolated by the limiting model. In
particular, this yields convergence for the (possibly nonconvex) optimization problem of minimizing
f (w).
11.5.1 Architecture
Let Φ : Rd → R be a neural network of depth one and width n ∈ N of type
Here x ∈ Rd is the input, and U ∈ Rn×d , v ∈ Rn , b ∈ Rn and c ∈ R are the parameters which we
collect in the vector w = (U , b, v, c) ∈ Rn(d+2)+1 (with U suitably reshaped). For future reference
we note that
∇U Φ(x, w) = (v ⊙ σ ′ (U x + b))x⊤ ∈ Rn×d
∇b Φ(x, w) = v ⊙ σ ′ (U x + b) ∈ Rn
(11.5.2)
∇v Φ(x, w) = σ(U x + b) ∈ Rn
∇c Φ(x, w) = 1 ∈ R,
where ⊙ denotes the Hadamard product. We also write ∇w Φ(x, w) ∈ Rn(d+2)+1 to denote the full
gradient with respect to all parameters.
In practice, it is common to initialize the weights randomly, and in this section we consider so-
called LeCun initialization. The following condition on the distribution used for this initialization
will be assumed throughout the rest of Section 11.5.
Assumption 11.14. The distribution W on R has expectation zero, variance one, and finite
moments up to order eight.
To explicitly indicate the expectation and variance in the notation, we also write W(0, 1) instead
of W, and for µ ∈ R and ς > 0 we use W(µ, ς 2 ) to denote the corresponding scaled and shifted
measure with expectation µ and variance ς 2 ; thus, if X ∼ W(0, 1) then µ + ςX ∼ W(µ, ς 2 ). LeCun
initialization [129] sets the variance of the weights in each layer to be reciprocal to the input
dimension of the layer, thereby normalizing the output variance across all network nodes. The
initial parameters
w0 = (U 0 , b0 , v 0 , c0 )
are thus randomly initialized with components
1 1
iid iid
U0;ij ∼ W 0, , v0;i ∼ W 0, , b0;i , c0 = 0, (11.5.3)
d n
independently for all i = 1, . . . , n, j = 1, . . . , d. For a fixed ς > 0 one might choose variances ς 2 /d
and ς 2 /n in (11.5.3), which would require only minor modifications in the rest of this section. Biases
156
are set to zero for simplicity, with nonzero initialization discussed in the exercises. All expectations
and probabilities in Section 11.5 are understood with respect to this random initialization.
Example 11.15. Typical √ examples
√ for W(0, 1) are the standard normal distribution on R or the
uniform distribution on [− 3, 3].
of the shallow network (11.5.1). Scaled properly, it converges in the infinite width limit n → ∞
towards a specific kernel known as the neural tangent kernel (NTK). Its precise formula depends
on the architecture and initialization. For the LeCun initialization (11.5.3) we denote it by K LC .
Theorem 11.16. Let R < ∞ such that |σ(x)| ≤ R · (1 + |x|) and |σ ′ (x)| ≤ R · (1 + |x|) for all
iid
x ∈ R. For any x, z ∈ Rd and ui ∼ W(0, 1/d), i = 1, . . . , d, it then holds
1
lim K̂n (x, z) = E[σ(u⊤ x)σ(u⊤ z)] =: K LC (x, z)
n→∞ n
almost surely.
Moreover, for every δ, ε > 0 there exists n0 (δ, ε, R) ∈ N such that for all n ≥ n0 and all x,
z ∈ Rd with ∥x∥, ∥z∥ ≤ R
h 1 i
P K̂n (x, z) − K LC (x, z) < ε ≥ 1 − δ.
n
are i.i.d. with finite pth moment (independent of n) for all 1 ≤ p ≤ 8. Due to the linear growth bound
(1) (1) (1)
on σ and σ ′ , the same holds for the (σ(xi ))ni=1 and the (σ ′ (xi ))ni=1 . Similarly, the (σ(zi ))ni=1
(1)
and (σ ′ (zi ))ni=1 are collections of i.i.d. random variables with finite pth moment for all 1 ≤ p ≤ 8.
√ iid
Denote ṽi = nv0;i such that ṽi ∼ W(0, 1). By (11.5.2)
n n
1 ⊤ 1 X 2 ′ (1) ′ (1) 1X (1) (1) 1
K̂n (x, z) = (1 + x z) 2 ṽi σ (xi )σ (zi ) + σ(xi )σ(zi ) + .
n n n n
i=1 i=1
Since
n
1 X 2 ′ (1) ′ (1)
ṽi σ (xi )σ (zi ) (11.5.4)
n
i=1
157
is an average over i.i.d. random variables with finite variance, the law of large numbers implies
almost sure convergence of this expression towards
(1) (1) (1) (1)
E ṽi2 σ ′ (xi )σ ′ (zi ) = E[ṽi2 ]E[σ ′ (xi )σ ′ (zi )]
(1) (1)
where we used that ṽi2 is independent of σ ′ (xi )σ ′ (zi ). By the same argument
n
1X (1) (1)
σ(xi )σ(zi ) → E[σ(u⊤ x)σ(u⊤ z)]
n
i=1
Example 11.17 (K LC for ReLU). Let σ(x) = max{0, x} and let W(0, 1) be the standard normal
distribution. For x, z ∈ Rd denote by
⊤
x z
θ = arccos
∥x∥∥z∥
iid
the angle between these vectors. Then according to [37, Appendix A], it holds with ui ∼ W(0, 1),
i = 1, . . . , d,
∥x∥∥z∥
K LC (x, z) = E[σ(u⊤ x)σ(u⊤ z)] = (sin(θ) + (π − θ) cos(θ)).
2πd
(K LC (xi , xj ))m
i,j=1 ∈ R
m×m
We start by showing Assumption 11.12 (a) for the present setting. More precisely, we give
bounds for the eigenvalues of the empirical tangent kernel.
158
Lemma 11.19. Let Assumption 11.18 be satisfied. Then for every δ > 0 there exists
n0 (δ, λLC
min , m, R) ∈ R such that for all n ≥ n0 with probability at least 1 − δ all eigenvalues of
m
(K̂n (xi , xj ))m
i,j=1 = ⟨∇w Φ(xi , w 0 ), ∇w Φ(xj , w 0 )⟩ i,j=1 ∈ R
m×m
belong to [nλLC LC
min /2, 2nλmax ].
LC :=
Proof. Denote Ĝn := (K̂n (xi , xj ))m
i,j=1 and G (K LC (xi , xj ))m
i,j=1 . By Theorem 11.16, there
exists n0 such that for all n ≥ n0 holds with probability at least 1 − δ that
1 λLC
GLC − Ĝn ≤ min .
n 2
where we have used that λLC min is the smallest eigenvalue, and thus singular value, of the symmetric
positive definite matrix GLC . This shows that the smallest eigenvalue of Ĝn is larger or equal to
λLC LC LC
min /2. Similarly, we conclude that the largest eigenvalue is bounded from above by λmax +λmin /2 ≤
LC
λmax . This concludes the proof.
Next we check Assumption 11.12 (b). To this end we first bound the norm of a random matrix.
iid
Lemma 11.20. Let W(0, 1) be as in Assumption 11.14, and let W ∈ Rn×d with Wij ∼ W(0, 1).
Denote the fourth moment of W(0, 1) by µ4 . Then
h p i dµ4
P ∥W ∥ ≤ n(d + 1) ≥ 1 − .
n
Proof. It holds
n X
X d 1/2
∥W ∥ ≤ ∥W ∥F = Wij2 .
i=1 j=1
Pd
The αi := j=1 Wij2 , i = 1, . . . , n, are i.i.d. distributed with expectation d and finite variance dC,
where C ≤ µ4 is the variance of W11 2 . By Theorem A.23
h i n
h1 X i h 1Xn i dµ
4
p
P ∥W ∥ > n(d + 1) ≤ P αi > d + 1 ≤ P αi − d > 1 ≤ ,
n n n
i=1 i=1
159
Lemma 11.21. Let Assumption 11.18 (a) be satisfied with some constant R. Then there exists
M (R), and for all c, δ > 0 there exists n0 (c, d, δ, R) ∈ N such that for all n ≥ n0 it holds with
probability at least 1 − δ
√
∥∇w Φ(x, w)∥ ≤ M n for all w ∈ Bcn−1/2 (w0 )
√
∥∇w Φ(x, w) − ∇w Φ(x, v)∥ ≤ M n∥w − v∥ for all w, v ∈ Bcn−1/2 (w0 )
Proof. Due to the initialization (11.5.3), by Lemma 11.20 we can find n0 (δ, d) such that for all
n ≥ n0 holds with probability at least 1 − δ that
√
∥v 0 ∥ ≤ 2 and ∥U 0 ∥ ≤ 2 n. (11.5.5)
For the rest of this proof we fix arbitrary x ∈ Rd and n ≥ n0 ≥ c2 such that
We need to show that the claimed inequalities hold as long as (11.5.5) is satisfied. We will several
times use that for all p, q ∈ Rn
√
∥p ⊙ q∥ ≤ ∥p∥∥q∥ and ∥σ(p)∥ ≤ R n + R∥p∥
w = (U , b, v, c) s.t. ∥w − w0 ∥ ≤ cn−1/2 .
Due to √ √
∥U ∥ ≤ ∥U 0 ∥ + ∥U 0 − U ∥F ≤ 2 n + cn−1/2 ≤ 3 n, (11.5.7)
the last norm in (11.5.6) is bounded by
and therefore √
∥∇b Φ(x, w)∥ ≤ n(6R + 9R2 ).
160
For the gradient with respect to U we use ∇U Φ(x, w) = ∇b Φ(x, w)x⊤ , so that
√
∥∇U Φ(x, w)∥F = ∥∇b Φ(x, w)x⊤ ∥F = ∥∇b Φ(x, w)∥∥x∥ ≤ n(6R2 + 9R3 ).
Next
Next
and finally ∇c Φ(x, w) = 1 is constant. With M2 (R) := R + 6R2 + 6R3 this shows
√
∥∇w Φ(x, w) − ∇w Φ(x, w̃)∥ ≤ nM2 (R)∥w − w̃∥.
In all, this concludes the proof with M (R) := max{M1 (R), M2 (R)}.
Before coming to the main result of this section, we first show that the initial error f (w0 )
remains bounded with high probability.
161
Lemma 11.22. Let Assumption 11.18 (a), (b) be satisfied. Then for every δ > 0 exists
R0 (δ, m, R) > 0 such that for all n ∈ N
P[f (w0 ) ≤ R0 ] ≥ 1 − δ.
√ iid
Proof. Let i ∈ {1, . . . , m}, and set α := U 0 xi and ṽj := nv0;j for j = 1, . . . , n, so that ṽj ∼
W(0, 1). Then
n
1 X
Φ(xi , w0 ) = √ ṽj σ(αj ).
n
j=1
By Assumption 11.14 and (11.5.3), the ṽj σ(αj ), j = 1, . . . , n, are i.i.d. centered random variables
with finite variance bounded by a constant C(R) independent of n. Thus the variance of Φ(xi , w0 )
is also bounded by C(R). By Chebyshev’s inequality, see Lemma A.22, for every k > 0
√ 1
P[|Φ(xi , w0 )| ≥ k C] ≤ 2 .
k
p
Setting k = m/δ
m
hX √ i m
X h √ i
2 2
P |Φ(xi , w0 ) − yi | ≥ m(k C + R) ≤ P |Φ(xi , w0 ) − yi | ≥ k C + R
i=1 i=1
m
X h √ i
≤ P |Φ(xi , w0 )| ≥ k C ≤ δ,
i=1
p
which shows the claim with R0 = m · ( Cm/δ + R)2 .
The next theorem is the main result of this section. It states that in the present setting gradient
descent converges to a global minimizer and the limiting network achieves zero loss, i.e. interpolates
the data. Moreover, during training the network weights remain close to initialization if the network
width n is large.
Theorem 11.23. Let Assumption 11.18 be satisfied, and let the parameters w0 of the neural
network Φ in (11.5.1) be initialized according to (11.5.3). Fix a learning rate
2 1
h<
λLC
min
LC
+ 4λmax n
162
Then for every δ > 0 there exist C > 0, n0 ∈ N such that for all n ≥ n0 holds with probability
at least 1 − δ that for all k ∈ N
C
∥wk − w0 ∥ ≤ √
n
hn 2k
f (wk ) ≤ C 1 − LC .
2λmin
Proof. We wish to apply Theorem 11.13, which requires Assumption 11.12 to be satisfied. By
Lemma 11.19, 11.21 and 11.22, for pevery c >√0 we can find n0 such that for all n ≥ n0 with
probability at least 1 − δ we have f (w0 ) ≤ R0 and Assumption 11.12 (a), (b) holds with the
values
√ √ nλLC
L = M n, U = M n, r = cn−1/2 , λmin = min
, λmax = 2nλLCmax .
2
For Assumption 11.12 (c), it suffices that
√
√ n2 (λLC
min /2)
2
−1/2 2mM n p
M n≤ √ and cn ≥ R0 .
12m3/2 M 2 n R0 n
Choosing c > 0 and n large enough, the inequalities hold. The statement is now a direct consequence
of Theorem 11.13.
for the full and linearized models, respectively. Let us consider the dynamics of the prediction of
the network on the training data. Writing
it holds
∇w f (w) = ∇w ∥Φ(X, w) − y∥2 = 2∇w Φ(X, w)⊤ (Φ(X, w) − y).
Thus for the full model
163
where w̃k is in the convex hull of wk and wk+1 .
Similarly, for the linearized model with (cp. (11.3.1))
such that
∇p f lin (p) = ∇p ∥Φlin (X, p) − y∥2 = 2∇w Φ(X, w0 )⊤ (Φlin (X, p) − y)
and
Remark 11.24. From (11.5.9) it is easy to see that with A := 2h∇w Φ(X, w0 )∇w Φ(X, w0 )⊤ and
B := I m − A holds the explicit formula
k−1
X
lin k lin
Φ (X, pk ) = B Φ (X, p0 ) + B k Ay
j=0
2h∇w Φ(X, w̃k )∇w Φ(X, wk )⊤ and 2h∇w Φ(X, w0 )∇w Φ(X, w0 )⊤ .
Recall that the step size h in Theorem 11.23 scales like 1/n.
Proposition 11.25. Consider the setting of Theorem 11.23. Then there exists C < ∞, and for
every δ > 0 there exists n0 such that for all n ≥ n0 holds with probability at least 1 − δ that for all
k∈N
1
∥∇w Φ(X, w̃k )∇w Φ(X, wk )⊤ − ∇p Φ(X, p0 )∇p Φ(X, p0 )⊤ ∥ ≤ Cn−1/2 .
n
Proof. Consider the setting of the proof of Theorem 11.23. Then for every k ∈ N holds ∥wk −w0 ∥ ≤
r and thus also ∥w̃k − w0 ∥ ≤ r, where r = cn−1/2 . Thus Lemma 11.21 implies the norm to be
bounded by
1
∥∇w Φ(X, w̃k ) − ∇p Φ(X, p0 )∥∥∇w Φ(X, wk )⊤ ∥+
n
1
∥∇p Φ(X, p0 )∥∥∇w Φ(X, wk )⊤ − ∇p Φ(X, p0 )⊤ ∥
n
≤ mM (∥w̃k − p0 ∥ + ∥wk − p0 ∥) ≤ cmM n−1/2
164
By Proposition 11.25 the two matrices driving the dynamics (11.5.8) and (11.5.9) remain in
an O(n−1/2 ) neighbourhood of each other throughout training. This allows to show the following
proposition, which states that the prediction function learned by the network gets arbitrarily close
to the one learned by the linearized version in the limit n → ∞. The proof, which we omit, is based
on Grönwall’s inequality. See [106, 131].
Proposition 11.26. Consider the setting of Theorem 11.23. Then there exists C < ∞, and for
every δ > 0 there exists n0 such that for all n ≥ n0 holds with probability at least 1 − δ that for all
∥x∥ ≤ 1
sup |Φ(x, wk ) − Φlin (x, pk )| ≤ Cn−1/2 .
k∈N
Definition 11.27. Let (Ω, P) be a probability space, and let g : Rd ×Ω → R. We call g a Gaussian
process with mean function m : Rd → R and covariance function c : Rd × Rd → R if
(b) for all k ∈ N and all x1 , . . . , xk ∈ Rd the random variables g(x1 , ·), . . . , g(xk , ·) have a joint
Gaussian distribution such that
(g(x1 , ω), . . . , g(xk , ω)) ∼ N m(xi )ki=1 , (c(xi , xj ))ki,j=1 .
165
Proposition 11.28. Consider depth-n networks Φn as in (11.5.1) with initialization (11.5.3), and
iid
define with ui ∼ W(0, 1/d), i = 1, . . . , d,
√ iid
Proof. Set ṽi := nv0,i and ũi = (U0,i1 , . . . , U0,id ) ∈ Rd , so that ṽi ∼ W(0, 1), and the ũi ∈ Rd are
also i.i.d., with each component distributed according to W(0, 1/d).
Then for any x1 , . . . , xk
ṽi σ(ũ⊤
i x1 )
.. k
Z i := ∈R i = 1, . . . , n,
.
ṽi σ(ũ⊤
i xk )
defines n centered i.i.d. vectors in Rk . By the central limit theorem, see Theorem A.25,
Φ(x1 , w0 ) n
.. 1 X
= √ Zi
.
n
Φ(xk , w0 ) j=1
In the sense of Proposition 11.28, the network Φ(x, w0 ) converges to a Gaussian process as the
width n tends to infinity. Using the explicit dynamics of the linearized network outlined in Remark
11.24, one can show that the linearized network after training also corresponds to a Gaussian
process (for some mean and covariance function depending on the data, the architecture, and the
initialization). As the full and linearized models converge in the infinite width limit, we can infer
that wide networks post-training resemble draws from a Gaussian process, see [131, Sec. 2.3.1] and
[46].
Rather than delving into the technical details of such statements, in Figure 11.3 we plot 80
different realizations of a neural network before and after training, i.e. the functions
We chose the architecture as (11.5.1) with activation function σ = arctan(x), width n = 250 and
initialization 3 3
iid iid iid
U0;ij ∼ N 0, , v0;i ∼ N 0, , b0;i , c0 ∼ N(0, 2). (11.5.11)
d n
166
The network was trained on a dataset of size m = 3 with k = 1000 steps of gradient descent
and constant step size h = 1/n. Before training, the network’s outputs resemble random draws
from a Gaussian process with a constant zero mean function. Post-training, the outputs show
minimal variance at the data points, since they essentially interpolate the data, cp. Remark 11.4
and (11.2.4). They exhibit increased variance further from these points, with the precise amount
depending on the initialization variance chosen in (11.5.11).
2 2
1 1
0 0
1 1
2 2
3 2 1 0 1 2 3 3 2 1 0 1 2 3
(a) before training (b) after training
Figure 11.3: 80 realizations of a neural network at initialization (a) and after training on the red
data points (b). The blue dashed line shows the mean. Figure based on [131, Fig. 2].
As a result of this different scaling, gradient descent with step width O(n−1 ) as in Theorem 11.23,
will primarily train the weigths v in the output layer, and will barely move the remaining parameters
U , b, and c. This is also reflected in the expression for the obtained kernel K LC computed in
Theorem 11.16, which corresponds to the contribution of the term ⟨∇v Φ, ∇v Φ⟩.
Remark 11.29. For optimization methods such as ADAM, which scale each component of the
gradient individually, the same does not hold in general.
LeCun initialization aims to normalize the variance of the output of all nodes at initialization
(the forward dynamics). To also normalize the variance of the gradients (the backward dynamics),
in this section we shortly dicuss a different architecture and initialization, consistent with the one
used in the original NTK paper [106].
167
11.6.1 Architecture
Let Φ : Rd → R be a depth-one neural network
1 ⊤ 1
Φ(x, w) = √ v σ √ U x + b + c, (11.6.1)
n d
At initialization, (11.6.1), (11.6.2) is equivalent to (11.5.1), (11.5.3). However, for the gradient we
obtain
∇U Φ(x, w) = n−1/2 v ⊙ σ ′ (d−1/2 U x + b) d−1/2 x⊤ ∈ Rn×d
∇b Φ(x, w) = n−1/2 v ⊙ σ ′ d−1/2 U x + b) ∈ Rn
(11.6.3)
−1/2 −1/2 n
∇v Φ(x, w) = n σ(d U x + b) ∈ R
∇c Φ(x, w) = 1 ∈ R.
Contrary to (11.5.2), the three gradients with O(n) entries are all scaled by the factor n−1/2 . This
leads to a different training dynamics.
as n → ∞. Here and in the following we consider the setting (11.6.1)–(11.6.2) for Φ and w0 .
Since this is also referred to as the NTK initialization, we denote the kernel by K NTK . Due to the
different training dynamics, we obtain additional terms in the NTK compared to Theorem 11.23.
Theorem 11.30. Let R < ∞ such that |σ(x)| ≤ R · (1 + |x|) and |σ ′ (x)| ≤ R · (1 + |x|) for all
iid
x ∈ R, and let W satisfy Assumption 11.14. For any x, z ∈ Rd and ui ∼ W(0, 1/d), i = 1, . . . , d,
it then holds
x⊤ z
lim K̂n (x, z) = 1 + E[σ ′ (u⊤ x)⊤ σ ′ (u⊤ z)] + E[σ(u⊤ x)⊤ σ(u⊤ z)] + 1
n→∞ d
=: K NTK (x, z)
almost surely.
168
Proof. Denote x(1) = U 0 x + b0 ∈ Rn and z (1) = U 0 z + b0 ∈ Rn . Due to the initialization (11.6.2)
and our assumptions on W(0, 1), the components
d
(1)
X
xi = U0;ij xj ∼ u⊤ x i = 1, . . . , n
j=1
are i.i.d. with finite pth moment (independent of n) for all 1 ≤ p ≤ 8, and the same holds for the
(1) (1) (1) (1)
(σ(xi ))ni=1 , (σ ′ (xi ))ni=1 , (σ(zi ))ni=1 , and (σ ′ (zi ))ni=1 .
Then
n n
x⊤ z 1 X 2 ′ (1) ′ (1) 1X (1) (1)
K̂n (x, z) = 1 + vi σ (xi )σ (zi ) + σ(xi )σ(zi ) + 1.
d n n
i=1 i=1
By the law of large numbers and because E[vi2 ] = 1, this converges almost surely to K NTK (x, z).
The existence of n0 follows similarly by an application of Theorem A.23.
Example 11.31 (K NTK for ReLU). Let σ(x) = max{0, x} and let W(0, 1/d) be the centered
normal distribution on R with variance 1/d. For x, z ∈ Rd holds by [37, Appendix A] (also see
iid
Exercise 11.36), that with ui ∼ W(0, 1/d), i = 1, . . . , d,
x⊤ z
π − arccos ∥x∥∥z∥
E[σ ′ (u⊤ x)σ ′ (u⊤ z)] = .
2π
Together with Example 11.17, this yields an explizit formula for K NTK in Theorem 11.30.
For this network architecture and under suitable assumptions on W, similar arguments as
in Section 11.5 can be used to show convergence of gradient descent to a global minimizer and
proximity of the full to the linearized model. We refer to the literature in the bibliography section.
169
Exercises
Exercise 11.32. Prove Theorem 11.3.
Hint: Assume first that w0 ∈ ker(A)⊥ (i.e. w0 ∈ H̃). For rank(A) < d, using wk = wk−1 −
h∇f (wk−1 ) and the singular value
P decomposition of A, write down an explicit formula for wk .
Observe that due to 1/(1 − x) = k∈N0 xk for all x ∈ (0, 1) it holds wk → A† y as k → ∞, where
A† is the Moore-Penrose pseudoinverse of A.
π−θ xz ⊤
E[1[0,∞) (u⊤ x)1[0,∞) (u⊤ z)] = , θ = arccos .
2π ∥x∥∥z∥
This shows the formula for the ReLU NTK with Gaussian initialization as discussed in Example
11.31.
Hint: Consider the following sketch
θ x
Exercise 11.37. Consider the network (11.5.1) with LeCun initialization as in (11.5.3), but with
the biases instead initialized as
iid
c, bi ∼ W(0, 1) for all i = 1, . . . , n. (11.6.4)
Compute the corresponding NTK as in Theorem 11.23. Moreover, compute the NTK also for the
normalized network (11.6.1) with initialization (11.6.2) as in Theorem 11.30, but replace again the
bias initialization with that given in (11.6.4).
170
Chapter 12
In Chapter 10, we saw how the weights of neural networks get adapted during training, using, e.g.,
variants of gradient descent. For certain cases, including the wide networks considered in Chapter
11, the corresponding iterative scheme converges to a global minimizer. In general, this is not
guaranteed, and gradient descent can for instance get stuck in non-global minima or saddle points.
To get a better understanding of these situations, in this chapter we discuss the so-called loss
landscape. This term refers to the graph of the empirical risk as a function of the weights. We
give a more rigorous definition below, and first introduce notation for neural networks and their
realizations for a fixed architecture.
Definition 12.1. Let A = (d0 , d1 , . . . , dL+1 ) ∈ NL+2 , let σ : R → R be an activation function, and
let B > 0. We denote the set of neural networks Φ with L layers, layer widths d0 , d1 , . . . , dL+1 , all
weights bounded in modulus by B, and using the activation function σ by N (σ; A, B). Additionally,
we define
L
×
PN (A, B) := [−B, B]dℓ+1 ×dℓ × [−B, B]dℓ+1 ,
ℓ=0
Rσ : PN (A, B) → N (σ; A, B)
(12.0.1)
(W (ℓ) , b(ℓ) )L
ℓ=0 7→ Φ,
where Φ is the neural network with weights and biases given by (W (ℓ) , b(ℓ) )L
ℓ=0 .
PL
Throughout, we will identify PN (A, B) with the cube [−B, B]nA , where nA := ℓ=0 dℓ+1 (dℓ +
1). Now we can introduce the loss landscape of a neural network architecture.
171
l minima
ca
lo
le points
dd
sa
sh
arp
Hig l min minimum
k ba ima
he
mpirical ris
o
gl
Figure 12.1: Two-dimensional section of a loss landscape. The loss landscape shows a spurious
valley with local minima, global minima, as well as a region where saddle points appear. Moreover,
a sharp minimum is shown.
ΛA,σ,S,L : PN (A; ∞) → R
θ 7→ R
b S (Rσ (θ)).
with R
b S in (1.2.3) and Rσ in (12.0.1).
Identifying PN (A, ∞) with RnA , we can consider ΛA,σ,S,L as a map on RnA and the loss
landscape is a subset of RnA × R. The loss landscape is a high-dimensional surface, with hills and
valleys. For visualization a two-dimensional section of a loss landscape is shown in Figure 12.1.
Questions of interest regarding the loss landscape include for example: How likely is it that we
find local instead of global minima? Are these local minima typically sharp, having small volume,
or are they part of large flat valleys that are difficult to escape? How bad is it to end up in a local
minimum? Are most local minima as deep as the global minimum, or can they be significantly
higher? How rough is the surface generally, and how do these characteristics depend on the network
architecture? While providing complete answers to these questions is hard in general, in the rest
of this chapter we give some intuition and mathematical insights for specific cases.
172
12.1 Visualization of loss landscapes
Visualizing loss landscapes can provide valuable insights into the effects of neural network depth,
width, and activation functions. However, we can only visualize an at most two-dimensional surface
embedded into three-dimensional space, whereas the loss landscape is a very high-dimensional
object (unless the neural networks have only very few weights and biases).
To make the loss landscape accessible, we need to reduce its dimensionality. This can be achieved
by evaluating the function ΛA,σ,S,L on a two-dimensional subspace of PN (A, ∞). Specifically, we
choose three-parameters µ, θ1 , θ2 and examine the function
• Based on critical points: For a more global perspective, µ, θ1 , θ2 can be chosen to ensure the
observation of multiple critical points. One way to achieve this is by running the optimization
procedure three times with final parameters θ(1) , θ(2) , θ(3) . If the procedures have converged,
then each of these parameters is close to a critical point of ΛA,σ,S,L . We can now set µ = θ(1) ,
θ1 = θ(2) − µ, θ2 = θ(3) − µ. This then guarantees that (12.1.1) passes through or at least
comes very close to three critical points (at (α1 , α2 ) = (0, 0), (0, 1), (1, 0)). We present six
visualizations of this form in Figure 12.2.
Figure 12.2 gives some interesting insight into the effect of depth and width on the shape of the
loss landscape. For very wide and shallow neural networks, we have the widest minima, which, in
the case of the tanh activation function also seem to belong to the same valley. With increasing
depth and smaller width the minima get steeper and more disconnected.
173
it is clear that for every permutation matrix P
Hence, in general there exist multiple parameterizations realizing the same output function. More-
over, if at least one global minimum with non-permutation-invariant weights exists, then there are
more than one global minima of the loss landscape.
This is not problematic; in fact, having many global minima is beneficial. The larger issue is the
existence of non-global minima. Following [235], we start by generalizing the notion of non-global
minima to spurious valleys.
A path-connected component of ΩΛ (c), which does not contain a global minimum of ΛA,σ,S,L is
called a spurious valley.
The next proposition shows that spurious local minima do not exist for shallow overparameter-
ized neural networks, i.e., for neural networks that have at least as many parameters in the hidden
layer as there are training samples.
Proof. Let θa , θb ∈ PN (A, ∞) with ΛA,σ,S,L (θa ) > ΛA,σ,S,L (θb ). Then we will show below that
there is another parameter θc such that
• there is a continuous path α : [0, 1] → PN (A, ∞) such that α(0) = θa , α(1) = θc , and
ΛA,σ,S,L (α) is monotonically decreasing.
By Exercise 12.7, the construction above rules out the existence of spurious valleys by choosing θa
an element of a spurious valley and θb a global minimum.
Next, we present the construction: Let us denote
1
(ℓ) (ℓ)
θo = W o , bo for o ∈ {a, b, c}.
ℓ=0
174
Moreover, for j = 1, . . . , d1 , we introduce v jo ∈ Rm defined as
(v jo )i = σ W (0)o xi + bo
(0)
for i = 1, . . . , m.
j
Next, we observe that there exists v ∗ ∈ Rm which is linearly independent from all (v ja )m
i=1 and
∗ ∗ ⊤ ∗ ∗ d ∗
can be written as (v )i = σ((w ) xi + b ) for some w ∈ R , b ∈ R. Indeed, if we assume that
0
where
(W (1) (1)
a (t))1 = (1 − 2t)(W a )1 and (W (1) (1) (1)
a (t))i = (W a )i + 2tαi (W a )1
where
∗
(W (0) (0)
a (t))1 = 2(t − 1/2)(W a )1 + (2t − 1)w and (W (0) (0)
a (t))i = (W a )i
175
(0) (0) (0) (0)
for i = 2, . . . , d1 , (ba (t))1 = 2(t − 1/2)(ba )1 + (2t − 1)b∗ , and (ba (t))i = (ba )i for i = 2, . . . , d1 .
It is clear by (12.2.2) that (Rσ (α1 )(xi ))m i=1 is constant. Moreover, Rσ (α2 )(x) is constant for all
x ∈ Rd0 . In addition, by construction for
m
j (0) (0)
v̄ := σ W a (1)xi + ba (1)
j i=1
ei = Φθ (xi ) − yi for i = 1, . . . , m.
If we use the square loss, then
m
cS (Φθ ) = 1
X
R e2i . (12.3.1)
m
i=1
Proposition 12.5. Let A = (d0 , d1 , 1) and σ : R → R. Then, for every θ ∈ PN (A, ∞) where
R
b S (Φθ ) in (12.3.1) is twice continuously differentiable with respect to the weights, it holds that
176
where H(θ) is the Hessian of R b S (Φθ ) at θ, H 0 (θ) is a positive semi-definite matrix which is
independent from (yi )i=1 , and H 1 (θ) is a symmetric matrix that for fixed θ and (xi )m
m
i=1 depends
linearly on (ei )m
i=1 .
Proof. Using the identification introduced after Definition 12.2, we can consider θ a vector in RnA .
For k = 1, . . . , nA , we have that
m
∂R
b S (Φθ ) 2 X ∂Φθ (xi )
= ei .
∂θk m ∂θk
i=1
2 Pm ⊤
we have that H 0 (θ) = m i=1 Ji,θ Ji,θ and hence H 0 (θ) is a sum of positive semi-definite matrices,
which shows that H 0 (θ) is positive semi-definite.
The symmetry of H 1 (θ) follows directly from the symmetry of second derivatives which holds
since we assumed twice continuous differentiability at θ. The linearity of H 1 (θ) in (ei )m i=1 is clear
from (12.3.2).
How does Proposition 12.5 imply the claimed relationship between the size of the loss and the
prevalence of saddle points?
Let θ correspond to a critical point. If H(θ) has at least one negative eigenvalue, then θ cannot
be a minimum, but instead must be either a saddle point or a maximum. While we do not know
anything about H 1 (θ) other than that it is symmetric, it is not unreasonable to assume that it
has a negative eigenvalue especially if nA is very large. With this consideration, let us consider the
following model:
Fix a parameter θ. Let S 0 = (xi , yi0 )m 0 m
i=1 be a sample and (ei )i=1 be the associated errors.
0 0 0
Further let H (θ), H 0 (θ), H 1 (θ) be the matrices according to Proposition 12.5.
Further let for λ > 0, S λ = (xi , yiλ )m m 0 m
i=1 be such that the associated errors are (ei )i=1 = λ(ei )i=1 .
The Hessian of R b S λ (Φθ ) at θ is then H λ (θ) satisfying
177
which we can expect to be negative for large λ. Thus, H λ (θ) has a negative eigenvalue for large λ.
On the other hand, if λ is small, then H λ (θ) is merely a perturbation of H 00 (θ) and we can
expect its spectrum to resemble that of H 00 more and more.
What we see is that, the same parameter, is more likely to be a saddle point for a sample that
produces a high empirical risk than for a sample with small risk. Note that, since H 00 (θ) was only
shown to be semi -definite the argument above does not rule out saddle points even for very small
λ. But it does show that for small λ, every negative eigenvalue would be very small.
A more refined analysis where we compare different parameters but for the same sample and
quantify the likelihood of local minima versus saddle points requires the introduction of a probability
distribution on the weights. We refer to [172] for the details.
178
Exercises
Exercise 12.6. In view of Definition 12.3, show that a local minimum of a differentiable function
is contained in a spurious valley.
Exercise 12.7. Show that if there exists a continuous path α between a parameter θ1 and a global
minimum θ2 such that ΛA,σ,S,L (α) is monotonically decreasing, then θ1 cannot be an element of a
spurious valley.
179
Figure 12.2: A collection of loss landscapes. In the left column are neural networks with ReLU
activation function, the right column shows loss landscapes of neural networks with the hyperbolic
tangent activation function. All neural networks have five dimensional input, and one dimensional
output. Moreover, from top to bottom the hidden layers have sizes 1000, 20, 10, and the number
of layers are 1, 4, 7.
180
Chapter 13
As we have seen in the previous chapter, the loss landscape of neural networks can be quite intricate
and is typically not convex. In some sense, the reason for this is that we take the point of view
of a map from the parameterization of a neural network. Let us consider a convex loss function
L : R × R → R and a sample S = (xi , yi )m d
i=1 ∈ (R × R) .
m
Then, for two neural networks Φ1 , Φ2 and for α ∈ (0, 1) it holds that
m
b S (αΦ1 + (1 − α)Φ2 ) = 1
X
R L(αΦ1 (xi ) + (1 − α)Φ2 (xi ), yi )
m
i=1
m
1 X
≤ αL(Φ1 (xi ), yi ) + (1 − α)L(Φ2 (xi ), yi )
m
i=1
= αR
b S (Φ1 ) + (1 − α)R
b S (Φ2 ).
Hence, the empirical risk is convex when considered as a map depending on the neural network
functions rather then the neural network parameters. A convex function does not have spurious
minima or saddle points. As a result, the issues from the previous section are avoided if we take
the perspective of neural network sets.
So why do we not optimize over the sets of neural networks instead of the parameters? To
understand this, we will now study the set of neural networks associated with a fixed architecture
as a subset of other function spaces.
We start by investigating the realization map Rσ introduced in Definition 12.1. Concretely,
we show in Section 13.1, that if σ is Lipschitz, then the set of neural networks is the image of
PN (A, ∞) under a locally Lipschitz map. We will use this fact to show in Section 13.2 that sets of
neural networks are typically non-convex, and even have arbitrarily large holes. Finally, in Section
13.3, we study the extent to which there exist best approximations to arbitrary functions, in the set
of neural networks. We will demonstrate that the lack of best approximations causes the weights
of neural networks to grow infinitely during training.
181
Proposition 13.1. Let A = (d0 , d1 , . . . , dL+1 ) ∈ NL+2 , let σ : R → R be Cσ -Lipschitz continuous
with Cσ ≥ 1, let |σ(x)| ≤ Cσ |x| for all x ∈ R, and let B ≥ 1.
Then, for all θ, θ′ ∈ PN (A, B),
Proof. Let θ, θ′ ∈ PN (A, B) and define δ := ∥θ − θ′ ∥∞ . Repeatedly using the triangle inequality
we find a sequence (θj )nj=0
A
such that θ0 = θ, θnA = θ′ , ∥θj − θj+1 ∥∞ ≤ δ, and θj and θj+1 differ in
one entry only for all j = 0, . . . nA − 1. We conclude that for all x ∈ [−1, 1]d0
A −1
nX
′
∥Rσ (θ)(x) − Rσ (θ )(x)∥∞ ≤ ∥Rσ (θj )(x) − Rσ (θj+1 )(x)∥∞ . (13.1.1)
j=0
To upper bound (13.1.1), we now only need to understand the effect of changing one weight in a
neural network by δ.
Before we can complete the proof we need two auxiliary lemmas. The first of which holds under
slightly weaker assumptions of Proposition 13.1.
Lemma 13.2. Under the assumptions of Proposition 13.1, but with B being allowed to be arbitrary
positive, it holds for all Φ ∈ N (σ; A, B)
Proof. We start with the case, where L = 1. Then, for (d0 , d1 , d2 ) = A, we have that
for certain W(0) , W(1) , b(0) , b(1) with all entries bounded by B. As a consequence, we can estimate
∥Φ(x) − Φ(x′ )∥∞ = W(1) σ(W(0) x + b(0) ) − σ(W(0) x′ + b(0) )
∞
(0) (0) (0) ′ (0)
≤ d1 B σ(W x+b ) − σ(W x +b )
∞
(0) ′
≤ d1 BCσ W (x − x )
∞
≤ d1 d0 B 2 Cσ x − x′ ∞
≤ Cσ · (dmax B)2 x − x′ ∞
,
where we used the Lipschitz property of σ and the fact that ∥Ax∥∞ ≤ n maxi,j |Aij |∥x∥∞ for every
matrix A = (Aij )m,n
i=1,j=1 ∈ R
m×n .
The induction step from L to L+1 follows similarly. This concludes the proof of the lemma.
182
Lemma 13.3. Under the assumptions of Proposition 13.1 it holds that
Resolving the recursive estimate of ∥x(ℓ) ∥∞ by 2Cσ Bdmax (max{1, ∥x(ℓ−1) ∥∞ }), we conclude that
∥x(ℓ) ∥∞ ≤ (2Cσ Bdmax )ℓ max{1, ∥x(0) ∥∞ } = (2Cσ Bdmax )ℓ .
This concludes the proof of the lemma.
We can now proceed with the proof of Proposition 13.1. Assume that θj+1 and θj differ only in
one entry. We assume this entry to be in the ℓth layer, and we start with the case ℓ < L. It holds
|Rσ (θj )(x) − Rσ (θj+1 )(x)| = |Φℓ (σ(W(ℓ) x(ℓ) + b(ℓ) )) − Φℓ (σ(W(ℓ) x(ℓ) + b(ℓ) ))|,
(ℓ) (ℓ)
where Φℓ ∈ N (σ; Aℓ , B) for Aℓ = (dℓ+1 , . . . , dL+1 ) and (W(ℓ) , b(ℓ) ), (W , b ) differ in one entry
only.
Using the Lipschitz continuity of Φℓ of Lemma 13.2, we have
|Rσ (θj )(x) − Rσ (θj+1 )(x)|
≤ CσL−ℓ−1 (Bdmax )L−ℓ |σ(W(ℓ) x(ℓ) + b(ℓ) ) − σ(W(ℓ) x(ℓ) + b(ℓ) )|
≤ CσL−ℓ (Bdmax )L−ℓ ∥W(ℓ) x(ℓ) + b(ℓ) − W(ℓ) x(ℓ) − b(ℓ) ∥∞
≤ CσL−ℓ (Bdmax )L−ℓ δ max{1, ∥x(ℓ) ∥∞ },
where δ := ∥θ − θ′ ∥max . Invoking Lemma (13.3), we conclude that
|Rσ (θj )(x) − Rσ (θj+1 )(x)| ≤ (2Cσ Bdmax )ℓ CσL−ℓ · (Bdmax )L−ℓ δ
≤ (2Cσ Bdmax )L ∥θ − θ′ ∥max .
For the case ℓ = L, a similar estimate can be shown. Combining this with (13.1.1) yields the
result.
Using Proposition 13.1, we can now consider the set of neural networks with a fixed architec-
ture N (σ; A, ∞) as a subset of L∞ ([−1, 1]d0 ). What is more, is that N (σ; A, ∞) is the image of
PN (A, ∞) under a locally Lipschitz map.
183
13.2 Convexity of neural network spaces
As a first step towards understanding N (σ; A, ∞) as a subset of L∞ ([−1, 1]d0 ), we notice that it is
star-shaped with few centers. Let us first introduce the necessary terminology.
Definition 13.4. Let Z be a subset of a linear space. A point x ∈ Z is called a center of Z if,
for every y ∈ Z it holds that
{tx + (1 − t)y | t ∈ [0, 1]} ⊆ Z.
A set is called star-shaped if it has at least one center.
The following proposition follows directly from the definition of a neural network and is the
content of Exercise 13.15.
Knowing that N (σ; A, B) is star-shaped with center 0, we can also ask ourselves if N (σ; A, B)
has more than this one center. It is not hard to see that also every constant function is a center.
The following theorem, which corresponds to [173, Proposition C.4], yields an upper bound on the
number of linearly independent centers.
g1′ (g)
g ′ (g)
2
T : L∞ ([−1, 1]d0 ) → RnA +1 , g 7→ .
..
.
gn′ A +1 (g)
Since T is continuous and linear, we have that T ◦ Rσ is locally Lipschitz continuous by Proposition
13.1. Moreover, since the (gi )ni=1
A +1
are linearly independent, we have that T (span((gi )ni=1
A +1
)) =
n +1 nA +1
R A . We denote V := span((gi )i=1 ).
184
Next, we would like to establish that N (σ; A, ∞) ⊃ V . Let g ∈ V then
nX
A +1
g= aℓ gℓ ,
ℓ=1
By the induction assumption ge(m) ∈ N (σ; A, ∞) and hence by Proposition 13.5 ge(m) /(am+1 ) ∈
N (σ; A, ∞). Additionally, since gm+1 is a center of N (σ; A, ∞), we have that 21 gm+1 + 2am+1
1
ge(m) ∈
N (σ; A, ∞). By Proposition 13.5, we conclude that ge(m+1) ∈ N (σ; A, ∞).
The induction shows that g ∈ N (σ; A, ∞) and thus V ⊆ N (σ; A, ∞). As a consequence,
T ◦ Rσ (PN (A, ∞)) ⊇ T (V ) = RnA +1 .
It is a well known fact of basic analysis that for every n ∈ N there does not exist a surjective
and locally Lipschitz continuous map from Rn to Rn+1 . We recall that nA = dim(PN (A, ∞)).
This yields the contradiction.
For a convex set X, the line between all two points of X is a subset of X. Hence, every point
of a convex set is a center. This yields the following corollary.
Corollary 13.7 tells us that we cannot expect convex sets of neural networks, if the set of
neural networks has many linearly independent elements. Sets of neural networks contain for
each f ∈ N (σ; A, ∞) also all shifts of this function, i.e., f (· + b) for a b ∈ Rd are elements of
f ∈ N (σ; A, ∞). For a set of functions, being shift invariant and having only finitely many linearly
independent functions at the same time, is a very restrictive condition. Indeed, it was shown in
[173, Proposition C.6] that if N (σ; A, ∞) has only finitely many linearly independent functions and
σ is differentiable in at least one point and has non-zero derivative there, then σ is necessarily a
polynomial.
We conclude that the set of neural networks is in general non-convex and star-shaped with 0
and constant functions being centers. One could visualize this set in 3D as in Figure 13.1.
The fact, that the neural network space is not convex, could also mean that it merely fails to
be convex at one point. For example R2 \ {0} is not convex, but for an optimization algorithm this
would likely not pose a problem.
We will next observe that N (σ; A, ∞) does not have such a benign non-convexity and in fact,
has arbitrarily large holes.
To make this claim mathematically precise, we first introduce the notion of ε-convexity.
185
Figure 13.1: Sketch of the space of neural networks in 3D. The vertical axis corresponds to the
constant neural network functions, each of which is a center. The set of neural networks consists
of many low-dimensional linear subspaces spanned by certain neural networks (Φ1 , . . . , Φ6 in this
sketch) and linear functions. Between these low-dimensional subspaces, there is not always a
straight-line connection by Corollary 13.7 and Theorem 13.9.
Definition 13.8. For ε > 0, we say that a subset A of a normed vector space X is ε-convex if
co(A) ⊆ A + Bε (0),
where co(A) denotes the convex hull of A and Bε (0) is an ε ball around 0 with respect to the norm
of X.
Intuitively speaking, a set that is convex when one fills up all holes smaller than ε is ε-convex.
Now we show that there is no ε > 0 such that N (σ; A, ∞) is ε-convex.
Theorem 13.9. Let L ∈ N and A = (d0 , d1 , . . . , dL , 1) ∈ NL+2 . Let K ⊆ Rd0 be compact and let
σ ∈ M, with M as in (3.1.1) and assume that σ is not a polynomial. Moreover, assume that there
exists an open set, where σ is differentiable and not constant.
If there exists an ε > 0 such that N (σ; A, ∞) is ε-convex, then N (σ; A, ∞) is dense in C(K).
Proof. Step 1. We show that ε-convexity implies N (σ; A, ∞) to be convex. By Proposition 13.5,
we have that N (σ; A, ∞) is scaling invariant. This implies that co(N (σ; A, ∞)) is scaling invariant
186
as well. Hence, if there exists ε > 0 such that N (σ; A, ∞) is ε-convex, then for every ε′ > 0
ε′ ε′
co(N (σ; A, ∞)) = co(N (σ; A, ∞)) ⊆ (N (σ; A, ∞) + Bε (0))
ε ε
= N (σ; A, ∞) + Bε′ (0).
This yields that N (σ; A, ∞) is ε′ -convex. Since ε′ was arbitrary, we have that N (σ; A, ∞) is
ε-convex for all ε > 0.
As a consequence, we have that
\
co(N (σ; A, ∞)) ⊆ (N (σ; A, ∞) + Bε (0))
ε>0
\
⊆ (N (σ; A, ∞) + Bε (0)) = N (σ; A, ∞).
ε>0
Hence, co(N (σ; A, ∞)) ⊆ N (σ; A, ∞) and, by the well-known fact that in every metric vector space
co(A) ⊆ co(A), we conclude that N (σ; A, ∞) is convex.
Step 2. We show that Nd1 (σ; 1) ⊆ N (σ; A, ∞). If N (σ; A, ∞) is ε-convex, then by Step 1
N (σ; A, ∞) is convex. The scaling invariance of N (σ; A, ∞) then shows that N (σ; A, ∞) is a
closed linear subspace of C(K).
Note that, by Proposition 3.16 for every w ∈ Rd0 and b ∈ R there exists a function f ∈
N (σ; A, ∞) such that
By definition, every constant function is an element of N (σ; A, ∞). Since N (σ; A, ∞) is a subspace,
this implies that all constant functions are in N (σ; A, ∞).
(1) (1)
Since N (σ; A, ∞) is a closed vector space, this implies that for all n ∈ N and all w1 , . . . , wn ∈
(2) (2) (1) (1)
Rd0 , w1 , . . . , wn ∈ R, b1 , . . . , bn ∈ R, b(2) ∈ R
n
(2) (1) (1)
X
x 7→ wi σ((wi )⊤ x + bi ) + b(2) ∈ N (σ; A, ∞). (13.2.3)
i=1
Step 3. From (13.2.3), we conclude that Nd1 (σ; 1) ⊆ N (σ; A, ∞). In words, the whole set of
shallow neural networks of arbitrary width is contained in the closure of the set of neural networks
with a fixed architecture. By Theorem 3.8, we have that Nd1 (σ; 1) is dense in C(K), which yields
the result.
For any activation function of practical relevance, a set of neural networks with fixed architecture
is not dense in C(K). This is only the case for very strange activation functions such as the one
discussed in Subsection 3.2. Hence, Theorem 13.9 shows that in general, sets of neural networks of
fixed architectures have arbitrarily large holes.
187
Consider A = (d0 , . . . , dL+1 ) ∈ NL+2 and an activation function σ. Let H be a normed function
space on [−1, 1]d0 such that N (σ; A, ∞) ⊆ H. For h ∈ H we would like to find a neural network
that best approximates h, i.e. to find Φ ∈ N (σ; A, ∞) such that
• the best approximation property, if for all h ∈ H there exists at least one Φ ∈ N (σ; A, ∞)
such that (13.3.1) holds,
• the unique best approximation property, if for all h ∈ H there exists exactly one
Φ ∈ N (σ; A, ∞) such that (13.3.1) holds,
We will see in the sequel, that, in the absence of the best approximation property, we will be able
to prove that the learning problem necessarily requires the weights of the neural networks to tend
to infinity, which may or may not be desirable in applications.
Moreover, having a continuous selection procedure is desirable as it implies the existence of a
stable selection algorithm; that is, an algorithm which, for similar problems yields similar neural
networks satisfying (13.3.1).
Below, we will study the properties above for Lp spaces, p ∈ [1, ∞). As we will see, neu-
ral network classes typically neither satisfy the continuous selection nor the best approximation
property.
Theorem 13.10. Let p ∈ (1, ∞). Every subset of Lp ([−1, 1]d0 ) with the unique best approximation
property is convex.
188
Proof. We observe from Theorem 13.6 and the discussion below, that under the assumptions of
this result, N (σ; A, ∞) is not convex.
We conclude that N (σ; A, ∞) does not have the unique best approximation property. Moreover,
if the set N (σ; A, ∞) does not have the best approximation property, then it is obvious that it
cannot have continuous selection. Thus, we can assume without loss of generality, that N (σ; A, ∞)
has the best approximation property and there exists a point h ∈ Lp ([−1, 1]d0 ) and two different
Φ1 ,Φ2 such that
∥Φ1 − h∥Lp = ∥Φ2 − h∥Lp = inf ∥Φ∗ − h∥Lp . (13.3.2)
Φ∗ ∈N (σ;A,∞)
189
13.3.2 Existence of best approximations
We have seen in Proposition 13.11 that under very mild assumptions, the continuous selection prop-
erty cannot hold. Moreover, the next result shows that in many cases, also the best approximation
property fails to be satisfied. We provide below a simplified version of [173, Theorem 3.1]. We also
refer to [69] for earlier work on this problem.
Then fn can be written as a neural network with architecture (σ; 1, 2, 1), i.e., A = (1, 2, 1). More-
over, for x > 0 we observe with the fundamental theorem of calculus and using integration by
substitution that
Z x+1/n Z nx+1
fn (x) = nσ ′ (nz)dz = σ ′ (z)dz. (13.3.4)
x nx
It is not hard to see that the right hand side of (13.3.4) converges to α for n → ∞.
Similarly, for x < 0, we observe that fn (x) converges to α′ for n → ∞. We conclude that
fn → α1R+ + α′ 1R−
190
13.3.3 Exploding weights phenomenon
Finally, we discuss one of the consequences of the non-existence of best approximations of Propo-
sition 13.12.
Consider a regression problem, where we aim to learn a function f using neural networks with
a fixed architecture N (A; σ, ∞). As discussed in the Chapters 10 and 11, we wish to produce a
sequence of neural networks (Φn )∞n=1 such that the risk defined in (1.2.4) converges to 0. If the loss
L is the squared loss, µ is a probability measure on [−1, 1]d0 , and the data is given by (x, f (x)) for
x ∼ µ, then
R(Φn ) = ∥Φn − f ∥2L2 ([−1,1]d0 ,µ)
(13.3.5)
Z
= |Φn (x) − f (x)|2 dµ(x) → 0 for n → ∞.
[−1,1]d0
According to Proposition 13.12, for a given A, and an activation function σ, it is possible that
(13.3.5) holds, but f ̸∈ N (σ; A, ∞). The following result shows that in this situation, the weights
of Φn diverge.
Then
n o
lim sup max ∥W (ℓ) (ℓ)
n ∥∞ , ∥bn ∥∞ ℓ = 0, . . . L = ∞. (13.3.7)
n→∞
Proof. We assume towards a contradiction that the left-hand side of (13.3.7) is finite. As a result,
there exists C > 0 such that Φn ∈ N (σ; A, C) for all n ∈ N.
By Proposition 13.1, we conclude that N (σ; A, C) is the image of a compact set under a continu-
ous map and hence is itself a compact set in L2 ([−1, 1]d0 , µ). In particular, we have that N (σ; A, C)
is closed. Hence, (13.3.6) implies f ∈ N (σ; A, C). This gives a contradiction.
Proposition 13.14 can be extended to all f for which there is no best approximation in N (σ; A, ∞),
see Exercise 13.18. The results imply that for functions we wish to learn that lack a best approxima-
tion within a neural network set, we must expect the weights of the approximating neural networks
to grow to infinity. This can be undesirable because, as we will see in the following sections on
generalization, a bounded parameter space facilitates many generalization bounds.
191
of sets of shallow neural networks. The results on convexity and closedness presented in this chapter
follow mostly the arguments of [173]. Similar results were also derived for other norms in [139].
192
Exercises
Exercise 13.15. Prove Proposition 13.5.
Exercise 13.17. Use Proposition 3.16, to extend Proposition 13.12 to arbitrary depth.
Exercise 13.18. Extend Proposition 13.14 to functions f for which there is no best-approximation
in N (σ; A, ∞). To do this, replace (13.3.6) by
193
Chapter 14
As discussed in the introduction in Section 1.2, we generally learn based on a finite data set. For
example, given data (xi , yi )m
i=1 , we try to find a network Φ that satisfies Φ(xi ) = yi for i = 1, . . . , m.
The field of generalization is concerned with how well such Φ performs on unseen data, which refers
to any x outside of training data {x1 , . . . , xm }. In this chapter we discuss generalization through
the use of covering numbers.
In Sections 14.1 and 14.2 we revisit and formalize the general setup of learning and empirical risk
minimization in a general context. Although some notions introduced in these sections have already
appeared in the previous chapters, we reintroduce them here for a more coherent presentation. In
Sections 14.3-14.5, we first discuss the concepts of generalization bounds and covering numbers,
and then apply these arguments specifically to neural networks. In Section 14.6 we explore the
so-called “approximation-complexity trade-off”, and finally in Sections 14.7-14.8 we introduce the
“VC dimension” and give some implications for classes of neural networks.
194
Figure 14.1: Collection of coffee data. The last row lacks a “Quality” label. Our aim is to predict
the label without the need for an (expensive) taste test.
with higher numbers indicating better quality. Let us assume that our subjective assessment of
quality of coffee is related to six features: “Acidity”, “Caffeine content”, “Price”, “Aftertaste”,
“Roast level”, and “Origin”. The feature space X thus corresponds to the set of six-tuples describing
these attributes, which can be either numeric or categorical (see Figure 14.1).
We aim to understand the relationship between elements of X and elements of Y , but we can
neither afford, nor do we have the time to taste all the coffees in the world. Instead, we can sample
some coffees, taste them, and grow our database accordingly as depicted in Figure 14.1. This way
we obtain samples of pairs in X × Y . The distribution D from which they are drawn depends on
various external factors. For instance, we might have avoided particularly cheap coffees, believing
them to be inferior. As a result they do not occur in our database. Moreover, if a colleague
contributes to our database, he might have tried the same brand and arrived at a different rating.
In this case, the quality label is not deterministic anymore.
Based on our database, we wish to predict the quality of an untasted coffee. Before proceeding,
we first formalize what it means to be a “good” prediction.
Characterizing how good a predictor is requires a notion of discrepancy in the label space. This
is the purpose of the so-called loss function, which is a measurable mapping L : Y × Y → R+ .
Based on the risk, we can now formalize what we consider a good predictor. The best predictor
is one such that its risk is as close as possible to the smallest that any function can achieve. More
precisely, we would like a risk that is close to the so-called Bayes risk
195
Example 14.3 (Loss functions). The choice of a loss function L usually depends on the application.
For a regression problem, i.e., a learning problem where Y is a non-discrete subset of a Euclidean
space, a common choice is the square loss L2 (y, y ′ ) = ∥y − y ′ ∥2 .
For binary classification problems, i.e. when Y is a discrete set of cardinality two, the “0 − 1
loss” (
1 y ̸= y ′
L0−1 (y, y ′ ) =
0 y = y′
is a common choice.
Another frequently used loss for binary classification, especially when we want to predict prob-
abilities (i.e., if Y = [0, 1] but all labels are binary), is the binary cross-entropy loss
Lce (y, y ′ ) = −(y log(y ′ ) + (1 − y) log(1 − y ′ )).
In contrast to the 0 − 1 loss, the cross-entropy loss is differentiable, which is desirable in deep
learning as we saw in Chapter 10.
In the coffee quality prediction problem, the quality is given as a fraction of the form k/10
for k = 0, . . . , 10. While this is a discrete set, it makes sense to more heavily penalize predictions
that are wrong by a larger amount. For example, predicting 4/10 instead of 8/10 should produce
a higher loss than predicting 7/10. Hence, we would not use the 0 − 1 loss but, for example, the
square loss.
How do we find a function h : X → Y with a risk that is as close as possible to the Bayes risk?
We will introduce a procedure to tackle this task in the next section.
1
In practice, the assumption of independence of the samples is often unclear and typically not satisfied. For
instance, the selection of the six previously tested coffees might be influenced by external factors such as personal
preferences or availability at the local store, which introduce bias into the dataset.
196
If the sample S is drawn i.i.d. according to D, then we immediately see from the linearity
of the expected value that R b S (h) is an unbiased estimator of R(h), i.e., ES∼Dm [R
b S (h)] = R(h).
Moreover, the weak law of large numbers states that the sample mean of an i.i.d. sequence of
integrable random variables converges to the expected value in probability. Hence, there is some
hope that, at least for large m ∈ N, minimizing the empirical risk instead of the population risk
might lead to a good hypothesis. We formalize this approach in the next definition.
R
b S (hS ) = inf R
b S (h) (14.2.1)
h∈H
From a generalization perspective, deep learning is empirical risk minimization over sets of
neural networks. The question we want to address next is how effective this approach is at producing
hypotheses that achieve a risk close to the Bayes risk.
Let H be some hypothesis set, such that an empirical risk minimizer hS exists for all S ∈
(X × Y )m ; see Exercise 14.25 for an explanation of why this is a reasonable assumption. Moreover,
let h∗ ∈ H be arbitrary. Then
R(hS ) − R∗ = R(hS ) − R b S (hS ) − R∗
b S (hS ) + R (14.2.2)
≤ |R(hS ) − R b S (h∗ ) − R∗
b S (hS )| + R
b S (h)| + R(h∗ ) − R∗ ,
≤ 2 sup |R(h) − R
h∈H
where in the first inequality we used that hS is the empirical risk minimizer. By taking the infimum
over all h∗ , we conclude that
R(hS ) − R∗ ≤ 2 sup |R(h) − R
b S (h)| + inf R(h) − R∗
h∈H h∈H
197
Definition 14.6 (Generalization bound). Let H ⊆ {h : X → Y } be a hypothesis set, and let
L : Y × Y → R be a loss function. Let κ : (0, 1) × N → R+ be such that for every δ ∈ (0, 1) holds
κ(δ, m) → 0 for m → ∞. We call κ a generalization bound for H if for every distribution D on
X × Y , every m ∈ N and every δ ∈ (0, 1), it holds with probability at least 1 − δ over the random
sample S ∼ Dm that
sup |R(h) − R
b S (h)| ≤ κ(δ, m).
h∈H
as soon as m is so large that κ(δ, m) ≤ ε. If there exists an empirical risk minimizer hS such that
R
b S (hS ) = 0, then with high probability the empirical risk minimizer will also have a small risk
R(hS ). Empirical risk minimization is often referred to as a “PAC” algorithm, which stands for
probably (δ) approximately correct (ε).
Definition 14.6 requires the upper bound κ on the discrepancy between the empirical risk and
the risk to be independent from the distribution D. Why should this be possible? After all, we could
have an underlying distribution that is not uniform and hence, certain data points could appear
very rarely in the sample. As a result, it should be very hard to produce a correct prediction
for such points. At first sight, this suggests that non-uniform distributions should be much more
challenging than uniform distributions. This intuition is incorrect, as the following argument based
on Example 14.1 demonstrates.
Example 14.8 (Generalization in the coffee quality problem). In Example 14.1, the underlying
distribution describes both our process of choosing coffees and the relation between the attributes
and the quality. Suppose we do not enjoy drinking coffee that costs less than 1€. Consequently, we
do not have a single sample of such coffee in the dataset, and therefore we have no chance about
learning the quality of cheap coffees.
However, the absence of coffee samples costing less than 1€ in our dataset is due to our general
avoidance of such coffee. As a result, we run a low risk of incorrectly classifying the quality of a
coffee that is cheaper than 1€, since it is unlikely that we will choose such a coffee in the future.
To establish generalization bounds, we use stochastic tools that guarantee that the empirical
risk converges to the true risk as the sample size increases. This is typically achieved through
concentration inequalities. One of the simplest and most well-known is Hoeffding’s inequality, see
Theorem A.24. We will now apply Hoeffding’s inequality to obtain a first generalization bound.
This generalization bound is well-known and can be found in many textbooks on machine learning,
e.g., [148, 212]. Although the result does not yet encompass neural networks, it forms the basis for
a similar result applicable to neural networks, as we discuss subsequently.
198
Proposition 14.9 (Finite hypothesis set). Let H ⊆ {h : X 7→ Y } be a finite hypothesis set. Let
L : Y × Y → R be such that L(Y × Y ) ⊆ [c1 , c2 ] with c2 − c1 = C > 0.
Then, for every m ∈ N and every distribution D on X × Y it holds with probability at least 1 − δ
over the sample S ∼ Dm that
r
sup |R(h) − Rb S (h)| ≤ C log(|H|) + log(2/δ) .
h∈H 2m
Note that R b S (hi ) is the mean of independent random variables which take their values almost surely
in [0, C]. Additionally, R(hi ) is the expectation of Rb S (hi ). The proof can therefore be finished by
applying Theorem A.24. This will be addressed in Exercise 14.26.
Consider now a non-finite set of neural networks H, and assume that it can be covered by a
finite set of (small) balls. Applying Proposition 14.9 to the centers of these balls, then allows to
derive a similar bound as in the proposition for H. This intuitive argument will be made rigorous
in the following section.
Definition 14.10. Let A be a relatively compact subset of a metric space (X, d). For ε > 0, we
call
m
( )
[
m
G(A, ε, (X, d)) := min m ∈ N ∃ (xi )i=1 ⊆ X s.t. Bε (xi ) ⊃ A ,
i=1
where Bε (x) = {z ∈ X | d(z, x) ≤ ε}, the ε-covering number of A in X. In case X or d are clear
from context, we also write G(A, ε, d) or G(A, ε, X) instead of G(A, ε, (X, d)).
A visualization of Definition 14.10 is given in Figure 14.2. As we will see, it is possible to upper
bound the ε-covering numbers of neural networks as a subset of L∞ ([0, 1]d ), assuming the weights
are confined to a fixed bounded set. The precise estimates are postponed to Section 14.5. Before
that, let us show how a finite covering number facilitates a generalization bound. We only consider
Euclidean feature spaces X in the following result. A more general version could be easily derived.
199
ε
Figure 14.2: Illustration of the concept of covering numbers of Definition 14.10. The shaded set
A ⊆ R2 is covered by sixteen Euclidean balls of radius ε. Therefore, G(A, ε, R2 ) ≤ 16.
Theorem 14.11. Let CY , CL > 0 and α > 0. Let Y ⊆ [−CY , CY ], X ⊆ Rd for some d ∈ N, and
H ⊆ {h : X → Y }. Further, let L : Y × Y → R be CL -Lipschitz.
Then, for every distribution D on X × Y and every m ∈ N it holds with probability at least 1 − δ
over the sample S ∼ Dm that for all h ∈ H
r
−α ∞
|R(h) − Rb S (h)| ≤ 4CY CL log(G(H, m , L (X))) + log(2/δ) + 2CL .
m mα
Proof. Let
M = G(H, m−α , L∞ (X)) (14.4.1)
and let HM = (hi )M
i=1 ⊆ H be such that for every h ∈ H there exists hi ∈ HM with ∥h−hi ∥L∞ (X) ≤
1/mα . The existence of HM follows by Definition 14.10.
Fix for the moment such h ∈ H and hi ∈ HM . By the reverse and normal triangle inequalities,
we have
|R(h) − R
b S (h)| − |R(hi ) − R
b S (hi )| ≤ |R(h) − R(hi )| + |R
b S (h) − R
b S (hi )|.
Moreover, from the monotonicity of the expected value and the Lipschitz property of L it follows
that
200
We thus conclude that for every ε > 0
h i
PS∼Dm ∃h ∈ H : |R(h) − R
b S (h)| ≥ ε
2CL
≤ PS∼Dm ∃hi ∈ HM : |R(hi ) − RS (hi )| ≥ ε − α .
b (14.4.2)
m
From Proposition 14.9, we know that for ε > 0 and δ ∈ (0, 1)
PS∼Dm ∃hi ∈ HM : |R(hi ) − R b S (hi )| ≥ ε − 2CL ≤ δ (14.4.3)
mα
as long as r
2CL log(M ) + log(2/δ)
ε− α >C ,
m 2m
√ that L(Y × Y ) ⊆ [c1 , c2 ] with c2 − c1 ≤ C. By the Lipschitz property of L we can
where C is such
choose C = 2 2CL CY .
Therefore, the definition of M in (14.4.1) together with (14.4.2) and (14.4.3) give that with
probability at least 1 − δ it holds for all h ∈ H
r
√ −α ∞
|R(h) − R b S (h)| ≤ 2 2CL CY log(G(H, m , L )) + log(2/δ) + 2CL .
2m mα
This concludes the proof.
Lemma 14.12. Let X1 , X2 be two metric spaces and let f : X1 → X2 be Lipschitz continuous with
Lipschitz constant CLip . For every relatively compact A ⊆ X1 it holds that for all ε > 0
The proof of Lemma 14.12 is left as an exercise. If we can represent the set of neural networks
as the image under the Lipschitz map of another set with known covering numbers, then Lemma
14.12 gives a direct way to bound the covering number of the neural network class.
Conveniently, we have already observed in Proposition 13.1, that the set of neural networks is
the image of PN (A, B) as in Definition 12.1 under the Lipschitz continuous realization map Rσ . It
thus suffices to establish the ε-covering number of PN (A, B) or equivalently of [−B, B]nA . Then,
using the Lipschitz property of Rσ that holds by Proposition 13.1, we can apply Lemma 14.12 to
find the covering numbers of N (σ; A, B). This idea is depicted in Figure 14.3.
201
Rσ
Figure 14.3: Illustration of the main idea to deduce covering numbers of neural network spaces.
Points θ ∈ R2 in parameter space in the left figure correspond to functions Rσ (θ) in the right figure
(with matching colors). By Lemma 14.12, a covering of the parameter space on the left translates
to a covering of the function space on the right.
It is clear that all points between −B and xk−1 have distance at most ε to one of the xj . Also,
xk−1 = −B + ε + 2(k − 1)ε ≥ B − ε. We conclude that G([−B, B], ε, R) ≤ ⌈B/ε⌉. Set Xk :=
{x0 , . . . , xk−1 }.
For arbitrary q, we observe that for every x ∈ [−B, B]q there is an element in Xkq = qj=1 Xk
N
with ∥ · ∥∞ distance less than ε. Clearly, |Xkq | = ⌈B/ε⌉q , which completes the proof.
Having established a covering number for [−B, B]nA and hence PN (A, B), we can now estimate
the covering numbers of deep neural networks by combining Lemma 14.12 and Propositions 13.1
and 14.13 .
G(N (σ; A, B), ε, L∞ ([0, 1]d0 )) ≤ G([−B, B]nA , ε/(2Cσ Bdmax )L , (RnA , ∥ · ∥∞ ))
≤ ⌈nA /ε⌉nA ⌈2Cσ Bdmax ⌉nA L .
202
We end this section, by applying the previous theorem to the generalization bound of Theorem
14.11 with α = 1/2. To simplify the analysis, we restrict the discussion to neural networks with
range [−1, 1]. To this end, denote
N ∗ (σ; A, B) := Φ ∈ N (σ; A, B)
Since N ∗ (σ; A, B) ⊆ N (σ; A, B) we can bound the covering numbers of N ∗ (σ; A, B) by those of
N (σ; A, B). This yields the following result.
Theorem 14.15. Let CL > 0 and let L : [−1, 1]×[−1, 1] → R be CL -Lipschitz continuous. Further,
let A = (d0 , d1 , . . . , dL+1 ) ∈ NL+2 , let σ : R → R be Cσ -Lipschitz continuous with Cσ ≥ 1, and
|σ(x)| ≤ Cσ |x| for all x ∈ R, and let B ≥ 1.
Then, for every m ∈ N, and every distribution D on X × [−1, 1] it holds with probability at least
1 − δ over S ∼ Dm that for all Φ ∈ N ∗ (σ; A, B)
r √
nA log(⌈nA m⌉) + LnA log(⌈2Cσ Bdmax ⌉) + log(2/δ)
|R(Φ) − RS (Φ)| ≤4CL
b
m
2CL
+√ .
m
where R∗ is the Bayes risk defined in (14.1.1). We make the following observations about the
approximation error εapprox and generalization error εgen in the context of neural network based
learning:
• Scaling of generalization error: By Theorem 14.15, for a hypothesis class H of neural networks
with nA weights and L layers, and for sample of size m ∈ N, the generalization error εgen
essentially scales like
p
εgen = O( (nA log(nA m) + LnA log(nA ))/m) as m → ∞.
• Scaling of approximation error: Assume there exists h∗ such that R(h∗ ) = R∗ , and let the
loss function L be Lipschitz continuous in the first coordinate. Then
εapprox = inf R(h) − R(h∗ ) = inf E(x,y)∼D [L(h(x), y) − L(h∗ (x), y)]
h∈H h∈H
≤ C inf ∥h − h∗ ∥L∞ ,
h∈H
203
for some constant C > 0. We have seen in Chapters 5 and 7 that if we choose H as a
set of neural networks with size nA and L layers, then, for appropriate activation functions,
inf h∈H ∥h − h∗ ∥L∞ behaves like nA −r if, e.g., h∗ is a d-dimensional s-Hölder regular function
and r = s/d (Theorem 5.22), or h∗ ∈ C k,s ([0, 1]d ) and r < (k + s)/d (Theorem 7.7).
By these considerations, we conclude that for an empirical risk minimizer ΦS from a set of neural
networks with nA weights and L layers, it holds that
R(ΦS ) − R∗ ≤ O( (nA log(m) + LnA log(nA ))/m) + O(nA −r ),
p
(14.6.1)
for m → ∞ and for some r depending on the regularity of h∗ . Note that, enlarging the neural
network set, i.e., increasing nA has two effects: The term associated to approximation decreases,
and the term associated to generalization increases. This trade-off is known as approximation-
complexity trade-off. The situation is depicted in Figure 14.4. The figure and (14.6.1) suggest
that, the perfect model, achieves the optimal trade-off between approximation and generalization
error. Using this notion, we can also separate all models into three classes:
• Underfitting: If the approximation error decays faster than the estimation error increases.
• Optimal : If the sum of approximation error and generalization error is at a minimum.
• Overfitting: If the approximation error decays slower than the estimation error increases.
In Chapter 15, we will see that deep learning often operates in the regime where the number of
parameters nA exceeds the optimal trade-off point. For certain architectures used in practice, nA
can be so large that the theory of the approximation-complexity trade-off suggests that learning
should be impossible. However, we emphasize, that the present analysis only provides upper bounds.
It does not prove that learning is impossible or even impractical in the overparameterized regime.
Moreover, in Chapter 11 we have already seen indications that learning in the overparametrized
regime need not necessarily lead to large generalization errors.
Definition 14.16. The VC dimension of H is the cardinality of the largest set S ⊆ Rd that is
shattered by H. We denote the VC dimension by VCdim(H).
204
underfitting overfitting
optimal trade-off
Example 14.17 (Intervals). Let H = {1[a,b] | a, b ∈ R}. It is clear that VCdim(H) ≥ 2 since for
x1 < x2 the functions
1[x1 −2,x1 −1] , 1[x1 −1,x1 ] , 1[x1 ,x2 ] , 1[x2 ,x2 +1] ,
Figure 14.5: Different ways to classify two or three points. The colored-blocks correspond to
intervals that produce different classifications of the points.
205
three points. More general, for d ≥ 2 with
Hd := {x 7→ 1[0,∞) (w⊤ x + b) | w ∈ Rd , b ∈ R}
In the example above, the VC dimension coincides with the number of parameters. However,
this is not true in general as the following example shows.
Example 14.19 (Infinite VC dimension). Let for x ∈ R
206
Theorem 14.20. Let d, k ∈ N and H ⊆ {h : Rd → {0, 1}} have VC dimension k. Let D be a
distribution on Rd × {0, 1}. Then, for every δ > 0 and m ∈ N, it holds with probability at least 1 − δ
over a sample S ∼ Dm that for every h ∈ H
r r
2k log(em/k) log(1/δ)
|R(h) − R
b S (h)| ≤ + . (14.7.1)
m 2m
In words, Theorem 14.20 tells us that if a hypothesis class has finite VC dimension, then a
hypothesis with a small empirical risk will have a small risk if the number of samples is large. This
shows that empirical risk minimization is a viable strategy in this scenario. Will this approach also
work if the VC dimension is not bounded? No, in fact, in that case, no learning algorithm will
succeed in reliably producing a hypothesis for which the risk is close to the best possible. We omit
the technical proof of the following theorem from [148, Theorem 3.23].
Theorem 14.21. Let k ∈ N and let H ⊆ {h : X → {0, 1}} be a hypothesis set with VC dimension
k. Then, for every m ∈ N and every learning algorithm A : (X × {0, 1})m → H there exists a
distribution D on X × {0, 1} such that
" r #
k 1
PS∼Dm R(A(S)) − inf R(h) > ≥ .
h∈H 320m 64
Theorem 14.21 immediately implies the following statement for the generalization bound.
Corollary 14.22. Let k ∈ N and let H ⊆ {h : X → {0, 1}} be a hypothesis set with VC dimension
k. Then, for every m ∈ N there exists a distribution D on X × {0, 1} such that
" r #
k 1
PS∼Dm sup |R(h) − R b S (h)| > ≥ .
h∈H 1280m 64
207
Then, applying Theorem 14.21 with A(S) = hS it holds that
2 sup |R(h) − R
b S (h)| ≥ |R(hS ) − R
b S (hS )| + |R(hδ ) − R
b S (hδ )|
h∈H
≥ R(hS ) − R
b S (hS ) + R
b S (hδ ) − R(hδ )
≥ R(hS ) − R(hδ )
> R(hS ) − inf R(h) − δ,
h∈H
where we used the definition of hS in the third inequality. The proof is completed by applying
Theorem 14.21 and using that δ was arbitrary.
√
We have seen now, that we have a generalization bound scaling like O(1/ m) for m → ∞ if
and only if the VC dimension of a hypothesis class is finite. In more quantitative terms, we require
the VC dimension of a neural network to be smaller than m.
What does this imply for neural network functions? For ReLU neural networks there holds the
following [3, Theorem 8.8].
The bound (14.7.1) is meaningful if m ≫ k. For ReLU neural networks as in Theorem 14.23,
this means m ≫ nA L log(nA ) + nA L2 . Fixing L = 1 this amounts to m ≫ nA log(nA ) for a
shallow neural network with nA parameters. This condition is contrary to what we assumed in
Chapter 11, where it was crucial that nA ≫ m. If the VC dimension of the neural network sets
scale like O(nA log(nA )), then Theorem 14.21 and Corollary 14.22 indicate that, at least for certain
distributions, generalization should not be possible in this regime. We will discuss the resolution
of this potential paradox in Chapter 15.
Theorem 14.24. Let k, d ∈ N. Assume that for every ε > 0 there exists Lε ∈ N and Aε with Lε
layers and input dimension d such that
ε
sup inf ∥f − Φ∥C 0 ([0,1]d ) < .
∥f ∥ k d ≤1
Φ∈N (σ ReLU ;A,∞) 2
C ([0,1] )
208
Then there exists C > 0 solely depending on k and d, such that for all ε ∈ (0, 1)
d
nAε Lε log(nAε ) + nAε L2ε ≥ Cε− k .
ε
sup |fy (x) − Φy (x)| < .
x∈[0,1]d 2τk
Then ε
1[0,∞) Φy (xj ) − = yj for all j = 1, . . . , N (ε).
2τk
Hence, the VC dimension of N (σReLU ; Aτ −1 ε , ∞) is larger or equal to N (ε). Theorem 14.23 thus
k
implies
d
N (ε) ≃ ε− k ≤ C · nA −1 Lτ −1 ε log(nA −1 ) + nA −1 L2τ −1 ε
τ ε k τ ε τ ε k
k k k
or equivalently
d d
τkk ε− k ≤ C · nA Lε log(nA ) + nA L2ε .
τ −1 ε τ −1 ε τ −1 ε
k k k
209
Figure 14.7: Illustration of fy from Equation (14.8.1) on [0, 1]2 .
In terms of the neural network size, this (necessary) condition becomes nAε ≥ Cε−d/k / log(ε)2 .
As we have shown in Chapter 7, in particular Theorem 7.7, up to log terms this condition is
also sufficient. Hence, while the constructive proof of Theorem 7.7 might have seemed rather
specific, under the assumption of the depth increasing at most logarithmically (which the
construction in Chapter 7 satisfies), it was essentially optimal! The neural networks in this
proof are shown to have size O(ε−d/k ) up to log terms.
• If we allow the depth Lε to increase faster than logarithmically in ε, then the lower bound on
the required neural network size improves. Fixing for example Aε with Lε layers such that
nAε ≤ W Lε for some fixed ε independent W ∈ N, the (necessary) condition on the depth
becomes
d
W log(W Lε )L2ε + W L3ε ≥ Cε− k
210
Bibliography and further reading
Classical statistical learning theory is based on the foundational work of Vapnik and Chervonenkis
[233]. This led to the formulation of the probably approximately correct (PAC) learning model
in [232], which is primarily utilized in this chapter. A streamlined mathematical introduction to
statistical learning theory can be found in [43].
Since statistical learning theory is well-established, there exists a substantial amount of excellent
expository work describing this theory. Some highly recommended books on the topic are [148,
212, 3]. The specific approach of characterizing learning via covering numbers has been discussed
extensively in [3, Chapter 14]. Specific results for ReLU activation used in this chapter were derived
in [204, 18]. The results of Section 14.8 describe some of the findings in [245, 246], and we also refer
to [51] for general lower bounds (also applicable to neural networks) when approximating classes
of Sobolev functions.
211
Exercises
Exercise 14.25. Let H be a set of neural networks with fixed architecture, where the weights are
taken from a compact set. Moreover, assume that the activation function is continuous. Show that
for every sample S there always exists an empirical risk minimizer hS .
Exercise 14.28. Show that, the VC dimension of H of Example 14.18 is indeed 3, by demonstrating
that no set of four points can be shattered by H.
H := {x 7→ 1[0,∞) (sin(wx)) | w ∈ R}
is infinite.
212
Chapter 15
Generalization in the
overparameterized regime
In the previous chapter, we discussed the theory of generalization for deep neural networks trained
by minimizing the empirical risk. A key conclusion was that good generalization is possible as long
as we choose an architecture that has a moderate number of neural network parameters relative to
the number of training samples. Moreover, we saw in Section 14.6 that the best performance can be
expected when the neural network size is chosen to balance the generalization and approximation
errors, by minimizing their sum.
Architectures On ImageNet
Figure 15.1: ImageNet Classification Competition: Final score on the test set in the Top 1 cat-
egory vs. Parameters-to-Training-Samples Ratio. Note that all architectures have more parame-
ters than training samples. Architectures include AlexNet [121], VGG16 [215], GoogLeNet [222],
ResNet50/ResNet152 [87], DenseNet121 [96], ViT-G/14 [248], EfficientNetB0 [224], and Amoe-
baNet [189].
Surprisingly, successful neural network architectures do not necessarily follow these theoretical
observations. Consider the neural network architectures in Figure 15.1. They represent some
213
of the most renowned image classification models, and all of them participated in the ImageNet
Classification Competition [50]. The training set consisted of 1.2 million images. The x-axis shows
the model performance, and the y-axis displays the ratio of the number of parameters to the size of
the training set; notably, all architectures have a ratio larger than one, i.e. have more parameters
than training samples. For the largest model, there are by a factor 1000 more neural network
parameters than training samples.
Given that the practical application of deep learning appears to operate in a regime significantly
different from the one analyzed in Chapter 14, we must ask: Why do these methods still work
effectively?
R(h)
R
b S (h)
underfitting overfitting
Interpolation threshold
Expressivity of H
Figure 15.2: Illustration of the double descent phenomenon.
214
The goal is to determine coefficients w ∈ Rn minimizing the empirical risk
m n m
b S (w) = 1 1 X
XX 2
R wi ϕi (xj ) − yj = (⟨ϕ(xj ), w⟩ − yj )2 .
m m
j=1 i=1 j=1
With
ϕ(x1 )⊤
ϕ1 (x1 ) . . . ϕn (x1 )
.. .. .. .. m×n
An := . = ∈R (15.1.2)
. . .
ϕ1 (xm ) . . . ϕn (xm ) ϕ(xm )⊤
and y = (y1 , . . . , ym )⊤ it holds
b S (w) = 1 ∥An w − y∥2 .
R (15.1.3)
m
As discussed in Sections 11.1-11.2, a unique minimizer of (15.1.3) only exists if An has rank n.
For a minimizer wn , the fitted function reads
n
X
fn (x) := wn,j ϕj (x). (15.1.4)
j=1
We are interested in the behavior of the fn as a function of n (the number of ansatz func-
tions/parameters of our model), and distinguish between two cases:
In the overparameterized case, there exist many minimizers of R b S . The training algorithm we
use to compute a minimizer determines the type of prediction function fn we obtain. To observe
double descent, i.e. to achieve good generalization for large n, we need to choose the minimizer
carefully. In the following, we consider the unique minimal 2-norm minimizer, which is defined as
wn,∗ = argmin{w∈Rn | Rb S (w)≤Rb S (v) ∀v∈Rn } ∥w∥ ∈ Rn . (15.1.5)
15.1.2 An example
Now let us consider a concrete example. In Figure 15.3 we plot a set of 40 ansatz functions
ϕ1 , . . . , ϕ40 , which are drawn from a Gaussian process. Additionally, the figure shows a plot of the
Runge function f , and m = 18 equispaced points which are used as the training data points. We
then fit a function in span{ϕ1 , . . . , ϕn } via (15.1.5) and (15.1.4). The result is displayed in Figure
15.4:
215
3 1.0 f
2 0.8 Data points
1
0.6
0
j
1 0.4
2 0.2
3
0.0
1.00 0.75 0.50 0.25 0.00 0.25 0.50 0.75 1.00 1.00 0.75 0.50 0.25 0.00 0.25 0.50 0.75 1.00
(a) ansatz functions ϕj (b) Runge function f and data points
Figure 15.3: Ansatz functions ϕ1 , . . . , ϕ40 drawn from a Gaussian process, along with the Runge
function and 18 equispaced data points.
• n = 2: The model can only represent functions in span{ϕ1 , ϕ2 }. It is not yet expressive
enough to give a meaningful approximation of f .
• n = 15: The model has sufficient expressivity to capture the main characteristics of f . Since
n = 15 < 18 = m, it is not yet able to interpolate the data. Thus it allows to strike a
good balanced between the approximation and generalization error, which corresponds to the
scenario discussed in Chapter 14.
• n = 18: We are at the interpolation threshold. The model is capable of interpolating the data,
and there is a unique w such that R b S (w) = 0. Yet, in between data points the behavior of the
predictor f18 seems erratic, and displays strong oscillations. This is referred to as overfitting,
and is to be expected due to our analysis in Chapter 14; while the approximation error at the
data points has improved compared to the case n = 15, the generalization error has gotten
worse.
• n = 40: This is the overparameterized regime, where we have significantly more parameters
than data points. Our prediction f40 interpolates the data and appears to be the best overall
approximation to f so far, due to a “good” choice of minimizer of R b S , namely (15.1.5).
We also note that, while quite good, the fit is not perfect. We cannot expect significant
improvement in performance by further increasing n, since at this point the main limiting
factor is the amount of available data. Also see Figure 15.5 (a).
Figure 15.5 (a) displays the error ∥f − fn ∥L2 ([−1,1]) over n. We observe the characteristic double
descent curve, where the error initially decreases, after peaking at the interpolation threshold,
which is marked by the dashed red line. Afterwards, in the overparameterized regime, it starts to
decrease again. Figure 15.5 (b) displays ∥wn,∗ ∥. Note how the Euclidean norm of the coefficient
vector also peaks at the interpolation threshold.
We emphasize that the precise nature of the convergence curves depends strongly on various
factors, such as the distribution and number of training points m, the ground truth f , and the
choice of ansatz functions ϕj (e.g., the specific kernel used to generate the ϕj in Figure 15.3 (a)).
In the present setting we achieve a good approximation of f for n = 15 < 18 = m corresponding to
the regime where the approximation and interpolation errors are balanced. However, as Figure 15.5
216
1.0 f 1.0 f
0.8 f2 f15
0.8
0.6 Data points Data points
0.4 0.6
0.2 0.4
0.0
0.2 0.2
0.4 0.0
1.00 0.75 0.50 0.25 0.00 0.25 0.50 0.75 1.00 1.00 0.75 0.50 0.25 0.00 0.25 0.50 0.75 1.00
(a) n = 2 (underparameterization) (b) n = 15 (balance of appr. and gen. error)
1.0 f 1.0 f
0.8 f18 0.8 f40
Data points Data points
0.6 0.6
0.4
0.4
0.2
0.2
0.0
0.0
1.00 0.75 0.50 0.25 0.00 0.25 0.50 0.75 1.00 1.00 0.75 0.50 0.25 0.00 0.25 0.50 0.75 1.00
(c) n = 18 (interpolation threshold) (d) n = 40 (overparameterization)
Figure 15.4: Fit of the m = 18 red data points using the ansatz functions ϕ1 , . . . , ϕn from Figure
15.3, employing equations (15.1.5) and (15.1.4) for different numbers of ansatz functions n.
(a) shows, it can be difficult to determine a suitable value of n < m a priori, and the acceptable
range of n values can be quite narrow. For overparametrization (n ≫ m), the precise choice of n is
less critical, potentially making the algorithm more stable in this regime. We encourage the reader
to conduct similar experiments and explore different settings to get a better feeling for the double
descent phenomenon.
217
n = 18 n = 18
10 1
100
10 2
10 20 30 40 10 20 30 40
n n
(a) ∥f − fn ∥L2 ([−1,1]) (b) ∥wn,∗ ∥
Figure 15.5: The L2 -error for the fitted functions in Figure 15.4, and the ℓ2 -norm of the corre-
sponding coefficient vector wn,∗ defined in (15.1.5).
Proposition 15.1. Assume that x1 , . . . , xm and the (ϕj )j∈N are such that An in (15.1.2) has full
rank n for all n ≤ m. Given y ∈ Rm , denote by wn,∗ (y) the vector in (15.1.5). Then
(
increasing for n < m,
n 7→ sup ∥wn,∗ (y)∥ is monotonically
∥y∥=1 decreasing for n ≥ m.
Proof. We start with the case n ≥ m. By assumption Am has full rank m, and thus An has rank
m for all n ≥ m, see (15.1.2). In particular, there exists wn ∈ Rn such that An wn = y. Now fix
y ∈ Rm and let wn be any such vector. Then wn+1 := (wn , 0) ∈ Rn+1 satisfies An+1 wn+1 = y
and ∥wn+1 ∥ = ∥wn ∥. Thus necessarily ∥wn+1,∗ ∥ ≤ ∥wn,∗ ∥ for the minimal norm solutions defined
in (15.1.5). Since this holds for every y, we obtain the statement for n ≥ m.
Now let n < m. Recall that the minimal norm solution can be written through the pseudo
inverse
wn,∗ (y) = A†n y,
see for instance Exercise 11.32. Here,
−1
σn,1 0
†
An = V n
.. .. ⊤
Un ∈ R
n×m
. .
−1
σn,n 0
218
where An = U n Σ n V ⊤
n is the singular value decomposition of An , and
σn,1
..
.
σ n,n
m×n
Σn = ∈R
0
..
.
0
contains the singular values σn,1 ≥ · · · ≥ σn,n > 0 of An ∈ Rm×n ordered by decreasing size. Since
V n ∈ Rn×n and U n ∈ Rm×m are orthogonal matrices, we have
sup ∥wn,∗ (y)∥ = sup ∥A†n y∥ = σn,n
−1
.
∥y∥=1 ∥y∥=1
we observe that n 7→ σn,n is monotonically decreasing for n ≤ m. This concludes the proof.
An assumption of the type B ≤ cB · (dCσ )−1 , i.e. a scaling of the weights by the reciprocal 1/d of
the width, is not unreasonable in practice: Standard initialization schemes such as LeCun [129] or
He [86] initialization, use random weights with variance scaled inverse proportional to the input
dimension of each layer. Moreover, as we saw in Chapter 11, for very wide neural networks, the
weights do not move significantly from their initialization during training. Additionally, many train-
ing routines use regularization terms on the weights, thereby encouraging them the optimization
routine to find small weights.
We study the generalization capacity of Lipschitz functions through the covering-number-based
learning results of Chapter 14. The set of C-Lipschitz functions on a compact d-dimensional
Euclidean domain LipC (Ω) has covering numbers bounded according to
d
∞ C
log(G(LipC (Ω), ε, L )) ≤ Ccov · for all ε > 0 (15.3.2)
ε
219
for some constant Ccov independent of ε > 0. A proof can be found in [75, Lemma 7], see also [230].
As a result of these considerations, we can identify two regimes:
• Standard regime: For small neural network size nA , we consider neural networks as a set
parameterized by nA parameters. As we have seen before, this yields a bound on the gen-
eralization error that scales linearly with nA . As long as nA is small in comparison to the
number of samples, we can expect good generalization by Theorem 14.15.
• Overparameterized regime: For large neural network size nA and small weights, we consider
neural networks as a subset of LipC (Ω) for a constant C > 0. This set has a covering number
bound that is independent of the number of parameters nA .
Choosing the better of the two generalization bounds for each regime yields the following result.
Recall that N ∗ (σ; A, B) denotes all neural networks in N (σ; A, B) with a range contained in [−1, 1]
(see (14.5.1)).
Theorem 15.2. Let C, CL > 0 and let L : [−1, 1] × [−1, 1] → R be CL -Lipschitz. Further, let
A = (d0 , d1 , . . . , dL+1 ) ∈ NL+2 , let σ : R → R be Cσ -Lipschitz continuous with Cσ ≥ 1, and
|σ(x)| ≤ Cσ |x| for all x ∈ R, and let B > 0.
Then, there exist c1 , c2 > 0, such that for every m ∈ N and every distribution D on
[−1, 1]d0 × [−1, 1] it holds with probability at least 1 − δ over S ∼ Dm that for all Φ ∈
N ∗ (σ; A, B) ∩ LipC ([−1, 1]d0 )
r
log(4/δ)
|R(Φ) − RS (Φ)| ≤ g(A, Cσ , B, m) + 4CL
b , (15.3.3)
m
where
( r √ )
nA log(nA ⌈ m⌉) + LnA log(dmax ) 1
− 2+d
g(A, Cσ , B, m) = min c1 , c2 m 0 .
m
Proof. Applying Theorem 14.11 with α = 1/(2 + d0 ) and (15.3.2), we obtain that with probability
at least 1 − δ/2 it holds for all Φ ∈ LipC ([−1, 1]d0 )
r
α d0
|R(Φ) − R b S (Φ)| ≤ 4CL Ccov (m C) + log(4/δ) + 2CL
m mα r
2C log(4/δ)
q
d d /(d +2)−1 L
≤ 4CL Ccov C 0 (m 0 0 ) + α + 4CL
m r m
2CL log(4/δ)
q
= 4CL Ccov C d0 (m−2/(d0 +2) ) + α + 4CL
m
r m
p
d
(4CL Ccov C + 2CL )
0 log(4/δ)
= + 4CL ,
mα m
√ √ √
where we used in the second inequality that x + y ≤ x + y for all x, y ≥ 0.
220
In addition, Theorem 14.15 yields that with probability at least 1 − δ/2 it holds for all Φ ∈
N ∗ (σ; A, B)
r √
nA log(⌈nA m⌉) + LnA log(⌈2Cσ Bdmax ⌉) + log(4/δ)
|R(Φ) − R
b S (Φ)| ≤ 4CL
m
2CL
+√
m
r √
nA log(⌈nA m⌉) + LnA log(⌈2Cσ Bdmax ⌉)
≤ 6CL
r m
log(4/δ))
+ 4CL .
m
Then, for Φ ∈ N ∗ (σ; A, B) ∩ LipC ([−1, 1]d0 ) the minimum of both upper bounds holds with
probability at least 1 − δ.
The two regimes in Theorem 15.2 correspond to the two terms comprising the minimum in the
definition of g(A, Cσ , B, m). The first term increases with nA while the second is constant. In the
first regime, where the first term is smaller, the generalization gap |R(Φ) − R b S (Φ)| increases with
nA .
In the second regime, where the second term is smaller, the generalization gap is constant with
nA . Moreover, it is reasonable to assume that the empirical risk R b S will decrease with increasing
number of parameters nA .
By (15.3.3) we can bound the risk by
r
R(Φ) ≤ R b S + g(A, Cσ , B, m) + 4CL log(4/δ) .
m
In the second regime, this upper bound is monotonically decreasing. In the first regime it may
both decrease and increase. In some cases, this behavior can lead to an upper bound on the risk
resembling the curve of Figure 15.2. The following section describes a specific scenario where this
is the case.
Remark 15.3. Theorem 15.2 assumes C-Lipschitz continuity of the neural networks. As we saw in
Sections 15.1.2 and 15.2, this assumption may not hold near the interpolation threshold. Hence,
Theorem 15.2 likely gives a too optimistic upper bound near the interpolation threshold.
221
To derive an upper bound on the risk, we start by upper bounding the empirical risk and then
applying Theorem 15.2 to establish an upper bound on the generalization gap. In combination,
these estimates provide an upper bound on the risk. We will then observe that this upper bound
follows the double descent curve in Figure 15.2.
for a constant Capprox > 0. Note that, we can interpolate the sample S already with d0 m parameters
by Theorem 9.3. However, it is not guaranteed that this can be done using CM -Lipschitz neural
networks.
where
r √
nAd log(⌈nA m⌉) + LnAd log(d)
κNN (Ad , m; c1 ) := c1 ,
m (15.4.2)
1
− 2+d
κLip (Ad , m; c2 ) := c2 m 0
R(Φ
e A ) :=R
d
e S (ΦA ) + min {κNN (Ad , m; c1 ), κLip (Ad , m; c2 )}
d
(15.4.3)
r
log(4/δ)
+ 4CL .
m
222
We depict in Figure 15.6 the upper bound on the risk given by (15.4.3) (excluding the terms
that do not depend on the architecture). The upper bound clearly resembles the double descent
phenomenon of Figure 15.2. Note that the Lipschitz interpolation point is slightly behind this
threshold, which is when we assume our empirical risk to be 0. To produce the plot, we chose
L = 5, c1 = 1.2 · 10−4 , c2 = 6.5 · 10−3 , m = 10.000, d0 = 6, Capprox = 30. We mention that the
double descent phenomenon is not visible for all choices of parameters. Moreover, in our model,
the fact that the peak coincides with the interpolation threshold is due to the choice of constants
and does not emerge from the model. Other models of double descent explain the location of the
peak more accurately [143, 83]. We note that, as observed in Remark 15.3, the peak close to the
interpolation threshold that we see in Figure 15.6 would likely be more pronounced in practical
scenarios.
Figure 15.6: Upper bound on R(ΦAd ) derived in (15.4.3). For better visibility the part correspond-
ing to y-values between 0.0012 and 0.0022 is not shown. The vertical dashed line indicates the
interpolation threshold according to Theorem 9.3.
223
capabilities of overparameterized neural networks, we refer to [19, Section 2]. Here, we only briefly
mention two additional directions of inquiry. First, if the learning algorithm introduces a form of
robustness, this can be leveraged to yield generalization bounds [6, 244, 24, 179]. Second, for very
overparameterized neural networks, it was stipulated in [106] that neural networks become linear
kernel interpolators based on the neural tangent kernel of Section 11.5.2. Thus, for large neural
networks, generalization can be studied through kernel regression [106, 131, 15, 135].
224
Exercises
Exercise 15.4. Let f : [−1, 1] → R be a continuous function, and let −1 ≤ x1 < · · · < xm ≤ 1 for
some fixed m ∈ N. As in Section 15.1.2, we wish to approximate f by a least squares approximation.
To this end we use the Fourier ansatz functions
(
1 sin(⌈ 2j ⌉πx) j ≥ 1 is odd
b0 (x) := and bj (x) := (15.4.4)
2 cos(⌈ 2j ⌉πx) j ≥ 1 is even.
denote by wn∗ ∈ Rn+1 the minimal norm minimizer of R b S , and set fn (x) := Pn wn bi (x).
i=0 ∗,i
Show that in this case generalization fails in the overparametrized regime: for sufficiently large
n ≫ m, fn is not necessarily a good approximation to f . What does fn converge to as n → ∞?
Exercise 15.5. Consider the setting of Exercise 15.4. We adapt the ansatz functions in (15.4.4)
by rescaling them via
b̃j := cj bj .
Choose real numbers cj ∈ R, such that the corresponding minimal norm least squares solution
avoids the phenomenon encountered in Exercise 15.4.
Hint: Should ansatz functions corresponding to large frequencies be scaled by large or small
numbers to avoid overfitting?
225
Chapter 16
How sensitive is the output of a neural network to small changes in its input? Real-world obser-
vations of trained neural networks often reveal that even barely noticeable modifications of the
input can lead to drastic variations in the network’s predictions. This intriguing behavior was first
documented in the context of image classification in [223].
Figure 16.1 illustrates this concept. The left panel shows a picture of a panda that the neural
network correctly classifies as a panda. By adding an almost imperceptible amount of noise to the
image, we obtain the modified image in the right panel. To a human, there is no visible difference,
but the neural network classifies the perturbed image as a wombat. This phenomenon, where
a correctly classified image is misclassified after a slight perturbation, is termed an adversarial
example.
In practice, such behavior is highly undesirable. It indicates that our learning algorithm might
not be very reliable and poses a potential security risk, as malicious actors could exploit it to trick
the algorithm. In this chapter, we describe the basic mathematical principles behind adversarial
examples and investigate simple conditions under which they might or might not occur. For sim-
plicity, we restrict ourselves to a binary classification problem but note that the main ideas remain
valid in more general situations.
+ 0.01x =
Human: Panda Barely visible noise Still a panda
NN classifier: Panda (high confidence) Flamingo (low confidence) Wombat (high confidence)
226
16.1 Adversarial examples
Let us start by formalizing the notion of an adversarial example. We consider the problem of
assigning a label y ∈ {−1, 1} to a vector x ∈ Rd . It is assumed that the relation between x and y
is described by a distribution D on Rd × {−1, 1}. In particular, for a given x, both values −1 and
1 could have positive probability, i.e. the label is not necessarily deterministic. Additionally, we let
Dx := {x ∈ Rd | ∃y s.t. (x, y) ∈ supp(D)}, (16.1.1)
and refer to Dx as the feature support.
Throughout this chapter we denote by
g : Rd → {−1, 0, 1}
a fixed so-called ground-truth classifier, satisfying1
P[y = g(x)|x] ≥ P[y = −g(x)|x] for all x ∈ Dx . (16.1.2)
Note that we allow g to take the value 0, which is to be understood as an additional label corre-
sponding to nonrelevant or nonsensical input data x. We will refer to g −1 (0) as the nonrelevant
class. The ground truth g is interpreted as how a human would classify the data, as the following
example illustrates.
Example 16.1. We wish to classify whether an image shows a panda (y = 1) or a wombat (y = −1).
Consider again Figure 16.1, and denote the three images by x1 , x2 , x3 . The first image x1 is a
photograph of a panda. Together with a label y, it can be interpreted as a draw (x1 , y) from D,
i.e. x1 ∈ Dx and g(x1 ) = 1. The second image x2 displays noise and corresponds to nonrelevant
data as it shows neither a panda nor a wombat. In particular, x2 ∈ Dxc and g(x2 ) = 0. The third
(perturbed) image x3 also belongs to Dxc , as it is not a photograph but a noise corrupted version
of x1 . Nonetheless, it is not nonrelevant, as a human would classify it as a panda. Thus g(x3 ) = 1.
Additional to the ground truth g, we denote by
h : Rd → {−1, 1}
some trained classifier.
1
To be more precise, the conditional distribution of y|x is only well-defined almost everywhere w.r.t. the marginal
distribution of x. Thus (16.1.2) can only be assumed to hold for almost every x ∈ Dx w.r.t. to the marginal
distribution of x.
227
In words, x′ is an adversarial example to x with perturbation δ, if (i) the distance of x and x′
is at most δ, (ii) x and x′ belong to the same (not nonrelevant) class according to the ground truth
classifier, and (iii) the classifier h correctly classifies x but misclassifies x′ .
Remark 16.3. We emphasize that the concept of a ground-truth classifier g differs from a minimizer
of the Bayes risk (14.1.1) for two reasons. First, we allow for an additional label 0 corresponding to
the nonrelevant class, which does not exist for the data generating distribution D. Second, g should
correctly classify points outside of Dx ; small perturbations of images as we find them in adversarial
examples, are not regular images in Dx . Nonetheless, a human classifier can still classify these
images, and g models this property of human classification.
228
(iv) Not a Bayes classifier/non-exhaustive distribution: In this case everything is possible. Data
points and their associated adversarial examples can appear in the feature support of the
distribution and adversarial examples to elements in the feature support of the distribution
can be created by leaving the feature support of the distribution. We will see examples in
the following section.
Theorem 16.4. Let w, w ∈ Rd be nonzero. For x ∈ Rd , let h(x) = sign(w⊤ x) be a classifier and
let g(x) = sign(w⊤ x) be the ground-truth classifier.
For every x ∈ Rd with h(x)g(x) > 0 and all ε ∈ (0, |w⊤ x|) such that
|w⊤ x| ε + |w⊤ x| |w⊤ w|
> (16.3.2)
∥w∥ ∥w∥ ∥w∥∥w∥
it holds that
ε + |w⊤ x|
x′ = x − h(x) w (16.3.3)
∥w∥2
is an adversarial example to x with perturbation δ = (ε + |w⊤ x|)/∥w∥.
229
Before we present the proof, we give some interpretation of this result. First, note that {x ∈
Rd | w ⊤ x= 0} is the decision boundary of h, meaning that points lying on opposite sides of this
hyperplane, are classified differently by h. Due to |w⊤ w| ≤ ∥w∥∥w∥, (16.3.2) implies that an
adversarial example always exists whenever
|w⊤ x| |w⊤ x|
> . (16.3.4)
∥w∥ ∥w∥
The left term is the decision margin of x for g, i.e. the distance of x to the decision boundary
of g. Similarly, the term on the right is the decision margin of x for h. Thus we conclude that
adversarial examples exist if the decision margin of x for the ground truth g is larger than that for
the classifier h.
Second, the term (w⊤ w)/(∥w∥∥w∥) describes the alignment of the two classifiers. If the clas-
sifiers are not aligned, i.e., w and w have a large angle between them, then adversarial examples
exist even if the margin of the classifier is larger than that of the ground-truth classifier.
Finally, adversarial examples with small perturbation are possible if |w⊤ x| ≪ ∥w∥. The ex-
treme case w⊤ x = 0 means that x lies on the decision boundary of h, and if |w⊤ x| ≪ ∥w∥ then
x is close to the decision boundary of h.
of Theorem 16.4. We verify that x′ in (16.3.3) satisfies the conditions of an adversarial example in
Definition 16.2. In the following we will use that due to h(x)g(x) > 0
g(x) = sign(w⊤ x) = sign(w⊤ x) = h(x) ̸= 0. (16.3.5)
First, it holds
ε + |w⊤ x| ε + |w⊤ x|
∥x − x′ ∥ = w = = δ.
∥w∥2 ∥w∥
Next we show g(x)g(x′ ) > 0, i.e. that (w⊤ x)(w⊤ x′ ) is positive. Plugging in the definition of
x′ , this term reads
ε + |w⊤ x| ⊤ ε + |w⊤ x| ⊤
w⊤ x w⊤ x − h(x) 2
w w = |w⊤ x|2 − |w⊤ x| w w
∥w∥ ∥w∥2
ε + |w⊤ x| ⊤
≥ |w⊤ x|2 − |w⊤ x| |w w|, (16.3.6)
∥w∥2
where the equality holds because h(x) = g(x) = sign(w⊤ x) by (16.3.5). Dividing the right-hand
side of (16.3.6) by |w⊤ x|∥w∥, which is positive by (16.3.5), we obtain
|w⊤ x| ε + |w⊤ x| |w⊤ w|
− . (16.3.7)
∥w∥ ∥w∥ ∥w∥∥w∥
The term (16.3.7) is positive thanks to (16.3.2).
Finally, we check that 0 ̸=h(x′ ) ̸= h(x), i.e. (w⊤ x)(w⊤ x′ ) < 0. We have that
ε + |w⊤ x| ⊤
(w⊤ x)(w⊤ x′ ) = |w⊤ x|2 − w⊤ xh(x) w w
∥w∥2
= |w⊤ x|2 − |w⊤ x|(ε + |w⊤ x|) < 0,
where we used that h(x) = sign(w⊤ x). This completes the proof.
230
Theorem 16.4 readily implies the following proposition for affine classifiers.
it holds that
ε + |w⊤ x + b|
x′ = x − h(x) w
∥w∥2
is an adversarial example with perturbation δ = (ε + |w⊤ x + b|)/∥w∥ to x.
Next fix α ∈ (0, 1) and set w := αw + (1 − α)v for some v ∈ w⊥ with ∥v∥ = 1, so that ∥w∥ = 1.
We let h(x) := sign(w⊤ x). We now show that every x ∈ Dx satisfies the assumptions of Theorem
16.4, and therefore admits an adversarial example.
Note that h(x) = g(x) for every x ∈ Dx . Hence h is a Bayes classifier. Now fix x ∈ Dx . Then
|w⊤ x| ≤ α|w⊤ x|, so that (16.3.2) is satisfied. Furthermore, for every ε > 0 it holds that
ε + |w⊤ x|
δ := ≤ ε + α.
∥w∥
Hence, for ε < |w⊤ x| it holds by Theorem 16.4 that there exists an adversarial example with
perturbation less than ε + α. For small α, the situation is depicted in the upper panel of Figure
16.2.
For the second example, we construct a distribution with global feature support and a classifier
which is not a Bayes classifier. This corresponds to case (iv) in Section 16.2.
231
A)
DBg
DBh x′
B)
DBg
DBh
x′
x
Figure 16.2: Illustration of the two types of adversarial examples in Examples 16.6 and 16.7. In
panel A) the feature support Dx corresponds to the dashed line. We depict the two decision
boundaries DBh = {x | w⊤ x = 0} of h(x) = sign(w⊤ x) and DBg = {x | w⊤ x = 0} g(x) =
sign(w⊤ x). Both h and g perfectly classify every data point in Dx . One data point x is shifted
outside of the support of the distribution in a way to change its label according to h. This creates
an adversarial example x′ . In panel B) the data distribution is globally supported. However, h
and g do not coincide. Thus the decision boundaries DBh and DBg do not coincide. Moving data
points across DBh can create adversarial examples, as depicted by x and x′ .
232
Example 16.7. Let Dx be a distribution on Rd with positive Lebesgue density everywhere outside
the decision boundary DBg = {x | w⊤ x = 0} of g. We define D to be the distribution of (X, g(X))
for X ∼ Dx . In addition, let w ∈ / {±w}, ∥w∥ = 1 and h(x) = sign(w⊤ x). We exclude w = −w
because, in this case, every prediction of h is wrong. Thus no adversarial examples are possible.
By construction the feature support is given by Dx = Rd . Moreover, h−1 ({±1}) and g −1 ({±1})
are half spaces, which implies that, in the notation of (16.2.2) that
Hence, for every δ > 0 there is a positive probability of observing x to which an adversarial example
with perturbation δ exists.
The situation is depicted in the lower panel of Figure 16.2.
i.e., as the distance of x to the closest element that is classified differently from x or the infimum
over all distances to elements from other classes if no closest element exists. Additionally, we denote
the distance of x to the closest adjacent affine piece by
where AΦ,x is the largest connected region on which Φ is affine and which contains x. We have the
following theorem.
ε + |Φ(x)|
µg (x), νΦ (x) > .
∥∇Φ(x)∥
Then
ε + |Φ(x)|
x′ := x − h(x) ∇Φ(x)
∥∇Φ(x)∥2
233
Proof. We show that x′ satisfies the properties in Definition 16.2.
By construction ∥x − x′ ∥ ≤ δ. Since µg (x) > δ it follows that g(x) = g(x′ ). Moreover, by
assumption g(x) ̸= 0, and thus g(x)g(x′ ) > 0.
It only remains to show that h(x′ ) ̸= h(x). Since δ < νΦ (x), we have that Φ(x) = ∇Φ(x)⊤ x + b
and Φ(x′ ) = ∇Φ(x)⊤ x′ + b for some b ∈ R. Therefore,
ε + |Φ(x)|
Φ(x) − Φ(x′ ) = ∇Φ(x)⊤ (x − x′ ) = ∇Φ(x)⊤ h(x) ∇Φ(x)
∥∇Φ(x)∥2
= h(x)(ε + |Φ(x)|).
Since h(x)|Φ(x)| = Φ(x) it follows that Φ(x′ ) = −h(x)ε. Hence, h(x′ ) = −h(x), which completes
the proof.
Remark 16.9. We look at the key parameters in Theorem 16.8 to understand which factors facilitate
adversarial examples.
• The geometric margin of the ground-truth classifier µg (x): To make the construction possible,
we need to be sufficiently far away from points that belong to a different class than x or to
the nonrelevant class.
• The distance to the next affine piece νΦ (x): Since we are looking for an adversarial example
within the same affine piece as x, we need this piece to be sufficiently large.
16.5 Robustness
Having established that adversarial examples can arise in various ways under mild assumptions, we
now turn our attention to conditions that prevent their existence.
234
Proposition 16.10. Let Φ : Rd → R be CL -Lipschitz with CL > 0, and let s > 0. Let h(x) =
sign(Φ(x)) be a classifier, and let g : Rd → {−1, 0, 1} be a ground-truth classifier. Moreover, let
x ∈ Rd be such that
Φ(x)g(x) ≥ s. (16.5.1)
Then there does not exist an adversarial example to x of perturbation δ < s/CL .
Proof. Let x ∈ Rd satisfy (16.5.1) and assume that ∥x′ − x∥ ≤ δ. The Lipschitz continuity of Φ
implies
Since |Φ(x)| ≥ s we conclude that Φ(x′ ) has the same sign as Φ(x) which shows that x′ cannot be
an adversarial example to x.
Remark 16.11. As we have seen in Lemma 13.2, we can bound the Lipschitz constant of ReLU
neural networks by restricting the magnitude and number of their weights and the number of
layers.
There has been some criticism to results of this form, see, e.g., [99], since an assumption on the
Lipschitz constant may potentially restrict the capabilities of the neural network too much. We
next present a result that shows under which assumptions on the training set, there exists a neural
network that classifies the training set correctly, but does not allow for adversarial examples within
the training set.
|g(xi ) − g(xj )|
sup =: M
f > 0.
i̸=j ∥xi − xj ∥
Then there exists a ReLU neural network Φ with depth(Φ) = O(log(m)) and width(Φ) = O(dm)
such that for all i = 1, . . . , m
sign(Φ(xi )) = g(xi )
and there is no adversarial example of perturbation δ = 1/M
f to xi .
Proof. The result follows directly from Theorem 9.6 and Proposition 16.10. The reader is invited
to complete the argument in Exercise 16.20.
235
bound given in Lemma 13.2 is
which grows exponentially with the depth of the neural network. However, in practice this bound
may be pessimistic, and locally the neural network might have significantly smaller gradients than
the global Lipschitz constant.
Because of this, it is reasonable to study results preventing adversarial examples under local
Lipschitz bounds. Such a result together with an algorithm providing bounds on the local Lipschitz
constant was proposed in [88]. We state the theorem adapted to our set-up.
Theorem 16.13. Let h : Rd → {−1, 1} be a classifier of the form h(x) = sign(Φ(x)) and let
g : Rd → {−1, 0, 1} be the ground-truth classifier. Let x ∈ Rd satisfy g(x) ̸= 0, and set
. |Φ(y) − Φ(x)|
α := max min Φ(x)g(x) sup ,R , (16.5.2)
R>0
∥y−x∥∞ ≤R ∥x − y∥∞
y̸=x
where the minimum is understood to be R in case the supremum is zero. Then there are no adver-
sarial examples to x with perturbation δ < α.
Proof. Let x ∈ Rd be as in the statement of the theorem. Assume, towards a contradiction, that
for 0 < δ < α satisfying (16.5.2), there exists an adversarial example x′ to x with perturbation δ.
If the supremum in (16.5.2) is zero, then Φ is constant on a ball of radius R around x. In
particular for ∥x′ − x∥ ≤ δ < R holds h(x′ ) = h(x) and x′ cannot be an adversarial example.
Now assume the supremum in (16.5.2) is not zero. It holds by (16.5.2), that
. |Φ(y) − Φ(x)|
δ < Φ(x)g(x) sup . (16.5.3)
∥y−x∥∞ ≤R ∥x − y∥∞
y̸=x
Moreover,
|Φ(y) − Φ(x)|
|Φ(x′ ) − Φ(x)| ≤ sup ∥x − x′ ∥∞
∥y−x∥∞ ≤R ∥x − y∥∞
y̸=x
|Φ(y) − Φ(x)|
≤ sup δ < Φ(x)g(x),
∥y−x∥∞ ≤R ∥x − y∥∞
y̸=x
236
The supremum in (16.5.2) is bounded by the Lipschitz constant of Φ on BR (x). Thus Theorem
16.13 depends only on the local Lipschitz constant of Φ. One obvious criticism of this result is
that the computation of (16.5.2) is potentially prohibitive. We next show a different result, for
which the assumptions can immediately be checked by applying a simple algorithm that we present
subsequently.
To state the following proposition, for a continuous function Φ : Rd → R and δ > 0 we define
for x ∈ Rd and δ > 0
Proposition 16.14. Let h : Rd → {−1, 1} be a classifier of the form h(x) = sign(Φ(x)) and
g : Rd → {−1, 0, 1}, let x be such that h(x) = g(x). Then x does not have an adversarial example
of perturbation δ if z δ,max z δ,min > 0.
Proof. The proof is immediate, since z δ,max z δ,min > 0 implies that all points in a δ neighborhood
of x are classified the same.
To apply (16.14), we only have to compute z δ,max and z δ,min . It turns out that if Φ is a neural
network, then z δ,max , z δ,min can be approximated by a computation similar to a forward pass of
Φ. Denote by |A| the matrix obtained by taking the absolute value of each entry of the matrix A.
Additionally, we define
A+ = (|A| + A)/2 and A− = (|A| − A)/2.
The idea behind the Algorithm 2 is common in the area of neural network verification, see, e.g.,
[66, 61, 7, 238].
Remark 16.15. Up to constants, Algorithm 2 has the same computational complexity as a forward
pass, also see Algorithm 1. In addition, in contrast to upper bounds based on estimating the global
Lipschitz constant of Φ via its weights, the upper bounds found via Algorithm 2 include the effect of
the activation function σ. For example, if σ is the ReLU, then we may often end up in a situation,
where δ (ℓ),up or δ (ℓ),low can have many entries that are 0. If an entry of W (ℓ) x(ℓ) +b(ℓ) is nonpositive,
then it is guaranteed that the associated entry in δ (ℓ),low will be zero. Similarly, if W (ℓ) has only
few positive entries, then most of the entries of δ (ℓ),up are not propagated to δ (ℓ+1),up .
Next, we prove that Algorithm 2 indeed produces sensible output.
Proposition 16.16. Let Φ be a neural network with weight matrices W (ℓ) ∈ Rdℓ+1 ×dℓ and bias
vectors b(ℓ) ∈ Rdℓ+1 for ℓ = 0, . . . , L, and a monotonically increasing activation function σ.
Let x ∈ Rd . Then the output of Algorithm 2 satisfies
237
Algorithm 2 Compute Φ(x), z δ,max and z δ,min for a given neural network
Input: weight matrices W (ℓ) ∈ Rdℓ+1 ×dℓ and bias vectors b(ℓ) ∈ Rdℓ+1 for ℓ = 0, . . . , L with
dL+1 = 1, monotonous activation function σ, input vector x ∈ Rd0 , neighborhood size δ > 0
Output: Bounds for z δ,max and z δ,min
x(0) = x
δ (0),up = δ1 ∈ Rd0
δ (0),low = δ1 ∈ Rd0
for ℓ = 0, . . . , L − 1 do
x(ℓ+1) = σ(W (ℓ) x(ℓ) + b(ℓ) )
δ (ℓ+1),up = σ(W (ℓ) x(ℓ) + (W (ℓ) )+ δ (ℓ),up + (W (ℓ) )− δ (ℓ),low + b(ℓ) ) − x(ℓ+1)
δ (ℓ+1),low = x(ℓ+1) − σ(W (ℓ) x(ℓ) − (W (ℓ) )+ δ (ℓ),low − (W (ℓ) )− δ (ℓ),up + b(ℓ) )
end for
x(L+1) = W (L) x(L) + b(L)
δ (L+1),up = (W (L) )+ δ (L),up + (W (L) )− δ (L),low
δ (L+1),low = (W (L) )+ δ (L),low + (W (L) )− δ (L),up
return x(L+1) , x(L+1) + δ (L+1),up , x(L+1) − δ (L+1),low
Proof. Fix y, x ∈ Rd with ∥y − x∥∞ ≤ δ and let y (ℓ) , x(ℓ) for ℓ = 0, . . . , L + 1 be as in Algorithm
2 applied to y, x, respectively. Moreover, let δ ℓ,up , δ ℓ,low for ℓ = 0, . . . , L + 1 be as in Algorithm 2
applied to x. We will prove by induction over ℓ = 0, . . . , L + 1 that
where the inequalities are understood entry-wise for vectors. Since y was arbitrary this then proves
the result.
The case ℓ = 0 follows immediately from ∥y − x∥∞ ≤ δ. Assume now, that the statement was
shown for ℓ < L. We have that
if
238
where we used the induction assumption in the last line. This shows the first estimate in (16.5.6).
Similarly,
where we used the induction assumption in the last line. This completes the proof of (16.5.6) for
all ℓ ≤ L.
The case ℓ = L + 1 follows by the same argument, but replacing σ by the identity.
239
Exercises
Exercise 16.17. Prove (16.3.1) by comparing the volume of the d-dimensional Euclidean unit ball
with the volume of the d-dimensional 1-ball of radius c for a given c > 0.
Exercise 16.18. Fix δ > 0. For a pair of classifiers h and g such that C1 ∪C−1 = ∅ in (16.2.2), there
trivially cannot exist any adversarial examples. Construct an example, of h, g, D such that C1 ,
C−1 ̸= ∅, h is not a Bayes classifier, and g is such that no adversarial examples with a perturbation
δ exist.
Is this also possible if g −1 (0) = ∅?
240
Appendix A
Probability theory
This appendix provides some basic notions and results in probability theory required in the main
text. It is intended as a revision for a reader already familiar with these concepts. For more details
and proofs, we refer for example to the standard textbook [117].
(i) Ω ∈ A,
(ii) Ac ∈ A whenever A ∈ A,
S
(iii) i∈N Ai ∈ A whenever Ai ∈ A for all i ∈ N.
For a sigma-algebra A on Ω, the tuple (Ω, A) is also referred to as a measurable space. For a
measurable space, a subset A ⊆ Ω is called measurable, if A ∈ A. Measurable sets are also called
events.
Another key system of subsets of Ω is that of a topology.
S
(iii) i∈I Oi ∈ T whenever for an index set I holds Oi ∈ T for all i ∈ I.
If T is a topology on Ω, we call (Ω, T) a topological space, and a set O ⊆ Ω is called open if and
only if O ∈ T.
241
Remark A.3. The two notions differ in that a topology allows for unions of arbitrary (possibly un-
countably many) sets, but only for finite intersection, whereas a sigma-algebra allows for countable
unions and intersections.
Example A.4. Let d ∈ N and denote by Bε (x) = {y ∈ Rd | ∥y − x∥ < ε} the set of points
whose Euclidean distance to x is less than ε. Then for every A ⊆ Rd , the smallest topology on A
containing A ∩ Bε (x) for all ε > 0, x ∈ Rd , is called the Euclidean topology on A.
If (Ω, T) is a topological space, then the Borel sigma-algebra refers to the smallest sigma-
algebra on Ω containing all open sets, i.e. all elements of T. Throughout this book, subsets of Rd
are always understood to be equipped with the Euclidean topology and the Borel sigma-algebra.
The Borel sigma-algebra on Rd is denoted by Bd .
We can now introduce measures.
Definition A.5. Let (Ω, A) be a measurable space. A mapping µ : A → [0, ∞] is called a measure
if it satisfies
(i) µ(∅) = 0,
(ii) for every sequence (Ai )i∈N ⊆ A such that Ai ∩ Aj = ∅ whenever i ̸= j, it holds
[ X
µ Ai = µ(Ai ).
i∈N i∈N
Example A.6. One can show that there exists a unique measure λ on (Rd , Bd ), such that for all
sets of the type ×dj=1 [ai , bi ) with −∞ < ai ≤ bi < ∞ holds
d
Y
λ(×di=1 [ai , bi )) = (bi − ai ).
i=1
This measure is called the Lebesgue measure.
If µ is a measure on the measurable space (Ω, A), then the triplet (Ω, A, µ) is called a measure
space. In case µ is a probability measure, it is called a probability space.
Let (Ω, A, µ) be a measure space. A subset N ⊆ Ω is called a null-set, if N is measurable and
µ(N ) = 0. Moreover, an equality or inequality is said to hold µ-almost everywhere or µ-almost
surely, if it is satisfied on the complement of a null-set. In case µ is clear from context, we simply
write “almost everywhere” or “almost surely” instead. Usually this refers to the Lebesgue measure.
242
Definition A.7. Let (Ω1 , A1 ) and (Ω2 , A2 ) be two measurable spaces. A function f : Ω1 → Ω2 is
called measurable if
Remark A.8. We again point out the parallels to topological spaces: A function f : Ω1 → Ω2
between two topological spaces (Ω1 , T1 ) and (Ω2 , T2 ) is called continuous if f −1 (O2 ) ∈ T1 for all
O2 ∈ T2 .
Let Ω1 be a set and let (Ω2 , A2 ) be a measurable space. For X : Ω1 → Ω2 , we can ask for
the smallest sigma-algebra AX on Ω1 , such that X is measurable as a mapping from (Ω1 , AX ) to
(Ω2 , A2 ). Clearly, for every sigma-algebra A1 on Ω1 , X is measurable as a mapping from (Ω1 , A1 )
to (Ω2 , A2 ) if and only if every A ∈ AX belongs to A1 ; or in other words, AX is a sub sigma-algebra
of A1 . It is easy to check that AX is given through the following definition.
AX := {X −1 (A2 ) | A2 ∈ A2 } ⊆ 2Ω1
Definition A.10. The measure PX is called the distribution of X. If (Ω2 , A2 ) = (Rd , Bd ), and
there exists a function fX : Rd → R such that
Z
P[A] = fX (x) dx for all A ∈ Bd ,
A
Remark A.11. The term distribution is often used without specifying an underlying probability
space and random variable. In this case, “distribution” stands interchangeably for “probability
243
measure”. For example, µ is a distribution on Ω2 states that µ is a probability measure on the
measurable space (Ω2 , A2 ). In this case, there always exists a probability space (Ω1 , A1 , P) and a
random variable X : Ω1 → Ω2 such that PX = µ; namely (Ω1 , A1 , P) = (Ω2 , A2 , µ) and X(ω) = ω.
Example A.12. Some important distributions include the following.
• Bernoulli distribution: A random variable X : Ω → {0, 1} is Bernoulli distributed if there
exists p ∈ [0, 1] such that P[X = 1] = p and P[X = 0] = 1 − p.
• Uniform distribution: A random variable X : Ω → Rd is uniformly distributed on a
measurable set A ∈ Bd , if its density equals
1
fX (x) = 1A (x)
|A|
where |A| < ∞ is the Lebesgue measure of A.
• Gaussian distribution: A random variable X : Ω → Rd is Gaussian distributed with mean
m ∈ Rd and the regular covariance matrix C ∈ Rd×d , if its density equals
1 1 ⊤ −1
fX (x) = exp − (x − m) C (x − m) .
(2π det(C))d/2 2
We denote this distribution by N(m, C).
Let (Ω, A, P) be a probability space, let X : Ω → Rd be an Rd -valued random variable. We then
call the Lebesgue integral
Z Z
E[X] := X(ω) dP(ω) = x dPX (x) (A.2.1)
Ω Rd
the expectation of X. Moreover, for k ∈ N we say that X has finite k-th moment if E[∥X∥k ] <
∞. Similarly, for a probability measure µ on Rd and k ∈ N, we say that µ has finite k-th moment
if Z
∥x∥k dµ(x) < ∞.
Rd
Furthermore, the matrix
Z
(X(ω) − E[X])(X(ω) − E[X])⊤ dP(ω) ∈ Rd×d
Ω
244
(ii) converge in probability to X, if
for all ε > 0 : lim P [{ω ∈ Ω | |Xj (ω) − X(ω)| > ε}] = 0,
j→∞
The notions in Definition A.13 are ordered by decreasing strength, i.e. almost sure convergence
implies convergence in probability, and convergence
R in probability implies weak convergence, see
for example [117, Chapter 13]. Since E[f ◦ X] = Rd f (x) dPX (x), the notion of weak convergence
only depends on the distribution PX of X. We thus also say that a sequence of random variables
converges weakly towards a measure µ.
and similarly for PY . Thus the marginals PX , PY , can be constructed from the joint distribution
PZ . In turn, knowledge of the marginals is not sufficient to construct the joint distribution.
A.3.2 Independence
The concept of independence serves to formalize the situation, where knowledge of one random
variable provides no information about another random variable. We first give the formal definition,
and afterwards discuss the roll of a die as a simple example.
245
Definition A.14. Let (Ω, A, P) be a probability space. Then two events A, B ∈ A are called
independent if
P[A ∩ B] = P[A]P[B].
Two random variables X : Ω → RdX and Y : Ω → RdY are called independent, if
Two random variables are thus independent, if and only if all events in their induced sigma-
algebras are independent. This turns out to be equivalent to the joint distribution P(X,Y ) being
equal to the product measure PX ⊗ PY ; the latter is characterized as the unique measure µ on
RdX +dY satisfying µ(A × B) = PX [A]PY [B] for all A ∈ Bdx , B ∈ BdY .
Example A.15. Let Ω = {1, . . . , 6} represent the outcomes of rolling a fair die, let A = 2Ω be the
sigma-algebra, and let P[ω] = 1/6 for all ω ∈ Ω. Consider the three random variables
0 if ω ∈ {1, 2}
( (
0 if ω is odd 0 if ω ≤ 3
X1 (ω) = X2 (ω) = X3 (ω) = 1 if ω ∈ {3, 4}
1 if ω is even 1 if ω ≥ 4
2 if ω ∈ {5, 6}.
= xy dP(X,Y ) (x, y)
2
ZR Z
= x dPX (x) y dPX (y)
R R
= E[X]E[Y ].
246
Using this observation, it is easy to see that for a sequence of independent R-valued random variables
(Xi )ni=1 with bounded second moments, there holds Bienaymé’s identity
" n # n
X X
V Xi = V [Xi ] . (A.3.1)
i=1 i=1
Definition A.18 (regular conditional distribution). Let (Ω, A, P) be a probability space, and let
X : Ω → RdX and Y : Ω → RdY be two random variables. Let τX|Y : BdX × RdY → [0, 1] satisfy
(i) y 7→ τX|Y (A, y) : RdY → [0, 1] is measurable for every fixed A ∈ BdX ,
(ii) A 7→ τX|Y (A, y) is a probability measure on (RdX , BdX ) for every y ∈ Y (Ω),
247
Definition A.18 provides a mathematically rigorous way of assigning a distribution to a random
variable conditioned on an event that may have probability zero, as in Example A.17. Existence
and uniqueness of these conditional distributions hold in the following sense, see for example [117,
Chapter 8] or [201, Chapter 3] for the specific statement given here.
Theorem A.19. Let (Ω, A, P) be a probability space, and let X : Ω → RdX , Y : Ω → RdY be two
random variables. Then there exists a regular version of the conditional distribution τ1 .
Let τ2 be another regular version of the conditional distribution. Then there exists a PY -null
set N ⊆ RdY , such that for all y ∈ N c ∩ Y (Ω), the two probability measures τ1 (·, y) and τ2 (·, y)
coincide.
Definition A.20. Let (Ω, A, P) be a probability space, and let X : Ω → RdX , Y : Ω → RdY ,
Z : Ω → RdZ be three random variables. We say that X and Z are conditionally independent
given Y , if the two distributions X|Y = y and Z|Y = y are independent for PY -almost every
y ∈ Y (Ω).
This is for example the case if there exists C < ∞ such that V[Xi ] ≤ C for all i ∈ N. Concentration
inequalities provide bounds on the rate of this convergence.
We start with Markov’s inequality.
Lemma A.21 (Markov’s inequality). Let X : Ω → R be a random variable, and let φ : [0, ∞) →
[0, ∞) be monotonically increasing. Then for all ε > 0
E[φ(|X|)]
P[|X| ≥ ε] ≤ .
φ(ε)
248
Proof. We have
Z Z
φ(|X(ω)|) E[φ(|X|)]
P[|X| ≥ ε] = 1 dP(ω) ≤ dP(ω) = ,
X −1 ([ε,∞)) Ω φ(ε) φ(ε)
Applying Markov’s inequality with φ(x) := x2 to the random variable X − E[X] directly gives
Chebyshev’s inequality.
Lemma A.22 (Chebyshev’s inequality). Let X : Ω → R be a random variable with finite variance.
Then for all ε > 0
V[X]
P[|X − E[X]| ≥ ε] ≤ 2 .
ε
From Chebyshev’s inequality we obtain the next result, which is a quite general concentration
inequality for random variables with finite variances.
Theorem A.23. Let X1 , . . . , Xn be n ∈ N independent real-valued random variables such that for
some ς > 0 holds E[|Xi − µ|2 ] ≤ ς 2 for all i = 1, . . . , n. Denote
n
h1 X i
µ := E Xj . (A.4.2)
n
j=1
Pn
= ( nj=1 Xi )/n − µ. By Bienaymé’s identity (A.3.1), it holds
P
Proof. Let Sn = j=1 (Xi − E[Xi ])/n
that
n
1 X 2 ς2
V[Sn ] = E[(Xi − E[Xi ]) ] ≤ .
n2 n
j=1
If we have additional information about the random variables, then we can derive sharper
bounds. In case of uniformly bounded random variables (rather than just bounded variance),
Hoeffding’s inequality, which we recall next, shows an exponential rate of concentration around the
mean.
249
Theorem A.24 (Hoeffding’s inequality). Let a, b ∈ R. Let X1 , . . . , Xn be n ∈ N independent
random real-valued variables such that a ≤ Xi ≤ b almost surely for all i = 1, . . . , n, and let µ be
as in (A.4.2). Then, for every ε > 0
n 2
1 X − 2nε
P Xj − µ > ε ≤ 2e (b−a)2 .
n
j=1
A proof can, for example, be found in [212, Section B.4], where this version is also taken from.
Finally, we recall the central limit theorem, in its multivariate formulation. We say that (Xj )j∈N
is an i.i.d. sequence of random variables, if the random variables are (pairwise) independent
and identically distributed. For a proof see [117, Theorem 15.58].
Theorem A.25 (Multivariate central limit theorem). Let (Xn )n∈N be an i.i.d. sequence of Rd -
valued random variables, such that E[Xn ] = 0 ∈ Rd and E[Xn,i Xn,j ] = Cij for all i, j = 1, . . . , d.
Let
X1 + · · · + Xn
Yn := √ .
n
Then Yn converges weakly to N(0, C) as n → ∞.
250
Appendix B
Functional analysis
This appendix provides some basic notions and results in functional analysis required in the main
text. It is intended as a revision for a reader already familiar with these concepts. For more details
and proofs, we refer for example to the standard textbooks [195, 196, 41, 77].
Definition B.1. Let K ∈ {R, C}. A vector space (over K) is a set X such that the following
holds:
(i) Properties of addition: For every x, y ∈ X there exists x + y ∈ X such that for all z ∈ X
Moreover, there exists a unique element 0 ∈ X such that x + 0 = x for all x ∈ X and for each
x ∈ X there exists a unique −x ∈ X such that x + (−x) = 0.
(ii) Properties of scalar multiplication: There exists a map (α, x) 7→ αx from K × X to X called
scalar multiplication. It satisfies 1x = x and (αβ)x = α(βx) for all x ∈ X.
If the field is clear from context, we simply refer to X as a vector space. We will primarily consider
the case K = R, and in this case we also say that X is a real vector space.
To introduce a notion of convergence on a vector space X, it needs to be equipped with a
topology, see Definition A.2. A topological vector space is a vector space which is also a
topological space, and in which addition and scalar multiplication are continuous maps. We next
discuss the most important instances of topological vector spaces.
251
Definition B.2. For a set X, we call a map dX : X × X → R+ a metric, if
In a metric space (X, dX ), we denote the open ball with center x and radius r > 0 by
Every metric space is naturally equipped with a topology: A set A ⊆ X is open if and only if for
every x ∈ A exists ε > 0 such that Bε (x) ⊆ A. Therefore every metric vector space is a topological
vector space.
Definition B.3. A metric space (X, dX ) is called complete, if every Cauchy sequence with respect
to d converges to an element in X.
For complete metric spaces, an immensely powerful tool is Baire’s category theorem. To state
it, we require the notion of density of sets. Let A, B ⊆ X for a topological space X. Then A is
dense in B if the closure of A, denoted by A, satisfies A ⊇ B.
Theorem B.4 (Baire’s category theorem). Let X be a complete metric space. Then the intersection
of every countable collection of dense open subsets of X dense in X.
252
Definition B.5. Let X be a vector space over a field K ∈ {R, C}. A map ∥ · ∥X : X → [0, ∞) is
called a norm if the following hold for all x, y ∈ X and all α ∈ K:
We call (X, ∥ · ∥X ) a normed space and omit ∥ · ∥X from the notation if it is clear from the
context.
Every norm induces a metric dX and hence a topology via dX (x, y) := ∥x − y∥X . In particular,
every normed vector space is a topological vector space with respect to this topology.
Definition B.6. A normed vector space is called a Banach space if and only if it is complete.
Before presenting the main results on Banach spaces, we collect a couple of important examples.
• Euclidean spaces: Let d ∈ N. Then (Rd , ∥ · ∥) is a Banach space.
• Continuous functions: Let d ∈ N and let K ⊆ Rd be compact. The set of continuous functions
from K to R is denoted by C(K). For α, β ∈ R and f , g ∈ C(K), we define addition and
scalar multiplication by (αf + βg)(x) = αf (x) + βg(x) for all x ∈ K. The vector space C(K)
equipped with the supremum norm
∥f ∥∞ := sup |f (x)|,
x∈K
is a Banach space.
• Lebesgue spaces: Let (Ω, A, µ) be a measure space and let 1 ≤ p < ∞. Then the Lebesgue
space Lp (Ω, µ) is defined as the vector space of all equivalence classes of measurable functions
f : Ω → R that coincide µ-almost everywhere and satisfy
Z 1/p
p
∥f ∥Lp (Ω,µ) := |f (x)| dµ(x) < ∞. (B.1.2)
Ω
253
The definition can be extended to complex or Rd -valued functions. In the latter case the
integrand in (B.1.2) is replaced by ∥f (x)∥p . We denote these spaces again by Lp (Ω, µ) with
the precise meaning being clear from context.
• Essentially bounded functions: Let (Ω, A, µ) be a measure space. The Lp spaces can be
extended to p = ∞ by defining the L∞ -norm
∥f ∥L∞ (Ω,µ) := inf{C ≥ 0 | µ({|f | > C}) = 0)}.
This is indeed a norm on the space of equivalence classes of measurable functions from Ω → R
that coincide µ-almost everywhere. Moreover, with this norm, L∞ (Ω, µ) is a Banach space. If
Ω = N and µ is the counting measure, we denote the resulting space by ℓ∞ (N) or simply ℓ∞ .
As in the case p < ∞, it is straightforward to extend the definition to complex or Rd -valued
functions, for which the same notation will be used.
We continue by introducing the concept of dual spaces.
Definition B.7. Let (X, ∥ · ∥X ) be a normed vector space over K ∈ {R, C}. Linear maps from
X → K are called linear functionals. The vector space of all continuous linear functionals on X
is called the (topological) dual space of X and is denoted by X ′ .
Together with the natural addition and scalar multiplication (for all h, g ∈ X ′ , α ∈ K and
x ∈ X)
(h + g)(x) := h(x) + g(x) and (αh)(x) := α(h(x)),
X ′ is a vector space. We equip X ′ with the norm
∥f ∥X ′ := sup |f (x)|.
x∈X
∥x∥X =1
The space (X ′ , ∥ · ∥X ′ ) is always a Banach space, even if (X, ∥ · ∥X ) is not complete [196, Theorem
4.1].
The dual space can often be used to characterize the original Banach space. One way in which
the dual space X ′ captures certain algebraic and geometric properties of the Banach space X is
through the Hahn-Banach theorem. In this book, we use one specific variant of this theorem and
its implication for the existence of dual bases, see for instance [196, Theorem 3.5].
An immediate consequence of Theorem B.8 that will be used throughout this book is the
existence of a dual basis. Let X be a Banach space and let (xi )i∈N ⊆ X be such that for all i ∈ N
xi ̸∈ span{xj | j ∈ N, j ̸= i}.
Then, for every i ∈ N, there exists fi ∈ X ′ such that fi (xj ) = 0 if i ̸= j and fi (xi ) = 1.
254
B.1.4 Hilbert spaces
Often, we require more structure than that provided by normed spaces. An inner product offers
additional tools to compare vectors by introducing notions of angle and orthogonality. For simplicity
we restrict ourselves to real vector spaces in the following.
Definition B.9. Let X be a real vector space. A map ⟨·, ·⟩X : X × X → R is called an inner
product on X if the following hold for all x, y, z ∈ X and all α, β ∈ R:
Theorem B.10 (Cauchy-Schwarz inequality). Let X be a vector space with inner product ⟨·, ·⟩X .
Then it holds for all x, y ∈ X
q
|⟨x, y⟩X | ≤ ⟨x, x⟩X ⟨y, y⟩X .
Proof. Let x, y ∈ X. If y = 0 then ⟨x, y⟩X = 0 and thus the statement is trivial. Assume in the
following y ̸= 0, so that ⟨y, y⟩X > 0. Using the linearity and symmetry properties it holds for all
α∈R
0 ≤ ⟨x − αy, x − αy⟩X = ⟨x, x⟩X − 2α ⟨x, y⟩X + α2 ⟨y, y⟩X .
Letting α := ⟨x, y⟩X / ⟨y, y⟩X we get
255
Definition B.11. Let H be a real vector space with inner product ⟨·, ·⟩H . Then (H, ⟨·, ·⟩H ) is
called a Hilbert space if and only if H is complete with respect to the norm ∥ · ∥H induced by
the inner product.
Definition B.12. Let (H, ⟨·, ·⟩H ) be a Hilbert space and let f , g ∈ H. We say that f and g are
orthogonal if ⟨f, g⟩H = 0, denoted by f ⊥ g. Moreover, for F , G ⊆ H we write F ⊥ G if f ⊥ g
for all f ∈ F , g ∈ G.
For orthogonal vectors, the polar identity immediately implies the Pythagorean theorem.
Theorem B.13 (Pythagorean theorem). Let (H, ⟨·, ·⟩H ) be a Hilbert space, n ∈ N, and let
f1 , . . . , fn ∈ H be pairwise orthogonal vectors. Then,
n 2 n
X X
fi = ∥fi ∥2H .
i=1 H i=1
A final property of Hilbert spaces that we encounter in this book is the existence of unique
projections onto convex sets. For a proof, see for instance [195, Thm. 4.10].
Theorem B.14. Let (H, ⟨·, ·⟩H ) be a Hilbert space and let K ̸= ∅ be a closed convex subset of H.
Then for all h ∈ H exists a unique k0 ∈ K such that
256
B.2 Fourier transform
The Fourier transform is a powerful tool in analysis. It allows to represent functions as a superpo-
sition of frequencies.
It is immediately clear from the definition, that ∥fˆ∥L∞ (Rd ) ≤ ∥f ∥L1 (Rd ) . As a result, the operator
F : f 7→ fˆ is a bounded linear map from L1 (Rd ) to L∞ (Rd ). We point out that fˆ can take complex
values and the definition is also meaningful for complex-valued functions f .
If fˆ ∈ L1 (Rd ), then we can reverse the process of taking the Fourier transform by taking the
inverse Fourier transform, see [195, Theorem 9.11].
257
Bibliography
[2] Z. Allen-Zhu, Y. Li, and Y. Liang. Learning and generalization in overparameterized neural
networks, going beyond two layers. In H. Wallach, H. Larochelle, A. Beygelzimer, F. d'Alché-
Buc, E. Fox, and R. Garnett, editors, Advances in Neural Information Processing Systems,
volume 32. Curran Associates, Inc., 2019.
[3] M. Anthony and P. L. Bartlett. Neural network learning: theoretical foundations. Cambridge
University Press, Cambridge, 1999.
[4] R. Arora, A. Basu, P. Mianjy, and A. Mukherjee. Understanding deep neural networks with
rectified linear units. In International Conference on Learning Representations, 2018.
[5] S. Arora, S. S. Du, W. Hu, Z. Li, R. R. Salakhutdinov, and R. Wang. On exact computation
with an infinitely wide neural net. In H. Wallach, H. Larochelle, A. Beygelzimer, F. d'Alché-
Buc, E. Fox, and R. Garnett, editors, Advances in Neural Information Processing Systems,
volume 32. Curran Associates, Inc., 2019.
[6] S. Arora, R. Ge, B. Neyshabur, and Y. Zhang. Stronger generalization bounds for deep
nets via a compression approach. In International Conference on Machine Learning, pages
254–263. PMLR, 2018.
[7] M. Baader, M. Mirman, and M. Vechev. Universal approximation with certified networks.
arXiv preprint arXiv:1909.13846, 2019.
[8] S. Barocas, M. Hardt, and A. Narayanan. Fairness and Machine Learning. fairmlbook.org,
2019. http://www.fairmlbook.org.
[9] A. R. Barron. Neural net approximation. In Proc. 7th Yale workshop on adaptive and learning
systems, volume 1, pages 69–72, 1992.
[11] A. R. Barron and J. M. Klusowski. Approximation and estimation for high-dimensional deep
learning networks. arXiv preprint arXiv:1809.03090, 2018.
[12] P. Bartlett. For valid generalization the size of the weights is more important than the size
of the network. Advances in neural information processing systems, 9, 1996.
258
[13] G. Beliakov. Interpolation of lipschitz functions. Journal of Computational and Applied
Mathematics, 196(1):20–44, 2006.
[14] M. Belkin, D. Hsu, S. Ma, and S. Mandal. Reconciling modern machine-learning practice
and the classical bias–variance trade-off. Proceedings of the National Academy of Sciences,
116(32):15849–15854, 2019.
[15] M. Belkin, S. Ma, and S. Mandal. To understand deep learning we need to understand kernel
learning. In International Conference on Machine Learning, pages 541–549. PMLR, 2018.
[16] R. Bellman. On the theory of dynamic programming. Proceedings of the national Academy
of Sciences, 38(8):716–719, 1952.
[17] Y. Bengio, P. Simard, and P. Frasconi. Learning long-term dependencies with gradient descent
is difficult. IEEE Transactions on Neural Networks, 5(2):157–166, 1994.
[18] J. Berner, P. Grohs, and A. Jentzen. Analysis of the generalization error: Empirical risk
minimization over deep artificial neural networks overcomes the curse of dimensionality in
the numerical approximation of black–scholes partial differential equations. SIAM Journal
on Mathematics of Data Science, 2(3):631–657, 2020.
[19] J. Berner, P. Grohs, G. Kutyniok, and P. Petersen. The modern mathematics of deep learning,
2021.
[20] D. P. Bertsekas. Nonlinear programming. Athena Scientific Optimization and Computation
Series. Athena Scientific, Belmont, MA, third edition, 2016.
[21] H. Bolcskei, P. Grohs, G. Kutyniok, and P. Petersen. Optimal approximation with sparsely
connected deep neural networks. SIAM Journal on Mathematics of Data Science, 1(1):8–45,
2019.
[22] L. Bottou. Stochastic Gradient Descent Tricks, pages 421–436. Springer Berlin Heidelberg,
Berlin, Heidelberg, 2012.
[23] L. Bottou, F. E. Curtis, and J. Nocedal. Optimization methods for large-scale machine
learning. SIAM Review, 60(2):223–311, 2018.
[24] O. Bousquet and A. Elisseeff. Stability and generalization. The Journal of Machine Learning
Research, 2:499–526, 2002.
[25] S. Boyd and L. Vandenberghe. Convex optimization. Cambridge University Press, Cambridge,
2004.
[26] J. Braun and M. Griebel. On a constructive proof of kolmogorov’s superposition theorem.
Constructive Approximation, 30(3):653–675, Dec 2009.
[27] M. M. Bronstein, J. Bruna, T. Cohen, and P. Veličković. Geometric deep learning: Grids,
groups, graphs, geodesics, and gauges. arXiv preprint arXiv:2104.13478, 2021.
[28] T. Brown, B. Mann, N. Ryder, M. Subbiah, J. D. Kaplan, P. Dhariwal, A. Neelakantan,
P. Shyam, G. Sastry, A. Askell, et al. Language models are few-shot learners. Advances in
neural information processing systems, 33:1877–1901, 2020.
259
[29] O. Calin. Deep learning architectures. Springer, 2020.
[31] C. Carathéodory. Über den variabilitätsbereich der fourier’schen konstanten von posi-
tiven harmonischen funktionen. Rendiconti del Circolo Matematico di Palermo (1884-1940),
32:193–217, 1911.
[32] N. Carlini and D. Wagner. Towards evaluating the robustness of neural networks. In 2017
ieee symposium on security and privacy (sp), pages 39–57. Ieee, 2017.
[33] S. M. Carroll and B. W. Dickinson. Construction of neural nets using the radon transform.
International 1989 Joint Conference on Neural Networks, pages 607–611 vol.1, 1989.
[35] M. Chen, H. Jiang, W. Liao, and T. Zhao. Efficient approximation of deep relu networks for
functions on low dimensional manifolds. Advances in neural information processing systems,
32, 2019.
[37] Y. Cho and L. Saul. Kernel methods for deep learning. In Y. Bengio, D. Schuurmans,
J. Lafferty, C. Williams, and A. Culotta, editors, Advances in Neural Information Processing
Systems, volume 22. Curran Associates, Inc., 2009.
[38] F. Chollet. Deep learning with Python. Simon and Schuster, 2021.
[39] A. Choromanska, M. Henaff, M. Mathieu, G. B. Arous, and Y. LeCun. The loss surfaces of
multilayer networks. In Artificial intelligence and statistics, pages 192–204. PMLR, 2015.
[40] C. K. Chui and H. N. Mhaskar. Deep nets for local manifold learning. Frontiers in Applied
Mathematics and Statistics, 4:12, 2018.
[42] N. Cristianini and J. Shawe-Taylor. An Introduction to Support Vector Machines and Other
Kernel-based Learning Methods. Cambridge University Press, 1 edition, 2000.
[43] F. Cucker and S. Smale. On the mathematical foundations of learning. Bulletin of the
American mathematical society, 39(1):1–49, 2002.
260
[45] Y. N. Dauphin, R. Pascanu, C. Gulcehre, K. Cho, S. Ganguli, and Y. Bengio. Identifying and
attacking the saddle point problem in high-dimensional non-convex optimization. Advances
in neural information processing systems, 27, 2014.
[48] T. De Ryck, S. Lanthaler, and S. Mishra. On the approximation of functions by tanh neural
networks. Neural Networks, 143:732–750, 2021.
[49] A. Défossez, L. Bottou, F. R. Bach, and N. Usunier. A simple convergence proof of adam
and adagrad. Trans. Mach. Learn. Res., 2022, 2022.
[50] J. Deng, W. Dong, R. Socher, L.-J. Li, K. Li, and L. Fei-Fei. Imagenet: A large-scale
hierarchical image database. In 2009 IEEE Conference on Computer Vision and Pattern
Recognition, pages 248–255, 2009.
[53] L. Dinh, R. Pascanu, S. Bengio, and Y. Bengio. Sharp minima can generalize for deep nets.
In International Conference on Machine Learning, pages 1019–1028. PMLR, 2017.
[55] M. Du, F. Yang, N. Zou, and X. Hu. Fairness in deep learning: A computational perspective.
IEEE Intelligent Systems, 36(4):25–34, 2021.
[56] S. Du, J. Lee, H. Li, L. Wang, and X. Zhai. Gradient descent finds global minima of deep
neural networks. In K. Chaudhuri and R. Salakhutdinov, editors, Proceedings of the 36th
International Conference on Machine Learning, volume 97 of Proceedings of Machine Learning
Research, pages 1675–1685. PMLR, 09–15 Jun 2019.
[57] J. Duchi, E. Hazan, and Y. Singer. Adaptive subgradient methods for online learning and
stochastic optimization. Journal of Machine Learning Research, 12(Jul):2121–2159, 2011.
[58] W. E and Q. Wang. Exponential convergence of the deep neural network approximation for
analytic functions. Sci. China Math., 61(10):1733–1740, 2018.
[59] K. Eckle and J. Schmidt-Hieber. A comparison of deep networks with relu activation function
and linear spline-type methods. Neural Networks, 110:232–242, 2019.
261
[60] R. Eldan and O. Shamir. The power of depth for feedforward neural networks. In V. Feldman,
A. Rakhlin, and O. Shamir, editors, 29th Annual Conference on Learning Theory, volume 49
of Proceedings of Machine Learning Research, pages 907–940, Columbia University, New York,
New York, USA, 23–26 Jun 2016. PMLR.
[62] C. L. Frenzen, T. Sasao, and J. T. Butler. On the number of segments needed in a piecewise
linear approximation. Journal of Computational and Applied mathematics, 234(2):437–446,
2010.
[63] K.-I. Funahashi. On the approximate realization of continuous mappings by neural networks.
Neural Networks, 2(3):183–192, 1989.
[65] G. Garrigos and R. M. Gower. Handbook of convergence theorems for (stochastic) gradient
methods, 2023.
[67] A. Géron. Hands-on machine learning with Scikit-Learn and TensorFlow : concepts, tools,
and techniques to build intelligent systems. O’Reilly Media, Sebastopol, CA, 2017.
[69] F. Girosi and T. Poggio. Networks and the best approximation property. Biological cyber-
netics, 63(3):169–176, 1990.
[71] L. Gonon and C. Schwab. Deep relu network expression rates for option prices in high-
dimensional, exponential lévy models. Finance and Stochastics, 25(4):615–657, 2021.
[72] I. J. Goodfellow, Y. Bengio, and A. Courville. Deep Learning. MIT Press, Cambridge, MA,
USA, 2016. http://www.deeplearningbook.org.
[73] I. J. Goodfellow, J. Shlens, and C. Szegedy. Explaining and harnessing adversarial examples.
In International Conference on Learning Representations (ICLR), 2015.
262
[75] L.-A. Gottlieb, A. Kontorovich, and R. Krauthgamer. Efficient regression in metric spaces
via approximate lipschitz extension. IEEE Transactions on Information Theory, 63(8):4838–
4849, 2017.
[76] R. M. Gower, N. Loizou, X. Qian, A. Sailanbayev, E. Shulgin, and P. Richtárik. SGD: General
analysis and improved rates. In K. Chaudhuri and R. Salakhutdinov, editors, Proceedings of
the 36th International Conference on Machine Learning, volume 97 of Proceedings of Machine
Learning Research, pages 5200–5209. PMLR, 09–15 Jun 2019.
[77] K. Gröchenig. Foundations of time-frequency analysis. Springer Science & Business Media,
2013.
[78] P. Grohs and L. Herrmann. Deep neural network approximation for high-dimensional elliptic
pdes with boundary conditions. IMA Journal of Numerical Analysis, 42(3):2055–2082, 2022.
[79] P. Grohs, F. Hornung, A. Jentzen, and P. Von Wurstemberger. A proof that artificial neural
networks overcome the curse of dimensionality in the numerical approximation of Black–
Scholes partial differential equations, volume 284. American Mathematical Society, 2023.
[80] P. Grohs, F. Hornung, A. Jentzen, and P. von Wurstemberger. A proof that artificial neural
networks overcome the curse of dimensionality in the numerical approximation of Black-
Scholes partial differential equations. Mem. Amer. Math. Soc., 284(1410):v+93, 2023.
[81] I. Gühring and M. Raslan. Approximation rates for neural networks with encodable weights
in smoothness spaces. Neural Networks, 134:107–130, 2021.
[82] B. Hanin and D. Rolnick. Complexity of linear regions in deep networks. In International
Conference on Machine Learning, pages 2596–2604. PMLR, 2019.
[84] S. S. Haykin. Neural networks and learning machines. Pearson Education, Upper Saddle
River, NJ, third edition, 2009.
[85] J. He, L. Li, J. Xu, and C. Zheng. Relu deep neural networks and linear finite elements. J.
Comput. Math., 38(3):502–527, 2020.
[86] K. He, X. Zhang, S. Ren, and J. Sun. Delving deep into rectifiers: Surpassing human-level
performance on imagenet classification. Proceedings of the IEEE international conference on
computer vision, 2015.
[87] K. He, X. Zhang, S. Ren, and J. Sun. Deep residual learning for image recognition. In
Proceedings of the IEEE conference on computer vision and pattern recognition, pages 770–
778, 2016.
[88] M. Hein and M. Andriushchenko. Formal guarantees on the robustness of a classifier against
adversarial manipulation. Advances in neural information processing systems, 30, 2017.
[89] H. Heuser. Lehrbuch der Analysis. Teil 1. Vieweg + Teubner, Wiesbaden, revised edition,
2009.
263
[90] G. Hinton. Divide the gradient by a running average of its recent magnitude. https://www.
cs.toronto.edu/~hinton/coursera/lecture6/lec6e.mp4, 2012. Lecture 6e.
[92] S. Hochreiter and J. Schmidhuber. Flat minima. Neural computation, 9(1):1–42, 1997.
[93] S. Hochreiter and J. Schmidhuber. Long short-term memory. Neural Computation, 9(8):1735–
1780, 1997.
[95] K. Hornik, M. Stinchcombe, and H. White. Multilayer feedforward networks are universal
approximators. Neural Networks, 2(5):359–366, 1989.
[96] G. Huang, Z. Liu, L. Van Der Maaten, and K. Q. Weinberger. Densely connected con-
volutional networks. Proceedings of the IEEE conference on computer vision and pattern
recognition, 1(2):3, 2017.
[97] G.-B. Huang and H. A. Babri. Upper bounds on the number of hidden neurons in feedforward
networks with arbitrary bounded nonlinear activation functions. IEEE transactions on neural
networks, 9(1):224–229, 1998.
[99] T. Huster, C.-Y. J. Chiang, and R. Chadha. Limitations of the lipschitz constant as a defense
against adversarial examples. In ECML PKDD 2018 Workshops: Nemesis 2018, UrbReas
2018, SoGood 2018, IWAISe 2018, and Green Data Mining 2018, Dublin, Ireland, September
10-14, 2018, Proceedings 18, pages 16–29. Springer, 2019.
[100] M. Hutzenthaler, A. Jentzen, T. Kruse, and T. A. Nguyen. A proof that rectified deep neural
networks overcome the curse of dimensionality in the numerical approximation of semilinear
heat equations. SN partial differential equations and applications, 1(2):10, 2020.
[101] J. Håstad. Computational limitations of small depth circuits. PhD thesis, Massachusetts
Institute of Technology, 1987. Ph.D. Thesis, Department of Mathematics.
[102] D. J. Im, M. Tao, and K. Branson. An empirical analysis of deep network loss surfaces. 2016.
[103] V. E. Ismailov. Ridge functions and applications in neural networks, volume 263. American
Mathematical Society, 2021.
[104] V. E. Ismailov. A three layer neural network can represent any multivariate function. Journal
of Mathematical Analysis and Applications, 523(1):127096, 2023.
[105] Y. Ito and K. Saito. Superposition of linearly independent functions and finite mappings by
neural networks. The Mathematical Scientist, 21(1):27, 1996.
264
[106] A. Jacot, F. Gabriel, and C. Hongler. Neural tangent kernel: Convergence and generalization
in neural networks. Advances in neural information processing systems, 31, 2018.
[107] A. Jentzen, B. Kuckuck, and P. von Wurstemberger. Mathematical introduction to deep
learning: methods, implementations, and theory. arXiv preprint arXiv:2310.20360, 2023.
[108] A. Jentzen and A. Riekert. On the existence of global minima and convergence analy-
ses for gradient descent methods in the training of deep neural networks. arXiv preprint
arXiv:2112.09684, 2021.
[109] A. Jentzen, D. Salimova, and T. Welti. A proof that deep artificial neural networks overcome
the curse of dimensionality in the numerical approximation of Kolmogorov partial differen-
tial equations with constant diffusion and nonlinear drift coefficients. Commun. Math. Sci.,
19(5):1167–1205, 2021.
[110] J. Jumper, R. Evans, A. Pritzel, T. Green, M. Figurnov, O. Ronneberger, K. Tunyasuvu-
nakool, R. Bates, A. Žı́dek, A. Potapenko, et al. Highly accurate protein structure prediction
with alphafold. Nature, 596(7873):583–589, 2021.
[111] P. C. Kainen, V. Kurkova, and A. Vogt. Approximation by neural networks is not continuous.
Neurocomputing, 29(1-3):47–56, 1999.
[112] P. C. Kainen, V. Kurkova, and A. Vogt. Continuity of approximation by neural networks in
l p spaces. Annals of Operations Research, 101:143–147, 2001.
[113] P. C. Kainen, V. Kurkova, and A. Vogt. Best approximation by linear combinations of
characteristic functions of half-spaces. Journal of Approximation Theory, 122(2):151–159,
2003.
[114] H. Karimi, J. Nutini, and M. Schmidt. Linear convergence of gradient and proximal-gradient
methods under the polyak-lojasiewicz condition. In P. Frasconi, N. Landwehr, G. Manco,
and J. Vreeken, editors, Machine Learning and Knowledge Discovery in Databases, pages
795–811, Cham, 2016. Springer International Publishing.
[115] C. Karner, V. Kazeev, and P. C. Petersen. Limitations of gradient descent due to numerical
instability of backpropagation. arXiv preprint arXiv:2210.00805, 2022.
[116] D. P. Kingma and J. Ba. Adam: A method for stochastic optimization. In 3rd Interna-
tional Conference on Learning Representations, ICLR 2015 - Conference Track Proceedings.
International Conference on Learning Representations, ICLR, 2015.
[117] A. Klenke. Wahrscheinlichkeitstheorie. Springer, 2006.
[118] M. Kohler, A. Krzyżak, and S. Langer. Estimation of a function of low local dimensionality
by deep neural networks. IEEE transactions on information theory, 68(6):4032–4042, 2022.
[119] M. Kohler and S. Langer. On the rate of convergence of fully connected deep neural network
regression estimates. The Annals of Statistics, 49(4):2231–2249, 2021.
[120] A. N. Kolmogorov. On the representation of continuous functions of many variables by
superposition of continuous functions of one variable and addition. Dokl. Akad. Nauk SSSR,
114:953–956, 1957.
265
[121] A. Krizhevsky, I. Sutskever, and G. E. Hinton. Imagenet classification with deep convolutional
neural networks. In Advances in neural information processing systems, pages 1097–1105,
2012.
[122] G. Kutyniok, P. Petersen, M. Raslan, and R. Schneider. A theoretical analysis of deep neural
networks and parametric pdes. Constructive Approximation, 55(1):73–125, 2022.
[123] V. Kůrková. Kolmogorov’s theorem is relevant. Neural Computation, 3(4):617–622, 1991.
[124] V. Kůrková. Kolmogorov’s theorem and multilayer neural networks. Neural Networks,
5(3):501–506, 1992.
[125] F. Laakmann and P. Petersen. Efficient approximation of solutions of parametric linear
transport equations by relu dnns. Advances in Computational Mathematics, 47(1):11, 2021.
[126] G. Lan. First-order and Stochastic Optimization Methods for Machine Learning. Springer
Series in the Data Sciences. Springer International Publishing, Cham, 1st ed. 2020. edition,
2020.
[127] Y. LeCun, Y. Bengio, and G. Hinton. Deep learning. Nature, 521(7553):436–444, May 2015.
[128] Y. LeCun, B. Boser, J. S. Denker, D. Henderson, R. E. Howard, W. Hubbard, and L. D.
Jackel. Backpropagation applied to handwritten zip code recognition. Neural Computation,
1(4):541–551, 1989.
[129] Y. A. LeCun, L. Bottou, G. B. Orr, and K.-R. Müller. Efficient BackProp, pages 9–48.
Springer Berlin Heidelberg, Berlin, Heidelberg, 2012.
[130] J. Lee, J. Sohl-dickstein, J. Pennington, R. Novak, S. Schoenholz, and Y. Bahri. Deep neural
networks as gaussian processes. In International Conference on Learning Representations,
2018.
[131] J. Lee, L. Xiao, S. Schoenholz, Y. Bahri, R. Novak, J. Sohl-Dickstein, and J. Pennington. Wide
neural networks of any depth evolve as linear models under gradient descent. In H. Wallach,
H. Larochelle, A. Beygelzimer, F. d'Alché-Buc, E. Fox, and R. Garnett, editors, Advances in
Neural Information Processing Systems, volume 32. Curran Associates, Inc., 2019.
[132] M. Leshno, V. Y. Lin, A. Pinkus, and S. Schocken. Multilayer feedforward networks with a
nonpolynomial activation function can approximate any function. Neural Networks, 6(6):861–
867, 1993.
[133] L. Lessard, B. Recht, and A. Packard. Analysis and design of optimization algorithms via
integral quadratic constraints. SIAM J. Optim., 26(1):57–95, 2016.
[134] H. Li, Z. Xu, G. Taylor, C. Studer, and T. Goldstein. Visualizing the loss landscape of neural
nets. Advances in neural information processing systems, 31, 2018.
[135] W. Li. Generalization error of minimum weighted norm and kernel interpolation. SIAM
Journal on Mathematics of Data Science, 3(1):414–438, 2021.
[136] M. Longo, J. A. Opschoor, N. Disch, C. Schwab, and J. Zech. De rham compatible deep
neural network fem. Neural Networks, 165:721–739, 2023.
266
[137] C. Ma, S. Wojtowytsch, L. Wu, et al. Towards a mathematical understanding of neu-
ral network-based machine learning: what we know and what we don’t. arXiv preprint
arXiv:2009.10713, 2020.
[138] C. Ma, L. Wu, et al. A priori estimates of the population risk for two-layer neural networks.
arXiv preprint arXiv:1810.06397, 2018.
[139] S. Mahan, E. J. King, and A. Cloninger. Nonclosedness of sets of neural networks in sobolev
spaces. Neural Networks, 137:85–96, 2021.
[140] V. Maiorov and A. Pinkus. Lower bounds for approximation by mlp neural networks. Neu-
rocomputing, 25(1):81–91, 1999.
[141] Y. Marzouk, Z. Ren, S. Wang, and J. Zech. Distribution learning via neural differential
equations: a nonparametric statistical perspective. Journal of Machine Learning Research
(accepted), 2024.
[142] W. S. McCulloch and W. Pitts. A logical calculus of the ideas immanent in nervous activity.
The bulletin of mathematical biophysics, 5:115–133, 1943.
[143] S. Mei and A. Montanari. The generalization error of random features regression: Precise
asymptotics and the double descent curve. Communications on Pure and Applied Mathemat-
ics, 75(4):667–766, 2022.
[145] H. N. Mhaskar. Neural networks for optimal approximation of smooth and analytic functions.
Neural computation, 8(1):164–177, 1996.
[148] M. Mohri, A. Rostamizadeh, and A. Talwalkar. Foundations of machine learning. MIT press,
2018.
[150] H. Montanelli and Q. Du. New error bounds for deep relu networks using sparse grids. SIAM
Journal on Mathematics of Data Science, 1(1):78–92, 2019.
[151] H. Montanelli and H. Yang. Error bounds for deep relu networks using the kolmogorov–arnold
superposition theorem. Neural Networks, 129:1–6, 2020.
[152] G. F. Montufar, R. Pascanu, K. Cho, and Y. Bengio. On the number of linear regions of deep
neural networks. In Z. Ghahramani, M. Welling, C. Cortes, N. Lawrence, and K. Weinberger,
editors, Advances in Neural Information Processing Systems, volume 27. Curran Associates,
Inc., 2014.
267
[153] S.-M. Moosavi-Dezfooli, A. Fawzi, O. Fawzi, and P. Frossard. Universal adversarial pertur-
bations. In Proceedings of the IEEE conference on computer vision and pattern recognition,
pages 1765–1773, 2017.
[154] E. Moulines and F. Bach. Non-asymptotic analysis of stochastic approximation algorithms for
machine learning. In J. Shawe-Taylor, R. Zemel, P. Bartlett, F. Pereira, and K. Weinberger,
editors, Advances in Neural Information Processing Systems, volume 24. Curran Associates,
Inc., 2011.
[155] K.-R. Muller, S. Mika, G. Ratsch, K. Tsuda, and B. Scholkopf. An introduction to kernel-
based learning algorithms. IEEE Transactions on Neural Networks, 12(2):181–201, 2001.
[156] R. Nakada and M. Imaizumi. Adaptive approximation and generalization of deep neural
network with intrinsic dimensionality. Journal of Machine Learning Research, 21(174):1–38,
2020.
[157] R. M. Neal. Bayesian learning for neural networks. PhD thesis, University of Toronto, 1995.
[159] Y. Nesterov. Lectures on convex optimization, volume 137 of Springer Optimization and Its
Applications. Springer, Cham, second edition, 2018.
[160] Y. E. Nesterov. A method for solving the convex programming problem with convergence
rate O(1/k 2 ). Dokl. Akad. Nauk SSSR, 269(3):543–547, 1983.
[161] B. Neyshabur, R. Tomioka, and N. Srebro. Norm-based capacity control in neural networks.
In Conference on learning theory, pages 1376–1401. PMLR, 2015.
[163] J. Nocedal and S. J. Wright. Numerical optimization. Springer Series in Operations Research
and Financial Engineering. Springer, New York, second edition, 2006.
[165] B. O’Donoghue and E. Candès. Adaptive restart for accelerated gradient schemes. Found.
Comput. Math., 15(3):715–732, 2015.
[166] J. A. A. Opschoor, C. Schwab, and J. Zech. Exponential ReLU DNN expression of holomor-
phic maps in high dimension. Constructive Approximation, 2021.
[167] J. A. A. Opschoor, C. Schwab, and J. Zech. Deep learning in high dimension: ReLU neural
network expression for Bayesian PDE inversion. In Optimization and control for partial
differential equations—uncertainty quantification, open and closed-loop control, and shape
optimization, volume 29 of Radon Ser. Comput. Appl. Math., pages 419–462. De Gruyter,
Berlin, 2022.
268
[168] P. Oswald. On the degree of nonlinear spline approximation in Besov-Sobolev spaces. J.
Approx. Theory, 61(2):131–157, 1990.
[170] N. Papernot, P. McDaniel, I. Goodfellow, S. Jha, Z. B. Celik, and A. Swami. Practical black-
box attacks against machine learning. In Proceedings of the 2017 ACM on Asia conference
on computer and communications security, pages 506–519, 2017.
[171] Y. C. Pati and P. S. Krishnaprasad. Analysis and synthesis of feedforward neural networks us-
ing discrete affine wavelet transformations. IEEE Transactions on Neural Networks, 4(1):73–
85, 1993.
[172] J. Pennington and Y. Bahri. Geometry of neural network loss surfaces via random matrix
theory. In International Conference on Machine Learning, pages 2798–2806. PMLR, 2017.
[173] P. Petersen, M. Raslan, and F. Voigtlaender. Topological properties of the set of func-
tions generated by neural networks of fixed size. Foundations of computational mathematics,
21:375–444, 2021.
[174] P. Petersen and F. Voigtlaender. Optimal approximation of piecewise smooth functions using
deep relu neural networks. Neural Networks, 108:296–330, 2018.
[176] A. Pinkus. Approximation theory of the MLP model in neural networks. In Acta numerica,
1999, volume 8 of Acta Numer., pages 143–195. Cambridge Univ. Press, Cambridge, 1999.
[177] G. Pisier. Remarques sur un résultat non publié de B. Maurey. Séminaire Analyse fonction-
nelle (dit ”Maurey-Schwartz”), 1980-1981.
[178] T. Poggio, H. Mhaskar, L. Rosasco, B. Miranda, and Q. Liao. Why and when can deep-but
not shallow-networks avoid the curse of dimensionality: a review. Int. J. Autom. Comput.,
14(5):503–519, 2017.
[179] T. Poggio, R. Rifkin, S. Mukherjee, and P. Niyogi. General conditions for predictivity in
learning theory. Nature, 428(6981):419–422, 2004.
[180] B. Polyak. Some methods of speeding up the convergence of iteration methods. USSR
Computational Mathematics and Mathematical Physics, 4(5):1–17, 1964.
[183] N. Qian. On the momentum term in gradient descent learning algorithms. Neural Networks,
12(1):145–151, 1999.
269
[184] M. H. Quynh Nguyen, Mahesh Chandra Mukkamala. On the loss landscape of a class of
deep neural networks with no bad local valleys. In International Conference on Learning
Representations (ICLR), 2018.
[186] A. Rahimi and B. Recht. Random features for large-scale kernel machines. In J. Platt,
D. Koller, Y. Singer, and S. Roweis, editors, Advances in Neural Information Processing
Systems, volume 20. Curran Associates, Inc., 2007.
[188] C. E. Rasmussen and C. K. I. Williams. Gaussian processes for machine learning. Adaptive
computation and machine learning. MIT Press, 2006.
[189] E. Real, S. Moore, A. Selle, S. Saxena, Y. L. Suematsu, Q. Le, and A. Kurakin. Regularized
evolution for image classifier architecture search. Proceedings of the AAAI Conference on
Artificial Intelligence, 33:4780–4789, 2019.
[190] S. J. Reddi, S. Kale, and S. Kumar. On the convergence of adam and beyond. In 6th
International Conference on Learning Representations, ICLR 2018, Vancouver, BC, Canada,
April 30 - May 3, 2018, Conference Track Proceedings. OpenReview.net, 2018.
[191] H. Robbins and S. Monro. A Stochastic Approximation Method. The Annals of Mathematical
Statistics, 22(3):400 – 407, 1951.
[192] F. Rosenblatt. The perceptron: A probabilistic model for information storage and organiza-
tion in the brain. Psychological Review, 65(6):386–408, 1958.
[193] W. Ruan, X. Yi, and X. Huang. Adversarial robustness of deep learning: Theory, algorithms,
and applications. In Proceedings of the 30th ACM international conference on information
& knowledge management, pages 4866–4869, 2021.
[195] W. Rudin. Real and complex analysis. McGraw-Hill Book Co., New York, third edition, 1987.
[196] W. Rudin. Functional analysis. International Series in Pure and Applied Mathematics.
McGraw-Hill, Inc., New York, second edition, 1991.
[198] T. D. Ryck and S. Mishra. Error analysis for deep neural network approximations of paramet-
ric hyperbolic conservation laws. Mathematics of Computation, 2023. Article electronically
published on December 15, 2023.
270
[199] I. Safran and O. Shamir. Depth separation in relu networks for approximating smooth non-
linear functions. ArXiv, abs/1610.09887, 2016.
[200] M. A. Sartori and P. J. Antsaklis. A simple method to derive bounds on the size and to train
multilayer neural networks. IEEE transactions on neural networks, 2(4):467–471, 1991.
[201] R. Scheichl and J. Zech. Numerical methods for bayesian inverse problems, 2021. Lecture
Notes.
[202] J. Schmidhuber. Deep learning in neural networks: An overview. Neural Networks, 61:85–117,
2015.
[204] J. Schmidt-Hieber. Nonparametric regression using deep neural networks with relu activation
function. 2020.
[206] B. Schölkopf and A. J. Smola. Learning with kernels : support vector machines, regularization,
optimization, and beyond. Adaptive computation and machine learning. MIT Press, 2002.
[207] L. Schumaker. Spline Functions: Basic Theory. Cambridge Mathematical Library. Cambridge
University Press, 3 edition, 2007.
[208] C. Schwab and J. Zech. Deep learning in high dimension: neural network expression rates for
generalized polynomial chaos expansions in UQ. Anal. Appl. (Singap.), 17(1):19–55, 2019.
[209] C. Schwab and J. Zech. Deep learning in high dimension: neural network expression rates for
analytic functions in L2 (mathbbRd , gammad ). SIAM/ASA J. Uncertain. Quantif., 11(1):199–
234, 2023.
[210] T. Serra, C. Tjandraatmadja, and S. Ramalingam. Bounding and counting linear regions of
deep neural networks, 2018.
[211] U. Shaham, A. Cloninger, and R. R. Coifman. Provable approximation properties for deep
neural networks. Applied and Computational Harmonic Analysis, 44(3):537–557, 2018.
[213] J. W. Siegel and J. Xu. High-order approximation rates for shallow neural networks with
cosine and reluk activation functions. Applied and Computational Harmonic Analysis, 58:1–
26, 2022.
[214] D. Silver, A. Huang, C. J. Maddison, A. Guez, L. Sifre, G. Van Den Driessche, J. Schrittwieser,
I. Antonoglou, V. Panneershelvam, M. Lanctot, et al. Mastering the game of go with deep
neural networks and tree search. nature, 529(7587):484–489, 2016.
271
[215] K. Simonyan and A. Zisserman. Very deep convolutional networks for large-scale image
recognition. In ICLR, 2014.
[216] E. M. Stein. Singular integrals and differentiability properties of functions. Princeton Math-
ematical Series, No. 30. Princeton University Press, Princeton, N.J., 1970.
[217] I. Steinwart and A. Christmann. Support Vector Machines. Springer, New York, 2008.
[218] D. Stutz, M. Hein, and B. Schiele. Disentangling adversarial robustness and generalization.
In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition,
pages 6976–6987, 2019.
[219] A. Sukharev. Optimal method of constructing best uniform approximations for functions of
a certain class. USSR Computational Mathematics and Mathematical Physics, 18(2):21–31,
1978.
[220] T. Sun, L. Qiao, and D. Li. Nonergodic complexity of proximal inertial gradient descents.
IEEE Trans. Neural Netw. Learn. Syst., 32(10):4613–4626, 2021.
[221] I. Sutskever, J. Martens, G. Dahl, and G. Hinton. On the importance of initialization and
momentum in deep learning. In S. Dasgupta and D. McAllester, editors, Proceedings of the
30th International Conference on Machine Learning, volume 28 of Proceedings of Machine
Learning Research, pages 1139–1147, Atlanta, Georgia, USA, 17–19 Jun 2013. PMLR.
[224] M. Tan and Q. V. Le. Efficientnet: Rethinking model scaling for convolutional neural net-
works. In Proceedings of the 36th International Conference on Machine Learning, pages
6105–6114, 2019.
[225] J. Tarela and M. Martı́nez. Region configurations for realizability of lattice piecewise-linear
models. Mathematical and Computer Modelling, 30(11):17–27, 1999.
[226] J. M. Tarela, E. Alonso, and M. V. Martı́nez. A representation method for PWL functions
oriented to parallel processing. Math. Comput. Modelling, 13(10):75–83, 1990.
[228] M. Telgarsky. benefits of depth in neural networks. In V. Feldman, A. Rakhlin, and O. Shamir,
editors, 29th Annual Conference on Learning Theory, volume 49 of Proceedings of Machine
Learning Research, pages 1517–1539, Columbia University, New York, New York, USA, 23–26
Jun 2016. PMLR.
272
[230] V. M. Tikhomirov. ε-entropy and ε-capacity of sets in functional spaces. Selected Works
of AN Kolmogorov: Volume III: Information Theory and the Theory of Algorithms, pages
86–170, 1993.
[235] L. Venturi, A. S. Bandeira, and J. Bruna. Spurious valleys in one-hidden-layer neural network
optimization landscapes. Journal of Machine Learning Research, 20:133, 2019.
[237] S. Wang and X. Sun. Generalization of hinging hyperplanes. IEEE Transactions on Infor-
mation Theory, 51(12):4425–4431, 2005.
[238] Z. Wang, A. Albarghouthi, G. Prakriya, and S. Jha. Interval universal approximation for
neural networks. Proceedings of the ACM on Programming Languages, 6(POPL):1–29, 2022.
[239] E. Weinan, C. Ma, and L. Wu. Barron spaces and the compositional function spaces for
neural network models. arXiv preprint arXiv:1906.08039, 2019.
[240] E. Weinan and S. Wojtowytsch. Representation formulas and pointwise properties for barron
functions. Calculus of Variations and Partial Differential Equations, 61(2):46, 2022.
[242] A. C. Wilson, R. Roelofs, M. Stern, N. Srebro, and B. Recht. The marginal value of adaptive
gradient methods in machine learning. In I. Guyon, U. V. Luxburg, S. Bengio, H. Wal-
lach, R. Fergus, S. Vishwanathan, and R. Garnett, editors, Advances in Neural Information
Processing Systems, volume 30. Curran Associates, Inc., 2017.
[243] C. Xiao, J.-Y. Zhu, B. Li, W. He, M. Liu, and D. Song. Spatially transformed adversarial
examples. arXiv preprint arXiv:1801.02612, 2018.
[244] H. Xu and S. Mannor. Robustness and generalization. Machine learning, 86:391–423, 2012.
273
[245] D. Yarotsky. Error bounds for approximations with deep ReLU networks. Neural Netw.,
94:103–114, 2017.
[246] D. Yarotsky and A. Zhevnerchuk. The phase diagram of approximation rates for deep neural
networks. In H. Larochelle, M. Ranzato, R. Hadsell, M. F. Balcan, and H. Lin, editors,
Advances in Neural Information Processing Systems, volume 33, pages 13005–13015. Curran
Associates, Inc., 2020.
[248] X. Zhai, A. Kolesnikov, N. Houlsby, and L. Beyer. Scaling vision transformers. In Proceedings
of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 12104–
12113, 2022.
274