Deep Learning Math
Deep Learning Math
April 8, 2025
Contents
1 Introduction 9
1.1 Mathematics of deep learning . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9
1.2 High-level overview of deep learning . . . . . . . . . . . . . . . . . . . . . . . . . . . 9
1.3 Why does it work? . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13
1.4 Outline and philosophy . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14
1.5 Material not covered in this book . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 16
3 Universal approximation 25
3.1 A universal approximation theorem . . . . . . . . . . . . . . . . . . . . . . . . . . . . 25
3.2 Superexpressive activations and Kolmogorov’s superposition theorem . . . . . . . . 35
4 Splines 40
4.1 B-splines and smooth functions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 40
4.2 Reapproximation of B-splines with sigmoidal activations . . . . . . . . . . . . . . . . 41
1
8 High-dimensional approximation 95
8.1 The Barron class . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 95
8.2 Functions with compositionality structure . . . . . . . . . . . . . . . . . . . . . . . . 100
8.3 Functions on manifolds . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 104
9 Interpolation 109
9.1 Universal interpolation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 110
9.2 Optimal interpolation and reconstruction . . . . . . . . . . . . . . . . . . . . . . . . 111
2
15 Generalization in the overparameterized regime 218
15.1 The double descent phenomenon . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 219
15.2 Size of weights . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 223
15.3 Theoretical justification . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 224
3
Preface
This book serves as an introduction to the key ideas in the mathematical analysis of deep learning.
It is designed to help students and researchers to quickly familiarize themselves with the area and to
provide a foundation for the development of university courses on the mathematics of deep learning.
Our main goal in the composition of this book was to present various rigorous, but easy to grasp,
results that help to build an understanding of fundamental mathematical concepts in deep learning.
To achieve this, we prioritize simplicity over generality.
As a mathematical introduction to deep learning, this book does not aim to give an exhaustive
survey of the entire (and rapidly growing) field, and some important research directions are missing.
In particular, we have favored mathematical results over empirical research, even though an accurate
account of the theory of deep learning requires both.
The book is intended for students and researchers in mathematics and related areas. While we
believe that every diligent researcher or student will be able to work through this manuscript, it
should be emphasized that a familiarity with analysis, linear algebra, probability theory, and basic
functional analysis is recommended for an optimal reading experience. To assist readers, a review
of key concepts in probability theory and functional analysis is provided in the appendix.
The material is structured around the three main pillars of deep learning theory: Approximation
theory, Optimization theory, and Statistical Learning theory. This structure, which corresponds
to the three error terms typically occuring in the theoretical analysis of deep learning models, is
inspired by other recent texts on the topic following the same outline [213, 271, 132]. More specif-
ically, Chapter 1 provides an overview and introduces key questions for understand deep learning.
Chapters 2 - 9 explore results in approximation theory, Chapters 10 - 13 discuss optimization the-
ory for deep learning, and the remaining Chapters 14 - 16 address the statistical aspects of deep
learning.
This book is the result of a series of lectures given by the authors. Parts of the material were
presented by P.P. in a lecture titled “Neural Network Theory” at the University of Vienna, and by
J.Z. in a lecture titled “Theory of Deep Learning” at Heidelberg University. The lecture notes of
these courses formed the basis of this book. We are grateful to the many colleagues and students
who contributed to this book through insightful discussions and valuable suggestions. We would
like to offer special thanks to the following individuals:
Jonathan Garcia Rebellon, Jakob Lanser, Andrés Felipe Lerma Pineda, Marvin Koß, Martin
Mauser, Davide Modesto, Martina Neuman, Bruno Perreaux, Johannes Asmus Petersen, Milutin
Popovic, Tuan Quach, Tim Rakowski, Lorenz Riess, Jakob Fabian Rohner, Jonas Schuhmann,
Peter Školnı́k, Matej Vedak, Simon Weissmann, Josephine Westermann, Ashia Wilson.
4
Notation
5
Symbol Description Reference
Φid
L identity ReLU neural network Lemma 5.1
1S indicator function of the set S
⟨·, ·⟩ Euclidean inner product on Rd
⟨·, ·⟩H inner product on a vector space H Definition B.11
maximal number of elements shared by a single node of
kT (5.3.2)
a triangulation
K̂n (x, x′ ) empirical tangent kernel (11.3.4)
ΛA,σ,S,L loss landscape defining function Definition 12.2
Lip(f ) Lipschitz constant of a function f (9.2.1)
LipM (Ω) M -Lipschitz continuous functions on Ω (9.2.4)
L general loss function Section 14.1
L0−1 0-1 loss Section 14.1
Lce binary cross entropy loss Section 14.1
L2 square loss Section 14.1
ℓp (N) space of p-summable sequences indexed over N Section B.2.3
Lp (Ω) Lebesgue space over Ω Section B.2.3
M piecewise continuous and locally bounded functions Definition 3.1.1
set of multilayer perceptrons with d-dim input, m-dim
Ndm (σ; L, n) Definition 3.6
output, activation function σ, depth L, and width n
Ndm (σ; L) union of Ndm (σ; L, n) for all n ∈ N Definition 3.6
set of neural networks with architecture A, activation
N (σ; A, B) Definition 12.1
function σ and all weights bounded in modulus by B
N ∗ (σ, A, B) neural networks in N (σ; A, B) with range in [−1, 1] (14.5.1)
N positive natural numbers
N0 natural numbers including 0
multivariate normal distribution with mean m ∈ Rd and
N(m, C)
covariance C ∈ Rd×d
number of parameters of a neural network with layer
nA Definition 12.1
widths described by A
Euclidean norm for vectors in Rd and spectral norm for
∥·∥
matrices in Rn×d
∥ · ∥F Frobenius norm for matrices
∥ · ∥∞ ∞-norm on Rd or supremum norm for functions
∥ · ∥p p-norm on Rd
Continued on next page
6
Symbol Description Reference
∥ · ∥X norm on a vector space X
0 zero vector or zero matrix
O(·) Landau notation
ω(η) patch of the node η (5.3.5)
ΩΛ (c) sublevel set of loss landscape Definition 12.3
∂f (x) set of subgradients of f at x Definition 10.19
Pn (Rd ) or Pn space of multivariate polynomials of degree n on Rd Example 3.5
space of multivariate polynomials of arbitrary degree on
P(Rd ) or P Example 3.5
Rd
PX distribution of random variable X Definition A.10
P[A] probability of event A Definition A.5
P[A|B] conditional probability of event A given B Definition A.3.2
parameter set of neural networks with architecture A
PN (A, B) Definition 12.1
and all weights bounded in modulus by B
Pieces(f, Ω) number of pieces of f on Ω Definition 6.1
model (e.g. neural network) in terms of input x (param-
Φ(x)
eter dependence suppressed)
model (e.g. neural network) in terms of input x and pa-
Φ(x, w)
rameters w
Φlin linearization around initialization (11.3.1)
Φmin
n minimum neural network Lemma 5.11
Φ×
ε multiplication neural network Lemma 7.3
Φ×
n,ε multiplication of n numbers neural network Proposition 7.4
Φ2 ◦ Φ1 composition of neural networks Lemma 5.2
Φ2 • Φ1 sparse composition of neural networks Lemma 5.2
(Φ1 , . . . , Φm ) parallelization of neural networks (5.1.1)
†
A pseudoinverse of a matrix A
Q rational numbers
R real numbers
R− non-positive real numbers
R+ non-negative real numbers
Rσ Realization map Definition 12.1
R∗ Bayes risk (14.1.1)
Continued on next page
7
Symbol Description Reference
R(h) risk of hypothesis h Definition 14.2
(1.2.3), Definition
R
b S (h) empirical risk of h for sample S 14.4
Sn cardinal B-spline Definition 4.1
d
Sℓ,t,n multivariate cardinal B-spline Definition 4.2
cardinality of an arbitrary set S, or Lebesgue measure of
|S|
S ⊆ Rd
S̊ interior of a set S
S closure of a set S
∂S boundary of a set S
Sc complement of a set S
S⊥ orthogonal complement of a set S Definition B.15
σ general activation function
σa parametric ReLU activation function Section 2.3
σReLU ReLU activation function Section 2.3
sign sign function
smax (A) maximal singular value of a matrix A
smin (A) minimal (positive) singular value of a matrix A
size(Φ) number of free network parameters in Φ Definition 2.4
span(S) linear hull or span of S
T triangulation Definition 5.13
V set of nodes in a triangulation Definition 5.13
V[X] variance of random variable X Section A.2.2
VCdim(H) VC dimension of a set of functions H Definition 14.16
W distribution of weight initialization Section 11.6.1
W (ℓ) , b(ℓ) weights and biases in layer ℓ of a neural network Definition 2.1
width(Φ) width of Φ Definition 2.1
x(ℓ) output of ℓ-th layer of a neural network Definition 2.1
x̄(ℓ) preactivations (10.5.3)
X′ dual space to a normed space X Definition B.9
8
Chapter 1
Introduction
9
(
(
Figure 1.1: Illustration of a single neuron ν. The neuron receives six inputs (x1 , . . . , x6 ) = x,
computes their weighted sum 6j=1 xj wj , adds a bias b, and finally applies the activation function
P
σ to produce the output ν(x).
Deep Neural Networks Deep neural networks are formed by a combination of neurons. A
neuron is a function of the form
Rd ∋ x 7→ Φ(x) = T1 ◦ σ ◦ T0 (x)
10
L+1
where L ∈ N and (Tj )j=0 are affine transformations. The number of compositions L is referred to
as the number of layers of the deep neural network. Similar to a single neuron, (deep) neural
networks can be viewed as a parameterized function class, with the parameters being the entries
of the matrices and vectors determining the affine transformations (Tj )L+1
j=0 .
(
0 (
(
(
(
(
Figure 1.2: Illustration of a shallow neural network. The affine transformation T0 is of the form
(x1 , . . . , x6 ) = x 7→ W x + b, where the rows of W are the weight vectors w1 , w2 , w3 for each
respective neuron.
Gradient-based training After defining the structure or architecture of the neural network,
e.g., the activation function and the number of layers, the second step of deep learning consists
of determining suitable values for its parameters. In practice this is achieved by minimizing an
objective function. In supervised learning, which will be our focus, this objective depends
on a collection of input-output pairs, commonly known as training data or simply as a sample.
Concretely, let S = (xi , y i )m d k
i=1 be a sample, where xi ∈ R represents the inputs and y i ∈ R the
corresponding outputs with d, k ∈ N. Our goal is to find a deep neural network Φ such that
in a meaningful sense. For example, we could interpret “≈” to mean closeness with respect to
the Euclidean norm, or more generally, that L(Φ(xi ), y i ) is small for a function L measuring the
dissimilarity between its inputs. Such a function L is called a loss function. A standard way of
achieving (1.2.2) is by minimizing the so-called empirical risk of Φ with respect to the sample S
defined as
m
b S (Φ) = 1
X
R L(Φ(xi ), y i ). (1.2.3)
m
i=1
This quantity serves as a measure of how well Φ predicts y i at the training points x1 , . . . , xm .
If L is differentiable, and for all xi the output Φ(xi ) depends differentiably on the parameters
of the neural network, then the gradient of the empirical risk Rb S (Φ) with respect to the parameters
is well-defined. This gradient can be efficiently computed using a technique called backpropa-
gation. This allows to minimize (1.2.3) by optimization algorithms such as (stochastic) gradient
11
descent. They produce a sequence of neural networks parameters, and corresponding neural net-
work functions Φ1 , Φ2 , . . . , for which the empirical risk is expected to decrease. Figure 1.3 illustrates
a possible behavior of this sequence.
Prediction The final part of deep learning concerns the question of whether we have actually
learned something by the procedure above. Suppose that our optimization routine has either
converged or has been terminated, yielding a neural network Φ∗ . While the optimization aimed
to minimize the empirical risk on the training sample S, our ultimate interest is not in how well
Φ∗ performs on S. Rather, we are interested in its performance on new data points (xnew , y new )
outside of S.
To make meaningful statements about this, we assume existence of a data distribution D on
the input-output space—in our case, this is Rd × Rk —such that both the elements of S and all
other data points are drawn from this distribution. In other words, we treat S as an i.i.d. draw
from D, and (xnew , y new ) also as sampled independently from D. If we want Φ∗ to perform well on
average, then this amounts to controlling the following expression
which is called the risk of Φ∗ . If the risk is not much larger than the empirical risk, then we say
that the neural network Φ∗ has a small generalization error. On the other hand, if the risk is
much larger than the empirical risk, then we say that Φ∗ overfits the training data, meaning that
Φ∗ has memorized the training samples, but does not generalize well to data outside of the training
set.
Figure 1.3: A sequence of one dimensional neural networks Φ1 , . . . , Φ4 that successively minimizes
the empirical risk for the sample S = (xi , yi )6i=1 .
12
1.3 Why does it work?
It is natural to wonder why the deep learning pipeline, as outlined in the previous subsection,
ultimately succeeds in learning, i.e., achieving a small risk. Is it true that for a given sample
(xi , y i )m
i=1 there exist a neural network such that Φ(xi ) ≈ y i for all i = 1, . . . m? Does the
optimization routine produce a meaningful result? Can we control the risk, knowing only that the
empirical risk is small?
While most of these questions can be answered affirmatively under certain assumptions, these
assumptions often do not apply to deep learning in practice. We next explore some potential
explanations and show that they lead to even more questions.
Approximation A fundamental result in the study of neural networks is the so-called universal
approximation theorem, which will be discussed in Chapter 3. This result states that every con-
tinuous function on a compact domain can be approximated arbitrarily well (in a uniform sense)
by a shallow neural network.
This result, however, does not address the practically relevant question of efficiency. For exam-
ple, if we aim for computational efficiency, then we may be interested in identifying the smallest
neural network that fits the data. This naturally raises the question: What is the role of the
architecture for the expressive capabilities of neural networks? Furthermore, viewing empirical
risk minimization as an approximation problem, we are confronted with a central challenge in ap-
proximation theory: the curse of dimensionality. Function approximation in high dimensions is
notoriously difficult and becomes exponentially harder as the dimensionality increases. Yet, many
successful deep learning architectures operate in this high-dimensional regime. Why do these neural
networks appear to overcome this so-called curse?
Optimization While gradient descent can sometimes be proven to converge to a global minimum,
as we will discuss in Chapter 10, this typically requires the objective function to be at least convex.
However, there is no reason to believe that for example the empirical risk is a convex function of
the network parameters. In fact, due to the repeatedly occurring compositions with the nonlinear
activation function in the network, the empirical risk is typically highly non-linear and not convex.
Therefore, there is generally no guarantee that the optimization routine will converge to a global
minimum, and it may get stuck in a local (and non-global) minimum or a saddle point. Why is the
output of the optimization nonetheless often meaningful in practice?
Generalization In traditional statistical learning theory, which we will review in Chapter 14,
the extent to which the risk exceeds the empirical risk, can be bounded a priori; such bounds are
often expressed in terms of a notion of complexity of the set of admissible functions (the class of
neural networks) divided by the number of training samples. For the class of neural networks of a
fixed architecture, the complexity roughly amounts to the number of neural network parameters.
In practice, typically neural networks with more parameters than training samples are used. This
is dubbed the overparameterized regime. In this regime, the classical estimates described above are
void.
Why is it that, nonetheless, deep overparameterized architectures are capable of making accu-
rate predictions on unseen data? Furthermore, while deep architectures often generalize well, they
sometimes fail spectacularly on specific, carefully crafted examples. In image classification tasks,
13
these examples may differ only slightly from correctly classified images in a way that is not per-
ceptible to the human eye. Such examples are known as adversarial examples, and their existence
poses a great challenge for applications of deep learning.
14
we prove substantially better approximation rates than we obtained for shallow neural networks.
This adds again to our understanding of depth and its connections to expressive power of neural
network architectures.
Chapter 8: High-dimensional approximation. The convergence rates established in the
previous chapters deteriorate significantly in high-dimensional settings. This chapter examines
three scenarios under which neural networks can provably overcome the curse of dimensionality.
Chapter 9: Interpolation. In this chapter we shift our perspective from approximation to
exact interpolation of the training data. We analyze conditions under which exact interpolation is
possible, and discuss the implications for empirical risk minimization. Furthermore, we present a
constructive proof showing that ReLU networks can express an optimal interpolant of the data (in
a specific sense).
Chapter 10: Training of neural networks. We start to examine the training process
of deep learning. First, we study the fundamentals of (stochastic) gradient descent and convex
optimization. Additionally, we examine accelerated methods and highlight the key principles behind
popular training algorithms such as Adam. Finally, we discuss how the backpropagation algorithm
can be used to implement these optimization algorithms for training neural networks.
Chapter 11: Wide neural networks and the neural tangent kernel. This chapter
introduces the neural tangent kernel as a tool for analyzing the training behavior of neural networks.
We begin by revisiting linear and kernel regression for the approximation of functions based on
data. Additionally we discuss the effect of adding a regularization term to the objective function.
Afterwards, we show for certain architectures of sufficient width, that the training dynamics of
gradient descent resemble those of kernel regression and converge to a global minimum. This
analysis provides insights into why, under certain conditions, we can train neural networks without
getting stuck in (bad) local minima, despite the non-convexity of the objective function. Finally, we
discuss a well-known link between neural networks and Gaussian processes, giving some indication
why overparameterized networks do not necessarily overfit in practice.
Chapter 12: Loss landscape analysis. In this chapter, we present an alternative view on the
optimization problem, by analyzing the loss landscape—the empirical risk as a function of the neural
network parameters. We give theoretical arguments showing that increasing overparameterization
leads to greater connectivity between the valleys and basins of the loss landscape. Consequently,
overparameterized architectures make it easier to reach a region where all minima are global minima.
Additionally, we observe that most stationary points associated with non-global minima are saddle
points. This sheds further light on the empirically observed fact that deep architectures can often
be optimized without getting stuck in non-global minima.
Chapter 13: Shape of neural network spaces. While Chapters 11 and 12 highlight
potential reasons for the success of neural network training, in this chapter, we show that the set
of neural networks of a fixed architecture has some undesirable properties from an optimization
perspective. Specifically, we show that this set is typically non-convex. Moreover, in general it does
not possess the best-approximation property, meaning that there might not exist a neural network
within the set yielding the best approximation for a given function.
Chapter 14 : Generalization properties of deep neural networks. To understand
why deep neural networks successfully generalize to unseen data points (outside of the training
set), we study classical statistical learning theory, with a focus on neural network functions as the
hypothesis class. We then show how to establish generalization bounds for deep learning, providing
theoretical insights into the performance on unseen data.
15
Chapter 15: Generalization in the overparameterized regime. The generalization
bounds of the previous chapter are not meaningful when the number of parameters of a neural net-
work surpasses the number of training samples. However, this overparameterized regime is where
many successful network architectures operate. To gain a deeper understanding of generalization
in this regime, we describe the phenomenon of double descent and present a potential explana-
tion. This addresses the question of why deep neural networks perform well despite being highly
overparameterized.
Chapter 16: Robustness and adversarial examples. In the final chapter, we explore
the existence of adversarial examples—inputs designed to deceive neural networks. We provide
some theoretical explanations of why adversarial examples arise, and discuss potential strategies to
prevent them.
16
model suggests or rejects a particular configuration can help engineers identify potential vulnerabil-
ities, ultimately leading to safer and more efficient designs. Ethically, transparent decision-making
is crucial, especially when the outcomes have significant consequences for individuals or society; bi-
ases present in the data or model design can lead to discriminatory outcomes, making explainability
essential.
However, explaining the predictions of deep neural networks is not straightforward. Despite
knowledge of the network weights and biases, the repeated and complex interplay of linear trans-
formations and non-linear activation functions often renders these models black boxes. A compre-
hensive overview of various techniques for interpretability, not only for deep neural networks, can
be found in [179]. Regarding the topic of fairness, we refer for instance to [72, 11].
Implementation: While this book focuses on provable theoretical results, the field of deep
learning is strongly driven by applications, and a thorough understanding of deep learning cannot
be achieved without practical experience. For this, there exist numerous resources with excellent
explanations. We recommend [87, 51, 218] as well as the countless online tutorials that are just a
Google (or alternative) search away.
Many more: The field is evolving rapidly, and new ideas are constantly being generated
and tested. This book cannot give a complete overview. However, we hope that it provides the
reader with a solid foundation in the fundamental knowledge and principles to quickly grasp and
understand new developments in the field.
17
Chapter 2
Feedforward neural networks, henceforth simply referred to as neural networks (NNs), constitute
the central object of study of this book. In this chapter, we provide a formal definition of neural
networks, discuss the size of a neural network, and give a brief overview of common activation
functions.
x(0) := x (2.1.1a)
(ℓ) (ℓ−1) (ℓ−1) (ℓ−1)
x := σ(W x +b ) for ℓ ∈ {1, . . . , L} (2.1.1b)
x(L+1) := W (L) x(L) + b(L) (2.1.1c)
holds
We call L the depth, dmax = maxℓ=1,...,L dℓ the width, σ the activation function, and
(σ; d0 , . . . , dL+1 ) the architecture of the neural network Φ. Moreover, W (ℓ) ∈ Rdℓ+1 ×dℓ are the
weight matrices and b(ℓ) ∈ Rdℓ+1 the bias vectors of Φ for ℓ = 0, . . . L.
Remark 2.2. Typically, there exist different choices of architectures, weights, and biases yielding
the same function Φ : Rd0 → RdL+1 . For this reason we cannot associate a unique meaning to these
notions solely based on the function realized by Φ. In the following, when we refer to the properties
18
of a neural network Φ, it is always understood to mean that there exists at least one construction
as in Definition 2.1, which realizes the function Φ and uses parameters that satisfy those properties.
The architecture of a neural network is often depicted as a connected graph, as illustrated in
Figure 2.1. The nodes in such graphs represent (the output of) the neurons. They are arranged in
layers, with x(ℓ) in Definition 2.1 corresponding to the neurons in layer ℓ. We also refer to x(0) in
(2.1.1a) as the input layer and to x(L+1) in (2.1.1c) as the output layer. All layers in between
are referred to as the hidden layers and their output is given by (2.1.1b). The number of hidden
layers corresponds to the depth. For the correct interpretation of such graphs, we note that by our
conventions in Definition 2.1, the activation function is applied after each affine transformation,
except in the final layer.
Neural networks of depth one are called shallow, if the depth is larger than one they are called
deep. The notion of deep neural networks is not used entirely consistently in the literature, and
some authors use the word deep only in case the depth is much larger than one, where the precise
meaning of “much larger” depends on the application.
Throughout, we only consider neural networks in the sense of Definition 2.1. We emphasize
however, that this is just one (simple but very common) type of neural network. Many adjustments
to this construction are possible and also widely used. For example:
• We may use different activation functions σℓ in each layer ℓ or we may even use a different
activation function for each node.
• Residual neural networks allow “skip connections” [112]. This means that information is
allowed to skip layers in the sense that the nodes in layer ℓ may have x(0) , . . . , x(ℓ−1) as their
input (and not just x(ℓ−1) ), cf. (2.1.1).
Let us clarify some further common terminology used in the context of neural networks:
• parameters: The parameters of a neural network refer to the set of all entries of the weight
matrices and bias vectors. These are often collected in a single vector
These parameters are adjustable and are learned during the training process, determining the
specific function realized by the network.
• hyperparameters: Hyperparameters are settings that define the network’s architecture (and
training process), but are not directly learned during training. Examples include the depth,
the number of neurons in each layer, and the choice of activation function. They are typically
set before training begins.
• weights: The term “weights” is often used broadly to refer to all parameters of a neural
network, including both the weight matrices and bias vectors.
19
input hidden layers output
(1) (3)
(0) x1 x1
x1
(2)
x1 (4)
x1
(1) (3)
(0) x2 x2
x2
(2)
x2 (4)
x2
(1) (3)
(0) x3 x3
x3
(2)
x3
(1) (3)
x4 x4
Figure 2.1: Sketch of a neural network with three hidden layers, and d0 = 3, d1 = 4, d2 = 3, d3 = 4,
d4 = 2. The neural network has depth three and width four.
• model: For a fixed architecture, every choice of network parameters w as in (2.1.2) defines
a specific function x 7→ Φw (x). In deep learning this function is often referred to as a model.
More generally, “model” can be used to describe any function parameterization by a set of
parameters w ∈ Rn , n ∈ N.
(ii) if d10 = d20 =: d0 and L1 = L2 =: L, then there exists a neural network Φparallel with architecture
(σ; d0 , d11 + d21 , . . . , d1L+1 + d2L+1 ) such that
(iii) if d10 = d20 =: d0 , L1 = L2 =: L, and d1L+1 = d2L+1 =: dL+1 , then there exists a neural network
Φsum with architecture (σ; d0 , d11 + d21 , . . . , d1L + d2L , dL+1 ) such that
20
(iv) if d1L1 +1 = d20 , then there exists a neural network Φcomp with architecture
(σ; d10 , d11 , . . . , d1L1 , d21 , . . . , d2L2 +1 ) such that
1
Φcomp (x) = Φ2 ◦ Φ1 (x) for all x ∈ Rd0 .
• weight sharing: This is a technique where specific entries of the weight matrices (or bias
vectors) are constrained to be equal. Formally, this means imposing conditions of the form
(i) (j)
Wk,l = Ws,t , i.e. the entry (k, l) of the ith weight matrix is equal to the entry at position
(s, t) of weight matrix j. We denote this assumption by (i, k, l) ∼ (j, s, t), paying tribute
to the trivial fact that “∼” is an equivalence relation. During training, shared weights are
updated jointly, meaning that any change to one weight is simultaneously applied to all other
weights of this class. Weight sharing can also be applied to the entries of bias vectors.
• sparsity: This refers to imposing a sparsity structure on the weight matrices (or bias vectors).
(i)
Specifically, we apriorily set Wk,l = 0 for certain (k, l, i), i.e. we impose entry (k, l) of the ith
weight matrix to be 0. These zero-valued entries are considered fixed, and are not adjusted
(i)
during training. The condition Wk,l = 0 corresponds to node l of layer i − 1 not serving as an
input to node k in layer i. If we represent the neural network as a graph, this is indicated by
not connecting the corresponding nodes. Sparsity can also be imposed on the bias vectors.
Both of these restrictions decrease the number of learnable parameters in the neural network. The
number of parameters can be seen as a measure of the complexity of the represented function class.
For this reason, we introduce size(Φ) as a notion for the number of learnable parameters. Formally
(with |S| denoting the cardinality of a set S):
21
2.3 Activation functions
Activation functions are a crucial part of neural networks, as they introduce nonlinearity into the
model. If an affine activation function were used, the resulting neural network function would also
be affine and hence very restricted in what it can represent.
The choice of activation function can have a significant impact on the performance, but there
does not seem to be a universally optimal one. We next discuss a few important activation functions
and highlight some common issues associated with them.
1.0 8 8
ReLU a=0.05
SiLU a=0.1
0.8 6 6
a=0.2
0.6 4
4
0.4 2
2
0.2 0
0
0.0
2
5 0 5 5 0 5 5 0 5
22
and depicted in Figure 2.2 (b). It is piecewise linear, and due to its simplicity its evaluation is
computationally very efficient. It is one of the most popular activation functions in practice. Since
its derivative is always zero or one, it does not suffer from the vanishing gradient problem to the
same extent as the sigmoid function. However, ReLU can suffer from the so-called dead neurons
problem. Consider the neural network
depending on the bias b ∈ R. If b < 0, then Φ(x) = 0 for all x ∈ R. The neuron corresponding to
d
the second application of σReLU thus produces a constant signal. Moreover, if b < 0, db Φ(x) = 0
for all x ∈ R. As a result, every negative value of b yields a stationary point of the empirical risk.
A gradient-based method will not be able to further train the parameter b. We thus refer to this
neuron as a dead neuron.
SiLU (Sigmoid Linear Unit): An important difference between the ReLU and the Sigmoid is
that the ReLU is not differentiable at 0. The SiLU activation function (also referred to as “swish”)
can be interpreted as a smooth approximation to the ReLU. It is defined as
x
σSiLU (x) := xσsig (x) = for x ∈ R,
1 + e−x
and is depicted in Figure 2.2 (b). There exist various other smooth activation functions that
mimic the ReLU, including the Softplus x 7→ log(1 + exp(x)), the GELU (Gaussian Error Linear
Unit) x 7→ xF (x) where F (x) denotes the cumulative distribution function of the standard normal
distribution, and the Mish x 7→ x tanh(log(1 + exp(x))).
Parametric ReLU or Leaky ReLU: This variant of the ReLU addresses the dead neuron
problem. For some a ∈ (0, 1), the parametric ReLU is defined as
and is depicted in Figure 2.2 (c) for three different values of a. Since the output of σ does not
have flat regions like the ReLU, the dying ReLU problem is mitigated. If a is not chosen too small,
then there is less of a vanishing gradient problem than for the Sigmoid. In practice, the additional
parameter a has to be fine-tuned depending on the application. Like the ReLU, the parametric
ReLU is not differentiable at 0.
23
Exercises
Exercise 2.5. Prove Proposition 2.3.
Exercise 2.6. In this exercise, we show that ReLU and parametric ReLU create similar sets of
neural network functions. Fix a > 0.
(i) Find a set of weight matrices and bias vectors, such that the associated neural network Φ1 ,
with the ReLU activation function σReLU satisfies Φ1 (x) = σa (x) for all x ∈ R.
(ii) Find a set of weight matrices and bias vectors, such that the associated neural network Φ2 ,
with the parametric ReLU activation function σa satisfies Φ2 (x) = σReLU (x) for all x ∈ R.
(iii) Conclude that every ReLU neural network can be expressed as a leaky ReLU neural network
and vice versa.
Exercise 2.7. Let d ∈ N, and let Φ1 be a neural network with the ReLU as activation function,
input dimension d, and output dimension 1. Moreover, let Φ2 be a neural network with the sigmoid
activation function, input dimension d, and output dimension 1. Show that, if Φ1 = Φ2 , then Φ1 is
a constant function.
Exercise 2.8. In this exercise, we show that for the sigmoid activation functions, dead-neuron-like
behavior is very rare. Let Φ be a neural network with the sigmoid activation function. Assume
that Φ is a constant function. Show that for every ε > 0 there is a non-constant neural network Φ e
with the same architecture as Φ such that for all ℓ = 0, . . . L,
(ℓ) (ℓ)
∥W (ℓ) − W
f ∥ ≤ ε and ∥b(ℓ) − e
b ∥≤ε
f (ℓ) , e
where W (ℓ) , b(ℓ) are the weights and biases of Φ and W
(ℓ)
b are the biases of Φ.
e
Show that such a statement does not hold for ReLU neural networks. What about leaky ReLU?
24
Chapter 3
Universal approximation
After introducing neural networks in Chapter 2, it is natural to inquire about their capabilities.
Specifically, we might wonder if there exist inherent limitations to the type of functions a neural
network can represent. Could there be a class of functions that neural networks cannot approx-
imate? If so, it would suggest that neural networks are specialized tools, similar to how linear
regression is suited for linear relationships, but not for data with nonlinear relationships.
In this chapter, primarily following [159], we will show that this is not the case, and neural
networks are indeed a universal tool. More precisely, given sufficiently large and complex archi-
tectures, they can approximate almost every sensible input-output relationship. We will formalize
and prove this claim in the subsequent sections.
Throughout what follows, we always consider C 0 (Rd ) equipped with the topology of Defini-
tion 3.1 (also see Exercise 3.22), and every subset such as C 0 (D) with the subspace topology:
for example, if D ⊆ Rd is bounded, then convergence in C 0 (D) refers to uniform convergence
limn→∞ supx∈D |fn (x) − f (x)| = 0.
25
Definition 3.2. Let d ∈ N. A set of functions H from Rd to R is a universal approximator (of
C 0 (Rd )), if for every ε > 0, every compact K ⊆ Rd , and every f ∈ C 0 (Rd ), there exists g ∈ H such
that supx∈K |f (x) − g(x)| < ε.
For a set of (not necessarily continuous) functions H mapping between Rd and R, we denote by
cc
H its closure with respect to compact convergence.
The relationship between a universal approximator and the closure with respect to compact
convergence is established in the proposition below.
Proof. Suppose that H is a universal approximator and fix f ∈ C 0 (Rd ). For n ∈ N, define Kn :=
[−n, n]d ⊆ Rd . Then for every n ∈ N there exists fn ∈ H such that supx∈Kn |fn (x) − f (x)| < 1/n.
cc
Since for every compact K ⊆ Rd there exists n0 such that K ⊆ Kn for all n ≥ n0 , it holds fn −→ f .
The “only if” part of the assertion is trivial.
A key tool to show that a set is a universal approximator is the Stone-Weierstrass theorem, see
for instance [233, Sec. 5.7].
(c) H is an algebra of functions, i.e., H is closed under addition, multiplication and scalar mul-
tiplication.
26
3.1.2 Shallow neural networks
With the necessary formalism established, we can now show that shallow neural networks of ar-
bitrary width form a universal approximator under certain (mild) conditions on the activation
function. The results in this section are based on [159], and for the proofs we follow the arguments
in that paper.
We first introduce notation for the set of all functions realized by certain architectures.
Definition 3.6. Let d, m, L, n ∈ N and σ : R → R. The set of all functions realized by neural
networks with d-dimensional input, m-dimensional output, depth at most L, width at most n, and
activation function σ is denoted by
Furthermore,
[
Ndm (σ; L) := Ndm (σ; L, n).
n∈N
In the sequel, we require the activation function σ to belong to the set of piecewise continuous
and locally bounded functions
M := σ ∈ L∞
loc (R) there exist intervals I1 , . . . , IM partitioning R,
(3.1.1)
s.t. σ ∈ C 0 (Ij ) for all j = 1, . . . , M .
Here, M ∈ N is finite, and the intervals Ij are understood to have positive (possibly infinite)
Lebesgue measure, i.e. Ij is not allowed to be empty or a single point. Hence, σ is a piecewise
continuous function, and it has discontinuities at at most finitely many points.
Example 3.7. Activation functions belonging to M include, in particular, all continuous non-
polynomial functions, which in turn includes all practically relevant activation functions such as
the ReLU, the SiLU, and the Sigmoid discussed in Section 2.3. In these cases, we can choose M = 1
and I1 = R. Discontinuous functions include for example the Heaviside function x 7→ 1x>0 (also
called a “perceptron” in this context) but also x 7→ 1x>0 sin(1/x): Both belong to M with M = 2,
I1 = (−∞, 0] and I2 = (0, ∞). We exclude for example the function x 7→ 1/x, which is not locally
bounded. ♢
The rest of this subsection is dedicated to proving the following theorem that has now already
been announced repeatedly.
Theorem 3.8. Let d ∈ N and σ ∈ M. Then Nd1 (σ; 1) is a universal approximator of C 0 (Rd ) if
and only if σ is not a polynomial.
27
Remark 3.9. We will see in Corollary 3.18 and Exercise 3.26 that neural networks can also arbitrarily
well approximate non-continuous functions with respect to suitable norms.
The universal approximation theorem by Leshno, Lin, Pinkus and Schocken [159]—of which
Theorem 3.8 is a special case—is even formulated for a much larger set M, which allows for
activation functions that have discontinuities at a (possibly non-finite) set of Lebesgue measure
zero. Instead of proving the theorem in this generality, we resort to the simpler case stated above.
This allows to avoid some technicalities, but the main ideas remain the same. The proof strategy
is to verify the following three claims:
cc cc
(i) if C 0 (R1 ) ⊆ N11 (σ; 1) then C 0 (Rd ) ⊆ Nd1 (σ; 1) ,
cc
(ii) if σ ∈ C ∞ (R) is not a polynomial then C 0 (R1 ) ⊆ N11 (σ; 1) ,
cc
(iii) if σ ∈ M is not a polynomial then there exists σ̃ ∈ C ∞ (R) ∩ N11 (σ; 1) which is not a
polynomial.
cc cc cc
Upon observing that σ̃ ∈ N11 (σ; 1) implies N11 (σ̃, 1) ⊆ N11 (σ; 1) , it is easy to see that these
statements together with Proposition 3.3 establish the implication “⇐” asserted in Theorem 3.8.
The reverse direction is straightforward to check and will be the content of Exercise 3.23.
We start with a more general version of (i) and reduce the problem to the one dimensional case
following [165, Theorem 2.1].
Lemma 3.10. Assume that H is a universal approximator of C 0 (R). Then for every d ∈ N
span{x 7→ g(w · x) | w ∈ Rd , g ∈ H}
We claim that
cc
Hk ⊆ span{Rd ∋ x 7→ g(w · x) | w ∈ Rd , g ∈ H} =: X (3.1.2)
for all k ∈ N0 . This implies that all multivariate polynomials belong to X. An application of the
Stone-Weierstrass theorem (cp. Example 3.5) and Proposition 3.3 then conclude the Q proof.
d β α d
For every α, β ∈ N0 with |α| = |β| = k, it holds D x = δβ,α α!, where α! := j=1 αj ! and
δβ,α = 1 if β = α and δβ,α = 0 otherwise. Hence, since {x 7→ xα | |α| = k} is a basis of Hk , the
set {Dα | |α| = k} is a basis of its topological dual H′k . Thus each linear functional l ∈ H′k allows
the representation l = p(D) for some p ∈ Hk (here D stands for the differential).
By the multinomial formula
k
d
X X k! α α
(w · x)k = wj x j = w x .
α!
j=1 d
{α∈N0 | |α|=k}
28
Therefore, we have that (x 7→ (w · x)k ) ∈ Hk . Moreover, for every l = p(D) ∈ H′k and all w ∈ Rd
we have that
Hence, if l(x 7→ (w · x)k ) = p(D)(x 7→ (w · x)k ) = 0 for all w ∈ Rd , then p ≡ 0 and thus l ≡ 0.
This implies span{x 7→ (w · x)k | w ∈ Rd } = Hk . Indeed, if there exists h ∈ Hk which is not
in span{x 7→ (w · x)k | w ∈ Rd }, then by the theorem of Hahn-Banach (see Theorem B.10), there
exists a non-zero functional in H′k vanishing on span{x 7→ (w · x)k | w ∈ Rd }. This contradicts the
previous observation.
By the universality of H it is not hard to see that x 7→ (w · x)k ∈ X for all w ∈ Rd . Therefore,
we have Hk ⊆ X for all k ∈ N0 .
By the above lemma, in order to verify that Nd1 (σ; 1) is a universal approximator, it suffices to
show that N11 (σ; 1) is a universal approximator. We first show that this is the case for sigmoidal
activations.
For sigmoidal activation functions we can now conclude the universality in the univariate case.
cc
Lemma 3.12. Let σ : R → R be monotonically increasing and sigmoidal. Then C 0 (R) ⊆ N11 (σ; 1) .
We prove Lemma 3.12 in Exercise 3.24. Lemma 3.10 and Lemma 3.12 show Theorem 3.8 in
the special case where σ is monotonically increasing and sigmoidal. For the general case, let us
continue with (ii) and consider C ∞ activations.
Lemma 3.13. If σ ∈ C ∞ (R) and σ is not a polynomial, then N11 (σ; 1) is dense in C 0 (R).
cc
Proof. Denote X := N11 (σ; 1) . We show again that all polynomials belong to X. An application
of the Stone-Weierstrass theorem then gives the statement.
Fix b ∈ R and denote fx (w) := σ(wx + b) for all x, w ∈ R. By Taylor’s theorem, for h ̸= 0
29
for some ξ = ξ(h) between w and w + h. Note that the left-hand side belongs to N11 (σ; 1) as a
function of x. Since σ ′′ ∈ C 0 (R), for every compact set K ⊆ R
sup sup |x2 σ ′′ (ξ(h)x + b)| ≤ sup sup |x2 σ ′′ (ηx + b)| < ∞.
x∈K |h|≤1 x∈K η∈[w−1,w+1]
Finally, we come to the proof of (iii)—the claim that there exists at least one non-polynomial
C ∞ (R) function in the closure of N11 (σ; 1). The argument is split into two lemmata. Denote in
the following by Cc∞ (R) the set of compactly supported C ∞ (R) functions, and for two functions f ,
g : R → R let Z
f ∗ g(x) := f (x − y)g(y) dx for all x ∈ R (3.1.4)
R
be the convolution of f and g.
cc
Lemma 3.14. Let σ ∈ M. Then for each φ ∈ Cc∞ (R) it holds σ ∗ φ ∈ N11 (σ; 1) .
Proof. Fix φ ∈ Cc∞ (R) and let a > 0 such that supp φ ⊆ [−a, a]. Denote yj := −a + 2aj/n for
j = 0, . . . , n and define for x ∈ R
n−1
2a X
fn (x) := σ(x − yj )φ(yj ).
n
j=0
cc
Clearly, fn ∈ N11 (σ; 1). We will show fn −→ σ ∗φ as n → ∞. To do so we verify uniform convergence
of fn towards σ ∗ φ on the interval [−b, b] with b > 0 arbitrary but fixed.
For x ∈ [−b, b]
n−1
X Z yj+1
|σ ∗ φ(x) − fn (x)| ≤ σ(x − y)φ(y) − σ(x − yj )φ(yj ) dy . (3.1.5)
j=0 yj
Fix ε ∈ (0, 1). Since σ ∈ M, S there exist z1 , . . . , zM ∈ R such that σ is continuous on R\{z1 , . . . , zM }
(cp. (3.1.1)). With Dε := M j=1 (zj −ε, zj +ε), observe that σ is uniformly continuous on the compact
set Kε := [−a − b, a + b] ∩ Dεc . Now let Jc ∪ Jd = {0, . . . , n − 1} be a partition (depending on x),
such that j ∈ Jc if and only if [x − yj+1 , x − yj ] ⊆ Kε . Hence, j ∈ Jd implies the existence of
i ∈ {1, . . . , M } such that the distance of zi to [x − yj+1 , x − yj ] is at most ε. Due to the interval
30
[x − yj+1 , x − yj ] having length 2a/n, we can bound
X [
yj+1 − yj = [x − yj+1 , x − yj ]
j∈Jd j∈Jd
M h
[ 2a 2a i
≤ zi − ε −
, zi + ε +
n n
i=1
4a
≤ M · 2ε + ,
n
where |A| denotes the Lebesgue measure of a measurable set A ⊆ R. Next, because of the local
boundedness of σ and the fact that φ ∈ Cc∞ , it holds sup|y|≤a+b |σ(y)| + sup|y|≤a |φ(y)| =: γ < ∞.
Hence
|σ ∗ φ(x) − fn (x)|
X Z yj+1
≤ σ(x − y)φ(y) − σ(x − yj )φ(yj ) dy
j∈Jc ∪Jd yj
2 4a
≤ 2γ M · 2ε +
n
+ 2a sup max |σ(x − y)φ(y) − σ(x − yj )φ(yj )|. (3.1.6)
j∈Jc y∈[yj ,yj+1 ]
Finally, uniform continuity of σ on Kε and φ on [−a, a] imply that the last term tends to 0 as
n → ∞ uniformly for all x ∈ [−b, b]. This shows that there exist C < ∞ (independent of ε and x)
and nε ∈ N (independent of x) such that the term in (3.1.6) is bounded by Cε for all n ≥ nε . Since
ε was arbitrary, this yields the claim.
Lemma 3.15. If σ ∈ M and σ ∗ φ is a polynomial for all φ ∈ Cc∞ (R), then σ is a polynomial.
Proof. Fix −∞ < a < b < ∞ and consider Cc∞ (a, b) := {φ ∈ C ∞ (R) | supp φ ⊆ [a, b]}. Define a
metric ρ on Cc∞ (a, b) via
X |φ − ψ|C j (a,b)
ρ(φ, ψ) := 2−j ,
1 + |φ − ψ|C j (a,b)
j∈N0
31
where
Since
Pj the space of j times differentiable functions on [a, b] is∞ complete with respect to the norm
i=0 | · |C i (a,b) , see for instance [114, Satz 104.3], the space Cc (a, b) is complete with the metric ρ.
For k ∈ N set
Vk := {φ ∈ Cc∞ (a, b) | σ ∗ φ ∈ Pk },
Proof (of Theorem 3.8). By Exercise 3.23 we have the implication “⇒”.
For the other direction we assume that σ ∈ M is not a polynomial. Then by Lemma 3.15
there exists φ ∈ Cc∞ (R) such that σ ∗ φ is not a polynomial. According to Lemma 3.14 we have
cc
σ ∗ φ ∈ N11 (σ; 1) . We conclude with Lemma 3.13 that N11 (σ; 1) is a universal approximator of
C 0 (R).
Finally, by Lemma 3.10, Nd1 (σ; 1) is a universal approximator of C 0 (Rd ).
32
3.1.3 Deep neural networks
Theorem 3.8 shows the universal approximation capability of single-hidden-layer neural networks
with activation functions σ ∈ M\P: they can approximate every continuous function on every
compact set to arbitrary precision, given sufficient width. This result directly extends to neural
networks of any fixed depth L ≥ 1. The idea is to use the fact that the identity function can be
approximated with a shallow neural network. Composing a shallow neural network approximation of
the target function f with (multiple) shallow neural networks approximating the identity function,
gives a deep neural network approximation of f .
Instead of directly applying Theorem 3.8, we first establish the following proposition regarding
the approximation of the identity function. Rather than σ ∈ M\P, it requires a different (mild)
assumption on the activation function. This allows for a constructive proof, yielding explicit bounds
on the neural network size, which will prove useful later in the book.
Proposition 3.16. Let d, L ∈ N, let K ⊆ Rd be compact, and let σ : R → R be such that there
exists an open set on which σ is differentiable and not constant. Then, for every ε > 0, there exists
a neural network Φ ∈ Ndd (σ; L, d) such that
Proof. The proof uses the same idea as in Lemma 3.13, where we approximate the derivative of the
activation function by a simple neural network. Let us first assume d ∈ N and L = 1.
Let x∗ ∈ R be such that σ is differentiable on a neighborhood of x∗ and σ ′ (x∗ ) = θ ̸= 0.
Moreover, let x∗ = (x∗ , . . . , x∗ ) ∈ Rd . Then, for λ > 0 we define
λ x λ
Φλ (x) := σ + x∗ − σ(x∗ ),
θ λ θ
Then, we have, for all x ∈ K,
σ(x/λ + x∗ ) − σ(x∗ )
Φλ (x) − x = λ − x. (3.1.7)
θ
If xi = 0 for i ∈ {1, . . . , d}, then (3.1.7) shows that (Φλ (x) − x)i = 0. Otherwise
By the definition of the derivative, we have that |(Φλ (x) − x)i | → 0 for λ → ∞ uniformly for all
x ∈ K and i ∈ {1, . . . , d}. Therefore, |Φλ (x) − x| → 0 for λ → ∞ uniformly for all x ∈ K.
The extension to L > 1 is straightforward and is the content of Exercise 3.27.
33
Corollary 3.17. Let d ∈ N, L ∈ N and σ ∈ M. Then Nd1 (σ; L) is a universal approximator of
C 0 (Rd ) if and only if σ is not a polynomial.
Proof. We only show the implication “⇐”. The other direction is again left as an exercise, see
Exercise 3.23.
Assume σ ∈ M is not a polynomial, let K ⊆ Rd be compact, and let f ∈ C 0 (Rd ). Fix ε ∈ (0, 1).
We need to show that there exists a neural network Φ ∈ Nd1 (σ; L) such that supx∈K |f (x)−Φ(x)| <
ε. The case L = 1 holds by Theorem 3.8, so let L > 1.
By Theorem 3.8, there exist Φshallow ∈ Nd1 (σ; 1) such that
ε
sup |f (x) − Φshallow (x)| < . (3.1.8)
x∈K 2
Compactness of {f (x) | x ∈ K} implies that we can find n > 0 such that
{Φshallow (x) | x ∈ K} ⊆ [−n, n]. (3.1.9)
Let Φid ∈ N11 (σ; L − 1) be an approximation to the identity such that
ε
sup |x − Φid (x)| < , (3.1.10)
x∈[−n,n] 2
which is possible by the extension of Proposition 3.16 to general non-polynomial activation functions
σ ∈ M.
Denote Φ := Φid ◦ Φshallow . According to Proposition 2.3 (iv) holds Φ ∈ Nd1 (σ; L) as desired.
Moreover (3.1.8), (3.1.9), (3.1.10) imply
sup |f (x) − Φ(x)| = sup |f (x) − Φid (Φshallow (x))|
x∈K x∈K
≤ sup |f (x) − Φshallow (x)| + |Φshallow (x) − Φid (Φshallow (x))|
x∈K
ε ε
≤ + = ε.
2 2
This concludes the proof.
Corollary 3.18. Let d ∈ N, L ∈ N, p ∈ [1, ∞), and let σ ∈ M not be a polynomial. Then for
every ε > 0, every compact K ⊆ Rd , and every f ∈ Lp (K) there exists Φf,ε ∈ Nd1 (σ; L) such that
Z 1/p
p
|f (x) − Φ(x)| dx ≤ ε.
K
34
3.2 Superexpressive activations and Kolmogorov’s superposition
theorem
In the previous section, we saw that a large class of activation functions allow for universal approx-
imation. However, these results did not provide any insights into the necessary neural network size
for achieving a specific accuracy.
Before exploring this topic further in the following chapters, we next present a remarkable result
that shows how the required neural network size is significantly influenced by the choice of activation
function. The result asserts that, with the appropriate activation function, every f ∈ C 0 (K) on a
compact set K ⊆ Rd can be approximated to every desired accuracy ε > 0 using a neural network
of size O(d2 ); in particular the neural network size is independent of ε > 0, K, and f . We will first
discuss the one-dimensional case.
Proposition 3.19. There exists a continuous activation function σ : R → R such that for every
compact K ⊆ R, every ε > 0 and every f ∈ C 0 (K) there exists Φ(x) = σ(wx + b) ∈ N11 (σ; 1, 1)
such that
Proof. Denote by P̃n all polynomials p(x) = nj=0 qj xj with rational coefficients, i.e. such that
P
that supx∈[0,1] |p(x) − f (x)| < ε/2. Now choose qj ∈ Q so close to rj such that p̃(x) := nj=1 qj xj
P
satisfies supx∈[0,1] |p̃(x) − p(x)| < ε/2. Let i ∈ Z such that p̃(x) = pi (x), i.e., pi (x) = σ(2i + x) for
all x ∈ [0, 1]. Then supx∈[0,1] |f (x) − σ(x + 2i)| < ε.
For general compact K assume that K ⊆ [a, b]. By Tietze’s extension theorem, f allows a
continuous extension to [a, b], so without loss of generality K = [a, b]. By the first case we can find
i ∈ Z such that with y = (x − a)/(b − a) (i.e. y ∈ [0, 1] if x ∈ [a, b])
x−a
sup f (x) − σ + 2i = sup |f (y · (b − a) + a) − σ(y + 2i)| < ε,
x∈[a,b] b−a y∈[0,1]
To extend this result to arbitrary dimension, we will use Kolmogorov’s superposition theorem.
It states that every continuous function of d variables can be expressed as a composition of functions
that each depend only on one variable. We omit the technical proof, which can be found in [147].
35
Theorem 3.20 (Kolmogorov). For every d ∈ N there exist 2d2 + d monotonically increasing
functions φi,j ∈ C 0 (R), i = 1, . . . , d, j = 1, . . . , 2d + 1, such that for every f ∈ C 0 ([0, 1]d ) there
exist functions fj ∈ C 0 (R), j = 1, . . . , 2d + 1 satisfying
2d+1 d
!
X X
f (x) = fj φi,j (xi ) for all x ∈ [0, 1]d .
j=1 i=1
Corollary 3.21. Let d ∈ N. With the activation function σ : R → R from Proposition 3.19, for
every compact K ⊆ Rd , every ε > 0 and every f ∈ C 0 (K) there exists Φ ∈ Nd1 (σ; 2, 2d2 + d) (i.e.
width(Φ) = 2d2 + d and depth(Φ) = 2) such that
Proof. Without loss of generality we can assume K = [0, 1]d : the extension to the general case then
follows by Tietze’s extension theorem and a scaling argument as in the proof of Proposition 3.19.
Let fj , φi,j , i = 1, . . . , d, j = 1, . . . , 2d + 1 be as in Theorem 3.20. Fix ε > 0. Let a > 0 be so
large that
Since each fj is uniformly continuous on the compact set [−da, da], we can find δ > 0 such that
ε
sup sup |fj (y) − fj (ỹ)| < . (3.2.1)
j |y−ỹ|<δ 2(2d + 1)
|y|,|ỹ|≤da
36
Thus with
d
X d
X
yj := φi,j (xi ), ỹj := φ̃i,j (xi )
j=1 j=1
2d+1 d 2d+1
! !
X X X
f (x) − σ wj · σ(wi,j xi + bi,j ) + bj = (fj (yj ) − f˜j (ỹj ))
j=1 i=1 j=1
2d+1
X
≤ |fj (yj ) − fj (ỹj )| + |fj (ỹj ) − f˜j (ỹj )|
j=1
2d+1
X
ε ε
≤ + ≤ ε.
2(2d + 1) 2(2d + 1)
j=1
37
“magic” activation function in Section 3.2 comes from [170] where it is shown that such an activation
function can even be chosen monotonically increasing.
38
Exercises
Exercise 3.22. Write down a generator of a (minimal) topology on C 0 (Rd ) such that fn → f ∈
cc
C 0 (Rd ) if and only if fn −→ f , and show this equivalence. This topology is referred to as the
topology of compact convergence.
Exercise 3.23. Show the implication “⇒” of Theorem 3.8 and Corollary 3.17.
Exercise 3.24. Prove Lemma 3.12. Hint: Consider σ(nx) for large n ∈ N.
Exercise 3.25. Let k ∈ N, σ ∈ M and assume that σ ∗ φ ∈ Pk for all φ ∈ Cc∞ (R). Show that
σ ∈ Pk .
Hint: Consider ψ ∈ Cc∞ (R) such that ψ ≥ 0 and R ψ(x) dx = 1 and set ψε (x) := ψ(x/ε)/ε.
R
Use that away from the discontinuities of σ it holds ψε ∗ σ(x) → σ(x) as ε → 0. Conclude that σ
is piecewise in Pk , and finally show that σ ∈ C k (R).
Exercise 3.26. Prove Corollary 3.18 with the use of Corollary 3.17.
39
Chapter 4
Splines
In Chapter 3, we saw that sufficiently large neural networks can approximate every continuous
function to arbitrary accuracy. However, these results did not further specify the meaning of
“sufficiently large” or what constitutes a suitable architecture. Ideally, given a function f , and a
desired accuracy ε > 0, we would like to have a (possibly sharp) bound on the required size, depth,
and width guaranteeing the existence of a neural network approximating f up to error ε.
The field of approximation theory establishes such trade-offs between properties of the function f
(e.g., its smoothness), the approximation accuracy, and the number of parameters needed to achieve
this accuracy. For example, given k, d ∈ N, how many parameters are required to approximate a
function f : [0, 1]d → R with ∥f ∥C k ([0,1]d ) ≤ 1 up to uniform error ε? Splines are known to achieve
this approximation accuracy with a superposition of O(ε−d/k ) simple (piecewise polynomial) basis
functions. In this chapter, following [176], we show that certain sigmoidal neural networks can
match this performance in terms of the neural network size. In fact, from an approximation
theoretical viewpoint we show that the considered neural networks are at least as expressive as
superpositions of splines.
By shifting and dilating the cardinal B-spline, we obtain a system of univariate splines. Taking
tensor products of these univariate splines yields a set of higher-dimensional functions known as
the multivariate B-splines.
40
Definition 4.2. For t ∈ R and n, ℓ ∈ N we define Sℓ,t,n := Sn (2ℓ (· − t)). Additionally, for d ∈ N,
t ∈ Rd , and n, ℓ ∈ N, we define the the multivariate B-spline Sℓ,t,n
d as
d
Y
d
Sℓ,t,n (x) := Sℓ,ti ,n (xi ) for x = (x1 , . . . xd ) ∈ Rd ,
i=1
and n o
B n := Sℓ,t,n
d
ℓ ∈ N, t ∈ Rd
Having introduced the system B n , we would like to understand how well we can represent each
smooth function by superpositions of elements of B n . The following theorem is adapted from the
more general result [202, Theorem 7]; also see [171, Theorem D.3] for a presentation closer to the
present formulation.
Theorem 4.3. Let d, n, k ∈ N such that 0 < k ≤ n. Then there exists C such that for every
f ∈ C k ([0, 1]d ) and every N ∈ N, there exist ci ∈ R with |ci | ≤ C∥f ∥L∞ ([0,1]d ) and Bi ∈ B n for
i = 1, . . . , N , such that
N
X k
f− ci Bi ≤ CN − d ∥f ∥C k [0,1]d .
i=1 L∞ ([0,1]d )
Remark 4.4. There are a couple of critical concepts in Theorem 4.3 that will reappear throughout
this book. The number of parameters N determines the approximation accuracy N −k/d . This im-
plies that achieving accuracy ε > 0 requires O(ε−d/k ) parameters (according to this upper bound),
which grows exponentially in d. This exponential dependence on d is referred to as the “curse of
dimension” and will be discussed again in the subsequent chapters. The smoothness parameter
k has the opposite effect of d, and improves the convergence rate. Thus, smoother functions can
be approximated with fewer B-splines than rougher functions. This more efficient approximation
requires the use of B-splines of order n with n ≥ k. We will see in the following, that the order of
the B-spline is closely linked to the concept of depth in neural networks.
41
Definition 4.5. A function σ : R → R is called sigmoidal of order q ∈ N, if σ ∈ C q−1 (R) and
there exists C > 0 such that
σ(x)
→0 as x → −∞,
xq
σ(x)
→1 as x → ∞,
xq
|σ(x)| ≤ C · (1 + |x|)q for all x ∈ R.
Example 4.6. The rectified power unit x 7→ σReLU (x)q is sigmoidal of order q. ♢
Our goal in the following is to show that neural networks can approximate a linear combination
of N B-splines with a number of parameters that is proportional to N . As an immediate conse-
quence of Theorem 4.3, we then obtain a convergence rate for neural networks. Let us start by
approximating a single univariate B-spline with a neural network of fixed size.
Sn − ΦSn L∞ ([−K,K])
≤ ε.
n−1
Proof. By definition (4.1.1), Sn is a linear combination of n + 1 shifts of σReLU . We start by
n−1
approximating σReLU . It is not hard to see (Exercise 4.10) that, for every K ′ > 0 and every t ∈ N
t t
a−q σ ◦ σ ◦ · · · ◦ σ(ax) −σReLU (x)q → 0 as a → ∞ (4.2.1)
| {z }
t− times
This shows that we can approximate the ReLU to the power of q t ≥ n − 1. However, our goal is to
obtain an approximation of the ReLU raised to the power n − 1, which could be smaller than q t .
t
To reduce the order, we emulate approximate derivatives of Φqε . Concretely, we show the following
claim: For all 1 ≤ p ≤ q t for every K ′ > 0 and ε > 0 there exists a neural network Φpε having
⌈logq (n − 1)⌉ layers and satisfying
42
The claim holds for p = q t . We now proceed by induction over p = q t , q t − 1, . . . Assume (4.2.3)
holds for some p ∈ {2, . . . , q t }. Fix δ ≥ 0. Then
Hence, by the binomial theorem it follows that there exists δ∗ > 0 such that
for all x ∈ [−K ′ , K ′ ]. By Proposition 2.3, (Φpδ2 (x + δ∗ ) − Φpδ2 )/(pδ∗ ) is a neural network with
∗ ∗
⌈logq (n − 1)⌉ layers and size independent from ε. Calling this neural network Φp−1ε shows that
(4.2.3) holds for p − 1, which concludes the induction argument and proves the claim.
For every neural network Φ, every spatial translation Φ(· − t) is a neural network of the same
architecture. Hence, every term in the sum (4.1.1) can be approximated to arbitrary accuracy by
a neural network of a fixed size. Since by Proposition 2.3, sums of neural networks of the same
depth are again neural networks of the same depth, the result follows.
d
Next, we extend Proposition 4.7 to the multivariate splines Sℓ,t,n for arbitrary ℓ, d ∈ N, t ∈ Rd .
d
Qd
Proof. By definition Sℓ,t,n (x) = i=1 Sℓ,ti ,n (xi ) where
By Proposition 4.7 there exist a constant C ′ > 0 such that for each i = 1, . . . , d and all ε > 0, there
is a neural network ΦSℓ,ti ,n with size C ′ and ⌈logq (n − 1)⌉ layers such that
If d = 1, this shows the statement. For general d, it remains to show that the product of the ΦSℓ,ti ,n
for i = 1, . . . , d can be approximated.
We first prove the following claim by induction: For every d ∈ N, d ≥ 2, there exists a constant
C ′′ > 0, such that for all K ′ ≥ 1 and all ε > 0 there exists a neural network Φmult,ε,d with size
43
C ′′ , ⌈log2 (d)⌉ layers, and activation function σ such that for all x1 , . . . , xd with |xi | ≤ K ′ for all
i = 1, . . . , d,
d
Y
Φmult,ε,d (x1 , . . . , xd ) − xi < ε. (4.2.4)
i=1
For the base case, let d = 2. Similar to the proof of Proposition 4.7, one can show that there exists
C ′′′ > 0 such that for every ε > 0 and K ′ > 0 there exists a neural network Φsquare,ε with one
hidden layer and size C ′′′ such that
Each term on the right-hand side can be approximated up to uniform error ε/6 with a network of
size C ′′′ and one hidden layer. By Proposition 2.3, we conclude that there exists a neural network
Φmult,ε,2 satisfying (4.2.4) for d = 2.
Assume the induction hypothesis (4.2.4) holds for d − 1 ≥ 1, and let ε > 0 and K ′ ≥ 1. We
have
d ⌊d/2⌋ d
Y Y Y
xi = xi · xi . (4.2.6)
i=1 i=1 i=⌊d/2⌋+1
We will now approximate each of the terms in the product on the right-hand side of (4.2.6) by a
neural network using the induction assumption.
For simplicity assume in the following that ⌈log2 (⌊d/2⌋)⌉ = ⌈log2 (d − ⌊d/2⌋)⌉. The general
case can be addressed via Proposition 3.16. By the induction assumption there then exist neural
networks Φmult,1 and Φmult,2 both with ⌈log2 (⌊d/2⌋)⌉ layers, such that for all xi with |xi | ≤ K ′ for
i = 1, . . . , d
⌊d/2⌋
Y ε
Φmult,1 (x1 , . . . , x⌊d/2⌋ ) − xi < ,
i=1
4((K ′ )⌊d/2⌋ + ε)
d
Y ε
Φmult,2 (x⌊d/2⌋+1 , . . . , xd ) − xi < .
4((K ′ )⌊d/2⌋ + ε)
i=⌊d/2⌋+1
By Proposition 2.3, Φmult,ε,d := Φmult,ε/2,2 ◦(Φmult,1 , Φmult,2 ) is a neural network with 1+⌈log2 (⌊d/2⌋)⌉ =
⌈log2 (d)⌉ layers. By construction, the size of Φmult,ε,d does not depend on K ′ or ε. Thus, to complete
the induction, it only remains to show (4.2.4).
For all a, b, c, d ∈ R holds
44
Hence, for x1 , . . . , xd with |xi | ≤ K ′ for all i = 1, . . . , d, we have that
d
Y
xi − Φmult,ε,d (x1 , . . . , xd )
i=1
⌊d/2⌋ d
ε Y Y
≤ + xi · xi − Φmult,1 (x1 , . . . , x⌊d/2⌋ )Φmult,2 (x⌊d/2⌋+1 , . . . , xd )
2
i=1 i=⌊d/2⌋+1
ε ε ε
≤ + |K ′ |⌊d/2⌋ ′ ⌊d/2⌋
+ (|K ′ |⌈d/2⌉ + ε) ′ ⌊d/2⌋
< ε.
2 4((K ) + ε) 4((K ) + ε)
This completes the proof of (4.2.4).
The overall result follows by using Proposition 2.3 to show that the multiplication network can
be composed with a neural network comprised of the ΦSℓ,ti ,n for i = 1, . . . , d. Since in no step above
the size of the individual networks was dependent on the approximation accuracy, this is also true
for the final network.
Proposition 4.8 shows that we can approximate a single multivariate B-spline with a neural
network with a size that is independent of the accuracy. Combining this observation with Theorem
4.3 leads to the following result.
Theorem 4.9. Let d, n, k ∈ N such that 0 < k ≤ n and n ≥ 2. Let q ≥ 2, and let σ be sigmoidal
of order q.
Then there exists C such that for every f ∈ C k ([0, 1]d ) and every N ∈ N there exists a neural
network ΦN with activation function σ, ⌈log2 (d)⌉ + ⌈logq (k − 1)⌉ layers, and size bounded by CN ,
such that
k
f − ΦN L∞ ([0,1]d ) ≤ CN − d ∥f ∥C k ([0,1]d ) .
Proof. Fix N ∈ N. By Theorem 4.3, there exist coefficients |ci | ≤ C∥f ∥L∞ ([0,1]d ) and Bi ∈ B n for
i = 1, . . . , N , such that
N
X k
f− ci Bi ≤ CN − d ∥f ∥C k ([0,1]d ) .
i=1 L∞ ([0,1]d )
Moreover, by Proposition 4.8, for each i = 1, . . . , N exists a neural network ΦBi with ⌈log2 (d)⌉ +
⌈logq (k − 1)⌉ layers, and a fixed size, which approximates Bi on [−1, 1]d ⊇ [0, 1]d up to error of
ε := N −k/d /N . The size of ΦBi is independent of i and N .
By Proposition 2.3, there exists a neural network ΦN that uniformly approximates N
P
i=1 ci Bi
d
up to error ε on [0, 1] , and has ⌈log2 (d)⌉ + ⌈logq (k − 1)⌉ layers. The size of this network is linear
in N (see Exercise 4.11). This concludes the proof.
Theorem 4.9 shows that neural networks with higher-order sigmoidal functions can approximate
smooth functions with the same accuracy as spline approximations while having a comparable
number of parameters. The network depth is required to behave like O(log(k)) in terms of the
smoothness parameter k, cp. Remark 4.4.
45
Bibliography and further reading
The argument of linking sigmoidal activation functions with spline based approximation was first
introduced in [176, 174]. For further details on spline approximation, see [202] or the book [245].
The general strategy of approximating basis functions by neural networks, and then lifting ap-
proximation results for those bases has been employed widely in the literature, and will also reappear
again in this book. While the following chapters primarily focus on ReLU activation, we highlight
a few notable approaches with non-ReLU activations based on the outlined strategy: To approx-
imate analytic functions, [175] emulates a monomial basis. To approximate periodic functions, a
basis of trigonometric polynomials is recreated in [177]. Wavelet bases have been emulated in [205].
Moreover, neural networks have been studied through the representation system of ridgelets [43]
and ridge functions [128]. A general framework describing the emulation of representation systems
to transfer approximation results was presented in [30].
46
Exercises
Exercise 4.10. Show that (4.2.1) holds.
Exercise 4.11. Let L ∈ N, σ : R → R, and let Φ1 , Φ2 be two neural networks with architecture
(1) (1) (2) (2)
(σ; d0 , d1 , . . . , dL , dL+1 ) and (σ; d0 , d1 , . . . , dL , dL+1 ). Show that Φ1 + Φ2 is a neural network
with size(Φ1 + Φ2 ) ≤ size(Φ1 ) + size(Φ2 ).
2
Exercise 4.12. Show that, for σ = σReLU and k ≤ 2, for all f ∈ C k ([0, 1]d ) all weights of the approx-
imating neural network of Theorem 4.9 can be bounded in absolute value by O(max{2, ∥f ∥C k ([0,1]d ) }).
47
Chapter 5
In this chapter, we discuss feedforward neural networks using the ReLU activation function σReLU
introduced in Section 2.3. We refer to these functions as ReLU neural networks. Due to its simplicity
and the fact that it reduces the vanishing and exploding gradients phenomena, the ReLU is one of
the most widely used activation functions in practice.
A key component of the proofs in the previous chapters was the approximation of derivatives of
the activation function to emulate polynomials. Since the ReLU is piecewise linear, this trick is not
applicable. This makes the analysis fundamentally different from the case of smoother activation
functions. Nonetheless, we will see that even this extremely simple activation function yields a very
rich class of functions possessing remarkable approximation capabilities.
To formalize these results, we begin this chapter by adopting a framework from [208], which
enables the tracking of the number of network parameters for basic manipulations such as adding
up or composing two neural networks. This will allow to bound the network complexity, when
constructing more elaborate networks from simpler ones. With these preliminaries at hand, the
rest of the chapter is dedicated to the exploration of links between ReLU neural networks and the
class of “continuous piecewise linear functions.” In Section 5.2, we will see that every such function
can be exactly represented by a ReLU neural network. Afterwards, in Section 5.3 we will give a
more detailed analysis of the required network complexity. Finally, we will use these results to
prove a first approximation theorem for ReLU neural networks in Section 5.4. The argument is
similar in spirit to Chapter 4, in that we transfer established approximation theory for piecewise
linear functions to the class of ReLU neural networks of a certain architecture.
• Reproducing an identity: We have seen in Proposition 3.16 that for most activation functions,
an approximation to the identity can be built by neural networks. For ReLUs, we can have
an even stronger result and reproduce the identity exactly. This identity will play a crucial
48
role in order to extend certain neural networks to deeper neural networks, and to facilitate
an efficient composition operation.
• Composition: We saw in Proposition 2.3 that we can produce a composition of two neural
networks and the resulting function is a neural network as well. There we did not study the
size of the resulting neural networks. For ReLU activation functions, this composition can be
done in a very efficient way leading to a neural network that has up to a constant not more
than the number of weights of the two initial neural networks.
• Parallelization: Also the parallelization of two neural networks was discussed in Proposition
2.3. We will refine this notion and make precise the size of the resulting neural networks.
• Linear combinations: Similarly, for the sum of two neural networks, we will give precise
bounds on the size of the resulting neural network.
5.1.1 Identity
We start with expressing the identity on Rd as a neural network of depth L ∈ N.
Lemma 5.1 (Identity). Let L ∈ N. Then, there exists a ReLU neural network Φid L such that
id d id id id
ΦL (x) = x for all x ∈ R . Moreover, depth(ΦL ) = L, width(ΦL ) = 2d, and size(ΦL ) = 2d·(L+1).
Proof. Writing I d ∈ Rd×d for the identity matrix, we choose the weights
Using that x = σReLU (x) − σReLU (−x) for all x ∈ R and σReLU (x) = x for all x ≥ 0 it is obvious
that the neural network Φid
L associated to the weights above satisfies the assertion of the lemma.
We will see in Exercise 5.24 that the property to exactly represent the identity is not shared
by sigmoidal activation functions. It does hold for polynomial activation functions though; also see
Proposition 3.16.
5.1.2 Composition
Assume we have two neural networks Φ1 , Φ2 with architectures (σReLU ; d10 , . . . , d1L1 +1 ) and (σReLU ; d20 , . . . , d2L1 +1 )
respectively. Moreover, we assume that they have weights and biases given by
(0) (0) (L1 ) (L1 ) (0) (0) (L2 ) (L2 )
(W 1 , b1 ), . . . , (W 1 , b1 ), and (W 2 , b2 ), . . . , (W 2 , b2 ),
respectively. If the output dimension d1L1 +1 of Φ1 equals the input dimension d20 of Φ2 , we can
define two types of concatenations: First Φ2 ◦ Φ1 is the neural network with weights and biases
49
given by
(0) (0) (L −1) (L −1) (0) (L ) (0) (L ) (0)
W 1 , b1 , . . . , W 1 1 , b1 1 , W 2 W 1 1 , W 2 b1 1 + b2 ,
(1) (1) (L ) (L )
W 2 , b2 , . . . , W 2 2 , b2 2 .
Second, Φ2 • Φ1 is the neural network defined as Φ2 ◦ Φid 1 ◦ Φ1 . In terms of weighs and biases,
Φ2 • Φ1 is given as
! !!
(L ) (L )
(0) (0)
(L1 −1) (L1 −1)
W1 1 b1 1
W 1 , b1 , . . . , W 1 , b1 , (L ) , (L ) ,
−W 1 1 −b1 1
(0) (0) (0) (1) (1) (L ) (L )
W 2 , −W 2 , b2 , W 2 , b2 , . . . , W 2 2 , b2 2 .
Lemma 5.2 (Composition). Let Φ1 , Φ2 be neural networks with architectures (σReLU ; d10 , . . . , d1L1 +1 )
and (σReLU ; d20 , . . . , d2L2 +1 ). Assume d1L1 +1 = d20 . Then Φ2 ◦ Φ1 (x) = Φ2 • Φ1 (x) = Φ2 (Φ1 (x)) for
0
all x ∈ Rd1 . Moreover,
and
1
Proof. The fact that Φ2 ◦ Φ1 (x) = Φ2 • Φ1 (x) = Φ2 (Φ1 (x)) for all x ∈ Rd0 follows immediately
from the construction. The same can be said for the width and depth bounds. To confirm the size
(0) (L ) d2 ×d1 (0) (L )
bound, we note that W 2 W 1 1 ∈ R 1 L1 and hence W 2 W 1 1 has not more than d21 × d1L1
(0) (L ) (0) 2
(nonzero) entries. Moreover, W 2 b1 1 + b2 ∈ Rd1 . Thus, the L1 -th layer of Φ2 ◦ Φ1 (x) has at
most d21 × (1 + d1L1 ) entries. The rest is obvious from the construction.
Interpreting linear transformations as neural networks of depth 0, the previous lemma is also
valid in case Φ1 or Φ2 is a linear mapping.
50
5.1.3 Parallelization
Let (Φi )m i i
i=1 be neural networks with architectures (σReLU ; d0 , . . . , dLi +1 ), respectively. We proceed
to build a neural network (Φ1 , . . . , Φm ) realizing the function
djL
Pm
dj0
Pm
j=1 j +1
(Φ1 , . . . , Φm ) : R j=1 →R (5.1.1)
(x1 , . . . , xm ) 7→ (Φ1 (x1 ), . . . , Φm (xm )).
where these matrices are understood as block-diagonal filled up with zeros. For the general case
where the Φj might have different depths, let Lmax := max1≤i≤m Li and I := {1 ≤ i ≤ m | Li <
Lmax }. For j ∈ I c set Φ
e j := Φj , and for each j ∈ I
e j := Φid
Φ Lmax −Lj ◦ Φj . (5.1.3)
Finally,
(Φ1 , . . . , Φm ) := (Φ
e 1, . . . , Φ
e m ). (5.1.4)
Lemma 5.3 (Parallelization). Let m ∈ N and (Φi )m i=1 be neural networks with architectures
(σReLU ; di0 , . . . , diLi +1 ), respectively. Then the neural network (Φ1 , . . . , Φm ) satisfies
dj0
Pm
(Φ1 , . . . , Φm )(x) = (Φ1 (x1 ), . . . , Φm (xm )) for all x ∈ R j=1 .
Proof. All statements except for the bound on the size follow immediately from the construction.
e i )m in (5.1.3)
To obtain the bound on the size, we note that by construction the sizes of the (Φ i=1
will simply be added. The size of each Φ
e i can be bounded with Lemma 5.2.
51
If all input dimensions d10 = · · · = dm
0 =: d0 are the same, we will also use parallelization with
d1 +···+dm
shared inputs to realize the function x 7→ (Φ1 (x), . . . , Φm (x)) from Rd0 → R L1 +1 Lm +1 .
In terms of the construction (5.1.2), the only required change is that the block-diagonal matrix
(0) (0)
Pm j 1 (0) (0)
diag(W 1 , . . . , W m ) becomes the matrix in R j=1 d1 ×d0 which stacks W 1 , . . . , W m on top of
each other. Similarly, we will allow Φj to only take some of the entries of x as input. For par-
allelization with shared inputs we will use the same notation (Φj )m
j=1 as before, where the precise
meaning will always be clear from context. Note that Lemma 5.3 remains valid in this case.
This corresponds P to the parallelization (Φ1 , . . . , Φm ) composed with the linear transformation
(z 1 , . . . , z m ) 7→ m
j=1 αj z j . The following result holds.
Lemma 5.4 (Linear combinations). Let m ∈ N and (Φi )m i=1 be neural networks with architec-
tures (σReLU ; di0 , . . . , diLi +1 ), respectively. Assume that d1L1P m
+1 = · · · = dLm +1 , letPα ∈ Rm and set
Lmax := maxj≤m Lj . Then, there exists a neural network j=1 αj Φj such that ( m
m
j=1 αj Φj )(x) =
Pm Pm j
m d
j=1 αj Φj (xj ) for all x = (xj )j=1 ∈ R
j=1 0 . Moreover,
m
X m
X
width αj Φj ≤ 2 width(Φj ), (5.1.6a)
j=1 j=1
m
X
depth αj Φj = max depth(Φj ), (5.1.6b)
j≤m
j=1
m m m
(Lmax − Lj )djLj +1 .
X X X
size αj Φj ≤ 2 size(Φj ) + 2 (5.1.6c)
j=1 j=1 j=1
52
For general depths, we define the sum of the neural networks to be the sum of the extended
neural networks Φe i as of (5.1.3). All statements of the lemma follow immediately from this con-
struction.
In case d10 = · · · = dm
0 =: d0 (all neural networks have the same input dimension), we will also
consider linear combinations with shared inputs, i.e., a neural network realizing
m
X
x 7→ αj Φj (x) for x ∈ Rd0 .
j=1
This requires the same minor adjustment as discussed at the end of Section 5.1.3. Lemma 5.4
remains valid in this case and again we do not distinguish in notation for linear combinations with
or without shared inputs.
Remark 5.6. A “continuous piecewise linear function” as in Definition 5.5 is actually piecewise
affine. To maintain consistency with the literature, we use the terminology cpwl.
In the following, we will refer to the connected domains on which f is equal to one of the
functions gj , also as regions or pieces. If f is cpwl with q ∈ N regions, then with n ∈ N denoting
the number of affine functions it holds n ≤ q.
Note that, the mapping x 7→ σReLU (w⊤ x + b), which is a ReLU neural network with a single
neuron, is cpwl (with two regions). Consequently, every ReLU neural network is a repeated compo-
sition of linear combinations of cpwl functions. It is not hard to see that the set of cpwl functions
is closed under compositions and linear combinations. Hence, every ReLU neural network is a cpwl
function. Interestingly, the reverse direction of this statement is also true, meaning that every cpwl
function can be represented by a ReLU neural network as we shall demonstrate below. Therefore,
we can identify the class of functions realized by arbitrary ReLU neural networks as the class of
cpwl functions.
53
A statement similar to Theorem 5.7 can be found in [7, 110]. There, the authors give a con-
struction with a depth that behaves logarithmic in d and is independent of n, but with significantly
larger bounds on the size. As we shall see, the proof of Theorem 5.7 is a simple consequence of the
following well-known result from [266]; also see [203], and for sharper bounds [282]. It states that
every cpwl function can be expressed as a finite maximum of a finite minimum of certain affine
functions.
Proof. Step 1. We start with d = 1, i.e., Ω ⊆ R is a (possibly unbounded) interval and for each
x ∈ Ω there exists j ∈ {1, . . . , n} such that with gj (x) := wj x + bj it holds that f (x) = gj (x).
Without loss of generality, we can assume that gi ̸= gj for all i ̸= j. Since the graphs of the gj are
lines, they intersect at (at most) finitely many points in Ω.
Since f is continuous, we conclude that there exist finitely many intervals covering Ω, such that
f coincides with one of the gj on each interval. For each x ∈ Ω let
sx := {1 ≤ j ≤ n | gj (x) ≥ f (x)}
and
Since there exist only finitely many possibilities to choose a subset of {1, . . . , n}, we conclude that
(5.2.1) holds for d = 1.
It remains to verify the claim (5.2.2). Fix y ̸= x ∈ Ω. Without loss of generality, let x < y
and let x = x0 < · · · < xk = y be such that f |[xi−1 ,xi ] equals some gj for each i ∈ {1, . . . , k}. In
order to show (5.2.2), it suffices to prove that there exists at least one j such that gj (x0 ) ≥ f (x0 )
and gj (xk ) ≤ f (xk ). The claim is trivial for k = 1. We proceed by induction. Suppose the
claim holds for k − 1, and consider the partition x0 < · · · < xk . Let r ∈ {1, . . . , n} be such
that f |[x0 ,x1 ] = gr |[x0 ,x1 ] . Applying the induction hypothesis to the interval [x1 , xk ], we can find
j ∈ {1, . . . , n} such that gj (x1 ) ≥ f (x1 ) and gj (xk ) ≤ f (xk ). If gj (x0 ) ≥ f (x0 ), then gj is the desired
function. Otherwise, gj (x0 ) < f (x0 ). Then gr (x0 ) = f (x0 ) > gj (x0 ) and gr (x1 ) = f (x1 ) ≤ gj (x1 ).
54
Therefore gr (x) ≤ gj (x) for all x ≥ x1 , and in particular gr (xk ) ≤ gj (xk ). Thus gr is the desired
function.
Step 2. For general d ∈ N, let gj (x) := w⊤ j x + bj for j = 1, . . . , n. For each x ∈ Ω, let
sx := {1 ≤ j ≤ n | gj (x) ≥ f (x)}
For an arbitrary 1-dimensional affine subspace S ⊆ Rd passing through x consider the line
(segment) I := S ∩ Ω, which is connected since Ω is convex. By Step 1, it holds
on all of I. Since I was arbitrary the formula is valid for all y ∈ Ω. This again implies (5.2.1) as
in Step 1.
Remark 5.9. For any a1 , . . . , ak ∈ R holds min{−a1 , . . . , −ak } = − max{a1 , . . . , ak }. Thus, in the
setting of Proposition 5.8, there exists m̃ ∈ N and sets s̃j ⊆ {1, . . . , n} for j = 1, . . . , m̃, such that
for all x ∈ Ω
To prove Theorem 5.7, it therefore suffices to show that the minimum and the maximum are
expressible by ReLU neural networks.
and
Proof. We have
(
0 if y > x
max{x, y} = y +
x−y if x ≥ y
= y + σReLU (x − y).
Using y = σReLU (y) − σReLU (−y), the claim for the maximum follows. For the minimum observe
that min{x, y} = − max{−x, −y}.
55
x
min{x, y}
y
Figure 5.1: Sketch of the neural network in Lemma 5.10. Only edges with non-zero weights are
drawn.
size(Φmin
n ) ≤ 16n, width(Φmin
n ) ≤ 3n, depth(Φmin
n ) ≤ ⌈log2 (n)⌉
Proof. Throughout denote by Φmin 2 : R2 → R the neural network from Lemma 5.10. It is of depth
1 and size 7 (since all biases are zero, it suffices to count the number of connections in Figure 5.1).
Step 1. Consider first the case where n = 2k for some k ∈ N. We proceed by induction of k.
For k = 1 the claim is proven. For k ≥ 2 set
Φmin
2k
:= Φmin
2 ◦ (Φmin min
2k−1 , Φ2k−1 ). (5.2.3)
Next, we bound the size of the neural network. Note that all biases in this neural network are set to
0, since the Φmin
2 neural network in Lemma 5.10 has no biases. Thus, the size of the neural network
min
Φ2k corresponds to the number of connections in the graph (the number of nonzero weights).
Careful inspection of the neural network architecture, see Figure 5.2, reveals that
k−2
X
size(Φmin
2k ) =4·2 k−1
+ 12 · 2j + 3
j=0
= 2n + 12 · (2k−1 − 1) + 3 = 2n + 6n − 9 ≤ 8n,
Φmin
1 (x) := x for all x ∈ R
56
be the identity on R, i.e. a linear transformation and thus formally a depth 0 neural network. Then,
for all n ≥ 2
(Φid ◦ Φmin min ) if n ∈ {2k + 1 | k ∈ N}
min min 1 ⌊n ⌋ , Φ⌈ n ⌉
Φn := Φ2 ◦ min
2
min
2 (5.2.4)
(Φ⌊ n ⌋ , Φ⌈ n ⌉ ) otherwise.
2 2
This definition extends (5.2.3) to arbitrary n ≥ 2, since the first case in (5.2.4) never occurs if n ≥ 2
is a power of two.
To analyze (5.2.4), we start with the depth and claim that
depth(Φmin
n )=k for all 2k−1 < n ≤ 2k
and all k ∈ N. We proceed by induction over k. The case k = 1 is clear. For the induction step,
assume the statement holds for some fixed k ∈ N and fix an integer n with 2k < n ≤ 2k+1 . Then
lnm
∈ (2k−1 , 2k ] ∩ N
2
and (
jnk {2k−1 } if n = 2k + 1
∈
2 (2k−1 , 2k ] ∩ N otherwise.
Using the induction assumption, (5.2.4) and Lemmas 5.1 and 5.2, this shows
depth(Φmin min
n ) = depth(Φ2 ) + k = 1 + k,
Φmax min
n (x1 , . . . , xn ) := −Φn (−x1 , . . . , −xn ).
57
x1
x2
x3
x4
min{x1 , . . . , x8 }
x5
x6
x7
x8
nr of connections
between layers: 2k−1 · 4 2k−2 · 12 2k−3 · 12 3
x1 x2 x3 x4 x5 x1 x2 x3 x4 x5 x6 x1 x2 x3 x4 x5 x6 x7 x8
Φmin
2 Φid min
1 Φ2 Φid min
1 Φ2 Φid min
1 Φ2 Φmin
2 Φmin
2 Φmin
2 Φmin
2
Φid
1 Φmin
2 Φmin
2 Φmin
2 Φmin
2 Φmin
2
Φmin
2 Φmin
2 Φmin
2
58
and
n m
X m
X o
max
width(Φ) ≤ 2 max width(Φm ), min
width(Φ|sj | ), width((w⊤
i x + bi )i∈sj ))
j=1 j=1
n
≤ 2 max{3m, 3mn, mdn} = O(dn2 )
and
⊤
size(Φ) ≤ 4 size(Φmax
m ) + size((Φ min m
)
|sj | j=1 ) + size((w i x + b ) )m
i i∈sj j=1 )
X m
≤ 4 16m + 2 (16|sj | + 2⌈log2 (n)⌉) + nm(d + 1) = O(dn2n ).
j=1
5.3.1 Triangulations of Ω
For the ensuing discussion, we will consider Ω ⊆ Rd to be partitioned into simplices. This parti-
tioning will be termed a triangulation of Ω. Other notions prevalent in the literature include a
tessellation of Ω, or a simplicial mesh on Ω. To give a precise definition, let us first recall some
terminology. For a set S ⊆ Rd we denote the convex hull of S by
X n Xn
co(S) := αj xj n ∈ N, xj ∈ S, αj ≥ 0, αj = 1 . (5.3.1)
j=1 j=1
An n-simplex is the convex hull of n ∈ N points that are independent in a specific sense. This
is made precise in the following definition.
59
η2 η2 η2
η3 η1 η3 η1 η3 η1
η5 η5
η4 η4 η4
Figure 5.4: The first is a regular triangulation, while the second and the third are not.
Definition 5.13. Let d ∈ N, and Ω ⊆ Rd be compact. Let T be a finite set of d-simplices, and
for each τ ∈ T let V (τ ) ⊆ Ω have cardinality d + 1 such that τ = co(V (τ )). We call T a regular
triangulation of Ω, if and only if
S
(i) τ ∈T τ = Ω,
60
We will split the proof into several lemmata. The strategy is to introduce a basis of the space
of cpwl functions on T the elements of which vanish on the boundary of Ω. We will then show
that there exist O(|T |) basis functions, each of which can be represented with a neural network the
size of which depends only on kT and d. To construct this basis, we first point out that an affine
function on a simplex is uniquely defined by its values at the nodes.
coefficients do not sum to 1). Hence, g is uniquely determined by its values at the nodes.
Since Ω is the union of the simplices τ ∈ T , every cpwl function with respect to T is thus
uniquely defined through its values at the nodes. Hence, the desired basis consists of cpwl functions
φη : Ω → R with respect to T such that
where δηµ denotes the Kronecker delta. Assuming φη to be well-defined for the moment, we can
then represent every cpwl function f : Ω → R that vanishes on the boundary ∂Ω as
X
f (x) = f (η)φη (x) for all x ∈ Ω.
η∈V∩Ω̊
Note that it suffices to sum over the set of interior nodes V ∩ Ω̊, since f (η) = 0 whenever η ∈
∂Ω. To formally verify existence and well-definedness of φη , we first need a lemma characterizing
the boundary of so-called “patches” of the triangulation: For each η ∈ V, we introduce the patch
ω(η) of the node η as the union of all elements containing η, i.e.,
[
ω(η) := τ. (5.3.5)
{τ ∈T | η∈τ }
We refer to Figure 5.5 for a visualization of Lemma 5.16. The proof of Lemma 5.16 is quite
technical but nonetheless elementary. We therefore only outline the general argument but leave
the details to the reader in Excercise 5.28: The boundary of ω(η) must be contained in the union
61
η6 η1
ω(η) co(V (τ1 )\{η}) = co({η 1 , η 2 })
τ6
τ5 τ1
η5 η2
τ4 η τ2
τ3
η4 η3
Figure 5.5: Visualization of Lemma 5.16 in two dimensions. The patch ω(η) consists of the union
of all 2-simplices τi containing η. Its boundary consists of the union of all 1-simplices made up by
the nodes of each τi without the center node, i.e., the convex hulls of V (τi )\{η}.
of the boundaries of all τ in the patch ω(η). Since η is an interior point of Ω, it must also be
an interior point of ω(η). This can be used to show that for every S := {η i0 , . . . , η ik } ⊆ V (τ ) of
cardinality k + 1 ≤ d, the interior of (the k-dimensional manifold) co(S) belongs to the interior
of ω(η) whenever η ∈ S. Using Exercise 5.28, it then only remains to check that co(S) ⊆ ∂ω(η)
whenever η ∈ / S, which yields the claimed formula. We are now in position to show well-definedness
of the basis functions in (5.3.4).
Lemma 5.17. For each interior node η ∈ V ∩ Ω̊ there exists a unique cpwl function φη : Ω → R
satisfying (5.3.4). Moreover, φη can be expressed by a ReLU neural network with size, width, and
depth bounds that only depend on d and kT .
Proof. By Lemma 5.15, on each τ ∈ T , the affine function φη |τ is uniquely defined through the
values at the nodes of τ . This defines a continuous function φη : Ω → R. Indeed, whenever
τ ∩ τ ′ ̸= ∅, then τ ∩ τ ′ is a subsimplex of both τ and τ ′ in the sense of Definition 5.13 (ii). Thus,
applying Lemma 5.15 again, the affine functions on τ and τ ′ coincide on τ ∩ τ ′ .
Using Lemma 5.15, Lemma 5.16 and the fact that φη (µ) = 0 whenever µ ̸= η, we find that
φη vanishes on the boundary of the patch ω(η) ⊆ Ω. Thus, φη vanishes on the boundary of Ω.
Extending by zero, it becomes a cpwl function φη : Rd → R. This function is nonzero only on
elements τ for which η ∈ τ . Hence, it is a cpwl function with at most n := kT + 1 affine functions.
By Theorem 5.7, φη can be expressed as a ReLU neural network with the claimed size, width and
depth bounds; to apply Theorem 5.7 we used that (the extension of) φη is defined on the convex
domain Rd .
it holds that Φ : Ω → R satisfies Φ(η) = f (η) for all η ∈ V. By Lemma 5.15 this implies that
f equals Φ on each τ , and thus f equals Φ on all of Ω. Since each element τ is the convex hull
62
of d + 1 nodes η ∈ V, the cardinality of V is bounded by the cardinality of T times d + 1. Thus,
the summation in (5.3.6) is over O(|T |) terms. Using Lemma 5.4 and Lemma 5.17 we obtain the
claimed bounds on size, width, and depth of the neural network.
Definition 5.18. A regular triangulation T is called locally convex if and only if ω(η) is convex
for all interior nodes η ∈ V ∩ Ω̊.
Theorem 5.19. Let d ∈ N, and let Ω ⊆ Rd be a bounded domain. Let T be a locally convex regular
triangulation of Ω. Let f : Ω → R be cpwl with respect to T and f |∂Ω = 0. Then, there exists a
constant C > 0 (independent of d, f and T ) and there exists a neural network Φf : Ω → R such
that Φf = f ,
size(Φf ) ≤ C · (1 + d2 kT |T |),
width(Φf ) ≤ C · (1 + d log(kT )|T |),
depth(Φf ) ≤ C · (1 + log2 (kT )).
Assume in the following that T is a locally convex triangulation. We will split the proof of the
theorem again into a few lemmata. First, we will show that a convex patch can be written as an
intersection of finitely many half-spaces. Specifically, with the affine hull of a set S defined as
Xn Xn
aff(S) := αj xj n ∈ N, xj ∈ S, αj ∈ R, αj = 1 (5.3.7)
j=1 j=1
be the affine hyperplane passing through all nodes in V (τ )\{η}, and let further
63
Lemma 5.20. Let η be an interior node. Then a patch ω(η) is convex if and only if
\
ω(η) = H+ (τ, η). (5.3.8)
{τ ∈T | η∈τ }
Proof. The right-hand side is a finite intersection of (convex) half-spaces, and thus itself convex. It
remains to show that if ω(η) is convex, then (5.3.8) holds. We start with “⊃”. Suppose x ∈ / ω(η).
Then the straight line co({x, η}) must pass through ∂ω(η), and by Lemma 5.16 this implies that
there exists τ ∈ T with η ∈ τ such that co({x, η}) passes through aff(V (τ )\{η}) = H0 (τ, η).
Hence η and x lie on different sides of this affine hyperplane, which shows “⊇”. Now we show “⊆”.
Let τ ∈ T be such that η ∈ τ and fix x in the complement of H+ (τ, η). Suppose that x ∈ ω(η). By
convexity, we then have co({x} ∪ τ ) ⊆ ω(η). This implies that there exists a point in co(V (τ )\{η})
belonging to the interior of ω(η). This contradicts Lemma 5.16. Thus, x ∈ / ω(η).
The above lemma allows us to explicitly construct the basis functions φη in (5.3.4). To see this,
denote in the following for τ ∈ T and η ∈ V (τ ) by gτ,η ∈ P1 (Rd ) the affine function such that
(
1 if η = µ
gτ,η (µ) = for all µ ∈ V (τ ).
0 if η ̸= µ
This function exists and is unique by Lemma 5.15. Observe that φη (x) = gτ,η (x) for all x ∈ τ .
Lemma 5.21. Let η ∈ V ∩ Ω̊ be an interior node and let ω(η) be a convex patch. Then
φη (x) = max 0, min gτ,η (x) for all x ∈ Rd . (5.3.9)
{τ ∈T | η∈τ }
Thus
i.e., (5.3.9) holds for all x ∈ R\ω(η). Next, let τ , τ ′ ∈ T such that η ∈ τ and η ∈ τ ′ . We wish to
show that gτ,η (x) ≤ gτ ′ ,η (x) for all x ∈ τ . Since gτ,η (x) = φη (x) for all x ∈ τ , this then concludes
the proof of (5.3.9). By Lemma 5.20 it holds
µ ∈ H+ (τ ′ , η) for all µ ∈ V (τ ).
64
Hence, by (5.3.10)
gτ ′ ,η (µ) ≥ 0 = gτ,η (µ) for all µ ∈ V (τ )\{η}.
Moreover, gτ,η (η) = gτ ′ ,η (η) = 1. Thus, gτ,η (µ) ≥ gτ ′ ,η (µ) for all µ ∈ V (τ ′ ) and therefore
gτ ′ ,η (x) ≥ gτ,η (x) for all x ∈ co(V (τ ′ )) = τ ′ .
Proof (of Theorem 5.19). For every interior node η ∈ V ∩ Ω̊, the cpwl basis function φη in (5.3.4)
can be expressed as in (5.3.9), i.e.,
φη (x) = σ • Φmin
|{τ ∈T | η∈τ }| • (gτ,η (x)){τ ∈T | η∈τ } ,
where (gτ,η (x)){τ ∈T | η∈τ } denotes the parallelization with shared inputs of the functions gτ,η (x) for
all τ ∈ T such that η ∈ τ .
For this neural network, with |{τ ∈ T | η ∈ τ }| ≤ kT , we have by Lemma 5.2
size(φη ) ≤ 4 size(σ) + size(Φmin
|{τ ∈T | η∈τ }| ) + size((gτ,η ){τ ∈T | η∈τ } )
≤ 4(2 + 16kT + kT d) (5.3.11)
and similarly
depth(φη ) ≤ 4 + ⌈log2 (kT )⌉, width(φη ) ≤ max{1, 3kT , d}. (5.3.12)
Since for every interior node, the number of simplices touching the node must be larger or equal
to d, we can assume max{kT , d} = kT in the following (otherwise there exist no interior nodes, and
the function f is constant 0). As in the proof of Theorem 5.14, the neural network
X
Φ(x) := f (η)φη (x)
η∈V∩Ω̊
realizes the function f on all of Ω. Since the number of nodes |V| is bounded by (d + 1)|T |, an
application of Lemma 5.4 yields the desired bounds.
|f (x) − f (y)|
∥f ∥C 0,s (Ω) := sup |f (x)| + sup , (5.4.1)
x∈Ω x̸=y∈Ω ∥x − y∥s2
and we denote by C 0,s (Ω) the set of functions f ∈ C 0 (Ω) for which ∥f ∥C 0,s (Ω) < ∞.
65
Hölder continuous functions can be approximated well by cpwl functions. This leads to the
following result.
Theorem 5.23. Let d ∈ N. There exists a constant C = C(d) such that for every f ∈ C 0,s ([0, 1]d )
and every N there exists a ReLU neural network ΦfN with
and
s
sup f (x) − ΦfN (x) ≤ C∥f ∥C 0,s ([0,1]d ) N − d .
x∈[0,1]d
Proof. For M ≥ 2, consider the set of nodes {ν/M | ν ∈ {−1, . . . , M + 1}d } where ν/M =
(ν1 /M, . . . , νd /M ). These nodes suggest a partition of [−1/M, 1 + 1/M ]d into (2 + M )d sub-
hypercubes. Each such sub-hypercube can be partitioned into d! simplices, such that we obtain a
regular triangulation T with d!(2+M )d elements on [0, 1]d . According to Theorem 5.14 there exists a
neural network Φ that is cpwl with respect to T and Φ(ν/M ) = f (ν/M ) whenever ν ∈ {0, . . . , M }d
and Φ(ν/M ) = 0 for all other (boundary) nodes. It holds
size(Φ) ≤ C|T | = Cd!(2 + M )d ,
width(Φ) ≤ C|T | = Cd!(2 + M )d , (5.4.2)
depth(Φ) ≤ C
for a constant C that only depends on d (since for our regular triangulation T , kT in (5.3.2) is a
fixed d-dependent constant).
Let us bound the error. Fix a point x ∈ [0, 1]d . Then x belongs to one of the interior simplices
τ of the triangulation. Two nodes of the simplex have distance at most
2 1/2 √
d
X 1 = d =: ε.
M M
j=1
Since Φ|τ is the linear interpolant of f at the nodes V (τ ) of the simplex τ , Φ(x) is a convex
combination of the (f (η))η∈V (τ ) . Fix an arbitrary node η 0 ∈ V (τ ). Then ∥x − η 0 ∥2 ≤ ε and
|Φ(x) − Φ(η 0 )| ≤ max |f (η) − f (µ)| ≤ sup |f (x) − f (y)|
η,µ∈V (τ ) x,y∈[0,1]d
∥x−y∥2 ≤ε
≤ ∥f ∥C 0,s ([0,1]d ) εs .
Hence, using f (η 0 ) = Φ(η 0 ),
|f (x) − Φ(x)| ≤ |f (x) − f (η 0 )| + |Φ(x) − Φ(η 0 )|
≤ 2∥f ∥C 0,s ([0,1]d ) εs
s
= 2∥f ∥C 0,s ([0,1]d ) d 2 M −s
s s
= 2d 2 ∥f ∥C 0,s ([0,1]d ) N − d (5.4.3)
66
where N := M d . The statement follows by (5.4.2) and (5.4.3).
The principle behind Theorem 5.23 can be applied in even more generality. Since we can
represent every cpwl function on a regular triangulation with a neural network of size O(N ), where
N denotes the number of elements, most classical (e.g. finite element) approximation theory for
cpwl functions can be lifted to generate statements about ReLU approximation. For instance, it is
well-known, that functions in the Sobolev space H 2 ([0, 1]d ) can be approximated by cpwl functions
on a regular triangulation in terms of L2 ([0, 1]d ) with the rate 2/d, e.g., [80, Chapter 22]. Similar
as in the proof of Theorem 5.23, for every f ∈ H 2 ([0, 1]d ) and every N ∈ N there then exists a
ReLU neural network ΦN such that size(ΦN ) = O(N ) and
2
∥f − ΦN ∥L2 ([0,1]d ) ≤ C∥f ∥H 2 ([0,1]d ) N − d .
Finally, we may consider how to approximate smoother functions such as f ∈ C k ([0, 1]d ), k > 1,
with ReLU neural networks. As discussed in Chapter 4 for sigmoidal activation functions, larger k
can lead to faster convergence. However, we will see in the following chapter, that the emulation of
piecewise affine functions on regular triangulations will not yield improved approximation rates as
k increases. To leverage such smoothness with ReLU networks, in Chapter 7 we will first build net-
works that emulate polynomials. Surprisingly, it turns out that polynomials can be approximated
very efficiently by deep ReLU neural networks.
67
Exercises
Exercise 5.24. Let p : R → R be a polynomial of degree n ≥ 1 (with leading coefficient nonzero)
and let s : R → R be a continuous sigmoidal activation function. Show that the identity map
x 7→ x : R → R belongs to N11 (p; 1, n + 1) but not to N11 (s; L) for any L ∈ N.
Exercise 5.25. Consider cpwl functions f : R → R with n ∈ N0 breakpoints (points where the
function is not C 1 ). Determine the minimal size required to exactly express every such f with a
depth-1 ReLU neural network.
Exercise 5.26. Show that, the notion of affine independence is invariant under permutations of
the points.
Exercise
Sd 5.28. Let τ = co(η 0 , . . . , η d ) be a d-simplex. Show that the boundary of τ is given by
i=0 co({η 0 , . . . , η d }\{η i }).
68
Chapter 6
In the previous chapters, we observed some remarkable approximation results of shallow ReLU
neural networks. In practice, however, deeper architectures are more common. To understand why,
in this chapter we discuss some potential shortcomings of shallow ReLU networks compared to deep
ReLU networks.
Traditionally, an insightful approach to study limitations of ReLU neural networks has been to
analyze the number of linear regions these functions can generate.
Definition 6.1. Let d ∈ N, Ω ⊆ Rd , and let f : Ω → R be cpwl (see Definition 5.5). We say
that f has p ∈ N pieces S (or linear regions), if p is the smallest number of connected open
sets (Ωi )pi=1 such that pi=1 Ωi = Ω, and f |Ωi is an affine function for all i = 1, . . . , p. We denote
Pieces(f, Ω) := p.
For d = 1 we call every point where f is not differentiable a break point of f .
To get an accurate cpwl approximation of a function, the approximating function needs to have
many pieces. The next theorem, corresponding to [82, Theorem 2], quantifies this statement.
Theorem 6.2. Let −∞ < a < b < ∞ and f ∈ C 3 ([a, b]) so that f is not affine. Then there exists
Rbp
a constant C > 0 depending only on a |f ′′ (x)| dx so that
The proof of the theorem is left to the reader, see Exercise 6.11.
Theorem 6.2 implies that for ReLU neural networks we need architectures allowing for many
pieces, if we want to approximate non-linear functions to high accuracy. How many pieces can we
69
create for a fixed depth and width? We establish a simple theoretical upper bound in Section 6.1.
Subsequently, we investigate under which conditions these upper bounds are attainable in Section
6.2. Lastly, in Section 6.3, we will discuss the practical relevance of this analysis by examining how
many pieces “typical” neural networks possess. Surprisingly, it turns out that randomly initialized
deep neural networks on average do not have a number of pieces that is anywhere close to the
theoretically achievable maximum.
This holds because the sum is affine in every point where both f1 and f2 are affine. Therefore,
the sum has at most as many break points as f1 and f2 combined. Moreover, the number of
pieces of a univariate function equals the number of its break points plus one.
This is because for each of the affine pieces of f2 —let us call one of those pieces A ⊆ R—we
have that f2 is either constant or injective on A. If it is constant, then f1 ◦ f2 is constant. If
it is injective, then Pieces(f1 ◦ f2 , A) = Pieces(f1 , f2 (A)) ≤ Pieces(f1 , Rd ). Since this holds
for all pieces of f2 we get (6.1.2).
Figure 6.1: Top: Composition of two cpwl functions f1 ◦ f2 can create a piece whenever the value
of f2 crosses a level that is associated to a break point of f1 . Bottom: Addition of two cpwl
functions f1 + f2 produces a cpwl function that can have break points at positions where either f1
or f2 has a break point.
These considerations give the following result, which follows the argument of [268, Lemma 2.1].
We state it for general cpwl activation functions. The ReLU activation function corresponds to
70
p = 2. Recall that the notation (σ; d0 , . . . , dL+1 ) denotes the architecture of a feedforward neural
network, see Definition 2.1.
Theorem 6.3. Let L ∈ N. Let σ be cpwl with p pieces. Then, every neural network with architecture
(σ; 1, d1 , . . . , dL , 1) has at most (p · width(Φ))L pieces.
Proof. The proof is via induction over the depth L. Let L = 1, and let Φ : R → R be a neural
network of architecture (σ; 1, d1 , 1). Then
d1
(1) (0) (0)
X
Φ(x) = wk σ(wk x + bk ) + b(1) for x ∈ R,
k=1
for certain w(0) , w(1) , b(0) ∈ Rd1 and b(1) ∈ R. By (6.1.1), Pieces(Φ) ≤ p · width(Φ).
For the induction step, assume the statement holds for L ∈ N, and let Φ : R → R be a neural
network of architecture (σ; 1, d1 , . . . , dL+1 , 1). Then, we can write
dL+1
X
Φ(x) = wj σ(hj (x)) + b for x ∈ R,
j=1
for some w ∈ RdL+1 , b ∈ R, and where each hj is a neural network of architecture (σ; 1, d1 , . . . , dL , 1).
Using the induction hypothesis, each σ ◦ hℓ has at most p · (p · width(Φ))L affine pieces. Hence
Φ has at most width(Φ) · p · (p · width(Φ))L = (p · width(Φ))L+1 affine pieces. This completes the
proof.
Theorem 6.3 shows that there are limits to how many pieces can be created with a certain
architecture. It is noteworthy that the effects of the depth and the width of a neural network
are vastly different. While increasing the width can polynomially increase the number of pieces,
increasing the depth can result in exponential increase. This is a first indication of the prowess of
depth of neural networks.
To understand the effect of this on the approximation problem, we apply the bound of Theorem
6.3 to Theorem 6.2.
Theorem 6.4 gives a lower bound on achievable approximation rates in dependence of the depth
L. As target functions become smoother, we expect that we can achieve faster convergence rates
(cp. Chapter 4). However, without increasing the depth, it seems to be impossible to leverage such
additional smoothness.
This observation strongly indicates that deeper architectures can be superior. Before making
this more concrete, we first explore whether the upper bounds of Theorem 6.3 are also achievable.
71
6.2 Tightness of upper bounds
We follow [268] to construct a ReLU neural network, that realizes the upper bound of Theorem
6.3. First let h1 : [0, 1] → R be the hat function
(
2x if x ∈ [0, 12 ]
h1 (x) :=
2 − 2x if x ∈ [ 12 , 1].
This function can be expressed by a ReLU neural network of depth one and with two nodes
h1 (x) = σReLU (2x) − σReLU (4x − 2) for all x ∈ [0, 1]. (6.2.1a)
We recursively set
i.e., hn = h1 ◦ · · · ◦ h1 is the n-fold composition of h1 . Since h1 : [0, 1] → [0, 1], we have hn : [0, 1] →
[0, 1] and
It turns out that this function has a rather interesting behavior. It is a “sawtooth” function with
2n−1 spikes, see Figure 6.2.
Proof. The case n = 1 holds by definition. We proceed by induction, and assume the statement
holds for n. Let x ∈ [0, 1/2] and i ≥ 0 even such that x ∈ [i2−(n+1) , (i + 1)2−(n+1) ]. Then
2x ∈ [i2−n , (i + 1)2−n ]. Thus
Similarly, if x ∈ [0, 1/2] and i ≥ 1 odd such that x ∈ [i2−(n+1) , (i + 1)2−(n+1) ], then h1 (x) = 2x ∈
[i2−n , (i + 1)2−n ] and
The case x ∈ [1/2, 1] follows by observing that hn+1 is symmetric around 1/2.
The neural network hn has size O(n) and is piecewise linear on at least 2n pieces. This shows
that the number of pieces can indeed increase exponentially in the neural network size, also see the
upper bound in Theorem 6.3.
72
h1 h2 h3
1 1 1
0 1 0 1 0 1
Figure 6.3: Two randomly initialized neural networks Φ1 and Φ2 with architectures
(σReLU ; 2, 10, 10, 1) and (σReLU ; 2, 5, 5, 5, 5, 5, 1). The initialization scheme was He initialization
[111]. The number of linear regions equals 114 and 110, respectively.
where h is a cpwl function, a is a vector, and b is a scalar, many pieces can be generated if ⟨a, h(x)⟩
73
crosses the −b level often.
If a, b are random variables, and we know that h does not oscillate too much, then we can
quantify the probability of ⟨a, h(x)⟩ crossing the −b level often. The following lemma from [140,
Lemma 3.1] provides the details.
Lemma 6.6. Let c > 0 and let h : [0, c] → R be a cpwl function on [0, c]. Let t ∈ N, let A ⊆ R be
a Lebesgue measurable set, and assume that for every y ∈ A
Then, c∥h′ ∥L∞ ≥ ∥h′ ∥L1 ≥ |A| · t, where |A| is the Lebesgue measure of A. In particular, if h
has at most P ∈ N pieces and ∥h′ ∥L1 < ∞, then for all δ > 0, t ≤ P
∥h′ ∥L1
P [|{x ∈ [0, c] | h(x) = U }| ≥ t] ≤ ,
δt
P [|{x ∈ [0, c] | h(x) = U }| > P ] = 0,
Proof. We will assume c = 1. The general case then follows by considering h̃(x) = h(x/c).
Let for (ci )Pi=1
+1
⊆ [0, 1] with c1 = 0, cP +1 = 1 and ci ≤ ci+1 for all i = 1, . . . , P + 1 the pieces of
h be given by ((ci , ci+1 ))Pi=1 . We denote
and for i = 1, . . . , P + 1
i−1
[
Vei := Vj .
j=1
In words, Ti,n contains the values of A that are hit on Vi for the nth time. Since h is cpwl, we
observe that for all i = 1, . . . , P
(ii) Ti,∞ ∪ ∞
S
n=1 Ti,n = h(Vi ) ∩ A,
(iv) |Ti,∞ | = 0.
74
Note that, since h is affine on Vi it holds that h′ = |h(Vi )|/|Vi | on Vi . Hence, for t ≤ P
P
X P
X
′
∥h ∥L1 ≥ |h(Vi )| ≥ |h(Vi ) ∩ A|
i=1 i=1
P ∞
!
X X
= |Ti,n | + |Ti,∞ |
i=1 n=1
P
XX ∞
= |Ti,n |
i=1 n=1
Xt X P
≥ |Ti,n |,
n=1 i=1
where the first equality follows by (i), (ii), the second by (iv), and the last inequality by (iii).
Note that, by assumption for all n ≤ t every y ∈ A is an element of Ti,n or Ti,∞ for some i ≤ P .
Therefore, by (iv)
X P
|Ti,n | ≥ |A|,
i=1
which completes the proof.
Lemma 6.6 applied to neural networks essentially states that, in a single neuron, if the bias
term is chosen uniformly randomly on an interval of length δ, then the probability of generating at
least t pieces by composition scales reciprocal to t.
Next, we will analyze how Lemma 6.6 implies an upper bound on the number of pieces generated
in a randomly initialized neural network. For simplicity, we only consider random biases in the
following, but mention that similar results hold if both the biases and weights are random variables
[104].
Definition 6.7. Let L ∈ N, (d0 , d1 , . . . , dL , 1) ∈ NL+2 and W (ℓ) ∈ Rdℓ+1 ×dℓ for ℓ = 0, . . . , L. Fur-
thermore, let δ > 0 and let the bias vectors b(ℓ) ∈ Rdℓ+1 , for ℓ = 0, . . . , L, be random variables such
that each entry of each b(ℓ) is independently and uniformly distributed on the interval [−δ/2, δ/2].
We call the associated ReLU neural network a random-bias neural network.
To apply Lemma 6.6 to a single neuron with random biases, we also need some bound on the
derivative of the input to the neuron.
Definition 6.8. Let L ∈ N, (d0 , d1 , . . . , dL , 1) ∈ NL+2 , and W (ℓ) ∈ Rdℓ+1 ×dℓ and b(ℓ) ∈ Rdℓ+1 for
ℓ = 0, . . . , L. Moreover let δ > 0.
For ℓ = 1, . . . , L + 1, i = 1, . . . , dℓ introduce the functions
75
where x(ℓ−1) is as in (2.1.1). We call
(
′
ν (W (ℓ) )L ℓ=1 , δ := max ηℓ,i ( · ; (W (j) , b(j) )ℓ−1
j=0 )
2
L
)
Y
(b(j) )L
j=0 ∈ dj+1
[−δ/2, δ/2] , ℓ = 1, . . . , L, i = 1, . . . , dℓ
j=0
Theorem 6.9. Let L ∈ N and let (d0 , d1 , . . ., dL , 1) ∈ NL+2 . Let δ ∈ (0, 1]. Let W (ℓ) ∈ Rdℓ+1 ×dℓ ,
for ℓ = 0, . . . , L, be such that ν (W (ℓ) )L
ℓ=0 , δ ≤ Cν for a Cν > 0.
For an associated random-bias neural network Φ, we have that for a line segment s ⊆ Rd0 of
length 1
L
Cν X
E[Pieces(Φ, s)] ≤ 1 + d1 + (1 + (L − 1) ln(2width(Φ))) dj . (6.3.1)
δ
j=2
Proof. Let W (ℓ) ∈ Rdℓ+1 ×dℓ for ℓ = 0, . . . , L. Moreover, let b(ℓ) ∈ [−δ/2, δ/2]dℓ+1 for ℓ = 0, . . . , L
be uniformly distributed random variables. We denote
θℓ : s → Rdℓ
dℓ
x 7→ (ηℓ,i (x; (W (j) , b(j) )ℓ−1
j=0 ))i=1 .
Let κ : s → [0, 1] be an isomorphism. Since each coordinate of θℓ is cpwl, there are points
x0 , x1 , . . . , xqℓ ∈ s with κ(xj ) < κ(xj+1 ) for j = 0, . . . , qℓ − 1, such that θℓ is affine (as a function
into Rdℓ ) on [κ(xj ), κ(xj+1 )] for all j = 0, . . . , qℓ − 1 as well as on [0, κ(x0 )] and [κ(xqℓ ), 1].
We will now inductively find an upper bound on the qℓ .
Let ℓ = 2, then
θ2 (x) = W (1) σReLU (W (0) x + b(0) ).
Since W (1) · +b(1) is an affine function, it follows that θ2 can only be non-affine in points where
σReLU (W (0) · +b(0) ) is not affine. Therefore, θ2 is only non-affine if one coordinate of W (0) · +b(0)
intersects 0 nontrivially. This can happen at most d1 times. We conclude that we can choose
q2 = d1 .
Next, let us find an upper bound on qℓ+1 from qℓ . Note that
76
Now θℓ+1 is affine in every point x ∈ s where θℓ is affine and (θℓ (x) + b(ℓ−1) )i ̸= 0 for all coordinates
i = 1, . . . , dℓ . As a result, we have that we can choose qℓ+1 such that
Therefore, for ℓ ≥ 2
ℓ
X
qℓ+1 ≤ d1 + {x ∈ s | (θj (x) + b(j) )i = 0 for at least one i = 1, . . . , dj }
j=3
dj
ℓ X
(j)
X
≤ d1 + {x ∈ s | ηj,i (x) = −bi } .
j=2 i=1
pk,ℓ,i = 0.
It holds
dj n
L X o
(j)
X
E x ∈ s ηj,i (x) = −bi
j=2 i=1
dj ∞
L X h n o i
(j)
X X
≤ k·P x ∈ s ηj,i (x) = −bi =k
j=2 i=1 k=1
dj ∞
L X
X X
≤ k · (pk,j,i − pk+1,j,i ).
j=2 i=1 k=1
77
The inner sum can be bounded by
∞
X ∞
X ∞
X
k · (pk,j,i − pk+1,j,i ) = k · pk,j,i − k · pk+1,j,i
k=1 k=1 k=1
∞
X ∞
X
= k · pk,j,i − (k − 1) · pk,j,i
k=1 k=2
∞
X
= p1,j,i + pk,j,i
k=2
∞
X
= pk,j,i
k=1
(2width(Φ))L−1
−1
X 1
≤ Cν δ
k
k=1
!
Z (2width(Φ))L−1
−1 1
≤ Cν δ 1+ dx
1 x
−1
≤ Cν δ (1 + (L − 1) ln((2width(Φ)))).
Pieces(Φ, s) ≤ qL+1 + 1
• Non-exponential dependence on depth: If we consider (6.3.1), we see that the number of pieces
scales in expectation essentially like O(LN ), where N is the total number of neurons of the
architecture. This shows that in expectation, the number of pieces is linear in the number of
layers, as opposed to the exponential upper bound of Theorem 6.3.
• Maximal internal derivative: Theorem 6.9 requires the weights to be chosen such that the
maximal internal derivative is bounded by a certain number. However, if they are randomly
initialized in such a way that with high probability the maximal internal derivative is bounded
by a small number, then similar results can be shown. In practice, weights in the ℓth layer
p are
often initialized according to a centered normal distribution with standard deviation 2/dℓ ,
[111]. Due to the anti-proportionality of the variance to the width of the layers it is achieved
that the internal derivatives remain bounded with high probability, independent of the width
of the neural networks. This explaines the observation from Figure 6.3.
78
Bibliography and further reading
Establishing bounds on the number of linear regions of a ReLU network has been a popular tool
to investigate the complexity of ReLU neural networks, see [182, 221, 7, 248, 104]. The bound
presented in Section 6.1, is based on [268]. For the construction of the sawtooth function in Section
6.2, we follow the arguments in [268, 269]. Together with the lower bound on the number of
required linear regions given in [82], this analysis shows how depth can be a limiting factor in terms
of achievable convergence rates, as stated in Theorem 6.4. Finally, the analysis of the number of
pieces deep neural networks attained with random intialization (Section 6.3) is based on [104] and
[140].
79
Exercises
Exercise 6.11. Let −∞ < a < b < ∞ and let f ∈ C 3 ([a, b])\P1 . Denote by p(ε) ∈ N the minimal
number of intervals partitioning [a, b], such that a (not necessarily continuous) piecewise linear
function on p(ε) intervals can approximate f on [a, b] uniformly up to error ε > 0. In this exercise,
we wish to show
√
lim inf p(ε) ε > 0. (6.3.2)
ε↘0
Therefore, we can find a constant C > 0 such that ε ≥ Cp(ε)−2 for all ε > 0. This shows a variant
of Theorem 6.2. Proceed as follows to prove (6.3.2):
(i) Fix ε > 0 and let a = x0 < x1 · · · < xp(ε) = b be a partitioning into p(ε) pieces. For
i = 0, . . . , p(ε) − 1 and x ∈ [xi , xi+1 ] let
f (xi+1 ) − f (xi )
ei (x) := f (x) − f (xi ) + (x − xi ) .
xi+1 − xi
h2i ′′
max |ei (x)| = |f (mi )| + O(h3i ).
x∈[xi ,xi+1 ] 8
(iv) Conclude that (6.3.2) holds for general non-linear f ∈ C 3 ([a, b]).
Exercise 6.12. Show that, for L = 1, Theorem 6.3 holds for piecewise smooth functions, when
replacing the number of affine pieces by the number of smooth pieces. These are defined by replacing
“affine” by “smooth” (meaning C ∞ ) in Definition 6.1.
Exercise 6.13. Show that, for L > 1, Theorem 6.3 does not hold for piecewise smooth functions,
when replacing the number of affine pieces by the number of smooth pieces.
(p)
Exercise 6.14. For p ∈ N, p > 2 and n ∈ N, construct a function hn similar to hn of (6.5), such
(p) (p)
that hn ∈ N11 (σReLU ; n, p) and such that hn has pn pieces and size O(p2 n).
80
Chapter 7
In the previous chapter, we observed that many layers are a necessary prerequisite for ReLU neural
networks to approximate smooth functions with high rates. We now analyze which depth is sufficient
to achieve good approximation rates for smooth functions.
To approximate smooth functions efficiently, one of the main tools in Chapter 4 was to rebuild
polynomial-based functions, such as higher-order B-splines. For smooth activation functions, we
were able to reproduce polynomials by using the nonlinearity of the activation functions. This
argument certainly cannot be repeated for the piecewise linear ReLU. On the other hand, up until
now, we have seen that deep ReLU neural networks are extremely efficient at producing the strongly
oscillating sawtooth functions discussed in Lemma 6.5. The main observation in this chapter is
that the sawtooth functions are intimately linked to the squaring function, which again leads to
polynomials. This observation was first made by Dmitry Yarotsky [292] in 2016, and the present
chapter is primarily based on this paper.
In Sections 7.1 and 7.2, we give Yarotsky’s approximation of the squaring and multiplication
functions. As a direct consequence, we show in Section 7.3 that deep ReLU neural networks
can be significantly more efficient than shallow ones in approximating analytic functions such as
polynomials and (certain) trigonometric functions. Using these tools, we conclude in Section 7.4
that deep ReLU neural networks can efficiently approximate k-times continuously differentiable
functions with Hölder continuous derivatives.
is a piecewise linear function on [0, 1] with break points xn,j = j2−n , j = 0, . . . , 2n . Moreover,
sn (xn,k ) = x2n,k for all k = 0, . . . , 2n , i.e. sn is the piecewise linear interpolant of x2 on [0, 1].
81
1 1 1
4 16 64
1 1 1
h1 (x) h1 (x) h2 (x)
x − x2 x − x2 − 4 x − x2 − 4 − 16
h1 (x) h2 (x) h3 (x)
4 16 64
Proof. The statement holds for n = 1. We proceed by induction. Assume the statement holds for
sn and let k ∈ {0, . . . , 2n+1 }. By Lemma 6.5, hn+1 (xn+1,k ) = 0 whenever k is even. Hence for even
k ∈ {0, . . . , 2n+1 }
n+1
X hj (xn+1,k )
sn+1 (xn+1,k ) = xn+1,k −
22j
j=1
hn+1 (xn+1,k )
= sn (xn+1,k ) − = sn (xn+1,k ) = x2n+1,k ,
22(n+1)
where we used the induction assumption sn (xn+1,k ) = x2n+1,k for xn+1,k = k2−(n+1) = k2 2−n =
xn,k/2 .
Now let k ∈ {1, . . . , 2n+1 − 1} be odd. Then by Lemma 6.5, hn+1 (xn+1,k ) = 1. Moreover,
since sn is linear on [xn,(k−1)/2 , xn,(k+1)/2 ] = [xn+1,k−1 , xn+1,k+1 ] and xn+1,k is the midpoint of this
interval,
hn+1 (xn+1,k )
sn+1 (xn+1,k ) = sn (xn+1,k ) −
22(n+1)
1 1
= (x2n+1,k−1 + x2n+1,k+1 ) − 2(n+1)
2 2
(k − 1)2 (k + 1) 2 2
= 2(n+1)+1 + 2(n+1)+1 − 2(n+1)+1
2 2 2
1 2k 2 k2
= = = x2n+1,k .
2 22(n+1) 22(n+1)
This completes the proof.
82
x s1 (x) s2 (x) sn−1 (x)
Figure 7.2: The neural networks h1 (x) = σReLU (2x) − σReLU (4x − 2) and sn (x) = σReLU (sn−1 (x)) −
hn (x)/22n where hn = h1 ◦ hn−1 . Figure based on [292, Fig. 2c] and [246, Fig. 1a].
Proof. Set en (x) := x2 − sn (x). Let x be in the interval [xn,k , xn,k+1 ] = [k2−n , (k + 1)2−n ] of length
2−n . Since sn is the linear interpolant of x2 on this interval, we have
x2n,k+1 − x2n,k 2k + 1 1
|e′n (x)| = 2x − = 2x − ≤ n.
2−n 2n 2
Thus en : [0, 1] → R has Lipschitz constant 2−n . Since en (xn,k ) = 0 for all k = 0, . . . , 2n , and the
length of the interval [xn,k , xn,k+1 ] equals 2−n we get
1
sup |en (x)| ≤ 2−n 2−n = 2−2n−1 .
x∈[0,1] 2
Finally, to see that sn can be represented by a neural network of the claimed architecture, note
that for n ≥ 2
n
X hj (x) hn (x) h1 ◦ hn−1 (x)
sn (x) = x − = sn−1 (x) − = σReLU ◦ sn−1 (x) − .
22j 2 2n 22n
j=1
Here we used that sn−1 is the piecewise linear interpolant of x2 , so that sn−1 (x) ≥ 0 and thus
sn−1 (x) = σReLU (sn−1 (x)) for all x ∈ [0, 1]. Hence sn is of depth n and width 3, see Figure 7.2.
In conclusion, we have shown that sn : [0, 1] → [0, 1] approximates the square function uniformly
on [0, 1] with exponentially decreasing error in the neural network size. Note that due to Theorem
6.4, this would not be possible with a shallow neural network, which can at best interpolate x2 on
a partition of [0, 1] with polynomially many (w.r.t. the neural network size) pieces.
83
7.2 Multiplication
According to Lemma 7.2, depth can help in the approximation of x 7→ x2 , which, on first sight,
seems like a rather specific example. However, as we shall discuss in the following, this opens
up a path towards fast approximation of functions with high regularity, e.g., C k ([0, 1]d ) for some
k > 1. The crucial observation is that, via the polarization identity we can write the product of
two numbers as a sum of squares
(x + y)2 − (x − y)2
x·y = (7.2.1)
4
for all x, y ∈ R. Efficient approximation of the operation of multiplication allows efficient ap-
proximation of polynomials. Those in turn are well-known to be good approximators for functions
exhibiting k ∈ N derivatives. Before exploring this idea further in the next section, we first make
precise the observation that neural networks can efficiently approximate the multiplication of real
numbers.
We start with the multiplication of two numbers, in which case neural networks of logarithmic
size in the desired accuracy are sufficient, [292, Proposition 3].
Lemma 7.3. For every ε > 0 there exists a ReLU neural network Φ× 2
ε : [−1, 1] → [−1, 1] such that
sup |x · y − Φ×
ε (x, y)| ≤ ε,
x,y∈[−1,1]
Since |a| = σReLU (a) + σReLU (−a), by (7.2.1) we have for all x, y ∈ [−1, 1]
(x + y)2 − (x − y)2
× |x + y| |x − y|
x · y − Φε (x, y) = − sn − sn
4 2 2
4( x+y 2 x−y 2
2 ) − 4( 2 ) 4sn ( |x+y| |x−y|
2 ) − 4sn ( 2 )
= −
4 4
4(2−2n−1 + 2−2n−1 )
≤ = 4−n ≤ ε,
4
where we used |x+y|/2, |x−y|/2 ∈ [0, 1]. We have depth(Φ× ε ) = 1+depth(sn ) = 1+n ≤ 1+⌈log4 (ε)⌉
and size(Φ×
ε ) ≤ C + 2size(s n ) ≤ Cn ≤ C · (1 − log(ε)) for some constant C > 0.
84
The fact that Φ× 2
ε maps from [−1, 1] → [−1, 1] follows by (7.2.2) and because sn : [0, 1] → [0, 1].
Finally, if x = 0, then Φ×
ε (x, y) = sn (|x + y|) − sn (|x − y|) = sn (|y|) − sn (|y|) = 0. If y = 0 the same
argument can be made.
In a similar way as in Proposition 4.8 and Lemma 5.11, we can apply operations with two inputs
in the form of a binary tree to extend them to an operation on arbitrary many inputs; see again
[292], and [246, Proposition 3.3] for the specific argument considered here.
Proposition 7.4. For every n ≥ 2 and ε > 0 there exists a ReLU neural network Φ× n
n,ε : [−1, 1] →
[−1, 1] such that
n
Y
sup xj − Φ×
n,ε (x1 , . . . , xn ) ≤ ε,
xj ∈[−1,1] j=1
Using Lemma 7.3, we find that this neural network has depth bounded by
depth Φ̃× k
2 ,δ
≤ kdepth(Φ×δ ) ≤ Ck · (1 + | log(δ)|) ≤ C log(n)(1 + | log(δ)|).
Y
ek := sup xj − Φ̃×
2k ,δ
(x) .
xj ∈[−1,1]
j≤2k
≤ 2k δ = nδ.
85
k−1
Here we used e1 ≤ δ, and that Φ̃× 2k ,δ
maps [−1, 1]2 to [−1, 1], which is a consequence of Lemma
7.3.
The case for general n ≥ 2 (not necessarily n = 2k ) is treated similar as in Lemma 5.11, by
replacing some Φ× δ neural networks with identity neural networks.
×
Finally, setting δ := ε/n and Φ×n,ε := Φ̃n,δ concludes the proof.
One possibility to approximate p is via the Horner scheme and the approximate multiplication Φ×
ε
from Lemma 7.3, yielding
p(x) = c0 + x · (c1 + x · (· · · + x · cn ) . . . )
≃ c0 + Φ× × ×
ε (x, c1 + Φε (x, c2 · · · + Φε (x, cn )) . . . ).
This scheme requires depth O(n) due to the nested multiplications. An alternative is to approximate
all monomials 1, x, . . . , xn with a binary tree using approximate multiplications Φ×
ε , and combing
them in the output layer, see Figure 7.3. This idea leads to a network of size O(n log(n)) and depth
O(log(n)). The following lemma formalizes this, see [208, Lemma A.5], [77, Proposition III.5], and
in particular [199, Lemma 4.3]. The proof is left as Exercise 7.13.
Lemma 7.5. There exists a constant C > 0, such that for any ε ∈ (0, 1) and any polynomial p of
degree n ≥ 2 as in (7.3.1), there exists a neural network Φpε such that
n
X
sup |p(x) − Φpε (x)| ≤ Cε |cj |
x∈[−1,1] j=0
Lemma 7.5 shows that deep ReLU networks can approximate polynomials efficiently. This
leads to an interesting implication regarding the superiority of deep architectures. Recall that
f : [−1, 1] → R is analytic if its Taylor series around any point x ∈ [−1, 1] converges to f in a
neighbourhood of x. For instance all polynomials, sin, cos, exp etc. are analytic. We now show that
these functions (except linear ones) can be approximated much more efficiently with deep ReLU
networks than by shallow ones: for fixed-depth networks, the number of parameters must grow
faster than any polynomial compared to the required size of deep architectures. More precisely
there holds the following.
86
1 x
1 x x2
1 x x2 x3 x4
Figure 7.3: Monomials 1, . . . , xn with n = 2k can be generated in a binary tree of depth k. Each
node represents the product of its inputs, with single-input nodes interpreted as squares.
Proposition 7.6. Let L ∈ N and let f : [−1, 1] → R be analytic but not linear. Then there exist
constants C, β > 0 such that for every ε > 0, there exists a ReLU neural network Φdeep satisfying
q
sup |f (x) − Φdeep (x)| ≤ C exp − β size(Φdeep ) ≤ ε, (7.3.2)
x∈[−1,1]
but for any ReLU neural network Φshallow of depth at most L holds
for some C̃ = C̃(r, Cr ). Choosing n = n(ε) = ⌈log(ε)/ log(r)⌉, with the bounds from Lemma 7.5
we find that
sup |f (x) − Φpε n (x)| ≤ 2C̃ε
x∈[−1,1]
87
This implies the existence of C, β > 0 and Φdeep as in (7.3.2).
The general case, where the Taylor expansions of f converges only locally is left as Exercise
7.14.
The proposition shows that the approximation of certain (highly relevant) functions requires
significantly more parameters when using shallow instead of deep architectures. Such statements
are known as depth separation results. We refer for instance to [268, 269, 271], where such a result
was shown by Telgarsky based on the sawtooth function constructed in Section 6.2. Lower bounds
on the approximation in the spirit of Proposition 7.6 were also given in [163] and [292].
Remark 7.7. Proposition
√ 7.6 shows in particular that for analytic f : [−1, 1] → R, holds the error
bound exp(−β N ) in terms of the network size N . This can be generalized to multivariate analytic
functions f : [−1, 1]d → R, in which case the bound reads exp(−βN 1/(1+d) ), see [75, 200].
and we denote by C k,s (Ω) the set of functions f ∈ C k (Ω) for which ∥f ∥C k,s (Ω) < ∞.
Lemma 7.9. Let d ∈ N, k ∈ N, s ∈ [0, 1], Ω = [0, 1]d and f ∈ C k,s (Ω). Then for all a, x ∈ Ω
X Dα f (a)
f (x) = (x − a)α + Rk (x) (7.4.2)
α!
{α∈Nd0 | 0≤|α|≤k}
k+1/2
where with h := maxi≤d |ai − xi | we have |Rk (x)| ≤ hk+s d k! ∥f ∥C k,s (Ω) .
88
Proof. First, for a function g ∈ C k (R) and a, t ∈ R
k−1 (j)
X g (a) g (k) (ξ)
g(t) = (t − a)j + (t − a)k
j! k!
j=0
k
X g (j) (a) g (k) (ξ) − g (k) (a)
= (t − a)j + (t − a)k ,
j! k!
j=0
for some ξ between a and t. Now let f ∈ C k,s (Rd ) and a, x ∈ Rd . Thus with g(t) := f (a+t·(x−a))
holds for f (x) = g(1)
k−1 (j)
X g (0) g (k) (ξ)
f (x) = + .
j! k!
j=0
j j! j! Qd
and (x − a)α = − aj )αj .
where we use the multivariate notations α = α! = Qd j=1 (xj
j=1 αj !
Hence
X Dα f (a)
f (x) = (x − a)α
α!
{α∈Nd0 | 0≤|α|≤k}
| {z }
∈Pk
X Dα f (a + ξ · (x − a)) − Dα f (a)
+ (x − a)α ,
α!
|α|=k
| {z }
=:Rk
for some ξ ∈ [0, 1]. Using the definition of h, the remainder term can be bounded by
k α α 1 X k
|Rk | ≤ h max sup |D f (a + t · (x − a)) − D f (a)|
|α|=k x∈Ω k! d
α
t∈[0,1] {α∈N0 | |α|=k}
k+ 2s
d
≤ hk+s ∥f ∥C k,s (Ω) ,
k!
√ k
= (1 + · · · + 1)k = dk by the
P
where we used (7.4.1), ∥x − a∥ ≤ dh, and {α∈Nd0 | |α|=k} α
multinomial formula.
We now come to the main statement of this section. Up to logarithmic terms, it shows the
convergence rate (k + s)/d for approximating functions in C k,s ([0, 1]d ).
89
Theorem 7.10. Let d ∈ N, k ∈ N0 , s ∈ [0, 1], and Ω = [0, 1]d . Then, there exists a constant C > 0
such that for every f ∈ C k,s (Ω) and every N ≥ 2 there exists a ReLU neural network ΦfN such that
k+s
sup |f (x) − ΦfN (x)| ≤ C∥f ∥C k,s (Ω) N − d , (7.4.3)
x∈Ω
Proof. The idea of the proof is to use the so-called “partition of unity method”: First we will
construct a partition of unity (φν )ν , such that for an appropriately chosen M ∈ N each φν has
support on a O(1/M ) neighborhood of a point η ∈ Ω. On each of these neighborhoods P we will use
the local Taylor polynomial pν of f around η to approximate the function. Then ν φν pν gives
an
P approximation to f on Ω. This approximation can be emulated by a neural network of the type
×
ν Φε (φν , p̂ν ), where p̂ν is an neural network approximation to the polynomial pν .
It suffices to show the theorem in the case where
( )
dk+1/2
max , exp(d) ∥f ∥C k,s (Ω) ≤ 1.
k!
90
where (iα,1 , . . . , iα,k ) ∈ {0, . . . , d}k is arbitrary but fixed such that |{j | iα,j = r}| = αr for all
r = 1, . . . , d. Finally, define
ΦfN :=
X
Φ×
ε (φν , p̂ν ). (7.4.7)
ν≤M
Step 2. We bound the approximation error. First, for each x ∈ Ω, using (7.4.5) and (7.4.6)
X X
f (x) − φν (x)pν (x) ≤ |φν (x)||pν (x) − f (x)|
ν≤M ν≤M
Next, fix ν ≤ M and y ∈ Ω such that ∥ν/M − y∥∞ ≤ 1/M ≤ 1. Then by Proposition 7.4
k
X Dα f ν Y νi
M
|pν (y) − p̂ν (y)| ≤ yiα,j − α,j
α! M
|α|≤k j=1
× νiα,1 iα,k
− Φ|α|,ε yiα,1 − , . . . , yiα,k −
M M
X Dα f ( ν )
M
≤ε ≤ ε exp(d)∥f ∥C k,s (Ω) ≤ ε, (7.4.9)
α!
|α|≤k
X X
φν (x)pν (x) − Φ×
ε (φν (x), p̂ν (x))
ν≤M ν≤M
X
≤ (|φν (x)pν (x) − φν (x)p̂ν (x)|
{ν≤M | x∈supp φν }
91
In total, together with (7.4.8)
With our choices in (7.4.4) this yields the error bound (7.4.3).
Step 3. It remains to bound the size and depth of the neural network in (7.4.7).
By Lemma 5.17, for each 0 ≤ ν ≤ M we have
where kT is the maximal number of simplices attached to a node in the mesh. Note that kT is
independent of M , so that the size and depth of φν are bounded by a constant Cφ independent of
M.
Lemma 7.3 and Proposition 7.4 thus imply with our choice of ε = N −(k+s)/d
depth(ΦfN ) = depth(Φ×
ε ) + max depth(φη ) + max depth(p̂ν )
ν≤M ν≤M
≤ C · (1 + | log(ε)| + Cφ ) + depth(Φ×
k,ε )
≤ C · (1 + | log(ε)| + Cφ )
≤ C · (1 + log(N ))
for some constant C > 0 depending on k and d (we use “C” to denote a generic constant that can
change its value in each line).
To bound the size, we first observe with Lemma 5.4 that
X
size(p̂ν ) ≤ C · 1 + size Φ×
|α|,ε
≤ C · (1 + | log(ε)|)
|α|≤k
for some C depending on k. Thus, for the size of ΦfN we obtain with M = ⌈N 1/d ⌉
size(ΦfN ) ≤ C · 1 +
X
size(Φ×
ε ) + size(φν ) + size(p̂ν )
ν≤M
≤ C · (1 + M )d (1 + | log(ε)| + Cφ )
≤ C · (1 + N 1/d )d (1 + Cφ + log(N ))
≤ CN log(N ),
Theorem 7.10 is similar in spirit to [292, Section 3.2]; the main differences are that [292] considers
the class C k ([0, 1]d ) instead of C k,s ([0, 1]d ), and uses an approximate partition of unity, while we
use the exact partition of unity constructed in Chapter 5. Up to logarithmic terms, the theorem
shows the convergence rate (k + s)/d. As long as k is large, in principle we can achieve arbitrarily
large (and d-independent if k ≥ d) convergence rates. In contrast to Theorem 5.23, achieving error
k+s
N − d requires depth O(log(N )), i.e. the neural network depth is required to increase. This can
be avoided however, and networks of depth O(k/d) suffice to attain these convergence rates [208].
92
Remark 7.11. Let L : x 7→ Ax + b : Rd → Rd be a bijective affine transformation and set
Ω := L([0, 1]d ) ⊆ Rd . Then for a function f ∈ C k,s (Ω), by Theorem 7.10 there exists a neural
network ΦfN such that
Since for x ∈ [0, 1]d holds |f (L(x))| ≤ supy∈Ω |f (y)| and if 0 ̸= α ∈ Nd0 is a multiindex |Dα (f (L(x))| ≤
∥A∥|α| supy∈Ω |Dα f (y)|, we have ∥f ◦ L∥C k,s ([0,1]d ) ≤ (1 + ∥A∥k+s )∥f ∥C k,s (Ω) . Thus the convergence
k+s
rate N − d is achieved on every set of the type L([0, 1]d ) for an affine map L, and in particular on
every hypercube ×dj=1 [aj , bj ].
93
Exercises
Exercise 7.12. We show another type of depth separation result: Let d ≥ 2. Prove that there
exist ReLU NNs Φ : Rd → R of depth two, which cannot be represented exactly by ReLU NNs
Φ : Rd → R of depth one.
Hint: Show that nonzero ReLU NNs of depth one necessarily have unbounded support.
Exercise 7.14. Show Proposition 7.6 in the general case where the Taylor series of f only converges
locally (see proof of Proposition 7.6).
Hint: Use the partition of unity method from the proof of Theorem 7.10.
94
Chapter 8
High-dimensional approximation
In the previous chapters we established convergence rates for the approximation of a function f :
[0, 1]d → R by a neural network. For example, Theorem 7.10 provides the error bound O(N −(k+s)/d )
in terms of the network size N (up to logarithmic terms), where k and s describe the smoothness
of f . Achieving an accuracy of ε > 0, therefore, necessitates a network size N = O(ε−d/(k+s) )
(according to this bound). Hence, the size of the network needs to increase exponentially in d.
This exponential dependence on the dimension d is referred to as the curse of dimensionality
[20]. For classical smoothness spaces, such exponential d dependence cannot be avoided [20, 69, 195].
However, functions f that are of interest in practice may have additional properties, which allow
for better convergence rates.
In this chapter, we discuss three scenarios under which the curse of dimensionality can be
mitigated. First, we examine an assumption limiting the behavior of functions in their Fourier
domain. This assumption allows for slow but dimension independent approximation rates. Second,
we consider functions with a specific compositional structure. Concretely, these functions are
constructed by compositions and linear combinations of simple low-dimensional subfunctions. In
this case, the curse of dimension is present but only through the input dimension of the subfunctions.
Finally, we study the situation, where we still approximate high-dimensional functions, but only care
about the approximation accuracy on a lower dimensional submanifold. Here, the approximation
rate is goverened by the smoothness and the dimension of the manifold.
its inverse Fourier transform. Then, for C > 0 the Barron class is defined as
Z
d 1 d
ΓC := f ∈ C(R ) ∃g ∈ L (R ), |ξ||g(ξ)| dξ ≤ C and f = ǧ .
Rd
95
We say that a function f ∈ ΓC has a finite Fourier moment, even though technically the Fourier
transform of f may not be well-defined, since f does not need to be integrable. By the Riemann-
Lebesgue Lemma, [99, Lemma 1.1.1], the condition f ∈ C(Rd ) in the definition of ΓC is automati-
cally satisfied if g ∈ L1 (Rd ) as in the definition exists.
The following proof approximation result for functions in ΓC is due to [13]. The presentation
of the proof is similar to [209, Section 5].
Theorem 8.1. Let σ : R → R be sigmoidal (see Definition 3.11) and let f ∈ ΓC for some C > 0.
Denote by B1d := {x ∈ Rd | ∥x∥ ≤ 1} the unit ball. Then, for every c > 4C 2 and every N ∈ N there
exists a neural network Φf with architecture (σ; d, N, 1) such that
Z
1 2 c
d
f (x) − Φf (x) dx ≤ , (8.1.1)
|B1 | B1d N
Remark 8.2. The approximation rate in (8.1.1) can be slightly improved under some assumptions
on the activation function such as powers of the ReLU, [253].
Importantly, the dimension d does not enter on the right-hand side of (8.1.1), in particular the
convergence rate is not directly affected by the dimension, which is in stark contrast to the results
of the previous chapters. However, it should be noted, that the constant C may still have some
inherent d-dependence, see Exercise 8.10.
The proof of Theorem 8.1 is based on a peculiar property of high-dimensional convex sets, which
is described by the (approximate) Carathéodory theorem, the original version of which was given
in [44]. The more general version stated in the following lemma follows [280, Theorem 0.0.2] and
[13, 212]. For its statement recall that co(G) denotes the the closure of the convex hull of G.
Lemma 8.3. Let H be a Hilbert space, and let G ⊆ H be such that for some B > 0 it holds that
∥g∥H ≤ B for all g ∈ G. Let f ∈ co(G). Then, for every N ∈ N and every c > B 2 there exist
(gi )N
i=1 ⊆ G such that
N 2
1 X c
f− gi ≤ . (8.1.2)
N N
i=1 H
Proof. Fix ε > 0 and N ∈ N. Since f ∈ co(G), there exist coefficients α1 , . . . , αm ∈ [0, 1] summing
to 1, and linearly independent elements h1 , . . . , hm ∈ G such that
m
X
f ∗ := αj hj
j=1
96
satisfies ∥f − f ∗ ∥H < ε. We claim that there exists g1 , . . . , gN , each in {h1 , . . . , hm }, such that
2
N
1 X B2
f∗ − gj ≤ . (8.1.3)
N N
j=1
H
Since ε > 0 was arbitrary, this then concludes the proof. Since there exists an isometric isomorphism
from span{h1 , . . . , hm } to Rm , there is no loss of generality in assuming H = Rm in the following.
Let Xi , i = 1, . . . , N , be i.i.d. Rm -valued random variables with
Lemma 8.3 provides a powerful tool: If we want to approximate a function f with a superposition
of N elements in a set G, then it is sufficient to show that f can be represented as an arbitrary
(infinite) convex combination of elements of G.
Lemma 8.3 suggests that we can prove Theorem 8.1 by showing that each function in ΓC belongs
to the closure of the convex hull of all neural networks with a single neuron, i.e. the set of all affine
transforms of the sigmoidal activation function σ. We make a small detour before proving this
result. We first show that each function f ∈ ΓC is in the closure of the convex hull of the set of
affine transforms of Heaviside functions, i.e. the set
n o
GC := B1d ∋ x 7→ γ · 1R+ (⟨a, x⟩ + b) a ∈ Rd , b ∈ R, |γ| ≤ 2C .
The following lemma, corresponding to [13, Theorem 2] and [209, Lemma 5.12], provides a link
between ΓC and GC .
97
Lemma 8.4. Let d ∈ N, C > 0 and f ∈ ΓC . Then f |B d − f (0) ∈ co(GC ), where the closure is
1
taken with respect to the norm
Z !1/2
1 2
∥g∥L2,⋄ (B d ) := |g(x)| dx . (8.1.5)
1 |B1d | B1d
where κ(ξ) is the phase of g(ξ), i.e. g(ξ) = |g(ξ)|ei κ(ξ) , and the last equality follows since f is
real-valued. Define a measure µ on Rd via its Lebesgue density
1
dµ(ξ) := |ξ||g(ξ)| dξ,
C′
where C ′ :=
R
|ξ||g(ξ)| dξ ≤ C; this is possible since f ∈ ΓC . Then (8.1.6) leads to
cos(⟨x, ξ⟩ + κ(ξ)) − cos(κ(ξ))
Z
′
f (x) − f (0) = C dµ(ξ). (8.1.7)
Rd |ξ|
Step 2. We show that x 7→ f (x) − f (0) is in the L2,⋄ (B1d ) closure of convex combinations of
the functions x 7→ qx (θ), where θ ∈ Rd , and
qx :B1d → R
cos(⟨x, ξ⟩ + κ(ξ)) − cos(κ(ξ)) (8.1.8)
ξ 7→ C ′ .
|ξ|
The cosine function is 1-Lipschitz. Hence for any ξ ∈ Rd the map (8.1.8) is bounded by one. In
addition, it is easy to see that qx is well-defined and continuous even in the origin. Therefore, for
x ∈ B1d , the integral (8.1.7) can be approximated by a Riemann sum, i.e.,
Z X
′
C qx (ξ) dµ(ξ) − C ′ qx (θ) · µ(Iθ ) → 0 as n → ∞ (8.1.9)
Rd 1 d
θ∈ n Z
where Iθ := [0, 1/n)d + θ. Since x 7→ f (x) − f (0) is continuous and thus bounded on B1d , we have
by the dominated convergence theorem that
2
Z
1 X
f (x) − f (0) − C ′ qx (θ) · µ(Iθ ) dx → 0. (8.1.10)
|B1d | B1d 1 d
θ∈ n Z
98
µ(Iθ ) = µ(Rd ) = 1, the claim holds.
P
Since 1 d
θ∈ n Z
Step 3. We prove that x 7→ qx (θ) is in the L2,⋄ (B1d ) closure of convex combinations of GC for
every θ ∈ Rd . Together with Step 2, this then concludes the proof.
Setting z = ⟨x, θ/|θ|⟩, the result follows if the maps
hθ :[−1, 1] → R
cos(|θ|z + κ(θ)) − cos(κ(θ)) (8.1.11)
z 7→ C ′
|θ|
[−1, 1] ∋ z 7→ γ1R+ a′ z + b′ ,
(8.1.12)
T T
X |hθ (i/T ) − hθ ((i − 1)/T )| X |hθ (−i/T ) − hθ ((1 − i)/T )|
+
2C 2C
i=1 i=1
T
2 X ′
≤ ∥hθ ∥L∞ (R) ≤ 1,
2CT
i=1
where we used C ′ ≤ C for the last inequality. We conclude that gT,− + gT,+ is a convex combina-
tion of functions of the form (8.1.12). Hence, hθ can be arbitrarily well approximated by convex
combinations of the form (8.1.12). This concludes the proof.
f |B d − f (0) ∈ co(GC ),
1
where the closure is understood with respect to the norm (8.1.5). It is not hard to see that for
every g ∈ GC it holds that ∥g∥L2,⋄ (B d ) ≤ 2C. Applying Lemma 8.3 with the Hilbert space L2,⋄ (B1d ),
1
we get that for every N ∈ N there exist |γi | ≤ 2C, ai ∈ Rd , bi ∈ R, for i = 1, . . . , N , so that
N 2
4C 2
Z
1 1 X
f (x) − f (0) − γi 1R+ (⟨ai , x⟩ + bi ) dx ≤ .
|B1d | B1d N
i=1
N
99
By Exercise 3.24, it holds that σ(λ·) → 1R+ for λ → ∞ almost everywhere. Thus, for every δ > 0
there exist ãi , b̃i , i = 1, . . . , N , so that
N 2
4C 2
Z
1 1 X
f (x) − f (0) − γi σ ⟨ãi , x⟩ + b̃i dx ≤ + δ.
|B1d | B1d N
i=1
N
The result follows by observing that
N
1 X
γi σ ⟨ãi , x⟩ + b̃i + f (0)
N
i=1
The dimension-independent approximation rate of Theorem 8.1 may seem surprising, especially
when comparing to the results in Chapters 4 and 5. However, this can be explained by recognizing
that the assumption of a finite Fourier moment is effectively a dimension-dependent regularity
assumption. Indeed, the condition becomes more restrictive in higher dimensions and hence the
complexity of ΓC does not grow with the dimension.
To further explain this, let us relate the Barron class to classical function spaces. In [13, Section
II] it was observed that a sufficient condition is that all derivatives of order up to ⌊d/2⌋ + 2 are
square-integrable. In other words, if f belongs to the Sobolev space H ⌊d/2⌋+2 (Rd ), then f is a
Barron function. Importantly, the functions must become smoother, as the dimension increases.
This assumption would also imply an approximation rate of N −1/2 in the L2 norm by sums of
at most N B-splines, see [202, 69]. However, in such estimates some constants may still depend
exponentially on d, whereas all constants in Theorem 8.1 are controlled independently of d.
Another notable aspect of the approximation of Barron functions is that the absolute values
of the weights other than the output weights are not bounded by a constant. To see this, we
refer to (8.1.9), where arbitrarily large θ need to be used. While ΓC is a compact set, the set of
neural networks of the specified architecture for a fixed N ∈ N is not parameterized with a compact
parameter set. In a certain sense, this is reminiscent of Proposition 3.19 and Theorem 3.20, where
arbitrarily strong approximation rates where achieved by using a very complex activation function
and a non-compact parameter space.
100
• exactly one vertex, ηM , has no outgoing edge.
With each vertex ηj for j > d we associate a function fj : Rdj → R. Here dj denotes the
cardinality of the set Sj , which is defined as the set of indices i corresponding to vertices ηi for
which we have an edge from ηi to ηj . Without loss of generality, we assume that m ≥ dj = |Sj | ≥ 1
for all j > d. Finally, we let
and1
we denote the set of all functions of the type FM by F k,s (m, d, M ). Figure 8.1 shows possible
graphs of such functions.
Clearly, for s = 0, F k,0 (m, d, M ) ⊆ C k (Rd ) since the composition of functions in C k belongs
again to C k . A direct application of Theorem 7.10 allows to approximate FM ∈ F k (m, d, M ) with a
k
neural network of size O(N log(N )) and error O(N − d ). Since each fj depends only on m variables,
k
intuitively we expect an error convergence of type O(N − m ) with the constant somehow depending
on the number M of vertices. To show that this is actually possible, in the following we associate
with each node ηj a depth lj ≥ 0, such that lj is the maximum number of edges connecting ηj to
one of the nodes {η1 , . . . , ηd }.
Figure 8.1: Three types of graphs that could be the basis of compositional functions. The associated
functions are composed of two or three-dimensional functions only.
1
The ordering of the inputs (Fi )i∈Sj in (8.2.1b) is arbitrary but considered fixed throughout.
101
Proposition 8.5. Let k, m, d, M ∈ N and s > 0. Let FM ∈ F k,s (m, d, M ). Then there exists a
constant C = C(m, k + s, M ) such that for every N ∈ N there exists a ReLU neural network ΦFM
such that
and
k+s
sup |FM (x) − ΦFM (x)| ≤ N − m .
x∈[0,1]d
Proof. Throughout this proof we assume without loss of generality that the indices follow a topo-
logical ordering, i.e., they are ordered such that Sj ⊆ {1, . . . , j − 1} for all j (i.e. the inputs of
vertex ηj can only be vertices ηi with i < j).
Step 1. First assume that fˆj are functions such that with 0 < ε ≤ 1
|fj (x) − fˆj (x)| ≤ δj := ε · (2m)−(M +1−j) for all x ∈ [−2, 2]dj . (8.2.3)
Let F̂j be defined as in (8.2.1), but with all fj in (8.2.1b) replaced by fˆj . We now check the error
of the approximation F̂M to FM . To do so we proceed by induction over j and show that for all
x ∈ [−1, 1]d
Note that due to ∥fj ∥C k ≤ 1 we have |Fj (x)| ≤ 1 and thus (8.2.4) implies in particular that
F̂j (x) ∈ [−2, 2].
For j = 1 it holds F1 (x1 ) = F̂1 (x1 ) = x1 , and thus (8.2.4) is valid for all x1 ∈ [−1, 1]. For the
induction step, for all x ∈ [−1, 1]d by (8.2.3) and the induction hypothesis
|Fj (x) − F̂j (x)| = |fj ((Fi )i∈Sj ) − fˆj ((F̂i )i∈Sj )|
= |fj ((Fi )i∈Sj ) − fj ((F̂i )i∈Sj )| + |fj ((F̂i )i∈Sj ) − fˆj ((F̂i )i∈Sj )|
X
≤ |Fi − F̂i | + δj
i∈Sj
Here we used that | dxdr fj ((xi )i∈Sj )| ≤ 1 for all r ∈ Sj so that by the triangle inequality and the
mean value theorem
X
|fj ((xi )i∈Sj ) − fj ((yi )i∈Sj )| ≤ |f ((xi )i∈Sj , (yi )i∈Sj ) − f ((xi )i∈Sj , (yi )i∈Sj )|
r∈Sj i≤r i>r i<r i≥r
X
≤ |xr − yr |.
r∈Sj
102
This shows that (8.2.4) holds, and thus for all x ∈ [−1, 1]d
Step 2. We sketch a construction, of how to write F̂M from Step 1 as a neural network ΦFM
of the asserted size and depth bounds. Fix N ∈ N and let
m
Nj := ⌈N (2m) k+s (M +1−j) ⌉.
and
fj
m(M +1−j) m(M + 1 − j)
size(Φ ) ≤ CNj log(Nj ) ≤ CN (2m) k+s log(N ) + log(2m)
k+s
as well as
fj m(M + 1 − j)
depth(Φ ) ≤ C · log(N ) + log(2m) .
k+s
Then
M M
X X m(M +1−j)
size(Φfj ) ≤ 2CN log(N ) (2m) k+s
j=1 j=1
M j
X m
≤ 2CN log(N ) (2m) k+s
j=1
m(M +1)
≤ 2CN log(N )(2m) k+s .
PM j
R M +1 1 M +1 .
Here we used j=1 a ≤ 1 exp(log(a)x) dx ≤ log(a) a
− k+s
The function F̂M from Step 1 then will yield error N by (8.2.3) and (8.2.5). We observe that
m
F̂M can be constructed inductively as a neural network ΦFM by propagating all values ΦF1 , . . . , Φ̂Fj
to all consecutive layers using identity neural networks and then using the outputs of (ΦFi )i∈Sj+1
as input to Φfj+1 . The depth of this neural network is bounded by
M
X
depth(Φfj ) = O(M log(N )).
j=1
We have at most M
P
j=1 |Sj | ≤ mM values which need to be propagated through these O(M log(N ))
layers, amounting to an overhead O(mM 2 log(N )) = O(log(N )) for the identity neural networks.
In all the neural network size is thus O(N log(N )).
Remark 8.6. From the proof we observe that the constant C in Proposition 8.5 behaves like
m(M +1)
O((2m) k+s ).
103
M
104
where
Proof. Since M is compact there exists A > 0 such that M ⊆ [−A, A]d . Similar as in the proof of
ν
Theorem 7.10, we consider a uniform mesh with nodes {−A + 2AP n | ν ≤ n}, and the corresponding
piecewise linear basis functions forming the partition of unity ν≤n φν ≡ 1 on [−A, A]d where
supp φν ⊆ {y ∈ Rd | ∥ νn − y∥∞ ≤ A n }. Let δ > 0 be as in the beginning of this section. Since M is
covered by the balls (Bδ/2 (xj ))M j=1 fixing n ∈ N large enough, for each ν such that supp φν ∩M =
, ̸ ∅
there exists j(ν) ∈ {1, . . . , M } such that supp φν ⊆ Bδ (xj(ν) ) and we set Ij := {ν ≤ n | j = j(ν)}.
Using (8.3.3) we then have for all x ∈ M
X M X
X
f (x) = φν (x)f (x) = φν (x)fj (πj (x)). (8.3.4)
ν≤n j=1 ν∈Ij
Next, we approximate the functions fj . Let Cj be the smallest (m-dimensional) cube in Txj M ≃
Rm such that πj (Bδ (xj ) ∩ M) ⊆ Cj . The function fj can be extended to a function on Cj (we will
use the same notation for this extension) such that
for some constant depending on πj (Bδ (xj ) ∩ M) but independent of f . Such an extension result
can, for example, be found in [257, Chapter VI]. By Theorem 7.10 (also see Remark 7.11), there
exists a neural network fˆj : Cj → R such that
k+s
sup |fj (x) − fˆj (x)| ≤ CN − m (8.3.5)
x∈Cj
and
105
k+s
To approximate f in (8.3.4) we now let with ε := N − m
M X
X
ΦN := Φ× ˆ
ε (φν , fi ◦ πj ),
j=1 ν∈Ij
where we note that πj is linear and thus fˆj ◦ πj can be expressed by a neural network. First let us
estimate the error of this approximation. For x ∈ M
M X
X
|f (x) − ΦN (x)| ≤ |φν (x)fj (πj (x)) − Φ× ˆ
ε (φν (x), fj (πj (x)))|
j=1 ν∈Ij
M X
X
≤ |φν (x)fj (πj (x)) − φν (x)fˆj (πj (x))|
j=1 ν∈Ij
+|φν (x)fˆj (πj (x)) − Φ×
ε (φν (x), ˆj (πj (x)))|
f
M X
X M
X X
≤ sup ∥fi − fˆi ∥L∞ (Ci ) |φν (x)| + ε
i≤M j=1 ν∈Ij j=1 {ν∈Ij | x∈supp φν }
k+s k+s
≤ CN − m + dε ≤ CN − m ,
where we used that x can be in the support of at most d of the φν , and where C is a constant
depending on d and M.
Finally, let us bound the size and depth of this approximation. Using size(φν ) ≤ C, depth(φν ) ≤
C (see (5.3.12)) and size(Φ× ×
ε ) ≤ C log(ε) ≤ C log(N ) and depth(Φε ) ≤ Cdepth(ε) ≤ C log(N ) (see
Lemma 7.3) we find
M X
X XM X
size(Φ×
ε ) + size(φ ν ) + size(fˆi ◦ π j ) ≤ C log(N ) + C + CN log(N )
j=1 ν∈Ij j=1 ν∈Ij
106
The compositionality assumption of Section 8.2 was discussed in the form presented in [214].
An alternative approach, known as the hierarchical composition/interaction model, was studied in
[146].
The manifold assumption discussed in Section 8.3 is frequently found in the literature, with
notable examples including [249, 53, 48, 240, 185, 145].
Another prominent direction, omitted in this chapter, pertains to scientific machine learn-
ing. High-dimensional functions often arise from (parametric) PDEs, which have a rich literature
describing their properties and structure. Various results have shown that neural networks can
leverage the inherent low-dimensionality known to exist in such problems. Efficient approximation
of certain classes of high-dimensional (or even infinite-dimensional) analytic functions, ubiquitous
in parametric PDEs, has been verified in [246, 247]. Further general analyses for high-dimensional
parametric problems can be found in [201, 149], and results exploiting specific structural conditions
of the underlying PDEs, e.g., in [152, 235]. Additionally, [75, 180, 200] provide results regarding
fast convergence for certain smooth functions in potentially high but finite dimensions.
For high-dimensional PDEs, elliptic problems have been addressed in [100], linear and semilin-
ear parabolic evolution equations have been explored in [101, 93, 125], and stochastic differential
equations in [134, 102].
107
Exercises
Exercise 8.8. Let C > 0 and d ∈ N. Show that, if g ∈ ΓC , then
for every a ∈ R+ , b ∈ Rd .
Exercise 8.9. Let C > 0 and d ∈ N. Show that, for gi ∈ ΓC , i = 1, . . . , m and c = (ci )m
i=1 it holds
that
Xm
ci gi ∈ Γ∥c∥1 C .
i=1
Exercise 8.10. Show that √ for every d ∈ N the function f (x) := exp(−∥x∥22 /2), x ∈ Rd , belongs
to Γd , and it holds Cf = O( d), for d → ∞.
108
Chapter 9
Interpolation
The learning problem associated to minimizing the empirical risk of (1.2.3) is based on minimizing
an error that results from evaluating a neural network on a finite set of (training) points. In
contrast, all previous approximation results focused on achieving uniformly small errors across the
entire domain. Finding neural networks that achieve a small training error appears to be much
simpler, since, instead of ∥f − Φn ∥∞ → 0 for a sequence of neural networks Φn , it suffices to have
Φn (xi ) → f (xi ) for all xi in the training set.
In this chapter, we study the extreme case of the aforementioned approximation problem. We
analyze under which conditions it is possible to find a neural network that coincides with the target
function f at all training points. This is referred to as interpolation. To make this notion more
precise, we state the following definition.
Definition 9.1 (Interpolation). Let d, m ∈ N, and let Ω ⊆ Rd . We say that a set of functions
H ⊆ {h : Ω → R} interpolates m points in Ω, if for every S = (xi , yi )m i=1 ⊆ Ω × R, such that
xi ̸= xj for i ̸= j, there exists a function h ∈ H such that h(xi ) = yi for all i = 1, . . . , m.
109
We start our analysis of the interpolation properties of neural networks by presenting a result
similar to the universal approximation theorem but for interpolation in the following section. In
the subsequent section, we then look at interpolation with desirable properties.
Example 9.2. Let H := {f ∈ C 0 ([0, 1]) | f (0) ∈ Q}. Then H is dense in C 0 ([0, 1]), but H does not
even interpolate one point in [0, 1].
♢
Moreover, Theorem 3.8 is an asymptotic result that only states that a given function can be
approximated for sufficiently large neural network architectures, but it does not state how large
the architecture needs to be.
Surprisingly, Theorem 3.8 can nonetheless be used to give a guarantee that a fixed-size archi-
tecture yields sets of neural networks that allow the interpolation of m points. This result is due
to [211]; for a more detailed discussion of previous results see the bibliography section. Due to its
similarity to the universal approximation theorem and the fact that it uses the same assumptions,
we call the following theorem the “Universal Interpolation Theorem”. For its statement recall the
definition of the set of allowed activation functions M in (3.1.1) and the class Nd1 (σ, 1, n) of shallow
neural networks of width n introduced in Definition 3.6.
Theorem 9.3 (Universal Interpolation Theorem). Let d, n ∈ N and let σ ∈ M not be a polynomial.
Then Nd1 (σ, 1, n) interpolates n + 1 points in Rd .
110
Then A being regular implies that for each (yi )n+1 n
i=1 exist c and (vj )j=1 such that (9.1.1) holds.
Hence, it suffices to find (wj )nj=1 and (bj )nj=1 such that A is regular.
To do so, we proceed by induction over k = 0, . . . , n, to show that there exist (wj )kj=1 and
(bj )kj=1 such that the first k + 1 columns of A are linearly independent. The case k = 0 is trivial.
Next let 0 < k < n and assume that the first k columns of A are linearly independent. We wish to
find wk , bk such that the first k + 1 columns are linearly independent. Suppose such wk , bk do not
exist and denote by Yk ⊆ Rn+1 the space spanned by the first k columns of A. Then for all w ∈ Rn ,
b ∈ R the vector (σ(w⊤ xi + b))n+1 i=1 ∈ R
n+1 must belong to Y . Fix y = (y )n+1 ∈ Rn+1 \Y . Then
k i i=1 k
n+1
XX N 2
inf ∥(Φ̃(xi ))n+1
i=1 − y∥22 = inf vj σ(w⊤
j xi + bj ) + c − yi
Φ̃∈Nd1 (σ,1) N,wj ,bj ,vj ,c
i=1 j=1
Since we can find a continuous function f : Rd → R such that f (xi ) = yi for all i = 1, . . . , n + 1,
this contradicts Theorem 3.8.
9.2.1 Motivation
In the previous section, we observed that neural networks with m − 1 ∈ N hidden neurons can
interpolate m points for every reasonable activation function. However, not all interpolants are
equally suitable for a given application. For instance, consider Figure 9.1 for a comparison between
polynomial and piecewise affine interpolation on the unit interval.
The two interpolants exhibit rather different behaviors. In general, there is no way of deter-
mining which constitutes a better approximation to f . In particular, given our limited information
about f , we cannot accurately reconstruct any additional features that may exist between inter-
polation points x1 , . . . , xm . In accordance with Occam’s razor, it thus seems reasonable to assume
that f does not exhibit extreme oscillations or behave erratically between interpolation points.
As such, the piecewise interpolant appears preferable in this scenario. One way to formalize the
assumption that f does not “exhibit extreme oscillations” is to assume that the Lipschitz constant
|f (x) − f (y)|
Lip(f ) := sup (9.2.1)
x̸=y ∥x − y∥
111
Figure 9.1: Interpolation of eight points by a polynomial of degree seven and by a piecewise affine
spline. The polynomial interpolation has a significantly larger derivative or Lipschitz constant than
the piecewise affine interpolator.
we have
|f (x) − f (y)| |yi − yj |
Lip(f ) = sup ≥ sup =: M̃ . (9.2.3)
x̸=y∈Ω ∥x − y∥ i̸=j ∥xi − xj ∥
Because of this, we fix M as a real number greater than or equal to M̃ for the remainder of our
analysis.
denoting the set of all functions with Lipschitz constant at most M , we want to solve the following
problem:
Problem 9.4. We wish to find an element
The next theorem shows that a function Φ as in (9.2.5) indeed exists. This Φ not only allows
for an explicit formula, it also belongs to LipM (Ω) and additionally interpolates the data. Hence,
it is not just an optimal reconstruction, it is also an optimal interpolant. This theorem goes back
to [17], which, in turn, is based on [261].
112
Theorem 9.5. Let m, d ∈ N, Ω ⊆ Rd , f : Ω → R, and let x1 , . . . , xm ∈ Ω, y1 , . . . , ym ∈ R satisfy
(9.2.2) and (9.2.3) with M̃ ≥ 0. Further, let M ≥ M̃ .
Then, Problem 9.4 has at least one solution given by
1
Φ(x) := (fupper (x) + flower (x)) for x ∈ Ω, (9.2.6)
2
where
Moreover, Φ ∈ LipM (Ω) and Φ interpolates the data (i.e. satisfies (9.2.2)).
Proof. First we claim that for all h1 , h2 ∈ LipM (Ω) holds max{h1 , h2 } ∈ LipM (Ω) as well as
min{h1 , h2 } ∈ LipM (Ω). Since min{h1 , h2 } = − max{−h1 , −h2 }, it suffices to show the claim for
the maximum. We need to check that
| max{h1 (x), h2 (x)} − max{h1 (y), h2 (y)}|
≤M (9.2.7)
∥x − y∥
for all x ̸= y ∈ Ω. Fix x ̸= y. Without loss of generality we assume that
max{h1 (x), h2 (x)} ≥ max{h1 (y), h2 (y)} and max{h1 (x), h2 (x)} = h1 (x).
If max{h1 (y), h2 (y)} = h1 (y) then the numerator in (9.2.7) equals h1 (x) − h1 (y) which is bounded
by M ∥x − y∥. If max{h1 (y), h2 (y)} = h2 (y), then the numerator equals h1 (x) − h2 (y) which is
bounded by h1 (x) − h1 (y) ≤ M ∥x − y∥. In either case (9.2.7) holds.
Clearly, x 7→ yk −M ∥x−xk ∥ ∈ LipM (Ω) for each k = 1, . . . , m and thus fupper , flower ∈ LipM (Ω)
as well as Φ ∈ LipM (Ω).
Next we claim that for all f ∈ LipM (Ω) satisfying (9.2.2) holds
flower (x) ≤ f (x) ≤ fupper (x) for all x ∈ Ω. (9.2.8)
This is true since for every k ∈ {1, . . . , m} and x ∈ Ω
|yk − f (x)| = |f (xk ) − f (x)| ≤ M ∥x − xk ∥
so that for all x ∈ Ω
f (x) ≤ min (yk + M ∥x − xk ∥), f (x) ≥ max (yk − M ∥x − xk ∥).
k=1,...,m k=1,...,m
Since fupper , flower ∈ LipM (Ω) satisfy (9.2.2), we conclude that for every h : Ω → R holds
sup sup |f (x) − h(x)| ≥ sup max{|flower (x) − h(x)|, |fupper (x) − h(x)|}
f ∈LipM (Ω) x∈Ω x∈Ω
f satisfies (9.2.2)
|flower (x) − fupper (x)|
≥ sup . (9.2.9)
x∈Ω 2
113
Moreover, using (9.2.8),
sup sup |f (x) − Φ(x)| ≤ sup max{|flower (x) − Φ(x)|, |fupper (x) − Φ(x)|}
f ∈LipM (Ω) x∈Ω x∈Ω
f satisfies (9.2.2)
|flower (x) − fupper (x)|
= sup . (9.2.10)
x∈Ω 2
Figure 9.2 depicts fupper , flower , and Φ for the interpolation problem shown in Figure 9.1, while
Figure 9.3 provides a two-dimensional example.
Figure 9.2: Interpolation of the points from Figure 9.1 with the optimal Lipschitz interpolant.
114
Figure 9.3: Two-dimensional example of the interpolation method of (9.2.6). From top left to
bottom we see fupper , flower , and Φ. The interpolation points (xi , yi )6i=1 are marked with red
crosses.
115
Then, there exists a ReLU neural network Φ ∈ LipM (Ω) that interpolates the data (i.e. satisfies
(9.2.2)) and satisfies
Moreover, depth(Φ) = O(log(m)), width(Φ) = O(dm) and all weights of Φ are bounded in absolute
value by max{M, ∥y∥∞ }.
Proof. To prove the result, we simply need to show that the function in (9.2.6) can be expressed as
a ReLU neural network with the size bounds described in the theorem. First we notice, that there
is a simple ReLU neural network that implements the 1-norm. It holds for all x ∈ Rd that
d
X
∥x∥1 = (σ(xi ) + σ(−xi )) .
i=1
Thus, there exists a ReLU neural network Φ∥·∥1 such that for all x ∈ Rd
for all x ∈ Rd . Using the parallelization of neural networks introduced in Section 5.1.3, there exists
a ReLU neural network Φall := (Φ1 , . . . , Φm ) : Rd → Rm such that
and
Using Lemma 5.11, we can now find a ReLU neural network Φupper such that Φupper = fupper (x)
for all x ∈ Ω, width(Φupper ) ≤ max{16m, 4md}, and depth(Φupper ) ≤ 1 + log(m).
Essentially the same construction yields a ReLU neural network Φlower with the respective
properties. Lemma 5.4 then completes the proof.
116
limx→−∞ σ(x) = 0, and limx→∞ σ(x) = 1. This result was improved in [122], which dropped the
nondecreasing assumption on σ.
The main idea of the optimal Lipschitz interpolation theorem in Section 9.2 is due to [17]. A
neural network construction of Lipschitz interpolants, which however is not the optimal interpolant
in the sense of Problem 9.4, is given in [133, Theorem 2.27].
117
Exercises
Exercise 9.7. Under the assumptions of Theorem 9.5, we define for x ∈ Ω the set of nearest
neighbors by
Ix := argmini=1,...,m ∥xi − x∥.
The one-nearest-neighbor classifier f1NN is defined by
1
f1NN (x) = (min yi + max yi ).
2 i∈Ix i∈Ix
118
Chapter 10
Up to this point, we have discussed the representation and approximation of certain function classes
using neural networks. The second pillar of deep learning concerns the question of how to fit a
neural network to given data, i.e., having fixed an architecture, how to find suitable weights and
biases. This task amounts to minimizing a so-called objective function such as the empirical risk
R̂S in (1.2.3). Throughout this chapter we denote the objective function by
f : Rn → R,
and interpret it as a function of all neural network weights and biases collected in a vector in Rn .
The goal1 is to (approximately) determine a minimizer, i.e., some w∗ ∈ Rn satisfying
Standard approaches primarily include variants of (stochastic) gradient descent. These are the
focus of the present chapter, in which we discuss basic ideas and results in convex optimization
using gradient-based algorithms. In Sections 10.1, 10.2, and 10.3, we explore gradient descent,
stochastic gradient descent, and accelerated gradient descent, and provide convergence proofs for
smooth and strongly convex objectives. Section 10.4 discusses adaptive step sizes and explains
the core principles behind popular algorithms such as Adam. Finally, Section 10.5 introduces the
backpropagation algorithm, which enables the efficient application of gradient-based methods to
neural network training.
119
This shows that the change in f around wk is locally described by the gradient ∇f (wk ). For
small v the contribution of the second order term is negligible, and the direction v along which the
decrease of the risk is maximized equals the negative gradient −∇f (wk ).
Thus, −∇f (wk ) is also called the direction of steepest descent. This leads to an update of the
form
where hk > 0 is referred to as the step size or learning rate. We refer to this iterative algorithm
as gradient descent.
Figure 10.1: Two examples of gradient descent as defined in (10.1.2). The red points represent the
wk .
so that if ∇f (wk ) ̸= 0, a small enough step size hk ensures that the algorithm decreases the value
of the objective function. In practice, tuning the learning rate hk can be a subtle issue as it should
strike a balance between the following dissenting requirements:
(i) hk needs to be sufficiently small so that the second-order term in (10.1.3) is not dominating,
and the update (10.1.2) decreases the objective function.
(ii) hk should be large enough to ensure significant decrease of the objective function, which
facilitates faster convergence of the algorithm.
A learning rate that is too high might overshoot the minimum, while a rate that is too low results
in slow convergence. Common strategies include, in particular, constant learning rates (hk = h
for all k ∈ N0 ), learning rate schedules such as decaying learning rates (hk ↘ 0 as k → ∞), and
adaptive methods. For adaptive methods the algorithm dynamically adjusts hk based on the values
of f (wj ) or ∇f (wj ) for j ≤ k.
120
smooth convex strongly convex
Figure 10.2: The graph of L-smooth functions lies between two quadratic functions at each point,
see (10.1.4), the graph of convex function lies above the tangent at each point, see (10.1.8), and
the graph of µ-strongly convex functions lies above a convex quadratic function at each point, see
(10.1.9).
(i) smooth if, at each w ∈ Rn , f is bounded above and below by a quadratic function that
touches its graph at w,
(iii) strongly convex if, at each w ∈ Rn , f lies above its tangent at w plus a convex quadratic
term.
These concepts are illustrated in Figure 10.2. We next give the precise mathematical definitions.
By definition, L-smooth functions satisfy the geometric property (i). In the literature, L-
smoothness is often instead defined as Lipschitz continuity of the gradient
Lemma 10.2. Let L > 0. Then f ∈ C 1 (Rn ) is L-smooth if and only if (10.1.5) holds.
121
Proof. We show that (10.1.4) implies (10.1.5). To this end assume first that f ∈ C 2 (Rn ), and that
(10.1.5) does not hold. Then we can find w ̸= v with
Z 1
w−v
∥w − v∥ sup e⊤ ∇2 f (v + t(w − v)) dt = ∥∇f (w) − ∇f (v)∥ > L∥w − v∥,
∥e∥=1 0 ∥w − v∥
where ∇2 f ∈ Rn×n denotes the Hessian. Since the Hessian is symmetric, this implies existence of
u, e ∈ Rn with ∥e∥ = 1 and |e⊤ ∇2 f (u)e| > L. Without loss of generality
Continuity of t 7→ e⊤ ∇2 f (u + te)e and (10.1.6) implies that for h > 0 small enough
Z h
f (u + he) > f (u) + h ⟨∇f (u), e⟩ + L (h − t) dt
0
L
= f (u) + ⟨∇f (u), he⟩ + ∥he∥2 .
2
This contradicts (10.1.4a).
Now let f ∈ C 1 (Rn ) and assume again that (10.1.5) does not hold. Then for every ε > 0 and
every compact set K ⊆ Rn there exists fε,K ∈ C 2 (Rn ) such that ∥f − fε,K ∥C 1 (K) < ε. By choosing
ε > 0 sufficiently small and K sufficiently large, it follows that fε,K violates (10.1.5). Consequently,
by the previous argument, fε,K must also violate (10.1.4), which in turn implies that f does not
satisfy (10.1.4) either.
Finally, the fact that (10.1.5) implies (10.1.4) is left as Exercise 10.17.
In case f is continuously differentiable, this is equivalent to the geometric property (ii) as the
next lemma shows. The proof is left as Exercise 10.18.
122
The concept of convexity is strengthened by so-called strong-convexity, which requires an addi-
tional positive quadratic term on the right-hand side of (10.1.8), and thus corresponds to geometric
property (iii) by definition.
A convex function need not be bounded from below (e.g. w 7→ w) and thus need not have any
(global) minimizers. And even if it is bounded from below, there need not exist minimizers (e.g.
w 7→ exp(w)). However we have the following statement.
(i) convex, then the set of minimizers of f is convex and has cardinality 0, 1, or ∞,
Proof. Let f be convex, and assume that w∗ and v ∗ are two minimizers of f . Then every convex
combination λw∗ + (1 − λ)v ∗ , λ ∈ [0, 1], is also a minimizer due to (10.1.7). This shows the first
claim.
Now let f be µ-strongly convex. Then (10.1.9) implies f to be lower bounded by a convex
quadratic function. Hence there exists at least one minimizer w∗ , and ∇f (w∗ ) = 0. By (10.1.9)
we then have f (v) > f (w∗ ) for every v ̸= w∗ .
The constant c is also referred to as the rate of convergence. Before giving the statement, we first
note that comparing (10.1.4a) and (10.1.9) it necessarily holds L ≥ µ and therefore κ := L/µ ≥ 1.
This term, known as the condition number of f , determines the rate of convergence.
123
Theorem 10.7. Let n ∈ N and L ≥ µ > 0. Let f : Rn → R be L-smooth and µ-strongly convex.
Further, let hk = h ∈ (0, 1/L] for all k ∈ N0 , let (wk )∞ n
k=0 ⊆ R be defined by (10.1.2), and let w ∗
be the unique minimizer of f .
Then, f (wk ) → f (w∗ ) and wk → w∗ converge linearly for k → ∞. For the specific choice
h = 1/L it holds for all k ∈ N
µ k
∥wk − w∗ ∥2 ≤ 1 − ∥w0 − w∗ ∥2 (10.1.10a)
L
L µ k
f (wk ) − f (w∗ ) ≤ 1− ∥w0 − w∗ ∥2 . (10.1.10b)
2 L
Proof. It suffices to show (10.1.10a), since (10.1.10b) follows directly by (10.1.10a) and (10.1.4a)
because ∇f (w∗ ) = 0. The case k = 0 is trivial, so let k ∈ N.
Expanding wk = wk−1 − h∇f (wk−1 ) and using µ-strong convexity (10.1.9)
To bound the sum of the last two terms, we first use (10.1.4a) to get
L
f (wk ) ≤ f (wk−1 ) + ⟨∇f (wk−1 ), −h∇f (wk−1 )⟩ + ∥h∇f (wk−1 )∥2
2
L 1
= f (wk−1 ) + − h2 ∥∇f (wk−1 )∥2
2 h
and therefore
If 2h ≥ 1/(1/h − L/2), which is equivalent to h ≤ 1/L, then the last term is less or equal to zero.
Hence (10.1.11) implies for h ≤ 1/L
124
Remark 10.8 (Convex objective functions). Let f : Rn → R be a convex and L-smooth function
with a minimizer at w∗ . As shown in Lemma 10.6, the minimizer need not be unique, so we cannot
expect wk → w∗ in general. However, the objective values still converge. Specifically, under these
assumptions, the following holds [189, Theorem 2.1.14, Corollary 2.1.2]: If hk = h ∈ (0, 2/L) for all
k ∈ N0 and (wk )∞ n
k=0 ⊆ R is generated by (10.1.2), then
wk+1 := wk − hk Gk , (10.2.2)
where hk > 0 denotes again the step size. Unlike in Section 10.1, we focus here on k-dependent
step sizes hk .
Example 10.9 (Empirical risk minimization). Suppose we have a data sample S := (xj , yj )m j=1 ,
d
where yj ∈ R is the label corresponding to the data point xj ∈ R . Using the square loss, we wish
to fit a neural network Φ(·, w) : Rd → R depending on parameters (i.e. weights and biases) w ∈ Rn ,
such that the empirical risk in (1.2.3)
m
1 X
f (w) := R̂S (w) = (Φ(xj , w) − yj )2 ,
m
j=1
and thus the computation of m gradients of the neural network Φ. For large m (in practice m can
be in the millions or even larger), this computation might be infeasible.
To reduce computational cost, we replace the full gradient (10.2.3) by the stochastic gradient
125
where j ∼ uniform(1, . . . , m) is a random variable with uniform distribution on the discrete set
{1, . . . , m}. Then
m
2 X
E[G] = (Φ(xj , w) − yj )∇w Φ(xj , w) = ∇f (w),
m
j=1
Hence, in one epoch (corresponding to k updates of the weights), the algorithm sweeps through the
whole dataset, and each data point is used exactly once. ♢
Let wk be generated by (10.2.2). Using L-smoothness (10.1.5) we then have [33, Lemma 4.2]
L
E[f (wk+1 )|wk ] − f (wk ) ≤ E[⟨∇f (wk ), wk+1 − wk ⟩] + E[∥wk+1 − wk ∥2 |wk ]
2
L
= −hk ∥∇f (wk )∥2 + E[∥hk Gk ∥2 |wk ].
2
For the objective function to decrease in expectation, the first term hk ∥∇f (wk )∥2 should dominate
the second term L2 E[∥hk Gk ∥2 |wk ]. As wk approaches the minimum, we have ∥∇f (wk )∥ → 0,
which suggests that E[∥hk Gk ∥2 |wk ] should also decrease over time.
This is illustrated in Figure 10.3 (a), which shows the progression of stochastic gradient descent
(SGD) with a constant learning rate, hk = h, on a quadratic objective function and using artificially
added gradient noise, such that E[∥Gk ∥2 |wk ] does not tend to zero. The stochastic updates in
(10.2.2) cause fluctuations in the optimization path. Since these fluctuations do not decrease as the
algorithm approaches the minimum, the iterates will not converge. Instead they stabilize within
a neighborhood of the minimum, and oscillate around it, e.g. [85, Theorem 9.8]. In practice, this
might yield a good enough approximation to w∗ . To achieve convergence in the limit, the variance
of the update vector, −hk Gk , must decrease over time however. This can be achieved either by
reducing the variance of Gk , for example through larger mini-batches (cf. Example 10.9), or more
commonly, by decreasing the step size hk as k progresses. Figure 10.3 (b) shows this for hk ∼ 1/k;
also see Figure 10.4.
2
In contrast to using the full batch, which corresponds to standard gradient descent.
126
1.0 1.0
GD GD
SGD (constant LR) SGD (decreasing LR)
0.8 0.8
0.6 0.6
0.4 0.4
0.2 0.2
0.0 0.0
0.2 0.2
0.4 0.2 0.0 0.2 0.4 0.6 0.8 0.4 0.2 0.0 0.2 0.4 0.6 0.8
(a) constant learning rate for SGD (b) decreasing learning rate for SGD
Figure 10.3: 20 steps of gradient descent (GD) and stochastic gradient descent (SGD) for a strongly
convex quadratic objective function. GD was computed with a constant learning rate, while SGD
was computed with either a constant learning rate (hk = h) or a decreasing learning rate (hk ∼ 1/k).
wk
wk
w∗ w∗
Figure 10.4: The update vector −hk Gk (black) is a draw from a random variable with expectation
−hk ∇f (wk ) (blue). In order to get convergence, the variance of the update vector should decrease
as wk approaches the minimizer w∗ . Otherwise, convergence will in general not hold, see Figure
10.3 (a).
127
10.2.2 Convergence of stochastic gradient descent
Since wk in (10.2.2) is a random variable by construction, a convergence statement can only be
stochastic, e.g., in expectation or with high probability. The next theorem, which is based on
[98, Theorem 3.2] and [33, Theorem 4.7], concentrates on the former. The result is stated under
assumption (10.2.6), which bounds the second moments of the stochastic gradients Gk and ensures
that they grow at most linearly with ∥∇f (wk )∥2 . Moreover, the analysis relies on the following
decreasing step sizes
µ 1 (k + 1)2 − k 2
hk := min , for all k ∈ N0 , (10.2.4)
L2 γ µ (k + 1)2
(k + 1)2 − k 2 2k + 1 2
= = + O(k −2 ). (10.2.5)
(k + 1)2 (k + 1)2 (k + 1)
This learning rate decay will allow us to establish a convergence rate. However, in practice, a less
aggressive decay, such as hk = O(k −1/2 ), or heuristic methods that decrease the learning rate based
on the observed convergence behaviour may be preferred.
Theorem 10.10. Let n ∈ N and L, µ, γ > 0. Let f : Rn → R be L-smooth and µ-strongly convex.
Fix w0 ∈ Rn , let (hk )∞ ∞ ∞
k=0 be as in (10.2.4) and let (Gk )k=0 , (w k )k=1 be sequences of random
variables as in (10.2.1) and (10.2.2). Assume that, for some fixed γ > 0, and all k ∈ N
C
E[∥wk − w∗ ∥2 ] ≤ ,
k
C
E[f (wk )] − f (w∗ ) ≤ .
k
E[∥wk − w∗ ∥2 |wk−1 ]
= ∥wk−1 − w∗ ∥2 − 2hk−1 E[⟨Gk−1 , wk−1 − w∗ ⟩ |wk−1 ] + h2k−1 E[∥Gk−1 ∥2 |wk−1 ]
≤ ∥wk−1 − w∗ ∥2 − 2hk−1 ⟨∇f (wk−1 ), wk−1 − w∗ ⟩ + h2k−1 γ(1 + ∥∇f (wk−1 )∥2 ).
128
Moreover, L-smoothness, µ-strong convexity and ∇f (w∗ ) = 0 imply
2L2
∥∇f (wk−1 )∥2 ≤ L2 ∥wk−1 − w∗ ∥2 ≤ (f (wk−1 ) − f (w∗ )).
µ
Combining the previous estimates we arrive at
The choice of hk−1 ≤ µ/(L2 γ) and the fact that (cf. (10.2.1))
thus give
1 (i + 1)2 − i2
hi = for all i ≥ k0 .
µ (i + 1)2
and thus
k−1 2
γ X (j + 1)2 − j 2 (j + 1)2
ek ≤ C̃ 2
µ (j + 1)2 k2
j=0
k−1
C̃γ 1 X (2j + 1)2
≤ 2 2
µ k (j + 1)2
j=0 | {z }
≤4
C̃γ 4k C
≤ 2 2
= ,
µ k k
129
with C := 4C̃γ/µ2 .
Since E[∥wk − w∗ ∥2 ] is the expectation of ek = E[∥wk − w∗ ∥2 |wk−1 ] with respect to the random
variable wk−1 , and C/k is a constant independent of wk−1 , we obtain
C
E[∥wk − w∗ ∥2 ] ≤ .
k
Finally, using L-smoothness
L L
f (wk ) − f (w∗ ) ≤ ⟨∇f (w∗ ), wk − w∗ ⟩ + ∥wk − w∗ ∥2 = ∥wk − w∗ ∥2 ,
2 2
and taking the expectation concludes the proof.
The specific choice of hk in (10.2.4) simplifies the calculations in the proof, but it is not necessary
in order for the asymptotic convergence to hold; see for instance [33, Theorem 4.7] or [27, Chapter
4] for more general statements.
10.3 Acceleration
Acceleration is an important tool for the training of neural networks [262]. The idea was first
introduced by Polyak in 1964 under the name “heavy ball method” [216]. It is inspired by the
dynamics of a heavy ball rolling down the valley of the loss landscape. Since then other types of
acceleration have been proposed and analyzed, with Nesterov acceleration being the most prominent
example [190]. In this section, we first give some intuition by discussing the heavy ball method for
a simple quadratic loss. Afterwards we turn to Nesterov acceleration and give a convergence proof
for L-smooth and µ-strongly convex objective functions that improves upon the bounds obtained
for gradient descent.
(1 − hζ1 )k+1
1 − hζ1 0 0
wk+1 = wk − hDwk = wk = w0 .
0 1 − hζ2 0 (1 − hζ2 )k+1
The optimal step size balancing the rate of convergence in both coordinates is
2
h∗ = argminh>0 max{|1 − hζ1 |, |1 − hζ2 |} = . (10.3.2)
ζ1 + ζ2
130
1.0
GD
HB
0.8
0.6
0.4
0.2
0.0
1.00 0.75 0.50 0.25 0.00 0.25 0.50 0.75 1.00
Figure 10.5: 20 steps of gradient descent (GD) and the heavy ball method (HB) on the objective
function (10.3.1) with ζ1 = 12 ≫ 1 = ζ2 , step size h = α = h∗ as in (10.3.2), and β = 1/3. Figure
based on [217, Fig. 6].
This is known as Polyak’s heavy ball method [216, 217]. Here α > 0 and β ∈ (0, 1) are hyperpa-
rameters (that could also depend on k) and in practice need to be carefully tuned to balance the
131
strength of the gradient and the momentum term. Iteratively expanding (10.3.5) with the given
initialization, observe that for k ≥ 0
k
!
X
j
wk+1 = wk − α β ∇f (wk−j ) . (10.3.6)
j=0
Thus, wk is updated using an exponentially weighted moving average of all past gradients. Choosing
the momentum parameter β in the interval (0, 1) ensures that the influence of previous gradients
on the update decays exponentially. The concrete value of β determines the balance between the
impact of recent and past gradients.
Intuitively, this linear combination of the past gradients averages out some of the oscillation
observed for gradient descent in Figure 10.5; the update vector is strengthened in directions where
past gradients are aligned (the second coordinate axis), while it is dampened in directions where
the gradients’ signs alternate (the first coordinate axis). Similarly, when using stochastic gradients,
it can help to reduce some of the variance.
As mentioned earlier, the heavy ball method can be interpreted as a discretization of the dy-
namics of a ball rolling down the valley of the loss landscape. If the ball has positive mass, i.e. is
“heavy”, its momentum prevents the ball from bouncing back and forth too strongly. The following
remark elucidates this connection.
Remark 10.12. As pointed out, e.g., in [217, 219], for suitable choices of α and β, (10.3.5) can be
interpreted as a discretization of the second-order ODE
This equation describes the movement of a point mass m under influence of the force field −∇f (w(t));
the term −w′ (t), which points in the negative direction of the current velocity, corresponds to fric-
tion, and r > 0 is the friction coefficient. The discretization
wk+1 − 2wk + wk−1 wk+1 − wk
m 2
= −∇f (wk ) −
h h
then leads to
h2 m
wk+1 = wk − ∇f (wk ) + (wk − wk−1 ), (10.3.8)
m − rh m − rh
| {z } | {z }
=α =β
Note that −∇f (w(t)) represents the velocity of w(t) in (10.3.9), whereas in (10.3.7), up to the
friction term, it corresponds to an acceleration.
132
10.3.2 Nesterov acceleration
Nesterov’s accelerated gradient method (NAG) [190, 189] builds on the heavy ball method. After
initializing w0 , v 0 ∈ Rn , the update is formulated for k ≥ 0 as the two-step process
where again α > 0 and β ∈ (0, 1) are hyperparameters. Substituting the second line into the first
we get for k ≥ 1
Comparing with the heavy ball method (10.3.5), the key difference is that the gradient is not
evaluated at the current position wk , but instead at the point v k = wk + β(wk − wk−1 ), which
can be interpreted as an estimate of the position at the next iteration. This improves stability and
robustness with respect to hyperparameter settings, see [160, Sections 4 and 5].
We now discuss the convergence of NAG for L-smooth and µ-strongly convex objective functions
f . To
pgive the analysis, it is convenient to first rewrite (10.3.10) as a threen sequence update: Let
τ = µ/L, α = 1/L, and β = (1 − τ )/(1 + τ ). After initializing w0 , v 0 ∈ R , (10.3.10) can also be
written as u0 = ((1 + τ )v 0 − w0 )/τ and for k ≥ 0
τ 1
vk = uk + wk (10.3.11a)
1+τ 1+τ
1
wk+1 = v k − ∇f (v k ) (10.3.11b)
L
τ
uk+1 = uk + τ · (v k − uk ) − ∇f (v k ), (10.3.11c)
µ
see Exercise 10.25. The proof of the next theorem proceeds along the lines of [275, Theorem A.3.1],
[287, Proposition 10]; also see [286, Proposition 20] who present a similar proof of a related result
based on the same references.
Theorem 10.13. Let n ∈ N, 0 < µp ≤ L, and let f : Rn → R be L-smooth and µ-strongly convex.
Further, let w0 , v 0 ∈ Rn and let τ = µ/L. Let (v k , wk+1 , uk+1 )∞ n
k=0 ⊆ R be defined by (10.3.11a),
and let w∗ be the unique minimizer of f .
Then, for all k ∈ N0 , it holds that
r
2 2 µ k µ
∥uk − w∗ ∥ ≤ 1− f (w0 ) − f (w∗ ) + ∥u0 − w∗ ∥2 , (10.3.12a)
µ L 2
r
µ k µ
f (wk ) − f (w∗ ) ≤ 1 − f (w0 ) − f (w∗ ) + ∥u0 − w∗ ∥2 . (10.3.12b)
L 2
Proof. Define
µ
ek := (f (wk ) − f (w∗ )) + ∥uk − w∗ ∥2 . (10.3.13)
2
133
To show (10.3.12), it suffices to prove with c := 1 − τ that ek+1 ≤ cek for all k ∈ N0 .
Step 1. We bound the first term in ek+1 defined in (10.3.13). Using L-smoothness (10.1.4a)
and (10.3.11b)
L 1
f (wk+1 ) − f (v k ) ≤ ⟨∇f (v k ), wk+1 − v k ⟩ + ∥wk+1 − v k ∥2 = − ∥∇f (v k )∥2 .
2 2L
Thus, since c + τ = 1,
1
f (wk+1 ) − f (w∗ ) ≤ (f (v k ) − f (w∗ )) − ∥∇f (v k )∥2
2L
= c · (f (wk ) − f (w∗ )) + c · (f (v k ) − f (wk ))
1
+ τ · (f (v k ) − f (w∗ )) − ∥∇f (v k )∥2 . (10.3.14)
2L
Step 2. We bound the second term in ek+1 defined in (10.3.13). By (10.3.11c)
µ µ
∥uk+1 − w∗ ∥2 − ∥uk − w∗ ∥2
2 2
µ µ
= ∥uk+1 − uk + uk − w∗ ∥2 − ∥uk − w∗ ∥2
2 2
µ 2 τ
= ∥uk+1 − uk ∥ + µ τ · (v k − uk ) − ∇f (v k ), uk − w∗
2 µ
µ
= ∥uk+1 − uk ∥2 + τ ⟨∇f (v k ), w∗ − uk ⟩ − τ µ ⟨v k − uk , w∗ − uk ⟩ . (10.3.15)
2
Using µ-strong convexity (10.1.9), we get
134
Step 3. We show ek+1 ≤ cek . Adding (10.3.14) and (10.3.16) gives
1 µ
ek+1 ≤ cek + c · (f (v k ) − f (wk )) − ∥∇f (v k )∥2 + ∥uk+1 − uk ∥2
2L 2
µ 2
+ ⟨∇f (v k ), wk − v k ⟩ − ∥wk − v k ∥ .
2τ
Using (10.3.11a), (10.3.11c) we expand
2
µ µ τ
∥uk+1 − uk ∥2 = wk − v k − ∇f (v k )
2 2 µ
µ τ2
= ∥wk − v k ∥2 − τ ⟨∇f (v k ), wk − v k ⟩ + ∥∇f (v k )∥2 ,
2 2µ
to obtain
τ2 1 µ
ek+1 ≤ cek + − ∥∇f (v k )∥2 − ∥wk − v k ∥2
2µ 2L 2τ
µ
+ c · f (v k ) − f (wk ) + ⟨∇f (v k ), wk − v k ⟩ + ∥wk − v k ∥2 .
2
The last line can be bounded using µ-strong convexity (10.1.9) and µ ≤ L
µ
∥v k − wk ∥2
c · (f (v k ) − f (wk ) + ⟨∇f (v k ), wk − v k ⟩) +
2
µ µ τL
≤ −(1 − τ ) ∥v k − wk ∥2 + ∥v k − wk ∥2 ≤ ∥v k − wk ∥2 .
2 2 2
In all
τ2
1 τL µ
ek+1 ≤ cek + ∥∇f (v k )∥2 +
− − ∥wk − v k ∥2 = cek ,
2µ 2L 2 2τ
p
where the terms in brackets vanished since τ = µ/L. This concludes the proof.
Comparing the result for gradient descent (10.1.10) with NAG (10.3.12), the improvement for
strongly convex objectives lies in the convergence rate, which is 1 − κ−1 for gradient descent3 , and
1 − κ−1/2 for NAG, where κ = L/µ. For NAG the convergence rate depends only on the square
root of the condition number κ. For ill-conditioned problems where κ is large, we therefore expect
much better performance for accelerated methods.
135
the main ideas behind AdaGrad, RMSProp, and Adam. The paper [231] provides an intuitive
general overview of first order methods and discusses several additional variants that are omitted
here. Moreover, in practice, various other techniques and heuristics such as batch normalization,
gradient clipping, regularization and dropout, early stopping, specific weight initializations etc. are
used. We do not discuss them here, and refer for example to [32], [94], or [87, Chapter 11].
136
Note how the outer scaling factor ζ has vanished due to the division, in this sense making the
update invariant to a componentwise rescaling of the objective. ♢
10.4.2 Algorithms
AdaGrad
AdaGrad [74], which stands for Adaptive Gradient Algorithm, corresponds to (10.4.1) with
β1 = 0, γ1 = β2 = γ2 = 1, αk = α for all k ∈ N0 .
This leaves the hyperparameters ε > 0 and α > 0. Here α > 0 can be understood as a “global”
learning rate. The default values in tensorflow [1] are α = 0.001 and ε = 10−7 . The AdaGrad
update then reads
Due to
k
X
sk+1 = ∇f (wj ) ⊙ ∇f (wj ), (10.4.2)
j=0
the algorithm therefore scales the gradient ∇f (wk ) in the update componentwise by the inverse
square root of the sum over all past squared gradients plus ε. Note that the scaling factor (sk+1,i +
ε)−1/2 for component i will be large, if the previous gradients for that component were small, and
vice versa.
RMSProp
RMSProp, which stands for Root Mean Squared Propagation, was introduced by Tieleman and
Hinton [272]. It corresponds to (10.4.1) with
effectively leaving the hyperparameters ε > 0, γ1 ∈ (0, 1) and α > 0. The default values in
tensorflow [1] are ε = 10−7 , α = 0.001 and γ1 = 0.9. The algorithm is thus given through
and corresponds to an exponentially weighted moving average over the past squared gradients.
Unlike for AdaGrad (10.4.2), where past gradients accumulate indefinitely, RMSprop exponentially
137
downweights older gradients, giving more weight to recent updates. This prevents the overly rapid
decay of learning rates and slow convergence sometimes observed in AdaGrad, e.g. [288, 87]. For
the same reason, the authors of Adadelta [295] proposed to use as a scaling vector the average
over a moving window of the past m squared gradients, for some fixed m ∈ N. For more details
on Adadelta, see [295, 231]. The standard RMSProp algorithm does not incorporate momentum,
however this possibility is already mentioned in [272], also see [262].
Adam
Adam [143], which stands for Adaptive Moment Estimation, corresponds to (10.4.1) with
q
1 − γ1k+1
β2 = 1 − β1 ∈ (0, 1), γ2 = 1 − γ1 ∈ (0, 1), αk = α
1 − β1k+1
for all k ∈ N0 , for some α > 0. The default values for the remaining parameters recommended in
[143] are ε = 10−8 , α = 0.001, β1 = 0.9 and γ1 = 0.999. The update can be formulated as
uk+1
uk+1 = β1 uk + (1 − β1 )∇f (wk ), ûk+1 = (10.4.4a)
1 − β1k+1
sk+1
sk+1 = γ1 sk + (1 − γ1 )∇f (wk ) ⊙ ∇f (wk ), ŝk+1 = (10.4.4b)
1 − γ1k+1
p
wk+1 = wk − αûk+1 ⊘ ŝk+1 + ε. (10.4.4c)
which corresponds to heavy ball momentum (cf. (10.3.6)). Second, to counteract the initialization
bias from u0 = 0 and s0 = 0, Adam applies a bias correction via
uk sk
ûk = , ŝk = .
1 − β1k 1 − γ1k
It should be noted that there exist specific settings and convex optimization problems for which
Adam (and RMSProp and Adadelta) does not necessarily converge to a minimizer, see [227]. The
authors of [227] propose a modification termed AMSGrad, which avoids this issue. Nonetheless,
Adam remains a highly popular algorithm for the training of neural networks. We also note that,
in the stochastic optimization setting, convergence proofs of such algorithms in general still require
k-dependent decrease of the “global” learning rate such as α = O(k −1/2 ) in (10.4.3b) and (10.4.4c).
10.5 Backpropagation
We now explain how to apply gradient-based methods to the training of neural networks. Let
d
Φ ∈ Nd0L+1 (σ; L, n) (see Definition 3.6) and assume that the activation function satisfies σ ∈ C 1 (R).
As earlier, we denote the neural network parameters by
138
with weight matrices W (ℓ) ∈ Rdℓ+1 ×dℓ and bias vectors b(ℓ) ∈ Rdℓ+1 . Additionally, we fix a differ-
entiable loss function L : RdL+1 × RdL+1 → R, e.g., L(w, w̃) = ∥w − w̃∥2 /2, and assume given data
(xj , y j )m d0
j=1 ⊆ R × R
dL+1 . The goal is to minimize an empirical risk of the form
m
1 X
f (w) := L(Φ(xj , w), y j )
m
j=1
as a function of the neural network parameters w. An application of the gradient step (10.1.2) to
update the parameters requires the computation of
m
1 X
∇f (w) = ∇w L(Φ(xj , w), y j ).
m
j=1
For stochastic methods, as explained in Example 10.9, we only compute the average over a (random)
subbatch of the dataset. In either case, we need an algorithm to determine ∇w L(Φ(x, w), y), i.e.
the gradients
∇b(ℓ) L(Φ(x, w), y) ∈ Rdℓ+1 , ∇W (ℓ) L(Φ(x, w), y) ∈ Rdℓ+1 ×dℓ (10.5.2)
for all ℓ = 0, . . . , L.
The backpropagation algorithm [234] provides an efficient way to do so, by storing intermediate
values in the calculation. To explain it, for fixed input x ∈ Rd0 introduce the notation
Note that x̄(k) depends on (W (ℓ) , b(ℓ) ) only if k > ℓ. Since x̄(ℓ+1) is a function of x̄(ℓ) for each ℓ,
by repeated application of the chain rule
(ℓ)
An analogous calculation holds for ∂L/∂bj . Since all terms in (10.5.4) are easy to compute (see
(10.5.3)), in principle we could use this formula to determine the gradients in (10.5.2). To avoid
unnecessary computations, the main idea of backpropagation is to introduce
139
and observe that
∂L ∂ x̄(ℓ+1)
(ℓ)
= (α(ℓ+1) )⊤ (ℓ)
.
∂Wij ∂Wij
As the following lemma shows, the α(ℓ) can be computed recursively for ℓ = L + 1, . . . , 1. This
explains the name “backpropagation”. As before, ⊙ denotes the componentwise product.
and
Proof. Equation (10.5.5) holds by definition. For ℓ ∈ {1, . . . , L} by the chain rule
∂L ∂ x̄(ℓ+1) ⊤ ∂L ∂ x̄(ℓ+1) ⊤
α(ℓ) = = = α(ℓ+1) .
∂ x̄(ℓ) | ∂ x̄{z (ℓ) ∂ x̄ (ℓ+1)
} | {z } ∂ x̄ (ℓ)
and
and
140
Proof. By (10.5.3a) for i, k ∈ {1, . . . , d1 }, and j ∈ {1, . . . , d0 }
(1) (1)
∂ x̄k ∂ x̄k
(0)
= δki and (0)
= δki xj ,
∂bi ∂Wij
ℓ+1 d
Thus, with ei = (δki )k=1
∂L ∂ x̄(ℓ+1) ⊤ ∂L (ℓ+1)
(ℓ)
= (ℓ)
= e⊤
i α
(ℓ+1)
= αi for ℓ ∈ {0, . . . , L}
∂bi ∂bi ∂ x̄(ℓ+1)
and similarly
∂L ∂ x̄(1) ⊤
(0) (0) (1)
(0)
= (0)
α(1) = x̄j e⊤
i α
(1)
= x̄j αi
∂Wij ∂Wij
and
∂L (ℓ) (ℓ+1)
(ℓ)
= σ(x̄j )αi for ℓ ∈ {1, . . . , L}.
∂Wij
Lemma 10.15 and Proposition 10.16 motivate Algorithm 1, in which a forward pass computing
x̄(ℓ) , ℓ = 1, . . . , L + 1, is followed by a backward pass to determine the α(ℓ) , ℓ = L + 1, . . . , 1,
and the gradients of L with respect to the neural network parameters. This shows how to use
gradient-based optimizers from the previous sections for the training of neural networks.
Two important remarks are in order. First, the objective function associated to neural networks
is typically not convex as a function of the neural network weights and biases. Thus, the analysis
of the previous sections will in general not be directly applicable. It may still give some insight
about the convergence behavior locally around a (local) minimizer however. Second,
we assumed the activation function to be continuously differentiable, which does not hold for
ReLU. Using the concept of subgradients, gradient-based algorithms and their analysis may be
generalized to some extent to also accommodate non-differentiable loss functions, see Exercises
10.20–10.22.
141
Algorithm 1 Backpropagation
Input: Network input x, target output y, neural network parameters
(0) (0) (L) (L)
((W , b ), . . . , (W , b ))
Output: Gradients of the loss function L with respect to neural network parameters
Forward pass
x̄(1) ← W (0) x + b(0)
for ℓ = 1, . . . , L do
x̄(ℓ+1) ← W (ℓ) σ(x̄(ℓ) ) + b(ℓ)
end for
Backward pass
α(L+1) ← ∇x̄(L+1) L(x̄(L+1) , y)
for ℓ = L, . . . , 1 do
∇b(ℓ) L ← α(ℓ+1)
∇W (ℓ) L ← α(ℓ+1) σ(x̄(ℓ) )⊤
α(ℓ) ← σ ′ (x̄(ℓ) ) ⊙ (W (ℓ) )⊤ α(ℓ+1)
end for
∇b(0) L ← α(1)
∇W (0) L ← α(1) x⊤
convergence proofs under weaker assumptions than those considered here. For convergence results
assuming for example the Polyak-Lojasiewicz inequality, which does not require convexity, see, e.g.,
[139].
Stochastic gradient descent (SGD) discussed in Section 10.2 originally dates back to Robbins and
Monro [228]. The proof presented here for strongly convex objective functions is based on [98, 33]
and in particular uses the step size from [98]; also see [184, 224, 187, 251]. For insights into the
potential benefits of SGD in terms of generalization properties, see, e.g., [289, 105, 297, 141, 256].
The heavy ball method in Section 10.3 goes back to Polyak [216]. To motivate the algorithm
we proceed as in [91, 217, 219], and also refer to [259, 196]. The analysis of Nesterov acceleration
[190] follows the arguments in [275, 287], with a similar proof also given in [286].
For Section 10.4 on adaptive learning rates, we follow the overviews [94, Section 8.5], [231], and
[87, Chapter 11] and the original works that introduced AdaGrad [74], Adadelta [295], RMSProp
[272] and Adam [143]. Regarding the analysis of RMSProp and Adam, we refer to [227] which
give an example of non-convergence, and provide a modification of the algorithm together with a
convergence analysis. Convergence proofs (for variations of) AdaGrad and Adam can also be found
in [64].
The backpropagation algorithm discussed in Section 10.5 was popularized by Rumelhart, Hinton
and Williams [234]; for further details on the historical developement we refer to [239, Section 5.5],
and for a more in-depth discussion of the algorithm, see for instance [109, 28, 192].
Similar discussions of gradient descent algorithms in the context of deep learning as given
here were presented in [271] and [132]: [271, Chapter 7] provides accessible convergence proofs of
(stochastic) gradient descent and gradient flow under different smoothness and convexity assump-
tions, and [132, Part III] gives a broader overview of optimization techniques in deep learning, but
142
restricts part of the analysis to quadratic objective functions. As in [33], our analysis of gradient de-
scent, stochastic gradient descent, and Nesterov acceleration, exclusively focused on strongly convex
objective functions. We also refer to this paper for a more detailed general treatment and analysis
of optimization algorithms in machine learning, covering various methods that are omitted here.
Details on implementations in Python can for example be found in [87], and for recommendations
and tricks regarding the implementation we also refer to [156, 32].
143
Exercises
Exercise 10.17. Let L > 0 and let f : Rn → R be continuously differentiable. Show that (10.1.5)
implies (10.1.4).
Exercise 10.18. Let f ∈ C 1 (Rn ). Show that f is convex in the sense of Definition 10.3 if and only
if
For convex functions f , a subgradient always exists, i.e. ∂f (v) is necessarily nonempty, e.g.,
[41, Section 1.2]. Subgradients generalize the notion of gradients for convex functions, since for
any convex continuously differentiable f , (10.5.6) is satisfied with g = ∇f (v). The following three
exercises on subgradients are based on the lecture notes [36]. Also see, e.g., [252, 41, 153] for more
details on subgradient descent.
Exercise 10.20. Let f : Rn → R be convex and Lip(f ) ≤ L. Show that for any g ∈ ∂f (v) holds
∥g∥ ≤ L.
wk+1 := wk − hk g k ,
Hint: Start by recursively expanding ∥wk − w∗ ∥2 = · · · , and then apply the property of the
subgradient.
Exercise 10.22. Consider the setting of Exercise 10.21. Determine step sizes h1 , . . . , hk (which
may depend on k, i.e. hk,1 , . . . , hk,k ) such that for any arbitrarily small δ > 0
144
Exercise 10.23. Let A ∈ Rn×n be symmetric positive semidefinite, b ∈ Rn and c ∈ R. Denote
the eigenvalues of A by ζ1 ≥ · · · ≥ ζn ≥ 0. Show that the objective function
1
f (w) := w⊤ Aw + b⊤ w + c (10.5.7)
2
is convex and ζ1 -smooth. Moreover, if ζn > 0, then f is ζn -strongly convex. Show that these values
are optimal in the sense that f is neither L-smooth nor µ-strongly convex if L < ζ1 and µ > ζn .
Hint: Note that L-smoothness and µ-strong convexity are invariant under shifts and the addition
of constants. That is, for every α ∈ R and β ∈ Rn , f˜(w) := α + f (w + β) is L-smooth or µ-strongly
convex if and only if f is. It thus suffices to consider w⊤ Aw/2.
Exercise 10.24. Let f be as in Exercise 10.23. Show that gradient descent converges for arbitrary
initialization w0 ∈ Rn , if and only if
Show that argminh>0 maxj=1,...,n |1 − hζj | = 2/(ζ1 + ζn ) and conclude that the convergence will be
slow if f is ill-conditioned, i.e. if ζ1 /ζn ≫ 1.
Hint: Assume first that b = 0 ∈ Rn and c = 0 ∈ R in (10.5.7), and use the singular value
decomposition A = U ⊤ diag(ζ1 , . . . , ζn )U .
p
Exercise 10.25. Show that (10.3.10) can equivalently be written as (10.3.11) with τ = µ/L,
α = 1/L, β = (1 − τ )/(1 + τ ) and the initialization u0 = ((1 + τ )w0 − s0 )/τ .
145
Chapter 11
In this chapter we explore the dynamics of training (shallow) neural networks of large width.
Throughout assume given data pairs
for distinct xi . We wish to train a model (e.g. a neural network) Φ(x, w) depending on the input
x ∈ Rd and the parameters w ∈ Rn . To this end we consider either minimization of the ridgeless
(unregularized) objective
m
X
f (w) := (Φ(xi , w) − yi )2 , (11.0.1b)
i=1
or, for some regularization parameter λ ≥ 0, of the ridge regularized objective
The adjectives ridge and ridgeless thus indicate the presence or absence of the regularization term
∥w∥2 .
In the ridgeless case, the objective is a multiple of the empirical risk R
b S (Φ) in (1.2.3) for the
m
sample S = (xi , yi )i=1 and the square-loss. Regularization is a common tool in machine learning
to improve model generalization and stability. The goal of this chapter is to get some insight into
the dynamics of Φ(x, wk ) as the parameter vector wk progresses during training. Additionally, we
want to gain some intuition about the influence of regularization, and the behaviour of the trained
model x 7→ Φ(x, wk ) for large k. We do so through the lense of so-called kernel methods. As a
training algorithm we exclusively focus on gradient descent with constant step size.
If Φ(x, w) depends linearly on the parameters w, the objective function (11.0.1c) is convex. As
established in the previous chapter (cf. Remark 10.8), gradient descent then finds a global minimizer.
For typical neural network architectures, w 7→ Φ(x, w) is not linear, and such a statement is in
general not true. Recent research has shown that neural network behavior tends to linearize in w
as network width increases [131]. This allows to transfer some of the results and techniques from
the linear case to the training of neural networks.
We start this chapter in Sections 11.1 and 11.2 by recalling (kernel) least-squares methods,
which describe linear (in w) models. Following [158], the subsequent sections examine why neural
146
networks exhibit linear-like behavior in the infinite-width limit. In Section 11.3 we introduce the so-
called tangent kernel. Section 11.4 presents abstract results showing, under suitable assumptions,
convergence towards a global minimizer when training the model. Section 11.5 builds on this
analysis and discusses connections to kernel regression with the tangent kernel. In Section 11.6
we then detail the implications for wide neural networks. A similar treatment of these results was
previously given by Telgarsky in [271, Chapter 8] for gradient flow (rather than gradient descent),
based on [49].
H̃ := span{x1 , . . . , xm } ⊆ Rd . (11.1.3)
It’s a standard result that w∗ is well-defined and belongs to the span H̃ of the training points,
e.g., [29, 65, 92]. While one way to prove this is through the pseudoinverse (see Theorem B.2), we
provide an alternative argument here, which can be directly extended to the infinite-dimensional
case as discussed in Section 11.2 ahead.
147
Theorem 11.2. There is a unique minimum norm solution w∗ ∈ Rd in (11.1.4). It lies in the
subspace H̃, and is the unique minimizer of f in H̃, i.e.
Then C is a finite dimensional space, and as such it is closed and convex. Therefore y ∗ =
argminỹ∈C ∥ỹ − y∥ exists and is unique (this is a fundamental property of Hilbert spaces, see
Theorem B.17). In particular, the set M = {w ∈ Rd | Aw = y ∗ } ⊆ Rd of minimizers of f is not
empty. Clearly M ⊆ Rd is closed and convex. As before, w∗ = argminw∈M ∥w∥ exists and is
unique.
It remains to show (11.1.5). Decompose w∗ = w̃ + ŵ with w̃ ∈ H̃ and ŵ ∈ H̃ ⊥ (see Definition
B.15). By definition of A it holds Aw∗ = Aw̃ and f (w∗ ) = f (w̃). Moreover ∥w∗ ∥2 = ∥w̃∥2 +∥ŵ∥2 .
Since w∗ is the minimum norm solution, w∗ = w̃ ∈ H̃. To conclude the proof, we need to show
that w∗ is the only minimizer of f in H̃. Assume there exists a minimizer v of f in H̃ different
from w∗ . Then 0 ̸= w∗ − v ∈ H̃. Thus A(w∗ − v) ̸= 0 and y ∗ = Aw∗ ̸= Av, which contradicts
that v minimizes f .
Next let λ > 0 in (11.1.2). Then minimizing fλ is referred to as ridge regression or Tikhonov
regularized least squares [274, 118, 79, 108]. The next proposition shows that there exists a unique
minimizer of fλ , which is closely connected to the minimum norm solution, e.g. [79, Theorem 5.2].
Theorem 11.3. Let λ > 0. Then, with fλ in (11.1.2), there exists a unique minimizer
Proof. According to Exercise 11.29, w 7→ fλ (w) is strongly convex on Rd , and thus also on the
subspace H̃ ⊆ Rd . Therefore, there exists a unique minimizer of fλ in H̃, which we denote by
w∗,λ ∈ H̃. To show that there exists no other minimizer of fλ in Rd , fix w ∈ Rd \H̃ and decompose
w = w̃ + ŵ with w̃ ∈ H̃ and 0 ̸= ŵ ∈ H̃ ⊥ . Then
and
∥w∥2 = ∥w̃∥2 + ∥ŵ∥2 > ∥w̃∥2 .
148
Thus fλ (w) > fλ (w̃) ≥ fλ (w∗,λ ).
It remains to show (11.1.6). We have
fλ (w) = (Aw − y)⊤ (Aw − y) + λw⊤ w
= w⊤ (A⊤ A + λI d )w − 2w⊤ A⊤ y,
where I d ∈ Rd×d is the identity matrix. The minimizer is reached at ∇fλ (w) = 0, which yields
w∗,λ = (A⊤ A + λI d )−1 A⊤ y.
Let A = U Σ V ⊤ be the singular value decomposition of A, where Σ ∈ Rm×d contains the nonzero
singular values s1 ≥ · · · ≥ sr > 0, and U ∈ Rm×m , V ∈ Rd×d are orthogonal. Then
w∗,λ = (V (Σ ⊤ Σ + λI d )V ⊤ )−1 V Σ ⊤ U ⊤ y
s
1
2
s1 +λ ..
=V . 0 U ⊤ y,
sr
s2r +λ
0 0
| {z }
∈Rd×m
where 0 stands for a zero block of suitable size. As λ → 0, this converges towards A† y, where A†
denotes the pseudoinverse of A, see (B.1.3). By Theorem B.2, A† y = w∗ .
Proposition 11.4. Let λ = 0 and fix h ∈ (0, smax (A)−2 ). Let w0 = w̃0 + ŵ0 where w̃0 ∈ H̃ and
ŵ0 ∈ H̃ ⊥ , and let (wk )k∈N be defined by (11.1.7). Then
lim wk = w∗ + ŵ0 .
k→∞
149
Next we consider ridge regression, where λ > 0 in (11.1.2), (11.1.7). The condition on the step
size in the next proposition can be weakened to h ∈ (λ + smax (A)2 )−1 , but we omit doing so for
simplicity.
Proposition 11.5. Let λ > 0, and fix h ∈ (0, (2λ + 2smax (A)2 )−1 ). Let w0 ∈ Rn and let (wk )k∈N
be defined by (11.1.7). Then
lim wk = w∗,λ
k→∞
and
λ
∥w∗ − w∗,λ ∥ ≤ ∥y∥ = O(λ) as λ → 0.
smin (A)3 + smin (A)λ
Proof. By Exercise 10.23, fλ is (2λ + 2smax (A)2 )-smooth, and by Exercise 11.29, fλ is strongly
convex. Thus Theorem 10.7 implies convergence of gradient descent towards the unique minimizer
w∗,λ .
For the bound on the distance to w∗ , assume A ̸= 0 (the case A = 0 is trivial). Expressing w∗
via the pseudoinverse of A (see Appendix B.1) we get
1
s1
..
† . 0
w∗ = A y = V U ⊤ y,
1
sr
0 0
where A = U Σ V ⊤ is the singular value decomposition of A, and s1 ≥ · · · ≥ sr > 0 denote the
singular values of A. The explicit formula for w∗,λ obtained in the proof of Theorem 11.3 then
yields
si 1
∥w∗ − w∗,λ ∥ ≤ max 2 − ∥y∥.
i≤r si + λ si
This gives the claimed bound.
By Proposition 11.5, if we use ridge regression with a small regularization parameter λ > 0,
then gradient descent converges to a vector w∗,λ which is O(λ) close to the minimal norm solution
w∗ , regardless of the initialization w0 .
150
Let us formalize this idea. For reasons that will become apparent later (see Remark 11.10), it
is useful to allow for the case n = ∞. To this end, let (H, ⟨·, ·⟩H ) be a Hilbert space (see Appendix
B.2.4), referred to as the feature space, and let ϕ : Rd → H denote the feature map. The model
is defined as
Φ(x, w) := ⟨ϕ(x), w⟩H (11.2.1)
with w ∈ H. We may think of H in the following either as Rn for some n ∈ N, or as ℓ2 (N) (see
Example B.12); in this case the components of ϕ are referred to as features. For some λ ≥ 0, the
goal is to minimize the objective
m
X 2
f (w) := ⟨ϕ(xj ), w⟩H − yj or fλ (w) := f (w) + λ∥w∥2H . (11.2.2)
j=1
H̃ := span{ϕ(x1 ), . . . , ϕ(xm )} ⊆ H
Theorem 11.7. There is a unique minimum norm solution w∗ ∈ H in (11.2.3). It lies in the
subspace H̃, and is the unique minimizer of f in H̃, i.e.
The proof of Theorem 11.2 is formulated such that it extends verbatim to Theorem 11.7, upon
replacing Rd with H and the matrix A ∈ Rm×d with the linear map
A :H → Rm
w 7→ (⟨ϕ(xi ), w⟩H )m
i=1 .
Similarly, Theorem 11.3 extends to the current setting with small modifications. The key obser-
vation is that by Theorem 11.7, the minimizer is attained in the finite-dimensional subspace H̃.
Selecting a basis for H̃, the proof then proceeds analogously. We leave it to the reader to check
this, see Exercise 11.30. This leads to the following statement.
151
Theorem 11.8. Let λ > 0. Then, with fλ in (11.2.2), there exists a unique minimizer
Statements as in Theorems 11.7 and 11.8, which yield that the minimizer is attained in the
finite dimensional subspace H̃, are known in the literature as representer theorems, [142, 243].
Such a minimizing α need not bePunique (if G is not regular), however, any such α yields a
minimizer in H̃, and thus w∗,λ = m j=1 αj ϕ(xj ) for any λ ≥ 0 by Theorems 11.7 and 11.8. This
suggests the following algorithm:
Given the well-definedness of w∗,0 := w∗ and w∗,λ for λ ≥ 0, we refer to
152
Algorithm 2 Kernel least-squares regression
Input: Data (xi , yi )m d d d
i=1 ∈ R × R, kernel K : R × R → R, regularization parameter λ ≥ 0,
evaluation point x ∈ R d
i.e. K is the corresponding kernel. See for instance [23, Sec. 3.2] or [258, Thm. 4.49].
where now
ϕ(x1 )⊤
A= ..
.
.
ϕ(xm )⊤
This corresponds to the situation discussed in Section 11.1.2.
Let λ = 0. For sufficiently small step size, by Proposition 11.4 for x ∈ Rd
where
w0 = w̃0 + ŵ0
with w̃0 ∈ H̃ = span{ϕ(x0 ), . . . , ϕ(xm )} ⊆ Rm , and ŵ0 ∈ H̃ ⊥ . For λ = 0, gradient descent thus
yields the ridgeless kernel least squares estimator plus an additional term ⟨ϕ(x), ŵ0 ⟩ depending on
initialization. Notably, on the set
153
(11.2.8) always coincides with the ridgeless least squares estimator.
Now let λ > 0. For sufficiently small step size, by Proposition 11.5 for x ∈ Rd
Thus, for λ > 0 gradient descent determines the ridge kernel least-squares estimator regardless of
the initialization. Moreover, for fixed x, the limiting model is O(λ) close to the ridgeless kernel
least-squares estimator.
which is the first order Taylor approximation of Φ around the initial parameter w0 . The parameters
of the linearized model will always be denoted by p ∈ Rn to distinguish them from the parameters
w of the full model. Introduce
The square loss objective for the linearized model then reads
m m
lin
X
lin 2
X 2
f (p) := (Φ (xj , p) − yj ) = ⟨∇w Φ(xj , w0 ), p⟩ − δj (11.3.3)
j=1 j=1
where ⟨·, ·⟩ stands for the Euclidean inner product in Rn . Comparing with (11.2.2), minimizing f lin
corresponds to kernel least squares regression with feature map
ϕ(x) = ∇w Φ(x, w0 ) ∈ Rn .
We refer to K̂n as the empirical tangent kernel, as it arises from the first order Taylor approxi-
mation (the tangent) of the original model Φ around initialization w0 . Note that K̂n depends on
the choice of w0 . For later reference we point out that as explained in Section 11.2.3, minimizing
154
(Φ(x1 , w) − y1 )2
(Φlin (x1 , w) − y1 )2
y1
w0 w0
Φlin (x1 , w)
Φ(x1 , w)
Figure 11.1: Graph of w 7→ Φ(x1 , w) and the linearization w 7→ Φlin (x1 , w) at the initial parameter
∂
w0 , s.t. ∂w Φ(x1 , w0 ) ̸= 0. If Φ and Φlin are close, then there exists w s.t. Φ(x1 , w) = y1 (left). If
the derivatives are also close, the loss (Φ(x1 , w) − y1 )2 is nearly convex in w, and gradient descent
finds a global minimizer (right).
f lin with gradient descent, sufficiently small step size, no regularization, yields a sequence (pk )k∈N0
satisfying
where p̂0 is the projection of the initialization p0 onto span{ϕ(x1 ), . . . , ϕ(xn )}⊥ . In particular, the
second term vanishes at the training points.
In this section we discuss sufficient conditions under which gradient descent converges to a global
minimizer.
The idea is as follows: if w 7→ Φ(x, w) is nonlinear but sufficiently close to its linearization Φlin
in (11.3.1) within some region, the objective function behaves almost like a convex function there.
If the region is large enough to contain both the intial value w0 and a global minimum, then we
expect gradient descent to never leave this (almost convex) basin during training and find a global
minimizer.
To illustrate this, consider Figures 11.1 and 11.2 where we set the number of training samples
to m = 1 and the number of parameters to n = 1. For the above reasoning to hold, the difference
between Φ and Φlin , as well as the difference in their derivatives, must remain small within a
neighborhood of w0 . The neighbourhood should be large enough to contain the global minimizer,
and thus depends critically on two factors: the initial error Φ(x1 , w0 ) − y1 , and the magnitude of
∂
the derivative ∂w Φ(x1 , w0 ).
For general m and n, we now make the required assumptions on Φ precise.
155
Φ(x1 , w) (Φ(x1 , w) − y1 )2
(Φlin (x1 , w) − y1 )2
y1
w0 w0
Φlin (x1 , w)
Figure 11.2: Same as Figure 11.1. If Φ and Φlin are not close, there need not exist w such that
Φ(x1 , w) = y1 , and gradient descent need not converge to a global minimizer.
Assumption 11.11. Let Φ ∈ C 1 (Rd × Rn ) and w0 ∈ Rn . There exist constants r, R, U, L > 0 and
0 < θmin ≤ θmax < ∞ such that ∥xi ∥ ≤ R for all i = 1, . . . , m, and it holds that
(c)
2
√ p
θmin 2 mU f (w0 )
L≤ p and r= . (11.4.3)
12m3/2 U 2 f (w0 ) θmin
Let us give more intuitive explanations of these technical assumptions: (a) implies in particular
that (∇w Φ(xi , w0 )⊤ )mi=1 ∈ R
m×n has full rank m ≤ n (thus we have at least as many parameters
∂
n as training data m). In the context of Figure 11.1, this means that ∂w Φ(x1 , w0 ) ̸= 0 and thus
Φ is a not a constant function. This guarantees existence of p such that Φlin (xi , p) = yi for all
lin
i = 1, . . . , m. Next, (b) formalizes in particular the required closeness of Φ and its linearization
Φlin . For example, since Φlin is the first order Taylor approximation of Φ at w0 ,
|Φ(x, w) − Φlin (x, w)| = |(∇w Φ(x, w̃) − ∇w Φ(x, w0 ))⊤ (w − w0 )| ≤ L∥w − w0 ∥2 ,
for some w̃ in the convex hull of w and w0 . Finally, (c) ties together all constants, ensuring the
full model to be sufficiently close to its linearizationpin a large enough ball of radius r around w0 .
Notably, r may be smaller for smaller initial error f (w0 ) and for larger θmin , which aligns with
our intuition from Figure 11.1.
We are now ready to state the following theorem, which is a variant of [158, Thm. G.1]. The
proof closely follows the arguments given there. In Section 11.6 we will see that the theorem’s
main requirement—Assumption 11.11—is satisfied with high probability for certain (wide) neural
networks.
156
Theorem 11.12. Let Assumption 11.11 hold. Fix a positive learning rate
1
h≤ . (11.4.4)
θmin + θmax
Let (wk )k∈N be generated by gradient descent, i.e. for all k ∈ N0
∥wk − w0 ∥ ≤ r (11.4.6a)
2k
f (wk ) ≤ (1 − hθmin ) f (w0 ). (11.4.6b)
and with the empirical tangent kernel K̂n in Assumption 11.11 (a)
∇e(w0 )∇e(w0 )⊤ = (K̂n (xi , xj ))m
i,j=1 ∈ R
m×m
. (11.4.7)
By (11.4.2a)
m
X
∥∇e(w)∥2 ≤ ∥∇e(w)∥2F = ∥∇w Φ(xj , w)∥2 ≤ mU 2 . (11.4.8a)
j=1
157
these inequalities directly imply (11.4.6).
The case k = 0 is trivial. For the induction step, assume (11.4.9) holds for some k ∈ N0 .
Step 2. We show (11.4.9a) for k + 1. The induction assumption (11.4.9a) and (11.4.10) give
wk ∈ Br (w0 ). Next
∇f (wk ) = ∇(e(wk )⊤ e(wk )) = 2∇e(wk )⊤ e(wk ). (11.4.11)
Using the iteration rule (11.4.5) and the bounds (11.4.8a) and (11.4.9b)
∥wk+1 − wk ∥ = 2h∥∇e(wk )⊤ e(wk )∥
√
≤ 2h mU ∥e(wk )∥
√
≤ 2h mU ∥e(w0 )∥ck .
This shows (11.4.9a) for k + 1. In particular by (11.4.10)
wk+1 , wk ∈ Br (w0 ). (11.4.12)
Step 3. We show (11.4.9b) for k + 1. Since e : Rn → Rm is continuously differentiable, there
exists w̃k in the convex hull of wk and wk+1 such that
e(wk+1 ) = e(wk ) + ∇e(w̃k )(wk+1 − wk ) = e(wk ) − h∇e(w̃k )∇f (wk ),
and thus by (11.4.11)
e(wk+1 ) = e(wk ) − 2h∇e(w̃k )∇e(wk )⊤ e(wk )
= I m − 2h∇e(w̃k )∇e(wk )⊤ e(wk ),
158
Let us emphasize that (11.4.6b) implies that gradient descent (11.4.5) achieves zero loss in the
limit. Consequently, the limiting model interpolates the training data. This shows in particular
convergence to a global minimizer for the (generally nonconvex) optimization problem of minimizing
f (w).
to denote the predicted values at the training points x1 , . . . , xm for given parameter choices w,
p ∈ Rn . Moreover
∇w Φ(x1 , w)⊤
∇w Φ(X, w) = .. m×n
∈R
.
∇w Φ(xm , w)⊤
and similarly for ∇w Φlin (X, w). Given x ∈ Rd , the model predictions at x and X evolve under
gradient descent as follows:
• full model: Initialize w0 ∈ Rn , and set for step size h > 0 and all k ∈ N0
Then
∇w f (w) = ∇w ∥Φ(X, w) − y∥2 = 2∇w Φ(X, w)⊤ (Φ(X, w) − y).
Thus
159
this yields
• linearized model: Initialize p0 := w0 ∈ Rn , and set for step size h > 0 and all k ∈ N0
and
this yields
Φlin (x, pk+1 ) = Φlin (x, pk ) − 2hGlin (x, X)(Φlin (X, pk ) − y) (11.5.6a)
lin lin lin lin
Φ (X, pk+1 ) = Φ (X, pk ) − 2hG (X, X)(Φ (X, pk ) − y). (11.5.6b)
The full dynamics (11.5.3) are governed by the k-dependent kernel matrices Gk . In contrast, the
linear model’s dynamics are entirely determined by the initial kernel matrix Glin . The following
corollary gives an upper bound on how much these matrices may deviate during training, [158,
Thm. G.1].
Corollary 11.13. Let w0 = p0 ∈ Rn , and let Assumption 11.11 be satisfied for some
r, R, U, L, θmin , θmax > 0. Let (wk )k∈N , (pk )k∈N be generated by gradient descent (11.5.1), (11.5.4)
with a positive step size
1
h< .
θmin + θmax
Then for all x ∈ Rd with ∥x∥ ≤ R
√
sup ∥Gk (x, X) − Glin (x, X)∥ ≤ 2 mU Lr, (11.5.7a)
k∈N
sup ∥Gk (X, X) − Glin (X, X)∥ ≤ 2mU Lr. (11.5.7b)
k∈N
160
Proof. By Theorem 11.12 it holds wk ∈ Br (w0 ) for all k ∈ N, and thus also w̃k ∈ Br (w0 ) for w̃k
in the convex hull of wk and wk+1 as in (11.5.2). Using Assumption 11.11 (b), the definitions of
Gk and Glin give
∥Gk (x, X) − Glin (x, X)∥ ≤ ∥∇w Φ(x, w̃k )∥∥∇w Φ(X, wk ) − ∇w Φ(X, w0 )∥
+ ∥∇w Φ(X, w0 )∥∥∇w Φ(x, w̃k ) − ∇w Φ(x, w0 )∥
√
≤ ( m + 1)U Lr.
Theorem 11.14. Consider the setting of Corollary 11.13, in particular let r, R, θmin , θmax be as
in Assumption 11.11. Then for all x ∈ Rd with ∥x∥ ≤ R
√
mU 2
lin 4 mU Lr p
sup ∥Φ(x, wk ) − Φ (x, pk )∥ ≤ 1+ 2
f (w0 ).
k∈N θmin (hθmin ) (θmin + θmax )
To prove the theorem, we first examine the difference between the full and linearized models on
the training data.
161
where I m ∈ Rm×m is the identity. Set c := 1 − hθmin . Then by (11.5.7b), (11.4.6b), and because
h < (θmin + θmax )−1 , we can bound the second term by
mU Lr p
∥2h(Gk − Glin )(Φ(X, wk ) − y)∥ ≤ 2 f (w0 ) ck .
θmin + θmax
| {z }
=:α̃
Moreover, assumption 11.11 (a) and h < (θmin + θmax )−1 yield
lin
∥I m − 2hG ∥ ≤ 1 − 2hθmin ≤ c.
P∞
Hence, using j=0 cj = (hθmin )−1
k
X α̃
∥ek+1 ∥ ≤ c∥ek ∥ + α̃ck ≤ · · · ≤ ck ∥e0 ∥ + ck−j α̃cj ≤ ck ∥e0 ∥ + (k + 1)ck .
hθmin
j=0
Since w0 = p0 it holds Φlin (X, p0 ) = Φ(X, w0 ) (cf. (11.3.1)). Thus ∥e0 ∥ = 0 which gives the
statement.
We are now in position to prove the theorem.
Proof (of Theorem 11.14). Throughout this proof we write for short
Gk = Gk (x, X) ∈ R1×m and Glin = Glin (x, X) ∈ R1×m ,
and set for k ∈ N
ek := Φ(x, wk ) − Φlin (x, pk ).
Subtracting (11.5.6a) from (11.5.3a)
ek+1 = ek − 2hGk (Φ(X, wk ) − y) + 2hGlin (Φlin (X, pk ) − y)
= ek − 2h(Gk − Glin )(Φ(X, wk ) − y) + 2hGlin (Φlin (X, pk ) − Φ(X, wk )).
Denote c := 1 − hθmin . By (11.5.7a) and (11.4.6b)
√
2h∥Gk − Glin ∥ ≤ 4h mU Lr
p
∥Φ(X, wk ) − y∥ ≤ ck f (w0 )
and by (11.4.2a) (cf. (11.5.5)) and Proposition 11.15
√
2h∥Glin ∥ ≤ 2h mU 2
∥Φ(X, wk ) − Φlin (X, pk )∥ ≤ αkck−1 .
Hence for k ≥ 0
√ p √
|ek+1 | ≤ |ek | + 4h mU Lr f (w0 ) ck + 2h mU 2 α kck−1 .
| {z } | {z }
=:β1 =:β2
j = (1 − c)−1 = (hθmin )−1 and j−1
P P
Repeatedly applying this bound and using j≥0 c j≥0 jc =
(1 − c)−2 = (hθmin )−2
k k
X
j
X β1 β2
|ek+1 | ≤ |e0 | + β1 c + β2 jcj−1 ≤ + .
hθmin (hθmin )2
j=0 j=0
162
11.6 Training dynamics for shallow neural networks
In this section, following [158], we discuss the implications of Theorems 11.12 and 11.14 for wide
neural networks. As in [271], for ease of presentation we focus on a shallow architecture with
only one hidden layer, but stress that similar considerations also hold for deep networks, see the
bibliography section.
11.6.1 Architecture
Let Φ : Rd → R be a neural network of depth one and width n ∈ N of type
Here x ∈ Rd is the input, and U ∈ Rn×d , v ∈ Rn , b ∈ Rn and c ∈ R are the parameters which we
collect in the vector w = (U , b, v, c) ∈ Rn(d+2)+1 (with U suitably reshaped). For future reference
we note that
∇U Φ(x, w) = (v ⊙ σ ′ (U x + b))x⊤ ∈ Rn×d
∇b Φ(x, w) = v ⊙ σ ′ (U x + b) ∈ Rn
(11.6.2)
∇v Φ(x, w) = σ(U x + b) ∈ Rn
∇c Φ(x, w) = 1 ∈ R,
where ⊙ denotes the Hadamard product. We also write ∇w Φ(x, w) ∈ Rn(d+2)+1 to denote the full
gradient with respect to all parameters.
In practice, it is common to initialize the weights randomly, and in this section we consider
so-called LeCun initialization [156]. The following condition on the distribution W used for this
initialization will be assumed throughout the rest of Section 11.6.
Assumption 11.16. The distribution W on R has expectation zero, variance one, and finite
moments up to order eight.
To explicitly indicate the expectation and variance in the notation, we also write W(0, 1) instead
of W, and for µ ∈ R and ς > 0 we use W(µ, ς 2 ) to denote the corresponding scaled and shifted
measure with expectation µ and variance ς 2 ; thus, if X ∼ W(0, 1) then µ + ςX ∼ W(µ, ς 2 ). LeCun
initialization sets the variance of the weights in each layer to be reciprocal to the input dimension of
the layer: the idea is to normalize the output variance of all network nodes. The initial parameters
w0 = (U 0 , b0 , v 0 , c0 )
163
11.6.2 Neural tangent kernel
We begin our analysis by investigating the empirical tangent kernel
K̂n (x, z) = ⟨∇w Φ(x, w0 ), ∇w Φ(z, w0 )⟩
of the shallow network (11.6.1) with initialization 11.6.3. Scaled properly, it converges in the infinite
width limit n → ∞ towards a specific kernel known as the neural tangent kernel (NTK) [131].
This kernel depends on both the architecture and the initialization scheme. Since we focus on the
specific setting introduced in Section 11.6.1 in the following, we simply denote it by K NTK .
Theorem 11.18. Let R < ∞ such that |σ(x)| ≤ R · (1 + |x|) and |σ ′ (x)| ≤ R · (1 + |x|) for all
iid
x ∈ R. For any x, z ∈ Rd and ui ∼ W(0, 1/d), i = 1, . . . , d, it then holds
1
lim K̂n (x, z) = E[σ(u⊤ x)σ(u⊤ z)] =: K NTK (x, z)
n→∞ n
almost surely.
Moreover, for every δ, ε > 0 there exists n0 (δ, ε, R) ∈ N such that for all n ≥ n0 and all x,
z ∈ Rd with ∥x∥, ∥z∥ ≤ R
" #
1
P K̂n (x, z) − K NTK (x, z) < ε ≥ 1 − δ.
n
are i.i.d. with finite pth moment (independent of n) for all 1 ≤ p ≤ 8. Due to the linear growth bound
(1) (1) (1)
on σ and σ ′ , the same holds for the (σ(xi ))ni=1 and the (σ ′ (xi ))ni=1 . Similarly, the (σ(zi ))ni=1
(1)
and (σ ′ (zi ))ni=1 are collections of i.i.d. random variables with finite pth moment for all 1 ≤ p ≤ 8.
√ iid
Denote ṽi = nv0;i such that ṽi ∼ W(0, 1). By (11.6.2)
n n
1 ⊤ 1 X 2 ′ (1) ′ (1) 1X (1) (1) 1
K̂n (x, z) = (1 + x z) 2 ṽi σ (xi )σ (zi ) + σ(xi )σ(zi ) + .
n n n n
i=1 i=1
Since
n
1 X 2 ′ (1) ′ (1)
ṽi σ (xi )σ (zi ) (11.6.4)
n
i=1
is an average over i.i.d. random variables with finite variance, the law of large numbers implies
almost sure convergence of this expression towards
(1) (1) (1) (1)
E ṽi2 σ ′ (xi )σ ′ (zi ) = E[ṽi2 ]E[σ ′ (xi )σ ′ (zi )]
164
(1) (1)
where we used that ṽi2 is independent of σ ′ (xi )σ ′ (zi ). By the same argument
n
1X (1) (1)
σ(xi )σ(zi ) → E[σ(u⊤ x)σ(u⊤ z)]
n
i=1
NTK , θ NTK ].
is regular and its eigenvalues belong to [θmin max
We start by showing Assumption 11.11 (a) for the present setting. More precisely, we give
bounds for the eigenvalues of the empirical tangent kernel.
Lemma 11.21. Let Assumption 11.20 be satisfied. Then for every δ > 0 there exists
NTK , m, R) ∈ R such that for all n ≥ n it holds with probability at least 1 − δ that all
n0 (δ, θmin 0
eigenvalues of
m
(K̂n (xi , xj ))m
i,j=1 = ⟨∇w Φ(xi , w 0 ), ∇w Φ(xj , w 0 )⟩ i,j=1 ∈ R
m×m
165
NTK :=
Proof. Denote Ĝn := (K̂n (xi , xj ))m
i,j=1 and G (K NTK (xi , xj ))m
i,j=1 . By Theorem 11.18,
there exists n0 such that for all n ≥ n0 holds with probability at least 1 − δ that
1 θNTK
GNTK − Ĝn ≤ min .
n 2
Assuming this bound to hold
1 1 θNTK θNTK θNTK
∥Ĝn ∥ = sup ∥Ĝn a∥ ≥ infm ∥GNTK a∥ − min ≥ θmin
NTK
− min ≥ min ,
n a∈Rm n a∈R 2 2 2
∥a∥=1 ∥a∥=1
where we have used that θminNTK is the smallest eigenvalue, and thus singular value, of the symmetric
NTK
positive definite matrix G . This shows that (with probability at least δ) the smallest eigenvalue
NTK
of Ĝn is larger or equal to nθmin /2. Similarly, we conclude that the largest eigenvalue is bounded
from above by n(θmaxNTK + θ NTK /2) ≤ 2nθ NTK . This concludes the proof.
min max
Next we check Assumption 11.11 (b). To this end we first bound the norm of a random matrix.
iid
Lemma 11.22. Let W(0, 1) be as in Assumption 11.16, and let W ∈ Rn×d with Wij ∼ W(0, 1).
Denote the fourth moment of W(0, 1) by µ4 . Then
h p i dµ4
P ∥W ∥ ≤ n(d + 1) ≥ 1 − .
n
Proof. It holds
n X
X d 1/2
∥W ∥ ≤ ∥W ∥F = Wij2 .
i=1 j=1
The αi := dj=1 Wij2 , i = 1, . . . , n, are i.i.d. distributed with expectation d and finite variance dC,
P
h i n
h1 X i h 1Xn i dµ
4
p
P ∥W ∥ > n(d + 1) ≤ P αi > d + 1 ≤ P αi − d > 1 ≤ ,
n n n
i=1 i=1
Lemma 11.23. Let Assumption 11.20 (a) be satisfied with some constant R. Then there exists
M (R), and for all c, δ > 0 there exists n0 (c, d, δ) ∈ N such that for all n ≥ n0 it holds with
probability at least 1 − δ that
√
∥∇w Φ(x, w)∥ ≤ M n for all w ∈ Bcn−1/2 (w0 )
√
∥∇w Φ(x, w) − ∇w Φ(x, v)∥ ≤ M n∥w − v∥ for all w, v ∈ Bcn−1/2 (w0 )
166
Proof. Due to the initialization (11.6.3), by Lemma 11.22 we can find ñ0 (δ, d) such that for all
n ≥ ñ0 holds with probability at least 1 − δ that
√
∥v 0 ∥ ≤ 2 and ∥U 0 ∥ ≤ 2 n. (11.6.5)
For the rest of this proof we let x ∈ Rd arbitrary with ∥x∥ ≤ R, we set
and we fix n ≥ n0 so that n−1/2 c ≤ 1. To prove the lemma we need to show that the claimed
inequalities hold as long as (11.6.5) is satisfied. We will several times use that for all p, q ∈ Rn
√
∥p ⊙ q∥ ≤ ∥p∥∥q∥ and ∥σ(p)∥ ≤ R n + R∥p∥
w = (U , b, v, c) s.t. ∥w − w0 ∥ ≤ cn−1/2 .
Using formula (11.6.2) for ∇b Φ, the fact that b0 = 0 by (11.6.3), and the above inequalities
Due to √ √
∥U ∥ ≤ ∥U 0 ∥ + ∥U 0 − U ∥F ≤ 2 n + cn−1/2 ≤ 3 n, (11.6.7)
and using the fact that σ ′ has Lipschitz constant R, the last norm in (11.6.6) is bounded by
and therefore √
∥∇b Φ(x, w)∥ ≤ n(6R + 9R2 ).
For the gradient with respect to U we use ∇U Φ(x, w) = ∇b Φ(x, w)x⊤ , so that
√
∥∇U Φ(x, w)∥F = ∥∇b Φ(x, w)x⊤ ∥F = ∥∇b Φ(x, w)∥∥x∥ ≤ n(6R2 + 9R3 ).
Next
167
and finally ∇c Φ(x, w) = 1. In all, with M1 (R) := (1 + 8R + 12R2 )
√
∥∇w Φ(x, w̃)∥ ≤ nM1 (R).
Next
and finally ∇c Φ(x, w) = 1 is constant. With M2 (R) := R + 6R2 + 6R3 this shows
√
∥∇w Φ(x, w) − ∇w Φ(x, w̃)∥ ≤ nM2 (R)∥w − w̃∥.
In all, this concludes the proof with M (R) := max{M1 (R), M2 (R)}.
Next, we show that the initial error f (w0 ) remains bounded with high probability.
Lemma 11.24. Let Assumption 11.20 (a), (b) be satisfied. Then for every δ > 0 exists
R0 (δ, m, R) > 0 such that for all n ∈ N
P[f (w0 ) ≤ R0 ] ≥ 1 − δ.
√ iid
Proof. Let i ∈ {1, . . . , m}, and set α := U 0 xi and ṽj := nv0;j for j = 1, . . . , n, so that ṽj ∼
W(0, 1). Then
n
1 X
Φ(xi , w0 ) = √ ṽj σ(αj ).
n
j=1
168
By Assumption 11.16 and (11.6.3), the ṽj σ(αj ), j = 1, . . . , n, are i.i.d. centered random variables
with finite variance bounded by a constant C(R) independent of n. Thus the variance of Φ(xi , w0 )
is also bounded by C(R). By Chebyshev’s inequality, see Lemma A.22, for every k > 0
√ 1
P[|Φ(xi , w0 )| ≥ k C] ≤ 2 .
k
p
Setting k = m/δ
m
hX √ i Xm h √ i
P |Φ(xi , w0 ) − yi |2 ≥ m(k C + R)2 ≤ P |Φ(xi , w0 ) − yi | ≥ k C + R
i=1 i=1
m
X h √ i
≤ P |Φ(xi , w0 )| ≥ k C ≤ δ,
i=1
p
which shows the claim with R0 = m · ( Cm/δ + R)2 .
The next theorem, which corresponds to [158, Thms. G.1 and H.1], is the main result of this
section. It summarizes our findings in the present setting for a shallow network of width n: with high
probability, gradient descent converges to a global minimizer and the limiting network interpolates
the data. During training the network weights remain close to initialization. The trained network
gives predictions that are O(n−1/2 ) close to the predictions of the trained linearized model. In the
statement of the theorem we denote again by Φlin the linearization of Φ defined in (11.3.1), and by
f lin , f , the corresponding square loss objectives defined in (11.0.1b), (11.3.3), respectively.
Theorem 11.25. Let Assumption 11.20 be satisfied, and let the parameters w0 of the width-n
neural network Φ in (11.6.1) be initialized according to (11.6.3). Fix the learning rate
1 1
h= NTK NTK
,
θmin + 4θmax n
Then for every δ > 0 there exist C > 0, n0 ∈ N such that for all n ≥ n0 it holds with probability
at least 1 − δ that for all k ∈ N and all x ∈ Rd with ∥x∥ ≤ R
C
∥wk − w0 ∥ ≤ √ (11.6.8a)
n
hn 2k
f (wk ) ≤ C 1 − NTK
(11.6.8b)
2θmin
C
∥Φ(x, wk ) − Φlin (x, pk )∥ ≤ √ . (11.6.8c)
n
Proof. We wish to apply Theorems 11.12 and 11.14. This requires Assumption 11.11 to be satisfied.
169
p Fix δ > 0 and let R0 (δ) be as in Lemma 11.24, so that with probability at least 1 − δ/2 it holds
f (w0 ) ≤ R0 . Next, let M = M (R) be as in Lemma 11.23, and fix n0,1 ∈ N and c > 0 so large
that for all n ≥ n0,1
√ √
√ n2 (θmin
NTK /2)2
−1/2 2 mM n p
M n≤ √ and cn = NTK
R0 . (11.6.9)
12m3/2 M 2 n R0 nθmin
By Lemma 11.21 and 11.23, we can then find n0,2 such that for all n ≥ n0,2 with probability at
least 1 − δ/2 we have that Assumption 11.11 (a), (b) holds with the values
√ √ NTK
nθmin
L = M n, U = M n, r = cn−1/2 , θmin = , NTK
θmax = 2nθmax . (11.6.10)
2
Together with (11.6.9), this shows that Assumption 11.11 holds with probability at least 1 − δ as
long as n ≥ n0 := max{n0,1 , n0,2 }.
Inequalities (11.6.8a), (11.6.8b) are then a direct consequence of Theorem 11.12. For (11.6.8c),
we plug the values of (11.6.10) into the bound in Theorem 11.14 to obtain for k ∈ N
√
mU 2
lin 4 mU Lr p
∥Φ(x, wk ) − Φ (x, pk )∥ ≤ 1+ f (w0 )
θmin (hθmin )2 (θmin + θmax )
C1 p
≤ √ (1 + C2 ) f (w0 ),
n
LC , θ LC but independent of n.
for some constants C1 , C2 depending on m, M , c, θmin max
Note that the convergence rate in (11.6.10) does not improve as n grows, since h is bounded by
a constant times 1/n.
Definition 11.26. Let (Ω, A, P) be a probability space (see Section A.1), and let g : Rd × Ω →
R. We call g a Gaussian process with mean function µ : Rd → R and covariance function
c : Rd × Rd → R if
(b) for all k ∈ N and all x1 , . . . , xk ∈ Rd the random variables g(x1 , ·), . . . , g(xk , ·) are jointly
Gaussian distributed with
(g(x1 , ω), . . . , g(xk , ω)) ∼ N µ(xi )ki=1 , (c(xi , xj ))ki,j=1 .
170
In words, g is a Gaussian process, if ω 7→ g(x, ω) defines a collection of random variables indexed
over x ∈ Rd , and the joint distribution of (g(x1 , ·))kj=1 is a Gaussian whose mean and variance are
determined by µ and c respectively. Fixing ω ∈ Ω, we can then interpret x 7→ g(x, ω) as a random
draw from a distribution over functions.
As first observed in [186], certain neural networks at initialization tend to Gaussian processes
in the infinite width limit.
Proposition 11.27. Let |σ(x)| ≤ R(1 + |x|)4 for all x ∈ R. Consider depth-n networks Φ as in
(11.6.1) with initialization (11.6.3). Let K NTK : Rd × Rd be as in Theorem 11.18.
Then for all distinct x1 , . . . , xk ∈ Rd it holds that
√ iid
Proof. Set ṽi := nv0,i and ũi = (U0,i1 , . . . , U0,id ) ∈ Rd , so that ṽi ∼ W(0, 1), and the ũi ∈ Rd are
also i.i.d., with each component distributed according to W(0, 1/d).
Then for any x1 , . . . , xk
ṽi σ(ũ⊤
i x1 )
.. k
Z i := ∈R i = 1, . . . , n,
.
ṽi σ(ũ⊤
i xk )
defines n centered i.i.d. vectors in Rk with finite second moments (here we use the assumption on
σ and the fact that W(0, 1) has finite moments of order 8 by Assumption 11.16). By the central
limit theorem, see Theorem A.25,
Φ(x1 , w0 ) n
.. 1 X
= √ Zi
. n
Φ(xk , w0 ) j=1
In the sense of Proposition 11.27, the network Φ(x, w0 ) converges to a Gaussian process as
the width n tends to infinity. It can also be shown that the linearized network after training
corresponds to a Gaussian process, with a mean and covariance function that depend on the data,
architecture, and initialization. Since the full and linearized models coincide in the infinite width
limit (see Theorem 11.25) we can infer that wide networks post-training resemble draws from a
Gaussian process, see [158, Section 2.3.1] and [61].
171
To motivate the mean function of this Gaussian process, we informally take an expectation
(over random initialization) of (11.3.5), yielding
h i
E lim Φlin (x, pk ) = ⟨ϕ(x), p∗ ⟩ + E[⟨ϕ(x), ŵ0 ⟩] .
k→∞ | {z } | {z }
ridgeless kernel least-squares =0
estimator with kernel K̂n
Here the second term vanishes, because ŵ0 is the orthogonal projection of the centered random
variable w0 onto a subspace, so that E[ŵ0 ] = 0. This suggests, that after training for a long time,
the mean of the trained linearized model resembles the ridgeless kernel least-squares estimator with
kernel K̂n . Since K̂n /n converges to K NTK by Theorem 11.18, and a scaling of the kernel by 1/n
does not affect the corresponding kernel least-squares estimator, we expect that for large widths n
and large k
h i h i
E Φ(x, wk ) ≃ E Φlin (x, pk ) ≃ ridgeless kernel least-squares estimator
with kernel K NTK evaluated at x
. (11.6.11)
In words, after sufficient training, the mean (over random initializations) of the trained neural
network x 7→ Φ(x, wk ) resembles the kernel least-squares estimator with kernel K NTK . Thus, under
these assumptions, we obtain an explicit characterization of what the network prediction looks like
after training with gradient descent. For more details and a characterization of the corresponding
covariance function, we refer again to [158, Section 2.3.1].
Let us now consider a numerical experiment to visualize this observation. In Figure 11.3 we
plot 80 different realizations of a neural network before and after training, i.e. the functions
x 7→ Φ(x, w0 ) and x 7→ Φ(x, wk ). (11.6.12)
The architecture was chosen as in (11.6.1) with activation function σ = arctan(x), width n = 250
and initialization
3 3
iid iid iid
U0;ij ∼ N 0, , v0;i ∼ N 0, , b0;i , c0 ∼ N(0, 2). (11.6.13)
d n
The network was trained on the ridgeless square loss
m
X
f (w) = (Φ(xj , w) − yj )2 ,
j=1
and a dataset of size m = 3 with k = 5000 steps of gradient descent and constant step size h = 1/n.
Before training, the network’s outputs resemble random draws from a Gaussian process with a
constant zero mean function. Post-training, the outputs show minimal variance at the training
points, since they essentially interpolate the data, as can be expected due to Theorem 11.25—
specifically (11.6.8b). Outside of the training points, we observe increased variance stemming from
the second term in (11.3.5). The mean should be close to the ridgeless kernel least squares estimator
with kernel K NTK by (11.6.11).
Figure 11.4 shows realizations of the network trained with ridge regularization, i.e. using the
loss function (11.0.1c). Initialization and all hyperparameters match those in Figure 11.3, with the
regularization parameter λ set to 0.01. For a linear model, the prediction after training with ridge
regularization is expected to exhibit reduced randomness, as the trained model is O(λ) close to the
ridgeless kernel least-squares estimator (see Section 11.2.3). We emphasize that Theorem 11.14,
showing closeness of the trained linearized and full model, does not encompass ridge regularization,
however in this example we observe a similar effect.
172
2 2
1 1
0 0
1 1
2 2
3 2 1 0 1 2 3 3 2 1 0 1 2 3
(a) before training (b) after training without regularization
Figure 11.3: 80 realizations of a neural network at initialization (a) and after training without
regularization on the red data points (b). The dashed line shows the mean. Figure based on [131,
Fig. 2], [158, Fig. 2].
Due to this different scaling, gradient descent with step size O(n−1 ) as in Theorem 11.25, will
primarily adjust the weigths v in the output layer, while only slightly modifying the remaining
parameters U , b, and c. This is also reflected in the expression for the obtained kernel K NTK
computed in Theorem 11.18, which corresponds only to the contribution of the term ⟨∇v Φ, ∇v Φ⟩.
LeCun initialization [156], sets the variance of the weight initialization inversely proportional to
the input dimension of each layer, so that the variance of all node outputs remains stable and does
not blow up as the width increases; also see [111]. However, it does not normalize the backward
dynamics, i.e., it does not ensure that the gradients with respect to the parameters have similar
variance. To balance the normalization of both the forward and backward dynamics, Glorot and
Bengio proposed a normalized initialization, where the variance is chosen inversely proportional to
the sum of the input and output dimensions of each layer [90]. We emphasize that the choice of
initialization strongly affects the neural tangent kernel (NTK) and, consequently, the predictions
of the trained network. For an initialization that explicitly normalizes the backward dynamics, we
refer to the original NTK paper [131].
173
2
2
3 2 1 0 1 2 3
Figure 11.4: 80 realizations of the neural network in Figure 11.3 after training on the red data
points with added ridge regularization. The dashed line shows the mean.
174
Exercises
Exercise 11.28. Prove Proposition 11.4.
Hint: Assume first that w0 ∈ ker(A)⊥ (i.e. w0 ∈ H̃). For rank(A) < d, using wk = wk−1 −
h∇f (wk−1 ) and the singular value
P decomposition of A, write down an explicit formula for wk .
Observe that due to 1/(1 − x) = k∈N0 xk for all x ∈ (0, 1) it holds wk → A† y as k → ∞, where
A† is the Moore-Penrose pseudoinverse of A.
Exercise 11.29. Let A ∈ Rd×d be symmetric positive semidefinite, b ∈ Rd , and c ∈ R. Let for
λ>0
f (w) := w⊤ Aw + b⊤ w + c and fλ (w) := f (w) + λ∥w∥2 .
Show that f is 2λ-strongly convex.
Hint: Use Exercise 10.23.
Exercise 11.30. Let (H, ⟨·, ·⟩H ) be a Hilbert space, and ϕ : Rd → H a mapping. Given (xj , yj )m
j=1 ∈
Rd × R, for λ > 0 denote
m
X 2
fλ (w) := ⟨ϕ(xj ), w⟩H − yj + λ∥w∥2H for all w ∈ H.
j=1
Prove that fλ has a unique minimizer w∗,λ ∈ H, that w∗,λ ∈ H̃ := span{ϕ(x1 ), . . . , ϕ(xm )}, and
that limλ→0 w∗,λ = w∗ , where w∗ is as in (11.2.3).
Hint: Assuming existence of w∗,λ , first show that w∗,λ belongs to the finite dimensional space
H̃. Now express w∗,λ in terms of an orthonormal basis of H̃, and prove that w∗,λ → w∗ .
Exercise 11.33. Consider the radial basis function (RBF) kernel K : R × R → R, K(x, z) :=
exp(−(x − z)2 ). Find a Hilbert space H and a feature map ϕ : R → H such that K(x, x′ ) =
⟨ϕ(x), ϕ(z)⟩H .
Exercise 11.34. Consider the network (11.6.1) with LeCun initialization as in (11.6.3), but with
the biases instead initialized as
iid
c, bi ∼ W(0, 1) for all i = 1, . . . , n. (11.6.14)
175
Chapter 12
In Chapter 10, we saw how the weights of neural networks get adapted during training, using, e.g.,
variants of gradient descent. For certain cases, including the wide networks considered in Chapter
11, the corresponding iterative scheme converges to a global minimizer. In general, this is not
guaranteed, and gradient descent can for instance get stuck in non-global minima or saddle points.
To get a better understanding of these situations, in this chapter we discuss the so-called loss
landscape. This term refers to the graph of the empirical risk as a function of the weights. We
give a more rigorous definition below, and first introduce notation for neural networks and their
realizations for a fixed architecture.
Definition 12.1. Let A = (d0 , d1 , . . . , dL+1 ) ∈ NL+2 , let σ : R → R be an activation function, and
let B > 0. We denote the set of neural networks Φ with L layers, layer widths d0 , d1 , . . . , dL+1 , all
weights bounded in modulus by B, and using the activation function σ by N (σ; A, B). Additionally,
we define
L
×
PN (A, B) := [−B, B]dℓ+1 ×dℓ × [−B, B]dℓ+1 ,
ℓ=0
Rσ : PN (A, B) → N (σ; A, B)
(12.0.1)
(W (ℓ) , b(ℓ) )L
ℓ=0 7→ Φ,
where Φ is the neural network with weights and biases given by (W (ℓ) , b(ℓ) )L
ℓ=0 .
PL
Throughout, we will identify PN (A, B) with the cube [−B, B]nA , where nA := ℓ=0 dℓ+1 (dℓ +
1). Now we can introduce the loss landscape of a neural network architecture.
176
l minima
ca
lo
le points
dd
sa
sh
arp
Hig l min minimum
k ba ima
he
mpirical ris
o
gl
Figure 12.1: Two-dimensional section of a loss landscape. The loss landscape shows a spurious
valley with local minima, global minima, as well as a region where saddle points appear. Moreover,
a sharp minimum is shown.
ΛA,σ,S,L : PN (A; ∞) → R
θ 7→ R
b S (Rσ (θ)).
with R
b S in (1.2.3) and Rσ in (12.0.1).
Identifying PN (A, ∞) with RnA , we can consider ΛA,σ,S,L as a map on RnA and the loss
landscape is a subset of RnA × R. The loss landscape is a high-dimensional surface, with hills and
valleys. For visualization a two-dimensional section of a loss landscape is shown in Figure 12.1.
Questions of interest regarding the loss landscape include for example: How likely is it that we
find local instead of global minima? Are these local minima typically sharp, having small volume,
or are they part of large flat valleys that are difficult to escape? How bad is it to end up in a local
minimum? Are most local minima as deep as the global minimum, or can they be significantly
higher? How rough is the surface generally, and how do these characteristics depend on the network
architecture? While providing complete answers to these questions is hard in general, in the rest
of this chapter we give some intuition and mathematical insights for specific cases.
177
12.1 Visualization of loss landscapes
Visualizing loss landscapes can provide valuable insights into the effects of neural network depth,
width, and activation functions. However, we can only visualize an at most two-dimensional surface
embedded into three-dimensional space, whereas the loss landscape is a very high-dimensional
object (unless the neural networks have only very few weights and biases).
To make the loss landscape accessible, we need to reduce its dimensionality. This can be achieved
by evaluating the function ΛA,σ,S,L on a two-dimensional subspace of PN (A, ∞). Specifically, we
choose three-parameters µ, θ1 , θ2 and examine the function
• Based on critical points: For a more global perspective, µ, θ1 , θ2 can be chosen to ensure the
observation of multiple critical points. One way to achieve this is by running the optimization
procedure three times with final parameters θ(1) , θ(2) , θ(3) . If the procedures have converged,
then each of these parameters is close to a critical point of ΛA,σ,S,L . We can now set µ = θ(1) ,
θ1 = θ(2) − µ, θ2 = θ(3) − µ. This then guarantees that (12.1.1) passes through or at least
comes very close to three critical points (at (α1 , α2 ) = (0, 0), (0, 1), (1, 0)). We present six
visualizations of this form in Figure 12.2.
Figure 12.2 gives some interesting insight into the effect of depth and width on the shape of the
loss landscape. For very wide and shallow neural networks, we have the widest minima, which, in
the case of the tanh activation function also seem to belong to the same valley. With increasing
depth and smaller width the minima get steeper and more disconnected.
178
it is clear that for every permutation matrix P
Hence, in general there exist multiple parameterizations realizing the same output function. More-
over, if at least one global minimum with non-permutation-invariant weights exists, then there are
more than one global minima of the loss landscape.
This is not problematic; in fact, having many global minima is beneficial. The larger issue is the
existence of non-global minima. Following [279], we start by generalizing the notion of non-global
minima to spurious valleys.
A path-connected component of ΩΛ (c), which does not contain a global minimum of ΛA,σ,S,L is
called a spurious valley.
The next proposition shows that spurious local minima do not exist for shallow overparameter-
ized neural networks, i.e., for neural networks that have at least as many parameters in the hidden
layer as there are training samples.
Proof. Let θa , θb ∈ PN (A, ∞) with ΛA,σ,S,L (θa ) > ΛA,σ,S,L (θb ). Then we will show below that
there is another parameter θc such that
• there is a continuous path α : [0, 1] → PN (A, ∞) such that α(0) = θa , α(1) = θc , and
ΛA,σ,S,L (α) is monotonically decreasing.
By Exercise 12.7, the construction above rules out the existence of spurious valleys by choosing θa
an element of a spurious valley and θb a global minimum.
Next, we present the construction: Let us denote
1
(ℓ) (ℓ)
θo = W o , bo for o ∈ {a, b, c}.
ℓ=0
179
Moreover, for j = 1, . . . , d1 , we introduce v jo ∈ Rm defined as
(v jo )i = σ W (0)o xi + b (0)
o for i = 1, . . . , m.
j
that t 7→ ΛA,σ,S,L (α(t)) has a minimum t on [0, 1] with ΛA,σ,S,L (α(t∗ )) ≤ ΛA,σ,S,L (θb ). Moreover,
∗
t 7→ ΛA,σ,S,L (α(t)) is monotonically decreasing on [0, t∗ ]. Setting θc = α(t∗ ) completes this case.
Case 2: Assume that Va has rank less than m. In this case, we show that we find a continuous
path from θa to another neural network parameter with higher rank. The path will be such that
ΛA,σ,S,L is monotonically decreasing.
Under the assumptions, we have that one v ja can be written as a linear combination of the
remaining v ia , i ̸= j. Without loss of generality, we assume j = 1. Then, there exist (αi )m i=2 such
that
m
X
v 1a = αi v ia . (12.2.2)
i=2
Next, we observe that there exists v ∗ ∈ Rm which is linearly independent from all (v ja )m
i=1 and can
be written as (v ∗ )i = σ((w∗ )⊤ xi + b∗ ) for some w∗ ∈ Rd0 , b∗ ∈ R. Indeed, if we assume that such
v ∗ does not exist, then for all w ∈ Rd0 , b ∈ R the vector (σ(w⊤ xi + b))m i=1 belongs to the same
m − 1 dimensional subspace. It would follow that span{(σ(w⊤ xi + b))m d0
i=1 | w ∈ R , b ∈ R} is an
m
m − 1 dimensional subspace of R which yields a contradiction to Theorem 9.3.
Now, we define two paths: First,
where
(W (1) (1)
a (t))1 = (1 − 2t)(W a )1 and (W (1) (1) (1)
a (t))i = (W a )i + 2tαi (W a )1
180
where
∗
(W (0) (0)
a (t))1 = 2(t − 1/2)(W a )1 + (2t − 1)w and (W (0) (0)
a (t))i = (W a )i
m
j (0) (0)
v̄ := σ W a (1)xi + ba (1)
j i=1
ei = Φθ (xi ) − yi for i = 1, . . . , m.
181
Proposition 12.5. Let A = (d0 , d1 , 1) and σ : R → R. Then, for every θ ∈ PN (A, ∞) where
R
b S (Φθ ) in (12.3.1) is twice continuously differentiable with respect to the weights, it holds that
where H(θ) is the Hessian of R b S (Φθ ) at θ, H 0 (θ) is a positive semi-definite matrix which is
independent from (yi )i=1 , and H 1 (θ) is a symmetric matrix that for fixed θ and (xi )m
m
i=1 depends
m
linearly on (ei )i=1 .
Proof. Using the identification introduced after Definition 12.2, we can consider θ a vector in RnA .
For k = 1, . . . , nA , we have that
m
∂R
b S (Φθ ) 2 X ∂Φθ (xi )
= ei .
∂θk m ∂θk
i=1
It remains to show that H 0 (θ) and H 1 (θ) have the asserted properties. Note that, setting
∂Φ (x )
θ i
∂θ1
.. ∈ RnA ,
Ji,θ =
.
∂Φθ (xi )
∂θnA
2 Pm ⊤
we have that H 0 (θ) = m i=1 Ji,θ Ji,θ and hence H 0 (θ) is a sum of positive semi-definite matrices,
which shows that H 0 (θ) is positive semi-definite.
The symmetry of H 1 (θ) follows directly from the symmetry of second derivatives which holds
since we assumed twice continuous differentiability at θ. The linearity of H 1 (θ) in (ei )m i=1 is clear
from (12.3.2).
How does Proposition 12.5 imply the claimed relationship between the size of the loss and the
prevalence of saddle points?
Let θ correspond to a critical point. If H(θ) has at least one negative eigenvalue, then θ cannot
be a minimum, but instead must be either a saddle point or a maximum. While we do not know
anything about H 1 (θ) other than that it is symmetric, it is not unreasonable to assume that it
has a negative eigenvalue especially if nA is very large. With this consideration, let us consider the
following model:
Fix a parameter θ. Let S 0 = (xi , yi0 )m 0 m
i=1 be a sample and (ei )i=1 be the associated errors.
0 0 0
Further let H (θ), H 0 (θ), H 1 (θ) be the matrices according to Proposition 12.5.
182
Further let for λ > 0, S λ = (xi , yiλ )m m 0 m
i=1 be such that the associated errors are (ei )i=1 = λ(ei )i=1 .
The Hessian of Rb S λ (Φθ ) at θ is then H λ (θ) satisfying
which we can expect to be negative for large λ. Thus, H λ (θ) has a negative eigenvalue for large λ.
On the other hand, if λ is small, then H λ (θ) is merely a perturbation of H 00 (θ) and we can
expect its spectrum to resemble that of H 00 more and more.
What we see is that, the same parameter, is more likely to be a saddle point for a sample that
produces a high empirical risk than for a sample with small risk. Note that, since H 00 (θ) was only
shown to be semi -definite the argument above does not rule out saddle points even for very small
λ. But it does show that for small λ, every negative eigenvalue would be very small.
A more refined analysis where we compare different parameters but for the same sample and
quantify the likelihood of local minima versus saddle points requires the introduction of a probability
distribution on the weights. We refer to [206] for the details.
183
Exercises
Exercise 12.6. In view of Definition 12.3, show that a local minimum of a differentiable function
is contained in a spurious valley.
Exercise 12.7. Show that if there exists a continuous path α between a parameter θ1 and a global
minimum θ2 such that ΛA,σ,S,L (α) is monotonically decreasing, then θ1 cannot be an element of a
spurious valley.
184
Figure 12.2: A collection of loss landscapes. In the left column are neural networks with ReLU
activation function, the right column shows loss landscapes of neural networks with the hyperbolic
tangent activation function. All neural networks have five dimensional input, and one dimensional
output. Moreover, from top to bottom the hidden layers have widths 1000, 20, 10, and the number
of hidden layers are 1, 4, 7.
185
Chapter 13
As we have seen in the previous chapter, the loss landscape of neural networks can be quite intricate
and is typically not convex. In some sense, the reason for this is that we take the point of view
of a map from the parameterization of a neural network. Let us consider a convex loss function
L : R × R → R and a sample S = (xi , yi )m d
i=1 ∈ (R × R) .
m
Then, for two neural networks Φ1 , Φ2 and for α ∈ (0, 1) it holds that
m
b S (αΦ1 + (1 − α)Φ2 ) = 1
X
R L(αΦ1 (xi ) + (1 − α)Φ2 (xi ), yi )
m
i=1
m
1 X
≤ αL(Φ1 (xi ), yi ) + (1 − α)L(Φ2 (xi ), yi )
m
i=1
= αR
b S (Φ1 ) + (1 − α)R
b S (Φ2 ).
Hence, the empirical risk is convex when considered as a map depending on the neural network
functions rather then the neural network parameters. A convex function does not have spurious
minima or saddle points. As a result, the issues from the previous section are avoided if we take
the perspective of neural network sets.
So why do we not optimize over the sets of neural networks instead of the parameters? To
understand this, we will now study the set of neural networks associated with a fixed architecture
as a subset of other function spaces.
We start by investigating the realization map Rσ introduced in Definition 12.1. Concretely,
we show in Section 13.1, that if σ is Lipschitz, then the set of neural networks is the image of
PN (A, ∞) under a locally Lipschitz map. We will use this fact to show in Section 13.2 that sets of
neural networks are typically non-convex, and even have arbitrarily large holes. Finally, in Section
13.3, we study the extent to which there exist best approximations to arbitrary functions, in the set
of neural networks. We will demonstrate that the lack of best approximations causes the weights
of neural networks to grow infinitely during training.
186
Proposition 13.1. Let A = (d0 , d1 , . . . , dL+1 ) ∈ NL+2 , let σ : R → R be Cσ -Lipschitz continuous
with Cσ ≥ 1, let |σ(x)| ≤ Cσ |x| for all x ∈ R, and let B ≥ 1.
Then, for all θ, θ′ ∈ PN (A, B),
Proof. Let θ, θ′ ∈ PN (A, B) and define δ := ∥θ − θ′ ∥∞ . Repeatedly using the triangle inequality
we find a sequence (θj )nj=0
A
such that θ0 = θ, θnA = θ′ , ∥θj − θj+1 ∥∞ ≤ δ, and θj and θj+1 differ in
one entry only for all j = 0, . . . nA − 1. We conclude that for all x ∈ [−1, 1]d0
A −1
nX
′
∥Rσ (θ)(x) − Rσ (θ )(x)∥∞ ≤ ∥Rσ (θj )(x) − Rσ (θj+1 )(x)∥∞ . (13.1.1)
j=0
To upper bound (13.1.1), we now only need to understand the effect of changing one weight in a
neural network by δ.
Before we can complete the proof we need two auxiliary lemmas. The first of which holds under
slightly weaker assumptions of Proposition 13.1.
Lemma 13.2. Under the assumptions of Proposition 13.1, but with B being allowed to be arbitrary
positive, it holds for all Φ ∈ N (σ; A, B)
Proof. We start with the case, where L = 1. Then, for (d0 , d1 , d2 ) = A, we have that
for certain W(0) , W(1) , b(0) , b(1) with all entries bounded by B. As a consequence, we can estimate
∥Φ(x) − Φ(x′ )∥∞ = W(1) σ(W(0) x + b(0) ) − σ(W(0) x′ + b(0) )
∞
(0) (0) (0) ′ (0)
≤ d1 B σ(W x+b ) − σ(W x +b )
∞
(0) ′
≤ d1 BCσ W (x − x )
∞
≤ d1 d0 B 2 Cσ x − x′ ∞
≤ Cσ · (dmax B)2 x − x′ ∞
,
where we used the Lipschitz property of σ and the fact that ∥Ax∥∞ ≤ n maxi,j |Aij |∥x∥∞ for every
matrix A = (Aij )m,n
i=1,j=1 ∈ R
m×n .
The induction step from L to L+1 follows similarly. This concludes the proof of the lemma.
187
Lemma 13.3. Under the assumptions of Proposition 13.1 it holds that
Resolving the recursive estimate of ∥x(ℓ) ∥∞ by 2Cσ Bdmax (max{1, ∥x(ℓ−1) ∥∞ }), we conclude that
∥x(ℓ) ∥∞ ≤ (2Cσ Bdmax )ℓ max{1, ∥x(0) ∥∞ } = (2Cσ Bdmax )ℓ .
This concludes the proof of the lemma.
We can now proceed with the proof of Proposition 13.1. Assume that θj+1 and θj differ only in
one entry. We assume this entry to be in the ℓth layer, and we start with the case ℓ < L. It holds
|Rσ (θj )(x) − Rσ (θj+1 )(x)| = |Φℓ (σ(W(ℓ) x(ℓ) + b(ℓ) )) − Φℓ (σ(W(ℓ) x(ℓ) + b(ℓ) ))|,
(ℓ) (ℓ)
where Φℓ ∈ N (σ; Aℓ , B) for Aℓ = (dℓ+1 , . . . , dL+1 ) and (W(ℓ) , b(ℓ) ), (W , b ) differ in one entry
only.
Using the Lipschitz continuity of Φℓ of Lemma 13.2, we have
|Rσ (θj )(x) − Rσ (θj+1 )(x)|
≤ CσL−ℓ−1 (Bdmax )L−ℓ |σ(W(ℓ) x(ℓ) + b(ℓ) ) − σ(W(ℓ) x(ℓ) + b(ℓ) )|
≤ CσL−ℓ (Bdmax )L−ℓ ∥W(ℓ) x(ℓ) + b(ℓ) − W(ℓ) x(ℓ) − b(ℓ) ∥∞
≤ 2CσL−ℓ (Bdmax )L−ℓ δ max{1, ∥x(ℓ) ∥∞ },
where δ := ∥θ − θ′ ∥∞ . Invoking Lemma (13.3), we conclude that
|Rσ (θj )(x) − Rσ (θj+1 )(x)| ≤ (2Cσ Bdmax )ℓ CσL−ℓ · (Bdmax )L−ℓ δ
≤ (2Cσ Bdmax )L ∥θ − θ′ ∥∞ .
For the case ℓ = L, a similar estimate can be shown. Combining this with (13.1.1) yields the
result.
Using Proposition 13.1, we can now consider the set of neural networks with a fixed architec-
ture N (σ; A, ∞) as a subset of L∞ ([−1, 1]d0 ). What is more, is that N (σ; A, ∞) is the image of
PN (A, ∞) under a locally Lipschitz map.
188
13.2 Convexity of neural network spaces
As a first step towards understanding N (σ; A, ∞) as a subset of L∞ ([−1, 1]d0 ), we notice that it is
star-shaped with few centers. Let us first introduce the necessary terminology.
Definition 13.4. Let Z be a subset of a linear space. A point x ∈ Z is called a center of Z if,
for every y ∈ Z it holds that
{tx + (1 − t)y | t ∈ [0, 1]} ⊆ Z.
A set is called star-shaped if it has at least one center.
The following proposition follows directly from the definition of a neural network and is the
content of Exercise 13.15.
Knowing that N (σ; A, B) is star-shaped with center 0, we can also ask ourselves if N (σ; A, B)
has more than this one center. It is not hard to see that also every constant function is a center.
The following theorem, which corresponds to [207, Proposition C.4], yields an upper bound on the
number of linearly independent centers.
g1′ (g)
g ′ (g)
2
T : L∞ ([−1, 1]d0 ) → RnA +1 , g 7→ .
..
.
gn′ A +1 (g)
Since T is continuous and linear, we have that T ◦ Rσ is locally Lipschitz continuous by Proposition
13.1. Moreover, since the (gi )ni=1
A +1
are linearly independent, we have that T (span((gi )ni=1
A +1
)) =
n +1 nA +1
R A . We denote V := span((gi )i=1 ).
189
Next, we would like to establish that N (σ; A, ∞) ⊃ V . Let g ∈ V then
nX
A +1
g= aℓ gℓ ,
ℓ=1
By the induction assumption ge(m) ∈ N (σ; A, ∞) and hence by Proposition 13.5 ge(m) /(am+1 ) ∈
N (σ; A, ∞). Additionally, since gm+1 is a center of N (σ; A, ∞), we have that 21 gm+1 + 2am+1
1
ge(m) ∈
N (σ; A, ∞). By Proposition 13.5, we conclude that ge(m+1) ∈ N (σ; A, ∞).
The induction shows that g ∈ N (σ; A, ∞) and thus V ⊆ N (σ; A, ∞). As a consequence,
T ◦ Rσ (PN (A, ∞)) ⊇ T (V ) = RnA +1 .
It is a well known fact of basic analysis that for every n ∈ N there does not exist a surjective
and locally Lipschitz continuous map from Rn to Rn+1 . We recall that nA = dim(PN (A, ∞)).
This yields the contradiction.
For a convex set X, the line between all two points of X is a subset of X. Hence, every point
of a convex set is a center. This yields the following corollary.
Corollary 13.7 tells us that we cannot expect convex sets of neural networks, if the set of
neural networks has many linearly independent elements. Sets of neural networks contain for
each f ∈ N (σ; A, ∞) also all shifts of this function, i.e., f (· + b) for a b ∈ Rd are elements of
N (σ; A, ∞). For a set of functions, being shift invariant and having only finitely many linearly
independent functions at the same time, is a very restrictive condition. Indeed, it was shown in
[207, Proposition C.6] that if N (σ; A, ∞) has only finitely many linearly independent functions and
σ is differentiable in at least one point and has non-zero derivative there, then σ is necessarily a
polynomial.
We conclude that the set of neural networks is in general non-convex and star-shaped with 0
and constant functions being centers. One could visualize this set in 3D as in Figure 13.1.
The fact, that the neural network space is not convex, could also mean that it merely fails to
be convex at one point. For example R2 \ {0} is not convex, but for an optimization algorithm this
would likely not pose a problem.
We will next observe that N (σ; A, ∞) does not have such a benign non-convexity and in fact,
has arbitrarily large holes.
To make this claim mathematically precise, we first introduce the notion of ε-convexity.
190
Figure 13.1: Sketch of the space of neural networks in 3D. The vertical axis corresponds to the
constant neural network functions, each of which is a center. The set of neural networks consists
of many low-dimensional linear subspaces spanned by certain neural networks (Φ1 , . . . , Φ6 in this
sketch) and linear functions. Between these low-dimensional subspaces, there is not always a
straight-line connection by Corollary 13.7 and Theorem 13.9.
191
Definition 13.8. For ε > 0, we say that a subset A of a normed vector space X is ε-convex if
co(A) ⊆ A + Bε (0),
where co(A) denotes the convex hull of A and Bε (0) is an ε ball around 0 with respect to the norm
of X.
Intuitively speaking, a set that is convex when one fills up all holes smaller than ε is ε-convex.
Now we show that there is no ε > 0 such that N (σ; A, ∞) is ε-convex.
Theorem 13.9. Let L ∈ N and A = (d0 , d1 , . . . , dL , 1) ∈ NL+2 . Let K ⊆ Rd0 be compact and let
σ ∈ M, with M as in (3.1.1) and assume that σ is not a polynomial. Moreover, assume that there
exists an open set, where σ is differentiable and not constant.
If there exists an ε > 0 such that N (σ; A, ∞) is ε-convex, then N (σ; A, ∞) is dense in C(K).
Proof. Step 1. We show that ε-convexity implies N (σ; A, ∞) to be convex. By Proposition 13.5,
we have that N (σ; A, ∞) is scaling invariant. This implies that co(N (σ; A, ∞)) is scaling invariant
as well. Hence, if there exists ε > 0 such that N (σ; A, ∞) is ε-convex, then for every ε′ > 0
ε′ ε′
co(N (σ; A, ∞)) = co(N (σ; A, ∞)) ⊆ (N (σ; A, ∞) + Bε (0))
ε ε
= N (σ; A, ∞) + Bε′ (0).
This yields that N (σ; A, ∞) is ε′ -convex. Since ε′ was arbitrary, we have that N (σ; A, ∞) is
ε-convex for all ε > 0.
As a consequence, we have that
\
co(N (σ; A, ∞)) ⊆ (N (σ; A, ∞) + Bε (0))
ε>0
\
⊆ (N (σ; A, ∞) + Bε (0)) = N (σ; A, ∞).
ε>0
Hence, co(N (σ; A, ∞)) ⊆ N (σ; A, ∞) and, by the well-known fact that in every metric vector space
co(A) ⊆ co(A), we conclude that N (σ; A, ∞) is convex.
Step 2. We show that Nd1 (σ; 1) ⊆ N (σ; A, ∞). If N (σ; A, ∞) is ε-convex, then by Step 1
N (σ; A, ∞) is convex. The scaling invariance of N (σ; A, ∞) then shows that N (σ; A, ∞) is a
closed linear subspace of C(K).
Note that, by Proposition 3.16 for every w ∈ Rd0 and b ∈ R there exists a function f ∈
N (σ; A, ∞) such that
192
(1) (1)
Since N (σ; A, ∞) is a closed vector space, this implies that for all n ∈ N and all w1 , . . . , wn ∈
(2) (2) (1) (1)
R 0 , w1 , . . . , wn ∈ R, b1 , . . . , bn ∈ R, b(2) ∈ R
d
n
(2) (1) (1)
X
x 7→ wi σ((wi )⊤ x + bi ) + b(2) ∈ N (σ; A, ∞). (13.2.3)
i=1
Step 3. From (13.2.3), we conclude that Nd1 (σ; 1) ⊆ N (σ; A, ∞). In words, the whole set of
shallow neural networks of arbitrary width is contained in the closure of the set of neural networks
with a fixed architecture. By Theorem 3.8, we have that Nd1 (σ; 1) is dense in C(K), which yields
the result.
For any activation function of practical relevance, a set of neural networks with fixed architecture
is not dense in C(K). This is only the case for very strange activation functions such as the one
discussed in Subsection 3.2. Hence, Theorem 13.9 shows that in general, sets of neural networks of
fixed architectures have arbitrarily large holes.
• the best approximation property, if for all h ∈ H there exists at least one Φ ∈ N (σ; A, ∞)
such that (13.3.1) holds,
• the unique best approximation property, if for all h ∈ H there exists exactly one
Φ ∈ N (σ; A, ∞) such that (13.3.1) holds,
We will see in the sequel, that, in the absence of the best approximation property, we will be able
to prove that the learning problem necessarily requires the weights of the neural networks to tend
to infinity, which may or may not be desirable in applications.
Moreover, having a continuous selection procedure is desirable as it implies the existence of a
stable selection algorithm; that is, an algorithm which, for similar problems yields similar neural
networks satisfying (13.3.1).
Below, we will study the properties above for Lp spaces, p ∈ [1, ∞). As we will see, neu-
ral network classes typically neither satisfy the continuous selection nor the best approximation
property.
193
13.3.1 Continuous selection
As shown in [136], neural network spaces essentially never admit the continuous selection property.
To give the argument, we first recall the following result from [136, Theorem 3.4] without proof.
Theorem 13.10. Let p ∈ (1, ∞). Every subset of Lp ([−1, 1]d0 ) with the unique best approximation
property is convex.
Proof. We observe from Theorem 13.6 and the discussion below, that under the assumptions of
this proposition, N (σ; A, ∞) is not convex.
We conclude from Theorem 13.10 that N (σ; A, ∞) does not have the unique best approximation
property. Moreover, if the set N (σ; A, ∞) does not have the best approximation property, then it
is obvious that it cannot have continuous selection. Thus, we can assume without loss of generality,
that N (σ; A, ∞) has the best approximation property and there exists a point h ∈ Lp ([−1, 1]d0 )
and two different Φ1 ,Φ2 such that
Assume towards a contradiction, that there exists Φ∗ ∈ N (σ; A, ∞) with Φ∗ ̸= Φ1 such that for
λ ∈ (−1, 0)
Then
194
Since Φ1 is a best approximation to h this implies that every inequality in the estimate above is an
equality. Hence, we have that
However, in a strictly convex space like Lp ([−1, 1]d0 ) for p > 1 this implies that
Φ∗ − P (λ) = c · (P (λ) − h)
Φ∗ = h + (c + 1)λ · (h − Φ1 )
and plugging into (13.3.3) yields |(c + 1)λ| = 1. If (c + 1)λ = −1, then we have Φ∗ = Φ1 which
produces a contradiction. If (c + 1)λ = 1, then
195
Then fn can be written as a neural network with architecture (σ; 1, 2, 1), i.e., A = (1, 2, 1). More-
over, for x > 0 we observe with the fundamental theorem of calculus and using integration by
substitution that
Z x+1/n Z nx+1
fn (x) = nσ ′ (nz)dz = σ ′ (z)dz. (13.3.4)
x nx
It is not hard to see that the right hand side of (13.3.4) converges to α for n → ∞.
Similarly, for x < 0, we observe that fn (x) converges to α′ for n → ∞. We conclude that
fn → α1R+ + α′ 1R−
According to Proposition 13.12, for a given A, and an activation function σ, it is possible that
(13.3.5) holds, but f ̸∈ N (σ; A, ∞). The following result shows that in this situation, the weights
of Φn diverge.
196
Assume that there exists a sequence Φn ∈ N (σ; A, ∞) and f ∈ L2 ([−1, 1]d0 , µ) \ N (σ; A, ∞)
such that
Then
n o
lim sup max ∥W (ℓ)
n ∞ ∥ , ∥b (ℓ)
n ∞ ∥ ℓ = 0, . . . L = ∞. (13.3.7)
n→∞
Proof. We assume towards a contradiction that the left-hand side of (13.3.7) is finite. As a result,
there exists C > 0 such that Φn ∈ N (σ; A, C) for all n ∈ N.
By Proposition 13.1, we conclude that N (σ; A, C) is the image of a compact set under a continu-
ous map and hence is itself a compact set in L2 ([−1, 1]d0 , µ). In particular, we have that N (σ; A, C)
is closed. Hence, (13.3.6) implies f ∈ N (σ; A, C). This gives a contradiction.
Proposition 13.14 can be extended to all f for which there is no best approximation in N (σ; A, ∞),
see Exercise 13.18. The results imply that for functions we wish to learn that lack a best approxima-
tion within a neural network set, we must expect the weights of the approximating neural networks
to grow to infinity. This can be undesirable because, as we will see in the following sections on
generalization, a bounded parameter space facilitates many generalization bounds.
197
Exercises
Exercise 13.15. Prove Proposition 13.5.
Exercise 13.17. Use Proposition 3.16, to extend Proposition 13.12 to arbitrary depth.
Exercise 13.18. Extend Proposition 13.14 to functions f for which there is no best-approximation
in N (σ; A, ∞). To do this, replace (13.3.6) by
198
Chapter 14
As discussed in the introduction in Section 1.2, we generally learn based on a finite data set. For
example, given data (xi , yi )m
i=1 , we try to find a network Φ that satisfies Φ(xi ) = yi for i = 1, . . . , m.
The field of generalization is concerned with how well such Φ performs on unseen data, which refers
to any x outside of training data {x1 , . . . , xm }. In this chapter we discuss generalization through
the use of covering numbers.
In Sections 14.1 and 14.2 we revisit and formalize the general setup of learning and empirical risk
minimization in a general context. Although some notions introduced in these sections have already
appeared in the previous chapters, we reintroduce them here for a more coherent presentation. In
Sections 14.3-14.5, we first discuss the concepts of generalization bounds and covering numbers,
and then apply these arguments specifically to neural networks. In Section 14.6 we explore the
so-called “approximation-complexity trade-off”, and finally in Sections 14.7-14.8 we introduce the
“VC dimension” and give some implications for classes of neural networks.
199
Figure 14.1: Collection of coffee data. The last row lacks a “Quality” label. Our aim is to predict
the label without the need for an (expensive) taste test.
with higher numbers indicating better quality. Let us assume that our subjective assessment of
quality of coffee is related to six features: “Acidity”, “Caffeine content”, “Price”, “Aftertaste”,
“Roast level”, and “Origin”. The feature space X thus corresponds to the set of six-tuples describing
these attributes, which can be either numeric or categorical (see Figure 14.1).
We aim to understand the relationship between elements of X and elements of Y , but we can
neither afford, nor do we have the time to taste all the coffees in the world. Instead, we can sample
some coffees, taste them, and grow our database accordingly as depicted in Figure 14.1. This way
we obtain samples of pairs in X × Y . The distribution D from which they are drawn depends on
various external factors. For instance, we might have avoided particularly cheap coffees, believing
them to be inferior. As a result they do not occur in our database. Moreover, if a colleague
contributes to our database, he might have tried the same brand and arrived at a different rating.
In this case, the quality label is not deterministic anymore.
Based on our database, we wish to predict the quality of an untasted coffee. Before proceeding,
we first formalize what it means to be a “good” prediction. ♢
Characterizing how good a predictor is requires a notion of discrepancy in the label space. This
is the purpose of the so-called loss function, which is a measurable mapping L : Y × Y → R+ .
Based on the risk, we can now formalize what we consider a good predictor. The best predictor
is one such that its risk is as close as possible to the smallest that any function can achieve. More
precisely, we would like a risk that is close to the so-called Bayes risk
200
Example 14.3 (Loss functions). The choice of a loss function L usually depends on the application.
For a regression problem, i.e., a learning problem where Y is a non-discrete subset of a Euclidean
space, a common choice is the square loss L2 (y, y ′ ) = ∥y − y ′ ∥2 .
For binary classification problems, i.e. when Y is a discrete set of cardinality two, the “0 − 1
loss” (
1 y ̸= y ′
L0−1 (y, y ′ ) =
0 y = y′
seems more natural.
Another frequently used loss for binary classification, especially when we want to predict prob-
abilities (i.e., if Y = [0, 1] but all labels are binary), is the binary cross-entropy loss
Lce (y, y ′ ) = −(y log(y ′ ) + (1 − y) log(1 − y ′ )).
In contrast to the 0 − 1 loss, the cross-entropy loss is differentiable, which is desirable in deep
learning as we saw in Chapter 10.
In the coffee quality prediction problem, the quality is given as a fraction of the form k/10
for k = 0, . . . , 10. While this is a discrete set, it makes sense to more heavily penalize predictions
that are wrong by a larger amount. For example, predicting 4/10 instead of 8/10 should produce
a higher loss than predicting 7/10. Hence, we would not use the 0 − 1 loss but, for example, the
square loss. ♢
How do we find a function h : X → Y with a risk that is as close as possible to the Bayes risk?
We will introduce a procedure to tackle this task in the next section.
1
In practice, the assumption of independence of the samples is often unclear and typically not satisfied. For
instance, the selection of the six previously tested coffees might be influenced by external factors such as personal
preferences or availability at the local store, which introduce bias into the dataset.
201
If the sample S is drawn i.i.d. according to D, then we immediately see from the linearity
of the expected value that R b S (h) is an unbiased estimator of R(h), i.e., ES∼Dm [R
b S (h)] = R(h).
Moreover, the weak law of large numbers states that the sample mean of an i.i.d. sequence of
integrable random variables converges to the expected value in probability. Hence, there is some
hope that, at least for large m ∈ N, minimizing the empirical risk instead of the population risk
might lead to a good hypothesis. We formalize this approach in the next definition.
R
b S (hS ) = inf R
b S (h) (14.2.1)
h∈H
From a generalization perspective, supervised deep learning is empirical risk minimization over
sets of neural networks. The question we want to address next is how effective this approach is at
producing hypotheses that achieve a risk close to the Bayes risk.
Let H be some hypothesis set, such that an empirical risk minimizer hS exists for all S ∈
(X × Y )m ; see Exercise 14.25 for an explanation of why this is a reasonable assumption. Moreover,
let g ∈ H be arbitrary. Then
R(hS ) − R∗ = R(hS ) − Rb S (hS ) + Rb S (hS ) − R∗
≤ |R(hS ) − R b S (g) − R∗
b S (hS )| + R (14.2.2)
b S (h)| + R(g) − R∗ ,
≤ 2 sup |R(h) − R
h∈H
where in the first inequality we used that hS is the empirical risk minimizer. By taking the infimum
over all g, we conclude that
R(hS ) − R∗ ≤ 2 sup |R(h) − R
b S (h)| + inf R(g) − R∗
h∈H g∈H
202
Definition 14.6 (Generalization bound). Let H ⊆ {h : X → Y } be a hypothesis set, and let
L : Y × Y → R be a loss function. Let κ : (0, 1) × N → R+ be such that for every δ ∈ (0, 1) holds
κ(δ, m) → 0 for m → ∞. We call κ a generalization bound for H if for every distribution D on
X × Y , every m ∈ N and every δ ∈ (0, 1), it holds with probability at least 1 − δ over the random
sample S ∼ Dm that
sup |R(h) − R
b S (h)| ≤ κ(δ, m).
h∈H
as soon as m is so large that κ(δ, m) ≤ ε. If there exists an empirical risk minimizer hS such that
R
b S (hS ) = 0, then with high probability the empirical risk minimizer will also have a small risk
R(hS ). Empirical risk minimization is often referred to as a “PAC” algorithm, which stands for
probably (δ) approximately correct (ε).
Definition 14.6 requires the upper bound κ on the discrepancy between the empirical risk and
the risk to be independent from the distribution D. Why should this be possible? After all, we could
have an underlying distribution that is not uniform and hence, certain data points could appear
very rarely in the sample. As a result, it should be very hard to produce a correct prediction
for such points. At first sight, this suggests that non-uniform distributions should be much more
challenging than uniform distributions. This intuition is incorrect, as the following argument based
on Example 14.1 demonstrates.
Example 14.8 (Generalization in the coffee quality problem). In Example 14.1, the underlying
distribution describes both our process of choosing coffees and the relation between the attributes
and the quality. Suppose we do not enjoy drinking coffee that costs less than 1€. Consequently,
we do not have a single sample of such coffee in the dataset, and therefore we have no chance of
learning the quality of cheap coffees.
However, the absence of coffee samples costing less than 1€ in our dataset is due to our general
avoidance of such coffee. As a result, we run a low risk of incorrectly classifying the quality of a
coffee that is cheaper than 1€, since it is unlikely that we will choose such a coffee in the future. ♢
To establish generalization bounds, we use stochastic tools that guarantee that the empirical
risk converges to the true risk as the sample size increases. This is typically achieved through
concentration inequalities. One of the simplest and most well-known is Hoeffding’s inequality, see
Theorem A.24. We will now apply Hoeffding’s inequality to obtain a first generalization bound.
This generalization bound is well-known and can be found in many textbooks on machine learning,
e.g., [178, 250]. Although the result does not yet encompass neural networks, it forms the basis for
a similar result applicable to neural networks, as we discuss subsequently.
203
Proposition 14.9 (Finite hypothesis set). Let H ⊆ {h : X 7→ Y } be a finite hypothesis set. Let
L : Y × Y → R be such that L(Y × Y ) ⊆ [c1 , c2 ] with c2 − c1 = C > 0.
Then, for every m ∈ N and every distribution D on X × Y it holds with probability at least 1 − δ
over the sample S ∼ Dm that
r
sup |R(h) − Rb S (h)| ≤ C log(|H|) + log(2/δ) .
h∈H 2m
Note that R b S (hi ) is the mean of independent random variables which take their values almost
surely in [c1 , c2 ]. Additionally, R(hi ) is the expectation of R
b S (hi ). The proof can therefore be
finished by applying Theorem A.24. This will be addressed in Exercise 14.26.
Consider now a non-finite set of neural networks H, and assume that it can be covered by a
finite set of (small) balls. Applying Proposition 14.9 to the centers of these balls, then allows to
derive a similar bound as in the proposition for H. This intuitive argument will be made rigorous
in the following section.
Definition 14.10. Let A be a relatively compact subset of a metric space (X, d). For ε > 0, we
call
n
( )
[
n
G(A, ε, (X, d)) := min n ∈ N ∃ (xi )i=1 ⊆ X s.t. Bε (xi ) ⊃ A ,
i=1
where Bε (x) = {z ∈ X | d(z, x) ≤ ε}, the ε-covering number of A in X. In case X or d are clear
from context, we also write G(A, ε, d) or G(A, ε, X) instead of G(A, ε, (X, d)).
A visualization of Definition 14.10 is given in Figure 14.2. As we will see, it is possible to upper
bound the ε-covering numbers of neural networks as a subset of L∞ ([0, 1]d ), assuming the weights
are confined to a fixed bounded set. The precise estimates are postponed to Section 14.5. Before
that, let us show how a finite covering number facilitates a generalization bound. We only consider
Euclidean feature spaces X in the following result. A more general version could be easily derived.
204
ε
Figure 14.2: Illustration of the concept of covering numbers of Definition 14.10. The shaded set
A ⊆ R2 is covered by sixteen Euclidean balls of radius ε. Therefore, G(A, ε, R2 ) ≤ 16.
Theorem 14.11. Let CY , CL > 0 and α > 0. Let Y ⊆ [−CY , CY ], X ⊆ Rd for some d ∈ N, and
H ⊆ {h : X → Y }. Further, let L : Y × Y → R be CL -Lipschitz.
Then, for every distribution D on X × Y and every m ∈ N it holds with probability at least 1 − δ
over the sample S ∼ Dm that for all h ∈ H
r
−α ∞
|R(h) − Rb S (h)| ≤ 4CY CL log(G(H, m , L (X))) + log(2/δ) + 2CL .
m mα
Proof. Let
M = G(H, m−α , L∞ (X)) (14.4.1)
and let HM = (hi )M
i=1 ⊆ H be such that for every h ∈ H there exists hi ∈ HM with ∥h−hi ∥L∞ (X) ≤
1/mα . The existence of HM follows by Definition 14.10.
Fix for the moment such h ∈ H and hi ∈ HM . By the reverse and normal triangle inequalities,
we have
|R(h) − R
b S (h)| − |R(hi ) − R
b S (hi )| ≤ |R(h) − R(hi )| + |R
b S (h) − R
b S (hi )|.
Moreover, from the monotonicity of the expected value and the Lipschitz property of L it follows
that
205
We thus conclude that for every ε > 0
h i
PS∼Dm ∃h ∈ H : |R(h) − R
b S (h)| ≥ ε
2CL
≤ PS∼Dm ∃hi ∈ HM : |R(hi ) − RS (hi )| ≥ ε − α .
b (14.4.2)
m
From Proposition 14.9, we know that for ε > 0 and δ ∈ (0, 1)
PS∼Dm ∃hi ∈ HM : |R(hi ) − R b S (hi )| ≥ ε − 2CL ≤ δ (14.4.3)
mα
as long as r
2CL log(M ) + log(2/δ)
ε− α >C ,
m 2m
√ that L(Y × Y ) ⊆ [c1 , c2 ] with c2 − c1 ≤ C. By the Lipschitz property of L we can
where C is such
choose C = 2 2CL CY .
Therefore, the definition of M in (14.4.1) together with (14.4.2) and (14.4.3) give that with
probability at least 1 − δ it holds for all h ∈ H
r
√ −α ∞
|R(h) − R b S (h)| ≤ 2 2CL CY log(G(H, m , L )) + log(2/δ) + 2CL .
2m mα
This concludes the proof.
Lemma 14.12. Let X1 , X2 be two metric spaces and let f : X1 → X2 be Lipschitz continuous with
Lipschitz constant CLip . For every relatively compact A ⊆ X1 it holds that for all ε > 0
The proof of Lemma 14.12 is left as an exercise. If we can represent the set of neural networks
as the image under the Lipschitz map of another set with known covering numbers, then Lemma
14.12 gives a direct way to bound the covering number of the neural network class.
Conveniently, we have already observed in Proposition 13.1, that the set of neural networks is
the image of PN (A, B) as in Definition 12.1 under the Lipschitz continuous realization map Rσ . It
thus suffices to establish the ε-covering number of PN (A, B) or equivalently of [−B, B]nA . Then,
using the Lipschitz property of Rσ that holds by Proposition 13.1, we can apply Lemma 14.12 to
find the covering numbers of N (σ; A, B). This idea is depicted in Figure 14.3.
206
Rσ
Figure 14.3: Illustration of the main idea to deduce covering numbers of neural network spaces.
Points θ ∈ R2 in parameter space in the left figure correspond to functions Rσ (θ) in the right figure
(with matching colors). By Lemma 14.12, a covering of the parameter space on the left translates
to a covering of the function space on the right.
It is clear that all points between −B and xk−1 have distance at most ε to one of the xj . Also,
xk−1 = −B + ε + 2(k − 1)ε ≥ B − ε. We conclude that G([−B, B], ε, R) ≤ ⌈B/ε⌉. Set Xk :=
{x0 , . . . , xk−1 }.
For arbitrary q, we observe that for every x ∈ [−B, B]q there is an element in Xkq = qk=1 Xk
N
with ∥ · ∥∞ distance less than ε. Clearly, |Xkq | = ⌈B/ε⌉q , which completes the proof.
Having established a covering number for [−B, B]nA and hence PN (A, B), we can now estimate
the covering numbers of deep neural networks by combining Lemma 14.12 and Propositions 13.1
and 14.13 .
207
We end this section, by applying the previous theorem to the generalization bound of Theorem
14.11 with α = 1/2. To simplify the analysis, we restrict the discussion to neural networks with
range [−1, 1]. To this end, denote
N ∗ (σ; A, B) := Φ ∈ N (σ; A, B)
Since N ∗ (σ; A, B) ⊆ N (σ; A, B) we can bound the covering numbers of N ∗ (σ; A, B) by those of
N (σ; A, B). This yields the following result.
Theorem 14.15. Let CL > 0 and let L : [−1, 1]×[−1, 1] → R be CL -Lipschitz continuous. Further,
let A = (d0 , d1 , . . . , dL+1 ) ∈ NL+2 , let σ : R → R be Cσ -Lipschitz continuous with Cσ ≥ 1, and
|σ(x)| ≤ Cσ |x| for all x ∈ R, and let B ≥ 1.
Then, for every m ∈ N, and every distribution D on X × [−1, 1] it holds with probability at least
1 − δ over S ∼ Dm that for all Φ ∈ N ∗ (σ; A, B)
r √
nA log(⌈nA m⌉) + LnA log(⌈2Cσ Bdmax ⌉) + log(2/δ)
|R(Φ) − RS (Φ)| ≤4CL
b
m
2CL
+√ .
m
where R∗ is the Bayes risk defined in (14.1.1). We make the following observations about the
approximation error εapprox and generalization error εgen in the context of neural network based
learning:
• Scaling of generalization error: By Theorem 14.15, for a hypothesis class H of neural networks
with nA weights and L layers, and for sample of size m ∈ N, the generalization error εgen
essentially scales like
p
εgen = O( (nA log(nA m) + LnA log(nA ))/m) as m → ∞.
• Scaling of approximation error: Assume there exists h∗ such that R(h∗ ) = R∗ , and let the
loss function L be Lipschitz continuous in the first coordinate. Then
εapprox = inf R(h) − R(h∗ ) = inf E(x,y)∼D [L(h(x), y) − L(h∗ (x), y)]
h∈H h∈H
≤ C inf ∥h − h∗ ∥L∞ ,
h∈H
208
for some constant C > 0. We have seen in Chapters 5 and 7 that if we choose H as a
set of neural networks with size nA and L layers, then, for appropriate activation functions,
inf h∈H ∥h − h∗ ∥L∞ behaves like nA −r if, e.g., h∗ is a d-dimensional s-Hölder regular function
and r = s/d (Theorem 5.23), or h∗ ∈ C k,s ([0, 1]d ) and r < (k + s)/d (Theorem 7.10).
By these considerations, we conclude that for an empirical risk minimizer ΦS from a set of neural
networks with nA weights and L layers, it holds that
R(ΦS ) − R∗ ≤ O( (nA log(m) + LnA log(nA ))/m) + O(nA −r ),
p
(14.6.1)
for m → ∞ and for some r depending on the regularity of h∗ . Note that, enlarging the neural
network set, i.e., increasing nA has two effects: The term associated to approximation decreases,
and the term associated to generalization increases. This trade-off is known as approximation-
complexity trade-off. The situation is depicted in Figure 14.4. The figure and (14.6.1) suggest
that, the perfect model, achieves the optimal trade-off between approximation and generalization
error. Using this notion, we can also separate all models into three classes:
• Underfitting: If the approximation error decays faster than the estimation error increases.
• Optimal : If the sum of approximation error and generalization error is at a minimum.
• Overfitting: If the approximation error decays slower than the estimation error increases.
In Chapter 15, we will see that deep learning often operates in the regime where the number of
parameters nA exceeds the optimal trade-off point. For certain architectures used in practice, nA
can be so large that the theory of the approximation-complexity trade-off suggests that learning
should be impossible. However, we emphasize, that the present analysis only provides upper bounds.
It does not prove that learning is impossible or even impractical in the overparameterized regime.
Moreover, in Chapter 11 we have already seen indications that learning in the overparametrized
regime need not necessarily lead to large generalization errors.
Definition 14.16. The VC dimension of H is the cardinality of the largest set S ⊆ Rd that is
shattered by H. We denote the VC dimension by VCdim(H).
209
underfitting overfitting
optimal trade-off
Example 14.17 (Intervals). Let H = {1[a,b] | a, b ∈ R}. It is clear that VCdim(H) ≥ 2 since for
x1 < x2 the functions
1[x1 −2,x1 −1] , 1[x1 −1,x1 ] , 1[x1 ,x2 ] , 1[x2 ,x2 +1] ,
Figure 14.5: Different ways to classify two or three points. The colored-blocks correspond to
intervals that produce different classifications of the points.
210
three points. More general, for d ≥ 2 with
Hd := {x 7→ 1[0,∞) (w⊤ x + b) | w ∈ Rd , b ∈ R}
Figure 14.6: Different ways to classify three points by a half-space, [244, Figure 1.4].
In the example above, the VC dimension coincides with the number of parameters. However,
this is not true in general as the following example shows.
Example 14.19 (Infinite VC dimension). Let for x ∈ R
211
Theorem 14.20. Let d, k ∈ N and H ⊆ {h : Rd → {0, 1}} have VC dimension k. Let D be a
distribution on Rd × {0, 1}. Then, for every δ > 0 and m ∈ N, it holds with probability at least 1 − δ
over a sample S ∼ Dm that for every h ∈ H
r r
2k log(em/k) log(1/δ)
|R(h) − R
b S (h)| ≤ + . (14.7.1)
m 2m
In words, Theorem 14.20 tells us that if a hypothesis class has finite VC dimension, then a
hypothesis with a small empirical risk will have a small risk if the number of samples is large. This
shows that empirical risk minimization is a viable strategy in this scenario. Will this approach also
work if the VC dimension is not bounded? No, in fact, in that case, no learning algorithm will
succeed in reliably producing a hypothesis for which the risk is close to the best possible. We omit
the technical proof of the following theorem from [178, Theorem 3.23].
Theorem 14.21. Let k ∈ N and let H ⊆ {h : X → {0, 1}} be a hypothesis set with VC dimension
k. Then, for every m ∈ N and every learning algorithm A : (X × {0, 1})m → H there exists a
distribution D on X × {0, 1} such that
" r #
k 1
PS∼Dm R(A(S)) − inf R(h) > ≥ .
h∈H 320m 64
Theorem 14.21 immediately implies the following statement for the generalization bound.
Corollary 14.22. Let k ∈ N and let H ⊆ {h : X → {0, 1}} be a hypothesis set with VC dimension
k. Then, for every m ∈ N there exists a distribution D on X × {0, 1} such that
" r #
k 1
PS∼Dm sup |R(h) − R b S (h)| > ≥ .
h∈H 1280m 64
212
Then, applying Theorem 14.21 with A(S) = hS it holds that
2 sup |R(h) − R
b S (h)| ≥ |R(hS ) − R
b S (hS )| + |R(hδ ) − R
b S (hδ )|
h∈H
≥ R(hS ) − R
b S (hS ) + R
b S (hδ ) − R(hδ )
≥ R(hS ) − R(hδ )
> R(hS ) − inf R(h) − δ,
h∈H
where we used the definition of hS in the third inequality. The proof is completed by applying
Theorem 14.21 and using that δ was arbitrary.
√
We have seen now, that we have a generalization bound scaling like O(1/ m) for m → ∞ if
and only if the VC dimension of a hypothesis class is finite. In more quantitative terms, we require
the VC dimension of a neural network to be smaller than m.
What does this imply for neural network functions? For ReLU neural networks there holds the
following [6, Theorem 8.8].
The bound (14.7.1) is meaningful if m ≫ k. For ReLU neural networks as in Theorem 14.23,
this means m ≫ nA L log(nA ) + nA L2 . Fixing L = 1 this amounts to m ≫ nA log(nA ) for a
shallow neural network with nA parameters. This condition is contrary to what we assumed in
Chapter 11, where it was crucial that nA ≫ m. If the VC dimension of the neural network sets
scale like O(nA log(nA )), then Theorem 14.21 and Corollary 14.22 indicate that, at least for certain
distributions, generalization should not be possible in this regime. We will discuss the resolution
of this potential paradox in Chapter 15.
Theorem 14.24. Let k, d ∈ N. Assume that for every ε > 0 there exists Lε ∈ N and Aε with Lε
layers and input dimension d such that
ε
sup inf ∥f − Φ∥C 0 ([0,1]d ) < .
∥f ∥ k d ≤1
Φ∈N (σReLU ;Aϵ ,∞) 2
C ([0,1] )
213
Then there exists C > 0 solely depending on k and d, such that for all ε ∈ (0, 1)
d
nAε Lε log(nAε ) + nAε L2ε ≥ Cε− k .
ε
sup |fy (x) − Φy (x)| < .
x∈[0,1]d 2τk
Then ε
1[0,∞) Φy (xj ) − = yj for all j = 1, . . . , N (ε).
2τk
Hence, the VC dimension of N (σReLU ; Aτ −1 ε , ∞) is larger or equal to N (ε). Theorem 14.23 thus
k
implies
d
N (ε) ≃ ε− k ≤ C · nA −1 Lτ −1 ε log(nA −1 ) + nA −1 L2τ −1 ε
τ ε k τ ε τ ε k
k k k
or equivalently
d d
τkk ε− k ≤ C · nAε Lε log(nAε ) + nAε L2ε .
214
Figure 14.7: Illustration of fy from Equation (14.8.1) on [0, 1]2 .
In terms of the neural network size, this (necessary) condition becomes nAε ≥ Cε−d/k / log(ε)2 .
As we have shown in Chapter 7, in particular Theorem 7.10, up to log terms this condition
is also sufficient. Hence, while the constructive proof of Theorem 7.10 might have seemed
rather specific, under the assumption of the depth increasing at most logarithmically (which
the construction in Chapter 7 satisfies), it was essentially optimal! The neural networks in
this proof are shown to have size O(ε−d/k ) up to log terms.
• If we allow the depth Lε to increase faster than logarithmically in ε, then the lower bound on
the required neural network size improves. Fixing for example Aε with Lε layers such that
nAε ≤ W Lε for some fixed ε independent W ∈ N, the (necessary) condition on the depth
becomes
d
W log(W Lε )L2ε + W L3ε ≥ Cε− k
215
Bibliography and further reading
Classical statistical learning theory is based on the foundational work of Vapnik and Chervonenkis
[277]. This led to the formulation of the probably approximately correct (PAC) learning model
in [276], which is primarily utilized in this chapter. A streamlined mathematical introduction to
statistical learning theory can be found in [58].
Since statistical learning theory is well-established, there exists a substantial amount of excellent
expository work describing this theory. Some highly recommended books on the topic are [178,
250, 6]. The specific approach of characterizing learning via covering numbers has been discussed
extensively in [6, Chapter 14]. Specific results for ReLU activation used in this chapter were derived
in [241, 24]. The results of Section 14.8 describe some of the findings in [292, 293]. Other scenarios
in which the tightness of the upper bounds were shown are, for example, if quantization of weights
is assumed, [30, 77, 208], or when some form of continuity of the approximation scheme is assumed,
see [67] for general lower bounds (also applicable to neural networks).
216
Exercises
Exercise 14.25. Let H be a set of neural networks with fixed architecture, where the weights are
taken from a compact set. Moreover, assume that the activation function is continuous. Show that
for every sample S there always exists an empirical risk minimizer hS .
Exercise 14.28. Show that, the VC dimension of H of Example 14.18 is indeed 3, by demonstrating
that no set of four points can be shattered by H.
H := {x 7→ 1[0,∞) (sin(wx)) | w ∈ R}
is infinite.
217
Chapter 15
Generalization in the
overparameterized regime
In the previous chapter, we discussed the theory of generalization for deep neural networks trained
by minimizing the empirical risk. A key conclusion was that good generalization is possible as long
as we choose an architecture that has a moderate number of neural network parameters relative to
the number of training samples. Moreover, we saw in Section 14.6 that the best performance can be
expected when the neural network size is chosen to balance the generalization and approximation
errors, by minimizing their sum.
Architectures On ImageNet
Figure 15.1: ImageNet Classification Competition: Final score on the test set in the Top 1 cat-
egory vs. Parameters-to-Training-Samples Ratio. Note that all architectures have more parame-
ters than training samples. Architectures include AlexNet [148], VGG16 [255], GoogLeNet [263],
ResNet50/ResNet152 [112], DenseNet121 [121], ViT-G/14 [296], EfficientNetB0 [265], and Amoe-
baNet [226].
Surprisingly, successful neural network architectures do not necessarily follow these theoretical
observations. Consider the neural network architectures in Figure 15.1. They represent some
218
of the most renowned image classification models, and all of them participated in the ImageNet
Classification Competition [66]. The training set consisted of 1.2 million images. The x-axis shows
the model performance, and the y-axis displays the ratio of the number of parameters to the size of
the training set; notably, all architectures have a ratio larger than one, i.e. have more parameters
than training samples. For the largest model, there are by a factor 1000 more neural network
parameters than training samples.
Given that the practical application of deep learning appears to operate in a regime significantly
different from the one analyzed in Chapter 14, we must ask: Why do these methods still work
effectively?
R(h)
R
b S (h)
underfitting overfitting
Interpolation threshold
Expressivity of H
Figure 15.2: Illustration of the double descent phenomenon.
219
The goal is to determine coefficients w ∈ Rn minimizing the empirical risk
m n m
b S (w) = 1 1 X
XX 2
R wi ϕi (xj ) − yj = (⟨ϕ(xj ), w⟩ − yj )2 .
m m
j=1 i=1 j=1
With
ϕ(x1 )⊤
ϕ1 (x1 ) . . . ϕn (x1 )
.. .. .. .. m×n
An := . = ∈R (15.1.2)
. . .
ϕ1 (xm ) . . . ϕn (xm ) ϕ(xm )⊤
and y = (y1 , . . . , ym )⊤ it holds
b S (w) = 1 ∥An w − y∥2 .
R (15.1.3)
m
As discussed in Sections 11.1-11.2, a unique minimizer of (15.1.3) only exists if An has rank n.
For a minimizer wn , the fitted function reads
n
X
fn (x) := wn,j ϕj (x). (15.1.4)
j=1
We are interested in the behavior of the fn as a function of n (the number of ansatz func-
tions/parameters of our model), and distinguish between two cases:
• Underparameterized : If n < m we have fewer parameters n than training points m. For
the least squares problem of minimizing R
b S , this means that there are more conditions m
than free parameters n. Thus, in general, we cannot interpolate the data, and we have
minw∈Rn R b S (w) > 0.
15.1.2 An example
We consider a concrete example. In Figure 15.3 we plot a set of 40 ansatz functions ϕ1 , . . . , ϕ40 ,
which are drawn from a Gaussian process. Additionally, the figure shows a plot of the Runge
function f , and m = 18 equispaced points which are used as the training data points. We then fit
a function in span{ϕ1 , . . . , ϕn } via (15.1.5) and (15.1.4). The result is displayed in Figure 15.4:
1
Here, the index n emphasizes the dimension of wn,∗ ∈ Rn . This notation should not be confused with the
ridge-regularized minimizer wλ,∗ introduced in Chapter 11.
220
3 1.0 f
2 0.8 Data points
1
0.6
0
j
1 0.4
2 0.2
3
0.0
1.00 0.75 0.50 0.25 0.00 0.25 0.50 0.75 1.00 1.00 0.75 0.50 0.25 0.00 0.25 0.50 0.75 1.00
(a) ansatz functions ϕj (b) Runge function f and data points
Figure 15.3: Ansatz functions ϕ1 , . . . , ϕ40 drawn from a Gaussian process, along with the Runge
function and 18 equispaced data points.
• n = 2: The model can only represent functions in span{ϕ1 , ϕ2 }. It is not yet expressive
enough to give a meaningful approximation of f .
• n = 15: The model has sufficient expressivity to capture the main characteristics of f . Since
n = 15 < 18 = m, it is not yet able to interpolate the data. Thus it allows to strike a
good balanced between the approximation and generalization error, which corresponds to the
scenario discussed in Chapter 14.
• n = 18: We are at the interpolation threshold. The model is capable of interpolating the data,
and there is a unique w such that R b S (w) = 0. Yet, in between data points the behavior of the
predictor f18 seems erratic, and displays strong oscillations. This is referred to as overfitting,
and is to be expected due to our analysis in Chapter 14; while the approximation error at the
data points has improved compared to the case n = 15, the generalization error has gotten
worse.
• n = 40: This is the overparameterized regime, where we have significantly more parameters
than data points. Our prediction f40 interpolates the data and appears to be the best overall
approximation to f so far, due to a “good” choice of minimizer of R b S , namely (15.1.5).
We also note that, while quite good, the fit is not perfect. We cannot expect significant
improvement in performance by further increasing n, since at this point the main limiting
factor is the amount of available data. Also see Figure 15.5 (a).
Figure 15.5 (a) displays the error ∥f − fn ∥L2 ([−1,1]) over n. We observe the characteristic double
descent curve, where the error initially decreases and then peaks at the interpolation threshold,
which is marked by the dashed red line. Afterwards, in the overparameterized regime, it starts to
decrease again. Figure 15.5 (b) displays ∥wn,∗ ∥. Note how the Euclidean norm of the coefficient
vector also peaks at the interpolation threshold.
We emphasize that the precise nature of the convergence curves depends strongly on various
factors, such as the distribution and number of training points m, the ground truth f , and the
choice of ansatz functions ϕj (e.g., the specific kernel used to generate the ϕj in Figure 15.3 (a)).
In the present setting we achieve a good approximation of f for n = 15 < 18 = m corresponding to
the regime where the approximation and interpolation errors are balanced. However, as Figure 15.5
221
1.0 f 1.0 f
0.8 f2 f15
0.8
0.6 Data points Data points
0.4 0.6
0.2 0.4
0.0
0.2 0.2
0.4 0.0
1.00 0.75 0.50 0.25 0.00 0.25 0.50 0.75 1.00 1.00 0.75 0.50 0.25 0.00 0.25 0.50 0.75 1.00
(a) n = 2 (underparameterization) (b) n = 15 (balance of appr. and gen. error)
1.0 f 1.0 f
0.8 f18 0.8 f40
Data points Data points
0.6 0.6
0.4
0.4
0.2
0.2
0.0
0.0
1.00 0.75 0.50 0.25 0.00 0.25 0.50 0.75 1.00 1.00 0.75 0.50 0.25 0.00 0.25 0.50 0.75 1.00
(c) n = 18 (interpolation threshold) (d) n = 40 (overparameterization)
Figure 15.4: Fit of the m = 18 red data points using the ansatz functions ϕ1 , . . . , ϕn from Figure
15.3, employing equations (15.1.5) and (15.1.4) for different numbers of ansatz functions n.
222
n = 18 n = 18
10 1
100
10 2
10 20 30 40 10 20 30 40
n n
(a) ∥f − fn ∥L2 ([−1,1]) (b) ∥wn,∗ ∥
Figure 15.5: The L2 -error for the fitted functions in Figure 15.4, and the Euclidean norm of the
corresponding coefficient vector wn,∗ defined in (15.1.5).
(a) shows, it can be difficult to determine a suitable value of n < m a priori, and the acceptable
range of n values can be quite narrow. For overparametrization (n ≫ m), the precise choice of n is
less critical, potentially making the algorithm more stable in this regime. We encourage the reader
to conduct similar experiments and explore different settings to get a better feeling for the double
descent phenomenon.
Proposition 15.1. Assume that x1 , . . . , xm and the (ϕj )j∈N are such that An in (15.1.2) has full
rank n for all n ≤ m. Given y ∈ Rm , denote by wn,∗ (y) the vector in (15.1.5). Then
(
increasing for n < m,
n 7→ sup ∥wn,∗ (y)∥ is monotonically
∥y∥=1 decreasing for n ≥ m.
Proof. We start with the case n ≥ m. By assumption Am has full rank m, and thus An has rank
m for all n ≥ m, see (15.1.2). In particular, there exists wn ∈ Rn such that An wn = y. Now fix
223
y ∈ Rm and let wn be any such vector. Then wn+1 := (wn , 0) ∈ Rn+1 satisfies An+1 wn+1 = y
and ∥wn+1 ∥ = ∥wn ∥. Thus necessarily ∥wn+1,∗ ∥ ≤ ∥wn,∗ ∥ for the minimal norm solutions defined
in (15.1.5). Since this holds for every y, we obtain the statement for n ≥ m.
Now let n < m. Recall that the minimal norm solution can be written through the pseudo
inverse
wn,∗ (y) = A†n y,
see Appendix B.1. That is,
s−1
n,1
A†n = V n .. ⊤ n×m
0 U n ∈ R
.
s−1
n,n
where An = U n Σ n V ⊤
n is the singular value decomposition of An , and
sn,1
..
.
Σn = ∈ Rm×n
sn,n
0
contains the singular values sn,1 ≥ · · · ≥ sn,n > 0 of An ∈ Rm×n ordered by decreasing size. Since
V n ∈ Rn×n and U n ∈ Rm×m are orthogonal matrices, we have
we observe that n 7→ sn,n is monotonically decreasing for n ≤ m. This concludes the proof.
224
An assumption of the type B ≤ cB · (dCσ )−1 , i.e. a scaling of the weights by the reciprocal 1/d of
the width, is not unreasonable in practice: Standard initialization schemes such as LeCun [156] or
He [111] initialization, use random weights with variance scaled inverse proportional to the input
dimension of each layer. Moreover, as we saw in Chapter 11, for very wide neural networks, the
weights do not move significantly from their initialization during training. Additionally, many
training routines use regularization terms on the weights, thereby encouraging the optimization
routine to find small weights.
We study the generalization capacity of Lipschitz functions through the covering-number-
based learning results of Chapter 14. The set LipC (Ω) of C-Lipschitz functions on a compact
d-dimensional Euclidean domain Ω has covering numbers bounded according to
d
∞ C
log(G(LipC (Ω), ε, L )) ≤ Ccov · for all ε > 0 (15.3.2)
ε
for some constant Ccov independent of ε > 0. A proof can be found in [97, Lemma 7], see also [273].
As a result of these considerations, we can identify two regimes:
• Standard regime: For small neural network size nA , we consider neural networks as a set
parameterized by nA parameters. As we have seen before, this yields a bound on the gen-
eralization error that scales linearly with nA . As long as nA is small in comparison to the
number of samples, we can expect good generalization by Theorem 14.15.
• Overparameterized regime: For large neural network size nA and small weights, we consider
neural networks as a subset of LipC (Ω) for a constant C > 0. This set has a covering number
bound that is independent of the number of parameters nA .
Choosing the better of the two generalization bounds for each regime yields the following result.
Recall that N ∗ (σ; A, B) denotes all neural networks in N (σ; A, B) with a range contained in [−1, 1]
(see (14.5.1)).
Theorem 15.2. Let C, CL > 0 and let L : [−1, 1] × [−1, 1] → R be CL -Lipschitz. Further, let
A = (d0 , d1 , . . . , dL+1 ) ∈ NL+2 , let σ : R → R be Cσ -Lipschitz continuous with Cσ ≥ 1, and
|σ(x)| ≤ Cσ |x| for all x ∈ R, and let B > 0.
Then, there exist c1 , c2 > 0, such that for every m ∈ N and every distribution D on
[−1, 1]d0 × [−1, 1] it holds with probability at least 1 − δ over S ∼ Dm that for all Φ ∈
N ∗ (σ; A, B) ∩ LipC ([−1, 1]d0 )
r
b S (Φ)| ≤ g(A, Cσ , B, m) + 4CL log(4/δ) ,
|R(Φ) − R (15.3.3)
m
where
( r √ )
nA log(nA ⌈ m⌉) + LnA log(dmax ) 1
− 2+d
g(A, Cσ , B, m) = min c1 , c2 m 0 .
m
225
Proof. Applying Theorem 14.11 with α = 1/(2 + d0 ) and (15.3.2), we obtain that with probability
at least 1 − δ/2 it holds for all Φ ∈ LipC ([−1, 1]d0 )
r
α d0
|R(Φ) − R b S (Φ)| ≤ 4CL Ccov (m C) + log(4/δ) + 2CL
m mα r
2C log(4/δ)
q
L
≤ 4CL Ccov C d0 (md0 /(d0 +2)−1 ) + α + 4CL
m r m
2CL log(4/δ)
q
= 4CL Ccov C d0 (m−2/(d0 +2) ) + α + 4CL
m
r m
p
d
(4CL Ccov C + 2CL )
0 log(4/δ)
= α
+ 4CL ,
m m
√ √ √
where we used in the second inequality that x + y ≤ x + y for all x, y ≥ 0.
In addition, Theorem 14.15 yields that with probability at least 1 − δ/2 it holds for all Φ ∈
∗
N (σ; A, B)
r √
nA log(⌈nA m⌉) + LnA log(⌈2Cσ Bdmax ⌉) + log(4/δ)
|R(Φ) − RS (Φ)| ≤ 4CL
b
m
2CL
+√
m
r √
nA log(⌈nA m⌉) + LnA log(⌈2Cσ Bdmax ⌉)
≤ 6CL
r m
log(4/δ))
+ 4CL .
m
Then, for Φ ∈ N ∗ (σ; A, B) ∩ LipC ([−1, 1]d0 ) the minimum of both upper bounds holds with
probability at least 1 − δ.
The two regimes in Theorem 15.2 correspond to the two terms comprising the minimum in the
definition of g(A, Cσ , B, m). The first term increases with nA while the second is constant. In the
first regime, where the first term is smaller, the generalization gap |R(Φ) − R b S (Φ)| increases with
nA .
In the second regime, where the second term is smaller, the generalization gap is constant with
nA . Moreover, it is reasonable to assume that the empirical risk R b S will decrease with increasing
number of parameters nA .
By (15.3.3) we can bound the risk by
r
R(Φ) ≤ R b S + g(A, Cσ , B, m) + 4CL log(4/δ) .
m
In the second regime, this upper bound is monotonically decreasing. In the first regime it may
both decrease and increase. In some cases, this behavior can lead to an upper bound on the risk
resembling the curve of Figure 15.2.
Remark 15.3. Theorem 15.2 assumes C-Lipschitz continuity of the neural networks. As we saw in
Sections 15.1.2 and 15.2, this assumption may not hold near the interpolation threshold. Hence,
Theorem 15.2 likely gives a too optimistic upper bound near the interpolation threshold.
226
Bibliography and further reading
The discussion on kernel regression and the effect of the number of parameters on the norm of the
weights was already given in [18]. Similar analyses, with more complex ansatz systems and more
precise asymptotic estimates, are found in [173, 107]. Our results in Section 15.3 are inspired by
[15]; see also [191].
For a detailed account of further arguments justifying the surprisingly good generalization
capabilities of overparameterized neural networks, we refer to [25, Section 2]. Here, we only briefly
mention two additional directions of inquiry. First, if the learning algorithm introduces a form of
robustness, this can be leveraged to yield generalization bounds [9, 291, 34, 215]. Second, for very
overparameterized neural networks, it was stipulated in [131] that neural networks become linear
kernel interpolators as discussed in Chapter 11. Thus, for large neural networks, generalization can
be studied through kernel regression [131, 158, 19, 162].
227
Exercises
Exercise 15.4. Let f : [−1, 1] → R be a continuous function, and let −1 ≤ x1 < · · · < xm ≤ 1 for
some fixed m ∈ N. As in Section 15.1.2, we wish to approximate f by a least squares approximation.
To this end we use the Fourier ansatz functions
(
1 sin(⌈ 2j ⌉πx) j ≥ 1 is odd
b0 (x) := and bj (x) := (15.3.4)
2 cos(⌈ 2j ⌉πx) j ≥ 1 is even.
denote by wn∗ ∈ Rn+1 the minimal norm minimizer of R b S , and set fn (x) := Pn wn bi (x).
i=0 ∗,i
Show that in this case generalization fails in the overparametrized regime: for sufficiently large
n ≫ m, fn is not necessarily a good approximation to f . What does fn converge to as n → ∞?
Exercise 15.5. Consider the setting of Exercise 15.4. We adapt the ansatz functions in (15.3.4)
by rescaling them via
b̃j := cj bj .
Choose real numbers cj ∈ R, such that the corresponding minimal norm least squares solution
avoids the phenomenon encountered in Exercise 15.4.
Hint: Should ansatz functions corresponding to large frequencies be scaled by large or small
numbers to avoid overfitting?
228
Chapter 16
How sensitive is the output of a neural network to small changes in its input? Real-world obser-
vations of trained neural networks often reveal that even barely noticeable modifications of the
input can lead to drastic variations in the network’s predictions. This intriguing behavior was first
documented in the context of image classification in [264].
Figure 16.1 illustrates this concept. The left panel shows a picture of a panda that the neural
network correctly classifies as a panda. By adding an almost imperceptible amount of noise to the
image, we obtain the modified image in the right panel. To a human, there is no visible difference,
but the neural network classifies the perturbed image as a wombat. This phenomenon, where
a correctly classified image is misclassified after a slight perturbation, is termed an adversarial
example.
In practice, such behavior is highly undesirable. It indicates that our learning algorithm might
not be very reliable and poses a potential security risk, as malicious actors could exploit it to trick
the algorithm. In this chapter, we describe the basic mathematical principles behind adversarial
examples and investigate simple conditions under which they might or might not occur. For sim-
plicity, we restrict ourselves to a binary classification problem but note that the main ideas remain
valid in more general situations.
+ 0.01x =
Human: Panda Barely visible noise Still a panda
NN classifier: Panda (high confidence) Flamingo (low confidence) Wombat (high confidence)
229
16.1 Adversarial examples
Let us start by formalizing the notion of an adversarial example. We consider the problem of
assigning a label y ∈ {−1, 1} to a vector x ∈ Rd . It is assumed that the relation between x and y
is described by a distribution D on Rd × {−1, 1}. In particular, for a given x, both values −1 and
1 could have positive probability, i.e. the label is not necessarily deterministic. Additionally, we let
Dx := {x ∈ Rd | ∃y s.t. (x, y) ∈ supp(D)}, (16.1.1)
and refer to Dx as the feature support.
Throughout this chapter we denote by
g : Rd → {−1, 0, 1}
a fixed so-called ground-truth classifier, satisfying1
P[y = g(x)|x] ≥ P[y = −g(x)|x] for all x ∈ Dx . (16.1.2)
Note that we allow g to take the value 0, which is to be understood as an additional label corre-
sponding to nonrelevant or nonsensical input data x. We will refer to g −1 (0) as the nonrelevant
class. The ground truth g is interpreted as how a human would classify the data, as the following
example illustrates.
Example 16.1. We wish to classify whether an image shows a panda (y = 1) or a wombat (y = −1).
Consider again Figure 16.1, and denote the three images by x1 , x2 , x3 . The first image x1 is a
photograph of a panda. Together with a label y, it can be interpreted as a draw (x1 , y) from a
distribution of images D, i.e. x1 ∈ Dx and g(x1 ) = 1. The second image x2 displays noise and
corresponds to nonrelevant data as it shows neither a panda nor a wombat. In particular, x2 ∈ Dxc
and g(x2 ) = 0. The third (perturbed) image x3 also belongs to Dxc , as it is not a photograph but
a noise corrupted version of x1 . Nonetheless, it is not nonrelevant, as a human would classify it as
a panda. Thus g(x3 ) = 1. ♢
Additional to the ground truth g, we denote by
h : Rd → {−1, 1}
some trained classifier.
1
To be more precise, the conditional distribution of y|x is only well-defined almost everywhere w.r.t. the marginal
distribution of x. Thus (16.1.2) can only be assumed to hold for almost every x ∈ Dx w.r.t. to the marginal
distribution of x.
230
In words, x′ is an adversarial example to x with perturbation δ, if (i) the distance of x and x′
is at most δ, (ii) x and x′ belong to the same (not nonrelevant) class according to the ground truth
classifier, and (iii) the classifier h correctly classifies x but misclassifies x′ .
Remark 16.3. We emphasize that the concept of a ground-truth classifier g differs from a minimizer
of the Bayes risk (14.1.1) for two reasons. First, we allow for an additional label 0 corresponding to
the nonrelevant class, which does not exist for the data generating distribution D. Second, g should
correctly classify points outside of Dx ; small perturbations of images as we find them in adversarial
examples, are not regular images in Dx . Nonetheless, a human classifier can still classify these
images, and g models this property of human classification.
231
(iv) Not a Bayes classifier/non-exhaustive distribution: In this case everything is possible. Data
points and their associated adversarial examples can appear in the feature support of the
distribution and adversarial examples to elements in the feature support of the distribution
can be created by leaving the feature support of the distribution. We will see examples in
the following section.
Theorem 16.4. Let w, w ∈ Rd be nonzero. For x ∈ Rd , let h(x) = sign(w⊤ x) be a classifier and
let g(x) = sign(w⊤ x) be the ground-truth classifier.
For every x ∈ Rd with h(x)g(x) > 0 and all ε ∈ (0, |w⊤ x|) such that
|w⊤ x| ε + |w⊤ x| |w⊤ w|
> (16.3.2)
∥w∥ ∥w∥ ∥w∥∥w∥
it holds that
ε + |w⊤ x|
x′ = x − h(x) w (16.3.3)
∥w∥2
is an adversarial example to x with perturbation δ = (ε + |w⊤ x|)/∥w∥.
232
Before we present the proof, we give some interpretation of this result. First, note that {x ∈
Rd | w ⊤ x= 0} is the decision boundary of h, meaning that points lying on opposite sides of this
hyperplane, are classified differently by h. Due to |w⊤ w| ≤ ∥w∥∥w∥, (16.3.2) implies that an
adversarial example always exists whenever
|w⊤ x| |w⊤ x|
> . (16.3.4)
∥w∥ ∥w∥
The left term is the decision margin of x for g, i.e. the distance of x to the decision boundary
of g. Similarly, the term on the right is the decision margin of x for h. Thus we conclude that
adversarial examples exist if the decision margin of x for the ground truth g is larger than that for
the classifier h.
Second, the term (w⊤ w)/(∥w∥∥w∥) describes the alignment of the two classifiers. If the clas-
sifiers are not aligned, i.e., w and w have a large angle between them, then adversarial examples
exist even if the margin of the classifier is larger than that of the ground-truth classifier.
Finally, adversarial examples with small perturbation are possible if |w⊤ x| ≪ ∥w∥. The ex-
treme case w⊤ x = 0 means that x lies on the decision boundary of h, and if |w⊤ x| ≪ ∥w∥ then
x is close to the decision boundary of h.
Proof (of Theorem 16.4). We verify that x′ in (16.3.3) satisfies the conditions of an adversarial
example in Definition 16.2. In the following we will use that due to h(x)g(x) > 0
g(x) = sign(w⊤ x) = sign(w⊤ x) = h(x) ̸= 0. (16.3.5)
First, it holds
ε + |w⊤ x| ε + |w⊤ x|
∥x − x′ ∥ = w = = δ.
∥w∥2 ∥w∥
Next we show g(x)g(x′ ) > 0, i.e. that (w⊤ x)(w⊤ x′ ) is positive. Plugging in the definition of
x′ , this term reads
ε + |w⊤ x| ⊤ ε + |w⊤ x| ⊤
w⊤ x w⊤ x − h(x) 2
w w = |w⊤ x|2 − |w⊤ x| w w
∥w∥ ∥w∥2
ε + |w⊤ x| ⊤
≥ |w⊤ x|2 − |w⊤ x| |w w|, (16.3.6)
∥w∥2
where the equality holds because h(x) = g(x) = sign(w⊤ x) by (16.3.5). Dividing the right-hand
side of (16.3.6) by |w⊤ x|∥w∥, which is positive by (16.3.5), we obtain
|w⊤ x| ε + |w⊤ x| |w⊤ w|
− . (16.3.7)
∥w∥ ∥w∥ ∥w∥∥w∥
The term (16.3.7) is positive thanks to (16.3.2).
Finally, we check that 0 ̸=h(x′ ) ̸= h(x), i.e. (w⊤ x)(w⊤ x′ ) < 0. We have that
ε + |w⊤ x| ⊤
(w⊤ x)(w⊤ x′ ) = |w⊤ x|2 − w⊤ xh(x) w w
∥w∥2
= |w⊤ x|2 − |w⊤ x|(ε + |w⊤ x|) < 0,
where we used that h(x) = sign(w⊤ x). This completes the proof.
233
Theorem 16.4 readily implies the following proposition for affine classifiers.
it holds that
ε + |w⊤ x + b|
x′ = x − h(x) w
∥w∥2
is an adversarial example with perturbation δ = (ε + |w⊤ x + b|)/∥w∥ to x.
Next fix α ∈ (0, 1) and set w := αw + (1 − α)v for some v ∈ w⊥ with ∥v∥ = 1, so that ∥w∥ = 1.
We let h(x) := sign(w⊤ x). We now show that every x ∈ Dx satisfies the assumptions of Theorem
16.4, and therefore admits an adversarial example.
Note that h(x) = g(x) for every x ∈ Dx . Hence h is a Bayes classifier. Now fix x ∈ Dx . Then
|w⊤ x| ≤ α|w⊤ x|, so that (16.3.2) is satisfied. Furthermore, for every ε > 0 it holds that
ε + |w⊤ x|
δ := ≤ ε + α.
∥w∥
Hence, for ε < |w⊤ x| it holds by Theorem 16.4 that there exists an adversarial example with
perturbation less than ε + α. For small α, the situation is depicted in the upper panel of Figure
16.2. ♢
For the second example, we construct a distribution with global feature support and a classifier
which is not a Bayes classifier. This corresponds to case (iv) in Section 16.2.
234
Example 16.7. Let Dx be a distribution on Rd with positive Lebesgue density everywhere outside
the decision boundary DBg = {x | w⊤ x = 0} of g. We define D to be the distribution of (X, g(X))
for X ∼ Dx . In addition, let w ∈ / {±w}, ∥w∥ = 1 and h(x) = sign(w⊤ x). We exclude w = −w
because, in this case, every prediction of h is wrong. Thus no adversarial examples are possible.
By construction the feature support is given by Dx = Rd . Moreover, h−1 ({−1}), h−1 ({1}) and
g ({−1}), g −1 ({1}) are half spaces, which implies that, in the notation of (16.2.2) that
−1
Hence, for every δ > 0 there is a positive probability of observing x to which an adversarial example
with perturbation δ exists.
The situation is depicted in the lower panel of Figure 16.2. ♢
i.e., as the distance of x to the closest element that is classified differently from x or the infimum
over all distances to elements from other classes if no closest element exists. Additionally, we denote
the distance of x to the closest adjacent affine piece by
where AΦ,x is the largest connected region on which Φ is affine and which contains x. We have the
following theorem.
ε + |Φ(x)|
µg (x), νΦ (x) > .
∥∇Φ(x)∥
Then
ε + |Φ(x)|
x′ := x − h(x) ∇Φ(x)
∥∇Φ(x)∥2
235
Proof. We show that x′ satisfies the properties in Definition 16.2.
By construction ∥x − x′ ∥ ≤ δ. Since µg (x) > δ it follows that g(x) = g(x′ ). Moreover, by
assumption g(x) ̸= 0, and thus g(x)g(x′ ) > 0.
It only remains to show that h(x′ ) ̸= h(x). Since δ < νΦ (x), we have that Φ(x) = ∇Φ(x)⊤ x + b
and Φ(x′ ) = ∇Φ(x)⊤ x′ + b for some b ∈ R. Therefore,
ε + |Φ(x)|
Φ(x) − Φ(x′ ) = ∇Φ(x)⊤ (x − x′ ) = ∇Φ(x)⊤ h(x) ∇Φ(x)
∥∇Φ(x)∥2
= h(x)(ε + |Φ(x)|).
Since h(x)|Φ(x)| = Φ(x) it follows that Φ(x′ ) = −h(x)ε. Hence, h(x′ ) = −h(x), which completes
the proof.
Remark 16.9. We look at the key parameters in Theorem 16.8 to understand which factors facilitate
adversarial examples.
• The geometric margin of the ground-truth classifier µg (x): To make the construction possible,
we need to be sufficiently far away from points that belong to a different class than x or to
the nonrelevant class.
• The distance to the next affine piece νΦ (x): Since we are looking for an adversarial example
within the same affine piece as x, we need this piece to be sufficiently large.
16.5 Robustness
Having established that adversarial examples can arise in various ways under mild assumptions, we
now turn our attention to conditions that prevent their existence.
236
Proposition 16.10. Let Φ : Rd → R be CL -Lipschitz with CL > 0, and let s > 0. Let h(x) =
sign(Φ(x)) be a classifier, and let g : Rd → {−1, 0, 1} be a ground-truth classifier. Moreover, let
x ∈ Rd be such that
Φ(x)g(x) ≥ s. (16.5.1)
Then there does not exist an adversarial example to x of perturbation δ < s/CL .
Proof. Let x ∈ Rd satisfy (16.5.1) and assume that ∥x′ − x∥ ≤ δ. The Lipschitz continuity of Φ
implies
Since |Φ(x)| ≥ s we conclude that Φ(x′ ) has the same sign as Φ(x) which shows that x′ cannot be
an adversarial example to x.
Remark 16.11. As we have seen in Lemma 13.2, we can bound the Lipschitz constant of ReLU
neural networks by restricting the magnitude and number of their weights and the number of
layers.
There has been some criticism to results of this form, see, e.g., [124], since an assumption on
the Lipschitz constant may potentially restrict the capabilities of the neural network too much. We
next present a result that shows under which assumptions on the training set, there exists a neural
network that classifies the training set correctly, but does not allow for adversarial examples within
the training set.
|g(xi ) − g(xj )|
sup =: M
f > 0.
i̸=j ∥xi − xj ∥
Then there exists a ReLU neural network Φ with depth(Φ) = O(log(m)) and width(Φ) = O(dm)
such that for all i = 1, . . . , m
sign(Φ(xi )) = g(xi )
and there is no adversarial example of perturbation δ = 1/M
f to xi .
Proof. The result follows directly from Theorem 9.6 and Proposition 16.10. The reader is invited
to complete the argument in Exercise 16.20.
237
bound given in Lemma 13.2 is
which grows exponentially with the depth of the neural network. However, in practice this bound
may be pessimistic, and locally the neural network might have significantly smaller gradients than
the global Lipschitz constant.
Because of this, it is reasonable to study results preventing adversarial examples under local
Lipschitz bounds. Such a result together with an algorithm providing bounds on the local Lipschitz
constant was proposed in [113]. We state the theorem adapted to our set-up.
Theorem 16.13. Let h : Rd → {−1, 1} be a classifier of the form h(x) = sign(Φ(x)) and let
g : Rd → {−1, 0, 1} be the ground-truth classifier. Let x ∈ Rd satisfy g(x) ̸= 0, and set
. |Φ(y) − Φ(x)|
α := max min Φ(x)g(x) sup ,R , (16.5.2)
R>0
∥y−x∥∞ ≤R ∥x − y∥∞
y̸=x
where the minimum is understood to be R in case the supremum is zero. Then there are no adver-
sarial examples to x with perturbation δ < α.
Proof. Let x ∈ Rd be as in the statement of the theorem. Assume, towards a contradiction, that
for 0 < δ < α satisfying (16.5.2), there exists an adversarial example x′ to x with perturbation δ.
If the supremum in (16.5.2) is zero, then Φ is constant on a ball of radius R around x. In
particular for ∥x′ − x∥ ≤ δ < R it holds that h(x′ ) = h(x) and x′ cannot be an adversarial
example.
Now assume the supremum in (16.5.2) is not zero. It holds by (16.5.2) for δ < R, that
. |Φ(y) − Φ(x)|
δ < Φ(x)g(x) sup . (16.5.3)
∥y−x∥∞ ≤R ∥x − y∥∞
y̸=x
Moreover,
|Φ(y) − Φ(x)|
|Φ(x′ ) − Φ(x)| ≤ sup ∥x − x′ ∥∞
∥y−x∥∞ ≤R ∥x − y∥∞
y̸=x
|Φ(y) − Φ(x)|
≤ sup δ < Φ(x)g(x),
∥y−x∥∞ ≤R ∥x − y∥∞
y̸=x
238
The supremum in (16.5.2) is bounded by the Lipschitz constant of Φ on BR (x). Thus Theorem
16.13 depends only on the local Lipschitz constant of Φ. One obvious criticism of this result is
that the computation of (16.5.2) is potentially prohibitive. We next show a different result, for
which the assumptions can immediately be checked by applying a simple algorithm that we present
subsequently.
To state the following proposition, for a continuous function Φ : Rd → R and δ > 0 we define
for x ∈ Rd and δ > 0
Proposition 16.14. Let h : Rd → {−1, 1} be a classifier of the form h(x) = sign(Φ(x)) and
g : Rd → {−1, 0, 1}, let x be such that h(x) = g(x). Then x does not have an adversarial example
of perturbation δ if z δ,max z δ,min > 0.
Proof. The proof is immediate, since z δ,max z δ,min > 0 implies that all points in a δ neighborhood
of x are classified the same.
To apply (16.14), we only have to compute z δ,max and z δ,min . It turns out that if Φ is a neural
network, then z δ,max , z δ,min can be approximated by a computation similar to a forward pass of
Φ. Denote by |A| the matrix obtained by taking the absolute value of each entry of the matrix A.
Additionally, we define
A+ = (|A| + A)/2 and A− = (|A| − A)/2.
The idea behind the Algorithm 3 is common in the area of neural network verification, see, e.g.,
[86, 81, 10, 283].
Remark 16.15. Up to constants, Algorithm 3 has the same computational complexity as a forward
pass, also see Algorithm 1. In addition, in contrast to upper bounds based on estimating the global
Lipschitz constant of Φ via its weights, the upper bounds found via Algorithm 3 include the effect of
the activation function σ. For example, if σ is the ReLU, then we may often end up in a situation,
where δ (ℓ),up or δ (ℓ),low can have many entries that are 0. If an entry of W (ℓ) x(ℓ) +b(ℓ) is nonpositive,
then it is guaranteed that the associated entry in δ (ℓ),low will be zero. Similarly, if W (ℓ) has only
few positive entries, then most of the entries of δ (ℓ),up are not propagated to δ (ℓ+1),up .
Next, we prove that Algorithm 3 indeed produces sensible output.
Proposition 16.16. Let Φ be a neural network with weight matrices W (ℓ) ∈ Rdℓ+1 ×dℓ and bias
vectors b(ℓ) ∈ Rdℓ+1 for ℓ = 0, . . . , L, and a monotonically increasing activation function σ.
Let x ∈ Rd . Then the output of Algorithm 3 satisfies
239
Algorithm 3 Compute Φ(x), z δ,max and z δ,min for a given neural network
Input: weight matrices W (ℓ) ∈ Rdℓ+1 ×dℓ and bias vectors b(ℓ) ∈ Rdℓ+1 for ℓ = 0, . . . , L with
dL+1 = 1, monotonous activation function σ, input vector x ∈ Rd0 , neighborhood size δ > 0
Output: Bounds for z δ,max and z δ,min
x(0) = x
δ (0),up = δ1 ∈ Rd0
δ (0),low = δ1 ∈ Rd0
for ℓ = 0, . . . , L − 1 do
x(ℓ+1) = σ(W (ℓ) x(ℓ) + b(ℓ) )
δ (ℓ+1),up = σ(W (ℓ) x(ℓ) + (W (ℓ) )+ δ (ℓ),up + (W (ℓ) )− δ (ℓ),low + b(ℓ) ) − x(ℓ+1)
δ (ℓ+1),low = x(ℓ+1) − σ(W (ℓ) x(ℓ) − (W (ℓ) )+ δ (ℓ),low − (W (ℓ) )− δ (ℓ),up + b(ℓ) )
end for
x(L+1) = W (L) x(L) + b(L)
δ (L+1),up = (W (L) )+ δ (L),up + (W (L) )− δ (L),low
δ (L+1),low = (W (L) )+ δ (L),low + (W (L) )− δ (L),up
return x(L+1) , x(L+1) + δ (L+1),up , x(L+1) − δ (L+1),low
Proof. Fix y, x ∈ Rd with ∥y − x∥∞ ≤ δ and let y (ℓ) , x(ℓ) for ℓ = 0, . . . , L + 1 be as in Algorithm
3 applied to y, x, respectively. Moreover, let δ ℓ,up , δ ℓ,low for ℓ = 0, . . . , L + 1 be as in Algorithm 3
applied to x. We will prove by induction over ℓ = 0, . . . , L + 1 that
where the inequalities are understood entry-wise for vectors. Since y was arbitrary this then proves
the result.
The case ℓ = 0 follows immediately from ∥y − x∥∞ ≤ δ. Assume now, that the statement was
shown for ℓ < L. We have that
if
240
where we used the induction assumption in the last line. This shows the first estimate in (16.5.6).
Similarly,
where we used the induction assumption in the last line. This completes the proof of (16.5.6) for
all ℓ ≤ L.
The case ℓ = L + 1 follows by the same argument, but replacing σ by the identity.
241
Exercises
Exercise 16.17. Prove (16.3.1) by comparing the volume of the d-dimensional Euclidean unit ball
with the volume of the d-dimensional 1-ball of radius c for a given c > 0.
Exercise 16.18. Fix δ > 0. For a pair of classifiers h and g such that C1 ∪C−1 = ∅ in (16.2.2), there
trivially cannot exist any adversarial examples. Construct an example, of h, g, D such that C1 ,
C−1 ̸= ∅, h is not a Bayes classifier, and g is such that no adversarial examples with a perturbation
δ exist.
Is this also possible if g −1 (0) = ∅?
242
A)
DBg
DBh x′
B)
DBg
DBh
x′
x
Figure 16.2: Illustration of the two types of adversarial examples in Examples 16.6 and 16.7. In
panel A) the feature support Dx corresponds to the dashed line. We depict the two decision
boundaries DBh = {x | w⊤ x = 0} of h(x) = sign(w⊤ x) and DBg = {x | w⊤ x = 0} g(x) =
sign(w⊤ x). Both h and g perfectly classify every data point in Dx . One data point x is shifted
outside of the support of the distribution in a way to change its label according to h. This creates
an adversarial example x′ . In panel B) the data distribution is globally supported. However, h
and g do not coincide. Thus the decision boundaries DBh and DBg do not coincide. Moving data
points across DBh can create adversarial examples, as depicted by x and x′ .
243
Appendix A
Probability theory
This appendix provides some basic notions and results in probability theory required in the main
text. It is intended as a revision for a reader already familiar with these concepts. For more details
and further proofs, we refer for example to the standard textbook [144].
(i) Ω ∈ A,
(ii) Ac ∈ A whenever A ∈ A,
S
(iii) i∈N Ai ∈ A whenever Ai ∈ A for all i ∈ N.
For a sigma-algebra A on Ω, the tuple (Ω, A) is also referred to as a measurable space. For a
measurable space, a subset A ⊆ Ω is called measurable, if A ∈ A. Measurable sets are also called
events.
Another key system of subsets of Ω is that of a topology.
S
(iii) i∈I Oi ∈ T whenever for an index set I holds Oi ∈ T for all i ∈ I.
If T is a topology on Ω, we call (Ω, T) a topological space, and a set O ⊆ Ω is called open if and
only if O ∈ T.
244
Remark A.3. The two notions differ in that a topology allows for unions of arbitrary (possibly un-
countably many) sets, but only for finite intersection, whereas a sigma-algebra allows for countable
unions and intersections.
Example A.4. Let d ∈ N and denote by Bε (x) = {y ∈ Rd | ∥y − x∥ < ε} the set of points
whose Euclidean distance to x is less than ε. Then for every A ⊆ Rd , the smallest topology on A
containing A ∩ Bε (x) for all ε > 0, x ∈ Rd , is called the Euclidean topology on A. ♢
If (Ω, T) is a topological space, then the Borel sigma-algebra refers to the smallest sigma-
algebra on Ω containing all open sets, i.e. all elements of T. Throughout this book, subsets of Rd
are always understood to be equipped with the Euclidean topology and the Borel sigma-algebra.
The Borel sigma-algebra on Rd is denoted by Bd .
We can now introduce measures.
Definition A.5. Let (Ω, A) be a measurable space. A mapping µ : A → [0, ∞] is called a measure
if it satisfies
(i) µ(∅) = 0,
(ii) for every sequence (Ai )i∈N ⊆ A such that Ai ∩ Aj = ∅ whenever i ̸= j, it holds
[ X
µ Ai = µ(Ai ).
i∈N i∈N
Example A.6. One can show that there exists a unique measure λ on (Rd , Bd ), such that for all
sets of the type ×dj=1 [ai , bi ) with −∞ < ai ≤ bi < ∞ holds
d
Y
λ(×di=1 [ai , bi )) = (bi − ai ).
i=1
This measure is called the Lebesgue measure. ♢
If µ is a measure on the measurable space (Ω, A), then the triplet (Ω, A, µ) is called a measure
space. In case µ is a probability measure, it is called a probability space.
Let (Ω, A, µ) be a measure space. A subset N ⊆ Ω is called a null-set, if N is measurable and
µ(N ) = 0. Moreover, an equality or inequality is said to hold µ-almost everywhere or µ-almost
surely, if it is satisfied on the complement of a null-set. In case µ is clear from context, we simply
write “almost everywhere” or “almost surely” instead. Usually this refers to the Lebesgue measure.
245
Definition A.7. Let (Ω1 , A1 ) and (Ω2 , A2 ) be two measurable spaces. A function f : Ω1 → Ω2 is
called measurable if
Remark A.8. We again point out the parallels to topological spaces: A function f : Ω1 → Ω2
between two topological spaces (Ω1 , T1 ) and (Ω2 , T2 ) is called continuous if f −1 (O2 ) ∈ T1 for all
O2 ∈ T2 .
Let Ω1 be a set and let (Ω2 , A2 ) be a measurable space. For X : Ω1 → Ω2 , we can ask for
the smallest sigma-algebra AX on Ω1 , such that X is measurable as a mapping from (Ω1 , AX ) to
(Ω2 , A2 ). Clearly, for every sigma-algebra A1 on Ω1 , X is measurable as a mapping from (Ω1 , A1 )
to (Ω2 , A2 ) if and only if every A ∈ AX belongs to A1 ; or in other words, AX is a sub sigma-algebra
of A1 . It is easy to check that AX is given through the following definition.
AX := {X −1 (A2 ) | A2 ∈ A2 } ⊆ 2Ω1
Definition A.10. The measure PX is called the distribution of X. If (Ω2 , A2 ) = (Rd , Bd ), and
there exists a function fX : Rd → R such that
Z
P[A] = fX (x) dx for all A ∈ Bd ,
A
Remark A.11. The term distribution is often used without specifying an underlying probability
space and random variable. In this case, “distribution” stands interchangeably for “probability
246
measure”. For example, µ is a distribution on Ω2 states that µ is a probability measure on the
measurable space (Ω2 , A2 ). In this case, there always exists a probability space (Ω1 , A1 , P) and a
random variable X : Ω1 → Ω2 such that PX = µ; namely (Ω1 , A1 , P) = (Ω2 , A2 , µ) and X(ω) = ω.
the expectation of X. Moreover, for k ∈ N we say that X has finite k-th moment if E[∥X∥k ] <
∞. Similarly, for a probability measure µ on Rd and k ∈ N, we say that µ has finite k-th moment
if Z
∥x∥k dµ(x) < ∞.
Rd
Furthermore, the matrix
Z
(X(ω) − E[X])(X(ω) − E[X])⊤ dP(ω) ∈ Rd×d
Ω
247
(i) converge almost surely to X, if
P ω ∈ Ω lim Xj (ω) = X(ω) = 1,
j→∞
for all ε > 0 : lim P [{ω ∈ Ω | |Xj (ω) − X(ω)| > ε}] = 0,
j→∞
The notions in Definition A.13 are ordered by decreasing strength, i.e. almost sure conver-
gence implies convergence in probability, and convergence in probability
R implies convergence in
distribution, see for example [144, Chapter 13]. Since E[f ◦ X] = Rd f (x) dPX (x), the notion of
convergence in distribution only depends on the distribution PX of X. We thus also say that a
sequence of random variables converges in distribution towards a measure µ.
and similarly for PY . Thus the marginals PX , PY , can be constructed from the joint distribution
PZ . In turn, knowledge of the marginals is not sufficient to construct the joint distribution.
A.3.2 Independence
The concept of independence serves to formalize the situation, where knowledge of one random
variable provides no information about another random variable. We first give the formal definition,
and afterwards discuss the roll of a die as a simple example.
248
Definition A.14. Let (Ω, A, P) be a probability space. Then two events A, B ∈ A are called
independent if
P[A ∩ B] = P[A]P[B].
Two random variables X : Ω → RdX and Y : Ω → RdY are called independent, if
Two random variables are thus independent, if and only if all events in their induced sigma-
algebras are independent. This turns out to be equivalent to the joint distribution P(X,Y ) being
equal to the product measure PX ⊗ PY ; the latter is characterized as the unique measure µ on
RdX +dY satisfying µ(A × B) = PX [A]PY [B] for all A ∈ Bdx , B ∈ BdY .
Example A.15. Let Ω = {1, . . . , 6} represent the outcomes of rolling a fair die, let A = 2Ω be the
sigma-algebra, and let P[ω] = 1/6 for all ω ∈ Ω. Consider the three random variables
0 if ω ∈ {1, 2}
( (
0 if ω is odd 0 if ω ≤ 3
X1 (ω) = X2 (ω) = X3 (ω) = 1 if ω ∈ {3, 4}
1 if ω is even 1 if ω ≥ 4
2 if ω ∈ {5, 6}.
= xy dP(X,Y ) (x, y)
2
ZR Z
= x dPX (x) y dPX (y)
R R
= E[X]E[Y ].
249
Using this observation, it is easy to see that for a sequence of independent R-valued random variables
(Xi )ni=1 with bounded second moments, there holds Bienaymé’s identity
" n # n
X X
V Xi = V [Xi ] . (A.3.1)
i=1 i=1
Definition A.18 (regular conditional distribution). Let (Ω, A, P) be a probability space, and let
X : Ω → RdX and Y : Ω → RdY be two random variables. Let τX|Y : BdX × RdY → [0, 1] satisfy
(i) y 7→ τX|Y (A, y) : RdY → [0, 1] is measurable for every fixed A ∈ BdX ,
(ii) A 7→ τX|Y (A, y) is a probability measure on (RdX , BdX ) for every y ∈ Y (Ω),
(iii) for all A ∈ BdX and all B ∈ BdY holds
Z
P[X ∈ A, Y ∈ B] = τX|Y (A, y)PY (y).
B
250
Definition A.18 provides a mathematically rigorous way of assigning a distribution to a random
variable conditioned on an event that may have probability zero, as in Example A.17. Existence
and uniqueness of these conditional distributions hold in the following sense, see for example [144,
Chapter 8] or [238, Chapter 3] for the specific statement given here.
Theorem A.19. Let (Ω, A, P) be a probability space, and let X : Ω → RdX , Y : Ω → RdY be two
random variables. Then there exists a regular version of the conditional distribution τ1 .
Let τ2 be another regular version of the conditional distribution. Then there exists a PY -null
set N ⊆ RdY , such that for all y ∈ N c ∩ Y (Ω), the two probability measures τ1 (·, y) and τ2 (·, y)
coincide.
Definition A.20. Let (Ω, A, P) be a probability space, and let X : Ω → RdX , Y : Ω → RdY ,
Z : Ω → RdZ be three random variables. We say that X and Z are conditionally independent
given Y , if the two distributions X|Y = y and Z|Y = y are independent for PY -almost every
y ∈ Y (Ω).
This is for example the case if there exists C < ∞ such that V[Xi ] ≤ C for all i ∈ N. Concentration
inequalities provide bounds on the rate of this convergence.
We start with Markov’s inequality.
Lemma A.21 (Markov’s inequality). Let X : Ω → R be a random variable, and let φ : [0, ∞) →
[0, ∞) be monotonically increasing. Then for all ε > 0
E[φ(|X|)]
P[|X| ≥ ε] ≤ .
φ(ε)
251
Proof. We have
Z Z
φ(|X(ω)|) E[φ(|X|)]
P[|X| ≥ ε] = 1 dP(ω) ≤ dP(ω) = ,
X −1 ([ε,∞)) Ω φ(ε) φ(ε)
Applying Markov’s inequality with φ(x) := x2 to the random variable X − E[X] directly gives
Chebyshev’s inequality.
Lemma A.22 (Chebyshev’s inequality). Let X : Ω → R be a random variable with finite variance.
Then for all ε > 0
V[X]
P[|X − E[X]| ≥ ε] ≤ 2 .
ε
From Chebyshev’s inequality we obtain the next result, which is a quite general concentration
inequality for random variables with finite variances.
Theorem A.23. Let X1 , . . . , Xn be n ∈ N independent real-valued random variables such that for
some ς > 0 holds E[|Xi − µ|2 ] ≤ ς 2 for all i = 1, . . . , n. Denote
n
h1 X i
µ := E Xj . (A.4.2)
n
j=1
Pn
= ( nj=1 Xi )/n − µ. By Bienaymé’s identity (A.3.1), it holds
P
Proof. Let Sn = j=1 (Xi − E[Xi ])/n
that
n
1 X 2 ς2
V[Sn ] = E[(Xi − E[Xi ]) ] ≤ .
n2 n
j=1
If we have additional information about the random variables, then we can derive sharper
bounds. In case of uniformly bounded random variables (rather than just bounded variance),
Hoeffding’s inequality, which we recall next, shows an exponential rate of concentration around the
mean.
252
Theorem A.24 (Hoeffding’s inequality). Let a, b ∈ R. Let X1 , . . . , Xn be n ∈ N independent
real-valued random variables such that a ≤ Xi ≤ b almost surely for all i = 1, . . . , n, and let µ be
as in (A.4.2). Then, for every ε > 0
n 2
1 X − 2nε
P Xj − µ > ε ≤ 2e (b−a)2 .
n
j=1
A proof can, for example, be found in [250, Section B.4], where this version is also taken from.
Finally, we recall the central limit theorem, in its multivariate formulation. We say that (Xj )j∈N
is an i.i.d. sequence of random variables, if the random variables are (pairwise) independent
and identically distributed. For a proof see [144, Theorem 15.58].
Theorem A.25 (Multivariate central limit theorem). Let (X n )n∈N be an i.i.d. sequence of Rd -
valued random variables, such that E[X n ] = 0 ∈ Rd and E[Xn,i Xn,j ] = Cij for all i, j = 1, . . . , d.
Let
X1 + · · · + Xn
Y n := √ ∈ Rd .
n
Then Y n converges in distribution to N(0, C) as n → ∞.
253
Appendix B
This appendix provides some basic notions and results in linear algebra and functional analysis
required in the main text. It is intended as a revision for a reader already familiar with these
concepts. For more details and further proofs, we refer for example to the standard textbooks
[29, 232, 233, 55, 99].
Theorem B.1 (Singular value decomposition). Let A ∈ Rm×n . Then there exist orthogonal ma-
trices U ∈ Rm×m , V ∈ Rn×n such that with
s1
..
. 0
Σ := ∈ Rm×n
sr
0 0
Aw = y. (B.1.1)
If A is not a regular square matrix, then in general there need not be a unique solution w ∈ Rn to
(B.1.1). However, there exists a unique minimal norm solution
254
The minimal norm solution can be expressed via the Moore-Penrose pseudoinverse A† ∈ Rn×m
of A; given an (arbitrary) SVD A = U ΣV ⊤ , it is defined as
−1
s1
..
A† := V Σ † U ⊤ where Σ † :=
. 0
∈ Rn×m . (B.1.3)
−1
sr
0 0
The following theorem makes this precise, e.g., [29, Theorem 1.2.10].
Theorem B.2. Let A ∈ Rm×n . Then there exists a unique minimum norm solution w∗ ∈ Rn in
(B.1.2) and it holds w∗ = A† y.
Proof. Denote by Σ r ∈ Rr×r the upper left quadrant of Σ . Since U ∈ Rm×m is orthogonal,
Σr 0
∥Aw − y∥ = V ⊤w − U ⊤y .
0 0
We can thus write M in (B.1.2) as
n r o
M = w ∈ Rn Σ r 0 V ⊤ w i=1 = (U ⊤ y)ri=1
n o
= w ∈ Rn (V ⊤ w)ri=1 = Σ −1 ⊤ r
r (U y)i=1
n o
= V z z ∈ Rn , (z)ri=1 = Σ −1 ⊤ r
r (U y)i=1
where (a)ri=1 denotes the first r entries of a vector a, and for the last equality we used orthogonality
of V ∈ Rn×n . Since ∥V z∥ = ∥z∥, the unique minimal norm solution is obtained by setting
components r + 1, . . . , m of z to zero, which yields
−1 ⊤ r
Σ r (U y)i=1
w∗ = V = V Σ † U ⊤ y = A† y
0
as claimed.
Definition B.3. Let K ∈ {R, C}. A vector space (over K) is a set X such that the following
holds:
(i) Properties of addition: For every x, y ∈ X there exists x + y ∈ X such that for all z ∈ X
Moreover, there exists a unique element 0 ∈ X such that x + 0 = x for all x ∈ X and for each
x ∈ X there exists a unique −x ∈ X such that x + (−x) = 0.
255
(ii) Properties of scalar multiplication: There exists a map (α, x) 7→ αx from K × X to X called
scalar multiplication. It satisfies 1x = x and (αβ)x = α(βx) for all x ∈ X.
If the field is clear from context, we simply refer to X as a vector space. We will primarily consider
the case K = R, and in this case we also say that X is a real vector space.
To introduce a notion of convergence on a vector space X, it needs to be equipped with a
topology, see Definition A.2. A topological vector space is a vector space which is also a
topological space, and in which addition and scalar multiplication are continuous maps. We next
discuss the most important instances of topological vector spaces.
In a metric space (X, dX ), we denote the open ball with center x and radius r > 0 by
Every metric space is naturally equipped with a topology: A set A ⊆ X is open if and only if for
every x ∈ A exists ε > 0 such that Bε (x) ⊆ A. Therefore every metric vector space is a topological
vector space.
Definition B.5. A metric space (X, dX ) is called complete, if every Cauchy sequence with respect
to d converges to an element in X.
For complete metric spaces, an immensely powerful tool is Baire’s category theorem. To state
it, we require the notion of density of sets. Let A, B ⊆ X for a topological space X. Then A is
dense in B if the closure of A, denoted by A, satisfies A ⊇ B.
256
Theorem B.6 (Baire’s category theorem). Let X be a complete metric space. Then the intersection
of every countable collection of dense open subsets of X is dense in X.
Definition B.7. Let X be a vector space over a field K ∈ {R, C}. A map ∥ · ∥X : X → [0, ∞) is
called a norm if the following hold for all x, y ∈ X and all α ∈ K:
We call (X, ∥ · ∥X ) a normed space and omit ∥ · ∥X from the notation if it is clear from the
context.
Every norm induces a metric dX and hence a topology via dX (x, y) := ∥x − y∥X . In particular,
every normed vector space is a topological vector space with respect to this topology.
Definition B.8. A normed vector space is called a Banach space if and only if it is complete.
Before presenting the main results on Banach spaces, we collect a couple of important examples.
• Euclidean spaces: Let d ∈ N. Then (Rd , ∥ · ∥) is a Banach space.
• Continuous functions: Let d ∈ N and let K ⊆ Rd be compact. The set of continuous functions
from K to R is denoted by C(K). For α, β ∈ R and f , g ∈ C(K), we define addition and
scalar multiplication by (αf + βg)(x) = αf (x) + βg(x) for all x ∈ K. The vector space C(K)
equipped with the supremum norm
∥f ∥∞ := sup |f (x)|,
x∈K
is a Banach space.
257
• Lebesgue spaces: Let (Ω, A, µ) be a measure space and let 1 ≤ p < ∞. Then the Lebesgue
space Lp (Ω, µ) is defined as the vector space of all equivalence classes of measurable functions
f : Ω → R that coincide µ-almost everywhere and satisfy
Z 1/p
p
∥f ∥Lp (Ω,µ) := |f (x)| dµ(x) < ∞. (B.2.2)
Ω
Definition B.9. Let (X, ∥ · ∥X ) be a normed vector space over K ∈ {R, C}. Linear maps from
X → K are called linear functionals. The vector space of all continuous linear functionals on X
is called the (topological) dual space of X and is denoted by X ′ .
Together with the natural addition and scalar multiplication (for all h, g ∈ X ′ , α ∈ K and
x ∈ X)
(h + g)(x) := h(x) + g(x) and (αh)(x) := α(h(x)),
X ′ is a vector space. We equip X ′ with the norm
∥f ∥X ′ := sup |f (x)|.
x∈X
∥x∥X =1
The space (X ′ , ∥ · ∥X ′ ) is always a Banach space, even if (X, ∥ · ∥X ) is not complete [233, Theorem
4.1].
The dual space can often be used to characterize the original Banach space. One way in which
the dual space X ′ captures certain algebraic and geometric properties of the Banach space X is
through the Hahn-Banach theorem. In this book, we use one specific variant of this theorem and
its implication for the existence of dual bases, see for instance [233, Theorem 3.5].
258
Theorem B.10 (Geometric Hahn-Banach, subspace version). Let M be a subspace of a Banach
space X and let x0 ∈ X. If x0 is not in the closure of M , then there exists f ∈ X ′ such that
f (x0 ) = 1 and f (x) = 0 for every x ∈ M .
An immediate consequence of Theorem B.10 that will be used throughout this book is the
existence of a dual basis. Let X be a Banach space and let (xi )i∈N ⊆ X be such that for all i ∈ N
xi ̸∈ span{xj | j ∈ N, j ̸= i}.
Then, for every i ∈ N, there exists fi ∈ X ′ such that fi (xj ) = 0 if i ̸= j and fi (xi ) = 1.
Definition B.11. Let X be a real vector space. A map ⟨·, ·⟩X : X × X → R is called an inner
product on X if the following hold for all x, y, z ∈ X and all α, β ∈ R:
Example B.12. For p = 2, the Lebesgue spaces L2 (Ω) and ℓ2 (N) are Hilbert spaces with inner
products
Z
⟨f, g⟩L2 (Ω) = f (x)g(x) dx for all f, g ∈ L2 (Ω),
Ω
and
X
⟨x, y⟩ℓ2 (N) = xj yj for all x = (xj )j∈N , y = (yj )j∈N ∈ ℓ2 (N).
j∈N
259
Theorem B.13 (Cauchy-Schwarz inequality). Let X be a vector space with inner product ⟨·, ·⟩X .
Then it holds for all x, y ∈ X
q
|⟨x, y⟩X | ≤ ⟨x, x⟩X ⟨y, y⟩X .
Proof. Let x, y ∈ X. If y = 0 then ⟨x, y⟩X = 0 and thus the statement is trivial. Assume in the
following y ̸= 0, so that ⟨y, y⟩X > 0. Using the linearity and symmetry properties it holds for all
α∈R
0 ≤ ⟨x − αy, x − αy⟩X = ⟨x, x⟩X − 2α ⟨x, y⟩X + α2 ⟨y, y⟩X .
Letting α := ⟨x, y⟩X / ⟨y, y⟩X we get
The properties of the inner product immediately yield the polar identity
The fact that (B.2.3) indeed defines a norm follows by an application of the Cauchy-Schwarz
inequality to (B.2.4), which yields that ∥ · ∥X satisfies the triangle inequality. This gives rise to the
definition of a Hilbert space.
Definition B.14. Let H be a real vector space with inner product ⟨·, ·⟩H . Then (H, ⟨·, ·⟩H ) is
called a Hilbert space if and only if H is complete with respect to the norm ∥ · ∥H induced by
the inner product.
260
Definition B.15. Let (H, ⟨·, ·⟩H ) be a Hilbert space and let f , g ∈ H. We say that f and g are
orthogonal if ⟨f, g⟩H = 0, denoted by f ⊥ g. For F , G ⊆ H we write F ⊥ G if f ⊥ g for all
f ∈ F , g ∈ G. Finally, for F ⊆ H, the set F ⊥ = {g ∈ H | g ⊥ f ∀f ∈ F } is called the orthogonal
complement of F in H.
For orthogonal vectors, the polar identity immediately implies the Pythagorean theorem.
Theorem B.16 (Pythagorean theorem). Let (H, ⟨·, ·⟩H ) be a Hilbert space, n ∈ N, and let
f1 , . . . , fn ∈ H be pairwise orthogonal vectors. Then,
n 2 n
X X
fi = ∥fi ∥2H .
i=1 H i=1
A final property of Hilbert spaces that we encounter in this book is the existence of unique
projections onto convex sets. For a proof, see for instance [232, Thm. 4.10].
Theorem B.17. Let (H, ⟨·, ·⟩H ) be a Hilbert space and let K ̸= ∅ be a closed convex subset of H.
Then for all h ∈ H exists a unique k0 ∈ K such that
261
Bibliography
262
[11] S. Barocas, M. Hardt, and A. Narayanan. Fairness and Machine Learning. fairmlbook.org,
2019. http://www.fairmlbook.org.
[12] A. R. Barron. Neural net approximation. In Proc. 7th Yale workshop on adaptive and learning
systems, volume 1, pages 69–72, 1992.
[14] A. R. Barron and J. M. Klusowski. Approximation and estimation for high-dimensional deep
learning networks. arXiv preprint arXiv:1809.03090, 2018.
[15] P. Bartlett. For valid generalization the size of the weights is more important than the size
of the network. Advances in neural information processing systems, 9, 1996.
[16] D. Beaglehole, M. Belkin, and P. Pandit. On the inconsistency of kernel ridgeless regression
in fixed dimensions. SIAM J. Math. Data Sci., 5(4):854–872, 2023.
[18] M. Belkin, D. Hsu, S. Ma, and S. Mandal. Reconciling modern machine-learning practice
and the classical bias–variance trade-off. Proceedings of the National Academy of Sciences,
116(32):15849–15854, 2019.
[19] M. Belkin, S. Ma, and S. Mandal. To understand deep learning we need to understand kernel
learning. In International Conference on Machine Learning, pages 541–549. PMLR, 2018.
[20] R. Bellman. On the theory of dynamic programming. Proceedings of the national Academy
of Sciences, 38(8):716–719, 1952.
[21] A. Ben-Israel and A. Charnes. Contributions to the theory of generalized inverses. Journal
of the Society for Industrial and Applied Mathematics, 11(3):667–699, 1963.
[22] Y. Bengio, P. Simard, and P. Frasconi. Learning long-term dependencies with gradient descent
is difficult. IEEE Transactions on Neural Networks, 5(2):157–166, 1994.
[23] A. Berlinet and C. Thomas-Agnan. Reproducing kernel Hilbert spaces in probability and
statistics. Kluwer Academic Publishers, Boston, MA, 2004. With a preface by Persi Diaconis.
[24] J. Berner, P. Grohs, and A. Jentzen. Analysis of the generalization error: Empirical risk
minimization over deep artificial neural networks overcomes the curse of dimensionality in
the numerical approximation of black–scholes partial differential equations. SIAM Journal
on Mathematics of Data Science, 2(3):631–657, 2020.
[25] J. Berner, P. Grohs, G. Kutyniok, and P. Petersen. The modern mathematics of deep learning,
2021.
263
[27] D. P. Bertsekas and J. N. Tsitsiklis. Neuro-dynamic programming., volume 3 of Optimization
and neural computation series. Athena Scientific, 1996.
[28] C. M. Bishop. Pattern Recognition and Machine Learning (Information Science and Statis-
tics). Springer, 1 edition, 2007.
[29] Å. Björck. Numerical Methods for Least Squares Problems. Society for Industrial and Applied
Mathematics, 1996.
[30] H. Bolcskei, P. Grohs, G. Kutyniok, and P. Petersen. Optimal approximation with sparsely
connected deep neural networks. SIAM Journal on Mathematics of Data Science, 1(1):8–45,
2019.
[31] B. E. Boser, I. M. Guyon, and V. N. Vapnik. A training algorithm for optimal margin
classifiers. In D. Haussler, editor, Proceedings of the 5th Annual Workshop on Computational
Learning Theory (COLT’92), pages 144–152, Pittsburgh, PA, USA, July 1992. ACM Press.
[32] L. Bottou. Stochastic Gradient Descent Tricks, pages 421–436. Springer Berlin Heidelberg,
Berlin, Heidelberg, 2012.
[33] L. Bottou, F. E. Curtis, and J. Nocedal. Optimization methods for large-scale machine
learning. SIAM Review, 60(2):223–311, 2018.
[34] O. Bousquet and A. Elisseeff. Stability and generalization. The Journal of Machine Learning
Research, 2:499–526, 2002.
[35] S. Boyd and L. Vandenberghe. Convex optimization. Cambridge University Press, Cambridge,
2004.
[38] S. Brenner and R. Scott. The Mathematical Theory of Finite Element Methods. Texts in
Applied Mathematics. Springer New York, 2007.
[39] M. M. Bronstein, J. Bruna, T. Cohen, and P. Veličković. Geometric deep learning: Grids,
groups, graphs, geodesics, and gauges. arXiv preprint arXiv:2104.13478, 2021.
[41] S. Bubeck. Convex optimization: Algorithms and complexity. Found. Trends Mach. Learn.,
8:231–357, 2014.
264
[44] C. Carathéodory. Über den variabilitätsbereich der fourier’schen konstanten von posi-
tiven harmonischen funktionen. Rendiconti del Circolo Matematico di Palermo (1884-1940),
32:193–217, 1911.
[45] N. Carlini and D. Wagner. Towards evaluating the robustness of neural networks. In 2017
ieee symposium on security and privacy (sp), pages 39–57. Ieee, 2017.
[46] S. M. Carroll and B. W. Dickinson. Construction of neural nets using the radon transform.
International 1989 Joint Conference on Neural Networks, pages 607–611 vol.1, 1989.
[48] M. Chen, H. Jiang, W. Liao, and T. Zhao. Efficient approximation of deep relu networks for
functions on low dimensional manifolds. Advances in neural information processing systems,
32, 2019.
[50] Y. Cho and L. Saul. Kernel methods for deep learning. In Y. Bengio, D. Schuurmans,
J. Lafferty, C. Williams, and A. Culotta, editors, Advances in Neural Information Processing
Systems, volume 22. Curran Associates, Inc., 2009.
[51] F. Chollet. Deep learning with Python. Simon and Schuster, 2021.
[52] A. Choromanska, M. Henaff, M. Mathieu, G. B. Arous, and Y. LeCun. The loss surfaces of
multilayer networks. In Artificial intelligence and statistics, pages 192–204. PMLR, 2015.
[53] C. K. Chui and H. N. Mhaskar. Deep nets for local manifold learning. Frontiers in Applied
Mathematics and Statistics, 4:12, 2018.
[54] P. Ciarlet. The Finite Element Method for Elliptic Problems. Studies in Mathematics and its
Applications. North Holland, 1978.
[56] C. Cortes and V. Vapnik. Support vector networks. Machine Learning, 20:273–297, 1995.
[57] N. Cristianini and J. Shawe-Taylor. An Introduction to Support Vector Machines and Other
Kernel-based Learning Methods. Cambridge University Press, 1 edition, 2000.
[58] F. Cucker and S. Smale. On the mathematical foundations of learning. Bulletin of the
American mathematical society, 39(1):1–49, 2002.
265
[60] Y. N. Dauphin, R. Pascanu, C. Gulcehre, K. Cho, S. Ganguli, and Y. Bengio. Identifying and
attacking the saddle point problem in high-dimensional non-convex optimization. Advances
in neural information processing systems, 27, 2014.
[63] T. De Ryck, S. Lanthaler, and S. Mishra. On the approximation of functions by tanh neural
networks. Neural Networks, 143:732–750, 2021.
[64] A. Défossez, L. Bottou, F. R. Bach, and N. Usunier. A simple convergence proof of adam
and adagrad. Trans. Mach. Learn. Res., 2022, 2022.
[66] J. Deng, W. Dong, R. Socher, L.-J. Li, K. Li, and L. Fei-Fei. Imagenet: A large-scale
hierarchical image database. In 2009 IEEE Conference on Computer Vision and Pattern
Recognition, pages 248–255, 2009.
[70] L. Dinh, R. Pascanu, S. Bengio, and Y. Bengio. Sharp minima can generalize for deep nets.
In International Conference on Machine Learning, pages 1019–1028. PMLR, 2017.
[72] M. Du, F. Yang, N. Zou, and X. Hu. Fairness in deep learning: A computational perspective.
IEEE Intelligent Systems, 36(4):25–34, 2021.
[73] S. Du, J. Lee, H. Li, L. Wang, and X. Zhai. Gradient descent finds global minima of deep
neural networks. In K. Chaudhuri and R. Salakhutdinov, editors, Proceedings of the 36th
International Conference on Machine Learning, volume 97 of Proceedings of Machine Learning
Research, pages 1675–1685. PMLR, 09–15 Jun 2019.
[74] J. Duchi, E. Hazan, and Y. Singer. Adaptive subgradient methods for online learning and
stochastic optimization. Journal of Machine Learning Research, 12(Jul):2121–2159, 2011.
[75] W. E and Q. Wang. Exponential convergence of the deep neural network approximation for
analytic functions. Sci. China Math., 61(10):1733–1740, 2018.
266
[76] K. Eckle and J. Schmidt-Hieber. A comparison of deep networks with relu activation function
and linear spline-type methods. Neural Networks, 110:232–242, 2019.
[77] D. Elbrächter, D. Perekrestenko, P. Grohs, and H. Bölcskei. Deep neural network approxi-
mation theory. IEEE Transactions on Information Theory, 67(5):2581–2623, 2021.
[78] R. Eldan and O. Shamir. The power of depth for feedforward neural networks. In V. Feldman,
A. Rakhlin, and O. Shamir, editors, 29th Annual Conference on Learning Theory, volume 49
of Proceedings of Machine Learning Research, pages 907–940, Columbia University, New York,
New York, USA, 23–26 Jun 2016. PMLR.
[79] H. Engl, M. Hanke, and A. Neubauer. Regularization of Inverse Problems. Mathematics and
Its Applications. Springer Netherlands, 2000.
[80] A. Ern and J. Guermond. Finite Elements I: Approximation and Interpolation. Texts in
Applied Mathematics. Springer International Publishing, 2021.
[82] C. L. Frenzen, T. Sasao, and J. T. Butler. On the number of segments needed in a piecewise
linear approximation. Journal of Computational and Applied mathematics, 234(2):437–446,
2010.
[83] K.-I. Funahashi. On the approximate realization of continuous mappings by neural networks.
Neural Networks, 2(3):183–192, 1989.
[85] G. Garrigos and R. M. Gower. Handbook of convergence theorems for (stochastic) gradient
methods, 2023.
[87] A. Géron. Hands-on machine learning with Scikit-Learn and TensorFlow : concepts, tools,
and techniques to build intelligent systems. O’Reilly Media, Sebastopol, CA, 2017.
[89] F. Girosi and T. Poggio. Networks and the best approximation property. Biological cyber-
netics, 63(3):169–176, 1990.
[90] X. Glorot and Y. Bengio. Understanding the difficulty of training deep feedforward neural
networks. In Y. W. Teh and M. Titterington, editors, Proceedings of the Thirteenth In-
ternational Conference on Artificial Intelligence and Statistics, volume 9 of Proceedings of
267
Machine Learning Research, pages 249–256, Chia Laguna Resort, Sardinia, Italy, 13–15 May
2010. PMLR.
[92] G. H. Golub and C. F. Van Loan. Matrix Computations - 4th Edition. Johns Hopkins
University Press, Philadelphia, PA, 2013.
[93] L. Gonon and C. Schwab. Deep relu network expression rates for option prices in high-
dimensional, exponential lévy models. Finance and Stochastics, 25(4):615–657, 2021.
[94] I. J. Goodfellow, Y. Bengio, and A. Courville. Deep Learning. MIT Press, Cambridge, MA,
USA, 2016. http://www.deeplearningbook.org.
[95] I. J. Goodfellow, J. Shlens, and C. Szegedy. Explaining and harnessing adversarial examples.
In International Conference on Learning Representations (ICLR), 2015.
[97] L.-A. Gottlieb, A. Kontorovich, and R. Krauthgamer. Efficient regression in metric spaces
via approximate lipschitz extension. IEEE Transactions on Information Theory, 63(8):4838–
4849, 2017.
[98] R. M. Gower, N. Loizou, X. Qian, A. Sailanbayev, E. Shulgin, and P. Richtárik. SGD: General
analysis and improved rates. In K. Chaudhuri and R. Salakhutdinov, editors, Proceedings of
the 36th International Conference on Machine Learning, volume 97 of Proceedings of Machine
Learning Research, pages 5200–5209. PMLR, 09–15 Jun 2019.
[99] K. Gröchenig. Foundations of time-frequency analysis. Springer Science & Business Media,
2013.
[100] P. Grohs and L. Herrmann. Deep neural network approximation for high-dimensional elliptic
pdes with boundary conditions. IMA Journal of Numerical Analysis, 42(3):2055–2082, 2022.
[101] P. Grohs, F. Hornung, A. Jentzen, and P. Von Wurstemberger. A proof that artificial neural
networks overcome the curse of dimensionality in the numerical approximation of Black–
Scholes partial differential equations, volume 284. American Mathematical Society, 2023.
[102] P. Grohs, F. Hornung, A. Jentzen, and P. von Wurstemberger. A proof that artificial neural
networks overcome the curse of dimensionality in the numerical approximation of Black-
Scholes partial differential equations. Mem. Amer. Math. Soc., 284(1410):v+93, 2023.
[103] I. Gühring and M. Raslan. Approximation rates for neural networks with encodable weights
in smoothness spaces. Neural Networks, 134:107–130, 2021.
[104] B. Hanin and D. Rolnick. Complexity of linear regions in deep networks. In International
Conference on Machine Learning, pages 2596–2604. PMLR, 2019.
268
[105] M. Hardt, B. Recht, and Y. Singer. Train faster, generalize better: Stability of stochastic
gradient descent. In M. F. Balcan and K. Q. Weinberger, editors, Proceedings of The 33rd
International Conference on Machine Learning, volume 48 of Proceedings of Machine Learning
Research, pages 1225–1234, New York, New York, USA, 20–22 Jun 2016. PMLR.
[108] T. Hastie, R. Tibshirani, and J. Friedman. The elements of statistical learning: data mining,
inference and prediction. Springer, 2 edition, 2009.
[109] S. S. Haykin. Neural networks and learning machines. Pearson Education, Upper Saddle
River, NJ, third edition, 2009.
[110] J. He, L. Li, J. Xu, and C. Zheng. Relu deep neural networks and linear finite elements. J.
Comput. Math., 38(3):502–527, 2020.
[111] K. He, X. Zhang, S. Ren, and J. Sun. Delving deep into rectifiers: Surpassing human-level
performance on imagenet classification. Proceedings of the IEEE international conference on
computer vision, 2015.
[112] K. He, X. Zhang, S. Ren, and J. Sun. Deep residual learning for image recognition. In
Proceedings of the IEEE conference on computer vision and pattern recognition, pages 770–
778, 2016.
[113] M. Hein and M. Andriushchenko. Formal guarantees on the robustness of a classifier against
adversarial manipulation. Advances in neural information processing systems, 30, 2017.
[114] H. Heuser. Lehrbuch der Analysis. Teil 1. Vieweg + Teubner, Wiesbaden, revised edition,
2009.
[116] S. Hochreiter and J. Schmidhuber. Flat minima. Neural computation, 9(1):1–42, 1997.
[117] S. Hochreiter and J. Schmidhuber. Long short-term memory. Neural Computation, 9(8):1735–
1780, 1997.
[118] A. E. Hoerl and R. W. Kennard. Ridge regression: Biased estimation for nonorthogonal
problems. Technometrics, 12:55–67, 1970.
[120] K. Hornik, M. Stinchcombe, and H. White. Multilayer feedforward networks are universal
approximators. Neural Networks, 2(5):359–366, 1989.
269
[121] G. Huang, Z. Liu, L. Van Der Maaten, and K. Q. Weinberger. Densely connected con-
volutional networks. Proceedings of the IEEE conference on computer vision and pattern
recognition, 1(2):3, 2017.
[122] G.-B. Huang and H. A. Babri. Upper bounds on the number of hidden neurons in feedforward
networks with arbitrary bounded nonlinear activation functions. IEEE transactions on neural
networks, 9(1):224–229, 1998.
[124] T. Huster, C.-Y. J. Chiang, and R. Chadha. Limitations of the lipschitz constant as a defense
against adversarial examples. In ECML PKDD 2018 Workshops: Nemesis 2018, UrbReas
2018, SoGood 2018, IWAISe 2018, and Green Data Mining 2018, Dublin, Ireland, September
10-14, 2018, Proceedings 18, pages 16–29. Springer, 2019.
[125] M. Hutzenthaler, A. Jentzen, T. Kruse, and T. A. Nguyen. A proof that rectified deep neural
networks overcome the curse of dimensionality in the numerical approximation of semilinear
heat equations. SN partial differential equations and applications, 1(2):10, 2020.
[126] J. Håstad. Computational limitations of small depth circuits. PhD thesis, Massachusetts
Institute of Technology, 1987. Ph.D. Thesis, Department of Mathematics.
[127] D. J. Im, M. Tao, and K. Branson. An empirical analysis of deep network loss surfaces. 2016.
[128] V. E. Ismailov. Ridge functions and applications in neural networks, volume 263. American
Mathematical Society, 2021.
[129] V. E. Ismailov. A three layer neural network can represent any multivariate function. Journal
of Mathematical Analysis and Applications, 523(1):127096, 2023.
[130] Y. Ito and K. Saito. Superposition of linearly independent functions and finite mappings by
neural networks. The Mathematical Scientist, 21(1):27, 1996.
[131] A. Jacot, F. Gabriel, and C. Hongler. Neural tangent kernel: Convergence and generalization
in neural networks. Advances in neural information processing systems, 31, 2018.
[133] A. Jentzen and A. Riekert. On the existence of global minima and convergence analy-
ses for gradient descent methods in the training of deep neural networks. arXiv preprint
arXiv:2112.09684, 2021.
[134] A. Jentzen, D. Salimova, and T. Welti. A proof that deep artificial neural networks overcome
the curse of dimensionality in the numerical approximation of Kolmogorov partial differen-
tial equations with constant diffusion and nonlinear drift coefficients. Commun. Math. Sci.,
19(5):1167–1205, 2021.
270
[135] J. Jumper, R. Evans, A. Pritzel, T. Green, M. Figurnov, O. Ronneberger, K. Tunyasuvu-
nakool, R. Bates, A. Žı́dek, A. Potapenko, et al. Highly accurate protein structure prediction
with alphafold. Nature, 596(7873):583–589, 2021.
[136] P. C. Kainen, V. Kurkova, and A. Vogt. Approximation by neural networks is not continuous.
Neurocomputing, 29(1-3):47–56, 1999.
[139] H. Karimi, J. Nutini, and M. Schmidt. Linear convergence of gradient and proximal-gradient
methods under the polyak-lojasiewicz condition. In P. Frasconi, N. Landwehr, G. Manco,
and J. Vreeken, editors, Machine Learning and Knowledge Discovery in Databases, pages
795–811, Cham, 2016. Springer International Publishing.
[140] C. Karner, V. Kazeev, and P. C. Petersen. Limitations of gradient descent due to numerical
instability of backpropagation. arXiv preprint arXiv:2210.00805, 2022.
[143] D. P. Kingma and J. Ba. Adam: A method for stochastic optimization. In 3rd Interna-
tional Conference on Learning Representations, ICLR 2015 - Conference Track Proceedings.
International Conference on Learning Representations, ICLR, 2015.
[145] M. Kohler, A. Krzyżak, and S. Langer. Estimation of a function of low local dimensionality
by deep neural networks. IEEE transactions on information theory, 68(6):4032–4042, 2022.
[146] M. Kohler and S. Langer. On the rate of convergence of fully connected deep neural network
regression estimates. The Annals of Statistics, 49(4):2231–2249, 2021.
[148] A. Krizhevsky, I. Sutskever, and G. E. Hinton. Imagenet classification with deep convolutional
neural networks. In Advances in neural information processing systems, pages 1097–1105,
2012.
271
[149] G. Kutyniok, P. Petersen, M. Raslan, and R. Schneider. A theoretical analysis of deep neural
networks and parametric pdes. Constructive Approximation, 55(1):73–125, 2022.
[150] V. Kůrková. Kolmogorov’s theorem is relevant. Neural Computation, 3(4):617–622, 1991.
[151] V. Kůrková. Kolmogorov’s theorem and multilayer neural networks. Neural Networks,
5(3):501–506, 1992.
[152] F. Laakmann and P. Petersen. Efficient approximation of solutions of parametric linear
transport equations by relu dnns. Advances in Computational Mathematics, 47(1):11, 2021.
[153] G. Lan. First-order and Stochastic Optimization Methods for Machine Learning. Springer
Series in the Data Sciences. Springer International Publishing, Cham, 1st ed. 2020. edition,
2020.
[154] Y. LeCun, Y. Bengio, and G. Hinton. Deep learning. Nature, 521(7553):436–444, May 2015.
[155] Y. LeCun, B. Boser, J. S. Denker, D. Henderson, R. E. Howard, W. Hubbard, and L. D.
Jackel. Backpropagation applied to handwritten zip code recognition. Neural Computation,
1(4):541–551, 1989.
[156] Y. LeCun, L. Bottou, G. Orr, and K. Müller. Efficient backprop. In Neural Networks: Tricks
of the Trade, Lecture Notes in Computer Science, chapter 2, page 546. Springer Berlin /
Heidelberg, 1998.
[157] J. Lee, J. Sohl-dickstein, J. Pennington, R. Novak, S. Schoenholz, and Y. Bahri. Deep neural
networks as gaussian processes. In International Conference on Learning Representations,
2018.
[158] J. Lee, L. Xiao, S. Schoenholz, Y. Bahri, R. Novak, J. Sohl-Dickstein, and J. Pennington. Wide
neural networks of any depth evolve as linear models under gradient descent. In H. Wallach,
H. Larochelle, A. Beygelzimer, F. d'Alché-Buc, E. Fox, and R. Garnett, editors, Advances in
Neural Information Processing Systems, volume 32. Curran Associates, Inc., 2019.
[159] M. Leshno, V. Y. Lin, A. Pinkus, and S. Schocken. Multilayer feedforward networks with a
nonpolynomial activation function can approximate any function. Neural Networks, 6(6):861–
867, 1993.
[160] L. Lessard, B. Recht, and A. Packard. Analysis and design of optimization algorithms via
integral quadratic constraints. SIAM J. Optim., 26(1):57–95, 2016.
[161] H. Li, Z. Xu, G. Taylor, C. Studer, and T. Goldstein. Visualizing the loss landscape of neural
nets. Advances in neural information processing systems, 31, 2018.
[162] W. Li. Generalization error of minimum weighted norm and kernel interpolation. SIAM
Journal on Mathematics of Data Science, 3(1):414–438, 2021.
[163] S. Liang and R. Srikant. Why deep neural networks for function approximation? In Proc. of
ICLR 2017, pages 1 – 17, 2017.
[164] T. Liang and A. Rakhlin. Just interpolate: kernel “ridgeless” regression can generalize. Ann.
Statist., 48(3):1329–1347, 2020.
272
[165] V. Lin and A. Pinkus. Fundamentality of ridge functions. Journal of Approximation Theory,
75(3):295–311, 1993.
[166] M. Longo, J. A. Opschoor, N. Disch, C. Schwab, and J. Zech. De rham compatible deep
neural network fem. Neural Networks, 165:721–739, 2023.
[168] C. Ma, L. Wu, et al. A priori estimates of the population risk for two-layer neural networks.
arXiv preprint arXiv:1810.06397, 2018.
[169] S. Mahan, E. J. King, and A. Cloninger. Nonclosedness of sets of neural networks in sobolev
spaces. Neural Networks, 137:85–96, 2021.
[170] V. Maiorov and A. Pinkus. Lower bounds for approximation by mlp neural networks. Neu-
rocomputing, 25(1):81–91, 1999.
[171] Y. Marzouk, Z. Ren, S. Wang, and J. Zech. Distribution learning via neural differential
equations: a nonparametric statistical perspective. Journal of Machine Learning Research
(accepted), 2024.
[172] W. S. McCulloch and W. Pitts. A logical calculus of the ideas immanent in nervous activity.
The bulletin of mathematical biophysics, 5:115–133, 1943.
[173] S. Mei and A. Montanari. The generalization error of random features regression: Precise
asymptotics and the double descent curve. Communications on Pure and Applied Mathemat-
ics, 75(4):667–766, 2022.
[175] H. N. Mhaskar. Neural networks for optimal approximation of smooth and analytic functions.
Neural computation, 8(1):164–177, 1996.
[178] M. Mohri, A. Rostamizadeh, and A. Talwalkar. Foundations of machine learning. MIT press,
2018.
[180] H. Montanelli and Q. Du. New error bounds for deep relu networks using sparse grids. SIAM
Journal on Mathematics of Data Science, 1(1):78–92, 2019.
[181] H. Montanelli and H. Yang. Error bounds for deep relu networks using the kolmogorov–arnold
superposition theorem. Neural Networks, 129:1–6, 2020.
273
[182] G. F. Montufar, R. Pascanu, K. Cho, and Y. Bengio. On the number of linear regions of deep
neural networks. In Z. Ghahramani, M. Welling, C. Cortes, N. Lawrence, and K. Weinberger,
editors, Advances in Neural Information Processing Systems, volume 27. Curran Associates,
Inc., 2014.
[183] S.-M. Moosavi-Dezfooli, A. Fawzi, O. Fawzi, and P. Frossard. Universal adversarial pertur-
bations. In Proceedings of the IEEE conference on computer vision and pattern recognition,
pages 1765–1773, 2017.
[184] E. Moulines and F. Bach. Non-asymptotic analysis of stochastic approximation algorithms for
machine learning. In J. Shawe-Taylor, R. Zemel, P. Bartlett, F. Pereira, and K. Weinberger,
editors, Advances in Neural Information Processing Systems, volume 24. Curran Associates,
Inc., 2011.
[185] R. Nakada and M. Imaizumi. Adaptive approximation and generalization of deep neural
network with intrinsic dimensionality. Journal of Machine Learning Research, 21(174):1–38,
2020.
[186] R. M. Neal. Bayesian learning for neural networks. PhD thesis, University of Toronto, 1995.
[187] A. Nemirovski, A. Juditsky, G. Lan, and A. Shapiro. Robust stochastic approximation ap-
proach to stochastic programming. SIAM Journal on Optimization, 19(4):1574–1609, 2009.
[189] Y. Nesterov. Lectures on convex optimization, volume 137 of Springer Optimization and Its
Applications. Springer, Cham, second edition, 2018.
[190] Y. E. Nesterov. A method for solving the convex programming problem with convergence
rate O(1/k 2 ). Dokl. Akad. Nauk SSSR, 269(3):543–547, 1983.
[191] B. Neyshabur, R. Tomioka, and N. Srebro. Norm-based capacity control in neural networks.
In Conference on learning theory, pages 1376–1401. PMLR, 2015.
[194] J. Nocedal and S. J. Wright. Numerical optimization. Springer Series in Operations Research
and Financial Engineering. Springer, New York, second edition, 2006.
[196] B. O’Donoghue and E. Candès. Adaptive restart for accelerated gradient schemes. Found.
Comput. Math., 15(3):715–732, 2015.
274
[197] J. A. Opschoor. Constructive deep neural network approximations of weighted analytic so-
lutions to partial differential equations in polygons, 2023. Dissertation 29278, ETH Zürich,
https://doi.org/10.3929/ethz-b-000614671.
[198] J. A. Opschoor and C. Schwab. Deep relu networks and high-order finite element methods
ii: Chebyšev emulation. Computers & Mathematics with Applications, 169:142–162, 2024.
[199] J. A. A. Opschoor, P. C. Petersen, and C. Schwab. Deep relu networks and high-order finite
element methods. Analysis and Applications, 18(05):715–770, 2020.
[200] J. A. A. Opschoor, C. Schwab, and J. Zech. Exponential ReLU DNN expression of holomor-
phic maps in high dimension. Constructive Approximation, 2021.
[201] J. A. A. Opschoor, C. Schwab, and J. Zech. Deep learning in high dimension: ReLU neural
network expression for Bayesian PDE inversion. In Optimization and control for partial
differential equations—uncertainty quantification, open and closed-loop control, and shape
optimization, volume 29 of Radon Ser. Comput. Appl. Math., pages 419–462. De Gruyter,
Berlin, 2022.
[204] N. Papernot, P. McDaniel, I. Goodfellow, S. Jha, Z. B. Celik, and A. Swami. Practical black-
box attacks against machine learning. In Proceedings of the 2017 ACM on Asia conference
on computer and communications security, pages 506–519, 2017.
[205] Y. C. Pati and P. S. Krishnaprasad. Analysis and synthesis of feedforward neural networks us-
ing discrete affine wavelet transformations. IEEE Transactions on Neural Networks, 4(1):73–
85, 1993.
[206] J. Pennington and Y. Bahri. Geometry of neural network loss surfaces via random matrix
theory. In International Conference on Machine Learning, pages 2798–2806. PMLR, 2017.
[207] P. Petersen, M. Raslan, and F. Voigtlaender. Topological properties of the set of func-
tions generated by neural networks of fixed size. Foundations of computational mathematics,
21:375–444, 2021.
[208] P. Petersen and F. Voigtlaender. Optimal approximation of piecewise smooth functions using
deep relu neural networks. Neural Networks, 108:296–330, 2018.
[210] M. D. Petković and P. S. Stanimirović. Iterative method for computing the moore–penrose
inverse based on penrose equations. Journal of Computational and Applied Mathematics,
235(6):1604–1613, 2011.
275
[211] A. Pinkus. Approximation theory of the MLP model in neural networks. In Acta numerica,
1999, volume 8 of Acta Numer., pages 143–195. Cambridge Univ. Press, Cambridge, 1999.
[212] G. Pisier. Remarques sur un résultat non publié de B. Maurey. Séminaire Analyse fonction-
nelle (dit ”Maurey-Schwartz”), 1980-1981.
[213] T. Poggio, A. Banburski, and Q. Liao. Theoretical issues in deep networks. Proceedings of
the National Academy of Sciences, 117(48):30039–30045, 2020.
[214] T. Poggio, H. Mhaskar, L. Rosasco, B. Miranda, and Q. Liao. Why and when can deep-but
not shallow-networks avoid the curse of dimensionality: a review. Int. J. Autom. Comput.,
14(5):503–519, 2017.
[215] T. Poggio, R. Rifkin, S. Mukherjee, and P. Niyogi. General conditions for predictivity in
learning theory. Nature, 428(6981):419–422, 2004.
[216] B. Polyak. Some methods of speeding up the convergence of iteration methods. USSR
Computational Mathematics and Mathematical Physics, 4(5):1–17, 1964.
[219] N. Qian. On the momentum term in gradient descent learning algorithms. Neural Networks,
12(1):145–151, 1999.
[220] M. H. Quynh Nguyen, Mahesh Chandra Mukkamala. On the loss landscape of a class of
deep neural networks with no bad local valleys. In International Conference on Learning
Representations (ICLR), 2018.
[222] A. Rahimi and B. Recht. Random features for large-scale kernel machines. In J. Platt,
D. Koller, Y. Singer, and S. Roweis, editors, Advances in Neural Information Processing
Systems, volume 20. Curran Associates, Inc., 2007.
[224] A. Rakhlin, O. Shamir, and K. Sridharan. Making gradient descent optimal for strongly
convex stochastic optimization. In Proceedings of the 29th International Coference on Inter-
national Conference on Machine Learning, ICML’12, page 1571–1578, Madison, WI, USA,
2012. Omnipress.
276
[225] C. E. Rasmussen and C. K. I. Williams. Gaussian processes for machine learning. Adaptive
computation and machine learning. MIT Press, 2006.
[226] E. Real, S. Moore, A. Selle, S. Saxena, Y. L. Suematsu, Q. Le, and A. Kurakin. Regularized
evolution for image classifier architecture search. Proceedings of the AAAI Conference on
Artificial Intelligence, 33:4780–4789, 2019.
[227] S. J. Reddi, S. Kale, and S. Kumar. On the convergence of adam and beyond. In 6th
International Conference on Learning Representations, ICLR 2018, Vancouver, BC, Canada,
April 30 - May 3, 2018, Conference Track Proceedings. OpenReview.net, 2018.
[228] H. Robbins and S. Monro. A Stochastic Approximation Method. The Annals of Mathematical
Statistics, 22(3):400 – 407, 1951.
[229] F. Rosenblatt. The perceptron: A probabilistic model for information storage and organiza-
tion in the brain. Psychological Review, 65(6):386–408, 1958.
[230] W. Ruan, X. Yi, and X. Huang. Adversarial robustness of deep learning: Theory, algorithms,
and applications. In Proceedings of the 30th ACM international conference on information
& knowledge management, pages 4866–4869, 2021.
[232] W. Rudin. Real and complex analysis. McGraw-Hill Book Co., New York, third edition, 1987.
[233] W. Rudin. Functional analysis. International Series in Pure and Applied Mathematics.
McGraw-Hill, Inc., New York, second edition, 1991.
[235] T. D. Ryck and S. Mishra. Error analysis for deep neural network approximations of paramet-
ric hyperbolic conservation laws. Mathematics of Computation, 2023. Article electronically
published on December 15, 2023.
[236] I. Safran and O. Shamir. Depth separation in relu networks for approximating smooth non-
linear functions. ArXiv, abs/1610.09887, 2016.
[237] M. A. Sartori and P. J. Antsaklis. A simple method to derive bounds on the size and to train
multilayer neural networks. IEEE transactions on neural networks, 2(4):467–471, 1991.
[238] R. Scheichl and J. Zech. Numerical methods for bayesian inverse problems, 2021. Lecture
Notes.
[239] J. Schmidhuber. Deep learning in neural networks: An overview. Neural Networks, 61:85–117,
2015.
[241] J. Schmidt-Hieber. Nonparametric regression using deep neural networks with relu activation
function. 2020.
277
[242] J. Schmidt-Hieber. The kolmogorov–arnold representation theorem revisited. Neural Net-
works, 137:119–126, 2021.
[243] B. Schölkopf, R. Herbrich, and A. J. Smola. A generalized representer theorem. In Proceedings
of the Annual Conference on Learning Theory, 2001.
[244] B. Schölkopf and A. J. Smola. Learning with kernels : support vector machines, regularization,
optimization, and beyond. Adaptive computation and machine learning. MIT Press, 2002.
[245] L. Schumaker. Spline Functions: Basic Theory. Cambridge Mathematical Library. Cambridge
University Press, 3 edition, 2007.
[246] C. Schwab and J. Zech. Deep learning in high dimension: neural network expression rates for
generalized polynomial chaos expansions in UQ. Anal. Appl. (Singap.), 17(1):19–55, 2019.
[247] C. Schwab and J. Zech. Deep learning in high dimension: neural network expression rates for
analytic functions in L2 (Rd , γd ). SIAM/ASA J. Uncertain. Quantif., 11(1):199–234, 2023.
[248] T. Serra, C. Tjandraatmadja, and S. Ramalingam. Bounding and counting linear regions of
deep neural networks, 2018.
[249] U. Shaham, A. Cloninger, and R. R. Coifman. Provable approximation properties for deep
neural networks. Applied and Computational Harmonic Analysis, 44(3):537–557, 2018.
[250] S. Shalev-Shwartz and S. Ben-David. Understanding Machine Learning - From Theory to
Algorithms. Cambridge University Press, 2014.
[251] O. Shamir and T. Zhang. Stochastic gradient descent for non-smooth optimization: Con-
vergence results and optimal averaging schemes. In S. Dasgupta and D. McAllester, editors,
Proceedings of the 30th International Conference on Machine Learning, volume 28 of Proceed-
ings of Machine Learning Research, pages 71–79, Atlanta, Georgia, USA, 17–19 Jun 2013.
PMLR.
[252] N. Z. Shor. Minimization Methods for Non-Differentiable Functions, volume 3 of Springer
Series in Computational Mathematics. Springer-Verlag, Berlin, Heidelberg, 1985.
[253] J. W. Siegel and J. Xu. High-order approximation rates for shallow neural networks with
cosine and reluk activation functions. Applied and Computational Harmonic Analysis, 58:1–
26, 2022.
[254] D. Silver, A. Huang, C. J. Maddison, A. Guez, L. Sifre, G. Van Den Driessche, J. Schrittwieser,
I. Antonoglou, V. Panneershelvam, M. Lanctot, et al. Mastering the game of go with deep
neural networks and tree search. nature, 529(7587):484–489, 2016.
[255] K. Simonyan and A. Zisserman. Very deep convolutional networks for large-scale image
recognition. In ICLR, 2014.
[256] D. Soudry, E. Hoffer, M. S. Nacson, S. Gunasekar, and N. Srebro. The implicit bias of gradient
descent on separable data. Journal of Machine Learning Research, 19(70):1–57, 2018.
[257] E. M. Stein. Singular integrals and differentiability properties of functions. Princeton Math-
ematical Series, No. 30. Princeton University Press, Princeton, N.J., 1970.
278
[258] I. Steinwart and A. Christmann. Support Vector Machines. Springer, New York, 2008.
[260] D. Stutz, M. Hein, and B. Schiele. Disentangling adversarial robustness and generalization.
In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition,
pages 6976–6987, 2019.
[261] A. Sukharev. Optimal method of constructing best uniform approximations for functions of
a certain class. USSR Computational Mathematics and Mathematical Physics, 18(2):21–31,
1978.
[262] I. Sutskever, J. Martens, G. Dahl, and G. Hinton. On the importance of initialization and
momentum in deep learning. In S. Dasgupta and D. McAllester, editors, Proceedings of the
30th International Conference on Machine Learning, volume 28 of Proceedings of Machine
Learning Research, pages 1139–1147, Atlanta, Georgia, USA, 17–19 Jun 2013. PMLR.
[265] M. Tan and Q. V. Le. Efficientnet: Rethinking model scaling for convolutional neural net-
works. In Proceedings of the 36th International Conference on Machine Learning, pages
6105–6114, 2019.
[266] J. Tarela and M. Martı́nez. Region configurations for realizability of lattice piecewise-linear
models. Mathematical and Computer Modelling, 30(11):17–27, 1999.
[267] J. M. Tarela, E. Alonso, and M. V. Martı́nez. A representation method for PWL functions
oriented to parallel processing. Math. Comput. Modelling, 13(10):75–83, 1990.
[269] M. Telgarsky. Benefits of depth in neural networks. In V. Feldman, A. Rakhlin, and O. Shamir,
editors, 29th Annual Conference on Learning Theory, volume 49 of Proceedings of Machine
Learning Research, pages 1517–1539, Columbia University, New York, New York, USA, 23–26
Jun 2016. PMLR.
[270] M. Telgarsky. Neural networks and rational functions. In D. Precup and Y. W. Teh, edi-
tors, Proceedings of the 34th International Conference on Machine Learning, volume 70 of
Proceedings of Machine Learning Research, pages 3387–3393. PMLR, 06–11 Aug 2017.
279
[271] M. Telgarsky. Deep learning theory lecture notes. https://mjt.cs.illinois.edu/dlt/,
2021. Version: 2021-10-27 v0.0-e7150f2d (alpha).
[272] T. Tieleman and G. Hinton. Lecture 6.5—RmsProp: Divide the gradient by a running
average of its recent magnitude. COURSERA: Neural Networks for Machine Learning, 2012.
http://www.cs.toronto.edu/~tijmen/csc321/slides/lecture_slides_lec6.pdf.
[273] V. M. Tikhomirov. ε-entropy and ε-capacity of sets in functional spaces. Selected Works
of AN Kolmogorov: Volume III: Information Theory and the Theory of Algorithms, pages
86–170, 1993.
[279] L. Venturi, A. S. Bandeira, and J. Bruna. Spurious valleys in one-hidden-layer neural network
optimization landscapes. Journal of Machine Learning Research, 20:133, 2019.
[282] S. Wang and X. Sun. Generalization of hinging hyperplanes. IEEE Transactions on Infor-
mation Theory, 51(12):4425–4431, 2005.
[283] Z. Wang, A. Albarghouthi, G. Prakriya, and S. Jha. Interval universal approximation for
neural networks. Proceedings of the ACM on Programming Languages, 6(POPL):1–29, 2022.
[284] E. Weinan, C. Ma, and L. Wu. Barron spaces and the compositional function spaces for
neural network models. arXiv preprint arXiv:1906.08039, 2019.
[285] E. Weinan and S. Wojtowytsch. Representation formulas and pointwise properties for barron
functions. Calculus of Variations and Partial Differential Equations, 61(2):46, 2022.
280
[286] S. Weissmann, A. Wilson, and J. Zech. Multilevel optimization for inverse problems. In P.-L.
Loh and M. Raginsky, editors, Proceedings of Thirty Fifth Conference on Learning Theory,
volume 178 of Proceedings of Machine Learning Research, pages 5489–5524. PMLR, 02–05
Jul 2022.
[288] A. C. Wilson, R. Roelofs, M. Stern, N. Srebro, and B. Recht. The marginal value of adaptive
gradient methods in machine learning. In I. Guyon, U. V. Luxburg, S. Bengio, H. Wal-
lach, R. Fergus, S. Vishwanathan, and R. Garnett, editors, Advances in Neural Information
Processing Systems, volume 30. Curran Associates, Inc., 2017.
[289] D. R. Wilson and T. R. Martinez. The general inefficiency of batch training for gradient
descent learning. Neural Netw., 16(10):1429–1451, Dec. 2003.
[290] C. Xiao, J.-Y. Zhu, B. Li, W. He, M. Liu, and D. Song. Spatially transformed adversarial
examples. arXiv preprint arXiv:1801.02612, 2018.
[291] H. Xu and S. Mannor. Robustness and generalization. Machine learning, 86:391–423, 2012.
[292] D. Yarotsky. Error bounds for approximations with deep ReLU networks. Neural Netw.,
94:103–114, 2017.
[293] D. Yarotsky and A. Zhevnerchuk. The phase diagram of approximation rates for deep neural
networks. In H. Larochelle, M. Ranzato, R. Hadsell, M. F. Balcan, and H. Lin, editors,
Advances in Neural Information Processing Systems, volume 33, pages 13005–13015. Curran
Associates, Inc., 2020.
[295] M. D. Zeiler. Adadelta: An adaptive learning rate method. CoRR, abs/1212.5701, 2012.
[296] X. Zhai, A. Kolesnikov, N. Houlsby, and L. Beyer. Scaling vision transformers. In Proceedings
of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 12104–
12113, 2022.
[297] C. Zhang, S. Bengio, M. Hardt, B. Recht, and O. Vinyals. Understanding deep learning
requires rethinking generalization, 2016.
281