0% found this document useful (0 votes)
151 views275 pages

Mathematical Theory of Deep

Uploaded by

Desco io Nice
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
151 views275 pages

Mathematical Theory of Deep

Uploaded by

Desco io Nice
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 275

arXiv:2407.18384v2 [cs.

LG] 11 Oct 2024

Mathematical theory of deep


learning

Philipp Petersen1 and Jakob Zech2


1
Universität Wien, Fakultät für Mathematik, 1090 Wien, Austria,
[email protected]
2 Universität Heidelberg, Interdisziplinäres Zentrum für Wissenschaftliches Rechnen, 69120

Heidelberg, Germany, [email protected]

October 14, 2024


Contents

1 Introduction 9
1.1 Mathematics of deep learning . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9
1.2 High-level overview of deep learning . . . . . . . . . . . . . . . . . . . . . . . . . . . 9
1.3 Why does it work? . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13
1.4 Outline and philosophy . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14
1.5 Material not covered in this book . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 16

2 Feedforward neural networks 18


2.1 Formal definition . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 18
2.2 Notion of size . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 21
2.3 Activation functions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 22

3 Universal approximation 25
3.1 A universal approximation theorem . . . . . . . . . . . . . . . . . . . . . . . . . . . . 25
3.2 Superexpressive activations and Kolmogorov’s superposition theorem . . . . . . . . 35

4 Splines 39
4.1 B-splines and smooth functions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 39
4.2 Reapproximation of B-splines with sigmoidal activations . . . . . . . . . . . . . . . . 40

5 ReLU neural networks 47


5.1 Basic ReLU calculus . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 47
5.2 Continuous piecewise linear functions . . . . . . . . . . . . . . . . . . . . . . . . . . . 52
5.3 Simplicial pieces . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 58
5.4 Convergence rates for Hölder continuous functions . . . . . . . . . . . . . . . . . . . 64

6 Affine pieces for ReLU neural networks 68


6.1 Upper bounds . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 69
6.2 Tightness of upper bounds . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 71
6.3 Depth separation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 72
6.4 Number of pieces in practice . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 73

7 Deep ReLU neural networks 81


7.1 The square function . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 81
7.2 Multiplication . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 83
7.3 C k,s functions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 85

1
8 High-dimensional approximation 92
8.1 The Barron class . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 92
8.2 Functions with compositionality structure . . . . . . . . . . . . . . . . . . . . . . . . 97
8.3 Functions on manifolds . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 100

9 Interpolation 106
9.1 Universal interpolation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 107
9.2 Optimal interpolation and reconstruction . . . . . . . . . . . . . . . . . . . . . . . . 108

10 Training of neural networks 116


10.1 Gradient descent . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 116
10.2 Stochastic gradient descent (SGD) . . . . . . . . . . . . . . . . . . . . . . . . . . . . 125
10.3 Backpropagation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 129
10.4 Acceleration . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 132
10.5 Other methods . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 139

11 Wide neural networks and the neural tangent kernel 145


11.1 Linear least-squares . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 146
11.2 Kernel least-squares . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 147
11.3 Tangent kernel . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 151
11.4 Convergence to global minimizers . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 152
11.5 Training dynamics for LeCun initialization . . . . . . . . . . . . . . . . . . . . . . . . 156
11.6 Normalized initialization . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 167

12 Loss landscape analysis 171


12.1 Visualization of loss landscapes . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 173
12.2 Spurious valleys . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 173
12.3 Saddle points . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 176

13 Shape of neural network spaces 181


13.1 Lipschitz parameterizations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 181
13.2 Convexity of neural network spaces . . . . . . . . . . . . . . . . . . . . . . . . . . . . 184
13.3 Closedness and best-approximation property . . . . . . . . . . . . . . . . . . . . . . . 187

14 Generalization properties of deep neural networks 194


14.1 Learning setup . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 194
14.2 Empirical risk minimization . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 196
14.3 Generalization bounds . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 197
14.4 Generalization bounds from covering numbers . . . . . . . . . . . . . . . . . . . . . . 199
14.5 Covering numbers of deep neural networks . . . . . . . . . . . . . . . . . . . . . . . . 201
14.6 The approximation-complexity trade-off . . . . . . . . . . . . . . . . . . . . . . . . . 203
14.7 PAC learning from VC dimension . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 204
14.8 Lower bounds on achievable approximation rates . . . . . . . . . . . . . . . . . . . . 208

2
15 Generalization in the overparameterized regime 213
15.1 The double descent phenomenon . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 214
15.2 Size of weights . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 217
15.3 Theoretical justification . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 219
15.4 Double descent for neural network learning . . . . . . . . . . . . . . . . . . . . . . . 221

16 Robustness and adversarial examples 226


16.1 Adversarial examples . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 227
16.2 Bayes classifier . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 228
16.3 Affine classifiers . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 229
16.4 ReLU neural networks . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 233
16.5 Robustness . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 234

A Probability theory 241


A.1 Sigma-algebras, topologies, and measures . . . . . . . . . . . . . . . . . . . . . . . . 241
A.2 Random variables . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 242
A.3 Conditionals, marginals, and independence . . . . . . . . . . . . . . . . . . . . . . . 245
A.4 Concentration inequalities . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 248

B Functional analysis 251


B.1 Vector spaces . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 251
B.2 Fourier transform . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 257

3
Preface

This book serves as an introduction to the key ideas in the mathematical analysis of deep learning.
It is designed to help students and researchers to quickly familiarize themselves with the area and to
provide a foundation for the development of university courses on the mathematics of deep learning.
Our main goal in the composition of this book was to present various rigorous, but easy to grasp,
results that help to build an understanding of fundamental mathematical concepts in deep learning.
To achieve this, we prioritize simplicity over generality.
As a mathematical introduction to deep learning, this book does not aim to give an exhaustive
survey of the entire (and rapidly growing) field, and some important research directions are missing.
In particular, we have favored mathematical results over empirical research, even though an accurate
account of the theory of deep learning requires both.
The book is intended for students and researchers in mathematics and related areas. While we
believe that every diligent researcher or student will be able to work through this manuscript, it
should be emphasized that a familiarity with analysis, linear algebra, probability theory, and basic
functional analysis is recommended for an optimal reading experience. To assist readers, a review
of key concepts in probability theory and functional analysis is provided in the appendix.
The material is structured around the three main pillars of deep learning theory: Approximation
theory, Optimization theory, and Statistical Learning theory. Chapter 1 provides an overview and
outlines key questions for understand deep learning. Chapters 2 - 9 explore results in approximation
theory, Chapters 10 - 13 discuss optimization theory for deep learning, and the remaining Chapters
14 - 16 address the statistical aspects of deep learning.
This book is the result of a series of lectures given by the authors. Parts of the material were
presented by P.P. in a lecture titled “Neural Network Theory” at the University of Vienna, and by
J.Z. in a lecture titled “Theory of Deep Learning” at Heidelberg University. The lecture notes of
these courses formed the basis of this book. We are grateful to the many colleagues and students
who contributed to this book through insightful discussions and valuable suggestions. We would
like to offer special thanks to the following individuals:
Jonathan Garcia Rebellon, Jakob Lanser, Andrés Felipe Lerma Pineda, Martin Mauser, Davide
Modesto, Martina Neuman, Bruno Perreaux, Johannes Asmus Petersen, Milutin Popovic, Tuan
Quach, Lorenz Riess, Jakob Fabian Rohner, Jonas Schuhmann, Peter Školnı́k, Matej Vedak, Simon
Weissmann, Ashia Wilson.

4
Notation
In this section, we provide a summary of the notations used throughout the manuscript for the
reader’s convenience.

Symbol Description Reference


A vector of layer widths Definition 12.1
A a sigma-algebra Definition A.1
aff(S) affine hull of S (5.3.7)
Bd the Borel sigma-algebra on Rd Section A.1
Bn B-Splines of order n Definition 4.2
Br (x) ball of radius r ≥ 0 around x in a metric space X (B.1.1)
Brd ball of radius r ≥ 0 around 0 in Rd
k-times continuously differentiable functions from
C k (Ω)
Ω→R
infinitely differentiable functions from Ω → R
Cc∞ (Ω)
with compact support in Ω
C 0,s (Ω) s-Hölder continuous functions from Ω → R
C k,s (Ω) C k (Ω) functions f for which f (k) ∈ C 0,s (Ω) Definition 7.5
cc
fn −→ f compact convergence of fn to f Definition 3.1
co(S) convex hull of a set S (5.3.1)
f ∗g convolution of f and g
D data distribution (1.2.4)/Section 14.1
Dα partial derivative
depth(Φ) depth of Φ Definition 2.1
εapprox approximation error (14.2.3)
εgen generalization error (14.2.3)
εint interpolation error (14.2.4)
E[X] expectation of random variable X (A.2.1)
E[X|Y ] conditional expectation of random variable X Subsection A.3.3
F(f ) or fˆ Fourier transform of f Definition B.15
G(S, ε, X) ε-covering number of a set S ⊆ X Definition 14.10
ΓC Barron space with constant C Section 8.1
∇x f gradient of f w.r.t. x
⊘ componentwise (Hadamard) division
Continued on next page

5
Symbol Description Reference
⊗ componentwise (Hadamard) product
hS empirical risk minimizer for a sample S Definition 14.5
Φid
L identity ReLU neural network Lemma 5.1
1S indicator function of the set S
⟨·, ·⟩ Euclidean inner product on Rd
⟨·, ·⟩H inner product on a vector space H Definition B.9
maximal number of elements shared by a single
kT (5.3.2)
node of a triangulation
K LC neural tangent kernel for the LeCun initialization Theorem 11.16
K̂n (x, x′ ) empirical tangent kernel (11.3.4)
K NTK neural tangent kernel for the NTK initialization Theorem 11.30
ΛA,σ,S,L loss landscape defining function Definition 12.2
Lip(f ) Lipschitz constant of a function f (9.2.1)
LipM (Ω) M -Lipschitz continuous functions on Ω (9.2.4)
L general loss function Section 14.1
L0−1 0-1 loss Section 14.1
Lce binary cross entropy loss Section 14.1
L2 square loss Section 14.1
Lp (Ω) Lebesgue space over Ω Section B.1.3
piecewise continuous and locally bounded func-
M Definition 3.1.1
tions
set of multilayer perceptrons with d-dim input, m-
Ndm (σ; L, n) dim output, activation function σ, depth L, and Definition 3.6
width n
Ndm (σ; L) union of Ndm (σ; L, n) for all n ∈ N Definition 3.6
set of neural networks with architecture A, ac-
N (σ; A, B) tivation function σ and all weights bounded in Definition 12.1
modulus by B
neural networks in N (σ; A, B) with range in
N ∗ (σ, A, B) (14.5.1)
[−1, 1]
N positive natural numbers
N0 natural numbers including 0
multivariate normal distribution with mean m ∈
N(m, C)
Rd and covariance C ∈ Rd×d
Continued on next page

6
Symbol Description Reference
number of parameters of a neural network with
nA Definition 12.1
layer widths described by A
Euclidean norm for vectors in Rd and spectral
∥·∥
norm for matrices in Rn×d
∥ · ∥F Frobenius norm for matrices
∥ · ∥∞ ∞-norm on Rd or supremum norm for functions
∥ · ∥p p-norm on Rd
∥ · ∥X norm on a vector space X
0 zero vector in Rd
O(·) Landau notation
ω(η) patch of the node η (5.3.5)
ΩΛ (c) sublevel set of loss landscape Definition 12.3
Pn short for Pn (Rd )
space of multivariate polynomials of degree n in
Pn (Rd ) Example 3.5
Rd
P short for P(Rd )
P[A] probability of event A Definition A.5
P[A|B] conditional probability of event A given B Definition A.3.2
PX distribution of random variable X Definition A.10
P(Rd ) space of multivariate polynomials in Rd Example 3.5
Φlin linearization of a model around initialization (11.3.1)
Φmin
n minimum neural network Lemma 5.11
Φ×
ε multiplication neural network Lemma 7.3
Φ×
n,ε multiplication of n numbers neural network Proposition 7.4
Φ2 ◦ Φ1 composition of neural networks Lemma 5.2
Φ2 • Φ1 sparse composition of neural networks Lemma 5.2
(Φ1 , . . . , Φm ) parallelization of neural networks (5.1.1)
Pieces(f, Ω) number of pieces of f on Ω Definition 6.1
parameter set of neural networks with architec-
PN (A, B) ture A and all weights bounded in modulus by Definition 12.1
B
Q rational numbers
R real numbers
Continued on next page

7
Symbol Description Reference
R− non-positive real numbers
R+ non-negative real numbers
Rσ Realization map Definition 12.1
R∗ Bayes risk (14.1.1)
R(h) risk of hypothesis h Definition 14.2
R
b S (h) empirical risk of h for sample S (1.2.3), Definition 14.4
Sn cardinal B-spline Definition 4.1
d
Sℓ,t,n multivariate cardinal B-spline Definition 4.2
cardinality of an arbitrary set S, or Lebesgue mea-
|S|
sure of S ⊆ Rd
S̊ interior of a set S
S closure of a set S
∂S boundary of a set S
Sc complement of a set S
σ general activation function
σa parametric ReLU activation function Section 2.3
σReLU ReLU activation function Section 2.3
sign sign function
size(Φ) number of free network parameters in Φ Definition 2.4
span(S) linear hull or span of S
T triangulation Definition 5.13
V[X] variance of random variable X Section A.2.2
VCdim(H) VC dimension of a set of functions H Definition 14.16
W distribution of weight intialization Section 11.5.1
W (ℓ) , b(ℓ) weights and biases in layer ℓ of a neural network Definition 2.1
width(Φ) width of Φ Definition 2.1
x(ℓ) output of ℓ-th layer of a neural network Definition 2.1
x̄(ℓ) preactivations (10.3.3)
X′ dual space to a normed space X Definition B.7

8
Chapter 1

Introduction

1.1 Mathematics of deep learning


In 2012, a deep learning architecture revolutionized the field of computer vision by achieving un-
precedented performance in the ImageNet Large Scale Visual Recognition Challenge (ILSVRC)
[121]. The deep learning architecture, known as AlexNet, significantly outperformed all competing
technologies. A few years later, in March 2016, a deep learning-based architecture called AlphaGo
defeated the best Go player at the time, Lee Sedol, in a five-game match [214]. Go is a highly
complex board game with a vast number of possible moves, making it a challenging problem for
artificial intelligence. Because of this complexity, many researchers believed that defeating a top
human Go player was a feat that would only be achieved decades later.
These breakthroughs, along with many others including DeepMind’s AlphaFold [110], which
revolutionized protein structure prediction in 2020, the unprecedented language capabilities of
large language models like GPT-3 (and later versions) [234, 28], and the emergence of generative
AI models like Stable Diffusion, Midjourney, and DALL-E, have sparked interest among scientists
across (almost) all disciplines. Likewise, while mathematical research on neural networks has a
long history, these groundbreaking developments revived interest in the theoretical underpinnings
of deep learning among mathematicians. However, initially, there was a clear consensus in the
mathematics community: We do not understand why this technology works so well! In fact, there
are many mathematical reasons that, at least superficially, should prevent the observed success.
Over the past decade the field has matured, and mathematicians have gained a more profound
understanding of deep learning, although many open questions remain. Recent years have brought
various new explanations and insights into the inner workings of deep learning models. Before
discussing these in detail in the following chapters, we first give a high-level introduction to deep
learning, with a focus on the supervised learning framework, which is the central theme of this
book.

1.2 High-level overview of deep learning


Deep learning refers to the application of deep neural networks trained by gradient-based methods,
to identify unknown input-output relationships. This approach has three key ingredients: deep
neural networks, gradient-based training, and prediction. We now explain each of these ingredients
separately.

9
(
(

Figure 1.1: Illustration of a single neuron ν. The neuron receives six inputs (x1 , . . . , x6 ) = x,
computes their weighted sum 6j=1 xj wj , adds a bias b, and finally applies the activation function
P
σ to produce the output ν(x).

Deep Neural Networks Deep neural networks are formed by a combination of neurons. A
neuron is a function of the form

Rd ∋ x 7→ ν(x) = σ(w⊤ x + b), (1.2.1)

where w ∈ Rd is a weight vector, b ∈ R is called bias, and the function σ is referred to as an


activation function. This concept is due to McCulloch and Pitts [142] and is a mathematical
model for biological neurons. If we consider σ to be the Heaviside function, i.e., σ = 1R+ with
R+ := [0, ∞), then the neuron “fires” if the weighted sum of the inputs x surpasses the threshold
−b. We depict a neuron in Figure 1.1. Note that if we fix d and σ, then the set of neurons can be
naturally parameterized by the d + 1 real values w1 , . . . , wd , b ∈ R.
Neural networks are functions formed by connecting neurons, where the output of one neuron
becomes the input to another. One simple but very common type of neural network is the so-called
feedforward neural network. This structure distinguishes itself by having the neurons grouped in
layers, and the inputs to neurons in the (ℓ + 1)-st layer are exclusively neurons from the ℓ-th layer.
We start by defining a shallow feedforward neural network as an affine transformation
applied to the output of a set of neurons that share the same input and the same activation
function. Here, an affine transformation is a map T : Rp → Rq such that T (x) = W x + b for
some W ∈ Rq×p , b ∈ Rq where p, q ∈ N.
Formally, a shallow feedforward neural network is, therefore, a map Φ of the form

Rd ∋ x 7→ Φ(x) = T1 ◦ σ ◦ T0 (x)

where T0 , T1 are affine transformations and the application of σ is understood to be in each


component of T1 (x). A visualization of a shallow neural network is given in Figure 1.2.
A deep feedforward neural network is constructed by compositions of shallow neural net-
works. This yields a map of the type

Rd ∋ x 7→ Φ(x) = TL+1 ◦ σ ◦ · · · ◦ T1 ◦ σ ◦ T0 (x),

10
L+1
where L ∈ N and (Tj )j=0 are affine transformations. The number of compositions L is referred to
as the number of layers of the deep neural network. Similar to a single neuron, (deep) neural
networks can be viewed as a parameterized function class, with the parameters being the entries
of the matrices and vectors determining the affine transformations (Tj )L+1
j=0 .

(
0 (
(
(
(
(
Figure 1.2: Illustration of a shallow neural network. The affine transformation T0 is of the form
(x1 , . . . , x6 ) = x 7→ W x + b, where the rows of W are the weight vectors w1 , w2 , w3 for each
respective neuron.

Gradient-based training After defining the structure or architecture of the neural network,
e.g., the activation function and the number of layers, the second step of deep learning consists of
determining optimal values for its parameters. This optimization is carried out by minimizing an
objective function. In supervised learning, which will be our focus, this objective depends on a
collection of input-output pairs known as a sample. Concretely, let S = (xi , y i )m
i=1 be a sample,
where xi ∈ Rd represents the inputs and y i ∈ Rk the corresponding outputs with d, k ∈ N. Our
goal is to find a deep neural network Φ such that

Φ(xi ) ≈ y i for all i = 1, . . . , m (1.2.2)

in a meaningful sense. For example, we could interpret “≈” to mean closeness with respect to
the Euclidean norm, or more generally, that L(Φ(xi ), y i ) is small for a function L measuring the
dissimilarity between its inputs. Such a function L is called a loss function. A standard way of
achieving (1.2.2) is by minimizing the so-called empirical risk of Φ with respect to the sample
S defined as
m
b S (Φ) = 1
X
R L(Φ(xi ), y i ). (1.2.3)
m
i=1

If L is differentiable, and for all xi the output Φ(xi ) depends differentiably on the parameters
of the neural network, then the gradient of the empirical risk Rb S (Φ) with respect to the parameters
is well-defined. This gradient can be efficiently computed using a technique called backpropa-
gation. This allows to minimize (1.2.3) by optimization algorithms such as (stochastic) gradient

11
descent. They produce a sequence of neural networks parameters, and corresponding neural net-
work functions Φ1 , Φ2 , . . . , for which the empirical risk is expected to decrease. Figure 1.3 illustrates
a possible behavior of this sequence.

Prediction The final part of deep learning concerns the question of whether we have actually
learned something by the procedure above. Suppose that our optimization routine has either con-
verged or has been terminated, yielding a neural network Φ∗ . While the optimization aimed to
minimize the empirical risk on the training sample S, our ultimate interest is not in how well Φ∗ per-
forms on S. Rather, we are interested in its performance on new, unseen data points (xnew , y new ).
To make meaningful statements about this performance, we need to assume a relationship between
the training sample S and other data points.
The standard approach is to assume existence of a data distribution D on the input-output
space—in our case, this is Rd × Rk —such that both the elements of S and all other considered data
points are drawn from this distribution. In other words, we treat S as an i.i.d. draw from D, and
(xnew , y new ) also sampled independently from D. If we want Φ∗ to perform well on average, then
this amounts to controlling the following expression

R(Φ∗ ) = E(xnew ,ynew )∼D [L(Φ∗ (xnew ), y new )], (1.2.4)

which is called the risk of Φ∗ . If the risk is not much larger than the empirical risk, then we say
that the neural network Φ∗ has a small generalization error. On the other hand, if the risk is
much larger than the empirical risk, then we say that Φ∗ overfits the training data, meaning that
Φ∗ has memorized the training samples, but does not generalize well to new data.

Figure 1.3: A sequence of one dimensional neural networks Φ1 , . . . , Φ4 that successively minimizes
the empirical risk for the sample S = (xi , yi )6i=1 .

12
1.3 Why does it work?
It is natural to wonder why the deep learning pipeline, as outlined in the previous subsection,
ultimately succeeds in learning, i.e., achieving a small risk. Is it true that for a given sample
(xi , y i )m
i=1 there exist a neural network such that Φ(xi ) ≈ y i for all i = 1, . . . m? Does the
optimization routine produce a meaningful result? Can we control the risk, knowing only that the
empirical risk is small?
While most of these questions can be answered affirmatively under certain assumptions, these
assumptions often do not apply to deep learning in practice. We next explore some potential
explanations and show that they lead to even more questions.

Approximation A fundamental result in the study of neural networks is the so-called universal
approximation theorem, which will be discussed in Chapter 3. This result states that every con-
tinuous function on a compact domain can be approximated arbitrarily well (in a uniform sense)
by a shallow neural network.
This result, however, does not answer questions that are more specific to deep learning, such
as the question of efficiency. For example, if we aim for computational efficiency, then we might
be interested in the smallest neural network that fits the data. This raises the question: What is
the role of the architecture for the expressive capabilities of neural networks? Furthermore, if we
consider reducing the empirical risk an approximation problem, we are confronted with one of the
main issues of approximation theory, which is the curse of dimensionality. Function approximation
in high dimensions is notoriously difficult and gets exponentially harder with increasing dimension.
In practice, many successful deep learning architectures operate in this high-dimensional regime.
Why do these neural networks not seem to suffer from the curse of dimensionality?

Optimization While gradient descent can sometimes be proven to converge to a global minimum
as we will discuss in Chapter 10, this typically requires the objective function to be at least convex.
However, there is no reason to believe that for example the empirical risk is a convex function of
the network parameters. In fact, due to the repeatedly occurring compositions with the nonlinear
activation function in the network, the empirical risk is typically highly non-linear and not convex.
Therefore, there is generally no guarantee that the optimization routine will converge to a global
minimum, and it may get stuck in a local (and non-global) minimum or a saddle point. Why is the
output of the optimization nonetheless often meaningful in practice?

Generalization In traditional statistical learning theory, which we will review in Chapter 14,
the extent to which the risk exceeds the empirical risk, can be bounded a priori; such bounds are
often expressed in terms of a notion of complexity of the set of admissible functions (the class of
neural networks) divided by the number of training samples. For the class of neural networks of a
fixed architecture, the complexity roughly amounts to the number of neural network parameters.
In practice, typically neural networks with more parameters than training samples are used. This
is dubbed the overparameterized regime. In this regime, the classical estimates described above are
void.
Why is it that, nonetheless, deep overparameterized architectures are capable of making accu-
rate predictions on unseen data? Furthermore, while deep architectures often generalize well, they
sometimes fail spectacularly on specific, carefully crafted examples. In image classification tasks,

13
these examples may differ only slightly from correctly classified images in a way that is not per-
ceptible to the human eye. Such examples are known as adversarial examples, and their existence
poses a great challenge for applications of deep learning.

1.4 Outline and philosophy


This book addresses the questions raised in the previous section, providing answers that are mathe-
matically rigorous and accessible. Our focus will be on provable statements, presented in a manner
that prioritizes simplicity and clarity over generality. We will sometimes illustrate key ideas only
in special cases, or under strong assumptions, both to avoid an overly technical exposition, and
because definitive answers are often not yet available. In the following, we summarize the content
of each chapter and highlight parts pertaining to the questions stated in the previous section.
Chapter 2: Feedforward neural networks. In this chapter, we introduce the main object
of study of this book—the feedforward neural network.
Chapter 3: Universal approximation. We present the classical view of function approx-
imation by neural networks, and give two instances of so-called universal approximation results.
Such statements describe the ability of neural networks to approximate every function of a given
class to arbitrary accuracy, given that the network size is sufficiently large. The first result, which
holds under very broad assumptions on the activation function, is on uniform approximation of
continuous functions on compact domains. The second result shows that for a very specific acti-
vation function, the network size can be chosen independent of the desired accuracy, highlighting
that universal approximation needs to be interpreted with caution.
Chapter 4: Splines. Going beyond universal approximation, this chapter starts to explore
approximation rates of neural networks. Specifically, we examine how well certain functions can be
approximated relative to the number of parameters in the network. For so-called sigmoidal activa-
tion functions, we establish a link between neural-network- and spline-approximation. This reveals,
that smoother functions require fewer network parameters. However, achieving this increased effi-
ciency necessitates the use of deeper neural networks. This observation offers a first glimpse into
the importance of depth in deep learning.
Chapter 5: ReLU neural networks. This chapter focuses on one of the most popular ac-
tivation functions in practice—the ReLU. We prove that the class of ReLU networks is equal to
the set of continuous piecewise linear functions, thus providing a theoretical foundation for their
expressive power. Furthermore, given a continuous piecewise linear function, we investigate the
necessary width and depth of a ReLU network to represent it. Finally, we leverage approxima-
tion theory for piecewise linear functions to derive convergence rates for approximating Hölder
continuous functions.
Chapter 6: Affine pieces for ReLU neural networks. Having gained some intuition about
ReLU neural networks, in this chapter, we adress some potential limitations. We analyze ReLU
neural networks by counting the number of affine regions that they generate. The key insight of
this chapter is that deep neural networks can generate exponentially more regions than shallow
ones. This observation provides further evidence for the potential advantages of depth in neural
network architectures.
Chapter 7: Deep ReLU neural networks. Having identified the ability of deep ReLU
neural networks to generate a large number of affine regions, we investigate whether this translates
into an actual advantage in function approximation. Indeed, for approximating smooth functions,

14
we prove substantially better approximation rates than we obtained for shallow neural networks.
This adds again to our understanding of depth and its connections to expressive power of neural
network architectures.
Chapter 8: High-dimensional approximation. The convergence rates established in the
previous chapters deteriorate significantly in high-dimensional settings. This chapter examines
three scenarios under which neural networks can provably overcome the curse of dimensionality.
Chapter 9: Interpolation. In this chapter we shift our perspective from approximation to
exact interpolation of the training data. We analyze conditions under which exact interpolation is
possible, and discuss the implications for empirical risk minimization. Furthermore, we present a
constructive proof showing that ReLU networks can express an optimal interpolant of the data (in
a specific sense).
Chapter 10: Training of neural networks. We start to examine the training process of deep
learning. First, we study the fundamentals of (stochastic) gradient descent and convex optimization.
Then, we discuss how the backpropagation algorithm can be used to implement these optimization
algorithms for training neural networks. Finally, we examine accelerated methods and highlight
the key principles behind popular and more advanced training algorithms such as Adam.
Chapter 11: Wide neural networks and the neural tangent kernel. This chapter
introduces the neural tangent kernel as a tool for analyzing the training behavior of neural networks.
We begin by revisiting linear and kernel regression for the approximation of functions based on data.
Afterwards, we demonstrate in an abstract setting that under certain assumptions, the training
dynamics of gradient descent for neural networks resemble those of kernel regression, converging
to a global minimum. Using standard intialization schemes, we then show that the assumptions
for such a statement to hold are satisfied with high probability, if the network is sufficiently wide
(overparameterized). This analysis provides insights into why, under certain conditions, we can
train neural networks without getting stuck in (bad) local minima, despite the non-convexity of
the objective function. Additionally, we discuss a well-known link between neural networks and
Gaussian processes, giving some indication why overparameterized networks do not necessarily
overfit in practice.
Chapter 12: Loss landscape analysis. In this chapter, we present an alternative view on the
optimization problem, by analyzing the loss landscape—the empirical risk as a function of the neural
network parameters. We give theoretical arguments showing that increasing overparameterization
leads to greater connectivity between the valleys and basins of the loss landscape. Consequently,
overparameterized architectures make it easier to reach a region where all minima are global minima.
Additionally, we observe that most stationary points associated with non-global minima are saddle
points. This sheds further light on the empirically observed fact that deep architectures can often
be optimized without getting stuck in non-global minima.
Chapter 13: Shape of neural network spaces. While Chapters 11 and 12 highlight
potential reasons for the success of neural network training, in this chapter, we show that the set
of neural networks of a fixed architecture has some undesirable properties from an optimization
perspective. Specifically, we show that this set is typically non-convex. Moreover, in general it does
not possess the best-approximation property, meaning that there might not exist a neural network
within the set yielding the best approximation for a given function.
Chapter 14 : Generalization properties of deep neural networks. To understand
why deep neural networks successfully generalize to unseen data points (outside of the training
set), we study classical statistical learning theory, with a focus on neural network functions as the

15
hypothesis class. We then show how to establish generalization bounds for deep learning, providing
theoretical insights into the performance on unseen data.
Chapter 15: Generalization in the overparameterized regime. The generalization
bounds of the previous chapter are not meaningful when the number of parameters of a neural net-
work surpasses the number of training samples. However, this overparameterized regime is where
many successful network architectures operate. To gain a deeper understanding of generalization
in this regime, we describe the phenomenon of double descent and present a potential explana-
tion. This addresses the question of why deep neural networks perform well despite being highly
overparameterized.
Chapter 16: Robustness and adversarial examples. In the final chapter, we explore
the existence of adversarial examples—inputs designed to deceive neural networks. We provide
some theoretical explanations of why adversarial examples arise, and discuss potential strategies to
prevent them.

1.5 Material not covered in this book


This book studies some central topics of deep learning but leaves out even more. Interesting
questions associated with the field that were omitted, as well as some pointers to related works are
listed below:
Advanced architectures: The (deep) feedforward neural network is far from the only type of
neural network. In practice, architectures must be adapted to the type of data. For example, images
exhibit strong spatial dependencies in the sense that adjacent pixels often have similar values.
Convolutional neural networks [128] are particularly well suited for this type of input, as they
employ convolutional filters that aggregate information from neighboring pixels, thus capturing the
data structure better than a fully connected feedforward network. Similarly, graph neural networks
[27] are a natural choice for graph-based data. For sequential data, such as natural language,
architectures with some form of memory component are used, including Long Short-Term Memory
(LSTM) networks [93] and attention-based architectures like transformers [234].
Interpretability/Explainability and Fairness: The use of deep neural networks in critical
decision-making processes, such as allocating scarce resources (e.g., organ transplants in medicine,
financial credit approval, hiring decisions) or engineering (e.g., optimizing bridge structures, au-
tonomous vehicle navigation, predictive maintenance), necessitates an understanding of their decision-
making process. This is crucial for both practical and ethical reasons.
Practically, understanding how a model arrives at a decision can help us improve its performance
and mitigate problems. It allows us to ensure that the model performs according to our intentions
and does not produce undesirable outcomes. For example, in bridge design, understanding why a
model suggests or rejects a particular configuration can help engineers identify potential vulnerabil-
ities, ultimately leading to safer and more efficient designs. Ethically, transparent decision-making
is crucial, especially when the outcomes have significant consequences for individuals or society; bi-
ases present in the data or model design can lead to discriminatory outcomes, making explainability
essential.
However, explaining the predictions of deep neural networks is not straightforward. Despite
knowledge of the network weights and biases, the repeated and complex interplay of linear trans-
formations and non-linear activation functions often renders these models black boxes. A compre-
hensive overview of various techniques for interpretability, not only for deep neural networks, can

16
be found in [149]. Regarding the topic of fairness, we refer for instance to [55, 8].
Unsupervised and Reinforcement Learning: While this book focuses on supervised learn-
ing, where each data point xi has a label yi , there is a vast field of machine learning called unsuper-
vised learning, where labels are absent. Classical unsupervised learning problems include clustering
and dimensionality reduction [212, Chapters 22/23].
A popular area in deep learning, where no labels are used, is physics-informed neural networks
[187]. Here, a neural network is trained to satisfy a partial differential equation (PDE), with the
loss function quantifying the deviation from this PDE.
Finally, reinforcement learning is a technique where an agent can interact with an environment
and receives feedback based on its actions. The actions are guided by a so-called policy, which is
to be learned, [148, Chapter 17]. In deep reinforcement learning, this policy is modeled by a deep
neural network. Reinforcement learning is the basis of the aforementioned AlphaGo.
Implementation: While this book focuses on provable theoretical results, the field of deep
learning is strongly driven by applications, and a thorough understanding of deep learning cannot
be achieved without practical experience. For this, there exist numerous resources with excellent
explanations. We recommend [67, 38, 182] as well as the countless online tutorials that are just a
Google (or alternative) search away.
Many more: The field is evolving rapidly, and new ideas are constantly being generated
and tested. This book cannot give a complete overview. However, we hope that it provides the
reader with a solid foundation in the fundamental knowledge and principles to quickly grasp and
understand new developments in the field.

Bibliography and further reading


Throughout this book, we will end each chapter with a short overview of related work and the
references used in the chapter.
In this introductory chapter, we highlight several other recent textbooks and works on deep
learning. For a historical survey on neural networks see [202] and also [127]. For general textbooks
on neural networks and deep learning, we refer to [84, 72, 182] for more recent monographs. A more
mathematical introduction to the topic is given, for example, in [3, 107, 29]. For the implementation
of neural networks we refer for example to [67, 38].

17
Chapter 2

Feedforward neural networks

Feedforward neural networks, henceforth simply referred to as neural networks (NNs), constitute
the central object of study of this book. In this chapter, we provide a formal definition of neural
networks, discuss the size of a neural network, and give a brief overview of common activation
functions.

2.1 Formal definition


We previously defined a single neuron ν in (1.2.1) and Figure 1.1. A neural network is constructed
by connecting multiple neurons. Let us now make precise this connection procedure.

Definition 2.1. Let L ∈ N, d0 , . . . , dL+1 ∈ N, and let σ : R → R. A function Φ : Rd0 → RdL+1


is called a neural network if there exist matrices W (ℓ) ∈ Rdℓ+1 ×dℓ and vectors b(ℓ) ∈ Rdℓ+1 ,
ℓ = 0, . . . , L, such that with

x(0) := x (2.1.1a)
(ℓ) (ℓ−1) (ℓ−1) (ℓ−1)
x := σ(W x +b ) for ℓ ∈ {1, . . . , L} (2.1.1b)
x(L+1) := W (L) x(L) + b(L) (2.1.1c)

holds

Φ(x) = x(L+1) for all x ∈ Rd0 .

We call L the depth, dmax = maxℓ=1,...,L dℓ the width, σ the activation function, and
(σ; d0 , . . . , dL+1 ) the architecture of the neural network Φ. Moreover, W (ℓ) ∈ Rdℓ+1 ×dℓ are the
weight matrices and b(ℓ) ∈ Rdℓ+1 the bias vectors of Φ for ℓ = 0, . . . L.

Remark 2.2. Typically, there exist different choices of architectures, weights, and biases yielding
the same function Φ : Rd0 → RdL+1 . For this reason we cannot associate a unique meaning to these
notions solely based on the function realized by Φ. In the following, when we refer to the properties

18
of a neural network Φ, it is always understood to mean that there exists at least one construction
as in Definition 2.1, which realizes the function Φ and uses parameters that satisfy those properties.
The architecture of a neural network is often depicted as a connected graph, as illustrated in
Figure 2.1. The nodes in such graphs represent (the output of) the neurons. They are arranged in
layers, with x(ℓ) in Definition 2.1 corresponding to the neurons in layer ℓ. We also refer to x(0) in
(2.1.1a) as the input layer and to x(L+1) in (2.1.1c) as the output layer. All layers in between
are referred to as the hidden layers and their output is given by (2.1.1b). The number of hidden
layers corresponds to the depth. For the correct interpretation of such graphs, we note that by our
conventions in Definition 2.1, the activation function is applied after each affine transformation,
except in the final layer.
Neural networks of depth one are called shallow, if the depth is larger than one they are called
deep. The notion of deep neural networks is not used entirely consistently in the literature, and
some authors use the word deep only in case the depth is much larger than one, where the precise
meaning of “much larger” depends on the application.
Throughout, we only consider neural networks in the sense of Definition 2.1. We emphasize
however, that this is just one (simple but very common) type of neural network. Many adjustments
to this construction are possible and also widely used. For example:

• We may use different activation functions σℓ in each layer ℓ or we may even use a different
activation function for each node.

• Residual neural networks allow “skip connections”. This means that information is allowed
to skip layers in the sense that the nodes in layer ℓ may have x(0) , . . . , x(ℓ−1) as their input
(and not just x(ℓ−1) ), cf. (2.1.1).

• In contrast to feedforward neural networks, recurrent neural networks allow information to


flow backward, in the sense that x(ℓ−1) , . . . , x(L+1) may serve as input for the nodes in layer ℓ
(and not just x(ℓ−1) ). This creates loops in the flow of information, and one has to introduce
a time index t ∈ N, as the output of a node in time step t might be different from the output
in time step t + 1.

Let us clarify some further common terminology used in the context of neural networks:

• parameters: The parameters of a neural network refer to the set of all entries of the weight
matrices and bias vectors. These are often collected in a single vector

w = ((W (0) , b(0) ), . . . , (W (L) , b(L) )). (2.1.2)

These parameters are adjustable and are learned during the training process, determining the
specific function realized by the network.

• hyperparameters: Hyperparameters are settings that define the network’s architecture (and
training process), but are not directly learned during training. Examples include the depth,
the number of neurons in each layer, and the choice of activation function. They are typically
set before training begins.

• weights: The term “weights” is often used broadly to refer to all parameters of a neural
network, including both the weight matrices and bias vectors.

19
input hidden layers output

layer 0 layer 1 layer 2 layer 3 layer 4

(1) (3)
(0) x1 x1
x1
(2)
x1 (4)
x1
(1) (3)
(0) x2 x2
x2
(2)
x2 (4)
x2
(1) (3)
(0) x3 x3
x3
(2)
x3
(1) (3)
x4 x4

Figure 2.1: Sketch of a neural network with three hidden layers, and d0 = 3, d1 = 4, d2 = 3, d3 = 4,
d4 = 2. The neural network has depth three and width four.

• model: For a fixed architecture, every choice of network parameters w as in (2.1.2) defines
a specific function x 7→ Φw (x). In deep learning this function is often referred to as a model.
More generally, “model” can be used to describe any function parameterization by a set of
parameters w ∈ Rn , n ∈ N.

2.1.1 Basic operations on neural networks


There are various ways how neural networks can be combined with one another. The next propo-
sition addresses this for linear combinations, compositions, and parallelization. The formal proof,
which is a good exercise to familiarize oneself with neural networks, is left as Exercise 2.5.

Proposition 2.3. For two neural networks Φ1 , Φ2 , with architectures

(σ; d10 , d11 , . . . , d1L1 +1 ) and (σ; d20 , d21 , . . . , d2L2 +1 )

respectively, it holds that


(i) for all α ∈ R exists a neural network Φα with architecture (σ; d10 , d11 , . . . , d1L1 +1 ) such that
1
Φα (x) = αΦ1 (x) for all x ∈ Rd0 ,

(ii) if d10 = d20 =: d0 and L1 = L2 =: L, then there exists a neural network Φparallel with architecture
(σ; d0 , d11 + d21 , . . . , d1L+1 + d2L+1 ) such that

Φparallel (x) = (Φ1 (x), Φ2 (x)) for all x ∈ Rd0 ,

(iii) if d10 = d20 =: d0 , L1 = L2 =: L, and d1L+1 = d2L+1 =: dL+1 , then there exists a neural network
Φsum with architecture (σ; d0 , d11 + d21 , . . . , d1L + d2L , dL+1 ) such that

Φsum (x) = Φ1 (x) + Φ2 (x) for all x ∈ Rd0 ,

20
(iv) if d1L1 +1 = d20 , then there exists a neural network Φcomp with architecture
(σ; d10 , d11 , . . . , d1L1 , d21 , . . . , d2L2 +1 ) such that
1
Φcomp (x) = Φ2 ◦ Φ1 (x) for all x ∈ Rd0 .

2.2 Notion of size


Neural networks provide a framework to parametrize functions. Ultimately, our goal is to find
a neural network that fits some underlying input-output relation. As mentioned above, the ar-
chitecture (depth, width and activation function) is typically chosen apriori and considered fixed.
During training of the neural network, its parameters (weights and biases) are suitably adapted by
some algorithm. Depending on the application, on top of the stated architecture choices, further
restrictions on the weights and biases can be desirable. For example, the following two appear
frequently:

• weight sharing: This is a technique where specific entries of the weight matrices (or bias
vectors) are constrained to be equal. Formally, this means imposing conditions of the form
(i) (j)
Wk,l = Ws,t , i.e. the entry (k, l) of the ith weight matrix is equal to the entry at position
(s, t) of weight matrix j. We denote this assumption by (i, k, l) ∼ (j, s, t), paying tribute
to the trivial fact that “∼” is an equivalence relation. During training, shared weights are
updated jointly, meaning that any change to one weight is simultaneously applied to all other
weights of this class. Weight sharing can also be applied to the entries of bias vectors.

• sparsity: This refers to imposing a sparsity structure on the weight matrices (or bias vectors).
(i)
Specifically, we apriorily set Wk,l = 0 for certain (k, l, i), i.e. we impose entry (k, l) of the ith
weight matrix to be 0. These zero-valued entries are considered fixed, and are not adjusted
(i)
during training. The condition Wk,l = 0 corresponds to node l of layer i − 1 not serving as an
input to node k in layer i. If we represent the neural network as a graph, this is indicated by
not connecting the corresponding nodes. Sparsity can also be imposed on the bias vectors.

Both of these restrictions decrease the number of learnable parameters in the neural network. The
number of parameters can be seen as a measure of the complexity of the represented function class.
For this reason, we introduce size(Φ) as a notion for the number of learnable parameters. Formally
(with |S| denoting the cardinality of a set S):

Definition 2.4. Let Φ be as in Definition 2.1. Then the size of Φ is


 
(i) (i)
size(Φ) := {(i, k, l) | Wk,l ̸= 0} ∪ {(i, k) | bk ̸= 0} ∼ . (2.2.1)

21
2.3 Activation functions
Activation functions are a crucial part of neural networks, as they introduce nonlinearity into the
model. If an affine activation function were used, the resulting neural network function would also
be affine and hence very restricted in what it can represent.
The choice of activation function can have a significant impact on the performance, but there
does not seem to be a universally optimal one. We next discuss a few important activation functions
and highlight some common issues associated with them.

1.0 8 8
ReLU a=0.05
SiLU a=0.1
0.8 6 6
a=0.2
0.6 4
4
0.4 2
2
0.2 0
0
0.0
2
5 0 5 5 0 5 5 0 5

(a) Sigmoid (b) ReLU and SiLU (c) Leaky ReLU

Figure 2.2: Different activation functions.

Sigmoid: The sigmoid activation function is given by


1
σsig (x) = for x ∈ R,
1 + e−x
and depicted in Figure 2.2 (a). Its output ranges between zero and one, making it interpretable
as a probability. The sigmoid is a smooth function, which allows the application of gradient-based
training.
It has the disadvantage that its derivative becomes very small if |x| → ∞. This can affect
learning due to the so-called vanishing gradient problem. Consider the simple neural network
Φn (x) = σ ◦ · · · ◦ σ(x + b) defined with n ∈ N compositions of σ, and where b ∈ R is a bias. Its
derivative with respect to b is
d d
Φn (x) = σ ′ (Φn−1 (x)) Φn−1 (x).
db db
If supx∈R |σ ′ (x)| ≤ 1 − δ, then by induction, | db
d
Φn (x)| ≤ (1 − δ)n . The opposite effect happens
for activation functions with derivatives uniformly larger than one. This argument shows that
the derivative of Φn (x, b) with respect to b can become exponentially small or exponentially large
when propagated through the layers. This effect, known as the vanishing- or exploding gradient
effect, also occurs for activation functions which do not admit the uniform bounds assumed above.
However, since the sigmoid activation function exhibits areas with extremely small gradients, the
vanishing gradient effect can be strongly exacerbated.
ReLU (Rectified Linear Unit): The ReLU is defined as

σReLU (x) = max{x, 0} for x ∈ R,

22
and depicted in Figure 2.2 (b). It is piecewise linear, and due to its simplicity its evaluation is
computationally very efficient. It is one of the most popular activation functions in practice. Since
its derivative is always zero or one, it does not suffer from the vanishing gradient problem to the
same extent as the sigmoid function. However, ReLU can suffer from the so-called dead neurons
problem. Consider the neural network

Φ(x) = σReLU (b − σReLU (x)) for x ∈ R

depending on the bias b ∈ R. If b < 0, then Φ(x) = 0 for all x ∈ R. The neuron corresponding to
d
the second application of σReLU thus produces a constant signal. Moreover, if b < 0, db Φ(x) = 0
for all x ∈ R. As a result, every negative value of b yields a stationary point of the empirical risk.
A gradient-based method will not be able to further train the parameter b. We thus refer to this
neuron as a dead neuron.
SiLU (Sigmoid Linear Unit): An important difference between the ReLU and the Sigmoid is
that the ReLU is not differentiable at 0. The SiLU activation function (also referred to as “swish”)
can be interpreted as a smooth approximation to the ReLU. It is defined as
x
σSiLU (x) := xσsig (x) = for x ∈ R,
1 + e−x
and is depicted in Figure 2.2 (b). There exist various other smooth activation functions that
mimic the ReLU, including the Softplus x 7→ log(1 + exp(x)), the GELU (Gaussian Error Linear
Unit) x 7→ xF (x) where F (x) denotes the cumulative distribution function of the standard normal
distribution, and the Mish x 7→ x tanh(log(1 + exp(x))).
Parametric ReLU or Leaky ReLU: This variant of the ReLU addresses the dead neuron
problem. For some a ∈ (0, 1), the parametric ReLU is defined as

σa (x) = max{x, ax} for x ∈ R,

and is depicted in Figure 2.2 (c) for three different values of a. Since the output of σ does not
have flat regions like the ReLU, the dying ReLU problem is mitigated. If a is not chosen too small,
then there is less of a vanishing gradient problem than for the Sigmoid. In practice, the additional
parameter a has to be fine-tuned depending on the application. Like the ReLU, the parametric
ReLU is not differentiable at 0.

Bibliography and further reading


The concept of neural networks was first introduced by McCulloch and Pitts in [142]. Later
Rosenblatt [192] introduced the perceptron, an artificial neuron with adjustable weights that forms
the basis of the multilayer perceptron (a fully connected feedforward neural network). The vanishing
gradient problem shortly addressed in Section 2.3 was discussed by Hochreiter in his diploma thesis
[91] and later in [17, 93].

23
Exercises
Exercise 2.5. Prove Proposition 2.3.

Exercise 2.6. In this exercise, we show that ReLU and parametric ReLU create similar sets of
neural network functions. Fix a > 0.

(i) Find a set of weight matrices and biases vectors, such that the associated neural network Φ1 ,
with the ReLU activation function σReLU satisfies Φ1 (x) = σa (x) for all x ∈ R.

(ii) Find a set of weight matrices and biases vectors, such that the associated neural network Φ2 ,
with the parametric ReLU activation function σa satisfies Φ2 (x) = σReLU (x) for all x ∈ R.

(iii) Conclude that every ReLU neural network can be expressed as a leaky ReLU neural network
and vice versa.

Exercise 2.7. Let d ∈ N, and let Φ1 be a neural network with the ReLU as activation function,
input dimension d, and output dimension 1. Moreover, let Φ2 be a neural network with the sigmoid
activation function, input dimension d, and output dimension 1. Show that, if Φ1 = Φ2 , then Φ1 is
a constant function.

Exercise 2.8. In this exercise, we show that for the sigmoid activation functions, dead-neuron-like
behavior is very rare. Let Φ be a neural network with the sigmoid activation function. Assume
that Φ is a constant function. Show that for every ε > 0 there is a non-constant neural network Φ e
with the same architecture as Φ such that for all ℓ = 0, . . . L,
(ℓ) (ℓ)
∥W (ℓ) − W
f ∥ ≤ ε and ∥b(ℓ) − e
b ∥≤ε

f (ℓ) , e
where W (ℓ) , b(ℓ) are the weights and biases of Φ and W
(ℓ)
b are the biases of Φ.
e
Show that such a statement does not hold for ReLU neural networks. What about leaky ReLU?

24
Chapter 3

Universal approximation

After introducing neural networks in Chapter 2, it is natural to inquire about their capabilities.
Specifically, we might wonder if there exist inherent limitations to the type of functions a neural
network can represent. Could there be a class of functions that neural networks cannot approx-
imate? If so, it would suggest that neural networks are specialized tools, similar to how linear
regression is suited for linear relationships, but not for data with nonlinear relationships.
In this chapter, we will show that this is not the case, and neural networks are indeed a universal
tool. More precisely, given sufficiently large and complex architectures, they can approximate
almost every sensible input-output relationship. We will formalize and prove this claim in the
subsequent sections.

3.1 A universal approximation theorem


To analyze what kind of functions can be approximated with neural networks, we start by consid-
ering the uniform approximation of continuous functions f : Rd → R on compact sets. To this end,
we first introduce the notion of compact convergence.

Definition 3.1. Let d ∈ N. A sequence of functions fn : Rd → R, n ∈ N, is said to con-


verge compactly to a function f : Rd → R, if for every compact K ⊆ Rd it holds that
cc
limn→∞ supx∈K |fn (x) − f (x)| = 0. In this case we write fn −→ f .

Throughout what follows, we always consider C 0 (Rd ) equipped with the topology of Defini-
tion 3.1 (also see Exercise 3.22), and every subset such as C 0 (D) with the subspace topology:
for example, if D ⊆ Rd is bounded, then convergence in C 0 (D) refers to uniform convergence
limn→∞ supx∈D |fn (x) − f (x)| = 0.

3.1.1 Universal approximators


As stated before, we want to show that deep neural networks can approximate every continuous
function in the sense of Definition 3.1. We call sets of functions that satisfy this property universal
approximators.

25
Definition 3.2. Let d ∈ N. A set of functions H from Rd to R is a universal approximator (of
C 0 (Rd )), if for every ε > 0, every compact K ⊆ Rd , and every f ∈ C 0 (Rd ), there exists g ∈ H such
that supx∈K |f (x) − g(x)| < ε.

For a set of (not necessarily continuous) functions H mapping between Rd and R, we denote by
cc
H its closure with respect to compact convergence.
The relationship between a universal approximator and the closure with respect to compact
convergence is established in the proposition below.

Proposition 3.3. Let d ∈ N and H be a set of functions from Rd to R. Then, H is a universal


cc
approximator of C 0 (Rd ) if and only if C 0 (Rd ) ⊆ H .

Proof. Suppose that H is a universal approximator and fix f ∈ C 0 (Rd ). For n ∈ N, define Kn :=
[−n, n]d ⊆ Rd . Then for every n ∈ N there exists fn ∈ H such that supx∈Kn |fn (x) − f (x)| < 1/n.
cc
Since for every compact K ⊆ Rd there exists n0 such that K ⊆ Kn for all n ≥ n0 , it holds fn −→ f .
The “only if” part of the assertion is trivial.
A key tool to show that a set is a universal approximator is the Stone-Weierstrass theorem, see
for instance [196, Sec. 5.7].

Theorem 3.4 (Stone-Weierstrass). Let d ∈ N, let K ⊆ Rd be compact, and let H ⊆ C 0 (K, R)


satisfy that

(a) for all x ∈ K there exists f ∈ H such that f (x) ̸= 0,

(b) for all x ̸= y ∈ K there exists f ∈ H such that f (x) ̸= f (y),

(c) H is an algebra of functions, i.e., H is closed under addition, multiplication and scalar mul-
tiplication.

Then H is dense in C 0 (K).

Example 3.5 (Polynomials are dense in C 0 (R d


Qd )). αFor ∈ Nd0 and a
a multiindex α = (α1 , . . . , αd )P
vector x = (x1 , . . . , xd ) ∈ R denote x := j=1 xj . In the following, with |α| := dj=1 αj , we
d α j

write
Pn := span{xα | α ∈ Nd0 , |α| ≤ n}
i.e., Pn is the
S space of polynomials of degree at most n (with real coefficients). It is easy to check
that P := n∈N Pn (Rd ) satisfies the assumptions of Theorem 3.4 on every compact set K ⊆ Rd .
Thus the space of polynomials P is a universal approximator of C 0 (Rd ), and by Proposition 3.3,
P is dense in C 0 (Rd ). In case we wish to emphasize the dimension of the underlying space, in the
following we will also write Pn (Rd ) or P(Rd ) to denote Pn , P respectively.

26
3.1.2 Shallow neural networks
With the necessary formalism established, we can now show that shallow neural networks of ar-
bitrary width form a universal approximator under certain (mild) conditions on the activation
function. The results in this section are based on [132], and for the proofs we follow the arguments
in that paper.
We first introduce notation for the set of all functions realized by certain architectures.

Definition 3.6. Let d, m, L, n ∈ N and σ : R → R. The set of all functions realized by neural
networks with d-dimensional input, m-dimensional output, depth at most L, width at most n, and
activation function σ is denoted by

Ndm (σ; L, n) := {Φ : Rd → Rm | Φ as in Def. 2.1, depth(Φ) ≤ L, width(Φ) ≤ n}.

Furthermore,
[
Ndm (σ; L) := Ndm (σ; L, n).
n∈N

In the sequel, we require the activation function σ to belong to the set of piecewise continuous
and locally bounded functions

M := σ ∈ L∞

loc (R) there exist intervals I1 , . . . , IM partitioning R,
(3.1.1)
s.t. σ ∈ C 0 (Ij ) for all j = 1, . . . , M .

Here, M ∈ N is finite, and the intervals Ij are understood to have positive (possibly infinite)
Lebesgue measure, i.e. Ij is e.g. not allowed to be empty or a single point. Hence, σ is a piecewise
continuous function, and it has discontinuities at most finitely many points.

Example 3.7. Activation functions belonging to M include, in particular, all continuous non-
polynomial functions, which in turn includes all practically relevant activation functions such as
the ReLU, the SiLU, and the Sigmoid discussed in Section 2.3. In these cases, we can choose M = 1
and I1 = R. Discontinuous functions include for example the Heaviside function x 7→ 1x>0 (also
called a “perceptron” in this context) but also x 7→ 1x>0 sin(1/x): Both belong to M with M = 2,
I1 = (−∞, 0] and I2 = (0, ∞). We exclude for example the function x 7→ 1/x, which is not locally
bounded.

The rest of this subsection is dedicated to proving the following theorem that has now already
been anounced repeatedly.

Theorem 3.8. Let d ∈ N and σ ∈ M. Then Nd1 (σ; 1) is a universal approximator of C 0 (Rd ) if
and only if σ is not a polynomial.

27
Remark 3.9. We will see in Exercise 3.26 and Corollary 3.18 that neural networks can also arbitrarily
well approximate non-continuous functions with respect to suitable norms.
The universal approximation theorem by Leshno, Lin, Pinkus and Schocken [132]—of which
Theorem 3.8 is a special case—is even formulated for a much larger set M, which allows for
activation functions that have discontinuities at a (possibly non-finite) set of Lebesgue measure
zero. Instead of proving the theorem in this generality, we resort to the simpler case stated above.
This allows to avoid some technicalities, but the main ideas remain the same. The proof strategy
is to verify the following three claims:
cc cc
(i) if C 0 (R1 ) ⊆ N11 (σ; 1) then C 0 (Rd ) ⊆ Nd1 (σ; 1) ,
cc
(ii) if σ ∈ C ∞ (R) is not a polynomial then C 0 (R1 ) ⊆ N11 (σ; 1) ,
cc
(iii) if σ ∈ M is not a polynomial then there exists σ̃ ∈ C ∞ (R) ∩ N11 (σ; 1) which is not a
polynomial.
cc cc cc
Upon observing that σ̃ ∈ N11 (σ; 1) implies N11 (σ̃, 1) ⊆ N11 (σ; 1) , it is easy to see that these
statements together with Proposition 3.3 establish the implication “⇐” asserted in Theorem 3.8.
The reverse direction is straightforward to check and will be the content of Exercise 3.23.
We start with a more general version of (i) and reduce the problem to the one dimensional case.

Lemma 3.10. Assume that H is a universal approximator of C 0 (R). Then for every d ∈ N

span{x 7→ g(w · x) | w ∈ Rd , g ∈ H}

is a universal approximator of C 0 (Rd ).

Proof. For k ∈ N0 , denote by Hk the space of all k-homogenous polynomials, that is


n o
Hk := span Rd ∋ x 7→ xα α ∈ Nd0 , |α| = k .

We claim that
cc
Hk ⊆ span{Rd ∋ x 7→ g(w · x) | w ∈ Rd , g ∈ H} =: X (3.1.2)

for all k ∈ N0 . This implies that all multivariate polynomials belong to X. An application of the
Stone-Weierstrass theorem (cp. Example 3.5) and Proposition 3.3 then conclude the Q proof.
For every α, β ∈ Nd0 with |α| = |β| = k, it holds Dβ xα = δβ,α α!, where α! := dj=1 αj ! and
δβ,α = 1 if β = α and δβ,α = 0 otherwise. Hence, since {x 7→ xα | |α| = k} is a basis of Hk , the
set {Dα | |α| = k} is a basis of its topological dual H′k . Thus each linear functional l ∈ H′k allows
the representation l = p(D) for some p ∈ Hk (here D stands for the differential).
By the multinomial formula
 k
d
X X k! α α
(w · x)k =  wj x j  = w x .
α!
j=1 d
{α∈N0 | |α|=k}

28
Therefore, we have that (x 7→ (w · x)k ) ∈ Hk . Moreover, for every l = p(D) ∈ H′k and all w ∈ Rd
we have that

l(x 7→ (w · x)k ) = k!p(w).

Hence, if l(x 7→ (w · x)k ) = p(D)(x 7→ (w · x)k ) = 0 for all w ∈ Rd , then p ≡ 0 and thus l ≡ 0.
This implies span{x 7→ (w · x)k | w ∈ Rd } = Hk . Indeed, if there exists h ∈ Hk which is not
in span{x 7→ (w · x)k | w ∈ Rd }, then by the theorem of Hahn-Banach (see Theorem B.8), there
exists a non-zero functional in H′k vanishing on span{x 7→ (w · x)k | w ∈ Rd }. This contradicts the
previous observation.
By the universality of H it is not hard to see that x 7→ (w · x)k ∈ X for all w ∈ Rd . Therefore,
we have Hk ⊆ X for all k ∈ N0 .

By the above lemma, in order to verify that Nd1 (σ; 1) is a universal approximator, it suffices to
show that N11 (σ; 1) is a universal approximator. We first show that this is the case for sigmoidal
activations.

Definition 3.11. An activation function σ : R → R is called sigmoidal, if σ ∈ C 0 (R),


limx→∞ σ(x) = 1 and limx→−∞ σ(x) = 0.

For sigmoidal activation functions we can now conclude the universality in the univariate case.

cc
Lemma 3.12. Let σ : R → R be monotonically increasing and sigmoidal. Then C 0 (R) ⊆ N11 (σ; 1) .

We prove Lemma 3.12 in Exercise 3.24. Lemma 3.10 and Lemma 3.12 show Theorem 3.8 in
the special case where σ is monotonically increasing and sigmoidal. For the general case, let us
continue with (ii) and consider C ∞ activations.

Lemma 3.13. If σ ∈ C ∞ (R) and σ is not a polynomial, then N11 (σ; 1) is dense in C 0 (R).

cc
Proof. Denote X := N11 (σ; 1) . We show again that all polynomials belong to X. An application
of the Stone-Weierstrass theorem then gives the statement.
Fix b ∈ R and denote fx (w) := σ(wx + b) for all x, w ∈ R. By Taylor’s theorem, for h ̸= 0

σ((w + h)x + b) − σ(wx + b) fx (w + h) − fx (w)


=
h h
h
= fx′ (w) + fx′′ (ξ)
2
h
= fx′ (w) + x2 σ ′′ (ξx + b) (3.1.3)
2

29
for some ξ = ξ(h) between w and w + h. Note that the left-hand side belongs to N11 (σ; 1) as a
function of x. Since σ ′′ ∈ C 0 (R), for every compact set K ⊆ R

sup sup |x2 σ ′′ (ξ(h)x + b)| ≤ sup sup |x2 σ ′′ (ηx + b)| < ∞.
x∈K |h|≤1 x∈K η∈[w−1,w+1]

Letting h → 0, as a function of x the term in (3.1.3) thus converges uniformly towards K ∋


x 7→ fx′ (w). Since K was arbitrary, x 7→ fx′ (w) belongs to X. Inductively applying the same
(k−1) (k)
argument to fx (w), we find that x 7→ fx (w) belongs to X for all k ∈ N, w ∈ R. Observe that
(k)
fx (w) = xk σ (k) (wx + b). Since σ is not a polynomial, for each k ∈ N there exists bk ∈ R such that
σ (k) (bk ) ̸= 0. Choosing w = 0, we obtain that x 7→ xk belongs to X.

Finally, we come to the proof of (iii)—the claim that there exists at least one non-polynomial
C ∞ (R) function in the closure of N11 (σ; 1). The argument is split into two lemmata. Denote in the
following by Cc∞ (R) the set of compactly supported C ∞ (R) functions.

cc
Lemma 3.14. Let σ ∈ M. Then for each φ ∈ Cc∞ (R) it holds σ ∗ φ ∈ N11 (σ; 1) .

Proof. Fix φ ∈ Cc∞ (R) and let a > 0 such that supp φ ⊆ [−a, a]. We have
Z
σ ∗ φ(x) = σ(x − y)φ(y) dy.
R

Denote yj := −a + 2aj/n for j = 0, . . . , n and define for x ∈ R


n−1
2a X
fn (x) := σ(x − yj )φ(yj ).
n
j=0

cc
Clearly, fn ∈ N11 (σ; 1). We will show fn −→ σ ∗φ as n → ∞. To do so we verify uniform convergence
of fn towards σ ∗ φ on the interval [−b, b] with b > 0 arbitrary but fixed.
For x ∈ [−b, b]
n−1
X Z yj+1
|σ ∗ φ(x) − fn (x)| ≤ σ(x − y)φ(y) − σ(x − yj )φ(yj ) dy . (3.1.4)
j=0 yj

Fix ε ∈ (0, 1). Since σ ∈ M, S there exist z1 , . . . , zM ∈ R such that σ is continuous on R\{z1 , . . . , zM }
(cp. (3.1.1)). With Dε := M j=1 (zj −ε, zj +ε), observe that σ is uniformly continuous on the compact
set Kε := [−a − b, a + b] ∩ Dεc . Now let Jc ∪ Jd = {0, . . . , n − 1} be a partition (depending on x),
such that j ∈ Jc if and only if [x − yj+1 , x − yj ] ⊆ Kε . Hence, j ∈ Jd implies the existence of
i ∈ {1, . . . , M } such that the distance of zi to [x − yj+1 , x − yj ] is at most ε. Due to the interval

30
[x − yj+1 , x − yj ] having length 2a/n, we can bound

X [
yj+1 − yj = [x − yj+1 , x − yj ]
j∈Jd j∈Jd
M h
[ 2a 2a i
≤ zi − ε −
, zi + ε +
n n
i=1
 4a 
≤ M · 2ε + .
n
Next, because of the local boundedness of σ and the fact that φ ∈ Cc∞ , it holds sup|y|≤a+b |σ(y)| +
sup|y|≤a |φ(y)| =: γ < ∞. Hence

|σ ∗ φ(x) − fn (x)|
X Z yj+1
≤ σ(x − y)φ(y) − σ(x − yj )φ(yj ) dy
j∈Jc ∪Jd yj
 
2 4a
≤ 2γ M · 2ε +
n
+ 2a sup max |σ(x − y)φ(y) − σ(x − yj )φ(yj )|. (3.1.5)
j∈Jc y∈[yj ,yj+1 ]

We can bound the term in the last maximum by

|σ(x − y)φ(y) − σ(x − yj )φ(yj )|


≤ |σ(x − y) − σ(x − yj )||φ(y)| + |σ(x − yj )||φ(y) − φ(yj )|
 
 
≤γ· sup |σ(z1 ) − σ(z2 )| + sup |φ(z1 ) − φ(z2 )|
.
z1 ,z2 ∈Kε

z1 ,z2 ∈[−a,a]
|z1 −z2 |≤ 2a
n |z1 −z2 |≤ 2a
n

Finally, uniform continuity of σ on Kε and φ on [−a, a] imply that the last term tends to 0 as
n → ∞ uniformly for all x ∈ [−b, b]. This shows that there exist C < ∞ (independent of ε and x)
and nε ∈ N (independent of x) such that the term in (3.1.5) is bounded by Cε for all n ≥ nε . Since
ε was arbitrary, this yields the claim.

Lemma 3.15. If σ ∈ M and σ ∗ φ is a polynomial for all φ ∈ Cc∞ (R), then σ is a polynomial.

Proof. Fix −∞ < a < b < ∞ and consider Cc∞ (a, b) := {φ ∈ C ∞ (R) | supp φ ⊆ [a, b]}. Define a
metric ρ on Cc∞ (a, b) via
X |φ − ψ|C j (a,b)
ρ(φ, ψ) := 2−j ,
1 + |φ − ψ|C j (a,b)
j∈N0

31
where

|φ|C j (a,b) := sup |φ(j) (x)|.


x∈[a,b]

Since
Pj the space of j times differentiable functions on [a, b] is complete with respect to the norm
| · | , see for instance [89, Satz 104.3], the space C ∞ (a, b) is complete with the metric ρ.
i
C (a,b) c
i=0
For k ∈ N set

Vk := {φ ∈ Cc∞ (a, b) | σ ∗ φ ∈ Pk },

where Pk := span{R ∋ x 7→ xj | 0 ≤ j ≤ k} denotes the space of polynomials of degree at most


k. Then Vk is closed with respect to the metric ρ. To see this, we only need to observe that for
a converging sequence φj → φ∗ with respect to ρ and φj ∈ Vk , it follows that Dk+1 (σ ∗ φ∗ ) = 0
and hence σ ∗ φ∗ is a polynomial. Since Dk+1 (σ ∗ φj ) = 0 we compute with the linearity of the
convolution and the fact that Dk+1 (f ∗ g) = f ∗ Dk+1 (g) for differentiable g and if both sides are
well-defined that

sup |Dk+1 (σ ∗ φ∗ )(x)|


x∈[a,b]

= sup |σ ∗ Dk+1 (φ∗ − φj )(x)|


x∈[a,b]

≤ |b − a| sup |σ(z)| · sup |Dk+1 (φj − φ∗ )(x)|


z∈[a−b,b−a] x∈[a,b]

and since σ is locally bounded, the right hand-side converges to 0.


By assumption we have
[
Vk = Cc∞ (a, b).
k∈N

Baire’s category theorem implies the existence of k0 ∈ N (depending on a, b) such that Vk0 contains
an open subset of Cc∞ (a, b). Since Vk0 is a vector space, it must hold Vk0 = Cc∞ (a, b).
We now show that φ ∗ σ ∈ Pk0 for every φ ∈ Cc∞ (R); in other words, k0 = k0 (a, b) can be chosen
independent of a and b. First consider a shift s ∈ R and let ã := a + s and b̃ := b + s. Then with
S(x) := x + s, for any φ ∈ Cc∞ (ã, b̃) holds φ ◦ S ∈ Cc∞ (a, b), and thus (φ ◦ S) ∗ σ ∈ Pk0 . Since
(φ ◦ S) ∗ σ(x) = φ ∗ σ(x + s), we conclude that φ ∗ σ ∈ Pk0 . Next let −∞ < ã < b̃ < ∞ be arbitrary.
Then, for an integer n > (b̃ − ã)(b − a) we can cover (ã, b̃) with n ∈ N overlappingP open intervals
∞ (ã, b̃) can be written as φ = n
(a1 , b1 ), . . . , (an , bn ), each of length b − a. Any φ ∈ C c j=1 φj where
n
φj ∈ Cc (aj , bj ). Then φ ∗ σ = j=1 φj ∗ σ ∈ Pk0 , and thus φ ∗ σ ∈ Pk0 for every φ ∈ Cc∞ (R).

P
Finally, Exercise 3.25 implies σ ∈ Pk0 .

Now we can put everything together to show Theorem 3.8.

of Theorem 3.8. By Exercise 3.23 we have the implication “⇒”.


For the other direction we assume that σ ∈ M is not a polynomial. Then by Lemma 3.15
there exists φ ∈ Cc∞ (R) such that σ ∗ φ is not a polynomial. According to Lemma 3.14 we have
cc
σ ∗ φ ∈ N11 (σ; 1) . We conclude with Lemma 3.13 that N11 (σ; 1) is a universal approximator of
C 0 (R).
Finally, by Lemma 3.10, Nd1 (σ; 1) is a universal approximator of C 0 (Rd ).

32
3.1.3 Deep neural networks
Theorem 3.8 shows the universal approximation capability of single-hidden-layer neural networks
with activation functions σ ∈ M\P: they can approximate every continuous function on every
compact set to arbitrary precision, given sufficient width. This result directly extends to neural
networks of any fixed depth L ≥ 1. The idea is to use the fact that the identity function can be
approximated with a shallow neural network. Composing a shallow neural network approximation of
the target function f with (multiple) shallow neural networks approximating the identity function,
gives a deep neural network approximation of f .
Instead of directly applying Theorem 3.8, we first establish the following proposition regarding
the approximation of the identity function. Rather than σ ∈ M\P, it requires a different (mild)
assumption on the activation function. This allows for a constructive proof, yielding explicit bounds
on the neural network size, which will prove useful later in the book.

Proposition 3.16. Let d, L ∈ N, let K ⊆ Rd be compact, and let σ : R → R be such that there
exists an open set on which σ is differentiable and not constant. Then, for every ε > 0, there exists
a neural network Φ ∈ Ndd (σ; L, d) such that

∥Φ(x) − x∥∞ < ε for all x ∈ K.

Proof. The proof uses the same idea as in Lemma 3.13, where we approximate the derivative of the
activation function by a simple neural network. Let us first assume d ∈ N and L = 1.
Let x∗ ∈ R be such that σ is differentiable on a neighborhood of x∗ and σ ′ (x∗ ) = θ ̸= 0.
Moreover, let x∗ = (x∗ , . . . , x∗ ) ∈ Rd . Then, for λ > 0 we define

λ x  λ
Φλ (x) := σ + x∗ − σ(x∗ ),
θ λ θ
Then, we have, for all x ∈ K,

σ(x/λ + x∗ ) − σ(x∗ )
Φλ (x) − x = λ − x. (3.1.6)
θ
If xi = 0 for i ∈ {1, . . . , d}, then (3.1.6) shows that (Φλ (x) − x)i = 0. Otherwise

|xi | σ(xi /λ + x∗ ) − σ(x∗ )


|(Φλ (x) − x)i | = −θ .
|θ| xi /λ

By the definition of the derivative, we have that |(Φλ (x) − x)i | → 0 for λ → ∞ uniformly for all
x ∈ K and i ∈ {1, . . . , d}. Therefore, |Φλ (x) − x| → 0 for λ → ∞ uniformly for all x ∈ K.
The extension to L > 1 is straight forward and is the content of Exercise 3.27.

Using the aforementioned generalization of Proposition 3.16 to arbitrary non-polynomial acti-


vation functions σ ∈ M, we obtain the following extension of Theorem 3.8.

33
Corollary 3.17. Let d ∈ N, L ∈ N and σ ∈ M. Then Nd1 (σ; L) is a universal approximator of
C 0 (Rd ) if and only if σ is not a polynomial.

Proof. We only show the implication “⇐”. The other direction is again left as an exercise, see
Exercise 3.23.
Assume σ ∈ M is not a polynomial, let K ⊆ Rd be compact, and let f ∈ C 0 (Rd ). Fix ε ∈ (0, 1).
We need to show that there exists a neural network Φ ∈ Nd1 (σ; L) such that supx∈K |f (x)−Φ(x)| <
ε. The case L = 1 holds by Theorem 3.8, so let L > 1.
By Theorem 3.8, there exist Φshallow ∈ Nd1 (σ; 1) such that
ε
sup |f (x) − Φshallow (x)| < . (3.1.7)
x∈K 2
Compactness of {f (x) | x ∈ K} implies that we can find n > 0 such that
{Φshallow (x) | x ∈ K} ⊆ [−n, n]. (3.1.8)
Let Φid ∈ N11 (σ; L − 1) be an approximation to the identity such that
ε
sup |x − Φid (x)| < , (3.1.9)
x∈[−n,n] 2
which is possibly by the extension of Proposition 3.16 to general non-polynomial activation functions
σ ∈ M.
Denote Φ := Φid ◦ Φshallow . According to Proposition 2.3 (iv) holds Φ ∈ Nd1 (σ; L) as desired.
Moreover (3.1.7), (3.1.8), (3.1.9) imply
sup |f (x) − Φ(x)| = sup |f (x) − Φid (Φshallow (x))|
x∈K x∈K

≤ sup |f (x) − Φshallow (x)| + |Φshallow (x) − Φid (Φshallow (x))|
x∈K
ε ε
≤ + = ε.
2 2
This concludes the proof.

3.1.4 Other norms


Additional to the continuous functions, universal approximation theorems can be shown for various
other function classes and topologies, which may also allow for the approximation of functions
exhibiting discontinuities or singularities. To give but one example, we next state such a result for
Lebesgue spaces on compact sets. The proof is left to the reader, see Exercise 3.26.

Corollary 3.18. Let d ∈ N, L ∈ N, p ∈ [1, ∞), and let σ ∈ M not be a polynomial. Then for
every ε > 0, every compact K ⊆ Rd , and every f ∈ Lp (K) there exists Φf,ε ∈ Nd1 (σ; L) such that
Z 1/p
p
|f (x) − Φ(x)| dx ≤ ε.
K

34
3.2 Superexpressive activations and Kolmogorov’s superposition
theorem
In the previous section, we saw that a large class of activation functions allow for universal approx-
imation. However, these results did not provide any insights into the necessary neural network size
for achieving a specific accuracy.
Before exploring this topic further in the following chapters, we next present a remarkable result
that shows how the required neural network size is significantly influenced by the choice of activation
function. The result asserts that, with the appropriate activation function, every f ∈ C 0 (K) on a
compact set K ⊆ Rd can be approximated to every desired accuracy ε > 0 using a neural network
of size O(d2 ); in particular the neural network size is independent of ε > 0, K, and f . We will first
discuss the one-dimensional case.

Proposition 3.19. There exists a continuous activation function σ : R → R such that for every
compact K ⊆ R, every ε > 0 and every f ∈ C 0 (K) there exists Φ(x) = σ(wx + b) ∈ N11 (σ; 1, 1)
such that

sup |f (x) − Φ(x)| < ε.


x∈K

Proof. Denote by P̃n all polynomials p(x) = nj=0 qj xj with rational coefficients, i.e. such that
P

qj ∈ Q for all j = 0, . . . , n. Then P̃n can be identified with the n-fold


S cartesian product Q × · · · × Q,
and thus P̃n is a countable set. Consequently also the set P̃ := n∈N P̃n of all polynomials with
rational coefficients is countable. Let (pi )i∈Z be an enumeration of these polynomials, and set
(
pi (x − 2i) if x ∈ [2i, 2i + 1]
σ(x) :=
pi (1)(2i + 2 − x) + pi+1 (0)(x − 2i − 1) if x ∈ (2i + 1, 2i + 2).
In words, σ equals pi on even intervals [2i, 2i + 1] and is linear on odd intervals [2i + 1, 2i + 2],
resulting in a continuous function overall.
We first assume K = [0, 1]. By Example 3.5, for every ε > 0 exists p(x) = nj=1 rj xj such
P

that supx∈[0,1] |p(x) − f (x)| < ε/2. Now choose qj ∈ Q so close to rj such that p̃(x) := nj=1 qj xj
P
satisfies supx∈[0,1] |p̃(x) − p(x)| < ε/2. Let i ∈ Z such that p̃(x) = pi (x), i.e., pi (x) = σ(2i + x) for
all x ∈ [0, 1]. Then supx∈[0,1] |f (x) − σ(x + 2i)| < ε.
For general compact K assume that K ⊆ [a, b]. By Tietze’s extension theorem, f allows a
continuous extension to [a, b], so without loss of generality K = [a, b]. By the first case we can find
i ∈ Z such that with y = (x − a)/(b − a) (i.e. y ∈ [0, 1] if x ∈ [a, b])
 
x−a
sup f (x) − σ + 2i = sup |f (y · (b − a) + a) − σ(y + 2i)| < ε,
x∈[a,b] b−a y∈[0,1]

which gives the statement with w = 1/(b − a) and b = −a · (b − a) + 2i.

To extend this result to arbitrary dimension, we will use Kolmogorov’s superposition theorem.
It states that every continuous function of d variables can be expressed as a composition of functions
that each depend only on one variable. We omit the technical proof, which can be found in [120].

35
Theorem 3.20 (Kolmogorov). For every d ∈ N there exist 2d2 + d monotonically increasing
functions φi,j ∈ C 0 (R), i = 1, . . . , d, j = 1, . . . , 2d + 1, such that for every f ∈ C 0 ([0, 1]d ) there
exist functions fj ∈ C 0 (R), j = 1, . . . , 2d + 1 satisfying
2d+1 d
!
X X
f (x) = fj φi,j (xi ) for all x ∈ [0, 1]d .
j=1 i=1

Corollary 3.21. Let d ∈ N. With the activation function σ : R → R from Proposition 3.19, for
every compact K ⊆ Rd , every ε > 0 and every f ∈ C 0 (K) there exists Φ ∈ Nd1 (σ; 2, 2d2 + d) (i.e.
width(Φ) = 2d2 + d and depth(Φ) = 2) such that

sup |f (x) − Φ(x)| < ε.


x∈K

Proof. Without loss of generality we can assume K = [0, 1]d : the extension to the general case then
follows by Tietze’s extension theorem and a scaling argument as in the proof of Proposition 3.19.
Let fj , φi,j , i = 1, . . . , d, j = 1, . . . , 2d + 1 be as in Theorem 3.20. Fix ε > 0. Let a > 0 be so
large that

sup sup |φi,j (x)| ≤ a.


i,j x∈[0,1]

Since each fj is uniformly continuous on the compact set [−da, da], we can find δ > 0 such that
ε
sup sup |fj (y) − fj (ỹ)| < . (3.2.1)
j |y−ỹ|<δ 2(2d + 1)
|y|,|ỹ|≤da

By Proposition 3.19 there exist wi,j , bi,j ∈ R such that


δ
sup sup |φi,j (x) − σ(wi,j x + bi,j ) | < (3.2.2)
i,j x∈[0,1] | {z } d
=:φ̃i,j (x)

and wj , bj ∈ R such that


ε
sup sup |fj (y) − σ(wj y + bj ) | < . (3.2.3)
j |y|≤a+δ | {z } 2(2d + 1)
=:f˜j (y)

Then for all x ∈ [0, 1]d by (3.2.2)


d d
X X δ
φi,j (xi ) − φ̃i,j (xi ) < d = δ.
d
i=1 i=1

36
Thus with
d
X d
X
yj := φi,j (xi ), ỹj := φ̃i,j (xi )
j=1 j=1

it holds |yj − ỹj | < δ. Using (3.2.1) and (3.2.3) we conclude

2d+1 d 2d+1
! !
X X X
f (x) − σ wj · σ(wi,j xi + bi,j ) + bj = (fj (yj ) − f˜j (ỹj ))
j=1 i=1 j=1
2d+1
X 
≤ |fj (yj ) − fj (ỹj )| + |fj (ỹj ) − f˜j (ỹj )|
j=1
2d+1
X 
ε ε
≤ + ≤ ε.
2(2d + 1) 2(2d + 1)
j=1

This concludes the proof.

Kolmogorov’s superposition theorem is intriguing as it shows that approximating d-dimensional


functions can be reduced to the (generally much simpler) one-dimensional case through composi-
tions. Neural networks, by nature, are well suited to approximate functions with compositional
structures. However, the functions fj in Theorem 3.20, even though only one-dimensional, could
become very complex and challenging to approximate themselves if d is large.
Similarly, the “magic” activation function in Proposition 3.19 encodes the information of all
rational polynomials on the unit interval, which is why a neural network of size O(1) suffices to
approximate every function to arbitrary accuracy. Naturally, no practical algorithm can efficiently
identify appropriate neural network weights and biases for this architecture. As such, the results
presented in Section 3.2 should be taken with a pinch of salt as their practical relevance is highly
limited. Nevertheless, they highlight that while universal approximation is a fundamental and
important property of neural networks, it leaves many aspects unexplored. To gain further insight
into practically relevant architectures, in the following chapters, we investigate neural networks
with activation functions such as the ReLU.

Bibliography and further reading


The foundation of universal approximation theorems goes back to the late 1980s with seminal works
by Cybenko [44], Hornik et al. [95, 94], Funahashi [63] and Carroll and Dickinson [33]. These results
were subsequently extended to a wider range of activation functions and architectures. The present
analysis in Section 3.1 closely follows the arguments in [132], where it was essentially shown that
universal approximation can be achieved if the activation function is not polynomial.
Kolmogorov’s superposition theorem stated in Theorem 3.20 was originally proven in 1957
[120]. For a more recent and constructive proof see for instance [26]. Kolmogorov’s theorem
and its obvious connections to neural networks have inspired various research in this field, e.g.
[162, 124, 151, 205, 104], with its practical relevance being debated [68, 123]. The idea for the
“magic” activation function in Section 3.2 comes from [140] where it is shown that such an activation
function can even be chosen monotonically increasing.

37
Exercises
Exercise 3.22. Write down a generator of a (minimal) topology on C 0 (Rd ) such that fn → f ∈
cc
C 0 (Rd ) if and only if fn −→ f , and show this equivalence. This topology is referred to as the
topology of compact convergence.

Exercise 3.23. Show the implication “⇒” of Theorem 3.8 and Corollary 3.17.

Exercise 3.24. Prove Lemma 3.12. Hint: Consider σ(nx) for large n ∈ N.

Exercise 3.25. Let k ∈ N, σ ∈ M and assume that σ ∗ φ ∈ Pk for all φ ∈ Cc∞ (R). Show that
σ ∈ Pk .
Hint: Consider ψ ∈ Cc∞ (R) such that ψ ≥ 0 and R ψ(x) dx = 1 and set ψε (x) := ψ(x/ε)/ε.
R

Use that away from the discontinuities of σ it holds ψε ∗ σ(x) → σ(x) as ε → 0. Conclude that σ
is piecewise in Pk , and finally show that σ ∈ C k (R).

Exercise 3.26. Prove Corollary 3.18 with the use of Corollary 3.17.

Exercise 3.27. Complete the proof of Proposition 3.16 for L > 1.

38
Chapter 4

Splines

In Chapter 3, we saw that sufficiently large neural networks can approximate every continuous
function to arbitrary accuracy. However, these results did not further specify the meaning of
“sufficiently large” or what constitutes a suitable architecture. Ideally, given a function f , and a
desired accuracy ε > 0, we would like to have a (possibly sharp) bound on the required size, depth,
and width guaranteeing the existence of a neural network approximating f up to error ε.
The field of approximation theory establishes such trade-offs between properties of the function f
(e.g., its smoothness), the approximation accuracy, and the number of parameters needed to achieve
this accuracy. For example, given k, d ∈ N, how many parameters are required to approximate a
function f : [0, 1]d → R with ∥f ∥C k ([0,1]d ) ≤ 1 up to uniform error ε? Splines are known to achieve
this approximation accuracy with a superposition of O(ε−d/k ) simple (piecewise polynomial) basis
functions. In this chapter, following [146], we show that certain sigmoidal neural networks can
match this performance in terms of the neural network size. In fact, from an approximation
theoretical viewpoint we show that the considered neural networks are at least as expressive as
superpositions of splines.

4.1 B-splines and smooth functions


We introduce a simple type of spline and its approximation properties below.

Definition 4.1. For n ∈ N, the univariate cardinal B-spline order n ∈ N is given by


n  
1 X
ℓ n
Sn (x) := (−1) σReLU (x − ℓ)n−1 for x ∈ R, (4.1.1)
(n − 1)! ℓ
ℓ=0

where 00 := 0 and σReLU denotes the ReLU activation function.

By shifting and dilating the cardinal B-spline, we obtain a system of univariate splines. Taking
tensor products of these univariate splines yields a set of higher-dimensional functions known as
the multivariate B-splines.

39
Definition 4.2. For t ∈ R and n, ℓ ∈ N we define Sℓ,t,n := Sn (2ℓ (· − t)). Additionally, for d ∈ N,
t ∈ Rd , and n, ℓ ∈ N, we define the the multivariate B-spline Sℓ,t,n
d as

d
Y
d
Sℓ,t,n (x) := Sℓ,ti ,n (xi ) for x = (x1 , . . . xd ) ∈ Rd ,
i=1

and n o
B n := Sℓ,t,n
d
ℓ ∈ N, t ∈ Rd

is the dictionary of B-splines of order n.

Having introduced the system B n , we would like to understand how well we can represent each
smooth function by superpositions of elements of B n . The following theorem is adapted from the
more general result [168, Theorem 7]; also see [141, Theorem D.3] for a presentation closer to the
present formulation.

Theorem 4.3. Let d, n, k ∈ N such that 0 < k ≤ n. Then there exists C such that for every
f ∈ C k ([0, 1]d ) and every N ∈ N, there exist ci ∈ R with |ci | ≤ C∥f ∥L∞ ([0,1]d ) and Bi ∈ B n for
i = 1, . . . , N , such that
N
X k
f− ci Bi ≤ CN − d ∥f ∥C k [0,1]d .
i=1 L∞ ([0,1]d )

Remark 4.4. There are a couple of critical concepts in Theorem 4.3 that will reappear throughout
this book. The number of parameters N determines the approximation accuracy N −k/d . This im-
plies that achieving accuracy ε > 0 requires O(ε−d/k ) parameters (according to this upper bound),
which grows exponentially in d. This exponential dependence on d is referred to as the “curse of
dimension” and will be discussed again in the subsequent chapters. The smoothness parameter
k has the opposite effect of d, and improves the convergence rate. Thus, smoother functions can
be approximated with fewer B-splines than rougher functions. This more efficient approximation
requires the use of B-splines of order n with n ≥ k. We will see in the following, that the order of
the B-spline is closely linked to the concept of depth in neural networks.

4.2 Reapproximation of B-splines with sigmoidal activations


We now show that the approximation rates of B-splines can be transfered to certain neural networks.
The following argument is based on [144].

40
Definition 4.5. A function σ : R → R is called sigmoidal of order q ∈ N, if σ ∈ C q−1 (R) and
there exists C > 0 such that
σ(x)
→0 as x → −∞,
xq
σ(x)
→1 as x → ∞,
xq
|σ(x)| ≤ C · (1 + |x|)q for all x ∈ R.

Example 4.6. The rectified power unit x 7→ σReLU (x)q is sigmoidal of order q.
Our goal in the following is to show that neural networks can approximate a linear combination
of N B-splines with a number of parameters that is proportional to N . As an immediate conse-
quence of Theorem 4.3, we then obtain a convergence rate for neural networks. Let us start by
approximating a single univariate B-spline with a neural network of fixed size.

Proposition 4.7. Let n ∈ N, n ≥ 2, K > 0, and let σ : R → R be sigmoidal of order q ≥ 2. There


exists a constant C > 0 such that for every ε > 0 there is a neural network ΦSn with activation
function σ, ⌈logq (n − 1)⌉ layers, and size C, such that

Sn − ΦSn L∞ ([−K,K])
≤ ε.

n−1
Proof. By definition (4.1.1), Sn is a linear combination of n + 1 shifts of σReLU . We start by
n−1
approximating σReLU . It is not hard to see (Exercise 4.10) that, for every K ′ > 0 and every t ∈ N

t t
a−q σ ◦ σ ◦ · · · ◦ σ(ax) −σReLU (x)q → 0 as a → ∞ (4.2.1)
| {z }
t− times

uniformly for all x ∈ [−K ′ , K ′ ].


Set t := ⌈logq (n − 1)⌉. Then t ≥ 1 since n ≥ 2, and q t ≥ n − 1. Thus, for every K ′ > 0 and
t
ε > 0 there exists a neural network Φqε with ⌈logq (n − 1)⌉ layers satisfying
t t
Φqε (x) − σReLU (x)q ≤ ε for all x ∈ [−K ′ , K ′ ]. (4.2.2)

This shows that we can approximate the ReLU to the power of q t ≥ n − 1. However, our goal is to
obtain an approximation of the ReLU raised to the power n − 1, which could be smaller than q t .
t
To reduce the order, we emulate approximate derivatives of Φqε . Concretely, we show the following
claim: For all 1 ≤ p ≤ q t for every K ′ > 0 and ε > 0 there exists a neural network Φpε having
⌈logq (n − 1)⌉ layers and satisfying

|Φpε (x) − σReLU (x)p | ≤ ε for all x ∈ [−K ′ , K ′ ]. (4.2.3)

41
The claim holds for p = q t . We now proceed by induction over p = q t , q t − 1, . . . Assume (4.2.3)
holds for some p ∈ {2, . . . , q t }. Fix δ ≥ 0. Then

Φpδ2 (x + δ) − Φpδ2 (x)


− σReLU (x)p−1

δ σReLU (x + δ)p − σReLU (x)p
≤2 + − σReLU (x)p−1 .
p pδ

Hence, by the binomial theorem it follows that there exists δ∗ > 0 such that

Φpδ2 (x + δ∗ ) − Φpδ2 (x)


∗ ∗
− σReLU (x)p−1 ≤ ε,
pδ∗

for all x ∈ [−K ′ , K ′ ]. By Proposition 2.3, (Φpδ2 (x + δ∗ ) − Φpδ2 )/(pδ∗ ) is a neural network with
∗ ∗
⌈logq (n − 1)⌉ layers and size independent from ε. Calling this neural network Φp−1ε shows that
(4.2.3) holds for p − 1, which concludes the induction argument and proves the claim.
For every neural network Φ, every spatial translation Φ(· − t) is a neural network of the same
architecture. Hence, every term in the sum (4.1.1) can be approximated to arbitrary accuracy by
a neural network of a fixed size. Since by Proposition 2.3, sums of neural networks of the same
depth are again neural networks of the same depth, the result follows.
d
Next, we extend Proposition 4.7 to the multivariate splines Sℓ,t,n for arbitrary ℓ, d ∈ N, t ∈ Rd .

Proposition 4.8. Let n, d ∈ N, n ≥ 2, K > 0, and let σ : R → R be sigmoidal of order q ≥ 2.


Further let ℓ ∈ N and t ∈ Rd .
d
Then, there exists a constant C > 0 such that for every ε > 0 there is a neural network ΦSℓ,t,n
with activation function σ, ⌈log2 (d)⌉ + ⌈logq (k − 1)⌉ layers, and size C, such that
d
d
Sℓ,t,n − ΦSℓ,t,n ≤ ε.
L∞ ([−K,K]d )

d
Qd
Proof. By definition Sℓ,t,n (x) = i=1 Sℓ,ti ,n (xi ) where

Sℓ,ti ,n (xi ) = Sn (2ℓ (xi − ti )).

By Proposition 4.7 there exist a constant C ′ > 0 such that for each i = 1, . . . , d and all ε > 0, there
is a neural network ΦSℓ,ti ,n with size C ′ and ⌈logq (n − 1)⌉ layers such that

Sℓ,ti ,n − ΦSℓ,ti ,n L∞ ([−K,K]d )


≤ ε.

If d = 1, this shows the statement. For general d, it remains to show that the product of the ΦSℓ,ti ,n
for i = 1, . . . , d can be approximated.
We first prove the following claim by induction: For every d ∈ N, d ≥ 2, there exists a constant
C ′′ > 0, such that for all K ′ ≥ 1 and all ε > 0 there exists a neural network Φmult,ε,d with size

42
C ′′ , ⌈log2 (d)⌉ layers, and activation function σ such that for all x1 , . . . , xd with |xi | ≤ K ′ for all
i = 1, . . . , d,
d
Y
Φmult,ε,d (x1 , . . . , xd ) − xi < ε. (4.2.4)
i=1

For the base case, let d = 2. Similar to the proof of Proposition 4.7, one can show that there exists
C ′′′ > 0 such that for every ε > 0 and K ′ > 0 there exists a neural network Φsquare,ε with one
hidden layer and size C ′′′ such that

|Φsquare,ε − σReLU (x)2 | ≤ ε for all |x| ≤ K ′ .

For every x = (x1 , x2 ) ∈ R2


1
(x1 + x2 )2 − x21 − x22

x1 x2 =
2
1
= σReLU (x1 + x2 )2 + σReLU (−x1 − x2 )2 − σReLU (x1 )2
2
− σReLU (−x1 )2 − σReLU (x2 )2 − σReLU (−x2 )2 .

(4.2.5)

Each term on the right-hand side can be approximated up to uniform error ε/6 with a network of
size C ′′′ and one hidden layer. By Proposition 2.3, we conclude that there exists a neural network
Φmult,ε,2 satisfying (4.2.4) for d = 2.
Assume the induction hypothesis (4.2.4) holds for d − 1 ≥ 1, and let ε > 0 and K ′ ≥ 1. We
have
d ⌊d/2⌋ d
Y Y Y
xi = xi · xi . (4.2.6)
i=1 i=1 i=⌊d/2⌋+1

We will now approximate each of the terms in the product on the right-hand side of (4.2.6) by a
neural network using the induction assumption.
For simplicity assume in the following that ⌈log2 (⌊d/2⌋)⌉ = ⌈log2 (d − ⌊d/2⌋)⌉. The general
case can be addressed via Proposition 3.16. By the induction assumption there then exist neural
networks Φmult,1 and Φmult,2 both with ⌈log2 (⌊d/2⌋)⌉ layers, such that for all xi with |xi | ≤ K ′ for
i = 1, . . . , d
⌊d/2⌋
Y ε
Φmult,1 (x1 , . . . , x⌊d/2⌋ ) − xi < ,
i=1
4((K ′ )⌊d/2⌋ + ε)
d
Y ε
Φmult,2 (x⌊d/2⌋+1 , . . . , xd ) − xi < .
4((K ′ )⌊d/2⌋ + ε)
i=⌊d/2⌋+1

By Proposition 2.3, Φmult,ε,d := Φmult,ε/2,2 ◦(Φmult,1 , Φmult,2 ) is a neural network with 1+⌈log2 (⌊d/2⌋)⌉ =
⌈log2 (d)⌉ layers. By construction, the size of Φmult,ε,d does not depend on K ′ or ε. Thus, to complete
the induction, it only remains to show (4.2.4).
For all a, b, c, d ∈ R holds

|ab − cd| ≤ |a||b − d| + |d||a − c|.

43
Hence, for x1 , . . . , xd with |xi | ≤ K ′ for all i = 1, . . . , d, we have that
d
Y
xi − Φmult,ε,d (x1 , . . . , xd )
i=1
⌊d/2⌋ d
ε Y Y
≤ + xi · xi − Φmult,1 (x1 , . . . , x⌊d/2⌋ )Φmult,2 (x⌊d/2⌋+1 , . . . , xd )
2
i=1 i=⌊d/2⌋+1
ε ε ε
≤ + |K ′ |⌊d/2⌋ ′ ⌊d/2⌋
+ (|K ′ |⌈d/2⌉ + ε) ′ ⌊d/2⌋
< ε.
2 4((K ) + ε) 4((K ) + ε)
This completes the proof of (4.2.4).
The overall result follows by using Proposition 2.3 to show that the multiplication network can
be composed with a neural network comprised of the ΦSℓ,ti ,n for i = 1, . . . , d. Since in no step above
the size of the individual networks was dependent on the approximation accuracy, this is also true
for the final network.

Proposition 4.8 shows that we can approximate a single multivariate B-spline with a neural
network with a size that is independent of the accuracy. Combining this observation with Theorem
4.3 leads to the following result.

Theorem 4.9. Let d, n, k ∈ N such that 0 < k ≤ n and n ≥ 2. Let q ≥ 2, and let σ be sigmoidal
of order q.
Then there exists C such that for every f ∈ C k ([0, 1]d ) and every N ∈ N there exists a neural
network ΦN with activation function σ, ⌈log2 (d)⌉ + ⌈logq (k − 1)⌉ layers, and size bounded by CN ,
such that
k
f − ΦN L∞ ([0,1]d ) ≤ CN − d ∥f ∥C k ([0,1]d ) .

Proof. Fix N ∈ N. By Theorem 4.3, there exist coefficients |ci | ≤ C∥f ∥L∞ ([0,1]d ) and Bi ∈ B n for
i = 1, . . . , N , such that
N
X k
f− ci Bi ≤ CN − d ∥f ∥C k ([0,1]d ) .
i=1 L∞ ([0,1]d )

Moreover, by Proposition 4.8, for each i = 1, . . . , N exists a neural network ΦBi with ⌈log2 (d)⌉ +
⌈logq (k − 1)⌉ layers, and a fixed size, which approximates Bi on [−1, 1]d ⊇ [0, 1]d up to error of
ε := N −k/d /N . The size of ΦBi is independent of i and N .
By Proposition 2.3, there exists a neural network ΦN that uniformly approximates N
P
i=1 ci Bi
d
up to error ε on [0, 1] , and has ⌈log2 (d)⌉ + ⌈logq (k − 1)⌉ layers. The size of this network is linear
in N (see Exercise 4.11). This concludes the proof.

Theorem 4.9 shows that neural networks with higher-order sigmoidal functions can approximate
smooth functions with the same accuracy as spline approximations while having a comparable
number of parameters. The network depth is required to behave like O(log(k)) in terms of the
smoothness parameter k, cp. Remark 4.4.

44
Bibliography and further reading
The argument of linking sigmoidal activation functions with spline based approximation was first
introduced in [146, 144]. For further details on spline approximation, see [168] or the book [207].
The general strategy of approximating basis functions by neural networks, and then lifting ap-
proximation results for those bases has been employed widely in the literature, and will also reappear
again in this book. While the following chapters primarily focus on ReLU activation, we highlight
a few notable approaches with non-ReLU activations based on the outlined strategy: To approx-
imate analytic functions, [145] emulates a monomial basis. To approximate periodic functions, a
basis of trigonometric polynomials is recreated in [147]. Wavelet bases have been emulated in [171].
Moreover, neural networks have been studied through the representation system of ridgelets [30]
and ridge functions [103]. A general framework describing the emulation of representation systems
to transfer approximation results was presented in [21].

45
Exercises
Exercise 4.10. Show that (4.2.1) holds.

Exercise 4.11. Let L ∈ N, σ : R → R, and let Φ1 , Φ2 be two neural networks with architecture
(1) (1) (2) (2)
(σ; d0 , d1 , . . . , dL , dL+1 ) and (σ; d0 , d1 , . . . , dL , dL+1 ). Show that Φ1 + Φ2 is a neural network
with size(Φ1 + Φ2 ) ≤ size(Φ1 ) + size(Φ2 ).
2
Exercise 4.12. Show that, for σ = σReLU and k ≤ 2, for all f ∈ C k ([0, 1]d ) all weights of the approx-
imating neural network of Theorem 4.9 can be bounded in absolute value by O(max{2, ∥f ∥C k ([0,1]d ) }).

46
Chapter 5

ReLU neural networks

In this chapter, we discuss feedforward neural networks using the ReLU activation function σReLU
introduced in Section 2.3. We refer to these functions as ReLU neural networks. Due to its simplicity
and the fact that it reduces the vanishing and exploding gradients phenomena, the ReLU is one of
the most widely used activation functions in practice.
A key component of the proofs in the previous chapters was the approximation of derivatives of
the activation function to emulate polynomials. Since the ReLU is piecewise linear, this trick is not
applicable. This makes the analysis fundamentally different from the case of smoother activation
functions. Nonetheless, we will see that even this extremely simple activation function yields a very
rich class of functions possessing remarkable approximation capabilities.
To formalize these results, we begin this chapter by adopting a framework from [174]. This
framework enables the tracking of the number of network parameters for basic manipulations such
as adding up or composing two neural networks. This will allow to bound the network complexity,
when constructing more elaborate networks from simpler ones. With these preliminaries at hand,
the rest of the chapter is dedicated to the exploration of links between ReLU neural networks and
the class of “continuous piecewise linear functions.” In Section 5.2, we will see that every such
function can be exactly represented by a ReLU neural network. Afterwards, in Section 5.3 we will
give a more detailed analysis of the required network complexity. Finally, we will use these results
to prove a first approximation theorem for ReLU neural networks in Section 5.4. The argument is
similar in spirit to Chapter 4, in that we transfer established approximation theory for piecewise
linear functions to the class of ReLU neural networks of a certain architecture.

5.1 Basic ReLU calculus


The goal of this section is to formalize how to combine and manipulate ReLU neural networks.
We have seen an instance of such a result already in Proposition 2.3. Now we want to make this
result more precise under the assumption that the activation function is the ReLU. We sharpen
Proposition 2.3 by adding bounds on the number of weights that the resulting neural networks
have. The following four operations form the basis of all constructions in the sequel.

• Reproducing an identity: We have seen in Proposition 3.16 that for most activation functions,
an approximation to the identity can be built by neural networks. For ReLUs, we can have
an even stronger result and reproduce the identity exactly. This identity will play a crucial

47
role in order to extend certain neural networks to deeper neural networks, and to facilitate
an efficient composition operation.
• Composition: We saw in Proposition 2.3 that we can produce a composition of two neural
networks and the resulting function is a neural network as well. There we did not study the
size of the resulting neural networks. For ReLU activation functions, this composition can be
done in a very efficient way leading to a neural network that has up to a constant not more
than the number of weights of the two initial neural networks.
• Parallelization: Also the parallelization of two neural networks was discussed in Proposition
2.3. We will refine this notion and make precise the size of the resulting neural networks.
• Linear combinations: Similarly, for the sum of two neural networks, we will give precise
bounds on the size of the resulting neural network.

5.1.1 Identity
We start with expressing the identity on Rd as a neural network of depth L ∈ N.

Lemma 5.1 (Identity). Let L ∈ N. Then, there exists a ReLU neural network Φid L such that
Φid
L (x) = x for all x ∈ Rd . Moreover, depth(Φid ) = L, width(Φid ) = 2d, and size(Φid ) = 2d·(L+1).
L L L

Proof. Writing I d ∈ Rd×d for the identity matrix, we choose the weights
(W (0) , b(0) ), . . . , (W (L) , b(L) )
  
Id
:= , 0 , (I 2d , 0), . . . , (I 2d , 0), ((I d , −I d ), 0).
−I d | {z }
L−1 times

Using that x = σReLU (x) − σReLU (−x) for all x ∈ R and σReLU (x) = x for all x ≥ 0 it is obvious
that the neural network Φid
L associated to the weights above satisfies the assertion of the lemma.

We will see in Exercise 5.23 that the property to exactly represent the identity is not shared by
sigmoidal activation functions. It does hold for polynomial activation functions, though.

5.1.2 Composition
Assume we have two neural networks Φ1 , Φ2 with architectures (σReLU ; d10 , . . . , d1L1 +1 ) and (σReLU ; d20 , . . . , d2L1 +1 )
respectively. Moreover, we assume that they have weights and biases given by
(0) (0) (L1 ) (L1 ) (0) (0) (L2 ) (L2 )
(W 1 , b1 ), . . . , (W 1 , b1 ), and (W 2 , b2 ), . . . , (W 2 , b2 ),
respectively. If the output dimension d1L1 +1 of Φ1 equals the input dimension d20 of Φ2 , we can
define two types of concatenations: First Φ2 ◦ Φ1 is the neural network with weights and biases
given by
     
(0) (0) (L −1) (L −1) (0) (L ) (0) (L ) (0)
W 1 , b1 , . . . , W 1 1 , b1 1 , W 2 W 1 1 , W 2 b1 1 + b2 ,
   
(1) (1) (L ) (L )
W 2 , b2 , . . . , W 2 2 , b2 2 .

48
Second, Φ2 • Φ1 is the neural network defined as Φ2 ◦ Φid 1 ◦ Φ1 . In terms of weighs and biases,
Φ2 • Φ1 is given as
! !!
(L ) (L )

(0) (0)
 
(L1 −1) (L1 −1)
 W1 1 b1 1
W 1 , b1 , . . . , W 1 , b1 , (L ) , (L ) ,
−W 1 1 −b1 1
      
(0) (0) (0) (1) (1) (L ) (L )
W 2 , −W 2 , b2 , W 2 , b2 , . . . , W 2 2 , b2 2 .

The following lemma collects the properties of the construction above.

Lemma 5.2 (Composition). Let Φ1 , Φ2 be neural networks with architectures (σReLU ; d10 , . . . , d1L1 +1 )
and (σReLU ; d20 , . . . , d2L2 +1 ). Assume d1L1 +1 = d20 . Then Φ2 ◦ Φ1 (x) = Φ2 • Φ1 (x) = Φ2 (Φ1 (x)) for
0
all x ∈ Rd1 . Moreover,

width(Φ2 ◦ Φ1 ) ≤ max{width(Φ1 ), width(Φ2 )},


depth(Φ2 ◦ Φ1 ) = depth(Φ1 ) + depth(Φ2 ),
size(Φ2 ◦ Φ1 ) ≤ size(Φ1 ) + size(Φ2 ) + (d1L1 + 1)d12 ,

and

width(Φ2 • Φ1 ) ≤ 2 max{width(Φ1 ), width(Φ2 )},


depth(Φ2 • Φ1 ) = depth(Φ1 ) + depth(Φ2 ) + 1,
size(Φ2 • Φ1 ) ≤ 2(size(Φ1 ) + size(Φ2 )).

1
Proof. The fact that Φ2 ◦ Φ1 (x) = Φ2 • Φ1 (x) = Φ2 (Φ1 (x)) for all x ∈ Rd0 follows immediately
from the construction. The same can be said for the width and depth bounds. To confirm the size
(0) (L ) d2 ×d1 (0) (L )
bound, we note that W 2 W 1 1 ∈ R 1 L1 and hence W 2 W 1 1 has not more than d21 × d1L1
(0) (L ) (0) 2
(nonzero) entries. Moreover, W 2 b1 1 + b2 ∈ Rd1 . Thus, the L1 -th layer of Φ2 ◦ Φ1 (x) has at
most d21 × (1 + d1L1 ) entries. The rest is obvious from the construction.

Interpreting linear transformations as neural networks of depth 0, the previous lemma is also
valid in case Φ1 or Φ2 is a linear mapping.

5.1.3 Parallelization
Let (Φi )m i i
i=1 be neural networks with architectures (σReLU ; d0 , . . . , dLi +1 ), respectively. We proceed
to build a neural network (Φ1 , . . . , Φm ) realizing the function

djL
Pm
dj0
Pm
j=1 j +1
(Φ1 , . . . , Φm ) : R j=1 →R (5.1.1)
(x1 , . . . , xm ) 7→ (Φ1 (x1 ), . . . , Φm (xm )).

49
To do so we first assume L1 = · · · = Lm = L, and define (Φ1 , . . . , Φm ) via the following sequence
of weight-bias tuples:
 (0)   (0)   (L)   (L) 
W1 b1 W1 b1
. .   ..  . .   .. 
 ,  .  , . . . ,   ,  .  (5.1.2)
 
 . .
(0) (0) (L) (L)
Wm bm Wm bm
where these matrices are understood as block-diagonal filled up with zeros. For the general case
where the Φj might have different depths, let Lmax := max1≤i≤m Li and I := {1 ≤ i ≤ m | Li <
Lmax }. For j ∈ I c set Φ
e j := Φj , and for each j ∈ I

e j := Φid
Φ Lmax −Lj ◦ Φj . (5.1.3)

Finally,

(Φ1 , . . . , Φm ) := (Φ
e 1, . . . , Φ
e m ). (5.1.4)

We collect the properties of the parallelization in the lemma below.

Lemma 5.3 (Parallelization). Let m ∈ N and (Φi )m i=1 be neural networks with architectures
(σReLU ; di0 , . . . , diLi +1 ), respectively. Then the neural network (Φ1 , . . . , Φm ) satisfies

dj0
Pm
(Φ1 , . . . , Φm )(x) = (Φ1 (x1 ), . . . , Φm (xm )) for all x ∈ R j=1 .

Moreover, with Lmax := maxj≤m Lj it holds that


m
X
width((Φ1 , . . . , Φm )) ≤ 2 width(Φj ), (5.1.5a)
j=1

depth((Φ1 , . . . , Φm )) = max depth(Φj ), (5.1.5b)


j≤m
m m
(Lmax − Lj )djLj +1 .
X X
size((Φ1 , . . . , Φm )) ≤ 2 size(Φj ) + 2 (5.1.5c)
j=1 j=1

Proof. All statements except for the bound on the size follow immediately from the construction.
e i )m in (5.1.3)
To obtain the bound on the size, we note that by construction the sizes of the (Φ i=1
will simply be added. The size of each Φ
e i can be bounded with Lemma 5.2.

If all input dimensions d10 = · · · = dm


0 =: d0 are the same, we will also use parallelization with
d1 +···+dm
shared inputs to realize the function x 7→ (Φ1 (x), . . . , Φm (x)) from Rd0 → R L1 +1 Lm +1 .

In terms of the construction (5.1.2), the only required change is that the block-diagonal matrix
(0) (0)
Pm j 1 (0) (0)
diag(W 1 , . . . , W m ) becomes the matrix in R j=1 d1 ×d0 which stacks W 1 , . . . , W m on top of
each other. Similarly, we will allow Φj to only take some of the entries of x as input. For par-
allelization with shared inputs we will use the same notation (Φj )m
j=1 as before, where the precise
meaning will always be clear from context. Note that Lemma 5.3 remains valid in this case.

50
5.1.4 Linear combinations
Let m ∈ N and let (Φi )m i
i=1 be ReLU neural networks that have architectures (σReLU ; d0 , . . . , dLi +1 ),
i

respectively. Assume that d1L1 +1 = · · · = dm Lm +1 , i.e., all Φ1 , . . . , Φm have P


the same output dimen-
sion. For scalars αj ∈ R, we wish to construct a ReLU neural network m j=1 αj Φj realizing the
function
( Pm j
d1
R j=1 d0 → R L1 +1
(x1 , . . . , xm ) 7→ m
P
j=1 αj Φj (xj ).

This corresponds Pm to the parallelization (Φ1 , . . . , Φm ) composed with the linear transformation
(z 1 , . . . , z m ) 7→ j=1 αj z j . The following result holds.

Lemma 5.4 (Linear combinations). Let m ∈ N and (Φi )m i=1 be neural networks with architec-
tures (σReLU ; di0 , . . . , diLi +1 ), respectively. Assume that d1L1P m
+1 = · · · = dLm +1 , letPα ∈ Rm and set
Lmax := maxj≤m Lj . Then, there exists a neural network m m
j=1 αj Φj such that ( j=1 αj Φj )(x) =
Pm Pm j
m j=1 d0 . Moreover,
j=1 αj Φj (xj ) for all x = (xj )j=1 ∈ R
 
m
X m
X
width  αj Φj  ≤ 2 width(Φj ), (5.1.6a)
j=1 j=1
 
m
X
depth  αj Φj  = max depth(Φj ), (5.1.6b)
j≤m
j=1
 
m m m
(Lmax − Lj )djLj +1 .
X X X
size  αj Φj  ≤ 2 size(Φj ) + 2 (5.1.6c)
j=1 j=1 j=1

Proof. The construction of m


P
j=1 αj Φj is analogous to that of (Φ1 , . . . , Φm ), i.e., we first define the
linear combination of neural networks with the same depth. Then the weights are chosen as in
(5.1.2), but with the last linear transformation replaced by
 
m
(α1 W (L) (L)
X
(L)
1 · · · αm W m ), αj bj  .
j=1

For general depths, we define the sum of the neural networks to be the sum of the extended
neural networks Φe i as of (5.1.3). All statements of the lemma follow immediately from this con-
struction.

In case d10 = · · · = dm
0 =: d0 (all neural networks have the same input dimension), we will also
consider linear combinations with shared inputs, i.e., a neural network realizing
m
X
x 7→ αj Φj (x) for x ∈ Rd0 .
j=1

51
This requires the same minor adjustment as discussed at the end of Section 5.1.3. Lemma 5.4
remains valid in this case and again we do not distinguish in notation for linear combinations with
or without shared inputs.

5.2 Continuous piecewise linear functions


In this section, we will relate ReLU neural networks to a large class of functions. We first formally
introduce the set of continuous piecewise linear functions from a set Ω ⊆ Rd to R. Note that we
admit in particular Ω = Rd in the following definition.

Definition 5.5. Let Ω ⊆ Rd , d ∈ N. We call a function f : Ω → R continuous, piecewise linear


(cpwl) if f ∈ C 0 (Ω) and there exist n ∈ N affine functions gj : Rd → R, gj (x) = w⊤ j x + bj such
that for each x ∈ Ω it holds that f (x) = gj (x) for at least one j ∈ {1, . . . , n}. For m > 1 we call
f : Ω → Rm cpwl if and only if each component of f is cpwl.

Remark 5.6. A “continuous piecewise linear function” as in Definition 5.5 is actually piecewise
affine. To maintain consistency with the literature, we use the terminology cpwl.
In the following, we will refer to the connected domains on which f is equal to one of the
functions gj , also as regions or pieces. If f is cpwl with q ∈ N regions, then with n ∈ N denoting
the number of affine functions it holds n ≤ q.
Note that, the mapping x 7→ σReLU (w⊤ x + b), which is a ReLU neural network with a single
neuron, is cpwl (with two regions). Consequently, every ReLU neural network is a repeated compo-
sition of linear combinations of cpwl functions. It is not hard to see that the set of cpwl functions
is closed under compositions and linear combinations. Hence, every ReLU neural network is a cpwl
function. Interestingly, the reverse direction of this statement is also true, meaning that every cpwl
function can be represented by a ReLU neural network as we shall demonstrate below. Therefore,
we can identify the class of functions realized by arbitrary ReLU neural networks as the class of
cpwl functions.

Theorem 5.7. Let d ∈ N, let Ω ⊆ Rd be convex, and let f : Ω → R be cpwl with n ∈ N as in


Definition 5.5. Then, there exists a ReLU neural network Φf such that Φf (x) = f (x) for all x ∈ Ω
and
size(Φf ) = O(dn2n ), width(Φf ) = O(dn2n ), depth(Φf ) = O(n).

A statement similar to Theorem 5.7 can be found in [4, 85]. There, the authors give a con-
struction with a depth that behaves logarithmic in d and is independent of n, but with significantly
larger bounds on the size. As we shall see, the proof of Theorem 5.7 is a simple consequence of the
following well-known result from [225]; also see [169, 237]. It states that every cpwl function can
be expressed as a finite maximum of a finite minimum of certain affine functions.

52
Proposition 5.8. Let d ∈ N, Ω ⊆ Rd be convex, and let f : Ω → R be cpwl with n ∈ N affine
functions as in Definition 5.5. Then there exists m ∈ N and sets sj ⊆ {1, . . . , n} for j ∈ {1, . . . , m},
such that

f (x) = max min(gi (x)) for all x ∈ Ω. (5.2.1)


1≤j≤m i∈sj

Proof. Step 1. We start with d = 1, i.e., Ω ⊆ R is a (possibly unbounded) interval and for each
x ∈ Ω there exists j ∈ {1, . . . , n} such that with gj (x) := wj x + bj it holds that f (x) = gj (x).
Without loss of generality, we can assume that gi ̸= gj for all i ̸= j. Since the graphs of the gj are
lines, they intersect at (at most) finitely many points in Ω.
Since f is continuous, we conclude that there exist finitely many intervals covering Ω, such that
f coincides with one of the gj on each interval. For each x ∈ Ω let

sx := {1 ≤ j ≤ n | gj (x) ≥ f (x)}

and

fx (y) := min gj (y) for all y ∈ Ω.


j∈sx

Clearly, fx (x) = f (x). We claim that, additionally,

fx (y) ≤ f (y) for all y ∈ Ω. (5.2.2)

This then shows that

f (y) = max fx (y) = max min gj (y) for all y ∈ R.


x∈Ω x∈Ω j∈sx

Since there exist only finitely many possibilities to choose a subset of {1, . . . , n}, we conclude that
(5.2.1) holds for d = 1.
It remains to verify the claim (5.2.2). Fix y ̸= x ∈ Ω. Without loss of generality, let x < y
and let x = x0 < · · · < xk = y be such that f |[xi−1 ,xi ] equals some gj for each i ∈ {1, . . . , k}. In
order to show (5.2.2), it suffices to prove that there exists at least one j such that gj (x0 ) ≥ f (x0 )
and gj (xk ) ≤ f (xk ). The claim is trivial for k = 1. We proceed by induction. Suppose the
claim holds for k − 1, and consider the partition x0 < · · · < xk . Let r ∈ {1, . . . , n} be such
that f |[x0 ,x1 ] = gr |[x0 ,x1 ] . Applying the induction hypothesis to the interval [x1 , xk ], we can find
j ∈ {1, . . . , n} such that gj (x1 ) ≥ f (x1 ) and gj (xk ) ≤ f (xk ). If gj (x0 ) ≥ f (x0 ), then gj is the desired
function. Otherwise, gj (x0 ) < f (x0 ). Then gr (x0 ) = f (x0 ) > gj (x0 ) and gr (x1 ) = f (x1 ) ≤ gj (x1 ).
Therefore gr (x) ≤ gj (x) for all x ≥ x1 , and in particular gr (xk ) ≤ gj (xk ). Thus gr is the desired
function.
Step 2. For general d ∈ N, let gj (x) := w⊤ j x + bj for j = 1, . . . , n. For each x ∈ Ω, let

sx := {1 ≤ j ≤ n | gj (x) ≥ f (x)}

53
and for all y ∈ Ω, let

fx (y) := min gj (y).


j∈sx

For an arbitrary 1-dimensional affine subspace S ⊆ Rd passing through x consider the line
(segment) I := S ∩ Ω, which is connected since Ω is convex. By Step 1, it holds

f (y) = max fx (y) = max min gj (y)


x∈Ω x∈Ω j∈sx

on all of I. Since I was arbitrary the formula is valid for all y ∈ Ω. This again implies (5.2.1) as
in Step 1.

Remark 5.9. Using min(a, b) = − max(−a, −b), there exists m̃ ∈ N and sets s̃j ⊆ {1, . . . , n} for
j = 1, . . . , m̃, such that for all x ∈ R

f (x) = −(−f (x)) = − min max(−w⊤


i x − bi )
1≤j≤m̃ i∈s̃j

= max (− max(−w⊤
i x − bi ))
1≤j≤m̃ i∈s̃j

= max (min(w⊤
i x + bi )).
1≤j≤m̃ i∈s̃j

To prove Theorem 5.7, it therefore suffices to show that the minimum and the maximum are
expressible by ReLU neural networks.

Lemma 5.10. For every x, y ∈ R it holds that

min{x, y} = σReLU (y) − σReLU (−y) − σReLU (y − x) ∈ N21 (σReLU ; 1, 3)

and

max{x, y} = σReLU (y) − σReLU (−y) + σReLU (x − y) ∈ N21 (σReLU ; 1, 3).

Proof. We have
(
0 if y > x
max{x, y} = y +
x−y if x ≥ y
= y + σReLU (x − y).

Using y = σReLU (y) − σReLU (−y), the claim for the maximum follows. For the minimum observe
that min{x, y} = − max{−x, −y}.

The minimum of n ≥ 2 inputs can be computed by repeatedly applying the construction of


Lemma 5.10. The resulting neural network is described in the next lemma.

54
x
min{x, y}
y

Figure 5.1: Sketch of the neural network in Lemma 5.10. Only edges with non-zero weights are
drawn.

Lemma 5.11. For every n ≥ 2 there exists a neural network Φmin


n : Rn → R with

size(Φmin
n ) ≤ 16n, width(Φmin
n ) ≤ 3n, depth(Φmin
n ) ≤ ⌈log2 (n)⌉

such that Φmin max : Rn → R


n (x1 , . . . , xn ) = min1≤j≤n xj . Similarly, there exists a neural network Φn
realizing the maximum and satisfying the same complexity bounds.

Proof. Throughout denote by Φmin 2 : R2 → R the neural network from Lemma 5.10. It is of depth
1 and size 7 (since all biases are zero, it suffices to count the number of connections in Figure 5.1).
Step 1. Consider first the case where n = 2k for some k ∈ N. We proceed by induction of k.
For k = 1 the claim is proven. For k ≥ 2 set

Φmin
2k
:= Φmin
2 ◦ (Φmin min
2k−1 , Φ2k−1 ). (5.2.3)

By Lemma 5.2 and Lemma 5.3 we have

depth(Φmin min min


2k ) ≤ depth(Φ2 ) + depth(Φ2k−1 ) ≤ · · · ≤ k.

Next, we bound the size of the neural network. Note that all biases in this neural network are set to
0, since the Φmin
2 neural network in Lemma 5.10 has no biases. Thus, the size of the neural network
min
Φ2k corresponds to the number of connections in the graph (the number of nonzero weights).
Careful inspection of the neural network architecture, see Figure 5.2, reveals that
k−2
X
size(Φmin
2k ) = 4 · 2
k−1
+ 12 · 2j + 3
j=0

= 2n + 12 · (2k−1 − 1) + 3 = 2n + 6n − 9 ≤ 8n,

and that width(Φmin


2k
) ≤ (3/2)2k . This concludes the proof for the case n = 2k .
Step 2. For the general case, we first let

Φmin
1 (x) := x for all x ∈ R

be the identity on R, i.e. a linear transformation and thus formally a depth 0 neural network. Then,
for all n ≥ 2 
(Φid ◦ Φmin min ) if n ∈ {2k + 1 | k ∈ N}
min min 1 ⌊n ⌋ , Φ⌈ n ⌉
Φn := Φ2 ◦ min
2
min
2 (5.2.4)
(Φ⌊ n ⌋ , Φ⌈ n ⌉ ) otherwise.
2 2

55
This definition extends (5.2.3) to arbitrary n ≥ 2, since the first case in (5.2.4) never occurs if n ≥ 2
is a power of two.
To analyze (5.2.4), we start with the depth and claim that

depth(Φmin
n )=k for all 2k−1 < n ≤ 2k

and all k ∈ N. We proceed by induction over k. The case k = 1 is clear. For the induction step,
assume the statement holds for some fixed k ∈ N and fix an integer n with 2k < n ≤ 2k+1 . Then
lnm
∈ (2k−1 , 2k ] ∩ N
2
and (
jnk {2k−1 } if n = 2k + 1

2 (2k−1 , 2k ] ∩ N otherwise.
Using the induction assumption, (5.2.4) and Lemmas 5.1 and 5.2, this shows

depth(Φmin min
n ) = depth(Φ2 ) + k = 1 + k,

and proves the claim.


For the size and width bounds, we only sketch the argument: Fix n ∈ N such that 2k−1 < n ≤ 2k .
Then Φmin
n is constructed from at most as many subnetworks as Φmin2k
, but with some Φmin
2 : R2 → R
blocks replaced by Φid id min
1 : R → R, see Figure 5.3. Since Φ1 has the same depth as Φ2 , but is smaller
min
in width and number of connections, the width and size of Φn is bounded by the width and size
of Φmin
2k
. Due to 2k ≤ 2n, the bounds from Step 1 give the bounds stated in the lemma.
Step 3. For the maximum, define

Φmax min
n (x1 , . . . , xn ) := −Φn (−x1 , . . . , −xn ).

of Theorem 5.7. By Proposition 5.8 the neural network



Φ := Φmax min m m
m • (Φ|sj | )j=1 • ((w i x + bi )i∈sj )j=1

realizes the function f .


Since the number of possibilities to choose subsets of {1, . . . , n} equals 2n we have m ≤ 2n .
Since each sj is a subset of {1, . . . , n}, the cardinality |sj | of sj is bounded by n. By Lemma 5.2,
Lemma 5.3, and Lemma 5.11

depth(Φ) ≤ 2 + depth(Φmax min


m ) + max depth(Φ|sj | )
1≤j≤n
n
≤ 1 + ⌈log2 (2 )⌉ + ⌈log2 (n)⌉ = O(n)

and
n m
X m
X o

width(Φ) ≤ 2 max width(Φmax
m ), width(Φmin
|sj | ), width((w i x + b )
i i∈sj ))
j=1 j=1
n
≤ 2 max{3m, 3mn, mdn} = O(dn2 )

56
x1
x2

x3
x4
min{x1 , . . . , x8 }
x5
x6

x7
x8

nr of connections
between layers: 2k−1 · 4 2k−2 · 12 2k−3 · 12 3

Figure 5.2: Architecture of the Φmin


2k
neural network in Step 1 of the proof of Lemma 5.11 and the
number of connections in each layer for k = 3. Each grey box corresponds to 12 connections in the
graph.

x1 x2 x3 x4 x5 x1 x2 x3 x4 x5 x6 x1 x2 x3 x4 x5 x6 x7 x8

Φmin
2 Φid min
1 Φ2 Φid min
1 Φ2 Φid min
1 Φ2 Φmin
2 Φmin
2 Φmin
2 Φmin
2

Φid
1 Φmin
2 Φmin
2 Φmin
2 Φmin
2 Φmin
2

Φmin
2 Φmin
2 Φmin
2

min{x1 , . . . , x5 } min{x1 , . . . , x6 } min{x1 , . . . , x8 }

Figure 5.3: Construction of Φmin


n for general n in Step 2 of the proof of Lemma 5.11.

57
and
 

size(Φ) ≤ 4 size(Φmax
m ) + size((Φ min m
)
|sj | j=1 ) + size((w i x + b ) )m
i i∈sj j=1 )
 
X m
≤ 4 16m + 2 (16|sj | + 2⌈log2 (n)⌉) + nm(d + 1) = O(dn2n ).
j=1

This concludes the proof.

5.3 Simplicial pieces


This section studies the case, were we do not have arbitrary cpwl functions, but where the regions
on which f is affine are simplices. Under this condition, we can construct neural networks that scale
merely linearly in the number of such regions, which is a serious improvement from the exponential
dependence of the size on the number of regions that was found in Theorem 5.7.

5.3.1 Triangulations of Ω
For the ensuing discussion, we will consider Ω ⊆ Rd to be partitioned into simplices. This parti-
tioning will be termed a triangulation of Ω. Other notions prevalent in the literature include a
tessellation of Ω, or a simplicial mesh on Ω. To give a precise definition, let us first recall some
terminology. For a set S ⊆ Rd we denote the convex hull of S by
 
X n Xn 
co(S) := αj xj n ∈ N, xj ∈ S, αj ≥ 0, αj = 1 . (5.3.1)
 
j=1 j=1

An n-simplex is the convex hull of n ∈ N points that are independent in a specific sense. This
is made precise in the following definition.

Definition 5.12. Let n ∈ N0 , d ∈ N and n ≤ d. We call x0 , . . . , xn ∈ Rd affinely independent


if and only if either n = 0 or n ≥ 1 and the vectors x1 − x0 , . . . , xn − x0 are linearly independent.
In this case, we call co(x0 , . . . , xn ) := co({x0 , . . . , xn }) an n-simplex.

As mentioned before, a triangulation refers to a partition of a space into simplices. We give a


formal definition below.

Definition 5.13. Let d ∈ N, and Ω ⊆ Rd be compact. Let T be a finite set of d-simplices, and
for each τ ∈ T let V (τ ) ⊆ Ω have cardinality d + 1 such that τ = co(V (τ )). We call T a regular
triangulation of Ω, if and only if
S
(i) τ ∈T τ = Ω,
(ii) for all τ , τ ′ ∈ T it holds that τ ∩ τ ′ = co(V (τ ) ∩ V (τ ′ )).
S
We call η ∈ V := τ ∈T V (τ ) a node (or vertex) and τ ∈ T an element of the triangulation.

58
η2 η2 η2

η3 η1 η3 η1 η3 η1
η5 η5

η4 η4 η4

τ1 = co(η 1 , η 2 , η 5 ) τ1 = co(η 2 , η 3 , η 4 ) τ1 = co(η 2 , η 3 , η 4 )


τ2 = co(η 2 , η 3 , η 5 ) τ2 = co(η 2 , η 5 , η 1 ) τ2 = co(η 1 , η 2 , η 3 )
τ3 = co(η 3 , η 4 , η 5 )

Figure 5.4: The first is a regular triangulation, while the second and the third are not.

For a regular triangulation T with nodes V we also introduce the constant


kT := max |{τ ∈ T | η ∈ τ }| (5.3.2)
η∈V

corresponding to the maximal number of elements shared by a single node.

5.3.2 Size bounds for regular triangulations


Throughout this subsection, let T be a regular triangulation of Ω, and we adhere to the notation
of Definition 5.13. We will say that f : Ω → R is cpwl with respect to T if f is cpwl and f |τ is
affine for each τ ∈ T . The rest of this subsection is dedicated to proving the following result. It
was first shown in [136] with a more technical argument, and extends an earlier statement from
[85] to general triangulations (also see Section 5.3.3).

Theorem 5.14. Let d ∈ N, Ω ⊆ Rd be a bounded domain, and let T be a regular triangulation


of Ω. Let f : Ω → R be cpwl with respect to T and f |∂Ω = 0. Then there exists a ReLU neural
network Φ : Ω → R realizing f , and it holds

size(Φ) = O(|T |), width(Φ) = O(|T |), depth(Φ) = O(1), (5.3.3)

where the constants in the Landau notation depend on d and kT in (5.3.2).

We will split the proof into several lemmata. The strategy is to introduce a basis of the space
of cpwl functions on T the elements of which vanish on the boundary of Ω. We will then show
that there exist O(|T |) basis functions, each of which can be represented with a neural network the
size of which depends only on kT and d. To construct this basis, we first point out that an affine
function on a simplex is uniquely defined by its values at the nodes.

Lemma 5.15. Let d ∈ N. Let τ := co(η 0 , . . . , η d ) be a d-simplex. For every y0 , . . . , yd ∈ R, there


exists a unique g ∈ P1 (Rd ) such that g(η i ) = yi , i = 0, . . . , d.

59
η6 η1
ω(η) co(V (τ1 )\{η}) = co({η 1 , η 2 })
τ6
τ5 τ1
η5 η2
τ4 η τ2
τ3

η4 η3

Figure 5.5: Visualization of Lemma 5.16 in two dimensions. The patch ω(η) consists of the union
of all 2-simplices τi containing η. Its boundary consists of the union of all 1-simplices made up by
the nodes of each τi without the center node, i.e., the convex hulls of V (τi )\{η}.

Proof. Since η 1 −η 0 , . . . , η d −η 0 is a basis of Rd , there is a unique w ∈ Rd such that w⊤ (η i −η 0 ) =


yi − y0 for i = 1,P. . . , d. Then P g(x) := w⊤ x + (y0 − w⊤P η 0 ) is as desired. Moreover, for every g ∈ P1
it holds that g( i=0 αi η i ) = i=0 αi g(η i ) whenever di=0 αi = 1 (this is in general not true if the
d d

coefficients do not sum to 1). Hence, g is uniquely determined by its values at the nodes.

Since Ω is the union of the simplices τ ∈ T , every cpwl function with respect to T is thus
uniquely defined through its values at the nodes. Hence, the desired basis consists of cpwl functions
φη : Ω → R with respect to T such that

φη (µ) = δηµ for all µ ∈ V, (5.3.4)

where δηµ denotes the Kronecker delta. Assuming φη to be well-defined for the moment, we can
then represent every cpwl function f : Ω → R that vanishes on the boundary ∂Ω as
X
f (x) = f (η)φη (x) for all x ∈ Ω.
η∈V∩Ω̊

Note that it suffices to sum over the set of interior nodes V ∩ Ω̊, since f (η) = 0 whenever η ∈
∂Ω. To formally verify existence and well-definedness of φη , we first need a lemma characterizing
the boundary of so-called “patches” of the triangulation: For each η ∈ V, we introduce the patch
ω(η) of the node η as the union of all elements containing η, i.e.,
[
ω(η) := τ. (5.3.5)
{τ ∈T | η∈τ }

Lemma 5.16. Let η ∈ V ∩ Ω̊ be an interior node. Then,


[
∂ω(η) = co(V (τ )\{η}).
{τ ∈T | η∈τ }

60
We refer to Figure 5.5 for a visualization of Lemma 5.16. The proof of Lemma 5.16 is quite
technical but nonetheless elementary. We therefore only outline the general argument but leave
the details to the reader in Excercise 5.27: The boundary of ω(η) must be contained in the union
of the boundaries of all τ in the patch ω(η). Since η is an interior point of Ω, it must also be
an interior point of ω(η). This can be used to show that for every S := {η i0 , . . . , η ik } ⊆ V (τ ) of
cardinality k + 1 ≤ d, the interior of (the k-dimensional manifold) co(S) belongs to the interior
of ω(η) whenever η ∈ S. Using Exercise 5.27, it then only remains to check that co(S) ⊆ ∂ω(η)
whenever η ∈ / S, which yields the claimed formula. We are now in position to show well-definedness
of the basis functions in (5.3.4).

Lemma 5.17. For each interior node η ∈ V ∩ Ω̊ there exists a unique cpwl function φη : Ω → R
satisfying (5.3.4). Moreover, φη can be expressed by a ReLU neural network with size, width, and
depth bounds that only depend on d and kT .

Proof. By Lemma 5.15, on each τ ∈ T , the affine function φη |τ is uniquely defined through the
values at the nodes of τ . This defines a continuous function φη : Ω → R. Indeed, whenever
τ ∩ τ ′ ̸= ∅, then τ ∩ τ ′ is a subsimplex of both τ and τ ′ in the sense of Definition 5.13 (ii). Thus,
applying Lemma 5.15 again, the affine functions on τ and τ ′ coincide on τ ∩ τ ′ .
Using Lemma 5.15, Lemma 5.16 and the fact that φη (µ) = 0 whenever µ ̸= η, we find that
φη vanishes on the boundary of the patch ω(η) ⊆ Ω. Thus, φη vanishes on the boundary of Ω.
Extending by zero, it becomes a cpwl function φη : Rd → R. This function is nonzero only on
elements τ for which η ∈ τ . Hence, it is a cpwl function with at most n := kT + 1 affine functions.
By Theorem 5.7, φη can be expressed as a ReLU neural network with the claimed size, width and
depth bounds.

Finally, Theorem 5.14 is now an easy consequence of the above lemmata.

of Theorem 5.14. With


X
Φ(x) := f (η)φη (x) for x ∈ Ω, (5.3.6)
η∈V∩Ω̊

it holds that Φ : Ω → R satisfies Φ(η) = f (η) for all η ∈ V. By Lemma 5.15 this implies that
f equals Φ on each τ , and thus f equals Φ on all of Ω. Since each element τ is the convex hull
of d + 1 nodes η ∈ V, the cardinality of V is bounded by the cardinality of T times d + 1. Thus,
the summation in (5.3.6) is over O(|T |) terms. Using Lemma 5.4 and Lemma 5.17 we obtain the
claimed bounds on size, width, and depth of the neural network.

5.3.3 Size bounds for locally convex triangulations


Assuming local convexity of the triangulation, in this section we make the dependence of the
constants in Theorem 5.14 explicit in the dimension d and in the maximal number of simplices
kT touching a node, see (5.3.2). As such the improvement over Theorem 5.14 is modest, and the
reader may choose to skip this section on a first pass. Nonetheless, the proof, originally from [85],
is entirely constructive and gives some further insight on how ReLU networks express functions.
Let us start by stating the required convexity constraint.

61
Definition 5.18. A regular triangulation T is called locally convex if and only if ω(η) is convex
for all interior nodes η ∈ V ∩ Ω̊.

The following theorem is a variant of [85, Theorem 3.1].

Theorem 5.19. Let d ∈ N, and let Ω ⊆ Rd be a bounded domain. Let T be a locally convex regular
triangulation of Ω. Let f : Ω → R be cpwl with respect to T and f |∂Ω = 0. Then, there exists a
constant C > 0 (independent of d, f and T ) and there exists a neural network Φf : Ω → R such
that Φf = f ,

size(Φf ) ≤ C · (1 + d2 kT |T |),
width(Φf ) ≤ C · (1 + d log(kT )|T |),
depth(Φf ) ≤ C · (1 + log2 (kT )).

Assume in the following that T is a locally convex triangulation. We will split the proof of the
theorem again into a few lemmata. First, we will show that a convex patch can be written as an
intersection of finitely many half-spaces. Specifically, with the affine hull of a set S defined as
 
Xn Xn 
aff(S) := αj xj n ∈ N, xj ∈ S, αj ∈ R, αj = 1 (5.3.7)
 
j=1 j=1

let in the following for τ ∈ T and η ∈ V (τ )

H0 (τ, η) := aff(V (τ )\{η})

be the affine hyperplane passing through all nodes in V (τ )\{η}, and let further

H+ (τ, η) := {x ∈ Rd | x is on the same side of H0 (τ, η) as η} ∪ H0 (τ, η).

Lemma 5.20. Let η be an interior node. Then a patch ω(η) is convex if and only if
\
ω(η) = H+ (τ, η). (5.3.8)
{τ ∈T | η∈T }

Proof. The right-hand side is a finite intersection of (convex) half-spaces, and thus itself convex. It
remains to show that if ω(η) is convex, then (5.3.8) holds. We start with “⊃”. Suppose x ∈ / ω(η).
Then the straight line co({x, η}) must pass through ∂ω(η), and by Lemma 5.16 this implies that
there exists τ ∈ T with η ∈ τ such that co({x, η}) passes through aff(V (τ )\{η}) = H0 (τ, η).

62
Hence η and x lie on different sides of this affine hyperplane, which shows “⊇”. Now we show “⊆”.
Let τ ∈ T be such that η ∈ τ and fix x in the complement of H+ (τ, η). Suppose that x ∈ ω(η). By
convexity, we then have co({x} ∪ τ ) ⊆ ω(η). This implies that there exists a point in co(V (τ )\{η})
belonging to the interior of ω(η). This contradicts Lemma 5.16. Thus, x ∈ / ω(η).

The above lemma allows us to explicitly construct the basis functions φη in (5.3.4). To see this,
denote in the following for τ ∈ T and η ∈ V (τ ) by gτ,η ∈ P1 (Rd ) the affine function such that
(
1 if η = µ
gτ,η (µ) = for all µ ∈ V (τ ).
0 if η ̸= µ

This function exists and is unique by Lemma 5.15. Observe that φη (x) = gτ,η (x) for all x ∈ τ .

Lemma 5.21. Let η ∈ V ∩ Ω̊ be an interior node and let ω(η) be a convex patch. Then
 
φη (x) = max 0, min gτ,η (x) for all x ∈ Rd . (5.3.9)
{τ ∈T | η∈τ }

Proof. First let x ∈


/ ω(η). By Lemma 5.20 there exists τ ∈ V (η) such that x is in the complement
of H+ (τ, η). Observe that

gτ,η |H+ (τ,η) ≥ 0 and gτ,η |H+ (τ,η)c < 0. (5.3.10)

Thus

min gτ,η (x) < 0 for all x ∈ ω(η)c ,


{τ ∈T | η∈τ }

i.e., (5.3.9) holds for all x ∈ R\ω(η). Next, let τ , τ ′ ∈ T such that η ∈ τ and η ∈ τ ′ . We wish to
show that gτ,η (x) ≤ gτ ′ ,η (x) for all x ∈ τ . Since gτ,η (x) = φη (x) for all x ∈ τ , this then concludes
the proof of (5.3.9). By Lemma 5.20 it holds

µ ∈ H+ (τ ′ , η) for all µ ∈ V (τ ).

Hence, by (5.3.10)

gτ ′ ,η (µ) ≥ 0 = gτ,η (µ) for all µ ∈ V (τ )\{η}.

Moreover, gτ,η (η) = gτ ′ ,η (η) = 1. Thus, gτ,η (µ) ≥ gτ ′ ,η (µ) for all µ ∈ V (τ ′ ) and therefore

gτ ′ ,η (x) ≥ gτ,η (x) for all x ∈ co(V (τ ′ )) = τ ′ .

of Theorem 5.19. For every interior node η ∈ V ∩ Ω̊, the cpwl basis function φη in (5.3.4) can be
expressed as in (5.3.9), i.e.,

φη (x) = σ • Φmin
|{τ ∈T | η∈τ }| • (gτ,η (x)){τ ∈T | η∈τ } ,

63
where (gτ,η (x)){τ ∈T | η∈τ } denotes the parallelization with shared inputs of the functions gτ,η (x) for
all τ ∈ T such that η ∈ τ .
For this neural network, with |{τ ∈ T | η ∈ τ }| ≤ kT , we have by Lemma 5.2

size(φη ) ≤ 4 size(σ) + size(Φmin



|{τ ∈T | η∈τ }| ) + size((gτ,η ){τ ∈T | η∈τ } )
≤ 4(2 + 16kT + kT d) (5.3.11)

and similarly

depth(φη ) ≤ 4 + ⌈log2 (kT )⌉, width(φη ) ≤ max{1, 3kT , d}. (5.3.12)

Since for every interior node, the number of simplices touching the node must be larger or equal
to d, we can assume max{kT , d} = kT in the following (otherwise there exist no interior nodes, and
the function f is constant 0). As in the proof of Theorem 5.14, the neural network

X
Φ(x) := f (η)φη (x)
η∈V∩Ω̊

realizes the function f on all of Ω. Since the number of nodes |V| is bounded by (d + 1)|T |, an
application of Lemma 5.4 yields the desired bounds.

5.4 Convergence rates for Hölder continuous functions


Theorem 5.14 immediately implies convergence rates for certain classes of (low regularity) functions.
Recall for example the space of Hölder continuous functions: for s ∈ (0, 1] and a bounded
domain Ω ⊆ Rd we define
|f (x) − f (y)|
∥f ∥C 0,s (Ω) := sup |f (x)| + sup . (5.4.1)
x∈Ω x̸=y∈Ω ∥x − y∥s2

Then, C 0,s (Ω) is the set of functions f ∈ C 0 (Ω) for which ∥f ∥C 0,s (Ω) < ∞.
Hölder continuous functions can be approximated well by certain cpwl functions. Therefore, we
obtain the following result.

Theorem 5.22. Let d ∈ N. There exists a constant C = C(d) such that for every f ∈ C 0,s ([0, 1]d )
and every N there exists a ReLU neural network ΦfN with

size(ΦfN ) ≤ CN, width(ΦfN ) ≤ CN, depth(ΦfN ) = C

and
s
sup f (x) − ΦfN (x) ≤ C∥f ∥C 0,s ([0,1]d ) N − d .
x∈[0,1]d

64
Proof. For M ≥ 2, consider the set of nodes {ν/M | ν ∈ {−1, . . . , M + 1}d } where ν/M =
(ν1 /M, . . . , νd /M ). These nodes suggest a partition of [−1/M, 1 + 1/M ]d into (2 + M )d sub-
hypercubes. Each such sub-hypercube can be partitioned into d! simplices, such that we obtain a
regular triangulation T with d!(2+M )d elements on [0, 1]d . According to Theorem 5.14 there exists a
neural network Φ that is cpwl with respect to T and Φ(ν/M ) = f (ν/M ) whenever ν ∈ {0, . . . , M }d
and Φ(ν/M ) = 0 for all other (boundary) nodes. It holds

size(Φ) ≤ C|T | = Cd!(2 + M )d ,


width(Φ) ≤ C|T | = Cd!(2 + M )d , (5.4.2)
depth(Φ) ≤ C
for a constant C that only depends on d (since for our regular triangulation T , kT in (5.3.2) is a
fixed d-dependent constant).
Let us bound the error. Fix a point x ∈ [0, 1]d . Then x belongs to one of the interior simplices
τ of the triangulation. Two nodes of the simplex have distance at most
2 1/2 √
 
d 

X 1  = d =: ε.
M M
j=1

Since Φ|τ is the linear interpolant of f at the nodes V (τ ) of the simplex τ , Φ(x) is a convex
combination of the (f (η))η∈V (τ ) . Fix an arbitrary node η 0 ∈ V (τ ). Then ∥x − η 0 ∥2 ≤ ε and

|Φ(x) − Φ(η 0 )| ≤ max |f (η) − f (µ)| ≤ sup |f (x) − f (y)|


η,µ∈V (τ ) x,y∈[0,1]d
∥x−y∥2 ≤ε

≤ ∥f ∥C 0,s ([0,1]d ) εs .
Hence, using f (η 0 ) = Φ(η 0 ),

|f (x) − Φ(x)| ≤ |f (x) − f (η 0 )| + |Φ(x) − Φ(η 0 )|


≤ 2∥f ∥C 0,s ([0,1]d ) εs
s
= 2∥f ∥C 0,s ([0,1]d ) d 2 M −s
s s
= 2d 2 ∥f ∥C 0,s ([0,1]d ) N − d (5.4.3)

where N := M d . The statement follows by (5.4.2) and (5.4.3).

The principle behind Theorem 5.22 can be applied in even more generality. Since we can
represent every cpwl function on a regular triangulation with a neural network of size O(N ), where
N denotes the number of elements, all of classical (e.g. finite element) approximation theory for
cpwl functions can be lifted to generate statements about ReLU approximation. For instance, it is
well-known, that functions in the Sobolev space H 2 ([0, 1]d ) can be approximated by cpwl functions
on a regular triangulation in terms of L2 ([0, 1]d ) with the rate 2/d. Similar as in the proof of
Theorem 5.22, for every f ∈ H 2 ([0, 1]d ) and every N ∈ N there then exists a ReLU neural network
ΦN such that size(ΦN ) = O(N ) and
2
∥f − ΦN ∥L2 ([0,1]d ) ≤ C∥f ∥H 2 ([0,1]d ) N − d .

65
Finally, we can wonder how to approximate even smoother functions, i.e., those that have many
continuous derivatives. Since more smoothness is a restrictive assumption on the set of functions
to approximate, we would hope that this will allow us to have smaller neural networks. Essentially,
we desire a result similar to Theorem 4.9, but with the ReLU activation function.
However, we will see in the following chapter, that the emulation of piecewise affine functions
on regular triangulations cannot yield the approximation rates of Theorem 4.9. To harness the
smoothness, it will be necessary to build ReLU neural networks that emulate polynomials. Sur-
prisingly, we will see in Chapter 7 that polynomials can be very efficiently approximated by deep
ReLU neural networks.

Bibliography and further reading


The ReLU calculus introduced in Section 5.1 was similarly given in [174]. The fact that every
cpwl function can be expressed as a maximum over a minimum of linear functions goes back to the
papers [226, 225]; also see [169, 237].
The main result of Section 5.2, which shows that every cpwl function can be expressed by a
ReLU network, is then a straightforward consequence. This was first observed in [4], which also
provided bounds on the network size. These bounds were significantly improved in [85] for cwpl
functions on triangular meshes that satisfy a local convexity condition. Under this assumption, it
was shown that the network size essentially only grows linearly with the number of pieces. The
paper [136] showed that the convexity assumption is not necessary for this statement to hold. We
give a similar result in Section 5.3.2, using a simpler argument than [136]. The locally convex case
from [85] is separately discussed in Section 5.3.3, as it allows for further improvements in some
constants.
The implications for the approximation of Hölder continuous functions discussed in Section 5.4,
follows by standard approximation theory for cpwl functions. For a general reference on splines
and piecewise polynomial approximation see for instance [207]. Finally we mention that similar
convergence results can also be shown for other activation functions, see, e.g., [144].

66
Exercises
Exercise 5.23. Let p : R → R be a polynomial of degree n ≥ 1 (with leading coefficient nonzero)
and let s : R → R be a continuous sigmoidal activation function. Show that the identity map
x 7→ x : R → R belongs to N11 (p; 1, n + 1) but not to N11 (s; L) for any L ∈ N.

Exercise 5.24. Consider cpwl functions f : R → R with n ∈ N0 breakpoints (points where the
function is not C 1 ). Determine the minimal size required to exactly express every such f with a
depth-1 ReLU neural network.

Exercise 5.25. Show that, the notion of affine independence is invariant under permutations of
the points.

= co(x0 , . . . , xd ) be a d-simplex. Show that the coefficients αi ≥ 0 such that


Exercise 5.26. Let τ P
Pd d
i=0 αi = 1 and x = i=0 αi xi are unique for every x ∈ τ .

Exercise
Sd 5.27. Let τ = co(η 0 , . . . , η d ) be a d-simplex. Show that the boundary of τ is given by
i=0 co({η 0 , . . . , η d }\{η i }).

67
Chapter 6

Affine pieces for ReLU neural


networks

In the previous chapters, we observed some remarkable approximation results of shallow ReLU
neural networks. In practice, however, deeper architectures are more common. To understand why,
we in this chapter we discuss some potential shortcomings of shallow ReLU networks compared to
deep ReLU networks.
Traditionally, an insightful approach to study limitations of ReLU neural networks has been to
analyze the number of linear regions these functions can generate.

Definition 6.1. Let d ∈ N, Ω ⊆ Rd , and let f : Ω → R be cpwl (see Definition 5.5). We say
that f has p ∈ N pieces S (or linear regions), if p is the smallest number of connected open
sets (Ωi )pi=1 such that pi=1 Ωi = Ω, and f |Ωi is an affine function for all i = 1, . . . , p. We denote
Pieces(f, Ω) := p.
For d = 1 we call every point where f is not differentiable a break point of f .

To get an accurate cpwl approximation of a function, the approximating function needs to have
many pieces. The next theorem, corresponding to [62, Theorem 2], quantifies this statement.

Theorem 6.2. Let −∞ < a < b < ∞ and f ∈ C 3 ([a, b]) so that f is not affine. Then there exists
Rbp
a constant c > 0 depending only on a |f ′′ (x)| dx so that

∥g − f ∥L∞ ([a,b]) > cp−2

for all cpwl g with at most p ∈ N pieces.

The proof of the theorem is left to the reader, see Exercise 6.12.
Theorem 6.2 implies that for ReLU neural networks we need architectures allowing for many
pieces, if we want to approximate non-linear functions to high accuracy. But how many pieces can

68
we create for a fixed depth and width? We will establish a simple theoretical upper bound in Section
6.1. Subsequently, we will investigate under which conditions these upper bounds are attainable
in Section 6.2. This will reveal that certain functions necessitate very large shallow networks for
approximation, whereas relatively small deep networks can also approximate them. These findings
are presented in Section 6.3.
Finally, we will question the practical relevance of this analysis by examining how many pieces
typical neural networks possess. Surprisingly, in Section 6.4 we will find that randomly initialized
deep neural networks on average do not have a number of pieces that is anywhere close to the
theoretical upper bound.

6.1 Upper bounds


Neural networks are based on the composition and addition of neurons. These two operations
increase the possible number of pieces in a very specific way. Figure 6.1 depicts the two operations
and their effect. They can be described as follows:
• Summation: Let Ω ⊆ R. The sum of two cpwl functions f1 , f2 : Ω → R satisfies

Pieces(f1 + f2 , Ω) ≤ Pieces(f1 , Ω) + Pieces(f2 , Ω) − 1. (6.1.1)

This holds because the sum is affine in every point where both f1 and f2 are affine. Therefore,
the sum has at most as many break points as f1 and f2 combined. Moreover, the number of
pieces of a univariate function equals the number of its break points plus one.

• Composition: Let again Ω ⊆ R. The composition of two functions f1 : Rd → R and f2 : Ω →


Rd satisfies

Pieces(f1 ◦ f2 , Ω) ≤ Pieces(f1 , Rd ) · Pieces(f2 , Ω). (6.1.2)

This is because for each of the affine pieces of f2 —let us call one of those pieces A ⊆ R—we
have that f2 is either constant or injective on A. If it is constant, then f1 ◦ f2 is constant. If
it is injective, then Pieces(f1 ◦ f2 , A) = Pieces(f1 , f2 (A)) ≤ Pieces(f1 , Rd ). Since this holds
for all pieces of f2 we get (6.1.2).
These considerations give the following result, which follows the argument of [227, Lemma 2.1].
We state it for general cpwl activation functions. The ReLU activation function corresponds to
p = 2.

Theorem 6.3. Let L ∈ N. Let σ be cpwl with p pieces. Then, every neural network with architecture
(σ; 1, d1 , . . . , dL , 1) has at most (p · width(Φ))L pieces.

Proof. The proof is via induction over the depth L. Let L = 1, and let Φ : R → R be a neural
network of architecture (σ; 1, d1 , 1). Then
d1
(1) (0) (0)
X
Φ(x) = wk σ(wk x + bk ) + b(1) for x ∈ R,
k=1

69
Figure 6.1: Top: Composition of two cpwl functions f1 ◦ f2 can create a piece whenever the value
of f2 crosses a level that is associated to a break point of f1 . Bottom: Addition of two cpwl
functions f1 + f2 produces a cpwl function that can have break points at positions where either f1
or f2 has a break point.

for certain w(0) , w(1) , b(0) ∈ Rd1 and b(1) ∈ R. By (6.1.1), Pieces(Φ) ≤ p · width(Φ).
For the induction step, assume the statement holds for L ∈ N, and let Φ : R → R be a neural
network of architecture (σ; 1, d1 , . . . , dL+1 , 1). Then, we can write
dL+1
X
Φ(x) = wj σ(hj (x)) + b for x ∈ R,
j=1

for some w ∈ RdL+1 , b ∈ R, and where each hj is a neural network of architecture (σ; 1, d1 , . . . , dL , 1).
Using the induction hypothesis, each σ ◦ hℓ has at most p · (p · width(Φ))L affine pieces. Hence
Φ has at most width(Φ) · p · (p · width(Φ))L = (p · width(Φ))L+1 affine pieces. This completes the
proof.

Theorem 6.3 shows that there are limits to how many pieces can be created with a certain
architecture. It is noteworthy that the effects of the depth and the width of a neural network
are vastly different. While increasing the width can polynomially increase the number of pieces,
increasing the depth can result in exponential increase. This is a first indication of the prowess of
depth of neural networks.
To understand the effect of this on the approximation problem, we apply the bound of Theorem
6.3 to Theorem 6.2.

Theorem 6.4. Let d0 ∈ N and f R∈ C 3 d0 d0


p ([0, 1] ). Assume there exists a line segment s ⊆ [0, 1] of
′′
positive length such that 0 < c := s |f (x)| dx. Then, there exists C > 0 solely depending on c,
such that for all ReLU neural networks Φ : Rd0 → R with L layers

∥f − Φ∥L∞ ([0,1]d0 ) ≥ c · (2width(Φ))−2L .

Theorem 6.4 gives a lower bound on achievable approximation rates in dependence of the depth
L. As target functions become smoother, we expect that we can achieve faster convergence rates

70
(cp. Chapter 4). However, without increasing the depth, it seems to be impossible to leverage such
additional smoothness.
This observation strongly indicates that deeper architectures can be superior. Before we can
make such statements, we first explore whether the upper bounds of Theorem 6.3 are even achiev-
able.

6.2 Tightness of upper bounds


To construct a ReLU neural network, that realizes the upper bound of Theorem 6.3, we first let
h1 : [0, 1] → R be the hat function
(
2x if x ∈ [0, 12 ]
h1 (x) :=
2 − 2x if x ∈ [ 12 , 1].

This function can be expressed by a ReLU neural network of depth one and with two nodes

h1 (x) = σReLU (2x) − σReLU (4x − 2) for all x ∈ [0, 1]. (6.2.1)

We recursively set hn := hn−1 ◦ h1 for all n ≥ 2, i.e., hn = h1 ◦ · · · ◦ h1 is the n-fold composition of


h1 . Since h1 : [0, 1] → [0, 1], we have hn : [0, 1] → [0, 1] and

hn ∈ N11 (σReLU ; n, 2).

It turns out that this function has a rather interesting behavior. It is a “sawtooth” function with
2n−1 spikes, see Figure 6.2.

Lemma 6.5. Let n ∈ N. It holds for all x ∈ [0, 1]


(
2n (x − i2−n ) if i ≥ 0 is even and x ∈ [i2−n , (i + 1)2−n ]
hn (x) =
2n ((i + 1)2−n − x) if i ≥ 1 is odd and x ∈ [i2−n , (i + 1)2−n ].

Proof. The case n = 1 holds by definition. We proceed by induction, and assume the statement
holds for n. Let x ∈ [0, 1/2] and i ≥ 0 even such that x ∈ [i2−(n+1) , (i + 1)2−(n+1) ]. Then
2x ∈ [i2−n , (i + 1)2−n ]. Thus

hn (h1 (x)) = hn (2x) = 2n (2x − i2−n ) = 2n+1 (x − i2−n+1 ).

Similarly, if x ∈ [0, 1/2] and i ≥ 1 odd such that x ∈ [i2−(n+1) , (i + 1)2−(n+1) ], then h1 (x) = 2x ∈
[i2−n , (i + 1)2−n ] and

hn (h1 (x)) = hn (2x) = 2n (2x − (i + 1)2−n ) = 2n+1 (x − (i + 1)2−n+1 ).

The case x ∈ [1/2, 1] follows by observing that hn+1 is symmetric around 1/2.

The neural network hn has size O(n) and is piecewise linear on at least 2n pieces. This shows
that the number of pieces can indeed increase exponentially in the neural network size, also see the
upper bound in Theorem 6.3.

71
h1 h2 h3
1 1 1

0 1 0 1 0 1

Figure 6.2: The functions hn in Lemma 6.5.

6.3 Depth separation


Now that we have established how increasing the depth can lead to exponentially more pieces than
increasing the width, we can deduce a so-called “depth-separation” result shown by Telgarsky in
[227, 228]. Such statements verify the existence of functions that can easily be approximated by
deep neural networks, but require much larger size when approximated by shallow neural networks.
The following theorem, along with its proof, is presented similarly in Telgarsky’s lecture notes [229].

Theorem 6.6. For every n ∈ N there exists a neural network f ∈ N11 (σReLU ; n2 + 3, 2) such that
for any g ∈ N11 (σReLU ; n, 2n−1 ) holds
Z 1
1
|f (x) − g(x)| dx ≥ .
0 32

The neural network f may have quadratically more layers than g, but width(g) = 2n−1 and
width(f ) = 2. Hence the size of g may be exponentially larger than the size of f , but nonetheless no
such g can approximate f . Thus even exponential increase in width cannot necessarily compensate
for increase in depth. The proof is based on the following observations stated in [228]:

(i) Functions with few oscillations poorly approximate functions with many oscillations,

(ii) neural networks with few layers have few oscillations,

(iii) neural networks with many layers can have many oscillations.

Proof of Theorem 6.6. Fix n ∈ N. Let f := hn2 +3 ∈ N11 (σReLU ; n2 + 3, 2). For arbitrary g ∈
2
N11 (σReLU ; n, 2n−1 ), by Theorem 6.3, g is piecewise linear with at most (2 · 2n−1 )n = 2n break
2
points. The function f is the sawtooth function with 2n +2 spikes. The number of triangles formed
2 2
by the graph of f and the constant line at 1/2 equals 2n +3 − 1, each with area 2−(n +5) , see
Figure 6.3. For the m triangles in between two break points of g, the graph of g does not cross at

72
1

1
2

0 1 0 1

Figure 6.3: Left: The functions hn form 2n − 1 triangles with the line at 1/2, each with area
2−(n+2) . Right: For an affine function with m (in this sketch m = 5) triangles in between two
break points, the function can cross at most ⌈m/2⌉ + 1 ≤ m/2 + 2 of them. Figure adapted from
[229, Section 5].

least m − (m/2 + 2) = m/2 − 2 of them. Thus we can bound


 
Z 1 1 
2 +3 2 2 2
−(n +5)
|f (x) − g(x)| dx ≥  ( 2n − 1 − 2n )− 2n}
2| ·{z |2 {z }
 
2

0
| {z } 
triangles on an interval ≥2·(pieces of g) area of a triangle

without break point of g
| {z }
≥missed triangles
2 +2 2 2 +5)
≥ (2n − 3 · 2n ) · 2−(n
2 2 1
≥ 2n · 2−(n +5) = ,
32
which concludes the proof.

6.4 Number of pieces in practice


We have seen in Theorem 6.3 that deep neural networks can have many more pieces than their
shallow counterparts. This begs the question if deep neural networks tend to generate more pieces
in practice. More formally: If we randomly initialize the weights of a neural network, what is the
expected number of linear regions? Will this number scale exponentially with the depth? This
question was analyzed in [82], and surprisingly, it was found that the number of pieces of randomly
initialized neural networks typically does not depend exponentially on the depth. In Figure 6.4, we
depict two neural networks, one shallow and one deep, that were randomly initialized according to
He initialization [86]. Both neural networks have essentially the same number of pieces (114 and
110) and there is no clear indication that one has a deeper architecture than the other.
In the following, we will give a simplified version of the main result of [82] to show why random
deep neural networks often behave like shallow neural networks.
We recall from Figure 6.1 that pieces are generated through composition of two functions f1
and f2 , if the values of f2 cross a level that is associated to a break point of f1 . In the case of a
simple neuron of the form

x 7→ σReLU (⟨a, h(x)⟩ + b)

73
Figure 6.4: Two randomly initialized neural networks Φ1 and Φ2 with architectures
(σReLU ; 1, 10, 10, 1) and (σReLU ; 1, 5, 5, 5, 5, 5, 1). The initialization scheme was He initialization
[86]. The number of linear regions equals 114 and 110, respectively.

where h is a cpwl function, a is a vector, and b is a scalar, many pieces can be generated if ⟨a, h(x)⟩
crosses the −b level often.
If a, b are random variables, and we know that h does not oscillate too much, then we can
quantify the probability of ⟨a, h(x)⟩ crossing the −b level often. The following lemma from [115,
Lemma 3.1] provides the details.

Lemma 6.7. Let c > 0 and let h : [0, c] → R be a cpwl function on [0, c]. Let t ∈ N, let A ⊆ R be
a Lebesgue measurable set, and assume that for every y ∈ A it holds that

|{x ∈ [0, c] | h(x) = y}| ≥ t.

Then, c∥h′ ∥L∞ ≥ ∥h′ ∥L1 ≥ |A| · t, where |A| is the Lebesgue measure of A.
In particular, if h has at most P ∈ N pieces and ∥h′ ∥L1 is finite, then it holds for all δ > 0 that
for all t ≤ P

∥h′ ∥L1
P [|{x ∈ [0, c] | h(x) = U }| ≥ t] ≤ ,
δt
P [|{x ∈ [0, c] | h(x) = U }| > P ] = 0,

where U is a uniformly distributed variable on [−δ/2, δ/2].

Proof. We will assume c = 1. The general case then follows by considering h̃(x) = h(x/c).
Let for (ci )Pi=1
+1
⊆ [0, 1] with c1 = 0, cP +1 = 1 and ci ≤ ci+1 for all i = 1, . . . , P + 1 the pieces of
h be given by ((ci , ci+1 ))Pi=1 . We denote

V1 := [0, c2 ], Vi := (ci , ci+1 ] for i = 1, . . . , P

74
and for j = i, . . . , P
i−1
[
Vei := Vj .
j=1

We define, for n ∈ N ∪ {∞}


n o
Ti,n := h(Vi ) ∩ y ∈ A |{x ∈ Vei | h(x) = y}| = n − 1 .

In words, Ti,n contains the values of A that are hit on Vi for the nth time. Since h is cpwl, we
observe that for all i = 1, . . . , P
(i) Ti,n1 ∩ Ti,n2 = ∅ for all n1 , n2 ∈ N ∪ {∞}, n1 ̸= n2 ,

(ii) Ti,∞ ∪ ∞
S
n=1 Ti,n = h(Vi ) ∩ A,

(iii) Ti,n = ∅ for all P < n < ∞,

(iv) |Ti,∞ | = 0.
Note that, since h is affine on Vi it holds that h′ = |h(Vi )|/|Vi | on Vi . Hence, for t ≤ P
P
X P
X

∥h ∥L1 ≥ |h(Vi )| ≥ |h(Vi ) ∩ A|
i=1 i=1
P ∞
!
X X
= |Ti,n | + |Ti,∞ |
i=1 n=1

P X
X
= |Ti,n |
i=1 n=1
Xt X P
≥ |Ti,n |,
n=1 i=1

where the first equality follows by (i), (ii), the second by (iv), and the last inequality by (iii).
Note that, by assumption for all n ≤ t every y ∈ A is an element of Ti,n or Ti,∞ for some i ≤ P .
Therefore, by (iv)
X P
|Ti,n | ≥ |A|,
i=1
which completes the proof.

Lemma 6.7 applied to neural networks essentially states that, in a single neuron, if the bias
term is chosen uniformly randomly on an interval of length δ, then the probability of generating at
least t pieces by composition scales reciprocal to t.
Next, we will analyze how Lemma 6.7 implies an upper bound on the number of pieces generated
in a randomly initialized neural network. For simplicity, we only consider random biases in the
following, but mention that similar results hold if both the biases and weights are random variables
[82].

75
Definition 6.8. Let L ∈ N, (d0 , d1 , . . . , dL , 1) ∈ NL+2 and W (ℓ) ∈ Rdℓ+1 ×dℓ for ℓ = 0, . . . , L. Fur-
thermore, let δ > 0 and let the bias vectors b(ℓ) ∈ Rdℓ+1 , for ℓ = 0, . . . , L, be random variables such
that each entry of each b(ℓ) is independently and uniformly distributed on the interval [−δ/2, δ/2].
We call the associated ReLU neural network a random-bias neural network.

To apply Lemma 6.7 to a single neuron with random biases, we also need some bound on the
derivative of the input to the neuron.

Definition 6.9. Let L ∈ N, (d0 , d1 , . . . , dL , 1) ∈ NL+2 , and W (ℓ) ∈ Rdℓ+1 ×dℓ and b(ℓ) ∈ Rdℓ+1 for
ℓ = 0, . . . , L. Moreover let δ > 0.
For ℓ = 1, . . . , L + 1, i = 1, . . . , dℓ introduce the functions

ηℓ,i (x; (W (j) , b(j) )ℓ−1


j=0 ) = (W
(ℓ−1) (ℓ−1)
x )i for x ∈ Rd0 ,

where x(ℓ−1) is as in (2.1.1). We call


(
 

ν (W (ℓ) )L ℓ=1 , δ := max ηℓ,i ( · ; (W (j) , b(j) )ℓ−1
j=0 )
2

L
)
Y
(b(j) )L
j=0 ∈ dj+1
[−δ/2, δ/2] , ℓ = 1, . . . , L, i = 1, . . . , dℓ
j=0

the maximal internal derivative of Φ.

We can now formulate the main result of this section.

Theorem 6.10. Let L ∈ N and let (d0 , d1 , . .. , dL , 1) ∈ NL+2 . Let δ ∈ (0, 1]. Let W (ℓ) ∈ Rdℓ+1 ×dℓ ,
for ℓ = 0, . . . , L, be such that ν (W (ℓ) )L
ℓ=0 , δ ≤ Cν for a Cν > 0.
For an associated random-bias neural network Φ, we have that for a line segment s ⊆ Rd0 of
length 1
L
Cν X
E[Pieces(Φ, s)] ≤ 1 + d1 + (1 + (L − 1) ln(2width(Φ))) dj . (6.4.1)
δ
j=2

Proof. Let W (ℓ) ∈ Rdℓ+1 ×dℓ for ℓ = 0, . . . , L. Moreover, let b(ℓ) ∈ [−δ/2, δ/2]dℓ+1 for ℓ = 0, . . . , L
be uniformly distributed random variables. We denote
θℓ : s → Rdℓ
dℓ
x 7→ (ηℓ,i (x; (W (j) , b(j) )ℓ−1
j=0 ))i=1 .

76
Let κ : s → [0, 1] be an isomorphism. Since each coordinate of θℓ is cpwl, there are points
x0 , x1 , . . . , xqℓ ∈ s with κ(xj ) < κ(xj+1 ) for j = 0, . . . , qℓ − 1, such that θℓ is affine (as a function
into Rdℓ ) on [κ(xj ), κ(xj+1 )] for all j = 0, . . . , qℓ − 1 as well as on [0, κ(x0 )] and [κ(xqℓ ), 1].
We will now inductively find an upper bound on the qℓ .
Let ℓ = 2, then
θ2 (x) = W (1) σReLU (W (0) x + b(0) ).
Since W (1) · +b(1) is an affine function, it follows that θ2 can only be non-affine in points where
σReLU (W (0) · +b(0) ) is not affine. Therefore, θ2 is only non-affine if one coordinate of W (0) · +b(0)
intersects 0 nontrivially. This can happen at most d1 times. We conclude that we can choose
q2 = d 1 .
Next, let us find an upper bound on qℓ+1 from qℓ . Note that

θℓ+1 (x) = W (ℓ) σReLU (θℓ (x) + b(ℓ−1) ).

Now θℓ+1 is affine in every point x ∈ s where θℓ is affine and (θℓ (x) + b(ℓ−1) )i ̸= 0 for all coordinates
i = 1, . . . , dℓ . As a result, we have that we can choose qℓ+1 such that

qℓ+1 ≤ qℓ + {x ∈ s | (θℓ (x) + b(ℓ−1) )i = 0 for at least one i = 1, . . . , dℓ } .

Therefore, for ℓ ≥ 2

X
qℓ+1 ≤ d1 + {x ∈ s | (θj (x) + b(j) )i = 0 for at least one i = 1, . . . , dj }
j=3
dj
ℓ X
(j)
X
≤ d1 + {x ∈ s | ηj,i (x) = −bi } .
j=2 i=1

By Theorem 6.3, we have that


 
Pieces ηℓ,i ( · ; (W (j) , b(j) )ℓ−1
j=0 ), s ≤ (2width(Φ))ℓ−1 .

We define for k ∈ N ∪ {∞}


h i
(ℓ)
pk,ℓ,i := P {x ∈ s | ηℓ,i (x) = −bi } ≥ k

Then by Lemma 6.7



pk,ℓ,i ≤
δk
and for k > (2width(Φ))ℓ−1

pk,ℓ,i = 0.

77
It holds
 
dj n
L X o
(j)
X
E x ∈ s ηj,i (x) = −bi 
j=2 i=1
dj ∞
L X h n o i
(j)
X X
≤ k·P x ∈ s ηj,i (x) = −bi =k
j=2 i=1 k=1
dj ∞
L X
X X
≤ k · (pk,j,i − pk+1,j,i ).
j=2 i=1 k=1

The inner sum can be bounded by



X ∞
X ∞
X
k · (pk,j,i − pk+1,j,i ) = k · pk,j,i − k · pk+1,j,i
k=1 k=1 k=1
X∞ X∞
= k · pk,j,i − (k − 1) · pk,j,i
k=1 k=2

X
= p1,j,i + pk,j,i
k=2

X
= pk,j,i
k=1
(2width(Φ))L−1
−1
X 1
≤ Cν δ
k
k=1
!
Z (2width(Φ))L−1
1
≤ Cν δ −1 1+ dx
1 x
≤ Cν δ −1 (1 + (L − 1) ln((2width(Φ)))).

We conclude that, in expectation, we can bound qL+1 by


L
X
d1 + Cν δ −1 (1 + (L − 1) ln(2width(Φ))) dj .
j=2

Finally, since θL = ΦL+1 |s , it follows that

Pieces(Φ, s) ≤ qL+1 + 1

which yields the result.

Remark 6.11. We make the following observations about Theorem 6.10:


• Non-exponential dependence on depth: If we consider (6.4.1), we see that the number of pieces
scales in expectation essentially like O(LN ), where N is the total number of neurons of the
architecture. This shows that in expectation, the number of pieces is linear in the number of
layers, as opposed to the exponential upper bound of Theorem 6.3.

78
• Maximal internal derivative: Theorem 6.10 requires the weights to be chosen such that the
maximal internal derivative is bounded by a certain number. However, if they are randomly
initialized in such a way that with high probability the maximal internal derivative is bounded
by a small number, then similar results can be shown. In practice, weights in the ℓth layer
p are
often initialized according to a centered normal distribution with standard deviation 2/dℓ ,
[86]. Due to the anti-proportionality of the variance to the width of the layers it is achieved
that the internal derivatives remain bounded with high probability, independent of the width
of the neural networks. This explaines the observation from Figure 6.4.

Bibliography and further reading


Establishing bounds on the number of linear regions of a ReLU network has been a popular tool
to investigate the complexity of ReLU neural networks, see [152, 185, 4, 210, 82]. The bound
presented in Section 6.1, is based on [227]. In addition to this bound, the paper also presents the
depth separation result discussed in Section 6.3. The proof techniques employed there have inspired
numerous subsequent works in the field.
Together with the lower bound on the number of required linear regions given in [62], this
analysis shows how depth can be a limiting factor in terms of achievable convergence rates, as
stated in Theorem 6.4.
For the construction of the sawtooth function in Section 6.2, and the depth separation result in
Section 6.3 follow the arguments in [227, 228, 229]. Beyond Telgarsky’s work, other notable depth
separation results include [60, 199, 4]. Moreover, closely related to such statements is the 1987
thesis by Håstad [101], which considers the limitations of logic circuits in terms of depth.
Finally, the analysis of the number of pieces deep neural networks attained with random intial-
ization (Section 6.4) is based on [82] and [115].

79
Exercises
Exercise 6.12. Let −∞ < a < b < ∞ and let f ∈ C 3 ([a, b])\P1 . Denote by p(ε) ∈ N the minimal
number of intervals partitioning [a, b], such that a (not necessarily continuous) piecewise linear
function on p(ε) intervals can approximate f on [a, b] uniformly up to error ε > 0. In this exercise,
we wish to show

lim inf p(ε) ε > 0. (6.4.2)
ε↘0

Therefore, we can find a constant C > 0 such that ε ≥ Cp(ε)−2 for all ε > 0. This shows a variant
of Theorem 6.2. Proceed as follows to prove (6.4.2):

(i) Fix ε > 0 and let a = x0 < x1 · · · < xp(ε) = b be a partitioning into p(ε) pieces. For
i = 0, . . . , p(ε) − 1 and x ∈ [xi , xi+1 ] let
 
f (xi+1 ) − f (xi )
ei (x) := f (x) − f (xi ) + (x − xi ) .
xi+1 − xi

Show that |ei (x)| ≤ 2ε for all x ∈ [xi , xi+1 ].

(ii) With hi := xi+1 − xi and mi := (xi + xi+1 )/2 show that

h2i ′′
max |ei (x)| = |f (mi )| + O(h3i ).
x∈[xi ,xi+1 ] 8

(iii) Assuming that c := inf x∈[a,b] |f ′′ (x)| > 0 show that


Z bp
√ 1
lim inf p(ε) ε ≥ |f ′′ (x)| dx.
ε↘0 4 a

(iv) Conclude that (6.4.2) holds for general non-linear f ∈ C 3 ([a, b]).

Exercise 6.13. Show that, for L = 1, Theorem 6.3 holds for piecewise smooth functions, when
replacing the number of affine pieces by the number of smooth pieces. These are defined by replacing
“affine” by “smooth” (meaning C ∞ ) in Definition 6.1.

Exercise 6.14. Show that, for L > 1, Theorem 6.3 does notx hold for piecewise smooth functions,
when replacing the number of affine pieces by the number of smooth pieces.
(p)
Exercise 6.15. For p ∈ N, p > 2 and n ∈ N, construct a function hn similar to hn of (6.5), such
(p) (p)
that hn ∈ N11 (σReLU ; n, p) and such that hn has pn pieces and size O(p2 n).

80
Chapter 7

Deep ReLU neural networks

In the previous chapter, we observed that many layers are a necessary prerequisite for ReLU neural
networks to approximate smooth functions with high rates. We now analyze which depth is sufficient
to achieve good approximation rates for smooth functions.
To approximate smooth functions efficiently, one of the main tools in Chapter 4 was to rebuild
polynomial-based functions, such as higher-order B-splines. For smooth activation functions, we
were able to reproduce polynomials by using the nonlinearity of the activation functions. This
argument certainly cannot be repeated for the piecewise linear ReLU. On the other hand, up until
now, we have seen that deep ReLU neural networks are extremely efficient at producing the strongly
oscillating sawtooth functions discussed in Lemma 6.5.
The main observation this chapter is that the efficient representation of sawtooth functions
is intimately linked to the approximation of the square function and hence allows very efficient
approximations of polynomial functions. This observation was first made by Dmitry Yarotsky [245]
in 2016, and the present chapter is primarily based on this paper.
First, in Section 7.1, we will give an efficient neural network approximation of the squaring func-
tion. Second, in Section 7.2, we will demonstrate how the squaring neural network can be modified
to yield a neural network that approximates the function that multiplies its inputs. Using these
two tools, we conclude in Section 7.3 that deep ReLU neural networks can efficiently approximate
k-times continuously differentiable functions with Hölder continuous derivatives.

7.1 The square function


In this section, we will show that the square function x 7→ x2 can be approximated very efficiently
by a deep neural network.

Proposition 7.1. Let n ∈ N. Then


n
X hj (x)
sn (x) := x −
22j
j=1

is a piecewise linear function on [0, 1] with break points xn,j = j2−n , j = 0, . . . , 2n . Moreover,
sn (xn,k ) = x2n,k for all k = 0, . . . , 2n , i.e. sn is the piecewise linear interpolant of x2 on [0, 1].

81
Proof. The statement holds for n = 1. We proceed by induction. Assume the statement holds for
sn and let k ∈ {0, . . . , 2n+1 }. By Lemma 6.5, hn+1 (xn+1,k ) = 0 whenever k is even. Hence for even
k ∈ {0, . . . , 2n+1 }
n+1
X hj (xn+1,k )
sn+1 (xn+1,k ) = xn+1,k −
22j
j=1
hn+1 (xn+1,k )
= sn (xn+1,k ) − 2(n+1)
= sn (xn+1,k ) = x2n+1,k ,
2

where we used the induction assumption sn (xn+1,k ) = x2n+1,k for xn+1,k = k2−(n+1) = k2 2−n =
xn,k/2 .
Now let k ∈ {1, . . . , 2n+1 − 1} be odd. Then by Lemma 6.5, hn+1 (xn+1,k ) = 1. Moreover,
since sn is linear on [xn,(k−1)/2 , xn,(k+1)/2 ] = [xn+1,k−1 , xn+1,k+1 ] and xn+1,k is the midpoint of this
interval,

hn+1 (xn+1,k )
sn+1 (xn+1,k ) = sn (xn+1,k ) −
22(n+1)
1 1
= (x2n+1,k−1 + x2n+1,k+1 ) − 2(n+1)
2 2
(k − 1)2 (k + 1)2 2
= 2(n+1)+1 + 2(n+1)+1 − 2(n+1)+1
2 2 2
1 2k 2 k2
= = 2(n+1) = x2n+1,k .
2 22(n+1) 2
This completes the proof.

Lemma 7.2. For n ∈ N, it holds

sup |x2 − sn (x)| ≤ 2−2n−1 .


x∈[0,1]

Moreover sn ∈ N11 (σReLU ; n, 3), and size(sn ) ≤ 7n and depth(sn ) = n.

Proof. Set en (x) := x2 − sn (x). Let x be in the interval [xn,k , xn,k+1 ] = [k2−n , (k + 1)2−n ] of length
2−n . Since sn is the linear interpolant of x2 on this interval, we have

x2n,k+1 − x2n,k 2k + 1 1
|e′n (x)| = 2x − = 2x − ≤ n.
2−n 2n 2

Thus en : [0, 1] → R has Lipschitz constant 2−n . Since en (xn,k ) = 0 for all k = 0, . . . , 2n , and the
length of the interval [xn,k , xn,k+1 ] equals 2−n we get

1
sup |en (x)| ≤ 2−n 2−n = 2−2n−1 .
x∈[0,1] 2

82
x s1 (x) s2 (x) sn−1 (x)

x h1 (x) x ... sn (x)

h1 (x) h2 (x) h3 (x) hn (x)

Figure 7.1: The neural networks h1 (x) = σReLU (2x) − σReLU (4x − 2) and sn (x) = σReLU (sn−1 (x)) −
hn (x)/22n where hn = h1 ◦ hn−1 .

Finally, to see that sn can be represented by a neural network of the claimed architecture, note
that for n ≥ 2
n
X hj (x) hn (x) h1 ◦ hn−1 (x)
sn (x) = x − = sn−1 (x) − = σReLU ◦ sn−1 (x) − .
22j 2 2n 22n
j=1

Here we used that sn−1 is the piecewise linear interpolant of x2 , so that sn−1 (x) ≥ 0 and thus
sn−1 (x) = σReLU (sn−1 (x)) for all x ∈ [0, 1]. Hence sn is of depth n and width 3, see Figure 7.1.

In conclusion, we have shown that sn : [0, 1] → [0, 1] approximates the square function uniformly
on [0, 1] with exponentially decreasing error in the neural network size. Note that due to Theorem
6.4, this would not be possible with a shallow neural network, which can at best interpolate x2 on
a partition of [0, 1] with polynomially many (w.r.t. the neural network size) pieces.

7.2 Multiplication
According to Lemma 7.2, depth can help in the approximation of x 7→ x2 , which, on first sight,
seems like a rather specific example. However, as we shall discuss in the following, this opens
up a path towards fast approximation of functions with high regularity, e.g., C k ([0, 1]d ) for some
k > 1. The crucial observation is that, via the polarization identity we can write the product of
two numbers as a sum of squares

(x + y)2 − (x − y)2
x·y = (7.2.1)
4
for all x, y ∈ R. Efficient approximation of the operation of multiplication allows efficient ap-
proximation of polynomials. Those in turn are well-known to be good approximators for functions
exhibiting k ∈ N derivatives. Before exploring this idea further in the next section, we first make
precise the observation that neural networks can efficiently approximate the multiplication of real
numbers.
We start with the multiplication of two numbers, in which case neural networks of logarithmic
size in the desired accuracy are sufficient.

83
Lemma 7.3. For every ε > 0 there exists a ReLU neural network Φ× 2
ε : [−1, 1] → [−1, 1] such that

sup |x · y − Φ×
ε (x, y)| ≤ ε,
x,y∈[−1,1]

and it holds size(Φ× ×


ε ) ≤ C · (1 + | log(ε)|) and depth(Φε ) ≤ C · (1 + | log(ε)|) for a constant C > 0
×
independent of ε. Moreover, Φε (x, y) = 0 if x = 0 or y = 0.

Proof. With n = ⌈| log4 (ε)|⌉, define the neural network


 
× σReLU (x + y) + σReLU (−x − y)
Φε (x, y) :=sn
2
 
σReLU (x − y) + σReLU (y − x)
− sn . (7.2.2)
2

Since |a| = σReLU (a) + σReLU (−a), by (7.2.1) we have for all x, y ∈ [−1, 1]

(x + y)2 − (x − y)2
    
× |x + y| |x − y|
x · y − Φε (x, y) = − sn − sn
4 2 2
4( x+y 2 x−y 2
2 ) − 4( 2 ) 4sn ( |x+y| |x−y|
2 ) − 4sn ( 2 )
= −
4 4
4(2−2n−1 + 2−2n−1 )
≤ = 4−n ≤ ε,
4
where we used |x+y|/2, |x−y|/2 ∈ [0, 1]. We have depth(Φ× ε ) = 1+depth(sn ) = 1+n ≤ 1+⌈log4 (ε)⌉
and size(Φ× ε ) ≤ C + 2size(s n ) ≤ Cn ≤ C · (1 − log(ε)) for some constant C > 0.
The fact that Φ× ε maps from [−1, 1]2 → [−1, 1] follows by (7.2.2) and because s : [0, 1] → [0, 1].
n
Finally, if x = 0, then Φ× ε (x, y) = sn (|x + y|) − sn (|x − y|) = sn (|y|) − sn (|y|) = 0. If y = 0 the same
argument can be made.

In a similar way as in Proposition 4.8 and Lemma 5.11, we can apply operations with two inputs
in the form of a binary tree to extend them to an operation on arbitrary many inputs.

Proposition 7.4. For every n ≥ 2 and ε > 0 there exists a ReLU neural network Φ× n
n,ε : [−1, 1] →
[−1, 1] such that

n
Y
sup xj − Φ×
n,ε (x1 , . . . , xn ) ≤ ε,
xj ∈[−1,1] j=1

and it holds size(Φ× ×


n,ε ) ≤ Cn · (1 + | log(ε/n)|) and depth(Φn,ε ) ≤ C log(n)(1 + | log(ε/n)|) for a
constant C > 0 independent of ε and n.

84
Proof. We begin with the case n = 2k . For k = 1 let Φ̃× ×
2,δ := Φδ . If k ≥ 2 let
 
Φ̃×k
2 ,δ
:= Φ×
δ ◦ Φ̃ ×
2k−1 ,δ
, Φ̃ ×
2k−1 ,δ
.

Using Lemma 7.3, we find that this neural network has depth bounded by
 
depth Φ̃× k
2 ,δ
≤ kdepth(Φ×δ ) ≤ Ck · (1 + | log(δ)|) ≤ C log(n)(1 + | log(δ)|).

Observing that the number of occurences of Φ×


Pk−1 j ×
δ equals j=0 2 ≤ n, the size of Φ̃2k ,δ can bounded
by Cnsize(Φ×δ ) ≤ Cn · (1 + | log(δ)|).
k
To estimate the approximation error, denote with x = (xj )2j=1

Y
ek := sup xj − Φ̃×
2k ,δ
(x) .
xj ∈[−1,1]
j≤2k

Then, using short notation of the type x≤2k−1 := (x1 , . . . , x2k−1 ),


2 k
Y  
ek = sup xj − Φ×
δ Φ̃×
2k−1 ,δ
(x≤2k−1 ), Φ̃ ×
2k−1 ,δ
(x>2k−1 )
xj ∈[−1,1] j=1
 
Y
≤δ+ sup  xj ek−1 + Φ̃×
2k−1 ,δ
(x>2k−1 ) ek−1 
xj ∈[−1,1]
j≤2k−1
k−2
X
≤ δ + 2ek−1 ≤ δ + 2(δ + 2ek−2 ) ≤ · · · ≤ δ 2j + 2k−1 e1
j=0
k
≤ 2 δ = nδ = ε.
k−1
Here we used e1 ≤ δ, and that Φ̃× 2k ,δ
maps [−1, 1]2 to [−1, 1], which is a consequence of Lemma
7.3.
The case for general n ≥ 2 (not necessarily n = 2k ) is treated similar as in Lemma 5.11, by
replacing some Φ× δ neural networks with identity neural networks.
×
Finally, setting δ := ε/n and Φ×n,ε := Φ̃n,δ concludes the proof.

7.3 C k,s functions


We will now discuss the implications of our observations in the previous sections for the approxi-
mation of functions in the class C k,s .

Definition 7.5. Let k ∈ N0 , s ∈ [0, 1] and Ω ⊆ Rd . Then

∥f ∥C k,s (Ω) := sup max |Dα f (x)|


x∈Ω {α∈Nd0 | |α|≤k}
|Dα f (x) − Dα f (y)| (7.3.1)
+ sup max ,
x̸=y∈Ω {α∈Nd0 | |α|=k} ∥x − y∥s2

85
and we denote by C k,s (Ω) the set of functions f ∈ C k (Ω) for which ∥f ∥C k,s (Ω) < ∞.

Note that these spaces are ordered according to

C k (Ω) ⊇ C k,s (Ω) ⊇ C k,t (Ω) ⊇ C k+1 (Ω)

for all 0 < s ≤ t ≤ 1.


In order to state our main result, we first recall a version of Taylor’s remainder formula for
k,s
C (Ω) functions.

Lemma 7.6. Let d ∈ N, k ∈ N, s ∈ [0, 1], Ω = [0, 1]d and f ∈ C k,s (Ω). Then for all a, x ∈ Ω
X Dα f (a)
f (x) = (x − a)α + Rk (x) (7.3.2)
α!
{α∈Nd0 | 0≤|α|≤k}

k+1/2
where with h := maxi≤d |ai − xi | we have |Rk (x)| ≤ hk+s d k! ∥f ∥C k,s (Ω) .

Proof. First, for a function g ∈ C k (R) and a, t ∈ R


k−1 (j)
X g (a) g (k) (ξ)
g(t) = (t − a)j + (t − a)k
j! k!
j=0
k
X g (j) (a) g (k) (ξ) − g (k) (a)
= (t − a)j + (t − a)k ,
j! k!
j=0

for some ξ between a and t. Now let f ∈ C k,s (Rd ) and a, x ∈ Rd . Thus with g(t) := f (a+t·(x−a))
holds for f (x) = g(1)
k−1 (j)
X g (0) g (k) (ξ)
f (x) = + .
j! k!
j=0

By the chain rule


 
X j
g (j) (t) = Dα f (a + t · (x − a))(x − a)α ,
α
{α∈Nd0 | |α|=j}

j j! j! Qd
and (x − a)α = − aj )αj .

where we use the multivariate notations α = α! = Qd j=1 (xj
j=1 αj !

86
Hence
X Dα f (a)
f (x) = (x − a)α
α!
{α∈Nd0 | 0≤|α|≤k}
| {z }
∈Pk
X Dα f (a + ξ · (x − a)) − Dα f (a)
+ (x − a)α ,
α!
|α|=k
| {z }
=:Rk

for some ξ ∈ [0, 1]. Using the definition of h, the remainder term can be bounded by
 
k α α 1 X k
|Rk | ≤ h max sup |D f (a + t · (x − a)) − D f (a)|
|α|=k x∈Ω k! d
α
t∈[0,1] {α∈N0 | |α|=k}

k+ 12
d
≤ hk+s ∥f ∥C k,s (Ω) ,
k!
√ k
= (1 + · · · + 1)k = dk by the
P 
where we used (7.3.1), ∥x − a∥2 ≤ dh and {α∈Nd0 | |α|=k} α
multinomial formula.

We now come to the main statement of this section. Up to logarithmic terms, it shows the
convergence rate (k + s)/d for approximating functions in C k,s ([0, 1]d ).

Theorem 7.7. Let d ∈ N, k ∈ N0 , s ∈ [0, 1], and Ω = [0, 1]d . Then, there exists a constant C > 0
such that for every f ∈ C k,s (Ω) and every N ≥ 2 there exists a ReLU neural network ΦfN such that
k+s
sup |f (x) − ΦfN (x)| ≤ CN − d ∥f ∥C k,s (Ω) , (7.3.3)
x∈Ω

size(ΦfN ) ≤ CN log(N ) and depth(ΦfN ) ≤ C log(N ).

Proof. The idea of the proof is to use the so-called “partition of unity method”: First we will
construct a partition of unity (φν )ν , such that for an appropriately chosen M ∈ N each φν has
support on a O(1/M ) neighborhood of a point η ∈ Ω. On each of these neighborhoods P we will use
the local Taylor polynomial pν of f around η to approximate the function. Then ν φν pν gives
an
P approximation to f on Ω. This approximation can be emulated by a neural network of the type
Φ × (φ , p̂ ), where p̂ is an neural network approximation to the polynomial p .
ν ε ν ν ν ν
It suffices to show the theorem in the case where
( )
dk+1/2
max , exp(d) ∥f ∥C k,s (Ω) ≤ 1.
k!

The general case can then be immediately deduced by a scaling argument.

87
Step 1. We construct the neural network. Define
k+s
M := ⌈N 1/d ⌉ and ε := N − d . (7.3.4)

Consider a uniform simplicial mesh with nodes {ν/M | ν ≤ M } where ν/M := (ν1 /M, . . . , νd /M ),
and where “ν ≤ M ” is short for {ν ∈ Nd0 | νi ≤ M for all i ≤ d}. We denote by φν the cpwl basis
function on this mesh such that φν (ν/M ) = 1 and φν (µ/M ) = 0 whenever µ ̸= ν. As shown in
Chapter 5, φν is a neural network of size O(1). Then
X
φν ≡ 1 on Ω, (7.3.5)
ν≤M

is a partition of unity. Moreover, observe that


 
ν 1
supp(φν ) ⊆ x ∈ Ω x− ≤ , (7.3.6)
M ∞ M

where ∥x∥∞ = maxi≤d |xi |.


For each ν ≤ M define the multivariate polynomial
X Dα f ν 

ν α
pν (x) := M
x− ∈ Pk ,
α! M
|α|≤k

and the approximation


X Dα f ν
  νiα,1 νiα,k 
p̂ν (x) := M
Φ×
|α|,ε x i α,1 − , . . . , x i α,k
− ,
α! M M
|α|≤k

where (iα,1 , . . . , iα,k ) ∈ {0, . . . , d}k is arbitrary but fixed such that |{j | iα,j = r}| = αr for all
r = 1, . . . , d. Finally, define

ΦfN :=
X
Φ×
ε (φν , p̂ν ). (7.3.7)
ν≤M

Step 2. We bound the approximation error. First, for each x ∈ Ω, using (7.3.5) and (7.3.6)

X X
f (x) − φν (x)pν (x) ≤ |φν (x)||pν (x) − f (x)|
ν≤M ν≤M

≤ max sup |f (y) − pν (y)|.


ν≤M {y∈Ω | ∥ ν −y∥ ≤ 1 }
M ∞ M

By Lemma 7.6 we obtain

k+ 21
−(k+s) d
X
sup f (x) − φν (x)pν (x) ≤ M ∥f ∥C k,s (Ω) ≤ M −(k+s) . (7.3.8)
x∈Ω k!
ν≤M

88
Next, fix ν ≤ M and y ∈ Ω such that ∥ν/M − y∥∞ ≤ 1/M ≤ 1. Then by Proposition 7.4
 k 
X Dα f ν Y νiα,j 
M
|pν (y) − p̂ν (y)| ≤ yiα,j −
α! M
|α|≤k j=1
 
× νiα,1 iα,k
− Φ|α|,ε yiα,1 − , . . . , yiα,k −
M M
X Dα f ( ν )
M
≤ε ≤ ε exp(d)∥f ∥C k,s (Ω) ≤ ε, (7.3.9)
α!
|α|≤k

where we used |Dα f (ν/M )| ≤ ∥f ∥C k,s (Ω) and

k k
X dj X dj ∞
X 1 X1 X j!
= = ≤ = exp(d).
α! j! α! j! j!
{α∈Nd0 | |α|≤k} j=0 {α∈Nd0 | |α|=j} j=0 j=0

Similarly, one shows that

|p̂ν (x)| ≤ exp(d)∥f ∥C k,s (Ω) ≤ 1 for all x ∈ Ω.

Fix x ∈ Ω. Then x belongs to a simplex of the mesh, and thus x can be in the support of at
most d + 1 (the number of nodes of a simplex) functions φν . Moreover, Lemma 7.3 implies that
supp Φ×
ε (φν (·), p̂ν (·)) ⊆ supp φν . Hence, using Lemma 7.3 and (7.3.9)

X X
φν (x)pν (x) − Φ×
ε (φν (x), p̂ν (x))
ν≤M ν≤M
X
≤ (|φν (x)pν (x) − φν (x)p̂ν (x)|
{ν≤M | x∈supp φν }

+ |φν (x)p̂ν (x) − Φ×



ε (φν (x), p̂ν (x))|
≤ ε + (d + 1)ε = (d + 2)ε.

In total, together with (7.3.8)

sup |f (x) − ΦfN (x)| ≤ M −(k+s) + ε · (d + 2).


x∈Ω

With our choices in (7.3.4) this yields the error bound (7.3.3).
Step 3. It remains to bound the size and depth of the neural network in (7.3.7).
By Lemma 5.17, for each 0 ≤ ν ≤ M we have

size(φν ) ≤ C · (1 + kT ), depth(φν ) ≤ C · (1 + log(kT )), (7.3.10)

where kT is the maximal number of simplices attached to a node in the mesh. Note that kT is
independent of M , so that the size and depth of φν are bounded by a constant Cφ independent of
M.

89
Lemma 7.3 and Proposition 7.4 thus imply with our choice of ε = N −(k+s)/d
depth(ΦfN ) = depth(Φ×
ε ) + max depth(φη ) + max depth(p̂ν )
ν≤M ν≤M
≤ C · (1 + | log(ε)| + Cφ ) + depth(Φ×
k,ε )
≤ C · (1 + | log(ε)| + Cφ )
≤ C · (1 + log(N ))
for some constant C > 0 depending on k and d (we use “C” to denote a generic constant that can
change its value in each line).
To bound the size, we first observe with Lemma 5.4 that
 
X  
size(p̂ν ) ≤ C · 1 + size Φ×
|α|,ε
 ≤ C · (1 + | log(ε)|)
|α|≤k

for some C depending on k. Thus, for the size of ΦfN we obtain with M = ⌈N 1/d ⌉
 

size(ΦfN ) ≤ C · 1 +
X
size(Φ×

ε ) + size(φν ) + size(p̂ν )

ν≤M

≤ C · (1 + M )d (1 + | log(ε)| + Cφ )
≤ C · (1 + N 1/d )d (1 + Cφ + log(N ))
≤ CN log(N ),
which concludes the proof.
Theorem 7.7 shows the convergence rate (k+s)/d for approximating a C k,s -function f : [0, 1]d →
R. As long as k is large, in principle we can achieve arbitrarily large (and d-independent if k ≥ d)
k+s
convergence rates. Crucially, and in contrast to Theorem 5.22, achieving error N − d requires the
neural networks to be of size O(N log(N )) and depth O(log(N )), i.e. to get more and more accurate
approximations, the neural network depth is required to increase.
Remark 7.8. Under the stronger assumption that f is an analytic function (in particular such an f is
in C ∞ ), one can show exponential convergence rates for ReLU networks of the type exp(−βN 1/(d+1) )
for some fixed β > 0 and where N corresponds again to the neural network size (up to logarithmic
terms), see [58, 166].
Remark 7.9. Let L : x 7→ Ax + b : Rd → Rd be a bijective affine transformation and set Ω :=
L([0, 1]d ) ⊆ Rd . Then for a function f ∈ C k,s (Ω), by Theorem 7.7 there exists a neural network
ΦfN such that
sup |f (x) − ΦfN (L−1 (x))| = sup |f (L(x)) − ΦfN (x)|
x∈Ω x∈[0,1]d
k+s
≤ C∥f ◦ L∥C k,s ([0,1]d ) N − d .
Since for x ∈ [0, 1]d holds |f (L(x))| ≤ supy∈Ω |f (y)| and if 0 ̸= α ∈ Nd0 is a multiindex |Dα (f (L(x))| ≤
|α|
∥A∥2 supy∈Ω |Dα f (y)|, we have ∥f ◦ L∥C k,s ([0,1]d ) ≤ (1 + ∥A∥k+s
2 )∥f ∥C k,s (Ω) . Thus the convergence
k+s
rate N − d is achieved on every set of the type L([0, 1]d ) for an affine map L, and in particular on
every hypercube ×dj=1 [aj , bj ].

90
Bibliography and further reading
This chapter is based on the seminal 2017 paper by Yarotsky [245], where the construction of
approximating the square function, the multiplication, and polynomials (discussed in Sections 7.1
and 7.2) was first introduced and analyzed. The construction relies on the sawtooth function
discussed in Section 6.2 and originally introduced by Telgarsky in [227]. Yarotsky’s work has
since sparked a large body of research, as it allows to lift polynomial approximation theory to
neural network classes. Convergence results based on this type of argument include for example
[174, 59, 150, 58, 166].
The approximation result derived in Section 7.3 for Hölder continuous functions follows by
standard approximation theory for piecewise polynomial functions. We point out that similar
results for the approximation of functions in C k or functions that are analytic can also be shown for
other activation function than ReLU; see in particular the works of Mhaskar [144, 145] and Section
6 in Pinkus’ Acta Numerica article [176] for sigmoidal and smooth activations. Additionally, the
more recent paper [48] specifically addresses the hyperbolic tangent activation. Finally, [81] studies
general activation functions that allow for the construction of approximate partitions of unity.

91
Chapter 8

High-dimensional approximation

In the previous chapters we established convergence rates for the approximation of a function f :
[0, 1]d → R by a neural network. For example, Theorem 7.7 provides the error bound O(N −(k+s)/d )
in terms of the network size N (up to logarithmic terms), where k and s describe the smoothness
of f . Achieving an accuracy of ε > 0, therefore, necessitates a network size N = O(ε−d/(k+s) )
(according to this bound). Hence, the size of the network needs to increase exponentially in d.
This exponential dependence on the dimension d is referred to as the curse of dimensionality
[16]. For classical smoothness spaces, such exponential d dependence cannot be avoided [16, 52, 164].
However, functions f that are of interest in practice may have additional properties, which allow
for better convergence rates.
In this chapter, we discuss three scenarios under which the curse of dimensionality can be
mitigated. First, we examine an assumption limiting the behavior of functions in their Fourier
domain. This assumption allows for slow but dimension independent approximation rates. Second,
we consider functions with a specific compositional structure. Concretely, these functions are
constructed by compositions and linear combinations of simple low-dimensional subfunctions. In
this case, the curse of dimension is present but only through the input dimension of the subfunctions.
Finally, we study the situation, where we still approximate high-dimensional functions, but only care
about the approximation accuracy on a lower dimensional submanifold. Here, the approximation
rate is goverened by the smoothness and the dimension of the manifold.

8.1 The Barron class


In [10], Barron introduced a set of functions that can be approximated by neural networks without
a curse of dimensionality. This set, known as the Barron class, is characterized by a specific type
of bounded variation. To define it, for f ∈ L1 (Rd ) denote by
Z

ˆ
f (w) := f (x)e−2π i w x dx
Rd

its Fourier transform. Then, for C > 0 the Barron class is defined as
 Z 
1 d ˆ
ΓC := f ∈ L (R ) ∥f ∥L1 (Rd ) < ∞, ˆ
|2πξ||f (ξ)| dξ < C .
Rd

92
We point out that the definition of ΓC in [10] is more general, but our assumption will simplify
some of the arguments. Nonetheless, the following proof is very close to the original result, and the
presentation is similar to [175, Section 5]. Theorem 1 in [10] reads as follows.

Theorem 8.1. Let σ : R → R be sigmoidal (see Definition 3.11) and let f ∈ ΓC for some C > 0.
Denote by B1d := {x ∈ Rd | ∥x∥ ≤ 1} the unit ball. Then, for every c > 4C 2 and every N ∈ N there
exists a neural network Φf with architecture (σ; d, N, 1) such that
Z
1 2 c
d
f (x) − Φf (x) dx ≤ , (8.1.1)
|B1 | B1d N

where |B1d | is the Lebesgue measure of B1d .

Remark 8.2. The approximation rate on (8.1.1) can be slightly improved under some assumptions
on the activation function such as powers of the ReLU, [213].
Importantly, the dimension d does not enter on the right-hand side of (8.1.1), in particular the
convergence rate is not directly affected by the dimension, which is in stark contrast to the results
of the previous chapters. However, it should be noted, that the constant Cf may still have some
inherent d-dependence, see Exercise 8.10.
The proof of Theorem 8.1 is based on a peculiar property of high-dimensional convex sets, which
is described by the (approximate) Caratheodory theorem, the original version of which was given
in [31]. The more general version stated in the following lemma follows [236, Theorem 0.0.2] and
[10, 177]. For its statement recall that co(G) denotes the the closure of the convex hull of G.

Lemma 8.3. Let H be a Hilbert space, and let G ⊆ H be such that for some B > 0 it holds that
∥g∥H ≤ B for all g ∈ G. Let f ∈ co(G). Then, for every N ∈ N and every c > B 2 there exist
(gi )N
i=1 ⊆ G such that

N 2
1 X c
f− gi ≤ . (8.1.2)
N N
i=1 H

Proof. Fix ε > 0 and N ∈ N. Since f ∈ co(G), there exist coefficients α1 , . . . , αm ∈ [0, 1] summing
to 1, and linearly independent elements h1 , . . . , hm ∈ G such that
m
X
f ∗ := αj hj
j=1

satisfies ∥f − f ∗ ∥H < ε. We claim that there exists g1 , . . . , gN , each in {h1 , . . . , hm }, such that
2
N
∗ 1 X B2
f − gj ≤ . (8.1.3)
N N
j=1
H

93
Since ε > 0 was arbitrary, this then concludes the proof. Since there exists an isometric isomorphism
from span{h1 , . . . , hm } to Rm , there is no loss of generality in assuming H = Rm in the following.
Let Xi , i = 1, . . . , N , be i.i.d. Rm -valued random variables with

P[Xi = hj ] = αj for all i = 1, . . . , m.


Pm
In particular E[Xi ] = j=1 αj hj = f ∗ for each i. Moreover,
 2  2
N N
∗ 1 X  = E 1
X
E f −
 Xj (f ∗ − Xj ) 
N N
j=1 j=1
" N #
1 X ∗ X
= 2 ∥f − Xj ∥2 + ⟨f ∗ − Xi , f ∗ − Xj ⟩
N
j=1 i̸=j
1
= E[∥f ∗ − X1 ∥2 ]
N
1
= E[∥f ∗ ∥ − 2 ⟨f ∗ , X1 ⟩ + ∥X1 ∥2 ]
N
1 B2
= E[∥X1 ∥2 − ∥f ∗ ∥2 ] ≤ . (8.1.4)
N N
Here we used that the (Xi )N ∗ ∗ ∗
i=1 are independent, the fact that E[Xi ] = f , as well as E⟨f − Xi , f − Xj ⟩ =
0 if i ̸= j. Since the expectation in (8.1.4) is bounded by B 2 /N , there must exist at least one real-
ization of the random variables Xi ∈ {h1 , . . . , hm }, denoted as gi , for which (8.1.3) holds.

Lemma 8.3 provides a powerful tool: If we want to approximate a function f with a superposition
of N elements in a set G, then it is sufficient to show that f can be represented as an arbitrary
(infinite) convex combination of elements of G.
Lemma 8.3 suggests that we can prove Theorem 8.1 by showing that each function in ΓC belongs
to the convex hull of neural networks with just a single neuron. We make a small detour before
proving this result. We first show that each function f ∈ ΓC is in the convex hull of affine transforms
of Heaviside functions. We define the set of affine transforms of Heaviside functions GC as
n o
GC := B1d ∋ x 7→ γ · 1R+ (⟨a, x⟩ + b) a ∈ Rd , b ∈ R, |γ| ≤ 2C .

The following lemma, corresponding to [175, Lemma 5.12], provides a link between ΓC and GC .

Lemma 8.4. Let d ∈ N, C > 0 and f ∈ ΓC . Then f |B d − f (0) ∈ co(GC ), where the closure is
1
taken with respect to the norm
Z !1/2
1 2
∥g∥L2,⋄ (B d ) := |g(x)| dx .
1 |B1d | B1d

94
Proof. Since f ∈ ΓC , we have that f , fˆ ∈ L1 (Rd ). Hence, we can apply the inverse Fourier
transform and get the following computation:
Z  
f (x) − f (0) = fˆ(ξ) e2πi⟨x,ξ⟩ − 1 dξ
d
ZR  
= fˆ(ξ) e2πi⟨x,ξ⟩+iκ(ξ) − eiκ(ξ) dξ
d
ZR
= fˆ(ξ) (cos(2π⟨x, ξ⟩ + κ(ξ)) − cos(κ(ξ))) dξ,
Rd

where κ(ξ) is the phase of fˆ(ξ) and the last inequality follows since f is real-valued.
To use the fact that f has a bounded Fourier moment, we reformulate the integral as
Z
fˆ(ξ) (cos(2π⟨x, ξ⟩ + κ(ξ)) − cos(κ(ξ))) dξ
Rd
(cos(2π⟨x, ξ⟩ + κ(ξ)) − cos(κ(ξ)))
Z
= |2πξ| fˆ(ξ) dξ.
Rd |2πξ|
We define a new measure Λ with density
1
dΛ(ξ) := |2πξ||fˆ(ξ)| dξ.
C
Since f ∈ ΓC , it follows that Λ is a probability measure on Rd . Now we have that
(cos(2π⟨x, ξ⟩ + κ(ξ)) − cos(κ(ξ)))
Z
f (x) − f (0) = C dΛ(ξ). (8.1.5)
Rd |2πξ|
Next, we would like to replace the integral of (8.1.5) by an appropriate finite sum.
The cosine function is 1-Lipschitz. Hence, we note that ξ 7→ qx (ξ) := (cos(2π⟨x, ξ⟩ + κ(ξ)) −
cos(κ(ξ)))/|2πξ| is bounded by 1. In addition, it is easy to see that qx is well-defined and continuous
even in the origin.
Therefore, the integral (8.1.5) can be approximated by a Riemann sum, i.e.,

Z X
C qx (ξ) dΛ(ξ) − C qx (θ) · Λ(Iθ ) → 0, (8.1.6)
Rd 1 d
θ∈ n Z

where Iθ := [0, 1/n)d + θ.


Since f (x)−f (0) is continuous and thus bounded on B1d , we have by the dominated convergence
theorem that
2
Z
1 X
f (x) − f (0) − C qx (θ) · Λ(Iθ ) dx → 0. (8.1.7)
|B1d | B1d 1 d
θ∈ n Z

Since θ∈ 1 Zd Λ(Iθ ) = Λ(Rd ) = 1, we conclude that f (x) − f (0) is in the L2,⋄ (B1d ) closure of
P
n
convex combinations of functions of the form

x 7→ gθ (x) := αθ qx (θ),

95
for θ ∈ Rd and 0 ≤ αθ ≤ C.
Now we only need to prove that each gθ is in co(GC ). By setting z = ⟨x, θ/|θ|⟩, we observe that
the result follows if the map

cos(2π|θ|z + κ(θ)) − cos(κ(θ))


[−1, 1] ∋ z 7→ αθ =: g̃θ (z),
|2πθ|

can be approximated arbitrarily well by convex combinations of functions of the form

[−1, 1] ∋ z 7→ γ1R+ a′ z + b′ ,

(8.1.8)

where a′ , b′ ∈ R and |γ| ≤ 2C.


We define, for T ∈ N,
T i
− g̃θ i−1
         
X g̃θ T T i i−1 i
gT,+ := 2Csign g̃θ − g̃θ 1R+ x − ,
2C T T T
i=1
T
− Ti − g̃θ 1−i
         
X g̃θ T i 1−i i
gT,− := 2Csign g̃θ − − g̃θ 1R+ −x + .
2C T T T
i=1

Per construction, gT,− + gT,+ converges to g̃θ for T → ∞. Moreover, ∥g̃θ′ ∥L∞ (R) ≤ C and hence

T T
X |g̃θ (i/T ) − g̃θ ((i − 1)/T )| X |g̃θ (−i/T ) − g̃θ ((1 − i)/T )|
+
2C 2C
i=1 i=1
T
2 X ′
≤ ∥g̃θ ∥L∞ (R) ≤ 1.
2CT
i=1

We conclude that gT,− + gT,+ is a convex combination of functions of the form (8.1.8). Hence,
g̃θ can be arbitrarily well approximated by convex combinations of the form (8.1.8). Therefore
gθ ∈ co(GC ). Finally, (8.1.7) yields that f − f (0) ∈ co(GC ).

We now have all tools to complete the proof of Theorem 8.1.

of Theorem 8.1. Let f ∈ ΓC . By Lemma 8.4

f |B d − f (0) ∈ co(GC ).
1

It is not hard to see that for every g ∈ GC holds ∥g∥L2,⋄ (B d ) ≤ 2C. Applying Lemma 8.3 with the
1
Hilbert space L2,⋄ (B1d ), we get that for every N ∈ N there exist |γi | ≤ 2C, ai ∈ Rd , bi ∈ R, for
i = 1, . . . , N , so that

N 2
4C 2
Z
1 X
f (x) − f (0) − γi 1R+ (⟨ai , x⟩ + bi ) dx ≤ .
|B1d | B1d i=1
N

By Exercise 3.24, it holds that σ(λ·) → 1R+ for λ → ∞ almost everywhere. Thus, for every δ > 0
there exist ãi , b̃i , i = 1, . . . , N , so that

96
N 2
4C 2
Z
1 X  
f (x) − f (0) − γi σ ⟨ãi , x⟩ + b̃i dx ≤ + δ.
|B1d | B1d i=1
N
The result follows by observing that
N
X  
γi σ ⟨ãi , x⟩ + b̃i + f (0)
i=1

is a neural network with architecture (σ; d, N, 1).

The dimension-independent approximation rate of Theorem 8.1 may seem surprising, especially
in comparison to the results in Chapters 4 and 5. However, this can be explained by recognizing
that the assumption of a finite Fourier moment is effectively a dimension-dependent regularity
assumption. Indeed, the condition becomes more restrictive in higher dimensions and hence the
complexity of ΓC does not grow with the dimension.
To further explain this, let us relate the Barron class to classical function spaces. In [10, Section
II] it was observed that a sufficient condition is that all derivatives of order up to ⌊d/2⌋ + 2 are
square-integrable. In other words, if f belongs to the Sobolev space H ⌊d/2⌋+2 (Rd ), then f is a
Barron function. Importantly, the functions must become smoother, as the dimension increases.
This assumption would also imply an approximation rate of N −1/2 in the L2 norm by sums of
at most N B-splines, see [168, 52]. However, in such estimates some constants may still depend
exponentially on d, whereas all constants in Theorem 8.1 are controlled independently of d.
Another notable aspect of the approximation of Barron functions is that the absolut values of
the weights other than the output weights are not bounded by a constant. To see this, we refer
to (8.1.6), where arbitrarily large θ need to be used. While ΓC is a compact set, the set of neural
networks of the specified architecture for a fixed N ∈ N is not parameterized with a compact
parameter set. In a certain sense, this is reminiscent of Proposition 3.19 and Theorem 3.20, where
arbitrarily strong approximation rates where achieved by using a very complex activation function
and a non-compact parameter space.

8.2 Functions with compositionality structure


As a next instance of types of functions for which the curse of dimensionality can be overcome, we
study functions with compositional structure. In words, this means that we study high-dimensional
functions that are constructed by composing many low-dimensional functions. This point of view
was proposed in [178]. Note that this can be a realistic assumption in many cases, such as for
sensor networks, where local information is first aggregated in smaller clusters of sensors before
some information is sent to a processing unit for further evaluation.
We introduce a model for compositional functions next. Consider a directed acyclic graph G
with M vertices η1 , . . . , ηM such that

• exactly d vertices, η1 , . . . , ηd , have no ingoing edge,

• each vertex has at most m ∈ N ingoing edges,

• exactly one vertex, ηM , has no outgoing edge.

97
With each vertex ηj for j > d we associate a function fj : Rdj → R. Here dj denotes the
cardinality of the set Sj , which is defined as the set of indices i corresponding to vertices ηi for
which we have an edge from ηi to ηj . Without loss of generality, we assume that m ≥ dj = |Sj | ≥ 1
for all j > d. Finally, we let

Fj := xj for all j≤d (8.2.1a)

and1

Fj := fj ((Fi )i∈Sj ) for all j > d. (8.2.1b)

Then FM (x1 , . . . , xd ) is a function from Rd → R. Assuming

∥fj ∥C k,s (Rdj ) ≤ 1 for all j = d + 1, . . . , M, (8.2.2)

we denote the set of all functions of the type FM by F k,s (m, d, M ). Figure 8.1 shows possible
graphs of such functions.
Clearly, for s = 0, F k,0 (m, d, M ) ⊆ C k (Rd ) since the composition of functions in C k belongs
again to C k . A direct application of Theorem 7.7 allows to approximate FM ∈ F k (m, d, M ) with a
k
neural network of size O(N log(N )) and error O(N − d ). Since each fj depends only on m variables,
k
intuitively we expect an error convergence of type O(N − m ) with the constant somehow depending
on the number M of vertices. To show that this is actually possible, in the following we associate
with each node ηj a depth lj ≥ 0, such that lj is the maximum number of edges connecting ηj to
one of the nodes {η1 , . . . , ηd }.

Figure 8.1: Three types of graphs that could be the basis of compositional functions. The associated
functions are composed of two or three-dimensional functions only.

Proposition 8.5. Let k, m, d, M ∈ N and s > 0. Let FM ∈ F k,s (m, d, M ). Then there exists a
constant C = C(m, k + s, M ) such that for every N ∈ N there exists a ReLU neural network ΦFM

1
The ordering of the inputs (Fi )i∈Sj in (8.2.1b) is arbitrary but considered fixed throughout.

98
such that

size(ΦFM ) ≤ CN log(N ), depth(ΦFM ) ≤ C log(N )

and
k+s
sup |FM (x) − ΦFM (x)| ≤ N − m .
x∈[0,1]d

Proof. Throughout this proof we assume without loss of generality that the indices follow a topo-
logical ordering, i.e., they are ordered such that Sj ⊆ {1, . . . , j − 1} for all j (i.e. the inputs of
vertex ηj can only be vertices ηi with i < j).
Step 1. First assume that fˆj are functions such that

|fj (x) − fˆj (x)| ≤ δj := ε · (2m)−(M +1−j) for all x ∈ [−2, 2]dj . (8.2.3)

Let F̂j be defined as in (8.2.1), but with all fj in (8.2.1b) replaced by fˆj . We now check the error
of the approximation F̂M to FM . To do so we proceed by induction over j and show that for all
x ∈ [−1, 1]d

|Fj (x) − F̂j (x)| ≤ (2m)−(M −j) ε. (8.2.4)

Note that due to ∥fj ∥C k ≤ 1 we have |Fj (x)| ≤ 1 and thus (8.2.4) implies in particular that
F̂j (x) ∈ [−2, 2].
For j = 1 it holds F1 (x1 ) = F̂1 (x1 ) = x1 , and thus (8.2.4) is valid for all x1 ∈ [−1, 1]. For the
induction step, for all x ∈ [−1, 1]d by (8.2.3) and the induction hypothesis

|Fj (x) − F̂j (x)| = |fj ((Fi )i∈Sj ) − fˆj ((F̂i )i∈Sj )|
= |fj ((Fi )i∈Sj ) − fj ((F̂i )i∈Sj )| + |fj ((F̂i )i∈Sj ) − fˆj ((F̂i )i∈Sj )|
X
≤ |Fi − F̂i | + δj
i∈Sj

≤ m · (2m)−(M −(j−1)) ε + (2m)−(M +1−j) ε


≤ (2m)−(M −j) ε.

Here we used that | dxdr fj ((xi )i∈Sj )| ≤ 1 for all r ∈ Sj so that


X
|fj ((xi )i∈Sj ) − fj ((yi )i∈Sj )| ≤ |f ((xi )i∈Sj , (yi )i∈Sj ) − f ((xi )i∈Sj , (yi )i∈Sj )|
r∈Sj i≤r i>r i<r i≥r
X
≤ |xr − yr |.
r∈Sj

This shows that (8.2.4) holds, and thus for all x ∈ [−1, 1]d

|FM (x) − F̂M (x)| ≤ ε.

99
Step 2. We sketch a construction, of how to write F̂M from Step 1 as a neural network ΦFM
of the claimed size and depth bounds. Fix N ∈ N and let
m
Nj := ⌈N (2m) k+s (M +1−j) ⌉.

By Theorem 7.7, since dj ≤ m, we can find a neural network Φfj satisfying

− k+s k+s
sup |fj (x) − Φfj (x)| ≤ Nj m
≤ N− m (2m)−(M +1−j) (8.2.5)
x∈[−2,2]dj

and
 
fj
m(M +1−j) m(M + 1 − j)
size(Φ ) ≤ CNj log(Nj ) ≤ CN (2m) k+s log(N ) + log(2m)
k+s

as well as
 
fj m(M + 1 − j)
depth(Φ ) ≤ C · log(N ) + log(2m) .
k+s

Then
n M M  j
X X m(M +1−j) X m
size(Φfj ) ≤ 2CN log(N ) (2m) k+s ≤ 2CN log(N ) (2m) k+s
j=1 j=1 j=1
m(M +1)
≤ 2CN log(N )(2m) k+s .
PM j
R M +1 1 M +1 .
Here we used j=1 a ≤ 1 exp(log(a)x) dx ≤ log(a) a
− k+s
The function F̂M from Step 1 then will yield error N by (8.2.3) and (8.2.5). We observe that
m

F̂M can be constructed inductively as a neural network ΦFM by propagating all values ΦF1 , . . . , Φ̂Fj
to all consecutive layers using identity neural networks and then using the outputs of (ΦFi )i∈Sj+1
as input to Φfj+1 . The depth of this neural network is bounded by
M
X
depth(Φfj ) = O(M log(N )).
j=1

We have at most M
P
j=1 |Sj | ≤ mM values which need to be propagated through these O(M log(N ))
layers, amounting to an overhead O(mM 2 log(N )) = O(log(N )) for the identity neural networks.
In all the neural network size is thus O(N log(N )).

Remark 8.6. From the proof we observe that the constant C in Proposition 8.5 behaves like
m(M +1)
O((2m) k+s ).

8.3 Functions on manifolds


Another instance in which the curse of dimension can be mitigated, is if the input to the network
belongs to Rd , but stems from an m-dimensional manifold M ⊆ Rd . If we only measure the

100
M

Figure 8.2: One-dimensional sub-manifold of three-dimensional space. At the orange point, we


depict a ball and the tangent space of the manifold.

approximation error on M, then we can again show that it is m rather than d that determines the
rate of convergence.
To explain the idea, we assume in the following that M is a smooth, compact m-dimensional
manifold in Rd . Moreover, we suppose that there exists δ > 0 and finitely many points x1 , . . . , xM ∈
M such that the δ-balls Bδ/2 (xi ) := {y ∈ Rd | ∥y − x∥2 < δ/2} for j = 1, . . . , M cover M (for
every δ > 0 such xi exist since M is compact). Moreover, denoting by Tx M ≃ Rm the tangential
space of M at x, we assume δ > 0 to be so small that the orthogonal projection

πj : Bδ (xj ) ∩ M → Txj M (8.3.1)

is injective, the set πj (Bδ (xj ) ∩ M) ⊆ Txj M has C ∞ boundary, and the inverse projection

πj−1 : πj (Bδ (xj ) ∩ M) → M (8.3.2)

is C ∞ (this is possible because M is a smooth manifold). A visualization of this assumption is


shown in Figure 8.2.
Note that πj in (8.3.1) is a linear map, whereas πj−1 in (8.3.2) is in general non-linear.
For a function f : M → R and x ∈ Bδ (xj ) ∩ M we can then write

f (x) = f (πj−1 (πj (x))) = fj (πj (x))

where

fj := f ◦ πj−1 : πj (Bδ (xj ) ∩ M) → R.

101
In the following, for f : M → R, k ∈ N0 , and s ∈ [0, 1) we let

∥f ∥C k,s (M) := sup ∥fj ∥C k,s (πj (Bδ (xj )∩M)) .


j=1,...,M

We now state the main result of this section.

Proposition 8.7. Let d, k ∈ N, s ≥ 0, and let M be a smooth, compact m-dimensional manifold


in Rd . Then there exists a constant C > 0 such that for all f ∈ C k,s (M) and every N ∈ N there
exists a ReLU neural network ΦfN such that size(ΦfN ) ≤ CN log(N ), depth(ΦfN ) ≤ C log(N ) and
k+s
sup |f (x) − ΦfN (x)| ≤ C∥f ∥C k,s (M) N − m .
x∈M

Proof. Since M is compact there exists A > 0 such that M ⊆ [−A, A]d . Similar as in the proof of
Theorem 7.7, we consider a uniform mesh with nodes {−A + 2A νnP| ν ≤ n}, and the corresponding
d
piecewise linear basis functions forming the partition of unity ν≤ φν ≡ 1 on [−A, A] where
supp φν ≤ {y ∈ Rd | ∥ νn − y∥∞ ≤ A n }. Let δ > 0 be such as in the beginning of this section.
Since M is covered by the balls (Bδ/2 (xj ))M
j=1 , fixing n ∈ N large enough, for each ν such that
supp φν ∩ M = ̸ ∅ there exists j(ν) ∈ {1, . . . , M } such that supp φν ⊆ Bδ (xj(ν) ) and we set
Ij := {ν ≤ M | j = j(ν)}. Then we have for all x ∈ M

X M X
X
f (x) = φν (x)fj (πj (x)) = φν (x)fj (πj (x)). (8.3.3)
ν≤n j=1 ν∈Ij

Next, we approximate the functions fj . Let Cj be the smallest (m-dimensional) cube in Txj M ≃
Rm such that πj (Bδ (xj ) ∩ M) ⊆ Cj . The function fˆj can be extended to a function on Cj (we will
use the same notation for this extension) such that

∥f ∥C k,s (Cj ) ≤ C∥f ∥C k,s (πj (Bδ (xj )∩M)) ,

for some constant depending on πj (Bδ (xj ) ∩ M) but independent of f . Such an extension result
can, for example, be found in [216, Chapter VI]. By Theorem 7.7 (also see Remark 7.9), there exists
a neural network fˆj : Cj → R such that
k+s
sup |fj (x) − fˆj (x)| ≤ CN − m (8.3.4)
x∈Cj

and

size(fˆj ) ≤ CN log(N ), depth(fˆj ) ≤ C log(N ).


k+s
To approximate f in (8.3.3) we now let with ε := N − d

M X
X
ΦN := Φ× ˆ
ε (φν , fi ◦ πj ),
j=1 ν∈Ij

102
where we note that πj is linear and thus fˆj ◦ πj can be expressed by a neural network. First let us
estimate the error of this approximation. For x ∈ M
M X
X
|f (x) − ΦN (x)| ≤ |φν (x)fj (πj (x)) − Φ× ˆ
ε (φν (x), fj (πj (x)))|
j=1 ν∈Ij
M X
X
≤ (|φν (x)fj (πj (x)) − φν (x)fj (πj (x))|
j=1 ν∈Ij

+|φν (x)fj (πj (x)) − Φ× ˆ
ε (φν (x), fj (πj (x)))|
M X
X M
X X
≤ sup ∥fi − fˆi ∥L∞ (Ci ) |φν (x)| + ε
i≤M j=1 ν∈Ij j=1 {ν∈Ij | x∈supp φν }
k+s k+s
≤ CN − m + dε ≤ CN − m ,

where we used that x can be in the support of at most d of the φν , and where C is a constant
depending on d and M.
Finally, let us bound the size and depth of this approximation. Using size(φν ) ≤ C, depth(φν ) ≤
C (see (5.3.12)) and size(Φ× ×
ε ) ≤ C log(ε) ≤ C log(N ) and depth(Φε ) ≤ Cdepth(ε) ≤ C log(N ) (see
Lemma 7.3) we find
M X 
X  XM X
size(Φ×
ε ) + size(φ ν ) + size(fˆi ◦ π j ) ≤ C log(N ) + C + CN log(N )
j=1 ν∈Ij j=1 ν∈Ij

= O(N log(N )),

which implies the bound on size(ΦN ). Moreover,


n o
depth(ΦN ) ≤ depth(Φ× ˆ
ε ) + max depth(φν , fj )

≤ C log(N ) + log(N ) = O(log(N )).

This completes the proof.

Bibliography and further reading


The ideas of Section 8.1 were originally developed in [10], with an extension to L∞ approximation
provided in [9]. These arguments can be extended to yield dimension-independent approximation
rates for high-dimensional discontinuous functions, provided the discontinuity follows a Barron
function, as shown in [175]. The Barron class has been generalized in various ways, as discussed in
[138, 137, 239, 240, 11].
The compositionality assumption of Section 8.2 was discussed in the form presented in [178].
An alternative approach, known as the hierarchical composition/interaction model, was studied in
[119].
The manifold assumption discussed in Section 8.3 is frequently found in the literature, with
notable examples including [211, 40, 35, 203, 156, 118].

103
Another prominent direction, omitted in this chapter, pertains to scientific machine learn-
ing. High-dimensional functions often arise from (parametric) PDEs, which have a rich literature
describing their properties and structure. Various results have shown that neural networks can
leverage the inherent low-dimensionality known to exist in such problems. Efficient approximation
of certain classes of high-dimensional (or even infinite-dimensional) analytic functions, ubiquitous
in parametric PDEs, has been verified in [208, 209]. Further general analyses for high-dimensional
parametric problems can be found in [167, 122], and results exploiting specific structural conditions
of the underlying PDEs, e.g., in [125, 198]. Additionally, [58, 150, 166] provide results regarding
fast convergence for certain smooth functions in potentially high but finite dimensions.
For high-dimensional PDEs, elliptic problems have been addressed in [78], linear and semilin-
ear parabolic evolution equations have been explored in [79, 71, 100], and stochastic differential
equations in [109, 80].

104
Exercises
Exercise 8.8. Let C > 0 and d ∈ N. Show that, if g ∈ ΓC , then

a−d g (a(· − b)) ∈ ΓC ,

for every a ∈ R+ , b ∈ Rd .

Exercise 8.9. Let C > 0 and d ∈ N. Show that, for gi ∈ ΓC , i = 1, . . . , m and c = (ci )m
i=1 it holds
that
Xm
ci gi ∈ Γ∥c∥1 C .
i=1

2 d
Exercise 8.10. √ For every d ∈ N the function f (x) := exp(−∥x∥2 /2), x ∈ R , belongs to Γd . It
holds Cf = O( d), for d → ∞.

Exercise 8.11. Let d ∈ N, and let f (x) = ∞ d


P
i=1 ci σReLU (⟨ai , x⟩ + bi ) for x ∈ R with ∥ai ∥ =
1, |bi | ≤ 1 for all i ∈ N. Show that for every N ∈ N, there exists a ReLU neural network with N
neurons and one layer such that
3∥c∥1
∥f − fN ∥L2 (B d ) ≤ √ .
1
N
Hence, every infinite ReLU neural network can be approximated at a rate O(N 1/2 ) by finite ReLU
neural networks of width N .

Exercise 8.12. Let C > 0 prove that every f ∈ ΓC is continuously differentiable.

105
Chapter 9

Interpolation

The learning problem associated to minimizing the empirical risk of (1.2.3) is based on minimizing
an error that results from evaluating a neural network on a finite set of (training) points. In
contrast, all previous approximation results focused on achieving uniformly small errors across the
entire domain. Finding neural networks that achieve a small training error appears to be much
simpler, since, instead of ∥f − Φn ∥∞ → 0 for a sequence of neural networks Φn , it suffices to have
Φn (xi ) → f (xi ) for all xi in the training set.
In this chapter, we study the extreme case of the aforementioned approximation problem. We
analyze under which conditions it is possible to find a neural network that coincides with the target
function f at all training points. This is referred to as interpolation. To make this notion more
precise, we state the following definition.

Definition 9.1 (Interpolation). Let d, m ∈ N, and let Ω ⊆ Rd . We say that a set of functions
H ⊆ {h : Ω → R} interpolates m points in Ω, if for every S = (xi , yi )m i=1 ⊆ Ω × R, such that
xi ̸= xj for i ̸= j, there exists a function h ∈ H such that h(xi ) = yi for all i = 1, . . . , m.

Knowing the interpolation properties of an architecture represents extremely valuable informa-


tion for two reasons:
• Consider an architecture that interpolates m points and let the number of training samples
be bounded by m. Then (1.2.3) always has a solution.
• Consider again an architecture that interpolates m points and assume that the number of
training samples is less than m. Then for every point x̃ not in the training set and every y ∈ R
there exists a minimizer h of (1.2.3) that satisfies h(x̃) = y. As a consequence, without further
restrictions (many of which we will discuss below), such an architecture cannot generalize to
unseen data.
The existence of solutions to the interpolation problem does not follow trivially from the approxi-
mation results provided in the previous chapters (even though we will later see that there is a close
connection). We also remark that the question of how many points neural networks with a given
architecture can interpolate is closely related to the so-called VC dimension, which we will study
in Chapter 14.

106
We start our analysis of the interpolation properties of neural networks by presenting a result
similar to the universal approximation theorem but for interpolation in the following section. In
the subsequent section, we then look at interpolation with desirable properties.

9.1 Universal interpolation


Under what conditions on the activation function and architecture can a set of neural networks
interpolate m ∈ N points? According to Chapter 3, particularly Theorem 3.8, we know that shallow
neural networks can approximate every continuous function with arbitrary accuracy, provided the
neural network width is large enough. As the neural network’s width and/or depth increases, the
architectures become increasingly powerful, leading us to expect that at some point, they should
be able to interpolate m points. However, this intuition may not be correct:
Example 9.2. Let H := {f ∈ C 0 ([0, 1]) | f (0) ∈ Q}. Then H is dense in C 0 ([0, 1]), but H does not
even interpolate one point in [0, 1].
Moreover, Theorem 3.8 is an asymptotic result that only states that a given function can be
approximated for sufficiently large neural network architectures, but it does not state how large
the architecture needs to be.
Surprisingly, Theorem 3.8 can nonetheless be used to give a guarantee that a fixed-size archi-
tecture yields sets of neural networks that allow the interpolation of m points. This result is due
to [176]; for a more detailed discussion of previous results see the bibliography section. Due to its
similarity to the universal approximation theorem and the fact that it uses the same assumptions,
we call the following theorem the “Universal Interpolation Theorem”. For its statement recall the
definition of the set of allowed activation functions M in (3.1.1) and the class Nd1 (σ, 1, n) of shallow
neural networks of width n introduced in Definition 3.6.

Theorem 9.3 (Universal Interpolation Theorem). Let d, n ∈ N and let σ ∈ M not be a polynomial.
Then Nd1 (σ, 1, n) interpolates n + 1 points in Rd .

Proof. Fix (xi )n+1 d n+1


i=1 ⊆ R arbitrary. We will show that for any (yi )i=1 ⊆ R there exist weights and
n d n n
biases (wj )j=1 ⊆ R , (bj )j=1 , (vj )j=1 ⊆ R, c ∈ R such that
n
X
Φ(xi ) := vj σ(w⊤
j xi + bj ) + c = yi for all i = 1, . . . , n + 1. (9.1.1)
j=1

Since Φ ∈ Nd1 (σ, 1, n) this then concludes the proof.


Denote
1 σ(w⊤ σ(w⊤
 
1 x1 + b1 ) ··· n x1 + bn )
A :=  ... .. .. ..
∈R
(n+1)×(n+1)
. (9.1.2)
 
. . .
⊤ ⊤
1 σ(w1 xn+1 + b1 ) · · · σ(wn xn+1 + bn )

Then A being regular implies that for each (yi )n+1 n


i=1 exist c and (vj )j=1 such that (9.1.1) holds.
Hence, it suffices to find (wj )nj=1 and (bj )nj=1 such that A is regular.

107
To do so, we proceed by induction over k = 0, . . . , n, to show that there exist (wj )kj=1 and
(bj )kj=1 such that the first k + 1 columns of A are linearly independent. The case k = 0 is trivial.
Next let 0 < k < n and assume that the first k columns of A are linearly independent. We wish to
find wk , bk such that the first k + 1 columns are linearly independent. Suppose such wk , bk do not
exist and denote by Yk ⊆ Rn+1 the space spanned by the first k columns of A. Then for all w ∈ Rn ,
b ∈ R the vector (σ(w⊤ xi + b))n+1 i=1 ∈ R
n+1 must belong to Y . Fix y = (y )n+1 ∈ Rn+1 \Y . Then
k i i=1 k
n+1
XX N 2
inf ∥(Φ̃(xi ))n+1 2
i=1 − y∥2 = inf vj σ(w⊤
j xi + bj ) + c − yi
Φ̃∈Nd1 (σ,1) N,wj ,bj ,vj ,c
i=1 j=1

≥ inf ∥ỹ − y∥22 > 0.


ỹ∈Yk

Since we can find a continuous function f : Rd → R such that f (xi ) = yi for all i = 1, . . . , n + 1,
this contradicts Theorem 3.8.

9.2 Optimal interpolation and reconstruction


Consider a bounded domain Ω ⊆ Rd , a function f : Ω → R, distinct points x1 , . . . , xm ∈ Ω, and
corresponding function values yi := f (xi ). Our objective is to approximate f based solely on the
data pairs (xi , yi ), i = 1, . . . , m. In this section, we will show that, under certain assumptions on
f , ReLU neural networks can express an “optimal” reconstruction which also turns out to be an
interpolant of the data.

9.2.1 Motivation
In the previous section, we observed that neural networks with m − 1 ∈ N hidden neurons can
interpolate m points for every reasonable activation function. However, not all interpolants are
equally suitable for a given application. For instance, consider Figure 9.1 for a comparison between
polynomial and piecewise affine interpolation on the unit interval.
The two interpolants exhibit rather different behaviors. In general, there is no way of deter-
mining which constitutes a better approximation to f . In particular, given our limited information
about f , we cannot accurately reconstruct any additional features that may exist between inter-
polation points x1 , . . . , xm . In accordance with Occam’s razor, it thus seems reasonable to assume
that f does not exhibit extreme oscillations or behave erratically between interpolation points.
As such, the piecewise interpolant appears preferable in this scenario. One way to formalize the
assumption that f does not “exhibit extreme oscillations” is to assume that the Lipschitz constant
|f (x) − f (y)|
Lip(f ) := sup (9.2.1)
x̸=y ∥x − y∥
of f is bounded by a fixed value M ∈ R. Here ∥ · ∥ denotes an arbitrary fixed norm on Rd .
How should we choose M ? For every function f : Ω → R satisfying
f (xi ) = yi for all i = 1, . . . , m, (9.2.2)
we have
|f (x) − f (y)| |yi − yj |
Lip(f ) = sup ≥ sup =: M̃ . (9.2.3)
x̸=y∈Ω ∥x − y∥ i̸=j ∥xi − xj ∥

108
Figure 9.1: Interpolation of eight points by a polynomial of degree seven and by a piecewise affine
spline. The polynomial interpolation has a significantly larger derivative or Lipschitz constant than
the piecewise affine interpolator.

Because of this, we fix M as a real number greater than or equal to M̃ for the remainder of our
analysis.

9.2.2 Optimal reconstruction for Lipschitz continuous functions


The above considerations raise the following question: Given only the information that the function
has Lipschitz constant at most M , what is the best reconstruction of f based on the data? We
consider here the “best reconstruction” to be a function that minimizes the L∞ -error in the worst
case. Specifically, with

LipM (Ω) := {f : Ω → R | Lip(f ) ≤ M }, (9.2.4)

denoting the set of all functions with Lipschitz constant at most M , we want to solve the following
problem:

Problem 9.4. We wish to find an element

Φ ∈ argminh:Ω→R sup sup |f (x) − h(x)|. (9.2.5)


f ∈LipM (Ω) x∈Ω
f satisfies (9.2.2)

The next theorem shows that a function Φ as in (9.2.5) indeed exists. This Φ not only allows
for an explicit formula, it also belongs to LipM (Ω) and additionally interpolates the data. Hence,
it is not just an optimal reconstruction, it is also an optimal interpolant. This theorem goes back
to [13], which, in turn, is based on [219].

109
Theorem 9.5. Let m, d ∈ N, Ω ⊆ Rd , f : Ω → R, and let x1 , . . . , xm ∈ Ω, y1 , . . . , ym ∈ R satisfy
(9.2.2) and (9.2.3) with M̃ > 0. Further, let M ≥ M̃ .
Then, Problem 9.4 has at least one solution given by
1
Φ(x) := (fupper (x) + flower (x)) for x ∈ Ω, (9.2.6)
2
where

fupper (x) := min (yk + M ∥x − xk ∥)


k=1,...,m

flower (x) := max (yk − M ∥x − xk ∥).


k=1,...,m

Moreover, Φ ∈ LipM (Ω) and Φ interpolates the data (i.e. satisfies (9.2.2)).

Proof. First we claim that for all h1 , h2 ∈ LipM (Ω) holds max{h1 , h2 } ∈ LipM (Ω) as well as
min{h1 , h2 } ∈ LipM (Ω). Since min{h1 , h2 } = − max{−h1 , −h2 }, it suffices to show the claim for
the maximum. We need to check that
| max{h1 (x), h2 (x)} − max{h1 (y), h2 (y)}|
≤M (9.2.7)
∥x − y∥
for all x ̸= y ∈ Ω. Fix x ̸= y. Without loss of generality we assume that
max{h1 (x), h2 (x)} ≥ max{h1 (y), h2 (y)} and max{h1 (x), h2 (x)} = h1 (x).
If max{h1 (y), h2 (y)} = h1 (y) then the numerator in (9.2.7) equals h1 (x) − h1 (y) which is bounded
by M ∥x − y∥. If max{h1 (y), h2 (y)} = h2 (y), then the numerator equals h1 (x) − h2 (y) which is
bounded by h1 (x) − h1 (y) ≤ M ∥x − y∥. In either case (9.2.7) holds.
Clearly, x 7→ yk −M ∥x−xk ∥ ∈ LipM (Ω) for each k = 1, . . . , m and thus fupper , flower ∈ LipM (Ω)
as well as Φ ∈ LipM (Ω).
Next we claim that for all f ∈ LipM (Ω) satisfying (9.2.2) holds
flower (x) ≤ f (x) ≤ fupper (x) for all x ∈ Ω. (9.2.8)
This is true since for every k ∈ {1, . . . , m} and x ∈ Ω
|yk − f (x)| = |f (xk ) − f (x)| ≤ M ∥x − xk ∥
so that for all x ∈ Ω
f (x) ≤ min (yk + M ∥x − xk ∥), f (x) ≥ max (yk − M ∥x − xk ∥).
k=1,...,m k=1,...,m

Since fupper , flower ∈ LipM (Ω) satisfy (9.2.2), we conclude that for every h : Ω → R holds
sup sup |f (x) − h(x)| ≥ sup max{|flower (x) − h(x)|, |fupper (x) − h(x)|}
f ∈LipM (Ω) x∈Ω x∈Ω
f satisfies (9.2.2)
|flower (x) − fupper (x)|
≥ sup . (9.2.9)
x∈Ω 2

110
Moreover, using (9.2.8),

sup sup |f (x) − Φ(x)| ≤ sup max{|flower (x) − Φ(x)|, |fupper (x) − Φ(x)|}
f ∈LipM (Ω) x∈Ω x∈Ω
f satisfies (9.2.2)
|flower (x) − fupper (x)|
= sup . (9.2.10)
x∈Ω 2

Finally, (9.2.9) and (9.2.10) imply that Φ is a solution of Problem 9.4.

Figure 9.2 depicts fupper , flower , and Φ for the interpolation problem shown in Figure 9.1, while
Figure 9.3 provides a two-dimensional example.

Figure 9.2: Interpolation of the points from Figure 9.1 with the optimal Lipschitz interpolant.

9.2.3 Optimal ReLU reconstructions


So far everything was valid withPan arbitrary norm on Rd . For the next theorem, we will restrict
ourselves to the 1-norm ∥x∥1 = dj=1 |xj |. Using the explicit formula of Theorem 9.5, we will now
show the remarkable result that ReLU neural networks can exactly express an optimal reconstruc-
tion (in the sense of Problem 9.4) with a neural network whose size scales linearly in the product
of the dimension d and the number of data points m. Additionally, the proof is constructive,
thus allowing in principle for an explicit construction on the neural network without the need for
training.

Theorem 9.6 (Optimal Lipschitz Reconstruction). Let m, d ∈ N, Ω ⊆ Rd , f : Ω → R, and let


x1 , . . . , xm ∈ Ω, y1 , . . . , ym ∈ R satisfy (9.2.2) and (9.2.3) with M̃ > 0. Further, let M ≥ M̃ and
let ∥ · ∥ = ∥ · ∥1 in (9.2.3) and (9.2.4).

111
Then, there exists a ReLU neural network Φ ∈ LipM (Ω) that interpolates the data (i.e. satisfies
(9.2.2)) and satisfies

Φ ∈ argminh:Ω→R sup sup |f (x) − h(x)|.


f ∈LipM (Ω) x∈Ω
f satisfies (9.2.2)

Moreover, depth(Φ) = O(log(m)), width(Φ) = O(dm) and all weights of Φ are bounded in absolute
value by max{M, ∥y∥∞ }.

Proof. To prove the result, we simply need to show that the function in (9.2.6) can be expressed as
a ReLU neural network with the size bounds described in the theorem. First we notice, that there
is a simple ReLU neural network that implements the 1-norm. It holds for all x ∈ Rd that
d
X
∥x∥1 = (σ(xi ) + σ(−xi )) .
i=1

Thus, there exists a ReLU neural network Φ∥·∥1 such that for all x ∈ Rd

width(Φ∥·∥1 ) = 2d, depth(Φ∥·∥1 ) = 1, Φ∥·∥1 (x) = ∥x∥1

As a result, there exist ReLU neural networks Φk : Rd → R, k = 1, . . . , m, such that

width(Φk ) = 2d, depth(Φk ) = 1, Φk (x) = yk + M ∥x − xk ∥1

for all x ∈ Rd . Using the parallelization of neural networks introduced in Section 5.1.3, there exists
a ReLU neural network Φall := (Φ1 , . . . , Φm ) : Rd → Rm such that

width(Φall ) = 4md, depth(Φall ) = 1

and

Φall (x) = (yk + M ∥x − xk ∥1 )m


k=1 for all x ∈ Rd .

Using Lemma 5.11, we can now find a ReLU neural network Φupper such that Φupper = fupper (x)
for all x ∈ Ω, width(Φupper ) ≤ max{16m, 4md}, and depth(Φupper ) ≤ 1 + log(m).
Essentially the same construction yields a ReLU neural network Φlower with the respective
properties. Lemma 5.4 then completes the proof.

Bibliography and further reading


The universal interpolation theorem stated in this chapter is due to [176, Theorem 5.1]. There exist
several earlier interpolation results, which were shown under stronger assumptions: In [200], the
interpolation property is already linked with a rank condition on the matrix (9.1.2). However, no
general conditions on the activation functions were formulated. In [105], the interpolation theorem
is established under the assumption that the activation function σ is continuous and nondecreasing,

112
limx→−∞ σ(x) = 0, and limx→∞ σ(x) = 1. This result was improved in [97], which dropped the
nondecreasing assumption on σ.
The main idea of the optimal Lipschitz interpolation theorem in Section 9.2 is due to [13]. A
neural network construction of Lipschitz interpolants, which however is not the optimal interpolant
in the sense of Problem 9.4, is given in [108, Theorem 2.27].

113
Exercises
Exercise 9.7. Under the assumptions of Theorem 9.5, we define for x ∈ Ω the set of nearest
neighbors by
Ix := argmini=1,...,m ∥xi − x∥.
The one-nearest-neighbor classifier f1NN is defined by
1
f1NN (x) = (min yi + max yi ).
2 i∈Ix i∈Ix

Let ΦM be the function in (9.2.6). Show that for all x ∈ Ω

ΦM (x) → f1NN (x) as M → ∞.

Exercise 9.8. Extend Theorem 9.6 to the ∥ · ∥∞ -norm. Hint: The resulting neural network will
need to be deeper than the one of Theorem 9.6.

114
Figure 9.3: Two-dimensional example of the interpolation method of (9.2.6). From top left to
bottom we see fupper , flower , and Φ. The interpolation points (xi , yi )6i=1 are marked with red
crosses.

115
Chapter 10

Training of neural networks

Up to this point, we have discussed the representation and approximation of certain function classes
using neural networks. The second pillar of deep learning concerns the question of how to fit a
neural network to given data, i.e., having fixed an architecture, how to find suitable weights and
biases. This task amounts to minimizing a so-called objective function such as the empirical risk
R̂S in (1.2.3). Throughout this chapter we denote the objective function by
f : Rn → R,
and interpret it as a function of all neural network weights and biases collected in a vector in Rn .
The goal is to (approximately) determine a minimizer, i.e., some w∗ ∈ Rn satisfying
f (w∗ ) ≤ f (w) for all w ∈ Rn .
Standard approaches include, in particular, variants of (stochastic) gradient descent. These are
the topic of this chapter, in which we present basic ideas and results in convex optimization using
gradient-based methods.

10.1 Gradient descent


The general idea of gradient descent is to start with some w0 ∈ Rn , and then apply sequential
updates by moving in the direction of steepest descent of the objective function. Assume for the
moment that f ∈ C 2 (Rn ), and denote the kth iterate by wk . Then
f (wk + v) = f (wk ) + v ⊤ ∇f (wk ) + O(∥v∥2 ) for ∥v∥2 → 0. (10.1.1)
This shows that the change in f around wk is locally described by the gradient ∇f (wk ). For
small v the contribution of the second order term is negligible, and the direction v along which the
decrease of the risk is maximized equals the negative gradient −∇f (wk ). Thus, −∇f (wk ) is also
called the direction of steepest descent. This leads to an update of the form
wk+1 := wk − hk ∇f (wk ), (10.1.2)
where hk > 0 is referred to as the step size or learning rate. We refer to this iterative algorithm
as gradient descent.
In practice tuning the learning rate can be a subtle issue as it should strike a balance between
the following dissenting requirements:

116
Figure 10.1: Two examples of gradient descent as defined in (10.1.2). The red points represent the
wk .

(i) hk needs to be sufficiently small so that with v = −hk ∇f (wk ), the second-order term in
(10.1.1) is not dominating. This ensures that the update (10.1.2) decreases the objective
function.

(ii) hk should be large enough to ensure significant decrease of the objective function, which
facilitates faster convergence of the algorithm.

A learning rate that is too high might overshoot the minimum, while a rate that is too low results
in slow convergence. Common strategies include, in particular, constant learning rates (hk = h
for all k ∈ N0 ), learning rate schedules such as decaying learning rates (hk ↘ 0 as k → ∞), and
adaptive methods. For adaptive methods the algorithm dynamically adjust hk based on the values
of f (wj ) or ∇f (wj ) for j ≤ k.
Remark 10.1. It is instructive to interpret (10.1.2) as an Euler discretization of the “gradient flow”

w(0) = w0 , w′ (t) = −∇f (w(t)) for t ∈ [0, ∞). (10.1.3)

This ODE describes the movement of a particle w(t), whose velocity at time t ≥ 0 equals −∇f (w(t))—
the vector of steepest descent. Note that

df (w(t))
= ∇f (w(t)), w′ (t) = −∥∇f (w(t))∥2 ,
dt
and thus the dynamics (10.1.3) necessarily decreases the value of the objective function along its
path as long as ∇f (w(t)) ̸= 0.
Throughout the rest of Section 10.1 we assume that w0 ∈ Rn is arbitrary, and the sequence
(wk )k∈N0 is generated by (10.1.2). We will analyze the convergence of this algorithm under suitable
assumptions on f and the hk . The proofs primarily follow the arguments in [159, Chapter 2]. We
also refer to that book for a much more detailed discussion of gradient descent, and further reading
on convex optimization.

117
10.1.1 L-smoothness
A key assumption to analyze convergence of (10.1.2) is Lipschitz continuity of ∇f .

Definition 10.2. Let n ∈ N, and L > 0. The function f : Rn → R is called L-smooth if


f ∈ C 1 (Rn ) and

∥∇f (w) − ∇f (v)∥ ≤ L∥w − v∥ for all w, v ∈ Rn .

For fixed w, L-smoothness implies the linear growth bound

∥∇f (w + v)∥ ≤ ∥∇f (w)∥ + L∥v∥

for ∇f . Integrating the gradient along lines in Rn then shows that f is bounded from above by a
quadratic function touching the graph of f at w, as stated in the next lemma; also see Figure 10.2.

Lemma 10.3. Let n ∈ N and L > 0. Let f : Rn → R be L-smooth. Then


L
f (v) ≤ f (w) + ⟨∇f (w), v − w⟩ + ∥w − v∥2 for all w, v ∈ Rn . (10.1.4)
2

Proof. We have for all w, v ∈ Rn


Z 1
f (v) = f (w) + ⟨∇f (w + t(v − w)), v − w⟩ dt
0
Z 1
= f (w) + ⟨∇f (w), v − w⟩ + ⟨∇f (w + t(v − w)) − ∇f (w), v − w⟩ dt.
0

Thus
Z 1
L
f (v) − f (w) − ⟨∇f (w), v − w⟩ ≤ L∥t(v − w)∥∥v − w∥ dt = ∥v − w∥2 ,
0 2

which shows (10.1.4).

Remark 10.4. The argument in the proof of Lemma 10.3 also gives the lower bound
L
f (v) ≥ f (w) + ⟨∇f (w), v − w⟩ − ∥w − v∥2 for all w, v ∈ Rn . (10.1.5)
2
The previous lemma allows us to show a decay property for the gradient descent iterates.
Specifically, the values of f necessarily decrease in each iteration as long as the step size hk is small
enough, and ∇f (wk ) ̸= 0.

118
Lemma 10.5. Let n ∈ N and L > 0. Let f : Rn → R be L-smooth. Further, let (hk )∞
k=1 be positive
∞ n
numbers and let (wk )k=0 ⊆ R be defined by (10.1.2).
Then, for all k ∈ N
 Lh2k 
f (wk+1 ) ≤ f (wk ) − hk − ∥∇f (wk )∥2 . (10.1.6)
2

Proof. Lemma 10.3 with v = wk+1 and w = wk gives


L
f (wk+1 ) ≤ f (wk ) + ⟨∇f (wk ), −hk ∇f (wk )⟩ + ∥hk ∇f (wk )∥2 ,
2
which corresponds to (10.1.6).
Remark 10.6. The right-hand side in (10.1.6) is minimized for step size hk = 1/L, in which case
(10.1.6) reads
1
f (wk+1 ) ≤ f (wk ) − ∥∇f (wk )∥2 .
2L
Next, let us discuss the behavior of the gradients for constant step sizes.

Proposition 10.7. Let n ∈ N and L > 0. Let f : Rn → R be L-smooth. Further, let hk = h ∈


(0, 2/L) for all k ∈ N, and (wk )∞ n
k=0 ⊆ R be defined by (10.1.2).
Then, for all k ∈ N
k
1 X 1 2
∥∇f (wj )∥2 ≤ (f (w0 ) − f (wk+1 )). (10.1.7)
k+1 k + 1 2h − Lh2
j=0

Proof. Set c := h − (Lh2 )/2 = (2h − Lh2 )/2 > 0. By (10.1.6) for j ≥ 0
f (wj ) − f (wj+1 ) ≥ c∥∇f (wj )∥2 .
Hence
k k
X
2 1X 1
∥∇f (wj )∥ ≤ f (wj ) − f (wj+1 ) = (f (w0 ) − f (wk+1 )) .
c c
j=0 j=0

Dividing by k + 1 concludes the proof.


Suppose that f is bounded from below, i.e. inf w∈Rn f (w) > −∞. In this case, the right-hand
side in (10.1.7) behaves like O(k −1 ) as k → ∞, and (10.1.7) implies
min ∥∇f (wj )∥ = O(k −1/2 ).
j=1,...,k

Thus, lower boundedness of the objective function together with L-smoothness already suffice to
obtain some form of convergence of the gradients to 0. We emphasize that this does not imply
convergence of wk towards some w∗ with ∇f (w∗ ) = 0 as the example f (w) = arctan(w), w ∈ R,
shows.

119
10.1.2 Convexity
While L-smoothness entails some interesting properties of gradient descent, it does not have any
direct implications on the existence or uniqueness of minimizers. To show convergence of f (wk )
towards minw f (w) for k → ∞ (assuming this minimum exists), we will assume that f is a convex
function.

Definition 10.8. Let n ∈ N. A function f : Rn → R is called convex if and only if

f (λw + (1 − λ)v) ≤ λf (w) + (1 − λ)f (v), (10.1.8)

for all w, v ∈ Rn , λ ∈ (0, 1).

Let n ∈ N. If f ∈ C 1 (Rn ), then f is convex if and only if

f (w) + ⟨∇f (w), v − w⟩ ≤ f (v) for all w, v ∈ Rn , (10.1.9)

as shown in Exercise 10.27. Thus, f ∈ C 1 (Rn ) is convex if and only if the graph of f lies above
each of its tangents, see Figure 10.2.
For convex f , a minimizer neither needs to exist (e.g., f (w) = w for w ∈ R) nor be unique
(e.g., f (w) = 0 for w ∈ Rn ). However, if w∗ and v ∗ are two minimizers, then every convex
combination λw∗ + (1 − λ)v ∗ , λ ∈ [0, 1], is also a minimizer due to (10.1.8). Thus, the set of all
minimizers is convex. In particular, a convex objective function has either zero, one, or infinitely
many minimizers. Moreover, if f ∈ C 1 (Rn ) then ∇f (w) = 0 implies

f (w) = f (w) + ⟨∇f (w), v − w⟩ ≤ f (v) for all v ∈ Rn .

Thus, w is a minimizer of f if and only if ∇f (w) = 0.


By Lemma 10.5, smallness of the step sizes and L-smoothness suffice to show a decay property
for the objective function f . Under the additional assumption of convexity, we also get a decay
property for the distance of wk to any minimizer w∗ .

Lemma 10.9. Let n ∈ N and L > 0. Let f : Rn → R be L-smooth and convex. Further, let
hk ∈ (0, 2/L) for all k ∈ N0 , and (wk )∞ n
k=0 ⊆ R be defined by (10.1.2). Suppose that w ∗ is a
minimizer of f .
Then, for all k ∈ N0
2 
∥wk+1 − w∗ ∥2 ≤ ∥wk − w∗ ∥2 − hk · − hk ∥∇f (wk )∥2 .
L

To prove the lemma, we will require the following inequality [159, Theorem 2.1.5].

120
Lemma 10.10. Let n ∈ N and L > 0. Let f : Rn → R be L-smooth and convex.
Then,
1
∥∇f (w) − ∇f (v)∥2 ≤ ⟨∇f (w) − ∇f (v), w − v⟩ for all w, v ∈ Rn .
L

Proof. Fix w ∈ Rn and set Ψ(u) := f (u) − ⟨∇f (w), u⟩ for all u ∈ Rn . Then ∇Ψ(u) = ∇f (u) −
∇f (w) and thus Ψ is L-smooth. Moreover, convexity of f , specifically (10.1.9), yields Ψ(u) ≥
f (w) − ⟨∇f (w), w⟩ = Ψ(w) for all u ∈ Rn , and thus w is a minimizer of Ψ. Using (10.1.4) on Ψ
we get for every v ∈ Rn
 L 
Ψ(w) = minn Ψ(u) ≤ minn Ψ(v) + ⟨∇Ψ(v), u − v⟩ + ∥u − v∥2
u∈R u∈R 2
L
= min Ψ(v) − t∥∇Ψ(v)∥2 + t2 ∥∇Ψ(v)∥2
t≥0 2
1
= Ψ(v) − ∥∇Ψ(v)∥2
2L
since the minimum of t 7→ t2 L/2 − t is attained at t = L−1 . This implies
1
f (w) − f (v) + ∥∇f (w) − ∇f (v)∥2 ≤ ⟨∇f (w), w − v⟩ .
2L
Adding the same inequality with the roles of w and v switched gives the result.

of Lemma 10.9. It holds


∥wk+1 − w∗ ∥2 = ∥wk − w∗ ∥2 − 2hk ⟨∇f (wk ), wk − w∗ ⟩ + h2k ∥∇f (wk )∥2 .
Since ∇f (w∗ ) = 0, Lemma 10.10 gives
1
− ⟨∇f (wk ), wk − w∗ ⟩ ≤ − ∥∇f (wk )∥2
L
which implies the claim.

These preparations allow us to show that for constant step size h < 2/L, we obtain convergence
of f (wk ) towards f (w∗ ) with rate O(k −1 ), as stated in the next theorem which corresponds to
[159, Theorem 2.1.14].

Theorem 10.11. Let n ∈ N and L > 0. Let f : Rn → R be L-smooth and convex. Further, let
hk = h ∈ (0, 2/L) for all k ∈ N0 , and let (wk )∞ n
k=0 ⊆ R be defined by (10.1.2). Suppose that w ∗ is
a minimizer of f .
Then, f (wk ) − f (w∗ ) = O(k −1 ) for k → ∞, and for the specific choice h = 1/L

2L
f (wk ) − f (w∗ ) ≤ ∥w0 − w∗ ∥2 for all k ∈ N0 . (10.1.10)
4+k

121
Proof. The case w0 = w∗ is trivial and throughout we assume w0 ̸= w∗ .
Step 1. Let j ∈ N0 . Using convexity (10.1.9)

f (wj ) − f (w∗ ) ≤ − ⟨∇f (wj ), w∗ − wj ⟩ ≤ ∥∇f (wj )∥∥w∗ − wj ∥. (10.1.11)

By Lemma 10.9 and since w0 ̸= w∗ it holds ∥w∗ − wj ∥ ≤ ∥w∗ − w0 ∥ =


̸ 0, so that we obtain a
lower bound on the gradient
(f (wj ) − f (w∗ ))2
∥∇f (wj )∥2 ≥ .
∥w∗ − w0 ∥2
Lemma 10.5 then yields
 Lh2 
f (wj+1 ) − f (w∗ ) ≤ f (wj ) − f (w∗ ) − h − ∥∇f (wj )∥2
2
 Lh2  (f (wj ) − f (w∗ ))2
≤ f (wj ) − f (w∗ ) − h − .
2 ∥w0 − w∗ ∥2
With ej := f (wj ) − f (w∗ ) and ω := (h − Lh2 /2)/∥w0 − w∗ ∥2 this reads

ej+1 ≤ ej − ωe2j = ej · (1 − ωej ), (10.1.12)

which is valid for all j ∈ N0 .


Step 2. By L-smoothness (10.1.4) and ∇f (w∗ ) = 0 it holds
L
f (w0 ) − f (w∗ ) ≤ ∥w0 − w∗ ∥2 , (10.1.13)
2
which implies (10.1.10) for k = 0. It remains to show the bound for k ∈ N.
Fix k ∈ N. We may assume ek > 0, since otherwise (10.1.10) is trivial. Then ej > 0 for all
j = 0, . . . , k − 1 since ej = 0 implies ei = 0 for all i > j, contradicting ek > 0. Moreover, ωej < 1
for all j = 0, . . . , k − 1, since ωej ≥ 1 implies ej+1 ≤ 0 by (10.1.12), contradicting ej+1 > 0.
Using that 1/(1 − x) ≥ 1 + x for all x ∈ [0, 1), (10.1.12) thus gives
1 1 1
≥ (1 + ωej ) = +ω for all j = 0, . . . , k − 1.
ej+1 ej ej
Hence
k−1
1 1 X 1 1
− = − ≥ kω
ek e0 ej+1 ej
j=0

and
1 1
f (wk ) − f (w∗ ) = ek ≤ 1 = 2
(h−Lh /2)
.
+ kω 1
e0 f (w0 )−f (w∗ ) + k ∥w 0 −w ∗ ∥
2

Using (10.1.13) we get


∥w0 − w∗ ∥2
f (wk ) − f (w∗ ) ≤ 2
= O(k −1 ). (10.1.14)
L + kh · (1 − Lh
2 )
Finally, (10.1.10) follows by plugging in h = 1/L.

122
L-smooth convex µ-strongly convex

Figure 10.2: The graph of L-smooth functions lies between two quadratic functions at each point,
see (10.1.4) and (10.1.5), the graph of convex functions lies above the tangent at each point, see
(10.1.9), and the graph of µ-strongly convex functions lies above a quadratic function at each point,
see (10.1.15).

Remark 10.12. The step size h = 1/L is again such that the upper bound in (10.1.14) is minimized.
We emphasize, that while under the assumptions of Theorem 10.11 it holds f (wk ) → f (w∗ ),
in general it is not true that wk → w∗ as k → ∞. To show the convergence of the wk , we need to
introduce stronger assumptions that guarantee the existence of a unique minimizer.

10.1.3 Strong convexity


To obtain faster convergence and guarantee the existence of unique minimizers, we next introduce
the notion of strong convexity. As the terminology suggests, strong convexity implies convexity;
specifically, while convexity requires f to be lower bounded by the linearization around each point,
strongly convex functions are lower bounded by the linearization plus a positive quadratic term.

Definition 10.13. Let n ∈ N and µ > 0. A function f ∈ C 1 (Rn ) is called µ-strongly convex if
µ
f (v) ≥ f (w) + ⟨∇f (w), v − w⟩ + ∥v − w∥2 for all w, v ∈ Rn . (10.1.15)
2

Note that (10.1.15) is the opposite of the bound (10.1.4) implied by L-smoothness. We depict
the three notions of L-smoothness, convexity, and µ-strong convexity in Figure 10.2.
Every µ-strongly convex function has a unique minimizer. To see this note first that (10.1.15)
implies f to be lower bounded by a convex quadratic function, so that there exists at least one
minimizer w∗ , and ∇f (w∗ ) = 0. By (10.1.15) we then have f (v) > f (w∗ ) for every v ̸= w∗ .
The next theorem shows that the gradient descent iterates converge linearly towards the unique
minimizer for L-smooth and µ-strongly convex functions. Recall that a sequence ek is said to
converge linearly to 0, if and only if there exist constants C > 0 and c ∈ [0, 1) such that

ek ≤ Cck for all k ∈ N0 .

The constant c is also referred to as the rate of convergence. Before giving the statement, we first
note that comparing (10.1.4) and (10.1.15) it necessarily holds L ≥ µ and therefore κ := L/µ ≥ 1.
This term is known as the condition number of f . It crucially influences the rate of convergence.

123
Theorem 10.14. Let n ∈ N and L ≥ µ > 0. Let f : Rn → R be L-smooth and µ-strongly convex.
Further, let hk = h ∈ (0, 1/L] for all k ∈ N0 , let (wk )∞ n
k=0 ⊆ R be defined by (10.1.2), and let w ∗
be the unique minimizer of f .
Then, f (wk ) → f (w∗ ) and wk → w∗ converge linearly for k → ∞. For the specific choice
h = 1/L
 µ k
∥wk − w∗ ∥2 ≤ 1 − ∥w0 − w∗ ∥2 (10.1.16a)
L
L µ k
f (wk ) − f (w∗ ) ≤ 1− ∥w0 − w∗ ∥2 . (10.1.16b)
2 L

Proof. It suffices to show (10.1.16a) since (10.1.16b) follows directly by Lemma 10.3 and because
∇f (w∗ ) = 0. The case k = 0 is trivial, so let k ∈ N.
Expanding wk = wk−1 − h∇f (wk−1 ) and using µ-strong convexity (10.1.15)

∥wk − w∗ ∥2 = ∥wk−1 − w∗ ∥2 − 2h ⟨∇f (wk−1 ), wk−1 − w∗ ⟩ + h2 ∥∇f (wk−1 )∥2


≤ (1 − µh)∥wk−1 − w∗ ∥2 − 2h · (f (wk−1 ) − f (w∗ )) + h2 ∥∇f (wk−1 )∥2 .

Moreover, the descent property in Lemma 10.5 gives

− 2h · (f (wk−1 ) − f (w∗ )) + h2 ∥∇f (wk−1 )∥2


h2
≤ −2h · (f (wk−1 ) − f (w∗ )) + (f (wk−1 ) − f (wk )). (10.1.17)
h · (1 − Lh/2)

The descent property also implies f (wk−1 ) − f (w∗ ) ≥ f (wk−1 ) − f (wk ). Thus the right-hand side
of (10.1.17) is less or equal to zero as long as 2h ≥ h/(1 − Lh/2), which is equivalent to h ≤ 1/L.
Hence

∥wk − w∗ ∥2 ≤ (1 − µh)∥wk−1 − w∗ ∥2 ≤ · · · ≤ (1 − µh)k ∥w0 − w∗ ∥2 .

This concludes the proof.

Remark 10.15. With a more refined argument, see [159, Theorem 2.1.15], the constraint on the
step size can be relaxed to h ≤ 2/(µ + L). For h = 2/(µ + L) one then obtains (10.1.16) with
1 − µ/L = 1 − κ−1 replaced by
 L/µ − 1 2  κ − 1 2
= ∈ [0, 1). (10.1.18)
L/µ + 1 κ+1

We have
 κ − 1 2
= 1 − 4κ−1 + O(κ−2 )
κ+1
as κ → ∞. Thus, (10.1.18) gives a slightly better, but conceptually similar, rate of convergence
than 1 − κ−1 shown in Theorem 10.14.

124
10.1.4 PL-inequality
Linear convergence for gradient descent can also be shown under a weaker assumption known as
the Polyak-Lojasiewicz-inequality, or PL-inequality for short.

Lemma 10.16. Let n ∈ N and µ > 0. Let f : Rn → R be µ-strongly convex and denote its unique
minimizer by w∗ . Then f satisfies the PL-inequality
1
µ · (f (w) − f (w∗ )) ≤ ∥∇f (w)∥2 for all w ∈ Rn . (10.1.19)
2

Proof. By µ-strong convexity we have


µ
f (v) ≥ f (w) + ⟨∇f (w), v − w⟩ + ∥v − w∥2 for all v, w ∈ Rn . (10.1.20)
2
The gradient of the right-hand side with respect to v equals ∇f (w) + µ · (v − w). This implies
that the minimum of this expression is attained at v = w − ∇f (w)/µ. Minimizing both sides of
(10.1.20) in v we thus find
1 1 1
f (w∗ ) ≥ f (w) − ∥∇f (w)∥2 + ∥∇f (w)∥2 = f (w) − ∥∇f (w)∥2 .
µ 2µ 2µ
Rearranging the terms gives (10.1.19).

As the lemma states, the PL-inequality is implied by strong convexity. Moreover, it is indeed
weaker than strong convexity, and does not even imply convexity, see Exercise 10.28. The next
theorem, which corresponds to [220, Theorem 1], gives a convergence result for L-smooth functions
satisfying the PL-inequality. It therefore does not require convexity. The proof is left as an exercise.
We only note that the PL-inequality bounds the distance to the minimal value of the objective
function by the squared norm of the gradient. It is thus precisely the type of bound required to
show convergence of gradient descent.

Theorem 10.17. Let n ∈ N and L > 0. Let f : Rn → R be L-smooth. Further, let hk = 1/L for
all k ∈ N0 , and let (wk )∞ n
k=0 ⊆ R be defined by (10.1.2), and let w ∗ be a (not necessarily unique)
minimizer of f , so that the PL-inequality (10.1.19) holds.
Then, it holds for all k ∈ N0 that
 µ k
f (wk ) − f (w∗ ) ≤ 1 − (f (w0 ) − f (w∗ )).
L

10.2 Stochastic gradient descent (SGD)


We next discuss a stochastic variant of gradient descent. The idea, which originally goes back to
Robbins and Monro [191], is to replace the gradient ∇f (wk ) in (10.1.2) by a random variable that

125
we denote by Gk . We interpret Gk as an approximation to ∇f (wk ); specifically, throughout we
will assume that (given wk ) Gk is an unbiased estimator, i.e.

E[Gk |wk ] = ∇f (wk ). (10.2.1)

After choosing some initial value w0 ∈ Rn , the update rule becomes

wk+1 := wk − hk Gk , (10.2.2)

where hk > 0 denotes again the step size, and unlike in Section 10.1, we focus here on the case of hk
depending on k. The iteration (10.2.2) creates a Markov chain (w0 , w1 , . . . ), meaning that wk is
a random variable, and its state only depends1 on wk−1 . The main reason for replacing the actual
gradient by an estimator, is not to improve the accuracy or convergence rate, but rather to decrease
the computational cost and storage requirements of the algorithm. The underlying assumption is
that Gk−1 can be computed at a fraction of the cost required for the computation of ∇f (wk−1 ).
The next example illustrates this in the standard setting.

Example 10.18 (Empirical risk minimization). Suppose we have some data S := (xj , yj )m j=1 ,
d
where yj ∈ R is the label corresponding to the data point xj ∈ R . Using the square loss, we wish
to fit a neural network Φ(·, w) : Rd → R depending on parameters (i.e. weights and biases) w ∈ Rn ,
such that the empirical risk
m
1 X
f (w) := R̂S (w) = (Φ(xj , w) − yj )2 ,
2m
j=1

is minimized. Performing one step of gradient descent requires the computation of


m
1 X
∇f (w) = (Φ(xj , w) − yj )∇w Φ(xj , w), (10.2.3)
m
j=1

and thus the computation of m gradients of the neural network Φ. For large m (in practice m can
be in the millions or even larger), this computation might be infeasible. To decrease computational
complexity, we replace the full gradient (10.2.3) by

G := (Φ(xj , w) − yj )∇w Φ(xj , w)

where j ∼ uniform(1, . . . , m) is a random variable with uniform distribution on the discrete set
{1, . . . , m}. Then
m
1 X
E[G] = (Φ(xj , w) − yj )∇w Φ(xj , w) = ∇f (w),
m
j=1

but an evaluation of G merely requires the computation of a single gradient of the neural net-
work. More general, one can choose a mini-batch size mb (where mb ≪ m) and let G =
1 P
mb j∈J (Φ(xj , w) − yj )∇w Φ(xj , w), where J is a random subset of {1, . . . , m} of cardinality mb .
1
More precisely, given wk−1 , the state of wk is conditionally independent of w1 , . . . , wk−2 . See Appendix A.3.3.

126
Remark 10.19. In practice, the following variant is also common: Let mb k = m for mb , k, m ∈ N,
i.e. the number of data points m is a k-fold multiple of the mini-batch size mb . In each epoch, first
Sk
a random partition ˙ i=1 Ji = {1, . . . , m} is determined. Then for each i = 1, . . . , k, the weights are
updated with the gradient estimate
1 X
Φ(xj − yj , w)∇w Φ(xj , w).
mb
j∈Ji

Hence, in one epoch (which corresponds to k updates of the neural network weights), the algorithm
sweeps through the whole dataset.
SGD can be analyzed in various settings. To give the general idea, we concentrate on the case
of L-smooth and µ-strongly convex objective functions. Let us start by looking at a property akin
to the (descent) Lemma 10.5. Using Lemma 10.3
L
f (wk+1 ) ≤ f (wk ) − hk ⟨∇f (wk ), Gk ⟩ + h2k ∥Gk ∥2 .
2
In contrast to gradient descent, we cannot say anything about the sign of the term in the middle of
the right-hand side. Thus, (10.2.2) need not necessarily decrease the value of the objective function
in every step. The key insight is that in expectation the value is still decreased under certain
assumptions, namely

L 
E[f (wk+1 )|wk ] ≤ f (wk ) − hk E[⟨∇f (wk ), Gk ⟩ |wk ] + h2k E ∥Gk ∥2 wk

2
L
= f (wk ) − hk ∥∇f (wk )∥2 + h2k E ∥Gk ∥2 wk
 
 2 
2 L 2
= f (wk ) − hk ∥∇f (wk )∥ − hk E[∥Gk ∥ |wk ]
2
where we used (10.2.1).
Assuming, for some fixed γ > 0, the uniform bound

E[∥Gk ∥2 |wk ] ≤ γ

and that ∥∇f (wk )∥ > 0 (which is true unless wk is the minimizer), upon choosing
2∥∇f (wk )∥2
0 < hk < ,

the expectation of the objective function decreases. Since ∇f (wk ) tends to 0 as we approach the
minimum, this also indicates that we should choose step sizes hk that tend to 0 for k → ∞. For
our analysis we will work with the specific choice
1 (k + 1)2 − k 2
hk := for all k ∈ N0 , (10.2.4)
µ (k + 1)2
as, e.g., in [76]. Note that
2k + 1 2
hk = 2
= + O(k −2 ) = O(k −1 ).
µ(k + 1) µ(k + 1)

127
Since wk is a random variable by construction, a convergence statement can only be stochastic,
e.g., in expectation or with high probability. We concentrate here on the former, but emphasize
that also the latter can be shown.

Theorem 10.20. Let n ∈ N and L, µ, γ > 0. Let f : Rn → R be L-smooth and µ-strongly convex.
Let (hk )∞ ∞ ∞
k=0 satisfy (10.2.4) and let (Gk )k=0 , (w k )k=0 be sequences of random variables satisfying
2
(10.2.1) and (10.2.2). Assume that E[∥Gk ∥ |wk ] ≤ γ for all k ∈ N0 .
Then

E[∥wk − w∗ ∥2 ] ≤ = O(k −1 ),
µ2 k
4Lγ
E[f (wk )] − f (w∗ ) ≤ 2 = O(k −1 )
2µ k
for k → ∞.

Proof. We proceed similar as in the proof of Theorem 10.14. It holds for k ≥ 1

E[∥wk − w∗ ∥2 |wk−1 ]
= ∥wk−1 − w∗ ∥2 − 2hk−1 E[⟨Gk−1 , wk−1 − w∗ ⟩ |wk−1 ] + h2k−1 E[∥Gk−1 ∥2 |wk−1 ]
= ∥wk−1 − w∗ ∥2 − 2hk−1 ⟨∇f (wk−1 ), wk−1 − w∗ ⟩ + h2k−1 E[∥Gk−1 ∥2 |wk−1 ].

By µ-strong convexity (10.1.15)

−2hk−1 ⟨∇f (wk−1 ), wk−1 − w∗ ⟩ ≤ −µhk−1 ∥wk−1 − w∗ ∥2 − 2hk−1 · (f (wk−1 ) − f (w∗ ))


≤ −µhk−1 ∥wk−1 − w∗ ∥2 .

Thus

E[∥wk − w∗ ∥2 |wk−1 ] ≤ (1 − µhk−1 )∥wk−1 − w∗ ∥2 + h2k−1 γ.

Using the Markov property, we have

E[∥wk − w∗ ∥2 |wk−1 , wk−2 ] = E[∥wk − w∗ ∥2 |wk−1 ]

so that

E[∥wk − w∗ ∥2 |wk−1 ] ≤ (1 − µhk−1 )E[∥wk−1 − w∗ ∥2 |wk−2 ] + h2k−1 γ.

With e0 := ∥w0 − w∗ ∥2 and ek := E[∥wk − w∗ ∥2 |wk−1 ] for k ≥ 1 we have found

ek ≤ (1 − µhk−1 )ek−1 + h2k−1 γ


≤ (1 − µhk−1 )((1 − µhk−2 )ek−2 + h2k−2 γ) + h2k−1 γ
k−1
Y k−1
X k−1
Y
≤ · · · ≤ e0 (1 − µhj ) + γ h2j (1 − µhi ).
j=0 j=0 i=j+1

128
By choice of hi
k−1 k−1
Y Y i2 j2
(1 − µhi ) = =
(i + 1)2 k2
i=j i=j

and thus
k−1  2
γ X (j + 1)2 − j 2 (j + 1)2
ek ≤ 2
µ (j + 1)2 k2
j=0
k−1
γ 1 X (2j + 1)2
≤ 2 2
µ k (j + 1)2
j=0 | {z }
≤4
γ 4k
≤ 2 2
µ k

≤ 2 .
µ k
Since E[∥wk − w∗ ∥2 ] is the expectation of E[∥wk − w∗ ∥2 |wk−1 ] with respect to the random
variable wk−1 , and e0 /k 2 + 4γ/(µ2 k) is a constant independent of wk−1 , we obtain

E[∥wk − w∗ ∥2 ] ≤ .
µ2 k
Finally, using L-smoothness
L L
f (wk ) − f (w∗ ) ≤ ⟨∇f (w∗ ), wk − w∗ ⟩ + ∥wk − w∗ ∥2 = ∥wk − w∗ ∥2 ,
2 2
and taking the expectation concludes the proof.

The specific choice of hk in (10.2.4) simplifies the calculations in the proof, but it is not necessary
in order for the asymptotic convergence to hold. One can show similar convergence results with
hk = c1 /(c2 + k) under certain assumptions on c1 , c2 , e.g. [23, Theorem 4.7].

10.3 Backpropagation
We now explain how to apply gradient-based methods to the training of neural networks. Let
d
Φ ∈ Nd0L+1 (σ; L, n) (see Definition 3.6) and assume that the activation function satisfies σ ∈ C 1 (R).
As earlier, we denote the neural network parameters by

w = ((W (0) , b(0) ), . . . , (W (L) , b(L) )) (10.3.1)

with weight matrices W (ℓ) ∈ Rdℓ+1 ×dℓ and bias vectors b(ℓ) ∈ Rdℓ+1 . Additionally, we fix a differ-
entiable loss function L : RdL+1 × RdL+1 → R, e.g., L(w, w̃) = ∥w − w̃∥2 /2, and assume given data
(xj , y j )m d0
j=1 ⊆ R × R
dL+1 . The goal is to minimize an empirical risk of the form

m
1 X
f (w) := L(Φ(xj , w), y j )
m
j=1

129
as a function of the neural network parameters w. An application of the gradient step (10.1.2) to
update the parameters requires the computation of
m
1 X
∇f (w) = ∇w L(Φ(xj , w), y j ).
m
j=1

For stochastic methods, as explained in Example 10.18, we only compute the average over a (ran-
dom) subbatch of the dataset. In either case, we need an algorithm to determine ∇w L(Φ(x, w), y),
i.e. the gradients

∇b(ℓ) L(Φ(x, w), y) ∈ Rdℓ+1 , ∇W (ℓ) L(Φ(x, w), y) ∈ Rdℓ+1 ×dℓ (10.3.2)

for all ℓ = 0, . . . , L.
The backpropagation algorithm [197] provides an efficient way to do so. To explain it, for fixed
input x ∈ Rd0 introduce the notation

x̄(1) := W (0) x + b(0) (10.3.3a)


(ℓ+1) (ℓ) (ℓ) (ℓ)
x̄ := W σ(x̄ ) + b for ℓ ∈ {1, . . . , L}, (10.3.3b)

where the application of σ : R → R to a vector is, as always, understood componentwise. With the
notation of Definition 2.1, x(ℓ) = σ(x̄(ℓ) ) ∈ Rdℓ for ℓ = 1, . . . , L and x̄(L+1) = x(L+1) = Φ(x, w) ∈
RdL+1 is the output of the neural network. Therefore, the x̄(ℓ) for ℓ = 1, . . . , L are sometimes also
referred to as the preactivations.
In the following, we additionally fix y ∈ RdL+1 and write

L := L(Φ(x, w), y) = L(x̄(L+1) , y).

Note that x̄(k) depends on (W (ℓ) , b(ℓ) ) only if k > ℓ. Since x̄(ℓ+1) is a function of x̄(ℓ) for each ℓ,
by repeated application of the chain rule
∂L ∂L ∂ x̄(L+1) ∂ x̄(ℓ+2) ∂ x̄(ℓ+1)
(ℓ)
= (L+1) (L)
· · · (ℓ+1) (ℓ)
. (10.3.4)
∂Wij |∂ x̄{z } | ∂ x̄{z } |∂ x̄{z } ∂Wij
∈R1×dL+1 ∈RdL+1 ×dL ∈Rdℓ+2 ×dℓ+1
| {z }
∈Rdℓ+1 ×1

(ℓ)
An analogous calculation holds for ∂L/∂bj . Since all terms in (10.3.4) are easy to compute (see
(10.3.3)), in principle we could use this formula to determine the gradients in (10.3.2). To avoid
unnecessary computations, the main idea of backpropagation is to introduce

α(ℓ) := ∇x̄(ℓ) L ∈ Rdℓ for all ℓ = 1, . . . , L + 1

and observe that


∂L ∂ x̄(ℓ+1)
(ℓ)
= (α(ℓ+1) )⊤ (ℓ)
.
∂Wij ∂Wij

As the following lemma shows, the α(ℓ) can be computed recursively for ℓ = L + 1, . . . , 1. This
explains the name “backpropagation”. In the following, ⊙ denotes the componentwise (Hadamard)
product, i.e. a ⊙ b = (ai bi )di=1 for every a, b ∈ Rd .

130
Lemma 10.21. It holds

α(L+1) = ∇x̄(L+1) L(x̄(L+1) , y) (10.3.5)

and

α(ℓ) = σ ′ (x̄(ℓ) ) ⊙ (W (ℓ) )⊤ α(ℓ+1) for all ℓ = L, . . . , 1.

Proof. Equation (10.3.5) holds by definition. For ℓ ∈ {1, . . . , L} by the chain rule
∂L  ∂ x̄(ℓ+1) ⊤ ∂L  ∂ x̄(ℓ+1) ⊤
α(ℓ) = = = α(ℓ+1) .
∂ x̄(ℓ) | ∂ x̄{z (ℓ) ∂ x̄ (ℓ+1)
} | {z } ∂ x̄ (ℓ)

∈Rdℓ ×dℓ+1 ∈Rdℓ+1 ×1

By (10.3.3b) for i ∈ {1, . . . , dℓ+1 }, j ∈ {1, . . . , dℓ }


 ∂ x̄(ℓ+1)  (ℓ+1)
∂ x̄i (ℓ) (ℓ)
= = Wij σ ′ (x̄j ).
∂ x̄(ℓ) ij (ℓ)
∂ x̄j
Thus the claim follows.

Putting everything together, we obtain explicit formulas for (10.3.2).

Proposition 10.22. It holds

∇b(ℓ) L = α(ℓ+1) ∈ Rdℓ+1 for ℓ = 0, . . . , L

and

∇W (0) L = α(1) x⊤ ∈ Rd1 ×d0

and

∇W (ℓ) L = α(ℓ+1) σ(x̄(ℓ) )⊤ ∈ Rdℓ+1 ×dℓ for ℓ = 1, . . . , L.

Proof. By (10.3.3a) for i, k ∈ {1, . . . , d1 }, and j ∈ {1, . . . , d0 }


(1) (1)
∂ x̄k ∂ x̄k
(0)
= δki and (0)
= δki xj ,
∂bi ∂Wij

and by (10.3.3b) for ℓ ∈ {1, . . . , L} and i, k ∈ {1, . . . , dℓ+1 }, and j ∈ {1, . . . , dℓ }


(ℓ+1) (ℓ+1)
∂ x̄k ∂ x̄k (ℓ)
(ℓ)
= δki and (ℓ)
= δki σ(x̄j ).
∂bi ∂Wij

131
ℓ+1 d
Thus, with ei = (δki )k=1

∂L  ∂ x̄(ℓ+1) ⊤ ∂L (ℓ+1)
(ℓ)
= (ℓ)
= e⊤
i α
(ℓ+1)
= αi for ℓ ∈ {0, . . . , L}
∂bi ∂bi ∂ x̄(ℓ+1)

and similarly

∂L  ∂ x̄(1) ⊤
(0) (0) (1)
(0)
= (0)
α(1) = x̄j e⊤
i α
(1)
= x̄j αi
∂Wij ∂Wij

and
∂L (ℓ) (ℓ+1)
(ℓ)
= σ(x̄j )αi for ℓ ∈ {1, . . . , L}.
∂Wij

This concludes the proof.

Lemma 10.21 and Proposition 10.22 motivate Algorithm 1, in which a forward pass computing
x̄(ℓ) , ℓ = 1, . . . , L + 1, is followed by a backward pass to determine the α(ℓ) , ℓ = L + 1, . . . , 1,
and the gradients of L with respect to the neural network parameters. This shows how to use
gradient-based optimizers from the previous sections for the training of neural networks.
Two important remarks are in order. First, the objective function associated to neural networks
is typically not convex as a function of the neural network weights and biases. Thus, the analysis
of the previous sections will in general not be directly applicable. It may still give some insight
about the convergence behavior locally around the minimizer however. Second, to derive the back-
propagation algorithm we assumed the activation function to be continuously differentiable, which
does not hold for ReLU. Using the concept of subgradients, gradient-based algorithms and their
analysis may be generalized to some extent to also accommodate non-differentiable loss functions,
see Exercises 10.31–10.33.

10.4 Acceleration
Acceleration is an important tool for the training of neural networks [221]. The idea was first
introduced by Polyak in 1964 under the name “heavy ball method” [180]. It is inspired by the
dynamics of a heavy ball rolling down the valley of the loss landscape. Since then other types of
acceleration have been proposed and analyzed, with Nesterov acceleration being the most prominent
example [160]. In this section, we first give some intuition by discussing the heavy ball method for
a simple quadratic loss. Afterwards we turn to Nesterov acceleration and give a convergence proof
for L-smooth and µ-strongly convex objective functions that improves upon the bounds obtained
for gradient descent.

10.4.1 Heavy ball method


We proceed similar as in [70, 181, 183] to motivate the idea. Consider the quadratic objective
function in two dimensions
 
1 ⊤ λ1 0
f (w) := w Dw where D= (10.4.1)
2 0 λ2

132
Algorithm 1 Backpropagation
Input: Network input x, target output y, neural network parameters
(0) (0) (L) (L)
((W , b ), . . . , (W , b ))
Output: Gradients of the loss function L with respect to neural network parameters

Forward pass
x̄(1) ← W (0) x + b(0)
for ℓ = 1, . . . , L do
x̄(ℓ+1) ← W (ℓ) σ(x̄(ℓ) ) + b(ℓ)
end for

Backward pass
α(L+1) ← ∇x̄(L+1) L(x̄(L+1) , y)
for ℓ = L, . . . , 1 do
∇b(ℓ) L ← α(ℓ+1)
∇W (ℓ) L ← α(ℓ+1) σ(x̄(ℓ) )⊤
α(ℓ) ← σ ′ (x̄(ℓ) ) ⊙ (W (ℓ) )⊤ α(ℓ+1)
end for
∇b(0) L ← α(1)
∇W (0) L ← α(1) x⊤

with λ1 ≥ λ2 > 0. Clearly, f has a unique minimizer at w∗ = 0 ∈ R2 . Starting at some w0 ∈ R2 ,


gradient descent with constant step size h > 0 computes the iterates
(1 − hλ1 )k+1
   
1 − hλ1 0 0
wk+1 = wk − hDwk = wk = w0 .
0 1 − hλ2 0 (1 − hλ2 )k+1
The method converges for arbitrary initialization w0 if and only if
|1 − hλ1 | < 1 and |1 − hλ2 | < 1.
The optimal step size balancing the speed of convergence in both coordinates is
2
h∗ = argminh>0 max{|1 − hλ1 |, |1 − hλ2 |} = . (10.4.2)
λ1 + λ2
With κ = λ1 /λ2 we then obtain the convergence rate
λ1 − λ2 κ−1
|1 − h∗ λ1 | = |1 − h∗ λ2 | = = ∈ [0, 1). (10.4.3)
λ1 + λ2 κ+1
If λ1 ≫ λ2 , this term is close to 1, and thus the convergence will be slow. This is consistent with
our analysis for strongly convex objective functions; by Exercise 10.34 the condition number of f
equals κ = λ1 /λ2 ≫ 1. Hence, the upper bounds in Theorem 10.14 and Remark 10.15 converge
only slowly. Similar considerations hold for general quadratic objective functions in Rn such as
1
f˜(w) = w⊤ Aw + b⊤ w + c (10.4.4)
2
with A ∈ Rn×n symmetric positive definite, b ∈ Rn and c ∈ R, see Exercise 10.35.

133
1.0
Gradient Descent
Heavy Ball
0.8

0.6

0.4

0.2

0.0
1.00 0.75 0.50 0.25 0.00 0.25 0.50 0.75 1.00

Figure 10.3: 20 steps of gradient descent and the heavy ball method on the objective function
(10.4.1) with λ1 = 12 ≫ 1 = λ2 , step size h = α = h∗ as in (10.4.2), and β = 1/3.

Remark 10.23. Interpreting (10.4.4) as a second-order Taylor expansion of some objective function
f˜ around its minimizer w∗ , we note that the described effects also occur for general objective
functions with ill-conditioned Hessians at the minimizer.
Figure 10.3 gives further insight into the poor performance of gradient descent for (10.4.1) with
λ1 ≫ λ2 . The loss-landscape looks like a ravine (the derivative is much larger in one direction than
the other), and away from the floor, ∇f mainly points to the opposite side. Therefore the iterates
oscillate back and forth in the first coordinate, and make little progress in the direction of the valley
along the second coordinate axis. To address this problem, the heavy ball method introduces a
“momentum” term which can mitigate this effect to some extent. The idea is, to choose the update
not just according to the gradient at the current location, but to add information from the previous
steps. After initializing w0 and, e.g., w1 = w0 − α∇f (w0 ), let for k ∈ N

wk+1 = wk − α∇f (wk ) + β(wk − wk−1 ). (10.4.5)

This is known as Polyak’s heavy ball method [180]. Here α > 0 and β ∈ (0, 1) are hyperparameters
(that could also depend on k) and in practice need to be carefully tuned to balance the strength of
the gradient and the momentum term. Iteratively expanding (10.4.5) with the given initialization,
observe that for k ≥ 0
k
!
X
wk+1 = wk − α β j ∇f (wk−j ) . (10.4.6)
j=0

Thus, wk is updated using an exponentially weighted average of all past gradients. Choosing the
momentum parameter β in the interval (0, 1) ensures that the influence of previous gradients on the
update decays exponentially. The concrete value of β determines the balance between the impact
of recent and past gradients.
Intuitively, this (exponentially weighted) linear combination of the past gradients averages out
some of the oscillation observed for gradient descent in Figure 10.3 in the first coordinate, and thus
“smoothes” the path. The partial derivative in the second coordinate, along which the objective

134
function is very flat, does not change much from one iterate to the next. Thus, its proportion in
the update is strengthened through the addition of momentum. This is observed in Figure 10.3.
As mentioned earlier, the heavy ball method can be interpreted as a discretization of the dy-
namics of a ball rolling down the valley of the loss landscape. If the ball has positive mass, i.e. is
“heavy”, its momentum prevents the ball from bouncing back and forth too strongly. The following
remark further elucidates this connection.
Remark 10.24. As pointed out, e.g., in [181, 183], for suitable choices of α and β, (10.4.5) can be
interpreted as a discretization of the second-order ODE

mw′′ (t) = −∇f (w(t)) − rw′ (t). (10.4.7)

This equation describes the movement of a point mass m under influence of the force field −∇f (w(t));
the term −w′ (t), which points in the negative direction of the current velocity, corresponds to fric-
tion, and r > 0 is the friction coefficient. The discretization
wk+1 − 2wk + wk−1 wk+1 − wk
m = −∇f (wk ) −
h2 h
then leads to
h2 m
wk+1 = wk − ∇f (wk ) + (wk − wk−1 ), (10.4.8)
m − rh m − rh
| {z } | {z }
=α =β

and thus to (10.4.5), [183].


Letting m = 0 in (10.4.8), we recover the gradient descent update (10.1.2). Hence, the positive
mass corresponds to the momentum term. Similarly, letting m = 0 in the continuous dynamics
(10.4.7), we obtain the gradient flow (10.1.3). The key difference between these equations is that
−∇f (w(t)) represents the velocity of w(t) in (10.1.3), whereas in (10.4.7), up to the friction term,
it corresponds to an acceleration.
Let us sketch an argument to show that (10.4.5) improves the convergence over plain gradient
descent for the objective function (10.4.1). Denoting wk = (wk,1 , wk,2 )⊤ ∈ R2 , we obtain from
(10.4.5) and the definition of f in (10.4.1)
    
wk+1,j 1 + β − αλj −β wk,j
= (10.4.9)
wk,j 1 0 wk−1,j

for j ∈ {1, 2} and k ≥ 1. The smaller the modulus of the eigenvalues of the matrix in (10.4.9),
the faster the convergence towards the minimizer w∗,j = 0 ∈ R for arbitrary initialization. Hence,
the goal is to choose α > 0 and β ∈ (0, 1) such that the maximal modulus of the eigenvalues
of the matrix for j ∈ {1, 2} is possibly small. We omit the details of this calculation (also see
[181, 165, 70]), but mention that this is obtained for
 2 2  √ λ − √ λ 2
1
α= √ √ and β= √ √ 2 .
λ1 + λ2 λ1 + λ2
With these choices, the modulus of the maximal eigenvalue is bounded by

p κ−1
β=√ ∈ [0, 1),
κ+1

135
where again κ = λ1 /λ2 . Due to (10.4.9), this expression gives a rate of convergence for (10.4.5).
Contrary to gradient descent, see (10.4.3), for this problem the heavy ball method achieves a
convergence rate that only depends on the square root of the condition number κ. This explains
the improved performance observed in Figure 10.3.

10.4.2 Nesterov acceleration


Nesterov’s accelerated gradient method (NAG) [160, 159], is a refinement of the heavy ball method.
After initializing v 0 , w0 ∈ Rn , the update is formulated as the two-step process
v k+1 = wk − α∇f (wk ) (10.4.10a)
wk+1 = v k+1 + β(v k+1 − v k ), (10.4.10b)
where again α > 0 and β ∈ (0, 1) are hyperparameters. Substituting the second line into the first
we get
v k+1 = v k − α∇f (wk ) + β(v k − v k−1 ).
Comparing with the heavy ball method (10.4.5), the key difference is that the gradient is not
evaluated at the current position v k , but instead at the point wk = v k + β(v k − v k−1 ), which can
be interpreted as an estimate of the position at the next iteration.
We next discuss the convergence for L-smooth and µ-strongly convex objective functions f . It
turns out, that these conditions are not sufficient in order for the heavy ball method (10.4.5) to
converge, and one can construct counterexamples [133]. This is in contrast to NAG, as the next
theorem shows. Top give the analysis, it is convenient to first rewrite (10.4.10) as a three sequence
update: Let τ = µ/L, α = 1/L, and β = (1 − τ )/(1 + τ ). After initializing w0 , v 0 ∈ Rn , (10.4.10)
can also be written as u0 = ((1 + τ )w0 − v 0 )/τ and for k ∈ N0
τ 1
wk = uk + vk (10.4.11a)
1+τ 1+τ
1
v k+1 = wk − ∇f (wk ) (10.4.11b)
L
τ
uk+1 = uk + τ · (wk − uk ) − ∇f (wk ), (10.4.11c)
µ
see Exercise 10.36.
The proof of the following theorem proceeds along the lines of [231, 241].

Theorem 10.25. Let n ∈ N and L,pµ > 0. Let f : Rn → R be L-smooth and µ-strongly convex.
Further, let v 0 , w0 ∈ Rn and let τ = µ/L. Let (wk , v k+1 , uk+1 )∞ n
k=0 ⊆ R be defined by (10.4.11a),
and let w∗ be the unique minimizer of f .
Then, for all k ∈ N0 , it holds that
r  
2 2 µ k µ 
∥uk − w∗ ∥ ≤ 1− f (v 0 ) − f (w∗ ) + ∥u0 − w∗ ∥2 , (10.4.12a)
µ L 2
r
 µ k  µ 
f (v k ) − f (w∗ ) ≤ 1 − f (v 0 ) − f (w∗ ) + ∥u0 − w∗ ∥2 . (10.4.12b)
L 2

136
Proof. Define
µ
ek := f (v k ) − f (w∗ ) + ∥uk − w∗ ∥2 . (10.4.13)
2
To show (10.4.12), it suffices to prove with c = 1 − τ that ek+1 ≤ cek for all k ∈ N0 .
We start with the last term in (10.4.13). By (10.4.11c)
µ µ
∥uk+1 − w∗ ∥2 − ∥uk − w∗ ∥2
2 2
µ µ
= ∥uk+1 − uk + uk − w∗ ∥2 − ∥uk − w∗ ∥2
2 2
 !
µ µ τ
= ∥uk+1 − uk ∥2 + · 2 τ · (wk − uk ) − ∇f (wk ), uk − w∗
2 2 µ
µ
= ∥uk+1 − uk ∥2 + τ ⟨∇f (wk ), w∗ − uk ⟩ − τ µ ⟨wk − uk , w∗ − uk ⟩ . (10.4.14)
2
From (10.4.11a) we have τ uk = (1 + τ )wk − v k so that

τ · (wk − uk ) = τ wk − (1 + τ )wk + v k = v k − wk (10.4.15)

and using µ-strong convexity (10.1.15), we get

τ ⟨∇f (wk ), w∗ − uk ⟩ = τ ⟨∇f (wk ), wk − uk ⟩ + τ ⟨∇f (wk ), w∗ − wk ⟩


τµ
≤ ⟨∇f (wk ), v k − wk ⟩ − τ · (f (wk ) − f (w∗ )) − ∥wk − w∗ ∥2 .
2
Moreover,
τµ
− ∥wk − w∗ ∥2 − τ µ ⟨wk − uk , w∗ − uk ⟩
2
τµ 
=− ∥wk − w∗ ∥2 − 2 ⟨wk − uk , wk − w∗ ⟩ + 2 ⟨wk − uk , wk − uk ⟩
2
τµ
= − (∥uk − w∗ ∥2 + ∥wk − uk ∥2 ).
2
Thus, (10.4.14) is bounded by
µ
∥uk+1 − uk ∥2 + ⟨∇f (wk ), v k − wk ⟩ − τ · (f (wk ) − f (w∗ ))
2
τµ τµ
− ∥uk − w∗ ∥2 − ∥wk − uk ∥2
2 2
which gives with c = 1 − τ
µ µ µ
∥uk+1 − w∗ ∥2 ≤ c ∥uk − w∗ ∥2 + ∥uk+1 − uk ∥2
2 2 2
τµ
+ ⟨∇f (wk ), v k − wk ⟩ − τ · (f (wk ) − f (w∗ )) − ∥wk − uk ∥2 . (10.4.16)
2
To bound the first term in (10.4.13), we use L-smoothness (10.1.4) and (10.4.11b)
L 1
f (v k+1 ) − f (wk ) ≤ ⟨∇f (wk ), v k+1 − wk ⟩ + ∥v k+1 − wk ∥2 = − ∥∇f (wk )∥2 ,
2 2L

137
so that
1
f (v k+1 ) − f (w∗ ) − τ · (f (wk ) − f (w∗ )) ≤ (1 − τ )(f (wk ) − f (w∗ )) − ∥∇f (wk )∥2
2L
1
= c · (f (v k ) − f (w∗ )) + c · (f (wk ) − f (v k )) − ∥∇f (wk )∥2 . (10.4.17)
2L
Now, (10.4.16) and (10.4.17) imply
1 µ
ek+1 ≤ cek + c · (f (wk ) − f (v k )) − ∥∇f (wk )∥2 + ∥uk+1 − uk ∥2
2L 2
τµ 2
+ ⟨∇f (wk ), v k − wk ⟩ − ∥wk − uk ∥ .
2
Since we wish to bound ek+1 by cek , we now show that all terms except cek on the right-hand side
of the inequality above sum up to a non-positive value. By (10.4.11c) and (10.4.15)

µ µ τ2
∥uk+1 − uk ∥2 = ∥v k − wk ∥2 − τ ⟨∇f (wk ), v k − wk ⟩ + ∥∇f (wk )∥2 .
2 2 2µ
Moreover, using µ-strong convexity

⟨∇f (wk ), v k − wk ⟩
 µ 
≤ τ ⟨∇f (wk ), v k − wk ⟩ + (1 − τ ) f (v k ) − f (wk ) − ∥v k − wk ∥2 .
2
Thus, we arrive at
1 µ
ek+1 ≤ cek + c · (f (wk ) − f (v k )) − ∥∇f (wk )∥2 + ∥v k − wk ∥2
2L 2
τ2
− τ ⟨∇f (wk ), v k − wk ⟩ + ∥∇f (wk )∥2 + τ ⟨∇f (wk ), v k − wk ⟩

µ τµ
+ c · (f (v k ) − f (wk )) − c ∥v k − wk ∥2 − ∥wk − uk ∥2
2 2
 τ2 1  µ 1
= cek + − ∥∇f (wk )∥2 + τ− ∥wk − v k ∥2
2µ 2L 2 τ
≤ cek
2
pwe used once more (10.4.15), and the fact that τ /(2µ) − 1/(2L) = 0 and τ − 1/τ ≤ 0 since
where
τ = µ/L ∈ (0, 1].

Comparing the result for gradient descent (10.1.16) with NAG (10.4.12), the improvement lies
in the convergence rate, which is 1−κ−1 for gradient descent (also see Remark 10.15), and 1−κ−1/2
for NAG, where κ = L/µ. In contrast to gradient descent, for NAG the convergence depends only
on the square root of the condition number κ. For ill-conditioned problems where κ is large, we
therefore expect much better performance for accelerated methods.
Finally, we mention that NAG also achieves faster convergence in the case of L-smooth and
convex objective functions. While the error decays like O(k −1 ) for gradient descent, see Theorem
10.11, for NAG one obtains convergence O(k −2 ), see [160, 158, 241].

138
10.5 Other methods
In recent years, a multitude of first order (gradient descent) methods has been proposed and studied
for the training of neural networks. They typically employ (a subset) of the three critical strategies:
mini-batches, acceleration, and adaptive step sizes. The concept of mini-batches and acceleration
have been covered in the previous sections, and we will touch upon adaptive learning rates in the
present one. Specifically, we present three algorithms—AdaGrad, RMSProp, and Adam—which
have been among the most influential in the field, and serve to explore the main ideas. An intuitive
overview of first order methods can also be found in [194], which discusses additional variants that
are omitted here. Moreover, in practice, various other techniques and heuristics such as batch
normalization, gradient clipping, data augmentation, regularization and dropout, early stopping,
specific weight initializations etc. are used. We do not discuss them here, and refer to [22] or to
[67, Chapter 11] for a practitioners guide.
After initializing m0 = 0 ∈ Rn , v 0 = 0 ∈ Rn , and w0 ∈ Rn , all methods discussed below are
special cases of the update

mk+1 = β1 mk + β2 ∇f (wk ) (10.5.1a)


v k+1 = γ1 v k + γ2 ∇f (wk ) ⊙ ∇f (wk ) (10.5.1b)
p
wk+1 = wk − αk mk+1 ⊘ v k+1 + ε (10.5.1c)

for k ∈ N0 , and certain hyperparameters αk , β1 , β2 , γ1 , γ2 , and ε. Here ⊙ and ⊘ denote the



componentwise multiplication and division, respectively, and v k+1 + ε is understood as the vector

( vk+1,i + ε)i . We will give some default values for those hyperparameters in the following, but
mention that careful problem dependent tuning can enhance performance. Equation (10.5.1a)
corresponds to heavy ball momentum if β1 > 0. If β1 = 0, then mk+1 is simply a multiple of
the current gradient. Equation (10.5.1b) defines a weight vector v k+1 that is used to set the
componentwise learning rate in the update of the parameter in (10.5.1c). These type of methods
are often applied using mini-batches, see Section 10.2. For simplicity we present them with the full
gradients.

10.5.1 AdaGrad
In Section 10.2 we argued, that for stochastic methods the learning rate should decrease in order
to get convergence. The choice of how to decrease the learning rate can have significant impact
in practice. AdaGrad [57], which stands for adaptive gradient algorithm, provides a method to
dynamically adjust learning rates during optimization. Moreover, it does so by using individual
learning rates for each component.
AdaGrad correspond to (10.5.1) with

β1 = 0, γ1 = β2 = γ2 = 1, αk = α for all k ∈ N0 .

This leaves the hyperparameters ε > 0 and α > 0. The constant ε > 0 is chosen small but positive
to avoid division by zero in (10.5.1c). Possible default values are α = 0.01 and ε = 10−8 . The
AdaGrad update then reads

v k+1 = v k + ∇f (wk ) ⊙ ∇f (wk )


p
wk+1 = wk − α∇f (wk ) ⊘ v k+1 + ε.

139
Due to
k
X
v k+1 = ∇f (wj ) ⊙ ∇f (wj ), (10.5.2)
j=0

the algorithm scales the gradient ∇f (wk ) in the update component-wise by the inverse square root
of the sum over all past squared gradients plus ε. Note that the scaling factor (vk+1,i + ε)−1/2 for
component i will be large, if the previous gradients for that component were small, and vice versa.
In the words of the authors of [57]: “our procedures give frequently occurring features very low
learning rates and infrequent features high learning rates.”
Remark 10.26. A benefit of the componentwise scaling can be observed for the ill-conditioned
objective function in (10.4.1). Since in this case ∇f (wj ) = (λ1 wj,1 , λ2 wj,2 )⊤ for each j = 1, . . . , k,
setting ε = 0 AdaGrad performs the update
!
wk,1 ( kj=0 wj,1
2 )−1/2
P
wk+1 = wk − α 2 )−1/2 .
wk,2 ( kj=0 wj,1
P


Note how the λ1 and λ2 factors in the update have vanished due to the division by v k+1 . This
makes the method invariant to a componentwise rescaling of the gradient, and results in a more
direct path towards the minimizer.

10.5.2 RMSProp
The sum of past squared gradients can increase rapidly, leading to a significant reduction in learning
rates when training neural networks with AdaGrad. This often results in slow convergence, see
for example [242]. RMSProp [90] seeks to rectify this by adjusting the learning rates using an
exponentially weighted average of past gradients.
RMSProp corresponds to (10.5.1) with

β1 = 0, β2 = 1, γ2 = 1 − γ1 ∈ (0, 1), αk = α for all k ∈ N0 ,

effectively leaving the hyperparameters ε > 0, γ1 ∈ (0, 1) and α > 0. Typically, recommended
default values are ε = 10−8 , α = 0.01 and γ1 = 0.9. The algorithm is given through

v k+1 = γ1 v k + (1 − γ1 )∇f (wk ) ⊙ ∇f (wk ) (10.5.3a)


p
wk+1 = wk − α∇f (wk ) ⊘ v k+1 + ε. (10.5.3b)

Note that
k
γ1j ∇f (wk−j ) ⊙ ∇f (wk−j ),
X
v k+1 = (1 − γ1 )
j=0

so that, contrary to AdaGrad (10.5.2), the influence of gradient ∇f (wk−j ) on the weight v k+1
decays exponentially in j.

140
10.5.3 Adam
Adam [116], short for adaptive moment estimation, combines adaptive learning rates based on
exponentially weighted averages as in RMSProp, with heavy ball momentum. Contrary to AdaGrad
an RMSProp it thus uses a value β1 > 0.
More precisely, Adam corresponds to (10.5.1) with
q
1 − γ1k+1
β2 = 1 − β1 ∈ (0, 1), γ2 = 1 − γ1 ∈ (0, 1), αk = α
1 − β1k+1

for all k ∈ N0 , for some α > 0. The default values for the remaining parameters recommended in
[116] are ε = 10−8 , α = 0.001, β1 = 0.9 and γ1 = 0.999. The update can be formulated as
mk+1
mk+1 = β1 mk + (1 − β1 )∇f (wk ), m̂k+1 = (10.5.4a)
1 − β1k+1
v k+1
v k+1 = γ1 v k + (1 − γ1 )∇f (wk ) ⊙ ∇f (wk ), v̂ k+1 = (10.5.4b)
1 − γ1k+1
p
wk+1 = wk − αm̂k+1 ⊘ v̂ k+1 + ε. (10.5.4c)

Note that mk+1 equals


k
β1j ∇f (wk−j )
X
mk+1 = (1 − β1 )
j=0

and thus correspond to heavy ball style momentum with momentum parameter β = β1 , see (10.4.6).
The normalized version m̂k+1 is introduced to account for the bias towards 0, stemming from the
initialization m0 = 0. The weight-vector v k+1 in (10.5.4b) is analogous to the exponentially
weighted average of RMSProp in (10.5.3a), and the normalization again serves to counter the bias
from v 0 = 0.
It should be noted that there exist examples of convex functions for which Adam does not
converge to a minimizer, see [190]. The authors of [190] propose a modification termed AMSGrad,
which avoids this issue and their analysis also applies to RMSProp. Nonetheless, Adam remains a
highly popular and successful algorithm for the training of neural networks. We also mention that
the proof of convergence in the stochastic setting requires k-dependent decreasing learning rates
such as α = O(k −1/2 ) in (10.5.3b) and (10.5.4c).

Bibliography and further reading


Section 10.1 on gradient descent is based on standard textbooks such as [20, 25, 163] and especially
[159]. These are also good references for further reading on convex optimization. In particular
Theorem 10.11 and the Lemmas leading up to it closely follow Nesterov’s arguments in [159]. Con-
vergence proofs under the PL inequality can be found in [114]. Stochastic gradient descent discussed
in Section 10.2 originally dates back to Robbins and Monro [191]. The first non-asymptotic con-
vergence analysis for strongly convex objective functions was given in [154]. The proof presented
here is similar to [76] and in particular uses their choice of step size. A good overview of proofs
for (stochastic) gradient descent algorithms together with detailed references can be found in [65],

141
and for a textbook specifically on stochastic optimization also see [126]. The backpropagation al-
gorithm discussed in Section 10.3 was popularized by Rumelhart, Hinton and Williams [197]; for
further details on the historical developement we refer to [202, Section 5.5], and for a more in-depth
discussion of the algorithm, see for instance [84]. The heavy ball method in Section 10.4 goes back
to Polyak [180]. To motivate the algorithm we proceed similar as in [70, 181, 183]. For the analysis
of Nesterov acceleration [160], we follow the Lyapunov type proofs given in [231, 241]. Finally,
for Section 10.5 on other algorithms, we refer to the original works that introduced AdaGrad [57],
RMSProp [90] and Adam [116]. A good overview of gradient descent methods popular for deep
learning can be found in [194]. Regarding the analysis of RMSProp and Adam, we refer to [190]
which gave an example of a convex function for which Adam does not converge, and provide a prov-
ably convergent modification of the algorithm. Convergence proofs (for variations of) AdaGrad and
Adam can also be found in [49].
For a general discussion and analysis of optimization algorithms in machine learning see [23].
Details on implementations in Python can for example be found in [67], and for recommendations
and tricks regarding the implementation we also refer to [22, 129].

142
Exercises
Exercise 10.27. Let f ∈ C 1 (Rn ). Show that f is convex in the sense of Definition 10.8 if and only
if

f (w) + ⟨∇f (w), v − w⟩ ≤ f (v) for all w, v ∈ Rn .

Exercise 10.28. Find a function f : R → R that is L-smooth, satisfies the PL-inequality (10.1.19)
for some µ > 0, has a unique minimizer w∗ ∈ R, but is not convex and thus also not strongly
convex.

Exercise 10.29. Prove Theorem 10.17, i.e. show that L-smoothness and the PL-inequality (10.1.19)
yield linear convergence of f (wk ) → f (w∗ ) as k → ∞.

Definition 10.30. For convex f : Rn → R, g ∈ Rn is called a subgradient (or subdifferential) of


f at v if and only if
f (w) ≥ f (v) + ⟨g, w − v⟩ for all w ∈ Rn . (10.5.5)
The set of all subgradients of f at v is denoted by ∂f (v).

A subgradient always exists, i.e. ∂f (v) is necessarily nonempty. This statement is also known
under the name “Hyperplane separation theorem”. Subgradients generalize the notion of gradients
for convex functions, since for any convex continuously differentiable f , (10.5.5) is satisfied with
g = ∇f (v).

Exercise 10.31. Let f : Rn → R be convex and Lip(f ) ≤ L. Show that for any g ∈ ∂f (v) holds
∥g∥ ≤ L.

Exercise 10.32. Let f : Rn → R be convex, Lip(f ) ≤ L and suppose that w∗ is a minimizer of f .


Fix w0 ∈ Rd , and for k ∈ N0 define the subgradient descent update

wk+1 := wk − hk g k ,

where g k is an arbitrary fixed element of ∂f (wk ). Show that


k
∥w0 − w∗ ∥2 + L2 h2i
P
i=1
min f (wi ) − f (w∗ ) ≤ k
.
i≤k P
2 hi
i=1

Hint: Start by recursively expanding ∥wk − w∗ ∥2 = · · · , and then apply the property of the
subgradient.

Exercise 10.33. Consider the setting of Exercise 10.32. Determine step sizes h1 , . . . , hk (which
may depend on k, i.e. hk,1 , . . . , hk,k ) such that for any arbitrarily small δ > 0

min f (wi ) − f (w∗ ) = O(k −1/2+δ ) as k → ∞.


i≤k

143
Exercise 10.34. Let A ∈ Rn×n be symmetric positive semidefinite, b ∈ Rn and c ∈ R. Denote
the eigenvalues of A by λ1 ≥ · · · ≥ λn ≥ 0. Show that the objective function
1
f (w) := w⊤ Aw + b⊤ w + c (10.5.6)
2
is convex and λ1 -smooth. Moreover, if λn > 0, then f is λn -strongly convex. Show that these
values are optimal in the sense that f is neither L-smooth nor µ-strongly convex if L < λ1 and
µ > λn .
Hint: Note that L-smoothness and µ-strong convexity are invariant under shifts and the addition
of constants. That is, for every α ∈ R and β ∈ Rn , f˜(w) := α + f (w + β) is L-smooth or µ-strongly
convex if and only if f is. It thus suffices to consider w⊤ Aw/2.

Exercise 10.35. Let f be as in Exercise 10.34. Show that gradient descent converges for arbitrary
initialization w0 ∈ Rn , if and only if

max |1 − hλj | < 1.


j=1,...,n

Show that argminh>0 maxj=1,...,n |1 − hλj | = 2/(λ1 + λn ) and conclude that the convergence will be
slow if f is ill-conditioned, i.e. if λ1 /λn ≫ 1.
Hint: Assume first that b = 0 ∈ Rn and c = 0 ∈ R in (10.5.6), and use the singular value
decomposition A = U ⊤ diag(λ1 , . . . , λn )U .
p
Exercise 10.36. Show that (10.4.10) can equivalently be written as (10.4.11) with τ = µ/L,
α = 1/L, β = (1 − τ )/(1 + τ ) and the initialization u0 = ((1 + τ )w0 − v 0 )/τ .

144
Chapter 11

Wide neural networks and the neural


tangent kernel

In this chapter we explore the dynamics of training neural networks of large width. Throughout
we focus on the situation where we have data pairs

(xi , yi ) ∈ Rd × R i ∈ {1, . . . , m}, (11.0.1a)

and wish to train a neural network Φ(x, w) depending on the input x ∈ Rd and the parameters
w ∈ Rn , by minimizing the square loss objective defined as
m
X
f (w) := (Φ(xi , w) − yi )2 , (11.0.1b)
i=1

which is a multiple of the empirical risk Rb S (Φ) in (1.2.3) for the sample S = (xi , yi )m and the
i=1
square-loss. We exclusively focus on gradient descent with a constant step size h, which yields a
sequence of parameters (wk )k∈N . We aim to understand the evolution of Φ(x, wk ) as k progresses.
For linear mappings w 7→ Φ(x, w), the objective function (11.0.1b) is convex. As established in
the previous chapter, gradient descent then finds a global minimizer. For typical neural network
architectures, w 7→ Φ(x, w) is not linear, and such a statement is in general not true.
Recent research has shown that neural network behavior tends to linearize in the parameters
as network width increases [106]. This allows to transfer some of the results and techniques from
the linear case to the training of neural networks. We start this chapter in Sections 11.1 and 11.2
by recalling (kernel) least-squares methods, which describe linear (in w) models. Following [131],
the subsequent sections explore why in the infinite width limit neural networks exhibit linear-like
behavior. In Section 11.5.2 we formally introduce the linearization of w 7→ Φ(x, w). Section 11.4
presents an abstract result showing convergence of gradient descent, under the condition that Φ
does not deviate too much from its linearization. In Sections 11.5 and 11.6, we then detail the
implications for wide neural networks for two (slightly) different architectures. In particular, we
will prove that gradient descent can find global minimizers when applied to (11.0.1b) for networks
of very large width. We emphasize that this analysis treats the case of strong overparametrization,
specifically the case of increasing the network width while keeping the number of data points m
fixed.

145
11.1 Linear least-squares
Arguably one of the simplest machine learning algorithms is linear least-squares regression. Given
data (11.0.1a), linear regression tries to fit a linear function Φ(x, w) := x⊤ w in terms of w by
minimizing f (w) in (11.0.1b). With
 ⊤  
x1 y1
 ..  m×d  .. 
A= . ∈R and y =  .  ∈ Rm (11.1.1)
x⊤
m ym
it holds
f (w) = ∥Aw − y∥2 . (11.1.2)
Remark 11.1. More generally, the ansatz Φ(x, (w, b)) := w⊤ x + b corresponds to
 
b
Φ(x, (w, b)) = (1, x⊤ ) .
w
Therefore, additionally allowing for a bias can be treated analogously.
The model Φ(x, w) = x⊤ w is linear in both x and w. In particular, w 7→ f (w) is a convex
function by Exercise 10.34, and we may apply the convergence results of Chapter 10 when using
gradient based algorithms. If A is invertible, then f has a unique minimizer given by w∗ = A−1 y. If
rank(A) = d, then f is strongly convex by Exercise 10.34, and there still exists a unique minimizer.
If however rank(A) < d, then ker(A) ̸= {0} and there exist infinitely many minimizers of f . To
ensure uniqueness, we look for the minimum norm solution (or minimum 2-norm solution)
w∗ := argmin{w∈Rd | f (w)≤f (v) ∀v∈Rd } ∥w∥. (11.1.3)
The following proposition establishes the uniqueness of w∗ and demonstrates that it can be repre-
sented as a superposition of the (xi )m
i=1 .

Proposition 11.2. Let A ∈ Rm×d and y ∈ Rm be as in (11.1.1). There exists a unique minimum
2-norm solution of (11.1.2). Denoting H̃ := span{x1 , . . . , xm } ⊆ Rd , it is the unique element

w∗ = argminw̃∈H̃ f (w̃) ∈ H̃. (11.1.4)

Proof. We start with existence and uniqueness. Let C ⊆ Rm be the space spanned by the columns
of A. Then C is closed and convex, and therefore y ∗ = argminỹ∈C ∥y − ỹ∥ exists and is unique
(this is a fundamental property of Hilbert spaces, see Theorem B.14). In particular, the set M =
{w ∈ Rd | Aw = y ∗ } ⊆ Rd of minimizers of f is not empty. Clearly M is also closed and convex.
By the same argument as before, w∗ = argminw∗ ∈M ∥w∗ ∥ exists and is unique.
It remains to show (11.1.4). Denote by w∗ the minimum norm solution and decompose w∗ =
w̃ + ŵ with w̃ ∈ H̃ and ŵ ∈ H̃ ⊥ . We have Aw∗ = Aw̃ and ∥w∗ ∥2 = ∥w̃∥2 + ∥ŵ∥2 . Since w∗
is the minimal norm solution it must hold ŵ = 0. Thus w∗ ∈ H̃. Finally assume there exists a
minimizer v of f in H̃ different from w∗ . Then 0 ̸= w∗ − v ∈ H̃, and since H̃ is spanned by the
rows of A we have A(w∗ − v) ̸= 0. Thus y ∗ = Aw∗ ̸= Av, which contradicts that v minimizes
f.

146
The condition of minimizing the 2-norm is a form of regularization. Interestingly, gradient
descent converges to the minimum norm solution for the quadratic objective (11.1.2), as long as w0
is initialized within H̃ = span{x1 , . . . , xm } (e.g. w0 = 0). Therefore, it does not find an “arbitrary”
minimizer but implicitly regularizes the problem in this sense. In the following smax (A) denotes
the maximal singular value of A.

Theorem 11.3. Let A ∈ Rm×d be as in (11.1.1), let w0 = w̃0 + ŵ0 where w̃0 ∈ H̃ and ŵ0 ∈ H̃ ⊥ .
Fix h ∈ (0, 1/(2smax (A)2 )) and set

wk+1 := wk − h∇f (wk ) for all k ∈ N (11.1.5)

with f in (11.1.2). Then


lim wk = w∗ + ŵ0 .
k→∞

We sketch the argument in case w0 ∈ H̃, and leave the full proof to the reader, see Exercise
11.32. Note that H̃ is the space spanned by the rows of A (or the columns of A⊤ ). The gradient
of the objective function equals
∇f (w) = 2A⊤ (Aw − y).
Therefore, if w0 ∈ H̃, then the iterates of gradient descent never leave the subspace H̃. By Exercise
10.34 and Theorem 10.11, for small enough step size, it holds f (wk ) → 0. By Proposition 11.2
there only exists one minimizer in H̃, corresponding to the minimum norm solution. Thus wk
converges to the minimal norm solution.

11.2 Kernel least-squares


Let again (xj , yj ) ∈ Rd × R, j = 1, . . . , m. In many applications linear models are too simplistic,
and are not able to capture the true relation between x and y. Kernel methods allow to overcome
this problem by introducing nonlinearity in x, but retaining linearity in the parameter w.
Let H be a Hilbert space with inner product ⟨·, ·⟩H , that is also referred to as the feature
space. For a (typically nonlinear) feature map ϕ : Rd → H, consider the model
Φ(x, w) = ⟨ϕ(x), w⟩H (11.2.1)
with w ∈ H. If H = Rn , the components of ϕ are referred to as features. With the objective
function
m
X 2
f (w) := ⟨ϕ(xj ), w⟩H − yj w ∈ H, (11.2.2)
j=1
we wish to determine a minimizer of f . To ensure uniqueness and regularize the problem, we again
consider the minimum H-norm solution
w∗ := argmin{w∈H | f (w)≤f (v) ∀v∈H} ∥w∥H .

As we will see below, w∗ is well-defined. We will call Φ(x, w∗ ) = ⟨ϕ(x), w∗ ⟩H the kernel least
squares estimator. The nonlinearity of the feature map allows for more expressive models x 7→
Φ(x, w) capable of capturing more complicated structures beyond linearity in the data.

147
Remark 11.4 (Gradient descent). Let H = Rn be equipped with the Euclidean inner product. Con-
sider the sequence (wk )k∈N0 ⊆ Rn generated by gradient descent to minimize (11.2.2). Assuming
sufficiently small step size, by Theorem 11.3 for x ∈ Rd

lim Φ(x, wk ) = ⟨ϕ(x), w∗ ⟩ + ⟨ϕ(x), ŵ0 ⟩ . (11.2.3)


k→∞

Here, ŵ0 ∈ Rn denotes the orthogonal projection of w0 ∈ Rn onto H̃ ⊥ where H̃ := span{ϕ(x1 ), . . . , ϕ(xm )}.
Gradient descent thus yields the kernel least squares estimator plus ⟨ϕ(x), ŵ0 ⟩. Notably, on the
set
{x ∈ Rd | ϕ(x) ∈ span{ϕ(x1 ), . . . , ϕ(xm )}}, (11.2.4)
(11.2.3) thus coincides with the kernel least squares estimator independent of the initialization w0 .

11.2.1 Examples
To motivate the concept of feature maps consider the following example from [155].

Example 11.5. Let xi ∈ R2 with associated labels yi ∈ {−1, 1} for i = 1, . . . , m. The goal is to
find some model Φ(·, w) : R2 → R, for which

sign(Φ(x, w)) (11.2.5)

predicts the label y of x. For a linear (in x) model

Φ(x, (w, b)) = x⊤ w + b,

the decision boundary of (11.2.5) equals {x ∈ R2 | x⊤ w + b = 0} in R2 . Hence, by adjusting w


and b, (11.2.5) can separate data by affine hyperplanes in R2 . Consider two datasets represented
by light blue squares for +1 and red circles for −1 labels:
x2 x2

x1 x1

dataset 1 dataset 2

The first dataset is separable by an affine hyperplane as depicted by the dashed line. Thus a linear
model is capable of correctly classifying all datapoints. For the second dataset this is not possible.
To enhance model expressivity, introduce a feature map ϕ : R2 → R6 via

ϕ(x) = (1, x1 , x2 , x1 x2 , x21 , x22 )⊤ ∈ R6 for all x ∈ R2 . (11.2.6)

For w ∈ R6 , this allows Φ(x) = w⊤ ϕ(x) to represent arbitrary polynomials of degree 2. With this
kernel approach, the decision boundary of (11.2.5) becomes the set of all hyperplanes in the feature
space passing through 0 ∈ R6 . Visualizing the last two features of the second dataset, we obtain

148
x22

x21

features 5 and 6 of dataset 2

Note how in the feature space R6 , the datapoints are again separated by such a hyperplane. Thus,
with the feature map in (11.2.6), the predictor (11.2.5) can perfectly classify all points also for the
second dataset.
In the above example we chose the feature space H = R6 . It is also possible to work with
infinite dimensional feature spaces as the next example demonstrates.
Example 11.6. Let H = ℓ2 (N) be the space of square summable sequences and ϕ : Rd → ℓ2 (N)
some map. Fitting the corresponding model
X
Φ(x, w) = ⟨ϕ(x), w⟩ℓ2 = ϕi (x)wi
i∈N

to data (xi , yi )m
i=1 requires to minimize

m
!2
X X 
f (w) = ϕi (xj )wi − yj w ∈ ℓ2 (N).
j=1 i∈N

Hence we have to determine an infinite sequence of parameters (wi )i∈N .

11.2.2 Kernel trick


At first glance, computing a (minimal H-norm) minimizer w in the possibly infinite-dimensional
Hilbert space H seems infeasible. The so-called kernel trick allows to do this computation. To
explain it, we first revisit the foundational representer theorem.

Theorem 11.7 (Representer theorem). There is a unique minimum H-norm solution w∗ ∈ H of


(11.2.2). With H̃ := span{ϕ(x1 ), . . . , ϕ(xm )} it equals the unique element

w∗ = argminw̃∈H̃ f (w̃) ∈ H̃. (11.2.7)

Proof. Let w̃1 , . . . , w̃n be a basis of H̃. If H̃ = {0} the statement is trivial, so we assume
P 1 ≤ n ≤ m.
Let A = (⟨ϕ(xi ), w̃j ⟩)ij ∈ Rm×n . Every w̃ ∈ H̃ has a unique representation w̃ = nj=1 αj w̃j for
some α ∈ Rn . With this ansatz
m
X m X
X n 2
f (w̃) = (⟨ϕ(xi ), w̃⟩ − yi )2 = ⟨ϕ(xi ), w̃j ⟩ αj − yi = ∥Aα − y∥2 . (11.2.8)
i=1 i=1 j=1

149
Note that A : Rn → Rm is injective since for every α ∈ Rn \{0} holds nj=1 αj w̃j ∈ H̃ \ {0} and
P
D E
hence Aα = ( ϕ(xi ), nj=1 αj w̃j )m n
P
i=1 ̸= 0. Therefore, there exists a unique minimizer α ∈ R of
the right-hand side of (11.2.8), and thus there exists a unique minimizer w∗ ∈ H̃ in (11.2.7).
For arbitrary w ∈ H we wish to show f (w) ≥ f (w∗ ), so that w∗ minimizes f in H. Decompose
w = w̃ + ŵ with w̃ ∈ H̃ and ŵ ∈ H̃ ⊥ , i.e. ⟨ϕ(xj ), ŵ⟩H = 0 for all j = 1, . . . , m. Then, using that
w∗ minimizes f in H̃,
m
X m
X
f (w) = (⟨ϕ(xj ), w⟩H − yj )2 = (⟨ϕ(xj ), w̃⟩H − yj )2 = f (w̃) ≥ f (w∗ ).
j=1 j=1

Finally, let w ∈ H be any minimizer of f in H different from w∗ . It remains to show ∥w∥H >
∥w∗ ∥H . Decompose again w = w̃ + ŵ with w̃ ∈ H̃ and ŵ ∈ H̃ ⊥ . As above f (w) = f (w̃) and
thus w̃ is a minimizer of f . Uniqueness of w∗ in (11.2.7) implies w̃ = w∗ . Therefore ŵ ̸= 0 and
∥w∗ ∥2H < ∥w̃∥2H + ∥ŵ∥2H = ∥w∥2H .

Instead of looking for the minimum norm minimizer w∗ in the Hilbert space H, by Proposition
11.2 it suffices to determine the unique minimizer in the at most m-dimensional subspace H̃ spanned
by ϕ(x1 ), . . . , ϕ(xm ). This significantly simplifies the problem. To do so we first introduce the
notion of kernels.

Definition 11.8. A symmetric function K : Rd ×Rd → R is called a kernel if for any x1 , . . . , xm ∈


Rd the kernel matrix G = (K(xi , xj ))m
i,j=1 ∈ R
m×m is symmetric positive semidefinite.

Given a feature map ϕ : Rd → H, it is easy to check that

K(x, x′ ) := ϕ(x), ϕ(x′ ) H


for all x, x′ ∈ Rd ,

defines a kernel. The corresponding kernel matrix G ∈ Rm×m is given by

Gij = ⟨ϕ(xi ), ϕ(xj )⟩H = K(xi , xj ).


Pm
With the ansatz w = j=1 αj ϕ(xj ), minimizing the objective (11.2.2) in H̃ is equivalent to mini-
mizing
∥Gα − y∥2 , (11.2.9)
in α = (α1 , . . . , αm ) ∈ Rm .

Pm
Proposition 11.9. Let α ∈ Rm be any minimizer of (11.2.9). Then w∗ = j=1 αj ϕ(xj ) is the
unique minimum H-norm solution of (11.2.2).

Proposition 11.9, the proof of which is left as an exercise, suggests the following algorithm to
compute the kernel least squares estimator:

150
(i) compute the kernel matrix G = (K(xi , xj ))m
i,j=1 ,

(ii) determine a minimizer α ∈ Rm of ∥Gα − y∥,

(iii) evaluate Φ(x, w∗ ) via


m m
* +
X X
Φ(x, w∗ ) = ϕ(x), αj ϕ(xj ) = αj K(x, xj ). (11.2.10)
j=1 H j=1

Thus, minimizing (11.2.2) and expressing the kernel least squares estimator does neither require
explicit knowledge of the feature map ϕ nor of the minimum norm solution w∗ ∈ H. It is sufficient
to choose a kernel map K : Rd × Rd → R; this is known as the kernel trick. Given a kernel K, we
will therefore also refer to (11.2.10) as the kernel least squares estimator without specifying H or
ϕ.

Example 11.10. Common examples of kernels include the polynomial kernel

K(x, x′ ) = (x⊤ x′ + c)r c ≥ 0, r ∈ N,

the radial basis function (RBF) kernel

K(x, x′ ) = exp(−c∥x − x′ ∥2 ) c > 0,

and the Laplace kernel


K(x, x′ ) = exp(−c∥x − x′ ∥) c > 0.

Remark 11.11. If Ω ⊆ Rd is compact and K : Ω × Ω → R is a continuous kernel, then Mercer’s


theorem implies existence of a Hilbert space H and a feature map ϕ : Rd → H such that

K(x, x′ ) = ϕ(x), ϕ(x′ ) H


for all x, x′ ∈ Ω,

i.e. K is the corresponding kernel. See for instance [217, Thm. 4.49].

11.3 Tangent kernel


Consider again a general model Φ(x, w) with input x ∈ Rd and parameters w ∈ Rn . The goal
remains to minimize the square loss objective (11.0.1b) given the data (11.0.1a). If w 7→ Φ(x, w)
is not linear, then unlike in Sections 11.1 and 11.2, the objective function (11.0.1b) is in general
not convex, and most results on first order methods in Chapter 10 are not directly applicable.
We now simplify the situation by linearizing the model in w ∈ Rn around the initialization:
Fixing w0 ∈ Rn , let

Φlin (x, w) := Φ(x, w0 ) + ∇w Φ(x, w0 )⊤ (w − w0 ) for all w ∈ Rn , (11.3.1)

which is the first order Taylor approximation of Φ around the initial parameter w0 . Introduce the
notation
δi := Φ(xi , w0 ) − ∇w Φ(xi , w0 )⊤ w0 − yi for all i = 1, . . . , m. (11.3.2)

151
The square loss for the linearized model then reads
m
X
f lin (w) := (Φlin (xi , w) − yi )2
j=1
m
X
= (⟨∇w Φ(xi , w0 ), w⟩ + δi )2 , (11.3.3)
j=1

where ⟨·, ·⟩ stands for the Euclidean inner product in Rn . Comparing with (11.2.2), minimizing f lin
corresponds to a kernel least squares regression with feature map

ϕ(x) = ∇w Φ(x, w0 ) ∈ Rn .

The corresponding kernel is

K̂n (x, x′ ) = ∇w Φ(x, w0 ), ∇w Φ(x′ , w0 ) . (11.3.4)

We refer to K̂n as the empirical tangent kernel, as it arises from the first order Taylor approxima-
tion (the tangent) of the original model Φ around initialization w0 . Note that the kernel depends
on the choice of w0 . As explained in Remark 11.4, training Φlin with gradient descent yields the
kernel least-squares estimator with kernel K̂n plus an additional term depending on w0 .
Of course the linearized model Φlin only captures the behaviour of Φ for parameters w that are
close to w0 . If we assume for the moment that during training of Φ, the parameters remain close to
initialization, then we can expect similar behaviour and performance of Φ and Φlin . Under certain
assumptions, we will see in the next sections that this is precisely what happens, when the width
of a neural network increases. Before we make this precise, in Section 11.4 we investigate whether
gradient descent applied to f (w) will find a global minimizer, under the assumption that Φlin is a
good approximation of Φ.

11.4 Convergence to global minimizers


Intuitively, if w 7→ Φ(x, w) is not linear but “close enough to its linearization” Φlin defined in
(11.3.1), we expect that the objective function is close to a convex function and gradient descent
can still find global minimizers of (11.0.1b). To motivate this, consider Figures 11.1 and 11.2
where we chose the number of training data m = 1 and the number of parameters n = 1. As
we can see, essentially we require the difference of Φ and Φlin and of their derivatives to be small
in a neighbourhood of w0 . The size of the neighbourhood crucially depends on the initial error
d
Φ(x1 , w0 ) − y1 , and on the size of the derivative dw Φ(x1 , w0 ).
For general m and n, we now make the required assumptions on Φ precise.
Assumption 11.12. Let Φ ∈ C 1 (Rd × Rn ) and w0 ∈ Rn . There exist constants r > 0, U , L < ∞
and 0 < λmin ≤ λmax < ∞ such that
(a) the kernel matrix of the empirical tangent kernel
m
(K̂n (xi , xj ))m
i,j=1 = ⟨∇w Φ(xi , w 0 ), ∇w Φ(xj , w 0 )⟩ i,j=1
∈ Rm×m (11.4.1)

is regular and its eigenvalues belong to [λmin , λmax ],

152
(Φ(x1 , w) − y1 )2

(Φlin (x1 , w) − y1 )2
y1
w0 Φlin (x1 , w) w0
Φ(x1 , w)

Figure 11.1: Graph of w 7→ Φ(x1 , w) and the linearization w 7→ Φlin (x1 , w) at the initial parameter
d
w0 , s.t. dw Φ(x1 , w0 ) ̸= 0. If Φ and Φlin are close, then there exists w s.t. Φ(x1 , w) = y1 (left). If
the derivatives are also close, the loss (Φ(x1 , w) − y1 )2 is nearly convex in w, and gradient descent
finds a global minimizer (right).

Φ(x1 , w) (Φ(x1 , w) − y1 )2
(Φlin (x1 , w) − y1 )2
y1
w0 Φlin (x1 , w) w0

Figure 11.2: Same as Figure 11.1. If Φ and Φlin are not close, there need not exist w such that
Φ(x1 , w) = y1 , and gradient descent need not converge to a global minimizer.

(b) for all i ∈ {1, . . . , m} holds

∥∇w Φ(xi , w)∥ ≤ U for all w ∈ Br (w0 )


(11.4.2)
∥∇w Φ(xi , w) − ∇w Φ(xi , v)∥ ≤ L∥w − v∥ for all w, v ∈ Br (w0 ),

(c) and √
λ2min 2 mU p
L≤ p and r= f (w0 ). (11.4.3)
12m3/2 U 2 f (w0 ) λmin

The regularity of the kernel matrix in Assumption 11.12 (a) is equivalent to (∇w Φ(xi , w0 )⊤ )m i=1 ∈
R m×n having full rank m ≤ n (in particular we have at least as many parameters n as training
d
data m). In the context of Figure 11.1, this means that dw Φ(x1 , w0 ) ̸= 0 and thus Φlin is a not a
constant function. This condition guarantees that there exists w such that Φlin (xi , w) = yi for all
i = 1, . . . , m. In other words, already the linearized model Φlin is sufficiently expressive to interpo-
late the data. Assumption 11.12 (b) formalizes the closeness condition of Φ and Φlin . Apart from
giving an upper bound on ∇w Φ(xi , w), it assumes w 7→ Φ(xi , w) to be L-smooth in a ball of radius
r > 0 around w0 , for all i = 1, . . . , m. This allows to control how far Φ(xi , w) and Φlin (xi , w) and
their derivatives may deviate from each other for w in this ball. Finally Assumption 11.12 (c) ties
together all constants, ensuring the full model to be sufficiently close to its linearization in a large
enough neighbourhood of w0 .
We are now ready to state the following theorem, which is a variant of [131, Thm. G.1]. In
Section 11.5 we will see that its main requirement—Assumption 11.12—is satisfied with high prob-
ability for certain (wide) neural networks.

153
Theorem 11.13. Let Assumption 11.12 be satisfied and fix a positive learning rate
1
h≤ . (11.4.4)
λmin + λmax
Set for all k ∈ N
wk+1 = wk − h∇f (wk ). (11.4.5)
It then holds for all k ∈ N

2 mU p
∥wk − w0 ∥ ≤ f (w0 ) (11.4.6a)
λmin
f (wk ) ≤ (1 − hλmin )2k f (w0 ). (11.4.6b)

Proof. In the following denote the error in prediction by

E(w) := (Φ(xi , w) − yi )m
i=1 ∈ R
m

such that
∇E(w) = (∇w Φ(xi , w))m
i=1 ∈ R
m×n

and with the empirical tangent kernel K̂n in Assumption 11.12

∇E(w)∇E(w)⊤ = (K̂n (xi , xj ))m


i,j=1 ∈ R
m×m
. (11.4.7)

Moreover, (11.4.2) gives


m
X
2
∥∇E(w)∥ ≤ ∥∇E(w)∥2F = ∥∇Φ(xi , w)∥2 ≤ mU 2 for all w ∈ Br (w0 ), (11.4.8a)
i=1

and similarly
m
X
∥∇E(w) − ∇E(v)∥2 ≤ ∥∇w Φ(xi , w) − ∇w Φ(xi , v)∥2
i=1
≤ mL2 ∥w − v∥2 for all w, v ∈ Br (w0 ). (11.4.8b)

Denote c := 1 − hλmin ∈ (0, 1). We use induction over k to prove


k−1 k−1
X √ X
∥wj+1 − wj ∥ ≤ h2 mU ∥E(w0 )∥ cj , (11.4.9a)
j=0 j=0

∥E(wk )∥2 ≤ ∥E(w0 )∥2 c2k , (11.4.9b)

for all k ∈ N0 and where an empty sum is understood as zero. Since ∞ j −1 = (hλ −1
P
j=0 c = (1−c) min )
2
and f (wk ) = ∥E(wk )∥ , these inequalities directly imply (11.4.6).
The case k = 0 is trivial. For the induction step, assume (11.4.9) holds for some k ∈ N0 .

154
Step 1. We show (11.4.9a) for k + 1. The induction assumption and (11.4.3) give
∞ √
√ X
j 2 mU p
∥wk − w0 ∥ ≤ 2h mU ∥E(w0 )∥ c = f (w0 ) = r, (11.4.10)
λmin
j=0

and thus wk ∈ Br (w0 ). Next


∇f (wk ) = ∇(E(wk )⊤ E(wk )) = 2∇E(wk )⊤ E(wk ). (11.4.11)
Using the iteration rule (11.4.5), the bound (11.4.8a), and (11.4.9b)
∥wk+1 − wk ∥ = 2h∥∇E(wk )⊤ E(wk )∥

≤ 2h mU ∥E(wk )∥

≤ 2h mU ∥E(w0 )∥ck .
This shows (11.4.9a) for k + 1. In particular, as in (11.4.10) we conclude
wk+1 , wk ∈ Br (w0 ). (11.4.12)
Step 2. We show (11.4.9b) for k + 1. Since E is continuously differentiable, there exists w̃k in
the convex hull of wk and wk+1 such that
E(wk+1 ) = E(wk ) + ∇E(w̃k )(wk+1 − wk ) = E(wk ) − h∇E(w̃k )∇f (wk ),
and thus by (11.4.11)
E(wk+1 ) = E(wk ) − 2h∇E(w̃k )∇E(wk )⊤ E(wk )
= I m − 2h∇E(w̃k )∇E(wk )⊤ E(wk ),


where I m ∈ Rm×m is the identity matrix. We wish to show that


∥I m − 2h∇E(w̃k )∇E(wk )⊤ ∥ ≤ c, (11.4.13)
which then implies (11.4.9b) for k + 1 and concludes the proof.
Using (11.4.8) and the fact that wk , w̃k ∈ Br (w0 ) by (11.4.12),
∥∇E(w̃k )∇E(wk )⊤ − ∇E(w0 )∇E(w0 )⊤ ∥
≤ ∥∇E(w̃k )∇E(wk )⊤ − ∇E(wk )∇E(wk )⊤ ∥
+ ∥∇E(w̃k )∇E(wk )⊤ − ∇E(wk )∇E(w0 )⊤ ∥
+ ∥∇E(w̃k )∇E(w0 )⊤ − ∇E(w0 )∇E(w0 )⊤ ∥
≤ 3mU Lr.

Since the eigenvalues of ∇E(w0 )∇E(w0 )⊤ belong to [λmin , λmax ] by (11.4.7) and Assumption 11.12
(a), as long as h ≤ (λmin + λmax )−1 we have
∥I m − 2h∇E(w̃k )∇E(wk )⊤ ∥ ≤ ∥I m − 2h∇E(w0 )∇E(w0 )⊤ ∥ + 6hmU Lr
≤ 1 − 2hλmin + 6hmU Lr
≤ 1 − 2h(λmin − 3mU Lr)
≤ 1 − hλmin = c,
where we have used the equality for r and the upper bound for L in (11.4.3).

155
Let us emphasize the main statement of Theorem 11.13. By (11.4.6b), full batch gradient
descent (11.4.5) achieves zero loss in the limit, i.e. the data is interpolated by the limiting model. In
particular, this yields convergence for the (possibly nonconvex) optimization problem of minimizing
f (w).

11.5 Training dynamics for LeCun initialization


In this and the next section we discuss the implications of Theorem 11.13 for wide neural networks.
For ease of presentation we focus on shallow networks with only one hidden layer, but stress that
similar considerations also hold for deep networks, see the bibliography section.

11.5.1 Architecture
Let Φ : Rd → R be a neural network of depth one and width n ∈ N of type

Φ(x, w) = v ⊤ σ(U x + b) + c. (11.5.1)

Here x ∈ Rd is the input, and U ∈ Rn×d , v ∈ Rn , b ∈ Rn and c ∈ R are the parameters which we
collect in the vector w = (U , b, v, c) ∈ Rn(d+2)+1 (with U suitably reshaped). For future reference
we note that
∇U Φ(x, w) = (v ⊙ σ ′ (U x + b))x⊤ ∈ Rn×d
∇b Φ(x, w) = v ⊙ σ ′ (U x + b) ∈ Rn
(11.5.2)
∇v Φ(x, w) = σ(U x + b) ∈ Rn
∇c Φ(x, w) = 1 ∈ R,
where ⊙ denotes the Hadamard product. We also write ∇w Φ(x, w) ∈ Rn(d+2)+1 to denote the full
gradient with respect to all parameters.
In practice, it is common to initialize the weights randomly, and in this section we consider so-
called LeCun initialization. The following condition on the distribution used for this initialization
will be assumed throughout the rest of Section 11.5.
Assumption 11.14. The distribution W on R has expectation zero, variance one, and finite
moments up to order eight.
To explicitly indicate the expectation and variance in the notation, we also write W(0, 1) instead
of W, and for µ ∈ R and ς > 0 we use W(µ, ς 2 ) to denote the corresponding scaled and shifted
measure with expectation µ and variance ς 2 ; thus, if X ∼ W(0, 1) then µ + ςX ∼ W(µ, ς 2 ). LeCun
initialization [129] sets the variance of the weights in each layer to be reciprocal to the input
dimension of the layer, thereby normalizing the output variance across all network nodes. The
initial parameters
w0 = (U 0 , b0 , v 0 , c0 )
are thus randomly initialized with components
 1  1
iid iid
U0;ij ∼ W 0, , v0;i ∼ W 0, , b0;i , c0 = 0, (11.5.3)
d n
independently for all i = 1, . . . , n, j = 1, . . . , d. For a fixed ς > 0 one might choose variances ς 2 /d
and ς 2 /n in (11.5.3), which would require only minor modifications in the rest of this section. Biases

156
are set to zero for simplicity, with nonzero initialization discussed in the exercises. All expectations
and probabilities in Section 11.5 are understood with respect to this random initialization.
Example 11.15. Typical √ examples
√ for W(0, 1) are the standard normal distribution on R or the
uniform distribution on [− 3, 3].

11.5.2 Neural tangent kernel


We begin our analysis by investigating the empirical tangent kernel

K̂n (x, z) = ⟨∇w Φ(x, w0 ), ∇w Φ(z, w0 )⟩

of the shallow network (11.5.1). Scaled properly, it converges in the infinite width limit n → ∞
towards a specific kernel known as the neural tangent kernel (NTK). Its precise formula depends
on the architecture and initialization. For the LeCun initialization (11.5.3) we denote it by K LC .

Theorem 11.16. Let R < ∞ such that |σ(x)| ≤ R · (1 + |x|) and |σ ′ (x)| ≤ R · (1 + |x|) for all
iid
x ∈ R. For any x, z ∈ Rd and ui ∼ W(0, 1/d), i = 1, . . . , d, it then holds
1
lim K̂n (x, z) = E[σ(u⊤ x)σ(u⊤ z)] =: K LC (x, z)
n→∞ n

almost surely.
Moreover, for every δ, ε > 0 there exists n0 (δ, ε, R) ∈ N such that for all n ≥ n0 and all x,
z ∈ Rd with ∥x∥, ∥z∥ ≤ R
h 1 i
P K̂n (x, z) − K LC (x, z) < ε ≥ 1 − δ.
n

Proof. Denote x(1) = U 0 x + b0 ∈ Rn and z (1) = U 0 z + b0 ∈ Rn . Due to the initialization (11.5.3)


and our assumptions on W(0, 1), the components
d
(1)
X
xi = U0;ij xj ∼ u⊤ x i = 1, . . . , n
j=1

are i.i.d. with finite pth moment (independent of n) for all 1 ≤ p ≤ 8. Due to the linear growth bound
(1) (1) (1)
on σ and σ ′ , the same holds for the (σ(xi ))ni=1 and the (σ ′ (xi ))ni=1 . Similarly, the (σ(zi ))ni=1
(1)
and (σ ′ (zi ))ni=1 are collections of i.i.d. random variables with finite pth moment for all 1 ≤ p ≤ 8.
√ iid
Denote ṽi = nv0;i such that ṽi ∼ W(0, 1). By (11.5.2)
n n
1 ⊤ 1 X 2 ′ (1) ′ (1) 1X (1) (1) 1
K̂n (x, z) = (1 + x z) 2 ṽi σ (xi )σ (zi ) + σ(xi )σ(zi ) + .
n n n n
i=1 i=1

Since
n
1 X 2 ′ (1) ′ (1)
ṽi σ (xi )σ (zi ) (11.5.4)
n
i=1

157
is an average over i.i.d. random variables with finite variance, the law of large numbers implies
almost sure convergence of this expression towards
(1) (1)  (1) (1)
E ṽi2 σ ′ (xi )σ ′ (zi ) = E[ṽi2 ]E[σ ′ (xi )σ ′ (zi )]


= E[σ ′ (u⊤ x)σ ′ (u⊤ z)],

(1) (1)
where we used that ṽi2 is independent of σ ′ (xi )σ ′ (zi ). By the same argument
n
1X (1) (1)
σ(xi )σ(zi ) → E[σ(u⊤ x)σ(u⊤ z)]
n
i=1

almost surely as n → ∞. This shows the first statement.


The existence of n0 follows similarly by an application of Theorem A.23.

Example 11.17 (K LC for ReLU). Let σ(x) = max{0, x} and let W(0, 1) be the standard normal
distribution. For x, z ∈ Rd denote by
 ⊤ 
x z
θ = arccos
∥x∥∥z∥
iid
the angle between these vectors. Then according to [37, Appendix A], it holds with ui ∼ W(0, 1),
i = 1, . . . , d,

∥x∥∥z∥
K LC (x, z) = E[σ(u⊤ x)σ(u⊤ z)] = (sin(θ) + (π − θ) cos(θ)).
2πd

11.5.3 Gradient descent


We now proceed similar as in [131, Appendix G], to show that Theorem 11.13 is applicable to the
wide neural network (11.5.1) with high probability under random initialization (11.5.3). This will
imply that gradient descent can find global minimizers when training wide neural networks. We
work under the following assumptions on the activation function and training data.

Assumption 11.18. There exist R < ∞ and 0 < λLC LC


min ≤ λmax < ∞ such that

(a) for the activation function σ : R → R holds |σ(0)|, Lip(σ), Lip(σ ′ ) ≤ R,

(b) ∥xi ∥, |yi | ≤ R for all training data (xi , yi ) ∈ Rd × R, i = 1, . . . , m,

(c) the kernel matrix of the neural tangent kernel

(K LC (xi , xj ))m
i,j=1 ∈ R
m×m

is regular and its eigenvalues belong to [λLC LC


min , λmax ].

We start by showing Assumption 11.12 (a) for the present setting. More precisely, we give
bounds for the eigenvalues of the empirical tangent kernel.

158
Lemma 11.19. Let Assumption 11.18 be satisfied. Then for every δ > 0 there exists
n0 (δ, λLC
min , m, R) ∈ R such that for all n ≥ n0 with probability at least 1 − δ all eigenvalues of
m
(K̂n (xi , xj ))m
i,j=1 = ⟨∇w Φ(xi , w 0 ), ∇w Φ(xj , w 0 )⟩ i,j=1 ∈ R
m×m

belong to [nλLC LC
min /2, 2nλmax ].

LC :=
Proof. Denote Ĝn := (K̂n (xi , xj ))m
i,j=1 and G (K LC (xi , xj ))m
i,j=1 . By Theorem 11.16, there
exists n0 such that for all n ≥ n0 holds with probability at least 1 − δ that

1 λLC
GLC − Ĝn ≤ min .
n 2

Assuming this bound to hold

1 1 λLC λLC λLC


∥Ĝn ∥ = sup ∥Ĝn a∥ ≥ infm ∥GLC a∥ − min ≥ λLC
min − min
≥ min
,
n a∈Rm n a∈R 2 2 2
∥a∥=1 ∥a∥=1

where we have used that λLC min is the smallest eigenvalue, and thus singular value, of the symmetric
positive definite matrix GLC . This shows that the smallest eigenvalue of Ĝn is larger or equal to
λLC LC LC
min /2. Similarly, we conclude that the largest eigenvalue is bounded from above by λmax +λmin /2 ≤
LC
λmax . This concludes the proof.

Next we check Assumption 11.12 (b). To this end we first bound the norm of a random matrix.

iid
Lemma 11.20. Let W(0, 1) be as in Assumption 11.14, and let W ∈ Rn×d with Wij ∼ W(0, 1).
Denote the fourth moment of W(0, 1) by µ4 . Then
h p i dµ4
P ∥W ∥ ≤ n(d + 1) ≥ 1 − .
n

Proof. It holds
n X
X d 1/2
∥W ∥ ≤ ∥W ∥F = Wij2 .
i=1 j=1
Pd
The αi := j=1 Wij2 , i = 1, . . . , n, are i.i.d. distributed with expectation d and finite variance dC,
where C ≤ µ4 is the variance of W11 2 . By Theorem A.23

h i n
h1 X i h 1Xn i dµ
4
p
P ∥W ∥ > n(d + 1) ≤ P αi > d + 1 ≤ P αi − d > 1 ≤ ,
n n n
i=1 i=1

which concludes the proof.

159
Lemma 11.21. Let Assumption 11.18 (a) be satisfied with some constant R. Then there exists
M (R), and for all c, δ > 0 there exists n0 (c, d, δ, R) ∈ N such that for all n ≥ n0 it holds with
probability at least 1 − δ

∥∇w Φ(x, w)∥ ≤ M n for all w ∈ Bcn−1/2 (w0 )

∥∇w Φ(x, w) − ∇w Φ(x, v)∥ ≤ M n∥w − v∥ for all w, v ∈ Bcn−1/2 (w0 )

for all x ∈ Rd with ∥x∥ ≤ R.

Proof. Due to the initialization (11.5.3), by Lemma 11.20 we can find n0 (δ, d) such that for all
n ≥ n0 holds with probability at least 1 − δ that

∥v 0 ∥ ≤ 2 and ∥U 0 ∥ ≤ 2 n. (11.5.5)

For the rest of this proof we fix arbitrary x ∈ Rd and n ≥ n0 ≥ c2 such that

∥x∥ ≤ R and n−1/2 c ≤ 1.

We need to show that the claimed inequalities hold as long as (11.5.5) is satisfied. We will several
times use that for all p, q ∈ Rn

∥p ⊙ q∥ ≤ ∥p∥∥q∥ and ∥σ(p)∥ ≤ R n + R∥p∥

since |σ(x)| ≤ R · (1 + |x|). The same holds for σ ′ .


Step 1. We show the bound on the gradient. Fix

w = (U , b, v, c) s.t. ∥w − w0 ∥ ≤ cn−1/2 .

Using formula (11.5.2) for ∇b Φ and the above inequalities

∥∇b Φ(x, w)∥ ≤ ∥∇b Φ(x, w0 )∥ + ∥∇b Φ(x, w) − ∇b Φ(x, w0 )∥


= ∥v 0 ⊙ σ ′ (U 0 x)∥ + ∥v ⊙ σ ′ (U x + b) − v 0 ⊙ σ ′ (U 0 x)∥
√ √
≤ 2(R n + 2R2 n) + ∥v ⊙ σ ′ (U x + b) − v 0 ⊙ σ ′ (U 0 x)∥. (11.5.6)

Due to √ √
∥U ∥ ≤ ∥U 0 ∥ + ∥U 0 − U ∥F ≤ 2 n + cn−1/2 ≤ 3 n, (11.5.7)
the last norm in (11.5.6) is bounded by

∥(v − v 0 ) ⊙ σ ′ (U x + b)∥ + ∥v 0 ⊙ (σ ′ (U x + b) − σ ′ (U 0 x))∥



≤ cn−1/2 (R n + R · (∥U ∥∥x∥ + ∥b∥)) + 2R · (∥U − U 0 ∥∥x∥ + ∥b∥)
√ √
≤ R n + 3 nR2 + cn−1/2 R + 2R · (cn−1/2 R + cn−1/2 )

≤ n(4R + 5R2 )

and therefore √
∥∇b Φ(x, w)∥ ≤ n(6R + 9R2 ).

160
For the gradient with respect to U we use ∇U Φ(x, w) = ∇b Φ(x, w)x⊤ , so that

∥∇U Φ(x, w)∥F = ∥∇b Φ(x, w)x⊤ ∥F = ∥∇b Φ(x, w)∥∥x∥ ≤ n(6R2 + 9R3 ).

Next

∥∇v Φ(x, w)∥ = ∥σ(U x + b)∥



≤ R n + R∥U x + b∥
√ √
≤ R n + R · (3 nR + cn−1/2 )

≤ n(2R + 3R2 ),

and finally ∇c Φ(x, w) = 1. In all, with M1 (R) := (1 + 8R + 12R2 )



∥∇w Φ(x, w̃)∥ ≤ nM1 (R).

Step 2. We show Lipschitz continuity. Fix

w = (U , b, v, c) and w̃ = (Ũ , b̃, ṽ, c̃)

such that ∥w − w0 ∥, ∥w̃ − w0 ∥ ≤ cn−1/2 . Then

∥∇b Φ(x, w) − ∇b Φ(x, w̃)∥ = ∥v ⊙ σ ′ (U x + b) − ṽ ⊙ σ ′ (Ũ x + b̃)∥.

Using ∥ṽ∥ ≤ ∥v 0 ∥ + cn−1/2 ≤ 3 and (11.5.7), this term is bounded by

∥(v − ṽ) ⊙ σ ′ (U x + b)∥ + ∥ṽ ⊙ (σ ′ (U x + b) − σ ′ (Ũ x + b̃))∥



≤ ∥v − ṽ∥(R n + R · (∥U ∥∥x∥ + ∥b∥)) + 3R · (∥x∥∥U − Ũ ∥ + ∥b − b̃∥)

≤ ∥w − w̃∥ n(5R + 6R2 ).

For ∇U Φ(x, w) we obtain similar as in Step 1

∥∇U Φ(x, w) − ∇U Φ(x, w̃)∥F = ∥x∥∥∇b Φ(x, w) − ∇b Φ(x, w̃)∥



≤ ∥w − w̃∥ n(5R2 + 6R3 ).

Next

∥∇v Φ(x, w) − ∇v Φ(x, w̃)∥ = ∥σ(U x + b) − σ(Ũ x − b̃)∥


≤ R · (∥U − Ũ ∥∥x∥ + ∥b − b̃∥)
≤ ∥w − w̃∥(R2 + R)

and finally ∇c Φ(x, w) = 1 is constant. With M2 (R) := R + 6R2 + 6R3 this shows

∥∇w Φ(x, w) − ∇w Φ(x, w̃)∥ ≤ nM2 (R)∥w − w̃∥.

In all, this concludes the proof with M (R) := max{M1 (R), M2 (R)}.

Before coming to the main result of this section, we first show that the initial error f (w0 )
remains bounded with high probability.

161
Lemma 11.22. Let Assumption 11.18 (a), (b) be satisfied. Then for every δ > 0 exists
R0 (δ, m, R) > 0 such that for all n ∈ N

P[f (w0 ) ≤ R0 ] ≥ 1 − δ.

√ iid
Proof. Let i ∈ {1, . . . , m}, and set α := U 0 xi and ṽj := nv0;j for j = 1, . . . , n, so that ṽj ∼
W(0, 1). Then
n
1 X
Φ(xi , w0 ) = √ ṽj σ(αj ).
n
j=1

By Assumption 11.14 and (11.5.3), the ṽj σ(αj ), j = 1, . . . , n, are i.i.d. centered random variables
with finite variance bounded by a constant C(R) independent of n. Thus the variance of Φ(xi , w0 )
is also bounded by C(R). By Chebyshev’s inequality, see Lemma A.22, for every k > 0
√ 1
P[|Φ(xi , w0 )| ≥ k C] ≤ 2 .
k
p
Setting k = m/δ
m
hX √ i m
X h √ i
2 2
P |Φ(xi , w0 ) − yi | ≥ m(k C + R) ≤ P |Φ(xi , w0 ) − yi | ≥ k C + R
i=1 i=1
m
X h √ i
≤ P |Φ(xi , w0 )| ≥ k C ≤ δ,
i=1
p
which shows the claim with R0 = m · ( Cm/δ + R)2 .

The next theorem is the main result of this section. It states that in the present setting gradient
descent converges to a global minimizer and the limiting network achieves zero loss, i.e. interpolates
the data. Moreover, during training the network weights remain close to initialization if the network
width n is large.

Theorem 11.23. Let Assumption 11.18 be satisfied, and let the parameters w0 of the neural
network Φ in (11.5.1) be initialized according to (11.5.3). Fix a learning rate
2 1
h<
λLC
min
LC
+ 4λmax n

and with the objective function (11.0.1b) let for all k ∈ N

wk+1 = wk − h∇f (wk ).

162
Then for every δ > 0 there exist C > 0, n0 ∈ N such that for all n ≥ n0 holds with probability
at least 1 − δ that for all k ∈ N
C
∥wk − w0 ∥ ≤ √
n
 hn 2k
f (wk ) ≤ C 1 − LC .
2λmin

Proof. We wish to apply Theorem 11.13, which requires Assumption 11.12 to be satisfied. By
Lemma 11.19, 11.21 and 11.22, for pevery c >√0 we can find n0 such that for all n ≥ n0 with
probability at least 1 − δ we have f (w0 ) ≤ R0 and Assumption 11.12 (a), (b) holds with the
values
√ √ nλLC
L = M n, U = M n, r = cn−1/2 , λmin = min
, λmax = 2nλLCmax .
2
For Assumption 11.12 (c), it suffices that

√ n2 (λLC
min /2)
2
−1/2 2mM n p
M n≤ √ and cn ≥ R0 .
12m3/2 M 2 n R0 n
Choosing c > 0 and n large enough, the inequalities hold. The statement is now a direct consequence
of Theorem 11.13.

11.5.4 Proximity to linearized model


The analysis thus far was based on the linearization Φlin describing the behaviour of the full network
Φ well in a neighbourhood of the initial parameters w0 . Moreover, Theorem 11.23 states that the
parameters remain in an O(n−1/2 ) neighbourhood of w0 during training. This suggests that the
trained full model limk→∞ Φ(x, wk ) yields predictions similar to the trained linearized model.
To describe this phenomenon, we adopt again the notations Φlin : Rd × Rn → R and f lin from
(11.3.1) and (11.3.3). Initializing w0 according to (11.5.3) and setting p0 = w0 , gradient descent
computes the parameter updates

wk+1 = wk − h∇w f (wk ), pk+1 = pk − h∇w f lin (pk )

for the full and linearized models, respectively. Let us consider the dynamics of the prediction of
the network on the training data. Writing

Φ(X, w) := (Φ(xi , w))m


i=1 ∈ R
m
such that ∇w Φ(X, w) ∈ Rm×n

it holds
∇w f (w) = ∇w ∥Φ(X, w) − y∥2 = 2∇w Φ(X, w)⊤ (Φ(X, w) − y).
Thus for the full model

Φ(X, wk+1 ) = Φ(X, wk ) + ∇w Φ(X, w̃k )(wk+1 − wk )


= Φ(X, wk ) − 2h∇w Φ(X, w̃k )∇w Φ(X, wk )⊤ (Φ(X, wk ) − y), (11.5.8)

163
where w̃k is in the convex hull of wk and wk+1 .
Similarly, for the linearized model with (cp. (11.3.1))

Φlin (X, w) := (Φlin (xi , w))m


i=1 ∈ R
m
and ∇p Φlin (X, p) = ∇w Φ(X, w0 ) ∈ Rm×n

such that
∇p f lin (p) = ∇p ∥Φlin (X, p) − y∥2 = 2∇w Φ(X, w0 )⊤ (Φlin (X, p) − y)
and

Φlin (X, pk+1 ) = Φlin (X, pk ) + ∇p Φlin (X, p0 )(pk+1 − pk )


= Φlin (X, pk ) − 2h∇w Φ(X, w0 )∇w Φ(X, w0 )⊤ (Φlin (X, pk ) − y). (11.5.9)

Remark 11.24. From (11.5.9) it is easy to see that with A := 2h∇w Φ(X, w0 )∇w Φ(X, w0 )⊤ and
B := I m − A holds the explicit formula
k−1
X
lin k lin
Φ (X, pk ) = B Φ (X, p0 ) + B k Ay
j=0

for the prediction of the linear model in step k. Note that


P∞ if Ak is regular and h is small enough,
k −1
then B converges to the zero matrix as k → ∞ and j=0 B = A since this is a Neumann
series.
Comparing the two dynamics (11.5.8) and (11.5.9), the difference only lies in the two Rm×m
matrices

2h∇w Φ(X, w̃k )∇w Φ(X, wk )⊤ and 2h∇w Φ(X, w0 )∇w Φ(X, w0 )⊤ .

Recall that the step size h in Theorem 11.23 scales like 1/n.

Proposition 11.25. Consider the setting of Theorem 11.23. Then there exists C < ∞, and for
every δ > 0 there exists n0 such that for all n ≥ n0 holds with probability at least 1 − δ that for all
k∈N
1
∥∇w Φ(X, w̃k )∇w Φ(X, wk )⊤ − ∇p Φ(X, p0 )∇p Φ(X, p0 )⊤ ∥ ≤ Cn−1/2 .
n

Proof. Consider the setting of the proof of Theorem 11.23. Then for every k ∈ N holds ∥wk −w0 ∥ ≤
r and thus also ∥w̃k − w0 ∥ ≤ r, where r = cn−1/2 . Thus Lemma 11.21 implies the norm to be
bounded by
1
∥∇w Φ(X, w̃k ) − ∇p Φ(X, p0 )∥∥∇w Φ(X, wk )⊤ ∥+
n
1
∥∇p Φ(X, p0 )∥∥∇w Φ(X, wk )⊤ − ∇p Φ(X, p0 )⊤ ∥
n
≤ mM (∥w̃k − p0 ∥ + ∥wk − p0 ∥) ≤ cmM n−1/2

which gives the statement.

164
By Proposition 11.25 the two matrices driving the dynamics (11.5.8) and (11.5.9) remain in
an O(n−1/2 ) neighbourhood of each other throughout training. This allows to show the following
proposition, which states that the prediction function learned by the network gets arbitrarily close
to the one learned by the linearized version in the limit n → ∞. The proof, which we omit, is based
on Grönwall’s inequality. See [106, 131].

Proposition 11.26. Consider the setting of Theorem 11.23. Then there exists C < ∞, and for
every δ > 0 there exists n0 such that for all n ≥ n0 holds with probability at least 1 − δ that for all
∥x∥ ≤ 1
sup |Φ(x, wk ) − Φlin (x, pk )| ≤ Cn−1/2 .
k∈N

11.5.5 Connection to Gaussian processes


In the previous section, we established that for large widths, the trained neural network mirrors the
behaviour of the trained linearized model, which itself is closely connected to kernel least-squares
with the neural tangent kernel. Yet, as pointed out in Remark 11.4, the obtained model still
strongly depends on the choice of random initialization w0 ∈ Rn . We should thus understand both
the model at initialization x 7→ Φ(x, w0 ) and the model after training x 7→ Φ(x, wk ), as random
draws of a certain distribution over functions. To make this precise, let us introduce Gaussian
processes.

Definition 11.27. Let (Ω, P) be a probability space, and let g : Rd ×Ω → R. We call g a Gaussian
process with mean function m : Rd → R and covariance function c : Rd × Rd → R if

(a) for each x ∈ Rd holds ω 7→ g(x, ω) is a random variable,

(b) for all k ∈ N and all x1 , . . . , xk ∈ Rd the random variables g(x1 , ·), . . . , g(xk , ·) have a joint
Gaussian distribution such that
 
(g(x1 , ω), . . . , g(xk , ω)) ∼ N m(xi )ki=1 , (c(xi , xj ))ki,j=1 .

In words, g is a Gaussian process, if ω 7→ g(x, ω) defines a collection of random variables indexed


over x ∈ Rd , such that the joint distribution of (g(x1 , ·))nj=1 is a Gaussian whose mean and variance
are determined by m and c respectively. Fixing ω ∈ Ω, we can then interpret x 7→ g(x, ω) as a
random draw from a distribution over functions.
As first observed in [157], certain neural networks at initialization tend to Gaussian processes
in the infinite width limit.

165
Proposition 11.28. Consider depth-n networks Φn as in (11.5.1) with initialization (11.5.3), and
iid
define with ui ∼ W(0, 1/d), i = 1, . . . , d,

c(x, z) := E[σ(u⊤ x)σ(u⊤ z)] for all x, z ∈ Rd .

Then for all distinct x1 , . . . , xk ∈ Rd it holds that

lim (Φn (x1 , w0 ), . . . , Φn (xk , w0 )) ∼ N(0, (c(xi , xj ))ki,j=1 )


n→∞

with weak convergence.

√ iid
Proof. Set ṽi := nv0,i and ũi = (U0,i1 , . . . , U0,id ) ∈ Rd , so that ṽi ∼ W(0, 1), and the ũi ∈ Rd are
also i.i.d., with each component distributed according to W(0, 1/d).
Then for any x1 , . . . , xk

ṽi σ(ũ⊤
 
i x1 )
.. k
Z i :=  ∈R i = 1, . . . , n,
 
.
ṽi σ(ũ⊤
i xk )

defines n centered i.i.d. vectors in Rk . By the central limit theorem, see Theorem A.25,
 
Φ(x1 , w0 ) n
.. 1 X
= √ Zi
 
 . 
n
Φ(xk , w0 ) j=1

converges weakly to N(0, C), where

Cij = E[ṽ12 σ(ũ⊤ ⊤ ⊤ ⊤


1 xi )σ(ũ1 xj )] = E[σ(ũ1 xi )σ(ũ1 xj )].

This concludes the proof.

In the sense of Proposition 11.28, the network Φ(x, w0 ) converges to a Gaussian process as the
width n tends to infinity. Using the explicit dynamics of the linearized network outlined in Remark
11.24, one can show that the linearized network after training also corresponds to a Gaussian
process (for some mean and covariance function depending on the data, the architecture, and the
initialization). As the full and linearized models converge in the infinite width limit, we can infer
that wide networks post-training resemble draws from a Gaussian process, see [131, Sec. 2.3.1] and
[46].
Rather than delving into the technical details of such statements, in Figure 11.3 we plot 80
different realizations of a neural network before and after training, i.e. the functions

x 7→ Φ(x, w0 ) and x 7→ Φ(x, wk ). (11.5.10)

We chose the architecture as (11.5.1) with activation function σ = arctan(x), width n = 250 and
initialization  3  3
iid iid iid
U0;ij ∼ N 0, , v0;i ∼ N 0, , b0;i , c0 ∼ N(0, 2). (11.5.11)
d n

166
The network was trained on a dataset of size m = 3 with k = 1000 steps of gradient descent
and constant step size h = 1/n. Before training, the network’s outputs resemble random draws
from a Gaussian process with a constant zero mean function. Post-training, the outputs show
minimal variance at the data points, since they essentially interpolate the data, cp. Remark 11.4
and (11.2.4). They exhibit increased variance further from these points, with the precise amount
depending on the initialization variance chosen in (11.5.11).

2 2

1 1

0 0

1 1

2 2
3 2 1 0 1 2 3 3 2 1 0 1 2 3
(a) before training (b) after training

Figure 11.3: 80 realizations of a neural network at initialization (a) and after training on the red
data points (b). The blue dashed line shows the mean. Figure based on [131, Fig. 2].

11.6 Normalized initialization


Consider the gradient ∇w Φ(x, w0 ) as in (11.5.2) with LeCun initialization. Since the components
iid
of v behave like vi ∼ W(0, 1/n), it is easy to check that in terms of the width n

E[∥∇U Φ(x, w0 )∥] = E[∥(v ⊙ σ ′ (U x + b))x⊤ ∥] = O(1)



E[∥∇b Φ(x, w0 )∥] = E[∥v ⊙ σ (U x + b)∥] = O(1)
E[∥∇v Φ(x, w0 )∥] = E[∥σ(U x + b)∥] = O(n)
E[∥∇c Φ(x, w0 )∥] = E[|1|] = O(1).

As a result of this different scaling, gradient descent with step width O(n−1 ) as in Theorem 11.23,
will primarily train the weigths v in the output layer, and will barely move the remaining parameters
U , b, and c. This is also reflected in the expression for the obtained kernel K LC computed in
Theorem 11.16, which corresponds to the contribution of the term ⟨∇v Φ, ∇v Φ⟩.
Remark 11.29. For optimization methods such as ADAM, which scale each component of the
gradient individually, the same does not hold in general.
LeCun initialization aims to normalize the variance of the output of all nodes at initialization
(the forward dynamics). To also normalize the variance of the gradients (the backward dynamics),
in this section we shortly dicuss a different architecture and initialization, consistent with the one
used in the original NTK paper [106].

167
11.6.1 Architecture
Let Φ : Rd → R be a depth-one neural network
1 ⊤  1 
Φ(x, w) = √ v σ √ U x + b + c, (11.6.1)
n d

with input x ∈ Rd and parameters U ∈ Rn×d , v ∈ Rn , b ∈ Rn and c ∈ R. We initialize the weights


randomly according to w0 = (U 0 , b0 , v 0 , c0 ) with parameters
iid iid
U0;ij ∼ W(0, 1), v0;i ∼ W(0, 1), b0;i , c0 = 0. (11.6.2)

At initialization, (11.6.1), (11.6.2) is equivalent to (11.5.1), (11.5.3). However, for the gradient we
obtain  
∇U Φ(x, w) = n−1/2 v ⊙ σ ′ (d−1/2 U x + b) d−1/2 x⊤ ∈ Rn×d
 
∇b Φ(x, w) = n−1/2 v ⊙ σ ′ d−1/2 U x + b) ∈ Rn
(11.6.3)
−1/2 −1/2 n
∇v Φ(x, w) = n σ(d U x + b) ∈ R
∇c Φ(x, w) = 1 ∈ R.
Contrary to (11.5.2), the three gradients with O(n) entries are all scaled by the factor n−1/2 . This
leads to a different training dynamics.

11.6.2 Neural tangent kernel


We compute again the neural tangent kernel. Unlike for LeCun initialization, there is no 1/n scaling
required to obtain convergence of

K̂n (x, z) = ⟨∇w Φ(x, w0 ), ∇w Φ(z, w0 )⟩

as n → ∞. Here and in the following we consider the setting (11.6.1)–(11.6.2) for Φ and w0 .
Since this is also referred to as the NTK initialization, we denote the kernel by K NTK . Due to the
different training dynamics, we obtain additional terms in the NTK compared to Theorem 11.23.

Theorem 11.30. Let R < ∞ such that |σ(x)| ≤ R · (1 + |x|) and |σ ′ (x)| ≤ R · (1 + |x|) for all
iid
x ∈ R, and let W satisfy Assumption 11.14. For any x, z ∈ Rd and ui ∼ W(0, 1/d), i = 1, . . . , d,
it then holds
 x⊤ z 
lim K̂n (x, z) = 1 + E[σ ′ (u⊤ x)⊤ σ ′ (u⊤ z)] + E[σ(u⊤ x)⊤ σ(u⊤ z)] + 1
n→∞ d
=: K NTK (x, z)

almost surely.

168
Proof. Denote x(1) = U 0 x + b0 ∈ Rn and z (1) = U 0 z + b0 ∈ Rn . Due to the initialization (11.6.2)
and our assumptions on W(0, 1), the components
d
(1)
X
xi = U0;ij xj ∼ u⊤ x i = 1, . . . , n
j=1

are i.i.d. with finite pth moment (independent of n) for all 1 ≤ p ≤ 8, and the same holds for the
(1) (1) (1) (1)
(σ(xi ))ni=1 , (σ ′ (xi ))ni=1 , (σ(zi ))ni=1 , and (σ ′ (zi ))ni=1 .
Then
n n
 x⊤ z  1 X 2 ′ (1) ′ (1) 1X (1) (1)
K̂n (x, z) = 1 + vi σ (xi )σ (zi ) + σ(xi )σ(zi ) + 1.
d n n
i=1 i=1

By the law of large numbers and because E[vi2 ] = 1, this converges almost surely to K NTK (x, z).
The existence of n0 follows similarly by an application of Theorem A.23.

Example 11.31 (K NTK for ReLU). Let σ(x) = max{0, x} and let W(0, 1/d) be the centered
normal distribution on R with variance 1/d. For x, z ∈ Rd holds by [37, Appendix A] (also see
iid
Exercise 11.36), that with ui ∼ W(0, 1/d), i = 1, . . . , d,
 
x⊤ z
π − arccos ∥x∥∥z∥
E[σ ′ (u⊤ x)σ ′ (u⊤ z)] = .

Together with Example 11.17, this yields an explizit formula for K NTK in Theorem 11.30.

For this network architecture and under suitable assumptions on W, similar arguments as
in Section 11.5 can be used to show convergence of gradient descent to a global minimizer and
proximity of the full to the linearized model. We refer to the literature in the bibliography section.

Bibliography and further reading


The discussion on linear and kernel regression in Sections 11.1 and 11.2 is quite standard, and can
similarly be found in many textbooks. For more details on kernel methods we refer for instance
to [42, 206]. The neural tangent kernel and its connection to the training dynamics was first
investigated in [106] using an architecture similar to the one in Section 11.6. Since then, many
works have extended this idea and presented differing perspectives on the topic, see for instance [2,
56, 5, 36]. Our presentation in Sections 11.4, 11.5, and 11.6 primarily follows [131] who also discussed
the case of LeCun initialization. Especially for the main results in Theorem 11.13 and Theorem
11.23, we largely follow the arguments in this paper. The above references additionally treat the
case of deep networks, which we have omitted here for simplicity. The explicit formula for the NTK
of ReLU networks as presented in Examples 11.17 and 11.31 was given in [37]. The observation
that neural networks at initialization behave like Gaussian processes discussed in Section 11.5.5 was
first made in [157]. For a general reference on Gaussian processes see the textbook [188]. When
only training the last layer of a network (in which the network is affine linear), there are strong
links to random feature methods [186]. Recent developements on this topic can also be found in
the literature under the name “Neural network Gaussian processes”, or NNGPs for short [130, 47].

169
Exercises
Exercise 11.32. Prove Theorem 11.3.
Hint: Assume first that w0 ∈ ker(A)⊥ (i.e. w0 ∈ H̃). For rank(A) < d, using wk = wk−1 −
h∇f (wk−1 ) and the singular value
P decomposition of A, write down an explicit formula for wk .
Observe that due to 1/(1 − x) = k∈N0 xk for all x ∈ (0, 1) it holds wk → A† y as k → ∞, where
A† is the Moore-Penrose pseudoinverse of A.

Exercise 11.33. Let xi ∈ Rd , i = 1, . . . , m. Show that there exists a “feature map” ϕ : Rd → Rm ,


such that for any configuration of labels yi ∈ {−1, 1}, there always exists a hyperplane in Rm
separating the two sets {ϕ(xi ) | yi = 1} and {ϕ(xi ) | yi = −1}.

Exercise 11.34. Consider the RBF kernel K : R × R → R, K(x, x′ ) := exp(−(x − x′ )2 ). Find a


Hilbert space H and a feature map ϕ : R → H such that K(x, x′ ) = ⟨ϕ(x), ϕ(x′ )⟩H .

Exercise 11.35. Let n ∈ N and consider the polynomial kernel K : Rd × Rd → R, K(x, x′ ) =


(1 + x⊤ x′ )r . Find a Hilbert space H and a feature map ϕ : Rd → H, such that K(x, x′ ) =
⟨ϕ(x), ϕ(x′ )⟩H .
Hint: Use the multinomial formula.
iid
Exercise 11.36. Let ui ∼ N(0, 1) be i.i.d. standard Gaussian distributed random variables for
i = 1, . . . , d. Show that for all nonzero x, z ∈ Rd

π−θ  xz ⊤ 
E[1[0,∞) (u⊤ x)1[0,∞) (u⊤ z)] = , θ = arccos .
2π ∥x∥∥z∥

This shows the formula for the ReLU NTK with Gaussian initialization as discussed in Example
11.31.
Hint: Consider the following sketch

θ x

Exercise 11.37. Consider the network (11.5.1) with LeCun initialization as in (11.5.3), but with
the biases instead initialized as
iid
c, bi ∼ W(0, 1) for all i = 1, . . . , n. (11.6.4)

Compute the corresponding NTK as in Theorem 11.23. Moreover, compute the NTK also for the
normalized network (11.6.1) with initialization (11.6.2) as in Theorem 11.30, but replace again the
bias initialization with that given in (11.6.4).

170
Chapter 12

Loss landscape analysis

In Chapter 10, we saw how the weights of neural networks get adapted during training, using, e.g.,
variants of gradient descent. For certain cases, including the wide networks considered in Chapter
11, the corresponding iterative scheme converges to a global minimizer. In general, this is not
guaranteed, and gradient descent can for instance get stuck in non-global minima or saddle points.
To get a better understanding of these situations, in this chapter we discuss the so-called loss
landscape. This term refers to the graph of the empirical risk as a function of the weights. We
give a more rigorous definition below, and first introduce notation for neural networks and their
realizations for a fixed architecture.

Definition 12.1. Let A = (d0 , d1 , . . . , dL+1 ) ∈ NL+2 , let σ : R → R be an activation function, and
let B > 0. We denote the set of neural networks Φ with L layers, layer widths d0 , d1 , . . . , dL+1 , all
weights bounded in modulus by B, and using the activation function σ by N (σ; A, B). Additionally,
we define
L
×
 
PN (A, B) := [−B, B]dℓ+1 ×dℓ × [−B, B]dℓ+1 ,
ℓ=0

and the realization map

Rσ : PN (A, B) → N (σ; A, B)
(12.0.1)
(W (ℓ) , b(ℓ) )L
ℓ=0 7→ Φ,

where Φ is the neural network with weights and biases given by (W (ℓ) , b(ℓ) )L
ℓ=0 .

PL
Throughout, we will identify PN (A, B) with the cube [−B, B]nA , where nA := ℓ=0 dℓ+1 (dℓ +
1). Now we can introduce the loss landscape of a neural network architecture.

Definition 12.2. Let A = (d0 , d1 , . . . , dL+1 ) ∈ NL+2 , let σ : R → R. Let m ∈ N, and S =


(xi , y i )m d0
i=1 ∈ (R × R
dL+1 )m be a sample and let L be a loss function. Then, the loss landscape

171
l minima
ca
lo

le points
dd
sa

sh
arp
Hig l min minimum
k ba ima
he
mpirical ris
o
gl

Figure 12.1: Two-dimensional section of a loss landscape. The loss landscape shows a spurious
valley with local minima, global minima, as well as a region where saddle points appear. Moreover,
a sharp minimum is shown.

is the graph of the function ΛA,σ,S,L defined as

ΛA,σ,S,L : PN (A; ∞) → R
θ 7→ R
b S (Rσ (θ)).

with R
b S in (1.2.3) and Rσ in (12.0.1).

Identifying PN (A, ∞) with RnA , we can consider ΛA,σ,S,L as a map on RnA and the loss
landscape is a subset of RnA × R. The loss landscape is a high-dimensional surface, with hills and
valleys. For visualization a two-dimensional section of a loss landscape is shown in Figure 12.1.
Questions of interest regarding the loss landscape include for example: How likely is it that we
find local instead of global minima? Are these local minima typically sharp, having small volume,
or are they part of large flat valleys that are difficult to escape? How bad is it to end up in a local
minimum? Are most local minima as deep as the global minimum, or can they be significantly
higher? How rough is the surface generally, and how do these characteristics depend on the network
architecture? While providing complete answers to these questions is hard in general, in the rest
of this chapter we give some intuition and mathematical insights for specific cases.

172
12.1 Visualization of loss landscapes
Visualizing loss landscapes can provide valuable insights into the effects of neural network depth,
width, and activation functions. However, we can only visualize an at most two-dimensional surface
embedded into three-dimensional space, whereas the loss landscape is a very high-dimensional
object (unless the neural networks have only very few weights and biases).
To make the loss landscape accessible, we need to reduce its dimensionality. This can be achieved
by evaluating the function ΛA,σ,S,L on a two-dimensional subspace of PN (A, ∞). Specifically, we
choose three-parameters µ, θ1 , θ2 and examine the function

R2 ∋ (α1 , α2 ) 7→ ΛA,σ,S,L (µ + α1 θ1 + α2 θ2 ). (12.1.1)

There are various natural choices for µ, θ1 , θ2 :


• Random directions: This was, for example used in [74, 102]. Here θ1 , θ2 are chosen randomly,
while µ is either a minimum of ΛA,σ,S,L or also chosen randomly. This simple approach can
offer a quick insight into how rough the surface can be. However, as was pointed out in
[134], random directions will very likely be orthogonal to the trajectory of the optimization
procedure. Hence, they will likely miss the most relevant features.

• Principal components of learning trajectory: To address the shortcomings of random direc-


tions, another possibility is to determine µ, θ1 , θ2 , which best capture some given learning
trajectory; For example, if θ(1) , θ(2) , . . . , θ(N ) are the parameters resulting from the training
by SGD, we may determine µ, θ1 , θ2 such that the hyperplane {µ + α1 θ1 + α2 θ2 | α1 , α2 ∈ R}
minimizes the mean squared distance to the θ(j) for j ∈ {1, . . . , N }. This is the approach of
[134], and can be achieved by a principal component analysis.

• Based on critical points: For a more global perspective, µ, θ1 , θ2 can be chosen to ensure the
observation of multiple critical points. One way to achieve this is by running the optimization
procedure three times with final parameters θ(1) , θ(2) , θ(3) . If the procedures have converged,
then each of these parameters is close to a critical point of ΛA,σ,S,L . We can now set µ = θ(1) ,
θ1 = θ(2) − µ, θ2 = θ(3) − µ. This then guarantees that (12.1.1) passes through or at least
comes very close to three critical points (at (α1 , α2 ) = (0, 0), (0, 1), (1, 0)). We present six
visualizations of this form in Figure 12.2.
Figure 12.2 gives some interesting insight into the effect of depth and width on the shape of the
loss landscape. For very wide and shallow neural networks, we have the widest minima, which, in
the case of the tanh activation function also seem to belong to the same valley. With increasing
depth and smaller width the minima get steeper and more disconnected.

12.2 Spurious valleys


From the perspective of optimization, the ideal loss landscape has one global minimum in the center
of a large valley, so that gradient descent converges towards the minimum irrespective of the chosen
initialization.
This situation is not realistic for deep neural networks. Indeed, for a simple shallow neural
network
Rd ∋ x 7→ Φ(x) = W (1) σ(W (0) x + b(0) ) + b(1) ,

173
it is clear that for every permutation matrix P

Φ(x) = W (1) P T σ(P W (0) x + P b(0) ) + b(1) for all x ∈ Rd .

Hence, in general there exist multiple parameterizations realizing the same output function. More-
over, if at least one global minimum with non-permutation-invariant weights exists, then there are
more than one global minima of the loss landscape.
This is not problematic; in fact, having many global minima is beneficial. The larger issue is the
existence of non-global minima. Following [235], we start by generalizing the notion of non-global
minima to spurious valleys.

Definition 12.3. Let A = (d0 , d1 , . . . , dL+1 ) ∈ NL+2 and σ : R → R. Let m ∈ N, and S =


(xi , y i )m d0
i=1 ∈ (R × R
dL+1 )m be a sample and let L be a loss function. For c ∈ R, we define the

sub-level set of ΛA,σ,S,L as

ΩΛ (c) := {θ ∈ PN (A, ∞) | ΛA,σ,S,L (θ) ≤ c}.

A path-connected component of ΩΛ (c), which does not contain a global minimum of ΛA,σ,S,L is
called a spurious valley.

The next proposition shows that spurious local minima do not exist for shallow overparameter-
ized neural networks, i.e., for neural networks that have at least as many parameters in the hidden
layer as there are training samples.

Proposition 12.4. Let A = (d0 , d1 , 1) ∈ N3 and let S = (xi , yi )m d0 m


i=1 ∈ (R × R) be a sample such
that m ≤ d1 . Furthermore, let σ ∈ M be not a polynomial, and let L be a convex loss function.
Further assume that ΛA,σ,S,L has at least one global minimum. Then, ΛA,σ,S,L , has no spurious
valleys.

Proof. Let θa , θb ∈ PN (A, ∞) with ΛA,σ,S,L (θa ) > ΛA,σ,S,L (θb ). Then we will show below that
there is another parameter θc such that

• ΛA,σ,S,L (θb ) = ΛA,σ,S,L (θc )

• there is a continuous path α : [0, 1] → PN (A, ∞) such that α(0) = θa , α(1) = θc , and
ΛA,σ,S,L (α) is monotonically decreasing.

By Exercise 12.7, the construction above rules out the existence of spurious valleys by choosing θa
an element of a spurious valley and θb a global minimum.
Next, we present the construction: Let us denote
 1 
(ℓ) (ℓ)
θo = W o , bo for o ∈ {a, b, c}.
ℓ=0

174
Moreover, for j = 1, . . . , d1 , we introduce v jo ∈ Rm defined as
  
(v jo )i = σ W (0)o xi + bo
(0)
for i = 1, . . . , m.
j

Notice that, if we set V o = ((v jo )⊤ )dj=1


1
, then
 m
W (1)
o V o = R (θ
σ o )(x i ) − b (1)
o , (12.2.1)
i=1

where the right-hand side is considered a row-vector.


We will now distinguish between two cases. For the first the result is trivial and the second can
be transformed into the first one.
Case 1: Assume that V a has rank m. In this case, it is obvious from (12.2.1), that there
exists W
f such that  m
f V a = Rσ (θb )(xi ) − b(1)
W .
a
i=1
(0) (0) (1) f , b(1)
We can thus set α(t) = −
((W a , ba ), ((1 t)W a
+ tW a )).
Note that by construction α(0) = θa and ΛA,σ,S,L (α(1)) = ΛA,σ,S,L (θb ). Moreover, t 7→
(Rσ (α(t))(xi ))mi=1 describes a straight path in R
m and hence, by the convexity of L it is clear

that t 7→ ΛA,σ,S,L (α(t)) is monotonically decreasing.


Case 2: Assume that Va has rank less than m. In this case, we show that we find a continuous
path from θa to another neural network parameter with higher rank. The path will be such that
ΛA,σ,S,L is monotonically decreasing.
Under the assumptions, we have that one v ja can be written as a linear combination of the
remaining v ia , i ̸= j. Without loss of generality, we assume j = 1. Then, there exist (αi )m
i=2 such
that
m
X
v 1a = αi v ia . (12.2.2)
i=2

Next, we observe that there exists v ∗ ∈ Rm which is linearly independent from all (v ja )m
i=1 and
∗ ∗ ⊤ ∗ ∗ d ∗
can be written as (v )i = σ((w ) xi + b ) for some w ∈ R , b ∈ R. Indeed, if we assume that
0

such v ∗ does not exist, then it follows that span{(σ(w⊤ xi + b))m d0


i=1 | w ∈ R , b ∈ R} is an m − 1
dimensional subspace of Rm which yields a contradiction to Theorem 9.3.
Now, we define two paths: First,

α1 (t) = ((W (0) (0) (1) (1)


a , ba ), (W a (t), ba )), for t ∈ [0, 1/2]

where

(W (1) (1)
a (t))1 = (1 − 2t)(W a )1 and (W (1) (1) (1)
a (t))i = (W a )i + 2tαi (W a )1

for i = 2, . . . , d1 , for t ∈ [0, 1/2]. Second,

α2 (t) = ((W (0) (0) (1) (1)


a (t), ba (t)), (W a (1/2), ba )), for t ∈ (1/2, 1],

where

(W (0) (0)
a (t))1 = 2(t − 1/2)(W a )1 + (2t − 1)w and (W (0) (0)
a (t))i = (W a )i

175
(0) (0) (0) (0)
for i = 2, . . . , d1 , (ba (t))1 = 2(t − 1/2)(ba )1 + (2t − 1)b∗ , and (ba (t))i = (ba )i for i = 2, . . . , d1 .
It is clear by (12.2.2) that (Rσ (α1 )(xi ))m i=1 is constant. Moreover, Rσ (α2 )(x) is constant for all
x ∈ Rd0 . In addition, by construction for
   m
j (0) (0)
v̄ := σ W a (1)xi + ba (1)
j i=1

it holds that ((v̄ j )⊤ )dj=1


1
has
rank larger than that of V a . Concatenating α1 and α2 now yields a
continuous path from θa to another neural network parameter with higher associated rank such
that ΛA,σ,S,L is monotonically decreasing along the path. Iterating this construction, we can find
a path to a neural network parameter where the associated matrix has full rank. This reduces the
problem to Case 1.

12.3 Saddle points


Saddle points are critical points of the loss landscape at which the loss decreases in one direction.
In this sense, saddle points are not as problematic as local minima or spurious valleys if the updates
in the learning iteration have some stochasticity. Eventually, a random step in the right direction
could be taken and the saddle point can be escaped.
If most of the critical points are saddle points, then, even though the loss landscape is challenging
for optimization, one still has a good chance of eventually reaching the global minimum. Saddle
points of the loss landscape were studied in [45, 172] and we will review some of the findings in a
simplified way below. The main observation in [172] is that, under some quite strong assumptions,
it holds that critical points in the loss landscape associated to a large loss are typically saddle points,
whereas those associated to small loss correspond to minima. This situation is encouraging for the
prospects of optimization in deep learning, since, even if we get stuck in a local minimum, it will
very likely be such that the loss is close to optimal.
The results of [172] use random matrix theory, which we do not recall here. Moreover, it is hard
to gauge if the assumptions made are satisfied for a specific problem. Nonetheless, we recall the
main idea, which provides some intuition to support the above claim.
Let A = (d0 , d1 , 1) ∈ N3 . Then, for a neural network parameter θ ∈ PN (A, ∞) and activation
function σ, we set Φθ := Rσ (θ) and define for a sample S = (xi , yi )m i=1 the errors

ei = Φθ (xi ) − yi for i = 1, . . . , m.
If we use the square loss, then
m
cS (Φθ ) = 1
X
R e2i . (12.3.1)
m
i=1

Next, we study the Hessian of R


b S (Φθ ).

Proposition 12.5. Let A = (d0 , d1 , 1) and σ : R → R. Then, for every θ ∈ PN (A, ∞) where
R
b S (Φθ ) in (12.3.1) is twice continuously differentiable with respect to the weights, it holds that

H(θ) = H 0 (θ) + H 1 (θ),

176
where H(θ) is the Hessian of R b S (Φθ ) at θ, H 0 (θ) is a positive semi-definite matrix which is
independent from (yi )i=1 , and H 1 (θ) is a symmetric matrix that for fixed θ and (xi )m
m
i=1 depends
linearly on (ei )m
i=1 .

Proof. Using the identification introduced after Definition 12.2, we can consider θ a vector in RnA .
For k = 1, . . . , nA , we have that
m
∂R
b S (Φθ ) 2 X ∂Φθ (xi )
= ei .
∂θk m ∂θk
i=1

Therefore, for j = 1, . . . , nA , we have, by the Leibniz rule, that


m  m
!
∂2R 2 X ∂ 2 Φθ (xi )
b S (Φθ ) 
2 X ∂Φθ (xi ) ∂Φθ (xi )
= + ei (12.3.2)
∂θj ∂θk m ∂θj ∂θk m ∂θj ∂θk
i=1 i=1
=: H 0 (θ) + H 1 (θ).
It remains to show that H 0 (θ) and H 1 (θ) have the asserted properties. Note that, setting
 ∂Φ (x ) 
θ i
∂θ1
 ..  ∈ RnA ,

Ji,θ =
 . 
∂Φθ (xi )
∂θnA

2 Pm ⊤
we have that H 0 (θ) = m i=1 Ji,θ Ji,θ and hence H 0 (θ) is a sum of positive semi-definite matrices,
which shows that H 0 (θ) is positive semi-definite.
The symmetry of H 1 (θ) follows directly from the symmetry of second derivatives which holds
since we assumed twice continuous differentiability at θ. The linearity of H 1 (θ) in (ei )m i=1 is clear
from (12.3.2).

How does Proposition 12.5 imply the claimed relationship between the size of the loss and the
prevalence of saddle points?
Let θ correspond to a critical point. If H(θ) has at least one negative eigenvalue, then θ cannot
be a minimum, but instead must be either a saddle point or a maximum. While we do not know
anything about H 1 (θ) other than that it is symmetric, it is not unreasonable to assume that it
has a negative eigenvalue especially if nA is very large. With this consideration, let us consider the
following model:
Fix a parameter θ. Let S 0 = (xi , yi0 )m 0 m
i=1 be a sample and (ei )i=1 be the associated errors.
0 0 0
Further let H (θ), H 0 (θ), H 1 (θ) be the matrices according to Proposition 12.5.
Further let for λ > 0, S λ = (xi , yiλ )m m 0 m
i=1 be such that the associated errors are (ei )i=1 = λ(ei )i=1 .
The Hessian of R b S λ (Φθ ) at θ is then H λ (θ) satisfying

H λ (θ) = H 00 (θ) + λH 01 (θ).


Hence, if λ is large, then H λ (θ) is perturbation of an amplified version of H 01 (θ). Clearly, if v is
an eigenvector of H 1 (θ) with negative eigenvalue −µ, then
v ⊤ H λ (θ)v ≤ (∥H 00 (θ)∥ − λµ)∥v∥2 ,

177
which we can expect to be negative for large λ. Thus, H λ (θ) has a negative eigenvalue for large λ.
On the other hand, if λ is small, then H λ (θ) is merely a perturbation of H 00 (θ) and we can
expect its spectrum to resemble that of H 00 more and more.
What we see is that, the same parameter, is more likely to be a saddle point for a sample that
produces a high empirical risk than for a sample with small risk. Note that, since H 00 (θ) was only
shown to be semi -definite the argument above does not rule out saddle points even for very small
λ. But it does show that for small λ, every negative eigenvalue would be very small.
A more refined analysis where we compare different parameters but for the same sample and
quantify the likelihood of local minima versus saddle points requires the introduction of a probability
distribution on the weights. We refer to [172] for the details.

Bibliography and further reading


The results on visualization of the loss landscape are inspired by [134, 74, 102]. Results on the
non-existence of spurious valleys can be found in [235] with similar results in [184]. In [39] the
loss landscape was studied by linking it to so-called spin-glass models. There it was found that
under strong assumptions critical points associated to lower losses are more likely to be minima
than saddle points. In [172], random matrix theory is used to provide similar results, that go
beyond those established in Section 12.3. On the topic of saddle points, [45] identifies the existence
of saddle points as more problematic than that of local minima, and an alternative saddle-point
aware optimization algorithm is introduced.
Two essential topics associated to the loss landscape that have not been discussed in this chapter
are mode connectivity and the sharpness of minima. Mode connectivity, roughly speaking describes
the phenomenon, that local minima found by SGD over deep neural networks are often connected
by simple curves of equally low loss [64, 54]. Moreover, the sharpness of minima has been analyzed
and linked to generalization capabilities of neural networks, with the idea being that wide neural
networks are easier to find and also yield robust neural networks [92, 34, 247]. However, this does
not appear to exclude sharp minima from generalizing well [53].

178
Exercises
Exercise 12.6. In view of Definition 12.3, show that a local minimum of a differentiable function
is contained in a spurious valley.

Exercise 12.7. Show that if there exists a continuous path α between a parameter θ1 and a global
minimum θ2 such that ΛA,σ,S,L (α) is monotonically decreasing, then θ1 cannot be an element of a
spurious valley.

Exercise 12.8. Find an example of a spurious valley for a simple architecture.


Hint: Use a single neuron ReLU neural network and observe that, for two networks one with
positive and one with negative slope, every continuous path in parameter space that connects the
two has to pass through a parameter corresponding to a constant function.

179
Figure 12.2: A collection of loss landscapes. In the left column are neural networks with ReLU
activation function, the right column shows loss landscapes of neural networks with the hyperbolic
tangent activation function. All neural networks have five dimensional input, and one dimensional
output. Moreover, from top to bottom the hidden layers have sizes 1000, 20, 10, and the number
of layers are 1, 4, 7.
180
Chapter 13

Shape of neural network spaces

As we have seen in the previous chapter, the loss landscape of neural networks can be quite intricate
and is typically not convex. In some sense, the reason for this is that we take the point of view
of a map from the parameterization of a neural network. Let us consider a convex loss function
L : R × R → R and a sample S = (xi , yi )m d
i=1 ∈ (R × R) .
m

Then, for two neural networks Φ1 , Φ2 and for α ∈ (0, 1) it holds that
m
b S (αΦ1 + (1 − α)Φ2 ) = 1
X
R L(αΦ1 (xi ) + (1 − α)Φ2 (xi ), yi )
m
i=1
m
1 X
≤ αL(Φ1 (xi ), yi ) + (1 − α)L(Φ2 (xi ), yi )
m
i=1
= αR
b S (Φ1 ) + (1 − α)R
b S (Φ2 ).

Hence, the empirical risk is convex when considered as a map depending on the neural network
functions rather then the neural network parameters. A convex function does not have spurious
minima or saddle points. As a result, the issues from the previous section are avoided if we take
the perspective of neural network sets.
So why do we not optimize over the sets of neural networks instead of the parameters? To
understand this, we will now study the set of neural networks associated with a fixed architecture
as a subset of other function spaces.
We start by investigating the realization map Rσ introduced in Definition 12.1. Concretely,
we show in Section 13.1, that if σ is Lipschitz, then the set of neural networks is the image of
PN (A, ∞) under a locally Lipschitz map. We will use this fact to show in Section 13.2 that sets of
neural networks are typically non-convex, and even have arbitrarily large holes. Finally, in Section
13.3, we study the extent to which there exist best approximations to arbitrary functions, in the set
of neural networks. We will demonstrate that the lack of best approximations causes the weights
of neural networks to grow infinitely during training.

13.1 Lipschitz parameterizations


In this section, we study the realization map Rσ . The main result is the following simplified version
of [173, Proposition 4].

181
Proposition 13.1. Let A = (d0 , d1 , . . . , dL+1 ) ∈ NL+2 , let σ : R → R be Cσ -Lipschitz continuous
with Cσ ≥ 1, let |σ(x)| ≤ Cσ |x| for all x ∈ R, and let B ≥ 1.
Then, for all θ, θ′ ∈ PN (A, B),

∥Rσ (θ) − Rσ (θ′ )∥L∞ ([−1,1]d0 ) ≤ (2Cσ Bdmax )L nA ∥θ − θ′ ∥∞ ,


PL
where dmax = maxℓ=0,...,L+1 dℓ and nA = ℓ=0 dℓ+1 (dℓ + 1).

Proof. Let θ, θ′ ∈ PN (A, B) and define δ := ∥θ − θ′ ∥∞ . Repeatedly using the triangle inequality
we find a sequence (θj )nj=0
A
such that θ0 = θ, θnA = θ′ , ∥θj − θj+1 ∥∞ ≤ δ, and θj and θj+1 differ in
one entry only for all j = 0, . . . nA − 1. We conclude that for all x ∈ [−1, 1]d0
A −1
nX

∥Rσ (θ)(x) − Rσ (θ )(x)∥∞ ≤ ∥Rσ (θj )(x) − Rσ (θj+1 )(x)∥∞ . (13.1.1)
j=0

To upper bound (13.1.1), we now only need to understand the effect of changing one weight in a
neural network by δ.
Before we can complete the proof we need two auxiliary lemmas. The first of which holds under
slightly weaker assumptions of Proposition 13.1.

Lemma 13.2. Under the assumptions of Proposition 13.1, but with B being allowed to be arbitrary
positive, it holds for all Φ ∈ N (σ; A, B)

∥Φ(x) − Φ(x′ )∥∞ ≤ CσL · (Bdmax )L+1 ∥x − x′ ∥∞ (13.1.2)

for all x, x′ ∈ Rd0 .

Proof. We start with the case, where L = 1. Then, for (d0 , d1 , d2 ) = A, we have that

Φ(x) = W(1) σ(W(0) x + b(0) ) + b(1) ,

for certain W(0) , W(1) , b(0) , b(1) with all entries bounded by B. As a consequence, we can estimate
 
∥Φ(x) − Φ(x′ )∥∞ = W(1) σ(W(0) x + b(0) ) − σ(W(0) x′ + b(0) )

(0) (0) (0) ′ (0)
≤ d1 B σ(W x+b ) − σ(W x +b )

(0) ′
≤ d1 BCσ W (x − x )

≤ d1 d0 B 2 Cσ x − x′ ∞
≤ Cσ · (dmax B)2 x − x′ ∞
,

where we used the Lipschitz property of σ and the fact that ∥Ax∥∞ ≤ n maxi,j |Aij |∥x∥∞ for every
matrix A = (Aij )m,n
i=1,j=1 ∈ R
m×n .

The induction step from L to L+1 follows similarly. This concludes the proof of the lemma.

182
Lemma 13.3. Under the assumptions of Proposition 13.1 it holds that

∥x(ℓ) ∥∞ ≤ (2Cσ Bdmax )ℓ for all x ∈ [−1, 1]d0 . (13.1.3)

Proof. Per Definitions (2.1.1b) and (2.1.1c), we have that for ℓ = 1, . . . , L + 1

∥x(ℓ) ∥∞ ≤ Cσ W(ℓ−1) x(ℓ−1) + b(ℓ−1)



(ℓ−1)
≤ Cσ Bdmax ∥x ∥∞ + BCσ ,
where we used the triangle inequality and the estimate ∥Ax∥∞ ≤ n maxi,j |Aij |∥x∥∞ , which holds
for every matrix A ∈ Rm×n . We obtain that
∥x(ℓ) ∥∞ ≤ Cσ Bdmax · (1 + ∥x(ℓ−1) ∥∞ )
≤ 2Cσ Bdmax · (max{1, ∥x(ℓ−1) ∥∞ }).

Resolving the recursive estimate of ∥x(ℓ) ∥∞ by 2Cσ Bdmax (max{1, ∥x(ℓ−1) ∥∞ }), we conclude that
∥x(ℓ) ∥∞ ≤ (2Cσ Bdmax )ℓ max{1, ∥x(0) ∥∞ } = (2Cσ Bdmax )ℓ .
This concludes the proof of the lemma.

We can now proceed with the proof of Proposition 13.1. Assume that θj+1 and θj differ only in
one entry. We assume this entry to be in the ℓth layer, and we start with the case ℓ < L. It holds
|Rσ (θj )(x) − Rσ (θj+1 )(x)| = |Φℓ (σ(W(ℓ) x(ℓ) + b(ℓ) )) − Φℓ (σ(W(ℓ) x(ℓ) + b(ℓ) ))|,
(ℓ) (ℓ)
where Φℓ ∈ N (σ; Aℓ , B) for Aℓ = (dℓ+1 , . . . , dL+1 ) and (W(ℓ) , b(ℓ) ), (W , b ) differ in one entry
only.
Using the Lipschitz continuity of Φℓ of Lemma 13.2, we have
|Rσ (θj )(x) − Rσ (θj+1 )(x)|
≤ CσL−ℓ−1 (Bdmax )L−ℓ |σ(W(ℓ) x(ℓ) + b(ℓ) ) − σ(W(ℓ) x(ℓ) + b(ℓ) )|
≤ CσL−ℓ (Bdmax )L−ℓ ∥W(ℓ) x(ℓ) + b(ℓ) − W(ℓ) x(ℓ) − b(ℓ) ∥∞
≤ CσL−ℓ (Bdmax )L−ℓ δ max{1, ∥x(ℓ) ∥∞ },
where δ := ∥θ − θ′ ∥max . Invoking Lemma (13.3), we conclude that
|Rσ (θj )(x) − Rσ (θj+1 )(x)| ≤ (2Cσ Bdmax )ℓ CσL−ℓ · (Bdmax )L−ℓ δ
≤ (2Cσ Bdmax )L ∥θ − θ′ ∥max .
For the case ℓ = L, a similar estimate can be shown. Combining this with (13.1.1) yields the
result.

Using Proposition 13.1, we can now consider the set of neural networks with a fixed architec-
ture N (σ; A, ∞) as a subset of L∞ ([−1, 1]d0 ). What is more, is that N (σ; A, ∞) is the image of
PN (A, ∞) under a locally Lipschitz map.

183
13.2 Convexity of neural network spaces
As a first step towards understanding N (σ; A, ∞) as a subset of L∞ ([−1, 1]d0 ), we notice that it is
star-shaped with few centers. Let us first introduce the necessary terminology.

Definition 13.4. Let Z be a subset of a linear space. A point x ∈ Z is called a center of Z if,
for every y ∈ Z it holds that
{tx + (1 − t)y | t ∈ [0, 1]} ⊆ Z.
A set is called star-shaped if it has at least one center.

The following proposition follows directly from the definition of a neural network and is the
content of Exercise 13.15.

Proposition 13.5. Let L ∈ N and A = (d0 , d1 , . . . , dL+1 ) ∈ NL+2 and σ : R → R. Then


N (σ; A, ∞) is scaling invariant, i.e. for every λ ∈ R it holds that λf ∈ N (σ; A, ∞) if
f ∈ N (σ; A, ∞), and hence 0 ∈ N (σ; A, ∞) is a center of N (σ; A, ∞).

Knowing that N (σ; A, B) is star-shaped with center 0, we can also ask ourselves if N (σ; A, B)
has more than this one center. It is not hard to see that also every constant function is a center.
The following theorem, which corresponds to [173, Proposition C.4], yields an upper bound on the
number of linearly independent centers.

NL+2 , and let σ : R → R be Lipschitz


Theorem 13.6. Let L ∈ N and A = (d0 , d1 , . . . , dL+1 ) ∈P
L
continuous. Then, N (σ; A, ∞) contains at most nA = ℓ=0 (dℓ + 1)dℓ+1 linearly independent
centers.

Proof. Assume by contradiction, that there are functions (gi )ni=1 A +1


⊆ N (σ; A, ∞) ⊆ L∞ ([−1, 1]d0 )
that are linearly independent and centers of N (σ; A, ∞).
By the Theorem of Hahn-Banach, there exist (gi′ )ni=1 A +1
⊆ (L∞ ([−1, 1]d0 ))′ such that gi′ (gj ) = δij ,
for all i, j ∈ {1, . . . , L + 1}. We define

g1′ (g)
 
 g ′ (g) 
2
T : L∞ ([−1, 1]d0 ) → RnA +1 , g 7→  .
 
..
 . 
gn′ A +1 (g)

Since T is continuous and linear, we have that T ◦ Rσ is locally Lipschitz continuous by Proposition
13.1. Moreover, since the (gi )ni=1
A +1
are linearly independent, we have that T (span((gi )ni=1
A +1
)) =
n +1 nA +1
R A . We denote V := span((gi )i=1 ).

184
Next, we would like to establish that N (σ; A, ∞) ⊃ V . Let g ∈ V then
nX
A +1

g= aℓ gℓ ,
ℓ=1

for some a1 , . . . , anA +1 ∈ R. We show by induction that g̃ (m) := m


P
ℓ=1 aℓ gℓ ∈ N (σ; A, ∞) for every
m ≤ nA + 1. This is obviously true for m = 1. Moreover, we have that g̃ (m+1) = am+1 gm+1 + g̃ (m) .
Hence, the induction step holds true if am+1 = 0. If am+1 ̸= 0, then we have that
 
(m+1) 1 1 (m)
ge = 2am+1 · gm+1 + ge . (13.2.1)
2 2am+1

By the induction assumption ge(m) ∈ N (σ; A, ∞) and hence by Proposition 13.5 ge(m) /(am+1 ) ∈
N (σ; A, ∞). Additionally, since gm+1 is a center of N (σ; A, ∞), we have that 21 gm+1 + 2am+1
1
ge(m) ∈
N (σ; A, ∞). By Proposition 13.5, we conclude that ge(m+1) ∈ N (σ; A, ∞).
The induction shows that g ∈ N (σ; A, ∞) and thus V ⊆ N (σ; A, ∞). As a consequence,
T ◦ Rσ (PN (A, ∞)) ⊇ T (V ) = RnA +1 .
It is a well known fact of basic analysis that for every n ∈ N there does not exist a surjective
and locally Lipschitz continuous map from Rn to Rn+1 . We recall that nA = dim(PN (A, ∞)).
This yields the contradiction.

For a convex set X, the line between all two points of X is a subset of X. Hence, every point
of a convex set is a center. This yields the following corollary.

Corollary 13.7. Let A = (d0 , d1 , . . . , dL+1


P), let, and let σ : R → R be Lipschitz continuous.
L
If N (σ; A, ∞) contains more than nA = ℓ=0 (dℓ + 1)dℓ+1 linearly independent functions, then
N (σ; A, ∞) is not convex.

Corollary 13.7 tells us that we cannot expect convex sets of neural networks, if the set of
neural networks has many linearly independent elements. Sets of neural networks contain for
each f ∈ N (σ; A, ∞) also all shifts of this function, i.e., f (· + b) for a b ∈ Rd are elements of
f ∈ N (σ; A, ∞). For a set of functions, being shift invariant and having only finitely many linearly
independent functions at the same time, is a very restrictive condition. Indeed, it was shown in
[173, Proposition C.6] that if N (σ; A, ∞) has only finitely many linearly independent functions and
σ is differentiable in at least one point and has non-zero derivative there, then σ is necessarily a
polynomial.
We conclude that the set of neural networks is in general non-convex and star-shaped with 0
and constant functions being centers. One could visualize this set in 3D as in Figure 13.1.
The fact, that the neural network space is not convex, could also mean that it merely fails to
be convex at one point. For example R2 \ {0} is not convex, but for an optimization algorithm this
would likely not pose a problem.
We will next observe that N (σ; A, ∞) does not have such a benign non-convexity and in fact,
has arbitrarily large holes.
To make this claim mathematically precise, we first introduce the notion of ε-convexity.

185
Figure 13.1: Sketch of the space of neural networks in 3D. The vertical axis corresponds to the
constant neural network functions, each of which is a center. The set of neural networks consists
of many low-dimensional linear subspaces spanned by certain neural networks (Φ1 , . . . , Φ6 in this
sketch) and linear functions. Between these low-dimensional subspaces, there is not always a
straight-line connection by Corollary 13.7 and Theorem 13.9.

Definition 13.8. For ε > 0, we say that a subset A of a normed vector space X is ε-convex if

co(A) ⊆ A + Bε (0),

where co(A) denotes the convex hull of A and Bε (0) is an ε ball around 0 with respect to the norm
of X.

Intuitively speaking, a set that is convex when one fills up all holes smaller than ε is ε-convex.
Now we show that there is no ε > 0 such that N (σ; A, ∞) is ε-convex.

Theorem 13.9. Let L ∈ N and A = (d0 , d1 , . . . , dL , 1) ∈ NL+2 . Let K ⊆ Rd0 be compact and let
σ ∈ M, with M as in (3.1.1) and assume that σ is not a polynomial. Moreover, assume that there
exists an open set, where σ is differentiable and not constant.
If there exists an ε > 0 such that N (σ; A, ∞) is ε-convex, then N (σ; A, ∞) is dense in C(K).

Proof. Step 1. We show that ε-convexity implies N (σ; A, ∞) to be convex. By Proposition 13.5,
we have that N (σ; A, ∞) is scaling invariant. This implies that co(N (σ; A, ∞)) is scaling invariant

186
as well. Hence, if there exists ε > 0 such that N (σ; A, ∞) is ε-convex, then for every ε′ > 0

ε′ ε′
co(N (σ; A, ∞)) = co(N (σ; A, ∞)) ⊆ (N (σ; A, ∞) + Bε (0))
ε ε
= N (σ; A, ∞) + Bε′ (0).

This yields that N (σ; A, ∞) is ε′ -convex. Since ε′ was arbitrary, we have that N (σ; A, ∞) is
ε-convex for all ε > 0.
As a consequence, we have that
\
co(N (σ; A, ∞)) ⊆ (N (σ; A, ∞) + Bε (0))
ε>0
\
⊆ (N (σ; A, ∞) + Bε (0)) = N (σ; A, ∞).
ε>0

Hence, co(N (σ; A, ∞)) ⊆ N (σ; A, ∞) and, by the well-known fact that in every metric vector space
co(A) ⊆ co(A), we conclude that N (σ; A, ∞) is convex.
Step 2. We show that Nd1 (σ; 1) ⊆ N (σ; A, ∞). If N (σ; A, ∞) is ε-convex, then by Step 1
N (σ; A, ∞) is convex. The scaling invariance of N (σ; A, ∞) then shows that N (σ; A, ∞) is a
closed linear subspace of C(K).
Note that, by Proposition 3.16 for every w ∈ Rd0 and b ∈ R there exists a function f ∈
N (σ; A, ∞) such that

f (x) = σ(w⊤ x + b) for all x ∈ K. (13.2.2)

By definition, every constant function is an element of N (σ; A, ∞). Since N (σ; A, ∞) is a subspace,
this implies that all constant functions are in N (σ; A, ∞).
(1) (1)
Since N (σ; A, ∞) is a closed vector space, this implies that for all n ∈ N and all w1 , . . . , wn ∈
(2) (2) (1) (1)
Rd0 , w1 , . . . , wn ∈ R, b1 , . . . , bn ∈ R, b(2) ∈ R
n
(2) (1) (1)
X
x 7→ wi σ((wi )⊤ x + bi ) + b(2) ∈ N (σ; A, ∞). (13.2.3)
i=1

Step 3. From (13.2.3), we conclude that Nd1 (σ; 1) ⊆ N (σ; A, ∞). In words, the whole set of
shallow neural networks of arbitrary width is contained in the closure of the set of neural networks
with a fixed architecture. By Theorem 3.8, we have that Nd1 (σ; 1) is dense in C(K), which yields
the result.

For any activation function of practical relevance, a set of neural networks with fixed architecture
is not dense in C(K). This is only the case for very strange activation functions such as the one
discussed in Subsection 3.2. Hence, Theorem 13.9 shows that in general, sets of neural networks of
fixed architectures have arbitrarily large holes.

13.3 Closedness and best-approximation property


The non-convexity of the set of neural networks can have some serious consequences for the way
we think of the approximation or learning problem by neural networks.

187
Consider A = (d0 , . . . , dL+1 ) ∈ NL+2 and an activation function σ. Let H be a normed function
space on [−1, 1]d0 such that N (σ; A, ∞) ⊆ H. For h ∈ H we would like to find a neural network
that best approximates h, i.e. to find Φ ∈ N (σ; A, ∞) such that

∥Φ − h∥H = inf ∥Φ∗ − h∥H . (13.3.1)


Φ∗ ∈N (σ;A,∞)

We say that N (σ; A, ∞) ⊆ H has

• the best approximation property, if for all h ∈ H there exists at least one Φ ∈ N (σ; A, ∞)
such that (13.3.1) holds,

• the unique best approximation property, if for all h ∈ H there exists exactly one
Φ ∈ N (σ; A, ∞) such that (13.3.1) holds,

• the continuous selection property, if there exists a continuous function ϕ : H → N (σ; A, ∞)


such that Φ = ϕ(h) satisfies (13.3.1) for all h ∈ H.

We will see in the sequel, that, in the absence of the best approximation property, we will be able
to prove that the learning problem necessarily requires the weights of the neural networks to tend
to infinity, which may or may not be desirable in applications.
Moreover, having a continuous selection procedure is desirable as it implies the existence of a
stable selection algorithm; that is, an algorithm which, for similar problems yields similar neural
networks satisfying (13.3.1).
Below, we will study the properties above for Lp spaces, p ∈ [1, ∞). As we will see, neu-
ral network classes typically neither satisfy the continuous selection nor the best approximation
property.

13.3.1 Continuous selection


As shown in [111], neural network spaces essentially never admit the continuous selection property.
To give the argument, we first recall the following result from [111, Theorem 3.4] without proof.

Theorem 13.10. Let p ∈ (1, ∞). Every subset of Lp ([−1, 1]d0 ) with the unique best approximation
property is convex.

This allows to show the next proposition.

Proposition 13.11. Let L ∈ N, A = (d0 , d1 , . . . , dL+1 ) ∈ NL+2 , let σ : R → R be Lipschitz


continuous and not a polynomial, and let p ∈ (1, ∞).
Then, N (σ; A, ∞) ⊆ Lp ([−1, 1]d0 ) does not have the continuous selection property.

188
Proof. We observe from Theorem 13.6 and the discussion below, that under the assumptions of
this result, N (σ; A, ∞) is not convex.
We conclude that N (σ; A, ∞) does not have the unique best approximation property. Moreover,
if the set N (σ; A, ∞) does not have the best approximation property, then it is obvious that it
cannot have continuous selection. Thus, we can assume without loss of generality, that N (σ; A, ∞)
has the best approximation property and there exists a point h ∈ Lp ([−1, 1]d0 ) and two different
Φ1 ,Φ2 such that
∥Φ1 − h∥Lp = ∥Φ2 − h∥Lp = inf ∥Φ∗ − h∥Lp . (13.3.2)
Φ∗ ∈N (σ;A,∞)

Note that (13.3.2) implies that h ̸∈ N (σ; A, ∞).


Let us consider the following function:

(1 + λ)h − λΦ1 for λ ≤ 0,
[−1, 1] ∋ λ 7→ P (λ) =
(1 − λ)h + λΦ2 for λ ≥ 0.
It is clear that P (λ) is a continuous path in Lp . Moreover, for λ ∈ (−1, 0)
∥Φ1 − P (λ)∥Lp = (1 + λ)∥Φ1 − h∥Lp .
Assume towards a contradiction, that there exists Φ∗ ̸= Φ1 such that for λ ∈ (−1, 0)
∥Φ∗ − P (λ)∥Lp ≤ ∥Φ1 − P (λ)∥Lp .
Then
∥Φ∗ − h∥Lp ≤ ∥Φ∗ − P (λ)∥Lp + ∥P (λ) − h∥Lp
≤ ∥Φ1 − P (λ)∥Lp + ∥P (λ) − h∥Lp
= (1 + λ)∥Φ1 − h∥Lp + |λ|∥Φ1 − h∥Lp = ∥Φ1 − h∥Lp . (13.3.3)
Since Φ1 is a best approximation to h this implies that every inequality in the estimate above is an
equality. Hence, we have that
∥Φ∗ − h∥Lp = ∥Φ∗ − P (λ)∥Lp + ∥P (λ) − h∥Lp .
However, in a strictly convex space like Lp ([−1, 1]d0 ) for p > 1 this implies that
Φ∗ − P (λ) = c · (P (λ) − h)
for a constant c ̸= 0. This yields that
Φ∗ = h + (c + 1)λ · (h − Φ1 )
and plugging into (13.3.3) yields |(c + 1)λ| = 1. If (c + 1)λ = −1, then we have Φ∗ = Φ1 which
produces a contradiction. If (c + 1)λ = 1, then
∥Φ∗ − P (λ)∥Lp = ∥2h − Φ1 − (1 + λ)h + λΦ1 ∥Lp
= ∥(1 − λ)h − (1 − λ)Φ1 ∥Lp > ∥P (λ) − Φ1 ∥Lp ,
which is another contradiction.
Hence, for every λ < 0 we have that Φ1 is the unique minimizer to P (λ) in N (σ; A, ∞). The same
argument holds for λ > 0 and Φ2 . We conclude that for every selection function ϕ : Lp ([−1, 1]d0 ) →
N (σ; A, ∞) such that Φ = ϕ(h) satisfies (13.3.1) for all h ∈ Lp ([−1, 1]d0 ) it holds that
lim ϕ(P (λ)) = Φ2 ̸= Φ1 = lim ϕ(P (λ)).
λ↓0 λ↑0

As a consequence, ϕ is not continuous, which shows the result.

189
13.3.2 Existence of best approximations
We have seen in Proposition 13.11 that under very mild assumptions, the continuous selection prop-
erty cannot hold. Moreover, the next result shows that in many cases, also the best approximation
property fails to be satisfied. We provide below a simplified version of [173, Theorem 3.1]. We also
refer to [69] for earlier work on this problem.

Proposition 13.12. Let A = (1, 2, 1) and let σ : R → R be Lipschitz continuous. Additionally


assume that there exist r > 0 and α′ ̸= α such that σ is differentiable for all |x| > r and σ ′ (x) → α
for x → ∞, σ ′ (x) → α′ for x → −∞.
Then, there exists a sequence in N (σ; A, ∞) which converges in Lp ([−1, 1]), for every p ∈ (1, ∞),
and the limit of this sequence is discontinuous. In particular, the limit of the sequence does not lie
in N (σ; A′ , ∞) for any A′ .

Proof. For all n ∈ N let

fn (x) = σ(nx + 1) − σ(nx) for all x ∈ R.

Then fn can be written as a neural network with architecture (σ; 1, 2, 1), i.e., A = (1, 2, 1). More-
over, for x > 0 we observe with the fundamental theorem of calculus and using integration by
substitution that
Z x+1/n Z nx+1
fn (x) = nσ ′ (nz)dz = σ ′ (z)dz. (13.3.4)
x nx

It is not hard to see that the right hand side of (13.3.4) converges to α for n → ∞.
Similarly, for x < 0, we observe that fn (x) converges to α′ for n → ∞. We conclude that

fn → α1R+ + α′ 1R−

almost everywhere as n → ∞. Since σ is Lipschitz continuous, we have that fn is bounded.


Therefore, we conclude that fn → α1R+ + α′ 1R− in Lp for all p ∈ [1, ∞) by the dominated
convergence theorem.

There is a straight-forward extension of Proposition 13.12 to arbitrary architectures, that will


be the content of Exercises 13.16 and 13.17.
Remark 13.13. The proof of Theorem 13.12 does not extend to the L∞ norm. This, of course, does
not mean that generally N (σ; A, ∞) is a closed set in L∞ ([−1, 1]d0 ). In fact, almost all activation
functions used in practice still give rise to non-closed neural network sets, see [173, Theorem 3.3].
However, there is one notable exception. For the ReLU activation function, it can be shown that
N (σReLU ; A, ∞) is a closed set in L∞ ([−1, 1]d0 ) if A has only one hidden layer. The closedness of
deep ReLU spaces in L∞ is an open problem.

190
13.3.3 Exploding weights phenomenon
Finally, we discuss one of the consequences of the non-existence of best approximations of Propo-
sition 13.12.
Consider a regression problem, where we aim to learn a function f using neural networks with
a fixed architecture N (A; σ, ∞). As discussed in the Chapters 10 and 11, we wish to produce a
sequence of neural networks (Φn )∞n=1 such that the risk defined in (1.2.4) converges to 0. If the loss
L is the squared loss, µ is a probability measure on [−1, 1]d0 , and the data is given by (x, f (x)) for
x ∼ µ, then
R(Φn ) = ∥Φn − f ∥2L2 ([−1,1]d0 ,µ)
(13.3.5)
Z
= |Φn (x) − f (x)|2 dµ(x) → 0 for n → ∞.
[−1,1]d0

According to Proposition 13.12, for a given A, and an activation function σ, it is possible that
(13.3.5) holds, but f ̸∈ N (σ; A, ∞). The following result shows that in this situation, the weights
of Φn diverge.

Proposition 13.14. Let A = (d0 , d1 , . . . , dL+1 ) ∈ NL+2 , let σ : R → R be Lipschitz continuous


with Cσ ≥ 1, and |σ(x)| ≤ Cσ |x| for all x ∈ R, and let µ be a measure on [−1, 1]d0 .
Assume that there exists a sequence Φn ∈ N (σ; A, ∞) and f ∈ L2 ([−1, 1]d0 , µ) \ N (σ; A, ∞)
such that

∥Φn − f ∥2L2 ([−1,1]d0 ,µ) → 0. (13.3.6)

Then
n o
lim sup max ∥W (ℓ) (ℓ)
n ∥∞ , ∥bn ∥∞ ℓ = 0, . . . L = ∞. (13.3.7)
n→∞

Proof. We assume towards a contradiction that the left-hand side of (13.3.7) is finite. As a result,
there exists C > 0 such that Φn ∈ N (σ; A, C) for all n ∈ N.
By Proposition 13.1, we conclude that N (σ; A, C) is the image of a compact set under a continu-
ous map and hence is itself a compact set in L2 ([−1, 1]d0 , µ). In particular, we have that N (σ; A, C)
is closed. Hence, (13.3.6) implies f ∈ N (σ; A, C). This gives a contradiction.

Proposition 13.14 can be extended to all f for which there is no best approximation in N (σ; A, ∞),
see Exercise 13.18. The results imply that for functions we wish to learn that lack a best approxima-
tion within a neural network set, we must expect the weights of the approximating neural networks
to grow to infinity. This can be undesirable because, as we will see in the following sections on
generalization, a bounded parameter space facilitates many generalization bounds.

Bibliography and further reading


The properties of neural network sets were first studied with a focus on the continuous approxima-
tion property in [111, 113, 112] and [69]. The results in [111, 112, 113] already use the non-convexity

191
of sets of shallow neural networks. The results on convexity and closedness presented in this chapter
follow mostly the arguments of [173]. Similar results were also derived for other norms in [139].

192
Exercises
Exercise 13.15. Prove Proposition 13.5.

Exercise 13.16. Extend Proposition 13.12 to A = (d0 , d1 , 1) for arbitrary d0 , d1 ∈ N, d1 ≥ 2.

Exercise 13.17. Use Proposition 3.16, to extend Proposition 13.12 to arbitrary depth.

Exercise 13.18. Extend Proposition 13.14 to functions f for which there is no best-approximation
in N (σ; A, ∞). To do this, replace (13.3.6) by

∥Φn − f ∥2L2 → inf ∥Φ − f ∥2L2 .


Φ∈N (σ;A,∞)

193
Chapter 14

Generalization properties of deep


neural networks

As discussed in the introduction in Section 1.2, we generally learn based on a finite data set. For
example, given data (xi , yi )m
i=1 , we try to find a network Φ that satisfies Φ(xi ) = yi for i = 1, . . . , m.
The field of generalization is concerned with how well such Φ performs on unseen data, which refers
to any x outside of training data {x1 , . . . , xm }. In this chapter we discuss generalization through
the use of covering numbers.
In Sections 14.1 and 14.2 we revisit and formalize the general setup of learning and empirical risk
minimization in a general context. Although some notions introduced in these sections have already
appeared in the previous chapters, we reintroduce them here for a more coherent presentation. In
Sections 14.3-14.5, we first discuss the concepts of generalization bounds and covering numbers,
and then apply these arguments specifically to neural networks. In Section 14.6 we explore the
so-called “approximation-complexity trade-off”, and finally in Sections 14.7-14.8 we introduce the
“VC dimension” and give some implications for classes of neural networks.

14.1 Learning setup


A general learning problem [148, 212, 43] requires a feature space X and a label space Y , which
we assume throughout to be measurable spaces. We observe joint data pairs (xi , yi )m i=1 ⊆ X ×Y , and
aim to identify a connection between the x and y variables. Specifically, we assume a relationship
between features x and labels y modeled by a probability distribution D over X ×Y , that generated
the observed data (xi , yi )m
i=1 . While this distribution is unknown, our goal is to extract information
from it, so that we can make possibly good predictions of y for a given x. Importantly, the
relationship between x and y need not be deterministic.
To make these concepts more concrete, we next present an example that will serve as the running
example throughout this chapter. This example is of high relevance for many mathematicians,
as ensuring a steady supply of high-quality coffee is essential for maximizing the output of our
mathematical work.
Example 14.1 (Coffee Quality). Our goal is to determine the quality of different coffees. To this
end we model the quality as a number in
n0 10 o
Y = ,..., ,
10 10

194
Figure 14.1: Collection of coffee data. The last row lacks a “Quality” label. Our aim is to predict
the label without the need for an (expensive) taste test.

with higher numbers indicating better quality. Let us assume that our subjective assessment of
quality of coffee is related to six features: “Acidity”, “Caffeine content”, “Price”, “Aftertaste”,
“Roast level”, and “Origin”. The feature space X thus corresponds to the set of six-tuples describing
these attributes, which can be either numeric or categorical (see Figure 14.1).
We aim to understand the relationship between elements of X and elements of Y , but we can
neither afford, nor do we have the time to taste all the coffees in the world. Instead, we can sample
some coffees, taste them, and grow our database accordingly as depicted in Figure 14.1. This way
we obtain samples of pairs in X × Y . The distribution D from which they are drawn depends on
various external factors. For instance, we might have avoided particularly cheap coffees, believing
them to be inferior. As a result they do not occur in our database. Moreover, if a colleague
contributes to our database, he might have tried the same brand and arrived at a different rating.
In this case, the quality label is not deterministic anymore.
Based on our database, we wish to predict the quality of an untasted coffee. Before proceeding,
we first formalize what it means to be a “good” prediction.
Characterizing how good a predictor is requires a notion of discrepancy in the label space. This
is the purpose of the so-called loss function, which is a measurable mapping L : Y × Y → R+ .

Definition 14.2. Let L : Y × Y → R+ be a loss function and let D be a distribution on X × Y .


For a measurable function h : X → Y we call

R(h) = E(x,y)∼D [L(h(x), y)]

the (population) risk of h.

Based on the risk, we can now formalize what we consider a good predictor. The best predictor
is one such that its risk is as close as possible to the smallest that any function can achieve. More
precisely, we would like a risk that is close to the so-called Bayes risk

R∗ := inf R(h), (14.1.1)


h : X→Y

where the infimum is taken over all h such that h : X → Y is measurable.

195
Example 14.3 (Loss functions). The choice of a loss function L usually depends on the application.
For a regression problem, i.e., a learning problem where Y is a non-discrete subset of a Euclidean
space, a common choice is the square loss L2 (y, y ′ ) = ∥y − y ′ ∥2 .
For binary classification problems, i.e. when Y is a discrete set of cardinality two, the “0 − 1
loss” (
1 y ̸= y ′
L0−1 (y, y ′ ) =
0 y = y′
is a common choice.
Another frequently used loss for binary classification, especially when we want to predict prob-
abilities (i.e., if Y = [0, 1] but all labels are binary), is the binary cross-entropy loss
Lce (y, y ′ ) = −(y log(y ′ ) + (1 − y) log(1 − y ′ )).
In contrast to the 0 − 1 loss, the cross-entropy loss is differentiable, which is desirable in deep
learning as we saw in Chapter 10.
In the coffee quality prediction problem, the quality is given as a fraction of the form k/10
for k = 0, . . . , 10. While this is a discrete set, it makes sense to more heavily penalize predictions
that are wrong by a larger amount. For example, predicting 4/10 instead of 8/10 should produce
a higher loss than predicting 7/10. Hence, we would not use the 0 − 1 loss but, for example, the
square loss.
How do we find a function h : X → Y with a risk that is as close as possible to the Bayes risk?
We will introduce a procedure to tackle this task in the next section.

14.2 Empirical risk minimization


Finding a minimizer of the risk constitutes a considerable challenge. First, we cannot search through
all measurable functions. Therefore, we need to restrict ourselves to a specific set H ⊆ {h : X → Y }
called the hypothesis set. In the following, this set will be some set of neural networks. Second,
we are faced with the problem that we cannot evaluate R(h) for non-trivial loss functions, because
the distribution D is unknown. To approximate the risk, we will assume access to an i.i.d. sample
of m observations drawn from D. This is precisely the situation described in the coffee quality
example of Figure 14.1, where m = 6 coffees were sampled.1 So for a given hypothesis h we can
check how well it performs on our sampled data. We call the error on the sample the empirical
risk.

Definition 14.4. Let m ∈ N, let L : Y ×Y → R be a loss function and let S = (xi , yi )m


i=1 ∈ (X×Y )
m

be a sample. For h : X → Y , we call


m
b S (h) = 1
X
R L(h(xi ), yi )
m
i=1

the empirical risk of h.

1
In practice, the assumption of independence of the samples is often unclear and typically not satisfied. For
instance, the selection of the six previously tested coffees might be influenced by external factors such as personal
preferences or availability at the local store, which introduce bias into the dataset.

196
If the sample S is drawn i.i.d. according to D, then we immediately see from the linearity
of the expected value that R b S (h) is an unbiased estimator of R(h), i.e., ES∼Dm [R
b S (h)] = R(h).
Moreover, the weak law of large numbers states that the sample mean of an i.i.d. sequence of
integrable random variables converges to the expected value in probability. Hence, there is some
hope that, at least for large m ∈ N, minimizing the empirical risk instead of the population risk
might lead to a good hypothesis. We formalize this approach in the next definition.

Definition 14.5. Let H ⊆ {h : X → Y } be a hypothesis set. Let m ∈ N, let L : Y × Y → R be a


loss function and let S = (xi , yi )m m
i=1 ∈ (X × Y ) be a sample. We call a function hS such that

R
b S (hS ) = inf R
b S (h) (14.2.1)
h∈H

an empirical risk minimizer.

From a generalization perspective, deep learning is empirical risk minimization over sets of
neural networks. The question we want to address next is how effective this approach is at producing
hypotheses that achieve a risk close to the Bayes risk.
Let H be some hypothesis set, such that an empirical risk minimizer hS exists for all S ∈
(X × Y )m ; see Exercise 14.25 for an explanation of why this is a reasonable assumption. Moreover,
let h∗ ∈ H be arbitrary. Then
R(hS ) − R∗ = R(hS ) − R b S (hS ) − R∗
b S (hS ) + R (14.2.2)
≤ |R(hS ) − R b S (h∗ ) − R∗
b S (hS )| + R
b S (h)| + R(h∗ ) − R∗ ,
≤ 2 sup |R(h) − R
h∈H

where in the first inequality we used that hS is the empirical risk minimizer. By taking the infimum
over all h∗ , we conclude that
R(hS ) − R∗ ≤ 2 sup |R(h) − R
b S (h)| + inf R(h) − R∗
h∈H h∈H

=: 2εgen + εapprox . (14.2.3)


Similarly, considering only (14.2.2), yields that
R(hS ) ≤ sup |R(h) − R
b S (h)| + inf R
b S (h)
h∈H h∈H

=: εgen + εint . (14.2.4)


How to choose H to reduce the approximation error εapprox or the interpolation error εint
was discussed at length in the previous chapters. The final piece is to figure out how to bound the
generalization error suph∈H |R(h) − R b S (h)|. This will be discussed in the sections below.

14.3 Generalization bounds


We have seen that one aspect of successful learning is to bound the generalization error εgen in
(14.2.3). Let us first formally describe this problem.

197
Definition 14.6 (Generalization bound). Let H ⊆ {h : X → Y } be a hypothesis set, and let
L : Y × Y → R be a loss function. Let κ : (0, 1) × N → R+ be such that for every δ ∈ (0, 1) holds
κ(δ, m) → 0 for m → ∞. We call κ a generalization bound for H if for every distribution D on
X × Y , every m ∈ N and every δ ∈ (0, 1), it holds with probability at least 1 − δ over the random
sample S ∼ Dm that

sup |R(h) − R
b S (h)| ≤ κ(δ, m).
h∈H

Remark 14.7. For a generalization bound κ it holds that


h i
P R(hS ) − Rb S (hS ) ≤ ε ≥ 1 − δ

as soon as m is so large that κ(δ, m) ≤ ε. If there exists an empirical risk minimizer hS such that
R
b S (hS ) = 0, then with high probability the empirical risk minimizer will also have a small risk
R(hS ). Empirical risk minimization is often referred to as a “PAC” algorithm, which stands for
probably (δ) approximately correct (ε).
Definition 14.6 requires the upper bound κ on the discrepancy between the empirical risk and
the risk to be independent from the distribution D. Why should this be possible? After all, we could
have an underlying distribution that is not uniform and hence, certain data points could appear
very rarely in the sample. As a result, it should be very hard to produce a correct prediction
for such points. At first sight, this suggests that non-uniform distributions should be much more
challenging than uniform distributions. This intuition is incorrect, as the following argument based
on Example 14.1 demonstrates.

Example 14.8 (Generalization in the coffee quality problem). In Example 14.1, the underlying
distribution describes both our process of choosing coffees and the relation between the attributes
and the quality. Suppose we do not enjoy drinking coffee that costs less than 1€. Consequently, we
do not have a single sample of such coffee in the dataset, and therefore we have no chance about
learning the quality of cheap coffees.
However, the absence of coffee samples costing less than 1€ in our dataset is due to our general
avoidance of such coffee. As a result, we run a low risk of incorrectly classifying the quality of a
coffee that is cheaper than 1€, since it is unlikely that we will choose such a coffee in the future.

To establish generalization bounds, we use stochastic tools that guarantee that the empirical
risk converges to the true risk as the sample size increases. This is typically achieved through
concentration inequalities. One of the simplest and most well-known is Hoeffding’s inequality, see
Theorem A.24. We will now apply Hoeffding’s inequality to obtain a first generalization bound.
This generalization bound is well-known and can be found in many textbooks on machine learning,
e.g., [148, 212]. Although the result does not yet encompass neural networks, it forms the basis for
a similar result applicable to neural networks, as we discuss subsequently.

198
Proposition 14.9 (Finite hypothesis set). Let H ⊆ {h : X 7→ Y } be a finite hypothesis set. Let
L : Y × Y → R be such that L(Y × Y ) ⊆ [c1 , c2 ] with c2 − c1 = C > 0.
Then, for every m ∈ N and every distribution D on X × Y it holds with probability at least 1 − δ
over the sample S ∼ Dm that
r
sup |R(h) − Rb S (h)| ≤ C log(|H|) + log(2/δ) .
h∈H 2m

Proof. Let H = {h1 , . . . , hn }. Then it holds by a union bound that


h i Xn h i
P ∃hi ∈ H : |R(hi ) − R
b S (hi )| > ε ≤ P |R(hi ) − R
b S (hi )| > ε .
i=1

Note that R b S (hi ) is the mean of independent random variables which take their values almost surely
in [0, C]. Additionally, R(hi ) is the expectation of Rb S (hi ). The proof can therefore be finished by
applying Theorem A.24. This will be addressed in Exercise 14.26.

Consider now a non-finite set of neural networks H, and assume that it can be covered by a
finite set of (small) balls. Applying Proposition 14.9 to the centers of these balls, then allows to
derive a similar bound as in the proposition for H. This intuitive argument will be made rigorous
in the following section.

14.4 Generalization bounds from covering numbers


To derive a generalization bound for classes of neural networks, we start by introducing the notion
of covering numbers.

Definition 14.10. Let A be a relatively compact subset of a metric space (X, d). For ε > 0, we
call
m
( )
[
m
G(A, ε, (X, d)) := min m ∈ N ∃ (xi )i=1 ⊆ X s.t. Bε (xi ) ⊃ A ,
i=1

where Bε (x) = {z ∈ X | d(z, x) ≤ ε}, the ε-covering number of A in X. In case X or d are clear
from context, we also write G(A, ε, d) or G(A, ε, X) instead of G(A, ε, (X, d)).

A visualization of Definition 14.10 is given in Figure 14.2. As we will see, it is possible to upper
bound the ε-covering numbers of neural networks as a subset of L∞ ([0, 1]d ), assuming the weights
are confined to a fixed bounded set. The precise estimates are postponed to Section 14.5. Before
that, let us show how a finite covering number facilitates a generalization bound. We only consider
Euclidean feature spaces X in the following result. A more general version could be easily derived.

199
ε

Figure 14.2: Illustration of the concept of covering numbers of Definition 14.10. The shaded set
A ⊆ R2 is covered by sixteen Euclidean balls of radius ε. Therefore, G(A, ε, R2 ) ≤ 16.

Theorem 14.11. Let CY , CL > 0 and α > 0. Let Y ⊆ [−CY , CY ], X ⊆ Rd for some d ∈ N, and
H ⊆ {h : X → Y }. Further, let L : Y × Y → R be CL -Lipschitz.
Then, for every distribution D on X × Y and every m ∈ N it holds with probability at least 1 − δ
over the sample S ∼ Dm that for all h ∈ H
r
−α ∞
|R(h) − Rb S (h)| ≤ 4CY CL log(G(H, m , L (X))) + log(2/δ) + 2CL .
m mα

Proof. Let
M = G(H, m−α , L∞ (X)) (14.4.1)
and let HM = (hi )M
i=1 ⊆ H be such that for every h ∈ H there exists hi ∈ HM with ∥h−hi ∥L∞ (X) ≤
1/mα . The existence of HM follows by Definition 14.10.
Fix for the moment such h ∈ H and hi ∈ HM . By the reverse and normal triangle inequalities,
we have

|R(h) − R
b S (h)| − |R(hi ) − R
b S (hi )| ≤ |R(h) − R(hi )| + |R
b S (h) − R
b S (hi )|.

Moreover, from the monotonicity of the expected value and the Lipschitz property of L it follows
that

|R(h) − R(hi )| ≤ E|L(h(x), y) − L(hi (x), y)|


CL
≤ CL E|h(x) − hi (x)| ≤ α .
m

A similar estimate yields |R b S (hi )| ≤ CL /mα .


b S (h) − R

200
We thus conclude that for every ε > 0
h i
PS∼Dm ∃h ∈ H : |R(h) − R
b S (h)| ≥ ε
 
2CL
≤ PS∼Dm ∃hi ∈ HM : |R(hi ) − RS (hi )| ≥ ε − α .
b (14.4.2)
m
From Proposition 14.9, we know that for ε > 0 and δ ∈ (0, 1)
 
PS∼Dm ∃hi ∈ HM : |R(hi ) − R b S (hi )| ≥ ε − 2CL ≤ δ (14.4.3)

as long as r
2CL log(M ) + log(2/δ)
ε− α >C ,
m 2m
√ that L(Y × Y ) ⊆ [c1 , c2 ] with c2 − c1 ≤ C. By the Lipschitz property of L we can
where C is such
choose C = 2 2CL CY .
Therefore, the definition of M in (14.4.1) together with (14.4.2) and (14.4.3) give that with
probability at least 1 − δ it holds for all h ∈ H
r
√ −α ∞
|R(h) − R b S (h)| ≤ 2 2CL CY log(G(H, m , L )) + log(2/δ) + 2CL .
2m mα
This concludes the proof.

14.5 Covering numbers of deep neural networks


We have seen in Theorem 14.11, estimating L∞ -covering numbers is crucial for understanding the
generalization error. How can we determine these covering numbers? The set of neural networks of
a fixed architecture can be a quite complex set (see Chapter 13), so it is not immediately clear how
to cover it with balls, let alone know the number of required balls. The following lemma suggest a
simpler approach.

Lemma 14.12. Let X1 , X2 be two metric spaces and let f : X1 → X2 be Lipschitz continuous with
Lipschitz constant CLip . For every relatively compact A ⊆ X1 it holds that for all ε > 0

G(f (A), CLip ε, X2 ) ≤ G(A, ε, X1 ).

The proof of Lemma 14.12 is left as an exercise. If we can represent the set of neural networks
as the image under the Lipschitz map of another set with known covering numbers, then Lemma
14.12 gives a direct way to bound the covering number of the neural network class.
Conveniently, we have already observed in Proposition 13.1, that the set of neural networks is
the image of PN (A, B) as in Definition 12.1 under the Lipschitz continuous realization map Rσ . It
thus suffices to establish the ε-covering number of PN (A, B) or equivalently of [−B, B]nA . Then,
using the Lipschitz property of Rσ that holds by Proposition 13.1, we can apply Lemma 14.12 to
find the covering numbers of N (σ; A, B). This idea is depicted in Figure 14.3.

201

Figure 14.3: Illustration of the main idea to deduce covering numbers of neural network spaces.
Points θ ∈ R2 in parameter space in the left figure correspond to functions Rσ (θ) in the right figure
(with matching colors). By Lemma 14.12, a covering of the parameter space on the left translates
to a covering of the function space on the right.

Proposition 14.13. Let B, ε > 0 and q ∈ N. Then

G([−B, B]q , ε, (Rq , ∥ · ∥∞ )) ≤ ⌈B/ε⌉q .

Proof. We start with the one-dimensional case q = 1. We choose k = ⌊B/ε⌋

x0 = −B + ε and xj = xj−1 + 2ε for j = 1, . . . , k − 1.

It is clear that all points between −B and xk−1 have distance at most ε to one of the xj . Also,
xk−1 = −B + ε + 2(k − 1)ε ≥ B − ε. We conclude that G([−B, B], ε, R) ≤ ⌈B/ε⌉. Set Xk :=
{x0 , . . . , xk−1 }.
For arbitrary q, we observe that for every x ∈ [−B, B]q there is an element in Xkq = qj=1 Xk
N
with ∥ · ∥∞ distance less than ε. Clearly, |Xkq | = ⌈B/ε⌉q , which completes the proof.

Having established a covering number for [−B, B]nA and hence PN (A, B), we can now estimate
the covering numbers of deep neural networks by combining Lemma 14.12 and Propositions 13.1
and 14.13 .

Theorem 14.14. Let A = (d0 , d1 , . . . , dL+1 ) ∈ NL+2 , let σ : R → R be Cσ -Lipschitz continuous


with Cσ ≥ 1, let |σ(x)| ≤ Cσ |x| for all x ∈ R, and let B ≥ 1. Then

G(N (σ; A, B), ε, L∞ ([0, 1]d0 )) ≤ G([−B, B]nA , ε/(2Cσ Bdmax )L , (RnA , ∥ · ∥∞ ))
≤ ⌈nA /ε⌉nA ⌈2Cσ Bdmax ⌉nA L .

202
We end this section, by applying the previous theorem to the generalization bound of Theorem
14.11 with α = 1/2. To simplify the analysis, we restrict the discussion to neural networks with
range [−1, 1]. To this end, denote

N ∗ (σ; A, B) := Φ ∈ N (σ; A, B)


Φ(x) ∈ [−1, 1] for all x ∈ [0, 1]d0 . (14.5.1)

Since N ∗ (σ; A, B) ⊆ N (σ; A, B) we can bound the covering numbers of N ∗ (σ; A, B) by those of
N (σ; A, B). This yields the following result.

Theorem 14.15. Let CL > 0 and let L : [−1, 1]×[−1, 1] → R be CL -Lipschitz continuous. Further,
let A = (d0 , d1 , . . . , dL+1 ) ∈ NL+2 , let σ : R → R be Cσ -Lipschitz continuous with Cσ ≥ 1, and
|σ(x)| ≤ Cσ |x| for all x ∈ R, and let B ≥ 1.
Then, for every m ∈ N, and every distribution D on X × [−1, 1] it holds with probability at least
1 − δ over S ∼ Dm that for all Φ ∈ N ∗ (σ; A, B)
r √
nA log(⌈nA m⌉) + LnA log(⌈2Cσ Bdmax ⌉) + log(2/δ)
|R(Φ) − RS (Φ)| ≤4CL
b
m
2CL
+√ .
m

14.6 The approximation-complexity trade-off


We recall the decomposition of the error in (14.2.3)

R(hS ) − R∗ ≤ 2εgen + εapprox ,

where R∗ is the Bayes risk defined in (14.1.1). We make the following observations about the
approximation error εapprox and generalization error εgen in the context of neural network based
learning:

• Scaling of generalization error: By Theorem 14.15, for a hypothesis class H of neural networks
with nA weights and L layers, and for sample of size m ∈ N, the generalization error εgen
essentially scales like
p
εgen = O( (nA log(nA m) + LnA log(nA ))/m) as m → ∞.

• Scaling of approximation error: Assume there exists h∗ such that R(h∗ ) = R∗ , and let the
loss function L be Lipschitz continuous in the first coordinate. Then

εapprox = inf R(h) − R(h∗ ) = inf E(x,y)∼D [L(h(x), y) − L(h∗ (x), y)]
h∈H h∈H
≤ C inf ∥h − h∗ ∥L∞ ,
h∈H

203
for some constant C > 0. We have seen in Chapters 5 and 7 that if we choose H as a
set of neural networks with size nA and L layers, then, for appropriate activation functions,
inf h∈H ∥h − h∗ ∥L∞ behaves like nA −r if, e.g., h∗ is a d-dimensional s-Hölder regular function
and r = s/d (Theorem 5.22), or h∗ ∈ C k,s ([0, 1]d ) and r < (k + s)/d (Theorem 7.7).
By these considerations, we conclude that for an empirical risk minimizer ΦS from a set of neural
networks with nA weights and L layers, it holds that
R(ΦS ) − R∗ ≤ O( (nA log(m) + LnA log(nA ))/m) + O(nA −r ),
p
(14.6.1)
for m → ∞ and for some r depending on the regularity of h∗ . Note that, enlarging the neural
network set, i.e., increasing nA has two effects: The term associated to approximation decreases,
and the term associated to generalization increases. This trade-off is known as approximation-
complexity trade-off. The situation is depicted in Figure 14.4. The figure and (14.6.1) suggest
that, the perfect model, achieves the optimal trade-off between approximation and generalization
error. Using this notion, we can also separate all models into three classes:
• Underfitting: If the approximation error decays faster than the estimation error increases.
• Optimal : If the sum of approximation error and generalization error is at a minimum.
• Overfitting: If the approximation error decays slower than the estimation error increases.
In Chapter 15, we will see that deep learning often operates in the regime where the number of
parameters nA exceeds the optimal trade-off point. For certain architectures used in practice, nA
can be so large that the theory of the approximation-complexity trade-off suggests that learning
should be impossible. However, we emphasize, that the present analysis only provides upper bounds.
It does not prove that learning is impossible or even impractical in the overparameterized regime.
Moreover, in Chapter 11 we have already seen indications that learning in the overparametrized
regime need not necessarily lead to large generalization errors.

14.7 PAC learning from VC dimension


In addition to covering numbers, there are several other tools to analyze the generalization capacity
of hypothesis sets. In the context of classification problems, one of the most important is the so-
called Vapnik–Chervonenkis (VC) dimension.

14.7.1 Definition and examples


Let H be a hypothesis set of functions mapping from Rd to {0, 1}. A set S = {x1 , . . . , xn } ⊆ Rd
is said to be shattered by H if for every (y1 , . . . , yn ) ∈ {0, 1}n there exists h ∈ H such that
h(xj ) = yj for all j ∈ N.
The VC dimension quantifies the complexity of a function class via the number of points that
can in principle be shattered.

Definition 14.16. The VC dimension of H is the cardinality of the largest set S ⊆ Rd that is
shattered by H. We denote the VC dimension by VCdim(H).

204
underfitting overfitting

optimal trade-off

Figure 14.4: Illustration of the approximation-complexity-trade-off of Equation (14.6.1). Here we


chose r = 1 and m = 10.000, also all implicit constants are assumed to be equal to 1.

Example 14.17 (Intervals). Let H = {1[a,b] | a, b ∈ R}. It is clear that VCdim(H) ≥ 2 since for
x1 < x2 the functions

1[x1 −2,x1 −1] , 1[x1 −1,x1 ] , 1[x1 ,x2 ] , 1[x2 ,x2 +1] ,

are all different, when restricted to S = (x1 , x2 ).


On the other hand, if x1 < x2 < x3 then, since h−1 ({1}) is an interval for all h ∈ H we have that
h(x1 ) = 1 = h(x3 ) implies h(x2 ) = 1. Hence, no set of three elements can be shattered. Therefore,
VCdim(H) = 2. The situation is depicted in Figure 14.5.

Figure 14.5: Different ways to classify two or three points. The colored-blocks correspond to
intervals that produce different classifications of the points.

Example 14.18 (Half-spaces). Let H2 = {1[0,∞) (⟨w, ·⟩ + b) | w ∈ R2 , b ∈ R} be a hypothesis set


of rotated and shifted two-dimensional half-spaces. In Figure 14.6 we see that H2 shatters a set of

205
three points. More general, for d ≥ 2 with

Hd := {x 7→ 1[0,∞) (w⊤ x + b) | w ∈ Rd , b ∈ R}

the VC dimension of Hd equals d + 1.

Figure 14.6: Different ways to classify three points by a half-space.

In the example above, the VC dimension coincides with the number of parameters. However,
this is not true in general as the following example shows.
Example 14.19 (Infinite VC dimension). Let for x ∈ R

H := {x 7→ 1[0,∞) (sin(wx)) | w ∈ R}.

Then the VC dimension of H is infinite (Exercise 14.29).

14.7.2 Generalization based on VC dimension


In the following, we consider a classification problem. Denote by D the data-generating distribution
on Rd × {0, 1}. Moreover, we let H be a set of functions from Rd → {0, 1}.
In the binary classification set-up, the natural choice of a loss function is the 0 − 1 loss
L0−1 (y, y ′ ) = 1y̸=y′ . Thus, given a sample S, the empirical risk of a function h ∈ H is
m
b S (h) = 1
X
R 1h(xi )̸=yi .
m
i=1

Moreover, the risk can be written as

R(h) = P(x,y)∼D [h(x) ̸= y],

i.e., the probability under (x, y) ∼ D of h misclassifying the label y of x.


We can now give a generalization bound in terms of the VC dimension of H, see, e.g., [148,
Corollary 3.19]:

206
Theorem 14.20. Let d, k ∈ N and H ⊆ {h : Rd → {0, 1}} have VC dimension k. Let D be a
distribution on Rd × {0, 1}. Then, for every δ > 0 and m ∈ N, it holds with probability at least 1 − δ
over a sample S ∼ Dm that for every h ∈ H
r r
2k log(em/k) log(1/δ)
|R(h) − R
b S (h)| ≤ + . (14.7.1)
m 2m

In words, Theorem 14.20 tells us that if a hypothesis class has finite VC dimension, then a
hypothesis with a small empirical risk will have a small risk if the number of samples is large. This
shows that empirical risk minimization is a viable strategy in this scenario. Will this approach also
work if the VC dimension is not bounded? No, in fact, in that case, no learning algorithm will
succeed in reliably producing a hypothesis for which the risk is close to the best possible. We omit
the technical proof of the following theorem from [148, Theorem 3.23].

Theorem 14.21. Let k ∈ N and let H ⊆ {h : X → {0, 1}} be a hypothesis set with VC dimension
k. Then, for every m ∈ N and every learning algorithm A : (X × {0, 1})m → H there exists a
distribution D on X × {0, 1} such that
" r #
k 1
PS∼Dm R(A(S)) − inf R(h) > ≥ .
h∈H 320m 64

Theorem 14.21 immediately implies the following statement for the generalization bound.

Corollary 14.22. Let k ∈ N and let H ⊆ {h : X → {0, 1}} be a hypothesis set with VC dimension
k. Then, for every m ∈ N there exists a distribution D on X × {0, 1} such that
" r #
k 1
PS∼Dm sup |R(h) − R b S (h)| > ≥ .
h∈H 1280m 64

Proof. For a sample S, let hS ∈ H be an empirical risk minimizer, i.e., R


b S (hS ) = minh∈H R
b S (h).
Let D be the distribution of Theorem 14.21. Moreover, for δ > 0, let hδ ∈ H be such that

R(hδ ) − inf R(h) < δ.


h∈H

207
Then, applying Theorem 14.21 with A(S) = hS it holds that
2 sup |R(h) − R
b S (h)| ≥ |R(hS ) − R
b S (hS )| + |R(hδ ) − R
b S (hδ )|
h∈H

≥ R(hS ) − R
b S (hS ) + R
b S (hδ ) − R(hδ )
≥ R(hS ) − R(hδ )
> R(hS ) − inf R(h) − δ,
h∈H

where we used the definition of hS in the third inequality. The proof is completed by applying
Theorem 14.21 and using that δ was arbitrary.

We have seen now, that we have a generalization bound scaling like O(1/ m) for m → ∞ if
and only if the VC dimension of a hypothesis class is finite. In more quantitative terms, we require
the VC dimension of a neural network to be smaller than m.
What does this imply for neural network functions? For ReLU neural networks there holds the
following [3, Theorem 8.8].

Theorem 14.23. Let A ∈ NL+2 , L ∈ N and set

H := {1[0,∞) ◦ Φ | Φ ∈ N (σReLU ; A, ∞)}.

Then, there exists a constant C > 0 independent of L and A such that

VCdim(H) ≤ C · (nA L log(nA ) + nA L2 ).

The bound (14.7.1) is meaningful if m ≫ k. For ReLU neural networks as in Theorem 14.23,
this means m ≫ nA L log(nA ) + nA L2 . Fixing L = 1 this amounts to m ≫ nA log(nA ) for a
shallow neural network with nA parameters. This condition is contrary to what we assumed in
Chapter 11, where it was crucial that nA ≫ m. If the VC dimension of the neural network sets
scale like O(nA log(nA )), then Theorem 14.21 and Corollary 14.22 indicate that, at least for certain
distributions, generalization should not be possible in this regime. We will discuss the resolution
of this potential paradox in Chapter 15.

14.8 Lower bounds on achievable approximation rates


We conclude this chapter on the complexities and generalization bounds of neural networks by using
the established VC dimension bound of Theorem 14.23 to deduce limitations to the approximation
capacity of neural networks. The result described below was first given in [245].

Theorem 14.24. Let k, d ∈ N. Assume that for every ε > 0 there exists Lε ∈ N and Aε with Lε
layers and input dimension d such that
ε
sup inf ∥f − Φ∥C 0 ([0,1]d ) < .
∥f ∥ k d ≤1
Φ∈N (σ ReLU ;A,∞) 2
C ([0,1] )

208
Then there exists C > 0 solely depending on k and d, such that for all ε ∈ (0, 1)
d
nAε Lε log(nAε ) + nAε L2ε ≥ Cε− k .

Proof. For x ∈ Rd consider the “bump function”


(  
1
exp 1 − 2 if ∥x∥2 < 1
f˜(x) := 1−∥x∥2
0 otherwise,

and its scaled version


 
˜ −1/k
fε (x) := εf 2ε x ,

for ε ∈ (0, 1). Then


h ε1/k ε1/k id
˜
supp(fε ) ⊆ − ,
2 2
and
∥f˜ε ∥C k ≤ 2k ∥f˜∥C k =: τk > 0.
Consider the equispaced point set {x1 , . . . , xN (ε) } = ε1/k Zd ∩ [0, 1]d . The cardinality of this set
is N (ε) ≃ ε−d/k . Given y ∈ {0, 1}N (ε) , let for x ∈ Rd
N (ε)
X
fy (x) := τk−1 yj f˜ε (x − xj ). (14.8.1)
j=1

Then fy (xj ) = τk−1 εyj for all j = 1, . . . , N (ε) and ∥fy ∥C k ≤ 1.


For every y ∈ {0, 1}N (ε) let Φy ∈ N (σReLU ; Aτ −1 ε , ∞) be such that
k

ε
sup |fy (x) − Φy (x)| < .
x∈[0,1]d 2τk

Then  ε 
1[0,∞) Φy (xj ) − = yj for all j = 1, . . . , N (ε).
2τk
Hence, the VC dimension of N (σReLU ; Aτ −1 ε , ∞) is larger or equal to N (ε). Theorem 14.23 thus
k
implies
d
 
N (ε) ≃ ε− k ≤ C · nA −1 Lτ −1 ε log(nA −1 ) + nA −1 L2τ −1 ε
τ ε k τ ε τ ε k
k k k

or equivalently
d d
 
τkk ε− k ≤ C · nA Lε log(nA ) + nA L2ε .
τ −1 ε τ −1 ε τ −1 ε
k k k

This completes the proof.

209
Figure 14.7: Illustration of fy from Equation (14.8.1) on [0, 1]2 .

To interpret Theorem 14.24, we consider two situations:


• In case the depth is allowed to increase at most logarithmically in ε, then reaching uniform
error ε for all f ∈ C k ([0, 1]d ) with ∥f ∥C k ([0,1]d ) ≤ 1 requires
d
nAε log(nAε ) log(ε) + nAε log(ε)2 ≥ Cε− k .

In terms of the neural network size, this (necessary) condition becomes nAε ≥ Cε−d/k / log(ε)2 .
As we have shown in Chapter 7, in particular Theorem 7.7, up to log terms this condition is
also sufficient. Hence, while the constructive proof of Theorem 7.7 might have seemed rather
specific, under the assumption of the depth increasing at most logarithmically (which the
construction in Chapter 7 satisfies), it was essentially optimal! The neural networks in this
proof are shown to have size O(ε−d/k ) up to log terms.

• If we allow the depth Lε to increase faster than logarithmically in ε, then the lower bound on
the required neural network size improves. Fixing for example Aε with Lε layers such that
nAε ≤ W Lε for some fixed ε independent W ∈ N, the (necessary) condition on the depth
becomes
d
W log(W Lε )L2ε + W L3ε ≥ Cε− k

and hence Lε ≳ ε−d/(3k) .


We add that, for arbitrary depth the upper bound on the VC dimension of Theorem 14.23
can be improved to n2A , [3, Theorem 8.6], and using this, would improve the just established
lower bound to Lε ≳ ε−d/(2k) .
For fixed width, this corresponds to neural networks of size O(ε−d/(2k) ), which would mean
twice the convergence rate proven in Theorem 7.7. Indeed, it turns out that neural networks
can achieve this rate in terms of the neural network size [246].
To sum up, in order to get error ε uniformly for all ∥f ∥C k ([0,1]d ) ≤ 1, the size of a ReLU neural
network is required to increase at least like O(ε−d/(2k) ) as ε → 0, i.e. the best possible attainable
convergence rate is 2k/d. It has been proven, that this rate is also achievable, and thus the bound
is sharp. Achieving this rate requires neural network architectures that grow faster in depth than
in width.

210
Bibliography and further reading
Classical statistical learning theory is based on the foundational work of Vapnik and Chervonenkis
[233]. This led to the formulation of the probably approximately correct (PAC) learning model
in [232], which is primarily utilized in this chapter. A streamlined mathematical introduction to
statistical learning theory can be found in [43].
Since statistical learning theory is well-established, there exists a substantial amount of excellent
expository work describing this theory. Some highly recommended books on the topic are [148,
212, 3]. The specific approach of characterizing learning via covering numbers has been discussed
extensively in [3, Chapter 14]. Specific results for ReLU activation used in this chapter were derived
in [204, 18]. The results of Section 14.8 describe some of the findings in [245, 246], and we also refer
to [51] for general lower bounds (also applicable to neural networks) when approximating classes
of Sobolev functions.

211
Exercises
Exercise 14.25. Let H be a set of neural networks with fixed architecture, where the weights are
taken from a compact set. Moreover, assume that the activation function is continuous. Show that
for every sample S there always exists an empirical risk minimizer hS .

Exercise 14.26. Complete the proof of Proposition 14.9.

Exercise 14.27. Prove Lemma 14.12.

Exercise 14.28. Show that, the VC dimension of H of Example 14.18 is indeed 3, by demonstrating
that no set of four points can be shattered by H.

Exercise 14.29. Show that the VC dimension of

H := {x 7→ 1[0,∞) (sin(wx)) | w ∈ R}

is infinite.

212
Chapter 15

Generalization in the
overparameterized regime

In the previous chapter, we discussed the theory of generalization for deep neural networks trained
by minimizing the empirical risk. A key conclusion was that good generalization is possible as long
as we choose an architecture that has a moderate number of neural network parameters relative to
the number of training samples. Moreover, we saw in Section 14.6 that the best performance can be
expected when the neural network size is chosen to balance the generalization and approximation
errors, by minimizing their sum.

Architectures On ImageNet

Figure 15.1: ImageNet Classification Competition: Final score on the test set in the Top 1 cat-
egory vs. Parameters-to-Training-Samples Ratio. Note that all architectures have more parame-
ters than training samples. Architectures include AlexNet [121], VGG16 [215], GoogLeNet [222],
ResNet50/ResNet152 [87], DenseNet121 [96], ViT-G/14 [248], EfficientNetB0 [224], and Amoe-
baNet [189].

Surprisingly, successful neural network architectures do not necessarily follow these theoretical
observations. Consider the neural network architectures in Figure 15.1. They represent some

213
of the most renowned image classification models, and all of them participated in the ImageNet
Classification Competition [50]. The training set consisted of 1.2 million images. The x-axis shows
the model performance, and the y-axis displays the ratio of the number of parameters to the size of
the training set; notably, all architectures have a ratio larger than one, i.e. have more parameters
than training samples. For the largest model, there are by a factor 1000 more neural network
parameters than training samples.
Given that the practical application of deep learning appears to operate in a regime significantly
different from the one analyzed in Chapter 14, we must ask: Why do these methods still work
effectively?

15.1 The double descent phenomenon


The success of deep learning in a regime not covered by traditional statistical learning theory
puzzled researchers for some time. In [14], an intriguing set of experiments was performed. These
experiments indicate that while the risk follows the upper bound from Section 14.6 for neural
network architectures that do not interpolate the data, the curve does not expand to infinity in the
way that Figure 14.4 suggests. Instead, after surpassing the so-called “interpolation threshold”,
the risk starts to decrease again. This behavior, known as double descent, is illustrated in Figure
15.2.

classical regime modern regime

R(h)

R
b S (h)
underfitting overfitting
Interpolation threshold

Expressivity of H
Figure 15.2: Illustration of the double descent phenomenon.

15.1.1 Least-squares regression revisited


To gain further insight, we consider least-squares (kernel) regression as introduced in Section 11.2.
Consider a data sample (xj , yj )m d
j=1 ⊆ R × R generated by some ground-truth function f , i.e.

yj = f (xj ) for j = 1, . . . , m. (15.1.1)

Let ϕjP: Rd → R, j ∈ N, be a sequence of ansatz functions. For n ∈ N, we wish to fit a function


x 7→ ni=1 wi ϕi (x) to the data using linear least-squares. To this end, we introduce the feature
map
Rd ∋ x 7→ ϕ(x) := (ϕ1 (x), . . . , ϕn (x))⊤ ∈ Rn .

214
The goal is to determine coefficients w ∈ Rn minimizing the empirical risk
m n m
b S (w) = 1 1 X
XX 2
R wi ϕi (xj ) − yj = (⟨ϕ(xj ), w⟩ − yj )2 .
m m
j=1 i=1 j=1

With
ϕ(x1 )⊤
   
ϕ1 (x1 ) . . . ϕn (x1 )
 .. .. .. .. m×n
An :=  . = ∈R (15.1.2)
  
. . .
ϕ1 (xm ) . . . ϕn (xm ) ϕ(xm )⊤
and y = (y1 , . . . , ym )⊤ it holds
b S (w) = 1 ∥An w − y∥2 .
R (15.1.3)
m
As discussed in Sections 11.1-11.2, a unique minimizer of (15.1.3) only exists if An has rank n.
For a minimizer wn , the fitted function reads
n
X
fn (x) := wn,j ϕj (x). (15.1.4)
j=1

We are interested in the behavior of the fn as a function of n (the number of ansatz func-
tions/parameters of our model), and distinguish between two cases:

• Underparameterized : If n < m we have fewer parameters n than training points m. For


the least squares problem of minimizing R
b S , this means that there are more conditions m
than free parameters n. Thus, in general, we cannot interpolate the data, and we have
minw∈Rn R b S (w) > 0.

• Overparameterized : If n ≥ m, then we have at least as many parameters n as training points


m. If the xj and the ϕj are such that An ∈ Rm×n has full rank m, then there exists w
such that R b S (w) = 0. If n > m, then An necessarily has a nontrivial kernel, and there exist
infinitely many parameters choices w that yield zero empirical risk R b S . Some of them lead
to better, and some lead to worse prediction functions fn in (15.1.4).

In the overparameterized case, there exist many minimizers of R b S . The training algorithm we
use to compute a minimizer determines the type of prediction function fn we obtain. To observe
double descent, i.e. to achieve good generalization for large n, we need to choose the minimizer
carefully. In the following, we consider the unique minimal 2-norm minimizer, which is defined as
 
wn,∗ = argmin{w∈Rn | Rb S (w)≤Rb S (v) ∀v∈Rn } ∥w∥ ∈ Rn . (15.1.5)

15.1.2 An example
Now let us consider a concrete example. In Figure 15.3 we plot a set of 40 ansatz functions
ϕ1 , . . . , ϕ40 , which are drawn from a Gaussian process. Additionally, the figure shows a plot of the
Runge function f , and m = 18 equispaced points which are used as the training data points. We
then fit a function in span{ϕ1 , . . . , ϕn } via (15.1.5) and (15.1.4). The result is displayed in Figure
15.4:

215
3 1.0 f
2 0.8 Data points
1
0.6
0
j

1 0.4
2 0.2
3
0.0
1.00 0.75 0.50 0.25 0.00 0.25 0.50 0.75 1.00 1.00 0.75 0.50 0.25 0.00 0.25 0.50 0.75 1.00
(a) ansatz functions ϕj (b) Runge function f and data points

Figure 15.3: Ansatz functions ϕ1 , . . . , ϕ40 drawn from a Gaussian process, along with the Runge
function and 18 equispaced data points.

• n = 2: The model can only represent functions in span{ϕ1 , ϕ2 }. It is not yet expressive
enough to give a meaningful approximation of f .

• n = 15: The model has sufficient expressivity to capture the main characteristics of f . Since
n = 15 < 18 = m, it is not yet able to interpolate the data. Thus it allows to strike a
good balanced between the approximation and generalization error, which corresponds to the
scenario discussed in Chapter 14.

• n = 18: We are at the interpolation threshold. The model is capable of interpolating the data,
and there is a unique w such that R b S (w) = 0. Yet, in between data points the behavior of the
predictor f18 seems erratic, and displays strong oscillations. This is referred to as overfitting,
and is to be expected due to our analysis in Chapter 14; while the approximation error at the
data points has improved compared to the case n = 15, the generalization error has gotten
worse.

• n = 40: This is the overparameterized regime, where we have significantly more parameters
than data points. Our prediction f40 interpolates the data and appears to be the best overall
approximation to f so far, due to a “good” choice of minimizer of R b S , namely (15.1.5).
We also note that, while quite good, the fit is not perfect. We cannot expect significant
improvement in performance by further increasing n, since at this point the main limiting
factor is the amount of available data. Also see Figure 15.5 (a).

Figure 15.5 (a) displays the error ∥f − fn ∥L2 ([−1,1]) over n. We observe the characteristic double
descent curve, where the error initially decreases, after peaking at the interpolation threshold,
which is marked by the dashed red line. Afterwards, in the overparameterized regime, it starts to
decrease again. Figure 15.5 (b) displays ∥wn,∗ ∥. Note how the Euclidean norm of the coefficient
vector also peaks at the interpolation threshold.
We emphasize that the precise nature of the convergence curves depends strongly on various
factors, such as the distribution and number of training points m, the ground truth f , and the
choice of ansatz functions ϕj (e.g., the specific kernel used to generate the ϕj in Figure 15.3 (a)).
In the present setting we achieve a good approximation of f for n = 15 < 18 = m corresponding to
the regime where the approximation and interpolation errors are balanced. However, as Figure 15.5

216
1.0 f 1.0 f
0.8 f2 f15
0.8
0.6 Data points Data points
0.4 0.6
0.2 0.4
0.0
0.2 0.2
0.4 0.0
1.00 0.75 0.50 0.25 0.00 0.25 0.50 0.75 1.00 1.00 0.75 0.50 0.25 0.00 0.25 0.50 0.75 1.00
(a) n = 2 (underparameterization) (b) n = 15 (balance of appr. and gen. error)
1.0 f 1.0 f
0.8 f18 0.8 f40
Data points Data points
0.6 0.6
0.4
0.4
0.2
0.2
0.0
0.0
1.00 0.75 0.50 0.25 0.00 0.25 0.50 0.75 1.00 1.00 0.75 0.50 0.25 0.00 0.25 0.50 0.75 1.00
(c) n = 18 (interpolation threshold) (d) n = 40 (overparameterization)

Figure 15.4: Fit of the m = 18 red data points using the ansatz functions ϕ1 , . . . , ϕn from Figure
15.3, employing equations (15.1.5) and (15.1.4) for different numbers of ansatz functions n.

(a) shows, it can be difficult to determine a suitable value of n < m a priori, and the acceptable
range of n values can be quite narrow. For overparametrization (n ≫ m), the precise choice of n is
less critical, potentially making the algorithm more stable in this regime. We encourage the reader
to conduct similar experiments and explore different settings to get a better feeling for the double
descent phenomenon.

15.2 Size of weights


In Figure 15.5, we observed that the norm of the coefficients ∥wn,∗ ∥ exhibits similar behavior to
the L2 -error, peaking at the interpolation threshold n = 18. In machine learning, large weights
are usually undesirable, as they are associated with large derivatives or oscillatory behavior. This
is evident in the example shown in Figure 15.4 for n = 18. Assuming that the data in (15.1.1)
was generated by a “smooth” function f , e.g. a function with moderate Lipschitz constant, these
large derivatives of the prediction function could lead to poor generalization. Such a smoothness
assumption about f may or may not be satisfied. However, if f is not smooth, there is little hope
of accurately recovering f from limited data (see the discussion in Section 9.2).
The next result gives an explanation for the observed behavior of ∥wn,∗ ∥.

217
n = 18 n = 18

10 1

100

10 2
10 20 30 40 10 20 30 40
n n
(a) ∥f − fn ∥L2 ([−1,1]) (b) ∥wn,∗ ∥

Figure 15.5: The L2 -error for the fitted functions in Figure 15.4, and the ℓ2 -norm of the corre-
sponding coefficient vector wn,∗ defined in (15.1.5).

Proposition 15.1. Assume that x1 , . . . , xm and the (ϕj )j∈N are such that An in (15.1.2) has full
rank n for all n ≤ m. Given y ∈ Rm , denote by wn,∗ (y) the vector in (15.1.5). Then
(
increasing for n < m,
n 7→ sup ∥wn,∗ (y)∥ is monotonically
∥y∥=1 decreasing for n ≥ m.

Proof. We start with the case n ≥ m. By assumption Am has full rank m, and thus An has rank
m for all n ≥ m, see (15.1.2). In particular, there exists wn ∈ Rn such that An wn = y. Now fix
y ∈ Rm and let wn be any such vector. Then wn+1 := (wn , 0) ∈ Rn+1 satisfies An+1 wn+1 = y
and ∥wn+1 ∥ = ∥wn ∥. Thus necessarily ∥wn+1,∗ ∥ ≤ ∥wn,∗ ∥ for the minimal norm solutions defined
in (15.1.5). Since this holds for every y, we obtain the statement for n ≥ m.
Now let n < m. Recall that the minimal norm solution can be written through the pseudo
inverse
wn,∗ (y) = A†n y,
see for instance Exercise 11.32. Here,
 −1 
σn,1 0

An = V n 
 .. ..  ⊤
 Un ∈ R
n×m
. .
−1
σn,n 0

218
where An = U n Σ n V ⊤
n is the singular value decomposition of An , and
 
σn,1
 .. 
 . 
 
 σ n,n 
 m×n
Σn =  ∈R

 0 
 .. 
 . 
0
contains the singular values σn,1 ≥ · · · ≥ σn,n > 0 of An ∈ Rm×n ordered by decreasing size. Since
V n ∈ Rn×n and U n ∈ Rm×m are orthogonal matrices, we have
sup ∥wn,∗ (y)∥ = sup ∥A†n y∥ = σn,n
−1
.
∥y∥=1 ∥y∥=1

Finally, since the minimal singular value σn,n of An can be written as


σn,n = infn ∥An x∥ ≥ inf ∥An+1 x∥ = σn+1,n+1 ,
x∈R x∈Rn+1
∥x∥=1 ∥x∥=1

we observe that n 7→ σn,n is monotonically decreasing for n ≤ m. This concludes the proof.

15.3 Theoretical justification


Let us now examine one possible explanation of the double descent phenomenon for neural networks.
While there are many alternative arguments available in the literature (see the bibliography section),
the explanation presented here is based on a simplification of the ideas in [12].
The key assumption underlying our analysis is that large overparameterized neural networks
tend to be Lipschitz continuous with a Lipschitz constant independent of the size. This is a
consequence of neural networks typically having relatively small weights. To motivate this, let us
consider the class of neural networks N (σ; A, B) for an architecture A of depth d ∈ N and width
L ∈ N. If σ is Cσ -Lipschitz continuous such that B ≤ cB · (dCσ )−1 for some cB > 0, then by Lemma
13.2
N (σ; A, B) ⊆ LipcL (Rd0 ), (15.3.1)
B

An assumption of the type B ≤ cB · (dCσ )−1 , i.e. a scaling of the weights by the reciprocal 1/d of
the width, is not unreasonable in practice: Standard initialization schemes such as LeCun [129] or
He [86] initialization, use random weights with variance scaled inverse proportional to the input
dimension of each layer. Moreover, as we saw in Chapter 11, for very wide neural networks, the
weights do not move significantly from their initialization during training. Additionally, many train-
ing routines use regularization terms on the weights, thereby encouraging them the optimization
routine to find small weights.
We study the generalization capacity of Lipschitz functions through the covering-number-based
learning results of Chapter 14. The set of C-Lipschitz functions on a compact d-dimensional
Euclidean domain LipC (Ω) has covering numbers bounded according to
 d
∞ C
log(G(LipC (Ω), ε, L )) ≤ Ccov · for all ε > 0 (15.3.2)
ε

219
for some constant Ccov independent of ε > 0. A proof can be found in [75, Lemma 7], see also [230].
As a result of these considerations, we can identify two regimes:

• Standard regime: For small neural network size nA , we consider neural networks as a set
parameterized by nA parameters. As we have seen before, this yields a bound on the gen-
eralization error that scales linearly with nA . As long as nA is small in comparison to the
number of samples, we can expect good generalization by Theorem 14.15.

• Overparameterized regime: For large neural network size nA and small weights, we consider
neural networks as a subset of LipC (Ω) for a constant C > 0. This set has a covering number
bound that is independent of the number of parameters nA .

Choosing the better of the two generalization bounds for each regime yields the following result.
Recall that N ∗ (σ; A, B) denotes all neural networks in N (σ; A, B) with a range contained in [−1, 1]
(see (14.5.1)).

Theorem 15.2. Let C, CL > 0 and let L : [−1, 1] × [−1, 1] → R be CL -Lipschitz. Further, let
A = (d0 , d1 , . . . , dL+1 ) ∈ NL+2 , let σ : R → R be Cσ -Lipschitz continuous with Cσ ≥ 1, and
|σ(x)| ≤ Cσ |x| for all x ∈ R, and let B > 0.
Then, there exist c1 , c2 > 0, such that for every m ∈ N and every distribution D on
[−1, 1]d0 × [−1, 1] it holds with probability at least 1 − δ over S ∼ Dm that for all Φ ∈
N ∗ (σ; A, B) ∩ LipC ([−1, 1]d0 )
r
log(4/δ)
|R(Φ) − RS (Φ)| ≤ g(A, Cσ , B, m) + 4CL
b , (15.3.3)
m
where
( r √ )
nA log(nA ⌈ m⌉) + LnA log(dmax ) 1
− 2+d
g(A, Cσ , B, m) = min c1 , c2 m 0 .
m

Proof. Applying Theorem 14.11 with α = 1/(2 + d0 ) and (15.3.2), we obtain that with probability
at least 1 − δ/2 it holds for all Φ ∈ LipC ([−1, 1]d0 )
r
α d0
|R(Φ) − R b S (Φ)| ≤ 4CL Ccov (m C) + log(4/δ) + 2CL
m mα r
2C log(4/δ)
q
d d /(d +2)−1 L
≤ 4CL Ccov C 0 (m 0 0 ) + α + 4CL
m r m
2CL log(4/δ)
q
= 4CL Ccov C d0 (m−2/(d0 +2) ) + α + 4CL
m
r m
p
d
(4CL Ccov C + 2CL )
0 log(4/δ)
= + 4CL ,
mα m
√ √ √
where we used in the second inequality that x + y ≤ x + y for all x, y ≥ 0.

220
In addition, Theorem 14.15 yields that with probability at least 1 − δ/2 it holds for all Φ ∈
N ∗ (σ; A, B)
r √
nA log(⌈nA m⌉) + LnA log(⌈2Cσ Bdmax ⌉) + log(4/δ)
|R(Φ) − R
b S (Φ)| ≤ 4CL
m
2CL
+√
m
r √
nA log(⌈nA m⌉) + LnA log(⌈2Cσ Bdmax ⌉)
≤ 6CL
r m
log(4/δ))
+ 4CL .
m
Then, for Φ ∈ N ∗ (σ; A, B) ∩ LipC ([−1, 1]d0 ) the minimum of both upper bounds holds with
probability at least 1 − δ.

The two regimes in Theorem 15.2 correspond to the two terms comprising the minimum in the
definition of g(A, Cσ , B, m). The first term increases with nA while the second is constant. In the
first regime, where the first term is smaller, the generalization gap |R(Φ) − R b S (Φ)| increases with
nA .
In the second regime, where the second term is smaller, the generalization gap is constant with
nA . Moreover, it is reasonable to assume that the empirical risk R b S will decrease with increasing
number of parameters nA .
By (15.3.3) we can bound the risk by
r
R(Φ) ≤ R b S + g(A, Cσ , B, m) + 4CL log(4/δ) .
m
In the second regime, this upper bound is monotonically decreasing. In the first regime it may
both decrease and increase. In some cases, this behavior can lead to an upper bound on the risk
resembling the curve of Figure 15.2. The following section describes a specific scenario where this
is the case.
Remark 15.3. Theorem 15.2 assumes C-Lipschitz continuity of the neural networks. As we saw in
Sections 15.1.2 and 15.2, this assumption may not hold near the interpolation threshold. Hence,
Theorem 15.2 likely gives a too optimistic upper bound near the interpolation threshold.

15.4 Double descent for neural network learning


Now let us understand the double descent phenomenon in the context of Theorem 15.2. We make
a couple of simplifying assumptions to obtain a formula for an upper bound on the risk. First, we
assume that the data S = (xi , yi )m d0
i=1 ∈ R × R stem from a CM -Lipschitz continuous function.
In addition, we fix a depth L ∈ N and consider, for d ∈ N, architectures of the form (σReLU ; Ad ),
where
Ad = (d0 , d, . . . , d, 1).
For this architecture the number of parameters is bounded by

nAd = (d0 + 1)d + (L − 1)(d + 1)d + d + 1.

221
To derive an upper bound on the risk, we start by upper bounding the empirical risk and then
applying Theorem 15.2 to establish an upper bound on the generalization gap. In combination,
these estimates provide an upper bound on the risk. We will then observe that this upper bound
follows the double descent curve in Figure 15.2.

15.4.1 Upper bound on empirical risk


We establish an upper bound on R b S (Φ) for Φ ∈ N ∗ (σReLU ; Ad , B)∩LipC ([−1, 1]d0 ). For B ≥ CM ,
M
we can apply Theorem 9.6, and conclude that with a neural network of sufficient depth we can
interpolate m points from a CM -Lipschitz function with a neural network in LipCM ([−1, 1]d0 ), if
nA ≥ cint log(m)d0 m. To simplify the exposition, we assume cint = 1 in the following. Thus,
Rb S (Φ) = 0 as soon as nA ≥ log(m)d0 m.
In addition, depending on smoothness properties of the data, the interpolation error may decay
with some rate, by one of the results in Chapters 5, 7, or 8. For simplicity, we choose that R b S (Φ) =
−1
O(nA ) for nA significantly smaller than log(m)d0 m. If we combine these two assumptions, we can
make the following Ansatz for the empirical risk of ΦAd ∈ N ∗ (σReLU ; Ad , B) ∩ LipCM ([−1, 1]d0 ):
n o
R e S (ΦA ) := Capprox max 0, n−1 − (log(m)d0 m)−1
b S (ΦA ) ≤ R (15.4.1)
d d Ad

for a constant Capprox > 0. Note that, we can interpolate the sample S already with d0 m parameters
by Theorem 9.3. However, it is not guaranteed that this can be done using CM -Lipschitz neural
networks.

15.4.2 Upper bound on generalization gap


We complement the bound on the empirical risk by an upper bound on the risk. Invoking the
notation of Theorem 15.2, we have that,

g(Ad , CσReLU , B, m) = min {κNN (Ad , m; c1 ), κLip (Ad , m; c2 )} ,

where
r √
nAd log(⌈nA m⌉) + LnAd log(d)
κNN (Ad , m; c1 ) := c1 ,
m (15.4.2)
1
− 2+d
κLip (Ad , m; c2 ) := c2 m 0

for some constants c1 , c2 > 0.

15.4.3 Upper bound on risk


Next, we combine (15.4.1) and (15.4.2) to obtain an upper bound on the risk R(ΦAd ). Specifically,
we define

R(Φ
e A ) :=R
d
e S (ΦA ) + min {κNN (Ad , m; c1 ), κLip (Ad , m; c2 )}
d
(15.4.3)
r
log(4/δ)
+ 4CL .
m

222
We depict in Figure 15.6 the upper bound on the risk given by (15.4.3) (excluding the terms
that do not depend on the architecture). The upper bound clearly resembles the double descent
phenomenon of Figure 15.2. Note that the Lipschitz interpolation point is slightly behind this
threshold, which is when we assume our empirical risk to be 0. To produce the plot, we chose
L = 5, c1 = 1.2 · 10−4 , c2 = 6.5 · 10−3 , m = 10.000, d0 = 6, Capprox = 30. We mention that the
double descent phenomenon is not visible for all choices of parameters. Moreover, in our model,
the fact that the peak coincides with the interpolation threshold is due to the choice of constants
and does not emerge from the model. Other models of double descent explain the location of the
peak more accurately [143, 83]. We note that, as observed in Remark 15.3, the peak close to the
interpolation threshold that we see in Figure 15.6 would likely be more pronounced in practical
scenarios.

Figure 15.6: Upper bound on R(ΦAd ) derived in (15.4.3). For better visibility the part correspond-
ing to y-values between 0.0012 and 0.0022 is not shown. The vertical dashed line indicates the
interpolation threshold according to Theorem 9.3.

Bibliography and further reading


The discussion on kernel regression and the effect of the number of parameters on the norm of the
weights was already given in [14]. Similar analyses, with more complex ansatz systems and more
precise asymptotic estimates, are found in [143, 83]. Our results in Section 15.3 are inspired by
[12]; see also [161].
For a detailed account of further arguments justifying the surprisingly good generalization

223
capabilities of overparameterized neural networks, we refer to [19, Section 2]. Here, we only briefly
mention two additional directions of inquiry. First, if the learning algorithm introduces a form of
robustness, this can be leveraged to yield generalization bounds [6, 244, 24, 179]. Second, for very
overparameterized neural networks, it was stipulated in [106] that neural networks become linear
kernel interpolators based on the neural tangent kernel of Section 11.5.2. Thus, for large neural
networks, generalization can be studied through kernel regression [106, 131, 15, 135].

224
Exercises
Exercise 15.4. Let f : [−1, 1] → R be a continuous function, and let −1 ≤ x1 < · · · < xm ≤ 1 for
some fixed m ∈ N. As in Section 15.1.2, we wish to approximate f by a least squares approximation.
To this end we use the Fourier ansatz functions
(
1 sin(⌈ 2j ⌉πx) j ≥ 1 is odd
b0 (x) := and bj (x) := (15.4.4)
2 cos(⌈ 2j ⌉πx) j ≥ 1 is even.

With the empirical risk


m n
b S (w) = 1
XX 2
R wi bi (xj ) − yj ,
m
j=1 i=0

denote by wn∗ ∈ Rn+1 the minimal norm minimizer of R b S , and set fn (x) := Pn wn bi (x).
i=0 ∗,i
Show that in this case generalization fails in the overparametrized regime: for sufficiently large
n ≫ m, fn is not necessarily a good approximation to f . What does fn converge to as n → ∞?

Exercise 15.5. Consider the setting of Exercise 15.4. We adapt the ansatz functions in (15.4.4)
by rescaling them via
b̃j := cj bj .
Choose real numbers cj ∈ R, such that the corresponding minimal norm least squares solution
avoids the phenomenon encountered in Exercise 15.4.
Hint: Should ansatz functions corresponding to large frequencies be scaled by large or small
numbers to avoid overfitting?

Exercise 15.6. Prove (15.3.2) for d = 1.

225
Chapter 16

Robustness and adversarial examples

How sensitive is the output of a neural network to small changes in its input? Real-world obser-
vations of trained neural networks often reveal that even barely noticeable modifications of the
input can lead to drastic variations in the network’s predictions. This intriguing behavior was first
documented in the context of image classification in [223].
Figure 16.1 illustrates this concept. The left panel shows a picture of a panda that the neural
network correctly classifies as a panda. By adding an almost imperceptible amount of noise to the
image, we obtain the modified image in the right panel. To a human, there is no visible difference,
but the neural network classifies the perturbed image as a wombat. This phenomenon, where
a correctly classified image is misclassified after a slight perturbation, is termed an adversarial
example.
In practice, such behavior is highly undesirable. It indicates that our learning algorithm might
not be very reliable and poses a potential security risk, as malicious actors could exploit it to trick
the algorithm. In this chapter, we describe the basic mathematical principles behind adversarial
examples and investigate simple conditions under which they might or might not occur. For sim-
plicity, we restrict ourselves to a binary classification problem but note that the main ideas remain
valid in more general situations.

+ 0.01x =
Human: Panda Barely visible noise Still a panda

NN classifier: Panda (high confidence) Flamingo (low confidence) Wombat (high confidence)

Figure 16.1: Sketch of an adversarial example.

226
16.1 Adversarial examples
Let us start by formalizing the notion of an adversarial example. We consider the problem of
assigning a label y ∈ {−1, 1} to a vector x ∈ Rd . It is assumed that the relation between x and y
is described by a distribution D on Rd × {−1, 1}. In particular, for a given x, both values −1 and
1 could have positive probability, i.e. the label is not necessarily deterministic. Additionally, we let
Dx := {x ∈ Rd | ∃y s.t. (x, y) ∈ supp(D)}, (16.1.1)
and refer to Dx as the feature support.
Throughout this chapter we denote by
g : Rd → {−1, 0, 1}
a fixed so-called ground-truth classifier, satisfying1
P[y = g(x)|x] ≥ P[y = −g(x)|x] for all x ∈ Dx . (16.1.2)
Note that we allow g to take the value 0, which is to be understood as an additional label corre-
sponding to nonrelevant or nonsensical input data x. We will refer to g −1 (0) as the nonrelevant
class. The ground truth g is interpreted as how a human would classify the data, as the following
example illustrates.
Example 16.1. We wish to classify whether an image shows a panda (y = 1) or a wombat (y = −1).
Consider again Figure 16.1, and denote the three images by x1 , x2 , x3 . The first image x1 is a
photograph of a panda. Together with a label y, it can be interpreted as a draw (x1 , y) from D,
i.e. x1 ∈ Dx and g(x1 ) = 1. The second image x2 displays noise and corresponds to nonrelevant
data as it shows neither a panda nor a wombat. In particular, x2 ∈ Dxc and g(x2 ) = 0. The third
(perturbed) image x3 also belongs to Dxc , as it is not a photograph but a noise corrupted version
of x1 . Nonetheless, it is not nonrelevant, as a human would classify it as a panda. Thus g(x3 ) = 1.
Additional to the ground truth g, we denote by
h : Rd → {−1, 1}
some trained classifier.

Definition 16.2. Let g : Rd → {−1, 0, 1} be the ground-truth classifier, let h : Rd → {−1, 1} be a


classifier, and let ∥ · ∥∗ be a norm on Rd . For x ∈ Rd and δ > 0, we call x′ ∈ Rd an adversarial
example to x ∈ Rd with perturbation δ, if and only if

(i) ∥x′ − x∥∗ ≤ δ,

(ii) g(x)g(x′ ) > 0,

(iii) h(x) = g(x) and h(x′ ) ̸= g(x′ ).

1
To be more precise, the conditional distribution of y|x is only well-defined almost everywhere w.r.t. the marginal
distribution of x. Thus (16.1.2) can only be assumed to hold for almost every x ∈ Dx w.r.t. to the marginal
distribution of x.

227
In words, x′ is an adversarial example to x with perturbation δ, if (i) the distance of x and x′
is at most δ, (ii) x and x′ belong to the same (not nonrelevant) class according to the ground truth
classifier, and (iii) the classifier h correctly classifies x but misclassifies x′ .
Remark 16.3. We emphasize that the concept of a ground-truth classifier g differs from a minimizer
of the Bayes risk (14.1.1) for two reasons. First, we allow for an additional label 0 corresponding to
the nonrelevant class, which does not exist for the data generating distribution D. Second, g should
correctly classify points outside of Dx ; small perturbations of images as we find them in adversarial
examples, are not regular images in Dx . Nonetheless, a human classifier can still classify these
images, and g models this property of human classification.

16.2 Bayes classifier


At first sight, an adversarial example seems to be no more than a misclassified sample. Naturally,
these exist if the model does not generalize well. In this section we present a more nuanced view
from [218].
To avoid edge cases, we assume in the following that for all x ∈ Dx
either P[y = 1|x] > P[y = −1|x] or P[y = 1|x] < P[y = −1|x] (16.2.1)
so that (16.1.2) uniquely defines g(x) for x ∈ Dx . We say that the distribution exhausts the
domain if Dx ∪ g −1 (0) = Rd . This means that every point is either in the feature support Dx or
it belongs to the nonrelevant class. Moreover, we say that h is a Bayes classifier if
P[h(x)|x] ≥ P[−h(x)|x] for all x ∈ Dx .
By (16.1.2), the ground truth g is a Bayes classifier, and (16.2.1) ensures that h coincides with g
on Dx if h is a Bayes classifier. It is easy to see that a Bayes classifier minimizes the Bayes risk.
With these two notions, we now distinguish between four cases.
(i) Bayes classifier/exhaustive distribution: If h is a Bayes classifier and the data exhausts the
domain, then there are no adversarial examples. This is because every x ∈ Rd either belongs
to the nonrelevant class or is classified the same by h and g.
(ii) Bayes classifier/non-exhaustive distribution: If h is a Bayes classifier and the distribution
does not exhaust the domain, then adversarial examples can exist. Even though the learned
classifier h coincides with the ground truth g on the feature support, adversarial examples
can be constructed for data points on the complement of Dx ∪ g −1 (0), which is not empty.
(iii) Not a Bayes classifier/exhaustive distribution: The set Dx can be covered by the four sub-
domains
C1 = h−1 (1) ∩ g −1 (1), F1 = h−1 (−1) ∩ g −1 (1),
(16.2.2)
C−1 = h−1 (−1) ∩ g −1 (−1), F−1 = h−1 (1) ∩ g −1 (−1).
If dist(C1 ∩ Dx , F1 ∩ Dx ) or dist(C−1 ∩ Dx , F−1 ∩ Dx ) is smaller than δ, then there exist
points x, x′ ∈ Dx such that x′ is an adversarial example to x with perturbation δ. Hence,
adversarial examples in the feature support can exist. This is, however, not guaranteed to
happen. For example, Dx does not need to be connected if g −1 (0) ̸= ∅, see Exercise 16.18.
Hence, even for classifiers that have incorrect predictions on the data, adversarial examples
do not need to exist.

228
(iv) Not a Bayes classifier/non-exhaustive distribution: In this case everything is possible. Data
points and their associated adversarial examples can appear in the feature support of the
distribution and adversarial examples to elements in the feature support of the distribution
can be created by leaving the feature support of the distribution. We will see examples in
the following section.

16.3 Affine classifiers


For linear classifiers, a simple argument outlined in [223] and [73] showcases that the high-dimensionality
of the input, common in image classification problems, is a potential cause for the existence of ad-
versarial examples.
A linear classifier is a map of the form
x 7→ sign(w⊤ x) where w, x ∈ Rd .
Let
sign(w⊤ x)sign(w)
x′ := x − 2|w⊤ x|
∥w∥1
where sign(w) is understood coordinate-wise. Then ∥x − x′ ∥∞ ≤ 2|w⊤ x|/∥w∥1 and it is not hard
to see that sign(w⊤ x′ ) ̸= sign(w⊤ x).
For high-dimensional vectors w, x chosen at random but possibly dependent such that w is
uniformly distributed on a d − 1 dimensional sphere, it holds with high probability that
|w⊤ x| ∥x∥∥w∥
≤ ≪ ∥x∥.
∥w∥1 ∥w∥1
This can be seen by noting that for every c > 0
µ({w ∈ Rd | ∥w∥1 > c, ∥w∥ ≤ 1}) → 1 for d → ∞, (16.3.1)
where µ is the uniform probability measure on the d-dimensional Euclidean unit ball, see Exercise
16.17. Thus, if x has a moderate Euclidean norm, the perturbation of x′ is likely small for large
dimensions.
Below we give a sufficient condition for the existence of adversarial examples, in case both h
and the ground truth g are linear classifiers.

Theorem 16.4. Let w, w ∈ Rd be nonzero. For x ∈ Rd , let h(x) = sign(w⊤ x) be a classifier and
let g(x) = sign(w⊤ x) be the ground-truth classifier.
For every x ∈ Rd with h(x)g(x) > 0 and all ε ∈ (0, |w⊤ x|) such that
|w⊤ x| ε + |w⊤ x| |w⊤ w|
> (16.3.2)
∥w∥ ∥w∥ ∥w∥∥w∥
it holds that
ε + |w⊤ x|
x′ = x − h(x) w (16.3.3)
∥w∥2
is an adversarial example to x with perturbation δ = (ε + |w⊤ x|)/∥w∥.

229
Before we present the proof, we give some interpretation of this result. First, note that {x ∈
Rd | w ⊤ x= 0} is the decision boundary of h, meaning that points lying on opposite sides of this
hyperplane, are classified differently by h. Due to |w⊤ w| ≤ ∥w∥∥w∥, (16.3.2) implies that an
adversarial example always exists whenever
|w⊤ x| |w⊤ x|
> . (16.3.4)
∥w∥ ∥w∥
The left term is the decision margin of x for g, i.e. the distance of x to the decision boundary
of g. Similarly, the term on the right is the decision margin of x for h. Thus we conclude that
adversarial examples exist if the decision margin of x for the ground truth g is larger than that for
the classifier h.
Second, the term (w⊤ w)/(∥w∥∥w∥) describes the alignment of the two classifiers. If the clas-
sifiers are not aligned, i.e., w and w have a large angle between them, then adversarial examples
exist even if the margin of the classifier is larger than that of the ground-truth classifier.
Finally, adversarial examples with small perturbation are possible if |w⊤ x| ≪ ∥w∥. The ex-
treme case w⊤ x = 0 means that x lies on the decision boundary of h, and if |w⊤ x| ≪ ∥w∥ then
x is close to the decision boundary of h.

of Theorem 16.4. We verify that x′ in (16.3.3) satisfies the conditions of an adversarial example in
Definition 16.2. In the following we will use that due to h(x)g(x) > 0
g(x) = sign(w⊤ x) = sign(w⊤ x) = h(x) ̸= 0. (16.3.5)
First, it holds
ε + |w⊤ x| ε + |w⊤ x|
∥x − x′ ∥ = w = = δ.
∥w∥2 ∥w∥
Next we show g(x)g(x′ ) > 0, i.e. that (w⊤ x)(w⊤ x′ ) is positive. Plugging in the definition of
x′ , this term reads
ε + |w⊤ x| ⊤ ε + |w⊤ x| ⊤
 
w⊤ x w⊤ x − h(x) 2
w w = |w⊤ x|2 − |w⊤ x| w w
∥w∥ ∥w∥2
ε + |w⊤ x| ⊤
≥ |w⊤ x|2 − |w⊤ x| |w w|, (16.3.6)
∥w∥2
where the equality holds because h(x) = g(x) = sign(w⊤ x) by (16.3.5). Dividing the right-hand
side of (16.3.6) by |w⊤ x|∥w∥, which is positive by (16.3.5), we obtain
|w⊤ x| ε + |w⊤ x| |w⊤ w|
− . (16.3.7)
∥w∥ ∥w∥ ∥w∥∥w∥
The term (16.3.7) is positive thanks to (16.3.2).
Finally, we check that 0 ̸=h(x′ ) ̸= h(x), i.e. (w⊤ x)(w⊤ x′ ) < 0. We have that
ε + |w⊤ x| ⊤
(w⊤ x)(w⊤ x′ ) = |w⊤ x|2 − w⊤ xh(x) w w
∥w∥2
= |w⊤ x|2 − |w⊤ x|(ε + |w⊤ x|) < 0,

where we used that h(x) = sign(w⊤ x). This completes the proof.

230
Theorem 16.4 readily implies the following proposition for affine classifiers.

Proposition 16.5. Let w, w ∈ Rd and b, b ∈ R. For x ∈ Rd let h(x) = sign(w⊤ x + b) be a


classifier and let g(x) = sign(w⊤ x + b) be the ground-truth classifier.
For every x ∈ Rd with w⊤ x ̸= 0, h(x)g(x) > 0, and all ε ∈ (0, |w⊤ x + b|) such that

|w⊤ x + b|2 (ε + |w⊤ x + b|)2 (w⊤ w + bb)2


> 2
∥w∥2 + b2 ∥w∥2 + b2 (∥w∥2 + b2 )(∥w∥2 + b )

it holds that
ε + |w⊤ x + b|
x′ = x − h(x) w
∥w∥2
is an adversarial example with perturbation δ = (ε + |w⊤ x + b|)/∥w∥ to x.

The proof is left to the reader, see Exercise 16.19.


Let us now study two cases of linear classifiers, which allow for different types of adversarial
examples. In the following two examples, the ground-truth classifier g : Rd → {−1, 1} is given by
g(x) = sign(w⊤ x) for w ∈ Rd with ∥w∥ = 1.
For the first example, we construct a Bayes classifier h admitting adversarial examples in the
complement of the feature support. This corresponds to case (ii) in Section 16.2.

Example 16.6. Let D be the uniform distribution on

{(λw, g(λw)) | λ ∈ [−1, 1] \ {0}} ⊆ Rd × {−1, 1}.

The feature support equals

Dx = {λw | λ ∈ [−1, 1] \ {0}} ⊆ span{w}.

Next fix α ∈ (0, 1) and set w := αw + (1 − α)v for some v ∈ w⊥ with ∥v∥ = 1, so that ∥w∥ = 1.
We let h(x) := sign(w⊤ x). We now show that every x ∈ Dx satisfies the assumptions of Theorem
16.4, and therefore admits an adversarial example.
Note that h(x) = g(x) for every x ∈ Dx . Hence h is a Bayes classifier. Now fix x ∈ Dx . Then
|w⊤ x| ≤ α|w⊤ x|, so that (16.3.2) is satisfied. Furthermore, for every ε > 0 it holds that

ε + |w⊤ x|
δ := ≤ ε + α.
∥w∥

Hence, for ε < |w⊤ x| it holds by Theorem 16.4 that there exists an adversarial example with
perturbation less than ε + α. For small α, the situation is depicted in the upper panel of Figure
16.2.

For the second example, we construct a distribution with global feature support and a classifier
which is not a Bayes classifier. This corresponds to case (iv) in Section 16.2.

231
A)
DBg

DBh x′

B)

DBg

DBh
x′
x

Figure 16.2: Illustration of the two types of adversarial examples in Examples 16.6 and 16.7. In
panel A) the feature support Dx corresponds to the dashed line. We depict the two decision
boundaries DBh = {x | w⊤ x = 0} of h(x) = sign(w⊤ x) and DBg = {x | w⊤ x = 0} g(x) =
sign(w⊤ x). Both h and g perfectly classify every data point in Dx . One data point x is shifted
outside of the support of the distribution in a way to change its label according to h. This creates
an adversarial example x′ . In panel B) the data distribution is globally supported. However, h
and g do not coincide. Thus the decision boundaries DBh and DBg do not coincide. Moving data
points across DBh can create adversarial examples, as depicted by x and x′ .

232
Example 16.7. Let Dx be a distribution on Rd with positive Lebesgue density everywhere outside
the decision boundary DBg = {x | w⊤ x = 0} of g. We define D to be the distribution of (X, g(X))
for X ∼ Dx . In addition, let w ∈ / {±w}, ∥w∥ = 1 and h(x) = sign(w⊤ x). We exclude w = −w
because, in this case, every prediction of h is wrong. Thus no adversarial examples are possible.
By construction the feature support is given by Dx = Rd . Moreover, h−1 ({±1}) and g −1 ({±1})
are half spaces, which implies that, in the notation of (16.2.2) that

dist(C±1 ∩ Dx , F±1 ∩ Dx ) = dist(C±1 , F±1 ) = 0.

Hence, for every δ > 0 there is a positive probability of observing x to which an adversarial example
with perturbation δ exists.
The situation is depicted in the lower panel of Figure 16.2.

16.4 ReLU neural networks


So far we discussed classification by affine classifiers. A binary classifier based on a ReLU neural
network is a function Rd ∋ x 7→ sign(Φ(x)), where Φ is a ReLU neural network. As noted in [223],
the arguments for affine classifiers, see Proposition 16.5, can be applied to the affine pieces of Φ, to
show existence of adversarial examples.
Consider a ground-truth classifier g : Rd → {−1, 0, 1}. For each x ∈ Rd we define the geometric
margin of g at x as

µg (x) := dist(x, g −1 ({g(x)})c ), (16.4.1)

i.e., as the distance of x to the closest element that is classified differently from x or the infimum
over all distances to elements from other classes if no closest element exists. Additionally, we denote
the distance of x to the closest adjacent affine piece by

νΦ (x) := dist(x, AcΦ,x ), (16.4.2)

where AΦ,x is the largest connected region on which Φ is affine and which contains x. We have the
following theorem.

Theorem 16.8. Let Φ : Rd → R and for x ∈ Rd let h(x) = sign(Φ(x)). Denote by g : Rd →


{−1, 0, 1} the ground-truth classifier. Let x ∈ Rd and ε > 0 be such that νΦ (x) > 0, g(x) ̸= 0,
∇Φ(x) ̸= 0 and

ε + |Φ(x)|
µg (x), νΦ (x) > .
∥∇Φ(x)∥

Then
ε + |Φ(x)|
x′ := x − h(x) ∇Φ(x)
∥∇Φ(x)∥2

is an adversarial example to x with perturbation δ = (ε + |Φ(x)|)/∥∇Φ(x)∥.

233
Proof. We show that x′ satisfies the properties in Definition 16.2.
By construction ∥x − x′ ∥ ≤ δ. Since µg (x) > δ it follows that g(x) = g(x′ ). Moreover, by
assumption g(x) ̸= 0, and thus g(x)g(x′ ) > 0.
It only remains to show that h(x′ ) ̸= h(x). Since δ < νΦ (x), we have that Φ(x) = ∇Φ(x)⊤ x + b
and Φ(x′ ) = ∇Φ(x)⊤ x′ + b for some b ∈ R. Therefore,
 
ε + |Φ(x)|
Φ(x) − Φ(x′ ) = ∇Φ(x)⊤ (x − x′ ) = ∇Φ(x)⊤ h(x) ∇Φ(x)
∥∇Φ(x)∥2
= h(x)(ε + |Φ(x)|).

Since h(x)|Φ(x)| = Φ(x) it follows that Φ(x′ ) = −h(x)ε. Hence, h(x′ ) = −h(x), which completes
the proof.

Remark 16.9. We look at the key parameters in Theorem 16.8 to understand which factors facilitate
adversarial examples.

• The geometric margin of the ground-truth classifier µg (x): To make the construction possible,
we need to be sufficiently far away from points that belong to a different class than x or to
the nonrelevant class.

• The distance to the next affine piece νΦ (x): Since we are looking for an adversarial example
within the same affine piece as x, we need this piece to be sufficiently large.

• The perturbation δ: The perturbation is given by (ε + |Φ(x)|)/∥∇Φ(x)∥, which depends on


the classification margin |Φ(x)| of the ReLU classifier and its sensitivity to inputs ∥∇Φ(x)∥.
For adversarial examples to be possible, we either want a small classification margin of Φ or
a high sensitivity of Φ to its inputs.

16.5 Robustness
Having established that adversarial examples can arise in various ways under mild assumptions, we
now turn our attention to conditions that prevent their existence.

16.5.1 Global Lipschitz regularity


We have repeatedly observed in the previous sections that a large value of ∥w∥ for linear classifiers
sign(w⊤ x), or ∥∇Φ(x)∥ for ReLU classifiers sign(Φ(x)), facilitates the occurrence of adversarial ex-
amples. Naturally, both these values are upper bounded by the Lipschitz constant of the classifier’s
inner functions x 7→ w⊤ x and x 7→ Φ(x). Consequently, it was stipulated early on that bound-
ing the Lipschitz constant of the inner functions could be an effective measure against adversarial
examples [223].
We have the following result for general classifiers of the form x 7→ sign(Φ(x)).

234
Proposition 16.10. Let Φ : Rd → R be CL -Lipschitz with CL > 0, and let s > 0. Let h(x) =
sign(Φ(x)) be a classifier, and let g : Rd → {−1, 0, 1} be a ground-truth classifier. Moreover, let
x ∈ Rd be such that

Φ(x)g(x) ≥ s. (16.5.1)

Then there does not exist an adversarial example to x of perturbation δ < s/CL .

Proof. Let x ∈ Rd satisfy (16.5.1) and assume that ∥x′ − x∥ ≤ δ. The Lipschitz continuity of Φ
implies

|Φ(x′ ) − Φ(x)| < s.

Since |Φ(x)| ≥ s we conclude that Φ(x′ ) has the same sign as Φ(x) which shows that x′ cannot be
an adversarial example to x.

Remark 16.11. As we have seen in Lemma 13.2, we can bound the Lipschitz constant of ReLU
neural networks by restricting the magnitude and number of their weights and the number of
layers.
There has been some criticism to results of this form, see, e.g., [99], since an assumption on the
Lipschitz constant may potentially restrict the capabilities of the neural network too much. We
next present a result that shows under which assumptions on the training set, there exists a neural
network that classifies the training set correctly, but does not allow for adversarial examples within
the training set.

Theorem 16.12. Let m ∈ N, let g : Rd → {−1, 0, 1} be a ground-truth classifier, and let


(xi , g(xi ))m d m
i=1 ∈ (R × {−1, 1}) . Assume that

|g(xi ) − g(xj )|
sup =: M
f > 0.
i̸=j ∥xi − xj ∥

Then there exists a ReLU neural network Φ with depth(Φ) = O(log(m)) and width(Φ) = O(dm)
such that for all i = 1, . . . , m
sign(Φ(xi )) = g(xi )
and there is no adversarial example of perturbation δ = 1/M
f to xi .

Proof. The result follows directly from Theorem 9.6 and Proposition 16.10. The reader is invited
to complete the argument in Exercise 16.20.

16.5.2 Local regularity


One issue with upper bounds involving global Lipschitz constants such as those in Proposition
16.10, is that these bounds may be quite large for deep neural networks. For example, the upper

235
bound given in Lemma 13.2 is

∥Φ(x) − Φ(x′ )∥∞ ≤ CσL · (Bdmax )L+1 ∥x − x′ ∥∞

which grows exponentially with the depth of the neural network. However, in practice this bound
may be pessimistic, and locally the neural network might have significantly smaller gradients than
the global Lipschitz constant.
Because of this, it is reasonable to study results preventing adversarial examples under local
Lipschitz bounds. Such a result together with an algorithm providing bounds on the local Lipschitz
constant was proposed in [88]. We state the theorem adapted to our set-up.

Theorem 16.13. Let h : Rd → {−1, 1} be a classifier of the form h(x) = sign(Φ(x)) and let
g : Rd → {−1, 0, 1} be the ground-truth classifier. Let x ∈ Rd satisfy g(x) ̸= 0, and set
 
 
 . |Φ(y) − Φ(x)| 
α := max min Φ(x)g(x) sup ,R , (16.5.2)
R>0 
 ∥y−x∥∞ ≤R ∥x − y∥∞ 

y̸=x

where the minimum is understood to be R in case the supremum is zero. Then there are no adver-
sarial examples to x with perturbation δ < α.

Proof. Let x ∈ Rd be as in the statement of the theorem. Assume, towards a contradiction, that
for 0 < δ < α satisfying (16.5.2), there exists an adversarial example x′ to x with perturbation δ.
If the supremum in (16.5.2) is zero, then Φ is constant on a ball of radius R around x. In
particular for ∥x′ − x∥ ≤ δ < R holds h(x′ ) = h(x) and x′ cannot be an adversarial example.
Now assume the supremum in (16.5.2) is not zero. It holds by (16.5.2), that
. |Φ(y) − Φ(x)|
δ < Φ(x)g(x) sup . (16.5.3)
∥y−x∥∞ ≤R ∥x − y∥∞
y̸=x

Moreover,

|Φ(y) − Φ(x)|
|Φ(x′ ) − Φ(x)| ≤ sup ∥x − x′ ∥∞
∥y−x∥∞ ≤R ∥x − y∥∞
y̸=x
|Φ(y) − Φ(x)|
≤ sup δ < Φ(x)g(x),
∥y−x∥∞ ≤R ∥x − y∥∞
y̸=x

where we applied (16.5.3) in the last line. It follows that

g(x)Φ(x′ ) = g(x)Φ(x) + g(x)(Φ(x′ ) − Φ(x))


≥ g(x)Φ(x) − |Φ(x′ ) − Φ(x)| > 0.

This rules out x′ as an adversarial example.

236
The supremum in (16.5.2) is bounded by the Lipschitz constant of Φ on BR (x). Thus Theorem
16.13 depends only on the local Lipschitz constant of Φ. One obvious criticism of this result is
that the computation of (16.5.2) is potentially prohibitive. We next show a different result, for
which the assumptions can immediately be checked by applying a simple algorithm that we present
subsequently.
To state the following proposition, for a continuous function Φ : Rd → R and δ > 0 we define
for x ∈ Rd and δ > 0

z δ,max := max{Φ(y) | ∥y − x∥∞ ≤ δ} (16.5.4)


δ,min
z := min{Φ(y) | ∥y − x∥∞ ≤ δ}. (16.5.5)

Proposition 16.14. Let h : Rd → {−1, 1} be a classifier of the form h(x) = sign(Φ(x)) and
g : Rd → {−1, 0, 1}, let x be such that h(x) = g(x). Then x does not have an adversarial example
of perturbation δ if z δ,max z δ,min > 0.

Proof. The proof is immediate, since z δ,max z δ,min > 0 implies that all points in a δ neighborhood
of x are classified the same.

To apply (16.14), we only have to compute z δ,max and z δ,min . It turns out that if Φ is a neural
network, then z δ,max , z δ,min can be approximated by a computation similar to a forward pass of
Φ. Denote by |A| the matrix obtained by taking the absolute value of each entry of the matrix A.
Additionally, we define
A+ = (|A| + A)/2 and A− = (|A| − A)/2.
The idea behind the Algorithm 2 is common in the area of neural network verification, see, e.g.,
[66, 61, 7, 238].
Remark 16.15. Up to constants, Algorithm 2 has the same computational complexity as a forward
pass, also see Algorithm 1. In addition, in contrast to upper bounds based on estimating the global
Lipschitz constant of Φ via its weights, the upper bounds found via Algorithm 2 include the effect of
the activation function σ. For example, if σ is the ReLU, then we may often end up in a situation,
where δ (ℓ),up or δ (ℓ),low can have many entries that are 0. If an entry of W (ℓ) x(ℓ) +b(ℓ) is nonpositive,
then it is guaranteed that the associated entry in δ (ℓ),low will be zero. Similarly, if W (ℓ) has only
few positive entries, then most of the entries of δ (ℓ),up are not propagated to δ (ℓ+1),up .
Next, we prove that Algorithm 2 indeed produces sensible output.

Proposition 16.16. Let Φ be a neural network with weight matrices W (ℓ) ∈ Rdℓ+1 ×dℓ and bias
vectors b(ℓ) ∈ Rdℓ+1 for ℓ = 0, . . . , L, and a monotonically increasing activation function σ.
Let x ∈ Rd . Then the output of Algorithm 2 satisfies

xL+1 + δ (L+1),up > z δ,max and xL+1 − δ (L+1),low < z δ,min .

237
Algorithm 2 Compute Φ(x), z δ,max and z δ,min for a given neural network
Input: weight matrices W (ℓ) ∈ Rdℓ+1 ×dℓ and bias vectors b(ℓ) ∈ Rdℓ+1 for ℓ = 0, . . . , L with
dL+1 = 1, monotonous activation function σ, input vector x ∈ Rd0 , neighborhood size δ > 0
Output: Bounds for z δ,max and z δ,min

x(0) = x
δ (0),up = δ1 ∈ Rd0
δ (0),low = δ1 ∈ Rd0
for ℓ = 0, . . . , L − 1 do
x(ℓ+1) = σ(W (ℓ) x(ℓ) + b(ℓ) )
δ (ℓ+1),up = σ(W (ℓ) x(ℓ) + (W (ℓ) )+ δ (ℓ),up + (W (ℓ) )− δ (ℓ),low + b(ℓ) ) − x(ℓ+1)
δ (ℓ+1),low = x(ℓ+1) − σ(W (ℓ) x(ℓ) − (W (ℓ) )+ δ (ℓ),low − (W (ℓ) )− δ (ℓ),up + b(ℓ) )
end for
x(L+1) = W (L) x(L) + b(L)
δ (L+1),up = (W (L) )+ δ (L),up + (W (L) )− δ (L),low
δ (L+1),low = (W (L) )+ δ (L),low + (W (L) )− δ (L),up
return x(L+1) , x(L+1) + δ (L+1),up , x(L+1) − δ (L+1),low

Proof. Fix y, x ∈ Rd with ∥y − x∥∞ ≤ δ and let y (ℓ) , x(ℓ) for ℓ = 0, . . . , L + 1 be as in Algorithm
2 applied to y, x, respectively. Moreover, let δ ℓ,up , δ ℓ,low for ℓ = 0, . . . , L + 1 be as in Algorithm 2
applied to x. We will prove by induction over ℓ = 0, . . . , L + 1 that

y (ℓ) − x(ℓ) ≤ δ ℓ,up and x(ℓ) − y (ℓ) ≤ δ ℓ,low , (16.5.6)

where the inequalities are understood entry-wise for vectors. Since y was arbitrary this then proves
the result.
The case ℓ = 0 follows immediately from ∥y − x∥∞ ≤ δ. Assume now, that the statement was
shown for ℓ < L. We have that

y (ℓ+1) − x(ℓ+1) − δ ℓ+1,up =σ(W (ℓ) y (ℓ) + b(ℓ) )


− σ W (ℓ) x(ℓ) + (W (ℓ) )+ δ (ℓ),up + (W (ℓ) )− δ (ℓ),low + b(ℓ) .


The monotonicity of σ implies that

y (ℓ+1) − x(ℓ+1) ≤ δ ℓ+1,up

if

W (ℓ) y (ℓ) ≤ W (ℓ) x(ℓ) + (W (ℓ) )+ δ (ℓ),up + (W (ℓ) )− δ (ℓ),low . (16.5.7)

To prove (16.5.7), we observe that

W (ℓ) (y (ℓ) − x(ℓ) ) = (W (ℓ) )+ (y (ℓ) − x(ℓ) ) − (W (ℓ) )− (y (ℓ) − x(ℓ) )


= (W (ℓ) )+ (y (ℓ) − x(ℓ) ) + (W (ℓ) )− (x(ℓ) − y (ℓ) )
≤ (W (ℓ) )+ δ (ℓ),up + (W (ℓ) )− δ (ℓ),low ,

238
where we used the induction assumption in the last line. This shows the first estimate in (16.5.6).
Similarly,

x(ℓ+1) − y (ℓ+1) − δ ℓ+1,low


= σ(W (ℓ) x(ℓ) − (W (ℓ) )+ δ (ℓ),low − (W (ℓ) )− δ (ℓ),up + b(ℓ) ) − σ(W (ℓ) y (ℓ) + b(ℓ) ).

Hence, x(ℓ+1) − y (ℓ+1) ≤ δ ℓ+1,low if

W (ℓ) y (ℓ) ≥ W (ℓ) x(ℓ) − (W (ℓ) )+ δ (ℓ),low − (W (ℓ) )− δ (ℓ),up . (16.5.8)

To prove (16.5.8), we observe that

W (ℓ) (x(ℓ) − y (ℓ) ) = (W (ℓ) )+ (x(ℓ) − y (ℓ) ) − (W (ℓ) )− (x(ℓ) − y (ℓ) )


= (W (ℓ) )+ (x(ℓ) − y (ℓ) ) + (W (ℓ) )− (y (ℓ) − x(ℓ) )
≤ (W (ℓ) )+ δ (ℓ),low + (W (ℓ) )− δ (ℓ),up ,

where we used the induction assumption in the last line. This completes the proof of (16.5.6) for
all ℓ ≤ L.
The case ℓ = L + 1 follows by the same argument, but replacing σ by the identity.

Bibliography and further reading


This chapter begins with the foundational paper [223], but it should be remarked that adversarial
examples for non-deep-learning models in machine learning were studied earlier in [98].
The results in this chapter are inspired by various results in the literature, though they may
not be found in precisely the same form. The overall setup is inspired by [223]. The explanation
based on the high-dimensionality of the data given in Section 16.3 was first formulated in [223] and
[73]. The formalism reviewed in Section 16.2 is inspired by [218]. The results on robustness via
local Lipschitz properties are due to [88]. Algorithm 2 is covered by results in the area of network
verifiability [66, 61, 7, 238]. For a more comprehensive overview of modern approaches, we refer to
the survey article [193].
Important directions not discussed in this chapter are the transferability of adversarial ex-
amples, defense mechanisms, and alternative adversarial operations. Transferability refers to the
phenomenon that adversarial examples for one model often also fool other models, [170, 153]. De-
fense mechanisms, i.e., techniques for specifically training a neural network to prevent adversarial
examples, include for example the Fast Gradient Sign Method of [73], and more sophisticated recent
approaches such as [32]. Finally, adversarial examples can be generated not only through additive
perturbations, but also through smooth transformations of images, as demonstrated in [1, 243].

239
Exercises
Exercise 16.17. Prove (16.3.1) by comparing the volume of the d-dimensional Euclidean unit ball
with the volume of the d-dimensional 1-ball of radius c for a given c > 0.

Exercise 16.18. Fix δ > 0. For a pair of classifiers h and g such that C1 ∪C−1 = ∅ in (16.2.2), there
trivially cannot exist any adversarial examples. Construct an example, of h, g, D such that C1 ,
C−1 ̸= ∅, h is not a Bayes classifier, and g is such that no adversarial examples with a perturbation
δ exist.
Is this also possible if g −1 (0) = ∅?

Exercise 16.19. Prove Proposition 16.5.


Hint: Repeat the proof of Theorem 16.4. In the first part set x(ext) = (x, 1), w(ext) = (w, b)
and w(ext) = (w, b). Then show that h(x′ ) ̸= h(x) by plugging in the definition of x′ .

Exercise 16.20. Complete the proof of Theorem 16.12.

240
Appendix A

Probability theory

This appendix provides some basic notions and results in probability theory required in the main
text. It is intended as a revision for a reader already familiar with these concepts. For more details
and proofs, we refer for example to the standard textbook [117].

A.1 Sigma-algebras, topologies, and measures


Let Ω be a set, and denote by 2Ω the powerset of Ω.

Definition A.1. A subset A ⊆ 2Ω is called a sigma-algebra1 on Ω if it satisfies

(i) Ω ∈ A,

(ii) Ac ∈ A whenever A ∈ A,
S
(iii) i∈N Ai ∈ A whenever Ai ∈ A for all i ∈ N.

For a sigma-algebra A on Ω, the tuple (Ω, A) is also referred to as a measurable space. For a
measurable space, a subset A ⊆ Ω is called measurable, if A ∈ A. Measurable sets are also called
events.
Another key system of subsets of Ω is that of a topology.

Definition A.2. A subset A ⊆ 2Ω is called a topology on Ω if it satisfies


(i) ∅, Ω ∈ T,

(ii) nj=1 Oj ∈ T whenever n ∈ N and O1 , . . . , On ∈ T,


T

S
(iii) i∈I Oi ∈ T whenever for an index set I holds Oi ∈ T for all i ∈ I.
If T is a topology on Ω, we call (Ω, T) a topological space, and a set O ⊆ Ω is called open if and
only if O ∈ T.

241
Remark A.3. The two notions differ in that a topology allows for unions of arbitrary (possibly un-
countably many) sets, but only for finite intersection, whereas a sigma-algebra allows for countable
unions and intersections.
Example A.4. Let d ∈ N and denote by Bε (x) = {y ∈ Rd | ∥y − x∥ < ε} the set of points
whose Euclidean distance to x is less than ε. Then for every A ⊆ Rd , the smallest topology on A
containing A ∩ Bε (x) for all ε > 0, x ∈ Rd , is called the Euclidean topology on A.
If (Ω, T) is a topological space, then the Borel sigma-algebra refers to the smallest sigma-
algebra on Ω containing all open sets, i.e. all elements of T. Throughout this book, subsets of Rd
are always understood to be equipped with the Euclidean topology and the Borel sigma-algebra.
The Borel sigma-algebra on Rd is denoted by Bd .
We can now introduce measures.

Definition A.5. Let (Ω, A) be a measurable space. A mapping µ : A → [0, ∞] is called a measure
if it satisfies

(i) µ(∅) = 0,

(ii) for every sequence (Ai )i∈N ⊆ A such that Ai ∩ Aj = ∅ whenever i ̸= j, it holds
[  X
µ Ai = µ(Ai ).
i∈N i∈N

finite if µ(Ω) < ∞, and it is sigma-finite if there exists a sequence


We say that the measure is S
(Ai )i∈N ⊆ A such that Ω = i∈N Ai and µ(Ai ) < 1 for all i ∈ N. In case µ(Ω) = 1, the measure is
called a probability measure.

Example A.6. One can show that there exists a unique measure λ on (Rd , Bd ), such that for all
sets of the type ×dj=1 [ai , bi ) with −∞ < ai ≤ bi < ∞ holds
d
Y
λ(×di=1 [ai , bi )) = (bi − ai ).
i=1
This measure is called the Lebesgue measure.
If µ is a measure on the measurable space (Ω, A), then the triplet (Ω, A, µ) is called a measure
space. In case µ is a probability measure, it is called a probability space.
Let (Ω, A, µ) be a measure space. A subset N ⊆ Ω is called a null-set, if N is measurable and
µ(N ) = 0. Moreover, an equality or inequality is said to hold µ-almost everywhere or µ-almost
surely, if it is satisfied on the complement of a null-set. In case µ is clear from context, we simply
write “almost everywhere” or “almost surely” instead. Usually this refers to the Lebesgue measure.

A.2 Random variables


A.2.1 Measurability of functions
To define random variables, we first need to recall the measurability of functions.

242
Definition A.7. Let (Ω1 , A1 ) and (Ω2 , A2 ) be two measurable spaces. A function f : Ω1 → Ω2 is
called measurable if

f −1 (A2 ) := {ω ∈ Ω1 | f (ω) ∈ A2 } ∈ A1 for all A2 ∈ A2 .

A mapping X : Ω1 → Ω2 is called a Ω2 -valued random variable if it is measurable.

Remark A.8. We again point out the parallels to topological spaces: A function f : Ω1 → Ω2
between two topological spaces (Ω1 , T1 ) and (Ω2 , T2 ) is called continuous if f −1 (O2 ) ∈ T1 for all
O2 ∈ T2 .
Let Ω1 be a set and let (Ω2 , A2 ) be a measurable space. For X : Ω1 → Ω2 , we can ask for
the smallest sigma-algebra AX on Ω1 , such that X is measurable as a mapping from (Ω1 , AX ) to
(Ω2 , A2 ). Clearly, for every sigma-algebra A1 on Ω1 , X is measurable as a mapping from (Ω1 , A1 )
to (Ω2 , A2 ) if and only if every A ∈ AX belongs to A1 ; or in other words, AX is a sub sigma-algebra
of A1 . It is easy to check that AX is given through the following definition.

Definition A.9. Let X : Ω1 → Ω2 be a random variable. Then

AX := {X −1 (A2 ) | A2 ∈ A2 } ⊆ 2Ω1

is the sigma-algebra induced by X on Ω1 .

A.2.2 Distribution and expectation


Now let (Ω1 , A1 , P) be a probability space, and let (Ω2 , A2 ) be a measurable space. Then X naturally
induces a measure on (Ω2 , A2 ) via

PX [A2 ] := P[X −1 (A2 )] for all A2 ∈ A2 .

Note that due to the measurability of X it holds X −1 (A2 ) ∈ A1 , so that PX is well-defined.

Definition A.10. The measure PX is called the distribution of X. If (Ω2 , A2 ) = (Rd , Bd ), and
there exists a function fX : Rd → R such that
Z
P[A] = fX (x) dx for all A ∈ Bd ,
A

then fX is called the (Lebesgue) density of X.

Remark A.11. The term distribution is often used without specifying an underlying probability
space and random variable. In this case, “distribution” stands interchangeably for “probability

243
measure”. For example, µ is a distribution on Ω2 states that µ is a probability measure on the
measurable space (Ω2 , A2 ). In this case, there always exists a probability space (Ω1 , A1 , P) and a
random variable X : Ω1 → Ω2 such that PX = µ; namely (Ω1 , A1 , P) = (Ω2 , A2 , µ) and X(ω) = ω.
Example A.12. Some important distributions include the following.
• Bernoulli distribution: A random variable X : Ω → {0, 1} is Bernoulli distributed if there
exists p ∈ [0, 1] such that P[X = 1] = p and P[X = 0] = 1 − p.
• Uniform distribution: A random variable X : Ω → Rd is uniformly distributed on a
measurable set A ∈ Bd , if its density equals
1
fX (x) = 1A (x)
|A|
where |A| < ∞ is the Lebesgue measure of A.
• Gaussian distribution: A random variable X : Ω → Rd is Gaussian distributed with mean
m ∈ Rd and the regular covariance matrix C ∈ Rd×d , if its density equals
 
1 1 ⊤ −1
fX (x) = exp − (x − m) C (x − m) .
(2π det(C))d/2 2
We denote this distribution by N(m, C).
Let (Ω, A, P) be a probability space, let X : Ω → Rd be an Rd -valued random variable. We then
call the Lebesgue integral
Z Z
E[X] := X(ω) dP(ω) = x dPX (x) (A.2.1)
Ω Rd

the expectation of X. Moreover, for k ∈ N we say that X has finite k-th moment if E[∥X∥k ] <
∞. Similarly, for a probability measure µ on Rd and k ∈ N, we say that µ has finite k-th moment
if Z
∥x∥k dµ(x) < ∞.
Rd
Furthermore, the matrix
Z
(X(ω) − E[X])(X(ω) − E[X])⊤ dP(ω) ∈ Rd×d

is the covariance of X : Ω → Rd . For d = 1, it is called the variance of X and denoted by V[X].


Finally, we recall different variants of convergence for random variables.

Definition A.13. Let (Ω, A, P) be a probability space, and let Xj : Ω → Rd , j ∈ N, be a sequence


of random variables and let X : Ω → Rd also be a random variable. The sequence is said to

(i) converge almost surely to X, if


 
P ω ∈ Ω lim Xj (ω) = X(ω) = 1,
j→∞

244
(ii) converge in probability to X, if

for all ε > 0 : lim P [{ω ∈ Ω | |Xj (ω) − X(ω)| > ε}] = 0,
j→∞

(iii) converge weakly to X, if for all bounded continuous functions f : Rd → R holds

lim E[f ◦ Xj ] = E[f ◦ X].


j→∞

The notions in Definition A.13 are ordered by decreasing strength, i.e. almost sure convergence
implies convergence in probability, and convergence
R in probability implies weak convergence, see
for example [117, Chapter 13]. Since E[f ◦ X] = Rd f (x) dPX (x), the notion of weak convergence
only depends on the distribution PX of X. We thus also say that a sequence of random variables
converges weakly towards a measure µ.

A.3 Conditionals, marginals, and independence


In this section, we concentrate on Rd -valued random variables, although the following concepts can
be extended to more general spaces.

A.3.1 Joint and marginal distribution


Let again (Ω, A, P) be a probability space, and let X : Ω → RdX , Y : Ω → RdY be two random
variables. Then
Z := (X, Y ) : Ω → RdX +dY
is also a random variable. Its distribution PZ is a measure on the measurable space (RdX +dY , BdX +dY ),
and PZ is referred to as the joint distribution of X and Y . On the other hand, PX , PY are called
the marginal distributions of X, Y . Note that

PX [A] = PZ [A × RdY ] for all A ∈ BdX ,

and similarly for PY . Thus the marginals PX , PY , can be constructed from the joint distribution
PZ . In turn, knowledge of the marginals is not sufficient to construct the joint distribution.

A.3.2 Independence
The concept of independence serves to formalize the situation, where knowledge of one random
variable provides no information about another random variable. We first give the formal definition,
and afterwards discuss the roll of a die as a simple example.

245
Definition A.14. Let (Ω, A, P) be a probability space. Then two events A, B ∈ A are called
independent if
P[A ∩ B] = P[A]P[B].
Two random variables X : Ω → RdX and Y : Ω → RdY are called independent, if

A, B are independent for all A ∈ AX , B ∈ AY .

Two random variables are thus independent, if and only if all events in their induced sigma-
algebras are independent. This turns out to be equivalent to the joint distribution P(X,Y ) being
equal to the product measure PX ⊗ PY ; the latter is characterized as the unique measure µ on
RdX +dY satisfying µ(A × B) = PX [A]PY [B] for all A ∈ Bdx , B ∈ BdY .
Example A.15. Let Ω = {1, . . . , 6} represent the outcomes of rolling a fair die, let A = 2Ω be the
sigma-algebra, and let P[ω] = 1/6 for all ω ∈ Ω. Consider the three random variables

0 if ω ∈ {1, 2}
( ( 
0 if ω is odd 0 if ω ≤ 3
X1 (ω) = X2 (ω) = X3 (ω) = 1 if ω ∈ {3, 4}
1 if ω is even 1 if ω ≥ 4 
2 if ω ∈ {5, 6}.

These random variables can be interpreted as follows:


• X1 indicates whether the roll yields an odd or even number.
• X2 indicates whether the roll yields a number at most 3 or at least 4.
• X3 categorizes the roll into one of the groups {1, 2}, {3, 4} or {5, 6}.
The induced sigma-algebras are
AX1 = {∅, Ω, {1, 3, 5}, {2, 4, 6}}
AX2 = {∅, Ω, {1, 2, 3}, {4, 5, 6}}
AX3 = {∅, Ω, {1, 2}, {3, 4}, {5, 6}, {1, 2, 3, 4}, {1, 2, 5, 6}, {3, 4, 5, 6}}.
We leave it to the reader to formally check that X1 and X2 are not independent, but X1 and X3
are independent. This reflects the fact that, for example, knowing the outcome to be odd, makes
it more likely that the number belongs to {1, 2, 3} rather than {4, 5, 6}. However, this knowledge
provides no information on the three categories {1, 2}, {3, 4}, and {5, 6}.
If X : Ω → R, Y : Ω → R are two independent random variables, then, due to P(X,Y ) = PX ⊗PY
Z
E[XY ] = X(ω)Y (ω) dP(ω)
Z Ω

= xy dP(X,Y ) (x, y)
2
ZR Z
= x dPX (x) y dPX (y)
R R
= E[X]E[Y ].

246
Using this observation, it is easy to see that for a sequence of independent R-valued random variables
(Xi )ni=1 with bounded second moments, there holds Bienaymé’s identity
" n # n
X X
V Xi = V [Xi ] . (A.3.1)
i=1 i=1

A.3.3 Conditional distributions


Let (Ω, A, P) be a probability space, and let A, B ∈ A be two events. In case P[B] > 0, we define
P[A ∩ B]
P[A|B] := , (A.3.2)
P[B]
and call P[A|B] the conditional probability of A given B.
Example A.16. Consider the setting of Example A.15. Let A = {ω ∈ Ω | X1 (ω) = 0} be the event
that the outcome of the die roll was an odd number and let B = {ω ∈ Ω | X2 (ω) = 0} be the event
that the outcome yielded a number at most 3. Then P[B] = 1/2, and P[A ∩ B] = 1/3. Thus
P[A ∩ B] 1/3 2
P[A|B] = = = .
P[B] 1/2 3
This reflects that, given we know the outcome to be at most 3, the probability of the number being
odd, i.e. in {1, 3}, is larger than the probability of the number being even, i.e. equal to 2.
The conditional probability in (A.3.2) is only well-defined if P[B] > 0. In practice, we often
encounter the case where we would like to condition on an event of probability zero.
Example A.17. Consider the following procedure: We first draw a random number p ∈ [0, 1]
according to a uniform distribution on [0, 1]. Afterwards we draw a random number X ∈ {0, 1}
according to a p-Bernoulli distribution, i.e. P[X = 1] = p and P[X = 0] = 1 − p. Then (p, X) is
a joint random variable taking values in [0, 1] × {0, 1}. What is P[X = 1|p = 0.5] in this case?
Intuitively, it should be 1/2, but note that P[p = 0.5] = 0, so that (A.3.2) is not meaningful here.

Definition A.18 (regular conditional distribution). Let (Ω, A, P) be a probability space, and let
X : Ω → RdX and Y : Ω → RdY be two random variables. Let τX|Y : BdX × RdY → [0, 1] satisfy

(i) y 7→ τX|Y (A, y) : RdY → [0, 1] is measurable for every fixed A ∈ BdX ,

(ii) A 7→ τX|Y (A, y) is a probability measure on (RdX , BdX ) for every y ∈ Y (Ω),

(iii) for all A ∈ BdX and all B ∈ BdY holds


Z
P[X ∈ A, Y ∈ B] = τX|Y (A, y)PY (y).
B

Then τ is called a regular (version of the) conditional distribution of X given Y . In this


case, we denote
P[X ∈ A|Y = y] := τX|Y (A, y),
and refer to this measure as the conditional distribution of X|Y = y.

247
Definition A.18 provides a mathematically rigorous way of assigning a distribution to a random
variable conditioned on an event that may have probability zero, as in Example A.17. Existence
and uniqueness of these conditional distributions hold in the following sense, see for example [117,
Chapter 8] or [201, Chapter 3] for the specific statement given here.

Theorem A.19. Let (Ω, A, P) be a probability space, and let X : Ω → RdX , Y : Ω → RdY be two
random variables. Then there exists a regular version of the conditional distribution τ1 .
Let τ2 be another regular version of the conditional distribution. Then there exists a PY -null
set N ⊆ RdY , such that for all y ∈ N c ∩ Y (Ω), the two probability measures τ1 (·, y) and τ2 (·, y)
coincide.

In particular, conditional distributions are only well-defined in a PY -almost everywhere sense.

Definition A.20. Let (Ω, A, P) be a probability space, and let X : Ω → RdX , Y : Ω → RdY ,
Z : Ω → RdZ be three random variables. We say that X and Z are conditionally independent
given Y , if the two distributions X|Y = y and Z|Y = y are independent for PY -almost every
y ∈ Y (Ω).

A.4 Concentration inequalities


Let Xi : Ω → R, i ∈ N, be a sequence of random variables with finite first moments. The centered
average over the first n terms
n
1X
Sn := (Xi − E[Xi ]) (A.4.1)
n
i=1
is another random variable, and by linearity of the expectation it holds E[Sn ] = 0. The sequence
is said to satisfy the strong law of large numbers if
h i
P lim sup |Sn | = 0 = 1.
n→∞

This is for example the case if there exists C < ∞ such that V[Xi ] ≤ C for all i ∈ N. Concentration
inequalities provide bounds on the rate of this convergence.
We start with Markov’s inequality.

Lemma A.21 (Markov’s inequality). Let X : Ω → R be a random variable, and let φ : [0, ∞) →
[0, ∞) be monotonically increasing. Then for all ε > 0

E[φ(|X|)]
P[|X| ≥ ε] ≤ .
φ(ε)

248
Proof. We have
Z Z
φ(|X(ω)|) E[φ(|X|)]
P[|X| ≥ ε] = 1 dP(ω) ≤ dP(ω) = ,
X −1 ([ε,∞)) Ω φ(ε) φ(ε)

which gives the claim.

Applying Markov’s inequality with φ(x) := x2 to the random variable X − E[X] directly gives
Chebyshev’s inequality.

Lemma A.22 (Chebyshev’s inequality). Let X : Ω → R be a random variable with finite variance.
Then for all ε > 0
V[X]
P[|X − E[X]| ≥ ε] ≤ 2 .
ε

From Chebyshev’s inequality we obtain the next result, which is a quite general concentration
inequality for random variables with finite variances.

Theorem A.23. Let X1 , . . . , Xn be n ∈ N independent real-valued random variables such that for
some ς > 0 holds E[|Xi − µ|2 ] ≤ ς 2 for all i = 1, . . . , n. Denote
n
h1 X i
µ := E Xj . (A.4.2)
n
j=1

Then for all ε > 0


n
" #
1X ς2
P Xi − µ ≥ ε ≤ 2 .
n ε n
j=1

Pn
= ( nj=1 Xi )/n − µ. By Bienaymé’s identity (A.3.1), it holds
P
Proof. Let Sn = j=1 (Xi − E[Xi ])/n
that
n
1 X 2 ς2
V[Sn ] = E[(Xi − E[Xi ]) ] ≤ .
n2 n
j=1

Since E[Sn ] = 0, Chebyshev’s inequality applied to Sn gives the statement.

If we have additional information about the random variables, then we can derive sharper
bounds. In case of uniformly bounded random variables (rather than just bounded variance),
Hoeffding’s inequality, which we recall next, shows an exponential rate of concentration around the
mean.

249
Theorem A.24 (Hoeffding’s inequality). Let a, b ∈ R. Let X1 , . . . , Xn be n ∈ N independent
random real-valued variables such that a ≤ Xi ≤ b almost surely for all i = 1, . . . , n, and let µ be
as in (A.4.2). Then, for every ε > 0
 
n 2
1 X − 2nε
P Xj − µ > ε ≤ 2e (b−a)2 .
n
j=1

A proof can, for example, be found in [212, Section B.4], where this version is also taken from.
Finally, we recall the central limit theorem, in its multivariate formulation. We say that (Xj )j∈N
is an i.i.d. sequence of random variables, if the random variables are (pairwise) independent
and identically distributed. For a proof see [117, Theorem 15.58].

Theorem A.25 (Multivariate central limit theorem). Let (Xn )n∈N be an i.i.d. sequence of Rd -
valued random variables, such that E[Xn ] = 0 ∈ Rd and E[Xn,i Xn,j ] = Cij for all i, j = 1, . . . , d.
Let
X1 + · · · + Xn
Yn := √ .
n
Then Yn converges weakly to N(0, C) as n → ∞.

250
Appendix B

Functional analysis

This appendix provides some basic notions and results in functional analysis required in the main
text. It is intended as a revision for a reader already familiar with these concepts. For more details
and proofs, we refer for example to the standard textbooks [195, 196, 41, 77].

B.1 Vector spaces

Definition B.1. Let K ∈ {R, C}. A vector space (over K) is a set X such that the following
holds:

(i) Properties of addition: For every x, y ∈ X there exists x + y ∈ X such that for all z ∈ X

x+y =y+x and x + (y + z) = (x + y) + z.

Moreover, there exists a unique element 0 ∈ X such that x + 0 = x for all x ∈ X and for each
x ∈ X there exists a unique −x ∈ X such that x + (−x) = 0.

(ii) Properties of scalar multiplication: There exists a map (α, x) 7→ αx from K × X to X called
scalar multiplication. It satisfies 1x = x and (αβ)x = α(βx) for all x ∈ X.

We call the elements of a vector space vectors.

If the field is clear from context, we simply refer to X as a vector space. We will primarily consider
the case K = R, and in this case we also say that X is a real vector space.
To introduce a notion of convergence on a vector space X, it needs to be equipped with a
topology, see Definition A.2. A topological vector space is a vector space which is also a
topological space, and in which addition and scalar multiplication are continuous maps. We next
discuss the most important instances of topological vector spaces.

B.1.1 Metric spaces


An important class of topological vector spaces consists of vector spaces that are also metric spaces.

251
Definition B.2. For a set X, we call a map dX : X × X → R+ a metric, if

(i) 0 ≤ dX (x, y) < ∞ for all x, y ∈ X,

(ii) dX (x, y) = 0 if and only if x = y,

(iii) dX (x, y) = d(y, x) for all x, y ∈ X,

(iv) dX (x, z) ≤ dX (x, y) + dX (y, z) for all x, y, z ∈ X.

We call (X, dX ) a metric space.

In a metric space (X, dX ), we denote the open ball with center x and radius r > 0 by

Br (x) := {y ∈ X | dX (x, y) < r}. (B.1.1)

Every metric space is naturally equipped with a topology: A set A ⊆ X is open if and only if for
every x ∈ A exists ε > 0 such that Bε (x) ⊆ A. Therefore every metric vector space is a topological
vector space.

Definition B.3. A metric space (X, dX ) is called complete, if every Cauchy sequence with respect
to d converges to an element in X.

For complete metric spaces, an immensely powerful tool is Baire’s category theorem. To state
it, we require the notion of density of sets. Let A, B ⊆ X for a topological space X. Then A is
dense in B if the closure of A, denoted by A, satisfies A ⊇ B.

Theorem B.4 (Baire’s category theorem). Let X be a complete metric space. Then the intersection
of every countable collection of dense open subsets of X dense in X.

Theorem B.4 implies that if X = ∞


S
i=1 Vi for a sequence of sets Vi , then at least one of the Vi
has to contain an open set. Indeed, assuming all Vi ’s have empty interior T∞implies that Vic = X \ Vi
is dense for all i ∈ N. By De Morgan’s laws, it then holds that ∅ = i=1 Vic which contradicts
Theorem B.4.

B.1.2 Normed spaces


A norm is a way of assigning a length to a vector. A normed space is a vector space with a norm.

252
Definition B.5. Let X be a vector space over a field K ∈ {R, C}. A map ∥ · ∥X : X → [0, ∞) is
called a norm if the following hold for all x, y ∈ X and all α ∈ K:

(i) triangle inequality: ∥x + y∥X ≤ ∥x∥X + ∥y∥X ,

(ii) absolute homogeneity: ∥αx∥X = |α|∥x∥X ,

(iii) positive definiteness: ∥x∥X = 0 if and only if x = 0.

We call (X, ∥ · ∥X ) a normed space and omit ∥ · ∥X from the notation if it is clear from the
context.

Every norm induces a metric dX and hence a topology via dX (x, y) := ∥x − y∥X . In particular,
every normed vector space is a topological vector space with respect to this topology.

B.1.3 Banach spaces

Definition B.6. A normed vector space is called a Banach space if and only if it is complete.

Before presenting the main results on Banach spaces, we collect a couple of important examples.
• Euclidean spaces: Let d ∈ N. Then (Rd , ∥ · ∥) is a Banach space.

• Continuous functions: Let d ∈ N and let K ⊆ Rd be compact. The set of continuous functions
from K to R is denoted by C(K). For α, β ∈ R and f , g ∈ C(K), we define addition and
scalar multiplication by (αf + βg)(x) = αf (x) + βg(x) for all x ∈ K. The vector space C(K)
equipped with the supremum norm

∥f ∥∞ := sup |f (x)|,
x∈K

is a Banach space.

• Lebesgue spaces: Let (Ω, A, µ) be a measure space and let 1 ≤ p < ∞. Then the Lebesgue
space Lp (Ω, µ) is defined as the vector space of all equivalence classes of measurable functions
f : Ω → R that coincide µ-almost everywhere and satisfy
Z 1/p
p
∥f ∥Lp (Ω,µ) := |f (x)| dµ(x) < ∞. (B.1.2)

The integral is independent of the choice of representative of the equivalence class of f .


Addition and scalar multiplication are defined pointwise as for C(K). It then holds that
Lp (Ω, µ) is a Banach space. If Ω is a measurable subset of Rd for d ∈ N, and µ is the
Lebesgue measure, we typically omit µ from the notation and simply write Lp (Ω). If Ω = N
and the measure is the counting measure, we denote these spaces by ℓp (N) or simply ℓp .

253
The definition can be extended to complex or Rd -valued functions. In the latter case the
integrand in (B.1.2) is replaced by ∥f (x)∥p . We denote these spaces again by Lp (Ω, µ) with
the precise meaning being clear from context.
• Essentially bounded functions: Let (Ω, A, µ) be a measure space. The Lp spaces can be
extended to p = ∞ by defining the L∞ -norm
∥f ∥L∞ (Ω,µ) := inf{C ≥ 0 | µ({|f | > C}) = 0)}.
This is indeed a norm on the space of equivalence classes of measurable functions from Ω → R
that coincide µ-almost everywhere. Moreover, with this norm, L∞ (Ω, µ) is a Banach space. If
Ω = N and µ is the counting measure, we denote the resulting space by ℓ∞ (N) or simply ℓ∞ .
As in the case p < ∞, it is straightforward to extend the definition to complex or Rd -valued
functions, for which the same notation will be used.
We continue by introducing the concept of dual spaces.

Definition B.7. Let (X, ∥ · ∥X ) be a normed vector space over K ∈ {R, C}. Linear maps from
X → K are called linear functionals. The vector space of all continuous linear functionals on X
is called the (topological) dual space of X and is denoted by X ′ .

Together with the natural addition and scalar multiplication (for all h, g ∈ X ′ , α ∈ K and
x ∈ X)
(h + g)(x) := h(x) + g(x) and (αh)(x) := α(h(x)),
X ′ is a vector space. We equip X ′ with the norm
∥f ∥X ′ := sup |f (x)|.
x∈X
∥x∥X =1

The space (X ′ , ∥ · ∥X ′ ) is always a Banach space, even if (X, ∥ · ∥X ) is not complete [196, Theorem
4.1].
The dual space can often be used to characterize the original Banach space. One way in which
the dual space X ′ captures certain algebraic and geometric properties of the Banach space X is
through the Hahn-Banach theorem. In this book, we use one specific variant of this theorem and
its implication for the existence of dual bases, see for instance [196, Theorem 3.5].

Theorem B.8 (Geometric Hahn-Banach, subspace version). Let M be a subspace of a Banach


space X and let x0 ∈ X. If x0 is not in the closure of M , then there exists f ∈ X ′ such that
f (x0 ) = 1 and f (x) = 0 for every x ∈ M .

An immediate consequence of Theorem B.8 that will be used throughout this book is the
existence of a dual basis. Let X be a Banach space and let (xi )i∈N ⊆ X be such that for all i ∈ N
xi ̸∈ span{xj | j ∈ N, j ̸= i}.
Then, for every i ∈ N, there exists fi ∈ X ′ such that fi (xj ) = 0 if i ̸= j and fi (xi ) = 1.

254
B.1.4 Hilbert spaces
Often, we require more structure than that provided by normed spaces. An inner product offers
additional tools to compare vectors by introducing notions of angle and orthogonality. For simplicity
we restrict ourselves to real vector spaces in the following.

Definition B.9. Let X be a real vector space. A map ⟨·, ·⟩X : X × X → R is called an inner
product on X if the following hold for all x, y, z ∈ X and all α, β ∈ R:

(i) linearity: ⟨αx + βy, z⟩X = α⟨x, z⟩X + β⟨y, z⟩X ,

(ii) symmetry: ⟨x, y⟩X = ⟨y, x⟩X ,

(iii) positive definiteness: ⟨x, x⟩X > 0 for all x ̸= 0.

On inner product spaces the so-called Cauchy-Schwarz inequality holds.

Theorem B.10 (Cauchy-Schwarz inequality). Let X be a vector space with inner product ⟨·, ·⟩X .
Then it holds for all x, y ∈ X
q
|⟨x, y⟩X | ≤ ⟨x, x⟩X ⟨y, y⟩X .

Moreover, equality holds if and only if x and y are linearly dependent.

Proof. Let x, y ∈ X. If y = 0 then ⟨x, y⟩X = 0 and thus the statement is trivial. Assume in the
following y ̸= 0, so that ⟨y, y⟩X > 0. Using the linearity and symmetry properties it holds for all
α∈R
0 ≤ ⟨x − αy, x − αy⟩X = ⟨x, x⟩X − 2α ⟨x, y⟩X + α2 ⟨y, y⟩X .
Letting α := ⟨x, y⟩X / ⟨y, y⟩X we get

⟨x, y⟩2X ⟨x, y⟩2X ⟨x, y⟩2X


0 ≤ ⟨x, x⟩X − 2 + = ⟨x, x⟩X − .
⟨y, y⟩X ⟨y, y⟩X ⟨y, y⟩X
Rearranging terms gives the claim.

Every inner product ⟨·, ·⟩X induces a norm via


p
∥x∥X := ⟨x, x⟩ for all x ∈ X. (B.1.3)
The properties of the inner product immediately yield the polar identity
∥x + y∥2X = ∥x∥2X + 2⟨x, y⟩X + ∥y∥2X . (B.1.4)
The fact that (B.1.3) indeed defines a norm follows by an application of the Cauchy-Schwarz
inequality to (B.1.4), which yields that ∥ · ∥X satisfies the triangle inequality. This gives rise to the
definition of a Hilbert space.

255
Definition B.11. Let H be a real vector space with inner product ⟨·, ·⟩H . Then (H, ⟨·, ·⟩H ) is
called a Hilbert space if and only if H is complete with respect to the norm ∥ · ∥H induced by
the inner product.

A standard example of a Hilbert space is L2 : Let (Ω, A, µ) be a measure space. Then


Z
⟨f, g⟩L2 (Ω,µ) = f (x)g(x) dµ(x) for all f, g ∈ L2 (Ω, µ),

defines an inner product on L2 (Ω, µ) compatible with the L2 (Ω, µ)-norm.


In a Hilbert space, we can compare vectors not only via their distance, measured by the norm,
but also by using the inner product, which corresponds to their relative orientation. This leads to
the concept of orthogonality.

Definition B.12. Let (H, ⟨·, ·⟩H ) be a Hilbert space and let f , g ∈ H. We say that f and g are
orthogonal if ⟨f, g⟩H = 0, denoted by f ⊥ g. Moreover, for F , G ⊆ H we write F ⊥ G if f ⊥ g
for all f ∈ F , g ∈ G.

For orthogonal vectors, the polar identity immediately implies the Pythagorean theorem.

Theorem B.13 (Pythagorean theorem). Let (H, ⟨·, ·⟩H ) be a Hilbert space, n ∈ N, and let
f1 , . . . , fn ∈ H be pairwise orthogonal vectors. Then,
n 2 n
X X
fi = ∥fi ∥2H .
i=1 H i=1

A final property of Hilbert spaces that we encounter in this book is the existence of unique
projections onto convex sets. For a proof, see for instance [195, Thm. 4.10].

Theorem B.14. Let (H, ⟨·, ·⟩H ) be a Hilbert space and let K ̸= ∅ be a closed convex subset of H.
Then for all h ∈ H exists a unique k0 ∈ K such that

∥h − k0 ∥H = inf{∥h − k∥H | k ∈ K}.

256
B.2 Fourier transform
The Fourier transform is a powerful tool in analysis. It allows to represent functions as a superpo-
sition of frequencies.

Definition B.15. Let d ∈ N. The Fourier transform of a function f ∈ L1 (Rd ) is defined by


Z

ˆ
F(f )(ω) := f (ω) := f (x)e−2πix ω dx for all ω ∈ Rd ,
Rd

and the inverse Fourier transform by


Z
−1 ⊤ω
F (f )(x) := fˇ(x) := fˆ(−x) = f (ω)e2πix dω for all x ∈ Rd .
Rd

It is immediately clear from the definition, that ∥fˆ∥L∞ (Rd ) ≤ ∥f ∥L1 (Rd ) . As a result, the operator
F : f 7→ fˆ is a bounded linear map from L1 (Rd ) to L∞ (Rd ). We point out that fˆ can take complex
values and the definition is also meaningful for complex-valued functions f .
If fˆ ∈ L1 (Rd ), then we can reverse the process of taking the Fourier transform by taking the
inverse Fourier transform, see [195, Theorem 9.11].

Theorem B.16. If f , fˆ ∈ L1 (Rd ) then F −1 (fˆ) = f almost everywhere.

257
Bibliography

[1] R. Alaifari, G. S. Alberti, and T. Gauksson. Adef: an iterative algorithm to construct


adversarial deformations. arXiv preprint arXiv:1804.07729, 2018.

[2] Z. Allen-Zhu, Y. Li, and Y. Liang. Learning and generalization in overparameterized neural
networks, going beyond two layers. In H. Wallach, H. Larochelle, A. Beygelzimer, F. d'Alché-
Buc, E. Fox, and R. Garnett, editors, Advances in Neural Information Processing Systems,
volume 32. Curran Associates, Inc., 2019.

[3] M. Anthony and P. L. Bartlett. Neural network learning: theoretical foundations. Cambridge
University Press, Cambridge, 1999.

[4] R. Arora, A. Basu, P. Mianjy, and A. Mukherjee. Understanding deep neural networks with
rectified linear units. In International Conference on Learning Representations, 2018.

[5] S. Arora, S. S. Du, W. Hu, Z. Li, R. R. Salakhutdinov, and R. Wang. On exact computation
with an infinitely wide neural net. In H. Wallach, H. Larochelle, A. Beygelzimer, F. d'Alché-
Buc, E. Fox, and R. Garnett, editors, Advances in Neural Information Processing Systems,
volume 32. Curran Associates, Inc., 2019.

[6] S. Arora, R. Ge, B. Neyshabur, and Y. Zhang. Stronger generalization bounds for deep
nets via a compression approach. In International Conference on Machine Learning, pages
254–263. PMLR, 2018.

[7] M. Baader, M. Mirman, and M. Vechev. Universal approximation with certified networks.
arXiv preprint arXiv:1909.13846, 2019.

[8] S. Barocas, M. Hardt, and A. Narayanan. Fairness and Machine Learning. fairmlbook.org,
2019. http://www.fairmlbook.org.

[9] A. R. Barron. Neural net approximation. In Proc. 7th Yale workshop on adaptive and learning
systems, volume 1, pages 69–72, 1992.

[10] A. R. Barron. Universal approximation bounds for superpositions of a sigmoidal function.


IEEE Trans. Inform. Theory, 39(3):930–945, 1993.

[11] A. R. Barron and J. M. Klusowski. Approximation and estimation for high-dimensional deep
learning networks. arXiv preprint arXiv:1809.03090, 2018.

[12] P. Bartlett. For valid generalization the size of the weights is more important than the size
of the network. Advances in neural information processing systems, 9, 1996.

258
[13] G. Beliakov. Interpolation of lipschitz functions. Journal of Computational and Applied
Mathematics, 196(1):20–44, 2006.
[14] M. Belkin, D. Hsu, S. Ma, and S. Mandal. Reconciling modern machine-learning practice
and the classical bias–variance trade-off. Proceedings of the National Academy of Sciences,
116(32):15849–15854, 2019.
[15] M. Belkin, S. Ma, and S. Mandal. To understand deep learning we need to understand kernel
learning. In International Conference on Machine Learning, pages 541–549. PMLR, 2018.
[16] R. Bellman. On the theory of dynamic programming. Proceedings of the national Academy
of Sciences, 38(8):716–719, 1952.
[17] Y. Bengio, P. Simard, and P. Frasconi. Learning long-term dependencies with gradient descent
is difficult. IEEE Transactions on Neural Networks, 5(2):157–166, 1994.
[18] J. Berner, P. Grohs, and A. Jentzen. Analysis of the generalization error: Empirical risk
minimization over deep artificial neural networks overcomes the curse of dimensionality in
the numerical approximation of black–scholes partial differential equations. SIAM Journal
on Mathematics of Data Science, 2(3):631–657, 2020.
[19] J. Berner, P. Grohs, G. Kutyniok, and P. Petersen. The modern mathematics of deep learning,
2021.
[20] D. P. Bertsekas. Nonlinear programming. Athena Scientific Optimization and Computation
Series. Athena Scientific, Belmont, MA, third edition, 2016.
[21] H. Bolcskei, P. Grohs, G. Kutyniok, and P. Petersen. Optimal approximation with sparsely
connected deep neural networks. SIAM Journal on Mathematics of Data Science, 1(1):8–45,
2019.
[22] L. Bottou. Stochastic Gradient Descent Tricks, pages 421–436. Springer Berlin Heidelberg,
Berlin, Heidelberg, 2012.
[23] L. Bottou, F. E. Curtis, and J. Nocedal. Optimization methods for large-scale machine
learning. SIAM Review, 60(2):223–311, 2018.
[24] O. Bousquet and A. Elisseeff. Stability and generalization. The Journal of Machine Learning
Research, 2:499–526, 2002.
[25] S. Boyd and L. Vandenberghe. Convex optimization. Cambridge University Press, Cambridge,
2004.
[26] J. Braun and M. Griebel. On a constructive proof of kolmogorov’s superposition theorem.
Constructive Approximation, 30(3):653–675, Dec 2009.
[27] M. M. Bronstein, J. Bruna, T. Cohen, and P. Veličković. Geometric deep learning: Grids,
groups, graphs, geodesics, and gauges. arXiv preprint arXiv:2104.13478, 2021.
[28] T. Brown, B. Mann, N. Ryder, M. Subbiah, J. D. Kaplan, P. Dhariwal, A. Neelakantan,
P. Shyam, G. Sastry, A. Askell, et al. Language models are few-shot learners. Advances in
neural information processing systems, 33:1877–1901, 2020.

259
[29] O. Calin. Deep learning architectures. Springer, 2020.

[30] E. J. Candes. Ridgelets: theory and applications. Stanford University, 1998.

[31] C. Carathéodory. Über den variabilitätsbereich der fourier’schen konstanten von posi-
tiven harmonischen funktionen. Rendiconti del Circolo Matematico di Palermo (1884-1940),
32:193–217, 1911.

[32] N. Carlini and D. Wagner. Towards evaluating the robustness of neural networks. In 2017
ieee symposium on security and privacy (sp), pages 39–57. Ieee, 2017.

[33] S. M. Carroll and B. W. Dickinson. Construction of neural nets using the radon transform.
International 1989 Joint Conference on Neural Networks, pages 607–611 vol.1, 1989.

[34] P. Chaudhari, A. Choromanska, S. Soatto, Y. LeCun, C. Baldassi, C. Borgs, J. Chayes,


L. Sagun, and R. Zecchina. Entropy-sgd: Biasing gradient descent into wide valleys. Journal
of Statistical Mechanics: Theory and Experiment, 2019(12):124018, 2019.

[35] M. Chen, H. Jiang, W. Liao, and T. Zhao. Efficient approximation of deep relu networks for
functions on low dimensional manifolds. Advances in neural information processing systems,
32, 2019.

[36] L. Chizat, E. Oyallon, and F. Bach. On lazy training in differentiable programming. In


H. Wallach, H. Larochelle, A. Beygelzimer, F. d'Alché-Buc, E. Fox, and R. Garnett, editors,
Advances in Neural Information Processing Systems, volume 32. Curran Associates, Inc.,
2019.

[37] Y. Cho and L. Saul. Kernel methods for deep learning. In Y. Bengio, D. Schuurmans,
J. Lafferty, C. Williams, and A. Culotta, editors, Advances in Neural Information Processing
Systems, volume 22. Curran Associates, Inc., 2009.

[38] F. Chollet. Deep learning with Python. Simon and Schuster, 2021.

[39] A. Choromanska, M. Henaff, M. Mathieu, G. B. Arous, and Y. LeCun. The loss surfaces of
multilayer networks. In Artificial intelligence and statistics, pages 192–204. PMLR, 2015.

[40] C. K. Chui and H. N. Mhaskar. Deep nets for local manifold learning. Frontiers in Applied
Mathematics and Statistics, 4:12, 2018.

[41] J. B. Conway. A course in functional analysis, volume 96. Springer, 2019.

[42] N. Cristianini and J. Shawe-Taylor. An Introduction to Support Vector Machines and Other
Kernel-based Learning Methods. Cambridge University Press, 1 edition, 2000.

[43] F. Cucker and S. Smale. On the mathematical foundations of learning. Bulletin of the
American mathematical society, 39(1):1–49, 2002.

[44] G. Cybenko. Approximation by superpositions of a sigmoidal function. Mathematics of


Control, Signals and Systems, 2(4):303–314, 1989.

260
[45] Y. N. Dauphin, R. Pascanu, C. Gulcehre, K. Cho, S. Ganguli, and Y. Bengio. Identifying and
attacking the saddle point problem in high-dimensional non-convex optimization. Advances
in neural information processing systems, 27, 2014.

[46] A. G. de G. Matthews. Sample-then-optimize posterior sampling for bayesian linear models.


2017.

[47] A. G. de G. Matthews, J. Hron, M. Rowland, R. E. Turner, and Z. Ghahramani. Gaussian


process behaviour in wide deep neural networks. In International Conference on Learning
Representations, 2018.

[48] T. De Ryck, S. Lanthaler, and S. Mishra. On the approximation of functions by tanh neural
networks. Neural Networks, 143:732–750, 2021.

[49] A. Défossez, L. Bottou, F. R. Bach, and N. Usunier. A simple convergence proof of adam
and adagrad. Trans. Mach. Learn. Res., 2022, 2022.

[50] J. Deng, W. Dong, R. Socher, L.-J. Li, K. Li, and L. Fei-Fei. Imagenet: A large-scale
hierarchical image database. In 2009 IEEE Conference on Computer Vision and Pattern
Recognition, pages 248–255, 2009.

[51] H. R. M. C. DeVore, R. Optimal nonlinear approximation. Manuscripta mathematica,


63(4):469–478, 1989.

[52] R. A. DeVore. Nonlinear approximation. Acta numerica, 7:51–150, 1998.

[53] L. Dinh, R. Pascanu, S. Bengio, and Y. Bengio. Sharp minima can generalize for deep nets.
In International Conference on Machine Learning, pages 1019–1028. PMLR, 2017.

[54] F. Draxler, K. Veschgini, M. Salmhofer, and F. Hamprecht. Essentially no barriers in neural


network energy landscape. In International conference on machine learning, pages 1309–1318.
PMLR, 2018.

[55] M. Du, F. Yang, N. Zou, and X. Hu. Fairness in deep learning: A computational perspective.
IEEE Intelligent Systems, 36(4):25–34, 2021.

[56] S. Du, J. Lee, H. Li, L. Wang, and X. Zhai. Gradient descent finds global minima of deep
neural networks. In K. Chaudhuri and R. Salakhutdinov, editors, Proceedings of the 36th
International Conference on Machine Learning, volume 97 of Proceedings of Machine Learning
Research, pages 1675–1685. PMLR, 09–15 Jun 2019.

[57] J. Duchi, E. Hazan, and Y. Singer. Adaptive subgradient methods for online learning and
stochastic optimization. Journal of Machine Learning Research, 12(Jul):2121–2159, 2011.

[58] W. E and Q. Wang. Exponential convergence of the deep neural network approximation for
analytic functions. Sci. China Math., 61(10):1733–1740, 2018.

[59] K. Eckle and J. Schmidt-Hieber. A comparison of deep networks with relu activation function
and linear spline-type methods. Neural Networks, 110:232–242, 2019.

261
[60] R. Eldan and O. Shamir. The power of depth for feedforward neural networks. In V. Feldman,
A. Rakhlin, and O. Shamir, editors, 29th Annual Conference on Learning Theory, volume 49
of Proceedings of Machine Learning Research, pages 907–940, Columbia University, New York,
New York, USA, 23–26 Jun 2016. PMLR.

[61] M. Fischer, M. Balunovic, D. Drachsler-Cohen, T. Gehr, C. Zhang, and M. Vechev. Dl2:


training and querying neural networks with logic. In International Conference on Machine
Learning, pages 1931–1941. PMLR, 2019.

[62] C. L. Frenzen, T. Sasao, and J. T. Butler. On the number of segments needed in a piecewise
linear approximation. Journal of Computational and Applied mathematics, 234(2):437–446,
2010.

[63] K.-I. Funahashi. On the approximate realization of continuous mappings by neural networks.
Neural Networks, 2(3):183–192, 1989.

[64] T. Garipov, P. Izmailov, D. Podoprikhin, D. P. Vetrov, and A. G. Wilson. Loss surfaces,


mode connectivity, and fast ensembling of dnns. Advances in neural information processing
systems, 31, 2018.

[65] G. Garrigos and R. M. Gower. Handbook of convergence theorems for (stochastic) gradient
methods, 2023.

[66] T. Gehr, M. Mirman, D. Drachsler-Cohen, P. Tsankov, S. Chaudhuri, and M. Vechev. Ai2:


Safety and robustness certification of neural networks with abstract interpretation. In 2018
IEEE symposium on security and privacy (SP), pages 3–18. IEEE, 2018.

[67] A. Géron. Hands-on machine learning with Scikit-Learn and TensorFlow : concepts, tools,
and techniques to build intelligent systems. O’Reilly Media, Sebastopol, CA, 2017.

[68] F. Girosi and T. Poggio. Representation properties of networks: Kolmogorov’s theorem is


irrelevant. Neural Computation, 1(4):465–469, 1989.

[69] F. Girosi and T. Poggio. Networks and the best approximation property. Biological cyber-
netics, 63(3):169–176, 1990.

[70] G. Goh. Why momentum really works. Distill, 2017.

[71] L. Gonon and C. Schwab. Deep relu network expression rates for option prices in high-
dimensional, exponential lévy models. Finance and Stochastics, 25(4):615–657, 2021.

[72] I. J. Goodfellow, Y. Bengio, and A. Courville. Deep Learning. MIT Press, Cambridge, MA,
USA, 2016. http://www.deeplearningbook.org.

[73] I. J. Goodfellow, J. Shlens, and C. Szegedy. Explaining and harnessing adversarial examples.
In International Conference on Learning Representations (ICLR), 2015.

[74] I. J. Goodfellow, O. Vinyals, and A. M. Saxe. Qualitatively characterizing neural network


optimization problems. arXiv preprint arXiv:1412.6544, 2014.

262
[75] L.-A. Gottlieb, A. Kontorovich, and R. Krauthgamer. Efficient regression in metric spaces
via approximate lipschitz extension. IEEE Transactions on Information Theory, 63(8):4838–
4849, 2017.

[76] R. M. Gower, N. Loizou, X. Qian, A. Sailanbayev, E. Shulgin, and P. Richtárik. SGD: General
analysis and improved rates. In K. Chaudhuri and R. Salakhutdinov, editors, Proceedings of
the 36th International Conference on Machine Learning, volume 97 of Proceedings of Machine
Learning Research, pages 5200–5209. PMLR, 09–15 Jun 2019.

[77] K. Gröchenig. Foundations of time-frequency analysis. Springer Science & Business Media,
2013.

[78] P. Grohs and L. Herrmann. Deep neural network approximation for high-dimensional elliptic
pdes with boundary conditions. IMA Journal of Numerical Analysis, 42(3):2055–2082, 2022.

[79] P. Grohs, F. Hornung, A. Jentzen, and P. Von Wurstemberger. A proof that artificial neural
networks overcome the curse of dimensionality in the numerical approximation of Black–
Scholes partial differential equations, volume 284. American Mathematical Society, 2023.

[80] P. Grohs, F. Hornung, A. Jentzen, and P. von Wurstemberger. A proof that artificial neural
networks overcome the curse of dimensionality in the numerical approximation of Black-
Scholes partial differential equations. Mem. Amer. Math. Soc., 284(1410):v+93, 2023.

[81] I. Gühring and M. Raslan. Approximation rates for neural networks with encodable weights
in smoothness spaces. Neural Networks, 134:107–130, 2021.

[82] B. Hanin and D. Rolnick. Complexity of linear regions in deep networks. In International
Conference on Machine Learning, pages 2596–2604. PMLR, 2019.

[83] T. Hastie, A. Montanari, S. Rosset, and R. J. Tibshirani. Surprises in high-dimensional


ridgeless least squares interpolation. The Annals of Statistics, 50(2):949–986, 2022.

[84] S. S. Haykin. Neural networks and learning machines. Pearson Education, Upper Saddle
River, NJ, third edition, 2009.

[85] J. He, L. Li, J. Xu, and C. Zheng. Relu deep neural networks and linear finite elements. J.
Comput. Math., 38(3):502–527, 2020.

[86] K. He, X. Zhang, S. Ren, and J. Sun. Delving deep into rectifiers: Surpassing human-level
performance on imagenet classification. Proceedings of the IEEE international conference on
computer vision, 2015.

[87] K. He, X. Zhang, S. Ren, and J. Sun. Deep residual learning for image recognition. In
Proceedings of the IEEE conference on computer vision and pattern recognition, pages 770–
778, 2016.

[88] M. Hein and M. Andriushchenko. Formal guarantees on the robustness of a classifier against
adversarial manipulation. Advances in neural information processing systems, 30, 2017.

[89] H. Heuser. Lehrbuch der Analysis. Teil 1. Vieweg + Teubner, Wiesbaden, revised edition,
2009.

263
[90] G. Hinton. Divide the gradient by a running average of its recent magnitude. https://www.
cs.toronto.edu/~hinton/coursera/lecture6/lec6e.mp4, 2012. Lecture 6e.

[91] S. Hochreiter. Untersuchungen zu dynamischen neuronalen Netzen. Diploma thesis, Institut


für Informatik, Lehrstuhl Prof. Brauer, Technische Universität München, 1991.

[92] S. Hochreiter and J. Schmidhuber. Flat minima. Neural computation, 9(1):1–42, 1997.

[93] S. Hochreiter and J. Schmidhuber. Long short-term memory. Neural Computation, 9(8):1735–
1780, 1997.

[94] K. Hornik. Approximation capabilities of multilayer feedforward networks. Neural Networks,


4(2):251–257, 1991.

[95] K. Hornik, M. Stinchcombe, and H. White. Multilayer feedforward networks are universal
approximators. Neural Networks, 2(5):359–366, 1989.

[96] G. Huang, Z. Liu, L. Van Der Maaten, and K. Q. Weinberger. Densely connected con-
volutional networks. Proceedings of the IEEE conference on computer vision and pattern
recognition, 1(2):3, 2017.

[97] G.-B. Huang and H. A. Babri. Upper bounds on the number of hidden neurons in feedforward
networks with arbitrary bounded nonlinear activation functions. IEEE transactions on neural
networks, 9(1):224–229, 1998.

[98] L. Huang, A. D. Joseph, B. Nelson, B. I. Rubinstein, and J. D. Tygar. Adversarial machine


learning. In Proceedings of the 4th ACM workshop on Security and artificial intelligence,
pages 43–58, 2011.

[99] T. Huster, C.-Y. J. Chiang, and R. Chadha. Limitations of the lipschitz constant as a defense
against adversarial examples. In ECML PKDD 2018 Workshops: Nemesis 2018, UrbReas
2018, SoGood 2018, IWAISe 2018, and Green Data Mining 2018, Dublin, Ireland, September
10-14, 2018, Proceedings 18, pages 16–29. Springer, 2019.

[100] M. Hutzenthaler, A. Jentzen, T. Kruse, and T. A. Nguyen. A proof that rectified deep neural
networks overcome the curse of dimensionality in the numerical approximation of semilinear
heat equations. SN partial differential equations and applications, 1(2):10, 2020.

[101] J. Håstad. Computational limitations of small depth circuits. PhD thesis, Massachusetts
Institute of Technology, 1987. Ph.D. Thesis, Department of Mathematics.

[102] D. J. Im, M. Tao, and K. Branson. An empirical analysis of deep network loss surfaces. 2016.

[103] V. E. Ismailov. Ridge functions and applications in neural networks, volume 263. American
Mathematical Society, 2021.

[104] V. E. Ismailov. A three layer neural network can represent any multivariate function. Journal
of Mathematical Analysis and Applications, 523(1):127096, 2023.

[105] Y. Ito and K. Saito. Superposition of linearly independent functions and finite mappings by
neural networks. The Mathematical Scientist, 21(1):27, 1996.

264
[106] A. Jacot, F. Gabriel, and C. Hongler. Neural tangent kernel: Convergence and generalization
in neural networks. Advances in neural information processing systems, 31, 2018.
[107] A. Jentzen, B. Kuckuck, and P. von Wurstemberger. Mathematical introduction to deep
learning: methods, implementations, and theory. arXiv preprint arXiv:2310.20360, 2023.
[108] A. Jentzen and A. Riekert. On the existence of global minima and convergence analy-
ses for gradient descent methods in the training of deep neural networks. arXiv preprint
arXiv:2112.09684, 2021.
[109] A. Jentzen, D. Salimova, and T. Welti. A proof that deep artificial neural networks overcome
the curse of dimensionality in the numerical approximation of Kolmogorov partial differen-
tial equations with constant diffusion and nonlinear drift coefficients. Commun. Math. Sci.,
19(5):1167–1205, 2021.
[110] J. Jumper, R. Evans, A. Pritzel, T. Green, M. Figurnov, O. Ronneberger, K. Tunyasuvu-
nakool, R. Bates, A. Žı́dek, A. Potapenko, et al. Highly accurate protein structure prediction
with alphafold. Nature, 596(7873):583–589, 2021.
[111] P. C. Kainen, V. Kurkova, and A. Vogt. Approximation by neural networks is not continuous.
Neurocomputing, 29(1-3):47–56, 1999.
[112] P. C. Kainen, V. Kurkova, and A. Vogt. Continuity of approximation by neural networks in
l p spaces. Annals of Operations Research, 101:143–147, 2001.
[113] P. C. Kainen, V. Kurkova, and A. Vogt. Best approximation by linear combinations of
characteristic functions of half-spaces. Journal of Approximation Theory, 122(2):151–159,
2003.
[114] H. Karimi, J. Nutini, and M. Schmidt. Linear convergence of gradient and proximal-gradient
methods under the polyak-lojasiewicz condition. In P. Frasconi, N. Landwehr, G. Manco,
and J. Vreeken, editors, Machine Learning and Knowledge Discovery in Databases, pages
795–811, Cham, 2016. Springer International Publishing.
[115] C. Karner, V. Kazeev, and P. C. Petersen. Limitations of gradient descent due to numerical
instability of backpropagation. arXiv preprint arXiv:2210.00805, 2022.
[116] D. P. Kingma and J. Ba. Adam: A method for stochastic optimization. In 3rd Interna-
tional Conference on Learning Representations, ICLR 2015 - Conference Track Proceedings.
International Conference on Learning Representations, ICLR, 2015.
[117] A. Klenke. Wahrscheinlichkeitstheorie. Springer, 2006.
[118] M. Kohler, A. Krzyżak, and S. Langer. Estimation of a function of low local dimensionality
by deep neural networks. IEEE transactions on information theory, 68(6):4032–4042, 2022.
[119] M. Kohler and S. Langer. On the rate of convergence of fully connected deep neural network
regression estimates. The Annals of Statistics, 49(4):2231–2249, 2021.
[120] A. N. Kolmogorov. On the representation of continuous functions of many variables by
superposition of continuous functions of one variable and addition. Dokl. Akad. Nauk SSSR,
114:953–956, 1957.

265
[121] A. Krizhevsky, I. Sutskever, and G. E. Hinton. Imagenet classification with deep convolutional
neural networks. In Advances in neural information processing systems, pages 1097–1105,
2012.
[122] G. Kutyniok, P. Petersen, M. Raslan, and R. Schneider. A theoretical analysis of deep neural
networks and parametric pdes. Constructive Approximation, 55(1):73–125, 2022.
[123] V. Kůrková. Kolmogorov’s theorem is relevant. Neural Computation, 3(4):617–622, 1991.
[124] V. Kůrková. Kolmogorov’s theorem and multilayer neural networks. Neural Networks,
5(3):501–506, 1992.
[125] F. Laakmann and P. Petersen. Efficient approximation of solutions of parametric linear
transport equations by relu dnns. Advances in Computational Mathematics, 47(1):11, 2021.
[126] G. Lan. First-order and Stochastic Optimization Methods for Machine Learning. Springer
Series in the Data Sciences. Springer International Publishing, Cham, 1st ed. 2020. edition,
2020.
[127] Y. LeCun, Y. Bengio, and G. Hinton. Deep learning. Nature, 521(7553):436–444, May 2015.
[128] Y. LeCun, B. Boser, J. S. Denker, D. Henderson, R. E. Howard, W. Hubbard, and L. D.
Jackel. Backpropagation applied to handwritten zip code recognition. Neural Computation,
1(4):541–551, 1989.
[129] Y. A. LeCun, L. Bottou, G. B. Orr, and K.-R. Müller. Efficient BackProp, pages 9–48.
Springer Berlin Heidelberg, Berlin, Heidelberg, 2012.
[130] J. Lee, J. Sohl-dickstein, J. Pennington, R. Novak, S. Schoenholz, and Y. Bahri. Deep neural
networks as gaussian processes. In International Conference on Learning Representations,
2018.
[131] J. Lee, L. Xiao, S. Schoenholz, Y. Bahri, R. Novak, J. Sohl-Dickstein, and J. Pennington. Wide
neural networks of any depth evolve as linear models under gradient descent. In H. Wallach,
H. Larochelle, A. Beygelzimer, F. d'Alché-Buc, E. Fox, and R. Garnett, editors, Advances in
Neural Information Processing Systems, volume 32. Curran Associates, Inc., 2019.
[132] M. Leshno, V. Y. Lin, A. Pinkus, and S. Schocken. Multilayer feedforward networks with a
nonpolynomial activation function can approximate any function. Neural Networks, 6(6):861–
867, 1993.
[133] L. Lessard, B. Recht, and A. Packard. Analysis and design of optimization algorithms via
integral quadratic constraints. SIAM J. Optim., 26(1):57–95, 2016.
[134] H. Li, Z. Xu, G. Taylor, C. Studer, and T. Goldstein. Visualizing the loss landscape of neural
nets. Advances in neural information processing systems, 31, 2018.
[135] W. Li. Generalization error of minimum weighted norm and kernel interpolation. SIAM
Journal on Mathematics of Data Science, 3(1):414–438, 2021.
[136] M. Longo, J. A. Opschoor, N. Disch, C. Schwab, and J. Zech. De rham compatible deep
neural network fem. Neural Networks, 165:721–739, 2023.

266
[137] C. Ma, S. Wojtowytsch, L. Wu, et al. Towards a mathematical understanding of neu-
ral network-based machine learning: what we know and what we don’t. arXiv preprint
arXiv:2009.10713, 2020.

[138] C. Ma, L. Wu, et al. A priori estimates of the population risk for two-layer neural networks.
arXiv preprint arXiv:1810.06397, 2018.

[139] S. Mahan, E. J. King, and A. Cloninger. Nonclosedness of sets of neural networks in sobolev
spaces. Neural Networks, 137:85–96, 2021.

[140] V. Maiorov and A. Pinkus. Lower bounds for approximation by mlp neural networks. Neu-
rocomputing, 25(1):81–91, 1999.

[141] Y. Marzouk, Z. Ren, S. Wang, and J. Zech. Distribution learning via neural differential
equations: a nonparametric statistical perspective. Journal of Machine Learning Research
(accepted), 2024.

[142] W. S. McCulloch and W. Pitts. A logical calculus of the ideas immanent in nervous activity.
The bulletin of mathematical biophysics, 5:115–133, 1943.

[143] S. Mei and A. Montanari. The generalization error of random features regression: Precise
asymptotics and the double descent curve. Communications on Pure and Applied Mathemat-
ics, 75(4):667–766, 2022.

[144] H. N. Mhaskar. Approximation properties of a multilayered feedforward artificial neural


network. Adv. Comput. Math., 1(1):61–80, 1993.

[145] H. N. Mhaskar. Neural networks for optimal approximation of smooth and analytic functions.
Neural computation, 8(1):164–177, 1996.

[146] H. N. Mhaskar and C. A. Micchelli. Approximation by superposition of sigmoidal and radial


basis functions. Adv. in Appl. Math., 13(3):350–373, 1992.

[147] H. N. Mhaskar and C. A. Micchelli. Degree of approximation by neural and translation


networks with a single hidden layer. Advances in applied mathematics, 16(2):151–183, 1995.

[148] M. Mohri, A. Rostamizadeh, and A. Talwalkar. Foundations of machine learning. MIT press,
2018.

[149] C. Molnar. Interpretable machine learning. Lulu. com, 2020.

[150] H. Montanelli and Q. Du. New error bounds for deep relu networks using sparse grids. SIAM
Journal on Mathematics of Data Science, 1(1):78–92, 2019.

[151] H. Montanelli and H. Yang. Error bounds for deep relu networks using the kolmogorov–arnold
superposition theorem. Neural Networks, 129:1–6, 2020.

[152] G. F. Montufar, R. Pascanu, K. Cho, and Y. Bengio. On the number of linear regions of deep
neural networks. In Z. Ghahramani, M. Welling, C. Cortes, N. Lawrence, and K. Weinberger,
editors, Advances in Neural Information Processing Systems, volume 27. Curran Associates,
Inc., 2014.

267
[153] S.-M. Moosavi-Dezfooli, A. Fawzi, O. Fawzi, and P. Frossard. Universal adversarial pertur-
bations. In Proceedings of the IEEE conference on computer vision and pattern recognition,
pages 1765–1773, 2017.

[154] E. Moulines and F. Bach. Non-asymptotic analysis of stochastic approximation algorithms for
machine learning. In J. Shawe-Taylor, R. Zemel, P. Bartlett, F. Pereira, and K. Weinberger,
editors, Advances in Neural Information Processing Systems, volume 24. Curran Associates,
Inc., 2011.

[155] K.-R. Muller, S. Mika, G. Ratsch, K. Tsuda, and B. Scholkopf. An introduction to kernel-
based learning algorithms. IEEE Transactions on Neural Networks, 12(2):181–201, 2001.

[156] R. Nakada and M. Imaizumi. Adaptive approximation and generalization of deep neural
network with intrinsic dimensionality. Journal of Machine Learning Research, 21(174):1–38,
2020.

[157] R. M. Neal. Bayesian learning for neural networks. PhD thesis, University of Toronto, 1995.

[158] Y. Nesterov. Introductory lectures on convex optimization, volume 87 of Applied Optimization.


Kluwer Academic Publishers, Boston, MA, 2004. A basic course.

[159] Y. Nesterov. Lectures on convex optimization, volume 137 of Springer Optimization and Its
Applications. Springer, Cham, second edition, 2018.

[160] Y. E. Nesterov. A method for solving the convex programming problem with convergence
rate O(1/k 2 ). Dokl. Akad. Nauk SSSR, 269(3):543–547, 1983.

[161] B. Neyshabur, R. Tomioka, and N. Srebro. Norm-based capacity control in neural networks.
In Conference on learning theory, pages 1376–1401. PMLR, 2015.

[162] R. H. Nielsen. Kolmogorov’s mapping neural network existence theorem. In Proceedings of


the IEEE First International Conference on Neural Networks (San Diego, CA), volume III,
pages 11–13. Piscataway, NJ: IEEE, 1987.

[163] J. Nocedal and S. J. Wright. Numerical optimization. Springer Series in Operations Research
and Financial Engineering. Springer, New York, second edition, 2006.

[164] E. Novak and H. Woźniakowski. Approximation of infinitely differentiable multivariate func-


tions is intractable. Journal of Complexity, 25(4):398–404, 2009.

[165] B. O’Donoghue and E. Candès. Adaptive restart for accelerated gradient schemes. Found.
Comput. Math., 15(3):715–732, 2015.

[166] J. A. A. Opschoor, C. Schwab, and J. Zech. Exponential ReLU DNN expression of holomor-
phic maps in high dimension. Constructive Approximation, 2021.

[167] J. A. A. Opschoor, C. Schwab, and J. Zech. Deep learning in high dimension: ReLU neural
network expression for Bayesian PDE inversion. In Optimization and control for partial
differential equations—uncertainty quantification, open and closed-loop control, and shape
optimization, volume 29 of Radon Ser. Comput. Appl. Math., pages 419–462. De Gruyter,
Berlin, 2022.

268
[168] P. Oswald. On the degree of nonlinear spline approximation in Besov-Sobolev spaces. J.
Approx. Theory, 61(2):131–157, 1990.

[169] S. Ovchinnikov. Max-min representation of piecewise linear functions. Beiträge Algebra


Geom., 43(1):297–302, 2002.

[170] N. Papernot, P. McDaniel, I. Goodfellow, S. Jha, Z. B. Celik, and A. Swami. Practical black-
box attacks against machine learning. In Proceedings of the 2017 ACM on Asia conference
on computer and communications security, pages 506–519, 2017.

[171] Y. C. Pati and P. S. Krishnaprasad. Analysis and synthesis of feedforward neural networks us-
ing discrete affine wavelet transformations. IEEE Transactions on Neural Networks, 4(1):73–
85, 1993.

[172] J. Pennington and Y. Bahri. Geometry of neural network loss surfaces via random matrix
theory. In International Conference on Machine Learning, pages 2798–2806. PMLR, 2017.

[173] P. Petersen, M. Raslan, and F. Voigtlaender. Topological properties of the set of func-
tions generated by neural networks of fixed size. Foundations of computational mathematics,
21:375–444, 2021.

[174] P. Petersen and F. Voigtlaender. Optimal approximation of piecewise smooth functions using
deep relu neural networks. Neural Networks, 108:296–330, 2018.

[175] P. C. Petersen. Neural Network Theory. 2020. http://www.pc-petersen.eu/Neural_


Network_Theory.pdf, Lecture notes.

[176] A. Pinkus. Approximation theory of the MLP model in neural networks. In Acta numerica,
1999, volume 8 of Acta Numer., pages 143–195. Cambridge Univ. Press, Cambridge, 1999.

[177] G. Pisier. Remarques sur un résultat non publié de B. Maurey. Séminaire Analyse fonction-
nelle (dit ”Maurey-Schwartz”), 1980-1981.

[178] T. Poggio, H. Mhaskar, L. Rosasco, B. Miranda, and Q. Liao. Why and when can deep-but
not shallow-networks avoid the curse of dimensionality: a review. Int. J. Autom. Comput.,
14(5):503–519, 2017.

[179] T. Poggio, R. Rifkin, S. Mukherjee, and P. Niyogi. General conditions for predictivity in
learning theory. Nature, 428(6981):419–422, 2004.

[180] B. Polyak. Some methods of speeding up the convergence of iteration methods. USSR
Computational Mathematics and Mathematical Physics, 4(5):1–17, 1964.

[181] B. T. Polyak. Introduction to optimization. Translations Series in Mathematics and Engineer-


ing. Optimization Software, Inc., Publications Division, New York, 1987. Translated from
the Russian, With a foreword by Dimitri P. Bertsekas.

[182] S. J. Prince. Understanding Deep Learning. MIT Press, 2023.

[183] N. Qian. On the momentum term in gradient descent learning algorithms. Neural Networks,
12(1):145–151, 1999.

269
[184] M. H. Quynh Nguyen, Mahesh Chandra Mukkamala. On the loss landscape of a class of
deep neural networks with no bad local valleys. In International Conference on Learning
Representations (ICLR), 2018.

[185] M. Raghu, B. Poole, J. Kleinberg, S. Ganguli, and J. Sohl-Dickstein. On the expressive


power of deep neural networks. In D. Precup and Y. W. Teh, editors, Proceedings of the
34th International Conference on Machine Learning, volume 70 of Proceedings of Machine
Learning Research, pages 2847–2854. PMLR, 06–11 Aug 2017.

[186] A. Rahimi and B. Recht. Random features for large-scale kernel machines. In J. Platt,
D. Koller, Y. Singer, and S. Roweis, editors, Advances in Neural Information Processing
Systems, volume 20. Curran Associates, Inc., 2007.

[187] M. Raissi, P. Perdikaris, and G. E. Karniadakis. Physics-informed neural networks: A deep


learning framework for solving forward and inverse problems involving nonlinear partial dif-
ferential equations. Journal of Computational physics, 378:686–707, 2019.

[188] C. E. Rasmussen and C. K. I. Williams. Gaussian processes for machine learning. Adaptive
computation and machine learning. MIT Press, 2006.

[189] E. Real, S. Moore, A. Selle, S. Saxena, Y. L. Suematsu, Q. Le, and A. Kurakin. Regularized
evolution for image classifier architecture search. Proceedings of the AAAI Conference on
Artificial Intelligence, 33:4780–4789, 2019.

[190] S. J. Reddi, S. Kale, and S. Kumar. On the convergence of adam and beyond. In 6th
International Conference on Learning Representations, ICLR 2018, Vancouver, BC, Canada,
April 30 - May 3, 2018, Conference Track Proceedings. OpenReview.net, 2018.

[191] H. Robbins and S. Monro. A Stochastic Approximation Method. The Annals of Mathematical
Statistics, 22(3):400 – 407, 1951.

[192] F. Rosenblatt. The perceptron: A probabilistic model for information storage and organiza-
tion in the brain. Psychological Review, 65(6):386–408, 1958.

[193] W. Ruan, X. Yi, and X. Huang. Adversarial robustness of deep learning: Theory, algorithms,
and applications. In Proceedings of the 30th ACM international conference on information
& knowledge management, pages 4866–4869, 2021.

[194] S. Ruder. An overview of gradient descent optimization algorithms, 2016.

[195] W. Rudin. Real and complex analysis. McGraw-Hill Book Co., New York, third edition, 1987.

[196] W. Rudin. Functional analysis. International Series in Pure and Applied Mathematics.
McGraw-Hill, Inc., New York, second edition, 1991.

[197] D. E. Rumelhart, G. E. Hinton, and R. J. Williams. Learning representations by back-


propagating errors. Nature, 323(6088):533–536, 1986.

[198] T. D. Ryck and S. Mishra. Error analysis for deep neural network approximations of paramet-
ric hyperbolic conservation laws. Mathematics of Computation, 2023. Article electronically
published on December 15, 2023.

270
[199] I. Safran and O. Shamir. Depth separation in relu networks for approximating smooth non-
linear functions. ArXiv, abs/1610.09887, 2016.

[200] M. A. Sartori and P. J. Antsaklis. A simple method to derive bounds on the size and to train
multilayer neural networks. IEEE transactions on neural networks, 2(4):467–471, 1991.

[201] R. Scheichl and J. Zech. Numerical methods for bayesian inverse problems, 2021. Lecture
Notes.

[202] J. Schmidhuber. Deep learning in neural networks: An overview. Neural Networks, 61:85–117,
2015.

[203] J. Schmidt-Hieber. Deep relu network approximation of functions on a manifold. arXiv


preprint arXiv:1908.00695, 2019.

[204] J. Schmidt-Hieber. Nonparametric regression using deep neural networks with relu activation
function. 2020.

[205] J. Schmidt-Hieber. The kolmogorov–arnold representation theorem revisited. Neural Net-


works, 137:119–126, 2021.

[206] B. Schölkopf and A. J. Smola. Learning with kernels : support vector machines, regularization,
optimization, and beyond. Adaptive computation and machine learning. MIT Press, 2002.

[207] L. Schumaker. Spline Functions: Basic Theory. Cambridge Mathematical Library. Cambridge
University Press, 3 edition, 2007.

[208] C. Schwab and J. Zech. Deep learning in high dimension: neural network expression rates for
generalized polynomial chaos expansions in UQ. Anal. Appl. (Singap.), 17(1):19–55, 2019.

[209] C. Schwab and J. Zech. Deep learning in high dimension: neural network expression rates for
analytic functions in L2 (mathbbRd , gammad ). SIAM/ASA J. Uncertain. Quantif., 11(1):199–
234, 2023.

[210] T. Serra, C. Tjandraatmadja, and S. Ramalingam. Bounding and counting linear regions of
deep neural networks, 2018.

[211] U. Shaham, A. Cloninger, and R. R. Coifman. Provable approximation properties for deep
neural networks. Applied and Computational Harmonic Analysis, 44(3):537–557, 2018.

[212] S. Shalev-Shwartz and S. Ben-David. Understanding Machine Learning - From Theory to


Algorithms. Cambridge University Press, 2014.

[213] J. W. Siegel and J. Xu. High-order approximation rates for shallow neural networks with
cosine and reluk activation functions. Applied and Computational Harmonic Analysis, 58:1–
26, 2022.

[214] D. Silver, A. Huang, C. J. Maddison, A. Guez, L. Sifre, G. Van Den Driessche, J. Schrittwieser,
I. Antonoglou, V. Panneershelvam, M. Lanctot, et al. Mastering the game of go with deep
neural networks and tree search. nature, 529(7587):484–489, 2016.

271
[215] K. Simonyan and A. Zisserman. Very deep convolutional networks for large-scale image
recognition. In ICLR, 2014.

[216] E. M. Stein. Singular integrals and differentiability properties of functions. Princeton Math-
ematical Series, No. 30. Princeton University Press, Princeton, N.J., 1970.

[217] I. Steinwart and A. Christmann. Support Vector Machines. Springer, New York, 2008.

[218] D. Stutz, M. Hein, and B. Schiele. Disentangling adversarial robustness and generalization.
In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition,
pages 6976–6987, 2019.

[219] A. Sukharev. Optimal method of constructing best uniform approximations for functions of
a certain class. USSR Computational Mathematics and Mathematical Physics, 18(2):21–31,
1978.

[220] T. Sun, L. Qiao, and D. Li. Nonergodic complexity of proximal inertial gradient descents.
IEEE Trans. Neural Netw. Learn. Syst., 32(10):4613–4626, 2021.

[221] I. Sutskever, J. Martens, G. Dahl, and G. Hinton. On the importance of initialization and
momentum in deep learning. In S. Dasgupta and D. McAllester, editors, Proceedings of the
30th International Conference on Machine Learning, volume 28 of Proceedings of Machine
Learning Research, pages 1139–1147, Atlanta, Georgia, USA, 17–19 Jun 2013. PMLR.

[222] C. Szegedy, W. Liu, Y. Jia, P. Sermanet, S. Reed, D. Anguelov, D. Erhan, V. Vanhoucke,


and A. Rabinovich. Going deeper with convolutions. In Proceedings of the IEEE conference
on computer vision and pattern recognition, pages 1–9, 2015.

[223] C. Szegedy, W. Zaremba, I. Sutskever, J. Bruna, D. Erhan, I. Goodfellow, and R. Fergus.


Intriguing properties of neural networks. In International Conference on Learning Represen-
tations (ICLR), 2014.

[224] M. Tan and Q. V. Le. Efficientnet: Rethinking model scaling for convolutional neural net-
works. In Proceedings of the 36th International Conference on Machine Learning, pages
6105–6114, 2019.

[225] J. Tarela and M. Martı́nez. Region configurations for realizability of lattice piecewise-linear
models. Mathematical and Computer Modelling, 30(11):17–27, 1999.

[226] J. M. Tarela, E. Alonso, and M. V. Martı́nez. A representation method for PWL functions
oriented to parallel processing. Math. Comput. Modelling, 13(10):75–83, 1990.

[227] M. Telgarsky. Representation benefits of deep feedforward networks, 2015.

[228] M. Telgarsky. benefits of depth in neural networks. In V. Feldman, A. Rakhlin, and O. Shamir,
editors, 29th Annual Conference on Learning Theory, volume 49 of Proceedings of Machine
Learning Research, pages 1517–1539, Columbia University, New York, New York, USA, 23–26
Jun 2016. PMLR.

[229] M. Telgarsky. Deep learning theory lecture notes. https://mjt.cs.illinois.edu/dlt/,


2021. Version: 2021-10-27 v0.0-e7150f2d (alpha).

272
[230] V. M. Tikhomirov. ε-entropy and ε-capacity of sets in functional spaces. Selected Works
of AN Kolmogorov: Volume III: Information Theory and the Theory of Algorithms, pages
86–170, 1993.

[231] S. Tu, S. Venkataraman, A. C. Wilson, A. Gittens, M. I. Jordan, and B. Recht. Breaking


locality accelerates block Gauss-Seidel. In D. Precup and Y. W. Teh, editors, Proceedings of
the 34th International Conference on Machine Learning, volume 70 of Proceedings of Machine
Learning Research, pages 3482–3491. PMLR, 06–11 Aug 2017.

[232] L. G. Valiant. A theory of the learnable. Communications of the ACM, 27(11):1134–1142,


1984.

[233] V. N. Vapnik and A. Y. Chervonenkis. On the uniform convergence of relative frequencies of


events to their probabilities. In Measures of complexity: festschrift for alexey chervonenkis,
pages 11–30. Springer, 2015.

[234] A. Vaswani, N. Shazeer, N. Parmar, J. Uszkoreit, L. Jones, A. N. Gomez, L. Kaiser, and


I. Polosukhin. Attention is all you need. Advances in neural information processing systems,
30, 2017.

[235] L. Venturi, A. S. Bandeira, and J. Bruna. Spurious valleys in one-hidden-layer neural network
optimization landscapes. Journal of Machine Learning Research, 20:133, 2019.

[236] R. Vershynin. High-dimensional probability: An introduction with applications in data science,


volume 47. Cambridge University Press, 2018.

[237] S. Wang and X. Sun. Generalization of hinging hyperplanes. IEEE Transactions on Infor-
mation Theory, 51(12):4425–4431, 2005.

[238] Z. Wang, A. Albarghouthi, G. Prakriya, and S. Jha. Interval universal approximation for
neural networks. Proceedings of the ACM on Programming Languages, 6(POPL):1–29, 2022.

[239] E. Weinan, C. Ma, and L. Wu. Barron spaces and the compositional function spaces for
neural network models. arXiv preprint arXiv:1906.08039, 2019.

[240] E. Weinan and S. Wojtowytsch. Representation formulas and pointwise properties for barron
functions. Calculus of Variations and Partial Differential Equations, 61(2):46, 2022.

[241] A. C. Wilson, B. Recht, and M. I. Jordan. A lyapunov analysis of accelerated methods in


optimization. Journal of Machine Learning Research, 22(113):1–34, 2021.

[242] A. C. Wilson, R. Roelofs, M. Stern, N. Srebro, and B. Recht. The marginal value of adaptive
gradient methods in machine learning. In I. Guyon, U. V. Luxburg, S. Bengio, H. Wal-
lach, R. Fergus, S. Vishwanathan, and R. Garnett, editors, Advances in Neural Information
Processing Systems, volume 30. Curran Associates, Inc., 2017.

[243] C. Xiao, J.-Y. Zhu, B. Li, W. He, M. Liu, and D. Song. Spatially transformed adversarial
examples. arXiv preprint arXiv:1801.02612, 2018.

[244] H. Xu and S. Mannor. Robustness and generalization. Machine learning, 86:391–423, 2012.

273
[245] D. Yarotsky. Error bounds for approximations with deep ReLU networks. Neural Netw.,
94:103–114, 2017.

[246] D. Yarotsky and A. Zhevnerchuk. The phase diagram of approximation rates for deep neural
networks. In H. Larochelle, M. Ranzato, R. Hadsell, M. F. Balcan, and H. Lin, editors,
Advances in Neural Information Processing Systems, volume 33, pages 13005–13015. Curran
Associates, Inc., 2020.

[247] H. M. D. K. S. B. Yiding Jiang, Behnam Neyshabur. Fantastic generalization measures and


where to find them. In International Conference on Learning Representations (ICLR), 2019.

[248] X. Zhai, A. Kolesnikov, N. Houlsby, and L. Beyer. Scaling vision transformers. In Proceedings
of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 12104–
12113, 2022.

274

You might also like