0% found this document useful (0 votes)
17 views

Deep Learning Math

The document presents a comprehensive mathematical theory of deep learning, covering various aspects such as feedforward neural networks, universal approximation, and training methods. It includes detailed discussions on neural network architectures, generalization properties, and robustness against adversarial examples. The work aims to provide a rigorous foundation for understanding the mathematical principles underlying deep learning techniques.

Uploaded by

RafiullahOmar
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
17 views

Deep Learning Math

The document presents a comprehensive mathematical theory of deep learning, covering various aspects such as feedforward neural networks, universal approximation, and training methods. It includes detailed discussions on neural network architectures, generalization properties, and robustness against adversarial examples. The work aims to provide a rigorous foundation for understanding the mathematical principles underlying deep learning techniques.

Uploaded by

RafiullahOmar
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 282

arXiv:2407.18384v3 [cs.

LG] 7 Apr 2025

Mathematical theory of deep


learning

Philipp Petersen1 and Jakob Zech2


1
Universität Wien, Fakultät für Mathematik, 1090 Wien, Austria,
[email protected]
2 Universität Heidelberg, Interdisziplinäres Zentrum für Wissenschaftliches Rechnen, 69120

Heidelberg, Germany, [email protected]

April 8, 2025
Contents

1 Introduction 9
1.1 Mathematics of deep learning . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9
1.2 High-level overview of deep learning . . . . . . . . . . . . . . . . . . . . . . . . . . . 9
1.3 Why does it work? . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13
1.4 Outline and philosophy . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14
1.5 Material not covered in this book . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 16

2 Feedforward neural networks 18


2.1 Formal definition . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 18
2.2 Notion of size . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 21
2.3 Activation functions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 22

3 Universal approximation 25
3.1 A universal approximation theorem . . . . . . . . . . . . . . . . . . . . . . . . . . . . 25
3.2 Superexpressive activations and Kolmogorov’s superposition theorem . . . . . . . . 35

4 Splines 40
4.1 B-splines and smooth functions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 40
4.2 Reapproximation of B-splines with sigmoidal activations . . . . . . . . . . . . . . . . 41

5 ReLU neural networks 48


5.1 Basic ReLU calculus . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 48
5.2 Continuous piecewise linear functions . . . . . . . . . . . . . . . . . . . . . . . . . . . 53
5.3 Simplicial pieces . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 59
5.4 Convergence rates for Hölder continuous functions . . . . . . . . . . . . . . . . . . . 65

6 Affine pieces for ReLU neural networks 69


6.1 Upper bounds . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 70
6.2 Tightness of upper bounds . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 72
6.3 Number of pieces in practice . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 73

7 Deep ReLU neural networks 81


7.1 The square function . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 81
7.2 Multiplication . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 84
7.3 Polynomials and depth separation . . . . . . . . . . . . . . . . . . . . . . . . . . . . 86
7.4 C k,s functions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 88

1
8 High-dimensional approximation 95
8.1 The Barron class . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 95
8.2 Functions with compositionality structure . . . . . . . . . . . . . . . . . . . . . . . . 100
8.3 Functions on manifolds . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 104

9 Interpolation 109
9.1 Universal interpolation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 110
9.2 Optimal interpolation and reconstruction . . . . . . . . . . . . . . . . . . . . . . . . 111

10 Training of neural networks 119


10.1 Gradient descent . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 119
10.2 Stochastic gradient descent . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 125
10.3 Acceleration . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 130
10.4 Adaptive and coordinate-wise learning rates . . . . . . . . . . . . . . . . . . . . . . . 135
10.5 Backpropagation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 138

11 Wide neural networks and the neural tangent kernel 146


11.1 Linear least-squares regression . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 147
11.2 Feature methods and kernel least-squares regression . . . . . . . . . . . . . . . . . . 150
11.3 Tangent kernel . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 154
11.4 Global minimizers . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 155
11.5 Proximity to trained linearized model . . . . . . . . . . . . . . . . . . . . . . . . . . 159
11.6 Training dynamics for shallow neural networks . . . . . . . . . . . . . . . . . . . . . 163

12 Loss landscape analysis 176


12.1 Visualization of loss landscapes . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 178
12.2 Spurious valleys . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 178
12.3 Saddle points . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 181

13 Shape of neural network spaces 186


13.1 Lipschitz parameterizations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 186
13.2 Convexity of neural network spaces . . . . . . . . . . . . . . . . . . . . . . . . . . . . 189
13.3 Closedness and best-approximation property . . . . . . . . . . . . . . . . . . . . . . . 193

14 Generalization properties of deep neural networks 199


14.1 Learning setup . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 199
14.2 Empirical risk minimization . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 201
14.3 Generalization bounds . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 202
14.4 Generalization bounds from covering numbers . . . . . . . . . . . . . . . . . . . . . . 204
14.5 Covering numbers of deep neural networks . . . . . . . . . . . . . . . . . . . . . . . . 206
14.6 The approximation-complexity trade-off . . . . . . . . . . . . . . . . . . . . . . . . . 208
14.7 PAC learning from VC dimension . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 209
14.8 Lower bounds on achievable approximation rates . . . . . . . . . . . . . . . . . . . . 213

2
15 Generalization in the overparameterized regime 218
15.1 The double descent phenomenon . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 219
15.2 Size of weights . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 223
15.3 Theoretical justification . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 224

16 Robustness and adversarial examples 229


16.1 Adversarial examples . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 230
16.2 Bayes classifier . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 231
16.3 Affine classifiers . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 232
16.4 ReLU neural networks . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 235
16.5 Robustness . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 236

A Probability theory 244


A.1 Sigma-algebras, topologies, and measures . . . . . . . . . . . . . . . . . . . . . . . . 244
A.2 Random variables . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 245
A.3 Conditionals, marginals, and independence . . . . . . . . . . . . . . . . . . . . . . . 248
A.4 Concentration inequalities . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 251

B Linear algebra and functional analysis 254


B.1 Singular value decomposition and pseudoinverse . . . . . . . . . . . . . . . . . . . . . 254
B.2 Vector spaces . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 255

3
Preface

This book serves as an introduction to the key ideas in the mathematical analysis of deep learning.
It is designed to help students and researchers to quickly familiarize themselves with the area and to
provide a foundation for the development of university courses on the mathematics of deep learning.
Our main goal in the composition of this book was to present various rigorous, but easy to grasp,
results that help to build an understanding of fundamental mathematical concepts in deep learning.
To achieve this, we prioritize simplicity over generality.
As a mathematical introduction to deep learning, this book does not aim to give an exhaustive
survey of the entire (and rapidly growing) field, and some important research directions are missing.
In particular, we have favored mathematical results over empirical research, even though an accurate
account of the theory of deep learning requires both.
The book is intended for students and researchers in mathematics and related areas. While we
believe that every diligent researcher or student will be able to work through this manuscript, it
should be emphasized that a familiarity with analysis, linear algebra, probability theory, and basic
functional analysis is recommended for an optimal reading experience. To assist readers, a review
of key concepts in probability theory and functional analysis is provided in the appendix.
The material is structured around the three main pillars of deep learning theory: Approximation
theory, Optimization theory, and Statistical Learning theory. This structure, which corresponds
to the three error terms typically occuring in the theoretical analysis of deep learning models, is
inspired by other recent texts on the topic following the same outline [213, 271, 132]. More specif-
ically, Chapter 1 provides an overview and introduces key questions for understand deep learning.
Chapters 2 - 9 explore results in approximation theory, Chapters 10 - 13 discuss optimization the-
ory for deep learning, and the remaining Chapters 14 - 16 address the statistical aspects of deep
learning.
This book is the result of a series of lectures given by the authors. Parts of the material were
presented by P.P. in a lecture titled “Neural Network Theory” at the University of Vienna, and by
J.Z. in a lecture titled “Theory of Deep Learning” at Heidelberg University. The lecture notes of
these courses formed the basis of this book. We are grateful to the many colleagues and students
who contributed to this book through insightful discussions and valuable suggestions. We would
like to offer special thanks to the following individuals:
Jonathan Garcia Rebellon, Jakob Lanser, Andrés Felipe Lerma Pineda, Marvin Koß, Martin
Mauser, Davide Modesto, Martina Neuman, Bruno Perreaux, Johannes Asmus Petersen, Milutin
Popovic, Tuan Quach, Tim Rakowski, Lorenz Riess, Jakob Fabian Rohner, Jonas Schuhmann,
Peter Školnı́k, Matej Vedak, Simon Weissmann, Josephine Westermann, Ashia Wilson.

4
Notation

Symbol Description Reference


A vector of layer widths Definition 12.1
A a sigma-algebra Definition A.1
aff(S) affine hull of S (5.3.7)
Bd the Borel sigma-algebra on Rd Section A.1
Bn B-Splines of order n Definition 4.2
Br (x) ball of radius r ≥ 0 around x in a metric space X (B.2.1)
Brd ball of radius r ≥ 0 around 0 in Rd
k-times continuously differentiable functions from Ω →
C k (Ω)
R
infinitely differentiable functions from Ω → R with com-
Cc∞ (Ω)
pact support in Ω
C 0,s (Ω) s-Hölder continuous functions from Ω → R Definition 5.22
C k,s (Ω) C k (Ω) functions f for which f (k) ∈ C 0,s (Ω) Definition 7.8
cc
fn −→ f compact convergence of fn to f Definition 3.1
co(S) convex hull of a set S (5.3.1)
f ∗g convolution of f and g (3.1.4)
D data distribution (1.2.4), Section 14.1
Dα partial derivative
depth(Φ) depth of Φ Definition 2.1
εapprox approximation error (14.2.3)
εgen generalization error (14.2.3)
εint interpolation error (14.2.4)
E[X] expectation of random variable X (A.2.1)
E[X|Y ] conditional expectation of random variable X Subsection A.3.3
G(S, ε, X) ε-covering number of a set S ⊆ X Definition 14.10
ΓC Barron space with constant C Section 8.1
∇x f gradient of f w.r.t. x
⊘ componentwise (Hadamard) division
⊗ componentwise (Hadamard) product
hS empirical risk minimizer for a sample S Definition 14.5
Continued on next page

5
Symbol Description Reference
Φid
L identity ReLU neural network Lemma 5.1
1S indicator function of the set S
⟨·, ·⟩ Euclidean inner product on Rd
⟨·, ·⟩H inner product on a vector space H Definition B.11
maximal number of elements shared by a single node of
kT (5.3.2)
a triangulation
K̂n (x, x′ ) empirical tangent kernel (11.3.4)
ΛA,σ,S,L loss landscape defining function Definition 12.2
Lip(f ) Lipschitz constant of a function f (9.2.1)
LipM (Ω) M -Lipschitz continuous functions on Ω (9.2.4)
L general loss function Section 14.1
L0−1 0-1 loss Section 14.1
Lce binary cross entropy loss Section 14.1
L2 square loss Section 14.1
ℓp (N) space of p-summable sequences indexed over N Section B.2.3
Lp (Ω) Lebesgue space over Ω Section B.2.3
M piecewise continuous and locally bounded functions Definition 3.1.1
set of multilayer perceptrons with d-dim input, m-dim
Ndm (σ; L, n) Definition 3.6
output, activation function σ, depth L, and width n
Ndm (σ; L) union of Ndm (σ; L, n) for all n ∈ N Definition 3.6
set of neural networks with architecture A, activation
N (σ; A, B) Definition 12.1
function σ and all weights bounded in modulus by B
N ∗ (σ, A, B) neural networks in N (σ; A, B) with range in [−1, 1] (14.5.1)
N positive natural numbers
N0 natural numbers including 0
multivariate normal distribution with mean m ∈ Rd and
N(m, C)
covariance C ∈ Rd×d
number of parameters of a neural network with layer
nA Definition 12.1
widths described by A
Euclidean norm for vectors in Rd and spectral norm for
∥·∥
matrices in Rn×d
∥ · ∥F Frobenius norm for matrices
∥ · ∥∞ ∞-norm on Rd or supremum norm for functions
∥ · ∥p p-norm on Rd
Continued on next page

6
Symbol Description Reference
∥ · ∥X norm on a vector space X
0 zero vector or zero matrix
O(·) Landau notation
ω(η) patch of the node η (5.3.5)
ΩΛ (c) sublevel set of loss landscape Definition 12.3
∂f (x) set of subgradients of f at x Definition 10.19
Pn (Rd ) or Pn space of multivariate polynomials of degree n on Rd Example 3.5
space of multivariate polynomials of arbitrary degree on
P(Rd ) or P Example 3.5
Rd
PX distribution of random variable X Definition A.10
P[A] probability of event A Definition A.5
P[A|B] conditional probability of event A given B Definition A.3.2
parameter set of neural networks with architecture A
PN (A, B) Definition 12.1
and all weights bounded in modulus by B
Pieces(f, Ω) number of pieces of f on Ω Definition 6.1
model (e.g. neural network) in terms of input x (param-
Φ(x)
eter dependence suppressed)
model (e.g. neural network) in terms of input x and pa-
Φ(x, w)
rameters w
Φlin linearization around initialization (11.3.1)
Φmin
n minimum neural network Lemma 5.11
Φ×
ε multiplication neural network Lemma 7.3
Φ×
n,ε multiplication of n numbers neural network Proposition 7.4
Φ2 ◦ Φ1 composition of neural networks Lemma 5.2
Φ2 • Φ1 sparse composition of neural networks Lemma 5.2
(Φ1 , . . . , Φm ) parallelization of neural networks (5.1.1)

A pseudoinverse of a matrix A
Q rational numbers
R real numbers
R− non-positive real numbers
R+ non-negative real numbers
Rσ Realization map Definition 12.1
R∗ Bayes risk (14.1.1)
Continued on next page

7
Symbol Description Reference
R(h) risk of hypothesis h Definition 14.2
(1.2.3), Definition
R
b S (h) empirical risk of h for sample S 14.4
Sn cardinal B-spline Definition 4.1
d
Sℓ,t,n multivariate cardinal B-spline Definition 4.2
cardinality of an arbitrary set S, or Lebesgue measure of
|S|
S ⊆ Rd
S̊ interior of a set S
S closure of a set S
∂S boundary of a set S
Sc complement of a set S
S⊥ orthogonal complement of a set S Definition B.15
σ general activation function
σa parametric ReLU activation function Section 2.3
σReLU ReLU activation function Section 2.3
sign sign function
smax (A) maximal singular value of a matrix A
smin (A) minimal (positive) singular value of a matrix A
size(Φ) number of free network parameters in Φ Definition 2.4
span(S) linear hull or span of S
T triangulation Definition 5.13
V set of nodes in a triangulation Definition 5.13
V[X] variance of random variable X Section A.2.2
VCdim(H) VC dimension of a set of functions H Definition 14.16
W distribution of weight initialization Section 11.6.1
W (ℓ) , b(ℓ) weights and biases in layer ℓ of a neural network Definition 2.1
width(Φ) width of Φ Definition 2.1
x(ℓ) output of ℓ-th layer of a neural network Definition 2.1
x̄(ℓ) preactivations (10.5.3)
X′ dual space to a normed space X Definition B.9

8
Chapter 1

Introduction

1.1 Mathematics of deep learning


In 2012, a deep learning architecture revolutionized the field of computer vision by achieving un-
precedented performance in the ImageNet Large Scale Visual Recognition Challenge (ILSVRC)
[148]. The deep learning architecture, known as AlexNet, significantly outperformed all competing
approaches.A few years later, in March 2016, a deep learning-based architecture called AlphaGo
defeated the best Go player at the time, Lee Sedol, in a five-game match [254]. Go is a highly
complex board game with a vast number of possible moves, making it a challenging problem for
artificial intelligence. Because of this complexity, many researchers believed that defeating a top
human Go player was a feat that would only be achieved decades later.
These breakthroughs along with many others, including DeepMind’s AlphaFold [135], which
revolutionized protein structure prediction in 2020, the unprecedented language capabilities of
large language models like GPT-3 (and later versions) [278, 40], and the emergence of generative
AI models like Stable Diffusion, Midjourney, and DALL-E, have sparked interest among scientists
across (almost) all disciplines. Likewise, while mathematical research on neural networks has a
long history, these groundbreaking developments revived interest in the theoretical underpinnings
of deep learning among mathematicians. However, initially, there was a clear consensus in the
mathematics community: We do not understand why this technology works so well! In fact, there
are many mathematical reasons that, at least superficially, should prevent the observed success.
Over the past decade the field has matured, and mathematicians have gained a more profound
understanding of deep learning, although many open questions remain. Recent years have brought
various new explanations and insights into the inner workings of these models. Before discussing
them in detail in the following chapters, we first give a high-level introduction to deep learning,
with a focus on the supervised learning framework for function approximation, which is the central
theme of this book.

1.2 High-level overview of deep learning


Deep learning refers to the application of deep neural networks trained by gradient-based methods,
to identify unknown input-output relationships. This approach has three key ingredients: deep
neural networks, gradient-based training, and prediction. We now explain each of these ingredients
separately.

9
(
(

Figure 1.1: Illustration of a single neuron ν. The neuron receives six inputs (x1 , . . . , x6 ) = x,
computes their weighted sum 6j=1 xj wj , adds a bias b, and finally applies the activation function
P
σ to produce the output ν(x).

Deep Neural Networks Deep neural networks are formed by a combination of neurons. A
neuron is a function of the form

Rd ∋ x 7→ ν(x) = σ(w⊤ x + b), (1.2.1)

where w ∈ Rd is a weight vector, b ∈ R is called bias, and the function σ is referred to as an


activation function. This concept is due to McCulloch and Pitts [172] and is a mathematical
model for biological neurons. If we consider σ to be the Heaviside function, i.e., σ = 1R+ with
R+ := [0, ∞), then the neuron “fires” if the weighted sum of the inputs x surpasses the threshold
−b. We depict a neuron in Figure 1.1. Note that if we fix d and σ, then the set of neurons can be
naturally parameterized by the d + 1 real values w1 , . . . , wd , b ∈ R.
Neural networks are functions formed by connecting neurons, where the output of one neuron
becomes the input to another. One simple but very common type of neural network is the so-called
feedforward neural network. This structure distinguishes itself by having the neurons grouped in
layers, and the inputs to neurons in the (ℓ + 1)-st layer are exclusively neurons from the ℓ-th layer.
We start by defining a shallow feedforward neural network as an affine transformation
applied to the output of a set of neurons that share the same input and the same activation
function. Here, an affine transformation is a map T : Rp → Rq such that T (x) = W x + b for
some W ∈ Rq×p , b ∈ Rq where p, q ∈ N.
Formally, a shallow feedforward neural network is, therefore, a map Φ of the form

Rd ∋ x 7→ Φ(x) = T1 ◦ σ ◦ T0 (x)

where T0 , T1 are affine transformations and the application of σ is understood to be in each


component of T1 (x). A visualization of a shallow neural network is given in Figure 1.2.
A deep feedforward neural network is constructed by compositions of shallow neural net-
works. This yields a map of the type

Rd ∋ x 7→ Φ(x) = TL+1 ◦ σ ◦ · · · ◦ T1 ◦ σ ◦ T0 (x),

10
L+1
where L ∈ N and (Tj )j=0 are affine transformations. The number of compositions L is referred to
as the number of layers of the deep neural network. Similar to a single neuron, (deep) neural
networks can be viewed as a parameterized function class, with the parameters being the entries
of the matrices and vectors determining the affine transformations (Tj )L+1
j=0 .

(
0 (
(
(
(
(
Figure 1.2: Illustration of a shallow neural network. The affine transformation T0 is of the form
(x1 , . . . , x6 ) = x 7→ W x + b, where the rows of W are the weight vectors w1 , w2 , w3 for each
respective neuron.

Gradient-based training After defining the structure or architecture of the neural network,
e.g., the activation function and the number of layers, the second step of deep learning consists
of determining suitable values for its parameters. In practice this is achieved by minimizing an
objective function. In supervised learning, which will be our focus, this objective depends
on a collection of input-output pairs, commonly known as training data or simply as a sample.
Concretely, let S = (xi , y i )m d k
i=1 be a sample, where xi ∈ R represents the inputs and y i ∈ R the
corresponding outputs with d, k ∈ N. Our goal is to find a deep neural network Φ such that

Φ(xi ) ≈ y i for all i = 1, . . . , m (1.2.2)

in a meaningful sense. For example, we could interpret “≈” to mean closeness with respect to
the Euclidean norm, or more generally, that L(Φ(xi ), y i ) is small for a function L measuring the
dissimilarity between its inputs. Such a function L is called a loss function. A standard way of
achieving (1.2.2) is by minimizing the so-called empirical risk of Φ with respect to the sample S
defined as
m
b S (Φ) = 1
X
R L(Φ(xi ), y i ). (1.2.3)
m
i=1

This quantity serves as a measure of how well Φ predicts y i at the training points x1 , . . . , xm .
If L is differentiable, and for all xi the output Φ(xi ) depends differentiably on the parameters
of the neural network, then the gradient of the empirical risk Rb S (Φ) with respect to the parameters
is well-defined. This gradient can be efficiently computed using a technique called backpropa-
gation. This allows to minimize (1.2.3) by optimization algorithms such as (stochastic) gradient

11
descent. They produce a sequence of neural networks parameters, and corresponding neural net-
work functions Φ1 , Φ2 , . . . , for which the empirical risk is expected to decrease. Figure 1.3 illustrates
a possible behavior of this sequence.

Prediction The final part of deep learning concerns the question of whether we have actually
learned something by the procedure above. Suppose that our optimization routine has either
converged or has been terminated, yielding a neural network Φ∗ . While the optimization aimed
to minimize the empirical risk on the training sample S, our ultimate interest is not in how well
Φ∗ performs on S. Rather, we are interested in its performance on new data points (xnew , y new )
outside of S.
To make meaningful statements about this, we assume existence of a data distribution D on
the input-output space—in our case, this is Rd × Rk —such that both the elements of S and all
other data points are drawn from this distribution. In other words, we treat S as an i.i.d. draw
from D, and (xnew , y new ) also as sampled independently from D. If we want Φ∗ to perform well on
average, then this amounts to controlling the following expression

R(Φ∗ ) = E(xnew ,ynew )∼D [L(Φ∗ (xnew ), y new )], (1.2.4)

which is called the risk of Φ∗ . If the risk is not much larger than the empirical risk, then we say
that the neural network Φ∗ has a small generalization error. On the other hand, if the risk is
much larger than the empirical risk, then we say that Φ∗ overfits the training data, meaning that
Φ∗ has memorized the training samples, but does not generalize well to data outside of the training
set.

Figure 1.3: A sequence of one dimensional neural networks Φ1 , . . . , Φ4 that successively minimizes
the empirical risk for the sample S = (xi , yi )6i=1 .

12
1.3 Why does it work?
It is natural to wonder why the deep learning pipeline, as outlined in the previous subsection,
ultimately succeeds in learning, i.e., achieving a small risk. Is it true that for a given sample
(xi , y i )m
i=1 there exist a neural network such that Φ(xi ) ≈ y i for all i = 1, . . . m? Does the
optimization routine produce a meaningful result? Can we control the risk, knowing only that the
empirical risk is small?
While most of these questions can be answered affirmatively under certain assumptions, these
assumptions often do not apply to deep learning in practice. We next explore some potential
explanations and show that they lead to even more questions.

Approximation A fundamental result in the study of neural networks is the so-called universal
approximation theorem, which will be discussed in Chapter 3. This result states that every con-
tinuous function on a compact domain can be approximated arbitrarily well (in a uniform sense)
by a shallow neural network.
This result, however, does not address the practically relevant question of efficiency. For exam-
ple, if we aim for computational efficiency, then we may be interested in identifying the smallest
neural network that fits the data. This naturally raises the question: What is the role of the
architecture for the expressive capabilities of neural networks? Furthermore, viewing empirical
risk minimization as an approximation problem, we are confronted with a central challenge in ap-
proximation theory: the curse of dimensionality. Function approximation in high dimensions is
notoriously difficult and becomes exponentially harder as the dimensionality increases. Yet, many
successful deep learning architectures operate in this high-dimensional regime. Why do these neural
networks appear to overcome this so-called curse?

Optimization While gradient descent can sometimes be proven to converge to a global minimum,
as we will discuss in Chapter 10, this typically requires the objective function to be at least convex.
However, there is no reason to believe that for example the empirical risk is a convex function of
the network parameters. In fact, due to the repeatedly occurring compositions with the nonlinear
activation function in the network, the empirical risk is typically highly non-linear and not convex.
Therefore, there is generally no guarantee that the optimization routine will converge to a global
minimum, and it may get stuck in a local (and non-global) minimum or a saddle point. Why is the
output of the optimization nonetheless often meaningful in practice?

Generalization In traditional statistical learning theory, which we will review in Chapter 14,
the extent to which the risk exceeds the empirical risk, can be bounded a priori; such bounds are
often expressed in terms of a notion of complexity of the set of admissible functions (the class of
neural networks) divided by the number of training samples. For the class of neural networks of a
fixed architecture, the complexity roughly amounts to the number of neural network parameters.
In practice, typically neural networks with more parameters than training samples are used. This
is dubbed the overparameterized regime. In this regime, the classical estimates described above are
void.
Why is it that, nonetheless, deep overparameterized architectures are capable of making accu-
rate predictions on unseen data? Furthermore, while deep architectures often generalize well, they
sometimes fail spectacularly on specific, carefully crafted examples. In image classification tasks,

13
these examples may differ only slightly from correctly classified images in a way that is not per-
ceptible to the human eye. Such examples are known as adversarial examples, and their existence
poses a great challenge for applications of deep learning.

1.4 Outline and philosophy


This book addresses the questions raised in the previous section, providing answers that are mathe-
matically rigorous and accessible. Our focus will be on provable statements, presented in a manner
that prioritizes simplicity and clarity over generality. We will sometimes illustrate key ideas only
in special cases, or under strong assumptions, both to avoid an overly technical exposition, and
because definitive answers are often not yet available. In the following, we summarize the content
of each chapter and highlight parts pertaining to the questions stated in the previous section.
Chapter 2: Feedforward neural networks. In this chapter, we introduce the main object
of study of this book—the feedforward neural network.
Chapter 3: Universal approximation. We present the classical view of function approx-
imation by neural networks, and give two instances of so-called universal approximation results.
Such statements describe the ability of neural networks to approximate every function of a given
class to arbitrary accuracy, given that the network size is sufficiently large. The first result, which
holds under very broad assumptions on the activation function, is on uniform approximation of
continuous functions on compact domains. The second result shows that for a very specific activa-
tion function, the network size can be chosen independently of the desired accuracy, highlighting
that universal approximation needs to be interpreted with caution.
Chapter 4: Splines. Going beyond universal approximation, this chapter starts to explore
approximation rates of neural networks. Specifically, we examine how well certain functions can be
approximated relative to the number of parameters in the network. For so-called sigmoidal activa-
tion functions, we establish a link between neural-network- and spline-approximation. This reveals
that smoother functions require fewer network parameters. However, achieving this increased effi-
ciency necessitates the use of deeper neural networks. This observation offers a first glimpse into
the importance of depth in deep learning.
Chapter 5: ReLU neural networks. This chapter focuses on one of the most popular ac-
tivation functions in practice—the ReLU. We prove that the class of ReLU networks is equal to
the set of continuous piecewise linear functions, thus providing a theoretical foundation for their
expressive power. Furthermore, given a continuous piecewise linear function, we investigate the
necessary width and depth of a ReLU network to represent it. Finally, we leverage approxima-
tion theory for piecewise linear functions to derive convergence rates for approximating Hölder
continuous functions.
Chapter 6: Affine pieces for ReLU neural networks. Having gained some intuition about
ReLU neural networks, in this chapter, we adress some potential limitations. We analyze ReLU
neural networks by counting the number of affine regions that they generate. The key insight of
this chapter is that deep neural networks can generate exponentially more regions than shallow
ones. This observation provides further evidence for the potential advantages of depth in neural
network architectures.
Chapter 7: Deep ReLU neural networks. Having identified the ability of deep ReLU
neural networks to generate a large number of affine regions, we investigate whether this translates
into an actual advantage in function approximation. Indeed, for approximating smooth functions,

14
we prove substantially better approximation rates than we obtained for shallow neural networks.
This adds again to our understanding of depth and its connections to expressive power of neural
network architectures.
Chapter 8: High-dimensional approximation. The convergence rates established in the
previous chapters deteriorate significantly in high-dimensional settings. This chapter examines
three scenarios under which neural networks can provably overcome the curse of dimensionality.
Chapter 9: Interpolation. In this chapter we shift our perspective from approximation to
exact interpolation of the training data. We analyze conditions under which exact interpolation is
possible, and discuss the implications for empirical risk minimization. Furthermore, we present a
constructive proof showing that ReLU networks can express an optimal interpolant of the data (in
a specific sense).
Chapter 10: Training of neural networks. We start to examine the training process
of deep learning. First, we study the fundamentals of (stochastic) gradient descent and convex
optimization. Additionally, we examine accelerated methods and highlight the key principles behind
popular training algorithms such as Adam. Finally, we discuss how the backpropagation algorithm
can be used to implement these optimization algorithms for training neural networks.
Chapter 11: Wide neural networks and the neural tangent kernel. This chapter
introduces the neural tangent kernel as a tool for analyzing the training behavior of neural networks.
We begin by revisiting linear and kernel regression for the approximation of functions based on
data. Additionally we discuss the effect of adding a regularization term to the objective function.
Afterwards, we show for certain architectures of sufficient width, that the training dynamics of
gradient descent resemble those of kernel regression and converge to a global minimum. This
analysis provides insights into why, under certain conditions, we can train neural networks without
getting stuck in (bad) local minima, despite the non-convexity of the objective function. Finally, we
discuss a well-known link between neural networks and Gaussian processes, giving some indication
why overparameterized networks do not necessarily overfit in practice.
Chapter 12: Loss landscape analysis. In this chapter, we present an alternative view on the
optimization problem, by analyzing the loss landscape—the empirical risk as a function of the neural
network parameters. We give theoretical arguments showing that increasing overparameterization
leads to greater connectivity between the valleys and basins of the loss landscape. Consequently,
overparameterized architectures make it easier to reach a region where all minima are global minima.
Additionally, we observe that most stationary points associated with non-global minima are saddle
points. This sheds further light on the empirically observed fact that deep architectures can often
be optimized without getting stuck in non-global minima.
Chapter 13: Shape of neural network spaces. While Chapters 11 and 12 highlight
potential reasons for the success of neural network training, in this chapter, we show that the set
of neural networks of a fixed architecture has some undesirable properties from an optimization
perspective. Specifically, we show that this set is typically non-convex. Moreover, in general it does
not possess the best-approximation property, meaning that there might not exist a neural network
within the set yielding the best approximation for a given function.
Chapter 14 : Generalization properties of deep neural networks. To understand
why deep neural networks successfully generalize to unseen data points (outside of the training
set), we study classical statistical learning theory, with a focus on neural network functions as the
hypothesis class. We then show how to establish generalization bounds for deep learning, providing
theoretical insights into the performance on unseen data.

15
Chapter 15: Generalization in the overparameterized regime. The generalization
bounds of the previous chapter are not meaningful when the number of parameters of a neural net-
work surpasses the number of training samples. However, this overparameterized regime is where
many successful network architectures operate. To gain a deeper understanding of generalization
in this regime, we describe the phenomenon of double descent and present a potential explana-
tion. This addresses the question of why deep neural networks perform well despite being highly
overparameterized.
Chapter 16: Robustness and adversarial examples. In the final chapter, we explore
the existence of adversarial examples—inputs designed to deceive neural networks. We provide
some theoretical explanations of why adversarial examples arise, and discuss potential strategies to
prevent them.

1.5 Material not covered in this book


This book studies some central topics of deep learning but leaves out even more. Interesting
questions associated with the field that were omitted, as well as some pointers to related works are
listed below:
Advanced architectures: The (deep) feedforward neural network is far from the only type of
neural network. In practice, architectures must be adapted to the type of data. For example, images
exhibit strong spatial dependencies in the sense that adjacent pixels often have similar values.
Convolutional neural networks [155] are particularly well suited for this type of input, as they
employ convolutional filters that aggregate information from neighboring pixels, thus capturing the
data structure better than a fully connected feedforward network. Similarly, graph neural networks
[39] are a natural choice for graph-based data. For sequential data, such as natural language,
architectures with some form of memory component are used, including Long Short-Term Memory
(LSTM) networks [117] and attention-based architectures like transformers [278].
Unsupervised and Reinforcement Learning: While this book focuses on supervised learn-
ing, where each data point xi has a label yi , there is a vast field of machine learning called unsuper-
vised learning, where labels are absent. Classical unsupervised learning problems include clustering
and dimensionality reduction [250, Chapters 22/23].
A popular area in deep learning, where no labels are used, is physics-informed neural networks
[223]. Here, a neural network is trained to satisfy a partial differential equation (PDE), with the
loss function quantifying the deviation from this PDE.
Finally, reinforcement learning is a technique where an agent can interact with an environment
and receives feedback based on its actions. The actions are guided by a so-called policy, which is
to be learned, [178, Chapter 17]. In deep reinforcement learning, this policy is modeled by a deep
neural network. Reinforcement learning is the basis of the aforementioned AlphaGo.
Interpretability/Explainability and Fairness: The use of deep neural networks in critical
decision-making processes, such as allocating scarce resources (e.g., organ transplants in medicine,
financial credit approval, hiring decisions) or engineering (e.g., optimizing bridge structures, au-
tonomous vehicle navigation, predictive maintenance), necessitates an understanding of their decision-
making process. This is crucial for both practical and ethical reasons.
Practically, understanding how a model arrives at a decision can help us improve its performance
and mitigate problems. It allows us to ensure that the model performs according to our intentions
and does not produce undesirable outcomes. For example, in bridge design, understanding why a

16
model suggests or rejects a particular configuration can help engineers identify potential vulnerabil-
ities, ultimately leading to safer and more efficient designs. Ethically, transparent decision-making
is crucial, especially when the outcomes have significant consequences for individuals or society; bi-
ases present in the data or model design can lead to discriminatory outcomes, making explainability
essential.
However, explaining the predictions of deep neural networks is not straightforward. Despite
knowledge of the network weights and biases, the repeated and complex interplay of linear trans-
formations and non-linear activation functions often renders these models black boxes. A compre-
hensive overview of various techniques for interpretability, not only for deep neural networks, can
be found in [179]. Regarding the topic of fairness, we refer for instance to [72, 11].
Implementation: While this book focuses on provable theoretical results, the field of deep
learning is strongly driven by applications, and a thorough understanding of deep learning cannot
be achieved without practical experience. For this, there exist numerous resources with excellent
explanations. We recommend [87, 51, 218] as well as the countless online tutorials that are just a
Google (or alternative) search away.
Many more: The field is evolving rapidly, and new ideas are constantly being generated
and tested. This book cannot give a complete overview. However, we hope that it provides the
reader with a solid foundation in the fundamental knowledge and principles to quickly grasp and
understand new developments in the field.

Bibliography and further reading


Throughout this book, we will end each chapter with a short overview of related work and the
references used in the chapter.
In this introductory chapter, we highlight several other recent textbooks and works on deep
learning. For a historical survey on neural networks see [239] and also [154]. For general textbooks
on neural networks and deep learning, we refer to [109, 94, 218] for more recent monographs.
More mathematical introductions to the topic are given, for example, in [6, 132, 42]. For the
implementation of neural networks we refer for example to [87, 51].

17
Chapter 2

Feedforward neural networks

Feedforward neural networks, henceforth simply referred to as neural networks (NNs), constitute
the central object of study of this book. In this chapter, we provide a formal definition of neural
networks, discuss the size of a neural network, and give a brief overview of common activation
functions.

2.1 Formal definition


We previously defined a single neuron ν in (1.2.1) and Figure 1.1. A neural network is constructed
by connecting multiple neurons. Let us now make precise this connection procedure.

Definition 2.1. Let L ∈ N, d0 , . . . , dL+1 ∈ N, and let σ : R → R. A function Φ : Rd0 → RdL+1


is called a neural network if there exist matrices W (ℓ) ∈ Rdℓ+1 ×dℓ and vectors b(ℓ) ∈ Rdℓ+1 ,
ℓ = 0, . . . , L, such that with

x(0) := x (2.1.1a)
(ℓ) (ℓ−1) (ℓ−1) (ℓ−1)
x := σ(W x +b ) for ℓ ∈ {1, . . . , L} (2.1.1b)
x(L+1) := W (L) x(L) + b(L) (2.1.1c)

holds

Φ(x) = x(L+1) for all x ∈ Rd0 .

We call L the depth, dmax = maxℓ=1,...,L dℓ the width, σ the activation function, and
(σ; d0 , . . . , dL+1 ) the architecture of the neural network Φ. Moreover, W (ℓ) ∈ Rdℓ+1 ×dℓ are the
weight matrices and b(ℓ) ∈ Rdℓ+1 the bias vectors of Φ for ℓ = 0, . . . L.

Remark 2.2. Typically, there exist different choices of architectures, weights, and biases yielding
the same function Φ : Rd0 → RdL+1 . For this reason we cannot associate a unique meaning to these
notions solely based on the function realized by Φ. In the following, when we refer to the properties

18
of a neural network Φ, it is always understood to mean that there exists at least one construction
as in Definition 2.1, which realizes the function Φ and uses parameters that satisfy those properties.
The architecture of a neural network is often depicted as a connected graph, as illustrated in
Figure 2.1. The nodes in such graphs represent (the output of) the neurons. They are arranged in
layers, with x(ℓ) in Definition 2.1 corresponding to the neurons in layer ℓ. We also refer to x(0) in
(2.1.1a) as the input layer and to x(L+1) in (2.1.1c) as the output layer. All layers in between
are referred to as the hidden layers and their output is given by (2.1.1b). The number of hidden
layers corresponds to the depth. For the correct interpretation of such graphs, we note that by our
conventions in Definition 2.1, the activation function is applied after each affine transformation,
except in the final layer.
Neural networks of depth one are called shallow, if the depth is larger than one they are called
deep. The notion of deep neural networks is not used entirely consistently in the literature, and
some authors use the word deep only in case the depth is much larger than one, where the precise
meaning of “much larger” depends on the application.
Throughout, we only consider neural networks in the sense of Definition 2.1. We emphasize
however, that this is just one (simple but very common) type of neural network. Many adjustments
to this construction are possible and also widely used. For example:

• We may use different activation functions σℓ in each layer ℓ or we may even use a different
activation function for each node.

• Residual neural networks allow “skip connections” [112]. This means that information is
allowed to skip layers in the sense that the nodes in layer ℓ may have x(0) , . . . , x(ℓ−1) as their
input (and not just x(ℓ−1) ), cf. (2.1.1).

• In contrast to feedforward neural networks, recurrent neural networks allow information to


flow backward, in the sense that x(ℓ−1) , . . . , x(L+1) may serve as input for the nodes in layer ℓ
(and not just x(ℓ−1) ). This creates loops in the flow of information, and one has to introduce
a time index t ∈ N, as the output of a node in time step t might be different from the output
in time step t + 1.

Let us clarify some further common terminology used in the context of neural networks:

• parameters: The parameters of a neural network refer to the set of all entries of the weight
matrices and bias vectors. These are often collected in a single vector

w = ((W (0) , b(0) ), . . . , (W (L) , b(L) )). (2.1.2)

These parameters are adjustable and are learned during the training process, determining the
specific function realized by the network.

• hyperparameters: Hyperparameters are settings that define the network’s architecture (and
training process), but are not directly learned during training. Examples include the depth,
the number of neurons in each layer, and the choice of activation function. They are typically
set before training begins.

• weights: The term “weights” is often used broadly to refer to all parameters of a neural
network, including both the weight matrices and bias vectors.

19
input hidden layers output

layer 0 layer 1 layer 2 layer 3 layer 4

(1) (3)
(0) x1 x1
x1
(2)
x1 (4)
x1
(1) (3)
(0) x2 x2
x2
(2)
x2 (4)
x2
(1) (3)
(0) x3 x3
x3
(2)
x3
(1) (3)
x4 x4

Figure 2.1: Sketch of a neural network with three hidden layers, and d0 = 3, d1 = 4, d2 = 3, d3 = 4,
d4 = 2. The neural network has depth three and width four.

• model: For a fixed architecture, every choice of network parameters w as in (2.1.2) defines
a specific function x 7→ Φw (x). In deep learning this function is often referred to as a model.
More generally, “model” can be used to describe any function parameterization by a set of
parameters w ∈ Rn , n ∈ N.

2.1.1 Basic operations on neural networks


There are various ways how neural networks can be combined with one another. The next propo-
sition addresses this for linear combinations, compositions, and parallelization. The formal proof,
which is a good exercise to familiarize oneself with neural networks, is left as Exercise 2.5.

Proposition 2.3. For two neural networks Φ1 , Φ2 , with architectures

(σ; d10 , d11 , . . . , d1L1 +1 ) and (σ; d20 , d21 , . . . , d2L2 +1 )

respectively, it holds that


(i) for all α ∈ R exists a neural network Φα with architecture (σ; d10 , d11 , . . . , d1L1 +1 ) such that
1
Φα (x) = αΦ1 (x) for all x ∈ Rd0 ,

(ii) if d10 = d20 =: d0 and L1 = L2 =: L, then there exists a neural network Φparallel with architecture
(σ; d0 , d11 + d21 , . . . , d1L+1 + d2L+1 ) such that

Φparallel (x) = (Φ1 (x), Φ2 (x)) for all x ∈ Rd0 ,

(iii) if d10 = d20 =: d0 , L1 = L2 =: L, and d1L+1 = d2L+1 =: dL+1 , then there exists a neural network
Φsum with architecture (σ; d0 , d11 + d21 , . . . , d1L + d2L , dL+1 ) such that

Φsum (x) = Φ1 (x) + Φ2 (x) for all x ∈ Rd0 ,

20
(iv) if d1L1 +1 = d20 , then there exists a neural network Φcomp with architecture
(σ; d10 , d11 , . . . , d1L1 , d21 , . . . , d2L2 +1 ) such that
1
Φcomp (x) = Φ2 ◦ Φ1 (x) for all x ∈ Rd0 .

2.2 Notion of size


Neural networks provide a framework to parametrize functions. Ultimately, our goal is to find
a neural network that fits some underlying input-output relation. As mentioned above, the ar-
chitecture (depth, width and activation function) is typically chosen apriori and considered fixed.
During training of the neural network, its parameters (weights and biases) are suitably adapted by
some algorithm. Depending on the application, on top of the stated architecture choices, further
restrictions on the weights and biases can be desirable. For example, the following two appear
frequently:

• weight sharing: This is a technique where specific entries of the weight matrices (or bias
vectors) are constrained to be equal. Formally, this means imposing conditions of the form
(i) (j)
Wk,l = Ws,t , i.e. the entry (k, l) of the ith weight matrix is equal to the entry at position
(s, t) of weight matrix j. We denote this assumption by (i, k, l) ∼ (j, s, t), paying tribute
to the trivial fact that “∼” is an equivalence relation. During training, shared weights are
updated jointly, meaning that any change to one weight is simultaneously applied to all other
weights of this class. Weight sharing can also be applied to the entries of bias vectors.

• sparsity: This refers to imposing a sparsity structure on the weight matrices (or bias vectors).
(i)
Specifically, we apriorily set Wk,l = 0 for certain (k, l, i), i.e. we impose entry (k, l) of the ith
weight matrix to be 0. These zero-valued entries are considered fixed, and are not adjusted
(i)
during training. The condition Wk,l = 0 corresponds to node l of layer i − 1 not serving as an
input to node k in layer i. If we represent the neural network as a graph, this is indicated by
not connecting the corresponding nodes. Sparsity can also be imposed on the bias vectors.

Both of these restrictions decrease the number of learnable parameters in the neural network. The
number of parameters can be seen as a measure of the complexity of the represented function class.
For this reason, we introduce size(Φ) as a notion for the number of learnable parameters. Formally
(with |S| denoting the cardinality of a set S):

Definition 2.4. Let Φ be as in Definition 2.1. Then the size of Φ is


 
(i) (i)
size(Φ) := {(i, k, l) | Wk,l ̸= 0} ∪ {(i, k) | bk ̸= 0} ∼ . (2.2.1)

21
2.3 Activation functions
Activation functions are a crucial part of neural networks, as they introduce nonlinearity into the
model. If an affine activation function were used, the resulting neural network function would also
be affine and hence very restricted in what it can represent.
The choice of activation function can have a significant impact on the performance, but there
does not seem to be a universally optimal one. We next discuss a few important activation functions
and highlight some common issues associated with them.

1.0 8 8
ReLU a=0.05
SiLU a=0.1
0.8 6 6
a=0.2
0.6 4
4
0.4 2
2
0.2 0
0
0.0
2
5 0 5 5 0 5 5 0 5

(a) Sigmoid (b) ReLU and SiLU (c) Leaky ReLU

Figure 2.2: Different activation functions.

Sigmoid: The sigmoid activation function is given by


1
σsig (x) = for x ∈ R,
1 + e−x
and depicted in Figure 2.2 (a). Its output ranges between zero and one, making it interpretable
as a probability. The sigmoid is a smooth function, which allows the application of gradient-based
training.
It has the disadvantage that its derivative becomes very small if |x| → ∞. This can affect
learning due to the so-called vanishing gradient problem. Consider the simple neural network
Φn (x) = σ ◦ · · · ◦ σ(x + b) defined with n ∈ N compositions of σ, and where b ∈ R is a bias. Its
derivative with respect to b is
d d
Φn (x) = σ ′ (Φn−1 (x)) Φn−1 (x).
db db
If supx∈R |σ ′ (x)| ≤ 1 − δ, then by induction, | db
d
Φn (x)| ≤ (1 − δ)n . The opposite effect happens
for activation functions with derivatives uniformly larger than one. This argument shows that
the derivative of Φn (x, b) with respect to b can become exponentially small or exponentially large
when propagated through the layers. This effect, known as the vanishing- or exploding gradient
effect, also occurs for activation functions which do not admit the uniform bounds assumed above.
However, since the sigmoid activation function exhibits areas with extremely small gradients, the
vanishing gradient effect can be strongly exacerbated.
ReLU (Rectified Linear Unit): The ReLU is defined as

σReLU (x) = max{x, 0} for x ∈ R,

22
and depicted in Figure 2.2 (b). It is piecewise linear, and due to its simplicity its evaluation is
computationally very efficient. It is one of the most popular activation functions in practice. Since
its derivative is always zero or one, it does not suffer from the vanishing gradient problem to the
same extent as the sigmoid function. However, ReLU can suffer from the so-called dead neurons
problem. Consider the neural network

Φ(x) = σReLU (b − σReLU (x)) for x ∈ R

depending on the bias b ∈ R. If b < 0, then Φ(x) = 0 for all x ∈ R. The neuron corresponding to
d
the second application of σReLU thus produces a constant signal. Moreover, if b < 0, db Φ(x) = 0
for all x ∈ R. As a result, every negative value of b yields a stationary point of the empirical risk.
A gradient-based method will not be able to further train the parameter b. We thus refer to this
neuron as a dead neuron.
SiLU (Sigmoid Linear Unit): An important difference between the ReLU and the Sigmoid is
that the ReLU is not differentiable at 0. The SiLU activation function (also referred to as “swish”)
can be interpreted as a smooth approximation to the ReLU. It is defined as
x
σSiLU (x) := xσsig (x) = for x ∈ R,
1 + e−x
and is depicted in Figure 2.2 (b). There exist various other smooth activation functions that
mimic the ReLU, including the Softplus x 7→ log(1 + exp(x)), the GELU (Gaussian Error Linear
Unit) x 7→ xF (x) where F (x) denotes the cumulative distribution function of the standard normal
distribution, and the Mish x 7→ x tanh(log(1 + exp(x))).
Parametric ReLU or Leaky ReLU: This variant of the ReLU addresses the dead neuron
problem. For some a ∈ (0, 1), the parametric ReLU is defined as

σa (x) = max{x, ax} for x ∈ R,

and is depicted in Figure 2.2 (c) for three different values of a. Since the output of σ does not
have flat regions like the ReLU, the dying ReLU problem is mitigated. If a is not chosen too small,
then there is less of a vanishing gradient problem than for the Sigmoid. In practice, the additional
parameter a has to be fine-tuned depending on the application. Like the ReLU, the parametric
ReLU is not differentiable at 0.

Bibliography and further reading


The concept of neural networks was first introduced by McCulloch and Pitts in [172]. Later
Rosenblatt [229] introduced the perceptron, an artificial neuron with adjustable weights that forms
the basis of the multilayer perceptron (a fully connected feedforward neural network). The vanishing
gradient problem shortly addressed in Section 2.3 was discussed by Hochreiter in his diploma thesis
[115] and later in [22, 117].

23
Exercises
Exercise 2.5. Prove Proposition 2.3.

Exercise 2.6. In this exercise, we show that ReLU and parametric ReLU create similar sets of
neural network functions. Fix a > 0.

(i) Find a set of weight matrices and bias vectors, such that the associated neural network Φ1 ,
with the ReLU activation function σReLU satisfies Φ1 (x) = σa (x) for all x ∈ R.

(ii) Find a set of weight matrices and bias vectors, such that the associated neural network Φ2 ,
with the parametric ReLU activation function σa satisfies Φ2 (x) = σReLU (x) for all x ∈ R.

(iii) Conclude that every ReLU neural network can be expressed as a leaky ReLU neural network
and vice versa.

Exercise 2.7. Let d ∈ N, and let Φ1 be a neural network with the ReLU as activation function,
input dimension d, and output dimension 1. Moreover, let Φ2 be a neural network with the sigmoid
activation function, input dimension d, and output dimension 1. Show that, if Φ1 = Φ2 , then Φ1 is
a constant function.

Exercise 2.8. In this exercise, we show that for the sigmoid activation functions, dead-neuron-like
behavior is very rare. Let Φ be a neural network with the sigmoid activation function. Assume
that Φ is a constant function. Show that for every ε > 0 there is a non-constant neural network Φ e
with the same architecture as Φ such that for all ℓ = 0, . . . L,
(ℓ) (ℓ)
∥W (ℓ) − W
f ∥ ≤ ε and ∥b(ℓ) − e
b ∥≤ε

f (ℓ) , e
where W (ℓ) , b(ℓ) are the weights and biases of Φ and W
(ℓ)
b are the biases of Φ.
e
Show that such a statement does not hold for ReLU neural networks. What about leaky ReLU?

24
Chapter 3

Universal approximation

After introducing neural networks in Chapter 2, it is natural to inquire about their capabilities.
Specifically, we might wonder if there exist inherent limitations to the type of functions a neural
network can represent. Could there be a class of functions that neural networks cannot approx-
imate? If so, it would suggest that neural networks are specialized tools, similar to how linear
regression is suited for linear relationships, but not for data with nonlinear relationships.
In this chapter, primarily following [159], we will show that this is not the case, and neural
networks are indeed a universal tool. More precisely, given sufficiently large and complex archi-
tectures, they can approximate almost every sensible input-output relationship. We will formalize
and prove this claim in the subsequent sections.

3.1 A universal approximation theorem


To analyze what kind of functions can be approximated with neural networks, we start by consid-
ering the uniform approximation of continuous functions f : Rd → R on compact sets. To this end,
we first introduce the notion of compact convergence.

Definition 3.1. Let d ∈ N. A sequence of functions fn : Rd → R, n ∈ N, is said to con-


verge compactly to a function f : Rd → R, if for every compact K ⊆ Rd it holds that
cc
limn→∞ supx∈K |fn (x) − f (x)| = 0. In this case we write fn −→ f .

Throughout what follows, we always consider C 0 (Rd ) equipped with the topology of Defini-
tion 3.1 (also see Exercise 3.22), and every subset such as C 0 (D) with the subspace topology:
for example, if D ⊆ Rd is bounded, then convergence in C 0 (D) refers to uniform convergence
limn→∞ supx∈D |fn (x) − f (x)| = 0.

3.1.1 Universal approximators


As stated before, we want to show that deep neural networks can approximate every continuous
function in the sense of Definition 3.1. We call sets of functions that satisfy this property universal
approximators.

25
Definition 3.2. Let d ∈ N. A set of functions H from Rd to R is a universal approximator (of
C 0 (Rd )), if for every ε > 0, every compact K ⊆ Rd , and every f ∈ C 0 (Rd ), there exists g ∈ H such
that supx∈K |f (x) − g(x)| < ε.

For a set of (not necessarily continuous) functions H mapping between Rd and R, we denote by
cc
H its closure with respect to compact convergence.
The relationship between a universal approximator and the closure with respect to compact
convergence is established in the proposition below.

Proposition 3.3. Let d ∈ N and H be a set of functions from Rd to R. Then, H is a universal


cc
approximator of C 0 (Rd ) if and only if C 0 (Rd ) ⊆ H .

Proof. Suppose that H is a universal approximator and fix f ∈ C 0 (Rd ). For n ∈ N, define Kn :=
[−n, n]d ⊆ Rd . Then for every n ∈ N there exists fn ∈ H such that supx∈Kn |fn (x) − f (x)| < 1/n.
cc
Since for every compact K ⊆ Rd there exists n0 such that K ⊆ Kn for all n ≥ n0 , it holds fn −→ f .
The “only if” part of the assertion is trivial.
A key tool to show that a set is a universal approximator is the Stone-Weierstrass theorem, see
for instance [233, Sec. 5.7].

Theorem 3.4 (Stone-Weierstrass). Let d ∈ N, let K ⊆ Rd be compact, and let H ⊆ C 0 (K, R)


satisfy that

(a) for all x ∈ K there exists f ∈ H such that f (x) ̸= 0,

(b) for all x ̸= y ∈ K there exists f ∈ H such that f (x) ̸= f (y),

(c) H is an algebra of functions, i.e., H is closed under addition, multiplication and scalar mul-
tiplication.

Then H is dense in C 0 (K).

Example 3.5 (Polynomials are a universal approximator). αd ) ∈ Nd0


For a multiindex α = (α1 , . . . , P
αj
and a vector x = (x1 , . . . , xd ) ∈ R denote x := j=1 xj . In the following, with |α| := dj=1 αj ,
d α
Qd
we write
Pn := span{xα | α ∈ Nd0 , |α| ≤ n}
i.e., Pn is the
S space of polynomials of degree at most n (with real coefficients). It is easy to check
that P := n∈N Pn (Rd ) satisfies the assumptions of Theorem 3.4 on every compact set K ⊆ Rd .
Thus the space of polynomials P is a universal approximator of C 0 (Rd ), and by Proposition 3.3,
P is dense in C 0 (Rd ). In case we wish to emphasize the dimension of the underlying space, in the
following we will also write Pn (Rd ) or P(Rd ) to denote Pn , P respectively. ♢

26
3.1.2 Shallow neural networks
With the necessary formalism established, we can now show that shallow neural networks of ar-
bitrary width form a universal approximator under certain (mild) conditions on the activation
function. The results in this section are based on [159], and for the proofs we follow the arguments
in that paper.
We first introduce notation for the set of all functions realized by certain architectures.

Definition 3.6. Let d, m, L, n ∈ N and σ : R → R. The set of all functions realized by neural
networks with d-dimensional input, m-dimensional output, depth at most L, width at most n, and
activation function σ is denoted by

Ndm (σ; L, n) := {Φ : Rd → Rm | Φ as in Def. 2.1, depth(Φ) ≤ L, width(Φ) ≤ n}.

Furthermore,
[
Ndm (σ; L) := Ndm (σ; L, n).
n∈N

In the sequel, we require the activation function σ to belong to the set of piecewise continuous
and locally bounded functions

M := σ ∈ L∞

loc (R) there exist intervals I1 , . . . , IM partitioning R,
(3.1.1)
s.t. σ ∈ C 0 (Ij ) for all j = 1, . . . , M .

Here, M ∈ N is finite, and the intervals Ij are understood to have positive (possibly infinite)
Lebesgue measure, i.e. Ij is not allowed to be empty or a single point. Hence, σ is a piecewise
continuous function, and it has discontinuities at at most finitely many points.

Example 3.7. Activation functions belonging to M include, in particular, all continuous non-
polynomial functions, which in turn includes all practically relevant activation functions such as
the ReLU, the SiLU, and the Sigmoid discussed in Section 2.3. In these cases, we can choose M = 1
and I1 = R. Discontinuous functions include for example the Heaviside function x 7→ 1x>0 (also
called a “perceptron” in this context) but also x 7→ 1x>0 sin(1/x): Both belong to M with M = 2,
I1 = (−∞, 0] and I2 = (0, ∞). We exclude for example the function x 7→ 1/x, which is not locally
bounded. ♢

The rest of this subsection is dedicated to proving the following theorem that has now already
been announced repeatedly.

Theorem 3.8. Let d ∈ N and σ ∈ M. Then Nd1 (σ; 1) is a universal approximator of C 0 (Rd ) if
and only if σ is not a polynomial.

27
Remark 3.9. We will see in Corollary 3.18 and Exercise 3.26 that neural networks can also arbitrarily
well approximate non-continuous functions with respect to suitable norms.
The universal approximation theorem by Leshno, Lin, Pinkus and Schocken [159]—of which
Theorem 3.8 is a special case—is even formulated for a much larger set M, which allows for
activation functions that have discontinuities at a (possibly non-finite) set of Lebesgue measure
zero. Instead of proving the theorem in this generality, we resort to the simpler case stated above.
This allows to avoid some technicalities, but the main ideas remain the same. The proof strategy
is to verify the following three claims:
cc cc
(i) if C 0 (R1 ) ⊆ N11 (σ; 1) then C 0 (Rd ) ⊆ Nd1 (σ; 1) ,
cc
(ii) if σ ∈ C ∞ (R) is not a polynomial then C 0 (R1 ) ⊆ N11 (σ; 1) ,
cc
(iii) if σ ∈ M is not a polynomial then there exists σ̃ ∈ C ∞ (R) ∩ N11 (σ; 1) which is not a
polynomial.
cc cc cc
Upon observing that σ̃ ∈ N11 (σ; 1) implies N11 (σ̃, 1) ⊆ N11 (σ; 1) , it is easy to see that these
statements together with Proposition 3.3 establish the implication “⇐” asserted in Theorem 3.8.
The reverse direction is straightforward to check and will be the content of Exercise 3.23.
We start with a more general version of (i) and reduce the problem to the one dimensional case
following [165, Theorem 2.1].

Lemma 3.10. Assume that H is a universal approximator of C 0 (R). Then for every d ∈ N

span{x 7→ g(w · x) | w ∈ Rd , g ∈ H}

is a universal approximator of C 0 (Rd ).

Proof. For k ∈ N0 , denote by Hk the space of all k-homogenous polynomials, that is


n o
Hk := span Rd ∋ x 7→ xα α ∈ Nd0 , |α| = k .

We claim that
cc
Hk ⊆ span{Rd ∋ x 7→ g(w · x) | w ∈ Rd , g ∈ H} =: X (3.1.2)
for all k ∈ N0 . This implies that all multivariate polynomials belong to X. An application of the
Stone-Weierstrass theorem (cp. Example 3.5) and Proposition 3.3 then conclude the Q proof.
d β α d
For every α, β ∈ N0 with |α| = |β| = k, it holds D x = δβ,α α!, where α! := j=1 αj ! and
δβ,α = 1 if β = α and δβ,α = 0 otherwise. Hence, since {x 7→ xα | |α| = k} is a basis of Hk , the
set {Dα | |α| = k} is a basis of its topological dual H′k . Thus each linear functional l ∈ H′k allows
the representation l = p(D) for some p ∈ Hk (here D stands for the differential).
By the multinomial formula
 k
d
X X k! α α
(w · x)k =  wj x j  = w x .
α!
j=1 d
{α∈N0 | |α|=k}

28
Therefore, we have that (x 7→ (w · x)k ) ∈ Hk . Moreover, for every l = p(D) ∈ H′k and all w ∈ Rd
we have that

l(x 7→ (w · x)k ) = k!p(w).

Hence, if l(x 7→ (w · x)k ) = p(D)(x 7→ (w · x)k ) = 0 for all w ∈ Rd , then p ≡ 0 and thus l ≡ 0.
This implies span{x 7→ (w · x)k | w ∈ Rd } = Hk . Indeed, if there exists h ∈ Hk which is not
in span{x 7→ (w · x)k | w ∈ Rd }, then by the theorem of Hahn-Banach (see Theorem B.10), there
exists a non-zero functional in H′k vanishing on span{x 7→ (w · x)k | w ∈ Rd }. This contradicts the
previous observation.
By the universality of H it is not hard to see that x 7→ (w · x)k ∈ X for all w ∈ Rd . Therefore,
we have Hk ⊆ X for all k ∈ N0 .

By the above lemma, in order to verify that Nd1 (σ; 1) is a universal approximator, it suffices to
show that N11 (σ; 1) is a universal approximator. We first show that this is the case for sigmoidal
activations.

Definition 3.11. An activation function σ : R → R is called sigmoidal, if σ ∈ C 0 (R),


limx→∞ σ(x) = 1 and limx→−∞ σ(x) = 0.

For sigmoidal activation functions we can now conclude the universality in the univariate case.

cc
Lemma 3.12. Let σ : R → R be monotonically increasing and sigmoidal. Then C 0 (R) ⊆ N11 (σ; 1) .

We prove Lemma 3.12 in Exercise 3.24. Lemma 3.10 and Lemma 3.12 show Theorem 3.8 in
the special case where σ is monotonically increasing and sigmoidal. For the general case, let us
continue with (ii) and consider C ∞ activations.

Lemma 3.13. If σ ∈ C ∞ (R) and σ is not a polynomial, then N11 (σ; 1) is dense in C 0 (R).

cc
Proof. Denote X := N11 (σ; 1) . We show again that all polynomials belong to X. An application
of the Stone-Weierstrass theorem then gives the statement.
Fix b ∈ R and denote fx (w) := σ(wx + b) for all x, w ∈ R. By Taylor’s theorem, for h ̸= 0

σ((w + h)x + b) − σ(wx + b) fx (w + h) − fx (w)


=
h h
h
= fx′ (w) + fx′′ (ξ)
2
h
= fx′ (w) + x2 σ ′′ (ξx + b) (3.1.3)
2

29
for some ξ = ξ(h) between w and w + h. Note that the left-hand side belongs to N11 (σ; 1) as a
function of x. Since σ ′′ ∈ C 0 (R), for every compact set K ⊆ R

sup sup |x2 σ ′′ (ξ(h)x + b)| ≤ sup sup |x2 σ ′′ (ηx + b)| < ∞.
x∈K |h|≤1 x∈K η∈[w−1,w+1]

Letting h → 0, as a function of x the term in (3.1.3) thus converges uniformly towards K ∋


x 7→ fx′ (w). Since K was arbitrary, x 7→ fx′ (w) belongs to X. Inductively applying the same
(k−1) (k)
argument to fx (w), we find that x 7→ fx (w) belongs to X for all k ∈ N, w ∈ R. Observe that
(k)
fx (w) = xk σ (k) (wx + b). Since σ is not a polynomial, for each k ∈ N there exists bk ∈ R such that
σ (k) (bk ) ̸= 0. Choosing w = 0, we obtain that x 7→ σ (k) (bk )xk belongs to X, and thus also x 7→ xk
belongs to X.

Finally, we come to the proof of (iii)—the claim that there exists at least one non-polynomial
C ∞ (R) function in the closure of N11 (σ; 1). The argument is split into two lemmata. Denote in
the following by Cc∞ (R) the set of compactly supported C ∞ (R) functions, and for two functions f ,
g : R → R let Z
f ∗ g(x) := f (x − y)g(y) dx for all x ∈ R (3.1.4)
R
be the convolution of f and g.

cc
Lemma 3.14. Let σ ∈ M. Then for each φ ∈ Cc∞ (R) it holds σ ∗ φ ∈ N11 (σ; 1) .

Proof. Fix φ ∈ Cc∞ (R) and let a > 0 such that supp φ ⊆ [−a, a]. Denote yj := −a + 2aj/n for
j = 0, . . . , n and define for x ∈ R
n−1
2a X
fn (x) := σ(x − yj )φ(yj ).
n
j=0

cc
Clearly, fn ∈ N11 (σ; 1). We will show fn −→ σ ∗φ as n → ∞. To do so we verify uniform convergence
of fn towards σ ∗ φ on the interval [−b, b] with b > 0 arbitrary but fixed.
For x ∈ [−b, b]
n−1
X Z yj+1
|σ ∗ φ(x) − fn (x)| ≤ σ(x − y)φ(y) − σ(x − yj )φ(yj ) dy . (3.1.5)
j=0 yj

Fix ε ∈ (0, 1). Since σ ∈ M, S there exist z1 , . . . , zM ∈ R such that σ is continuous on R\{z1 , . . . , zM }
(cp. (3.1.1)). With Dε := M j=1 (zj −ε, zj +ε), observe that σ is uniformly continuous on the compact
set Kε := [−a − b, a + b] ∩ Dεc . Now let Jc ∪ Jd = {0, . . . , n − 1} be a partition (depending on x),
such that j ∈ Jc if and only if [x − yj+1 , x − yj ] ⊆ Kε . Hence, j ∈ Jd implies the existence of
i ∈ {1, . . . , M } such that the distance of zi to [x − yj+1 , x − yj ] is at most ε. Due to the interval

30
[x − yj+1 , x − yj ] having length 2a/n, we can bound

X [
yj+1 − yj = [x − yj+1 , x − yj ]
j∈Jd j∈Jd
M h
[ 2a 2a i
≤ zi − ε −
, zi + ε +
n n
i=1
 4a 
≤ M · 2ε + ,
n
where |A| denotes the Lebesgue measure of a measurable set A ⊆ R. Next, because of the local
boundedness of σ and the fact that φ ∈ Cc∞ , it holds sup|y|≤a+b |σ(y)| + sup|y|≤a |φ(y)| =: γ < ∞.
Hence

|σ ∗ φ(x) − fn (x)|
X Z yj+1
≤ σ(x − y)φ(y) − σ(x − yj )φ(yj ) dy
j∈Jc ∪Jd yj
 
2 4a
≤ 2γ M · 2ε +
n
+ 2a sup max |σ(x − y)φ(y) − σ(x − yj )φ(yj )|. (3.1.6)
j∈Jc y∈[yj ,yj+1 ]

We can bound the term in the last maximum by

|σ(x − y)φ(y) − σ(x − yj )φ(yj )|


≤ |σ(x − y) − σ(x − yj )||φ(y)| + |σ(x − yj )||φ(y) − φ(yj )|
 
 
≤γ· sup |σ(z1 ) − σ(z2 )| + sup |φ(z1 ) − φ(z2 )|
.
z1 ,z2 ∈Kε

z1 ,z2 ∈[−a,a]
|z1 −z2 |≤ 2a
n |z1 −z2 |≤ 2a
n

Finally, uniform continuity of σ on Kε and φ on [−a, a] imply that the last term tends to 0 as
n → ∞ uniformly for all x ∈ [−b, b]. This shows that there exist C < ∞ (independent of ε and x)
and nε ∈ N (independent of x) such that the term in (3.1.6) is bounded by Cε for all n ≥ nε . Since
ε was arbitrary, this yields the claim.

Lemma 3.15. If σ ∈ M and σ ∗ φ is a polynomial for all φ ∈ Cc∞ (R), then σ is a polynomial.

Proof. Fix −∞ < a < b < ∞ and consider Cc∞ (a, b) := {φ ∈ C ∞ (R) | supp φ ⊆ [a, b]}. Define a
metric ρ on Cc∞ (a, b) via
X |φ − ψ|C j (a,b)
ρ(φ, ψ) := 2−j ,
1 + |φ − ψ|C j (a,b)
j∈N0

31
where

|φ|C j (a,b) := sup |φ(j) (x)|.


x∈[a,b]

Since
Pj the space of j times differentiable functions on [a, b] is∞ complete with respect to the norm
i=0 | · |C i (a,b) , see for instance [114, Satz 104.3], the space Cc (a, b) is complete with the metric ρ.
For k ∈ N set

Vk := {φ ∈ Cc∞ (a, b) | σ ∗ φ ∈ Pk },

where Pk := span{R ∋ x 7→ xj | 0 ≤ j ≤ k} denotes the space of polynomials of degree at most k.


Then Vk is closed with respect to the metric ρ. To see this, we need to show that for a converging
sequence φj → φ∗ with respect to ρ and φj ∈ Vk , it follows that Dk+1 (σ ∗ φ∗ ) = 0 and hence σ ∗ φ∗
is a polynomial: Using Dk+1 (σ ∗ φj ) = 0 if φj ∈ Vk , the linearity of the convolution, and the fact
that Dk+1 (σ ∗ g) = σ ∗ Dk+1 (g) for differentiable g and if both sides are well-defined, we get

sup |Dk+1 (σ ∗ φ∗ )(x)|


x∈[a,b]

= sup |σ ∗ Dk+1 (φ∗ − φj )(x)|


x∈[a,b]

≤ |b − a| sup |σ(z)| · sup |Dk+1 (φj − φ∗ )(x)|.


z∈[a−b,b−a] x∈[a,b]

Since σ is locally bounded, the right hand-side converges to 0 as j → ∞.


By assumption we have
[
Vk = Cc∞ (a, b).
k∈N

Baire’s category theorem (Theorem B.6) implies the existence of k0 ∈ N (depending on a, b)


such that Vk0 contains an open subset of Cc∞ (a, b). Since Vk0 is a vector space, it must hold
Vk0 = Cc∞ (a, b).
We now show that φ ∗ σ ∈ Pk0 for every φ ∈ Cc∞ (R); in other words, k0 = k0 (a, b) can be chosen
independent of a and b. First consider a shift s ∈ R and let ã := a + s and b̃ := b + s. Then with
S(x) := x + s, for any φ ∈ Cc∞ (ã, b̃) holds φ ◦ S ∈ Cc∞ (a, b), and thus (φ ◦ S) ∗ σ ∈ Pk0 . Since
(φ ◦ S) ∗ σ(x) = φ ∗ σ(x + s), we conclude that φ ∗ σ ∈ Pk0 . Next let −∞ < ã < b̃ < ∞ be arbitrary.
Then, for any integer n > (b̃ − ã)/(b − a) we can cover (ã, b̃) with n ∈ N overlappingP open intervals
∞ (ã, b̃) can be written as φ = n
(a1 , b1 ), . . . , (an , bn ), each of length b − a. Any φ ∈ C c j=1 φj where
n
φj ∈ Cc (aj , bj ). Then φ ∗ σ = j=1 φj ∗ σ ∈ Pk0 , and thus φ ∗ σ ∈ Pk0 for every φ ∈ Cc∞ (R).

P
Finally, Exercise 3.25 implies σ ∈ Pk0 .

Now we can put everything together to show Theorem 3.8.

Proof (of Theorem 3.8). By Exercise 3.23 we have the implication “⇒”.
For the other direction we assume that σ ∈ M is not a polynomial. Then by Lemma 3.15
there exists φ ∈ Cc∞ (R) such that σ ∗ φ is not a polynomial. According to Lemma 3.14 we have
cc
σ ∗ φ ∈ N11 (σ; 1) . We conclude with Lemma 3.13 that N11 (σ; 1) is a universal approximator of
C 0 (R).
Finally, by Lemma 3.10, Nd1 (σ; 1) is a universal approximator of C 0 (Rd ).

32
3.1.3 Deep neural networks
Theorem 3.8 shows the universal approximation capability of single-hidden-layer neural networks
with activation functions σ ∈ M\P: they can approximate every continuous function on every
compact set to arbitrary precision, given sufficient width. This result directly extends to neural
networks of any fixed depth L ≥ 1. The idea is to use the fact that the identity function can be
approximated with a shallow neural network. Composing a shallow neural network approximation of
the target function f with (multiple) shallow neural networks approximating the identity function,
gives a deep neural network approximation of f .
Instead of directly applying Theorem 3.8, we first establish the following proposition regarding
the approximation of the identity function. Rather than σ ∈ M\P, it requires a different (mild)
assumption on the activation function. This allows for a constructive proof, yielding explicit bounds
on the neural network size, which will prove useful later in the book.

Proposition 3.16. Let d, L ∈ N, let K ⊆ Rd be compact, and let σ : R → R be such that there
exists an open set on which σ is differentiable and not constant. Then, for every ε > 0, there exists
a neural network Φ ∈ Ndd (σ; L, d) such that

∥Φ(x) − x∥∞ < ε for all x ∈ K.

Proof. The proof uses the same idea as in Lemma 3.13, where we approximate the derivative of the
activation function by a simple neural network. Let us first assume d ∈ N and L = 1.
Let x∗ ∈ R be such that σ is differentiable on a neighborhood of x∗ and σ ′ (x∗ ) = θ ̸= 0.
Moreover, let x∗ = (x∗ , . . . , x∗ ) ∈ Rd . Then, for λ > 0 we define

λ x  λ
Φλ (x) := σ + x∗ − σ(x∗ ),
θ λ θ
Then, we have, for all x ∈ K,

σ(x/λ + x∗ ) − σ(x∗ )
Φλ (x) − x = λ − x. (3.1.7)
θ
If xi = 0 for i ∈ {1, . . . , d}, then (3.1.7) shows that (Φλ (x) − x)i = 0. Otherwise

|xi | σ(xi /λ + x∗ ) − σ(x∗ )


|(Φλ (x) − x)i | = −θ .
|θ| xi /λ

By the definition of the derivative, we have that |(Φλ (x) − x)i | → 0 for λ → ∞ uniformly for all
x ∈ K and i ∈ {1, . . . , d}. Therefore, |Φλ (x) − x| → 0 for λ → ∞ uniformly for all x ∈ K.
The extension to L > 1 is straightforward and is the content of Exercise 3.27.

Using the aforementioned generalization of Proposition 3.16 to arbitrary non-polynomial acti-


vation functions σ ∈ M, we obtain the following extension of Theorem 3.8.

33
Corollary 3.17. Let d ∈ N, L ∈ N and σ ∈ M. Then Nd1 (σ; L) is a universal approximator of
C 0 (Rd ) if and only if σ is not a polynomial.

Proof. We only show the implication “⇐”. The other direction is again left as an exercise, see
Exercise 3.23.
Assume σ ∈ M is not a polynomial, let K ⊆ Rd be compact, and let f ∈ C 0 (Rd ). Fix ε ∈ (0, 1).
We need to show that there exists a neural network Φ ∈ Nd1 (σ; L) such that supx∈K |f (x)−Φ(x)| <
ε. The case L = 1 holds by Theorem 3.8, so let L > 1.
By Theorem 3.8, there exist Φshallow ∈ Nd1 (σ; 1) such that
ε
sup |f (x) − Φshallow (x)| < . (3.1.8)
x∈K 2
Compactness of {f (x) | x ∈ K} implies that we can find n > 0 such that
{Φshallow (x) | x ∈ K} ⊆ [−n, n]. (3.1.9)
Let Φid ∈ N11 (σ; L − 1) be an approximation to the identity such that
ε
sup |x − Φid (x)| < , (3.1.10)
x∈[−n,n] 2
which is possible by the extension of Proposition 3.16 to general non-polynomial activation functions
σ ∈ M.
Denote Φ := Φid ◦ Φshallow . According to Proposition 2.3 (iv) holds Φ ∈ Nd1 (σ; L) as desired.
Moreover (3.1.8), (3.1.9), (3.1.10) imply
sup |f (x) − Φ(x)| = sup |f (x) − Φid (Φshallow (x))|
x∈K x∈K

≤ sup |f (x) − Φshallow (x)| + |Φshallow (x) − Φid (Φshallow (x))|
x∈K
ε ε
≤ + = ε.
2 2
This concludes the proof.

3.1.4 Other norms


In addition to the case of continuous functions, universal approximation theorems can be shown
for various other function classes and topologies, which may also allow for the approximation of
functions exhibiting discontinuities or singularities. To give but one example, we next state such a
result for Lebesgue spaces on compact sets. The proof is left to the reader, see Exercise 3.26.

Corollary 3.18. Let d ∈ N, L ∈ N, p ∈ [1, ∞), and let σ ∈ M not be a polynomial. Then for
every ε > 0, every compact K ⊆ Rd , and every f ∈ Lp (K) there exists Φf,ε ∈ Nd1 (σ; L) such that
Z 1/p
p
|f (x) − Φ(x)| dx ≤ ε.
K

34
3.2 Superexpressive activations and Kolmogorov’s superposition
theorem
In the previous section, we saw that a large class of activation functions allow for universal approx-
imation. However, these results did not provide any insights into the necessary neural network size
for achieving a specific accuracy.
Before exploring this topic further in the following chapters, we next present a remarkable result
that shows how the required neural network size is significantly influenced by the choice of activation
function. The result asserts that, with the appropriate activation function, every f ∈ C 0 (K) on a
compact set K ⊆ Rd can be approximated to every desired accuracy ε > 0 using a neural network
of size O(d2 ); in particular the neural network size is independent of ε > 0, K, and f . We will first
discuss the one-dimensional case.

Proposition 3.19. There exists a continuous activation function σ : R → R such that for every
compact K ⊆ R, every ε > 0 and every f ∈ C 0 (K) there exists Φ(x) = σ(wx + b) ∈ N11 (σ; 1, 1)
such that

sup |f (x) − Φ(x)| < ε.


x∈K

Proof. Denote by P̃n all polynomials p(x) = nj=0 qj xj with rational coefficients, i.e. such that
P

qj ∈ Q for all j = 0, . . . , n. Then P̃n can be identified with the n-fold


S cartesian product Q × · · · × Q,
and thus P̃n is a countable set. Consequently also the set P̃ := n∈N P̃n of all polynomials with
rational coefficients is countable. Let (pi )i∈Z be an enumeration of these polynomials, and set
(
pi (x − 2i) if x ∈ [2i, 2i + 1]
σ(x) :=
pi (1)(2i + 2 − x) + pi+1 (0)(x − 2i − 1) if x ∈ (2i + 1, 2i + 2).
In words, σ equals pi on even intervals [2i, 2i + 1] and is linear on odd intervals [2i + 1, 2i + 2],
resulting in a continuous function overall.
We first assume K = [0, 1]. By Example 3.5, for every ε > 0 exists p(x) = nj=1 rj xj such
P

that supx∈[0,1] |p(x) − f (x)| < ε/2. Now choose qj ∈ Q so close to rj such that p̃(x) := nj=1 qj xj
P
satisfies supx∈[0,1] |p̃(x) − p(x)| < ε/2. Let i ∈ Z such that p̃(x) = pi (x), i.e., pi (x) = σ(2i + x) for
all x ∈ [0, 1]. Then supx∈[0,1] |f (x) − σ(x + 2i)| < ε.
For general compact K assume that K ⊆ [a, b]. By Tietze’s extension theorem, f allows a
continuous extension to [a, b], so without loss of generality K = [a, b]. By the first case we can find
i ∈ Z such that with y = (x − a)/(b − a) (i.e. y ∈ [0, 1] if x ∈ [a, b])
 
x−a
sup f (x) − σ + 2i = sup |f (y · (b − a) + a) − σ(y + 2i)| < ε,
x∈[a,b] b−a y∈[0,1]

which gives the statement with w = 1/(b − a) and b = −a · (b − a) + 2i.

To extend this result to arbitrary dimension, we will use Kolmogorov’s superposition theorem.
It states that every continuous function of d variables can be expressed as a composition of functions
that each depend only on one variable. We omit the technical proof, which can be found in [147].

35
Theorem 3.20 (Kolmogorov). For every d ∈ N there exist 2d2 + d monotonically increasing
functions φi,j ∈ C 0 (R), i = 1, . . . , d, j = 1, . . . , 2d + 1, such that for every f ∈ C 0 ([0, 1]d ) there
exist functions fj ∈ C 0 (R), j = 1, . . . , 2d + 1 satisfying
2d+1 d
!
X X
f (x) = fj φi,j (xi ) for all x ∈ [0, 1]d .
j=1 i=1

Corollary 3.21. Let d ∈ N. With the activation function σ : R → R from Proposition 3.19, for
every compact K ⊆ Rd , every ε > 0 and every f ∈ C 0 (K) there exists Φ ∈ Nd1 (σ; 2, 2d2 + d) (i.e.
width(Φ) = 2d2 + d and depth(Φ) = 2) such that

sup |f (x) − Φ(x)| < ε.


x∈K

Proof. Without loss of generality we can assume K = [0, 1]d : the extension to the general case then
follows by Tietze’s extension theorem and a scaling argument as in the proof of Proposition 3.19.
Let fj , φi,j , i = 1, . . . , d, j = 1, . . . , 2d + 1 be as in Theorem 3.20. Fix ε > 0. Let a > 0 be so
large that

sup sup |φi,j (x)| ≤ a.


i,j x∈[0,1]

Since each fj is uniformly continuous on the compact set [−da, da], we can find δ > 0 such that
ε
sup sup |fj (y) − fj (ỹ)| < . (3.2.1)
j |y−ỹ|<δ 2(2d + 1)
|y|,|ỹ|≤da

By Proposition 3.19 there exist wi,j , bi,j ∈ R such that


δ
sup sup |φi,j (x) − σ(wi,j x + bi,j ) | < (3.2.2)
i,j x∈[0,1] | {z } d
=:φ̃i,j (x)

and wj , bj ∈ R such that


ε
sup sup |fj (y) − σ(wj y + bj ) | < . (3.2.3)
j |y|≤a+δ | {z } 2(2d + 1)
=:f˜j (y)

Then for all x ∈ [0, 1]d by (3.2.2)


d d
X X δ
φi,j (xi ) − φ̃i,j (xi ) < d = δ.
d
i=1 i=1

36
Thus with
d
X d
X
yj := φi,j (xi ), ỹj := φ̃i,j (xi )
j=1 j=1

it holds |yj − ỹj | < δ. Using (3.2.1) and (3.2.3) we conclude

2d+1 d 2d+1
! !
X X X
f (x) − σ wj · σ(wi,j xi + bi,j ) + bj = (fj (yj ) − f˜j (ỹj ))
j=1 i=1 j=1
2d+1
X 
≤ |fj (yj ) − fj (ỹj )| + |fj (ỹj ) − f˜j (ỹj )|
j=1
2d+1
X 
ε ε
≤ + ≤ ε.
2(2d + 1) 2(2d + 1)
j=1

This concludes the proof.

Kolmogorov’s superposition theorem is intriguing as it shows that approximating d-dimensional


functions can be reduced to the (generally much simpler) one-dimensional case through composi-
tions. Neural networks, by nature, are well suited to approximate functions with compositional
structures. However, the functions fj in Theorem 3.20, even though only one-dimensional, could
become very complex and challenging to approximate themselves if d is large.
Similarly, the “magic” activation function in Proposition 3.19 encodes the information of all
rational polynomials on the unit interval, which is why a neural network of size O(1) suffices to
approximate every function to arbitrary accuracy. Naturally, no practical algorithm can efficiently
identify appropriate neural network weights and biases for this architecture. As such, the results
presented in Section 3.2 should be taken with a pinch of salt as their practical relevance is highly
limited. Nevertheless, they highlight that while universal approximation is a fundamental and
important property of neural networks, it leaves many aspects unexplored. To gain further insight
into practically relevant architectures, in the following chapters, we investigate neural networks
with activation functions such as the ReLU.

Bibliography and further reading


The foundation of universal approximation theorems goes back to the late 1980s with seminal
works by Cybenko [59], Hornik et al. [120, 119], Funahashi [83] and Carroll and Dickinson [46].
These results were subsequently extended to a wider range of activation functions and architectures.
The present analysis in Section 3.1 closely follows the arguments in [159], where it was essentially
shown that universal approximation can be achieved if the activation function is not polynomial.
The proof of Lemma 3.10 is from [165, Theorem 2.1], with earlier results of this type being due to
[281].
Kolmogorov’s superposition theorem stated in Theorem 3.20 was originally proven in 1957
[147]. For a more recent and constructive proof see for instance [37]. Kolmogorov’s theorem
and its obvious connections to neural networks have inspired various research in this field, e.g.
[193, 151, 181, 242, 129], with its practical relevance being debated [88, 150]. The idea for the

37
“magic” activation function in Section 3.2 comes from [170] where it is shown that such an activation
function can even be chosen monotonically increasing.

38
Exercises
Exercise 3.22. Write down a generator of a (minimal) topology on C 0 (Rd ) such that fn → f ∈
cc
C 0 (Rd ) if and only if fn −→ f , and show this equivalence. This topology is referred to as the
topology of compact convergence.

Exercise 3.23. Show the implication “⇒” of Theorem 3.8 and Corollary 3.17.

Exercise 3.24. Prove Lemma 3.12. Hint: Consider σ(nx) for large n ∈ N.

Exercise 3.25. Let k ∈ N, σ ∈ M and assume that σ ∗ φ ∈ Pk for all φ ∈ Cc∞ (R). Show that
σ ∈ Pk .
Hint: Consider ψ ∈ Cc∞ (R) such that ψ ≥ 0 and R ψ(x) dx = 1 and set ψε (x) := ψ(x/ε)/ε.
R

Use that away from the discontinuities of σ it holds ψε ∗ σ(x) → σ(x) as ε → 0. Conclude that σ
is piecewise in Pk , and finally show that σ ∈ C k (R).

Exercise 3.26. Prove Corollary 3.18 with the use of Corollary 3.17.

Exercise 3.27. Complete the proof of Proposition 3.16 for L > 1.

39
Chapter 4

Splines

In Chapter 3, we saw that sufficiently large neural networks can approximate every continuous
function to arbitrary accuracy. However, these results did not further specify the meaning of
“sufficiently large” or what constitutes a suitable architecture. Ideally, given a function f , and a
desired accuracy ε > 0, we would like to have a (possibly sharp) bound on the required size, depth,
and width guaranteeing the existence of a neural network approximating f up to error ε.
The field of approximation theory establishes such trade-offs between properties of the function f
(e.g., its smoothness), the approximation accuracy, and the number of parameters needed to achieve
this accuracy. For example, given k, d ∈ N, how many parameters are required to approximate a
function f : [0, 1]d → R with ∥f ∥C k ([0,1]d ) ≤ 1 up to uniform error ε? Splines are known to achieve
this approximation accuracy with a superposition of O(ε−d/k ) simple (piecewise polynomial) basis
functions. In this chapter, following [176], we show that certain sigmoidal neural networks can
match this performance in terms of the neural network size. In fact, from an approximation
theoretical viewpoint we show that the considered neural networks are at least as expressive as
superpositions of splines.

4.1 B-splines and smooth functions


We introduce a simple type of spline and its approximation properties below.

Definition 4.1. For n ∈ N, the univariate cardinal B-spline order n ∈ N is given by


n  
1 X
ℓ n
Sn (x) := (−1) σReLU (x − ℓ)n−1 for x ∈ R, (4.1.1)
(n − 1)! ℓ
ℓ=0

where 00 := 0 and σReLU denotes the ReLU activation function.

By shifting and dilating the cardinal B-spline, we obtain a system of univariate splines. Taking
tensor products of these univariate splines yields a set of higher-dimensional functions known as
the multivariate B-splines.

40
Definition 4.2. For t ∈ R and n, ℓ ∈ N we define Sℓ,t,n := Sn (2ℓ (· − t)). Additionally, for d ∈ N,
t ∈ Rd , and n, ℓ ∈ N, we define the the multivariate B-spline Sℓ,t,n
d as

d
Y
d
Sℓ,t,n (x) := Sℓ,ti ,n (xi ) for x = (x1 , . . . xd ) ∈ Rd ,
i=1

and n o
B n := Sℓ,t,n
d
ℓ ∈ N, t ∈ Rd

is the dictionary of B-splines of order n.

Having introduced the system B n , we would like to understand how well we can represent each
smooth function by superpositions of elements of B n . The following theorem is adapted from the
more general result [202, Theorem 7]; also see [171, Theorem D.3] for a presentation closer to the
present formulation.

Theorem 4.3. Let d, n, k ∈ N such that 0 < k ≤ n. Then there exists C such that for every
f ∈ C k ([0, 1]d ) and every N ∈ N, there exist ci ∈ R with |ci | ≤ C∥f ∥L∞ ([0,1]d ) and Bi ∈ B n for
i = 1, . . . , N , such that
N
X k
f− ci Bi ≤ CN − d ∥f ∥C k [0,1]d .
i=1 L∞ ([0,1]d )

Remark 4.4. There are a couple of critical concepts in Theorem 4.3 that will reappear throughout
this book. The number of parameters N determines the approximation accuracy N −k/d . This im-
plies that achieving accuracy ε > 0 requires O(ε−d/k ) parameters (according to this upper bound),
which grows exponentially in d. This exponential dependence on d is referred to as the “curse of
dimension” and will be discussed again in the subsequent chapters. The smoothness parameter
k has the opposite effect of d, and improves the convergence rate. Thus, smoother functions can
be approximated with fewer B-splines than rougher functions. This more efficient approximation
requires the use of B-splines of order n with n ≥ k. We will see in the following, that the order of
the B-spline is closely linked to the concept of depth in neural networks.

4.2 Reapproximation of B-splines with sigmoidal activations


We now show that the approximation rates of B-splines can be transfered to certain neural networks.
The following argument is based on [174].

41
Definition 4.5. A function σ : R → R is called sigmoidal of order q ∈ N, if σ ∈ C q−1 (R) and
there exists C > 0 such that
σ(x)
→0 as x → −∞,
xq
σ(x)
→1 as x → ∞,
xq
|σ(x)| ≤ C · (1 + |x|)q for all x ∈ R.

Example 4.6. The rectified power unit x 7→ σReLU (x)q is sigmoidal of order q. ♢
Our goal in the following is to show that neural networks can approximate a linear combination
of N B-splines with a number of parameters that is proportional to N . As an immediate conse-
quence of Theorem 4.3, we then obtain a convergence rate for neural networks. Let us start by
approximating a single univariate B-spline with a neural network of fixed size.

Proposition 4.7. Let n ∈ N, n ≥ 2, K > 0, and let σ : R → R be sigmoidal of order q ≥ 2. There


exists a constant C > 0 such that for every ε > 0 there is a neural network ΦSn with activation
function σ, ⌈logq (n − 1)⌉ layers, and size C, such that

Sn − ΦSn L∞ ([−K,K])
≤ ε.

n−1
Proof. By definition (4.1.1), Sn is a linear combination of n + 1 shifts of σReLU . We start by
n−1
approximating σReLU . It is not hard to see (Exercise 4.10) that, for every K ′ > 0 and every t ∈ N

t t
a−q σ ◦ σ ◦ · · · ◦ σ(ax) −σReLU (x)q → 0 as a → ∞ (4.2.1)
| {z }
t− times

uniformly for all x ∈ [−K ′ , K ′ ].


Set t := ⌈logq (n − 1)⌉. Then t ≥ 1 since n ≥ 2, and q t ≥ n − 1. Thus, for every K ′ > 0 and
t
ε > 0 there exists a neural network Φqε with ⌈logq (n − 1)⌉ layers satisfying
t t
Φqε (x) − σReLU (x)q ≤ ε for all x ∈ [−K ′ , K ′ ]. (4.2.2)

This shows that we can approximate the ReLU to the power of q t ≥ n − 1. However, our goal is to
obtain an approximation of the ReLU raised to the power n − 1, which could be smaller than q t .
t
To reduce the order, we emulate approximate derivatives of Φqε . Concretely, we show the following
claim: For all 1 ≤ p ≤ q t for every K ′ > 0 and ε > 0 there exists a neural network Φpε having
⌈logq (n − 1)⌉ layers and satisfying

|Φpε (x) − σReLU (x)p | ≤ ε for all x ∈ [−K ′ , K ′ ]. (4.2.3)

42
The claim holds for p = q t . We now proceed by induction over p = q t , q t − 1, . . . Assume (4.2.3)
holds for some p ∈ {2, . . . , q t }. Fix δ ≥ 0. Then

Φpδ2 (x + δ) − Φpδ2 (x)


− σReLU (x)p−1

δ σReLU (x + δ)p − σReLU (x)p
≤2 + − σReLU (x)p−1 .
p pδ

Hence, by the binomial theorem it follows that there exists δ∗ > 0 such that

Φpδ2 (x + δ∗ ) − Φpδ2 (x)


∗ ∗
− σReLU (x)p−1 ≤ ε,
pδ∗

for all x ∈ [−K ′ , K ′ ]. By Proposition 2.3, (Φpδ2 (x + δ∗ ) − Φpδ2 )/(pδ∗ ) is a neural network with
∗ ∗
⌈logq (n − 1)⌉ layers and size independent from ε. Calling this neural network Φp−1ε shows that
(4.2.3) holds for p − 1, which concludes the induction argument and proves the claim.
For every neural network Φ, every spatial translation Φ(· − t) is a neural network of the same
architecture. Hence, every term in the sum (4.1.1) can be approximated to arbitrary accuracy by
a neural network of a fixed size. Since by Proposition 2.3, sums of neural networks of the same
depth are again neural networks of the same depth, the result follows.
d
Next, we extend Proposition 4.7 to the multivariate splines Sℓ,t,n for arbitrary ℓ, d ∈ N, t ∈ Rd .

Proposition 4.8. Let n, d ∈ N, n ≥ 2, K > 0, and let σ : R → R be sigmoidal of order q ≥ 2.


Further let ℓ ∈ N and t ∈ Rd .
d
Then, there exists a constant C > 0 such that for every ε > 0 there is a neural network ΦSℓ,t,n
with activation function σ, ⌈log2 (d)⌉ + ⌈logq (n − 1)⌉ layers, and size C, such that
d
d
Sℓ,t,n − ΦSℓ,t,n ≤ ε.
L∞ ([−K,K]d )

d
Qd
Proof. By definition Sℓ,t,n (x) = i=1 Sℓ,ti ,n (xi ) where

Sℓ,ti ,n (xi ) = Sn (2ℓ (xi − ti )).

By Proposition 4.7 there exist a constant C ′ > 0 such that for each i = 1, . . . , d and all ε > 0, there
is a neural network ΦSℓ,ti ,n with size C ′ and ⌈logq (n − 1)⌉ layers such that

Sℓ,ti ,n − ΦSℓ,ti ,n L∞ ([−K,K]d )


≤ ε.

If d = 1, this shows the statement. For general d, it remains to show that the product of the ΦSℓ,ti ,n
for i = 1, . . . , d can be approximated.
We first prove the following claim by induction: For every d ∈ N, d ≥ 2, there exists a constant
C ′′ > 0, such that for all K ′ ≥ 1 and all ε > 0 there exists a neural network Φmult,ε,d with size

43
C ′′ , ⌈log2 (d)⌉ layers, and activation function σ such that for all x1 , . . . , xd with |xi | ≤ K ′ for all
i = 1, . . . , d,
d
Y
Φmult,ε,d (x1 , . . . , xd ) − xi < ε. (4.2.4)
i=1

For the base case, let d = 2. Similar to the proof of Proposition 4.7, one can show that there exists
C ′′′ > 0 such that for every ε > 0 and K ′ > 0 there exists a neural network Φsquare,ε with one
hidden layer and size C ′′′ such that

|Φsquare,ε − σReLU (x)2 | ≤ ε for all |x| ≤ K ′ .

For every x = (x1 , x2 ) ∈ R2


1
(x1 + x2 )2 − x21 − x22

x1 x2 =
2
1
= σReLU (x1 + x2 )2 + σReLU (−x1 − x2 )2 − σReLU (x1 )2
2
− σReLU (−x1 )2 − σReLU (x2 )2 − σReLU (−x2 )2 .

(4.2.5)

Each term on the right-hand side can be approximated up to uniform error ε/6 with a network of
size C ′′′ and one hidden layer. By Proposition 2.3, we conclude that there exists a neural network
Φmult,ε,2 satisfying (4.2.4) for d = 2.
Assume the induction hypothesis (4.2.4) holds for d − 1 ≥ 1, and let ε > 0 and K ′ ≥ 1. We
have
d ⌊d/2⌋ d
Y Y Y
xi = xi · xi . (4.2.6)
i=1 i=1 i=⌊d/2⌋+1

We will now approximate each of the terms in the product on the right-hand side of (4.2.6) by a
neural network using the induction assumption.
For simplicity assume in the following that ⌈log2 (⌊d/2⌋)⌉ = ⌈log2 (d − ⌊d/2⌋)⌉. The general
case can be addressed via Proposition 3.16. By the induction assumption there then exist neural
networks Φmult,1 and Φmult,2 both with ⌈log2 (⌊d/2⌋)⌉ layers, such that for all xi with |xi | ≤ K ′ for
i = 1, . . . , d
⌊d/2⌋
Y ε
Φmult,1 (x1 , . . . , x⌊d/2⌋ ) − xi < ,
i=1
4((K ′ )⌊d/2⌋ + ε)
d
Y ε
Φmult,2 (x⌊d/2⌋+1 , . . . , xd ) − xi < .
4((K ′ )⌊d/2⌋ + ε)
i=⌊d/2⌋+1

By Proposition 2.3, Φmult,ε,d := Φmult,ε/2,2 ◦(Φmult,1 , Φmult,2 ) is a neural network with 1+⌈log2 (⌊d/2⌋)⌉ =
⌈log2 (d)⌉ layers. By construction, the size of Φmult,ε,d does not depend on K ′ or ε. Thus, to complete
the induction, it only remains to show (4.2.4).
For all a, b, c, d ∈ R holds

|ab − cd| ≤ |a||b − d| + |d||a − c|.

44
Hence, for x1 , . . . , xd with |xi | ≤ K ′ for all i = 1, . . . , d, we have that
d
Y
xi − Φmult,ε,d (x1 , . . . , xd )
i=1
⌊d/2⌋ d
ε Y Y
≤ + xi · xi − Φmult,1 (x1 , . . . , x⌊d/2⌋ )Φmult,2 (x⌊d/2⌋+1 , . . . , xd )
2
i=1 i=⌊d/2⌋+1
ε ε ε
≤ + |K ′ |⌊d/2⌋ ′ ⌊d/2⌋
+ (|K ′ |⌈d/2⌉ + ε) ′ ⌊d/2⌋
< ε.
2 4((K ) + ε) 4((K ) + ε)
This completes the proof of (4.2.4).
The overall result follows by using Proposition 2.3 to show that the multiplication network can
be composed with a neural network comprised of the ΦSℓ,ti ,n for i = 1, . . . , d. Since in no step above
the size of the individual networks was dependent on the approximation accuracy, this is also true
for the final network.

Proposition 4.8 shows that we can approximate a single multivariate B-spline with a neural
network with a size that is independent of the accuracy. Combining this observation with Theorem
4.3 leads to the following result.

Theorem 4.9. Let d, n, k ∈ N such that 0 < k ≤ n and n ≥ 2. Let q ≥ 2, and let σ be sigmoidal
of order q.
Then there exists C such that for every f ∈ C k ([0, 1]d ) and every N ∈ N there exists a neural
network ΦN with activation function σ, ⌈log2 (d)⌉ + ⌈logq (k − 1)⌉ layers, and size bounded by CN ,
such that
k
f − ΦN L∞ ([0,1]d ) ≤ CN − d ∥f ∥C k ([0,1]d ) .

Proof. Fix N ∈ N. By Theorem 4.3, there exist coefficients |ci | ≤ C∥f ∥L∞ ([0,1]d ) and Bi ∈ B n for
i = 1, . . . , N , such that
N
X k
f− ci Bi ≤ CN − d ∥f ∥C k ([0,1]d ) .
i=1 L∞ ([0,1]d )

Moreover, by Proposition 4.8, for each i = 1, . . . , N exists a neural network ΦBi with ⌈log2 (d)⌉ +
⌈logq (k − 1)⌉ layers, and a fixed size, which approximates Bi on [−1, 1]d ⊇ [0, 1]d up to error of
ε := N −k/d /N . The size of ΦBi is independent of i and N .
By Proposition 2.3, there exists a neural network ΦN that uniformly approximates N
P
i=1 ci Bi
d
up to error ε on [0, 1] , and has ⌈log2 (d)⌉ + ⌈logq (k − 1)⌉ layers. The size of this network is linear
in N (see Exercise 4.11). This concludes the proof.

Theorem 4.9 shows that neural networks with higher-order sigmoidal functions can approximate
smooth functions with the same accuracy as spline approximations while having a comparable
number of parameters. The network depth is required to behave like O(log(k)) in terms of the
smoothness parameter k, cp. Remark 4.4.

45
Bibliography and further reading
The argument of linking sigmoidal activation functions with spline based approximation was first
introduced in [176, 174]. For further details on spline approximation, see [202] or the book [245].
The general strategy of approximating basis functions by neural networks, and then lifting ap-
proximation results for those bases has been employed widely in the literature, and will also reappear
again in this book. While the following chapters primarily focus on ReLU activation, we highlight
a few notable approaches with non-ReLU activations based on the outlined strategy: To approx-
imate analytic functions, [175] emulates a monomial basis. To approximate periodic functions, a
basis of trigonometric polynomials is recreated in [177]. Wavelet bases have been emulated in [205].
Moreover, neural networks have been studied through the representation system of ridgelets [43]
and ridge functions [128]. A general framework describing the emulation of representation systems
to transfer approximation results was presented in [30].

46
Exercises
Exercise 4.10. Show that (4.2.1) holds.

Exercise 4.11. Let L ∈ N, σ : R → R, and let Φ1 , Φ2 be two neural networks with architecture
(1) (1) (2) (2)
(σ; d0 , d1 , . . . , dL , dL+1 ) and (σ; d0 , d1 , . . . , dL , dL+1 ). Show that Φ1 + Φ2 is a neural network
with size(Φ1 + Φ2 ) ≤ size(Φ1 ) + size(Φ2 ).
2
Exercise 4.12. Show that, for σ = σReLU and k ≤ 2, for all f ∈ C k ([0, 1]d ) all weights of the approx-
imating neural network of Theorem 4.9 can be bounded in absolute value by O(max{2, ∥f ∥C k ([0,1]d ) }).

47
Chapter 5

ReLU neural networks

In this chapter, we discuss feedforward neural networks using the ReLU activation function σReLU
introduced in Section 2.3. We refer to these functions as ReLU neural networks. Due to its simplicity
and the fact that it reduces the vanishing and exploding gradients phenomena, the ReLU is one of
the most widely used activation functions in practice.
A key component of the proofs in the previous chapters was the approximation of derivatives of
the activation function to emulate polynomials. Since the ReLU is piecewise linear, this trick is not
applicable. This makes the analysis fundamentally different from the case of smoother activation
functions. Nonetheless, we will see that even this extremely simple activation function yields a very
rich class of functions possessing remarkable approximation capabilities.
To formalize these results, we begin this chapter by adopting a framework from [208], which
enables the tracking of the number of network parameters for basic manipulations such as adding
up or composing two neural networks. This will allow to bound the network complexity, when
constructing more elaborate networks from simpler ones. With these preliminaries at hand, the
rest of the chapter is dedicated to the exploration of links between ReLU neural networks and the
class of “continuous piecewise linear functions.” In Section 5.2, we will see that every such function
can be exactly represented by a ReLU neural network. Afterwards, in Section 5.3 we will give a
more detailed analysis of the required network complexity. Finally, we will use these results to
prove a first approximation theorem for ReLU neural networks in Section 5.4. The argument is
similar in spirit to Chapter 4, in that we transfer established approximation theory for piecewise
linear functions to the class of ReLU neural networks of a certain architecture.

5.1 Basic ReLU calculus


The goal of this section is to formalize how to combine and manipulate ReLU neural networks.
We have seen an instance of such a result already in Proposition 2.3. Now we want to make this
result more precise under the assumption that the activation function is the ReLU. We sharpen
Proposition 2.3 by adding bounds on the number of weights that the resulting neural networks
have. The following four operations form the basis of all constructions in the sequel.

• Reproducing an identity: We have seen in Proposition 3.16 that for most activation functions,
an approximation to the identity can be built by neural networks. For ReLUs, we can have
an even stronger result and reproduce the identity exactly. This identity will play a crucial

48
role in order to extend certain neural networks to deeper neural networks, and to facilitate
an efficient composition operation.

• Composition: We saw in Proposition 2.3 that we can produce a composition of two neural
networks and the resulting function is a neural network as well. There we did not study the
size of the resulting neural networks. For ReLU activation functions, this composition can be
done in a very efficient way leading to a neural network that has up to a constant not more
than the number of weights of the two initial neural networks.

• Parallelization: Also the parallelization of two neural networks was discussed in Proposition
2.3. We will refine this notion and make precise the size of the resulting neural networks.

• Linear combinations: Similarly, for the sum of two neural networks, we will give precise
bounds on the size of the resulting neural network.

5.1.1 Identity
We start with expressing the identity on Rd as a neural network of depth L ∈ N.

Lemma 5.1 (Identity). Let L ∈ N. Then, there exists a ReLU neural network Φid L such that
id d id id id
ΦL (x) = x for all x ∈ R . Moreover, depth(ΦL ) = L, width(ΦL ) = 2d, and size(ΦL ) = 2d·(L+1).

Proof. Writing I d ∈ Rd×d for the identity matrix, we choose the weights

(W (0) , b(0) ), . . . , (W (L) , b(L) )


  
Id
:= , 0 , (I 2d , 0), . . . , (I 2d , 0), ((I d , −I d ), 0).
−I d | {z }
L−1 times

Using that x = σReLU (x) − σReLU (−x) for all x ∈ R and σReLU (x) = x for all x ≥ 0 it is obvious
that the neural network Φid
L associated to the weights above satisfies the assertion of the lemma.

We will see in Exercise 5.24 that the property to exactly represent the identity is not shared
by sigmoidal activation functions. It does hold for polynomial activation functions though; also see
Proposition 3.16.

5.1.2 Composition
Assume we have two neural networks Φ1 , Φ2 with architectures (σReLU ; d10 , . . . , d1L1 +1 ) and (σReLU ; d20 , . . . , d2L1 +1 )
respectively. Moreover, we assume that they have weights and biases given by
(0) (0) (L1 ) (L1 ) (0) (0) (L2 ) (L2 )
(W 1 , b1 ), . . . , (W 1 , b1 ), and (W 2 , b2 ), . . . , (W 2 , b2 ),

respectively. If the output dimension d1L1 +1 of Φ1 equals the input dimension d20 of Φ2 , we can
define two types of concatenations: First Φ2 ◦ Φ1 is the neural network with weights and biases

49
given by
     
(0) (0) (L −1) (L −1) (0) (L ) (0) (L ) (0)
W 1 , b1 , . . . , W 1 1 , b1 1 , W 2 W 1 1 , W 2 b1 1 + b2 ,
   
(1) (1) (L ) (L )
W 2 , b2 , . . . , W 2 2 , b2 2 .

Second, Φ2 • Φ1 is the neural network defined as Φ2 ◦ Φid 1 ◦ Φ1 . In terms of weighs and biases,
Φ2 • Φ1 is given as
! !!
(L ) (L )

(0) (0)
 
(L1 −1) (L1 −1)
 W1 1 b1 1
W 1 , b1 , . . . , W 1 , b1 , (L ) , (L ) ,
−W 1 1 −b1 1
      
(0) (0) (0) (1) (1) (L ) (L )
W 2 , −W 2 , b2 , W 2 , b2 , . . . , W 2 2 , b2 2 .

The following lemma collects the properties of the constructions above.

Lemma 5.2 (Composition). Let Φ1 , Φ2 be neural networks with architectures (σReLU ; d10 , . . . , d1L1 +1 )
and (σReLU ; d20 , . . . , d2L2 +1 ). Assume d1L1 +1 = d20 . Then Φ2 ◦ Φ1 (x) = Φ2 • Φ1 (x) = Φ2 (Φ1 (x)) for
0
all x ∈ Rd1 . Moreover,

width(Φ2 ◦ Φ1 ) ≤ max{width(Φ1 ), width(Φ2 )},


depth(Φ2 ◦ Φ1 ) = depth(Φ1 ) + depth(Φ2 ),
size(Φ2 ◦ Φ1 ) ≤ size(Φ1 ) + size(Φ2 ) + (d1L1 + 1)d12 ,

and

width(Φ2 • Φ1 ) ≤ 2 max{width(Φ1 ), width(Φ2 )},


depth(Φ2 • Φ1 ) = depth(Φ1 ) + depth(Φ2 ) + 1,
size(Φ2 • Φ1 ) ≤ 2(size(Φ1 ) + size(Φ2 )).

1
Proof. The fact that Φ2 ◦ Φ1 (x) = Φ2 • Φ1 (x) = Φ2 (Φ1 (x)) for all x ∈ Rd0 follows immediately
from the construction. The same can be said for the width and depth bounds. To confirm the size
(0) (L ) d2 ×d1 (0) (L )
bound, we note that W 2 W 1 1 ∈ R 1 L1 and hence W 2 W 1 1 has not more than d21 × d1L1
(0) (L ) (0) 2
(nonzero) entries. Moreover, W 2 b1 1 + b2 ∈ Rd1 . Thus, the L1 -th layer of Φ2 ◦ Φ1 (x) has at
most d21 × (1 + d1L1 ) entries. The rest is obvious from the construction.

Interpreting linear transformations as neural networks of depth 0, the previous lemma is also
valid in case Φ1 or Φ2 is a linear mapping.

50
5.1.3 Parallelization
Let (Φi )m i i
i=1 be neural networks with architectures (σReLU ; d0 , . . . , dLi +1 ), respectively. We proceed
to build a neural network (Φ1 , . . . , Φm ) realizing the function
djL
Pm
dj0
Pm
j=1 j +1
(Φ1 , . . . , Φm ) : R j=1 →R (5.1.1)
(x1 , . . . , xm ) 7→ (Φ1 (x1 ), . . . , Φm (xm )).

To do so we first assume L1 = · · · = Lm = L, and define (Φ1 , . . . , Φm ) via the following sequence


of weight-bias tuples:
 (0)   (0)   (L)   (L) 
W1 b1 W1 b1
. .. . . . 
 ,  ..  , . . . ,  ..  ,  ..  (5.1.2)
        

(0) (0) (L) (L)
Wm bm Wm bm

where these matrices are understood as block-diagonal filled up with zeros. For the general case
where the Φj might have different depths, let Lmax := max1≤i≤m Li and I := {1 ≤ i ≤ m | Li <
Lmax }. For j ∈ I c set Φ
e j := Φj , and for each j ∈ I

e j := Φid
Φ Lmax −Lj ◦ Φj . (5.1.3)

Finally,

(Φ1 , . . . , Φm ) := (Φ
e 1, . . . , Φ
e m ). (5.1.4)

We collect the properties of the parallelization in the lemma below.

Lemma 5.3 (Parallelization). Let m ∈ N and (Φi )m i=1 be neural networks with architectures
(σReLU ; di0 , . . . , diLi +1 ), respectively. Then the neural network (Φ1 , . . . , Φm ) satisfies

dj0
Pm
(Φ1 , . . . , Φm )(x) = (Φ1 (x1 ), . . . , Φm (xm )) for all x ∈ R j=1 .

Moreover, with Lmax := maxj≤m Lj it holds that


m
X
width((Φ1 , . . . , Φm )) ≤ 2 width(Φj ), (5.1.5a)
j=1

depth((Φ1 , . . . , Φm )) = max depth(Φj ), (5.1.5b)


j≤m
m m
(Lmax − Lj )djLj +1 .
X X
size((Φ1 , . . . , Φm )) ≤ 2 size(Φj ) + 2 (5.1.5c)
j=1 j=1

Proof. All statements except for the bound on the size follow immediately from the construction.
e i )m in (5.1.3)
To obtain the bound on the size, we note that by construction the sizes of the (Φ i=1
will simply be added. The size of each Φ
e i can be bounded with Lemma 5.2.

51
If all input dimensions d10 = · · · = dm
0 =: d0 are the same, we will also use parallelization with
d1 +···+dm
shared inputs to realize the function x 7→ (Φ1 (x), . . . , Φm (x)) from Rd0 → R L1 +1 Lm +1 .

In terms of the construction (5.1.2), the only required change is that the block-diagonal matrix
(0) (0)
Pm j 1 (0) (0)
diag(W 1 , . . . , W m ) becomes the matrix in R j=1 d1 ×d0 which stacks W 1 , . . . , W m on top of
each other. Similarly, we will allow Φj to only take some of the entries of x as input. For par-
allelization with shared inputs we will use the same notation (Φj )m
j=1 as before, where the precise
meaning will always be clear from context. Note that Lemma 5.3 remains valid in this case.

5.1.4 Linear combinations


Let m ∈ N and let (Φi )m i
i=1 be ReLU neural networks that have architectures (σReLU ; d0 , . . . , dLi +1 ),
i

respectively. Assume that d1L1 +1 = · · · = dm Lm +1 , i.e., all Φ1 , . . . , Φm have P


the same output dimen-
sion. For scalars αj ∈ R, we wish to construct a ReLU neural network m j=1 αj Φj realizing the
function
( Pm j
d1
R j=1 d0 → R L1 +1
(x1 , . . . , xm ) 7→ m
P
j=1 αj Φj (xj ).

This corresponds P to the parallelization (Φ1 , . . . , Φm ) composed with the linear transformation
(z 1 , . . . , z m ) 7→ m
j=1 αj z j . The following result holds.

Lemma 5.4 (Linear combinations). Let m ∈ N and (Φi )m i=1 be neural networks with architec-
tures (σReLU ; di0 , . . . , diLi +1 ), respectively. Assume that d1L1P m
+1 = · · · = dLm +1 , letPα ∈ Rm and set
Lmax := maxj≤m Lj . Then, there exists a neural network j=1 αj Φj such that ( m
m
j=1 αj Φj )(x) =
Pm Pm j
m d
j=1 αj Φj (xj ) for all x = (xj )j=1 ∈ R
j=1 0 . Moreover,

 
m
X m
X
width  αj Φj  ≤ 2 width(Φj ), (5.1.6a)
j=1 j=1
 
m
X
depth  αj Φj  = max depth(Φj ), (5.1.6b)
j≤m
j=1
 
m m m
(Lmax − Lj )djLj +1 .
X X X
size  αj Φj  ≤ 2 size(Φj ) + 2 (5.1.6c)
j=1 j=1 j=1

Proof. The construction of m


P
j=1 αj Φj is analogous to that of (Φ1 , . . . , Φm ), i.e., we first define the
linear combination of neural networks with the same depth. Then the weights are chosen as in
(5.1.2), but with the last linear transformation replaced by
 
m
(α1 W (L) (L)
X
(L)
1 · · · αm W m ), αj bj  .
j=1

52
For general depths, we define the sum of the neural networks to be the sum of the extended
neural networks Φe i as of (5.1.3). All statements of the lemma follow immediately from this con-
struction.
In case d10 = · · · = dm
0 =: d0 (all neural networks have the same input dimension), we will also
consider linear combinations with shared inputs, i.e., a neural network realizing
m
X
x 7→ αj Φj (x) for x ∈ Rd0 .
j=1

This requires the same minor adjustment as discussed at the end of Section 5.1.3. Lemma 5.4
remains valid in this case and again we do not distinguish in notation for linear combinations with
or without shared inputs.

5.2 Continuous piecewise linear functions


In this section, we will relate ReLU neural networks to a large class of functions. We first formally
introduce the set of continuous piecewise linear functions from a set Ω ⊆ Rd to R. Note that we
admit in particular Ω = Rd in the following definition.

Definition 5.5. Let Ω ⊆ Rd , d ∈ N. We call a function f : Ω → R continuous, piecewise linear


(cpwl) if f ∈ C 0 (Ω) and there exist n ∈ N affine functions gj : Rd → R, gj (x) = w⊤ j x + bj such
that for each x ∈ Ω it holds that f (x) = gj (x) for at least one j ∈ {1, . . . , n}. For m > 1 we call
f : Ω → Rm cpwl if and only if each component of f is cpwl.

Remark 5.6. A “continuous piecewise linear function” as in Definition 5.5 is actually piecewise
affine. To maintain consistency with the literature, we use the terminology cpwl.
In the following, we will refer to the connected domains on which f is equal to one of the
functions gj , also as regions or pieces. If f is cpwl with q ∈ N regions, then with n ∈ N denoting
the number of affine functions it holds n ≤ q.
Note that, the mapping x 7→ σReLU (w⊤ x + b), which is a ReLU neural network with a single
neuron, is cpwl (with two regions). Consequently, every ReLU neural network is a repeated compo-
sition of linear combinations of cpwl functions. It is not hard to see that the set of cpwl functions
is closed under compositions and linear combinations. Hence, every ReLU neural network is a cpwl
function. Interestingly, the reverse direction of this statement is also true, meaning that every cpwl
function can be represented by a ReLU neural network as we shall demonstrate below. Therefore,
we can identify the class of functions realized by arbitrary ReLU neural networks as the class of
cpwl functions.

Theorem 5.7. Let d ∈ N, let Ω ⊆ Rd be convex, and let f : Ω → R be cpwl with n ∈ N as in


Definition 5.5. Then, there exists a ReLU neural network Φf such that Φf (x) = f (x) for all x ∈ Ω
and
size(Φf ) = O(dn2n ), width(Φf ) = O(dn2n ), depth(Φf ) = O(n).

53
A statement similar to Theorem 5.7 can be found in [7, 110]. There, the authors give a con-
struction with a depth that behaves logarithmic in d and is independent of n, but with significantly
larger bounds on the size. As we shall see, the proof of Theorem 5.7 is a simple consequence of the
following well-known result from [266]; also see [203], and for sharper bounds [282]. It states that
every cpwl function can be expressed as a finite maximum of a finite minimum of certain affine
functions.

Proposition 5.8. Let d ∈ N, Ω ⊆ Rd be convex, and let f : Ω → R be cpwl with n ∈ N affine


functions as in Definition 5.5. Then there exists m ∈ N and sets sj ⊆ {1, . . . , n} for j ∈ {1, . . . , m},
such that

f (x) = max min(gi (x)) for all x ∈ Ω. (5.2.1)


1≤j≤m i∈sj

Proof. Step 1. We start with d = 1, i.e., Ω ⊆ R is a (possibly unbounded) interval and for each
x ∈ Ω there exists j ∈ {1, . . . , n} such that with gj (x) := wj x + bj it holds that f (x) = gj (x).
Without loss of generality, we can assume that gi ̸= gj for all i ̸= j. Since the graphs of the gj are
lines, they intersect at (at most) finitely many points in Ω.
Since f is continuous, we conclude that there exist finitely many intervals covering Ω, such that
f coincides with one of the gj on each interval. For each x ∈ Ω let

sx := {1 ≤ j ≤ n | gj (x) ≥ f (x)}

and

fx (y) := min gj (y) for all y ∈ Ω.


j∈sx

Clearly, fx (x) = f (x). We claim that, additionally,

fx (y) ≤ f (y) for all y ∈ Ω. (5.2.2)

This then shows that

f (y) = max fx (y) = max min gj (y) for all y ∈ R.


x∈Ω x∈Ω j∈sx

Since there exist only finitely many possibilities to choose a subset of {1, . . . , n}, we conclude that
(5.2.1) holds for d = 1.
It remains to verify the claim (5.2.2). Fix y ̸= x ∈ Ω. Without loss of generality, let x < y
and let x = x0 < · · · < xk = y be such that f |[xi−1 ,xi ] equals some gj for each i ∈ {1, . . . , k}. In
order to show (5.2.2), it suffices to prove that there exists at least one j such that gj (x0 ) ≥ f (x0 )
and gj (xk ) ≤ f (xk ). The claim is trivial for k = 1. We proceed by induction. Suppose the
claim holds for k − 1, and consider the partition x0 < · · · < xk . Let r ∈ {1, . . . , n} be such
that f |[x0 ,x1 ] = gr |[x0 ,x1 ] . Applying the induction hypothesis to the interval [x1 , xk ], we can find
j ∈ {1, . . . , n} such that gj (x1 ) ≥ f (x1 ) and gj (xk ) ≤ f (xk ). If gj (x0 ) ≥ f (x0 ), then gj is the desired
function. Otherwise, gj (x0 ) < f (x0 ). Then gr (x0 ) = f (x0 ) > gj (x0 ) and gr (x1 ) = f (x1 ) ≤ gj (x1 ).

54
Therefore gr (x) ≤ gj (x) for all x ≥ x1 , and in particular gr (xk ) ≤ gj (xk ). Thus gr is the desired
function.
Step 2. For general d ∈ N, let gj (x) := w⊤ j x + bj for j = 1, . . . , n. For each x ∈ Ω, let

sx := {1 ≤ j ≤ n | gj (x) ≥ f (x)}

and for all y ∈ Ω, let

fx (y) := min gj (y).


j∈sx

For an arbitrary 1-dimensional affine subspace S ⊆ Rd passing through x consider the line
(segment) I := S ∩ Ω, which is connected since Ω is convex. By Step 1, it holds

f (y) = max fx (y) = max min gj (y)


x∈Ω x∈Ω j∈sx

on all of I. Since I was arbitrary the formula is valid for all y ∈ Ω. This again implies (5.2.1) as
in Step 1.

Remark 5.9. For any a1 , . . . , ak ∈ R holds min{−a1 , . . . , −ak } = − max{a1 , . . . , ak }. Thus, in the
setting of Proposition 5.8, there exists m̃ ∈ N and sets s̃j ⊆ {1, . . . , n} for j = 1, . . . , m̃, such that
for all x ∈ Ω

f (x) = −(−f (x)) = − max min(−gi (x))


1≤j≤m̃ i∈s̃j

= − max (− max(gi (x)))


1≤j≤m̃ i∈s̃j

= min (max(gi (x))).


1≤j≤m̃ i∈s̃j

To prove Theorem 5.7, it therefore suffices to show that the minimum and the maximum are
expressible by ReLU neural networks.

Lemma 5.10. For every x, y ∈ R it holds that

min{x, y} = σReLU (y) − σReLU (−y) − σReLU (y − x) ∈ N21 (σReLU ; 1, 3)

and

max{x, y} = σReLU (y) − σReLU (−y) + σReLU (x − y) ∈ N21 (σReLU ; 1, 3).

Proof. We have
(
0 if y > x
max{x, y} = y +
x−y if x ≥ y
= y + σReLU (x − y).

Using y = σReLU (y) − σReLU (−y), the claim for the maximum follows. For the minimum observe
that min{x, y} = − max{−x, −y}.

55
x
min{x, y}
y

Figure 5.1: Sketch of the neural network in Lemma 5.10. Only edges with non-zero weights are
drawn.

The minimum of n ≥ 2 inputs can be computed by repeatedly applying the construction of


Lemma 5.10. The resulting neural network is described in the next lemma.

Lemma 5.11. For every n ≥ 2 there exists a neural network Φmin


n : Rn → R with

size(Φmin
n ) ≤ 16n, width(Φmin
n ) ≤ 3n, depth(Φmin
n ) ≤ ⌈log2 (n)⌉

such that Φmin max : Rn → R


n (x1 , . . . , xn ) = min1≤j≤n xj . Similarly, there exists a neural network Φn
realizing the maximum and satisfying the same complexity bounds.

Proof. Throughout denote by Φmin 2 : R2 → R the neural network from Lemma 5.10. It is of depth
1 and size 7 (since all biases are zero, it suffices to count the number of connections in Figure 5.1).
Step 1. Consider first the case where n = 2k for some k ∈ N. We proceed by induction of k.
For k = 1 the claim is proven. For k ≥ 2 set

Φmin
2k
:= Φmin
2 ◦ (Φmin min
2k−1 , Φ2k−1 ). (5.2.3)

By Lemma 5.2 and Lemma 5.3 we have

depth(Φmin min min


2k ) ≤ depth(Φ2 ) + depth(Φ2k−1 ) ≤ · · · ≤ k.

Next, we bound the size of the neural network. Note that all biases in this neural network are set to
0, since the Φmin
2 neural network in Lemma 5.10 has no biases. Thus, the size of the neural network
min
Φ2k corresponds to the number of connections in the graph (the number of nonzero weights).
Careful inspection of the neural network architecture, see Figure 5.2, reveals that
k−2
X
size(Φmin
2k ) =4·2 k−1
+ 12 · 2j + 3
j=0

= 2n + 12 · (2k−1 − 1) + 3 = 2n + 6n − 9 ≤ 8n,

and that width(Φmin


2k
) ≤ (3/2)2k . This concludes the proof for the case n = 2k .
Step 2. For the general case, we first let

Φmin
1 (x) := x for all x ∈ R

56
be the identity on R, i.e. a linear transformation and thus formally a depth 0 neural network. Then,
for all n ≥ 2 
(Φid ◦ Φmin min ) if n ∈ {2k + 1 | k ∈ N}
min min 1 ⌊n ⌋ , Φ⌈ n ⌉
Φn := Φ2 ◦ min
2
min
2 (5.2.4)
(Φ⌊ n ⌋ , Φ⌈ n ⌉ ) otherwise.
2 2

This definition extends (5.2.3) to arbitrary n ≥ 2, since the first case in (5.2.4) never occurs if n ≥ 2
is a power of two.
To analyze (5.2.4), we start with the depth and claim that

depth(Φmin
n )=k for all 2k−1 < n ≤ 2k

and all k ∈ N. We proceed by induction over k. The case k = 1 is clear. For the induction step,
assume the statement holds for some fixed k ∈ N and fix an integer n with 2k < n ≤ 2k+1 . Then
lnm
∈ (2k−1 , 2k ] ∩ N
2
and (
jnk {2k−1 } if n = 2k + 1

2 (2k−1 , 2k ] ∩ N otherwise.
Using the induction assumption, (5.2.4) and Lemmas 5.1 and 5.2, this shows

depth(Φmin min
n ) = depth(Φ2 ) + k = 1 + k,

and proves the claim.


For the size and width bounds, we only sketch the argument: Fix n ∈ N such that 2k−1 < n ≤ 2k .
Then Φmin
n is constructed from at most as many subnetworks as Φmin2k
, but with some Φmin
2 : R2 → R
blocks replaced by Φid id min
1 : R → R, see Figure 5.3. Since Φ1 has the same depth as Φ2 , but is smaller
min
in width and number of connections, the width and size of Φn is bounded by the width and size
of Φmin
2k
. Due to 2k ≤ 2n, the bounds from Step 1 give the bounds stated in the lemma.
Step 3. For the maximum, define

Φmax min
n (x1 , . . . , xn ) := −Φn (−x1 , . . . , −xn ).

Proof (of Theorem 5.7). By Proposition 5.8 the neural network



Φ := Φmax min m m
m • (Φ|sj | )j=1 • ((w i x + bi )i∈sj )j=1

realizes the function f .


Since the number of possibilities to choose subsets of {1, . . . , n} equals 2n we have m ≤ 2n .
Since each sj is a subset of {1, . . . , n}, the cardinality |sj | of sj is bounded by n. By Lemma 5.2,
Lemma 5.3, and Lemma 5.11

depth(Φ) ≤ 2 + depth(Φmax min


m ) + max depth(Φ|sj | )
1≤j≤n
n
≤ 1 + ⌈log2 (2 )⌉ + ⌈log2 (n)⌉ = O(n)

57
x1
x2

x3
x4
min{x1 , . . . , x8 }
x5
x6

x7
x8

nr of connections
between layers: 2k−1 · 4 2k−2 · 12 2k−3 · 12 3

Figure 5.2: Architecture of the Φmin


2k
neural network in Step 1 of the proof of Lemma 5.11 and the
number of connections in each layer for k = 3. Each grey box corresponds to 12 connections in the
graph.

x1 x2 x3 x4 x5 x1 x2 x3 x4 x5 x6 x1 x2 x3 x4 x5 x6 x7 x8

Φmin
2 Φid min
1 Φ2 Φid min
1 Φ2 Φid min
1 Φ2 Φmin
2 Φmin
2 Φmin
2 Φmin
2

Φid
1 Φmin
2 Φmin
2 Φmin
2 Φmin
2 Φmin
2

Φmin
2 Φmin
2 Φmin
2

min{x1 , . . . , x5 } min{x1 , . . . , x6 } min{x1 , . . . , x8 }

Figure 5.3: Construction of Φmin


n for general n in Step 2 of the proof of Lemma 5.11.

58
and
n m
X m
X o
max
width(Φ) ≤ 2 max width(Φm ), min
width(Φ|sj | ), width((w⊤
i x + bi )i∈sj ))
j=1 j=1
n
≤ 2 max{3m, 3mn, mdn} = O(dn2 )

and
 

size(Φ) ≤ 4 size(Φmax
m ) + size((Φ min m
)
|sj | j=1 ) + size((w i x + b ) )m
i i∈sj j=1 )
 
X m
≤ 4 16m + 2 (16|sj | + 2⌈log2 (n)⌉) + nm(d + 1) = O(dn2n ).
j=1

This concludes the proof.

5.3 Simplicial pieces


This section studies the case, where we do not have arbitrary cpwl functions, but where the regions
on which f is affine are simplices. Under this condition, we can construct neural networks that scale
merely linearly in the number of such regions, which is a serious improvement from the exponential
dependence of the size on the number of regions that was found in Theorem 5.7.

5.3.1 Triangulations of Ω
For the ensuing discussion, we will consider Ω ⊆ Rd to be partitioned into simplices. This parti-
tioning will be termed a triangulation of Ω. Other notions prevalent in the literature include a
tessellation of Ω, or a simplicial mesh on Ω. To give a precise definition, let us first recall some
terminology. For a set S ⊆ Rd we denote the convex hull of S by
 
X n Xn 
co(S) := αj xj n ∈ N, xj ∈ S, αj ≥ 0, αj = 1 . (5.3.1)
 
j=1 j=1

An n-simplex is the convex hull of n ∈ N points that are independent in a specific sense. This
is made precise in the following definition.

Definition 5.12. Let n ∈ N0 , d ∈ N and n ≤ d. We call x0 , . . . , xn ∈ Rd affinely independent


if and only if either n = 0 or n ≥ 1 and the vectors x1 − x0 , . . . , xn − x0 are linearly independent.
In this case, we call co(x0 , . . . , xn ) := co({x0 , . . . , xn }) an n-simplex.

As mentioned before, a triangulation refers to a partition of a space into simplices. We give a


formal definition below.

59
η2 η2 η2

η3 η1 η3 η1 η3 η1
η5 η5

η4 η4 η4

τ1 = co(η 1 , η 2 , η 5 ) τ1 = co(η 2 , η 3 , η 4 ) τ1 = co(η 2 , η 3 , η 4 )


τ2 = co(η 2 , η 3 , η 5 ) τ2 = co(η 2 , η 5 , η 1 ) τ2 = co(η 1 , η 2 , η 3 )
τ3 = co(η 3 , η 4 , η 5 )

Figure 5.4: The first is a regular triangulation, while the second and the third are not.

Definition 5.13. Let d ∈ N, and Ω ⊆ Rd be compact. Let T be a finite set of d-simplices, and
for each τ ∈ T let V (τ ) ⊆ Ω have cardinality d + 1 such that τ = co(V (τ )). We call T a regular
triangulation of Ω, if and only if
S
(i) τ ∈T τ = Ω,

(ii) for all τ , τ ′ ∈ T it holds that τ ∩ τ ′ = co(V (τ ) ∩ V (τ ′ )).


S
We call η ∈ V := τ ∈T V (τ ) a node (or vertex) and τ ∈ T an element of the triangulation.

For a regular triangulation T with nodes V we also introduce the constant


kT := max |{τ ∈ T | η ∈ τ }| (5.3.2)
η∈V

corresponding to the maximal number of elements shared by a single node.

5.3.2 Size bounds for regular triangulations


Throughout this subsection, let T be a regular triangulation of Ω, and we adhere to the notation
of Definition 5.13. We will say that f : Ω → R is cpwl with respect to T if f is cpwl and f |τ is
affine for each τ ∈ T . The rest of this subsection is dedicated to proving the following result. It
was first shown in [166] with a more technical argument, and extends an earlier statement from
[110] to general triangulations (also see Section 5.3.3).

Theorem 5.14. Let d ∈ N, Ω ⊆ Rd be a bounded domain, and let T be a regular triangulation


of Ω. Let f : Ω → R be cpwl with respect to T and f |∂Ω = 0. Then there exists a ReLU neural
network Φ : Ω → R realizing f , and it holds
size(Φ) = O(|T |), width(Φ) = O(|T |), depth(Φ) = O(1), (5.3.3)
where the constants in the Landau notation depend on d and kT in (5.3.2).

60
We will split the proof into several lemmata. The strategy is to introduce a basis of the space
of cpwl functions on T the elements of which vanish on the boundary of Ω. We will then show
that there exist O(|T |) basis functions, each of which can be represented with a neural network the
size of which depends only on kT and d. To construct this basis, we first point out that an affine
function on a simplex is uniquely defined by its values at the nodes.

Lemma 5.15. Let d ∈ N. Let τ := co(η 0 , . . . , η d ) be a d-simplex. For every y0 , . . . , yd ∈ R, there


exists a unique g ∈ P1 (Rd ) such that g(η i ) = yi , i = 0, . . . , d.

Proof. Since η 1 −η 0 , . . . , η d −η 0 is a basis of Rd , there is a unique w ∈ Rd such that w⊤ (η i −η 0 ) =


yi − y0 for i = 1,P. . . , d. Then P g(x) := w⊤ x + (y0 − w⊤P η 0 ) is as desired. Moreover, for every g ∈ P1
it holds that g( i=0 αi η i ) = di=0 αi g(η i ) whenever di=0 αi = 1 (this is in general not true if the
d

coefficients do not sum to 1). Hence, g is uniquely determined by its values at the nodes.

Since Ω is the union of the simplices τ ∈ T , every cpwl function with respect to T is thus
uniquely defined through its values at the nodes. Hence, the desired basis consists of cpwl functions
φη : Ω → R with respect to T such that

φη (µ) = δηµ for all µ ∈ V, (5.3.4)

where δηµ denotes the Kronecker delta. Assuming φη to be well-defined for the moment, we can
then represent every cpwl function f : Ω → R that vanishes on the boundary ∂Ω as
X
f (x) = f (η)φη (x) for all x ∈ Ω.
η∈V∩Ω̊

Note that it suffices to sum over the set of interior nodes V ∩ Ω̊, since f (η) = 0 whenever η ∈
∂Ω. To formally verify existence and well-definedness of φη , we first need a lemma characterizing
the boundary of so-called “patches” of the triangulation: For each η ∈ V, we introduce the patch
ω(η) of the node η as the union of all elements containing η, i.e.,
[
ω(η) := τ. (5.3.5)
{τ ∈T | η∈τ }

Lemma 5.16. Let η ∈ V ∩ Ω̊ be an interior node. Then,


[
∂ω(η) = co(V (τ )\{η}).
{τ ∈T | η∈τ }

We refer to Figure 5.5 for a visualization of Lemma 5.16. The proof of Lemma 5.16 is quite
technical but nonetheless elementary. We therefore only outline the general argument but leave
the details to the reader in Excercise 5.28: The boundary of ω(η) must be contained in the union

61
η6 η1
ω(η) co(V (τ1 )\{η}) = co({η 1 , η 2 })
τ6
τ5 τ1
η5 η2
τ4 η τ2
τ3

η4 η3

Figure 5.5: Visualization of Lemma 5.16 in two dimensions. The patch ω(η) consists of the union
of all 2-simplices τi containing η. Its boundary consists of the union of all 1-simplices made up by
the nodes of each τi without the center node, i.e., the convex hulls of V (τi )\{η}.

of the boundaries of all τ in the patch ω(η). Since η is an interior point of Ω, it must also be
an interior point of ω(η). This can be used to show that for every S := {η i0 , . . . , η ik } ⊆ V (τ ) of
cardinality k + 1 ≤ d, the interior of (the k-dimensional manifold) co(S) belongs to the interior
of ω(η) whenever η ∈ S. Using Exercise 5.28, it then only remains to check that co(S) ⊆ ∂ω(η)
whenever η ∈ / S, which yields the claimed formula. We are now in position to show well-definedness
of the basis functions in (5.3.4).

Lemma 5.17. For each interior node η ∈ V ∩ Ω̊ there exists a unique cpwl function φη : Ω → R
satisfying (5.3.4). Moreover, φη can be expressed by a ReLU neural network with size, width, and
depth bounds that only depend on d and kT .

Proof. By Lemma 5.15, on each τ ∈ T , the affine function φη |τ is uniquely defined through the
values at the nodes of τ . This defines a continuous function φη : Ω → R. Indeed, whenever
τ ∩ τ ′ ̸= ∅, then τ ∩ τ ′ is a subsimplex of both τ and τ ′ in the sense of Definition 5.13 (ii). Thus,
applying Lemma 5.15 again, the affine functions on τ and τ ′ coincide on τ ∩ τ ′ .
Using Lemma 5.15, Lemma 5.16 and the fact that φη (µ) = 0 whenever µ ̸= η, we find that
φη vanishes on the boundary of the patch ω(η) ⊆ Ω. Thus, φη vanishes on the boundary of Ω.
Extending by zero, it becomes a cpwl function φη : Rd → R. This function is nonzero only on
elements τ for which η ∈ τ . Hence, it is a cpwl function with at most n := kT + 1 affine functions.
By Theorem 5.7, φη can be expressed as a ReLU neural network with the claimed size, width and
depth bounds; to apply Theorem 5.7 we used that (the extension of) φη is defined on the convex
domain Rd .

Finally, Theorem 5.14 is now an easy consequence of the above lemmata.

Proof (of Theorem 5.14). With


X
Φ(x) := f (η)φη (x) for x ∈ Ω, (5.3.6)
η∈V∩Ω̊

it holds that Φ : Ω → R satisfies Φ(η) = f (η) for all η ∈ V. By Lemma 5.15 this implies that
f equals Φ on each τ , and thus f equals Φ on all of Ω. Since each element τ is the convex hull

62
of d + 1 nodes η ∈ V, the cardinality of V is bounded by the cardinality of T times d + 1. Thus,
the summation in (5.3.6) is over O(|T |) terms. Using Lemma 5.4 and Lemma 5.17 we obtain the
claimed bounds on size, width, and depth of the neural network.

5.3.3 Size bounds for locally convex triangulations


Assuming local convexity of the triangulation, in this section we make the dependence of the
constants in Theorem 5.14 explicit in the dimension d and in the maximal number of simplices
kT touching a node, see (5.3.2). As such the improvement over Theorem 5.14 is modest, and the
reader may choose to skip this section on a first pass. Nonetheless, the proof, originally from [110],
is entirely constructive and gives some further insight on how ReLU networks express functions.
Let us start by stating the required convexity constraint.

Definition 5.18. A regular triangulation T is called locally convex if and only if ω(η) is convex
for all interior nodes η ∈ V ∩ Ω̊.

The following theorem is a variant of [110, Theorem 3.1].

Theorem 5.19. Let d ∈ N, and let Ω ⊆ Rd be a bounded domain. Let T be a locally convex regular
triangulation of Ω. Let f : Ω → R be cpwl with respect to T and f |∂Ω = 0. Then, there exists a
constant C > 0 (independent of d, f and T ) and there exists a neural network Φf : Ω → R such
that Φf = f ,

size(Φf ) ≤ C · (1 + d2 kT |T |),
width(Φf ) ≤ C · (1 + d log(kT )|T |),
depth(Φf ) ≤ C · (1 + log2 (kT )).

Assume in the following that T is a locally convex triangulation. We will split the proof of the
theorem again into a few lemmata. First, we will show that a convex patch can be written as an
intersection of finitely many half-spaces. Specifically, with the affine hull of a set S defined as
 
Xn Xn 
aff(S) := αj xj n ∈ N, xj ∈ S, αj ∈ R, αj = 1 (5.3.7)
 
j=1 j=1

let in the following for τ ∈ T and η ∈ V (τ )

H0 (τ, η) := aff(V (τ )\{η})

be the affine hyperplane passing through all nodes in V (τ )\{η}, and let further

H+ (τ, η) := {x ∈ Rd | x is on the same side of H0 (τ, η) as η} ∪ H0 (τ, η).

63
Lemma 5.20. Let η be an interior node. Then a patch ω(η) is convex if and only if
\
ω(η) = H+ (τ, η). (5.3.8)
{τ ∈T | η∈τ }

Proof. The right-hand side is a finite intersection of (convex) half-spaces, and thus itself convex. It
remains to show that if ω(η) is convex, then (5.3.8) holds. We start with “⊃”. Suppose x ∈ / ω(η).
Then the straight line co({x, η}) must pass through ∂ω(η), and by Lemma 5.16 this implies that
there exists τ ∈ T with η ∈ τ such that co({x, η}) passes through aff(V (τ )\{η}) = H0 (τ, η).
Hence η and x lie on different sides of this affine hyperplane, which shows “⊇”. Now we show “⊆”.
Let τ ∈ T be such that η ∈ τ and fix x in the complement of H+ (τ, η). Suppose that x ∈ ω(η). By
convexity, we then have co({x} ∪ τ ) ⊆ ω(η). This implies that there exists a point in co(V (τ )\{η})
belonging to the interior of ω(η). This contradicts Lemma 5.16. Thus, x ∈ / ω(η).

The above lemma allows us to explicitly construct the basis functions φη in (5.3.4). To see this,
denote in the following for τ ∈ T and η ∈ V (τ ) by gτ,η ∈ P1 (Rd ) the affine function such that
(
1 if η = µ
gτ,η (µ) = for all µ ∈ V (τ ).
0 if η ̸= µ

This function exists and is unique by Lemma 5.15. Observe that φη (x) = gτ,η (x) for all x ∈ τ .

Lemma 5.21. Let η ∈ V ∩ Ω̊ be an interior node and let ω(η) be a convex patch. Then
 
φη (x) = max 0, min gτ,η (x) for all x ∈ Rd . (5.3.9)
{τ ∈T | η∈τ }

Proof. First let x ∈


/ ω(η). By Lemma 5.20 there exists τ ∈ T with η ∈ τ such that x is in the
complement of H+ (τ, η). Observe that

gτ,η |H+ (τ,η) ≥ 0 and gτ,η |H+ (τ,η)c < 0. (5.3.10)

Thus

min gτ,η (x) < 0 for all x ∈ ω(η)c ,


{τ ∈T | η∈τ }

i.e., (5.3.9) holds for all x ∈ R\ω(η). Next, let τ , τ ′ ∈ T such that η ∈ τ and η ∈ τ ′ . We wish to
show that gτ,η (x) ≤ gτ ′ ,η (x) for all x ∈ τ . Since gτ,η (x) = φη (x) for all x ∈ τ , this then concludes
the proof of (5.3.9). By Lemma 5.20 it holds

µ ∈ H+ (τ ′ , η) for all µ ∈ V (τ ).

64
Hence, by (5.3.10)
gτ ′ ,η (µ) ≥ 0 = gτ,η (µ) for all µ ∈ V (τ )\{η}.
Moreover, gτ,η (η) = gτ ′ ,η (η) = 1. Thus, gτ,η (µ) ≥ gτ ′ ,η (µ) for all µ ∈ V (τ ′ ) and therefore
gτ ′ ,η (x) ≥ gτ,η (x) for all x ∈ co(V (τ ′ )) = τ ′ .

Proof (of Theorem 5.19). For every interior node η ∈ V ∩ Ω̊, the cpwl basis function φη in (5.3.4)
can be expressed as in (5.3.9), i.e.,
φη (x) = σ • Φmin
|{τ ∈T | η∈τ }| • (gτ,η (x)){τ ∈T | η∈τ } ,

where (gτ,η (x)){τ ∈T | η∈τ } denotes the parallelization with shared inputs of the functions gτ,η (x) for
all τ ∈ T such that η ∈ τ .
For this neural network, with |{τ ∈ T | η ∈ τ }| ≤ kT , we have by Lemma 5.2
size(φη ) ≤ 4 size(σ) + size(Φmin

|{τ ∈T | η∈τ }| ) + size((gτ,η ){τ ∈T | η∈τ } )
≤ 4(2 + 16kT + kT d) (5.3.11)
and similarly
depth(φη ) ≤ 4 + ⌈log2 (kT )⌉, width(φη ) ≤ max{1, 3kT , d}. (5.3.12)
Since for every interior node, the number of simplices touching the node must be larger or equal
to d, we can assume max{kT , d} = kT in the following (otherwise there exist no interior nodes, and
the function f is constant 0). As in the proof of Theorem 5.14, the neural network

X
Φ(x) := f (η)φη (x)
η∈V∩Ω̊

realizes the function f on all of Ω. Since the number of nodes |V| is bounded by (d + 1)|T |, an
application of Lemma 5.4 yields the desired bounds.

5.4 Convergence rates for Hölder continuous functions


Theorem 5.14 immediately implies convergence rates for certain classes of (low regularity) functions.
Recall for example the space C 0,s of Hölder continuous functions.

Definition 5.22. Let s ∈ (0, 1] and Ω ⊆ Rd . Then for f : Ω → R

|f (x) − f (y)|
∥f ∥C 0,s (Ω) := sup |f (x)| + sup , (5.4.1)
x∈Ω x̸=y∈Ω ∥x − y∥s2

and we denote by C 0,s (Ω) the set of functions f ∈ C 0 (Ω) for which ∥f ∥C 0,s (Ω) < ∞.

65
Hölder continuous functions can be approximated well by cpwl functions. This leads to the
following result.

Theorem 5.23. Let d ∈ N. There exists a constant C = C(d) such that for every f ∈ C 0,s ([0, 1]d )
and every N there exists a ReLU neural network ΦfN with

size(ΦfN ) ≤ CN, width(ΦfN ) ≤ CN, depth(ΦfN ) = C

and
s
sup f (x) − ΦfN (x) ≤ C∥f ∥C 0,s ([0,1]d ) N − d .
x∈[0,1]d

Proof. For M ≥ 2, consider the set of nodes {ν/M | ν ∈ {−1, . . . , M + 1}d } where ν/M =
(ν1 /M, . . . , νd /M ). These nodes suggest a partition of [−1/M, 1 + 1/M ]d into (2 + M )d sub-
hypercubes. Each such sub-hypercube can be partitioned into d! simplices, such that we obtain a
regular triangulation T with d!(2+M )d elements on [0, 1]d . According to Theorem 5.14 there exists a
neural network Φ that is cpwl with respect to T and Φ(ν/M ) = f (ν/M ) whenever ν ∈ {0, . . . , M }d
and Φ(ν/M ) = 0 for all other (boundary) nodes. It holds
size(Φ) ≤ C|T | = Cd!(2 + M )d ,
width(Φ) ≤ C|T | = Cd!(2 + M )d , (5.4.2)
depth(Φ) ≤ C
for a constant C that only depends on d (since for our regular triangulation T , kT in (5.3.2) is a
fixed d-dependent constant).
Let us bound the error. Fix a point x ∈ [0, 1]d . Then x belongs to one of the interior simplices
τ of the triangulation. Two nodes of the simplex have distance at most
2 1/2 √
 
d 

X 1  = d =: ε.
M M
j=1

Since Φ|τ is the linear interpolant of f at the nodes V (τ ) of the simplex τ , Φ(x) is a convex
combination of the (f (η))η∈V (τ ) . Fix an arbitrary node η 0 ∈ V (τ ). Then ∥x − η 0 ∥2 ≤ ε and
|Φ(x) − Φ(η 0 )| ≤ max |f (η) − f (µ)| ≤ sup |f (x) − f (y)|
η,µ∈V (τ ) x,y∈[0,1]d
∥x−y∥2 ≤ε

≤ ∥f ∥C 0,s ([0,1]d ) εs .
Hence, using f (η 0 ) = Φ(η 0 ),
|f (x) − Φ(x)| ≤ |f (x) − f (η 0 )| + |Φ(x) − Φ(η 0 )|
≤ 2∥f ∥C 0,s ([0,1]d ) εs
s
= 2∥f ∥C 0,s ([0,1]d ) d 2 M −s
s s
= 2d 2 ∥f ∥C 0,s ([0,1]d ) N − d (5.4.3)

66
where N := M d . The statement follows by (5.4.2) and (5.4.3).

The principle behind Theorem 5.23 can be applied in even more generality. Since we can
represent every cpwl function on a regular triangulation with a neural network of size O(N ), where
N denotes the number of elements, most classical (e.g. finite element) approximation theory for
cpwl functions can be lifted to generate statements about ReLU approximation. For instance, it is
well-known, that functions in the Sobolev space H 2 ([0, 1]d ) can be approximated by cpwl functions
on a regular triangulation in terms of L2 ([0, 1]d ) with the rate 2/d, e.g., [80, Chapter 22]. Similar
as in the proof of Theorem 5.23, for every f ∈ H 2 ([0, 1]d ) and every N ∈ N there then exists a
ReLU neural network ΦN such that size(ΦN ) = O(N ) and
2
∥f − ΦN ∥L2 ([0,1]d ) ≤ C∥f ∥H 2 ([0,1]d ) N − d .

Finally, we may consider how to approximate smoother functions such as f ∈ C k ([0, 1]d ), k > 1,
with ReLU neural networks. As discussed in Chapter 4 for sigmoidal activation functions, larger k
can lead to faster convergence. However, we will see in the following chapter, that the emulation of
piecewise affine functions on regular triangulations will not yield improved approximation rates as
k increases. To leverage such smoothness with ReLU networks, in Chapter 7 we will first build net-
works that emulate polynomials. Surprisingly, it turns out that polynomials can be approximated
very efficiently by deep ReLU neural networks.

Bibliography and further reading


The ReLU calculus introduced in Section 5.1 was similarly given in [208]. The fact that every
cpwl function can be expressed as a maximum over a minimum of linear functions goes back to
the papers [267, 266]; see also [203] for an accessible presentation of this result. Additionally, [282]
provides sharper bounds on the number of required nestings in such representations.
The main result of Section 5.2, which shows that every cpwl function can be expressed by a
ReLU network, is then a straightforward consequence. This was first observed in [7], which also
provided bounds on the network size. These bounds were significantly improved in [110] for cpwl
functions on triangular meshes that satisfy a local convexity condition. Under this assumption, it
was shown that the network size essentially only grows linearly with the number of pieces. The
paper [166] showed that the convexity assumption is not necessary for this statement to hold. We
give a similar result in Section 5.3.2, using a simpler argument than [166]. The locally convex case
from [110] is separately discussed in Section 5.3.3, as it allows for further improvements in some
constants.
The implications for the approximation of Hölder continuous functions discussed in Section
5.4, follows by standard approximation theory for cpwl functions; see for example [68] or the
finite element literature such as [54, 38, 80], which focus on approximation in Sobolev spaces.
Additionally, [293] provide a stronger result, where it is shown that ReLU networks can essentially
achieve twice the rate proven in Theorem 5.23, and this is sharp. For a general reference on splines
and piecewise polynomial approximation see for instance [245]. Finally we mention that similar
convergence results can also be shown for other activation functions, see, e.g., [174].

67
Exercises
Exercise 5.24. Let p : R → R be a polynomial of degree n ≥ 1 (with leading coefficient nonzero)
and let s : R → R be a continuous sigmoidal activation function. Show that the identity map
x 7→ x : R → R belongs to N11 (p; 1, n + 1) but not to N11 (s; L) for any L ∈ N.

Exercise 5.25. Consider cpwl functions f : R → R with n ∈ N0 breakpoints (points where the
function is not C 1 ). Determine the minimal size required to exactly express every such f with a
depth-1 ReLU neural network.

Exercise 5.26. Show that, the notion of affine independence is invariant under permutations of
the points.

= co(x0 , . . . , xd ) be a d-simplex. Show that the coefficients αi ≥ 0 such that


Exercise 5.27. Let τ P
Pd d
i=0 αi = 1 and x = i=0 αi xi are unique for every x ∈ τ .

Exercise
Sd 5.28. Let τ = co(η 0 , . . . , η d ) be a d-simplex. Show that the boundary of τ is given by
i=0 co({η 0 , . . . , η d }\{η i }).

68
Chapter 6

Affine pieces for ReLU neural


networks

In the previous chapters, we observed some remarkable approximation results of shallow ReLU
neural networks. In practice, however, deeper architectures are more common. To understand why,
in this chapter we discuss some potential shortcomings of shallow ReLU networks compared to deep
ReLU networks.
Traditionally, an insightful approach to study limitations of ReLU neural networks has been to
analyze the number of linear regions these functions can generate.

Definition 6.1. Let d ∈ N, Ω ⊆ Rd , and let f : Ω → R be cpwl (see Definition 5.5). We say
that f has p ∈ N pieces S (or linear regions), if p is the smallest number of connected open
sets (Ωi )pi=1 such that pi=1 Ωi = Ω, and f |Ωi is an affine function for all i = 1, . . . , p. We denote
Pieces(f, Ω) := p.
For d = 1 we call every point where f is not differentiable a break point of f .

To get an accurate cpwl approximation of a function, the approximating function needs to have
many pieces. The next theorem, corresponding to [82, Theorem 2], quantifies this statement.

Theorem 6.2. Let −∞ < a < b < ∞ and f ∈ C 3 ([a, b]) so that f is not affine. Then there exists
Rbp
a constant C > 0 depending only on a |f ′′ (x)| dx so that

∥g − f ∥L∞ ([a,b]) > Cp−2

for all cpwl g with at most p ∈ N pieces.

The proof of the theorem is left to the reader, see Exercise 6.11.
Theorem 6.2 implies that for ReLU neural networks we need architectures allowing for many
pieces, if we want to approximate non-linear functions to high accuracy. How many pieces can we

69
create for a fixed depth and width? We establish a simple theoretical upper bound in Section 6.1.
Subsequently, we investigate under which conditions these upper bounds are attainable in Section
6.2. Lastly, in Section 6.3, we will discuss the practical relevance of this analysis by examining how
many pieces “typical” neural networks possess. Surprisingly, it turns out that randomly initialized
deep neural networks on average do not have a number of pieces that is anywhere close to the
theoretically achievable maximum.

6.1 Upper bounds


Neural networks are based on the composition and addition of neurons. These two operations
increase the possible number of pieces in a very specific way. Figure 6.1 depicts the two operations
and their effect. They can be described as follows:
• Summation: Let Ω ⊆ R. The sum of two cpwl functions f1 , f2 : Ω → R satisfies

Pieces(f1 + f2 , Ω) ≤ Pieces(f1 , Ω) + Pieces(f2 , Ω) − 1. (6.1.1)

This holds because the sum is affine in every point where both f1 and f2 are affine. Therefore,
the sum has at most as many break points as f1 and f2 combined. Moreover, the number of
pieces of a univariate function equals the number of its break points plus one.

• Composition: Let again Ω ⊆ R. The composition of two functions f1 : Rd → R and f2 : Ω →


Rd satisfies

Pieces(f1 ◦ f2 , Ω) ≤ Pieces(f1 , Rd ) · Pieces(f2 , Ω). (6.1.2)

This is because for each of the affine pieces of f2 —let us call one of those pieces A ⊆ R—we
have that f2 is either constant or injective on A. If it is constant, then f1 ◦ f2 is constant. If
it is injective, then Pieces(f1 ◦ f2 , A) = Pieces(f1 , f2 (A)) ≤ Pieces(f1 , Rd ). Since this holds
for all pieces of f2 we get (6.1.2).

Figure 6.1: Top: Composition of two cpwl functions f1 ◦ f2 can create a piece whenever the value
of f2 crosses a level that is associated to a break point of f1 . Bottom: Addition of two cpwl
functions f1 + f2 produces a cpwl function that can have break points at positions where either f1
or f2 has a break point.

These considerations give the following result, which follows the argument of [268, Lemma 2.1].
We state it for general cpwl activation functions. The ReLU activation function corresponds to

70
p = 2. Recall that the notation (σ; d0 , . . . , dL+1 ) denotes the architecture of a feedforward neural
network, see Definition 2.1.

Theorem 6.3. Let L ∈ N. Let σ be cpwl with p pieces. Then, every neural network with architecture
(σ; 1, d1 , . . . , dL , 1) has at most (p · width(Φ))L pieces.

Proof. The proof is via induction over the depth L. Let L = 1, and let Φ : R → R be a neural
network of architecture (σ; 1, d1 , 1). Then
d1
(1) (0) (0)
X
Φ(x) = wk σ(wk x + bk ) + b(1) for x ∈ R,
k=1

for certain w(0) , w(1) , b(0) ∈ Rd1 and b(1) ∈ R. By (6.1.1), Pieces(Φ) ≤ p · width(Φ).
For the induction step, assume the statement holds for L ∈ N, and let Φ : R → R be a neural
network of architecture (σ; 1, d1 , . . . , dL+1 , 1). Then, we can write
dL+1
X
Φ(x) = wj σ(hj (x)) + b for x ∈ R,
j=1

for some w ∈ RdL+1 , b ∈ R, and where each hj is a neural network of architecture (σ; 1, d1 , . . . , dL , 1).
Using the induction hypothesis, each σ ◦ hℓ has at most p · (p · width(Φ))L affine pieces. Hence
Φ has at most width(Φ) · p · (p · width(Φ))L = (p · width(Φ))L+1 affine pieces. This completes the
proof.
Theorem 6.3 shows that there are limits to how many pieces can be created with a certain
architecture. It is noteworthy that the effects of the depth and the width of a neural network
are vastly different. While increasing the width can polynomially increase the number of pieces,
increasing the depth can result in exponential increase. This is a first indication of the prowess of
depth of neural networks.
To understand the effect of this on the approximation problem, we apply the bound of Theorem
6.3 to Theorem 6.2.

Theorem 6.4. Let d0 ∈ N and f R∈ C 3 d0 d0


p ([0, 1] ). Assume there exists a line segment s ⊆ [0, 1] of
positive length such that 0 < c := s |f ′′ (x)| dx. Then, there exists C > 0 solely depending on c,
such that for all ReLU neural networks Φ : Rd0 → R with L hidden layers

∥f − Φ∥L∞ ([0,1]d0 ) ≥ C · (2width(Φ))−2L .

Theorem 6.4 gives a lower bound on achievable approximation rates in dependence of the depth
L. As target functions become smoother, we expect that we can achieve faster convergence rates
(cp. Chapter 4). However, without increasing the depth, it seems to be impossible to leverage such
additional smoothness.
This observation strongly indicates that deeper architectures can be superior. Before making
this more concrete, we first explore whether the upper bounds of Theorem 6.3 are also achievable.

71
6.2 Tightness of upper bounds
We follow [268] to construct a ReLU neural network, that realizes the upper bound of Theorem
6.3. First let h1 : [0, 1] → R be the hat function
(
2x if x ∈ [0, 12 ]
h1 (x) :=
2 − 2x if x ∈ [ 12 , 1].

This function can be expressed by a ReLU neural network of depth one and with two nodes

h1 (x) = σReLU (2x) − σReLU (4x − 2) for all x ∈ [0, 1]. (6.2.1a)

We recursively set

hn := hn−1 ◦ h1 for all n ≥ 2, (6.2.1b)

i.e., hn = h1 ◦ · · · ◦ h1 is the n-fold composition of h1 . Since h1 : [0, 1] → [0, 1], we have hn : [0, 1] →
[0, 1] and

hn ∈ N11 (σReLU ; n, 2).

It turns out that this function has a rather interesting behavior. It is a “sawtooth” function with
2n−1 spikes, see Figure 6.2.

Lemma 6.5. Let n ∈ N. It holds for all x ∈ [0, 1]


(
2n (x − i2−n ) if i ≥ 0 is even and x ∈ [i2−n , (i + 1)2−n ]
hn (x) =
2n ((i + 1)2−n − x) if i ≥ 1 is odd and x ∈ [i2−n , (i + 1)2−n ].

Proof. The case n = 1 holds by definition. We proceed by induction, and assume the statement
holds for n. Let x ∈ [0, 1/2] and i ≥ 0 even such that x ∈ [i2−(n+1) , (i + 1)2−(n+1) ]. Then
2x ∈ [i2−n , (i + 1)2−n ]. Thus

hn (h1 (x)) = hn (2x) = 2n (2x − i2−n ) = 2n+1 (x − i2−n+1 ).

Similarly, if x ∈ [0, 1/2] and i ≥ 1 odd such that x ∈ [i2−(n+1) , (i + 1)2−(n+1) ], then h1 (x) = 2x ∈
[i2−n , (i + 1)2−n ] and

hn (h1 (x)) = hn (2x) = 2n (2x − (i + 1)2−n ) = 2n+1 (x − (i + 1)2−n+1 ).

The case x ∈ [1/2, 1] follows by observing that hn+1 is symmetric around 1/2.

The neural network hn has size O(n) and is piecewise linear on at least 2n pieces. This shows
that the number of pieces can indeed increase exponentially in the neural network size, also see the
upper bound in Theorem 6.3.

72
h1 h2 h3
1 1 1

0 1 0 1 0 1

Figure 6.2: The functions hn in Lemma 6.5.

Figure 6.3: Two randomly initialized neural networks Φ1 and Φ2 with architectures
(σReLU ; 2, 10, 10, 1) and (σReLU ; 2, 5, 5, 5, 5, 5, 1). The initialization scheme was He initialization
[111]. The number of linear regions equals 114 and 110, respectively.

6.3 Number of pieces in practice


We have seen in Theorem 6.3 that deep neural networks can have many more pieces than their
shallow counterparts. This begs the question if deep neural networks tend to generate more pieces
in practice. More formally: If we randomly initialize the weights of a neural network, what is
the expected number of linear regions? Will this number scale exponentially with the depth? This
question was analyzed in [104], and surprisingly, it was found that the number of pieces of randomly
initialized neural networks typically does not depend exponentially on the depth. In Figure 6.3, we
depict two neural networks, one shallow and one deep, that were randomly initialized according to
He initialization [111]. Both neural networks have essentially the same number of pieces (114 and
110) and there is no clear indication that one has a deeper architecture than the other.
In the following, we will give a simplified version of the main result of [104] to show why random
deep neural networks often behave like shallow neural networks.
We recall from Figure 6.1 that pieces are generated through composition of two functions f1
and f2 , if the values of f2 cross a level that is associated to a break point of f1 . In the case of a
simple neuron of the form

x 7→ σReLU (⟨a, h(x)⟩ + b)

where h is a cpwl function, a is a vector, and b is a scalar, many pieces can be generated if ⟨a, h(x)⟩

73
crosses the −b level often.
If a, b are random variables, and we know that h does not oscillate too much, then we can
quantify the probability of ⟨a, h(x)⟩ crossing the −b level often. The following lemma from [140,
Lemma 3.1] provides the details.

Lemma 6.6. Let c > 0 and let h : [0, c] → R be a cpwl function on [0, c]. Let t ∈ N, let A ⊆ R be
a Lebesgue measurable set, and assume that for every y ∈ A

|{x ∈ [0, c] | h(x) = y}| ≥ t.

Then, c∥h′ ∥L∞ ≥ ∥h′ ∥L1 ≥ |A| · t, where |A| is the Lebesgue measure of A. In particular, if h
has at most P ∈ N pieces and ∥h′ ∥L1 < ∞, then for all δ > 0, t ≤ P

∥h′ ∥L1
P [|{x ∈ [0, c] | h(x) = U }| ≥ t] ≤ ,
δt
P [|{x ∈ [0, c] | h(x) = U }| > P ] = 0,

where U is a uniformly distributed variable on [−δ/2, δ/2].

Proof. We will assume c = 1. The general case then follows by considering h̃(x) = h(x/c).
Let for (ci )Pi=1
+1
⊆ [0, 1] with c1 = 0, cP +1 = 1 and ci ≤ ci+1 for all i = 1, . . . , P + 1 the pieces of
h be given by ((ci , ci+1 ))Pi=1 . We denote

V1 := [0, c2 ], Vi := (ci , ci+1 ] for i = 1, . . . , P

and for i = 1, . . . , P + 1
i−1
[
Vei := Vj .
j=1

We define, for n ∈ N ∪ {∞}


n o
Ti,n := h(Vi ) ∩ y ∈ A |{x ∈ Vei | h(x) = y}| = n − 1 .

In words, Ti,n contains the values of A that are hit on Vi for the nth time. Since h is cpwl, we
observe that for all i = 1, . . . , P

(i) Ti,n1 ∩ Ti,n2 = ∅ for all n1 , n2 ∈ N ∪ {∞}, n1 ̸= n2 ,

(ii) Ti,∞ ∪ ∞
S
n=1 Ti,n = h(Vi ) ∩ A,

(iii) Ti,n = ∅ for all P < n < ∞,

(iv) |Ti,∞ | = 0.

74
Note that, since h is affine on Vi it holds that h′ = |h(Vi )|/|Vi | on Vi . Hence, for t ≤ P
P
X P
X

∥h ∥L1 ≥ |h(Vi )| ≥ |h(Vi ) ∩ A|
i=1 i=1
P ∞
!
X X
= |Ti,n | + |Ti,∞ |
i=1 n=1
P
XX ∞
= |Ti,n |
i=1 n=1
Xt X P
≥ |Ti,n |,
n=1 i=1

where the first equality follows by (i), (ii), the second by (iv), and the last inequality by (iii).
Note that, by assumption for all n ≤ t every y ∈ A is an element of Ti,n or Ti,∞ for some i ≤ P .
Therefore, by (iv)
X P
|Ti,n | ≥ |A|,
i=1
which completes the proof.

Lemma 6.6 applied to neural networks essentially states that, in a single neuron, if the bias
term is chosen uniformly randomly on an interval of length δ, then the probability of generating at
least t pieces by composition scales reciprocal to t.
Next, we will analyze how Lemma 6.6 implies an upper bound on the number of pieces generated
in a randomly initialized neural network. For simplicity, we only consider random biases in the
following, but mention that similar results hold if both the biases and weights are random variables
[104].

Definition 6.7. Let L ∈ N, (d0 , d1 , . . . , dL , 1) ∈ NL+2 and W (ℓ) ∈ Rdℓ+1 ×dℓ for ℓ = 0, . . . , L. Fur-
thermore, let δ > 0 and let the bias vectors b(ℓ) ∈ Rdℓ+1 , for ℓ = 0, . . . , L, be random variables such
that each entry of each b(ℓ) is independently and uniformly distributed on the interval [−δ/2, δ/2].
We call the associated ReLU neural network a random-bias neural network.

To apply Lemma 6.6 to a single neuron with random biases, we also need some bound on the
derivative of the input to the neuron.

Definition 6.8. Let L ∈ N, (d0 , d1 , . . . , dL , 1) ∈ NL+2 , and W (ℓ) ∈ Rdℓ+1 ×dℓ and b(ℓ) ∈ Rdℓ+1 for
ℓ = 0, . . . , L. Moreover let δ > 0.
For ℓ = 1, . . . , L + 1, i = 1, . . . , dℓ introduce the functions

ηℓ,i (x; (W (j) , b(j) )ℓ−1


j=0 ) = (W
(ℓ−1) (ℓ−1)
x )i for x ∈ Rd0 ,

75
where x(ℓ−1) is as in (2.1.1). We call
(
 

ν (W (ℓ) )L ℓ=1 , δ := max ηℓ,i ( · ; (W (j) , b(j) )ℓ−1
j=0 )
2

L
)
Y
(b(j) )L
j=0 ∈ dj+1
[−δ/2, δ/2] , ℓ = 1, . . . , L, i = 1, . . . , dℓ
j=0

the maximal internal derivative of Φ.

We can now formulate the main result of this section.

Theorem 6.9. Let L ∈ N and let (d0 , d1 , . . ., dL , 1) ∈ NL+2 . Let δ ∈ (0, 1]. Let W (ℓ) ∈ Rdℓ+1 ×dℓ ,
for ℓ = 0, . . . , L, be such that ν (W (ℓ) )L
ℓ=0 , δ ≤ Cν for a Cν > 0.
For an associated random-bias neural network Φ, we have that for a line segment s ⊆ Rd0 of
length 1
L
Cν X
E[Pieces(Φ, s)] ≤ 1 + d1 + (1 + (L − 1) ln(2width(Φ))) dj . (6.3.1)
δ
j=2

Proof. Let W (ℓ) ∈ Rdℓ+1 ×dℓ for ℓ = 0, . . . , L. Moreover, let b(ℓ) ∈ [−δ/2, δ/2]dℓ+1 for ℓ = 0, . . . , L
be uniformly distributed random variables. We denote

θℓ : s → Rdℓ
dℓ
x 7→ (ηℓ,i (x; (W (j) , b(j) )ℓ−1
j=0 ))i=1 .

Let κ : s → [0, 1] be an isomorphism. Since each coordinate of θℓ is cpwl, there are points
x0 , x1 , . . . , xqℓ ∈ s with κ(xj ) < κ(xj+1 ) for j = 0, . . . , qℓ − 1, such that θℓ is affine (as a function
into Rdℓ ) on [κ(xj ), κ(xj+1 )] for all j = 0, . . . , qℓ − 1 as well as on [0, κ(x0 )] and [κ(xqℓ ), 1].
We will now inductively find an upper bound on the qℓ .
Let ℓ = 2, then
θ2 (x) = W (1) σReLU (W (0) x + b(0) ).
Since W (1) · +b(1) is an affine function, it follows that θ2 can only be non-affine in points where
σReLU (W (0) · +b(0) ) is not affine. Therefore, θ2 is only non-affine if one coordinate of W (0) · +b(0)
intersects 0 nontrivially. This can happen at most d1 times. We conclude that we can choose
q2 = d1 .
Next, let us find an upper bound on qℓ+1 from qℓ . Note that

θℓ+1 (x) = W (ℓ) σReLU (θℓ (x) + b(ℓ−1) ).

76
Now θℓ+1 is affine in every point x ∈ s where θℓ is affine and (θℓ (x) + b(ℓ−1) )i ̸= 0 for all coordinates
i = 1, . . . , dℓ . As a result, we have that we can choose qℓ+1 such that

qℓ+1 ≤ qℓ + {x ∈ s | (θℓ (x) + b(ℓ−1) )i = 0 for at least one i = 1, . . . , dℓ } .

Therefore, for ℓ ≥ 2

X
qℓ+1 ≤ d1 + {x ∈ s | (θj (x) + b(j) )i = 0 for at least one i = 1, . . . , dj }
j=3
dj
ℓ X
(j)
X
≤ d1 + {x ∈ s | ηj,i (x) = −bi } .
j=2 i=1

By Theorem 6.3, we have that


 
Pieces ηℓ,i ( · ; (W (j) , b(j) )ℓ−1
j=0 ), s ≤ (2width(Φ))ℓ−1 .

We define for k ∈ N ∪ {∞}


h i
(ℓ)
pk,ℓ,i := P {x ∈ s | ηℓ,i (x) = −bi } ≥ k

Then by Lemma 6.6



pk,ℓ,i ≤
δk
and for k > (2width(Φ))ℓ−1

pk,ℓ,i = 0.

It holds
 
dj n
L X o
(j)
X
E x ∈ s ηj,i (x) = −bi 
j=2 i=1
dj ∞
L X h n o i
(j)
X X
≤ k·P x ∈ s ηj,i (x) = −bi =k
j=2 i=1 k=1
dj ∞
L X
X X
≤ k · (pk,j,i − pk+1,j,i ).
j=2 i=1 k=1

77
The inner sum can be bounded by

X ∞
X ∞
X
k · (pk,j,i − pk+1,j,i ) = k · pk,j,i − k · pk+1,j,i
k=1 k=1 k=1

X ∞
X
= k · pk,j,i − (k − 1) · pk,j,i
k=1 k=2

X
= p1,j,i + pk,j,i
k=2

X
= pk,j,i
k=1
(2width(Φ))L−1
−1
X 1
≤ Cν δ
k
k=1
!
Z (2width(Φ))L−1
−1 1
≤ Cν δ 1+ dx
1 x
−1
≤ Cν δ (1 + (L − 1) ln((2width(Φ)))).

We conclude that, in expectation, we can bound qL+1 by


L
X
d1 + Cν δ −1 (1 + (L − 1) ln(2width(Φ))) dj .
j=2

Finally, since θL = ΦL+1 |s , it follows that

Pieces(Φ, s) ≤ qL+1 + 1

which yields the result.

Remark 6.10. We make the following observations about Theorem 6.9:

• Non-exponential dependence on depth: If we consider (6.3.1), we see that the number of pieces
scales in expectation essentially like O(LN ), where N is the total number of neurons of the
architecture. This shows that in expectation, the number of pieces is linear in the number of
layers, as opposed to the exponential upper bound of Theorem 6.3.

• Maximal internal derivative: Theorem 6.9 requires the weights to be chosen such that the
maximal internal derivative is bounded by a certain number. However, if they are randomly
initialized in such a way that with high probability the maximal internal derivative is bounded
by a small number, then similar results can be shown. In practice, weights in the ℓth layer
p are
often initialized according to a centered normal distribution with standard deviation 2/dℓ ,
[111]. Due to the anti-proportionality of the variance to the width of the layers it is achieved
that the internal derivatives remain bounded with high probability, independent of the width
of the neural networks. This explaines the observation from Figure 6.3.

78
Bibliography and further reading
Establishing bounds on the number of linear regions of a ReLU network has been a popular tool
to investigate the complexity of ReLU neural networks, see [182, 221, 7, 248, 104]. The bound
presented in Section 6.1, is based on [268]. For the construction of the sawtooth function in Section
6.2, we follow the arguments in [268, 269]. Together with the lower bound on the number of
required linear regions given in [82], this analysis shows how depth can be a limiting factor in terms
of achievable convergence rates, as stated in Theorem 6.4. Finally, the analysis of the number of
pieces deep neural networks attained with random intialization (Section 6.3) is based on [104] and
[140].

79
Exercises
Exercise 6.11. Let −∞ < a < b < ∞ and let f ∈ C 3 ([a, b])\P1 . Denote by p(ε) ∈ N the minimal
number of intervals partitioning [a, b], such that a (not necessarily continuous) piecewise linear
function on p(ε) intervals can approximate f on [a, b] uniformly up to error ε > 0. In this exercise,
we wish to show

lim inf p(ε) ε > 0. (6.3.2)
ε↘0

Therefore, we can find a constant C > 0 such that ε ≥ Cp(ε)−2 for all ε > 0. This shows a variant
of Theorem 6.2. Proceed as follows to prove (6.3.2):

(i) Fix ε > 0 and let a = x0 < x1 · · · < xp(ε) = b be a partitioning into p(ε) pieces. For
i = 0, . . . , p(ε) − 1 and x ∈ [xi , xi+1 ] let
 
f (xi+1 ) − f (xi )
ei (x) := f (x) − f (xi ) + (x − xi ) .
xi+1 − xi

Show that |ei (x)| ≤ 2ε for all x ∈ [xi , xi+1 ].

(ii) With hi := xi+1 − xi and mi := (xi + xi+1 )/2 show that

h2i ′′
max |ei (x)| = |f (mi )| + O(h3i ).
x∈[xi ,xi+1 ] 8

(iii) Assuming that c := inf x∈[a,b] |f ′′ (x)| > 0 show that


Z bp
√ 1
lim inf p(ε) ε ≥ |f ′′ (x)| dx.
ε↘0 4 a

(iv) Conclude that (6.3.2) holds for general non-linear f ∈ C 3 ([a, b]).

Exercise 6.12. Show that, for L = 1, Theorem 6.3 holds for piecewise smooth functions, when
replacing the number of affine pieces by the number of smooth pieces. These are defined by replacing
“affine” by “smooth” (meaning C ∞ ) in Definition 6.1.

Exercise 6.13. Show that, for L > 1, Theorem 6.3 does not hold for piecewise smooth functions,
when replacing the number of affine pieces by the number of smooth pieces.
(p)
Exercise 6.14. For p ∈ N, p > 2 and n ∈ N, construct a function hn similar to hn of (6.5), such
(p) (p)
that hn ∈ N11 (σReLU ; n, p) and such that hn has pn pieces and size O(p2 n).

80
Chapter 7

Deep ReLU neural networks

In the previous chapter, we observed that many layers are a necessary prerequisite for ReLU neural
networks to approximate smooth functions with high rates. We now analyze which depth is sufficient
to achieve good approximation rates for smooth functions.
To approximate smooth functions efficiently, one of the main tools in Chapter 4 was to rebuild
polynomial-based functions, such as higher-order B-splines. For smooth activation functions, we
were able to reproduce polynomials by using the nonlinearity of the activation functions. This
argument certainly cannot be repeated for the piecewise linear ReLU. On the other hand, up until
now, we have seen that deep ReLU neural networks are extremely efficient at producing the strongly
oscillating sawtooth functions discussed in Lemma 6.5. The main observation in this chapter is
that the sawtooth functions are intimately linked to the squaring function, which again leads to
polynomials. This observation was first made by Dmitry Yarotsky [292] in 2016, and the present
chapter is primarily based on this paper.
In Sections 7.1 and 7.2, we give Yarotsky’s approximation of the squaring and multiplication
functions. As a direct consequence, we show in Section 7.3 that deep ReLU neural networks
can be significantly more efficient than shallow ones in approximating analytic functions such as
polynomials and (certain) trigonometric functions. Using these tools, we conclude in Section 7.4
that deep ReLU neural networks can efficiently approximate k-times continuously differentiable
functions with Hölder continuous derivatives.

7.1 The square function


We start with the approximation of the map x 7→ x2 . The construction, first given in [292], is based
on the sawtooth functions hn defined in (6.2.1) and originally introduced in [268], see Figure 6.2.
The proof idea is visualized in Figure 7.1.

Proposition 7.1. Let n ∈ N. Then


n
X hj (x)
sn (x) := x −
22j
j=1

is a piecewise linear function on [0, 1] with break points xn,j = j2−n , j = 0, . . . , 2n . Moreover,
sn (xn,k ) = x2n,k for all k = 0, . . . , 2n , i.e. sn is the piecewise linear interpolant of x2 on [0, 1].

81
1 1 1
4 16 64

1 1 1
h1 (x) h1 (x) h2 (x)
x − x2 x − x2 − 4 x − x2 − 4 − 16
h1 (x) h2 (x) h3 (x)
4 16 64

Figure 7.1: Construction of sn in Proposition 7.1.

Proof. The statement holds for n = 1. We proceed by induction. Assume the statement holds for
sn and let k ∈ {0, . . . , 2n+1 }. By Lemma 6.5, hn+1 (xn+1,k ) = 0 whenever k is even. Hence for even
k ∈ {0, . . . , 2n+1 }
n+1
X hj (xn+1,k )
sn+1 (xn+1,k ) = xn+1,k −
22j
j=1
hn+1 (xn+1,k )
= sn (xn+1,k ) − = sn (xn+1,k ) = x2n+1,k ,
22(n+1)

where we used the induction assumption sn (xn+1,k ) = x2n+1,k for xn+1,k = k2−(n+1) = k2 2−n =
xn,k/2 .
Now let k ∈ {1, . . . , 2n+1 − 1} be odd. Then by Lemma 6.5, hn+1 (xn+1,k ) = 1. Moreover,
since sn is linear on [xn,(k−1)/2 , xn,(k+1)/2 ] = [xn+1,k−1 , xn+1,k+1 ] and xn+1,k is the midpoint of this
interval,

hn+1 (xn+1,k )
sn+1 (xn+1,k ) = sn (xn+1,k ) −
22(n+1)
1 1
= (x2n+1,k−1 + x2n+1,k+1 ) − 2(n+1)
2 2
(k − 1)2 (k + 1) 2 2
= 2(n+1)+1 + 2(n+1)+1 − 2(n+1)+1
2 2 2
1 2k 2 k2
= = = x2n+1,k .
2 22(n+1) 22(n+1)
This completes the proof.

As a consequence there holds the following, [292, Proposition 2].

82
x s1 (x) s2 (x) sn−1 (x)

x h1 (x) x ... sn (x)

h1 (x) h2 (x) h3 (x) hn (x)

Figure 7.2: The neural networks h1 (x) = σReLU (2x) − σReLU (4x − 2) and sn (x) = σReLU (sn−1 (x)) −
hn (x)/22n where hn = h1 ◦ hn−1 . Figure based on [292, Fig. 2c] and [246, Fig. 1a].

Lemma 7.2. For n ∈ N, it holds

sup |x2 − sn (x)| ≤ 2−2n−1 .


x∈[0,1]

Moreover sn ∈ N11 (σReLU ; n, 3), and size(sn ) ≤ 7n and depth(sn ) = n.

Proof. Set en (x) := x2 − sn (x). Let x be in the interval [xn,k , xn,k+1 ] = [k2−n , (k + 1)2−n ] of length
2−n . Since sn is the linear interpolant of x2 on this interval, we have

x2n,k+1 − x2n,k 2k + 1 1
|e′n (x)| = 2x − = 2x − ≤ n.
2−n 2n 2

Thus en : [0, 1] → R has Lipschitz constant 2−n . Since en (xn,k ) = 0 for all k = 0, . . . , 2n , and the
length of the interval [xn,k , xn,k+1 ] equals 2−n we get

1
sup |en (x)| ≤ 2−n 2−n = 2−2n−1 .
x∈[0,1] 2

Finally, to see that sn can be represented by a neural network of the claimed architecture, note
that for n ≥ 2
n
X hj (x) hn (x) h1 ◦ hn−1 (x)
sn (x) = x − = sn−1 (x) − = σReLU ◦ sn−1 (x) − .
22j 2 2n 22n
j=1

Here we used that sn−1 is the piecewise linear interpolant of x2 , so that sn−1 (x) ≥ 0 and thus
sn−1 (x) = σReLU (sn−1 (x)) for all x ∈ [0, 1]. Hence sn is of depth n and width 3, see Figure 7.2.

In conclusion, we have shown that sn : [0, 1] → [0, 1] approximates the square function uniformly
on [0, 1] with exponentially decreasing error in the neural network size. Note that due to Theorem
6.4, this would not be possible with a shallow neural network, which can at best interpolate x2 on
a partition of [0, 1] with polynomially many (w.r.t. the neural network size) pieces.

83
7.2 Multiplication
According to Lemma 7.2, depth can help in the approximation of x 7→ x2 , which, on first sight,
seems like a rather specific example. However, as we shall discuss in the following, this opens
up a path towards fast approximation of functions with high regularity, e.g., C k ([0, 1]d ) for some
k > 1. The crucial observation is that, via the polarization identity we can write the product of
two numbers as a sum of squares

(x + y)2 − (x − y)2
x·y = (7.2.1)
4
for all x, y ∈ R. Efficient approximation of the operation of multiplication allows efficient ap-
proximation of polynomials. Those in turn are well-known to be good approximators for functions
exhibiting k ∈ N derivatives. Before exploring this idea further in the next section, we first make
precise the observation that neural networks can efficiently approximate the multiplication of real
numbers.
We start with the multiplication of two numbers, in which case neural networks of logarithmic
size in the desired accuracy are sufficient, [292, Proposition 3].

Lemma 7.3. For every ε > 0 there exists a ReLU neural network Φ× 2
ε : [−1, 1] → [−1, 1] such that

sup |x · y − Φ×
ε (x, y)| ≤ ε,
x,y∈[−1,1]

and it holds size(Φ× ×


ε ) ≤ C · (1 + | log(ε)|) and depth(Φε ) ≤ C · (1 + | log(ε)|) for a constant C > 0
×
independent of ε. Moreover, Φε (x, y) = 0 if x = 0 or y = 0.

Proof. With n = ⌈| log4 (ε)|⌉, define the neural network


 
× σReLU (x + y) + σReLU (−x − y)
Φε (x, y) :=sn
2
 
σReLU (x − y) + σReLU (y − x)
− sn . (7.2.2)
2

Since |a| = σReLU (a) + σReLU (−a), by (7.2.1) we have for all x, y ∈ [−1, 1]

(x + y)2 − (x − y)2
    
× |x + y| |x − y|
x · y − Φε (x, y) = − sn − sn
4 2 2
4( x+y 2 x−y 2
2 ) − 4( 2 ) 4sn ( |x+y| |x−y|
2 ) − 4sn ( 2 )
= −
4 4
4(2−2n−1 + 2−2n−1 )
≤ = 4−n ≤ ε,
4
where we used |x+y|/2, |x−y|/2 ∈ [0, 1]. We have depth(Φ× ε ) = 1+depth(sn ) = 1+n ≤ 1+⌈log4 (ε)⌉
and size(Φ×
ε ) ≤ C + 2size(s n ) ≤ Cn ≤ C · (1 − log(ε)) for some constant C > 0.

84
The fact that Φ× 2
ε maps from [−1, 1] → [−1, 1] follows by (7.2.2) and because sn : [0, 1] → [0, 1].
Finally, if x = 0, then Φ×
ε (x, y) = sn (|x + y|) − sn (|x − y|) = sn (|y|) − sn (|y|) = 0. If y = 0 the same
argument can be made.

In a similar way as in Proposition 4.8 and Lemma 5.11, we can apply operations with two inputs
in the form of a binary tree to extend them to an operation on arbitrary many inputs; see again
[292], and [246, Proposition 3.3] for the specific argument considered here.

Proposition 7.4. For every n ≥ 2 and ε > 0 there exists a ReLU neural network Φ× n
n,ε : [−1, 1] →
[−1, 1] such that

n
Y
sup xj − Φ×
n,ε (x1 , . . . , xn ) ≤ ε,
xj ∈[−1,1] j=1

and it holds size(Φ× ×


n,ε ) ≤ Cn · (1 + | log(ε/n)|) and depth(Φn,ε ) ≤ C log(n)(1 + | log(ε/n)|) for a
constant C > 0 independent of ε and n.

Proof. We begin with the case n = 2k . For k = 1 let Φ̃× ×


2,δ := Φδ . If k ≥ 2 let
 
Φ̃×
2k ,δ
:= Φ×
δ ◦ Φ̃ ×
2k−1 ,δ
, Φ̃ ×
2k−1 ,δ
.

Using Lemma 7.3, we find that this neural network has depth bounded by
 
depth Φ̃× k
2 ,δ
≤ kdepth(Φ×δ ) ≤ Ck · (1 + | log(δ)|) ≤ C log(n)(1 + | log(δ)|).

Observing that the number of occurences of Φ×


Pk−1 j ×
δ equals j=0 2 ≤ n, the size of Φ̃2k ,δ can bounded
by Cnsize(Φ×δ ) ≤ Cn · (1 + | log(δ)|).
k
To estimate the approximation error, denote with x = (xj )2j=1

Y
ek := sup xj − Φ̃×
2k ,δ
(x) .
xj ∈[−1,1]
j≤2k

Then, using short notation of the type x≤2k−1 := (x1 , . . . , x2k−1 ),


2 k
Y  
ek = sup xj − Φ×
δ Φ̃×
2k−1 ,δ
(x≤2k−1 ), Φ̃ ×
2k−1 ,δ
(x>2k−1 )
xj ∈[−1,1] j=1
 
Y
≤δ+ sup  xj ek−1 + Φ̃×
2k−1 ,δ
(x>2k−1 ) ek−1 
xj ∈[−1,1]
j≤2k−1
k−2
X
≤ δ + 2ek−1 ≤ δ + 2(δ + 2ek−2 ) ≤ · · · ≤ δ 2j + 2k−1 e1
j=0

≤ 2k δ = nδ.

85
k−1
Here we used e1 ≤ δ, and that Φ̃× 2k ,δ
maps [−1, 1]2 to [−1, 1], which is a consequence of Lemma
7.3.
The case for general n ≥ 2 (not necessarily n = 2k ) is treated similar as in Lemma 5.11, by
replacing some Φ× δ neural networks with identity neural networks.
×
Finally, setting δ := ε/n and Φ×n,ε := Φ̃n,δ concludes the proof.

7.3 Polynomials and depth separation


As a first consequence of the above observations, we consider approximating the polynomial
n
X
p(x) = cj xj . (7.3.1)
j=0

One possibility to approximate p is via the Horner scheme and the approximate multiplication Φ×
ε
from Lemma 7.3, yielding

p(x) = c0 + x · (c1 + x · (· · · + x · cn ) . . . )
≃ c0 + Φ× × ×
ε (x, c1 + Φε (x, c2 · · · + Φε (x, cn )) . . . ).

This scheme requires depth O(n) due to the nested multiplications. An alternative is to approximate
all monomials 1, x, . . . , xn with a binary tree using approximate multiplications Φ×
ε , and combing
them in the output layer, see Figure 7.3. This idea leads to a network of size O(n log(n)) and depth
O(log(n)). The following lemma formalizes this, see [208, Lemma A.5], [77, Proposition III.5], and
in particular [199, Lemma 4.3]. The proof is left as Exercise 7.13.

Lemma 7.5. There exists a constant C > 0, such that for any ε ∈ (0, 1) and any polynomial p of
degree n ≥ 2 as in (7.3.1), there exists a neural network Φpε such that
n
X
sup |p(x) − Φpε (x)| ≤ Cε |cj |
x∈[−1,1] j=0

and size(Φpε ) ≤ Cn log(n/ε) and depth(Φpε ) ≤ C log(n/ε).

Lemma 7.5 shows that deep ReLU networks can approximate polynomials efficiently. This
leads to an interesting implication regarding the superiority of deep architectures. Recall that
f : [−1, 1] → R is analytic if its Taylor series around any point x ∈ [−1, 1] converges to f in a
neighbourhood of x. For instance all polynomials, sin, cos, exp etc. are analytic. We now show that
these functions (except linear ones) can be approximated much more efficiently with deep ReLU
networks than by shallow ones: for fixed-depth networks, the number of parameters must grow
faster than any polynomial compared to the required size of deep architectures. More precisely
there holds the following.

86
1 x

1 x x2

1 x x2 x3 x4

Figure 7.3: Monomials 1, . . . , xn with n = 2k can be generated in a binary tree of depth k. Each
node represents the product of its inputs, with single-input nodes interpreted as squares.

Proposition 7.6. Let L ∈ N and let f : [−1, 1] → R be analytic but not linear. Then there exist
constants C, β > 0 such that for every ε > 0, there exists a ReLU neural network Φdeep satisfying
 q 
sup |f (x) − Φdeep (x)| ≤ C exp − β size(Φdeep ) ≤ ε, (7.3.2)
x∈[−1,1]

but for any ReLU neural network Φshallow of depth at most L holds

sup |f (x) − Φshallow (x)| ≥ C −1 size(Φshallow )−2L . (7.3.3)


x∈[−1,1]

Proof. The lower bound on (7.3.3) holds by Theorem 6.4.


Let us show the upper bound on the deep neural network. Assume first that the convergence
radius of the Taylor series of f around 0 is r > 1. Then for all x ∈ [−1, 1]
X f (j) (0)
f (x) = cj xj where cj = and |cj | ≤ Cr r−j ,
j!
j∈N0
Pn j
for all j ∈ N0 and some Cr > 0. Hence pn (x) := j=0 cj x satisfies
X X Cr r−n
sup |f (x) − pn (x)| ≤ |cj | ≤ Cr r−j ≤ .
x∈[−1,1] 1 − r−1
j>n j>n

Let now Φpε n be the network in Lemma 7.5. Then


 r−n 
sup |f (x) − Φpε n (x)| ≤ C̃ · ε + ,
x∈[−1,1] 1 − r−1

for some C̃ = C̃(r, Cr ). Choosing n = n(ε) = ⌈log(ε)/ log(r)⌉, with the bounds from Lemma 7.5
we find that
sup |f (x) − Φpε n (x)| ≤ 2C̃ε
x∈[−1,1]

and for another constant Ĉ = Ĉ(r)


size(Φpε n ) ≤ Ĉ · (1 + log(ε)2 ) and depth(Φpε n ) ≤ Ĉ · (1 + log(ε)).

87
This implies the existence of C, β > 0 and Φdeep as in (7.3.2).
The general case, where the Taylor expansions of f converges only locally is left as Exercise
7.14.

The proposition shows that the approximation of certain (highly relevant) functions requires
significantly more parameters when using shallow instead of deep architectures. Such statements
are known as depth separation results. We refer for instance to [268, 269, 271], where such a result
was shown by Telgarsky based on the sawtooth function constructed in Section 6.2. Lower bounds
on the approximation in the spirit of Proposition 7.6 were also given in [163] and [292].
Remark 7.7. Proposition
√ 7.6 shows in particular that for analytic f : [−1, 1] → R, holds the error
bound exp(−β N ) in terms of the network size N . This can be generalized to multivariate analytic
functions f : [−1, 1]d → R, in which case the bound reads exp(−βN 1/(1+d) ), see [75, 200].

7.4 C k,s functions


We will now discuss the implications of our observations in the previous sections for the approxi-
mation of functions in the class C k,s .

Definition 7.8. Let k ∈ N0 , s ∈ [0, 1] and Ω ⊆ Rd . Then for f : Ω → R

∥f ∥C k,s (Ω) := sup max |Dα f (x)|


x∈Ω {α∈Nd0 | |α|≤k}
|Dα f (x) − Dα f (y)| (7.4.1)
+ sup max ,
x̸=y∈Ω {α∈Nd0 | |α|=k} ∥x − y∥s

and we denote by C k,s (Ω) the set of functions f ∈ C k (Ω) for which ∥f ∥C k,s (Ω) < ∞.

Note that these spaces are ordered according to

C k (Ω) ⊇ C k,s (Ω) ⊇ C k,t (Ω) ⊇ C k+1 (Ω)

for all 0 < s ≤ t ≤ 1.


In order to state our main result, we first recall a version of Taylor’s remainder formula for
C k,s (Ω) functions.

Lemma 7.9. Let d ∈ N, k ∈ N, s ∈ [0, 1], Ω = [0, 1]d and f ∈ C k,s (Ω). Then for all a, x ∈ Ω
X Dα f (a)
f (x) = (x − a)α + Rk (x) (7.4.2)
α!
{α∈Nd0 | 0≤|α|≤k}

k+1/2
where with h := maxi≤d |ai − xi | we have |Rk (x)| ≤ hk+s d k! ∥f ∥C k,s (Ω) .

88
Proof. First, for a function g ∈ C k (R) and a, t ∈ R
k−1 (j)
X g (a) g (k) (ξ)
g(t) = (t − a)j + (t − a)k
j! k!
j=0
k
X g (j) (a) g (k) (ξ) − g (k) (a)
= (t − a)j + (t − a)k ,
j! k!
j=0

for some ξ between a and t. Now let f ∈ C k,s (Rd ) and a, x ∈ Rd . Thus with g(t) := f (a+t·(x−a))
holds for f (x) = g(1)
k−1 (j)
X g (0) g (k) (ξ)
f (x) = + .
j! k!
j=0

By the chain rule


 
(j)
X j
g (t) = Dα f (a + t · (x − a))(x − a)α ,
α
{α∈Nd0 | |α|=j}

j j! j! Qd
and (x − a)α = − aj )αj .

where we use the multivariate notations α = α! = Qd j=1 (xj
j=1 αj !
Hence
X Dα f (a)
f (x) = (x − a)α
α!
{α∈Nd0 | 0≤|α|≤k}
| {z }
∈Pk
X Dα f (a + ξ · (x − a)) − Dα f (a)
+ (x − a)α ,
α!
|α|=k
| {z }
=:Rk

for some ξ ∈ [0, 1]. Using the definition of h, the remainder term can be bounded by
 
k α α 1 X k
|Rk | ≤ h max sup |D f (a + t · (x − a)) − D f (a)|
|α|=k x∈Ω k! d
α
t∈[0,1] {α∈N0 | |α|=k}
k+ 2s
d
≤ hk+s ∥f ∥C k,s (Ω) ,
k!
√ k
= (1 + · · · + 1)k = dk by the
P 
where we used (7.4.1), ∥x − a∥ ≤ dh, and {α∈Nd0 | |α|=k} α
multinomial formula.

We now come to the main statement of this section. Up to logarithmic terms, it shows the
convergence rate (k + s)/d for approximating functions in C k,s ([0, 1]d ).

89
Theorem 7.10. Let d ∈ N, k ∈ N0 , s ∈ [0, 1], and Ω = [0, 1]d . Then, there exists a constant C > 0
such that for every f ∈ C k,s (Ω) and every N ≥ 2 there exists a ReLU neural network ΦfN such that
k+s
sup |f (x) − ΦfN (x)| ≤ C∥f ∥C k,s (Ω) N − d , (7.4.3)
x∈Ω

size(ΦfN ) ≤ CN log(N ) and depth(ΦfN ) ≤ C log(N ).

Proof. The idea of the proof is to use the so-called “partition of unity method”: First we will
construct a partition of unity (φν )ν , such that for an appropriately chosen M ∈ N each φν has
support on a O(1/M ) neighborhood of a point η ∈ Ω. On each of these neighborhoods P we will use
the local Taylor polynomial pν of f around η to approximate the function. Then ν φν pν gives
an
P approximation to f on Ω. This approximation can be emulated by a neural network of the type
×
ν Φε (φν , p̂ν ), where p̂ν is an neural network approximation to the polynomial pν .
It suffices to show the theorem in the case where
( )
dk+1/2
max , exp(d) ∥f ∥C k,s (Ω) ≤ 1.
k!

The general case can then be immediately deduced by a scaling argument.


Step 1. We construct the neural network. Define
k+s
M := ⌈N 1/d ⌉ and ε := N − d . (7.4.4)
Consider a uniform simplicial mesh with nodes {ν/M | ν ≤ M } where ν/M := (ν1 /M, . . . , νd /M ),
and where “ν ≤ M ” is short for {ν ∈ Nd0 | νi ≤ M for all i ≤ d}. We denote by φν the cpwl basis
function on this mesh such that φν (ν/M ) = 1 and φν (µ/M ) = 0 whenever µ ̸= ν. As shown in
Chapter 5, φν is a neural network of size O(1). Then
X
φν ≡ 1 on Ω, (7.4.5)
ν≤M

is a partition of unity. Moreover, observe that


 
ν 1
supp(φν ) ⊆ x ∈ Ω x− ≤ , (7.4.6)
M ∞ M
where ∥x∥∞ = maxi≤d |xi |.
For each ν ≤ M define the multivariate polynomial
X Dα f ν 

M ν α
pν (x) := x− ∈ Pk ,
α! M
|α|≤k

and the approximation


X Dα f ν
  νiα,1 νiα,k 
p̂ν (x) := M
Φ×
|α|,ε x iα,1 − , . . . , x iα,k − ,
α! M M
|α|≤k

90
where (iα,1 , . . . , iα,k ) ∈ {0, . . . , d}k is arbitrary but fixed such that |{j | iα,j = r}| = αr for all
r = 1, . . . , d. Finally, define
ΦfN :=
X
Φ×
ε (φν , p̂ν ). (7.4.7)
ν≤M

Step 2. We bound the approximation error. First, for each x ∈ Ω, using (7.4.5) and (7.4.6)

X X
f (x) − φν (x)pν (x) ≤ |φν (x)||pν (x) − f (x)|
ν≤M ν≤M

≤ max sup |f (y) − pν (y)|.


ν≤M {y∈Ω | ∥ ν −y∥ ≤ 1 }
M ∞ M

By Lemma 7.9 we obtain


1
X dk+ 2
sup f (x) − φν (x)pν (x) ≤ M −(k+s) ∥f ∥C k,s (Ω) ≤ M −(k+s) . (7.4.8)
x∈Ω k!
ν≤M

Next, fix ν ≤ M and y ∈ Ω such that ∥ν/M − y∥∞ ≤ 1/M ≤ 1. Then by Proposition 7.4
 k 
X Dα f ν Y νi 
M
|pν (y) − p̂ν (y)| ≤ yiα,j − α,j
α! M
|α|≤k j=1
 
× νiα,1 iα,k
− Φ|α|,ε yiα,1 − , . . . , yiα,k −
M M
X Dα f ( ν )
M
≤ε ≤ ε exp(d)∥f ∥C k,s (Ω) ≤ ε, (7.4.9)
α!
|α|≤k

where we used |Dα f (ν/M )| ≤ ∥f ∥C k,s (Ω) and


k k
X dj X dj ∞
X 1 X1 X j!
= = ≤ = exp(d).
α! j! α! j! j!
{α∈Nd0 | |α|≤k} j=0 {α∈Nd0 | |α|=j} j=0 j=0

Similarly, one shows that


|p̂ν (x)| ≤ exp(d)∥f ∥C k,s (Ω) ≤ 1 for all x ∈ Ω.
Fix x ∈ Ω. Then x belongs to a simplex of the mesh, and thus x can be in the support of at
most d + 1 (the number of nodes of a simplex) functions φν . Moreover, Lemma 7.3 implies that
supp Φ×
ε (φν (·), p̂ν (·)) ⊆ supp φν . Hence, using Lemma 7.3 and (7.4.9)

X X
φν (x)pν (x) − Φ×
ε (φν (x), p̂ν (x))
ν≤M ν≤M
X
≤ (|φν (x)pν (x) − φν (x)p̂ν (x)|
{ν≤M | x∈supp φν }

+ |φν (x)p̂ν (x) − Φ×



ε (φν (x), p̂ν (x))|
≤ ε + (d + 1)ε = (d + 2)ε.

91
In total, together with (7.4.8)

sup |f (x) − ΦfN (x)| ≤ M −(k+s) + ε · (d + 2).


x∈Ω

With our choices in (7.4.4) this yields the error bound (7.4.3).
Step 3. It remains to bound the size and depth of the neural network in (7.4.7).
By Lemma 5.17, for each 0 ≤ ν ≤ M we have

size(φν ) ≤ C · (1 + kT ), depth(φν ) ≤ C · (1 + log(kT )), (7.4.10)

where kT is the maximal number of simplices attached to a node in the mesh. Note that kT is
independent of M , so that the size and depth of φν are bounded by a constant Cφ independent of
M.
Lemma 7.3 and Proposition 7.4 thus imply with our choice of ε = N −(k+s)/d

depth(ΦfN ) = depth(Φ×
ε ) + max depth(φη ) + max depth(p̂ν )
ν≤M ν≤M
≤ C · (1 + | log(ε)| + Cφ ) + depth(Φ×
k,ε )
≤ C · (1 + | log(ε)| + Cφ )
≤ C · (1 + log(N ))

for some constant C > 0 depending on k and d (we use “C” to denote a generic constant that can
change its value in each line).
To bound the size, we first observe with Lemma 5.4 that
 
X  
size(p̂ν ) ≤ C · 1 + size Φ×
|α|,ε
 ≤ C · (1 + | log(ε)|)
|α|≤k

for some C depending on k. Thus, for the size of ΦfN we obtain with M = ⌈N 1/d ⌉
 

size(ΦfN ) ≤ C · 1 +
X
size(Φ×

ε ) + size(φν ) + size(p̂ν )

ν≤M

≤ C · (1 + M )d (1 + | log(ε)| + Cφ )
≤ C · (1 + N 1/d )d (1 + Cφ + log(N ))
≤ CN log(N ),

which concludes the proof.

Theorem 7.10 is similar in spirit to [292, Section 3.2]; the main differences are that [292] considers
the class C k ([0, 1]d ) instead of C k,s ([0, 1]d ), and uses an approximate partition of unity, while we
use the exact partition of unity constructed in Chapter 5. Up to logarithmic terms, the theorem
shows the convergence rate (k + s)/d. As long as k is large, in principle we can achieve arbitrarily
large (and d-independent if k ≥ d) convergence rates. In contrast to Theorem 5.23, achieving error
k+s
N − d requires depth O(log(N )), i.e. the neural network depth is required to increase. This can
be avoided however, and networks of depth O(k/d) suffice to attain these convergence rates [208].

92
Remark 7.11. Let L : x 7→ Ax + b : Rd → Rd be a bijective affine transformation and set
Ω := L([0, 1]d ) ⊆ Rd . Then for a function f ∈ C k,s (Ω), by Theorem 7.10 there exists a neural
network ΦfN such that

sup |f (x) − ΦfN (L−1 (x))| = sup |f (L(x)) − ΦfN (x)|


x∈Ω x∈[0,1]d
k+s
≤ C∥f ◦ L∥C k,s ([0,1]d ) N − d .

Since for x ∈ [0, 1]d holds |f (L(x))| ≤ supy∈Ω |f (y)| and if 0 ̸= α ∈ Nd0 is a multiindex |Dα (f (L(x))| ≤
∥A∥|α| supy∈Ω |Dα f (y)|, we have ∥f ◦ L∥C k,s ([0,1]d ) ≤ (1 + ∥A∥k+s )∥f ∥C k,s (Ω) . Thus the convergence
k+s
rate N − d is achieved on every set of the type L([0, 1]d ) for an affine map L, and in particular on
every hypercube ×dj=1 [aj , bj ].

Bibliography and further reading


This chapter is based on the seminal 2017 paper by Yarotsky [292], where the construction of
approximating the square function, the multiplication, and polynomials (discussed in Sections 7.1,
7.2, 7.3) was first introduced and analyzed. The construction relies on the sawtooth function
discussed in Section 6.2 and originally constructed by Telgarsky in [268]. Similar results were
obtained around the same time by Liang and Srikant via a bit extraction technique using both the
ReLU and the Heaviside function as activation functions [163]. These works have since sparked a
large body of research, as they allow to lift polynomial approximation theory to neural network
classes. Convergence results based on this type of argument include for example [208, 76, 180, 75,
200]. We also refer to [270] for related results on rational approximation.
The depth separation result in Section 7.3 is based on the exponential convergence rates obtained
for analytic functions in [75, 200], also see [77, Lemma III.7]. For the approximation of polynomials
with ReLU neural networks stated in Lemma 7.5, see, e.g., [208, 77, 199], and also [197, 198] for
constructions based on Chebyshev polynomials, which can be more efficient. For further depth
separation results, we refer to [268, 269, 78, 236, 7]. Moreover, closely related to such statements is
the 1987 thesis by Håstad [126], which considers the limitations of logic circuits in terms of depth.
The approximation result derived in Section 7.4 for C k,s functions follows by standard approx-
imation theory for piecewise polynomial functions, and is similar as in [292]. We point out that
such statements can also be shown for other activation functions than ReLU; see in particular the
works of Mhaskar [174, 175] and Section 6 in Pinkus’ Acta Numerica article [211] for sigmoidal and
smooth activations. Additionally, the more recent paper [63] specifically addresses the hyperbolic
tangent activation. Finally, [103] studies general activation functions that allow for the construction
of approximate partitions of unity.

93
Exercises
Exercise 7.12. We show another type of depth separation result: Let d ≥ 2. Prove that there
exist ReLU NNs Φ : Rd → R of depth two, which cannot be represented exactly by ReLU NNs
Φ : Rd → R of depth one.
Hint: Show that nonzero ReLU NNs of depth one necessarily have unbounded support.

Exercise 7.13. Prove Lemma 7.5.


Hint: Proceed by induction over the iteration depth in Figure 7.3.

Exercise 7.14. Show Proposition 7.6 in the general case where the Taylor series of f only converges
locally (see proof of Proposition 7.6).
Hint: Use the partition of unity method from the proof of Theorem 7.10.

94
Chapter 8

High-dimensional approximation

In the previous chapters we established convergence rates for the approximation of a function f :
[0, 1]d → R by a neural network. For example, Theorem 7.10 provides the error bound O(N −(k+s)/d )
in terms of the network size N (up to logarithmic terms), where k and s describe the smoothness
of f . Achieving an accuracy of ε > 0, therefore, necessitates a network size N = O(ε−d/(k+s) )
(according to this bound). Hence, the size of the network needs to increase exponentially in d.
This exponential dependence on the dimension d is referred to as the curse of dimensionality
[20]. For classical smoothness spaces, such exponential d dependence cannot be avoided [20, 69, 195].
However, functions f that are of interest in practice may have additional properties, which allow
for better convergence rates.
In this chapter, we discuss three scenarios under which the curse of dimensionality can be
mitigated. First, we examine an assumption limiting the behavior of functions in their Fourier
domain. This assumption allows for slow but dimension independent approximation rates. Second,
we consider functions with a specific compositional structure. Concretely, these functions are
constructed by compositions and linear combinations of simple low-dimensional subfunctions. In
this case, the curse of dimension is present but only through the input dimension of the subfunctions.
Finally, we study the situation, where we still approximate high-dimensional functions, but only care
about the approximation accuracy on a lower dimensional submanifold. Here, the approximation
rate is goverened by the smoothness and the dimension of the manifold.

8.1 The Barron class


In [13], Barron introduced a set of functions that can be approximated by neural networks without
a curse of dimensionality. This set, known as the Barron class, is characterized by a specific type
of bounded variation. To define it, for g ∈ L1 (Rd ) we denote by
Z

ǧ(w) := g(x)ei w x dx
Rd

its inverse Fourier transform. Then, for C > 0 the Barron class is defined as
 Z 
d 1 d
ΓC := f ∈ C(R ) ∃g ∈ L (R ), |ξ||g(ξ)| dξ ≤ C and f = ǧ .
Rd

95
We say that a function f ∈ ΓC has a finite Fourier moment, even though technically the Fourier
transform of f may not be well-defined, since f does not need to be integrable. By the Riemann-
Lebesgue Lemma, [99, Lemma 1.1.1], the condition f ∈ C(Rd ) in the definition of ΓC is automati-
cally satisfied if g ∈ L1 (Rd ) as in the definition exists.
The following proof approximation result for functions in ΓC is due to [13]. The presentation
of the proof is similar to [209, Section 5].

Theorem 8.1. Let σ : R → R be sigmoidal (see Definition 3.11) and let f ∈ ΓC for some C > 0.
Denote by B1d := {x ∈ Rd | ∥x∥ ≤ 1} the unit ball. Then, for every c > 4C 2 and every N ∈ N there
exists a neural network Φf with architecture (σ; d, N, 1) such that
Z
1 2 c
d
f (x) − Φf (x) dx ≤ , (8.1.1)
|B1 | B1d N

where |B1d | is the Lebesgue measure of B1d .

Remark 8.2. The approximation rate in (8.1.1) can be slightly improved under some assumptions
on the activation function such as powers of the ReLU, [253].
Importantly, the dimension d does not enter on the right-hand side of (8.1.1), in particular the
convergence rate is not directly affected by the dimension, which is in stark contrast to the results
of the previous chapters. However, it should be noted, that the constant C may still have some
inherent d-dependence, see Exercise 8.10.
The proof of Theorem 8.1 is based on a peculiar property of high-dimensional convex sets, which
is described by the (approximate) Carathéodory theorem, the original version of which was given
in [44]. The more general version stated in the following lemma follows [280, Theorem 0.0.2] and
[13, 212]. For its statement recall that co(G) denotes the the closure of the convex hull of G.

Lemma 8.3. Let H be a Hilbert space, and let G ⊆ H be such that for some B > 0 it holds that
∥g∥H ≤ B for all g ∈ G. Let f ∈ co(G). Then, for every N ∈ N and every c > B 2 there exist
(gi )N
i=1 ⊆ G such that

N 2
1 X c
f− gi ≤ . (8.1.2)
N N
i=1 H

Proof. Fix ε > 0 and N ∈ N. Since f ∈ co(G), there exist coefficients α1 , . . . , αm ∈ [0, 1] summing
to 1, and linearly independent elements h1 , . . . , hm ∈ G such that
m
X
f ∗ := αj hj
j=1

96
satisfies ∥f − f ∗ ∥H < ε. We claim that there exists g1 , . . . , gN , each in {h1 , . . . , hm }, such that
2
N
1 X B2
f∗ − gj ≤ . (8.1.3)
N N
j=1
H

Since ε > 0 was arbitrary, this then concludes the proof. Since there exists an isometric isomorphism
from span{h1 , . . . , hm } to Rm , there is no loss of generality in assuming H = Rm in the following.
Let Xi , i = 1, . . . , N , be i.i.d. Rm -valued random variables with

P[Xi = hj ] = αj for all i = 1, . . . , m.


Pm
In particular E[Xi ] = j=1 αj hj = f ∗ for each i. Moreover,
 2   2 
N N
∗ 1 X  = E 1
X
E f −
 Xj (f ∗ − Xj ) 
N N
j=1 j=1
H H
" N #
1 X X
= 2 ∥f ∗ − Xj ∥2H + ⟨f ∗ − Xi , f ∗ − Xj ⟩H
N
j=1 i̸=j
1
= E[∥f ∗ − X1 ∥2H ]
N
1
= E[∥f ∗ ∥H − 2 ⟨f ∗ , X1 ⟩H + ∥X1 ∥2H ]
N
1 B2
= E[∥X1 ∥2H − ∥f ∗ ∥2H ] ≤ . (8.1.4)
N N
Here we used that the (Xi )N ∗ ∗ ∗
i=1 are i.i.d., the fact that E[Xi ] = f , as well as E⟨f − Xi , f − Xj ⟩ = 0
if i ̸= j. Since the expectation in (8.1.4) is bounded by B 2 /N , there must exist at least one
realization of the random variables Xi ∈ {h1 , . . . , hm }, denoted as gi , for which (8.1.3) holds.

Lemma 8.3 provides a powerful tool: If we want to approximate a function f with a superposition
of N elements in a set G, then it is sufficient to show that f can be represented as an arbitrary
(infinite) convex combination of elements of G.
Lemma 8.3 suggests that we can prove Theorem 8.1 by showing that each function in ΓC belongs
to the closure of the convex hull of all neural networks with a single neuron, i.e. the set of all affine
transforms of the sigmoidal activation function σ. We make a small detour before proving this
result. We first show that each function f ∈ ΓC is in the closure of the convex hull of the set of
affine transforms of Heaviside functions, i.e. the set
n o
GC := B1d ∋ x 7→ γ · 1R+ (⟨a, x⟩ + b) a ∈ Rd , b ∈ R, |γ| ≤ 2C .

The following lemma, corresponding to [13, Theorem 2] and [209, Lemma 5.12], provides a link
between ΓC and GC .

97
Lemma 8.4. Let d ∈ N, C > 0 and f ∈ ΓC . Then f |B d − f (0) ∈ co(GC ), where the closure is
1
taken with respect to the norm
Z !1/2
1 2
∥g∥L2,⋄ (B d ) := |g(x)| dx . (8.1.5)
1 |B1d | B1d

Proof. Step 1. We express f (x) via an integral.


Since f ∈ ΓC , we have that there exist g ∈ L1 (Rd ) such that for all x ∈ Rd
Z  
f (x) − f (0) = g(ξ) ei⟨x,ξ⟩ − 1 dξ
d
ZR  
= |g(ξ)| ei(⟨x,ξ⟩+κ(ξ)) − ei κ(ξ) dξ
d
ZR

= |g(ξ)| cos(⟨x, ξ⟩ + κ(ξ)) − cos(κ(ξ)) dξ, (8.1.6)
Rd

where κ(ξ) is the phase of g(ξ), i.e. g(ξ) = |g(ξ)|ei κ(ξ) , and the last equality follows since f is
real-valued. Define a measure µ on Rd via its Lebesgue density
1
dµ(ξ) := |ξ||g(ξ)| dξ,
C′
where C ′ :=
R
|ξ||g(ξ)| dξ ≤ C; this is possible since f ∈ ΓC . Then (8.1.6) leads to
cos(⟨x, ξ⟩ + κ(ξ)) − cos(κ(ξ))
Z

f (x) − f (0) = C dµ(ξ). (8.1.7)
Rd |ξ|
Step 2. We show that x 7→ f (x) − f (0) is in the L2,⋄ (B1d ) closure of convex combinations of
the functions x 7→ qx (θ), where θ ∈ Rd , and
qx :B1d → R
cos(⟨x, ξ⟩ + κ(ξ)) − cos(κ(ξ)) (8.1.8)
ξ 7→ C ′ .
|ξ|
The cosine function is 1-Lipschitz. Hence for any ξ ∈ Rd the map (8.1.8) is bounded by one. In
addition, it is easy to see that qx is well-defined and continuous even in the origin. Therefore, for
x ∈ B1d , the integral (8.1.7) can be approximated by a Riemann sum, i.e.,
Z X

C qx (ξ) dµ(ξ) − C ′ qx (θ) · µ(Iθ ) → 0 as n → ∞ (8.1.9)
Rd 1 d
θ∈ n Z

where Iθ := [0, 1/n)d + θ. Since x 7→ f (x) − f (0) is continuous and thus bounded on B1d , we have
by the dominated convergence theorem that
2
Z
1 X
f (x) − f (0) − C ′ qx (θ) · µ(Iθ ) dx → 0. (8.1.10)
|B1d | B1d 1 d
θ∈ n Z

98
µ(Iθ ) = µ(Rd ) = 1, the claim holds.
P
Since 1 d
θ∈ n Z
Step 3. We prove that x 7→ qx (θ) is in the L2,⋄ (B1d ) closure of convex combinations of GC for
every θ ∈ Rd . Together with Step 2, this then concludes the proof.
Setting z = ⟨x, θ/|θ|⟩, the result follows if the maps

hθ :[−1, 1] → R
cos(|θ|z + κ(θ)) − cos(κ(θ)) (8.1.11)
z 7→ C ′
|θ|

can be approximated arbitrarily well by convex combinations of functions of the form

[−1, 1] ∋ z 7→ γ1R+ a′ z + b′ ,

(8.1.12)

where a′ , b′ ∈ R and |γ| ≤ 2C. To show this define for T ∈ N


T i
− hθ i−1
         
X hθ T T i i−1 i
gT,+ := 2Csign hθ − hθ 1R+ x − ,
2C T T T
i=1
T
− Ti − hθ 1−i
         
X hθ T i 1−i i
gT,− := 2Csign hθ − − hθ 1R+ −x + .
2C T T T
i=1

By construction, gT,− + gT,+ is a piecewise constant approximation to hθ that interpolates hθ at


i/T for i = 1, . . . , T . Since hθ is continuous, we have that gT,− + gT,+ → hθ uniformly as T → ∞.
Moreover, ∥h′θ ∥L∞ (R) ≤ C and hence

T T
X |hθ (i/T ) − hθ ((i − 1)/T )| X |hθ (−i/T ) − hθ ((1 − i)/T )|
+
2C 2C
i=1 i=1
T
2 X ′
≤ ∥hθ ∥L∞ (R) ≤ 1,
2CT
i=1

where we used C ′ ≤ C for the last inequality. We conclude that gT,− + gT,+ is a convex combina-
tion of functions of the form (8.1.12). Hence, hθ can be arbitrarily well approximated by convex
combinations of the form (8.1.12). This concludes the proof.

We now have all tools to complete the proof of Theorem 8.1.

Proof (of Theorem 8.1). Let f ∈ ΓC . By Lemma 8.4

f |B d − f (0) ∈ co(GC ),
1

where the closure is understood with respect to the norm (8.1.5). It is not hard to see that for
every g ∈ GC it holds that ∥g∥L2,⋄ (B d ) ≤ 2C. Applying Lemma 8.3 with the Hilbert space L2,⋄ (B1d ),
1
we get that for every N ∈ N there exist |γi | ≤ 2C, ai ∈ Rd , bi ∈ R, for i = 1, . . . , N , so that

N 2
4C 2
Z
1 1 X
f (x) − f (0) − γi 1R+ (⟨ai , x⟩ + bi ) dx ≤ .
|B1d | B1d N
i=1
N

99
By Exercise 3.24, it holds that σ(λ·) → 1R+ for λ → ∞ almost everywhere. Thus, for every δ > 0
there exist ãi , b̃i , i = 1, . . . , N , so that

N 2
4C 2
Z
1 1 X  
f (x) − f (0) − γi σ ⟨ãi , x⟩ + b̃i dx ≤ + δ.
|B1d | B1d N
i=1
N
The result follows by observing that
N
1 X  
γi σ ⟨ãi , x⟩ + b̃i + f (0)
N
i=1

is a neural network with architecture (σ; d, N, 1).

The dimension-independent approximation rate of Theorem 8.1 may seem surprising, especially
when comparing to the results in Chapters 4 and 5. However, this can be explained by recognizing
that the assumption of a finite Fourier moment is effectively a dimension-dependent regularity
assumption. Indeed, the condition becomes more restrictive in higher dimensions and hence the
complexity of ΓC does not grow with the dimension.
To further explain this, let us relate the Barron class to classical function spaces. In [13, Section
II] it was observed that a sufficient condition is that all derivatives of order up to ⌊d/2⌋ + 2 are
square-integrable. In other words, if f belongs to the Sobolev space H ⌊d/2⌋+2 (Rd ), then f is a
Barron function. Importantly, the functions must become smoother, as the dimension increases.
This assumption would also imply an approximation rate of N −1/2 in the L2 norm by sums of
at most N B-splines, see [202, 69]. However, in such estimates some constants may still depend
exponentially on d, whereas all constants in Theorem 8.1 are controlled independently of d.
Another notable aspect of the approximation of Barron functions is that the absolute values
of the weights other than the output weights are not bounded by a constant. To see this, we
refer to (8.1.9), where arbitrarily large θ need to be used. While ΓC is a compact set, the set of
neural networks of the specified architecture for a fixed N ∈ N is not parameterized with a compact
parameter set. In a certain sense, this is reminiscent of Proposition 3.19 and Theorem 3.20, where
arbitrarily strong approximation rates where achieved by using a very complex activation function
and a non-compact parameter space.

8.2 Functions with compositionality structure


As a next instance of types of functions for which the curse of dimensionality can be overcome, we
study functions with compositional structure. In words, this means that we study high-dimensional
functions that are constructed by composing many low-dimensional functions. This point of view
was proposed in [214]. Note that this can be a realistic assumption in many cases, such as for
sensor networks, where local information is first aggregated in smaller clusters of sensors before
some information is sent to a processing unit for further evaluation.
We introduce a model for compositional functions next. Consider a directed acyclic graph G
with M vertices η1 , . . . , ηM such that
• exactly d vertices, η1 , . . . , ηd , have no ingoing edge,

• each vertex has at most m ∈ N ingoing edges,

100
• exactly one vertex, ηM , has no outgoing edge.

With each vertex ηj for j > d we associate a function fj : Rdj → R. Here dj denotes the
cardinality of the set Sj , which is defined as the set of indices i corresponding to vertices ηi for
which we have an edge from ηi to ηj . Without loss of generality, we assume that m ≥ dj = |Sj | ≥ 1
for all j > d. Finally, we let

Fj := xj for all j≤d (8.2.1a)

and1

Fj := fj ((Fi )i∈Sj ) for all j > d. (8.2.1b)

Then FM (x1 , . . . , xd ) is a function from Rd → R. Assuming

∥fj ∥C k,s (Rdj ) ≤ 1 for all j = d + 1, . . . , M, (8.2.2)

we denote the set of all functions of the type FM by F k,s (m, d, M ). Figure 8.1 shows possible
graphs of such functions.
Clearly, for s = 0, F k,0 (m, d, M ) ⊆ C k (Rd ) since the composition of functions in C k belongs
again to C k . A direct application of Theorem 7.10 allows to approximate FM ∈ F k (m, d, M ) with a
k
neural network of size O(N log(N )) and error O(N − d ). Since each fj depends only on m variables,
k
intuitively we expect an error convergence of type O(N − m ) with the constant somehow depending
on the number M of vertices. To show that this is actually possible, in the following we associate
with each node ηj a depth lj ≥ 0, such that lj is the maximum number of edges connecting ηj to
one of the nodes {η1 , . . . , ηd }.

Figure 8.1: Three types of graphs that could be the basis of compositional functions. The associated
functions are composed of two or three-dimensional functions only.

1
The ordering of the inputs (Fi )i∈Sj in (8.2.1b) is arbitrary but considered fixed throughout.

101
Proposition 8.5. Let k, m, d, M ∈ N and s > 0. Let FM ∈ F k,s (m, d, M ). Then there exists a
constant C = C(m, k + s, M ) such that for every N ∈ N there exists a ReLU neural network ΦFM
such that

size(ΦFM ) ≤ CN log(N ), depth(ΦFM ) ≤ C log(N )

and
k+s
sup |FM (x) − ΦFM (x)| ≤ N − m .
x∈[0,1]d

Proof. Throughout this proof we assume without loss of generality that the indices follow a topo-
logical ordering, i.e., they are ordered such that Sj ⊆ {1, . . . , j − 1} for all j (i.e. the inputs of
vertex ηj can only be vertices ηi with i < j).
Step 1. First assume that fˆj are functions such that with 0 < ε ≤ 1

|fj (x) − fˆj (x)| ≤ δj := ε · (2m)−(M +1−j) for all x ∈ [−2, 2]dj . (8.2.3)

Let F̂j be defined as in (8.2.1), but with all fj in (8.2.1b) replaced by fˆj . We now check the error
of the approximation F̂M to FM . To do so we proceed by induction over j and show that for all
x ∈ [−1, 1]d

|Fj (x) − F̂j (x)| ≤ (2m)−(M −j) ε. (8.2.4)

Note that due to ∥fj ∥C k ≤ 1 we have |Fj (x)| ≤ 1 and thus (8.2.4) implies in particular that
F̂j (x) ∈ [−2, 2].
For j = 1 it holds F1 (x1 ) = F̂1 (x1 ) = x1 , and thus (8.2.4) is valid for all x1 ∈ [−1, 1]. For the
induction step, for all x ∈ [−1, 1]d by (8.2.3) and the induction hypothesis

|Fj (x) − F̂j (x)| = |fj ((Fi )i∈Sj ) − fˆj ((F̂i )i∈Sj )|
= |fj ((Fi )i∈Sj ) − fj ((F̂i )i∈Sj )| + |fj ((F̂i )i∈Sj ) − fˆj ((F̂i )i∈Sj )|
X
≤ |Fi − F̂i | + δj
i∈Sj

≤ m · (2m)−(M −(j−1)) ε + (2m)−(M +1−j) ε


≤ (2m)−(M −j) ε.

Here we used that | dxdr fj ((xi )i∈Sj )| ≤ 1 for all r ∈ Sj so that by the triangle inequality and the
mean value theorem
X
|fj ((xi )i∈Sj ) − fj ((yi )i∈Sj )| ≤ |f ((xi )i∈Sj , (yi )i∈Sj ) − f ((xi )i∈Sj , (yi )i∈Sj )|
r∈Sj i≤r i>r i<r i≥r
X
≤ |xr − yr |.
r∈Sj

102
This shows that (8.2.4) holds, and thus for all x ∈ [−1, 1]d

|FM (x) − F̂M (x)| ≤ ε.

Step 2. We sketch a construction, of how to write F̂M from Step 1 as a neural network ΦFM
of the asserted size and depth bounds. Fix N ∈ N and let
m
Nj := ⌈N (2m) k+s (M +1−j) ⌉.

By Theorem 7.10, since dj ≤ m, we can find a neural network Φfj satisfying


− k+s k+s
sup |fj (x) − Φfj (x)| ≤ Nj m
≤ N− m (2m)−(M +1−j) (8.2.5)
dj
x∈[−2,2]

and
 
fj
m(M +1−j) m(M + 1 − j)
size(Φ ) ≤ CNj log(Nj ) ≤ CN (2m) k+s log(N ) + log(2m)
k+s
as well as
 
fj m(M + 1 − j)
depth(Φ ) ≤ C · log(N ) + log(2m) .
k+s
Then
M M
X X m(M +1−j)
size(Φfj ) ≤ 2CN log(N ) (2m) k+s

j=1 j=1
M  j
X m
≤ 2CN log(N ) (2m) k+s
j=1
m(M +1)
≤ 2CN log(N )(2m) k+s .
PM j
R M +1 1 M +1 .
Here we used j=1 a ≤ 1 exp(log(a)x) dx ≤ log(a) a
− k+s
The function F̂M from Step 1 then will yield error N by (8.2.3) and (8.2.5). We observe that
m

F̂M can be constructed inductively as a neural network ΦFM by propagating all values ΦF1 , . . . , Φ̂Fj
to all consecutive layers using identity neural networks and then using the outputs of (ΦFi )i∈Sj+1
as input to Φfj+1 . The depth of this neural network is bounded by
M
X
depth(Φfj ) = O(M log(N )).
j=1

We have at most M
P
j=1 |Sj | ≤ mM values which need to be propagated through these O(M log(N ))
layers, amounting to an overhead O(mM 2 log(N )) = O(log(N )) for the identity neural networks.
In all the neural network size is thus O(N log(N )).

Remark 8.6. From the proof we observe that the constant C in Proposition 8.5 behaves like
m(M +1)
O((2m) k+s ).

103
M

Figure 8.2: One-dimensional sub-manifold of three-dimensional space. At the orange point, we


depict a ball and the tangent space of the manifold.

8.3 Functions on manifolds


Another instance in which the curse of dimension can be mitigated, is if the input to the network
belongs to Rd , but stems from an m-dimensional manifold M ⊆ Rd . If we only measure the
approximation error on M, then we can again show that it is m rather than d that determines the
rate of convergence.
To explain the idea, we assume in the following that M is a smooth, compact m-dimensional
manifold in Rd . Moreover, we suppose that there exists δ > 0 and finitely many points x1 , . . . , xM ∈
M such that the δ-balls Bδ/2 (xi ) := {y ∈ Rd | ∥y − x∥2 < δ/2} for j = 1, . . . , M cover M (for
every δ > 0 such xi exist since M is compact). Moreover, denoting by Tx M ≃ Rm the tangential
space of M at x, we assume δ > 0 to be so small that the orthogonal projection
πj : Bδ (xj ) ∩ M → Txj M (8.3.1)
is injective, the set πj (Bδ (xj ) ∩ M) ⊆ Txj M has C ∞ boundary, and the inverse of πj , i.e.

πj−1 : πj (Bδ (xj ) ∩ M) → M (8.3.2)


is C ∞ (this is possible because M is a smooth manifold). A visualization of this assumption is
shown in Figure 8.2.
Note that πj in (8.3.1) is a linear map, whereas πj−1 in (8.3.2) is in general non-linear.
For a function f : M → R we can then write
f (x) = f (πj−1 (πj (x))) = fj (πj (x)) for all x ∈ Bδ (xj ) ∩ M (8.3.3)

104
where

fj := f ◦ πj−1 : πj (Bδ (xj ) ∩ M) → R.

In the following, for f : M → R, k ∈ N0 , and s ∈ [0, 1) we let

∥f ∥C k,s (M) := sup ∥fj ∥C k,s (πj (Bδ (xj )∩M)) .


j=1,...,M

We now state the main result of this section.

Proposition 8.7. Let d, k ∈ N, s ≥ 0, and let M be a smooth, compact m-dimensional manifold


in Rd . Then there exists a constant C > 0 such that for all f ∈ C k,s (M) and every N ∈ N there
exists a ReLU neural network ΦfN such that size(ΦfN ) ≤ CN log(N ), depth(ΦfN ) ≤ C log(N ) and
k+s
sup |f (x) − ΦfN (x)| ≤ C∥f ∥C k,s (M) N − m .
x∈M

Proof. Since M is compact there exists A > 0 such that M ⊆ [−A, A]d . Similar as in the proof of
ν
Theorem 7.10, we consider a uniform mesh with nodes {−A + 2AP n | ν ≤ n}, and the corresponding
piecewise linear basis functions forming the partition of unity ν≤n φν ≡ 1 on [−A, A]d where
supp φν ⊆ {y ∈ Rd | ∥ νn − y∥∞ ≤ A n }. Let δ > 0 be as in the beginning of this section. Since M is
covered by the balls (Bδ/2 (xj ))M j=1 fixing n ∈ N large enough, for each ν such that supp φν ∩M =
, ̸ ∅
there exists j(ν) ∈ {1, . . . , M } such that supp φν ⊆ Bδ (xj(ν) ) and we set Ij := {ν ≤ n | j = j(ν)}.
Using (8.3.3) we then have for all x ∈ M

X M X
X
f (x) = φν (x)f (x) = φν (x)fj (πj (x)). (8.3.4)
ν≤n j=1 ν∈Ij

Next, we approximate the functions fj . Let Cj be the smallest (m-dimensional) cube in Txj M ≃
Rm such that πj (Bδ (xj ) ∩ M) ⊆ Cj . The function fj can be extended to a function on Cj (we will
use the same notation for this extension) such that

∥f ∥C k,s (Cj ) ≤ C∥f ∥C k,s (πj (Bδ (xj )∩M)) ,

for some constant depending on πj (Bδ (xj ) ∩ M) but independent of f . Such an extension result
can, for example, be found in [257, Chapter VI]. By Theorem 7.10 (also see Remark 7.11), there
exists a neural network fˆj : Cj → R such that
k+s
sup |fj (x) − fˆj (x)| ≤ CN − m (8.3.5)
x∈Cj

and

size(fˆj ) ≤ CN log(N ), depth(fˆj ) ≤ C log(N ).

105
k+s
To approximate f in (8.3.4) we now let with ε := N − m

M X
X
ΦN := Φ× ˆ
ε (φν , fi ◦ πj ),
j=1 ν∈Ij

where we note that πj is linear and thus fˆj ◦ πj can be expressed by a neural network. First let us
estimate the error of this approximation. For x ∈ M
M X
X
|f (x) − ΦN (x)| ≤ |φν (x)fj (πj (x)) − Φ× ˆ
ε (φν (x), fj (πj (x)))|
j=1 ν∈Ij
M X 
X
≤ |φν (x)fj (πj (x)) − φν (x)fˆj (πj (x))|
j=1 ν∈Ij

+|φν (x)fˆj (πj (x)) − Φ×
ε (φν (x), ˆj (πj (x)))|
f
M X
X M
X X
≤ sup ∥fi − fˆi ∥L∞ (Ci ) |φν (x)| + ε
i≤M j=1 ν∈Ij j=1 {ν∈Ij | x∈supp φν }
k+s k+s
≤ CN − m + dε ≤ CN − m ,

where we used that x can be in the support of at most d of the φν , and where C is a constant
depending on d and M.
Finally, let us bound the size and depth of this approximation. Using size(φν ) ≤ C, depth(φν ) ≤
C (see (5.3.12)) and size(Φ× ×
ε ) ≤ C log(ε) ≤ C log(N ) and depth(Φε ) ≤ Cdepth(ε) ≤ C log(N ) (see
Lemma 7.3) we find
M X 
X  XM X
size(Φ×
ε ) + size(φ ν ) + size(fˆi ◦ π j ) ≤ C log(N ) + C + CN log(N )
j=1 ν∈Ij j=1 ν∈Ij

= O(N log(N )),

which implies the bound on size(ΦN ). Moreover,


n o
depth(ΦN ) ≤ depth(Φ×
ε ) + max depth(φ ,
ν j
ˆ
f )
≤ C log(N ) + log(N ) = O(log(N )).

This completes the proof.

Bibliography and further reading


The ideas of Section 8.1 were originally developed in [13], with an extension to L∞ approximation
provided in [12]. These arguments can be extended to yield dimension-independent approximation
rates for high-dimensional discontinuous functions, provided the discontinuity follows a Barron
function, as shown in [209]. The Barron class has been generalized in various ways, as discussed in
[168, 167, 284, 285, 14].

106
The compositionality assumption of Section 8.2 was discussed in the form presented in [214].
An alternative approach, known as the hierarchical composition/interaction model, was studied in
[146].
The manifold assumption discussed in Section 8.3 is frequently found in the literature, with
notable examples including [249, 53, 48, 240, 185, 145].
Another prominent direction, omitted in this chapter, pertains to scientific machine learn-
ing. High-dimensional functions often arise from (parametric) PDEs, which have a rich literature
describing their properties and structure. Various results have shown that neural networks can
leverage the inherent low-dimensionality known to exist in such problems. Efficient approximation
of certain classes of high-dimensional (or even infinite-dimensional) analytic functions, ubiquitous
in parametric PDEs, has been verified in [246, 247]. Further general analyses for high-dimensional
parametric problems can be found in [201, 149], and results exploiting specific structural conditions
of the underlying PDEs, e.g., in [152, 235]. Additionally, [75, 180, 200] provide results regarding
fast convergence for certain smooth functions in potentially high but finite dimensions.
For high-dimensional PDEs, elliptic problems have been addressed in [100], linear and semilin-
ear parabolic evolution equations have been explored in [101, 93, 125], and stochastic differential
equations in [134, 102].

107
Exercises
Exercise 8.8. Let C > 0 and d ∈ N. Show that, if g ∈ ΓC , then

a−d g (a(· − b)) ∈ ΓC ,

for every a ∈ R+ , b ∈ Rd .

Exercise 8.9. Let C > 0 and d ∈ N. Show that, for gi ∈ ΓC , i = 1, . . . , m and c = (ci )m
i=1 it holds
that
Xm
ci gi ∈ Γ∥c∥1 C .
i=1

Exercise 8.10. Show that √ for every d ∈ N the function f (x) := exp(−∥x∥22 /2), x ∈ Rd , belongs
to Γd , and it holds Cf = O( d), for d → ∞.

Exercise 8.11. Let d ∈ N, and let f (x) = ∞ d


P
i=1 ci σReLU (⟨ai , x⟩ + bi ) for x ∈ R with ∥ai ∥ =
1, |bi | ≤ 1 for all i ∈ N. Show that for every N ∈ N, there exists a ReLU neural network with N
neurons and one layer such that
3∥c∥1
∥f − fN ∥L2 (B d ) ≤ √ .
1
N
Hence, every infinite ReLU neural network can be approximated at a rate O(N 1/2 ) by finite ReLU
neural networks of width N .

Exercise 8.12. Let C > 0 prove that every f ∈ ΓC is continuously differentiable.

108
Chapter 9

Interpolation

The learning problem associated to minimizing the empirical risk of (1.2.3) is based on minimizing
an error that results from evaluating a neural network on a finite set of (training) points. In
contrast, all previous approximation results focused on achieving uniformly small errors across the
entire domain. Finding neural networks that achieve a small training error appears to be much
simpler, since, instead of ∥f − Φn ∥∞ → 0 for a sequence of neural networks Φn , it suffices to have
Φn (xi ) → f (xi ) for all xi in the training set.
In this chapter, we study the extreme case of the aforementioned approximation problem. We
analyze under which conditions it is possible to find a neural network that coincides with the target
function f at all training points. This is referred to as interpolation. To make this notion more
precise, we state the following definition.

Definition 9.1 (Interpolation). Let d, m ∈ N, and let Ω ⊆ Rd . We say that a set of functions
H ⊆ {h : Ω → R} interpolates m points in Ω, if for every S = (xi , yi )m i=1 ⊆ Ω × R, such that
xi ̸= xj for i ̸= j, there exists a function h ∈ H such that h(xi ) = yi for all i = 1, . . . , m.

Knowing the interpolation properties of an architecture represents extremely valuable informa-


tion for two reasons:
• Consider an architecture that interpolates m points and let the number of training samples
be bounded by m. Then (1.2.3) always has a solution.
• Consider again an architecture that interpolates m points and assume that the number of
training samples is less than m. Then for every point x̃ not in the training set and every y ∈ R
there exists a minimizer h of (1.2.3) that satisfies h(x̃) = y. As a consequence, without further
restrictions (many of which we will discuss below), such an architecture cannot generalize to
unseen data.
The existence of solutions to the interpolation problem does not follow trivially from the approxi-
mation results provided in the previous chapters (even though we will later see that there is a close
connection). We also remark that the question of how many points neural networks with a given
architecture can interpolate is closely related to the so-called VC dimension, which we will study
in Chapter 14.

109
We start our analysis of the interpolation properties of neural networks by presenting a result
similar to the universal approximation theorem but for interpolation in the following section. In
the subsequent section, we then look at interpolation with desirable properties.

9.1 Universal interpolation


Under what conditions on the activation function and architecture can a set of neural networks
interpolate m ∈ N points? According to Chapter 3, particularly Theorem 3.8, we know that shallow
neural networks can approximate every continuous function with arbitrary accuracy, provided the
neural network width is large enough. As the neural network’s width and/or depth increases, the
architectures become increasingly powerful, leading us to expect that at some point, they should
be able to interpolate m points. However, this intuition may not be correct:

Example 9.2. Let H := {f ∈ C 0 ([0, 1]) | f (0) ∈ Q}. Then H is dense in C 0 ([0, 1]), but H does not
even interpolate one point in [0, 1].

Moreover, Theorem 3.8 is an asymptotic result that only states that a given function can be
approximated for sufficiently large neural network architectures, but it does not state how large
the architecture needs to be.
Surprisingly, Theorem 3.8 can nonetheless be used to give a guarantee that a fixed-size archi-
tecture yields sets of neural networks that allow the interpolation of m points. This result is due
to [211]; for a more detailed discussion of previous results see the bibliography section. Due to its
similarity to the universal approximation theorem and the fact that it uses the same assumptions,
we call the following theorem the “Universal Interpolation Theorem”. For its statement recall the
definition of the set of allowed activation functions M in (3.1.1) and the class Nd1 (σ, 1, n) of shallow
neural networks of width n introduced in Definition 3.6.

Theorem 9.3 (Universal Interpolation Theorem). Let d, n ∈ N and let σ ∈ M not be a polynomial.
Then Nd1 (σ, 1, n) interpolates n + 1 points in Rd .

Proof. Fix (xi )n+1 d n+1


i=1 ⊆ R arbitrary. We will show that for any (yi )i=1 ⊆ R there exist weights and
biases (wj )nj=1 ⊆ Rd , (bj )nj=1 , (vj )nj=1 ⊆ R, c ∈ R such that
n
X
Φ(xi ) := vj σ(w⊤
j xi + bj ) + c = yi for all i = 1, . . . , n + 1. (9.1.1)
j=1

Since Φ ∈ Nd1 (σ, 1, n) this then concludes the proof.


Denote
1 σ(w⊤ σ(w⊤
 
1 x1 + b1 ) ··· n x1 + bn )
A :=  ... .. .. ..
∈R
(n+1)×(n+1)
. (9.1.2)
 
. . .
⊤ ⊤
1 σ(w1 xn+1 + b1 ) · · · σ(wn xn+1 + bn )

110
Then A being regular implies that for each (yi )n+1 n
i=1 exist c and (vj )j=1 such that (9.1.1) holds.
Hence, it suffices to find (wj )nj=1 and (bj )nj=1 such that A is regular.
To do so, we proceed by induction over k = 0, . . . , n, to show that there exist (wj )kj=1 and
(bj )kj=1 such that the first k + 1 columns of A are linearly independent. The case k = 0 is trivial.
Next let 0 < k < n and assume that the first k columns of A are linearly independent. We wish to
find wk , bk such that the first k + 1 columns are linearly independent. Suppose such wk , bk do not
exist and denote by Yk ⊆ Rn+1 the space spanned by the first k columns of A. Then for all w ∈ Rn ,
b ∈ R the vector (σ(w⊤ xi + b))n+1 i=1 ∈ R
n+1 must belong to Y . Fix y = (y )n+1 ∈ Rn+1 \Y . Then
k i i=1 k

n+1
XX N 2
inf ∥(Φ̃(xi ))n+1
i=1 − y∥22 = inf vj σ(w⊤
j xi + bj ) + c − yi
Φ̃∈Nd1 (σ,1) N,wj ,bj ,vj ,c
i=1 j=1

≥ inf ∥ỹ − y∥22 > 0.


ỹ∈Yk

Since we can find a continuous function f : Rd → R such that f (xi ) = yi for all i = 1, . . . , n + 1,
this contradicts Theorem 3.8.

9.2 Optimal interpolation and reconstruction


Consider a bounded domain Ω ⊆ Rd , a function f : Ω → R, distinct points x1 , . . . , xm ∈ Ω, and
corresponding function values yi := f (xi ). Our objective is to approximate f based solely on the
data pairs (xi , yi ), i = 1, . . . , m. In this section, we will show that, under certain assumptions on
f , ReLU neural networks can express an “optimal” reconstruction which also turns out to be an
interpolant of the data.

9.2.1 Motivation
In the previous section, we observed that neural networks with m − 1 ∈ N hidden neurons can
interpolate m points for every reasonable activation function. However, not all interpolants are
equally suitable for a given application. For instance, consider Figure 9.1 for a comparison between
polynomial and piecewise affine interpolation on the unit interval.
The two interpolants exhibit rather different behaviors. In general, there is no way of deter-
mining which constitutes a better approximation to f . In particular, given our limited information
about f , we cannot accurately reconstruct any additional features that may exist between inter-
polation points x1 , . . . , xm . In accordance with Occam’s razor, it thus seems reasonable to assume
that f does not exhibit extreme oscillations or behave erratically between interpolation points.
As such, the piecewise interpolant appears preferable in this scenario. One way to formalize the
assumption that f does not “exhibit extreme oscillations” is to assume that the Lipschitz constant

|f (x) − f (y)|
Lip(f ) := sup (9.2.1)
x̸=y ∥x − y∥

of f is bounded by a fixed value M ∈ R. Here ∥ · ∥ denotes an arbitrary fixed norm on Rd .


How should we choose M ? For every function f : Ω → R satisfying

f (xi ) = yi for all i = 1, . . . , m, (9.2.2)

111
Figure 9.1: Interpolation of eight points by a polynomial of degree seven and by a piecewise affine
spline. The polynomial interpolation has a significantly larger derivative or Lipschitz constant than
the piecewise affine interpolator.

we have
|f (x) − f (y)| |yi − yj |
Lip(f ) = sup ≥ sup =: M̃ . (9.2.3)
x̸=y∈Ω ∥x − y∥ i̸=j ∥xi − xj ∥

Because of this, we fix M as a real number greater than or equal to M̃ for the remainder of our
analysis.

9.2.2 Optimal reconstruction for Lipschitz continuous functions


The above considerations raise the following question: Given only the information that the function
has Lipschitz constant at most M , what is the best reconstruction of f based on the data? We
consider here the “best reconstruction” to be a function that minimizes the L∞ -error in the worst
case. Specifically, with

LipM (Ω) := {f : Ω → R | Lip(f ) ≤ M }, (9.2.4)

denoting the set of all functions with Lipschitz constant at most M , we want to solve the following
problem:
Problem 9.4. We wish to find an element

Φ ∈ argminh:Ω→R sup sup |f (x) − h(x)|. (9.2.5)


f ∈LipM (Ω) x∈Ω
f satisfies (9.2.2)

The next theorem shows that a function Φ as in (9.2.5) indeed exists. This Φ not only allows
for an explicit formula, it also belongs to LipM (Ω) and additionally interpolates the data. Hence,
it is not just an optimal reconstruction, it is also an optimal interpolant. This theorem goes back
to [17], which, in turn, is based on [261].

112
Theorem 9.5. Let m, d ∈ N, Ω ⊆ Rd , f : Ω → R, and let x1 , . . . , xm ∈ Ω, y1 , . . . , ym ∈ R satisfy
(9.2.2) and (9.2.3) with M̃ ≥ 0. Further, let M ≥ M̃ .
Then, Problem 9.4 has at least one solution given by
1
Φ(x) := (fupper (x) + flower (x)) for x ∈ Ω, (9.2.6)
2
where

fupper (x) := min (yk + M ∥x − xk ∥)


k=1,...,m

flower (x) := max (yk − M ∥x − xk ∥).


k=1,...,m

Moreover, Φ ∈ LipM (Ω) and Φ interpolates the data (i.e. satisfies (9.2.2)).

Proof. First we claim that for all h1 , h2 ∈ LipM (Ω) holds max{h1 , h2 } ∈ LipM (Ω) as well as
min{h1 , h2 } ∈ LipM (Ω). Since min{h1 , h2 } = − max{−h1 , −h2 }, it suffices to show the claim for
the maximum. We need to check that
| max{h1 (x), h2 (x)} − max{h1 (y), h2 (y)}|
≤M (9.2.7)
∥x − y∥
for all x ̸= y ∈ Ω. Fix x ̸= y. Without loss of generality we assume that
max{h1 (x), h2 (x)} ≥ max{h1 (y), h2 (y)} and max{h1 (x), h2 (x)} = h1 (x).
If max{h1 (y), h2 (y)} = h1 (y) then the numerator in (9.2.7) equals h1 (x) − h1 (y) which is bounded
by M ∥x − y∥. If max{h1 (y), h2 (y)} = h2 (y), then the numerator equals h1 (x) − h2 (y) which is
bounded by h1 (x) − h1 (y) ≤ M ∥x − y∥. In either case (9.2.7) holds.
Clearly, x 7→ yk −M ∥x−xk ∥ ∈ LipM (Ω) for each k = 1, . . . , m and thus fupper , flower ∈ LipM (Ω)
as well as Φ ∈ LipM (Ω).
Next we claim that for all f ∈ LipM (Ω) satisfying (9.2.2) holds
flower (x) ≤ f (x) ≤ fupper (x) for all x ∈ Ω. (9.2.8)
This is true since for every k ∈ {1, . . . , m} and x ∈ Ω
|yk − f (x)| = |f (xk ) − f (x)| ≤ M ∥x − xk ∥
so that for all x ∈ Ω
f (x) ≤ min (yk + M ∥x − xk ∥), f (x) ≥ max (yk − M ∥x − xk ∥).
k=1,...,m k=1,...,m

Since fupper , flower ∈ LipM (Ω) satisfy (9.2.2), we conclude that for every h : Ω → R holds
sup sup |f (x) − h(x)| ≥ sup max{|flower (x) − h(x)|, |fupper (x) − h(x)|}
f ∈LipM (Ω) x∈Ω x∈Ω
f satisfies (9.2.2)
|flower (x) − fupper (x)|
≥ sup . (9.2.9)
x∈Ω 2

113
Moreover, using (9.2.8),

sup sup |f (x) − Φ(x)| ≤ sup max{|flower (x) − Φ(x)|, |fupper (x) − Φ(x)|}
f ∈LipM (Ω) x∈Ω x∈Ω
f satisfies (9.2.2)
|flower (x) − fupper (x)|
= sup . (9.2.10)
x∈Ω 2

Finally, (9.2.9) and (9.2.10) imply that Φ is a solution of Problem 9.4.

Figure 9.2 depicts fupper , flower , and Φ for the interpolation problem shown in Figure 9.1, while
Figure 9.3 provides a two-dimensional example.

Figure 9.2: Interpolation of the points from Figure 9.1 with the optimal Lipschitz interpolant.

9.2.3 Optimal ReLU reconstructions


So far everything was valid withPan arbitrary norm on Rd . For the next theorem, we will restrict
ourselves to the 1-norm ∥x∥1 = dj=1 |xj |. Using the explicit formula of Theorem 9.5, we will now
show the remarkable result that ReLU neural networks can exactly express an optimal reconstruc-
tion (in the sense of Problem 9.4) with a neural network whose size scales linearly in the product
of the dimension d and the number of data points m. Additionally, the proof is constructive, thus
allowing in principle for an explicit construction of the neural network without the need for training.

Theorem 9.6 (Optimal Lipschitz Reconstruction). Let m, d ∈ N, Ω ⊆ Rd , f : Ω → R, and let


x1 , . . . , xm ∈ Ω, y1 , . . . , ym ∈ R satisfy (9.2.2) and (9.2.3) with M̃ > 0. Further, let M ≥ M̃ and
let ∥ · ∥ = ∥ · ∥1 in (9.2.3) and (9.2.4).

114
Figure 9.3: Two-dimensional example of the interpolation method of (9.2.6). From top left to
bottom we see fupper , flower , and Φ. The interpolation points (xi , yi )6i=1 are marked with red
crosses.

115
Then, there exists a ReLU neural network Φ ∈ LipM (Ω) that interpolates the data (i.e. satisfies
(9.2.2)) and satisfies

Φ ∈ argminh:Ω→R sup sup |f (x) − h(x)|.


f ∈LipM (Ω) x∈Ω
f satisfies (9.2.2)

Moreover, depth(Φ) = O(log(m)), width(Φ) = O(dm) and all weights of Φ are bounded in absolute
value by max{M, ∥y∥∞ }.

Proof. To prove the result, we simply need to show that the function in (9.2.6) can be expressed as
a ReLU neural network with the size bounds described in the theorem. First we notice, that there
is a simple ReLU neural network that implements the 1-norm. It holds for all x ∈ Rd that
d
X
∥x∥1 = (σ(xi ) + σ(−xi )) .
i=1

Thus, there exists a ReLU neural network Φ∥·∥1 such that for all x ∈ Rd

width(Φ∥·∥1 ) = 2d, depth(Φ∥·∥1 ) = 1, Φ∥·∥1 (x) = ∥x∥1

As a result, there exist ReLU neural networks Φk : Rd → R, k = 1, . . . , m, such that

width(Φk ) = 2d, depth(Φk ) = 1, Φk (x) = yk + M ∥x − xk ∥1

for all x ∈ Rd . Using the parallelization of neural networks introduced in Section 5.1.3, there exists
a ReLU neural network Φall := (Φ1 , . . . , Φm ) : Rd → Rm such that

width(Φall ) = 4md, depth(Φall ) = 1

and

Φall (x) = (yk + M ∥x − xk ∥1 )m


k=1 for all x ∈ Rd .

Using Lemma 5.11, we can now find a ReLU neural network Φupper such that Φupper = fupper (x)
for all x ∈ Ω, width(Φupper ) ≤ max{16m, 4md}, and depth(Φupper ) ≤ 1 + log(m).
Essentially the same construction yields a ReLU neural network Φlower with the respective
properties. Lemma 5.4 then completes the proof.

Bibliography and further reading


The universal interpolation theorem stated in this chapter is due to [211, Theorem 5.1]. There exist
several earlier interpolation results, which were shown under stronger assumptions: In [237], the
interpolation property is already linked with a rank condition on the matrix (9.1.2). However, no
general conditions on the activation functions were formulated. In [130], the interpolation theorem
is established under the assumption that the activation function σ is continuous and nondecreasing,

116
limx→−∞ σ(x) = 0, and limx→∞ σ(x) = 1. This result was improved in [122], which dropped the
nondecreasing assumption on σ.
The main idea of the optimal Lipschitz interpolation theorem in Section 9.2 is due to [17]. A
neural network construction of Lipschitz interpolants, which however is not the optimal interpolant
in the sense of Problem 9.4, is given in [133, Theorem 2.27].

117
Exercises
Exercise 9.7. Under the assumptions of Theorem 9.5, we define for x ∈ Ω the set of nearest
neighbors by
Ix := argmini=1,...,m ∥xi − x∥.
The one-nearest-neighbor classifier f1NN is defined by
1
f1NN (x) = (min yi + max yi ).
2 i∈Ix i∈Ix

Let ΦM be the function in (9.2.6). Show that for all x ∈ Ω

ΦM (x) → f1NN (x) as M → ∞.

Exercise 9.8. Extend Theorem 9.6 to the ∥ · ∥∞ -norm.


Hint: The resulting neural network will need to be deeper than the one of Theorem 9.6.

118
Chapter 10

Training of neural networks

Up to this point, we have discussed the representation and approximation of certain function classes
using neural networks. The second pillar of deep learning concerns the question of how to fit a
neural network to given data, i.e., having fixed an architecture, how to find suitable weights and
biases. This task amounts to minimizing a so-called objective function such as the empirical risk
R̂S in (1.2.3). Throughout this chapter we denote the objective function by

f : Rn → R,

and interpret it as a function of all neural network weights and biases collected in a vector in Rn .
The goal1 is to (approximately) determine a minimizer, i.e., some w∗ ∈ Rn satisfying

f (w∗ ) ≤ f (w) for all w ∈ Rn .

Standard approaches primarily include variants of (stochastic) gradient descent. These are the
focus of the present chapter, in which we discuss basic ideas and results in convex optimization
using gradient-based algorithms. In Sections 10.1, 10.2, and 10.3, we explore gradient descent,
stochastic gradient descent, and accelerated gradient descent, and provide convergence proofs for
smooth and strongly convex objectives. Section 10.4 discusses adaptive step sizes and explains
the core principles behind popular algorithms such as Adam. Finally, Section 10.5 introduces the
backpropagation algorithm, which enables the efficient application of gradient-based methods to
neural network training.

10.1 Gradient descent


The general idea of gradient descent is to start with some w0 ∈ Rn , and then apply sequential
updates by moving in the direction of steepest descent of the objective function. Assume for the
moment that f ∈ C 2 (Rn ), and denote the kth iterate by wk . Then

f (wk + v) = f (wk ) + v ⊤ ∇f (wk ) + O(∥v∥2 ) for ∥v∥2 → 0. (10.1.1)


1
In reality, the goal is more nuanced: rather than merely minimizing the objective function, we want to find a
parameter that yields a well-generalizing model. However, in this chapter we adopt the perspective of minimizing an
objective function.

119
This shows that the change in f around wk is locally described by the gradient ∇f (wk ). For
small v the contribution of the second order term is negligible, and the direction v along which the
decrease of the risk is maximized equals the negative gradient −∇f (wk ).
Thus, −∇f (wk ) is also called the direction of steepest descent. This leads to an update of the
form

wk+1 := wk − hk ∇f (wk ), (10.1.2)

where hk > 0 is referred to as the step size or learning rate. We refer to this iterative algorithm
as gradient descent.

Figure 10.1: Two examples of gradient descent as defined in (10.1.2). The red points represent the
wk .

By (10.1.1) and (10.1.2) it holds (also see [26, Section 1.2])

f (wk+1 ) = f (wk ) − hk ∥∇f (wk )∥2 + O(h2k ), (10.1.3)

so that if ∇f (wk ) ̸= 0, a small enough step size hk ensures that the algorithm decreases the value
of the objective function. In practice, tuning the learning rate hk can be a subtle issue as it should
strike a balance between the following dissenting requirements:

(i) hk needs to be sufficiently small so that the second-order term in (10.1.3) is not dominating,
and the update (10.1.2) decreases the objective function.

(ii) hk should be large enough to ensure significant decrease of the objective function, which
facilitates faster convergence of the algorithm.

A learning rate that is too high might overshoot the minimum, while a rate that is too low results
in slow convergence. Common strategies include, in particular, constant learning rates (hk = h
for all k ∈ N0 ), learning rate schedules such as decaying learning rates (hk ↘ 0 as k → ∞), and
adaptive methods. For adaptive methods the algorithm dynamically adjusts hk based on the values
of f (wj ) or ∇f (wj ) for j ≤ k.

120
smooth convex strongly convex

Figure 10.2: The graph of L-smooth functions lies between two quadratic functions at each point,
see (10.1.4), the graph of convex function lies above the tangent at each point, see (10.1.8), and
the graph of µ-strongly convex functions lies above a convex quadratic function at each point, see
(10.1.9).

10.1.1 Structural conditions and existence of minimizers


We start our analysis by discussing three key notions for analyzing gradient descent algorithms, be-
ginning with an intuitive (but loose) geometric description. A continuously differentiable objective
function f : Rn → R will be called

(i) smooth if, at each w ∈ Rn , f is bounded above and below by a quadratic function that
touches its graph at w,

(ii) convex if, at each w ∈ Rn , f lies above its tangent at w,

(iii) strongly convex if, at each w ∈ Rn , f lies above its tangent at w plus a convex quadratic
term.

These concepts are illustrated in Figure 10.2. We next give the precise mathematical definitions.

Definition 10.1. Let n ∈ N and L > 0. A function f : Rn → R is called L-smooth if f ∈ C 1 (Rn )


and
L
f (v) ≤ f (w) + ⟨∇f (w), v − w⟩ + ∥w − v∥2 for all w, v ∈ Rn , (10.1.4a)
2
L
f (v) ≥ f (w) + ⟨∇f (w), v − w⟩ − ∥w − v∥2 for all w, v ∈ Rn . (10.1.4b)
2

By definition, L-smooth functions satisfy the geometric property (i). In the literature, L-
smoothness is often instead defined as Lipschitz continuity of the gradient

∥∇f (w) − ∇f (v)∥ ≤ L∥w − v∥ for all w, v ∈ Rn . (10.1.5)

Lemma 10.2. Let L > 0. Then f ∈ C 1 (Rn ) is L-smooth if and only if (10.1.5) holds.

121
Proof. We show that (10.1.4) implies (10.1.5). To this end assume first that f ∈ C 2 (Rn ), and that
(10.1.5) does not hold. Then we can find w ̸= v with
Z 1
w−v
∥w − v∥ sup e⊤ ∇2 f (v + t(w − v)) dt = ∥∇f (w) − ∇f (v)∥ > L∥w − v∥,
∥e∥=1 0 ∥w − v∥

where ∇2 f ∈ Rn×n denotes the Hessian. Since the Hessian is symmetric, this implies existence of
u, e ∈ Rn with ∥e∥ = 1 and |e⊤ ∇2 f (u)e| > L. Without loss of generality

e⊤ ∇2 f (u)e > L. (10.1.6)

Then for h > 0 by Taylor’s formula


Z h
f (u + he) = f (u) + h ⟨∇f (u), e⟩ + e⊤ ∇2 f (u + te)e(h − t) dt.
0

Continuity of t 7→ e⊤ ∇2 f (u + te)e and (10.1.6) implies that for h > 0 small enough
Z h
f (u + he) > f (u) + h ⟨∇f (u), e⟩ + L (h − t) dt
0
L
= f (u) + ⟨∇f (u), he⟩ + ∥he∥2 .
2
This contradicts (10.1.4a).
Now let f ∈ C 1 (Rn ) and assume again that (10.1.5) does not hold. Then for every ε > 0 and
every compact set K ⊆ Rn there exists fε,K ∈ C 2 (Rn ) such that ∥f − fε,K ∥C 1 (K) < ε. By choosing
ε > 0 sufficiently small and K sufficiently large, it follows that fε,K violates (10.1.5). Consequently,
by the previous argument, fε,K must also violate (10.1.4), which in turn implies that f does not
satisfy (10.1.4) either.
Finally, the fact that (10.1.5) implies (10.1.4) is left as Exercise 10.17.

Definition 10.3. Let n ∈ N. A function f : Rn → R is called convex if and only if

f (λw + (1 − λ)v) ≤ λf (w) + (1 − λ)f (v) (10.1.7)

for all w, v ∈ Rn , λ ∈ (0, 1).

In case f is continuously differentiable, this is equivalent to the geometric property (ii) as the
next lemma shows. The proof is left as Exercise 10.18.

Lemma 10.4. Let f ∈ C 1 (Rn ). Then f is convex if and only if

f (v) ≥ f (w) + ⟨∇f (w), v − w⟩ for all w, v ∈ Rn . (10.1.8)

122
The concept of convexity is strengthened by so-called strong-convexity, which requires an addi-
tional positive quadratic term on the right-hand side of (10.1.8), and thus corresponds to geometric
property (iii) by definition.

Definition 10.5. Let n ∈ N and µ > 0. A function f : Rn → R is called µ-strongly convex if


f ∈ C 1 (Rn ) and
µ
f (v) ≥ f (w) + ⟨∇f (w), v − w⟩ + ∥v − w∥2 for all w, v ∈ Rn . (10.1.9)
2

A convex function need not be bounded from below (e.g. w 7→ w) and thus need not have any
(global) minimizers. And even if it is bounded from below, there need not exist minimizers (e.g.
w 7→ exp(w)). However we have the following statement.

Lemma 10.6. Let f : Rn → R. If f is

(i) convex, then the set of minimizers of f is convex and has cardinality 0, 1, or ∞,

(ii) µ-strongly convex, then f has exactly one minimizer.

Proof. Let f be convex, and assume that w∗ and v ∗ are two minimizers of f . Then every convex
combination λw∗ + (1 − λ)v ∗ , λ ∈ [0, 1], is also a minimizer due to (10.1.7). This shows the first
claim.
Now let f be µ-strongly convex. Then (10.1.9) implies f to be lower bounded by a convex
quadratic function. Hence there exists at least one minimizer w∗ , and ∇f (w∗ ) = 0. By (10.1.9)
we then have f (v) > f (w∗ ) for every v ̸= w∗ .

10.1.2 Convergence of gradient descent


As announced before, to analyze convergence, we focus on µ-strongly convex and L-smooth objec-
tives only; we refer to the bibliography section for further reading under weaker assumptions. The
following theorem, which establishes linear convergence of gradient descent, is a standard result
(see, e.g., [189, 41, 153]). The proof presented here is taken from [85, Theorem 3.6].
Recall that a sequence ek is said to converge linearly to 0, if and only if there exist constants
C > 0 and c ∈ [0, 1) such that

ek ≤ Cck for all k ∈ N0 .

The constant c is also referred to as the rate of convergence. Before giving the statement, we first
note that comparing (10.1.4a) and (10.1.9) it necessarily holds L ≥ µ and therefore κ := L/µ ≥ 1.
This term, known as the condition number of f , determines the rate of convergence.

123
Theorem 10.7. Let n ∈ N and L ≥ µ > 0. Let f : Rn → R be L-smooth and µ-strongly convex.
Further, let hk = h ∈ (0, 1/L] for all k ∈ N0 , let (wk )∞ n
k=0 ⊆ R be defined by (10.1.2), and let w ∗
be the unique minimizer of f .
Then, f (wk ) → f (w∗ ) and wk → w∗ converge linearly for k → ∞. For the specific choice
h = 1/L it holds for all k ∈ N
 µ k
∥wk − w∗ ∥2 ≤ 1 − ∥w0 − w∗ ∥2 (10.1.10a)
L
L µ k
f (wk ) − f (w∗ ) ≤ 1− ∥w0 − w∗ ∥2 . (10.1.10b)
2 L

Proof. It suffices to show (10.1.10a), since (10.1.10b) follows directly by (10.1.10a) and (10.1.4a)
because ∇f (w∗ ) = 0. The case k = 0 is trivial, so let k ∈ N.
Expanding wk = wk−1 − h∇f (wk−1 ) and using µ-strong convexity (10.1.9)

∥wk − w∗ ∥2 = ∥wk−1 − w∗ ∥2 − 2h ⟨∇f (wk−1 ), wk−1 − w∗ ⟩ + h2 ∥∇f (wk−1 )∥2


≤ (1 − µh)∥wk−1 − w∗ ∥2 − 2h · (f (wk−1 ) − f (w∗ )) + h2 ∥∇f (wk−1 )∥2 . (10.1.11)

To bound the sum of the last two terms, we first use (10.1.4a) to get

L
f (wk ) ≤ f (wk−1 ) + ⟨∇f (wk−1 ), −h∇f (wk−1 )⟩ + ∥h∇f (wk−1 )∥2
  2
L 1
= f (wk−1 ) + − h2 ∥∇f (wk−1 )∥2
2 h

so that for h < 2/L


1
h2 ∥∇f (wk−1 )∥2 ≤ (f (wk−1 ) − f (wk ))
1/h − L/2
1
≤ (f (wk−1 ) − f (w∗ )),
1/h − L/2

and therefore

− 2h · (f (wk−1 ) − f (w∗ )) + h2 ∥∇f (wk−1 )∥2


 1 
≤ − 2h + (f (wk−1 ) − f (w∗ )).
1/h − L/2

If 2h ≥ 1/(1/h − L/2), which is equivalent to h ≤ 1/L, then the last term is less or equal to zero.
Hence (10.1.11) implies for h ≤ 1/L

∥wk − w∗ ∥2 ≤ (1 − µh)∥wk−1 − w∗ ∥2 ≤ · · · ≤ (1 − µh)k ∥w0 − w∗ ∥2 .

This concludes the proof.

124
Remark 10.8 (Convex objective functions). Let f : Rn → R be a convex and L-smooth function
with a minimizer at w∗ . As shown in Lemma 10.6, the minimizer need not be unique, so we cannot
expect wk → w∗ in general. However, the objective values still converge. Specifically, under these
assumptions, the following holds [189, Theorem 2.1.14, Corollary 2.1.2]: If hk = h ∈ (0, 2/L) for all
k ∈ N0 and (wk )∞ n
k=0 ⊆ R is generated by (10.1.2), then

f (wk ) − f (w∗ ) = O(k −1 ) as k → ∞.

10.2 Stochastic gradient descent


We next discuss a stochastic variant of gradient descent. The idea, which originally goes back to
Robbins and Monro [228], is to replace the gradient ∇f (wk ) in (10.1.2) by a random variable that
we denote by Gk . We interpret Gk as an approximation to ∇f (wk ). More precisely, throughout
we assume that given wk , Gk is an unbiased estimator of ∇f (wk ) conditionally independent of
w0 , . . . , wk−1 (see Appendix A.3.3), so that

E[Gk |wk ] = E[Gk |wk , . . . , w0 ] = ∇f (wk ). (10.2.1)

After choosing some initial value w0 ∈ Rn , the update rule becomes

wk+1 := wk − hk Gk , (10.2.2)

where hk > 0 denotes again the step size. Unlike in Section 10.1, we focus here on k-dependent
step sizes hk .

10.2.1 Motivation and decreasing learning rates


The next example motivates the algorithm in the standard setting, e.g. [94, Chapter 8].

Example 10.9 (Empirical risk minimization). Suppose we have a data sample S := (xj , yj )m j=1 ,
d
where yj ∈ R is the label corresponding to the data point xj ∈ R . Using the square loss, we wish
to fit a neural network Φ(·, w) : Rd → R depending on parameters (i.e. weights and biases) w ∈ Rn ,
such that the empirical risk in (1.2.3)
m
1 X
f (w) := R̂S (w) = (Φ(xj , w) − yj )2 ,
m
j=1

is minimized. Performing one step of gradient descent requires the computation of


m
2 X
∇f (w) = (Φ(xj , w) − yj )∇w Φ(xj , w), (10.2.3)
m
j=1

and thus the computation of m gradients of the neural network Φ. For large m (in practice m can
be in the millions or even larger), this computation might be infeasible.
To reduce computational cost, we replace the full gradient (10.2.3) by the stochastic gradient

G := 2(Φ(xj , w) − yj )∇w Φ(xj , w)

125
where j ∼ uniform(1, . . . , m) is a random variable with uniform distribution on the discrete set
{1, . . . , m}. Then
m
2 X
E[G] = (Φ(xj , w) − yj )∇w Φ(xj , w) = ∇f (w),
m
j=1

meaning that G is an unbiased estimator of ∇f (w). Importantly, computing (a realization of) G


merely requires a single gradient evaluation of the neural network.
More generally, one can choose mini-batches2 of size mb (where mb ≪ m) and let
2 X
G= (Φ(xj , w) − yj )∇w Φ(xj , w),
mb
j∈J

where J is a random subset of {1, . . . , m} of cardinality mb . A larger mini-batch size reduces


the variance of G (thus giving a more robust estimate of the true gradient) but increases the
computational cost.
A related common variant is the following: Let mb k = m for mb , k, m ∈ N, i.e. the number
of data points m is a k-fold multiple of the mini-batch size mb . In each epoch, first a random
Sk
partition ˙ i=1 Ji = {1, . . . , m} with |Ji | = mb for each i, is determined. Then for each i = 1, . . . , k,
the weights are updated with the gradient estimate
2 X
Φ(xj − yj , w)∇w Φ(xj , w).
mb
j∈Ji

Hence, in one epoch (corresponding to k updates of the weights), the algorithm sweeps through the
whole dataset, and each data point is used exactly once. ♢
Let wk be generated by (10.2.2). Using L-smoothness (10.1.5) we then have [33, Lemma 4.2]
L
E[f (wk+1 )|wk ] − f (wk ) ≤ E[⟨∇f (wk ), wk+1 − wk ⟩] + E[∥wk+1 − wk ∥2 |wk ]
2
L
= −hk ∥∇f (wk )∥2 + E[∥hk Gk ∥2 |wk ].
2
For the objective function to decrease in expectation, the first term hk ∥∇f (wk )∥2 should dominate
the second term L2 E[∥hk Gk ∥2 |wk ]. As wk approaches the minimum, we have ∥∇f (wk )∥ → 0,
which suggests that E[∥hk Gk ∥2 |wk ] should also decrease over time.
This is illustrated in Figure 10.3 (a), which shows the progression of stochastic gradient descent
(SGD) with a constant learning rate, hk = h, on a quadratic objective function and using artificially
added gradient noise, such that E[∥Gk ∥2 |wk ] does not tend to zero. The stochastic updates in
(10.2.2) cause fluctuations in the optimization path. Since these fluctuations do not decrease as the
algorithm approaches the minimum, the iterates will not converge. Instead they stabilize within
a neighborhood of the minimum, and oscillate around it, e.g. [85, Theorem 9.8]. In practice, this
might yield a good enough approximation to w∗ . To achieve convergence in the limit, the variance
of the update vector, −hk Gk , must decrease over time however. This can be achieved either by
reducing the variance of Gk , for example through larger mini-batches (cf. Example 10.9), or more
commonly, by decreasing the step size hk as k progresses. Figure 10.3 (b) shows this for hk ∼ 1/k;
also see Figure 10.4.
2
In contrast to using the full batch, which corresponds to standard gradient descent.

126
1.0 1.0
GD GD
SGD (constant LR) SGD (decreasing LR)
0.8 0.8

0.6 0.6

0.4 0.4

0.2 0.2

0.0 0.0

0.2 0.2
0.4 0.2 0.0 0.2 0.4 0.6 0.8 0.4 0.2 0.0 0.2 0.4 0.6 0.8
(a) constant learning rate for SGD (b) decreasing learning rate for SGD

Figure 10.3: 20 steps of gradient descent (GD) and stochastic gradient descent (SGD) for a strongly
convex quadratic objective function. GD was computed with a constant learning rate, while SGD
was computed with either a constant learning rate (hk = h) or a decreasing learning rate (hk ∼ 1/k).

wk

wk

w∗ w∗

(a) wk far from w∗ (b) wk close to w∗

Figure 10.4: The update vector −hk Gk (black) is a draw from a random variable with expectation
−hk ∇f (wk ) (blue). In order to get convergence, the variance of the update vector should decrease
as wk approaches the minimizer w∗ . Otherwise, convergence will in general not hold, see Figure
10.3 (a).

127
10.2.2 Convergence of stochastic gradient descent
Since wk in (10.2.2) is a random variable by construction, a convergence statement can only be
stochastic, e.g., in expectation or with high probability. The next theorem, which is based on
[98, Theorem 3.2] and [33, Theorem 4.7], concentrates on the former. The result is stated under
assumption (10.2.6), which bounds the second moments of the stochastic gradients Gk and ensures
that they grow at most linearly with ∥∇f (wk )∥2 . Moreover, the analysis relies on the following
decreasing step sizes
 µ 1 (k + 1)2 − k 2 
hk := min , for all k ∈ N0 , (10.2.4)
L2 γ µ (k + 1)2

from [98]. Note that hk = O(k −1 ) as k → ∞, since

(k + 1)2 − k 2 2k + 1 2
= = + O(k −2 ). (10.2.5)
(k + 1)2 (k + 1)2 (k + 1)

This learning rate decay will allow us to establish a convergence rate. However, in practice, a less
aggressive decay, such as hk = O(k −1/2 ), or heuristic methods that decrease the learning rate based
on the observed convergence behaviour may be preferred.

Theorem 10.10. Let n ∈ N and L, µ, γ > 0. Let f : Rn → R be L-smooth and µ-strongly convex.
Fix w0 ∈ Rn , let (hk )∞ ∞ ∞
k=0 be as in (10.2.4) and let (Gk )k=0 , (w k )k=1 be sequences of random
variables as in (10.2.1) and (10.2.2). Assume that, for some fixed γ > 0, and all k ∈ N

E[∥Gk ∥2 |wk ] ≤ γ(1 + ∥∇f (wk )∥2 ). (10.2.6)

Then there exists a constant C = C(γ, µ, L) such that for all k ∈ N

C
E[∥wk − w∗ ∥2 ] ≤ ,
k
C
E[f (wk )] − f (w∗ ) ≤ .
k

Proof. Using (10.2.1) and (10.2.6) it holds for k ≥ 1

E[∥wk − w∗ ∥2 |wk−1 ]
= ∥wk−1 − w∗ ∥2 − 2hk−1 E[⟨Gk−1 , wk−1 − w∗ ⟩ |wk−1 ] + h2k−1 E[∥Gk−1 ∥2 |wk−1 ]
≤ ∥wk−1 − w∗ ∥2 − 2hk−1 ⟨∇f (wk−1 ), wk−1 − w∗ ⟩ + h2k−1 γ(1 + ∥∇f (wk−1 )∥2 ).

By µ-strong convexity (10.1.9)

− 2hk−1 ⟨∇f (wk−1 ), wk−1 − w∗ ⟩


≤ −µhk−1 ∥wk−1 − w∗ ∥2 − 2hk−1 · (f (wk−1 ) − f (w∗ )).

128
Moreover, L-smoothness, µ-strong convexity and ∇f (w∗ ) = 0 imply

2L2
∥∇f (wk−1 )∥2 ≤ L2 ∥wk−1 − w∗ ∥2 ≤ (f (wk−1 ) − f (w∗ )).
µ
Combining the previous estimates we arrive at

E[∥wk − w∗ ∥2 |wk−1 ] ≤(1 − µhk−1 )∥wk−1 − w∗ ∥2 + h2k−1 γ


 L2 γ 
+ 2hk−1 hk−1 − 1 (f (wk−1 ) − f (w∗ )).
µ

The choice of hk−1 ≤ µ/(L2 γ) and the fact that (cf. (10.2.1))

E[∥wk − w∗ ∥2 |wk−1 , wk−2 ] = E[∥wk − w∗ ∥2 |wk−1 ]

thus give

E[∥wk − w∗ ∥2 |wk−1 ] ≤ (1 − µhk−1 )E[∥wk−1 − w∗ ∥2 |wk−2 ] + h2k−1 γ.

With e0 := ∥w0 − w∗ ∥2 and ek := E[∥wk − w∗ ∥2 |wk−1 ] for k ≥ 1 we have found

ek ≤ (1 − µhk−1 )ek−1 + h2k−1 γ


≤ (1 − µhk−1 )((1 − µhk−2 )ek−2 + h2k−2 γ) + h2k−1 γ
k−1
Y k−1
X k−1
Y
≤ · · · ≤ e0 (1 − µhj ) + γ h2j (1 − µhi ).
j=0 j=0 i=j+1

Note that there exists k0 ∈ N such that by (10.2.4) and (10.2.5)

1 (i + 1)2 − i2
hi = for all i ≥ k0 .
µ (i + 1)2

Hence there exists C̃ depending on γ, µ, L (but independent of k) such that


k−1 k−1
Y Y i2 j2
(1 − µhi ) ≤ C̃ = C̃ for all 0 ≤ j ≤ k
(i + 1)2 k2
i=j i=j

and thus
k−1  2
γ X (j + 1)2 − j 2 (j + 1)2
ek ≤ C̃ 2
µ (j + 1)2 k2
j=0
k−1
C̃γ 1 X (2j + 1)2
≤ 2 2
µ k (j + 1)2
j=0 | {z }
≤4

C̃γ 4k C
≤ 2 2
= ,
µ k k

129
with C := 4C̃γ/µ2 .
Since E[∥wk − w∗ ∥2 ] is the expectation of ek = E[∥wk − w∗ ∥2 |wk−1 ] with respect to the random
variable wk−1 , and C/k is a constant independent of wk−1 , we obtain
C
E[∥wk − w∗ ∥2 ] ≤ .
k
Finally, using L-smoothness
L L
f (wk ) − f (w∗ ) ≤ ⟨∇f (w∗ ), wk − w∗ ⟩ + ∥wk − w∗ ∥2 = ∥wk − w∗ ∥2 ,
2 2
and taking the expectation concludes the proof.

The specific choice of hk in (10.2.4) simplifies the calculations in the proof, but it is not necessary
in order for the asymptotic convergence to hold; see for instance [33, Theorem 4.7] or [27, Chapter
4] for more general statements.

10.3 Acceleration
Acceleration is an important tool for the training of neural networks [262]. The idea was first
introduced by Polyak in 1964 under the name “heavy ball method” [216]. It is inspired by the
dynamics of a heavy ball rolling down the valley of the loss landscape. Since then other types of
acceleration have been proposed and analyzed, with Nesterov acceleration being the most prominent
example [190]. In this section, we first give some intuition by discussing the heavy ball method for
a simple quadratic loss. Afterwards we turn to Nesterov acceleration and give a convergence proof
for L-smooth and µ-strongly convex objective functions that improves upon the bounds obtained
for gradient descent.

10.3.1 Heavy ball method


We follow [91, 217, 219] to motivate the idea. Consider the quadratic objective function in two
dimensions
 
1 ⊤ ζ1 0
f (w) := w Dw where D= (10.3.1)
2 0 ζ2

with ζ1 ≥ ζ2 > 0. Clearly, f has a unique minimizer at w∗ = 0 ∈ R2 . Starting at some w0 ∈ R2 ,


gradient descent with constant step size h > 0 computes the iterates

(1 − hζ1 )k+1
   
1 − hζ1 0 0
wk+1 = wk − hDwk = wk = w0 .
0 1 − hζ2 0 (1 − hζ2 )k+1

The method converges for arbitrary initialization w0 if and only if

|1 − hζ1 | < 1 and |1 − hζ2 | < 1.

The optimal step size balancing the rate of convergence in both coordinates is
2
h∗ = argminh>0 max{|1 − hζ1 |, |1 − hζ2 |} = . (10.3.2)
ζ1 + ζ2

130
1.0
GD
HB
0.8

0.6

0.4

0.2

0.0
1.00 0.75 0.50 0.25 0.00 0.25 0.50 0.75 1.00

Figure 10.5: 20 steps of gradient descent (GD) and the heavy ball method (HB) on the objective
function (10.3.1) with ζ1 = 12 ≫ 1 = ζ2 , step size h = α = h∗ as in (10.3.2), and β = 1/3. Figure
based on [217, Fig. 6].

With κ = ζ1 /ζ2 we then obtain the convergence rate


ζ1 − ζ2 κ−1
|1 − h∗ ζ1 | = |1 − h∗ ζ2 | = = ∈ [0, 1). (10.3.3)
ζ1 + ζ2 κ+1
If ζ1 ≫ ζ2 , this term is close to 1, and thus the convergence will be slow. This is consistent with
our analysis for strongly convex objective functions; by Exercise 10.23 the condition number of f
equals κ = ζ1 /ζ2 ≫ 1. Hence, the upper bound in Theorem 10.7 converges only slowly. Similar
considerations hold for general quadratic objective functions in Rn such as
1
f˜(w) = w⊤ Aw + b⊤ w + c (10.3.4)
2
with A ∈ Rn×n symmetric positive definite, b ∈ Rn and c ∈ R, see Exercise 10.24.
Remark 10.11. Interpreting (10.3.4) as a second-order Taylor approximation of some objective
function around its minimizer, we note that the described effects also occur for general objective
functions with ill-conditioned Hessians at the minimizer.
Figure 10.5 gives further insight into the poor performance of gradient descent for (10.3.1) with
ζ1 ≫ ζ2 . The loss-landscape looks like a ravine (the derivative is much larger in one direction than
the other), and away from the floor, ∇f mainly points to the opposite side. Therefore the iterates
oscillate back and forth in the first coordinate, and make little progress in the direction of the valley
along the second coordinate axis. To address this problem, the heavy ball method introduces a
“momentum” term which can mitigate this effect to some extent. The idea is to choose the update
not just according to the gradient at the current location, but to add information from the previous
steps. After initializing w0 and, e.g., w1 = w0 − α∇f (w0 ), let for k ∈ N

wk+1 = wk − α∇f (wk ) + β(wk − wk−1 ). (10.3.5)

This is known as Polyak’s heavy ball method [216, 217]. Here α > 0 and β ∈ (0, 1) are hyperpa-
rameters (that could also depend on k) and in practice need to be carefully tuned to balance the

131
strength of the gradient and the momentum term. Iteratively expanding (10.3.5) with the given
initialization, observe that for k ≥ 0
k
!
X
j
wk+1 = wk − α β ∇f (wk−j ) . (10.3.6)
j=0

Thus, wk is updated using an exponentially weighted moving average of all past gradients. Choosing
the momentum parameter β in the interval (0, 1) ensures that the influence of previous gradients
on the update decays exponentially. The concrete value of β determines the balance between the
impact of recent and past gradients.
Intuitively, this linear combination of the past gradients averages out some of the oscillation
observed for gradient descent in Figure 10.5; the update vector is strengthened in directions where
past gradients are aligned (the second coordinate axis), while it is dampened in directions where
the gradients’ signs alternate (the first coordinate axis). Similarly, when using stochastic gradients,
it can help to reduce some of the variance.
As mentioned earlier, the heavy ball method can be interpreted as a discretization of the dy-
namics of a ball rolling down the valley of the loss landscape. If the ball has positive mass, i.e. is
“heavy”, its momentum prevents the ball from bouncing back and forth too strongly. The following
remark elucidates this connection.
Remark 10.12. As pointed out, e.g., in [217, 219], for suitable choices of α and β, (10.3.5) can be
interpreted as a discretization of the second-order ODE

mw′′ (t) = −∇f (w(t)) − rw′ (t). (10.3.7)

This equation describes the movement of a point mass m under influence of the force field −∇f (w(t));
the term −w′ (t), which points in the negative direction of the current velocity, corresponds to fric-
tion, and r > 0 is the friction coefficient. The discretization
wk+1 − 2wk + wk−1 wk+1 − wk
m 2
= −∇f (wk ) −
h h
then leads to
h2 m
wk+1 = wk − ∇f (wk ) + (wk − wk−1 ), (10.3.8)
m − rh m − rh
| {z } | {z }
=α =β

and thus to (10.3.5), [219].


Letting m = 0 in (10.3.8), we recover the gradient descent update (10.1.2). Hence, positive
mass m > 0 corresponds to the momentum term. The gradient descent update in turn can be
interpreted as an Euler discretization of the gradient flow

w′ (t) = −∇f (w(t)). (10.3.9)

Note that −∇f (w(t)) represents the velocity of w(t) in (10.3.9), whereas in (10.3.7), up to the
friction term, it corresponds to an acceleration.

132
10.3.2 Nesterov acceleration
Nesterov’s accelerated gradient method (NAG) [190, 189] builds on the heavy ball method. After
initializing w0 , v 0 ∈ Rn , the update is formulated for k ≥ 0 as the two-step process

wk+1 = v k − α∇f (v k ) (10.3.10a)


v k+1 = wk+1 + β(wk+1 − wk ) (10.3.10b)

where again α > 0 and β ∈ (0, 1) are hyperparameters. Substituting the second line into the first
we get for k ≥ 1

wk+1 = wk − α∇f (v k ) + β(wk − wk−1 ).

Comparing with the heavy ball method (10.3.5), the key difference is that the gradient is not
evaluated at the current position wk , but instead at the point v k = wk + β(wk − wk−1 ), which
can be interpreted as an estimate of the position at the next iteration. This improves stability and
robustness with respect to hyperparameter settings, see [160, Sections 4 and 5].
We now discuss the convergence of NAG for L-smooth and µ-strongly convex objective functions
f . To
pgive the analysis, it is convenient to first rewrite (10.3.10) as a threen sequence update: Let
τ = µ/L, α = 1/L, and β = (1 − τ )/(1 + τ ). After initializing w0 , v 0 ∈ R , (10.3.10) can also be
written as u0 = ((1 + τ )v 0 − w0 )/τ and for k ≥ 0
τ 1
vk = uk + wk (10.3.11a)
1+τ 1+τ
1
wk+1 = v k − ∇f (v k ) (10.3.11b)
L
τ
uk+1 = uk + τ · (v k − uk ) − ∇f (v k ), (10.3.11c)
µ
see Exercise 10.25. The proof of the next theorem proceeds along the lines of [275, Theorem A.3.1],
[287, Proposition 10]; also see [286, Proposition 20] who present a similar proof of a related result
based on the same references.

Theorem 10.13. Let n ∈ N, 0 < µp ≤ L, and let f : Rn → R be L-smooth and µ-strongly convex.
Further, let w0 , v 0 ∈ Rn and let τ = µ/L. Let (v k , wk+1 , uk+1 )∞ n
k=0 ⊆ R be defined by (10.3.11a),
and let w∗ be the unique minimizer of f .
Then, for all k ∈ N0 , it holds that
r  
2 2 µ k µ 
∥uk − w∗ ∥ ≤ 1− f (w0 ) − f (w∗ ) + ∥u0 − w∗ ∥2 , (10.3.12a)
µ L 2
r  
 µ k µ 
f (wk ) − f (w∗ ) ≤ 1 − f (w0 ) − f (w∗ ) + ∥u0 − w∗ ∥2 . (10.3.12b)
L 2

Proof. Define
µ
ek := (f (wk ) − f (w∗ )) + ∥uk − w∗ ∥2 . (10.3.13)
2

133
To show (10.3.12), it suffices to prove with c := 1 − τ that ek+1 ≤ cek for all k ∈ N0 .
Step 1. We bound the first term in ek+1 defined in (10.3.13). Using L-smoothness (10.1.4a)
and (10.3.11b)
L 1
f (wk+1 ) − f (v k ) ≤ ⟨∇f (v k ), wk+1 − v k ⟩ + ∥wk+1 − v k ∥2 = − ∥∇f (v k )∥2 .
2 2L
Thus, since c + τ = 1,
1
f (wk+1 ) − f (w∗ ) ≤ (f (v k ) − f (w∗ )) − ∥∇f (v k )∥2
2L
= c · (f (wk ) − f (w∗ )) + c · (f (v k ) − f (wk ))
1
+ τ · (f (v k ) − f (w∗ )) − ∥∇f (v k )∥2 . (10.3.14)
2L
Step 2. We bound the second term in ek+1 defined in (10.3.13). By (10.3.11c)
µ µ
∥uk+1 − w∗ ∥2 − ∥uk − w∗ ∥2
2 2
µ µ
= ∥uk+1 − uk + uk − w∗ ∥2 − ∥uk − w∗ ∥2
2  2 
µ 2 τ
= ∥uk+1 − uk ∥ + µ τ · (v k − uk ) − ∇f (v k ), uk − w∗
2 µ
µ
= ∥uk+1 − uk ∥2 + τ ⟨∇f (v k ), w∗ − uk ⟩ − τ µ ⟨v k − uk , w∗ − uk ⟩ . (10.3.15)
2
Using µ-strong convexity (10.1.9), we get

τ ⟨∇f (v k ), w∗ − uk ⟩ = τ ⟨∇f (v k ), v k − uk ⟩ + τ ⟨∇f (v k ), w∗ − v k ⟩


τµ
≤ τ ⟨∇f (v k ), v k − uk ⟩ − τ · (f (v k ) − f (w∗ )) − ∥v k − w∗ ∥2 .
2
Moreover,
τµ
− ∥v k − w∗ ∥2 − τ µ ⟨v k − uk , w∗ − uk ⟩
2
τµ 
=− ∥v k − w∗ ∥2 − 2 ⟨v k − uk , v k − w∗ ⟩ + 2 ⟨v k − uk , v k − uk ⟩
2
τµ
= − (∥uk − w∗ ∥2 + ∥v k − uk ∥2 ).
2
Thus, (10.3.15) is bounded by
µ
∥uk+1 − uk ∥2 + τ ⟨∇f (v k ), v k − uk ⟩ − τ · (f (v k ) − f (w∗ ))
2
τµ τµ
− ∥uk − w∗ ∥2 − ∥v k − uk ∥2 .
2 2
From (10.3.11a) we have τ · (v k − uk ) = wk − v k , so that with c = 1 − τ we arrive at
µ µ µ
∥uk+1 − w∗ ∥2 ≤ c ∥uk − w∗ ∥2 + ∥uk+1 − uk ∥2
2 2 2
µ
+ ⟨∇f (v k ), wk − v k ⟩ − τ · (f (v k ) − f (w∗ )) − ∥wk − v k ∥2 . (10.3.16)

134
Step 3. We show ek+1 ≤ cek . Adding (10.3.14) and (10.3.16) gives
1 µ
ek+1 ≤ cek + c · (f (v k ) − f (wk )) − ∥∇f (v k )∥2 + ∥uk+1 − uk ∥2
2L 2
µ 2
+ ⟨∇f (v k ), wk − v k ⟩ − ∥wk − v k ∥ .

Using (10.3.11a), (10.3.11c) we expand
2
µ µ τ
∥uk+1 − uk ∥2 = wk − v k − ∇f (v k )
2 2 µ
µ τ2
= ∥wk − v k ∥2 − τ ⟨∇f (v k ), wk − v k ⟩ + ∥∇f (v k )∥2 ,
2 2µ
to obtain
 τ2 1  µ
ek+1 ≤ cek + − ∥∇f (v k )∥2 − ∥wk − v k ∥2
2µ 2L 2τ
 µ
+ c · f (v k ) − f (wk ) + ⟨∇f (v k ), wk − v k ⟩ + ∥wk − v k ∥2 .
2
The last line can be bounded using µ-strong convexity (10.1.9) and µ ≤ L
µ
∥v k − wk ∥2
c · (f (v k ) − f (wk ) + ⟨∇f (v k ), wk − v k ⟩) +
2
µ µ τL
≤ −(1 − τ ) ∥v k − wk ∥2 + ∥v k − wk ∥2 ≤ ∥v k − wk ∥2 .
2 2 2
In all
 τ2
1  τL µ
ek+1 ≤ cek + ∥∇f (v k )∥2 +
− − ∥wk − v k ∥2 = cek ,
2µ 2L 2 2τ
p
where the terms in brackets vanished since τ = µ/L. This concludes the proof.

Comparing the result for gradient descent (10.1.10) with NAG (10.3.12), the improvement for
strongly convex objectives lies in the convergence rate, which is 1 − κ−1 for gradient descent3 , and
1 − κ−1/2 for NAG, where κ = L/µ. For NAG the convergence rate depends only on the square
root of the condition number κ. For ill-conditioned problems where κ is large, we therefore expect
much better performance for accelerated methods.

10.4 Adaptive and coordinate-wise learning rates


In recent years, a multitude of first order (gradient descent) methods has been proposed and studied
for the training of neural networks. Many of them incorporate some or all of the following key
strategies: stochastic gradients, acceleration, and adaptive step sizes. The concept of stochastic
gradients and acceleration have been covered in the Sections 10.2 and 10.3, and we will touch
upon adaptive learning rates in the present one. Specifically, following the original papers [74, 295,
272, 143] and in particular the overviews [94, Section 8.5], [231], and [87, Chapter 11], we explain
3
Also see [189, Theorem 2.1.15] for the sharper rate (1 − κ−1 )2 /(1 + κ−1 )2 = 1 − 4κ−1 + O(κ−2 ).

135
the main ideas behind AdaGrad, RMSProp, and Adam. The paper [231] provides an intuitive
general overview of first order methods and discusses several additional variants that are omitted
here. Moreover, in practice, various other techniques and heuristics such as batch normalization,
gradient clipping, regularization and dropout, early stopping, specific weight initializations etc. are
used. We do not discuss them here, and refer for example to [32], [94], or [87, Chapter 11].

10.4.1 Coordinate-wise scaling


In Section 10.3.1, we saw why plain gradient descent can be inefficient for ill-conditioned objective
functions. This issue can be particularly pronounced in high-dimensional optimization problems,
such as when training neural networks, where certain parameters influence the network output much
more than others. As a result, a single learning rate may be suboptimal; directions in parameter
space with small gradients are updated too slowly, while in directions with large gradients the
algorithm might overshoot. To address this, one approach is to precondition the gradient by
multiplying it with a matrix that accounts for the geometry of the parameter space, e.g. [5, 194].
A simpler and computationally efficient alternative is to scale each component of the gradient
individually, corresponding to a diagonal preconditioning matrix. This allows different learning
rates for different coordinates. The key question is how to set these learning rates. The main idea,
first proposed in [74], is to scale each component inverse proportional to the magnitude of past
gradients. In the words of the authors of [74]: “Our procedures give frequently occurring features
very low learning rates and infrequent features high learning rates.”
After initializing u0 = 0 ∈ Rn , s0 = 0 ∈ Rn , and w0 ∈ Rn , all methods discussed below are
special cases of
uk+1 = β1 uk + β2 ∇f (wk ) (10.4.1a)
sk+1 = γ1 sk + γ2 ∇f (wk ) ⊙ ∇f (wk ) (10.4.1b)
p
wk+1 = wk − αk uk+1 ⊘ sk+1 + ε (10.4.1c)
for k ∈ N0 , and certain hyperparameters αk , β1 , β2 , γ1 , γ2 , and ε. Here ⊙ and ⊘ denote the

componentwise (Hadamard) multiplication and division, respectively, and sk+1 + ε is understood

as the vector ( vk+1,i + ε)i . Equation (10.4.1a) defines an update vector and corresponds to heavy
ball momentum if β1 ∈ (0, 1). If β1 = 0, then uk+1 is simply a multiple of the current gradient.
Equation (10.4.1b) defines a scaling vector sk+1 that is used to set a coordinate-wise learning rate
of the update vector in (10.4.1c). The constant ε > 0 is chosen small but positive to avoid division
by zero in (10.4.1c). These type of methods are often applied using mini-batches, see Section 10.2.
For simplicity we present them with the full gradients.
Example 10.14. Consider an objective function f : Rn → R, and its rescaled version
fζ (w) := f (w ⊙ ζ) with gradient ∇fζ (w) = ζ ⊙ ∇f (w ⊙ ζ),
for some ζ ∈ (0, ∞)n . Gradient descent (10.1.2) applied to fζ performs the update
wk+1 = wk − hk ζ ⊙ ∇f (w ⊙ ζ).
Setting ε = 0, (10.4.1) on the other hand performs the update
k
! v u k
j
γ1j ∇f (wk−j ⊙ ζ) ⊙ ∇f (wk−j ⊙ ζ).
X u X
wk+1 = wk − αk β2 β1 ∇f (wk−j ⊙ ζ) ⊘ tγ2
j=0 j=0

136
Note how the outer scaling factor ζ has vanished due to the division, in this sense making the
update invariant to a componentwise rescaling of the objective. ♢

10.4.2 Algorithms
AdaGrad
AdaGrad [74], which stands for Adaptive Gradient Algorithm, corresponds to (10.4.1) with

β1 = 0, γ1 = β2 = γ2 = 1, αk = α for all k ∈ N0 .

This leaves the hyperparameters ε > 0 and α > 0. Here α > 0 can be understood as a “global”
learning rate. The default values in tensorflow [1] are α = 0.001 and ε = 10−7 . The AdaGrad
update then reads

sk+1 = sk + ∇f (wk ) ⊙ ∇f (wk )


p
wk+1 = wk − α∇f (wk ) ⊘ sk+1 + ε.

Due to
k
X
sk+1 = ∇f (wj ) ⊙ ∇f (wj ), (10.4.2)
j=0

the algorithm therefore scales the gradient ∇f (wk ) in the update componentwise by the inverse
square root of the sum over all past squared gradients plus ε. Note that the scaling factor (sk+1,i +
ε)−1/2 for component i will be large, if the previous gradients for that component were small, and
vice versa.

RMSProp
RMSProp, which stands for Root Mean Squared Propagation, was introduced by Tieleman and
Hinton [272]. It corresponds to (10.4.1) with

β1 = 0, β2 = 1, γ2 = 1 − γ1 ∈ (0, 1), αk = α for all k ∈ N0 ,

effectively leaving the hyperparameters ε > 0, γ1 ∈ (0, 1) and α > 0. The default values in
tensorflow [1] are ε = 10−7 , α = 0.001 and γ1 = 0.9. The algorithm is thus given through

sk+1 = γ1 sk + (1 − γ1 )∇f (wk ) ⊙ ∇f (wk ) (10.4.3a)


p
wk+1 = wk − α∇f (wk ) ⊘ sk+1 + ε. (10.4.3b)

The scaling vector can be expressed as


k
γ1j ∇f (wk−j ) ⊙ ∇f (wk−j ),
X
sk+1 = (1 − γ1 )
j=0

and corresponds to an exponentially weighted moving average over the past squared gradients.
Unlike for AdaGrad (10.4.2), where past gradients accumulate indefinitely, RMSprop exponentially

137
downweights older gradients, giving more weight to recent updates. This prevents the overly rapid
decay of learning rates and slow convergence sometimes observed in AdaGrad, e.g. [288, 87]. For
the same reason, the authors of Adadelta [295] proposed to use as a scaling vector the average
over a moving window of the past m squared gradients, for some fixed m ∈ N. For more details
on Adadelta, see [295, 231]. The standard RMSProp algorithm does not incorporate momentum,
however this possibility is already mentioned in [272], also see [262].

Adam
Adam [143], which stands for Adaptive Moment Estimation, corresponds to (10.4.1) with
q
1 − γ1k+1
β2 = 1 − β1 ∈ (0, 1), γ2 = 1 − γ1 ∈ (0, 1), αk = α
1 − β1k+1
for all k ∈ N0 , for some α > 0. The default values for the remaining parameters recommended in
[143] are ε = 10−8 , α = 0.001, β1 = 0.9 and γ1 = 0.999. The update can be formulated as
uk+1
uk+1 = β1 uk + (1 − β1 )∇f (wk ), ûk+1 = (10.4.4a)
1 − β1k+1
sk+1
sk+1 = γ1 sk + (1 − γ1 )∇f (wk ) ⊙ ∇f (wk ), ŝk+1 = (10.4.4b)
1 − γ1k+1
p
wk+1 = wk − αûk+1 ⊘ ŝk+1 + ε. (10.4.4c)

Compared to RMSProp, Adam introduces two modifications. First, due to β1 > 0,


k
β1j ∇f (wk−j )
X
uk+1 = (1 − β1 )
j=0

which corresponds to heavy ball momentum (cf. (10.3.6)). Second, to counteract the initialization
bias from u0 = 0 and s0 = 0, Adam applies a bias correction via
uk sk
ûk = , ŝk = .
1 − β1k 1 − γ1k
It should be noted that there exist specific settings and convex optimization problems for which
Adam (and RMSProp and Adadelta) does not necessarily converge to a minimizer, see [227]. The
authors of [227] propose a modification termed AMSGrad, which avoids this issue. Nonetheless,
Adam remains a highly popular algorithm for the training of neural networks. We also note that,
in the stochastic optimization setting, convergence proofs of such algorithms in general still require
k-dependent decrease of the “global” learning rate such as α = O(k −1/2 ) in (10.4.3b) and (10.4.4c).

10.5 Backpropagation
We now explain how to apply gradient-based methods to the training of neural networks. Let
d
Φ ∈ Nd0L+1 (σ; L, n) (see Definition 3.6) and assume that the activation function satisfies σ ∈ C 1 (R).
As earlier, we denote the neural network parameters by

w = ((W (0) , b(0) ), . . . , (W (L) , b(L) )) (10.5.1)

138
with weight matrices W (ℓ) ∈ Rdℓ+1 ×dℓ and bias vectors b(ℓ) ∈ Rdℓ+1 . Additionally, we fix a differ-
entiable loss function L : RdL+1 × RdL+1 → R, e.g., L(w, w̃) = ∥w − w̃∥2 /2, and assume given data
(xj , y j )m d0
j=1 ⊆ R × R
dL+1 . The goal is to minimize an empirical risk of the form

m
1 X
f (w) := L(Φ(xj , w), y j )
m
j=1

as a function of the neural network parameters w. An application of the gradient step (10.1.2) to
update the parameters requires the computation of
m
1 X
∇f (w) = ∇w L(Φ(xj , w), y j ).
m
j=1

For stochastic methods, as explained in Example 10.9, we only compute the average over a (random)
subbatch of the dataset. In either case, we need an algorithm to determine ∇w L(Φ(x, w), y), i.e.
the gradients

∇b(ℓ) L(Φ(x, w), y) ∈ Rdℓ+1 , ∇W (ℓ) L(Φ(x, w), y) ∈ Rdℓ+1 ×dℓ (10.5.2)

for all ℓ = 0, . . . , L.
The backpropagation algorithm [234] provides an efficient way to do so, by storing intermediate
values in the calculation. To explain it, for fixed input x ∈ Rd0 introduce the notation

x̄(1) := W (0) x + b(0) (10.5.3a)


(ℓ+1) (ℓ) (ℓ) (ℓ)
x̄ := W σ(x̄ ) + b for ℓ ∈ {1, . . . , L}, (10.5.3b)

where the application of σ : R → R to a vector is, as always, understood componentwise.


With the notation of Definition 2.1, x(ℓ) = σ(x̄(ℓ) ) ∈ Rdℓ for ℓ = 1, . . . , L and x̄(L+1) =
x(L+1) = Φ(x, w) ∈ RdL+1 is the output of the neural network. Therefore, the x̄(ℓ) for ℓ = 1, . . . , L
are sometimes also referred to as the preactivations.
In the following, we additionally fix y ∈ RdL+1 and write

L := L(Φ(x, w), y) = L(x̄(L+1) , y).

Note that x̄(k) depends on (W (ℓ) , b(ℓ) ) only if k > ℓ. Since x̄(ℓ+1) is a function of x̄(ℓ) for each ℓ,
by repeated application of the chain rule

∂L ∂L ∂ x̄(L+1) ∂ x̄(ℓ+2) ∂ x̄(ℓ+1)


(ℓ)
= (L+1) (L)
· · · (ℓ+1) (ℓ)
. (10.5.4)
∂Wij |∂ x̄{z } | ∂ x̄{z } |∂ x̄{z } ∂Wij
∈R1×dL+1 ∈RdL+1 ×dL ∈Rdℓ+2 ×dℓ+1
| {z }
∈Rdℓ+1 ×1

(ℓ)
An analogous calculation holds for ∂L/∂bj . Since all terms in (10.5.4) are easy to compute (see
(10.5.3)), in principle we could use this formula to determine the gradients in (10.5.2). To avoid
unnecessary computations, the main idea of backpropagation is to introduce

α(ℓ) := ∇x̄(ℓ) L ∈ Rdℓ for all ℓ = 1, . . . , L + 1

139
and observe that

∂L ∂ x̄(ℓ+1)
(ℓ)
= (α(ℓ+1) )⊤ (ℓ)
.
∂Wij ∂Wij

As the following lemma shows, the α(ℓ) can be computed recursively for ℓ = L + 1, . . . , 1. This
explains the name “backpropagation”. As before, ⊙ denotes the componentwise product.

Lemma 10.15. It holds

α(L+1) = ∇x̄(L+1) L(x̄(L+1) , y) (10.5.5)

and

α(ℓ) = σ ′ (x̄(ℓ) ) ⊙ (W (ℓ) )⊤ α(ℓ+1) for all ℓ = L, . . . , 1.

Proof. Equation (10.5.5) holds by definition. For ℓ ∈ {1, . . . , L} by the chain rule

∂L  ∂ x̄(ℓ+1) ⊤ ∂L  ∂ x̄(ℓ+1) ⊤
α(ℓ) = = = α(ℓ+1) .
∂ x̄(ℓ) | ∂ x̄{z (ℓ) ∂ x̄ (ℓ+1)
} | {z } ∂ x̄ (ℓ)

∈Rdℓ ×dℓ+1 ∈Rdℓ+1 ×1

By (10.5.3b) for i ∈ {1, . . . , dℓ+1 }, j ∈ {1, . . . , dℓ }


 ∂ x̄(ℓ+1)  (ℓ+1)
∂ x̄i (ℓ) (ℓ)
= = Wij σ ′ (x̄j ).
∂ x̄(ℓ) ij (ℓ)
∂ x̄j

Thus the claim follows.

Putting everything together, we obtain explicit formulas for (10.5.2).

Proposition 10.16. It holds

∇b(ℓ) L = α(ℓ+1) ∈ Rdℓ+1 for ℓ = 0, . . . , L

and

∇W (0) L = α(1) x⊤ ∈ Rd1 ×d0

and

∇W (ℓ) L = α(ℓ+1) σ(x̄(ℓ) )⊤ ∈ Rdℓ+1 ×dℓ for ℓ = 1, . . . , L.

140
Proof. By (10.5.3a) for i, k ∈ {1, . . . , d1 }, and j ∈ {1, . . . , d0 }
(1) (1)
∂ x̄k ∂ x̄k
(0)
= δki and (0)
= δki xj ,
∂bi ∂Wij

and by (10.5.3b) for ℓ ∈ {1, . . . , L} and i, k ∈ {1, . . . , dℓ+1 }, and j ∈ {1, . . . , dℓ }


(ℓ+1) (ℓ+1)
∂ x̄k ∂ x̄k (ℓ)
(ℓ)
= δki and (ℓ)
= δki σ(x̄j ).
∂bi ∂Wij

ℓ+1 d
Thus, with ei = (δki )k=1

∂L  ∂ x̄(ℓ+1) ⊤ ∂L (ℓ+1)
(ℓ)
= (ℓ)
= e⊤
i α
(ℓ+1)
= αi for ℓ ∈ {0, . . . , L}
∂bi ∂bi ∂ x̄(ℓ+1)

and similarly

∂L  ∂ x̄(1) ⊤
(0) (0) (1)
(0)
= (0)
α(1) = x̄j e⊤
i α
(1)
= x̄j αi
∂Wij ∂Wij

and
∂L (ℓ) (ℓ+1)
(ℓ)
= σ(x̄j )αi for ℓ ∈ {1, . . . , L}.
∂Wij

This concludes the proof.

Lemma 10.15 and Proposition 10.16 motivate Algorithm 1, in which a forward pass computing
x̄(ℓ) , ℓ = 1, . . . , L + 1, is followed by a backward pass to determine the α(ℓ) , ℓ = L + 1, . . . , 1,
and the gradients of L with respect to the neural network parameters. This shows how to use
gradient-based optimizers from the previous sections for the training of neural networks.
Two important remarks are in order. First, the objective function associated to neural networks
is typically not convex as a function of the neural network weights and biases. Thus, the analysis
of the previous sections will in general not be directly applicable. It may still give some insight
about the convergence behavior locally around a (local) minimizer however. Second,
we assumed the activation function to be continuously differentiable, which does not hold for
ReLU. Using the concept of subgradients, gradient-based algorithms and their analysis may be
generalized to some extent to also accommodate non-differentiable loss functions, see Exercises
10.20–10.22.

Bibliography and further reading


The convergence proof of gradient descent for smooth and strongly convex functions presented
in Section 10.1 follows [85], which provides a collection of simple proofs for various (stochastic)
gradient descent methods together with detailed references. For standard textbooks on gradient
descent and convex optimization, see [26, 188, 189, 35, 194, 153, 41]. These references also include

141
Algorithm 1 Backpropagation
Input: Network input x, target output y, neural network parameters
(0) (0) (L) (L)
((W , b ), . . . , (W , b ))
Output: Gradients of the loss function L with respect to neural network parameters

Forward pass
x̄(1) ← W (0) x + b(0)
for ℓ = 1, . . . , L do
x̄(ℓ+1) ← W (ℓ) σ(x̄(ℓ) ) + b(ℓ)
end for

Backward pass
α(L+1) ← ∇x̄(L+1) L(x̄(L+1) , y)
for ℓ = L, . . . , 1 do
∇b(ℓ) L ← α(ℓ+1)
∇W (ℓ) L ← α(ℓ+1) σ(x̄(ℓ) )⊤
α(ℓ) ← σ ′ (x̄(ℓ) ) ⊙ (W (ℓ) )⊤ α(ℓ+1)
end for
∇b(0) L ← α(1)
∇W (0) L ← α(1) x⊤

convergence proofs under weaker assumptions than those considered here. For convergence results
assuming for example the Polyak-Lojasiewicz inequality, which does not require convexity, see, e.g.,
[139].
Stochastic gradient descent (SGD) discussed in Section 10.2 originally dates back to Robbins and
Monro [228]. The proof presented here for strongly convex objective functions is based on [98, 33]
and in particular uses the step size from [98]; also see [184, 224, 187, 251]. For insights into the
potential benefits of SGD in terms of generalization properties, see, e.g., [289, 105, 297, 141, 256].
The heavy ball method in Section 10.3 goes back to Polyak [216]. To motivate the algorithm
we proceed as in [91, 217, 219], and also refer to [259, 196]. The analysis of Nesterov acceleration
[190] follows the arguments in [275, 287], with a similar proof also given in [286].
For Section 10.4 on adaptive learning rates, we follow the overviews [94, Section 8.5], [231], and
[87, Chapter 11] and the original works that introduced AdaGrad [74], Adadelta [295], RMSProp
[272] and Adam [143]. Regarding the analysis of RMSProp and Adam, we refer to [227] which
give an example of non-convergence, and provide a modification of the algorithm together with a
convergence analysis. Convergence proofs (for variations of) AdaGrad and Adam can also be found
in [64].
The backpropagation algorithm discussed in Section 10.5 was popularized by Rumelhart, Hinton
and Williams [234]; for further details on the historical developement we refer to [239, Section 5.5],
and for a more in-depth discussion of the algorithm, see for instance [109, 28, 192].
Similar discussions of gradient descent algorithms in the context of deep learning as given
here were presented in [271] and [132]: [271, Chapter 7] provides accessible convergence proofs of
(stochastic) gradient descent and gradient flow under different smoothness and convexity assump-
tions, and [132, Part III] gives a broader overview of optimization techniques in deep learning, but

142
restricts part of the analysis to quadratic objective functions. As in [33], our analysis of gradient de-
scent, stochastic gradient descent, and Nesterov acceleration, exclusively focused on strongly convex
objective functions. We also refer to this paper for a more detailed general treatment and analysis
of optimization algorithms in machine learning, covering various methods that are omitted here.
Details on implementations in Python can for example be found in [87], and for recommendations
and tricks regarding the implementation we also refer to [156, 32].

143
Exercises
Exercise 10.17. Let L > 0 and let f : Rn → R be continuously differentiable. Show that (10.1.5)
implies (10.1.4).

Exercise 10.18. Let f ∈ C 1 (Rn ). Show that f is convex in the sense of Definition 10.3 if and only
if

f (w) + ⟨∇f (w), v − w⟩ ≤ f (v) for all w, v ∈ Rn .

Definition 10.19. For convex f : Rn → R, g ∈ Rn is called a subgradient (or subdifferential) of


f at v if and only if
f (w) ≥ f (v) + ⟨g, w − v⟩ for all w ∈ Rn . (10.5.6)
The set of all subgradients of f at v is denoted by ∂f (v).

For convex functions f , a subgradient always exists, i.e. ∂f (v) is necessarily nonempty, e.g.,
[41, Section 1.2]. Subgradients generalize the notion of gradients for convex functions, since for
any convex continuously differentiable f , (10.5.6) is satisfied with g = ∇f (v). The following three
exercises on subgradients are based on the lecture notes [36]. Also see, e.g., [252, 41, 153] for more
details on subgradient descent.

Exercise 10.20. Let f : Rn → R be convex and Lip(f ) ≤ L. Show that for any g ∈ ∂f (v) holds
∥g∥ ≤ L.

Exercise 10.21. Let f : Rn → R be convex, Lip(f ) ≤ L and suppose that w∗ is a minimizer of f .


Fix w0 ∈ Rn , and for k ∈ N0 define the subgradient descent update

wk+1 := wk − hk g k ,

where g k is an arbitrary fixed element of ∂f (wk ). Show that


k
∥w0 − w∗ ∥2 + L2 h2i
P
i=1
min f (wi ) − f (w∗ ) ≤ k
.
i≤k P
2 hi
i=1

Hint: Start by recursively expanding ∥wk − w∗ ∥2 = · · · , and then apply the property of the
subgradient.

Exercise 10.22. Consider the setting of Exercise 10.21. Determine step sizes h1 , . . . , hk (which
may depend on k, i.e. hk,1 , . . . , hk,k ) such that for any arbitrarily small δ > 0

min f (wi ) − f (w∗ ) = O(k −1/2+δ ) as k → ∞.


i≤k

144
Exercise 10.23. Let A ∈ Rn×n be symmetric positive semidefinite, b ∈ Rn and c ∈ R. Denote
the eigenvalues of A by ζ1 ≥ · · · ≥ ζn ≥ 0. Show that the objective function
1
f (w) := w⊤ Aw + b⊤ w + c (10.5.7)
2
is convex and ζ1 -smooth. Moreover, if ζn > 0, then f is ζn -strongly convex. Show that these values
are optimal in the sense that f is neither L-smooth nor µ-strongly convex if L < ζ1 and µ > ζn .
Hint: Note that L-smoothness and µ-strong convexity are invariant under shifts and the addition
of constants. That is, for every α ∈ R and β ∈ Rn , f˜(w) := α + f (w + β) is L-smooth or µ-strongly
convex if and only if f is. It thus suffices to consider w⊤ Aw/2.

Exercise 10.24. Let f be as in Exercise 10.23. Show that gradient descent converges for arbitrary
initialization w0 ∈ Rn , if and only if

max |1 − hζj | < 1.


j=1,...,n

Show that argminh>0 maxj=1,...,n |1 − hζj | = 2/(ζ1 + ζn ) and conclude that the convergence will be
slow if f is ill-conditioned, i.e. if ζ1 /ζn ≫ 1.
Hint: Assume first that b = 0 ∈ Rn and c = 0 ∈ R in (10.5.7), and use the singular value
decomposition A = U ⊤ diag(ζ1 , . . . , ζn )U .
p
Exercise 10.25. Show that (10.3.10) can equivalently be written as (10.3.11) with τ = µ/L,
α = 1/L, β = (1 − τ )/(1 + τ ) and the initialization u0 = ((1 + τ )w0 − s0 )/τ .

145
Chapter 11

Wide neural networks and the neural


tangent kernel

In this chapter we explore the dynamics of training (shallow) neural networks of large width.
Throughout assume given data pairs

(xi , yi ) ∈ Rd × R i ∈ {1, . . . , m}, (11.0.1a)

for distinct xi . We wish to train a model (e.g. a neural network) Φ(x, w) depending on the input
x ∈ Rd and the parameters w ∈ Rn . To this end we consider either minimization of the ridgeless
(unregularized) objective
m
X
f (w) := (Φ(xi , w) − yi )2 , (11.0.1b)
i=1
or, for some regularization parameter λ ≥ 0, of the ridge regularized objective

fλ (w) := f (w) + λ∥w∥2 . (11.0.1c)

The adjectives ridge and ridgeless thus indicate the presence or absence of the regularization term
∥w∥2 .
In the ridgeless case, the objective is a multiple of the empirical risk R
b S (Φ) in (1.2.3) for the
m
sample S = (xi , yi )i=1 and the square-loss. Regularization is a common tool in machine learning
to improve model generalization and stability. The goal of this chapter is to get some insight into
the dynamics of Φ(x, wk ) as the parameter vector wk progresses during training. Additionally, we
want to gain some intuition about the influence of regularization, and the behaviour of the trained
model x 7→ Φ(x, wk ) for large k. We do so through the lense of so-called kernel methods. As a
training algorithm we exclusively focus on gradient descent with constant step size.
If Φ(x, w) depends linearly on the parameters w, the objective function (11.0.1c) is convex. As
established in the previous chapter (cf. Remark 10.8), gradient descent then finds a global minimizer.
For typical neural network architectures, w 7→ Φ(x, w) is not linear, and such a statement is in
general not true. Recent research has shown that neural network behavior tends to linearize in w
as network width increases [131]. This allows to transfer some of the results and techniques from
the linear case to the training of neural networks.
We start this chapter in Sections 11.1 and 11.2 by recalling (kernel) least-squares methods,
which describe linear (in w) models. Following [158], the subsequent sections examine why neural

146
networks exhibit linear-like behavior in the infinite-width limit. In Section 11.3 we introduce the so-
called tangent kernel. Section 11.4 presents abstract results showing, under suitable assumptions,
convergence towards a global minimizer when training the model. Section 11.5 builds on this
analysis and discusses connections to kernel regression with the tangent kernel. In Section 11.6
we then detail the implications for wide neural networks. A similar treatment of these results was
previously given by Telgarsky in [271, Chapter 8] for gradient flow (rather than gradient descent),
based on [49].

11.1 Linear least-squares regression


Arguably one of the simplest machine learning algorithms is linear least-squares regression, e.g.,
[65, 29, 108, 92]. Given data (11.0.1a), we fit a linear function x 7→ Φ(x, w) := x⊤ w by minimizing
f or fλ in (11.0.1). With
 ⊤  
x1 y1
 ..  m×d  .. 
A= . ∈R and y =  .  ∈ Rm (11.1.1)
x⊤
m ym
it holds
f (w) = ∥Aw − y∥2 and fλ (w) = f (w) + λ∥w∥2 . (11.1.2)
The x1 , . . . , xm are referred to as the training points (or design points), and throughout the rest
of Section 11.1, we denote their span by

H̃ := span{x1 , . . . , xm } ⊆ Rd . (11.1.3)

This is the subspace spanned by the rows of A.


Remark 11.1. More generally, the ansatz Φ(x, (w, b)) := w⊤ x + b corresponds to
 
⊤ b
Φ(x, (w, b)) = (1, x ) .
w
Therefore, additionally allowing for a bias can be treated similarly.

11.1.1 Existence of minimizers


We start with the ridgeless case λ = 0. The model Φ(x, w) = x⊤ w is linear in both x and w.
In particular, w 7→ f (w) is a convex function by Exercise 10.23. If A is invertible, then f has
the unique minimizer w∗ = A−1 y. If rank(A) = d, then f is strongly convex by Exercise 10.23,
and there still exists a unique minimizer. If however rank(A) < d, then ker(A) ̸= {0} and there
exist infinitely many minimizers of f . To guarantee uniqueness, we consider the minimum norm
solution

w∗ := argminw∈M ∥w∥, M := {w ∈ Rd | f (w) ≤ f (v) ∀v ∈ Rd }. (11.1.4)

It’s a standard result that w∗ is well-defined and belongs to the span H̃ of the training points,
e.g., [29, 65, 92]. While one way to prove this is through the pseudoinverse (see Theorem B.2), we
provide an alternative argument here, which can be directly extended to the infinite-dimensional
case as discussed in Section 11.2 ahead.

147
Theorem 11.2. There is a unique minimum norm solution w∗ ∈ Rd in (11.1.4). It lies in the
subspace H̃, and is the unique minimizer of f in H̃, i.e.

w∗ = argminw̃∈H̃ f (w̃). (11.1.5)

Proof. We start with existence and uniqueness of w∗ ∈ Rd in (11.1.4). Let


n o
C := span Aw w ∈ Rd ⊆ Rm .

Then C is a finite dimensional space, and as such it is closed and convex. Therefore y ∗ =
argminỹ∈C ∥ỹ − y∥ exists and is unique (this is a fundamental property of Hilbert spaces, see
Theorem B.17). In particular, the set M = {w ∈ Rd | Aw = y ∗ } ⊆ Rd of minimizers of f is not
empty. Clearly M ⊆ Rd is closed and convex. As before, w∗ = argminw∈M ∥w∥ exists and is
unique.
It remains to show (11.1.5). Decompose w∗ = w̃ + ŵ with w̃ ∈ H̃ and ŵ ∈ H̃ ⊥ (see Definition
B.15). By definition of A it holds Aw∗ = Aw̃ and f (w∗ ) = f (w̃). Moreover ∥w∗ ∥2 = ∥w̃∥2 +∥ŵ∥2 .
Since w∗ is the minimum norm solution, w∗ = w̃ ∈ H̃. To conclude the proof, we need to show
that w∗ is the only minimizer of f in H̃. Assume there exists a minimizer v of f in H̃ different
from w∗ . Then 0 ̸= w∗ − v ∈ H̃. Thus A(w∗ − v) ̸= 0 and y ∗ = Aw∗ ̸= Av, which contradicts
that v minimizes f .

Next let λ > 0 in (11.1.2). Then minimizing fλ is referred to as ridge regression or Tikhonov
regularized least squares [274, 118, 79, 108]. The next proposition shows that there exists a unique
minimizer of fλ , which is closely connected to the minimum norm solution, e.g. [79, Theorem 5.2].

Theorem 11.3. Let λ > 0. Then, with fλ in (11.1.2), there exists a unique minimizer

w∗,λ := argminw∈Rd fλ (w).

It holds w∗,λ ∈ H̃, and


lim w∗,λ = w∗ . (11.1.6)
λ→0

Proof. According to Exercise 11.29, w 7→ fλ (w) is strongly convex on Rd , and thus also on the
subspace H̃ ⊆ Rd . Therefore, there exists a unique minimizer of fλ in H̃, which we denote by
w∗,λ ∈ H̃. To show that there exists no other minimizer of fλ in Rd , fix w ∈ Rd \H̃ and decompose
w = w̃ + ŵ with w̃ ∈ H̃ and 0 ̸= ŵ ∈ H̃ ⊥ . Then

f (w) = ∥Aw − y∥2 = ∥Aw̃ − y∥2 = f (w̃)

and
∥w∥2 = ∥w̃∥2 + ∥ŵ∥2 > ∥w̃∥2 .

148
Thus fλ (w) > fλ (w̃) ≥ fλ (w∗,λ ).
It remains to show (11.1.6). We have
fλ (w) = (Aw − y)⊤ (Aw − y) + λw⊤ w
= w⊤ (A⊤ A + λI d )w − 2w⊤ A⊤ y,
where I d ∈ Rd×d is the identity matrix. The minimizer is reached at ∇fλ (w) = 0, which yields
w∗,λ = (A⊤ A + λI d )−1 A⊤ y.
Let A = U Σ V ⊤ be the singular value decomposition of A, where Σ ∈ Rm×d contains the nonzero
singular values s1 ≥ · · · ≥ sr > 0, and U ∈ Rm×m , V ∈ Rd×d are orthogonal. Then
w∗,λ = (V (Σ ⊤ Σ + λI d )V ⊤ )−1 V Σ ⊤ U ⊤ y
 s 
1
2
 s1 +λ .. 

=V  . 0 U ⊤ y,

 sr 
 s2r +λ 
0 0
| {z }
∈Rd×m

where 0 stands for a zero block of suitable size. As λ → 0, this converges towards A† y, where A†
denotes the pseudoinverse of A, see (B.1.3). By Theorem B.2, A† y = w∗ .

11.1.2 Gradient descent


Consider gradient descent to minimize the objective fλ in (11.1.2). Starting with w0 ∈ Rd , the
iterative update with constant step size h > 0 reads
wk+1 = wk − 2hA⊤ (Awk − y) − 2hλwk for all k ∈ N0 . (11.1.7)
Let again first λ = 0, i.e. fλ = f . Since f is convex and quadratic, by Remark 10.8 for sufficiently
small step size h > 0 it holds f (wk ) → f (w∗ ) as k → ∞. Is it also true that wk converges to the
minimal norm solution w∗ ∈ H̃? Recall that H̃ is spanned by the columns of A⊤ . Thus, if w0 ∈ H̃,
then by (11.1.7), the iterates wk never leave the subspace H̃. Since there is only one minimizer in
H̃, it follows that wk → w∗ as k → ∞.
This shows that gradient descent, does not find an arbitrary optimum when minimizing f , but
converges towards the minimum norm solution as long as w0 ∈ H̃ (e.g. w0 = 0). It is well-known
[21, Theorem 16], that iterations of type (11.1.7) lead to minimal norm solutions as made more
precise by the next proposition. To state it, we write in the following smax (A) for the maximal
singular value of A, and smin (A) for the minimal positive singular value, with the convention
smin (A) := ∞ in case A = 0. The full proof is left as Exercise 11.28.

Proposition 11.4. Let λ = 0 and fix h ∈ (0, smax (A)−2 ). Let w0 = w̃0 + ŵ0 where w̃0 ∈ H̃ and
ŵ0 ∈ H̃ ⊥ , and let (wk )k∈N be defined by (11.1.7). Then

lim wk = w∗ + ŵ0 .
k→∞

149
Next we consider ridge regression, where λ > 0 in (11.1.2), (11.1.7). The condition on the step
size in the next proposition can be weakened to h ∈ (λ + smax (A)2 )−1 , but we omit doing so for
simplicity.

Proposition 11.5. Let λ > 0, and fix h ∈ (0, (2λ + 2smax (A)2 )−1 ). Let w0 ∈ Rn and let (wk )k∈N
be defined by (11.1.7). Then
lim wk = w∗,λ
k→∞

and
λ
∥w∗ − w∗,λ ∥ ≤ ∥y∥ = O(λ) as λ → 0.
smin (A)3 + smin (A)λ

Proof. By Exercise 10.23, fλ is (2λ + 2smax (A)2 )-smooth, and by Exercise 11.29, fλ is strongly
convex. Thus Theorem 10.7 implies convergence of gradient descent towards the unique minimizer
w∗,λ .
For the bound on the distance to w∗ , assume A ̸= 0 (the case A = 0 is trivial). Expressing w∗
via the pseudoinverse of A (see Appendix B.1) we get
1 
s1
 .. 
† . 0
w∗ = A y = V   U ⊤ y,

 1 
sr
0 0
where A = U Σ V ⊤ is the singular value decomposition of A, and s1 ≥ · · · ≥ sr > 0 denote the
singular values of A. The explicit formula for w∗,λ obtained in the proof of Theorem 11.3 then
yields
si 1
∥w∗ − w∗,λ ∥ ≤ max 2 − ∥y∥.
i≤r si + λ si
This gives the claimed bound.
By Proposition 11.5, if we use ridge regression with a small regularization parameter λ > 0,
then gradient descent converges to a vector w∗,λ which is O(λ) close to the minimal norm solution
w∗ , regardless of the initialization w0 .

11.2 Feature methods and kernel least-squares regression


Linear models are often too simplistic to capture the true relationship between x and y. Feature-
and kernel-based methods (e.g., [57, 244, 108]) address this by replacing x 7→ ⟨x, w⟩ with x 7→
⟨ϕ(x), w⟩ where ϕ : Rd → Rn is a (typically nonlinear) map. This introduces nonlinearity in x
while retaining linearity in the parameter w ∈ Rn .
Example 11.6. Let data (xi , yi )m
i=1 ⊆ R × R be given, and define for x ∈ R

ϕ(x) := (1, x, . . . , xn−1 )⊤ ∈ Rn .


Pn−1
For w ∈ Rn , the model x 7→ ⟨ϕ(x), w⟩ = j=0 wj xj can represent any polynomial of degree n − 1.

150
Let us formalize this idea. For reasons that will become apparent later (see Remark 11.10), it
is useful to allow for the case n = ∞. To this end, let (H, ⟨·, ·⟩H ) be a Hilbert space (see Appendix
B.2.4), referred to as the feature space, and let ϕ : Rd → H denote the feature map. The model
is defined as
Φ(x, w) := ⟨ϕ(x), w⟩H (11.2.1)
with w ∈ H. We may think of H in the following either as Rn for some n ∈ N, or as ℓ2 (N) (see
Example B.12); in this case the components of ϕ are referred to as features. For some λ ≥ 0, the
goal is to minimize the objective
m
X 2
f (w) := ⟨ϕ(xj ), w⟩H − yj or fλ (w) := f (w) + λ∥w∥2H . (11.2.2)
j=1

In analogy to (11.1.3), throughout the rest of Section 11.2 denote by

H̃ := span{ϕ(x1 ), . . . , ϕ(xm )} ⊆ H

the space spanned by the feature vectors at the training points.

11.2.1 Existence of minimizers


We start with the ridgeless case λ = 0 in (11.2.2). To guarantee uniqueness and regularize the
problem, we again consider the minimum norm solution

w∗ := argminw∈M ∥w∥H , M := {w ∈ H | f (w) ≤ f (v) ∀v ∈ H}. (11.2.3)

Theorem 11.7. There is a unique minimum norm solution w∗ ∈ H in (11.2.3). It lies in the
subspace H̃, and is the unique minimizer of f in H̃, i.e.

w∗ = argminw̃∈H̃ f (w̃). (11.2.4)

The proof of Theorem 11.2 is formulated such that it extends verbatim to Theorem 11.7, upon
replacing Rd with H and the matrix A ∈ Rm×d with the linear map

A :H → Rm
w 7→ (⟨ϕ(xi ), w⟩H )m
i=1 .

Similarly, Theorem 11.3 extends to the current setting with small modifications. The key obser-
vation is that by Theorem 11.7, the minimizer is attained in the finite-dimensional subspace H̃.
Selecting a basis for H̃, the proof then proceeds analogously. We leave it to the reader to check
this, see Exercise 11.30. This leads to the following statement.

151
Theorem 11.8. Let λ > 0. Then, with fλ in (11.2.2), there exists a unique minimizer

w∗,λ := argminw∈H fλ (w). (11.2.5)

It holds w∗,λ ∈ H̃, and


lim w∗,λ = w∗ .
λ→0

Statements as in Theorems 11.7 and 11.8, which yield that the minimizer is attained in the
finite dimensional subspace H̃, are known in the literature as representer theorems, [142, 243].

11.2.2 The kernel trick


We now explain the connection to “kernels”. At first glance, minimizing (11.2.2) in the potentially
infinite-dimensional Hilbert space H seems infeasible. However, the so-called kernel trick enables
this computation [31].

Definition 11.9. A symmetric function K : Rd ×Rd → R is called a kernel, if for any x1 , . . . , xk ∈


Rd , k ∈ N, the kernel matrix G = (K(xi , xj ))ki,j=1 ∈ Rk×k is symmetric positive semidefinite.

Given a feature map ϕ : Rd → H, it is easy to check that

K(x, z) := ⟨ϕ(x), ϕ(z)⟩H for all x, z ∈ Rd , (11.2.6)

defines a kernel. The corresponding kernel matrix G ∈ Rm×m is

Gij = ⟨ϕ(xi ), ϕ(xj )⟩H = K(xi , xj ).


Pm
The ansatz w∗ = j=1 αj ϕ(xj ) then turns the optimization problem (11.2.3) into

argminα∈Rm ∥Gα − y∥2 + λα⊤ Gα. (11.2.7)

Such a minimizing α need not bePunique (if G is not regular), however, any such α yields a
minimizer in H̃, and thus w∗,λ = m j=1 αj ϕ(xj ) for any λ ≥ 0 by Theorems 11.7 and 11.8. This
suggests the following algorithm:
Given the well-definedness of w∗,0 := w∗ and w∗,λ for λ ≥ 0, we refer to

x 7→ Φ(x, w∗,λ ) = ⟨ϕ(x), w∗,λ ⟩H

as the (ridge or ridgeless) kernel least-squares estimator. By the above considerations,


its computation neither requires explicit knowledge of the feature map ϕ nor of w∗,λ ∈ H. It is
sufficient to choose a kernel K : Rd × Rd → R and perform all computations in finite dimensional
spaces. This is known as the kernel trick. While Algorithm 2 will not play a role in the rest of
the chapter, we present it here to give a more complete picture.

152
Algorithm 2 Kernel least-squares regression
Input: Data (xi , yi )m d d d
i=1 ∈ R × R, kernel K : R × R → R, regularization parameter λ ≥ 0,
evaluation point x ∈ R d

Output: (Ridge or ridgeless) kernel least squares estimator at x

compute the kernel matrix G = (K(xi , xj ))m


i,j=1
determine a minimizer α ∈ Rm of ∥Gα − y∥2 + λα⊤ Gα
evaluate Φ(x, w∗,λ ) via
m m
* +
X X
Φ(x, w∗,λ ) = ϕ(x), αj ϕ(xj ) = αj K(x, xj )
j=1 H j=1

Remark 11.10. If Ω ⊆ Rd is compact and K : Ω × Ω → R is a continuous kernel, then Mercer’s


theorem implies existence of a Hilbert space H and a feature map ϕ : Rd → H such that

K(x, z) = ⟨ϕ(x), ϕ(z)⟩H for all x, z ∈ Ω,

i.e. K is the corresponding kernel. See for instance [23, Sec. 3.2] or [258, Thm. 4.49].

11.2.3 Gradient descent


In practice we may either minimize fλ in (11.2.2) (in the Hilbert space H) or the objective in
(11.2.7) (in Rm ). We now focus on the former, as this will allow to draw connections to neural
network training in the subsequent sections. In order to use gradient descent, we assume H = Rn
equipped with the Euclidean inner product. Initializing w0 ∈ Rn , gradient descent with constant
step size h > 0 to minimize fλ reads

wk+1 = wk − 2hA⊤ (Awk − y) − 2hλwk for all k ∈ N0 ,

where now
ϕ(x1 )⊤
 

A= ..
.
 
.
ϕ(xm )⊤
This corresponds to the situation discussed in Section 11.1.2.
Let λ = 0. For sufficiently small step size, by Proposition 11.4 for x ∈ Rd

lim Φ(x, wk ) = ⟨ϕ(x), w∗ ⟩ + ⟨ϕ(x), ŵ0 ⟩ , (11.2.8)


k→∞

where
w0 = w̃0 + ŵ0
with w̃0 ∈ H̃ = span{ϕ(x0 ), . . . , ϕ(xm )} ⊆ Rm , and ŵ0 ∈ H̃ ⊥ . For λ = 0, gradient descent thus
yields the ridgeless kernel least squares estimator plus an additional term ⟨ϕ(x), ŵ0 ⟩ depending on
initialization. Notably, on the set

{x ∈ Rd | ϕ(x) ∈ span{ϕ(x1 ), . . . , ϕ(xm )}}, (11.2.9)

153
(11.2.8) always coincides with the ridgeless least squares estimator.
Now let λ > 0. For sufficiently small step size, by Proposition 11.5 for x ∈ Rd

lim Φ(x, wk ) = ⟨ϕ(x), w∗,λ ⟩ = ⟨ϕ(x), w∗ ⟩ + O(λ) as λ → 0.


k→∞

Thus, for λ > 0 gradient descent determines the ridge kernel least-squares estimator regardless of
the initialization. Moreover, for fixed x, the limiting model is O(λ) close to the ridgeless kernel
least-squares estimator.

11.3 Tangent kernel


Consider a general model Φ(x, w) with input x ∈ Rd and parameters w ∈ Rn . The goal is to
minimize the square loss objective (11.0.1b) given data (11.0.1a). Our analysis in this and the
following two sections focuses on the ridgeless case. We will revisit ridge regression in Section
11.6.4, where we consider a simple test example of training a neural network with and without
regularization.
If w 7→ Φ(x, w) is not linear, then unlike in Sections 11.1 and 11.2, the objective function
(11.0.1b) is in general not convex, and most results on first order methods in Chapter 10 are not
directly applicable. We thus simplify the situation by linearizing the model in the parameter w ∈ Rn
around initialization: Fixing w0 ∈ Rn , let

Φlin (x, p) := Φ(x, w0 ) + ∇w Φ(x, w0 )⊤ (p − w0 ) for all w ∈ Rn , (11.3.1)

which is the first order Taylor approximation of Φ around the initial parameter w0 . The parameters
of the linearized model will always be denoted by p ∈ Rn to distinguish them from the parameters
w of the full model. Introduce

δj := yj − Φ(xi , w0 ) + ∇w Φ(xj , w0 )⊤ w0 for all j = 1, . . . , m. (11.3.2)

The square loss objective for the linearized model then reads
m m
lin
X
lin 2
X 2
f (p) := (Φ (xj , p) − yj ) = ⟨∇w Φ(xj , w0 ), p⟩ − δj (11.3.3)
j=1 j=1

where ⟨·, ·⟩ stands for the Euclidean inner product in Rn . Comparing with (11.2.2), minimizing f lin
corresponds to kernel least squares regression with feature map

ϕ(x) = ∇w Φ(x, w0 ) ∈ Rn .

By (11.2.6) the corresponding kernel is

K̂n (x, z) = ⟨∇w Φ(x, w0 ), ∇w Φ(z, w0 )⟩ . (11.3.4)

We refer to K̂n as the empirical tangent kernel, as it arises from the first order Taylor approxi-
mation (the tangent) of the original model Φ around initialization w0 . Note that K̂n depends on
the choice of w0 . For later reference we point out that as explained in Section 11.2.3, minimizing

154
(Φ(x1 , w) − y1 )2

(Φlin (x1 , w) − y1 )2
y1

w0 w0
Φlin (x1 , w)
Φ(x1 , w)

Figure 11.1: Graph of w 7→ Φ(x1 , w) and the linearization w 7→ Φlin (x1 , w) at the initial parameter

w0 , s.t. ∂w Φ(x1 , w0 ) ̸= 0. If Φ and Φlin are close, then there exists w s.t. Φ(x1 , w) = y1 (left). If
the derivatives are also close, the loss (Φ(x1 , w) − y1 )2 is nearly convex in w, and gradient descent
finds a global minimizer (right).

f lin with gradient descent, sufficiently small step size, no regularization, yields a sequence (pk )k∈N0
satisfying

lim Φlin (x, pk ) = ⟨ϕ(x), p∗ ⟩ + ⟨ϕ(x), p̂0 ⟩ , (11.3.5)


k→∞ | {z } | {z }
ridgeless kernel least-squares term depending on initialization,
estimator with kernel K̂n which vanishes at training points

where p̂0 is the projection of the initialization p0 onto span{ϕ(x1 ), . . . , ϕ(xn )}⊥ . In particular, the
second term vanishes at the training points.

11.4 Global minimizers


Consider a general model Φ : Rd × Rn → R, data as in (11.0.1a), and the ridgeless square loss
m
X
f (w) = (Φ(xj , w) − yj )2 .
j=1

In this section we discuss sufficient conditions under which gradient descent converges to a global
minimizer.
The idea is as follows: if w 7→ Φ(x, w) is nonlinear but sufficiently close to its linearization Φlin
in (11.3.1) within some region, the objective function behaves almost like a convex function there.
If the region is large enough to contain both the intial value w0 and a global minimum, then we
expect gradient descent to never leave this (almost convex) basin during training and find a global
minimizer.
To illustrate this, consider Figures 11.1 and 11.2 where we set the number of training samples
to m = 1 and the number of parameters to n = 1. For the above reasoning to hold, the difference
between Φ and Φlin , as well as the difference in their derivatives, must remain small within a
neighborhood of w0 . The neighbourhood should be large enough to contain the global minimizer,
and thus depends critically on two factors: the initial error Φ(x1 , w0 ) − y1 , and the magnitude of

the derivative ∂w Φ(x1 , w0 ).
For general m and n, we now make the required assumptions on Φ precise.

155
Φ(x1 , w) (Φ(x1 , w) − y1 )2
(Φlin (x1 , w) − y1 )2
y1

w0 w0
Φlin (x1 , w)

Figure 11.2: Same as Figure 11.1. If Φ and Φlin are not close, there need not exist w such that
Φ(x1 , w) = y1 , and gradient descent need not converge to a global minimizer.

Assumption 11.11. Let Φ ∈ C 1 (Rd × Rn ) and w0 ∈ Rn . There exist constants r, R, U, L > 0 and
0 < θmin ≤ θmax < ∞ such that ∥xi ∥ ≤ R for all i = 1, . . . , m, and it holds that

(a) the kernel matrix of the empirical tangent kernel


m
(K̂n (xi , xj ))m
i,j=1 = ⟨∇w Φ(xi , w 0 ), ∇w Φ(xj , w 0 )⟩ i,j=1
∈ Rm×m (11.4.1)

is regular and its eigenvalues belong to [θmin , θmax ],

(b) for all x ∈ Rd with ∥x∥ ≤ R

∥∇w Φ(x, w)∥ ≤ U for all w ∈ Br (w0 ) (11.4.2a)


∥∇w Φ(x, w) − ∇w Φ(x, v)∥ ≤ L∥w − v∥ for all w, v ∈ Br (w0 ), (11.4.2b)

(c)
2
√ p
θmin 2 mU f (w0 )
L≤ p and r= . (11.4.3)
12m3/2 U 2 f (w0 ) θmin

Let us give more intuitive explanations of these technical assumptions: (a) implies in particular
that (∇w Φ(xi , w0 )⊤ )mi=1 ∈ R
m×n has full rank m ≤ n (thus we have at least as many parameters

n as training data m). In the context of Figure 11.1, this means that ∂w Φ(x1 , w0 ) ̸= 0 and thus
Φ is a not a constant function. This guarantees existence of p such that Φlin (xi , p) = yi for all
lin

i = 1, . . . , m. Next, (b) formalizes in particular the required closeness of Φ and its linearization
Φlin . For example, since Φlin is the first order Taylor approximation of Φ at w0 ,

|Φ(x, w) − Φlin (x, w)| = |(∇w Φ(x, w̃) − ∇w Φ(x, w0 ))⊤ (w − w0 )| ≤ L∥w − w0 ∥2 ,

for some w̃ in the convex hull of w and w0 . Finally, (c) ties together all constants, ensuring the
full model to be sufficiently close to its linearizationpin a large enough ball of radius r around w0 .
Notably, r may be smaller for smaller initial error f (w0 ) and for larger θmin , which aligns with
our intuition from Figure 11.1.
We are now ready to state the following theorem, which is a variant of [158, Thm. G.1]. The
proof closely follows the arguments given there. In Section 11.6 we will see that the theorem’s
main requirement—Assumption 11.11—is satisfied with high probability for certain (wide) neural
networks.

156
Theorem 11.12. Let Assumption 11.11 hold. Fix a positive learning rate
1
h≤ . (11.4.4)
θmin + θmax
Let (wk )k∈N be generated by gradient descent, i.e. for all k ∈ N0

wk+1 = wk − h∇f (wk ). (11.4.5)

It then holds for all k ∈ N

∥wk − w0 ∥ ≤ r (11.4.6a)
2k
f (wk ) ≤ (1 − hθmin ) f (w0 ). (11.4.6b)

Proof. Denote the model prediction error at the data points by


∇w Φ(x1 , w)⊤
   
Φ(x1 , w) − y1
.. m .. m×n
e(w) :=  ∈R s.t. ∇e(w) =  ∈R
   
. .
Φ(xm , w) − ym ∇w Φ(xm , w)⊤

and with the empirical tangent kernel K̂n in Assumption 11.11 (a)
∇e(w0 )∇e(w0 )⊤ = (K̂n (xi , xj ))m
i,j=1 ∈ R
m×m
. (11.4.7)
By (11.4.2a)
m
X
∥∇e(w)∥2 ≤ ∥∇e(w)∥2F = ∥∇w Φ(xj , w)∥2 ≤ mU 2 . (11.4.8a)
j=1

Similarly, using (11.4.2b)


m
X
2
∥∇e(w) − ∇e(v)∥ ≤ ∥∇w Φ(xj , w) − ∇w Φ(xj , v)∥2
j=1

≤ mL2 ∥w − v∥2 for all w, v ∈ Br (w0 ). (11.4.8b)


Step 1. Denote c := 1 − hθmin ∈ (0, 1). In the remainder of the proof we use induction over k
to show
k−1 k−1
X √ X
∥wj+1 − wj ∥ ≤ 2h mU ∥e(w0 )∥ cj , (11.4.9a)
j=0 j=0

∥e(wk )∥2 ≤ ∥e(w0 )∥2 c2k , (11.4.9b)


P∞
for all k ∈ N0 and where an empty sum is understood as zero. Since, j=0 c
j = (1 − c)−1 , and
p
∥e(w)∥ = f (w), using (11.4.3) we have

√ X √ p 1
2h mU ∥e(w0 )∥ cj = 2h mU f (w0 ) ≤ r, (11.4.10)
hθmin
j=0

157
these inequalities directly imply (11.4.6).
The case k = 0 is trivial. For the induction step, assume (11.4.9) holds for some k ∈ N0 .
Step 2. We show (11.4.9a) for k + 1. The induction assumption (11.4.9a) and (11.4.10) give
wk ∈ Br (w0 ). Next
∇f (wk ) = ∇(e(wk )⊤ e(wk )) = 2∇e(wk )⊤ e(wk ). (11.4.11)
Using the iteration rule (11.4.5) and the bounds (11.4.8a) and (11.4.9b)
∥wk+1 − wk ∥ = 2h∥∇e(wk )⊤ e(wk )∥

≤ 2h mU ∥e(wk )∥

≤ 2h mU ∥e(w0 )∥ck .
This shows (11.4.9a) for k + 1. In particular by (11.4.10)
wk+1 , wk ∈ Br (w0 ). (11.4.12)
Step 3. We show (11.4.9b) for k + 1. Since e : Rn → Rm is continuously differentiable, there
exists w̃k in the convex hull of wk and wk+1 such that
e(wk+1 ) = e(wk ) + ∇e(w̃k )(wk+1 − wk ) = e(wk ) − h∇e(w̃k )∇f (wk ),
and thus by (11.4.11)
e(wk+1 ) = e(wk ) − 2h∇e(w̃k )∇e(wk )⊤ e(wk )
= I m − 2h∇e(w̃k )∇e(wk )⊤ e(wk ),


where I m ∈ Rm×m is the identity matrix. We wish to show that


∥I m − 2h∇e(w̃k )∇e(wk )⊤ ∥ ≤ c, (11.4.13)
which then implies (11.4.9b) for k + 1 and concludes the proof.
Using (11.4.8) and the fact that wk , w̃k ∈ Br (w0 ) by (11.4.12),
∥∇e(w̃k )∇e(wk )⊤ − ∇e(w0 )∇e(w0 )⊤ ∥
≤ ∥∇e(w̃k )∇e(wk )⊤ − ∇e(wk )∇e(wk )⊤ ∥
+ ∥∇e(wk )∇e(wk )⊤ − ∇e(wk )∇e(w0 )⊤ ∥
+ ∥∇e(wk )∇e(w0 )⊤ − ∇e(w0 )∇e(w0 )⊤ ∥
≤ 3mU Lr.
Since the eigenvalues of ∇e(w0 )∇e(w0 )⊤ belong to [θmin , θmax ] by (11.4.7) and Assumption 11.11
(a), as long as h ≤ (θmin + θmax )−1 we have
∥I m − 2h∇e(w̃k )∇e(wk )⊤ ∥ ≤ ∥I m − 2h∇e(w0 )∇e(w0 )⊤ ∥ + 6hmU Lr
≤ 1 − 2hθmin + 6hmU Lr.
Due to (11.4.3)
2
√ p
θmin 2 mU f (w0 )
1 − 2hθmin + 6hmU Lr ≤ 1 − 2hθmin + 6hmU p
12m3/2 U 2 f (w0 ) θmin
= 1 − hθmin = c,
which concludes the proof.

158
Let us emphasize that (11.4.6b) implies that gradient descent (11.4.5) achieves zero loss in the
limit. Consequently, the limiting model interpolates the training data. This shows in particular
convergence to a global minimizer for the (generally nonconvex) optimization problem of minimizing
f (w).

11.5 Proximity to trained linearized model


The analysis in Section 11.4 was based on the observation that the linearization Φlin closely mimics
the behaviour of the full model Φ for parameters with distance of at most r (cf. Assumption 11.11)
to the initial parameter w0 . Theorem 11.12 states that the parameters remain within this range
throughout training. This suggests that the predictions of the trained full model limk→∞ Φ(x, wk ),
are similar to those of the trained linear model limk→∞ Φlin (x, pk ). In this section we formalize
this statement.

11.5.1 Evolution of model predictions


We adopt again the notation Φlin : Rd × Rn → R from (11.3.1) to represent the linearization of
Φ : Rd × Rn → R around w0 . The parameters of the linearized model are represented by p ∈ Rn ,
and the corresponding loss function is written as f lin (p), as in (11.3.3). Additionally, we define
X := (x1 , . . . , xm ) and

Φ(X, w) := (Φ(xi , w))m


i=1 ∈ R
m

Φlin (X, p) := (Φlin (xi , p))m


i=1 ∈ R
m

to denote the predicted values at the training points x1 , . . . , xm for given parameter choices w,
p ∈ Rn . Moreover
∇w Φ(x1 , w)⊤
 

∇w Φ(X, w) =  .. m×n
∈R
 
.
∇w Φ(xm , w)⊤
and similarly for ∇w Φlin (X, w). Given x ∈ Rd , the model predictions at x and X evolve under
gradient descent as follows:

• full model: Initialize w0 ∈ Rn , and set for step size h > 0 and all k ∈ N0

wk+1 = wk − h∇w f (wk ). (11.5.1)

Then
∇w f (w) = ∇w ∥Φ(X, w) − y∥2 = 2∇w Φ(X, w)⊤ (Φ(X, w) − y).
Thus

Φ(x, wk+1 ) = Φ(x, wk ) + (∇w Φ(x, w̃k ))⊤ (wk+1 − wk )


= Φ(x, wk ) − 2h∇w Φ(x, w̃k )⊤ ∇w Φ(X, wk )⊤ (Φ(X, wk ) − y),

where w̃k is in the convex hull of wk and wk+1 . Introducing

Gk (x, X) := ∇w Φ(x, w̃k )⊤ ∇w Φ(X, wk )⊤ ∈ R1×m


(11.5.2)
Gk (X, X) := ∇w Φ(X, w̃k )∇w Φ(X, wk )⊤ ∈ Rm×m

159
this yields

Φ(x, wk+1 ) = Φ(x, wk ) − 2hGk (x, X)(Φ(X, wk ) − y), (11.5.3a)


Φ(X, wk+1 ) = Φ(X, wk ) − 2hGk (X, X)(Φ(X, wk ) − y). (11.5.3b)

• linearized model: Initialize p0 := w0 ∈ Rn , and set for step size h > 0 and all k ∈ N0

pk+1 = pk − h∇p f lin (pk ). (11.5.4)

Then, since ∇p Φlin (x, p) = ∇w Φ(x, w0 ) for any p ∈ Rn ,

∇p f lin (p) = ∇p ∥Φlin (X, p) − y∥2 = 2∇w Φ(X, w0 )⊤ (Φlin (X, p) − y)

and

Φlin (x, pk+1 ) = Φlin (x, pk ) + ∇w Φ(x, w0 )(pk+1 − pk )


= Φlin (x, pk ) − 2h∇w Φ(x, w0 )⊤ ∇w Φ(X, w0 )⊤ (Φlin (X, pk ) − y).

Introducing (cf. (11.4.1))

Glin (x, X) := ∇w Φ(x, w0 )⊤ ∇w Φ(X, w0 )⊤ ∈ R1×m ,


(11.5.5)
Glin (X, X) := ∇w Φ(X, w0 )∇w Φ(X, w0 )⊤ = (K̂n (xi , xj ))m
i,j=1 ∈ R
m×m

this yields

Φlin (x, pk+1 ) = Φlin (x, pk ) − 2hGlin (x, X)(Φlin (X, pk ) − y) (11.5.6a)
lin lin lin lin
Φ (X, pk+1 ) = Φ (X, pk ) − 2hG (X, X)(Φ (X, pk ) − y). (11.5.6b)

The full dynamics (11.5.3) are governed by the k-dependent kernel matrices Gk . In contrast, the
linear model’s dynamics are entirely determined by the initial kernel matrix Glin . The following
corollary gives an upper bound on how much these matrices may deviate during training, [158,
Thm. G.1].

Corollary 11.13. Let w0 = p0 ∈ Rn , and let Assumption 11.11 be satisfied for some
r, R, U, L, θmin , θmax > 0. Let (wk )k∈N , (pk )k∈N be generated by gradient descent (11.5.1), (11.5.4)
with a positive step size
1
h< .
θmin + θmax
Then for all x ∈ Rd with ∥x∥ ≤ R

sup ∥Gk (x, X) − Glin (x, X)∥ ≤ 2 mU Lr, (11.5.7a)
k∈N
sup ∥Gk (X, X) − Glin (X, X)∥ ≤ 2mU Lr. (11.5.7b)
k∈N

160
Proof. By Theorem 11.12 it holds wk ∈ Br (w0 ) for all k ∈ N, and thus also w̃k ∈ Br (w0 ) for w̃k
in the convex hull of wk and wk+1 as in (11.5.2). Using Assumption 11.11 (b), the definitions of
Gk and Glin give

∥Gk (x, X) − Glin (x, X)∥ ≤ ∥∇w Φ(x, w̃k )∥∥∇w Φ(X, wk ) − ∇w Φ(X, w0 )∥
+ ∥∇w Φ(X, w0 )∥∥∇w Φ(x, w̃k ) − ∇w Φ(x, w0 )∥

≤ ( m + 1)U Lr.

The proof for the second inequality is similar.

11.5.2 Limiting model predictions


We begin by stating the main result of this section, which is based on and follows the arguments
in [158, Thm. H.1]. It gives an upper bound on the discrepancy between the full and linearized
models at each training step, and thus in the limit.

Theorem 11.14. Consider the setting of Corollary 11.13, in particular let r, R, θmin , θmax be as
in Assumption 11.11. Then for all x ∈ Rd with ∥x∥ ≤ R

mU 2
 
lin 4 mU Lr p
sup ∥Φ(x, wk ) − Φ (x, pk )∥ ≤ 1+ 2
f (w0 ).
k∈N θmin (hθmin ) (θmin + θmax )

To prove the theorem, we first examine the difference between the full and linearized models on
the training data.

Proposition 11.15. Consider the setting of Corollary 11.13 and set


2mU Lr p
α := f (w0 ).
hθmin (θmin + θmax )

Then for all k ∈ N


∥Φ(X, wk ) − Φlin (X, pk )∥ ≤ αk(1 − hθmin )k−1 .

Proof. Throughout this proof we write for short

Gk = Gk (X, X) and Glin = Glin (X, X),

and set for k ∈ N


ek := Φ(X, wk ) − Φlin (X, pk ).
Subtracting (11.5.6b) from (11.5.3b) we get for k ≥ 0

ek+1 = ek − 2hGk (Φ(X, wk ) − y) + 2hGlin (Φlin (X, pk ) − y)


= (I m − 2hGlin )ek − 2h(Gk − Glin )(Φ(X, wk ) − y)

161
where I m ∈ Rm×m is the identity. Set c := 1 − hθmin . Then by (11.5.7b), (11.4.6b), and because
h < (θmin + θmax )−1 , we can bound the second term by
mU Lr p
∥2h(Gk − Glin )(Φ(X, wk ) − y)∥ ≤ 2 f (w0 ) ck .
θmin + θmax
| {z }
=:α̃

Moreover, assumption 11.11 (a) and h < (θmin + θmax )−1 yield
lin
∥I m − 2hG ∥ ≤ 1 − 2hθmin ≤ c.
P∞
Hence, using j=0 cj = (hθmin )−1
k
X α̃
∥ek+1 ∥ ≤ c∥ek ∥ + α̃ck ≤ · · · ≤ ck ∥e0 ∥ + ck−j α̃cj ≤ ck ∥e0 ∥ + (k + 1)ck .
hθmin
j=0

Since w0 = p0 it holds Φlin (X, p0 ) = Φ(X, w0 ) (cf. (11.3.1)). Thus ∥e0 ∥ = 0 which gives the
statement.
We are now in position to prove the theorem.
Proof (of Theorem 11.14). Throughout this proof we write for short
Gk = Gk (x, X) ∈ R1×m and Glin = Glin (x, X) ∈ R1×m ,
and set for k ∈ N
ek := Φ(x, wk ) − Φlin (x, pk ).
Subtracting (11.5.6a) from (11.5.3a)
ek+1 = ek − 2hGk (Φ(X, wk ) − y) + 2hGlin (Φlin (X, pk ) − y)
= ek − 2h(Gk − Glin )(Φ(X, wk ) − y) + 2hGlin (Φlin (X, pk ) − Φ(X, wk )).
Denote c := 1 − hθmin . By (11.5.7a) and (11.4.6b)

2h∥Gk − Glin ∥ ≤ 4h mU Lr
p
∥Φ(X, wk ) − y∥ ≤ ck f (w0 )
and by (11.4.2a) (cf. (11.5.5)) and Proposition 11.15

2h∥Glin ∥ ≤ 2h mU 2
∥Φ(X, wk ) − Φlin (X, pk )∥ ≤ αkck−1 .
Hence for k ≥ 0
√ p √
|ek+1 | ≤ |ek | + 4h mU Lr f (w0 ) ck + 2h mU 2 α kck−1 .
| {z } | {z }
=:β1 =:β2
j = (1 − c)−1 = (hθmin )−1 and j−1
P P
Repeatedly applying this bound and using j≥0 c j≥0 jc =
(1 − c)−2 = (hθmin )−2
k k
X
j
X β1 β2
|ek+1 | ≤ |e0 | + β1 c + β2 jcj−1 ≤ + .
hθmin (hθmin )2
j=0 j=0

Here we used that due to w0 = p0 it holds Φ(x, w0 ) = Φlin (x, p0 ) so that e0 = 0.

162
11.6 Training dynamics for shallow neural networks
In this section, following [158], we discuss the implications of Theorems 11.12 and 11.14 for wide
neural networks. As in [271], for ease of presentation we focus on a shallow architecture with
only one hidden layer, but stress that similar considerations also hold for deep networks, see the
bibliography section.

11.6.1 Architecture
Let Φ : Rd → R be a neural network of depth one and width n ∈ N of type

Φ(x, w) = v ⊤ σ(U x + b) + c. (11.6.1)

Here x ∈ Rd is the input, and U ∈ Rn×d , v ∈ Rn , b ∈ Rn and c ∈ R are the parameters which we
collect in the vector w = (U , b, v, c) ∈ Rn(d+2)+1 (with U suitably reshaped). For future reference
we note that
∇U Φ(x, w) = (v ⊙ σ ′ (U x + b))x⊤ ∈ Rn×d
∇b Φ(x, w) = v ⊙ σ ′ (U x + b) ∈ Rn
(11.6.2)
∇v Φ(x, w) = σ(U x + b) ∈ Rn
∇c Φ(x, w) = 1 ∈ R,
where ⊙ denotes the Hadamard product. We also write ∇w Φ(x, w) ∈ Rn(d+2)+1 to denote the full
gradient with respect to all parameters.
In practice, it is common to initialize the weights randomly, and in this section we consider
so-called LeCun initialization [156]. The following condition on the distribution W used for this
initialization will be assumed throughout the rest of Section 11.6.
Assumption 11.16. The distribution W on R has expectation zero, variance one, and finite
moments up to order eight.
To explicitly indicate the expectation and variance in the notation, we also write W(0, 1) instead
of W, and for µ ∈ R and ς > 0 we use W(µ, ς 2 ) to denote the corresponding scaled and shifted
measure with expectation µ and variance ς 2 ; thus, if X ∼ W(0, 1) then µ + ςX ∼ W(µ, ς 2 ). LeCun
initialization sets the variance of the weights in each layer to be reciprocal to the input dimension of
the layer: the idea is to normalize the output variance of all network nodes. The initial parameters

w0 = (U 0 , b0 , v 0 , c0 )

are thus randomly initialized with components


 1  1
iid iid
U0;ij ∼ W 0, , v0;i ∼ W 0, , b0;i , c0 = 0, (11.6.3)
d n
independently for all i = 1, . . . , n, j = 1, . . . , d. For a fixed ς > 0 one might choose variances ς 2 /d
and ς 2 /n in (11.6.3), which would require only minor modifications in the rest of this section. Biases
are set to zero for simplicity, with nonzero initialization discussed in the exercises. All expectations
and probabilities in Section 11.6 are understood with respect to this random initialization.
Example 11.17. Typical √ examples
√ for W(0, 1) are the standard normal distribution on R or the
uniform distribution on [− 3, 3]. ♢

163
11.6.2 Neural tangent kernel
We begin our analysis by investigating the empirical tangent kernel
K̂n (x, z) = ⟨∇w Φ(x, w0 ), ∇w Φ(z, w0 )⟩
of the shallow network (11.6.1) with initialization 11.6.3. Scaled properly, it converges in the infinite
width limit n → ∞ towards a specific kernel known as the neural tangent kernel (NTK) [131].
This kernel depends on both the architecture and the initialization scheme. Since we focus on the
specific setting introduced in Section 11.6.1 in the following, we simply denote it by K NTK .

Theorem 11.18. Let R < ∞ such that |σ(x)| ≤ R · (1 + |x|) and |σ ′ (x)| ≤ R · (1 + |x|) for all
iid
x ∈ R. For any x, z ∈ Rd and ui ∼ W(0, 1/d), i = 1, . . . , d, it then holds
1
lim K̂n (x, z) = E[σ(u⊤ x)σ(u⊤ z)] =: K NTK (x, z)
n→∞ n
almost surely.
Moreover, for every δ, ε > 0 there exists n0 (δ, ε, R) ∈ N such that for all n ≥ n0 and all x,
z ∈ Rd with ∥x∥, ∥z∥ ≤ R
" #
1
P K̂n (x, z) − K NTK (x, z) < ε ≥ 1 − δ.
n

Proof. Denote x(1) = U 0 x + b0 ∈ Rn and z (1) = U 0 z + b0 ∈ Rn . Due to the initialization (11.6.3)


and our assumptions on W(0, 1), the components
d
(1)
X
xi = U0;ij xj ∼ u⊤ x i = 1, . . . , n
j=1

are i.i.d. with finite pth moment (independent of n) for all 1 ≤ p ≤ 8. Due to the linear growth bound
(1) (1) (1)
on σ and σ ′ , the same holds for the (σ(xi ))ni=1 and the (σ ′ (xi ))ni=1 . Similarly, the (σ(zi ))ni=1
(1)
and (σ ′ (zi ))ni=1 are collections of i.i.d. random variables with finite pth moment for all 1 ≤ p ≤ 8.
√ iid
Denote ṽi = nv0;i such that ṽi ∼ W(0, 1). By (11.6.2)
n n
1 ⊤ 1 X 2 ′ (1) ′ (1) 1X (1) (1) 1
K̂n (x, z) = (1 + x z) 2 ṽi σ (xi )σ (zi ) + σ(xi )σ(zi ) + .
n n n n
i=1 i=1
Since
n
1 X 2 ′ (1) ′ (1)
ṽi σ (xi )σ (zi ) (11.6.4)
n
i=1
is an average over i.i.d. random variables with finite variance, the law of large numbers implies
almost sure convergence of this expression towards
(1) (1)  (1) (1)
E ṽi2 σ ′ (xi )σ ′ (zi ) = E[ṽi2 ]E[σ ′ (xi )σ ′ (zi )]


= E[σ ′ (u⊤ x)σ ′ (u⊤ z)],

164
(1) (1)
where we used that ṽi2 is independent of σ ′ (xi )σ ′ (zi ). By the same argument
n
1X (1) (1)
σ(xi )σ(zi ) → E[σ(u⊤ x)σ(u⊤ z)]
n
i=1

almost surely as n → ∞. This shows the first statement.


The existence of n0 follows similarly by an application of Theorem A.23.
Example 11.19 (K NTK for ReLU). Let σ(x) = max{0, x} and let W(0, 1) be the standard normal
distribution. For x, z ∈ Rd denote by
 ⊤ 
x z
ϑ = arccos
∥x∥∥z∥
iid
the angle between these vectors. Then according to [50, Appendix A], it holds with ui ∼ W(0, 1),
i = 1, . . . , d,
∥x∥∥z∥
K NTK (x, z) = E[σ(u⊤ x)σ(u⊤ z)] = (sin(ϑ) + (π − ϑ) cos(ϑ)).
2πd

11.6.3 Training dynamics and model predictions


We now proceed as in [158, Appendix G], to show that the analysis in Sections 11.4-11.5 is ap-
plicable to the wide neural network (11.6.1) with high probability under random initialization
(11.6.3). We work under the following assumptions on the activation function and training data
[158, Assumptions 1-4].
NTK ≤ θ NTK < ∞ such that
Assumption 11.20. There exist 1 ≤ R < ∞ and 0 < θmin max

(a) σ : R → R belongs to C 1 (R) and |σ(0)|, Lip(σ), Lip(σ ′ ) ≤ R,


(b) ∥xi ∥, |yi | ≤ R for all training data (xi , yi ) ∈ Rd × R, i = 1, . . . , m,
(c) the kernel matrix of the neural tangent kernel
(K NTK (xi , xj ))m
i,j=1 ∈ R
m×m

NTK , θ NTK ].
is regular and its eigenvalues belong to [θmin max

We start by showing Assumption 11.11 (a) for the present setting. More precisely, we give
bounds for the eigenvalues of the empirical tangent kernel.

Lemma 11.21. Let Assumption 11.20 be satisfied. Then for every δ > 0 there exists
NTK , m, R) ∈ R such that for all n ≥ n it holds with probability at least 1 − δ that all
n0 (δ, θmin 0
eigenvalues of
m
(K̂n (xi , xj ))m
i,j=1 = ⟨∇w Φ(xi , w 0 ), ∇w Φ(xj , w 0 )⟩ i,j=1 ∈ R
m×m

NTK /2, 2nθ NTK ].


belong to [nθmin max

165
NTK :=
Proof. Denote Ĝn := (K̂n (xi , xj ))m
i,j=1 and G (K NTK (xi , xj ))m
i,j=1 . By Theorem 11.18,
there exists n0 such that for all n ≥ n0 holds with probability at least 1 − δ that
1 θNTK
GNTK − Ĝn ≤ min .
n 2
Assuming this bound to hold
1 1 θNTK θNTK θNTK
∥Ĝn ∥ = sup ∥Ĝn a∥ ≥ infm ∥GNTK a∥ − min ≥ θmin
NTK
− min ≥ min ,
n a∈Rm n a∈R 2 2 2
∥a∥=1 ∥a∥=1

where we have used that θminNTK is the smallest eigenvalue, and thus singular value, of the symmetric
NTK
positive definite matrix G . This shows that (with probability at least δ) the smallest eigenvalue
NTK
of Ĝn is larger or equal to nθmin /2. Similarly, we conclude that the largest eigenvalue is bounded
from above by n(θmaxNTK + θ NTK /2) ≤ 2nθ NTK . This concludes the proof.
min max

Next we check Assumption 11.11 (b). To this end we first bound the norm of a random matrix.

iid
Lemma 11.22. Let W(0, 1) be as in Assumption 11.16, and let W ∈ Rn×d with Wij ∼ W(0, 1).
Denote the fourth moment of W(0, 1) by µ4 . Then
h p i dµ4
P ∥W ∥ ≤ n(d + 1) ≥ 1 − .
n

Proof. It holds
n X
X d 1/2
∥W ∥ ≤ ∥W ∥F = Wij2 .
i=1 j=1

The αi := dj=1 Wij2 , i = 1, . . . , n, are i.i.d. distributed with expectation d and finite variance dC,
P

where C ≤ µ4 is the variance of W11 2 . By Theorem A.23

h i n
h1 X i h 1Xn i dµ
4
p
P ∥W ∥ > n(d + 1) ≤ P αi > d + 1 ≤ P αi − d > 1 ≤ ,
n n n
i=1 i=1

which concludes the proof.

Lemma 11.23. Let Assumption 11.20 (a) be satisfied with some constant R. Then there exists
M (R), and for all c, δ > 0 there exists n0 (c, d, δ) ∈ N such that for all n ≥ n0 it holds with
probability at least 1 − δ that

∥∇w Φ(x, w)∥ ≤ M n for all w ∈ Bcn−1/2 (w0 )

∥∇w Φ(x, w) − ∇w Φ(x, v)∥ ≤ M n∥w − v∥ for all w, v ∈ Bcn−1/2 (w0 )

for all x ∈ Rd with ∥x∥ ≤ R.

166
Proof. Due to the initialization (11.6.3), by Lemma 11.22 we can find ñ0 (δ, d) such that for all
n ≥ ñ0 holds with probability at least 1 − δ that

∥v 0 ∥ ≤ 2 and ∥U 0 ∥ ≤ 2 n. (11.6.5)

For the rest of this proof we let x ∈ Rd arbitrary with ∥x∥ ≤ R, we set

n0 := max{c2 , ñ0 (δ, d)}

and we fix n ≥ n0 so that n−1/2 c ≤ 1. To prove the lemma we need to show that the claimed
inequalities hold as long as (11.6.5) is satisfied. We will several times use that for all p, q ∈ Rn

∥p ⊙ q∥ ≤ ∥p∥∥q∥ and ∥σ(p)∥ ≤ R n + R∥p∥

since |σ(x)| ≤ R · (1 + |x|). The same holds for σ ′ .


Step 1. We show the bound on the gradient. Fix

w = (U , b, v, c) s.t. ∥w − w0 ∥ ≤ cn−1/2 .

Using formula (11.6.2) for ∇b Φ, the fact that b0 = 0 by (11.6.3), and the above inequalities

∥∇b Φ(x, w)∥ ≤ ∥∇b Φ(x, w0 )∥ + ∥∇b Φ(x, w) − ∇b Φ(x, w0 )∥


= ∥v 0 ⊙ σ ′ (U 0 x)∥ + ∥v ⊙ σ ′ (U x + b) − v 0 ⊙ σ ′ (U 0 x)∥
√ √
≤ 2(R n + 2R2 n) + ∥v ⊙ σ ′ (U x + b) − v 0 ⊙ σ ′ (U 0 x)∥. (11.6.6)

Due to √ √
∥U ∥ ≤ ∥U 0 ∥ + ∥U 0 − U ∥F ≤ 2 n + cn−1/2 ≤ 3 n, (11.6.7)
and using the fact that σ ′ has Lipschitz constant R, the last norm in (11.6.6) is bounded by

∥(v − v 0 ) ⊙ σ ′ (U x + b)∥ + ∥v 0 ⊙ (σ ′ (U x + b) − σ ′ (U 0 x))∥



≤ cn−1/2 (R n + R · (∥U ∥∥x∥ + ∥b∥)) + 2R · (∥U − U 0 ∥∥x∥ + ∥b∥)
√ √
≤ R n + 3 nR2 + cn−1/2 R + 2R · (cn−1/2 R + cn−1/2 )

≤ n(4R + 5R2 )

and therefore √
∥∇b Φ(x, w)∥ ≤ n(6R + 9R2 ).
For the gradient with respect to U we use ∇U Φ(x, w) = ∇b Φ(x, w)x⊤ , so that

∥∇U Φ(x, w)∥F = ∥∇b Φ(x, w)x⊤ ∥F = ∥∇b Φ(x, w)∥∥x∥ ≤ n(6R2 + 9R3 ).

Next

∥∇v Φ(x, w)∥ = ∥σ(U x + b)∥



≤ R n + R∥U x + b∥
√ √
≤ R n + R · (3 nR + cn−1/2 )

≤ n(2R + 3R2 ),

167
and finally ∇c Φ(x, w) = 1. In all, with M1 (R) := (1 + 8R + 12R2 )

∥∇w Φ(x, w̃)∥ ≤ nM1 (R).

Step 2. We show Lipschitz continuity. Fix

w = (U , b, v, c) and w̃ = (Ũ , b̃, ṽ, c̃)

such that ∥w − w0 ∥, ∥w̃ − w0 ∥ ≤ cn−1/2 . Then

∥∇b Φ(x, w) − ∇b Φ(x, w̃)∥ = ∥v ⊙ σ ′ (U x + b) − ṽ ⊙ σ ′ (Ũ x + b̃)∥.

Using ∥ṽ∥ ≤ ∥v 0 ∥ + cn−1/2 ≤ 3 and (11.6.7), this term is bounded by

∥(v − ṽ) ⊙ σ ′ (U x + b)∥ + ∥ṽ ⊙ (σ ′ (U x + b) − σ ′ (Ũ x + b̃))∥



≤ ∥v − ṽ∥(R n + R · (∥U ∥∥x∥ + ∥b∥)) + 3R · (∥x∥∥U − Ũ ∥ + ∥b − b̃∥)

≤ ∥w − w̃∥ n(5R + 6R2 ).

For ∇U Φ(x, w) we obtain similar as in Step 1

∥∇U Φ(x, w) − ∇U Φ(x, w̃)∥F = ∥x∥∥∇b Φ(x, w) − ∇b Φ(x, w̃)∥



≤ ∥w − w̃∥ n(5R2 + 6R3 ).

Next

∥∇v Φ(x, w) − ∇v Φ(x, w̃)∥ = ∥σ(U x + b) − σ(Ũ x − b̃)∥


≤ R · (∥U − Ũ ∥∥x∥ + ∥b − b̃∥)
≤ ∥w − w̃∥(R2 + R)

and finally ∇c Φ(x, w) = 1 is constant. With M2 (R) := R + 6R2 + 6R3 this shows

∥∇w Φ(x, w) − ∇w Φ(x, w̃)∥ ≤ nM2 (R)∥w − w̃∥.

In all, this concludes the proof with M (R) := max{M1 (R), M2 (R)}.

Next, we show that the initial error f (w0 ) remains bounded with high probability.

Lemma 11.24. Let Assumption 11.20 (a), (b) be satisfied. Then for every δ > 0 exists
R0 (δ, m, R) > 0 such that for all n ∈ N

P[f (w0 ) ≤ R0 ] ≥ 1 − δ.

√ iid
Proof. Let i ∈ {1, . . . , m}, and set α := U 0 xi and ṽj := nv0;j for j = 1, . . . , n, so that ṽj ∼
W(0, 1). Then
n
1 X
Φ(xi , w0 ) = √ ṽj σ(αj ).
n
j=1

168
By Assumption 11.16 and (11.6.3), the ṽj σ(αj ), j = 1, . . . , n, are i.i.d. centered random variables
with finite variance bounded by a constant C(R) independent of n. Thus the variance of Φ(xi , w0 )
is also bounded by C(R). By Chebyshev’s inequality, see Lemma A.22, for every k > 0
√ 1
P[|Φ(xi , w0 )| ≥ k C] ≤ 2 .
k
p
Setting k = m/δ
m
hX √ i Xm h √ i
P |Φ(xi , w0 ) − yi |2 ≥ m(k C + R)2 ≤ P |Φ(xi , w0 ) − yi | ≥ k C + R
i=1 i=1
m
X h √ i
≤ P |Φ(xi , w0 )| ≥ k C ≤ δ,
i=1
p
which shows the claim with R0 = m · ( Cm/δ + R)2 .

The next theorem, which corresponds to [158, Thms. G.1 and H.1], is the main result of this
section. It summarizes our findings in the present setting for a shallow network of width n: with high
probability, gradient descent converges to a global minimizer and the limiting network interpolates
the data. During training the network weights remain close to initialization. The trained network
gives predictions that are O(n−1/2 ) close to the predictions of the trained linearized model. In the
statement of the theorem we denote again by Φlin the linearization of Φ defined in (11.3.1), and by
f lin , f , the corresponding square loss objectives defined in (11.0.1b), (11.3.3), respectively.

Theorem 11.25. Let Assumption 11.20 be satisfied, and let the parameters w0 of the width-n
neural network Φ in (11.6.1) be initialized according to (11.6.3). Fix the learning rate
1 1
h= NTK NTK
,
θmin + 4θmax n

set p0 := w0 and let for all k ∈ N0

wk+1 = wk − h∇f (wk ) and pk+1 = pk − h∇f lin (pk ).

Then for every δ > 0 there exist C > 0, n0 ∈ N such that for all n ≥ n0 it holds with probability
at least 1 − δ that for all k ∈ N and all x ∈ Rd with ∥x∥ ≤ R

C
∥wk − w0 ∥ ≤ √ (11.6.8a)
n
 hn 2k
f (wk ) ≤ C 1 − NTK
(11.6.8b)
2θmin
C
∥Φ(x, wk ) − Φlin (x, pk )∥ ≤ √ . (11.6.8c)
n

Proof. We wish to apply Theorems 11.12 and 11.14. This requires Assumption 11.11 to be satisfied.

169
p Fix δ > 0 and let R0 (δ) be as in Lemma 11.24, so that with probability at least 1 − δ/2 it holds
f (w0 ) ≤ R0 . Next, let M = M (R) be as in Lemma 11.23, and fix n0,1 ∈ N and c > 0 so large
that for all n ≥ n0,1
√ √
√ n2 (θmin
NTK /2)2
−1/2 2 mM n p
M n≤ √ and cn = NTK
R0 . (11.6.9)
12m3/2 M 2 n R0 nθmin
By Lemma 11.21 and 11.23, we can then find n0,2 such that for all n ≥ n0,2 with probability at
least 1 − δ/2 we have that Assumption 11.11 (a), (b) holds with the values
√ √ NTK
nθmin
L = M n, U = M n, r = cn−1/2 , θmin = , NTK
θmax = 2nθmax . (11.6.10)
2
Together with (11.6.9), this shows that Assumption 11.11 holds with probability at least 1 − δ as
long as n ≥ n0 := max{n0,1 , n0,2 }.
Inequalities (11.6.8a), (11.6.8b) are then a direct consequence of Theorem 11.12. For (11.6.8c),
we plug the values of (11.6.10) into the bound in Theorem 11.14 to obtain for k ∈ N

mU 2
 
lin 4 mU Lr p
∥Φ(x, wk ) − Φ (x, pk )∥ ≤ 1+ f (w0 )
θmin (hθmin )2 (θmin + θmax )
C1 p
≤ √ (1 + C2 ) f (w0 ),
n
LC , θ LC but independent of n.
for some constants C1 , C2 depending on m, M , c, θmin max

Note that the convergence rate in (11.6.10) does not improve as n grows, since h is bounded by
a constant times 1/n.

11.6.4 Connection to kernel least-squares and Gaussian processes


Theorem 11.25 establishes that the trained neural network mirrors the behaviour of the trained
linearized model. As pointed out in Section 11.3, the prediction of the trained linearized model
corresponds to the ridgeless least squares estimator plus a term depending on the choice of random
initialization w0 ∈ Rn . We should thus understand both the model at initialization x 7→ Φ(x, w0 )
and the model after training x 7→ Φ(x, wk ), as random draws of a certain distribution over func-
tions. To explain this further, we introduce Gaussian processes.

Definition 11.26. Let (Ω, A, P) be a probability space (see Section A.1), and let g : Rd × Ω →
R. We call g a Gaussian process with mean function µ : Rd → R and covariance function
c : Rd × Rd → R if

(a) for each x ∈ Rd it holds that ω 7→ g(x, ω) is a random variable,

(b) for all k ∈ N and all x1 , . . . , xk ∈ Rd the random variables g(x1 , ·), . . . , g(xk , ·) are jointly
Gaussian distributed with
 
(g(x1 , ω), . . . , g(xk , ω)) ∼ N µ(xi )ki=1 , (c(xi , xj ))ki,j=1 .

170
In words, g is a Gaussian process, if ω 7→ g(x, ω) defines a collection of random variables indexed
over x ∈ Rd , and the joint distribution of (g(x1 , ·))kj=1 is a Gaussian whose mean and variance are
determined by µ and c respectively. Fixing ω ∈ Ω, we can then interpret x 7→ g(x, ω) as a random
draw from a distribution over functions.
As first observed in [186], certain neural networks at initialization tend to Gaussian processes
in the infinite width limit.

Proposition 11.27. Let |σ(x)| ≤ R(1 + |x|)4 for all x ∈ R. Consider depth-n networks Φ as in
(11.6.1) with initialization (11.6.3). Let K NTK : Rd × Rd be as in Theorem 11.18.
Then for all distinct x1 , . . . , xk ∈ Rd it holds that

lim (Φ(x1 , w0 ), . . . , Φ(xk , w0 )) ∼ N(0, (K NTK (xi , xj ))ki,j=1 )


n→∞

with convergence in distribution.

√ iid
Proof. Set ṽi := nv0,i and ũi = (U0,i1 , . . . , U0,id ) ∈ Rd , so that ṽi ∼ W(0, 1), and the ũi ∈ Rd are
also i.i.d., with each component distributed according to W(0, 1/d).
Then for any x1 , . . . , xk

ṽi σ(ũ⊤
 
i x1 )
.. k
Z i :=  ∈R i = 1, . . . , n,
 
.
ṽi σ(ũ⊤
i xk )

defines n centered i.i.d. vectors in Rk with finite second moments (here we use the assumption on
σ and the fact that W(0, 1) has finite moments of order 8 by Assumption 11.16). By the central
limit theorem, see Theorem A.25,
 
Φ(x1 , w0 ) n
.. 1 X
= √ Zi
 
 . n
Φ(xk , w0 ) j=1

converges in distribution to N(0, C), where

Cij = E[ṽ12 σ(ũ⊤ ⊤ ⊤ ⊤


1 xi )σ(ũ1 xj )] = E[σ(ũ1 xi )σ(ũ1 xj )] = K
NTK
(xi , xj ).

This concludes the proof.

In the sense of Proposition 11.27, the network Φ(x, w0 ) converges to a Gaussian process as
the width n tends to infinity. It can also be shown that the linearized network after training
corresponds to a Gaussian process, with a mean and covariance function that depend on the data,
architecture, and initialization. Since the full and linearized models coincide in the infinite width
limit (see Theorem 11.25) we can infer that wide networks post-training resemble draws from a
Gaussian process, see [158, Section 2.3.1] and [61].

171
To motivate the mean function of this Gaussian process, we informally take an expectation
(over random initialization) of (11.3.5), yielding
h i
E lim Φlin (x, pk ) = ⟨ϕ(x), p∗ ⟩ + E[⟨ϕ(x), ŵ0 ⟩] .
k→∞ | {z } | {z }
ridgeless kernel least-squares =0
estimator with kernel K̂n

Here the second term vanishes, because ŵ0 is the orthogonal projection of the centered random
variable w0 onto a subspace, so that E[ŵ0 ] = 0. This suggests, that after training for a long time,
the mean of the trained linearized model resembles the ridgeless kernel least-squares estimator with
kernel K̂n . Since K̂n /n converges to K NTK by Theorem 11.18, and a scaling of the kernel by 1/n
does not affect the corresponding kernel least-squares estimator, we expect that for large widths n
and large k
h i h i
E Φ(x, wk ) ≃ E Φlin (x, pk ) ≃ ridgeless kernel least-squares estimator
with kernel K NTK evaluated at x
. (11.6.11)

In words, after sufficient training, the mean (over random initializations) of the trained neural
network x 7→ Φ(x, wk ) resembles the kernel least-squares estimator with kernel K NTK . Thus, under
these assumptions, we obtain an explicit characterization of what the network prediction looks like
after training with gradient descent. For more details and a characterization of the corresponding
covariance function, we refer again to [158, Section 2.3.1].
Let us now consider a numerical experiment to visualize this observation. In Figure 11.3 we
plot 80 different realizations of a neural network before and after training, i.e. the functions
x 7→ Φ(x, w0 ) and x 7→ Φ(x, wk ). (11.6.12)
The architecture was chosen as in (11.6.1) with activation function σ = arctan(x), width n = 250
and initialization
 3  3
iid iid iid
U0;ij ∼ N 0, , v0;i ∼ N 0, , b0;i , c0 ∼ N(0, 2). (11.6.13)
d n
The network was trained on the ridgeless square loss
m
X
f (w) = (Φ(xj , w) − yj )2 ,
j=1

and a dataset of size m = 3 with k = 5000 steps of gradient descent and constant step size h = 1/n.
Before training, the network’s outputs resemble random draws from a Gaussian process with a
constant zero mean function. Post-training, the outputs show minimal variance at the training
points, since they essentially interpolate the data, as can be expected due to Theorem 11.25—
specifically (11.6.8b). Outside of the training points, we observe increased variance stemming from
the second term in (11.3.5). The mean should be close to the ridgeless kernel least squares estimator
with kernel K NTK by (11.6.11).
Figure 11.4 shows realizations of the network trained with ridge regularization, i.e. using the
loss function (11.0.1c). Initialization and all hyperparameters match those in Figure 11.3, with the
regularization parameter λ set to 0.01. For a linear model, the prediction after training with ridge
regularization is expected to exhibit reduced randomness, as the trained model is O(λ) close to the
ridgeless kernel least-squares estimator (see Section 11.2.3). We emphasize that Theorem 11.14,
showing closeness of the trained linearized and full model, does not encompass ridge regularization,
however in this example we observe a similar effect.

172
2 2

1 1

0 0

1 1

2 2
3 2 1 0 1 2 3 3 2 1 0 1 2 3
(a) before training (b) after training without regularization

Figure 11.3: 80 realizations of a neural network at initialization (a) and after training without
regularization on the red data points (b). The dashed line shows the mean. Figure based on [131,
Fig. 2], [158, Fig. 2].

11.6.5 Role of initialization


iid
Consider the gradient ∇w Φ(x, w0 ) as in (11.6.2) with LeCun initialization (11.6.3), so that v0;i ∼
iid
W(0, 1/n) and U0;ij ∼ W(0, 1/d). For the gradient norms in terms of the width n we obtain

E[∥∇U Φ(x, w0 )∥2 ] = E[∥(v 0 ⊙ σ ′ (U 0 x))x⊤ ∥2 ] = O(1)


E[∥∇b Φ(x, w0 )∥2 ] = E[∥v 0 ⊙ σ ′ (U 0 x)∥2 ] = O(1)
2 2
E[∥∇v Φ(x, w0 )∥ ] = E[∥σ(U 0 x)∥ ] = O(n)
E[∥∇c Φ(x, w0 )∥2 ] = E[|1|] = O(1).

Due to this different scaling, gradient descent with step size O(n−1 ) as in Theorem 11.25, will
primarily adjust the weigths v in the output layer, while only slightly modifying the remaining
parameters U , b, and c. This is also reflected in the expression for the obtained kernel K NTK
computed in Theorem 11.18, which corresponds only to the contribution of the term ⟨∇v Φ, ∇v Φ⟩.
LeCun initialization [156], sets the variance of the weight initialization inversely proportional to
the input dimension of each layer, so that the variance of all node outputs remains stable and does
not blow up as the width increases; also see [111]. However, it does not normalize the backward
dynamics, i.e., it does not ensure that the gradients with respect to the parameters have similar
variance. To balance the normalization of both the forward and backward dynamics, Glorot and
Bengio proposed a normalized initialization, where the variance is chosen inversely proportional to
the sum of the input and output dimensions of each layer [90]. We emphasize that the choice of
initialization strongly affects the neural tangent kernel (NTK) and, consequently, the predictions
of the trained network. For an initialization that explicitly normalizes the backward dynamics, we
refer to the original NTK paper [131].

173
2

2
3 2 1 0 1 2 3

Figure 11.4: 80 realizations of the neural network in Figure 11.3 after training on the red data
points with added ridge regularization. The dashed line shows the mean.

Bibliography and further reading


The discussion on linear and kernel least-squares in Sections 11.1 and 11.2 is mostly standard,
and can similarly be found in many textbooks, e.g., [108, 250, 178]; moreover, for more details on
least-squares problems and algorithms see [92, 29, 35, 194], for iterative algorithms to compute the
pseudoinverse, e.g., [21, 210], and for regularization of ill-posed problems, e.g., [79]. The representer
theorem was originally introduced in [142], with a more general version presented in [243]. For an
easily accessible formulation, see, e.g., [250, Theorem 16.1]. The kernel trick is commonly attributed
to [31], see also [2, 56]. For more details on kernel methods we refer to [57, 244, 298]. For recent
works regarding in particular generalization properties of kernel ridgeless regression see for instance
[164, 106, 16].
The neural tangent kernel and its connection to the training dynamics was first investigated
in [131]. Since then, many works have extended this idea and presented differing perspectives
on the topic, see for instance [4, 73, 8, 49]. Our presentation in Sections 11.3–11.6 is based on
and closely follows [158], especially for the main results in these sections, where we adhere to the
arguments in this paper. Moreover, a similar treatment of these results for gradient flow (rather
than gradient descent) was given in [271] based on [49]; in particular, as in [271], we only consider
shallow networks and first provide an abstract analysis valid for arbitrary function parametrizations
before specifying to neural networks. The paper [158] and some of the other references cited above
also address the case of deep architectures, which are omitted here. The explicit formula for the
NTK of ReLU networks as presented in Example 11.19 was given in [50].
The observation that neural networks at initialization behave like Gaussian processes discussed
in Section 11.6.4 was first made in [186]. For a general reference on Gaussian processes see the
textbook [225]. When only training the last layer of a network (in which the network is affine
linear), there are strong links to random feature methods [222]. Recent developements on this
topic can also be found in the literature under the name “Neural network Gaussian processes”, or
NNGPs for short [157, 62].

174
Exercises
Exercise 11.28. Prove Proposition 11.4.
Hint: Assume first that w0 ∈ ker(A)⊥ (i.e. w0 ∈ H̃). For rank(A) < d, using wk = wk−1 −
h∇f (wk−1 ) and the singular value
P decomposition of A, write down an explicit formula for wk .
Observe that due to 1/(1 − x) = k∈N0 xk for all x ∈ (0, 1) it holds wk → A† y as k → ∞, where
A† is the Moore-Penrose pseudoinverse of A.

Exercise 11.29. Let A ∈ Rd×d be symmetric positive semidefinite, b ∈ Rd , and c ∈ R. Let for
λ>0
f (w) := w⊤ Aw + b⊤ w + c and fλ (w) := f (w) + λ∥w∥2 .
Show that f is 2λ-strongly convex.
Hint: Use Exercise 10.23.

Exercise 11.30. Let (H, ⟨·, ·⟩H ) be a Hilbert space, and ϕ : Rd → H a mapping. Given (xj , yj )m
j=1 ∈
Rd × R, for λ > 0 denote
m
X 2
fλ (w) := ⟨ϕ(xj ), w⟩H − yj + λ∥w∥2H for all w ∈ H.
j=1

Prove that fλ has a unique minimizer w∗,λ ∈ H, that w∗,λ ∈ H̃ := span{ϕ(x1 ), . . . , ϕ(xm )}, and
that limλ→0 w∗,λ = w∗ , where w∗ is as in (11.2.3).
Hint: Assuming existence of w∗,λ , first show that w∗,λ belongs to the finite dimensional space
H̃. Now express w∗,λ in terms of an orthonormal basis of H̃, and prove that w∗,λ → w∗ .

Exercise 11.31. Let xi ∈ Rd , i = 1, . . . , m. Show that there exists a “feature map” ϕ : Rd → Rm ,


such that for any configuration of labels yi ∈ {−1, 1}, there always exists a hyperplane in Rm
separating the two sets {ϕ(xi ) | yi = 1} and {ϕ(xi ) | yi = −1}.

Exercise 11.32. Let n ∈ N and consider the polynomial kernel K : Rd × Rd → R, K(x, z) = (1 +


x⊤ z)n . Find a Hilbert space H and a feature map ϕ : Rd → H, such that K(x, z) = ⟨ϕ(x), ϕ(z)⟩H .
Hint: Use the multinomial formula.

Exercise 11.33. Consider the radial basis function (RBF) kernel K : R × R → R, K(x, z) :=
exp(−(x − z)2 ). Find a Hilbert space H and a feature map ϕ : R → H such that K(x, x′ ) =
⟨ϕ(x), ϕ(z)⟩H .

Exercise 11.34. Consider the network (11.6.1) with LeCun initialization as in (11.6.3), but with
the biases instead initialized as
iid
c, bi ∼ W(0, 1) for all i = 1, . . . , n. (11.6.14)

Compute the corresponding NTK as in Theorem 11.25.

175
Chapter 12

Loss landscape analysis

In Chapter 10, we saw how the weights of neural networks get adapted during training, using, e.g.,
variants of gradient descent. For certain cases, including the wide networks considered in Chapter
11, the corresponding iterative scheme converges to a global minimizer. In general, this is not
guaranteed, and gradient descent can for instance get stuck in non-global minima or saddle points.
To get a better understanding of these situations, in this chapter we discuss the so-called loss
landscape. This term refers to the graph of the empirical risk as a function of the weights. We
give a more rigorous definition below, and first introduce notation for neural networks and their
realizations for a fixed architecture.

Definition 12.1. Let A = (d0 , d1 , . . . , dL+1 ) ∈ NL+2 , let σ : R → R be an activation function, and
let B > 0. We denote the set of neural networks Φ with L layers, layer widths d0 , d1 , . . . , dL+1 , all
weights bounded in modulus by B, and using the activation function σ by N (σ; A, B). Additionally,
we define
L
×
 
PN (A, B) := [−B, B]dℓ+1 ×dℓ × [−B, B]dℓ+1 ,
ℓ=0

and the realization map

Rσ : PN (A, B) → N (σ; A, B)
(12.0.1)
(W (ℓ) , b(ℓ) )L
ℓ=0 7→ Φ,

where Φ is the neural network with weights and biases given by (W (ℓ) , b(ℓ) )L
ℓ=0 .

PL
Throughout, we will identify PN (A, B) with the cube [−B, B]nA , where nA := ℓ=0 dℓ+1 (dℓ +
1). Now we can introduce the loss landscape of a neural network architecture.

Definition 12.2. Let A = (d0 , d1 , . . . , dL+1 ) ∈ NL+2 , let σ : R → R. Let m ∈ N, and S =


(xi , y i )m d0
i=1 ∈ (R × R
dL+1 )m be a sample and let L be a loss function. Then, the loss landscape

176
l minima
ca
lo

le points
dd
sa

sh
arp
Hig l min minimum
k ba ima
he
mpirical ris
o
gl

Figure 12.1: Two-dimensional section of a loss landscape. The loss landscape shows a spurious
valley with local minima, global minima, as well as a region where saddle points appear. Moreover,
a sharp minimum is shown.

is the graph of the function ΛA,σ,S,L defined as

ΛA,σ,S,L : PN (A; ∞) → R
θ 7→ R
b S (Rσ (θ)).

with R
b S in (1.2.3) and Rσ in (12.0.1).

Identifying PN (A, ∞) with RnA , we can consider ΛA,σ,S,L as a map on RnA and the loss
landscape is a subset of RnA × R. The loss landscape is a high-dimensional surface, with hills and
valleys. For visualization a two-dimensional section of a loss landscape is shown in Figure 12.1.
Questions of interest regarding the loss landscape include for example: How likely is it that we
find local instead of global minima? Are these local minima typically sharp, having small volume,
or are they part of large flat valleys that are difficult to escape? How bad is it to end up in a local
minimum? Are most local minima as deep as the global minimum, or can they be significantly
higher? How rough is the surface generally, and how do these characteristics depend on the network
architecture? While providing complete answers to these questions is hard in general, in the rest
of this chapter we give some intuition and mathematical insights for specific cases.

177
12.1 Visualization of loss landscapes
Visualizing loss landscapes can provide valuable insights into the effects of neural network depth,
width, and activation functions. However, we can only visualize an at most two-dimensional surface
embedded into three-dimensional space, whereas the loss landscape is a very high-dimensional
object (unless the neural networks have only very few weights and biases).
To make the loss landscape accessible, we need to reduce its dimensionality. This can be achieved
by evaluating the function ΛA,σ,S,L on a two-dimensional subspace of PN (A, ∞). Specifically, we
choose three-parameters µ, θ1 , θ2 and examine the function

R2 ∋ (α1 , α2 ) 7→ ΛA,σ,S,L (µ + α1 θ1 + α2 θ2 ). (12.1.1)

There are various natural choices for µ, θ1 , θ2 :


• Random directions: This was, for example used in [96, 127]. Here θ1 , θ2 are chosen randomly,
while µ is either a minimum of ΛA,σ,S,L or also chosen randomly. This simple approach can
offer a quick insight into how rough the surface can be. However, as was pointed out in
[161], random directions will very likely be orthogonal to the trajectory of the optimization
procedure. Hence, they will likely miss the most relevant features.

• Principal components of learning trajectory: To address the shortcomings of random direc-


tions, another possibility is to determine µ, θ1 , θ2 , which best capture some given learning
trajectory; For example, if θ(1) , θ(2) , . . . , θ(N ) are the parameters resulting from the training
by SGD, we may determine µ, θ1 , θ2 such that the hyperplane {µ + α1 θ1 + α2 θ2 | α1 , α2 ∈ R}
minimizes the mean squared distance to the θ(j) for j ∈ {1, . . . , N }. This is the approach of
[161], and can be achieved by a principal component analysis.

• Based on critical points: For a more global perspective, µ, θ1 , θ2 can be chosen to ensure the
observation of multiple critical points. One way to achieve this is by running the optimization
procedure three times with final parameters θ(1) , θ(2) , θ(3) . If the procedures have converged,
then each of these parameters is close to a critical point of ΛA,σ,S,L . We can now set µ = θ(1) ,
θ1 = θ(2) − µ, θ2 = θ(3) − µ. This then guarantees that (12.1.1) passes through or at least
comes very close to three critical points (at (α1 , α2 ) = (0, 0), (0, 1), (1, 0)). We present six
visualizations of this form in Figure 12.2.
Figure 12.2 gives some interesting insight into the effect of depth and width on the shape of the
loss landscape. For very wide and shallow neural networks, we have the widest minima, which, in
the case of the tanh activation function also seem to belong to the same valley. With increasing
depth and smaller width the minima get steeper and more disconnected.

12.2 Spurious valleys


From the perspective of optimization, the ideal loss landscape has one global minimum in the center
of a large valley, so that gradient descent converges towards the minimum irrespective of the chosen
initialization.
This situation is not realistic for deep neural networks. Indeed, for a simple shallow neural
network
Rd ∋ x 7→ Φ(x) = W (1) σ(W (0) x + b(0) ) + b(1) ,

178
it is clear that for every permutation matrix P

Φ(x) = W (1) P T σ(P W (0) x + P b(0) ) + b(1) for all x ∈ Rd .

Hence, in general there exist multiple parameterizations realizing the same output function. More-
over, if at least one global minimum with non-permutation-invariant weights exists, then there are
more than one global minima of the loss landscape.
This is not problematic; in fact, having many global minima is beneficial. The larger issue is the
existence of non-global minima. Following [279], we start by generalizing the notion of non-global
minima to spurious valleys.

Definition 12.3. Let A = (d0 , d1 , . . . , dL+1 ) ∈ NL+2 and σ : R → R. Let m ∈ N, and S =


(xi , y i )m d0
i=1 ∈ (R × R
dL+1 )m be a sample and let L be a loss function. For c ∈ R, we define the

sub-level set of ΛA,σ,S,L as

ΩΛ (c) := {θ ∈ PN (A, ∞) | ΛA,σ,S,L (θ) ≤ c}.

A path-connected component of ΩΛ (c), which does not contain a global minimum of ΛA,σ,S,L is
called a spurious valley.

The next proposition shows that spurious local minima do not exist for shallow overparameter-
ized neural networks, i.e., for neural networks that have at least as many parameters in the hidden
layer as there are training samples.

Proposition 12.4. Let A = (d0 , d1 , 1) ∈ N3 and let S = (xi , yi )m d0 m


i=1 ∈ (R × R) be a sample such
that m ≤ d1 . Furthermore, let σ ∈ M be not a polynomial, and let L be a convex loss function.
Further assume that ΛA,σ,S,L has at least one global minimum. Then, ΛA,σ,S,L , has no spurious
valleys.

Proof. Let θa , θb ∈ PN (A, ∞) with ΛA,σ,S,L (θa ) > ΛA,σ,S,L (θb ). Then we will show below that
there is another parameter θc such that

• ΛA,σ,S,L (θb ) ≥ ΛA,σ,S,L (θc )

• there is a continuous path α : [0, 1] → PN (A, ∞) such that α(0) = θa , α(1) = θc , and
ΛA,σ,S,L (α) is monotonically decreasing.

By Exercise 12.7, the construction above rules out the existence of spurious valleys by choosing θa
an element of a spurious valley and θb a global minimum.
Next, we present the construction: Let us denote
 1 
(ℓ) (ℓ)
θo = W o , bo for o ∈ {a, b, c}.
ℓ=0

179
Moreover, for j = 1, . . . , d1 , we introduce v jo ∈ Rm defined as
  
(v jo )i = σ W (0)o xi + b (0)
o for i = 1, . . . , m.
j

Notice that, if we set V o = ((v jo )⊤ )dj=1


1
, then
 m
W (1)
o V o = R (θ
σ o )(x i ) − b (1)
o , (12.2.1)
i=1

where the right-hand side is considered a row-vector.


We will now distinguish between two cases. For the first the result is trivial and the second can
be transformed into the first one.
Case 1: Assume that V a has rank m. In this case, it is obvious from (12.2.1), that there
exists W
f such that  m
f V a = Rσ (θb )(xi ) − b(1)
W .
a
i=1
(0) (0) (1) f , b(1)
We can thus set α(t) = −
((W a , ba ), ((1 t)W a
+ tW a )).
Note that by construction α(0) = θa and ΛA,σ,S,L (α(1)) = ΛA,σ,S,L (θb ). Moreover, t 7→
(Rσ (α(t))(xi ))mi=1 describes a straight path in R
m and hence, by the convexity of L it is clear

that t 7→ ΛA,σ,S,L (α(t)) has a minimum t on [0, 1] with ΛA,σ,S,L (α(t∗ )) ≤ ΛA,σ,S,L (θb ). Moreover,

t 7→ ΛA,σ,S,L (α(t)) is monotonically decreasing on [0, t∗ ]. Setting θc = α(t∗ ) completes this case.
Case 2: Assume that Va has rank less than m. In this case, we show that we find a continuous
path from θa to another neural network parameter with higher rank. The path will be such that
ΛA,σ,S,L is monotonically decreasing.
Under the assumptions, we have that one v ja can be written as a linear combination of the
remaining v ia , i ̸= j. Without loss of generality, we assume j = 1. Then, there exist (αi )m i=2 such
that
m
X
v 1a = αi v ia . (12.2.2)
i=2

Next, we observe that there exists v ∗ ∈ Rm which is linearly independent from all (v ja )m
i=1 and can
be written as (v ∗ )i = σ((w∗ )⊤ xi + b∗ ) for some w∗ ∈ Rd0 , b∗ ∈ R. Indeed, if we assume that such
v ∗ does not exist, then for all w ∈ Rd0 , b ∈ R the vector (σ(w⊤ xi + b))m i=1 belongs to the same
m − 1 dimensional subspace. It would follow that span{(σ(w⊤ xi + b))m d0
i=1 | w ∈ R , b ∈ R} is an
m
m − 1 dimensional subspace of R which yields a contradiction to Theorem 9.3.
Now, we define two paths: First,

α1 (t) = ((W (0) (0) (1) (1)


a , ba ), (W a (t), ba )), for t ∈ [0, 1/2]

where

(W (1) (1)
a (t))1 = (1 − 2t)(W a )1 and (W (1) (1) (1)
a (t))i = (W a )i + 2tαi (W a )1

for i = 2, . . . , d1 , for t ∈ [0, 1/2]. Second,

α2 (t) = ((W (0) (0) (1) (1)


a (t), ba (t)), (W a (1/2), ba )), for t ∈ (1/2, 1],

180
where

(W (0) (0)
a (t))1 = 2(t − 1/2)(W a )1 + (2t − 1)w and (W (0) (0)
a (t))i = (W a )i

(0) (0) (0) (0)


for i = 2, . . . , d1 , (ba (t))1 = 2(t − 1/2)(ba )1 + (2t − 1)b∗ , and (ba (t))i = (ba )i for i = 2, . . . , d1 .
It is clear by (12.2.2) that (Rσ (α1 )(xi ))m i=1 is constant. Moreover, Rσ (α2 )(x) is constant for all
d
x ∈ R . In addition, by construction for
0

   m
j (0) (0)
v̄ := σ W a (1)xi + ba (1)
j i=1

it holds that ((v̄ j )⊤ )dj=1


1
has rank larger than that of V a . Concatenating α1 and α2 now yields a
continuous path from θa to another neural network parameter with higher associated rank such
that ΛA,σ,S,L is monotonically decreasing along the path. Iterating this construction, we can find
a path to a neural network parameter where the associated matrix has full rank. This reduces the
problem to Case 1.

12.3 Saddle points


Saddle points are critical points of the loss landscape at which the loss decreases in one direction.
In this sense, saddle points are not as problematic as local minima or spurious valleys if the updates
in the learning iteration have some stochasticity. Eventually, a random step in the right direction
could be taken and the saddle point can be escaped.
If most of the critical points are saddle points, then, even though the loss landscape is challenging
for optimization, one still has a good chance of eventually reaching the global minimum. Saddle
points of the loss landscape were studied in [60, 206] and we will review some of the findings in a
simplified way below. The main observation in [206] is that, under some quite strong assumptions,
it holds that critical points in the loss landscape associated to a large loss are typically saddle points,
whereas those associated to small loss correspond to minima. This situation is encouraging for the
prospects of optimization in deep learning, since, even if we get stuck in a local minimum, it will
very likely be such that the loss is close to optimal.
The results of [206] use random matrix theory, which we do not recall here. Moreover, it is hard
to gauge if the assumptions made are satisfied for a specific problem. Nonetheless, we recall the
main idea, which provides some intuition to support the above claim.
Let A = (d0 , d1 , 1) ∈ N3 . Then, for a neural network parameter θ ∈ PN (A, ∞) and activation
function σ, we set Φθ := Rσ (θ) and define for a sample S = (xi , yi )m i=1 the errors

ei = Φθ (xi ) − yi for i = 1, . . . , m.

If we use the square loss, then


m
cS (Φθ ) = 1
X
R e2i . (12.3.1)
m
i=1

Next, we study the Hessian of R


b S (Φθ ).

181
Proposition 12.5. Let A = (d0 , d1 , 1) and σ : R → R. Then, for every θ ∈ PN (A, ∞) where
R
b S (Φθ ) in (12.3.1) is twice continuously differentiable with respect to the weights, it holds that

H(θ) = H 0 (θ) + H 1 (θ),

where H(θ) is the Hessian of R b S (Φθ ) at θ, H 0 (θ) is a positive semi-definite matrix which is
independent from (yi )i=1 , and H 1 (θ) is a symmetric matrix that for fixed θ and (xi )m
m
i=1 depends
m
linearly on (ei )i=1 .

Proof. Using the identification introduced after Definition 12.2, we can consider θ a vector in RnA .
For k = 1, . . . , nA , we have that
m
∂R
b S (Φθ ) 2 X ∂Φθ (xi )
= ei .
∂θk m ∂θk
i=1

Therefore, for j = 1, . . . , nA , we have, by the Leibniz rule, that


m m
!
∂2R ∂ 2 Φθ (xi )
b S (Φθ )  
2 X ∂Φθ (xi ) ∂Φθ (xi ) 2 X
= + ei (12.3.2)
∂θj ∂θk m ∂θj ∂θk m ∂θj ∂θk
i=1 i=1
=: H 0 (θ) + H 1 (θ).

It remains to show that H 0 (θ) and H 1 (θ) have the asserted properties. Note that, setting
 ∂Φ (x ) 
θ i
∂θ1
 ..  ∈ RnA ,

Ji,θ =
 . 
∂Φθ (xi )
∂θnA

2 Pm ⊤
we have that H 0 (θ) = m i=1 Ji,θ Ji,θ and hence H 0 (θ) is a sum of positive semi-definite matrices,
which shows that H 0 (θ) is positive semi-definite.
The symmetry of H 1 (θ) follows directly from the symmetry of second derivatives which holds
since we assumed twice continuous differentiability at θ. The linearity of H 1 (θ) in (ei )m i=1 is clear
from (12.3.2).

How does Proposition 12.5 imply the claimed relationship between the size of the loss and the
prevalence of saddle points?
Let θ correspond to a critical point. If H(θ) has at least one negative eigenvalue, then θ cannot
be a minimum, but instead must be either a saddle point or a maximum. While we do not know
anything about H 1 (θ) other than that it is symmetric, it is not unreasonable to assume that it
has a negative eigenvalue especially if nA is very large. With this consideration, let us consider the
following model:
Fix a parameter θ. Let S 0 = (xi , yi0 )m 0 m
i=1 be a sample and (ei )i=1 be the associated errors.
0 0 0
Further let H (θ), H 0 (θ), H 1 (θ) be the matrices according to Proposition 12.5.

182
Further let for λ > 0, S λ = (xi , yiλ )m m 0 m
i=1 be such that the associated errors are (ei )i=1 = λ(ei )i=1 .
The Hessian of Rb S λ (Φθ ) at θ is then H λ (θ) satisfying

H λ (θ) = H 00 (θ) + λH 01 (θ).

Hence, if λ is large, then H λ (θ) is perturbation of an amplified version of H 01 (θ). Clearly, if v is


an eigenvector of H 1 (θ) with negative eigenvalue −µ, then

v ⊤ H λ (θ)v ≤ (∥H 00 (θ)∥ − λµ)∥v∥2 ,

which we can expect to be negative for large λ. Thus, H λ (θ) has a negative eigenvalue for large λ.
On the other hand, if λ is small, then H λ (θ) is merely a perturbation of H 00 (θ) and we can
expect its spectrum to resemble that of H 00 more and more.
What we see is that, the same parameter, is more likely to be a saddle point for a sample that
produces a high empirical risk than for a sample with small risk. Note that, since H 00 (θ) was only
shown to be semi -definite the argument above does not rule out saddle points even for very small
λ. But it does show that for small λ, every negative eigenvalue would be very small.
A more refined analysis where we compare different parameters but for the same sample and
quantify the likelihood of local minima versus saddle points requires the introduction of a probability
distribution on the weights. We refer to [206] for the details.

Bibliography and further reading


The results on visualization of the loss landscape are inspired by [161, 96, 127]. Results on the
non-existence of spurious valleys can be found in [279] with similar results in [220]. In [52] the
loss landscape was studied by linking it to so-called spin-glass models. There it was found that
under strong assumptions critical points associated to lower losses are more likely to be minima
than saddle points. In [206], random matrix theory is used to provide similar results, that go
beyond those established in Section 12.3. On the topic of saddle points, [60] identifies the existence
of saddle points as more problematic than that of local minima, and an alternative saddle-point
aware optimization algorithm is introduced.
Two essential topics associated to the loss landscape that have not been discussed in this chapter
are mode connectivity and the sharpness of minima. Mode connectivity, roughly speaking describes
the phenomenon, that local minima found by SGD over deep neural networks are often connected
by simple curves of equally low loss [84, 71]. Moreover, the sharpness of minima has been analyzed
and linked to generalization capabilities of neural networks, with the idea being that wide neural
networks are easier to find and also yield robust neural networks [116, 47, 294]. However, this does
not appear to prevent sharp minima from generalizing well [70].

183
Exercises
Exercise 12.6. In view of Definition 12.3, show that a local minimum of a differentiable function
is contained in a spurious valley.

Exercise 12.7. Show that if there exists a continuous path α between a parameter θ1 and a global
minimum θ2 such that ΛA,σ,S,L (α) is monotonically decreasing, then θ1 cannot be an element of a
spurious valley.

Exercise 12.8. Find an example of a spurious valley for a simple architecture.


Hint: Use a single neuron ReLU neural network and observe that, for two networks one with
positive and one with negative slope, every continuous path in parameter space that connects the
two has to pass through a parameter corresponding to a constant function.

184
Figure 12.2: A collection of loss landscapes. In the left column are neural networks with ReLU
activation function, the right column shows loss landscapes of neural networks with the hyperbolic
tangent activation function. All neural networks have five dimensional input, and one dimensional
output. Moreover, from top to bottom the hidden layers have widths 1000, 20, 10, and the number
of hidden layers are 1, 4, 7.
185
Chapter 13

Shape of neural network spaces

As we have seen in the previous chapter, the loss landscape of neural networks can be quite intricate
and is typically not convex. In some sense, the reason for this is that we take the point of view
of a map from the parameterization of a neural network. Let us consider a convex loss function
L : R × R → R and a sample S = (xi , yi )m d
i=1 ∈ (R × R) .
m

Then, for two neural networks Φ1 , Φ2 and for α ∈ (0, 1) it holds that
m
b S (αΦ1 + (1 − α)Φ2 ) = 1
X
R L(αΦ1 (xi ) + (1 − α)Φ2 (xi ), yi )
m
i=1
m
1 X
≤ αL(Φ1 (xi ), yi ) + (1 − α)L(Φ2 (xi ), yi )
m
i=1
= αR
b S (Φ1 ) + (1 − α)R
b S (Φ2 ).

Hence, the empirical risk is convex when considered as a map depending on the neural network
functions rather then the neural network parameters. A convex function does not have spurious
minima or saddle points. As a result, the issues from the previous section are avoided if we take
the perspective of neural network sets.
So why do we not optimize over the sets of neural networks instead of the parameters? To
understand this, we will now study the set of neural networks associated with a fixed architecture
as a subset of other function spaces.
We start by investigating the realization map Rσ introduced in Definition 12.1. Concretely,
we show in Section 13.1, that if σ is Lipschitz, then the set of neural networks is the image of
PN (A, ∞) under a locally Lipschitz map. We will use this fact to show in Section 13.2 that sets of
neural networks are typically non-convex, and even have arbitrarily large holes. Finally, in Section
13.3, we study the extent to which there exist best approximations to arbitrary functions, in the set
of neural networks. We will demonstrate that the lack of best approximations causes the weights
of neural networks to grow infinitely during training.

13.1 Lipschitz parameterizations


In this section, we study the realization map Rσ . The main result is the following simplified version
of [207, Proposition 4].

186
Proposition 13.1. Let A = (d0 , d1 , . . . , dL+1 ) ∈ NL+2 , let σ : R → R be Cσ -Lipschitz continuous
with Cσ ≥ 1, let |σ(x)| ≤ Cσ |x| for all x ∈ R, and let B ≥ 1.
Then, for all θ, θ′ ∈ PN (A, B),

∥Rσ (θ) − Rσ (θ′ )∥L∞ ([−1,1]d0 ) ≤ (2Cσ Bdmax )L nA ∥θ − θ′ ∥∞ ,


PL
where dmax = maxℓ=0,...,L+1 dℓ and nA = ℓ=0 dℓ+1 (dℓ + 1).

Proof. Let θ, θ′ ∈ PN (A, B) and define δ := ∥θ − θ′ ∥∞ . Repeatedly using the triangle inequality
we find a sequence (θj )nj=0
A
such that θ0 = θ, θnA = θ′ , ∥θj − θj+1 ∥∞ ≤ δ, and θj and θj+1 differ in
one entry only for all j = 0, . . . nA − 1. We conclude that for all x ∈ [−1, 1]d0
A −1
nX

∥Rσ (θ)(x) − Rσ (θ )(x)∥∞ ≤ ∥Rσ (θj )(x) − Rσ (θj+1 )(x)∥∞ . (13.1.1)
j=0

To upper bound (13.1.1), we now only need to understand the effect of changing one weight in a
neural network by δ.
Before we can complete the proof we need two auxiliary lemmas. The first of which holds under
slightly weaker assumptions of Proposition 13.1.

Lemma 13.2. Under the assumptions of Proposition 13.1, but with B being allowed to be arbitrary
positive, it holds for all Φ ∈ N (σ; A, B)

∥Φ(x) − Φ(x′ )∥∞ ≤ CσL · (Bdmax )L+1 ∥x − x′ ∥∞ (13.1.2)

for all x, x′ ∈ Rd0 .

Proof. We start with the case, where L = 1. Then, for (d0 , d1 , d2 ) = A, we have that

Φ(x) = W(1) σ(W(0) x + b(0) ) + b(1) ,

for certain W(0) , W(1) , b(0) , b(1) with all entries bounded by B. As a consequence, we can estimate
 
∥Φ(x) − Φ(x′ )∥∞ = W(1) σ(W(0) x + b(0) ) − σ(W(0) x′ + b(0) )

(0) (0) (0) ′ (0)
≤ d1 B σ(W x+b ) − σ(W x +b )

(0) ′
≤ d1 BCσ W (x − x )

≤ d1 d0 B 2 Cσ x − x′ ∞
≤ Cσ · (dmax B)2 x − x′ ∞
,

where we used the Lipschitz property of σ and the fact that ∥Ax∥∞ ≤ n maxi,j |Aij |∥x∥∞ for every
matrix A = (Aij )m,n
i=1,j=1 ∈ R
m×n .

The induction step from L to L+1 follows similarly. This concludes the proof of the lemma.

187
Lemma 13.3. Under the assumptions of Proposition 13.1 it holds that

∥x(ℓ) ∥∞ ≤ (2Cσ Bdmax )ℓ for all x ∈ [−1, 1]d0 . (13.1.3)

Proof. Per Definitions (2.1.1b) and (2.1.1c), we have that for ℓ = 1, . . . , L + 1

∥x(ℓ) ∥∞ ≤ Cσ W(ℓ−1) x(ℓ−1) + b(ℓ−1)



(ℓ−1)
≤ Cσ Bdmax ∥x ∥∞ + BCσ ,
where we used the triangle inequality and the estimate ∥Ax∥∞ ≤ n maxi,j |Aij |∥x∥∞ , which holds
for every matrix A ∈ Rm×n . We obtain that
∥x(ℓ) ∥∞ ≤ Cσ Bdmax · (1 + ∥x(ℓ−1) ∥∞ )
≤ 2Cσ Bdmax · (max{1, ∥x(ℓ−1) ∥∞ }).

Resolving the recursive estimate of ∥x(ℓ) ∥∞ by 2Cσ Bdmax (max{1, ∥x(ℓ−1) ∥∞ }), we conclude that
∥x(ℓ) ∥∞ ≤ (2Cσ Bdmax )ℓ max{1, ∥x(0) ∥∞ } = (2Cσ Bdmax )ℓ .
This concludes the proof of the lemma.

We can now proceed with the proof of Proposition 13.1. Assume that θj+1 and θj differ only in
one entry. We assume this entry to be in the ℓth layer, and we start with the case ℓ < L. It holds
|Rσ (θj )(x) − Rσ (θj+1 )(x)| = |Φℓ (σ(W(ℓ) x(ℓ) + b(ℓ) )) − Φℓ (σ(W(ℓ) x(ℓ) + b(ℓ) ))|,
(ℓ) (ℓ)
where Φℓ ∈ N (σ; Aℓ , B) for Aℓ = (dℓ+1 , . . . , dL+1 ) and (W(ℓ) , b(ℓ) ), (W , b ) differ in one entry
only.
Using the Lipschitz continuity of Φℓ of Lemma 13.2, we have
|Rσ (θj )(x) − Rσ (θj+1 )(x)|
≤ CσL−ℓ−1 (Bdmax )L−ℓ |σ(W(ℓ) x(ℓ) + b(ℓ) ) − σ(W(ℓ) x(ℓ) + b(ℓ) )|
≤ CσL−ℓ (Bdmax )L−ℓ ∥W(ℓ) x(ℓ) + b(ℓ) − W(ℓ) x(ℓ) − b(ℓ) ∥∞
≤ 2CσL−ℓ (Bdmax )L−ℓ δ max{1, ∥x(ℓ) ∥∞ },
where δ := ∥θ − θ′ ∥∞ . Invoking Lemma (13.3), we conclude that
|Rσ (θj )(x) − Rσ (θj+1 )(x)| ≤ (2Cσ Bdmax )ℓ CσL−ℓ · (Bdmax )L−ℓ δ
≤ (2Cσ Bdmax )L ∥θ − θ′ ∥∞ .
For the case ℓ = L, a similar estimate can be shown. Combining this with (13.1.1) yields the
result.

Using Proposition 13.1, we can now consider the set of neural networks with a fixed architec-
ture N (σ; A, ∞) as a subset of L∞ ([−1, 1]d0 ). What is more, is that N (σ; A, ∞) is the image of
PN (A, ∞) under a locally Lipschitz map.

188
13.2 Convexity of neural network spaces
As a first step towards understanding N (σ; A, ∞) as a subset of L∞ ([−1, 1]d0 ), we notice that it is
star-shaped with few centers. Let us first introduce the necessary terminology.

Definition 13.4. Let Z be a subset of a linear space. A point x ∈ Z is called a center of Z if,
for every y ∈ Z it holds that
{tx + (1 − t)y | t ∈ [0, 1]} ⊆ Z.
A set is called star-shaped if it has at least one center.

The following proposition follows directly from the definition of a neural network and is the
content of Exercise 13.15.

Proposition 13.5. Let L ∈ N and A = (d0 , d1 , . . . , dL+1 ) ∈ NL+2 and σ : R → R. Then


N (σ; A, ∞) is scaling invariant, i.e. for every λ ∈ R it holds that λf ∈ N (σ; A, ∞) if
f ∈ N (σ; A, ∞), and hence 0 ∈ N (σ; A, ∞) is a center of N (σ; A, ∞).

Knowing that N (σ; A, B) is star-shaped with center 0, we can also ask ourselves if N (σ; A, B)
has more than this one center. It is not hard to see that also every constant function is a center.
The following theorem, which corresponds to [207, Proposition C.4], yields an upper bound on the
number of linearly independent centers.

NL+2 , and let σ : R → R be Lipschitz


Theorem 13.6. Let L ∈ N and A = (d0 , d1 , . . . , dL+1 ) ∈P
L
continuous. Then, N (σ; A, ∞) contains at most nA = ℓ=0 (dℓ + 1)dℓ+1 linearly independent
centers.

Proof. Assume by contradiction, that there are functions (gi )ni=1 A +1


⊆ N (σ; A, ∞) ⊆ L∞ ([−1, 1]d0 )
that are linearly independent and centers of N (σ; A, ∞).
By the Theorem of Hahn-Banach, there exist (gi′ )ni=1 A +1
⊆ (L∞ ([−1, 1]d0 ))′ such that gi′ (gj ) = δij ,
for all i, j ∈ {1, . . . , L + 1}. We define

g1′ (g)
 
 g ′ (g) 
2
T : L∞ ([−1, 1]d0 ) → RnA +1 , g 7→  .
 
..
 . 
gn′ A +1 (g)

Since T is continuous and linear, we have that T ◦ Rσ is locally Lipschitz continuous by Proposition
13.1. Moreover, since the (gi )ni=1
A +1
are linearly independent, we have that T (span((gi )ni=1
A +1
)) =
n +1 nA +1
R A . We denote V := span((gi )i=1 ).

189
Next, we would like to establish that N (σ; A, ∞) ⊃ V . Let g ∈ V then
nX
A +1

g= aℓ gℓ ,
ℓ=1

for some a1 , . . . , anA +1 ∈ R. We show by induction that g̃ (m) := m


P
ℓ=1 aℓ gℓ ∈ N (σ; A, ∞) for every
m ≤ nA + 1. This is obviously true for m = 1. Moreover, we have that g̃ (m+1) = am+1 gm+1 + g̃ (m) .
Hence, the induction step holds true if am+1 = 0. If am+1 ̸= 0, then we have that
 
(m+1) 1 1 (m)
ge = 2am+1 · gm+1 + ge . (13.2.1)
2 2am+1

By the induction assumption ge(m) ∈ N (σ; A, ∞) and hence by Proposition 13.5 ge(m) /(am+1 ) ∈
N (σ; A, ∞). Additionally, since gm+1 is a center of N (σ; A, ∞), we have that 21 gm+1 + 2am+1
1
ge(m) ∈
N (σ; A, ∞). By Proposition 13.5, we conclude that ge(m+1) ∈ N (σ; A, ∞).
The induction shows that g ∈ N (σ; A, ∞) and thus V ⊆ N (σ; A, ∞). As a consequence,
T ◦ Rσ (PN (A, ∞)) ⊇ T (V ) = RnA +1 .
It is a well known fact of basic analysis that for every n ∈ N there does not exist a surjective
and locally Lipschitz continuous map from Rn to Rn+1 . We recall that nA = dim(PN (A, ∞)).
This yields the contradiction.

For a convex set X, the line between all two points of X is a subset of X. Hence, every point
of a convex set is a center. This yields the following corollary.

Corollary 13.7. Let A = (d0 , d1 , . . . , dL+1


P), let, and let σ : R → R be Lipschitz continuous.
L
If N (σ; A, ∞) contains more than nA = ℓ=0 (dℓ + 1)dℓ+1 linearly independent functions, then
N (σ; A, ∞) is not convex.

Corollary 13.7 tells us that we cannot expect convex sets of neural networks, if the set of
neural networks has many linearly independent elements. Sets of neural networks contain for
each f ∈ N (σ; A, ∞) also all shifts of this function, i.e., f (· + b) for a b ∈ Rd are elements of
N (σ; A, ∞). For a set of functions, being shift invariant and having only finitely many linearly
independent functions at the same time, is a very restrictive condition. Indeed, it was shown in
[207, Proposition C.6] that if N (σ; A, ∞) has only finitely many linearly independent functions and
σ is differentiable in at least one point and has non-zero derivative there, then σ is necessarily a
polynomial.
We conclude that the set of neural networks is in general non-convex and star-shaped with 0
and constant functions being centers. One could visualize this set in 3D as in Figure 13.1.
The fact, that the neural network space is not convex, could also mean that it merely fails to
be convex at one point. For example R2 \ {0} is not convex, but for an optimization algorithm this
would likely not pose a problem.
We will next observe that N (σ; A, ∞) does not have such a benign non-convexity and in fact,
has arbitrarily large holes.
To make this claim mathematically precise, we first introduce the notion of ε-convexity.

190
Figure 13.1: Sketch of the space of neural networks in 3D. The vertical axis corresponds to the
constant neural network functions, each of which is a center. The set of neural networks consists
of many low-dimensional linear subspaces spanned by certain neural networks (Φ1 , . . . , Φ6 in this
sketch) and linear functions. Between these low-dimensional subspaces, there is not always a
straight-line connection by Corollary 13.7 and Theorem 13.9.

191
Definition 13.8. For ε > 0, we say that a subset A of a normed vector space X is ε-convex if

co(A) ⊆ A + Bε (0),

where co(A) denotes the convex hull of A and Bε (0) is an ε ball around 0 with respect to the norm
of X.

Intuitively speaking, a set that is convex when one fills up all holes smaller than ε is ε-convex.
Now we show that there is no ε > 0 such that N (σ; A, ∞) is ε-convex.

Theorem 13.9. Let L ∈ N and A = (d0 , d1 , . . . , dL , 1) ∈ NL+2 . Let K ⊆ Rd0 be compact and let
σ ∈ M, with M as in (3.1.1) and assume that σ is not a polynomial. Moreover, assume that there
exists an open set, where σ is differentiable and not constant.
If there exists an ε > 0 such that N (σ; A, ∞) is ε-convex, then N (σ; A, ∞) is dense in C(K).

Proof. Step 1. We show that ε-convexity implies N (σ; A, ∞) to be convex. By Proposition 13.5,
we have that N (σ; A, ∞) is scaling invariant. This implies that co(N (σ; A, ∞)) is scaling invariant
as well. Hence, if there exists ε > 0 such that N (σ; A, ∞) is ε-convex, then for every ε′ > 0

ε′ ε′
co(N (σ; A, ∞)) = co(N (σ; A, ∞)) ⊆ (N (σ; A, ∞) + Bε (0))
ε ε
= N (σ; A, ∞) + Bε′ (0).

This yields that N (σ; A, ∞) is ε′ -convex. Since ε′ was arbitrary, we have that N (σ; A, ∞) is
ε-convex for all ε > 0.
As a consequence, we have that
\
co(N (σ; A, ∞)) ⊆ (N (σ; A, ∞) + Bε (0))
ε>0
\
⊆ (N (σ; A, ∞) + Bε (0)) = N (σ; A, ∞).
ε>0

Hence, co(N (σ; A, ∞)) ⊆ N (σ; A, ∞) and, by the well-known fact that in every metric vector space
co(A) ⊆ co(A), we conclude that N (σ; A, ∞) is convex.
Step 2. We show that Nd1 (σ; 1) ⊆ N (σ; A, ∞). If N (σ; A, ∞) is ε-convex, then by Step 1
N (σ; A, ∞) is convex. The scaling invariance of N (σ; A, ∞) then shows that N (σ; A, ∞) is a
closed linear subspace of C(K).
Note that, by Proposition 3.16 for every w ∈ Rd0 and b ∈ R there exists a function f ∈
N (σ; A, ∞) such that

f (x) = σ(w⊤ x + b) for all x ∈ K. (13.2.2)

By definition, every constant function is an element of N (σ; A, ∞).

192
(1) (1)
Since N (σ; A, ∞) is a closed vector space, this implies that for all n ∈ N and all w1 , . . . , wn ∈
(2) (2) (1) (1)
R 0 , w1 , . . . , wn ∈ R, b1 , . . . , bn ∈ R, b(2) ∈ R
d

n
(2) (1) (1)
X
x 7→ wi σ((wi )⊤ x + bi ) + b(2) ∈ N (σ; A, ∞). (13.2.3)
i=1

Step 3. From (13.2.3), we conclude that Nd1 (σ; 1) ⊆ N (σ; A, ∞). In words, the whole set of
shallow neural networks of arbitrary width is contained in the closure of the set of neural networks
with a fixed architecture. By Theorem 3.8, we have that Nd1 (σ; 1) is dense in C(K), which yields
the result.

For any activation function of practical relevance, a set of neural networks with fixed architecture
is not dense in C(K). This is only the case for very strange activation functions such as the one
discussed in Subsection 3.2. Hence, Theorem 13.9 shows that in general, sets of neural networks of
fixed architectures have arbitrarily large holes.

13.3 Closedness and best-approximation property


The non-convexity of the set of neural networks can have some serious consequences for the way
we think of the approximation or learning problem by neural networks.
Consider A = (d0 , . . . , dL+1 ) ∈ NL+2 and an activation function σ. Let H be a normed function
space on [−1, 1]d0 such that N (σ; A, ∞) ⊆ H. For h ∈ H we would like to find a neural network
that best approximates h, i.e. to find Φ ∈ N (σ; A, ∞) such that

∥Φ − h∥H = inf ∥Φ∗ − h∥H . (13.3.1)


Φ∗ ∈N (σ;A,∞)

We say that N (σ; A, ∞) ⊆ H has

• the best approximation property, if for all h ∈ H there exists at least one Φ ∈ N (σ; A, ∞)
such that (13.3.1) holds,

• the unique best approximation property, if for all h ∈ H there exists exactly one
Φ ∈ N (σ; A, ∞) such that (13.3.1) holds,

• the continuous selection property, if there exists a continuous function ϕ : H → N (σ; A, ∞)


such that Φ = ϕ(h) satisfies (13.3.1) for all h ∈ H.

We will see in the sequel, that, in the absence of the best approximation property, we will be able
to prove that the learning problem necessarily requires the weights of the neural networks to tend
to infinity, which may or may not be desirable in applications.
Moreover, having a continuous selection procedure is desirable as it implies the existence of a
stable selection algorithm; that is, an algorithm which, for similar problems yields similar neural
networks satisfying (13.3.1).
Below, we will study the properties above for Lp spaces, p ∈ [1, ∞). As we will see, neu-
ral network classes typically neither satisfy the continuous selection nor the best approximation
property.

193
13.3.1 Continuous selection
As shown in [136], neural network spaces essentially never admit the continuous selection property.
To give the argument, we first recall the following result from [136, Theorem 3.4] without proof.

Theorem 13.10. Let p ∈ (1, ∞). Every subset of Lp ([−1, 1]d0 ) with the unique best approximation
property is convex.

This allows to show the next proposition.

Proposition 13.11. Let L ∈ N, A = (d0 , d1 , . . . , dL+1 ) ∈ NL+2 , let σ : R → R be Lipschitz


continuous and not a polynomial, and let p ∈ (1, ∞).
Then, N (σ; A, ∞) ⊆ Lp ([−1, 1]d0 ) does not have the continuous selection property.

Proof. We observe from Theorem 13.6 and the discussion below, that under the assumptions of
this proposition, N (σ; A, ∞) is not convex.
We conclude from Theorem 13.10 that N (σ; A, ∞) does not have the unique best approximation
property. Moreover, if the set N (σ; A, ∞) does not have the best approximation property, then it
is obvious that it cannot have continuous selection. Thus, we can assume without loss of generality,
that N (σ; A, ∞) has the best approximation property and there exists a point h ∈ Lp ([−1, 1]d0 )
and two different Φ1 ,Φ2 such that

∥Φ1 − h∥Lp = ∥Φ2 − h∥Lp = inf ∥Φ∗ − h∥Lp . (13.3.2)


Φ∗ ∈N (σ;A,∞)

Note that (13.3.2) implies that h ̸∈ N (σ; A, ∞).


Let us consider the following function:

(1 + λ)h − λΦ1 for λ ≤ 0,
[−1, 1] ∋ λ 7→ P (λ) =
(1 − λ)h + λΦ2 for λ ≥ 0.

It is clear that P (λ) is a continuous path in Lp . Moreover, for λ ∈ (−1, 0)

∥Φ1 − P (λ)∥Lp = (1 + λ)∥Φ1 − h∥Lp .

Assume towards a contradiction, that there exists Φ∗ ∈ N (σ; A, ∞) with Φ∗ ̸= Φ1 such that for
λ ∈ (−1, 0)

∥Φ∗ − P (λ)∥Lp ≤ ∥Φ1 − P (λ)∥Lp .

Then

∥Φ∗ − h∥Lp ≤ ∥Φ∗ − P (λ)∥Lp + ∥P (λ) − h∥Lp


≤ ∥Φ1 − P (λ)∥Lp + ∥P (λ) − h∥Lp
= (1 + λ)∥Φ1 − h∥Lp + |λ|∥Φ1 − h∥Lp = ∥Φ1 − h∥Lp . (13.3.3)

194
Since Φ1 is a best approximation to h this implies that every inequality in the estimate above is an
equality. Hence, we have that

∥Φ∗ − h∥Lp = ∥Φ∗ − P (λ)∥Lp + ∥P (λ) − h∥Lp .

However, in a strictly convex space like Lp ([−1, 1]d0 ) for p > 1 this implies that

Φ∗ − P (λ) = c · (P (λ) − h)

for a constant c ̸= 0. This yields that

Φ∗ = h + (c + 1)λ · (h − Φ1 )

and plugging into (13.3.3) yields |(c + 1)λ| = 1. If (c + 1)λ = −1, then we have Φ∗ = Φ1 which
produces a contradiction. If (c + 1)λ = 1, then

∥Φ∗ − P (λ)∥Lp = ∥2h − Φ1 − (1 + λ)h + λΦ1 ∥Lp


= ∥(1 − λ)h − (1 − λ)Φ1 ∥Lp > ∥P (λ) − Φ1 ∥Lp ,

which is another contradiction.


Hence, for every λ < 0 we have that Φ1 is the unique minimizer to P (λ) in N (σ; A, ∞). The same
argument holds for λ > 0 and Φ2 . We conclude that for every selection function ϕ : Lp ([−1, 1]d0 ) →
N (σ; A, ∞) such that Φ = ϕ(h) satisfies (13.3.1) for all h ∈ Lp ([−1, 1]d0 ) it holds that

lim ϕ(P (λ)) = Φ2 ̸= Φ1 = lim ϕ(P (λ)).


λ↓0 λ↑0

As a consequence, ϕ is not continuous, which shows the result.

13.3.2 Existence of best approximations


We have seen in Proposition 13.11 that under very mild assumptions, the continuous selection prop-
erty cannot hold. Moreover, the next result shows that in many cases, also the best approximation
property fails to be satisfied. We provide below a simplified version of [207, Theorem 3.1]. We also
refer to [89] for earlier work on this problem.

Proposition 13.12. Let A = (1, 2, 1) and let σ : R → R be Lipschitz continuous. Additionally


assume that there exist r > 0 and α′ ̸= α such that σ is differentiable for all |x| > r and σ ′ (x) → α
for x → ∞, σ ′ (x) → α′ for x → −∞.
Then, there exists a sequence in N (σ; A, ∞) which converges in Lp ([−1, 1]), for every p ∈ (1, ∞),
and the limit of this sequence is discontinuous. In particular, the limit of the sequence does not lie
in N (σ; A′ , ∞) for any A′ .

Proof. For all n ∈ N let

fn (x) = σ(nx + 1) − σ(nx) for all x ∈ R.

195
Then fn can be written as a neural network with architecture (σ; 1, 2, 1), i.e., A = (1, 2, 1). More-
over, for x > 0 we observe with the fundamental theorem of calculus and using integration by
substitution that
Z x+1/n Z nx+1
fn (x) = nσ ′ (nz)dz = σ ′ (z)dz. (13.3.4)
x nx

It is not hard to see that the right hand side of (13.3.4) converges to α for n → ∞.
Similarly, for x < 0, we observe that fn (x) converges to α′ for n → ∞. We conclude that

fn → α1R+ + α′ 1R−

almost everywhere as n → ∞. Since σ is Lipschitz continuous, we have that fn is bounded.


Therefore, we conclude that fn → α1R+ + α′ 1R− in Lp for all p ∈ [1, ∞) by the dominated
convergence theorem.

There is a straight-forward extension of Proposition 13.12 to arbitrary architectures, that will


be the content of Exercises 13.16 and 13.17.
Remark 13.13. The proof of Theorem 13.12 does not extend to the L∞ norm. This, of course, does
not mean that generally N (σ; A, ∞) is a closed set in L∞ ([−1, 1]d0 ). In fact, almost all activation
functions used in practice still give rise to non-closed neural network sets, see [207, Theorem 3.3].
However, there is one notable exception. For the ReLU activation function, it can be shown that
N (σReLU ; A, ∞) is a closed set in L∞ ([−1, 1]d0 ) if A has only one hidden layer. The closedness of
deep ReLU spaces in L∞ is an open problem.

13.3.3 Exploding weights phenomenon


Finally, we discuss one of the consequences of the non-existence of best approximations of Propo-
sition 13.12.
Consider a regression problem, where we aim to learn a function f using neural networks with
a fixed architecture N (A; σ, ∞). As discussed in the Chapters 10 and 11, we wish to produce a
sequence of neural networks (Φn )∞n=1 such that the risk defined in (1.2.4) converges to 0. If the loss
L is the squared loss, µ is a probability measure on [−1, 1]d0 , and the data is given by (x, f (x)) for
x ∼ µ, then

R(Φn ) = ∥Φn − f ∥2L2 ([−1,1]d0 ,µ)


(13.3.5)
Z
= |Φn (x) − f (x)|2 dµ(x) → 0 for n → ∞.
[−1,1]d0

According to Proposition 13.12, for a given A, and an activation function σ, it is possible that
(13.3.5) holds, but f ̸∈ N (σ; A, ∞). The following result shows that in this situation, the weights
of Φn diverge.

Proposition 13.14. Let A = (d0 , d1 , . . . , dL+1 ) ∈ NL+2 , let σ : R → R be Lipschitz continuous


with Cσ ≥ 1, and |σ(x)| ≤ Cσ |x| for all x ∈ R, and let µ be a measure on [−1, 1]d0 .

196
Assume that there exists a sequence Φn ∈ N (σ; A, ∞) and f ∈ L2 ([−1, 1]d0 , µ) \ N (σ; A, ∞)
such that

∥Φn − f ∥2L2 ([−1,1]d0 ,µ) → 0. (13.3.6)

Then
n o
lim sup max ∥W (ℓ)
n ∞ ∥ , ∥b (ℓ)
n ∞ ∥ ℓ = 0, . . . L = ∞. (13.3.7)
n→∞

Proof. We assume towards a contradiction that the left-hand side of (13.3.7) is finite. As a result,
there exists C > 0 such that Φn ∈ N (σ; A, C) for all n ∈ N.
By Proposition 13.1, we conclude that N (σ; A, C) is the image of a compact set under a continu-
ous map and hence is itself a compact set in L2 ([−1, 1]d0 , µ). In particular, we have that N (σ; A, C)
is closed. Hence, (13.3.6) implies f ∈ N (σ; A, C). This gives a contradiction.

Proposition 13.14 can be extended to all f for which there is no best approximation in N (σ; A, ∞),
see Exercise 13.18. The results imply that for functions we wish to learn that lack a best approxima-
tion within a neural network set, we must expect the weights of the approximating neural networks
to grow to infinity. This can be undesirable because, as we will see in the following sections on
generalization, a bounded parameter space facilitates many generalization bounds.

Bibliography and further reading


The properties of neural network sets were first studied with a focus on the continuous approxima-
tion property in [136, 138, 137] and [89]. The results in [136, 137, 138] already use the non-convexity
of sets of shallow neural networks. The results on convexity and closedness presented in this chapter
follow mostly the arguments of [207]. Similar results were also derived for other norms in [169].

197
Exercises
Exercise 13.15. Prove Proposition 13.5.

Exercise 13.16. Extend Proposition 13.12 to A = (d0 , d1 , 1) for arbitrary d0 , d1 ∈ N, d1 ≥ 2.

Exercise 13.17. Use Proposition 3.16, to extend Proposition 13.12 to arbitrary depth.

Exercise 13.18. Extend Proposition 13.14 to functions f for which there is no best-approximation
in N (σ; A, ∞). To do this, replace (13.3.6) by

∥Φn − f ∥2L2 → inf ∥Φ − f ∥2L2 .


Φ∈N (σ;A,∞)

198
Chapter 14

Generalization properties of deep


neural networks

As discussed in the introduction in Section 1.2, we generally learn based on a finite data set. For
example, given data (xi , yi )m
i=1 , we try to find a network Φ that satisfies Φ(xi ) = yi for i = 1, . . . , m.
The field of generalization is concerned with how well such Φ performs on unseen data, which refers
to any x outside of training data {x1 , . . . , xm }. In this chapter we discuss generalization through
the use of covering numbers.
In Sections 14.1 and 14.2 we revisit and formalize the general setup of learning and empirical risk
minimization in a general context. Although some notions introduced in these sections have already
appeared in the previous chapters, we reintroduce them here for a more coherent presentation. In
Sections 14.3-14.5, we first discuss the concepts of generalization bounds and covering numbers,
and then apply these arguments specifically to neural networks. In Section 14.6 we explore the
so-called “approximation-complexity trade-off”, and finally in Sections 14.7-14.8 we introduce the
“VC dimension” and give some implications for classes of neural networks.

14.1 Learning setup


A general learning problem [178, 250, 58] requires a feature space X and a label space Y , which
we assume throughout to be measurable spaces. We observe joint data pairs (xi , yi )m i=1 ⊆ X ×Y , and
aim to identify a connection between the x and y variables. Specifically, we assume a relationship
between features x and labels y modeled by a probability distribution D over X ×Y , that generated
the observed data (xi , yi )m
i=1 . While this distribution is unknown, our goal is to extract information
from it, so that we can make possibly good predictions of y for a given x. Importantly, the
relationship between x and y need not be deterministic.
To make these concepts more concrete, we next present an example that will serve to explain
ideas throughout this chapter. This example is of high relevance for many mathematicians, as
ensuring a steady supply of high-quality coffee is essential for maximizing the output of our math-
ematical work.
Example 14.1 (Coffee Quality). Our goal is to determine the quality of different coffees. To this
end we model the quality as a number in
n0 10 o
Y = ,..., ,
10 10

199
Figure 14.1: Collection of coffee data. The last row lacks a “Quality” label. Our aim is to predict
the label without the need for an (expensive) taste test.

with higher numbers indicating better quality. Let us assume that our subjective assessment of
quality of coffee is related to six features: “Acidity”, “Caffeine content”, “Price”, “Aftertaste”,
“Roast level”, and “Origin”. The feature space X thus corresponds to the set of six-tuples describing
these attributes, which can be either numeric or categorical (see Figure 14.1).
We aim to understand the relationship between elements of X and elements of Y , but we can
neither afford, nor do we have the time to taste all the coffees in the world. Instead, we can sample
some coffees, taste them, and grow our database accordingly as depicted in Figure 14.1. This way
we obtain samples of pairs in X × Y . The distribution D from which they are drawn depends on
various external factors. For instance, we might have avoided particularly cheap coffees, believing
them to be inferior. As a result they do not occur in our database. Moreover, if a colleague
contributes to our database, he might have tried the same brand and arrived at a different rating.
In this case, the quality label is not deterministic anymore.
Based on our database, we wish to predict the quality of an untasted coffee. Before proceeding,
we first formalize what it means to be a “good” prediction. ♢
Characterizing how good a predictor is requires a notion of discrepancy in the label space. This
is the purpose of the so-called loss function, which is a measurable mapping L : Y × Y → R+ .

Definition 14.2. Let L : Y × Y → R+ be a loss function and let D be a distribution on X × Y .


For a measurable function h : X → Y we call

R(h) = E(x,y)∼D [L(h(x), y)]

the (population) risk of h.

Based on the risk, we can now formalize what we consider a good predictor. The best predictor
is one such that its risk is as close as possible to the smallest that any function can achieve. More
precisely, we would like a risk that is close to the so-called Bayes risk

R∗ := inf R(h), (14.1.1)


h : X→Y

where the infimum is taken over all measurable h : X → Y .

200
Example 14.3 (Loss functions). The choice of a loss function L usually depends on the application.
For a regression problem, i.e., a learning problem where Y is a non-discrete subset of a Euclidean
space, a common choice is the square loss L2 (y, y ′ ) = ∥y − y ′ ∥2 .
For binary classification problems, i.e. when Y is a discrete set of cardinality two, the “0 − 1
loss” (
1 y ̸= y ′
L0−1 (y, y ′ ) =
0 y = y′
seems more natural.
Another frequently used loss for binary classification, especially when we want to predict prob-
abilities (i.e., if Y = [0, 1] but all labels are binary), is the binary cross-entropy loss
Lce (y, y ′ ) = −(y log(y ′ ) + (1 − y) log(1 − y ′ )).
In contrast to the 0 − 1 loss, the cross-entropy loss is differentiable, which is desirable in deep
learning as we saw in Chapter 10.
In the coffee quality prediction problem, the quality is given as a fraction of the form k/10
for k = 0, . . . , 10. While this is a discrete set, it makes sense to more heavily penalize predictions
that are wrong by a larger amount. For example, predicting 4/10 instead of 8/10 should produce
a higher loss than predicting 7/10. Hence, we would not use the 0 − 1 loss but, for example, the
square loss. ♢
How do we find a function h : X → Y with a risk that is as close as possible to the Bayes risk?
We will introduce a procedure to tackle this task in the next section.

14.2 Empirical risk minimization


Finding a minimizer of the risk constitutes a considerable challenge. First, we cannot search through
all measurable functions. Therefore, we need to restrict ourselves to a specific set H ⊆ {h : X → Y }
called the hypothesis set. In the following, this set will be some set of neural networks. Second,
we are faced with the problem that we cannot evaluate R(h) for non-trivial loss functions, because
the distribution D is typically unknown so that expectations with respect to D cannot be computed.
To approximate the risk, we will assume access to an i.i.d. sample of m observations drawn from D.
This is precisely the situation described in the coffee quality example of Figure 14.1, where m = 6
coffees were sampled.1 For a given hypothesis h we can then check how well it performs on our
sampled data.

Definition 14.4. Let m ∈ N, let L : Y ×Y → R be a loss function and let S = (xi , yi )m


i=1 ∈ (X×Y )
m

be a sample. For h : X → Y , we call


m
b S (h) = 1
X
R L(h(xi ), yi )
m
i=1

the empirical risk of h.

1
In practice, the assumption of independence of the samples is often unclear and typically not satisfied. For
instance, the selection of the six previously tested coffees might be influenced by external factors such as personal
preferences or availability at the local store, which introduce bias into the dataset.

201
If the sample S is drawn i.i.d. according to D, then we immediately see from the linearity
of the expected value that R b S (h) is an unbiased estimator of R(h), i.e., ES∼Dm [R
b S (h)] = R(h).
Moreover, the weak law of large numbers states that the sample mean of an i.i.d. sequence of
integrable random variables converges to the expected value in probability. Hence, there is some
hope that, at least for large m ∈ N, minimizing the empirical risk instead of the population risk
might lead to a good hypothesis. We formalize this approach in the next definition.

Definition 14.5. Let H ⊆ {h : X → Y } be a hypothesis set. Let m ∈ N, let L : Y × Y → R be a


loss function and let S = (xi , yi )m m
i=1 ∈ (X × Y ) be a sample. We call a function hS such that

R
b S (hS ) = inf R
b S (h) (14.2.1)
h∈H

an empirical risk minimizer.

From a generalization perspective, supervised deep learning is empirical risk minimization over
sets of neural networks. The question we want to address next is how effective this approach is at
producing hypotheses that achieve a risk close to the Bayes risk.
Let H be some hypothesis set, such that an empirical risk minimizer hS exists for all S ∈
(X × Y )m ; see Exercise 14.25 for an explanation of why this is a reasonable assumption. Moreover,
let g ∈ H be arbitrary. Then
R(hS ) − R∗ = R(hS ) − Rb S (hS ) + Rb S (hS ) − R∗
≤ |R(hS ) − R b S (g) − R∗
b S (hS )| + R (14.2.2)
b S (h)| + R(g) − R∗ ,
≤ 2 sup |R(h) − R
h∈H

where in the first inequality we used that hS is the empirical risk minimizer. By taking the infimum
over all g, we conclude that
R(hS ) − R∗ ≤ 2 sup |R(h) − R
b S (h)| + inf R(g) − R∗
h∈H g∈H

=: 2εgen + εapprox . (14.2.3)


Similarly, considering only (14.2.2), yields that
R(hS ) ≤ sup |R(h) − R
b S (h)| + inf R
b S (g)
h∈H g∈H

=: εgen + εint . (14.2.4)


How to choose H to reduce the approximation error εapprox or the interpolation error εint
was discussed at length in the previous chapters. The final piece is to figure out how to bound the
generalization error suph∈H |R(h) − R b S (h)|. This will be discussed in the sections below.

14.3 Generalization bounds


We have seen that one aspect of successful learning is to bound the generalization error εgen in
(14.2.3). Let us first formally describe this problem.

202
Definition 14.6 (Generalization bound). Let H ⊆ {h : X → Y } be a hypothesis set, and let
L : Y × Y → R be a loss function. Let κ : (0, 1) × N → R+ be such that for every δ ∈ (0, 1) holds
κ(δ, m) → 0 for m → ∞. We call κ a generalization bound for H if for every distribution D on
X × Y , every m ∈ N and every δ ∈ (0, 1), it holds with probability at least 1 − δ over the random
sample S ∼ Dm that

sup |R(h) − R
b S (h)| ≤ κ(δ, m).
h∈H

Remark 14.7. For a generalization bound κ it holds that


h i
P R(hS ) − Rb S (hS ) ≤ ε ≥ 1 − δ

as soon as m is so large that κ(δ, m) ≤ ε. If there exists an empirical risk minimizer hS such that
R
b S (hS ) = 0, then with high probability the empirical risk minimizer will also have a small risk
R(hS ). Empirical risk minimization is often referred to as a “PAC” algorithm, which stands for
probably (δ) approximately correct (ε).
Definition 14.6 requires the upper bound κ on the discrepancy between the empirical risk and
the risk to be independent from the distribution D. Why should this be possible? After all, we could
have an underlying distribution that is not uniform and hence, certain data points could appear
very rarely in the sample. As a result, it should be very hard to produce a correct prediction
for such points. At first sight, this suggests that non-uniform distributions should be much more
challenging than uniform distributions. This intuition is incorrect, as the following argument based
on Example 14.1 demonstrates.

Example 14.8 (Generalization in the coffee quality problem). In Example 14.1, the underlying
distribution describes both our process of choosing coffees and the relation between the attributes
and the quality. Suppose we do not enjoy drinking coffee that costs less than 1€. Consequently,
we do not have a single sample of such coffee in the dataset, and therefore we have no chance of
learning the quality of cheap coffees.
However, the absence of coffee samples costing less than 1€ in our dataset is due to our general
avoidance of such coffee. As a result, we run a low risk of incorrectly classifying the quality of a
coffee that is cheaper than 1€, since it is unlikely that we will choose such a coffee in the future. ♢

To establish generalization bounds, we use stochastic tools that guarantee that the empirical
risk converges to the true risk as the sample size increases. This is typically achieved through
concentration inequalities. One of the simplest and most well-known is Hoeffding’s inequality, see
Theorem A.24. We will now apply Hoeffding’s inequality to obtain a first generalization bound.
This generalization bound is well-known and can be found in many textbooks on machine learning,
e.g., [178, 250]. Although the result does not yet encompass neural networks, it forms the basis for
a similar result applicable to neural networks, as we discuss subsequently.

203
Proposition 14.9 (Finite hypothesis set). Let H ⊆ {h : X 7→ Y } be a finite hypothesis set. Let
L : Y × Y → R be such that L(Y × Y ) ⊆ [c1 , c2 ] with c2 − c1 = C > 0.
Then, for every m ∈ N and every distribution D on X × Y it holds with probability at least 1 − δ
over the sample S ∼ Dm that
r
sup |R(h) − Rb S (h)| ≤ C log(|H|) + log(2/δ) .
h∈H 2m

Proof. Let H = {h1 , . . . , hn }. Then it holds by a union bound that


h i Xn h i
P ∃hi ∈ H : |R(hi ) − R
b S (hi )| > ε ≤ P |R(hi ) − R
b S (hi )| > ε .
i=1

Note that R b S (hi ) is the mean of independent random variables which take their values almost
surely in [c1 , c2 ]. Additionally, R(hi ) is the expectation of R
b S (hi ). The proof can therefore be
finished by applying Theorem A.24. This will be addressed in Exercise 14.26.

Consider now a non-finite set of neural networks H, and assume that it can be covered by a
finite set of (small) balls. Applying Proposition 14.9 to the centers of these balls, then allows to
derive a similar bound as in the proposition for H. This intuitive argument will be made rigorous
in the following section.

14.4 Generalization bounds from covering numbers


To derive a generalization bound for classes of neural networks, we start by introducing the notion
of covering numbers.

Definition 14.10. Let A be a relatively compact subset of a metric space (X, d). For ε > 0, we
call
n
( )
[
n
G(A, ε, (X, d)) := min n ∈ N ∃ (xi )i=1 ⊆ X s.t. Bε (xi ) ⊃ A ,
i=1

where Bε (x) = {z ∈ X | d(z, x) ≤ ε}, the ε-covering number of A in X. In case X or d are clear
from context, we also write G(A, ε, d) or G(A, ε, X) instead of G(A, ε, (X, d)).

A visualization of Definition 14.10 is given in Figure 14.2. As we will see, it is possible to upper
bound the ε-covering numbers of neural networks as a subset of L∞ ([0, 1]d ), assuming the weights
are confined to a fixed bounded set. The precise estimates are postponed to Section 14.5. Before
that, let us show how a finite covering number facilitates a generalization bound. We only consider
Euclidean feature spaces X in the following result. A more general version could be easily derived.

204
ε

Figure 14.2: Illustration of the concept of covering numbers of Definition 14.10. The shaded set
A ⊆ R2 is covered by sixteen Euclidean balls of radius ε. Therefore, G(A, ε, R2 ) ≤ 16.

Theorem 14.11. Let CY , CL > 0 and α > 0. Let Y ⊆ [−CY , CY ], X ⊆ Rd for some d ∈ N, and
H ⊆ {h : X → Y }. Further, let L : Y × Y → R be CL -Lipschitz.
Then, for every distribution D on X × Y and every m ∈ N it holds with probability at least 1 − δ
over the sample S ∼ Dm that for all h ∈ H
r
−α ∞
|R(h) − Rb S (h)| ≤ 4CY CL log(G(H, m , L (X))) + log(2/δ) + 2CL .
m mα

Proof. Let
M = G(H, m−α , L∞ (X)) (14.4.1)
and let HM = (hi )M
i=1 ⊆ H be such that for every h ∈ H there exists hi ∈ HM with ∥h−hi ∥L∞ (X) ≤
1/mα . The existence of HM follows by Definition 14.10.
Fix for the moment such h ∈ H and hi ∈ HM . By the reverse and normal triangle inequalities,
we have

|R(h) − R
b S (h)| − |R(hi ) − R
b S (hi )| ≤ |R(h) − R(hi )| + |R
b S (h) − R
b S (hi )|.

Moreover, from the monotonicity of the expected value and the Lipschitz property of L it follows
that

|R(h) − R(hi )| ≤ E|L(h(x), y) − L(hi (x), y)|


CL
≤ CL E|h(x) − hi (x)| ≤ α .
m

A similar estimate yields |R b S (hi )| ≤ CL /mα .


b S (h) − R

205
We thus conclude that for every ε > 0
h i
PS∼Dm ∃h ∈ H : |R(h) − R
b S (h)| ≥ ε
 
2CL
≤ PS∼Dm ∃hi ∈ HM : |R(hi ) − RS (hi )| ≥ ε − α .
b (14.4.2)
m
From Proposition 14.9, we know that for ε > 0 and δ ∈ (0, 1)
 
PS∼Dm ∃hi ∈ HM : |R(hi ) − R b S (hi )| ≥ ε − 2CL ≤ δ (14.4.3)

as long as r
2CL log(M ) + log(2/δ)
ε− α >C ,
m 2m
√ that L(Y × Y ) ⊆ [c1 , c2 ] with c2 − c1 ≤ C. By the Lipschitz property of L we can
where C is such
choose C = 2 2CL CY .
Therefore, the definition of M in (14.4.1) together with (14.4.2) and (14.4.3) give that with
probability at least 1 − δ it holds for all h ∈ H
r
√ −α ∞
|R(h) − R b S (h)| ≤ 2 2CL CY log(G(H, m , L )) + log(2/δ) + 2CL .
2m mα
This concludes the proof.

14.5 Covering numbers of deep neural networks


We have seen in Theorem 14.11, estimating L∞ -covering numbers is crucial for understanding the
generalization error. How can we determine these covering numbers? The set of neural networks of
a fixed architecture can be a quite complex set (see Chapter 13), so it is not immediately clear how
to cover it with balls, let alone know the number of required balls. The following lemma suggest a
simpler approach.

Lemma 14.12. Let X1 , X2 be two metric spaces and let f : X1 → X2 be Lipschitz continuous with
Lipschitz constant CLip . For every relatively compact A ⊆ X1 it holds that for all ε > 0

G(f (A), CLip ε, X2 ) ≤ G(A, ε, X1 ).

The proof of Lemma 14.12 is left as an exercise. If we can represent the set of neural networks
as the image under the Lipschitz map of another set with known covering numbers, then Lemma
14.12 gives a direct way to bound the covering number of the neural network class.
Conveniently, we have already observed in Proposition 13.1, that the set of neural networks is
the image of PN (A, B) as in Definition 12.1 under the Lipschitz continuous realization map Rσ . It
thus suffices to establish the ε-covering number of PN (A, B) or equivalently of [−B, B]nA . Then,
using the Lipschitz property of Rσ that holds by Proposition 13.1, we can apply Lemma 14.12 to
find the covering numbers of N (σ; A, B). This idea is depicted in Figure 14.3.

206

Figure 14.3: Illustration of the main idea to deduce covering numbers of neural network spaces.
Points θ ∈ R2 in parameter space in the left figure correspond to functions Rσ (θ) in the right figure
(with matching colors). By Lemma 14.12, a covering of the parameter space on the left translates
to a covering of the function space on the right.

Proposition 14.13. Let B, ε > 0 and q ∈ N. Then

G([−B, B]q , ε, (Rq , ∥ · ∥∞ )) ≤ ⌈B/ε⌉q .

Proof. We start with the one-dimensional case q = 1. We choose k = ⌊B/ε⌋

x0 = −B + ε and xj = xj−1 + 2ε for j = 1, . . . , k − 1.

It is clear that all points between −B and xk−1 have distance at most ε to one of the xj . Also,
xk−1 = −B + ε + 2(k − 1)ε ≥ B − ε. We conclude that G([−B, B], ε, R) ≤ ⌈B/ε⌉. Set Xk :=
{x0 , . . . , xk−1 }.
For arbitrary q, we observe that for every x ∈ [−B, B]q there is an element in Xkq = qk=1 Xk
N
with ∥ · ∥∞ distance less than ε. Clearly, |Xkq | = ⌈B/ε⌉q , which completes the proof.

Having established a covering number for [−B, B]nA and hence PN (A, B), we can now estimate
the covering numbers of deep neural networks by combining Lemma 14.12 and Propositions 13.1
and 14.13 .

Theorem 14.14. Let A = (d0 , d1 , . . . , dL+1 ) ∈ NL+2 , let σ : R → R be Cσ -Lipschitz continuous


with Cσ ≥ 1, let |σ(x)| ≤ Cσ |x| for all x ∈ R, and let B ≥ 1. Then

G(N (σ; A, B), ε, L∞ ([0, 1]d0 ))


≤ G([−B, B]nA , ε/(nA (2Cσ Bdmax )L ), (RnA , ∥ · ∥∞ ))
≤ ⌈nA /ε⌉nA ⌈2Cσ Bdmax ⌉nA L .

207
We end this section, by applying the previous theorem to the generalization bound of Theorem
14.11 with α = 1/2. To simplify the analysis, we restrict the discussion to neural networks with
range [−1, 1]. To this end, denote

N ∗ (σ; A, B) := Φ ∈ N (σ; A, B)


Φ(x) ∈ [−1, 1] for all x ∈ [0, 1]d0 . (14.5.1)

Since N ∗ (σ; A, B) ⊆ N (σ; A, B) we can bound the covering numbers of N ∗ (σ; A, B) by those of
N (σ; A, B). This yields the following result.

Theorem 14.15. Let CL > 0 and let L : [−1, 1]×[−1, 1] → R be CL -Lipschitz continuous. Further,
let A = (d0 , d1 , . . . , dL+1 ) ∈ NL+2 , let σ : R → R be Cσ -Lipschitz continuous with Cσ ≥ 1, and
|σ(x)| ≤ Cσ |x| for all x ∈ R, and let B ≥ 1.
Then, for every m ∈ N, and every distribution D on X × [−1, 1] it holds with probability at least
1 − δ over S ∼ Dm that for all Φ ∈ N ∗ (σ; A, B)
r √
nA log(⌈nA m⌉) + LnA log(⌈2Cσ Bdmax ⌉) + log(2/δ)
|R(Φ) − RS (Φ)| ≤4CL
b
m
2CL
+√ .
m

14.6 The approximation-complexity trade-off


We recall the decomposition of the error in (14.2.3)

R(hS ) − R∗ ≤ 2εgen + εapprox ,

where R∗ is the Bayes risk defined in (14.1.1). We make the following observations about the
approximation error εapprox and generalization error εgen in the context of neural network based
learning:

• Scaling of generalization error: By Theorem 14.15, for a hypothesis class H of neural networks
with nA weights and L layers, and for sample of size m ∈ N, the generalization error εgen
essentially scales like
p
εgen = O( (nA log(nA m) + LnA log(nA ))/m) as m → ∞.

• Scaling of approximation error: Assume there exists h∗ such that R(h∗ ) = R∗ , and let the
loss function L be Lipschitz continuous in the first coordinate. Then

εapprox = inf R(h) − R(h∗ ) = inf E(x,y)∼D [L(h(x), y) − L(h∗ (x), y)]
h∈H h∈H
≤ C inf ∥h − h∗ ∥L∞ ,
h∈H

208
for some constant C > 0. We have seen in Chapters 5 and 7 that if we choose H as a
set of neural networks with size nA and L layers, then, for appropriate activation functions,
inf h∈H ∥h − h∗ ∥L∞ behaves like nA −r if, e.g., h∗ is a d-dimensional s-Hölder regular function
and r = s/d (Theorem 5.23), or h∗ ∈ C k,s ([0, 1]d ) and r < (k + s)/d (Theorem 7.10).
By these considerations, we conclude that for an empirical risk minimizer ΦS from a set of neural
networks with nA weights and L layers, it holds that
R(ΦS ) − R∗ ≤ O( (nA log(m) + LnA log(nA ))/m) + O(nA −r ),
p
(14.6.1)
for m → ∞ and for some r depending on the regularity of h∗ . Note that, enlarging the neural
network set, i.e., increasing nA has two effects: The term associated to approximation decreases,
and the term associated to generalization increases. This trade-off is known as approximation-
complexity trade-off. The situation is depicted in Figure 14.4. The figure and (14.6.1) suggest
that, the perfect model, achieves the optimal trade-off between approximation and generalization
error. Using this notion, we can also separate all models into three classes:
• Underfitting: If the approximation error decays faster than the estimation error increases.
• Optimal : If the sum of approximation error and generalization error is at a minimum.
• Overfitting: If the approximation error decays slower than the estimation error increases.
In Chapter 15, we will see that deep learning often operates in the regime where the number of
parameters nA exceeds the optimal trade-off point. For certain architectures used in practice, nA
can be so large that the theory of the approximation-complexity trade-off suggests that learning
should be impossible. However, we emphasize, that the present analysis only provides upper bounds.
It does not prove that learning is impossible or even impractical in the overparameterized regime.
Moreover, in Chapter 11 we have already seen indications that learning in the overparametrized
regime need not necessarily lead to large generalization errors.

14.7 PAC learning from VC dimension


In addition to covering numbers, there are several other tools to analyze the generalization capacity
of hypothesis sets. In the context of classification problems, one of the most important is the so-
called Vapnik–Chervonenkis (VC) dimension.

14.7.1 Definition and examples


Let H be a hypothesis set of functions mapping from Rd to {0, 1}. A set S = {x1 , . . . , xn } ⊆ Rd
is said to be shattered by H if for every (y1 , . . . , yn ) ∈ {0, 1}n there exists h ∈ H such that
h(xj ) = yj for all j ∈ N.
The VC dimension quantifies the complexity of a function class via the number of points that
can in principle be shattered.

Definition 14.16. The VC dimension of H is the cardinality of the largest set S ⊆ Rd that is
shattered by H. We denote the VC dimension by VCdim(H).

209
underfitting overfitting

optimal trade-off

Figure 14.4: Illustration of the approximation-complexity-trade-off of Equation (14.6.1). Here we


chose r = 1 and m = 10.000, also all implicit constants are assumed to be equal to 1.

Example 14.17 (Intervals). Let H = {1[a,b] | a, b ∈ R}. It is clear that VCdim(H) ≥ 2 since for
x1 < x2 the functions

1[x1 −2,x1 −1] , 1[x1 −1,x1 ] , 1[x1 ,x2 ] , 1[x2 ,x2 +1] ,

are all different, when restricted to S = (x1 , x2 ).


On the other hand, if x1 < x2 < x3 then, since h−1 ({1}) is an interval for all h ∈ H we have that
h(x1 ) = 1 = h(x3 ) implies h(x2 ) = 1. Hence, no set of three elements can be shattered. Therefore,
VCdim(H) = 2. The situation is depicted in Figure 14.5. ♢

Example 14.18 (Half-spaces). Let H2 = {1[0,∞) (⟨w, ·⟩ + b) | w ∈ R2 , b ∈ R} be a hypothesis set


of rotated and shifted two-dimensional half-spaces. In Figure 14.6 we see that H2 shatters a set of

Figure 14.5: Different ways to classify two or three points. The colored-blocks correspond to
intervals that produce different classifications of the points.

210
three points. More general, for d ≥ 2 with

Hd := {x 7→ 1[0,∞) (w⊤ x + b) | w ∈ Rd , b ∈ R}

the VC dimension of Hd equals d + 1. ♢

Figure 14.6: Different ways to classify three points by a half-space, [244, Figure 1.4].

In the example above, the VC dimension coincides with the number of parameters. However,
this is not true in general as the following example shows.
Example 14.19 (Infinite VC dimension). Let for x ∈ R

H := {x 7→ 1[0,∞) (sin(wx)) | w ∈ R}.

Then the VC dimension of H is infinite (Exercise 14.29). ♢

14.7.2 Generalization based on VC dimension


In the following, we consider a classification problem. Denote by D the data-generating distribution
on Rd × {0, 1}. Moreover, we let H be a set of functions from Rd → {0, 1}.
In the binary classification set-up, the natural choice of a loss function is the 0 − 1 loss
L0−1 (y, y ′ ) = 1y̸=y′ . Thus, given a sample S, the empirical risk of a function h ∈ H is
m
b S (h) = 1
X
R 1h(xi )̸=yi .
m
i=1

Moreover, the risk can be written as

R(h) = P(x,y)∼D [h(x) ̸= y],

i.e., the probability under (x, y) ∼ D of h misclassifying the label y of x.


We can now give a generalization bound in terms of the VC dimension of H, see, e.g., [178,
Corollary 3.19]:

211
Theorem 14.20. Let d, k ∈ N and H ⊆ {h : Rd → {0, 1}} have VC dimension k. Let D be a
distribution on Rd × {0, 1}. Then, for every δ > 0 and m ∈ N, it holds with probability at least 1 − δ
over a sample S ∼ Dm that for every h ∈ H
r r
2k log(em/k) log(1/δ)
|R(h) − R
b S (h)| ≤ + . (14.7.1)
m 2m

In words, Theorem 14.20 tells us that if a hypothesis class has finite VC dimension, then a
hypothesis with a small empirical risk will have a small risk if the number of samples is large. This
shows that empirical risk minimization is a viable strategy in this scenario. Will this approach also
work if the VC dimension is not bounded? No, in fact, in that case, no learning algorithm will
succeed in reliably producing a hypothesis for which the risk is close to the best possible. We omit
the technical proof of the following theorem from [178, Theorem 3.23].

Theorem 14.21. Let k ∈ N and let H ⊆ {h : X → {0, 1}} be a hypothesis set with VC dimension
k. Then, for every m ∈ N and every learning algorithm A : (X × {0, 1})m → H there exists a
distribution D on X × {0, 1} such that
" r #
k 1
PS∼Dm R(A(S)) − inf R(h) > ≥ .
h∈H 320m 64

Theorem 14.21 immediately implies the following statement for the generalization bound.

Corollary 14.22. Let k ∈ N and let H ⊆ {h : X → {0, 1}} be a hypothesis set with VC dimension
k. Then, for every m ∈ N there exists a distribution D on X × {0, 1} such that
" r #
k 1
PS∼Dm sup |R(h) − R b S (h)| > ≥ .
h∈H 1280m 64

Proof. For a sample S, let hS ∈ H be an empirical risk minimizer, i.e., R


b S (hS ) = minh∈H R
b S (h).
Let D be the distribution of Theorem 14.21. Moreover, for δ > 0, let hδ ∈ H be such that

R(hδ ) − inf R(h) < δ.


h∈H

212
Then, applying Theorem 14.21 with A(S) = hS it holds that
2 sup |R(h) − R
b S (h)| ≥ |R(hS ) − R
b S (hS )| + |R(hδ ) − R
b S (hδ )|
h∈H

≥ R(hS ) − R
b S (hS ) + R
b S (hδ ) − R(hδ )
≥ R(hS ) − R(hδ )
> R(hS ) − inf R(h) − δ,
h∈H

where we used the definition of hS in the third inequality. The proof is completed by applying
Theorem 14.21 and using that δ was arbitrary.

We have seen now, that we have a generalization bound scaling like O(1/ m) for m → ∞ if
and only if the VC dimension of a hypothesis class is finite. In more quantitative terms, we require
the VC dimension of a neural network to be smaller than m.
What does this imply for neural network functions? For ReLU neural networks there holds the
following [6, Theorem 8.8].

Theorem 14.23. Let A ∈ NL+2 , L ∈ N and set

H := {1[0,∞) ◦ Φ | Φ ∈ N (σReLU ; A, ∞)}.

Then, there exists a constant C > 0 independent of L and A such that

VCdim(H) ≤ C · (nA L log(nA ) + nA L2 ).

The bound (14.7.1) is meaningful if m ≫ k. For ReLU neural networks as in Theorem 14.23,
this means m ≫ nA L log(nA ) + nA L2 . Fixing L = 1 this amounts to m ≫ nA log(nA ) for a
shallow neural network with nA parameters. This condition is contrary to what we assumed in
Chapter 11, where it was crucial that nA ≫ m. If the VC dimension of the neural network sets
scale like O(nA log(nA )), then Theorem 14.21 and Corollary 14.22 indicate that, at least for certain
distributions, generalization should not be possible in this regime. We will discuss the resolution
of this potential paradox in Chapter 15.

14.8 Lower bounds on achievable approximation rates


We conclude this chapter on the complexities and generalization bounds of neural networks by using
the established VC dimension bound of Theorem 14.23 to deduce limitations to the approximation
capacity of neural networks. The result described below was first given in [292].

Theorem 14.24. Let k, d ∈ N. Assume that for every ε > 0 there exists Lε ∈ N and Aε with Lε
layers and input dimension d such that
ε
sup inf ∥f − Φ∥C 0 ([0,1]d ) < .
∥f ∥ k d ≤1
Φ∈N (σReLU ;Aϵ ,∞) 2
C ([0,1] )

213
Then there exists C > 0 solely depending on k and d, such that for all ε ∈ (0, 1)
d
nAε Lε log(nAε ) + nAε L2ε ≥ Cε− k .

Proof. For x ∈ Rd consider the “bump function”


(  
1
exp 1 − 2 if ∥x∥2 < 1
f˜(x) := 1−∥x∥2
0 otherwise,

and its scaled version


 
f˜ε (x) := εf˜ 2ε−1/k x ,

for ε ∈ (0, 1). Then


h ε1/k ε1/k id
supp(f˜ε ) ⊆ − ,
2 2
and
∥f˜ε ∥C k ≤ 2k ∥f˜∥C k =: τk > 0.
Consider the equispaced point set {x1 , . . . , xN (ε) } = ε1/k Zd ∩ [0, 1]d . The cardinality of this set
is N (ε) ≃ ε−d/k . Given y ∈ {0, 1}N (ε) , let for x ∈ Rd
N (ε)
X
fy (x) := τk−1 yj f˜ε (x − xj ). (14.8.1)
j=1

Then fy (xj ) = τk−1 εyj for all j = 1, . . . , N (ε) and ∥fy ∥C k ≤ 1.


For every y ∈ {0, 1}N (ε) let Φy ∈ N (σReLU ; Aτ −1 ε , ∞) be such that
k

ε
sup |fy (x) − Φy (x)| < .
x∈[0,1]d 2τk

Then  ε 
1[0,∞) Φy (xj ) − = yj for all j = 1, . . . , N (ε).
2τk
Hence, the VC dimension of N (σReLU ; Aτ −1 ε , ∞) is larger or equal to N (ε). Theorem 14.23 thus
k
implies
d
 
N (ε) ≃ ε− k ≤ C · nA −1 Lτ −1 ε log(nA −1 ) + nA −1 L2τ −1 ε
τ ε k τ ε τ ε k
k k k

or equivalently
d d
 
τkk ε− k ≤ C · nAε Lε log(nAε ) + nAε L2ε .

This completes the proof.

214
Figure 14.7: Illustration of fy from Equation (14.8.1) on [0, 1]2 .

To interpret Theorem 14.24, we consider two situations:


• In case the depth is allowed to increase at most logarithmically in ε, then reaching uniform
error ε for all f ∈ C k ([0, 1]d ) with ∥f ∥C k ([0,1]d ) ≤ 1 requires
d
nAε log(nAε ) log(ε) + nAε log(ε)2 ≥ Cε− k .

In terms of the neural network size, this (necessary) condition becomes nAε ≥ Cε−d/k / log(ε)2 .
As we have shown in Chapter 7, in particular Theorem 7.10, up to log terms this condition
is also sufficient. Hence, while the constructive proof of Theorem 7.10 might have seemed
rather specific, under the assumption of the depth increasing at most logarithmically (which
the construction in Chapter 7 satisfies), it was essentially optimal! The neural networks in
this proof are shown to have size O(ε−d/k ) up to log terms.

• If we allow the depth Lε to increase faster than logarithmically in ε, then the lower bound on
the required neural network size improves. Fixing for example Aε with Lε layers such that
nAε ≤ W Lε for some fixed ε independent W ∈ N, the (necessary) condition on the depth
becomes
d
W log(W Lε )L2ε + W L3ε ≥ Cε− k

and hence Lε ≳ ε−d/(3k) .


We add that, for arbitrary depth the upper bound on the VC dimension of Theorem 14.23
can be improved to n2A , [6, Theorem 8.6], and using this, would improve the just established
lower bound to Lε ≳ ε−d/(2k) .
For fixed width, this corresponds to neural networks of size O(ε−d/(2k) ), which would mean
twice the convergence rate proven in Theorem 7.10. Indeed, it turns out that neural networks
can achieve this rate in terms of the neural network size [293].
To sum up, in order to get error ε uniformly for all ∥f ∥C k ([0,1]d ) ≤ 1, the size of a ReLU neural
network is required to increase at least like O(ε−d/(2k) ) as ε → 0, i.e. the best possible attainable
convergence rate is 2k/d. It has been proven, that this rate is also achievable, and thus the bound
is sharp. Achieving this rate requires neural network architectures that grow faster in depth than
in width.

215
Bibliography and further reading
Classical statistical learning theory is based on the foundational work of Vapnik and Chervonenkis
[277]. This led to the formulation of the probably approximately correct (PAC) learning model
in [276], which is primarily utilized in this chapter. A streamlined mathematical introduction to
statistical learning theory can be found in [58].
Since statistical learning theory is well-established, there exists a substantial amount of excellent
expository work describing this theory. Some highly recommended books on the topic are [178,
250, 6]. The specific approach of characterizing learning via covering numbers has been discussed
extensively in [6, Chapter 14]. Specific results for ReLU activation used in this chapter were derived
in [241, 24]. The results of Section 14.8 describe some of the findings in [292, 293]. Other scenarios
in which the tightness of the upper bounds were shown are, for example, if quantization of weights
is assumed, [30, 77, 208], or when some form of continuity of the approximation scheme is assumed,
see [67] for general lower bounds (also applicable to neural networks).

216
Exercises
Exercise 14.25. Let H be a set of neural networks with fixed architecture, where the weights are
taken from a compact set. Moreover, assume that the activation function is continuous. Show that
for every sample S there always exists an empirical risk minimizer hS .

Exercise 14.26. Complete the proof of Proposition 14.9.

Exercise 14.27. Prove Lemma 14.12.

Exercise 14.28. Show that, the VC dimension of H of Example 14.18 is indeed 3, by demonstrating
that no set of four points can be shattered by H.

Exercise 14.29. Show that the VC dimension of

H := {x 7→ 1[0,∞) (sin(wx)) | w ∈ R}

is infinite.

217
Chapter 15

Generalization in the
overparameterized regime

In the previous chapter, we discussed the theory of generalization for deep neural networks trained
by minimizing the empirical risk. A key conclusion was that good generalization is possible as long
as we choose an architecture that has a moderate number of neural network parameters relative to
the number of training samples. Moreover, we saw in Section 14.6 that the best performance can be
expected when the neural network size is chosen to balance the generalization and approximation
errors, by minimizing their sum.

Architectures On ImageNet

Figure 15.1: ImageNet Classification Competition: Final score on the test set in the Top 1 cat-
egory vs. Parameters-to-Training-Samples Ratio. Note that all architectures have more parame-
ters than training samples. Architectures include AlexNet [148], VGG16 [255], GoogLeNet [263],
ResNet50/ResNet152 [112], DenseNet121 [121], ViT-G/14 [296], EfficientNetB0 [265], and Amoe-
baNet [226].

Surprisingly, successful neural network architectures do not necessarily follow these theoretical
observations. Consider the neural network architectures in Figure 15.1. They represent some

218
of the most renowned image classification models, and all of them participated in the ImageNet
Classification Competition [66]. The training set consisted of 1.2 million images. The x-axis shows
the model performance, and the y-axis displays the ratio of the number of parameters to the size of
the training set; notably, all architectures have a ratio larger than one, i.e. have more parameters
than training samples. For the largest model, there are by a factor 1000 more neural network
parameters than training samples.
Given that the practical application of deep learning appears to operate in a regime significantly
different from the one analyzed in Chapter 14, we must ask: Why do these methods still work
effectively?

15.1 The double descent phenomenon


The success of deep learning in a regime not covered by traditional statistical learning theory
puzzled researchers for some time. In [18], an intriguing set of experiments was performed. These
experiments indicate that while the risk follows the upper bound from Section 14.6 for neural
network architectures that do not interpolate the data, the curve does not expand to infinity in the
way that Figure 14.4 suggests. Instead, after surpassing the so-called “interpolation threshold”,
the risk starts to decrease again. This behavior, known as double descent, is illustrated in Figure
15.2.

classical regime modern regime

R(h)
R
b S (h)
underfitting overfitting
Interpolation threshold

Expressivity of H
Figure 15.2: Illustration of the double descent phenomenon.

15.1.1 Least-squares regression revisited


To gain further insight, we consider ridgeless kernel least-squares regression as introduced in Section
11.2. Consider a data sample (xj , yj )m d
j=1 ⊆ R × R generated by some ground-truth function f , i.e.

yj = f (xj ) for j = 1, . . . , m. (15.1.1)

Let ϕjP: Rd → R, j ∈ N, be a sequence of ansatz functions. For n ∈ N, we wish to fit a function


x 7→ ni=1 wi ϕi (x) to the data using linear least-squares. To this end, we introduce the feature
map
Rd ∋ x 7→ ϕ(x) := (ϕ1 (x), . . . , ϕn (x))⊤ ∈ Rn .

219
The goal is to determine coefficients w ∈ Rn minimizing the empirical risk
m n m
b S (w) = 1 1 X
XX 2
R wi ϕi (xj ) − yj = (⟨ϕ(xj ), w⟩ − yj )2 .
m m
j=1 i=1 j=1

With
ϕ(x1 )⊤
   
ϕ1 (x1 ) . . . ϕn (x1 )
 .. .. .. .. m×n
An :=  . = ∈R (15.1.2)
  
. . .
ϕ1 (xm ) . . . ϕn (xm ) ϕ(xm )⊤
and y = (y1 , . . . , ym )⊤ it holds
b S (w) = 1 ∥An w − y∥2 .
R (15.1.3)
m
As discussed in Sections 11.1-11.2, a unique minimizer of (15.1.3) only exists if An has rank n.
For a minimizer wn , the fitted function reads
n
X
fn (x) := wn,j ϕj (x). (15.1.4)
j=1

We are interested in the behavior of the fn as a function of n (the number of ansatz func-
tions/parameters of our model), and distinguish between two cases:
• Underparameterized : If n < m we have fewer parameters n than training points m. For
the least squares problem of minimizing R
b S , this means that there are more conditions m
than free parameters n. Thus, in general, we cannot interpolate the data, and we have
minw∈Rn R b S (w) > 0.

• Overparameterized : If n ≥ m, then we have at least as many parameters n as training points


m. If the xj and the ϕj are such that An ∈ Rm×n has full rank m, then there exists w
such that R b S (w) = 0. If n > m, then An necessarily has a nontrivial kernel, and there exist
infinitely many parameters choices w that yield zero empirical risk R b S . Some of them lead
to better, and some lead to worse prediction functions fn in (15.1.4).
In the overparameterized case, there exist many minimizers of R b S . The training algorithm we
use to compute a minimizer determines the type of prediction function fn we obtain. We argued in
Chapter 11, that for suitable initalization, gradient descent converges towards the minimal norm
minimizer 1

wn,∗ = argminw∈M ∥w∥ ∈ Rn , M = {w ∈ Rn | R b S (v) ∀v ∈ Rn }.


b S (w) ≤ R (15.1.5)

15.1.2 An example
We consider a concrete example. In Figure 15.3 we plot a set of 40 ansatz functions ϕ1 , . . . , ϕ40 ,
which are drawn from a Gaussian process. Additionally, the figure shows a plot of the Runge
function f , and m = 18 equispaced points which are used as the training data points. We then fit
a function in span{ϕ1 , . . . , ϕn } via (15.1.5) and (15.1.4). The result is displayed in Figure 15.4:
1
Here, the index n emphasizes the dimension of wn,∗ ∈ Rn . This notation should not be confused with the
ridge-regularized minimizer wλ,∗ introduced in Chapter 11.

220
3 1.0 f
2 0.8 Data points
1
0.6
0
j

1 0.4
2 0.2
3
0.0
1.00 0.75 0.50 0.25 0.00 0.25 0.50 0.75 1.00 1.00 0.75 0.50 0.25 0.00 0.25 0.50 0.75 1.00
(a) ansatz functions ϕj (b) Runge function f and data points

Figure 15.3: Ansatz functions ϕ1 , . . . , ϕ40 drawn from a Gaussian process, along with the Runge
function and 18 equispaced data points.

• n = 2: The model can only represent functions in span{ϕ1 , ϕ2 }. It is not yet expressive
enough to give a meaningful approximation of f .

• n = 15: The model has sufficient expressivity to capture the main characteristics of f . Since
n = 15 < 18 = m, it is not yet able to interpolate the data. Thus it allows to strike a
good balanced between the approximation and generalization error, which corresponds to the
scenario discussed in Chapter 14.

• n = 18: We are at the interpolation threshold. The model is capable of interpolating the data,
and there is a unique w such that R b S (w) = 0. Yet, in between data points the behavior of the
predictor f18 seems erratic, and displays strong oscillations. This is referred to as overfitting,
and is to be expected due to our analysis in Chapter 14; while the approximation error at the
data points has improved compared to the case n = 15, the generalization error has gotten
worse.

• n = 40: This is the overparameterized regime, where we have significantly more parameters
than data points. Our prediction f40 interpolates the data and appears to be the best overall
approximation to f so far, due to a “good” choice of minimizer of R b S , namely (15.1.5).
We also note that, while quite good, the fit is not perfect. We cannot expect significant
improvement in performance by further increasing n, since at this point the main limiting
factor is the amount of available data. Also see Figure 15.5 (a).

Figure 15.5 (a) displays the error ∥f − fn ∥L2 ([−1,1]) over n. We observe the characteristic double
descent curve, where the error initially decreases and then peaks at the interpolation threshold,
which is marked by the dashed red line. Afterwards, in the overparameterized regime, it starts to
decrease again. Figure 15.5 (b) displays ∥wn,∗ ∥. Note how the Euclidean norm of the coefficient
vector also peaks at the interpolation threshold.
We emphasize that the precise nature of the convergence curves depends strongly on various
factors, such as the distribution and number of training points m, the ground truth f , and the
choice of ansatz functions ϕj (e.g., the specific kernel used to generate the ϕj in Figure 15.3 (a)).
In the present setting we achieve a good approximation of f for n = 15 < 18 = m corresponding to
the regime where the approximation and interpolation errors are balanced. However, as Figure 15.5

221
1.0 f 1.0 f
0.8 f2 f15
0.8
0.6 Data points Data points
0.4 0.6
0.2 0.4
0.0
0.2 0.2
0.4 0.0
1.00 0.75 0.50 0.25 0.00 0.25 0.50 0.75 1.00 1.00 0.75 0.50 0.25 0.00 0.25 0.50 0.75 1.00
(a) n = 2 (underparameterization) (b) n = 15 (balance of appr. and gen. error)
1.0 f 1.0 f
0.8 f18 0.8 f40
Data points Data points
0.6 0.6
0.4
0.4
0.2
0.2
0.0
0.0
1.00 0.75 0.50 0.25 0.00 0.25 0.50 0.75 1.00 1.00 0.75 0.50 0.25 0.00 0.25 0.50 0.75 1.00
(c) n = 18 (interpolation threshold) (d) n = 40 (overparameterization)

Figure 15.4: Fit of the m = 18 red data points using the ansatz functions ϕ1 , . . . , ϕn from Figure
15.3, employing equations (15.1.5) and (15.1.4) for different numbers of ansatz functions n.

222
n = 18 n = 18

10 1

100

10 2
10 20 30 40 10 20 30 40
n n
(a) ∥f − fn ∥L2 ([−1,1]) (b) ∥wn,∗ ∥

Figure 15.5: The L2 -error for the fitted functions in Figure 15.4, and the Euclidean norm of the
corresponding coefficient vector wn,∗ defined in (15.1.5).

(a) shows, it can be difficult to determine a suitable value of n < m a priori, and the acceptable
range of n values can be quite narrow. For overparametrization (n ≫ m), the precise choice of n is
less critical, potentially making the algorithm more stable in this regime. We encourage the reader
to conduct similar experiments and explore different settings to get a better feeling for the double
descent phenomenon.

15.2 Size of weights


In Figure 15.5, we observed that the norm of the coefficients ∥wn,∗ ∥ exhibits similar behavior to
the L2 -error, peaking at the interpolation threshold n = 18. In machine learning, large weights
are usually undesirable, as they are associated with large derivatives or oscillatory behavior. This
is evident in the example shown in Figure 15.4 for n = 18. Assuming that the data in (15.1.1)
was generated by a “smooth” function f , e.g. a function with moderate Lipschitz constant, these
large derivatives of the prediction function could lead to poor generalization. Such a smoothness
assumption about f may or may not be satisfied. However, if f is not smooth, there is little hope
of accurately recovering f from limited data (see the discussion in Section 9.2).
The next result gives an explanation for the observed behavior of ∥wn,∗ ∥.

Proposition 15.1. Assume that x1 , . . . , xm and the (ϕj )j∈N are such that An in (15.1.2) has full
rank n for all n ≤ m. Given y ∈ Rm , denote by wn,∗ (y) the vector in (15.1.5). Then
(
increasing for n < m,
n 7→ sup ∥wn,∗ (y)∥ is monotonically
∥y∥=1 decreasing for n ≥ m.

Proof. We start with the case n ≥ m. By assumption Am has full rank m, and thus An has rank
m for all n ≥ m, see (15.1.2). In particular, there exists wn ∈ Rn such that An wn = y. Now fix

223
y ∈ Rm and let wn be any such vector. Then wn+1 := (wn , 0) ∈ Rn+1 satisfies An+1 wn+1 = y
and ∥wn+1 ∥ = ∥wn ∥. Thus necessarily ∥wn+1,∗ ∥ ≤ ∥wn,∗ ∥ for the minimal norm solutions defined
in (15.1.5). Since this holds for every y, we obtain the statement for n ≥ m.
Now let n < m. Recall that the minimal norm solution can be written through the pseudo
inverse
wn,∗ (y) = A†n y,
see Appendix B.1. That is,

s−1
 
n,1
A†n = V n  ..  ⊤ n×m
0 U n ∈ R

.
s−1
n,n

where An = U n Σ n V ⊤
n is the singular value decomposition of An , and
 
sn,1
 .. 
.
Σn =   ∈ Rm×n
 
 sn,n 
0

contains the singular values sn,1 ≥ · · · ≥ sn,n > 0 of An ∈ Rm×n ordered by decreasing size. Since
V n ∈ Rn×n and U n ∈ Rm×m are orthogonal matrices, we have

sup ∥wn,∗ (y)∥ = sup ∥A†n y∥ = s−1


n,n .
∥y∥=1 ∥y∥=1

Finally, since the minimal singular value sn,n of An can be written as

sn,n = infn ∥An x∥ ≥ inf ∥An+1 x∥ = sn+1,n+1 ,


x∈R x∈Rn+1
∥x∥=1 ∥x∥=1

we observe that n 7→ sn,n is monotonically decreasing for n ≤ m. This concludes the proof.

15.3 Theoretical justification


Let us now examine one possible explanation of the double descent phenomenon for neural networks.
While there are many alternative arguments available in the literature (see the bibliography section),
the explanation presented here is based on a simplification of the ideas in [15].
The key assumption underlying our analysis is that large overparameterized neural networks
tend to be Lipschitz continuous with a Lipschitz constant independent of the size. This is a
consequence of neural networks typically having relatively small weights. To motivate this, let us
consider the class of neural networks N (σ; A, B) for an architecture A of depth d ∈ N and width
L ∈ N. If σ is Cσ -Lipschitz continuous with Cσ ≥ 1, such that B ≤ cB · (dCσ )−1 for some cB > 0,
then by Lemma 13.2

N (σ; A, B) ⊆ LipcL (Rd0 ), (15.3.1)


B

224
An assumption of the type B ≤ cB · (dCσ )−1 , i.e. a scaling of the weights by the reciprocal 1/d of
the width, is not unreasonable in practice: Standard initialization schemes such as LeCun [156] or
He [111] initialization, use random weights with variance scaled inverse proportional to the input
dimension of each layer. Moreover, as we saw in Chapter 11, for very wide neural networks, the
weights do not move significantly from their initialization during training. Additionally, many
training routines use regularization terms on the weights, thereby encouraging the optimization
routine to find small weights.
We study the generalization capacity of Lipschitz functions through the covering-number-
based learning results of Chapter 14. The set LipC (Ω) of C-Lipschitz functions on a compact
d-dimensional Euclidean domain Ω has covering numbers bounded according to
 d
∞ C
log(G(LipC (Ω), ε, L )) ≤ Ccov · for all ε > 0 (15.3.2)
ε

for some constant Ccov independent of ε > 0. A proof can be found in [97, Lemma 7], see also [273].
As a result of these considerations, we can identify two regimes:

• Standard regime: For small neural network size nA , we consider neural networks as a set
parameterized by nA parameters. As we have seen before, this yields a bound on the gen-
eralization error that scales linearly with nA . As long as nA is small in comparison to the
number of samples, we can expect good generalization by Theorem 14.15.

• Overparameterized regime: For large neural network size nA and small weights, we consider
neural networks as a subset of LipC (Ω) for a constant C > 0. This set has a covering number
bound that is independent of the number of parameters nA .

Choosing the better of the two generalization bounds for each regime yields the following result.
Recall that N ∗ (σ; A, B) denotes all neural networks in N (σ; A, B) with a range contained in [−1, 1]
(see (14.5.1)).

Theorem 15.2. Let C, CL > 0 and let L : [−1, 1] × [−1, 1] → R be CL -Lipschitz. Further, let
A = (d0 , d1 , . . . , dL+1 ) ∈ NL+2 , let σ : R → R be Cσ -Lipschitz continuous with Cσ ≥ 1, and
|σ(x)| ≤ Cσ |x| for all x ∈ R, and let B > 0.
Then, there exist c1 , c2 > 0, such that for every m ∈ N and every distribution D on
[−1, 1]d0 × [−1, 1] it holds with probability at least 1 − δ over S ∼ Dm that for all Φ ∈
N ∗ (σ; A, B) ∩ LipC ([−1, 1]d0 )
r
b S (Φ)| ≤ g(A, Cσ , B, m) + 4CL log(4/δ) ,
|R(Φ) − R (15.3.3)
m
where
( r √ )
nA log(nA ⌈ m⌉) + LnA log(dmax ) 1
− 2+d
g(A, Cσ , B, m) = min c1 , c2 m 0 .
m

225
Proof. Applying Theorem 14.11 with α = 1/(2 + d0 ) and (15.3.2), we obtain that with probability
at least 1 − δ/2 it holds for all Φ ∈ LipC ([−1, 1]d0 )
r
α d0
|R(Φ) − R b S (Φ)| ≤ 4CL Ccov (m C) + log(4/δ) + 2CL
m mα r
2C log(4/δ)
q
L
≤ 4CL Ccov C d0 (md0 /(d0 +2)−1 ) + α + 4CL
m r m
2CL log(4/δ)
q
= 4CL Ccov C d0 (m−2/(d0 +2) ) + α + 4CL
m
r m
p
d
(4CL Ccov C + 2CL )
0 log(4/δ)
= α
+ 4CL ,
m m
√ √ √
where we used in the second inequality that x + y ≤ x + y for all x, y ≥ 0.
In addition, Theorem 14.15 yields that with probability at least 1 − δ/2 it holds for all Φ ∈

N (σ; A, B)
r √
nA log(⌈nA m⌉) + LnA log(⌈2Cσ Bdmax ⌉) + log(4/δ)
|R(Φ) − RS (Φ)| ≤ 4CL
b
m
2CL
+√
m
r √
nA log(⌈nA m⌉) + LnA log(⌈2Cσ Bdmax ⌉)
≤ 6CL
r m
log(4/δ))
+ 4CL .
m
Then, for Φ ∈ N ∗ (σ; A, B) ∩ LipC ([−1, 1]d0 ) the minimum of both upper bounds holds with
probability at least 1 − δ.

The two regimes in Theorem 15.2 correspond to the two terms comprising the minimum in the
definition of g(A, Cσ , B, m). The first term increases with nA while the second is constant. In the
first regime, where the first term is smaller, the generalization gap |R(Φ) − R b S (Φ)| increases with
nA .
In the second regime, where the second term is smaller, the generalization gap is constant with
nA . Moreover, it is reasonable to assume that the empirical risk R b S will decrease with increasing
number of parameters nA .
By (15.3.3) we can bound the risk by
r
R(Φ) ≤ R b S + g(A, Cσ , B, m) + 4CL log(4/δ) .
m
In the second regime, this upper bound is monotonically decreasing. In the first regime it may
both decrease and increase. In some cases, this behavior can lead to an upper bound on the risk
resembling the curve of Figure 15.2.
Remark 15.3. Theorem 15.2 assumes C-Lipschitz continuity of the neural networks. As we saw in
Sections 15.1.2 and 15.2, this assumption may not hold near the interpolation threshold. Hence,
Theorem 15.2 likely gives a too optimistic upper bound near the interpolation threshold.

226
Bibliography and further reading
The discussion on kernel regression and the effect of the number of parameters on the norm of the
weights was already given in [18]. Similar analyses, with more complex ansatz systems and more
precise asymptotic estimates, are found in [173, 107]. Our results in Section 15.3 are inspired by
[15]; see also [191].
For a detailed account of further arguments justifying the surprisingly good generalization
capabilities of overparameterized neural networks, we refer to [25, Section 2]. Here, we only briefly
mention two additional directions of inquiry. First, if the learning algorithm introduces a form of
robustness, this can be leveraged to yield generalization bounds [9, 291, 34, 215]. Second, for very
overparameterized neural networks, it was stipulated in [131] that neural networks become linear
kernel interpolators as discussed in Chapter 11. Thus, for large neural networks, generalization can
be studied through kernel regression [131, 158, 19, 162].

227
Exercises
Exercise 15.4. Let f : [−1, 1] → R be a continuous function, and let −1 ≤ x1 < · · · < xm ≤ 1 for
some fixed m ∈ N. As in Section 15.1.2, we wish to approximate f by a least squares approximation.
To this end we use the Fourier ansatz functions
(
1 sin(⌈ 2j ⌉πx) j ≥ 1 is odd
b0 (x) := and bj (x) := (15.3.4)
2 cos(⌈ 2j ⌉πx) j ≥ 1 is even.

With the empirical risk


m n
b S (w) = 1
XX 2
R wi bi (xj ) − yj ,
m
j=1 i=0

denote by wn∗ ∈ Rn+1 the minimal norm minimizer of R b S , and set fn (x) := Pn wn bi (x).
i=0 ∗,i
Show that in this case generalization fails in the overparametrized regime: for sufficiently large
n ≫ m, fn is not necessarily a good approximation to f . What does fn converge to as n → ∞?

Exercise 15.5. Consider the setting of Exercise 15.4. We adapt the ansatz functions in (15.3.4)
by rescaling them via
b̃j := cj bj .
Choose real numbers cj ∈ R, such that the corresponding minimal norm least squares solution
avoids the phenomenon encountered in Exercise 15.4.
Hint: Should ansatz functions corresponding to large frequencies be scaled by large or small
numbers to avoid overfitting?

Exercise 15.6. Prove (15.3.2) for d = 1.

228
Chapter 16

Robustness and adversarial examples

How sensitive is the output of a neural network to small changes in its input? Real-world obser-
vations of trained neural networks often reveal that even barely noticeable modifications of the
input can lead to drastic variations in the network’s predictions. This intriguing behavior was first
documented in the context of image classification in [264].
Figure 16.1 illustrates this concept. The left panel shows a picture of a panda that the neural
network correctly classifies as a panda. By adding an almost imperceptible amount of noise to the
image, we obtain the modified image in the right panel. To a human, there is no visible difference,
but the neural network classifies the perturbed image as a wombat. This phenomenon, where
a correctly classified image is misclassified after a slight perturbation, is termed an adversarial
example.
In practice, such behavior is highly undesirable. It indicates that our learning algorithm might
not be very reliable and poses a potential security risk, as malicious actors could exploit it to trick
the algorithm. In this chapter, we describe the basic mathematical principles behind adversarial
examples and investigate simple conditions under which they might or might not occur. For sim-
plicity, we restrict ourselves to a binary classification problem but note that the main ideas remain
valid in more general situations.

+ 0.01x =
Human: Panda Barely visible noise Still a panda

NN classifier: Panda (high confidence) Flamingo (low confidence) Wombat (high confidence)

Figure 16.1: Sketch of an adversarial example.

229
16.1 Adversarial examples
Let us start by formalizing the notion of an adversarial example. We consider the problem of
assigning a label y ∈ {−1, 1} to a vector x ∈ Rd . It is assumed that the relation between x and y
is described by a distribution D on Rd × {−1, 1}. In particular, for a given x, both values −1 and
1 could have positive probability, i.e. the label is not necessarily deterministic. Additionally, we let
Dx := {x ∈ Rd | ∃y s.t. (x, y) ∈ supp(D)}, (16.1.1)
and refer to Dx as the feature support.
Throughout this chapter we denote by
g : Rd → {−1, 0, 1}
a fixed so-called ground-truth classifier, satisfying1
P[y = g(x)|x] ≥ P[y = −g(x)|x] for all x ∈ Dx . (16.1.2)
Note that we allow g to take the value 0, which is to be understood as an additional label corre-
sponding to nonrelevant or nonsensical input data x. We will refer to g −1 (0) as the nonrelevant
class. The ground truth g is interpreted as how a human would classify the data, as the following
example illustrates.
Example 16.1. We wish to classify whether an image shows a panda (y = 1) or a wombat (y = −1).
Consider again Figure 16.1, and denote the three images by x1 , x2 , x3 . The first image x1 is a
photograph of a panda. Together with a label y, it can be interpreted as a draw (x1 , y) from a
distribution of images D, i.e. x1 ∈ Dx and g(x1 ) = 1. The second image x2 displays noise and
corresponds to nonrelevant data as it shows neither a panda nor a wombat. In particular, x2 ∈ Dxc
and g(x2 ) = 0. The third (perturbed) image x3 also belongs to Dxc , as it is not a photograph but
a noise corrupted version of x1 . Nonetheless, it is not nonrelevant, as a human would classify it as
a panda. Thus g(x3 ) = 1. ♢
Additional to the ground truth g, we denote by
h : Rd → {−1, 1}
some trained classifier.

Definition 16.2. Let g : Rd → {−1, 0, 1} be the ground-truth classifier, let h : Rd → {−1, 1} be a


classifier, and let ∥ · ∥∗ be a norm on Rd . For x ∈ Rd and δ > 0, we call x′ ∈ Rd an adversarial
example to x ∈ Rd with perturbation δ, if and only if

(i) ∥x′ − x∥∗ ≤ δ,

(ii) g(x)g(x′ ) > 0,

(iii) h(x) = g(x) and h(x′ ) ̸= g(x′ ).

1
To be more precise, the conditional distribution of y|x is only well-defined almost everywhere w.r.t. the marginal
distribution of x. Thus (16.1.2) can only be assumed to hold for almost every x ∈ Dx w.r.t. to the marginal
distribution of x.

230
In words, x′ is an adversarial example to x with perturbation δ, if (i) the distance of x and x′
is at most δ, (ii) x and x′ belong to the same (not nonrelevant) class according to the ground truth
classifier, and (iii) the classifier h correctly classifies x but misclassifies x′ .
Remark 16.3. We emphasize that the concept of a ground-truth classifier g differs from a minimizer
of the Bayes risk (14.1.1) for two reasons. First, we allow for an additional label 0 corresponding to
the nonrelevant class, which does not exist for the data generating distribution D. Second, g should
correctly classify points outside of Dx ; small perturbations of images as we find them in adversarial
examples, are not regular images in Dx . Nonetheless, a human classifier can still classify these
images, and g models this property of human classification.

16.2 Bayes classifier


At first sight, an adversarial example seems to be no more than a misclassified sample. Naturally,
these exist if the model does not generalize well. In this section we present the more nuanced view
of [260].
To avoid edge cases, we assume in the following that for all x ∈ Dx
either P[y = 1|x] > P[y = −1|x] or P[y = 1|x] < P[y = −1|x] (16.2.1)
so that (16.1.2) uniquely defines g(x) for x ∈ Dx . We say that the distribution exhausts the
domain if Dx ∪ g −1 (0) = Rd . This means that every point is either in the feature support Dx or
it belongs to the nonrelevant class. Moreover, we say that h is a Bayes classifier if
P[h(x)|x] ≥ P[−h(x)|x] for all x ∈ Dx .
By (16.1.2), the ground truth g is a Bayes classifier, and (16.2.1) ensures that h coincides with g
on Dx if h is a Bayes classifier. It is easy to see that a Bayes classifier minimizes the Bayes risk.
With these two notions, we now distinguish between four cases.
(i) Bayes classifier/exhaustive distribution: If h is a Bayes classifier and the data exhausts the
domain, then there are no adversarial examples. This is because every x ∈ Rd either belongs
to the nonrelevant class or is classified the same by h and g.
(ii) Bayes classifier/non-exhaustive distribution: If h is a Bayes classifier and the distribution
does not exhaust the domain, then adversarial examples can exist. Even though the learned
classifier h coincides with the ground truth g on the feature support, adversarial examples
can be constructed for data points on the complement of Dx ∪ g −1 (0), which is not empty.
(iii) Not a Bayes classifier/exhaustive distribution: The set Dx can be covered by the four sub-
domains
C1 = h−1 (1) ∩ g −1 (1), F1 = h−1 (−1) ∩ g −1 (1),
(16.2.2)
C−1 = h−1 (−1) ∩ g −1 (−1), F−1 = h−1 (1) ∩ g −1 (−1).
If dist(C1 ∩ Dx , F1 ∩ Dx ) or dist(C−1 ∩ Dx , F−1 ∩ Dx ) is smaller than δ, then there exist
points x, x′ ∈ Dx such that x′ is an adversarial example to x with perturbation δ. Hence,
adversarial examples in the feature support can exist. This is, however, not guaranteed to
happen. For example, Dx does not need to be connected if g −1 (0) ̸= ∅, see Exercise 16.18.
Hence, even for classifiers that have incorrect predictions on the data, adversarial examples
do not need to exist.

231
(iv) Not a Bayes classifier/non-exhaustive distribution: In this case everything is possible. Data
points and their associated adversarial examples can appear in the feature support of the
distribution and adversarial examples to elements in the feature support of the distribution
can be created by leaving the feature support of the distribution. We will see examples in
the following section.

16.3 Affine classifiers


For linear classifiers, a simple argument outlined in [264] and [95] showcases that the high-dimensionality
of the input, common in image classification problems, is a potential cause for the existence of ad-
versarial examples.
A linear classifier is a map of the form
x 7→ sign(w⊤ x) where w, x ∈ Rd .
Let
sign(w⊤ x)sign(w)
x′ := x − 2|w⊤ x|
∥w∥1
where sign(w) is understood coordinate-wise. Then ∥x − x′ ∥∞ ≤ 2|w⊤ x|/∥w∥1 and it is not hard
to see that sign(w⊤ x′ ) ̸= sign(w⊤ x).
For high-dimensional vectors w, x chosen at random but possibly dependent such that w is
uniformly distributed on a d − 1 dimensional sphere, it holds with high probability that
|w⊤ x| ∥x∥∥w∥
≤ ≪ ∥x∥.
∥w∥1 ∥w∥1
This can be seen by noting that for every c > 0
µ({w ∈ Rd | ∥w∥1 > c, ∥w∥ ≤ 1}) → 1 for d → ∞, (16.3.1)
where µ is the uniform probability measure on the d-dimensional Euclidean unit ball, see Exercise
16.17. Thus, if x has a moderate Euclidean norm, the perturbation of x′ is likely small for large
dimensions.
Below we give a sufficient condition for the existence of adversarial examples, in case both h
and the ground truth g are linear classifiers.

Theorem 16.4. Let w, w ∈ Rd be nonzero. For x ∈ Rd , let h(x) = sign(w⊤ x) be a classifier and
let g(x) = sign(w⊤ x) be the ground-truth classifier.
For every x ∈ Rd with h(x)g(x) > 0 and all ε ∈ (0, |w⊤ x|) such that
|w⊤ x| ε + |w⊤ x| |w⊤ w|
> (16.3.2)
∥w∥ ∥w∥ ∥w∥∥w∥
it holds that
ε + |w⊤ x|
x′ = x − h(x) w (16.3.3)
∥w∥2
is an adversarial example to x with perturbation δ = (ε + |w⊤ x|)/∥w∥.

232
Before we present the proof, we give some interpretation of this result. First, note that {x ∈
Rd | w ⊤ x= 0} is the decision boundary of h, meaning that points lying on opposite sides of this
hyperplane, are classified differently by h. Due to |w⊤ w| ≤ ∥w∥∥w∥, (16.3.2) implies that an
adversarial example always exists whenever
|w⊤ x| |w⊤ x|
> . (16.3.4)
∥w∥ ∥w∥
The left term is the decision margin of x for g, i.e. the distance of x to the decision boundary
of g. Similarly, the term on the right is the decision margin of x for h. Thus we conclude that
adversarial examples exist if the decision margin of x for the ground truth g is larger than that for
the classifier h.
Second, the term (w⊤ w)/(∥w∥∥w∥) describes the alignment of the two classifiers. If the clas-
sifiers are not aligned, i.e., w and w have a large angle between them, then adversarial examples
exist even if the margin of the classifier is larger than that of the ground-truth classifier.
Finally, adversarial examples with small perturbation are possible if |w⊤ x| ≪ ∥w∥. The ex-
treme case w⊤ x = 0 means that x lies on the decision boundary of h, and if |w⊤ x| ≪ ∥w∥ then
x is close to the decision boundary of h.

Proof (of Theorem 16.4). We verify that x′ in (16.3.3) satisfies the conditions of an adversarial
example in Definition 16.2. In the following we will use that due to h(x)g(x) > 0
g(x) = sign(w⊤ x) = sign(w⊤ x) = h(x) ̸= 0. (16.3.5)
First, it holds
ε + |w⊤ x| ε + |w⊤ x|
∥x − x′ ∥ = w = = δ.
∥w∥2 ∥w∥
Next we show g(x)g(x′ ) > 0, i.e. that (w⊤ x)(w⊤ x′ ) is positive. Plugging in the definition of
x′ , this term reads
ε + |w⊤ x| ⊤ ε + |w⊤ x| ⊤
 
w⊤ x w⊤ x − h(x) 2
w w = |w⊤ x|2 − |w⊤ x| w w
∥w∥ ∥w∥2
ε + |w⊤ x| ⊤
≥ |w⊤ x|2 − |w⊤ x| |w w|, (16.3.6)
∥w∥2
where the equality holds because h(x) = g(x) = sign(w⊤ x) by (16.3.5). Dividing the right-hand
side of (16.3.6) by |w⊤ x|∥w∥, which is positive by (16.3.5), we obtain
|w⊤ x| ε + |w⊤ x| |w⊤ w|
− . (16.3.7)
∥w∥ ∥w∥ ∥w∥∥w∥
The term (16.3.7) is positive thanks to (16.3.2).
Finally, we check that 0 ̸=h(x′ ) ̸= h(x), i.e. (w⊤ x)(w⊤ x′ ) < 0. We have that
ε + |w⊤ x| ⊤
(w⊤ x)(w⊤ x′ ) = |w⊤ x|2 − w⊤ xh(x) w w
∥w∥2
= |w⊤ x|2 − |w⊤ x|(ε + |w⊤ x|) < 0,

where we used that h(x) = sign(w⊤ x). This completes the proof.

233
Theorem 16.4 readily implies the following proposition for affine classifiers.

Proposition 16.5. Let w, w ∈ Rd and b, b ∈ R. For x ∈ Rd let h(x) = sign(w⊤ x + b) be a


classifier and let g(x) = sign(w⊤ x + b) be the ground-truth classifier.
For every x ∈ Rd with w⊤ x ̸= 0, h(x)g(x) > 0, and all ε ∈ (0, |w⊤ x + b|) such that

|w⊤ x + b|2 (ε + |w⊤ x + b|)2 (w⊤ w + bb)2


> 2
∥w∥2 + b2 ∥w∥2 + b2 (∥w∥2 + b2 )(∥w∥2 + b )

it holds that
ε + |w⊤ x + b|
x′ = x − h(x) w
∥w∥2
is an adversarial example with perturbation δ = (ε + |w⊤ x + b|)/∥w∥ to x.

The proof is left to the reader, see Exercise 16.19.


Let us now study two cases of linear classifiers, which allow for different types of adversarial
examples. In the following two examples, the ground-truth classifier g : Rd → {−1, 1} is given by
g(x) = sign(w⊤ x) for w ∈ Rd with ∥w∥ = 1.
For the first example, we construct a Bayes classifier h admitting adversarial examples in the
complement of the feature support. This corresponds to case (ii) in Section 16.2.

Example 16.6. Let D be the uniform distribution on

{(λw, g(λw)) | λ ∈ [−1, 1] \ {0}} ⊆ Rd × {−1, 1}.

The feature support equals

Dx = {λw | λ ∈ [−1, 1] \ {0}} ⊆ span{w}.

Next fix α ∈ (0, 1) and set w := αw + (1 − α)v for some v ∈ w⊥ with ∥v∥ = 1, so that ∥w∥ = 1.
We let h(x) := sign(w⊤ x). We now show that every x ∈ Dx satisfies the assumptions of Theorem
16.4, and therefore admits an adversarial example.
Note that h(x) = g(x) for every x ∈ Dx . Hence h is a Bayes classifier. Now fix x ∈ Dx . Then
|w⊤ x| ≤ α|w⊤ x|, so that (16.3.2) is satisfied. Furthermore, for every ε > 0 it holds that

ε + |w⊤ x|
δ := ≤ ε + α.
∥w∥

Hence, for ε < |w⊤ x| it holds by Theorem 16.4 that there exists an adversarial example with
perturbation less than ε + α. For small α, the situation is depicted in the upper panel of Figure
16.2. ♢

For the second example, we construct a distribution with global feature support and a classifier
which is not a Bayes classifier. This corresponds to case (iv) in Section 16.2.

234
Example 16.7. Let Dx be a distribution on Rd with positive Lebesgue density everywhere outside
the decision boundary DBg = {x | w⊤ x = 0} of g. We define D to be the distribution of (X, g(X))
for X ∼ Dx . In addition, let w ∈ / {±w}, ∥w∥ = 1 and h(x) = sign(w⊤ x). We exclude w = −w
because, in this case, every prediction of h is wrong. Thus no adversarial examples are possible.
By construction the feature support is given by Dx = Rd . Moreover, h−1 ({−1}), h−1 ({1}) and
g ({−1}), g −1 ({1}) are half spaces, which implies that, in the notation of (16.2.2) that
−1

dist(C±1 ∩ Dx , F±1 ∩ Dx ) = dist(C±1 , F±1 ) = 0.

Hence, for every δ > 0 there is a positive probability of observing x to which an adversarial example
with perturbation δ exists.
The situation is depicted in the lower panel of Figure 16.2. ♢

16.4 ReLU neural networks


So far we discussed classification by affine classifiers. A binary classifier based on a ReLU neural
network is a function Rd ∋ x 7→ sign(Φ(x)), where Φ is a ReLU neural network. As noted in [264],
the arguments for affine classifiers, see Proposition 16.5, can be applied to the affine pieces of Φ, to
show existence of adversarial examples.
Consider a ground-truth classifier g : Rd → {−1, 0, 1}. For each x ∈ Rd we define the geometric
margin of g at x as

µg (x) := dist(x, g −1 ({g(x)})c ), (16.4.1)

i.e., as the distance of x to the closest element that is classified differently from x or the infimum
over all distances to elements from other classes if no closest element exists. Additionally, we denote
the distance of x to the closest adjacent affine piece by

νΦ (x) := dist(x, AcΦ,x ), (16.4.2)

where AΦ,x is the largest connected region on which Φ is affine and which contains x. We have the
following theorem.

Theorem 16.8. Let Φ : Rd → R and for x ∈ Rd let h(x) = sign(Φ(x)). Denote by g : Rd →


{−1, 0, 1} the ground-truth classifier. Let x ∈ Rd and ε > 0 be such that νΦ (x) > 0, g(x) ̸= 0,
∇Φ(x) ̸= 0 and

ε + |Φ(x)|
µg (x), νΦ (x) > .
∥∇Φ(x)∥

Then
ε + |Φ(x)|
x′ := x − h(x) ∇Φ(x)
∥∇Φ(x)∥2

is an adversarial example to x with perturbation δ = (ε + |Φ(x)|)/∥∇Φ(x)∥.

235
Proof. We show that x′ satisfies the properties in Definition 16.2.
By construction ∥x − x′ ∥ ≤ δ. Since µg (x) > δ it follows that g(x) = g(x′ ). Moreover, by
assumption g(x) ̸= 0, and thus g(x)g(x′ ) > 0.
It only remains to show that h(x′ ) ̸= h(x). Since δ < νΦ (x), we have that Φ(x) = ∇Φ(x)⊤ x + b
and Φ(x′ ) = ∇Φ(x)⊤ x′ + b for some b ∈ R. Therefore,
 
ε + |Φ(x)|
Φ(x) − Φ(x′ ) = ∇Φ(x)⊤ (x − x′ ) = ∇Φ(x)⊤ h(x) ∇Φ(x)
∥∇Φ(x)∥2
= h(x)(ε + |Φ(x)|).

Since h(x)|Φ(x)| = Φ(x) it follows that Φ(x′ ) = −h(x)ε. Hence, h(x′ ) = −h(x), which completes
the proof.

Remark 16.9. We look at the key parameters in Theorem 16.8 to understand which factors facilitate
adversarial examples.

• The geometric margin of the ground-truth classifier µg (x): To make the construction possible,
we need to be sufficiently far away from points that belong to a different class than x or to
the nonrelevant class.

• The distance to the next affine piece νΦ (x): Since we are looking for an adversarial example
within the same affine piece as x, we need this piece to be sufficiently large.

• The perturbation δ: The perturbation is given by (ε + |Φ(x)|)/∥∇Φ(x)∥, which depends on


the classification margin |Φ(x)| of the ReLU classifier and its sensitivity to inputs ∥∇Φ(x)∥.
For adversarial examples to be possible, we either want a small classification margin of Φ or
a high sensitivity of Φ to its inputs.

16.5 Robustness
Having established that adversarial examples can arise in various ways under mild assumptions, we
now turn our attention to conditions that prevent their existence.

16.5.1 Global Lipschitz regularity


We have repeatedly observed in the previous sections that a large value of ∥w∥ for linear classifiers
sign(w⊤ x), or ∥∇Φ(x)∥ for ReLU classifiers sign(Φ(x)), facilitates the occurrence of adversarial ex-
amples. Naturally, both these values are upper bounded by the Lipschitz constant of the classifier’s
inner functions x 7→ w⊤ x and x 7→ Φ(x). Consequently, it was stipulated early on that bound-
ing the Lipschitz constant of the inner functions could be an effective measure against adversarial
examples [264].
We have the following result for general classifiers of the form x 7→ sign(Φ(x)).

236
Proposition 16.10. Let Φ : Rd → R be CL -Lipschitz with CL > 0, and let s > 0. Let h(x) =
sign(Φ(x)) be a classifier, and let g : Rd → {−1, 0, 1} be a ground-truth classifier. Moreover, let
x ∈ Rd be such that

Φ(x)g(x) ≥ s. (16.5.1)

Then there does not exist an adversarial example to x of perturbation δ < s/CL .

Proof. Let x ∈ Rd satisfy (16.5.1) and assume that ∥x′ − x∥ ≤ δ. The Lipschitz continuity of Φ
implies

|Φ(x′ ) − Φ(x)| < s.

Since |Φ(x)| ≥ s we conclude that Φ(x′ ) has the same sign as Φ(x) which shows that x′ cannot be
an adversarial example to x.

Remark 16.11. As we have seen in Lemma 13.2, we can bound the Lipschitz constant of ReLU
neural networks by restricting the magnitude and number of their weights and the number of
layers.
There has been some criticism to results of this form, see, e.g., [124], since an assumption on
the Lipschitz constant may potentially restrict the capabilities of the neural network too much. We
next present a result that shows under which assumptions on the training set, there exists a neural
network that classifies the training set correctly, but does not allow for adversarial examples within
the training set.

Theorem 16.12. Let m ∈ N, let g : Rd → {−1, 0, 1} be a ground-truth classifier, and let


(xi , g(xi ))m d m
i=1 ∈ (R × {−1, 1}) . Assume that

|g(xi ) − g(xj )|
sup =: M
f > 0.
i̸=j ∥xi − xj ∥

Then there exists a ReLU neural network Φ with depth(Φ) = O(log(m)) and width(Φ) = O(dm)
such that for all i = 1, . . . , m
sign(Φ(xi )) = g(xi )
and there is no adversarial example of perturbation δ = 1/M
f to xi .

Proof. The result follows directly from Theorem 9.6 and Proposition 16.10. The reader is invited
to complete the argument in Exercise 16.20.

16.5.2 Local regularity


One issue with upper bounds involving global Lipschitz constants such as those in Proposition
16.10, is that these bounds may be quite large for deep neural networks. For example, the upper

237
bound given in Lemma 13.2 is

∥Φ(x) − Φ(x′ )∥∞ ≤ CσL · (Bdmax )L+1 ∥x − x′ ∥∞

which grows exponentially with the depth of the neural network. However, in practice this bound
may be pessimistic, and locally the neural network might have significantly smaller gradients than
the global Lipschitz constant.
Because of this, it is reasonable to study results preventing adversarial examples under local
Lipschitz bounds. Such a result together with an algorithm providing bounds on the local Lipschitz
constant was proposed in [113]. We state the theorem adapted to our set-up.

Theorem 16.13. Let h : Rd → {−1, 1} be a classifier of the form h(x) = sign(Φ(x)) and let
g : Rd → {−1, 0, 1} be the ground-truth classifier. Let x ∈ Rd satisfy g(x) ̸= 0, and set
 
 
 . |Φ(y) − Φ(x)| 
α := max min Φ(x)g(x) sup ,R , (16.5.2)
R>0 
 ∥y−x∥∞ ≤R ∥x − y∥∞ 

y̸=x

where the minimum is understood to be R in case the supremum is zero. Then there are no adver-
sarial examples to x with perturbation δ < α.

Proof. Let x ∈ Rd be as in the statement of the theorem. Assume, towards a contradiction, that
for 0 < δ < α satisfying (16.5.2), there exists an adversarial example x′ to x with perturbation δ.
If the supremum in (16.5.2) is zero, then Φ is constant on a ball of radius R around x. In
particular for ∥x′ − x∥ ≤ δ < R it holds that h(x′ ) = h(x) and x′ cannot be an adversarial
example.
Now assume the supremum in (16.5.2) is not zero. It holds by (16.5.2) for δ < R, that
. |Φ(y) − Φ(x)|
δ < Φ(x)g(x) sup . (16.5.3)
∥y−x∥∞ ≤R ∥x − y∥∞
y̸=x

Moreover,
|Φ(y) − Φ(x)|
|Φ(x′ ) − Φ(x)| ≤ sup ∥x − x′ ∥∞
∥y−x∥∞ ≤R ∥x − y∥∞
y̸=x
|Φ(y) − Φ(x)|
≤ sup δ < Φ(x)g(x),
∥y−x∥∞ ≤R ∥x − y∥∞
y̸=x

where we applied (16.5.3) in the last line. It follows that

g(x)Φ(x′ ) = g(x)Φ(x) + g(x)(Φ(x′ ) − Φ(x))


≥ g(x)Φ(x) − |Φ(x′ ) − Φ(x)| > 0.

This rules out x′ as an adversarial example.

238
The supremum in (16.5.2) is bounded by the Lipschitz constant of Φ on BR (x). Thus Theorem
16.13 depends only on the local Lipschitz constant of Φ. One obvious criticism of this result is
that the computation of (16.5.2) is potentially prohibitive. We next show a different result, for
which the assumptions can immediately be checked by applying a simple algorithm that we present
subsequently.
To state the following proposition, for a continuous function Φ : Rd → R and δ > 0 we define
for x ∈ Rd and δ > 0

z δ,max := max{Φ(y) | ∥y − x∥∞ ≤ δ} (16.5.4)


δ,min
z := min{Φ(y) | ∥y − x∥∞ ≤ δ}. (16.5.5)

Proposition 16.14. Let h : Rd → {−1, 1} be a classifier of the form h(x) = sign(Φ(x)) and
g : Rd → {−1, 0, 1}, let x be such that h(x) = g(x). Then x does not have an adversarial example
of perturbation δ if z δ,max z δ,min > 0.

Proof. The proof is immediate, since z δ,max z δ,min > 0 implies that all points in a δ neighborhood
of x are classified the same.

To apply (16.14), we only have to compute z δ,max and z δ,min . It turns out that if Φ is a neural
network, then z δ,max , z δ,min can be approximated by a computation similar to a forward pass of
Φ. Denote by |A| the matrix obtained by taking the absolute value of each entry of the matrix A.
Additionally, we define
A+ = (|A| + A)/2 and A− = (|A| − A)/2.
The idea behind the Algorithm 3 is common in the area of neural network verification, see, e.g.,
[86, 81, 10, 283].
Remark 16.15. Up to constants, Algorithm 3 has the same computational complexity as a forward
pass, also see Algorithm 1. In addition, in contrast to upper bounds based on estimating the global
Lipschitz constant of Φ via its weights, the upper bounds found via Algorithm 3 include the effect of
the activation function σ. For example, if σ is the ReLU, then we may often end up in a situation,
where δ (ℓ),up or δ (ℓ),low can have many entries that are 0. If an entry of W (ℓ) x(ℓ) +b(ℓ) is nonpositive,
then it is guaranteed that the associated entry in δ (ℓ),low will be zero. Similarly, if W (ℓ) has only
few positive entries, then most of the entries of δ (ℓ),up are not propagated to δ (ℓ+1),up .
Next, we prove that Algorithm 3 indeed produces sensible output.

Proposition 16.16. Let Φ be a neural network with weight matrices W (ℓ) ∈ Rdℓ+1 ×dℓ and bias
vectors b(ℓ) ∈ Rdℓ+1 for ℓ = 0, . . . , L, and a monotonically increasing activation function σ.
Let x ∈ Rd . Then the output of Algorithm 3 satisfies

xL+1 + δ (L+1),up > z δ,max and xL+1 − δ (L+1),low < z δ,min .

239
Algorithm 3 Compute Φ(x), z δ,max and z δ,min for a given neural network
Input: weight matrices W (ℓ) ∈ Rdℓ+1 ×dℓ and bias vectors b(ℓ) ∈ Rdℓ+1 for ℓ = 0, . . . , L with
dL+1 = 1, monotonous activation function σ, input vector x ∈ Rd0 , neighborhood size δ > 0
Output: Bounds for z δ,max and z δ,min

x(0) = x
δ (0),up = δ1 ∈ Rd0
δ (0),low = δ1 ∈ Rd0
for ℓ = 0, . . . , L − 1 do
x(ℓ+1) = σ(W (ℓ) x(ℓ) + b(ℓ) )
δ (ℓ+1),up = σ(W (ℓ) x(ℓ) + (W (ℓ) )+ δ (ℓ),up + (W (ℓ) )− δ (ℓ),low + b(ℓ) ) − x(ℓ+1)
δ (ℓ+1),low = x(ℓ+1) − σ(W (ℓ) x(ℓ) − (W (ℓ) )+ δ (ℓ),low − (W (ℓ) )− δ (ℓ),up + b(ℓ) )
end for
x(L+1) = W (L) x(L) + b(L)
δ (L+1),up = (W (L) )+ δ (L),up + (W (L) )− δ (L),low
δ (L+1),low = (W (L) )+ δ (L),low + (W (L) )− δ (L),up
return x(L+1) , x(L+1) + δ (L+1),up , x(L+1) − δ (L+1),low

Proof. Fix y, x ∈ Rd with ∥y − x∥∞ ≤ δ and let y (ℓ) , x(ℓ) for ℓ = 0, . . . , L + 1 be as in Algorithm
3 applied to y, x, respectively. Moreover, let δ ℓ,up , δ ℓ,low for ℓ = 0, . . . , L + 1 be as in Algorithm 3
applied to x. We will prove by induction over ℓ = 0, . . . , L + 1 that

y (ℓ) − x(ℓ) ≤ δ ℓ,up and x(ℓ) − y (ℓ) ≤ δ ℓ,low , (16.5.6)

where the inequalities are understood entry-wise for vectors. Since y was arbitrary this then proves
the result.
The case ℓ = 0 follows immediately from ∥y − x∥∞ ≤ δ. Assume now, that the statement was
shown for ℓ < L. We have that

y (ℓ+1) − x(ℓ+1) − δ ℓ+1,up =σ(W (ℓ) y (ℓ) + b(ℓ) )


− σ W (ℓ) x(ℓ) + (W (ℓ) )+ δ (ℓ),up + (W (ℓ) )− δ (ℓ),low + b(ℓ) .


The monotonicity of σ implies that

y (ℓ+1) − x(ℓ+1) ≤ δ ℓ+1,up

if

W (ℓ) y (ℓ) ≤ W (ℓ) x(ℓ) + (W (ℓ) )+ δ (ℓ),up + (W (ℓ) )− δ (ℓ),low . (16.5.7)

To prove (16.5.7), we observe that

W (ℓ) (y (ℓ) − x(ℓ) ) = (W (ℓ) )+ (y (ℓ) − x(ℓ) ) − (W (ℓ) )− (y (ℓ) − x(ℓ) )


= (W (ℓ) )+ (y (ℓ) − x(ℓ) ) + (W (ℓ) )− (x(ℓ) − y (ℓ) )
≤ (W (ℓ) )+ δ (ℓ),up + (W (ℓ) )− δ (ℓ),low ,

240
where we used the induction assumption in the last line. This shows the first estimate in (16.5.6).
Similarly,

x(ℓ+1) − y (ℓ+1) − δ ℓ+1,low


= σ(W (ℓ) x(ℓ) − (W (ℓ) )+ δ (ℓ),low − (W (ℓ) )− δ (ℓ),up + b(ℓ) ) − σ(W (ℓ) y (ℓ) + b(ℓ) ).

Hence, x(ℓ+1) − y (ℓ+1) ≤ δ ℓ+1,low if

W (ℓ) y (ℓ) ≥ W (ℓ) x(ℓ) − (W (ℓ) )+ δ (ℓ),low − (W (ℓ) )− δ (ℓ),up . (16.5.8)

To prove (16.5.8), we observe that

W (ℓ) (x(ℓ) − y (ℓ) ) = (W (ℓ) )+ (x(ℓ) − y (ℓ) ) − (W (ℓ) )− (x(ℓ) − y (ℓ) )


= (W (ℓ) )+ (x(ℓ) − y (ℓ) ) + (W (ℓ) )− (y (ℓ) − x(ℓ) )
≤ (W (ℓ) )+ δ (ℓ),low + (W (ℓ) )− δ (ℓ),up ,

where we used the induction assumption in the last line. This completes the proof of (16.5.6) for
all ℓ ≤ L.
The case ℓ = L + 1 follows by the same argument, but replacing σ by the identity.

Bibliography and further reading


This chapter begins with the foundational paper [264], but it should be remarked that adversarial
examples for non-deep-learning models in machine learning were studied earlier in [123].
The results in this chapter are inspired by various results in the literature, though they may
not be found in precisely the same form. The overall setup is inspired by [264]. The explanation
based on the high-dimensionality of the data given in Section 16.3 was first formulated in [264] and
[95]. The formalism reviewed in Section 16.2 is inspired by [260]. The results on robustness via
local Lipschitz properties are due to [113]. Algorithm 3 is covered by results in the area of network
verifiability [86, 81, 10, 283]. For a more comprehensive overview of modern approaches, we refer
to the survey article [230].
Important directions not discussed in this chapter are the transferability of adversarial ex-
amples, defense mechanisms, and alternative adversarial operations. Transferability refers to the
phenomenon that adversarial examples for one model often also fool other models, [204, 183]. De-
fense mechanisms, i.e., techniques for specifically training a neural network to prevent adversarial
examples, include for example the Fast Gradient Sign Method of [95], and more sophisticated recent
approaches such as [45]. Finally, adversarial examples can be generated not only through additive
perturbations, but also through smooth transformations of images, as demonstrated in [3, 290].

241
Exercises
Exercise 16.17. Prove (16.3.1) by comparing the volume of the d-dimensional Euclidean unit ball
with the volume of the d-dimensional 1-ball of radius c for a given c > 0.

Exercise 16.18. Fix δ > 0. For a pair of classifiers h and g such that C1 ∪C−1 = ∅ in (16.2.2), there
trivially cannot exist any adversarial examples. Construct an example, of h, g, D such that C1 ,
C−1 ̸= ∅, h is not a Bayes classifier, and g is such that no adversarial examples with a perturbation
δ exist.
Is this also possible if g −1 (0) = ∅?

Exercise 16.19. Prove Proposition 16.5.


Hint: Repeat the proof of Theorem 16.4. In the first part set x(ext) = (x, 1), w(ext) = (w, b)
and w(ext) = (w, b). Then show that h(x′ ) ̸= h(x) by plugging in the definition of x′ .

Exercise 16.20. Complete the proof of Theorem 16.12.

242
A)
DBg

DBh x′

B)

DBg

DBh
x′
x

Figure 16.2: Illustration of the two types of adversarial examples in Examples 16.6 and 16.7. In
panel A) the feature support Dx corresponds to the dashed line. We depict the two decision
boundaries DBh = {x | w⊤ x = 0} of h(x) = sign(w⊤ x) and DBg = {x | w⊤ x = 0} g(x) =
sign(w⊤ x). Both h and g perfectly classify every data point in Dx . One data point x is shifted
outside of the support of the distribution in a way to change its label according to h. This creates
an adversarial example x′ . In panel B) the data distribution is globally supported. However, h
and g do not coincide. Thus the decision boundaries DBh and DBg do not coincide. Moving data
points across DBh can create adversarial examples, as depicted by x and x′ .

243
Appendix A

Probability theory

This appendix provides some basic notions and results in probability theory required in the main
text. It is intended as a revision for a reader already familiar with these concepts. For more details
and further proofs, we refer for example to the standard textbook [144].

A.1 Sigma-algebras, topologies, and measures


Let Ω be a set, and denote by 2Ω the powerset of Ω.

Definition A.1. A subset A ⊆ 2Ω is called a sigma-algebra1 on Ω if it satisfies

(i) Ω ∈ A,

(ii) Ac ∈ A whenever A ∈ A,
S
(iii) i∈N Ai ∈ A whenever Ai ∈ A for all i ∈ N.

For a sigma-algebra A on Ω, the tuple (Ω, A) is also referred to as a measurable space. For a
measurable space, a subset A ⊆ Ω is called measurable, if A ∈ A. Measurable sets are also called
events.
Another key system of subsets of Ω is that of a topology.

Definition A.2. A subset T ⊆ 2Ω is called a topology on Ω if it satisfies


(i) ∅, Ω ∈ T,

(ii) nj=1 Oj ∈ T whenever n ∈ N and O1 , . . . , On ∈ T,


T

S
(iii) i∈I Oi ∈ T whenever for an index set I holds Oi ∈ T for all i ∈ I.
If T is a topology on Ω, we call (Ω, T) a topological space, and a set O ⊆ Ω is called open if and
only if O ∈ T.

244
Remark A.3. The two notions differ in that a topology allows for unions of arbitrary (possibly un-
countably many) sets, but only for finite intersection, whereas a sigma-algebra allows for countable
unions and intersections.
Example A.4. Let d ∈ N and denote by Bε (x) = {y ∈ Rd | ∥y − x∥ < ε} the set of points
whose Euclidean distance to x is less than ε. Then for every A ⊆ Rd , the smallest topology on A
containing A ∩ Bε (x) for all ε > 0, x ∈ Rd , is called the Euclidean topology on A. ♢
If (Ω, T) is a topological space, then the Borel sigma-algebra refers to the smallest sigma-
algebra on Ω containing all open sets, i.e. all elements of T. Throughout this book, subsets of Rd
are always understood to be equipped with the Euclidean topology and the Borel sigma-algebra.
The Borel sigma-algebra on Rd is denoted by Bd .
We can now introduce measures.

Definition A.5. Let (Ω, A) be a measurable space. A mapping µ : A → [0, ∞] is called a measure
if it satisfies

(i) µ(∅) = 0,

(ii) for every sequence (Ai )i∈N ⊆ A such that Ai ∩ Aj = ∅ whenever i ̸= j, it holds
[  X
µ Ai = µ(Ai ).
i∈N i∈N

finite if µ(Ω) < ∞, and it is sigma-finite if there exists a sequence


We say that the measure is S
(Ai )i∈N ⊆ A such that Ω = i∈N Ai and µ(Ai ) < 1 for all i ∈ N. In case µ(Ω) = 1, the measure is
called a probability measure.

Example A.6. One can show that there exists a unique measure λ on (Rd , Bd ), such that for all
sets of the type ×dj=1 [ai , bi ) with −∞ < ai ≤ bi < ∞ holds
d
Y
λ(×di=1 [ai , bi )) = (bi − ai ).
i=1
This measure is called the Lebesgue measure. ♢
If µ is a measure on the measurable space (Ω, A), then the triplet (Ω, A, µ) is called a measure
space. In case µ is a probability measure, it is called a probability space.
Let (Ω, A, µ) be a measure space. A subset N ⊆ Ω is called a null-set, if N is measurable and
µ(N ) = 0. Moreover, an equality or inequality is said to hold µ-almost everywhere or µ-almost
surely, if it is satisfied on the complement of a null-set. In case µ is clear from context, we simply
write “almost everywhere” or “almost surely” instead. Usually this refers to the Lebesgue measure.

A.2 Random variables


A.2.1 Measurability of functions
To define random variables, we first need to recall the measurability of functions.

245
Definition A.7. Let (Ω1 , A1 ) and (Ω2 , A2 ) be two measurable spaces. A function f : Ω1 → Ω2 is
called measurable if

f −1 (A2 ) := {ω ∈ Ω1 | f (ω) ∈ A2 } ∈ A1 for all A2 ∈ A2 .

A mapping X : Ω1 → Ω2 is called a Ω2 -valued random variable if it is measurable.

Remark A.8. We again point out the parallels to topological spaces: A function f : Ω1 → Ω2
between two topological spaces (Ω1 , T1 ) and (Ω2 , T2 ) is called continuous if f −1 (O2 ) ∈ T1 for all
O2 ∈ T2 .
Let Ω1 be a set and let (Ω2 , A2 ) be a measurable space. For X : Ω1 → Ω2 , we can ask for
the smallest sigma-algebra AX on Ω1 , such that X is measurable as a mapping from (Ω1 , AX ) to
(Ω2 , A2 ). Clearly, for every sigma-algebra A1 on Ω1 , X is measurable as a mapping from (Ω1 , A1 )
to (Ω2 , A2 ) if and only if every A ∈ AX belongs to A1 ; or in other words, AX is a sub sigma-algebra
of A1 . It is easy to check that AX is given through the following definition.

Definition A.9. Let X : Ω1 → Ω2 be a random variable. Then

AX := {X −1 (A2 ) | A2 ∈ A2 } ⊆ 2Ω1

is the sigma-algebra induced by X on Ω1 .

A.2.2 Distribution and expectation


Now let (Ω1 , A1 , P) be a probability space, and let (Ω2 , A2 ) be a measurable space. Then X naturally
induces a measure on (Ω2 , A2 ) via

PX [A2 ] := P[X −1 (A2 )] for all A2 ∈ A2 .

Note that due to the measurability of X it holds X −1 (A2 ) ∈ A1 , so that PX is well-defined.

Definition A.10. The measure PX is called the distribution of X. If (Ω2 , A2 ) = (Rd , Bd ), and
there exists a function fX : Rd → R such that
Z
P[A] = fX (x) dx for all A ∈ Bd ,
A

then fX is called the (Lebesgue) density of X.

Remark A.11. The term distribution is often used without specifying an underlying probability
space and random variable. In this case, “distribution” stands interchangeably for “probability

246
measure”. For example, µ is a distribution on Ω2 states that µ is a probability measure on the
measurable space (Ω2 , A2 ). In this case, there always exists a probability space (Ω1 , A1 , P) and a
random variable X : Ω1 → Ω2 such that PX = µ; namely (Ω1 , A1 , P) = (Ω2 , A2 , µ) and X(ω) = ω.

Example A.12. Some important distributions include the following.

• Bernoulli distribution: A random variable X : Ω → {0, 1} is Bernoulli distributed if there


exists p ∈ [0, 1] such that P[X = 1] = p and P[X = 0] = 1 − p.

• Uniform distribution: A random variable X : Ω → Rd is uniformly distributed on a


measurable set A ∈ Bd , if its density equals
1
fX (x) = 1A (x)
|A|

where |A| < ∞ is the Lebesgue measure of A.

• Gaussian distribution: A random variable X : Ω → Rd is Gaussian distributed with mean


m ∈ Rd and the regular covariance matrix C ∈ Rd×d , if its density equals
 
1 1 ⊤ −1
fX (x) = exp − (x − m) C (x − m) .
(2π det(C))d/2 2

We denote this distribution by N(m, C).

Let (Ω, A, P) be a probability space, let X : Ω → Rd be an Rd -valued random variable. We then


call the Lebesgue integral
Z Z
E[X] := X(ω) dP(ω) = x dPX (x) (A.2.1)
Ω Rd

the expectation of X. Moreover, for k ∈ N we say that X has finite k-th moment if E[∥X∥k ] <
∞. Similarly, for a probability measure µ on Rd and k ∈ N, we say that µ has finite k-th moment
if Z
∥x∥k dµ(x) < ∞.
Rd
Furthermore, the matrix
Z
(X(ω) − E[X])(X(ω) − E[X])⊤ dP(ω) ∈ Rd×d

is the covariance of X : Ω → Rd . For d = 1, it is called the variance of X and denoted by V[X].


Finally, we recall different variants of convergence for random variables.

Definition A.13. Let (Ω, A, P) be a probability space, let Xj : Ω → Rd , j ∈ N, be a sequence of


random variables, and let X : Ω → Rd also be a random variable. The sequence is said to

247
(i) converge almost surely to X, if
 
P ω ∈ Ω lim Xj (ω) = X(ω) = 1,
j→∞

(ii) converge in probability to X, if

for all ε > 0 : lim P [{ω ∈ Ω | |Xj (ω) − X(ω)| > ε}] = 0,
j→∞

(iii) converge in distribution to X, if for all bounded continuous functions f : Rd → R

lim E[f ◦ Xj ] = E[f ◦ X].


j→∞

The notions in Definition A.13 are ordered by decreasing strength, i.e. almost sure conver-
gence implies convergence in probability, and convergence in probability
R implies convergence in
distribution, see for example [144, Chapter 13]. Since E[f ◦ X] = Rd f (x) dPX (x), the notion of
convergence in distribution only depends on the distribution PX of X. We thus also say that a
sequence of random variables converges in distribution towards a measure µ.

A.3 Conditionals, marginals, and independence


In this section, we concentrate on Rd -valued random variables, although the following concepts can
be extended to more general spaces.

A.3.1 Joint and marginal distribution


Let again (Ω, A, P) be a probability space, and let X : Ω → RdX , Y : Ω → RdY be two random
variables. Then
Z := (X, Y ) : Ω → RdX +dY
is also a random variable. Its distribution PZ is a measure on the measurable space (RdX +dY , BdX +dY ),
and PZ is referred to as the joint distribution of X and Y . On the other hand, PX , PY are called
the marginal distributions of X, Y . Note that

PX [A] = PZ [A × RdY ] for all A ∈ BdX ,

and similarly for PY . Thus the marginals PX , PY , can be constructed from the joint distribution
PZ . In turn, knowledge of the marginals is not sufficient to construct the joint distribution.

A.3.2 Independence
The concept of independence serves to formalize the situation, where knowledge of one random
variable provides no information about another random variable. We first give the formal definition,
and afterwards discuss the roll of a die as a simple example.

248
Definition A.14. Let (Ω, A, P) be a probability space. Then two events A, B ∈ A are called
independent if
P[A ∩ B] = P[A]P[B].
Two random variables X : Ω → RdX and Y : Ω → RdY are called independent, if

A, B are independent for all A ∈ AX , B ∈ AY .

Two random variables are thus independent, if and only if all events in their induced sigma-
algebras are independent. This turns out to be equivalent to the joint distribution P(X,Y ) being
equal to the product measure PX ⊗ PY ; the latter is characterized as the unique measure µ on
RdX +dY satisfying µ(A × B) = PX [A]PY [B] for all A ∈ Bdx , B ∈ BdY .
Example A.15. Let Ω = {1, . . . , 6} represent the outcomes of rolling a fair die, let A = 2Ω be the
sigma-algebra, and let P[ω] = 1/6 for all ω ∈ Ω. Consider the three random variables

0 if ω ∈ {1, 2}
( ( 
0 if ω is odd 0 if ω ≤ 3
X1 (ω) = X2 (ω) = X3 (ω) = 1 if ω ∈ {3, 4}
1 if ω is even 1 if ω ≥ 4 
2 if ω ∈ {5, 6}.

These random variables can be interpreted as follows:


• X1 indicates whether the roll yields an odd or even number.
• X2 indicates whether the roll yields a number at most 3 or at least 4.
• X3 categorizes the roll into one of the groups {1, 2}, {3, 4} or {5, 6}.
The induced sigma-algebras are
AX1 = {∅, Ω, {1, 3, 5}, {2, 4, 6}}
AX2 = {∅, Ω, {1, 2, 3}, {4, 5, 6}}
AX3 = {∅, Ω, {1, 2}, {3, 4}, {5, 6}, {1, 2, 3, 4}, {1, 2, 5, 6}, {3, 4, 5, 6}}.
We leave it to the reader to formally check that X1 and X2 are not independent, but X1 and X3
are independent. This reflects the fact that, for example, knowing the outcome to be odd, makes
it more likely that the number belongs to {1, 2, 3} rather than {4, 5, 6}. However, this knowledge
provides no information on the three categories {1, 2}, {3, 4}, and {5, 6}. ♢
If X : Ω → R, Y : Ω → R are two independent random variables, then, due to P(X,Y ) = PX ⊗PY
Z
E[XY ] = X(ω)Y (ω) dP(ω)
Z Ω

= xy dP(X,Y ) (x, y)
2
ZR Z
= x dPX (x) y dPX (y)
R R
= E[X]E[Y ].

249
Using this observation, it is easy to see that for a sequence of independent R-valued random variables
(Xi )ni=1 with bounded second moments, there holds Bienaymé’s identity
" n # n
X X
V Xi = V [Xi ] . (A.3.1)
i=1 i=1

A.3.3 Conditional distributions


Let (Ω, A, P) be a probability space, and let A, B ∈ A be two events. In case P[B] > 0, we define
P[A ∩ B]
P[A|B] := , (A.3.2)
P[B]
and call P[A|B] the conditional probability of A given B.
Example A.16. Consider the setting of Example A.15. Let A = {ω ∈ Ω | X1 (ω) = 0} be the event
that the outcome of the die roll was an odd number and let B = {ω ∈ Ω | X2 (ω) = 0} be the event
that the outcome yielded a number at most 3. Then P[B] = 1/2, and P[A ∩ B] = 1/3. Thus
P[A ∩ B] 1/3 2
P[A|B] = = = .
P[B] 1/2 3
This reflects that, given we know the outcome to be at most 3, the probability of the number being
odd, i.e. in {1, 3}, is larger than the probability of the number being even, i.e. equal to 2. ♢
The conditional probability in (A.3.2) is only well-defined if P[B] > 0. In practice, we often
encounter the case where we would like to condition on an event of probability zero.
Example A.17. Consider the following procedure: We first draw a random number p ∈ [0, 1]
according to a uniform distribution on [0, 1]. Afterwards we draw a random number X ∈ {0, 1}
according to a p-Bernoulli distribution, i.e. P[X = 1] = p and P[X = 0] = 1 − p. Then (p, X) is
a joint random variable taking values in [0, 1] × {0, 1}. What is P[X = 1|p = 0.5] in this case?
Intuitively, it should be 1/2, but note that P[p = 0.5] = 0, so that (A.3.2) is not meaningful here.

Definition A.18 (regular conditional distribution). Let (Ω, A, P) be a probability space, and let
X : Ω → RdX and Y : Ω → RdY be two random variables. Let τX|Y : BdX × RdY → [0, 1] satisfy
(i) y 7→ τX|Y (A, y) : RdY → [0, 1] is measurable for every fixed A ∈ BdX ,

(ii) A 7→ τX|Y (A, y) is a probability measure on (RdX , BdX ) for every y ∈ Y (Ω),
(iii) for all A ∈ BdX and all B ∈ BdY holds
Z
P[X ∈ A, Y ∈ B] = τX|Y (A, y)PY (y).
B

Then τ is called a regular (version of the) conditional distribution of X given Y . In this


case, we denote
P[X ∈ A|Y = y] := τX|Y (A, y),
and refer to this measure as the conditional distribution of X|Y = y.

250
Definition A.18 provides a mathematically rigorous way of assigning a distribution to a random
variable conditioned on an event that may have probability zero, as in Example A.17. Existence
and uniqueness of these conditional distributions hold in the following sense, see for example [144,
Chapter 8] or [238, Chapter 3] for the specific statement given here.

Theorem A.19. Let (Ω, A, P) be a probability space, and let X : Ω → RdX , Y : Ω → RdY be two
random variables. Then there exists a regular version of the conditional distribution τ1 .
Let τ2 be another regular version of the conditional distribution. Then there exists a PY -null
set N ⊆ RdY , such that for all y ∈ N c ∩ Y (Ω), the two probability measures τ1 (·, y) and τ2 (·, y)
coincide.

In particular, conditional distributions are only well-defined in a PY -almost everywhere sense.

Definition A.20. Let (Ω, A, P) be a probability space, and let X : Ω → RdX , Y : Ω → RdY ,
Z : Ω → RdZ be three random variables. We say that X and Z are conditionally independent
given Y , if the two distributions X|Y = y and Z|Y = y are independent for PY -almost every
y ∈ Y (Ω).

A.4 Concentration inequalities


Let Xi : Ω → R, i ∈ N, be a sequence of random variables with finite first moments. The centered
average over the first n terms
n
1X
Sn := (Xi − E[Xi ]) (A.4.1)
n
i=1
is another random variable, and by linearity of the expectation it holds E[Sn ] = 0. The sequence
is said to satisfy the strong law of large numbers if
h i
P lim sup |Sn | = 0 = 1.
n→∞

This is for example the case if there exists C < ∞ such that V[Xi ] ≤ C for all i ∈ N. Concentration
inequalities provide bounds on the rate of this convergence.
We start with Markov’s inequality.

Lemma A.21 (Markov’s inequality). Let X : Ω → R be a random variable, and let φ : [0, ∞) →
[0, ∞) be monotonically increasing. Then for all ε > 0

E[φ(|X|)]
P[|X| ≥ ε] ≤ .
φ(ε)

251
Proof. We have
Z Z
φ(|X(ω)|) E[φ(|X|)]
P[|X| ≥ ε] = 1 dP(ω) ≤ dP(ω) = ,
X −1 ([ε,∞)) Ω φ(ε) φ(ε)

which gives the claim.

Applying Markov’s inequality with φ(x) := x2 to the random variable X − E[X] directly gives
Chebyshev’s inequality.

Lemma A.22 (Chebyshev’s inequality). Let X : Ω → R be a random variable with finite variance.
Then for all ε > 0
V[X]
P[|X − E[X]| ≥ ε] ≤ 2 .
ε

From Chebyshev’s inequality we obtain the next result, which is a quite general concentration
inequality for random variables with finite variances.

Theorem A.23. Let X1 , . . . , Xn be n ∈ N independent real-valued random variables such that for
some ς > 0 holds E[|Xi − µ|2 ] ≤ ς 2 for all i = 1, . . . , n. Denote
n
h1 X i
µ := E Xj . (A.4.2)
n
j=1

Then for all ε > 0


n
" #
1X ς2
P Xi − µ ≥ ε ≤ 2 .
n ε n
j=1

Pn
= ( nj=1 Xi )/n − µ. By Bienaymé’s identity (A.3.1), it holds
P
Proof. Let Sn = j=1 (Xi − E[Xi ])/n
that
n
1 X 2 ς2
V[Sn ] = E[(Xi − E[Xi ]) ] ≤ .
n2 n
j=1

Since E[Sn ] = 0, Chebyshev’s inequality applied to Sn gives the statement.

If we have additional information about the random variables, then we can derive sharper
bounds. In case of uniformly bounded random variables (rather than just bounded variance),
Hoeffding’s inequality, which we recall next, shows an exponential rate of concentration around the
mean.

252
Theorem A.24 (Hoeffding’s inequality). Let a, b ∈ R. Let X1 , . . . , Xn be n ∈ N independent
real-valued random variables such that a ≤ Xi ≤ b almost surely for all i = 1, . . . , n, and let µ be
as in (A.4.2). Then, for every ε > 0
 
n 2
1 X − 2nε
P Xj − µ > ε ≤ 2e (b−a)2 .
n
j=1

A proof can, for example, be found in [250, Section B.4], where this version is also taken from.
Finally, we recall the central limit theorem, in its multivariate formulation. We say that (Xj )j∈N
is an i.i.d. sequence of random variables, if the random variables are (pairwise) independent
and identically distributed. For a proof see [144, Theorem 15.58].

Theorem A.25 (Multivariate central limit theorem). Let (X n )n∈N be an i.i.d. sequence of Rd -
valued random variables, such that E[X n ] = 0 ∈ Rd and E[Xn,i Xn,j ] = Cij for all i, j = 1, . . . , d.
Let
X1 + · · · + Xn
Y n := √ ∈ Rd .
n
Then Y n converges in distribution to N(0, C) as n → ∞.

253
Appendix B

Linear algebra and functional analysis

This appendix provides some basic notions and results in linear algebra and functional analysis
required in the main text. It is intended as a revision for a reader already familiar with these
concepts. For more details and further proofs, we refer for example to the standard textbooks
[29, 232, 233, 55, 99].

B.1 Singular value decomposition and pseudoinverse


Let A ∈ Rm×n , m, n ∈ N. Then the square root of the positive eigenvalues of A⊤ A (or equivalently
of AA⊤ ) are referred to as the singular values of A. We denote them in the following by
s1 ≥ s2 · · · ≥ sr > 0, where r := rank(A), so that r ≤ min{m, n}. Every matrix allows for a
singular value decomposition (SVD) as stated in the next theorem, e.g. [29, Theorem 1.2.1].
Recall that a matrix V ∈ Rn×n is called orthogonal, if V ⊤ V is the identity.

Theorem B.1 (Singular value decomposition). Let A ∈ Rm×n . Then there exist orthogonal ma-
trices U ∈ Rm×m , V ∈ Rn×n such that with
 
s1
 .. 
. 0
Σ :=   ∈ Rm×n

 sr 
0 0

it holds that A = U Σ V ⊤ , where 0 stands for a zero block of suitable size.

Given y ∈ Rm , consider the linear system

Aw = y. (B.1.1)

If A is not a regular square matrix, then in general there need not be a unique solution w ∈ Rn to
(B.1.1). However, there exists a unique minimal norm solution

w∗ = argminw∈M ∥w∥, M = {w ∈ Rm | ∥Aw − y∥ ≤ ∥Av − y∥ ∀v ∈ Rn }. (B.1.2)

254
The minimal norm solution can be expressed via the Moore-Penrose pseudoinverse A† ∈ Rn×m
of A; given an (arbitrary) SVD A = U ΣV ⊤ , it is defined as
 −1 
s1
 .. 
A† := V Σ † U ⊤ where Σ † := 
 . 0
 ∈ Rn×m . (B.1.3)
 −1
sr 
0 0
The following theorem makes this precise, e.g., [29, Theorem 1.2.10].

Theorem B.2. Let A ∈ Rm×n . Then there exists a unique minimum norm solution w∗ ∈ Rn in
(B.1.2) and it holds w∗ = A† y.

Proof. Denote by Σ r ∈ Rr×r the upper left quadrant of Σ . Since U ∈ Rm×m is orthogonal,
 
Σr 0
∥Aw − y∥ = V ⊤w − U ⊤y .
0 0
We can thus write M in (B.1.2) as
n r o
M = w ∈ Rn Σ r 0 V ⊤ w i=1 = (U ⊤ y)ri=1

n o
= w ∈ Rn (V ⊤ w)ri=1 = Σ −1 ⊤ r
r (U y)i=1
n o
= V z z ∈ Rn , (z)ri=1 = Σ −1 ⊤ r
r (U y)i=1

where (a)ri=1 denotes the first r entries of a vector a, and for the last equality we used orthogonality
of V ∈ Rn×n . Since ∥V z∥ = ∥z∥, the unique minimal norm solution is obtained by setting
components r + 1, . . . , m of z to zero, which yields
 −1 ⊤ r 
Σ r (U y)i=1
w∗ = V = V Σ † U ⊤ y = A† y
0
as claimed.

B.2 Vector spaces

Definition B.3. Let K ∈ {R, C}. A vector space (over K) is a set X such that the following
holds:

(i) Properties of addition: For every x, y ∈ X there exists x + y ∈ X such that for all z ∈ X

x+y =y+x and x + (y + z) = (x + y) + z.

Moreover, there exists a unique element 0 ∈ X such that x + 0 = x for all x ∈ X and for each
x ∈ X there exists a unique −x ∈ X such that x + (−x) = 0.

255
(ii) Properties of scalar multiplication: There exists a map (α, x) 7→ αx from K × X to X called
scalar multiplication. It satisfies 1x = x and (αβ)x = α(βx) for all x ∈ X.

We call the elements of a vector space vectors.

If the field is clear from context, we simply refer to X as a vector space. We will primarily consider
the case K = R, and in this case we also say that X is a real vector space.
To introduce a notion of convergence on a vector space X, it needs to be equipped with a
topology, see Definition A.2. A topological vector space is a vector space which is also a
topological space, and in which addition and scalar multiplication are continuous maps. We next
discuss the most important instances of topological vector spaces.

B.2.1 Metric spaces


An important class of topological vector spaces consists of vector spaces that are also metric spaces.

Definition B.4. For a set X, we call a map dX : X × X → [0, ∞) a metric, if

(i) dX (x, y) = 0 if and only if x = y,

(ii) dX (x, y) = d(y, x) for all x, y ∈ X,

(iii) dX (x, z) ≤ dX (x, y) + dX (y, z) for all x, y, z ∈ X.

We call (X, dX ) a metric space.

In a metric space (X, dX ), we denote the open ball with center x and radius r > 0 by

Br (x) := {y ∈ X | dX (x, y) < r}. (B.2.1)

Every metric space is naturally equipped with a topology: A set A ⊆ X is open if and only if for
every x ∈ A exists ε > 0 such that Bε (x) ⊆ A. Therefore every metric vector space is a topological
vector space.

Definition B.5. A metric space (X, dX ) is called complete, if every Cauchy sequence with respect
to d converges to an element in X.

For complete metric spaces, an immensely powerful tool is Baire’s category theorem. To state
it, we require the notion of density of sets. Let A, B ⊆ X for a topological space X. Then A is
dense in B if the closure of A, denoted by A, satisfies A ⊇ B.

256
Theorem B.6 (Baire’s category theorem). Let X be a complete metric space. Then the intersection
of every countable collection of dense open subsets of X is dense in X.

Theorem B.6 implies that if X = ∞


S
i=1 Vi for a sequence of sets Vi , then at least one of the Vi
has to contain an open set. Indeed, assuming all Vi ’s have empty interior T∞implies that Vic = X \ Vi
is dense for all i ∈ N. By De Morgan’s laws, it then holds that ∅ = i=1 Vic which contradicts
Theorem B.6.

B.2.2 Normed spaces


A norm is a way of assigning a length to a vector. A normed space is a vector space with a norm.

Definition B.7. Let X be a vector space over a field K ∈ {R, C}. A map ∥ · ∥X : X → [0, ∞) is
called a norm if the following hold for all x, y ∈ X and all α ∈ K:

(i) triangle inequality: ∥x + y∥X ≤ ∥x∥X + ∥y∥X ,

(ii) absolute homogeneity: ∥αx∥X = |α|∥x∥X ,

(iii) positive definiteness: ∥x∥X = 0 if and only if x = 0.

We call (X, ∥ · ∥X ) a normed space and omit ∥ · ∥X from the notation if it is clear from the
context.

Every norm induces a metric dX and hence a topology via dX (x, y) := ∥x − y∥X . In particular,
every normed vector space is a topological vector space with respect to this topology.

B.2.3 Banach spaces

Definition B.8. A normed vector space is called a Banach space if and only if it is complete.

Before presenting the main results on Banach spaces, we collect a couple of important examples.
• Euclidean spaces: Let d ∈ N. Then (Rd , ∥ · ∥) is a Banach space.
• Continuous functions: Let d ∈ N and let K ⊆ Rd be compact. The set of continuous functions
from K to R is denoted by C(K). For α, β ∈ R and f , g ∈ C(K), we define addition and
scalar multiplication by (αf + βg)(x) = αf (x) + βg(x) for all x ∈ K. The vector space C(K)
equipped with the supremum norm
∥f ∥∞ := sup |f (x)|,
x∈K

is a Banach space.

257
• Lebesgue spaces: Let (Ω, A, µ) be a measure space and let 1 ≤ p < ∞. Then the Lebesgue
space Lp (Ω, µ) is defined as the vector space of all equivalence classes of measurable functions
f : Ω → R that coincide µ-almost everywhere and satisfy
Z 1/p
p
∥f ∥Lp (Ω,µ) := |f (x)| dµ(x) < ∞. (B.2.2)

The integral is independent of the choice of representative of the equivalence class of f .


Addition and scalar multiplication are defined pointwise as for C(K). It then holds that
Lp (Ω, µ) is a Banach space. If Ω is a measurable subset of Rd for d ∈ N, and µ is the
Lebesgue measure, we typically omit µ from the notation and simply write Lp (Ω). If Ω = N
and the measure is the counting measure, we denote these spaces by ℓp (N) or simply ℓp .
The definition can be extended to complex or Rd -valued functions. In the latter case the
integrand in (B.2.2) is replaced by ∥f (x)∥p . We denote these spaces again by Lp (Ω, µ) with
the precise meaning being clear from context.
• Essentially bounded functions: Let (Ω, A, µ) be a measure space. The Lp spaces can be
extended to p = ∞ by defining the L∞ -norm
∥f ∥L∞ (Ω,µ) := inf{C ≥ 0 | µ({|f | > C}) = 0)}.
This is indeed a norm on the space of equivalence classes of measurable functions from Ω → R
that coincide µ-almost everywhere. Moreover, with this norm, L∞ (Ω, µ) is a Banach space. If
Ω = N and µ is the counting measure, we denote the resulting space by ℓ∞ (N) or simply ℓ∞ .
As in the case p < ∞, it is straightforward to extend the definition to complex or Rd -valued
functions, for which the same notation will be used.
We continue by introducing the concept of dual spaces.

Definition B.9. Let (X, ∥ · ∥X ) be a normed vector space over K ∈ {R, C}. Linear maps from
X → K are called linear functionals. The vector space of all continuous linear functionals on X
is called the (topological) dual space of X and is denoted by X ′ .

Together with the natural addition and scalar multiplication (for all h, g ∈ X ′ , α ∈ K and
x ∈ X)
(h + g)(x) := h(x) + g(x) and (αh)(x) := α(h(x)),
X ′ is a vector space. We equip X ′ with the norm
∥f ∥X ′ := sup |f (x)|.
x∈X
∥x∥X =1

The space (X ′ , ∥ · ∥X ′ ) is always a Banach space, even if (X, ∥ · ∥X ) is not complete [233, Theorem
4.1].
The dual space can often be used to characterize the original Banach space. One way in which
the dual space X ′ captures certain algebraic and geometric properties of the Banach space X is
through the Hahn-Banach theorem. In this book, we use one specific variant of this theorem and
its implication for the existence of dual bases, see for instance [233, Theorem 3.5].

258
Theorem B.10 (Geometric Hahn-Banach, subspace version). Let M be a subspace of a Banach
space X and let x0 ∈ X. If x0 is not in the closure of M , then there exists f ∈ X ′ such that
f (x0 ) = 1 and f (x) = 0 for every x ∈ M .

An immediate consequence of Theorem B.10 that will be used throughout this book is the
existence of a dual basis. Let X be a Banach space and let (xi )i∈N ⊆ X be such that for all i ∈ N

xi ̸∈ span{xj | j ∈ N, j ̸= i}.

Then, for every i ∈ N, there exists fi ∈ X ′ such that fi (xj ) = 0 if i ̸= j and fi (xi ) = 1.

B.2.4 Hilbert spaces


Often, we require more structure than that provided by normed spaces. An inner product offers
additional tools to compare vectors by introducing notions of angle and orthogonality. For simplicity
we restrict ourselves to real vector spaces in the following.

Definition B.11. Let X be a real vector space. A map ⟨·, ·⟩X : X × X → R is called an inner
product on X if the following hold for all x, y, z ∈ X and all α, β ∈ R:

(i) linearity: ⟨αx + βy, z⟩X = α⟨x, z⟩X + β⟨y, z⟩X ,

(ii) symmetry: ⟨x, y⟩X = ⟨y, x⟩X ,

(iii) positive definiteness: ⟨x, x⟩X > 0 for all x ̸= 0.

Example B.12. For p = 2, the Lebesgue spaces L2 (Ω) and ℓ2 (N) are Hilbert spaces with inner
products
Z
⟨f, g⟩L2 (Ω) = f (x)g(x) dx for all f, g ∈ L2 (Ω),

and
X
⟨x, y⟩ℓ2 (N) = xj yj for all x = (xj )j∈N , y = (yj )j∈N ∈ ℓ2 (N).
j∈N

On inner product spaces the so-called Cauchy-Schwarz inequality holds.

259
Theorem B.13 (Cauchy-Schwarz inequality). Let X be a vector space with inner product ⟨·, ·⟩X .
Then it holds for all x, y ∈ X
q
|⟨x, y⟩X | ≤ ⟨x, x⟩X ⟨y, y⟩X .

Moreover, equality holds if and only if x and y are linearly dependent.

Proof. Let x, y ∈ X. If y = 0 then ⟨x, y⟩X = 0 and thus the statement is trivial. Assume in the
following y ̸= 0, so that ⟨y, y⟩X > 0. Using the linearity and symmetry properties it holds for all
α∈R
0 ≤ ⟨x − αy, x − αy⟩X = ⟨x, x⟩X − 2α ⟨x, y⟩X + α2 ⟨y, y⟩X .
Letting α := ⟨x, y⟩X / ⟨y, y⟩X we get

⟨x, y⟩2X ⟨x, y⟩2X ⟨x, y⟩2X


0 ≤ ⟨x, x⟩X − 2 + = ⟨x, x⟩X − .
⟨y, y⟩X ⟨y, y⟩X ⟨y, y⟩X
Rearranging terms gives the claim.

Every inner product ⟨·, ·⟩X induces a norm via


p
∥x∥X := ⟨x, x⟩ for all x ∈ X. (B.2.3)

The properties of the inner product immediately yield the polar identity

∥x + y∥2X = ∥x∥2X + 2⟨x, y⟩X + ∥y∥2X . (B.2.4)

The fact that (B.2.3) indeed defines a norm follows by an application of the Cauchy-Schwarz
inequality to (B.2.4), which yields that ∥ · ∥X satisfies the triangle inequality. This gives rise to the
definition of a Hilbert space.

Definition B.14. Let H be a real vector space with inner product ⟨·, ·⟩H . Then (H, ⟨·, ·⟩H ) is
called a Hilbert space if and only if H is complete with respect to the norm ∥ · ∥H induced by
the inner product.

A standard example of a Hilbert space is L2 : Let (Ω, A, µ) be a measure space. Then


Z
⟨f, g⟩L2 (Ω,µ) = f (x)g(x) dµ(x) for all f, g ∈ L2 (Ω, µ),

defines an inner product on L2 (Ω, µ) compatible with the L2 (Ω, µ)-norm.


In a Hilbert space, we can compare vectors not only via their distance, measured by the norm,
but also by using the inner product, which corresponds to their relative orientation. This leads to
the concept of orthogonality.

260
Definition B.15. Let (H, ⟨·, ·⟩H ) be a Hilbert space and let f , g ∈ H. We say that f and g are
orthogonal if ⟨f, g⟩H = 0, denoted by f ⊥ g. For F , G ⊆ H we write F ⊥ G if f ⊥ g for all
f ∈ F , g ∈ G. Finally, for F ⊆ H, the set F ⊥ = {g ∈ H | g ⊥ f ∀f ∈ F } is called the orthogonal
complement of F in H.

For orthogonal vectors, the polar identity immediately implies the Pythagorean theorem.

Theorem B.16 (Pythagorean theorem). Let (H, ⟨·, ·⟩H ) be a Hilbert space, n ∈ N, and let
f1 , . . . , fn ∈ H be pairwise orthogonal vectors. Then,
n 2 n
X X
fi = ∥fi ∥2H .
i=1 H i=1

A final property of Hilbert spaces that we encounter in this book is the existence of unique
projections onto convex sets. For a proof, see for instance [232, Thm. 4.10].

Theorem B.17. Let (H, ⟨·, ·⟩H ) be a Hilbert space and let K ̸= ∅ be a closed convex subset of H.
Then for all h ∈ H exists a unique k0 ∈ K such that

∥h − k0 ∥H = inf{∥h − k∥H | k ∈ K}.

261
Bibliography

[1] M. Abadi, A. Agarwal, P. Barham, E. Brevdo, Z. Chen, C. Citro, G. S. Corrado, A. Davis,


J. Dean, M. Devin, S. Ghemawat, I. Goodfellow, A. Harp, G. Irving, M. Isard, Y. Jia,
R. Jozefowicz, L. Kaiser, M. Kudlur, J. Levenberg, D. Mané, R. Monga, S. Moore, D. Mur-
ray, C. Olah, M. Schuster, J. Shlens, B. Steiner, I. Sutskever, K. Talwar, P. Tucker, V. Van-
houcke, V. Vasudevan, F. Viégas, O. Vinyals, P. Warden, M. Wattenberg, M. Wicke, Y. Yu,
and X. Zheng. TensorFlow: Large-scale machine learning on heterogeneous systems, 2015.
Software available from tensorflow.org.
[2] M. A. Aizerman, E. A. Braverman, and L. Rozonoer. Theoretical foundations of the poten-
tial function method in pattern recognition learning. In Automation and Remote Control,,
number 25 in Automation and Remote Control,, pages 821–837, 1964.
[3] R. Alaifari, G. S. Alberti, and T. Gauksson. Adef: an iterative algorithm to construct
adversarial deformations. arXiv preprint arXiv:1804.07729, 2018.
[4] Z. Allen-Zhu, Y. Li, and Y. Liang. Learning and generalization in overparameterized neural
networks, going beyond two layers. In H. Wallach, H. Larochelle, A. Beygelzimer, F. d'Alché-
Buc, E. Fox, and R. Garnett, editors, Advances in Neural Information Processing Systems,
volume 32. Curran Associates, Inc., 2019.
[5] S.-i. Amari. Natural gradient works efficiently in learning. Neural Computation, 10(2):251–
276, 02 1998.
[6] M. Anthony and P. L. Bartlett. Neural network learning: theoretical foundations. Cambridge
University Press, Cambridge, 1999.
[7] R. Arora, A. Basu, P. Mianjy, and A. Mukherjee. Understanding deep neural networks with
rectified linear units. In International Conference on Learning Representations, 2018.
[8] S. Arora, S. S. Du, W. Hu, Z. Li, R. R. Salakhutdinov, and R. Wang. On exact computation
with an infinitely wide neural net. In H. Wallach, H. Larochelle, A. Beygelzimer, F. d'Alché-
Buc, E. Fox, and R. Garnett, editors, Advances in Neural Information Processing Systems,
volume 32. Curran Associates, Inc., 2019.
[9] S. Arora, R. Ge, B. Neyshabur, and Y. Zhang. Stronger generalization bounds for deep
nets via a compression approach. In International Conference on Machine Learning, pages
254–263. PMLR, 2018.
[10] M. Baader, M. Mirman, and M. Vechev. Universal approximation with certified networks.
arXiv preprint arXiv:1909.13846, 2019.

262
[11] S. Barocas, M. Hardt, and A. Narayanan. Fairness and Machine Learning. fairmlbook.org,
2019. http://www.fairmlbook.org.

[12] A. R. Barron. Neural net approximation. In Proc. 7th Yale workshop on adaptive and learning
systems, volume 1, pages 69–72, 1992.

[13] A. R. Barron. Universal approximation bounds for superpositions of a sigmoidal function.


IEEE Trans. Inform. Theory, 39(3):930–945, 1993.

[14] A. R. Barron and J. M. Klusowski. Approximation and estimation for high-dimensional deep
learning networks. arXiv preprint arXiv:1809.03090, 2018.

[15] P. Bartlett. For valid generalization the size of the weights is more important than the size
of the network. Advances in neural information processing systems, 9, 1996.

[16] D. Beaglehole, M. Belkin, and P. Pandit. On the inconsistency of kernel ridgeless regression
in fixed dimensions. SIAM J. Math. Data Sci., 5(4):854–872, 2023.

[17] G. Beliakov. Interpolation of lipschitz functions. Journal of Computational and Applied


Mathematics, 196(1):20–44, 2006.

[18] M. Belkin, D. Hsu, S. Ma, and S. Mandal. Reconciling modern machine-learning practice
and the classical bias–variance trade-off. Proceedings of the National Academy of Sciences,
116(32):15849–15854, 2019.

[19] M. Belkin, S. Ma, and S. Mandal. To understand deep learning we need to understand kernel
learning. In International Conference on Machine Learning, pages 541–549. PMLR, 2018.

[20] R. Bellman. On the theory of dynamic programming. Proceedings of the national Academy
of Sciences, 38(8):716–719, 1952.

[21] A. Ben-Israel and A. Charnes. Contributions to the theory of generalized inverses. Journal
of the Society for Industrial and Applied Mathematics, 11(3):667–699, 1963.

[22] Y. Bengio, P. Simard, and P. Frasconi. Learning long-term dependencies with gradient descent
is difficult. IEEE Transactions on Neural Networks, 5(2):157–166, 1994.

[23] A. Berlinet and C. Thomas-Agnan. Reproducing kernel Hilbert spaces in probability and
statistics. Kluwer Academic Publishers, Boston, MA, 2004. With a preface by Persi Diaconis.

[24] J. Berner, P. Grohs, and A. Jentzen. Analysis of the generalization error: Empirical risk
minimization over deep artificial neural networks overcomes the curse of dimensionality in
the numerical approximation of black–scholes partial differential equations. SIAM Journal
on Mathematics of Data Science, 2(3):631–657, 2020.

[25] J. Berner, P. Grohs, G. Kutyniok, and P. Petersen. The modern mathematics of deep learning,
2021.

[26] D. P. Bertsekas. Nonlinear programming. Athena Scientific Optimization and Computation


Series. Athena Scientific, Belmont, MA, third edition, 2016.

263
[27] D. P. Bertsekas and J. N. Tsitsiklis. Neuro-dynamic programming., volume 3 of Optimization
and neural computation series. Athena Scientific, 1996.

[28] C. M. Bishop. Pattern Recognition and Machine Learning (Information Science and Statis-
tics). Springer, 1 edition, 2007.

[29] Å. Björck. Numerical Methods for Least Squares Problems. Society for Industrial and Applied
Mathematics, 1996.

[30] H. Bolcskei, P. Grohs, G. Kutyniok, and P. Petersen. Optimal approximation with sparsely
connected deep neural networks. SIAM Journal on Mathematics of Data Science, 1(1):8–45,
2019.

[31] B. E. Boser, I. M. Guyon, and V. N. Vapnik. A training algorithm for optimal margin
classifiers. In D. Haussler, editor, Proceedings of the 5th Annual Workshop on Computational
Learning Theory (COLT’92), pages 144–152, Pittsburgh, PA, USA, July 1992. ACM Press.

[32] L. Bottou. Stochastic Gradient Descent Tricks, pages 421–436. Springer Berlin Heidelberg,
Berlin, Heidelberg, 2012.

[33] L. Bottou, F. E. Curtis, and J. Nocedal. Optimization methods for large-scale machine
learning. SIAM Review, 60(2):223–311, 2018.

[34] O. Bousquet and A. Elisseeff. Stability and generalization. The Journal of Machine Learning
Research, 2:499–526, 2002.

[35] S. Boyd and L. Vandenberghe. Convex optimization. Cambridge University Press, Cambridge,
2004.

[36] S. Boyd, L. Xiao, and A. Mutapcic. Subgradient methods. https://web.stanford.edu/


class/ee392o/subgrad_method.pdf, 2003. Lecture Notes, Stanford University.

[37] J. Braun and M. Griebel. On a constructive proof of kolmogorov’s superposition theorem.


Constructive Approximation, 30(3):653–675, Dec 2009.

[38] S. Brenner and R. Scott. The Mathematical Theory of Finite Element Methods. Texts in
Applied Mathematics. Springer New York, 2007.

[39] M. M. Bronstein, J. Bruna, T. Cohen, and P. Veličković. Geometric deep learning: Grids,
groups, graphs, geodesics, and gauges. arXiv preprint arXiv:2104.13478, 2021.

[40] T. Brown, B. Mann, N. Ryder, M. Subbiah, J. D. Kaplan, P. Dhariwal, A. Neelakantan,


P. Shyam, G. Sastry, A. Askell, et al. Language models are few-shot learners. Advances in
neural information processing systems, 33:1877–1901, 2020.

[41] S. Bubeck. Convex optimization: Algorithms and complexity. Found. Trends Mach. Learn.,
8:231–357, 2014.

[42] O. Calin. Deep learning architectures. Springer, 2020.

[43] E. J. Candes. Ridgelets: theory and applications. Stanford University, 1998.

264
[44] C. Carathéodory. Über den variabilitätsbereich der fourier’schen konstanten von posi-
tiven harmonischen funktionen. Rendiconti del Circolo Matematico di Palermo (1884-1940),
32:193–217, 1911.

[45] N. Carlini and D. Wagner. Towards evaluating the robustness of neural networks. In 2017
ieee symposium on security and privacy (sp), pages 39–57. Ieee, 2017.

[46] S. M. Carroll and B. W. Dickinson. Construction of neural nets using the radon transform.
International 1989 Joint Conference on Neural Networks, pages 607–611 vol.1, 1989.

[47] P. Chaudhari, A. Choromanska, S. Soatto, Y. LeCun, C. Baldassi, C. Borgs, J. Chayes,


L. Sagun, and R. Zecchina. Entropy-sgd: Biasing gradient descent into wide valleys. Journal
of Statistical Mechanics: Theory and Experiment, 2019(12):124018, 2019.

[48] M. Chen, H. Jiang, W. Liao, and T. Zhao. Efficient approximation of deep relu networks for
functions on low dimensional manifolds. Advances in neural information processing systems,
32, 2019.

[49] L. Chizat, E. Oyallon, and F. Bach. On lazy training in differentiable programming. In


H. Wallach, H. Larochelle, A. Beygelzimer, F. d'Alché-Buc, E. Fox, and R. Garnett, editors,
Advances in Neural Information Processing Systems, volume 32. Curran Associates, Inc.,
2019.

[50] Y. Cho and L. Saul. Kernel methods for deep learning. In Y. Bengio, D. Schuurmans,
J. Lafferty, C. Williams, and A. Culotta, editors, Advances in Neural Information Processing
Systems, volume 22. Curran Associates, Inc., 2009.

[51] F. Chollet. Deep learning with Python. Simon and Schuster, 2021.

[52] A. Choromanska, M. Henaff, M. Mathieu, G. B. Arous, and Y. LeCun. The loss surfaces of
multilayer networks. In Artificial intelligence and statistics, pages 192–204. PMLR, 2015.

[53] C. K. Chui and H. N. Mhaskar. Deep nets for local manifold learning. Frontiers in Applied
Mathematics and Statistics, 4:12, 2018.

[54] P. Ciarlet. The Finite Element Method for Elliptic Problems. Studies in Mathematics and its
Applications. North Holland, 1978.

[55] J. B. Conway. A course in functional analysis, volume 96. Springer, 2019.

[56] C. Cortes and V. Vapnik. Support vector networks. Machine Learning, 20:273–297, 1995.

[57] N. Cristianini and J. Shawe-Taylor. An Introduction to Support Vector Machines and Other
Kernel-based Learning Methods. Cambridge University Press, 1 edition, 2000.

[58] F. Cucker and S. Smale. On the mathematical foundations of learning. Bulletin of the
American mathematical society, 39(1):1–49, 2002.

[59] G. Cybenko. Approximation by superpositions of a sigmoidal function. Mathematics of


Control, Signals and Systems, 2(4):303–314, 1989.

265
[60] Y. N. Dauphin, R. Pascanu, C. Gulcehre, K. Cho, S. Ganguli, and Y. Bengio. Identifying and
attacking the saddle point problem in high-dimensional non-convex optimization. Advances
in neural information processing systems, 27, 2014.

[61] A. G. de G. Matthews. Sample-then-optimize posterior sampling for bayesian linear models.


2017.

[62] A. G. de G. Matthews, J. Hron, M. Rowland, R. E. Turner, and Z. Ghahramani. Gaussian


process behaviour in wide deep neural networks. In International Conference on Learning
Representations, 2018.

[63] T. De Ryck, S. Lanthaler, and S. Mishra. On the approximation of functions by tanh neural
networks. Neural Networks, 143:732–750, 2021.

[64] A. Défossez, L. Bottou, F. R. Bach, and N. Usunier. A simple convergence proof of adam
and adagrad. Trans. Mach. Learn. Res., 2022, 2022.

[65] J. W. Demmel. Applied Numerical Linear Algebra. SIAM, Jan. 1997.

[66] J. Deng, W. Dong, R. Socher, L.-J. Li, K. Li, and L. Fei-Fei. Imagenet: A large-scale
hierarchical image database. In 2009 IEEE Conference on Computer Vision and Pattern
Recognition, pages 248–255, 2009.

[67] H. R. M. C. DeVore, R. Optimal nonlinear approximation. Manuscripta mathematica,


63(4):469–478, 1989.

[68] R. DeVore and G. Lorentz. Constructive Approximation. Grundlehren der mathematischen


Wissenschaften. Springer Berlin Heidelberg, 1993.

[69] R. A. DeVore. Nonlinear approximation. Acta numerica, 7:51–150, 1998.

[70] L. Dinh, R. Pascanu, S. Bengio, and Y. Bengio. Sharp minima can generalize for deep nets.
In International Conference on Machine Learning, pages 1019–1028. PMLR, 2017.

[71] F. Draxler, K. Veschgini, M. Salmhofer, and F. Hamprecht. Essentially no barriers in neural


network energy landscape. In International conference on machine learning, pages 1309–1318.
PMLR, 2018.

[72] M. Du, F. Yang, N. Zou, and X. Hu. Fairness in deep learning: A computational perspective.
IEEE Intelligent Systems, 36(4):25–34, 2021.

[73] S. Du, J. Lee, H. Li, L. Wang, and X. Zhai. Gradient descent finds global minima of deep
neural networks. In K. Chaudhuri and R. Salakhutdinov, editors, Proceedings of the 36th
International Conference on Machine Learning, volume 97 of Proceedings of Machine Learning
Research, pages 1675–1685. PMLR, 09–15 Jun 2019.

[74] J. Duchi, E. Hazan, and Y. Singer. Adaptive subgradient methods for online learning and
stochastic optimization. Journal of Machine Learning Research, 12(Jul):2121–2159, 2011.

[75] W. E and Q. Wang. Exponential convergence of the deep neural network approximation for
analytic functions. Sci. China Math., 61(10):1733–1740, 2018.

266
[76] K. Eckle and J. Schmidt-Hieber. A comparison of deep networks with relu activation function
and linear spline-type methods. Neural Networks, 110:232–242, 2019.

[77] D. Elbrächter, D. Perekrestenko, P. Grohs, and H. Bölcskei. Deep neural network approxi-
mation theory. IEEE Transactions on Information Theory, 67(5):2581–2623, 2021.

[78] R. Eldan and O. Shamir. The power of depth for feedforward neural networks. In V. Feldman,
A. Rakhlin, and O. Shamir, editors, 29th Annual Conference on Learning Theory, volume 49
of Proceedings of Machine Learning Research, pages 907–940, Columbia University, New York,
New York, USA, 23–26 Jun 2016. PMLR.

[79] H. Engl, M. Hanke, and A. Neubauer. Regularization of Inverse Problems. Mathematics and
Its Applications. Springer Netherlands, 2000.

[80] A. Ern and J. Guermond. Finite Elements I: Approximation and Interpolation. Texts in
Applied Mathematics. Springer International Publishing, 2021.

[81] M. Fischer, M. Balunovic, D. Drachsler-Cohen, T. Gehr, C. Zhang, and M. Vechev. Dl2:


training and querying neural networks with logic. In International Conference on Machine
Learning, pages 1931–1941. PMLR, 2019.

[82] C. L. Frenzen, T. Sasao, and J. T. Butler. On the number of segments needed in a piecewise
linear approximation. Journal of Computational and Applied mathematics, 234(2):437–446,
2010.

[83] K.-I. Funahashi. On the approximate realization of continuous mappings by neural networks.
Neural Networks, 2(3):183–192, 1989.

[84] T. Garipov, P. Izmailov, D. Podoprikhin, D. P. Vetrov, and A. G. Wilson. Loss surfaces,


mode connectivity, and fast ensembling of dnns. Advances in neural information processing
systems, 31, 2018.

[85] G. Garrigos and R. M. Gower. Handbook of convergence theorems for (stochastic) gradient
methods, 2023.

[86] T. Gehr, M. Mirman, D. Drachsler-Cohen, P. Tsankov, S. Chaudhuri, and M. Vechev. Ai2:


Safety and robustness certification of neural networks with abstract interpretation. In 2018
IEEE symposium on security and privacy (SP), pages 3–18. IEEE, 2018.

[87] A. Géron. Hands-on machine learning with Scikit-Learn and TensorFlow : concepts, tools,
and techniques to build intelligent systems. O’Reilly Media, Sebastopol, CA, 2017.

[88] F. Girosi and T. Poggio. Representation properties of networks: Kolmogorov’s theorem is


irrelevant. Neural Computation, 1(4):465–469, 1989.

[89] F. Girosi and T. Poggio. Networks and the best approximation property. Biological cyber-
netics, 63(3):169–176, 1990.

[90] X. Glorot and Y. Bengio. Understanding the difficulty of training deep feedforward neural
networks. In Y. W. Teh and M. Titterington, editors, Proceedings of the Thirteenth In-
ternational Conference on Artificial Intelligence and Statistics, volume 9 of Proceedings of

267
Machine Learning Research, pages 249–256, Chia Laguna Resort, Sardinia, Italy, 13–15 May
2010. PMLR.

[91] G. Goh. Why momentum really works. Distill, 2017. http://distill.pub/2017/momentum.

[92] G. H. Golub and C. F. Van Loan. Matrix Computations - 4th Edition. Johns Hopkins
University Press, Philadelphia, PA, 2013.

[93] L. Gonon and C. Schwab. Deep relu network expression rates for option prices in high-
dimensional, exponential lévy models. Finance and Stochastics, 25(4):615–657, 2021.

[94] I. J. Goodfellow, Y. Bengio, and A. Courville. Deep Learning. MIT Press, Cambridge, MA,
USA, 2016. http://www.deeplearningbook.org.

[95] I. J. Goodfellow, J. Shlens, and C. Szegedy. Explaining and harnessing adversarial examples.
In International Conference on Learning Representations (ICLR), 2015.

[96] I. J. Goodfellow, O. Vinyals, and A. M. Saxe. Qualitatively characterizing neural network


optimization problems. arXiv preprint arXiv:1412.6544, 2014.

[97] L.-A. Gottlieb, A. Kontorovich, and R. Krauthgamer. Efficient regression in metric spaces
via approximate lipschitz extension. IEEE Transactions on Information Theory, 63(8):4838–
4849, 2017.

[98] R. M. Gower, N. Loizou, X. Qian, A. Sailanbayev, E. Shulgin, and P. Richtárik. SGD: General
analysis and improved rates. In K. Chaudhuri and R. Salakhutdinov, editors, Proceedings of
the 36th International Conference on Machine Learning, volume 97 of Proceedings of Machine
Learning Research, pages 5200–5209. PMLR, 09–15 Jun 2019.

[99] K. Gröchenig. Foundations of time-frequency analysis. Springer Science & Business Media,
2013.

[100] P. Grohs and L. Herrmann. Deep neural network approximation for high-dimensional elliptic
pdes with boundary conditions. IMA Journal of Numerical Analysis, 42(3):2055–2082, 2022.

[101] P. Grohs, F. Hornung, A. Jentzen, and P. Von Wurstemberger. A proof that artificial neural
networks overcome the curse of dimensionality in the numerical approximation of Black–
Scholes partial differential equations, volume 284. American Mathematical Society, 2023.

[102] P. Grohs, F. Hornung, A. Jentzen, and P. von Wurstemberger. A proof that artificial neural
networks overcome the curse of dimensionality in the numerical approximation of Black-
Scholes partial differential equations. Mem. Amer. Math. Soc., 284(1410):v+93, 2023.

[103] I. Gühring and M. Raslan. Approximation rates for neural networks with encodable weights
in smoothness spaces. Neural Networks, 134:107–130, 2021.

[104] B. Hanin and D. Rolnick. Complexity of linear regions in deep networks. In International
Conference on Machine Learning, pages 2596–2604. PMLR, 2019.

268
[105] M. Hardt, B. Recht, and Y. Singer. Train faster, generalize better: Stability of stochastic
gradient descent. In M. F. Balcan and K. Q. Weinberger, editors, Proceedings of The 33rd
International Conference on Machine Learning, volume 48 of Proceedings of Machine Learning
Research, pages 1225–1234, New York, New York, USA, 20–22 Jun 2016. PMLR.

[106] T. Hastie, A. Montanari, S. Rosset, and R. J. Tibshirani. Surprises in high-dimensional


ridgeless least squares interpolation. The Annals of Statistics, 50(2):949 – 986, 2022.

[107] T. Hastie, A. Montanari, S. Rosset, and R. J. Tibshirani. Surprises in high-dimensional


ridgeless least squares interpolation. The Annals of Statistics, 50(2):949–986, 2022.

[108] T. Hastie, R. Tibshirani, and J. Friedman. The elements of statistical learning: data mining,
inference and prediction. Springer, 2 edition, 2009.

[109] S. S. Haykin. Neural networks and learning machines. Pearson Education, Upper Saddle
River, NJ, third edition, 2009.

[110] J. He, L. Li, J. Xu, and C. Zheng. Relu deep neural networks and linear finite elements. J.
Comput. Math., 38(3):502–527, 2020.

[111] K. He, X. Zhang, S. Ren, and J. Sun. Delving deep into rectifiers: Surpassing human-level
performance on imagenet classification. Proceedings of the IEEE international conference on
computer vision, 2015.

[112] K. He, X. Zhang, S. Ren, and J. Sun. Deep residual learning for image recognition. In
Proceedings of the IEEE conference on computer vision and pattern recognition, pages 770–
778, 2016.

[113] M. Hein and M. Andriushchenko. Formal guarantees on the robustness of a classifier against
adversarial manipulation. Advances in neural information processing systems, 30, 2017.

[114] H. Heuser. Lehrbuch der Analysis. Teil 1. Vieweg + Teubner, Wiesbaden, revised edition,
2009.

[115] S. Hochreiter. Untersuchungen zu dynamischen neuronalen Netzen. Diploma thesis, Institut


für Informatik, Lehrstuhl Prof. Brauer, Technische Universität München, 1991.

[116] S. Hochreiter and J. Schmidhuber. Flat minima. Neural computation, 9(1):1–42, 1997.

[117] S. Hochreiter and J. Schmidhuber. Long short-term memory. Neural Computation, 9(8):1735–
1780, 1997.

[118] A. E. Hoerl and R. W. Kennard. Ridge regression: Biased estimation for nonorthogonal
problems. Technometrics, 12:55–67, 1970.

[119] K. Hornik. Approximation capabilities of multilayer feedforward networks. Neural Networks,


4(2):251–257, 1991.

[120] K. Hornik, M. Stinchcombe, and H. White. Multilayer feedforward networks are universal
approximators. Neural Networks, 2(5):359–366, 1989.

269
[121] G. Huang, Z. Liu, L. Van Der Maaten, and K. Q. Weinberger. Densely connected con-
volutional networks. Proceedings of the IEEE conference on computer vision and pattern
recognition, 1(2):3, 2017.

[122] G.-B. Huang and H. A. Babri. Upper bounds on the number of hidden neurons in feedforward
networks with arbitrary bounded nonlinear activation functions. IEEE transactions on neural
networks, 9(1):224–229, 1998.

[123] L. Huang, A. D. Joseph, B. Nelson, B. I. Rubinstein, and J. D. Tygar. Adversarial machine


learning. In Proceedings of the 4th ACM workshop on Security and artificial intelligence,
pages 43–58, 2011.

[124] T. Huster, C.-Y. J. Chiang, and R. Chadha. Limitations of the lipschitz constant as a defense
against adversarial examples. In ECML PKDD 2018 Workshops: Nemesis 2018, UrbReas
2018, SoGood 2018, IWAISe 2018, and Green Data Mining 2018, Dublin, Ireland, September
10-14, 2018, Proceedings 18, pages 16–29. Springer, 2019.

[125] M. Hutzenthaler, A. Jentzen, T. Kruse, and T. A. Nguyen. A proof that rectified deep neural
networks overcome the curse of dimensionality in the numerical approximation of semilinear
heat equations. SN partial differential equations and applications, 1(2):10, 2020.

[126] J. Håstad. Computational limitations of small depth circuits. PhD thesis, Massachusetts
Institute of Technology, 1987. Ph.D. Thesis, Department of Mathematics.

[127] D. J. Im, M. Tao, and K. Branson. An empirical analysis of deep network loss surfaces. 2016.

[128] V. E. Ismailov. Ridge functions and applications in neural networks, volume 263. American
Mathematical Society, 2021.

[129] V. E. Ismailov. A three layer neural network can represent any multivariate function. Journal
of Mathematical Analysis and Applications, 523(1):127096, 2023.

[130] Y. Ito and K. Saito. Superposition of linearly independent functions and finite mappings by
neural networks. The Mathematical Scientist, 21(1):27, 1996.

[131] A. Jacot, F. Gabriel, and C. Hongler. Neural tangent kernel: Convergence and generalization
in neural networks. Advances in neural information processing systems, 31, 2018.

[132] A. Jentzen, B. Kuckuck, and P. von Wurstemberger. Mathematical introduction to deep


learning: methods, implementations, and theory. arXiv preprint arXiv:2310.20360, 2023.

[133] A. Jentzen and A. Riekert. On the existence of global minima and convergence analy-
ses for gradient descent methods in the training of deep neural networks. arXiv preprint
arXiv:2112.09684, 2021.

[134] A. Jentzen, D. Salimova, and T. Welti. A proof that deep artificial neural networks overcome
the curse of dimensionality in the numerical approximation of Kolmogorov partial differen-
tial equations with constant diffusion and nonlinear drift coefficients. Commun. Math. Sci.,
19(5):1167–1205, 2021.

270
[135] J. Jumper, R. Evans, A. Pritzel, T. Green, M. Figurnov, O. Ronneberger, K. Tunyasuvu-
nakool, R. Bates, A. Žı́dek, A. Potapenko, et al. Highly accurate protein structure prediction
with alphafold. Nature, 596(7873):583–589, 2021.

[136] P. C. Kainen, V. Kurkova, and A. Vogt. Approximation by neural networks is not continuous.
Neurocomputing, 29(1-3):47–56, 1999.

[137] P. C. Kainen, V. Kurkova, and A. Vogt. Continuity of approximation by neural networks in


l p spaces. Annals of Operations Research, 101:143–147, 2001.

[138] P. C. Kainen, V. Kurkova, and A. Vogt. Best approximation by linear combinations of


characteristic functions of half-spaces. Journal of Approximation Theory, 122(2):151–159,
2003.

[139] H. Karimi, J. Nutini, and M. Schmidt. Linear convergence of gradient and proximal-gradient
methods under the polyak-lojasiewicz condition. In P. Frasconi, N. Landwehr, G. Manco,
and J. Vreeken, editors, Machine Learning and Knowledge Discovery in Databases, pages
795–811, Cham, 2016. Springer International Publishing.

[140] C. Karner, V. Kazeev, and P. C. Petersen. Limitations of gradient descent due to numerical
instability of backpropagation. arXiv preprint arXiv:2210.00805, 2022.

[141] N. S. Keskar, D. Mudigere, J. Nocedal, M. Smelyanskiy, and P. T. P. Tang. On large-batch


training for deep learning: Generalization gap and sharp minima. CoRR, abs/1609.04836,
2016.

[142] G. S. Kimeldorf and G. Wahba. A correspondence between bayesian estimation on stochastic


processes and smoothing by splines. The Annals of Mathematical Statistics, 41(2):495–502,
1970.

[143] D. P. Kingma and J. Ba. Adam: A method for stochastic optimization. In 3rd Interna-
tional Conference on Learning Representations, ICLR 2015 - Conference Track Proceedings.
International Conference on Learning Representations, ICLR, 2015.

[144] A. Klenke. Wahrscheinlichkeitstheorie. Springer, 2006.

[145] M. Kohler, A. Krzyżak, and S. Langer. Estimation of a function of low local dimensionality
by deep neural networks. IEEE transactions on information theory, 68(6):4032–4042, 2022.

[146] M. Kohler and S. Langer. On the rate of convergence of fully connected deep neural network
regression estimates. The Annals of Statistics, 49(4):2231–2249, 2021.

[147] A. N. Kolmogorov. On the representation of continuous functions of many variables by


superposition of continuous functions of one variable and addition. Dokl. Akad. Nauk SSSR,
114:953–956, 1957.

[148] A. Krizhevsky, I. Sutskever, and G. E. Hinton. Imagenet classification with deep convolutional
neural networks. In Advances in neural information processing systems, pages 1097–1105,
2012.

271
[149] G. Kutyniok, P. Petersen, M. Raslan, and R. Schneider. A theoretical analysis of deep neural
networks and parametric pdes. Constructive Approximation, 55(1):73–125, 2022.
[150] V. Kůrková. Kolmogorov’s theorem is relevant. Neural Computation, 3(4):617–622, 1991.
[151] V. Kůrková. Kolmogorov’s theorem and multilayer neural networks. Neural Networks,
5(3):501–506, 1992.
[152] F. Laakmann and P. Petersen. Efficient approximation of solutions of parametric linear
transport equations by relu dnns. Advances in Computational Mathematics, 47(1):11, 2021.
[153] G. Lan. First-order and Stochastic Optimization Methods for Machine Learning. Springer
Series in the Data Sciences. Springer International Publishing, Cham, 1st ed. 2020. edition,
2020.
[154] Y. LeCun, Y. Bengio, and G. Hinton. Deep learning. Nature, 521(7553):436–444, May 2015.
[155] Y. LeCun, B. Boser, J. S. Denker, D. Henderson, R. E. Howard, W. Hubbard, and L. D.
Jackel. Backpropagation applied to handwritten zip code recognition. Neural Computation,
1(4):541–551, 1989.
[156] Y. LeCun, L. Bottou, G. Orr, and K. Müller. Efficient backprop. In Neural Networks: Tricks
of the Trade, Lecture Notes in Computer Science, chapter 2, page 546. Springer Berlin /
Heidelberg, 1998.
[157] J. Lee, J. Sohl-dickstein, J. Pennington, R. Novak, S. Schoenholz, and Y. Bahri. Deep neural
networks as gaussian processes. In International Conference on Learning Representations,
2018.
[158] J. Lee, L. Xiao, S. Schoenholz, Y. Bahri, R. Novak, J. Sohl-Dickstein, and J. Pennington. Wide
neural networks of any depth evolve as linear models under gradient descent. In H. Wallach,
H. Larochelle, A. Beygelzimer, F. d'Alché-Buc, E. Fox, and R. Garnett, editors, Advances in
Neural Information Processing Systems, volume 32. Curran Associates, Inc., 2019.
[159] M. Leshno, V. Y. Lin, A. Pinkus, and S. Schocken. Multilayer feedforward networks with a
nonpolynomial activation function can approximate any function. Neural Networks, 6(6):861–
867, 1993.
[160] L. Lessard, B. Recht, and A. Packard. Analysis and design of optimization algorithms via
integral quadratic constraints. SIAM J. Optim., 26(1):57–95, 2016.
[161] H. Li, Z. Xu, G. Taylor, C. Studer, and T. Goldstein. Visualizing the loss landscape of neural
nets. Advances in neural information processing systems, 31, 2018.
[162] W. Li. Generalization error of minimum weighted norm and kernel interpolation. SIAM
Journal on Mathematics of Data Science, 3(1):414–438, 2021.
[163] S. Liang and R. Srikant. Why deep neural networks for function approximation? In Proc. of
ICLR 2017, pages 1 – 17, 2017.
[164] T. Liang and A. Rakhlin. Just interpolate: kernel “ridgeless” regression can generalize. Ann.
Statist., 48(3):1329–1347, 2020.

272
[165] V. Lin and A. Pinkus. Fundamentality of ridge functions. Journal of Approximation Theory,
75(3):295–311, 1993.

[166] M. Longo, J. A. Opschoor, N. Disch, C. Schwab, and J. Zech. De rham compatible deep
neural network fem. Neural Networks, 165:721–739, 2023.

[167] C. Ma, S. Wojtowytsch, L. Wu, et al. Towards a mathematical understanding of neu-


ral network-based machine learning: what we know and what we don’t. arXiv preprint
arXiv:2009.10713, 2020.

[168] C. Ma, L. Wu, et al. A priori estimates of the population risk for two-layer neural networks.
arXiv preprint arXiv:1810.06397, 2018.

[169] S. Mahan, E. J. King, and A. Cloninger. Nonclosedness of sets of neural networks in sobolev
spaces. Neural Networks, 137:85–96, 2021.

[170] V. Maiorov and A. Pinkus. Lower bounds for approximation by mlp neural networks. Neu-
rocomputing, 25(1):81–91, 1999.

[171] Y. Marzouk, Z. Ren, S. Wang, and J. Zech. Distribution learning via neural differential
equations: a nonparametric statistical perspective. Journal of Machine Learning Research
(accepted), 2024.

[172] W. S. McCulloch and W. Pitts. A logical calculus of the ideas immanent in nervous activity.
The bulletin of mathematical biophysics, 5:115–133, 1943.

[173] S. Mei and A. Montanari. The generalization error of random features regression: Precise
asymptotics and the double descent curve. Communications on Pure and Applied Mathemat-
ics, 75(4):667–766, 2022.

[174] H. N. Mhaskar. Approximation properties of a multilayered feedforward artificial neural


network. Adv. Comput. Math., 1(1):61–80, 1993.

[175] H. N. Mhaskar. Neural networks for optimal approximation of smooth and analytic functions.
Neural computation, 8(1):164–177, 1996.

[176] H. N. Mhaskar and C. A. Micchelli. Approximation by superposition of sigmoidal and radial


basis functions. Adv. in Appl. Math., 13(3):350–373, 1992.

[177] H. N. Mhaskar and C. A. Micchelli. Degree of approximation by neural and translation


networks with a single hidden layer. Advances in applied mathematics, 16(2):151–183, 1995.

[178] M. Mohri, A. Rostamizadeh, and A. Talwalkar. Foundations of machine learning. MIT press,
2018.

[179] C. Molnar. Interpretable machine learning. Lulu. com, 2020.

[180] H. Montanelli and Q. Du. New error bounds for deep relu networks using sparse grids. SIAM
Journal on Mathematics of Data Science, 1(1):78–92, 2019.

[181] H. Montanelli and H. Yang. Error bounds for deep relu networks using the kolmogorov–arnold
superposition theorem. Neural Networks, 129:1–6, 2020.

273
[182] G. F. Montufar, R. Pascanu, K. Cho, and Y. Bengio. On the number of linear regions of deep
neural networks. In Z. Ghahramani, M. Welling, C. Cortes, N. Lawrence, and K. Weinberger,
editors, Advances in Neural Information Processing Systems, volume 27. Curran Associates,
Inc., 2014.

[183] S.-M. Moosavi-Dezfooli, A. Fawzi, O. Fawzi, and P. Frossard. Universal adversarial pertur-
bations. In Proceedings of the IEEE conference on computer vision and pattern recognition,
pages 1765–1773, 2017.

[184] E. Moulines and F. Bach. Non-asymptotic analysis of stochastic approximation algorithms for
machine learning. In J. Shawe-Taylor, R. Zemel, P. Bartlett, F. Pereira, and K. Weinberger,
editors, Advances in Neural Information Processing Systems, volume 24. Curran Associates,
Inc., 2011.

[185] R. Nakada and M. Imaizumi. Adaptive approximation and generalization of deep neural
network with intrinsic dimensionality. Journal of Machine Learning Research, 21(174):1–38,
2020.

[186] R. M. Neal. Bayesian learning for neural networks. PhD thesis, University of Toronto, 1995.

[187] A. Nemirovski, A. Juditsky, G. Lan, and A. Shapiro. Robust stochastic approximation ap-
proach to stochastic programming. SIAM Journal on Optimization, 19(4):1574–1609, 2009.

[188] Y. Nesterov. Introductory lectures on convex optimization, volume 87 of Applied Optimization.


Kluwer Academic Publishers, Boston, MA, 2004. A basic course.

[189] Y. Nesterov. Lectures on convex optimization, volume 137 of Springer Optimization and Its
Applications. Springer, Cham, second edition, 2018.

[190] Y. E. Nesterov. A method for solving the convex programming problem with convergence
rate O(1/k 2 ). Dokl. Akad. Nauk SSSR, 269(3):543–547, 1983.

[191] B. Neyshabur, R. Tomioka, and N. Srebro. Norm-based capacity control in neural networks.
In Conference on learning theory, pages 1376–1401. PMLR, 2015.

[192] M. A. Nielsen. Neural networks and deep learning, 2018.

[193] R. H. Nielsen. Kolmogorov’s mapping neural network existence theorem. In Proceedings of


the IEEE First International Conference on Neural Networks (San Diego, CA), volume III,
pages 11–13. Piscataway, NJ: IEEE, 1987.

[194] J. Nocedal and S. J. Wright. Numerical optimization. Springer Series in Operations Research
and Financial Engineering. Springer, New York, second edition, 2006.

[195] E. Novak and H. Woźniakowski. Approximation of infinitely differentiable multivariate func-


tions is intractable. Journal of Complexity, 25(4):398–404, 2009.

[196] B. O’Donoghue and E. Candès. Adaptive restart for accelerated gradient schemes. Found.
Comput. Math., 15(3):715–732, 2015.

274
[197] J. A. Opschoor. Constructive deep neural network approximations of weighted analytic so-
lutions to partial differential equations in polygons, 2023. Dissertation 29278, ETH Zürich,
https://doi.org/10.3929/ethz-b-000614671.

[198] J. A. Opschoor and C. Schwab. Deep relu networks and high-order finite element methods
ii: Chebyšev emulation. Computers & Mathematics with Applications, 169:142–162, 2024.

[199] J. A. A. Opschoor, P. C. Petersen, and C. Schwab. Deep relu networks and high-order finite
element methods. Analysis and Applications, 18(05):715–770, 2020.

[200] J. A. A. Opschoor, C. Schwab, and J. Zech. Exponential ReLU DNN expression of holomor-
phic maps in high dimension. Constructive Approximation, 2021.

[201] J. A. A. Opschoor, C. Schwab, and J. Zech. Deep learning in high dimension: ReLU neural
network expression for Bayesian PDE inversion. In Optimization and control for partial
differential equations—uncertainty quantification, open and closed-loop control, and shape
optimization, volume 29 of Radon Ser. Comput. Appl. Math., pages 419–462. De Gruyter,
Berlin, 2022.

[202] P. Oswald. On the degree of nonlinear spline approximation in Besov-Sobolev spaces. J.


Approx. Theory, 61(2):131–157, 1990.

[203] S. Ovchinnikov. Max-min representation of piecewise linear functions. Beiträge Algebra


Geom., 43(1):297–302, 2002.

[204] N. Papernot, P. McDaniel, I. Goodfellow, S. Jha, Z. B. Celik, and A. Swami. Practical black-
box attacks against machine learning. In Proceedings of the 2017 ACM on Asia conference
on computer and communications security, pages 506–519, 2017.

[205] Y. C. Pati and P. S. Krishnaprasad. Analysis and synthesis of feedforward neural networks us-
ing discrete affine wavelet transformations. IEEE Transactions on Neural Networks, 4(1):73–
85, 1993.

[206] J. Pennington and Y. Bahri. Geometry of neural network loss surfaces via random matrix
theory. In International Conference on Machine Learning, pages 2798–2806. PMLR, 2017.

[207] P. Petersen, M. Raslan, and F. Voigtlaender. Topological properties of the set of func-
tions generated by neural networks of fixed size. Foundations of computational mathematics,
21:375–444, 2021.

[208] P. Petersen and F. Voigtlaender. Optimal approximation of piecewise smooth functions using
deep relu neural networks. Neural Networks, 108:296–330, 2018.

[209] P. C. Petersen. Neural Network Theory. 2020. http://www.pc-petersen.eu/Neural_


Network_Theory.pdf, Lecture notes.

[210] M. D. Petković and P. S. Stanimirović. Iterative method for computing the moore–penrose
inverse based on penrose equations. Journal of Computational and Applied Mathematics,
235(6):1604–1613, 2011.

275
[211] A. Pinkus. Approximation theory of the MLP model in neural networks. In Acta numerica,
1999, volume 8 of Acta Numer., pages 143–195. Cambridge Univ. Press, Cambridge, 1999.

[212] G. Pisier. Remarques sur un résultat non publié de B. Maurey. Séminaire Analyse fonction-
nelle (dit ”Maurey-Schwartz”), 1980-1981.

[213] T. Poggio, A. Banburski, and Q. Liao. Theoretical issues in deep networks. Proceedings of
the National Academy of Sciences, 117(48):30039–30045, 2020.

[214] T. Poggio, H. Mhaskar, L. Rosasco, B. Miranda, and Q. Liao. Why and when can deep-but
not shallow-networks avoid the curse of dimensionality: a review. Int. J. Autom. Comput.,
14(5):503–519, 2017.

[215] T. Poggio, R. Rifkin, S. Mukherjee, and P. Niyogi. General conditions for predictivity in
learning theory. Nature, 428(6981):419–422, 2004.

[216] B. Polyak. Some methods of speeding up the convergence of iteration methods. USSR
Computational Mathematics and Mathematical Physics, 4(5):1–17, 1964.

[217] B. T. Polyak. Introduction to optimization. Translations Series in Mathematics and Engineer-


ing. Optimization Software, Inc., Publications Division, New York, 1987. Translated from
the Russian, With a foreword by Dimitri P. Bertsekas.

[218] S. J. Prince. Understanding Deep Learning. MIT Press, 2023.

[219] N. Qian. On the momentum term in gradient descent learning algorithms. Neural Networks,
12(1):145–151, 1999.

[220] M. H. Quynh Nguyen, Mahesh Chandra Mukkamala. On the loss landscape of a class of
deep neural networks with no bad local valleys. In International Conference on Learning
Representations (ICLR), 2018.

[221] M. Raghu, B. Poole, J. Kleinberg, S. Ganguli, and J. Sohl-Dickstein. On the expressive


power of deep neural networks. In D. Precup and Y. W. Teh, editors, Proceedings of the
34th International Conference on Machine Learning, volume 70 of Proceedings of Machine
Learning Research, pages 2847–2854. PMLR, 06–11 Aug 2017.

[222] A. Rahimi and B. Recht. Random features for large-scale kernel machines. In J. Platt,
D. Koller, Y. Singer, and S. Roweis, editors, Advances in Neural Information Processing
Systems, volume 20. Curran Associates, Inc., 2007.

[223] M. Raissi, P. Perdikaris, and G. E. Karniadakis. Physics-informed neural networks: A deep


learning framework for solving forward and inverse problems involving nonlinear partial dif-
ferential equations. Journal of Computational physics, 378:686–707, 2019.

[224] A. Rakhlin, O. Shamir, and K. Sridharan. Making gradient descent optimal for strongly
convex stochastic optimization. In Proceedings of the 29th International Coference on Inter-
national Conference on Machine Learning, ICML’12, page 1571–1578, Madison, WI, USA,
2012. Omnipress.

276
[225] C. E. Rasmussen and C. K. I. Williams. Gaussian processes for machine learning. Adaptive
computation and machine learning. MIT Press, 2006.

[226] E. Real, S. Moore, A. Selle, S. Saxena, Y. L. Suematsu, Q. Le, and A. Kurakin. Regularized
evolution for image classifier architecture search. Proceedings of the AAAI Conference on
Artificial Intelligence, 33:4780–4789, 2019.

[227] S. J. Reddi, S. Kale, and S. Kumar. On the convergence of adam and beyond. In 6th
International Conference on Learning Representations, ICLR 2018, Vancouver, BC, Canada,
April 30 - May 3, 2018, Conference Track Proceedings. OpenReview.net, 2018.

[228] H. Robbins and S. Monro. A Stochastic Approximation Method. The Annals of Mathematical
Statistics, 22(3):400 – 407, 1951.

[229] F. Rosenblatt. The perceptron: A probabilistic model for information storage and organiza-
tion in the brain. Psychological Review, 65(6):386–408, 1958.

[230] W. Ruan, X. Yi, and X. Huang. Adversarial robustness of deep learning: Theory, algorithms,
and applications. In Proceedings of the 30th ACM international conference on information
& knowledge management, pages 4866–4869, 2021.

[231] S. Ruder. An overview of gradient descent optimization algorithms, 2016.

[232] W. Rudin. Real and complex analysis. McGraw-Hill Book Co., New York, third edition, 1987.

[233] W. Rudin. Functional analysis. International Series in Pure and Applied Mathematics.
McGraw-Hill, Inc., New York, second edition, 1991.

[234] D. E. Rumelhart, G. E. Hinton, and R. J. Williams. Learning representations by back-


propagating errors. Nature, 323(6088):533–536, 1986.

[235] T. D. Ryck and S. Mishra. Error analysis for deep neural network approximations of paramet-
ric hyperbolic conservation laws. Mathematics of Computation, 2023. Article electronically
published on December 15, 2023.

[236] I. Safran and O. Shamir. Depth separation in relu networks for approximating smooth non-
linear functions. ArXiv, abs/1610.09887, 2016.

[237] M. A. Sartori and P. J. Antsaklis. A simple method to derive bounds on the size and to train
multilayer neural networks. IEEE transactions on neural networks, 2(4):467–471, 1991.

[238] R. Scheichl and J. Zech. Numerical methods for bayesian inverse problems, 2021. Lecture
Notes.

[239] J. Schmidhuber. Deep learning in neural networks: An overview. Neural Networks, 61:85–117,
2015.

[240] J. Schmidt-Hieber. Deep relu network approximation of functions on a manifold. arXiv


preprint arXiv:1908.00695, 2019.

[241] J. Schmidt-Hieber. Nonparametric regression using deep neural networks with relu activation
function. 2020.

277
[242] J. Schmidt-Hieber. The kolmogorov–arnold representation theorem revisited. Neural Net-
works, 137:119–126, 2021.
[243] B. Schölkopf, R. Herbrich, and A. J. Smola. A generalized representer theorem. In Proceedings
of the Annual Conference on Learning Theory, 2001.
[244] B. Schölkopf and A. J. Smola. Learning with kernels : support vector machines, regularization,
optimization, and beyond. Adaptive computation and machine learning. MIT Press, 2002.
[245] L. Schumaker. Spline Functions: Basic Theory. Cambridge Mathematical Library. Cambridge
University Press, 3 edition, 2007.
[246] C. Schwab and J. Zech. Deep learning in high dimension: neural network expression rates for
generalized polynomial chaos expansions in UQ. Anal. Appl. (Singap.), 17(1):19–55, 2019.
[247] C. Schwab and J. Zech. Deep learning in high dimension: neural network expression rates for
analytic functions in L2 (Rd , γd ). SIAM/ASA J. Uncertain. Quantif., 11(1):199–234, 2023.
[248] T. Serra, C. Tjandraatmadja, and S. Ramalingam. Bounding and counting linear regions of
deep neural networks, 2018.
[249] U. Shaham, A. Cloninger, and R. R. Coifman. Provable approximation properties for deep
neural networks. Applied and Computational Harmonic Analysis, 44(3):537–557, 2018.
[250] S. Shalev-Shwartz and S. Ben-David. Understanding Machine Learning - From Theory to
Algorithms. Cambridge University Press, 2014.
[251] O. Shamir and T. Zhang. Stochastic gradient descent for non-smooth optimization: Con-
vergence results and optimal averaging schemes. In S. Dasgupta and D. McAllester, editors,
Proceedings of the 30th International Conference on Machine Learning, volume 28 of Proceed-
ings of Machine Learning Research, pages 71–79, Atlanta, Georgia, USA, 17–19 Jun 2013.
PMLR.
[252] N. Z. Shor. Minimization Methods for Non-Differentiable Functions, volume 3 of Springer
Series in Computational Mathematics. Springer-Verlag, Berlin, Heidelberg, 1985.
[253] J. W. Siegel and J. Xu. High-order approximation rates for shallow neural networks with
cosine and reluk activation functions. Applied and Computational Harmonic Analysis, 58:1–
26, 2022.
[254] D. Silver, A. Huang, C. J. Maddison, A. Guez, L. Sifre, G. Van Den Driessche, J. Schrittwieser,
I. Antonoglou, V. Panneershelvam, M. Lanctot, et al. Mastering the game of go with deep
neural networks and tree search. nature, 529(7587):484–489, 2016.
[255] K. Simonyan and A. Zisserman. Very deep convolutional networks for large-scale image
recognition. In ICLR, 2014.
[256] D. Soudry, E. Hoffer, M. S. Nacson, S. Gunasekar, and N. Srebro. The implicit bias of gradient
descent on separable data. Journal of Machine Learning Research, 19(70):1–57, 2018.
[257] E. M. Stein. Singular integrals and differentiability properties of functions. Princeton Math-
ematical Series, No. 30. Princeton University Press, Princeton, N.J., 1970.

278
[258] I. Steinwart and A. Christmann. Support Vector Machines. Springer, New York, 2008.

[259] G. Strang. Lecture 23: Accelerating gradient descent (use momentum).


MIT OpenCourseWare: Matrix Methods in Data Analysis, Signal Pro-
cessing, And Machine Learning, 2018. https://ocw.mit.edu/courses/
18-065-matrix-methods-in-data-analysis-signal-processing-and-machine-learning-spring-2018
resources/lecture-23-accelerating-gradient-descent-use-momentum/.

[260] D. Stutz, M. Hein, and B. Schiele. Disentangling adversarial robustness and generalization.
In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition,
pages 6976–6987, 2019.

[261] A. Sukharev. Optimal method of constructing best uniform approximations for functions of
a certain class. USSR Computational Mathematics and Mathematical Physics, 18(2):21–31,
1978.

[262] I. Sutskever, J. Martens, G. Dahl, and G. Hinton. On the importance of initialization and
momentum in deep learning. In S. Dasgupta and D. McAllester, editors, Proceedings of the
30th International Conference on Machine Learning, volume 28 of Proceedings of Machine
Learning Research, pages 1139–1147, Atlanta, Georgia, USA, 17–19 Jun 2013. PMLR.

[263] C. Szegedy, W. Liu, Y. Jia, P. Sermanet, S. Reed, D. Anguelov, D. Erhan, V. Vanhoucke,


and A. Rabinovich. Going deeper with convolutions. In Proceedings of the IEEE conference
on computer vision and pattern recognition, pages 1–9, 2015.

[264] C. Szegedy, W. Zaremba, I. Sutskever, J. Bruna, D. Erhan, I. Goodfellow, and R. Fergus.


Intriguing properties of neural networks. In International Conference on Learning Represen-
tations (ICLR), 2014.

[265] M. Tan and Q. V. Le. Efficientnet: Rethinking model scaling for convolutional neural net-
works. In Proceedings of the 36th International Conference on Machine Learning, pages
6105–6114, 2019.

[266] J. Tarela and M. Martı́nez. Region configurations for realizability of lattice piecewise-linear
models. Mathematical and Computer Modelling, 30(11):17–27, 1999.

[267] J. M. Tarela, E. Alonso, and M. V. Martı́nez. A representation method for PWL functions
oriented to parallel processing. Math. Comput. Modelling, 13(10):75–83, 1990.

[268] M. Telgarsky. Representation benefits of deep feedforward networks, 2015.

[269] M. Telgarsky. Benefits of depth in neural networks. In V. Feldman, A. Rakhlin, and O. Shamir,
editors, 29th Annual Conference on Learning Theory, volume 49 of Proceedings of Machine
Learning Research, pages 1517–1539, Columbia University, New York, New York, USA, 23–26
Jun 2016. PMLR.

[270] M. Telgarsky. Neural networks and rational functions. In D. Precup and Y. W. Teh, edi-
tors, Proceedings of the 34th International Conference on Machine Learning, volume 70 of
Proceedings of Machine Learning Research, pages 3387–3393. PMLR, 06–11 Aug 2017.

279
[271] M. Telgarsky. Deep learning theory lecture notes. https://mjt.cs.illinois.edu/dlt/,
2021. Version: 2021-10-27 v0.0-e7150f2d (alpha).

[272] T. Tieleman and G. Hinton. Lecture 6.5—RmsProp: Divide the gradient by a running
average of its recent magnitude. COURSERA: Neural Networks for Machine Learning, 2012.
http://www.cs.toronto.edu/~tijmen/csc321/slides/lecture_slides_lec6.pdf.

[273] V. M. Tikhomirov. ε-entropy and ε-capacity of sets in functional spaces. Selected Works
of AN Kolmogorov: Volume III: Information Theory and the Theory of Algorithms, pages
86–170, 1993.

[274] A. N. Tikhonov. Regularization of incorrectly posed problems. Soviet Mathematics Doklady,


4(6):1624–1627, 1963.

[275] S. Tu, S. Venkataraman, A. C. Wilson, A. Gittens, M. I. Jordan, and B. Recht. Breaking


locality accelerates block Gauss-Seidel. In D. Precup and Y. W. Teh, editors, Proceedings of
the 34th International Conference on Machine Learning, volume 70 of Proceedings of Machine
Learning Research, pages 3482–3491. PMLR, 06–11 Aug 2017.

[276] L. G. Valiant. A theory of the learnable. Communications of the ACM, 27(11):1134–1142,


1984.

[277] V. N. Vapnik and A. Y. Chervonenkis. On the uniform convergence of relative frequencies of


events to their probabilities. In Measures of complexity: festschrift for alexey chervonenkis,
pages 11–30. Springer, 2015.

[278] A. Vaswani, N. Shazeer, N. Parmar, J. Uszkoreit, L. Jones, A. N. Gomez, L. Kaiser, and


I. Polosukhin. Attention is all you need. Advances in neural information processing systems,
30, 2017.

[279] L. Venturi, A. S. Bandeira, and J. Bruna. Spurious valleys in one-hidden-layer neural network
optimization landscapes. Journal of Machine Learning Research, 20:133, 2019.

[280] R. Vershynin. High-dimensional probability: An introduction with applications in data science,


volume 47. Cambridge University Press, 2018.

[281] B. A. Vostrecov and M. A. Kreĭnes. Approximation of continuous functions by superpositions


of plane waves. Dokl. Akad. Nauk SSSR, 140:1237–1240, 1961.

[282] S. Wang and X. Sun. Generalization of hinging hyperplanes. IEEE Transactions on Infor-
mation Theory, 51(12):4425–4431, 2005.

[283] Z. Wang, A. Albarghouthi, G. Prakriya, and S. Jha. Interval universal approximation for
neural networks. Proceedings of the ACM on Programming Languages, 6(POPL):1–29, 2022.

[284] E. Weinan, C. Ma, and L. Wu. Barron spaces and the compositional function spaces for
neural network models. arXiv preprint arXiv:1906.08039, 2019.

[285] E. Weinan and S. Wojtowytsch. Representation formulas and pointwise properties for barron
functions. Calculus of Variations and Partial Differential Equations, 61(2):46, 2022.

280
[286] S. Weissmann, A. Wilson, and J. Zech. Multilevel optimization for inverse problems. In P.-L.
Loh and M. Raginsky, editors, Proceedings of Thirty Fifth Conference on Learning Theory,
volume 178 of Proceedings of Machine Learning Research, pages 5489–5524. PMLR, 02–05
Jul 2022.

[287] A. C. Wilson, B. Recht, and M. I. Jordan. A lyapunov analysis of accelerated methods in


optimization. Journal of Machine Learning Research, 22(113):1–34, 2021.

[288] A. C. Wilson, R. Roelofs, M. Stern, N. Srebro, and B. Recht. The marginal value of adaptive
gradient methods in machine learning. In I. Guyon, U. V. Luxburg, S. Bengio, H. Wal-
lach, R. Fergus, S. Vishwanathan, and R. Garnett, editors, Advances in Neural Information
Processing Systems, volume 30. Curran Associates, Inc., 2017.

[289] D. R. Wilson and T. R. Martinez. The general inefficiency of batch training for gradient
descent learning. Neural Netw., 16(10):1429–1451, Dec. 2003.

[290] C. Xiao, J.-Y. Zhu, B. Li, W. He, M. Liu, and D. Song. Spatially transformed adversarial
examples. arXiv preprint arXiv:1801.02612, 2018.

[291] H. Xu and S. Mannor. Robustness and generalization. Machine learning, 86:391–423, 2012.

[292] D. Yarotsky. Error bounds for approximations with deep ReLU networks. Neural Netw.,
94:103–114, 2017.

[293] D. Yarotsky and A. Zhevnerchuk. The phase diagram of approximation rates for deep neural
networks. In H. Larochelle, M. Ranzato, R. Hadsell, M. F. Balcan, and H. Lin, editors,
Advances in Neural Information Processing Systems, volume 33, pages 13005–13015. Curran
Associates, Inc., 2020.

[294] H. M. D. K. S. B. Yiding Jiang, Behnam Neyshabur. Fantastic generalization measures and


where to find them. In International Conference on Learning Representations (ICLR), 2019.

[295] M. D. Zeiler. Adadelta: An adaptive learning rate method. CoRR, abs/1212.5701, 2012.

[296] X. Zhai, A. Kolesnikov, N. Houlsby, and L. Beyer. Scaling vision transformers. In Proceedings
of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 12104–
12113, 2022.

[297] C. Zhang, S. Bengio, M. Hardt, B. Recht, and O. Vinyals. Understanding deep learning
requires rethinking generalization, 2016.

[298] M. A. Álvarez, L. Rosasco, and N. D. Lawrence. 2012.

281

You might also like