0% found this document useful (0 votes)
16 views241 pages

Book Lecture

Uploaded by

Polina Bogdanova
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
16 views241 pages

Book Lecture

Uploaded by

Polina Bogdanova
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 241

Philipp Petersen and Jakob Zech

Mathematical theory of deep learning


– Monograph –

June 26, 2024

Springer Nature

[Draft of June 26, 2024. Not for dissemination. Only to be used in Philipp Petersen’s lecture]
[Draft of June 26, 2024. Not for dissemination. Only to be used in Philipp Petersen’s lecture]
Contents

1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3
1.1 Mathematics of deep learning . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3
1.2 High-level overview of deep learning . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3
1.3 Why does it work? . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6
1.4 Course philosophy and outline . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7
1.5 Material not covered in this course . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8

2 Feed-forward neural networks . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11


2.1 Formal definition . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11
2.2 Notion of size . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13
2.3 Activation functions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13

3 Universal Approximation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 17
3.1 A universal approximation theorem . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 17
3.2 Superexpressive activations and Kolmogorov’s superposition theorem . . . . . . . . . . . . . . . . . . . . . . . . . . . 26

4 Splines . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 31
4.1 B-splines and smooth functions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 31
4.2 Reapproximation of B-splines with sigmoidal activations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 32

5 ReLU neural networks . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 39


5.1 Basic ReLU calculus . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 39
5.2 Continuous piecewise linear functions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 43
5.3 Simplicial pieces . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 48
5.4 Convergence rates for Hölder continuous functions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 53

6 Affine pieces for ReLU neural networks . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 57


6.1 Upper bounds . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 57
6.2 Tightness of upper bounds . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 59
6.3 Depth separation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 60
6.4 Number of pieces in practice . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 61

7 Deep ReLU neural networks . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 69


7.1 The square function . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 69
7.2 Multiplication . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 71
7.3 𝐶 𝑘,𝑠 functions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 73

v
[Draft of June 26, 2024. Not for dissemination. Only to be used in Philipp Petersen’s lecture]
8 High-dimensional approximation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 79
8.1 The Barron class . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 79
8.2 Functions with compositionality structure . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 84
8.3 Functions on manifolds . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 87

9 Interpolation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 91
9.1 Universal interpolation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 91
9.2 Optimal interpolation and reconstruction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 92

10 Training of neural networks . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 99


10.1 Gradient descent . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 99
10.2 Stochastic gradient descent (SGD) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 107
10.3 Backpropagation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 110
10.4 Acceleration . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 113
10.5 Other methods . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 118

11 Wide neural networks . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 125


11.1 Linear least-squares . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 125
11.2 Kernel least-squares . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 127
11.3 Tangent kernel . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 130
11.4 Convergence to global minimizers . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 131
11.5 Training dynamics for LeCun initialization . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 134
11.6 Normalized initialization . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 143

12 Loss landscape analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 149


12.1 Visualization of loss landscapes . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 150
12.2 Spurious minima . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 151
12.3 Saddle points . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 153

13 Shape of neural network spaces . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 159


13.1 Lipschitz parameterizations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 159
13.2 Convexity of neural network spaces . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 161
13.3 Closedness and best-approximation property . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 164

14 Generalization properties of deep neural networks . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 171


14.1 Learning setup . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 171
14.2 Empirical risk minimization . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 173
14.3 Generalization bounds . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 174
14.4 Generalization bounds from covering numbers . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 175
14.5 Covering numbers of deep neural networks . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 177
14.6 The approximation-complexity trade-off . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 178
14.7 PAC learning from VC dimension . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 180
14.8 Lower bounds on achievable approximation rates . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 183

15 Generalization in the overparameterized regime . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 187


15.1 The double descent phenomenon . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 188
15.2 Size of weights . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 190
15.3 Theoretical justification . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 192
15.4 Double descent for neural network learning . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 194

vi
[Draft of June 26, 2024. Not for dissemination. Only to be used in Philipp Petersen’s lecture]
16 Robustness and adversarial examples . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 199
16.1 Adversarial examples . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 199
16.2 Bayes classifier . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 200
16.3 Affine classifiers . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 201
16.4 ReLU neural networks . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 205
16.5 Robustness . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 206

A Probability theory . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 213


A.1 Sigma-algebras and measures . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 213
A.2 Random variables . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 214
A.3 Conditionals, marginals, and independence . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 216
A.4 Concentration inequalities . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 219

B Functional analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 221


B.1 Vector spaces . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 221
B.2 Fourier transform . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 225

References . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 227

vii
[Draft of June 26, 2024. Not for dissemination. Only to be used in Philipp Petersen’s lecture]
[Draft of June 26, 2024. Not for dissemination. Only to be used in Philipp Petersen’s lecture]
Preface

This book serves as an introduction to the key ideas in the mathematical analysis of deep learning. It is designed
to help researchers quickly familiarize themselves with the area and to provide a foundation for the development of
university courses on the mathematics deep learning. Our main goal in the composition of this book was to present
various rigorous, but easy to grasp, results that help build an understanding of why certain phenomena arise in deep
learning. To achieve this, we prioritize simplicity over generality.
Since we envision this material as the basis for a mathematical introduction to deep learning, it is certainly not an
entirely accurate representation of the entire field and some important research directions are missing. In particular,
we have favored mathematical results over empirical research, even though an accurate account of the theory of deep
learning requires both.
The book is intended for students and researchers in mathematics and related areas. While we believe that every
diligent researcher or student will be able to work through this manuscript, it should be emphasized that a familiarity
with linear algebra, probability theory, and basic functional analysis will enhance the reader’s experience. A review of
basic probability theory and functional analysis is provided in the appendix of this book.
This book is the result of a series of lectures that the authors gave. P.P. presented parts of this lecture in a lecture
titled “’Neural Network Theory’ at the university of Vienna. J.Z. taught “’Theory of Deep Learning’ at the University
of Heidelberg. The lecture notes of these courses formed the basis of this book.
The material in this book is structured according to the three main mathematical branches underlying deep learning
theory: Approximation theory, Optimization theory, Statistical Learning theory. Chapter 1 presents the research area as
a whole and lists a set of questions relevant to understand deep learning. Chapters 2 - 9 discuss various approximation
theoretical aspects, Chapters 10 - 13 explain the optimization theory underlying deep learning and the remaining
Chapters 14 - 16 address the statistical aspects of deep learning.
Successfully turning the material of this book into a lecture depends on multiple parameters: The overall length of
the lecture, the technical depth with which material will be presented, the background of the students, and the speed of
the lecturer. For reference, we list three ways to turn the lecture into a standard 2 × 90 minutes per week lecture over a
semester of 14 weeks. The fast pace is mostly applicable for an audience with very advanced students, the regular pace
has been successful for a class with mathematics master’s students, and the comfortable pace would be appropriate for
an audience with a broader background.
Many of our colleagues or students have contributed in various forms to this book through dicussions or suggestions
for improvement. We would like to thank the following colleagues: Andrés Felipe Lerma Pineda, Jonathan Garcia
Rebellon, Martina Neuman, Matej Vedak, Martin Mauser, Davide Modesto, Tuan Quach ..
Finally, we thank the publication team of Springer Birkhäuser for their support in the development of this book.

ix
[Draft of June 26, 2024. Not for dissemination. Only to be used in Philipp Petersen’s lecture]
Chapter Fast Regular Comfortable
1 complete complete complete
2 complete complete complete
3 complete complete complete
4 complete sketch main ideas omit
5 complete complete complete
6 complete complete skip Section 6.4
7 complete sketch proof of Theorem 7.7
8 complete present only one of the sections
9 complete complete skip Section 9.2
10 complete omit Sections 10.4 and 10.5
11 complete complete omit Section 11.6
12 complete complete only state Definition 12.1
13 complete complete only show Proposition 13.1
14 complete complete omit Section 14.8
15 complete complete complete
16 complete omit omit

Three ways of turning the material into a 2x90 minute per week lecture over 14 weeks. The presentation speed can be
fast, regular, or comfortable, and can be chosen depending on the audience or other circumstances.

x
[Draft of June 26, 2024. Not for dissemination. Only to be used in Philipp Petersen’s lecture]
Notation

In this section, we provide a summary of the notations used throughout the manuscript for the reader’s convenience.

Symbol Description Definition


R (ℎ) risk of hypothesis ℎ Definition 14.2
R
b𝑆 (ℎ) empirical risk of ℎ for sample 𝑆 Definition 14.4/ Equation (1.2.3)
L general loss function
𝜎 general activation function
𝜎sig sigmoid activation function Section 2.3
𝜎ReLU ReLU activation function Section 2.3
𝜎𝑎 parametric ReLU activation function Section 2.3
N𝑑𝑚 ( 𝜎; 𝐿, 𝑛) set of multilayer perceptrons with 𝑑-dim input, 𝑚-dim output, activation Definition 3.6
function 𝜎, depth 𝐿, and width 𝑛
N𝑑𝑚 ( 𝜎; 𝐿) union of N𝑑𝑚 ( 𝜎; 𝐿, 𝑛) for all 𝑛 ∈ N Definition 3.6
N ( 𝜎; A, 𝐵) Set of neural networks with architecture A, activation function 𝜎 and all Definition 12.1
weights bounded in modulus by 𝐵
P N ( A, 𝐵) Parameter set of neural networks with architecture A and all weights Definition 12.1
bounded in modulus by 𝐵
𝐷𝜶 partial derivative 𝛼
𝜕
𝛼
𝜕𝑥1 1 ···𝜕𝑥𝑑 𝑑
𝑅𝜎 Realization map Definition 12.1
size(Φ) number of free network parameters in Φ (2.2.1)
width(Φ) width of Φ Definition 2.1
depth(Φ) depth of Φ Definition 2.1
P𝑛 short for P𝑛 (R𝑑 )
P𝑛 (R𝑑 ) space of multivariate polynomials of degree 𝑛 in R𝑑 Example 3.5
P short for P(R𝑑 )
P(R𝑑 ) space of multivariate polynomials in R𝑑 Example 3.5
R real numbers
N positive natural numbers
R− negative real numbers (−∞, 0)
R0− non-positive real numbers (−∞, 0]
R+ positive real numbers (0, ∞)
R+0 non-negative real numbers [0, ∞)
N0 natural numbers including 0
𝐶 𝑘 (Ω) 𝑘-times differentiable functions from Ω → R
𝐶 𝑘,𝑠 (Ω) 𝐶 𝑘 (Ω) functions 𝑓 for which 𝑓 (𝑘) ∈ 𝐶 0,𝑠 (Ω)
𝐶 0,𝑠 (Ω) 𝑠-Hölder continuous functions from Ω → R
1𝐴 indicator function of the set 𝐴
| 𝐴| cardinality of the set 𝐴 if 𝐴 is finite, otherwise Lebesgue measure of 𝐴
co( 𝐴) convex hull of a set 𝐴
∥·∥ Euclidean norm for vectors and spectral norm for matrices
∥ · ∥𝑋 Norm on a vector space 𝑋
∥ · ∥𝑝 𝑝-norm on R𝑑
∥ · ∥∞ ∞-norm on R𝑑 or supremum norm for functions
∥ 𝑓 ∥ 𝐶 𝑘 ( 𝐴) sup 𝑥 ∈ 𝐴 sup|𝜶 |≤ 𝑘 ∥ 𝐷 𝜶 𝑓 ( 𝑥 ) ∥
⊗ Componentwise (Hadamard) product
⊘ Componentwise (Hadamard) division
𝑂 (·) Landau notation
Lip( 𝑓 ) Lipschitz constant of a function 𝑓 sup 𝒙≠𝒚 ∥ 𝑓 ( ∥𝒙)𝒙−𝒚
− 𝑓 (𝒚 ) ∥

𝐵𝑟 ( 𝒙) Ball of radius 𝑟 ≥ 0 around 𝒙 {𝒚 | ∥ 𝒙 − 𝒚 ∥ ≤ 𝑟 }
G ( 𝑀, 𝜀, 𝑋) 𝜖 -covering number of a set 𝑀 ⊂ 𝑋 Definition 14.10
VCdim( H) VC dimension of a set of functions H Definition 14.16

1
[Draft of June 26, 2024. Not for dissemination. Only to be used in Philipp Petersen’s lecture]
Terminology Description Definition
activation function todo todo
biases todo todo
convex todo todo
loss function todo todo
𝐶-Lipschitz a function 𝑓 : 𝑋 → 𝑌 is 𝐶-Lipschitz if has Lipschitz constant 𝐶 ∥ 𝑓 ( 𝑥 ) − 𝑓 (𝑧) ∥𝑌 ≤ 𝐶 ∥ 𝑥 − 𝑦 ∥ 𝑋
model a function Φ( 𝒙, 𝒘) depending on parameters 𝒘 ∈ R𝑛 to predict 𝑦 from todo
𝒙 ∈ R𝑑
𝜇-strongly convex todo todo
objective function todo todo
ReLU todo todo
parameters adjustable variables 𝒘 to tune a model Φ(·, 𝒘) Definition 14.2
risk todo todo
𝐿-smooth todo todo
weights todo todo

2
[Draft of June 26, 2024. Not for dissemination. Only to be used in Philipp Petersen’s lecture]
Chapter 1
Introduction

1.1 Mathematics of deep learning

In 2012, a deep learning architecture revolutionized the field of computer vision by achieving unprecedented perfor-
mance in the ImageNet Large Scale Visual Recognition Challenge (ILSVRC) [106]. The deep learning architecture,
known as AlexNet, significantly outperformed all competing technologies. A few years later, in March 2016, a deep
learning-based architecture called AlphaGo defeated the best Go player at the time, Lee Sedol, in a five-game match
[190]. Go is a highly complex board game with a vast number of possible moves, making it a challenging problem for
artificial intelligence. Because of this complexity, many researchers believed that defeating a top human Go player was
a feat that would only be achieved decades later.
These breakthroughs, along with several others, sparked a surge of interest among scientists across all disciplines.
Likewise, a broad range of mathematicians started talking about deep learning. However, initially, there was a clear
consensus in the mathematics community: We do not understand why this technology works so well! In fact, there are
many mathematical reasons that, at least superficially, should prevent the observed success.
Over the past decade, the field has matured, and mathematicians can now claim that they understand more about
deep learning, even though many open questions remain. Recent years have seen significant progress in providing
explanations and insights into the inner workings of deep learning models. Before making these claims more precise,
we will give a high-level introduction to deep learning. This course will concentrate on deep learning in the so-called
supervised learning framework.

1.2 High-level overview of deep learning

Deep learning describes the act of training deep neural networks with gradient-based methods to identify an unknown
input-output relation. This approach has three key ingredients: deep neural networks, gradient-based training, and
prediction. We will now explain each of these ingredients separately.
• Neural networks: Deep neural networks are based on combining multiple neurons. A neuron is a function of the
form

R𝑑 ∋ 𝒙 ↦→ 𝜈(𝒙) = 𝜎(𝒘 ⊤ 𝒙 + 𝑏), (1.2.1)

where 𝒘 ∈ R𝑑 is a weight vector, 𝑏 ∈ R is called bias, and the function 𝜎 is a so-called activation function.
This concept is due to McCulloch and Pitts [125] and is a model for biological neurons. If one thinks of 𝜎 as
the Heaviside function, i.e., 𝜎 = 1R+ , then the neuron "fires" if the weighted sum of the inputs 𝒙 surpasses the
threshold −𝑏. We depict a neuron in Figure 1.1. Note that if we fix 𝑑 and 𝜎, then the set of neurons can be naturally
parameterized by 𝑑 + 1 real values.

3
[Draft of June 26, 2024. Not for dissemination. Only to be used in Philipp Petersen’s lecture]
(
(

Fig. 1.1: Illustration of a single neuron 𝜈. This neuron has six inputs (𝑥1 , . . . , 𝑥 6 ) = 𝒙. Of these inputs, the weighted
sum is taken, where the weights are given by 𝒘. Afterward, a bias term 𝑏 is added, and an activation function 𝜎 is
applied.

Neural networks are functions that result by taking the output of certain neurons as inputs for other neurons. One
type of neural network which is computationally very convenient is the feedforward neural network. This structure
distinguishes itself by having the neurons grouped in layers, and the inputs to neurons in the ℓ + 1-st layer are
exclusively neurons from the ℓ-th layer.
We start by defining a shallow feedforward neural network as an affine transformation applied to the output of
a set of neurons that share the same input and the same activation function. Here, an affine transformation is a
map ℎ from R 𝑝 to R𝑞 where 𝑝, 𝑞 ∈ N such that ℎ(𝒙) = 𝑾𝒙 + 𝒃, where 𝑾 is a 𝑞 × 𝑝 matrix and 𝒃 ∈ R𝑞 .
Formally, a shallow feedforward neural network is, therefore, a map Φ of the form

R𝑑 ∋ 𝒙 ↦→ Φ(𝒙) = 𝑇2 𝜎(𝑇1 (𝒙)),

where 𝑇1 , 𝑇2 are affine transformations and the application of 𝜎 is understood to be in each coordinate of 𝑇1 (𝒙). A
visualization of a shallow neural network is given in Figure 1.2.
A deep feedforward neural network is constructed by taking the composition of many shallow neural networks.
One then calls the number of compositions of shallow neural networks that were performed to build the deep neural
network the number of layers.
Since we observed at the beginning that the set of neurons can be viewed as a parameterized set of functions, we
can conclude the same for deep feedforward neural networks if all dimensions of the affine transformations within
the network are fixed.
• Gradient-based training: The second step of deep learning consists in identifying one specific neural network from
a given set of neural networks. This selection is carried out by minimizing an objective function. In supervised
learning, which is what we will focus on in this lecture, the objective is given based on a collection of input-output
pairs that we call a sample. Concretely, let us say we are given 𝑆 = (𝒙 𝑖 , 𝒚 𝑖 )𝑖=1
𝑚 , where 𝒙 ∈ R𝑑 and 𝒚 ∈ R 𝑘 for
𝑖 𝑖
𝑑, 𝑘 ∈ N. We attempt to find a deep neural network Φ such that for all 𝑖 = 1, . . . , 𝑚

Φ(𝒙𝑖 ) ≈ 𝒚 𝑖 (1.2.2)

in a reasonable sense. For example, we could require "≈" to mean close with respect to the Euclidean norm, or
more generally, that L (Φ(𝒙𝑖 ), 𝒚 𝑖 ) is small for a function L such that L (𝒚, 𝒚) = 0 for all 𝒚 in the domain of L. We
call such a function L a loss function. The standard way of modeling the requirement (1.2.2) is by minimizing the
so-called empirical risk of Φ defined by

4
[Draft of June 26, 2024. Not for dissemination. Only to be used in Philipp Petersen’s lecture]
(
(
(
(
(
(
Fig. 1.2: Illustration of a shallow neural network. Here it is assumed that the affine transformation 𝑇1 is of the form
(𝑥1 , . . . , 𝑥 6 ) = 𝒙 ↦→ 𝑾𝒙 + 𝒃, where the rows of 𝑾 are 𝒘 (1) , 𝒘 (2) , 𝒘 (3) . Comparing with Figure 1.1, we see that a
shallow neural network is formed by applying an affine transformation to a collection of neurons with the same input
and with the same activation function.

𝑚
b𝑆 (Φ) = 1
∑︁
R L (Φ(𝒙𝑖 ), 𝒚 𝑖 ). (1.2.3)
𝑚 𝑖=1

If L is differentiable, and for all 𝒙 𝑖 , it holds that Φ(𝒙 𝑖 ) depends in a differentiable (or at least subdifferentiable)
way on the parameters of the neural network, then there exists a (sub-)gradient of R b𝑆 (Φ) for each parameter
value of the neural network Φ. In fact, this gradient (or at least an unbiased estimator of this gradient) can be
efficiently computed using a technique called backpropagation. As a result, we can attempt to minimize (1.2.3)
by (stochastic) gradient descent, which will produce a sequence of neural networks Φ1 , Φ2 , . . . for which we
hope that the empirical risk will decay. A sketch of how the resulting sequence of neural networks could behave is
depicted in Figure 1.3.
• Prediction: The final part of deep learning concerns the question if we have actually learned something by the
procedure above. Assume that the optimization routine has either converged or we have stopped the iteration at
some point so that we end up with a neural network Φ∗ .
Even though this was the objective of the optimization, we do not actually care how well Φ performs on 𝑆 because
we already know 𝑆. Our interest lies in the performance of Φ on a new data point (𝒙new , 𝒚 new ). To make any
kind of meaningful statement about the performance of Φ outside of 𝑆, we need to assume that there is some
relationship between 𝑆 and other data points. The standard way of approaching this is to assume that there is a
data distribution D on the input-output space—in our case, this is R𝑑 × R 𝑘 —such that both the elements of 𝑆
and all other considered data points follow D.
In other words, 𝑆 is considered a random variable the entries of which are i.i.d drawn from D, and (𝒙 new , 𝒚 new ) ∼ D
independent of 𝑆. If we would like Φ∗ to perform well on average, then this amounts to controlling the following
expression

R (Φ∗ ) = E ( 𝒙new ,𝒚new )∼D (L (Φ∗ (𝒙 new ), 𝒚 new )). (1.2.4)

The expression R (Φ∗ ) is called the risk of Φ. If the risk is not much larger than the empirical risk, then we say
that the neural network Φ∗ has a small generalization error. On the other hand, if the risk is much larger than the
empirical risk, then we say that Φ∗ overfits.

5
[Draft of June 26, 2024. Not for dissemination. Only to be used in Philipp Petersen’s lecture]
Fig. 1.3: A sequence of one dimensional neural networks Φ1 , . . . , Φ4 is shown that successively minimizes the empirical
6 .
risk for the sample 𝑆 = (𝑥 𝑖 , 𝑦 𝑖 )𝑖=1

1.3 Why does it work?

It is natural to wonder why the deep learning pipeline, as outlined in the previous subsection, ultimately succeeds in
learning, i.e., achieving a small risk. Is it true that for a given sample (𝒙𝑖 , 𝒚 𝑖 )𝑖=1
𝑚 there exist a neural network such that

Φ(𝒙𝑖 ) ≈ 𝒚 𝑖 for all 𝑖 = 1, . . . 𝑚? Does the optimization routine produce anything meaningful? Is there any reason to
assume that we can control the risk, knowing only that the empirical risk is small?
While most of these questions can be answered affirmatively in certain regimes, these regimes often do not apply to
deep learning in practice. Below we list some explanations and show that they lead to even more questions:
• Approximation: One of the most fundamental results on neural networks is the so-called universal approximation
theorem, which will be discussed in Chapter 3. This result states that every continuous function on a compact
domain can be arbitrarily well uniformly approximated by a shallow neural network. For example, for a continuous
loss function L, this implies that we can get the empirical risk arbitrarily small for every conceivable sample if we
only choose a neural network that is sufficiently large.
This result, however, does not answer a lot of questions that are more specific to deep learning, such as the question
of efficiency. For example, if we want to build a computationally efficient tool, then we may be interested in the
smallest neural network to fit the data. In this context, it is, of course, reasonable to ask: What is the role of
the architecture and specifically the depth in this point-fitting problem? Also, if we consider reducing the
empirical risk an approximation problem, we are confronted with one of the main issues of approximation theory,
which is the curse of dimensionality. Function approximation in high dimensions is extremely hard and gets
exponentially harder with increasing dimensions. Why do deep neural networks not seem to suffer from this
curse?
• Optimization: While gradient descent can sometimes be shown to converge, as we will discuss in Chapter 10, it
typically requires the objective function to be at least convex. There is no reason to assume that the relationship
between the parameterization of a neural network and the output of a neural network on a sample should be a
convex map. In fact, due to the many compositions, we would expect this function to be highly non-linear and
not convex. Therefore, if the optimization routine converges, there is no guarantee that it has not gotten stuck in a
local (and non-global) minimum or even just a saddle point. Why is the output of the optimization nonetheless
often very meaningful in practice?

6
[Draft of June 26, 2024. Not for dissemination. Only to be used in Philipp Petersen’s lecture]
• Generalization: In traditional statistical learning theory, which we will review in Chapter 14, it is established
that the square of the generalization error, i.e., the extent to which the risk exceeds the empirical risk, can be
bounded roughly by the complexity of the set of admissible functions for the learning procedure divided by the
number of training samples. In our case, this complexity amounts corresponds to the number of parameters of the
neural networks. Much more precise statements will be shown in Chapter 14. However, in practical applications
of deep learning, it is typically the case that neural networks with more parameters than training samples are
used, which is dubbed the overparameterized regime. In this regime, the classical estimates are void. Why is it
that, nonetheless, deep overparameterized architectures consistently make extremely accurate predictions
on unseen data? In addition, while deep architectures often generalize well, they sometimes fail spectacularly
on specially chosen examples. Especially in image classification tasks, these data points differ from perfectly
classified data points only in ways that are not or just barely visible to the human eye. Such objects are referred to
as adversarial examples and it is puzzling why deep learning gives rise to this phenomenon.
In this course, we will present answers to all the questions above. How, where, and in what form will be specified
in the following subsection.

1.4 Course philosophy and outline

In this course, we give answers to the questions raised in the previous subsection. We hope that these answers are such
that they convince mathematicians. This means that we will focus on provable statements. Moreover, the explanations
will be as simple as possible. We refrain from stating results in the most general way but typically focus on illustrating
special cases.
We describe the structure of the course below. We highlight the answers to the open questions from the previous
chapters in boldface.
• Chapter 2: Feedforward neural networks. In the first chapter, we will introduce the main object of study of this
work—the feedforward neural network.
• Chapter 3: Universal approximation. We discuss the classical view of the approximation of functions by neural
networks. We discuss two instances of universal approximation, i.e., the ability of neural networks to approximate
every continuous function as well as possible. First, under very mild assumptions on the activation function, the
universal approximation theorem is shown, which states that for a given function on a compact domain and an
accuracy 𝜖 > 0, there exists a neural network that approximates the function up to a uniform error of 𝜖. In this
result, we do not know how the size of the neural network depends on 𝜖. Second, we show that, under very strong
assumptions on the activation function, we can even represent every continuous function 𝑓 on a compact domain
exactly by a neural network, and the size of the network is independent of 𝑓 .
• Chapter 4: Splines. Next, we study approximation rates, i.e., the extent to which certain functions can be approxi-
mated by neural networks relative to the number of free parameters that the neural networks have. We will see that,
for so-called sigmoidal activation functions, we can study this question by showing a link between neural-network-
and spline approximation. Here we will see that the smoother a function, the fewer parameters does a neural network
need to approximate it. However, to unlock this efficient approximation we require deeper neural networks. This
observation offers a first glimpse into the role of depth in deep learning.
• Chapter 5: ReLU neural networks. In this part, we focus on one of the most frequently used activation functions
in practice—the ReLU. We will establish a link between reasonably shallow neural networks with that activation
function and continuous piecewise linear functions. We will then show how efficient these functions are in
approximating Hölder continuous functions.
• Chapter 6: Affine pieces for ReLU neural networks. Having gained some intuition about what ReLU neural networks
can do, we next identify certain limits of their capacity. One tool to achieve this is to count the number of affine
pieces that these neural networks generate. The main insight of this section is that shallow neural networks cannot
generate a lot of pieces, while deep networks may be able to. This observation is a further explanation of the
role of depth in deep learning.

7
[Draft of June 26, 2024. Not for dissemination. Only to be used in Philipp Petersen’s lecture]
• Chapter 7: Deep ReLU neural networks. Having found a potential advantage of deep neural networks, we next study
if this advantage materializes. Indeed, we will see that deep neural networks achieve substantially better approxi-
mation rates than shallow neural networks for smooth functions. This observation adds to the understanding of
the role of depth in deep learning.
• Chapter 8: High-dimensional approximation. In this chapter, we consider three assumptions on the data that allow
deep learning to overcome the curse of dimensionality.
• Chapter 9: Interpolation. Before we proceed to study the training of deep neural networks, we take a look at the
objective in training, which is the empirical error. To reduce this error, a neural network not necessarily needs
to approximate a target function well but only needs to interpolate the training data. We analyze under which
conditions this is possible. In this context, we find an alternative learning algorithm for ReLU neural networks that
satisfies an intriguing optimality condition.
• Chapter 10: Training. Next, we study the training of deep neural networks. First, we discuss the basics of (stochastic)
gradient descent, and then we discuss more specific neural network-related issues, such as how to compute the
gradient using the backpropagation rule.
• Chapter 11: Wide networks and the neural tangent kernel. We present one tool to analyze the training behavior
of neural networks—the neural tangent kernel. Here we show that if neural networks are sufficiently overparam-
eterized, then the optimization procedure will succeed in finding a global minimum of the empirical risk. This
answers why we can optimize neural networks despite a non-linear and non-convex objective function.
• Chapter 12: Loss landscape analysis. In this chapter, we present an alternative view on the optimization problem,
by analyzing how the loss landscape, i.e., the function of empirical risk values parameterized by the weights and
biases of a neural network architecture looks like. Analyzing this landscape will allow us to understand if there
are non-global minima in which the optimization routine can get trapped. We will see theoretical arguments
showing that the more overparameterized a network architecture becomes the more connected will the valleys or
basins of the landscape get. In other words, the more overparameterized a network architecture, the easier it is to
reach a region in which the only minima are global minima. In addition, we will observe that most stationary points
associated to non-globally minimal empirical risk values are saddle points instead of minima. This sheds further
light on empirically observed fact that deep architectures can often be optimized without getting stuck in
non-global minima.
• Chapter 13: Shape of neural network spaces. Next, we will provide a counterpoint to Chapters 11 and 12, by
showing that the set of neural networks with fixed architecture has a couple of undesirable properties. We will see
that in many cases, this is a non-convex set. Also it does not possess the best-approximation property, i.e., for a
given function, there is not necessarily a neural network that approximates this function best.
• Chapter 14 : Generalization properties of deep neural networks. To understand the question of why deep neural
networks successfully predict unseen data points, we first study classical statistical learning theory. This is done
specifically for neural networks, and we show how to establish generalization bounds for deep learning.
• Chapter 15: Generalization in the overparameterized regime. The generalization bounds of the previous chapter
are not meaningful when the number of parameters of a neural network becomes larger than the number of samples.
However, this is the case in most practical applications. We will describe the phenomenon of double descent and
demonstrate one explanation for it. This then addresses the question of why deep neural networks perform
well despite being highly overparameterized.
• Chapter 16: Robustness and adversarial examples. In the last chapter, we study the existence of adversarial
examples. We show theoretical explanations of why adversarial examples arise, as well as some strategies to
prevent them.

1.5 Material not covered in this course

This course studies some central topics of deep learning but leaves out even more. Interesting questions associated with
the field that were omitted, as well as some pointers to related works are listed below:

8
[Draft of June 26, 2024. Not for dissemination. Only to be used in Philipp Petersen’s lecture]
• Advanced architectures: The deep feedforward neural network is by far not the only type of neural network that
people consider. In practice, architectures are adapted to the type of data. For example, images exhibit strong
spatial dependencies in the sense that adjacent pixels very often have very close values. If each pixel is considered
an input into a neural network, then a feedforward neural network interprets each pixel as an independent variable.
On the other hand, a convolutional neural network [112] will apply a series of convolutional filters that aggregate
information from neighboring pixels. This fits the data structure better. Similarly, if the data is graph-based, then
so-called graph neural networks [27] are used. If the data is sequential, such as natural language, then architectures
with some form of memory are used, e.g., LSTMs [83], or attention-based architectures, e.g., transformers [208].
• Interpretability/Explainability: Considering a neural network Φ that produces an output Φ(𝒙), it seems very
desirable to be able to ask "Why?" and get an understandable answer. For example, if Φ predicts a runner’s
expected 100m time based on their training history, it would be very valuable to know the reason for predicting a
slow (or fast) time. Then, the athlete could adapt their training accordingly. The reader can certainly think of many
examples where receiving such an explanation would not only be desirable but necessary.
For deep neural networks, producing a humanly understandable explanation of its prediction is typically not that
easy. This is because, even though all weights are known, the repeated application of large matrices and activation
functions is so complex that these systems are typically considered black boxes. A comprehensive overview of
various techniques, not only for deep neural networks, can be found in [131].
• Unsupervised or Reinforcement Learning: As mentioned before, we consider the case where every data point has
a label. This is not necessarily so in most applications. Indeed, there is a vast field of machine learning where
no labels are involved. This is called unsupervised learning. Classical fields in this regime are clustering and
dimensionality reduction [188, Chapters 22/23].
A popular area in deep learning, where no labels are used, is physics-informed neural networks [165]. Here, a
neural network is tasked to satisfy a partial differential equation, and the loss measures how well it does.
Finally, reinforcement learning is a technique where an agent can interact with an environment and will receive
feedback based on its actions. The actions are governed by a so-called policy, which is to be learned, [130, Chapter
17]. In deep reinforcement learning, this policy is modeled by a deep neural network. Reinforcement learning is
the basis of the aforementioned AlphaGo.
• Implementation: Deep learning is a field driven mainly by applications. While this course focuses on provable
theoretical results, a complete understanding of deep learning cannot be achieved without practical experience.
For this, there are countless resources with excellent datasets and explanations available, which are better than any
introduction in this course could ever be. We recommend [36, 161] as well as countless online tutorials that are
just a Google (or alternative) search away.
• Many more: The field is evolving rapidly, and new ideas are constantly being generated and tested. This course
cannot give a complete overview. However, we hope that it supplies the reader with enough fundamental knowledge
and principles that they have the ability to find and understand newer works quickly.

9
[Draft of June 26, 2024. Not for dissemination. Only to be used in Philipp Petersen’s lecture]
[Draft of June 26, 2024. Not for dissemination. Only to be used in Philipp Petersen’s lecture]
Chapter 2
Feed-forward neural networks

Feedforward neural networks, henceforth simply referred to as neural networks (NNs), constitute the central object
of study of this book. In this chapter we provide a formal definition of neural networks, discuss the size of a neural
network, and give a brief overview of common activation functions.

2.1 Formal definition

In this section, we will refine and extend the definition given in the introduction. We have already seen the definition
of a neuron 𝜈 in (1.2.1) and Figure 1.1. A neural network is formed by the concatenation of such neurons. Let us now
make precise this concatenation procedure.
Definition 2.1 Let 𝐿 ∈ N, 𝑑0 , . . . , 𝑑 𝐿+1 ∈ N, and let 𝜎 : R → R. A function Φ : R𝑑0 → R𝑑𝐿+1 is called a neural
network if there exist matrices 𝑾 (ℓ ) ∈ R𝑑ℓ+1 ×𝑑ℓ and vectors 𝒃 (ℓ ) ∈ R𝑑ℓ+1 , ℓ = 0, . . . , 𝐿, such that with

𝒙 (0) := 𝒙 (2.1.1a)
(ℓ ) (ℓ −1) (ℓ −1) (ℓ −1)
𝒙 := 𝜎(𝑾 𝒙 +𝒃 ) for ℓ ∈ {1, . . . , 𝐿} (2.1.1b)
(𝐿+1) (𝐿) (𝐿) ( 𝐿)
𝒙 := 𝑾 𝒙 +𝒃 (2.1.1c)

holds

Φ(𝒙) = 𝒙 (𝐿+1) for all 𝒙 ∈ R𝑑0 .

We call 𝐿 the depth, 𝑑max = maxℓ=1,...,𝐿 𝑑ℓ the width, 𝜎 the activation function, and (𝜎; 𝑑0 , . . . , 𝑑 𝐿+1 ) the
architecture of the neural network Φ. Moreover, 𝑾 (ℓ ) ∈ R𝑑ℓ+1 ×𝑑ℓ are the weight matrices and 𝒃 (ℓ ) ∈ R𝑑ℓ+1 the bias
vectors of Φ for ℓ = 0, . . . 𝐿.
Remark 2.2 Typically, there exist different choices of architectures, weights and biases yielding the same function
Φ : R𝑑0 → R𝑑𝐿+1 . For this reason we cannot associate a unique meaning to this notions solely based on the function
realized by Φ. In the following, when we refer to the properties of a neural network Φ, it is always understood to mean
that there exists at least one construction as in Definition 2.1, which realizes the function Φ and uses parameters that
satisfy those properties.
The architecture of a neural network is often depicted as a connected graph, as illustrated in Figure 2.1. The nodes in
such graphs represent (the output of) the neurons. They are arranged in layers, with 𝒙 (ℓ ) in Definition 2.1 corresponding
to the neurons in layer ℓ. We also refer to 𝒙 (0) in (2.1.1a) as the input layer and to 𝒙 (𝐿+1) in (2.1.1c) as the output
layer. All layers in between are referred to as the hidden layers and their output is given by (2.1.1b). The number of
hidden layers corresponds to the depth. For the correct interpretation of such graphs, we note that by our conventions
in Definition 2.1, the activation function is applied after each affine transformation, except in the final layer.

11
[Draft of June 26, 2024. Not for dissemination. Only to be used in Philipp Petersen’s lecture]
Throughout, we only consider neural networks in the sense of Definition 2.1. We emphasize however, that this is just
one (simple but very common) type of network. Many adjustments to this construction are possible and also widely
used. For example:
• We may use different activation functions 𝜎ℓ in each layer ℓ or we may even use a different activation function
for each node.
• Residual neural networks allow “skip connections”. This means that information is allowed to skip layers in the
sense that the nodes in layer ℓ may have 𝒙 (0) , . . . , 𝒙 (ℓ −1) as their input (and not just 𝒙 (ℓ −1) ), cf. (2.1.1).
• In contrast to feedforward neural networks, recurrent neural networks allow information to flow backward, in the
sense that 𝒙 (ℓ −1) , . . . , 𝒙 (𝐿+1) may serve as input for the nodes in layer ℓ (and not just 𝒙 (ℓ −1) ). This creates loops in
the flow of information, and one has to introduce a time index 𝑡 ∈ N, as the output of a node in time step 𝑡 might
be different from the output in time step 𝑡 + 1.

input hidden layers output

layer 0 layer 1 layer 2 layer 3 layer 4

(1) (3)
𝑥1(0)
𝑥 𝑥
1 1
(2)
𝑥1(4)
𝑥
1
(1) (3)
𝑥2(0)
𝑥 𝑥
2 2
(2)
𝑥2(4)
𝑥
2
(1) (3)
𝑥3(0)
𝑥 𝑥
3 3
(2)
𝑥
3
(1) (3)
𝑥 𝑥
4 4

Fig. 2.1: Sketch of a neural network with three hidden layers, and 𝑑0 = 3, 𝑑1 = 4, 𝑑2 = 3, 𝑑3 = 4, 𝑑4 = 2. The network
has depth three and width four.

With our convention, the depth of the network corresponds to the number of applications of the activation function.
Throughout we only consider networks of depth at least one. Neural networks of depth one are called shallow, if the
depth is larger than one they are called deep. The notion of deep neural networks is not used entirely consistently in
the literature, and some authors use the word deep only in case the depth is much larger than one, where the precise
meaning of “much larger” depends on the application.
There are various ways how neural networks can be combined with one another. The next proposition addresses this
for linear combinations, compositions, and parallelization. The formal proof, which is a good exercise to familiarize
onself with neural networks, is left as Exercise 2.5.
Proposition 2.3 For two neural networks Φ1 , Φ2 , with architectures

(𝜎; 𝑑01 , 𝑑11 , . . . , 𝑑 1𝐿1 +1 ) and (𝜎; 𝑑02 , 𝑑12 , . . . , 𝑑 2𝐿2 +1 )

respectively, it holds that


(i) for all 𝛼 ∈ R exists a neural network Φ 𝛼 with architecture (𝜎; 𝑑01 , 𝑑11 , . . . , 𝑑 1𝐿1 +1 ) such that
1
Φ 𝛼 (𝒙) = 𝛼Φ1 (𝒙) for all 𝒙 ∈ R𝑑0 ,

(ii) if 𝑑01 = 𝑑02 C 𝑑0 and 𝐿 1 = 𝐿 2 C 𝐿, then there exists a neural network Φparallel with architecture (𝜎; 𝑑0 , 𝑑11 +
𝑑12 , . . . , 𝑑 1𝐿+1 + 𝑑 2𝐿+1 ) such that

12
[Draft of June 26, 2024. Not for dissemination. Only to be used in Philipp Petersen’s lecture]
Φparallel (𝒙) = (Φ1 (𝒙), Φ2 (𝒙)) for all 𝒙 ∈ R𝑑0 ,

(iii) if 𝑑01 = 𝑑02 C 𝑑0 , 𝐿 1 = 𝐿 2 C 𝐿, and 𝑑 1𝐿+1 = 𝑑 2𝐿+1 C 𝑑 𝐿+1 , then there exists a neural network Φsum with architecture
(𝜎; 𝑑0 , 𝑑11 + 𝑑12 , . . . , 𝑑 1𝐿 + 𝑑 2𝐿 , 𝑑 𝐿+1 ) such that

Φsum (𝒙) = Φ1 (𝒙) + Φ2 (𝒙) for all 𝒙 ∈ R𝑑0 ,

(iv) if 𝑑 1𝐿1 +1 = 𝑑02 , then there exists a neural network Φcomp with architecture (𝜎; 𝑑01 , 𝑑11 , . . . , 𝑑 1𝐿1 , 𝑑12 , . . . , 𝑑 2𝐿+1 ) such
that
1
Φcomp (𝒙) = Φ2 ◦ Φ1 (𝒙) for all 𝒙 ∈ R𝑑0 .

2.2 Notion of size

Neural networks provide a framework to parametrize functions. Ultimately, our goal is to find a network Φ that fits
some underlying input-output relation. Typically, the network architecture (i.e., depth, width and activation function) is
chosen apriori and considered fixed. During “training” of the network, the entries of the weight matrices and bias vectors
are suitably adapted by some training algorithm. Depending on the application, on top of the stated architecture choices,
further restrictions on the weights and biases can be desirable. For example, the following two appear frequently:
(𝑖) ( 𝑗)
• weight sharing: This is an assumption of the type 𝑊 𝑘,𝑙 = 𝑊𝑠,𝑡 , i.e. we impose apriori that the entry (𝑘, 𝑙) of
the 𝑖th weight matrix is equal to the entry at position (𝑠, 𝑡) of weight matrix 𝑗. We denote this assumption by
(𝑖, 𝑘, 𝑙) ∼ ( 𝑗, 𝑠, 𝑡), paying tribute to the trivial fact that “∼” is an equivalence relation. Such conditions can also be
imposed on the bias entries.
(𝑖)
• sparsity: This is an assumption of the type 𝑊 𝑘,𝑙 = 0 for certain (𝑘, 𝑙, 𝑖), i.e. we apriorily impose that entry (𝑘, 𝑙)
of the 𝑖th weight matrix is 0. Similarly, such a condition can be imposed on the entries of the bias vectors. The
(𝑖)
condition 𝑊 𝑘,𝑙 = 0 corresponds to node 𝑙 of layer 𝑖 − 1 not serving as an input to node 𝑘 in layer 𝑖. If we sketch the
network graph, this is indicated by not connecting these two nodes.
Both of these restrictions decrease the number of learnable parameters in the network. The number of parameters can
be seen as a measure of the complexity of the represented function class. For this reason, we introduce size(Φ) as a
notion for the number of learnable parameters. Formally (with |𝑆| denoting the cardinality of a set 𝑆):

Definition 2.4 Let Φ be as in Definition 2.1. Then the size of Φ is


 
(𝑖)
size(Φ) := {(𝑖, 𝑘, 𝑙) | 𝑊 𝑘,𝑙 ≠ 0} ∪ {(𝑖, 𝑘) | 𝑏 𝑘(𝑖) ≠ 0} ∼ . (2.2.1)

2.3 Activation functions

Activation functions are a crucial part of deep neural networks, as they introduce nonlinearity into the model. If an
affine function were used as the activation function, the resulting neural network would also be affine and hence very
restricted in what it can represent.
The choice of activation function can have a significant impact on the performance, but there does not seem to be a
universally optimal choice. Below we discuss a few important activation functions and highlight some common issues
associated with them.
Sigmoid: The sigmoid activation function is given by
1
𝜎sig (𝑥) = for 𝑥 ∈ R,
1 + 𝑒−𝑥

13
[Draft of June 26, 2024. Not for dissemination. Only to be used in Philipp Petersen’s lecture]
1.0 8 8
ReLU a=0.05
SiLU a=0.1
0.8 6 6
a=0.2
0.6 4
4
0.4 2
2
0.2 0
0
0.0
2
5 0 5 5 0 5 5 0 5
(a) Sigmoid (b) ReLU and SiLU (c) Leaky ReLU

Fig. 2.2: Different activation functions.

and depicted in Figure 2.2 (a). Its output ranges between zero and one, making it interpretable as a probability. The
sigmoid is a smooth function, which allows the application of gradient-based training.
It has the disadvantage that its derivative becomes very small if |𝑥| → ∞. This can affect learning due to the
so-called vanishing gradient problem. Consider the simple network Φ𝑛 (𝑥) = 𝜎 ◦ · · · ◦ 𝜎(𝑥 + 𝑏) defined with 𝑛 ∈ N
compositions of 𝜎, and where 𝑏 ∈ R is a bias. Its derivative with respect to 𝑏 is

d d
Φ𝑛 (𝑥) = 𝜎 ′ (Φ𝑛−1 (𝑥)) Φ𝑛−1 (𝑥).
d𝑏 d𝑏
d
If sup 𝑥 ∈R |𝜎 ′ (𝑥)| ≤ 1 − 𝛿, then by induction, | d𝑏 Φ𝑛 (𝑥)| ≤ (1 − 𝛿) 𝑛 . The opposite effect happens for activation
functions with derivatives uniformly larger than one. This argument shows that the derivative of Φ𝑛 (𝑥, 𝑏) with respect
to 𝑏 can become exponentially small or exponentially large when propagated through the layers. This effect known
as the vanishing- or exploding gradient effect also occurs for activation functions which do not admit the uniform
bounds assumed above. However, since the sigmoid activation function exhibits areas with extremely small gradients,
the vanishing gradient effect can be strongly exacerbated.
ReLU (Rectified Linear Unit): The ReLU is defined as

𝜎ReLU (𝑥) = max{𝑥, 0} for 𝑥 ∈ R,

and depicted in Figure 2.2 (b). It is piecewise linear, and due to its simplicity its evaluation is computationally very
efficient. It is one of the most popular activation functions in practice. Since its derivative is always zero or one, it does
not suffer from the vanishing gradient problem to the same extent as the sigmoid function. However, ReLU can suffer
from the so-called dead neurons problem. Consider the neural network

Φ(𝑥) = 𝜎ReLU (𝑏 − 𝜎ReLU (𝑥)) for 𝑥 ∈ R

depending on the bias 𝑏 ∈ R. If 𝑏 < 0, then Φ(𝑥) = 0 for all 𝑥 ∈ R. The neuron corresponding to the second application
d
of 𝜎ReLU thus produces a constant signal. Moreover, if 𝑏 < 0, d𝑏 Φ(𝑥) = 0 for all 𝑥 ∈ R. As a result, every negative
value of 𝑏 yields a stationary point of the empirical risk. A gradient-based method will not be able to further train the
parameter 𝑏. We thus refer to this neuron as a dead neuron.
SiLU (Sigmoid Linear Unit): An important difference between the ReLU and the Sigmoid is that the ReLU is
not differentiable at 0. The SiLU activation function (also referred to as “swish”) can be interpreted as a smooth
approximation to the ReLU. It is defined as
𝑥
𝜎SiLU (𝑥) := 𝑥𝜎sig (𝑥) = for 𝑥 ∈ R,
1 + 𝑒−𝑥

14
[Draft of June 26, 2024. Not for dissemination. Only to be used in Philipp Petersen’s lecture]
and is depicted in Figure 2.2 (b). There exist various other smooth activation functions that mimic the ReLU, including
the Softplus 𝑥 ↦→ log(1 + exp(𝑥)), the GELU (Gaussian Error Linear Unit) 𝑥 ↦→ 𝑥𝐹 (𝑥) where 𝐹 (𝑥) denotes the
cumulative distribution function of the standard normal distribution, and the Mish 𝑥 ↦→ 𝑥 tanh(log(1 + exp(𝑥))).
Parametric ReLU or Leaky ReLU: This variant of the ReLU addresses the dead neuron problem. For some
𝑎 ∈ (0, 1), the parametric ReLU is defined as

𝜎𝑎 (𝑥) = max{𝑥, 𝑎𝑥} for 𝑥 ∈ R,

and is depicted in Figure 2.2 (c) for three different values of 𝑎. Since the output of 𝜎 does not have flat regions like
the ReLU, the dying ReLU problem is mitigated. If 𝑎 is not chosen too small, then there is less of a vanishing gradient
problem than for the Sigmoid. In practice the additional additional parameter 𝑎 has to be fine-tuned depending on the
application. Like the ReLU, the parametric ReLU is not differentiable at 0.

Bibliography and further reading

The concept of neural networks was first introduced by McCulloch and Pitts in [125]. Later Rosenblatt [170] introduced
the perceptron, an artificial neuron with adjustable weights that forms the basis of the multilayer perceptron (a fully
connected feedforward neural network). The vanishing gradient problem shortly addressed in Section 2.3 was discussed
by Hochreiter in his diploma thesis [81] and later in [17, 83]. For a historical survey on neural networks see [179] and
also [111]. For general textbooks on neural networks we refer to [74, 4], with the latter focusing on theoretical aspects.
Also see [64] for a more recent monograph. For the implementation of neural networks we refer for example to [60, 36].

15
[Draft of June 26, 2024. Not for dissemination. Only to be used in Philipp Petersen’s lecture]
Exercises

Exercise 2.5 Prove Proposition 2.3.

Exercise 2.6 In this exercise we show that ReLU and parametric ReLU create similar sets of neural network functions.
Fix 𝑎 > 0.
(i) Find a set of weight matrices and biases vectors, such that the associated neural network Φ1 , with the ReLU
activation function 𝜎ReLU satisfies Φ1 (𝑥) = 𝜎𝑎 (𝑥) for all 𝑥 ∈ R.
(ii) Find a set of weight matrices and biases vectors, such that the associated neural network Φ2 , with the parametric
ReLU activation function 𝜎𝑎 satisfies Φ2 (𝑥) = 𝜎ReLU (𝑥) for all 𝑥 ∈ R.
(iii) Conclude that every ReLU network can be expressed as a leaky ReLU network and vice versa.

Exercise 2.7 Let 𝑑 ∈ N, and let Φ1 be a neural network with the ReLU as activation function, input dimension 𝑑, and
output dimension 1. Moreover, let Φ2 be a neural network with the sigmoid activation function, input dimension 𝑑, and
output dimension 1. Show that, if Φ1 = Φ2 , then Φ1 is a constant function.

Exercise 2.8 In this exercise we show that for the sigmoid activation functions, dead-neuron-like behavior is very rare.
Let Φ be a neural network with the sigmoid activation function. Assume that Φ is a constant function. Show that for
every 𝜀 > 0 there is a non-constant neural network Φ e with the same architecture as Φ such that for all ℓ = 0, . . . 𝐿,

∥𝑾 (ℓ ) − 𝑾
e (ℓ ) ∥ ≤ 𝜀 and ∥𝒃 (ℓ ) − e
𝒃 (ℓ ) ∥ ≤ 𝜀

where 𝑾 (ℓ ) , 𝒃 (ℓ ) are the weights and biases of Φ and 𝑾


e (ℓ ) , e
𝒃 (ℓ ) are the biases of Φ.
e
Show that such a statement does not hold for ReLU networks. What about leaky ReLU?

16
[Draft of June 26, 2024. Not for dissemination. Only to be used in Philipp Petersen’s lecture]
Chapter 3
Universal Approximation

After introducing neural networks in Chapter 2, it is natural to inquire about their capabilites. Specifically, we might
wonder if there exist inherent limitations to the type of functions a neural network can represent. Could there be a class
of functions that neural networks cannot approximate? If so, it would suggest neural networks are specialized tools,
similar to how linear regression is suited for linear relationships, but not for data with nonlinear relationships.
In this chapter, we will show that this is not the case, and neural networks are indeed a universal tool. More precisely,
given sufficiently large and complex architectures, they can approximate almost any sensible input-output relationship.
We will formalize and prove this claim in the subsequent sections.

3.1 A universal approximation theorem

To analyze what kind of functions can be approximated with neural networks, we start by considering the uniform
approximation of continuous functions 𝑓 : R𝑑 → R on compact sets. To this end we first introduce the notion of
compact convergence.
Definition 3.1 Let 𝑑 ∈ N. A sequence of functions 𝑓𝑛 : R𝑑 → R, 𝑛 ∈ N, is said to converge compactly to a function
𝑓 : R𝑑 → R, if for every compact 𝐾 ⊆ R𝑑 it holds that lim𝑛→∞ sup 𝑥 ∈𝐾 | 𝑓𝑛 (𝑥) − 𝑓 (𝑥)| = 0. In this case we write
cc
𝑓𝑛 −→ 𝑓 .
Throughout what follows, we always consider 𝐶 0 (R𝑑 ) equipped with the topology of Definition 3.1 (also see
Exercise 3.23), and every subset such as 𝐶 0 (𝐷) with the subspace topology: for example, if 𝐷 ⊆ R𝑑 is bounded, then
convergence in 𝐶 0 (𝐷) refers to uniform convergence lim𝑛→∞ sup 𝑥 ∈𝐷 | 𝑓𝑛 (𝑥) − 𝑓 (𝑥)| = 0.

3.1.1 Universal approximators

As stated before, we want to show that deep neural networks can approximate every continuous function in the sense
of Definition 3.1. Before we show this, we introduce sets of functions that satisfy this approximation property. We call
such sets universal approximators.
Definition 3.2 Let 𝑑 ∈ N. A set of functions H from R𝑑 to R is a universal approximator (of 𝐶 0 (R𝑑 )), if for every
𝜀 > 0, every compact 𝐾 ⊆ R𝑑 , and every 𝑓 ∈ 𝐶 0 (R𝑑 ), there exists 𝑔 ∈ H such that sup 𝒙∈𝐾 | 𝑓 (𝒙) − 𝑔(𝒙)| < 𝜀.
cc
For a set of (not necessarily continuous) functions H mapping between R𝑑 and R, we denote by H its closure
with respect to compact convergence.
The relationship between a universal approximator and the closure with respect to compact convergence is established
in the proposition below.

17
[Draft of June 26, 2024. Not for dissemination. Only to be used in Philipp Petersen’s lecture]
Proposition 3.3 Let 𝑑 ∈ N and H be a set of functions from R𝑑 to R. Then, H is a universal approximator of 𝐶 0 (R𝑑 )
cc
if and only if 𝐶 0 (R𝑑 ) ⊆ H .
Proof Suppose that H is a universal approximator and fix 𝑓 ∈ 𝐶 0 (R𝑑 ). For 𝑛 ∈ N, define 𝐾𝑛 := [−𝑛, 𝑛] 𝑑 ⊆ R𝑑 . Then
for every 𝑛 ∈ N there exists 𝑓𝑛 ∈ H such that sup 𝒙∈𝐾𝑛 | 𝑓𝑛 (𝒙) − 𝑓 (𝒙)| < 1/𝑛. Since for every compact 𝐾 ⊆ R𝑑 there
cc
exists 𝑛0 such that 𝐾 ⊆ 𝐾𝑛 for all 𝑛 ≥ 𝑛0 , it holds 𝑓𝑛 −→ 𝑓 . The “only if” part of the assertion is trivial. □
A key tool to show that a set is a universal approximator is the Stone-Weierstrass theorem, see for instance [174,
Sec. 5.7].
Theorem 3.4 (Stone-Weierstrass) Let 𝑑 ∈ N, let 𝐾 ⊆ R𝑑 be compact, and let H ⊆ 𝐶 0 (𝐾, R) satisfy that
(a) for all 𝒙 ∈ 𝐾 there exists 𝑓 ∈ H such that 𝑓 (𝒙) ≠ 0,
(b) for all 𝒙 ≠ 𝒚 ∈ 𝐾 there exists 𝑓 ∈ H such that 𝑓 (𝒙) ≠ 𝑓 (𝒚),
(c) H is an algebra of functions, i.e., H is closed under addition, multiplication and scalar multiplication.
Then H is dense in 𝐶 0 (𝐾).
Example 3.5 (Polynomials are dense in 𝐶 0 (R𝑑 )) For a multiindex 𝜶 = (𝛼1 , . . . , 𝛼𝑑 ) ∈ N0𝑑 and a vector 𝒙 =
𝛼
(𝑥1 , . . . , 𝑥 𝑑 ) ∈ R𝑑 denote 𝒙 𝜶 := 𝑑𝑗=1 𝑥 𝑗 𝑗 . In the following, with |𝜶| := 𝑑𝑗=1 𝛼 𝑗 , we write
Î Í

P𝑛 := span{𝒙 𝜶 | 𝜶 ∈ N0𝑑 , |𝜶| ≤ 𝑛}


Ð
i.e., P𝑛 is the space of polynomials of degree at most 𝑛 (with real coefficients). It is easy to check that P := 𝑛∈N P𝑛 (R𝑑 )
satisfies the assumptions of Theorem 3.4 on every compact set 𝐾 ⊆ R𝑑 . Thus the space of polynomials P is a universal
approximator of 𝐶 0 (R𝑑 ), and by Proposition 3.3, P is dense in 𝐶 0 (R𝑑 ). In case we wish to emphasize the dimension
of the underlying space, in the following we will also write P𝑛 (R𝑑 ) or P(R𝑑 ) to denote P𝑛 , P respectively.

3.1.2 Shallow networks

With the necessary formalism established in the previous subsection, we can now demonstrate that shallow networks
of arbitrary width form a universal approximator under certain (mild) conditions on the activation function. The results
in this section are based on [116], and for the proofs we follow the arguments in that paper.
We first introduce notation for the set of all functions realized by certain architectures.
Definition 3.6 Let 𝑑, 𝑚, 𝐿, 𝑛 ∈ N and 𝜎 : R → R. The set of all functions realized by neural networks with 𝑑-
dimensional input, 𝑚-dimensional output, depth at most 𝐿, width at most 𝑛, and activation function 𝜎 is denoted
by

N𝑑𝑚 (𝜎; 𝐿, 𝑛) := {Φ : R𝑑 → R𝑚 | Φ as in Def. 2.1, depth(Φ) ≤ 𝐿, width(Φ) ≤ 𝑛}.

Furthermore,
Ø
N𝑑𝑚 (𝜎; 𝐿) := N𝑑𝑚 (𝜎; 𝐿, 𝑛).
𝑛∈N

In the sequel, we require the activation function 𝜎 to belong to the set of piecewise continuous and locally bounded
functions


M := 𝜎 ∈ 𝐿 loc (R) there exist intervals 𝐼1 , . . . , 𝐼 𝑀 partitioning R,
(3.1.1)
s.t. 𝜎 ∈ 𝐶 0 (𝐼 𝑗 ) for all 𝑗 = 1, . . . , 𝑀 .
Here 𝑀 ∈ N is finite, and the intervals 𝐼 𝑗 are understood to have positive (possibly infinite) Lebesgue measure, i.e. 𝐼 𝑗 is
e.g. not allowed to be empty or a single point. Hence, 𝜎 is a piecewise continuous function, and it has discontinuities
at most finitely many points.

18
[Draft of June 26, 2024. Not for dissemination. Only to be used in Philipp Petersen’s lecture]
Example 3.7 Activation functions belonging to M include for instance the ReLU, the SiLU and the Sigmoid, see
Section 2.3. In these cases, we can choose 𝑀 = 1 and 𝐼1 = R. Discontinuous functions include for example the
Heaviside function 𝑥 ↦→ 1 𝑥>0 (also called a “perceptron” in this context) but also 𝑥 ↦→ 1 𝑥>0 sin(1/𝑥): Both belong to
M with 𝑀 = 2, 𝐼1 = (−∞, 0] and 𝐼2 = (0, ∞). We exclude for example the function 𝑥 ↦→ 1/𝑥, which is not locally
bounded.
The rest of this subsection is dedicated to proving the following theorem that has now already been anounced
repeatedly.
Theorem 3.8 Let 𝑑 ∈ N and 𝜎 ∈ M. Then N𝑑1 (𝜎; 1) is a universal approximator of 𝐶 0 (R𝑑 ) if and only if 𝜎 is not a
polynomial.
Remark 3.9 We will see in Exercise 3.27 and Corollary 3.18 that neural networks can also arbitrarily well approximate
non-continuous functions with respect to suitable norms.
The universal approximation theorem by Leshno, Lin, Pinkus and Schocken [116]—of which Theorem 3.8 is a special
case—is even formulated for a much larger set M, which allows for activation functions that have discontinuities at
a (possibly non-finite) set of Lebesgue measure zero. Instead of proving the theorem in this generality, we resort to
the simpler case stated above. This allows to avoid some technicalities, but the main ideas remain the same. The proof
strategy is to verify the following three claims:
cc cc
(i) if 𝐶 0 (R1 ) ⊆ N11 (𝜎; 1) then 𝐶 0 (R𝑑 ) ⊆ N𝑑1 (𝜎; 1) ,
cc
(ii) if 𝜎 ∈ 𝐶 ∞ (R) is not a polynomial then 𝐶 0 (R1 ) ⊆ N11 (𝜎; 1) ,
cc
˜ ∈ 𝐶 ∞ (R) ∩ N11 (𝜎; 1) which is not a polynomial.
(iii) if 𝜎 ∈ M is not a polynomial then there exists 𝜎
cc cc cc
Upon observing that 𝜎 ˜ ∈ N11 (𝜎; 1) implies N11 ( 𝜎,
˜ 1) ⊆ N11 (𝜎; 1) , it is easy to see that these statements together
with Proposition 3.3 establish the implication “⇐” asserted in Theorem 3.8. The reverse direction is straightforward
to check and will be the content of Exercise 3.24.
We start with a more general version of (i) and reduce the problem to the one dimensional case.
Lemma 3.10 Assume that H is a universal approximator of 𝐶 0 (R). Then for every 𝑑 ∈ N

span{𝒙 ↦→ 𝑔(𝒘 · 𝒙) | 𝒘 ∈ R𝑑 , 𝑔 ∈ H }

is a universal approximator of 𝐶 0 (R𝑑 ).


Proof For 𝑘 ∈ N0 , denote by H 𝑘 the space of all 𝑘-homogenous polynomials, that is

H 𝑘 := span R𝑑 ∋ 𝒙 ↦→ 𝒙 𝜶 𝜶 ∈ N0𝑑 , |𝜶| = 𝑘 .




We claim that
cc
H 𝑘 ⊆ span{R𝑑 ∋ 𝒙 ↦→ 𝑔(𝒘 · 𝒙) | 𝒘 ∈ R𝑑 , 𝑔 ∈ H } =: 𝑋 (3.1.2)

for all 𝑘 ∈ N0 . This implies that all multivariate polynomials belong to 𝑋. An application of the Stone-Weierstrass
theorem (cp. Example 3.5) and Proposition 3.3 then conclude the proof.
For every 𝜶, 𝜷 ∈ N0𝑑 with |𝜶| = | 𝜷| = 𝑘, it holds 𝐷 𝜷 𝒙 𝜶 = 𝛿 𝜷,𝜶 𝜶!, where 𝜶! := 𝑑𝑗=1 𝛼 𝑗 ! and 𝛿 𝜷,𝜶 = 1 if 𝜷 = 𝜶 and
Î
𝛿 𝜷,𝜶 = 0 otherwise. Hence, since {𝒙 ↦→ 𝒙 𝜶 | |𝜶| = 𝑘 } is a basis of H 𝑘 , the set {𝐷 𝜶 | |𝜶| = 𝑘 } is a basis of its topological
dual H′𝑘 . Thus each linear functional 𝑙 ∈ H′𝑘 allows the representation 𝑙 = 𝑝(𝐷) for some 𝑝 ∈ H 𝑘 (here 𝐷 stands for
the differential).
By the multinomial formula
𝑘
𝑑
𝑘 ©∑︁ ∑︁ 𝑘! 𝜶 𝜶
(𝒘 · 𝒙) = ­ 𝑤 𝑗 𝑥 𝑗 ® = 𝒘 𝒙 .
ª
𝜶!
« 𝑗=1 ¬ {𝜶∈N0𝑑 | |𝜶 |=𝑘 }

19
[Draft of June 26, 2024. Not for dissemination. Only to be used in Philipp Petersen’s lecture]
Therefore, we have that (𝒙 ↦→ (𝒘 · 𝒙) 𝑘 ) ∈ H 𝑘 . Moreover, for every 𝑙 = 𝑝(𝐷) ∈ H′𝑘 and all 𝒘 ∈ R𝑑 we have that

𝑙 (𝒙 ↦→ (𝒘 · 𝒙) 𝑘 ) = 𝑘!𝑝(𝒘).

Hence, if 𝑙 (𝒙 ↦→ (𝒘 · 𝒙) 𝑘 ) = 𝑝(𝐷) (𝒙 ↦→ (𝒘 · 𝒙) 𝑘 ) = 0 for all 𝒘 ∈ R𝑑 , then 𝑝 ≡ 0 and thus 𝑙 ≡ 0.


This implies span{𝒙 ↦→ (𝒘 · 𝒙) 𝑘 | 𝒘 ∈ R𝑑 } = H 𝑘 . Indeed, if there exists ℎ ∈ H 𝑘 which is not in span{𝒙 ↦→
(𝒘 · 𝒙) 𝑘 | 𝒘 ∈ R𝑑 }, then by the theorem of Hahn-Banach (see Theorem B.8), there exists a non-zero functional in H′𝑘
vanishing on span{𝒙 ↦→ (𝒘 · 𝒙) 𝑘 | 𝒘 ∈ R𝑑 }. This contradicts the previous observation.
By the universality of H it is not hard to see that 𝒙 ↦→ (𝒘 · 𝒙) 𝑘 ∈ 𝑋 for all 𝒘 ∈ R𝑑 . Therefore, we have H 𝑘 ⊆ 𝑋 for
all 𝑘 ∈ N0 . □

By the above lemma, in order to verify that N𝑑1 (𝜎; 1) is a universal approximator, it suffices to show that N11 (𝜎; 1)
is a universal approximator. We first show that this is the case for sigmoidal activations.

Definition 3.11 (sigmoidal activation) An activation function 𝜎 : R → R is called sigmoidal, if 𝜎 ∈ 𝐶 0 (R),


lim 𝑥→∞ 𝜎(𝑥) = 1 and lim 𝑥→−∞ 𝜎(𝑥) = 0.

For sigmoidal activation functions we can now conclude the universality in the univariate case.
cc
Lemma 3.12 Let 𝜎 : R → R be monotonically increasing and sigmoidal. Then 𝐶 0 (R) ⊆ N11 (𝜎; 1) .

We prove Lemma 3.12 in Exercise 3.25. Lemma 3.10 and Lemma 3.12 show Theorem 3.8 in the special case where
𝜎 is monotonically increasing and sigmoidal. For the general case, let us continue with (ii) and consider 𝐶 ∞ activations.

Lemma 3.13 If 𝜎 ∈ C∞ (R) and 𝜎 is not a polynomial, then N11 (𝜎; 1) is dense in 𝐶 0 (R).
cc
Proof Denote 𝑋 := N11 (𝜎; 1) . We show again that all polynomials belong to 𝑋. An application of the Stone-
Weierstrass theorem then gives the statement.
Fix 𝑏 ∈ R and denote 𝑓 𝑥 (𝑤) := 𝜎(𝑤𝑥 + 𝑏) for all 𝑥, 𝑤 ∈ R. By Taylor’s theorem, for ℎ ≠ 0

𝜎((𝑤 + ℎ)𝑥 + 𝑏) − 𝜎(𝑤𝑥 + 𝑏) 𝑓 𝑥 (𝑤 + ℎ) − 𝑓 𝑥 (𝑤)


=
ℎ ℎ

= 𝑓 𝑥′ (𝑤) + 𝑓 𝑥′′ (𝜉)
2


= 𝑓 𝑥 (𝑤) + 𝑥 2 𝜎 ′′ (𝜉𝑥 + 𝑏) (3.1.3)
2
for some 𝜉 = 𝜉 (ℎ) between 𝑤 and 𝑤 + ℎ. Note that the left-hand side belongs to N11 (𝜎; 1) as a function of 𝑥. Since
𝜎 ′′ ∈ 𝐶 0 (R), for every compact set 𝐾 ⊆ R

sup sup |𝑥 2 𝜎 ′′ (𝜉 (ℎ)𝑥 + 𝑏)| ≤ sup sup |𝑥 2 𝜎 ′′ (𝜂𝑥 + 𝑏)| < ∞.


𝑥 ∈𝐾 | ℎ| ≤1 𝑥 ∈𝐾 𝜂 ∈ [𝑤−1,𝑤+1]

Letting ℎ → 0, as a function of 𝑥 the term in (3.1.3) thus converges uniformly towards 𝐾 ∋ 𝑥 ↦→ 𝑓 𝑥′ (𝑤). Since 𝐾 was
arbitrary, 𝑥 ↦→ 𝑓 𝑥′ (𝑤) belongs to 𝑋. Inductively applying the same argument to 𝑓 𝑥(𝑘−1) (𝑤), we find that 𝑥 ↦→ 𝑓 𝑥(𝑘 ) (𝑤)
belongs to 𝑋 for all 𝑘 ∈ N, 𝑤 ∈ R. Observe that 𝑓 𝑥(𝑘 ) (𝑤) = 𝑥 𝑘 𝜎 (𝑘 ) (𝑤𝑥 + 𝑏). Since 𝜎 is not a polynomial, for each
𝑘 ∈ N there exists 𝑏 𝑘 ∈ R such that 𝜎 (𝑘 ) (𝑏 𝑘 ) ≠ 0. Choosing 𝑤 = 0, we obtain that 𝑥 ↦→ 𝑥 𝑘 belongs to 𝑋. □
Finally, we come to the proof of (iii)—the claim that there exists at least one non-polynomial 𝐶 ∞ (R) function in the
closure of N11 (𝜎; 1). The argument is split into two lemmata. Denote in the following by 𝐶𝑐∞ (R) the set of compactly
supported 𝐶 ∞ (R) functions.
cc
Lemma 3.14 Let 𝜎 ∈ M. Then for each 𝜑 ∈ 𝐶𝑐∞ (R) it holds 𝜎 ∗ 𝜑 ∈ N11 (𝜎; 1) .

20
[Draft of June 26, 2024. Not for dissemination. Only to be used in Philipp Petersen’s lecture]
Proof Fix 𝜑 ∈ 𝐶𝑐∞ (R) and let 𝑎 > 0 such that supp 𝜑 ⊆ [−𝑎, 𝑎]. We have

𝜎 ∗ 𝜑(𝑥) = 𝜎(𝑥 − 𝑦)𝜑(𝑦) d𝑦.
R

Denote 𝑦 𝑗 := −𝑎 + 2𝑎 𝑗/𝑛 for 𝑗 = 0, . . . , 𝑛 and define for 𝑥 ∈ R


𝑛−1
2𝑎 ∑︁
𝑓𝑛 (𝑥) B 𝜎(𝑥 − 𝑦 𝑗 )𝜑(𝑦 𝑗 ).
𝑛 𝑗=0

cc
Clearly, 𝑓𝑛 ∈ N11 (𝜎; 1). We will show 𝑓𝑛 −→ 𝜎 ∗ 𝜑 as 𝑛 → ∞. To do so we verify uniform convergence of 𝑓𝑛 towards
𝜎 ∗ 𝜑 on the interval [−𝑏, 𝑏] with 𝑏 > 0 arbitrary but fixed.
For 𝑥 ∈ [−𝑏, 𝑏]
𝑛−1 ∫
∑︁ 𝑦 𝑗+1
|𝜎 ∗ 𝜑(𝑥) − 𝑓𝑛 (𝑥)| ≤ 𝜎(𝑥 − 𝑦)𝜑(𝑦) − 𝜎(𝑥 − 𝑦 𝑗 )𝜑(𝑦 𝑗 ) d𝑦 . (3.1.4)
𝑗=0 𝑦𝑗

Fix 𝜀 ∈ (0, 1). Since 𝜎 ∈ M, there exist 𝑧1 , . . . , 𝑧 𝑀 ∈ R such that 𝜎 is continuous on R\{𝑧1 , . . . , 𝑧 𝑀 } (cp. (3.1.1)).
Ð
With 𝐷 𝜀 := 𝑀 𝑗=1 (𝑧 𝑗 − 𝜀, 𝑧 𝑗 + 𝜀), observe that 𝜎 is uniformly continuous on the compact set 𝐾 𝜀 := [−𝑎 − 𝑏, 𝑎 + 𝑏] ∩ 𝐷 𝜀 .
𝑐

Now let 𝐽𝑐 ∪ 𝐽𝑑 = {0, . . . , 𝑛 − 1} be a partition (depending on 𝑥), such that 𝑗 ∈ 𝐽𝑐 if and only if [𝑥 − 𝑦 𝑗+1 , 𝑥 − 𝑦 𝑗 ] ⊆ 𝐾 𝜀 .
Hence, 𝑗 ∈ 𝐽𝑑 implies the existence of 𝑖 ∈ {1, . . . , 𝑀 } such that the distance of 𝑧𝑖 to [𝑥 − 𝑦 𝑗+1 , 𝑥 − 𝑦 𝑗 ] is at most 𝜀.
Due to the interval [𝑥 − 𝑦 𝑗+1 , 𝑥 − 𝑦 𝑗 ] having length 2𝑎/𝑛, we can bound

∑︁ Ø
𝑦 𝑗+1 − 𝑦 𝑗 = [𝑥 − 𝑦 𝑗+1 , 𝑥 − 𝑦 𝑗 ]
𝑗 ∈ 𝐽𝑑 𝑗 ∈ 𝐽𝑑
𝑀 h
Ø 2𝑎 2𝑎 i
≤ 𝑧𝑖 − 𝜀 −
, 𝑧𝑖 + 𝜀 +
𝑖=1
𝑛 𝑛
 4𝑎 
≤ 𝑀 · 2𝜀 + .
𝑛
Next, because of the local boundedness of 𝜎 and the fact that 𝜑 ∈ 𝐶𝑐∞ , it holds sup | 𝑦 | ≤𝑎+𝑏 |𝜎(𝑦)| + sup | 𝑦 | ≤𝑎 |𝜑(𝑦)| =:
𝛾 < ∞. Hence

|𝜎 ∗ 𝜑(𝑥) − 𝑓𝑛 (𝑥)|
∑︁ ∫ 𝑦 𝑗+1
≤ 𝜎(𝑥 − 𝑦)𝜑(𝑦) − 𝜎(𝑥 − 𝑦 𝑗 )𝜑(𝑦 𝑗 ) d𝑧
𝑗 ∈ 𝐽𝑐 ∪𝐽𝑑 𝑦𝑗
 
2 4𝑎
≤ 2𝛾 𝑀 · 2𝜀 +
𝑛
+ 2𝑎 sup max |𝜎(𝑥 − 𝑦)𝜑(𝑦) − 𝜎(𝑥 − 𝑦 𝑗 )𝜑(𝑦 𝑗 )|. (3.1.5)
𝑗 ∈ 𝐽𝑐 𝑦 ∈ [ 𝑦 𝑗 ,𝑦 𝑗+1 ]

We can bound the term in the last maximum by

21
[Draft of June 26, 2024. Not for dissemination. Only to be used in Philipp Petersen’s lecture]
|𝜎(𝑥 − 𝑦)𝜑(𝑦) − 𝜎(𝑥 − 𝑦 𝑗 )𝜑(𝑦 𝑗 )|
≤ |𝜎(𝑥 − 𝑦) − 𝜎(𝑥 − 𝑦 𝑗 )||𝜑(𝑦)| + |𝜎(𝑥 − 𝑦 𝑗 )||𝜑(𝑦) − 𝜑(𝑦 𝑗 )|

© ª
≤ 𝛾 · ­ sup |𝜎(𝑧 1 ) − 𝜎(𝑧2 )| + sup |𝜑(𝑧 1 ) − 𝜑(𝑧2 )| ® .
­ ®
­ 𝑧1 ,𝑧2 ∈𝐾 𝜀 𝑧1 ,𝑧2 ∈ [−𝑎,𝑎] ®
2𝑎
« | 𝑧1 −𝑧2 | ≤ 𝑛 | 𝑧1 −𝑧2 | ≤ 2𝑎
𝑛 ¬
Finally, uniform continuity of 𝜎 on 𝐾 𝜀 and 𝜑 on [−𝑎, 𝑎] imply that the last term tends to 0 as 𝑛 → ∞ uniformly for all
𝑥 ∈ [−𝑏, 𝑏]. This shows that there exist 𝐶 < ∞ (independent of 𝜀 and 𝑥) and 𝑛 𝜀 ∈ N (independent of 𝑥) such that the
term in (3.1.5) is bounded by 𝐶𝜀 for all 𝑛 ≥ 𝑛 𝜀 . Since 𝜀 was arbitrary, this yields the claim. □

Lemma 3.15 If 𝜎 ∈ M and 𝜎 ∗ 𝜑 is a polynomial for all 𝜑 ∈ 𝐶𝑐∞ (R), then 𝜎 is a polynomial.

Proof Fix −∞ < 𝑎 < 𝑏 < ∞ and consider 𝐶𝑐∞ (𝑎, 𝑏) := {𝜑 ∈ 𝐶 ∞ (R) | supp 𝜑 ⊆ [𝑎, 𝑏]}. Define a metric 𝜌 on 𝐶𝑐∞ (𝑎, 𝑏)
via
∑︁ |𝜑 − 𝜓| 𝐶 𝑗 (𝑎,𝑏)
𝜌(𝜑, 𝜓) := 2− 𝑗 ,
𝑗 ∈N0
1 + |𝜑 − 𝜓| 𝐶 𝑗 (𝑎,𝑏)

where

|𝜑| 𝐶 𝑗 (𝑎,𝑏) := sup |𝜑 ( 𝑗 ) (𝑥)|.


𝑥 ∈ [𝑎,𝑏]

Í𝑗
Since the space of 𝑗 times differentiable functions on [𝑎, 𝑏] is complete with respect to the norm 𝑖=0 | · | 𝐶 𝑖 (𝑎,𝑏) , see
for instance [79, Satz 104.3], the space 𝐶𝑐∞ (𝑎, 𝑏) is complete with the metric 𝜌. For 𝑘 ∈ N set

𝑉𝑘 := {𝜑 ∈ 𝐶𝑐∞ (𝑎, 𝑏) | 𝜎 ∗ 𝜑 ∈ P 𝑘 },

where P 𝑘 := span{R ∋ 𝑥 ↦→ 𝑥 𝑗 | 0 ≤ 𝑗 ≤ 𝑘 } denotes the space of polynomials of degree at most 𝑘. Then 𝑉𝑘 is closed
with respect to the metric 𝜌. To see this, we only need to observe that for a converging sequence 𝜑 𝑗 → 𝜑∗ with respect
to 𝜌 and 𝜑 𝑗 ∈ 𝑉𝑘 , it follows that 𝐷 𝑘+1 (𝜎 ∗ 𝜑∗ ) = 0 and hence 𝜎 ∗ 𝜑∗ is a polynomial. Since 𝐷 𝑘+1 (𝜎 ∗ 𝜑 𝑗 ) = 0 we
compute with the linearity of the convolution and the fact that 𝐷 𝑘+1 ( 𝑓 ∗ 𝑔) = 𝑓 ∗ 𝐷 𝑘+1 (𝑔) for differentiable 𝑔 and if
both sides are well-defined that

sup |𝐷 𝑘+1 (𝜎 ∗ 𝜑∗ ) (𝑥)|


𝑥 ∈ [𝑎,𝑏]

= sup |𝜎 ∗ 𝐷 𝑘+1 (𝜑∗ − 𝜑 𝑗 ) (𝑥)|


𝑥 ∈ [𝑎,𝑏]

≤ |𝑏 − 𝑎| sup |𝜎(𝑧)| · sup |𝐷 𝑘+1 (𝜑 𝑗 − 𝜑∗ ) (𝑥)|


𝑧 ∈ [𝑎−𝑏,𝑏−𝑎] 𝑥 ∈ [𝑎,𝑏]

and since 𝜎 is locally bounded, the right hand-side converges to 0.


By assumption we have
Ø
𝑉𝑘 = 𝐶𝑐∞ (𝑎, 𝑏).
𝑘 ∈N

Baire’s category theorem implies the existence of 𝑘 0 ∈ N (depending on 𝑎, 𝑏) such that 𝑉𝑘0 contains an open subset of
𝐶𝑐∞ (𝑎, 𝑏). Since 𝑉𝑘0 is a vector space, it must hold 𝑉𝑘0 = 𝐶𝑐∞ (𝑎, 𝑏).
We now show that 𝜑 ∗ 𝜎 ∈ P 𝑘0 for every 𝜑 ∈ 𝐶𝑐∞ (R); in other words, 𝑘 0 = 𝑘 0 (𝑎, 𝑏) can be chosen independent of 𝑎
and 𝑏. First consider a shift 𝑠 ∈ R and let 𝑎˜ := 𝑎 + 𝑠 and 𝑏˜ := 𝑏 + 𝑠. Then with 𝑆(𝑥) := 𝑥 + 𝑠, for any 𝜑 ∈ 𝐶𝑐∞ ( 𝑎, ˜ holds
˜ 𝑏)

𝜑 ◦ 𝑆 ∈ 𝐶𝑐 (𝑎, 𝑏), and thus (𝜑 ◦ 𝑆) ∗ 𝜎 ∈ P 𝑘0 . Since (𝜑 ◦ 𝑆) ∗ 𝜎(𝑥) = 𝜑 ∗ 𝜎(𝑥 + 𝑠), we conclude that 𝜑 ∗ 𝜎 ∈ P 𝑘0 . Next
let −∞ < 𝑎˜ < 𝑏˜ < ∞ be arbitrary. Then, for an integer 𝑛 > ( 𝑏˜ − 𝑎) ˜ (𝑏 − 𝑎) we can cover ( 𝑎, ˜ with 𝑛 ∈ N overlapping
˜ 𝑏)

22
[Draft of June 26, 2024. Not for dissemination. Only to be used in Philipp Petersen’s lecture]
𝑏 𝑛 ), each of length 𝑏 − 𝑎. Any 𝜑 ∈ 𝐶𝑐∞ ( 𝑎,
open intervals (𝑎 1 , 𝑏 1 ), . . . , (𝑎 𝑛 ,Í ˜ can be written as 𝜑 = Í𝑛 𝜑 𝑗 where
˜ 𝑏) 𝑗=1
𝜑 𝑗 ∈ 𝐶𝑐∞ (𝑎 𝑗 , 𝑏 𝑗 ). Then 𝜑 ∗ 𝜎 = 𝑛𝑗=1 𝜑 𝑗 ∗ 𝜎 ∈ P 𝑘0 , and thus 𝜑 ∗ 𝜎 ∈ P 𝑘0 for every 𝜑 ∈ 𝐶𝑐∞ (R).
Finally, Exercise 3.26 implies 𝜎 ∈ P 𝑘0 . □
Now we can put everything together to show Theorem 3.8.
Proof (of Theorem 3.8) By Exercise 3.24 we have the implication “⇒”.
For the other direction we assume that 𝜎 ∈ M is not a polynomial. Then by Lemma 3.15 there exists 𝜑 ∈ 𝐶𝑐∞ (R)
cc
such that 𝜎 ∗ 𝜑 is not a polynomial. According to Lemma 3.14 we have 𝜎 ∗ 𝜑 ∈ N11 (𝜎; 1) . We conclude with Lemma
3.13 that N11 (𝜎; 1) is a universal approximator of 𝐶 0 (R).
Finally, by Lemma 3.10, N𝑑1 (𝜎; 1) is a universal approximator of 𝐶 0 (R𝑑 ). □

3.1.3 Deep networks

Theorem 3.8 discusses the universality of neural networks of depth one, in the sense that they can approximate any
continuous function 𝑓 on compact sets arbitrarily well as long as the width is large enough. It is easy to see that an
analogous statement is valid for any fixed depth 𝐿 ≥ 1. The idea is to first use that the identity can be approximated
arbitrarily well by Theorem 3.8. Taking the composition of a shallow network approximation to 𝑓 with such approximate
identity networks then yields a deep network approximation to 𝑓 .

Proposition 3.16 Let 𝑑, 𝐿 ∈ N, let 𝐾 ⊆ R𝑑 be compact, and let 𝜎 ∈ M not be a polynomial. Then, for every 𝜀 > 0,
there exists a neural network Φ ∈ N𝑑𝑑 (𝜎; 𝐿) such that

∥Φ(𝒙) − 𝒙∥ ∞ < 𝜀 for all 𝒙 ∈ 𝐾.

Proof Let 𝑛 > 0 be so large that 𝐾 ⊆ [−𝑛, 𝑛] 𝑑 .


We start with the case 𝑑 = 1. By Theorem 3.8, there exist Φid ∈ N11 (𝜎; 1) such that
𝜀
sup |𝑥 − Φid (𝑥)| < . (3.1.6)
𝑥 ∈ [ − (𝑛+1),𝑛+1] 𝐿

𝑗
Denote the 𝑗-fold composition Φid ◦ · · · ◦ Φid by Φid . According to Exercise 2.5, Φ := Φid
𝐿 is a network of depth 𝐿, i.e.
1
Φ ∈ N1 (𝜎; 𝐿).
We first claim that for every 𝒙 ∈ [−𝑛, 𝑛] 𝑑 and every 𝑗 ∈ {1, . . . , 𝐿 − 1} holds
𝑗
sup |Φid (𝑥)| ≤ 𝑛 + 1. (3.1.7)
𝑥 ∈ [−𝑛,𝑛] 𝑑

For 𝑗 = 1 by (3.1.6)
𝑗 𝜀
sup |Φid (𝒙)| ≤ sup (|𝑥| + |𝑥 − Φid (𝑥)|) ≤ 𝑛 + .
𝑥 ∈ [ −𝑛,𝑛] 𝑥 ∈ [−𝑛,𝑛] 𝐿
Similarly for 𝑗 ∈ {2, . . . , 𝐿 − 1} by (3.1.6) and induction
𝑗 𝜀 𝜀 𝜀
sup |Φid (𝑥)| ≤ 𝑛 + ( 𝑗 − 1) + ≤ 𝑛+𝐿 ≤ 𝑛+1
𝑥 ∈ [ −𝑛,𝑛] 2𝐿 2𝐿 2𝐿

which gives the claim.


Next we bound the error. By (3.1.7) we may repeatedly use (3.1.6) to obtain

23
[Draft of June 26, 2024. Not for dissemination. Only to be used in Philipp Petersen’s lecture]
𝐿
∑︁
𝐿 𝑗 𝑗 −1
sup |𝑥 − Φid (𝑥)| ≤ sup (|𝑥 − Φid (𝑥)| + sup |Φid (𝑦) − Φid (𝑦)|
𝑥 ∈ [ −𝑛,𝑛] 𝑥 ∈ [ −𝑛,𝑛] 𝑦 ∈ [− (𝑛+1),𝑛+1] 𝑗=2
𝜀
≤𝐿 = 𝜀.
𝐿
This proves the theorem for 𝑑 = 1 with Φ := Φid 𝐿.

If 𝑑 > 1, then by Proposition 2.3 (ii) the network Φ := (Φid


𝐿 , . . . , Φ 𝐿 ) has the desired properties.
id □

Corollary 3.17 Let 𝑑 ∈ N, 𝐿 ∈ N and 𝜎 ∈ M. Then N𝑑1 (𝜎; 𝐿) is a universal approximator of 𝐶 0 (R𝑑 ) if and only if 𝜎
is not a polynomial.

Proof We only show the implication “⇐”. The other direction is again left as an exercise, see Exercise 3.24.
Assume 𝜎 ∈ M is not a polynomial, let 𝐾 ⊆ R𝑑 be compact, and let 𝑓 ∈ 𝐶 0 (R𝑑 ). Fix 𝜀 ∈ (0, 1). We need to
show that there exists a neural network Φ ∈ N𝑑1 (𝜎; 𝐿) such that sup 𝒙∈𝐾 | 𝑓 (𝒙) − Φ(𝒙)| < 𝜀. The case 𝐿 = 1 holds by
Theorem 3.8, so let 𝐿 > 1.
By Theorem 3.8, there exist Φshallow ∈ N𝑑1 (𝜎; 1) such that
𝜀
sup | 𝑓 (𝒙) − Φshallow (𝒙)| < . (3.1.8)
𝒙∈𝐾 2

Compactness of { 𝑓 (𝒙) | 𝒙 ∈ 𝐾 } implies that we can find 𝑛 > 0 such that

{Φshallow (𝒙) | 𝒙 ∈ 𝐾 } ⊆ [−𝑛, 𝑛]. (3.1.9)

Let Φid ∈ N11 (𝜎; 𝐿 − 1) be an approximation to the identity such that


𝜀
sup |𝑥 − Φid (𝑥)| < , (3.1.10)
𝑥 ∈ [−𝑛,𝑛] 2

which is possibly by Proposition 3.19.


Denote Φ := Φid ◦ Φshallow . According to Proposition 2.3 (iv) holds Φ ∈ N𝑑1 (𝜎; 𝐿) as desired. Moreover (3.1.8),
(3.1.9), (3.1.10) imply

sup | 𝑓 (𝒙) − Φ(𝒙)| = sup | 𝑓 (𝒙) − Φid (Φshallow (𝒙))|


𝒙∈𝐾 𝒙∈𝐾

≤ sup | 𝑓 (𝒙) − Φshallow (𝒙)| + |Φshallow (𝒙) − Φid (Φshallow (𝒙))|
𝒙∈𝐾
𝜀 𝜀
≤ + = 𝜀.
2 2
This concludes the proof. □

3.1.4 Other norms

Additional to the continuous functions, universal approximation theorems can be shown for various other function
classes and topologies, which may also allow for the approximation of functions exhibiting discontinuities or singular-
ities. To give but one example, we next state such a result for Lebesgue spaces on compact sets. The proof is left to the
reader, see Exercise 3.27.

Corollary 3.18 Let 𝑑 ∈ N, 𝐿 ∈ N, 𝑝 ∈ [1, ∞), and let 𝜎 ∈ M not be a polynomial. Then for every 𝜀 > 0, every
compact 𝐾 ⊆ R𝑑 , and every 𝑓 ∈ 𝐿 𝑝 (𝐾) there exists Φ 𝑓 , 𝜀 ∈ N𝑑1 (𝜎; 𝐿) such that

24
[Draft of June 26, 2024. Not for dissemination. Only to be used in Philipp Petersen’s lecture]
∫  1/ 𝑝
𝑝
| 𝑓 (𝒙) − Φ(𝒙)| d𝒙 ≤ 𝜀.
𝐾

3.1.5 Deep neural networks

Theorem 3.8 discusses the universality of shallow neural networks. It is, however, easy to see that under very weak
assumptions on the activation function every shallow neural network can be extended to a deep neural network which
approximates the shallow neural network arbitrarily well. By consequence, also deep neural networks satisfy universal
approximation. The essential building block is the following proposition, which gives an approximation of the identity.

Proposition 3.19 Let 𝑑, 𝐿 ∈ N, let 𝐾 ⊆ R𝑑 compact, and let 𝜎 : R → R be such that there exists an open set on which
𝜎 is differentiable and not constant. Then, for every 𝜀 > 0, there exists a neural network Φ ∈ N𝑑𝑑 (𝜎; 𝐿, 𝑑) such that

|Φ(𝒙) − 𝒙| < 𝜀 for all 𝒙 ∈ 𝐾.

Proof The proof utilizes the same idea as in Lemma 3.13, where we approximate the derivative of the activation
function by a simple neural network.
Let us first assume 𝑑 ∈ N and 𝐿 = 1.
Let 𝑥 ∗ ∈ R be such that 𝜎 is differentiable on a neighborhood of 𝑥 ∗ and 𝜎 ′ (𝑥 ∗ ) = 𝜃 ≠ 0. Moreover, let 𝒙 ∗ =
(𝑥 , . . . , 𝑥 ∗ ) ∈ R𝑑 . Then, for 𝜆 > 0 we define

𝜆 𝒙  𝜆
Φ𝜆 (𝒙) B 𝜎 + 𝒙 ∗ − 𝜎(𝒙 ∗ ),
𝜃 𝜆 𝜃
which is a neural network with one hidden layer and width 𝑑. Then, we have, for all 𝒙 ∈ 𝐾,

𝜎(𝒙/𝜆 + 𝒙 ∗ ) − 𝜎(𝒙 ∗ )
Φ𝜆 (𝒙) − 𝒙 = 𝜆 − 𝒙. (3.1.11)
𝜃
If 𝑥𝑖 = 0 for 𝑖 ∈ {1, . . . , 𝑑}, then (3.1.11) shows that (Φ𝜆 (𝒙) − 𝒙)𝑖 = 0. Otherwise

|𝒙𝑖 | 𝜎(𝒙𝑖 /𝜆 + 𝑥 ∗ ) − 𝜎(𝑥 ∗ )


|(Φ𝜆 (𝒙) − 𝒙)𝑖 | = −𝜃 .
|𝜃| 𝒙𝑖 /𝜆

By the definition of the derivative, we have that |(Φ𝜆 (𝒙) − 𝒙)𝑖 | → 0 for 𝜆 → ∞ uniformly for all 𝒙 ∈ 𝐾 and
𝑖 ∈ {1, . . . , 𝑑}. Therefore, |Φ𝜆 (𝒙) − 𝒙| → 0 for 𝜆 → ∞ uniformly for all 𝒙 ∈ 𝐾.
The extension to 𝐿 > 1 is straight forward and is the content of Exercise ??. □
With Proposition 3.19 it is clear that the universal approximation theorem for shallow neural networks holds for
deep neural networks as well. Indeed, let 𝐿 > 1 and 𝜎 be such that the assumptions of Proposition 3.19 and of Theorem
3.8 are satisfied. Then, for every function 𝑓 ∈ 𝐶 0 (R𝑑 ), every compact 𝐾 ⊆ R𝑑 and every 𝜀 > 0 it holds by Theorem
3.8 that there exists Φ1 ∈ N𝑑1 (𝜎; 1) such that
𝜀
|Φ1 (𝒙) − 𝑓 (𝒙)| ≤ for all 𝒙 ∈ 𝐾.
2
Since 𝑓 is continuous, we have that 𝑓 (𝐾) is compact and hence, there exists 𝐾 ′ ⊆ R compact such that Φ1 (𝐾) ⊆
𝑓 (𝐾) + 𝐵 𝜀/2 (0) ⊆ 𝐾 ′ .
Moreover, by Proposition 3.19 there exists a neural network Φ2 ∈ N11 (𝜎; 𝐿 − 1) such that
𝜀
|Φ2 (𝑥) − 𝑥| ≤ for all 𝑥 ∈ 𝐾 ′ .
2
Then Φ := Φ2 ◦ Φ1 satisfies

25
[Draft of June 26, 2024. Not for dissemination. Only to be used in Philipp Petersen’s lecture]
| 𝑓 (𝒙) − Φ(𝒙)| ≤ | 𝑓 (𝒙) − Φ1 (𝒙)| + |Φ1 (𝒙) − Φ2 ◦ Φ1 (𝒙)| ≤ 𝜀.

By Proposition 2.3 and Exercise 2.5, Φ2 ◦ Φ1 is a neural network with 𝐿 hidden layers in the sense of Definition 2.1.
This shows that 𝑁 𝑑1 (𝜎; 𝐿) is universal for all 𝐿 ∈ N.

3.2 Superexpressive activations and Kolmogorov’s superposition theorem

In the previous section, we saw that a large class of activation functions allow for universal approximation. However,
these results did not provide any insights into the necessary network size for achieving a specific accuracy.
Before exploring this topic further in the following chapters, we next present a remarkable result that shows how
the required network size is significantly influenced by the choice of activation function. The result asserts that, with
the appropriate activation function, every 𝑓 ∈ 𝐶 0 (𝐾) on a compact set 𝐾 ⊆ R𝑑 can be approximated to every desired
accuracy 𝜀 > 0 using a network of size 𝑂 (𝑑 2 ); in particular the network size is independent of 𝜀 > 0, 𝐾, and 𝑓 . We
will first discuss the one-dimensional case.

Proposition 3.20 There exists a continuous activation function 𝜎 : R → R such that for every compact 𝐾 ⊆ R, every
𝜀 > 0 and every 𝑓 ∈ 𝐶 0 (𝐾) there exists Φ(𝑥) = 𝜎(𝑤𝑥 + 𝑏) ∈ N11 (𝜎; 1, 1) such that

sup | 𝑓 (𝑥) − Φ(𝑥)| < 𝜀.


𝑥 ∈𝐾
Í
Proof Denote by P̃𝑛 all polynomials 𝑝(𝑥) = 𝑛𝑗=0 𝑞 𝑗 𝑥 𝑗 with rational coefficients, i.e. such that 𝑞 𝑗 ∈ Q for all
𝑗 = 0, . . . , 𝑛. Then P̃𝑛 can be identified
Ð with the 𝑛-fold cartesian product Q × · · · × Q, and thus P̃𝑛 is a countable set.
Consequently also the set P̃ := 𝑛∈N P̃𝑛 of all polynomials with rational coefficients is countable. Let ( 𝑝 𝑖 )𝑖 ∈Z be an
enumeration of these polynomials, and set
(
𝑝 𝑖 (𝑥 − 2𝑖) if 𝑥 ∈ [2𝑖, 2𝑖 + 1]
𝜎(𝑥) :=
𝑝 𝑖 (1) (2𝑖 + 2 − 𝑥) + 𝑝 𝑖+1 (0) (𝑥 − 2𝑖 − 1) if 𝑥 ∈ (2𝑖 + 1, 2𝑖 + 2).

In words, 𝜎 equals 𝑝 𝑖 on even intervals [2𝑖, 2𝑖 + 1] and is linear on odd intervals [2𝑖 + 1, 2𝑖 + 2], resulting in a continuous
function overall.
Í
We first assume 𝐾 = [0, 1]. By Example 3.5, for every 𝜀 > 0 exists 𝑝(𝑥) = 𝑛𝑗=1 𝑟 𝑗 𝑥 𝑗 such that sup 𝑥 ∈ [0,1] | 𝑝(𝑥) −
Í
𝑓 (𝑥)| < 𝜀/2. Now choose 𝑞 𝑗 ∈ Q so close to 𝑟 𝑗 such that 𝑝(𝑥) ˜ := 𝑛𝑗=1 𝑞 𝑗 𝑥 𝑗 satisfies sup 𝑥 ∈ [0,1] | 𝑝(𝑥)
˜ − 𝑝(𝑥)| < 𝜀/2.
Let 𝑖 ∈ Z such that 𝑝(𝑥)
˜ = 𝑝 𝑖 (𝑥), i.e., 𝑝 𝑖 (𝑥) = 𝜎(2𝑖 + 𝑥) for all 𝑥 ∈ [0, 1]. Then sup 𝑥 ∈ [0,1] | 𝑓 (𝑥) − 𝜎(𝑥 + 2𝑖)| < 𝜀.
For general compact 𝐾 assume that 𝐾 ⊆ [𝑎, 𝑏]. By Tietze’s extension theorem, 𝑓 allows a continuous extension to
[𝑎, 𝑏], so without loss of generality 𝐾 = [𝑎, 𝑏]. By the first case we can find 𝑖 ∈ Z such that with 𝑦 = (𝑥 − 𝑎)/(𝑏 − 𝑎)
(i.e. 𝑦 ∈ [0, 1] if 𝑥 ∈ [𝑎, 𝑏])
𝑥 − 𝑎 
sup 𝑓 (𝑥) − 𝜎 + 2𝑖 = sup | 𝑓 (𝑦 · (𝑏 − 𝑎) + 𝑎) − 𝜎(𝑦 − 2𝑖)| < 𝜀,
𝑥 ∈ [𝑎,𝑏] 𝑏−𝑎 𝑦 ∈ [0,1]

which gives the statement with 𝑤 = 1/(𝑏 − 𝑎) and 𝑏 = −𝑎 · (𝑏 − 𝑎) + 2𝑖. □


To extend this result to arbitrary dimension, we will use Kolmogorov’s superposition theorem. It states that every
continuous function of 𝑑 variables can be expressed as a composition of functions that each depend only on one
variable. We omit the technical proof, which can be found in [105].

Theorem 3.21 (Kolmogorov) For every 𝑑 ∈ N exist 2𝑑 2 + 𝑑 monotonically increasing functions 𝜑𝑖, 𝑗 ∈ 𝐶 0 (R),
𝑖 = 1, . . . , 𝑑, 𝑗 = 1, . . . , 2𝑑 + 1, such that for every 𝑓 ∈ 𝐶 0 ( [0, 1] 𝑑 ) there exist functions 𝑓 𝑗 ∈ 𝐶 0 (R), 𝑗 = 1, . . . , 2𝑑 + 1
satisfying

26
[Draft of June 26, 2024. Not for dissemination. Only to be used in Philipp Petersen’s lecture]
2𝑑+1 𝑑
!
∑︁ ∑︁
𝑓 (𝒙) = 𝑓𝑗 𝜑𝑖, 𝑗 (𝑥𝑖 ) for all 𝒙 ∈ [0, 1] 𝑑 .
𝑗=1 𝑖=1

Corollary 3.22 Let 𝑑 ∈ N. With the activation function 𝜎 : R → R from Proposition 3.20, for every compact 𝐾 ⊆ R𝑑 ,
every 𝜀 > 0 and every 𝑓 ∈ 𝐶 0 (𝐾) there exists Φ ∈ N𝑑1 (𝜎; 2, 2𝑑 2 + 𝑑) (i.e. width(Φ) = 2𝑑 2 + 𝑑 and depth(Φ) = 2)
such that

sup | 𝑓 (𝒙) − Φ(𝒙)| < 𝜀.


𝒙∈𝐾

Proof Without loss of generality we can assume 𝐾 = [0, 1] 𝑑 : the extension to the general case then follows by Tietze’s
extension theorem and a scaling argument as in the proof of Proposition 3.20.
Let 𝑓 𝑗 , 𝜑𝑖, 𝑗 , 𝑖 = 1, . . . , 𝑑, 𝑗 = 1, . . . , 2𝑑 + 1 be as in Theorem 3.21. Fix 𝜀 > 0. Let 𝑎 > 0 be so large that

sup sup |𝜑𝑖, 𝑗 (𝑥)| ≤ 𝑎.


𝑖, 𝑗 𝑥 ∈ [0,1]

Since each 𝑓 𝑗 is uniformly continuous on the compact set [−𝑑𝑎, 𝑑𝑎], we can find 𝛿 > 0 such that
𝜀
sup sup | 𝑓 𝑗 (𝑦) − 𝑓 𝑗 ( 𝑦˜ )| < . (3.2.1)
𝑗 | 𝑦− 𝑦˜ |< 𝛿 2(2𝑑 + 1)
| 𝑦 |, | 𝑦˜ | ≤𝑑𝑎

By Proposition 3.20 there exist 𝑤𝑖, 𝑗 , 𝑏 𝑖, 𝑗 ∈ R such that

𝛿
sup sup |𝜑𝑖, 𝑗 (𝑥) − 𝜎(𝑤𝑖, 𝑗 𝑥 + 𝑏 𝑖, 𝑗 ) | < (3.2.2)
𝑖, 𝑗 𝑥 ∈ [0,1] | {z } 𝑑
=: 𝜑˜ 𝑖, 𝑗 ( 𝑥 )

and 𝑤 𝑗 , 𝑏 𝑗 ∈ R such that


𝜀
sup sup | 𝑓 𝑗 (𝑦) − 𝜎(𝑤 𝑗 𝑦 + 𝑏 𝑗 ) | < . (3.2.3)
𝑗 | 𝑦 | ≤𝑎+ 𝛿 | {z } 2(2𝑑 + 1)
=: 𝑓˜𝑗 ( 𝑥 )

Then for all 𝒙 ∈ [0, 1] 𝑑 by (3.2.2)


𝑑 𝑑
∑︁ ∑︁ 𝛿
𝜑𝑖, 𝑗 (𝑥𝑖 ) − 𝜑˜ 𝑖, 𝑗 (𝑥𝑖 ) < 𝑑 = 𝛿.
𝑖=1 𝑖=1
𝑑

Thus with
𝑑
∑︁ 𝑑
∑︁
𝑦 𝑗 := 𝜑𝑖, 𝑗 (𝑥𝑖 ), 𝑦˜ 𝑗 := 𝜑˜ 𝑖, 𝑗 (𝑥𝑖 )
𝑗=1 𝑗=1

it holds |𝑦 𝑗 − 𝑦˜ 𝑗 | < 𝛿. Using (3.2.1) and (3.2.3) we conclude

27
[Draft of June 26, 2024. Not for dissemination. Only to be used in Philipp Petersen’s lecture]
2𝑑+1 𝑑
! ! 2𝑑+1
∑︁ ∑︁ ∑︁
𝑓 (𝒙) − 𝜎 𝑤𝑗 · 𝜎(𝑤𝑖, 𝑗 𝑥𝑖 + 𝑏 𝑖, 𝑗 ) + 𝑏 𝑗 = ( 𝑓 𝑗 (𝑦 𝑗 ) − 𝑓˜𝑗 ( 𝑦˜ 𝑗 ))
𝑗=1 𝑖=1 𝑗=1
2𝑑+1
∑︁
| 𝑓 𝑗 (𝑦 𝑗 ) − 𝑓 𝑗 ( 𝑦˜ 𝑗 )| + | 𝑓 𝑗 ( 𝑦˜ 𝑗 ) − 𝑓˜𝑗 ( 𝑦˜ 𝑗 )|


𝑗=1
2𝑑+1  
∑︁ 𝜀 𝜀
≤ + ≤ 𝜀.
𝑗=1
2(2𝑑 + 1) 2(2𝑑 + 1)

This concludes the proof. □


Kolmogorov’s superposition theorem is intriguing as it shows that approximating 𝑑-dimensional functions can be
reduced to the (generally much simpler) one-dimensional case through compositions. Neural networks, by nature, are
well suited to approximate functions with compositional structures. However, the functions 𝑓 𝑗 in Theorem 3.21, even
though only one-dimensional, could become very complex and challenging to approximate themselves if 𝑑 is large.
Similarly, the “magic” activation function in Proposition 3.20 encodes the information of all rational polynomials
on the unit interval, which is why a network of size 𝑂 (1) suffices to approximate every function to arbitrary accuracy.
Naturally, no practical algorithm can efficiently identify appropriate network weights and biases for this architecture.
As such, the results presented in Section 3.2 should be taken with a pinch of salt as their practical relevance is highly
limited. Nevertheless, they highlight that while universal approximation is a fundamental and important property of
neural networks, it leaves many aspects unexplored. To gain further insight into practically relevant architectures, in
the following chapters, we investigate networks with activation functions such as the ReLU.

Bibliography and further reading

The foundation of universal approximation theorems goes back to the late 1980s with seminal works by Cybenko [42],
Hornik et al. [85, 84], Funahashi [56] and Carroll and Dickinson [30]. These results were subsequently extended to a
wider range activation functions and architectures. The present analysis in Section 3.1 closely follows the arguments
in [116], where it was essentially shown that universal approximation can be achieved if the activation function is not
polynomial.
Kolmogorov’s superposition theorem stated in Theorem 3.21 was originally proven in 1957 [105]. For a more recent
and constructive proof see for instance [26]. Kolmogorov’s theorem and its obvious connections to neural networks
have inspired both theoretical works, e.g. [108, 182], and more recently also the design of neural network architectures
[120]. The idea for the “magic” activation function in Section 3.2 comes from [].

28
[Draft of June 26, 2024. Not for dissemination. Only to be used in Philipp Petersen’s lecture]
Exercises

Exercise 3.23 Write down a generator of a (minimal) topology on 𝐶 0 (R𝑑 ) such that 𝑓𝑛 → 𝑓 ∈ 𝐶 0 (R𝑑 ) if and only if
cc
𝑓𝑛 −→ 𝑓 , and show this equivalence. This topology is referred to as the topology of compact convergence.

Exercise 3.24 Show the implication “⇒” of Theorem 3.8 and Corollary 3.17.

Exercise 3.25 Prove Lemma 3.12. Hint: Consider 𝜎(𝑛𝑥) for large 𝑛 ∈ N.

Exercise 3.26 Let 𝑘 ∈ N, 𝜎 ∈ M and assume that 𝜎 ∞


∫ ∗ 𝜑 ∈ P 𝑘 for all 𝜑 ∈ 𝐶𝑐 (R). Show that 𝜎 ∈ P 𝑘 .

Hint: Consider 𝜓 ∈ 𝐶𝑐 (R) such that 𝜓 ≥ 0 and R 𝜓(𝑥) d𝑥 = 1 and set 𝜓 𝜀 (𝑥) := 𝜓(𝑥/𝜀)/𝜀. Use that away from the
discontinuities of 𝜎 it holds 𝜓 𝜀 ∗ 𝜎(𝑥) → 𝜎(𝑥) as 𝜀 → 0. Conclude that 𝜎 is piecewise in P 𝑘 , and finally show that
𝜎 ∈ 𝐶 𝑘 (R).

Exercise 3.27 Prove Corollary 3.18 to Theorem 3.8.

29
[Draft of June 26, 2024. Not for dissemination. Only to be used in Philipp Petersen’s lecture]
[Draft of June 26, 2024. Not for dissemination. Only to be used in Philipp Petersen’s lecture]
Chapter 4
Splines

In the universal approximation theorem (Theorem 3.8), we have seen that sufficiently large neural networks can
approximate every continuous function on a compact domain to an arbitrary accuracy. However, given a specific
function that we may want to approximate up to a certain error, we cannot conclude from this theorem how to choose
the architecture of our network.
What would be immensely more helpful would be a way to quantify how many parameters a neural network needs
to be able to achieve a specific error. Mathematically, we desire a statement of the following kind:
Let 𝑓 : R𝑑 → R. There exists a function ℎ : (0, 1) → R such that for every 𝜀 ∈ (0, 1) there exists a neural network
𝑓 𝑓 𝑓
Φ 𝜀 such that ∥ 𝑓 − Φ 𝜀 ∥ 𝐿 ∞ ( [0,1] 𝑑 ) ≤ 𝜀 and size(Φ 𝜀 ) ≤ ℎ(𝜀). Naturally, in the statement above, the function ℎ will
depend on 𝑓 or some properties of 𝑓 .
Statements providing a trade-off between certain properties of a function 𝑓 , an approximation accuracy, and the
number of parameters that an approximation with respect to a certain approximation scheme needs to have are
established in the field of approximation theory.
One of the most classical approximation scenarios is that we are attempting to approximate functions of a specified
smoothness. For example, for 𝑘, 𝑑 ∈ N, we could specify that 𝑓 is a 𝑑-dimensional function, has 𝑘 continuous
derivatives, and ∥ 𝑓 ∥ 𝐶 𝑘 ( [0,1] 𝑑 ) ≤ 1. How many parameters are now necessary to approximate 𝑓 up to a uniform error of
𝜀 > 0? We will see in the following section that if we are approximating 𝑓 by superpositions of simple basis functions,
so-called splines, then, up to a multiplicative constant, 𝜀 −𝑑/𝑘 basis functions suffice to achieve an error of 𝜀.
How do deep neural networks compare to the approximation by superpositions of splines? Interestingly, we will
observe in Section 4.2 that we can transfer the approximation performance of splines to deep neural networks. In other
words, from an approximation theoretical point of view, whatever is possible with superpositions of splines is possible
with deep neural networks!

4.1 B-splines and smooth functions

We introduce a simple type of spline and its approximation properties below.

Definition 4.1 For 𝑘 ∈ N, the univariate cardinal B-spline on [0, 𝑘] of order 𝑘 ∈ N is given by

𝑘  
1 ∑︁
ℓ 𝑘
N𝑘 (𝑥) B (−1) 𝜎ReLU (𝑥 − ℓ) 𝑘−1 , for 𝑥 ∈ R, (4.1.1)
(𝑘 − 1)! ℓ=0 ℓ

where we adopt the convention that 00 = 0 and 𝜎ReLU is the previously encountered ReLU function.

31
[Draft of June 26, 2024. Not for dissemination. Only to be used in Philipp Petersen’s lecture]
If we shift and dilate the cardinal B-spline, then we produce a system of univariate splines. If we furthermore take
tensor products of these functions, then we construct a set of high-dimensional functions. These are the multivariate
B-splines.
Definition 4.2 For 𝑡 ∈ R and 𝑘, ℓ ∈ N, we define Nℓ,𝑡 ,𝑘 B N𝑘 (2ℓ (· − 𝑡)). Additionally, we define for 𝑑 ∈ N, ℓ ∈ N,
𝒕 ∈ R𝑑 the multivariate B-splines Nℓ,𝒕,𝑘
𝑑
by

𝑑
Ö
𝑑
Nℓ,𝒕,𝑘 (𝒙) B Nℓ,𝑡𝑖 ,𝑘 (𝑥𝑖 ), for 𝒙 = (𝑥1 , . . . 𝑥 𝑑 ) ∈ R𝑑 .
𝑖=1

For a given 𝑘 ∈ N we can define the dictionary of B-splines of order 𝑘 as


n o
B 𝑘 B Nℓ,𝒕,𝑘
𝑑
: ℓ ∈ N, 𝒕 ∈ R𝑑 .

Having introduced the system B 𝑘 , we would like to understand how well we can represent each smooth function by
superpositions of elements of B 𝑘 . One result, that fits our setup perfectly is the following. 1
Theorem 4.3 ([147, Theorem 7]) Let 𝑑, 𝑘 ∈ N, 𝑝 ∈ (0, ∞], 0 < 𝑠 ≤ 𝑘. Then there exist 𝐶, 𝐶 ′ > 0 such that, for every
𝑓 ∈ 𝐶 𝑠 ( [0, 1] 𝑑 ), we have that, for every 𝛿 > 0 and every 𝑁 ∈ N, there exists 𝑐 𝑖 ∈ R with |𝑐 𝑖 | ≤ 𝐶 ∥ 𝑓 ∥ ∞ and 𝐵𝑖 ∈ B 𝑘
for 𝑖 = 1, . . . , 𝑁 such that
𝑁
∑︁ 𝛿−𝑠
𝑓− 𝑐 𝑖 𝐵𝑖 ≤ 𝐶 ′ 𝑁 𝑑 ∥ 𝑓 ∥ 𝐶 𝑠 [0,1] 𝑑 .
𝑖=1 𝐿 𝑝 [0,1] 𝑑

Remark 4.4 There are a couple of critical concepts in Theorem 4.3 that will reappear throughout this course. The
number of parameters that the approximation uses is 𝑁 and is linked to the approximation accuracy, which is 𝑁 ( 𝛿−𝑠)/𝑑 .
This implies that for a given accuracy 𝜀 > 0 we need of the order of 𝜀 −𝑑/( 𝛿−𝑠) parameters. This number of parameters
grows very quickly if the dimension 𝑑 of the function increases. This influence of 𝑑 is referred to as the curse of
dimension and will be discussed extensively in the later chapters of this course. On the other hand, the smoothness of 𝑓
has the opposite effect. For the same accuracy, smoother functions can be approximated with much fewer B-splines than
rough functions. However, since 𝑘 ≥ 𝑠 is required, this more efficient approximation is only possible if the underlying
system is comprised of more complex B-splines. We will later see, that the degree of the B-splines that unlocks a higher
approximation rate correspond to the depth when we approximate with neural networks.

4.2 Reapproximation of B-splines with sigmoidal activations

Now, we will demonstrate that the approximation rates of B-splines can be transfered to certain deep neural networks.
The idea that we follow below is due to [127]. First we need an assumption on the activation functions. In the following,
we use higher-order sigmoidal activation functions.
Definition 4.5 A function 𝜎 : R → R is called sigmoidal of order 𝑞 ∈ N, if 𝜎 ∈ 𝐶 𝑞−1 (R) and there exists 𝐶 > 0 such
that
𝜎(𝑥) 𝜎(𝑥)
→ 0, for 𝑥 → −∞, → 1, for 𝑥 → ∞, and
𝑥𝑞 𝑥𝑞
|𝜎(𝑥)| ≤ 𝐶 · (1 + |𝑥|) 𝑞 , for all 𝑥 ∈ R.

Remark 4.6 Examples of activation functions that are sigmoidal of order 𝑞 are, for example, powers of the ReLU, i.e.,
𝑥 ↦→ 𝜎ReLU (𝑥) 𝑞 .
1 In [147, Theorem 7] this statement is formulated for more general spaces. We formulate it in a simplified setting to not introduce overly
complex smoothness spaces.

32
[Draft of June 26, 2024. Not for dissemination. Only to be used in Philipp Petersen’s lecture]
To show that neural networks with higher-order sigmoidal activation functions can approximate sums of 𝑁 B-
splines with a number of parameters that is proportional to 𝑁, we first demonstrate that the cardinal B-spline can be
approximated to arbitrary accuracy with a neural network of fixed architecture.

Proposition 4.7 Let 𝑘, 𝑑 ∈ N, 𝑘 ≥ 2, 𝐾 > 0, and 𝜎 : R → R be sigmoidal of order 𝑞 ≥ 2. There exists a constant
𝐶 > 0 such that for every 𝜀 > 0 there is a neural network Φ N𝑘 with activation function 𝜎, ⌈log𝑞 (𝑘 − 1)⌉ layers and
size 𝐶, such that
N𝑘 − Φ N𝑘 𝐿 ∞ ( [−𝐾 ,𝐾 ] 𝑑 ) ≤ 𝜀.

Proof To build a neural network approximating N𝑘 , we consider (4.1.1). Note that, (4.1.1) is a sum of 𝑘 shifts of
𝑘−1 . Therefore, we first want to approximate the term 𝜎 𝑘−1 arbitrarily well by a neural network. It is not hard to see
𝜎ReLU ReLU
(Exercise 4.11) that, for every 𝐾 ′ > 0, and for 𝑡 ∈ N,

𝑎 −𝑞 𝜎 ◦ 𝜎 ◦ · · · ◦ 𝜎(𝑎𝑥) −𝜎ReLU (𝑥) 𝑞 → 0 for 𝑎 → ∞


𝑡 𝑡
(4.2.1)
| {z }
𝑡 − times

uniformly for all 𝑥 ∈ [−𝐾 ′ , 𝐾 ′ ].


We choose 𝑡 B ⌈log𝑞 (𝑘 − 1)⌉. Note that, 𝑡 ≥ 1 by the assumption that 𝑘 ≥ 2. We have that, 𝑞 𝑡 ≥ 𝑘 − 1. We conclude
𝑡
that, for every 𝐾 ′ > 0 and 𝜀 > 0 there exists a neural network Φ𝑞𝜀 having ⌈log𝑞 (𝑘 − 1)⌉ layers and satisfying
𝑡 𝑡
Φ𝑞𝜀 (𝑥) − 𝜎ReLU (𝑥) 𝑞 ≤ 𝜀, (4.2.2)

for all 𝑥 ∈ [−𝐾 ′ , 𝐾 ′ ].


With (4.2.2), we have successfully reproduced a power of the ReLU activation function. However, the exponent 𝑞 𝑡
𝑡
could be larger than 𝑘 − 1. To reduce the order down to 𝑘 − 1, we implement an approximate derivative of Φ𝑞𝜀 .
We will next prove the following claim: For all 1 ≤ 𝑝 ≤ 𝑞 𝑡 for every 𝐾 ′ > 0 and 𝜀 > 0 there exists a neural network
𝑝
Φ 𝜀 having ⌈log𝑞 (𝑘 − 1)⌉ layers and satisfying

Φ 𝜀𝑝 (𝑥) − 𝜎ReLU (𝑥) 𝑝 ≤ 𝜀, (4.2.3)

for all 𝑥 ∈ [−𝐾 ′ , 𝐾 ′ ].


We have shown the claim for 𝑝 = 𝑞 𝑡 already, so (4.2.3) will follow by induction by the following argument. Let
𝛿 ≥ 0, then we observe that

Φ 𝑝𝛿2 (𝑥 + 𝛿) − Φ 𝑝𝛿2 (𝑥)


− 𝜎ReLU (𝑥) 𝑝−1 (4.2.4)
𝑝𝛿
𝛿 𝜎ReLU (𝑥 + 𝛿) 𝑝 − 𝜎ReLU (𝑥) 𝑝
≤2 + − 𝜎ReLU (𝑥) 𝑝−1 . (4.2.5)
𝑝 𝑝𝛿

Hence, by the binomial theorem it follows that there exists 𝛿∗ > 0 such that

Φ 𝑝𝛿2 (𝑥 + 𝛿∗ ) − Φ 𝑝𝛿2 (𝑥)


∗ ∗
− 𝜎ReLU (𝑥) 𝑝−1 ≤ 𝜀,
𝑝𝛿∗

for all 𝑥 ∈ [−𝐾 ′ , 𝐾 ′ ]. By Proposition 2.3, (Φ 𝑝𝛿2 (𝑥 + 𝛿∗ ) − Φ 𝑝𝛿2 )/( 𝑝𝛿∗ ) is a neural network with ⌈log𝑞 (𝑘 − 1)⌉ layers
∗ ∗
and size independent from 𝜀. We can call this neural network Φ 𝜀𝑝−1 and conclude that (4.2.3) holds for all 1 ≤ 𝑝 ≤ 𝑞 𝑡 .
By the definition of neural networks, it is clear that for every neural network also every spatial translation of it is
a neural network with the same architecture. Hence, considering (4.1.1), we observe that N𝑘 can be approximated to

33
[Draft of June 26, 2024. Not for dissemination. Only to be used in Philipp Petersen’s lecture]
arbitrary accuracy by a sum of neural networks of a fixed size. Since by Proposition 2.3, sums of neural networks with
the same depth are again neural networks with the same depth, the result follows. □
With Proposition 4.7, we can approximate the cardinal B-spline to arbitrary accuracy by a fixed size neural network.
Next, we extend this result to the multivariate splines Nℓ,𝒕,𝑘
𝑑
for arbitrary ℓ, 𝑑 ∈ N, 𝒕 ∈ R𝑑 .

Proposition 4.8 Let 𝑘, 𝑑 ∈ N, 𝑘 ≥ 2, 𝐾 > 0, and 𝜎 : R → R be sigmoidal of order 𝑞 ≥ 2. Further let ℓ ∈ N and
𝒕 ∈ R𝑑 .
Then, there exists a constant 𝐶 > 0 such that for every 𝜀 > 0 there is a neural network Φ Nℓ,𝒕 ,𝑘 with activation
𝑑

function 𝜎, ⌈log2 (𝑑)⌉ + ⌈log𝑞 (𝑘 − 1)⌉ layers and size 𝐶, such that

− Φ Nℓ,𝒕 ,𝑘
𝑑
𝑑
Nℓ,𝒕,𝑘 ≤ 𝜀.
𝐿 ∞ ( [−𝐾 ,𝐾 ] 𝑑 )

Proof The function Nℓ,𝒕,𝑘 𝑑


is a product of the univariate functions Nℓ,𝑡𝑖 ,𝑘 = N𝑘 (2ℓ (· − 𝑡)). We observe that, Nℓ,𝑡𝑖 ,𝑘 is
just N𝑘 composed with an affine transformation and hence, by Proposition 4.7, there exist a constant 𝐶 ′ > 0 such that
for each 𝑖 = 1, . . . , 𝑑 and all 𝜀 > 0, there is a neural network Φ Nℓ,𝑡𝑖 ,𝑘 with size 𝐶 ′ and ⌈log𝑞 (𝑘 − 1)⌉ layers such that

Nℓ,𝑡𝑖 ,𝑘 − Φ Nℓ,𝑡𝑖 ,𝑘 𝐿 ∞ ( [−𝐾 ,𝐾 ] 𝑑 )


≤ 𝜀.

This completes the proof for 𝑑 = 1. For general 𝑑, we need to show that the product of the Φ Nℓ,𝑡𝑖 ,𝑘 for 𝑖 = 1, . . . , 𝑑 can
be approximated well.
As an intermediate step, we prove the following claim by induction: For 𝑑 ∈ N, 𝑑 ≥ 2 there exists a constant 𝐶 ′′ > 0,
such that for all 𝐵 ≥ 1 and 𝜀 > 0 there exists a neural network Φmult with size 𝐶 ′′ , ⌈log2 (𝑑)⌉ layers, and activation
function 𝜎 such that for all 𝑥1 , . . . , 𝑥 𝑑 with |𝑥𝑖 | ≤ 𝐵 for all 𝑖 = 1, . . . , 𝑑,
𝑑
Ö
Φmult, 𝜀,𝑑 (𝑥1 , . . . , 𝑥 𝑑 ) − 𝑥𝑖 < 𝜀. (4.2.6)
𝑖=1

We first show the claim for 𝑑 = 2. We use a similar argument as in the proof of Proposition 4.7. Note that, we
have that there exists a neural network of only one hidden layer which arbitrarily well approximates the function
𝑥 ↦→ 𝜎ReLU (𝑥) 𝑞 . Moreover, since 𝑞 ≥ 2 the argument implying (4.2.3) yields that there exists 𝐶 ′′′ > 0 such that for
every 𝜀 and 𝐵 ≥ 1 there exists a neural network Φsquare, 𝜀 with only one hidden layer and size 𝐶 ′′′ such that

|Φsquare, 𝜀 − 𝜎ReLU (𝑥) 2 | ≤ 𝜀 for all |𝑥| ≤ 𝐵.

Note that, for every 𝑥 = (𝑥1 , 𝑥2 ) ∈ R2


1  
𝑥1 𝑥2 = (𝑥1 + 𝑥 2 ) 2 − 𝑥 12 − 𝑥 22
2
1 
= 𝜎ReLU (𝑥 1 + 𝑥2 ) 2 + 𝜎ReLU (−𝑥 1 − 𝑥2 ) 2 − 𝜎ReLU (𝑥 1 ) 2
2 
− 𝜎ReLU (−𝑥1 ) 2 − 𝜎ReLU (𝑥2 ) 2 − 𝜎ReLU (−𝑥 2 ) 2 . (4.2.7)

As we have seen before, we can approximate for a given 𝜀 > 0 each of the terms on the right-hand side of (4.2.7) up to
an accuracy 𝜀/6 by a neural network with one hidden layer and size independent of 𝜀. By Proposition 2.3, we conclude
that there exists a neural network Φmult, 𝜀,2 satisfying the claim of the induction (4.2.6).
For arbitrary 𝑑 > 2, and arbitrary 𝜀 > 0 and 𝐵 ≥ 1, we observe that
𝑑
Ö ⌊𝑑/2⌋
Ö 𝑑
Ö
𝑥𝑖 = 𝑥𝑖 · 𝑥𝑖 . (4.2.8)
𝑖=1 𝑖=1 𝑖=⌊𝑑/2⌋+1

34
[Draft of June 26, 2024. Not for dissemination. Only to be used in Philipp Petersen’s lecture]
We will now approximate each of the terms in the product on the right-hand side of (4.2.8) by a neural network using the
induction assumption. To keep the technicalities under control, we assume that ⌈log2 ( ⌊𝑑/2⌋)⌉ = ⌈log2 (𝑑 − ⌊𝑑/2⌋)⌉ =
⌈log2 ( ⌈𝑑/2⌉)⌉ so that both networks have the same depth. The general case, can be addressed via Proposition 3.19.
We have by the induction assumption, that there exist neural networks Φmult,1 and Φmult,2 both with ⌈log2 ( ⌊𝑑/2⌋)⌉
layers, such that for all 𝑥𝑖 with |𝑥 𝑖 | ≤ 𝐵, for 𝑖 = 1, . . . , 𝑑, it holds that
⌊𝑑/2⌋
Ö
Φmult,1 (𝑥1 , . . . , 𝑥 ⌊𝑑/2⌋ ) − 𝑥𝑖 < 𝜀/(4(𝐵 ⌊𝑑/2⌋ + 𝜀)),
𝑖=1
𝑑
Ö
Φmult,2 (𝑥 ⌊𝑑/2⌋+1 , . . . , 𝑥 𝑑 ) − 𝑥𝑖 < 𝜀/(4(𝐵 ⌊𝑑/2⌋ + 𝜀)).
𝑖=⌊𝑑/2⌋+1

By Proposition 2.3, we now have that Φmult, 𝜀,𝑑 B Φmult, 𝜀/2,2 ◦ (Φmult,1 , Φmult,2 ) is a neural network with 1 +
⌈log2 ( ⌊𝑑/2⌋)⌉ = ⌈log2 (𝑑)⌉ layers. Here we chose the Φmult, 𝜀/2,2 with 𝐵 = 𝐵 ⌈𝑑/2⌉ + 𝜀. It is evident from the construction
that the size of Φmult, 𝜀,𝑑 does not depend on 𝐵 or 𝜀. Thus, to complete the induction, we only need to show (4.2.6).
Note that by the triangle inequality, for all 𝑎, 𝑏, 𝑐, 𝑑 ∈ R, it holds that

|𝑎𝑏 − 𝑐𝑑| ≤ |𝑎||𝑏 − 𝑑| + |𝑑||𝑎 − 𝑐|.

Hence, for 𝑥1 , . . . , 𝑥 𝑑 with |𝑥𝑖 | ≤ 𝐵 for all 𝑖 = 1, . . . , 𝑑, we have that


𝑑
Ö
𝑥 𝑖 − Φmult, 𝜀,𝑑 (𝑥1 , . . . , 𝑥 𝑑 )
𝑖=1
⌊𝑑/2⌋ 𝑑
𝜀 Ö Ö
≤ + 𝑥𝑖 · 𝑥𝑖 − Φmult,1 (𝑥1 , . . . , 𝑥 ⌊𝑑/2⌋ )Φmult,2 (𝑥 ⌊𝑑/2⌋+1 , . . . , 𝑥 𝑑 )
2 𝑖=1 𝑖=⌊𝑑/2⌋+1
𝜀 𝜀 𝜀
≤ + |𝐵| ⌊𝑑/2⌋ ⌊𝑑/2⌋
+ (|𝐵| ⌈𝑑/2⌉ + 𝜀) ⌊𝑑/2⌋
< 𝜀.
2 4(𝐵 + 𝜀) 4(𝐵 + 𝜀)
This completes the proof of (4.2.6).
The overall result follows by using Proposition 2.3 to show that the multiplication network can be composed with
a neural network comprised of the Φ Nℓ,𝑡𝑖 ,𝑘 for 𝑖 = 1, . . . , 𝑑. Since in no step above the size of the individual networks
was dependent on the approximation accuracy, this is also true for the final network. □
Proposition 4.8 shows that we can approximate a single multivariate B-spline with a neural network with a size
that is independent from the accuracy. Now, we need to combine this observation with Theorem 4.3 to show how well
neural networks can approximate smooth functions.
Theorem 4.9 Let 𝑑, 𝑘 ∈ N, 𝑝 ∈ (0, ∞], 0 < 𝑠 ≤ 𝑘 and 𝑘 ≥ 2. Further let 𝛿 > 0. For 𝑞 ≥ 2, let 𝜎 be sigmoidal of order
𝑞.
Then there exists 𝐶 > 0 such that, for every 𝑓 ∈ 𝐶 𝑠 ( [0, 1] 𝑑 ) and every 𝜀 > 0 there exists a neural network network
Φ with activation function 𝜎, ⌈log2 (𝑑)⌉ + ⌈log𝑞 (𝑘 − 1)⌉ layers and at most size 𝐶𝜀 −𝑑/( 𝛿−𝑠) , such that
𝑓

𝑓 − Φ𝑓 𝐿 𝑝 ( [0,1] 𝑑 )
≤ 𝜀 · (1 + ∥ 𝑓 ∥ 𝐶 𝑠 ( [0,1] 𝑑 ) ).

Proof We first observe that, due to the compact domain [0, 1] 𝑑 , it suffices to show the estimate for 𝑝 = ∞.
Choose 𝜀 > 0, then invoking Theorem 4.3, we have that for 𝑁 ≥ (𝜀/𝐶 ′ ) 𝑑/( 𝛿−𝑠) there exist for 𝑖 = 1, . . . , 𝑁,
coefficients |𝑐 𝑖 | ≤ 𝐶 ∥ 𝑓 ∥ ∞ , and 𝐵𝑖 ∈ B 𝑘 such that
𝑁
∑︁
𝑓− 𝑐 𝑖 𝐵𝑖 ≤ 𝜀∥ 𝑓 ∥ 𝐶 𝑠 .
𝑖=1 𝐿 ∞ ( [0,1] 𝑑 )

35
[Draft of June 26, 2024. Not for dissemination. Only to be used in Philipp Petersen’s lecture]
Moreover, by Proposition 4.8, there exists for each 𝑖 = 1, . . . , 𝑁 a neural network Φ 𝐵𝑖 with ⌈log2 (𝑑)⌉ + ⌈log𝑞 (𝑘 − 1)⌉
layers, a fixed size, approximating 𝐵𝑖 on [−1, 1] 𝑑 ⊃ [0, 1] 𝑑 up to an error of 𝜀/(𝐶𝑁). The size of Φ 𝐵𝑖 is independent
of 𝑖, 𝜀, and 𝐶.
Í𝑁
By Proposition 2.3, it is clear that there exists a neural network, which we call Φ 𝑓 , that approximates 𝑖=1 𝑐 𝑖 𝐵𝑖 to
an error of 𝜀, and has ⌈log2 (𝑑)⌉ + ⌈log𝑞 (𝑘 − 1)⌉ layers. The size of this network is linear in 𝑁 (see Exercise 4.12).
The final result now follows directly with the triangle inequality. □

Remark 4.10 Theorem 4.9 shows that neural networks with higher-order sigmoidal functions can approximate smooth
functions with the same accuracy as spline approximations while having a comparable number of parameters. We see
that, since 𝑠 ≤ 𝑘, this result requires deeper neural networks if the functions to be approximated are smoother. On the
other hand, smoother functions can be approximated more efficiently in terms of the size of the neural network.

Bibliography and further reading

The idea of first approximating basis functions by neural networks and then lifting approximation results that exist for
those bases to neural networks has been employed extensively. This concept will also appear repeatedly in the rest of this
book. Since the rest of this manuscript focuses mostly on the ReLU activation function, we use this section to highlight
a couple of approaches with non-ReLU activation functions next: To approximate analytic functions, [128] emulates
a monomial basis. To approximate periodic functions, a basis of trigonometric polynomials is recreated in [129].
Wavelets bases have been emulated in [150]. Moreover, neural networks are studied through the representation system
of ridgelets in [28] or generally ridge functions [92]. A general framework describing emulation of representation
systems to carry over approximation results was described in [21].

36
[Draft of June 26, 2024. Not for dissemination. Only to be used in Philipp Petersen’s lecture]
Exercises

Exercise 4.11 Show that (4.2.1) holds.

Exercise 4.12 Let 𝐿 ∈ N, and Φ1 , Φ2 be two neural networks with architecture (𝑑0 , 𝑑1(1) , . . . , 𝑑 𝐿(1) , 𝑑 𝐿+1 ) and
(𝑑0 , 𝑑1(2) , . . . , 𝑑 𝐿(2) , 𝑑 𝐿+1 ). Show that Φ1 + Φ2 is a neural network with size(Φ1 + Φ2 ) ≤ size(Φ1 ) + size(Φ2 ).
2
Exercise 4.13 Show that, for 𝜎 = 𝜎ReLU and 𝑠 ≤ 2, for all 𝑓 ∈ 𝐶 𝑠 ( [0, 1] 𝑑 ) all weights of the approximating neural
network of Theorem 4.9 can be bounded in absolute value by 𝑂 (max{2, ∥ 𝑓 ∥ 𝐶 𝑠 ( [0,1] 𝑑 ) }).

37
[Draft of June 26, 2024. Not for dissemination. Only to be used in Philipp Petersen’s lecture]
[Draft of June 26, 2024. Not for dissemination. Only to be used in Philipp Petersen’s lecture]
Chapter 5
ReLU neural networks

In this section, we discuss feedforward neural networks using the ReLU activation function 𝜎ReLU introduced in Section
2.3. We refer to these functions as ReLU neural networks. Due to its simplicity and the fact that it reduces the vanishing
and exploding gradients phenomena, it is one of the most widely used in practice.
A key component of the proofs in the previous chapters was the approximation of derivatives of the activation
function to emulate polynomials. Since the ReLU is piecewise linear, this trick is not applicable. This makes the
analysis fundamentally different from the case of smoother activation functions. Nonetheless, we will see that even
this extremely simple activation function yields a very rich class of functions possessing remarkable approximation
capabilities.
To formalize these results, we begin this chapter by adopting a framework from [153]. This framework enables the
tracking of the number of network parameters for basic manipulations such as adding up or composing two neural
networks. This will allow to bound the network complexity, when constructing more elaborate networks from simpler
ones. With these preliminaries at hand, the rest of the chapter is dedicated to the exploration of links between ReLU
neural networks and the class of “continuous piecewise linear functions.” In Section 5.2, we will see that every such
function can be exactly represented by a ReLU neural network. Afterwards, in Section 5.3 we will give a more detailed
analysis of the required network complexity. Finally, we will use these results to prove a first approximation theorem
for ReLU neural networks in Section 5.4. The argument is similar in spirit to Chapter 4, in that we transfer established
approximation theory for piecewise linear functions to the class of ReLU neural networks of a certain architecture.

5.1 Basic ReLU calculus

The goal of this section is to formalize how to combine and manipulate ReLU neural networks. We have seen an
instance of such a result already in Proposition 2.3. Now we want to make this result more precise under the assumption
that the activation function is the ReLU. We sharpen Proposition 2.3 by adding bounds on the number of weights that
the resulting neural networks have. The following four operations form the basis of all constructions in the sequel.
• Reproducing an identity: We have seen in Proposition 3.19 that for most activation functions, an approximation to
the identity can be built by neural networks. For ReLUs, we can have an even stronger result and reproduce the
identity exactly. This identity will play a crucial role in order to extend certain neural networks to deeper neural
networks, and to facilitate an efficient composition operation.
• Composition: We saw in Proposition 2.3 that we can produce a composition of two neural networks and the
resulting function is a neural network as well. There we did not study the size of the resulting neural networks. For
ReLU activation functions, this composition can be done in a very efficient way leading to a neural network that
has up to a constant not more than the number of weights of the two initial neural networks.
• Parallelization: Also the parallelization of two neural networks was discussed in Proposition 2.3. We will refine
this notion and make precise the size of the resulting neural networks.

39
[Draft of June 26, 2024. Not for dissemination. Only to be used in Philipp Petersen’s lecture]
• Linear combinations: Similarly, for the sum of two neural networks, we will give precise bounds on the size of the
resulting neural network.

5.1.1 Identity

We start with expressing the identity on R𝑑 as a neural network of depth 𝐿 ∈ N.

Lemma 5.1 (Identity) Let 𝐿 ∈ N. Then, there exists a ReLU neural network Φid id
𝐿 such that Φ 𝐿 (𝒙) = 𝒙 for all 𝒙 ∈ R .
𝑑
id id id
Moreover, depth(Φ 𝐿 ) = 𝐿, width(Φ 𝐿 ) = 2𝑑, and size(Φ 𝐿 ) = 2𝑑 · (𝐿 + 1).

Proof Writing 𝑰 𝑑 ∈ R𝑑×𝑑 for the identity matrix, we choose the weights

(𝑾 (0) , 𝒃 (0) ), . . . , (𝑾 (𝐿) , 𝒃 (𝐿) )


  
𝑰𝑑
:= , 0 , ( 𝑰2𝑑 , 0), . . . , ( 𝑰2𝑑 , 0) , (( 𝑰 𝑑 , −𝑰 𝑑 ), 0).
−𝑰 𝑑 | {z }
𝐿−1 times

Using that 𝑥 = 𝜎ReLU (𝑥) − 𝜎ReLU (−𝑥) for all 𝑥 ∈ R and 𝜎ReLU (𝑥) = 𝑥 for all 𝑥 ≥ 0 it is obvious that the neural
network Φid
𝐿 associated to the weights above satisfies the assertion of the lemma. □

We will see in Exercise 5.23 that the property to exactly represent the identity is not shared by sigmoidal activation
functions. It does hold for polynomial activation functions, though.

5.1.2 Composition

Assume we have two neural networks Φ1 , Φ2 with architectures (𝜎ReLU ; 𝑑01 , . . . , 𝑑 1𝐿1 +1 ) and (𝜎ReLU ; 𝑑02 , . . . , 𝑑 2𝐿1 +1 )
respectively. Moreover, we assume that they have weights and biases given by

(𝑾1(0) , 𝒃 1(0) ), . . . , (𝑾1(𝐿1 ) , 𝒃 1(𝐿1 ) ), and (𝑾2(0) , 𝒃 2(0) ), . . . , (𝑾2(𝐿2 ) , 𝒃 2(𝐿2 ) ),

respectively. If the output dimension 𝑑 1𝐿1 +1 of Φ1 equals the input dimension 𝑑02 of Φ2 , we can define two types of
concatenations: First Φ2 ◦ Φ1 is the neural network with weights and biases given by
     
𝑾1(0) , 𝒃 1(0) , . . . , 𝑾1(𝐿1 −1) , 𝒃 1(𝐿1 −1) , 𝑾2(0) 𝑾1(𝐿1 ) , 𝑾2(0) 𝒃 1(𝐿1 ) + 𝒃 2(0) ,
   
𝑾2(1) , 𝒃 2(1) , . . . , 𝑾2(𝐿2 ) , 𝒃 2(𝐿2 ) .

Second, Φ2 • Φ1 is the neural network defined as Φ2 ◦ Φid


1 ◦ Φ1 . In terms of weighs and biases, Φ2 • Φ1 is given as
! !!
   𝑾1( 𝐿1 )
 𝒃 1(𝐿1 )
𝑾1(0) , 𝒃 1(0)
,..., 𝑾1(𝐿1 −1) , 𝒃 1(𝐿2 −1)
, , ,
−𝑾1(𝐿1 ) −𝒃 1(𝐿1 )
      
𝑾2(0) , −𝑾2(0) , 𝒃 2(0) , 𝑾2(1) , 𝒃 2(1) , . . . , 𝑾2(𝐿2 ) , 𝒃 2(𝐿2 ) .

The following lemma collects the properties of the construction above.

Lemma 5.2 (Composition) Let Φ1 , Φ2 be neural networks with architectures (𝜎ReLU ; 𝑑01 , . . . , 𝑑 1𝐿1 +1 ) and (𝜎ReLU ; 𝑑02 , . . . , 𝑑 2𝐿1 +1 ).
0
Assume 𝑑 1𝐿1 +1 = 𝑑02 . Then Φ2 ◦ Φ1 (𝒙) = Φ2 • Φ1 (𝒙) = Φ2 (Φ1 (𝒙)) for all 𝒙 ∈ R𝑑1 . Moreover,

40
[Draft of June 26, 2024. Not for dissemination. Only to be used in Philipp Petersen’s lecture]
width(Φ2 ◦ Φ1 ) ≤ max{width(Φ1 ), width(Φ2 )},
depth(Φ2 ◦ Φ1 ) = depth(Φ1 ) + depth(Φ2 ),
size(Φ2 ◦ Φ1 ) ≤ size(Φ1 ) + size(Φ2 ) + (𝑑1𝐿1 + 1)𝑑21 ,

and

width(Φ2 • Φ1 ) ≤ 2 max{width(Φ1 ), width(Φ2 )},


depth(Φ2 • Φ1 ) = depth(Φ1 ) + depth(Φ2 ) + 1,
size(Φ2 • Φ1 ) ≤ 2(size(Φ1 ) + size(Φ2 )).
0
Proof The fact that Φ2 ◦ Φ1 (𝒙) = Φ2 • Φ1 (𝒙) = Φ2 (Φ1 (𝒙)) for all 𝒙 ∈ R𝑑1 follows immediately from the construction.
𝑑 2 ×𝑑 1
The same can be said for the width and depth bounds. To confirm the size bound, we note that 𝑾2(0) 𝑾1(𝐿1 ) ∈ R 1 𝐿1
2
and hence 𝑾2(0) 𝑾1(𝐿1 ) has not more than 𝑑12 × 𝑑 1𝐿1 (nonzero) entries. Moreover, 𝑾2(0) 𝒃 1(𝐿1 ) + 𝒃 2(0) ∈ R𝑑1 . Thus, the
𝐿 1 -th layer of Φ2 ◦ Φ1 (𝒙) has at most 𝑑12 × (1 + 𝑑 1𝐿1 ) entries. The rest is obvious from the construction. □
Interpreting linear transformations as neural networks of depth 0, the previous lemma is also valid in case Φ1 or Φ2
is a linear mapping.

5.1.3 Parallelization

Next, we wish to put neural networks in parallel. Let (Φ𝑖 )𝑖=1


𝑚 be neural networks with architectures (𝜎
ReLU ; 𝑑 0 , . . . , 𝑑 𝐿𝑖 +1 ),
𝑖 𝑖

respectively. We proceed to build a neural network (Φ1 , . . . , Φ𝑚 ) such that


Í𝑚 𝑗
Í𝑚 𝑗
𝑑0 𝑑𝐿
(Φ1 , . . . , Φ𝑚 ) : R 𝑗=1 →R 𝑗=1 𝑗 +1 (5.1.1)
(𝒙 1 , . . . , 𝒙 𝑚 ) ↦→ (Φ1 (𝒙 1 ), . . . , Φ𝑚 (𝒙 𝑚 )).

To do so we first assume 𝐿 1 = · · · = 𝐿 𝑚 = 𝐿, and define (Φ1 , . . . , Φ𝑚 ) via the following sequence of weight-bias
tuples:
(0) (0) (𝐿) (𝐿)
©©𝑾1 ª © 𝒃 1 ªª ©©𝑾1 ª © 𝒃 1 ªª
.. ® ­ .. ®® .. ® ­ .. ®®
(5.1.2)
­­ ­­
­­ . ® , ­ . ®® , . . . , ­­ . ® , ­ . ®®
­­ ® ­ ®® ­­ ® ­ ®®
(0) (0) (𝐿) (𝐿)
«« 𝑾𝑚 ¬ « 𝒃 𝑚 ¬¬ «« 𝑾 𝑚 ¬ « 𝒃 𝑚 ¬¬
where these matrices are understood as block-diagonal filled up with zeros. For the general case where the Φ 𝑗 might
have different depths, let 𝐿 max := max1≤𝑖 ≤𝑚 𝐿 𝑖 and 𝐼 := {1 ≤ 𝑖 ≤ 𝑚 | 𝐿 𝑖 < 𝐿 max }. For 𝑗 ∈ 𝐼 𝑐 set Φ
e 𝑗 := Φ 𝑗 , and for
each 𝑗 ∈ 𝐼
e 𝑗 := Φid
Φ 𝐿max −𝐿 𝑗 ◦ Φ 𝑗 . (5.1.3)

Finally,

(Φ1 , . . . , Φ𝑚 ) := ( Φ e 𝑚 ).
e1, . . . , Φ (5.1.4)

We collect the properties of the parallelization in the lemma below.


Lemma 5.3 (Parallelization) Let 𝑚 ∈ N and (Φ𝑖 )𝑖=1 𝑚 be neural networks with architectures (𝜎
ReLU ; 𝑑 0 , . . . , 𝑑 𝐿𝑖 +1 ),
𝑖 𝑖

respectively. Then the neural network (Φ1 , . . . , Φ𝑚 ) satisfies


Í𝑚 𝑗
𝑑0
(Φ1 , . . . , Φ𝑚 ) (𝒙) = (Φ1 (𝒙 1 ), . . . , Φ𝑚 (𝒙 𝑚 )) for all 𝒙 ∈ R 𝑗=1 .

41
[Draft of June 26, 2024. Not for dissemination. Only to be used in Philipp Petersen’s lecture]
Moreover, with 𝐿 max := max 𝑗 ≤𝑚 𝐿 𝑗 it holds that
𝑚
∑︁
width((Φ1 , . . . , Φ𝑚 )) ≤ 2 width(Φ 𝑗 ), (5.1.5a)
𝑗=1

depth((Φ1 , . . . , Φ𝑚 )) = max depth(Φ 𝑗 ), (5.1.5b)


𝑗 ≤𝑚
𝑚
∑︁ 𝑚
∑︁
𝑗
size((Φ1 , . . . , Φ𝑚 )) ≤ 2 size(Φ 𝑗 ) + 2 (𝐿 max − 𝐿 𝑗 )𝑑 𝐿 𝑗 +1 . (5.1.5c)
𝑗=1 𝑗=1

Proof All statements except for the bound on the size follow immediately from the construction. To obtain the bound
on the size, we note that by construction the sizes of the ( Φ
e 𝑖 ) 𝑚 in (5.1.3) will simply be added. The size of each Φ
𝑖=1
e𝑖
can be bounded with Lemma 5.2. □

If all input dimensions 𝑑01 = · · · = 𝑑0𝑚 C 𝑑0 are the same, we will also use parallelization with shared inputs to
𝑑1 +···+𝑑 𝑚
realize the function 𝒙 ↦→ (Φ1 (𝒙), . . . , Φ𝑚 (𝒙)) from R𝑑0 → R 𝐿1 +1 𝐿𝑚 +1
. In terms of the construction (5.1.2), the
Í𝑚 𝑗 1
(0)
only required change is that the block-diagonal matrix diag(𝑾1 , . . . , 𝑾𝑚(0) ) becomes the matrix in R 𝑗=1 𝑑1 ×𝑑0 which
stacks 𝑾1(0) , . . . , 𝑾𝑚(0) on top of each other. Similarly, we will allow Φ 𝑗 to only take some of the entries of 𝒙 as input.
For parallelization with shared inputs we will use the same notation (Φ 𝑗 ) 𝑚 𝑗=1 as before, where the precise meaning will
always be clear from context. Note that Lemma 5.3 remains valid in this case.

5.1.4 Linear combinations

Let 𝑚 ∈ N and let (Φ𝑖 )𝑖=1 𝑚 be ReLU neural networks that have architectures (𝜎
ReLU ; 𝑑 0 , . . . , 𝑑 𝐿𝑖 +1 ), respectively.
𝑖 𝑖
1
Assume that 𝑑 𝐿1 +1 = · · · = 𝑑 𝐿𝑚 +1 , i.e., all Φ1 , . . . , Φ𝑚 have the same output dimension. For scalars 𝛼 𝑗 ∈ R, we wish
𝑚
Í
to construct a ReLU neural network 𝑚 𝑗=1 𝛼 𝑗 Φ 𝑗 realizing the function

𝑑1
( Í𝑚 𝑗
R 𝑗=1 𝑑0 → R 𝐿1 +1
Í
(𝒙1 , . . . , 𝒙 𝑚 ) ↦→ 𝑚𝑗=1 𝛼 𝑗 Φ 𝑗 (𝒙 𝑗 ).

Í𝑚This corresponds to the parallelization (Φ1 , . . . , Φ𝑚 ) composed with the linear transformation (𝒛1 , . . . , 𝒛 𝑚 ) ↦→
𝑗=1 𝛼 𝑗 𝒛 𝑗 . The following result holds.

Lemma 5.4 (Linear combinations) Let 𝑚 ∈ N and (Φ𝑖 )𝑖=1 𝑚 be neural networks with architectures (𝜎
ReLU ; 𝑑 0 , . . . , 𝑑 𝐿𝑖 +1 ),
𝑖 𝑖

respectively. Assume that 𝑑 1𝐿1 +1 = · · · = 𝑑 𝐿𝑚𝑚 +1 , let 𝛼 ∈ R𝑚 and set 𝐿 max := max 𝑗 ≤𝑚 𝐿 𝑗 . Then, there exists a neural
Í𝑚 𝑗
𝑗=1 𝑑0 . Moreover,
Í Í𝑚 Í𝑚
network 𝑚 𝑗=1 𝛼 𝑗 Φ 𝑗 such that ( 𝑗=1 𝛼 𝑗 Φ 𝑗 ) (𝒙) = 𝑗=1 𝛼 𝑗 Φ 𝑗 (𝒙 𝑗 ) for all 𝒙 = (𝒙 𝑗 ) 𝑗=1 ∈ R
𝑚

𝑚 𝑚
©∑︁ ∑︁
width ­ 𝛼 𝑗 Φ 𝑗 ® ≤ 2 width(Φ 𝑗 ), (5.1.6a)
ª

« 𝑗=1 ¬ 𝑗=1

𝑚
∑︁
depth ­ 𝛼 𝑗 Φ 𝑗 ® = max depth(Φ 𝑗 ), (5.1.6b)
© ª
𝑗 ≤𝑚
« 𝑗=1 ¬
𝑚 𝑚 𝑚
©∑︁ ∑︁ ∑︁
𝑗
size ­ 𝛼 𝑗 Φ 𝑗 ® ≤ 2 size(Φ 𝑗 ) + 2 (𝐿 max − 𝐿 𝑗 )𝑑 𝐿 𝑗 +1 . (5.1.6c)
ª

« 𝑗=1 ¬ 𝑗=1 𝑗=1

42
[Draft of June 26, 2024. Not for dissemination. Only to be used in Philipp Petersen’s lecture]
Í
Proof The construction of 𝑚 𝑗=1 𝛼 𝑗 Φ 𝑗 is analogous to that of (Φ1 , . . . , Φ𝑚 ), i.e., we first define the linear combination
of neural networks with the same depth. Then the weights are chosen as in (5.1.2), but with the last linear transformation
replaced by

∑︁ 𝑚
(𝐿) (𝐿)
­ (𝛼1 𝑾1 · · · 𝛼𝑚 𝑾𝑚 ), 𝛼 𝑗 𝒃 (𝐿)
© ª
𝑗 ®.
« 𝑗=1 ¬

For general depths, we define the sum of the neural networks to be the sum of the extended neural networks Φ
e 𝑖 as
of (5.1.3). All statements of the lemma follow immediately from this construction. □
In case 𝑑01 = · · · = 𝑑0𝑚 C 𝑑0 (all neural networks have the same input dimension), we will also consider linear
combinations with shared inputs, i.e., a neural network realizing
𝑚
∑︁
𝒙 ↦→ 𝛼 𝑗 Φ 𝑗 (𝒙) for 𝒙 ∈ R𝑑0 .
𝑗=1

This requires the same minor adjustment as discussed at the end of Section 5.1.3. Lemma 5.4 remains valid in this case
and again we do not distinguish in notation for linear combinations with or without shared inputs.

5.2 Continuous piecewise linear functions

In this section, we will relate ReLU neural networks to a large class of functions. We first formally introduce the set of
continuous piecewise linear functions from a set Ω ⊆ R𝑑 to R. Note that we admit in particular Ω = R𝑑 in the following
definition.

Definition 5.5 Let Ω ⊆ R𝑑 , 𝑑 ∈ N. We call a function 𝑓 : Ω → R continuous, piecewise linear (cpwl) if 𝑓 ∈ 𝐶 0 (Ω)
and there exist 𝑛 ∈ N affine functions 𝑔 𝑗 : R𝑑 → R, 𝑔 𝑗 (𝒙) = 𝒘 ⊤𝑗 𝒙 + 𝑏 𝑗 such that for each 𝒙 ∈ Ω it holds that
𝑓 (𝒙) = 𝑔 𝑗 (𝒙) for at least one 𝑗 ∈ {1, . . . , 𝑛}. For 𝑚 > 1 we call 𝑓 : Ω → R𝑚 cpwl if and only if each component of 𝑓
is cpwl.

Remark 5.6 A “continuous piecewise linear function” as in Definition 5.5 is actually piecewise affine. To maintain
consistency with the literature, we use the terminology cpwl.

In the following, we will refer to the connected domains on which 𝑓 is equal to one of the functions 𝑔 𝑗 , also as
regions or pieces. If 𝑓 is cpwl with 𝑞 ∈ N regions, then with 𝑛 ∈ N denoting the number of affine functions it holds
𝑛 ≤ 𝑞.
Note that, the mapping 𝒙 ↦→ 𝜎ReLU (𝒘 ⊤ 𝒙 + 𝑏), which is a ReLU neural network with a single neuron, is cpwl (with
two regions). Consequently, every ReLU neural network is a repeated composition of linear combinations of cpwl
functions. It is not hard to see that the set of cpwl functions is closed under compositions and linear combinations.
Hence, every ReLU neural network is a cpwl function. Interestingly, the reverse direction of this statement is also
true, meaning that every cpwl function can be represented by a ReLU neural network as we shall demonstrate below.
Therefore, we can identify the class of functions realized by arbitrary ReLU neural networks as the class of cpwl
functions.

Theorem 5.7 Let 𝑑 ∈ N, Ω ⊆ R𝑑 be convex, and let 𝑓 : Ω → R be cpwl with 𝑛 ∈ N as in Definition 5.5. Then, there
exists a ReLU neural network Φ 𝑓 such that Φ 𝑓 (𝒙) = 𝑓 (𝒙) for all 𝒙 ∈ Ω and

size(Φ 𝑓 ) = 𝑂 (𝑑𝑛2𝑛 ), width(Φ 𝑓 ) = 𝑂 (𝑑𝑛2𝑛 ), depth(Φ 𝑓 ) = 𝑂 (𝑛).

A statement similar to Theorem 5.7 can be found in [5, 75]. They give a construction with a depth that behaves
logarithmic in 𝑑 and is independent of 𝑛, but with significantly larger bounds on the size. As we shall see, the proof of

43
[Draft of June 26, 2024. Not for dissemination. Only to be used in Philipp Petersen’s lecture]
Theorem 5.7 is a simple consequence of the following well-known result from [201]; also see [148, 211]. It states that
every cpwl function can be expressed as a finite maximum of a finite minimum of certain affine functions.

Proposition 5.8 Let 𝑑 ∈ N, Ω ⊆ R𝑑 be convex, and let 𝑓 : Ω → R be cpwl with 𝑛 ∈ N affine functions as in Definition
5.5. Then there exists 𝑚 ∈ N and sets 𝑠 𝑗 ⊆ {1, . . . , 𝑛} for 𝑗 ∈ {1, . . . , 𝑚}, such that

𝑓 (𝒙) = max min (𝑔𝑖 (𝒙)) for all 𝒙 ∈ Ω. (5.2.1)


1≤ 𝑗 ≤𝑚 𝑖 ∈𝑠 𝑗

Proof Step 1. We start with 𝑑 = 1, i.e., Ω ⊆ R is a (possibly unbounded) interval and for each 𝑥 ∈ Ω there exists
𝑗 ∈ {1, . . . , 𝑛} such that with 𝑔 𝑗 (𝑥) B 𝑤 𝑗 𝑥 + 𝑏 𝑗 it holds that 𝑓 (𝑥) = 𝑔 𝑗 (𝑥). Without loss of generality, we can assume
that 𝑔𝑖 ≠ 𝑔 𝑗 for all 𝑖 ≠ 𝑗. Since the graphs of the 𝑔 𝑗 are lines, they intersect at (at most) finitely many points in Ω.
Since 𝑓 is continuous, we conclude that there exist finitely many intervals covering Ω, such that 𝑓 coincides with
one of the 𝑔 𝑗 on each interval. For each 𝑥 ∈ Ω let

𝑠 𝑥 := {1 ≤ 𝑗 ≤ 𝑛 | 𝑔 𝑗 (𝑥) ≥ 𝑓 (𝑥)}

and

𝑓 𝑥 (𝑦) := min 𝑔 𝑗 (𝑦) for all 𝑦 ∈ Ω.


𝑗 ∈𝑠𝑥

Clearly, 𝑓 𝑥 (𝑥) = 𝑓 (𝑥). We claim that, additionally,

𝑓 𝑥 (𝑦) ≤ 𝑓 (𝑦) for all 𝑦 ∈ Ω. (5.2.2)

This then shows that

𝑓 (𝑦) = max 𝑓 𝑥 (𝑦) = max min 𝑔 𝑗 (𝑦) for all 𝑦 ∈ R.


𝑥 ∈Ω 𝑥 ∈Ω 𝑗 ∈𝑠𝑥

Since there exist only finitely many possibilities to choose a subset of {1, . . . , 𝑛}, we conclude that (5.2.1) holds for
𝑑 = 1.
It remains to verify the claim (5.2.2). Fix 𝑦 ≠ 𝑥 ∈ Ω. Without loss of generality, let 𝑥 < 𝑦 and let 𝑥 = 𝑥0 < · · · < 𝑥 𝑘 = 𝑦
be such that 𝑓 | [ 𝑥𝑖−1 , 𝑥𝑖 ] equals some 𝑔 𝑗 for each 𝑖 ∈ {1, . . . , 𝑘 }. In order to show (5.2.2), it suffices to prove that there
exists at least one 𝑗 such that 𝑔 𝑗 (𝑥0 ) ≥ 𝑓 (𝑥0 ) and 𝑔 𝑗 (𝑥 𝑘 ) ≤ 𝑓 (𝑥 𝑘 ). The claim is trivial for 𝑘 = 1. We proceed by
induction. Suppose the claim holds for 𝑘 − 1, and consider the partition 𝑥0 < · · · < 𝑥 𝑘 . Let 𝑟 ∈ {1, . . . , 𝑛} be such that
𝑓 | [ 𝑥0 , 𝑥1 ] = 𝑔𝑟 | [ 𝑥0 , 𝑥1 ] . Applying the induction hypothesis to the interval [𝑥 1 , 𝑥 𝑘 ], we can find 𝑗 ∈ {1, . . . , 𝑛} such that
𝑔 𝑗 (𝑥 1 ) ≥ 𝑓 (𝑥1 ) and 𝑔 𝑗 (𝑥 𝑘 ) ≤ 𝑓 (𝑥 𝑘 ). If 𝑔 𝑗 (𝑥0 ) ≥ 𝑓 (𝑥0 ), then 𝑔 𝑗 is the desired function. Otherwise, 𝑔 𝑗 (𝑥0 ) < 𝑓 (𝑥0 ).
Then 𝑔𝑟 (𝑥 0 ) = 𝑓 (𝑥0 ) > 𝑔 𝑗 (𝑥0 ) and 𝑔𝑟 (𝑥1 ) = 𝑓 (𝑥1 ) ≤ 𝑔 𝑗 (𝑥1 ). Therefore 𝑔𝑟 (𝑥) ≤ 𝑔 𝑗 (𝑥) for all 𝑥 ≥ 𝑥 1 , and in particular
𝑔𝑟 (𝑥 𝑘 ) ≤ 𝑔 𝑗 (𝑥 𝑘 ). Thus 𝑔𝑟 is the desired function.
Step 2. For general 𝑑 ∈ N, let 𝑔 𝑗 (𝒙) := 𝒘 ⊤𝑗 𝒙 + 𝑏 𝑗 for 𝑗 = 1, . . . , 𝑛. For each 𝒙 ∈ Ω, let

𝑠 𝒙 := {1 ≤ 𝑗 ≤ 𝑛 | 𝑔 𝑗 (𝒙) ≥ 𝑓 (𝒙)}

and for all 𝒚 ∈ Ω, let

𝑓 𝒙 ( 𝒚) := min 𝑔 𝑗 ( 𝒚).
𝑗 ∈𝑠 𝒙

For an arbitrary 1-dimensional affine subspace 𝑆 ⊆ R𝑑 passing through 𝒙 consider the line (segment) 𝐼 := 𝑆 ∩ Ω,
which is connected since Ω is convex. By Step 1, it holds

𝑓 (𝒚) = max 𝑓 𝒙 ( 𝒚) = max min 𝑔 𝑗 ( 𝒚)


𝒙∈Ω 𝒙∈Ω 𝑗 ∈𝑠 𝒙

on all of 𝐼. Since 𝐼 was arbitrary the formula is valid for all 𝒚 ∈ Ω. This again implies (5.2.1) as in Step 1. □

44
[Draft of June 26, 2024. Not for dissemination. Only to be used in Philipp Petersen’s lecture]
Remark 5.9 Using min(𝑎, 𝑏) = − max(−𝑎, −𝑏), there exists 𝑚˜ ∈ N and sets 𝑠˜ 𝑗 ⊆ {1, . . . , 𝑛} for 𝑗 = 1, . . . , 𝑚,
˜ such
that for all 𝒙 ∈ R

𝑓 (𝒙) = −(− 𝑓 (𝒙)) = − min max (−𝒘𝑖⊤ 𝒙 − 𝑏 𝑖 )


˜ 𝑖 ∈ 𝑠˜ 𝑗
1≤ 𝑗 ≤ 𝑚

= max (− max (−𝒘𝑖⊤ 𝒙 − 𝑏 𝑖 ))


1≤ 𝑗 ≤ 𝑚
˜ 𝑖 ∈ 𝑠˜ 𝑗

= max (min (𝒘𝑖⊤ 𝒙 + 𝑏 𝑖 )).


˜ 𝑖 ∈ 𝑠˜ 𝑗
1≤ 𝑗 ≤ 𝑚

To prove Theorem 5.7, it therefore suffices to show that the minimum and the maximum are expressible by ReLU
neural networks.

Lemma 5.10 For every 𝑥, 𝑦 ∈ R it holds that

min{𝑥, 𝑦} = 𝜎ReLU (𝑦) − 𝜎ReLU (−𝑦) − 𝜎ReLU (𝑦 − 𝑥) ∈ N21 (𝜎ReLU ; 1, 3)

and

max{𝑥, 𝑦} = 𝜎ReLU (𝑦) − 𝜎ReLU (−𝑦) + 𝜎ReLU (𝑥 − 𝑦) ∈ N21 (𝜎ReLU ; 1, 3).

Proof We have
(
0 if 𝑦 > 𝑥
max{𝑥, 𝑦} = 𝑦 +
𝑥−𝑦 if 𝑥 ≥ 𝑦
= 𝑦 + 𝜎ReLU (𝑥 − 𝑦).

Using 𝑦 = 𝜎ReLU (𝑦) − 𝜎ReLU (−𝑦), the claim for the maximum follows. For the minimum observe that min{𝑥, 𝑦} =
− max{−𝑥, −𝑦}. □

𝑥
min{𝑥, 𝑦}
𝑦

Fig. 5.1: Sketch of the neural network in Lemma 5.10. Only edges with non-zero weights are drawn.

The minimum of 𝑛 ≥ 2 inputs can be computed by repeatedly applying the construction of Lemma 5.10. The
resulting neural network is described in the next lemma.

Lemma 5.11 For every 𝑛 ≥ 2 there exists a neural network Φmin


𝑛 : R → R with
𝑛

size(Φmin
𝑛 ) ≤ 16𝑛, width(Φmin
𝑛 ) ≤ 3𝑛, depth(Φmin
𝑛 ) ≤ ⌈log2 (𝑛)⌉

such that Φmin max : R𝑛 → R realizing the


𝑛 (𝑥 1 , . . . , 𝑥 𝑛 ) = min1≤ 𝑗 ≤𝑛 𝑥 𝑗 . Similarly, there exists a neural network Φ𝑛
maximum and satisfying the same complexity bounds.

Proof Throughout denote by Φmin 2 : R2 → R the neural network from Lemma 5.10. It is of depth 1 and size 7 (since
all biases are zero, it suffices to count the number of connections in Figure 5.1).
Step 1. Consider first the case where 𝑛 = 2 𝑘 for some 𝑘 ∈ N. We proceed by induction of 𝑘. For 𝑘 = 1 the claim is
proven. For 𝑘 ≥ 2 set

45
[Draft of June 26, 2024. Not for dissemination. Only to be used in Philipp Petersen’s lecture]
Φmin
2𝑘
:= Φmin min min
2 ◦ (Φ2 𝑘−1 , Φ2 𝑘−1 ). (5.2.3)
By Lemma 5.2 and Lemma 5.3 we have

depth(Φmin
2𝑘
) ≤ depth(Φmin min
2 ) + depth(Φ2 𝑘−1 ) ≤ · · · ≤ 𝑘.

Next, we bound the size of the neural network. Note that all biases in this neural network are set to 0, since the Φmin
2
neural network in Lemma 5.10 has no biases. Thus, the size of the neural network Φmin 2𝑘
corresponds to the number of
connections in the graph (the number of nonzero weights). Careful inspection of the neural network architecture, see
Figure 5.2, reveals that
𝑘−2
∑︁
size(Φmin
2𝑘
) = 4 · 2 𝑘−1 + 12 · 2 𝑗 + 3
𝑗=0

= 2𝑛 + 12 · (2 𝑘−1 − 1) + 3 = 2𝑛 + 6𝑛 − 9 ≤ 8𝑛,

and that width(Φmin


2𝑘
) ≤ (3/2)2 𝑘 . This concludes the proof for the case 𝑛 = 2 𝑘 .
Step 2. For the general case, we first let

Φmin
1 (𝑥) := 𝑥 for all 𝑥 ∈ R

be the identity on R, i.e. a linear transformation and thus formally a depth 0 neural network. Then, for all 𝑛 ≥ 2
(
(Φid min min
1 ◦ Φ⌊ 𝑛 ⌋ , Φ⌈ 𝑛 ⌉ ) if 𝑛 ∈ {2 𝑘 + 1 | 𝑘 ∈ N}
Φmin := Φmin
2 ◦ 2 2 (5.2.4)
(Φmin , Φmin
𝑛
⌊𝑛⌋ ⌈𝑛⌉
) otherwise.
2 2

This definition extends (5.2.3) to arbitrary 𝑛 ≥ 2, since the first case in (5.2.4) never occurs if 𝑛 ≥ 2 is a power of two.
To analyze (5.2.4), we start with the depth and claim that

depth(Φmin
𝑛 ) = 𝑘 for all 2 𝑘−1 < 𝑛 ≤ 2 𝑘

and all 𝑘 ∈ N. We proceed by induction over 𝑘. The case 𝑘 = 1 is clear. For the induction step, assume the statement
holds for some fixed 𝑘 ∈ N and fix an integer 𝑛 with 2 𝑘 < 𝑛 ≤ 2 𝑘+1 . Then
l𝑛m
∈ (2 𝑘−1 , 2 𝑘 ] ∩ N
2
and (
j𝑛k {2 𝑘−1 } if 𝑛 = 2 𝑘 + 1

2 (2 , 2 ] ∩ N otherwise.
𝑘−1 𝑘

Using the induction assumption, (5.2.4) and Lemmas 5.1 and 5.2, this shows

depth(Φmin min
𝑛 ) = depth(Φ2 ) + 𝑘 = 1 + 𝑘,

and proves the claim.


For the size and width bounds, we only sketch the argument: Fix 𝑛 ∈ N such that 2 𝑘−1 < 𝑛 ≤ 2 𝑘 . Then Φmin𝑛 is
constructed from at most as many subnetworks as Φmin
2 𝑘 , but with some Φmin : R2 → R blocks replaced by Φid : R → R,
2 1
see Figure 5.3. Since Φid min
1 has the same depth as Φ2 , but is smaller in width and number of connections, the width and
size of Φmin min
𝑛 is bounded by the width and size of Φ2 𝑘 . Due to 2 ≤ 2𝑛, the bounds from Step 1 give the bounds stated
𝑘

in the lemma.
Step 3. For the maximum, define

Φmax min
𝑛 (𝑥 1 , . . . , 𝑥 𝑛 ) := −Φ𝑛 (−𝑥 1 , . . . , −𝑥 𝑛 ).

46
[Draft of June 26, 2024. Not for dissemination. Only to be used in Philipp Petersen’s lecture]
𝑥1
𝑥2

𝑥3
𝑥4
min{𝑥1 , . . . , 𝑥 8 }
𝑥5
𝑥6

𝑥7
𝑥8

nr of connections
between layers: 2 𝑘−1 · 4 2 𝑘−2 · 12 2 𝑘−3 · 12 3

Fig. 5.2: Architecture of the Φmin


2𝑘
neural network in Step 1 of the proof of Lemma 5.11 and the number of connections
in each layer for 𝑘 = 3. Each grey box corresponds to 12 connections in the graph.

𝑥1 𝑥2 𝑥3 𝑥4 𝑥5 𝑥1 𝑥2 𝑥3 𝑥4 𝑥5 𝑥6 𝑥1 𝑥2 𝑥3 𝑥4 𝑥5 𝑥6 𝑥7 𝑥8

Φmin
2 Φid
1 Φmin
2 Φid
1 Φmin
2 Φid
1 Φmin
2 Φmin
2 Φmin
2 Φmin
2 Φmin
2

Φid
1 Φmin
2 Φmin
2 Φmin
2 Φmin
2 Φmin
2

Φmin
2 Φmin
2 Φmin
2

min{ 𝑥1 , . . . , 𝑥5 } min{ 𝑥1 , . . . , 𝑥6 } min{ 𝑥1 , . . . , 𝑥8 }

Fig. 5.3: Construction of Φmin


𝑛 for general 𝑛 in Step 2 of the proof of Lemma 5.11.

Proof (of Theorem 5.7) By Proposition 5.8 the neural network



Φ := Φmax min 𝑚 𝑚
𝑚 • (Φ |𝑠 𝑗 | ) 𝑗=1 • ((𝒘𝑖 𝒙 + 𝑏 𝑖 )𝑖 ∈𝑠 𝑗 ) 𝑗=1

realizes the function 𝑓 .


Since the number of possibilities to choose subsets of {1, . . . , 𝑛} equals 2𝑛 we have 𝑚 ≤ 2𝑛 . Since each 𝑠 𝑗 is a
subset of {1, . . . , 𝑛}, the cardinality |𝑠 𝑗 | of 𝑠 𝑗 is bounded by 𝑛. By Lemma 5.2, Lemma 5.3, and Lemma 5.11

depth(Φ) ≤ 2 + depth(Φmax min


𝑚 ) + max depth(Φ |𝑠 𝑗 | )
1≤ 𝑗 ≤𝑛
𝑛
≤ 1 + ⌈log2 (2 )⌉ + ⌈log2 (𝑛)⌉ = 𝑂 (𝑛)

and

47
[Draft of June 26, 2024. Not for dissemination. Only to be used in Philipp Petersen’s lecture]
n 𝑚
∑︁ 𝑚
∑︁ o
width(Φ) ≤ 2 max width(Φmax
𝑚 ), width(Φmin
|𝑠 𝑗 | ), width((𝒘𝑖⊤ 𝒙 + 𝑏 𝑖 )𝑖 ∈𝑠 𝑗 ))
𝑗=1 𝑗=1
𝑛
≤ 2 max{3𝑚, 3𝑚𝑛, 𝑚𝑑𝑛} = 𝑂 (𝑑𝑛2 )

and
 

size(Φ) ≤ 4 size(Φmax
𝑚 ) + size((Φ min 𝑚
)
|𝑠 𝑗 | 𝑗=1 ) + size((𝒘 𝑖 𝒙 + 𝑏 ) ) 𝑚
𝑖 𝑖 ∈𝑠 𝑗 𝑗=1 )

∑︁ 𝑚
≤ 4 ­16𝑚 + 2 (16|𝑠 𝑗 | + 2⌈log2 (𝑛)⌉) + 𝑛𝑚(𝑑 + 1) ® = 𝑂 (𝑑𝑛2𝑛 ).
© ª

« 𝑗=1 ¬
This concludes the proof. □

5.3 Simplicial pieces

This section studies the case, were we do not have arbitrary cpwl functions, but where the regions on which 𝑓 is affine
are simplices. Under this condition, we can construct neural networks that scale merely linearly in the number of such
regions, which is a serious improvement from the exponential dependence of the size on the number of regions that
was found in Theorem 5.7.

5.3.1 Triangulations of 𝛀

For the ensuing discussion, we will consider Ω ⊆ R𝑑 to be partitioned into simplices. This partitioning will be termed
a triangulation of Ω. Other notions prevalent in the literature include a tessellation of Ω, or a simplicial mesh on Ω.
To give a precise definition, let us first recall some terminology. For a set 𝑆 ⊆ R𝑑 we denote the convex hull of 𝑆 by

 𝑛
 ∑︁
 𝑛
∑︁ 


co(𝑆) := 𝛼 𝑗 𝒙 𝑗 𝑛 ∈ N, 𝒙 𝑗 ∈ 𝑆, 𝛼 𝑗 ≥ 0, 𝛼𝑗 = 1 .
 
 𝑗=1 𝑗=1 

An 𝑛-simplex is the convex hull of 𝑛 ∈ N points that are independent in a specific sense. This is made precise in the
following definition.

Definition 5.12 Let 𝑛 ∈ N0 , 𝑑 ∈ N and 𝑛 ≤ 𝑑. We call 𝒙0 , . . . , 𝒙 𝑛 ∈ R𝑑 affinely independent if and only if either
𝑛 = 0 or 𝑛 ≥ 1 and the vectors 𝒙1 − 𝒙0 , . . . , 𝒙 𝑛 − 𝒙0 are linearly independent. In this case, we call co(𝒙0 , . . . , 𝒙 𝑛 ) :=
co({𝒙0 , . . . , 𝒙 𝑛 }) an 𝑛-simplex.

As mentioned before, a triangulation refers to a partition of a space into simplices. We give a formal definition
below.

Definition 5.13 Let 𝑑 ∈ N, and Ω ⊆ R𝑑 be compact. Let T be a finite set of 𝑑-simplices, and for each 𝜏 ∈ T let
𝑉 (𝜏) ⊆ Ω have cardinality 𝑑 + 1 such that 𝜏 = co(𝑉 (𝜏)). We call T a regular triangulation of Ω, if and only if
Ð
(i) 𝜏 ∈ T 𝜏 = Ω,
(ii) for all 𝜏, 𝜏 ′ ∈ T it holds that 𝜏 ∩ 𝜏 ′ = co(𝑉 (𝜏) ∩ 𝑉 (𝜏 ′ )).
Ð
We call 𝜼 ∈ V := 𝜏 ∈ T 𝑉 (𝜏) a node (or vertex) and 𝜏 ∈ T an element of the triangulation.

For a regular triangulation T with nodes V we also introduce the constant

48
[Draft of June 26, 2024. Not for dissemination. Only to be used in Philipp Petersen’s lecture]
𝜼2 𝜼2 𝜼2

𝜼3 𝜼1 𝜼3 𝜼1 𝜼3 𝜼1
𝜼5 𝜼5

𝜼4 𝜼4 𝜼4

𝜏1 = co(𝜼1 , 𝜼2 , 𝜼5 ) 𝜏1 = co(𝜼2 , 𝜼3 , 𝜼4 ) 𝜏1 = co(𝜼2 , 𝜼3 , 𝜼4 )


𝜏2 = co(𝜼2 , 𝜼3 , 𝜼5 ) 𝜏2 = co(𝜼2 , 𝜼5 , 𝜼1 ) 𝜏2 = co(𝜼1 , 𝜼2 , 𝜼3 )
𝜏3 = co(𝜼3 , 𝜼4 , 𝜼5 )

Fig. 5.4: The first is a regular triangulation, while the second and the third are not.

𝑘 T := max |{𝜏 ∈ T | 𝜼 ∈ 𝜏}| (5.3.1)


𝜼∈ V

corresponding to the maximal number of elements shared by a single node.

5.3.2 Size bounds for regular triangulations

Throughout this subsection, let T be a regular triangulation of Ω, and we adhere to the notation of Definition 5.13.
We will say that 𝑓 : Ω → R is cpwl with respect to T if 𝑓 is cpwl and 𝑓 | 𝜏 is affine for each 𝜏 ∈ T . The rest of this
subsection is dedicated to proving the following result. It was first shown in [121] with a more technical argument, and
extends an earlier statement from [75] to general triangulations (also see Section 5.3.3).

Theorem 5.14 Let 𝑑 ∈ N, Ω ⊆ R𝑑 be a bounded domain, and let T be a regular triangulation of Ω. Let 𝑓 : Ω → R be
cpwl with respect to T and 𝑓 | 𝜕Ω = 0. Then there exists a ReLU neural network Φ : Ω → R realizing 𝑓 , and it holds

size(Φ) = 𝑂 (|T |), width(Φ) = 𝑂 (|T |), depth(Φ) = 𝑂 (1), (5.3.2)

where the constants in the Landau notation depend on 𝑑 and 𝑘 T in (5.3.1).

We will split the proof into several lemmata. The strategy is to introduce a basis of the space of cpwl functions on
T the elements of which vanish on the boundary of Ω. We will then show that there exist 𝑂 (|T |) basis functions, each
of which can be represented with a neural network the size of which depends only on 𝑘 T and 𝑑. To construct this basis,
we first point out that an affine function on a simplex is uniquely defined by its values at the nodes.

Lemma 5.15 Let 𝑑 ∈ N. Let 𝜏 := co(𝜼0 , . . . , 𝜼 𝑑 ) be a 𝑑-simplex. For every 𝑦 0 , . . . , 𝑦 𝑑 ∈ R, there exists a unique
𝑔 ∈ P1 (R𝑑 ) such that 𝑔(𝜼𝑖 ) = 𝑦 𝑖 , 𝑖 = 0, . . . , 𝑑.

Proof Since 𝜼1 −𝜼0 , . . . , 𝜼 𝑑 −𝜼0 is a basis of R𝑑 , there is a unique 𝒘 ∈ R𝑑 such that 𝒘 ⊤ (𝜼𝑖 −𝜼0Í
) = 𝑦 𝑖 −𝑦 0 forÍ
𝑖 = 1, . . . , 𝑑.
Then 𝑔(𝒙) := 𝒘 ⊤ 𝒙 + (𝑦 0 − 𝒘 ⊤ 𝜼0 ) is as desired. Moreover, for every 𝑔 ∈ P1 it holds that 𝑔( 𝑖=0 𝑑
𝛼𝑖 𝜼𝑖 ) = 𝑖=0𝑑
𝛼𝑖 𝑔(𝜼𝑖 )
Í𝑑
whenever 𝑖=0 𝛼𝑖 = 1 (this is in general not true if the coefficients do not sum to 1). Hence, 𝑔 is uniquely determined
by its values at the nodes. □
Since Ω is the union of the simplices 𝜏 ∈ T , every cpwl function with respect to T is thus uniquely defined through
its values at the nodes. Hence, the desired basis consists of cpwl functions 𝜑𝜼 : Ω → R with respect to T such that

𝜑𝜼 ( 𝝁) = 𝛿𝜼𝝁 for all 𝝁 ∈ V, (5.3.3)

49
[Draft of June 26, 2024. Not for dissemination. Only to be used in Philipp Petersen’s lecture]
where 𝛿𝜼𝝁 denotes the Kronecker delta. Assuming 𝜑𝜼 to be well-defined for the moment, we can then represent every
cpwl function 𝑓 : Ω → R that vanishes on the boundary 𝜕Ω as
∑︁
𝑓 (𝒙) = 𝑓 (𝜼)𝜑𝜼 (𝒙) for all 𝒙 ∈ Ω.
𝜼∈ V∩Ω̊

Note that it suffices to sum over the set of interior nodes V ∩ Ω̊, since 𝑓 (𝜼) = 0 whenever 𝜼 ∈ 𝜕Ω. To formally
verify existence and well-definedness of 𝜑𝜼 , we first need a lemma characterizing the boundary of so-called “patches”
of the triangulation: For each 𝜼 ∈ V, we introduce the patch 𝜔(𝜼) of the node 𝜼 as the union of all elements containing
𝜼, i.e.,
Ø
𝜔(𝜼) := 𝜏.
{ 𝜏 ∈ T | 𝜼∈ 𝜏 }

Lemma 5.16 Let 𝜼 ∈ V ∩ Ω̊ be an interior node. Then,


Ø
𝜕𝜔(𝜼) = co(𝑉 (𝜏)\{𝜼}).
{ 𝜏 ∈ T | 𝜼∈ 𝜏 }

𝜼6 𝜼1
𝜔 (𝜼) co(𝑉 ( 𝜏1 )\{𝜼} ) = co( {𝜼1 , 𝜼2 } )
𝜏6
𝜏5 𝜏1
𝜼5 𝜼2
𝜏4 𝜼 𝜏2
𝜏3

𝜼4 𝜼3

Fig. 5.5: Visualization of Lemma 5.16 in two dimensions. The patch 𝜔(𝜼) consists of the union of all 2-simplices 𝜏𝑖
containing 𝜼. Its boundary consists of the union of all 1-simplices made up by the nodes of each 𝜏𝑖 without the center
node, i.e., the convex hulls of 𝑉 (𝜏𝑖 )\{𝜼}.

We refer to Figure 5.5 for a visualization of Lemma 5.16. The proof of Lemma 5.16 is quite technical but nonetheless
elementary. We therefore only outline the general argument but leave the details to the reader in Excercise 5.27: The
boundary of 𝜔(𝜼) must be contained in the union of the boundaries of all 𝜏 in the patch 𝜔(𝜼). Since 𝜼 is an interior
point of Ω, it must also be an interior point of 𝜔(𝜼). This can be used to show that for every 𝑆 := {𝜼𝑖0 , . . . , 𝜼𝑖𝑘 } ⊆ 𝑉 (𝜏)
of cardinality 𝑘 + 1 ≤ 𝑑, the interior of (the 𝑘-dimensional manifold) co(𝑆) belongs to the interior of 𝜔(𝜼) whenever
𝜼 ∈ 𝑆. Using Exercise 5.27, it then only remains to check that co(𝑆) ⊆ 𝜕𝜔(𝜼) whenever 𝜼 ∉ 𝑆, which yields the
claimed formula. We are now in position to show well-definenedness of the basis functions in (5.3.3).

Lemma 5.17 For each interior node 𝜼 ∈ V ∩ Ω̊ there exists a unique cpwl function 𝜑𝜼 : Ω → R satisfying (5.3.3).
Moreover, 𝜑𝜼 can be expressed by a ReLU neural network with size, width, and depth bounds that only depend on 𝑑
and 𝑘 T .

Proof By Lemma 5.15, on each 𝜏 ∈ T , the affine function 𝜑𝜼 | 𝜏 is uniquely defined through the values at the nodes of
𝜏. This defines a continuous function 𝜑𝜼 : Ω → R. Indeed, whenever 𝜏 ∩ 𝜏 ′ ≠ ∅, then 𝜏 ∩ 𝜏 ′ is a subsimplex of both 𝜏
and 𝜏 ′ in the sense of Definition 5.13 (ii). Thus, applying Lemma 5.15 again, the affine functions on 𝜏 and 𝜏 ′ coincide
on 𝜏 ∩ 𝜏 ′ .
Using Lemma 5.15, Lemma 5.16 and the fact that 𝜑𝜼 ( 𝝁) = 0 whenever 𝝁 ≠ 𝜼, we find that 𝜑𝜼 vanishes on the
boundary of the patch 𝜔(𝜼) ⊆ Ω. Thus, 𝜑𝜼 vanishes on the boundary of Ω. Extending by zero, it becomes a cpwl

50
[Draft of June 26, 2024. Not for dissemination. Only to be used in Philipp Petersen’s lecture]
function 𝜑𝜼 : R𝑑 → R. This function is nonzero only on elements 𝜏 for which 𝜼 ∈ 𝜏. Hence, it is a cpwl function with
at most 𝑛 := 𝑘 T + 1 affine functions. By Theorem 5.7, 𝜑𝜼 can be expressed as a ReLU neural network with the claimed
size, width and depth bounds. □
Finally, Theorem 5.14 is now an easy consequence of the above lemmata.
Proof (of Theorem 5.14) With
∑︁
Φ(𝒙) := 𝑓 (𝜼)𝜑𝜼 (𝒙) for 𝒙 ∈ Ω, (5.3.4)
𝜼∈ V∩Ω̊

it holds that Φ : Ω → R satisfies Φ(𝜼) = 𝑓 (𝜼) for all 𝜼 ∈ V. By Lemma 5.15 this implies that 𝑓 equals Φ on each 𝜏,
and thus 𝑓 equals Φ on all of Ω. Since each element 𝜏 is the convex hull of 𝑑 + 1 nodes 𝜼 ∈ V, the cardinality of V is
bounded by the cardinality of T times 𝑑 + 1. Thus, the summation in (5.3.4) is over 𝑂 (|T |) terms. Using Lemma 5.4
and Lemma 5.17 we obtain the claimed bounds on size, width, and depth of the neural network. □

5.3.3 Size bounds for locally convex triangulations

Assuming local convexity of the triangulation, in this section we make the dependence of the constants in Theorem
5.14 explicit in the dimension 𝑑 and in the maximal number of simplices 𝑘 T touching a node, see (5.3.1). As such the
improvement over Theorem 5.14 is modest, and the reader may choose to skip this section on a first pass. Nonetheless,
the proof, originally from [75], is entirely constructive and gives some further insight on how ReLU networks express
functions. Let us start by stating the required convexity constraint.

Definition 5.18 A regular triangulation T is called locally convex if and only if 𝜔(𝜼) is convex for all interior nodes
𝜼 ∈ V ∩ Ω̊.

The following theorem is a variant of [75, Theorem 3.1].

Theorem 5.19 Let 𝑑 ∈ N, and let Ω ⊆ R𝑑 be a bounded domain. Let T be a locally convex regular triangulation of
Ω. Let 𝑓 : Ω → R be cpwl with respect to T and 𝑓 | 𝜕Ω = 0. Then, there exists a constant 𝐶 > 0 (independent of 𝑑, 𝑓
and T ) and there exists a neural network Φ 𝑓 : Ω → R such that Φ 𝑓 = 𝑓 ,

size(Φ 𝑓 ) ≤ 𝐶 · (1 + 𝑑 2 𝑘 T |T |),
width(Φ 𝑓 ) ≤ 𝐶 · (1 + 𝑑 log(𝑘 T )|T |), and
depth(Φ 𝑓 ) ≤ 𝐶 · (1 + log2 (𝑘 T )).

Assume in the following that T is a locally convex triangulation. We will split the proof of the theorem again into
a few lemmata. First, we will show that a convex patch can be written as an intersection of finitely many half-spaces.
Specifically, with the affine hull of a set 𝑆 defined as

 𝑛
 ∑︁
 𝑛
∑︁ 


aff (𝑆) := 𝛼 𝑗 𝒙 𝑗 𝑛 ∈ N, 𝒙 𝑗 ∈ 𝑆, 𝛼 𝑗 ∈ R, 𝛼𝑗 = 1
 
 𝑗=1 𝑗=1 

let in the following for 𝜏 ∈ T and 𝜼 ∈ 𝑉 (𝜏)

𝐻0 (𝜏, 𝜼) := aff(𝑉 (𝜏)\{𝜼})

be the affine hyperplane passing through all nodes in 𝑉 (𝜏)\{𝜼}, and let further

𝐻+ (𝜏, 𝜼) := {𝒙 ∈ R𝑑 | 𝒙 is on the same side of 𝐻0 (𝜏, 𝜼) as 𝜼} ∪ 𝐻0 (𝜏, 𝜼).

51
[Draft of June 26, 2024. Not for dissemination. Only to be used in Philipp Petersen’s lecture]
Lemma 5.20 Let 𝜼 be an interior node. Then a patch 𝜔(𝜼) is convex if and only if
Ù
𝜔(𝜼) = 𝐻+ (𝜏, 𝜼). (5.3.5)
{ 𝜏 ∈ T | 𝜼∈ T }

Proof The right-hand side is a finite intersection of (convex) half-spaces, and thus itself convex. It remains to show
that if 𝜔(𝜼) is convex, then (5.3.5) holds. We start with “⊃”. Suppose 𝒙 ∉ 𝜔(𝜼). Then the straight line co({𝒙, 𝜼}) must
pass through 𝜕𝜔(𝜼), and by Lemma 5.16 this implies that there exists 𝜏 ∈ T with 𝜼 ∈ 𝜏 such that co({𝒙, 𝜼}) passes
through aff(𝑉 (𝜏)\{𝜼}) = 𝐻0 (𝜏, 𝜼). Hence 𝜼 and 𝒙 lie on different sides of this affine hyperplane, which shows “⊇”.
Now we show “⊆”. Let 𝜏 ∈ T be such that 𝜼 ∈ 𝜏 and fix 𝒙 in the complement of 𝐻+ (𝜏, 𝜼). Suppose that 𝒙 ∈ 𝜔(𝜼). By
convexity, we then have co({𝒙} ∪ 𝜏) ⊆ 𝜔(𝜼). This implies that there exists a point in co(𝑉 (𝜏)\{𝜼}) belonging to the
interior of 𝜔(𝜼). This contradicts Lemma 5.16. Thus, 𝒙 ∉ 𝜔(𝜼). □
The above lemma allows us to explicitly construct the basis functions 𝜑𝜼 in (5.3.3). To see this, denote in the
following for 𝜏 ∈ T and 𝜼 ∈ 𝑉 (𝜏) by 𝑔 𝜏,𝜼 ∈ P1 (R𝑑 ) the affine function such that
(
1 if 𝜼 = 𝝁
𝑔 𝜏,𝜼 ( 𝝁) = for all 𝝁 ∈ 𝑉 (𝜏).
0 if 𝜼 ≠ 𝝁

This function exists and is unique by Lemma 5.15. Observe that 𝜑𝜼 (𝒙) = 𝑔 𝜏,𝜼 (𝒙) for all 𝒙 ∈ 𝜏.

Lemma 5.21 Let 𝜼 ∈ V ∩ Ω̊ be an interior node and let 𝜔(𝜼) be a convex patch. Then
 
𝜑𝜼 (𝒙) = max 0, min 𝑔 𝜏,𝜼 (𝒙) for all 𝒙 ∈ R𝑑 . (5.3.6)
{ 𝜏 ∈ T | 𝜼∈ 𝜏 }

Proof First let 𝒙 ∉ 𝜔(𝜼). By Lemma 5.20 there exists 𝜏 ∈ 𝑉 (𝜼) such that 𝒙 is in the complement of 𝐻+ (𝜏, 𝜼). Observe
that

𝑔 𝜏,𝜼 | 𝐻+ ( 𝜏,𝜼) ≥ 0 and 𝑔 𝜏,𝜼 | 𝐻+ ( 𝜏,𝜼) 𝑐 < 0. (5.3.7)

Thus

min 𝑔 𝜏,𝜼 (𝒙) < 0 for all 𝒙 ∈ 𝜔(𝜼) 𝑐 ,


{ 𝜏 ∈ T | 𝜼∈ 𝜏 }

i.e., (5.3.6) holds for all 𝒙 ∈ R\𝜔(𝜼). Next, let 𝜏, 𝜏 ′ ∈ T such that 𝜼 ∈ 𝜏 and 𝜼 ∈ 𝜏 ′ . We wish to show that
𝑔 𝜏,𝜼 (𝒙) ≤ 𝑔 𝜏 ′ ,𝜼 (𝒙) for all 𝒙 ∈ 𝜏. Since 𝑔 𝜏,𝜼 (𝒙) = 𝜑𝜼 (𝒙) for all 𝒙 ∈ 𝜏, this then concludes the proof of (5.3.6). By
Lemma 5.20 it holds

𝝁 ∈ 𝐻+ (𝜏 ′ , 𝜼) for all 𝝁 ∈ 𝑉 (𝜏).

Hence, by (5.3.7)

𝑔 𝜏 ′ ,𝜼 ( 𝝁) ≥ 0 = 𝑔 𝜏,𝜼 ( 𝝁) for all 𝝁 ∈ 𝑉 (𝜏)\{𝜼}.

Moreover, 𝑔 𝜏,𝜼 (𝜼) = 𝑔 𝜏 ′ ,𝜼 (𝜼) = 1. Thus, 𝑔 𝜏,𝜼 ( 𝝁) ≥ 𝑔 𝜏 ′ ,𝜼 ( 𝝁) for all 𝝁 ∈ 𝑉 (𝜏 ′ ) and therefore

𝑔 𝜏 ′ ,𝜼 (𝒙) ≥ 𝑔 𝜏,𝜼 (𝒙) for all 𝒙 ∈ co(𝑉 (𝜏 ′ )) = 𝜏 ′ .

Proof (of Theorem 5.19) For every interior node 𝜼 ∈ V ∩ Ω̊, the cpwl basis function 𝜑𝜼 in (5.3.3) can be expressed
as in (5.3.6), i.e.,

𝜑𝜼 (𝒙) = 𝜎 • Φmin
| { 𝜏 ∈ T | 𝜼∈ 𝜏 } | • (𝑔 𝜏,𝜼 (𝒙)) { 𝜏 ∈ T | 𝜼∈ 𝜏 } ,

52
[Draft of June 26, 2024. Not for dissemination. Only to be used in Philipp Petersen’s lecture]
where (𝑔 𝜏,𝜼 (𝒙)) { 𝜏 ∈ T | 𝜼∈ 𝜏 } denotes the parallelization with shared inputs of the functions 𝑔 𝜏,𝜼 (𝒙) for all 𝜏 ∈ T such
that 𝜼 ∈ 𝜏.
For this neural network, with |{𝜏 ∈ T | 𝜼 ∈ 𝜏}| ≤ 𝑘 T , we have by Lemma 5.2

size(𝜑𝜼 ) ≤ 4 size(𝜎) + size(Φmin



| { 𝜏 ∈ T | 𝜼∈ 𝜏 } | ) + size((𝑔 𝜏,𝜼 ) { 𝜏 ∈ T | 𝜼∈ 𝜏 } )
≤ 4(2 + 16𝑘 T + 𝑘 T 𝑑) (5.3.8)

and similarly

depth(𝜑𝜼 ) ≤ 4 + ⌈log2 (𝑘 T )⌉, width(𝜑𝜼 ) ≤ max{1, 3𝑘 T , 𝑑}. (5.3.9)

Since for every interior node, the number of simplices touching the node must be larger or equal to 𝑑, we can assume
max{𝑘 T , 𝑑} = 𝑘 T in the following (otherwise there exist no interior nodes, and the function 𝑓 is constant 0). As in the
proof of Theorem 5.14, the neural network

∑︁
Φ(𝒙) := 𝑓 (𝜼)𝜑𝜼 (𝒙)
𝜼∈ V∩Ω̊

realizes the function 𝑓 on all of Ω. Since the number of nodes |V | is bounded by (𝑑 + 1)|T |, an application of Lemma
5.4 yields the desired bounds. □

5.4 Convergence rates for Hölder continuous functions

Theorem 5.14 immediately implies convergence rates for certain classes of (low regularity) functions. Recall for
example the space of Hölder continuous functions: for 𝑠 ∈ (0, 1] and a bounded domain Ω ⊆ R𝑑 we define

| 𝑓 (𝒙) − 𝑓 ( 𝒚)|
∥ 𝑓 ∥ 𝐶 0,𝑠 (Ω) := sup | 𝑓 (𝒙)| + sup . (5.4.1)
𝒙∈Ω 𝒙≠𝒚 ∈Ω ∥𝒙 − 𝒚∥ 2𝑠

Then, 𝐶 0,𝑠 (Ω) is the set of functions 𝑓 ∈ 𝐶 0 (Ω) for which ∥ 𝑓 ∥ 𝐶 0,𝑠 (Ω) < ∞.
Hölder continuous continuous functions can be approximated well by certain cpwl functions. Therefore, we obtain
the following result.

Theorem 5.22 Let 𝑑 ∈ N. There exists a constant 𝐶 = 𝐶 (𝑑) such that for every 𝑓 ∈ 𝐶 0,𝑠 ( [0, 1] 𝑑 ) and every 𝑁 there
𝑓
exists a ReLU neural network Φ 𝑁 with
𝑓 𝑓 𝑓
size(Φ 𝑁 ) ≤ 𝐶𝑁, width(Φ 𝑁 ) ≤ 𝐶𝑁, depth(Φ 𝑁 ) = 𝐶

and
𝑠
𝑓 (𝒙) − Φ 𝑁 (𝒙) ≤ 𝐶 ∥ 𝑓 ∥ 𝐶 0,𝑠 ( [0,1] 𝑑 ) 𝑁 − 𝑑 .
𝑓
sup
𝒙∈ [0,1] 𝑑

Proof For 𝑀 ≥ 2, consider the set of nodes {𝝂/𝑀 | 𝝂 ∈ {−1, . . . , 𝑀 + 1} 𝑑 } where 𝝂/𝑀 = (𝜈1 /𝑀, . . . , 𝜈 𝑑 /𝑀).
These nodes suggest a partition of [−1/𝑀, 1 + 1/𝑀] 𝑑 into (2 + 𝑀) 𝑑 sub-hypercubes. Each such sub-hypercube can
be partitioned into 𝑑! simplices, such that we obtain a regular triangulation T with 𝑑!(2 + 𝑀) 𝑑 elements on [0, 1] 𝑑 .
According to Theorem 5.14 there exists a neural network Φ that is cpwl with respect to T and Φ(𝝂/𝑀) = 𝑓 (𝝂/𝑀)
whenever 𝝂 ∈ {0, . . . , 𝑀 } 𝑑 and Φ(𝝂/𝑀) = 0 for all other (boundary) nodes. It holds

53
[Draft of June 26, 2024. Not for dissemination. Only to be used in Philipp Petersen’s lecture]
size(Φ) ≤ 𝐶 |T | = 𝐶𝑑!(2 + 𝑀) 𝑑 ,
width(Φ) ≤ 𝐶 |T | = 𝐶𝑑!(2 + 𝑀) 𝑑 , (5.4.2)
depth(Φ) ≤ 𝐶

for a constant 𝐶 that only depends on 𝑑 (since for our regular triangulation T , 𝑘 T in (5.3.1) is a fixed 𝑑-dependent
constant).
Let us bound the error. Fix a point 𝒙 ∈ [0, 1] 𝑑 . Then 𝒙 belongs to one of the interior simplices 𝜏 of the triangulation.
Two nodes of the simplex have distance at most

𝑑  2 1/2 √
©∑︁ 1 ª 𝑑
­ ® = =: 𝜀.
𝑀 𝑀
« 𝑗=1 ¬
Since Φ| 𝜏 is the linear interpolant of 𝑓 at the nodes 𝑉 (𝜏) of the simplex 𝜏, Φ(𝒙) is a convex combination of the
( 𝑓 (𝜼))𝜼∈𝑉 ( 𝜏 ) . Fix an arbitrary node 𝜼0 ∈ 𝑉 (𝜏). Then ∥𝒙 − 𝜼0 ∥ 2 ≤ 𝜀 and

|Φ(𝒙) − Φ(𝜼0 )| ≤ max | 𝑓 (𝜼) − 𝑓 ( 𝝁)| ≤ sup | 𝑓 (𝒙) − 𝑓 ( 𝒚)|


𝜼,𝝁∈𝑉 ( 𝜏 ) 𝒙,𝒚 ∈ [0,1] 𝑑
∥ 𝒙−𝒚 ∥ 2 ≤ 𝜀

≤ ∥ 𝑓 ∥ 𝐶 0,𝑠 ( [0,1] 𝑑 ) 𝜀 𝑠 .

Hence, using 𝑓 (𝜼0 ) = Φ(𝜼0 ),

| 𝑓 (𝒙) − Φ(𝒙)| ≤ | 𝑓 (𝒙) − 𝑓 (𝜼0 )| + |Φ(𝒙) − Φ(𝜼0 )|


≤ 2∥ 𝑓 ∥ 𝐶 0,𝑠 ( [0,1] 𝑑 ) 𝜀 𝑠
𝑠
= 2∥ 𝑓 ∥ 𝐶 0,𝑠 ( [0,1] 𝑑 ) 𝑑 2 𝑀 −𝑠
𝑠 𝑠
= 2𝑑 2 ∥ 𝑓 ∥ 𝐶 0,𝑠 ( [0,1] 𝑑 ) 𝑁 − 𝑑 (5.4.3)

where 𝑁 := 𝑀 𝑑 . The statement follows by (5.4.2) and (5.4.3). □


The principle behind Theorem 5.22 can be applied in even more generality. Since we can represent every cpwl
function on a regular triangulation with a neural network of size 𝑂 (𝑁), where 𝑁 denotes the number of elements, all of
classical (e.g. finite element) approximation theory for cpwl functions can be lifted to generate statements about ReLU
approximation. For instance, it is well-known, that functions in the Sobolev space 𝐻 2 ( [0, 1] 𝑑 ) can be approximated
by cpwl functions on a regular triangulation in terms of 𝐿 2 ( [0, 1] 𝑑 ) with the rate 2/𝑑. Similar as in the proof of
Theorem 5.22, for every 𝑓 ∈ 𝐻 2 ( [0, 1] 𝑑 ) and every 𝑁 ∈ N there then exists a ReLU neural network Φ 𝑁 such that
size(Φ 𝑁 ) = 𝑂 (𝑁) and
2
∥ 𝑓 − Φ 𝑁 ∥ 𝐿 2 ( [0,1] 𝑑 ) ≤ 𝐶 ∥ 𝑓 ∥ 𝐻 2 ( [0,1] 𝑑 ) 𝑁 − 𝑑 .

Finally, we can wonder how to approximate even smoother functions, i.e., those that have many continuous deriva-
tives. Since more smoothness is a restrictive assumption on the set of functions to approximate, we would hope that
this will allow us to have smaller neural networks. Essentially, we desire a result similar to Theorem 4.9, but with the
ReLU activation function.
However, we will see in the following chapter, that the emulation of piecewise affine functions on regular triangu-
lations cannot yield the approximation rates of Theorem 4.9. To harness the smoothness, it will be necessary to build
ReLU neural networks that emulate polynomials. Surprisingly, we will see in Chapter 7 that polynomials can be very
efficiently approximated by deep ReLU neural networks.

54
[Draft of June 26, 2024. Not for dissemination. Only to be used in Philipp Petersen’s lecture]
Bibliography and further reading

The ReLU calculus introduced in Section 5.1 was similarly given in [153]. The fact that every cpwl function can be
expressed as a maximum over a minimum of linear functions goes back to the papers [202, 201]; also see [148, 211].
The main result of Section 5.2, which shows that every cpwl function can be expressed by a ReLU network, is
then a straightforward consequence. This was first observed in [5], which also provided bounds on the network size.
These bounds were significantly improved in [75] for cwpl functions on triangular meshes that satisfy a local convexity
condition. Under this assumption, it was shown that the network size essentially only grows linearly with the number
of pieces. The paper [121] showed that the convexity assumption is not necessary for this statement to hold. We give
a similar result in Section 5.3.2, using a simpler argument than [121]. The locally convex case from [75] is separately
discussed in Section 5.3.3, as it allows for further improvements in some constants.
Lastly, the implications for the approximation of Hölder continuous functions discussed in Section 5.4, follows
by standard approximation theory for cpwl functions. For a general reference on splines and piecewise polynomial
approximation see for instance [184].

55
[Draft of June 26, 2024. Not for dissemination. Only to be used in Philipp Petersen’s lecture]
Exercises

Exercise 5.23 Let 𝑝 : R → R be a polynomial of degree 𝑛 ≥ 1 (with leading coefficient nonzero) and let 𝑠 : R → R
be a continuous sigmoidal activation function. Show that the identity map 𝑥 ↦→ 𝑥 : R → R belongs to N11 ( 𝑝; 1, 𝑛 + 1)
but not to N11 (𝑠; 𝐿) for any 𝐿 ∈ N.

Exercise 5.24 Consider cpwl functions 𝑓 : R → R with 𝑛 ∈ N0 breakpoints (points where the function is not 𝐶 1 ).
Determine the minimal size required to exactly express every such 𝑓 with a depth-1 ReLU neural network.

Exercise 5.25 Show that, the notion of affine independence is invariant under permutations of the points.
Í𝑑
Exercise 5.26 Let 𝜏 = co(𝒙0 , . . . , 𝒙 𝑑 ) be a 𝑑-simplex. Show that the coefficients 𝛼𝑖 ≥ 0 such that 𝑖=0 𝛼𝑖 = 1 and
Í𝑑
𝒙 = 𝑖=0 𝛼𝑖 𝒙𝑖 are unique for every 𝒙 ∈ 𝜏.
Ð𝑑
Exercise 5.27 Let 𝜏 = co(𝜼0 , . . . , 𝜼 𝑑 ) be a 𝑑-simplex. Show that the boundary of 𝜏 is given by 𝑖=0 co({𝜼0 , . . . , 𝜼 𝑑 }\{𝜼𝑖 }).

56
[Draft of June 26, 2024. Not for dissemination. Only to be used in Philipp Petersen’s lecture]
Chapter 6
Affine pieces for ReLU neural networks

In the previous chapters, we observed some remarkable approximation results of shallow ReLU neural networks. In
practice, however, people use deeper architectures. To understand this fact, we would like to understand the shortcomings
of shallow architectures before introducing deeper ones.
Traditionally, an insightful approach to study limitations of ReLU neural networks has been to analyze the number
of pieces these functions can generate. Before we proceed, let us first make precise what we mean by number of pieces.

Definition 6.1 Let 𝑑 ∈ N and 𝑓 : Ω → R where Ω ⊆ R𝑑 , be continuous, we say that 𝑓 is piecewise affine Ð 𝑝 with 𝑝 ∈ N
𝑝
pieces, if 𝑝 is the smallest natural number for which there are connected open sets (Ω𝑖 )𝑖=1 such that 𝑖=1 Ω𝑖 = Ω, and
𝑓Ω𝑖 is an affine function for all 𝑖 = 1, . . . , 𝑝. We denote Pieces( 𝑓 , Ω) B 𝑝.
If 𝑑 = 1, then we call all points where a piecewise affine function 𝑓 is not affine the break points of 𝑓 .

It is important to be able to generate a lot of pieces with neural networks if we want to be able to approximate smooth
functions at a fast rate. Indeed, we have the following theorem. The proof is left to the reader, see Exercise 6.12.

Theorem 6.2 ([55, Theorem 2]) Let −∞ < 𝑎 < 𝑏 < ∞ and 𝑓 ∈ 𝐶 3 ( [𝑎, 𝑏]) so that 𝑓 is not affine linear. Then there
∫ 𝑏 √︁
exists a constant 𝑐 > 0 depending only on 𝑎 | 𝑓 ′′ (𝑥)|𝑑𝑥 so that, for every 𝑝 ∈ N,

∥𝑔 − 𝑓 ∥ ∞ > 𝑐 𝑝 −2 ,

for all 𝑔 which are piecewise affine linear with at most 𝑝 pieces.

Theorem 6.2 implies that we need to produce neural networks with many pieces, if we want to approximate non-
linear functions to high accuracy. How many pieces can we create with neural networks of a fixed depth and width?
We will give a simple theoretical upper bound in the following Section 6.1. Thereafter, we will study under which
conditions these upper bounds are achievable in Section 6.2. This then implies that there are functions that require huge
shallow neural networks to approximate, while relatively small deep neural networks can approximate them well. The
result is collected in Section 6.3. Finally, we ask ourselves how pertinent this analysis is in practice. More precisely,
we wish to understand how many pieces typical neural networks have. Surprisingly, we will, in Section 6.4, find that
randomly initialized deep neural networks on average do not have a number of pieces that is anywhere close to the
upper bound. Indeed, we will see that in expectation an upper bound on the number of pieces for randomly generated
neural networks depends only linearly on the number of layers.

6.1 Upper bounds

Neural networks are based on composition and addition of neurons. These two operations if applied to general piecewise
affine functions have the capability of increasing the number of pieces in a specific way. We depict the two operations
and their effect on the number of pieces in Figure 6.1. Formally, the two operations act as follows:

57
[Draft of June 26, 2024. Not for dissemination. Only to be used in Philipp Petersen’s lecture]
• Summation: The sum of two univariate piecewise affine functions 𝑓1 , 𝑓2 satisfies

Pieces( 𝑓1 + 𝑓2 , Ω) ≤ Pieces( 𝑓1 , Ω) + Pieces( 𝑓2 , Ω) − 1 (6.1.1)

for every Ω ⊆ R. This can be seen since the sum is affine in every point where both 𝑓1 and 𝑓2 are affine. Therefore,
the sum has at most as many break points as 𝑓1 and 𝑓2 combined. Moreover, for univariate functions it holds that
the number of pieces is one more than the number of break points.
• Composition: The composition of two functions 𝑓1 : R𝑑 → R and 𝑓2 : R → R𝑑 satisfies

Pieces( 𝑓1 ◦ 𝑓2 , Ω) ≤ Pieces( 𝑓1 , R𝑑 ) · Pieces( 𝑓2 , Ω) (6.1.2)

for every Ω ⊆ R. This is because for each of the affine pieces of 𝑓2 —let us call one of those pieces 𝐴 ⊆ R—we
have that 𝑓2 is either constant or injective on 𝐴. If it is constant then, 𝑓1 ◦ 𝑓2 is constant. If it is injective, then
Pieces( 𝑓1 ◦ 𝑓2 , 𝐴) = Pieces( 𝑓1 , 𝑓2 ( 𝐴)) ≤ Pieces( 𝑓1 , R𝑑 ). Since the estimate holds for all pieces of 𝑓2 we conclude
(6.1.2).

Fig. 6.1: Sketch of two ways of creating pieces. Top: Composition of two piecewise affine functions 𝑓1 ◦ 𝑓2 can create
a piece whenever the value of 𝑓2 crosses a level that is associated to a break point of 𝑓1 . Bottom: Addition of two
piecewise affine functions 𝑓1 + 𝑓2 produces a piecewise affine function that can have break points at positions, where
either 𝑓1 , or 𝑓2 are not affine.

Based on the considerations above, we conclude the following result, that follows the argument of [203, Lemma
2.1]. We state it for general piecewise affine activation functions. Most importantly, it holds for 𝜎 = 𝜎ReLU with 𝑝 = 2.

Theorem 6.3 Let 𝐿 ∈ N. Let 𝜎 be piecewise affine with 𝑝 pieces. Then, for every neural network Φ with architecture
(1, 𝑑1 , . . . , 𝑑 𝐿 , 1) ∈ N 𝐿+2 , we have that Φ has at most ( 𝑝 · width(Φ)) 𝐿 pieces.

Proof The proof is given via induction over 𝐿. For 𝐿 = 1, we have that for 𝑥 ∈ R
𝑑1
∑︁
Φ(𝑥) = 𝑐 𝑘 𝜎(𝑎 𝑘 𝑥 + 𝑏 𝑖 ) + 𝑑,
𝑘=1

where 𝑐 𝑘 , 𝑎 𝑘 , 𝑏 𝑖 , 𝑑 ∈ R. By (6.1.1), we conclude that Φ has at most 𝑝 · width(Φ) affine linear pieces.
Assume the statement holds for 𝐿 ∈ N. Let Φ 𝐿+1 be a neural network with 𝐿 + 1 layers. Let

𝑾 (0) , 𝒃 (0) , . . . , 𝑾 (𝐿+1) , 𝒃 (𝐿+1) ,

be the weights and biases of Φ 𝐿+1 . It is clear that for all 𝑥 ∈ R

Φ 𝐿+1 (𝑥) = 𝑾 (𝐿+1) [𝜎(ℎ1 (𝑥)), . . . , 𝜎(ℎ 𝑑𝐿+1 (𝑥))] 𝑇 + 𝒃 (𝐿+1) ,

58
[Draft of June 26, 2024. Not for dissemination. Only to be used in Philipp Petersen’s lecture]
where for ℓ = 1, . . . , 𝑑 𝐿+1 each ℎℓ is the realization of a neural network with input and output dimension 1, 𝐿 layers,
and at most width(Φ 𝐿+1 ) neurons in each layer.
By this observation and the induction hypothesis, we conclude that for ℓ = 1, . . . , 𝑑 𝐿+1 , 𝜎 ◦ ℎℓ has at most
𝑝 · ( 𝑝 · width(Φ 𝐿+1 )) 𝐿 affine linear pieces. Hence,
𝑑∑︁
𝐿+1

Φ 𝐿+1 (𝑥) = (𝑾 (𝐿+1) ) 𝑘 𝜎(ℎ 𝑘 (𝑥)) + 𝒃 (𝐿+1)


𝑘=1

has at most width(Φ 𝐿+1 ) · 𝑝 · ( 𝑝 · width(Φ 𝐿+1 )) 𝐿 = ( 𝑝 · width(Φ 𝐿+1 )) 𝐿+1 many affine linear pieces. This completes
the proof. □
Theorem 6.3 shows that there are limits to how many pieces can be created with a certain architecture. It is noteworthy
that the effects of the depth and the width of a neural network are vastly different. While increasing the width will have
a polynomial effect on the upper bound, increasing the depth blows up the upper bound exponentially. This is a first
indication of the prowess of depth of neural networks.
To understand the effect of this on the approximation problem, we apply the bound of Theorem 6.3 to Theorem 6.2.
∫ √︁
Theorem 6.4 Let 𝑑0 ∈ N, and 𝑓 ∈ 𝐶 3 ( [0, 1] 𝑑0 ) be such that along a line 𝔰 ⊆ [0, 1] 𝑑0 it holds that 𝔰 | 𝑓 ′′ (𝑥)|𝑑𝑥 > 𝑐,
˜
then there exists 𝑐 > 0 depending on 𝑐˜ only, such that for all ReLU neural networks Φ with 𝐿 layers and input dimension
𝑑0 it holds that
∥Φ − 𝑓 ∥ ∞ ≥ 𝑐 · (2width(Φ)) −2𝐿 .

Theorem 6.4 shows that there are some approximation rates that are not achievable with shallow ReLU neural
networks. In fact, if the target function gets increasingly smooth, then we would expect it to become increasingly easier
to approximate it. However, without increasing the depth, it seems to be impossible to leverage additional smoothness.
This observation strongly indicates that deeper architectures are superior for the approximation of smooth functions.
However, before we can make such a statement with certainty, we should study whether the upper bounds are even
achievable.

6.2 Tightness of upper bounds

To construct a ReLU neural network, that realizes the upper bound of Theorem 6.3, we first let ℎ1 : [0, 1] → R be the
hat function
(
2𝑥 if 𝑥 ∈ [0, 21 ]
ℎ1 (𝑥) :=
2 − 2𝑥 if 𝑥 ∈ [ 12 , 1],

which can be expressed by a neural network of depth one and with two nodes:

ℎ1 (𝑥) = 𝜎ReLU (2𝑥) − 𝜎ReLU (4𝑥 − 2) for all 𝑥 ∈ [0, 1]. (6.2.1)

We recursively set ℎ 𝑛 := ℎ 𝑛−1 ◦ ℎ1 for all 𝑛 ≥ 2, i.e., ℎ 𝑛 = ℎ1 ◦ · · · ◦ ℎ1 is the 𝑛-fold composition of ℎ1 . Since
ℎ1 : [0, 1] → [0, 1], we have ℎ 𝑛 : [0, 1] → [0, 1] and

ℎ 𝑛 ∈ N11 (𝜎ReLU , 𝑛, 2).

It turns out that this function has a rather interesting behavior. It is a “sawtooth” function with 2𝑛−1 spikes, see
Figure 6.2.
Lemma 6.5 Let 𝑛 ∈ N. It holds for all 𝑥 ∈ [0, 1]

59
[Draft of June 26, 2024. Not for dissemination. Only to be used in Philipp Petersen’s lecture]
(
2𝑛 (𝑥 − 𝑖2−𝑛 ) if 𝑖 ≥ 0 is even and 𝑥 ∈ [𝑖2−𝑛 , (𝑖 + 1)2−𝑛 ]
ℎ 𝑛 (𝑥) =
2𝑛 ((𝑖 + 1)2−𝑛 − 𝑥) if 𝑖 ≥ 1 is odd and 𝑥 ∈ [𝑖2−𝑛 , (𝑖 + 1)2−𝑛 ].

Proof The case 𝑛 = 1 holds by definition. We proceed by induction, and assume the statement holds for 𝑛. Let
𝑥 ∈ [0, 1/2] and 𝑖 ≥ 0 even such that 𝑥 ∈ [𝑖2− (𝑛+1) , (𝑖 + 1)2− (𝑛+1) ]. Then 2𝑥 ∈ [𝑖2−𝑛 , (𝑖 + 1)2−𝑛 ]. Thus

ℎ 𝑛 (ℎ1 (𝑥)) = ℎ 𝑛 (2𝑥) = 2𝑛 (2𝑥 − 𝑖2−𝑛 ) = 2𝑛+1 (𝑥 − 𝑖2−𝑛+1 ).

Similarly, if 𝑥 ∈ [0, 1/2] and 𝑖 ≥ 1 odd such that 𝑥 ∈ [𝑖2− (𝑛+1) , (𝑖 + 1)2− (𝑛+1) ], then ℎ1 (𝑥) = 2𝑥 ∈ [𝑖2−𝑛 , (𝑖 + 1)2−𝑛 ]
and

ℎ 𝑛 (ℎ1 (𝑥)) = ℎ 𝑛 (2𝑥) = 2𝑛 (2𝑥 − (𝑖 + 1)2−𝑛 ) = 2𝑛+1 (𝑥 − (𝑖 + 1)2−𝑛+1 ).

The case 𝑥 ∈ [1/2, 1] follows by observing that ℎ 𝑛+1 is symmetric around 1/2. □

ℎ1 ℎ2 ℎ3
1 1 1

0 1 0 1 0 1

Fig. 6.2: The functions ℎ 𝑛 in Lemma 6.5.

The neural network ℎ 𝑛 has size 𝑂 (𝑛) but is piecewise linear on 𝑂 (2𝑛 ) pieces. This shows that the upper bound of
Theorem 6.3 is tight.

6.3 Depth separation

Now that we have established how increasing the depth can lead to exponentially more pieces than increasing the width,
we can deduce a so-called “depth-separation” result [203]. Such statements verify the existence of functions that can
easily be approximated by deep neural networks, but require much larger size when approximated by shallow neural
networks. The theorem from [203] reads as follows.

Theorem 6.6 For every 𝑛 ∈ N there exists a neural network 𝑓 ∈ N11 (𝜎ReLU , 𝑛2 + 4, 2) such that for any 𝑔 ∈
N11 (𝜎ReLU , 𝑛, 2𝑛−1 ) holds
∫ 1
1
| 𝑓 (𝑥) − 𝑔(𝑥)| d𝑥 ≥ .
0 64

The function 𝑓 has quadratically more layers than 𝑔, but width(𝑔) = 2𝑛−1 and width( 𝑓 ) = 2. Hence the size of 𝑔
may be exponentially larger than the size of 𝑓 , but nonetheless no such 𝑔 can approximate 𝑓 ! Thus even exponential
increase in width cannot necessarily compensate for increase in depth. The result is based on the following observations
from [203]:
(i) Functions with few oscillations poorly approximate functions with many oscillations,
(ii) functions expressed by neural networks with few layers have few oscillations,
(iii) functions expressed by neural networks with many layers can have many oscillations.

60
[Draft of June 26, 2024. Not for dissemination. Only to be used in Philipp Petersen’s lecture]
Proof (Proof of Theorem 6.6) Fix 𝑛 ∈ N. Let 𝑓 := ℎ 𝑛2 +4 ∈ N11 (𝜎ReLU , 𝑛2 +4, 2). For arbitrary 𝑔 ∈ N11 (𝜎ReLU , 𝑛, 2𝑛−1 ),
2
by Theorem 6.3, 𝑔 is piecewise linear with at most (2 · 2𝑛−1 ) 𝑛 = 2𝑛 break points. The function 𝑓 is the sawtooth
2
function with 2𝑛 +3 spikes. The number of triangles formed by the graph of 𝑓 and the constant line at 1/2 equals
2 2
2𝑛 +4 − 1, each with area 2− (𝑛 +6) , see Figure 6.3. For the 𝑚 triangles in between two break points of 𝑔, the graph of 𝑔
crosses at most 𝑚/2 − 1 of them. Thus we can bound

© ª
∫ 1 ­1 ®
𝑛2 +4 𝑛2 𝑛2 2
| 𝑓 (𝑥) − 𝑔(𝑥)| d𝑥 ≥ ­ ( 2 −1−2 )− 2 ® 2− (𝑛 +6)
­ ®
0 ­2 | {z } |{z} ® | {z }
triangles on an interval ≥pieces of 𝑔 area of a triangle
­ ®

« without break point of 𝑔 ¬
| {z }
≥missed triangles
2 2 +6) 1
≥ 2𝑛 · 2− (𝑛 = ,
64
which concludes the proof. □

1
2

0 1 0 1

Fig. 6.3: Left: The functions ℎ 𝑛 form 2𝑛 − 1 triangles with the line at 1/2, each with area (1/2)2− (𝑛+1) = 2− (𝑛+2) .
Right: For an affine function with 𝑚 (in this sketch 𝑚 = 5) triangles in between two break points, the function can
cross at most 𝑚/2 − 1 of them.

6.4 Number of pieces in practice

We have seen in Theorem 6.3 that deep neural networks can have many more pieces than their shallow counterparts.
This begs the question if deep neural networks are more prone to generating many pieces in practice? A more formal
question would be: if we were to randomly choose weights of a neural network, how many pieces will it have in
expectation? Will this number scale exponentially with the depth? This question was analyzed in [72] and it was found
that, surprisingly, the number of pieces of randomly initialized neural networks typically does not depend exponentially
on the depth. In Figure 6.4, we depict two neural networks, one shallow and one deep, that were randomly initialized
according to He initialization [76]. Both neural networks have essentially the same number of pieces (114 and 110)
and there is no clear indication that one has a deeper architecture than the other.
In the following, we will give a simplified version of the main result of [72] to show why random deep neural
networks often behave like shallow neural networks.
We recall from Figure 6.1 that pieces are generated through composition of two functions 𝑓1 and 𝑓2 , if the values of
𝑓2 cross a level that is associated to a break point of 𝑓1 . In the case of a simple neuron of the form

𝒙 ↦→ 𝜎ReLU (⟨𝒂, ℎ(𝒙)⟩ + 𝑏),

61
[Draft of June 26, 2024. Not for dissemination. Only to be used in Philipp Petersen’s lecture]
Fig. 6.4: It was observed in [72] that the number of pieces of randomly initialized neural networks does not behave
like the upper bound of Theorem 6.3. In particular, the expected number of pieces does not scale exponentially with
the number of layers. Here we depict two neural networks Φ1 and Φ2 with architectures (𝜎ReLU ; 1, 10, 10, 1) and
(𝜎ReLU ; 1, 5, 5, 5, 5, 5, 1), respectively. For both architectures the weights were randomly initialized.

where ℎ is a piecewise affine function, 𝒂 is a vector of the same shape as ℎ(𝒙), 𝑏 is a scalar, we have that a lot of pieces
can be generated if ⟨𝒂, ℎ(𝒙)⟩ crosses the −𝑏 level often.
If 𝒂, 𝑏 are random variables, and we know that ℎ does not oscillate too much, then we can quantify the probability
of ⟨𝒂, ℎ(𝒙)⟩ crossing the −𝑏 level often. The following lemma from [100] provides the details.

Lemma 6.7 ([100, Lemma 3.1]) Let 𝑐 > 0 and ℎ : [0, 𝑐] → R be a piecewise affine function on [0, 𝑐]. Let 𝑡 ∈ N and
𝐴 be a Lebesgue measurable set and assume that for every 𝑦 ∈ 𝐴 it holds that

#{𝑥 ∈ [0, 𝑐] : ℎ(𝑥) = 𝑦} ≥ 𝑡.

Then, 𝑐∥ℎ′ ∥ ∞ ≥ ∥ℎ′ ∥ 1 ≥ 𝜆( 𝐴) · 𝑡, where 𝜆 is the Lebesgue measure. In particular, for ℎ with at most 𝑃 ∈ N pieces
and with ∥ℎ′ ∥ 1 finite it holds for all 𝛿 > 0 that for all 𝑡 ≤ 𝑃

∥ℎ′ ∥ 1
P (#{𝑥 ∈ [0, 𝑐] : ℎ(𝑥) = 𝑈} ≥ 𝑡) ≤ ,
𝛿𝑡
P (#{𝑥 ∈ [0, 𝑐] : ℎ(𝑥) = 𝑈} > 𝑃) = 0,

where 𝑈 is a uniformly distributed variable on [−𝛿/2, 𝛿/2].

Proof We will assume 𝑐 = 1. The general case, then follows by considering ℎ˜ = ℎ(·/𝑐).
Let for (𝑐 𝑖 )𝑖=1 𝑃+1 ⊆ [0, 1] with 𝑐 = 0, 𝑐
1 𝑃+1 = 1 and 𝑐 𝑖 ≤ 𝑐 𝑖+1 for all 𝑖 = 1, . . . , 𝑃 + 1 the pieces of ℎ be given by
((𝑐 𝑖 , 𝑐 𝑖+1 ))𝑖=1 . We denote
𝑃

𝑉1 B [0, 𝑐 2 ], 𝑉𝑖 B (𝑐 𝑖 , 𝑐 𝑖+1 ] for 𝑖 = 1, . . . , 𝑃

and for 𝑗 = 𝑖, . . . , 𝑃
𝑖−1
Ø
e𝑖 B
𝑉 𝑉𝑗 .
𝑗=1

We define, for 𝑛 ∈ N ∪ {∞}

62
[Draft of June 26, 2024. Not for dissemination. Only to be used in Philipp Petersen’s lecture]
n n o o
𝑇𝑖,𝑛 B ℎ(𝑉𝑖 ) ∩ 𝑦 ∈ 𝐴 : # 𝑥 ∈ 𝑉
e𝑖 : ℎ(𝑥) = 𝑦 = 𝑛 − 1 .

In words, 𝑇𝑖,𝑛 contains the values of 𝐴 that are hit on 𝑉𝑖 for the 𝑛-th time. Since ℎ is piecewise affine, we observe that
for all 𝑖 = 1, . . . , 𝑃
(i) 𝑇𝑖,𝑛1 ∩ Ð
𝑇𝑖,𝑛2 = ∅ for all 𝑛1 , 𝑛2 ∈ N ∪ {∞}, 𝑛1 ≠ 𝑛2 ,
(ii) 𝑇𝑖,∞ ∪ ∞ 𝑛=1 𝑇𝑖,𝑛 = ℎ(𝑉𝑖 ) ∩ 𝐴,
(iii) 𝑇𝑖,𝑛 = ∅ for all 𝑃 < 𝑛 < ∞,
(iv) 𝜆 𝑇𝑖,∞ = 0.
Note that, since ℎ is affine on 𝑉𝑖 it holds that ℎ′ = 𝜆(ℎ(𝑉𝑖 ))/𝜆(𝑉𝑖 ) on 𝑉𝑖 . Hence,
𝑃
∑︁ 𝑃
∑︁
∥ℎ′ ∥ 1 ≥ 𝜆(ℎ(𝑉𝑖 )) ≥ 𝜆 (ℎ(𝑉𝑖 ) ∩ 𝐴)
𝑖=1 𝑖=1

𝑃 ∑︁
!
∑︁  
= 𝜆 𝑇𝑖,𝑛 + 𝜆 𝑇𝑖,∞
(i), (ii)
𝑖=1 𝑛=1

𝑃 ∑︁
∑︁ 
= 𝜆 𝑇𝑖,𝑛
(iv)
𝑖=1 𝑛=1
𝑡 ∑︁
∑︁ 𝑃

≥ 𝜆 𝑇𝑖,𝑛 ,
(iii)
𝑛=1 𝑖=1

where 𝑡 ≤ 𝑃 and we were allowed to change the order of summation in the last inequality because we only sum over a
finite number of terms. Note that, by assumption for all 𝑛 ≤ 𝑡 every 𝑦 ∈ 𝐴 is an element of 𝑇𝑖,𝑛 or 𝑇𝑖,∞ for some 𝑖 ≤ 𝑃.
Therefore, by (iv)
𝑃
∑︁ 
𝜆 𝑇𝑖,𝑛 ≥ 𝜆( 𝐴),
𝑖=1

which completes the proof. □


Lemma 6.7 applied to neural networks essentially states that, in a single neuron, if the bias term is chosen uniformly
randomly on an interval of length 𝛿, then the probability of generating 𝑡 pieces by composition scales like the reciprocal
of 𝑡.
Next, we will analyze how Lemma 6.7 implies an upper bound on the number of pieces generated in a randomly
initialized neural network. To make this more precise, we need to rigorously define a randomly initialized neural
network.
In our arguments below, we will only consider the biases to be random. However, the results remain the same if both
weights and biases are considered random, as long as they are independent of each other. In that case, the following
results simply hold for every instantiation of the random weights. Moreover, even the independence assumption can be
weakened. This is analyzed in [72].

Definition 6.8 Let 𝐿 ∈ N and (𝑑0 , 𝑑1 , . . . , 𝑑 𝐿 , 1) ∈ N 𝐿+2 . Let 𝛿 > 0. Let 𝑾 (ℓ ) ∈ R𝑑ℓ+1 ×𝑑ℓ for ℓ = 0, . . . , 𝐿. Further let
𝒃 (ℓ ) ∈ R𝑑ℓ+1 for ℓ = 0, . . . , 𝐿 be random variables satisfying all entries of 𝒃 (ℓ ) are independently uniformly distributed
on [−𝛿/2, 𝛿/2]. Then we call the associated ReLU neural network Φ a random-bias neural network.

To apply Lemma 6.7 to a single neuron with random biases, we also need some kind of bound on the derivative of
the input to the neuron. For this, we define preactivations and the maximal internal derivative.

Definition 6.9 Let 𝐿 ∈ N and (𝑑0 , 𝑑1 , . . . , 𝑑 𝐿 , 1) ∈ N 𝐿+2 . Let 𝑾 (ℓ ) ∈ R𝑑ℓ+1 ×𝑑ℓ and 𝒃 (ℓ ) ∈ R𝑑ℓ+1 for ℓ = 0, . . . , 𝐿.
For ℓ = 1, . . . , 𝐿 + 1, 𝑖 = 1, . . . , 𝑑ℓ we call the functions

𝜂ℓ,𝑖 (𝒙; (𝑾 ( 𝑗 ) , 𝒃 ( 𝑗 ) ) ℓ𝑗=0


−1
) = (𝑾 (ℓ −1) 𝒙 (ℓ −1) )𝑖 , for 𝒙 ∈ R𝑑0 ,

63
[Draft of June 26, 2024. Not for dissemination. Only to be used in Philipp Petersen’s lecture]
the preactivations of the ReLU neural network Φ with weights and biases given by (𝑾 (ℓ ) , 𝒃 (ℓ ) )ℓ=0
𝐿 .
(ℓ ) ×𝑑
For 𝛿 > 0, and 𝑾 ∈ R ℓ+1 ℓ for ℓ = 0, . . . , 𝐿, we call
𝑑

  n
𝜈 (𝑾 (ℓ ) )ℓ=1
𝐿 ′
, 𝛿 B max 𝜂ℓ,𝑖 ( · ; (𝑾 ( 𝑗 ) , 𝒃 ( 𝑗 ) ) ℓ𝑗=0
−1
) :
1
𝐿
Ö 


(𝒃 ( 𝑗 ) ) 𝐿𝑗=0 ∈ [−𝛿/2, 𝛿/2] 𝑑 𝑗+1 , ℓ = 1, . . . , 𝐿, 𝑖 = 1, . . . , 𝑑ℓ

𝑗=0 
the maximal internal derivative of Φ.

We can now formulate the main result of this section.

Theorem 6.10 Let 𝐿 ∈ N and let (𝑑0 , 𝑑1 , . . . , 𝑑 𝐿 , 1) ∈ N 𝐿+2 . Let 𝛿 ∈ (0, 1]. Let 𝑾 (ℓ ) ∈ R𝑑ℓ+1 ×𝑑ℓ , for ℓ = 0, . . . , 𝐿, be
such that 𝜈 (𝑾 (ℓ ) )ℓ=0
𝐿 , 𝛿 ≤ 𝐶 for a 𝐶 > 0.
𝜈 𝜈
For an associated random-bias neural network Φ, we have that for a line segment 𝔰 ⊆ R𝑑0 of length 1
𝐿
𝐶𝜈 ∑︁
E(Pieces(Φ, 𝔰)) ≤ 1 + 𝑑1 + (1 + (𝐿 − 1) ln(2width(Φ))) 𝑑𝑗. (6.4.1)
𝛿 𝑗=2

Proof Let 𝑾 (ℓ ) ∈ R𝑑ℓ+1 ×𝑑ℓ for ℓ = 0, . . . , 𝐿. Moreover, let 𝒃 (ℓ ) ∈ [−𝛿/2, 𝛿/2] 𝑑ℓ+1 for ℓ = 0, . . . , 𝐿 be uniformly
distributed random variables, then the vector of preactivations is denoted by

𝜃 ℓ (𝒙) : 𝔰 → R𝑑ℓ
𝒙 ↦→ (𝜂ℓ,𝑖 (𝒙; (𝑾 ( 𝑗 ) , 𝒃 ( 𝑗 ) ) ℓ𝑗=0
−1 𝑑ℓ
))𝑖=1 .

Let 𝜅 : 𝔰 → [0, 1] be an isomorphism. Note that since each coordinate of 𝜃 ℓ is piecewise affine, we also have that
there are points 𝒙0 , 𝒙 1 , . . . , 𝒙 𝑞ℓ ∈ 𝔰 with 𝜅(𝒙 𝑗 ) < 𝜅(𝒙 𝑗+1 ) for 𝑗 = 0, . . . , 𝑞 ℓ − 1, such that 𝜃 ℓ is affine (as a function into
R𝑑ℓ ) on [𝜅(𝒙 𝑗 ), 𝜅(𝒙 𝑗+1 )] for all 𝑗 = 0, . . . , 𝑞 ℓ − 1 as well as on [0, 𝜅(𝒙0 )] and [𝜅(𝒙 𝑞ℓ ), 1].
We will now inductively find an upper bound on the 𝑞 ℓ .
Let ℓ = 2, then
𝜃 2 (𝒙) = 𝑾 (1) 𝜎ReLU (𝑾 (0) 𝒙 + 𝒃 (0) ).
Since 𝑾 (1) · +𝑏 (1) is an affine function, it follows that 𝜃 2 can only be non-affine in points where 𝜎ReLU (𝑾 (0) · +𝒃 (0) ) is
not affine. Therefore, 𝜃 2 is only non-affine if one coordinate of 𝑾 (0) · +𝒃 (0) intersects 0 nontrivially. This can happen
at most 𝑑1 times. We conclude that we can choose 𝑞 2 = 𝑑1 .
Next, let us find an upper bound on 𝑞 ℓ+1 from 𝑞 ℓ . Note that

𝜃 ℓ+1 (𝒙) = 𝑾 (ℓ ) 𝜎ReLU (𝜃 ℓ (𝒙) + 𝒃 (ℓ −1) ).

Now 𝜃 ℓ+1 is affine in every point 𝒙 ∈ 𝔰 where 𝜃 ℓ is affine and (𝜃 ℓ (𝒙) + 𝒃 (ℓ −1) )𝑖 ≠ 0 for all coordinates 𝑖 = 1, . . . , 𝑑ℓ .
As a result, we have that we can choose 𝑞 ℓ+1 such that
n o
𝑞 ℓ+1 ≤ 𝑞 ℓ + # 𝒙 ∈ 𝔰 : (𝜃 ℓ (𝒙) + 𝒃 (ℓ −1) )𝑖 = 0 for at least one 𝑖 = 1, . . . , 𝑑ℓ .

Therefore, for ℓ ≥ 2

∑︁ n o
𝑞 ℓ+1 ≤ 𝑑1 + # 𝒙 ∈ 𝔰 : (𝜃 𝑗 (𝒙) + 𝒃 ( 𝑗 ) )𝑖 = 0 for at least one 𝑖 = 1, . . . , 𝑑 𝑗
𝑗=3

∑︁ 𝑑𝑗
ℓ ∑︁ n o
( 𝑗)
≤ 𝑑1 + # 𝒙 ∈ 𝔰 : 𝜂 𝑗,𝑖 (𝒙) = −𝒃 𝑖 .
𝑗=2 𝑖=1

64
[Draft of June 26, 2024. Not for dissemination. Only to be used in Philipp Petersen’s lecture]
By Theorem 6.3, we have that
 
Pieces 𝜂ℓ,𝑖 ( · ; (𝑾 ( 𝑗 ) , 𝒃 ( 𝑗 ) ) ℓ𝑗=0
−1
, 𝔰) ≤ (2width(Φ)) ℓ −1 .

We define for 𝑘 ∈ N ∪ {∞}


 
𝑝 𝑘,ℓ,𝑖 B P #{𝒙 ∈ 𝔰 : 𝜂ℓ,𝑖 (𝒙) = −𝒃 𝑖(ℓ ) } ≥ 𝑘

Then, by Lemma 6.7

𝑝 𝑘,ℓ,𝑖 ≤ 𝐶𝜈 /(𝛿𝑘)

and for 𝑘 > (2width(Φ)) ℓ −1

𝑝 𝑘,ℓ,𝑖 = 0.

Then, we compute that

𝐿 𝑑𝑗 n o
©∑︁ ∑︁ ( 𝑗) ª
E­ # 𝒙 ∈ 𝔰 : 𝜂 𝑗,𝑖 (𝒙) = −𝒃 𝑖 ®
« 𝑗=2 𝑖=1 ¬
𝐿
∑︁ ∑︁ ∞
𝑑 𝑗 ∑︁  n o 
( 𝑗)
≤ 𝑘 · P # 𝒙 ∈ 𝔰 : 𝜂 𝑗,𝑖 (𝒙) = −𝒃 𝑖 =𝑘
𝑗=2 𝑖=1 𝑘=1
𝐿 ∑︁
∑︁ ∞
𝑑 𝑗 ∑︁
≤ 𝑘 · ( 𝑝 𝑘, 𝑗,𝑖 − 𝑝 𝑘+1, 𝑗,𝑖 ).
𝑗=2 𝑖=1 𝑘=1

We make the following technical computation to estimate the inner sum:



∑︁ ∞
∑︁ ∞
∑︁
𝑘 · ( 𝑝 𝑘, 𝑗,𝑖 − 𝑝 𝑘+1, 𝑗,𝑖 ) = 𝑘 · 𝑝 𝑘, 𝑗,𝑖 − 𝑘 · 𝑝 𝑘+1, 𝑗,𝑖
𝑘=1 𝑘=1 𝑘=1

∑︁ ∞
∑︁
= 𝑘 · 𝑝 𝑘, 𝑗,𝑖 − (𝑘 − 1) · 𝑝 𝑘, 𝑗,𝑖
𝑘=1 𝑘=2

∑︁
= 𝑝 1, 𝑗,𝑖 + 𝑝 𝑘, 𝑗,𝑖
𝑘=2

∑︁
= 𝑝 𝑘, 𝑗,𝑖
𝑘=1
(2width(Φ)
∑︁ )
𝐿−1

−1 1
≤ 𝐶𝜈 𝛿
𝑘=1
𝑘
!
∫ (2width(Φ) ) 𝐿−1
−1 1
≤ 𝐶𝜈 𝛿 1+ 𝑑𝑥
1 𝑥
−1
≤ 𝐶𝜈 𝛿 (1 + (𝐿 − 1) ln((2width(Φ)))).

We conclude that, in expectation, we can bound 𝑞 𝐿+1 by

65
[Draft of June 26, 2024. Not for dissemination. Only to be used in Philipp Petersen’s lecture]
𝐿
∑︁
𝑑1 + 𝐶𝜈 𝛿 −1 (1 + (𝐿 − 1) ln(2width(Φ))) 𝑑𝑗.
𝑗=2

Finally, since 𝜃 𝐿 = (Φ 𝐿+1 ) |𝔰 , it follows that

Pieces(Φ, 𝔰) ≤ 𝑞 𝐿+1 + 1

which yields the result. □

Remark 6.11 We make the following observations about Theorem 6.10:


• Non-exponential dependence on depth: If we consider (6.4.1), we see that the number of pieces scales in expectation
essentially like O (𝐿𝑁), where 𝑁 is the total number of neurons of the architecture. This shows that in expectation,
the number of pieces is linear in the number of layers, as opposed to the exponential upper bound of Theorem 6.3.
• Maximal internal derivative: Theorem 6.10 requires the weights to be chosen such that the maximal internal
derivative is bounded by a certain number. However, if they are randomly initialized with a scheme, such that
almost surely or with very high probability, the maximal internal derivative is bounded by a small number, then
similar results can be shown. In practice, weights
√︁ in the ℓ-th layer are often initialized according to a centered
normal distribution with standard deviation 2/𝑑ℓ , [76]. Due to the anti-proportionality of the variance to the
width of the layers it is achieved that the internal derivatives remain bounded, independent of the width of the
neural networks. This explaines the observation from Figure 6.4.

Bibliography and further reading

66
[Draft of June 26, 2024. Not for dissemination. Only to be used in Philipp Petersen’s lecture]
Exercises

Exercise 6.12 Let −∞ < 𝑎 < 𝑏 < ∞ and let 𝑓 ∈ 𝐶 3 ( [𝑎, 𝑏])\P1 . Denote by 𝑝(𝜀) ∈ N the minimal number of intervals
partitioning [𝑎, 𝑏], such that a (not necessarily continuous) piecewise linear function on 𝑝(𝜀) intervals can approximate
𝑓 on [𝑎, 𝑏] uniformly up to error 𝜀 > 0. In this exercise, we wish to show

lim inf 𝑝(𝜀) 𝜀 > 0. (6.4.2)
𝜀↘0

Therefore, we can find a constant 𝐶 > 0 such that 𝜀 ≥ 𝐶 𝑝(𝜀) −2 for all 𝜀 > 0. This shows a variant of Theorem 6.2.
Proceed as follows to prove (6.4.2):

(i) Fix 𝜀 > 0 and let 𝑎 = 𝑥0 < 𝑥 1 · · · < 𝑥 𝑝 ( 𝜀) = 𝑏 be a partitioning into 𝑝(𝜀) pieces. For 𝑖 = 0, . . . , 𝑝(𝜀) − 1 and
𝑥 ∈ [𝑥 𝑖 , 𝑥𝑖+1 ] let
 
𝑓 (𝑥 𝑖+1 ) − 𝑓 (𝑥𝑖 )
𝑒 𝑖 (𝑥) := 𝑓 (𝑥) − 𝑓 (𝑥𝑖 ) + (𝑥 − 𝑥𝑖 ) .
𝑥𝑖+1 − 𝑥 𝑖

Show that |𝑒 𝑖 (𝑥)| ≤ 2𝜀 for all 𝑥 ∈ [𝑥𝑖 , 𝑥𝑖+1 ].


(ii) With ℎ𝑖 := 𝑥 𝑖+1 − 𝑥𝑖 and 𝑚 𝑖 := (𝑥𝑖 + 𝑥𝑖+1 )/2 show that

ℎ2𝑖 ′′
max |𝑒 𝑖 (𝑥)| = | 𝑓 (𝑚 𝑖 )| + 𝑂 (ℎ3𝑖 ).
𝑥 ∈ [ 𝑥𝑖 , 𝑥𝑖+1 ] 8

(iii) Assuming that 𝑐 := inf 𝑥 ∈ [𝑎,𝑏] | 𝑓 ′′ (𝑥)| > 0 show that


∫ 𝑏
1 √︁
lim inf 𝑝(𝜀) 𝜀 ≥ | 𝑓 ′′ (𝑥)| d𝑥.
𝜀↘0 4 𝑎

(iv) Conclude that (6.4.2) holds for general non-linear 𝑓 ∈ 𝐶 3 ( [𝑎, 𝑏]).

Exercise 6.13 Show that, for 𝐿 = 1, Theorem 6.3 holds for piecewise smooth functions, when replacing the number of
affine pieces by the number of smooth pieces. These are defined by replacing "affine" by "smooth" in Definition 6.1.

Exercise 6.14 Show that, for 𝐿 > 1, Theorem 6.3 does not hold for piecewise smooth functions, when replacing the
number of affine pieces by the number of smooth pieces. These are defined by replacing "affine" by "smooth" in
Definition 6.1.

Exercise 6.15 For 𝑝 ∈ N, 𝑝 > 2 and n ∈ N, Construct a function ℎ 𝑛( 𝑝) , similarly to ℎ 𝑛 of (6.5) such that ℎ 𝑛( 𝑝) ∈
N11 (𝜎ReLU , 𝑛, 𝑝) such that ℎ 𝑛( 𝑝) has 𝑝 𝑛 pieces and size 𝑂 ( 𝑝 2 𝑛).

67
[Draft of June 26, 2024. Not for dissemination. Only to be used in Philipp Petersen’s lecture]
[Draft of June 26, 2024. Not for dissemination. Only to be used in Philipp Petersen’s lecture]
Chapter 7
Deep ReLU neural networks

In the previous chapter, we observed that many layers are a necessary prerequisite for ReLU neural networks to
approximate smooth functions with high rates. To complement this observation, we now analyze which depth is
sufficient to achieve good approximation rates for smooth functions?
To approximate smooth functions efficiently, one of the main tools in Chapter 4 was to rebuild polynomial-based
functions, such as higher-order B-splines. For smooth activation functions, we were able to reproduce polynomials by
using the nonlinearity of the activation functions. This argument certainly cannot be repeated for the ReLU. On the
other hand, up until now, we have seen that deep ReLU neural networks are extremely efficient at producing sawtooth
functions, Lemma 6.5.
The main observation in the next section is that the efficient representation of sawtooth functions is intimately linked
to the approximation of the square function and hence allows very efficient approximations of polynomial functions.
We will show the following in this chapter: First, in Section 7.1, we will link sawtooth functions and the square
function and then show how this yields a neural network approximating the squaring function. Second, in Section 7.2,
we will demonstrate how the squaring neural network can be modified to yield a neural network that approximates
the function that multiplies its inputs. Using these two tools, we will observe in Section 7.3 that deep ReLU neural
networks can approximate 𝑘-times continuously differentiable functions with Hölder continuous derivatives, with the
optimal approximation rate.

7.1 The square function

In this section, we will show that the square function 𝑥 ↦→ 𝑥 2 can be approximated very efficiently by a deep neural
network.

Proposition 7.1 Let 𝑛 ∈ N. Then


𝑛
∑︁ ℎ 𝑗 (𝑥)
𝑠 𝑛 (𝑥) := 𝑥 −
𝑗=1
22 𝑗

is a piecewise linear function on [0, 1] with break points 𝑥 𝑛, 𝑗 = 𝑗2−𝑛 , 𝑗 = 0, . . . , 2𝑛 . Moreover, 𝑠 𝑛 (𝑥 𝑛,𝑘 ) = 𝑥 𝑛,𝑘
2 for all
2
𝑘 = 0, . . . , 2 , i.e. 𝑠 𝑛 is the piecewise linear interpolant of 𝑥 on [0, 1].
𝑛

Proof The statement holds for 𝑛 = 1. We proceed by induction. Assume the statement holds for 𝑠 𝑛 and let 𝑘 ∈
{0, . . . , 2𝑛+1 }. By Lemma 6.5, ℎ 𝑛+1 (𝑥 𝑛+1,𝑘 ) = 0 whenever 𝑘 is even. Hence for even 𝑘 ∈ {0, . . . , 2𝑛+1 }

69
[Draft of June 26, 2024. Not for dissemination. Only to be used in Philipp Petersen’s lecture]
𝑛+1
∑︁ ℎ 𝑗 (𝑥 𝑛+1,𝑘 )
𝑠 𝑛+1 (𝑥 𝑛+1,𝑘 ) = 𝑥 −
𝑗=1
22 𝑗
ℎ 𝑛+1 (𝑥 𝑛+1,𝑘 ) 2
= 𝑠 𝑛 (𝑥 𝑛+1,𝑘 ) − = 𝑠 𝑛 (𝑥 𝑛+1,𝑘 ) = 𝑥 𝑛+1,𝑘 ,
22(𝑛+1)
where we used the induction assumption 𝑠 𝑛 (𝑥 𝑛+1,𝑘 ) = 𝑥 𝑛+1,𝑘 2 for 𝑥 𝑛+1,𝑘 = 𝑘2− (𝑛+1) = 𝑘2 2−𝑛 = 𝑥 𝑛,𝑘/2 .
Now let 𝑘 ∈ {1, . . . , 2𝑛+1 − 1} be odd. Then by Lemma 6.5, ℎ 𝑛+1 (𝑥 𝑛+1,𝑘 ) = 1. Moreover, since 𝑠 𝑛 is linear on
[𝑥 𝑛, (𝑘−1)/2 , 𝑥 𝑛, (𝑘+1)/2 ] = [𝑥 𝑛+1,𝑘−1 , 𝑥 𝑛+1,𝑘+1 ] and 𝑥 𝑛+1,𝑘 is the midpoint of this interval,

ℎ 𝑛+1 (𝑥 𝑛+1,𝑘 )
𝑠 𝑛+1 (𝑥 𝑛+1,𝑘 ) = 𝑠 𝑛 (𝑥 𝑛+1,𝑘 ) −
22(𝑛+1)
1 2 2 1
= (𝑥 + 𝑥 𝑛+1,𝑘+1 ) − 2(𝑛+1)
2 𝑛+1,𝑘−1 2
(𝑘 − 1) 2 (𝑘 + 1) 2 2
= 2(𝑛+1)+1 + 2(𝑛+1)+1 − 2(𝑛+1)+1
2 2 2
1 2𝑘 2 𝑘2 2
= = = 𝑥 𝑛+1,𝑘 .
2 22(𝑛+1) 22(𝑛+1)
This completes the proof. □

Lemma 7.2 For 𝑛 ∈ N, it holds

sup |𝑥 2 − 𝑠 𝑛 (𝑥)| ≤ 2−2𝑛−1 .


𝑥 ∈ [0,1]

Moreover 𝑠 𝑛 ∈ N𝑎1 (𝜎ReLU , 𝑛, 3), and size(𝑠 𝑛 ) ≤ 7𝑛 and depth(𝑠 𝑛 ) = 𝑛.

Proof Set 𝑒 𝑛 (𝑥) := 𝑥 2 − 𝑠 𝑛 (𝑥). Let 𝑥 be in the interval [𝑥 𝑛,𝑘 , 𝑥 𝑛,𝑘+1 ] = [𝑘2−𝑛 , (𝑘 + 1)2−𝑛 ] of length 2−𝑛 . Since 𝑠 𝑛 is
the linear interpolant of 𝑥 2 on this interval, we have
2 2
− 𝑥 𝑛,𝑘
𝑥 𝑛,𝑘+1 2𝑘 + 1 1
|𝑒 ′𝑛 (𝑥)| = 2𝑥 − = 2𝑥 − ≤ 𝑛.
2−𝑛 2𝑛 2

Thus 𝑒 𝑛 : [0, 1] → R has Lipschitz constant 2−𝑛 . Since 𝑒 𝑛 (𝑥 𝑛,𝑘 ) = 0 for all 𝑘 = 0, . . . , 2𝑛 , and the length of the
interval [𝑥 𝑛,𝑘 , 𝑥 𝑛,𝑘+1 ] equals 2−𝑛 we get

1 −𝑛 −𝑛
sup |𝑒 𝑛 (𝑥)| ≤ 2 2 = 2−2𝑛−1 .
𝑥 ∈ [0,1] 2

Finally, to see that 𝑠 𝑛 can be represented by a neural network of the claimed architecture, note that for 𝑛 ≥ 2
𝑛
∑︁ ℎ 𝑗 (𝑥) ℎ 𝑛 (𝑥) ℎ1 ◦ ℎ 𝑛−1 (𝑥)
𝑠 𝑛 (𝑥) = 𝑥 − 2𝑗
= 𝑠 𝑛−1 (𝑥) − 2𝑛 = 𝜎ReLU ◦ 𝑠 𝑛−1 (𝑥) − .
𝑗=1
2 2 22𝑛

Here we used that 𝑠 𝑛−1 is the piecewise linear interpolant of 𝑥 2 , so that 𝑠 𝑛−1 (𝑥) ≥ 0 and thus 𝑠 𝑛−1 (𝑥) = 𝜎ReLU (𝑠 𝑛−1 (𝑥))
for all 𝑥 ∈ [0, 1]. Hence 𝑠 𝑛 is of depth 𝑛 and width 3, see Figure 7.1. □
In conclusion, we have shown that 𝑠 𝑛 : [0, 1] → [0, 1] approximates the square function uniformly on [0, 1] with
exponentially decreasing error in the neural network size. Note that due to Theorem 6.4, this would not be possible
with a shallow neural network, which can at best interpolate 𝑥 2 on a partition of [0, 1] with polynomially many (w.r.t.
the neural network size) pieces.

70
[Draft of June 26, 2024. Not for dissemination. Only to be used in Philipp Petersen’s lecture]
𝑥 𝑠1 ( 𝑥 ) 𝑠2 ( 𝑥 ) 𝑠𝑛−1 ( 𝑥 )

𝑥 ℎ1 ( 𝑥 ) 𝑥 ... 𝑠𝑛 ( 𝑥 )

ℎ1 ( 𝑥 ) ℎ2 ( 𝑥 ) ℎ3 ( 𝑥 ) ℎ𝑛 ( 𝑥 )

Fig. 7.1: The neural networks ℎ1 (𝑥) = 𝜎ReLU (2𝑥) − 𝜎ReLU (4𝑥 − 2) and 𝑠 𝑛 (𝑥) = 𝜎ReLU (𝑠 𝑛−1 (𝑥)) − ℎ 𝑛 (𝑥)/22𝑛 where
ℎ 𝑛 = ℎ1 ◦ ℎ 𝑛−1 .

7.2 Multiplication

According to Lemma 7.2, depth can help in the approximation of 𝑥 ↦→ 𝑥 2 , which, on first sight, seems like a rather
specific example. However, as we shall discuss in the following, this opens up a path towards fast approximation of
functions with high regularity, e.g., 𝐶 𝑘 ( [0, 1] 𝑑 ) for some 𝑘 > 1. The crucial observation is that, via the polarization
identity we can write the product of two numbers as a sum of squares

(𝑥 + 𝑦) 2 − (𝑥 − 𝑦) 2
𝑥·𝑦= (7.2.1)
4
for all 𝑥, 𝑦 ∈ R. Efficient approximation of the operation of multiplication allows efficient approximation of polynomials.
Those in turn are well-known to be good approximators for functions exhibiting 𝑘 ∈ N derivatives. Before exploring this
idea further in the next section, we first make precise the observation that neural networks can efficiently approximate
the multiplication of real numbers.
We start with the multiplication of two numbers, in which case neural networks of logarithmic size in the desired
accuracy are sufficient.

Lemma 7.3 For every 𝜀 > 0 there exists a ReLU neural network Φ×𝜀 : [−1, 1] 2 → [−1, 1] such that

sup |𝑥 · 𝑦 − Φ×𝜀 (𝑥, 𝑦)| ≤ 𝜀,


𝑥,𝑦 ∈ [−1,1]

and it holds size(Φ×𝜀 ) ≤ 𝐶 · (1 + | log(𝜀)|) and depth(Φ×𝜀 ) ≤ 𝐶 · (1 + | log(𝜀)|) for a constant 𝐶 > 0 independent of 𝜀.
Moreover Φ×𝜀 (𝑥, 𝑦) = 0 if 𝑥 = 0 or 𝑦 = 0.

Proof With 𝑛 = ⌈| log4 (𝜀)|⌉, define the neural network


 
× 𝜎ReLU (𝑥 + 𝑦) + 𝜎ReLU (−𝑥 − 𝑦)
Φ 𝜀 (𝑥, 𝑦) :=𝑠 𝑛
2
 
𝜎ReLU (𝑥 − 𝑦) + 𝜎ReLU (𝑦 − 𝑥)
− 𝑠𝑛 . (7.2.2)
2

Since |𝑎| = 𝜎ReLU (𝑎) + 𝜎ReLU (−𝑎), by (7.2.1) we have for all 𝑥, 𝑦 ∈ [−1, 1]

(𝑥 + 𝑦) 2 − (𝑥 − 𝑦) 2
    
× |𝑥 + 𝑦| |𝑥 − 𝑦|
𝑥 · 𝑦 − Φ 𝜀 (𝑥, 𝑦) = − 𝑠𝑛 − 𝑠𝑛
4 2 2
4( 𝑥+𝑦 2 𝑥−𝑦 2
2 ) − 4( 2 ) 4𝑠 𝑛 ( | 𝑥+𝑦 | | 𝑥−𝑦 |
2 ) − 4𝑠 𝑛 ( 2 )
= −
4 4
4(2−2𝑛−1 + 2−2𝑛−1 )
≤ = 4−𝑛 ≤ 𝜀,
4

71
[Draft of June 26, 2024. Not for dissemination. Only to be used in Philipp Petersen’s lecture]
where we used |𝑥 + 𝑦|, |𝑥 − 𝑦| ∈ [0, 1]. We have depth(Φ×𝜀 ) = 1 + depth(𝑠 𝑛 ) = 1 + 𝑛 ≤ 1 + ⌈log4 (𝜀)⌉ and size(Φ×𝜀 ) ≤
𝐶 + 2size(𝑠 𝑛 ) ≤ 𝐶𝑛 ≤ 𝐶 · (1 − log(𝜀)) for some constant 𝐶 > 0.
The fact that Φ×𝜀 maps from [−1, 1] 2 → [−1, 1] follows by (7.2.2) and because 𝑠 𝑛 : [0, 1] → [0, 1]. Finally, if
𝑥 = 0, then Φ×𝜀 (𝑥, 𝑦) = 𝑠 𝑛 (|𝑥 + 𝑦|) − 𝑠 𝑛 (|𝑥 − 𝑦|) = 𝑠 𝑛 (|𝑦|) − 𝑠 𝑛 (|𝑦|) = 0. If 𝑦 = 0 the same argument can be made. □
In a similar way as in Proposition 4.8 and Lemma 5.11, we can apply operations with two inputs in the form of a
binary tree to extend them to an operation on arbitrary many inputs.
× : [−1, 1] 𝑛 → [−1, 1] such that
Proposition 7.4 For every 𝑛 ≥ 2 and 𝜀 > 0 there exists a ReLU neural network Φ𝑛, 𝜀

𝑛
Ö
×
sup 𝑥 𝑗 − Φ𝑛, 𝜀 (𝑥 1 , . . . , 𝑥 𝑛 ) ≤ 𝜀,
𝑥 𝑗 ∈ [ −1,1] 𝑗=1

× ) ≤ 𝐶𝑛 · (1 + | log(𝜀/𝑛)|) and depth(Φ × ) ≤ 𝐶 log(𝑛) (1 + | log(𝜀/𝑛)|) for a constant 𝐶 > 0


and it holds size(Φ𝑛, 𝜀 𝑛, 𝜀
independent of 𝜀 and 𝑛.
× := Φ × . If 𝑘 ≥ 2 let
Proof We begin with the case 𝑛 = 2 𝑘 . For 𝑘 = 1 let Φ̃2, 𝛿 𝛿
 
Φ̃2×𝑘 , 𝛿 := Φ×𝛿 ◦ Φ̃2×𝑘−1 , 𝛿 , Φ̃2×𝑘−1 , 𝛿 .

Using Lemma 7.3, we find that this neural network has depth bounded by
 
depth Φ̃2×𝑘 , 𝛿 ≤ 𝑘depth(Φ×𝛿 ) ≤ 𝐶 𝑘 · (1 + | log(𝛿)|) ≤ 𝐶 log(𝑛) (1 + | log(𝛿)|).

Observing that the number of occurences of Φ×𝛿 equals 𝑘−1 × ×


Í
𝑗=0 2 ≤ 𝑛, the size of Φ̃2 𝑘 , 𝛿 can bounded by 𝐶𝑛size(Φ 𝛿 ) ≤
𝑗

𝐶𝑛 · (1 + | log(𝛿)|).
𝑘
To estimate the approximation error, denote with 𝒙 = (𝑥 𝑗 ) 2𝑗=1

Ö
𝑒 𝑘 := sup 𝑥 𝑗 − Φ̃2×𝑘 , 𝛿 (𝒙) .
𝑥 𝑗 ∈ [−1,1] 𝑗 ≤2 𝑘

Then, using short notation of the type 𝒙 ≤2𝑘−1 := (𝑥1 , . . . , 𝑥 2𝑘−1 ),

2𝑘
Ö  
𝑒𝑘 = sup 𝑥 𝑗 − Φ×𝛿 Φ̃2×𝑘−1 , 𝛿 (𝒙 ≤2𝑘−1 ), Φ̃2×𝑘−1 , 𝛿 (𝒙 >2𝑘−1 )
𝑥 𝑗 ∈ [ −1,1] 𝑗=1

© Ö
≤𝛿+ sup 𝑥 𝑗 𝑒 𝑘−1 + Φ̃2×𝑘−1 , 𝛿 (𝒙 >2𝑘−1 ) 𝑒 𝑘−1 ®
ª
­
𝑥 ∈ [ −1,1] 𝑗 ≤2 𝑘−1
« ¬
𝑘−2
∑︁
≤ 𝛿 + 2𝑒 𝑘−1 ≤ 𝛿 + 2(𝛿 + 2𝑒 𝑘−2 ) ≤ · · · ≤ 𝛿 2 𝑗 + 2 𝑘−1 𝑒 1
𝑗=0
𝑘
≤ 2 𝛿 = 𝑛𝛿 = 𝜀.

Here we used 𝑒 1 ≤ 𝛿, and that Φ̃2×𝑘 , 𝛿 maps [−1, 1] 2


𝑘−1
to [−1, 1], which is a consequence of Lemma 7.3.
The case for general 𝑛 ≥ 2 (not necessarily 𝑛 = is treated similar as in Lemma 5.11, by replacing some Φ×𝛿
2𝑘 )
neural networks with identity neural networks.
× := Φ̃ × concludes the proof.
Finally, setting 𝛿 := 𝜀/𝑛 and Φ𝑛, □
𝜀 𝑛, 𝛿

72
[Draft of June 26, 2024. Not for dissemination. Only to be used in Philipp Petersen’s lecture]
7.3 𝑪 𝒌,𝒔 functions

We will now discuss the implications of our observations in the previous sections for the approximation of functions
in the class 𝐶 𝑘,𝑠 .

Definition 7.5 Let 𝑘 ∈ N0 , 𝑠 ∈ [0, 1] and Ω ⊆ R𝑑 . Then

∥ 𝑓 ∥ 𝐶 𝑘,𝑠 (Ω) := sup max |𝐷 𝜶 𝑓 (𝒙)|


𝒙∈Ω {𝜶∈N0𝑑 | |𝜶 | ≤ 𝑘 }
|𝐷 𝜶 𝑓 (𝒙) − 𝐷 𝜶 𝑓 ( 𝒚)| (7.3.1)
+ sup max ,
𝒙≠𝒚 ∈Ω {𝜶∈N0𝑑 | |𝜶 |=𝑘 } ∥𝒙 − 𝒚∥ 2𝑠

and we denote by 𝐶 𝑘,𝑠 (Ω) the set of functions 𝑓 ∈ 𝐶 𝑘 (Ω) for which ∥ 𝑓 ∥ 𝐶 𝑘,𝑠 (Ω) < ∞.

Note that these spaces are ordered according to

𝐶 𝑘 (Ω) ⊇ 𝐶 𝑘,𝑠 (Ω) ⊇ 𝐶 𝑘,𝑡 (Ω) ⊇ 𝐶 𝑘+1 (Ω)

for all 0 < 𝑠 ≤ 𝑡 ≤ 1.


In order to state our main result, we first recall a version of Taylor’s remainder formula for 𝐶 𝑘,𝑠 (Ω) functions.

Lemma 7.6 Let 𝑑 ∈ N, 𝑘 ∈ N, 𝑠 ∈ [0, 1], Ω = [0, 1] 𝑑 and 𝑓 ∈ 𝐶 𝑘,𝑠 (Ω). Then for all 𝒂, 𝒙 ∈ Ω
∑︁ 𝐷 𝜶 𝑓 (𝒂)
𝑓 (𝒙) = (𝒙 − 𝒂) 𝜶 + 𝑅 𝑘 (𝒙) (7.3.2)
𝜶!
{𝜶∈N0𝑑 | 0≤ |𝜶 | ≤ 𝑘 }

𝑘+1/2
where with ℎ := max𝑖 ≤𝑑 |𝑎 𝑖 − 𝑥 𝑖 | we have |𝑅 𝑘 (𝒙)| ≤ ℎ 𝑘+𝑠 𝑑 𝑘! ∥ 𝑓 ∥ 𝐶 𝑘,𝑠 (Ω) .

Proof First, for a function 𝑔 ∈ 𝐶 𝑘 (R) and 𝑎, 𝑡 ∈ R


𝑘−1 ( 𝑗 )
∑︁ 𝑔 (𝑎) 𝑔 (𝑘 ) (𝜉)
𝑔(𝑡) = (𝑡 − 𝑎) 𝑗 + (𝑡 − 𝑎) 𝑘
𝑗=0
𝑗! 𝑘!

𝑔 ( 𝑗 ) (𝑎) 𝑔 (𝑘 ) (𝜉) − 𝑔 (𝑘 ) (𝑎)


𝑘
∑︁
= (𝑡 − 𝑎) 𝑗 + (𝑡 − 𝑎) 𝑘 ,
𝑗=0
𝑗! 𝑘!

for some 𝜉 between 𝑎 and 𝑡. Now let 𝑓 ∈ 𝐶 𝑘,𝑠 (R𝑑 ) and 𝒂, 𝒙 ∈ R𝑑 . Thus with 𝑔(𝑡) := 𝑓 (𝒂 + 𝑡 · (𝒙 − 𝒂)) holds for
𝑓 (𝒙) = 𝑔(1)
𝑘−1 ( 𝑗 )
∑︁ 𝑔 (0) 𝑔 (𝑘 ) (𝜉)
𝑓 (𝒙) = + .
𝑗=0
𝑗! 𝑘!

By the chain rule


 
∑︁ 𝑗 𝜶
𝑔 ( 𝑗 ) (𝑡) = 𝐷 𝑓 (𝒂 + 𝑡 · (𝒙 − 𝒂)) (𝒙 − 𝒂) 𝜶 ,
𝜶
{𝜶∈N0𝑑 | |𝜶 |= 𝑗 }

𝑗 𝑗! 𝑗!
and (𝒙 − 𝒂) 𝜶 =
Î𝑑
where we use the multivariate notations 𝜶 = 𝜶! = Î𝑑
𝛼𝑗 ! 𝑗=1 (𝑥 𝑗 − 𝑎 𝑗 ) 𝛼 𝑗 . Hence
𝑗=1

73
[Draft of June 26, 2024. Not for dissemination. Only to be used in Philipp Petersen’s lecture]
∑︁ 𝐷 𝜶 𝑓 (𝒂)
𝑓 (𝒙) = (𝒙 − 𝒂) 𝜶
𝜶!
{𝜶∈N0𝑑 | 0≤ |𝜶 | ≤ 𝑘 }
| {z }
∈P 𝑘
∑︁ 𝜶
𝐷 𝑓 (𝒂 + 𝜉 · (𝒙 − 𝒂)) − 𝐷 𝜶 𝑓 (𝒂)
+ (𝒙 − 𝒂) 𝜶 ,
𝜶!
|𝜶 |=𝑘
| {z }
=:𝑅 𝑘

for some 𝜉 ∈ [0, 1]. Using the definition of ℎ, the remainder term can be bounded by
 
𝑘 𝜶 1 𝜶
∑︁ 𝑘
|𝑅 𝑘 | ≤ ℎ max sup |𝐷 𝑓 (𝒂 + 𝑡 · (𝒙 − 𝒂)) − 𝐷 𝑓 (𝒂)|
|𝜶 |=𝑘 𝒙∈Ω 𝑘! 𝜶
𝑡 ∈ [0,1] {𝜶∈N0𝑑 | |𝜶 |=𝑘 }

𝑑 𝑘+ 12
≤ ℎ 𝑘+𝑠 ∥ 𝑓 ∥ 𝐶 𝑘,𝑠 ( [0,1] 𝑑 ) ,
𝑘!
√ Í 𝑘
where we used (7.3.1), ∥𝒙 − 𝒂∥ 2 ≤ 𝑑ℎ and {𝜶∈N𝑑 | |𝜶 |=𝑘 } 𝜶 = (1 + · · · + 1) 𝑘 = 𝑑 𝑘 by the multinomial formula. □
0

We now come to the main statement of this section. Up to logarithmic terms, it shows the convergence rate (𝑘 + 𝑠)/𝑑
for approximating functions in 𝐶 𝑘,𝑠 ( [0, 1] 𝑑 ).

Theorem 7.7 Let 𝑑 ∈ N, 𝑘 ∈ N0 , 𝑠 ∈ [0, 1], and Ω = [0, 1] 𝑑 . Then, there exists a constant 𝐶 > 0 such that for every
𝑓
𝑓 ∈ 𝐶 𝑘,𝑠 (Ω) and every 𝑁 ≥ 2 there exists a ReLU neural network Φ 𝑁 such that
𝑘+𝑠
sup | 𝑓 (𝒙) − Φ 𝑁 (𝒙)| ≤ 𝐶𝑁 −
𝑓 𝑑 ∥ 𝑓 ∥ 𝐶 𝑘,𝑠 (Ω) , (7.3.3)
𝒙∈Ω

𝑓 𝑓
size(Φ 𝑁 ) ≤ 𝐶𝑁 log(𝑁) and depth(Φ 𝑁 ) ≤ 𝐶 log(𝑁).

Proof The idea of the proof is to use the so-called “partition of unity method”: First we will construct a partition
of unity (𝜑𝝂 )𝝂 , such that each 𝜑𝝂 has support on a 𝑂 (1/𝑀) neighborhood of a point 𝜼 ∈ Ω. On each Í of these
neighborhoods we will use the local Taylor polynomial 𝑝𝝂 of 𝑓 around 𝜼 to approximate the function.ÍThen 𝝂 𝜑𝝂 𝑝𝝂
gives an approximation to 𝑓 on Ω. This approximation can be emulated by a neural network of the type 𝝂 Φ×𝜀 (𝜑𝝂 , 𝑝ˆ𝝂 ),
where 𝑝ˆ𝝂 is an neural network approximation to the polynomial 𝑝𝝂 .
It suffices to show the theorem in the case where
 𝑘+1/2 
𝑑
max , exp(𝑑) ∥ 𝑓 ∥ 𝐶 𝑘,𝑠 (Ω) ≤ 1.
𝑘!

The general case can then be immediately deduced by a scaling argument.


Step 1. We construct the neural network. Define
𝑘+𝑠
𝑀 := ⌈𝑁 1/𝑑 ⌉ and 𝜀 := 𝑁 − 𝑑 . (7.3.4)

Consider a uniform simplicial mesh with nodes {𝝂/𝑀 | 𝝂 ≤ 𝑀 } where 𝝂/𝑀 := (𝜈1 /𝑀, . . . , 𝜈 𝑑 /𝑀), and where
“𝝂 ≤ 𝑀” is short for {𝝂 ∈ N0𝑑 | 𝜈𝑖 ≤ 𝑀 for all 𝑖 ≤ 𝑑}. We denote by 𝜑𝝂 the cpwl basis function on this mesh such that
𝜑𝝂 (𝝂/𝑀) = 1 and 𝜑𝝂 (𝜼 𝝁 ) = 1 whenever 𝝁 ≠ 𝝂. As shown in Chapter 5, 𝜑𝝂 is a neural network of size 𝑂 (1). Then
∑︁
𝜑𝝂 ≡ 1 on Ω, (7.3.5)
𝝂≤𝑀

is a partition of unity. Moreover, observe that

74
[Draft of June 26, 2024. Not for dissemination. Only to be used in Philipp Petersen’s lecture]
 
𝝂 1
supp(𝜑𝝂 ) ⊆ 𝒙 ∈ Ω 𝒙− ≤ , (7.3.6)
𝑀 ∞ 𝑀

where ∥𝒙∥ ∞ = max𝑖 ≤𝑑 |𝑥𝑖 |.


For each 𝝂 ≤ 𝑀 define the multivariate polynomial
∑︁ 𝐷 𝜶 𝑓 𝝂 

𝑀 𝝂 𝜶
𝑝𝝂 (𝒙) := 𝒙− ∈ P𝑘 ,
𝜶! 𝑀
|𝜶 | ≤ 𝑘

and the approximation


∑︁ 𝐷 𝜶 𝑓 𝝂
  𝜈𝑖𝜶,1 𝜈𝑖𝜶,𝑘 
𝑝ˆ𝝂 (𝒙) := 𝑀
Φ×|𝜶 |, 𝜀 𝑥𝑖𝜶,1 − , . . . , 𝑥 𝑖𝜶,𝑘 − ,
𝜶! 𝑀 𝑀
|𝜶 | ≤ 𝑘

where (𝑖𝜶,1 , . . . , 𝑖𝜶,𝑘 ) ∈ {0, . . . , 𝑑} 𝑘 is arbitrary but fixed such that |{ 𝑗 | 𝑖𝜶, 𝑗 = 𝑟 }| = 𝛼𝑟 for all 𝑟 = 1, . . . , 𝑑.
Finally, define
∑︁
Φ×𝜀 (𝜑𝝂 , 𝑝ˆ𝝂 ),
𝑓
Φ 𝑁 := (7.3.7)
𝝂≤𝑀

where the precise values of 𝜀 > 0 and 𝑀 ∈ N will be chosen at the end of Step 2.
Step 2. We bound the approximation error. First, for each 𝒙 ∈ Ω, using (7.3.5) and (7.3.6)

∑︁ ∑︁
𝑓 (𝒙) − 𝜑𝝂 (𝒙) 𝑝𝝂 (𝒙) ≤ |𝜑𝝂 (𝒙)|| 𝑝𝝂 (𝒙) − 𝑓 (𝒙)|
𝝂≤𝑀 𝝂≤𝑀
≤ max sup | 𝑓 ( 𝒚) − 𝑝𝝂 ( 𝒚)|.
𝝂≤𝑀 𝝂 1
{𝒚 ∈Ω | ∥ 𝑀 −𝒚 ∥ ∞ ≤ 𝑀 }

By Lemma 7.6 we obtain


1
∑︁ 𝑑 𝑘+ 2
sup 𝑓 (𝒙) − 𝜑𝝂 (𝒙) 𝑝𝝂 (𝒙) ≤ 𝑀 − (𝑘+𝑠) ∥ 𝑓 ∥ 𝐶 𝑘,𝑠 (Ω) ≤ 𝑀 − (𝑘+𝑠) . (7.3.8)
𝒙∈Ω 𝝂≤𝑀
𝑘!

Next, fix 𝝂 ≤ 𝑀 and 𝒚 ∈ Ω such that ∥𝝂/𝑀 − 𝒚∥ ∞ ≤ 1/𝑀 ≤ 1. Then by Proposition 7.4

∑︁ 𝐷 𝜶 𝑓 𝝂 Ö
 𝑘 
𝑀
𝜈𝑖𝜶, 𝑗 
| 𝑝𝝂 ( 𝒚) − 𝑝ˆ𝝂 ( 𝒚)| ≤ 𝑦 𝑖𝜶, 𝑗 −
𝜶! 𝑀
|𝜶 | ≤ 𝑘 𝑗=1
 
×
𝜈𝑖𝜶,1 𝑖𝜶,𝑘
− Φ |𝜶 |, 𝜀 𝑦 𝑖𝜶,1 − , . . . , 𝑦 𝑖𝜶,𝑘 −
𝑀 𝑀
∑︁ 𝐷 𝜶 𝑓 ( 𝝂 )
𝑀
≤𝜀 ≤ 𝜀 exp(𝑑) ∥ 𝑓 ∥ 𝐶 𝑘,𝑠 (Ω) ≤ 𝜀, (7.3.9)
𝜶!
|𝜶 | ≤ 𝑘

where we used |𝐷 𝜶 𝑓 (𝝂/𝑀)| ≤ ∥ 𝑓 ∥ 𝐶 𝑘,𝑠 (Ω) and

𝑘 𝑘 ∞
∑︁ 1 ∑︁ 1 ∑︁ 𝑗! ∑︁ 𝑑 𝑗 ∑︁ 𝑑 𝑗
= = ≤ = exp(𝑑).
𝜶! 𝑗=0 𝑗! 𝜶! 𝑗=0 𝑗! 𝑗!
{𝜶∈N0𝑑 | |𝜶 | ≤ 𝑘 } {𝜶∈N0𝑑 | |𝜶 |= 𝑗 } 𝑗=0

Similarly, one shows that

75
[Draft of June 26, 2024. Not for dissemination. Only to be used in Philipp Petersen’s lecture]
| 𝑝ˆ𝝂 (𝒙)| ≤ exp(𝑑) ∥ 𝑓 ∥ 𝐶 𝑘,𝑠 (Ω) ≤ 1 for all 𝒙 ∈ Ω.

Fix 𝒙 ∈ Ω. Then 𝒙 belongs to a simplex of the mesh, and thus 𝒙 can be in the support of at most 𝑑 + 1 (the number of
nodes of a simplex) functions 𝜑𝝂 . Moreover, Lemma 7.3 implies that supp Φ×𝜀 (𝜑𝝂 (·), 𝑝ˆ𝝂 (·)) ⊆ supp 𝜑𝝂 . Hence, using
Lemma 7.3 and (7.3.9)

∑︁ ∑︁
𝜑𝝂 (𝒙) 𝑝𝝂 (𝒙) − Φ×𝜀 (𝜑𝝂 (𝒙), 𝑝ˆ𝝂 (𝒙))
𝝂≤𝑀 𝝂≤𝑀
∑︁
≤ (|𝜑𝝂 (𝒙) 𝑝𝝂 (𝒙) − 𝜑𝝂 (𝒙) 𝑝ˆ𝝂 (𝒙)|
{𝝂 ≤ 𝑀 | 𝒙∈supp 𝜑𝝂 }

+ |𝜑𝝂 (𝒙) 𝑝ˆ𝝂 (𝒙) − Φ×𝜀 (𝜑𝝂 (𝒙), 𝑝ˆ𝝂 (𝒙))|




≤ 𝜀 + (𝑑 + 1)𝜀 = (𝑑 + 2)𝜀.

In total, together with (7.3.8)

sup | 𝑓 (𝒙) − Φ 𝑁 (𝒙)| ≤ 𝑀 − (𝑘+𝑠) + 𝜀 · (𝑑 + 2).


𝑓
𝒙∈Ω

With our choices in (7.3.4) this yields the error bound (7.3.3).
Step 3. It remains to bound the size and depth of the neural network in (7.3.7).
By Lemma 5.17, for each 0 ≤ 𝝂 ≤ 𝑀 we have

size(𝜑𝝂 ) ≤ 𝐶 · (1 + 𝑘 T ), depth(𝜑𝝂 ) ≤ 𝐶 · (1 + log(𝑘 T )), (7.3.10)

where 𝑘 T is the maximal number of simplices attached to a node in the mesh. Note that 𝑘 T is independent of 𝑀, so
that the size and depth of 𝜑𝝂 are bounded by a constant 𝐶 𝜑 independent of 𝑀.
Lemma 7.3 and Proposition 7.4 thus imply with our choice of 𝜀 = 𝑁 − (𝑘+𝑠)/𝑑

depth(Φ 𝑁 ) = depth(Φ×𝜀 ) + max depth(𝜑𝜼 ) + max depth( 𝑝ˆ𝝂 )


𝑓
𝝂≤𝑀 𝝂≤𝑀
≤ 𝐶 · (1 + | log(𝜀)| + 𝐶 𝜑 ) + depth(Φ×𝑘, 𝜀 )
≤ 𝐶 · (1 + | log(𝜀)| + 𝐶 𝜑 )
≤ 𝐶 · (1 + log(𝑁))

for some constant 𝐶 > 0 depending on 𝑘 and 𝑑 (we use “𝐶” to denote a generic constant that can change its value in
each line).
To bound the size, we first observe with Lemma 5.4 that

∑︁  
size( 𝑝ˆ𝝂 ) ≤ 𝐶 · ­1 + size Φ×|𝜶 |, 𝜀 ® ≤ 𝐶 · (1 + | log(𝜀)|)
© ª

« |𝜶 | ≤ 𝑘 ¬

for some 𝐶 depending on 𝑘. Thus, for the size of Φ 𝑁 we obtain with 𝑀 = ⌈𝑁 1/𝑑 ⌉
𝑓

!
∑︁
size(Φ×𝜀 )
𝑓 
size(Φ 𝑁 ) ≤ 𝐶 · 1+ + size(𝜑𝝂 ) + size( 𝑝ˆ𝝂 )
𝝂≤𝑀
𝑑
≤ 𝐶 · (1 + 𝑀) (1 + | log(𝜀)| + 𝐶 𝜑 )
≤ 𝐶 · (1 + 𝑁 1/𝑑 ) 𝑑 (1 + 𝐶 𝜑 + log(𝑁))
≤ 𝐶𝑁 log(𝑁),

76
[Draft of June 26, 2024. Not for dissemination. Only to be used in Philipp Petersen’s lecture]
which concludes the proof. □
Theorem 7.7 shows the convergence rate (𝑘 + 𝑠)/𝑑 for approximating a 𝐶 𝑘,𝑠 -function 𝑓 : [0, 1] 𝑑 → R. As long as
𝑘 is large, in principle we can achieve arbitrarily large (and 𝑑-independent if 𝑘 ≥ 𝑑) convergence rates. Crucially, and
𝑘+𝑠
in contrast to Theorem 5.22, achieving error 𝑁 − 𝑑 requires the neural networks to be of size 𝑂 (𝑁 log(𝑁)) and depth
𝑂 (log(𝑁)), i.e. to get more and more accurate approximations, the neural network depth is required to increase.

Remark 7.8 Under the stronger assumption that 𝑓 is an analytic function (in particular such an 𝑓 is in 𝐶 ∞ ), one can
show exponential convergence rates of the type exp(−𝛽𝑁 1/(𝑑+1) ) for some fixed 𝛽 > 0 and where 𝑁 corresponds again
to the neural network size (up to logarithmic terms), see [53, 145].

Remark 7.9 Let 𝐿 : 𝒙 ↦→ 𝑨𝒙 + 𝒃 : R𝑑 → R𝑑 be a bijective affine transformation and set Ω := 𝐿( [0, 1] 𝑑 ) ⊆ R𝑑 . Then
𝑓
for a function 𝑓 ∈ 𝐶 𝑘,𝑠 (Ω), by Theorem 7.7 there exists a neural network Φ 𝑁 such that

sup | 𝑓 (𝒙) − Φ 𝑁 (𝐿 −1 (𝒙))| = | 𝑓 (𝐿(𝒙)) − Φ 𝑁 (𝐿 −1 (𝒙))|


𝑓 𝑓
sup
𝒙∈Ω 𝒙∈ [0,1] 𝑑
𝑘+𝑠
≤ 𝐶 ∥ 𝑓 ◦ 𝐿 ∥ 𝐶 𝑘,𝑠 ( [0,1] 𝑑 ) 𝑁 − 𝑑 .

Since for 𝒙 ∈ [0, 1] 𝑑 holds | 𝑓 (𝐿(𝒙))| ≤ sup𝒚 ∈Ω | 𝑓 ( 𝒚)| and if 0 ≠ 𝜶 ∈ N0𝑑 is a multiindex |𝐷 𝜶 ( 𝑓 (𝐿(𝒙))| ≤
∥ 𝐴∥ 2|𝜶 | sup𝒚 ∈Ω |𝐷 𝜶 𝑓 ( 𝒚)|, we have ∥ 𝑓 ◦ 𝐿 ∥ 𝐶 𝑘,𝑠 ( [0,1] 𝑑 ) ≤ (1 + ∥ 𝐴∥ 2𝑘+𝑠 ) ∥ 𝑓 ∥ 𝐶 𝑘,𝑠 (Ω) . Thus the convergence rate 𝑁 − 𝑑 is
𝑘+𝑠

achieved on every set of the type 𝐿 ( [0, 1] 𝑑 ) for an affine map 𝐿, and in particular on every hypercube ×𝑑𝑗=1 [𝑎 𝑗 , 𝑏 𝑗 ].

Bibliography and further reading

Pinkus Section 6 Mhaskar 1996

77
[Draft of June 26, 2024. Not for dissemination. Only to be used in Philipp Petersen’s lecture]
[Draft of June 26, 2024. Not for dissemination. Only to be used in Philipp Petersen’s lecture]
Chapter 8
High-dimensional approximation

The results in the previous sections provide convergence rates when approximating a function 𝑓 : [0, 1] 𝑑 → R by a
neural network in terms of the neural network’s size. For example, in Theorem 7.7, we observed an approximation rate
of the form O (𝑁 − (𝑘+𝑠)/𝑑 ) for the number of neurons 𝑁 growing to infinity. Here 𝑘 and 𝑠 describe the smoothness of
𝑓 . Note that, for 𝑑 > 𝑘 + 𝑠 the approximation error decays potentially very slowly.
If we look at this phenomenon from a different angle, we see that, achieving an error of size 𝜀 requires a neural
network with 𝑂 (𝜀 −𝑑/(𝑠+𝑘 ) ) parameters. Hence, the number of parameters required to achieve a certain error, depends
exponentially on the dimension 𝑑.
This exponential dependence on the dimension 𝑑 is a phenomenon that appears throughout approximation theory
and is referred to as the curse of dimensionality. For classical smoothness spaces, such exponential 𝑑 dependence
cannot be avoided, [16, 48, 143]. However, functions 𝑓 that are of interest in practice, may have additional properties,
which allow them to be approximated at a better convergence rate.
In this chapter, we discuss three assumptions under which the curse of dimensionality can be avoided: First, we
discuss an assumption on the underlying space of functions that restricts their behavior in Fourier space. We will
observe that this assumption will allow for slow but dimension independent approximation rates. Next, we will assume
that we are attempting to approximate functions that have a specific compositional structure. This means that these
are compositions and sums of sub-functions. Here we will see that the overall approximation problem will suffer
only through the highest input dimension that appears in the sub-functions. Finally, we study the situation, where we
still approximate high-dimensional functions, but only care about the approximation accuracy on a lower dimensional
manifold. In this case, we will observe that the smoothness on the low dimensional manifold and the dimension of that
manifold determine the approximation rate.

8.1 The Barron class

In [10], a set of functions with bounded variation was introduced. The elements of this set can be approximated by
neural networks without a curse of dimensionality. We introduce this set, coined Barron class, below. Let 𝑓 ∈ 𝐿 1 (R𝑑 )
and let

𝑓ˆ(𝒘) := exp(−2𝜋 i 𝒘 ⊤ 𝒙) 𝑓 (𝒙) d𝒙
R𝑑

be its Fourier transform. Then, for 𝐶 > 0 the Barron class is defined as
 ∫ 
Γ𝐶 B 𝑓 ∈ 𝐿 1 (R𝑑 ) : ∥ 𝑓ˆ∥ 1 < ∞, |2𝜋𝜉 | 𝑓ˆ(𝜉) 𝑑𝜉 < 𝐶 .
R𝑑

79
[Draft of June 26, 2024. Not for dissemination. Only to be used in Philipp Petersen’s lecture]
We point out that the definition of Γ𝐶 in [10] is more general, but out assumption will simplify some of the arguments.
The proof below is nonetheless very close to the original result. This version is akin to [154, Section 5].

Theorem 8.1 ([10, Theorem 1]) Let 𝐶 > 0, 𝑑 ∈ N, 𝑓 ∈ Γ𝐶 , 𝜎 : R → R be sigmoidal (Definition 3.11) and
𝑁 ∈ N. Then, for every 𝑐 > 4𝐶 2 , there exists a neural network Φ 𝑓 with architecture (𝑑, 𝑁, 1; 𝜎), depth(Φ 𝑓 ) = 1,
size(Φ 𝑓 ) ≤ 𝑁 · (𝑑 + 2) + 1, and

1 2 𝑐
𝑓 (𝒙) − Φ 𝑓 (𝒙) 𝑑𝒙 ≤ , (8.1.1)
|𝐵1 | 𝐵1
𝑑 𝑑 𝑁

where |𝐵1𝑑 | denotes the Lebesgue measure of 𝐵1𝑑 .

Remark 8.2 The approximation rate on (8.1.1) can be slightly improved under some assumptions on the activation
function such as powers of the ReLU, [189].

Importantly, the dimension 𝑑 does not enter on the right-hand side of (8.1.1), in particular the convergence rate is
not directly affected by the dimension, which is in stark contrast to our earlier results. However, it should be noted, that
the constant 𝐶 𝑓 may still have some inherent 𝑑-dependence, see Exercise 8.12.
The proof of Theorem 8.1 is based on a peculiar property of high-dimensional convex sets, which is described by
the so-called approximate Caratheodory theorem. We state it as a lemma below.

Lemma 8.3 ([210, Theorem 0.0.2], [10, 156]) Let 𝐺 be a subset of a Hilbert space and let 𝐺 be such that the norm of
each element of 𝐺 is bounded by 𝐵 > 0. Let 𝑓 ∈ co(𝐺). Then, for every 𝑁 ∈ N and 𝑐 ′ > 𝐵2 there exist (𝑔𝑖 )𝑖=1
𝑁
⊆𝐺
Í 𝑁
and (𝑐 𝑖 )𝑖=1 ⊆ [0, 1] with 𝑖=1 𝑐 𝑖 = 1 such that
𝑁

2
𝑐′
𝑁
∑︁
𝑓− 𝑐 𝑖 𝑔𝑖 ≤ . (8.1.2)
𝑖=1
𝑁

Proof Let 𝑓 ∈ co(𝐺). Naturally, there exists for every 𝛿 > 0 a function 𝑓 ∗ ∈ co(𝐺) so that ∥ 𝑓 − 𝑓 ∗ ∥ ≤ 𝛿. By definition
𝑓 ∗ is a convex combination of elements from 𝐺, which means that there exists 𝑚 ∈ N so that
𝑚
∑︁
𝑓∗ = 𝑐 ′𝑖 𝑔𝑖′
𝑖=1

for some (𝑔𝑖′ )𝑖=1


𝑚 ⊆ 𝐺, (𝑐 ′ ) 𝑚 ⊆ [0, 1], with ′
Í𝑚
𝑖 𝑖=1 𝑖=1 𝑐 𝑖 = 1.
We will next apply some tools from probability theory. Since most readers will be more used to random variables
taking real values, we translate the problem of approximation 𝑓 ∗ by few elements 𝑔𝑖 to Euclidean space first.
Note that, there exists an at most 𝑚-dimensional linear space 𝐿 𝑚 such that (𝑔𝑖′ )𝑖=1
𝑚 ⊆ 𝐿 . Hence, there is an isometric
𝑚
′ ∗
isomorphism between 𝐿 𝑚 and R . Hence, we can think of 𝑔𝑖 , and 𝑓 as elements of R𝑚 in the sequel.
𝑚

Let 𝜌 be the discrete probability distribution with on {1, . . . , 𝑚} with P𝜌 (𝑘) = 𝑐 ′𝑘 for 𝑘 ∈ {1, . . . , 𝑚} and let (𝑔 𝑗 ) ∞
𝑗=1
random variables defined by 𝑔 𝑗 = 𝑔𝑖′ 𝑗 , where 𝑖 𝑗 ∼ 𝜌 are i.i.d random variables for 𝑗 ∈ N.
Everything is set up in such a way that we have that
𝑚
∑︁
E(𝑔 𝑗 ) = E(𝑔1 ) = 𝑐 ′𝑖 𝑔𝑖′ = 𝑓 ∗ .
𝑖=1

We observe that (𝑋 𝑗 ) ∞ ∗
𝑗=1 B 𝑔𝑖 𝑗 − 𝑓 are i.i.d random variables and E(𝑋 𝑗 ) = 0. Since the 𝑋 𝑗 are independent, we can
make the following computation

80
[Draft of June 26, 2024. Not for dissemination. Only to be used in Philipp Petersen’s lecture]
2
𝑁 𝑁
© 1 ∑︁
®= 1
∑︁  
2
ª
E ­­ 𝑋𝑗 ® 𝑁2 E ∥ 𝑋 𝑗 ∥
𝑁 𝑗=1 𝑗=1
« ¬
𝑁
1 ∑︁  
= 2 E𝑖∼𝜌 ∥𝑔𝑖 ∥ 2 − 2 ⟨𝑔𝑖 , 𝑓 ∗ ⟩ + ∥ 𝑓 ∗ ∥ 2
𝑁 𝑗=1
𝑁
1 ∑︁   𝐵2
= 2
E𝑖∼𝜌 ∥𝑔 𝑖 ∥ 2
− ∥ 𝑓 ∗ ∥2 ≤ . (8.1.3)
𝑁 𝑗=1 𝑁

The first identity above follows from Bienaymé’s identity, the second identity uses the polar identity that holds in every
Hilbert space (B.1.3), and the third identity used the linearity of the expected value. From (8.1.3) we conclude that
there exists at least one event 𝜔 such that
2
𝑁
1 ∑︁ 𝐵2
𝑋 𝑗 (𝜔) ≤
𝑁 𝑗=1 𝑁

and hence
2
𝑁
1 ∑︁ 𝐵
𝑔𝑖 ( 𝜔) − 𝑓 ∗ ≤ √ .
𝑁 𝑗=1 𝑗 𝑁

By the triangle inequality, we conclude that

𝑁 2  2
1 ∑︁ 𝐵
𝑔𝑖 ( 𝜔) − 𝑓 ≤ √ +𝛿 .
𝑁 𝑖=1 𝑗 𝑁

Since 𝛿 was arbitrary this yields the result. □

Remark 8.4 Lemma 8.3 gives us the following powerful tool: If we√ want to approximate a function 𝑓 with a superposition
of 𝑁 elements in a set 𝐺 and we want an approximation rate of 1/ 𝑁, then it is sufficient to show that 𝑓 can be represented
as an arbitrary (infinite) convex combination of elements of 𝐺.

Lemma 8.3 suggests that we can prove Theorem 8.1 by showing that each function in Γ𝐶 is in the convex hull
of neural networks with just a single neuron. We make a small detour before proving this result. We first show that
each function 𝑓 ∈ Γ𝐶 is in the convex hull of affine transforms of Heaviside functions. We define the set of affine
transforms of Heaviside functions 𝐺 𝐶 as follows

𝐺 𝐶 B 𝐵1𝑑 ∋ 𝒙 ↦→ 𝛾 · 1R+ (⟨𝒂, 𝒙⟩ + 𝑏) : 𝒂 ∈ R𝑑 , 𝑏 ∈ R, |𝛾| ≤ 2𝐶 .

The following lemma now provides the link between Γ𝐶 and 𝐺 𝐶 .

Lemma 8.5 ([154, Lemma 5.12]) Let 𝑑 ∈ N, 𝐶 > 0 and 𝑓 ∈ Γ𝐶 . Then ( 𝑓 | 𝐵𝑑 − 𝑓 (0)) ∈ co(𝐺 𝐶 ). Here the closure is
1
taken with respect to the norm ∥ · ∥ 𝐿 2,⋄ ( 𝐵𝑑 ) , defined by
1

∫ ! 1/2
1 2
∥𝑔∥ 𝐿 2,⋄ (𝐵𝑑 ) B |𝑔(𝒙)| 𝑑𝒙 .
1 |𝐵1𝑑 | 𝐵1𝑑

Proof Since 𝑓 ∈ Γ𝐶 , we have that 𝑓 , 𝑓ˆ ∈ 𝐿 1 (R𝑑 ). Hence, we can apply the inverse Fourier transform and get the
following computation:

81
[Draft of June 26, 2024. Not for dissemination. Only to be used in Philipp Petersen’s lecture]
∫  
𝑓 (𝒙) − 𝑓 (0) = 𝑓ˆ(𝜉) 𝑒 2 𝜋𝑖⟨ 𝒙, 𝜉 ⟩ − 1 𝑑𝜉
𝑑
∫R  
= 𝑓ˆ(𝜉) 𝑒 2 𝜋𝑖⟨ 𝒙, 𝜉 ⟩+𝑖𝜅 ( 𝜉 ) − 𝑒 𝑖𝜅 ( 𝜉 ) 𝑑𝜉
𝑑
∫R
= 𝑓ˆ(𝜉) (cos(2𝜋⟨𝒙, 𝜉⟩ + 𝜅(𝜉)) − cos(𝜅(𝜉))) 𝑑𝜉,
R𝑑

where 𝜅(𝜉) is the phase of 𝑓ˆ(𝜉) and the last inequality follows since 𝑓 is real-valued.
To use the fact that 𝑓 has a bounded Fourier moment, we reformulate the integral above as follows:

𝑓ˆ(𝜉) (cos(2𝜋⟨𝑥, 𝜉⟩ + 𝜅(𝜉)) − cos(𝜅(𝜉))) 𝑑𝜉
R𝑑

(cos(2𝜋⟨𝑥, 𝜉⟩ + 𝜅(𝜉)) − cos(𝜅(𝜉)))
= |2𝜋𝜉 | 𝑓ˆ(𝜉) 𝑑𝜉.
R 𝑑 |2𝜋𝜉 |
We define a new measure Λ such that
1
𝑑Λ(𝜉) B |2𝜋𝜉 || 𝑓ˆ(𝜉)|𝑑𝜉.
𝐶
Since 𝑓 ∈ Γ𝐶 , it follows that Λ is a probability measure on R𝑑 . Now we have that

(cos(2𝜋⟨𝒙, 𝜉⟩ + 𝜅(𝜉)) − cos(𝜅(𝜉)))
𝑓 (𝒙) − 𝑓 (0) = 𝐶 𝑑Λ(𝜉). (8.1.4)
R𝑑 |2𝜋𝜉 |
Next, we would like to replace the integral of (8.1.4) by an appropriate finite sum.
The cosine function is 1-Lipschitz. Hence, we note that 𝜉 ↦→ 𝑞 𝒙 (𝜉) B (cos(2𝜋⟨𝒙, 𝜉⟩ + 𝜅(𝜉)) − cos(𝜅(𝜉)))/|2𝜋𝜉 | is
bounded by 1. In addition, it is easy to see that 𝑞 𝒙 is well-defined and continuous even in the origin.
Therefore, the integral (8.1.4) can be approximated by a Riemann sum, i.e.,

∫ ∑︁
𝐶 𝑞 𝒙 (𝜉)𝑑Λ(𝜉) − 𝐶 𝑞 𝒙 (𝜃) · Λ(𝐼 𝜃 ) → 0, (8.1.5)
R𝑑 1 𝑑
𝜃∈ 𝑛Z

where 𝐼 𝜃 B [0, 1/𝑛) 𝑑 + 𝜃.


Since 𝑓 (𝒙) − 𝑓 (0) is continuous and thus bounded on 𝐵1𝑑 , we have by the dominated convergence theorem that
2

1 ∑︁
𝑓 (𝒙) − 𝑓 (0) − 𝐶 𝑞 𝒙 (𝜃) · Λ(𝐼 𝜃 ) 𝑑𝒙 → 0. (8.1.6)
|𝐵1𝑑 | 𝐵1𝑑
𝜃 ∈ 𝑛1 Z𝑑

Since 𝜃 ∈ 1 Z𝑑 Λ(𝐼 𝜃 ) = Λ(R𝑑 ) = 1, we conclude that 𝑓 (𝒙) − 𝑓 (0) is in the 𝐿 2,⋄ (𝐵1𝑑 ) closure of convex combinations
Í
𝑛
of functions of the form
𝑥 ↦→ 𝑔 𝜃 (𝒙) B 𝛼 𝜃 𝑞 𝒙 (𝜃),
for 𝜃 ∈ R𝑑 and 0 ≤ 𝛼 𝜃 ≤ 𝐶.
Now we only need to prove that each 𝑔 𝜃 is in co(𝐺 𝐶 ). By setting 𝑧 = ⟨𝒙, 𝜃/|𝜃|⟩, we observe that the result follows
if the map
cos(2𝜋|𝜃|𝑧 + 𝜅(𝜃)) − cos(𝜅(𝜃))
[−1, 1] ∋ 𝑧 ↦→ 𝛼 𝜃 C 𝑔˜ 𝜃 (𝑧),
|2𝜋𝜃|
can be approximated arbitrarily well by convex combinations of functions of the form

[−1, 1] ∋ 𝑧 ↦→ 𝛾1R+ (𝑎 ′ 𝑧 + 𝑏 ′ ) , (8.1.7)

82
[Draft of June 26, 2024. Not for dissemination. Only to be used in Philipp Petersen’s lecture]
where 𝑎 ′ , 𝑏 ′ ∈ R and |𝛾| ≤ 2𝐶.
We define, for 𝑇 ∈ N,
  
𝑖 𝑖−1
𝑇 𝑔˜ 𝜃
∑︁ 𝑇 − 𝑔˜ 𝜃 𝑇
   
𝑖

𝑖−1
 
𝑖

𝑔𝑇 ,+ B 2𝐶sign 𝑔˜ 𝜃 − 𝑔˜ 𝜃 1R+ 𝑥 − ,
𝑖=1
2𝐶 𝑇 𝑇 𝑇
 
− 𝑇𝑖 − 𝑔˜ 𝜃 1−𝑖

𝑇 𝑔˜ 𝜃        
∑︁ 𝑇 𝑖 1−𝑖 𝑖
𝑔𝑇 ,− B 2𝐶sign 𝑔˜ 𝜃 − − 𝑔˜ 𝜃 1R −𝑥 +
+ .
𝑖=1
2𝐶 𝑇 𝑇 𝑇

Per construction, (𝑔𝑇 ,− ) + (𝑔𝑇 ,+ ) converges to 𝑔˜ 𝜃 for 𝑇 → ∞. Moreover, ∥ 𝑔˜ ′𝜃 ∥ ≤ 𝐶 and hence

𝑇 𝑇
∑︁ | 𝑔˜ 𝜃 (𝑖/𝑇) − 𝑔˜ 𝜃 ((𝑖 − 1)/𝑇)| ∑︁ | 𝑔˜ 𝜃 (−𝑖/𝑇) − 𝑔˜ 𝜃 ((1 − 𝑖)/𝑇)|
+
𝑖=1
2𝐶 𝑖=1
2𝐶
𝑇
2 ∑︁ ′
≤ ∥ 𝑔˜ ∥ ∞ ≤ 1.
2𝐶𝑇 𝑖=1 𝜃

We conclude that (𝑔𝑇 ,− ) + (𝑔𝑇 ,+ ) is a convex combination of functions of the form (8.1.7). Hence, we have that 𝑔˜ 𝜃 can
be arbitrarily well approximated by convex combinations of the form (8.1.7). Therefore, we have that 𝑔 𝜃 ∈ co(𝐺 𝐶 ).
Finally, (8.1.6) yields that 𝑓 − 𝑓 (0) ∈ co(𝐺 𝐶 ). □
We now have all tools to complete the proof of Theorem 8.1.
Proof (of Theorem 8.1) Let 𝑓 ∈ Γ𝐶 , then, invoking Lemma 8.5, we conclude that

𝑓 | 𝐵𝑑 − 𝑓 (0) ∈ co(𝐺 𝐶 ).
1

It is not hard to see that, for every element 𝑔 ∈ 𝐺 𝐶 we have that ∥𝑔∥ 𝐿 2,⋄ (𝐵𝑑 ) ≤ 2𝐶. Applying now Lemma 8.3 with the
1
Hilbert space 𝐿 2,⋄ (𝐵1𝑑 ), we get that for every 𝑁 ∈ N, there exist |𝛾𝑖 | ≤ 2𝐶, 𝒂 𝑖 ∈ R𝑑 , 𝑏 𝑖 ∈ R, for 𝑖 = 1, . . . , 𝑁, so that

𝑁 2
4𝐶 2

1 ∑︁
𝑓 | 𝐵𝑑 (𝒙) − 𝑓 (0) − 𝛾𝑖 1R+ (⟨𝒂 𝑖 , 𝑥⟩ + 𝑏 𝑖 ) 𝑑𝒙 ≤ .
|𝐵1𝑑 | 𝐵1𝑑 1
𝑖=1
𝑁

By Exercise 3.25, it holds that 𝜎(𝜆·) → 1R+ for 𝜆 → ∞ almost everywhere. Thus, it is clear that, for every 𝛿 > 0, there
exist 𝒂˜ 𝑖 , 𝑏˜ 𝑖 , 𝑖 = 1, . . . , 𝑁, so that

𝑁 2
4𝐶 2

1 ∑︁
𝛾𝑖 𝜎 ⟨ 𝒂˜ 𝑖 , 𝒙⟩ + 𝑏˜ 𝑖

𝑓 | 𝐵𝑑 (𝒙) − 𝑓 (0) − 𝑑𝒙 ≤ + 𝛿.
|𝐵1𝑑 | 𝐵1𝑑 1
𝑖=1
𝑁

The result follows by observing that


𝑁
∑︁
𝛾𝑖 𝜎 ⟨ 𝒂˜ 𝑖 , 𝒙⟩ + 𝑏˜ 𝑖 + 𝑓 (0)

𝑖=1

is a neural network with architecture (𝑑, 𝑁, 1; 𝜎), which is clear by Proposition 2.3. □
Remark 8.6 The approximation rate of Theorem 8.1 is independent of the dimension and this is quite surprising in
view of the approximation results of the Chapters 4 and 5. However, there is a more intuitive reason for this result. In
fact, the assumption of having a finite Fourier moment is akin to a dimension-dependent regularity assumption. Indeed,
the condition becomes more restrictive in higher dimensions and hence the complexity of Γ𝐶 does not grow with the
dimension.
This can be seen by relating the Barron class to classical function spaces. In [10, Section II] it was observed that
a sufficient condition is that all derivatives of order up to ⌊𝑑/2⌋ + 2 are square-integrable. Put differently, a sufficient

83
[Draft of June 26, 2024. Not for dissemination. Only to be used in Philipp Petersen’s lecture]
condition for a function 𝑓 to be a Barron function is that 𝑓 ∈ 𝑊 ⌊𝑑/2⌋+2,2 (R𝑑 ). Notably the functions need to become
smoother, the higher the dimension is. This assumption would also imply an approximation rate of 𝑁 −1/2 in the 𝐿 2
norm by sums of at most 𝑁 B-splines, see [147, 48].
Another remarkable feature of the approximation of Barron functions is that the weights other than the output
weights are not bounded based on 𝑁. To see this, we refer to (8.1.5), where arbitrarily large 𝜃 need to be used. While
Γ𝐶 is a compact set, the set of neural networks of the specified architecture for a fixed 𝑁 ∈ N is not parameterized with
a compact parameter set. In a certain sense, this is reminiscent of Proposition 3.20 and Theorem 3.21, where arbitrarily
strong approximation rates where achieved by using a very complex activation function and a non-compact parameter
space.

8.2 Functions with compositionality structure

As a next instance of types of functions for which the curse of dimensionality can be overcome, we study functions
with compositional structure. In words, this means that we study high-dimensional functions that are constructed by
composing many low-dimensional functions. This point of view was proposed in [157]. Note that this can be a realistic
assumption in many cases, such as for sensor networks, where local information is first aggregated in smaller clusters
of sensors before some information is sent to a processing unit for further evaluation.
We introduce a model for compositional functions next. Consider a directed acyclic graph G with 𝑀 vertices
𝜂1 , . . . , 𝜂 𝑀 such that
• exactly 𝑑 vertices, 𝜂1 , . . . , 𝜂 𝑑 , have no ingoing edge,
• each vertex has at most 𝑚 ∈ N ingoing edges,
• exactly one vertex, 𝜂 𝑀 , has no outgoing edge.
With each vertex 𝜂 𝑗 for 𝑗 > 𝑑 we associate a function 𝑓 𝑗 : R𝑑 𝑗 → R. Here 𝑑 𝑗 denotes the cardinality of the set 𝑆 𝑗 ,
which is defined as the set of indices 𝑖 corresponding to vertices 𝜂𝑖 for which we have an edge from 𝜂𝑖 to 𝜂 𝑗 . Without
loss of generality, we assume that 𝑚 ≥ 𝑑 𝑗 = |𝑆 𝑗 | ≥ 1 for all 𝑗 > 𝑑. Finally, we let

𝐹 𝑗 := 𝑥 𝑗 for all 𝑗 ≤𝑑 (8.2.1a)

and1

𝐹 𝑗 := 𝑓 𝑗 ((𝐹𝑖 )𝑖 ∈𝑆 𝑗 ) for all 𝑗 > 𝑑. (8.2.1b)

Then 𝐹𝑀 (𝑥1 , . . . , 𝑥 𝑑 ) is a function from R𝑑 → R. Assuming

∥ 𝑓 𝑗 ∥ 𝐶 𝑘,𝑠 (R𝑑 𝑗 ) ≤ 1 for all 𝑗 = 𝑑 + 1, . . . , 𝑀, (8.2.2)

we denote the set of all functions of the type 𝐹𝑀 by F 𝑘,𝑠 (𝑚, 𝑑, 𝑀). Figure 8.1 shows possible graphs of such functions.
Clearly, for 𝑠 = 0, F 𝑘,0 (𝑚, 𝑑, 𝑀) ⊆ 𝐶 𝑘 (R𝑑 ) since the composition of functions in 𝐶 𝑘 belongs again to 𝐶 𝑘 . A direct
application of Theorem 7.7 allows to approximate 𝐹𝑀 ∈ F 𝑘 (𝑚, 𝑑, 𝑀) with a neural network of size 𝑂 (𝑁 log(𝑁))
𝑘
and error 𝑂 (𝑁 − 𝑑 ). Since each 𝑓 𝑗 depends only on 𝑚 variables, intuitively we expect an error convergence of type
𝑘
𝑂 (𝑁 − 𝑚 ) with the constant somehow depending on the number 𝑀 of vertices. To show that this is actually possible, in
the following we associate with each node 𝜂 𝑗 a depth 𝑙 𝑗 ≥ 0, such that 𝑙 𝑗 is the maximum number of edges connecting
𝜂 𝑗 to one of the nodes {𝜂1 , . . . , 𝜂 𝑑 }.

Proposition 8.7 Let 𝑘, 𝑚, 𝑑, 𝑀 ∈ N and 𝑠 > 0. Let 𝐹𝑀 ∈ F 𝑘,𝑠 (𝑚, 𝑑, 𝑀). Then there exists a constant 𝐶 =
𝐶 (𝑚, 𝑘 + 𝑠, 𝑀) such that for every 𝑁 ∈ N there exists a ReLU neural network 𝐹ˆ𝑀 such that

size( 𝐹ˆ𝑀 ) ≤ 𝐶𝑁 log(𝑁), depth( 𝐹ˆ𝑀 ) ≤ 𝐶 log(𝑁)

1 The ordering of the inputs (𝐹𝑖 ) 𝑖 ∈𝑆 𝑗 in (8.2.1b) is arbitrary but considered fixed throughout.

84
[Draft of June 26, 2024. Not for dissemination. Only to be used in Philipp Petersen’s lecture]
Fig. 8.1: Three types of graphs that could be the basis of compositional functions. The associated functions are
composed of two or three-dimensional functions only.

and
𝑘+𝑠
sup |𝐹𝑀 (𝒙) − 𝐹ˆ𝑀 (𝒙)| ≤ 𝑁 − 𝑚 .
𝒙∈ [0,1] 𝑑

Proof Throughout this proof we assume without loss of generality that the indices follow a topological ordering, i.e.,
they are ordered such that 𝑆 𝑗 ⊆ {1, . . . , 𝑗 − 1} for all 𝑗 (i.e. the inputs of vertex 𝜂 𝑗 can only be vertices 𝜂𝑖 with 𝑖 < 𝑗).
Step 1. First assume that there exists functions 𝑓ˆ𝑗 such that

| 𝑓 𝑗 (𝒙) − 𝑓ˆ𝑗 (𝒙)| ≤ 𝛿 𝑗 := 𝜀 · (2𝑚) − ( 𝑀+1− 𝑗 ) for all 𝒙 ∈ [−2, 2] 𝑑 𝑗 . (8.2.3)

Let 𝐹ˆ 𝑗 be defined as in (8.2.1), but with all 𝑓 𝑗 in (8.2.1b) replaced by 𝑓ˆ𝑗 . We now check the error of the approximation
𝐹ˆ𝑀 to 𝐹𝑀 . To do so we proceed by induction over 𝑗 and show that for all 𝒙 ∈ [−1, 1] 𝑑

|𝐹 𝑗 (𝒙) − 𝐹ˆ 𝑗 (𝒙)| ≤ (2𝑚) − ( 𝑀 − 𝑗 ) 𝜀. (8.2.4)

Note that due to ∥ 𝑓 𝑗 ∥ 𝐶 𝑘 ≤ 1 we have |𝐹 𝑗 (𝒙)| ≤ 1 and thus (8.2.4) implies in particular that 𝐹ˆ 𝑗 (𝒙) ∈ [−2, 2].
For 𝑗 = 1 it holds 𝐹1 (𝑥1 ) = 𝐹ˆ1 (𝑥1 ) = 𝑥1 , and thus (8.2.4) is valid for all 𝑥1 ∈ [−1, 1]. For the induction step, for all
𝒙 ∈ [−1, 1] 𝑑 by (8.2.3) and the induction hypothesis

|𝐹 𝑗 (𝒙) − 𝐹ˆ 𝑗 (𝒙)| = | 𝑓 𝑗 ((𝐹𝑖 )𝑖 ∈𝑆 𝑗 ) − 𝑓ˆ𝑗 (( 𝐹ˆ𝑖 )𝑖 ∈𝑆 𝑗 )|


= | 𝑓 𝑗 ((𝐹𝑖 )𝑖 ∈𝑆 𝑗 ) − 𝑓 𝑗 (( 𝐹ˆ𝑖 )𝑖 ∈𝑆 𝑗 )| + | 𝑓 𝑗 (( 𝐹ˆ𝑖 )𝑖 ∈𝑆 𝑗 ) − 𝑓ˆ𝑗 (( 𝐹ˆ𝑖 )𝑖 ∈𝑆 𝑗 )|
∑︁
≤ |𝐹𝑖 − 𝐹ˆ𝑖 | + 𝛿 𝑗
𝑖 ∈𝑆 𝑗

≤ 𝑚 · (2𝑚) − ( 𝑀 − ( 𝑗 −1) ) 𝜀 + (2𝑚) − ( 𝑀+1− 𝑗 ) 𝜀


≤ (2𝑚) − ( 𝑀 − 𝑗 ) 𝜀.

Here we used that | 𝑑𝑑𝑥𝑟 𝑓 𝑗 ((𝑥𝑖 )𝑖 ∈𝑆 𝑗 )| ≤ 1 for all 𝑟 ∈ 𝑆 𝑗 so that

85
[Draft of June 26, 2024. Not for dissemination. Only to be used in Philipp Petersen’s lecture]
∑︁
| 𝑓 𝑗 ((𝑥𝑖 )𝑖 ∈𝑆 𝑗 ) − 𝑓 𝑗 ((𝑦 𝑖 )𝑖 ∈𝑆 𝑗 )| ≤ | 𝑓 ((𝑥𝑖 )𝑖 ∈𝑆 𝑗 , (𝑦 𝑖 )𝑖 ∈𝑆 𝑗 ) − 𝑓 ((𝑥𝑖 )𝑖 ∈𝑆 𝑗 , (𝑦 𝑖 )𝑖 ∈𝑆 𝑗 )|
𝑟 ∈𝑆 𝑗 𝑖 ≤𝑟 𝑖>𝑟 𝑖<𝑟 𝑖 ≥𝑟
∑︁
≤ |𝑥𝑟 − 𝑦 𝑟 |.
𝑟 ∈𝑆 𝑗

This shows that (8.2.4) holds, and thus for all 𝒙 ∈ [−1, 1] 𝑑

|𝐹𝑀 (𝒙) − 𝐹ˆ𝑀 (𝒙)| ≤ 𝜀.

Step 2. We sketch a construction, of how to write 𝐹ˆ𝑀 from Step 1 as a neural network of the claimed size and depth
bounds. Fix 𝑁 ∈ N and let
𝑚
𝑁 𝑗 := ⌈𝑁 (2𝑚) 𝑘+𝑠 ( 𝑀+1− 𝑗 ) ⌉.

By Theorem 7.7, since 𝑑 𝑗 ≤ 𝑚, we can find a neural network 𝑓ˆ𝑗 satisfying


𝑘+𝑠
− 𝑘+𝑠
sup | 𝑓 𝑗 (𝒙) − 𝑓ˆ𝑗 (𝒙)| ≤ 𝑁 𝑗 𝑚
≤ 𝑁− 𝑚 (2𝑚) − ( 𝑀+1− 𝑗 ) (8.2.5)
𝑑𝑗
𝒙∈ [ −2,2]

and
 
𝑚( 𝑀+1− 𝑗) 𝑚(𝑀 + 1 − 𝑗)
size( 𝑓ˆ𝑗 ) ≤ 𝐶𝑁 𝑗 log(𝑁 𝑗 ) ≤ 𝐶𝑁 (2𝑚) 𝑘+𝑠 log(𝑁) + log(2𝑚)
𝑘+𝑠

as well as
 
ˆ 𝑚(𝑀 + 1 − 𝑗)
depth( 𝑓 𝑗 ) ≤ 𝐶 · log(𝑁) + log(2𝑚) .
𝑘+𝑠

Then
𝑛
∑︁ 𝑀
∑︁ 𝑀 
∑︁ 𝑗
𝑚( 𝑀+1− 𝑗) 𝑚
size( 𝑓ˆ𝑗 ) ≤ 2𝐶𝑁 log(𝑁) (2𝑚) 𝑘+𝑠 ≤ 2𝐶𝑁 log(𝑁) (2𝑚) 𝑘+𝑠
𝑗=1 𝑗=1 𝑗=1
𝑚( 𝑀+1)
≤ 2𝐶𝑁 log(𝑁) (2𝑚) 𝑘+𝑠 .
Í𝑀 ∫ 𝑀+1 1
Here we used 𝑗=1 𝑎𝑗 ≤ 1
exp(log(𝑎)𝑥) d𝑥 ≤ log(𝑎) 𝑎
𝑀+1 .

− 𝑘+𝑠
The function 𝐹ˆ𝑀 from Step 1 then will yield error 𝑁 by (8.2.3) and (8.2.5). We observe that 𝐹ˆ𝑀 can be
𝑚

constructed as a neural network by propagating all values 𝐹1 , . . . , 𝐹ˆ 𝑗 to all consecutive layers using identity neural
ˆ
networks and then using the values ( 𝐹ˆ𝑖 )𝑖 ∈𝑆 𝑗+1 as input to 𝑓ˆ𝑗+1 . The depth of this neural network is bounded by

𝑀
∑︁
depth( 𝑓ˆ𝑗 ) = 𝑂 (𝑀 log(𝑁)).
𝑗=1

Í
We have at most 𝑀 𝑗=1 |𝑆 𝑗 | ≤ 𝑚𝑀 values which need to be propagated through these 𝑂 (𝑀 log(𝑁)) layers, amounting
to an overhead 𝑂 (𝑚𝑀 2 log(𝑁)) = 𝑂 (log(𝑁)) for the identity neural networks. In all the neural network size is thus
𝑂 (𝑁 log(𝑁)). □
𝑚( 𝑀+1)
Remark 8.8 From the proof we observe that the constant 𝐶 in Proposition 8.7 behaves like 𝑂 ((2𝑚) 𝑘+𝑠 ).

86
[Draft of June 26, 2024. Not for dissemination. Only to be used in Philipp Petersen’s lecture]
8.3 Functions on manifolds

Another instance in which the curse of dimension can be mitigated, is if the input to the network belongs to R𝑑 , but
stems from an 𝑚-dimensional manifold M ⊆ R𝑑 . If we only measure the approximation error on M, then we can again
show that it is 𝑚 rather than 𝑑 that determines the rate of convergence.

Fig. 8.2: One-dimensional sub-manifold of three-dimensional space. At the orange point, we depict a ball and the
tangent space of the manifold.

To explain the idea, we assume in the following that M is a smooth, compact 𝑚-dimensional manifold in R𝑑 .
Moreover, we suppose that there exists 𝛿 > 0 and finitely many points 𝒙 1 , . . . , 𝒙 𝑀 ∈ M such that the 𝛿-balls
𝐵 𝛿/2 (𝒙𝑖 ) := {𝒚 ∈ R𝑑 | ∥ 𝒚 − 𝒙∥ 2 < 𝛿/2} for 𝑗 = 1, . . . , 𝑀 cover M (for every 𝛿 > 0 such 𝒙𝑖 exist since M is compact).
Moreover, denoting by 𝑇𝒙 M ≃ R𝑚 the tangential space of M at 𝒙, we assume 𝛿 > 0 to be so small that the orthogonal
projection

𝜋 𝑗 : 𝐵 𝛿 (𝒙 𝑗 ) ∩ M → 𝑇𝒙 𝑗 M (8.3.1)

is injective, the set 𝜋 𝑗 (𝐵 𝛿 (𝒙 𝑗 ) ∩ M) ⊆ 𝑇𝒙 𝑗 M has 𝐶 ∞ boundary, and the inverse projection

𝜋 −1
𝑗 : 𝜋 𝑗 (𝐵 𝛿 (𝒙 𝑗 ) ∩ M) → M (8.3.2)

is 𝐶 ∞ (this is possible because M is a smooth manifold). A visualization of this assumption is shown in Figure 8.2
Note that 𝜋 𝑗 in (8.3.1) is a linear map, whereas 𝜋 −1
𝑗 in (8.3.2) is in general non-linear.
For a function 𝑓 : M → R and 𝒙 ∈ 𝐵 𝛿 (𝒙 𝑗 ) ∩ M we can then write

𝑓 (𝒙) = 𝑓 (𝜋 −1
𝑗 (𝜋 𝑗 (𝒙))) = 𝑓 𝑗 (𝜋 𝑗 (𝒙))

where

87
[Draft of June 26, 2024. Not for dissemination. Only to be used in Philipp Petersen’s lecture]
𝑓 𝑗 := 𝑓 ◦ 𝜋 −1
𝑗 : 𝜋 𝑗 (𝐵 𝛿 (𝒙 𝑗 ) ∩ M) → R.

In the following, for 𝑓 : M → R, 𝑘 ∈ N0 , and 𝑠 ∈ [0, 1) we let

∥ 𝑓 ∥ 𝐶 𝑘,𝑠 ( M ) := sup ∥ 𝑓 𝑗 ∥ 𝐶 𝑘,𝑠 ( 𝜋 𝑗 (𝐵 𝛿 ( 𝒙 𝑗 )∩M ) ) .


𝑗=1,...,𝑀

We now state the main result of this section.

Proposition 8.9 Let 𝑑, 𝑘 ∈ N, 𝑠 ≥ 0, and let M be a smooth, compact 𝑚-dimensional manifold in R𝑑 . Then there exists
𝑓
a constant 𝐶 > 0 such that for all 𝑓 ∈ 𝐶 𝑘,𝑠 (M) and every 𝑁 ∈ N there exists a ReLU neural network Φ 𝑁 such that
𝑓 𝑓
size(Φ 𝑁 ) ≤ 𝐶𝑁 log(𝑁), depth(Φ 𝑁 ) ≤ 𝐶 log(𝑁) and
𝑘+𝑠
sup | 𝑓 (𝒙) − Φ 𝑁 (𝒙)| ≤ 𝐶 ∥ 𝑓 ∥ 𝐶 𝑘,𝑠 ( M ) 𝑁 −
𝑓 𝑚 .
𝒙∈ M

Proof Since M is compact there exists 𝐴 > 0 such that M ⊆ [−𝐴, 𝐴] 𝑑 . Similar as in the proof of Theorem 7.7,
we consider a uniform mesh with nodes {−𝐴 + 2𝐴 𝝂𝑛 | 𝝂 ≤ 𝑛}, and the corresponding piecewise linear basis functions
forming the partition of unity 𝝂 ≤ 𝜑𝝂 ≡ 1 on [−𝐴, 𝐴] 𝑑 where supp 𝜑𝝂 ≤ {𝒚 ∈ R𝑑 | ∥ 𝝂𝑛 − 𝒚∥ ∞ ≤ 𝑛𝐴 }. Let 𝛿 > 0 be
Í
such as in the beginning of this section. Since M is covered by the balls (𝐵 𝛿/2 (𝒙 𝑗 )) 𝑀
𝑗=1 , fixing 𝑛 ∈ N large enough,
for each 𝝂 such that supp 𝜑𝝂 ∩ M ≠ ∅ there exists 𝑗 (𝝂) ∈ {1, . . . , 𝑀 } such that supp 𝜑𝝂 ⊆ 𝐵 𝛿 (𝒙 𝑗 (𝝂 ) ) and we set
𝐼 𝑗 := {𝝂 ≤ 𝑀 | 𝑗 = 𝑗 (𝝂)}. Then we have for all 𝒙 ∈ M

∑︁ 𝑀 ∑︁
∑︁
𝑓 (𝒙) = 𝜑𝝂 (𝒙) 𝑓 𝑗 (𝜋 𝑗 (𝒙)) = 𝜑𝝂 (𝒙) 𝑓 𝑗 (𝜋 𝑗 (𝒙)). (8.3.3)
𝝂 ≤𝑛 𝑗=1 𝝂 ∈𝐼 𝑗

Next, we approximate the functions 𝑓 𝑗 . Let 𝐶 𝑗 be the smallest (𝑚-dimensional) cube in 𝑇𝒙 𝑗 M ≃ R𝑚 such that
𝜋 𝑗 (𝐵 𝛿 (𝒙 𝑗 ) ∩ M) ⊆ 𝐶 𝑗 . The function 𝑓ˆ𝑗 can be extended to a function on 𝐶 𝑗 (we will use the same notation for this
extension) such that

∥ 𝑓 ∥ 𝐶 𝑘,𝑠 (𝐶 𝑗 ) ≤ 𝐶 ∥ 𝑓 ∥ 𝐶 𝑘,𝑠 ( 𝜋 𝑗 (𝐵 𝛿 ( 𝒙 𝑗 )∩M ) ) ,

for some constant depending on 𝜋 𝑗 (𝐵 𝛿 (𝒙 𝑗 ) ∩ M) but independent of 𝑓 . Such an extension result can, for example, be
found in [192, Chapter VI]. By Theorem 7.7 (also see Remark 7.9), there exists a neural network 𝑓ˆ𝑗 : 𝐶 𝑗 → R such
that
𝑘+𝑠
sup | 𝑓 𝑗 (𝒙) − 𝑓ˆ𝑗 (𝒙)| ≤ 𝐶𝑁 − 𝑚 (8.3.4)
𝒙∈𝐶 𝑗

and

size( 𝑓ˆ𝑗 ) ≤ 𝐶𝑁 log(𝑁), depth( 𝑓ˆ𝑗 ) ≤ 𝐶 log(𝑁).


𝑘+𝑠
To approximate 𝑓 in (8.3.3) we now let with 𝜀 := 𝑁 − 𝑑

𝑀 ∑︁
∑︁
Φ 𝑁 := Φ×𝜀 (𝜑𝝂 , 𝑓ˆ𝑖 ◦ 𝜋 𝑗 ),
𝑗=1 𝝂 ∈𝐼 𝑗

where we note that 𝜋 𝑗 is linear and thus 𝑓ˆ𝑗 ◦ 𝜋 𝑗 can be expressed by a neural network. First let us estimate the error of
this approximation. For 𝒙 ∈ M

88
[Draft of June 26, 2024. Not for dissemination. Only to be used in Philipp Petersen’s lecture]
𝑀 ∑︁
∑︁
| 𝑓 (𝒙) − Φ 𝑁 (𝒙)| ≤ |𝜑𝝂 (𝒙) 𝑓 𝑗 (𝜋 𝑗 (𝒙)) − Φ×𝜀 (𝜑𝝂 (𝒙), 𝑓ˆ𝑗 (𝜋 𝑗 (𝒙)))|
𝑗=1 𝝂 ∈𝐼 𝑗
𝑀 ∑︁
∑︁
≤ |𝜑𝝂 (𝒙) 𝑓 𝑗 (𝜋 𝑗 (𝒙)) − 𝜑𝝂 (𝒙) 𝑓 𝑗 (𝜋 𝑗 (𝒙))|
𝑗=1 𝝂 ∈𝐼 𝑗

+|𝜑𝝂 (𝒙) 𝑓 𝑗 (𝜋 𝑗 (𝒙)) − Φ×𝜀 (𝜑𝝂 (𝒙), 𝑓ˆ𝑗 (𝜋 𝑗 (𝒙)))|
𝑀 ∑︁
∑︁ 𝑀
∑︁ ∑︁
≤ sup ∥ 𝑓𝑖 − 𝑓ˆ𝑖 ∥ 𝐿 ∞ (𝐶𝑖 ) |𝜑𝝂 (𝒙)| + 𝜀
𝑖≤𝑀 𝑗=1 𝝂 ∈𝐼 𝑗 𝑗=1 {𝝂 ∈𝐼 𝑗 | 𝒙∈supp 𝜑𝝂 }
𝑘+𝑠 𝑘+𝑠
≤ 𝐶𝑁 − 𝑚 + 𝑑𝜀 ≤ 𝐶𝑁 − 𝑚 ,

where we used that 𝒙 can be in the support of at most 𝑑 of the 𝜑𝝂 , and where 𝐶 is a constant depending on 𝑑 and M.
Finally, let us bound the size and depth of this approximation. Using size(𝜑𝝂 ) ≤ 𝐶, depth(𝜑𝝂 ) ≤ 𝐶 (see (5.3.9)) and
size(Φ×𝜀 ) ≤ 𝐶 log(𝜀) ≤ 𝐶 log(𝑁) and depth(Φ×𝜀 ) ≤ 𝐶depth(𝜀) ≤ 𝐶 log(𝑁) (see Lemma 7.3) we find
𝑀 ∑︁ 
∑︁ 𝑀 ∑︁
 ∑︁
size(Φ×𝜀 ) + size(𝜑𝝂 ) + size( 𝑓ˆ𝑖 ◦ 𝜋 𝑗 ) ≤ 𝐶 log(𝑁) + 𝐶 + 𝐶𝑁 log(𝑁)
𝑗=1 𝝂 ∈ 𝐼 𝑗 𝑗=1 𝝂 ∈𝐼 𝑗

= 𝑂 (𝑁 log(𝑁)),

which implies the bound on size(Φ 𝑁 ). Moreover,

depth(Φ 𝑁 ) ≤ depth(Φ×𝜀 ) + max depth(𝜑𝝂 , 𝑓ˆ𝑗 )




≤ 𝐶 log(𝑁) + log(𝑁) = 𝑂 (log(𝑁)).

This completes the proof. □

Bibliography and further reading

The ideas of Section8.1 were developed in [10], with an extension to 𝐿 ∞ approximation derived in [9]. These arguments
can be extended to yield dimension-independent approximation rates for high-dimensional discontinuous functions,
provided the discontinuity follows a Barron function, as shown in [154]. The Barron class has been generalized in
various ways, as discussed in [123, 122, 213, 214, 11].
The compositionality assumption of Section 8.2 was discussed in the form presented in [157]. An alternative
approach, the so-called hierarchical composition/interaction model, was studied in [104].
The manifold assumption of Section 8.3 is frequently found in the literature, with notable examples including
[187, 38, 32, 180, 136, 103].
Another prominent direction, which was omitted in this chapter, pertains to scientific machine learning. In particular,
in scientific machine learning, high-dimensional functions often arise from (parametric) PDEs, which have a rich
literature describing their properties and structure. Various results have shown that neural networks can leverage the
inherent low-dimensionality known to exist in such problems. For instance, very efficient approximation of certain
analytic functions has been demonstrated in [185, 186], and the approximation of solutions of (linear and semilinear)
parabolic evolution equations has been explored in [71, 63, 90]. Additionally, stationary elliptic PDEs have been
addressed in [70].
Moreover, very high dimensionality is often encountered in parametric problems. The associated approximation
problems have been studied in [132, 146, 107, 109].

89
[Draft of June 26, 2024. Not for dissemination. Only to be used in Philipp Petersen’s lecture]
Exercises

Exercise 8.10 Let 𝐶 > 0 and 𝑑 ∈ N. Show that, if 𝑔 ∈ Γ𝐶 , then

𝑎 −𝑑 𝑔 (𝑎(· − 𝒃)) ∈ Γ𝐶 ,

for every 𝑎 ∈ R+ , 𝒃 ∈ R𝑑 .

Exercise 8.11 Let 𝐶 > 0 and 𝑑 ∈ N. Show that, for 𝑔𝑖 ∈ Γ𝐶 , 𝑖 = 1, . . . , 𝑚 and 𝑐 = (𝑐 𝑖 )𝑖=1
𝑚 it holds that

𝑚
∑︁
𝑐 𝑖 𝑔𝑖 ∈ Γ∥𝑐 ∥ 1 𝐶 .
𝑖=1

Exercise 8.12 For every 𝑑 ∈ N the function 𝑓 (𝒙) := exp(−∥𝒙∥ 22 /2), 𝒙 ∈ R𝑑 , belongs to Γ𝑑 . It holds 𝐶 𝑓 = 𝑂 ( 𝑑), for
𝑑 → ∞.

Exercise 8.13 Let 𝑑 ∈ N, and let 𝑓 (𝒙) = ∞


Í
𝑖=1 𝑐 𝑖 𝜎ReLU (⟨𝒂 𝑖 , 𝒙⟩ + 𝑏 𝑖 ) for 𝒙 ∈ R with ∥ 𝒂 𝑖 ∥ = 1, |𝑏 𝑖 | ≤ 1 for all 𝑖 ∈ N.
𝑑

Show that for every 𝑁 ∈ N, there exists a ReLU neural network with 𝑁 neurons and one layer such that

3∥𝑐∥ 1
∥ 𝑓 − 𝑓 𝑁 ∥ 𝐿 2 (𝐵𝑑 ) ≤ √ .
1
𝑁

Hence, every infinite ReLU neural network can be approximated at a rate 𝑂 (𝑁 1/2 ) by finite ReLU neural networks of
width 𝑁.

Exercise 8.14 Let 𝐶 > 0 prove that every 𝑓 ∈ Γ𝐶 is continuously differentiable.

90
[Draft of June 26, 2024. Not for dissemination. Only to be used in Philipp Petersen’s lecture]
Chapter 9
Interpolation

The learning problem associated to minimizing the empirical risk of (1.2.3) is based on minimizing an error that results
from evaluating a neural network on a finite set of (training) points. In contrast, all previous approximation results
focused on achieving uniformly small errors across the entire domain. Finding neural networks that achieve a small
training error appears to be much simpler, since, instead of ∥ 𝑓 − Φ𝑛 ∥ ∞ → 0 for a sequence of neural networks Φ𝑛 , it
suffices to have Φ𝑛 (𝒙𝑖 ) → 𝑓 (𝒙𝑖 ) for all 𝒙𝑖 in the training set.
In this chapter, we study the extreme case of the aforementioned approximation problem. We analyze under which
conditions it is possible to find a neural network that coincides with the target function 𝑓 at all training points. This is
referred to as interpolation. To make this notion more precise, we state the following definition.

Definition 9.1 (Interpolation) Let 𝑑, 𝑚 ∈ N, and let Ω ⊆ R𝑑 . We say that a set of functions H ⊆ {ℎ : Ω → R}
𝑚 ⊆ Ω × R, such that 𝒙 ≠ 𝒙 for 𝑖 ≠ 𝑗, there exists a function
interpolates 𝑚 points in Ω, if for every 𝑆 = (𝒙𝑖 , 𝑦 𝑖 )𝑖=1 𝑖 𝑗
ℎ ∈ H such that ℎ(𝒙𝑖 ) = 𝑦 𝑖 for all 𝑖 = 1, . . . , 𝑚.

Knowing the interpolation properties of an architecture represents extremely valuable information for two reasons:
• Consider an architecture that interpolates 𝑚 points and let the number of training samples be bounded by 𝑚. Then
(1.2.3) always has a solution.
• Consider again an architecture that interpolates 𝑚 points and assume that the number of training samples is less
than 𝑚. Then for every point 𝒙˜ not in the training set and every 𝑦 ∈ R there exists a minimizer ℎ of (1.2.3) that
satisfies ℎ( 𝒙)
˜ = 𝑦. As a consequence, without further restrictions (many of which we will discuss below), such an
architecture cannot generalize to unseen data.
The existence of solutions to the interpolation problem does not follow trivially from the approximation results provided
in the previous chapters (even though we will later see that there is a close connection). We also remark that the question
of how many points neural networks with a given architecture can interpolate is closely related to the so-called VC
dimension, which we will study in Chapter 14.
We start our analysis of the interpolation properties of neural networks by presenting a result similar to the universal
approximation theorem but for interpolation in the following section. In the subsequent section, we then look at
interpolation with desirable properties.

9.1 Universal interpolation

Under what conditions on the activation function and architecture can a set of neural networks interpolate 𝑚 ∈ N
points? According to Chapter 3, particularly Theorem 3.8, we know that shallow neural networks can approximate
every continuous function with arbitrary accuracy, provided the neural network width is large enough. As the neural
network’s width and/or depth increases, the architectures become increasingly powerful, leading us to expect that at
some point, they should be able to interpolate 𝑚 points. However, this intuition may not be correct:

91
[Draft of June 26, 2024. Not for dissemination. Only to be used in Philipp Petersen’s lecture]
Example 9.2 Let H := { 𝑓 ∈ 𝐶 0 ( [0, 1]) | 𝑓 (0) ∈ Q}. Then H is dense in 𝐶 0 ( [0, 1]), but H does not even interpolate
one point in [0, 1].
Moreover, Theorem 3.8 is an asymptotic result that only states that a given function can be approximated for sufficiently
large neural network architectures, but it does not state how large the architecture needs to be.
Surprisingly, Theorem 3.8 can nonetheless be used to give a guarantee that a fixed-size architecture yields sets of
neural networks that allow the interpolation of 𝑚 points. This result is due to [155]. A more refined discussion of
previous results is given in the bibliography section of this chapter. Due to its similarity to the universal approximation
theorem and the fact that it uses the same assumptions, we call the following theorem the “Universal Interpolation
Theorem”. For its statement recall the definition of the set of allowed activation functions M in (3.1.1) and the class
N𝑑1 (𝜎, 1, 𝑛) of shallow neural networks of width 𝑛 introduced in Definition 3.6.
Theorem 9.3 (Universal Interpolation Theorem) Let 𝑑, 𝑛 ∈ N and let 𝜎 ∈ M not be a polynomial. Then N𝑑1 (𝜎, 1, 𝑛)
interpolates 𝑛 + 1 points in R𝑑 .
Proof Fix (𝒙𝑖 )𝑖=1 𝑛+1 ⊆ R𝑑 arbitrary. We will show that for any (𝑦 ) 𝑛+1 ⊆ R there exist weights and biases (𝒘 ) 𝑛 ⊆ R𝑑 ,
𝑖 𝑖=1 𝑗 𝑗=1
(𝑏 𝑗 ) 𝑛𝑗=1 , (𝑣 𝑗 ) 𝑛𝑗=1 ⊆ R, 𝑐 ∈ R such that

𝑛
∑︁
Φ(𝒙 𝑖 ) := 𝑣 𝑗 𝜎(𝒘 ⊤𝑗 𝒙𝑖 + 𝑏 𝑗 ) + 𝑐 = 𝑦 𝑖 for all 𝑖 = 1, . . . , 𝑛 + 1. (9.1.1)
𝑗=1

Since Φ ∈ N𝑑1 (𝜎, 1, 𝑛) this then concludes the proof.


Denote
1 𝜎(𝒘1⊤ 𝒙 1 + 𝑏 1 ) · · · 𝜎(𝒘𝑚 ⊤𝒙 + 𝑏 )
1 𝑛
® ∈ R (𝑛+1) × (𝑛+1) .
©. . . . ª
𝑨 := ­ .
­ . .
. . . .
. ® (9.1.2)
1 𝜎(𝒘 ⊤𝒙 + 𝑏 ) · · · 𝜎(𝒘 ⊤𝒙 + 𝑏 )
« 1 𝑛+1 1 𝑛 𝑛+1 𝑛 ¬

Then 𝑨 being regular implies that for each (𝑦 𝑖 )𝑖=1𝑛+1 exist 𝑐 and (𝑣 ) 𝑛 such that (9.1.1) holds. Hence, it suffices to find
𝑗 𝑗=1
(𝒘 𝑗 ) 𝑛𝑗=1 and (𝑏 𝑗 ) 𝑛𝑗=1 such that 𝑨 is regular.
To do so, we proceed by induction over 𝑘 = 0, . . . , 𝑛, to show that there exist (𝒘 𝑗 ) 𝑘𝑗=1 and (𝑏 𝑗 ) 𝑘𝑗=1 such that the
first 𝑘 + 1 columns of 𝑨 are linearly independent. The case 𝑘 = 0 is trivial. Next let 0 < 𝑘 < 𝑛 and assume that the
first 𝑘 columns of 𝑨 are linearly independent. We wish to find 𝒘 𝑘 , 𝑏 𝑘 such that the first 𝑘 + 1 columns are linearly
independent. Suppose such 𝒘 𝑘 , 𝑏 𝑘 do not exist and denote by 𝑌𝑘 ⊆ R𝑛+1 the space spanned by the first 𝑘 columns of 𝑨.
Then for all 𝒘 ∈ R𝑛 , 𝑏 ∈ R the vector (𝜎(𝒘 ⊤ 𝒙 𝑖 + 𝑏))𝑖=1𝑛+1 ∈ R𝑛+1 must belong to 𝑌 . Fix 𝒚 = (𝑦 ) 𝑛+1 ∈ R𝑛+1 \𝑌 . Then
𝑘 𝑖 𝑖=1 𝑘

𝑛+1  ∑︁
∑︁ 𝑁 2
inf 𝑛+1
∥ ( Φ̃(𝒙𝑖 ))𝑖=1 − 𝒚∥ 22 = inf 𝑣 𝑗 𝜎(𝒘 ⊤𝑗 𝒙𝑖 + 𝑏 𝑗 ) + 𝑐 − 𝑦 𝑖
Φ̃∈ N𝑑1 ( 𝜎,1) 𝑁 ,𝒘 𝑗 ,𝑏 𝑗 ,𝑣 𝑗 ,𝑐
𝑖=1 𝑗=1

≥ inf ∥ 𝒚˜ − 𝒚∥ 22 > 0.
𝒚˜ ∈𝑌𝑘

Since we can find a continuous function 𝑓 : R𝑑 → R such that 𝑓 (𝒙𝑖 ) = 𝑦 𝑖 for all 𝑖 = 1, . . . , 𝑛 + 1, this contradicts
Theorem 3.8. □

9.2 Optimal interpolation and reconstruction

Consider a bounded domain Ω ⊆ R𝑑 , a function 𝑓 : Ω → R, distinct points 𝒙1 , . . . , 𝒙 𝑚 ⊆ Ω, and corresponding


function values 𝑦 𝑖 := 𝑓 (𝒙𝑖 ). Our objective is to approximate 𝑓 based solely on the data pairs (𝒙𝑖 , 𝑦 𝑖 ), 𝑖 = 1, . . . , 𝑚.
In this section, we will show that, under certain assumptions on 𝑓 , ReLU neural networks can express an “optimal”
reconstruction which also turns out to be an interpolant of the data.

92
[Draft of June 26, 2024. Not for dissemination. Only to be used in Philipp Petersen’s lecture]
9.2.1 Motivation

In the previous section, we observed that neural networks with 𝑚 − 1 ∈ N hidden neurons can interpolate 𝑚 points
for every reasonable activation function. However, not all interpolants are equally suitable for a given application.
For instance, consider Figure 9.1 for a comparison between polynomial and piecewise affine interpolation on the unit
interval.

Fig. 9.1: Interpolation of eight points by a polynomial of degree seven and by a piecewise affine spline. The polynomial
interpolation has a significantly larger derivative or Lipschitz constant than the piecewise affine interpolator.

The two interpolants exhibit rather different behaviors. In general, there is no way of determining which constitutes
a better approximation to 𝑓 . In particular, given our limited information about 𝑓 , we cannot accurately reconstruct any
additional features that may exist between interpolation points 𝒙1 , . . . , 𝒙 𝑚 . In accordance with Occam’s razor, it thus
seems reasonable to assume that 𝑓 does not exhibit extreme oscillations or behave erratically between interpolation
points. As such, the piecewise interpolant appears preferable in this scenario. One way to formalize the assumption
that 𝑓 does not “exhibit extreme oscillations” is to assume that the Lipschitz constant

| 𝑓 (𝒙) − 𝑓 ( 𝒚)|
Lip( 𝑓 ) := sup
𝒙≠𝒚 ∥𝒙 − 𝒚∥

of 𝑓 is bounded by a fixed value 𝑀 ∈ R. Here ∥ · ∥ denotes an arbitrary fixed norm on R𝑑 .


How should we choose 𝑀? For every function 𝑓 : Ω → R satisfying

𝑓 (𝒙𝑖 ) = 𝑦 𝑖 for all 𝑖 = 1, . . . , 𝑚, (9.2.1)

we have
| 𝑓 (𝒙) − 𝑓 ( 𝒚)| |𝑦 𝑖 − 𝑦 𝑗 |
Lip( 𝑓 ) = sup ≥ sup ˜
C 𝑀. (9.2.2)
𝒙≠𝒚 ∈Ω ∥𝒙 − 𝒚∥ 𝑖≠ 𝑗 ∥𝒙 𝑖 − 𝒙 𝑗 ∥

Because of this, we fix 𝑀 as a real number greater than or equal to 𝑀˜ for the remainder of our analysis.

93
[Draft of June 26, 2024. Not for dissemination. Only to be used in Philipp Petersen’s lecture]
9.2.2 Optimal reconstruction for Lipschitz continuous functions

The above considerations raise the following question: Given only the information that the function has Lipschitz
constant at most 𝑀, what is the best reconstruction of 𝑓 based on the data? We consider here the “best reconstruction”
to be a function that minimizes the 𝐿 ∞ -error in the worst case. Specifically, with

Lip 𝑀 (Ω) := { 𝑓 : Ω → R | Lip( 𝑓 ) ≤ 𝑀 }, (9.2.3)

denoting the set of all functions with Lipschitz constant at most 𝑀, we want to solve the following problem:

Problem 9.4 We wish to find an element

Φ ∈ arg min sup sup | 𝑓 (𝒙) − ℎ(𝒙)|. (9.2.4)


ℎ:Ω→R 𝑓 ∈Lip 𝑀 (Ω) 𝒙∈Ω
𝑓 satisfies (9.2.1)

The next theorem shows that a function Φ as in (9.2.4) indeed exists. This Φ not only allows for an explicit formula,
it also belongs to Lip 𝑀 (Ω) and additionally interpolates the data. Hence, it is not just an optimal reconstruction, it is
also an optimal interpolant. This theorem goes back to [13], which, in turn, is based on [195].

Theorem 9.5 Let 𝑚, 𝑑 ∈ N, Ω ⊆ R𝑑 , 𝑓 : Ω → R, and let 𝒙1 , . . . , 𝒙 𝑚 ∈ Ω, 𝑦 1 , . . . , 𝑦 𝑚 ∈ R satisfy (9.2.1) and (9.2.2)


with 𝑀˜ > 0. Further, let 𝑀 ≥ 𝑀.
˜
Then, Problem 9.4 has at least one solution given by
1
Φ(𝒙) := ( 𝑓upper (𝒙) + 𝑓lower (𝒙)) for 𝒙 ∈ Ω, (9.2.5)
2
where

𝑓upper (𝒙) := min (𝑦 𝑘 + 𝑀 ∥𝒙 − 𝒙 𝑘 ∥ )


𝑘=1,...,𝑚
𝑓lower (𝒙) := max (𝑦 𝑘 − 𝑀 ∥𝒙 − 𝒙 𝑘 ∥ ).
𝑘=1,...,𝑚

Moreover, Φ ∈ Lip 𝑀 (Ω) and Φ interpolates the data (i.e. satisfies (9.2.1)).

Proof First we claim that for all ℎ1 , ℎ2 ∈ Lip 𝑀 (Ω) holds max{ℎ1 , ℎ2 } ∈ Lip 𝑀 (Ω) as well as min{ℎ1 , ℎ2 } ∈ Lip 𝑀 (Ω).
Since min{ℎ1 , ℎ2 } = − max{−ℎ1 , −ℎ2 }, it suffices to show the claim for the maximum. We need to check that

| max{ℎ1 (𝒙), ℎ2 (𝒙)} − max{ℎ1 ( 𝒚), ℎ2 ( 𝒚)}|


≤𝑀 (9.2.6)
∥𝒙 − 𝒚∥
for all 𝒙 ≠ 𝒚 ∈ Ω. Fix 𝒙 ≠ 𝒚. Without loss of generality we assume that

max{ℎ1 (𝒙), ℎ2 (𝒙)} ≥ max{ℎ1 ( 𝒚), ℎ2 ( 𝒚)} and max{ℎ1 (𝒙), ℎ2 (𝒙)} = ℎ1 (𝒙).

If max{ℎ1 ( 𝒚), ℎ2 (𝒚)} = ℎ1 ( 𝒚) then the numerator in (9.2.6) equals ℎ1 (𝒙) − ℎ1 ( 𝒚) which is bounded by 𝑀 ∥𝒙 − 𝒚∥. If
max{ℎ1 ( 𝒚), ℎ2 ( 𝒚)} = ℎ2 ( 𝒚), then the numerator equals ℎ1 (𝒙) − ℎ2 ( 𝒚) which is bounded by ℎ1 (𝒙) − ℎ1 ( 𝒚) ≤ 𝑀 ∥𝒙 − 𝒚∥.
In either case (9.2.6) holds.
Clearly, 𝒙 ↦→ 𝑦 𝑘 − 𝑀 ∥𝒙 − 𝒙 𝑘 ∥ ∈ Lip 𝑀 (Ω) for each 𝑘 = 1, . . . , 𝑚 and thus 𝑓upper , 𝑓lower ∈ Lip 𝑀 (Ω) as well as
Φ ∈ Lip 𝑀 (Ω).
Next we claim that for all 𝑓 ∈ Lip 𝑀 (Ω) satisfying (9.2.1) holds

𝑓lower (𝒙) ≤ 𝑓 (𝒙) ≤ 𝑓upper (𝒙) for all 𝒙 ∈ Ω. (9.2.7)

This is true since for every 𝑘 ∈ {1, . . . , 𝑚} and 𝒙 ∈ Ω

94
[Draft of June 26, 2024. Not for dissemination. Only to be used in Philipp Petersen’s lecture]
|𝑦 𝑘 − 𝑓 (𝒙)| = | 𝑓 (𝒙 𝑘 ) − 𝑓 (𝒙)| ≤ 𝑀 ∥𝒙 − 𝒙 𝑘 ∥

so that for all 𝒙 ∈ Ω

𝑓 (𝒙) ≤ min (𝑦 𝑘 + 𝑀 ∥𝒙 − 𝒙 𝑘 ∥ ), 𝑓 (𝒙) ≥ max (𝑦 𝑘 − 𝑀 ∥𝒙 − 𝒙 𝑘 ∥ ).


𝑘=1,...,𝑚 𝑘=1,...,𝑚

Since 𝑓upper , 𝑓lower ∈ Lip 𝑀 (Ω) satisfy (9.2.1), we conclude that for every ℎ : Ω → R holds

sup sup | 𝑓 (𝒙) − ℎ(𝒙)| ≥ sup max{| 𝑓lower (𝒙) − ℎ(𝒙)|, | 𝑓upper (𝒙) − ℎ(𝒙)|}
𝑓 ∈Lip 𝑀 (Ω) 𝒙∈Ω 𝒙∈Ω
𝑓 satisfies (9.2.1)
| 𝑓lower (𝒙) − 𝑓upper (𝒙)|
≥ sup . (9.2.8)
𝒙∈Ω 2

Moreover, using (9.2.7),

sup sup | 𝑓 (𝒙) − Φ(𝒙)| ≤ sup max{| 𝑓lower (𝒙) − Φ(𝒙)|, | 𝑓upper (𝒙) − Φ(𝒙)|}
𝑓 ∈Lip 𝑀 (Ω) 𝒙∈Ω 𝒙∈Ω
𝑓 satisfies (9.2.1)
| 𝑓lower (𝒙) − 𝑓upper (𝒙)|
= sup . (9.2.9)
𝒙∈Ω 2

Finally, (9.2.8) and (9.2.9) imply that Φ is a solution of Problem 9.4. □


Figure 9.2 depicts 𝑓upper , 𝑓lower , and Φ for the interpolation problem shown in Figure 9.1, while Figure 9.3 provides
a two-dimensional example.

Fig. 9.2: Interpolation of the points from Figure 9.1 with the optimal Lipschitz interpolant.

95
[Draft of June 26, 2024. Not for dissemination. Only to be used in Philipp Petersen’s lecture]
9.2.3 Optimal ReLU reconstructions

So far everything
Í𝑑 was valid with an arbitrary norm on R𝑑 . For the next theorem, we will restrict ourselves to the 1-norm
∥𝒙∥ 1 = 𝑗=1 |𝑥 𝑗 |. Using the explicit formula of Theorem 9.5, we will now show the remarkable result that ReLU
neural networks can exactly express an optimal reconstruction (in the sense of Problem 9.4) with a neural network
whose size scales linearly in the product of the dimension 𝑑 and the number of data points 𝑚. Additionally, the proof is
constructive, thus allowing in principle for an explicit construction on the neural network without the need for training.

Fig. 9.3: Two-dimensional example of the interpolation method of (9.2.5). From top left to bottom we see 𝑓upper , 𝑓lower ,
6 are marked with red crosses.
and Φ. The interpolation points (𝒙𝑖 , 𝑦 𝑖 )𝑖=1

Theorem 9.6 (Optimal Lipschitz Reconstruction) Let 𝑚, 𝑑 ∈ N, Ω ⊆ R𝑑 , 𝑓 : Ω → R, and let 𝒙1 , . . . , 𝒙 𝑚 ∈ Ω,


𝑦 1 , . . . , 𝑦 𝑚 ∈ R satisfy (9.2.1) and (9.2.2) with 𝑀˜ > 0. Further, let 𝑀 ≥ 𝑀˜ and let ∥ · ∥ = ∥ · ∥ 1 in (9.2.2) and (9.2.3).
Then, there exists a ReLU neural network Φ ∈ Lip 𝑀 (Ω) that interpolates the data (i.e. satisfies (9.2.1)) and satisfies

96
[Draft of June 26, 2024. Not for dissemination. Only to be used in Philipp Petersen’s lecture]
Φ ∈ arg min sup sup | 𝑓 (𝒙) − Φ(𝒙)|.
ℎ:Ω→R 𝑓 ∈Lip 𝑀 (Ω) 𝒙∈Ω
𝑓 satisfies (9.2.1)

Moreover, depth(Φ) = 𝑂 (log(𝑚)), width(Φ) = 𝑂 (𝑑𝑚) and all weights of Φ are bounded in absolute value by
max{𝑀, max{|𝑦 𝑖 | | 𝑖 = 1, . . . , 𝑚}}.

Proof To prove the result, we simply need to show that the function in (9.2.5) can be expressed as a ReLU neural
network with the size bounds described in the theorem. First we notice, that there is a simple ReLU neural network that
implements the 1-norm. It holds for all 𝒙 ∈ R𝑑 that
𝑑
∑︁
∥𝒙∥ 1 = (𝜎(𝑥𝑖 ) + 𝜎(−𝑥𝑖 )) .
𝑖=1

Thus, there exists a ReLU neural network Φ ∥ · ∥ 1 such that for all 𝒙 ∈ R𝑑

width(Φ ∥ · ∥ 1 ) = 2𝑑, depth(Φ ∥ · ∥ 1 ) = 1, Φ ∥ · ∥ 1 (𝒙) = ∥𝒙∥ 1

As a result, there exist ReLU neural networks Φ 𝑘 : R𝑑 → R, 𝑘 = 1, . . . , 𝑚, such that

width(Φ 𝑘 ) = 2𝑑, depth(Φ 𝑘 ) = 1, Φ 𝑘 (𝒙) = 𝑦 𝑘 + 𝑀 ∥𝒙 − 𝒙 𝑘 ∥ 1

for all 𝒙 ∈ R𝑑 . Using the parallelization of neural networks introduced in Section 5.1.3, there exists a ReLU neural
network Φall := (Φ1 , . . . , Φ𝑚 ) : R𝑑 → R𝑚 such that

width(Φall ) = 4𝑚𝑑, depth(Φall ) = 2, and


𝑚
Φall (𝒙) = (𝑦 𝑘 + 𝑀 ∥𝒙 − 𝒙 𝑘 ∥ 1 ) 𝑘=1 for all 𝒙 ∈ R𝑑 .

Using Lemma 5.11, we can now find a ReLU neural network Φupper such that Φupper = 𝑓upper (𝒙) for all 𝒙 ∈ Ω,
width(Φupper ) ≤ max{16𝑚, 4𝑚𝑑}, and depth(Φupper ) ≤ 1 + log(𝑚).
Essentially the same construction yields a ReLU neural network Φlower with the respective properties. Lemma 5.4
then completes the proof. □

Bibliography and further reading

The universal interpolation theorem stated in this chapter is due to [155, Theorem 5.1]. Before this result there were
some interpolation results with stronger conditions. In [177], the interpolation property is already linked with a rank
condition on the matrix (9.1.2). However, no general conditions on the activation functions that guarantee this were
formulated. In [93], the interpolation theorem is established under the assumption that the activation function 𝜎 is
continuous and nondecreasing, lim 𝑥→−∞ 𝜎(𝑥) = 0, and lim 𝑥→∞ 𝜎(𝑥) = 1. This result was improved in [87] to only
require 𝜎 to be continuous, nonlinear, and taking limits at ±∞.
Concerning the optimal Lipschitz interpolation theorem, we already mentioned that the main idea is due to [13]. A
neural network construction of Lipschitz interpolants, which, however, is not the optimal interpolant in the sense of
Problem 9.4, is given in [95, Theorem 2.27].

97
[Draft of June 26, 2024. Not for dissemination. Only to be used in Philipp Petersen’s lecture]
[Draft of June 26, 2024. Not for dissemination. Only to be used in Philipp Petersen’s lecture]
Chapter 10
Training of neural networks

Up to this point, we have discussed the representation and approximation of certain function classes using neural
networks. The second pillar of deep learning concerns the question of how to fit a neural network to given data, i.e.,
having fixed an architecture, how to find suitable weights and biases. This task amounts to minimizing a so-called
objective function such as the empirical risk in (1.2.3). Throughout this chapter we denote the objective function by

f : R𝑛 → R,

and interpret it as a function of all neural network weights and biases collected in a vector in R𝑛 . The goal is to
(approximately) determine a minimizer, i.e., some 𝒘∗ ∈ R𝑛 satisfying

f (𝒘∗ ) ≤ f (𝒘) for all 𝒘 ∈ R𝑛 .

Standard approaches include, in particular, variants of (stochastic) gradient descent. These are the topic of this chapter,
in which we present basic ideas and results in convex optimization using gradient-based methods.

10.1 Gradient descent

The general idea of gradient descent is to start with some 𝒘0 ∈ R𝑛 , and then apply sequential updates by moving in
the direction of steepest descent of the objective function. Assume for the moment that f ∈ 𝐶 2 (R𝑛 ), and denote the 𝑘th
iterate by 𝒘 𝑘 . Then

f (𝒘 𝑘 + 𝒗) = f (𝒘 𝑘 ) + 𝒗 ⊤ ∇f (𝒘 𝑘 ) + 𝑂 (∥𝒗∥ 2 ) for ∥𝒗∥ 2 → 0. (10.1.1)

This shows that the change in f around 𝒘 𝑘 is locally described by the gradient ∇f (𝒘 𝑘 ). For small 𝒗 the contribution of
the second order term is negligible, and the direction 𝒗 along which the decrease of the risk is maximized equals the
negative gradient −∇f (𝒘 𝑘 ). Thus, −∇f (𝒘 𝑘 ) is also called the direction of steepest descent. This leads to an update of
the form

𝒘 𝑘+1 := 𝒘 𝑘 − ℎ 𝑘 ∇f (𝒘 𝑘 ), (10.1.2)

where ℎ 𝑘 > 0 is referred to as the step size or learning rate. We refer to this iterative algorithm as gradient descent.
In practice tuning the learning rate can be a subtle issue as it should strike a balance between the following dissenting
requirements:
(i) ℎ 𝑘 needs to be sufficiently small so that with 𝒗 = −ℎ 𝑘 ∇f (𝒘 𝑘 ), the second-order term in (10.1.1) is not dominating.
This ensures that the update (10.1.2) decreases the objective function.

99
[Draft of June 26, 2024. Not for dissemination. Only to be used in Philipp Petersen’s lecture]
Fig. 10.1: Two examples of gradient descent as defined in (10.1.2). The red points represent the 𝒘 𝑘 .

(ii) ℎ 𝑘 should be large enough to ensure significant decrease of the objective function, which facilitates faster conver-
gence of the algorithm.
A learning rate that is too high might overshoot the minimum, while a rate that is too low results in slow convergence.
Common strategies include, in particular, constant learning rates (ℎ 𝑘 = ℎ for all 𝑘 ∈ N0 ), learning rate schedules such
as decaying learning rates (ℎ 𝑘 ↘ 0 as 𝑘 → ∞), and adaptive methods. For adaptive methods the algorithm dynamically
adjust ℎ 𝑘 based on the values of f (𝒘 𝑗 ) or ∇f (𝒘 𝑗 ) for 𝑗 ≤ 𝑘.

Remark 10.1 It is instructive to interpret (10.1.2) as an Euler discretization of the “gradient flow”

𝒘(0) = 𝒘0 , 𝒘 ′ (𝑡) = −∇f (𝒘(𝑡)) for 𝑡 ∈ [0, ∞). (10.1.3)

This ODE describes the movement of a particle 𝒘(𝑡), whose velocity at time 𝑡 ≥ 0 equals −∇f (𝒘(𝑡))—the vector of
steepest descent. Note that

df (𝒘(𝑡))
= ⟨∇f (𝒘(𝑡)), 𝒘 ′ (𝑡)⟩ = −∥∇f (𝒘(𝑡)) ∥ 2 ,
d𝑡
and thus the dynamics (10.1.3) necessarily decreases the value of the objective function along its path as long as
∇f (𝒘(𝑡)) ≠ 0.

Throughout the rest of Section 10.1 we assume that 𝒘0 ∈ R𝑛 is arbitrary, and the sequence (𝒘 𝑘 ) 𝑘 ∈N0 is generated
by (10.1.2). We will analyze the convergence of this algorithm under suitable assumptions on f and the ℎ 𝑘 . The proofs
primarily follow the arguments in [139, Chapter 2]. We also refer to that book for a much more detailed discussion of
gradient descent, and further reading on convex optimization.

10.1.1 𝑳 -smoothness

A key assumption to analyze convergence of (10.1.2) is Lipschitz continuity of ∇f.

Definition 10.2 Let 𝑛 ∈ N, and 𝐿 > 0. The function f : R𝑛 → R is called 𝐿-smooth if f ∈ 𝐶 1 (R𝑛 ) and

∥∇f (𝒘) − ∇f (𝒗) ∥ ≤ 𝐿 ∥𝒘 − 𝒗∥ for all 𝒘, 𝒗 ∈ R𝑛 .

100
[Draft of June 26, 2024. Not for dissemination. Only to be used in Philipp Petersen’s lecture]
For fixed 𝒘, 𝐿-smoothness implies the linear growth bound

∥∇f (𝒘 + 𝒗) ∥ ≤ ∥∇f (𝒘) ∥ + 𝐿∥𝒗∥

for ∇f. Integrating the gradient along lines in R𝑛 then shows that f is bounded from above by a quadratic function
touching the graph of f at 𝒘, as stated in the next lemma; also see Figure 10.2.
Lemma 10.3 Let 𝑛 ∈ N and 𝐿 > 0. Let f : R𝑛 → R be 𝐿-smooth. Then
𝐿
f (𝒗) ≤ f (𝒘) + ⟨∇f (𝒘), 𝒗 − 𝒘⟩ + ∥𝒘 − 𝒗∥ 2 for all 𝒘, 𝒗 ∈ R𝑛 . (10.1.4)
2
Proof We have for all 𝒘, 𝒗 ∈ R𝑛
∫ 1
f (𝒗) = f (𝒘) + ⟨∇f (𝒘 + 𝑡 (𝒗 − 𝒘)), 𝒗 − 𝒘⟩ d𝑡
0
∫ 1
= f (𝒘) + ⟨∇f (𝒘), 𝒗 − 𝒘⟩ + ⟨∇f (𝒘 + 𝑡 (𝒗 − 𝒘)) − ∇f (𝒘), 𝒗 − 𝒘⟩ d𝑡.
0

Thus
∫ 1
𝐿
f (𝒗) − f (𝒘) − ⟨∇f (𝒘), 𝒗 − 𝒘⟩ ≤ 𝐿 ∥𝑡 (𝒗 − 𝒘) ∥ ∥𝒗 − 𝒘∥ d𝑡 = ∥𝒗 − 𝒘∥ 2 ,
0 2
which shows (10.1.4). □
Remark 10.4 The argument in the proof of Lemma 10.3 also gives the lower bound
𝐿
f (𝒗) ≥ f (𝒘) + ⟨∇f (𝒘), 𝒗 − 𝒘⟩ − ∥𝒘 − 𝒗∥ 2 for all 𝒘, 𝒗 ∈ R𝑛 . (10.1.5)
2
The previous lemma allows us to show a decay property for the gradient descent iterates. Specifically, the values of
f necessarily decrease in each iteration as long as the step size ℎ 𝑘 is small enough, and ∇f (𝒘 𝑘 ) ≠ 0.
Lemma 10.5 Let 𝑛 ∈ N and 𝐿 > 0. Let f : R𝑛 → R be 𝐿-smooth. Further, let (ℎ 𝑘 ) ∞
𝑘=1 be positive numbers and let
(𝒘 𝑘 ) ∞
𝑘=0 ⊆ R 𝑛 be defined by (10.1.2).

Then, for all 𝑘 ∈ N


 𝐿ℎ2𝑘 
f (𝒘 𝑘+1 ) ≤ f (𝒘 𝑘 ) − ℎ 𝑘 − ∥∇f (𝒘 𝑘 ) ∥ 2 . (10.1.6)
2
Proof Lemma 10.3 with 𝒗 = 𝒘 𝑘+1 and 𝒘 = 𝒘 𝑘 gives
𝐿
f (𝒘 𝑘+1 ) ≤ f (𝒘 𝑘 ) + ⟨∇f (𝒘 𝑘 ), −ℎ 𝑘 ∇f (𝒘 𝑘 )⟩ + ∥ℎ 𝑘 ∇f (𝒘 𝑘 ) ∥ 2 ,
2
which corresponds to (10.1.6). □
Remark 10.6 The right-hand side in (10.1.6) is minimized for step size ℎ 𝑘 = 1/𝐿, in which case (10.1.6) reads
1
f (𝒘 𝑘+1 ) ≤ f (𝒘 𝑘 ) − ∥∇f (𝒘 𝑘 ) ∥ 2 .
2𝐿
Next, let us discuss the behavior of the gradients for constant step sizes.
Proposition 10.7 Let 𝑛 ∈ N and 𝐿 > 0. Let f : R𝑛 → R be 𝐿-smooth. Further, let ℎ 𝑘 = ℎ ∈ (0, 2/𝐿) for all 𝑘 ∈ N, and
(𝒘 𝑘 ) ∞
𝑘=0 ⊆ R be defined by (10.1.2).
𝑛

Then, for all 𝑘 ∈ N

101
[Draft of June 26, 2024. Not for dissemination. Only to be used in Philipp Petersen’s lecture]
𝑘
1 ∑︁ 1 2
∥∇f (𝒘 𝑗 ) ∥ 2 ≤ (f (𝒘0 ) − f (𝒘 𝑘+1 )). (10.1.7)
𝑘 + 1 𝑗=0 𝑘 + 1 2ℎ − 𝐿ℎ2

Proof Set 𝑐 := ℎ − (𝐿ℎ2 )/2 = (2ℎ − 𝐿ℎ2 )/2 > 0. By (10.1.6) for 𝑗 ≥ 0

f (𝒘 𝑗 ) − f (𝒘 𝑗+1 ) ≥ 𝑐∥∇f (𝒘 𝑗 ) ∥ 2 .

Hence
𝑘 𝑘
∑︁ 1 ∑︁
∥∇f (𝒘 𝑗 ) ∥ 2 ≤ f (𝒘 𝑗 ) − f (𝒘 𝑗+1 ) = f (𝒘0 ) − f (𝒘 𝑘+1 ).
𝑗=0
𝑐 𝑗=0

Dividing by 𝑘 + 1 concludes the proof. □


Suppose that f is bounded from below, i.e. inf 𝒘∈R𝑛 f (𝒘) > −∞. In this case, the right-hand side in (10.1.7) behaves
like 𝑂 (𝑘 −1 ) as 𝑘 → ∞, and (10.1.7) implies

min ∥∇f (𝒘 𝑗 ) ∥ = 𝑂 (𝑘 −1/2 ).


𝑗=1,...,𝑘

Thus, lower boundedness of the objective function together with 𝐿-smoothness already suffice to obtain some form of
convergence of the gradients to 0. We emphasize that this does not imply convergence of 𝒘 𝑘 towards some 𝒘∗ with
∇f (𝒘∗ ) = 0 as the example f (𝑤) = arctan(𝑤), 𝑤 ∈ R, shows.

10.1.2 Convexity

While 𝐿-smoothness entails some interesting properties of gradient descent, it does not have any direct implications on
the existence or uniqueness of minimizers. To show convergence of f (𝒘 𝑘 ) towards min𝒘 f (𝒘) for 𝑘 → ∞ (assuming
this minimum exists), we will assume that f is a convex function.
Definition 10.8 Let 𝑛 ∈ N. A function f : R𝑛 → R is called convex if and only if

f (𝜆𝒘 + (1 − 𝜆)𝒗) ≤ 𝜆f (𝒘) + (1 − 𝜆)f (𝒗), (10.1.8)

for all 𝒘, 𝒗 ∈ R𝑛 , 𝜆 ∈ (0, 1).


Let 𝑛 ∈ N. If f ∈ 𝐶 1 (R𝑛 ), then f is convex if and only if

f (𝒘) + ⟨∇f (𝒘), 𝒗 − 𝒘⟩ ≤ f (𝒗) for all 𝒘, 𝒗 ∈ R𝑛 , (10.1.9)

as shown in Exercise 10.27. Thus, f ∈ 𝐶 1 (R𝑛 ) is convex if and only if the graph of f lies above each of its tangents, see
Figure 10.2.
For convex f, a minimizer neither needs to exist (e.g., f (𝑤) = 𝑤 for 𝑤 ∈ R) nor be unique (e.g., f (𝒘) = 0 for
𝒘 ∈ R𝑛 ). However, if 𝒘∗ and 𝒗∗ are two minimizers, then every convex combination 𝜆𝒘∗ + (1 − 𝜆)𝒗∗ , 𝜆 ∈ [0, 1], is also
a minimizer due to (10.1.8). Thus, the set of all minimizers is convex. In particular, a convex objective function has
either zero, one, or infinitely many minimizers. Moreover, if f ∈ 𝐶 1 (R𝑛 ) then ∇f (𝒘) = 0 implies

f (𝒘) = f (𝒘) + ⟨∇f (𝒘), 𝒗 − 𝒘⟩ ≤ f (𝒗) for all 𝒗 ∈ R𝑛 .

Thus, 𝒘 is a minimizer of f if and only if ∇f (𝒘) = 0.


By Lemma 10.5, smallness of the step sizes and 𝐿-smoothness suffice to show a decay property for the objective
function f. Under the additional assumption of convexity, we also get a decay property for the distance of 𝒘 𝑘 to any
minimizer 𝒘∗ .

102
[Draft of June 26, 2024. Not for dissemination. Only to be used in Philipp Petersen’s lecture]
Lemma 10.9 Let 𝑛 ∈ N and 𝐿 > 0. Let f : R𝑛 → R be 𝐿-smooth and convex. Further, let ℎ 𝑘 ∈ (0, 2/𝐿) for all 𝑘 ∈ N0 ,
and (𝒘 𝑘 ) ∞
𝑘=0 ⊆ R be defined by (10.1.2). Suppose that 𝒘∗ is a minimizer of f.
𝑛

Then, for all 𝑘 ∈ N0


2 
∥𝒘 𝑘+1 − 𝒘∗ ∥ 2 ≤ ∥𝒘 𝑘 − 𝒘∗ ∥ 2 − ℎ 𝑘 · − ℎ 𝑘 ∥∇f (𝒘 𝑘 ) ∥ 2 .
𝐿
To prove the lemma, we will require the following inequality [139, Theorem 2.1.5].
Lemma 10.10 Let 𝑛 ∈ N and 𝐿 > 0. Let f : R𝑛 → R be 𝐿-smooth and convex.
Then,
1
∥∇f (𝒘) − ∇f (𝒗) ∥ 2 ≤ ⟨∇f (𝒘) − ∇f (𝒗), 𝒘 − 𝒗⟩ for all 𝒘, 𝒗 ∈ R𝑛 .
𝐿
Proof Fix 𝒘 ∈ R𝑛 and set Ψ(𝒖) := f (𝒖) − ⟨∇f (𝒘), 𝒖⟩ for all 𝒖 ∈ R𝑛 . Then ∇Ψ(𝒖) = ∇f (𝒖) − ∇f (𝒘) and thus Ψ is
𝐿-smooth. Moreover, convexity of f, specifically (10.1.9), yields Ψ(𝒖) ≥ f (𝒘) − ⟨∇f (𝒘), 𝒘⟩ = Ψ(𝒘) for all 𝒖 ∈ R𝑛 ,
and thus 𝒘 is a minimizer of Ψ. Using (10.1.4) on Ψ we get for every 𝒗 ∈ R𝑛
 𝐿 
Ψ(𝒘) = min𝑛 Ψ(𝒖) ≤ min𝑛 Ψ(𝒗) + ⟨∇Ψ(𝒗), 𝒖 − 𝒗⟩ + ∥𝒖 − 𝒗∥ 2
𝒖 ∈R 𝒖 ∈R 2
𝐿
= min Ψ(𝒗) − 𝑡 ∥∇Ψ(𝒗) ∥ 2 + 𝑡 2 ∥∇Ψ(𝒗) ∥ 2
𝑡 ≥0 2
1
= Ψ(𝒗) − ∥∇Ψ(𝒗) ∥ 2
2𝐿
since the minimum of 𝑡 ↦→ 𝑡 2 𝐿/2 − 𝑡 is attained at 𝑡 = 𝐿 −1 . This implies
1
f (𝒘) − f (𝒗) + ∥∇f (𝒘) − ∇f (𝒗) ∥ 2 ≤ ⟨∇f (𝒘), 𝒘 − 𝒗⟩ .
2𝐿
Adding the same inequality with the roles of 𝒘 and 𝒗 switched gives the result. □
Proof (of Lemma 10.9) It holds

∥𝒘 𝑘+1 − 𝒘∗ ∥ 2 = ∥𝒘 𝑘 − 𝒘∗ ∥ 2 − 2ℎ 𝑘 ⟨∇f (𝒘 𝑘 ), 𝒘 𝑘 − 𝒘∗ ⟩ + ℎ2𝑘 ∥∇f (𝒘 𝑘 ) ∥ 2 .

Since ∇f (𝒘∗ ) = 0, Lemma 10.10 gives


1
− ⟨∇f (𝒘 𝑘 ), 𝒘 𝑘 − 𝒘∗ ⟩ ≤ − ∥∇f (𝒘 𝑘 ) ∥ 2
𝐿
which implies the claim. □
These preparations allow us to show that for constant step size ℎ < 2/𝐿, we obtain convergence of f (𝒘 𝑘 ) towards
f (𝒘∗ ) with rate 𝑂 (𝑘 −1 ), as stated in the next theorem.
Theorem 10.11 ([139, Theorem 2.1.14]) Let 𝑛 ∈ N and 𝐿 > 0. Let f : R𝑛 → R be 𝐿-smooth and convex. Further, let
ℎ 𝑘 = ℎ ∈ (0, 2/𝐿) for all 𝑘 ∈ N0 , and let (𝒘 𝑘 ) ∞
𝑘=0 ⊆ R be defined by (10.1.2). Suppose that 𝒘∗ is a minimizer of f.
𝑛
−1
Then, f (𝒘 𝑘 ) − f (𝒘∗ ) = 𝑂 (𝑘 ) for 𝑘 → ∞, and for the specific choice ℎ = 1/𝐿
2𝐿
f (𝒘 𝑘 ) − f (𝒘∗ ) ≤ ∥𝒘0 − 𝒘∗ ∥ 2 for all 𝑘 ∈ N0 . (10.1.10)
4+𝑘
Proof The case 𝒘0 = 𝒘∗ is trivial and throughout we assume 𝒘0 ≠ 𝒘∗ .
Step 1. Let 𝑗 ∈ N0 . Using convexity (10.1.9)

f (𝒘 𝑗 ) − f (𝒘∗ ) ≤ − ∇f (𝒘 𝑗 ), 𝒘∗ − 𝒘 𝑗 ≤ ∥∇f (𝒘 𝑗 ) ∥ ∥𝒘∗ − 𝒘 𝑗 ∥ . (10.1.11)

103
[Draft of June 26, 2024. Not for dissemination. Only to be used in Philipp Petersen’s lecture]
By Lemma 10.9 and since 𝒘0 ≠ 𝒘∗ it holds ∥𝒘∗ − 𝒘 𝑗 ∥ ≤ ∥𝒘∗ − 𝒘0 ∥ ≠ 0, so that we obtain a lower bound on the
gradient

(f (𝒘 𝑗 ) − f (𝒘∗ )) 2
∥∇f (𝒘 𝑗 ) ∥ 2 ≥ .
∥𝒘∗ − 𝒘0 ∥ 2

Lemma 10.5 then yields


 𝐿ℎ2 
f (𝒘 𝑗+1 ) − f (𝒘∗ ) ≤ f (𝒘 𝑗 ) − f (𝒘∗ ) − ℎ − ∥∇f (𝒘 𝑗 ) ∥ 2
2
 𝐿ℎ2  (f (𝒘 𝑗 ) − f (𝒘∗ )) 2
≤ f (𝒘 𝑗 ) − f (𝒘∗ ) − ℎ − .
2 ∥𝒘0 − 𝒘∗ ∥ 2

With 𝑒 𝑗 := f (𝒘 𝑗 ) − f (𝒘∗ ) and 𝜔 := (ℎ − 𝐿ℎ2 /2)/∥𝒘0 − 𝒘∗ ∥ 2 this reads

𝑒 𝑗+1 ≤ 𝑒 𝑗 − 𝜔𝑒 2𝑗 = 𝑒 𝑗 · (1 − 𝜔𝑒 𝑗 ), (10.1.12)

which is valid for all 𝑗 ∈ N0 .


Step 2. By 𝐿-smoothness (10.1.4) and ∇f (𝒘∗ ) = 0 it holds
𝐿
f (𝒘0 ) − f (𝒘∗ ) ≤ ∥𝒘0 − 𝒘∗ ∥ 2 , (10.1.13)
2
which implies (10.1.10) for 𝑘 = 0. It remains to show the bound for 𝑘 ∈ N.
Fix 𝑘 ∈ N. We may assume 𝑒 𝑘 > 0, since otherwise (10.1.10) is trivial. Then 𝑒 𝑗 > 0 for all 𝑗 = 0, . . . , 𝑘 − 1 since
𝑒 𝑗 = 0 implies 𝑒 𝑖 = 0 for all 𝑖 > 𝑗, contradicting 𝑒 𝑘 = 0. Moreover, 𝜔𝑒 𝑗 < 1 for all 𝑗 = 0, . . . , 𝑘 − 1, since 𝜔𝑒 𝑗 ≥ 1
implies 𝑒 𝑗+1 ≤ 0 by (10.1.12), contradicting 𝑒 𝑗+1 > 0.
Using that 1/(1 − 𝑥) ≥ 1 + 𝑥 for all 𝑥 ∈ [0, 1), (10.1.12) thus gives
1 1 1
≥ (1 + 𝜔𝑒 𝑗 ) = +𝜔 for all 𝑗 = 0, . . . , 𝑘 − 1.
𝑒 𝑗+1 𝑒𝑗 𝑒𝑗

Hence
𝑘−1
1 1 ∑︁  1 1
− = − ≥ 𝑘𝜔
𝑒 𝑘 𝑒 0 𝑗=0 𝑒 𝑗+1 𝑒 𝑗

and
1 1
f (𝒘 𝑘 ) − f (𝒘∗ ) = 𝑒 𝑘 ≤ 1
= 2
.
𝑒0 + 𝑘𝜔 1
f (𝒘0 ) −f (𝒘∗ ) + 𝑘 (ℎ−𝐿ℎ /2)
∥𝒘 −𝒘 ∥ 2
0 ∗

Using (10.1.13) we get

∥𝒘0 − 𝒘∗ ∥ 2
f (𝒘 𝑘 ) − f (𝒘∗ ) ≤ 2 𝐿ℎ
= 𝑂 (𝑘 −1 ). (10.1.14)
𝐿 + 𝑘 ℎ · (1 − 2 )

Finally, (10.1.10) follows by plugging in ℎ = 1/𝐿. □

Remark 10.12 The step size ℎ = 1/𝐿 is again such that the upper bound in (10.1.14) is minimized.

It is important to note that while under the assumptions of Theorem 10.11 it holds f (𝒘 𝑘 ) → f (𝒘∗ ), in general it is
not true that 𝒘 𝑘 → 𝒘∗ as 𝑘 → ∞. To show the convergence of the 𝒘 𝑘 , we need to introduce stronger assumptions that
guarantee the existence of a unique minimizer.

104
[Draft of June 26, 2024. Not for dissemination. Only to be used in Philipp Petersen’s lecture]
10.1.3 Strong convexity

To obtain faster convergence and guarantee the existence of unique minimizers, we next introduce the notion of strong
convexity. As the terminology suggests, strong convexity implies convexity; specifically, while convexity requires f
to be lower bounded by the linearization around each point, strongly convex functions are lower bounded by the
linearization plus a positive quadratic term.

Definition 10.13 Let 𝑛 ∈ N and 𝜇 > 0. A function f ∈ 𝐶 1 (R𝑛 ) is called 𝜇-strongly convex if
𝜇
f (𝒗) ≥ f (𝒘) + ⟨∇f (𝒘), 𝒗 − 𝒘⟩ + ∥𝒗 − 𝒘∥ 2 for all 𝒘, 𝒗 ∈ R𝑛 . (10.1.15)
2

𝐿-smooth convex 𝜇-strongly convex

Fig. 10.2: The graph of 𝐿-smooth functions lies between two quadratic functions at each point, see (10.1.4) and (10.1.5),
the graph of convex functions lies above the tangent at each point, see (10.1.9), and the graph of 𝜇-strongly convex
functions lies above a quadratic function at each point, see (10.1.15).

Note that (10.1.15) is the opposite of the bound (10.1.4) implied by 𝐿-smoothness. We depict the three notions of
𝐿-smoothness, convexity, and 𝜇-strong convexity in Figure 10.2.
Every 𝜇-strongly convex function has a unique minimizer. To see this note first that (10.1.15) implies f to be lower
bounded by a convex quadratic function, so that there exists at least one minimizer 𝒘∗ , and ∇f (𝒘∗ ) = 0. By (10.1.15)
we then have f (𝒗) > f (𝒘∗ ) for every 𝒗 ≠ 𝒘∗ .
The next theorem shows that the gradient descent iterates converge linearly towards the unique minimizer for 𝐿-
smooth and 𝜇-strongly convex functions. Recall that a sequence 𝑒 𝑘 is said to converge linearly to 0, if and only if there
exist constants 𝐶 > 0 and 𝑐 ∈ [0, 1) such that

𝑒 𝑘 ≤ 𝐶𝑐 𝑘 for all 𝑘 ∈ N0 .

The constant 𝑐 is also referred to as the rate of convergence. Before giving the statement, we first note that comparing
(10.1.4) and (10.1.15) it necessarily holds 𝐿 ≥ 𝜇 and therefore 𝜅 := 𝐿/𝜇 ≥ 1. This term is known as the condition
number of f. It crucially influences the rate of convergence.

Theorem 10.14 Let 𝑛 ∈ N and 𝐿 ≥ 𝜇 > 0. Let f : R𝑛 → R be 𝐿-smooth and 𝜇-strongly convex. Further, let
ℎ 𝑘 = ℎ ∈ (0, 1/𝐿] for all 𝑘 ∈ N0 , let (𝒘 𝑘 ) ∞
𝑘=0 ⊆ R be defined by (10.1.2), and let 𝒘∗ be the unique minimizer of f.
𝑛

Then, f (𝒘 𝑘 ) → f (𝒘∗ ) and 𝒘 𝑘 → 𝒘∗ converge linearly for 𝑘 → ∞. For the specific choice ℎ = 1/𝐿
 𝜇 𝑘
∥𝒘 𝑘 − 𝒘∗ ∥ 2 ≤ 1 − ∥𝒘0 − 𝒘∗ ∥ 2 (10.1.16a)
𝐿
𝐿 𝜇 𝑘
f (𝒘 𝑘 ) − f (𝒘∗ ) ≤ 1− ∥𝒘0 − 𝒘∗ ∥ 2 . (10.1.16b)
2 𝐿
Proof It suffices to show (10.1.16a) since (10.1.16b) follows directly by Lemma 10.3 and because ∇f (𝒘∗ ) = 0. The
case 𝑘 = 0 is trivial, so let 𝑘 ∈ N.
Expanding 𝒘 𝑘 = 𝒘 𝑘−1 − ℎ∇f (𝒘 𝑘−1 ) and using 𝜇-strong convexity (10.1.15)

105
[Draft of June 26, 2024. Not for dissemination. Only to be used in Philipp Petersen’s lecture]
∥𝒘 𝑘 − 𝒘∗ ∥ 2 = ∥𝒘 𝑘−1 − 𝒘∗ ∥ 2 − 2ℎ ⟨∇f (𝒘 𝑘−1 ), 𝒘 𝑘−1 − 𝒘∗ ⟩ + ℎ2 ∥∇f (𝒘 𝑘−1 ) ∥ 2
≤ (1 − 𝜇ℎ) ∥𝒘 𝑘−1 − 𝒘∗ ∥ 2 − 2ℎ · (f (𝒘 𝑘−1 ) − f (𝒘∗ )) + ℎ2 ∥∇f (𝒘 𝑘−1 ) ∥ 2 .

Moreover, the descent property in Lemma 10.5 gives

− 2ℎ · (f (𝒘 𝑘−1 ) − f (𝒘∗ )) + ℎ2 ∥∇f (𝒘 𝑘−1 ) ∥ 2


ℎ2
≤ −2ℎ · (f (𝒘 𝑘−1 ) − f (𝒘∗ )) + (f (𝒘 𝑘−1 ) − f (𝒘 𝑘 )). (10.1.17)
ℎ · (1 − 𝐿ℎ/2)

The descent property also implies f (𝒘 𝑘−1 ) − f (𝒘∗ ) ≥ f (𝒘 𝑘−1 ) − f (𝒘 𝑘 ). Thus the right-hand side of (10.1.17) is less or
equal to zero as long as 2ℎ ≥ ℎ/(1 − 𝐿ℎ/2), which is equivalent to ℎ ≤ 1/𝐿. Hence

∥𝒘 𝑘 − 𝒘∗ ∥ 2 ≤ (1 − 𝜇ℎ) ∥𝒘 𝑘−1 − 𝒘∗ ∥ 2 ≤ · · · ≤ (1 − 𝜇ℎ) 𝑘 ∥𝒘0 − 𝒘∗ ∥ 2 .

This concludes the proof. □

Remark 10.15 With a more refined argument, see [139, Theorem 2.1.15], the constraint on the step size can be relaxed
to ℎ ≤ 2/(𝜇 + 𝐿). For ℎ = 2/(𝜇 + 𝐿) one then obtains (10.1.16) with 1 − 𝜇/𝐿 = 1 − 𝜅 −1 replaced by
 𝐿/𝜇 − 1  2  𝜅 − 1 2
= ∈ [0, 1). (10.1.18)
𝐿/𝜇 + 1 𝜅+1
We have
 𝜅 − 1 2
= 1 − 4𝜅 −1 + 𝑂 (𝜅 −2 )
𝜅+1
as 𝜅 → ∞. Thus, (10.1.18) gives a slightly better, but conceptually similar, rate of convergence than 1 − 𝜅 −1 shown in
Theorem 10.14.

10.1.4 PL-inequality

Linear convergence for gradient descent can also be shown under a weaker assumption known as the Polyak-Łojasiewicz-
inequality, or PL-inequality for short.

Lemma 10.16 Let 𝑛 ∈ N and 𝜇 > 0. Let f : R𝑛 → R be 𝜇-strongly convex and denote its unique minimizer by 𝒘∗ . Then
f satisfies the PL-inequality
1
𝜇 · (f (𝒘) − f (𝒘∗ )) ≤ ∥∇f (𝒘) ∥ 2 for all 𝒘 ∈ R𝑛 . (10.1.19)
2
Proof By 𝜇-strong convexity we have
𝜇
f (𝒗) ≥ f (𝒘) + ⟨∇f (𝒘), 𝒗 − 𝒘⟩ + ∥𝒗 − 𝒘∥ 2 for all 𝒗, 𝒘 ∈ R𝑛 . (10.1.20)
2
The gradient of the right-hand side with respect to 𝒗 equals ∇f (𝒘) + 𝜇 · (𝒗 − 𝒘). This implies that the minimum of this
expression is attained at 𝒗 = 𝒘 − ∇f (𝒘)/𝜇. Minimizing both sides of (10.1.20) in 𝒗 we thus find
1 1 1
f (𝒘∗ ) ≥ f (𝒘) − ∥∇f (𝒘) ∥ 2 + ∥∇f (𝒘) ∥ 2 = f (𝒘) − ∥∇f (𝒘) ∥ 2 .
𝜇 2𝜇 2𝜇
Rearranging the terms gives (10.1.19). □

106
[Draft of June 26, 2024. Not for dissemination. Only to be used in Philipp Petersen’s lecture]
As the lemma states, the PL-inequality is implied by strong convexity. Moreover, it is indeed weaker than strong
convexity, and does not even imply convexity, see Exercise 10.28. The next theorem, which corresponds to [196,
Theorem 1], gives a convergence result for 𝐿-smooth functions satisfying the PL-inequality. It therefore does not
require convexity. The proof is left as an exercise. We only note that the PL-inequality bounds the distance to the
minimal value of the objective function by the squared norm of the gradient. It is thus precisely the type of bound
required to show convergence of gradient descent.

Theorem 10.17 Let 𝑛 ∈ N and 𝐿 > 0. Let f : R𝑛 → R be 𝐿-smooth. Further, let ℎ 𝑘 = 1/𝐿 for all 𝑘 ∈ N0 , and let
(𝒘 𝑘 ) ∞
𝑘=0 ⊆ R be defined by (10.1.2), and let 𝒘∗ be a (not necessarily unique) minimizer of f, so that the PL-inequality
𝑛

(10.1.19) holds.
Then, it holds for all 𝑘 ∈ N0 that
 𝜇 𝑘
f (𝒘 𝑘 ) − f (𝒘∗ ) ≤ 1 − (f (𝒘0 ) − f (𝒘∗ )).
𝐿

10.2 Stochastic gradient descent (SGD)

We next discuss a stochastic variant of gradient descent. The idea, which originally goes back to Robbins and Monro
[169], is to replace the gradient ∇f (𝒘 𝑘 ) in (10.1.2) by a random variable that we denote by 𝑮 𝑘 . We interpret 𝑮 𝑘 as an
approximation to ∇f (𝒘 𝑘 ); specifically, throughout we will assume that (given 𝒘 𝑘 ) 𝑮 𝑘 is an unbiased estimator, i.e.

E[𝑮 𝑘 |𝒘 𝑘 ] = ∇f (𝒘 𝑘 ). (10.2.1)

After choosing some initial value 𝒘0 ∈ R𝑛 , the update rule becomes

𝒘 𝑘+1 := 𝒘 𝑘 − ℎ 𝑘 𝑮 𝑘 , (10.2.2)

where ℎ 𝑘 > 0 denotes again the step size, and unlike in Section 10.1, we focus here on the case of ℎ 𝑘 depending on 𝑘.
The iteration (10.2.2) creates a Markov chain (𝒘0 , 𝒘1 , . . . ), meaning that 𝒘 𝑘 is a random variable, and its state only
depends1 on 𝒘 𝑘−1 . The main reason for replacing the actual gradient by an estimator, is not to improve the accuracy
or convergence rate, but rather to decrease the computational cost and storage requirements of the algorithm. The
underlying assumption is that 𝑮 𝑘−1 can be computed at a fraction of the cost required for the computation of ∇f (𝒘 𝑘−1 ).
The next example illustrates this in the standard setting.

Example 10.18 (Empirical risk minimization) Suppose we have some data (𝒙 𝑗 , 𝑦 𝑗 ) 𝑚


𝑗=1 , where 𝑦 𝑗 ∈ R can be understood
as the label corresponding to the data point 𝒙 𝑗 ∈ R . Using the square loss, we wish to fit a neural network
𝑑

Φ(·; 𝒘) : R𝑑 → R depending on parameters (i.e. weights and biases) 𝒘 ∈ R𝑛 , such that the empirical risk
𝑚
1 ∑︁
f (𝒘) := (Φ(𝒙 𝑗 ; 𝒘) − 𝑦 𝑗 ) 2 ,
2𝑚 𝑗=1

is minimized. Performing one step of gradient descent requires the computation of


𝑚
1 ∑︁
∇f (𝒘) = (Φ(𝒙 𝑗 ; 𝒘) − 𝑦 𝑗 )∇𝒘 Φ(𝒙 𝑗 ; 𝒘), (10.2.3)
𝑚 𝑗=1

and thus the computation of 𝑚 gradients of the neural network Φ. For large 𝑚 (in practice 𝑚 can be in the millions or
even larger), this computation might be infeasible. To decrease computational complexity, we replace the full gradient
(10.2.3) by

1 More precisely, given 𝒘 𝑘−1 , the state of 𝒘 𝑘 is conditionally independent of 𝒘1 , . . . , 𝒘 𝑘−2 . See Appendix A.3.3.

107
[Draft of June 26, 2024. Not for dissemination. Only to be used in Philipp Petersen’s lecture]
𝑮 := (Φ(𝒙 𝑗 ; 𝒘) − 𝑦 𝑗 )∇𝒘 Φ(𝒙 𝑗 ; 𝒘)

where 𝑗 ∼ uniform(1, . . . , 𝑚) is a random variable with uniform distribution on the discrete set {1, . . . , 𝑚}. Then
𝑚
1 ∑︁
E[𝑮] = (Φ(𝒙 𝑗 ; 𝒘) − 𝑦 𝑗 )∇𝒘 Φ(𝒙 𝑗 ; 𝒘) = ∇f (𝒘),
𝑚 𝑗=1

but an evaluation of 𝑮 merely requires the computation of a single gradient of the neural network. More general, one
can choose a mini-batch size 𝑚 𝑏 (where 𝑚 𝑏 ≪ 𝑚) and let 𝑮 = 𝑚1𝑏 𝑗 ∈ 𝐽 Φ(𝒙 𝑗 − 𝑦 𝑗 ; 𝒘)∇𝒘 Φ(𝒙 𝑗 ; 𝒘), where 𝐽 is a
Í
random subset of {1, . . . , 𝑚} of cardinality 𝑚 𝑏 .

Remark 10.19 In practice, the following variant is also common: Let 𝑚 𝑏 𝑘 = 𝑚 for 𝑚 𝑏 , 𝑘, 𝑚 ∈ N, i.e. the number of data
points 𝑚 is a 𝑘-fold multiple of the mini-batch size 𝑚 𝑏 . In each epoch, first a random partition ¤ 𝑖=1 𝐽𝑖 = {1, . . . , 𝑚} is
Ð𝑘
determined. Then for each 𝑖 = 1, . . . , 𝑘, the weights are updated with the gradient estimate
1 ∑︁
Φ(𝒙 𝑗 − 𝑦 𝑗 ; 𝒘)∇𝒘 Φ(𝒙 𝑗 ; 𝒘).
𝑚𝑏 𝑗 ∈ 𝐽
𝑖

Hence, in one epoch (which corresponds to 𝑘 updates of the neural network weights), the algorithm sweeps through
the whole dataset.

SGD can be analyzed in various settings. To give the general idea, we concentrate on the case of 𝐿-smooth and
𝜇-strongly convex objective functions. Let us start by looking at a property akin to the (descent) Lemma 10.5. Using
Lemma 10.3
𝐿
f (𝒘 𝑘+1 ) ≤ f (𝒘 𝑘 ) − ℎ 𝑘 ⟨∇f (𝒘 𝑘 ), 𝑮 𝑘 ⟩ + ℎ2𝑘 ∥𝑮 𝑘 ∥ 2 .
2
In contrast to gradient descent, we cannot say anything about the sign of the term in the middle of the right-hand side.
Thus, (10.2.2) need not necessarily decrease the value of the objective function in every step. The key insight is that in
expectation the value is still decreased under certain assumptions, namely

𝐿 
E[f (𝒘 𝑘+1 )|𝒘 𝑘 ] ≤ f (𝒘 𝑘 ) − ℎ 𝑘 E[⟨∇f (𝒘 𝑘 ), 𝑮 𝑘 ⟩|𝒘 𝑘 ] + ℎ2𝑘 E ∥𝑮 𝑘 ∥ 2 𝒘 𝑘

2
2𝐿
2
= f (𝒘 𝑘 ) − ℎ 𝑘 ∥∇f (𝒘 𝑘 ) ∥ + ℎ 𝑘 E ∥𝑮 𝑘 ∥ 2 𝒘 𝑘
 
 2 
2 𝐿 2
= f (𝒘 𝑘 ) − ℎ 𝑘 ∥∇f (𝒘 𝑘 ) ∥ − ℎ 𝑘 E[∥𝑮 𝑘 ∥ |𝒘 𝑘 ]
2

where we used (10.2.1).


Assuming, for some fixed 𝛾 > 0, the uniform bound

E[∥𝑮 𝑘 ∥ 2 |𝒘 𝑘 ] ≤ 𝛾

and that ∥∇f (𝒘 𝑘 ) ∥ > 0 (which is true unless 𝒘 𝑘 is the minimizer), upon choosing

2∥∇f (𝒘 𝑘 ) ∥ 2
0 < ℎ𝑘 < ,
𝐿𝛾

the expectation of the objective function decreases. Since ∇f (𝒘 𝑘 ) tends to 0 as we approach the minimum, this also
indicates that we should choose step sizes ℎ 𝑘 that tend to 0 for 𝑘 → ∞. For our analysis we will work with the specific
choice

108
[Draft of June 26, 2024. Not for dissemination. Only to be used in Philipp Petersen’s lecture]
1 (𝑘 + 1) 2 − 𝑘 2
ℎ 𝑘 := for all 𝑘 ∈ N0 , (10.2.4)
𝜇 (𝑘 + 1) 2

as, e.g., in [68]. Note that


2𝑘 + 1 2
ℎ𝑘 = = + 𝑂 (𝑘 −2 ) = 𝑂 (𝑘 −1 ).
𝜇(𝑘 + 1) 2 𝜇(𝑘 + 1)
Since 𝒘 𝑘 is a random variable by construction, a convergence statement can only be stochastic, e.g., in expectation
or with high probability. We concentrate here on the former, but emphasize that also the latter can be shown.

Theorem 10.20 (SGD) Let 𝑛 ∈ N and 𝐿, 𝜇, 𝛾 > 0. Let f : R𝑛 → R be 𝐿-smooth and 𝜇-strongly convex. Let (ℎ 𝑘 ) ∞ 𝑘=0
satisfy (10.2.4) and let (𝑮 𝑘 ) ∞ ∞
𝑘=0 , (𝒘 𝑘 ) 𝑘=0 be sequences of random variables satisfying (10.2.1) and (10.2.2). Assume
that E[∥𝑮 𝑘 ∥ 2 |𝒘 𝑘 ] ≤ 𝛾 for all 𝑘 ∈ N0 .
Then
4𝛾
E[∥𝒘 𝑘 − 𝒘∗ ∥ 2 ] ≤ = 𝑂 (𝑘 −1 ),
𝜇2 𝑘
4𝐿𝛾
E[f (𝒘 𝑘 )] − f (𝒘∗ ) ≤ = 𝑂 (𝑘 −1 )
2𝜇2 𝑘
for 𝑘 → ∞.

Proof We proceed similar as in the proof of Theorem 10.14. It holds for 𝑘 ≥ 1

E[∥𝒘 𝑘 − 𝒘∗ ∥ 2 |𝒘 𝑘−1 ]
= ∥𝒘 𝑘−1 − 𝒘∗ ∥ 2 − 2ℎ 𝑘−1 E[⟨𝑮 𝑘−1 , 𝒘 𝑘−1 − 𝒘∗ ⟩|𝒘 𝑘−1 ] + ℎ2𝑘−1 E[∥𝑮 𝑘−1 ∥ 2 |𝒘 𝑘−1 ]
= ∥𝒘 𝑘−1 − 𝒘∗ ∥ 2 − 2ℎ 𝑘−1 ⟨∇f (𝒘 𝑘−1 ), 𝒘 𝑘−1 − 𝒘∗ ⟩ + ℎ2𝑘−1 E[∥𝑮 𝑘−1 ∥ 2 |𝒘 𝑘−1 ].

By 𝜇-strong convexity (10.1.15)

−2ℎ 𝑘−1 ⟨∇f (𝒘 𝑘−1 ), 𝒘 𝑘−1 − 𝒘∗ ⟩ ≤ −𝜇ℎ 𝑘−1 ∥𝒘 𝑘−1 − 𝒘∗ ∥ 2 − 2ℎ 𝑘−1 · (f (𝒘 𝑘−1 ) − f (𝒘∗ ))
≤ −𝜇ℎ 𝑘−1 ∥𝒘 𝑘−1 − 𝒘∗ ∥ 2 .

Thus

E[∥𝒘 𝑘 − 𝒘∗ ∥ 2 |𝒘 𝑘−1 ] ≤ (1 − 𝜇ℎ 𝑘−1 ) ∥𝒘 𝑘−1 − 𝒘∗ ∥ 2 + ℎ2𝑘−1 𝛾.

Using the Markov property, we have

E[∥𝒘 𝑘 − 𝒘∗ ∥ 2 |𝒘 𝑘−1 , 𝒘 𝑘−2 ] = E[∥𝒘 𝑘 − 𝒘∗ ∥ 2 |𝒘 𝑘−1 ]

so that

E[∥𝒘 𝑘 − 𝒘∗ ∥ 2 |𝒘 𝑘−1 ] ≤ (1 − 𝜇ℎ 𝑘−1 )E[∥𝒘 𝑘−1 − 𝒘∗ ∥ 2 |𝒘 𝑘−2 ] + ℎ2𝑘−1 𝛾.

With 𝑒 0 := ∥𝒘0 − 𝒘∗ ∥ 2 and 𝑒 𝑘 := E[∥𝒘 𝑘 − 𝒘∗ ∥ 2 |𝒘 𝑘−1 ] for 𝑘 ≥ 1 we have found

𝑒 𝑘 ≤ (1 − 𝜇ℎ 𝑘−1 )𝑒 𝑘−1 + ℎ2𝑘−1 𝛾


≤ (1 − 𝜇ℎ 𝑘−1 ) ((1 − 𝜇ℎ 𝑘−2 )𝑒 𝑘−2 + ℎ2𝑘−2 𝛾) + ℎ2𝑘−1 𝛾
𝑘−1
Ö 𝑘−1
∑︁ 𝑘−1
Ö
≤ · · · ≤ 𝑒0 (1 − 𝜇ℎ 𝑗 ) + 𝛾 ℎ2𝑗 (1 − 𝜇ℎ𝑖 ).
𝑗=0 𝑗=0 𝑖= 𝑗+1

109
[Draft of June 26, 2024. Not for dissemination. Only to be used in Philipp Petersen’s lecture]
By choice of ℎ𝑖
𝑘−1 𝑘−1
Ö Ö 𝑖2 𝑗2
(1 − 𝜇ℎ𝑖 ) = =
𝑖= 𝑗 𝑖= 𝑗
(𝑖 + 1) 2 𝑘 2

and thus
𝑘−1  2
𝛾 ∑︁ ( 𝑗 + 1) 2 − 𝑗 2 ( 𝑗 + 1) 2
𝑒𝑘 ≤ 2
𝜇 𝑗=0 ( 𝑗 + 1) 2 𝑘2
𝑘−1
𝛾 1 ∑︁ (2 𝑗 + 1) 2

𝜇2 𝑘 2 𝑗=0 ( 𝑗 + 1) 2
| {z }
≤4
𝛾 4𝑘

𝜇2 𝑘 2
4𝛾
≤ 2 .
𝜇 𝑘

Since E[∥𝒘 𝑘 − 𝒘∗ ∥ 2 ] is the expectation of E[∥𝒘 𝑘 − 𝒘∗ ∥ 2 |𝒘 𝑘−1 ] with respect to the random variable 𝒘 𝑘−1 , and
𝑒 0 /𝑘 2 + 4𝛾/(𝜇2 𝑘) is a constant independent of 𝒘 𝑘−1 , we obtain
4𝛾
E[∥𝒘 𝑘 − 𝒘∗ ∥ 2 ] ≤ + .
𝜇2 𝑘
Finally, using 𝐿-smoothness
𝐿 𝐿
f (𝒘 𝑘 ) − f (𝒘∗ ) ≤ ⟨∇f (𝒘∗ ), 𝒘 𝑘 − 𝒘∗ ⟩ + ∥𝒘 𝑘 − 𝒘∗ ∥ 2 = ∥𝒘 𝑘 − 𝒘∗ ∥ 2 ,
2 2
and taking the expectation concludes the proof. □

The specific choice of ℎ 𝑘 in (10.2.4) simplifies the calculations in the proof, but it is not necessary in order for
the asymptotic convergence to hold. One can show similar convergence results with ℎ 𝑘 = 𝑐 1 /(𝑐 2 + 𝑘) under certain
assumptions on 𝑐 1 , 𝑐 2 , e.g. [23, Theorem 4.7].

10.3 Backpropagation

We now explain how to apply gradient-based methods to the training of neural networks. Let Φ ∈ N𝑑𝑑0𝐿+1 (𝜎, 𝐿, 𝑛) (see
Definition 3.6) and assume that the activation function satisfies 𝜎 ∈ 𝐶 1 (R). As earlier, we denote the neural network
parameters by

𝒘 = ((𝑾 (0) , 𝒃 (0) ), . . . , (𝑾 (𝐿) , 𝒃 (𝐿) )) (10.3.1)

with weight matrices 𝑾 (ℓ ) ∈ R𝑑ℓ+1 ×𝑑ℓ and bias vectors 𝒃 (ℓ ) ∈ R𝑑ℓ+1 . Additionally, we fix a differentiable loss function
L : R𝑑𝐿+1 × R𝑑𝐿+1 → R, e.g., L (𝒘, 𝒘) ˜ = ∥𝒘 − 𝒘∥ ˜ 2 /2, and assume given data (𝒙 𝑗 , 𝒚 𝑗 ) 𝑚
𝑗=1 ⊆ R × R
𝑑0 𝑑𝐿+1 . The goal is

to minimize an empirical risk of the form


𝑚
1 ∑︁
f (𝒘) := L (Φ(𝒙 𝑗 ; 𝒘), 𝒚 𝑗 )
𝑚 𝑗=1

110
[Draft of June 26, 2024. Not for dissemination. Only to be used in Philipp Petersen’s lecture]
as a function of the neural network parameters. An application of the gradient step (10.1.2) to update the parameters
requires
𝑚
1 ∑︁
∇f (𝒘) = ∇𝒘 L (Φ(𝒙 𝑗 ; 𝒘), 𝒚 𝑗 ).
𝑚 𝑗=1

For stochastic methods, as explained in Example 10.18, we only compute this sum over a (random) subbatch of the
dataset. In either case, we need an algorithm to determine ∇𝒘 L (Φ(𝒙; 𝒘), 𝒚), i.e. the gradients

∇𝒃 (ℓ) L (Φ(𝒙; 𝒘), 𝒚) ∈ R𝑑ℓ+1 , ∇𝑾 (ℓ) L (Φ(𝒙; 𝒘), 𝒚) ∈ R𝑑ℓ+1 ×𝑑ℓ (10.3.2)

for all ℓ = 0, . . . , 𝐿.
The backpropagation algorithm [176] provides an efficient way to do so. To explain it, for fixed 𝒙 ∈ R𝑑0 introduce
the notation
𝒙 (0) := 𝒙 and

𝒙 (ℓ+1) := 𝑾 (ℓ ) 𝜎(𝒙 (ℓ ) ) + 𝒃 (ℓ ) for all 𝑗 ∈ {0, . . . , 𝐿}. (10.3.3)

Here the application of 𝜎 : R → R to the vector 𝒘 (ℓ ) ∈ R𝑑ℓ is, as always, understood componentwise, and by definition
𝒙 (𝐿+1) = Φ(𝒙; 𝒘) is the output of the neural network. Observe that 𝒙 (𝑘 ) depends on (𝑾 (ℓ ) , 𝒃 (ℓ ) ) only if 𝑘 > ℓ. In the
following, we also fix 𝒚 ∈ R𝑑𝐿+1 and write

L := L (Φ(𝒙; 𝒘), 𝒚) = L (𝒙 (𝐿+1) , 𝒚).

Since 𝒙 (ℓ+1) is a function of 𝒙 (ℓ ) for each ℓ, by repeated application of the chain rule

𝜕L 𝜕L 𝜕𝒙 (𝐿+1) 𝜕𝒙 (ℓ+2) 𝜕𝒙 (ℓ+1)


= · · · . (10.3.4)
𝜕𝑊𝑖(ℓ𝑗 ) 𝜕𝒙 (𝐿+1) 𝜕𝒙 (𝐿)
| {z } | {z }
𝜕𝒙 (ℓ+1)
| {z } 𝜕𝑊𝑖 𝑗
(ℓ )

1×𝑑 𝐿+1 𝑑 𝐿+1 ×𝑑 𝐿 𝑑ℓ+2 ×𝑑ℓ+1


| {z }
∈R ∈R ∈R
∈R𝑑ℓ+1 ×1

An analogous calculation holds for 𝜕L/𝜕𝑏 (ℓ )


𝑗 . Since all terms in (10.3.4) are easy to compute (see (10.3.3)), in principle
we could use this formula to determine the gradients in (10.3.2). To avoid unnecessary computations, the main idea of
backpropagation is to introduce

𝜶 (ℓ ) := ∇ 𝒙 (ℓ) L ∈ R𝑑ℓ for all ℓ = 1, . . . , 𝐿 + 1

and observe that


𝜕L 𝜕𝒙 (ℓ+1)
= (𝜶 (ℓ+1) ) ⊤ .
𝜕𝑊𝑖(ℓ𝑗 ) 𝜕𝑊𝑖(ℓ𝑗 )

As the following lemma shows, the 𝜶 (ℓ ) can be computed recursively from ℓ = 𝐿 + 1, . . . , 1. This explains the name
“backpropagation”.
Lemma 10.21 Under the set-up of this section, it holds

𝜶 (𝐿+1) = ∇ 𝒙 (𝐿+1) L (𝒙 (𝐿+1) , 𝒚) (10.3.5)

and with ⊙ denoting the componentwise (Hadamard) product of two vectors

𝜶 (ℓ ) = 𝜎 ′ (𝒙 (ℓ ) ) ⊙ (𝑾 (ℓ+1) ) ⊤ 𝜶 (ℓ+1) for all ℓ = 𝐿, . . . , 1.

Proof Equation (10.3.5) holds by definition. For ℓ ≤ 𝐿 by the chain rule

111
[Draft of June 26, 2024. Not for dissemination. Only to be used in Philipp Petersen’s lecture]
𝜕L  𝜕𝒙 (ℓ+1)  ⊤ 𝜕L  𝜕𝒙 (ℓ+1)  ⊤
𝜶 (ℓ ) = = = 𝜶 (ℓ ) .
𝜕𝒙 (ℓ ) 𝜕𝒙 (ℓ ) 𝜕𝒙 (ℓ+1) 𝜕𝒙 (ℓ )
| {z } | {z }
∈R𝑑ℓ ×𝑑ℓ+1 ∈R𝑑ℓ+1 ×1

By (10.3.3) for 𝑖 ∈ {1, . . . , 𝑑ℓ+1 }, 𝑗 ∈ {1, . . . , 𝑑ℓ }

 𝜕𝒙 (ℓ+1)  𝜕𝑥 𝑖(ℓ+1)
= = 𝑊𝑖(ℓ𝑗 ) 𝜎 ′ (𝑥 (ℓ )
𝑗 ).
𝜕𝒙 (ℓ ) 𝑖𝑗 𝜕𝑥 (ℓ )
𝑗

Thus, the claim follows. □


Putting everything together, we obtain explicit formulas for (10.3.2).

Proposition 10.22 Under the set-up of this subsection, it holds, for ℓ = 0, . . . , 𝐿

∇𝒃 (ℓ) L = 𝜶 (ℓ+1) ∈ R𝑑ℓ+1

and

∇𝑾 (ℓ) L = 𝜶 (ℓ+1) · 𝜎(𝒙 (ℓ ) ) ⊤ ∈ R𝑑ℓ+1 ×𝑑ℓ .

Proof By (10.3.3) for 𝑖, 𝑘 ∈ {1, . . . , 𝑑ℓ+1 }, and 𝑗 ∈ {1, . . . , 𝑑ℓ }

𝜕𝑥 𝑘(ℓ+1) 𝜕𝑥 𝑘(ℓ+1)
= 𝛿 𝑘𝑖 and = 𝛿 𝑘𝑖 𝜎(𝑥 (ℓ )
𝑗 ).
𝜕𝑏 𝑖(ℓ ) 𝜕𝑊𝑖(ℓ𝑗 )

𝑑ℓ+1
Thus, with 𝒆 𝑖 = (𝛿 𝑘𝑖 ) 𝑘=1

𝜕L  𝜕𝒙 (ℓ+1)  ⊤ 𝜕L
= = 𝒆⊤
𝑖 𝜶
(ℓ+1)
= 𝛼𝑖(ℓ+1)
𝜕𝑏 𝑖(ℓ ) 𝜕𝑏 𝑖(ℓ ) 𝜕𝒙 (ℓ+1)

and similarly

𝜕L  𝜕𝒙 (ℓ+1)  ⊤
= 𝜶 (ℓ+1) = 𝜎(𝑥 (ℓ ) ⊤ (ℓ+1)
𝑗 )𝒆 𝑖 𝜶 = 𝜎(𝑥 (ℓ )
𝑗 )𝛼𝑖
(ℓ+1)
.
𝜕𝑊𝑖(ℓ𝑗 ) 𝜕𝑊𝑖(ℓ𝑗 )

This concludes the proof. □


Lemma 10.21 and Proposition 10.22 motivate Algorithm 1, in which a forward pass computing 𝒙 (ℓ ) , ℓ = 1, . . . , 𝐿 +1,
is followed by a backward pass to determine the 𝜶 (ℓ ) , ℓ = 𝐿 + 1, . . . , 1, and the gradients of L with respect to the neural
network parameters. This shows how to use gradient-based optimizers from the previous sections for the training of
neural networks.
Two important remarks are in order. First, the objective function associated to neural networks is typically not convex
as a function of the neural network weights and biases. Thus, the analysis of the previous sections will in general not be
directly applicable. It may still give some insight about the convergence behavior locally around the minimizer however.
Second, to derive the backpropagation algorithm we assumed the activation function to be continuously differentiable,
which does not hold for ReLU. Using the concept of subgradients, gradient-based algorithms and their analysis may
be generalized to some extent to also accommodate non-differentiable loss functions, see Exercises 10.31–10.33.

112
[Draft of June 26, 2024. Not for dissemination. Only to be used in Philipp Petersen’s lecture]
Algorithm 1 Backpropagation
Input: Network input 𝒙, target output 𝒚, neural network parameters ( (𝑾 (0) , 𝒃 (0) ) , . . . , (𝑾 (𝐿) , 𝒃 (𝐿) ) )
Output: Gradients of the loss function L with respect to neural network parameters

Forward pass
𝒙 (0) ← 𝒙
for ℓ = 0, . . . , 𝐿 do
𝒙 (ℓ+1) ← 𝑾 (ℓ) 𝜎 ( 𝒙 (ℓ) ) + 𝒃 (ℓ)
end for

Backward pass
𝜶 ( 𝐿+1) ← ∇ 𝒙 ( 𝐿+1) L ( 𝒙 ( 𝐿+1) , 𝒚 )
for ℓ = 𝐿, . . . , 1 do
∇𝒃 (ℓ) L ← 𝜶 (ℓ+1)
∇𝑾 (ℓ) L ← 𝜶 (ℓ+1) · 𝜎 ( 𝒙 (ℓ) ) ⊤
𝜶 (ℓ) = 𝜎 ′ ( 𝒙 (ℓ) ) ⊙ (𝑾 (ℓ+1) ) ⊤ 𝜶 (ℓ+1)
end for
∇𝒃 (0) L ← 𝜶 (1)
∇𝑾 (0) L ← 𝜶 (1) ( 𝒙 (0) ) ⊤

10.4 Acceleration

Acceleration is an important tool for the training of neural networks [197]. The idea was first introduced by Polyak
in 1964 under the name “heavy ball method” [159]. It is inspired by the dynamics of a heavy ball rolling down the
valley of the loss landscape. Since then other types of acceleration have been proposed and analyzed, with Nesterov
acceleration being the most prominent example [140]. In this section, we first give some intuition by discussing the
heavy ball method for a simple quadratic loss. Afterwards we turn to Nesterov acceleration and give a convergence
proof for 𝐿-smooth and 𝜇-strongly convex objective functions that improves upon the bounds obtained for gradient
descent.

10.4.1 Heavy ball method

We proceed similar as in [62, 160, 162] to motivate the idea. Consider the quadratic objective function in two dimensions
 
1 ⊤ 𝜆1 0
f (𝒘) := 𝒘 𝑫𝒘 where 𝑫= (10.4.1)
2 0 𝜆2

with 𝜆1 ≥ 𝜆2 > 0. Clearly, f has a unique minimizer at 𝒘∗ = 0 ∈ R2 . Starting at some 𝒘0 ∈ R2 , gradient descent with
constant step size ℎ > 0 computes the iterates
   
1 − ℎ𝜆1 0 (1 − ℎ𝜆1 ) 𝑘+1 0
𝒘 𝑘+1 = 𝒘 𝑘 − ℎ𝑫𝒘 𝑘 = 𝒘 = 𝒘 .
0 1 − ℎ𝜆 2 𝑘 0 (1 − ℎ𝜆2 ) 𝑘+1 0

The method converges for arbitrary initialization 𝒘0 if and only if

|1 − ℎ𝜆1 | < 1 and |1 − ℎ𝜆2 | < 1.

The optimal step size balancing out the speed of convergence in both coordinates is
2
ℎ∗ = arg min max{|1 − ℎ𝜆1 |, |1 − ℎ𝜆2 |} = . (10.4.2)
ℎ>0 𝜆1 + 𝜆2

113
[Draft of June 26, 2024. Not for dissemination. Only to be used in Philipp Petersen’s lecture]
With 𝜅 = 𝜆1 /𝜆2 we then obtain the convergence rate
𝜆1 − 𝜆2 𝜅 − 1
|1 − ℎ∗ 𝜆1 | = |1 − ℎ∗ 𝜆2 | = = ∈ [0, 1). (10.4.3)
𝜆1 + 𝜆2 𝜅+1
If 𝜆1 ≫ 𝜆2 , this term is close to 1, and thus the convergence will be slow. This is consistent with our analysis for
strongly convex objective functions; by Exercise 10.34 the condition number of f equals 𝜅 = 𝜆1 /𝜆2 ≫ 1. Hence,
the upper bounds in Theorem 10.14 and Remark 10.15 converge only slowly. Similar considerations hold for general
quadratic objective functions in R𝑛 such as
1 ⊤
f̃ (𝒘) = 𝒘 𝑨𝒘 + 𝒃 ⊤ 𝒘 + 𝑐 (10.4.4)
2
with 𝑨 ∈ R𝑛×𝑛 symmetric positive definite, 𝒃 ∈ R𝑛 and 𝑐 ∈ R, see Exercise 10.35.

Remark 10.23 Interpreting (10.4.4) as a second-order Taylor expansion of some objective function f̃ around its min-
imizer 𝒘∗ , we note that the above described effects also occur for general objective functions with ill-conditioned
Hessians at the minimizer.

1.0
Gradient Descent
Heavy Ball
0.8

0.6

0.4

0.2

0.0

1.00 0.75 0.50 0.25 0.00 0.25 0.50 0.75 1.00

Fig. 10.3: 20 steps of gradient descent and the heavy ball method on the objective function (10.4.1) with 𝜆1 = 12 ≫
1 = 𝜆2 , step size ℎ = 𝛼 = ℎ∗ as in (10.4.2), and 𝛽 = 1/3.

Figure 10.3 gives further insight into the poor performance of gradient descent for (10.4.1) with 𝜆1 ≫ 𝜆2 . The
loss-landscape looks like a ravine (the derivative is much larger in one direction than the other), and away from the
floor, ∇f mainly points to the opposite side. Therefore the iterates oscillate back and forth in the first coordinate, and
make little progress in the direction of the valley along the second coordinate axis. To address this problem, the heavy
ball method introduces a “momentum” term which can mitigate this effect to some extent. The idea is, to choose the
update not just according to the gradient at the current location, but to add information from the previous steps. After
initializing 𝒘0 and, e.g., 𝒘1 = 𝒘0 − 𝛼∇f (𝒘0 ), let for 𝑘 ∈ N

𝒘 𝑘+1 = 𝒘 𝑘 − 𝛼∇f (𝒘 𝑘 ) + 𝛽(𝒘 𝑘 − 𝒘 𝑘−1 ). (10.4.5)

This is known as Polyak’s heavy ball method [159]. Here 𝛼 > 0 and 𝛽 ∈ (0, 1) are hyperparameters (that could also
depend on 𝑘) and in practice need to be carefully tuned to balance the strength of the gradient and the momentum term.
Iteratively expanding (10.4.5) with the given initialization, observe that for 𝑘 ≥ 0

114
[Draft of June 26, 2024. Not for dissemination. Only to be used in Philipp Petersen’s lecture]
𝑘
!
∑︁
𝑗
𝒘 𝑘+1 = 𝒘 𝑘 − 𝛼 𝛽 ∇f (𝒘 𝑘− 𝑗 ) . (10.4.6)
𝑗=0

Thus, 𝒘 𝑘 is updated using an exponentially weighted average of all past gradients. Choosing the momentum parameter
𝛽 in the interval (0, 1) ensures that the influence of previous gradients on the update decays exponentially. The concrete
value of 𝛽 determines the balance between the impact of recent and past gradients.
Intuitively, this (exponentially weighted) linear combination of the past gradients averages out some of the oscillation
observed for gradient descent in Figure 10.3 in the 𝑥1 coordinate, and thus “smoothes” the path. The partial derivative
in the 𝑥2 coordinate, along which the objective function is very flat, does not change much from one iterate to the next.
Thus, its proportion in the update is strengthened through the addition of momentum. This is observed in Figure 10.3.
As mentioned earlier, the heavy ball method can be interpreted as a discretization of the dynamics of a ball rolling
down the valley of the loss landscape. If the ball has positive mass, i.e. is “heavy”, its momentum prevents the ball
from bouncing back and forth too strongly. The following remark further elucidates this connection.

Remark 10.24 As pointed out, e.g., in [160, 162], for suitable choices of 𝛼 and 𝛽, (10.4.5) can be interpreted as a
discretization of the second-order ODE

𝑚𝒘 ′′ (𝑡) = −∇f (𝒘(𝑡)) − 𝑟𝒘 ′ (𝑡). (10.4.7)

This equation describes the movement of a point mass 𝑚 under influence of the force field −∇f (𝒘(𝑡)); the term
−𝒘 ′ (𝑡), which points in the negative direction of the current velocity, corresponds to friction, and 𝑟 > 0 is the friction
coefficient. The discretization
𝒘 𝑘+1 − 2𝒘 𝑘 + 𝒘 𝑘−1 𝒘 𝑘+1 − 𝒘 𝑘
𝑚 = −∇f (𝒘 𝑘 ) −
ℎ2 ℎ
then leads to
ℎ2 𝑚
𝒘 𝑘+1 = 𝒘 𝑘 − ∇f (𝒘 𝑘 ) + (𝒘 𝑘 − 𝒘 𝑘−1 ), (10.4.8)
𝑚 − 𝑟ℎ 𝑚 − 𝑟ℎ
| {z } | {z }
=𝛼 =𝛽

and thus to (10.4.5), [162].


Letting 𝑚 = 0 in (10.4.8), we recover the gradient descent update (10.1.2). Hence, the positive mass corresponds to
the momentum term. Similarly, letting 𝑚 = 0 in the continuous dynamics (10.4.7), we obtain the gradient flow (10.1.3).
The key difference between these equations is that −∇f (𝒘(𝑡)) represents the velocity of 𝒘(𝑡) in (10.1.3), whereas in
(10.4.7), up to the friction term, it corresponds to an acceleration.

Let us sketch an argument to show that (10.4.5) improves the convergence over plain gradient descent for the
objective function (10.4.1). Denoting 𝒘 𝑘 = (𝑤 𝑘,1 , 𝑤 𝑘,2 ) ⊤ ∈ R2 , we obtain from (10.4.5) and the definition of f in
(10.4.1)
    
𝑤 𝑘+1, 𝑗 1 + 𝛽 − 𝛼𝜆 𝑗 − 𝛽 𝑤 𝑘, 𝑗
= (10.4.9)
𝑤 𝑘, 𝑗 1 0 𝑤 𝑘−1, 𝑗

for 𝑗 ∈ {1, 2} and 𝑘 ≥ 1. The smaller the modulus of the eigenvalues of the matrix in (10.4.9), the faster the convergence
towards the minimizer 𝑤∗, 𝑗 = 0 ∈ R for arbitrary initialization. Hence, the goal is to choose 𝛼 > 0 and 𝛽 ∈ (0, 1) such
that the maximal modulus of the eigenvalues of the matrix for 𝑗 ∈ {1, 2} is possibly small. We omit the details of this
calculation (also see [160, 144, 62]), but mention that this is obtained for
 2 2  √𝜆 − √𝜆  2
1 2
𝛼= √ √ and 𝛽= √ √ .
𝜆1 + 𝜆2 𝜆1 + 𝜆2
With these choices, the modulus of the maximal eigenvalue is bounded by

115
[Draft of June 26, 2024. Not for dissemination. Only to be used in Philipp Petersen’s lecture]

√︁ 𝜅−1
𝛽= √ ∈ [0, 1),
𝜅+1

where again 𝜅 = 𝜆1 /𝜆2 . Due to (10.4.9), this expression gives a rate of convergence for (10.4.5). Contrary to gradient
descent, see (10.4.3), for this problem the heavy ball method achieves a convergence rate that only depends on the
square root of the condition number 𝜅. This explains the improved performance observed in Figure 10.3.

10.4.2 Nesterov acceleration

Nesterov’s accelerated gradient method (NAG) [140, 139], is a refinement of the heavy ball method. After initializing
𝒗0 , 𝒘0 ∈ R𝑛 , the update is formulated as the two-step process

𝒗 𝑘+1 = 𝒘 𝑘 − 𝛼∇f (𝒘 𝑘 ) (10.4.10a)


𝒘 𝑘+1 = 𝒗 𝑘+1 + 𝛽(𝒗 𝑘+1 − 𝒗 𝑘 ), (10.4.10b)

where again 𝛼 > 0 and 𝛽 ∈ (0, 1) are hyperparameters. Substituting the second line into the first we get

𝒗 𝑘+1 = 𝒗 𝑘 − 𝛼∇f (𝒘 𝑘 ) + 𝛽(𝒗 𝑘 − 𝒗 𝑘−1 ).

Comparing with the heavy ball method (10.4.5), the key difference is that the gradient is not evaluated at the current
position 𝒗 𝑘 , but instead at the point 𝒘 𝑘 = 𝒗 𝑘 + 𝛽(𝒗 𝑘 − 𝒗 𝑘−1 ), which can be interpreted as an estimate of the position at
the next iteration.
We next discuss the convergence for 𝐿-smooth and 𝜇-strongly convex objective functions f. It turns out, that
these conditions are not sufficient in order for the heavy ball method (10.4.5) to converge, and one can construct
counterexamples [117]. This is in contrast to NAG, as the next √︁ theorem shows. To give the analysis, it is convenient to
first rewrite (10.4.10) as a three sequence update: Let 𝜏 = 𝜇/𝐿, 𝛼 = 1/𝐿, and 𝛽 = (1 − 𝜏)/(1 + 𝜏). After initializing
𝒘0 , 𝒗0 ∈ R𝑛 , (10.4.10) can also be written as 𝒖 0 = ((1 + 𝜏)𝒘0 − 𝒗0 )/𝜏 and for 𝑘 ∈ N0
𝜏 1
𝒘𝑘 = 𝒖𝑘 + 𝒗𝑘 (10.4.11a)
1+𝜏 1+𝜏
1
𝒗 𝑘+1 = 𝒘 𝑘 − ∇f (𝒘 𝑘 ) (10.4.11b)
𝐿
𝜏
𝒖 𝑘+1 = 𝒖 𝑘 + 𝜏 · (𝒘 𝑘 − 𝒖 𝑘 ) − ∇f (𝒘 𝑘 ), (10.4.11c)
𝜇
see Exercise 10.36.
The proof of the following theorem proceeds along the lines of [205, 215].

√︁ Let 𝑛 ∈ N and 𝐿, 𝜇 > 0. Let f :𝑛 R → R be 𝐿-smooth and 𝜇-strongly convex. Further, let 𝒗0 , 𝒘0 ∈ R
Theorem 10.25 𝑛 𝑛

and let 𝜏 = 𝜇/𝐿. Let (𝒘 𝑘 , 𝒗 𝑘+1 , 𝒖 𝑘+1 ) ∞𝑘=0 ⊆ R be defined by (10.4.11a), and let 𝒘 ∗ be the unique minimizer of f.
Then, for all 𝑘 ∈ N0 , it holds that
√︂  
2 2 𝜇 𝑘 𝜇 
∥𝒖 𝑘 − 𝒘∗ ∥ ≤ 1− Φ(𝒗0 ) − Φ(𝒘∗ ) + ∥𝒖 0 − 𝒘∗ ∥ 2 , (10.4.12a)
𝜇 𝐿 2
√︂
 𝜇 𝑘  𝜇 
f (𝒗 𝑘 ) − f (𝒘∗ ) ≤ 1 − Φ(𝒗0 ) − Φ(𝒘∗ ) + ∥𝒖 0 − 𝒘∗ ∥ 2 . (10.4.12b)
𝐿 2
Proof Define
𝜇
𝑒 𝑘 := Φ(𝒗 𝑘 ) − Φ(𝒘∗ ) + ∥𝒖 𝑘 − 𝒘∗ ∥ 2 . (10.4.13)
2

116
[Draft of June 26, 2024. Not for dissemination. Only to be used in Philipp Petersen’s lecture]
To show (10.4.12), it suffices to prove with 𝑐 = 1 − 𝜏 that 𝑒 𝑘+1 ≤ 𝑐𝑒 𝑘 for all 𝑘 ∈ N0 .
We start with the last term in (10.4.13). By (10.4.11c)
𝜇 𝜇
∥𝒖 𝑘+1 − 𝒘∗ ∥ 2 − ∥𝒖 𝑘 − 𝒘∗ ∥ 2
2 2
𝜇 𝜇
= ∥𝒖 𝑘+1 − 𝒖 𝑘 + 𝒖 𝑘 − 𝒘∗ ∥ 2 − ∥𝒖 𝑘 − 𝒘∗ ∥ 2
2 2
 !
𝜇 𝜇 𝜏
= ∥𝒖 𝑘+1 − 𝒖 𝑘 ∥ 2 + · 2 𝜏 · (𝒘 𝑘 − 𝒖 𝑘 ) − ∇f (𝒘 𝑘 ), 𝒖 𝑘 − 𝒘∗
2 2 𝜇
𝜇
= ∥𝒖 𝑘+1 − 𝒖 𝑘 ∥ 2 + 𝜏 ⟨∇f (𝒘 𝑘 ), 𝒘∗ − 𝒖 𝑘 ⟩ − 𝜏𝜇 ⟨𝒘 𝑘 − 𝒖 𝑘 , 𝒘∗ − 𝒖 𝑘 ⟩ . (10.4.14)
2
From (10.4.11a) we have 𝜏𝒖 𝑘 = (1 + 𝜏)𝒘 𝑘 − 𝒗 𝑘 so that

𝜏 · (𝒘 𝑘 − 𝒖 𝑘 ) = 𝜏𝒘 𝑘 − (1 + 𝜏)𝒘 𝑘 + 𝒗 𝑘 = 𝒗 𝑘 − 𝒘 𝑘 (10.4.15)

and using 𝜇-strong convexity (10.1.15), we get

𝜏 ⟨∇f (𝒘 𝑘 ), 𝒘∗ − 𝒖 𝑘 ⟩ = 𝜏 ⟨∇f (𝒘 𝑘 ), 𝒘 𝑘 − 𝒖 𝑘 ⟩ + 𝜏 ⟨∇f (𝒘 𝑘 ), 𝒘∗ − 𝒘 𝑘 ⟩


𝜏𝜇
≤ ⟨∇f (𝒘 𝑘 ), 𝒗 𝑘 − 𝒘 𝑘 ⟩ − 𝜏 · (Φ(𝒘 𝑘 ) − Φ(𝒘∗ )) − ∥𝒘 𝑘 − 𝒘∗ ∥ 2 .
2
Moreover,
𝜏𝜇
− ∥𝒘 𝑘 − 𝒘∗ ∥ 2 − 𝜏𝜇 ⟨𝒘 𝑘 − 𝒖 𝑘 , 𝒘∗ − 𝒖 𝑘 ⟩
2
𝜏𝜇  
=− ∥𝒘 𝑘 − 𝒘∗ ∥ 2 − 2 ⟨𝒘 𝑘 − 𝒖 𝑘 , 𝒘 𝑘 − 𝒘∗ ⟩ + 2 ⟨𝒘 𝑘 − 𝒖 𝑘 , 𝒘 𝑘 − 𝒖 𝑘 ⟩
2
𝜏𝜇
= − (∥𝒖 𝑘 − 𝒘∗ ∥ 2 + ∥𝒘 𝑘 − 𝒖 𝑘 ∥ 2 ).
2
Thus, (10.4.14) is bounded by
𝜇
∥𝒖 𝑘+1 − 𝒖 𝑘 ∥ 2 + ⟨∇f (𝒘 𝑘 ), 𝒗 𝑘 − 𝒘 𝑘 ⟩ − 𝜏 · (Φ(𝒘 𝑘 ) − Φ(𝒘∗ ))
2
𝜏𝜇 𝜏𝜇
− ∥𝒖 𝑘 − 𝒘∗ ∥ 2 − ∥𝒘 𝑘 − 𝒖 𝑘 ∥ 2
2 2
which gives with 𝑐 = 1 − 𝜏
𝜇 𝜇 𝜇
∥𝒖 𝑘+1 − 𝒘∗ ∥ 2 ≤ 𝑐 ∥𝒖 𝑘 − 𝒘∗ ∥ 2 + ∥𝒖 𝑘+1 − 𝒖 𝑘 ∥ 2
2 2 2
𝜏𝜇
+ ⟨∇f (𝒘 𝑘 ), 𝒗 𝑘 − 𝒘 𝑘 ⟩ − 𝜏 · (Φ(𝒘 𝑘 ) − Φ(𝒘∗ )) − ∥𝒘 𝑘 − 𝒖 𝑘 ∥ 2 . (10.4.16)
2
To bound the first term in (10.4.13), we use 𝐿-smoothness (10.1.4) and (10.4.11b)
𝐿 1
Φ(𝒗 𝑘+1 ) − Φ(𝒘 𝑘 ) ≤ ⟨∇Φ(𝒘 𝑘 ), 𝒗 𝑘+1 − 𝒘 𝑘 ⟩ + ∥𝒗 𝑘+1 − 𝒘 𝑘 ∥ 2 = − ∥∇Φ(𝒘 𝑘 ) ∥ 2 ,
2 2𝐿
so that
1
Φ(𝒗 𝑘+1 ) − Φ(𝒘∗ ) − 𝜏 · (Φ(𝒘 𝑘 ) − Φ(𝒘∗ )) ≤ (1 − 𝜏) (Φ(𝒘 𝑘 ) − Φ(𝒘∗ )) − ∥∇Φ(𝒘 𝑘 ) ∥ 2
2𝐿
1
= 𝑐 · (Φ(𝒗 𝑘 ) − Φ(𝒘∗ )) + 𝑐 · (Φ(𝒘 𝑘 ) − Φ(𝒗 𝑘 )) − ∥∇Φ(𝒘 𝑘 ) ∥ 2 . (10.4.17)
2𝐿
Now, (10.4.16) and (10.4.17) imply

117
[Draft of June 26, 2024. Not for dissemination. Only to be used in Philipp Petersen’s lecture]
1 𝜇
𝑒 𝑘+1 ≤ 𝑐𝑒 𝑘 + 𝑐 · (Φ(𝒘 𝑘 ) − Φ(𝒗 𝑘 )) − ∥∇Φ(𝒘 𝑘 ) ∥ 2 + ∥𝒖 𝑘+1 − 𝒖 𝑘 ∥ 2
2𝐿 2
𝜏𝜇
+ ⟨∇f (𝒘 𝑘 ), 𝒗 𝑘 − 𝒘 𝑘 ⟩ − ∥𝒘 𝑘 − 𝒖 𝑘 ∥ 2 .
2
Since we wish to bound 𝑒 𝑘+1 by 𝑐𝑒 𝑘 , we now show that all terms except 𝑐𝑒 𝑘 on the right-hand side of the inequality
above sum up to a non-positive value. By (10.4.11c) and (10.4.15)

𝜇 𝜇 𝜏2
∥𝒖 𝑘+1 − 𝒖 𝑘 ∥ 2 = ∥𝒗 𝑘 − 𝒘 𝑘 ∥ 2 − 𝜏 ⟨∇f (𝒘 𝑘 ), 𝒗 𝑘 − 𝒘 𝑘 ⟩ + ∥∇f (𝒘 𝑘 ) ∥ 2 .
2 2 2𝜇
Moreover, using 𝜇-strong convexity

⟨∇f (𝒘 𝑘 ), 𝒗 𝑘 − 𝒘 𝑘 ⟩
 𝜇 
≤ 𝜏 ⟨∇f (𝒘 𝑘 ), 𝒗 𝑘 − 𝒘 𝑘 ⟩ + (1 − 𝜏) Φ(𝒗 𝑘 ) − Φ(𝒘 𝑘 ) − ∥𝒗 𝑘 − 𝒘 𝑘 ∥ 2 .
2
Thus, we arrive at
1 𝜇
𝑒 𝑘+1 ≤ 𝑐𝑒 𝑘 + 𝑐 · (Φ(𝒘 𝑘 ) − Φ(𝒗 𝑘 )) − ∥∇Φ(𝒘 𝑘 ) ∥ 2 + ∥𝒗 𝑘 − 𝒘 𝑘 ∥ 2
2𝐿 2
𝜏2
− 𝜏 ⟨∇f (𝒘 𝑘 ), 𝒗 𝑘 − 𝒘 𝑘 ⟩ + ∥∇f (𝒘 𝑘 ) ∥ 2 + 𝜏 ⟨∇f (𝒘 𝑘 ), 𝒗 𝑘 − 𝒘 𝑘 ⟩
2𝜇
𝜇 𝜏𝜇
+ 𝑐 · (Φ(𝒗 𝑘 ) − Φ(𝒘 𝑘 )) − 𝑐 ∥𝒗 𝑘 − 𝒘 𝑘 ∥ 2 − ∥𝒘 𝑘 − 𝒖 𝑘 ∥ 2
2 2
 𝜏2 1  𝜇 1
= 𝑐𝑒 𝑘 + − ∥∇f (𝒘 𝑘 ) ∥ 2 + 𝜏 − ∥𝒘 𝑘 − 𝒗 𝑘 ∥ 2
2𝜇 2𝐿 2 𝜏
≤ 𝑐𝑒 𝑘
√︁
where we used once more (10.4.15), and the fact that 𝜏 2 /(2𝜇) − 1/(2𝐿) = 0 and 𝜏 − 1/𝜏 ≤ 0 since 𝜏 = 𝜇/𝐿 ∈ (0, 1].□
Comparing the result for gradient descent (10.1.16) with NAG (10.4.12), the improvement lies in the convergence
rate, which is 1 − 𝜅 −1 for gradient descent (also see Remark 10.15), and 1 − 𝜅 −1/2 for NAG, where 𝜅 = 𝐿/𝜇. In
contrast to gradient descent, for NAG the convergence depends only on the square root of the condition number 𝜅. For
ill-conditioned problems where 𝜅 is large, we therefore expect much better performance for accelerated methods.
Finally, we mention that NAG also achieves faster convergence in the case of 𝐿-smooth and convex objective
functions. While the error decays like 𝑂 (𝑘 −1 ) for gradient descent, see Theorem 10.11, for NAG one obtains convergence
𝑂 (𝑘 −2 ), see [140, 138, 215].

10.5 Other methods

In recent years, a multitude of first order (gradient descent) methods has been proposed and studied for the training
of neural networks. They typically employ (a subset) of the three critical strategies: mini-batches, acceleration, and
adaptive step sizes. The concept of mini-batches and acceleration have been covered in the previous sections, and
we will touch upon adaptive learning rates in the present one. Specifically, we present three algorithms—AdaGrad,
RMSProp, and Adam—which have been among the most influential in the field, and serve to explore the main ideas. An
intuitive overview of first order methods can also be found in [172], which discusses additional variants that are omitted
here. Moreover, in practice, various other techniques and heuristics such as batch normalization, gradient clipping,
data augmentation, regularization and dropout, early stopping, specific weight initializations etc. are used. We do not
discuss them here, and refer to [22] or to [60, Chapter 11] for a practitioners guide.

118
[Draft of June 26, 2024. Not for dissemination. Only to be used in Philipp Petersen’s lecture]
After initializing 𝒎 0 = 0 ∈ R𝑛 , 𝒗0 = 0 ∈ R𝑛 , and 𝒘0 ∈ R𝑛 , all methods discussed below are special cases of the
update

𝒎 𝑘+1 = 𝛽1 𝒎 𝑘 + 𝛽2 ∇f (𝒘 𝑘 ) (10.5.1a)
𝒗 𝑘+1 = 𝛾1 𝒗 𝑘 + 𝛾2 ∇f (𝒘 𝑘 ) ⊙ ∇f (𝒘 𝑘 ) (10.5.1b)

𝒘 𝑘+1 = 𝒘 𝑘 − 𝛼 𝑘 𝒎 𝑘+1 ⊘ 𝒗 𝑘+1 + 𝜀 (10.5.1c)

for 𝑘 ∈ N0 , and certain hyperparameters 𝛼 𝑘 , 𝛽1 , 𝛽2 , 𝛾1 , 𝛾2 , and 𝜀. Here ⊙ and ⊘ denote the componentwise multiplication
√ √
and division, respectively, and 𝒗 𝑘+1 + 𝜀 is understood as the vector ( 𝑣 𝑘+1,𝑖 + 𝜀)𝑖 . We will give some default values for
those hyperparameters in the following, but mention that careful problem dependent tuning can significantly improve
the performance. Equation (10.5.1a) corresponds to heavy ball momentum if 𝛽1 > 0. If 𝛽1 = 0, then 𝒎 𝑘+1 is simply a
multiple of the current gradient. Equation (10.5.1b) defines a weight vector 𝒗 𝑘+1 that is used to set the componentwise
learning rate in the update of the parameter in (10.5.1c). These type of methods are often applied using mini-batches,
see Section 10.2. For simplicity we present them with the full gradients.

10.5.1 AdaGrad

In Section 10.2 we argued, that for stochastic methods the learning rate should decrease in order to get convergence.
The choice of how to decrease the learning rate can significantly impact performance. AdaGrad [52], which stands for
adaptive gradient algorithm, provides a method to dynamically adjust learning rates during optimization. Moreover, it
does so by using individual learning rates for each component.
AdaGrad correspond to (10.5.1) with

𝛽1 = 0, 𝛾1 = 𝛽2 = 𝛾2 = 1, 𝛼𝑘 = 𝛼 for all 𝑘 ∈ N0 .

This leaves the hyperparameters 𝜀 > 0 and 𝛼 > 0. The constant 𝜀 > 0 is chosen small but positive to avoid division by
zero in (10.5.1c). Possible default values are 𝛼 = 0.01 and 𝜀 = 10−8 . The AdaGrad update then reads

𝒗 𝑘+1 = 𝒗 𝑘 + ∇f (𝒘 𝑘 ) ⊙ ∇f (𝒘 𝑘 )

𝒘 𝑘+1 = 𝒘 𝑘 − 𝛼∇f (𝒘 𝑘 ) ⊘ 𝒗 𝑘+1 + 𝜀.

Due to
𝑘
∑︁
𝒗 𝑘+1 = ∇f (𝒘 𝑗 ) ⊙ ∇f (𝒘 𝑗 ), (10.5.2)
𝑗=0

the algorithm scales the gradient ∇f (𝒘 𝑘 ) in the update component-wise by the inverse square root of the sum over all
past squared gradients plus 𝜀. Note that the scaling factor (𝑣 𝑘+1,𝑖 + 𝜀) −1/2 for component 𝑖 will be large, if the previous
gradients for that component were small, and vice versa. In the words of the authors of [52]: “our procedures give
frequently occurring features very low learning rates and infrequent features high learning rates.”

Remark 10.26 A benefit of the componentwise scaling can be observed for the ill-conditioned objective function in
(10.4.1). Since in this case ∇f (𝒘 𝑗 ) = (𝜆1 𝑤 𝑗,1 , 𝜆2 𝑤 𝑗,2 ) ⊤ for each 𝑗 = 1, . . . , 𝑘, setting 𝜀 = 0 AdaGrad performs the
update !
𝑤 𝑘,1 ( 𝑘𝑗=0 𝑤2𝑗,1 ) −1/2
Í
𝒘 𝑘+1 = 𝒘 𝑘 − 𝛼 .
𝑤 𝑘,2 ( 𝑘𝑗=0 𝑤2𝑗,1 ) −1/2
Í


Note how the 𝜆 1 and 𝜆2 factors in the update have vanished due to the division by 𝒗 𝑘+1 . This makes the method
invariant to a componentwise rescaling of the gradient, and results in a more direct path towards the minimizer.

119
[Draft of June 26, 2024. Not for dissemination. Only to be used in Philipp Petersen’s lecture]
10.5.2 RMSProp

The sum of past squared gradients can increase rapidly, leading to a significant reduction in learning rates when training
neural networks with AdaGrad. This often results in slow convergence, see for example [216]. RMSProp [80] seeks to
rectify this by adjusting the learning rates using an exponentially weighted average of past gradients.
RMSProp corresponds to (10.5.1) with

𝛽1 = 0, 𝛽2 = 1, 𝛾2 = 1 − 𝛾1 ∈ (0, 1), 𝛼𝑘 = 𝛼 for all 𝑘 ∈ N0 ,

effectively leaving the hyperparameters 𝜀 > 0, 𝛾1 ∈ (0, 1) and 𝛼 > 0. Typically, recommended default values are
𝜀 = 10−8 , 𝛼 = 0.01 and 𝛾1 = 0.9. The algorithm is given through

𝒗 𝑘+1 = 𝛾1 𝒗 𝑘 + (1 − 𝛾1 )∇f (𝒘 𝑘 ) ⊙ ∇f (𝒘 𝑘 ) (10.5.3a)



𝒘 𝑘+1 = 𝒘 𝑘 − 𝛼∇f (𝒘 𝑘 ) ⊘ 𝒗 𝑘+1 + 𝜀. (10.5.3b)

Note that
𝑘
∑︁
𝑗
𝒗 𝑘+1 = (1 − 𝛾1 ) 𝛾1 ∇f (𝒘 𝑘− 𝑗 ) ⊙ ∇f (𝒘 𝑘− 𝑗 ),
𝑗=0

so that, contrary to AdaGrad (10.5.2), the influence of gradient ∇f (𝒘 𝑘− 𝑗 ) on the weight 𝒗 𝑘+1 decays exponentially in 𝑗.

10.5.3 Adam

Adam [101], short for adaptive moment estimation, combines adaptive learning rates based on exponentially weighted
averages as in RMSProp, with heavy ball momentum. Contrary to AdaGrad an RMSProp it thus uses a value 𝛽1 > 0.
More precisely, Adam corresponds to (10.5.1) with
√︃
1 − 𝛾1𝑘+1
𝛽2 = 1 − 𝛽1 ∈ (0, 1), 𝛾2 = 1 − 𝛾1 ∈ (0, 1), 𝛼𝑘 = 𝛼
1 − 𝛽1𝑘+1

for all 𝑘 ∈ N0 , for some 𝛼 > 0. The default values for the remaining parameters recommended in [101] are 𝜀 = 10−8 ,
𝛼 = 0.001, 𝛽1 = 0.9 and 𝛾1 = 0.999. The update can be formulated as
𝒎 𝑘+1
𝒎 𝑘+1 = 𝛽1 𝒎 𝑘 + (1 − 𝛽1 )∇f (𝒘 𝑘 ), 𝒎ˆ 𝑘+1 = (10.5.4a)
1 − 𝛽1𝑘+1
𝒗 𝑘+1
𝒗 𝑘+1 = 𝛾1 𝒗 𝑘 + (1 − 𝛾1 )∇f (𝒘 𝑘 ) ⊙ ∇f (𝒘 𝑘 ), 𝒗ˆ 𝑘+1 = (10.5.4b)
1 − 𝛾1𝑘+1
√︁
𝒘 𝑘+1 = 𝒘 𝑘 − 𝛼 𝒎ˆ 𝑘+1 ⊘ 𝒗ˆ 𝑘+1 + 𝜀. (10.5.4c)

Note that 𝒎 𝑘+1 equals


𝑘
∑︁
𝑗
𝒎 𝑘+1 = (1 − 𝛽1 ) 𝛽1 ∇f (𝒘 𝑘− 𝑗 )
𝑗=0

and thus correspond to heavy ball style momentum with momentum parameter 𝛽 = 𝛽1 , see (10.4.6). The normalized
version 𝒎ˆ 𝑘+1 is introduced to account for the bias towards 0, stemming from the initialization 𝒎 0 = 0. The weight-vector

120
[Draft of June 26, 2024. Not for dissemination. Only to be used in Philipp Petersen’s lecture]
𝒗 𝑘+1 in (10.5.4b) is analogous to the exponentially weighted average of RMSProp in (10.5.3a), and the normalization
again serves to counter the bias from 𝒗0 = 0.
It should be noted that there exist examples of convex functions for which Adam does not converge to a minimizer,
see [168]. The authors of [168] propose a modification termed AMSGrad, which avoids this issue and their analysis also
applies to RMSProp. Nonetheless, Adam remains a highly popular and successful algorithm for the training of neural
networks. We also mention that the proof of convergence in the stochastic setting requires 𝑘-dependent decreasing
learning rates such as 𝛼 = 𝑂 (𝑘 −1/2 ) in (10.5.3b) and (10.5.4c).

Bibliography and further reading

Section 10.1 on gradient descent is based on standard textbooks such as [20, 25, 142] and especially [139]. These are
also good references for further reading on convex optimization. In particular Theorem 10.11 and the Lemmas leading
up to it closely follow Nesterov’s arguments in [139]. Convergence proofs under the PL inequality can be found in
[99]. Stochastic gradient descent discussed in Section 10.2 originally dates back to Robbins and Monro [169]. The first
non-asymptotic convergence analysis for strongly convex objective functions was given in [134]. The proof presented
here is similar to [68] and in particular uses their choice of step size. A good overview of proofs for (stochastic)
gradient descent algorithms together with detailed references can be found in [58], and for a textbook specifically on
stochastic optimization also see [110]. The backpropagation algorithm discussed in Section 10.3 was first introduced
by Rumelhart, Hinton and Williams in [176]; for a more detailed discussion, see for instance [74]. The heavy ball
method in Section 10.4 goes back to Polyak [159]. To motivate the algorithm we proceed similar as in [62, 160, 162].
For the analysis of Nesterov acceleration [140], we follow the Lyapunov type proofs given in [205, 215]. Finally, for
Section 10.5 on other algorithms, we refer to the original works that introduced AdaGrad [52], RMSProp [80] and
Adam [101]. A good overview of gradient descent methods popular for deep learning can be found in [172]. Regarding
the analysis of RMSProp and Adam, we refer to [168] which gave an example of a convex function for which Adam
does not converge, and provide a provably convergent modification of the algorithm. Convergence proofs (for variations
of) AdaGrad and Adam can also be found in [46].
For a general discussion and analysis of optimization algorithms in machine learning see [23]. Details on implemen-
tations in Python can for example be found in [60], and for recommendations and tricks regarding the implementation
we also refer to [22, 113].

121
[Draft of June 26, 2024. Not for dissemination. Only to be used in Philipp Petersen’s lecture]
Exercises

Exercise 10.27 Let f ∈ 𝐶 1 (R𝑛 ). Show that f is convex in the sense of Definition 10.8 if and only if

f (𝒘) + ⟨∇f (𝒘), 𝒗 − 𝒘⟩ ≤ f (𝒗) for all 𝒘, 𝒗 ∈ R𝑛 .

Exercise 10.28 Find a function f : R → R that is 𝐿-smooth, satisfies the PL-inequality (10.1.19) for some 𝜇 > 0, has
a unique minimizer 𝑤∗ ∈ R, but is not convex and thus also not strongly convex.

Exercise 10.29 Prove Theorem 10.17, i.e. show that 𝐿-smoothness and the PL-inequality (10.1.19) yield linear con-
vergence of f (𝒘 𝑘 ) → f (𝒘∗ ) as 𝑘 → ∞.

Definition 10.30 For convex f : R𝑛 → R, 𝒈 ∈ R𝑛 is called a subgradient (or subdifferential) of f at 𝒗 if and only if

f (𝒘) ≥ f (𝒗) + ⟨𝒈, 𝒘 − 𝒗⟩ for all 𝒘 ∈ R𝑛 . (10.5.5)

The set of all subgradients of f at 𝒗 is denoted by 𝜕f (𝒗).

A subgradient always exists, i.e. 𝜕f (𝒗) is necessarily nonempty. This statement is also known under the name “Hyper-
plane separation theorem”. Subgradients generalize the notion of gradients for convex functions, since for any convex
continuously differentiable f, (10.5.5) is satisfied with 𝒈 = ∇f (𝒗).

Exercise 10.31 Let f : R𝑛 → R be convex and Lip(f) ≤ 𝐿. Show that for any 𝒈 ∈ 𝜕f (𝒗) holds ∥ 𝒈∥ ≤ 𝐿.

Exercise 10.32 Let f : R𝑛 → R be convex, Lip(f) ≤ 𝐿 and suppose that 𝒘∗ is a minimizer of f. Fix 𝒘0 ∈ R𝑑 , and for
𝑘 ∈ N0 define the subgradient descent update

𝒘 𝑘+1 := 𝒘 𝑘 − ℎ 𝑘 𝒈 𝑘 ,

where 𝒈 𝑘 is an arbitrary fixed element of 𝜕f (𝒘 𝑘 ). Show that


𝑘
∥𝒘0 − 𝒘∗ ∥ 2 + 𝐿 2 ℎ2𝑖
Í
𝑖=1
min f (𝒘𝑖 ) − f (𝒘∗ ) ≤ .
𝑖≤𝑘 𝑘
Í
2 ℎ𝑖
𝑖=1

Hint: Start by recursively expanding ∥𝒘 𝑘 − 𝒘∗ ∥ 2 = · · · , and then apply the property of the subgradient.

Exercise 10.33 Consider the setting of Exercise 10.32. Determine step sizes ℎ1 , . . . , ℎ 𝑘 (which may depend on 𝑘, i.e.
ℎ 𝑘,1 , . . . , ℎ 𝑘,𝑘 ) such that for any arbitrarily small 𝛿 > 0

min f (𝒘𝑖 ) − f (𝒘∗ ) = 𝑂 (𝑘 −1/2+ 𝛿 ) as 𝑘 → ∞.


𝑖≤𝑘

Exercise 10.34 Let 𝑨 ∈ R𝑛×𝑛 be symmetric positive semidefinite, 𝒃 ∈ R𝑛 and 𝑐 ∈ R. Denote the eigenvalues of 𝑨 by
𝜆 1 ≥ · · · ≥ 𝜆 𝑛 ≥ 0. Show that the objective function
1 ⊤
f (𝒘) := 𝒘 𝑨𝒘 + 𝒃 ⊤ 𝒘 + 𝑐 (10.5.6)
2
is convex and 𝜆1 -smooth. Moreover, if 𝜆 𝑛 > 0, then f is 𝜆 𝑛 -strongly convex. Show that these values are optimal in the
sense that f is neither 𝐿-smooth nor 𝜇-strongly convex if 𝐿 < 𝜆1 and 𝜇 > 𝜆 𝑛 .
Hint: Note that 𝐿-smoothness and 𝜇-strong convexity are invariant under shifts and the addition of constants. That
is, for every 𝛼 ∈ R and 𝜷 ∈ R𝑛 , f̃ (𝒘) := 𝛼 + f (𝒘 + 𝜷) is 𝐿-smooth or 𝜇-strongly convex if and only if f is. It thus
suffices to consider 𝒘 ⊤ 𝑨𝒘/2.

122
[Draft of June 26, 2024. Not for dissemination. Only to be used in Philipp Petersen’s lecture]
Exercise 10.35 Let f be as in Exercise 10.34. Show that gradient descent converges for arbitrary initialization 𝒘0 ∈ R𝑛 ,
if and only if

max |1 − ℎ𝜆 𝑗 | < 1.
𝑗=1,...,𝑛

Show that arg minℎ>0 max 𝑗=1,...,𝑛 |1 − ℎ𝜆 𝑗 | = 2/(𝜆1 + 𝜆 𝑛 ) and conclude that the convergence will be slow if f is
ill-conditioned, i.e. if 𝜆1 /𝜆 𝑛 ≫ 1.
Hint: Assume first that 𝒃 = 0 ∈ R𝑛 and 𝑐 = 0 ∈ R in (10.5.6), and use the singular value decomposition
𝑨 = 𝑼 ⊤ diag(𝜆1 , . . . , 𝜆 𝑛 )𝑼.
√︁
Exercise 10.36 Show that (10.4.10) can equivalently be written as (10.4.11) with 𝜏 = 𝜇/𝐿, 𝛼 = 1/𝐿, 𝛽 = (1 − 𝜏)/(1 +
𝜏) and the initialization 𝒖 0 = ((1 + 𝜏)𝒘0 − 𝒗0 )/𝜏.

123
[Draft of June 26, 2024. Not for dissemination. Only to be used in Philipp Petersen’s lecture]
[Draft of June 26, 2024. Not for dissemination. Only to be used in Philipp Petersen’s lecture]
Chapter 11
Wide neural networks

In this chapter we explore the dynamics of training neural networks of large width. Throughout we focus on the situation
where we have data pairs
(𝒙𝑖 , 𝑦 𝑖 ) ∈ R𝑑 × R 𝑖 ∈ {1, . . . , 𝑚}, (11.0.1a)
and wish to train a neural network Φ(𝒙, 𝒘) depending on the input 𝒙 ∈ R𝑑 and the parameters 𝒘 ∈ R𝑛 , by minimizing
the square loss objective defined as
𝑚
∑︁
f (𝒘) := (Φ(𝒙𝑖 , 𝒘) − 𝑦 𝑖 ) 2 , (11.0.1b)
𝑖=1

which is a multiple of the empirical risk R b𝑆 (Φ) of (1.2.3) for the sample 𝑆 = (𝒙𝑖 , 𝑦 𝑖 ) 𝑚 and the square-loss. We
𝑖=1
refer to map from a parameter set to a neural network set with fixed architecture as a model. We exclusively focus on
gradient descent with a constant step size ℎ, which yields a sequence of parameters (𝒘 𝑘 ) 𝑘 ∈N . We aim to understand the
evolution of Φ(𝒙, 𝒘 𝑘 ) as 𝑘 progresses. For linear mappings 𝒘 ↦→ Φ(𝒙, 𝒘), the objective function (11.0.1b) is convex.
As established in the previous chapter, gradient descent then finds a global minimizer. For typical neural network
architectures, 𝒘 ↦→ Φ(𝒙, 𝒘) is not linear, and such a statement is in general not true.
Recent research has highlighted that neural network behavior tends to linearize in the parameters as network width
increases [94]. This allows to transfer some of the results and techniques from the linear case to the training of neural
networks. We start this chapter in Sections 11.1 and 11.2 by recalling (kernel) least-squares methods, which describe
linear (in 𝒘) models. Following [115], the subsequent sections explore why in the infinite width limit neural networks
exhibit linear-like behavior. In Section 11.5.2 we formally introduce the linearization of 𝒘 ↦→ Φ(𝒙, 𝒘). Section 11.4
presents an abstract result showing convergence of gradient descent, under the condition that Φ does not deviate too
much from its linearization. In Sections 11.5 and 11.6, we then detail the implications for wide neural networks for
two (slightly) different architectures. In particular, we will prove that gradient descent can find global minimizers
when applied to (11.0.1b) for networks of very large width. We emphasize that this analysis treats the case of strong
overparametrization, specifically when network width increases while keeping the number of data points 𝑚 fixed.

11.1 Linear least-squares

Arguably one of the simplest machine learning algorithms is linear least-squares regression. Given data (11.0.1a),
linear regression tries to fit a linear function Φ(𝒙, 𝒘) := 𝒙 ⊤ 𝒘 in terms of 𝒘 by minimizing f (𝒘) in (11.0.1b). With

𝒙⊤ 𝑦1
© .1 ª © . ª
𝑨 = ­­ .. ®® ∈ R𝑚×𝑑 and 𝒚 = ­ .. ®® ∈ R𝑚
­ (11.1.1)

«𝒙 𝑚 ¬ «𝑦 𝑚¬
it holds

125
[Draft of June 26, 2024. Not for dissemination. Only to be used in Philipp Petersen’s lecture]
f (𝒘) = ∥ 𝑨𝒘 − 𝒚∥ 2 . (11.1.2)
Remark 11.1 More generally, the ansatz Φ(𝒙, (𝒘, 𝑏)) := 𝒘⊤ 𝒙
+ 𝑏 corresponds to
 
𝑏
Φ(𝒙, (𝒘, 𝑏)) = (1, 𝒙 ⊤ ) .
𝒘

Therefore, additionally allowing for a bias can be treated analogously.


The model Φ(𝒙, 𝒘) = 𝒙 ⊤ 𝒘 is linear in both 𝒙 and 𝒘. In particular, 𝒘 ↦→ f (𝒘) is a convex function by Exercise 10.34,
and we may apply the convergence results of Chapter 10 when using gradient based algorithms. If 𝑨 is invertible, then
f has a unique minimizer given by 𝒘∗ = 𝑨 −1 𝒚. If rank( 𝑨) = 𝑑, then f is strongly convex by Exercise 10.34, and there
still exists a unique minimizer. If however rank( 𝑨) < 𝑑, then ker( 𝑨) ≠ {0} and there exist infinitely many minimizers
of f. To ensure uniqueness, we look for the minimum norm solution (or minimum 2-norm solution)

𝒘∗ := arg min ∥𝒘∥ . (11.1.3)


{𝒘∈R𝑑 | f (𝒘) ≤f (𝒗) ∀𝒗∈R𝑑 }

The following proposition establishes the uniqueness of 𝒘∗ and demonstrates that it can be represented as a superposition
of the (𝒙𝑖 )𝑖=1
𝑚 .

Proposition 11.2 Let 𝑨 ∈ R𝑚×𝑑 and 𝒚 ∈ R𝑚 be as in (11.1.1). There exists a unique minimum 2-norm solution of
(11.1.2). Denoting 𝐻˜ := span{𝒙1 , . . . , 𝒙 𝑚 } ⊆ R𝑑 , it is the unique element

𝒘∗ = arg min f ( 𝒘) ˜
˜ ∈ 𝐻. (11.1.4)
˜ 𝐻˜
𝒘∈

Proof We start with existence and uniqueness. Let 𝐶 ⊆ R𝑚 be the space spanned by the columns of 𝑨. Then 𝐶 is
closed and convex, and therefore 𝒚 ∗ = arg min𝒚˜ ∈𝐶 ∥ 𝒚 − 𝒚˜ ∥ exists and is unique (this is a fundamental property of
Hilbert spaces, see, e.g. [173, Thm. 4.10]). In particular, the set 𝑀 = {𝒘 ∈ R𝑑 | 𝑨𝒘 = 𝒚 ∗ } ⊆ R𝑑 of minimizers of f is
not empty. Clearly 𝑀 is also closed and convex. By the same argument as before, 𝒘∗ = arg min𝒘∗ ∈ 𝑀 ∥𝒘∗ ∥ exists and
is unique.
It remains to show (11.1.4). Denote by 𝒘∗ the minimum norm solution and decompose 𝒘∗ = 𝒘˜ + 𝒘ˆ with 𝒘˜ ∈ 𝐻˜ and
𝒘ˆ ∈ 𝐻˜ ⊥ . We have 𝑨𝒘∗ = 𝑨𝒘˜ and ∥𝒘∗ ∥ 2 = ∥ 𝒘∥˜ 2 + ∥ 𝒘∥
ˆ 2 . Since 𝒘∗ is the minimal norm solution it must hold 𝒘ˆ = 0.
Thus 𝒘∗ ∈ 𝐻. ˜ Finally assume there exists a minimizer 𝒗 of f in 𝐻˜ different from 𝒘∗ . Then 0 ≠ 𝒘∗ − 𝒗 ∈ 𝐻,˜ and since
𝐻˜ is spanned by the rows of 𝑨 we have 𝑨(𝒘∗ − 𝒗) ≠ 0. Thus 𝒚 ∗ = 𝑨𝒘∗ ≠ 𝑨𝒗, which contradicts that 𝒗 minimizes f. □
The condition of minimizing the 2-norm is a form of regularization. Interestingly, gradient descent converges to the
minimum norm solution for the quadratic objective (11.1.2), as long as 𝒘0 is initialized within 𝐻˜ = span{𝒙1 , . . . , 𝒙 𝑚 }
(e.g. 𝒘0 = 0). Therefore, it does not find an “arbitrary” minimizer but implicitly regularizes the problem in this sense.
In the following 𝑠max ( 𝑨) denotes the maximal singular value of 𝑨.
Theorem 11.3 Let 𝑨 ∈ R𝑚×𝑑 be as in (11.1.1), let 𝒘0 = 𝒘˜ 0 + 𝒘ˆ 0 where 𝒘˜ 0 ∈ 𝐻˜ and 𝒘ˆ 0 ∈ 𝐻˜ ⊥ . Fix
ℎ ∈ (0, 1/(2𝑠max ( 𝑨) 2 )) and set
𝒘 𝑘+1 := 𝒘 𝑘 − ℎ∇f (𝒘 𝑘 ) for all 𝑘 ∈ N (11.1.5)
with f in (11.1.2). Then
lim 𝒘 𝑘 = 𝒘∗ + 𝒘ˆ 0 .
𝑘→∞

We sketch the argument in case 𝒘0 ∈ 𝐻, ˜ and leave the full proof to the reader, see Exercise 11.29. Note that 𝐻˜ is
the space spanned by the rows of 𝑨 (or the columns of 𝑨⊤ ). The gradient of the objective function equals

∇f (𝒘) = 2𝑨⊤ ( 𝑨𝒘 − 𝒚).


˜ then the iterates of gradient descent never leave the subspace 𝐻.
Therefore, if 𝒘0 ∈ 𝐻, ˜ By Exercise 10.34 and Theorem
˜
10.11, for small enough step size, it holds f (𝒘 𝑘 ) → 0. By Proposition 11.2 there only exists one minimizer in 𝐻,
corresponding to the minimum norm solution. Thus 𝒘 𝑘 converges to the minimal norm solution.

126
[Draft of June 26, 2024. Not for dissemination. Only to be used in Philipp Petersen’s lecture]
11.2 Kernel least-squares

Let again (𝒙 𝑗 , 𝑦 𝑗 ) ∈ R𝑑 × R, 𝑗 = 1, . . . , 𝑚. In many applications linear models are too simplistic, and are not able to
capture the true relation between 𝒙 and 𝑦. Kernel methods allow to overcome this problem by introducing nonlinearity
in 𝒙, but retaining linearity in the parameter 𝒘.
Let 𝐻 be a Hilbert space with inner product ⟨·, ·⟩ 𝐻 , that is also referred to as the feature space. For a (typically
nonlinear) feature map 𝜙 : R𝑑 → 𝐻, consider the model

Φ(𝒙, 𝒘) = ⟨𝜙(𝒙), 𝒘⟩ 𝐻 (11.2.1)

with 𝒘 ∈ 𝐻. If 𝐻 = R𝑛 , the components of 𝜙 are referred to as features. With the objective function
𝑚
∑︁ 2
f (𝒘) := ⟨𝜙(𝒙 𝑗 ), 𝒘⟩ 𝐻 − 𝑦 𝑗 𝒘 ∈ 𝐻, (11.2.2)
𝑗=1

we wish to determine a minimizer of f. To ensure uniqueness and regularize the problem, we again consider the
minimum 𝐻-norm solution
𝒘∗ := arg min ∥𝒘∥ 𝐻 .
{𝒘∈ 𝐻 | f (𝒘) ≤f (𝒗) ∀𝒗∈ 𝐻 }

As we will see below, 𝒘∗ is well-defined. We will call Φ(𝒙, 𝒘∗ ) = ⟨𝜙(𝒙), 𝒘∗ ⟩ 𝐻 the kernel least squares estimator.
The nonlinearity of the feature map allows for more expressive models 𝒙 ↦→ Φ(𝒙, 𝒘) capable of capturing more
complicated structures beyond linearity in the data.

Remark 11.4 (Gradient descent) Let 𝐻 = R𝑛 be equipped with the Euclidean inner product. Consider the sequence
(𝒘 𝑘 ) 𝑘 ∈N0 ⊆ R𝑛 generated by gradient descent to minimize (11.2.2). Assuming sufficiently small step size, by Theorem
11.3 for 𝒙 ∈ R𝑑
lim Φ(𝒙, 𝒘 𝑘 ) = ⟨𝜙(𝒙), 𝒘∗ ⟩ + ⟨𝜙(𝒙), 𝒘ˆ 0 ⟩ . (11.2.3)
𝑘→∞

Here, 𝒘ˆ 0 ∈ R𝑛 denotes the orthogonal projection of 𝒘0 ∈ R𝑛 onto 𝐻˜ ⊥ where 𝐻˜ := span{𝜙(𝒙1 ), . . . , 𝜙(𝒙 𝑚 )}. Gradient
descent thus yields the kernel least squares estimator plus ⟨𝜙(𝒙), 𝒘ˆ 0 ⟩. Notably, on the set

{𝒙 ∈ R𝑑 | 𝜙(𝒙) ∈ span{𝜙(𝒙1 ), . . . , 𝜙(𝒙 𝑚 )}}, (11.2.4)

(11.2.3) thus coincides with the kernel least squares estimator independent of the initialization 𝒘0 .

11.2.1 Examples

To motivate the concept of feature maps consider the following example from [135].

Example 11.5 Let 𝒙𝑖 ∈ R2 with associated labels 𝑦 𝑖 ∈ {−1, 1} for 𝑖 = 1, . . . , 𝑚. The goal is to find some model
Φ(·, 𝒘) : R2 → R, for which
sign(Φ(𝒙, 𝒘)) (11.2.5)
predicts the label 𝑦 of 𝒙. For a linear (in 𝒙) model

Φ(𝒙, (𝒘, 𝑏)) = 𝒙 ⊤ 𝒘 + 𝑏,

the decision boundary of (11.2.5) equals {𝒙 ∈ R2 | 𝒙 ⊤ 𝒘 + 𝑏 = 0} in R2 . Hence, by adjusting 𝒘 and 𝑏, (11.2.5) can
separate data by affine hyperplanes in R2 . Consider two datasets represented by light blue squares for +1 and red circles
for −1 labels:

127
[Draft of June 26, 2024. Not for dissemination. Only to be used in Philipp Petersen’s lecture]
𝑥2 𝑥2

𝑥1 𝑥1

dataset 1 dataset 2

The first dataset is separable by an affine hyperplane as depicted by the dashed line. Thus a linear model is capable of
correctly classifying all datapoints. For the second dataset this is not possible.
To enhance model expressivity, introduce a feature map 𝜙 : R2 → R6 via

𝜙(𝒙) = (1, 𝑥1 , 𝑥2 , 𝑥1 𝑥2 , 𝑥12 , 𝑥22 ) ⊤ ∈ R6 for all 𝒙 ∈ R2 . (11.2.6)

For 𝒘 ∈ R6 , this allows Φ(𝒙) = 𝒘 ⊤ 𝜙(𝒙) to represent arbitrary polynomials of degree 2. With this kernel approach,
the decision boundary of (11.2.5) becomes the set of all hyperplanes in the feature space passing through 0 ∈ R6 .
Visualizing the last two features of the second dataset, we obtain
𝑥22

𝑥12

features 5 and 6 of dataset 2

Note how in the feature space R6 , the datapoints are again separated by such a hyperplane. Thus, with the feature map
in (11.2.6), the predictor (11.2.5) can perfectly classify all points also for the second dataset.
In the above example we chose the feature space 𝐻 = R6 . It is also possible to work with infinite dimensional feature
spaces as the next example demonstrates.
Example 11.6 Let 𝐻 = ℓ 2 (N) be the space of square summable sequences and 𝜙 : R𝑑 → ℓ 2 (N) some map. Fitting the
corresponding model ∑︁
Φ(𝒙, 𝒘) = ⟨𝜙(𝒙), 𝒘⟩ℓ 2 = 𝜙𝑖 (𝒙)𝑤𝑖
𝑖 ∈N

to data (𝒙 𝑖 , 𝑦 𝑖 )𝑖=1
𝑚 requires to minimize

𝑚  ∑︁
!2
∑︁ 
f (𝒘) = 𝜙𝑖 (𝒙 𝑗 )𝑤𝑖 − 𝑦 𝑗 𝒘 ∈ ℓ 2 (N).
𝑗=1 𝑖 ∈N

Hence we have to determine an infinite sequence of parameters (𝑤𝑖 )𝑖 ∈N .

11.2.2 Kernel trick

At first glance, computing a (minimal 𝐻-norm) minimizer 𝒘 in the possibly infinite-dimensional Hilbert space 𝐻
seems infeasible. The so-called kernel trick allows to do this computation. To explain it, we first revisit the foundational
representer theorem.

128
[Draft of June 26, 2024. Not for dissemination. Only to be used in Philipp Petersen’s lecture]
Theorem 11.7 (Representer theorem) There is a unique minimum 𝐻-norm solution 𝒘∗ ∈ 𝐻 of (11.2.2). With
𝐻˜ := span{𝜙(𝒙1 ), . . . , 𝜙(𝒙 𝑚 )} it equals the unique element

𝒘∗ = arg min f ( 𝒘) ˜
˜ ∈ 𝐻. (11.2.7)
˜ 𝐻˜
𝒘∈

˜ If 𝐻˜ = {0} the statement is trivial, so we assume 1 ≤ 𝑛 ≤ 𝑚. Let 𝑨 =


Proof Let 𝒘˜ 1 , . . . , 𝒘˜ 𝑛 be a basis of 𝐻.
(⟨𝜙(𝒙 𝑖 ), 𝒘˜ 𝑗 ⟩)𝑖 𝑗 ∈ R𝑚×𝑛 . Then it is clear that for some 𝜶 ∈ R𝑛 \ {0} it holds that 𝑛𝑗=1 𝛼 𝑗 𝒘˜ 𝑗 ∈ 𝐻˜ \ {0} and hence
Í

˜ ∈ 𝐻˜ has a unique representation 𝒘˜ = 𝑛𝑗=1 𝛼 𝑗 𝒘˜ 𝑗


Í Í
𝑨𝛼 = (⟨𝜙(𝒙𝑖 ), 𝑛𝑗=1 𝛼 𝑗 𝒘˜ 𝑗 ⟩)𝑖=1𝑚 ≠ 0. Therefore, 𝑨 is injective. Every 𝒘

for some 𝜶 ∈ R𝑛 . With this ansatz


𝑚
∑︁ 𝑚  ∑︁
∑︁ 𝑛 2
f ( 𝒘)
˜ = (⟨𝜙(𝒙 𝑖 ), 𝒘⟩
˜ − 𝑦𝑖 ) = ⟨𝜙(𝒙𝑖 ), 𝒘˜ 𝑗 ⟩𝛼 𝑗 − 𝑦 𝑖 = ∥ 𝑨𝜶 − 𝒚∥ 2 .
𝑖=1 𝑖=1 𝑗=1

Since 𝑨 is injective, there exists a unique minimizer 𝜶 ∈ R𝑛 of the right-hand side, and thus there exists a unique
minimizer 𝒘∗ ∈ 𝐻˜ in (11.2.7).
For arbitrary 𝒘 ∈ 𝐻 we wish to show f (𝒘) ≥ f (𝒘∗ ), so that 𝒘∗ minimizes f in 𝐻. Decompose 𝒘 = 𝒘˜ + 𝒘ˆ with
𝒘˜ ∈ 𝐻˜ and 𝒘ˆ ∈ 𝐻˜ ⊥ , i.e. ⟨𝜙(𝒙 𝑗 ), 𝒘⟩ ˜
ˆ 𝐻 = 0 for all 𝑗 = 1, . . . , 𝑚. Then, using that 𝒘∗ minimizes f in 𝐻,
𝑚
∑︁ 𝑚
∑︁
f (𝒘) = (⟨𝜙(𝒙 𝑗 ), 𝒘⟩ 𝐻 − 𝑦 𝑗 ) 2 = ˜ 𝐻 − 𝑦 𝑗 ) 2 = f ( 𝒘)
(⟨𝜙(𝒙 𝑗 ), 𝒘⟩ ˜ ≥ f (𝒘∗ ).
𝑗=1 𝑗=1

Finally, let 𝒘 ∈ 𝐻 be any minimizer of f in 𝐻 different from 𝒘∗ . It remains to show ∥𝒘∥ 𝐻 > ∥𝒘∗ ∥ 𝐻 . Decompose
again 𝒘 = 𝒘˜ + 𝒘ˆ with 𝒘˜ ∈ 𝐻˜ and 𝒘ˆ ∈ 𝐻˜ ⊥ . As above f (𝒘) = f ( 𝒘)
˜ and thus 𝒘˜ is a minimizer of f. Uniqueness of 𝒘∗ in
(11.2.7) implies 𝒘˜ = 𝒘∗ . Therefore 𝒘ˆ ≠ 0 and ∥𝒘∗ ∥ 2𝐻 < ∥ 𝒘∥
˜ 2𝐻 + ∥ 𝒘∥
ˆ 2𝐻 = ∥𝒘∥ 𝐻 . □
Instead of looking for the minimum norm minimizer 𝒘∗ in the Hilbert space 𝐻, by Proposition 11.2 it suffices
to determine the unique minimizer in the at most 𝑚-dimensional subspace 𝐻˜ spanned by 𝜙(𝒙1 ), . . . , 𝜙(𝒙 𝑚 ). This
significantly simplifies the problem. To do so we first introduce the notion of kernels.
Definition 11.8 A symmetric function 𝐾 : R𝑑 × R𝑑 → R is called a kernel if for any 𝒙1 , . . . , 𝒙 𝑚 ∈ R𝑑 the kernel
matrix 𝑮 = (𝐾 (𝒙 𝑖 , 𝒙 𝑗 ))𝑖,𝑚𝑗=1 ∈ R𝑚×𝑚 is symmetric positive semidefinite.

Given a feature map 𝜙 : R𝑑 → 𝐻, it is easy to check that

𝐾 (𝒙, 𝒙 ′ ) := ⟨𝜙(𝒙), 𝜙(𝒙 ′ )⟩ 𝐻 for all 𝒙, 𝒙 ′ ∈ R𝑑 ,

defines a kernel. The corresponding kernel matrix 𝑮 ∈ R𝑚×𝑚 is given by

𝐺 𝑖 𝑗 = ⟨𝜙(𝒙 𝑖 ), 𝜙(𝒙 𝑗 )⟩ 𝐻 = 𝐾 (𝒙𝑖 , 𝒙 𝑗 ).

𝛼 𝑗 𝜙(𝒙 𝑗 ), minimizing the objective (11.2.2) in 𝐻˜ is equivalent to minimizing


Í𝑚
With the ansatz 𝒘 = 𝑗=1

∥𝑮𝜶 − 𝒚∥ 2 , (11.2.8)

in 𝜶 = (𝛼1 , . . . , 𝛼𝑚 ) ∈ R𝑚 .
Í𝑚
Proposition 11.9 Let 𝜶 ∈ R𝑚 be any minimizer of (11.2.8). Then 𝒘∗ = 𝑗=1 𝛼 𝑗 𝜙(𝒙 𝑗 ) is the (unique) minimum 𝐻-norm
solution of (11.2.2).

Proposition 11.9, the proof of which is left as an exercise, suggests the following algorithm to compute the kernel
least squares estimator:
(i) compute the kernel matrix 𝑮 = (𝐾 (𝒙𝑖 , 𝒙 𝑗 ))𝑖,𝑚𝑗=1 ,
(ii) determine a minimizer 𝜶 ∈ R𝑚 of ∥𝑮𝜶 − 𝒚∥,

129
[Draft of June 26, 2024. Not for dissemination. Only to be used in Philipp Petersen’s lecture]
(iii) evaluate Φ(𝒙, 𝒘∗ ) via * +
𝑚
∑︁ 𝑚
∑︁
Φ(𝒙, 𝒘∗ ) = 𝜙(𝒙), 𝛼 𝑗 𝜙(𝒙 𝑗 ) = 𝛼 𝑗 𝐾 (𝒙, 𝒙 𝑗 ). (11.2.9)
𝑗=1 𝐻 𝑗=1

Thus, minimizing (11.2.2) and expressing the kernel least squares estimator does neither require explicit knowledge of
the feature map 𝜙 nor of the minimum norm solution 𝒘∗ ∈ 𝐻. It is sufficient to choose a kernel map 𝐾 : R𝑑 × R𝑑 → R;
this is known as the kernel trick. Given a kernel 𝐾, we will therefore also refer to (11.2.9) as the kernel least squares
estimator without specifying 𝐻 or 𝜙.

Example 11.10 Common examples of kernels include the polynomial kernel

𝐾 (𝒙, 𝒙 ′ ) = (𝒙 ⊤ 𝒙 ′ + 𝑐) 𝑟 𝑐 ≥ 0, 𝑟 ∈ N,

the radial basis function (RBF) kernel

𝐾 (𝒙, 𝒙 ′ ) = exp(−𝑐∥𝒙 − 𝒙 ′ ∥ 2 ) 𝑐 > 0,

and the Laplace kernel


𝐾 (𝒙, 𝒙 ′ ) = exp(−𝑐∥𝒙 − 𝒙 ′ ∥ ) 𝑐 > 0.

Remark 11.11 If Ω ⊆ R𝑑 is compact and 𝐾 : Ω × Ω → R is a continuous kernel, then Mercer’s theorem implies
existence of a Hilbert space 𝐻 and a feature map 𝜙 : R𝑑 → 𝐻 such that

𝐾 (𝒙, 𝒙 ′ ) = ⟨𝜙(𝒙), 𝜙(𝒙 ′ )⟩ 𝐻 for all 𝒙, 𝒙 ′ ∈ Ω,

i.e. 𝐾 is the corresponding kernel. See for instance [193, Thm. 4.49].

11.3 Tangent kernel

Consider again a general model Φ(𝒙, 𝒘) with input 𝒙 ∈ R𝑑 and parameters 𝒘 ∈ R𝑛 . The goal remains to minimize the
square loss objective (11.0.1b) given the data (11.0.1a). If 𝒘 ↦→ Φ(𝒙, 𝒘) is not linear, then unlike in Sections 11.1 and
11.2, the objective function (11.0.1b) is in general not convex, and most results on first order methods in Chapter 10
are not directly applicable.
We now simplify the situation by linearizing the model in 𝒘 ∈ R𝑛 around the initialization: Fixing 𝒘0 ∈ R𝑛 , let

Φlin (𝒙, 𝒘) := Φ(𝒙, 𝒘0 ) + ∇𝒘 Φ(𝒙, 𝒘0 ) ⊤ (𝒘 − 𝒘0 ) for all 𝒘 ∈ R𝑛 , (11.3.1)

which is the first order Taylor approximation of Φ around the initial parameter 𝒘0 . Introduce the notation

𝛿𝑖 := Φ(𝒙𝑖 , 𝒘0 ) − ∇𝒘 Φ(𝒙𝑖 , 𝒘0 ) ⊤ 𝒘0 − 𝑦 𝑖 for all 𝑖 = 1, . . . , 𝑚. (11.3.2)

The square loss for the linearized model then reads


𝑚
∑︁
f lin (𝒘) := (Φlin (𝒙 𝑖 , 𝒘) − 𝑦 𝑖 ) 2
𝑗=1
𝑚
∑︁
= (⟨∇𝒘 Φ(𝒙𝑖 , 𝒘0 ), 𝒘⟩ − 𝛿𝑖 ) 2 , (11.3.3)
𝑗=1

where ⟨·, ·⟩ stands for the Euclidean inner product in R𝑛 . Comparing with (11.2.2), minimizing f lin corresponds to a
kernel least squares regression with feature map

130
[Draft of June 26, 2024. Not for dissemination. Only to be used in Philipp Petersen’s lecture]
𝜙(𝒙) = ∇𝒘 Φ(𝒙, 𝒘0 ) ∈ R𝑛 .

The corresponding kernel is


𝐾ˆ 𝑛 (𝒙, 𝒙 ′ ) = ⟨∇𝒘 Φ(𝒙, 𝒘0 ), ∇𝒘 Φ(𝒙 ′ , 𝒘0 )⟩ . (11.3.4)
We refer to 𝐾ˆ 𝑛 as the empirical tangent kernel, as it arises from the first order Taylor approximation (the tangent) of the
original model Φ around initialization 𝒘0 . Note that the kernel depends on the choice of 𝒘0 . As explained in Remark
11.4, training Φlin with gradient descent yields the kernel least-squares estimator with kernel 𝐾ˆ 𝑛 plus an additional
term depending on 𝒘0 .
Of course the linearized model Φlin only captures the behaviour of Φ for parameters 𝒘 that are close to 𝒘0 . If we
assume for the moment that during training of Φ, the parameters remain close to initialization, then we can expect
similar behaviour and performance of Φ and Φlin . Under certain assumptions, we will see in the next sections that this
is precisely what happens, when the width of a neural network increases. Before we make this precise, in Section 11.4
we investigate whether gradient descent applied to f (𝒘) will find a global minimizer, under the assumption that Φlin is
a good approximation of Φ.

11.4 Convergence to global minimizers

Intuitively, if 𝒘 ↦→ Φ(𝒙, 𝒘) is not linear but “close enough to its linearization” Φlin defined in (11.3.1), we expect that
the objective function is close to a convex function and gradient descent can still find global minimizers of (11.0.1b).
To motivate this, consider Figures 11.1 and 11.2 where we chose the number of training data 𝑚 = 1 and the number
of parameters 𝑛 = 1. As we can see, essentially we require the difference of Φ and Φlin and of their derivatives to be
small in a neighbourhood of 𝑤0 . The size of the neighbourhood crucially depends on the initial error Φ(𝒙1 , 𝑤0 ) − 𝑦 1 ,
𝑑
and on the size of the derivative 𝑑𝑤 Φ(𝒙1 , 𝑤0 ).

(Φ( 𝒙1 , 𝑤) − 𝑦1 ) 2

(Φlin ( 𝒙1 , 𝑤) − 𝑦1 ) 2
𝑦1

𝑤0 Φlin ( 𝒙1 , 𝑤) 𝑤0
Φ( 𝒙1 , 𝑤)

Fig. 11.1: Graph of a model 𝑤 ↦→ Φ(𝒙1 , 𝑤) and its linearization 𝑤 ↦→ Φlin (𝒙1 , 𝑤) at the initial parameter 𝑤0 , s.t.
𝑑 lin
𝑑𝑤 Φ(𝒙 1 , 𝑤0 ) ≠ 0. If Φ and Φ are close, then there exists 𝑤 s.t. Φ(𝒙 1 , 𝑤) = 𝑦 1 (left). If the derivatives are also close,
2
the loss (Φ(𝒙1 , 𝑤) − 𝑦 1 ) is nearly convex in 𝑤, and gradient descent finds a global minimizer (right).

Φ( 𝒙1 , 𝑤) (Φ( 𝒙1 , 𝑤) − 𝑦1 ) 2
(Φlin ( 𝒙1 , 𝑤) − 𝑦1 ) 2
𝑦1

𝑤0 Φlin ( 𝒙1 , 𝑤) 𝑤0

Fig. 11.2: Same as Figure 11.1. If Φ and Φlin are not close, there need not exist 𝑤 such that Φ(𝒙1 , 𝑤) = 𝑦 1 , and gradient
descent need not converge to a global minimizer.

131
[Draft of June 26, 2024. Not for dissemination. Only to be used in Philipp Petersen’s lecture]
For general 𝑚 and 𝑛, we now make the required assumptions on Φ precise.
Assumption 1 Let Φ ∈ 𝐶 1 (R𝑑 × R𝑛 ) and 𝒘0 ∈ R𝑛 . There exist constants 𝑟 > 0, 𝑈, 𝐿 < ∞ and 0 < 𝜆 min ≤ 𝜆max < ∞
such that
(a) the kernel matrix of the empirical tangent kernel

( 𝐾ˆ 𝑛 (𝒙𝑖 , 𝒙 𝑗 ))𝑖,𝑚𝑗=1 =
𝑚
∇𝒘 Φ(𝒙𝑖 , 𝒘0 ), ∇𝒘 Φ(𝒙 𝑗 , 𝒘0 ) 𝑖, 𝑗=1 ∈ R𝑚×𝑚 (11.4.1)

is regular and its eigenvalues belong to [𝜆min , 𝜆max ],


(b) for all 𝑖 ∈ {1, . . . , 𝑚} holds

∥∇𝒘 Φ(𝒙𝑖 , 𝒘) ∥ ≤ 𝑈 for all 𝒘 ∈ 𝐵𝑟 (𝒘0 )


(11.4.2)
∥∇𝒘 Φ(𝒙𝑖 , 𝒘) − ∇𝒘 Φ(𝒙 𝑖 , 𝒗) ∥ ≤ 𝐿∥𝒘 − 𝒗∥ for all 𝒘, 𝒗 ∈ 𝐵𝑟 (𝒘0 ),

(c) and

𝜆2min 2 𝑚𝑈 √︁
𝐿≤ √︁ and 𝑟= f (𝒘0 ). (11.4.3)
12𝑚 3/2𝑈 2 f (𝒘0 ) 𝜆min

The regularity of the kernel matrix in Assumption 1 (a) is equivalent to (∇𝒘 Φ(𝒙𝑖 , 𝒘0 ) ⊤ )𝑖=1 𝑚 ∈ R𝑚×𝑛 having full rank

𝑚 ≤ 𝑛 (in particular we have at least as many parameters 𝑛 as training data 𝑚). In the context of Figure 11.1, this means
that 𝑑𝑤𝑑
Φ(𝒙1 , 𝑤0 ) ≠ 0 and thus Φlin is a not a constant function. This condition guarantees that there exists 𝒘 such
that Φ (𝒘, 𝒙 𝑖 ) = 𝑦 𝑖 for all 𝑖 = 1, . . . , 𝑚. In other words, already the linearized model Φlin is sufficiently expressive
lin

to interpolate the data. Assumption 1 (b) formalizes the closeness condition of Φ and Φlin . Apart from giving an
upper bound on ∇𝒘 Φ(𝒙 𝑖 , 𝒘), it assumes 𝒘 ↦→ Φ(𝒙𝑖 , 𝒘) to be 𝐿-smooth in a ball of radius 𝑟 > 0 around 𝒘0 , for all
𝑖 = 1, . . . , 𝑚. This allows to control how far Φ(𝒙𝑖 , 𝒘) and Φlin (𝒙 𝑖 , 𝒘) and their derivatives may deviate from each other
for 𝒘 in this ball. Finally Assumption 1 (c) ties together all constants, ensuring the full model to be sufficiently close
to its linearization in a large enough neighbourhood of 𝒘0 .
We are now ready to state the following theorem, which is a variant of [115, Thm. G.1]. In Section 11.5 we will see
that its main requirement—Assumption 1—is satisfied with high probability for certain (wide) neural networks.

Theorem 11.12 Let Assumption 1 be satisfied and fix a positive learning rate
1
ℎ≤ . (11.4.4)
𝜆min + 𝜆max
Set for all 𝑘 ∈ N
𝒘 𝑘+1 = 𝒘 𝑘 − ℎ∇f (𝒘 𝑘 ). (11.4.5)
It then holds for all 𝑘 ∈ N

2 𝑚𝑈 √︁
∥𝒘 𝑘 − 𝒘0 ∥ ≤ f (𝒘0 ) (11.4.6a)
𝜆min
f (𝒘 𝑘 ) ≤ (1 − ℎ𝜆min ) 2𝑘 f (𝒘0 ). (11.4.6b)

Proof In the following denote the error in prediction by


𝑚
𝐸 (𝒘) := (Φ(𝒙𝑖 , 𝒘) − 𝑦 𝑖 )𝑖=1 ∈ R𝑚

such that
𝑚
∇𝐸 (𝒘) = (∇𝒘 Φ(𝒙𝑖 , 𝒘))𝑖=1 ∈ R𝑚×𝑛
and with the empirical tangent kernel 𝐾ˆ 𝑛 in Assumption 1

∇𝐸 (𝒘)∇𝐸 (𝒘) ⊤ = ( 𝐾ˆ 𝑛 (𝒙 𝑖 , 𝒙 𝑗 ))𝑖,𝑚𝑗=1 ∈ R𝑚×𝑚 . (11.4.7)

132
[Draft of June 26, 2024. Not for dissemination. Only to be used in Philipp Petersen’s lecture]
Moreover, (11.4.2) gives
𝑚
∑︁
∥∇𝐸 (𝒘) ∥ 2 ≤ ∥∇𝐸 (𝒘) ∥ 2𝐹 = ∥∇Φ(𝒙𝑖 , 𝒘) ∥ 2 ≤ 𝑚𝑈 2 for all 𝒘 ∈ 𝐵𝑟 (𝒘0 ), (11.4.8a)
𝑖=1

and similarly
𝑚
∑︁
∥∇𝐸 (𝒘) − ∇𝐸 (𝒗) ∥ 2 ≤ ∥∇𝒘 Φ(𝒙𝑖 , 𝒘) − ∇𝒘 Φ(𝒙𝑖 , 𝒗) ∥ 2
𝑖=1
≤ 𝑚𝐿 2 ∥𝒘 − 𝒗∥ 2 for all 𝒘, 𝒗 ∈ 𝐵𝑟 (𝒘0 ). (11.4.8b)

Denote 𝑐 := 1 − ℎ𝜆min ∈ (0, 1). We use induction over 𝑘 to prove


𝑘−1 𝑘−1
∑︁ √ ∑︁
∥𝒘 𝑗+1 − 𝒘 𝑗 ∥ ≤ ℎ2 𝑚𝑈 ∥𝐸 (𝒘0 ) ∥ 𝑐 𝑗, (11.4.9a)
𝑗=0 𝑗=0

∥𝐸 (𝒘 𝑘 ) ∥ 2 ≤ ∥𝐸 (𝒘0 ) ∥ 2 𝑐2𝑘 , (11.4.9b)

for all 𝑘 ∈ N0 and where an empty sum is understood as zero. Since ∞ −1 = (ℎ𝜆 −1 and
Í
𝑗=0 𝑐 = (1 − 𝑐) min )
𝑗
2
f (𝒘 𝑘 ) = ∥𝐸 (𝒘 𝑘 ) ∥ , these inequalities directly imply (11.4.6).
The case 𝑘 = 0 is trivial. For the induction step, assume (11.4.9) holds for some 𝑘 ∈ N0 .
Step 1. We show (11.4.9a) for 𝑘 + 1. The induction assumption and (11.4.3) give
∞ √
√ ∑︁ 2 𝑚𝑈 √︁
∥𝒘 𝑘 − 𝒘0 ∥ ≤ 2ℎ 𝑚𝑈 ∥𝐸 (𝒘0 ) ∥ 𝑐𝑗 = f (𝒘0 ) = 𝑟, (11.4.10)
𝑗=0
𝜆min

and thus 𝒘 𝑘 ∈ 𝐵𝑟 (𝒘0 ). Next


∇f (𝒘 𝑘 ) = ∇(𝐸 (𝒘 𝑘 ) ⊤ 𝐸 (𝒘 𝑘 )) = 2∇𝐸 (𝒘 𝑘 ) ⊤ 𝐸 (𝒘 𝑘 ). (11.4.11)
Using the iteration rule (11.4.5), the bound (11.4.8a), and (11.4.9b)

∥𝒘 𝑘+1 − 𝒘 𝑘 ∥ = 2ℎ∥∇𝐸 (𝒘 𝑘 ) ⊤ 𝐸 (𝒘 𝑘 ) ∥

≤ 2ℎ 𝑚𝑈 ∥𝐸 (𝒘 𝑘 ) ∥

≤ 2ℎ 𝑚𝑈 ∥𝐸 (𝒘0 ) ∥ 𝑐 𝑘 .

This shows (11.4.9a) for 𝑘 + 1. In particular, as in (11.4.10) we conclude

𝒘 𝑘+1 , 𝒘 𝑘 ∈ 𝐵𝑟 (𝒘0 ). (11.4.12)

Step 2. We show (11.4.9b) for 𝑘 + 1. Since 𝐸 is continuously differentiable, there exists 𝒘˜ 𝑘 in the convex hull of 𝒘 𝑘
and 𝒘 𝑘+1 such that

𝐸 (𝒘 𝑘+1 ) = 𝐸 (𝒘 𝑘 ) + ∇𝐸 ( 𝒘˜ 𝑘 ) (𝒘 𝑘+1 − 𝒘 𝑘 ) = 𝐸 (𝒘 𝑘 ) − ℎ∇𝐸 ( 𝒘˜ 𝑘 )∇f (𝒘 𝑘 ),

and thus by (11.4.11)

𝐸 (𝒘 𝑘+1 ) = 𝐸 (𝒘 𝑘 ) − 2ℎ∇𝐸 ( 𝒘˜ 𝑘 )∇𝐸 (𝒘 𝑘 ) ⊤ 𝐸 (𝒘 𝑘 )


= 𝑰𝑚 − 2ℎ∇𝐸 ( 𝒘˜ 𝑘 )∇𝐸 (𝒘 𝑘 ) ⊤ 𝐸 (𝒘 𝑘 ),


where 𝑰𝑚 ∈ R𝑚×𝑚 is the identity matrix. We wish to show that

∥ 𝑰𝑚 − 2ℎ∇𝐸 ( 𝒘˜ 𝑘 )∇𝐸 (𝒘 𝑘 ) ⊤ ∥ ≤ 𝑐, (11.4.13)

133
[Draft of June 26, 2024. Not for dissemination. Only to be used in Philipp Petersen’s lecture]
which then implies (11.4.9b) for 𝑘 + 1 and concludes the proof.
Using (11.4.8) and the fact that 𝒘 𝑘 , 𝒘˜ 𝑘 ∈ 𝐵𝑟 (𝒘0 ) by (11.4.12),

∥∇𝐸 ( 𝒘˜ 𝑘 )∇𝐸 (𝒘 𝑘 ) ⊤ − ∇𝐸 (𝒘0 )∇𝐸 (𝒘0 ) ⊤ ∥


≤ ∥∇𝐸 ( 𝒘˜ 𝑘 )∇𝐸 (𝒘 𝑘 ) ⊤ − ∇𝐸 (𝒘 𝑘 )∇𝐸 (𝒘 𝑘 ) ⊤ ∥
+ ∥∇𝐸 ( 𝒘˜ 𝑘 )∇𝐸 (𝒘 𝑘 ) ⊤ − ∇𝐸 (𝒘 𝑘 )∇𝐸 (𝒘0 ) ⊤ ∥
+ ∥∇𝐸 ( 𝒘˜ 𝑘 )∇𝐸 (𝒘0 ) ⊤ − ∇𝐸 (𝒘0 )∇𝐸 (𝒘0 ) ⊤ ∥
≤ 3𝑚𝑈 𝐿𝑟.

Since the eigenvalues of ∇𝐸 (𝒘0 )∇𝐸 (𝒘0 ) ⊤ belong to [𝜆min , 𝜆max ] by (11.4.7) and Assumption 1 (a), as long as
ℎ ≤ (𝜆min + 𝜆max ) −1 we have

∥ 𝑰𝑚 − 2ℎ∇𝐸 ( 𝒘˜ 𝑘 )∇𝐸 (𝒘 𝑘 ) ⊤ ∥ ≤ ∥ 𝑰𝑚 − 2ℎ∇𝐸 (𝒘0 )∇𝐸 (𝒘0 ) ⊤ ∥ + 6ℎ𝑚𝑈 𝐿𝑟


≤ 1 − 2ℎ𝜆min + 6ℎ𝑚𝑈 𝐿𝑟
≤ 1 − 2ℎ(𝜆min − 3𝑚𝑈 𝐿𝑟)
≤ 1 − ℎ𝜆min = 𝑐,

where we have used the equality for 𝑟 and the upper bound for 𝐿 in (11.4.3). □
Let us emphasize the main statement of Theorem 11.12. By (11.4.6b), full batch gradient descent (11.4.5) achieves
zero loss in the limit, i.e. the data is interpolated by the limiting model. In particular, this yields convergence for the
(possibly nonconvex) optimization problem of minimizing f (𝒘).

11.5 Training dynamics for LeCun initialization

In this and the next section we discuss the implications of Theorem 11.12 for wide neural networks. For ease of
presentation we focus on shallow networks with only one hidden layer, but stress that similar considerations also hold
for deep networks, see the bibliography section.

11.5.1 Architecture

Let Φ : R𝑑 → R be a neural network of depth one and width 𝑛 ∈ N of type

Φ(𝒙, 𝒘) = 𝒗 ⊤ 𝜎(𝑼𝒙 + 𝒃) + 𝑐. (11.5.1)

Here 𝒙 ∈ R𝑑 is the input, and 𝑼 ∈ R𝑛×𝑑 , 𝒗 ∈ R𝑛 , 𝒃 ∈ R𝑛 and 𝑐 ∈ R are the parameters which we collect in the vector
𝒘 = (𝑼, 𝒃, 𝒗, 𝑐) ∈ R𝑛(𝑑+2)+1 (with 𝑼 suitably reshaped). For future reference we note that

∇𝑼 Φ(𝒙, 𝒘) = (𝒗 ⊙ 𝜎 ′ (𝑼𝒙 + 𝒃))𝒙 ⊤ ∈ R𝑛×𝑑


∇𝒃 Φ(𝒙, 𝒘) = 𝒗 ⊙ 𝜎 ′ (𝑼𝒙 + 𝒃) ∈ R𝑛
(11.5.2)
∇𝒗 Φ(𝒙, 𝒘) = 𝜎(𝑼𝒙 + 𝒃) ∈ R𝑛
∇𝑐 Φ(𝒙, 𝒘) = 1 ∈ R,

where ⊙ denotes the Hadamard product. We also write ∇𝒘 Φ(𝒙, 𝒘) ∈ R𝑛(𝑑+2)+1 to denote the full gradient with respect
to all parameters.

134
[Draft of June 26, 2024. Not for dissemination. Only to be used in Philipp Petersen’s lecture]
In practice, it is common to initialize the weights randomly, and in this section we consider so-called LeCun
initialization. The following condition on the distribution used for this initialization will be assumed throughout the
rest of Section 11.5.
Assumption 2 The distribution D on R has expectation zero, variance one, and finite moments up to order eight. □
To explicitly indicate the expectation and variance in the notation, we also write D (0, 1) instead of D, and for 𝜇 ∈ R
and 𝜍 > 0 we use D (𝜇, 𝜍 2 ) to denote the corresponding scaled and shifted measure with expectation 𝜇 and variance
𝜍 2 ; thus, if 𝑋 ∼ D (0, 1) then 𝜇 + 𝜍 𝑋 ∼ D (𝜇, 𝜍 2 ). LeCun initialization [113] sets the variance of the weights in each
layer to be reciprocal to the input dimension of the layer, thereby normalizing the output variance across all network
nodes. The initial parameters
𝒘0 = (𝑼0 , 𝒃 0 , 𝒗0 , 𝑐 0 )
are thus randomly initialized with components
 1  1
iid iid
𝑈0;𝑖 𝑗 ∼ D 0, , 𝑣0;𝑖 ∼ D 0, , 𝑏 0;𝑖 , 𝑐 0 = 0, (11.5.3)
𝑑 𝑛
independently for all 𝑖 = 1, . . . , 𝑛, 𝑗 = 1, . . . , 𝑑. For a fixed 𝜍 > 0 one might choose variances 𝜍 2 /𝑑 and 𝜍 2 /𝑛 in
(11.5.3), which would require only minor modifications in the rest of this section. Biases are set to zero for simplicity,
with nonzero initialization discussed in the exercises. All expectations and probabilities in Section 11.5 are understood
with respect to this random initialization.

√ √ 11.13 Typical examples for D (0, 1) are the standard normal distribution on R or the uniform distribution on
Example
[− 3, 3].

11.5.2 Neural tangent kernel

We begin our analysis by investigating the empirical tangent kernel

𝐾ˆ 𝑛 (𝒙, 𝒛) = ⟨∇𝒘 Φ(𝒙, 𝒘0 ), ∇𝒘 Φ(𝒛, 𝒘0 )⟩

of the shallow network (11.5.1). Scaled properly, it converges in the infinite width limit 𝑛 → ∞ towards a specific
kernel known as the neural tangent kernel (NTK). Its precise formula depends on the architecture and initialization.
For the LeCun initialization (11.5.3) we denote it by 𝐾 LC .

Theorem 11.14 Let 𝑅 < ∞ such that |𝜎(𝑥)| ≤ 𝑅 · (1 + |𝑥|) and |𝜎 ′ (𝑥)| ≤ 𝑅 · (1 + |𝑥|) for all 𝑥 ∈ R. For any 𝒙, 𝒛 ∈ R𝑑
iid
and 𝑢 𝑖 ∼ D (0, 1/𝑑), 𝑖 = 1, . . . , 𝑑, it then holds
1 ˆ
lim 𝐾𝑛 (𝒙, 𝒛) = E[𝜎(𝒖 ⊤ 𝒙) ⊤ 𝜎(𝒖 ⊤ 𝒛)] =: 𝐾 LC (𝒙, 𝒛)
𝑛→∞ 𝑛
almost surely.
Moreover, for every 𝛿, 𝜀 > 0 there exists 𝑛0 (𝛿, 𝜀, 𝑅) ∈ N such that for all 𝑛 ≥ 𝑛0 and all 𝒙, 𝒛 ∈ R𝑑 with ∥𝒙∥,
∥𝒛∥ ≤ 𝑅
h 1 i
P 𝐾ˆ 𝑛 (𝒙, 𝒛) − 𝐾 LC (𝒙, 𝒛) < 𝜀 ≥ 1 − 𝛿.
𝑛

Proof Denote 𝒙 (1) = 𝑼0 𝒙 + 𝒃 0 ∈ R𝑛 and 𝒛 (1) = 𝑼0 𝒛 + 𝒃 0 ∈ R𝑛 . Due to the initialization (11.5.3) and our assumptions
on D (0, 1), the components
∑︁𝑑
𝑥𝑖(1) = 𝑈0;𝑖 𝑗 𝑥 𝑗 ∼ 𝒖 ⊤ 𝒙 𝑖 = 1, . . . , 𝑛
𝑗=1

135
[Draft of June 26, 2024. Not for dissemination. Only to be used in Philipp Petersen’s lecture]
are i.i.d. with finite 𝑝th moment (independent of 𝑛) for all 1 ≤ 𝑝 ≤ 8. Due to the linear growth bound on 𝜎 and 𝜎 ′ , the
same holds for the (𝜎(𝑥𝑖(1) ))𝑖=1
𝑛 and the (𝜎 ′ (𝑥 (1) )) 𝑛 . Similarly, the (𝜎(𝑧 (1) )) 𝑛 and (𝜎 ′ (𝑧 (1) )) 𝑛 are collections of
𝑖 𝑖=1 𝑖 𝑖=1 𝑖 𝑖=1
i.i.d. random variables with finite 𝑝th moment for all 1 ≤ 𝑝 ≤ 8.
√ iid
Denote 𝑣˜𝑖 = 𝑛𝑣0;𝑖 such that 𝑣˜𝑖 ∼ D (0, 1). By (11.5.2)
𝑛 𝑛
1 ˆ 1 ∑︁ 2 ′ (1) ′ (1) 1 ∑︁ 1
𝐾𝑛 (𝒙, 𝒛) = (1 + 𝒙 ⊤ 𝒛) 2 𝑣˜𝑖 𝜎 (𝑥 𝑖 )𝜎 (𝑧 𝑖 ) + 𝜎(𝑥𝑖(1) )𝜎(𝑧 𝑖(1) ) + .
𝑛 𝑛 𝑖=1 𝑛 𝑖=1 𝑛

Since
𝑛
1 ∑︁ 2 ′ (1) ′ (1)
𝑣˜ 𝜎 (𝑥𝑖 )𝜎 (𝑧 𝑖 ) (11.5.4)
𝑛 𝑖=1 𝑖
is an average over i.i.d. random variables with finite variance, the law of large numbers implies almost sure convergence
of this expression towards

E 𝑣˜𝑖2 𝜎 ′ (𝑥𝑖(1) )𝜎 ′ (𝑧𝑖(1) ) = E[˜𝑣𝑖2 ]E[𝜎 ′ (𝑥𝑖(1) )𝜎 ′ (𝑧𝑖(1) )]


 

= E[𝜎 ′ (𝒖 ⊤ 𝒙)𝜎 ′ (𝒖 ⊤ 𝒛)],

where we used that 𝑣˜𝑖2 is independent of 𝜎 ′ (𝑥𝑖(1) )𝜎 ′ (𝑧 𝑖(1) ). By the same argument
𝑛
1 ∑︁
𝜎(𝑥 𝑖(1) )𝜎(𝑧𝑖(1) ) → E[𝜎(𝒖 ⊤ 𝒙)𝜎(𝒖 ⊤ 𝒛)]
𝑛 𝑖=1

almost surely as 𝑛 → ∞. This shows the first statement.


The existence of 𝑛0 follows similarly by an application of Theorem A.22. □
Example 11.15 (𝐾 LC for ReLU) Let 𝜎(𝑥) = max{0, 𝑥} and let D (0, 1) be the standard normal distribution. For 𝒙,
𝒛 ∈ R𝑑 denote by  ⊤ 
𝒙 𝒛
𝜃 = arccos
∥𝒙∥ ∥𝒛∥
iid
the angle between these vectors. Then according to [35, Appendix A], it holds with 𝑢 𝑖 ∼ D (0, 1), 𝑖 = 1, . . . , 𝑑,

∥𝒙∥ ∥𝒛∥
𝐾 LC (𝒙, 𝒛) = E[𝜎(𝒖 ⊤ 𝒙)𝜎(𝒖 ⊤ 𝒛)] = (sin(𝜃) + (𝜋 − 𝜃) cos(𝜃)).
2𝜋𝑑

11.5.3 Gradient descent

We now proceed similar as in [115, App. G], to show that Theorem 11.12 is applicable to the wide neural network
(11.5.1) with high probability under random initialization (11.5.3). This will imply that gradient descent can find global
minimizers when training wide neural networks. We work under the following assumptions on the activation function
and training data.
Assumption 3 There exist 𝑅 < ∞ and 0 < 𝜆LC
min
≤ 𝜆LC
max < ∞ such that

(a) for the activation function 𝜎 : R → R holds |𝜎(0)|, Lip(𝜎), Lip(𝜎 ′ ) ≤ 𝑅,


(b) ∥𝒙𝑖 ∥ , |𝑦 𝑖 | ≤ 𝑅 for all training data (𝒙 𝑖 , 𝑦 𝑖 ) ∈ R𝑑 × R, 𝑖 = 1, . . . , 𝑚,
(c) the kernel matrix of the neural tangent kernel

(𝐾 LC (𝒙𝑖 , 𝒙 𝑗 ))𝑖,𝑚𝑗=1 ∈ R𝑚×𝑚

is regular and its eigenvalues belong to [𝜆LC , 𝜆LC ].


min max

136
[Draft of June 26, 2024. Not for dissemination. Only to be used in Philipp Petersen’s lecture]
We start by showing Assumption 1 (a) for the present setting. More precisely, we give bounds for the eigenvalues of
the empirical tangent kernel.

Lemma 11.16 Let Assumption 3 be satisfied. Then for every 𝛿 > 0 there exists 𝑛0 (𝛿, 𝜆LC min
, 𝑚, 𝑅) ∈ R such that for all
𝑛 ≥ 𝑛0 with probability at least 1 − 𝛿 all eigenvalues of

( 𝐾ˆ 𝑛 (𝒙𝑖 , 𝒙 𝑗 ))𝑖,𝑚𝑗=1 = ∇𝒘 Φ(𝒙𝑖 , 𝒘0 ), ∇𝒘 Φ(𝒙 𝑗 , 𝒘0 ) 𝑖, 𝑗=1 ∈ R𝑚×𝑚


𝑚

belong to [𝑛𝜆LC
min
/2, 2𝑛𝜆LC
max ].

Proof Denote 𝑮ˆ 𝑛 := ( 𝐾ˆ 𝑛 (𝒙𝑖 , 𝒙 𝑗 ))𝑖,𝑚𝑗=1 and 𝑮 LC := (𝐾 LC (𝒙 𝑖 , 𝒙 𝑗 ))𝑖,𝑚𝑗=1 . By Theorem 11.14, there exists 𝑛0 such that for
all 𝑛 ≥ 𝑛0 holds with probability at least 1 − 𝛿 that

1 𝜆LC
𝑮 LC − 𝑮ˆ 𝑛 ≤ min .
𝑛 2

Assuming this bound to hold

1 ˆ 1 𝜆LC 𝜆LC 𝜆LC


∥ 𝑮 𝑛 ∥ = sup ∥ 𝑮ˆ 𝑛 𝒂∥ ≥ inf𝑚 ∥𝑮 LC 𝒂∥ − min ≥ 𝜆LC
min − min
≥ min
,
𝑛 𝒂∈R𝑚 𝑛 𝒂∈R 2 2 2
∥𝒂 ∥ =1 ∥𝒂 ∥ =1

where we have used that 𝜆LC min


is the smallest eigenvalue, and thus singular value, of the symmetric positive definite
matrix 𝑮 LC . This shows that the smallest eigenvalue of 𝑮ˆ 𝑛 is larger or equal to 𝜆LC
min
/2. Similarly, we conclude that the
largest eigenvalue is bounded from above by 𝜆LC max + 𝜆 LC /2 ≤ 𝜆 LC . This concludes the proof.
min max □
Next we check Assumption 1 (b). To this end we first bound the norm of a random matrix.
iid
Lemma 11.17 Let D (0, 1) be as in Assumption 2, and let 𝑾 ∈ R𝑛×𝑑 with 𝑊𝑖 𝑗 ∼ D (0, 1). Denote the fourth moment
of D (0, 1) by 𝜇4 . Then
h √︁ i 𝑑𝜇4
P ∥𝑾 ∥ ≤ 𝑛(𝑑 + 1) ≥ 1 − .
𝑛
Proof It holds
𝑛 ∑︁
 ∑︁ 𝑑  1/2
∥𝑾 ∥ ≤ ∥𝑾 ∥ 𝐹 = 𝑊𝑖2𝑗 .
𝑖=1 𝑗=1

The 𝛼𝑖 := 𝑑𝑗=1 𝑊𝑖2𝑗 , 𝑖 = 1, . . . , 𝑛, are i.i.d. distributed with expectation 𝑑 and finite variance 𝑑𝐶, where 𝐶 ≤ 𝜇4 is the
Í
2 . By Theorem A.22
variance of 𝑊11
h i 𝑛
h 1 ∑︁ i 𝑛
h 1 ∑︁ i 𝑑𝜇
4
√︁
P ∥𝑾 ∥ > 𝑛(𝑑 + 1) ≤ P 𝛼𝑖 > 𝑑 + 1 ≤ P 𝛼𝑖 − 𝑑 > 1 ≤ ,
𝑛 𝑖=1 𝑛 𝑖=1 𝑛

which concludes the proof. □

Lemma 11.18 Let Assumption 3 (a) be satisfied with some constant 𝑅. Then there exists 𝑀 (𝑅), and for all 𝑐, 𝛿 > 0
there exists 𝑛0 (𝑐, 𝑑, 𝛿, 𝑅) ∈ N such that for all 𝑛 ≥ 𝑛0 it holds with probability at least 1 − 𝛿

∥∇𝒘 Φ(𝒙, 𝒘) ∥ ≤ 𝑀 𝑛 for all 𝒘 ∈ 𝐵𝑐𝑛−1/2 (𝒘0 )

∥∇𝒘 Φ(𝒙, 𝒘) − ∇𝒘 Φ(𝒙, 𝒗) ∥ ≤ 𝑀 𝑛∥𝒘 − 𝒗∥ for all 𝒘, 𝒗 ∈ 𝐵𝑐𝑛−1/2 (𝒘0 )

for all 𝒙 ∈ R𝑑 with ∥𝒙∥ ≤ 𝑅.

137
[Draft of June 26, 2024. Not for dissemination. Only to be used in Philipp Petersen’s lecture]
Proof Due to the initialization (11.5.3), by Lemma 11.17 we can find 𝑛0 (𝛿, 𝑑) such that for all 𝑛 ≥ 𝑛0 holds with
probability at least 1 − 𝛿 that √
∥𝒗0 ∥ ≤ 2 and ∥𝑼0 ∥ ≤ 2 𝑛. (11.5.5)
For the rest of this proof we fix arbitrary 𝒙 ∈ R𝑑 and 𝑛 ≥ 𝑛0 ≥ 𝑐2 such that

∥𝒙∥ ≤ 𝑅 and 𝑛 −1/2 𝑐 ≤ 1.

We need to show that the claimed inequalities hold as long as (11.5.5) is satisfied. We will several times use that for all
𝒑, 𝒒 ∈ R𝑛 √
∥ 𝒑 ⊙ 𝒒∥ ≤ ∥ 𝒑∥ ∥𝒒∥ and ∥𝜎( 𝒑) ∥ ≤ 𝑅 𝑛 + 𝑅∥ 𝒑∥
since |𝜎(𝑥)| ≤ 𝑅 · (1 + |𝑥|). The same holds for 𝜎 ′ .
Step 1. We show the bound on the gradient. Fix

𝒘 = (𝑼, 𝒃, 𝒗, 𝑐) s.t. ∥𝒘 − 𝒘0 ∥ ≤ 𝑐𝑛 −1/2 .

Using formula (11.5.2) for ∇𝒃 Φ and the above inequalities

∥∇𝒃 Φ(𝒙, 𝒘) ∥ ≤ ∥∇𝒃 Φ(𝒙, 𝒘0 ) ∥ + ∥∇𝒃 Φ(𝒙, 𝒘) − ∇𝒃 Φ(𝒙, 𝒘0 ) ∥


= ∥𝒗0 ⊙ 𝜎 ′ (𝑼0 𝒙) ∥ + ∥𝒗 ⊙ 𝜎 ′ (𝑼𝒙 + 𝒃) − 𝒗0 ⊙ 𝜎 ′ (𝑼0 𝒙) ∥
√ √
≤ 2(𝑅 𝑛 + 2𝑅 2 𝑛) + ∥𝒗 ⊙ 𝜎 ′ (𝑼𝒙 + 𝒃) − 𝒗0 ⊙ 𝜎 ′ (𝑼0 𝒙) ∥ . (11.5.6)

Due to √ √
∥𝑼∥ ≤ ∥𝑼0 ∥ + ∥𝑼0 − 𝑼∥ 𝐹 ≤ 2 𝑛 + 𝑐𝑛 −1/2 ≤ 3 𝑛, (11.5.7)
the last norm in (11.5.6) is bounded by

∥ (𝒗 − 𝒗0 ) ⊙ 𝜎 ′ (𝑼𝒙 + 𝒃) ∥ + ∥𝒗0 ⊙ (𝜎 ′ (𝑼𝒙 + 𝒃) − 𝜎 ′ (𝑼0 𝒙)) ∥



≤ 𝑐𝑛 −1/2 (𝑅 𝑛 + 𝑅 · (∥𝑼∥ ∥𝒙∥ + ∥𝒃∥ )) + 2𝑅 · (∥𝑼 − 𝑼0 ∥ ∥𝒙∥ + ∥𝒃∥ )
√ √
≤ 𝑅 𝑛 + 3 𝑛𝑅 2 + 𝑐𝑛 −1/2 𝑅 + 2𝑅 · (𝑐𝑛 −1/2 𝑅 + 𝑐𝑛 −1/2 )

≤ 𝑛(4𝑅 + 5𝑅 2 )

and therefore √
∥∇𝒃 Φ(𝒙, 𝒘) ∥ ≤ 𝑛(6𝑅 + 9𝑅 2 ).
For the gradient with respect to 𝑼 we use ∇𝑼 Φ(𝒙, 𝒘) = ∇𝒃 Φ(𝒙, 𝒘)𝒙 ⊤ , so that

∥∇𝑼 Φ(𝒙, 𝒘) ∥ 𝐹 = ∥∇𝒃 Φ(𝒙, 𝒘)𝒙 ⊤ ∥ 𝐹 = ∥∇𝒃 Φ(𝒙, 𝒘) ∥ ∥𝒙∥ ≤ 𝑛(6𝑅 2 + 9𝑅 3 ).

Next

∥∇𝒗 Φ(𝒙, 𝒘) ∥ = ∥𝜎(𝑼𝒙 + 𝒃) ∥



≤ 𝑅 𝑛 + 𝑅∥𝑼𝒙 + 𝒃∥
√ √
≤ 𝑅 𝑛 + 𝑅 · (3 𝑛𝑅 + 𝑐𝑛 −1/2 )

≤ 𝑛(2𝑅 + 3𝑅 2 ),

and finally ∇𝑐 Φ(𝒙, 𝒘) = 1. In all, with 𝑀1 (𝑅) := (1 + 8𝑅 + 12𝑅 2 )



∥∇𝒘 Φ(𝒙, 𝒘)˜ ∥ ≤ 𝑛𝑀1 (𝑅).

Step 2. We show Lipschitz continuity. Fix

138
[Draft of June 26, 2024. Not for dissemination. Only to be used in Philipp Petersen’s lecture]
𝒘 = (𝑼, 𝒃, 𝒗, 𝑐) and 𝒘˜ = (𝑼, ˜ 𝒗˜ , 𝑐)
˜ 𝒃, ˜

such that ∥𝒘 − 𝒘0 ∥ , ∥ 𝒘˜ − 𝒘0 ∥ ≤ 𝑐𝑛 −1/2 . Then

˜ ∥ = ∥𝒗 ⊙ 𝜎 ′ (𝑼𝒙 + 𝒃) − 𝒗˜ ⊙ 𝜎 ′ (𝑼𝒙
∥∇𝒃 Φ(𝒙, 𝒘) − ∇𝒃 Φ(𝒙, 𝒘) ˜ ∥.
˜ + 𝒃)

Using ∥ 𝒗˜ ∥ ≤ ∥𝒗0 ∥ + 𝑐𝑛 −1/2 ≤ 3 and (11.5.7), this term is bounded by

∥ (𝒗 − 𝒗˜ ) ⊙ 𝜎 ′ (𝑼𝒙 + 𝒃) ∥ + ∥ 𝒗˜ ⊙ (𝜎 ′ (𝑼𝒙 + 𝒃) − 𝜎 ′ (𝑼𝒙 ˜ ∥


˜ + 𝒃))

≤ ∥𝒗 − 𝒗˜ ∥ (𝑅 𝑛 + 𝑅 · (∥𝑼∥ ∥𝒙∥ + ∥𝒃∥ )) + 3𝑅 · (∥𝒙∥ ∥𝑼 − 𝑼∥ ˜ )
˜ + ∥𝒃 − 𝒃∥

≤ ∥𝒘 − 𝒘∥ ˜ 𝑛(5𝑅 + 6𝑅 2 ).

For ∇𝑼 Φ(𝒙, 𝒘) we obtain similar as in Step 1

∥∇𝑼 Φ(𝒙, 𝒘) − ∇𝑼 Φ(𝒙, 𝒘)


˜ ∥ 𝐹 = ∥𝒙∥ ∥∇𝒃 Φ(𝒙, 𝒘) − ∇𝒃 Φ(𝒙, 𝒘)
˜ ∥
√ 2 3
≤ ∥𝒘 − 𝒘∥
˜ 𝑛(5𝑅 + 6𝑅 ).

Next

∥∇𝒗 Φ(𝒙, 𝒘) − ∇𝒗 Φ(𝒙, 𝒘)


˜ ∥ = ∥𝜎(𝑼𝒙 + 𝒃) − 𝜎(𝑼𝒙 ˜ ∥
˜ − 𝒃)
˜ )
˜ ∥𝒙∥ + ∥𝒃 − 𝒃∥
≤ 𝑅 · (∥𝑼 − 𝑼∥
˜ (𝑅 2 + 𝑅)
≤ ∥𝒘 − 𝒘∥

and finally ∇𝑐 Φ(𝒙, 𝒘) = 1 is constant. With 𝑀2 (𝑅) := 𝑅 + 6𝑅 2 + 6𝑅 3 this shows



∥∇𝒘 Φ(𝒙, 𝒘) − ∇𝒘 Φ(𝒙, 𝒘)˜ ∥ ≤ 𝑛𝑀2 (𝑅) ∥𝒘 − 𝒘∥ ˜ .

In all, this concludes the proof with 𝑀 (𝑅) := max{𝑀1 (𝑅), 𝑀2 (𝑅)}. □

Before coming to the main result of this section, we first show that the initial error f (𝒘0 ) remains bounded with high
probability.

Lemma 11.19 Let Assumption 3 (a), (b) be satisfied. Then for every 𝛿 > 0 exists 𝑅0 (𝛿, 𝑚, 𝑅) > 0 such that for all
𝑛∈N
P[f (𝒘0 ) ≥ 𝑅0 ] ≤ 1 − 𝛿.
√ iid
Proof Let 𝑖 ∈ {1, . . . , 𝑚}, and set 𝜶 := 𝑼0 𝒙𝑖 and 𝑣˜ 𝑗 := 𝑛𝑣0; 𝑗 for 𝑗 = 1, . . . , 𝑛, so that 𝑣˜ 𝑗 ∼ D (0, 1). Then
𝑛
1 ∑︁
Φ(𝒙𝑖 , 𝒘0 ) = √ 𝑣˜ 𝑗 𝜎(𝛼 𝑗 ).
𝑛 𝑗=1

By Assumption 2 and (11.5.3), the 𝑣˜ 𝑗 𝜎(𝛼 𝑗 ), 𝑗 = 1, . . . , 𝑛, are i.i.d. centered random variables with finite vari-
ance bounded by a constant 𝐶 (𝑅) independent of 𝑛. Thus the variance of Φ(𝒙𝑖 , 𝒘0 ) is also bounded by 𝐶 (𝑅). By
Chebyscheff’s inequality, see Theorem ??, for every 𝑘 > 0
√ 1
P[|Φ(𝒙𝑖 , 𝒘0 )| ≥ 𝑘 𝐶] ≤ 2 .
𝑘
√︁
Setting 𝑘 = 𝑚/𝛿

139
[Draft of June 26, 2024. Not for dissemination. Only to be used in Philipp Petersen’s lecture]
𝑚
h ∑︁ √ 𝑚
i ∑︁ h √ i
P |Φ(𝒙𝑖 , 𝒘0 ) − 𝑦 𝑖 | 2 ≥ 𝑚(𝑘 𝐶 + 𝑅) 2 ≤ P |Φ(𝒙𝑖 , 𝒘0 ) − 𝑦 𝑖 | ≥ 𝑘 𝐶 + 𝑅
𝑖=1 𝑖=1
𝑚
∑︁ h √ i
≤ P |Φ(𝒙𝑖 , 𝒘0 )| ≥ 𝑘 𝐶 ≤ 𝛿,
𝑖=1
√︁
which shows the claim with 𝑅0 = 𝑚 · ( 𝐶𝑚/𝛿 + 𝑅) 2 . □
The next theorem is the main result of this section. It states that in the present setting gradient descent converges to
a global minimizer and the limiting network achieves zero loss, i.e. interpolates the data. Moreover, during training the
network weights remain close to initialization if the network width 𝑛 is large.

Theorem 11.20 Let Assumption 3 be satisfied, and let the parameters 𝒘0 of the neural network Φ in (11.5.1) be
initialized according to (11.5.3). Fix a learning rate
2 1
ℎ<
𝜆LC
min
LC
+ 4𝜆max 𝑛

and with the objective function (11.0.1b) let for all 𝑘 ∈ N

𝒘 𝑘+1 = 𝒘 𝑘 − ℎ∇f (𝒘 𝑘 ).

Then for every 𝛿 > 0 there exist 𝐶 > 0, 𝑛0 ∈ N such that for all 𝑛 ≥ 𝑛0 holds with probability at least 1 − 𝛿 that for
all 𝑘 ∈ N
𝐶
∥𝒘 𝑘 − 𝒘0 ∥ ≤ √
𝑛
 ℎ𝑛  2𝑘
f (𝒘 𝑘 ) ≤ 𝐶 1 − LC .
2𝜆min

Proof We wish to apply Theorem 11.12, which requires Assumption 1 to be satisfied. By Lemma 11.16, √︁ 11.18√and
11.19, for every 𝑐 > 0 we can find 𝑛0 such that for all 𝑛 ≥ 𝑛0 with probability at least 1 − 𝛿 we have f (𝒘0 ) ≤ 𝑅0
and Assumption 1 (a), (b) holds with the values

√ √ 𝑛𝜆LC
min
𝐿 = 𝑀 𝑛, 𝑈 = 𝑀 𝑛, 𝑟 = 𝑐𝑛 −1/2 , 𝜆min = , 𝜆max = 2𝑛𝜆LC
max .
2
For Assumption 1 (c), it suffices that

√ 𝑛2 (𝜆LC
min
/2) 2 −1/2 2𝑚𝑀 𝑛 √︁
𝑀 𝑛≤ √ and 𝑐𝑛 ≥ 𝑅0 .
12𝑚 3/2 𝑀 2 𝑛 𝑅0 𝑛

Choosing 𝑐 > 0 and 𝑛 large enough, the inequalities hold. The statement is now a direct consequence of Theorem
11.12. □

11.5.4 Proximity to linearized model

The analysis thus far was based on the linearization Φlin describing the behaviour of the full network Φ well in a
neighbourhood of the initial parameters 𝒘0 . Moreover, Theorem 11.20 states that the parameters remain in an 𝑂 (𝑛 −1/2 )
neighbourhood of 𝒘0 during training. This suggests that the trained full model lim 𝑘→∞ Φ(𝒙, 𝒘 𝑘 ) yields predictions
similar to the trained linearized model.

140
[Draft of June 26, 2024. Not for dissemination. Only to be used in Philipp Petersen’s lecture]
To describe this phenomenon, we adopt again the notations Φlin : R𝑑 × R𝑛 → R and f lin from (11.3.1) and (11.3.3).
Initializing 𝒘0 according to (11.5.3) and setting 𝒑 0 = 𝒘0 , gradient descent computes the parameter updates

𝒘 𝑘+1 = 𝒘 𝑘 − ℎ∇𝒘 f (𝒘 𝑘 ), 𝒑 𝑘+1 = 𝒑 𝑘 − ℎ∇𝒘 f lin ( 𝒑 𝑘 )

for the full and linearized models, respectively. Let us consider the dynamics of the prediction of the network on the
training data. Writing
𝑚
Φ( 𝑿, 𝒘) := (Φ(𝒙𝑖 , 𝒘))𝑖=1 ∈ R𝑚 such that ∇𝒘 Φ( 𝑿, 𝒘) ∈ R𝑚×𝑛

it holds
∇𝒘 f (𝒘) = ∇𝒘 ∥Φ( 𝑿, 𝒘) − 𝒚∥ 2 = 2∇𝒘 Φ( 𝑿, 𝒘) ⊤ (Φ( 𝑿, 𝒘) − 𝒚).
Thus for the full model

Φ( 𝑿, 𝒘 𝑘+1 ) = Φ( 𝑿, 𝒘 𝑘 ) + ∇𝒘 Φ( 𝑿, 𝒘˜ 𝑘 ) (𝒘 𝑘+1 − 𝒘 𝑘 )
= Φ( 𝑿, 𝒘 𝑘 ) − 2ℎ∇𝒘 Φ( 𝑿, 𝒘˜ 𝑘 )∇𝒘 Φ( 𝑿, 𝒘 𝑘 ) ⊤ (Φ( 𝑿, 𝒘 𝑘 ) − 𝒚), (11.5.8)

where 𝒘˜ 𝑘 is in the convex hull of 𝒘 𝑘 and 𝒘 𝑘+1 .


Similarly, for the linearized model with (cp. (11.3.1))

Φlin ( 𝑿, 𝒘) := (Φlin (𝒙𝑖 , 𝒘))𝑖=1


𝑚
∈ R𝑚 and ∇ 𝒑 Φlin ( 𝑿, 𝒑) = ∇𝒘 Φ( 𝑿, 𝒘0 ) ∈ R𝑚×𝑛

such that
∇ 𝒑 f lin ( 𝒑) = ∇ 𝒑 ∥Φlin ( 𝑿, 𝒑) − 𝒚∥ 2 = 2∇𝒘 Φ( 𝑿, 𝒘0 ) ⊤ (Φlin ( 𝑿, 𝒑) − 𝒚)
and

Φlin ( 𝑿, 𝒑 𝑘+1 ) = Φlin ( 𝑿, 𝒑 𝑘 ) + ∇ 𝒑 Φlin ( 𝑿, 𝒑 0 ) ( 𝒑 𝑘+1 − 𝒑 𝑘 )


= Φlin ( 𝑿, 𝒑 𝑘 ) − 2ℎ∇𝒘 Φ( 𝑿, 𝒘0 )∇𝒘 Φ( 𝑿, 𝒘0 ) ⊤ (Φlin ( 𝑿, 𝒑 𝑘 ) − 𝒚). (11.5.9)

Remark 11.21 From (11.5.9) it is easy to see that with 𝑨 := 2ℎ∇𝒘 Φ( 𝑿, 𝒘0 )∇𝒘 Φ( 𝑿, 𝒘0 ) ⊤ and 𝑩 := 𝑰𝑚 − 𝑨 holds the
explicit formula
𝑘−1
∑︁
Φlin ( 𝑿, 𝒑 𝑘 ) = 𝑩 𝑘 Φlin ( 𝑿, 𝒑 0 ) + 𝑩 𝑘 𝑨𝒚
𝑗=0

for the prediction of the linear model in step 𝑘. Note that if 𝑨 is regular and ℎ is small enough, then 𝑩 𝑘 converges to
the zero matrix as 𝑘 → ∞ and ∞ −1 since this is a Neumann series.
Í
𝑗=0 𝑩 = 𝑨
𝑘

Comparing the two dynamics (11.5.8) and (11.5.9), the difference only lies in the two R𝑚×𝑚 matrices

2ℎ∇𝒘 Φ( 𝑿, 𝒘˜ 𝑘 )∇𝒘 Φ( 𝑿, 𝒘 𝑘 ) ⊤ and 2ℎ∇𝒘 Φ( 𝑿, 𝒘0 )∇𝒘 Φ( 𝑿, 𝒘0 ) ⊤ .

Recall that the step size ℎ in Theorem 11.20 scales like 1/𝑛.

Proposition 11.22 Consider the setting of Theorem 11.20. Then there exists 𝐶 < ∞, and for every 𝛿 > 0 there exists
𝑛0 such that for all 𝑛 ≥ 𝑛0 holds with probability at least 1 − 𝛿 that for all 𝑘 ∈ N
1
∥∇𝒘 Φ( 𝑿, 𝒘˜ 𝑘 )∇𝒘 Φ( 𝑿, 𝒘 𝑘 ) ⊤ − ∇ 𝒑 Φ( 𝑿, 𝒑 0 )∇ 𝒑 Φ( 𝑿, 𝒑 0 ) ⊤ ∥ ≤ 𝐶𝑛 −1/2 .
𝑛
Proof Consider the setting of the proof of Theorem 11.20. Then for every 𝑘 ∈ N holds ∥𝒘 𝑘 − 𝒘0 ∥ ≤ 𝑟 and thus also
∥ 𝒘˜ 𝑘 − 𝒘0 ∥ ≤ 𝑟, where 𝑟 = 𝑐𝑛 −1/2 . Thus Lemma 11.18 implies the norm to be bounded by

141
[Draft of June 26, 2024. Not for dissemination. Only to be used in Philipp Petersen’s lecture]
1
∥∇𝒘 Φ( 𝑿, 𝒘˜ 𝑘 ) − ∇ 𝒑 Φ( 𝑿, 𝒑 0 ) ∥ ∥∇𝒘 Φ( 𝑿, 𝒘 𝑘 ) ⊤ ∥ +
𝑛
1
∥∇ 𝒑 Φ( 𝑿, 𝒑 0 ) ∥ ∥∇𝒘 Φ( 𝑿, 𝒘 𝑘 ) ⊤ − ∇ 𝒑 Φ( 𝑿, 𝒑 0 ) ⊤ ∥
𝑛
≤ 𝑚𝑀 (∥ 𝒘˜ 𝑘 − 𝒑 0 ∥ + ∥𝒘 𝑘 − 𝒑 0 ∥ ) ≤ 𝑐𝑚𝑀𝑛 −1/2

which gives the statement. □


By Proposition 11.22 the two matrices driving the dynamics (11.5.8) and (11.5.9) remain in an 𝑂 (𝑛 −1/2 ) neighbour-
hood of each other throughout training. This allows to show the following proposition, which states that the prediction
function learned by the network gets arbitrarily close to the one learned by the linearized version in the limit 𝑛 → ∞.
The proof, which we omit, is based on Grönwall’s inequality. See [94, 115].

Proposition 11.23 Consider the setting of Theorem 11.20. Then there exists 𝐶 < ∞, and for every 𝛿 > 0 there exists
𝑛0 such that for all 𝑛 ≥ 𝑛0 holds with probability at least 1 − 𝛿 that for all ∥𝒙∥ ≤ 1

sup |Φ(𝒙, 𝒘 𝑘 ) − Φlin (𝒙, 𝒑 𝑘 )| ≤ 𝐶𝑛 −1/2 .


𝑘 ∈N

11.5.5 Connection to Gaussian processes

In the previous section, we established that for large widths, the trained neural network mirrors the behaviour of the
trained linearized model, which itself is closely connected to kernel least-squares with the neural tangent kernel. Yet, as
pointed out in Remark 11.4, the obtained model still strongly depends on the choice of random initialziation 𝒘0 ∈ R𝑛 .
We should thus understand both the model at initialization 𝒙 ↦→ Φ(𝒙, 𝒘0 ) and the model after training 𝒙 ↦→ Φ(𝒙, 𝒘 𝑘 ),
as random draws of a certain distribution over functions. To make this precise, let us introduce Gaussian processes.

Definition 11.24 Let (Ω, P) be a probability space, and let 𝑔 : R𝑑 × Ω → R. We call 𝑔 a Gaussian process with mean
function 𝑚 : R𝑑 → R and covariance function 𝑐 : R𝑑 × R𝑑 → R if
(a) for each 𝒙 ∈ R𝑑 holds 𝜔 ↦→ 𝑔(𝒙, 𝜔) is a random variable,
(b) for all 𝑘 ∈ N and all 𝒙1 , . . . , 𝒙 𝑘 ∈ R𝑑 the random variables 𝑔(𝒙1 , ·), . . . , 𝑔(𝒙 𝑘 , ·) have a joint Gaussian distribution
such that  
𝑘
(𝑔(𝒙 1 , 𝜔), . . . , 𝑔(𝒙 𝑘 , 𝜔)) ∼ N 𝑚(𝒙𝑖 )𝑖=1 , (𝑐(𝒙𝑖 , 𝒙 𝑗 ))𝑖,𝑘 𝑗=1 .

In words, 𝑔 is a Gaussian process, if 𝜔 ↦→ 𝑔(𝒙, ·) defines a collection of random variables indexed over 𝒙 ∈ R𝑑 ,
such that the joint distribution of (𝑔(𝒙1 , ·)) 𝑛𝑗=1 is a Gaussian whose mean and variance are determined by 𝑚 and 𝑐
respectively. Fixing 𝜔 ∈ Ω, we can then interpret 𝒙 ↦→ 𝑔(𝒙, 𝜔) as a random draw from a distribution over functions.
As first observed in [137], certain neural networks at initialization tend to Gaussian processes in the infinite width
limit.
iid
Proposition 11.25 Consider depth-𝑛 networks Φ𝑛 as in (11.5.1) with initialization (11.5.3), and define with 𝑢 𝑖 ∼
D (0, 1/𝑑), 𝑖 = 1, . . . , 𝑑,
𝑐(𝒙, 𝒛) := E[𝜎(𝒖 ⊤ 𝒙)𝜎(𝒖 ⊤ 𝒛)] for all 𝒙, 𝒛 ∈ R𝑑 .
Then for all distinct 𝒙1 , . . . , 𝒙 𝑘 ∈ R𝑑 it holds that

lim (Φ𝑛 (𝒙1 , 𝒘0 ), . . . , Φ𝑛 (𝒙 𝑘 , 𝒘0 )) ∼ N(0, (𝑐(𝒙 𝑖 , 𝒙 𝑗 ))𝑖,𝑘 𝑗=1 )


𝑛→∞

with weak convergence.


√ iid
Proof Set 𝑣˜𝑖 := 𝑛𝑣0,𝑖 and 𝒖˜ 𝑖 = (𝑈0,𝑖1 , . . . , 𝑈0,𝑖𝑑 ) ∈ R𝑑 , so that 𝑣˜𝑖 ∼ D (0, 1), and the 𝒖˜ 𝑖 ∈ R𝑑 are also i.i.d., with
each component distributed according to D (0, 1/𝑑).

142
[Draft of June 26, 2024. Not for dissemination. Only to be used in Philipp Petersen’s lecture]
Then for any 𝒙1 , . . . , 𝒙 𝑘
𝑣˜𝑖 𝜎( 𝒖˜ ⊤𝑖 𝒙1 )
©
𝒁𝑖 := ­
­ .
..
ª
® ∈ R𝑘 𝑖 = 1, . . . , 𝑛,
®

«𝑣˜𝑖 𝜎( 𝒖˜ 𝑖 𝒙 𝑘 ) ¬
defines 𝑛 centered i.i.d. vectors in R 𝑘 . By the central limit theorem, see Theorem A.24,

Φ(𝒙 1 , 𝒘0 ) 𝑛
® = √1
∑︁
©
­ .. ª
𝒁𝑖 ∗
­ . ® 𝑛 𝑗=1
«Φ(𝒙 𝑘 , 𝒘0 ) ¬
converges weakly to N(0, 𝑪), where

𝐶𝑖 𝑗 = E[˜𝑣12 𝜎( 𝒖˜ ⊤ ˜⊤
1 𝒙 𝑖 )𝜎( 𝒖 ˜⊤
1 𝒙 𝑗 )] = E[𝜎( 𝒖 ˜⊤
1 𝒙 𝑖 )𝜎( 𝒖 1 𝒙 𝑗 )].

This concludes the proof. □


In the sense of Proposition 11.25, the network Φ(𝒙, 𝒘0 ) converges to a Gaussian process as the width 𝑛 tends
to infinity. Using the explicit dynamics of the linearized network outlined in Remark 11.21, one can show that the
linearized network after training also corresponds to a Gaussian process (for some mean and covariance function
depending the data, the architecture and the initialization). As the full and linearized models converge in the infinite
width limit, we can infer that wide networks post-training resemble draws from a Gaussian process, see [115, Sec.
2.3.1] and [44].
Rather than delving into the technical details of such statements, in Figure 11.3 we plot 80 different realizations of
a neural network before and after training, i.e.

𝒙 ↦→ Φ(𝒙, 𝒘0 ) and 𝒙 ↦→ Φ(𝒙, 𝒘 𝑘 ). (11.5.10)

We chose the architecture as (11.5.1) with activation function 𝜎 = arctan(𝑥), width 𝑛 = 250 and initialization
 3  2
iid iid iid
𝑈0;𝑖 𝑗 ∼ N 0, , 𝑣0;𝑖 ∼ N 0, , 𝑏 0;𝑖 , 𝑐 0 ∼ N(0, 3). (11.5.11)
𝑑 𝑛
The network was trained on a dataset of size 𝑚 = 3 with 𝑘 = 1000 steps of gradient descent and constant step size
ℎ = 1/𝑛. Before training, the network’s outputs resemble random draws from a Gaussian process with a constant zero
mean function. Post-training, the outputs show minimal variance at the data points, since they essentially interpolate
the data, cp. Remark 11.4 and (11.2.4). They exhibit increased variance further from these points, with the precise
amount depending on the initialization variance chosen in (11.5.11).

11.6 Normalized initialization

Consider the gradient ∇𝒘 Φ(𝒙, 𝒘0 ) as in (11.5.2) with LeCun initialization. Since the components of 𝒗 behave like
iid
𝑣𝑖 ∼ D (0, 1/𝑛), it is easy to check that in terms of the width 𝑛

E[∥∇𝑼 Φ(𝒙, 𝒘0 ) ∥ ] = E[∥ (𝒗 ⊙ 𝜎 ′ (𝑼𝒙 + 𝒃))𝒙 ⊤ ∥ ] = 𝑂 (1)



E[∥∇𝒃 Φ(𝒙, 𝒘0 ) ∥ ] = E[∥𝒗 ⊙ 𝜎 (𝑼𝒙 + 𝒃) ∥ ] = 𝑂 (1)
E[∥∇𝒗 Φ(𝒙, 𝒘0 ) ∥ ] = E[∥𝜎(𝑼𝒙 + 𝒃) ∥ ] = 𝑂 (𝑛)
E[∥∇𝑐 Φ(𝒙, 𝒘0 ) ∥ ] = E[|1|] = 𝑂 (1).

143
[Draft of June 26, 2024. Not for dissemination. Only to be used in Philipp Petersen’s lecture]
2 2

1 1

0 0

1 1

2 2

3 2 1 0 1 2 3 3 2 1 0 1 2 3

Fig. 11.3: 80 realizations of a neural network at initialization (left) and after training on the blue data points (right).
The red dashed line shows the mean. Plot based on [115, Fig. 2].

As a result of this different scaling, gradient descent with step width 𝑂 (𝑛 −1 ) as in Theorem 11.20, will primarily train
the weigths 𝒗 in the output layer, and will barely move the remaining parameters 𝑼, 𝒃, and 𝑐. This is also reflected in
the expression for the obtained kernel 𝐾 LC computed in Theorem 11.14, which corresponds to the contribution of the
term ⟨∇𝒗 Φ, ∇𝒗 Φ⟩.
Remark 11.26 For optimization methods such as ADAM, which scale each component of the gradient individually, the
same does not hold in general.
LeCun initialization aims to normalize the variance of the output of all nodes at initialization (the forward dynamics).
To also normalize the variance of the gradients (the backward dynamics), in this section we shortly dicuss a different
architecture and initialization, consistent with the one used in the original NTK paper [94].

11.6.1 Architecture

Let Φ : R𝑑 → R be a depth-one neural network


1  1 
Φ(𝒙, 𝒘) = √ 𝒗 ⊤ 𝜎 √ 𝑼𝒙 + 𝒃 + 𝑐, (11.6.1)
𝑛 𝑑

with input 𝒙 ∈ R𝑑 and parameters 𝑼 ∈ R𝑛×𝑑 , 𝒗 ∈ R𝑛 , 𝒃 ∈ R𝑛 and 𝑐 ∈ R. We initialize the weights randomly according
to 𝒘0 = (𝑼0 , 𝒃 0 , 𝒗0 , 𝑐 0 ) with parameters
iid iid
𝑈0;𝑖 𝑗 ∼ D (0, 1), 𝑣0;𝑖 ∼ D (0, 1), 𝑏 0;𝑖 , 𝑐 0 = 0. (11.6.2)

At initialization, (11.6.1), (11.6.2) is equivalent to (11.5.1), (11.5.3). However, for the gradient we obtain

144
[Draft of June 26, 2024. Not for dissemination. Only to be used in Philipp Petersen’s lecture]
 
∇𝑼 Φ(𝒙, 𝒘) = 𝑛 −1/2 𝒗 ⊙ 𝜎 ′ (𝑑 −1/2𝑼𝒙 + 𝒃) 𝑑 −1/2 𝒙 ⊤ ∈ R𝑛×𝑑
 
∇𝒃 Φ(𝒙, 𝒘) = 𝑛 −1/2 𝒗 ⊙ 𝜎 ′ 𝑑 −1/2𝑼𝒙 + 𝒃) ∈ R𝑛 (11.6.3)
−1/2 −1/2 𝑛
∇𝒗 Φ(𝒙, 𝒘) = 𝑛 𝜎(𝑑 𝑼𝒙 + 𝒃) ∈ R
∇𝑐 Φ(𝒙, 𝒘) = 1 ∈ R.

Contrary to (11.5.2), the three gradients with 𝑂 (𝑛) entries are all scaled by the factor 𝑛 −1/2 . This leads to a different
training dynamics.

11.6.2 Neural tangent kernel

We compute again the neural tangent kernel. Unlike for LeCun initialization, there is no 1/𝑛 scaling required to obtain
convergence of
𝐾ˆ 𝑛 (𝒙, 𝒛) = ⟨∇𝒘 Φ(𝒙, 𝒘0 ), ∇𝒘 Φ(𝒛, 𝒘0 )⟩
as 𝑛 → ∞. Here and in the following we consider the setting (11.6.1)–(11.6.2) for Φ and 𝒘0 . This is also referred to
as the NTK initialization, we denote the kernel by 𝐾 NTK . Due to the different training dynamics, we obtain additional
terms in the NTK compared to Theorem 11.20.

Theorem 11.27 Let 𝑅 < ∞ such that |𝜎(𝑥)| ≤ 𝑅 · (1 + |𝑥|) and |𝜎 ′ (𝑥)| ≤ 𝑅 · (1 + |𝑥|) for all 𝑥 ∈ R, and let D satisfy
iid
Assumption 2. For any 𝒙, 𝒛 ∈ R𝑑 and 𝑢 𝑖 ∼ D (0, 1/𝑑), 𝑖 = 1, . . . , 𝑑, it then holds
 𝒙⊤ 𝒛 
lim 𝐾ˆ 𝑛 (𝒙, 𝒛) = 1 + E[𝜎 ′ (𝒖 ⊤ 𝒙) ⊤ 𝜎 ′ (𝒖 ⊤ 𝒛)] + E[𝜎(𝒖 ⊤ 𝒙) ⊤ 𝜎(𝒖 ⊤ 𝒛)] + 1
𝑛→∞ 𝑑
=: 𝐾 NTK (𝒙, 𝒛)

almost surely.

Proof Denote 𝒙 (1) = 𝑼0 𝒙 + 𝒃 0 ∈ R𝑛 and 𝒛 (1) = 𝑼0 𝒛 + 𝒃 0 ∈ R𝑛 . Due to the initialization (11.6.2) and our assumptions
on D (0, 1), the components
∑︁𝑑
𝑥𝑖(1) = 𝑈0;𝑖 𝑗 𝑥 𝑗 ∼ 𝒖 ⊤ 𝒙 𝑖 = 1, . . . , 𝑛
𝑗=1

are i.i.d. with finite 𝑝th moment (independent of 𝑛) for all 1 ≤ 𝑝 ≤ 8, and the same holds for the (𝜎(𝑥𝑖(1) ))𝑖=1
𝑛 ,
(1) (1) (1)
(𝜎 ′ (𝑥𝑖 ))𝑖=1
𝑛 , (𝜎(𝑧 ′
𝑖 ))𝑖=1 , and (𝜎 (𝑧 𝑖 ))𝑖=1 .
𝑛 𝑛

Then
𝒙 ⊤ 𝒛  1 ∑︁ 2 ′ (1) ′ (1)
𝑛 𝑛
 1 ∑︁
𝐾ˆ 𝑛 (𝒙, 𝒛) = 1 + 𝑣𝑖 𝜎 (𝑥𝑖 )𝜎 (𝑧𝑖 ) + 𝜎(𝑥𝑖(1) )𝜎(𝑧𝑖(1) ) + 1.
𝑑 𝑛 𝑖=1 𝑛 𝑖=1

By the law of large numbers and because E[𝑣𝑖2 ] = 1, this converges almost surely to 𝐾 NTK (𝒙, 𝒛).
The existence of 𝑛0 follows similarly by an application of Theorem A.22. □

Example 11.28 (𝐾 NTK for ReLU) Let 𝜎(𝑥) = max{0, 𝑥} and let D (0, 1/𝑑) be the centered normal distribution on R
iid
with variance 1/𝑑. For 𝒙, 𝒛 ∈ R𝑑 holds by [35, Appendix A] (also see Exercise 11.33), that with 𝑢 𝑖 ∼ D (0, 1/𝑑),
𝑖 = 1, . . . , 𝑑,  ⊤ 
𝒙 𝒛
𝜋 − arccos ∥ 𝒙∥ ∥𝒛 ∥
′ ⊤ ′ ⊤
E[𝜎 (𝒖 𝒙)𝜎 (𝒖 𝒛)] = .
2𝜋
Together with Example 11.15, this yields an explizit formula for 𝐾 NTK in Theorem 11.27.

145
[Draft of June 26, 2024. Not for dissemination. Only to be used in Philipp Petersen’s lecture]
For this network architecture and under suitable assumptions on D, similar arguments as in Section 11.5 can be
used to show convergence of gradient descent to a global minimizer and proximity of the full to the linearized model.
We refer to the literature in the bibliography section.

Bibliography and further reading

The discussion on linear and kernel regression in Sections 11.1 and 11.2 is quite standard, and can similarly be found
in many textbooks. For more details on kernel methods we refer for instance to [40, 183]. The neural tangent and its
connection to the training dynamics was first investigated in [94] using an architecture similar to the one in Section
11.6. Since then, many works have extended this idea and presented differing perspectives on the topic, see for instance
[2, 51, 6, 33]. Our presentation in Sections 11.4, 11.5, and 11.6 primarily follows [115] who also discussed the case
of LeCun initialization. Especially for the main results in Theorem 11.12 and Theorem 11.20, we largely follow the
arguments in this paper. The above references additionally treat the case of deep networks, which we have omitted
here for simplicity. The explicit formula for the NTK of ReLU networks as presented in Examples 11.15 and 11.28
was given in [35]. The observation that neural networks at initialization behave like Gaussian processes presented in
Section 11.5.5 was first made in [137]. For a general reference on Gaussian processes see the textbook [166]. When
only training the last layer of a network (in which the network is affine linear), there are strong links to random feature
methods [164]. Recent developements on this topic can also be found in the literature under the name “Neural network
Gaussian processes”, or NNGPs for short [114, 45].

146
[Draft of June 26, 2024. Not for dissemination. Only to be used in Philipp Petersen’s lecture]
Exercises

Exercise 11.29 Prove Theorem 11.3.


˜ For rank( 𝑨) < 𝑑, using 𝒘 𝑘 = 𝒘 𝑘−1 − ℎ∇f (𝒘 𝑘−1 ) and the
Hint: Assume first that 𝒘0 ∈ ker( 𝑨) ⊥ (i.e. 𝒘0 ∈ 𝐻).
Í
singular value decomposition of 𝑨, write down an explicit formula for 𝒘 𝑘 . Observe that due to 1/(1 − 𝑥) = 𝑘 ∈N0 𝑥 𝑘
for all 𝑥 ∈ (0, 1) it holds 𝒘 𝑘 → 𝑨† 𝒚 as 𝑘 → ∞, where 𝑨† is the Moore-Penrose pseudoinverse of 𝑨.

Exercise 11.30 Let 𝒙𝑖 ∈ R𝑑 , 𝑖 = 1, . . . , 𝑚. Show that there exists a “feature map” 𝜙 : R𝑑 → R𝑚 , such that for any
configuration of labels 𝑦 𝑖 ∈ {−1, 1}, there always exists a hyperplane in R𝑚 separating the two sets {𝜙(𝒙 𝑖 ) | 𝑦 𝑖 = 1}
and {𝜙(𝒙𝑖 ) | 𝑦 𝑖 = −1}.

Exercise 11.31 Consider the RBF kernel 𝐾 : R × R → R, 𝐾 (𝑥, 𝑥 ′ ) := exp(−(𝑥 − 𝑥 ′ ) 2 ). Find a Hilbert space 𝐻 and a
feature map 𝜙 : R → 𝐻 such that 𝐾 (𝑥, 𝑥 ′ ) = ⟨𝜙(𝒙), 𝜙(𝒙 ′ )⟩ 𝐻 .

Exercise 11.32 Let 𝑛 ∈ N and consider the polynomial kernel 𝐾 : R𝑑 × R𝑑 → R, 𝐾 (𝒙, 𝒙 ′ ) = (1 + 𝒙 ⊤ 𝒙 ′ ) 𝑟 . Find a
Hilbert space 𝐻 and a feature map 𝜙 : R𝑑 → 𝐻, such that 𝐾 (𝒙, 𝒙 ′ ) = ⟨𝜙(𝒙), 𝜙(𝒙 ′ )⟩ 𝐻 .
Hint: Use the multinomial formula.
iid
Exercise 11.33 Let 𝑢 𝑖 ∼ N(0, 1) be i.i.d. standard Gaussian distributed random variables for 𝑖 = 1, . . . , 𝑑. Show that
for all nonzero 𝒙, 𝒛 ∈ R𝑑
𝜋−𝜃  𝒙𝒛⊤ 
E[1 [0,∞) (𝒖 ⊤ 𝒙)1 [0,∞) (𝒖 ⊤ 𝒛)] = , 𝜃 = arccos .
2𝜋 ∥𝒙∥ ∥𝒛∥
This shows the formula for the ReLU NTK with Gaussian initialization as discussed in Example 11.28.
Hint: Consider the following sketch

𝜃 𝒙

Exercise 11.34 Consider the network (11.5.1) with LeCun initialization as in (11.5.3), but with the biases instead
initialized as
iid
𝑐, 𝑏 𝑖 ∼ D (0, 1) for all 𝑖 = 1, . . . , 𝑛. (11.6.4)
Compute the corresponding NTK as in Theorem 11.20. Moreover, compute the NTK also for the normalized network
(11.6.1) with initialization (11.6.2) as in Theorem 11.27, but replace again the bias initialization with that given in
(11.6.4).

147
[Draft of June 26, 2024. Not for dissemination. Only to be used in Philipp Petersen’s lecture]
[Draft of June 26, 2024. Not for dissemination. Only to be used in Philipp Petersen’s lecture]
Chapter 12
Loss landscape analysis

In Chapter 10, we saw how the weights of neural networks get adapted during training, using, e.g., variants of gradient
descent. For certain cases, including the wide networks considered in Chapter 11, the corresponding iterative scheme
converges to a global minimizer. In general, this is not guaranteed, and gradient descent can for instance get stuck in
non-global minima or saddle points.
To get a better understanding of these situations, in this chapter we discuss the so-called loss landscape. This term
refers to the graph of the empirical risk as a function of the weights. We give a more rigorous definition below, and
first introduce notation for neural networks and their realizations for a fixed architecture.

Definition 12.1 Let A = (𝑑0 , 𝑑1 , . . . , 𝑑 𝐿+1 ) ∈ N 𝐿+2 , let 𝜎 : R → R be an activation function, and let 𝐵 > 0. We denote
the set of neural networks Φ with 𝐿 layers, layer widths 𝑑0 , 𝑑1 , . . . , 𝑑 𝐿+1 , all weights bounded in modulus by 𝐵, and
using the activation function 𝜎 by N (𝜎; A, 𝐵). Additionally, we define

?
𝐿  
PN (A, 𝐵) B [−𝐵, 𝐵] 𝑑ℓ+1 ×𝑑ℓ × [−𝐵, 𝐵] 𝑑ℓ+1 ,
ℓ=0

and the realization map

𝑅 𝜎 : PN (A, 𝐵) → N (𝜎; A, 𝐵)
(12.0.1)
(𝑾 (ℓ ) , 𝒃 (ℓ ) )ℓ=0
𝐿
↦→ Φ,

where Φ is the neural network with weights and biases given by (𝑾 (ℓ ) , 𝒃 (ℓ ) )ℓ=0
𝐿 .

Í𝐿
Throughout, we will identify PN (A, 𝐵) with the cube [−𝐵, 𝐵] 𝑛A , where 𝑛 A B ℓ=0 𝑑ℓ+1 (𝑑ℓ + 1). Now we can
introduce the loss landscape of a neural network architecture.

Definition 12.2 Let A = (𝑑0 , 𝑑1 , . . . , 𝑑 𝐿+1 ) ∈ N 𝐿+2 , let 𝜎 : R → R. Let 𝑚 ∈ N, and 𝑆 = (𝒙 𝑖 , 𝒚 𝑖 )𝑖=1
𝑚 ∈ (R𝑑0 × R𝑑𝐿+1 ) 𝑚

be a sample and let L be a loss function. Then, the loss landscape is the graph of the function Λ A, 𝜎,𝑆, L defined as

Λ A, 𝜎,𝑆, L : PN (A; ∞) → R
𝜃 ↦→ R
b𝑆 (𝑅 𝜎 (𝜃)).

with R
b𝑆 in (1.2.3) and 𝑅 𝜎 in (12.0.1).

Identifying PN (A, ∞) with R𝑛A , we can consider Λ A, 𝜎,𝑆, L as a map on R𝑛A and the loss landscape is a subset of
R 𝑛A × R. The loss landscape is a high-dimensional surface, with hills and valleys. For visualization a two-dimensional
section of a loss landscape is shown in Figure 12.1.
Questions of interest regarding the loss landscape include for example: How likely is it that we find local instead
of global minima? Are these local minima typically sharp, having small volume, or are they part of large flat valleys

149
[Draft of June 26, 2024. Not for dissemination. Only to be used in Philipp Petersen’s lecture]
l minima
ca

lo
le points
dd

sa
sh
arp
Hig l min minimum
k ba ima
he
mpirical ris

o
gl

Fig. 12.1: Two-dimensional section of a loss landscape. The loss landscape shows a spurious valley with local minima,
global minima, as well as a region where saddle points appear. Moreover, a sharp minimum is shown.

that are difficult to escape? How bad is it to end up in a local minimum? Are most local minima as deep as the global
minimum, or can they be significantly higher? How rough is the surface generally, and how do these characteristics
depend on the network architecture? While providing complete answers to these questions is hard in general, in the rest
of this chapter we give some intuition and mathematical insights for specific cases.

12.1 Visualization of loss landscapes

Visualizing loss landscapes can provide valuable insights into the effects of neural network depth, width, and activation
functions. However, we can only visualize an at most two-dimensional surface embedded into three-dimensional space,
whereas the loss landscape is a very high-dimensional object (unless the neural networks have only very few weights
and biases).
To make the loss landscape accessible, we need to reduce its dimensionality. This can be achieved by evaluating the
function Λ A, 𝜎,𝑆, L on a two-dimensional subspace of PN (A, ∞). Specifically, we choose three-parameters 𝜇, 𝜃 1 , 𝜃 2
and examine the function

R2 ∋ (𝛼1 , 𝛼2 ) ↦→ Λ A, 𝜎,𝑆, L (𝜇 + 𝛼1 𝜃 1 + 𝛼2 𝜃 2 ). (12.1.1)

There are various natural choices for 𝜇, 𝜃 1 , 𝜃 2 :


• Random directions: This was, for example used in [66, 91]. Here 𝜃 1 , 𝜃 2 are chosen randomly, while 𝜇 is either a
minimum of Λ A, 𝜎,𝑆, L or also chosen randomly. This simple approach can offer a quick insight into how rough
the surface can be. However, as was pointed out in [118], random directions will very likely be orthogonal to the
trajectory of the optimization procedure. Hence, they will likely miss the most relevant features.
• Principal components of learning trajectory: To address the shortcomings of random directions, another possibility
is to determine 𝜇, 𝜃 1 , 𝜃 2 , which best capture some given learning trajectory; For example, if 𝜃 (1) , 𝜃 (2) , . . . , 𝜃 ( 𝑁 )
are the parameters resulting from the training by SGD, we may determine 𝜇, 𝜃 1 , 𝜃 2 such that the hyperplane
{𝜇 + 𝛼1 𝜃 1 + 𝛼2 𝜃 2 | 𝛼1 , 𝛼2 ∈ R} minimizes the mean squared distance to the 𝜃 ( 𝑗 ) for 𝑗 ∈ {1, . . . , 𝑁 }. This is the
approach of [118], and can be achieved by a principal component analysis.

150
[Draft of June 26, 2024. Not for dissemination. Only to be used in Philipp Petersen’s lecture]
• Based on critical points: For a more global perspective, 𝜇, 𝜃 1 , 𝜃 2 can be chosen to ensure the observation of
multiple critical points. One way to achieve this is by running the optimization procedure three times with final
parameters 𝜃 (1) , 𝜃 (2) , 𝜃 (3) . If the procedures have converged, then each of these parameters is close to a critical
point of Λ A, 𝜎,𝑆, L . We can now set 𝜇 = 𝜃 (1) , 𝜃 1 = 𝜃 (2) − 𝜇, 𝜃 2 = 𝜃 (3) − 𝜇. This then guarantees that (12.1.1)
passes through or at least comes very close to three critical points (at (𝛼1 , 𝛼2 ) = (0, 0), (0, 1), (1, 0)). We present
six visualizations of this form in Figure 12.2.
Figure 12.2 gives some interesting insight into the effect of depth and width on the shape of the loss landscape. For
very wide and shallow neural networks, we have the widest minima, which, in the case of the tanh activation function
also seem to belong to the same valley. With increasing depth and smaller width the minima get steeper and more
disconnected.

12.2 Spurious minima

From the perspective of optimization, the ideal loss landscape has one global minimum in the center of a large valley,
so that gradient descent converges towards the minimum irrespective of the chosen initialization.
This situation is not realistic for deep neural networks. Indeed, for a simple shallow neural network

R𝑑 ∋ 𝒙 ↦→ Φ(𝒙) = 𝑾 (1) 𝜎(𝑾 (0) 𝒙 + 𝒃 (0) ) + 𝒃 (1) ,

it is clear that for every permutation matrix 𝑷

Φ(𝒙) = 𝑾 (1) 𝑷𝑇 𝜎(𝑷𝑾 (0) 𝒙 + 𝑷𝒃 (0) ) + 𝒃 (1) for all 𝒙 ∈ R𝑑 .

Hence, in general there exist multiple parameterizations realizing the same output function. Moreover, if least one
global minimum with non-permutation-invariant weights exists, then there are more than one global minima of the loss
landscape.
This is not problematic; in fact, having many global minima is beneficial. The larger issue is the existence of
non-global minima. Following [209], we start by generalizing the notion of non-global minima to spurious valleys.

Definition 12.3 Let A = (𝑑0 , 𝑑1 , . . . , 𝑑 𝐿+1 ) ∈ N 𝐿+2 and 𝜎 : R → R. Let 𝑚 ∈ N, and 𝑆 = (𝒙𝑖 , 𝒚 𝑖 )𝑖=1
𝑚 ∈ (R𝑑0 × R𝑑𝐿+1 ) 𝑚

be a sample and let L be a loss function. For 𝑐 ∈ R, we define the sub-level set of Λ A, 𝜎,𝑆, L as

ΩΛ (𝑐) B {𝜃 ∈ PN (A, ∞) | Λ A, 𝜎,𝑆, L (𝜃) ≤ 𝑐}.

A path-connected component of ΩΛ (𝑐), which does not contain a global minimum of Λ A, 𝜎,𝑆, L is called a spurious
valley.

The next proposition shows that spurious local minima do not exist for shallow overparameterized neural networks,
i.e., for neural networks that have at least as many parameters in the hidden layer as there are training samples.

Proposition 12.4 Let A = (𝑑0 , 𝑑1 , 1) ∈ N3 and let 𝑆 = (𝒙𝑖 , 𝑦 𝑖 )𝑖=1


𝑚 ∈ (R𝑑0 × R) 𝑚 be a sample such that 𝑚 ≤ 𝑑 .
1
Furthermore, let 𝜎 ∈ M and L be a convex loss function. Further assume that Λ A, 𝜎,𝑆, L has at least one global
minimum. Then, Λ A, 𝜎,𝑆, L , has no spurious valleys.

Proof Let 𝜃 𝑎 , 𝜃 𝑏 ∈ PN (A, ∞) with Λ A, 𝜎,𝑆, L (𝜃 𝑎 ) > Λ A, 𝜎,𝑆, L (𝜃 𝑏 ). Then we will show below that there is another
parameter 𝜃 𝑐 such that
• Λ A, 𝜎,𝑆, L (𝜃 𝑏 ) = Λ A, 𝜎,𝑆, L (𝜃 𝑐 )
• there is a continuous path 𝛼 : [0, 1] → PN (A, ∞) such that 𝛼(0) = 𝜃 𝑎 , 𝛼(1) = 𝜃 𝑐 , and Λ A, 𝜎,𝑆, L (𝛼) is
monotonically decreasing.

151
[Draft of June 26, 2024. Not for dissemination. Only to be used in Philipp Petersen’s lecture]
By Exercise 12.7, the construction above rules out the existence of spurious valleys by choosing 𝜃 𝑎 an element of a
spurious valley and 𝜃 𝑏 a global minimum.
Next, we present the construction: Let us denote
 1 
(ℓ ) (ℓ )
𝜃 𝑜 = 𝑾𝑜 , 𝒃 𝑜 for 𝑜 ∈ {𝑎, 𝑏, 𝑐}.
ℓ=0

𝑗
Moreover, for 𝑗 = 1, . . . , 𝑑1 , we introduce 𝒗𝑜 ∈ R𝑚 defined as
  
(𝒗𝑜 )𝑖 = 𝜎 𝑾𝑜(0) 𝒙𝑖 + 𝒃 𝑜(0)
𝑗
for 𝑖 = 1, . . . , 𝑚.
𝑗

Notice that, if we set 𝑽𝑜 = ((𝒗𝑜 ) ⊤ ) 𝑑𝑗=1


𝑗 1
, then
 𝑚
𝑾𝑜(1) 𝑽𝑜 = 𝑅 𝜎 (𝜃 𝑜 ) (𝒙𝑖 ) − 𝒃 𝑜(1) , (12.2.1)
𝑖=1

where the right-hand side is considered a row-vector.


We will now distinguish between two cases. For the first the result is trivial and the second can be transformed into
the first one.
Case 1: Assume that 𝑽𝑎 has rank 𝑚. In this case, it is obvious from (12.2.1), that there exists 𝑾e such that
 𝑚
e 𝑎 = 𝑅 𝜎 (𝜃 𝑏 ) (𝒙𝑖 ) − 𝒃 𝑎(1)
𝑾𝑽 .
𝑖=1

We can thus set 𝛼(𝑡) = ((𝑾𝑎(0) , 𝒃 𝑎(0) ), ((1 − 𝑡)𝑾𝑎(1) + 𝑡 𝑾,


e 𝒃 𝑎(1) ).
Note that by construction 𝛼(0) = 𝜃 𝑎 and Λ A, 𝜎,𝑆, L (𝛼(1)) = Λ A, 𝜎,𝑆, L (𝜃 𝑏 ). Moreover, 𝑡 ↦→ (𝑅 𝜎 (𝛼(𝑡)) (𝒙 𝑖 ))𝑖=1
𝑚

describes a straight path in R and hence, by the convexity of L it is clear that 𝑡 ↦→ Λ A, 𝜎,𝑆, L (𝛼(𝑡)) is monotonically
𝑚

decreasing.
Case 2: Assume that 𝑉𝑎 has rank less than 𝑚. In this case, we show that we find a continuous path from 𝜃 𝑎 to
another neural network parameter with higher rank. The path will be such that Λ A, 𝜎,𝑆, L is monotonically decreasing.
𝑗
Under the assumptions, we have that one 𝒗𝑎 can be written as a linear combination of the remaining 𝒗𝑎𝑖 , 𝑖 ≠ 𝑗.
Without loss of generality, we assume 𝑗 = 1. Then, there exist (𝛼𝑖 )𝑖=2 𝑚 such that

𝑚
∑︁
𝒗𝑎1 = 𝛼𝑖 𝒗𝑎𝑖 . (12.2.2)
𝑖=2

Next, we observe that here exists 𝒗 ∗ ∈ R𝑚 which is linearly independent from all (𝒗𝑎 )𝑖=1
𝑗
𝑚 and can be written as

(𝒗 ∗ )𝑖 = 𝜎((𝒘 ∗ ) ⊤ 𝒙𝑖 + 𝑏 ∗ ) for some 𝒘 ∗ ∈ R𝑑0 , 𝑏 ∗ ∈ R. Indeed, if we assume that such 𝒗 ∗ does not exist, then it follows
that span{(𝜎(𝒘 ⊤ 𝒙 𝑖 + 𝑏))𝑖=1 𝑚 | 𝒘 ∈ R𝑑0 , 𝑏 ∈ R} is an 𝑚 − 1 dimensional subspace of R𝑚 which yields a contradiction to

Theorem 9.3.
Now, we define two paths: First,

𝛼1 (𝑡) = ((𝑾𝑎(0) , 𝒃 𝑎(0) ), (𝑾𝑎(1) (𝑡), 𝒃 𝑎(1) )), for 𝑡 ∈ [0, 1/2]

where
(𝑾𝑎(1) (𝑡))1 = (1 − 2𝑡) (𝑾𝑎(1) )1 and (𝑾𝑎(1) (𝑡))𝑖 = (𝑾𝑎(1) )𝑖 + 2𝑡𝛼𝑖 (𝑾𝑎(1) )1
for 𝑖 = 2, . . . , 𝑑1 , for 𝑡 ∈ [0, 1/2]. Second,

𝛼2 (𝑡) = ((𝑾𝑎(0) (𝑡), 𝒃 𝑎(0) (𝑡)), (𝑾𝑎(1) (1/2), 𝒃 𝑎(1) )), for 𝑡 ∈ (1/2, 1],

where
(𝑾𝑎(0) (𝑡))1 = 2(𝑡 − 1/2) (𝑾𝑎(0) )1 + (2𝑡 − 1)𝒘 ∗ and (𝑾𝑎(0) (𝑡))𝑖 = (𝑾𝑎(0) )𝑖

152
[Draft of June 26, 2024. Not for dissemination. Only to be used in Philipp Petersen’s lecture]
for 𝑖 = 2, . . . , 𝑑1 , (𝒃 𝑎(0) (𝑡))1 = 2(𝑡 − 1/2) (𝒃 𝑎(0) )1 + (2𝑡 − 1)𝑏 ∗ , and (𝒃 𝑎(0) (𝑡))𝑖 = (𝒃 𝑎(0) )𝑖 for 𝑖 = 2, . . . , 𝑑1 .
Now it is clear by (12.2.2) that (𝑅 𝜎 (𝛼1 ) (𝒙𝑖 ))𝑖=1 𝑚 is constant. Moreover, 𝑅 (𝛼 ) (𝒙) is constant for all 𝒙 ∈ R𝑑0 . In
𝜎 2
addition, by construction for    𝑚
𝑗 (0) (0)
𝒗¯ B 𝜎 𝑾𝑎 (1)𝒙𝑖 + 𝒃 𝑎 (1)
𝑗 𝑖=1

it holds that (( 𝒗¯ 𝑗 ) ⊤ ) 𝑑𝑗=1


1
has rank larger than that of 𝑽𝑎 . Concatenating 𝛼1 and 𝛼2 now yields a continuous path from
𝜃 𝑎 to another neural network parameter with higher associated rank such that Λ A, 𝜎,𝑆, L is monotonically decreasing
along the path. Iterating this construction, we can find a path to a neural network parameter where the associated matrix
has full rank. This reduces the problem to Case 1. □

12.3 Saddle points

Saddle points are critical points of the loss landscape at which the loss decreases in one direction. In this sense, saddle
points are not as problematic as local minima or spurious valleys if the updates in the learning iteration have some
stochasticity. Eventually, a random step in the right direction could be taken and the saddle point can be escaped.
If most of the critical points are saddle points, then, even though the loss landscape is challenging for optimization,
one still has a good chance of eventually reaching the global minimum. Saddle points of the loss landscape were studied
in [43, 151] and we will review some of the findings in a simplified way below. The main observation in [151] is that,
under some quite strong assumptions, it holds that critical points in the loss landscape associated to a large loss are
typically saddle points, whereas those associated to small loss correspond to minima. This situation is encouraging for
the prospects of optimization in deep learning, since, even if we get stuck in a local minimum, it will very likely be
such that the loss is close to optimal.
The results of [151] use random matrix theory, which we do not recall here. Moreover, it is hard to gauge if the
assumptions made are satisfied for a specific problem. Nonetheless, we recall the main idea, which provides some
intuition to support the above claim.
Let A = (𝑑0 , 𝑑1 , 1) ∈ N3 . Then, for a neural network parameter 𝜃 ∈ PN (A, ∞) and activation function 𝜎, we set
Φ 𝜃 B 𝑅 𝜎 (𝜃) and define for a sample 𝑆 = (𝒙𝑖 , 𝑦 𝑖 )𝑖=1
𝑚 the errors

𝑒 𝑖 = Φ 𝜃 (𝒙𝑖 ) − 𝑦 𝑖 for 𝑖 = 1, . . . , 𝑚.

If we use the square loss, then


𝑚
c𝑆 (Φ 𝜃 ) = 1
∑︁
R 𝑒2 . (12.3.1)
𝑚 𝑖=1 𝑖

Next, we study the Hessian of R


b𝑆 (Φ 𝜃 ).

Proposition 12.5 Let A = (𝑑0 , 𝑑1 , 1) and 𝜎 : R → R. Then, for every 𝜃 ∈ PN (A, ∞) where R
b𝑆 (Φ 𝜃 ) in (12.3.1) is
twice continuously differentiable with respect to the weights, it holds that

𝑯(𝜃) = 𝑯0 (𝜃) + 𝑯1 (𝜃),

where 𝑯(𝜃) is the Hessian of R


b𝑆 (Φ 𝜃 ) at 𝜃, 𝑯0 (𝜃) is a positive semi-definite matrix which is independent from (𝑦 𝑖 ) 𝑚 ,
𝑖=1
and 𝑯1 (𝜃) is a symmetric matrix that for fixed 𝜃 and (𝒙𝑖 )𝑖=1
𝑚 depends linearly on (𝑒 ) 𝑚 .
𝑖 𝑖=1

Proof Using the identification introduced after Definition 12.2, we can consider 𝜃 a vector in R𝑛A . For 𝑘 = 1, . . . , 𝑛 A ,
we have that
𝑚
𝜕R
b𝑆 (Φ 𝜃 ) 2 ∑︁ 𝜕Φ 𝜃 (𝒙𝑖 )
= 𝑒𝑖 .
𝜕𝜃 𝑘 𝑚 𝑖=1 𝜕𝜃 𝑘

153
[Draft of June 26, 2024. Not for dissemination. Only to be used in Philipp Petersen’s lecture]
Therefore, for 𝑗 = 1, . . . , 𝑛 A , we have, by the Leibniz rule, that
𝑚  𝑚
!
𝜕2R 2 ∑︁ 𝜕 2 Φ 𝜃 (𝒙𝑖 )

b𝑆 (Φ 𝜃 ) 2 ∑︁ 𝜕Φ 𝜃 (𝒙𝑖 ) 𝜕Φ 𝜃 (𝒙𝑖 )
= + 𝑒𝑖 (12.3.2)
𝜕𝜃 𝑗 𝜕𝜃 𝑘 𝑚 𝑖=1 𝜕𝜃 𝑗 𝜕𝜃 𝑘 𝑚 𝑖=1 𝜕𝜃 𝑗 𝜕𝜃 𝑘
C 𝑯0 (𝜃) + 𝑯1 (𝜃).

It remains to show that 𝑯0 (𝜃) and 𝑯1 (𝜃) have the asserted properties. Note that, setting
𝜕Φ 𝜃 ( 𝒙𝑖 )
© 𝜕𝜃1 ª
­ . ®
𝐽𝑖, 𝜃 = ­ .. ® ∈ R𝑛A ,
­ 𝜕Φ ( 𝒙 ) ®
𝜃 𝑖
« 𝜕𝜃𝑛A ¬

we have that 𝑯0 (𝜃) = 𝑚2 𝑖=1 𝐽𝑖, 𝜃 𝐽𝑖,⊤𝜃 and hence 𝑯0 (𝜃) is a sum of positive semi-definite matrices, which shows that
Í𝑚
𝑯0 (𝜃) is positive semi-definite.
The symmetry of 𝑯1 (𝜃) follows directly from the symmetry of second derivatives which holds since we assumed
twice continuous differentiability at 𝜃. The linearity of 𝑯1 (𝜃) in (𝑒 𝑖 )𝑖=1
𝑚 is clear from (12.3.2). □
How does Proposition 12.5 imply the claimed relationship between the size of the loss and the prevalence of saddle
points?
Let 𝜃 correspond to a critical point. If 𝑯(𝜃) has at least one negative eigenvalue, then 𝜃 cannot be a minimum, but
instead must be either a saddle point or a maximum. While we do not know anything about 𝑯1 (𝜃) other than that it is
symmetric, it is not unreasonable to assume that it has a negative eigenvalue especially if 𝑛 A is very large. With this
consideration, let us consider the following model:
Fix a parameter 𝜃. Let 𝑆 0 = (𝒙𝑖 , 𝑦 0𝑖 )𝑖=1 𝑚 be a sample and (𝑒 0 ) 𝑚 be the associated errors. Further let
𝑖 𝑖=1
0 0 0
𝑯 (𝜃), 𝑯0 (𝜃), 𝑯1 (𝜃) be the matrices according to Proposition 12.5.
𝑚 be such that the associated errors are (𝑒 ) 𝑚 = 𝜆(𝑒 0 ) 𝑚 . The Hessian of
Further let for 𝜆 > 0, 𝑆 𝜆 = (𝒙𝑖 , 𝑦 𝜆𝑖 )𝑖=1 𝑖 𝑖=1 𝑖 𝑖=1
b𝑆 𝜆 (Φ 𝜃 ) at 𝜃 is then 𝑯 (𝜃) satisfying
R 𝜆

𝑯 𝜆 (𝜃) = 𝑯00 (𝜃) + 𝜆𝑯10 (𝜃).


Hence, if 𝜆 is large, then 𝑯 𝜆 (𝜃) is perturbation of an amplified version of 𝑯10 (𝜃). Clearly, if 𝒗 is an eigenvector of
𝑯1 (𝜃) with negative eigenvalue −𝜇, then

𝒗 ⊤ 𝑯 𝜆 (𝜃)𝒗 ≤ (∥𝑯00 (𝜃) ∥ − 𝜆𝜇) ∥𝒗∥ 2 ,

which we can expect to be negative for large 𝜆. Thus, 𝑯 𝜆 (𝜃) has a negative eigenvalue for large 𝜆.
On the other hand, if 𝜆 is small, then 𝑯 𝜆 (𝜃) is merely a perturbation of 𝑯00 (𝜃) and we can expect its spectrum to
resemble that of 𝑯00 more and more.
What we see is that, the same parameter, is more likely to be a saddle point for a sample that produces a high empirical
risk than for a sample with small risk. Note that, since 𝑯00 (𝜃) was only shown to be semi-definite the argument above
does not rule out saddle points even for very small 𝜆. But it does show that for small 𝜆, every negative eigenvalue would
be very small.
A more refined analysis where we compare different parameters but for the same sample and quantify the likelihood
of local minima versus saddle points requires the introduction of a probability distribution on the weights. We refer to
[151] for the details.

Bibliography and further reading

The results on visualization of the loss landscape are inspired by [118, 66, 91]. Results on the non-existence of spurious
valleys can be found in [209] with similar results in [163]. In [37] the loss landscape was studied by linking it to

154
[Draft of June 26, 2024. Not for dissemination. Only to be used in Philipp Petersen’s lecture]
so-called spin-glass models. There it was found that under strong assumptions critical points associated to lower losses
are more likely to be minima than saddle points. In [151], random matrix theory is used to provide similar results,
that go beyond those established in Section 12.3. On the topic of saddle points, [43] identifies the existence of saddle
points as more problematic than that of local minima, and an alternative saddle-point aware optimization algorithm is
introduced.
Two essential topics associated to the loss landscape that have not been discussed in this chapter are mode connectivity
and the sharpness of minima. Mode connectivity, roughly speaking describes the phenomenon, that local minima found
by SGD over deep neural networks are often connected by simple curves of equally low loss [57, 50]. Moreover, the
sharpness of minima has been analyzed and linked to generalization capabilities of neural networks, with the idea being
that wide neural networks are easier to find and also yield robust neural networks [82, 31, 221]. However, this does not
appear to exclude sharp minima from generalizing well [49].

155
[Draft of June 26, 2024. Not for dissemination. Only to be used in Philipp Petersen’s lecture]
Exercises

Exercise 12.6 In view of Definition 12.3, show that a local minimum of a differentiable function is contained in a
spurious valley.

Exercise 12.7 Show that if there exists a continuous path 𝛼 between a parameter 𝜃 1 and a global minimum 𝜃 2 such that
Λ A, 𝜎,𝑆, L (𝛼) is monotonically decreasing, then 𝜃 1 cannot be an element of a spurious valley.

Exercise 12.8 Find an example of a spurious valley for a simple architecture. (Hint: Use a single neuron ReLU neural
network and observe that, for two networks one with positive and one with negative slope, every continuous path in
parameter space that connects the two has to pass through a parameter corresponding to a constant function.)

156
[Draft of June 26, 2024. Not for dissemination. Only to be used in Philipp Petersen’s lecture]
Fig. 12.2: A collection of loss landscapes. In the left column are neural networks with ReLU activation function, the
right column shows loss landscapes of neural networks with the hyperbolic tangent activation function. All neural
networks have five dimensional input, and one dimensional output. Moreover, from top to bottom the hidden layers
have sizes 1000, 20, 10, and the number of layers are 1, 4, 7.

157
[Draft of June 26, 2024. Not for dissemination. Only to be used in Philipp Petersen’s lecture]
[Draft of June 26, 2024. Not for dissemination. Only to be used in Philipp Petersen’s lecture]
Chapter 13
Shape of neural network spaces

As we have seen in the previous chapter, the loss landscape of neural networks can be quite intricate and is typically
not convex. In some sense, the reason for this is that we take the point of view of a map from the parameterization of a
neural network. Let us consider a convex loss function L : R × R → R and a sample 𝑆 = (𝒙𝑖 , 𝑦 𝑖 )𝑖=1
𝑚 ∈ (R𝑑 × R) 𝑚 .

Then, for two neural networks Φ1 , Φ2 and for 𝛼 ∈ (0, 1) it holds that
𝑚
b𝑆 (𝛼Φ1 + (1 − 𝛼)Φ2 ) = 1
∑︁
R L (𝛼Φ1 (𝒙𝑖 ) + (1 − 𝛼)Φ2 (𝒙𝑖 ), 𝑦 𝑖 )
𝑚 𝑖=1
𝑚
1 ∑︁
≤ 𝛼L (Φ1 (𝒙𝑖 ), 𝑦 𝑖 ) + (1 − 𝛼)L (Φ2 (𝒙𝑖 ), 𝑦 𝑖 )
𝑚 𝑖=1
= 𝛼R
b𝑆 (Φ1 ) + (1 − 𝛼) R
b𝑆 (Φ2 ).

Hence, the empirical risk is convex when considered as a map depending on the neural network functions rather then
the neural network parameters. A convex function does not have spurious minima or saddle points. As a result, the
issues from the previous section are avoided if we take the perspective of neural network sets.
So why do we not optimize over the sets of neural networks instead of the parameters? To understand this, we will
now study the set of neural networks associated to a fixed architecture as a subset of other function spaces.
We start by investigating the realization map 𝑅 𝜎 introduced in Definition 12.1. Concretely, we show in Section 13.1,
that if 𝜎 is Lipschitz, then the set of neural networks is the image of PN (A, ∞) under a locally Lipschitz map. We will
use this fact to show in Section 13.2 that sets of neural networks are typically non-convex, and even have arbitrarily
large holes. Finally, in Section 13.3, we study the extent to which there exist best approximations to arbitrary functions,
in the set of neural networks. We will demonstrate that the lack of best approximations causes the weights of neural
networks to grow infinitely during training.

13.1 Lipschitz parameterizations

In this section, we study the realization map 𝑅 𝜎 . The main result is the following simplified version of [152, Proposition
4].

Proposition 13.1 Let A = (𝑑0 , 𝑑1 , . . . , 𝑑 𝐿+1 ) ∈ N 𝐿+2 , let 𝜎 : R → R be 𝐶 𝜎 -Lipschitz continuous with 𝐶 𝜎 ≥ 1, let
|𝜎(𝑥)| ≤ 𝐶 𝜎 |𝑥| for all 𝑥 ∈ R, and let 𝐵 ≥ 1.
Then, for all 𝜃, 𝜃 ′ ∈ PN (A, 𝐵),

∥𝑅 𝜎 (𝜃) − 𝑅 𝜎 (𝜃 ′ ) ∥ 𝐿 ∞ ( [−1,1] 𝑑0 ) ≤ (2𝐶 𝜎 𝐵𝑑max ) 𝐿 𝑛 A ∥𝜃 − 𝜃 ′ ∥ ∞ ,


Í𝐿
where 𝑑max = maxℓ=0,...,𝐿+1 𝑑ℓ and 𝑛 A = ℓ=0 𝑑ℓ+1 (𝑑ℓ + 1).

159
[Draft of June 26, 2024. Not for dissemination. Only to be used in Philipp Petersen’s lecture]
Proof Let 𝜃, 𝜃 ′ ∈ PN (A, 𝐵) and define 𝛿 B ∥𝜃 − 𝜃 ′ ∥ ∞ . Repeatedly using the triangle inequality we find a sequence
(𝜃 𝑗 ) 𝑛𝑗=0
A
such that 𝜃 0 = 𝜃, 𝜃 𝑛A = 𝜃 ′ , ∥𝜃 𝑗 − 𝜃 𝑗+1 ∥ ∞ ≤ 𝛿, and 𝜃 𝑗 and 𝜃 𝑗+1 differ in one entry only for all 𝑗 = 0, . . . 𝑛 A − 1.
We conclude that for all 𝒙 ∈ [−1, 1] 𝑑0

A −1
𝑛∑︁

|𝑅 𝜎 (𝜃) (𝒙) − 𝑅 𝜎 (𝜃 ) (𝒙)| ≤ |𝑅 𝜎 (𝜃 𝑗 ) (𝒙) − 𝑅 𝜎 (𝜃 𝑗+1 ) (𝒙)|. (13.1.1)
𝑗=0

To upper bound (13.1.1), we now only need to understand the effect of changing one weight in a neural network by 𝛿.
Before we can complete the proof we need two auxiliary lemmas. The first of which holds under slightly weaker
assumptions of Proposition 13.1.

Lemma 13.2 Under the assumptions of Proposition 13.1, but with 𝐵 being allowed to be arbitrary positive, it holds
for all Φ ∈ N (𝜎; A, 𝐵)

∥Φ(𝒙) − Φ(𝒙 ′ ) ∥ ∞ ≤ 𝐶 𝜎
𝐿
· (𝐵𝑑max ) 𝐿+1 ∥𝒙 − 𝒙 ′ ∥ ∞ (13.1.2)

for all 𝒙, 𝒙 ′ ∈ R𝑑0 . □

Proof We start with the case, where 𝐿 = 1. Then, for (𝑑0 , 𝑑1 , 𝑑2 ) = A, we have that

Φ(𝒙) = W (1) 𝜎(W (0) 𝒙 + b (0) ) + b (1) ,

for certain W (0) , W (1) , b (0) , b (1) with all entries bounded by 𝐵. As a consequence, we can estimate
 
∥Φ(𝒙) − Φ(𝒙 ′ ) ∥ ∞ = W (1) 𝜎(W (0) 𝒙 + b (0) ) − 𝜎(W (0) 𝒙 ′ + b (0) )

(0) (0) (0) ′ (0)
≤ 𝑑1 𝐵 𝜎(W 𝒙+b ) − 𝜎(W 𝒙 +b )

≤ 𝑑1 𝐵𝐶 𝜎 W (0) (𝒙 − 𝒙 ′ )

≤ 𝑑1 𝑑0 𝐵2 𝐶 𝜎 ∥𝒙 − 𝒙 ′ ∥ ∞ ≤ 𝐶 𝜎 · (𝑑max 𝐵) 2 ∥𝒙 − 𝒙 ′ ∥ ∞ ,

where we used the Lipschitz property of 𝜎 and the fact that ∥A𝒙∥ ∞ ≤ 𝑛 max𝑖, 𝑗 | 𝐴𝑖 𝑗 |∥𝒙∥ ∞ for every matrix 𝑨 =
𝑚,𝑛
( 𝐴𝑖 𝑗 )𝑖=1, 𝑗=1 ∈ R
𝑚×𝑛 .

The induction step from 𝐿 to 𝐿 + 1 follows similarly. This concludes the proof of the lemma. □

Lemma 13.3 Under the assumptions of Proposition 13.1 it holds that

∥𝒙 (ℓ ) ∥ ∞ ≤ (2𝐶 𝜎 𝐵𝑑max ) ℓ for all 𝒙 ∈ [−1, 1] 𝑑0 . (13.1.3)

Proof Per Definitions (2.1.1b) and (2.1.1c), we have that for ℓ = 1, . . . , 𝐿 + 1

∥𝒙 (ℓ ) ∥ ∞ ≤ 𝐶 𝜎 W (ℓ −1) 𝒙 (ℓ −1) + b (ℓ −1)



(ℓ −1)
≤ 𝐶 𝜎 𝐵𝑑max ∥𝒙 ∥ ∞ + 𝐵𝐶 𝜎 ,

where we used the triangle inequality and the estimate ∥A𝒙∥ ∞ ≤ 𝑛 max𝑖, 𝑗 | 𝐴𝑖 𝑗 |∥𝒙∥ ∞ , which holds for every matrix
A ∈ R𝑚×𝑛 . We obtain that

∥𝒙 (ℓ ) ∥ ∞ ≤ 𝐶 𝜎 𝐵𝑑max · (1 + ∥𝒙 (ℓ −1) ∥ ∞ )
≤ 2𝐶 𝜎 𝐵𝑑max · (max{1, ∥𝒙 (ℓ −1) ∥ ∞ }).

Resolving the recursive estimate of ∥𝒙 (ℓ ) ∥ ∞ by 2𝐶 𝜎 𝐵𝑑max (max{1, ∥𝒙 (ℓ −1) ∥ ∞ }), we conclude that

160
[Draft of June 26, 2024. Not for dissemination. Only to be used in Philipp Petersen’s lecture]
∥𝒙 (ℓ ) ∥ ∞ ≤ (2𝐶 𝜎 𝐵𝑑max ) ℓ max{1, ∥𝒙 (0) ∥ ∞ } = (2𝐶 𝜎 𝐵𝑑max ) ℓ .

This concludes the proof of the lemma. □


We can now proceed with the proof of Proposition 13.1. Assume that 𝜃 𝑗+1 and 𝜃 𝑗 differ only in one entry. We assume
this entry to be in the ℓth layer, and we start with the case ℓ < 𝐿. It holds

|𝑅 𝜎 (𝜃 𝑗 ) (𝒙) − 𝑅 𝜎 (𝜃 𝑗+1 ) (𝒙)| = |Φℓ (𝜎(W (ℓ ) 𝒙 (ℓ ) + b (ℓ ) )) − Φℓ (𝜎(W (ℓ ) 𝒙 (ℓ ) + b (ℓ ) ))|,


(ℓ ) (ℓ )
where Φℓ ∈ N (𝜎; A ℓ , 𝐵) for A ℓ = (𝑑ℓ+1 , . . . , 𝑑 𝐿+1 ) and (W (ℓ ) , b (ℓ ) ), (W ,b ) differ in one entry only.
Using the Lipschitz continuity of Φℓ of Lemma 13.2, we have

|𝑅 𝜎 (𝜃 𝑗 ) (𝒙) − 𝑅 𝜎 (𝜃 𝑗+1 ) (𝒙)|


𝐿−ℓ −1
≤ 𝐶𝜎 (𝐵𝑑max ) 𝐿−ℓ |𝜎(W (ℓ ) 𝒙 (ℓ ) + b (ℓ ) ) − 𝜎(W (ℓ ) 𝒙 (ℓ ) + b (ℓ ) )|
𝐿−ℓ
≤ 𝐶𝜎 (𝐵𝑑max ) 𝐿−ℓ ∥W (ℓ ) 𝒙 (ℓ ) + b (ℓ ) − W (ℓ ) 𝒙 (ℓ ) + b (ℓ ) ∥ ∞
𝐿−ℓ
≤ 𝐶𝜎 (𝐵𝑑max ) 𝐿−ℓ 𝛿 max{1, ∥𝒙 (ℓ ) ∥ ∞ },

where 𝛿 B ∥𝜃 − 𝜃 ′ ∥ max . Invoking (13.3), we conclude that

|𝑅 𝜎 (𝜃 𝑗 ) (𝒙) − 𝑅 𝜎 (𝜃 𝑗+1 ) (𝒙)| ≤ (2𝐶 𝜎 𝐵𝑑max ) ℓ 𝐶 𝜎


𝐿−ℓ
· (𝐵𝑑max ) 𝐿−ℓ 𝛿
≤ (2𝐶 𝜎 𝐵𝑑max ) 𝐿 ∥𝜃 − 𝜃 ′ ∥ max .

For the case ℓ = 𝐿, a similar estimate can be shown. Combining this with (13.1.1) yields the result. □
Using Proposition 13.1, we can now consider the set of neural networks with a fixed architecture N (𝜎; A, ∞) as
a subset of 𝐿 ∞ ( [−1, 1] 𝑑0 ). What is more, is that N (𝜎; A, ∞) is the image of PN (A, ∞) under a locally Lipschitz
map.

13.2 Convexity of neural network spaces

As a first step towards understanding N (𝜎; A, ∞) as a subset of 𝐿 ∞ ( [−1, 1] 𝑑0 ), we notice that it is star-shaped with
few centers. Let us first introduce the necessary terminology.
Definition 13.4 Let 𝑍 be a subset of a linear space. A point 𝑥 ∈ 𝑍 is called a center of 𝑍 if, for every 𝑦 ∈ 𝑍 it holds
that
{𝑡𝑥 + (1 − 𝑡)𝑦 | 𝑡 ∈ [0, 1]} ⊆ 𝑍.
A set is called star-shaped if it has at least one center.
The following proposition follows directly from the definition of a neural network and is the content of Exercise
13.15.
Proposition 13.5 Let 𝐿 ∈ N and A = (𝑑0 , 𝑑1 , . . . , 𝑑 𝐿+1 ) ∈ N 𝐿+2 and 𝜎 : R → R. Then N (𝜎; A, ∞) is scaling
invariant, i.e. for every 𝜆 ∈ R it holds that 𝜆 𝑓 ∈ N (𝜎; A, ∞) if 𝑓 ∈ N (𝜎; A, ∞), and hence 0 ∈ N (𝜎; A, ∞) is a
center of N (𝜎; A, ∞).
Knowing that N (𝜎; A, 𝐵) is star-shaped with center 0, we can also ask ourselves if N (𝜎; A, 𝐵) has more than this
one center. It is not hard to see that also every constant function is a center. The following theorem yields an upper
bound on the number of linearly independent centers.
Theorem 13.6 ([152, Proposition C.4]) Let 𝐿 ∈ N and A = Í (𝑑0 , 𝑑1 , . . . , 𝑑 𝐿+1 ) ∈ N 𝐿+2 , and let 𝜎 : R → R be
𝐿
Lipschitz continuous. Then, N (𝜎; A, ∞) contains at most 𝑛 A = ℓ=0 (𝑑ℓ + 1)𝑑ℓ+1 linearly independent centers.

161
[Draft of June 26, 2024. Not for dissemination. Only to be used in Philipp Petersen’s lecture]
𝑛A +1
Proof Assume by contradiction, that there are functions (𝑔𝑖 )𝑖=1 ⊆ N (𝜎; A, ∞) ⊆ 𝐿 ∞ ( [−1, 1] 𝑑0 ) that are linearly
independent and centers of N (𝜎; A, ∞).
𝑛A +1
By the Theorem of Hahn-Banach, there exist (𝑔𝑖′ )𝑖=1 ⊆ (𝐿 ∞ ( [−1, 1] 𝑑0 )) ′ such that 𝑔𝑖′ (𝑔 𝑗 ) = 𝛿𝑖 𝑗 , for all 𝑖, 𝑗 ∈
{1, . . . , 𝐿 + 1}. We define

𝑔 ′ (𝑔)
© 𝑔1′ (𝑔) ª
­ 2
𝑇 : 𝐿 ∞ ( [−1, 1] 𝑑0 ) → R𝑛A +1 ,
®
𝑔 ↦→ ­­ .. ®.
®
­ . ®
𝑔 ′ (𝑔)
« 𝑛A +1 ¬
Since 𝑇 is continuous and linear, we have that 𝑇 ◦ 𝑅 𝜎 is locally Lipschitz continuous by Proposition 13.1. Moreover,
𝑛A +1 𝑛A +1 𝑛A +1
since the (𝑔𝑖 )𝑖=1 are linearly independent, we have that 𝑇 (span((𝑔𝑖 )𝑖=1 )) = R𝑛A +1 . We denote 𝑉 B span((𝑔𝑖 )𝑖=1 ).
Next, we would like to establish that N (𝜎; A, ∞) ⊃ 𝑉. Let 𝑔 ∈ 𝑉 then

A +1
𝑛∑︁
𝑔= 𝑎 ℓ 𝑔ℓ ,
ℓ=1

for some 𝑎 1 , . . . , 𝑎 𝑛A +1 ∈ R. We show by induction that 𝑔˜ (𝑚) B ℓ=1


Í𝑚
𝑎 ℓ 𝑔ℓ ∈ N (𝜎; A, ∞) for every 𝑚 ≤ 𝑛 A + 1. This
is obviously true for 𝑚 = 1. Moreover, we have that 𝑔˜ (𝑚+1) = 𝑎 𝑚+1 𝑔𝑚+1 + 𝑔˜ (𝑚) . Hence, the induction step holds true if
𝑎 𝑚+1 = 0. If 𝑎 𝑚+1 ≠ 0, then we have that
 
1 1
𝑔 (𝑚+1) = 2𝑎 𝑚+1 · 𝑔𝑚+1 + 𝑔 (𝑚) .
e (13.2.1)
2 2𝑎 𝑚+1

By the induction assumption e 𝑔 (𝑚) ∈ N (𝜎; A, ∞) and hence by Proposition 13.5 e 𝑔 (𝑚) /(𝑎 𝑚+1 ) ∈ N (𝜎; A, ∞).
1 1 (𝑚)
Additionally, since 𝑔𝑚+1 is a center of N (𝜎; A, ∞), we have that 2 𝑔𝑚+1 + 2𝑎𝑚+1 e 𝑔 ∈ N (𝜎; A, ∞). By Proposition
13.5, we conclude that e𝑔 (𝑚+1) ∈ N (𝜎; A, ∞).
The induction shows that 𝑔 ∈ N (𝜎; A, ∞) and thus 𝑉 ⊆ N (𝜎; A, ∞). As a consequence, 𝑇 ◦ 𝑅 𝜎 (PN (A, ∞)) ⊇
𝑇 (𝑉) = R𝑛A +1 .
It is a well known fact of basic analysis that for every for 𝑛 ∈ N there does not exist a surjective and locally Lipschitz
continuous map from R𝑛 to R𝑛+1 . We recall that 𝑛 A = dim(PN (A, ∞)). This yields the contradiction. □

For a convex set 𝑋, the line between all two points of 𝑋 is a subset of 𝑋. Hence, every point of a convex set is a
center. This yields the following corollary.

Corollary 13.7 Let A = (𝑑0 , 𝑑1 , . . . , 𝑑 𝐿+1 ), let, and let 𝜎 : R → R be Lipschitz continuous. If N (𝜎; A, ∞) contains
Í𝐿
more than 𝑛 A = ℓ=0 (𝑑ℓ + 1)𝑑ℓ+1 linearly independent functions, then N (𝜎; A, ∞) is not convex.

Corollary 13.7 tells us that we cannot expect convex sets of neural networks, if the set of neural networks has many
linearly independent elements. Sets of neural networks contain for each 𝑓 ∈ N (𝜎; A, ∞) also all shifts of this function,
i.e., 𝑓 (· + 𝒃) for a 𝒃 ∈ R𝑑 are elements of 𝑓 ∈ N (𝜎; A, ∞). For a set of functions, being shift invariant and having only
finitely many linearly independent functions at the same time, is a very restrictive condition. Indeed, it was shown in
[152, Proposition C.6] that if N (𝜎; A, ∞) has only finitely many linearly independent functions and 𝜎 is differentiable
in at least one point and has non-zero derivative there, then 𝜎 is necessarily a polynomial.
We conclude that the set of neural networks is in general non-convex and star-shaped with 0 and constant functions
being centers. One could visualize this set in 3D as in Figure 13.1.
The fact, that the neural network space is not convex, could also mean that it merely fails to be convex at one point.
For example R2 \ {0} is not convex, but for an optimization algorithm this would likely not pose a problem.
We will next observe that N (𝜎; A, ∞) does not have such a benign non-convexity and in fact, has arbitrarily large
holes.
To make this claim mathematically precise, we first introduce the notion of 𝜀-convexity.

162
[Draft of June 26, 2024. Not for dissemination. Only to be used in Philipp Petersen’s lecture]
Fig. 13.1: Sketch of the space of neural networks in 3D. The vertical axis corresponds to the constant neural network
functions, each of which is a center. The set of neural networks consists of many low-dimensional linear subspaces
spanned by certain neural networks (Φ1 , . . . , Φ6 in this sketch) and linear functions. Between these low-dimensional
subspaces, there is not always a straight-line connection by Corollary 13.7 and Theorem 13.9.

Definition 13.8 For 𝜀 > 0, we say that a subset 𝐴 of a normed vector space 𝑋 is 𝜀-convex if

co( 𝐴) ⊆ 𝐴 + 𝐵 𝜀 (0),

where co( 𝐴) denotes the convex hull of 𝐴 and 𝐵 𝜀 (0) is an 𝜀 ball around 0 with respect to the norm of 𝑋.

Intuitively speaking, a set that is convex when one fills up all holes smaller than 𝜀 is 𝜀-convex. Now we show that
there is no 𝜀 > 0 such that N (𝜎; A, ∞) is 𝜀-convex.

Theorem 13.9 Let 𝐿 ∈ N and A = (𝑑0 , 𝑑1 , . . . , 𝑑 𝐿 , 1) ∈ N 𝐿+2 . Let 𝐾 ⊆ R𝑑0 be compact and let 𝜎 ∈ M, with M as in
(3.1.1) and assume that 𝜎 is not a polynomial. Moreover, assume that there exists an open set, where 𝜎 is differentiable
and not constant.
If there exists an 𝜀 > 0 such that N (𝜎; A, ∞) is 𝜀-convex, then N (𝜎; A, ∞) is dense in 𝐶 (𝐾).

Proof Step 1. We show that 𝜀-convexity implies N (𝜎; A, ∞) to be convex. By Proposition 13.5, we have that
N (𝜎; A, ∞) is scaling invariant. This implies that co(N (𝜎; A, ∞)) is scaling invariant as well. Hence, if there exists
𝜀 > 0 such that N (𝜎; A, ∞) is 𝜀-convex, then for every 𝜀 ′ > 0

𝜀′ 𝜀′
co(N (𝜎; A, ∞)) = co(N (𝜎; A, ∞)) ⊆ (N (𝜎; A, ∞) + 𝐵 𝜀 (0))
𝜀 𝜀
= N (𝜎; A, ∞) + 𝐵 𝜀 ′ (0).

This yields that N (𝜎; A, ∞) is 𝜀 ′ -convex. Since 𝜀 ′ was arbitrary, we have that N (𝜎; A, ∞) is 𝜀-convex for all 𝜀 > 0.
As a consequence, we have that

163
[Draft of June 26, 2024. Not for dissemination. Only to be used in Philipp Petersen’s lecture]
Ù
co(N (𝜎; A, ∞)) ⊆ (N (𝜎; A, ∞) + 𝐵 𝜀 (0))
𝜀>0
Ù
⊆ (N (𝜎; A, ∞) + 𝐵 𝜀 (0)) = N (𝜎; A, ∞).
𝜀>0

Hence, co(N (𝜎; A, ∞)) ⊆ N (𝜎; A, ∞) and, by the well-known fact that in every metric vector space co( 𝐴) ⊆ co( 𝐴),
we conclude that N (𝜎; A, ∞) is convex.
Step 2. We show that N𝑑1 (𝜎; 1) ⊆ N (𝜎; A, ∞). If N (𝜎; A, ∞) is 𝜀-convex, then by Step 1 N (𝜎; A, ∞) is convex.
The scaling invariance of N (𝜎; A, ∞) then shows that N (𝜎; A, ∞) is a closed linear subspace of 𝐶 (𝐾).
Note that, by Proposition 3.19 for every 𝒘 ∈ R𝑑0 and 𝑏 ∈ R there exists a function 𝑓 ∈ N (𝜎; A, ∞) such that

𝑓 (𝒙) = 𝜎(𝒘 ⊤ 𝒙 + 𝑏) for all 𝒙 ∈ 𝐾. (13.2.2)

One consequence of the definition of neural networks is that a constant function is an element of N (𝜎; A, ∞). Since
N (𝜎; A, ∞) is a subspace, this implies that all constant functions are in N (𝜎; A, ∞).
Since N (𝜎; A, ∞) is a closed vector space, this implies that for all 𝑛 ∈ N and all 𝒘1(1) , . . . , 𝒘𝑛(1) ∈ R𝑑0 ,
𝑤1(2) , . . . , 𝑤𝑛(2) ∈ R, 𝑏 1(1) , . . . , 𝑏 𝑛(1) ∈ R, 𝑏 (2) ∈ R
𝑛
∑︁
𝒙 ↦→ 𝑤𝑖(2) 𝜎((𝒘𝑖(1) ) ⊤ 𝒙 + 𝑏 𝑖(1) ) + 𝑏 (2) ∈ N (𝜎; A, ∞). (13.2.3)
𝑖=1

Step 3. From (13.2.3), we conclude that N𝑑1 (𝜎; 1) ⊆ N (𝜎; A, ∞). In words, the whole set of shallow neural
networks of arbitrary width is contained in the closure of the set of neural networks with a fixed architecture. By
Theorem 3.8, we have that N𝑑1 (𝜎; 1) is dense in 𝐶 (𝐾), which yields the result. □
For any activation function of practical relevance, a set of neural networks with fixed architecture is not dense in
𝐶 (𝐾). This is only the case for very strange activation functions such as the one discussed in Subsection 3.2. Hence,
Theorem 13.9 shows that in general, sets of neural networks of fixed architectures have arbitrarily large holes.

13.3 Closedness and best-approximation property

The non-convexity of the set of neural networks can have some serious consequences for the way we think of the
approximation or learning problem by neural networks.
Consider A = (𝑑0 , . . . , 𝑑 𝐿+1 ) ∈ N 𝐿+2 and an activation function 𝜎. Let 𝐻 be a normed function space on [−1, 1] 𝑑0
such that N (𝜎; A, ∞) ⊆ 𝐻. For ℎ ∈ 𝐻 we would like to find a neural network that best approximates ℎ, i.e. to find
Φ ∈ N (𝜎; A, ∞) such that

∥Φ − ℎ∥ 𝐻 = inf ∥Φ∗ − ℎ∥ 𝐻 . (13.3.1)


Φ∗ ∈ N ( 𝜎;A,∞)

We say that N (𝜎; A, ∞) ⊆ 𝐻 has


• the best approximation property, if for all ℎ ∈ 𝐻 there exists at least one Φ ∈ N (𝜎; A, ∞) such that (13.3.1)
holds,
• the unique best approximation property, if for all ℎ ∈ 𝐻 there exists exactly one Φ ∈ N (𝜎; A, ∞) such that
(13.3.1) holds,
• the continuous selection property, if there exists a continuous function 𝜙 : 𝐻 → N (𝜎; A, ∞) such that Φ = 𝜙(ℎ)
satisfies (13.3.1) for all ℎ ∈ 𝐻.

164
[Draft of June 26, 2024. Not for dissemination. Only to be used in Philipp Petersen’s lecture]
We will see in the sequel, that, in the absence of the best approximation property, we will be able to prove that the
learning problem necessarily requires the weights of the neural networks to tend to infinity, which may or may not be
desirable in applications.
Moreover, having a continuous selection procedure is desirable as it implies the existence of a stable selection
algorithm; that is, an algorithm which, for similar problems yields similar neural networks satisfying (13.3.1).
Below, we will study the properties above for 𝐿 𝑝 spaces, 𝑝 ∈ [1, ∞). As we will see, neural network classes typically
neither satisfy the continuous selection nor the best approximation property.

13.3.1 Continuous selection

As shown in [96], neural network spaces essentially never admit the continuous selection property. To give the argument,
we first recall the following result from [96, Theorem 3.4] without proof.
Theorem 13.10 Let 𝑝 ∈ (1, ∞). Every subset of 𝐿 𝑝 ( [−1, 1] 𝑑0 ) with the unique best approximation property is convex.
This allows to show the next proposition.

Proposition 13.11 Let 𝐿 ∈ N, A = (𝑑0 , 𝑑1 , . . . , 𝑑 𝐿+1 ) ∈ N 𝐿+2 , let 𝜎 : R → R be Lipschitz continuous and not a
polynomial, and let 𝑝 ∈ (1, ∞).
Then, N (𝜎; A, ∞) ⊆ 𝐿 𝑝 ( [−1, 1] 𝑑0 ) does not have the continuous selection property.

Proof We observe from Theorem 13.6 and the discussion below, that under the assumptions of this result, N (𝜎; A, ∞)
is not convex.
We conclude that N (𝜎; A, ∞) does not have the unique best approximation property. Moreover, if the set
N (𝜎; A, ∞) does not have the best approximation property, then it is obvious that it cannot have continuous se-
lection. Thus, we can assume without loss of generality, that N (𝜎; A, ∞) has the best approximation property and
there exists a point ℎ ∈ 𝐿 𝑝 ( [−1, 1] 𝑑0 ) and two different Φ1 ,Φ2 such that

∥Φ1 − ℎ∥ 𝐿 𝑝 = ∥Φ2 − ℎ∥ 𝐿 𝑝 = inf ∥Φ∗ − ℎ∥ 𝐿 𝑝 . (13.3.2)


Φ∗ ∈ N ( 𝜎;A,∞)

Note that (13.3.2) implies that ℎ ∉ N (𝜎; A, ∞).


Let us consider the following function:

(1 + 𝜆)ℎ − 𝜆Φ1 for 𝜆 ≤ 0,
[−1, 1] ∋ 𝜆 ↦→ 𝑃(𝜆) =
(1 − 𝜆)ℎ + 𝜆Φ2 for 𝜆 ≥ 0.

It is clear that 𝑃(𝜆) is a continuous path in 𝐿 𝑝 . Moreover, for 𝜆 ∈ (−1, 0)

∥Φ1 − 𝑃(𝜆) ∥ 𝐿 𝑝 = (1 + 𝜆) ∥Φ1 − ℎ∥ 𝐿 𝑝 .

Assume towards a contradiction, that there exists Φ∗ ≠ Φ1 such that for 𝜆 ∈ (−1, 0)

∥Φ∗ − 𝑃(𝜆) ∥ 𝐿 𝑝 ≤ ∥Φ1 − 𝑃(𝜆) ∥ 𝐿 𝑝 .

Then

∥Φ∗ − ℎ∥ 𝐿 𝑝 ≤ ∥Φ∗ − 𝑃(𝜆) ∥ 𝐿 𝑝 + ∥𝑃(𝜆) − ℎ∥ 𝐿 𝑝


≤ ∥Φ1 − 𝑃(𝜆) ∥ 𝐿 𝑝 + ∥𝑃(𝜆) − ℎ∥ 𝐿 𝑝
= (1 + 𝜆) ∥Φ1 − ℎ∥ 𝐿 𝑝 + |𝜆|∥Φ1 − ℎ∥ 𝐿 𝑝 = ∥Φ1 − ℎ∥ 𝐿 𝑝 . (13.3.3)

Since Φ1 is a best approximation to ℎ this implies that every inequality in the estimate above is an equality. Hence, we
have that

165
[Draft of June 26, 2024. Not for dissemination. Only to be used in Philipp Petersen’s lecture]
∥Φ∗ − ℎ∥ 𝐿 𝑝 = ∥Φ∗ − 𝑃(𝜆) ∥ 𝐿 𝑝 + ∥𝑃(𝜆) − ℎ∥ 𝐿 𝑝 .

However, in a strictly convex space like 𝐿 𝑝 ( [−1, 1] 𝑑0 ) for 𝑝 > 1 this implies that

Φ∗ − 𝑃(𝜆) = 𝑐 · (𝑃(𝜆) − ℎ)

for a constant 𝑐 ≠ 0. This yields that


Φ∗ = ℎ + (𝑐 + 1)𝜆 · (ℎ − Φ1 )
and plugging into (13.3.3) yields |(𝑐 +1)𝜆| = 1. If (𝑐 +1)𝜆 = −1, then we have Φ∗ = Φ1 which produces a contradiction.
If (𝑐 + 1)𝜆 = 1, then

∥Φ∗ − 𝑃(𝜆) ∥ 𝐿 𝑝 = ∥2ℎ − Φ1 − (1 + 𝜆)ℎ + 𝜆Φ1 ∥ 𝐿 𝑝


= ∥ (1 − 𝜆)ℎ − (1 − 𝜆)Φ1 ∥ 𝐿 𝑝 > ∥𝑃(𝜆) − Φ1 ∥ 𝐿 𝑝 ,

which is another contradiction.


Hence, for every 𝜆 < 0 we have that Φ1 is the unique minimizer to 𝑃(𝜆) in N (𝜎; A, ∞). The same argument holds
for 𝜆 > 0 and Φ2 . We conclude that for every selection function 𝜙 : 𝐿 𝑝 ( [−1, 1] 𝑑0 ) → N (𝜎; A, ∞) such that Φ = 𝜙(ℎ)
satisfies (13.3.1) for all ℎ ∈ 𝐿 𝑝 ( [−1, 1] 𝑑0 ) it holds that

lim 𝜙(𝑃(𝜆)) = Φ2 ≠ Φ1 = lim 𝜙(𝑃(𝜆)).


𝜆↓0 𝜆↑0

As a consequence, 𝜙 is not continuous, which shows the result. □

13.3.2 Existence of best approximations

We have seen in Proposition 13.11 that under very mild assumptions, the continuous selection property cannot hold.
Moreover, the next result shows that in many cases, also the best approximation property fails to be satisfied. We
provide below a simplified version of [152, Theorem 3.1]. We also refer to [61] for earlier work on this problem.

Proposition 13.12 Let A = (1, 2, 1) and let 𝜎 : R → R be Lipschitz continuous. Additionally assume that there exist
𝑟 > 0 and 𝛼′ ≠ 𝛼 such that 𝜎 is differentiable for all |𝑥| > 𝑟 and 𝜎 ′ (𝑥) → 𝛼 for 𝑥 → ∞, 𝜎 ′ (𝑥) → 𝛼′ for 𝑥 → −∞.
Then, there exists a sequence in N (𝜎; A, ∞) which converges in 𝐿 𝑝 ( [−1, 1] 𝑑0 ), for every 𝑝 ∈ (1, ∞), and the limit
of this sequence is discontinuous. In particular, the limit of the sequence does not lie in N (𝜎; A ′ , ∞) for any A ′ .

Proof For all 𝑛 ∈ N let

𝑓𝑛 (𝑥) = 𝜎(𝑛𝑥 + 1) − 𝜎(𝑛𝑥) for all 𝑥 ∈ R.

Then 𝑓𝑛 can be written as a neural network with architecture A = (1, 2, 1). Moreover, for 𝑥 > 0 we observe with the
fundamental theorem of calculus and using integration by substitution that
∫ 𝑥+1/𝑛 ∫ 𝑛𝑥+1
𝑓𝑛 (𝑥) = 𝑛𝜎 ′ (𝑛𝑧)𝑑𝑧 = 𝜎 ′ (𝑧)𝑑𝑧. (13.3.4)
𝑥 𝑛𝑥

It is not hard to see that the right hand side of (13.3.4) converges to 𝛼 for 𝑛 → ∞.
Similarly, for 𝑥 < 0, we observe that 𝑓𝑛 (𝑥) converges to 𝛼′ for 𝑛 → ∞. We conclude that

𝑓𝑛 → 𝛼1R+ + 𝛼′ 1R−

almost everywhere. Since 𝜎 is Lipschitz continuous, we have that 𝑓𝑛 is bounded. Therefore, we conclude that 𝑓𝑛 →
𝛼1R+ + 𝛼′ 1R− in 𝐿 𝑝 for all 𝑝 ∈ [1, ∞) by the dominated convergence theorem. □

166
[Draft of June 26, 2024. Not for dissemination. Only to be used in Philipp Petersen’s lecture]
There is a straight-forward extension of Proposition 13.12 to arbitrary architectures, that will be the content of
Exercises 13.16 and 13.17.

Remark 13.13 The proof of Theorem 13.12 does not extend to the 𝐿 ∞ norm. This, of course, does not mean that
generally N (𝜎; A, ∞) is a closed set in 𝐿 ∞ ( [−1, 1] 𝑑0 ). In fact, almost all activation functions used in practice still
give rise to non-closed neural network sets, see [152, Theorem 3.3]. However, there is one notable exception. For the
ReLU activation function, it can be shown that N (𝜎ReLU ; A, ∞) is a closed set in 𝐿 ∞ ( [−1, 1] 𝑑0 ) if A has only one
hidden layer. The closedness of deep ReLU spaces in 𝐿 ∞ is an open problem.

13.3.3 Exploding weights phenomenon

Finally, we discuss one of the consequences of the non-existence of best approximations of Proposition 13.12.
Consider a regression problem, where we aim to learn a function 𝑓 using neural networks with a fixed architecture
N (A; 𝜎, ∞). As discussed in the Chapters 10 and 11, we wish to produce a sequence of neural networks (Φ𝑛 ) 𝑛=1 ∞ such

that the risk defined in (1.2.4) converges to 0. If the loss L is the squared loss, 𝜇 is a probability measure on [−1, 1] 𝑑0 ,
and the data is given by (𝒙, 𝑓 (𝒙)) for 𝒙 ∼ 𝜇, then

R (Φ𝑛 ) = ∥Φ𝑛 − 𝑓 ∥ 2𝐿 2 ( [−1,1] 𝑑0 ,𝜇)


∫ (13.3.5)
= |Φ𝑛 (𝒙) − 𝑓 (𝒙)| 2 𝑑𝜇(𝒙) → 0 for 𝑛 → ∞.
[ −1,1] 𝑑0

According to Proposition 13.12, for a given A, and an activation function 𝜎, it is possible that (13.3.5) holds, but
𝑓 ∉ N (𝜎; A, ∞). The following result shows that in this situation, the weights of Φ𝑛 diverge.

Proposition 13.14 Let A = (𝑑0 , 𝑑1 , . . . , 𝑑 𝐿+1 ) ∈ N 𝐿+2 , let 𝜎 : R → R be Lipschitz continuous with 𝐶 𝜎 ≥ 1, and
|𝜎(𝑥)| ≤ 𝐶 𝜎 |𝑥| for all 𝑥 ∈ R, and let 𝐵 > 0 and let 𝜇 be a measure on [−1, 1] 𝑑0 .
Assume that there exists a sequence Φ𝑛 ∈ N (𝜎; A, ∞) and 𝑓 ∈ 𝐿 2 ( [−1, 1] 𝑑0 , 𝜇) \ N (𝜎; A, ∞) such that

∥Φ𝑛 − 𝑓 ∥ 2𝐿 2 ( [−1,1] 𝑑0 ,𝜇) → 0. (13.3.6)

Then
n o
lim sup max ∥𝑾𝑛(ℓ ) ∥ ∞ , ∥𝒃 𝑛(ℓ ) ∥ ∞ ℓ = 0, . . . 𝐿 = ∞. (13.3.7)
𝑛→∞

Proof We assume towards a contradiction that the left-hand side of (13.3.7) is finite. As a result, there exists 𝐶 > 0
such that Φ𝑛 ∈ N (𝜎; A, 𝐶) for all 𝑛 ∈ N.
By Proposition 13.1, we conclude that N (𝜎; A, 𝐶) is the image of a compact set under a continuous map and hence
is itself a compact set in 𝐿 2 ( [−1, 1] 𝑑0 , 𝜇). In particular, we have that N (𝜎; A, 𝐶) is closed. Hence, (13.3.6) implies
𝑓 ∈ N (𝜎; A, 𝐶). This gives a contradiction. □
Proposition 13.14 can be extended to all 𝑓 for which there is no best approximation in N (𝜎; A, ∞), see Exercise
13.18. The results imply that for functions we wish to learn that lack a best approximation within a neural network set,
we must expect the weights of the approximating neural networks to grow to infinity. This can be undesirable because,
as we will see in the following sections on generalization, a bounded parameter space facilitates many generalization
bounds.

167
[Draft of June 26, 2024. Not for dissemination. Only to be used in Philipp Petersen’s lecture]
Bibliography and further reading

The properties of neural network sets were first studied with a focus on the continuous approximation property in
[96, 98, 97] and [61]. The results in [96, 97, 98] already use the non-convexity of sets of shallow neural networks.
The results on convexity and closedness presented in this chapter follow mostly the arguments of [152]. Similar results
were also derived for other norms [124].

168
[Draft of June 26, 2024. Not for dissemination. Only to be used in Philipp Petersen’s lecture]
Exercises

Exercise 13.15 Prove Proposition 13.5.

Exercise 13.16 Extend Proposition 13.12 to architectures A = (𝑑0 , 𝑑1 , 1) for arbitrary 𝑑0 , 𝑑1 ∈ N, 𝑑1 ≥ 2.

Exercise 13.17 Use Proposition 3.19, to extend Proposition 13.12 to arbitrary depth.

Exercise 13.18 Extend Proposition 13.14 to functions 𝑓 for which there is no best-approximation in N (𝜎; A, ∞). To
do this, replace (13.3.6) by
∥Φ𝑛 − 𝑓 ∥ 2𝐿 2 → inf ∥Φ − 𝑓 ∥ 2𝐿 2 .
Φ∈ N ( 𝜎;A,∞)

169
[Draft of June 26, 2024. Not for dissemination. Only to be used in Philipp Petersen’s lecture]
[Draft of June 26, 2024. Not for dissemination. Only to be used in Philipp Petersen’s lecture]
Chapter 14
Generalization properties of deep neural networks

As discussed in the introduction in Section 1.2, we generally learn based on a finite data set. For example, given data
(𝑥𝑖 , 𝑦 𝑖 )𝑖=1
𝑚 , we try to find a network Φ that satisfies Φ(𝑥 ) = 𝑦 for 𝑖 = 1, . . . , 𝑚. The field of generalization is concerened
𝑖 𝑖
with how well such Φ performs on unseen data, which refers to any 𝑥 outside of training data {𝑥 1 , . . . , 𝑥 𝑚 }. In this
chapter we discuss generalization through the use of covering numbers.
In Sections 14.1 and 14.2 we revisit and formalize the general setup of learning and empirical risk minimization in
a general context. Although some notions introduced in these sections have already appeared in the previous chapters,
we reintroduce them here for a more coherent presentation. In Sections 14.3-14.5, we first discuss the concepts of
generalization bounds and covering numbers, and then apply these arguments specifically to neural networks. In Section
14.6 we explore the so-called “approximation-complexity trade-off”, and finally in Sections 14.7-14.8 we introduce the
“VC dimension” and give some implications for classes of neural networks.

14.1 Learning setup

A general learning problem [130, 188, 41] requires a feature space 𝑋 and a label space 𝑌 , which we assume throughout
to be measurable spaces. We observe joint data pairs (𝑥𝑖 , 𝑦 𝑖 )𝑖=1
𝑚 ⊆ 𝑋 × 𝑌 , and aim to identify a connection between

the 𝑥 and 𝑦 variables. Specifically, we assume a relationship between features 𝑥 and labels 𝑦 modeled by a probability
distribution D over 𝑋 × 𝑌 , that generated the observed data (𝑥𝑖 , 𝑦 𝑖 )𝑖=1
𝑚 . While this distribution is unknown, our goal

is to extract information from it, so that we can make possibly good predictions of 𝑦 for a given 𝑥. Importantly, the
relationship between 𝑥 and 𝑦 need not be deterministic.
To make these concepts more concrete, we next present an example that will serve as the running example throughout
this chapter. This example is of high relevance for many mathematicians, as ensuring a steady supply of high-quality
coffee is essential for maximizing the output of our mathematical work.
Example 14.1 (Coffee Quality) Our goal is to determine the quality of different coffees. To this end we model the
quality as a number in
n0 10 o
𝑌= ,..., ,
10 10
with higher numbers indicating better quality. Let us assume that our subjective assessment of quality of coffee is
related to six features: “Acidity”, “Caffeine content”, “Price”, “Aftertaste”, “Roast level”, and “Origin”. The feature
space 𝑋 thus corresponds to the set of six-tuples describing these attributes, which can be either numeric or categorical
(see Figure 14.1).
We aim to understand the relationship between elements of 𝑋 and elements of 𝑌 , but we can neither afford, nor
do we have the time to taste all the coffees in the world. Instead, we can sample some coffees, taste them, and grow
our database accordingly as depicted in Figure 14.1. This way we obtain samples of pairs in 𝑋 × 𝑌 . The distribution
D from which they are drawn depends on various external factors. For instance, we might have avoided particularly
cheap coffees, believing them to be inferior. As a result they do not occur in our database. Moreover, if a colleague

171
[Draft of June 26, 2024. Not for dissemination. Only to be used in Philipp Petersen’s lecture]
Fig. 14.1: Collection of coffee data. The last row lacks a “Quality” label. Our aim is to predict the label without the
need for an (expensive) taste test.

contributes to our database, he might have tried the same brand and arrived at a different rating. In this case, the quality
label is not deterministic anymore.
Based on our database, we wish to predict the quality of an untasted coffee. Before proceeding, we first formalize
what it means to be a “good” prediction.

Characterizing how good a predictor is requires a notion of discrepancy in the label space. This is the purpose of
the so-called loss function, which is a measurable mapping L : 𝑌 × 𝑌 → R+ .
Definition 14.2 Let L : 𝑌 × 𝑌 → R+ be a loss function and let D be a distribution on 𝑋 × 𝑌 . For a measurable function
ℎ : 𝑋 → 𝑌 we call

R (ℎ) = E ( 𝑥,𝑦)∼D [L (ℎ(𝑥), 𝑦)]

the risk of ℎ.
Based on the risk, we can now formalize what we consider a good predictor. The best predictor is one such that its risk
is as close as possible to the smallest that any function can achieve. More precisely, we would like a risk that is close
to the so-called Bayes risk

𝑅∗ B inf R (ℎ), (14.1.1)


ℎ : 𝑋→𝑌

where the infimum is taken over all ℎ such that ℎ : 𝑋 → 𝑌 is measurable.

Example 14.3 (Loss functions) The choice of a loss function L usually depends on the application. For a regression
problem, i.e., a learning problem where 𝑌 is a non-discrete subset of a Euclidean space, a common choice is the square
loss L2 (𝒚, 𝒚 ′ ) = ∥ 𝒚 − 𝒚 ′ ∥ 2 .
For binary classification problems, i.e. when 𝑌 is a discrete set of cardinality two, the “0 − 1 loss”
(
1 𝑦 ≠ 𝑦′
L0−1 (𝑦, 𝑦 ′ ) =
0 𝑦 = 𝑦′

is a common choice.
Another frequently used loss for binary classification, especially when we want to predict probabilities (i.e., if
𝑌 = [0, 1] but all labels are binary), is the binary cross-entropy loss

Lce (𝑦, 𝑦 ′ ) = −(𝑦 log(𝑦 ′ ) + (1 − 𝑦) log(1 − 𝑦 ′ )).

In contrast to the 0 − 1 loss, the cross-entropy loss is differentiable, which is desirable in deep learning as we saw in
Chapter 10.

172
[Draft of June 26, 2024. Not for dissemination. Only to be used in Philipp Petersen’s lecture]
In the coffee quality prediction problem, the quality is given as a fraction of the form 𝑘/10 for 𝑘 = 0, . . . , 10.
While this is a discrete set, it makes sense to more heavily penalize predictions that are wrong by a larger amount. For
example, predicting 4/10 instead of 8/10 should produce a higher loss than predicting 7/10. Hence, we would not use
the 0 − 1 loss but, for example, the square loss.
How do we find a function ℎ : 𝑋 → 𝑌 with a risk that is as close as possible to the Bayes risk? We will introduce a
procedure to tackle this task in the next section.

14.2 Empirical risk minimization

Finding a minimizer of the risk constitutes a considerable challenge. First, we cannot search through all measurable
functions. Therefore, we need to restrict ourselves to a specific set H ⊆ {ℎ : 𝑋 → 𝑌 } called the hypothesis set. In the
following, this set will be some set of neural networks. Second, we are faced with the problem that we cannot evaluate
R (ℎ) for non-trivial loss functions, because the distribution D is unknown. To approximate the risk, we will assume
access to an i.i.d. sample of 𝑚 observations drawn from D. This is precisely the situation described in the coffee quality
example of Figure 14.1, where 𝑚 = 6 coffees were sampled.1 So for a given hypothesis ℎ we can check how well it
performs on our sampled data. We call the error on the sample the empirical risk.
Definition 14.4 Let 𝑚 ∈ N, let L : 𝑌 × 𝑌 → R be a loss function and let 𝑆 = (𝑥 𝑖 , 𝑦 𝑖 )𝑖=1
𝑚 ∈ (𝑋 × 𝑌 ) 𝑚 be a sample. For

ℎ : 𝑋 → 𝑌 , we call
𝑚
b𝑆 (ℎ) = 1
∑︁
R L (ℎ(𝑥𝑖 ), 𝑦 𝑖 )
𝑚 𝑖=1

the empirical risk of ℎ.


If the sample 𝑆 is drawn i.i.d. according to D, then we immediately see from the linearity of the expected value that
R
b𝑆 (ℎ) is an unbiased estimator of R (ℎ), i.e., E𝑆∼D 𝑚 [ R
b𝑆 (ℎ)] = R (ℎ). Moreover, the weak law of large numbers states
that the sample mean of an i.i.d. sequence of integrable random variables converges to the expected value in probability.
Hence, there is some hope that, at least for large 𝑚 ∈ N, minimizing the empirical risk instead of the actual risk might
lead to a good hypothesis. We formalize this approach in the next definition.
Definition 14.5 Let H ⊆ {ℎ : 𝑋 → 𝑌 } be a hypothesis set. Let 𝑚 ∈ N, let L : 𝑌 × 𝑌 → R be a loss function and let
𝑆 = (𝑥𝑖 , 𝑦 𝑖 )𝑖=1
𝑚 ∈ (𝑋 × 𝑌 ) 𝑚 be a sample. We call a function ℎ such that
𝑆

R
b𝑆 (ℎ 𝑆 ) = inf R
b𝑆 (ℎ) (14.2.1)
ℎ∈ H

an empirical risk minimizer.


From a generalization perspective, deep learning is empirical risk minimization over sets of neural networks. The
question we want to address next is how effective this approach is at producing hypotheses that achieve a risk close to
the Bayes risk.
Let H be some hypothesis set, such that an empirical risk minimizer ℎ 𝑆 exists for all 𝑆 ∈ (𝑋 × 𝑌 ) 𝑚 ; see Exercise
14.25 for an explanation of why this is a reasonable assumption. Moreover, let ℎ∗ ∈ H be arbitrary. Then

R (ℎ 𝑆 ) − 𝑅 ∗ = R (ℎ 𝑆 ) − R b𝑆 (ℎ 𝑆 ) − 𝑅 ∗
b𝑆 (ℎ 𝑆 ) + R (14.2.2)
≤ |R (ℎ 𝑆 ) − R b𝑆 (ℎ∗ ) − 𝑅 ∗
b𝑆 (ℎ 𝑆 )| + R
b𝑆 (ℎ)| + R (ℎ∗ ) − 𝑅 ∗ ,
≤ 2 sup |R (ℎ) − R
ℎ∈ H

1 In practice, the assumption of independence of the samples is often unclear and typically not satisfied. For instance, the selection of the
six previously tested coffees might be influenced by external factors such as personal preferences or availability at the local store, which
introduce bias into the dataset.

173
[Draft of June 26, 2024. Not for dissemination. Only to be used in Philipp Petersen’s lecture]
where in the first inequality we used that ℎ 𝑆 is the empirical risk minimizer. By taking the infimum over all ℎ∗ , we
conclude that

R (ℎ 𝑆 ) − 𝑅 ∗ ≤ 2 sup |R (ℎ) − R
b𝑆 (ℎ)| + inf R (ℎ) − 𝑅 ∗
ℎ∈ H ℎ∈ H

C 2𝜀 gen + 𝜀 approx . (14.2.3)

Similarly, considering only (14.2.2), yields that


b𝑆 (ℎ)| + inf R
R (ℎ 𝑆 ) ≤ sup |R (ℎ) − R b𝑆 (ℎ)
ℎ∈ H ℎ∈ H

C 𝜀 gen + 𝜀 int . (14.2.4)

How to choose H to reduce the approximation error 𝜀 approx or the interpolation error 𝜀int was discussed at length
in the previous chapters. The final piece is to figure out how to bound the generalization error supℎ∈ H |R (ℎ) − R
b𝑆 (ℎ)|.
This will be discussed in the sections below.

14.3 Generalization bounds

We have seen that one aspect of successful learning is to bound the generalization error 𝜀 gen in (14.2.3). Let us first
formally describe this problem.
Definition 14.6 (Generalization bound) Let H ⊆ {ℎ : 𝑋 → 𝑌 } be a hypothesis set, and let L : 𝑌 × 𝑌 → R be a
loss function. Let 𝜅 : (0, 1) × N → R+ be such that for every 𝛿 ∈ (0, 1) holds 𝜅(𝛿, 𝑚) → 0 for 𝑚 → ∞. We call 𝜅 a
generalization bound for H if for every distribution D on 𝑋 × 𝑌 , every 𝑚 ∈ N and every 𝛿 ∈ (0, 1), it holds with
probability at least 1 − 𝛿 over the random sample 𝑆 ∼ D 𝑚 that

sup |R (ℎ) − R
b𝑆 (ℎ)| ≤ 𝜅(𝛿, 𝑚).
ℎ∈ H

Remark 14.7 For a generalization bound 𝜅 it holds that


h i
P R (ℎ 𝑆 ) − R b𝑆 (ℎ 𝑆 ) ≤ 𝜀 ≥ 1 − 𝛿

as soon as 𝑚 is so large that 𝜅(𝛿, 𝑚) ≤ 𝜀. If there exists an empirical risk minimizer ℎ 𝑆 such that Rb𝑆 (ℎ 𝑆 ) = 0, then
with high probability the empirical risk minimizer will also have a small risk R (ℎ 𝑆 ). Empirical risk minimization is
often referred to as a “PAC” algorithm, which stands for probably (𝛿) approximately correct (𝜀).

Definition 14.6 requires the upper bound 𝜅 on the discrepancy between the empirical risk and the risk to be
independent from the distribution D. Why should this be possible? After all, we could have an underlying distribution
that is not uniform and hence, certain data points could appear very rarely in the sample. As a result, it should be very
hard to produce a correct prediction for such points. At first sight, this suggests that non-uniform distributions should
be much more challenging than uniform distributions. This intuition is incorrect, as the following argument based on
Example 14.1 demonstrates.

Example 14.8 (Generalization in the coffee quality problem) In Example 14.1, the underlying distribution describes
both our process of choosing coffees and the relation between the attributes and the quality. Suppose we do not enjoy
drinking coffee that costs less than 1=C. Consequently, we do not have a single sample of such coffee in the dataset, and
therefore we have no chance about learning the quality of cheap coffees.
However, the absence of coffee samples costing less than 1=C in our dataset is due to our general avoidance of such
coffee. As a result, we run a low risk of incorrectly classifying the quality of a coffee that is cheaper than 1=C, since it
is unlikely that we will choose such a coffee in the future.

174
[Draft of June 26, 2024. Not for dissemination. Only to be used in Philipp Petersen’s lecture]
To establish generalization bounds, we use stochastic tools that guarantee that the empirical risk converges to the
true risk as the sample size increases. This is typically achieved through concentration inequalities. One of the simplest
and most well-known is Hoeffding’s inequality, see Theorem A.23. We will now apply Hoeffding’s inequality to obtain
a first generalization bound. This generalization bound is well-known and can be found in many textbooks on machine
learning, e.g., [130, 188]. Although the result does not yet encompass neural networks, it forms the basis for a similar
result applicable to neural networks, as we discuss subsequently.

Proposition 14.9 (Finite hypothesis set) Let H ⊆ {ℎ : 𝑋 ↦→ 𝑌 } be a finite hypothesis set. Let L : 𝑌 × 𝑌 → R be such
that L (𝑌 × 𝑌 ) ⊆ [𝑐 1 , 𝑐 2 ] with 𝑐 2 − 𝑐 1 = 𝐶 > 0.
Then, for every 𝑚 ∈ N and every distribution D on 𝑋 × 𝑌 it holds with probability at least 1 − 𝛿 over the sample
𝑆 ∼ D 𝑚 that
√︂
sup |R (ℎ) − Rb𝑆 (ℎ)| ≤ 𝐶 log(|H |) + log(2/𝛿) .
ℎ∈ H 2𝑚

Proof Let H = {ℎ1 , . . . , ℎ 𝑛 }. Then it holds by a union bound that


h 𝑛
i ∑︁ h i
P ∃ℎ𝑖 ∈ H : |R (ℎ𝑖 ) − R
b𝑆 (ℎ𝑖 )| > 𝜀 ≤ P |R (ℎ𝑖 ) − R
b𝑆 (ℎ𝑖 )| > 𝜀 .
𝑖=1

Note that Rb𝑆 (ℎ𝑖 ) is the mean of independent random variables which take their values almost surely in [0, 𝐶].
Additionally, R (ℎ𝑖 ) is the expectation of R
b𝑆 (ℎ𝑖 ). The proof can therefore be finished by applying Theorem A.23. This
will be addressed in Exercise 14.26. □
Consider now a non-finite set of neural networks H , and assume that it can be covered by a finite set of (small) balls.
Applying Proposition 14.9 to the centers of these balls, then allows to derive a similar bound as in the proposition for
H . This intuitive argument will be made rigorous in the following section.

14.4 Generalization bounds from covering numbers

To derive a generalization bound for classes of neural networks, we start by introducing the notion of covering numbers.

Definition 14.10 Let 𝐴 be a relatively compact subset of a metric space (𝑋, 𝑑). For 𝜀 > 0, we call
( 𝑚
)
Ø
𝑚
G( 𝐴, 𝜀, (𝑋, 𝑑)) B min 𝑚 ∈ N ∃ (𝑥𝑖 )𝑖=1 ⊆ 𝑋 s.t. 𝐵 𝜀 (𝑥𝑖 ) ⊃ 𝐴 ,
𝑖=1

where 𝐵 𝜀 (𝑥) = {𝑧 ∈ 𝑋 | 𝑑 (𝑧, 𝑥) ≤ 𝜀}, the 𝜀-covering number of 𝐴 in 𝑋. In case 𝑋 or 𝑑 are clear from context, we also
write G( 𝐴, 𝜀, 𝑑) or G( 𝐴, 𝜀, 𝑋) instead of G( 𝐴, 𝜀, (𝑋, 𝑑)).
A visualization of Definition 14.10 is given in Figure 14.2. As we will see, it is possible to upper bound the 𝜀-covering
numbers of neural networks as a subset of 𝐿 ∞ ( [0, 1] 𝑑 ), assuming the weights are confined to a fixed bounded set.
The precise estimates are postponed to Section 14.5. Before that, let us show how a finite covering number facilitates
a generalization bound. We only consider Euclidean feature spaces 𝑋 in the following result. A more general version
could be easily derived.

Theorem 14.11 Let 𝐶𝑌 , 𝐶Lip > 0. Let 𝛼 > 0. Let 𝑌 ⊆ [−𝐶𝑌 , 𝐶𝑌 ], 𝑋 ⊆ R𝑑 for some 𝑑 ∈ N, and H ⊆ {ℎ : 𝑋 → 𝑌 }.
Further, let L : 𝑌 × 𝑌 → R be 𝐶Lip -Lipschitz.
Then, for every distribution D on 𝑋 × 𝑌 and every 𝑚 ∈ N it holds with probability at least 1 − 𝛿 over the sample
𝑆 ∼ D 𝑚 that for all ℎ ∈ H

175
[Draft of June 26, 2024. Not for dissemination. Only to be used in Philipp Petersen’s lecture]
𝜀

Fig. 14.2: Illustration of the concept of covering numbers of Definition 14.10. The shaded set 𝐴 ⊆ R2 is covered by
sixteen Euclidean balls of radius 𝜀. Therefore, G( 𝐴, 𝜀, R2 ) ≤ 16.

√︂
log(G(H , 𝑚 − 𝛼 , 𝐿 ∞ (𝑋))) + log(2/𝛿) 2𝐶Lip
|R (ℎ) − R
b𝑆 (ℎ)| ≤ 4𝐶𝑌 𝐶Lip + .
𝑚 𝑚𝛼
Proof Let
𝑀 = G(H , 𝑚 − 𝛼 , 𝐿 ∞ (𝑋)) (14.4.1)
and let H𝑀 = (ℎ𝑖 )𝑖=1
𝑀 ⊆ H be such that for every ℎ ∈ H there exists ℎ ∈ H
𝑖 𝑀 with ∥ℎ − ℎ𝑖 ∥ 𝐿 ∞ (𝑋) ≤ 1/𝑚 . The
𝛼

existence of H𝑀 follows by Definition 14.10.


Fix for the moment such ℎ ∈ H and ℎ𝑖 ∈ H𝑀 . By the reverse and normal triangle inequalities, we have

|R (ℎ) − R
b𝑆 (ℎ)| − |R (ℎ𝑖 ) − R
b𝑆 (ℎ𝑖 )| ≤ |R (ℎ) − R (ℎ𝑖 )| + | R
b𝑆 (ℎ) − R
b𝑆 (ℎ𝑖 )|.

Moreover, from the monotonicity of the expected value and the Lipschitz property of L it follows that

|R (ℎ) − R (ℎ𝑖 )| ≤ E|L (ℎ(𝑥), 𝑦) − L (ℎ𝑖 (𝑥), 𝑦)|


𝐶Lip
≤ 𝐶Lip E|ℎ(𝑥) − ℎ𝑖 (𝑥)| ≤ 𝛼 .
𝑚

A similar estimate yields | R


b𝑆 (ℎ) − R
b𝑆 (ℎ𝑖 )| ≤ 𝐶Lip /𝑚 𝛼 .
We thus conclude that for every 𝜀 > 0
h i
P𝑆∼D 𝑚 ∃ℎ ∈ H : |R (ℎ) − R b𝑆 (ℎ)| ≥ 𝜀
 
2𝐶Lip
≤ P𝑆∼D 𝑚 ∃ℎ𝑖 ∈ 𝐻 𝑀 : |R (ℎ𝑖 ) − R 𝑆 (ℎ𝑖 )| ≥ 𝜀 −
b . (14.4.2)
𝑚𝛼

From Proposition 14.9, we know that for 𝜀 > 0 and 𝛿 ∈ (0, 1)


 
2𝐶Lip
P𝑆∼D ∃ℎ𝑖 ∈ 𝐻 𝑀 : |R (ℎ𝑖 ) − R 𝑆 (ℎ𝑖 )| ≥ 𝜀 −
𝑚 b ≤𝛿 (14.4.3)
𝑚𝛼

as long as √︂
2𝐶Lip log(𝑀) + log(2/𝛿)
𝜀− >𝐶 ,
𝑚𝛼 2𝑚

176
[Draft of June 26, 2024. Not for dissemination. Only to be used in Philipp Petersen’s lecture]
where√𝐶 is such that L (𝑌 × 𝑌 ) ⊆ [𝑐 1 , 𝑐 2 ] with 𝑐 2 − 𝑐 1 ≤ 𝐶. By the Lipschitz property of L we can choose
𝐶 = 2 2𝐶Lip 𝐶𝑌 .
Therefore, the definition of 𝑀 in (14.4.1) together with (14.4.2) and (14.4.3) give that with probability at least 1 − 𝛿
it holds for all ℎ ∈ H
√︂
√ log(G(H , 𝑚 − 𝛼 , 𝐿 ∞ )) + log(2/𝛿) 2𝐶Lip
|R (ℎ) − R 𝑆 (ℎ)| ≤ 2 2𝐶Lip 𝐶𝑌
b + .
2𝑚 𝑚𝛼
This concludes the proof. □

14.5 Covering numbers of deep neural networks

We have seen in Theorem 14.11, estimating 𝐿 ∞ -covering numbers is crucial for understanding the generalization error.
How can we determine these covering numbers? The set of neural networks of a fixed architecture can be a quite
complex set (see Chapter 13), so it is not immediately clear how to cover it with balls, let alone know the number of
required balls. The following lemma suggest a simpler approach.
Lemma 14.12 Let 𝑋1 , 𝑋2 be two metric spaces and let 𝑓 : 𝑋1 → 𝑋2 be Lipschitz continuous with Lipschitz constant
𝐶Lip . For every relatively compact 𝐴 ⊆ 𝑋1 it holds that for all 𝜀 > 0

G( 𝑓 ( 𝐴), 𝐶Lip 𝜀, 𝑋2 ) ≤ G( 𝐴, 𝜀, 𝑋1 ).

The proof of Lemma 14.12 is left as an exercise. If we can represent the set of neural networks as the image under
the Lipschitz map of another set with known covering numbers, then Lemma 14.12 gives a direct way to bound the
covering number of the neural network class.
Conveniently, we have already observed in Proposition 13.1, that the set of neural networks is the image of PN (A, 𝐵)
as in Definition 12.1 under the Lipschitz continuous realization map 𝑅 𝜎 . It thus suffices to establish the 𝜀-covering
number of PN (A, 𝐵) or equivalently of [−𝐵, 𝐵] 𝑛A . Then, using the Lipschitz property of 𝑅 𝜎 that holds by Proposition
13.1, we can apply Lemma 14.12 to find the covering numbers of N (𝜎; A, 𝐵). This idea is depicted in Figure 14.3.

𝑅𝜎

Fig. 14.3: Illustration of the main idea to deduce covering numbers of neural network spaces. Points 𝜃 ∈ R2 in parameter
space in the left figure correspond to functions 𝑅 𝜎 (𝜃) in the right figure (with matching colors). By Lemma 14.12, a
covering of the parameter space on the left translates to a covering of the function space on the right.

Proposition 14.13 Let 𝐵, 𝜀 > 0 and 𝑞 ∈ N. Then

G( [−𝐵, 𝐵] 𝑞 , 𝜀, (R𝑞 , ∥ · ∥ ∞ )) ≤ ⌈𝐵/𝜀⌉ 𝑞 .

177
[Draft of June 26, 2024. Not for dissemination. Only to be used in Philipp Petersen’s lecture]
Proof We start with the one-dimensional case 𝑞 = 1. We choose 𝑘 = ⌊𝐵/𝜀⌋

𝑥 0 = −𝐵 + 𝜀 and 𝑥 𝑗 = 𝑥 𝑗 −1 + 2𝜀 for 𝑗 = 1, . . . , 𝑘 − 1.

It is clear that all points between −𝐵 and 𝑥 𝑘−1 have distance at most 𝜀 to one of the 𝑥 𝑗 . Also, 𝑥 𝑘−1 = −𝐵 + 𝜀 +2(𝑘 −1)𝜀 ≥
𝐵 − 𝜀. We conclude that G( [−𝐵, 𝐵], 𝜀, R) ≤ ⌈𝐵/𝜀⌉. Set 𝑋 𝑘 B {𝑥0 , . . . , 𝑥 𝑘−1 }. Ë𝑞
For arbitrary 𝑞, we observe that for every 𝑥 ∈ [−𝐵, 𝐵] 𝑞 there is an element in 𝑋 𝑘𝑞 = 𝑗=1 𝑋 𝑘 with ∥ · ∥ ∞ distance
less than 𝜀. Clearly, |𝑋 𝑘𝑞 | = ⌈𝐵/𝜀⌉ 𝑞 , which completes the proof. □
Having established a covering number for [−𝐵, 𝐵] 𝑛A and hence PN (A, 𝐵), we can now estimate the covering
numbers of deep neural networks by combining Lemma 14.12 and Propositions 13.1 and 14.13 .
Theorem 14.14 Let A = (𝑑0 , 𝑑1 , . . . , 𝑑 𝐿+1 ) ∈ N 𝐿+2 , let 𝜎 : R → R be 𝐶 𝜎 -Lipschitz continuous with 𝐶 𝜎 ≥ 1, let
|𝜎(𝑥)| ≤ 𝐶 𝜎 |𝑥| for all 𝑥 ∈ R, and let 𝐵 ≥ 1. Then

G(N (𝜎; A, 𝐵), 𝜀, 𝐿 ∞ ( [0, 1] 𝑑0 )) ≤ G( [−𝐵, 𝐵] 𝑛A , 𝜀/(2𝐶 𝜎 𝐵𝑑max ) 𝐿 , (R𝑛A , ∥ · ∥ ∞ ))


≤ ⌈𝐵/𝜀⌉ 𝑛A ⌈2𝐶 𝜎 𝐵𝑑max ⌉ 𝑛A 𝐿 .

We end this section, by applying the previous theorem to the generalization bound of Theorem 14.11 with 𝛼 = 1/2.
To make our life slightly simpler, we will consider only neural networks that map to [−1, 1]. We denote

N ∗ (𝜎; A, 𝐵) = Φ ∈ N (𝜎; A, 𝐵)


N (𝜎; A, 𝐵) (𝒙) ∈ [−1, 1] for all 𝒙 ∈ [0, 1] 𝑑0 .

Since N ∗ (𝜎; A, 𝐵) ⊆ N (𝜎; A, 𝐵) we can bound the covering numbers of N ∗ (𝜎; A, 𝐵) by those of N (𝜎; A, 𝐵). This
yields the following result.
Theorem 14.15 Let 𝐶Lip > 0 and let L : [−1, 1] × [−1, 1] → R be 𝐶Lip -Lipschitz continuous. Further, let A =
(𝑑0 , 𝑑1 , . . . , 𝑑 𝐿+1 ) ∈ N 𝐿+2 , let 𝜎 : R → R be 𝐶 𝜎 -Lipschitz continuous with 𝐶 𝜎 ≥ 1, and |𝜎(𝑥)| ≤ 𝐶 𝜎 |𝑥| for all 𝑥 ∈ R,
and let 𝐵 ≥ 1.
Then, for every 𝑚 ∈ N, and every distribution D on 𝑋 × [−1, 1] it holds with probability at least 1 − 𝛿 over 𝑆 ∼ D 𝑚
that for all Φ ∈ N ∗ (𝜎; A, 𝐵)

√︄
|R (Φ) − R b𝑆 (Φ)| ≤ 4𝐶Lip 𝑛 A log( ⌈𝐵 𝑚⌉) + 𝐿𝑛 A log( ⌈2𝐶 𝜎 𝐵𝑑max ⌉) + log(2/𝛿) .
𝑚

14.6 The approximation-complexity trade-off

We recall the decomposition of the error in (14.2.3)

R (ℎ 𝑆 ) − 𝑅 ∗ ≤ 2𝜀 gen + 𝜀 approx ,

where 𝑅∗ is the Bayes risk defined in (14.1.1). We make the following observations about the approximation error
𝜀 approx and generalization error 𝜀 gen in the context of neural network based learning:
• Scaling of generalization error: By Theorem 14.15, for a hypothesis class H of neural networks with 𝑛 A weights
and 𝐿 layers, and for sample of size 𝑚 ∈ N, the generalization error 𝜀 gen essentially scales like
√︁
𝜀 gen = O ( (𝑛 A log(𝑚) + 𝐿𝑛 A log(𝑛 A ))/𝑚) as 𝑚 → ∞.

• Scaling of approximation error: Assume there exists ℎ∗ such that R (ℎ∗ ) = 𝑅 ∗ , and let the loss function L be
Lipschitz continuous in the first coordinate. Then

178
[Draft of June 26, 2024. Not for dissemination. Only to be used in Philipp Petersen’s lecture]
𝜀 approx = inf R (ℎ) − R (ℎ∗ ) = inf E ( 𝑥,𝑦)∼D [L (ℎ(𝑥), 𝑦) − L (ℎ∗ (𝑥), 𝑦)]
ℎ∈ H ℎ∈ H
≲ inf ∥ℎ − ℎ∗ ∥ 𝐿 ∞ .
ℎ∈ H

We have seen in Chapters 5 and 7 that if we choose H as a set of neural networks with size 𝑛 A and 𝐿 layers, then,
for appropriate activation functions, inf ℎ∈ H ∥ℎ − ℎ∗ ∥ 𝐿 ∞ behaves like 𝑛 A −𝑟 if, e.g., ℎ∗ is a 𝑑-dimensional 𝑠-Hölder
regular function and 𝑟 = 𝑠/𝑑 (Theorem 5.22), or ℎ∗ ∈ 𝐶 𝑘,𝑠 ( [0, 1] 𝑑 ) and 𝑟 < (𝑘 + 𝑠)/𝑑 (Theorem 7.7).
By these considerations, we conclude that for an empirical risk minimizer Φ𝑆 from a set of neural networks with 𝑛 A
weights and 𝐿 layers, it holds that
√︁
R (Φ𝑆 ) − 𝑅 ∗ ≤ O ( (𝑛 A log(𝑚) + 𝐿𝑛 A log(𝑛 A ))/𝑚) + O (𝑛 A −𝑟 ), (14.6.1)

for 𝑚 → ∞ and for some 𝑟 depending on the regularity of ℎ∗ . Note that, enlarging the neural network set, i.e., increasing
𝑛 A has two effects: The term associated to approximation decreases, and the term associated to generalization increases.
This trade-off is known as approximation-complexity trade-off. The situation is depicted in Figure 14.4. The figure
and (14.6.1) suggest that, the perfect model, achieves the optimal trade-off between approximation and generalization
error. Using this notion, we can also separate all models into three classes:
• Underfitting: If the approximation error decays faster than the estimation error increases.
• Optimal: If the sum of approximation error and generalization error is at a minimum.
• Overfitting: If the approximation error decays slower than the estimation error increases.

underfitting overfitting

optimal trade-off

Fig. 14.4: Illustration of the approximation-complexity-trade-off of Equation (14.6.1). Here we chose 𝑟 = 1 and
𝑚 = 10.000, also all implicit constants are assumed to be equal to 1.

179
[Draft of June 26, 2024. Not for dissemination. Only to be used in Philipp Petersen’s lecture]
In Chapter 15, we will see that deep learning often operates in the regime where the number of parameters 𝑛 A
exceeds the optimal trade-off point. For certain architectures used in practice, 𝑛 A can be so large that the theory of
the approximation-complexity trade-off suggests that learning should be impossible. However, we emphasize, that
the present analysis only provides upper bounds. It does not prove that learning is impossible or even impractical
in the overparameterized regime. Moreover, in Chapter 11 we have already seen indications that learning in the
overparametrized regime need not necessarily lead to large generalization errors.

14.7 PAC learning from VC dimension

In addition to covering numbers, there are several other tools to analyze the generalization capacity of hypothesis
sets. In the context of classification problems, one of the most important is the so-called Vapnik–Chervonenkis (VC)
dimension.

14.7.1 Definition and examples

Let H be a hypothesis set of functions mapping from R𝑑 to {0, 1}. A set 𝑆 = {𝒙1 , . . . , 𝒙 𝑛 } ⊆ R𝑑 is said to be shattered
by H if for every (𝑦 1 , . . . , 𝑦 𝑛 ) ∈ {0, 1} 𝑛 there exists ℎ ∈ H such that ℎ(𝒙 𝑗 ) = 𝑦 𝑗 for all 𝑗 ∈ N.
The VC dimension quantifies the complexity of a function class via the number of points that can in principle be
shattered.

Definition 14.16 The VC dimension of H is the cardinality of the largest set 𝑆 ⊆ R𝑑 that is shattered by H . We denote
the VC dimension by VCdim(H ).

Example 14.17 (Intervals) Let H = {1 [𝑎,𝑏] | 𝑎, 𝑏 ∈ R}. It is clear that VCdim(H ) ≥ 2 since for 𝑥1 < 𝑥2 the functions

1 [ 𝑥1 −2, 𝑥1 −1] , 1 [ 𝑥1 −1,𝑥1 ] , 1 [ 𝑥1 ,𝑥2 ] , 1 [ 𝑥2 ,𝑥2 +1] ,

are all different, when restricted to 𝑆 = (𝑥 1 , 𝑥2 ).


On the other hand, if 𝑥1 < 𝑥2 < 𝑥3 then, since ℎ −1 ({1}) is an interval for all ℎ ∈ H we have that ℎ(𝑥1 ) = 1 = ℎ(𝑥 3 )
implies ℎ(𝑥2 ) = 1. Hence, no set of three elements can be shattered. Therefore, VCdim(H ) = 2. The situation is
depicted in Figure 14.5.

Fig. 14.5: Different ways to classify two or three points. The colored-blocks correspond to intervals that produce
different classifications of the points.

Example 14.18 (Two-dimensional half-spaces) Let H = {1 [0,∞) (⟨𝒘, ·⟩ + 𝑏) | 𝒘 ∈ R2 , 𝑏 ∈ R} be a hypothesis set of


rotated and shifted two-dimensional half-spaces. In Figure 14.6 we see that H shatters a set of three points. More
general, with

H𝑑 := {𝒙 ↦→ 1 [0,∞) (𝒘 ⊤ 𝒙 + 𝑏) | 𝒘 ∈ R𝑑 , 𝑏 ∈ R}

the VC dimension of H𝑑 equals 𝑑 + 1.

180
[Draft of June 26, 2024. Not for dissemination. Only to be used in Philipp Petersen’s lecture]
Fig. 14.6: Different ways to classify three points by a half-space.

In the example above, the VC dimension coincides with the number of parameters. However, this is not true in
general as the following example shows.
Example 14.19 (Infinite VC dimension) Let for 𝑥 ∈ R

H := {𝑥 ↦→ 1 [0,∞) (sin(𝑤𝑥)) | 𝑤 ∈ R}.

Then the VC dimension of H is infinite (Exercise 14.29).

14.7.2 Generalization based on VC dimension

In the following, we consider a classification problem. Denote by D the data-generating distribution on R𝑑 × {0, 1}.
Moreover, we let H be a set of functions from R𝑑 → {0, 1}.
In the binary classification set-up, the natural choice of a loss function is the 0 − 1 loss L0−1 (𝑦, 𝑦 ′ ) = 1 𝑦≠𝑦 ′ . Thus,
given a sample 𝑆, the empirical risk of a function ℎ ∈ H is
𝑚
b𝑆 (ℎ) = 1
∑︁
R 1ℎ( 𝒙𝑖 )≠𝑦𝑖 .
𝑚 𝑖=1

Moreover, the risk can be written as

R (ℎ) = P ( 𝒙,𝑦)∼D [ℎ(𝒙) ≠ 𝑦],

i.e., the probability under (𝒙, 𝑦) ∼ D of ℎ misclassifying the label 𝑦 of 𝒙.


We can now give a generalization bound in terms of the VC dimension of H , see, e.g., [130, Corollary 3.19]:
Theorem 14.20 Let 𝑑, 𝑘 ∈ N and H ⊆ {ℎ : R𝑑 → {0, 1}} have VC dimension 𝑘. Let D be a distribution on R𝑑 × {0, 1}.
Then, for every 𝛿 > 0 and 𝑚 ∈ N, it holds with probability at least 1 − 𝛿 over a sample 𝑆 ∼ D 𝑚 that for every ℎ ∈ H
√︂ √︂
2𝑘 log(𝑒𝑚/𝑘) log(1/𝛿)
|R (ℎ) − R 𝑆 (ℎ)| ≤
b + . (14.7.1)
𝑚 2𝑚
In words, Theorem 14.20 tells us that if a hypothesis class has finite VC dimension, then a hypothesis with a small
empirical risk will have a small risk if the number of samples is large. This shows that empirical risk minimization

181
[Draft of June 26, 2024. Not for dissemination. Only to be used in Philipp Petersen’s lecture]
is a viable strategy in this scenario. Will this approach also work if the VC dimension is not bounded? No, in fact,
in that case, no learning algorithm will succeed in reliably producing a hypothesis for which the risk is close to the
best possible. We omit the proof the the following theorem, because it is technical and not of much relevance for the
following discussion.
Theorem 14.21 ([130, Theorem 3.23]) Let 𝑘 ∈ N and let H ⊆ {ℎ : 𝑋 → {0, 1}} be a hypothesis set with VC
dimension 𝑘. Then, for every 𝑚 ∈ N and every learning algorithm A : (𝑋 × {0, 1}) 𝑚 → H there exists a distribution
D on 𝑋 × {0, 1} such that
" √︂ #
𝑘 1
P𝑆∼D 𝑚 R (A(𝑆)) − inf R (ℎ) > ≥ .
ℎ∈ H 320𝑚 64

Theorem 14.21 immediately implies the following statement for the generalization bound.
Corollary 14.22 Let 𝑘 ∈ N and let H ⊆ {ℎ : 𝑋 → {0, 1}} be a hypothesis set with VC dimension 𝑘. Then, for every
𝑚 ∈ N there exists a distribution D on 𝑋 × {0, 1} such that
" √︂ #
𝑘 1
P𝑆∼D 𝑚 sup |R (ℎ) − R b𝑆 (ℎ)| > ≥ .
ℎ∈ H 1280𝑚 64

Proof For a sample 𝑆, let ℎ 𝑆 ∈ H be an empirical risk minimizer, i.e., R


b𝑆 (ℎ 𝑆 ) = minℎ∈ H R
b𝑆 (ℎ). We define A(𝑆) = ℎ 𝑆 .
Let D be the distribution of Theorem 14.21. Moreover, for 𝛿 > 0, let ℎ 𝛿 ∈ H be such that

R (ℎ 𝛿 ) − inf R (ℎ) < 𝛿.


ℎ∈ H

Then, it holds that

2 sup |R (ℎ) − R
b𝑆 (ℎ)| ≥ |R (ℎ 𝑆 ) − R
b𝑆 (ℎ 𝑆 )| + |R (ℎ 𝛿 ) − R
b𝑆 (ℎ 𝛿 )|
ℎ∈ H

≥ R (ℎ 𝑆 ) − R
b𝑆 (ℎ 𝑆 ) + R
b𝑆 (ℎ 𝛿 ) − R (ℎ 𝛿 )
≥ R (ℎ 𝑆 ) − R (ℎ 𝛿 )
> R (ℎ 𝑆 ) − inf R (ℎ) − 𝛿,
ℎ∈ H

where we used the definition of ℎ 𝑆 in the third inequality. The proof is completed by applying Theorem 14.21 and
using that 𝛿 was arbitrary. □

We have seen now, that we have a generalization bound scaling like O (1/ 𝑚) for 𝑚 → ∞ if and only if the VC
dimension of a hypothesis class is finite. In more quantitative terms, we require the VC dimension of a neural network
to be smaller than 𝑚.
What does this imply for neural network functions? For ReLU neural networks the following holds.

Theorem 14.23 ([4, Theorem 8.8]) Let A be an architecture of depth 𝐿 ∈ N and set

H := {1 [0,∞) ◦ Φ | Φ ∈ N (𝜎ReLU ; A, ∞)}.

There exists a constant 𝐶 independent of 𝐿 and A such that

VCdim(H ) ≤ 𝐶 · (𝑛 A 𝐿 log(𝑛 A ) + 𝑛 A 𝐿 2 ).

The bound (14.7.1) is meaningful if 𝑚 ≫ 𝑘. For ReLU neural networks as in Theorem 14.23, this means 𝑚 ≫
𝑛 A 𝐿 log(𝑛 A ) + 𝑛 A 𝐿 2 . Fixing 𝐿 = 1 this amounts to 𝑚 ≫ 𝑛 A log(𝑛 A ) for a shallow neural network with 𝑛 A
parameters. This condition is contrary to what we assumed in Chapter 11, where it was crucial that 𝑛 A ≫ 𝑚. If the VC
dimension of the neural network sets scale like O (𝑛 A log(𝑛 A )), then Theorem 14.21 and Corollary 14.22 indicate that,

182
[Draft of June 26, 2024. Not for dissemination. Only to be used in Philipp Petersen’s lecture]
at least for certain distributions, generalization should not be possible in this regime. We will discuss the resolution of
this potential paradox in Chapter 15.

14.8 Lower bounds on achievable approximation rates

We conclude this chapter on the complexities and generalization bounds of neural networks by using the established
VC dimension bound of Theorem 14.23 to deduce limitations to the approximation capacity of neural networks. The
result described below was first given in [219].

Theorem 14.24 Let 𝑘, 𝑑 ∈ N. Assume that for every 𝜀 > 0 exists 𝐿 𝜀 ∈ N and an architecture A 𝜀 with 𝐿 𝜀 layers and
input dimension 𝑑 such that
𝜀
sup inf ∥ 𝑓 − Φ∥ 𝐶 0 ( [0,1] 𝑑 ) < .
∥ 𝑓 ∥ 𝐶 𝑘 ( [0,1] 𝑑 ) ≤1 Φ∈ N ( 𝜎ReLU ;A,∞) 2

Then there exists 𝐶 > 0 solely depending on 𝑘 and 𝑑, such that for all 𝜀 ∈ (0, 1)
𝑑
𝑛 A 𝜀 𝐿 𝜀 log(𝑛 A 𝜀 ) + 𝑛 A 𝜀 𝐿 2𝜀 ≥ 𝐶𝜀 − 𝑘 .

Proof For 𝒙 ∈ R𝑑 consider the “bump function”


(  
1
exp 1 − 1− ∥ 𝒙∥ 22
if ∥𝒙∥ 2 < 1
𝑓˜(𝒙) :=
0 otherwise,

and its scaled version


 
𝑓˜𝜀 (𝒙) B 𝜀 𝑓 2𝜀 −1/𝑘 𝒙 ,

for 𝜀 ∈ (0, 1). Then


h 𝜀 1/𝑘 𝜀 1/𝑘 i 𝑑
supp( 𝑓˜𝜀 ) ⊆ − ,
2 2
and
∥ 𝑓˜𝜀 ∥ 𝐶 𝑘 ≤ 2 𝑘 ∥ 𝑓˜∥ 𝐶 𝑘 C 𝜏𝑘 > 0.
Consider the equispaced point set {𝒙1 , . . . , 𝒙 𝑁 ( 𝜀) } = 𝜀 1/𝑘 Z𝑑 ∩ [0, 1] 𝑑 . The cardinality of this set is 𝑁 (𝜀) ≃ 𝜀 −𝑑/𝑘 .
Given 𝒚 ∈ {0, 1} 𝑁 ( 𝜀) , let for 𝒙 ∈ R𝑑
( 𝜀)
𝑁∑︁
𝑓𝒚 (𝒙) B 𝜏𝑘−1 𝑦 𝑗 𝑓˜𝜀 (𝒙 − 𝒙 𝑗 ). (14.8.1)
𝑗=1

Then 𝑓𝒚 (𝒙 𝑗 ) = 𝜏𝑘−1 𝜀𝑦 𝑗 for all 𝑗 = 1, . . . , 𝑁 (𝜀) and ∥ 𝑓𝒚 ∥ 𝐶 𝑘 ≤ 1.


For every 𝒚 ∈ {0, 1} 𝑁 ( 𝜀) let Φ𝒚 ∈ N (𝜎ReLU ; A 𝜏 −1 𝜀 , ∞) be such that
𝑘

𝜀
sup | 𝑓𝒚 (𝒙) − Φ𝒚 (𝒙)| < .
𝒙∈ [0,1] 𝑑 2𝜏𝑘

Then  𝜀 
1 [0,∞) Φ 𝑦 (𝒙 𝑗 ) − = 𝑦𝑗 for all 𝑗 = 1, . . . , 𝑁 (𝜀).
2𝜏𝑘
Hence, the VC dimension of N (𝜎ReLU ; A 𝜏 −1 𝜀 , ∞) is larger or equal to 𝑁 (𝜀). Theorem 14.23 thus implies
𝑘

183
[Draft of June 26, 2024. Not for dissemination. Only to be used in Philipp Petersen’s lecture]
𝑑
 
𝑁 (𝜀) ≃ 𝜀 − 𝑘 ≤ 𝐶 · 𝑛 A 𝜏 −1 𝜀 𝐿 𝜏 −1 𝜀 log(𝑛 A 𝜏 −1 𝜀 ) + 𝑛 A 𝜏 −1 𝜀 𝐿 2𝜏 −1 𝜀
𝑘 𝑘 𝑘 𝑘 𝑘

or equivalently
𝑑 𝑑
 
𝜏𝑘𝑘 𝜀 − 𝑘 ≤ 𝐶 · 𝑛 A 𝜏 −1 𝜀 𝐿 𝜀 log(𝑛 A 𝜏 −1 𝜀 ) + 𝑛 A 𝜏 −1 𝜀 𝐿 2𝜀 .
𝑘 𝑘 𝑘

This completes the proof. □

Fig. 14.7: Illustration of 𝑓𝒚 from Equation (14.8.1) on [0, 1] 2 .

To interpret Theorem 14.24, we consider two situations:


• In case the depth is allowed to increase at most logarithmically in 𝜀, then reaching uniform error 𝜀 for all
𝑓 ∈ 𝐶 𝑘 ( [0, 1] 𝑑 ) with ∥ 𝑓 ∥ 𝐶 𝑘 ( [0,1] 𝑑 ) ≤ 1 requires
𝑑
𝑛 A 𝜀 log(𝑛 A 𝜀 ) log(𝜀) + 𝑛 A 𝜀 log(𝜀) 2 ≥ 𝐶𝜀 − 𝑘 .

In terms of the neural network size, this (necessary) condition becomes 𝑛 A 𝜀 ≥ 𝐶𝜀 −𝑑/𝑘 /log(𝜀) 2 . As we have
shown in Chapter 7, in particular Theorem 7.7, up to log terms this condition is also sufficient. Hence, while the
constructive proof of Theorem 7.7 might have seemed rather specific, under the assumption of the depth increasing
at most logarithmically (which the construction in Chapter 7 satisfies), it was essentially optimal! The neural
networks in this proof are shown to have size 𝑂 (𝜀 −𝑑/𝑘 ) up to log terms.
• If we allow the depth 𝐿 𝜀 to increase faster than logarithmically in 𝜀, then the lower bound on the required neural
network size improves. Fixing for example A 𝜀 with 𝐿 𝜀 layers such that 𝑛 A 𝜀 ≤ 𝑊 𝐿 𝜀 for some fixed 𝜀 independent
𝑊 ∈ N, the (necessary) condition on the depth becomes
𝑑
𝑊 log(𝑊 𝐿 𝜀 )𝐿 2𝜀 + 𝑊 𝐿 3𝜀 ≥ 𝐶𝜀 − 𝑘

and hence 𝐿 𝜀 ≳ 𝜀 −𝑑/(3𝑘 ) .


We add that, for arbitrary depth the upper bound on the VC dimension of Theorem 14.23 can be improved to 𝑛2A ,
[4, Theorem 8.6], and using this, would improve the just established lower bound to 𝐿 𝜀 ≳ 𝜀 −𝑑/(2𝑘 ) .
For fixed width, this corresponds to neural networks of size 𝑂 (𝜀 −𝑑/(2𝑘 ) ), which would mean twice the convergence
rate proven in Theorem 7.7. Indeed, it turns out that neural networks can achieve this rate in terms of the neural
network size [220].
To sum up, in order to get error 𝜀 uniformly for all ∥ 𝑓 ∥ 𝐶 𝑘 ( [0,1] 𝑑 ) ≤ 1, the size of a ReLU neural network is required
to increase at least like 𝑂 (𝜀 −𝑑/(2𝑘 ) ) as 𝜀 → 0, i.e. the best possible attainable convergence rate is 2𝑘/𝑑. It has been

184
[Draft of June 26, 2024. Not for dissemination. Only to be used in Philipp Petersen’s lecture]
proven, that this rate is also achievable, and thus the bound is sharp. Achieving this rate requires neural network
architectures that grow faster in depth than in width.

Bibliography and further reading

Classical statistical learning theory is based on the foundational work of Vapnik and Chervonenkis [207]. This led to
the formulation of the probably approximately correct (PAC) learning model in [206], which is primarily utilized in
this chapter. A streamlined mathematical introduction to statistical learning theory can be found in [41].
Since statistical learning theory is well-established, there exists a substantial amount of excellent expository work
describing this theory. Some highly recommended books on the topic are [130, 188, 4]. The specific approach of
characterizing learning via covering numbers has been discussed extensively in [4, Chapter 14]. Specific results for
ReLU activation used in this chapter were derived in [181, 18]. The results of Section 14.8 describe some of the findings
in [219, 220].

185
[Draft of June 26, 2024. Not for dissemination. Only to be used in Philipp Petersen’s lecture]
Exercises

Exercise 14.25 Let H be a set of neural networks with fixed architecture, where the weights are taken from a compact
set. Moreover, assume that the activation function is continuous. Show that for every sample 𝑆 there always exists an
empirical risk minimizer ℎ 𝑆 .

Exercise 14.26 Complete the proof of Proposition 14.9.

Exercise 14.27 Prove Lemma 14.12.

Exercise 14.28 Show that, the VC dimension of H of Example 14.18 is indeed 3, by demonstrating that no set of four
points can be shattered by H .

Exercise 14.29 Show that the VC dimension of

H := {𝑥 ↦→ 1 [0,∞) (sin(𝑤𝑥)) | 𝑤 ∈ R}

is infinite.

186
[Draft of June 26, 2024. Not for dissemination. Only to be used in Philipp Petersen’s lecture]
Chapter 15
Generalization in the overparameterized regime

In the previous chapter, we discussed the theory of generalization for deep neural networks trained by minimizing the
empirical risk. A key conclusion was that good generalization is possible as long as we choose an architecture that has
a moderate number of network parameters relative to the number of training samples. Moreover, we saw in Section
14.6 that the best performance can be expected when the neural network size is chosen to balance the generalization
and approximation errors, by minimizing their sum.

Architectures On ImageNet

Fig. 15.1: ImageNet Classification Competition: Final score on the test set in the Top 1 category vs. Parameters-to-
Training-Samples Ratio. Note that all architectures have more parameters than training samples. Architectures include
AlexNet [106], VGG16 [191], GoogLeNet [198], ResNet50/ResNet152 [77], DenseNet121 [86], ViT-G/14 [222],
EfficientNetB0 [200], and AmoebaNet [167].

Surprisingly, successful network architectures do not necessarily follow these theoretical observations. Consider the
neural network architectures in Figure 15.1. They represent some of the most renowned image classification models,
and all of them participated in the ImageNet Classification Competition [47]. The training set consisted of 1.2 million
images. The 𝑥-axis shows the model performance, and the 𝑦-axis displays the ratio of the number of parameters to the
size of the training set; notably, all architectures have a ratio larger than one, i.e. have more parameters than training
samples. For the largest model, there are by a factor 1000 more network parameters than training samples.

187
[Draft of June 26, 2024. Not for dissemination. Only to be used in Philipp Petersen’s lecture]
Given that the practical application of deep learning appears to operate in a regime significantly different from the
one analyzed in Chapter 14, we must ask: Why do these methods still work effectively?

15.1 The double descent phenomenon

The success of deep learning in a regime not covered by traditional statistical learning theory puzzled researchers for
some time. In [14], an intriguing set of experiments was performed. These experiments indicate that while the risk
follows the upper bound from Section 14.6 for neural network architectures that do not interpolate the data, the curve
does not expand to infinity in the way that Figure 14.4 suggests. Instead, after surpassing the so-called “interpolation
threshold”, the risk starts to decrease again. This behavior, known as double descent, is illustrated in Figure 15.2.

classical regime modern regime

R (ℎ)

Rb𝑆 (ℎ)
underfitting overfitting
Interpolation threshold

Expressivity of H

Fig. 15.2: Illustration of the double descent phenomenon.

15.1.1 Least-squares regression revisited

To gain further insight, we consider least-squares (kernel) regression as introduced in Section 11.2. Consider a data
sample (𝒙 𝑗 , 𝑦 𝑗 ) 𝑚
𝑗=1 ⊆ R × R generated by some ground-truth function 𝑓 , i.e.
𝑑

𝑦 𝑗 = 𝑓 (𝒙 𝑗 ) for 𝑗 = 1, . . . , 𝑚. (15.1.1)
Í𝑛
Let 𝜙 𝑗 : R𝑑 → R, 𝑗 ∈ N, be a sequence of ansatz functions. For 𝑛 ∈ N, we wish to fit a function 𝒙 ↦→ 𝑖=1 𝑤𝑖 𝜙𝑖 (𝒙) to
the data using linear least-squares. To this end, we introduce the feature map

R𝑑 ∋ 𝒙 ↦→ 𝜙(𝒙) := (𝜙1 (𝒙), . . . , 𝜙 𝑛 (𝒙)) ⊤ ∈ R𝑛 .

The goal is to determine coefficients 𝒘 ∈ R𝑛 minimizing the empirical risk


𝑚  ∑︁
𝑛 𝑚
 2 1 ∑︁
b𝑆 (𝒘) = 1
∑︁
R 𝑤𝑖 𝜙𝑖 (𝒙 𝑗 ) − 𝑦 𝑗 = (⟨𝜙(𝒙 𝑗 ), 𝒘⟩ − 𝑦 𝑗 ) 2 .
𝑚 𝑗=1 𝑖=1 𝑚 𝑗=1

With

188
[Draft of June 26, 2024. Not for dissemination. Only to be used in Philipp Petersen’s lecture]
𝜙1 (𝒙 1 ) . . . 𝜙 𝑛 (𝒙1 ) 𝜙(𝒙1 ) ⊤
𝑨𝑛 := ­­ ... .. ª® = ©­ .. ª® ∈ R𝑚×𝑛
© ..
. . ® ­ . ® (15.1.2)
«𝜙1 (𝒙 𝑚 ) . . . 𝜙 𝑛 (𝒙 𝑚 ) ¬ «𝜙(𝒙 𝑚 ) ⊤ ¬
and 𝒚 = (𝑦 1 , . . . , 𝑦 𝑚 ) ⊤ it holds
b𝑆 (𝒘) = 1 ∥ 𝑨𝑛 𝒘 − 𝒚∥ 2 .
R (15.1.3)
𝑚
As discussed in Sections 11.1-11.2, a unique minimizer of (15.1.3) only exists if 𝑨𝑛 has rank 𝑛. For a minimizer
𝒘𝑛 , the fitted function reads
𝑛
∑︁
𝑓𝑛 (𝑥) := 𝑤𝑛, 𝑗 𝜙 𝑗 (𝑥). (15.1.4)
𝑗=1

We are interested in the behavior of the 𝑓𝑛 as a function of 𝑛 (the number of ansatz functions/parameters of our model),
and distinguish between two cases:
• Underparameterized: If 𝑛 < 𝑚 we have fewer parameters 𝑛 than training points 𝑚. For the least squares problem
of minimizing R b𝑆 , this means that there are more conditions 𝑚 than free parameters 𝑛. Thus, in general, we cannot
interpolate the data, and we have min𝒘∈R𝑛 R b𝑆 (𝒘) > 0.
• Overparameterized: If 𝑛 ≥ 𝑚, then we have at least as many parameters 𝑛 as training points 𝑚. If the 𝒙 𝑗 and the 𝜙 𝑗
are such that 𝑨𝑛 ∈ R𝑚×𝑛 has full rank 𝑚, then there exists 𝒘 such that R b𝑆 (𝒘) = 0. If 𝑛 > 𝑚, then 𝑨𝑛 necessarily
has a nontrivial kernel, and there exist infinitely many parameters choices 𝒘 that yield zero empirical risk R b𝑆 .
Some of them lead to better, and some lead to worse prediction functions 𝑓𝑛 in (15.1.4).

3 1.0 f
2 0.8
Data points
1
0.6
0
j

1 0.4
2 0.2
3
0.0
1.00 0.75 0.50 0.25 0.00 0.25 0.50 0.75 1.00 1.00 0.75 0.50 0.25 0.00 0.25 0.50 0.75 1.00

(a) ansatz functions 𝜙 𝑗 (b) Runge function 𝑓 and data points

Fig. 15.3: Ansatz functions 𝜙1 , . . . , 𝜙40 drawn from a Gaussian process, along with the Runge function and 18
equispaced data points.

In the overparameterized case, there exist many minimizers of R b𝑆 . The training algorithm we use to compute a
minimizer determines the type of prediction function 𝑓𝑛 we obtain. To observe double descent, i.e. to achieve good
generalization for large 𝑛, we need to choose the minimizer carefully. In the following, we consider the unique minimal
2-norm minimizer, which is defined as
 
𝒘𝑛,∗ = arg min ∥𝒘∥ ∈ R𝑛 . (15.1.5)
{𝒘∈R𝑛 | R
b𝑆 (𝒘) ≤ R
b𝑆 (𝒗) ∀𝒗∈R𝑛 }

189
[Draft of June 26, 2024. Not for dissemination. Only to be used in Philipp Petersen’s lecture]
15.1.2 An example

Now let us consider a concrete example. In Figure 15.3 we plot a set of 40 ansatz functions 𝜙1 , . . . , 𝜙40 , which are
drawn from a Gaussian process. Additionally, the figure shows a plot of the Runge function 𝑓 , and 𝑚 = 18 equispaced
points which are used as the training data points. We then fit a function in span{𝜙1 , . . . , 𝜙 𝑛 } via (15.1.5) and (15.1.4).
The result is displayed in Figure 15.4:
• 𝑛 = 2: The model can only represent functions in span{𝜙1 , 𝜙2 }. It is not yet expressive enough to give a meaningful
approximation of 𝑓 .
• 𝑛 = 15: The model has sufficient expressivity to capture the main characteristics of 𝑓 . Since 𝑛 = 15 < 18 = 𝑚,
it is not yet able to interpolate the data. Thus it allows to strike a good balanced between the approximation and
generalization error, which corresponds to the scenario discussed in Chapter 14.
• 𝑛 = 18: We are at the interpolation threshold. The model is capable of interpolating the data, and there is a unique
𝒘 such that R b𝑆 (𝒘) = 0. Yet, in between data points the behavior of the predictor 𝑓18 seems erratic, and displays
strong oscillations. This is referred to as overfitting, and is to be expected due to our analysis in Chapter 14; while
the approximation error at the data points has improved compared to the case 𝑛 = 15, the generalization error has
gotten worse.
• 𝑛 = 40: This is the overparameterized regime, where we have significantly more parameters than data points. Our
prediction 𝑓40 interpolates the data and appears to be the best overall approximation to 𝑓 so far, due to a “good”
choice of minimizer of R b𝑆 , namely (15.1.5). We also note that, while quite good, the fit is not perfect. We cannot
expect significant improvement in performance by further increasing 𝑛, since at this point the main limiting factor
is the amount of available data. Also see Figure 15.5 (a).

Figure 15.5 (a) displays the error ∥ 𝑓 − 𝑓𝑛 ∥ 𝐿 2 ( [−1,1] ) over 𝑛. We observe the characteristic double descent curve,
where the error initially decreases, after peaking at the interpolation threshold, which is marked by the dashed red line.
Afterwards, in the overparameterized regime, it starts to decrease again. Figure 15.5 (b) displays ∥𝒘𝑛,∗ ∥. Note how the
Euclidean norm of the coefficient vector also peaks at the interpolation threshold.
We emphasize that the precise nature of the convergence curves depends strongly on various factors, such as the
distribution and number of training points 𝑚, the ground truth 𝑓 , and the choice of ansatz functions 𝜙 𝑗 (e.g., the specific
kernel used to generate the 𝜙 𝑗 in Figure 15.3 (a)). In the present setting we achieve a good approximation of 𝑓 for
𝑛 = 15 < 18 = 𝑚 corresponding to the regime where the approximation and interpolation errors are balanced. However,
as Figure 15.5 (a) shows, it can be difficult to determine a suitable value of 𝑛 < 𝑚 a priori, and the acceptable range
of 𝑛 values can be quite narrow. For overparametrization (𝑛 ≫ 𝑚), the precise choice of 𝑛 is less critical, potentially
making the algorithm more stable in this regime. We encourage the reader to conduct similar experiments and explore
different settings to get a better feeling for the double descent phenomenon.

15.2 Size of weights

In Figure 15.5, we observed that the norm of the coefficients ∥𝒘𝑛,∗ ∥ exhibits similar behavior to the 𝐿 2 -error, peaking
at the interpolation threshold 𝑛 = 18. In machine learning, large weights are usually undesirable, as they are associated
with large derivatives or oscillatory behavior. This is evident in the example shown in Figure 15.4 for 𝑛 = 18. Assuming
that the data in (15.1.1) was generated by a “smooth” function 𝑓 , e.g. a function with moderate Lipschitz constant,
these large derivatives of the prediction function may lead to poor generalization. It is important to note that such
a smoothness assumption about 𝑓 may or may not be satisfied. However, if 𝑓 is not smooth, there is little hope of
accurately recovering 𝑓 from limited data (see the discussion in Section 9.2).
The next result gives an explanation for the observed behavior of ∥𝒘𝑛,∗ ∥.

Proposition 15.1 Assume that 𝒙 1 , . . . , 𝒙 𝑚 and the (𝜙 𝑗 ) 𝑗 ∈N are such that 𝑨𝑛 in (15.1.2) has full rank 𝑛 for all 𝑛 ≤ 𝑚.
Given 𝒚 ∈ R𝑚 , denote by 𝒘𝑛,∗ (𝒚) the vector in (15.1.5). Then

190
[Draft of June 26, 2024. Not for dissemination. Only to be used in Philipp Petersen’s lecture]
1.0 f 1.0 f
0.8 f2 f15
Data points 0.8 Data points
0.6
0.4 0.6
0.2 0.4
0.0
0.2 0.2
0.4 0.0
1.00 0.75 0.50 0.25 0.00 0.25 0.50 0.75 1.00 1.00 0.75 0.50 0.25 0.00 0.25 0.50 0.75 1.00

(a) 𝑛 = 2 (underparameterization) (b) 𝑛 = 15 (balance of appr. and gen. error)

1.0 f 1.0 f
0.8 f18 0.8
f40
Data points Data points
0.6
0.6
0.4
0.4
0.2
0.2
0.0
0.0
1.00 0.75 0.50 0.25 0.00 0.25 0.50 0.75 1.00 1.00 0.75 0.50 0.25 0.00 0.25 0.50 0.75 1.00

(c) 𝑛 = 18 (interpolation threshold) (d) 𝑛 = 40 (overparameterization)

Fig. 15.4: Fit of the 𝑚 = 18 red data points using the ansatz functions 𝜙1 , . . . , 𝜙 𝑛 from Figure 15.3, employing equations
(15.1.5) and (15.1.4) for different numbers of ansatz functions 𝑛.

n = 18 n = 18

10 1

100

10 2
10 20 30 40 10 20 30 40
n n
(a) ∥ 𝑓 − 𝑓 𝑛 ∥ 𝐿 2 ( [−1,1]) (b) ∥𝒘𝑛,∗ ∥

Fig. 15.5: The 𝐿 2 -error for the fitted functions in Figure 15.4, and the ℓ 2 -norm of the corresponding coefficient vector
𝒘𝑛,∗ defined in (15.1.5).

191
[Draft of June 26, 2024. Not for dissemination. Only to be used in Philipp Petersen’s lecture]
(
increasing for 𝑛 < 𝑚,
𝑛 ↦→ sup ∥𝒘𝑛,∗ ( 𝒚) ∥ is monotonically
∥𝒚 ∥ =1 decreasing for 𝑛 ≥ 𝑚.

Proof We start with the case 𝑛 ≥ 𝑚. By assumption 𝑨𝑚 has full rank 𝑚, and thus 𝑨𝑛 has rank 𝑚 for all 𝑛 ≥ 𝑚,
see (15.1.2). In particular, there exists 𝒘𝑛 ∈ R𝑛 such that 𝑨𝑛 𝒘𝑛 = 𝒚. Now fix 𝒚 ∈ R𝑚 and let 𝒘𝑛 be any such vector.
Then 𝒘𝑛+1 := (𝒘𝑛 , 0) ∈ R𝑛+1 satisfies 𝑨𝑛+1 𝒘𝑛+1 = 𝒚 and ∥𝒘𝑛+1 ∥ = ∥𝒘𝑛 ∥. Thus necessarily ∥𝒘𝑛+1,∗ ∥ ≤ ∥𝒘𝑛,∗ ∥ for the
minimal norm solutions defined in (15.1.5). Since this holds for every 𝒚, we obtain the statement for 𝑛 ≥ 𝑚.
Now let 𝑛 < 𝑚. Recall that the minimal norm solution can be written through the pseudo inverse

𝒘𝑛,∗ ( 𝒚) = 𝑨†𝑛 𝒚,

see for instance Exercise 11.29. Here,


−1
𝜎𝑛,1 0
𝑨†𝑛 = 𝑽𝑛 ­­
© ª ⊤
.. .. ® 𝑼 ∈ R𝑛×𝑚
. . ® 𝑛
−1
𝜎𝑛,𝑛 0¬
«

where 𝑨𝑛 = 𝑼𝑛 𝜮 𝑛 𝑽𝑛⊤ is the singular value decomposition of 𝑨𝑛 , and

𝜎𝑛,1
© .. ª
­ . ®
­ ®
­ ®
𝜎𝑛,𝑛 ®
𝜮𝑛 = ­ ® ∈ R𝑚×𝑛
­
­ 0 ®
­ ®
­ .. ®
­ . ®
« 0 ¬

contains the singular values 𝜎𝑛,1 ≥ · · · ≥ 𝜎𝑛,𝑛 > 0 of 𝑨𝑛 ∈ R𝑚×𝑛 ordered by decreasing size. Since 𝑽𝑛 ∈ R𝑛×𝑛 and
𝑼𝑛 ∈ R𝑚×𝑚 are orthogonal matrices, we have

sup ∥𝒘𝑛,∗ ( 𝒚) ∥ = sup ∥ 𝑨†𝑛 𝒚∥ = 𝜎𝑛,𝑛


−1
.
∥𝒚 ∥ =1 ∥𝒚 ∥ =1

Finally, since the minimal singular value 𝜎𝑛,𝑛 of 𝑨𝑛 can be written as

𝜎𝑛,𝑛 = inf𝑛 ∥ 𝑨𝑛 𝒙∥ ≥ inf ∥ 𝑨𝑛+1 𝒙∥ = 𝜎𝑛+1,𝑛+1 ,


𝒙∈R 𝒙∈R𝑛+1
∥ 𝒙∥ =1 ∥ 𝒙∥ =1

we observe that 𝑛 ↦→ 𝜎𝑛,𝑛 is monotonically decreasing for 𝑛 ≤ 𝑚. This concludes the proof. □

15.3 Theoretical justification

Let us now discuss one explanation of the double descent phenomenon for neural networks. Before we proceed, it
has to be mentioned that there are many alternative explanations of the double descent phenomenon in the literature
(described in the bibliography section at the end of this chapter). The explanation that we give below is essentially
based on the ideas in [12], but is strongly simplified.
The key assumption that we will make is that large overparameterized neural networks are typically Lipschitz
continuous with a Lipschitz constant independent of the size. This is a consequence of the neural networks having
relatively small weights. Indeed, let us consider neural networks from the space N (𝜎; A, 𝐵) for a 𝐶 𝜎 -Lipschitz
activation function such that 𝐵 ≤ 𝑐 𝐵 · (𝑑max 𝐶 𝜎 ) −1 for a 𝑐 𝐵 > 0. Then by Lemma 13.2 we have that

192
[Draft of June 26, 2024. Not for dissemination. Only to be used in Philipp Petersen’s lecture]
N (𝜎; A, 𝐵) ⊂ Lip(𝑐 𝐿𝐵 ). (15.3.1)

For large architectures trained by stochastic gradient descent, the assumption that 𝐵 ≤ 𝑐 𝐵 · (𝑑max 𝐶 𝜎 ) −1 is not unrealistic.
Indeed, the weights of the neural network do not move much from their initialization, see Chapter 11, and results in
[34, 3]. Moreover, the weights are also typically initialized in a way so that the modulus of the weights is proportional
to the reciprocal of the number of neurons in the associated layer, which, for relatively balanced neural networks,
corresponds to the assumption 𝐵 ≤ 𝑐 𝐵 · (𝑑max 𝐶 𝜎 ) −1 . An alternative argument justifying the small weight assumption
is that many training routines use regularization terms on the weights, thereby encouraging them the optimization
routine to find small weights.
We can study the generalization capacity of Lipschitz functions through our covering-number-based learning results.
The set of 𝐶-Lipschitz functions on a compact 𝑑-dimensional Euclidean domain Lip(𝐶) has covering numbers given
by

log(G(Lip(𝐶), 𝜖, 𝐿 ∞ )) = O ((𝐶/𝜖) 𝑑 ) for all 𝜖 > 0. (15.3.2)

A proof can be found in [67, Lemma 7], see also [204].


As a result of these considerations, we can identify two regimes:
• Standard regime: For small architectures A, we can consider neural networks as a set parameterized by 𝑛 A
parameters. As we have seen before, this yields a covering number that scales linearly with 𝑛 A . As long as 𝑛 A is
small in comparison with the number of samples, we can expect good generalization by Theorem 14.15.
• Overparameterized regime: For large architectures A, but small weights as described before, we can consider
neural networks as a subset of Lip(𝐶) for a constant 𝐶 > 0. This set has a covering number bound that is
independent from the number of parameters of 𝑛 A .
Choosing the better of the two generalization bounds for each regime yields the following result.

Theorem 15.2 Let 𝐶, 𝐶 L > 0 and let L : [−1, 1] × [−1, 1] → R be 𝐶 L -Lipschitz. Further, let A = (𝑑0 , 𝑑1 , . . . , 𝑑 𝐿+1 ) ∈
N 𝐿+2 , let 𝜎 : R → R be 𝐶 𝜎 -Lipschitz continuous with 𝐶 𝜎 ≥ 1, and |𝜎(𝑥)| ≤ 𝐶 𝜎 |𝑥| for all 𝑥 ∈ R, and let 𝐵 > 0.
Then, there exist 𝑐 1 , 𝑐 2 > 0, such that for every 𝑚 ∈ N and every distribution D on [−1, 1] 𝑑0 × [−1, 1] it holds with
probability at least 1 − 𝛿 over the choice of a sample 𝑆 ∼ D 𝑚 that
√︂
|R (Φ) − Rb𝑆 (Φ)| ≤ 𝑔(A, 𝐶 𝜎 , 𝐵, 𝑚) + 4𝐶 L log(4/𝛿) , (15.3.3)
𝑚
where

 √︄ 
𝑛 A log( ⌈ 𝑚⌉) + 𝐿𝑛 A log(𝑑max )
 1 


 − 2+𝑑 
𝑔(A, 𝐶 𝜎 , 𝐵, 𝑚) = min 𝑐 1 , 𝑐2 𝑚 0

 𝑚 

 
for all Φ ∈ N ∗ (𝜎; A, 𝐵) ∩ Lip(𝐶).

Proof Invoking Theorem 14.11 with 𝛼 = 1/(2 + 𝑑0 ) and (15.3.1) for the covering numbers. we obtain that with
probability 1 − 𝛿/2 the following bound holds. Here, 𝑐˜0 > 0 is the implicit constant of (15.3.2) and we apply the
concavity of the square root in the second inequality

193
[Draft of June 26, 2024. Not for dissemination. Only to be used in Philipp Petersen’s lecture]
√︂
𝑐˜0 (𝑚 𝛼 𝐶) 𝑑0 + log(4/𝛿) 2𝐶 L
|R (Φ) − R
b𝑆 (Φ)| ≤ 4𝐶 L + 𝛼
𝑚 𝑚
√︂
2𝐶 log(4/𝛿)
√︃
/(𝑑 +2) −1 L
≤ 4𝐶 L 𝑐˜0 𝐶 0 (𝑚 0 0
𝑑 𝑑 ) + 𝛼 + 4𝐶 L
𝑚 𝑚
√︂
2𝐶 log(4/𝛿)
√︃
L
= 4𝐶 L 𝑐˜0 𝐶 𝑑0 (𝑚 −2/(𝑑0 +2) ) + 𝛼 + 4𝐶 L
𝑚 𝑚
√︁ √︂
(4𝐶 L 𝑐˜0 𝐶 + 2𝐶 L )
𝑑0 log(4/𝛿)
= + 4𝐶 L
𝑚𝛼 𝑚
for all Φ ∈ Lip(𝐶).
In addition, Theorem 14.15 yields that with probability 1 − 𝛿/2.

√︄
b𝑆 (Φ)| ≤ 4𝐶 L 𝑛 A log( ⌈𝐵 𝑚⌉) + 𝐿𝑛 A log( ⌈2𝐶 𝜎 𝐵𝑑max ⌉) + log(4/𝛿)
|R (Φ) − R
𝑚
2𝐶 L
+ √
𝑚

√︄
𝑛 A log( ⌈𝐵 𝑚⌉) + 𝐿𝑛 A log( ⌈2𝐶 𝜎 𝐵𝑑max ⌉)
≤ 6𝐶 L
𝑚
√︂
log(4/𝛿))
+ 4𝐶 L .
𝑚
for all Φ ∈ N ∗ (𝜎; A, 𝐵).
Then, for Φ ∈ N ∗ (𝜎; A, 𝐵) ∩ Lip(𝐶) the minimum of both upper bounds holds with probability 1 − 𝛿. □

Remark 15.3 Theorem 15.2 describes two regimes that govern the generalization of neural-network-based learning.
The regimes correspond to the two terms that appear in the minimum of the definition of 𝑔(A, 𝐶 𝜎 , 𝐵, 𝑚). The first term
increases with 𝑛 A and the second is constant in 𝑛 A . In the first regime, where the first term is smaller, the generalization
gap grows with 𝑛 A . In the second regime, i.e., when the second term is smaller, the generalization gap is constant with
𝑛 A . Moreover, we can assume that the empirical risk R b𝑆 will decrease with increasing number of parameters 𝑛 A .
As a result, the risk can be upper bounded by
√︂
R (Φ) ≤ Rb𝑆 + 𝑔(A, 𝐶 𝜎 , 𝐵, 𝑚) + 4𝐶 L log(4/𝛿) ,
𝑚
and the right hand side of this upper bound is always monotonically decreasing in the second regime, but can fall and
grow in the first. In some cases, this behavior can lead to an upper bound on the risk that resembles the curve of Figure
15.2. We will describe a set-up where this is so in the next section.

Remark 15.4 We add that Theorem 15.2 stipulates Lipschitz neural networks for all architectures. As we saw in Sections
15.1.2 and 15.2 this assumption is likely not valid at the interpolation threshold. Hence, Theorem 15.2 likely gives a
too optimistic upper bound for practical scenarios near the interpolation threshold.

15.4 Double descent for neural network learning

Now let us understand the double descent phenomenon in the context of Theorem 15.2. We make a couple of simplifying
assumptions to get a formula for an upper bound on the risk. First, we assume that the data 𝑆 = (𝒙𝑖 , 𝑦 𝑖 )𝑖=1
𝑚 ∈ R 𝑑0 × R

stem from a 𝐶 𝑀 -Lipschitz continuous function. In addition, we fix a depth 𝐿 ∈ N and consider, for 𝑑 ∈ N, architectures
of the form

194
[Draft of June 26, 2024. Not for dissemination. Only to be used in Philipp Petersen’s lecture]
A 𝑑 = (𝑑0 , 𝑑, . . . , , 𝑑, 1).
We note that, for this choice of architecture, we have 𝑛 A 𝑑 = (𝑑0 + 1)𝑑 + (𝐿 − 1) (𝑑 + 1)𝑑 + 𝑑 + 1. Further, we restrict
our analysis to the ReLU activation function 𝜎ReLU .
Under these assumptions we will now derive upper bounds on the risk. We start by finding an upper bound on the
empirical risk and then applying Theorem 15.2 to establish an upper bound on the generalization gap. In combination,
these estimates provide an upper bound on the risk. We will then observe that this upper bound follows the shape of
the curve in Figure 15.2.

15.4.1 Upper bound on empirical risk

We are interested in establishing an upper bound on R b𝑆 (Φ) when assuming Φ ∈ N ∗ (𝜎ReLU ; A 𝑑 , 𝐵) ∩ Lip(𝐶 𝑀 ). For
𝐵 ≥ 𝐶 𝑀 , we can apply Theorem 9.6, and conclude that we can interpolate 𝑚 points from a 𝐶 𝑀 -Lipschitz function with
a neural network in Lip(𝐶 𝑀 ), if 𝑛 A ≥ log(𝑚)𝑑0 𝑚. Thus, R b𝑆 (Φ) = 0 as soon as 𝑛 A ≥ log(𝑚)𝑑0 𝑚.
In addition, depending on smoothness properties of the data, the interpolation error may decay with some rate, by
one of the results in Chapters 5, 7, or 8. For simplicity, we choose that Rb𝑆 (Φ) = 𝑂 (𝑛 −1 ) for 𝑛 A significantly smaller
A
than log(𝑚)𝑑0 𝑚. If we combine there two assumptions, we can make the following Ansatz for the empirical risk of
Φ A 𝑑 ∈ N ∗ (𝜎ReLU ; A 𝑑 , 𝐵) ∩ Lip(𝐶 𝑀 ):
n o
R e𝑆 (Φ A 𝑑 ) B 𝐶approx max 0, 𝑛 −1 − (log(𝑚)𝑑0 𝑚) −1
b𝑆 (Φ A 𝑑 ) ≤ R (15.4.1)
A𝑑

for a constant 𝐶approx > 0. Note that, we can interpolate the sample 𝑆 already with 𝑑0 𝑚 parameters by Theorem 9.3.
However, it is not guaranteed that this can be done using 𝐶 𝑀 -Lipschitz neural networks.

15.4.2 Upper bound on generalization gap

We complement the bound on the empirical risk by an upper bound on the risk. Invoking the notation of Theorem 15.2,
we have that,

𝑔(A 𝑑 , 𝐶 𝜎ReLU , 𝐵, 𝑚) = min 𝜅 NN (A 𝑑 , 𝑚; 𝑐 1 ), 𝜅Lip (A 𝑑 , 𝑚; 𝑐 2 ) ,

where

√︄
𝑛 A 𝑑 log( ⌈ 𝑚⌉) + 𝐿𝑛 A 𝑑 log(𝑑)
𝜅NN (A 𝑑 , 𝑚; 𝑐 1 ) = 𝑐 1 ,
𝑚 (15.4.2)
1
− 2+𝑑
𝜅 Lip (A 𝑑 , 𝑚; 𝑐 2 ) = 𝑐 2 𝑚 0

for some constants 𝑐 1 , 𝑐 2 > 0.

15.4.3 Upper bound on risk

Next, we combine (15.4.1) and (15.4.2) to obtain an upper bound on the risk R (Φ A 𝑑 ). Specifically, we define

195
[Draft of June 26, 2024. Not for dissemination. Only to be used in Philipp Petersen’s lecture]

e(Φ A 𝑑 ) := R
R e𝑆 (Φ A 𝑑 ) + min 𝜅 NN (A 𝑑 , 𝑚; 𝑐 1 ), 𝜅Lip (A 𝑑 , 𝑚; 𝑐 2 ) (15.4.3)
√︂
log(4/𝛿)
+ 4𝐶 L .
𝑚
We depict in Figure 15.6 the upper bound on the risk given by (15.4.3) (excluding the terms that do not depend
on the architecture). The upper bound clearly resembles the double descent phenomenon of Figure 15.2. Note that the
Lipschitz interpolation point is slightly behind this threshold, which is when we assume our empirical risk to be 0. To
produce the plot, we chose 𝐿 = 5, 𝑐 1 = 1.2 · 10−4 , 𝑐 2 = 6.5 · 10−3 , 𝑚 = 10.000, 𝑑0 = 6, 𝐶approx = 30. We mention
that the double descent phenomenon is not visible for all choices of parameters. Moreover, in our model, the fact that
the "peak" coincides with the interpolation threshold is due to the choice of constants and does not emerge from the
model. Other models of double descent explain the location of the peak more accurately [126, 73]. However, it needs
to be added, that as we have seen before in Sections 15.1.2 and 15.2, the assumption on the Lipschitz property of the
neural networks does not need to be valid close to the interpolation threshold, which means that for practical scenarios
our upper bound on the risk is too optimistic there. In other words, the peak that is observed in Figure 15.2 is likely
more pronounced than what is obtained in Figure 15.6 below.

Fig. 15.6: Upper bound on R (Φ A 𝑑 ) in the context of Theorem 15.2 as derived in (15.4.3). For better visibility the part
corresponding to 𝑦-values between 0.0012 and 0.0022 is not shown. We denote the interpolation threshold according
to Theorem 9.3 with a vertical dashed black line.

196
[Draft of June 26, 2024. Not for dissemination. Only to be used in Philipp Petersen’s lecture]
Bibliography and further reading

The discussion on kernel regression and the effect of the number of parameters on the norm of the weights was already
given in [14]. Similar analyses, with more complex ansatz systems and more precise asymptotic estimates, are found
in [126, 73]. Our results in Section 15.3 are inspired by [12]; see also [141].
For a detailed account of further arguments justifying the surprisingly good generalization capabilities of overpa-
rameterized neural networks, we refer to [19, Section 2]. Here, we only briefly mention two additional directions of
inquiry. First, if the learning algorithm introduces a form of robustness, this can be leveraged to yield generalization
bounds [7, 218, 24, 158]. Second, for very overparameterized neural networks, it was stipulated in [94] that neural
networks become linear kernel interpolators based on the neural tangent kernel of Section 11.5.2. Thus, for large neural
networks, generalization can be studied through kernel regression [94, 115, 15, 119].

197
[Draft of June 26, 2024. Not for dissemination. Only to be used in Philipp Petersen’s lecture]
[Draft of June 26, 2024. Not for dissemination. Only to be used in Philipp Petersen’s lecture]
Chapter 16
Robustness and adversarial examples

How sensitive is the output of a neural network to small changes in its input? Real-world observations of trained
neural networks often reveal that even barely noticeable modifications of the input can lead to drastic variations in the
network’s predictions. This intriguing behavior was first documented in the context of image classification in [199].
Figure 16.1 illustrates this concept. The left panel shows a picture of a panda that the neural network correctly
classifies as a panda. By adding an almost impercetible amount of noise to the image, we obtain the modified image in
the right panel. To a human, there is no visible difference, but the neural network classifies the perturbed image as a
wombat. This phenomenon, where a correctly classified image is misclassified after a slight perturbation, is termed an
adversarial example.
In practice, such behavior is highly undesirable. It indicates that our learning algorithm might not be very reliable
and poses a potential security risk, as malicious actors could exploit it to trick the algorithm. In this chapter, we
describe the basic mathematical principles behind adversarial examples and investigate simple conditions under which
they might or might not occur. For simplicity, we restrict ourselves to a binary classification problem but note that the
main ideas remain valid in more general situations.

16.1 Adversarial examples

Let us start by formalizing the notion of an adversarial example. We consider the problem of assigning a label
𝑦 ∈ {−1, 1} to a vector 𝒙 ∈ R𝑑 . It is assumed that the relation between 𝒙 and 𝑦 is described by a distribution D on
R𝑑 × {−1, 1}. In particular, for a given 𝒙, both values −1 and 1 could have positive probability, i.e. the label is not

+ 0.01x =
Human: Panda Barely visible noise Still a panda

NN classifier: Panda (high confidence) Flamingo (low confidence) Wombat (high confidence)

Fig. 16.1: Sketch of the phenomenon of an adversarial example.

199
[Draft of June 26, 2024. Not for dissemination. Only to be used in Philipp Petersen’s lecture]
necessarily deterministic. Additionally, we let

𝐷 𝒙 := {𝒙 ∈ R𝑑 | ∃𝑦 s.t. (𝒙, 𝑦) ∈ supp(D)},

and refer to 𝐷 𝒙 as the feature support.


Throughout this chapter we denote by
𝑔 : R𝑑 → {−1, 0, 1}
a fixed so-called ground-truth classifier, satisfying1

P[𝑦 = 𝑔(𝒙)|𝒙] ≥ P[𝑦 = −𝑔(𝒙)|𝒙] for all 𝒙 ∈ 𝐷 𝒙 . (16.1.1)

Note that we allow 𝑔 to take the value 0, which is to be understood as an additional label corresponding to nonrelevant
or nonsensical input data 𝒙. We will refer to 𝑔 −1 (0) as the nonrelevant class. The ground truth 𝑔 is interpreted as how
a human would classify the data, as the following example illustrates.
Example 16.1 We wish to classify whether an image shows a panda (𝑦 = 1) or a wombat (𝑦 = −1). Consider again
Figure 16.1, and denote the three images by 𝒙1 , 𝒙2 , 𝒙 3 . The first image 𝒙1 is a photograph of a panda. Together with a
label 𝑦, it can be interpreted as a draw (𝒙1 , 𝑦) from D, i.e. 𝒙1 ∈ 𝐷 𝒙 and 𝑔(𝒙1 ) = 1. The second image 𝒙2 displays noise
and corresponds to nonrelevant data as it shows neither a panda nor a wombat. In particular, 𝒙2 ∈ 𝐷 𝑐𝒙 and 𝑔(𝒙2 ) = 0.
The third (perturbed) image 𝒙3 also belongs to 𝐷 𝑐𝒙 , as it is not a photograph but a noise corrupted version of 𝒙1 .
Nonetheless, it is not nonrelevant, as a human would classify it as a panda. Thus 𝑔(𝒙3 ) = 1.
Additional to the ground truth 𝑔, we denote by

ℎ : R𝑑 → {−1, 1}

some trained classifier.

Definition 16.2 Let 𝑔 : R𝑑 → {−1, 0, 1} be the ground-truth classifier, let ℎ : R𝑑 → {−1, 1} be a classifier, and let
∥ · ∥ ∗ be a norm on R𝑑 . For 𝒙 ∈ R𝑑 and 𝛿 > 0, we call 𝒙 ′ ∈ R𝑑 an adversarial example to 𝒙 ∈ R𝑑 with perturbation
𝛿, if and only if
(i) ∥𝒙 ′ − 𝒙∥ ∗ ≤ 𝛿,
(ii) 𝑔(𝒙)𝑔(𝒙 ′ ) > 0,
(iii) ℎ(𝒙) = 𝑔(𝒙) and ℎ(𝒙 ′ ) ≠ 𝑔(𝒙 ′ ).

In words, 𝒙 ′ is an adversarial example to 𝒙 with perturbation 𝛿, if (i) the distance of 𝒙 and 𝒙 ′ is at most 𝛿, (ii) 𝒙 and
𝒙 ′ belong to the same (not nonrelevant) class according to the ground truth classifier, and (iii) the classifier ℎ correctly
classifies 𝒙 but misclassifies 𝒙 ′ .

Remark 16.3 We emphasize that the concept of a ground-truth classifier 𝑔 differs from a minimizer of the Bayes
risk (14.1.1) for two reasons. First, we allow for an additional label 0 corresponding to the nonrelevant class, which
does not exist for the data generating distribution D. Second, 𝑔 should correctly classify points outside of 𝐷 𝒙 ; small
perturbations of images as we find them in adversarial examples, are not regular images in 𝐷 𝒙 . Nonetheless, a human
classifier can still classify these images, and 𝑔 models this property of human classification.

16.2 Bayes classifier

At first sight, an adversarial example seems to be no more than a misclassified sample. Naturally, these exist if the
model does not generalize well. In this section we present a more nuanced view from [194].

1 To be more precise, the conditional distribution of 𝑦 | 𝒙 is only well-defined almost everywhere w.r.t. the marginal distribution of 𝒙. Thus
(16.1.1) can only be assumed to hold for almost every 𝒙 ∈ 𝐷 𝒙 w.r.t. to the marginal distribution of 𝒙.

200
[Draft of June 26, 2024. Not for dissemination. Only to be used in Philipp Petersen’s lecture]
To avoid edge cases, we assume in the following that for all 𝒙 ∈ 𝐷 𝒙

either P[𝑦 = 1|𝒙] > P[𝑦 = −1|𝒙] or P[𝑦 = 1|𝒙] < P[𝑦 = −1|𝒙] (16.2.1)

so that (16.1.1) uniquely defines 𝑔(𝒙) for 𝒙 ∈ 𝐷 𝒙 . We say that the distribution exhausts the domain if 𝐷 𝒙 ∪𝑔 −1 (0) = R𝑑 .
This means that every point is either in the feature support 𝐷 𝒙 or it belongs to the nonrelevant class. Moreover, we say
that ℎ is a Bayes classifier if
P[ℎ(𝒙)|𝒙] ≥ P[−ℎ(𝒙)|𝒙] for all 𝒙 ∈ 𝐷 𝒙 .
By (16.1.1), the ground truth 𝑔 is a Bayes classifier, and (16.2.1) ensures that ℎ coincides with 𝑔 on 𝐷 𝒙 if ℎ is a Bayes
classifier. It is easy to see that a Bayes classifier minimizes the Bayes risk.
With these two notions, we now distinguish between four cases.
(i) Bayes classifier/exhaustive distribution: If ℎ is a Bayes classifier and the data exhausts the domain, then there are
no adversarial examples. This is because every 𝒙 ∈ R𝑑 either belongs to the nonrelevant class or is classified the
same by ℎ and 𝑔.
(ii) Bayes classifier/non-exhaustive distribution: If ℎ is a Bayes classifier and the distribution does not exhaust the
domain, then adversarial examples can exist. Even though the learned classifier ℎ coincides with the ground truth 𝑔
on the feature support, adversarial examples can be constructed for data points on the complement of 𝐷 𝒙 ∪ 𝑔 −1 (0),
which is not empty.
(iii) Not a Bayes classifier/exhaustive distribution: The set 𝐷 𝒙 can be covered by the four subdomains

𝐶1 = ℎ −1 (1) ∩ 𝑔 −1 (1), 𝐹1 = ℎ −1 (−1) ∩ 𝑔 −1 (1),


(16.2.2)
𝐶−1 = ℎ −1 (−1) ∩ 𝑔 −1 (−1), 𝐹−1 = ℎ −1 (1) ∩ 𝑔 −1 (−1).

If dist(𝐶1 ∩ 𝐷 𝒙 , 𝐹1 ∩ 𝐷 𝒙 ) or dist(𝐶−1 ∩ 𝐷 𝒙 , 𝐹−1 ∩ 𝐷 𝒙 ) is smaller than 𝛿, then there exist points 𝒙, 𝒙 ′ ∈ 𝐷 𝒙 such
that 𝒙 ′ is an adversarial example to 𝑥 with perturbation 𝛿. Hence, adversarial examples in the feature support can
exist. This is, however, not guaranteed to happen. For example, 𝐷 𝒙 does not need to be connected if 𝑔 −1 (0) ≠ ∅,
see Exercise 16.18. Hence, even for classifiers that have incorrect predictions on the data, adversarial examples do
not need to exist.
(iv) Not a Bayes classifier/non-exhaustive distribution: In this case everything is possible. Data points and their
associated adversarial examples can appear in the feature support of the distribution and adversarial examples to
elements in the feature support of the distribution can be created by leaving the feature support of the distribution.
We will see examples in the following section.

16.3 Affine classifiers

For linear classifiers, a simple argument outlined in [199] and [65] showcases that the high-dimensionality of the input,
common in image classification problems, is a potential cause for the existence of adversarial examples.
A linear classifier is a map of the form

𝒙 ↦→ sign(𝒘 ⊤ 𝒙) where 𝒘, 𝒙 ∈ R𝑑 .

Let
sign(𝒘 ⊤ 𝒙)sign(𝒘)
𝒙 ′ := 𝒙 − 2|𝒘 ⊤ 𝒙|
∥𝒘∥ 1
where sign(𝒘) is understood coordinate-wise. Then ∥𝒙−𝒙 ′ ∥ ∞ ≤ 2|𝒘 ⊤ 𝒙|/∥𝒘∥ 1 and it is not hard to see that sign(𝒘 ⊤ 𝒙 ′ ) ≠
sign(𝒘 ⊤ 𝒙).
For high-dimensional vectors 𝒘, 𝒙 which are chosen at random, possibly dependent, but so that 𝒘 is uniformly
distributed on a 𝑑 − 1 dimensional sphere, it holds with high probability that

201
[Draft of June 26, 2024. Not for dissemination. Only to be used in Philipp Petersen’s lecture]
|𝒘 ⊤ 𝒙| ∥𝒙∥ ∥𝒘∥
≤ ≪ ∥𝒙∥. (16.3.1)
∥𝒘∥ 1 ∥𝒘∥ 1
One way to observe (16.3.1) is to see that for every 𝑐 > 0 it holds that

𝜇({𝒘 ∈ R𝑑 | ∥𝒘∥ 1 > 𝑐, ∥𝒘∥ ≤ 1}) → 1 for 𝑑 → ∞, (16.3.2)

where 𝜇 is the uniform probability measure on the 𝑑-dimensional Euclidean unit ball, see Exercise 16.17. Thus, if 𝒙
has a moderate Euclidean norm, we see that the perturbation of 𝒙 ′ is likely small for large dimensions.
Below we give a sufficient condition for the existence of adversarial examples, in case both ℎ and the ground truth
𝑔 are linear classifiers.

Theorem 16.4 Let 𝒘, 𝒘 ∈ R𝑑 be nonzero. For 𝒙 ∈ R𝑑 , let ℎ(𝒙) = sign(𝒘 ⊤ 𝒙) be a classifier and let 𝑔(𝒙) = sign(𝒘 ⊤ 𝑥)
be the ground-truth classifier.
For every 𝒙 ∈ R𝑑 with ℎ(𝒙)𝑔(𝒙) > 0 and all 𝜀 ∈ (0, |𝒘 ⊤ 𝒙|) such that

|𝒘 ⊤ 𝒙| 𝜀 + |𝒘 ⊤ 𝒙| |𝒘 ⊤ 𝒘|
> (16.3.3)
∥𝒘∥ ∥𝒘∥ ∥𝒘∥ ∥𝒘∥
it holds that
𝜀 + |𝒘 ⊤ 𝒙|
𝒙 ′ = 𝒙 − ℎ(𝒙) 𝒘 (16.3.4)
∥𝒘∥ 2

is an adversarial example to 𝒙 with perturbation 𝛿 = (𝜀 + |𝒘 ⊤ 𝒙|)/∥𝒘∥.

Before we present the proof, we give some interpretation of this result. First, note that {𝒙 ∈ R𝑑 | 𝒘 ⊤ 𝒙= 0} is the
decision boundary of ℎ, meaning that points lying on opposite sides of this hyperplane, are classified differently by ℎ.
Due to |𝒘 ⊤ 𝒘| ≤ ∥𝒘∥ ∥𝒘∥ , (16.3.3) implies that an adversarial example always exists whenever

|𝒘 ⊤ 𝒙| |𝒘 ⊤ 𝒙|
> . (16.3.5)
∥𝒘∥ ∥𝒘∥
The left term is the decision margin of 𝒙 for 𝑔, i.e. the distance of 𝒙 to the decision boundary of 𝑔. Similarly, the term
on the right is the decision margin of 𝒙 for ℎ. Thus we conclude that adversarial examples exist if the decision margin
of 𝒙 for the ground truth 𝑔 is larger than that for the classifier ℎ.
Second, the term (𝒘 ⊤ 𝒘)/(∥𝒘∥ ∥𝒘∥) describes the alignment of the two classifiers. If the classifiers are not aligned,
i.e., 𝒘 and 𝒘 have a large angle between them, then adversarial examples exist even if the margin of the classifier is
larger than that of the ground-truth classifier.
Finally, adversarial examples with small perturbation are possible if |𝒘 ⊤ 𝒙| ≪ ∥𝒘∥. The extreme case 𝒘 ⊤ 𝒙 = 0
means that 𝒙 lies on the decision boundary of ℎ, and if |𝒘 ⊤ 𝒙| ≪ ∥𝒘∥ then 𝒙 is close to the decision boundary of ℎ.
Proof (of Theorem 16.4) We verify that 𝒙 ′ in (16.3.4) satisfies the conditions of an adversarial example in Definition
16.2. In the following we will use that due to ℎ(𝒙)𝑔(𝒙) > 0

𝑔(𝒙) = sign(𝒘 ⊤ 𝒙) = sign(𝒘 ⊤ 𝒙) = ℎ(𝒙) ≠ 0. (16.3.6)

First, it holds

𝜀 + |𝒘 ⊤ 𝒙| 𝜀 + |𝒘 ⊤ 𝒙|
∥𝒙 − 𝒙 ′ ∥ = 𝒘 = = 𝛿.
∥𝒘∥ 2 ∥𝒘∥

Next we show 𝑔(𝒙)𝑔(𝒙 ′ ) > 0, i.e. that (𝒘 ⊤ 𝒙) (𝒘 ⊤ 𝒙 ′ ) is positive. Plugging in the definition of 𝒙 ′ , this term reads

202
[Draft of June 26, 2024. Not for dissemination. Only to be used in Philipp Petersen’s lecture]
𝜀 + |𝒘 ⊤ 𝒙| ⊤ 𝜀 + |𝒘 ⊤ 𝒙| ⊤
 
𝒘 ⊤ 𝒙 𝒘 ⊤ 𝒙 − ℎ(𝒙) 2
𝒘 𝒘 = |𝒘 ⊤ 𝒙| 2 − |𝒘 ⊤ 𝒙| 𝒘 𝒘
∥𝒘∥ ∥𝒘∥ 2
𝜀 + |𝒘 ⊤ 𝒙| ⊤
≥ |𝒘 ⊤ 𝒙| 2 − |𝒘 ⊤ 𝒙| |𝒘 𝒘|, (16.3.7)
∥𝒘∥ 2

where the equality holds because ℎ(𝒙) = 𝑔(𝒙) = sign(𝒘 ⊤ 𝒙) by (16.3.6). Dividing the right-hand side of (16.3.7) by
|𝒘 ⊤ 𝒙|∥𝒘∥, which is positive by (16.3.6), we obtain

|𝒘 ⊤ 𝒙| 𝜀 + |𝒘 ⊤ 𝒙| |𝒘 ⊤ 𝒘|
− . (16.3.8)
∥𝒘∥ ∥𝒘∥ ∥𝒘∥ ∥𝒘∥
The term (16.3.8) is positive thanks to (16.3.3).
Finally, we check that 0 ≠ℎ(𝒙 ′ ) ≠ ℎ(𝒙), i.e. (𝒘 ⊤ 𝒙) (𝒘 ⊤ 𝒙 ′ ) < 0. We have that

𝜀 + |𝒘 ⊤ 𝒙| ⊤
(𝒘 ⊤ 𝒙) (𝒘 ⊤ 𝒙 ′ ) = |𝒘 ⊤ 𝒙| 2 − 𝒘 ⊤ 𝒙ℎ(𝒙) 𝒘 𝒘
∥𝒘∥ 2
= |𝒘 ⊤ 𝒙| 2 − |𝒘 ⊤ 𝒙|(𝜀 + |𝒘 ⊤ 𝒙|) < 0,

where we used that ℎ(𝒙) = sign(𝒘 ⊤ 𝒙). This completes the proof. □
Theorem 16.4 readily implies the following proposition for affine classifiers.
Proposition 16.5 Let 𝒘, 𝒘 ∈ R𝑑 and 𝑏, 𝑏 ∈ R. For 𝒙 ∈ R𝑑 let ℎ(𝒙) = sign(𝒘 ⊤ 𝒙 + 𝑏) be a classifier and let
𝑔(𝒙) = sign(𝒘 ⊤ 𝒙 + 𝑏) be the ground-truth classifier.
For every 𝒙 ∈ R𝑑 with 𝒘 ⊤ 𝒙 ≠ 0, ℎ(𝒙)𝑔(𝒙) > 0, and all 𝜀 ∈ (0, |𝒘 ⊤ 𝒙 + 𝑏|) such that

|𝒘 ⊤ 𝒙 + 𝑏| 2 (𝜀 + |𝒘 ⊤ 𝒙 + 𝑏|) 2 (𝒘 ⊤ 𝒘 + 𝑏𝑏) 2
2 2
> 2 2 2
∥𝒘∥ + 𝑏 ∥𝒘∥ + 𝑏 (∥𝒘∥ 2 + 𝑏 2 ) (∥𝒘∥ 2 + 𝑏 )
it holds that
𝜀 + |𝒘 ⊤ 𝒙 + 𝑏|
𝒙 ′ = 𝒙 − ℎ(𝒙) 𝒘
∥𝒘∥ 2
is an adversarial example with perturbation 𝛿 = (𝜀 + |𝒘 ⊤ 𝒙 + 𝑏|)/∥𝒘∥ to 𝒙.
The proof is left to the reader, see Exercise 16.19.
Let us now study two cases of linear classifiers, which allow for different types of adversarial examples. In the
following two examples, the ground-truth classifier 𝑔 : R𝑑 → {−1, 1} is given by 𝑔(𝒙) = sign(𝒘 ⊤ 𝒙) for 𝒘 ∈ R𝑑 with
∥𝒘∥ = 1.
For the first example, we construct a Bayes classifier ℎ admitting adversarial examples in the complement of the
feature support. This corresponds to case (ii) in Section 16.2.
Example 16.6 Let D be the uniform distribution on

{(𝜆𝒘, 𝑔(𝜆𝒘)) | 𝜆 ∈ [−1, 1] \ {0}} ⊆ R𝑑 × {−1, 1}.

The feature support equals


𝐷 𝒙 = {𝜆𝒘 | 𝜆 ∈ [−1, 1] \ {0}} ⊆ span{𝒘}.
Next fix 𝛼 ∈ (0, 1) and set 𝒘 := 𝛼𝒘+(1−𝛼)𝒗 for some 𝒗 ∈ 𝒘 ⊥ with ∥𝒗∥ = 1, so that ∥𝒘∥ = 1. We let ℎ(𝒙) := sign(𝒘 ⊤ 𝒙).
We now show that every 𝒙 ∈ 𝐷 𝒙 satisfies the assumptions of Theorem 16.4, and therefore admits an adversarial example.
Note that ℎ(𝒙) = 𝑔(𝒙) for every 𝒙 ∈ 𝐷 𝒙 . Hence ℎ is a Bayes classifier. Now fix 𝒙 ∈ 𝐷 𝒙 . Then |𝒘 ⊤ 𝒙| ≤ 𝛼|𝒘 ⊤ 𝒙|, so
that (16.3.3) is satisfied. Furthermore, for every 𝜀 > 0 it holds that

𝜀 + |𝒘 ⊤ 𝒙|
𝛿 := ≤ 𝜀 + 𝛼.
∥𝒘∥

203
[Draft of June 26, 2024. Not for dissemination. Only to be used in Philipp Petersen’s lecture]
A)

DB𝑔

𝑥′
B) DBℎ

DB𝑔

DBℎ
𝑥′
𝑥

Fig. 16.2: Illustration of the two types of adversarial examples in Examples 16.6 and 16.7. In panel A) the feature support
𝐷 𝒙 corresponds to the dashed line. We depict the two decision boundaries DBℎ = {𝒙 | 𝒘 ⊤ 𝒙 = 0} of ℎ(𝒙) = sign(𝒘 ⊤ 𝒙)
and DB𝑔 = {𝒙 | 𝒘 ⊤ 𝒙 = 0} 𝑔(𝒙) = sign(𝒘 ⊤ 𝒙). Both ℎ and 𝑔 perfectly classify every data point in 𝐷 𝒙 . One data point
𝒙 is shifted outside of the support of the distribution in a way to change its label according to ℎ. This creates an
adversarial example 𝒙 ′ . In panel B) the data distribution is globally supported. However, ℎ and 𝑔 do not coincide.
Thus the decision boundaries DBℎ and DB𝑔 do not coincide. Moving data points across DBℎ can create adversarial
examples, as depicted by 𝒙 and 𝒙 ′ .

Hence, for 𝜖 < |𝒘 ⊤ 𝒙| it holds by Theorem 16.4 that there exists an adversarial example with perturbation less than
𝜀 + 𝛼. For small 𝛼, the situation is depicted in the upper panel of Figure 16.2.
For the second example, we construct a distribution with global feature support and a classifier which is not a Bayes
classifier. This corresponds to case (iv) in Section 16.2.
Example 16.7 Let D 𝒙 be a distribution on R𝑑 with positive Lebesgue density everywhere outside the decision boundary
DB𝑔 = {𝒙 | 𝒘 ⊤ 𝒙 = 0} of 𝑔. We define D to be the distribution of (𝑋, 𝑔(𝑋)) for 𝑋 ∼ D 𝒙 . In addition, let 𝒘 ∉ {±𝒘},
∥𝒘∥ = 1 and ℎ(𝒙) = sign(𝒘 ⊤ 𝒙). We exclude 𝒘 = −𝒘 because, in this case, every prediction of ℎ is wrong. Thus no
adversarial examples are possible.

204
[Draft of June 26, 2024. Not for dissemination. Only to be used in Philipp Petersen’s lecture]
By construction the feature support is given by 𝐷 𝒙 = R𝑑 . Moreover, ℎ −1 ({±1}) and 𝑔 −1 ({±1}) are half spaces,
which implies that, in the notation of (16.2.2) that

dist(𝐶±1 ∩ 𝐷 𝒙 , 𝐹±1 ∩ 𝐷 𝒙 ) = dist(𝐶±1 , 𝐹±1 ) = 0.

Hence, for every 𝛿 > 0 there is a positive probability of observing 𝒙 to which an adversarial example with perturbation
𝛿 exists.
The situation is depicted in the lower panel of Figure 16.2.

16.4 ReLU neural networks

So far we discussed classification by affine classifiers. A binary classifier based on a ReLU neural network is a function
R𝑑 ∋ 𝒙 ↦→ sign(Φ(𝒙)), where Φ is a ReLU neural network. As noted in [199], the arguments for affine classifiers, see
Proposition 16.5, can be applied to the affine pieces of Φ, to show existence of adversarial examples.
Consider a ground-truth classifier 𝑔 : R𝑑 → {−1, 0, 1}. For each 𝒙 ∈ R𝑑 we define the geometric margin of 𝑔 at 𝒙 as

𝜇𝑔 (𝒙) := dist(𝒙, 𝑔 −1 ({𝑔(𝒙)}) 𝑐 ),

i.e., as the distance of 𝒙 to the closest element that is classified differently from 𝒙 or the infimum over all distances
to elements from other classes if no closest element exists. Additionally, we denote the distance of 𝒙 to the closest
adjacent affine piece by
𝑐
𝜈Φ (𝒙) := dist(𝒙, 𝐴Φ, 𝒙 ),

where 𝐴Φ, 𝒙 is the largest connected region on which Φ is affine and which contains 𝒙. We have the following theorem.

Theorem 16.8 Let Φ : R𝑑 → R and for 𝒙 ∈ R𝑑 let ℎ(𝒙) = sign(Φ(𝒙)). Denote by 𝑔 : R𝑑 → {−1, 0, 1} the ground-truth
classifier. Let 𝒙 ∈ R𝑑 and 𝜀 > 0 be such that 𝜈Φ (𝒙) > 0, 𝑔(𝒙) ≠ 0, ∇Φ(𝒙) ≠ 0 and

𝜀 + |Φ(𝒙)|
𝜇𝑔 (𝒙), 𝜈Φ (𝒙) > .
∥∇Φ(𝒙) ∥
Then
𝜀 + |Φ(𝒙)|
𝒙 ′ := 𝒙 − ℎ(𝒙) ∇Φ(𝒙)
∥∇Φ(𝒙) ∥ 2

is an adversarial example to 𝒙 with perturbation 𝛿 = (𝜀 + |Φ(𝒙)|)/∥∇Φ(𝒙) ∥.

Proof We show that 𝒙 ′ satisfies the properties in Definition 16.2.


By construction ∥𝒙 − 𝒙 ′ ∥ ≤ 𝛿. Since 𝜇𝑔 (𝒙) > 𝛿 it follows that 𝑔(𝒙) = 𝑔(𝒙 ′ ). Moreover, by assumption 𝑔(𝒙) ≠ 0,
and thus 𝑔(𝒙)𝑔(𝒙 ′ ) > 0.
It only remains to show that ℎ(𝒙 ′ ) ≠ ℎ(𝒙). Since 𝛿 < 𝜈Φ (𝒙), we have that Φ(𝒙) = ∇Φ(𝒙) ⊤ 𝒙 + 𝑏 and Φ(𝒙 ′ ) =
∇Φ(𝒙) ⊤ 𝒙 ′ + 𝑏 for some 𝑏 ∈ R. Therefore,
 
𝜀 + |Φ(𝒙)|
Φ(𝒙) − Φ(𝒙 ′ ) = ∇Φ(𝒙) ⊤ (𝒙 − 𝒙 ′ ) = ∇Φ(𝒙) ⊤ ℎ(𝒙) ∇Φ(𝒙)
∥∇Φ(𝒙) ∥ 2
= ℎ(𝒙) (𝜀 + |Φ(𝒙)|).

Since ℎ(𝒙)|Φ(𝒙)| = Φ(𝒙) it follows that Φ(𝒙 ′ ) = −ℎ(𝒙)𝜀. Hence, ℎ(𝒙 ′ ) = −ℎ(𝒙), which completes the proof. □

Remark 16.9 We look at the key parameters in Theorem 16.8 to understand which factors facilitate adversarial examples.

205
[Draft of June 26, 2024. Not for dissemination. Only to be used in Philipp Petersen’s lecture]
• The geometric margin of the ground-truth classifier 𝜇𝑔 (𝒙): To make the construction possible, we need to be
sufficiently far away from points that belong to a different class than 𝒙 or to the nonrelevant class.
• The distance to the next affine piece 𝜈Φ (𝒙): Since we are looking for an adversarial example within the same affine
piece as 𝒙, we need this piece to be sufficiently large.
• The perturbation 𝛿: The perturbation is given by (𝜀 + |Φ(𝒙)|)/∥∇Φ(𝒙) ∥, which depends on the classification
margin |Φ(𝒙)| of the ReLU classifier and its sensitivity to inputs ∥∇Φ(𝒙) ∥. For adversarial examples to be
possible, we either want a small classification margin of Φ or a high sensitivity of Φ to its inputs.

16.5 Robustness

Having established that adversarial examples can arise in various ways under mild assumptions, we now turn our
attention to conditions that prevent their existence.

16.5.1 Global Lipschitz regularity

We have repeatedly observed in the previous sections that a large value of ∥𝒘∥ for linear classifiers sign(𝒘 ⊤ 𝒙), or
∥∇Φ(𝒙) ∥ for ReLU classifiers sign(Φ(𝒙)), facilitates the occurrence of adversarial examples. Naturally, both these
values are upper bounded by the Lipschitz constant of the classifier’s inner functions 𝒙 ↦→ 𝒘 ⊤ 𝒙 and 𝒙 ↦→ Φ(𝒙).
Consequently, it was stipulated early on that bounding the Lipschitz constant of the inner functions could be an
effective measure against adversarial examples [199].
We have the following result for general classifiers of the form 𝒙 ↦→ sign(Φ(𝒙)).
Proposition 16.10 Let Φ : R𝑑 → R be 𝐶 𝐿 -Lipschitz with 𝐶 𝐿 > 0, and let 𝑠 > 0. Let ℎ(𝒙) = sign(Φ(𝒙)) be a classifier,
and let 𝑔 : R𝑑 → {−1, 0, 1} be a ground-truth classifier. Moreover, let 𝒙 ∈ R𝑑 be such that

Φ(𝒙)𝑔(𝒙) ≥ 𝑠. (16.5.1)

Then there does not exist an adversarial example to 𝒙 of perturbation 𝛿 < 𝑠/𝐶 𝐿 .
Proof Let 𝒙 ∈ R𝑑 satisfy (16.5.1) and assume that ∥𝒙 ′ − 𝒙∥ ≤ 𝛿. The Lipschitz continuity of Φ implies

|Φ(𝒙 ′ ) − Φ(𝒙)| < 𝑠.

Since |Φ(𝒙)| ≥ 𝑠 we conclude that Φ(𝒙 ′ ) has the same sign as Φ(𝒙) which shows that 𝒙 ′ cannot be an adversarial
example to 𝒙. □
Remark 16.11 As we have seen in Lemma 13.2, we can bound the Lipschitz constant of ReLU neural networks by
restricting the magnitude and number of their weights and the number of layers.
There has been some criticism to results of this form, see, e.g., [89], since an assumption on the Lipschitz constant
may potentially restrict the capabilities of the neural network too much. We next present a result that shows under
which assumptions on the training set, there exists a neural network that classifies the training set correctly, but does
not allow for adversarial examples within the training set.
Theorem 16.12 Let 𝑚 ∈ N, let 𝑔 : R𝑑 → {−1, 0, 1} be a ground-truth classifier, and let (𝒙𝑖 , 𝑔(𝒙𝑖 ))𝑖=1
𝑚 ∈ (R𝑑 ×

{−1, 1}) 𝑚 . Assume that


|𝑔(𝒙 𝑖 ) − 𝑔(𝒙 𝑗 )|
sup e > 0.
=: 𝑀
𝑖≠ 𝑗 ∥𝒙𝑖 − 𝒙 𝑗 ∥
Then there exists a ReLU neural network Φ with depth(Φ) = 𝑂 (log(𝑚)) and width(Φ) = 𝑂 (𝑑𝑚) such that for all
𝑖 = 1, . . . , 𝑚

206
[Draft of June 26, 2024. Not for dissemination. Only to be used in Philipp Petersen’s lecture]
sign(Φ(𝒙𝑖 )) = 𝑔(𝒙𝑖 )
e to 𝒙𝑖 .
and there is no adversarial example of perturbation 𝛿 = 1/ 𝑀
Proof The result follows directly from Theorem 9.6 and Proposition 16.10. The reader is invited to complete the
argument in Exercise 16.20. □

16.5.2 Local regularity

One issue with upper bounds involving global Lipschitz constants such as those in Proposition 16.10, is that these
bounds may be quite large for deep neural networks. For example, the upper bound given in Lemma 13.2 is

∥Φ(𝒙) − Φ(𝒙 ′ ) ∥ ∞ ≤ 𝐶 𝜎
𝐿
· (𝐵𝑑max ) 𝐿+1 ∥𝒙 − 𝒙 ′ ∥ ∞

which grows exponentially with the depth of the neural network. However, in practice this bound may be pessimistic,
and locally the neural network might have significantly smaller gradients than the global Lipschitz constant.
Because of this, it is reasonable to study results preventing adversarial examples under local Lipschitz bounds. Such
a result together with an algorithm providing bounds on the local Lipschitz constant was proposed in [78]. We state the
theorem adapted to our set-up.
Theorem 16.13 Let ℎ : R𝑑 → {−1, 1} be a classifier of the form ℎ(𝒙) = sign(Φ(𝒙)) and let 𝑔 : R𝑑 → {−1, 0, 1} be
the ground-truth classifier. Let 𝒙 ∈ R𝑑 satisfy 𝑔(𝒙) ≠ 0, and set

 

|Φ( 𝒚) − Φ(𝒙)| 


 . 

𝛼 := max min Φ(𝒙)𝑔(𝒙) sup ,𝑅 , (16.5.2)
𝑅>0 
 ∥𝒚− 𝒙∥ ∞ ≤ 𝑅 ∥𝒙 − 𝒚∥ ∞ 


 𝒚≠𝒙 

where the minimum is understood to be 𝑅 in case the supremum is zero. Then there are no adversarial examples to 𝒙
with perturbation 𝛿 < 𝛼.
Proof Let 𝒙 ∈ R𝑑 be as in the statement of the theorem. Assume, towards a contradiction, that for 0 < 𝛿 < 𝛼 satisfying
(16.5.2), there exists an adversarial example 𝒙 ′ to 𝒙 with perturbation 𝛿.
If the supremum in (16.5.2) is zero, then Φ is constant on a ball of radius 𝑅 around 𝒙. In particular for ∥𝒙 ′ −𝒙∥ ≤ 𝛿 < 𝑅
holds ℎ(𝒙 ′ ) = ℎ(𝒙) and 𝒙 ′ cannot be an adversarial example.
Now assume the supremum in (16.5.2) is not zero. It holds by (16.5.2), that
. |Φ( 𝒚) − Φ(𝒙)|
𝛿 < Φ(𝒙)𝑔(𝒙) sup . (16.5.3)
∥𝒚− 𝒙∥ ∞ ≤ 𝑅 ∥𝒙 − 𝒚∥ ∞
𝒚≠𝒙

Moreover,
|Φ( 𝒚) − Φ(𝒙)|
|Φ(𝒙 ′ ) − Φ(𝒙)| ≤ sup ∥𝒙 − 𝒙 ′ ∥ ∞
∥𝒚− 𝒙∥ ∞ ≤ 𝑅 ∥𝒙 − 𝒚∥ ∞
𝒚≠𝒙
|Φ( 𝒚) − Φ(𝒙)|
≤ sup 𝛿 < Φ(𝒙)𝑔(𝒙),
∥𝒚− 𝒙∥ ∞ ≤ 𝑅 ∥𝒙 − 𝒚∥ ∞
𝒚≠𝒙

where we applied (16.5.3) in the last line. It follows that

𝑔(𝒙)Φ(𝒙 ′ ) = 𝑔(𝒙)Φ(𝒙) + 𝑔(𝒙) (Φ(𝒙 ′ ) − Φ(𝒙))


≥ 𝑔(𝒙)Φ(𝒙) − |Φ(𝒙 ′ ) − Φ(𝒙)| > 0.

207
[Draft of June 26, 2024. Not for dissemination. Only to be used in Philipp Petersen’s lecture]
This rules out 𝒙 ′ as an adversarial example. □
The supremum in (16.5.2) is bounded by the Lipschitz constant of Φ on 𝐵 𝑅 (𝒙). Thus Theorem 16.13 depends
only on the local Lipschitz constant of Φ. One obvious criticism of this result is that the computation of (16.5.2) is
potentially prohibitive. We next show a different result, for which the assumptions can immediately be checked by
applying a simple algorithm that we present subsequently.
To state the following proposition, for a continuous function Φ : R𝑑 → R and 𝛿 > 0 we define for 𝒙 ∈ R𝑑 and 𝛿 > 0

𝑧 𝛿,max := max{Φ( 𝒚) | ∥ 𝒚 − 𝒙∥ ∞ ≤ 𝛿} (16.5.4)


𝛿,min
𝑧 := min{Φ( 𝒚) | ∥ 𝒚 − 𝒙∥ ∞ ≤ 𝛿}. (16.5.5)

Proposition 16.14 Let ℎ : R𝑑 → {−1, 1} be a classifier of the form ℎ(𝒙) = sign(Φ(𝒙)) and 𝑔 : R𝑑 → {−1, 0, 1}, let 𝒙
be such that ℎ(𝒙) = 𝑔(𝒙). Then 𝒙 does not have an adversarial example of perturbation 𝛿 if 𝑧 𝛿,max 𝑧 𝛿,min > 0.
Proof The proof is immediate, since 𝑧 𝛿,max 𝑧 𝛿,min > 0 implies that all points in a 𝛿 neighborhood of 𝒙 are classified
the same. □
To apply (16.14), we only have to compute 𝑧 𝛿,max and 𝑧 𝛿,min . It turns out that if Φ is a neural network, then 𝑧 𝛿,max ,
𝑧 𝛿,mincan be approximated by a computation similar to a forward pass of Φ. Denote by | 𝑨| the matrix obtained by
taking the absolute value of each entry of the matrix 𝑨. Additionally, we define

𝑨+ = (| 𝑨| + 𝑨)/2 and 𝑨 − = (| 𝑨| − 𝑨)/2.

The idea behind the Algorithm 2 is common in the area of neural network verification, see, e.g., [59, 54, 8, 212].

Algorithm 2 Compute Φ(𝒙), 𝑧 𝛿,max and 𝑧 𝛿,min for a given neural network.
Input: weight matrices 𝑾 (ℓ) ∈ R𝑑ℓ+1 ×𝑑ℓ and bias vectors 𝒃 (ℓ) ∈ R𝑑ℓ+1 for ℓ = 0, . . . , 𝐿 with 𝑑𝐿+1 = 1, monotonous activation function
𝜎, input vector 𝒙 ∈ R𝑑0 , neighborhood size 𝛿 > 0
Output: Bounds for 𝑧 𝛿,max and 𝑧 𝛿,min

𝒙 (0) = 𝒙
𝛿 (0) ,up = 𝛿1 ∈ R𝑑0
𝛿 (0) ,low = 𝛿1 ∈ R𝑑0
for ℓ : 0 to 𝐿 − 1 do
𝒙 (ℓ+1) = 𝜎 (𝑾 (ℓ) 𝒙 (ℓ) + 𝒃 (ℓ) )
𝛿 (ℓ+1) ,up = 𝜎 (𝑾 (ℓ) 𝒙 (ℓ) + (𝑾 (ℓ) ) + 𝛿 (ℓ) ,up + (𝑾 (ℓ) ) − 𝛿 (ℓ) ,low + 𝒃 (ℓ) ) − 𝒙 (ℓ+1)
𝛿 (ℓ+1) ,low = 𝒙 (ℓ+1) − 𝜎 (𝑾 (ℓ) 𝒙 (ℓ) − (𝑾 (ℓ) ) + 𝛿 (ℓ) ,low − (𝑾 (ℓ) ) − 𝛿 (ℓ) ,up + 𝒃 (ℓ) )
end for
𝒙 ( 𝐿+1) = 𝑾 ( 𝐿) 𝒙 ( 𝐿) + 𝒃 ( 𝐿)
𝛿 ( 𝐿+1) ,up = (𝑾 ( 𝐿) ) + 𝛿 ( 𝐿) ,up + (𝑾 ( 𝐿) ) − 𝛿 (𝐿) ,low
𝛿 ( 𝐿+1) ,low = (𝑾 (𝐿) ) + 𝛿 ( 𝐿) ,low + (𝑾 ( 𝐿) ) − 𝛿 (𝐿) ,up
return 𝒙 ( 𝐿+1) , 𝒙 ( 𝐿+1) + 𝛿 ( 𝐿+1) ,up , 𝒙 ( 𝐿+1) − 𝛿 (𝐿+1) ,low

Remark 16.15 Up to constants, Algorithm 2 has the same computational complexity as a forward pass, also see
Algorithm 1. In addition, in contrast to upper bounds based on estimating the global Lipschitz constant of Φ via its
weights, the upper bounds found via Algorithm 2 include the effect of the activation function 𝜎. For example, if 𝜎 is
the ReLU, then we may often end up in a situation, where 𝛿 (ℓ ),up or 𝛿 (ℓ ),low can have many entries that are 0. If an
entry of 𝑾 (ℓ ) 𝒙 (ℓ ) + 𝒃 (ℓ ) is nonpositive, then it is guaranteed that the associated entry in 𝛿 (ℓ ),low will be zero. Similarly,
if 𝑾 (ℓ ) has only few positive entries, then most of the entries of 𝛿 (ℓ ),up are not propagated to 𝛿 (ℓ+1),up .
Next, we prove that Algorithm 2 indeed produces sensible output.
Proposition 16.16 Let Φ be a neural network with weight matrices 𝑾 (ℓ ) ∈ R𝑑ℓ+1 ×𝑑ℓ and bias vectors 𝒃 (ℓ ) ∈ R𝑑ℓ+1 for
ℓ = 0, . . . , 𝐿, and a monotonically increasing activation function 𝜎.

208
[Draft of June 26, 2024. Not for dissemination. Only to be used in Philipp Petersen’s lecture]
Let 𝒙 ∈ R𝑑 . Then the output of Algorithm 2 satisfies

𝒙 𝐿+1 + 𝛿 (𝐿+1),up > 𝑧 𝛿,max and 𝒙 𝐿+1 − 𝛿 (𝐿+1),low < 𝑧 𝛿,min .

Proof Fix 𝒚, 𝒙 ∈ R𝑑 with ∥ 𝒚 − 𝒙∥ ∞ ≤ 𝛿 and let 𝒚 (ℓ ) , 𝒙 (ℓ ) for ℓ = 0, . . . , 𝐿 + 1 be as in Algorithm 2 applied to 𝒚,


𝒙, respectively. Moreover, let 𝛿ℓ,up , 𝛿ℓ,low for ℓ = 0, . . . , 𝐿 + 1 be as in Algorithm 2 applied to 𝒙. We will prove by
induction over ℓ = 0, . . . , 𝐿 + 1 that

𝒚 (ℓ ) − 𝒙 (ℓ ) ≤ 𝛿ℓ,up and 𝒙 (ℓ ) − 𝒚 (ℓ ) ≤ 𝛿ℓ,low , (16.5.6)

where the inequalities are understood entry-wise for vectors. Since 𝒚 was arbitrary this then proves the result.
The case ℓ = 0 follows immediately from ∥ 𝒚 − 𝒙∥ ∞ ≤ 𝛿. Assume now, that the statement was shown for ℓ < 𝐿. We
have that

𝒚 (ℓ+1) − 𝒙 (ℓ+1) − 𝛿ℓ+1,up =𝜎(𝑾 (ℓ ) 𝒚 (ℓ ) + 𝒃 (ℓ ) )


− 𝜎 𝑾 (ℓ ) 𝒙 (ℓ ) + (𝑾 (ℓ ) ) + 𝛿 (ℓ ),up + (𝑾 (ℓ ) ) − 𝛿 (ℓ ),low + 𝒃 (ℓ ) .


The monotonicity of 𝜎 implies that


𝒚 (ℓ+1) − 𝒙 (ℓ+1) ≤ 𝛿ℓ+1,up
if

𝑾 (ℓ ) 𝒚 (ℓ ) ≤ 𝑾 (ℓ ) 𝒙 (ℓ ) + (𝑾 (ℓ ) ) + 𝛿 (ℓ ),up + (𝑾 (ℓ ) ) − 𝛿 (ℓ ),low . (16.5.7)

To prove (16.5.7), we observe that

𝑾 (ℓ ) ( 𝒚 (ℓ ) − 𝒙 (ℓ ) ) = (𝑾 (ℓ ) ) + ( 𝒚 (ℓ ) − 𝒙 (ℓ ) ) − (𝑾 (ℓ ) ) − ( 𝒚 (ℓ ) − 𝒙 (ℓ ) )
= (𝑾 (ℓ ) ) + ( 𝒚 (ℓ ) − 𝒙 (ℓ ) ) + (𝑾 (ℓ ) ) − (𝒙 (ℓ ) − 𝒚 (ℓ ) )
≤ (𝑾 (ℓ ) ) + 𝛿 (ℓ ),up + (𝑾 (ℓ ) ) − 𝛿 (ℓ ),low ,

where we used the induction assumption in the last line. This shows the first estimate in (16.5.6). Similarly,

𝒙 (ℓ+1) − 𝒚 (ℓ+1) − 𝛿ℓ+1,low


= 𝜎(𝑾 (ℓ ) 𝒙 (ℓ ) − (𝑾 (ℓ ) ) + 𝛿 (ℓ ) ,low − (𝑾 (ℓ ) ) − 𝛿 (ℓ ),up + 𝒃 (ℓ ) ) − 𝜎(𝑾 (ℓ ) 𝒚 (ℓ ) + 𝒃 (ℓ ) ).

Hence, 𝒙 (ℓ+1) − 𝒚 (ℓ+1) ≤ 𝛿ℓ+1,low if

𝑾 (ℓ ) 𝒚 (ℓ ) ≥ 𝑾 (ℓ ) 𝒙 (ℓ ) − (𝑾 (ℓ ) ) + 𝛿 (ℓ ),low − (𝑾 (ℓ ) ) − 𝛿 (ℓ ),up . (16.5.8)

To prove (16.5.8), we observe that

𝑾 (ℓ ) (𝒙 (ℓ ) − 𝒚 (ℓ ) ) = (𝑾 (ℓ ) ) + (𝒙 (ℓ ) − 𝒚 (ℓ ) ) − (𝑾 (ℓ ) ) − (𝒙 (ℓ ) − 𝒚 (ℓ ) )
= (𝑾 (ℓ ) ) + (𝒙 (ℓ ) − 𝒚 (ℓ ) ) + (𝑾 (ℓ ) ) − (𝒚 (ℓ ) − 𝒙 (ℓ ) )
≤ (𝑾 (ℓ ) ) + 𝛿 (ℓ ),low + (𝑾 (ℓ ) ) − 𝛿 (ℓ ),up ,

where we used the induction assumption in the last line. This completes the proof of (16.5.6) for all ℓ ≤ 𝐿.
The case ℓ = 𝐿 + 1 follows by the same argument, but replacing 𝜎 by the identity. □

209
[Draft of June 26, 2024. Not for dissemination. Only to be used in Philipp Petersen’s lecture]
Exercises

Exercise 16.17 Prove (16.3.2) by comparing the volume of the 𝑑-dimensional Euclidean unit ball with the volume of
the 𝑑-dimensional 1-ball of radius 𝑐 for a given 𝑐 > 0.

Exercise 16.18 Fix 𝛿 > 0. For a pair of classifiers ℎ and 𝑔 such that 𝐶1 ∪ 𝐶−1 = ∅ in (16.2.2), there trivially cannot
exist any adversarial examples. Construct an example, of ℎ, 𝑔, D such that 𝐶1 , 𝐶−1 ≠ ∅, ℎ is not a Bayes classifier, and
𝑔 is such that no adversarial examples with a perturbation 𝛿 exist.
Is this also possible if 𝑔 −1 (0) = ∅?

Exercise 16.19 Prove Proposition 16.5.Hint: Repeat the proof of Theorem 16.4. In the first part set 𝒙 (ext) = (𝒙, 1),
𝒘 (ext) = (𝒘, 𝑏) and 𝒘 (ext) = (𝒘, 𝑏). Then show that ℎ(𝒙 ′ ) ≠ ℎ(𝒙) by plugging in the definition of 𝒙 ′ .

Exercise 16.20 Complete the proof of Theorem 16.12.

210
[Draft of June 26, 2024. Not for dissemination. Only to be used in Philipp Petersen’s lecture]
Bibliography and further reading

This chapter starts with the foundational paper [199], but it should be remarked that adversarial examples have been
studied for non-deep-learning models in machine learning before [88].
The results in this chapter are inspired by results in the literature, even though they may not be found in precisely
the form stated here. The setup is inspired by [199] and the explanation via high-dimensionality of the data given in
Section 16.3 was first formulated in [199] and [65]. The formalism reviewed in Section 16.2 is inspired by [194]. The
results on robustness via local Lipschitz properties are due to [78]. Algorithm 2 is covered by results in the area of
network verifiability [59, 54, 8, 212].
For a more comprehensive overview of modern approaches, we refer to the survey article [171].
Important directions not discussed in detail in this chapter are the transferability of adversarial examples, defense
mechanisms, and alternative adversarial operations. Transferability refers to the phenomenon that adversarial examples
for one model often work as well for different models, [149] [133]. Various defense mechanisms, i.e., ways of specifically
training a neural network to prevent adversarial examples, have been introduced. Examples include the Fast Gradient
Sign Method of [65]. However, also more sophisticated approaches have been recently developed, e.g., [29]. Finally,
adding a perturbation is not the only way to produce adversarial examples: in [1, 217], it was shown that images can
also be smoothly transformed to fool classifiers.

211
[Draft of June 26, 2024. Not for dissemination. Only to be used in Philipp Petersen’s lecture]
[Draft of June 26, 2024. Not for dissemination. Only to be used in Philipp Petersen’s lecture]
Appendix A
Probability theory

This chapter provides some basic notions and results in probability theory required in the main text. It is intended as
a revision for a reader already familiar with these concepts. For more details and proofs, we refer for example to the
standard textbook [102].

A.1 Sigma-algebras and measures

Let Ω be a set, and denote by 2Ω the powerset of Ω.

Definition A.1 A subset 𝔄 ⊆ 2Ω is called a sigma-algebra1 on Ω if it satisfies


(i) Ω ∈ 𝔄,
𝐴𝑐 ∈ 𝔄 whenever 𝐴 ∈ 𝔄,
(ii) Ð
(iii) 𝑖 ∈N 𝐴𝑖 ∈ 𝔄 whenever 𝐴𝑖 ∈ 𝔄 for all 𝑖 ∈ N.

For a sigma-algebra 𝔄 on Ω, the tuple (Ω, 𝔄) is also referred to as a measurable space. For a measurable space, a
subset 𝐴 ⊆ Ω is called measurable, if 𝐴 ∈ 𝔄. Measurable sets are also called events.
Recall that another key system of subsets of Ω is that of a topology. A topology 𝔗 ⊆ 2Ω is a subset of the powerset
satisfying:
∅, Ω ∈ 𝔗,
(i) Ñ
(ii) 𝑛𝑗=1 𝑂 𝑗 ∈ 𝔗 whenever 𝑛 ∈ N and 𝑂 1 , . . . , 𝑂 𝑛 ∈ 𝔗,
Ð
(iii) 𝑖 ∈𝐼 𝑂 𝑖 ∈ 𝔗 whenever for an index set 𝐼 holds 𝑂 𝑖 ∈ 𝔗 for all 𝑖 ∈ 𝐼.
For a topological space (Ω, 𝔗), a set 𝑂 ⊆ Ω is called open if and only if 𝑂 ∈ 𝔗.

Remark A.2 The two notions differ in that a topology allows for unions of arbitrary (possibly uncountably many) sets,
but only for finite intersection, whereas a sigma-algebra allows for countable unions and intersections.

Example A.3 Let 𝑑 ∈ N and denote by 𝐵 𝜀 (𝒙) = {𝒚 ∈ R𝑑 | ∥ 𝒚 − 𝒙∥ < 𝜀} the set of points whose Euclidean distance to
𝒙 is less than 𝜀. Then for every 𝐴 ⊆ R𝑑 , the smallest topology on 𝐴 containing 𝐴 ∩ 𝐵 𝜀 (𝒙) for all 𝜀 > 0, 𝒙 ∈ R𝑑 , is
called the Euclidean topology on 𝐴.

If (Ω, 𝔗) is a topological space, then the Borel sigma-algebra refers to the smallest sigma-algebra on Ω containing
all open sets, i.e. all elements of 𝔗. Throughout this book, subsets of R𝑑 are always understood to be equipped with
the Euclidean topology and the Borel sigma-algebra. The Borel sigma-algebra on R𝑑 is denoted by 𝔅𝑑 .
We can now introduce measures.

1 We use this notation instead of the more common “𝜎-algebra” to avoid confusion with the activation function 𝜎.

213
[Draft of June 26, 2024. Not for dissemination. Only to be used in Philipp Petersen’s lecture]
Definition A.4 Let (Ω, 𝔄) be a measurable space. A mapping 𝜇 : 𝔄 → [0, ∞] is called a measure if it satisfies
(i) 𝜇(∅) = 0,
(ii) for every sequence ( 𝐴𝑖 )𝑖 ∈N ⊆ 𝔄 such that 𝐴𝑖 ∩ 𝐴 𝑗 = ∅ whenever 𝑖 ≠ 𝑗, it holds
 Ø  ∑︁
𝜇 𝐴𝑖 = 𝜇( 𝐴𝑖 ).
𝑖 ∈N 𝑖 ∈N

We say that the measure is finite if 𝜇(Ω) < ∞, and it is sigma-finite if there exists a sequence ( 𝐴𝑖 )𝑖 ∈N ⊆ 𝔄 such that
Ð
Ω = 𝑖 ∈N 𝐴𝑖 and 𝜇( 𝐴𝑖 ) < 1. In case 𝜇(Ω) = 1, the measure is called a probability measure.

Example A.5 One can show that there exists a unique measure 𝜆 on (R𝑑 , 𝔅𝑑 ), such that for all sets of the type
×𝑑𝑗=1 [𝑎 𝑖 , 𝑏 𝑖 ) with −∞ < 𝑎 𝑖 ≤ 𝑏 𝑖 < ∞ holds

𝑑
Ö
𝜆(×𝑑𝑗=1 [𝑎 𝑖 , 𝑏 𝑖 )) = (𝑏 𝑖 − 𝑎 𝑖 ).
𝑗=1

This measure is called the Lebesgue measure.

If 𝜇 is a measure on the measurable space (Ω, 𝔄), then the triplet (Ω, 𝔄, 𝜇) is called a measure space. In case 𝜇 is
a probability measure, it is called a probability space.
Let (Ω, 𝔄, 𝜇) be a measure space. A subset 𝑁 ⊆ Ω is called a null-set, if 𝑁 is measurable and 𝜇(𝑁) = 0. Moreover,
an equality or inequality is said to hold 𝜇-almost everywhere or 𝜇-almost surely, if it is satisfied on the complement
of a null-set. In case 𝜇 is clear from context, we simply write “almost everywhere” or “almost surely” instead.

A.2 Random variables

A.2.1 Measurability of functions

To define random variables, we first need to recall the measurability of functions.


Definition A.6 Let (Ω1 , 𝔄1 ) and (Ω2 , 𝔄2 ) be two measurable spaces. A function 𝑓 : Ω1 → Ω2 is called measurable
if
𝑓 −1 ( 𝐴2 ) := {𝜔 ∈ Ω1 | 𝑓 (𝜔) ∈ 𝐴2 } ∈ 𝔄1 for all 𝐴2 ∈ 𝔄2 .
A mapping 𝑋 : Ω1 → Ω2 is called a Ω2 -valued random variable if it is measurable.

Remark A.7 We again point out the parallels to topological spaces: A function 𝑓 : Ω1 → Ω2 between two topological
spaces (Ω1 , 𝔗 1 ) and (Ω2 , 𝔗 2 ) is called continuous if 𝑓 −1 (𝑂 2 ) ∈ 𝔗 1 for all 𝑂 2 ∈ 𝔗 2 .

Let Ω1 be a set and let (Ω2 , 𝔄2 ) be a measurable space. For 𝑋 : Ω1 → Ω2 , we can ask for the smallest sigma-algebra
𝔄 𝑋 on Ω1 , such that 𝑋 is measurable as a mapping from (Ω1 , 𝔄 𝑋 ) to (Ω2 , 𝔄2 ). Clearly, for every sigma-algebra 𝔄1 on
Ω1 , 𝑋 is measurable as a mapping from (Ω1 , 𝔄1 ) to (Ω2 , 𝔄2 ) if and only if every 𝐴 ∈ 𝔄 𝑋 belongs to 𝔄1 ; or in other
words, 𝔄 𝑋 is a sub sigma-algebra of 𝔄1 . It is easy to check that 𝔄 𝑋 is given through the following definition.

Definition A.8 Let 𝑋 : Ω1 → Ω2 be a random variable. Then

𝔄 𝑋 := {𝑋 −1 ( 𝐴2 ) | 𝐴2 ∈ 𝔄2 } ⊆ 2Ω1

is the sigma-algebra induced by 𝑋 on Ω1 .

214
[Draft of June 26, 2024. Not for dissemination. Only to be used in Philipp Petersen’s lecture]
A.2.2 Distribution and expectation

Now let (Ω1 , 𝔄1 , P) be a probability space, and let (Ω2 , 𝔄2 ) be a measurable space. Then 𝑋 naturally induces a measure
on (Ω2 , 𝔄2 ) via
P𝑋 ( 𝐴2 ) := P[𝑋 −1 ( 𝐴2 )] for all 𝐴2 ∈ 𝔄2 .
Note that due to the measurability of 𝑋 it holds 𝑋 −1 ( 𝐴2 ) ∈ 𝔄1 , so that P𝑋 is well-defined.
Definition A.9 The measure P𝑋 is called the distribution of 𝑋. If (Ω2 , 𝔄2 ) = (R𝑑 , 𝔅𝑑 ), and there exists a function
𝑓 : R𝑑 → R such that ∫
P[ 𝐴] = 𝑓 (𝒙) d𝒙 for all 𝐴 ∈ 𝔅𝑑 ,
𝐴
then 𝑓 𝑋 is called the (Lebesgue) density of 𝑋.

Remark A.10 The term distribution is often used without specifying an underlying probability space and random
variable. In this case, “distribution” stands interchangeably for “probability measure”. For example, 𝜇 is a distribution
on Ω2 states that 𝜇 is a probability measure on the measurable space (Ω2 , 𝔄2 ). In this case, there always exists a
probability space (Ω1 , 𝔄1 , P) and a random variable 𝑋 : Ω1 → Ω2 such that P𝑋 = 𝜇; namely (Ω1 , 𝔄1 , P) = (Ω2 , 𝔄2 , 𝜇)
and 𝑋 (𝜔) = 𝜔.

Example A.11 Some important distributions include the following.


• Bernoulli distribution: A random variable 𝑋 : Ω → {0, 1} is Bernoulli distributed if there exists 𝑝 ∈ [0, 1] such
that P[𝑋 = 1] = 𝑝 and P[𝑋 = 0] = 1 − 𝑝,
• uniform distribution: A random variable 𝑋 : Ω → R𝑑 is uniformly distributed on a measurable set 𝐴 ∈ 𝔅𝑑 , if its
density equals
1
𝑓 𝑋 (𝒙) = 1 𝐴 (𝒙)
| 𝐴|
where | 𝐴| < ∞ is the Lebesgue measure of 𝐴.
• Gaussian distribution: A random variable 𝑋 : Ω → R𝑑 is Gaussian distributed with mean 𝒎 ∈ R𝑑 and the
regular covariance matrix 𝑪 ∈ R𝑑×𝑑 , if its density equals
 
1 1 ⊤ −1
𝑓 𝑋 (𝒙) = exp − (𝒙 − 𝒎) 𝑪 (𝒙 − 𝒎) .
(2𝜋 det(𝑪)) 𝑑/2 2

We denote this distribution is by N(𝒎, 𝑪).

Let (Ω, 𝔄, P) be a probability space, and let 𝑋 : Ω → R𝑑 be an R𝑑 -valued random variable. We then call the
Lebesgue integral ∫ ∫
E[𝑋] := 𝑋 (𝜔) dP(𝜔) = 𝒙 dP𝑋 (𝒙)
Ω R𝑑

the expectation of 𝑋. Moreover, for 𝑘 ∈ N we say that 𝑋 has finite 𝑘-th moment if E[∥ 𝑋 ∥ 𝑘 ] < ∞. Similarly, for a
probability measure 𝜇 on R𝑑 and 𝑘 ∈ N, we say that 𝜇 has finite 𝑘th moment if

∥𝒙∥ 𝑘 d𝜇(𝒙) < ∞.
R𝑑

Furthermore, the matrix ∫


(𝑋 (𝜔) − E[𝑋]) (𝑋 (𝜔) − E[𝑋]) ⊤ dP(𝜔) ∈ R𝑑×𝑑
Ω

is the covariance of 𝑋 : Ω → R𝑑 . For 𝑑 = 1, it is called the variance of 𝑋 and denoted by V[𝑋].


Finally, we recall different variants of convergence for random variables.

215
[Draft of June 26, 2024. Not for dissemination. Only to be used in Philipp Petersen’s lecture]
Definition A.12 Let (Ω, 𝔄, P) be a probability space, and let 𝑋 𝑗 : Ω → R𝑑 , 𝑗 ∈ N, be a sequence of random variables
and let 𝑋 : Ω → R𝑑 also be a random variable. The sequence is said to
(i) converge almost surely to 𝑋, if  
P 𝜔 ∈ Ω lim 𝑋 𝑗 (𝜔) = 𝑋 (𝜔) =1
𝑗→∞

(ii) converge in probability to 𝑋, if


 
for all 𝜀 > 0 : lim P 𝜔 ∈ Ω |𝑋 𝑗 (𝜔) − 𝑋 (𝜔)| > 𝜀 =0
𝑗→∞

(iii) converge weakly to 𝑋, if for all bounded continuous functions 𝑓 : R𝑑 → R holds

lim E[ 𝑓 ◦ 𝑋 𝑗 ] = E[ 𝑓 ◦ 𝑋].
𝑗→∞

The notions in Definition A.12 are ordered by decreasing strength, i.e. almost sure convergence
∫ implies convergence
in probability, and convergence in probability implies weak convergence. Since E[ 𝑓 ◦ 𝑋] = R𝑑 𝑓 (𝑥) dP𝑋 (𝑥), the notion
of weak convergence only depends on the distribution P𝑋 of 𝑋. We thus also say that a sequence of random variables
converges weakly towards a measure 𝜇.

A.3 Conditionals, marginals, and independence

In this section, we concentrate on R𝑑 -valued random variables, although the following concepts can be extended to
more general spaces.

A.3.1 Joint and marginal distribution

Let again (Ω, 𝔄, P) be a probability space, and let 𝑋 : Ω → R𝑑𝑋 , 𝑌 : Ω → R𝑑𝑌 be two random variables. Then

𝑍 := (𝑋, 𝑌 ) : Ω → R𝑑𝑋 +𝑑𝑌

is also a random variable. Its distribution P 𝑍 is a measure on the measurable space (R𝑑𝑋 +𝑑𝑌 , 𝔅𝑑𝑋 +𝑑𝑌 ), and P 𝑍 is
referred to as the joint distribution of 𝑋 and 𝑌 . On the other hand, P𝑋 , P𝑌 are called the marginal distributions of
𝑋, 𝑌 . It is important to note that

P 𝑍 [ 𝐴] = P 𝑍 [ 𝐴 × R𝑑𝑌 ] for all 𝐴 ∈ 𝔅𝑑𝑋 ,

and similarly for P𝑌 . Thus the marginals P𝑋 , P𝑌 , can be constructed from the joint distribution P 𝑍 . In turn, knowledge
of the marginals is not sufficient to construct the joint distribution.

A.3.2 Independence

The concept of independence serves to formalize the situation, where knowledge of one random variable provides no
information about another random variable. We first give the formal definition, and afterwards discuss the roll of a die
as a simple example.

Definition A.13 Let (Ω, 𝔄, P) be a probability space. Then two events 𝐴, 𝐵 ∈ 𝔄 are called independent if

216
[Draft of June 26, 2024. Not for dissemination. Only to be used in Philipp Petersen’s lecture]
P[ 𝐴 ∩ 𝐵] = P[ 𝐴]P[𝐵].

Two random variables 𝑋 : Ω → R𝑑𝑋 and 𝑌 : Ω → R𝑑𝑌 are called independent, if

𝐴, 𝐵 are independent for all 𝐴 ∈ 𝔄 𝑋 , 𝐵 ∈ 𝔄𝑌 .

Two random variables are thus independent, if and only if all events in their induced sigma-algebras are independent.
This turns out to be equivalent to the joint distribution P (𝑋,𝑌 ) being equal to the product measure P𝑋 ⊗ P𝑌 ; the latter is
characterized as the unique measure 𝜇 on R𝑑𝑋 +𝑑𝑌 satisfying 𝜇( 𝐴 × 𝐵) = P𝑋 [ 𝐴]P𝑌 [𝐵] for all 𝐴 ∈ 𝔅𝑑𝑥 , 𝐵 ∈ 𝔅𝑑𝑌 .

Example A.14 Let Ω = {1, . . . , 6} represent the outcomes of rolling a fair die, let 𝔄 = 2Ω be the sigma-algebra, and let
P[𝜔] = 1/6 for all 𝜔 ∈ Ω. Consider the three random variables

( ( 
 0 if 𝜔 ∈ {1, 2}
0 if 𝜔 is odd 0 if 𝜔 ≤ 3 


𝑋1 (𝜔) = 𝑋2 (𝜔) = 𝑋3 (𝜔) = 1 if 𝜔 ∈ {3, 4}
1 if 𝜔 is even 1 if 𝜔 ≥ 4 
2 if 𝜔 ∈ {5, 6}.


These random variables can be interpreted as follows:
• 𝑋1 indicates whether the roll yields an odd or even number.
• 𝑋2 indicates whether the roll yields a number at most 3 or at least 4.
• 𝑋3 categorizes the roll into one of the groups {1, 2}, {3, 4} or {5, 6}.
The induced sigma-algebras are

𝔄 𝑋1 = {∅, Ω, {1, 3, 5}, {2, 4, 6}}


𝔄 𝑋2 = {∅, Ω, {1, 2, 3}, {4, 5, 6}}
𝔄 𝑋3 = {∅, Ω, {1, 2}, {3, 4}, {5, 6}, {1, 2, 3, 4}, {1, 2, 5, 6}, {3, 4, 5, 6}}.

We leave it to the reader to formally check that 𝑋1 and 𝑋2 are not independent, but 𝑋1 and 𝑋3 are independent. This
reflects the fact that, for example, knowing the outcome to be odd, makes it more likely that the number belongs to
{1, 2, 3} rather than {4, 5, 6}. However, this knowledge provides no information on the three categories {1, 2}, {3, 4},
and {5, 6}.

If 𝑋 : Ω → R, 𝑌 : Ω → R are two independent random variables, then, due to P (𝑋,𝑌 ) = P𝑋 ⊗ P𝑌



E[𝑋𝑌 ] = 𝑋 (𝜔)𝑌 (𝜔) dP(𝜔)
∫Ω
= 𝑥𝑦 dP (𝑋,𝑌 ) (𝑥, 𝑦)
2
∫R ∫
= 𝑥 dP𝑋 (𝑥) 𝑦 dP𝑋 (𝑦)
R R
= E[𝑋]E[𝑌 ].

Using this observation, it is easy to see that for a sequence of independent R-valued random variables (𝑋𝑖 )𝑖=1
𝑛 with

bounded second moments, there holds Bienaymé’s identity


" 𝑛 # 𝑛
∑︁ ∑︁
V 𝑋𝑖 = V [𝑋𝑖 ] . (A.3.1)
𝑖=1 𝑖=1

217
[Draft of June 26, 2024. Not for dissemination. Only to be used in Philipp Petersen’s lecture]
A.3.3 Conditional distributions

Let (Ω, 𝔄, P) be a probability space, and let 𝐴, 𝐵 ∈ 𝔄 be two events. In case P[𝐵] > 0, we define

P[ 𝐴 ∩ 𝐵]
P[ 𝐴|𝐵] := , (A.3.2)
P[𝐵]

and call P[ 𝐴|𝐵] the conditional probability of 𝐴 given 𝐵.


Example A.15 Consider the setting of Example A.14. Let 𝐴 = {𝜔 ∈ Ω | 𝑋1 (𝜔) = 0} be the event that the outcome of
the die roll was an odd number and let 𝐵 = {𝜔 ∈ Ω | 𝑋2 (𝜔) = 0} be the event that the outcome yielded a number at
most 3. Then P[𝐵] = 1/2, and P[ 𝐴 ∩ 𝐵] = 1/3. Thus

P[ 𝐴 ∩ 𝐵] 1/3 2
P[ 𝐴|𝐵] = = = .
P[𝐵] 1/2 3

This reflects that, given we know the outcome to be at most 3, the probability of the number being odd, i.e. in {1, 3},
is larger than the probability of the number being even, i.e. equal to 2.
The conditional probability in (A.3.2) is only well-defined if P[𝐵] > 0. In practice, we often encounter the case
where we would like to condition on an event of probability zero.
Example A.16 Consider the following procedure: We first draw a random number 𝑝 ∈ [0, 1] according to a uniform
distribution on [0, 1]. Afterwards we draw a random number 𝑋 ∈ {0, 1} according to a 𝑝-Bernoulli distribution, i.e.
P[𝑋 = 1] = 𝑝 and P[𝑋 = 0] = 1− 𝑝. Then ( 𝑝, 𝑋) is a joint random variable on [0, 1] ×{0, 1}. What is P[𝑋 = 1| 𝑝 = 0.5]
in this case? Intuitively, it should be 1/2, but note that P[ 𝑝 = 0.5] = 0, so that (A.3.2) is not meaningful here.
Definition A.17 (regular conditional distribution) Let (Ω, 𝔄, P) be a probability space, and let 𝑋 : Ω → R𝑑𝑋 and
𝑌 : Ω → R𝑑𝑌 be two random variables. Let 𝜏𝑋 |𝑌 : 𝔅𝑑𝑋 × R𝑑𝑌 → [0, 1] satisfy
(i) 𝑦 ↦→ 𝜏𝑋 |𝑌 ( 𝐴, 𝑦) : R𝑑𝑌 → [0, 1] is measurable for every fixed 𝐴 ∈ 𝔅𝑑𝑋 ,
(ii) 𝐴 →↦ 𝜏𝑋 |𝑌 ( 𝐴, 𝑦) is a probability measure on (R𝑑𝑋 , 𝔅𝑑𝑋 ) for every 𝑦 ∈ 𝑌 (Ω),
(iii) for all 𝐴 ∈ 𝔅𝑑𝑋 and all 𝐵 ∈ 𝔅𝑑𝑌 holds

P[𝑋 ∈ 𝐴, 𝑌 ∈ 𝐵] = 𝜏𝑋 |𝑌 ( 𝐴, 𝑦) dP𝑌 (𝑦).
𝐵

Then 𝜏 is called a regular (version of the) conditional distribution of 𝑋 given 𝑌 . In this case, we denote

P[𝑋 ∈ 𝐴|𝑌 = 𝑦] := 𝜏𝑋 |𝑌 ( 𝐴, 𝑦),

and refer to this measure as the conditional distribution of 𝑋 |𝑌 = 𝑦.


Definition A.17 provides a mathematically rigorous way of assigning a distribution to a random variable conditioned
on an event that may have probability zero, as in Example A.16. Existence and uniqueness of these conditional
distributions hold in the following sense, see for example [102, Chapter 8] or [178, Chapter 3] for the specific statement
given here.
Theorem A.18 Let (Ω, 𝔄, P) be a probability space, and let 𝑋 : Ω → R𝑑𝑋 , 𝑌 : Ω → R𝑑𝑌 be two random variables.
Then there exists a regular version of the conditional distribution 𝜏1 .
Let 𝜏2 be another regular version of the conditional distribution. Then there exists a P𝑌 -null set 𝑁 ⊆ R𝑑𝑌 , such that
for all 𝑦 ∈ 𝑁 𝑐 ∩ 𝑌 (Ω), the two probability measures 𝜏1 (·, 𝑦) = 𝜏2 (·, 𝑦) coincide.
In particular, conditional distributions are only well-defined in a P𝑌 -almost everywhere sense.
Definition A.19 Let (Ω, 𝔄, P) be a probability space, and let 𝑋 : Ω → R𝑑𝑋 , 𝑌 : Ω → R𝑑𝑌 , 𝑍 : Ω → R𝑑𝑍 be three
random variables. We say that 𝑋 and 𝑍 are conditionally independent given 𝑌 , if the two distributions 𝑋 |𝑌 = 𝑦 and
𝑍 |𝑌 = 𝑦 are independent for P𝑌 -almost every 𝑦 ∈ 𝑌 (Ω).

218
[Draft of June 26, 2024. Not for dissemination. Only to be used in Philipp Petersen’s lecture]
A.4 Concentration inequalities

Let 𝑋𝑖 : Ω → R, 𝑖 ∈ N, be a sequence of random variables with finite first moments. The centered average over the
first 𝑛 terms
𝑛
1 ∑︁
𝑆 𝑛 := (𝑋𝑖 − E[𝑋𝑖 ]) (A.4.1)
𝑛 𝑖=1
is another random variable, and by linearity of the expectation it holds E[𝑆 𝑛 ] = 0. The sequence is said to satisfy the
strong law of large numbers if h i
P lim sup |𝑆 𝑛 | = 0 = 1.
𝑛→∞

This is for example the case if there exists 𝐶 < ∞ such that V[𝑋𝑖 ] ≤ 𝐶 for all 𝑖 ∈ N. Concentration inequalities provide
bounds on the rate of this convergence.
We start with Markov’s inequality.
Lemma A.20 Let 𝑋 : Ω → R be a random variable, and let 𝜑 : [0, ∞) → [0, ∞) be monotonically increasing. Then
for all 𝜀 > 0
E[𝜑(|𝑋 |)]
P[|𝑋 | ≥ 𝜀] ≤ .
𝜑(𝜀)
Proof We have ∫ ∫
𝜑(|𝑋 (𝜔)|) E[𝜑(|𝑋 |)]
P[|𝑋 | ≥ 𝜀] = 1 dP(𝜔) ≤ dP(𝜔) = .
−1
𝑋 ( [ 𝜀,∞) ) Ω 𝜑(𝜀) 𝜀
Applying Markov’s inequality with 𝜑(𝑥) := 𝑥 2 to the random variable 𝑋 − E[𝑋] directly gives Chebyshev’s
inequality.
Lemma A.21 Let 𝑋 : Ω → R be a random variable with finite variance. Then for all 𝜀 > 0

V[𝑋]
P[|𝑋 − E[𝑋] | ≥ 𝜀] ≤ .
𝜀2
From Chebyshev’s inequality we obtain the next result, which is a quite general concentration inequality for random
variables with finite variances.
Theorem A.22 Let 𝑋1 , . . . , 𝑋𝑛 be 𝑛 ∈ N independent real valued random variables such that for some 𝜍 > 0 holds
E[|𝑋𝑖 − 𝜇| 2 ] ≤ 𝜍 2 for all 𝑖 = 1, . . . , 𝑛. Denote
𝑛
h 1 ∑︁ i
𝜇 := E 𝑋𝑗 . (A.4.2)
𝑛 𝑗=1

Then for all 𝜀 > 0 " #


𝑛
1 ∑︁ 𝜍2
P 𝑋𝑖 − 𝜇 ≥ 𝜀 ≤ 2 .
𝑛 𝑗=1 𝜀 𝑛
Í𝑛 Í𝑛
Proof Let 𝑆 𝑛 = 𝑗=1 (𝑋𝑖 − E[𝑋𝑖 ])/𝑛 = ( 𝑗=1 𝑋𝑖 )/𝑛 − 𝜇. By Bienaymé’s identity (A.3.1), it holds that
𝑛
1 ∑︁ 2 𝜍2
V[𝑆 𝑛 ] = E[(𝑋𝑖 − E[𝑋𝑖 ]) ] ≤ .
𝑛2 𝑗=1 𝑛

Since E[𝑆 𝑛 ] = 0, Chebyshev’s inequality applied to 𝑆 𝑛 gives the statement. □


If we have additional information about the random variables, then we can derive sharper bounds. In case of
uniformly bounded random variables (rather than just bounded variance), Hoeffding’s inequality, which we recall next,
shows an exponential rate of concentration around the mean.

219
[Draft of June 26, 2024. Not for dissemination. Only to be used in Philipp Petersen’s lecture]
Theorem A.23 Let 𝑎, 𝑏 ∈ R. Let 𝑋1 , . . . , 𝑋𝑛 be 𝑛 ∈ N independent random variables such that 𝑎 ≤ 𝑋𝑖 ≤ 𝑏 almost
surely for all 𝑖 = 1, . . . , 𝑛, and let 𝜇 be as in (A.4.2). Then, for every 𝜀 > 0

1 𝑛
 ∑︁  2
 − 2𝑚𝜀
P  𝑋 𝑗 − 𝜇 > 𝜀  ≤ 2𝑒 (𝑏−𝑎) 2 .
 𝑛 𝑗=1 
 
A proof can, for example, be found in [188, Section B.4], where this version is also taken from.
Finally, we recall the central limit theorem, in its multivariate formulation. We say that (𝑋 𝑗 ) 𝑗 ∈N is an i.i.d. sequence
of random variables, if the random variables are (pairwise) independent and identically distributed. For a proof see
[102, Theorem 15.58].

Theorem A.24 (Multivariate central limit theorem) Let (𝑋 𝑗 ) 𝑗 ∈N be an i.i.d. sequence of R𝑑 valued √ random variables,
such that E[𝑋𝑛 ] = 0 ∈ R𝑑 and E[𝑋𝑛,𝑖 𝑋𝑛, 𝑗 ] = 𝐶𝑖 𝑗 , 𝑖, 𝑗 = 1, . . . , 𝑛. Let 𝑌𝑛 := (𝑋1 , . . . , 𝑋𝑛 )/ 𝑛. Then 𝑌𝑛 converges
weakly to N(0, 𝑪) as 𝑛 → ∞.

220
[Draft of June 26, 2024. Not for dissemination. Only to be used in Philipp Petersen’s lecture]
Appendix B
Functional analysis

This chapter provides some basic notions and results in functional analysis required in the main text. It is intended as a
revision for a reader already familiar with these concepts. For more details, proofs, we refer for example to the standard
textbooks [173, 175, 39, 69], where versions of all results below can be found.

B.1 Vector spaces

Definition B.1 Let K ∈ {R, C}. A vector space (over K) is a set 𝑋 such that the following holds:
(i) Properties of addition: For every 𝑥, 𝑦 ∈ 𝑋 there exists 𝑥 + 𝑦 ∈ 𝑋 such that for all 𝑧 ∈ 𝑋

𝑥 + 𝑦 = 𝑦 + 𝑥 and 𝑥 + (𝑦 + 𝑧) = (𝑥 + 𝑦) + 𝑧.

Moreover, there exists a unique element 0 ∈ 𝑋 such that 𝑥 + 0 = 𝑥 for all 𝑥 ∈ 𝑋 and for each 𝑥 ∈ 𝑋 there exists a
unique −𝑥 ∈ 𝑋 such that 𝑥 + (−𝑥) = 0.
(ii) Properties of scalar multiplication: There exists a map (𝜆, 𝑥) ↦→ 𝜆𝑥 from K × 𝑋 to 𝑋 called scalar multiplication.
It satisfies 1𝑥 = 𝑥 and (𝜆𝜇)𝑥 = 𝜆(𝜇𝑥) for all 𝑥 ∈ 𝑋.
We call the elements of a vector space vectors.

If the field is clear from context, we simply refer to 𝑋 as a vector space. We will primarily consider the case K = R,
and in this case we also say that 𝑋 is a real vector space.
To introduce a notion of convergence on a vector space 𝑋, it needs to be equipped with a topology. A topological
vector space is a vector space which is also a topological space, and in which addition and scalar multiplication are
continuous maps. We next discuss the most important instances of topological vector spaces.

B.1.1 Metric spaces

A metric space is a topological vector space equipped with a metric.

Definition B.2 For a set 𝑋, we call a map 𝑑 𝑋 : 𝑋 × 𝑋 → R+ a metric, if


(i) 0 < 𝑑 𝑋 (𝑥, 𝑦) < ∞ for all 𝑥, 𝑦 ∈ 𝑋,
(ii) 𝑑 𝑋 (𝑥, 𝑦) = 0 if and only if 𝑥 = 𝑦,
(iii) 𝑑 𝑋 (𝑥, 𝑦) = 𝑑 (𝑦, 𝑥) for all 𝑥, 𝑦 ∈ 𝑋,
(iv) 𝑑 𝑋 (𝑥, 𝑧) ≤ 𝑑 𝑋 (𝑥, 𝑦) + 𝑑 𝑋 (𝑦, 𝑧) for all 𝑥, 𝑦, 𝑧 ∈ 𝑋.
We call (𝑋, 𝑑 𝑋 ) a metric space.

221
[Draft of June 26, 2024. Not for dissemination. Only to be used in Philipp Petersen’s lecture]
In a metric space (𝑋, 𝑑 𝑋 ), we denote the open ball with center 𝑥 and radius 𝑟 > 0 by

𝐵𝑟 (𝑥) := {𝑦 ∈ 𝑋 | 𝑑 𝑋 (𝑥, 𝑦) < 𝑟}.

Every metric space is naturally equipped with a topology: A set 𝐴 ⊆ 𝑋 is open if and only if for every 𝑥 ∈ 𝐴 exists
𝜀 > 0 such that 𝐵 𝜀 (𝑥) ⊆ 𝐴.

Definition B.3 A metric space (𝑋, 𝑑 𝑋 ) is called complete, if every Cauchy sequence with respect to 𝑑 converges.

For complete metric spaces, an immensely powerful tool is Baire’s category theorem. To state it, we require the
notion of density of sets. Let 𝐴, 𝐵 ⊆ 𝑋 for a topological space 𝑋. Then 𝐴 is dense in 𝐵 if the closure of 𝐴, denoted by
𝐴, satisfies 𝐴 ⊇ 𝐵.

Theorem B.4 (Baire’s category theorem) Let 𝑋 be a complete metric space. Then the intersection of every countable
collection of dense open subsets of 𝑋 dense in 𝑋.

Theorem B.4 implies that if 𝑋 = ∞


Ð
𝑖=1 𝑉𝑖 for a sequence of sets 𝑉𝑖 , then at least one of the 𝑉𝑖 has to contain an open set.
Indeed, assuming all 𝑉𝑖 ’s have empty interior implies that 𝑉𝑖𝑐 = 𝑋 \ 𝑉𝑖 is dense for all 𝑖 ∈ N. By De Morgan’s laws, it
then holds that ∅ = ∞
Ñ
𝑖=1 𝑉𝑖 which contradicts Theorem B.4.
𝑐

B.1.2 Normed spaces

A norm is a way of asigning a length to a vector. A normed space is a vector space with a norm.

Definition B.5 For a vector space 𝑋, we call a map ∥ · ∥ 𝑋 from 𝑋 to R+ a norm if the following properties are satisfied:
(i) triangle inequality: ∥𝑥 + 𝑦∥ 𝑋 ≤ ∥𝑥∥ 𝑋 + ∥𝑦∥ 𝑋 for all 𝑥, 𝑦 ∈ 𝑋,
(ii) absolute homogeneity: ∥𝜆𝑥∥ 𝑋 = |𝜆|∥𝑥∥ 𝑋 for all 𝜆 ∈ R, 𝑥 ∈ 𝑋,
(iii) positive definiteness: ∥𝑥∥ 𝑋 > 0 for all 𝑥 ∈ 𝑋, 𝑥 ≠ 0.
We call (𝑋, ∥ · ∥ 𝑋 ) a normed space and omit ∥ · ∥ 𝑋 from the notation if it is clear from the context.

Every norm induces a metric 𝑑 𝑋 and hence topology via 𝑑 𝑋 (𝑥, 𝑦) := ∥𝑥 − 𝑦∥ 𝑋 . Moreover, every normed space is a
topological vector space with respect to this topology.

B.1.3 Banach spaces

Definition B.6 A normed vector space is called a Banach space if it is complete.

Before presenting the main results on Banach spaces, we collect a couple of important examples.
(i) Euclidean spaces: Let 𝑑 ∈ N. Then (R𝑑 , ∥ · ∥ ) is a Banach space.
(ii) Continuous functions: Let 𝑑 ∈ N and let 𝐾 ⊆ R𝑑 be compact. Then the set of continuous functions from 𝐾
to R is denoted by 𝐶 (𝐾). For 𝛼, 𝛽 ∈ R and 𝑓 , 𝑔 ∈ 𝐶 (𝐾), we define addition and scalar multiplication by
(𝛼 𝑓 + 𝛽𝑔) (𝒙) = 𝛼 𝑓 (𝒙) + 𝛽𝑔(𝒙) for all 𝒙 ∈ 𝐾. The vector space 𝐶 (𝐾) equipped with the supremum norm

∥ 𝑓 ∥ ∞ := sup | 𝑓 (𝒙)|,
𝒙∈𝐾

is a Banach space.

222
[Draft of June 26, 2024. Not for dissemination. Only to be used in Philipp Petersen’s lecture]
(iii) Lebesgue spaces 𝐿 𝑝 : Let (Ω, 𝔄, 𝜇) be a measure space and let 1 ≤ 𝑝 < ∞. Then the Lebesgue space 𝐿 𝑝 (Ω, 𝜇) is
defined as the vector space of all equivalence classes of functions 𝑓 : Ω → R that coincide 𝜇-almost everywhere
and satisfy
∫  1/ 𝑝
𝑝
∥ 𝑓 ∥ 𝐿 𝑝 (Ω,𝜇) := | 𝑓 (𝑥)| 𝑑𝜇(𝑥) < ∞. (B.1.1)
Ω
The integral is independent of the choice of representative of the equivalence class of 𝑓 . Addition and scalar
multiplication are defined as for 𝐶 (𝐾). It holds that 𝐿 𝑝 (Ω, 𝜇) is a Banach space. If Ω is a measurable subset of
R𝑑 for 𝑑 ∈ N, and 𝜇 is the Lebesgue measure, we typically omit 𝜇 from the notation and simply write 𝐿 𝑝 (Ω). If
Ω = N and the measure is the counting measure, we denote these spaces by ℓ 𝑝 (N) or simply ℓ 𝑝 .
The definition can be extended to complex or R𝑑 valued functions. In the latter case the integrand in (B.1.1)
is replaced by ∥ 𝑓 (𝑥) ∥ 𝑝 . We denote these spaces again by 𝐿 𝑝 (Ω, 𝜇) with the precise meaning being clear from
context.
(iv) Essentially bounded functions 𝐿 ∞ : Let (Ω, 𝔄, 𝜇) be a measure space. The 𝐿 𝑝 spaces can be extended to 𝑝 = ∞ by
defining the 𝐿 ∞ -norm

∥ 𝑓 ∥ 𝐿 ∞ (Ω,𝜇) := inf{𝐶 ≥ 0 | 𝜇({| 𝑓 | > 𝐶}) = 0)}.

This is indeed a norm on the space of equivalence classes of measurable functions that coincide 𝜇-almost
everywhere. Moreover, with this norm, 𝐿 ∞ (Ω, 𝜇) is a Banach space. For the special case where Ω = N and 𝜇 is
the counting measure, we denote the resulting space by ℓ ∞ (N) or ℓ ∞ . As in the case 𝑝 < ∞, it is straightforward
to extend the definition to complex or R𝑑 valued functions, for which the same notation will be used.
We often use dual spaces in our analysis. These are defined below.
Definition B.7 Let (𝑋, ∥ · ∥ 𝑋 ) be a normed space. Linear maps from 𝑋 → R are called linear functionals. The vector
space of all continuous linear functionals on 𝑋 is called the dual space of 𝑋 and is denoted by 𝑋 ′ .
Together with the natural addition and scalar multiplication

(ℎ + 𝑔) (𝑥) B ℎ(𝑥) + 𝑔(𝑥) and (𝜆ℎ) (𝑥) B 𝜆(ℎ(𝑥)) for all 𝜆 ∈ R, 𝑥 ∈ 𝑋,

𝑋 ′ is a vector space. Moreover, we equip it with the norm

∥ 𝑓 ∥ 𝑋′ := sup | 𝑓 (𝑥)|.
𝑥 ∈𝑋
∥ 𝑥 ∥ 𝑋 =1

The space (𝑋 ′ , ∥ · ∥ 𝑋′ ) is always a Banach space, even if (𝑋, ∥ · ∥ 𝑋 ) is not complete.


The dual space can often be used to characterize the original Banach space (called primal space). One way in
which the dual space can capture certain algebraic or geometric properties of the primal space is through the so-called
Hahn-Banach theorems. In this book we only use one specific variant and its implication for the existence of so-called
dual bases.

Theorem B.8 (Geometric Hahn-Banach, subspace version) Let 𝑀 be a subspace of a Banach space 𝑋 and let
𝑥0 ∈ 𝑋. If 𝑥0 is not in the closure of 𝑀, then there exists 𝑓 ∈ 𝑋 ′ such that 𝑓 (𝑥 0 ) = 1 and 𝑓 (𝑥) = 0 for every 𝑥 ∈ 𝑀.

One direct consequence of Theorem B.8 that will be used throughout this manuscript is the existence of a dual basis.
Let (𝑥𝑖 )𝑖 ∈N ⊆ 𝑋 be such that for all 𝑖 ∈ N

𝑥𝑖 ∉ span{𝑥 𝑗 | 𝑗 ∈ N, 𝑗 ≠ 𝑖},

then there exists for every 𝑖 ∈ N, an 𝑓𝑖 ∈ 𝑋 ′ such that 𝑓𝑖 (𝑥 𝑗 ) = 0 if 𝑖 ≠ 𝑗 and 𝑓𝑖 (𝑥𝑖 ) = 1.

223
[Draft of June 26, 2024. Not for dissemination. Only to be used in Philipp Petersen’s lecture]
B.1.4 Hilbert spaces

Often, we require more structure than that provided by normed spaces. An inner product offers an additional way to
compare two vectors.

Definition B.9 Let 𝑋 be a real vector space. An inner product on 𝑋 is a function ⟨·, ·⟩𝑋 : 𝑋 × 𝑋 → R such that for all
𝛼, 𝛽 ∈ R and 𝑥, 𝑦, 𝑧 ∈ 𝑋 holds
(i) ⟨𝛼𝑥 + 𝛽𝑦, 𝑧⟩𝑋 = 𝛼⟨𝑥, 𝑧⟩𝑋 + 𝛽⟨𝑦, 𝑧⟩𝑋 ,
(ii) ⟨𝑥, 𝑦⟩𝑋 = ⟨𝑦, 𝑥⟩𝑋 ,
(iii) ⟨𝑥, 𝑥⟩𝑋 > 0 for all 𝑥 ≠ 0.

On inner product spaces the so-called Cauchy-Schwarz inequality holds.


Theorem B.10 (Cauchy-Schwarz inequality) Let 𝑋 be a vector space with inner product ⟨·, ·⟩𝑋 . Then it holds for all
𝑥, 𝑦 ∈ 𝑋
√︁
|⟨𝑥, 𝑦⟩𝑋 | ≤ ⟨𝑥, 𝑥⟩ 𝑋 ⟨𝑦, 𝑦⟩ 𝑋 .

Moreover, equality holds if and only if 𝑥 and 𝑦 are linearly dependent.


Every inner product ⟨·, ·⟩𝑋 induces a norm via
√︁
∥𝑥∥ 𝑋 := ⟨𝑥, 𝑥⟩ for all 𝑥 ∈ 𝑋. (B.1.2)

The properties of the inner product immediately yield the polar identity

∥𝑥 + 𝑦∥ 2𝑋 = ∥𝑥∥ 2𝑋 + 2⟨𝑥, 𝑦⟩𝑋 + ∥𝑦∥ 2𝑋 . (B.1.3)

The fact that (B.1.2) is indeed a norm follows by an application of the Cauchy-Schwarz inequality to (B.1.3), which
yields that ∥ · ∥ 𝑋 satisfies the triangle inequality. This gives rise to the definition of a Hilbert space.

Definition B.11 Let 𝐻 be a real vector space with inner product ⟨·, ·⟩ 𝐻 . Then (𝐻, ⟨·, ·⟩ 𝐻 ) is called a Hilbert space if
and only if 𝐻 is complete with respect to the norm ∥ · ∥ 𝐻 induced by the inner product.

A standard example of a Hilbert space is 𝐿 2 : Let (Ω, 𝔄, 𝜇) be a measure space. Then



⟨ 𝑓 , 𝑔⟩ 𝐿 2 (Ω, 𝜇) = 𝑓 (𝑥)𝑔(𝑥) d𝜇(𝑥) for all 𝑓 , 𝑔 ∈ 𝐿 2 (Ω, 𝜇),
Ω

defines an inner product on 𝐿 2 (Ω, 𝜇) compatible with the 𝐿 2 (Ω, 𝜇)-norm.


In a Hilbert space, we can compare vectors not only via their distance, measured by the norm, but also by using the
inner product, which corresponds to their relative orientation. This leads to the concept of orthogonality.
Definition B.12 Let 𝐻 be a Hilbert space and let 𝑓 , 𝑔 ∈ 𝐻. We say that 𝑓 and 𝑔 are orthogonal if ⟨ 𝑓 , 𝑔⟩ = 0, denoted
by 𝑓 ⊥ 𝑔. Moreover, for 𝐹, 𝐺 ⊆ 𝐻 we write 𝐹 ⊥ 𝐺 if 𝑓 ⊥ 𝑔 for all 𝑓 ∈ 𝐹, 𝑔 ∈ 𝐺.
For orthogonal vectors, the polar identity immediately implies the Pythagorean theorem.

Theorem B.13 Let 𝐻 be a Hilbert space, 𝑛 ∈ N, and let 𝑓1 , . . . , 𝑓𝑛 ∈ 𝐻 be pairwise orthogonal vectors. Then,

𝑛 2 𝑛
∑︁ ∑︁
𝑓𝑖 = ∥ 𝑓𝑖 ∥ 2𝐻 .
𝑖=1 𝐻 𝑖=1

A final property of Hilbert spaces that we encounter in this book is the existence of unique projections onto convex
sets.

224
[Draft of June 26, 2024. Not for dissemination. Only to be used in Philipp Petersen’s lecture]
Theorem B.14 Let 𝐻 be a Hilbert space and let 𝐾 ≠ ∅ be a closed convex subset of 𝐻. Then for all ℎ ∈ 𝐻 exists a
unique 𝑘 0 ∈ 𝐾 such that

∥ℎ − 𝑘 0 ∥ 𝐻 = inf{∥ℎ − 𝑘 ∥ | 𝑘 ∈ 𝐾 }.

B.2 Fourier transform

The Fourier transform is a powerful tool in analysis. It allows to represent functions as a superposition of frequencies.

Definition B.15 Let 𝑑 ∈ N. The Fourier transform of a function 𝑓 ∈ 𝐿 1 (R𝑑 ) is defined by




F ( 𝑓 ) (𝝎) := 𝑓ˆ(𝝎) := 𝑓 (𝒙)𝑒 −2 𝜋𝑖 𝒙 𝝎 d𝒙 for all 𝝎 ∈ R𝑑 ,
R𝑑

and the inverse Fourier transform by



−1 ⊤𝝎
F ( 𝑓 ) (𝒙) := 𝑓ˇ(𝒙) := 𝑓ˆ(−𝒙) = 𝑓 (𝝎)𝑒 2 𝜋𝑖 𝒙 d𝝎 for all 𝒙 ∈ R𝑑 .
R𝑑

It is immediately clear from the definition, that ∥ 𝑓ˆ∥ 𝐿 ∞ (R𝑑 ) ≤ ∥ 𝑓 ∥ 𝐿 1 (R𝑑 ) . As a result, the operator F : 𝑓 ↦→ 𝑓ˆ is a
bounded linear map from 𝐿 1 (R𝑑 ) to 𝐿 ∞ (R𝑑 ). We point out that 𝑓ˆ can take complex values and the definition is also
meaningful for complex valued functions 𝑓 .
If 𝑓ˆ ∈ 𝐿 1 (R𝑑 ), then we can reverse the process of taking the Fourier transform by taking the inverse Fourier
transform.

Theorem B.16 If 𝑓 , 𝑓ˆ ∈ 𝐿 1 (R𝑑 ) then F −1 ( 𝑓ˆ) = 𝑓 .

225
[Draft of June 26, 2024. Not for dissemination. Only to be used in Philipp Petersen’s lecture]
[Draft of June 26, 2024. Not for dissemination. Only to be used in Philipp Petersen’s lecture]
References

1. R. Alaifari, G. S. Alberti, and T. Gauksson. Adef: an iterative algorithm to construct adversarial deformations. arXiv preprint
arXiv:1804.07729, 2018.
2. Z. Allen-Zhu, Y. Li, and Y. Liang. Learning and generalization in overparameterized neural networks, going beyond two layers.
In H. Wallach, H. Larochelle, A. Beygelzimer, F. d'Alché-Buc, E. Fox, and R. Garnett, editors, Advances in Neural Information
Processing Systems, volume 32. Curran Associates, Inc., 2019.
3. Z. Allen-Zhu, Y. Li, and Z. Song. A convergence theory for deep learning via over-parameterization. In International Conference on
Machine Learning, pages 242–252. PMLR, 2019.
4. M. Anthony and P. L. Bartlett. Neural network learning: theoretical foundations. Cambridge University Press, Cambridge, 1999.
5. R. Arora, A. Basu, P. Mianjy, and A. Mukherjee. Understanding deep neural networks with rectified linear units. In International
Conference on Learning Representations, 2018.
6. S. Arora, S. S. Du, W. Hu, Z. Li, R. R. Salakhutdinov, and R. Wang. On exact computation with an infinitely wide neural net.
In H. Wallach, H. Larochelle, A. Beygelzimer, F. d'Alché-Buc, E. Fox, and R. Garnett, editors, Advances in Neural Information
Processing Systems, volume 32. Curran Associates, Inc., 2019.
7. S. Arora, R. Ge, B. Neyshabur, and Y. Zhang. Stronger generalization bounds for deep nets via a compression approach. In
International Conference on Machine Learning, pages 254–263. PMLR, 2018.
8. M. Baader, M. Mirman, and M. Vechev. Universal approximation with certified networks. arXiv preprint arXiv:1909.13846, 2019.
9. A. R. Barron. Neural net approximation. In Proc. 7th Yale workshop on adaptive and learning systems, volume 1, pages 69–72, 1992.
10. A. R. Barron. Universal approximation bounds for superpositions of a sigmoidal function. IEEE Trans. Inform. Theory, 39(3):930–945,
1993.
11. A. R. Barron and J. M. Klusowski. Approximation and estimation for high-dimensional deep learning networks. arXiv preprint
arXiv:1809.03090, 2018.
12. P. Bartlett. For valid generalization the size of the weights is more important than the size of the network. Advances in neural
information processing systems, 9, 1996.
13. G. Beliakov. Interpolation of lipschitz functions. Journal of Computational and Applied Mathematics, 196(1):20–44, 2006.
14. M. Belkin, D. Hsu, S. Ma, and S. Mandal. Reconciling modern machine-learning practice and the classical bias–variance trade-off.
Proceedings of the National Academy of Sciences, 116(32):15849–15854, 2019.
15. M. Belkin, S. Ma, and S. Mandal. To understand deep learning we need to understand kernel learning. In International Conference
on Machine Learning, pages 541–549. PMLR, 2018.
16. R. Bellman. On the theory of dynamic programming. Proceedings of the national Academy of Sciences, 38(8):716–719, 1952.
17. Y. Bengio, P. Simard, and P. Frasconi. Learning long-term dependencies with gradient descent is difficult. IEEE Transactions on
Neural Networks, 5(2):157–166, 1994.
18. J. Berner, P. Grohs, and A. Jentzen. Analysis of the generalization error: Empirical risk minimization over deep artificial neural
networks overcomes the curse of dimensionality in the numerical approximation of black–scholes partial differential equations. SIAM
Journal on Mathematics of Data Science, 2(3):631–657, 2020.
19. J. Berner, P. Grohs, G. Kutyniok, and P. Petersen. The modern mathematics of deep learning, 2021.
20. D. P. Bertsekas. Nonlinear programming. Athena Scientific Optimization and Computation Series. Athena Scientific, Belmont, MA,
third edition, 2016.
21. H. Bolcskei, P. Grohs, G. Kutyniok, and P. Petersen. Optimal approximation with sparsely connected deep neural networks. SIAM
Journal on Mathematics of Data Science, 1(1):8–45, 2019.
22. L. Bottou. Stochastic Gradient Descent Tricks, pages 421–436. Springer Berlin Heidelberg, Berlin, Heidelberg, 2012.
23. L. Bottou, F. E. Curtis, and J. Nocedal. Optimization methods for large-scale machine learning. SIAM Review, 60(2):223–311, 2018.
24. O. Bousquet and A. Elisseeff. Stability and generalization. The Journal of Machine Learning Research, 2:499–526, 2002.
25. S. Boyd and L. Vandenberghe. Convex optimization. Cambridge University Press, Cambridge, 2004.
26. J. Braun and M. Griebel. On a constructive proof of kolmogorov’s superposition theorem. Constructive Approximation, 30(3):653–675,
Dec 2009.

227
[Draft of June 26, 2024. Not for dissemination. Only to be used in Philipp Petersen’s lecture]
27. M. M. Bronstein, J. Bruna, T. Cohen, and P. Veličković. Geometric deep learning: Grids, groups, graphs, geodesics, and gauges.
arXiv preprint arXiv:2104.13478, 2021.
28. E. J. Candes. Ridgelets: theory and applications. Stanford University, 1998.
29. N. Carlini and D. Wagner. Towards evaluating the robustness of neural networks. In 2017 ieee symposium on security and privacy
(sp), pages 39–57. Ieee, 2017.
30. S. M. Carroll and B. W. Dickinson. Construction of neural nets using the radon transform. International 1989 Joint Conference on
Neural Networks, pages 607–611 vol.1, 1989.
31. P. Chaudhari, A. Choromanska, S. Soatto, Y. LeCun, C. Baldassi, C. Borgs, J. Chayes, L. Sagun, and R. Zecchina. Entropy-sgd:
Biasing gradient descent into wide valleys. Journal of Statistical Mechanics: Theory and Experiment, 2019(12):124018, 2019.
32. M. Chen, H. Jiang, W. Liao, and T. Zhao. Efficient approximation of deep relu networks for functions on low dimensional manifolds.
Advances in neural information processing systems, 32, 2019.
33. L. Chizat, E. Oyallon, and F. Bach. On lazy training in differentiable programming. In H. Wallach, H. Larochelle, A. Beygelzimer,
F. d'Alché-Buc, E. Fox, and R. Garnett, editors, Advances in Neural Information Processing Systems, volume 32. Curran Associates,
Inc., 2019.
34. L. Chizat, E. Oyallon, and F. Bach. On lazy training in differentiable programming. Advances in neural information processing
systems, 32, 2019.
35. Y. Cho and L. Saul. Kernel methods for deep learning. In Y. Bengio, D. Schuurmans, J. Lafferty, C. Williams, and A. Culotta, editors,
Advances in Neural Information Processing Systems, volume 22. Curran Associates, Inc., 2009.
36. F. Chollet. Deep learning with Python. Simon and Schuster, 2021.
37. A. Choromanska, M. Henaff, M. Mathieu, G. B. Arous, and Y. LeCun. The loss surfaces of multilayer networks. In Artificial
intelligence and statistics, pages 192–204. PMLR, 2015.
38. C. K. Chui and H. N. Mhaskar. Deep nets for local manifold learning. Frontiers in Applied Mathematics and Statistics, 4:12, 2018.
39. J. B. Conway. A course in functional analysis, volume 96. Springer, 2019.
40. N. Cristianini and J. Shawe-Taylor. An Introduction to Support Vector Machines and Other Kernel-based Learning Methods. Cambridge
University Press, 1 edition, 2000.
41. F. Cucker and S. Smale. On the mathematical foundations of learning. Bulletin of the American mathematical society, 39(1):1–49,
2002.
42. G. Cybenko. Approximation by superpositions of a sigmoidal function. Mathematics of Control, Signals and Systems, 2(4):303–314,
1989.
43. Y. N. Dauphin, R. Pascanu, C. Gulcehre, K. Cho, S. Ganguli, and Y. Bengio. Identifying and attacking the saddle point problem in
high-dimensional non-convex optimization. Advances in neural information processing systems, 27, 2014.
44. A. G. de G. Matthews. Sample-then-optimize posterior sampling for bayesian linear models. 2017.
45. A. G. de G. Matthews, J. Hron, M. Rowland, R. E. Turner, and Z. Ghahramani. Gaussian process behaviour in wide deep neural
networks. In International Conference on Learning Representations, 2018.
46. A. Défossez, L. Bottou, F. R. Bach, and N. Usunier. A simple convergence proof of adam and adagrad. Trans. Mach. Learn. Res.,
2022, 2022.
47. J. Deng, W. Dong, R. Socher, L.-J. Li, K. Li, and L. Fei-Fei. Imagenet: A large-scale hierarchical image database. In 2009 IEEE
Conference on Computer Vision and Pattern Recognition, pages 248–255, 2009.
48. R. A. DeVore. Nonlinear approximation. Acta numerica, 7:51–150, 1998.
49. L. Dinh, R. Pascanu, S. Bengio, and Y. Bengio. Sharp minima can generalize for deep nets. In International Conference on Machine
Learning, pages 1019–1028. PMLR, 2017.
50. F. Draxler, K. Veschgini, M. Salmhofer, and F. Hamprecht. Essentially no barriers in neural network energy landscape. In International
conference on machine learning, pages 1309–1318. PMLR, 2018.
51. S. Du, J. Lee, H. Li, L. Wang, and X. Zhai. Gradient descent finds global minima of deep neural networks. In K. Chaudhuri and
R. Salakhutdinov, editors, Proceedings of the 36th International Conference on Machine Learning, volume 97 of Proceedings of
Machine Learning Research, pages 1675–1685. PMLR, 09–15 Jun 2019.
52. J. Duchi, E. Hazan, and Y. Singer. Adaptive subgradient methods for online learning and stochastic optimization. Journal of Machine
Learning Research, 12(Jul):2121–2159, 2011.
53. W. E and Q. Wang. Exponential convergence of the deep neural network approximation for analytic functions. Sci. China Math.,
61(10):1733–1740, 2018.
54. M. Fischer, M. Balunovic, D. Drachsler-Cohen, T. Gehr, C. Zhang, and M. Vechev. Dl2: training and querying neural networks with
logic. In International Conference on Machine Learning, pages 1931–1941. PMLR, 2019.
55. C. L. Frenzen, T. Sasao, and J. T. Butler. On the number of segments needed in a piecewise linear approximation. Journal of
Computational and Applied mathematics, 234(2):437–446, 2010.
56. K.-I. Funahashi. On the approximate realization of continuous mappings by neural networks. Neural Networks, 2(3):183–192, 1989.
57. T. Garipov, P. Izmailov, D. Podoprikhin, D. P. Vetrov, and A. G. Wilson. Loss surfaces, mode connectivity, and fast ensembling of
dnns. Advances in neural information processing systems, 31, 2018.
58. G. Garrigos and R. M. Gower. Handbook of convergence theorems for (stochastic) gradient methods, 2023.
59. T. Gehr, M. Mirman, D. Drachsler-Cohen, P. Tsankov, S. Chaudhuri, and M. Vechev. Ai2: Safety and robustness certification of neural
networks with abstract interpretation. In 2018 IEEE symposium on security and privacy (SP), pages 3–18. IEEE, 2018.
60. A. Géron. Hands-on machine learning with Scikit-Learn and TensorFlow : concepts, tools, and techniques to build intelligent systems.
O’Reilly Media, Sebastopol, CA, 2017.

228
[Draft of June 26, 2024. Not for dissemination. Only to be used in Philipp Petersen’s lecture]
61. F. Girosi and T. Poggio. Networks and the best approximation property. Biological cybernetics, 63(3):169–176, 1990.
62. G. Goh. Why momentum really works. Distill, 2017.
63. L. Gonon and C. Schwab. Deep relu network expression rates for option prices in high-dimensional, exponential lévy models. Finance
and Stochastics, 25(4):615–657, 2021.
64. I. J. Goodfellow, Y. Bengio, and A. Courville. Deep Learning. MIT Press, Cambridge, MA, USA, 2016. http://www.
deeplearningbook.org.
65. I. J. Goodfellow, J. Shlens, and C. Szegedy. Explaining and harnessing adversarial examples. In International Conference on Learning
Representations (ICLR), 2015.
66. I. J. Goodfellow, O. Vinyals, and A. M. Saxe. Qualitatively characterizing neural network optimization problems. arXiv preprint
arXiv:1412.6544, 2014.
67. L.-A. Gottlieb, A. Kontorovich, and R. Krauthgamer. Efficient regression in metric spaces via approximate lipschitz extension. IEEE
Transactions on Information Theory, 63(8):4838–4849, 2017.
68. R. M. Gower, N. Loizou, X. Qian, A. Sailanbayev, E. Shulgin, and P. Richtárik. SGD: General analysis and improved rates. In
K. Chaudhuri and R. Salakhutdinov, editors, Proceedings of the 36th International Conference on Machine Learning, volume 97 of
Proceedings of Machine Learning Research, pages 5200–5209. PMLR, 09–15 Jun 2019.
69. K. Gröchenig. Foundations of time-frequency analysis. Springer Science & Business Media, 2013.
70. P. Grohs and L. Herrmann. Deep neural network approximation for high-dimensional elliptic pdes with boundary conditions. IMA
Journal of Numerical Analysis, 42(3):2055–2082, 2022.
71. P. Grohs, F. Hornung, A. Jentzen, and P. Von Wurstemberger. A proof that artificial neural networks overcome the curse of
dimensionality in the numerical approximation of Black–Scholes partial differential equations, volume 284. American Mathematical
Society, 2023.
72. B. Hanin and D. Rolnick. Complexity of linear regions in deep networks. In International Conference on Machine Learning, pages
2596–2604. PMLR, 2019.
73. T. Hastie, A. Montanari, S. Rosset, and R. J. Tibshirani. Surprises in high-dimensional ridgeless least squares interpolation. The
Annals of Statistics, 50(2):949–986, 2022.
74. S. S. Haykin. Neural networks and learning machines. Pearson Education, Upper Saddle River, NJ, third edition, 2009.
75. J. He, L. Li, J. Xu, and C. Zheng. Relu deep neural networks and linear finite elements. J. Comput. Math., 38(3):502–527, 2020.
76. K. He, X. Zhang, S. Ren, and J. Sun. Delving deep into rectifiers: Surpassing human-level performance on imagenet classification.
Proceedings of the IEEE international conference on computer vision, 2015.
77. K. He, X. Zhang, S. Ren, and J. Sun. Deep residual learning for image recognition. In Proceedings of the IEEE conference on
computer vision and pattern recognition, pages 770–778, 2016.
78. M. Hein and M. Andriushchenko. Formal guarantees on the robustness of a classifier against adversarial manipulation. Advances in
neural information processing systems, 30, 2017.
79. H. Heuser. Lehrbuch der Analysis. Teil 1. Vieweg + Teubner, Wiesbaden, revised edition, 2009.
80. G. Hinton. Divide the gradient by a running average of its recent magnitude. https://www.cs.toronto.edu/~hinton/coursera/
lecture6/lec6e.mp4, 2012. Lecture 6e.
81. S. Hochreiter. Untersuchungen zu dynamischen neuronalen Netzen. Diploma thesis, Institut für Informatik, Lehrstuhl Prof. Brauer,
Technische Universität München, 1991.
82. S. Hochreiter and J. Schmidhuber. Flat minima. Neural computation, 9(1):1–42, 1997.
83. S. Hochreiter and J. Schmidhuber. Long short-term memory. Neural Computation, 9(8):1735–1780, 1997.
84. K. Hornik. Approximation capabilities of multilayer feedforward networks. Neural Networks, 4(2):251–257, 1991.
85. K. Hornik, M. Stinchcombe, and H. White. Multilayer feedforward networks are universal approximators. Neural Networks, 2(5):359–
366, 1989.
86. G. Huang, Z. Liu, L. Van Der Maaten, and K. Q. Weinberger. Densely connected convolutional networks. Proceedings of the IEEE
conference on computer vision and pattern recognition, 1(2):3, 2017.
87. G.-B. Huang and H. A. Babri. Upper bounds on the number of hidden neurons in feedforward networks with arbitrary bounded
nonlinear activation functions. IEEE transactions on neural networks, 9(1):224–229, 1998.
88. L. Huang, A. D. Joseph, B. Nelson, B. I. Rubinstein, and J. D. Tygar. Adversarial machine learning. In Proceedings of the 4th ACM
workshop on Security and artificial intelligence, pages 43–58, 2011.
89. T. Huster, C.-Y. J. Chiang, and R. Chadha. Limitations of the lipschitz constant as a defense against adversarial examples. In ECML
PKDD 2018 Workshops: Nemesis 2018, UrbReas 2018, SoGood 2018, IWAISe 2018, and Green Data Mining 2018, Dublin, Ireland,
September 10-14, 2018, Proceedings 18, pages 16–29. Springer, 2019.
90. M. Hutzenthaler, A. Jentzen, T. Kruse, and T. A. Nguyen. A proof that rectified deep neural networks overcome the curse of
dimensionality in the numerical approximation of semilinear heat equations. SN partial differential equations and applications,
1(2):10, 2020.
91. D. J. Im, M. Tao, and K. Branson. An empirical analysis of deep network loss surfaces. 2016.
92. V. E. Ismailov. Ridge functions and applications in neural networks, volume 263. American Mathematical Society, 2021.
93. Y. Ito and K. Saito. Superposition of linearly independent functions and finite mappings by neural networks. The Mathematical
Scientist, 21(1):27, 1996.
94. A. Jacot, F. Gabriel, and C. Hongler. Neural tangent kernel: Convergence and generalization in neural networks. Advances in neural
information processing systems, 31, 2018.
95. A. Jentzen and A. Riekert. On the existence of global minima and convergence analyses for gradient descent methods in the training
of deep neural networks. arXiv preprint arXiv:2112.09684, 2021.

229
[Draft of June 26, 2024. Not for dissemination. Only to be used in Philipp Petersen’s lecture]
96. P. C. Kainen, V. Kurkova, and A. Vogt. Approximation by neural networks is not continuous. Neurocomputing, 29(1-3):47–56, 1999.
97. P. C. Kainen, V. Kurkova, and A. Vogt. Continuity of approximation by neural networks in l p spaces. Annals of Operations Research,
101:143–147, 2001.
98. P. C. Kainen, V. Kurkova, and A. Vogt. Best approximation by linear combinations of characteristic functions of half-spaces. Journal
of Approximation Theory, 122(2):151–159, 2003.
99. H. Karimi, J. Nutini, and M. Schmidt. Linear convergence of gradient and proximal-gradient methods under the polyak-łojasiewicz
condition. In P. Frasconi, N. Landwehr, G. Manco, and J. Vreeken, editors, Machine Learning and Knowledge Discovery in Databases,
pages 795–811, Cham, 2016. Springer International Publishing.
100. C. Karner, V. Kazeev, and P. C. Petersen. Limitations of gradient descent due to numerical instability of backpropagation. arXiv
preprint arXiv:2210.00805, 2022.
101. D. P. Kingma and J. Ba. Adam: A method for stochastic optimization. In 3rd International Conference on Learning Representations,
ICLR 2015 - Conference Track Proceedings. International Conference on Learning Representations, ICLR, 2015.
102. A. Klenke. Wahrscheinlichkeitstheorie. Springer, 2006.
103. M. Kohler, A. Krzyżak, and S. Langer. Estimation of a function of low local dimensionality by deep neural networks. IEEE
transactions on information theory, 68(6):4032–4042, 2022.
104. M. Kohler and S. Langer. On the rate of convergence of fully connected deep neural network regression estimates. The Annals of
Statistics, 49(4):2231–2249, 2021.
105. A. N. Kolmogorov. On the representation of continuous functions of many variables by superposition of continuous functions of one
variable and addition. Dokl. Akad. Nauk SSSR, 114:953–956, 1957.
106. A. Krizhevsky, I. Sutskever, and G. E. Hinton. Imagenet classification with deep convolutional neural networks. In Advances in neural
information processing systems, pages 1097–1105, 2012.
107. G. Kutyniok, P. Petersen, M. Raslan, and R. Schneider. A theoretical analysis of deep neural networks and parametric pdes.
Constructive Approximation, 55(1):73–125, 2022.
108. V. Kůrková. Kolmogorov’s theorem and multilayer neural networks. Neural Networks, 5(3):501–506, 1992.
109. F. Laakmann and P. Petersen. Efficient approximation of solutions of parametric linear transport equations by relu dnns. Advances in
Computational Mathematics, 47(1):11, 2021.
110. G. Lan. First-order and Stochastic Optimization Methods for Machine Learning. Springer Series in the Data Sciences. Springer
International Publishing, Cham, 1st ed. 2020. edition, 2020.
111. Y. LeCun, Y. Bengio, and G. Hinton. Deep learning. Nature, 521(7553):436–444, May 2015.
112. Y. LeCun, B. Boser, J. S. Denker, D. Henderson, R. E. Howard, W. Hubbard, and L. D. Jackel. Backpropagation applied to handwritten
zip code recognition. Neural Computation, 1(4):541–551, 1989.
113. Y. A. LeCun, L. Bottou, G. B. Orr, and K.-R. Müller. Efficient BackProp, pages 9–48. Springer Berlin Heidelberg, Berlin, Heidelberg,
2012.
114. J. Lee, J. Sohl-dickstein, J. Pennington, R. Novak, S. Schoenholz, and Y. Bahri. Deep neural networks as gaussian processes. In
International Conference on Learning Representations, 2018.
115. J. Lee, L. Xiao, S. Schoenholz, Y. Bahri, R. Novak, J. Sohl-Dickstein, and J. Pennington. Wide neural networks of any depth evolve as
linear models under gradient descent. In H. Wallach, H. Larochelle, A. Beygelzimer, F. d'Alché-Buc, E. Fox, and R. Garnett, editors,
Advances in Neural Information Processing Systems, volume 32. Curran Associates, Inc., 2019.
116. M. Leshno, V. Y. Lin, A. Pinkus, and S. Schocken. Multilayer feedforward networks with a nonpolynomial activation function can
approximate any function. Neural Networks, 6(6):861–867, 1993.
117. L. Lessard, B. Recht, and A. Packard. Analysis and design of optimization algorithms via integral quadratic constraints. SIAM J.
Optim., 26(1):57–95, 2016.
118. H. Li, Z. Xu, G. Taylor, C. Studer, and T. Goldstein. Visualizing the loss landscape of neural nets. Advances in neural information
processing systems, 31, 2018.
119. W. Li. Generalization error of minimum weighted norm and kernel interpolation. SIAM Journal on Mathematics of Data Science,
3(1):414–438, 2021.
120. Z. Liu, Y. Wang, S. Vaidya, F. Ruehle, J. Halverson, M. Soljačić, T. Y. Hou, and M. Tegmark. Kan: Kolmogorov-arnold networks,
2024.
121. M. Longo, J. A. Opschoor, N. Disch, C. Schwab, and J. Zech. De rham compatible deep neural network fem. Neural Networks,
165:721–739, 2023.
122. C. Ma, S. Wojtowytsch, L. Wu, et al. Towards a mathematical understanding of neural network-based machine learning: what we
know and what we don’t. arXiv preprint arXiv:2009.10713, 2020.
123. C. Ma, L. Wu, et al. A priori estimates of the population risk for two-layer neural networks. arXiv preprint arXiv:1810.06397, 2018.
124. S. Mahan, E. J. King, and A. Cloninger. Nonclosedness of sets of neural networks in sobolev spaces. Neural Networks, 137:85–96,
2021.
125. W. S. McCulloch and W. Pitts. A logical calculus of the ideas immanent in nervous activity. The bulletin of mathematical biophysics,
5:115–133, 1943.
126. S. Mei and A. Montanari. The generalization error of random features regression: Precise asymptotics and the double descent curve.
Communications on Pure and Applied Mathematics, 75(4):667–766, 2022.
127. H. N. Mhaskar. Approximation properties of a multilayered feedforward artificial neural network. Adv. Comput. Math., 1(1):61–80,
1993.
128. H. N. Mhaskar. Neural networks for optimal approximation of smooth and analytic functions. Neural computation, 8(1):164–177,
1996.

230
[Draft of June 26, 2024. Not for dissemination. Only to be used in Philipp Petersen’s lecture]
129. H. N. Mhaskar and C. A. Micchelli. Degree of approximation by neural and translation networks with a single hidden layer. Advances
in applied mathematics, 16(2):151–183, 1995.
130. M. Mohri, A. Rostamizadeh, and A. Talwalkar. Foundations of machine learning. MIT press, 2018.
131. C. Molnar. Interpretable machine learning. Lulu. com, 2020.
132. H. Montanelli and Q. Du. New error bounds for deep relu networks using sparse grids. SIAM Journal on Mathematics of Data
Science, 1(1):78–92, 2019.
133. S.-M. Moosavi-Dezfooli, A. Fawzi, O. Fawzi, and P. Frossard. Universal adversarial perturbations. In Proceedings of the IEEE
conference on computer vision and pattern recognition, pages 1765–1773, 2017.
134. E. Moulines and F. Bach. Non-asymptotic analysis of stochastic approximation algorithms for machine learning. In J. Shawe-Taylor,
R. Zemel, P. Bartlett, F. Pereira, and K. Weinberger, editors, Advances in Neural Information Processing Systems, volume 24. Curran
Associates, Inc., 2011.
135. K.-R. Muller, S. Mika, G. Ratsch, K. Tsuda, and B. Scholkopf. An introduction to kernel-based learning algorithms. IEEE Transactions
on Neural Networks, 12(2):181–201, 2001.
136. R. Nakada and M. Imaizumi. Adaptive approximation and generalization of deep neural network with intrinsic dimensionality. Journal
of Machine Learning Research, 21(174):1–38, 2020.
137. R. M. Neal. Bayesian learning for neural networks. PhD thesis, University of Toronto, 1995.
138. Y. Nesterov. Introductory lectures on convex optimization, volume 87 of Applied Optimization. Kluwer Academic Publishers, Boston,
MA, 2004. A basic course.
139. Y. Nesterov. Lectures on convex optimization, volume 137 of Springer Optimization and Its Applications. Springer, Cham, second
edition, 2018.
140. Y. E. Nesterov. A method for solving the convex programming problem with convergence rate 𝑂 (1/𝑘 2 ). Dokl. Akad. Nauk SSSR,
269(3):543–547, 1983.
141. B. Neyshabur, R. Tomioka, and N. Srebro. Norm-based capacity control in neural networks. In Conference on learning theory, pages
1376–1401. PMLR, 2015.
142. J. Nocedal and S. J. Wright. Numerical optimization. Springer Series in Operations Research and Financial Engineering. Springer,
New York, second edition, 2006.
143. E. Novak and H. Woźniakowski. Approximation of infinitely differentiable multivariate functions is intractable. Journal of Complexity,
25(4):398–404, 2009.
144. B. O’Donoghue and E. Candès. Adaptive restart for accelerated gradient schemes. Found. Comput. Math., 15(3):715–732, 2015.
145. J. A. A. Opschoor, C. Schwab, and J. Zech. Exponential ReLU DNN expression of holomorphic maps in high dimension. Constructive
Approximation, 2021.
146. J. A. A. Opschoor, C. Schwab, and J. Zech. Deep learning in high dimension: ReLU neural network expression for Bayesian PDE
inversion. In Optimization and control for partial differential equations—uncertainty quantification, open and closed-loop control,
and shape optimization, volume 29 of Radon Ser. Comput. Appl. Math., pages 419–462. De Gruyter, Berlin, 2022.
147. P. Oswald. On the degree of nonlinear spline approximation in Besov-Sobolev spaces. J. Approx. Theory, 61(2):131–157, 1990.
148. S. Ovchinnikov. Max-min representation of piecewise linear functions. Beiträge Algebra Geom., 43(1):297–302, 2002.
149. N. Papernot, P. McDaniel, I. Goodfellow, S. Jha, Z. B. Celik, and A. Swami. Practical black-box attacks against machine learning. In
Proceedings of the 2017 ACM on Asia conference on computer and communications security, pages 506–519, 2017.
150. Y. C. Pati and P. S. Krishnaprasad. Analysis and synthesis of feedforward neural networks using discrete affine wavelet transformations.
IEEE Transactions on Neural Networks, 4(1):73–85, 1993.
151. J. Pennington and Y. Bahri. Geometry of neural network loss surfaces via random matrix theory. In International Conference on
Machine Learning, pages 2798–2806. PMLR, 2017.
152. P. Petersen, M. Raslan, and F. Voigtlaender. Topological properties of the set of functions generated by neural networks of fixed size.
Foundations of computational mathematics, 21:375–444, 2021.
153. P. Petersen and F. Voigtlaender. Optimal approximation of piecewise smooth functions using deep relu neural networks. Neural
Networks, 108:296–330, 2018.
154. P. C. Petersen. Neural Network Theory. 2020. http://www.pc-petersen.eu/Neural_Network_Theory.pdf, Lecture notes.
155. A. Pinkus. Approximation theory of the MLP model in neural networks. In Acta numerica, 1999, volume 8 of Acta Numer., pages
143–195. Cambridge Univ. Press, Cambridge, 1999.
156. G. Pisier. Remarques sur un résultat non publié de B. Maurey. Séminaire Analyse fonctionnelle (dit "Maurey-Schwartz"), 1980-1981.
157. T. Poggio, H. Mhaskar, L. Rosasco, B. Miranda, and Q. Liao. Why and when can deep-but not shallow-networks avoid the curse of
dimensionality: a review. Int. J. Autom. Comput., 14(5):503–519, 2017.
158. T. Poggio, R. Rifkin, S. Mukherjee, and P. Niyogi. General conditions for predictivity in learning theory. Nature, 428(6981):419–422,
2004.
159. B. Polyak. Some methods of speeding up the convergence of iteration methods. USSR Computational Mathematics and Mathematical
Physics, 4(5):1–17, 1964.
160. B. T. Polyak. Introduction to optimization. Translations Series in Mathematics and Engineering. Optimization Software, Inc.,
Publications Division, New York, 1987. Translated from the Russian, With a foreword by Dimitri P. Bertsekas.
161. S. J. Prince. Understanding Deep Learning. MIT Press, 2023.
162. N. Qian. On the momentum term in gradient descent learning algorithms. Neural Networks, 12(1):145–151, 1999.
163. M. H. Quynh Nguyen, Mahesh Chandra Mukkamala. On the loss landscape of a class of deep neural networks with no bad local
valleys. In International Conference on Learning Representations (ICLR), 2018.

231
[Draft of June 26, 2024. Not for dissemination. Only to be used in Philipp Petersen’s lecture]
164. A. Rahimi and B. Recht. Random features for large-scale kernel machines. In J. Platt, D. Koller, Y. Singer, and S. Roweis, editors,
Advances in Neural Information Processing Systems, volume 20. Curran Associates, Inc., 2007.
165. M. Raissi, P. Perdikaris, and G. E. Karniadakis. Physics-informed neural networks: A deep learning framework for solving forward
and inverse problems involving nonlinear partial differential equations. Journal of Computational physics, 378:686–707, 2019.
166. C. E. Rasmussen and C. K. I. Williams. Gaussian processes for machine learning. Adaptive computation and machine learning. MIT
Press, 2006.
167. E. Real, S. Moore, A. Selle, S. Saxena, Y. L. Suematsu, Q. Le, and A. Kurakin. Regularized evolution for image classifier architecture
search. Proceedings of the AAAI Conference on Artificial Intelligence, 33:4780–4789, 2019.
168. S. J. Reddi, S. Kale, and S. Kumar. On the convergence of adam and beyond. In 6th International Conference on Learning
Representations, ICLR 2018, Vancouver, BC, Canada, April 30 - May 3, 2018, Conference Track Proceedings. OpenReview.net, 2018.
169. H. Robbins and S. Monro. A Stochastic Approximation Method. The Annals of Mathematical Statistics, 22(3):400 – 407, 1951.
170. F. Rosenblatt. The perceptron: A probabilistic model for information storage and organization in the brain. Psychological Review,
65(6):386–408, 1958.
171. W. Ruan, X. Yi, and X. Huang. Adversarial robustness of deep learning: Theory, algorithms, and applications. In Proceedings of the
30th ACM international conference on information & knowledge management, pages 4866–4869, 2021.
172. S. Ruder. An overview of gradient descent optimization algorithms, 2016.
173. W. Rudin. Real and complex analysis. McGraw-Hill Book Co., New York, third edition, 1987.
174. W. Rudin. Functional analysis. International Series in Pure and Applied Mathematics. McGraw-Hill, Inc., New York, second edition,
1991.
175. W. Rudin. Functional analysis 2nd ed. International Series in Pure and Applied Mathematics. McGraw-Hill, Inc., New York, 1991.
176. D. E. Rumelhart, G. E. Hinton, and R. J. Williams. Learning representations by back-propagating errors. Nature, 323(6088):533–536,
1986.
177. M. A. Sartori and P. J. Antsaklis. A simple method to derive bounds on the size and to train multilayer neural networks. IEEE
transactions on neural networks, 2(4):467–471, 1991.
178. R. Scheichl and J. Zech. Numerical methods for bayesian inverse problems, 2021. Lecture Notes.
179. J. Schmidhuber. Deep learning in neural networks: An overview. Neural Networks, 61:85–117, 2015.
180. J. Schmidt-Hieber. Deep relu network approximation of functions on a manifold. arXiv preprint arXiv:1908.00695, 2019.
181. J. Schmidt-Hieber. Nonparametric regression using deep neural networks with relu activation function. 2020.
182. J. Schmidt-Hieber. The kolmogorov–arnold representation theorem revisited. Neural Networks, 137:119–126, 2021.
183. B. Schölkopf and A. J. Smola. Learning with kernels : support vector machines, regularization, optimization, and beyond. Adaptive
computation and machine learning. MIT Press, 2002.
184. L. Schumaker. Spline Functions: Basic Theory. Cambridge Mathematical Library. Cambridge University Press, 3 edition, 2007.
185. C. Schwab and J. Zech. Deep learning in high dimension: neural network expression rates for generalized polynomial chaos expansions
in UQ. Anal. Appl. (Singap.), 17(1):19–55, 2019.
186. C. Schwab and J. Zech. Deep learning in high dimension: neural network expression rates for analytic functions in
𝐿 2 (𝑚𝑎𝑡 ℎ𝑏𝑏𝑅 𝑑 , 𝑔𝑎𝑚𝑚𝑎𝑑 ). SIAM/ASA J. Uncertain. Quantif., 11(1):199–234, 2023.
187. U. Shaham, A. Cloninger, and R. R. Coifman. Provable approximation properties for deep neural networks. Applied and Computational
Harmonic Analysis, 44(3):537–557, 2018.
188. S. Shalev-Shwartz and S. Ben-David. Understanding Machine Learning - From Theory to Algorithms. Cambridge University Press,
2014.
189. J. W. Siegel and J. Xu. High-order approximation rates for shallow neural networks with cosine and reluk activation functions. Applied
and Computational Harmonic Analysis, 58:1–26, 2022.
190. D. Silver, A. Huang, C. J. Maddison, A. Guez, L. Sifre, G. Van Den Driessche, J. Schrittwieser, I. Antonoglou, V. Panneershelvam,
M. Lanctot, et al. Mastering the game of go with deep neural networks and tree search. nature, 529(7587):484–489, 2016.
191. K. Simonyan and A. Zisserman. Very deep convolutional networks for large-scale image recognition. In ICLR, 2014.
192. E. M. Stein. Singular integrals and differentiability properties of functions. Princeton Mathematical Series, No. 30. Princeton
University Press, Princeton, N.J., 1970.
193. I. Steinwart and A. Christmann. Support Vector Machines. Springer, New York, 2008.
194. D. Stutz, M. Hein, and B. Schiele. Disentangling adversarial robustness and generalization. In Proceedings of the IEEE/CVF
Conference on Computer Vision and Pattern Recognition, pages 6976–6987, 2019.
195. A. Sukharev. Optimal method of constructing best uniform approximations for functions of a certain class. USSR Computational
Mathematics and Mathematical Physics, 18(2):21–31, 1978.
196. T. Sun, L. Qiao, and D. Li. Nonergodic complexity of proximal inertial gradient descents. IEEE Trans. Neural Netw. Learn. Syst.,
32(10):4613–4626, 2021.
197. I. Sutskever, J. Martens, G. Dahl, and G. Hinton. On the importance of initialization and momentum in deep learning. In S. Dasgupta
and D. McAllester, editors, Proceedings of the 30th International Conference on Machine Learning, volume 28 of Proceedings of
Machine Learning Research, pages 1139–1147, Atlanta, Georgia, USA, 17–19 Jun 2013. PMLR.
198. C. Szegedy, W. Liu, Y. Jia, P. Sermanet, S. Reed, D. Anguelov, D. Erhan, V. Vanhoucke, and A. Rabinovich. Going deeper with
convolutions. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 1–9, 2015.
199. C. Szegedy, W. Zaremba, I. Sutskever, J. Bruna, D. Erhan, I. Goodfellow, and R. Fergus. Intriguing properties of neural networks. In
International Conference on Learning Representations (ICLR), 2014.
200. M. Tan and Q. V. Le. Efficientnet: Rethinking model scaling for convolutional neural networks. In Proceedings of the 36th International
Conference on Machine Learning, pages 6105–6114, 2019.

232
[Draft of June 26, 2024. Not for dissemination. Only to be used in Philipp Petersen’s lecture]
201. J. Tarela and M. Martínez. Region configurations for realizability of lattice piecewise-linear models. Mathematical and Computer
Modelling, 30(11):17–27, 1999.
202. J. M. Tarela, E. Alonso, and M. V. Martínez. A representation method for PWL functions oriented to parallel processing. Math.
Comput. Modelling, 13(10):75–83, 1990.
203. M. Telgarsky. Representation benefits of deep feedforward networks, 2015.
204. V. M. Tikhomirov. 𝜀-entropy and 𝜀-capacity of sets in functional spaces. Selected Works of AN Kolmogorov: Volume III: Information
Theory and the Theory of Algorithms, pages 86–170, 1993.
205. S. Tu, S. Venkataraman, A. C. Wilson, A. Gittens, M. I. Jordan, and B. Recht. Breaking locality accelerates block Gauss-Seidel. In
D. Precup and Y. W. Teh, editors, Proceedings of the 34th International Conference on Machine Learning, volume 70 of Proceedings
of Machine Learning Research, pages 3482–3491. PMLR, 06–11 Aug 2017.
206. L. G. Valiant. A theory of the learnable. Communications of the ACM, 27(11):1134–1142, 1984.
207. V. N. Vapnik and A. Y. Chervonenkis. On the uniform convergence of relative frequencies of events to their probabilities. In Measures
of complexity: festschrift for alexey chervonenkis, pages 11–30. Springer, 2015.
208. A. Vaswani, N. Shazeer, N. Parmar, J. Uszkoreit, L. Jones, A. N. Gomez, Ł. Kaiser, and I. Polosukhin. Attention is all you need.
Advances in neural information processing systems, 30, 2017.
209. L. Venturi, A. S. Bandeira, and J. Bruna. Spurious valleys in one-hidden-layer neural network optimization landscapes. Journal of
Machine Learning Research, 20:133, 2019.
210. R. Vershynin. High-dimensional probability: An introduction with applications in data science, volume 47. Cambridge University
Press, 2018.
211. S. Wang and X. Sun. Generalization of hinging hyperplanes. IEEE Transactions on Information Theory, 51(12):4425–4431, 2005.
212. Z. Wang, A. Albarghouthi, G. Prakriya, and S. Jha. Interval universal approximation for neural networks. Proceedings of the ACM on
Programming Languages, 6(POPL):1–29, 2022.
213. E. Weinan, C. Ma, and L. Wu. Barron spaces and the compositional function spaces for neural network models. arXiv preprint
arXiv:1906.08039, 2019.
214. E. Weinan and S. Wojtowytsch. Representation formulas and pointwise properties for barron functions. Calculus of Variations and
Partial Differential Equations, 61(2):46, 2022.
215. A. C. Wilson, B. Recht, and M. I. Jordan. A lyapunov analysis of accelerated methods in optimization. Journal of Machine Learning
Research, 22(113):1–34, 2021.
216. A. C. Wilson, R. Roelofs, M. Stern, N. Srebro, and B. Recht. The marginal value of adaptive gradient methods in machine learning. In
I. Guyon, U. V. Luxburg, S. Bengio, H. Wallach, R. Fergus, S. Vishwanathan, and R. Garnett, editors, Advances in Neural Information
Processing Systems, volume 30. Curran Associates, Inc., 2017.
217. C. Xiao, J.-Y. Zhu, B. Li, W. He, M. Liu, and D. Song. Spatially transformed adversarial examples. arXiv preprint arXiv:1801.02612,
2018.
218. H. Xu and S. Mannor. Robustness and generalization. Machine learning, 86:391–423, 2012.
219. D. Yarotsky. Error bounds for approximations with deep ReLU networks. Neural Netw., 94:103–114, 2017.
220. D. Yarotsky and A. Zhevnerchuk. The phase diagram of approximation rates for deep neural networks. In H. Larochelle, M. Ranzato,
R. Hadsell, M. F. Balcan, and H. Lin, editors, Advances in Neural Information Processing Systems, volume 33, pages 13005–13015.
Curran Associates, Inc., 2020.
221. H. M. D. K. S. B. Yiding Jiang, Behnam Neyshabur. Fantastic generalization measures and where to find them. In International
Conference on Learning Representations (ICLR), 2019.
222. X. Zhai, A. Kolesnikov, N. Houlsby, and L. Beyer. Scaling vision transformers. In Proceedings of the IEEE/CVF Conference on
Computer Vision and Pattern Recognition, pages 12104–12113, 2022.

233
[Draft of June 26, 2024. Not for dissemination. Only to be used in Philipp Petersen’s lecture]

You might also like