0% found this document useful (0 votes)
81 views14 pages

On The Geometry of Deep Learning

Geometry of Deep Learning

Uploaded by

tuktuk2000singh
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
81 views14 pages

On The Geometry of Deep Learning

Geometry of Deep Learning

Uploaded by

tuktuk2000singh
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 14

On the Geometry of Deep Learning

∗ † ‡
Randall Balestriero Ahmed Imtiaz Humayun Richard G. Baraniuk
arXiv:2408.04809v1 [cs.LG] 9 Aug 2024

Introduction dressing higher level issues like trustworthiness (can


we trust a black box?), sustainability (ever-growing
Machine learning has significantly advanced our abil- computations lead to a growing environmental foot-
ity to address a wide range of difficult computational print), and social responsibility (fairness, bias, and
problems and is the engine driving progress in mod- beyond).
ern artificial intelligence (AI). Today’s machine learn- In this paper, we overview one promising avenue
ing landscape is dominated by deep (neural) net- of progress at the mathematical foundation of deep
works, which are compositions of a large number of learning: the connection between deep networks and
simple parameterized linear and nonlinear operators. function approximation by affine splines (continuous
An all-too-familiar story of the past decade is that of piecewise linear functions in multiple dimensions).
plugging a deep network into an engineering or scien- In particular, we will overview work over the past
tific application as a black box, learning its parame- decade on understanding certain geometrical prop-
ter values using copious training data, and then sig- erties of a deep network’s affine spline mapping, in
nificantly improving performance over classical task- particular how it tessellates its input space. As we
specific approaches based on erudite practitioner ex- will see, the affine spline connection and geometrical
pertise or mathematical elegance. viewpoint provide a powerful portal through which
Despite this exciting empirical progress, however, to view, analyze, and improve the inner workings of
the precise mechanisms by which deep learning works a deep network.
so well remain relatively poorly understood, adding There are a host of interesting open mathemati-
an air of mystery to the entire field. Ongoing at- cal problems in machine learning in general and deep
tempts to build a rigorous mathematical framework learning in particular that are surprisingly accessible
have been stymied by the fact that, while deep net- once one gets past the jargon. Indeed, as we will see,
works are locally simple, they are globally compli- the core ideas can be understood by anyone know-
cated. Hence, they have primarily been studied ing some linear algebra and calculus. Hence, we will
as “black boxes” and mainly empirically. This ap- pose numerous open questions as they arise in our
proach greatly complicates analysis to understand exposition in the hopes that they entice more math-
the both the success and failure modes of deep net- ematicians to join the deep learning community.
works. This approach also greatly complicates deep The state-of-the-art in deep learning is a rapidly
learning system design, which today proceeds al- moving target, and so we focus on the bedrock of
chemistically rather than from rigorous design prin- modern deep networks, so-called feedforward neural
ciples. And this approach greatly complicates ad- networks employing piecewise linear activation func-
tions. While our analysis does not fully cover some
∗ Randall Balestriero is an Assistant Professor at Brown very recent methods, most notably transformer net-
University. His email address is [email protected].
† Ahmed Imtiaz Humayun is a PhD student at Rice Univer- works, the networks we study are employed therein as
sity. His email address is [email protected]. key building blocks. Moreover, since we focus on the
‡ Richard Baraniuk is the C. S. Burrus Professor at Rice affine spline viewpoint, we will not have the oppor-
University. His email address is [email protected]. tunity to discuss other interesting geometric work in
* , †:Equal contributions

1
deep learning, including tropical geometry [ZNL18] where z (0) = x and z (L) = y b = f (x). First the
and beyond. Finally, to spin a consistent story line, layer applies an affine transformation to its input.
we will focus primarily on work from our group; we Second, in a standard abuse of notation, it applies
will however review several key results developed by a scalar nonlinear transformation — called the ac-
others. Our bibliography is concise, and so the inter- tivation function σ — to each entry in the result.
ested reader is invited to explore the extensive works The entries of z (ℓ) are called the layer-ℓ neurons or
cited in the papers we reference. units, and the width of the layer is the dimensional-
ity of z (ℓ) . When layers of the form (2) are used in
Deep Learning (1), deep learners refer to the network as a multilayer
perceptron (MLP).
Machine learning in 200 words or less. In su- The parameters θ(ℓ) of the layer are the elements
pervised machine learning, we are given a collection of the weight matrix W (ℓ) and the bias vector b(ℓ) .
n
of n training data pairs {(xi , y i )}i=1 ; xi is termed Special network structures have been developed to
the data and y i the label. Without loss of generality, reduce the generally quadratic cost of multiplying by
we will take xi ∈ RD , y i ∈ RC to be column vectors, the W (ℓ) . One notable class of networks constraints
but in practice they are often tensors. W (ℓ) to be a circulant matrix, so that W (ℓ) z (ℓ) cor-
We seek a predictor or model f with two basic prop- responds to a convolution, giving rise to the term
erties. First, the predictor should fit the training ConvNet for such models. Even with this simplifica-
data: f (xi ) ≈ y i . When the predictor fits (near) per- tion, it is common these days to work with networks
fectly, we say that it has interpolated the data. Sec- with billions of parameters.
ond, the predictor should generalize to unseen data: The most widespread activation function in mod-
f (x′ ) ≈ y ′ , where (x′ , y ′ ) is test data that does not ern deep networks is the rectified linear unit (ReLU)
appear in the training set. When we fit the train-
ing data but do not generalize, we say that we have σ(u) = max{u, 0} =: ReLU(u). (3)
overfit. Throughout this paper, we focus on networks that
One solves the prediction problem by first design- use this activation, although the results hold for any
ing a parameterized model fΘ with parameters Θ and continuous piecewise linear nonlinearity (e.g., abso-
then learning or training by optimizing Θ to make lute value σ(u) = |u|). Special activations are often
fΘ (xi ) as close as possible to y i on average in terms employed at the last layer f (L) , from the linear acti-
of some distance or loss function L, which is often vation σ(u) = u to the softmax that converts a vector
called the training error. to a probability histogram. These activations do not
Deep networks. A deep network is a predictor or affect our analysis below. It is worth point out, but
model constructed from the composition of L inter- beyond the scope of this paper, that it is possible
mediate mappings called layers [GBCB16] to generalize the results we review below to a much
  larger class of smooth activation functions (e.g., sig-
(L) (1)
fΘ (x) = fθ(L) ◦ · · · ◦ fθ(1) (x). (1)
moid gated linear units, Swish activation) by adopt-
ing a probabilistic viewpoint [BB18].
Here Θ is the collection of parameters from each layer, The term “network” is used in deep learning be-
θ(ℓ) , ℓ = 1, . . . , L. We will omit the parameters Θ or cause compositions of the form (1) are often depicted
θ(ℓ) from our notation except where they are critical, as such; see Figure 1.
since they are ever-present in the discussion below. Learning. To learn to fit the training data with a
The ℓ-th deep network layer f (ℓ) takes as input the deep network, we tune the parameters W (ℓ) , b(ℓ) , ℓ =
vector z (ℓ−1) and outputs the vector z (ℓ) by combin- 1, . . . , L such that, on average, when training datum
ing two simple operations xi is input to the network, the output ybi = f (xi )
    is close to y i as measured by some loss function L.
z (ℓ) = f (ℓ) z (ℓ−1) = σ W (ℓ) z (ℓ−1) + b(ℓ) , (2) Two loss functions are ubiquitous in deep learning;

2
one can nearly interpolate the training data. (We
often drop the “nearly” below for brevity.) What
distinguishes the performance of one deep network
architecture from another, then, is what it does away
from the training points, i.e., how well it generalizes
to unseen data.
Deep nets break out. Despite neural networks ex-
isting in some form for over 80 years, their success
was limited in practice until the AI boom of 2012.
This sudden growth was enabled by three converging
factors: i) going deep with many layers, ii) training
on enormous data sets, and iii) new computing archi-
tectures based on graphics processing units (GPUs).
Figure 1: A 6-layer deep network. Purple, blue, and The spark that ignited the AI boom was the Ima-
yellow nodes represent the input, neurons, and output, genet Challenge 2012 where teams competed to best
respectively. The width of layer 2 is 5, for example. classify a set of input images x into one of 1000
The links between the nodes represent the elements of categories. The Imagenet training data was about
the weight matrices W (ℓ) . The sum with the bias b(ℓ) n = 1.3 million, 150,000-pixel color digital images
and subsequent activation σ(·) are implicitly performed human-labeled into 1000 classes, such as ‘bird,’ ‘bus,’
at each neuron. ‘sofa.’ 2012 was the first time a deep network won
the Challenge; AlexNet, a ConvNet with 62 million
the first is the classical squared error based on the parameters in five convolutional layers followed by
two-norm three fully connected layers achieved and accuracy
n
of 60%. Subsequent competitions featured only deep
1X 2 networks, and, by the final competition in 2017, they
L(Θ) := y − fΘ (xi ) (4)
n i=1 i 2 had reached 81% accuracy, which is arguably better
than most humans can achieve.
The other is the cross-entropy which is oft-used in Black boxes. Deep networks with dozens of layers
classification tasks. and millions of parameters are powerful for fitting
Standard learning practice is to use some flavor of and mimicking training data, but also inscrutable.
gradient (steepest) descent that iteratively reduces L Deep networks are created from such simple trans-
by updating the parameters W (ℓ) , b(ℓ) by subtracting formations (e.g., affine transform and thresholding)
a small scalar multiple of the partial derivatives of L that it is maddening the composition of several of
with respect to those parameters. them so complicates analysis and defies deep under-
In practice, since the number of training data pairs standing. Consequently, deep learning practitioners
n can be enormous, one calculates the gradient of L tend to treat them as black boxes and proceed empir-
for each iteration using only a subset of data points ically using an alchemical development process that
and labels called a minibatch. focuses primarily on the inputs x and outputs f (x) of
Note that even a nice loss function like (4) will the network. To truly understand deep networks we
have a multitude of local minima due to the nonlin- need to be able to see inside the black box as a deep
ear activation at each layer coupled with the compo- network is learning and predicting. In the sequel, we
sition of multiple such layers. Consequently, numer- will discuss one promising line of work in this vein
ous heuristics have been developed to help navigate that leverages the fact that deep networks are affine
to high-performing local minima. In modern deep spline mappings.
networks, the number of neurons is usually so enor-
mous that, by suitably optimizing the parameters,

3
<latexit sha1_base64="8yJNGxGmq+t1aZqHZQ9LDIAr2HQ=">AAAHlXicjdVNT9swGAdwwzbK2AuwHXbYJRogsQtqObAdNgkoGkUgrYP1RSMVsh23jfBLFDu0JcqX2HX7Yvs2c1ofih+QiNTKen5/x3biJCThsTbV6r+FxSdPny1Vlp+vvHj56vXq2vqbtlZZSlmLKq7SLsGa8ViylokNZ90kZVgQzjrkul5654alOlbyp5kkrCfwQMb9mGJjS93N/nZIxh83r9Y2qjvV6RHARs01NpA7mlfrS6MwUjQTTBrKsdaXtWpiejlOTUw5K1bCTLME02s8YJe2KbFgupdPJ1wEW7YSBX2V2p80wbQ63yPHQuuJIDYpsBlq38rifXaZmf7nXh7LJDNM0tlA/YwHRgXl6oMoThk1fGIbmKaxnWtAhzjF1NhrtHJnGCLsIiQbUSUEllEenjPMdZGHTOosZeXYeVj+E5KfF8U94cdm60oknI3ZAyevP9ThsfETadjAru/+/K8yf7cDwUVu/0WA/VMR4oQAGTsZAxk6GQLpOukCmTiZALl1cgskc5IBuXFyA0Q/TMwJAyKEIwGo6aQJpOWkBaTtpA3kwMkBkEMnh0DqTupAjpwcARk5GQFpOGkAOXVyCqTjpAPkxMkJvKssVQ6rcD82h7HdvETxqHwrKJ6HZQXsTWawF5uW/BzmydAPzmp+MmIcnHJW85My82IS7sKmBqvQcBVn9uUd+YO6op+9iAfCj85q/iXkSpfPf/nMU8zzM/9UtDGn4GbT4zk9BtqeU7CJ6cWcXpQT2woiZYIkVVFGTTCbt7eweEBtxlvZtFTYT1XN/zDBRnt3p7a3s/djd2P/i/toLaP36APaRjX0Ce2jBmqiFqKIo9/oD/pbeVf5WjmqfJtFFxdcn7fozlH5/h/6gM5W</latexit>

Affine Splines f (x)


As we will now explain, deep networks are tractable
<latexit sha1_base64="cztmBgaXtW7qxBGu5XGmJ1Exg7w=">AAAHknicjdVNT9swGAdww17K2Btsu+0SDZB2Qi0HNk07AEWiCA4drC8arZDjuK2FX6LYoZSon2HX7aPt28xpn0PxAxKRWlnP7+/YTpwkTqWwrlr9t7T85Omz55WVF6svX71+83Zt/V3bmjxjvMWMNFk3ppZLoXnLCSd5N804VbHknfiqXnrnmmdWGP3TTVLeV3SoxUAw6nyptTm5FJuXaxvV7ersiHCjBo0NAkfzcv35uJcYliuuHZPU2otaNXX9gmZOMMmnq73c8pSyKzrkF76pqeK2X8xmO422fCWJBibzP+2iWXWxR0GVtRMV+6SibmRDK4v32UXuBl/7hdBp7rhm84EGuYycicqlR4nIOHNy4huUZcLPNWIjmlHm/AVavTNMrPwiNB8zoxTVSdE741TaadHj2uYZL8cueuV/HBdn0+k94cdm60alkt/wB05ef6jDY+PH2vGhX9/9+V9l/m6HmE4L/68iGp4qjkFiJDcgN0hGICMkXZAukgnIBMktyC2SHCRHcg1yjcQ+TByEI1EKSCFqgjSRtEBaSNogbST7IPtIDkAOkNRB6kgOQQ6RjEHGSBogDSQnICdIOiAdJMcgx/iu8swAVvF+bI6E37yxkUn5VjCy6JUVtDe5o0FsVgpzVKajMDivhcmES3TKeS1M6jyIabwLmxatwuJVnPo3dxIOCsUwey6GKozOa+EllMaWz3/5zDMqi9PwVKyxoOhms6MFPULaXlC0idn5gp6XE9uKEuOiNDNJzlw0n3ewMDFkPhOsbFaa+k9VLfww4UZ7Z7u2u737Y2dj7zt8tFbIR/KJfCY18oXskQZpkhZhRJDf5A/5W/lQ+VbZr9Tn0eUl6POe3Dkqp/8BpYzNjA==</latexit>

yi
multidimensional extensions of the familiar one- <latexit sha1_base64="jGey2ejMLk4WEdzav0X3NfT4PZc=">AAAHqXicjdVNT9swGAdwwzbK2Auw3bZLNEBimoRaDmyHTQKKRBEcCqwvglSV47iNhR1HsUNbokj7NLtuX2ffZk7rQ/EDEpFaWc/vbztOnCRIOFO6Wv23sPjs+YulyvLLlVev37xdXVt/11YySwltEcll2g2wopzFtKWZ5rSbpBSLgNNOcFMvvXNLU8Vk/FNPEtoTeBizASNYm1J/7cOmP2IhjbDOJ31WeD+8wbYfjPvs82Z/baO6U50eHmzUbGMD2aPZX18a+aEkmaCxJhwrdV2rJrqX41Qzwmmx4meKJpjc4CG9Ns0YC6p6+XQRhbdlKqE3kKn5xdqbVud75FgoNRGBSQqsI+VaWXzIrjM9+NbLWZxkmsZkNtEg456WXnlFvJCllGg+MQ1MUmbO1SMRTjHR5rqt3JsmEGYRMR0RKQSOw9y/oJirIvdprLKUlnPnfvkfBPlFUTwQfmq2LkXC6Zg+Mnj9sQ5PjZ/Emg7N+h7OX5X5+x0CXOTmX3jYHSoIrARAxlbGQCIrEZCulS6QiZUJkDsrd0AyKxmQWyu3QNTjRK1QIEJYEoCaVppAWlZaQNpW2kAOrBwAObRyCKRupQ7kyMoRkJGVEZCGlQaQUyunQDpWOkBOrJzAu0pTabEK92MzYmbzBpKH5VtB8twvK2BvUo2d2LTk5jBPIjc4q7nJkHIw5KzmJuPMicVwFzYVWIWCqzgzL/TQndQW3ewlGwo3Oqu5l5BLVT7/5TNPMM/P3KFIY07BzSbHc3oMtD2nYBOTyzm9LE9sywul9pJUhhnR3uy8nYWxITEZZ2XTUmE+VTX3wwQb7d2d2t7O3vnuxv53+9FaRh/RJ7SNaugr2kcN1EQtRNAv9Bv9QX8rXyrnlW7lahZdXLB93qN7R4X8B+/Z1fs=</latexit>

dimensional continuous piecewise linear functions de-


picted on the left in Figure 2. When such continuous
ybi = f (xi )
piecewise functions are fit to training data, we refer
to them as affine splines for short.

<latexit sha1_base64="Xv8gJgwNK5pQnExotaqUsCjIbBQ=">AAAHk3icjdVNT9swGAdwwzbK2AuwaaddogESu6CWA5u0HYAiUQSTOlhfNFohx3HbDL9EsUNbon6HXbdvtm8zp30OxQ9IRGplPb+/YztxkjARsbHl8r+FxSdPny2Vlp+vvHj56vXq2vqbptFZyniDaaHTdkgNF7HiDRtbwdtJyqkMBW+F19XCWzc8NbFWP+w44V1J+yruxYxaV2pu9rZHHzev1jbKO+XpEeBGBRobBI761frSsBNplkmuLBPUmMtKObHdnKY2ZoJPVjqZ4Qll17TPL11TUclNN59OdxJsuUoU9HTqfsoG0+p8j5xKY8YydElJ7cD4VhTvs8vM9j5381glmeWKzQbqZSKwOijWHkRxypkVY9egLI3dXAM2oCll1l2hlTvDhNItQvEh01JSFeWdc06FmeQdrkyW8mLsvFP8h2F+PpncE35stqplIviIP3Dy6kMdHhs/UZb33fruz/8s8nc7hHSSu38ZUP9UYQgSIhmBjJAMQAZI2iBtJGOQMZJbkFskGUiG5AbkBol5mDgIRyIlkERUB6kjaYA0kDRBmkgOQA6QHIIcIqmCVJEcgRwhGYIMkdRAakhOQU6RtEBaSE5ATvBd5akGLOP9WB/EbvOGWkTFW0GLvFNU0N7klnqxacnPUZEM/OCs5icjLtApZzU/qTIvpvAurBu0CoNXceZe3ZE/KBT97EXcl350VvMvodCmeP6LZ55RkZ/5p2K1OUU3mx3P6THS5pyiTcwu5vSimNhWEGkbJKmOMmaD2by9hcV95jLeyqaliftUVfwPE240d3cqezt733c39r/CR2uZvCcfyDapkE9kn9RInTQII7/Ib/KH/C29K30pHZaOZtHFBejzltw5St/+A0/lzYQ=</latexit>

f (x)
<latexit sha1_base64="49yvl7lelMhjGg/VeJgATZqvQVw=">AAAHpHicjdVNT9swGAdwwzbK2Bts2mmXaMDELqjlwHbYASgSRfTQ0fVFIx1yHLeNsOModiglyofZdftE+zZz2udQ/IBEpFbW8/s7thMnCRIRaVOt/ltafvL02Upl9fnai5evXr9Z33jb1SpLGe8wJVTaD6jmIop5x0RG8H6ScioDwXvBVb303jVPdaTiH2aa8IGkozgaRowaW7pcf5/7WlIhfD30znmzU2ztZJ+3Ltc3q7vV2eHhRg0amwSO1uXGysQPFcskjw0TVOuLWjUxg5ymJmKCF2t+pnlC2RUd8QvbjKnkepDP5l9427YSekOV2l9svFl1sUdOpdZTGdikpGasXSuL99lFZoZfB3kUJ5nhMZsPNMyEZ5RXXgwvjFLOjJjaBmVpZOfqsTFNKTP2kq3dGSaQdhExnzAlJY3D3D/nVOgi93mss5SXY+d++R8E+XlR3BN+bLauZCL4DX/g5PWHOjw2fhobPrLruz//s8zf7RDQIrf/0qPuqYIAJEByA3KDZAwyRtIH6SOZgkyR3ILcIslAMiTXINdI9MPEQTgSKYEkohZIC0kHpIOkC9JFcghyiOQI5AhJHaSO5BjkGMkEZIKkAdJAcgZyhqQH0kNyCnKK7ypPFWAV78fWOLKbN1AiLN8KSuR+WUF7kxvqxGYlN0dFMnaD85qbDLlAp5zX3GScObEY78KWRqvQeBVN+y4P3UGh6Gbb0Ui60XnNvYRC6fL5L595RkXedE/FGguKbjY7WdATpN0FRZuYtRe0XU5s2wuV8ZJUhRkz3nzezsKiEbMZZ2WzUmE/VTX3w4Qb3b3d2v7u/ve9zYNv8NFaJR/IR7JDauQLOSAN0iIdwkhOfpM/5G/lU6VZaVc68+jyEvR5R+4clV//AQwl0/w=</latexit>

ReLU(u)

x
<latexit sha1_base64="l8Uj38HXvHpNmOTBHb7E9bdYpmA=">AAACT3icbZDLTgIxFIY7eIPxBrp00wgmrsiMJujCBcSNS0zkkjCEdEoHKm1n0nYEMpl3cKtv5dIncWcsOAsBT9Lmz/+1Oef8fsSo0o7zaeW2tnd29/IFe//g8Oi4WDppqzCWmLRwyELZ9ZEijArS0lQz0o0kQdxnpONP7he880KkoqF40vOI9DkaCRpQjLSx2hXPn9mVQbHsVJ1lwU3hZqIMsmoOSta1NwxxzInQmCGleq4T6X6CpKaYkdT2YkUihCdoRHpGCsSJ6ifLcVN4YZwhDEJpjtBw6f79kSCu1Jz75iVHeqzW2cL8j/ViHdz2EyqiWBOBfxsFMYM6hIvd4ZBKgjWbG4GwpGZWiMdIIqxNQvZKG5+bJQSZ4pBzJIaJ5zfSxNwcNtJ1Ms3IdIPMMjIzxLZNzO56qJuifVV1a9Xa41W5fpcFngdn4BxcAhfcgDp4AE3QAhg8g1fwBt6tD+vL+s5lT3NWJk7BSuUKP3KIstQ=</latexit>

<latexit sha1_base64="C37ieShYAdRq5S9rul50umbtFKg=">AAAHkHicjdVNT9swGAdwwzbK2Buw4y7RAGkn1HIADpMGFIkyOHR0fdFohRzHbSP8EsUObYn6EXbdPtu+zZz2ORQ/IBGplfX8/o7txEnCRMTGlsv/lpZfvHy1Ulp9vfbm7bv3H9Y3NltGZynjTaaFTjshNVzEijdtbAXvJCmnMhS8Hd5WC2/f8dTEWv20k4T3JB2ouB8zal2psT3evlnfKu+WZ0eAGxVobBE46jcbK6NupFkmubJMUGOuK+XE9nKa2pgJPl3rZoYnlN3SAb92TUUlN718NtdpsOMqUdDXqfspG8yqiz1yKo2ZyNAlJbVD41tRfMyuM9s/7OWxSjLLFZsP1M9EYHVQLDyI4pQzKyauQVkau7kGbEhTyqy7PGsPhgmlW4TiI6alpCrKu1ecCjPNu1yZLOXF2Hm3+A/D/Go6fST83GxVy0TwMX/i5NWnOjw3fq4sH7j1PZ7/VeQfdgjpNHf/MqD+qcIQJEQyBhkjGYIMkXRAOkgmIBMk9yD3SDKQDMkdyB0S8zRxEI5ESiCJqA5SR9IEaSJpgbSQHIMcIzkBOUFSBakiOQU5RTICGSGpgdSQXIBcIGmDtJGcg5zju8pTDVjG+7E+jN3mDbWIireCFnm3qKC9yS31YrOSn6MiGfrBec1PRlygU85rflJlXkzhXVg3aBUGr+LSvbcjf1Ao+tlGPJB+dF7zL6HQpnj+i2eeUZFf+qditQVFN5udLegZ0taCok3MGgvaKCa2E0TaBkmqo4zZYD5vb2HxgLmMt7JZaeo+VRX/w4Qbrb3dyv7u/o+9raOv8NFaJZ/IZ/KFVMgBOSI1UidNwsiA/CZ/yN/SZumw9K10PI8uL0Gfj+TBUfr+H624zK8=</latexit>

x <latexit sha1_base64="LybnUa9/kEkfD0I/W45YMvqD9LA=">AAAHkHicjdVNT9swGAdwwzbK2Buw4y7RAGkn1HJgHCYNKBJlcOjo+qLRCjmO20b4JYodSonyEXbdPtu+zZz2ORQ/IC1SK+v5/R3biZOEiYiNrVb/Li0/e/5ipbL6cu3V6zdv361vbHaMzlLG20wLnfZCariIFW/b2AreS1JOZSh4N7ypl9695amJtfphpwkfSDpS8TBm1LpSazvbvl7fqu5WZ0eAGzVobBE4mtcbK5N+pFkmubJMUGOuatXEDnKa2pgJXqz1M8MTym7oiF+5pqKSm0E+m2sR7LhKFAx16n7KBrPqYo+cSmOmMnRJSe3Y+FYWH7OrzA4PBnmsksxyxeYDDTMRWB2UCw+iOOXMiqlrUJbGbq4BG9OUMusuz9qDYULpFqH4hGkpqYry/iWnwhR5nyuTpbwcO++X/2GYXxbFI+H/zda1TAS/40+cvP5Uh/+NnynLR259j+d/lvmHHUJa5O5fBtQ/VRiChEjuQO6QjEHGSHogPSRTkCmSe5B7JBlIhuQW5BaJeZo4CEciJZBE1ARpImmDtJF0QDpIjkCOkByDHCOpg9SRnICcIJmATJA0QBpIzkHOkXRBukjOQM7wXeWpBqzi/dgcx27zhlpE5VtBi7xfVtDe5JZ6sVnJz1GRjP3gvOYnIy7QKec1P6kyL6bwLmwatAqDV3Hh3tuRPygU/WwrHkk/Oq/5l1BoUz7/5TPPqMgv/FOxxoKim81OF/QUaWdB0SZmrQVtlRPbCSJtgyTVUcZsMJ+3t7B4xFzGW9msVLhPVc3/MOFGZ2+3tr+7/31v6/ALfLRWyQfykXwiNfKZHJIGaZI2YWREfpHf5E9ls3JQ+Vo5mkeXl6DPe/LgqHz7B5g3zKw=</latexit>

u
<latexit sha1_base64="TVCZJXIAEyFS+jf9HCTo3qQqt3k=">AAAHlXicjdVNT9swGAdwwzbKujfYDjvsEg2QdkItB7bLJKBoFIG0DtYXjVbIcZ62EX6JYodSonyJXbcvtm8zp/Wh+AGJSK2s5/d3bCdOEiY81qZW+7e0/OTps5XK6vPqi5evXr9ZW3/b0SpLGbSZ4irthVQDjyW0TWw49JIUqAg5dMOrRunda0h1rORPM01gIOhIxsOYUWNLvc2+EjCim5drG7Xt2uwIcKPuGhvEHa3L9ZVJP1IsEyAN41Tri3otMYOcpiZmHIpqP9OQUHZFR3Bhm5IK0IN8NuEi2LKVKBiq1P6kCWbVxR45FVpPRWiTgpqx9q0s3mcXmRl+GeSxTDIDks0HGmY8MCooVx9EcQrM8KltUJbGdq4BG9OUMmOvUfXOMKGwi5AwYUoIKqO8fwaU6yLvg9RZCuXYeb/8D8P8rCjuCT8221Ai4XADD5y88VCHx8aPpYGRXd/9+V9l/m6HkBa5/RcB9U8Vhk5CJDdObpCMnYyR9Jz0kEydTJHcOrlFkjnJkFw7uUaiHyZwAkiEcCQQtZy0kLSdtJF0nHSQ7DvZR3Lg5ABJw0kDyaGTQyQTJxMkTSdNJCdOTpB0nXSRHDs5xncVUuWwhvdjaxzbzRsqHpVvBcXzfllBexMM9WKzkp+jPBn7wXnNT0bA0SnnNT8pMy8m8S5sabQKjVdxal/ekT+oK/rZ83gk/Oi85l9CrnT5/JfPPKM8P/VPxZoLim42O1rQI6SdBUWbmJ0v6Hk5sa0gUiZIUhVlzATzeXsLi0fMZryVzUqF/VTV/Q8TbnR2tuu727s/djb2DtxHa5V8IB/JJ1Inn8keaZIWaRNGOPlN/pC/lfeVr5XDyrd5dHnJ9XlH7hyV7/8BWcTO1A==</latexit>

! xi
<latexit sha1_base64="RCtmHVqJSzezzUoMuitHD+K7vY0=">AAAHlHicjdVNT9swGAdww17K2Bts0i67RAOknVDLge3AASgTRaCpg/VFoxWyHbe18EsUO7QlyofYdftk+zZzWh+KH5CI1Mp6fn/HduIkJBHc2Gr139Lyk6fPnldWXqy+fPX6zdu19Xdto7OUshbVQqddgg0TXLGW5VawbpIyLIlgHXJdL71zw1LDtfpppwnrSzxUfMAptq7U2eyRyRXfvFrbqG5XZ0cEGzXf2ED+aF6tPx/3Yk0zyZSlAhtzWasmtp/j1HIqWLHaywxLML3GQ3bpmgpLZvr5bL5FtOUqcTTQqfspG82qiz1yLI2ZSuKSEtuRCa0s3meXmR187edcJZllis4HGmQisjoqFx/FPGXUiqlrYJpyN9eIjnCKqXWXaPXOMES6RSg2plpKrOK8d86wMEXeY8pkKSvHznvlPyH5eVHcE35stq5lItiEPXDy+kMdHhs/UZYN3fruz/8q83c7EFzk7l9GODwVIV4IkImXCZCRlxGQrpcukKmXKZBbL7dAMi8ZkBsvN0DMw8S8MCBSepKAml6aQFpeWkDaXtpADrwcADn0cgik7qUO5MjLEZCxlzGQhpcGkFMvp0A6XjpATrycwLvKUu2xCvdjc8Td5iVaxOVbQYu8V1bA3mQWB7FZKcxhkYzC4LwWJmMmwCnntTCpsiCm4C5sGrAKA1dx5t7dcTioL4bZCz6UYXReCy+h0KZ8/stnnmKRn4Wnoo0FBTebHi/oMdD2goJNTC8W9KKc2FYUaxslqY4zaqP5vIOF8SF1mWBls1LhPlW18MMEG+2d7dru9u6PnY39Pf/RWkEf0Sf0GdXQF7SPGqiJWoiia/Qb/UF/Kx8qe5V65ds8urzk+7xHd47K9/9Ip85d</latexit>

Figure 2: At left, a one-dimensional continuous piecewise


linear function that we refer to as an affine spline. At
right, the ReLU activation function (3) at the heart of
many of today’s deep networks.

Deep networks implement one particular extension


x
<latexit sha1_base64="l8Uj38HXvHpNmOTBHb7E9bdYpmA=">AAACT3icbZDLTgIxFIY7eIPxBrp00wgmrsiMJujCBcSNS0zkkjCEdEoHKm1n0nYEMpl3cKtv5dIncWcsOAsBT9Lmz/+1Oef8fsSo0o7zaeW2tnd29/IFe//g8Oi4WDppqzCWmLRwyELZ9ZEijArS0lQz0o0kQdxnpONP7he880KkoqF40vOI9DkaCRpQjLSx2hXPn9mVQbHsVJ1lwU3hZqIMsmoOSta1NwxxzInQmCGleq4T6X6CpKaYkdT2YkUihCdoRHpGCsSJ6ifLcVN4YZwhDEJpjtBw6f79kSCu1Jz75iVHeqzW2cL8j/ViHdz2EyqiWBOBfxsFMYM6hIvd4ZBKgjWbG4GwpGZWiMdIIqxNQvZKG5+bJQSZ4pBzJIaJ5zfSxNwcNtJ1Ms3IdIPMMjIzxLZNzO56qJuifVV1a9Xa41W5fpcFngdn4BxcAhfcgDp4AE3QAhg8g1fwBt6tD+vL+s5lT3NWJk7BSuUKP3KIstQ=</latexit>

of the affine spline concept to a multidimensional do-


main and range. As we will see in the next section, a Figure 3: Input space tessellation Ω of the two-
deep network generalizes the intervals of the indepen- dimensional input space (below) and affine spline map-
dent variable over which a piecewise affine function is ping f (x) (above) for a toy deep network of depth L = 4
simply affine (recall Figure 2) to an irregular tessel- and width 20. Also depicted is a training data pair (xi , yi )
lation (tiling) of the network’s D-dimensional input and the prediction ybi .
space into convex polytopes. Let Ω denote the tessel-
lation and ω ∈ Ω an individual tile. (The jargon for (xi , yi );1 learning uses optimization to adjust the
the polytope tiles is “linear region” [MPCB14].) weights and biases to create a tessellation and affine
Generalizing the straight lines defining the func- transformations such that the affine spline predic-
tion on each interval in Figure 2, a deep network tions ybi come as close as possible to the true labels
creates an affine transformation on each tile such yi as measured by the squared error loss (4), for ex-
that the overall collection is continuous. Figure 3 ample.
depicts an example for a toy deep network with a
two-dimensional input space; here the tiles are poly-
gons. This all can be written as [BB21] Deep Network Tessellation
As promised, let us now see how a deep network cre- X
(5) ates its input space tessellation [BB21]. Without loss f (x) = (Aω x + cω )1{x∈ω} ,
ω∈Ω of generality, we start with the first layer f (1) whose
input is x and output is z (1) . The k-th entry in z (1)
where the matrix Aω and vector cω define the affine (the value of the k-th neuron) is calculated simply as
transformation from tile ω to the output.
 
Both the tessellation Ω and Aω , cω from the affine (1) (1)
zk = σ wk · x + bk ,
(1)
(6)
transformations are functions of the deep network
weights W (ℓ) and biases b(ℓ) . Geometrically, envi- 1 We remove the boldface from the labels in this example

sion Figure 3 with a cloud of n training data points because they are scalars.

4
(1)
where the dot denotes the inner product, wk is the [BCAB19]. For example, the second layer creates
k-th row of the weight matrix W (1) , and σ is the a hyperplane arrangement in its input space, which
ReLU activation function (3). The quantity inside happens to be the output space of layer one. Thus,
the activation function is the equation of a D − 1- these hyperplanes can be pulled back through layer
dimensional hyperplane in the input space RD that one to its input space by performing the same pro-
(1)
is perpendicular to wk and offset from the origin cess as above but on a tile-by-tile basis relative to the
(1) (1) layer-one tessellation and its associated affine trans-
by bk /∥wk ∥2 . This hyperplane bisects the input
(1) forms. The effect on the layer-two hyperplanes is that
space into two half-spaces; one where zk > 0 and they are folded each time they cross a hyperplane cre-
(1)
one where zk = 0. ated by layer 1. Careful inspection of the tessellation
The collection of hyperplanes corresponding to in Figure 3 reveals many examples of such hyperplane
each neuron in z (1) create a hyperplane arrangement. folding. Similarly, the hyperplanes created by layer
It is precisely the intersections of the half-spaces of three will be folded every time they encounter a hy-
the hyperplane arrangement that tessellate the input perplane in the input space from layers one or two.
space into convex polytope tiles (see Figure 4). Much can be said about this folding process, in-
cluding a formula for the dihedral angle of a folded
hyperplane as a function of the network’s weights
and biases. However, the formulae for the angles
and affine transformations unfortunately become un-
wieldy for more than two layers. Finding simplifica-
tions for these attributes is an interesting open prob-
lem as are the connections to other subdivision pro-
cesses like wavelets and fractals.
The theory of hyperplane arrangements is rich and
tells us that, generally speaking, the number of tiles
grows rapidly with the number of neurons in each
layer. Hence, we can expect even modestly sized
deep networks to have an enormous number of tiles
in their input space, each with a corresponding affine
Figure 4: A deep network layer tessellates its input space
into convex polytopal tiles via a hyperplane arrangement, transformation from input to output space. Impor-
with each hyperplane corresponding to one neuron at the tantly, though, the affine transformations are highly
output of the layer. In this two-dimensional example as- coupled because the overall mapping (5) must remain
suming ReLU activation, the red line indicates the one- continuous. This means that the class of functions
dimensional hyperplane corresponding to the k-th neuron that can be represented using a deep network is con-
in the first layer. siderably smaller than if the mapping could be un-
coupled and/or discontinuous. Understanding what
The weights and biases of the first layer determine deep learning practitioners call the network’s “im-
not only the tessellation of the input space but also plicit bias” remains an important open problem.
an affine transformation on each tile to implement
(5). Explicit formulas for Aω , cω are available in
Visualizing the Tessellation
[BCAB19]. It should be clear that, since all of the
transformations in (6) are continuous, so must be the The toy, low-dimensional examples in Figures 3 and 4
affine spline (5) corresponding to the first layer. are useful for building intuition, but how can we gain
The tessellation corresponding to the composition insight into the tessellation of a deep network with
of two or more layers follows an interesting subdivi- thousands or more of input and output dimensions?
sion process akin to a “tessellation of tessellations” One way to proceed is to compute summary statistics

5
about the tessellation, such as how the number of tiles
scales as we increase the width or depth of a network
(e.g., [MPCB14]); more on this below. An alternative Egyptian cats
is to gain insight via direct visualization.
SplineCam is an exact method for computing and
visualizing a deep network’s spline tessellation over a
specified low-dimensional region of the input space,
typically a bounded two-dimensional planar slice
[HBBB23]. SplineCam uses an efficient graph data
structure to encode the intersections of the hyper-
planes (from the various layers) that pass through
the slice and then uses a fast heuristic breadth-first
search algorithm to identify tiles from the graph. All
of the computations besides the search can be vec- Tabby cats
torized and computed on GPUs to enable the visual-
ization of even industrial-scale deep networks.
Figure 5: SplineCam visualization of a two-dimensional
Figure 5 depicts a SplineCam slice along the plane slice through the affine spline tessellation of the 4096-
defined by three training images for a 5-layer Con- dimensional input space of a 5-layer ConvNet of average
vNet trained to classify between Egyptian and Tabby width 160 trained to classify 64×64 digital photos of cats.
cat photos. The first thing we notice is the extraor- The stars denote the three training images that define the
dinarily large number of tiles in just this small re- plane and the red lines the decision boundaries between
gion of the 4096-dimensional input space. It can be the two classes. (Adapted from [HBBB23].)
shown that the decision boundary separating Egyp-
tian and Tabby cats corresponds to a single hyper-
plane in the final layer that is folded extensively from ically used in deep network optimization [Uns19] and
being pulled back through the previous four layers what types of functions are learned by deep networks
[BCAB19]. Photos falling in the lower left of the slice [PN22].
are classified as Tabbies, while photos falling in the
lower right are classified as Egyptians. The density The Self-Similar Geometry of the
of tiles is also not uniform and varies across the input
Tessellation
space.
An interesting avenue for future research in- It has been known since the late 1980s that even a
volves the efficient extension of SplineCam to higher- two-layer neural network is a universal approxima-
dimensional slices both for visualization and the com- tor, meaning that, as the number of neurons grows,
putation of summary statistics. one can approximate an arbitrary continuous func-
The main goal of this paper is to demonstrate the tion over a Borel measurable set to arbitrary pre-
broad range of insights that can be garnered into the cision [Cyb89]. But, unfortunately, while two-layer
inner workings of a deep network through a focused networks are easily capable of interpolating a set of
study of the geometry of its input space tessellation. training data, in practice they do a poor job general-
To this end, we now tour five examples relating to izing to data outside of the training set. In contrast,
deep network approximation, optimization, and data deep networks with L ≫ 2 layers have proved over
synthesis. But we would be remiss if we did not the past 15 years that they are capable of both inter-
point to the significant progress that has been made polating and generalizing well.
leveraging other important aspects of the spline view Several groups have investigated the connections
of deep learning, such as understanding how affine between a network’s depth and its tessellation’s
splines emerge naturally from the regularization typ- capacity to better approximate. [MPCB14] was the

6
first to quantify the advantage of depth by counting network can approximate that matters, but rather
the number of tiles and showing that deep networks how it learns to approximate. Empirical studies have
create more tiles (and hence are more expressive) indicated that this is because the so-called loss land-
than shallow networks. scape of the loss function L(Θ) navigated by gradient
descent as it optimizes the deep network parameters
Further work has worked to link the self-similar na- is much smoother for ResNets as compared to Con-
ture of the tessellation to good approximation. Using vNets (see Figure 6). However, to date there has been
self-similarity, one can construct new function spaces no analytical work in this direction.
for which deeper networks provide better approxi-
mation rates (see [DHP21, DDF+ 22] and the refer- ConvNet ResNet loss

ences therein). The benefits of depth stem from the


fact that the model is able to replicate a part of the
function it is trying to approximate in many different
places in the input space and with different scalings
or orientations. Extending these results, which cur- W ` , b`
<latexit sha1_base64="q2tXW4Q7YfXvuzhksl8FyA4WU4k=">AAAHo3icjdVNT9swGAdwwzbK2FvZdtslGiDtMKGWA9sRKBJFcOiAvmikQ7bztI2w4yh2KCXqd9l1+0b7NnNaH4ofkIjUynp+f8d24iQsFbE2tdq/peVnz1+sVFZfrr16/ebtu+r6+45WecahzZVQWY9RDSJOoG1iI6CXZkAlE9Bl143SuzeQ6VglF2aSQl/SYRIPYk6NLV1VP26GrPsrBCG+BiFjs9bmVXWjtl2bHQFu1F1jg7ijdbW+Mg4jxXMJieGCan1Zr6WmX9DMxFzAdC3MNaSUX9MhXNpmQiXofjGb/jTYspUoGKjM/hITzKqLPQoqtZ5IZpOSmpH2rSw+ZJe5GXzvF3GS5gYSPh9okIvAqKC8FkEUZ8CNmNgG5Vls5xrwEc0oN/aKrd0bhkm7iATGXElJk6gIz4AKPS1CSHSeQTl2EZb/jBVn0+kD4admG0qmAm7hkZM3Huvw1PhxYmBo1/dw/meZv9+B0Wlh/2VA/VMx5oQhuXVyi2TkZISk56SHZOJkguTOyR2S3EmO5MbJDRL9OIETQCKlI4mo5aSFpO2kjaTjpINk38k+kgMnB0gaThpIDp0cIhk7GSNpOmkiOXFygqTrpIvk2MkxvquQKYc1vB9bo9huXqZEVL4VlCjCsoL2JhjqxWYlP0dFOvKD85qfjECgU85rfjLJvViCd2FLo1VovIpT+yqP/EFd0c+ex0PpR+c1/xIKpcvnv3zmORXFqX8q3lxQdLP50YIeIe0sKNrE/HxBz8uJbQWRMkGaqSjnJpjP21tYPOQ2461sVpraT1Xd/zDhRmdnu767vftjZ2PvwH20Vskn8pl8IXXyjeyRJmmRNuHkjvwmf8jfylblpHJWuZhHl5dcnw/k3lHp/wen1dOH</latexit>

rently hold only for one-dimensional input and out-


put spaces, to multidimensional signals is an inter- Figure 6: Loss landscape L(Θ) of a ConvNet and ResNet
esting open research avenue. (from [LXT+ 18]). Piecewise quadratic loss function.

Using the affine spline viewpoint, it is possible to


Geometry of the Loss Function
analytically characterize the local properties of the
Frankly, it seems an apparent miracle that deep net- deep network loss landscape and quantitatively com-
work learning even works. Because of the composi- pare different deep network architectures. The key is
tion of nonlinear layers and the myriad local minima that, for the deep networks under our consideration
of the loss function, deep network optimization re- trained by minimizing the squared error (4), the loss
mains an active area of empirical research. Here we landscape L as a function of the deep network param-
look at one analytical angle that exploits the affine eters W (ℓ) , b(ℓ) is a continuous piecewise quadratic
spline nature of deep networks. function [RBB23, SPD+ 20] that is amenable to anal-
Over the past decade, a menagerie of different deep ysis (see Figure 6).
network architectures has emerged that innovate in The optimization of quadratic loss surfaces is well-
different ways on the basic architecture (1), (2). A understood. In particular, the eccentricity of a
natural question for the practitioner is: Which archi- quadratic loss landscape is governed by the sin-
tecture should be preferred for a given task? Approx- gular values of the Hessian matrix containing the
imation capability does not offer a point of differen- second-order quadratic terms. Less eccentric (more
tiation, because, as we just discussed, as their size bowl shaped) losses are easier for gradient descent to
(number of parameters) grows, most deep networks quickly navigate to the bottom. Similarly, the local
attain a universal approximation capability. eccentricity of a continuous piecewise quadratic loss
Practitioners know that deep networks with skip function and the width of each local minimum basin
connections are governed by the singular values of a “local Hes-
  sian matrix” that is a function of not only the deep
z (ℓ) = σ W (ℓ) z (ℓ−1) + b(ℓ) + z (ℓ−1) (7) network parameters but also the deep network archi-
tecture. This enables us to quantitatively compare
such as so-called ResNets, are much preferred over different deep network architectures in terms of their
ConvNets, because empirically their gradient descent singular values.
learning converges faster and more stably to a bet- In particular, we can make a fair, quantitative com-
ter minimum. In other words, it is not what a deep parison between the loss landscapes of the ConvNet

7
and ResNet architectures by comparing their singular Astute readers might see a connection to the stan-
values. The key finding is that the condition number dard data preprocessing step of data normalization
of a ResNet (the ratio of the largest to smallest sin- and centering; the main difference is that this pro-
gular value) is bounded, while that of the ConvNet cessing is performed before each and every gradient
is not [RBB23]. This means that the local loss land- learning step. Batch normalization often greatly aids
scape of a ResNet with skip connections is provably the optimization of a wide variety of deep networks,
better conditioned than that of a ConvNet and thus helping it to find a better (lower) minimum quicker.
less erratic, less eccentric, and with local minima that But the reasons for its efficacy are poorly understood.
are more accommodating to gradient-based optimiza- We can make progress on understanding batch nor-
tion. malization by again leaning on the affine spline view-
Beyond analysis, one interesting future research av- point. Let’s focus on the effect of batch normalization
enue in this direction is converting this analytical un- at initialization just before gradient learning begins;
derstanding into new optimization algorithms that the effect is pronounced and it is then easy to extrap-
are more efficient than today’s gradient descent ap- olate regarding what happens at subsequent gradient
proaches. steps. Prior to learning, a deep network’s weights are
initialized with random numbers. This means that
the initial hyperplane arrangement is also random.
The Geometry of Initialization The key finding of [BB22] is that batch normaliza-
As we just discussed, even for the prosaic squared er- tion adapts the geometry of a deep network’s spline
ror loss function (4), the loss landscape as a function tessellation to focus the network’s attention on the
of the parameters is highly nonconvex with myriad lo- training data xi . It does this by adjusting the angles
cal minima. Since gradient descent basically descends and offsets of the hyperplanes that form the bound-
to the bottom of the first basin it can find, where it aries of the polytopal tiles to increase their density in
starts (the initialization) really matters. Over the regions of the input space inhabited by the training
years, many techniques have been developed to im- data, thereby enabling finer approximation there.
prove the initialization and/or help gradient descent More precisely, batch normalization directly
find better minima; here we look at one of them that adapts each layer’s input space tessellation to min-
is particularly geometric in nature. imize the total least squares distance between the
With batch normalization, we modify the definition tile boundaries and the training data. The resulting
of the neural computation from (6) to data-adaptive initialization aligns the spline tessella-
! tion with the data not just at initialization but before
(1) (1) every gradient step to give the learning algorithm a
(1) wk · x − µk
zk = σ (1)
, (8) much better chance of finding a quality set of weights
νk
and biases. See Figure 7 for a visualization.
(1) (1) Figure 8 provides clear evidence of batch normal-
where µk and νk are not learned by gradient de- ization’s adaptive prowess. We initialize an 11-layer
scent but instead are directly computed as the mean deep network with a two-dimensional input space
(1)
and standard deviation of wk · xi over the training three different ways to train on data with a star-
data inputs involved in each gradient step in the op- shaped distribution. We plot the density of the hy-
timization. Importantly, this includes the very first perplanes (basically, the number of hyperplanes pass-
step, and so batch normalization directly impacts the ing through local regions of the input space) cre-
initialization from which we start iterating on the loss ated by layers 3, 7, and 11 for three different layer
landscape.2 configurations: i) the standard layer (6) with bias
2 As implemented in practice, batch normalization has two

additional parameters that are learned as part of the gradi- no effect on the optimization initialization and only a limited
ent descent; however [BB22] shows that these parameters have effect during learning as compared to µ and ν.

8
data ℓ=3 ℓ=7 ℓ = 11
1.0 0.7 0.7

zero bias
0.0 0.0 0.0

batch norm random bias


without batch norm with batch norm 0.2 0.2 0.2

Figure 7: Visualization of a set of two-dimensional data


points xi (black dots) and the input-space spline tessel-
lation of a 4-layer toy deep network with random weights
W (ℓ) . The grey lines correspond to (folded) hyperplanes
from the first three layers. The blue lines correspond to Figure 8: Densities of the hyperplanes created by layers
folded hyperplanes from the fourth layer. (Adapted from 3, 7, and 11 in the two-dimensional input space of an
[BB22].) 11-layer deep network of width 1024. The training data
consists of 50 samples from a star-shaped distribution.
(Adapted from [BB22].)
b(ℓ) = 0; ii) the standard layer (6) with random bias
b(ℓ) ; iii) the batch normalization layer (8). In all three
cases, the weights W (ℓ) were initialized randomly. inside the network nor how to improve it. Clearly, as
We can make several observations. First, constrain- we adjust the parameters to decrease the loss function
ing the bias to be zero forces the network into a cen- using gradient descent, the tessellation will change
tral hyperplane arrangement tessellation that is not dynamically. Can we use this insight to learn some-
amenable to aligning with the data. Second, random- thing new about what goes on inside a deep network
izing both the weights and biases splays the tiles over during learning?
the entire input space, including many places where Consider a deep network learning to classify pho-
the training data is not. Third, batch normalization tos of handwritten digits 0–9. Figure 9 deploys
focuses the hyperplanes from all three of the layers SplineCam to visualize a portion of a 2D slice of
onto the regions where the training data lives. the input space of the network defined by three
One interesting avenue for future research in this data points in the MNIST handwritten digit training
direction is developing new normalization schemes dataset [HBB24]. At left, we see that the tessella-
that replace the total least squares optimization to tion at initialization (before we start learning) is in
enforce a specific kind of adaptivity of the tessella- disarray due to the random weights and biases and
tion to the data and task at hand. nonuse of batch normalization (more on this later).
The tessellation is random, and the training error is
large.
The Dynamic Geometry of Learning
In the middle, we see the tessellation after conver-
Current deep learning practice treats a deep network gence to near-zero training error, when most of the
as a black box and optimizes its internal parameters digits are on the correct side of their respective de-
(weights and biases) to minimize some end-to-end cision boundaries. Not shown by the figure is the
training error like the squared loss in (4). While this fact that the network also generalizes well to unseen
approach has proved mightily successful empirically, test data at this juncture. High density suggests that
it provides no insight into how learning is going on even a continuous piecewise affine function can be

9
be much lower than just after interpolation.
We designate this state of affairs delayed robust-
ness; it is one facet of the general phenomenon
of grokking that has been discovered only recently
[PBE+ 22]. A dirty secret of deep networks is that
f (x) can be quite unstable to small changes in x
LC = 4.91 LC = 2.96 LC = 0.142 (which seems expected given the high degree of non-
initialization interpolation grokking linearity). This instability makes deep networks less
robust and more prone to attacks like causing a ‘barn’
Figure 9: SplineCam visualization of a slice of the input image to be classified as a ‘pig’ by adding a nearly
space defined by three MNIST digits being classified by a
undetectable but carefully designed attack signal to
4-layer MLP of width 200. The false color map (vivirdis)
encodes the 2-norm of the Aω matrix defined on each tile
the picture of a barn. Continuing learning to achieve
according to purple (low), green (medium), yellow (high). grokking and delayed robustness is a new approach
The decision boundary is depicted in red. (Adapted from to mitigating such attacks in particular and making
[HBB24].) deep learning more stable and predictable in general.
Can we translate the visualization of Figure 9 into
a metric that can be put into practice to compare
quite rugged around these points [BPB20]. Indeed, or improve deep networks? This is an open research
the false coloring indicates that the 2-norms of the question, but here are some first steps [HBB24]. De-
Aω matrices has increased around the training im- fine the local complexity (LC) as the number of tiles
ages, meaning that their “slopes” have increased. As in a neighborhood V around a point x in the input
a consequence, the overall spline mapping f (x) is now space. While exact computation of the LC is combi-
likely more rugged and more sensitive to changes in natorially complex, an upper bound can be obtained
the input x as measured by a local (per-tile) Lip- in terms of the number of hyperplanes that inter-
schitz constant. In summary, at (near) interpola- sect V according to Zaslavsky’s Theorem, with the
tion, the gradient learning iterations have in some assumption that V is small enough that the hyper-
sense accomplished their task (near zero training er- planes are not folded inside V . Therefore, we can use
ror) but with elevated sensitivity of f (x) to changes the number of hyperplanes intersecting V as a proxy
in x around the training data points as compared to for the number of tiles in V .
the random initialization. For the experiment reported in Figure 9, we com-
Interpolation is the point that would typically be puted the LC in the neighborhood of each training
recommended to stop training and fix the network data point in the entire training dataset and then av-
for use in an application. But let’s see what hap- eraged those values. From the above discussion, high
pens if we continue training about 37 times longer. LC around a point x in the input space implies small,
At right in Figure 9, we see that, while the training dense tiles in that region and a potentially unsmooth
error has not changed after continued training (it is and unstable mapping f (x) around x. The values
still near zero, meaning correct classification of nearly reported in Figure 9 confirm that the LC does in-
all the training data), the tessellation has metamor- deed capture the intuition that we garnered visually.
phosed. There are now only half as many tiles in One interesting potential application of the LC is as
this region, and they have all migrated to define the a new progress measure that serves as a proxy for a
decision boundary, where presumably they are being deep network’s expressivity; LC is task-agnostic yet
used to create sharp decisions. Around the training informative of the training dynamics.
data, we now have a very low density of tiles with low Open research questions regarding the dynamics of
2-norm of their Aω matrices, and thus presumably a deep network learning abound. At a high level, it is
much smoother mapping f (x). Hence, the sensitivity clear from Figure 9 that the classification function be-
of f (x) as measured by a local Lipshitz constant will ing learned has its curvature concentrated at the deci-

10
sion boundary. Approximation theory would suggest piecewise affine manifold;3 see Figure 10. Points on
that a free-form spline should indeed concentrate its the manifold are given by (5) as x sweeps through
tiles around the decision boundary to minimize the the input space.
approximation error. However, it is not clear why
that migration occurs so late in the training process.
Another interesting research direction is the in-
terplay between grokking and batch normalization,
which we discussed above. Batch normalization prov-
ably concentrates the tessellation near the data sam-
ples, but to grok we need the tiles to move away from
the samples. Hence, it is clear that batch normaliza-
tion and grokking compete with each other. How to
get the best of both worlds at both ends of the gra-
Figure 10: A ReLU-based deep generative network man-
dient learning timeline is an open question. ifold M is continuous and piecewise affine. Each affine
spline tile ω in the input space is mapped by an affine
The Geometry of Generative Models transformation to a corresponding tile M (ω) on the man-
ifold.
A generative model aims to learn the underlying
patterns in the training data in order to generate A major issue with deep generative models is that,
new, similar data. The current crop of deep genera- if the training data is not carefully sourced and cur-
tive models includes transformer networks that power rated, then they can produce biased outputs. A deep
large language models for text synthesis and chat- generative model like a GAN or VAE is trained to
bots and diffusion models for image synthesis. Here approximate both the structure of the true data man-
we investigate the geometry of models that until re- ifold from which the training data was sampled and
cently were state-of-the-art, such as Generative Ad- the data distribution on that manifold. However, all
versarial Networks (GANs) and Variational Autoen- too often in practice, training data are obtained based
coders (VAEs) that are often based on ReLU and on preferences, costs, or convenience factors that pro-
other piecewise linear activation functions. duce artifacts in the training data distribution on the
Deep generative models map from typically a low- manifold. Indeed, it is common in practice for there
dimensional Euclidean input space (called the param- to be more training data points in one part of the
eter space) to a manifold M of roughly the same di- manifold than another. For example, a large frac-
mension in a high-dimensional output space. Each tion of the faces in the CelebA dataset are smiling,
point x in the parameter space synthesizes a cor- and a large fraction of those in the FFHQ dataset
responding output point y b = f (x) on the manifold are female with dark hair. When one samples uni-
(e.g., a picture of a bedroom). Training on a large formly from a model trained with such biased data,
number of images y i learns an approximation to the the biases will be reproduced when sampling from
mapping f from the parameter space to the manifold. the trained model, which has far-reaching implica-
It is beyond the scope of this review, but learning the tions for algorithmic fairness and beyond.
parameters of a deep generative model is usually more We can both understand and ameliorate sampling
involved than simple gradient descent [GBCB16]. It biases in deep generative models by again leveraging
is useful for both training and synthesis to view the their affine spline nature. The key insight for the
points x from the parameter space as governed by bias issue is that the tessellation of the input space
some probability distribution, e.g., uniform over a is carried over onto the manifold. Each convex tile ω
bounded region of the input space. in the input space is mapped to a convex tile M (ω)
In the case of a GAN based on ReLU or similar
activation functions, the manifold M is a continuous 3 We allow M to intersect itself transversally in this setting.

11
on the manifold using the affine transform
M (ω) = {Aω x + cω , x ∈ ω}, (9)
and the manifold M is the union of the M (ω). This
straightforward construction enables us to analyti-
cally characterize many properties of M via (5).
In particular, it is easy to show that the mapping
(9) from the input space to the manifold warps the
tiles in the input space tessellation by Aω , causing
their volume to expand or contract by
q
vol(M (ω))
= det(A⊤ ω Aω ). (10)
vol(ω)
Knowing this, we can take any trained and fixed
generative model and determine a nonuniform sam-
pling of the input space according to (10) such that
the sampling on the manifold is provably uniform
and free from bias. The bonus is that this proce- StyleGAN2 MaGNET-StyleGAN2
dure, which we call MAximum entropy Generative
NETwork (MaGNET) [HBB22a], is simply a post- Figure 11: Images synthesized by sampling uniformly
processing procedure that does not require any re- from the input space of a StyleGAN2 deep generative
training of the network. model trained on the FFHQ face data set and nonuni-
formly according to (10) using MaGNET. (Adapted from
Figure 11 demonstrates MaGNET’s debiasing abil-
[HBB22a].)
ities. On the left are 18 faces synthesized by the
StyleGAN2 generative model trained on the FFHQ
face dataset. On the right are 18 faces synthesized Like MaGNET, this polarity sampling approach ap-
by the same StyleGAN2 generative model but using a plies to any pre-trained generative network and so has
nonuniform sampling distribution on the input space broad applicability. See Figure 12 for an illustrative
based on (10). MaGNET sampling yields a better toy example and [HBB22b] for numerous examples
gender, age, and hair color balance as well as more with large-scale generative models, including using
diverse backgrounds and accessories. In fact, MaG- polarity sampling to boost the performance of exist-
NET sampling produces 41% more male faces (as de- ing generative models to state-of-the-art.
termined by the Microsoft Cognitive API) to balance There are many interesting open research questions
out the gender distribution. around affine splines and deep generative networks.
We can turn the volumetric deformation (10) into One related to the MaGNET sampling strategy is
a tool to efficiently explore the data distribution on that it assumes that the trained generative network
a deep generative model’s manifold. By following actually learned a good enough approximation of the
the MaGNET sampling approach but using an in- true underlying data manifold. One could envision
put sampling distribution based on det(A⊤ ρ
ω Aω ) we exploring how MaGNET could be used to test such
can synthesize images in the modes (high probability an assumption.
regions of the manifold that are more “typical and
high quality”) using ρ → −∞ and or in the anti-
Discussion and Outlook
modes (low probability regions of the manifold that
are more “diverse and exploratory”) using ρ → ∞ While there are several ways to envision extending
[HBB22b]. Setting ρ = 0 returns the model to uni- the concept of a one-dimensional affine spline (recall
form sampling. Figure 2) to high-dimensional functions and opera-

12
spline layers within each transformer block of the net-
work. Hence, we can apply many of the above ideas,
including local complexity (LC) estimation, to study
the smoothness, expressivity, and sensitivity charac-
teristics of even monstrously large language models
like the GPT, Gemini, and Llama series.
We hope that we have convinced you that viewing
deep networks as affine splines provides a powerful ge-
omeric toolbox to better understand how they learn,
Figure 12: Polarity-guided synthesis of points in the how they operate, and how they can be improved in
plane by a Wasserstein GAN generative model. When the
a principled fashion. But splines are just one inter-
polarity parameter ρ = 0, the model produces a data dis-
tribution closely resembling the training data. When the
esting research direction in the mathematics of deep
polarity parameter ρ ≪ 0 (ρ ≫ 0), the WGAN produces learning. These are early days, and there are many
a data distribution focusing on the modes (anti-modes), more open than closed research questions.
the high (low) probability regions of the training data.
(From [HBB22b].)
Acknowledgments
tors, progress has been made only along the direc- Thanks to T. Mitchell Roddenberry and Ali
tion of forcing the tessellation of the domain to hew to Siahkoohi for their comments on the manuscript.
some kind of grid (e.g., uniform or multiscale uniform AIH and RGB were supported by NSF grants CCF-
for spline wavelets). Such constructions are ill-suited 1911094, IIS-1838177, and IIS-1730574; ONR grants
for machine learning problems in high dimensions due N00014-18-1-2571, N00014-20-1-2534, N00014-23-1-
to the so-called curse of dimensionality that renders 2714, and MURI N00014-20-1-2787; AFOSR grant
approximation intractable. FA9550-22-1-0060; DOI grant 140D0423C0076; and
We can view deep networks as a tractable mecha- a Vannevar Bush Faculty Fellowship, ONR grant
nism for emulating those most powerful of splines, the N00014-18-1-2047.
free-knot splines (splines like those in Figure 2 where
the intervals partitioning the real line are arbitrary)
in high dimensions. A deep network uses the power of References
a hyperplane arrangement to tractably create a myr-
iad of flexible convex polytopal tiles that tessellate [BB18] Randall Balestriero and Richard Baraniuk, From
its input space plus affine transformations on each hard to soft: Understanding deep network nonlin-
earities via vector quantization and statistical in-
that result in quite powerful approximation capabili-
ference, International Conference on Learning Rep-
ties in theory [DHP21] and in practice. There is much resentations (ICLR) (2018).
work to do in studying these approximations (e.g., de-
veloping realistic function approximation classes and [BB21] , Mad Max: Affine spline insights into deep
learning, Proceedings of the IEEE 109 (2021),
proving approximation rates) as well as developing no. 5, 704–727.
new deep network architectures that attain improved
rates and robustness. [BB22] Randall Balestriero and Richard G Baraniuk,
Batch normalization, explained, arXiv preprint
An additional timely research direction involves ex- arXiv:2209.14778 (2022).
tending the ideas discussed here to deep networks like
transformers that employ at least some nonlinearities [BCAB19] Randall Balestriero, Romain Cosentino, Behnaam
Aazhang, and Richard Baraniuk, The geometry of
that are not piecewise linear. The promising news is deep networks: Power diagram subdivision, Ad-
that the bulk of the learnable parameters in state-of- vances in Neural Information Processing Systems
the-art transformers lie in readily analyzable affine (NIPS) 32 (2019).

13
[BPB20] Randall Balestriero, Sebastien Paris, and Richard aries, IEEE/CVF Conference on Computer Vision
Baraniuk, Analytical probability distributions and and Pattern Recognition (CVPR) (2023).
exact expectation-maximization for deep genera- [LXT+ 18] Hao Li, Zheng Xu, Gavin Taylor, Christoph
tive networks, Advances in Neural Information Studer, and Tom Goldstein, Visualizing the loss
Processing Systems (NeurIPS) (2020). landscape of neural nets, Advances in Neural In-
[Cyb89] George Cybenko, Approximation by superpositions formation Processing Systems (NIPS) 31 (2018).
of a sigmoidal function, Mathematics of Control, [MPCB14] Guido F Montufar, Razvan Pascanu, Kyunghyun
Signals, and Systems 2 (1989), no. 4, 303–314. Cho, and Yoshua Bengio, On the number of linear
[DDF+ 22] Ingrid Daubechies, Ronald DeVore, Simon Fou- regions of deep neural networks, Advances in Neu-
cart, Boris Hanin, and Guergana Petrova, Non- ral Information Processing Systems (NIPS) (2014).
linear approximation and (deep) ReLU networks, [PBE+ 22] Alethea Power, Yuri Burda, Harri Edwards, Igor
Constructive Approximation 55 (2022), no. 1, 127– Babuschkin, and Vedant Misra, Grokking: Gener-
172. alization beyond overfitting on small algorithmic
[DHP21] Ronald DeVore, Boris Hanin, and Guergana datasets, arXiv preprint arXiv:2201.02177 (2022).
Petrova, Neural network approximation, Acta Nu- [PN22] Rahul Parhi and Robert Nowak, What kinds of
merica 30 (2021), 327–444. functions do deep neural networks learn? Insights
[GBCB16] Ian Goodfellow, Yoshua Bengio, Aaron Courville, from variational spline theory, SIAM Journal on
and Yoshua Bengio, Deep Learning, MIT Press, Mathematics of Data Science 4 (2022), no. 2,
2016. 464–489.
[HBB22a] Ahmed Imtiaz Humayun, Randall Balestriero, and [RBB23] Rudolf Riedi, Randall Balestriero, and Richard
Richard Baraniuk, MaGNET: Uniform sampling Baraniuk, Singular value perturbation and deep
from deep generative network manifolds without network optimization, Constructive Approxima-
retraining, International Conference on Learning tion 57 (2023), no. 2.
Representations (ICLR) (2022). [SPD+ 20] Justin Sahs, Ryan Pyle, Aneel Damaraju, Josue
[HBB22b] , Polarity sampling: Quality and diversity Ortega Caro, Onur Tavaslioglu, Andy Lu, and
control of pre-trained generative networks via sin- Ankit Patel, Shallow univariate ReLU networks
gular values, IEEE/CVF Conference on Computer as splines: Initialization, loss surface, Hessian, &
Vision and Pattern Recognition (CVPR) (2022). gradient flow dynamics, arXiv:2008.01772 (2020).
[HBB24] , Deep networks always grok and here is [Uns19] Michael Unser, A representer theorem for deep
why, International Conference on Machine Learn- neural networks, Journal of Machine Learning Re-
ing (ICML) (2024). search 20 (2019), no. 110, 1–30.
[HBBB23] Ahmed Imtiaz Humayun, Randall Balestriero, [ZNL18] Liwen Zhang, Gregory Naitzat, and Lek-Heng Lim,
Guha Balakrishnan, and Richard Baraniuk, Tropical geometry of deep neural networks, Inter-
SplineCam: Exact visualization and characteriza- national Conference on Machine Learning (ICML)
tion of deep network geometry and decision bound- (2018).

14

You might also like