On The Geometry of Deep Learning
On The Geometry of Deep Learning
∗ † ‡
Randall Balestriero Ahmed Imtiaz Humayun Richard G. Baraniuk
arXiv:2408.04809v1 [cs.LG] 9 Aug 2024
1
deep learning, including tropical geometry [ZNL18] where z (0) = x and z (L) = y b = f (x). First the
and beyond. Finally, to spin a consistent story line, layer applies an affine transformation to its input.
we will focus primarily on work from our group; we Second, in a standard abuse of notation, it applies
will however review several key results developed by a scalar nonlinear transformation — called the ac-
others. Our bibliography is concise, and so the inter- tivation function σ — to each entry in the result.
ested reader is invited to explore the extensive works The entries of z (ℓ) are called the layer-ℓ neurons or
cited in the papers we reference. units, and the width of the layer is the dimensional-
ity of z (ℓ) . When layers of the form (2) are used in
Deep Learning (1), deep learners refer to the network as a multilayer
perceptron (MLP).
Machine learning in 200 words or less. In su- The parameters θ(ℓ) of the layer are the elements
pervised machine learning, we are given a collection of the weight matrix W (ℓ) and the bias vector b(ℓ) .
n
of n training data pairs {(xi , y i )}i=1 ; xi is termed Special network structures have been developed to
the data and y i the label. Without loss of generality, reduce the generally quadratic cost of multiplying by
we will take xi ∈ RD , y i ∈ RC to be column vectors, the W (ℓ) . One notable class of networks constraints
but in practice they are often tensors. W (ℓ) to be a circulant matrix, so that W (ℓ) z (ℓ) cor-
We seek a predictor or model f with two basic prop- responds to a convolution, giving rise to the term
erties. First, the predictor should fit the training ConvNet for such models. Even with this simplifica-
data: f (xi ) ≈ y i . When the predictor fits (near) per- tion, it is common these days to work with networks
fectly, we say that it has interpolated the data. Sec- with billions of parameters.
ond, the predictor should generalize to unseen data: The most widespread activation function in mod-
f (x′ ) ≈ y ′ , where (x′ , y ′ ) is test data that does not ern deep networks is the rectified linear unit (ReLU)
appear in the training set. When we fit the train-
ing data but do not generalize, we say that we have σ(u) = max{u, 0} =: ReLU(u). (3)
overfit. Throughout this paper, we focus on networks that
One solves the prediction problem by first design- use this activation, although the results hold for any
ing a parameterized model fΘ with parameters Θ and continuous piecewise linear nonlinearity (e.g., abso-
then learning or training by optimizing Θ to make lute value σ(u) = |u|). Special activations are often
fΘ (xi ) as close as possible to y i on average in terms employed at the last layer f (L) , from the linear acti-
of some distance or loss function L, which is often vation σ(u) = u to the softmax that converts a vector
called the training error. to a probability histogram. These activations do not
Deep networks. A deep network is a predictor or affect our analysis below. It is worth point out, but
model constructed from the composition of L inter- beyond the scope of this paper, that it is possible
mediate mappings called layers [GBCB16] to generalize the results we review below to a much
larger class of smooth activation functions (e.g., sig-
(L) (1)
fΘ (x) = fθ(L) ◦ · · · ◦ fθ(1) (x). (1)
moid gated linear units, Swish activation) by adopt-
ing a probabilistic viewpoint [BB18].
Here Θ is the collection of parameters from each layer, The term “network” is used in deep learning be-
θ(ℓ) , ℓ = 1, . . . , L. We will omit the parameters Θ or cause compositions of the form (1) are often depicted
θ(ℓ) from our notation except where they are critical, as such; see Figure 1.
since they are ever-present in the discussion below. Learning. To learn to fit the training data with a
The ℓ-th deep network layer f (ℓ) takes as input the deep network, we tune the parameters W (ℓ) , b(ℓ) , ℓ =
vector z (ℓ−1) and outputs the vector z (ℓ) by combin- 1, . . . , L such that, on average, when training datum
ing two simple operations xi is input to the network, the output ybi = f (xi )
is close to y i as measured by some loss function L.
z (ℓ) = f (ℓ) z (ℓ−1) = σ W (ℓ) z (ℓ−1) + b(ℓ) , (2) Two loss functions are ubiquitous in deep learning;
2
one can nearly interpolate the training data. (We
often drop the “nearly” below for brevity.) What
distinguishes the performance of one deep network
architecture from another, then, is what it does away
from the training points, i.e., how well it generalizes
to unseen data.
Deep nets break out. Despite neural networks ex-
isting in some form for over 80 years, their success
was limited in practice until the AI boom of 2012.
This sudden growth was enabled by three converging
factors: i) going deep with many layers, ii) training
on enormous data sets, and iii) new computing archi-
tectures based on graphics processing units (GPUs).
Figure 1: A 6-layer deep network. Purple, blue, and The spark that ignited the AI boom was the Ima-
yellow nodes represent the input, neurons, and output, genet Challenge 2012 where teams competed to best
respectively. The width of layer 2 is 5, for example. classify a set of input images x into one of 1000
The links between the nodes represent the elements of categories. The Imagenet training data was about
the weight matrices W (ℓ) . The sum with the bias b(ℓ) n = 1.3 million, 150,000-pixel color digital images
and subsequent activation σ(·) are implicitly performed human-labeled into 1000 classes, such as ‘bird,’ ‘bus,’
at each neuron. ‘sofa.’ 2012 was the first time a deep network won
the Challenge; AlexNet, a ConvNet with 62 million
the first is the classical squared error based on the parameters in five convolutional layers followed by
two-norm three fully connected layers achieved and accuracy
n
of 60%. Subsequent competitions featured only deep
1X 2 networks, and, by the final competition in 2017, they
L(Θ) := y − fΘ (xi ) (4)
n i=1 i 2 had reached 81% accuracy, which is arguably better
than most humans can achieve.
The other is the cross-entropy which is oft-used in Black boxes. Deep networks with dozens of layers
classification tasks. and millions of parameters are powerful for fitting
Standard learning practice is to use some flavor of and mimicking training data, but also inscrutable.
gradient (steepest) descent that iteratively reduces L Deep networks are created from such simple trans-
by updating the parameters W (ℓ) , b(ℓ) by subtracting formations (e.g., affine transform and thresholding)
a small scalar multiple of the partial derivatives of L that it is maddening the composition of several of
with respect to those parameters. them so complicates analysis and defies deep under-
In practice, since the number of training data pairs standing. Consequently, deep learning practitioners
n can be enormous, one calculates the gradient of L tend to treat them as black boxes and proceed empir-
for each iteration using only a subset of data points ically using an alchemical development process that
and labels called a minibatch. focuses primarily on the inputs x and outputs f (x) of
Note that even a nice loss function like (4) will the network. To truly understand deep networks we
have a multitude of local minima due to the nonlin- need to be able to see inside the black box as a deep
ear activation at each layer coupled with the compo- network is learning and predicting. In the sequel, we
sition of multiple such layers. Consequently, numer- will discuss one promising line of work in this vein
ous heuristics have been developed to help navigate that leverages the fact that deep networks are affine
to high-performing local minima. In modern deep spline mappings.
networks, the number of neurons is usually so enor-
mous that, by suitably optimizing the parameters,
3
<latexit sha1_base64="8yJNGxGmq+t1aZqHZQ9LDIAr2HQ=">AAAHlXicjdVNT9swGAdwwzbK2AuwHXbYJRogsQtqObAdNgkoGkUgrYP1RSMVsh23jfBLFDu0JcqX2HX7Yvs2c1ofih+QiNTKen5/x3biJCThsTbV6r+FxSdPny1Vlp+vvHj56vXq2vqbtlZZSlmLKq7SLsGa8ViylokNZ90kZVgQzjrkul5654alOlbyp5kkrCfwQMb9mGJjS93N/nZIxh83r9Y2qjvV6RHARs01NpA7mlfrS6MwUjQTTBrKsdaXtWpiejlOTUw5K1bCTLME02s8YJe2KbFgupdPJ1wEW7YSBX2V2p80wbQ63yPHQuuJIDYpsBlq38rifXaZmf7nXh7LJDNM0tlA/YwHRgXl6oMoThk1fGIbmKaxnWtAhzjF1NhrtHJnGCLsIiQbUSUEllEenjPMdZGHTOosZeXYeVj+E5KfF8U94cdm60oknI3ZAyevP9ThsfETadjAru/+/K8yf7cDwUVu/0WA/VMR4oQAGTsZAxk6GQLpOukCmTiZALl1cgskc5IBuXFyA0Q/TMwJAyKEIwGo6aQJpOWkBaTtpA3kwMkBkEMnh0DqTupAjpwcARk5GQFpOGkAOXVyCqTjpAPkxMkJvKssVQ6rcD82h7HdvETxqHwrKJ6HZQXsTWawF5uW/BzmydAPzmp+MmIcnHJW85My82IS7sKmBqvQcBVn9uUd+YO6op+9iAfCj85q/iXkSpfPf/nMU8zzM/9UtDGn4GbT4zk9BtqeU7CJ6cWcXpQT2woiZYIkVVFGTTCbt7eweEBtxlvZtFTYT1XN/zDBRnt3p7a3s/djd2P/i/toLaP36APaRjX0Ce2jBmqiFqKIo9/oD/pbeVf5WjmqfJtFFxdcn7fozlH5/h/6gM5W</latexit>
yi
multidimensional extensions of the familiar one- <latexit sha1_base64="jGey2ejMLk4WEdzav0X3NfT4PZc=">AAAHqXicjdVNT9swGAdwwzbK2Auw3bZLNEBimoRaDmyHTQKKRBEcCqwvglSV47iNhR1HsUNbokj7NLtuX2ffZk7rQ/EDEpFaWc/vbztOnCRIOFO6Wv23sPjs+YulyvLLlVev37xdXVt/11YySwltEcll2g2wopzFtKWZ5rSbpBSLgNNOcFMvvXNLU8Vk/FNPEtoTeBizASNYm1J/7cOmP2IhjbDOJ31WeD+8wbYfjPvs82Z/baO6U50eHmzUbGMD2aPZX18a+aEkmaCxJhwrdV2rJrqX41Qzwmmx4meKJpjc4CG9Ns0YC6p6+XQRhbdlKqE3kKn5xdqbVud75FgoNRGBSQqsI+VaWXzIrjM9+NbLWZxkmsZkNtEg456WXnlFvJCllGg+MQ1MUmbO1SMRTjHR5rqt3JsmEGYRMR0RKQSOw9y/oJirIvdprLKUlnPnfvkfBPlFUTwQfmq2LkXC6Zg+Mnj9sQ5PjZ/Emg7N+h7OX5X5+x0CXOTmX3jYHSoIrARAxlbGQCIrEZCulS6QiZUJkDsrd0AyKxmQWyu3QNTjRK1QIEJYEoCaVppAWlZaQNpW2kAOrBwAObRyCKRupQ7kyMoRkJGVEZCGlQaQUyunQDpWOkBOrJzAu0pTabEK92MzYmbzBpKH5VtB8twvK2BvUo2d2LTk5jBPIjc4q7nJkHIw5KzmJuPMicVwFzYVWIWCqzgzL/TQndQW3ewlGwo3Oqu5l5BLVT7/5TNPMM/P3KFIY07BzSbHc3oMtD2nYBOTyzm9LE9sywul9pJUhhnR3uy8nYWxITEZZ2XTUmE+VTX3wwQb7d2d2t7O3vnuxv53+9FaRh/RJ7SNaugr2kcN1EQtRNAv9Bv9QX8rXyrnlW7lahZdXLB93qN7R4X8B+/Z1fs=</latexit>
<latexit sha1_base64="Xv8gJgwNK5pQnExotaqUsCjIbBQ=">AAAHk3icjdVNT9swGAdwwzbK2AuwaaddogESu6CWA5u0HYAiUQSTOlhfNFohx3HbDL9EsUNbon6HXbdvtm8zp30OxQ9IRGplPb+/YztxkjARsbHl8r+FxSdPny2Vlp+vvHj56vXq2vqbptFZyniDaaHTdkgNF7HiDRtbwdtJyqkMBW+F19XCWzc8NbFWP+w44V1J+yruxYxaV2pu9rZHHzev1jbKO+XpEeBGBRobBI761frSsBNplkmuLBPUmMtKObHdnKY2ZoJPVjqZ4Qll17TPL11TUclNN59OdxJsuUoU9HTqfsoG0+p8j5xKY8YydElJ7cD4VhTvs8vM9j5381glmeWKzQbqZSKwOijWHkRxypkVY9egLI3dXAM2oCll1l2hlTvDhNItQvEh01JSFeWdc06FmeQdrkyW8mLsvFP8h2F+PpncE35stqplIviIP3Dy6kMdHhs/UZb33fruz/8s8nc7hHSSu38ZUP9UYQgSIhmBjJAMQAZI2iBtJGOQMZJbkFskGUiG5AbkBol5mDgIRyIlkERUB6kjaYA0kDRBmkgOQA6QHIIcIqmCVJEcgRwhGYIMkdRAakhOQU6RtEBaSE5ATvBd5akGLOP9WB/EbvOGWkTFW0GLvFNU0N7klnqxacnPUZEM/OCs5icjLtApZzU/qTIvpvAurBu0CoNXceZe3ZE/KBT97EXcl350VvMvodCmeP6LZ55RkZ/5p2K1OUU3mx3P6THS5pyiTcwu5vSimNhWEGkbJKmOMmaD2by9hcV95jLeyqaliftUVfwPE240d3cqezt733c39r/CR2uZvCcfyDapkE9kn9RInTQII7/Ib/KH/C29K30pHZaOZtHFBejzltw5St/+A0/lzYQ=</latexit>
f (x)
<latexit sha1_base64="49yvl7lelMhjGg/VeJgATZqvQVw=">AAAHpHicjdVNT9swGAdwwzbK2Bts2mmXaMDELqjlwHbYASgSRfTQ0fVFIx1yHLeNsOModiglyofZdftE+zZz2udQ/IBEpFbW8/s7thMnCRIRaVOt/ltafvL02Upl9fnai5evXr9Z33jb1SpLGe8wJVTaD6jmIop5x0RG8H6ScioDwXvBVb303jVPdaTiH2aa8IGkozgaRowaW7pcf5/7WlIhfD30znmzU2ztZJ+3Ltc3q7vV2eHhRg0amwSO1uXGysQPFcskjw0TVOuLWjUxg5ymJmKCF2t+pnlC2RUd8QvbjKnkepDP5l9427YSekOV2l9svFl1sUdOpdZTGdikpGasXSuL99lFZoZfB3kUJ5nhMZsPNMyEZ5RXXgwvjFLOjJjaBmVpZOfqsTFNKTP2kq3dGSaQdhExnzAlJY3D3D/nVOgi93mss5SXY+d++R8E+XlR3BN+bLauZCL4DX/g5PWHOjw2fhobPrLruz//s8zf7RDQIrf/0qPuqYIAJEByA3KDZAwyRtIH6SOZgkyR3ILcIslAMiTXINdI9MPEQTgSKYEkohZIC0kHpIOkC9JFcghyiOQI5AhJHaSO5BjkGMkEZIKkAdJAcgZyhqQH0kNyCnKK7ypPFWAV78fWOLKbN1AiLN8KSuR+WUF7kxvqxGYlN0dFMnaD85qbDLlAp5zX3GScObEY78KWRqvQeBVN+y4P3UGh6Gbb0Ui60XnNvYRC6fL5L595RkXedE/FGguKbjY7WdATpN0FRZuYtRe0XU5s2wuV8ZJUhRkz3nzezsKiEbMZZ2WzUmE/VTX3w4Qb3b3d2v7u/ve9zYNv8NFaJR/IR7JDauQLOSAN0iIdwkhOfpM/5G/lU6VZaVc68+jyEvR5R+4clV//AQwl0/w=</latexit>
ReLU(u)
x
<latexit sha1_base64="l8Uj38HXvHpNmOTBHb7E9bdYpmA=">AAACT3icbZDLTgIxFIY7eIPxBrp00wgmrsiMJujCBcSNS0zkkjCEdEoHKm1n0nYEMpl3cKtv5dIncWcsOAsBT9Lmz/+1Oef8fsSo0o7zaeW2tnd29/IFe//g8Oi4WDppqzCWmLRwyELZ9ZEijArS0lQz0o0kQdxnpONP7he880KkoqF40vOI9DkaCRpQjLSx2hXPn9mVQbHsVJ1lwU3hZqIMsmoOSta1NwxxzInQmCGleq4T6X6CpKaYkdT2YkUihCdoRHpGCsSJ6ifLcVN4YZwhDEJpjtBw6f79kSCu1Jz75iVHeqzW2cL8j/ViHdz2EyqiWBOBfxsFMYM6hIvd4ZBKgjWbG4GwpGZWiMdIIqxNQvZKG5+bJQSZ4pBzJIaJ5zfSxNwcNtJ1Ms3IdIPMMjIzxLZNzO56qJuifVV1a9Xa41W5fpcFngdn4BxcAhfcgDp4AE3QAhg8g1fwBt6tD+vL+s5lT3NWJk7BSuUKP3KIstQ=</latexit>
<latexit sha1_base64="C37ieShYAdRq5S9rul50umbtFKg=">AAAHkHicjdVNT9swGAdwwzbK2Buw4y7RAGkn1HIADpMGFIkyOHR0fdFohRzHbSP8EsUObYn6EXbdPtu+zZz2ORQ/IBGplfX8/o7txEnCRMTGlsv/lpZfvHy1Ulp9vfbm7bv3H9Y3NltGZynjTaaFTjshNVzEijdtbAXvJCmnMhS8Hd5WC2/f8dTEWv20k4T3JB2ouB8zal2psT3evlnfKu+WZ0eAGxVobBE46jcbK6NupFkmubJMUGOuK+XE9nKa2pgJPl3rZoYnlN3SAb92TUUlN718NtdpsOMqUdDXqfspG8yqiz1yKo2ZyNAlJbVD41tRfMyuM9s/7OWxSjLLFZsP1M9EYHVQLDyI4pQzKyauQVkau7kGbEhTyqy7PGsPhgmlW4TiI6alpCrKu1ecCjPNu1yZLOXF2Hm3+A/D/Go6fST83GxVy0TwMX/i5NWnOjw3fq4sH7j1PZ7/VeQfdgjpNHf/MqD+qcIQJEQyBhkjGYIMkXRAOkgmIBMk9yD3SDKQDMkdyB0S8zRxEI5ESiCJqA5SR9IEaSJpgbSQHIMcIzkBOUFSBakiOQU5RTICGSGpgdSQXIBcIGmDtJGcg5zju8pTDVjG+7E+jN3mDbWIireCFnm3qKC9yS31YrOSn6MiGfrBec1PRlygU85rflJlXkzhXVg3aBUGr+LSvbcjf1Ao+tlGPJB+dF7zL6HQpnj+i2eeUZFf+qditQVFN5udLegZ0taCok3MGgvaKCa2E0TaBkmqo4zZYD5vb2HxgLmMt7JZaeo+VRX/w4Qbrb3dyv7u/o+9raOv8NFaJZ/IZ/KFVMgBOSI1UidNwsiA/CZ/yN/SZumw9K10PI8uL0Gfj+TBUfr+H624zK8=</latexit>
x <latexit sha1_base64="LybnUa9/kEkfD0I/W45YMvqD9LA=">AAAHkHicjdVNT9swGAdwwzbK2Buw4y7RAGkn1HJgHCYNKBJlcOjo+qLRCjmO20b4JYodSonyEXbdPtu+zZz2ORQ/IC1SK+v5/R3biZOEiYiNrVb/Li0/e/5ipbL6cu3V6zdv361vbHaMzlLG20wLnfZCariIFW/b2AreS1JOZSh4N7ypl9695amJtfphpwkfSDpS8TBm1LpSazvbvl7fqu5WZ0eAGzVobBE4mtcbK5N+pFkmubJMUGOuatXEDnKa2pgJXqz1M8MTym7oiF+5pqKSm0E+m2sR7LhKFAx16n7KBrPqYo+cSmOmMnRJSe3Y+FYWH7OrzA4PBnmsksxyxeYDDTMRWB2UCw+iOOXMiqlrUJbGbq4BG9OUMusuz9qDYULpFqH4hGkpqYry/iWnwhR5nyuTpbwcO++X/2GYXxbFI+H/zda1TAS/40+cvP5Uh/+NnynLR259j+d/lvmHHUJa5O5fBtQ/VRiChEjuQO6QjEHGSHogPSRTkCmSe5B7JBlIhuQW5BaJeZo4CEciJZBE1ARpImmDtJF0QDpIjkCOkByDHCOpg9SRnICcIJmATJA0QBpIzkHOkXRBukjOQM7wXeWpBqzi/dgcx27zhlpE5VtBi7xfVtDe5JZ6sVnJz1GRjP3gvOYnIy7QKec1P6kyL6bwLmwatAqDV3Hh3tuRPygU/WwrHkk/Oq/5l1BoUz7/5TPPqMgv/FOxxoKim81OF/QUaWdB0SZmrQVtlRPbCSJtgyTVUcZsMJ+3t7B4xFzGW9msVLhPVc3/MOFGZ2+3tr+7/31v6/ALfLRWyQfykXwiNfKZHJIGaZI2YWREfpHf5E9ls3JQ+Vo5mkeXl6DPe/LgqHz7B5g3zKw=</latexit>
u
<latexit sha1_base64="TVCZJXIAEyFS+jf9HCTo3qQqt3k=">AAAHlXicjdVNT9swGAdwwzbKujfYDjvsEg2QdkItB7bLJKBoFIG0DtYXjVbIcZ62EX6JYodSonyJXbcvtm8zp/Wh+AGJSK2s5/d3bCdOEiY81qZW+7e0/OTps5XK6vPqi5evXr9ZW3/b0SpLGbSZ4irthVQDjyW0TWw49JIUqAg5dMOrRunda0h1rORPM01gIOhIxsOYUWNLvc2+EjCim5drG7Xt2uwIcKPuGhvEHa3L9ZVJP1IsEyAN41Tri3otMYOcpiZmHIpqP9OQUHZFR3Bhm5IK0IN8NuEi2LKVKBiq1P6kCWbVxR45FVpPRWiTgpqx9q0s3mcXmRl+GeSxTDIDks0HGmY8MCooVx9EcQrM8KltUJbGdq4BG9OUMmOvUfXOMKGwi5AwYUoIKqO8fwaU6yLvg9RZCuXYeb/8D8P8rCjuCT8221Ai4XADD5y88VCHx8aPpYGRXd/9+V9l/m6HkBa5/RcB9U8Vhk5CJDdObpCMnYyR9Jz0kEydTJHcOrlFkjnJkFw7uUaiHyZwAkiEcCQQtZy0kLSdtJF0nHSQ7DvZR3Lg5ABJw0kDyaGTQyQTJxMkTSdNJCdOTpB0nXSRHDs5xncVUuWwhvdjaxzbzRsqHpVvBcXzfllBexMM9WKzkp+jPBn7wXnNT0bA0SnnNT8pMy8m8S5sabQKjVdxal/ekT+oK/rZ83gk/Oi85l9CrnT5/JfPPKM8P/VPxZoLim42O1rQI6SdBUWbmJ0v6Hk5sa0gUiZIUhVlzATzeXsLi0fMZryVzUqF/VTV/Q8TbnR2tuu727s/djb2DtxHa5V8IB/JJ1Inn8keaZIWaRNGOPlN/pC/lfeVr5XDyrd5dHnJ9XlH7hyV7/8BWcTO1A==</latexit>
! xi
<latexit sha1_base64="RCtmHVqJSzezzUoMuitHD+K7vY0=">AAAHlHicjdVNT9swGAdww17K2Bts0i67RAOknVDLge3AASgTRaCpg/VFoxWyHbe18EsUO7QlyofYdftk+zZzWh+KH5CI1Mp6fn/HduIkJBHc2Gr139Lyk6fPnldWXqy+fPX6zdu19Xdto7OUshbVQqddgg0TXLGW5VawbpIyLIlgHXJdL71zw1LDtfpppwnrSzxUfMAptq7U2eyRyRXfvFrbqG5XZ0cEGzXf2ED+aF6tPx/3Yk0zyZSlAhtzWasmtp/j1HIqWLHaywxLML3GQ3bpmgpLZvr5bL5FtOUqcTTQqfspG82qiz1yLI2ZSuKSEtuRCa0s3meXmR187edcJZllis4HGmQisjoqFx/FPGXUiqlrYJpyN9eIjnCKqXWXaPXOMES6RSg2plpKrOK8d86wMEXeY8pkKSvHznvlPyH5eVHcE35stq5lItiEPXDy+kMdHhs/UZYN3fruz/8q83c7EFzk7l9GODwVIV4IkImXCZCRlxGQrpcukKmXKZBbL7dAMi8ZkBsvN0DMw8S8MCBSepKAml6aQFpeWkDaXtpADrwcADn0cgik7qUO5MjLEZCxlzGQhpcGkFMvp0A6XjpATrycwLvKUu2xCvdjc8Td5iVaxOVbQYu8V1bA3mQWB7FZKcxhkYzC4LwWJmMmwCnntTCpsiCm4C5sGrAKA1dx5t7dcTioL4bZCz6UYXReCy+h0KZ8/stnnmKRn4Wnoo0FBTebHi/oMdD2goJNTC8W9KKc2FYUaxslqY4zaqP5vIOF8SF1mWBls1LhPlW18MMEG+2d7dru9u6PnY39Pf/RWkEf0Sf0GdXQF7SPGqiJWoiia/Qb/UF/Kx8qe5V65ds8urzk+7xHd47K9/9Ip85d</latexit>
sion Figure 3 with a cloud of n training data points because they are scalars.
4
(1)
where the dot denotes the inner product, wk is the [BCAB19]. For example, the second layer creates
k-th row of the weight matrix W (1) , and σ is the a hyperplane arrangement in its input space, which
ReLU activation function (3). The quantity inside happens to be the output space of layer one. Thus,
the activation function is the equation of a D − 1- these hyperplanes can be pulled back through layer
dimensional hyperplane in the input space RD that one to its input space by performing the same pro-
(1)
is perpendicular to wk and offset from the origin cess as above but on a tile-by-tile basis relative to the
(1) (1) layer-one tessellation and its associated affine trans-
by bk /∥wk ∥2 . This hyperplane bisects the input
(1) forms. The effect on the layer-two hyperplanes is that
space into two half-spaces; one where zk > 0 and they are folded each time they cross a hyperplane cre-
(1)
one where zk = 0. ated by layer 1. Careful inspection of the tessellation
The collection of hyperplanes corresponding to in Figure 3 reveals many examples of such hyperplane
each neuron in z (1) create a hyperplane arrangement. folding. Similarly, the hyperplanes created by layer
It is precisely the intersections of the half-spaces of three will be folded every time they encounter a hy-
the hyperplane arrangement that tessellate the input perplane in the input space from layers one or two.
space into convex polytope tiles (see Figure 4). Much can be said about this folding process, in-
cluding a formula for the dihedral angle of a folded
hyperplane as a function of the network’s weights
and biases. However, the formulae for the angles
and affine transformations unfortunately become un-
wieldy for more than two layers. Finding simplifica-
tions for these attributes is an interesting open prob-
lem as are the connections to other subdivision pro-
cesses like wavelets and fractals.
The theory of hyperplane arrangements is rich and
tells us that, generally speaking, the number of tiles
grows rapidly with the number of neurons in each
layer. Hence, we can expect even modestly sized
deep networks to have an enormous number of tiles
in their input space, each with a corresponding affine
Figure 4: A deep network layer tessellates its input space
into convex polytopal tiles via a hyperplane arrangement, transformation from input to output space. Impor-
with each hyperplane corresponding to one neuron at the tantly, though, the affine transformations are highly
output of the layer. In this two-dimensional example as- coupled because the overall mapping (5) must remain
suming ReLU activation, the red line indicates the one- continuous. This means that the class of functions
dimensional hyperplane corresponding to the k-th neuron that can be represented using a deep network is con-
in the first layer. siderably smaller than if the mapping could be un-
coupled and/or discontinuous. Understanding what
The weights and biases of the first layer determine deep learning practitioners call the network’s “im-
not only the tessellation of the input space but also plicit bias” remains an important open problem.
an affine transformation on each tile to implement
(5). Explicit formulas for Aω , cω are available in
Visualizing the Tessellation
[BCAB19]. It should be clear that, since all of the
transformations in (6) are continuous, so must be the The toy, low-dimensional examples in Figures 3 and 4
affine spline (5) corresponding to the first layer. are useful for building intuition, but how can we gain
The tessellation corresponding to the composition insight into the tessellation of a deep network with
of two or more layers follows an interesting subdivi- thousands or more of input and output dimensions?
sion process akin to a “tessellation of tessellations” One way to proceed is to compute summary statistics
5
about the tessellation, such as how the number of tiles
scales as we increase the width or depth of a network
(e.g., [MPCB14]); more on this below. An alternative Egyptian cats
is to gain insight via direct visualization.
SplineCam is an exact method for computing and
visualizing a deep network’s spline tessellation over a
specified low-dimensional region of the input space,
typically a bounded two-dimensional planar slice
[HBBB23]. SplineCam uses an efficient graph data
structure to encode the intersections of the hyper-
planes (from the various layers) that pass through
the slice and then uses a fast heuristic breadth-first
search algorithm to identify tiles from the graph. All
of the computations besides the search can be vec- Tabby cats
torized and computed on GPUs to enable the visual-
ization of even industrial-scale deep networks.
Figure 5: SplineCam visualization of a two-dimensional
Figure 5 depicts a SplineCam slice along the plane slice through the affine spline tessellation of the 4096-
defined by three training images for a 5-layer Con- dimensional input space of a 5-layer ConvNet of average
vNet trained to classify between Egyptian and Tabby width 160 trained to classify 64×64 digital photos of cats.
cat photos. The first thing we notice is the extraor- The stars denote the three training images that define the
dinarily large number of tiles in just this small re- plane and the red lines the decision boundaries between
gion of the 4096-dimensional input space. It can be the two classes. (Adapted from [HBBB23].)
shown that the decision boundary separating Egyp-
tian and Tabby cats corresponds to a single hyper-
plane in the final layer that is folded extensively from ically used in deep network optimization [Uns19] and
being pulled back through the previous four layers what types of functions are learned by deep networks
[BCAB19]. Photos falling in the lower left of the slice [PN22].
are classified as Tabbies, while photos falling in the
lower right are classified as Egyptians. The density The Self-Similar Geometry of the
of tiles is also not uniform and varies across the input
Tessellation
space.
An interesting avenue for future research in- It has been known since the late 1980s that even a
volves the efficient extension of SplineCam to higher- two-layer neural network is a universal approxima-
dimensional slices both for visualization and the com- tor, meaning that, as the number of neurons grows,
putation of summary statistics. one can approximate an arbitrary continuous func-
The main goal of this paper is to demonstrate the tion over a Borel measurable set to arbitrary pre-
broad range of insights that can be garnered into the cision [Cyb89]. But, unfortunately, while two-layer
inner workings of a deep network through a focused networks are easily capable of interpolating a set of
study of the geometry of its input space tessellation. training data, in practice they do a poor job general-
To this end, we now tour five examples relating to izing to data outside of the training set. In contrast,
deep network approximation, optimization, and data deep networks with L ≫ 2 layers have proved over
synthesis. But we would be remiss if we did not the past 15 years that they are capable of both inter-
point to the significant progress that has been made polating and generalizing well.
leveraging other important aspects of the spline view Several groups have investigated the connections
of deep learning, such as understanding how affine between a network’s depth and its tessellation’s
splines emerge naturally from the regularization typ- capacity to better approximate. [MPCB14] was the
6
first to quantify the advantage of depth by counting network can approximate that matters, but rather
the number of tiles and showing that deep networks how it learns to approximate. Empirical studies have
create more tiles (and hence are more expressive) indicated that this is because the so-called loss land-
than shallow networks. scape of the loss function L(Θ) navigated by gradient
descent as it optimizes the deep network parameters
Further work has worked to link the self-similar na- is much smoother for ResNets as compared to Con-
ture of the tessellation to good approximation. Using vNets (see Figure 6). However, to date there has been
self-similarity, one can construct new function spaces no analytical work in this direction.
for which deeper networks provide better approxi-
mation rates (see [DHP21, DDF+ 22] and the refer- ConvNet ResNet loss
7
and ResNet architectures by comparing their singular Astute readers might see a connection to the stan-
values. The key finding is that the condition number dard data preprocessing step of data normalization
of a ResNet (the ratio of the largest to smallest sin- and centering; the main difference is that this pro-
gular value) is bounded, while that of the ConvNet cessing is performed before each and every gradient
is not [RBB23]. This means that the local loss land- learning step. Batch normalization often greatly aids
scape of a ResNet with skip connections is provably the optimization of a wide variety of deep networks,
better conditioned than that of a ConvNet and thus helping it to find a better (lower) minimum quicker.
less erratic, less eccentric, and with local minima that But the reasons for its efficacy are poorly understood.
are more accommodating to gradient-based optimiza- We can make progress on understanding batch nor-
tion. malization by again leaning on the affine spline view-
Beyond analysis, one interesting future research av- point. Let’s focus on the effect of batch normalization
enue in this direction is converting this analytical un- at initialization just before gradient learning begins;
derstanding into new optimization algorithms that the effect is pronounced and it is then easy to extrap-
are more efficient than today’s gradient descent ap- olate regarding what happens at subsequent gradient
proaches. steps. Prior to learning, a deep network’s weights are
initialized with random numbers. This means that
the initial hyperplane arrangement is also random.
The Geometry of Initialization The key finding of [BB22] is that batch normaliza-
As we just discussed, even for the prosaic squared er- tion adapts the geometry of a deep network’s spline
ror loss function (4), the loss landscape as a function tessellation to focus the network’s attention on the
of the parameters is highly nonconvex with myriad lo- training data xi . It does this by adjusting the angles
cal minima. Since gradient descent basically descends and offsets of the hyperplanes that form the bound-
to the bottom of the first basin it can find, where it aries of the polytopal tiles to increase their density in
starts (the initialization) really matters. Over the regions of the input space inhabited by the training
years, many techniques have been developed to im- data, thereby enabling finer approximation there.
prove the initialization and/or help gradient descent More precisely, batch normalization directly
find better minima; here we look at one of them that adapts each layer’s input space tessellation to min-
is particularly geometric in nature. imize the total least squares distance between the
With batch normalization, we modify the definition tile boundaries and the training data. The resulting
of the neural computation from (6) to data-adaptive initialization aligns the spline tessella-
! tion with the data not just at initialization but before
(1) (1) every gradient step to give the learning algorithm a
(1) wk · x − µk
zk = σ (1)
, (8) much better chance of finding a quality set of weights
νk
and biases. See Figure 7 for a visualization.
(1) (1) Figure 8 provides clear evidence of batch normal-
where µk and νk are not learned by gradient de- ization’s adaptive prowess. We initialize an 11-layer
scent but instead are directly computed as the mean deep network with a two-dimensional input space
(1)
and standard deviation of wk · xi over the training three different ways to train on data with a star-
data inputs involved in each gradient step in the op- shaped distribution. We plot the density of the hy-
timization. Importantly, this includes the very first perplanes (basically, the number of hyperplanes pass-
step, and so batch normalization directly impacts the ing through local regions of the input space) cre-
initialization from which we start iterating on the loss ated by layers 3, 7, and 11 for three different layer
landscape.2 configurations: i) the standard layer (6) with bias
2 As implemented in practice, batch normalization has two
additional parameters that are learned as part of the gradi- no effect on the optimization initialization and only a limited
ent descent; however [BB22] shows that these parameters have effect during learning as compared to µ and ν.
8
data ℓ=3 ℓ=7 ℓ = 11
1.0 0.7 0.7
zero bias
0.0 0.0 0.0
9
be much lower than just after interpolation.
We designate this state of affairs delayed robust-
ness; it is one facet of the general phenomenon
of grokking that has been discovered only recently
[PBE+ 22]. A dirty secret of deep networks is that
f (x) can be quite unstable to small changes in x
LC = 4.91 LC = 2.96 LC = 0.142 (which seems expected given the high degree of non-
initialization interpolation grokking linearity). This instability makes deep networks less
robust and more prone to attacks like causing a ‘barn’
Figure 9: SplineCam visualization of a slice of the input image to be classified as a ‘pig’ by adding a nearly
space defined by three MNIST digits being classified by a
undetectable but carefully designed attack signal to
4-layer MLP of width 200. The false color map (vivirdis)
encodes the 2-norm of the Aω matrix defined on each tile
the picture of a barn. Continuing learning to achieve
according to purple (low), green (medium), yellow (high). grokking and delayed robustness is a new approach
The decision boundary is depicted in red. (Adapted from to mitigating such attacks in particular and making
[HBB24].) deep learning more stable and predictable in general.
Can we translate the visualization of Figure 9 into
a metric that can be put into practice to compare
quite rugged around these points [BPB20]. Indeed, or improve deep networks? This is an open research
the false coloring indicates that the 2-norms of the question, but here are some first steps [HBB24]. De-
Aω matrices has increased around the training im- fine the local complexity (LC) as the number of tiles
ages, meaning that their “slopes” have increased. As in a neighborhood V around a point x in the input
a consequence, the overall spline mapping f (x) is now space. While exact computation of the LC is combi-
likely more rugged and more sensitive to changes in natorially complex, an upper bound can be obtained
the input x as measured by a local (per-tile) Lip- in terms of the number of hyperplanes that inter-
schitz constant. In summary, at (near) interpola- sect V according to Zaslavsky’s Theorem, with the
tion, the gradient learning iterations have in some assumption that V is small enough that the hyper-
sense accomplished their task (near zero training er- planes are not folded inside V . Therefore, we can use
ror) but with elevated sensitivity of f (x) to changes the number of hyperplanes intersecting V as a proxy
in x around the training data points as compared to for the number of tiles in V .
the random initialization. For the experiment reported in Figure 9, we com-
Interpolation is the point that would typically be puted the LC in the neighborhood of each training
recommended to stop training and fix the network data point in the entire training dataset and then av-
for use in an application. But let’s see what hap- eraged those values. From the above discussion, high
pens if we continue training about 37 times longer. LC around a point x in the input space implies small,
At right in Figure 9, we see that, while the training dense tiles in that region and a potentially unsmooth
error has not changed after continued training (it is and unstable mapping f (x) around x. The values
still near zero, meaning correct classification of nearly reported in Figure 9 confirm that the LC does in-
all the training data), the tessellation has metamor- deed capture the intuition that we garnered visually.
phosed. There are now only half as many tiles in One interesting potential application of the LC is as
this region, and they have all migrated to define the a new progress measure that serves as a proxy for a
decision boundary, where presumably they are being deep network’s expressivity; LC is task-agnostic yet
used to create sharp decisions. Around the training informative of the training dynamics.
data, we now have a very low density of tiles with low Open research questions regarding the dynamics of
2-norm of their Aω matrices, and thus presumably a deep network learning abound. At a high level, it is
much smoother mapping f (x). Hence, the sensitivity clear from Figure 9 that the classification function be-
of f (x) as measured by a local Lipshitz constant will ing learned has its curvature concentrated at the deci-
10
sion boundary. Approximation theory would suggest piecewise affine manifold;3 see Figure 10. Points on
that a free-form spline should indeed concentrate its the manifold are given by (5) as x sweeps through
tiles around the decision boundary to minimize the the input space.
approximation error. However, it is not clear why
that migration occurs so late in the training process.
Another interesting research direction is the in-
terplay between grokking and batch normalization,
which we discussed above. Batch normalization prov-
ably concentrates the tessellation near the data sam-
ples, but to grok we need the tiles to move away from
the samples. Hence, it is clear that batch normaliza-
tion and grokking compete with each other. How to
get the best of both worlds at both ends of the gra-
Figure 10: A ReLU-based deep generative network man-
dient learning timeline is an open question. ifold M is continuous and piecewise affine. Each affine
spline tile ω in the input space is mapped by an affine
The Geometry of Generative Models transformation to a corresponding tile M (ω) on the man-
ifold.
A generative model aims to learn the underlying
patterns in the training data in order to generate A major issue with deep generative models is that,
new, similar data. The current crop of deep genera- if the training data is not carefully sourced and cur-
tive models includes transformer networks that power rated, then they can produce biased outputs. A deep
large language models for text synthesis and chat- generative model like a GAN or VAE is trained to
bots and diffusion models for image synthesis. Here approximate both the structure of the true data man-
we investigate the geometry of models that until re- ifold from which the training data was sampled and
cently were state-of-the-art, such as Generative Ad- the data distribution on that manifold. However, all
versarial Networks (GANs) and Variational Autoen- too often in practice, training data are obtained based
coders (VAEs) that are often based on ReLU and on preferences, costs, or convenience factors that pro-
other piecewise linear activation functions. duce artifacts in the training data distribution on the
Deep generative models map from typically a low- manifold. Indeed, it is common in practice for there
dimensional Euclidean input space (called the param- to be more training data points in one part of the
eter space) to a manifold M of roughly the same di- manifold than another. For example, a large frac-
mension in a high-dimensional output space. Each tion of the faces in the CelebA dataset are smiling,
point x in the parameter space synthesizes a cor- and a large fraction of those in the FFHQ dataset
responding output point y b = f (x) on the manifold are female with dark hair. When one samples uni-
(e.g., a picture of a bedroom). Training on a large formly from a model trained with such biased data,
number of images y i learns an approximation to the the biases will be reproduced when sampling from
mapping f from the parameter space to the manifold. the trained model, which has far-reaching implica-
It is beyond the scope of this review, but learning the tions for algorithmic fairness and beyond.
parameters of a deep generative model is usually more We can both understand and ameliorate sampling
involved than simple gradient descent [GBCB16]. It biases in deep generative models by again leveraging
is useful for both training and synthesis to view the their affine spline nature. The key insight for the
points x from the parameter space as governed by bias issue is that the tessellation of the input space
some probability distribution, e.g., uniform over a is carried over onto the manifold. Each convex tile ω
bounded region of the input space. in the input space is mapped to a convex tile M (ω)
In the case of a GAN based on ReLU or similar
activation functions, the manifold M is a continuous 3 We allow M to intersect itself transversally in this setting.
11
on the manifold using the affine transform
M (ω) = {Aω x + cω , x ∈ ω}, (9)
and the manifold M is the union of the M (ω). This
straightforward construction enables us to analyti-
cally characterize many properties of M via (5).
In particular, it is easy to show that the mapping
(9) from the input space to the manifold warps the
tiles in the input space tessellation by Aω , causing
their volume to expand or contract by
q
vol(M (ω))
= det(A⊤ ω Aω ). (10)
vol(ω)
Knowing this, we can take any trained and fixed
generative model and determine a nonuniform sam-
pling of the input space according to (10) such that
the sampling on the manifold is provably uniform
and free from bias. The bonus is that this proce- StyleGAN2 MaGNET-StyleGAN2
dure, which we call MAximum entropy Generative
NETwork (MaGNET) [HBB22a], is simply a post- Figure 11: Images synthesized by sampling uniformly
processing procedure that does not require any re- from the input space of a StyleGAN2 deep generative
training of the network. model trained on the FFHQ face data set and nonuni-
formly according to (10) using MaGNET. (Adapted from
Figure 11 demonstrates MaGNET’s debiasing abil-
[HBB22a].)
ities. On the left are 18 faces synthesized by the
StyleGAN2 generative model trained on the FFHQ
face dataset. On the right are 18 faces synthesized Like MaGNET, this polarity sampling approach ap-
by the same StyleGAN2 generative model but using a plies to any pre-trained generative network and so has
nonuniform sampling distribution on the input space broad applicability. See Figure 12 for an illustrative
based on (10). MaGNET sampling yields a better toy example and [HBB22b] for numerous examples
gender, age, and hair color balance as well as more with large-scale generative models, including using
diverse backgrounds and accessories. In fact, MaG- polarity sampling to boost the performance of exist-
NET sampling produces 41% more male faces (as de- ing generative models to state-of-the-art.
termined by the Microsoft Cognitive API) to balance There are many interesting open research questions
out the gender distribution. around affine splines and deep generative networks.
We can turn the volumetric deformation (10) into One related to the MaGNET sampling strategy is
a tool to efficiently explore the data distribution on that it assumes that the trained generative network
a deep generative model’s manifold. By following actually learned a good enough approximation of the
the MaGNET sampling approach but using an in- true underlying data manifold. One could envision
put sampling distribution based on det(A⊤ ρ
ω Aω ) we exploring how MaGNET could be used to test such
can synthesize images in the modes (high probability an assumption.
regions of the manifold that are more “typical and
high quality”) using ρ → −∞ and or in the anti-
Discussion and Outlook
modes (low probability regions of the manifold that
are more “diverse and exploratory”) using ρ → ∞ While there are several ways to envision extending
[HBB22b]. Setting ρ = 0 returns the model to uni- the concept of a one-dimensional affine spline (recall
form sampling. Figure 2) to high-dimensional functions and opera-
12
spline layers within each transformer block of the net-
work. Hence, we can apply many of the above ideas,
including local complexity (LC) estimation, to study
the smoothness, expressivity, and sensitivity charac-
teristics of even monstrously large language models
like the GPT, Gemini, and Llama series.
We hope that we have convinced you that viewing
deep networks as affine splines provides a powerful ge-
omeric toolbox to better understand how they learn,
Figure 12: Polarity-guided synthesis of points in the how they operate, and how they can be improved in
plane by a Wasserstein GAN generative model. When the
a principled fashion. But splines are just one inter-
polarity parameter ρ = 0, the model produces a data dis-
tribution closely resembling the training data. When the
esting research direction in the mathematics of deep
polarity parameter ρ ≪ 0 (ρ ≫ 0), the WGAN produces learning. These are early days, and there are many
a data distribution focusing on the modes (anti-modes), more open than closed research questions.
the high (low) probability regions of the training data.
(From [HBB22b].)
Acknowledgments
tors, progress has been made only along the direc- Thanks to T. Mitchell Roddenberry and Ali
tion of forcing the tessellation of the domain to hew to Siahkoohi for their comments on the manuscript.
some kind of grid (e.g., uniform or multiscale uniform AIH and RGB were supported by NSF grants CCF-
for spline wavelets). Such constructions are ill-suited 1911094, IIS-1838177, and IIS-1730574; ONR grants
for machine learning problems in high dimensions due N00014-18-1-2571, N00014-20-1-2534, N00014-23-1-
to the so-called curse of dimensionality that renders 2714, and MURI N00014-20-1-2787; AFOSR grant
approximation intractable. FA9550-22-1-0060; DOI grant 140D0423C0076; and
We can view deep networks as a tractable mecha- a Vannevar Bush Faculty Fellowship, ONR grant
nism for emulating those most powerful of splines, the N00014-18-1-2047.
free-knot splines (splines like those in Figure 2 where
the intervals partitioning the real line are arbitrary)
in high dimensions. A deep network uses the power of References
a hyperplane arrangement to tractably create a myr-
iad of flexible convex polytopal tiles that tessellate [BB18] Randall Balestriero and Richard Baraniuk, From
its input space plus affine transformations on each hard to soft: Understanding deep network nonlin-
earities via vector quantization and statistical in-
that result in quite powerful approximation capabili-
ference, International Conference on Learning Rep-
ties in theory [DHP21] and in practice. There is much resentations (ICLR) (2018).
work to do in studying these approximations (e.g., de-
veloping realistic function approximation classes and [BB21] , Mad Max: Affine spline insights into deep
learning, Proceedings of the IEEE 109 (2021),
proving approximation rates) as well as developing no. 5, 704–727.
new deep network architectures that attain improved
rates and robustness. [BB22] Randall Balestriero and Richard G Baraniuk,
Batch normalization, explained, arXiv preprint
An additional timely research direction involves ex- arXiv:2209.14778 (2022).
tending the ideas discussed here to deep networks like
transformers that employ at least some nonlinearities [BCAB19] Randall Balestriero, Romain Cosentino, Behnaam
Aazhang, and Richard Baraniuk, The geometry of
that are not piecewise linear. The promising news is deep networks: Power diagram subdivision, Ad-
that the bulk of the learnable parameters in state-of- vances in Neural Information Processing Systems
the-art transformers lie in readily analyzable affine (NIPS) 32 (2019).
13
[BPB20] Randall Balestriero, Sebastien Paris, and Richard aries, IEEE/CVF Conference on Computer Vision
Baraniuk, Analytical probability distributions and and Pattern Recognition (CVPR) (2023).
exact expectation-maximization for deep genera- [LXT+ 18] Hao Li, Zheng Xu, Gavin Taylor, Christoph
tive networks, Advances in Neural Information Studer, and Tom Goldstein, Visualizing the loss
Processing Systems (NeurIPS) (2020). landscape of neural nets, Advances in Neural In-
[Cyb89] George Cybenko, Approximation by superpositions formation Processing Systems (NIPS) 31 (2018).
of a sigmoidal function, Mathematics of Control, [MPCB14] Guido F Montufar, Razvan Pascanu, Kyunghyun
Signals, and Systems 2 (1989), no. 4, 303–314. Cho, and Yoshua Bengio, On the number of linear
[DDF+ 22] Ingrid Daubechies, Ronald DeVore, Simon Fou- regions of deep neural networks, Advances in Neu-
cart, Boris Hanin, and Guergana Petrova, Non- ral Information Processing Systems (NIPS) (2014).
linear approximation and (deep) ReLU networks, [PBE+ 22] Alethea Power, Yuri Burda, Harri Edwards, Igor
Constructive Approximation 55 (2022), no. 1, 127– Babuschkin, and Vedant Misra, Grokking: Gener-
172. alization beyond overfitting on small algorithmic
[DHP21] Ronald DeVore, Boris Hanin, and Guergana datasets, arXiv preprint arXiv:2201.02177 (2022).
Petrova, Neural network approximation, Acta Nu- [PN22] Rahul Parhi and Robert Nowak, What kinds of
merica 30 (2021), 327–444. functions do deep neural networks learn? Insights
[GBCB16] Ian Goodfellow, Yoshua Bengio, Aaron Courville, from variational spline theory, SIAM Journal on
and Yoshua Bengio, Deep Learning, MIT Press, Mathematics of Data Science 4 (2022), no. 2,
2016. 464–489.
[HBB22a] Ahmed Imtiaz Humayun, Randall Balestriero, and [RBB23] Rudolf Riedi, Randall Balestriero, and Richard
Richard Baraniuk, MaGNET: Uniform sampling Baraniuk, Singular value perturbation and deep
from deep generative network manifolds without network optimization, Constructive Approxima-
retraining, International Conference on Learning tion 57 (2023), no. 2.
Representations (ICLR) (2022). [SPD+ 20] Justin Sahs, Ryan Pyle, Aneel Damaraju, Josue
[HBB22b] , Polarity sampling: Quality and diversity Ortega Caro, Onur Tavaslioglu, Andy Lu, and
control of pre-trained generative networks via sin- Ankit Patel, Shallow univariate ReLU networks
gular values, IEEE/CVF Conference on Computer as splines: Initialization, loss surface, Hessian, &
Vision and Pattern Recognition (CVPR) (2022). gradient flow dynamics, arXiv:2008.01772 (2020).
[HBB24] , Deep networks always grok and here is [Uns19] Michael Unser, A representer theorem for deep
why, International Conference on Machine Learn- neural networks, Journal of Machine Learning Re-
ing (ICML) (2024). search 20 (2019), no. 110, 1–30.
[HBBB23] Ahmed Imtiaz Humayun, Randall Balestriero, [ZNL18] Liwen Zhang, Gregory Naitzat, and Lek-Heng Lim,
Guha Balakrishnan, and Richard Baraniuk, Tropical geometry of deep neural networks, Inter-
SplineCam: Exact visualization and characteriza- national Conference on Machine Learning (ICML)
tion of deep network geometry and decision bound- (2018).
14