Deep Learning and Pure Mathematics
Deep Learning and Pure Mathematics
MATHEMATICIAN?
GEORDIE WILLIAMSON
arXiv:2304.12602v1 [math.RT] 25 Apr 2023
1. Introduction
Over the last decade, deep learning has found countless applications throughout
industry and science. However, its impact on pure mathematics has been modest.
This is perhaps surprising, as some of the tasks at which deep learning excels—like
playing the board-game Go or finding patterns in complicated structures—appear
to present similar difficulties to problems encountered in research mathematics. On
the other hand, the ability to reason—probably the single most important defining
characteristic of mathematical enquiry—remains a central unsolved problem in ar-
tificial intelligence. Thus, mathematics can be seen as an important litmus test as
to what modern artificial intelligence can and cannot do.
There is great potential for interaction between mathematics and machine learn-
ing.1 However, there is also a lot of hype, and it is easy for the mathematician
to be put off. In my experience, it remains hard to use deep learning to aid my
mathematical research. However it is possible. One also has the sense that the
potential, once the right tools have been uncovered, is significant.
This is a very informal survey of what a working mathematician might expect
when using the tools of deep learning on mathematics problems. I outline some
of the beautiful ideas behind deep learning. I also give some practical hints for
using these tools. I finish with some examples where deep learning has been used
productively in pure mathematics research. (I hope it goes without saying that the
impact of deep learning on applied mathematics has been enormous.)
Finally, in my experience, the more one uses the tools of deep learning, the
more difficult it becomes not to ask oneself foundational questions about why they
work. This raises an entirely different set of questions. Although fascinating, the
mathematical theory of deep learning is not the focus here.
Remark 1.1. The elephant in the room in any discussion today of deep learning is
the recent success of ChatGPT and other large language models. The internet is
full of examples of ChatGPT doing both very well and very poorly on reasoning
and mathematics problems. It seems likely that large language models will be
able to interact well with proof assistants in the near future (see e.g. [HRW` 21,
JWZ` 23]). It is also likely that a greater role will be played in mathematics research
1
In 1948, Turing [Tur48, §6] identifies games, mathematics, cryptography and language transla-
tion and acquisition as five “suitable branches of thought” in which experimentation with machine
intelligence might be fruitful.
1
2
Neurons “fire” by emitting electrical charge along their axon. We may encode the
charges arriving along each node by a real number, in which case the charge emitted
by a neuron is given by
x1
x2 ´ÿ ¯
x3 ‚ z z“f xi
x4
x5
x1
x2 ´ÿ ¯
(1) x3 ‚ z z“f wi xi
x4
x5
Thus positive and negative weights correspond to excitatory and inhibitory connec-
tions respectively.
Having settled on a crude mathematical model of a single neuron, we may then
assemble them together to form a neural network :
‚
‚
‚
‚ ‚
‚
‚ ‚
‚
‚
‚
Implicit in this picture is the assignment of a weight to each edge. Thus our neural
network yields a function which takes real valued inputs (5 in the above picture),
and outputs real values (2 above), via repeated application of (1) at each node.
This is a good picture for the layperson to have in mind. It is useful to visualize
the complex interconnectedness present in artificial neural networks, as well as the
locality of the computation taking place. However for the mathematician, one can
explain things a little differently. The configuration
‚
‚
‚
‚
‚
‚
‚
‚
‚
For the purposes of this article, a vanilla neural network 5 is a gadget of the form
A f A f A f A`´1
Rd1 ÝÑ
1
Rd2 ÝÑ Rd2 ÝÑ
2
Rd3 ÝÑ Rd3 ÝÑ
3
. . . ÝÑ Rd`´1 ÝÑ Rd`
where Ai are affine linear maps. We refer to Rd1 , Rd2 , . . . , Rd` as the layers of the
network. In order to simplify the discussion, we always assume that our activation
function f is given by ReLU (the “rectified linear unit”), that is
´ÿ ¯ ÿ
f λ i ei “ maxpλi , 0qei
5One often encounters the term “Multi Layer Perceptron (MLP)” in the literature.
5
ÞÑ 6 ÞÑ 2
ÞÑ “yes” ÞÑ “no”
We assume that we have “training data” consisting of images labelled by “6” or “not
6”. As a first attempt we might consider a network having a single linear layer:
1{p1`e´x q
A
R28ˆ28 ÝÑ R ÝÝÝÝÝÝÝÝÑ R.
Here A is affine linear, and the second function (the “logistic function”7) is a conve-
nient way of converting an arbitrary real number into a probability. Thus, positive
values of A mean that we think our image is a 6, and negative values of A mean we
think it is not.
We will be successful if we can find a hyperplane separating all vectors corre-
sponding to 6’s (red dots) from those that do not represent 6’s (blue dots):
´
`
‚
‚ ‚ vectors in
‚ ‚ R28ˆ28
‚
‚ ‚
6I can convince myself that my brain produces a probability distribution and not a yes/no
answer by recalling my efforts to decipher my grandmother’s letters when I was a child.
7a.k.a. sigmoid in the machine learning literature
6
“yes” and “no”) the optimal separating hyperplane may be found easily, by a variant
of linear regression.8
This is where deep learning can come into its own. The idea is that the successive
layers of the neural net “learn” more and more complicated features of the input,
eventually leading to an easy decision problem.9
In the standard setting of supervised learning, we assume the existence of a
function
φ : Rn Ñ Rm
and know a (usually large) number of its values. The task is to find a reasonable
approximation of φ, given these known values. (The reader should keep in mind the
motivating problem of the previous section, where one wants to learn a function
φ : R28ˆ28 Ñ R10
giving the probabilities that a certain 28 ˆ 28-pixel grayscale image represents one
of the 10 digits 0, 1, . . . , 9.)
We fix a network architecture, which in our simple setting of a vanilla neural
net, means that we fix the number of layers ` and layer dimensions n2 , . . . , n`´1 .
We then build a neural net (see §2) which serves as our function approximator:
A f A f A f A`´1
(2) φ« : Rn ÝÑ
1
Rn2 ÝÑ Rn2 ÝÑ
2
Rn3 ÝÑ Rn3 ÝÑ
3
. . . ÝÑ Rdn´1 ÝÑ Rm
To begin with, the affine linear maps Ai are initialised via some (usually random)
initialization scheme, and hence the function φ« output by our neural network will
be random and have no relation to our target function φ. We then measure the
distance between our function φ« and φ via some loss function L. (For example, L
might be the mean squared distance between the values of φ and φ« .10) A crucial
assumption is that this loss function is differentiable in terms of the weights of
our neural network. Finally, we perform gradient descent with respect to the loss
8For a striking mathematical example of support vector machines see [HK22], where SVMs are
trained to distinguish simple and non-simple finite groups, by inspection of their multiplication
table.
9This idea seems to have been present in the machine learning literature for decades, see e.g.
[LBBH98]. It is well explained in [GBC16, §6]. For illustrations of this as well as the connection
to fundamental questions in topology, see the work of Olah [Ola14].
10There are many subtleties here, and a good choice of loss function is one of them. In
my limited experience, neural networks do a lot better learning probability distributions than
general functions. When learning probability distributions, cross entropy [GBC16, §3.13] is the
loss function of choice.
7
function, in order to update the parameters in (2) to (hopefully) better and better
approximate φ.
In order to get an intuitive picture of what is happening during training, let
us assume that m “ 1 (so we are trying to learn a scalar function), and that our
activation functions are ReLU. Thus φ« is the composition of affine linear and
piecewise linear functions, and hence is piecewise linear. As with any piecewise
linear function, we obtain a decomposition of Rn into polytopal regions
(3)
such that φ« is affine linear on each region. As training progresses, the affine linear
functions move in a way similar to the learning of a line of best fit. The complexity
comes from the fact that now the regions themselves may also move, disappear or
spawn new regions.
(1) Typically, one splits the known values of φ into two disjoint sets, consisting
of training data and test data. Steps of gradient descent are only performed
using the training data. This allows us to periodically check whether our
model is also making reasonable predictions at points not present in the
training data (“test error”).
(2) In most applications of machine learning, the training data is enormous and
feeding it all through the neural network (2) in order to compute the loss
function is unduly expensive. Thus one usually employs stochastic gradient
descent: at every step the gradient of the loss function is evaluated at a
small random subset of the training data.
(3) A traditional paradigm in statistics tries to use a small number of param-
eters to describe a complex data set. A simple example of this paradigm is
a line of best fit. Here the model choice prevents overfitting in which the
model simply “memorizes” all the training data. Deep learning is different,
in that often there are enough parameters to allow overfitting. What is
surprising is that often neural nets generalize well (i.e. don’t overfit) even
though they could in principle.
8
5.2. Learning descent sets. Consider the symmetric group Σn consisting of all per-
mutations of 1, 2, . . . , n. Given a permutation we can consider its left and right
descent sets:
(4) Lpxq “ t1 ď i ă n | x´1 piq ą x´1 pi ` 1qu,
(5) Rpxq “ t1 ď i ă n | xpiq ą xpi ` 1qu.
Obviously, Lpx´1 q “ Rpxq and Rpx´1 q “ Lpxq. The left an right descent sets are
important invariants of a permutation.
It is interesting to see whether a neural network can be trained to learn the left
and right descent sets. In other words, we would like to train a neural network
φ« : Rn Ñ Rn´1
which given the vector pxp1q, xp2q, . . . , xpnqq returns a sequence of n´1 probabilities
giving whether or not 1 ď i ă n belongs to the left (resp. right) descent set.
This example is interesting in that (5) implies that the right descent set can be
predicted perfectly with a single linear layer. More precisely, if we consider
γ : Rn Ñ Rn´1
pv1 , . . . , vn q ÞÑ pv1 ´ v2 , v2 ´ v3 , . . . , vn´1 ´ vn q
then the ith coordinate of γ evaluated on a permutation pxp1q, . . . , xpnqq is positive
if and only if i P Rpxq. On the other hand, it seems much harder to hand craft a
neural network which extracts the left descent set from pxp1q, . . . , xpnqq.
This might lead us to guess that a neural network will have a much easier time
learning the right descent set than the left descent set. This turns out to be the
case, and the difference is dramatic: a vanilla neural network with two hidden layers
of dimensions 500 and 100 learns to predict right descent sets for n “ 35 with high
accuracy after a few seconds. Whereas, the same network struggles to get even a
single correct answer for the left descent set, after significant training!13 It is striking
that using permutation matrices as inputs rather than the vectors pxp1q, . . . , xpnqq
gives perfect symmetry in training between left and right.14 The issue here is the
13Decreasing n and allowing longer training suggests that the network can learn the left descent
set, however it is much harder.
14For a colab containing all of these experiments, see [GGW22, Classifying descent sets in S ].
n
10
representation: how the model receives its input can have a dramatic effect on
model performance.
5.3. Transformers and linear algebra. Our final example is much more sophisti-
cated, and illustrates how important the choice of training data can be. It also
shows how surprising the results of training large neural networks can be.
A transformer is a neural network architecture which first emerged in machine
translation [VSP` 17]. We will not go into any detail about the transformer archi-
tecture here, except to say that it is well-suited to tasks where the input and output
are sequences of tokens (“sequence to sequence” tasks):
x y z transformer a b c
Charton considers real symmetric matrices, all of whose entries are signed floating
point numbers with three significant figures and exponent lying between ´100 and
100.15 The transformer obtains impressive accuracy on most linear algebra tasks.
What is remarkable is that for the transformer the entries of the matrix (e.g. 3.14,
-27.8, 0.000132, . . . ) are simply tokens—the transformer doesn’t “know” that 3.14
is close to 3.13, or that both are positive; it doesn’t even “know” that its tokens
represent numbers!
Another remarkable aspect of this work concerns generalization. A model trained
on Wigner matrices (e.g. entries sampled uniformly from r´10, 10s) does not gen-
eralize well at all to matrices with positive eigenvalues. On the other hand, a model
trained on matrices with eigenvalues sampled from a Laplace distribution (which
has heavy tails) does generalize to matrices whose eigenvalues are all positive, even
though it has not seen a single such matrix during training! The interested reader
is referred to Charton’s paper [Cha21] (in particular Table 12) and his lecture on
youtube [Cha22].
15Charton considers various encodings of these numbers via sequences of tokens of various
lengths, see [Cha21].
11
6.1. Counter-examples in combinatorics. One can dream that deep learning might
one day provide a mathematician’s “bicycle for the mind”: an easy to use and
flexible framework for exploring possibilities and potential counter-examples. (I
have certainly lost many days trying to prove a statement that turned out to be
false, with the counter-example lying just beyond my mental horizon.)
We are certainly not there yet, but the closest we have come to witnessing such
a framework is provided in the work of Adam Wagner [Wag21]. He focuses on con-
jectures of the form: over all combinatorial structures X, an associated numerical
quantity Z is bounded by B. He considers situations where there is some simple
recipe for generating objects in X, and that the numerical quantity Z is efficiently
computable.
For example, a conjecture in graph theory states that for any connected graph
G on n ě 3 vertices, with largest eigenvalue λ and matching number µ we have
?
(6) λ ` µ ´ n ´ 1 ´ 1 ě 0.
(It is not important for this discussion to know what the matching number or largest
eigenvalue are!)
Wagner fixes an enumeration e1 , e2 , . . . of the edges E in a complete graph on n-
vertices. Graphs are generated by playing a single player game: the player is offered
e1 , e2 etc. and decides at each point whether to accept or reject the edge, the goal
being to minimize (6). A move in the game is given by a 01-vector indicating edges
that have been taken so far, together with a vector indicating which edge is under
consideration. For example, when n “ 4 the pair pp1, 0, 1, 1, 0, 0q, p0, 0, 0, 0, 1, 0qq
indicates that edge number 5 is under consideration, and that edges 1, 3 and 4 have
already been selected, and 2 rejected. Moves are sampled according to a neural
12
network
(7) µ : RE ‘ RE Ñ R,
which (after application of sigmoid) gives the probability that we should take the
edge under consideration.
Wagner then employs the cross entropy method to gradually train the neural
network. A fixed (and large) number of graphs are sampled according to the neural
network (7). Then a fixed percentage (say 10%) of the games resulting in the
smallest values of the LHS of (6) are used as training data to update the neural
network (7). (That is, we tweak the weights of the neural network to make decisions
that result in graphs that are as close as possible to providing a counter-example
to (7).) We then repeat. This method eventually finds a counter-example to (6)
on 19 vertices. The evolution of graphs sampled from the neural network is shown
in Figure 6.1—note how the neural network learns quickly that tree-like graphs do
best. Exactly the same method works to discover counter-examples to several other
conjectures in combinatorics, see [Wag21].
Ø 1 ` 3q ` q 2
graph. One interesting aspect of this conjecture is that it is (to the best of my
knowledge) a conjecture born of pure empiricism.
For the Bruhat graph, the definition is simple, but the resulting graph is com-
plicated. On the other hand, the definition of the Kazhdan-Lusztig polynomial is
complicated, however the resulting polynomial is simple. Thus, there is at least a
passing resemblance to traditional applications of machine learning, where a sim-
ple judgement (e.g. “it’s a cat”) is made from complicated input (e.g. an array of
pixels).
It is natural to use neural networks as a testing ground for this conjecture: if a
neural network can easily predict the Kazhdan-Lusztig polynomial from the Bruhat
graph, perhaps we can too! We trained a neural network to predict Kazhdan-Lusztig
polynomials from the Bruhat graph. We used a neural network architecture known
as a graph neural network, and trained the neural network to predict a probability
distribution on the coefficients of q, q 2 , q 3 and q 4 .16 The neural network was trained
on « 20 000 Bruhat graphs, and achieved very high accuracy (« 98%) after less
than a day’s training. This provides reasonable evidence that there is some way of
reliably guessing the Kazhdan-Lusztig polynomial from the Bruhat graph.
It is notoriously difficult to go from a trained neural network to some kind of
human understanding. One technique to do so is known as saliency analysis. Re-
call that neural networks often learn a piecewise linear function, and hence one can
take derivatives of the learned function to try to learn which inputs have the most
influence on a given output.17 In our example, saliency analysis provided subgraphs
of the original Bruhat graph which appeared to have remarkable “hypercube” like
structure (see Figure 3 and [DVB` 21, Figure 5a]). After considerable work this
eventually led to a conjecture [BBD` 22] which would settle the combinatorial in-
variance conjecture for symmetric groups if proven, and has stimulated research on
this problem from pure mathematicians [GW23, BG23b, BG23a, BM23].
In a parallel development, Davies, Juhász, Lackenby and Tomasev were able to
use saliency analysis to discover a new relationship between the signature and hy-
perbolic invariants of knots [DJLT22]. The machine learning background of both
16The coefficient of q 0 is known to always equal 1. In our training sets no coefficients of q 5 or
higher occur.
17This technique is often called “vanilla gradient” in the literature. Apparently it is very brittle
in real-world applications.
14
works is explained in [DVB` 21]. It would be very interesting to find further exam-
ples where saliency leads to new conjectures and theorems.
6.3. Guiding calculation. Another area where deep learning has promise to impact
mathematics is in the guiding of calculation. In many settings a computation can
be done in many ways. Any choice will lead to a correct outcome, but choices
may drastically effect the length of the computation. It is interesting to apply deep
learning in these settings, as false steps (which deep learning models are bound to
make) effects efficiency but not accuracy.
Over the last three years there have been several examples of such applications.
In [PSHL20], the authors use a machine learning algorithm to guide selection strate-
gies in Buchberger’s algorithm, which is a central algorithm in the theory of Gröbner
bases in polynomial rings. In [Sim21], Simpson uses deep neural networks to sim-
plify proofs in the classification of nilpotent semi-groups. In [HKS22], the authors
use a deep neural network to predict computation times of period matrices, and
use it to more efficiently compute the periods of certain hypersurfaces in projective
space.
6.4. Prediction. Due to limitations of space, we cannot begin to survey all the
work done in this infant subject. In particular, there has been much work (see
e.g. [BHH` 21, BCDL20]) training neural networks to predict difficult quantities in
mathematics (e.g. volumes of polytopes, line bundle cohomology,. . . ).
7. Conclusion
The use of deep learning in pure mathematics is in its infancy. The tools of
machine learning are flexible and powerful, but need expertise and experience to
use. One should not expect things to work “out of the box”. Deep learning has
found applications in several branches of pure mathematics including combinatorics,
representation theory, topology and algebraic geometry. Applications so far support
the thesis that deep learning most usefully aids the more intuitive (“system 1”) parts
of the mathematical process: spotting patterns, deciding where counter-examples
might lie, choosing which part of a calculation to do next. However, the possibilities
do seem endless, and only time will tell.
References
[BBD` 22] C. Blundell, L. Buesing, A. Davies, P. Veličković, and G. Williamson. Towards com-
binatorial invariance for Kazhdan-Lusztig polynomials. Represent. Theory, 26:1145–
1191, 2022. 12, 13
[BCDL20] C. R. Brodie, A. Constantin, R. Deen, and A. Lukas. Machine learning line bundle
cohomology. Fortschritte der Physik, 68(1):1900087, 2020. 14
[BG23a] G. Barkley and C. Gaetz. Combinatorial invariance for elementary intervals. arXiv
preprint, arXiv:2303.15577, 2023. 13
[BG23b] G. Barkley and C. Gaetz. Combinatorial invariance for lower intervals using hypercube
decompositions, 2023. 13
[BHA` 21] R. Bommasani, D. A. Hudson, E. Adeli, R. Altman, S. Arora, S. von Arx, M. S.
Bernstein, J. Bohg, A. Bosselut, E. Brunskill, et al. On the opportunities and risks of
foundation models. arXiv preprint arXiv:2108.07258, 2021. 2
[BHH` 21] J. Bao, Y.-H. He, E. Hirst, J. Hofscheier, A. Kasprzyk, and S. Majumder. Polytopes
and machine learning. arXiv preprint arXiv:2109.09602, 2021. 14
[BM23] F. Brenti and M. Marietti. Kazhdan–Lusztig R-polynomials, combinatorial invariance,
and hypercube decompositions. preprint 2023, 2023. 13
15
[Tur48] A. M. Turing. The Essential Turing, chapter Intelligent machinery, pages 395–432.
Oxford University Press (reprinted 2004), 1948. 1, 2
[VSP` 17] A. Vaswani, N. Shazeer, N. Parmar, J. Uszkoreit, L. Jones, A. N. Gomez, L. Kaiser,
and I. Polosukhin. Attention is all you need. Advances in neural information processing
systems, 30, 2017. 10
[Wag21] A. Z. Wagner. Constructions in combinatorics via neural networks. arXiv preprint
arXiv:2104.14516, 2021. 11, 12
University of Sydney,Australia.
Email address: [email protected]