0% found this document useful (0 votes)
8 views

Deep Learning and Pure Mathematics

Deep learning is now one of the greatest technologies that come from AI. Learning it and knowing how it can help even pure mathematicians is key.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
8 views

Deep Learning and Pure Mathematics

Deep learning is now one of the greatest technologies that come from AI. Learning it and knowing how it can help even pure mathematicians is key.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 16

IS DEEP LEARNING A USEFUL TOOL FOR THE PURE

MATHEMATICIAN?

GEORDIE WILLIAMSON
arXiv:2304.12602v1 [math.RT] 25 Apr 2023

Abstract. A personal and informal account of what a pure mathematician


might expect when using tools from deep learning in their research.

1. Introduction
Over the last decade, deep learning has found countless applications throughout
industry and science. However, its impact on pure mathematics has been modest.
This is perhaps surprising, as some of the tasks at which deep learning excels—like
playing the board-game Go or finding patterns in complicated structures—appear
to present similar difficulties to problems encountered in research mathematics. On
the other hand, the ability to reason—probably the single most important defining
characteristic of mathematical enquiry—remains a central unsolved problem in ar-
tificial intelligence. Thus, mathematics can be seen as an important litmus test as
to what modern artificial intelligence can and cannot do.
There is great potential for interaction between mathematics and machine learn-
ing.1 However, there is also a lot of hype, and it is easy for the mathematician
to be put off. In my experience, it remains hard to use deep learning to aid my
mathematical research. However it is possible. One also has the sense that the
potential, once the right tools have been uncovered, is significant.
This is a very informal survey of what a working mathematician might expect
when using the tools of deep learning on mathematics problems. I outline some
of the beautiful ideas behind deep learning. I also give some practical hints for
using these tools. I finish with some examples where deep learning has been used
productively in pure mathematics research. (I hope it goes without saying that the
impact of deep learning on applied mathematics has been enormous.)
Finally, in my experience, the more one uses the tools of deep learning, the
more difficult it becomes not to ask oneself foundational questions about why they
work. This raises an entirely different set of questions. Although fascinating, the
mathematical theory of deep learning is not the focus here.
Remark 1.1. The elephant in the room in any discussion today of deep learning is
the recent success of ChatGPT and other large language models. The internet is
full of examples of ChatGPT doing both very well and very poorly on reasoning
and mathematics problems. It seems likely that large language models will be
able to interact well with proof assistants in the near future (see e.g. [HRW` 21,
JWZ` 23]). It is also likely that a greater role will be played in mathematics research
1
In 1948, Turing [Tur48, §6] identifies games, mathematics, cryptography and language transla-
tion and acquisition as five “suitable branches of thought” in which experimentation with machine
intelligence might be fruitful.
1
2

by very large models, possibly with emergent capabilities (“foundation models” in


the language of the excellent [BHA` 21]). The impacts of such developments on
mathematics are difficult to predict. In this article I will ignore these questions
entirely. Thus I will restrict myself to situations in which deep learning can be used
by mathematicians without access to these large models.
1.1. About the author. I am a pure mathematician, working mostly in geometric represen-
tation theory and related fields. I began an ongoing collaboration with DeepMind in 2020,
on possible interactions of machine learning and mathematics, and have been fascinated
by the subject ever since.2

2. What is a neural network?


Artificial neural networks emulate the biological neural networks present in the
brains of humans and other animals. Typically this emulation takes place on a
computer. The idea of doing so is very natural. See [MP43, Tur48] for remarkable
early accounts.
A cartoon picture of a neuron imagines it as a unit with several inputs and a
single output, which may then be connected to other neurons:

Neurons “fire” by emitting electrical charge along their axon. We may encode the
charges arriving along each node by a real number, in which case the charge emitted
by a neuron is given by

x1
x2 ´ÿ ¯
x3 ‚ z z“f xi
x4
x5

where f is a (typically monotone increasing and non-linear) activation function.


Soon we will assume that our activation function is fixed3, however at this level of
precision the reader is encouraged to imagine something like f pxq “ tanhpxq. The
activation function is meant to model the non-linear response curves of neurons to
stimuli. For example, some neurons may not fire until a certain charge is reached
at their source.4
2My understanding of this landscape has benefitted enormously from discussions with Charles
Blundell, Lars Buesing, Alex Davies, Joel Gibson, Georg Gottwald, Camilo Libedinsky, Sébastien
Racaniere, Carlos Simpson, Grzegorz Swirszcz, Petar Veličković, Adam Wagner, Théophane We-
ber and Greg Yang. Without their help this journey would have been slower and much more
painful. This article is based on a lecture given at the Fields Institute Symposium on the future
of mathematical research, which was held at the instigation of Akshay Venkatesh.
3and equal to “ReLU”: f pxq “ maxp0, xq
4In biological neural nets there is typically large variation in the responses of neurons to stimuli,
depending on where they are in the brain (see e.g. [HW62]). This is one of the many features of
biological neural nets that is usually ignored when building artificial neural networks.
3

Another important feature of neurons is that their firing may be excitatory or


inhibitory of downstream neurons to varying degrees. In order to account for this,
one allows modification of the input charges via weights (the wi ):

x1
x2 ´ÿ ¯
(1) x3 ‚ z z“f wi xi
x4
x5

Thus positive and negative weights correspond to excitatory and inhibitory connec-
tions respectively.
Having settled on a crude mathematical model of a single neuron, we may then
assemble them together to form a neural network :



‚ ‚

‚ ‚


Implicit in this picture is the assignment of a weight to each edge. Thus our neural
network yields a function which takes real valued inputs (5 in the above picture),
and outputs real values (2 above), via repeated application of (1) at each node.
This is a good picture for the layperson to have in mind. It is useful to visualize
the complex interconnectedness present in artificial neural networks, as well as the
locality of the computation taking place. However for the mathematician, one can
explain things a little differently. The configuration








is simply a complicated way of drawing a 5 ˆ 4 matrix. In other words, we can


rewrite our neural network above economically in the form
W f W f
R5 ÝÑ
1
R4 ÝÑ R4 ÝÑ
2
R2 ÝÑ R2
where the Wi are linear maps determined by matrices of weights, and f is shorthand
for the coordinatewise application of our activation function f .
4

For the purposes of this article, a vanilla neural network 5 is a gadget of the form
A f A f A f A`´1
Rd1 ÝÑ
1
Rd2 ÝÑ Rd2 ÝÑ
2
Rd3 ÝÑ Rd3 ÝÑ
3
. . . ÝÑ Rd`´1 ÝÑ Rd`
where Ai are affine linear maps. We refer to Rd1 , Rd2 , . . . , Rd` as the layers of the
network. In order to simplify the discussion, we always assume that our activation
function f is given by ReLU (the “rectified linear unit”), that is
´ÿ ¯ ÿ
f λ i ei “ maxpλi , 0qei

where the ei are standard basis vectors.


Remark 2.1. We make the following remarks:
(1) The attentive reader might have observed a slight of hand above, where we
suddenly allowed affine linear maps in our definition of a vanilla neural net.
This can be justified as follows: In biological neural nets both the charge
triggering a neuron to fire, as well as the charge emitted, varies across the
neural network. This suggests that each activation function should have
parameters, i.e. be given by x ÞÑ f px ` aq ` b for varying a, b P R at
each node. Things just got a lot more complicated! Affine linear maps
circumvent this issue: by adding the possibility of affine linear maps one
gets the same degree of expressivity, with a much simpler setup.
(2) We only consider ReLU activation functions below. This one of the standard
choices, and provides a useful simplification. However one shouldn’t forget
that it is possible to vary activation functions.
(3) We have tried to motivate the above discussion of neural networks as some
imitation of neural activity. It is important to keep in mind that this is a
very loose metaphor at best. However I do find it useful in understanding
and motivating basic concepts. For an excellent account along these lines
by an excellent mathematician, the reader is referred to [Mum20].
(4) The alert reader will notice that we have implicitly assumed above that our
graphs representing neural networks do not have any cycles or loops. This
is again a simplification, and it is desirable in certain situations (e.g. in
recurrent neural networks) to allow loops.
Vanilla neural networks are often referred to as fully-connected because each
neuron is connected to every neuron in the next layer. This is almost opposite to
the situation encountered in the brain, where remarkably sparse neural networks
are found. The connection pattern of neurons is referred to the architecture of
the neural network. As well as vanilla neural networks, important artificial neural
network architectures include convolutional neural networks, graph neural networks
and transformers. Constraints of length prohibit us from discussing these architec-
tures in any depth.
Remark 2.2. More generally, nowadays the term “neural network” is often used to
refer to any program in which the output depends in a smooth way on the input
(and thus the program can be updated via some form of gradient descent). We
ignore this extra generality here.

5One often encounters the term “Multi Layer Perceptron (MLP)” in the literature.
5

3. Motivation for deep learning


In order to understand deep learning, it is useful to keep in mind the tasks
at which it first excelled. One of the most important such examples is image
classification. For example, we might want to classify hand-written digits:

ÞÑ 6 ÞÑ 2

Here each digit is given as (say) a 28 ˆ 28 matrix of grayscale values between 0


and 255. This is a task which is effortless for us, but is traditionally difficult for
computers.
We can imagine that our brain contains a function which sees a hand-written
digit and produces a probability distribution on t0, 1, . . . , 9u, i.e. “what digit we
think it is”.6 We might attempt to imitate this function with a neural network.
Let us consider a simpler problem in which we try to decide whether a hand-
written digit is a 6 or not:

ÞÑ “yes” ÞÑ “no”

We assume that we have “training data” consisting of images labelled by “6” or “not
6”. As a first attempt we might consider a network having a single linear layer:
1{p1`e´x q
A
R28ˆ28 ÝÑ R ÝÝÝÝÝÝÝÝÑ R.

Here A is affine linear, and the second function (the “logistic function”7) is a conve-
nient way of converting an arbitrary real number into a probability. Thus, positive
values of A mean that we think our image is a 6, and negative values of A mean we
think it is not.
We will be successful if we can find a hyperplane separating all vectors corre-
sponding to 6’s (red dots) from those that do not represent 6’s (blue dots):

´
`

‚ ‚ vectors in
‚ ‚ R28ˆ28

‚ ‚

Of course, it may not be possible to find such a hyperplane. Also, even if we


find a hyperplane separating red and blue dots, it is not clear that such a rule
would generalize, to correctly predict whether an unseen image (i.e. image not in
our training data) represents a 6 or not. Remarkably, this technique—known in
machine learning as a Support Vector Machine (SVM)—does work in many simple
learning scenarios. Given training data (e.g. a large set of vectors labelled with

6I can convince myself that my brain produces a probability distribution and not a yes/no
answer by recalling my efforts to decipher my grandmother’s letters when I was a child.
7a.k.a. sigmoid in the machine learning literature
6

“yes” and “no”) the optimal separating hyperplane may be found easily, by a variant
of linear regression.8

4. What is deep learning?


Many classification problems are only solvable via a hyperplane after application
of some non-linear function which allows linear separation of the two classes:

This is where deep learning can come into its own. The idea is that the successive
layers of the neural net “learn” more and more complicated features of the input,
eventually leading to an easy decision problem.9
In the standard setting of supervised learning, we assume the existence of a
function
φ : Rn Ñ Rm
and know a (usually large) number of its values. The task is to find a reasonable
approximation of φ, given these known values. (The reader should keep in mind the
motivating problem of the previous section, where one wants to learn a function
φ : R28ˆ28 Ñ R10
giving the probabilities that a certain 28 ˆ 28-pixel grayscale image represents one
of the 10 digits 0, 1, . . . , 9.)
We fix a network architecture, which in our simple setting of a vanilla neural
net, means that we fix the number of layers ` and layer dimensions n2 , . . . , n`´1 .
We then build a neural net (see §2) which serves as our function approximator:
A f A f A f A`´1
(2) φ« : Rn ÝÑ
1
Rn2 ÝÑ Rn2 ÝÑ
2
Rn3 ÝÑ Rn3 ÝÑ
3
. . . ÝÑ Rdn´1 ÝÑ Rm
To begin with, the affine linear maps Ai are initialised via some (usually random)
initialization scheme, and hence the function φ« output by our neural network will
be random and have no relation to our target function φ. We then measure the
distance between our function φ« and φ via some loss function L. (For example, L
might be the mean squared distance between the values of φ and φ« .10) A crucial
assumption is that this loss function is differentiable in terms of the weights of
our neural network. Finally, we perform gradient descent with respect to the loss
8For a striking mathematical example of support vector machines see [HK22], where SVMs are
trained to distinguish simple and non-simple finite groups, by inspection of their multiplication
table.
9This idea seems to have been present in the machine learning literature for decades, see e.g.
[LBBH98]. It is well explained in [GBC16, §6]. For illustrations of this as well as the connection
to fundamental questions in topology, see the work of Olah [Ola14].
10There are many subtleties here, and a good choice of loss function is one of them. In
my limited experience, neural networks do a lot better learning probability distributions than
general functions. When learning probability distributions, cross entropy [GBC16, §3.13] is the
loss function of choice.
7

function, in order to update the parameters in (2) to (hopefully) better and better
approximate φ.
In order to get an intuitive picture of what is happening during training, let
us assume that m “ 1 (so we are trying to learn a scalar function), and that our
activation functions are ReLU. Thus φ« is the composition of affine linear and
piecewise linear functions, and hence is piecewise linear. As with any piecewise
linear function, we obtain a decomposition of Rn into polytopal regions

(3)

such that φ« is affine linear on each region. As training progresses, the affine linear
functions move in a way similar to the learning of a line of best fit. The complexity
comes from the fact that now the regions themselves may also move, disappear or
spawn new regions.

Remark 4.1. For an excellent interactive animation of a simple neural network


learning a classification task, the reader is urged to experiment with the Tensor
Flow Playground [SC]. Karpathy’s convolutional neural network demo [Kar] is also
illustrative to play with.

Remark 4.2. Some remarks:

(1) Typically, one splits the known values of φ into two disjoint sets, consisting
of training data and test data. Steps of gradient descent are only performed
using the training data. This allows us to periodically check whether our
model is also making reasonable predictions at points not present in the
training data (“test error”).
(2) In most applications of machine learning, the training data is enormous and
feeding it all through the neural network (2) in order to compute the loss
function is unduly expensive. Thus one usually employs stochastic gradient
descent: at every step the gradient of the loss function is evaluated at a
small random subset of the training data.
(3) A traditional paradigm in statistics tries to use a small number of param-
eters to describe a complex data set. A simple example of this paradigm is
a line of best fit. Here the model choice prevents overfitting in which the
model simply “memorizes” all the training data. Deep learning is different,
in that often there are enough parameters to allow overfitting. What is
surprising is that often neural nets generalize well (i.e. don’t overfit) even
though they could in principle.
8

5. Simple examples from pure mathematics


It is important to keep in mind that the main motivating questions for deep
learning research are very different from typical questions arising in pure mathe-
matics. For example, the “recognize a hand-written digit” function considered in
the previous two sections is rather different to the Riemann zeta function!
This means that the mathematician wanting to use machine learning should
keep in mind that they are using tools designed for a very different purpose. The
hype that “neural nets can learn anything” also doesn’t help. The following rules
of thumb are useful to keep in mind when selecting a problem for deep learning:
(1) Noise stable. Functions involved in image and speech recognition motivated
much research in machine learning. These functions typically have very
high-dimensional input (e.g. R100ˆ100 for a square 100 ˆ 100 grayscale
image) and are noise stable. For example, we can usually recognise an
image or understand speech after the introduction of a lot of noise. Neural
nets typically do poorly on functions which are very noise-sensitive.11
(2) High dimensional. If one thinks of a neural network as a function approx-
imator, it is a function approximator that comes into its own on high-
dimensional input. These are the settings in which traditional techniques
like Fourier series break down, due to the curse of dimensionality. Deep
learning should be considered when the difficulty comes from the dimen-
sionality, rather than from the inherent complexity of the function.
(3) Unit cube. Returning to our (unreliable) analogy with biological neural nets,
one expects all charges occurring in the brain to belong to some fixed small
interval. The same is true of artificial neural networks: they perform best
when all real numbers encountered throughout the network from input to
output belong to some bounded interval. Deep learning packages are often
written assuming that the inputs belong to the unit cube r0, 1sn Ă Rn .
(4) Details matter. Design choices like network architecture and size, initial-
ization scheme, choice of learning rate (i.e. step size of gradient descent),
choice of optimizer etc. matter enormously. It is also important how the
inputs to the neural network are encoded as vectors in Rn (the representa-
tion).12
With these rules of thumb in mind we will now discuss three examples in pure
mathematics.
5.1. Learning the parity bit. Consider the parity bit function
σ : t0, 1um Ñ t0, 1u
m
ÿ
pxi q ÞÑ xi mod 2.
i“1
We might be tempted to use a neural network to try to learn a function
σ« : Rm Ñ R
11This point should be read with some caution. For example, evaluation of board positions in
Go is not a particularly noise-stable problem.
12It seems silly to have to write that details matter in any technical subject. However many
people I have spoken to are under the false impression that one model works for everything, and
that training happens “out of the box” and is easy. For an excellent and honest summary by an
expert of the difficulties encountered when training large models, see [Kar19].
9

which agrees with σ under the natural embedding t0, 1um Ă Rm .


This is a classic problem in machine learning [MP17, I § 3.1]. It generalizes
the problem of learning the XOR function (the case m “ 2), which is one of the
simplest problems which cannot be learned without non-linearities. There exist
elegant neural networks extending σ to the unit cube, and given a large proportion
(e.g. 50%) of the set t0, 1um a neural network can be trained to express σ [RHW85,
pp. 14-16]. However, given only a small proportion of the values of σ (e.g. 10% for
m “ 10) a vanilla neural network will not reliably generalize to all values of σ (for
experiments, see [GGW22, ‘Playing with parity’]).
The issue here is that σ is highly noise sensitive. (Indeed, σ is precisely the
checksum of signal processing!) This is an important example to keep in mind, as
many simple functions in pure mathematics resemble σ. For example, see [GGW22,
Week 2] where we attempt (without much luck!) to train a neural network to learn
the Möbius function from number theory.

5.2. Learning descent sets. Consider the symmetric group Σn consisting of all per-
mutations of 1, 2, . . . , n. Given a permutation we can consider its left and right
descent sets:
(4) Lpxq “ t1 ď i ă n | x´1 piq ą x´1 pi ` 1qu,
(5) Rpxq “ t1 ď i ă n | xpiq ą xpi ` 1qu.
Obviously, Lpx´1 q “ Rpxq and Rpx´1 q “ Lpxq. The left an right descent sets are
important invariants of a permutation.
It is interesting to see whether a neural network can be trained to learn the left
and right descent sets. In other words, we would like to train a neural network
φ« : Rn Ñ Rn´1
which given the vector pxp1q, xp2q, . . . , xpnqq returns a sequence of n´1 probabilities
giving whether or not 1 ď i ă n belongs to the left (resp. right) descent set.
This example is interesting in that (5) implies that the right descent set can be
predicted perfectly with a single linear layer. More precisely, if we consider
γ : Rn Ñ Rn´1
pv1 , . . . , vn q ÞÑ pv1 ´ v2 , v2 ´ v3 , . . . , vn´1 ´ vn q
then the ith coordinate of γ evaluated on a permutation pxp1q, . . . , xpnqq is positive
if and only if i P Rpxq. On the other hand, it seems much harder to hand craft a
neural network which extracts the left descent set from pxp1q, . . . , xpnqq.
This might lead us to guess that a neural network will have a much easier time
learning the right descent set than the left descent set. This turns out to be the
case, and the difference is dramatic: a vanilla neural network with two hidden layers
of dimensions 500 and 100 learns to predict right descent sets for n “ 35 with high
accuracy after a few seconds. Whereas, the same network struggles to get even a
single correct answer for the left descent set, after significant training!13 It is striking
that using permutation matrices as inputs rather than the vectors pxp1q, . . . , xpnqq
gives perfect symmetry in training between left and right.14 The issue here is the

13Decreasing n and allowing longer training suggests that the network can learn the left descent
set, however it is much harder.
14For a colab containing all of these experiments, see [GGW22, Classifying descent sets in S ].
n
10

representation: how the model receives its input can have a dramatic effect on
model performance.

5.3. Transformers and linear algebra. Our final example is much more sophisti-
cated, and illustrates how important the choice of training data can be. It also
shows how surprising the results of training large neural networks can be.
A transformer is a neural network architecture which first emerged in machine
translation [VSP` 17]. We will not go into any detail about the transformer archi-
tecture here, except to say that it is well-suited to tasks where the input and output
are sequences of tokens (“sequence to sequence” tasks):

x y z transformer a b c

More precisely, the input sequence (“xyz”) determines a probability distribution


over all tokens. We then sample from this distribution to obtain the first token
(“a”). Now the input and sequence sampled so far (“xyz” + “a”) provides a new
distribution over tokens, from which we sample our second token (“b”), etc.
In a recent work [Cha21] Charton trains a transformer to perform various tasks
in linear algebra: matrix transposition, matrix addition, matrix multiplication,
determination of eigenvalues, determination of eigenvectors etc. For example, the
eigenvalue task is regarded as the “translation”:

real 5 ˆ 5-symmetric matrix list of eigenvalues


M “ pm11 , m12 , m13 , . . . , m55 q transformer λ1 ě λ2 ě ¨ ¨ ¨ ě λ5 .

Charton considers real symmetric matrices, all of whose entries are signed floating
point numbers with three significant figures and exponent lying between ´100 and
100.15 The transformer obtains impressive accuracy on most linear algebra tasks.
What is remarkable is that for the transformer the entries of the matrix (e.g. 3.14,
-27.8, 0.000132, . . . ) are simply tokens—the transformer doesn’t “know” that 3.14
is close to 3.13, or that both are positive; it doesn’t even “know” that its tokens
represent numbers!
Another remarkable aspect of this work concerns generalization. A model trained
on Wigner matrices (e.g. entries sampled uniformly from r´10, 10s) does not gen-
eralize well at all to matrices with positive eigenvalues. On the other hand, a model
trained on matrices with eigenvalues sampled from a Laplace distribution (which
has heavy tails) does generalize to matrices whose eigenvalues are all positive, even
though it has not seen a single such matrix during training! The interested reader
is referred to Charton’s paper [Cha21] (in particular Table 12) and his lecture on
youtube [Cha22].

6. Examples from research mathematics


We now turn to some examples where deep learning has been used in pure
mathematics research.

15Charton considers various encodings of these numbers via sequences of tokens of various
lengths, see [Cha21].
11

Figure 1. The evolution of graphs towards Wagner’s counter-


example, from [Wag21].

6.1. Counter-examples in combinatorics. One can dream that deep learning might
one day provide a mathematician’s “bicycle for the mind”: an easy to use and
flexible framework for exploring possibilities and potential counter-examples. (I
have certainly lost many days trying to prove a statement that turned out to be
false, with the counter-example lying just beyond my mental horizon.)
We are certainly not there yet, but the closest we have come to witnessing such
a framework is provided in the work of Adam Wagner [Wag21]. He focuses on con-
jectures of the form: over all combinatorial structures X, an associated numerical
quantity Z is bounded by B. He considers situations where there is some simple
recipe for generating objects in X, and that the numerical quantity Z is efficiently
computable.
For example, a conjecture in graph theory states that for any connected graph
G on n ě 3 vertices, with largest eigenvalue λ and matching number µ we have
?
(6) λ ` µ ´ n ´ 1 ´ 1 ě 0.

(It is not important for this discussion to know what the matching number or largest
eigenvalue are!)
Wagner fixes an enumeration e1 , e2 , . . . of the edges E in a complete graph on n-
vertices. Graphs are generated by playing a single player game: the player is offered
e1 , e2 etc. and decides at each point whether to accept or reject the edge, the goal
being to minimize (6). A move in the game is given by a 01-vector indicating edges
that have been taken so far, together with a vector indicating which edge is under
consideration. For example, when n “ 4 the pair pp1, 0, 1, 1, 0, 0q, p0, 0, 0, 0, 1, 0qq
indicates that edge number 5 is under consideration, and that edges 1, 3 and 4 have
already been selected, and 2 rejected. Moves are sampled according to a neural
12

network
(7) µ : RE ‘ RE Ñ R,
which (after application of sigmoid) gives the probability that we should take the
edge under consideration.
Wagner then employs the cross entropy method to gradually train the neural
network. A fixed (and large) number of graphs are sampled according to the neural
network (7). Then a fixed percentage (say 10%) of the games resulting in the
smallest values of the LHS of (6) are used as training data to update the neural
network (7). (That is, we tweak the weights of the neural network to make decisions
that result in graphs that are as close as possible to providing a counter-example
to (7).) We then repeat. This method eventually finds a counter-example to (6)
on 19 vertices. The evolution of graphs sampled from the neural network is shown
in Figure 6.1—note how the neural network learns quickly that tree-like graphs do
best. Exactly the same method works to discover counter-examples to several other
conjectures in combinatorics, see [Wag21].

6.2. Conjecture generation. The combinatorial invariance conjecture is a conjec-


ture in representation theory which was proposed by Lusztig and Dyer in the early
1980s [Bre04]. To any pair of permutations x, y P Σn in the symmetric group one
may associate two objects: the Bruhat graph (a directed graph); and the Kazhdan-
Lusztig polynomial (a polynomial in q), see Figure 2 for an example of both. The
conjecture states that an isomorphism between Bruhat graphs implies equality be-
tween Kazhdan-Lusztig polynomials. A more optimistic version of this conjecture
asks for a recipe which computes the Kazhdan-Lusztig polynomial from the Bruhat

Ø 1 ` 3q ` q 2

Figure 2. Bruhat interval and Kazhdan-Lusztig polynomial for


the pair of permutations x “ p1, 3, 2, 5, 4, 6q and y “ p3, 4, 5, 6, 1, 2q
in Σ6 , from [BBD` 22]
.
13

Figure 3. Bruhat interval pre and post saliency analysis.

graph. One interesting aspect of this conjecture is that it is (to the best of my
knowledge) a conjecture born of pure empiricism.
For the Bruhat graph, the definition is simple, but the resulting graph is com-
plicated. On the other hand, the definition of the Kazhdan-Lusztig polynomial is
complicated, however the resulting polynomial is simple. Thus, there is at least a
passing resemblance to traditional applications of machine learning, where a sim-
ple judgement (e.g. “it’s a cat”) is made from complicated input (e.g. an array of
pixels).
It is natural to use neural networks as a testing ground for this conjecture: if a
neural network can easily predict the Kazhdan-Lusztig polynomial from the Bruhat
graph, perhaps we can too! We trained a neural network to predict Kazhdan-Lusztig
polynomials from the Bruhat graph. We used a neural network architecture known
as a graph neural network, and trained the neural network to predict a probability
distribution on the coefficients of q, q 2 , q 3 and q 4 .16 The neural network was trained
on « 20 000 Bruhat graphs, and achieved very high accuracy (« 98%) after less
than a day’s training. This provides reasonable evidence that there is some way of
reliably guessing the Kazhdan-Lusztig polynomial from the Bruhat graph.
It is notoriously difficult to go from a trained neural network to some kind of
human understanding. One technique to do so is known as saliency analysis. Re-
call that neural networks often learn a piecewise linear function, and hence one can
take derivatives of the learned function to try to learn which inputs have the most
influence on a given output.17 In our example, saliency analysis provided subgraphs
of the original Bruhat graph which appeared to have remarkable “hypercube” like
structure (see Figure 3 and [DVB` 21, Figure 5a]). After considerable work this
eventually led to a conjecture [BBD` 22] which would settle the combinatorial in-
variance conjecture for symmetric groups if proven, and has stimulated research on
this problem from pure mathematicians [GW23, BG23b, BG23a, BM23].
In a parallel development, Davies, Juhász, Lackenby and Tomasev were able to
use saliency analysis to discover a new relationship between the signature and hy-
perbolic invariants of knots [DJLT22]. The machine learning background of both
16The coefficient of q 0 is known to always equal 1. In our training sets no coefficients of q 5 or
higher occur.
17This technique is often called “vanilla gradient” in the literature. Apparently it is very brittle
in real-world applications.
14

works is explained in [DVB` 21]. It would be very interesting to find further exam-
ples where saliency leads to new conjectures and theorems.

6.3. Guiding calculation. Another area where deep learning has promise to impact
mathematics is in the guiding of calculation. In many settings a computation can
be done in many ways. Any choice will lead to a correct outcome, but choices
may drastically effect the length of the computation. It is interesting to apply deep
learning in these settings, as false steps (which deep learning models are bound to
make) effects efficiency but not accuracy.
Over the last three years there have been several examples of such applications.
In [PSHL20], the authors use a machine learning algorithm to guide selection strate-
gies in Buchberger’s algorithm, which is a central algorithm in the theory of Gröbner
bases in polynomial rings. In [Sim21], Simpson uses deep neural networks to sim-
plify proofs in the classification of nilpotent semi-groups. In [HKS22], the authors
use a deep neural network to predict computation times of period matrices, and
use it to more efficiently compute the periods of certain hypersurfaces in projective
space.

6.4. Prediction. Due to limitations of space, we cannot begin to survey all the
work done in this infant subject. In particular, there has been much work (see
e.g. [BHH` 21, BCDL20]) training neural networks to predict difficult quantities in
mathematics (e.g. volumes of polytopes, line bundle cohomology,. . . ).

7. Conclusion
The use of deep learning in pure mathematics is in its infancy. The tools of
machine learning are flexible and powerful, but need expertise and experience to
use. One should not expect things to work “out of the box”. Deep learning has
found applications in several branches of pure mathematics including combinatorics,
representation theory, topology and algebraic geometry. Applications so far support
the thesis that deep learning most usefully aids the more intuitive (“system 1”) parts
of the mathematical process: spotting patterns, deciding where counter-examples
might lie, choosing which part of a calculation to do next. However, the possibilities
do seem endless, and only time will tell.

References
[BBD` 22] C. Blundell, L. Buesing, A. Davies, P. Veličković, and G. Williamson. Towards com-
binatorial invariance for Kazhdan-Lusztig polynomials. Represent. Theory, 26:1145–
1191, 2022. 12, 13
[BCDL20] C. R. Brodie, A. Constantin, R. Deen, and A. Lukas. Machine learning line bundle
cohomology. Fortschritte der Physik, 68(1):1900087, 2020. 14
[BG23a] G. Barkley and C. Gaetz. Combinatorial invariance for elementary intervals. arXiv
preprint, arXiv:2303.15577, 2023. 13
[BG23b] G. Barkley and C. Gaetz. Combinatorial invariance for lower intervals using hypercube
decompositions, 2023. 13
[BHA` 21] R. Bommasani, D. A. Hudson, E. Adeli, R. Altman, S. Arora, S. von Arx, M. S.
Bernstein, J. Bohg, A. Bosselut, E. Brunskill, et al. On the opportunities and risks of
foundation models. arXiv preprint arXiv:2108.07258, 2021. 2
[BHH` 21] J. Bao, Y.-H. He, E. Hirst, J. Hofscheier, A. Kasprzyk, and S. Majumder. Polytopes
and machine learning. arXiv preprint arXiv:2109.09602, 2021. 14
[BM23] F. Brenti and M. Marietti. Kazhdan–Lusztig R-polynomials, combinatorial invariance,
and hypercube decompositions. preprint 2023, 2023. 13
15

[Bre04] F. Brenti. Kazhdan-Lusztig polynomials: history, problems, and combinatorial invari-


ance. Sém. Lothar. Combin., 49:Art. B49b, 30, 2002/04. 12
[Cha21] F. Charton. Linear algebra with transformers. CoRR, abs/2112.01898, 2021. 10
[Cha22] F. Charton. Math with Transformers. https://www.youtube.com/watch?v=81o-
Uiop5CA, October 2022. Accessed on 20 March, 2023. 10
[DJLT22] A. Davies, A. Juhász, M. Lackenby, and N. Tomasev. The signature and cusp geometry
of hyperbolic knots. Geometry and Topology, 2022. 13
[DVB` 21] A. Davies, P. Veličković, L. Buesing, S. Blackwell, D. Zheng, N. Tomašev, R. Tan-
burn, P. Battaglia, C. Blundell, A. Juhász, M. Lackenby, G. Williamson, D. Hassabis,
and P. Kohli. Advancing mathematics by guiding human intuition with AI. Nature,
600(7887):70–74, 2021. 13, 14
[GBC16] I. Goodfellow, Y. Bengio, and A. Courville. Deep learning. MIT press, 2016. 6
[GGW22] J. Gibson, G. Gottwald, and G. Williamson. Machine Learning for the Working Mathe-
matician. https://sites.google.com/view/mlwm-seminar-2022, June 2022. Accessed on
20 March, 2023. 9
[GW23] M. Gurevich and C. Wang. Parabolic recursions for Kazhdan-Lusztig polynomials and
the hypercube decomposition. arXiv preprint arXiv:2303.09251, 2023. 13
[HK22] Y.-H. He and M. Kim. Learning algebraic structures: preliminary investigations. In-
ternational Journal of Data Science in the Mathematical Sciences, pages 1–20, 2022.
6
[HKS22] K. Heal, A. Kulkarni, and E. C. Sertöz. Deep learning Gauss–Manin connections.
Advances in Applied Clifford Algebras, 32(2):24, 2022. 14
[HRW` 21] J. M. Han, J. Rute, Y. Wu, E. W. Ayers, and S. Polu. Proof artifact co-training for
theorem proving with language models. arXiv preprint arXiv:2102.06203, 2021. 1
[HW62] D. H. Hubel and T. N. Wiesel. Receptive fields, binocular interaction and functional
architecture in the cat’s visual cortex. The Journal of physiology, 160(1):106, 1962. 2
[JWZ` 23] A. Q. Jiang, S. Welleck, J. P. Zhou, W. Li, J. Liu, M. Jamnik, T. Lacroix, Y. Wu, and
G. Lample. Draft, sketch, and prove: Guiding formal theorem provers with informal
proofs, 2023. 1
[Kar] A. Karpathy. Convnet Javascript Demo.
https://cs.stanford.edu/people/karpathy/convnetjs/demo/classify2d.html. Accessed
on 18 March, 2023. 7
[Kar19] A. Karpathy. A Recipe for Training Neural Networks.
http://karpathy.github.io/2019/04/25/recipe/, Apr 25, 2019. Accessed on 20
March, 2023. 8
[LBBH98] Y. LeCun, L. Bottou, Y. Bengio, and P. Haffner. Gradient-based learning applied to
document recognition. Proceedings of the IEEE, 86(11):2278–2324, 1998. 6
[MP43] W. S. McCulloch and W. Pitts. A logical calculus of the ideas immanent in nervous
activity. The bulletin of mathematical biophysics, 5:115–133, 1943. 2
[MP17] M. Minsky and S. A. Papert. Perceptrons, Reissue of the 1988 Expanded Edition with
a new foreword by Léon Bottou: An Introduction to Computational Geometry. MIT
press, 2017. 9
[Mum20] D. Mumford. The Astonishing Convergence of AI and the Human Brain, October 1,
2020. Accessed on 13 March, 2023. 4
[Ola14] C. Olah. Neural Networks, Manifolds, and Topology.
https://colah.github.io/posts/2014-03-NN-Manifolds-Topology/, April 6, 2014.
Accessed on 18 March, 2023. 6
[PSHL20] D. Peifer, M. Stillman, and D. Halpern-Leistner. Learning selection strategies in buch-
berger’s algorithm. In International Conference on Machine Learning, pages 7575–
7585. PMLR, 2020. 14
[RHW85] D. E. Rumelhart, G. E. Hinton, and R. J. Williams. Learning internal representations
by error propagation. Technical report, California Univ San Diego La Jolla Inst for
Cognitive Science, 1985. 9
[SC] D. Smilkov and S. Carter. The Tensorflow Playground.
https://playground.tensorflow.org. Accessed on 18 March, 2023. 7
[Sim21] C. Simpson. Learning proofs for the classification of nilpotent semigroups. arXiv
preprint arXiv:2106.03015, 2021. 14
16

[Tur48] A. M. Turing. The Essential Turing, chapter Intelligent machinery, pages 395–432.
Oxford University Press (reprinted 2004), 1948. 1, 2
[VSP` 17] A. Vaswani, N. Shazeer, N. Parmar, J. Uszkoreit, L. Jones, A. N. Gomez, L. Kaiser,
and I. Polosukhin. Attention is all you need. Advances in neural information processing
systems, 30, 2017. 10
[Wag21] A. Z. Wagner. Constructions in combinatorics via neural networks. arXiv preprint
arXiv:2104.14516, 2021. 11, 12

University of Sydney,Australia.
Email address: [email protected]

You might also like