Neural Algorithmic Reasoning (2021)
Neural Algorithmic Reasoning (2021)
DeepMind
Abstract
Algorithms have been fundamental to recent global technological advances
and, in particular, they have been the cornerstone of technical advances
in one field rapidly being applied to another. We argue that algorithms
possess fundamentally different qualities to deep learning methods, and
this strongly suggests that, were deep learning methods better able to
mimic algorithms, generalisation of the sort seen with algorithms would
become possible with deep learning—something far out of the reach of
current machine learning methods. Furthermore, by representing elements
in a continuous space of learnt algorithms, neural networks are able to
adapt known algorithms more closely to real-world problems, potentially
finding more efficient and pragmatic solutions than those proposed by
human computer scientists.
Here we present neural algorithmic reasoning—the art of building neu-
ral networks that are able to execute algorithmic computation—and pro-
vide our opinion on its transformative potential for running classical al-
gorithms on inputs previously considered inaccessible to them.
1
concern is: will it work in a new situation? In other words, from training data,
will the deep learning method generalise to the new situation. Under certain
assumptions, guarantees can be given but so far these are in simple cases.
Algorithms, on the other hand, typically come with strong general guaran-
tees. The invariances of an algorithm can be stated as a precondition and a
postcondition, combined with how the time and space complexity scales with
input size. The precondition states what the algorithm will assume is true about
its inputs, and the post-condition will state what the algorithm can then guar-
antee about its outputs after its execution. For example, the precondition of a
sorting algorithm may specify what kind of input it expects (e.g., a finite list of
integers allocated in memory it can modify) and then the postcondition might
state that after execution, the input memory location contains the same integers
but in ascending order.
Even with something as elementary as sorting, neural networks cannot pro-
vide guarantees of this kind: neural networks can be demonstrated to work on
certain problem instances and to generalise to certain larger instances than were
in the training data. There is no guarantee they will work for all problem sizes,
unlike good sorting algorithms, nor even on all inputs of a certain size.
Algorithms and the predictions or decisions learnt by deep learning have very
different properties—the former provide strong guarantees but are inflexible to
the problem being tackled, whilst the latter provide few guarantees but can
adapt to a wide range of problems. Understandably, work has considered how
to get the best of both. Induction of algorithms from data will have significant
implications in computer science: better approximations to intractable prob-
lems, previously intractable problems shown to be tractable in practice, and
algorithms that can be optimised directly for the hardware that is executing
them with little or no human intervention.
Already several approaches have been explored for combining deep learning
and algorithms. Inspired by deep reinforcement learning, deep learning meth-
ods can be trained to use existing, known algorithms as fixed external tools
[Reed and De Freitas, 2015, Li et al., 2020]. This very promising approach
works well—when the existing known algorithms fit the problem at hand. This
is somewhat reminiscent of the software engineer wiring together a collection of
known algorithms. An alternative approach is to teach deep neural networks to
imitate the workings of an existing algorithm, by producing the same output,
and in the strongest case by replicating the same intermediate steps [Graves
et al., 2014, Kaiser and Sutskever, 2015, Kurach et al., 2015, Veličković et al.,
2020]. In this form, the algorithm itself is encoded directly into the neural
network before it is executed. This more fluid representation of the algorithm
allows learning to adapt the internal mechanisms of the algorithm itself via feed-
back from data. Furthermore, a single network may be taught multiple known
algorithms and abstract commonalities among them [Veličković et al., 2019],
allowing novel algorithms to be derived. Both of these approaches build atop
known algorithms. In the former case, new combinations of existing algorithms
can be learnt. Excitingly, in the latter case, new variants or adaptations of
algorithms can be learnt, as the deep neural network is more malleable than the
2
original algorithm.
At present, in computer science, a real world problem is solved by first
fitting the problem to a known class of problems (such as sorting all numbers),
and then an appropriate algorithm chosen for this known problem class. This
known problem class may actually be larger than that exhibited by the real
world problem, and so the chosen algorithm may be suboptimal in practice (for
example, the known problem class may be NP-hard, but all real world examples
are actually in P, so can be solved in polynomial time). Instead, by combining
deep learning and algorithms together, an algorithm can be fit directly to the
real world problem, without the need for the intermediate proxy problem.
3
tested mathematical model or formula that includes all of the varia-
tions and imponderables that must be weighed.* Even when the in-
dividual has been closely associated with the particular territory he
is evaluating, the final answer, however accurate, is largely one of
judgment and experience.”
4
P
+∞ +∞ 2 4
−2 f g −2
6
6
−3
−3
0 7 0 7
−4
−4
2 2
7
7
+∞ +∞ 7 −2
9 9
x̄ Processor A(x̄)
f˜ g̃
x y
Figure 1: The blueprint of neural algorithmic reasoning. We assume that our
real-world problem requires learning a mapping from natural inputs, x, to natu-
ral outputs, y—for example, fastest-time routing based on real-time traffic infor-
mation. Note that natural inputs are often likely high-dimensional, noisy, and
prone to changing rapidly—as is often the case in the traffic example. Further,
we assume that solving our problem would benefit from applying an algorithm,
A—however, A only operates over abstract inputs, x̄. In this case, A could
be Dijkstra’s algorithm for shortest paths [Dijkstra et al., 1959], which oper-
ates over weighted graphs with exactly one scalar weight per node, producing
the shortest path tree. First, an algorithmic reasoner is trained to imitate A,
learning a function g(P (f (x̄))), optimising it to be close to ground-truth ab-
stract outputs, A(x̄). P is a processor network operating in a high-dimensional
latent space, which, if trained correctly, will be able to imitate the individual
steps of A. f and g are encoder and decoder networks, respectively, designed
to carry abstract data to and from P ’s latent input space. Once trained, we
can replace f and g with f˜ and g̃—encoders and decoders designed to process
natural inputs into the latent space of P and decode P ’s representations into
natural outputs, respectively. Keeping P ’s parameters fixed, we can then learn
a function g̃(P (f˜(x))), allowing us an end-to-end differentiable function from
x to y, without any low-dimensional bottlenecks—hence it is a great target for
neural network optimisation.
5
The blueprint of neural algorithmic reasoning Having motivated the
use of neural algorithmic executors, we can now demonstrate an elegant neural
end-to-end pipeline which goes straight from raw inputs to general outputs,
while emulating an algorithm internally. The general procedure for applying
an algorithm A (which admits abstract inputs x̄) to raw inputs x is as follows
(following Figure 1):
1. Learn an algorithmic reasoner for A, by learning to execute it on syn-
thetically generated inputs, x̄. This yields functions f, P, g such that
g(P (f (x̄))) ≈ A(x̄). f and g are encoder/decoder functions, designed
to carry data to and from the latent space of P (the processor network).
2. Set up appropriate encoder and decoder neural networks, f˜ and g̃, to
process raw data and produce desirable outputs. The encoder should
produce embeddings that correspond to the input dimension of P , while
the decoder should operate over input embeddings that correspond to the
output dimension of P .
3. Swap out f and g for f˜ and g̃, and learn their parameters by gradient
descent on any differentiable loss function that compares g̃(P (f˜(x))) to
ground-truth outputs, y. The parameters of P should be kept frozen.
Through this pipeline, neural algorithmic reasoning offers a strong approach to
applying algorithms on natural inputs. The raw encoder function, f˜, has the
potential to replace the human feature engineer, as it is learning how to map
raw inputs onto the algorithmic input space for P , purely by backpropagation.
One area where this blueprint had already proved useful is reinforcement
learning (RL). A very popular algorithm in this space is Value Iteration (VI)—it
is able to solve the RL problem perfectly, assuming access to environment-related
inputs that are usually hidden. Hence it would be highly attractive to be able
to apply VI over such environments, and also, given the partial observability of
the inputs necessary to apply VI, it is a prime target for our reasoning blueprint.
Specifically, the XLVIN architecture [Deac et al., 2020] is an exact instance
of our blueprint for the VI algorithm. Besides improved data efficiency over
more traditional approaches to RL, it also compared favourably against ATreeC
[Farquhar et al., 2017], which attempts to directly apply VI in a neural pipeline,
thus encountering the algorithmic bottleneck problem in low-data regimes.
6
References
Quentin Cappart, Didier Chételat, Elias Khalil, Andrea Lodi, Christopher Mor-
ris, and Petar Veličković. Combinatorial optimization and reasoning with
graph neural networks. arXiv preprint arXiv:2102.09544, 2021.
Thomas H Cormen, Charles E Leiserson, Ronald L Rivest, and Clifford Stein.
Introduction to Algorithms. MIT press, 2009.