How To Transfer Algorithmic Reasoning Knowledge To Learn New Algorithms?
How To Transfer Algorithmic Reasoning Knowledge To Learn New Algorithms?
[email protected] [email protected]
Abstract
Learning to execute algorithms is a fundamental problem that has been widely
studied. Prior work [1] has shown that to enable systematic generalisation on
graph algorithms it is critical to have access to the intermediate steps of the pro-
gram/algorithm. In many reasoning tasks, where algorithmic-style reasoning is
important, we only have access to the input and output examples. Thus, inspired
by the success of pre-training on similar tasks or data in Natural Language Pro-
cessing (NLP) and Computer Vision, we set out to study how we can transfer
algorithmic reasoning knowledge. Specifically, we investigate how we can use
algorithms for which we have access to the execution trace to learn to solve similar
tasks for which we do not. We investigate two major classes of graph algorithms,
parallel algorithms such as breadth-first search and Bellman-Ford and sequential
greedy algorithms such as Prim and Dijkstra. Due to the fundamental differences
between algorithmic reasoning knowledge and feature extractors such as used in
Computer Vision or NLP, we hypothesise that standard transfer techniques will not
be sufficient to achieve systematic generalisation. To investigate this empirically
we create a dataset including 9 algorithms and 3 different graph types. We validate
this empirically and show how instead multi-task learning can be used to achieve
the transfer of algorithmic reasoning knowledge.
1 Introduction
Transfer learning [2] has been responsible for significant successes in multiple areas of machine
learning, including Natural Language Processing (NLP) [3, 4] and Computer Vision (CV) [5]. Pre-
training and reusing the learned weights, by freezing them as feature extractors, or fine-tuning from
them as an initialisation, are common approaches to transfer in these domains. This has enabled
successful learning on problems where data is limited in some form.
Algorithmic reasoning on graphs [1, 6] is a fundamental problem that has been studied under the
assumption that we have access to the execution traces of the algorithms we want to learn. In practice,
such as many real world reasoning tasks, this may not be true. Due to the limited data, we look to
transfer learning to enable us to solve algorithmic tasks without intermediate steps. Specifically, we
∗
Corresponding author
2 Related Work
Neural Execution: Many papers have studied how to execute algorithms before [16, 17, 18, 19, 20].
With the rise of graph representation learning [21, 22, 23] and graph neural networks [11, 12, 13, 14],
recent work has looked at executing graph algorithms [1, 6]. This line of work assumes access to the
intermediate steps of the algorithm. Further, people have looked at teaching transformers to learn
explicitly denoted subroutines and then learn how to combine them [24]. The key difference between
our work and prior work is that we look at the setting of learning graph algorithms where we do not
have access to the execution trace and we do not denote subroutines explicitly. Instead we propose
to use similar algorithms as an inductive bias and look at how to enable this transfer of algorithmic
knowledge.
2
Transfer learning: Transfer learning was originally proposed in the 1970s [25, 26]. Today, it is a
common tool in NLP and Computer Vision to use pre-trained models and fine-tune them on the target
task [2, 3, 5]. This is helpful when the target task is similar and the available data is limited in some
way. Our work differs in two ways; firstly, we focus on algorithmic reasoning knowledge, which we
hypothesis will require different techniques than reusing weights. Secondly, the data on the target
task is partially missing specifically the execution trace of the algorithm is missing.
3 Background
This section gives background on the algorithms and graphs used for the experiments. To be precise
we consider a graph to be a tuple G = (V, E) with a set of vertices V and a set of edges E ⊆ V × V 2 .
This section explains the algorithms studied in this paper in more detail. After that, we give a brief
introduction into graph neural networks, the standard architecture paradigm for graph inputs.
3.1 Algorithms
We broadly study two classes of algorithmic reasoning: parallel and sequential reasoning. Specifically,
the parallel algorithms (shown in Algorithm 1) studied in this paper exchange messages with their
neighbours until an equilibrium is reached. Sequential algorithms, presented in Algorithm 2, greedily
remove elements from a priority queue and updating the neighbouring nodes’ keys.
3
3.2 Graph Neural Networks
A general framework to describe several graph neural network architecture is the message passing
framework introduced by Gilmer et al. [13]. Such graph neural networksL(GNNs) consist of three
parts: a message function M , an update function U , and an aggregator . M and U are arbitrary
neural networks. The high-level idea is
The message function in this paper takes as input the sending and receiving nodes’ representation as
well as an encoding of the edge feature along which the message is sent. The aggregator chosen is
the max aggregator applied element-wise as it was determined to work best by Veličković et al. [1]
and provides algorithmic alignment [15] between the architecture and the tasks. The update function
takes as input the node’s previous embedding and aggregated messages.
We study learning graph algorithms that take in a graph G, a weight function w : E → R that assigns
a weight to each edge in the graph, and node features X : V → Rk . Algorithms compute an output
Y : V → F k and a predecessor P : V → V at each time step. This encompasses a large class of
graph algorithms solving tasks such as reachability or shortest-path.
For instance, for a sequential algorithm the ith node’s features Xt [i] would be the key value and the
state variable. Given Xt , the graph G, and the edge weights, the graph algorithm at each iteration
returns the node features Yt and the predecessor for each Pt (pred variable in pseudocode § 3.1). YT
and PT at the final step are considered the output of the algorithm. The intermediate steps refers to
Xt , Yt , Pt ∀t ∈ {1, . . . , T − 1} after each iteration.
4 Methods
In this section, we first briefly present two architectures we will use for training and then discuss how
existing algorithmic knowledge can be used to solve tasks when no execution trace is given.
4.1 Architecture
4.1.1 NeuralExecutor
The NeuralExecutor (NE) [1] uses an encoder-processor-decoder architecture. Let X ∈ Rn×k be
node states, where n, k are the number of nodes and features, respectively. Each edge (u, v) has a
weight w(u, v) ∈ R. The architecture keeps a hidden state for each node H ∈ Rn×l with l features,
which is initialised to all zeros. The encoder E consists of a linear layer and computes a hidden
embedding E(Xi , Hi ) = Zi , where i indicates the ith step in the computation. The processor P is
message passing neural network (MPNN) with a max aggregator with linear message and update
functions. The processor computes the new hidden state for each node P(Hi , A, w) = Hi+1 . Then,
we have the decoder D(Zi , Hi+1 ) = Yi+1 and predecessor predictor S(Zi , Hi+1 ) = Si . Finally, a
termination network σ(T (Hi+1 )) decides whether we should terminate or not.
for more accumulation of errors. As long as we have algorithmic alignment [15] between the architecture and the
algorithm in question there is no additional challenge to NP-hard problems except the length of their sequence
and hence more opportunities to introduce errors and propagate them.
4
(a) (b)
Figure 1: x0 , x1 are the node states of two algorithms respectively. w are the edge weights. p0 , p1
are the predecessor predictions for the two algorithms respectively and y0 , y1 are the next node
state predictions. hi s are the previous hidden state kept by the network and e(ij) is the computed
edge weight embedding. Superscript indices indicate a particular node. (a) Shows the original
Neural Executor architecture when doing multi-task learning. (b) Shows the more expressive Neural
Executor++ when doing multi-task learning. The key difference is that (b) forces a common way to
operate on a hidden embedding space, while (a) focuses on achieving both at the same time.
We make minor changes to allow for better algorithmic alignment: we remove the ReLU activations
from the processor and replace the termination layer with a processor and linear module4 .
4.1.2 NeuralExecutor++
Veličković et al. [1] showed positive transfer when learning algorithms in a multi-task setup with
intermediate steps. They did so by concatenating together the node features and encoding them
together into the same h embedding (see Fig. 1). This has the advantage of giving strong guidance
to the secondary algorithm learned in this multi-task set-up. However, ideally we are able to only
use the base algorithm during training without having to use it at inference time introducing another
failure mode. Thus, we propose to change the architecture as follows:
Since the node encoder is unable to learn the individual subroutines operating on edges, we re-
place it with a latent encoder for each task that operates on edges and for an edge (i, j) takes in
[h(i) , x(i) , e(ij) , h(j) , x(j) ] (Fig. 1). The important difference is that each has task has its own encoder
rather than all tasks sharing one encoder. The goal is to force the model to learn the shared subroutines
in the processor only and the specialised subroutines in the latent encoder only (see Fig. 1).
Further, we change the latent encoder to consist of a linear and non-linear encoder in parallel that
are added together. This should allow for algorithmic alignment with a much larger array of tasks
(specifically W IDEST PATH), but comes at the cost that overfitting is more likely, which may hurt
generalisation to larger graphs. This last problem has been avoided in prior work [1, 15] by having
the neural network only learn linear components, which will not be able to overfit easily.5
Sequential algorithms (Seq): We used softmax for the prediction of the next node to be removed
from the queue and softmax for the predecessor prediction, where we masked out all nodes except
the neighbours and the node itself. We used a smooth l1 loss with β = 0.001 for the prediction of the
key of the selected node. Finally, we used binary cross entropy for termination prediction. We
only update the node state of the chosen node, this helps with drift of the node state at test time.
Parallel algorithms (Par): We used a smooth l1 loss with β = 0.001 for the prediction of the key
except for BFS where we used binary cross-entropy as the node state is either 0 or 1. In this setting,
during teacher-forcing, we masked out nodes that are unreachable at a given time step.
4
We remove the ReLU because it limits the ability of the max aggregator to minimise values by using negative
inputs. The termination condition depends on whether the remaining nodes are still reachable hence an MPNN
is a more appropriate than a linear layer.
5
See the Supplementary for a more formal description.
5
Teacher forcing (TF): We train the network to predict the next step given the ground truth inputs. At
test time the networks prediction are used instead of the ground truth.
No algorithm (NA): We train the network using only the loss on the final outputs, this means
we change the softmax for predicting the next node to a binary cross entropy for sequential
algorithms. The next inputs are those predicted by the network using gumbel softmax to predict
the next node at each stage. The number of steps are given to the network in this scenario. Inspired
by [27], we sample 10 trajectories using the best one for back-propagation, when using a gumbel
softmax. The idea is that only the best loss is of interest at evaluation time and not the average loss.
We show in the supplementary that this helps stabilise and improve training compared to taking the
mean of the trajectories.
A stronger inductive bias is to train the base algorithm B, for which we have access to intermediate
steps, together with the target algorithm T , for which we do not. This multi-task set-up highlights
the difference between the original Neural Executor architecture proposed in Veličković et al. [1]
and Neural Executor++ (see Fig. 1). The former simply concatenates the node features of B and T
and forces them to share the hidden embedding. The latter allows different encoder-decoders and
only shares the weights of the processor, i.e. each algorithm has its own hidden embedding. This
second approach encourages the network to execute the shared subroutine in the processor and the
individual subroutines in the encoder-decoder. The overall idea is that the base algorithm B serves as
an inductive bias as to how to structure the latent space and teach the processor how to evolve T .
4.5 Graphs
Erdos-Renyi graphs are random graphs where each possible edge has probability p of being added
to the graph. Barabasi-Albert graphs are power-law graphs with a few highly connected nodes and
many dangling nodes. 2d-grid graphs are very regular graphs in an arbitrary 2d-grid shape. We chose
these 3 classes of graphs because they represent some of the major possible differences between
graphs. 2d-grid graphs are very regular and Veličković et al. [1] notes that very regular graphs tend to
transfer poorly from or to random graphs. Erdos-Renyi graphs tend to be sparse, but highly likely to
be connected. Barabasi-Albert graphs tend to be quite dense graphs with shorter average path-length
than random graphs. As such these graph classes differ significantly from each other.
6
Table 1: Teacher forcing (seq.). NE++ worse performance on D IJKSTRA showing the downside of
higher model capacity.
D IJKSTRA M OST RELIABLE
Model #Nodes Next node Key Predecessor Next node Key Predecessor
20 0.018 ± 0.004 0.0367 ± 0.01 0.005 ± 0.002 0.238 ± 0.05 0.0358 ± 0.02 0.053 ± 0.01
NE 50 0.089 ± 0.009 0.569 ± 0.7 0.02 ± 0.02 0.557 ± 0.08 0.0697 ± 0.05 0.099 ± 0.01
100 0.341 ± 0.02 4.79 ± 6 0.064 ± 0.04 0.763 ± 0.07 0.0924 ± 0.06 0.167 ± 0.007
20 0.008 ± 0.003 0.0108 ± 0.004 0.003 ± 0.0008 0.174 ± 0.07 0.0264 ± 0.02 0.047 ± 0.03
NE ++ 50 0.35 ± 0.03 445 ± 600 0.211 ± 0.02 0.492 ± 0.2 0.0676 ± 0.05 0.112 ± 0.05
100 0.729 ± 0.01 1.62e10 ± 2e10 0.522 ± 0.07 0.699 ± 0.2 0.0906 ± 0.06 0.171 ± 0.06
For all graphs, we generate edge weights that are uniformly between [0.2, 1.0], this range prevents
key values such as shortest path from becoming too extreme6 .
5 Experiments
For all experiments we use 5,000 graphs of each type (Erdos-Renyi (ER), Barabasi-Albert (BA),
2d-Grids (2d-G)) with 20 nodes each. We train using A DAM [28] with a learning rate of 0.0005, a
batch size of 64, and use early stopping with a patience of 10 to prevent overfitting. We test on graphs
size 20, 50, and 100 nodes. The hidden embedding size is set to 32 except for NE++ for multi-task
experiments, where it is 16 to account for the additional expressivity of having several encoders.
Each experiment was executed on a V100 GPU in less than 5 hours for the longest experiment.
We measure the average performance over all 3 graph types at evaluation separately and present the
average with standard deviation in the main paper. Large standard deviation may arise due to the
extreme difference between random graphs of type ER or BA versus 2d-G graphs.7
5.1 Metrics
Sequential algorithms: Predecessor (Pred.) error rate is the most important measure as to whether a
task has been successfully completed as it gives us the path predicted by the network. Next node (Next)
error rate measures whether the next node is the correct one to pick. Key accuracy measures whether
the key of the picked node is correct, measured in mean squared error. Next node are indicative of
whether the correct algorithm is being executed, while Pred. and Key primarily serves to indicate the
correctness of the solutions found. Lower is always better.
Parallel algorithms: Key accuracy measures the node features mean squared error for all algorithms
except BFS, where it is measured in accuracy as the node feature is a binary choice between 0 and 1.
Predecessor (Pred.) accuracy measures the accuracy of predicting the predecessor node.
For the sequential algorithms we study transfer from P RIM to D IJKSTRA and from W IDEST PATH to
M OST RELIABLE PATH, for parallel Algorithms we study transfer from BFS to B ELLMAN -F ORD
and from W IDEST PATH to M OST RELIABLE PATH.
6
All code to generate data and train models will be released upon acceptance with an MIT license.
7
See the Supplementary for a more detailed explanation.
7
Table 3: Transfer to no algorithm (seq.). Pre-trained on P RIM and W IDEST, respectively. Classic
transfer learning fails to provide size-generalisation and often performs worse than no transfer.
D IJKSTRA M OST RELIABLE
Model #Nodes Key Predecessor Key Predecessor
20 0.014 ± 0.005 0.081 ± 0.03 0.0235 ± 0.003 0.248 ± 0.06
NE Freeze 50 122 ± 200 0.597 ± 0.09 3.91e5 ± 3e4 0.864 ± 0.009
100 3.18e6 ± 4e6 0.607 ± 0.1 5.06e15 ± 7e14 0.761 ± 0.04
20 0.0021 ± 0.0009 0.05 ± 0.02 0.036 ± 0.005 0.227 ± 0.07
NE Fine-tune 50 1.11e3 ± 1e3 0.241 ± 0.09 2e3 ± 200 0.636 ± 0.1
100 5.93e7 ± 8e7 0.388 ± 0.1 1.06e9 ± 1e8 0.709 ± 0.02
20 0.00136 ± 0.0005 0.06 ± 0.03 0.0163 ± 0.003 0.231 ± 0.08
NE 2-Processor 50 14.2 ± 20 0.162 ± 0.06 205 ± 30 0.749 ± 0.04
100 1.04e4 ± 1e4 0.305 ± 0.07 1.22e10 ± 4e9 0.815 ± 0.03
20 0.00136 ± 0.001 0.063 ± 0.02 0.00687 ± 0.0008 0.199 ± 0.09
NE++ Freeze 50 42.6 ± 50 0.841 ± 0.04 465 ± 100 0.58 ± 0.1
100 262 ± 400 0.895 ± 0.07 8.06e7 ± 3e7 0.672 ± 0.08
20 0.000414 ± 0.0003 0.034 ± 0.02 0.00669 ± 0.002 0.191 ± 0.07
NE++ Fine-tune 50 13.5 ± 20 0.962 ± 0.03 1.79e5 ± 2e5 0.757 ± 0.05
100 2.51e4 ± 4e4 0.962 ± 0.04 8.17e14 ± 5e14 0.774 ± 0.05
20 0.00443 ± 0.002 0.06 ± 0.04 0.0022 ± 0.0003 0.154 ± 0.05
NE++ 2-Processor 50 16.6 ± 20 0.356 ± 0.1 306 ± 70 0.644 ± 0.03
100 4.66e3 ± 6e3 0.779 ± 0.02 3.58e6 ± 9e5 0.791 ± 0.02
Table 4: Multi-task (seq.): Using P RIM and W IDEST as inductive bias, respectively. Multi-task
learning shows good generalisation, especially on the Key metric for D IJKSTRA and on Predecessor
for M OST RELIABLE.
D IJKSTRA M OST RELIABLE
Model #Nodes Key Predecessor Key Predecessor
20 0.00362 ± 0.0005 0.042 ± 0.01 0.207 ± 0.03 0.452 ± 0.09
NE 50 11.5 ± 2 0.134 ± 0.1 1.85 ± 0.5 0.501 ± 0.06
100 126 ± 30 0.303 ± 0.3 6.47 ± 3 0.597 ± 0.01
20 0.000178 ± 8e-05 0.019 ± 0.009 0.00279 ± 0.0004 0.166 ± 0.07
NE ++ 50 0.413 ± 0.4 0.161 ± 0.1 0.199 ± 0.3 0.185 ± 0.009
100 2.91 ± 3 0.282 ± 0.2 0.843 ± 1 0.267 ± 0.1
6.1 Sequential
The first experiments establish baselines in terms of achievable performance given the intermediate
steps and trained with teacher-forcing (Tab. 1). We run each algorithm separately. Next we establish
the performance in the no-algorithm setting (Tab. 2), i.e. what is achievable without intermediate
supervision.
Expressivity can harm systematic generalisation: Firstly, we note that the additional expressivity
of the NE++ (§ 4.1.2) seems to hurt systematic generalisation even with the large amount of data
available (Tab. 1) as we can see on D IJKSTRA. On M OST RELIABLE both do equally well on Key
and Pred., but looking at Next node we can see that NE++ does better in simulating the algorithm.
M OST RELIABLE is a non-linear task so a non-linear encoder is expected to help. We also note that
as the graphs grow in size, the number of reachable nodes in the priority queue increases, making it
more likely we pick the wrong node without affecting the correctness of prediction.
Secondly, we note that in the NA setting (Tab. 2) the NE is able to solve the Pred. prediction quite
well up to 100 nodes, but clearly found an alternative way of reasoning as the key prediction is hugely
wrong for larger graphs. Note that the largest shortest distances will be found in 2-grid graphs, where
it will be upper bounded by 51. Also for M OST RELIABLE the performance on Pred. drop at 50
nodes is significantly more severe with less representation power.
Transfer yields little improvement: The two key experiments are the transfer setting (§ 4.3) and
the multi-task setting (§ 4.4). We hypothesised that the standard transfer experiments (fine-tune
and freeze) would not help systematic generalisation. None of the transfer methods (Tab. 3) help
generalise either task significantly. In fact, they harm systematic generalisation in terms of Pred.
prediction in all cases. The only benefit that can be observed is better generalisation on Key accuracy
indicating that the network outputs are less extreme. The best transfer method is 2-Processor as we
hypothesised in § 4.3, which improves Key prediction at the cost of harming Pred. accuracy.
Multi-task helps systematic generalisation: In the multi-task set-up (Tab. 4), several things occur:
the Key prediction generalises even better and is predicting in a reasonable range given the longest
8
Table 5: 2-Proc. transfer pre-trained on P RIM, D IJKSTRA, & DFS for sequential and BFS &
B ELLMAN -F ORD for parallel. Pre-training on several tasks does not improve classic transfer.
M OST RELIABLE ( SEQ ) M OST RELIABLE ( PAR )
Model #Nodes Key Predecessor Key Predecessor
20 0.00237 ± 0.0003 0.163 ± 0.06 0.0408 ± 0.009 0.227 ± 0.07
NE ++ 2-Proc. 50 62.9 ± 20 0.606 ± 0.05 0.161 ± 0.1 0.363 ± 0.04
100 1.76e6 ± 4e5 0.758 ± 0.07 2.68 ± 4 0.448 ± 0.08
Table 6: No algorithm (par.): For transfer we report the results of the best method (§ 4.3). Pre-trained
on BFS and W IDEST respectively. Reliance on intermediate steps is lower for this class of problems,
but multi-task transfer of knowledge is still beneficial in terms of size-generalisation.
B ELLMAN -F ORD M OST RELIABLE PATH
Model #Nodes Key Predecessor Key Predecessor
20 0.0182 ± 0.02 0.057 ± 0.02 0.018 ± 0.002 0.226 ± 0.06
NE (NA) 50 59 ± 80 0.164 ± 0.1 0.147 ± 0.2 0.327 ± 0.02
100 1.98e6 ± 3e6 0.261 ± 0.2 10.4 ± 10 0.435 ± 0.05
20 0.00253 ± 0.002 0.028 ± 0.01 0.00957 ± 0.004 0.145 ± 0.06
NE++ (NA) 50 0.226 ± 0.3 0.057 ± 0.01 0.0367 ± 0.04 0.171 ± 0.03
100 196 ± 300 0.095 ± 0.04 120 ± 200 0.217 ± 0.02
20 0.0386 ± 0.02 0.072 ± 0.03 0.0221 ± 0.006 0.237 ± 0.06
NE (Transfer Fine-tune) 50 25 ± 40 0.162 ± 0.05 0.332 ± 0.4 0.331 ± 0.02
100 1.72e5 ± 2e5 0.242 ± 0.1 230 ± 300 0.402 ± 0.03
20 0.0223 ± 0.02 0.062 ± 0.03 0.0131 ± 0.003 0.196 ± 0.07
NE++ (Transfer 2-Proc.) 50 0.666 ± 0.7 0.105 ± 0.005 3.04 ± 4 0.313 ± 0.05
100 10.8 ± 10 0.168 ± 0.05 579 ± 800 0.411 ± 0.1
20 0.0154 ± 0.02 0.034 ± 0.01 0.173 ± 0.1 0.346 ± 0.02
NE (Multi-task) 50 6.22 ± 9 0.051 ± 0.004 0.407 ± 0.4 0.362 ± 0.03
100 1.53e3 ± 1e3 0.096 ± 0.02 0.615 ± 0.6 0.376 ± 0.05
20 0.00353 ± 0.004 0.023 ± 0.01 0.00672 ± 0.0005 0.153 ± 0.06
NE++ (Multi-task) 50 0.0141 ± 0.02 0.03 ± 0.006 0.00805 ± 0.002 0.182 ± 0.01
100 8.84 ± 10 0.13 ± 0.1 0.00971 ± 0.002 0.212 ± 0.02
shortest path in graphs of size 100. Further, NE in this setting has similar Pred. accuracy on D IJKSTRA
compared to NA, NE++ benefitted from the inductive bias in terms of its Pred. accuracy on graphs
of size 100 for D IJKSTRA. The results on M OST RELIABLE are significantly improved and NE++
achieves good levels of systematic generalisation in solving the task. NE interestingly worsens in its
performance on 20 nodes, but maintains a stable Pred. accuracy on larger graphs. Demonstrating that
the inductive bias from W IDEST prevents overfitting in distribution and improves the performance on
larger graphs. Overall, the results validate our initial hypothesis that multi-task learning is the correct
approach to transfer knowledge.
Trying to extract shared subroutines does not help transfer: Finally, we study to what extent
the models are able to separate the common shared subroutines and the subroutines individual to
each algorithm by training multiple algorithms with TF in a multi-task set-up together (Tab. 5). If
the processor successfully captures only the shared subroutines, then we might expect the transfer
results to be improve. We can see in Tab. 5 that multi-task pre-training does not significantly improve
results and that multi-task learning with the target algorithm is still the best approach. However, one
alternative explanation is that given a good processor, the encoder struggles to learn the expected
encoding by the processor and thus performs poorly.
6.2 Parallel
Parallel algorithms are significantly easier than sequential ones due to their much shorter length
and the lack of a central data-structure that needs to be learned to execute. This can be observed in
the much higher performance in the NA setting (Tab. 17). Interestingly, it seems that in this setting
expressivity was helpful for systematic generalisation, even in the B ELLMAN -F ORD setting.
Transfer harms performance: In Tab. 17 we show only the best transfer result, but as we can see
this actually harms Pred. accuracy for NE++ for both algorithms, while producing roughly the same
result for NE. In both cases, the results suggest that random initialisations are better than transfer
ones. We think this may because the shared algorithmic knowledge of parallel algorithms is already
inherently captured by GNNs as they apply message functions in parallel to each edge, which then
only need to learn the relax_edge function. Pre-training on several algorithms did not help (Tab. 5).
9
Multi-task only helps Key accuracy: Similarly to sequential reasoning multi-task vastly outper-
forms transfer techniques and significantly improves Key prediction compared to NA, while keeping
Pred. similar. Contrary to sequential reasoning the Pred. prediction is comparable between NA and
multi-task. We think this is due to the shorter execution length providing less of an inductive bias for
the target algorithm and the strong inductive bias of GNN architectures towards parallel algorithms.
However, the access to more stable gradients due to the multi-task learning approach seems to help
learning to some extent due to the improved Key predictions. Further, we observe that when the
model has less capacity (NE on M OST RELIABLE PATH) multi-task is still able to improve systematic
generalisation on Pred. at the cost of slightly worse in-distribution (20 nodes) performance.
Transfer via freezing and/or fine-tuning clearly does not work as demonstrated by the results in Tab. 3.
The fact that having two processors, one frozen and one to fine-tune, also does not help transfer is
telling, because neither the fine-tuning process losing information, nor the rigidity of the network
can be at fault. In other words, fine-tuning has the disadvantage that we lose the original weights
and hence potentially lose information. Freezing weights significantly limits the weights that can
be changed and thus making it harder to fit the data. However, the 2-processor approach suffers
from neither problem and yet still does not work.8 Thus, we hypothesis that the reason why transfer
fails to work is that the initial weights of a similar algorithm are not near a good (as in generalising)
minimum for the target algorithm, in fact the minimum is often worse than the minimum found from
randomly initialised weights (see Tab. 2).
Multi-task on the other hand does not rely on the weights being near a good minimum, but instead
enforces them to be the same for the processor. This is a very different way to use the base algorithm
as an inductive bias. This inductive bias is successful because the final weights are from a minimum
that systematically generalises (on at least one of the algorithms) with the additional constraint that it
performs well on the second target algorithm. For transfer the initial weights might systematically
generalise on the original task there is no guarantee that the final weights stem from a minima that
systematically generalises.
7 Conclusion
We set out to investigate how systematic generalisation could be improved on algorithmic tasks when
the intermediate steps of the algorithm are not available. Inspired by the success of transfer learning
in domains such as CV and NLP, we investigated it’s applicability to learning graph algorithms in this
setting. We showed that standard transfer learning is inadequate to leverage algorithmic knowledge
learned from intermediate steps to new algorithmic tasks. Further, we showed how multi-task learning
can enable the successful transfer of inductive biases learned from other algorithms when intermediate
steps are available, significantly improving systematic generalisation. The results are especially
strong in the more difficult sequential reasoning domain. Moreover, we conclude that expressivity
can hurt systematic generalisation if the task is too easy and intermediate supervision is available.
This should be taken into account when choosing the model. These disadvantages disappeared when
trying to learn algorithmic reasoning without intermediate steps in our multi-task set-up, in this
setting NE++ always outperforms the simpler architecture. Both architectures can achieve systematic
generalisation. Limitations of our work are that the results are specific to algorithms on static graphs.
Furthermore, as the number of execution steps increases faster than linear in the number of nodes
results are likely to worsen significantly.
This paper’s contributions are fundamental in nature and thus, the societal impact of this paper is
low and there are no associated ethical risks. Any benefits or risks stem from further advances in
reasoning systems that may be in some form be based on this work.
8
Experiment 1 (in the Supplementary material) shows that the information from a processor can be used and
recovered.
10
Acknowledgments and Disclosure of Funding
We would like to thank Meng Qu, Zhaocheng Zhu, and Zuobai Zhang for proof reading the manuscript
prior to submission.
This project is supported by the Natural Sciences and Engineering Research Council (NSERC)
Discovery Grant, the Canada CIFAR AI Chair Program, collaboration grants between Microsoft
Research and Mila, Samsung Electronics Co., Ltd., Amazon Faculty Research Award, Tencent AI
Lab Rhino-Bird Gift Fund and a NRC Collaborative R&D Project (AI4D-CORE-06). This project
was also partially funded by IVADO Fundamental Research Project grant PRF-2019-3583139727.
Petar Veličković is a Research Scientist at DeepMind.
11
References
[1] Petar Veličković, Rex Ying, Matilde Padovano, Raia Hadsell, and Charles Blundell. Neural
execution of graph algorithms. arXiv preprint arXiv:1910.10593, 2019.
[2] Dumitru Erhan, Aaron Courville, Yoshua Bengio, and Pascal Vincent. Why does unsupervised
pre-training help deep learning? In Proceedings of the thirteenth international conference
on artificial intelligence and statistics, pages 201–208. JMLR Workshop and Conference
Proceedings, 2010.
[3] Jacob Devlin, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova. Bert: Pre-training of
deep bidirectional transformers for language understanding. arXiv preprint arXiv:1810.04805,
2018.
[4] TB Brown, B Mann, N Ryder, M Subbiah, J Kaplan, P Dhariwal, A Neelakantan, P Shyam,
G Sastry, A Askell, et al. Language models are few-shot learners. arxiv 2020. arXiv preprint
arXiv:2005.14165, 4, 2020.
[5] Alexey Dosovitskiy, Lucas Beyer, Alexander Kolesnikov, Dirk Weissenborn, Xiaohua Zhai,
Thomas Unterthiner, Mostafa Dehghani, Matthias Minderer, Georg Heigold, Sylvain Gelly,
Jakob Uszkoreit, and Neil Houlsby. An image is worth 16x16 words: Transformers for image
recognition at scale, 2020.
[6] Petar Veličković, Lars Buesing, Matthew C Overlan, Razvan Pascanu, Oriol Vinyals, and
Charles Blundell. Pointer graph networks. arXiv preprint arXiv:2006.06380, 2020.
[7] Vojtěch Jarník. On a certain problem of minimization. Práce moravskè přírodovědecké
společnosti 6, fasc. 4, pages 57–63, 1930. URL http://hdl.handle.net/10338.dmlcz/
500726.
[8] R.C. Prim. Shortest connection networks and some generalizations. Bell System Technical
Journal, pages 1389–1401, 1957. URL https://archive.org/details/bstj36-6-1389.
[9] E.W. Dijkstra. A note on two problems in connexion with graphs. Numerische Mathematik,
pages 269–271, 1959. URL https://doi.org/10.1007/BF01386390.
[10] Andreea Deac, Petar Veličković, Ognjen Milinković, Pierre-Luc Bacon, Jian Tang, and Mladen
Nikolić. Xlvin: executed latent value iteration nets, 2020.
[11] Yujia Li, Daniel Tarlow, Marc Brockschmidt, and Richard Zemel. Gated graph sequence neural
networks. arXiv preprint arXiv:1511.05493, 2015.
[12] Thomas N Kipf and Max Welling. Semi-supervised classification with graph convolutional
networks. arXiv preprint arXiv:1609.02907, 2016.
[13] Justin Gilmer, Samuel S Schoenholz, Patrick F Riley, Oriol Vinyals, and George E Dahl. Neural
message passing for quantum chemistry. In International Conference on Machine Learning,
pages 1263–1272. PMLR, 2017.
[14] Petar Veličković, Guillem Cucurull, Arantxa Casanova, Adriana Romero, Pietro Lio, and Yoshua
Bengio. Graph attention networks. arXiv preprint arXiv:1710.10903, 2017.
[15] Keyulu Xu, Jingling Li, Mozhi Zhang, Simon S. Du, Ken ichi Kawarabayashi, and Stefanie
Jegelka. What can neural networks reason about?, 2020.
[16] Wojciech Zaremba and Ilya Sutskever. Learning to execute, 2015.
[17] Łukasz Kaiser and Ilya Sutskever. Neural gpus learn algorithms, 2016.
[18] Karol Kurach, Marcin Andrychowicz, and Ilya Sutskever. Neural random-access machines,
2016.
[19] Scott Reed and Nando de Freitas. Neural programmer-interpreters, 2016.
12
[20] Adam Santoro, Ryan Faulkner, David Raposo, Jack Rae, Mike Chrzanowski, Theophane Weber,
Daan Wierstra, Oriol Vinyals, Razvan Pascanu, and Timothy Lillicrap. Relational recurrent
neural networks. In S. Bengio, H. Wallach, H. Larochelle, K. Grauman, N. Cesa-Bianchi,
and R. Garnett, editors, Advances in Neural Information Processing Systems, volume 31.
Curran Associates, Inc., 2018. URL https://proceedings.neurips.cc/paper/2018/
file/e2eabaf96372e20a9e3d4b5f83723a61-Paper.pdf.
[21] Michael M. Bronstein, Joan Bruna, Yann LeCun, Arthur Szlam, and Pierre Vandergheynst.
Geometric deep learning: Going beyond euclidean data. IEEE Signal Processing Magazine,
34(4):18–42, Jul 2017. ISSN 1558-0792. doi: 10.1109/msp.2017.2693418. URL http:
//dx.doi.org/10.1109/MSP.2017.2693418.
[22] William L. Hamilton, Rex Ying, and Jure Leskovec. Representation learning on graphs: Methods
and applications, 2018.
[23] Peter W. Battaglia, Jessica B. Hamrick, Victor Bapst, Alvaro Sanchez-Gonzalez, Vinicius Zam-
baldi, Mateusz Malinowski, Andrea Tacchetti, David Raposo, Adam Santoro, Ryan Faulkner,
Caglar Gulcehre, Francis Song, Andrew Ballard, Justin Gilmer, George Dahl, Ashish Vaswani,
Kelsey Allen, Charles Nash, Victoria Langston, Chris Dyer, Nicolas Heess, Daan Wierstra,
Pushmeet Kohli, Matt Botvinick, Oriol Vinyals, Yujia Li, and Razvan Pascanu. Relational
inductive biases, deep learning, and graph networks, 2018.
[24] Yujun Yan, Kevin Swersky, Danai Koutra, Parthasarathy Ranganathan, and Milad Hashemi.
Neural execution engines: Learning to execute subroutines, 2020.
[25] S Bozinovski and A Fulgosi. The influence of pattern similarity and transfer learning upon
training of a base perceptron b2. In Proceedings of Symposium Informatica, pages 3–121, 1976.
[26] S Bozinovski. Teaching space: A representation concept for adaptive pattern classification.
Technical report, COINS Technical Report, University of Massachusetts at Amherst, 1981.
[27] Brenden K Petersen, Mikel Landajuela Larma, Terrell N. Mundhenk, Claudio Prata Santiago,
Soo Kyung Kim, and Joanne Taery Kim. Deep symbolic regression: Recovering mathematical
expressions from data via risk-seeking policy gradients. In International Conference on Learning
Representations, 2021. URL https://openreview.net/forum?id=m5Qsh0kBQG.
[28] Diederik P Kingma and Jimmy Ba. Adam: A method for stochastic optimization. arXiv preprint
arXiv:1412.6980, 2014.
13
A Appendix
A.1 Pseudo-code
In this section we give the pseudo-code for the initialise_nodes and relax_edge functions for
all the algorithms. Parallel meaning they use the framework in Alg. 1 and Sequential meaning they
use the framework in Alg. 2.
14
Algorithm 7 W IDEST PATH (Parallel) Algorithm 8 W IDEST PATH (Sequential)
function INITIALISE _ NODES(G.vertices, i) function INITIALISE _ NODES(G.vertices, i)
for v ∈ G.vertices do for v ∈ G.vertices do
if v = i then if v = i then
v.key ← ∞ v.key ← ∞
v.pred ← v v.pred ← v
else else
v.key ← 0 v.key ← 0
v.pred ← ⊥ v.pred ← ⊥
end if end if
end for end for
end function end function
function RELAX _ EDGE(u, v, w) function RELAX _ EDGE(u, v, w)
if v.key < min(u.key, w(u, v)) then if v.key < min(u.key, w(u, v)) then
v.key ← min(u.key, w(u, v)) v.key ← min(u.key, w(u, v))
v.pred ← u v.pred ← u
end if end if
end function end function
Algorithm 9 M OST RELIABLE PATH (Paral- Algorithm 10 M OST RELIABLE PATH (Se-
lel) quential)
function INITIALISE _ NODES(G.vertices, i) function INITIALISE _ NODES(G.vertices, i)
for v ∈ G.vertices do for v ∈ G.vertices do
if v = i then if v = i then
v.key ← 1 v.key ← 1
v.pred ← v v.pred ← v
else else
v.key ← 0 v.key ← 0
v.pred ← ⊥ v.pred ← ⊥
end if end if
end for end for
end function end function
function RELAX _ EDGE(u, v, w) function RELAX _ EDGE(u, v, w)
if v.key < u.key × w(u, v) then if v.key < u.key × w(u, v) then
v.key ← u.key × w(u, v) v.key ← u.key × w(u, v)
v.pred ← u v.pred ← u
end if end if
end function end function
15
Algorithm 11 D EPTH - FIRST SEARCH (DFS)
(Sequential)
function INITIALISE _ NODES(G.vertices, i)
for v ∈ G.vertices do
if v = i then
v.key ← |G.vertices|
v.pred ← v
else
v.key ← ∞
v.pred ← ⊥
end if
end for
end function
function RELAX _ EDGE(u, v, w)
if v.key = ∞ then
v.key ← u.key − 1
v.pred ← u
end if
end function
In this section we verify that taking the maximum of several trajectories as described in § 4.2, the
results are in Tab. 7 and confirm our hypothesis.
Table 7: No algorithm (seq.). We take the mean of 10 trajectories instead of the max.
D IJKSTRA M OST RELIABLE
Model #Nodes Key Predecessor Key Predecessor
20 6.22e-5 ± 5e-5 0.014 ± 0.01 0.00279 ± 0.0004 0.158 ± 0.06
NE 50 1.13e6 ± 2e6 0.121 ± 0.08 11 ± 5 0.615 ± 0.1
100 1.39e17 ± 2e17 0.237 ± 0.2 3.31e5 ± 2e5 0.658 ± 0.03
20 0.00221 ± 0.002 0.026 ± 0.01 0.000879 ± 8e-5 0.121 ± 0.05
NE ++ 50 13.3 ± 20 0.257 ± 0.06 0.232 ± 0.1 0.503 ± 0.01
100 1.38e6 ± 1e6 0.629 ± 0.1 0.378 ± 0.2 0.576 ± 0.02
In this section we give the results for each graph type separately for the sequential setting as well as
adding termination accuracy for the teacher forcing setting.
Termination accuracy: We measure termination accuracy according the following formula:
|Tpred − Ttrue |
Term = 1 − , (1)
Ttrue
where Ttrue is the correct integer last step and Tpred the predicted last step. The theoretical possible
range is [1, −∞], where we reach 1 only if predicting the correct step, as we go further away from
the correct last step we get smaller and possibly negative (this can only happen by terminating much
later than is correct). In practice, the range is limited because we run the network for a maximum
number of steps equal to the number of nodes in the graph.
16
Table 8: Teacher forcing (seq.).
D IJKSTRA
20 0.014 0.016 0.023 0.0492 0.0358 0.0252 0.003 0.007 0.004 0.997 0.993 0.996
NE 50 0.101 0.085 0.079 1.49 0.127 0.0903 0.042 0.008 0.01 0.962 0.993 0.994
100 0.311 0.367 0.344 13.4 0.556 0.373 0.124 0.041 0.026 0.921 0.966 0.984
20 0.004 0.01 0.01 0.00692 0.0159 0.00955 0.002 0.004 0.003 0.998 0.996 0.997
NE++ 50 0.311 0.37 0.369 1.33e3 0.919 0.657 0.238 0.202 0.194 0.881 0.937 0.952
100 0.75 0.72 0.718 4.87e10 3.61e4 1.86e4 0.621 0.489 0.456 0.67 0.922 0.936
M OST RELIABLE
20 0.079 0.189 0.253 0.00376 0.0252 0.0502 0.015 0.049 0.078 0.985 0.951 0.922
NE 50 0.279 0.561 0.635 0.00605 0.0704 0.126 0.038 0.133 0.165 0.964 0.865 0.833
100 0.485 0.786 0.826 0.00549 0.111 0.155 0.083 0.206 0.225 0.924 0.789 0.773
20 0.173 0.258 0.284 0.0116 0.0337 0.0621 0.037 0.058 0.064 0.963 0.942 0.936
NE++ 50 0.45 0.612 0.609 0.0114 0.0762 0.122 0.08 0.107 0.111 0.926 0.894 0.889
100 0.661 0.817 0.81 0.0163 0.109 0.152 0.177 0.165 0.16 0.829 0.833 0.842
20 0.000208 5.14e-5 5.29e-5 0.01 0.018 0.04 0.00923 0.00912 0.00651 0.1 0.209 0.253
NE 50 2.91e6 5.85 6.33 0.289 0.03 0.044 12.3 18.4 29.4 0.788 0.948 0.955
100 1e17 9700 9360 0.485 0.039 0.058 1.24e6 912000 1.8e6 0.694 0.789 0.82
20 1.07e-5 4.32e-6 4.2e-6 0.001 0.004 0.009 0.000431 0.000556 0.000409 0.042 0.149 0.188
NE++ 50 4.83 0.369 0.236 0.094 0.123 0.229 227 75.2 74.1 0.495 0.56 0.643
100 24900 468 221 0.34 0.365 0.459 8.28e13 7.51e11 4.5e11 0.692 0.806 0.781
Table 10: Multi-task (seq.) Trained with P RIM and W IDEST respectively.
D IJKSTRA M OST RELIABLE
20 0.00436 0.00319 0.00332 0.028 0.038 0.06 0.254 0.18 0.188 0.342 0.444 0.571
NE 50 14.4 10.2 10 0.303 0.045 0.053 1.19 2.3 2.05 0.422 0.528 0.552
100 88.6 141 147 0.733 0.102 0.073 2.24 9.33 7.83 0.609 0.598 0.584
20 0.000267 0.000195 7.37e-5 0.01 0.016 0.031 0.00255 0.00246 0.00337 0.076 0.178 0.244
NE++ 50 0.372 0.000154 0.867 0.294 0.027 0.163 0.592 0.00203 0.00151 0.177 0.18 0.198
100 6.94 0.301 1.48 0.493 0.109 0.245 2.53 0.00148 0.00233 0.432 0.184 0.184
20 0.0204 0.0119 0.00982 0.048 0.073 0.122 0.021 0.0212 0.0282 0.171 0.25 0.322
NE Freeze 50 341 16.8 7.72 0.466 0.639 0.687 3.47e5 4.07e5 4.2e5 0.876 0.862 0.856
100 9.49e6 4e4 2.03e4 0.793 0.533 0.494 4.05e15 5.36e15 5.77e15 0.82 0.728 0.734
20 0.00341 0.00159 0.0013 0.022 0.049 0.08 0.0299 0.0363 0.0418 0.143 0.23 0.307
NE Fine-tune 50 2.91e3 245 174 0.366 0.157 0.2 1.72e3 2.11e3 2.17e3 0.47 0.689 0.749
100 1.77e8 6.98e5 4.34e5 0.557 0.289 0.317 9.01e8 1.15e9 1.13e9 0.716 0.688 0.724
20 0.00203 0.00114 0.000919 0.023 0.063 0.093 0.0124 0.0157 0.0207 0.137 0.226 0.331
NE 2-Proc. 50 36.2 3.86 2.51 0.096 0.156 0.235 175 240 198 0.786 0.698 0.763
100 3.07e4 225 339 0.205 0.356 0.355 7.27e9 1.34e10 1.59e10 0.769 0.824 0.851
20 0.00269 0.000876 0.000516 0.035 0.066 0.089 0.00641 0.00619 0.008 0.091 0.202 0.303
NE++ Freeze 50 115 6.53 5.9 0.802 0.894 0.828 290 549 557 0.415 0.649 0.676
100 782 2.87 1.41 0.8 0.941 0.946 3.25e7 1.03e8 1.06e8 0.562 0.729 0.725
20 0.000788 0.000284 0.00017 0.011 0.034 0.057 0.00823 0.0041 0.00772 0.103 0.187 0.284
NE++ Fine-tune 50 38 1.42 0.955 0.927 0.98 0.98 397000 6.90e4 7.12e4 0.694 0.773 0.803
100 7.54e4 1.6 1.26 0.908 0.99 0.99 1.49e15 4.83e14 4.76e14 0.705 0.824 0.794
20 0.00694 0.00375 0.0026 0.021 0.05 0.109 0.00243 0.00235 0.00183 0.088 0.172 0.201
NE++ 2-Proc. 50 47.8 0.546 1.49 0.508 0.256 0.306 404 290 223 0.61 0.64 0.681
100 1.35e4 119 383 0.764 0.808 0.764 2.4e6 4.44e6 3.88e6 0.771 0.808 0.795
17
A.4 Parallel: Breakdown tables
In this section we give the results for each graph type separately for the parallel setting as well as
adding termination accuracy for the teacher forcing setting.
M OST RELIABLE
20 0.0145 0.0382 0.00181 0.03 0.061 0.079 0.0204 0.0187 0.0147 0.147 0.233 0.297
NE 50 177 0.184 0.00364 0.304 0.096 0.091 0.4 0.0201 0.0212 0.298 0.331 0.352
100 5.93e6 1.21 0.00474 0.54 0.126 0.116 31 0.0262 0.0295 0.5 0.403 0.403
20 0.00528 0.00152 0.000793 0.012 0.026 0.046 0.0147 0.00904 0.00498 0.064 0.157 0.214
NE++ 50 0.671 0.00489 0.00155 0.072 0.045 0.055 0.0955 0.00768 0.00675 0.133 0.186 0.193
100 589 0.0192 0.00215 0.148 0.068 0.068 361 0.0112 0.0108 0.24 0.21 0.202
20 0.253 0.246 0.00821 0.078 0.103 0.151 0.577 0.0559 0.0599 0.229 0.305 0.431
NE Freeze 50 8.28e5 1.18 0.0176 0.319 0.169 0.187 2.76e4 0.294 0.197 0.378 0.484 0.547
100 1.53e17 5.36 0.0272 0.434 0.221 0.222 6.73e10 1.61 1.15 0.508 0.573 0.611
20 0.0635 0.0475 0.00463 0.038 0.078 0.101 0.029 0.0224 0.0148 0.164 0.237 0.311
NE Fine-tune 50 74.8 0.271 0.00829 0.231 0.124 0.13 0.95 0.0206 0.0262 0.318 0.318 0.358
100 515000 0.931 0.013 0.413 0.162 0.151 691 0.034 0.0364 0.437 0.374 0.394
20 0.128 0.226 0.00439 0.051 0.097 0.14 0.0326 0.0243 0.0157 0.138 0.223 0.317
NE 2-Proc. 50 1.68e3 1.02 0.00731 0.615 0.143 0.165 11.2 0.0195 0.0231 0.284 0.327 0.38
100 4.99e6 1.86 0.00816 0.878 0.196 0.194 1.44e6 0.0403 0.0474 0.43 0.428 0.435
20 0.0281 0.0573 0.00251 0.028 0.082 0.095 0.0196 0.0114 0.00888 0.15 0.236 0.362
NE++ Freeze 50 172 0.181 0.00515 0.383 0.123 0.114 1.87 0.0139 0.0153 0.259 0.315 0.385
100 1.45e7 1.71 0.00734 0.609 0.171 0.141 19800 0.0282 0.0233 0.384 0.394 0.406
20 0.258 0.246 0.00939 0.078 0.139 0.203 0.0216 0.012 0.00804 0.132 0.202 0.265
NE++ Fine-tune 50 5.73e3 1.24 0.0223 0.494 0.198 0.224 8.49 0.00868 0.00788 0.227 0.248 0.256
100 3.86e9 7.07 0.0267 0.724 0.288 0.269 12500 0.0115 0.00801 0.394 0.283 0.274
20 0.019 0.0444 0.00353 0.028 0.068 0.09 0.0168 0.0138 0.00868 0.113 0.194 0.281
NE++ 2-Proc. 50 1.69 0.302 0.00598 0.11 0.098 0.106 9.11 0.0107 0.0112 0.374 0.266 0.299
100 30 2.46 0.00767 0.234 0.14 0.13 1740 0.0131 0.0107 0.564 0.324 0.344
18
Table 15: Multi-task (par.) Trained with BFS and W IDEST respectively.
B ELLMAN -F ORD M OST RELIABLE
20 0.037 0.00771 0.00163 0.017 0.036 0.05 0.324 0.121 0.0748 0.331 0.327 0.38
NE 50 18.6 0.036 0.0028 0.045 0.054 0.053 0.976 0.15 0.0948 0.402 0.334 0.349
100 3270 1320 0.0043 0.114 0.106 0.067 1.53 0.199 0.114 0.443 0.344 0.34
20 0.00868 0.0015 0.000407 0.012 0.019 0.038 0.00741 0.00632 0.00643 0.067 0.177 0.216
NE++ 50 0.0382 0.00339 0.000544 0.023 0.031 0.036 0.0095 0.00537 0.00927 0.165 0.184 0.197
100 26.5 0.00565 0.000693 0.287 0.053 0.049 0.00929 0.00739 0.0124 0.24 0.199 0.197
In the main paper we did not give the results for all the transfer methods (only the best one). In this
section we give the transfer results in the same style as the main paper as well as the teacher forcing
results.
Table 16: Teacher forcing (par.).
B ELLMAN -F ORD M OST RELIABLE PATH
The large standard deviation is to be expected given the widely different graph types. Given that all
edge weights have expected value 0.6 the max expected shortest distance in a grid graph of size n is
0.6*(n/2+1), which is orders of magnitude larger than for an Erdos-Renyi or Barabasi-Albert graph,
which will have a diameter of O(log(n)) and thus a max expected shortest distance of 0.6*log(n). This
large difference in the shortest path makes large key errors when generalising much more likely on
a grid-graph than on the other types of graphs. Vice versa is true for predecessor prediction: on a
grid graph the degree of node is between 2 and 4 (constant no matter the size of the graph), while
for a Barabasi-Albert and Erdos-Renyi graph it will grow with the size of the graph and be much
larger, making errors much more likely. Again yielding vastly different error percentages. See the
breakdown across graph types in the tables above.
19
Table 18: 2-Processor transfer pre-trained on P RIM, D IJKSTRA, & DFS for sequential and BFS &
B ELLMAN -F ORD for parallel.
M OST RELIABLE (S EQ .) M OST RELIABLE (PAR .)
20 0.00265 0.00244 0.002 0.097 0.161 0.232 0.0518 0.0297 0.041 0.153 0.214 0.313
NE++ 2-Proc. 50 29.4 72.3 86.9 0.535 0.629 0.654 0.341 0.0611 0.0821 0.414 0.311 0.366
100 1.13e6 2.06e6 2.08e6 0.663 0.8 0.812 7.81 0.106 0.105 0.558 0.383 0.403
A.8.1 Experiment 1
Table 19: Pre-train on Dijkstra with teacher forcing, transfer with a frozen processor to see to what
extent the encoder/decoder can be recovered.
D IJKSTRA
Conclusion from Table 20: The results are mostly quite similar to the original results, but slightly
worse, thus indicating that while the re-use of a pre-trained processor is not trivial it is no the primary
reason for transfer to fail.
A.8.2 Experiment 2
Table 20: Pre-train on Dijkstra with teacher forcing, transfer with a frozen processor to see to what
extent the encoder/decoder can be recovered.
D IJKSTRA
A.9 NeuralExecutor++
Let X ∈ Rn×k be node states, where n, k are the number of nodes and features, respectively. Each
edge (u, v) has a weight w(u, v) ∈ R. The architecture keeps a hidden state for each node H ∈ Rn×l
with l features, which is initialised to all zeros at time step 0. The encoder E consists of a 2 layer
MLP with ReLU activation and is separate for each algorithm. The encoder is applied on each edge
20
(t) (t) (t) (t) (t) (t)
a hidden embedding E(Xi , Hi , Xj , Hj , Wij ) = Zij , where t indicates the tth time step in
the computation and i, j refer to the nodes of the edge. This edge embedding is then passed to the
message function of the processor P, which is message passing neural network (MPNN) with a
max aggregator with linear message and update functions (these message and update functions are
always shared between all algorithms). The processor computes the new hidden state for each node
P(Hi , A, W ) = Ht+1 . Then, we have the decoder D(Zt , Ht+1 ) = Yt+1 and predecessor predictor
S(Zt , Ht+1 ) = St . Finally, a termination network σ(T (Ht+1 )) decides whether we should terminate
or not.
21