0% found this document useful (0 votes)

54 views21 pages

How To Transfer Algorithmic Reasoning Knowledge To Learn New Algorithms?

Uploaded by

timsmith1081574

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

54 views21 pages

How To Transfer Algorithmic Reasoning Knowledge To Learn New Algorithms?

Uploaded by

timsmith1081574

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 21

How to transfer algorithmic reasoning knowledge to

learn new algorithms?

Louis-Pascal A. C. Xhonneux∗ Andreea Deac

Université de Montréal Université de Montréal
Mila Mila
arXiv:2110.14056v1 [cs.LG] 26 Oct 2021

[email protected] [email protected]

Petar Veličković Jian Tang

DeepMind, London UK HEC Montréal
[email protected] Mila
[email protected]

Abstract
Learning to execute algorithms is a fundamental problem that has been widely
studied. Prior work [1] has shown that to enable systematic generalisation on
graph algorithms it is critical to have access to the intermediate steps of the pro-
gram/algorithm. In many reasoning tasks, where algorithmic-style reasoning is
important, we only have access to the input and output examples. Thus, inspired
by the success of pre-training on similar tasks or data in Natural Language Pro-
cessing (NLP) and Computer Vision, we set out to study how we can transfer
algorithmic reasoning knowledge. Specifically, we investigate how we can use
algorithms for which we have access to the execution trace to learn to solve similar
tasks for which we do not. We investigate two major classes of graph algorithms,
parallel algorithms such as breadth-first search and Bellman-Ford and sequential
greedy algorithms such as Prim and Dijkstra. Due to the fundamental differences
between algorithmic reasoning knowledge and feature extractors such as used in
Computer Vision or NLP, we hypothesise that standard transfer techniques will not
be sufficient to achieve systematic generalisation. To investigate this empirically
we create a dataset including 9 algorithms and 3 different graph types. We validate
this empirically and show how instead multi-task learning can be used to achieve
the transfer of algorithmic reasoning knowledge.

1 Introduction
Transfer learning [2] has been responsible for significant successes in multiple areas of machine
learning, including Natural Language Processing (NLP) [3, 4] and Computer Vision (CV) [5]. Pre-
training and reusing the learned weights, by freezing them as feature extractors, or fine-tuning from
them as an initialisation, are common approaches to transfer in these domains. This has enabled
successful learning on problems where data is limited in some form.
Algorithmic reasoning on graphs [1, 6] is a fundamental problem that has been studied under the
assumption that we have access to the execution traces of the algorithms we want to learn. In practice,
such as many real world reasoning tasks, this may not be true. Due to the limited data, we look to
transfer learning to enable us to solve algorithmic tasks without intermediate steps. Specifically, we
∗
Corresponding author

35th Conference on Neural Information Processing Systems (NeurIPS 2021)

investigate how to transfer knowledge between similar graph algorithms in the setting where we have
access to the execution traces for some (e.g. P RIM [7, 8]), but not others (e.g. D IJKSTRA [9]). This
has not been explored before, but is important as many reasoning related fields care deeply about
systematic generalisation, while only having input-output pairs to learn from. One such example
is knowledge graph link prediction. There exists a reasoning process that can answer the question:
what is the tail entity given the head entity and the relation. However, we do not have access to its
execution trace, i.e. the step-by-step deductions of the reasoning process, but we do have input-output
pair examples. We envision that the set-up studied and the direction proposed in this paper will help
to learn neural networks that can solve reasoning-style problems by using other reasoning knowledge
as an inductive bias. A slightly different, but related application may be when the data the graph
algorithm needs to operate on is encoded in a high-level space. This is a scenario encountered in
reinforcement learning and concurrent work Deac et al. [10] has already started investigating the
process of pre-training an encoder with graph algorithms.
Veličković et al. [1] successfully trained graph neural networks (GNNs) [11, 12, 13, 14] to execute
graph algorithms. Two key ingredients were access to the intermediate steps of the algorithm and
algorithmic alignment [15]. Algorithmic alignment [15] refers to the concept of a computation
structure—in our case neural networks—and the structure of an algorithm ‘aligning’—in this paper
graph algorithms. ‘Alignment’ means there exists a mapping between substructures of the computation
architecture and simple substeps of the algorithm, where the substructures can ‘easily’ compute the
corresponding substep they are mapped to—for some definition of easy, usually linear.
Extending this work, we enable solving tasks without access to the intermediate steps using other
similar algorithms as an inductive bias. To do this we add several new algorithms. Further, since the
algorithms studied in [1] were all expressible with only linear components in the GNN, we extend
their architecture with a more expressive encoder to enable algorithmic alignment with more tasks.
One assumption we make is that the algorithms are similar (e.g. both P RIM and D IJKSTRA greedily
select nodes from a priority queue). This shared algorithmic knowledge should serve as an inductive
bias to enable a neural network to systematically generalise, even when learning only on input-output
pairs. We hypothesise that due to differences between feature extraction and algorithmic reasoning,
successful approaches to transfer as used in CV and NLP will provide only minimal improvements
when transferring from one base algorithm to a target algorithm. While features across images may be
similar, the features of algorithms differs (e.g. shortest distance in D IJKSTRA and lightest edge to the
tree in P RIM), but are processed in a conceptually similar manner. Our intuition is that this conceptual
relationship may not yield weights that are near each other in the weight space, thus making transfer
difficult. We instead propose to use the base algorithms as inductive biases by training them with the
intermediate steps along side training the target algorithm, for which we do not provide intermediate
supervision. Our second hypothesis is that this will help systematic generalisation. We validate both
hypotheses—transfer does not help, while multi-task learning helps—empirically.
The contributions of this paper are:
1. presenting a new benchmark for transferring algorithmic reasoning knowledge on graphs;
2. sampling the best trajectory to stabilise training when no execution traces are available;
3. show that standard transfer techniques fail to significantly improve systematic generalisation;
4. demonstrate how systematic generalisation can instead be improved on algorithmic tasks
with multi-task learning.

2 Related Work
Neural Execution: Many papers have studied how to execute algorithms before [16, 17, 18, 19, 20].
With the rise of graph representation learning [21, 22, 23] and graph neural networks [11, 12, 13, 14],
recent work has looked at executing graph algorithms [1, 6]. This line of work assumes access to the
intermediate steps of the algorithm. Further, people have looked at teaching transformers to learn
explicitly denoted subroutines and then learn how to combine them [24]. The key difference between
our work and prior work is that we look at the setting of learning graph algorithms where we do not
have access to the execution trace and we do not denote subroutines explicitly. Instead we propose
to use similar algorithms as an inductive bias and look at how to enable this transfer of algorithmic
knowledge.

2
Transfer learning: Transfer learning was originally proposed in the 1970s [25, 26]. Today, it is a
common tool in NLP and Computer Vision to use pre-trained models and fine-tune them on the target
task [2, 3, 5]. This is helpful when the target task is similar and the available data is limited in some
way. Our work differs in two ways; firstly, we focus on algorithmic reasoning knowledge, which we
hypothesis will require different techniques than reusing weights. Secondly, the data on the target
task is partially missing specifically the execution trace of the algorithm is missing.

3 Background
This section gives background on the algorithms and graphs used for the experiments. To be precise
we consider a graph to be a tuple G = (V, E) with a set of vertices V and a set of edges E ⊆ V × V 2 .
This section explains the algorithms studied in this paper in more detail. After that, we give a brief
introduction into graph neural networks, the standard architecture paradigm for graph inputs.

3.1 Algorithms

We broadly study two classes of algorithmic reasoning: parallel and sequential reasoning. Specifically,
the parallel algorithms (shown in Algorithm 1) studied in this paper exchange messages with their
neighbours until an equilibrium is reached. Sequential algorithms, presented in Algorithm 2, greedily
remove elements from a priority queue and updating the neighbouring nodes’ keys.

Algorithm 1 Parallel Algorithm 2 Sequential

Input: graph G, weights w, source index i Input: graph G, edge weights w, source node
initialise_nodes(G.vertices, i) index i
repeat initialise_nodes(G.vertices, i)
for (u, v) ∈ G.edges() do Q ← PriorityQueue(G.vertices)
relax_edge(u, v, w) repeat
end for u ← Q.pop_min()
until none of the nodes change for v ∈ G.neighbours(u) do
relax_edge(u, v, w)
end for
until Q is empty
Different algorithms will implement their own relax_edge and initialise_nodes functions,
however, the overall framework stays the same and can be learned by our GNN architecture (§ 4.1).
For the sequential algorithms each node has a state feature indicating whether it has been removed
from the priority queue, a key feature for the priority queue, and a pointer to the predecessor node.
We study the following algorithms:
PARALLEL S EQUENTIAL
1. B READTH - FIRST SEARCH (BFS) 1. P RIMS
2. B ELLMAN -F ORD 2. D IJKSTRA
3. W IDEST PATH ( PARALLEL ) 3. D EPTH - FIRST SEARCH (DFS)
4. M OST RELIABLE - PATH ( PARALLEL ) 4. W IDEST PATH ( PARALLEL )
5. M OST RELIABLE - PATH ( PARALLEL )
We give the pseudo-code for all algorithms in the supplementary. Note that the relax_edge function
is increasingly difficult to learn for the neural network as you go down the list: P RIMS uses the edge
weight as a key, thus the relax_edge function is the identity, for D IJKSTRA and D EPTH - FIRST
SEARCH addition is necessary to compute the edge update, for W IDEST- PATH we need to take a
maximum, which with a ReLU activation can still be done exactly, and for M OST RELIABLE - PATH
the neural network needs to learn to approximate multiplication.3
2
Throughout the paper we use graphs that are simple with weights.
3
We note that the tasks are all linear time in the size of the graph. More challenging problems such as NP-hard
problems are primarily more difficult because of their increased run-time and hence longer sequences allowing

3
3.2 Graph Neural Networks

A general framework to describe several graph neural network architecture is the message passing
framework introduced by Gilmer et al. [13]. Such graph neural networksL(GNNs) consist of three
parts: a message function M , an update function U , and an aggregator . M and U are arbitrary
neural networks. The high-level idea is

1. each node computes a message for its neighbours using M ;

L
2. then each node aggregates the received messages using ;
3. finally, each node updates its embedding using U .

The message function in this paper takes as input the sending and receiving nodes’ representation as
well as an encoding of the edge feature along which the message is sent. The aggregator chosen is
the max aggregator applied element-wise as it was determined to work best by Veličković et al. [1]
and provides algorithmic alignment [15] between the architecture and the tasks. The update function
takes as input the node’s previous embedding and aggregated messages.

3.3 Problem definition

We study learning graph algorithms that take in a graph G, a weight function w : E → R that assigns
a weight to each edge in the graph, and node features X : V → Rk . Algorithms compute an output
Y : V → F k and a predecessor P : V → V at each time step. This encompasses a large class of
graph algorithms solving tasks such as reachability or shortest-path.
For instance, for a sequential algorithm the ith node’s features Xt [i] would be the key value and the
state variable. Given Xt , the graph G, and the edge weights, the graph algorithm at each iteration
returns the node features Yt and the predecessor for each Pt (pred variable in pseudocode § 3.1). YT
and PT at the final step are considered the output of the algorithm. The intermediate steps refers to
Xt , Yt , Pt ∀t ∈ {1, . . . , T − 1} after each iteration.

4 Methods
In this section, we first briefly present two architectures we will use for training and then discuss how
existing algorithmic knowledge can be used to solve tasks when no execution trace is given.

4.1 Architecture

As our starting point we choose the encoder-processor-decoder architecture proposed in Veličković

et al. [1]. The high-level idea is that the network encodes the current node state into a hidden
embedding, on which a graph neural network unit, called the processor, is applied using the graph
structure. The results are then decoded by the decoder, which does the prediction of the node features
at the next time step (§ 3.1). The point of this architecture is to allow several algorithms to use the
same processor architecture. A more detailed summary of the architecture is given below:

4.1.1 NeuralExecutor
The NeuralExecutor (NE) [1] uses an encoder-processor-decoder architecture. Let X ∈ Rn×k be
node states, where n, k are the number of nodes and features, respectively. Each edge (u, v) has a
weight w(u, v) ∈ R. The architecture keeps a hidden state for each node H ∈ Rn×l with l features,
which is initialised to all zeros. The encoder E consists of a linear layer and computes a hidden
embedding E(Xi , Hi ) = Zi , where i indicates the ith step in the computation. The processor P is
message passing neural network (MPNN) with a max aggregator with linear message and update
functions. The processor computes the new hidden state for each node P(Hi , A, w) = Hi+1 . Then,
we have the decoder D(Zi , Hi+1 ) = Yi+1 and predecessor predictor S(Zi , Hi+1 ) = Si . Finally, a
termination network σ(T (Hi+1 )) decides whether we should terminate or not.
for more accumulation of errors. As long as we have algorithmic alignment [15] between the architecture and the
algorithm in question there is no additional challenge to NP-hard problems except the length of their sequence
and hence more opportunities to introduce errors and propagate them.

4
(a) (b)

Figure 1: x0 , x1 are the node states of two algorithms respectively. w are the edge weights. p0 , p1
are the predecessor predictions for the two algorithms respectively and y0 , y1 are the next node
state predictions. hi s are the previous hidden state kept by the network and e(ij) is the computed
edge weight embedding. Superscript indices indicate a particular node. (a) Shows the original
Neural Executor architecture when doing multi-task learning. (b) Shows the more expressive Neural
Executor++ when doing multi-task learning. The key difference is that (b) forces a common way to
operate on a hidden embedding space, while (a) focuses on achieving both at the same time.

We make minor changes to allow for better algorithmic alignment: we remove the ReLU activations
from the processor and replace the termination layer with a processor and linear module4 .

4.1.2 NeuralExecutor++
Veličković et al. [1] showed positive transfer when learning algorithms in a multi-task setup with
intermediate steps. They did so by concatenating together the node features and encoding them
together into the same h embedding (see Fig. 1). This has the advantage of giving strong guidance
to the secondary algorithm learned in this multi-task set-up. However, ideally we are able to only
use the base algorithm during training without having to use it at inference time introducing another
failure mode. Thus, we propose to change the architecture as follows:
Since the node encoder is unable to learn the individual subroutines operating on edges, we re-
place it with a latent encoder for each task that operates on edges and for an edge (i, j) takes in
[h(i) , x(i) , e(ij) , h(j) , x(j) ] (Fig. 1). The important difference is that each has task has its own encoder
rather than all tasks sharing one encoder. The goal is to force the model to learn the shared subroutines
in the processor only and the specialised subroutines in the latent encoder only (see Fig. 1).
Further, we change the latent encoder to consist of a linear and non-linear encoder in parallel that
are added together. This should allow for algorithmic alignment with a much larger array of tasks
(specifically W IDEST PATH), but comes at the cost that overfitting is more likely, which may hurt
generalisation to larger graphs. This last problem has been avoided in prior work [1, 15] by having
the neural network only learn linear components, which will not be able to overfit easily.5

4.2 Training and Loss functions

Sequential algorithms (Seq): We used softmax for the prediction of the next node to be removed
from the queue and softmax for the predecessor prediction, where we masked out all nodes except
the neighbours and the node itself. We used a smooth l1 loss with β = 0.001 for the prediction of the
key of the selected node. Finally, we used binary cross entropy for termination prediction. We
only update the node state of the chosen node, this helps with drift of the node state at test time.
Parallel algorithms (Par): We used a smooth l1 loss with β = 0.001 for the prediction of the key
except for BFS where we used binary cross-entropy as the node state is either 0 or 1. In this setting,
during teacher-forcing, we masked out nodes that are unreachable at a given time step.
4
We remove the ReLU because it limits the ability of the max aggregator to minimise values by using negative
inputs. The termination condition depends on whether the remaining nodes are still reachable hence an MPNN
is a more appropriate than a linear layer.
5
See the Supplementary for a more formal description.

5
Teacher forcing (TF): We train the network to predict the next step given the ground truth inputs. At
test time the networks prediction are used instead of the ground truth.
No algorithm (NA): We train the network using only the loss on the final outputs, this means
we change the softmax for predicting the next node to a binary cross entropy for sequential
algorithms. The next inputs are those predicted by the network using gumbel softmax to predict
the next node at each stage. The number of steps are given to the network in this scenario. Inspired
by [27], we sample 10 trajectories using the best one for back-propagation, when using a gumbel
softmax. The idea is that only the best loss is of interest at evaluation time and not the average loss.
We show in the supplementary that this helps stabilise and improve training compared to taking the
mean of the trajectories.

4.3 Standard transfer methods

We try three main approaches:

Freeze weights: This is a standard transfer technique, where we use the learned weights of the
processor unit on a base algorithm B for learning a target algorithm T . In this set-up we freeze the
weights of the processor and only the encoder-decoder part of the architecture can be learned. We
assume that the algorithm B was learned with teacher forcing, while T is learned with no-algorithm.
We note that encoder-decoder weights cannot be reused because in general the input/output dimensions
may differ from algorithm to algorithm.
Fine-tune weights: This is the same as the previous technique except that we do not freeze the
processor weights and instead let them be changed by the gradient descent algorithm.
2-Processors: Again we assume we have access to the processor weights learned on a base algorithm
B. This setting uses two processors in parallel whose outputs we sum together. One of the processor
uses the learned processor weights and is frozen, while the other processor is free to learn the
necessary changes and is randomly initialised. We suspect this will be the best transfer method as it
does not forget the information given by the domain algorithm, but retains the flexibility to adapt the
processor.

4.4 Transfer via multi-task learning

A stronger inductive bias is to train the base algorithm B, for which we have access to intermediate
steps, together with the target algorithm T , for which we do not. This multi-task set-up highlights
the difference between the original Neural Executor architecture proposed in Veličković et al. [1]
and Neural Executor++ (see Fig. 1). The former simply concatenates the node features of B and T
and forces them to share the hidden embedding. The latter allows different encoder-decoders and
only shares the weights of the processor, i.e. each algorithm has its own hidden embedding. This
second approach encourages the network to execute the shared subroutine in the processor and the
individual subroutines in the encoder-decoder. The overall idea is that the base algorithm B serves as
an inductive bias as to how to structure the latent space and teach the processor how to evolve T .

4.5 Graphs

We study three kinds of graphs:

log2 |V |
i) Erdos-Renyi p = min |V | , 0.5 , ii) Barabasi-Albert, iii) 2d-grid graphs.

Erdos-Renyi graphs are random graphs where each possible edge has probability p of being added
to the graph. Barabasi-Albert graphs are power-law graphs with a few highly connected nodes and
many dangling nodes. 2d-grid graphs are very regular graphs in an arbitrary 2d-grid shape. We chose
these 3 classes of graphs because they represent some of the major possible differences between
graphs. 2d-grid graphs are very regular and Veličković et al. [1] notes that very regular graphs tend to
transfer poorly from or to random graphs. Erdos-Renyi graphs tend to be sparse, but highly likely to
be connected. Barabasi-Albert graphs tend to be quite dense graphs with shorter average path-length
than random graphs. As such these graph classes differ significantly from each other.

6
Table 1: Teacher forcing (seq.). NE++ worse performance on D IJKSTRA showing the downside of
higher model capacity.
D IJKSTRA M OST RELIABLE
Model #Nodes Next node Key Predecessor Next node Key Predecessor
20 0.018 ± 0.004 0.0367 ± 0.01 0.005 ± 0.002 0.238 ± 0.05 0.0358 ± 0.02 0.053 ± 0.01
NE 50 0.089 ± 0.009 0.569 ± 0.7 0.02 ± 0.02 0.557 ± 0.08 0.0697 ± 0.05 0.099 ± 0.01
100 0.341 ± 0.02 4.79 ± 6 0.064 ± 0.04 0.763 ± 0.07 0.0924 ± 0.06 0.167 ± 0.007
20 0.008 ± 0.003 0.0108 ± 0.004 0.003 ± 0.0008 0.174 ± 0.07 0.0264 ± 0.02 0.047 ± 0.03
NE ++ 50 0.35 ± 0.03 445 ± 600 0.211 ± 0.02 0.492 ± 0.2 0.0676 ± 0.05 0.112 ± 0.05
100 0.729 ± 0.01 1.62e10 ± 2e10 0.522 ± 0.07 0.699 ± 0.2 0.0906 ± 0.06 0.171 ± 0.06

Table 2: No algorithm (seq.). Size-generalisation is lacking without intermediate steps, especially on

the more difficult M OST RELIABLE.
D IJKSTRA M OST RELIABLE
Model #Nodes Key Predecessor Key Predecessor
20 0.000104 ± 7e-05 0.023 ± 0.01 0.00829 ± 0.001 0.187 ± 0.06
NE 50 9.71e5 ± 1e06 0.121 ± 0.1 20 ± 7 0.897 ± 0.08
100 3.34e16 ± 5e16 0.194 ± 0.2 1.32e6 ± 4e5 0.768 ± 0.05
20 6.4e-06 ± 3e-06 0.005 ± 0.003 0.000465 ± 6e-05 0.126 ± 0.06
NE ++ 50 1.81 ± 2 0.149 ± 0.06 125 ± 70 0.566 ± 0.06
100 8.54e3 ± 1e4 0.388 ± 0.05 2.8e13 ± 4e13 0.76 ± 0.05

For all graphs, we generate edge weights that are uniformly between [0.2, 1.0], this range prevents
key values such as shortest path from becoming too extreme6 .

5 Experiments

For all experiments we use 5,000 graphs of each type (Erdos-Renyi (ER), Barabasi-Albert (BA),
2d-Grids (2d-G)) with 20 nodes each. We train using A DAM [28] with a learning rate of 0.0005, a
batch size of 64, and use early stopping with a patience of 10 to prevent overfitting. We test on graphs
size 20, 50, and 100 nodes. The hidden embedding size is set to 32 except for NE++ for multi-task
experiments, where it is 16 to account for the additional expressivity of having several encoders.
Each experiment was executed on a V100 GPU in less than 5 hours for the longest experiment.
We measure the average performance over all 3 graph types at evaluation separately and present the
average with standard deviation in the main paper. Large standard deviation may arise due to the
extreme difference between random graphs of type ER or BA versus 2d-G graphs.7

5.1 Metrics

Sequential algorithms: Predecessor (Pred.) error rate is the most important measure as to whether a
task has been successfully completed as it gives us the path predicted by the network. Next node (Next)
error rate measures whether the next node is the correct one to pick. Key accuracy measures whether
the key of the picked node is correct, measured in mean squared error. Next node are indicative of
whether the correct algorithm is being executed, while Pred. and Key primarily serves to indicate the
correctness of the solutions found. Lower is always better.
Parallel algorithms: Key accuracy measures the node features mean squared error for all algorithms
except BFS, where it is measured in accuracy as the node feature is a binary choice between 0 and 1.
Predecessor (Pred.) accuracy measures the accuracy of predicting the predecessor node.

6 Results and Discussion

For the sequential algorithms we study transfer from P RIM to D IJKSTRA and from W IDEST PATH to
M OST RELIABLE PATH, for parallel Algorithms we study transfer from BFS to B ELLMAN -F ORD
and from W IDEST PATH to M OST RELIABLE PATH.
6
All code to generate data and train models will be released upon acceptance with an MIT license.
7
See the Supplementary for a more detailed explanation.

7
Table 3: Transfer to no algorithm (seq.). Pre-trained on P RIM and W IDEST, respectively. Classic
transfer learning fails to provide size-generalisation and often performs worse than no transfer.
D IJKSTRA M OST RELIABLE
Model #Nodes Key Predecessor Key Predecessor
20 0.014 ± 0.005 0.081 ± 0.03 0.0235 ± 0.003 0.248 ± 0.06
NE Freeze 50 122 ± 200 0.597 ± 0.09 3.91e5 ± 3e4 0.864 ± 0.009
100 3.18e6 ± 4e6 0.607 ± 0.1 5.06e15 ± 7e14 0.761 ± 0.04
20 0.0021 ± 0.0009 0.05 ± 0.02 0.036 ± 0.005 0.227 ± 0.07
NE Fine-tune 50 1.11e3 ± 1e3 0.241 ± 0.09 2e3 ± 200 0.636 ± 0.1
100 5.93e7 ± 8e7 0.388 ± 0.1 1.06e9 ± 1e8 0.709 ± 0.02
20 0.00136 ± 0.0005 0.06 ± 0.03 0.0163 ± 0.003 0.231 ± 0.08
NE 2-Processor 50 14.2 ± 20 0.162 ± 0.06 205 ± 30 0.749 ± 0.04
100 1.04e4 ± 1e4 0.305 ± 0.07 1.22e10 ± 4e9 0.815 ± 0.03
20 0.00136 ± 0.001 0.063 ± 0.02 0.00687 ± 0.0008 0.199 ± 0.09
NE++ Freeze 50 42.6 ± 50 0.841 ± 0.04 465 ± 100 0.58 ± 0.1
100 262 ± 400 0.895 ± 0.07 8.06e7 ± 3e7 0.672 ± 0.08
20 0.000414 ± 0.0003 0.034 ± 0.02 0.00669 ± 0.002 0.191 ± 0.07
NE++ Fine-tune 50 13.5 ± 20 0.962 ± 0.03 1.79e5 ± 2e5 0.757 ± 0.05
100 2.51e4 ± 4e4 0.962 ± 0.04 8.17e14 ± 5e14 0.774 ± 0.05
20 0.00443 ± 0.002 0.06 ± 0.04 0.0022 ± 0.0003 0.154 ± 0.05
NE++ 2-Processor 50 16.6 ± 20 0.356 ± 0.1 306 ± 70 0.644 ± 0.03
100 4.66e3 ± 6e3 0.779 ± 0.02 3.58e6 ± 9e5 0.791 ± 0.02

Table 4: Multi-task (seq.): Using P RIM and W IDEST as inductive bias, respectively. Multi-task
learning shows good generalisation, especially on the Key metric for D IJKSTRA and on Predecessor
for M OST RELIABLE.
D IJKSTRA M OST RELIABLE
Model #Nodes Key Predecessor Key Predecessor
20 0.00362 ± 0.0005 0.042 ± 0.01 0.207 ± 0.03 0.452 ± 0.09
NE 50 11.5 ± 2 0.134 ± 0.1 1.85 ± 0.5 0.501 ± 0.06
100 126 ± 30 0.303 ± 0.3 6.47 ± 3 0.597 ± 0.01
20 0.000178 ± 8e-05 0.019 ± 0.009 0.00279 ± 0.0004 0.166 ± 0.07
NE ++ 50 0.413 ± 0.4 0.161 ± 0.1 0.199 ± 0.3 0.185 ± 0.009
100 2.91 ± 3 0.282 ± 0.2 0.843 ± 1 0.267 ± 0.1

6.1 Sequential

The first experiments establish baselines in terms of achievable performance given the intermediate
steps and trained with teacher-forcing (Tab. 1). We run each algorithm separately. Next we establish
the performance in the no-algorithm setting (Tab. 2), i.e. what is achievable without intermediate
supervision.
Expressivity can harm systematic generalisation: Firstly, we note that the additional expressivity
of the NE++ (§ 4.1.2) seems to hurt systematic generalisation even with the large amount of data
available (Tab. 1) as we can see on D IJKSTRA. On M OST RELIABLE both do equally well on Key
and Pred., but looking at Next node we can see that NE++ does better in simulating the algorithm.
M OST RELIABLE is a non-linear task so a non-linear encoder is expected to help. We also note that
as the graphs grow in size, the number of reachable nodes in the priority queue increases, making it
more likely we pick the wrong node without affecting the correctness of prediction.
Secondly, we note that in the NA setting (Tab. 2) the NE is able to solve the Pred. prediction quite
well up to 100 nodes, but clearly found an alternative way of reasoning as the key prediction is hugely
wrong for larger graphs. Note that the largest shortest distances will be found in 2-grid graphs, where
it will be upper bounded by 51. Also for M OST RELIABLE the performance on Pred. drop at 50
nodes is significantly more severe with less representation power.
Transfer yields little improvement: The two key experiments are the transfer setting (§ 4.3) and
the multi-task setting (§ 4.4). We hypothesised that the standard transfer experiments (fine-tune
and freeze) would not help systematic generalisation. None of the transfer methods (Tab. 3) help
generalise either task significantly. In fact, they harm systematic generalisation in terms of Pred.
prediction in all cases. The only benefit that can be observed is better generalisation on Key accuracy
indicating that the network outputs are less extreme. The best transfer method is 2-Processor as we
hypothesised in § 4.3, which improves Key prediction at the cost of harming Pred. accuracy.
Multi-task helps systematic generalisation: In the multi-task set-up (Tab. 4), several things occur:
the Key prediction generalises even better and is predicting in a reasonable range given the longest

8
Table 5: 2-Proc. transfer pre-trained on P RIM, D IJKSTRA, & DFS for sequential and BFS &
B ELLMAN -F ORD for parallel. Pre-training on several tasks does not improve classic transfer.
M OST RELIABLE ( SEQ ) M OST RELIABLE ( PAR )
Model #Nodes Key Predecessor Key Predecessor
20 0.00237 ± 0.0003 0.163 ± 0.06 0.0408 ± 0.009 0.227 ± 0.07
NE ++ 2-Proc. 50 62.9 ± 20 0.606 ± 0.05 0.161 ± 0.1 0.363 ± 0.04
100 1.76e6 ± 4e5 0.758 ± 0.07 2.68 ± 4 0.448 ± 0.08

Table 6: No algorithm (par.): For transfer we report the results of the best method (§ 4.3). Pre-trained
on BFS and W IDEST respectively. Reliance on intermediate steps is lower for this class of problems,
but multi-task transfer of knowledge is still beneficial in terms of size-generalisation.
B ELLMAN -F ORD M OST RELIABLE PATH
Model #Nodes Key Predecessor Key Predecessor
20 0.0182 ± 0.02 0.057 ± 0.02 0.018 ± 0.002 0.226 ± 0.06
NE (NA) 50 59 ± 80 0.164 ± 0.1 0.147 ± 0.2 0.327 ± 0.02
100 1.98e6 ± 3e6 0.261 ± 0.2 10.4 ± 10 0.435 ± 0.05
20 0.00253 ± 0.002 0.028 ± 0.01 0.00957 ± 0.004 0.145 ± 0.06
NE++ (NA) 50 0.226 ± 0.3 0.057 ± 0.01 0.0367 ± 0.04 0.171 ± 0.03
100 196 ± 300 0.095 ± 0.04 120 ± 200 0.217 ± 0.02
20 0.0386 ± 0.02 0.072 ± 0.03 0.0221 ± 0.006 0.237 ± 0.06
NE (Transfer Fine-tune) 50 25 ± 40 0.162 ± 0.05 0.332 ± 0.4 0.331 ± 0.02
100 1.72e5 ± 2e5 0.242 ± 0.1 230 ± 300 0.402 ± 0.03
20 0.0223 ± 0.02 0.062 ± 0.03 0.0131 ± 0.003 0.196 ± 0.07
NE++ (Transfer 2-Proc.) 50 0.666 ± 0.7 0.105 ± 0.005 3.04 ± 4 0.313 ± 0.05
100 10.8 ± 10 0.168 ± 0.05 579 ± 800 0.411 ± 0.1
20 0.0154 ± 0.02 0.034 ± 0.01 0.173 ± 0.1 0.346 ± 0.02
NE (Multi-task) 50 6.22 ± 9 0.051 ± 0.004 0.407 ± 0.4 0.362 ± 0.03
100 1.53e3 ± 1e3 0.096 ± 0.02 0.615 ± 0.6 0.376 ± 0.05
20 0.00353 ± 0.004 0.023 ± 0.01 0.00672 ± 0.0005 0.153 ± 0.06
NE++ (Multi-task) 50 0.0141 ± 0.02 0.03 ± 0.006 0.00805 ± 0.002 0.182 ± 0.01
100 8.84 ± 10 0.13 ± 0.1 0.00971 ± 0.002 0.212 ± 0.02

shortest path in graphs of size 100. Further, NE in this setting has similar Pred. accuracy on D IJKSTRA
compared to NA, NE++ benefitted from the inductive bias in terms of its Pred. accuracy on graphs
of size 100 for D IJKSTRA. The results on M OST RELIABLE are significantly improved and NE++
achieves good levels of systematic generalisation in solving the task. NE interestingly worsens in its
performance on 20 nodes, but maintains a stable Pred. accuracy on larger graphs. Demonstrating that
the inductive bias from W IDEST prevents overfitting in distribution and improves the performance on
larger graphs. Overall, the results validate our initial hypothesis that multi-task learning is the correct
approach to transfer knowledge.
Trying to extract shared subroutines does not help transfer: Finally, we study to what extent
the models are able to separate the common shared subroutines and the subroutines individual to
each algorithm by training multiple algorithms with TF in a multi-task set-up together (Tab. 5). If
the processor successfully captures only the shared subroutines, then we might expect the transfer
results to be improve. We can see in Tab. 5 that multi-task pre-training does not significantly improve
results and that multi-task learning with the target algorithm is still the best approach. However, one
alternative explanation is that given a good processor, the encoder struggles to learn the expected
encoding by the processor and thus performs poorly.

6.2 Parallel

Parallel algorithms are significantly easier than sequential ones due to their much shorter length
and the lack of a central data-structure that needs to be learned to execute. This can be observed in
the much higher performance in the NA setting (Tab. 17). Interestingly, it seems that in this setting
expressivity was helpful for systematic generalisation, even in the B ELLMAN -F ORD setting.
Transfer harms performance: In Tab. 17 we show only the best transfer result, but as we can see
this actually harms Pred. accuracy for NE++ for both algorithms, while producing roughly the same
result for NE. In both cases, the results suggest that random initialisations are better than transfer
ones. We think this may because the shared algorithmic knowledge of parallel algorithms is already
inherently captured by GNNs as they apply message functions in parallel to each edge, which then
only need to learn the relax_edge function. Pre-training on several algorithms did not help (Tab. 5).

9
Multi-task only helps Key accuracy: Similarly to sequential reasoning multi-task vastly outper-
forms transfer techniques and significantly improves Key prediction compared to NA, while keeping
Pred. similar. Contrary to sequential reasoning the Pred. prediction is comparable between NA and
multi-task. We think this is due to the shorter execution length providing less of an inductive bias for
the target algorithm and the strong inductive bias of GNN architectures towards parallel algorithms.
However, the access to more stable gradients due to the multi-task learning approach seems to help
learning to some extent due to the improved Key predictions. Further, we observe that when the
model has less capacity (NE on M OST RELIABLE PATH) multi-task is still able to improve systematic
generalisation on Pred. at the cost of slightly worse in-distribution (20 nodes) performance.

6.3 Why transfer learning fails?

Transfer via freezing and/or fine-tuning clearly does not work as demonstrated by the results in Tab. 3.
The fact that having two processors, one frozen and one to fine-tune, also does not help transfer is
telling, because neither the fine-tuning process losing information, nor the rigidity of the network
can be at fault. In other words, fine-tuning has the disadvantage that we lose the original weights
and hence potentially lose information. Freezing weights significantly limits the weights that can
be changed and thus making it harder to fit the data. However, the 2-processor approach suffers
from neither problem and yet still does not work.8 Thus, we hypothesis that the reason why transfer
fails to work is that the initial weights of a similar algorithm are not near a good (as in generalising)
minimum for the target algorithm, in fact the minimum is often worse than the minimum found from
randomly initialised weights (see Tab. 2).

6.4 Why multi-task fairs better?

Multi-task on the other hand does not rely on the weights being near a good minimum, but instead
enforces them to be the same for the processor. This is a very different way to use the base algorithm
as an inductive bias. This inductive bias is successful because the final weights are from a minimum
that systematically generalises (on at least one of the algorithms) with the additional constraint that it
performs well on the second target algorithm. For transfer the initial weights might systematically
generalise on the original task there is no guarantee that the final weights stem from a minima that
systematically generalises.

7 Conclusion

We set out to investigate how systematic generalisation could be improved on algorithmic tasks when
the intermediate steps of the algorithm are not available. Inspired by the success of transfer learning
in domains such as CV and NLP, we investigated it’s applicability to learning graph algorithms in this
setting. We showed that standard transfer learning is inadequate to leverage algorithmic knowledge
learned from intermediate steps to new algorithmic tasks. Further, we showed how multi-task learning
can enable the successful transfer of inductive biases learned from other algorithms when intermediate
steps are available, significantly improving systematic generalisation. The results are especially
strong in the more difficult sequential reasoning domain. Moreover, we conclude that expressivity
can hurt systematic generalisation if the task is too easy and intermediate supervision is available.
This should be taken into account when choosing the model. These disadvantages disappeared when
trying to learn algorithmic reasoning without intermediate steps in our multi-task set-up, in this
setting NE++ always outperforms the simpler architecture. Both architectures can achieve systematic
generalisation. Limitations of our work are that the results are specific to algorithms on static graphs.
Furthermore, as the number of execution steps increases faster than linear in the number of nodes
results are likely to worsen significantly.
This paper’s contributions are fundamental in nature and thus, the societal impact of this paper is
low and there are no associated ethical risks. Any benefits or risks stem from further advances in
reasoning systems that may be in some form be based on this work.

8
Experiment 1 (in the Supplementary material) shows that the information from a processor can be used and
recovered.

10
Acknowledgments and Disclosure of Funding
We would like to thank Meng Qu, Zhaocheng Zhu, and Zuobai Zhang for proof reading the manuscript
prior to submission.
This project is supported by the Natural Sciences and Engineering Research Council (NSERC)
Discovery Grant, the Canada CIFAR AI Chair Program, collaboration grants between Microsoft
Research and Mila, Samsung Electronics Co., Ltd., Amazon Faculty Research Award, Tencent AI
Lab Rhino-Bird Gift Fund and a NRC Collaborative R&D Project (AI4D-CORE-06). This project
was also partially funded by IVADO Fundamental Research Project grant PRF-2019-3583139727.
Petar Veličković is a Research Scientist at DeepMind.

11
References
[1] Petar Veličković, Rex Ying, Matilde Padovano, Raia Hadsell, and Charles Blundell. Neural
execution of graph algorithms. arXiv preprint arXiv:1910.10593, 2019.
[2] Dumitru Erhan, Aaron Courville, Yoshua Bengio, and Pascal Vincent. Why does unsupervised
pre-training help deep learning? In Proceedings of the thirteenth international conference
on artificial intelligence and statistics, pages 201–208. JMLR Workshop and Conference
Proceedings, 2010.
[3] Jacob Devlin, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova. Bert: Pre-training of
deep bidirectional transformers for language understanding. arXiv preprint arXiv:1810.04805,
2018.
[4] TB Brown, B Mann, N Ryder, M Subbiah, J Kaplan, P Dhariwal, A Neelakantan, P Shyam,
G Sastry, A Askell, et al. Language models are few-shot learners. arxiv 2020. arXiv preprint
arXiv:2005.14165, 4, 2020.
[5] Alexey Dosovitskiy, Lucas Beyer, Alexander Kolesnikov, Dirk Weissenborn, Xiaohua Zhai,
Thomas Unterthiner, Mostafa Dehghani, Matthias Minderer, Georg Heigold, Sylvain Gelly,
Jakob Uszkoreit, and Neil Houlsby. An image is worth 16x16 words: Transformers for image
recognition at scale, 2020.
[6] Petar Veličković, Lars Buesing, Matthew C Overlan, Razvan Pascanu, Oriol Vinyals, and
Charles Blundell. Pointer graph networks. arXiv preprint arXiv:2006.06380, 2020.
[7] Vojtěch Jarník. On a certain problem of minimization. Práce moravskè přírodovědecké
společnosti 6, fasc. 4, pages 57–63, 1930. URL http://hdl.handle.net/10338.dmlcz/
500726.
[8] R.C. Prim. Shortest connection networks and some generalizations. Bell System Technical
Journal, pages 1389–1401, 1957. URL https://archive.org/details/bstj36-6-1389.
[9] E.W. Dijkstra. A note on two problems in connexion with graphs. Numerische Mathematik,
pages 269–271, 1959. URL https://doi.org/10.1007/BF01386390.
[10] Andreea Deac, Petar Veličković, Ognjen Milinković, Pierre-Luc Bacon, Jian Tang, and Mladen
Nikolić. Xlvin: executed latent value iteration nets, 2020.
[11] Yujia Li, Daniel Tarlow, Marc Brockschmidt, and Richard Zemel. Gated graph sequence neural
networks. arXiv preprint arXiv:1511.05493, 2015.
[12] Thomas N Kipf and Max Welling. Semi-supervised classification with graph convolutional
networks. arXiv preprint arXiv:1609.02907, 2016.
[13] Justin Gilmer, Samuel S Schoenholz, Patrick F Riley, Oriol Vinyals, and George E Dahl. Neural
message passing for quantum chemistry. In International Conference on Machine Learning,
pages 1263–1272. PMLR, 2017.
[14] Petar Veličković, Guillem Cucurull, Arantxa Casanova, Adriana Romero, Pietro Lio, and Yoshua
Bengio. Graph attention networks. arXiv preprint arXiv:1710.10903, 2017.
[15] Keyulu Xu, Jingling Li, Mozhi Zhang, Simon S. Du, Ken ichi Kawarabayashi, and Stefanie
Jegelka. What can neural networks reason about?, 2020.
[16] Wojciech Zaremba and Ilya Sutskever. Learning to execute, 2015.
[17] Łukasz Kaiser and Ilya Sutskever. Neural gpus learn algorithms, 2016.
[18] Karol Kurach, Marcin Andrychowicz, and Ilya Sutskever. Neural random-access machines,
2016.
[19] Scott Reed and Nando de Freitas. Neural programmer-interpreters, 2016.

12
[20] Adam Santoro, Ryan Faulkner, David Raposo, Jack Rae, Mike Chrzanowski, Theophane Weber,
Daan Wierstra, Oriol Vinyals, Razvan Pascanu, and Timothy Lillicrap. Relational recurrent
neural networks. In S. Bengio, H. Wallach, H. Larochelle, K. Grauman, N. Cesa-Bianchi,
and R. Garnett, editors, Advances in Neural Information Processing Systems, volume 31.
Curran Associates, Inc., 2018. URL https://proceedings.neurips.cc/paper/2018/
file/e2eabaf96372e20a9e3d4b5f83723a61-Paper.pdf.
[21] Michael M. Bronstein, Joan Bruna, Yann LeCun, Arthur Szlam, and Pierre Vandergheynst.
Geometric deep learning: Going beyond euclidean data. IEEE Signal Processing Magazine,
34(4):18–42, Jul 2017. ISSN 1558-0792. doi: 10.1109/msp.2017.2693418. URL http:
//dx.doi.org/10.1109/MSP.2017.2693418.
[22] William L. Hamilton, Rex Ying, and Jure Leskovec. Representation learning on graphs: Methods
and applications, 2018.
[23] Peter W. Battaglia, Jessica B. Hamrick, Victor Bapst, Alvaro Sanchez-Gonzalez, Vinicius Zam-
baldi, Mateusz Malinowski, Andrea Tacchetti, David Raposo, Adam Santoro, Ryan Faulkner,
Caglar Gulcehre, Francis Song, Andrew Ballard, Justin Gilmer, George Dahl, Ashish Vaswani,
Kelsey Allen, Charles Nash, Victoria Langston, Chris Dyer, Nicolas Heess, Daan Wierstra,
Pushmeet Kohli, Matt Botvinick, Oriol Vinyals, Yujia Li, and Razvan Pascanu. Relational
inductive biases, deep learning, and graph networks, 2018.
[24] Yujun Yan, Kevin Swersky, Danai Koutra, Parthasarathy Ranganathan, and Milad Hashemi.
Neural execution engines: Learning to execute subroutines, 2020.
[25] S Bozinovski and A Fulgosi. The influence of pattern similarity and transfer learning upon
training of a base perceptron b2. In Proceedings of Symposium Informatica, pages 3–121, 1976.
[26] S Bozinovski. Teaching space: A representation concept for adaptive pattern classification.
Technical report, COINS Technical Report, University of Massachusetts at Amherst, 1981.
[27] Brenden K Petersen, Mikel Landajuela Larma, Terrell N. Mundhenk, Claudio Prata Santiago,
Soo Kyung Kim, and Joanne Taery Kim. Deep symbolic regression: Recovering mathematical
expressions from data via risk-seeking policy gradients. In International Conference on Learning
Representations, 2021. URL https://openreview.net/forum?id=m5Qsh0kBQG.
[28] Diederik P Kingma and Jimmy Ba. Adam: A method for stochastic optimization. arXiv preprint
arXiv:1412.6980, 2014.

13
A Appendix

A.1 Pseudo-code

In this section we give the pseudo-code for the initialise_nodes and relax_edge functions for
all the algorithms. Parallel meaning they use the framework in Alg. 1 and Sequential meaning they
use the framework in Alg. 2.

Algorithm 3 BFS (Parallel) Algorithm 4 D IJKSTRA (Sequential)

function INITIALISE _ NODES(G.vertices, i) function INITIALISE _ NODES(G.vertices, i)
for v ∈ G.vertices do for v ∈ G.vertices do
if v = i then if v = i then
v.key ← 1 v.key ← 0
v.pred ← v v.pred ← v
else else
v.key ← 0 v.key ← ∞
v.pred ← ⊥ v.pred ← ⊥
end if end if
end for end for
end function end function
function RELAX _ EDGE(u, v, w) function RELAX _ EDGE(u, v, w)
if v.key = 0 ∧ u.key = 1 then if v.key > u.key + w(u, v) then
v.key ← 1 v.key ← u.key + w(u, v)
v.pred ← u v.pred ← u
end if end if
end function end function

Algorithm 5 B ELLMAN -F ORD (Parallel) Algorithm 6 P RIM (Sequential)

function INITIALISE _ NODES(G.vertices, i) function INITIALISE _ NODES(G.vertices, i)
for v ∈ G.vertices do for v ∈ G.vertices do
if v = i then if v = i then
v.key ← 0 v.key ← 0
v.pred ← v v.pred ← v
else else
v.key ← ∞ v.key ← ∞
v.pred ← ⊥ v.pred ← ⊥
end if end if
end for end for
end function end function
function RELAX _ EDGE(u, v, w) function RELAX _ EDGE(u, v, w)
if v.key > u.key + w(u, v) then if v.pred == ⊥ and v.key > w(u, v)
v.key ← u.key + w(u, v) then
v.pred ← u v.key ← w(u, v)
end if v.pred ← u
end function end if
end function

14
Algorithm 7 W IDEST PATH (Parallel) Algorithm 8 W IDEST PATH (Sequential)
function INITIALISE _ NODES(G.vertices, i) function INITIALISE _ NODES(G.vertices, i)
for v ∈ G.vertices do for v ∈ G.vertices do
if v = i then if v = i then
v.key ← ∞ v.key ← ∞
v.pred ← v v.pred ← v
else else
v.key ← 0 v.key ← 0
v.pred ← ⊥ v.pred ← ⊥
end if end if
end for end for
end function end function
function RELAX _ EDGE(u, v, w) function RELAX _ EDGE(u, v, w)
if v.key < min(u.key, w(u, v)) then if v.key < min(u.key, w(u, v)) then
v.key ← min(u.key, w(u, v)) v.key ← min(u.key, w(u, v))
v.pred ← u v.pred ← u
end if end if
end function end function

Algorithm 9 M OST RELIABLE PATH (Paral- Algorithm 10 M OST RELIABLE PATH (Se-
lel) quential)
function INITIALISE _ NODES(G.vertices, i) function INITIALISE _ NODES(G.vertices, i)
for v ∈ G.vertices do for v ∈ G.vertices do
if v = i then if v = i then
v.key ← 1 v.key ← 1
v.pred ← v v.pred ← v
else else
v.key ← 0 v.key ← 0
v.pred ← ⊥ v.pred ← ⊥
end if end if
end for end for
end function end function
function RELAX _ EDGE(u, v, w) function RELAX _ EDGE(u, v, w)
if v.key < u.key × w(u, v) then if v.key < u.key × w(u, v) then
v.key ← u.key × w(u, v) v.key ← u.key × w(u, v)
v.pred ← u v.pred ← u
end if end if
end function end function

15
Algorithm 11 D EPTH - FIRST SEARCH (DFS)
(Sequential)
function INITIALISE _ NODES(G.vertices, i)
for v ∈ G.vertices do
if v = i then
v.key ← |G.vertices|
v.pred ← v
else
v.key ← ∞
v.pred ← ⊥
end if
end for
end function
function RELAX _ EDGE(u, v, w)
if v.key = ∞ then
v.key ← u.key − 1
v.pred ← u
end if
end function

A.2 Mean of trajectories

In this section we verify that taking the maximum of several trajectories as described in § 4.2, the
results are in Tab. 7 and confirm our hypothesis.

Table 7: No algorithm (seq.). We take the mean of 10 trajectories instead of the max.
D IJKSTRA M OST RELIABLE
Model #Nodes Key Predecessor Key Predecessor
20 6.22e-5 ± 5e-5 0.014 ± 0.01 0.00279 ± 0.0004 0.158 ± 0.06
NE 50 1.13e6 ± 2e6 0.121 ± 0.08 11 ± 5 0.615 ± 0.1
100 1.39e17 ± 2e17 0.237 ± 0.2 3.31e5 ± 2e5 0.658 ± 0.03
20 0.00221 ± 0.002 0.026 ± 0.01 0.000879 ± 8e-5 0.121 ± 0.05
NE ++ 50 13.3 ± 20 0.257 ± 0.06 0.232 ± 0.1 0.503 ± 0.01
100 1.38e6 ± 1e6 0.629 ± 0.1 0.378 ± 0.2 0.576 ± 0.02

A.3 Sequential: Breakdown tables

In this section we give the results for each graph type separately for the sequential setting as well as
adding termination accuracy for the teacher forcing setting.
Termination accuracy: We measure termination accuracy according the following formula:
|Tpred − Ttrue |
Term = 1 − , (1)
Ttrue
where Ttrue is the correct integer last step and Tpred the predicted last step. The theoretical possible
range is [1, −∞], where we reach 1 only if predicting the correct step, as we go further away from
the correct last step we get smaller and possibly negative (this can only happen by terminating much
later than is correct). In practice, the range is limited because we run the network for a maximum
number of steps equal to the number of nodes in the graph.

16
Table 8: Teacher forcing (seq.).
D IJKSTRA

Next Key Pred. Term.

Model #Nodes
2d-G ER BA 2d-G ER BA 2d-G ER BA 2d-G ER BA

20 0.014 0.016 0.023 0.0492 0.0358 0.0252 0.003 0.007 0.004 0.997 0.993 0.996
NE 50 0.101 0.085 0.079 1.49 0.127 0.0903 0.042 0.008 0.01 0.962 0.993 0.994
100 0.311 0.367 0.344 13.4 0.556 0.373 0.124 0.041 0.026 0.921 0.966 0.984
20 0.004 0.01 0.01 0.00692 0.0159 0.00955 0.002 0.004 0.003 0.998 0.996 0.997
NE++ 50 0.311 0.37 0.369 1.33e3 0.919 0.657 0.238 0.202 0.194 0.881 0.937 0.952
100 0.75 0.72 0.718 4.87e10 3.61e4 1.86e4 0.621 0.489 0.456 0.67 0.922 0.936

M OST RELIABLE

Next Key Pred. Term.

Model #Nodes
2d-G ER BA 2d-G ER BA 2d-G ER BA 2d-G ER BA

20 0.079 0.189 0.253 0.00376 0.0252 0.0502 0.015 0.049 0.078 0.985 0.951 0.922
NE 50 0.279 0.561 0.635 0.00605 0.0704 0.126 0.038 0.133 0.165 0.964 0.865 0.833
100 0.485 0.786 0.826 0.00549 0.111 0.155 0.083 0.206 0.225 0.924 0.789 0.773
20 0.173 0.258 0.284 0.0116 0.0337 0.0621 0.037 0.058 0.064 0.963 0.942 0.936
NE++ 50 0.45 0.612 0.609 0.0114 0.0762 0.122 0.08 0.107 0.111 0.926 0.894 0.889
100 0.661 0.817 0.81 0.0163 0.109 0.152 0.177 0.165 0.16 0.829 0.833 0.842

Table 9: No algorithm (seq.).

D IJKSTRA M OST RELIABLE

Key Pred. Key Pred.

Model #Nodes
2d-G ER BA 2d-G ER BA 2d-G ER BA 2d-G ER BA

20 0.000208 5.14e-5 5.29e-5 0.01 0.018 0.04 0.00923 0.00912 0.00651 0.1 0.209 0.253
NE 50 2.91e6 5.85 6.33 0.289 0.03 0.044 12.3 18.4 29.4 0.788 0.948 0.955
100 1e17 9700 9360 0.485 0.039 0.058 1.24e6 912000 1.8e6 0.694 0.789 0.82
20 1.07e-5 4.32e-6 4.2e-6 0.001 0.004 0.009 0.000431 0.000556 0.000409 0.042 0.149 0.188
NE++ 50 4.83 0.369 0.236 0.094 0.123 0.229 227 75.2 74.1 0.495 0.56 0.643
100 24900 468 221 0.34 0.365 0.459 8.28e13 7.51e11 4.5e11 0.692 0.806 0.781

Table 10: Multi-task (seq.) Trained with P RIM and W IDEST respectively.
D IJKSTRA M OST RELIABLE

Key Pred. Key Pred.

Model #Nodes
2d-G ER BA 2d-G ER BA 2d-G ER BA 2d-G ER BA

20 0.00436 0.00319 0.00332 0.028 0.038 0.06 0.254 0.18 0.188 0.342 0.444 0.571
NE 50 14.4 10.2 10 0.303 0.045 0.053 1.19 2.3 2.05 0.422 0.528 0.552
100 88.6 141 147 0.733 0.102 0.073 2.24 9.33 7.83 0.609 0.598 0.584
20 0.000267 0.000195 7.37e-5 0.01 0.016 0.031 0.00255 0.00246 0.00337 0.076 0.178 0.244
NE++ 50 0.372 0.000154 0.867 0.294 0.027 0.163 0.592 0.00203 0.00151 0.177 0.18 0.198
100 6.94 0.301 1.48 0.493 0.109 0.245 2.53 0.00148 0.00233 0.432 0.184 0.184

Table 11: No algorithm (par.) Pre-trained on P RIM and W IDEST respectively.

D IJKSTRA M OST RELIABLE

Key Pred. Key Pred.

Model #Nodes
2d-G ER BA 2d-G ER BA 2d-G ER BA 2d-G ER BA

20 0.0204 0.0119 0.00982 0.048 0.073 0.122 0.021 0.0212 0.0282 0.171 0.25 0.322
NE Freeze 50 341 16.8 7.72 0.466 0.639 0.687 3.47e5 4.07e5 4.2e5 0.876 0.862 0.856
100 9.49e6 4e4 2.03e4 0.793 0.533 0.494 4.05e15 5.36e15 5.77e15 0.82 0.728 0.734
20 0.00341 0.00159 0.0013 0.022 0.049 0.08 0.0299 0.0363 0.0418 0.143 0.23 0.307
NE Fine-tune 50 2.91e3 245 174 0.366 0.157 0.2 1.72e3 2.11e3 2.17e3 0.47 0.689 0.749
100 1.77e8 6.98e5 4.34e5 0.557 0.289 0.317 9.01e8 1.15e9 1.13e9 0.716 0.688 0.724
20 0.00203 0.00114 0.000919 0.023 0.063 0.093 0.0124 0.0157 0.0207 0.137 0.226 0.331
NE 2-Proc. 50 36.2 3.86 2.51 0.096 0.156 0.235 175 240 198 0.786 0.698 0.763
100 3.07e4 225 339 0.205 0.356 0.355 7.27e9 1.34e10 1.59e10 0.769 0.824 0.851

20 0.00269 0.000876 0.000516 0.035 0.066 0.089 0.00641 0.00619 0.008 0.091 0.202 0.303
NE++ Freeze 50 115 6.53 5.9 0.802 0.894 0.828 290 549 557 0.415 0.649 0.676
100 782 2.87 1.41 0.8 0.941 0.946 3.25e7 1.03e8 1.06e8 0.562 0.729 0.725
20 0.000788 0.000284 0.00017 0.011 0.034 0.057 0.00823 0.0041 0.00772 0.103 0.187 0.284
NE++ Fine-tune 50 38 1.42 0.955 0.927 0.98 0.98 397000 6.90e4 7.12e4 0.694 0.773 0.803
100 7.54e4 1.6 1.26 0.908 0.99 0.99 1.49e15 4.83e14 4.76e14 0.705 0.824 0.794
20 0.00694 0.00375 0.0026 0.021 0.05 0.109 0.00243 0.00235 0.00183 0.088 0.172 0.201
NE++ 2-Proc. 50 47.8 0.546 1.49 0.508 0.256 0.306 404 290 223 0.61 0.64 0.681
100 1.35e4 119 383 0.764 0.808 0.764 2.4e6 4.44e6 3.88e6 0.771 0.808 0.795

17
A.4 Parallel: Breakdown tables

In this section we give the results for each graph type separately for the parallel setting as well as
adding termination accuracy for the teacher forcing setting.

Table 12: Teacher forcing (par.).

B ELLMAN -F ORD

Key Pred. Term.

Model #Nodes
2d-G ER BA 2d-G ER BA 2d-G ER BA

20 0.194 0.0416 0.0131 0.215 0.175 0.167 0.551 0.735 0.733

NE 50 0.829 0.0533 0.0239 0.212 0.232 0.235 0.492 0.708 0.708
100 1.75 0.104 0.0349 0.22 0.247 0.291 0.183 0.644 0.703
20 0.478 0.0883 0.0324 0.312 0.244 0.224 0.603 0.833 0.854
NE++ 50 3.75 0.111 0.0602 0.386 0.348 0.319 0.373 0.850 0.850
100 21.7 0.201 0.0841 0.443 0.391 0.393 0.202 0.783 0.876

M OST RELIABLE

Key Pred. Term.

Model #Nodes
2d-G ER BA 2d-G ER BA 2d-G ER BA

20 0.123 0.0533 0.0251 0.22 0.243 0.306 0.766 0.720 0.781

NE 50 0.311 0.0594 0.034 0.327 0.287 0.31 0.122 0.24 0.818
100 1.05 0.0657 0.0386 0.38 0.321 0.312 −1.68 −1.61 −0.0624
20 0.0851 0.0404 0.0245 0.269 0.264 0.315 0.796 0.721 0.637
NE++ 50 0.185 0.0423 0.0325 0.402 0.308 0.323 0.460 0.835 0.828
100 3.04 0.0465 0.0339 0.477 0.348 0.327 0.309 0.851 0.883

Table 13: No algorithm (par.).

D IJKSTRA M OST RELIABLE

Key Pred. Key Pred.

Model #Nodes
2d-G ER BA 2d-G ER BA 2d-G ER BA 2d-G ER BA

20 0.0145 0.0382 0.00181 0.03 0.061 0.079 0.0204 0.0187 0.0147 0.147 0.233 0.297
NE 50 177 0.184 0.00364 0.304 0.096 0.091 0.4 0.0201 0.0212 0.298 0.331 0.352
100 5.93e6 1.21 0.00474 0.54 0.126 0.116 31 0.0262 0.0295 0.5 0.403 0.403
20 0.00528 0.00152 0.000793 0.012 0.026 0.046 0.0147 0.00904 0.00498 0.064 0.157 0.214
NE++ 50 0.671 0.00489 0.00155 0.072 0.045 0.055 0.0955 0.00768 0.00675 0.133 0.186 0.193
100 589 0.0192 0.00215 0.148 0.068 0.068 361 0.0112 0.0108 0.24 0.21 0.202

Table 14: No algorithm (par.) Pre-trained on BFS and W IDEST respectively.

B ELLMAN -F ORD M OST RELIABLE

Key Pred. Key Pred.

Model #Nodes
2d-G ER BA 2d-G ER BA 2d-G ER BA 2d-G ER BA

20 0.253 0.246 0.00821 0.078 0.103 0.151 0.577 0.0559 0.0599 0.229 0.305 0.431
NE Freeze 50 8.28e5 1.18 0.0176 0.319 0.169 0.187 2.76e4 0.294 0.197 0.378 0.484 0.547
100 1.53e17 5.36 0.0272 0.434 0.221 0.222 6.73e10 1.61 1.15 0.508 0.573 0.611
20 0.0635 0.0475 0.00463 0.038 0.078 0.101 0.029 0.0224 0.0148 0.164 0.237 0.311
NE Fine-tune 50 74.8 0.271 0.00829 0.231 0.124 0.13 0.95 0.0206 0.0262 0.318 0.318 0.358
100 515000 0.931 0.013 0.413 0.162 0.151 691 0.034 0.0364 0.437 0.374 0.394
20 0.128 0.226 0.00439 0.051 0.097 0.14 0.0326 0.0243 0.0157 0.138 0.223 0.317
NE 2-Proc. 50 1.68e3 1.02 0.00731 0.615 0.143 0.165 11.2 0.0195 0.0231 0.284 0.327 0.38
100 4.99e6 1.86 0.00816 0.878 0.196 0.194 1.44e6 0.0403 0.0474 0.43 0.428 0.435
20 0.0281 0.0573 0.00251 0.028 0.082 0.095 0.0196 0.0114 0.00888 0.15 0.236 0.362
NE++ Freeze 50 172 0.181 0.00515 0.383 0.123 0.114 1.87 0.0139 0.0153 0.259 0.315 0.385
100 1.45e7 1.71 0.00734 0.609 0.171 0.141 19800 0.0282 0.0233 0.384 0.394 0.406
20 0.258 0.246 0.00939 0.078 0.139 0.203 0.0216 0.012 0.00804 0.132 0.202 0.265
NE++ Fine-tune 50 5.73e3 1.24 0.0223 0.494 0.198 0.224 8.49 0.00868 0.00788 0.227 0.248 0.256
100 3.86e9 7.07 0.0267 0.724 0.288 0.269 12500 0.0115 0.00801 0.394 0.283 0.274
20 0.019 0.0444 0.00353 0.028 0.068 0.09 0.0168 0.0138 0.00868 0.113 0.194 0.281
NE++ 2-Proc. 50 1.69 0.302 0.00598 0.11 0.098 0.106 9.11 0.0107 0.0112 0.374 0.266 0.299
100 30 2.46 0.00767 0.234 0.14 0.13 1740 0.0131 0.0107 0.564 0.324 0.344

18
Table 15: Multi-task (par.) Trained with BFS and W IDEST respectively.
B ELLMAN -F ORD M OST RELIABLE

Key Pred. Key Pred.

Model #Nodes
2d-G ER BA 2d-G ER BA 2d-G ER BA 2d-G ER BA

20 0.037 0.00771 0.00163 0.017 0.036 0.05 0.324 0.121 0.0748 0.331 0.327 0.38
NE 50 18.6 0.036 0.0028 0.045 0.054 0.053 0.976 0.15 0.0948 0.402 0.334 0.349
100 3270 1320 0.0043 0.114 0.106 0.067 1.53 0.199 0.114 0.443 0.344 0.34
20 0.00868 0.0015 0.000407 0.012 0.019 0.038 0.00741 0.00632 0.00643 0.067 0.177 0.216
NE++ 50 0.0382 0.00339 0.000544 0.023 0.031 0.036 0.0095 0.00537 0.00927 0.165 0.184 0.197
100 26.5 0.00565 0.000693 0.287 0.053 0.049 0.00929 0.00739 0.0124 0.24 0.199 0.197

A.5 Parallel TF & transfer results

In the main paper we did not give the results for all the transfer methods (only the best one). In this
section we give the transfer results in the same style as the main paper as well as the teacher forcing
results.
Table 16: Teacher forcing (par.).
B ELLMAN -F ORD M OST RELIABLE PATH

Model #Nodes Key Pred. Key Pred.

20 0.0828 ± 0.08 0.186 ± 0.02 0.067 ± 0.04 0.256 ± 0.04

NE 50 0.302 ± 0.4 0.226 ± 0.01 0.135 ± 0.1 0.308 ± 0.02
100 0.629 ± 0.8 0.253 ± 0.03 0.383 ± 0.5 0.337 ± 0.03
20 0.199 ± 0.2 0.26 ± 0.04 0.05 ± 0.03 0.283 ± 0.02
NE++ 50 1.31 ± 2 0.351 ± 0.03 0.0866 ± 0.07 0.344 ± 0.04
100 7.33 ± 10 0.409 ± 0.02 1.04 ± 1 0.384 ± 0.07

Table 17: No algorithm (par.) Pre-trained on BFS and W IDEST respectively.

B ELLMAN -F ORD M OST RELIABLE PATH

Model #Nodes Key Predecessor Key Predecessor

20 0.169 ± 0.1 0.111 ± 0.03 0.231 ± 0.2 0.322 ± 0.08

NE Freeze 50 2.76e5 ± 4e5 0.225 ± 0.07 9.18e3 ± 1e4 0.47 ± 0.07
100 5.1e16 ± 7e16 0.292 ± 0.1 2.24e10 ± 3e10 0.564 ± 0.04
20 0.0386 ± 0.02 0.072 ± 0.03 0.0221 ± 0.006 0.237 ± 0.06
NE Fine-tune 50 25 ± 40 0.162 ± 0.05 0.332 ± 0.4 0.331 ± 0.02
100 1.72e5 ± 2e5 0.242 ± 0.1 230 ± 300 0.402 ± 0.03
20 0.119 ± 0.09 0.096 ± 0.04 0.0242 ± 0.007 0.226 ± 0.07
NE 2-Proc. 50 562 ± 800 0.308 ± 0.2 3.73 ± 5 0.33 ± 0.04
100 1.66e6 ± 2e6 0.423 ± 0.3 4.8e5 ± 7e5 0.431 ± 0.003
20 0.0293 ± 0.02 0.068 ± 0.03 0.0133 ± 0.005 0.249 ± 0.09
NE++ Freeze 50 57.5 ± 80 0.207 ± 0.1 0.633 ± 0.9 0.32 ± 0.05
100 4.85e6 ± 7e6 0.307 ± 0.2 6.59e3 ± 9e3 0.395 ± 0.009
20 0.171 ± 0.1 0.14 ± 0.05 0.0139 ± 0.006 0.2 ± 0.05
NE++ Fine-tune 50 1.91e3 ± 3e3 0.305 ± 0.1 2.83 ± 4 0.244 ± 0.01
100 1.29e9 ± 2e9 0.427 ± 0.2 4.16e3 ± 6e3 0.317 ± 0.05
20 0.0223 ± 0.02 0.062 ± 0.03 0.0131 ± 0.003 0.196 ± 0.07
NE++ 2-Proc. 50 0.666 ± 0.7 0.105 ± 0.005 3.04 ± 4 0.313 ± 0.05
100 10.8 ± 10 0.168 ± 0.05 579 ± 800 0.411 ± 0.1

A.6 Multi-task pre-training: Full breakdown

A.7 Large standard deviations

The large standard deviation is to be expected given the widely different graph types. Given that all
edge weights have expected value 0.6 the max expected shortest distance in a grid graph of size n is
0.6*(n/2+1), which is orders of magnitude larger than for an Erdos-Renyi or Barabasi-Albert graph,
which will have a diameter of O(log(n)) and thus a max expected shortest distance of 0.6*log(n). This
large difference in the shortest path makes large key errors when generalising much more likely on
a grid-graph than on the other types of graphs. Vice versa is true for predecessor prediction: on a
grid graph the degree of node is between 2 and 4 (constant no matter the size of the graph), while
for a Barabasi-Albert and Erdos-Renyi graph it will grow with the size of the graph and be much
larger, making errors much more likely. Again yielding vastly different error percentages. See the
breakdown across graph types in the tables above.

19
Table 18: 2-Processor transfer pre-trained on P RIM, D IJKSTRA, & DFS for sequential and BFS &
B ELLMAN -F ORD for parallel.
M OST RELIABLE (S EQ .) M OST RELIABLE (PAR .)

Key Pred. Key Pred.

Model #Nodes
2d-G ER BA 2d-G ER BA 2d-G ER BA 2d-G ER BA

20 0.00265 0.00244 0.002 0.097 0.161 0.232 0.0518 0.0297 0.041 0.153 0.214 0.313
NE++ 2-Proc. 50 29.4 72.3 86.9 0.535 0.629 0.654 0.341 0.0611 0.0821 0.414 0.311 0.366
100 1.13e6 2.06e6 2.08e6 0.663 0.8 0.812 7.81 0.106 0.105 0.558 0.383 0.403

A.8 Further experiments

A.8.1 Experiment 1

Table 19: Pre-train on Dijkstra with teacher forcing, transfer with a frozen processor to see to what
extent the encoder/decoder can be recovered.
D IJKSTRA

Model #Nodes Next Key Predecessor

20 0.114 ± 0.03 0.226 ± 0.09 0.023 ± 0.002

NE 2-Proc. 50 0.457 ± 0.03 3.51 ± 4 0.083 ± 0.03
100 0.759 ± 0.02 132 ± 200 0.226 ± 0.1
20 0.11 ± 0.01 0.253 ± 0.1 0.078 ± 0.03
NE++ 2-Proc. 50 0.619 ± 0.06 4.35 ± 4 0.344 ± 0.02
100 0.839 ± 0.03 35.1 ± 40 0.557 ± 0.09

Conclusion from Table 20: The results are mostly quite similar to the original results, but slightly
worse, thus indicating that while the re-use of a pre-trained processor is not trivial it is no the primary
reason for transfer to fail.

A.8.2 Experiment 2

Table 20: Pre-train on Dijkstra with teacher forcing, transfer with a frozen processor to see to what
extent the encoder/decoder can be recovered.
D IJKSTRA

Model #Nodes Key Predecessor

20 0.00653 ± 0.005 0.092 ± 0.04

NE 2-Proc. 50 16900 ± 20000 0.401 ± 0.05
100 1.11e + 12 ± 2e + 12 0.771 ± 0.2
20 0.000531 ± 0.0003 0.05 ± 0.03
NE++ 2-Proc. 50 12.1 ± 10 0.249 ± 0.1
100 74500 ± 100000 0.502 ± 0.07
20 0.00029 ± 0.0003 0.026 ± 0.01
NE Finetune 50 794000 ± 1e + 06 0.441 ± 0.07
100 8.32e + 15 ± 1e + 16 0.627 ± 0.09
20 0.000192 ± 0.0002 0.027 ± 0.02
NE++ Finetune 50 20100 ± 30000 0.67 ± 0.1
100 1.25e + 14 ± 2e + 14 0.716 ± 0.05
20 0.000623 ± 0.0004 0.041 ± 0.02
NE Freeze 50 79.2 ± 90 0.37 ± 0.09
100 1.23e + 06 ± 2e + 06 0.862 ± 0.08
20 0.0108 ± 0.007 0.096 ± 0.05
NE++ Freeze 50 152 ± 200 0.336 ± 0.01
100 3.29e + 09 ± 4e + 09 0.522 ± 0.09
20 0.0994 ± 0.1 0.129 ± 0.08
NE Multi-task 50 0.926 ± 1 0.155 ± 0.1
100 3.77 ± 4 0.212 ± 0.1
20 1.47 ± 2 0.363 ± 0.07
NE++ Multi-task 50 5.2e + 09 ± 5e + 09 0.769 ± 0.1
100 3.14e + 29 ± 3e + 29 0.752 ± 0.05

A.9 NeuralExecutor++

Let X ∈ Rn×k be node states, where n, k are the number of nodes and features, respectively. Each
edge (u, v) has a weight w(u, v) ∈ R. The architecture keeps a hidden state for each node H ∈ Rn×l
with l features, which is initialised to all zeros at time step 0. The encoder E consists of a 2 layer
MLP with ReLU activation and is separate for each algorithm. The encoder is applied on each edge

20
(t) (t) (t) (t) (t) (t)
a hidden embedding E(Xi , Hi , Xj , Hj , Wij ) = Zij , where t indicates the tth time step in
the computation and i, j refer to the nodes of the edge. This edge embedding is then passed to the
message function of the processor P, which is message passing neural network (MPNN) with a
max aggregator with linear message and update functions (these message and update functions are
always shared between all algorithms). The processor computes the new hidden state for each node
P(Hi , A, W ) = Ht+1 . Then, we have the decoder D(Zt , Ht+1 ) = Yt+1 and predecessor predictor
S(Zt , Ht+1 ) = St . Finally, a termination network σ(T (Ht+1 )) decides whether we should terminate
or not.

Assignment 1
No ratings yet
Assignment 1
2 pages
The Solution of The Zodiac Killer's 340-Character Cipher
No ratings yet
The Solution of The Zodiac Killer's 340-Character Cipher
62 pages
Error Due To Diaphragm Constraint
No ratings yet
Error Due To Diaphragm Constraint
3 pages
History of Linear Programming
No ratings yet
History of Linear Programming
3 pages
Neural Execution Engines: Learning To Execute Subroutines: Work Completed During An Internship at Google
No ratings yet
Neural Execution Engines: Learning To Execute Subroutines: Work Completed During An Internship at Google
21 pages
When creating a narrow AI - hierarchy and nonlocality of neural network skills
No ratings yet
When creating a narrow AI - hierarchy and nonlocality of neural network skills
19 pages
Graph Contrastive Learning With Augmentations
No ratings yet
Graph Contrastive Learning With Augmentations
12 pages
Neural Algorithmic Reasoning (2021)
No ratings yet
Neural Algorithmic Reasoning (2021)
7 pages
What is Being Transferred in Transfer Learning?
No ratings yet
What is Being Transferred in Transfer Learning?
28 pages
Automated Relational Meta-learning
No ratings yet
Automated Relational Meta-learning
19 pages
Transfer Learning: Meskatul Islam ID: 1703210201349 6 Semester, Dept. of CSE Premier University, Chittagong
No ratings yet
Transfer Learning: Meskatul Islam ID: 1703210201349 6 Semester, Dept. of CSE Premier University, Chittagong
4 pages
Learning Graph Structure With A Finite-State
No ratings yet
Learning Graph Structure With A Finite-State
28 pages
2302.08043v3
No ratings yet
2302.08043v3
12 pages
Combinatorial Optimization and Reasoning With Graph Neural Networks
No ratings yet
Combinatorial Optimization and Reasoning With Graph Neural Networks
61 pages
Graphprompt: Unifying Pre-Training and Downstream Tasks For Graph Neural Networks
No ratings yet
Graphprompt: Unifying Pre-Training and Downstream Tasks For Graph Neural Networks
12 pages
UNIT_ICHP 4
No ratings yet
UNIT_ICHP 4
19 pages
Bengio, 2009 Curriculum Learning PDF
No ratings yet
Bengio, 2009 Curriculum Learning PDF
8 pages
theseGNN-XAI (1)
No ratings yet
theseGNN-XAI (1)
4 pages
A Survey of Graph Prompting Methods
No ratings yet
A Survey of Graph Prompting Methods
11 pages
71 Graph Q Learning For Combinato
No ratings yet
71 Graph Q Learning For Combinato
8 pages
GMPT_cikm2021_final
No ratings yet
GMPT_cikm2021_final
10 pages
Learning Combinatorial Optimization Algorithms Over Graphs
No ratings yet
Learning Combinatorial Optimization Algorithms Over Graphs
24 pages
GNN - PEter
No ratings yet
GNN - PEter
96 pages
Transfer Learning
No ratings yet
Transfer Learning
22 pages
Lecture Notes
No ratings yet
Lecture Notes
86 pages
NeurIPS 2024 Understanding Transformer Reasoning Capabilities via Graph Algorithms Paper Conference
No ratings yet
NeurIPS 2024 Understanding Transformer Reasoning Capabilities via Graph Algorithms Paper Conference
51 pages
Curriculum Learning in Deep Network
No ratings yet
Curriculum Learning in Deep Network
13 pages
A Generalization of Transformer Networks To Graphs
No ratings yet
A Generalization of Transformer Networks To Graphs
8 pages
Combinatorial Optimization and Reasoning With Graph Neural Networks
No ratings yet
Combinatorial Optimization and Reasoning With Graph Neural Networks
58 pages
Relational Reinforcement Learning With Guided Demon 2017 Artificial Intellig
No ratings yet
Relational Reinforcement Learning With Guided Demon 2017 Artificial Intellig
18 pages
2204.07697v1
No ratings yet
2204.07697v1
23 pages
TOWARDS FOUNDATION MODELS FOR KNOWLEDGE
No ratings yet
TOWARDS FOUNDATION MODELS FOR KNOWLEDGE
22 pages
25569-Article Text-29632-1-2-20230626
No ratings yet
25569-Article Text-29632-1-2-20230626
9 pages
Why_are_Graph_Neural_Networks_Effective_for_EDA_Problems
No ratings yet
Why_are_Graph_Neural_Networks_Effective_for_EDA_Problems
8 pages
transfer learning
No ratings yet
transfer learning
24 pages
One-Shot Learning With Memory-Augmented Neural Networks
No ratings yet
One-Shot Learning With Memory-Augmented Neural Networks
13 pages
Training the application of LLM
No ratings yet
Training the application of LLM
68 pages
Transfer Learnring
No ratings yet
Transfer Learnring
5 pages
Lec11 Transfer Learning
No ratings yet
Lec11 Transfer Learning
45 pages
Unit-V Tranfer Learning Notes
No ratings yet
Unit-V Tranfer Learning Notes
27 pages
Transfer Learning
No ratings yet
Transfer Learning
13 pages
diligenti2017
No ratings yet
diligenti2017
4 pages
Joint Edge-model Sparse Learning is Provably Efficient for Graph Neural Networks
No ratings yet
Joint Edge-model Sparse Learning is Provably Efficient for Graph Neural Networks
45 pages
(SDM2022)Neural Graph Matching for Pre-training Graph Neural Networks
No ratings yet
(SDM2022)Neural Graph Matching for Pre-training Graph Neural Networks
9 pages
Transfer Learning Through Embedding Spaces (Z-Lib - Io)
No ratings yet
Transfer Learning Through Embedding Spaces (Z-Lib - Io)
223 pages
[Fall 2024] Deep Learning 3
No ratings yet
[Fall 2024] Deep Learning 3
54 pages
7510 Graph Neural Networks For Lear
No ratings yet
7510 Graph Neural Networks For Lear
19 pages
50 Breakthrough AI Concepts in 500 Words Each: In 500 words, #17
From Everand
50 Breakthrough AI Concepts in 500 Words Each: In 500 words, #17
Nietsnie Trebla
No ratings yet
Curriculum Learning: A Survey: Petru Soviany Radu Tudor Ionescu Paolo Rota Nicu Sebe
No ratings yet
Curriculum Learning: A Survey: Petru Soviany Radu Tudor Ionescu Paolo Rota Nicu Sebe
40 pages
AI notes Module- 4
No ratings yet
AI notes Module- 4
13 pages
AlonAndYahav 2021 On The Bottleneck of Graph Neu
No ratings yet
AlonAndYahav 2021 On The Bottleneck of Graph Neu
16 pages
Mastering Data Structures and Algorithms in Python & Java
From Everand
Mastering Data Structures and Algorithms in Python & Java
Sachin Naha
No ratings yet
Unit 4
No ratings yet
Unit 4
50 pages
Neurosymbolic Presentation
No ratings yet
Neurosymbolic Presentation
42 pages
Master Inspera
No ratings yet
Master Inspera
45 pages
Neural Networks For Applications in The Arts: Peter M. Todd
No ratings yet
Neural Networks For Applications in The Arts: Peter M. Todd
9 pages
Graph Neural Networks Are Dynamic Programmers: Equal Contribution
No ratings yet
Graph Neural Networks Are Dynamic Programmers: Equal Contribution
18 pages
Runtime Neural Pruning
No ratings yet
Runtime Neural Pruning
11 pages
A Gentle Introduction To Graph Neural Networks
No ratings yet
A Gentle Introduction To Graph Neural Networks
9 pages
UNIT III
No ratings yet
UNIT III
26 pages
(KDD 2023) All in One - Multi-Task Prompting For Graph Neural Networks
No ratings yet
(KDD 2023) All in One - Multi-Task Prompting For Graph Neural Networks
12 pages
On The Generalization Capability of Memory Networks For Reasoning
No ratings yet
On The Generalization Capability of Memory Networks For Reasoning
6 pages
Design Space for Graph Neural Network
No ratings yet
Design Space for Graph Neural Network
9 pages
Deep Learning With Python Illustrated Guide For Beginners & Intermediates: The Future Is Here!: The Future Is Here!, #2
From Everand
Deep Learning With Python Illustrated Guide For Beginners & Intermediates: The Future Is Here!: The Future Is Here!, #2
William Sullivan
1/5 (1)
Improving Language Understanding by Generative Pre-Training - by Ceshine Lee - Veritable - Medium
No ratings yet
Improving Language Understanding by Generative Pre-Training - by Ceshine Lee - Veritable - Medium
19 pages
Detectability of Solar Panels As A Technosignature IOPscience
No ratings yet
Detectability of Solar Panels As A Technosignature IOPscience
16 pages
Large Language Models Understand and Can Be Enhanced by Emotional Stimuli
No ratings yet
Large Language Models Understand and Can Be Enhanced by Emotional Stimuli
32 pages
Fine-Tuning Aligned Language Models Compromises Safety, Even When Users Do Not Intend To!
No ratings yet
Fine-Tuning Aligned Language Models Compromises Safety, Even When Users Do Not Intend To!
45 pages
Introducing ChatGPT II
No ratings yet
Introducing ChatGPT II
16 pages
Rethinking Benchmark and Contamination For Language Models With Rephrased Samples
No ratings yet
Rethinking Benchmark and Contamination For Language Models With Rephrased Samples
16 pages
Applying Ai To Rebuild Middle Class Jobs
No ratings yet
Applying Ai To Rebuild Middle Class Jobs
22 pages
A Theory For Emergence of Complex Skills in Language Models
No ratings yet
A Theory For Emergence of Complex Skills in Language Models
17 pages
Skill-Mix - A Flexible and Expandable Family of Evaluations For AI Models
No ratings yet
Skill-Mix - A Flexible and Expandable Family of Evaluations For AI Models
33 pages
OpenVoice - Versatile Instant Voice Cloning
No ratings yet
OpenVoice - Versatile Instant Voice Cloning
7 pages
Who's Harry Potter? Approximate Unlearning in LLMs
No ratings yet
Who's Harry Potter? Approximate Unlearning in LLMs
21 pages
Does GPT-4 Pass The Turing Test
No ratings yet
Does GPT-4 Pass The Turing Test
25 pages
Mobile ALOHA - Learning Bimanual Mobile Manipulation With Low-Cost Whole-Body Teleoperation
No ratings yet
Mobile ALOHA - Learning Bimanual Mobile Manipulation With Low-Cost Whole-Body Teleoperation
20 pages
Generative AI Exists Because of The Transformer
No ratings yet
Generative AI Exists Because of The Transformer
52 pages
Defense in Depth - An Action Plan To Increase The Safety and Security of Advanced AI
No ratings yet
Defense in Depth - An Action Plan To Increase The Safety and Security of Advanced AI
13 pages
The Cat Is Out of The Bag: Orientalism Anti-Blackness and White Supremacy in Dr. Seuss's Children's Books
No ratings yet
The Cat Is Out of The Bag: Orientalism Anti-Blackness and White Supremacy in Dr. Seuss's Children's Books
51 pages
Linearity of Relation Decoding in Transformer Language Models
No ratings yet
Linearity of Relation Decoding in Transformer Language Models
23 pages
Religion and Science
100% (1)
Religion and Science
4 pages
Affordable Travel Club Application-US
No ratings yet
Affordable Travel Club Application-US
1 page
"The Way To Become A Man": The Influence of Commercial Sex On Male Psychosocial Development by Adrian DeLuna Garcia
No ratings yet
"The Way To Become A Man": The Influence of Commercial Sex On Male Psychosocial Development by Adrian DeLuna Garcia
244 pages
I Am The One Who Would Awaken You
No ratings yet
I Am The One Who Would Awaken You
5 pages
Humankind Is Literally One Family
No ratings yet
Humankind Is Literally One Family
3 pages
Categorical Deep Learning - An Algebraic Theory of Architectures
No ratings yet
Categorical Deep Learning - An Algebraic Theory of Architectures
29 pages
ACEs Wild Making Meaning Out of Trauma Through Altruism Born of Suffering by Jessica Gibson
No ratings yet
ACEs Wild Making Meaning Out of Trauma Through Altruism Born of Suffering by Jessica Gibson
107 pages
An Electronic Thesaurus of Vedic Texts by Jost Gippert
No ratings yet
An Electronic Thesaurus of Vedic Texts by Jost Gippert
15 pages
Making Sense of and Healing Suffering Insights From Buddhism and Critical Social Science Ruben Flores
No ratings yet
Making Sense of and Healing Suffering Insights From Buddhism and Critical Social Science Ruben Flores
13 pages
"The Way To Become A Man": The Influence of Commercial Sex On Male Psychosocial Development by Adrian DeLuna Garcia
No ratings yet
"The Way To Become A Man": The Influence of Commercial Sex On Male Psychosocial Development by Adrian DeLuna Garcia
244 pages
(2024) Proactive Recommendation in Social Networks - Steering User Interest Via Neighbor Influence
No ratings yet
(2024) Proactive Recommendation in Social Networks - Steering User Interest Via Neighbor Influence
12 pages
Neural Network
No ratings yet
Neural Network
58 pages
Send Unit Routing
No ratings yet
Send Unit Routing
65 pages
Ch3 Polynomial Func Test A
No ratings yet
Ch3 Polynomial Func Test A
3 pages
Ap STAT Style ch3
No ratings yet
Ap STAT Style ch3
19 pages
QM - Insem2 - Assignment Questions
No ratings yet
QM - Insem2 - Assignment Questions
5 pages
Portfolio Optimization For Minimum Risk With Scipy - Efficient Frontier Explained
No ratings yet
Portfolio Optimization For Minimum Risk With Scipy - Efficient Frontier Explained
7 pages
Full Factorial
No ratings yet
Full Factorial
4 pages
Machine Learning Super Cheatsheet (Prof. Pedram Jahangiry)
No ratings yet
Machine Learning Super Cheatsheet (Prof. Pedram Jahangiry)
2 pages
(A) Plot The Error E As A Function of The Number of Training Iterations
No ratings yet
(A) Plot The Error E As A Function of The Number of Training Iterations
8 pages
A Deep and Scalable Unsupervised Machine Learning System for Cyber-Attack Detection in Large-Scale Smart Grids
No ratings yet
A Deep and Scalable Unsupervised Machine Learning System for Cyber-Attack Detection in Large-Scale Smart Grids
11 pages
4 Discrete Random Variable
No ratings yet
4 Discrete Random Variable
32 pages
The Revolution of AI-Driven Product Innovation
No ratings yet
The Revolution of AI-Driven Product Innovation
37 pages
Quiz Bank PDF
No ratings yet
Quiz Bank PDF
27 pages
Numerical Analysis
No ratings yet
Numerical Analysis
24 pages
BSP Lab Manual
No ratings yet
BSP Lab Manual
70 pages
Image Fusion Using Variational Mode Decomposition
No ratings yet
Image Fusion Using Variational Mode Decomposition
9 pages
Sa2 08HSL PDF
No ratings yet
Sa2 08HSL PDF
94 pages
Scilab 6 A
No ratings yet
Scilab 6 A
44 pages
Advanced Certificate Programme DS
No ratings yet
Advanced Certificate Programme DS
34 pages
B.Suresh Kumar Ap/Ece Tkec Ec6502 PDSP Two Marks
No ratings yet
B.Suresh Kumar Ap/Ece Tkec Ec6502 PDSP Two Marks
14 pages
Iterative Linear System PDF
No ratings yet
Iterative Linear System PDF
13 pages
Computional Engineering Contents Pages
No ratings yet
Computional Engineering Contents Pages
6 pages
DPP-Inequalities (Wavy Curve Method) BASICS - NEHA - 231102 - 180010
100% (1)
DPP-Inequalities (Wavy Curve Method) BASICS - NEHA - 231102 - 180010
6 pages
A Comprehensive Comparative Evaluation and Analysis of Distributional Semantic Models
No ratings yet
A Comprehensive Comparative Evaluation and Analysis of Distributional Semantic Models
38 pages
Factoring
100% (1)
Factoring
51 pages
BBT 3201 - Introduction To AI Concepts - August 2019
No ratings yet
BBT 3201 - Introduction To AI Concepts - August 2019
7 pages

How To Transfer Algorithmic Reasoning Knowledge To Learn New Algorithms?

Uploaded by

How To Transfer Algorithmic Reasoning Knowledge To Learn New Algorithms?

Uploaded by

How to transfer algorithmic reasoning knowledge to

learn new algorithms?

Louis-Pascal A. C. Xhonneux∗ Andreea Deac

Petar Veličković Jian Tang

35th Conference on Neural Information Processing Systems (NeurIPS 2021)

Algorithm 1 Parallel Algorithm 2 Sequential

1. each node computes a message for its neighbours using M ;

3.3 Problem definition

As our starting point we choose the encoder-processor-decoder architecture proposed in Veličković

4.2 Training and Loss functions

4.3 Standard transfer methods

We try three main approaches:

4.4 Transfer via multi-task learning

We study three kinds of graphs:

Table 2: No algorithm (seq.). Size-generalisation is lacking without intermediate steps, especially on

6 Results and Discussion

6.3 Why transfer learning fails?

6.4 Why multi-task fairs better?

Algorithm 3 BFS (Parallel) Algorithm 4 D IJKSTRA (Sequential)

Algorithm 5 B ELLMAN -F ORD (Parallel) Algorithm 6 P RIM (Sequential)

A.2 Mean of trajectories

A.3 Sequential: Breakdown tables

Next Key Pred. Term.

Next Key Pred. Term.

Table 9: No algorithm (seq.).

Key Pred. Key Pred.

Key Pred. Key Pred.

Table 11: No algorithm (par.) Pre-trained on P RIM and W IDEST respectively.

Key Pred. Key Pred.

Table 12: Teacher forcing (par.).

Key Pred. Term.

20 0.194 0.0416 0.0131 0.215 0.175 0.167 0.551 0.735 0.733

Key Pred. Term.

20 0.123 0.0533 0.0251 0.22 0.243 0.306 0.766 0.720 0.781

Table 13: No algorithm (par.).

Key Pred. Key Pred.

Table 14: No algorithm (par.) Pre-trained on BFS and W IDEST respectively.

Key Pred. Key Pred.

Key Pred. Key Pred.

A.5 Parallel TF & transfer results

Model #Nodes Key Pred. Key Pred.

20 0.0828 ± 0.08 0.186 ± 0.02 0.067 ± 0.04 0.256 ± 0.04

Table 17: No algorithm (par.) Pre-trained on BFS and W IDEST respectively.

Model #Nodes Key Predecessor Key Predecessor

20 0.169 ± 0.1 0.111 ± 0.03 0.231 ± 0.2 0.322 ± 0.08

A.6 Multi-task pre-training: Full breakdown

A.7 Large standard deviations

Key Pred. Key Pred.

A.8 Further experiments

Model #Nodes Next Key Predecessor

20 0.114 ± 0.03 0.226 ± 0.09 0.023 ± 0.002

Model #Nodes Key Predecessor

20 0.00653 ± 0.005 0.092 ± 0.04

You might also like