0% found this document useful (0 votes)
59 views18 pages

Evaluating Logical Generalization in Graph Neural Networks: R R R R R R R R

This document presents a benchmark called GraphLog for evaluating logical generalization in graph neural networks. GraphLog consists of relation prediction tasks on 57 distinct logical domains requiring models to perform rule induction. The authors use GraphLog to evaluate GNNs in single-task learning, multi-task pretraining, and continual learning settings. Their results show that a model's ability to generalize depends on the diversity of logical rules encountered during training.

Uploaded by

Stanislav Se
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
59 views18 pages

Evaluating Logical Generalization in Graph Neural Networks: R R R R R R R R

This document presents a benchmark called GraphLog for evaluating logical generalization in graph neural networks. GraphLog consists of relation prediction tasks on 57 distinct logical domains requiring models to perform rule induction. The authors use GraphLog to evaluate GNNs in single-task learning, multi-task pretraining, and continual learning settings. Their results show that a model's ability to generalize depends on the diversity of logical rules encountered during training.

Uploaded by

Stanislav Se
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 18

Evaluating Logical Generalization in Graph Neural Networks

Koustuv Sinha 1 2 3 Shagun Sodhani 1 Joelle Pineau 1 2 3 William L. Hamilton 2 3

Abstract W1


r2 ∧ r3 ⟹ r1
r4 ∧ r2 ⟹ r3 gi1
Recent research has highlighted the role of rela- Rules


r2 ∧ r3 ⟹ r1
arXiv:2003.06560v1 [cs.LG] 14 Mar 2020

tional inductive biases in building learning agents W2

… ……
that can generalize and reason in a compositional


r4 ∧ r5 ⟹ r7 r4 ∧ r2 ⟹ r3
manner. However, while relational learning al- r4 ∧ r5 ⟹ r7 gi2


gorithms such as graph neural networks (GNNs)



show promise, we do not understand how effec- r1 ∧ r2 ⟹ r5
Wn
tively these approaches can adapt to new tasks. In


this work, we study the task of logical generaliza- r1 ∧ r4 ⟹ r6 gin
r1 ∧ r2 ⟹ r5
tion using GNNs by designing a benchmark suite
grounded in first-order logic. Our benchmark
suite, GraphLog, requires that learning algorithms
perform rule induction in different synthetic log- Figure 1. GraphLog setup. We define a large set of rules that
ics, represented as knowledge graphs. GraphLog are grounded in propositional logic. We partition the rule set into
overlapping subsets, which we use to define the unique worlds, Wk .
consists of relation prediction tasks on 57 distinct
Finally, within each world Wk , we generate several knowledge
logical domains. We use GraphLog to evaluate graphs gik that are governed by the rule set of Wk .
GNNs in three different setups: single-task super-
vised learning, multi-task pretraining, and contin-
ual learning. Unlike previous benchmarks, our
approach allows us to precisely control the logical objects/entities/situations, and to compose relations into
relationship between the different tasks. We find higher-order relations, is one of the reasons why humans
that the ability for models to generalize and adapt quickly learn how to solve new tasks (Holyoak & Morrison,
is strongly determined by the diversity of the logi- 2012; Alexander, 2016).
cal rules they encounter during training, and our The perceived importance of relational reasoning for gener-
results highlight new challenges for the design of alization capabilities has fueled the development of several
GNN models. We publicly release the dataset and neural network architectures that incorporate relational in-
code used to generate and interact with the dataset ductive biases (Battaglia et al., 2016; Santoro et al., 2017;
at https://www.cs.mcgill.ca/ ksinha4/graphlog/. Battaglia et al., 2018). Graph neural networks (GNNs),
in particular, have emerged as a dominant computational
paradigm within this growing area (Scarselli et al., 2008;
1. Introduction Hamilton et al., 2017a; Gilmer et al., 2017; Schlichtkrull
et al., 2018; Du et al., 2019). However, despite the growing
Relational reasoning, or the ability to reason about the re- interest in GNNs and their promise for improving the gener-
lationship between objects entities in the environment, is alization capabilities of neural networks, we currently lack
considered a fundamental aspect of intelligence (Krawczyk an understanding of how effectively these models can adapt
et al., 2011; Halford et al., 2010). Relational reasoning is and generalize across distinct tasks.
known to play a critical role in cognitive growth of chil-
dren (Son et al., 2011; Farrington-Flint et al., 2007; Rich- In this work, we study the task of logical generalization,
land et al., 2010). This ability to infer relations between in the context of relational reasoning using GNNs. In par-
ticular, we study how GNNs can induce logical rules and
*
Equal contribution 1 Facebook AI Research, Montreal, Canada generalize by combining these rules in novel ways after
2
School of Computer Science, McGill University, Montreal, training. We propose a benchmark suite, GraphLog, that is
Canada 3 Montreal Institute of Learning Algorithms (Mila). Corre-
spondence to: Koustuv Sinha <[email protected]>. grounded in first-order logic. Figure 1 shows the setup of
the benchmark. Given a set of logical rules, we create differ-
ent logical worlds with overlapping rules. For each world
Evaluating Logical Generalization in Graph Neural Networks

(say Wk ), we sample multiple knowledge graphs (say gik ). examples include Freebase15K (Bordes et al., 2013), Word-
The learning agent should learn to induce the logical rules Net (Miller, 1995), NELL (Mitchell & Fredkin, 2014), and
for predicting the missing facts in these knowledge graphs. YAGO (Suchanek et al., 2007; Hoffart et al., 2011; Mahdis-
Using our benchmark, we evaluate the generalization capa- oltani et al., 2013). These datasets are derived from real-
bilities of GNNs in a supervised setting by predicting unseen world knowledge graphs and are useful for empirical evalua-
combinations of known rules within a specific logical world. tion of relation prediction systems. However, these datasets
This task that explicitly requires inductive generalization. are generally noisy and incomplete, as many facts are not
We further analyze how various GNN architectures perform available in the underlying knowledge bases (West et al.,
in the multi-task and the continual learning scenarios, where 2014; Paulheim, 2017). Moreover, the logical rules under-
they have to learn over a set of logical worlds with different pinning these systems are often opaque and implicit (Guo
underlying logic. Our setup allows us to control the similar- et al., 2016). All these shortcomings reduce the usefulness
ity between the different worlds by controlling the overlap of existing knowledge graph datasets for understanding the
in logical rules between different worlds. This enables us to logical generalization capability of neural networks. Some
precisely analyze how task similarity impacts performance of these limitations can be overcome by using synthetic
in the multi-task setting. datasets, which can provide a high degree of control and
flexibility over the data generation process at a low cost.
Our analysis provides the following useful insights regard-
Synthetic datasets are useful for understanding the behavior
ing the logical generalization capabilities of GNNs:
of different models - especially when the underlying prob-
• Two architecture choices for GNNs have a strong posi- lem can have many factors of variations. We consider using
tive impact on the generalization performance: 1) incor- synthetic datasets, as a means and not an end, to understand
porating multi-relational edge features using attention, the logical generalization capability of GNNs.
and 2) explicitly modularising the GNN architecture
Our GraphLog benchmark serves as a synthetic comple-
to include a parametric representation function, which
ment to the real-world datasets. Instead of sampling from a
learns representations for the relations based on the
real-world knowledge base, we create synthetic knowledge
knowledge graph structure.
graphs that are governed by a known and inspectable set of
• In the multi-task setting, training a model on a more logical rules. Moreover, the relations in GraphLog are self-
diverse set of logical worlds improves generalization contained and do not require any common-sense knowledge,
and adaptation performance. thus making the tasks self-contained.
• All the evaluated models exhibit catastrophic forgetting
Procedurally generated datasets for reasoning. In recent
in the continual learning setting. This indicates that
years, several procedurally generated benchmarks have been
the models are prone to fitting to just the current task
proposed to study the relational reasoning and composi-
at hand and not learning representations and composi-
tional generalization properties of neural networks. Some
tions that can transfer across tasks—highlighting the
recent and prominent examples are listed in Table 1. These
challenge of lifelong learning in the context of logical
datasets aim to provide a controlled testbed for evaluating
generalization and GNNs.
the compositional reasoning capabilities of neural networks
in isolation. Based on these existing works and their insight-
2. Background and Related Work ful observations, we enumerate the four key desiderata that,
we believe, such a benchmark should provide:
Graph Neural Networks. Several graph neural network
(GNN) architectures have been proposed to learn the repre-
sentation for the graph input (Scarselli et al., 2008; Duve- 1. Interpretable Rules: The rules that are used to procedu-
naud et al., 2015; Defferrard et al., 2016; Kipf & Welling, rally generate the dataset should be human interpretable.
2016; Gilmer et al., 2017; Veličković et al., 2017; Hamilton 2. Diversity: The benchmark datasets should have enough
et al., 2017b; Schlichtkrull et al., 2018). Previous works diversity across different tasks, and the compositional
have focused on evaluating graph neural networks in terms rules used to solve different tasks should be distinct, so
of their expressive power (Morris et al., 2019; Xu et al., that adaptation on a novel task is not trivial. The degree
2018), usefulness of features (Chen et al., 2019), and ex- of similarity across the tasks should be configurable to
plaining the predictions from GNNs (Ying et al., 2019). enable evaluating the role of diversity in generalization.
Complementing these works, we evaluate GNN models on 3. Compositional generalization: The benchmark should
the task of logical generalization. require compositional generalization, i.e., generalization
to unseen combinations of rules.
Knowledge graph completion. Many knowledge graph 4. Number of tasks: The benchmark should support cre-
datasets are available for the task of relation prediction ating a large number of tasks. This enables a more fine-
(also known as knowledge base completion). Prominent grained inspection of the generalization capabilities of
Evaluating Logical Generalization in Graph Neural Networks

Dataset IR D CG M S Me Mu CL
Thus, following the path between two nodes, and applying
CLEVR (Johnson et al., 2017) 3 7 7 Vision 3 7 7 7
CoGenT (Johnson et al., 2017) 3 7 3 Vision 3 7 7 7 the propositional rules along the edges of the path, we can
CLUTRR (Sinha et al., 2019) 3 7 3 Text 3 7 7 7 resolve the relationship between the nodes. Hence, we refer
SCAN (Lake & Baroni, 2017) 3 7 3 Text 3 3 7 7
SQoOP (Bahdanau et al., 2018) 3 7 3 Vision 3 7 7 7 to the paths as resolution paths. The edges of the resolution
TextWorld (Côté et al., 2018) 7 3 3 Text 3 3 3 3 path are concatenated together to obtain a descriptor. These
GraphLog (Proposed) 3 3 3 Graph 3 3 3 3
descriptors are used for quantifying the similarity between
different resolution paths, with a higher overlap between
Table 1. Features of related datasets that are: 1) designed to test the descriptors implying a greater similarity between two
compositional generalization and reasoning, and 2) procedurally resolution paths.
gnerated. We compare the datasets along the following dimensions:
Inspectable Rules (IR), Diversity, Compositional Generalization
(CG), Modality and if the following training setups are supported: 3.2. Problem Setup
Supervised, Meta-learning, Multitask & Continual learning (CL). We formulate the relational reasoning task as predicting
relations between the nodes in a relational graph. Given a
the model in different setups, e.g., supervised learning, query (G, u, v) where u, v ∈ VG , the learner has to predict
multitask learning, and continual learning. the relation r? for the edge u →r? v. Unlike the previous
work on knowledge graph completion, we emphasize an
As shown in Table 1, GraphLog is unique in satisfying all of inductive problem setup, where the graph G in each query is
these desiderata. We highlight that GraphLog is the only unique. Rather than reasoning on a single static knowledge
dataset specifically designed to test logical generaliza- graph during training and testing, we consider the setting
tion capabilities on graph data, whereas previous works where the model must learn to generalize to unseen graphs
have largely focused on the image and text modalities. during evaluation.

3. GraphLog 3.3. Dataset Generation

3.1. Terminology As discussed in Section 2, we want our proposed benchmark


to provide four key desiderata: (i) interpretable rules, (ii)
A graph G = (VG , EG ) is a collection of a set of nodes diversity, (iii) compositional generalization and (iv) large
VG and a set of edges EG between the nodes. In this work, number of tasks. We describe how our dataset generation
we assume that each pair of nodes have at most one edge process ensures all four aspects.
between them. A relational graph is a graph where the
edge between two nodes (say u and v) is assigned a label, Rule generation. We create a set R of K relations and use
denoted r. The labeled edge is denoted as (u →r v) ∈ EG . it to sample a rule set R. We impose two constraints on
A relation set R is a set of relations {r1 , r2 , ... rK }. A R: (i) No two rules in R can have the same body. This
rule set R is a set of rules in first order logic, which we ensures consistency between the rules. (ii) Rules cannot
denote in the Datalog format (Evans & Grefenstette, 2017), have common relations among the head and body. This en-
[ri , rj ] ⇒ rk , and which can be expanded as Horn clauses sures the absence of cyclic dependencies in rules (Hamilton
of the form: et al., 2018). Generating the dataset using a consistent and
well-defined rule set ensures interpretability in the resulting
∃z ∈ VG : (u →ri z) ∧ (z →rj v) ⇒ (u →rk v) (1) dataset. The full algorithm for rule generation is given in
Appendix (Algorithm 1).
where z denotes a variable that can be bound to any entity Graph generation. The graph generation process has two
and ⇒ denotes logical implication. The relations ri , rj form steps: In the first step, we recursively sample and use rules
the body while the relation rk forms the head of the rule. in R to generate a relational graph called the WorldGraph
Horn clauses of this form represent a well-defined subset (as shown in Figure 1). This sampling procedure enables us
of first-order logic, and they encompass the types of logical to create a diverse set of WorldGraphs by considering only
rules learned by the vast majority of existing rule induction certain subsets (of R) during sampling. By controlling the
engines for knowledge graphs (Langley & Simon, 1995). extent of overlap between the subsets of R (in terms of the
We use pu,v
G to denote a path from node u to v in a graph
number of rules that are common across the subsets), we
G. We construct graphs according to rules of the form in can precisely control the similarity between the different
Equation 1 so that a path between two nodes will always WorldGraphs. The full algorithm for generating the World-
imply a specific relation between these two nodes. In other Graph and controlling the similarity between the worlds is
words, we will always have that given in Appendix (Algorithm 3 and Section A.2).
In the second step, the WorldGraph GW is used to sample
∃ri ∈ R : pu,v
G ⇒ (u →ri v). (2)
Evaluating Logical Generalization in Graph Neural Networks

Number of relations 20 train split of many worlds (W1 , · · · , WM ) and evaluated on


Total number of WorldGraphs 57
the test split of the same worlds. The complexity of each
Total number of unique rules 76
Training Graphs per WorldGraph 5000 world and the similarity between the different worlds can be
Validation Graphs per WorldGraph 1000 precisely controlled. GraphLog thus enables us to evaluate
Testing Graphs per WorldGraph 1000 how model performance varies when the model is trained
Number of rules per WorldGraph 20 on similar vs. dissimilar worlds.
Average number of descriptors 522
Maximum length of resolution path 10 GraphLog is also designed to study the effect of pre-training
Minimum length of resolution path 2 on adaptation. In this setup, the model is first pre-trained on
the train split of multiple worlds (W1 , · · · , WM ) and then
Table 2. Aggregate statistics of the worlds used in GraphLog.
Statistics for each individual world are in the Appendix.
adapted (fine-tuned) on the train split of the unseen heldout
worlds (WM +1 , · · · , WN ). The model is evaluated on the
test split of the novel worlds. Similar to the previous setup,
GraphLog provides us an opportunity to investigate the
a set of graphs GSW = (g1 , · · · gN ) (shown as Step (a) in effect of similarity in pre-training. This enables GraphLog
Figure 2). A graph gi is sampled from GW by sampling to mimic in-distribution and out-of-distribution training and
a pair of nodes (u, v) from GW and then by sampling a testing scenarios, as well as precisely categorize the effect
resolution path pu,v
GW . The edge u →ri v between the source of multi-task pre-training for adaptation performance.
and sink node of the path provides the target relation for
the learning model to predict. To increase the complexity Continual learning. GraphLog provides access to a large
of the sampled gi graphs (beyond being simple paths), we number of worlds, enabling us to evaluate the logical gener-
also add nodes to gi by sampling neighbors of the nodes alization capability of the models in the continual learning
on pu,v
GW , such that no other shortest path exists between
setup. In this setup, the model is trained on a sequence
u and v. Algorithm 4 (in the Appendix) details our graph of worlds. Before training on a new world, the model is
sampling approach. evaluated on all the worlds that the model has trained on
so far. Given the several challenges involved in continual
3.4. Summary of the GraphLog Dataset learning (Thrun & Pratt, 2012; Parisi et al., 2019; De Lange
et al., 2019; Sodhani et al., 2019), we do not expect the
We use the data generation process described in Section models to be able to remember the knowledge from all the
3.3 to instantiate a dataset suite with 57 distinct logical previous tasks. Nonetheless, given that we are evaluating
worlds and 5000 graphs per world (Figure 1). The dataset the models for relational reasoning and that our datasets
is divided into the sets of training, validation, and testing share relations, we would expect the models to retain some
worlds. The graphs within each world are also split into knowledge of how to solve the previous tasks. In this sense,
training, validation, and testing sets. The key statistics of the performance on the previous tasks can also be seen as
the datasets are given in Table 2. Though we instantiate 57 an indicator of if the models actually learn to solve the rela-
worlds, the GraphLog code can instantiate an arbitrary num- tional reasoning tasks or they just fit to the current dataset
ber of worlds and has been included in the supplementary distribution.
material.

3.4.1. S ETUPS SUPPORTED IN G RAPH L OG


4. Representation and Composition
GraphLog enables us to investigate the logical relational In this section, we describe the graph neural network (GNN)
reasoning performance of models in the following setups: architectures that we evaluate on the GraphLog benchmark.
In order to perform well on the benchmark tasks, a model
Supervised learning. In the supervised learning setup, a should learn representations that are useful for solving the
model is trained on the train split of a single logical world tasks in the current world while being general enough to be
and evaluated on the test split of the same world. The total effectively adapted to the new worlds. To this end, we struc-
number of rules grows exponentially with the number of ture the GNN models we analyze around two key modules:
relations K, making it impossible to train on all possible
combinations of the relations. However, we expect that a • Representation module: This module is represented as
perfectly systematic model generalizes to unseen combina- a function fr : GW × R → Rd , which maps logical
tions of relations by training only on a subset of combina- relations within a particular world W to d-dimensional
tions (i.e., via inductive reasoning). vector representations. Intuitively, this function should
learn how to encode the semantics of the various relations
Multi-task learning. GraphLog provides multiple logical within a logical world.
worlds, each with its own training and evaluation splits. In • Composition module: This module is a function fc :
the standard multi-task training, the model is trained on the
{
Evaluating Logical Generalization in Graph Neural Networks

GW
(a)
g1
GWS


gn { (g , u
i , v)
(d)
fc (f)


(b) (e)
fr
r
GŴ (c)
Figure 2. Overview of the training process: (a): Sampling multiple graphs from GW . (b): Converting the relational graph into extended
graph GˆW . Note that edges of different color (denoting different types of relations) are replaced by a node of same type in ĜW . (c):
Learning representations of the relations (r) using fr with the extended graph as the input. In case of Param models, the relation
representations are parameterized via an embedding layer and the extended graph is not created. (d, e): The composition function takes as
input the query gi , u, v and the relational representation r. (f): The composition function predicts the relation between the nodes u and v.

GW × VGW × VGW × Rd×|R| → R, which learns how corresponding edge-node (u − r − v) is connected to only
to compose the relation representations learned by fr to those nodes that were incident to it in the original graph (i.e.
make predictions about queries over a knowledge graph. nodes u and v; see Figure 2, Step (b)). This new graph ĜW
only has one type of edge and comprises of nodes from both
Note that though we break down the process into two steps,
the original graph and from the set of edge-nodes.
in practice, the learner does not have access to the correct
representations of relations or to R. The learner has to We learn the relation representations by training a GNN
rely only on the target labels to solve the reasoning task. model on the expanded WorldGraph and by averaging the
We hypothesize that this separation of concerns between a edge-node embeddings corresponding to each relation type
representation function and a composition function (Dijkstra, ri ∈ R. (Step (c) in Figure 2). For the GNN model, we
1982) could provide a useful inductive bias for the model. consider the Graph Convolutional Network (GCN) (Kipf
& Welling, 2016) and the Graph Attention Network (GAT)
4.1. Representation modules architectures. Since the nodes do not have any features or
attributes, we randomly initialize the embeddings in these
We first describe the different approaches for learning the GNN message passing layers.
representation ri ∈ Rd for the relations. These representa-
tions will be provided as input to the composition function. The intuition behind creating the extended-graph is that the
representation GNN function can learn the relation embed-
Direct parameterization. The simplest approach to define dings based on the structure of the complete relational graph
the representation module is to train unique embeddings for GW . We expect this to provide an inductive bias that can
each relation ri . This approach is predominantly used in the generalize more effectively than the simple Param approach.
previous work on GNNs (Gilmer et al., 2017; Veličković Finally, note that while the representation function is given
et al., 2017), and we term this approach as the Param repre- access to the WorldGraph to learn representations for rela-
sentation module. A major limitation of this approach is that tions, the composition module is not able to interface with
the relation representations are optimized specifically for the WorldGraph in order to make predictions about a query.
each logical world, and there is no inductive bias towards
learning representations that can generalize.
4.2. Composition modules
Learning representations from the graph structure. In
We now describe the GNNs used for the composition mod-
order to define a more powerful and expressive representa-
ules. These models take as input the query (gi , u, v) and the
tion function, we consider an approach that learns relation
relation embedding ri ∈ Rd (Step (d) and (e) in Figure 2).
representations as a function of the WorldGraph underly-
ing a logical world. To do so, we consider an “extended” Relational Graph Convolutional Network (RGCN).
form of the WorldGraph, ĜW , where introduce new nodes Given that the input to the composition module is a rela-
(called edge-nodes) corresponding to each edge in the orig- tional graph, the RGCN model (Schlichtkrull et al., 2018) is
inal WorldGraph GW . For an edge (u →r v) ∈ EG , the a natural choice for a baseline architecture. In this approach,
Evaluating Logical Generalization in Graph Neural Networks

we iterate a series of message passing operations: 0.8


 
X X
h(t)
u = ReLU
 ri ×1 T ×3 h(t−1)
v
,
ri ∈R v∈Nri (u) 0.7

(t)
where hu ∈ Rd denotes the representation for a node u

Test Accuracy
at the tth layer of the model, T ∈ Rdr ×d×d is a learnable
0.6
tensor, r ∈ Rd is the representation for relation r, and
Nri (u) denotes the neighbors of node u by relation ri . We
use ×i to denote multiplication across a particular mode
of the tensor. This RGCN model learns a relation-specific 0.5

propagation matrix, specified by the interaction between the GAT-E-GAT


GCN-E-GAT
relation embedding ri and the shared tensor T .1 Param-E-GAT
GAT-RGCN
Edge-based Graph Attention Network (Edge-GAT). In 0.4 GCN-RGCN
addition to the RGCN model—which is considered the de- Param-RGCN

facto standard architecture for applying GNNs to multi- Easy World Medium World Hard World
relational data—we also explore an extension of the Graph World difficulty

Attention Network (GAT) model (Veličković et al., 2017) Figure 3. We categorize the datasets in terms of their relative diffi-
to handle edge types. Many recent works have highlighted culty (see Appendix). We observe that the models using E-GAT
the importance of the attention mechanism, especially in as the composition function consistently work well.
the context of relational reasoning (Vaswani et al., 2017;
Santoro et al., 2018; Schlag et al., 2019). Motivated by this
observation, we investigate an extended version of the GAT, contexts: (i) Single Task Supervised Learning, (ii) Multi-
where we incorporate gating via an LSTM (Hochreiter & Task Training and (iii) Continual Learning. Our experiments
Schmidhuber, 1997) and where the attention is conditioned use the GraphLog benchmark with distinct 57 worlds or
on both the incoming message (from the other nodes) and knowledge graph datasets (see Section 3) and 6 different
the relation embedding (of the other nodes): different GNN models (see Section 4). In the main paper,
X X   we share the key trends and observations that hold across the
mN (u) = α h(t−1)
u , h(t−1)
v ,r different combinations of the models and the datasets, along
ri ∈R v∈Nri (u) with some representative results. The full set of results is
h(t) = LSTM(mN (u) , h(t−1) ) provided in the Appendix. All the models are implemented
u u
using PyTorch 1.3.1 (Paszke et al., 2019). The code has
Following the original GAT model, the attention function α been included with the supplemental material.
is defined using an dense neural network on the concatena-
tion of the input vectors. We refer to this model as the Edge 5.1. Single Task Supervised Learning
GAT (E-GAT) model.
In our first setup, we train and evaluate all of the models
Query and node representations. We predict the relation on all the 57 worlds, one model, and one world pair at a
(K) (K)
for a given query (gi , u, v) by concatenating hu , hv time. This experiment provides several important results.
(the final-layer query node embeddings, assuming a K-layer Previous works considered only a handful of datasets when
GNN) and applying a two-layer dense neural network (Step evaluating the different models on the task of relational
(f) in Figure 2). The entire model (i.e., the representation reasoning. As such, it is possible to design a model that
function and the composition function) are trained end-to- can exploit the biases present in the few datasets that the
end using the softmax cross-entropy loss. Since we have no model is being evaluated over. In our case, we consider over
node features, we randomly initialize all the node embed- 50 datasets, with different characteristics (Table 2). It is
(0)
dings in the GNNs (i.e., hu ). difficult for one model to outperform the other models on
all the datasets just by exploiting some dataset-specific bias,
thereby making the conclusions more robust.
5. Experiments
In Figure 3, we present the results for the different models.
We aim to quantify the performance of the different GNN
We categorize the worlds in three categories of difficulty –
models on the task of logical relation reasoning, in three
easy, moderate and difficult – based on relative test perfor-
1
Note that the shared tensor is equivalent to the basis matrix mance of the models on each world. Table 6 (in Appendix)
formulation in Schlichtkrull et al. (2018). contains the results for the different models on the individ-
Evaluating Logical Generalization in Graph Neural Networks

GAT-E-GAT Param-E-GAT GCN-E-GAT


S D 0.6
fr fc Accuracy Accuracy 0.5

GAT E-GAT 0.534 ±0.11 0.534 ±0.09

Accuracy
0.4

0.3
GAT RGCN 0.474 ±0.11 0.502 ±0.09 0.2

GCN E-GAT 0.522 ±0.1 0.533 ±0.09 0.1

GCN RGCN 0.448 ±0.09 0.476 ±0.09 0.0


100%
GAT-RGCN Param-RGCN GCN-RGCN
Param E-GAT 0.507 ±0.09 0.5 ±0.09 0.6
0%

Param RGCN 0.416 ±0.07 0.449 ±0.07 0.5

Accuracy
0.4

0.3
Table 3. Multitask evaluation performance when trained on differ- 0.2

ent data distributions. We categorize the training distribution on 0.1

basis of their similarity of rules: Similar (S) containing similar 0.0


Zero Convergence Zero Convergence Zero Convergence
K-Shot K-Shot K-Shot
worlds and a mix of similar and dissimilar worlds (D)

Figure 5. We evaluate the effect of changing the similarity between


ual worlds. We observe that the models using E-GAT as the training and the evaluation datasets. The colors of the bars
the composition functions always outperform their counter- depicts how similar the two distributions are while the y-axis shows
parts using the RGCN models. This confirms our hypothesis the mean accuracy of the model on the test split of the evaluation
world. We report both the zero-shot adaptation performance and
about the usefulness of combining relational reasoning and
performance after convergence.
attention for improving the performance on relational rea-
soning tasks. An interesting observation is that the relative
ordering among the worlds, in terms of the test accuracy of all the models when evaluated on the test split. Another im-
the different models, is consistent irrespective of the model portant observation is that just like the supervised learning
we use, highlighting the intrinsic difficulty of the different setup, the GAT-EGAT model consistently performs either
worlds in GraphLog. as good as or better than other models and the models using
EGAT for the composition function perform better than the
5.2. Multi-Task Training ones using the RGCN model. Figure 4 shows how the per-
We now turn to the setting of multi-task learning where we formance of the various models changes when we perform
train the same model on multiple logical worlds. multi-task training on an increasingly large set of worlds. In-
terestingly, we see that model performance improves as the
number of worlds is increased from 10 to 20 but then begins
0.60
to decline, indicating capacity saturation in the presence of
too many diverse worlds.
0.55
Multi-task pre-training. In this setup, we pre-train the
Accuracy

0.50 model on multiple worlds and adapt on a heldout world. We


0.45
study how the models’ adaption capabilities vary as the sim-
ilarity between the training and the evaluation distributions
0.40 changes. Figure 5 considers the case of zero-shot adapta-
10 15 20 25 30 35 40 45 50
tion and adaptation till convergence. As we move along the
Number of worlds x-axis, the zero-shot performance (shown with solid colors)
GAT-E-GAT GCN-E-GAT Param-RGCN decreases in all the setups. This is expected as the similarity
Param-E-GAT GAT-RGCN GCN-RGCN
between the training and the evaluation distributions also
Figure 4. We run multitask experiments over an increasing number decreases. An interesting trend is that the model’s perfor-
of worlds to stretch the capacity of the models. Evaluation perfor- mance, after adaptation, increases as the similarity between
mance is reported as the average of test set performance across the two distributions decreases. This suggests that training
the worlds that the model has trained on so far. All the models over a diverse set of distributions improves adaptation capa-
reach their optimal performance at 20 worlds, beyond which their bility. The results for adaptation with 5, 10, ... 30 steps are
performance starts to degrade. provided in the Appendix (Figure 8).
Basic multi-task training. First, we evaluate a how chang-
5.3. Continual Learning Setup
ing the similarity among the training worlds affects the
test performance in the multi-task setup, where a model is In the continual learning setup, we evaluate the knowledge
trained jointly on eight and tested on three distinct worlds. retention capabilities of the GNN models. We train the
In Table 3, we observe that considering a mix of similar and model on a sequence of overlapping worlds, and after con-
dissimilar worlds improves the generalization capabilities of verging on every world, we report the average of model’s
Evaluating Logical Generalization in Graph Neural Networks

GAT-E-GAT Param-E-GAT GAT-E-GAT Param-E-GAT


0.5

0.8
0.4

Accuracy
Accuracy

0.6
0.3

0.4
0.2

0.2
0.1

GCN-E-GAT GAT-RGCN GCN-E-GAT GAT-RGCN


0.5

0.8 0.4

Accuracy
Accuracy

0.6 0.3

0.4 0.2

0.2 0.1

Param-RGCN GCN-RGCN Param-RGCN GCN-RGCN


0.5

0.8 0.4

Accuracy
Accuracy

0.6 0.3

0.4 0.2

0.2 0.1

0 20 40 0 20 40 0 20 40 0 20 40
Worlds Worlds Worlds Worlds

Current Accuracy Shared Composition and Representation


Mean past accuracy Shared Representation Unique Composition
Shared Composition Unique Representation
Figure 6. We evaluate the performance of all the models in a
continual learning setup. The blue curve shows the accuracy on the
Figure 7. We evaluate the performance in a continual learning
current world and the orange curve shows the mean accuracy on
setup where we share either the representation function or the
all the previously seen worlds. As the model trains on new worlds,
composition function or both. We observe that sharing the repre-
its performance on the previously seen worlds degrades rapidly.
sentation function reduces the effect of catastrophic forgetting as
This is the forgetting effect commonly encountered in continual
compared to the other setups.
learning setups.

6. Discussion & Conclusion


In this work, we propose GraphLog, a benchmark suite for
performance on all the previous worlds. In Figure 6 we
evaluating the logical generalization capabilities of Graph
observe that as the model is trained on different worlds, the
Neural Networks. GraphLog is grounded in first-order logic
performance on the previous worlds degrades rapidly. This
and provides access to a large number of diverse tasks that
highlights that the current reasoning models are not suitable
require compositional generalization to solve, including
for continual learning.
single task supervised learning, multi-task learning, and
The role of the representation function. We also investi- continual learning. Our results highlight the importance
gate the model’s performance in a continual learning setup of attention mechanisms and modularity to achieve logical
where the model learns only a world-specific representa- generalization, while also highlighting open challenges re-
tion function or a world-specific composition function, and lated to multi-task and continual learning in the context of
where the other module is shared across the worlds. In Fig- GNNs. A natural direction for future work is leveraging
ure 7, we observe that sharing the representation function GraphLog for studies of fast adaptation and meta-learning
reduces the effect of catastrophic forgetting, but sharing the in the context of logical reasoning (e.g., via gradient-based
composition function does not have the same effect. This meta learning), as well as integrating state-of-the-art meth-
suggests that the representation function learns representa- ods (e.g., regularization techniques) to combat catastrophic
tions that are useful across the worlds. forgetting in the context of GNNs.
Evaluating Logical Generalization in Graph Neural Networks

References Du, S. S., Hou, K., Salakhutdinov, R. R., Poczos, B., Wang,
R., and Xu, K. Graph neural tangent kernel: Fusing
Alexander, P. A. Relational thinking and relational reason-
graph neural networks with graph kernels. In Advances in
ing: harnessing the power of patterning. NPJ science of
Neural Information Processing Systems, pp. 5724–5734,
learning, 1(1):1–7, 2016.
2019.
Bahdanau, D., Murty, S., Noukhovitch, M., Nguyen, T. H., Duvenaud, D. K., Maclaurin, D., Iparraguirre, J., Bombarell,
de Vries, H., and Courville, A. Systematic generalization: R., Hirzel, T., Aspuru-Guzik, A., and Adams, R. P. Con-
what is required and can it be learned? arXiv preprint volutional networks on graphs for learning molecular fin-
arXiv:1811.12889, 2018. gerprints. In Advances in neural information processing
systems, pp. 2224–2232, 2015.
Battaglia, P., Pascanu, R., Lai, M., Rezende, D. J., et al.
Interaction networks for learning about objects, relations Evans, R. and Grefenstette, E. Learning Explanatory Rules
and physics. In Advances in neural information process- from Noisy Data. November 2017.
ing systems, pp. 4502–4510, 2016.
Farrington-Flint, L., Canobi, K. H., Wood, C., and Faulkner,
Battaglia, P. W., Hamrick, J. B., Bapst, V., Sanchez- D. The role of relational reasoning in children’s addition
Gonzalez, A., Zambaldi, V., Malinowski, M., Tacchetti, concepts. British Journal of Developmental Psychology,
A., Raposo, D., Santoro, A., Faulkner, R., et al. Rela- 25(2):227–246, 2007.
tional inductive biases, deep learning, and graph networks.
arXiv preprint arXiv:1806.01261, 2018. Gilmer, J., Schoenholz, S. S., Riley, P. F., Vinyals, O., and
Dahl, G. E. Neural message passing for quantum chem-
Bengio, Y., Louradour, J., Collobert, R., and Weston, J. istry. In Proceedings of the 34th International Conference
Curriculum learning. In Proceedings of the 26th annual on Machine Learning-Volume 70, pp. 1263–1272. JMLR.
international conference on machine learning, pp. 41–48, org, 2017.
2009.
Guo, M., Haque, A., Huang, D.-A., Yeung, S., and Fei-Fei,
L. Dynamic task prioritization for multitask learning. In
Bordes, A., Usunier, N., Garcia-Duran, A., Weston, J., and
Proceedings of the European Conference on Computer
Yakhnenko, O. Translating embeddings for modeling
Vision (ECCV), pp. 270–287, 2018.
multi-relational data. In Advances in neural information
processing systems, pp. 2787–2795, 2013. Guo, S., Wang, Q., Wang, L., Wang, B., and Guo, L. Jointly
embedding knowledge graphs and logical rules. In Pro-
Chen, T., Bian, S., and Sun, Y. Are powerful graph neural ceedings of the 2016 Conference on Empirical Methods
nets necessary? a dissection on graph classification. arXiv in Natural Language Processing, pp. 192–202, 2016.
preprint arXiv:1905.04579, 2019.
Halford, G. S., Wilson, W. H., and Phillips, S. Relational
Côté, M.-A., Kádár, Á., Yuan, X., Kybartas, B., Barnes, T., knowledge: the foundation of higher cognition. Trends
Fine, E., Moore, J., Hausknecht, M., El Asri, L., Adada, in cognitive sciences, 14(11):497–505, 2010.
M., et al. Textworld: A learning environment for text-
based games. In Workshop on Computer Games, pp. Hamilton, W., Ying, R., and Leskovec, J. Representation
41–75. Springer, 2018. learning on graphs: Methods and applications. IEEE
Data Engineering Bulletin, 2017a.
De Lange, M., Aljundi, R., Masana, M., Parisot, S., Jia, X.,
Leonardis, A., Slabaugh, G., and Tuytelaars, T. Continual Hamilton, W., Ying, Z., and Leskovec, J. Inductive repre-
learning: A comparative study on how to defy forgetting sentation learning on large graphs. In Advances in neural
in classification tasks. arXiv preprint arXiv:1909.08383, information processing systems, pp. 1024–1034, 2017b.
2019. Hamilton, W., Bajaj, P., Zitnik, M., Jurafsky, D., and
Leskovec, J. Embedding logical queries on knowledge
Defferrard, M., Bresson, X., and Vandergheynst, P. Con-
graphs. In Advances in Neural Information Processing
volutional neural networks on graphs with fast localized
Systems 31, pp. 2026–2037. 2018.
spectral filtering. In Advances in neural information pro-
cessing systems, pp. 3844–3852, 2016. Hochreiter, S. and Schmidhuber, J. Long short-term memory.
Neural computation, 9(8):1735–1780, 1997.
Dijkstra, E. W. On the role of scientific thought. In Selected
writings on computing: a personal perspective, pp. 60–66. Hoffart, J., Suchanek, F. M., Berberich, K., Lewis-Kelham,
Springer, 1982. E., De Melo, G., and Weikum, G. Yago2: exploring and
Evaluating Logical Generalization in Graph Neural Networks

querying world knowledge in time, space, context, and (eds.), Advances in Neural Information Processing
many languages. In Proceedings of the 20th international Systems 32, pp. 8024–8035. Curran Associates, Inc.,
conference companion on World wide web, pp. 229–232, 2019. URL http://papers.nips.cc/paper/
2011. 9015-pytorch-an-imperative-style-high-performance
pdf.
Holyoak, K. J. and Morrison, R. G. The Oxford handbook of
thinking and reasoning. Oxford University Press, 2012. Paulheim, H. Knowledge graph refinement: A survey of
approaches and evaluation methods. Semantic web, 8(3):
Johnson, J., Hariharan, B., van der Maaten, L., Fei-Fei, L.,
489–508, 2017.
Lawrence Zitnick, C., and Girshick, R. Clevr: A diag-
nostic dataset for compositional language and elementary Richland, L. E., Chan, T.-K., Morrison, R. G., and Au, T. K.-
visual reasoning. In Proceedings of the IEEE Confer- F. Young childrens analogical reasoning across cultures:
ence on Computer Vision and Pattern Recognition, pp. Similarities and differences. Journal of Experimental
2901–2910, 2017. Child Psychology, 105(1-2):146–153, 2010.
Kipf, T. N. and Welling, M. Semi-supervised classifica- Santoro, A., Raposo, D., Barrett, D. G., Malinowski, M.,
tion with graph convolutional networks. arXiv preprint Pascanu, R., Battaglia, P., and Lillicrap, T. A simple neu-
arXiv:1609.02907, 2016. ral network module for relational reasoning. In Advances
in neural information processing systems, pp. 4967–4976,
Krawczyk, D. C., McClelland, M. M., and Donovan, C. M.
2017.
A hierarchy for relational reasoning in the prefrontal cor-
tex. Cortex, 47(5):588–597, 2011. Santoro, A., Faulkner, R., Raposo, D., Rae, J., Chrzanowski,
Lake, B. M. and Baroni, M. Generalization with- M., Weber, T., Wierstra, D., Vinyals, O., Pascanu, R., and
out systematicity: On the compositional skills of Lillicrap, T. Relational recurrent neural networks. In
sequence-to-sequence recurrent networks. arXiv preprint Advances in neural information processing systems, pp.
arXiv:1711.00350, 2017. 7299–7310, 2018.

Langley, P. and Simon, H. A. Applications of machine Scarselli, F., Gori, M., Tsoi, A. C., Hagenbuchner, M., and
learning and rule induction. Communications of the ACM, Monfardini, G. The graph neural network model. IEEE
38(11):54–64, 1995. Transactions on Neural Networks, 20(1):61–80, 2008.

Mahdisoltani, F., Biega, J., and Suchanek, F. M. Yago3: A Schlag, I., Smolensky, P., Fernandez, R., Jojic, N., Schmid-
knowledge base from multilingual wikipedias. 2013. huber, J., and Gao, J. Enhancing the transformer with
explicit relational encoding for math problem solving.
Miller, G. A. Wordnet: a lexical database for english. Com- arXiv preprint arXiv:1910.06611, 2019.
munications of the ACM, 38(11):39–41, 1995.
Schlichtkrull, M., Kipf, T. N., Bloem, P., Van Den Berg, R.,
Mitchell, T. and Fredkin, E. Never ending language learn- Titov, I., and Welling, M. Modeling relational data with
ing. In Big Data (Big Data), 2014 IEEE International graph convolutional networks. In European Semantic
Conference on, pp. 1–1, 2014. Web Conference, pp. 593–607. Springer, 2018.
Morris, C., Ritzert, M., Fey, M., Hamilton, W., Lenssen, Sinha, K., Sodhani, S., Dong, J., Pineau, J., and Hamilton,
J., Rattan, G., and Grohe, M. Weisfeiler and Leman go W. L. Clutrr: A diagnostic benchmark for inductive
neural: Higher-order graph neural networks. In AAAI, reasoning from text. arXiv preprint arXiv:1908.06177,
2019. 2019.
Parisi, G. I., Kemker, R., Part, J. L., Kanan, C., and Wermter, Sodhani, S., Chandar, S., and Bengio, Y. On training recur-
S. Continual lifelong learning with neural networks: A rent neural networks for lifelong learning. 2019.
review. Neural Networks, 2019.
Son, J. Y., Smith, L. B., and Goldstone, R. L. Connecting
Paszke, A., Gross, S., Massa, F., Lerer, A., Bradbury, instances to promote childrens relational reasoning. Jour-
J., Chanan, G., Killeen, T., Lin, Z., Gimelshein, N., nal of experimental child psychology, 108(2):260–277,
Antiga, L., Desmaison, A., Kopf, A., Yang, E., DeVito, 2011.
Z., Raison, M., Tejani, A., Chilamkurthy, S., Steiner,
B., Fang, L., Bai, J., and Chintala, S. Pytorch: An Suchanek, F. M., Kasneci, G., and Weikum, G. Yago: a
imperative style, high-performance deep learning core of semantic knowledge. In Proceedings of the 16th
library. In Wallach, H., Larochelle, H., Beygelz- international conference on World Wide Web, pp. 697–
imer, A., dÁlché-Buc, F., Fox, E., and Garnett, R. 706, 2007.
Evaluating Logical Generalization in Graph Neural Networks

Thrun, S. and Pratt, L. Learning to learn. Springer Science


& Business Media, 2012.
Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones,
L., Gomez, A. N., Kaiser, Ł., and Polosukhin, I. Atten-
tion is all you need. In Advances in neural information
processing systems, pp. 5998–6008, 2017.

Veličković, P., Cucurull, G., Casanova, A., Romero, A.,


Lio, P., and Bengio, Y. Graph attention networks. arXiv
preprint arXiv:1710.10903, 2017.
West, R., Gabrilovich, E., Murphy, K., Sun, S., Gupta, R.,
and Lin, D. Knowledge base completion via search-
based question answering. In Proceedings of the 23rd
international conference on World wide web, pp. 515–
526, 2014.
Xu, K., Hu, W., Leskovec, J., and Jegelka, S. How
powerful are graph neural networks? arXiv preprint
arXiv:1810.00826, 2018.

Ying, Z., Bourgeois, D., You, J., Zitnik, M., and Leskovec, J.
Gnnexplainer: Generating explanations for graph neural
networks. In Advances in Neural Information Processing
Systems, pp. 9240–9251, 2019.
Supplemental Materials : Evaluating Logical Generalization in Graph Neural
Networks

Koustuv Sinha 1 2 3 Shagun Sodhani 1 Joelle Pineau 1 2 3 William L. Hamilton 2 3

A. GraphLog World Sampling. From the set of rules in R, we partition


rules into buckets for different worlds (Algorithm 2). We use
A.1. Extended Terminology a simple policy of bucketing via a sliding window of width
In this section, we extend the terminology introduced in w with stride s, to classify rules pertaining to each world.
Section 3.1. A set of relations is said to be Invertible if For example, two such consecutive worlds can be generated
as Rt = [Ri . . . Ri+w ] and Rt+1 = [Ri+s . . . Ri+w+s ].
(Algorithm 2) We randomly permute R before bucketing
∀ri ∈ R, ∃rj ∈ R | {∀u, v ∈ VG : (u →ri v) ⇒ (v →rj u)}. in-order.
(3) Graph Generation. This is a two-step process where first
i.e. for all relations in R, there exists a relation in R such we sample a world graph (Algorithm 3) and then we sam-
that for all node pairs (u, v) in the graph, if there exists ple individual graphs from the world graph (Algorithm 4).
an edge u →ri v then there exists another edge v →rj u. Given a set of rules RS , in the first step, we recursively
Invertible relations are useful in determining the inverse of sample and apply rules in RS to generate a relation graph
a clause, where the directionality of the clause is flipped called world graph. This sampling procedure enables us
along with the ordering of the elements in the conjunctive to create a diverse set of world graphs by considering only
clause. For example, the inverse of Equation 1 will be of certain subsets (of R) during sampling. By controlling the
the form: extent of overlap between the subsets of R (in terms of the
number of rules that are common across the subsets), we
can precisely control the similarity between the different
∃z ∈ V : (v → z) ∧ (z → u) ⇒ (v → u) (4) world graphs.
G rˆj rˆi rˆk
In the second step (Algorithm 4), the world graph is used
A.2. Dataset Generation to sample a set of graphs GSW = {g1 , · · · gN }. A graph
gi is sampled from GW by sampling a pair of nodes (u, v)
This section follows up on the discussion in Section 3.3.
from GW and then by sampling a resolution path pu,v GW . The
We describe all the steps involved in the dataset generation
edge u →ri v provides the target relation that the learn-
process.
ing model has to predict. Since the relation for the edge
Rule Generation. In Algorithm 1, we describe the com- u →ri v can be resolved by composing the relations along
plete process of generating rules in GraphLog . We require the resolution path, the relation prediction task tests for the
the set of K relations, which we use to sample the rule set compositional generalization abilities of the models. We
R. We mark some rules as being Invertible Rules (Section first sample all possible resolution paths and get their indi-
A.1). Then, we iterate through all possible combinations of vidual descriptors Di , which we split in training, validation
relations in DataLog format to sample possible candidate and test splits. We then construct the training, validation and
rules. We impose two constraints on the candidate rule: (i) testing graphs by first adding all edges of an individual Di
No two rules in R can have the same body. This ensures to the corresponding graph gi , and then sampling neighbors
consistency between the rules. (ii) Candidate rules cannot of pgi . Concretely, we use Breadth First Search (BFS) to
have common relations among the head and body. This sample the neighboring subgraph of each node u ∈ pgi with
ensures absence of cycles. We also add the inverse rule of a decaying selection probability γ. This allows us to create
our sampled candidate rule and check the same consisten- diverse input graphs while having precise control over its
cies again. We employ two types of unary Horn clauses to resolution by its descriptor Di . Splitting dataset over these
perform the closure of the available rules and to check the descriptor paths ensures inductive generalization.
consistency of the different rules in R. Using this process,
we ensure that all generated rules are sound and consistent
with respect to R.
Evaluating Logical Generalization in Graph Neural Networks

Algorithm 1 Rule Generator Algorithm 3 World Graph Generator


Input: Set of K relations {ri }K , K > 0 Require: Set of relations {ri }K , K > 0
Define an empty rule set R Require: Set of rules derived from {ri }K , |R| > 0
Populate Invertible Rules, ri =⇒ rˆi , add to R Require: Set rule selection probability gamma γ = 0.8
for all ri ∈ {ri }K do Set rule selection probability P [R[i]] = 1, ∀i ∈ |R|
for all rj ∈ {ri }K do Require: Maximum number of expansions s ≥ 2
for all rk ∈ {ri }K do Require: Set of available nodes N , s.t. |N | ≥ 0
Define candidate rule t : [ri , rj ] =⇒ rk Require: Number of cycles of generation c ≥ 0
if Cyclical rule, i.e. ri == rk OR rj == rk Set WorldGraph set of edges Gm = ∅
then while |N | > 0 or c > 0 do
Reject rule Randomly choose an expansion number for this cycle:
end if steps = rand(2, s)
if t[body] 6∈ R then Set added edges for this cycle Ec = ∅
Add t to R for all step in steps do
Define inverse rule tinv : [rˆj , rˆi ] =⇒ rˆk if step = 0 then
if tinv [body] 6∈ R then With uniform probability, either:
Add tinv to R Sample rt from RS [head] and sample u, v ∈ N
else without replacement, OR
Remove rule having body tinv [body] from Sample an edge (u, rt , v) from Gm
R Add (u, rt , v) to Ec and Gm
end if else
end if Sample an edge (u, rt , v) from Ec
end for end if
end for Sample a rule R[i] from R following P s.t.
end for [ri , rj ] =⇒ rt
Check and remove any further cyclical rules. P [R[i]] = P [R[i]] ∗ γ
Sample a new node y ∈ N without replacement
Add edge (u, ri , y) to Ec and Gm
Algorithm 2 Partition rules into overlapping sets Add edge (y, rj , v) t Ec and Gm
Require: Rule Set RS end for
Require: Number of worlds nw > 0 if All rules in R is used atleast once then
Require: Number of rules per world w > 0 Increment c by 1
Require: Overlapping increment stride s > 0 Reset rule selection probability P [R[i]] = 1, ∀i ∈
for i = 0; i < |RS | − w; do |R|
Ri = RS [i; i + w] end if
i=i+s end while
end for

A.4. Computing difficulty


A.3. Computing Similarity
Recent research in multitask learning has shown evidence
GraphLog provides precise control for categorizing the sim- that models prioritize selection of difficult tasks over easy
ilarity between different worlds by computing the overlap tasks while learning to boost the overall performance (Guo
of the underlying rules. Concretely, the similarity between et al., 2018). Thus, GraphLog also provides a method to
two worlds W i and W j is defined as Sim(W i , W j ) = examine how pretraining on tasks of different difficulty level
|Ri ∩ Rj |, where Wi and Wj are the graph worlds and affects the adaptation performance. Due to the stochastic
Ri and Rj are the set of rules associated with them. Thus effect of partitioning of the rules, GraphLog consists of
GraphLog enables various training scenarios - training on datasets with varying range of difficulty. We use the su-
highly similar worlds or training on a mix of similar and dis- pervised learning scores (Table 6) as a proxy to determine
similar worlds. This fine grained control allows GraphLog the the relative difficulty of different datasets. We cluster
to mimic both in-distribution and out-of-distribution sce- the datasets such that tasks with prediction accuracy greater
narios - during training and testing. It also enables us to than or above 70% are labeled as easy difficulty, 50-70%
precisely categorize the effect of multi-task pre-training are labeled as medium difficulty and below 50% are labeled
when the model needs to adapt to novel worlds. as hard difficulty dataset. We find that the labels obtained
Evaluating Logical Generalization in Graph Neural Networks

Algorithm 4 Graph Sampler Easy Medium Difficult


Require: Rule Set RS fr fc Accuracy Accuracy Accuracy
Require: World Graph Gm = (Vm , Em ) GAT E-GAT 0.729 ±0.05 0.586 ±0.05 0.414 ±0.07
Param E-GAT 0.728 ±0.05 0.574 ±0.06 0.379 ±0.06
Require: Maximum Expansion length e > 2
GCN E-GAT 0.713 ±0.05 0.55 ±0.06 0.396 ±0.05
Set Descriptor set S = ∅ GAT RGCN 0.695 ±0.04 0.53 ±0.03 0.421 ±0.06
for all u, v ∈ Em do Param RGCN 0.551 ±0.08 0.457 ±0.05 0.362 ±0.05
Get all walks Y(u,v) ∈ Gm such that |Y(u,v) | ≤ e GCN RGCN 0.673 ±0.05 0.514 ±0.04 0.396 ±0.06
Get all descriptors DY(u,v) for all walks Y(u,v)
Add DY(u,v) to S Table 4. Inductive performance on data splits marked by difficulty
end for
Set train graph set Gtrain = ∅
Set test graph set Gtest = ∅ inductive testing on the worlds bucketized by their relative
Split descriptors in train and test split, Strain and Stest difficulty (Table 4). We sample equal number of worlds
for all Di ∈ Strain or Stest do from each difficulty bucket, and separately perform multi-
Set source node us = Di [0] and sink node vs = task training and testing. We evaluate the average prediction
Di [−1] accuracy on the datasets within each bucket. We observe
Set prediction target t = Em [us ][vs ] that the average multitask performance also mimics the rela-
Set graph edges gi = ∅ tive task difficulty distribution. We find GAT-E-GAT model
Add all edges from Di to gi outperforms other baselines in Easy and Medium setup, but
for all u, v ∈ Di do is outperformed by GAT-RGCN model in the Difficult setup.
Sample Breadth First Search connected nodes from For each model, we used the same architecture and hyperpa-
u and v with decaying probability γ rameter settings across the buckets. Optimizing individually
Add the sampled edges to gi for each bucket may improve the relative performance.
end for
Remove edges in gi which create shorter paths between C.2. Multitask Pre-training by task similarity
us and vs
Add (gi , us , vs , t) to either Gtrain or Gtest group = 0.0 group = 1.0
end for
0.4
Accuracy

0.2
by this criteria are consistent across the different models
(Figure 3).
0 5 10 15 20 0 5 10 15 20
Gradient updates Gradient updates
B. Supervised learning on GraphLog GAT_E-GAT GCN_E-GAT Param_E-GAT
GAT_RGCN GCN_RGCN Param_RGCN
We perform extensive experiments over all the datasets
available in GraphLog (statistics given in Table 6). We
observe that in general, for the entire set of 57 worlds, the
GAT E-GAT model performs the best. We observe that the Figure 8. We perform fine-grained analysis of few shot adaptation
capabilities in Multitask setting. Group 0.0 and 1.0 corresponds to
relative difficulty (Section A.4) of the tasks are highly corre-
0% and 100% similarity respectively.
lated with the number of descriptors (Section A.1) available
for each task. This shows that for a learner, a dataset with
In the main paper (Section 5.2) we introduce the setup of
enough variety among the resolution paths of the graphs is
performing multitask pre-training on GraphLog datasets
relatively easier to learn compared to the datasets which has
and adaptation on the datasets based on relative similarity.
less variation.
Here, we perform fine-grained analysis of few-shot adapata-
tion capabilities of the models. We analyze the adaptation
C. Multitask Learning performance in two settings - when the adaptation dataset
has complete overlap of rules with the training datasets
C.1. Multitask Learning on different data splits by
(group=1.0) and when the adaptation dataset has zero over-
difficulty
lap with the training datasets (group=0.0). We find RGCN
In Section A.4 we introduced the notion of difficulty among family of models with a graph based representation func-
the tasks available in GraphLog . Here, we consider a set tion has faster adaptation on the dissimilar dataset, with
of experiments where we perform multitask training and GCN-RGCN showing the fastest improvement. However
Evaluating Logical Generalization in Graph Neural Networks

Easy Medium Difficult


fr fc Accuracy Accuracy Accuracy
GAT E-GAT 0.531 ±0.03 0.569 ±0.01 0.555 ±0.04
Param E-GAT 0.520 ±0.02 0.548 ±0.01 0.540 ±0.01
GCN E-GAT 0.555 ±0.01 0.561 ±0.02 0.558 ±0.01
GAT RGCN 0.502 ±0.02 0.532 ±0.01 0.532 ±0.01
Param RGCN 0.535 ±0.01 0.506 ±0.04 0.539 ±0.04
GCN RGCN 0.481 ±0.02 0.516 ±0.02 0.520 ±0.01
Mean 0.521 0.540 0.539

Table 5. Convergence performance on 3 held out datasets when GAT-E-GAT Param-E-GAT


pre-trained on easy, medium and hard training datasets 0.5

0.4

Accuracy
0.3
on the similar dataset the models follow the ranking of the
0.2
supervised learning experiments, with GAT-EGAT model
adapting comparitively better. 0.1

C.3. Multitask Pre-training by task difficulty GCN-E-GAT GAT-RGCN


0.5

Using the notion of difficulty introduced in Section A.4, we 0.4


perform the suite of experiments to evaluate the effect of
pre-training on Easy, Medium and Difficult datasets. Inter- Accuracy 0.3

estingly, we find the performance on convergence is better 0.2


on Medium and Hard datasets on pre-training, compared 0.1
to the Easy dataset (Table 5). This behaviour is also mir-
rored in k-shot adaptation performance (Figure 9), where Param-RGCN GCN-RGCN
pre-training on Hard dataset provides faster adaptation per- 0.5
formance on 4/6 models.
0.4
Accuracy

0.3
D. Continual Learning
0.2
A natural question arises following our continual learning
0.1
experiments in Section 5.3 : does the order of difficulty of
the worlds matter? Thus, we perform an experiment fol- 0 10 20 30 40 0 10 20 30 40
lowing Curriculum Learning (Bengio et al., 2009) setup, Gradient updates Gradient updates
where the order of the worlds being trained is determined by
their relative difficulty (which is determined by the perfor-
mance of models in supervised learning setup, Table 6, i.e.,
train_world easy medium hard
we order the worlds from easier worlds to harder worlds).
We observe that while the current task accuracy follows
the trend of the difficulty of the worlds (Figure 10), the
mean of past accuracy is significantly worse. This suggests Figure 9. We evaluate the effect of k-shot adaptation on held
that a curriculum learning strategy might not be optimal to out datasets when pre-trained on easy, medium and hard training
learn graph representations in a continual learning setting. datasets, among the different model architectures. Here, k ranges
from 0 to 40.
We also performed the same experiment with sharing only
the composition and representation functions (Figure 11),
and observe similar trends where sharing the representation
function reduces the effect of catastrophic forgetting.

E. Hyperparameters and Experimental Setup


In this section, we provide detailed hyperparameter settings
for both models and dataset generation for the purposes
of reproducibility. The codebase and dataset used in the
Evaluating Logical Generalization in Graph Neural Networks

World ID NC ND Split ARL AN AE D M1 M2 M3 M4 M5 M6


rule 0 17 286 train 4.49 15.487 19.295 Hard 0.481 0.500 0.494 0.486 0.462 0.462
rule 1 15 239 train 4.10 11.565 13.615 Hard 0.432 0.411 0.428 0.406 0.400 0.408
rule 2 17 157 train 3.21 9.809 11.165 Hard 0.412 0.357 0.373 0.347 0.347 0.319
rule 3 16 189 train 3.63 11.137 13.273 Hard 0.429 0.404 0.473 0.373 0.401 0.451
rule 4 16 189 train 3.94 12.622 15.501 Medium 0.624 0.606 0.619 0.475 0.481 0.595
rule 5 14 275 train 4.41 14.545 18.872 Hard 0.526 0.539 0.548 0.429 0.461 0.455
rule 6 16 249 train 5.06 16.257 20.164 Hard 0.528 0.514 0.536 0.498 0.495 0.476
rule 7 17 288 train 4.47 13.161 16.333 Medium 0.613 0.558 0.598 0.487 0.486 0.537
rule 8 15 404 train 5.43 15.997 19.134 Medium 0.627 0.643 0.629 0.523 0.563 0.569
rule 9 19 1011 train 7.22 24.151 32.668 Easy 0.758 0.744 0.739 0.683 0.651 0.623
rule 10 18 524 train 5.87 18.011 22.202 Medium 0.656 0.654 0.663 0.596 0.563 0.605
rule 11 17 194 train 4.29 11.459 13.037 Medium 0.552 0.525 0.533 0.445 0.456 0.419
rule 12 15 306 train 4.14 11.238 12.919 Easy 0.771 0.726 0.603 0.511 0.561 0.523
rule 13 16 149 train 3.58 11.238 13.549 Hard 0.453 0.402 0.419 0.347 0.298 0.344
rule 14 16 224 train 4.14 11.371 13.403 Hard 0.448 0.457 0.401 0.314 0.318 0.332
rule 15 14 224 train 3.82 12.661 15.105 Hard 0.494 0.423 0.501 0.402 0.397 0.435
rule 16 16 205 train 3.59 11.345 13.293 Hard 0.318 0.332 0.292 0.328 0.306 0.291
rule 17 17 147 train 3.16 8.163 8.894 Hard 0.347 0.308 0.274 0.164 0.161 0.181
rule 18 18 923 train 6.63 25.035 33.080 Easy 0.700 0.680 0.713 0.650 0.641 0.618
rule 19 16 416 train 6.10 17.180 20.818 Easy 0.790 0.774 0.777 0.731 0.729 0.702
rule 20 20 2024 train 8.63 34.059 45.985 Easy 0.830 0.799 0.854 0.756 0.741 0.750
rule 21 13 272 train 4.58 10.559 11.754 Medium 0.621 0.610 0.632 0.531 0.516 0.580
rule 22 17 422 train 5.21 16.540 20.681 Medium 0.586 0.593 0.628 0.530 0.506 0.573
rule 23 15 383 train 4.97 17.067 21.111 Hard 0.508 0.522 0.493 0.455 0.473 0.476
rule 24 18 879 train 6.33 21.402 26.152 Easy 0.706 0.704 0.743 0.656 0.641 0.638
rule 25 15 278 train 3.84 11.093 12.775 Hard 0.424 0.419 0.382 0.358 0.345 0.412
rule 26 15 352 train 4.71 14.157 17.115 Medium 0.565 0.534 0.532 0.466 0.461 0.499
rule 27 16 393 train 4.98 14.296 16.499 Easy 0.713 0.714 0.722 0.632 0.604 0.647
rule 28 16 391 train 4.82 17.551 21.897 Medium 0.575 0.564 0.571 0.503 0.499 0.552
rule 29 16 144 train 3.87 10.193 11.774 Hard 0.468 0.445 0.475 0.325 0.336 0.389
rule 30 17 177 train 3.51 10.270 11.764 Hard 0.381 0.426 0.382 0.357 0.316 0.336
rule 31 19 916 train 5.90 20.147 26.562 Easy 0.788 0.789 0.770 0.669 0.674 0.641
rule 32 16 287 train 4.66 16.270 20.929 Medium 0.674 0.671 0.700 0.621 0.594 0.615
rule 33 18 312 train 4.50 14.738 18.266 Medium 0.695 0.660 0.709 0.710 0.679 0.668
rule 34 18 504 train 5.00 15.345 18.614 Easy 0.908 0.888 0.906 0.768 0.762 0.811
rule 35 19 979 train 6.23 21.867 28.266 Easy 0.831 0.750 0.782 0.680 0.700 0.662
rule 36 19 252 train 4.66 13.900 16.613 Easy 0.742 0.698 0.698 0.659 0.627 0.651
rule 37 17 260 train 4.00 11.956 14.010 Easy 0.843 0.826 0.826 0.673 0.698 0.716
rule 38 17 568 train 5.21 15.305 20.075 Easy 0.748 0.762 0.733 0.644 0.630 0.719
rule 39 15 182 train 3.98 12.552 14.800 Easy 0.737 0.642 0.635 0.592 0.603 0.587
rule 40 17 181 train 3.69 11.556 14.437 Medium 0.552 0.584 0.575 0.525 0.472 0.479
rule 41 15 113 train 3.58 10.162 11.553 Medium 0.619 0.601 0.626 0.490 0.468 0.470
rule 42 14 95 train 2.96 8.939 9.751 Hard 0.511 0.472 0.483 0.386 0.393 0.395
rule 43 16 162 train 3.36 11.077 13.337 Medium 0.622 0.567 0.579 0.473 0.482 0.437
rule 44 18 705 train 4.75 15.310 18.172 Hard 0.538 0.561 0.603 0.498 0.519 0.450
rule 45 15 151 train 3.39 9.127 10.001 Medium 0.569 0.580 0.592 0.535 0.524 0.524
rule 46 19 2704 train 7.94 31.458 43.489 Easy 0.850 0.820 0.828 0.773 0.762 0.749
rule 47 18 647 train 6.66 22.139 27.789 Easy 0.723 0.667 0.708 0.620 0.649 0.611
rule 48 16 978 train 6.15 17.802 21.674 Easy 0.812 0.798 0.812 0.772 0.763 0.753
rule 49 14 169 train 3.41 9.983 11.177 Easy 0.714 0.734 0.700 0.511 0.491 0.615
rule 50 16 286 train 3.99 12.274 16.117 Medium 0.651 0.653 0.656 0.555 0.583 0.570
rule 51 16 332 valid 4.44 16.384 21.817 Easy 0.746 0.742 0.738 0.667 0.657 0.689
rule 52 17 351 valid 4.81 16.231 20.613 Medium 0.697 0.716 0.754 0.653 0.655 0.670
rule 53 15 165 valid 3.65 10.838 12.378 Hard 0.458 0.464 0.525 0.334 0.364 0.373
rule 54 13 303 test 5.25 13.503 15.567 Medium 0.638 0.623 0.603 0.587 0.586 0.555
rule 55 16 293 test 4.83 16.444 20.944 Medium 0.625 0.582 0.578 0.561 0.528 0.571
rule 56 15 241 test 4.40 14.010 16.702 Medium 0.653 0.681 0.692 0.522 0.513 0.550
AGG 16.33 428.94 4.70 14.89 18.37 0.618 / 26 0.603 / 10 0.611 / 20 0.530 / 1 0.526 / 0 0.539 / 0

Table 6. Results on Single-task supervised setup for all datasets in GraphLog. Abbreviations: NC: Number of Classes, ND: Number of
Descriptors, ARL: Average Resolution Length, AN: Average number of nodes, AE: Average number of edges
, D: Difficulty, AGG: Aggregate Statistics. List of models considered : M1: GAT-EGAT, M2: GCN-E-GAT, M3:
Param-E-GAT, M4: GAT-RGCN, M5: GCN-RGCN and M6: Param-RGCN. Difficulty is calculated by taking the scores of
the model (M1) and partitioning the worlds according to their accuracy (≥ 0.7 = Easy, ≥ 0.54 and < 0.7 = Medium, and
< 0.54 = Hard). We provide both the mean of the raw accuracy scores for all models, as well as the number of times the
model is ranked first in all the tasks.
Evaluating Logical Generalization in Graph Neural Networks

GAT-E-GAT Param-E-GAT GAT-E-GAT Param-E-GAT

0.8
0.6

Accuracy
Accuracy

0.6
0.4
0.4
0.2
0.2

GCN-E-GAT GAT-RGCN
GCN-E-GAT GAT-RGCN

0.8 Accuracy 0.6


Accuracy

0.6 0.4

0.4
0.2
0.2
Param-RGCN GCN-RGCN
Param-RGCN GCN-RGCN
0.6
Accuracy

0.8
0.4
Accuracy

0.6

0.4 0.2

0.2 0 20 40 0 20 40
Worlds Worlds
0 20 40 0 20 40 Shared Composition and Representation
Worlds Worlds Shared Representation Unique Composition
Current Accuracy Shared Composition Unique Representation
Mean past accuracy

Figure 11. Curriculum Learning strategy in Continual Learning


Figure 10. Curriculum Learning strategy in Continual Learning setup of GraphLog, when either the composition function or the
setup of GraphLog. representation function is shared for all worlds.
Evaluating Logical Generalization in Graph Neural Networks

experiments are attached with the Supplementary materials,


and will be made public on acceptance.

E.1. Dataset Hyperparams


We generate GraphLog with 20 relations or classes (K),
which results in 76 rules in RS after consistency checks. For
unary rules, we specify half of the relations to be symmetric
and other half to have their invertible relations. To split the
rules for individual worlds, we choose the number of rules
for each world w = 20 and stride s = 1, and end up with 57
worlds R0 . . . R56 . For each world Ri , we generate 5000
training, 1000 testing and 1000 validation graphs.

E.2. Model Hyperparams


For all models, we perform hyper-parameter sweep (grid
search) to find the optimal values based on the validation
accuracy. For all models, we use the relation embedding and
node embedding to be 200 dimensions. We train all models
with Adam optimizer with learning rate 0.001 and weight
decay of 0.0001. For supervised setting, we train all models
for 500 epochs, and we add a scheduler for learning rate
to decay it by 0.8 whenever the validation loss is stagnant
for 10 epochs. In multitask setting, we sample a new task
every epoch from the list of available tasks. Here, we run
all models for 2000 epochs when we have the number of
tasks ≤ 10. For larger number of tasks (Figure 4), we
train by proportionally increasing the number of epochs
compared to the number of tasks. (2k epochs for 10 tasks,
4k epochs for 20 tasks, 6k epochs for 30 tasks, 8k epochs
for 40 tasks and 10k epochs for 50 tasks). For continual
learning experiment, we train each task for 100 epochs for
all models. No learning rate scheduling is used for either
multitask or continual learning experiments. Individual
model hyper-parameters are as follows:

• Representation functions :
– GAT : Number of layers = 2, Number of attention
heads = 2, Dropout = 0.4
– GCN : Number of layers = 2, with symmetric nor-
malization and bias, no dropout
• Composition functions:
– E-GAT: Number of layers = 6, Number of atten-
tion heads = 2, Dropout = 0.4
– RGCN: Number of layers = 2, no dropout, with
bias.

You might also like