Evaluating Logical Generalization in Graph Neural Networks: R R R R R R R R
Evaluating Logical Generalization in Graph Neural Networks: R R R R R R R R
Abstract W1
…
r2 ∧ r3 ⟹ r1
r4 ∧ r2 ⟹ r3 gi1
Recent research has highlighted the role of rela- Rules
…
r2 ∧ r3 ⟹ r1
arXiv:2003.06560v1 [cs.LG] 14 Mar 2020
… ……
that can generalize and reason in a compositional
…
r4 ∧ r5 ⟹ r7 r4 ∧ r2 ⟹ r3
manner. However, while relational learning al- r4 ∧ r5 ⟹ r7 gi2
…
gorithms such as graph neural networks (GNNs)
…
…
show promise, we do not understand how effec- r1 ∧ r2 ⟹ r5
Wn
tively these approaches can adapt to new tasks. In
…
this work, we study the task of logical generaliza- r1 ∧ r4 ⟹ r6 gin
r1 ∧ r2 ⟹ r5
tion using GNNs by designing a benchmark suite
grounded in first-order logic. Our benchmark
suite, GraphLog, requires that learning algorithms
perform rule induction in different synthetic log- Figure 1. GraphLog setup. We define a large set of rules that
ics, represented as knowledge graphs. GraphLog are grounded in propositional logic. We partition the rule set into
overlapping subsets, which we use to define the unique worlds, Wk .
consists of relation prediction tasks on 57 distinct
Finally, within each world Wk , we generate several knowledge
logical domains. We use GraphLog to evaluate graphs gik that are governed by the rule set of Wk .
GNNs in three different setups: single-task super-
vised learning, multi-task pretraining, and contin-
ual learning. Unlike previous benchmarks, our
approach allows us to precisely control the logical objects/entities/situations, and to compose relations into
relationship between the different tasks. We find higher-order relations, is one of the reasons why humans
that the ability for models to generalize and adapt quickly learn how to solve new tasks (Holyoak & Morrison,
is strongly determined by the diversity of the logi- 2012; Alexander, 2016).
cal rules they encounter during training, and our The perceived importance of relational reasoning for gener-
results highlight new challenges for the design of alization capabilities has fueled the development of several
GNN models. We publicly release the dataset and neural network architectures that incorporate relational in-
code used to generate and interact with the dataset ductive biases (Battaglia et al., 2016; Santoro et al., 2017;
at https://www.cs.mcgill.ca/ ksinha4/graphlog/. Battaglia et al., 2018). Graph neural networks (GNNs),
in particular, have emerged as a dominant computational
paradigm within this growing area (Scarselli et al., 2008;
1. Introduction Hamilton et al., 2017a; Gilmer et al., 2017; Schlichtkrull
et al., 2018; Du et al., 2019). However, despite the growing
Relational reasoning, or the ability to reason about the re- interest in GNNs and their promise for improving the gener-
lationship between objects entities in the environment, is alization capabilities of neural networks, we currently lack
considered a fundamental aspect of intelligence (Krawczyk an understanding of how effectively these models can adapt
et al., 2011; Halford et al., 2010). Relational reasoning is and generalize across distinct tasks.
known to play a critical role in cognitive growth of chil-
dren (Son et al., 2011; Farrington-Flint et al., 2007; Rich- In this work, we study the task of logical generalization,
land et al., 2010). This ability to infer relations between in the context of relational reasoning using GNNs. In par-
ticular, we study how GNNs can induce logical rules and
*
Equal contribution 1 Facebook AI Research, Montreal, Canada generalize by combining these rules in novel ways after
2
School of Computer Science, McGill University, Montreal, training. We propose a benchmark suite, GraphLog, that is
Canada 3 Montreal Institute of Learning Algorithms (Mila). Corre-
spondence to: Koustuv Sinha <[email protected]>. grounded in first-order logic. Figure 1 shows the setup of
the benchmark. Given a set of logical rules, we create differ-
ent logical worlds with overlapping rules. For each world
Evaluating Logical Generalization in Graph Neural Networks
(say Wk ), we sample multiple knowledge graphs (say gik ). examples include Freebase15K (Bordes et al., 2013), Word-
The learning agent should learn to induce the logical rules Net (Miller, 1995), NELL (Mitchell & Fredkin, 2014), and
for predicting the missing facts in these knowledge graphs. YAGO (Suchanek et al., 2007; Hoffart et al., 2011; Mahdis-
Using our benchmark, we evaluate the generalization capa- oltani et al., 2013). These datasets are derived from real-
bilities of GNNs in a supervised setting by predicting unseen world knowledge graphs and are useful for empirical evalua-
combinations of known rules within a specific logical world. tion of relation prediction systems. However, these datasets
This task that explicitly requires inductive generalization. are generally noisy and incomplete, as many facts are not
We further analyze how various GNN architectures perform available in the underlying knowledge bases (West et al.,
in the multi-task and the continual learning scenarios, where 2014; Paulheim, 2017). Moreover, the logical rules under-
they have to learn over a set of logical worlds with different pinning these systems are often opaque and implicit (Guo
underlying logic. Our setup allows us to control the similar- et al., 2016). All these shortcomings reduce the usefulness
ity between the different worlds by controlling the overlap of existing knowledge graph datasets for understanding the
in logical rules between different worlds. This enables us to logical generalization capability of neural networks. Some
precisely analyze how task similarity impacts performance of these limitations can be overcome by using synthetic
in the multi-task setting. datasets, which can provide a high degree of control and
flexibility over the data generation process at a low cost.
Our analysis provides the following useful insights regard-
Synthetic datasets are useful for understanding the behavior
ing the logical generalization capabilities of GNNs:
of different models - especially when the underlying prob-
• Two architecture choices for GNNs have a strong posi- lem can have many factors of variations. We consider using
tive impact on the generalization performance: 1) incor- synthetic datasets, as a means and not an end, to understand
porating multi-relational edge features using attention, the logical generalization capability of GNNs.
and 2) explicitly modularising the GNN architecture
Our GraphLog benchmark serves as a synthetic comple-
to include a parametric representation function, which
ment to the real-world datasets. Instead of sampling from a
learns representations for the relations based on the
real-world knowledge base, we create synthetic knowledge
knowledge graph structure.
graphs that are governed by a known and inspectable set of
• In the multi-task setting, training a model on a more logical rules. Moreover, the relations in GraphLog are self-
diverse set of logical worlds improves generalization contained and do not require any common-sense knowledge,
and adaptation performance. thus making the tasks self-contained.
• All the evaluated models exhibit catastrophic forgetting
Procedurally generated datasets for reasoning. In recent
in the continual learning setting. This indicates that
years, several procedurally generated benchmarks have been
the models are prone to fitting to just the current task
proposed to study the relational reasoning and composi-
at hand and not learning representations and composi-
tional generalization properties of neural networks. Some
tions that can transfer across tasks—highlighting the
recent and prominent examples are listed in Table 1. These
challenge of lifelong learning in the context of logical
datasets aim to provide a controlled testbed for evaluating
generalization and GNNs.
the compositional reasoning capabilities of neural networks
in isolation. Based on these existing works and their insight-
2. Background and Related Work ful observations, we enumerate the four key desiderata that,
we believe, such a benchmark should provide:
Graph Neural Networks. Several graph neural network
(GNN) architectures have been proposed to learn the repre-
sentation for the graph input (Scarselli et al., 2008; Duve- 1. Interpretable Rules: The rules that are used to procedu-
naud et al., 2015; Defferrard et al., 2016; Kipf & Welling, rally generate the dataset should be human interpretable.
2016; Gilmer et al., 2017; Veličković et al., 2017; Hamilton 2. Diversity: The benchmark datasets should have enough
et al., 2017b; Schlichtkrull et al., 2018). Previous works diversity across different tasks, and the compositional
have focused on evaluating graph neural networks in terms rules used to solve different tasks should be distinct, so
of their expressive power (Morris et al., 2019; Xu et al., that adaptation on a novel task is not trivial. The degree
2018), usefulness of features (Chen et al., 2019), and ex- of similarity across the tasks should be configurable to
plaining the predictions from GNNs (Ying et al., 2019). enable evaluating the role of diversity in generalization.
Complementing these works, we evaluate GNN models on 3. Compositional generalization: The benchmark should
the task of logical generalization. require compositional generalization, i.e., generalization
to unseen combinations of rules.
Knowledge graph completion. Many knowledge graph 4. Number of tasks: The benchmark should support cre-
datasets are available for the task of relation prediction ating a large number of tasks. This enables a more fine-
(also known as knowledge base completion). Prominent grained inspection of the generalization capabilities of
Evaluating Logical Generalization in Graph Neural Networks
Dataset IR D CG M S Me Mu CL
Thus, following the path between two nodes, and applying
CLEVR (Johnson et al., 2017) 3 7 7 Vision 3 7 7 7
CoGenT (Johnson et al., 2017) 3 7 3 Vision 3 7 7 7 the propositional rules along the edges of the path, we can
CLUTRR (Sinha et al., 2019) 3 7 3 Text 3 7 7 7 resolve the relationship between the nodes. Hence, we refer
SCAN (Lake & Baroni, 2017) 3 7 3 Text 3 3 7 7
SQoOP (Bahdanau et al., 2018) 3 7 3 Vision 3 7 7 7 to the paths as resolution paths. The edges of the resolution
TextWorld (Côté et al., 2018) 7 3 3 Text 3 3 3 3 path are concatenated together to obtain a descriptor. These
GraphLog (Proposed) 3 3 3 Graph 3 3 3 3
descriptors are used for quantifying the similarity between
different resolution paths, with a higher overlap between
Table 1. Features of related datasets that are: 1) designed to test the descriptors implying a greater similarity between two
compositional generalization and reasoning, and 2) procedurally resolution paths.
gnerated. We compare the datasets along the following dimensions:
Inspectable Rules (IR), Diversity, Compositional Generalization
(CG), Modality and if the following training setups are supported: 3.2. Problem Setup
Supervised, Meta-learning, Multitask & Continual learning (CL). We formulate the relational reasoning task as predicting
relations between the nodes in a relational graph. Given a
the model in different setups, e.g., supervised learning, query (G, u, v) where u, v ∈ VG , the learner has to predict
multitask learning, and continual learning. the relation r? for the edge u →r? v. Unlike the previous
work on knowledge graph completion, we emphasize an
As shown in Table 1, GraphLog is unique in satisfying all of inductive problem setup, where the graph G in each query is
these desiderata. We highlight that GraphLog is the only unique. Rather than reasoning on a single static knowledge
dataset specifically designed to test logical generaliza- graph during training and testing, we consider the setting
tion capabilities on graph data, whereas previous works where the model must learn to generalize to unseen graphs
have largely focused on the image and text modalities. during evaluation.
GW
(a)
g1
GWS
…
gn { (g , u
i , v)
(d)
fc (f)
→
(b) (e)
fr
r
GŴ (c)
Figure 2. Overview of the training process: (a): Sampling multiple graphs from GW . (b): Converting the relational graph into extended
graph GˆW . Note that edges of different color (denoting different types of relations) are replaced by a node of same type in ĜW . (c):
Learning representations of the relations (r) using fr with the extended graph as the input. In case of Param models, the relation
representations are parameterized via an embedding layer and the extended graph is not created. (d, e): The composition function takes as
input the query gi , u, v and the relational representation r. (f): The composition function predicts the relation between the nodes u and v.
GW × VGW × VGW × Rd×|R| → R, which learns how corresponding edge-node (u − r − v) is connected to only
to compose the relation representations learned by fr to those nodes that were incident to it in the original graph (i.e.
make predictions about queries over a knowledge graph. nodes u and v; see Figure 2, Step (b)). This new graph ĜW
only has one type of edge and comprises of nodes from both
Note that though we break down the process into two steps,
the original graph and from the set of edge-nodes.
in practice, the learner does not have access to the correct
representations of relations or to R. The learner has to We learn the relation representations by training a GNN
rely only on the target labels to solve the reasoning task. model on the expanded WorldGraph and by averaging the
We hypothesize that this separation of concerns between a edge-node embeddings corresponding to each relation type
representation function and a composition function (Dijkstra, ri ∈ R. (Step (c) in Figure 2). For the GNN model, we
1982) could provide a useful inductive bias for the model. consider the Graph Convolutional Network (GCN) (Kipf
& Welling, 2016) and the Graph Attention Network (GAT)
4.1. Representation modules architectures. Since the nodes do not have any features or
attributes, we randomly initialize the embeddings in these
We first describe the different approaches for learning the GNN message passing layers.
representation ri ∈ Rd for the relations. These representa-
tions will be provided as input to the composition function. The intuition behind creating the extended-graph is that the
representation GNN function can learn the relation embed-
Direct parameterization. The simplest approach to define dings based on the structure of the complete relational graph
the representation module is to train unique embeddings for GW . We expect this to provide an inductive bias that can
each relation ri . This approach is predominantly used in the generalize more effectively than the simple Param approach.
previous work on GNNs (Gilmer et al., 2017; Veličković Finally, note that while the representation function is given
et al., 2017), and we term this approach as the Param repre- access to the WorldGraph to learn representations for rela-
sentation module. A major limitation of this approach is that tions, the composition module is not able to interface with
the relation representations are optimized specifically for the WorldGraph in order to make predictions about a query.
each logical world, and there is no inductive bias towards
learning representations that can generalize.
4.2. Composition modules
Learning representations from the graph structure. In
We now describe the GNNs used for the composition mod-
order to define a more powerful and expressive representa-
ules. These models take as input the query (gi , u, v) and the
tion function, we consider an approach that learns relation
relation embedding ri ∈ Rd (Step (d) and (e) in Figure 2).
representations as a function of the WorldGraph underly-
ing a logical world. To do so, we consider an “extended” Relational Graph Convolutional Network (RGCN).
form of the WorldGraph, ĜW , where introduce new nodes Given that the input to the composition module is a rela-
(called edge-nodes) corresponding to each edge in the orig- tional graph, the RGCN model (Schlichtkrull et al., 2018) is
inal WorldGraph GW . For an edge (u →r v) ∈ EG , the a natural choice for a baseline architecture. In this approach,
Evaluating Logical Generalization in Graph Neural Networks
(t)
where hu ∈ Rd denotes the representation for a node u
Test Accuracy
at the tth layer of the model, T ∈ Rdr ×d×d is a learnable
0.6
tensor, r ∈ Rd is the representation for relation r, and
Nri (u) denotes the neighbors of node u by relation ri . We
use ×i to denote multiplication across a particular mode
of the tensor. This RGCN model learns a relation-specific 0.5
facto standard architecture for applying GNNs to multi- Easy World Medium World Hard World
relational data—we also explore an extension of the Graph World difficulty
Attention Network (GAT) model (Veličković et al., 2017) Figure 3. We categorize the datasets in terms of their relative diffi-
to handle edge types. Many recent works have highlighted culty (see Appendix). We observe that the models using E-GAT
the importance of the attention mechanism, especially in as the composition function consistently work well.
the context of relational reasoning (Vaswani et al., 2017;
Santoro et al., 2018; Schlag et al., 2019). Motivated by this
observation, we investigate an extended version of the GAT, contexts: (i) Single Task Supervised Learning, (ii) Multi-
where we incorporate gating via an LSTM (Hochreiter & Task Training and (iii) Continual Learning. Our experiments
Schmidhuber, 1997) and where the attention is conditioned use the GraphLog benchmark with distinct 57 worlds or
on both the incoming message (from the other nodes) and knowledge graph datasets (see Section 3) and 6 different
the relation embedding (of the other nodes): different GNN models (see Section 4). In the main paper,
X X we share the key trends and observations that hold across the
mN (u) = α h(t−1)
u , h(t−1)
v ,r different combinations of the models and the datasets, along
ri ∈R v∈Nri (u) with some representative results. The full set of results is
h(t) = LSTM(mN (u) , h(t−1) ) provided in the Appendix. All the models are implemented
u u
using PyTorch 1.3.1 (Paszke et al., 2019). The code has
Following the original GAT model, the attention function α been included with the supplemental material.
is defined using an dense neural network on the concatena-
tion of the input vectors. We refer to this model as the Edge 5.1. Single Task Supervised Learning
GAT (E-GAT) model.
In our first setup, we train and evaluate all of the models
Query and node representations. We predict the relation on all the 57 worlds, one model, and one world pair at a
(K) (K)
for a given query (gi , u, v) by concatenating hu , hv time. This experiment provides several important results.
(the final-layer query node embeddings, assuming a K-layer Previous works considered only a handful of datasets when
GNN) and applying a two-layer dense neural network (Step evaluating the different models on the task of relational
(f) in Figure 2). The entire model (i.e., the representation reasoning. As such, it is possible to design a model that
function and the composition function) are trained end-to- can exploit the biases present in the few datasets that the
end using the softmax cross-entropy loss. Since we have no model is being evaluated over. In our case, we consider over
node features, we randomly initialize all the node embed- 50 datasets, with different characteristics (Table 2). It is
(0)
dings in the GNNs (i.e., hu ). difficult for one model to outperform the other models on
all the datasets just by exploiting some dataset-specific bias,
thereby making the conclusions more robust.
5. Experiments
In Figure 3, we present the results for the different models.
We aim to quantify the performance of the different GNN
We categorize the worlds in three categories of difficulty –
models on the task of logical relation reasoning, in three
easy, moderate and difficult – based on relative test perfor-
1
Note that the shared tensor is equivalent to the basis matrix mance of the models on each world. Table 6 (in Appendix)
formulation in Schlichtkrull et al. (2018). contains the results for the different models on the individ-
Evaluating Logical Generalization in Graph Neural Networks
Accuracy
0.4
0.3
GAT RGCN 0.474 ±0.11 0.502 ±0.09 0.2
Accuracy
0.4
0.3
Table 3. Multitask evaluation performance when trained on differ- 0.2
0.8
0.4
Accuracy
Accuracy
0.6
0.3
0.4
0.2
0.2
0.1
0.8 0.4
Accuracy
Accuracy
0.6 0.3
0.4 0.2
0.2 0.1
0.8 0.4
Accuracy
Accuracy
0.6 0.3
0.4 0.2
0.2 0.1
0 20 40 0 20 40 0 20 40 0 20 40
Worlds Worlds Worlds Worlds
References Du, S. S., Hou, K., Salakhutdinov, R. R., Poczos, B., Wang,
R., and Xu, K. Graph neural tangent kernel: Fusing
Alexander, P. A. Relational thinking and relational reason-
graph neural networks with graph kernels. In Advances in
ing: harnessing the power of patterning. NPJ science of
Neural Information Processing Systems, pp. 5724–5734,
learning, 1(1):1–7, 2016.
2019.
Bahdanau, D., Murty, S., Noukhovitch, M., Nguyen, T. H., Duvenaud, D. K., Maclaurin, D., Iparraguirre, J., Bombarell,
de Vries, H., and Courville, A. Systematic generalization: R., Hirzel, T., Aspuru-Guzik, A., and Adams, R. P. Con-
what is required and can it be learned? arXiv preprint volutional networks on graphs for learning molecular fin-
arXiv:1811.12889, 2018. gerprints. In Advances in neural information processing
systems, pp. 2224–2232, 2015.
Battaglia, P., Pascanu, R., Lai, M., Rezende, D. J., et al.
Interaction networks for learning about objects, relations Evans, R. and Grefenstette, E. Learning Explanatory Rules
and physics. In Advances in neural information process- from Noisy Data. November 2017.
ing systems, pp. 4502–4510, 2016.
Farrington-Flint, L., Canobi, K. H., Wood, C., and Faulkner,
Battaglia, P. W., Hamrick, J. B., Bapst, V., Sanchez- D. The role of relational reasoning in children’s addition
Gonzalez, A., Zambaldi, V., Malinowski, M., Tacchetti, concepts. British Journal of Developmental Psychology,
A., Raposo, D., Santoro, A., Faulkner, R., et al. Rela- 25(2):227–246, 2007.
tional inductive biases, deep learning, and graph networks.
arXiv preprint arXiv:1806.01261, 2018. Gilmer, J., Schoenholz, S. S., Riley, P. F., Vinyals, O., and
Dahl, G. E. Neural message passing for quantum chem-
Bengio, Y., Louradour, J., Collobert, R., and Weston, J. istry. In Proceedings of the 34th International Conference
Curriculum learning. In Proceedings of the 26th annual on Machine Learning-Volume 70, pp. 1263–1272. JMLR.
international conference on machine learning, pp. 41–48, org, 2017.
2009.
Guo, M., Haque, A., Huang, D.-A., Yeung, S., and Fei-Fei,
L. Dynamic task prioritization for multitask learning. In
Bordes, A., Usunier, N., Garcia-Duran, A., Weston, J., and
Proceedings of the European Conference on Computer
Yakhnenko, O. Translating embeddings for modeling
Vision (ECCV), pp. 270–287, 2018.
multi-relational data. In Advances in neural information
processing systems, pp. 2787–2795, 2013. Guo, S., Wang, Q., Wang, L., Wang, B., and Guo, L. Jointly
embedding knowledge graphs and logical rules. In Pro-
Chen, T., Bian, S., and Sun, Y. Are powerful graph neural ceedings of the 2016 Conference on Empirical Methods
nets necessary? a dissection on graph classification. arXiv in Natural Language Processing, pp. 192–202, 2016.
preprint arXiv:1905.04579, 2019.
Halford, G. S., Wilson, W. H., and Phillips, S. Relational
Côté, M.-A., Kádár, Á., Yuan, X., Kybartas, B., Barnes, T., knowledge: the foundation of higher cognition. Trends
Fine, E., Moore, J., Hausknecht, M., El Asri, L., Adada, in cognitive sciences, 14(11):497–505, 2010.
M., et al. Textworld: A learning environment for text-
based games. In Workshop on Computer Games, pp. Hamilton, W., Ying, R., and Leskovec, J. Representation
41–75. Springer, 2018. learning on graphs: Methods and applications. IEEE
Data Engineering Bulletin, 2017a.
De Lange, M., Aljundi, R., Masana, M., Parisot, S., Jia, X.,
Leonardis, A., Slabaugh, G., and Tuytelaars, T. Continual Hamilton, W., Ying, Z., and Leskovec, J. Inductive repre-
learning: A comparative study on how to defy forgetting sentation learning on large graphs. In Advances in neural
in classification tasks. arXiv preprint arXiv:1909.08383, information processing systems, pp. 1024–1034, 2017b.
2019. Hamilton, W., Bajaj, P., Zitnik, M., Jurafsky, D., and
Leskovec, J. Embedding logical queries on knowledge
Defferrard, M., Bresson, X., and Vandergheynst, P. Con-
graphs. In Advances in Neural Information Processing
volutional neural networks on graphs with fast localized
Systems 31, pp. 2026–2037. 2018.
spectral filtering. In Advances in neural information pro-
cessing systems, pp. 3844–3852, 2016. Hochreiter, S. and Schmidhuber, J. Long short-term memory.
Neural computation, 9(8):1735–1780, 1997.
Dijkstra, E. W. On the role of scientific thought. In Selected
writings on computing: a personal perspective, pp. 60–66. Hoffart, J., Suchanek, F. M., Berberich, K., Lewis-Kelham,
Springer, 1982. E., De Melo, G., and Weikum, G. Yago2: exploring and
Evaluating Logical Generalization in Graph Neural Networks
querying world knowledge in time, space, context, and (eds.), Advances in Neural Information Processing
many languages. In Proceedings of the 20th international Systems 32, pp. 8024–8035. Curran Associates, Inc.,
conference companion on World wide web, pp. 229–232, 2019. URL http://papers.nips.cc/paper/
2011. 9015-pytorch-an-imperative-style-high-performance
pdf.
Holyoak, K. J. and Morrison, R. G. The Oxford handbook of
thinking and reasoning. Oxford University Press, 2012. Paulheim, H. Knowledge graph refinement: A survey of
approaches and evaluation methods. Semantic web, 8(3):
Johnson, J., Hariharan, B., van der Maaten, L., Fei-Fei, L.,
489–508, 2017.
Lawrence Zitnick, C., and Girshick, R. Clevr: A diag-
nostic dataset for compositional language and elementary Richland, L. E., Chan, T.-K., Morrison, R. G., and Au, T. K.-
visual reasoning. In Proceedings of the IEEE Confer- F. Young childrens analogical reasoning across cultures:
ence on Computer Vision and Pattern Recognition, pp. Similarities and differences. Journal of Experimental
2901–2910, 2017. Child Psychology, 105(1-2):146–153, 2010.
Kipf, T. N. and Welling, M. Semi-supervised classifica- Santoro, A., Raposo, D., Barrett, D. G., Malinowski, M.,
tion with graph convolutional networks. arXiv preprint Pascanu, R., Battaglia, P., and Lillicrap, T. A simple neu-
arXiv:1609.02907, 2016. ral network module for relational reasoning. In Advances
in neural information processing systems, pp. 4967–4976,
Krawczyk, D. C., McClelland, M. M., and Donovan, C. M.
2017.
A hierarchy for relational reasoning in the prefrontal cor-
tex. Cortex, 47(5):588–597, 2011. Santoro, A., Faulkner, R., Raposo, D., Rae, J., Chrzanowski,
Lake, B. M. and Baroni, M. Generalization with- M., Weber, T., Wierstra, D., Vinyals, O., Pascanu, R., and
out systematicity: On the compositional skills of Lillicrap, T. Relational recurrent neural networks. In
sequence-to-sequence recurrent networks. arXiv preprint Advances in neural information processing systems, pp.
arXiv:1711.00350, 2017. 7299–7310, 2018.
Langley, P. and Simon, H. A. Applications of machine Scarselli, F., Gori, M., Tsoi, A. C., Hagenbuchner, M., and
learning and rule induction. Communications of the ACM, Monfardini, G. The graph neural network model. IEEE
38(11):54–64, 1995. Transactions on Neural Networks, 20(1):61–80, 2008.
Mahdisoltani, F., Biega, J., and Suchanek, F. M. Yago3: A Schlag, I., Smolensky, P., Fernandez, R., Jojic, N., Schmid-
knowledge base from multilingual wikipedias. 2013. huber, J., and Gao, J. Enhancing the transformer with
explicit relational encoding for math problem solving.
Miller, G. A. Wordnet: a lexical database for english. Com- arXiv preprint arXiv:1910.06611, 2019.
munications of the ACM, 38(11):39–41, 1995.
Schlichtkrull, M., Kipf, T. N., Bloem, P., Van Den Berg, R.,
Mitchell, T. and Fredkin, E. Never ending language learn- Titov, I., and Welling, M. Modeling relational data with
ing. In Big Data (Big Data), 2014 IEEE International graph convolutional networks. In European Semantic
Conference on, pp. 1–1, 2014. Web Conference, pp. 593–607. Springer, 2018.
Morris, C., Ritzert, M., Fey, M., Hamilton, W., Lenssen, Sinha, K., Sodhani, S., Dong, J., Pineau, J., and Hamilton,
J., Rattan, G., and Grohe, M. Weisfeiler and Leman go W. L. Clutrr: A diagnostic benchmark for inductive
neural: Higher-order graph neural networks. In AAAI, reasoning from text. arXiv preprint arXiv:1908.06177,
2019. 2019.
Parisi, G. I., Kemker, R., Part, J. L., Kanan, C., and Wermter, Sodhani, S., Chandar, S., and Bengio, Y. On training recur-
S. Continual lifelong learning with neural networks: A rent neural networks for lifelong learning. 2019.
review. Neural Networks, 2019.
Son, J. Y., Smith, L. B., and Goldstone, R. L. Connecting
Paszke, A., Gross, S., Massa, F., Lerer, A., Bradbury, instances to promote childrens relational reasoning. Jour-
J., Chanan, G., Killeen, T., Lin, Z., Gimelshein, N., nal of experimental child psychology, 108(2):260–277,
Antiga, L., Desmaison, A., Kopf, A., Yang, E., DeVito, 2011.
Z., Raison, M., Tejani, A., Chilamkurthy, S., Steiner,
B., Fang, L., Bai, J., and Chintala, S. Pytorch: An Suchanek, F. M., Kasneci, G., and Weikum, G. Yago: a
imperative style, high-performance deep learning core of semantic knowledge. In Proceedings of the 16th
library. In Wallach, H., Larochelle, H., Beygelz- international conference on World Wide Web, pp. 697–
imer, A., dÁlché-Buc, F., Fox, E., and Garnett, R. 706, 2007.
Evaluating Logical Generalization in Graph Neural Networks
Ying, Z., Bourgeois, D., You, J., Zitnik, M., and Leskovec, J.
Gnnexplainer: Generating explanations for graph neural
networks. In Advances in Neural Information Processing
Systems, pp. 9240–9251, 2019.
Supplemental Materials : Evaluating Logical Generalization in Graph Neural
Networks
0.2
by this criteria are consistent across the different models
(Figure 3).
0 5 10 15 20 0 5 10 15 20
Gradient updates Gradient updates
B. Supervised learning on GraphLog GAT_E-GAT GCN_E-GAT Param_E-GAT
GAT_RGCN GCN_RGCN Param_RGCN
We perform extensive experiments over all the datasets
available in GraphLog (statistics given in Table 6). We
observe that in general, for the entire set of 57 worlds, the
GAT E-GAT model performs the best. We observe that the Figure 8. We perform fine-grained analysis of few shot adaptation
capabilities in Multitask setting. Group 0.0 and 1.0 corresponds to
relative difficulty (Section A.4) of the tasks are highly corre-
0% and 100% similarity respectively.
lated with the number of descriptors (Section A.1) available
for each task. This shows that for a learner, a dataset with
In the main paper (Section 5.2) we introduce the setup of
enough variety among the resolution paths of the graphs is
performing multitask pre-training on GraphLog datasets
relatively easier to learn compared to the datasets which has
and adaptation on the datasets based on relative similarity.
less variation.
Here, we perform fine-grained analysis of few-shot adapata-
tion capabilities of the models. We analyze the adaptation
C. Multitask Learning performance in two settings - when the adaptation dataset
has complete overlap of rules with the training datasets
C.1. Multitask Learning on different data splits by
(group=1.0) and when the adaptation dataset has zero over-
difficulty
lap with the training datasets (group=0.0). We find RGCN
In Section A.4 we introduced the notion of difficulty among family of models with a graph based representation func-
the tasks available in GraphLog . Here, we consider a set tion has faster adaptation on the dissimilar dataset, with
of experiments where we perform multitask training and GCN-RGCN showing the fastest improvement. However
Evaluating Logical Generalization in Graph Neural Networks
0.4
Accuracy
0.3
on the similar dataset the models follow the ranking of the
0.2
supervised learning experiments, with GAT-EGAT model
adapting comparitively better. 0.1
0.3
D. Continual Learning
0.2
A natural question arises following our continual learning
0.1
experiments in Section 5.3 : does the order of difficulty of
the worlds matter? Thus, we perform an experiment fol- 0 10 20 30 40 0 10 20 30 40
lowing Curriculum Learning (Bengio et al., 2009) setup, Gradient updates Gradient updates
where the order of the worlds being trained is determined by
their relative difficulty (which is determined by the perfor-
mance of models in supervised learning setup, Table 6, i.e.,
train_world easy medium hard
we order the worlds from easier worlds to harder worlds).
We observe that while the current task accuracy follows
the trend of the difficulty of the worlds (Figure 10), the
mean of past accuracy is significantly worse. This suggests Figure 9. We evaluate the effect of k-shot adaptation on held
that a curriculum learning strategy might not be optimal to out datasets when pre-trained on easy, medium and hard training
learn graph representations in a continual learning setting. datasets, among the different model architectures. Here, k ranges
from 0 to 40.
We also performed the same experiment with sharing only
the composition and representation functions (Figure 11),
and observe similar trends where sharing the representation
function reduces the effect of catastrophic forgetting.
Table 6. Results on Single-task supervised setup for all datasets in GraphLog. Abbreviations: NC: Number of Classes, ND: Number of
Descriptors, ARL: Average Resolution Length, AN: Average number of nodes, AE: Average number of edges
, D: Difficulty, AGG: Aggregate Statistics. List of models considered : M1: GAT-EGAT, M2: GCN-E-GAT, M3:
Param-E-GAT, M4: GAT-RGCN, M5: GCN-RGCN and M6: Param-RGCN. Difficulty is calculated by taking the scores of
the model (M1) and partitioning the worlds according to their accuracy (≥ 0.7 = Easy, ≥ 0.54 and < 0.7 = Medium, and
< 0.54 = Hard). We provide both the mean of the raw accuracy scores for all models, as well as the number of times the
model is ranked first in all the tasks.
Evaluating Logical Generalization in Graph Neural Networks
0.8
0.6
Accuracy
Accuracy
0.6
0.4
0.4
0.2
0.2
GCN-E-GAT GAT-RGCN
GCN-E-GAT GAT-RGCN
0.6 0.4
0.4
0.2
0.2
Param-RGCN GCN-RGCN
Param-RGCN GCN-RGCN
0.6
Accuracy
0.8
0.4
Accuracy
0.6
0.4 0.2
0.2 0 20 40 0 20 40
Worlds Worlds
0 20 40 0 20 40 Shared Composition and Representation
Worlds Worlds Shared Representation Unique Composition
Current Accuracy Shared Composition Unique Representation
Mean past accuracy
• Representation functions :
– GAT : Number of layers = 2, Number of attention
heads = 2, Dropout = 0.4
– GCN : Number of layers = 2, with symmetric nor-
malization and bias, no dropout
• Composition functions:
– E-GAT: Number of layers = 6, Number of atten-
tion heads = 2, Dropout = 0.4
– RGCN: Number of layers = 2, no dropout, with
bias.