RGCN
RGCN
Convolutional Networks
1 Introduction
this intuition, we develop an encoder model for entities in the relational graph
and apply it to both tasks.
Our entity classification model uses softmax classifiers at each node in the
graph. The classifiers take node representations supplied by a relational graph
convolutional network (R-GCN) and predict the labels. The model, including
R-GCN parameters, is learned by optimizing the cross-entropy loss.
Our link prediction model can be regarded as an autoencoder consisting of
(1) an encoder: an R-GCN producing latent feature representations of entities,
and (2) a decoder: a tensor factorization model exploiting these representations
to predict labeled edges. Though in principle the decoder can rely on any type of
factorization (or generally any scoring function), we use one of the simplest and
most effective factorization methods: DistMult [11]. We observe that our method
achieves significant improvements on the challenging FB15k-237 dataset [12],
as well as competitive performance on FB15k and WN18. Among other base-
lines, our model outperforms direct optimization of the factorization (i.e. vanilla
DistMult). This result demonstrates that explicit modeling of neighborhoods in
R-GCNs is beneficial for recovering missing facts in knowledge bases.
Our main contributions are as follows: To the best of our knowledge, we are
the first to show that the GCN framework can be applied to modeling relational
data, specifically to link prediction and entity classification tasks. Secondly, we
introduce techniques for parameter sharing and to enforce sparsity constraints,
and use them to apply R-GCNs to multigraphs with large numbers of relations.
Lastly, we show that the performance of factorization models, at the example
of DistMult, can be significantly improved by enriching them with an encoder
model that performs multiple steps of information propagation in the relational
graph.
1
R contains relations both in canonical direction (e.g. born in) and in inverse direction
(e.g. born in inv ).
Modeling Relational Data with Graph Convolutional Networks 595
(l) (l)
where hi ∈ Rd is the hidden state of node vi in the l-th layer of the neural net-
work, with d(l) being the dimensionality of this layer’s representations. Incoming
messages of the form gm (·, ·) are accumulated and passed through an element-
wise activation function σ(·), such as the ReLU(·) = max(0, ·).2 Mi denotes the
set of incoming messages for node vi and is often chosen to be identical to the set
of incoming edges. gm (·, ·) is typically chosen to be a (message-specific) neural
network-like function or simply a linear transformation gm (hi , hj ) = W hj with
a weight matrix W such as in [14]. This type of transformation has been shown
to be very effective at accumulating and encoding features from local, structured
neighborhoods, and has led to significant improvements in areas such as graph
classification [13] and graph-based semi-supervised learning [14].
Motivated by these architectures, we define the following simple propagation
model for calculating the forward-pass update of an entity or node denoted by
vi in a relational (directed and labeled) multi-graph:
⎛ ⎞
1
= σ⎝ W (l) h + W0 hi ⎠ ,
(l+1) (l) (l) (l)
hi (2)
r
ci,r r j
r∈R j∈Ni
where Nir denotes the set of neighbor indices of node i under relation r ∈ R.
ci,r is a problem-specific normalization constant that can either be learned or
chosen in advance (such as ci,r = |Nir |).
Intuitively, (2) accumulates transformed feature vectors of neighboring nodes
through a normalized sum. Choosing linear transformations of the form W hj
that only depend on the neighboring node has crucial computational benefits:
(1) we do not need to store intermediate edge-based representations which could
require a significant amount of memory, and (2) it allows us to implement Eq. 2 in
vectorized form using efficient sparse-dense O(|E|) matrix multiplications, similar
to [14]. Different from regular GCNs, we introduce relation-specific transforma-
tions, i.e. depending on the type and direction of an edge. To ensure that the
representation of a node at layer l + 1 can also be informed by the corresponding
representation at layer l, we add a single self-connection of a special relation
type to each node in the data.
A neural network layer update consists of evaluating (2) in parallel for every
node in the graph. Multiple layers can be stacked to allow for dependencies across
several relational steps. We refer to this graph encoder model as a relational
graph convolutional network (R-GCN). The computation graph for a single node
update in the R-GCN model is depicted in Fig. 1.
2.2 Regularization
A central issue with applying (2) to highly multi-relational data is the rapid
growth in number of parameters with the number of relations in the graph. In
practice this can easily lead to overfitting on rare relations and to models of very
2
Note that this represents a simplification of the message passing neural network
proposed in [16] that suffices to include the aforementioned models as special cases.
596 M. Schlichtkrull et al.
rel_1 (in)
rel_1
Node loss
R-GCN
Input
rel_1 (out)
+ ReLU encoder
rel_N (in)
rel_N
(b) Entity classification model
rel_N (out)
Edge loss
DistMult
R-GCN
Input
self-loop self-loop
encoder decoder
Fig. 1. Diagram for computing the update of a single graph node/entity (red) in the R-
GCN model. Activations (d-dimensional vectors) from neighboring nodes (dark blue)
are gathered and then transformed for each relation type individually (for both in-
and outgoing edges). The resulting representation (green) is accumulated in a (nor-
malized) sum and passed through an activation function (such as the ReLU). This
per-node update can be computed in parallel with shared parameters across the whole
graph. (b) Depiction of an R-GCN model for entity classification with a per-node loss
function. (c) Link prediction model with an R-GCN encoder (interspersed with fully-
connected/dense layers) and a DistMult decoder. (Color figure online)
large size. Two intuitive strategies to address such issues is to share parameters
between weight matrices, and to enforce sparsity in weight matrices so as to limit
the total number of parameters.
Corresponding to these two strategies, we introduce two separate meth-
ods for regularizing the weights of R-GCN-layers: basis- and block-diagonal -
(l)
decomposition. With the basis decomposition, each Wr is defined as follows:
B
(l) (l)
Wr(l) = arb Vb , (3)
b=1
(l+1) (l)
×d (l)
i.e. as a linear combination of basis transformations Vb ∈ Rd with
(l)
coefficients arb such that only the coefficients depend on r.
(l)
In the block-diagonal decomposition, we let each Wr be defined through
the direct sum over a set of low-dimensional matrices:
B
(l)
Wr(l) = Qbr . (4)
b=1
Modeling Relational Data with Graph Convolutional Networks 597
(l)
Thereby, Wr are block-diagonal matrices:
(l+1)
(l) (l) (l) /B)×(d(l) /B)
diag(Q1r , . . . , QBr ) with Qbr ∈ R(d . (5)
Note that for B = d, each Q has dimension 1 and Wr becomes a diagonal matrix.
The block-diagonal decomposition is as such a generalization of the diagonal
sparsity constraint used in the decoder in e.g. DistMult [11].
The basis function decomposition (3) can be seen as a form of effective weight
sharing between different relation types, while the block decomposition (4) can
be seen as a sparsity constraint on the weight matrices for each relation type. The
block decomposition structure encodes an intuition that latent features can be
grouped into sets of variables which are more tightly coupled within groups than
across groups. Both decompositions reduce the number of parameters needed to
learn for highly multi-relational data (such as realistic knowledge bases).
The overall R-GCN model then takes the following form: We stack L layers as
defined in (2) – the output of the previous layer being the input to the next layer.
The input to the first layer can be chosen as a unique one-hot vector for each
node in the graph if no other features are present. For the block representation,
we map this one-hot vector to a dense representation through a single linear
transformation. While in this work we only consider the featureless approach,
we note that GCN-type models can incorporate predefined feature vectors [14].
3 Entity Classification
K
(L)
L=− tik ln hik , (6)
i∈Y k=1
(L)
where Y is the set of node indices that have labels and hik is the k-th entry of
the network output for the i-th labeled node. tik denotes its respective ground
truth label. In practice, we train the model using (full-batch) gradient descent
techniques. A schematic depiction of the model is given in Fig. 1b.
4 Link Prediction
Link prediction deals with prediction of new facts (i.e. triples (subject, relation,
object)). Formally, the knowledge base is represented by a directed, labeled graph
G = (V, E, R). Rather than the full set of edges E, we are given only an incom-
plete subset Ê. The task is to assign scores f (s, r, o) to possible edges (s, r, o) in
order to determine how likely those edges are to belong to E.
598 M. Schlichtkrull et al.
5 Empirical Evaluation
5.1 Entity Classification Experiments
Here, we consider the task of classifying entities in a knowledge base. In order to
infer, for example, the type of an entity (e.g. person or company), a successful
model needs to reason about the relations with other entities that this entity is
involved in.
these datasets need not necessarily encode directed subject-object relations, but
are also used to encode the presence, or absence, of a specific feature for a given
entity. In each dataset, the targets to be classified are properties of a group of
entities represented as nodes. The exact statistics of the datasets can be found
in Table 1. For a more detailed description of the datasets the reader is referred
to [22]. We remove relations that were used to create entity labels: employs and
affiliation for AIFB, isMutagenic for MUTAG, hasLithogenesis for BGS, and
objectCategory and material for AM.
For the entity classification benchmarks described in our paper, the evalua-
tion process differs subtly between publications. To eliminate these differences,
we repeated the baselines in a uniform manner, using the canonical test/train
split from [22]. We performed hyperparameter optimization on only the training
set, running a single evaluation on the test set after hyperparameters were chosen
for each baseline. This explains why the numbers we report differ slightly from
those in the original publications (where cross-validation accuracy was reported).
Table 1. Number of entities, relations, edges and classes along with the number of
labeled entities for each of the datasets. Labeled denotes the subset of entities that
have labels and that are to be classified.
4
https://github.com/Data2Semantics/mustard.
600 M. Schlichtkrull et al.
the MUTAG task, our preprocessing differs from that used in [23,25] where for
a given target relation (s, r, o) all triples connecting s to o are removed. Since o
is a boolean value in the MUTAG data, one can infer the label after processing
from other boolean relations that are still present. This issue is now mentioned
in the Mustard documentation. In our preprocessing, we remove only the specific
triples encoding the target relation.
Results. All results in Table 2 are reported on the train/test benchmark splits
from [22]. We further set aside 20% of the training set as a validation set for
hyperparameter tuning. For R-GCN, we report performance of a 2-layer model
with 16 hidden units (10 for AM), basis function decomposition (Eq. 3), and
trained with Adam [28] for 50 epochs using a learning rate of 0.01. The normal-
ization constant is chosen as ci,r = |Nir |.
Hyperparameters for baselines are chosen according to the best model per-
formance in [23], i.e. WL: 2 (tree depth), 3 (number of iterations); RDF2Vec: 2
(WL tree depth), 4 (WL iterations), 500 (embedding size), 5 (window size), 10
(SkipGram iterations), 25 (number of negative samples). We optimize the SVM
regularization constant C ∈ {0.001, 0.01, 0.1, 1, 10, 100, 1000} based on perfor-
mance on a 80/20 train/validation split (of the original training set).
For R-GCN, we choose an l2 penalty on first layer weights Cl2 ∈ {0, 5 · 10−4 }
and the number of basis functions B ∈ {0, 10, 20, 30, 40} based on validation set
performance, where B = 0 refers to no basis decomposition. Block decomposition
did not improve results. Otherwise, hyperparameters are chosen as follows: 50
(number of epochs), 16 (number of hidden units), and ci,r = |Nir | (normalization
constant). We do not use dropout. For AM, we use a reduced number of 10
hidden units for R-GCN to reduce the memory footprint. All entity classification
experiments were run on CPU nodes with 64 GB of memory.
Table 2. Entity classification results in accuracy (average and standard error over 10
runs) for a feature-based baseline (see main text for details), WL [24, 25], RDF2Vec
[23], and R-GCN (this work). Test performance is reported on the train/test set splits
provided by [22].
merely the presence of a certain feature. BGS is a dataset of rock types with
hierarchical feature descriptions which was similarly converted to RDF format,
where relations encode the presence of a certain feature or feature hierarchy.
Labeled entities in MUTAG and BGS are only connected via high-degree hub
nodes that encode a certain feature.
We conjecture that the fixed choice of normalization constant for the aggre-
gation of messages from neighboring nodes is partly to blame for this behavior,
which can be particularly problematic for nodes of high degree. A potentially
promising way to overcome this limitation in future work is to introduce an
attention mechanism, i.e. to replace the normalization constant 1/ci,r with data-
dependent attention weights aij,r , where j,r aij,r = 1.
Table 3. Number of entities and relation types along with the number of edges per
split for the three datasets.
MRR Hits @
Model Raw Filtered 1 3 10
LinkFeat 0.063 0.079
DistMult 0.100 0.191 0.106 0.207 0.376
R-GCN 0.158 0.248 0.153 0.258 0.414
R-GCN+ 0.156 0.249 0.151 0.264 0.417
CP 0.080 0.182 0.101 0.197 0.357
TransE 0.144 0.233 0.147 0.263 0.398
HolE 0.124 0.222 0.133 0.253 0.391
ComplEx 0.109 0.201 0.112 0.213 0.388
Results. We provide results using two commonly used evaluation metrics: mean
reciprocal rank (MRR) and Hits at n (H@n). Following [29], both metrics can
be computed in a raw and a filtered setting. We report filtered and raw MRR,
and filtered Hits at 1, 3, and 10.
We evaluate hyperparameter choices on the respective validation splits. We
r |Ni |, i.e. applied
r
found a normalization constant defined as ci,r = ci =
across relation types, to work best. For FB15k and WN18, we report results
using basis decomposition (Eq. 3) with two basis functions, and a single encoding
layer with 200-dimensional embeddings. For FB15k-237, we found block decom-
position (Eq. 4) to perform best, using two layers with block dimension 5 × 5
and 500-dimensional embeddings. We regularize the encoder via edge dropout
applied before normalization, with dropout rate 0.2 for self-loops and 0.4 for
other edges. We apply l2 regularization to the decoder with a penalty of 0.01.
We use the Adam optimizer [28] with a learning rate of 0.01. For the baseline
and the other factorizations, we found the parameters from [20] – apart from
the dimensionality on FB15k-237 – to work best, though to make the systems
Modeling Relational Data with Graph Convolutional Networks 603
6 Related Work
Table 5. Results on the FB15k and WN18 datasets. Results marked (*) taken from
[20]. Results marks (**) taken from [30].
FB15k WN18
MRR Hits @ MRR Hits @
Model Raw Filtered 1 3 10 Raw Filtered 1 3 10
LinkFeat 0.779 0.804 0.938 0.939
DistMult 0.248 0.634 0.522 0.718 0.814 0.526 0.813 0.701 0.921 0.943
R-GCN 0.251 0.651 0.541 0.736 0.825 0.553 0.814 0.686 0.928 0.955
R-GCN+ 0.262 0.696 0.601 0.760 0.842 0.561 0.819 0.697 0.929 0.964
CP* 0.152 0.326 0.219 0.376 0.532 0.075 0.058 0.049 0.080 0.125
TransE* 0.221 0.380 0.231 0.472 0.641 0.335 0.454 0.089 0.823 0.934
HolE** 0.232 0.524 0.402 0.613 0.739 0.616 0.938 0.930 0.945 0.949
ComplEx* 0.242 0.692 0.599 0.759 0.840 0.587 0.941 0.936 0.945 0.947
Our R-GCN encoder model is closely related to a number of works in the area of
neural networks on graphs. It is primarily motivated as an adaption of previous
work on GCNs [13,14,39,40] for large-scale and highly multi-relational data,
characteristic of realistic knowledge bases.
Early work in this area includes the graph neural network (GNN) [15].
A number of extensions to the original GNN have been proposed, most notably
[41,42], both of which use gating mechanisms to facilitate optimization.
Modeling Relational Data with Graph Convolutional Networks 605
7 Conclusions
References
1. Yao, X., Van Durme, B.: Information extraction over structured data: question
answering with freebase. In: ACL (2014)
2. Bao, J., Duan, N., Zhou, M., Zhao, T.: Knowledge-based question answering as
machine translation. In: ACL (2014)
606 M. Schlichtkrull et al.
3. Seyler, D., Yahya, M., Berberich, K.: Generating quiz questions from knowledge
graphs. In: Proceedings of the 24th International Conference on World Wide Web
(2015)
4. Hixon, B., Clark, P., Hajishirzi, H.: Learning knowledge graphs for question answer-
ing through conversational dialog. In: Proceedings of NAACL HLT, pp. 851–861
(2015)
5. Bordes, A., Usunier, N., Chopra, S., Weston, J.: Large-scale simple question
answering with memory networks. arXiv preprint arXiv:1506.02075 (2015)
6. Dong, L., Wei, F., Zhou, M., Xu, K.: Question answering over freebase with multi-
column convolutional neural networks. In: ACL (2015)
7. Kotov, A., Zhai, C.: Tapping into knowledge base for concept feedback: leveraging
conceptnet to improve search results for difficult queries. In: WSDM (2012)
8. Dalton, J., Dietz, L., Allan, J.: Entity query feature expansion using knowledge
base links. In: ACM SIGIR (2014)
9. Xiong, C., Callan, J.: Query expansion with freebase. In: Proceedings of the 2015
International Conference on The Theory of Information Retrieval, pp. 111–120
(2015)
10. Xiong, C., Callan, J.: Esdrank: connecting query and documents through external
semi-structured data. In: CIKM (2015)
11. Yang, B., Yih, W., He, X., Gao, J., Deng, L.: Embedding entities and relations for
learning and inference in knowledge bases. arXiv preprint arXiv:1412.6575 (2014)
12. Toutanova, K., Chen, D.: Observed versus latent features for knowledge base and
text inference. In: Proceedings of the 3rd Workshop on Continuous Vector Space
Models and their Compositionality, pp. 57–66 (2015)
13. Duvenaud, D.K., Maclaurin, D., Iparraguirre, J., Bombarell, R., Hirzel, T., Aspuru-
Guzik, A., Adams, R.P.: Convolutional networks on graphs for learning molecular
fingerprints. In: NIPS (2015)
14. Kipf, T.N., Welling, M.: Semi-supervised classification with graph convolutional
networks. In: ICLR (2017)
15. Scarselli, F., Gori, M., Tsoi, A.C., Hagenbuchner, M., Monfardini, G.: The graph
neural network model. IEEE Trans. Neural Netw. 20(1), 61–80 (2009)
16. Gilmer, J., Schoenholz, S.S., Riley, P.F., Vinyals, O., Dahl, G.E.: Neural message
passing for quantum chemistry. In: ICML (2017)
17. Socher, R., Chen, D., Manning, C.D., Ng, A.: Reasoning with neural tensor net-
works for knowledge base completion. In: NIPS (2013)
18. Lin, Y., Liu, Z., Luan, H., Sun, M., Rao, S., Liu, S.: Modeling relation paths for
representation learning of knowledge bases. In: EMNLP (2015)
19. Toutanova, K., Lin, V., Yih, W., Poon, H., Quirk, C.: Compositional learning of
embeddings for relation paths in knowledge base and text. In: ACL (2016)
20. Trouillon, T., Welbl, J., Riedel, S., Gaussier, E., Bouchard, G.: Complex embed-
dings for simple link prediction. In: ICML (2016)
21. Kipf, T.N., Welling, M.: Variational graph auto-encoders. arXiv preprint
arXiv:1611.07308 (2016)
22. Ristoski, P., de Vries, G.K.D., Paulheim, H.: A collection of benchmark datasets
for systematic evaluations of machine learning on the semantic web. In: Groth, P.,
Simperl, E., Gray, A., Sabou, M., Krötzsch, M., Lecue, F., Flöck, F., Gil, Y. (eds.)
ISWC 2016. LNCS, vol. 9982, pp. 186–194. Springer, Cham (2016). https://doi.
org/10.1007/978-3-319-46547-0 20
Modeling Relational Data with Graph Convolutional Networks 607
23. Ristoski, P., Paulheim, H.: RDF2Vec: RDF Graph embeddings for data mining.
In: Groth, P., Simperl, E., Gray, A., Sabou, M., Krötzsch, M., Lecue, F., Flöck, F.,
Gil, Y. (eds.) ISWC 2016. LNCS, vol. 9981, pp. 498–514. Springer, Cham (2016).
https://doi.org/10.1007/978-3-319-46523-4 30
24. Shervashidze, N., Schweitzer, P., Leeuwen, E.J., Mehlhorn, K., Borgwardt, K.M.:
Weisfeiler-lehman graph kernels. J. Mach. Learn. Res. 12(Sep), 2539–2561 (2011)
25. de Vries, G.K.D., de Rooij, S.: Substructure counting graph kernels for machine
learning from rdf data. Web Semant. Sci. Serv. Agents World Wide Web 35, 71–84
(2015)
26. Paulheim, H., Fümkranz, J.: Unsupervised generation of data mining features from
linked open data. In: Proceedings of the 2nd International Conference on Web
Intelligence, Mining And Semantics, p. 31 (2012)
27. Mikolov, T., Sutskever, I., Chen, K., Corrado, G.S., Dean, J.: Distributed repre-
sentations of words and phrases and their compositionality. In: NIPS (2013)
28. Kingma, D., Ba, J.: Adam: a method for stochastic optimization. arXiv preprint
arXiv:1412.6980 (2014)
29. Bordes, A., Usunier, N., Garcia-Duran, A., Weston, J., Yakhnenko, O.: Translating
embeddings for modeling multi-relational data. In: NIPS (2013)
30. Nickel, M., Rosasco, L., Poggio, T.: Holographic embeddings of knowledge graphs.
In: AAAI (2016)
31. Hitchcock, F.L.: The expression of a tensor or a polyadic as a sum of products.
Stud. Appl. Math. 6(1–4), 164–189 (1927)
32. Nickel, M., Tresp, V., Kriegel, H.P.: A three-way model for collective learning on
multi-relational data. In: ICML (2011)
33. Chang, K.W., Yih, W., Yang, B., Meek, C.: Typed tensor decomposition of knowl-
edge bases for relation extraction. In: EMNLP (2014)
34. Dettmers, T., Minervini, P., Stenetorp, P., Riedel, S.: Convolutional 2D knowledge
graph embeddings. In: AAAI (2018)
35. Kolda, T.G., Bader, B.W.: Tensor decompositions and applications. SIAM Rev.
51(3), 455–500 (2009)
36. Guu, K., Miller, J., Liang, P.: Traversing knowledge graphs in vector space. In:
EMNLP (2015)
37. Garcia-Duran, A., Bordes, A., Usunier, N.: Composing relationships with transla-
tions. Technical report. CNRS, Heudiasyc (2015)
38. Neelakantan, A., Roth, B., McCallum, A.: Compositional vector space models for
knowledge base completion. In: ACL (2015)
39. Bruna, J., Zaremba, W., Szlam, A., LeCun, Y.: Spectral networks and locally
connected networks on graphs. In: ICLR (2014)
40. Defferrard, M., Bresson, X., Vandergheynst, P.: Convolutional neural networks on
graphs with fast localized spectral filtering. In: NIPS (2016)
41. Li, Y., Tarlow, D., Brockschmidt, M., Zemel, R.: Gated graph sequence neural
networks. In: ICLR (2016)
42. Pham, T., Tran, T., Phung, D., Venkatesh, S.: Column networks for collective
classification. In: AAAI (2017)
43. Hamilton, W.L., Ying, R., Leskovec, J.: Inductive representation learning on large
graphs. In: NIPS (2017)
44. Chen, J., Zhu, J.: Stochastic training of graph convolutional networks. arXiv
preprint arXiv:1710.10568 (2017)
45. Chen, J., Ma, T., Xiao, C.: FastGCN: fast learning with graph convolutional net-
works via importance sampling. In: ICLR (2018)