Convolutional Knowledge Graph Embeddings

The document describes a new 2D convolutional neural network model called ConvE for link prediction in knowledge graphs. ConvE uses 2D convolutions over entity and relation embeddings to predict missing links, making it more expressive than shallow models while remaining parameter efficient. The authors introduce ConvE, develop a method to speed up its training and evaluation, and show it achieves state-of-the-art results on several knowledge graph datasets while using significantly fewer parameters than other models. They also systematically investigate issues with inverse relations in common evaluation datasets and introduce robust versions to ensure models cannot exploit these leaks.

Uploaded by

phanpeter_492

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

33 views

Convolutional Knowledge Graph Embeddings

Uploaded by

phanpeter_492

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 8

The Thirty-Second AAAI Conference

on Artificial Intelligence (AAAI-18)

Convolutional 2D Knowledge Graph Embeddings

Tim Dettmers∗
Università della Svizzera italiana
[email protected]

Pasquale Minervini, Pontus Stenetorp, Sebastian Riedel

University College London
{p.minervini,p.stenetorp,s.riedel}@cs.ucl.ac.uk

Abstract can contain millions of facts; as a consequence, link pre-

dictors should scale in a manageable way with respect to
Link prediction for knowledge graphs is the task of predict- both the number of parameters and computational costs to be
ing missing relationships between entities. Previous work on applicable in real-world scenarios.
link prediction has focused on shallow, fast models which
can scale to large knowledge graphs. However, these models For solving such scaling problems, link prediction models
learn less expressive features than deep, multi-layer models – are often composed of simple operations, like inner products
which potentially limits performance. In this work we intro- and matrix multiplications over an embedding space, and
duce ConvE, a multi-layer convolutional network model for use a limited number of parameters (Nickel et al. 2016).
link prediction, and report state-of-the-art results for several DistMult (Yang et al. 2015) is such a model, characterised
established datasets. We also show that the model is highly pa- by three-way interactions between embedding parameters,
rameter efficient, yielding the same performance as DistMult which produce one feature per parameter. Using such simple,
and R-GCN with 8x and 17x fewer parameters. Analysis of fast, shallow models allows one to scale to large knowledge
our model suggests that it is particularly effective at modelling graphs, at the cost of learning less expressive features.
nodes with high indegree – which are common in highly-
connected, complex knowledge graphs such as Freebase and The only way to increase the number of features in shallow
YAGO3. In addition, it has been noted that the WN18 and models – and thus their expressiveness – is to increase the
FB15k datasets suffer from test set leakage, due to inverse embedding size. However, doing so does not scale to larger
relations from the training set being present in the test set – knowledge graphs, since the total number of embedding pa-
however, the extent of this issue has so far not been quantified. rameters is proportional to the the number of entities and
We find this problem to be severe: a simple rule-based model relations in the graph. For example, a shallow model like
can achieve state-of-the-art results on both WN18 and FB15k. DistMult with an embedding size of 200, applied to Free-
To ensure that models are evaluated on datasets where simply base, will need 33 GB of memory for its parameters. To
exploiting inverse relations cannot yield competitive results, increase the number of features independently of the em-
we investigate and validate several commonly used datasets
bedding size requires the use of multiple layers of features.
– deriving robust variants where necessary. We then perform
experiments on these robust datasets for our own and several However, previous multi-layer knowledge graph embedding
previously proposed models, and find that ConvE achieves architectures, that feature fully connected layers, are prone
state-of-the-art Mean Reciprocal Rank across all datasets. to overfit (Nickel et al. 2016). One way to solve the scaling
problem of shallow architectures, and the overfitting problem
of fully connected deep architectures, is to use parameter
Introduction efficient, fast operators which can be composed into deep
networks.
Knowledge graphs are graph-structured knowledge bases,
The convolution operator, commonly used in computer
where facts are represented in the form of relationships
vision, has exactly these properties: it is parameter efficient
(edges) between entities (nodes). They have important ap-
and fast to compute, due to highly optimised GPU imple-
plications in search, analytics, recommendation, and data
mentations. Furthermore, due to its ubiquitous use, robust
integration – however, they tend to suffer from incom-
methodologies have been established to control overfitting
pleteness, that is, missing links in the graph. For exam-
when training multi-layer convolutional networks (Szegedy
ple, in Freebase and DBpedia more than 66% of the per-
et al. 2015; Ioffe and Szegedy 2015; Srivastava et al. 2014;
son entries are missing a birthplace (Dong et al. 2014;
Szegedy et al. 2016).
Krompaß, Baier, and Tresp 2015). Identifying such miss-
ing links is referred to as link prediction. Knowledge graphs In this paper we introduce ConvE, a model that uses 2D
convolutions over embeddings to predict missing links in
∗
This work was conducted during a research visit to University knowledge graphs. ConvE is the simplest multi-layer con-
College London. volutional architecture for link prediction: it is defined by a
Copyright c 2018, Association for the Advancement of Artificial single convolution layer, a projection layer to the embedding
Intelligence (www.aaai.org). All rights reserved. dimension, and an inner product layer.

1811
Specifically, our contributions are as follows: space. In this work, we use 2D-convolutions which operate
• Introducing a simple, competitive 2D convolutional link spatially over embeddings.
prediction model, ConvE.
Number of Interactions for 1D vs 2D Convolutions
• Developing a 1-N scoring procedure that speeds up training
three-fold and evaluation by 300x. Using 2D rather than 1D convolutions increases the expres-
siveness of our model through additional points of interaction
• Establishing that our model is highly parameter efficient, between embeddings. For example, consider the case where
achieving better scores than DistMult and R-GCNs on we concatenate two rows of 1D embeddings, a and b with
FB15k-237 with 8x and 17x fewer parameters. dimension n = 3:
• Showing that for increasingly complex knowledge graphs,
([a a a] ; [b b b]) = [a a a b b b] .
as measured by indegree and PageRank, the difference
in performance between our model and a shallow model A padded 1D convolution with filter size k = 3 will be
increases proportionally to the complexity of the graph. able to model the interactions between these two embeddings
• Systematically investigating reported inverse relations test around the concatenation point (with a number of interactions
set leakage across commonly used link prediction datasets, proportional to k).
introducing robust versions of datasets where necessary, so If we concatenate (i.e. stack) two rows of 2D embeddings
that they cannot be solved using simple rule-based models. with dimension m × n, where m = 2 and n = 3, we obtain
the following:
• Evaluating ConvE and several previously proposed models
⎡ ⎤
on these robust datasets: our model achieves state-of-the- a a a
art Mean Reciprocal Rank across all of them. a a a b b b ⎢ a a a⎥
; =⎣ .
a a a b b b b b b⎦
Related Work b b b
Several neural link prediction models have been proposed
A padded 2D convolution with filter size 3 × 3 will be able
in the literature, such as the Translating Embeddings model
to model the interactions around the entire concatenation line
(TransE) (Bordes et al. 2013a), the Bilinear Diagonal model
(with a number of interactions proportional to n and k).
(DistMult) (Yang et al. 2015) and its extension in the complex
We can extend this principle to an alternating pattern, such
space (ComplEx) (Trouillon et al. 2016); we refer to Nickel
as the following: ⎡ ⎤
et al. (2016) for a recent survey. The model that is most
a a a
closely related to this work is most likely the Holographic
⎢b b b⎥
Embeddings model (HolE) (Nickel, Rosasco, and Poggio ⎣ a a a⎦ .
2016), which uses cross-correlation – the inverse of circular
b b b
convolution – for matching entity embeddings; it is inspired
by holographic models of associative memory. However, In this case, a 2D convolution operation is able to model even
HolE does not learn multiple layers of non-linear features, more interactions between a and b (with a number of inter-
and it is thus theoretically less expressive than our model. actions proportional to m, n, and k). Thus, 2D convolution
To the best of our knowledge, our model is the first is able to extract more feature interactions between two em-
neural link prediction model to use 2D convolutional lay- beddings compared to 1D convolution. The same principle
ers. Graph Convolutional Networks (GCNs) (Duvenaud can be extending to higher dimensional convolutions, but we
et al. 2015; Defferrard, Bresson, and Vandergheynst 2016; leave this as future work.
Kipf and Welling 2016) are a related line of research, where
the convolution operator is generalised to use locality infor- Background
mation in graphs. However, the GCN framework is limited
to undirected graphs, while knowledge graphs are naturally A knowledge graph G = {(s, r, o)} ⊆ E × R × E can
directed, and suffers from potentially prohibitive memory be formalised as a set of triples (facts), each consisting of
requirements (Kipf and Welling 2016). Relational GCNs a relationship r ∈ R and two entities s, o ∈ E, referred to
(R-GCNs) (Schlichtkrull et al. 2017) are a generalisation as the subject and object of the triple. Each triple (s, r, o)
of GCNs developed for dealing with highly multi-relational denotes a relationship of type r between the entities s and o.
data such as knowledge graphs – we include them in our The link prediction problem can be formalised as a point-
experimental evaluations. wise learning to rank problem, where the objective is learning
Several convolutional models have been proposed in natu- a scoring function ψ : E × R × E → R. Given an input triple
ral language processing (NLP) for solving a variety of tasks, x = (s, r, o), its score ψ(x) ∈ R is proportional to the likeli-
including semantic parsing (Yih et al. 2011), sentence classi- hood that the fact encoded by x is true.
fication (Kim 2014), search query retrieval (Shen et al. 2014), Neural Link Predictors
sentence modelling (Kalchbrenner, Grefenstette, and Blun- Neural link prediction models (Nickel et al. 2016) can be
som 2014), as well as other NLP tasks (Collobert et al. 2011). seen as multi-layer neural networks consisting of an encoding
However, most work in NLP uses 1D-convolutions, that is component and a scoring component. Given an input triple
convolutions which operate over a temporal sequence of em- (s, r, o), the encoding component maps entities s, o ∈ E to
beddings, for example a sequence of words in embedding their distributed embedding representations es , eo ∈ Rk . In

1812
Table 1: Scoring functions ψr (es , eo ) from neural link predictors in the literature, their relation-dependent parameters and space
complexity; ne and nr respectively denote the number of entities and relation types, i.e. ne = |E| and nr = |R|.

Model Scoring Function ψr (es , eo ) Relation Parameters Space Complexity

SE (Bordes et al. 2014) WrL es − WrR eo p WrL , WrR ∈ Rk×k O(ne k + nr k 2 )
TransE (Bordes et al. 2013a) es + rr − eo p r r ∈ Rk O(ne k + nr k)
DistMult (Yang et al. 2015) es , rr , eo r r ∈ Rk O(ne k + nr k)
ComplEx (Trouillon et al. 2016) es , rr , eo r r ∈ Ck O(ne k + nr k)

ConvE f (vec(f ([es ; rr ] ∗ ω))W)eo r r ∈ Rk O(ne k + nr k )

the scoring component, the two entity embeddings es and eo entropy loss:
are scored by a function ψr . The score of a triple (s, r, o) is 1
defined as ψ(s, r, o) = ψr (es , eo ) ∈ R. L(p, t) = − (ti · log(pi ) + (1 − ti ) · log(1 − pi )), (2)
N i
In Table 1 we summarise the scoring function of several
link prediction models from the literature. The vectors es and where t is the label vector with dimension R1x1 for 1-1 scor-
eo denote the subject and object embedding, where es , eo ∈ ing or R1xN for 1-N scoring (see the next section for 1-N
Ck in ComplEx and es , eo ∈ Rk in all other models, and scoring); the elements of vector t are ones for relationships
x, y, z = i xi yi zi denotes the tri-linear dot product; ∗ that exists and zero otherwise.
denotes the convolution operator; f denotes a non-linear We use rectified linear units as the non-linearity f for
function. faster training (Krizhevsky, Sutskever, and Hinton 2012), and
batch normalisation after each layer to stabilise, regularise
and increase rate of convergence (Ioffe and Szegedy 2015).
Convolutional 2D Knowledge Graphs We regularise our model by using dropout (Srivastava et al.
Embeddings 2014) in several stages. In particular, we use dropout on
the embeddings, on the feature maps after the convolution
In this work we propose a neural link prediction model where operation, and on the hidden units after the fully connected
the interactions between input entities and relationships are layer. We use Adam as optimiser (Kingma and Ba 2014),
modelled by convolutional and fully-connected layers. The and label smoothing to lessen overfitting due to saturation of
main characteristic of our model is that the score is defined by output non-linearities at the labels (Szegedy et al. 2016).
a convolution over 2D shaped embeddings. The architecture
is summarised in Figure 1; formally, the scoring function is Fast Evaluation for Link Prediction Tasks
defined as follows: In our architecture convolution consumes about 75-90% of
the total computation time, thus it is important to minimise
ψr (es , eo ) = f (vec(f ([es ; rr ] ∗ ω))W)eo , (1)
the number of convolution operations to speed up compu-
tation as much as possible. For link prediction models, the
where rr ∈ Rk is a relation parameter depending on r, es batch size is usually increased to speed up evaluation (Bordes
and rr denote a 2D reshaping of es and rr , respectively: if et al. 2013b). However, this is not feasible for convolutional
es , rr ∈ Rk , then es , rr ∈ Rkw ×kh , where k = kw kh . models since the memory requirements quickly outgrow the
In the feed-forward pass, the model performs a row-vector GPU memory capacity when increasing the batch size.
look-up operation on two embedding matrices, one for enti- Unlike other link prediction models which take an entity

ties, denoted E|E|×k and one for relations, denoted R|R|×k , pair and a relation as a triple (s, r, o), and score it (1-1 scor-
where k and k are the entity and relation embedding dimen- ing), we take one (s, r) pair and score it against all entities
sions, and |E| and |R| denote the number of entities and o ∈ E simultaneously (1-N scoring). If we benchmark 1-1
relations. The model then concatenates es and rr , and uses it scoring on a high-end GPU with batch size and embedding
as an input for a 2D convolutional layer with filters ω. Such size 128, then a training pass and an evaluation with a con-
a layer returns a feature map tensor T ∈ Rc×m×n , where c volution model on FB15k – one of the dataset used in the
is the number of 2D feature maps with dimensions m and n. experiments – takes 2.4 minutes and 3.34 hours. Using 1-N
The tensor T is then reshaped into a vector vec(T ) ∈ Rcmn , scoring, the respective numbers are 45 and 35 seconds – a
which is then projected into a k-dimensional space using a lin- considerable improvement of over 300x in terms of evalu-
ear transformation parametrised by the matrix W ∈ Rcmn×k ation time. Additionally, this approach is scalable to large
and matched with the object embedding eo via an inner prod- knowledge graphs and increases convergence speed. For a
uct. The parameters of the convolutional filters and the matrix single forward-backward pass with batch size of 128, go-
W are independent of the parameters for the entities s and o ing from N = 100, 000 to N = 1, 000, 000 entities only
and the relationship r. increases the computational time from 64ms to 80ms – in
For training the model parameters, we apply the lo- other words, a ten-fold increase in the number of entities only
gistic sigmoid function σ(·) to the scores, that is p = increases the computation time by 25% – which attests the
σ(ψr (es , eo )), and minimise the following binary cross- scalability of the approach.

1813
Figure 1: In the ConvE model, the entity and relation embeddings are ﬁrst reshaped and concatenated (steps 1, 2); the resulting
matrix is then used as input to a convolutional layer (step 3); the resulting feature map tensor is vectorised and projected into a
k-dimensional space (step 4) and matched with all candidate object embeddings (step 5).

0.9
0.2
0.1
0.6
0.2
0.3
0.0
0.7
0.1
0.4
0.4
0.4

If instead of 1-N scoring, we use 1-(0.1N) scoring – that relations: a large number of test triples can be obtained sim-
is, scoring against 10% of the entities – we can compute a ply by inverting triples in the training set. For example, the
forward-backward pass 25% faster. However, we converge test set frequently contains triples such as (s, hyponym, o)
roughly 230% slower on the training set. Thus 1-N scoring while the training set contains its inverse (o, hypernym, s).
has an additional effect which is akin to batch normalisa- To create a dataset without this property, Toutanova and
tion (Ioffe and Szegedy 2015) – we trade some computa- Chen (2015) introduced FB15k-237 – a subset of FB15k
tional performance for greatly increased convergence speed where inverse relations are removed. However, they did
and also achieve better performance as shown in Section 7. not explicitly investigate the severity of this problem, which
Do note that the technique in general could by applied to might explain why research continues using these datasets
any 1-1 scoring model. This practical trick in speeding up for evaluation without addressing this issue (e.g. Trouillon et
training and evaluation can be applied to any 1-1 scoring al. (2016), Nickel, Rosasco, and Poggio (2016), Nguyen et
model, such as the great majority of link prediction models. al. (2016), Liu et al. (2016)).
In the following section, we introduce a simple rule-based
Experiments model which demonstrates the severity of this bias by achiev-
ing state-of-the-art results on both WN18 and FB15k. In
Knowledge Graph Datasets order to ensure that we evaluate on datasets that do not have
For evaluating our proposed model, we use a selection of link inverse relation test leakage, we apply our simple rule-based
prediction datasets from the literature. model to each dataset. Apart from FB15k, which was cor-
WN18 (Bordes et al. 2013a) is a subset of WordNet which rected by FB15k-237, we also find flaws with WN18. We
consists of 18 relations and 40,943 entities. Most of the thus create WN18RR to reclaim WN18 as a dataset, which
151,442 triples consist of hyponym and hypernym relations cannot easily be completed using a single rule – but requires
and, for such a reason, WN18 tends to follow a strictly hier- modelling of the complete knowledge graph. WN18RR1
archical structure. contains 93,003 triples with 40,943 entities and 11 relations.
FB15k (Bordes et al. 2013a) is a subset of Freebase which For future research, we recommend against using FB15k and
contains about 14,951 entities with 1,345 different relations. WN18 and instead recommend FB15k-237, WN18RR, and
A large fraction of content in this knowledge graph describes YAGO3-10.
facts about movies, actors, awards, sports, and sport teams.
YAGO3-10 (Mahdisoltani, Biega, and Suchanek 2015) is Experimental Setup
a subset of YAGO3 which consists of entities which have a We selected the hyperparameters of our ConvE model via
minimum of 10 relations each. It has 123,182 entities and 37 grid search according to the mean reciprocal rank (MRR)
relations. Most of the triples deal with descriptive attributes on the validation set. Hyperparameter ranges for the grid
of people, such as citizenship, gender, and profession. search were as follows – embedding dropout {0.0, 0.1, 0.2},
Countries (Bouchard, Singh, and Trouillon 2015) is a feature map dropout {0.0, 0.1, 0.2, 0.3}, projection layer
benchmark dataset that is useful to evaluate a model’s abil- dropout {0.0, 0.1, 0.3, 0.5}, embedding size {100, 200},
ity to learn long-range dependencies between entities and batch size {64, 128, 256}, learning rate {0.001, 0.003}, and
relations. It consists of three sub-tasks which increase in label smoothing {0.0, 0.1, 0.2, 0.3}.
difficulty in a step-wise fashion, where the minimum path- Besides the grid search, we investigated modifications of
length to find a solution increases from 2 to 4. the 2D convolution layer for our models. In particular, we
It was first noted by Toutanova and Chen (2015) that
1
WN18 and FB15k suffer from test leakage through inverse https://github.com/TimDettmers/ConvE

1814
Table 2: Parameter scaling of DistMult vs ConvE. To gauge the severity of this problem, we construct a sim-
ple, rule-based model that solely models inverse relations.
We call this model the inverse model. The model extracts in-
Param. Emb. Hits
verse relationships automatically from the training set: given
Model count size MRR @10 @3 @1
two relation pairs r1 , r2 ∈ R, we check whether (s, r1 , o)
DistMult 1.89M 128 .23 .41 .25 .15 implies (o, r2 , s), or vice-versa.
DistMult 0.95M 64 .22 .39 .25 .14 We assume that inverse relations are randomly distributed
DistMult 0.23M 16 .16 .31 .17 .09 among the training, validation and test sets and, as such, we
ConvE 5.05M 200 .32 .49 .35 .23 expect the number of inverse relations to be proportional to
ConvE 1.89M 96 .32 .49 .35 .23 the size of the training set compared to the total dataset size.
ConvE 0.95M 54 .30 .46 .33 .22 Thus, we detect inverse relations if the presence of (s, r1 , o)
ConvE 0.46M 28 .28 .43 .30 .20 co-occurs with the presence of (o, r2 , s) with a frequency of
ConvE 0.23M 14 .26 .40 .28 .19 at least 0.99 − (fv + ft ), where fv and ft is the fraction of
the validation and test set compared to the total size of the
dataset. Relations matching this criterion are assumed to be
the inverse of each other.
experimented with replacing it with fully connected layers At test time, we check if the test triple has inverse matches
and 1D convolution; however, these modifications consis- outside the test set: if k matches are found, we sample a
tently reduced the predictive accuracy of the model. We also permutation of the top k ranks for these matches; if no match
experimented with different filter sizes, and found that we is found, we select a random rank for the test triple.
only receive good results if the first convolutional layer uses
small (i.e. 3x3) filters. Results
We found that the following combination of parameters Similarly to previous work (Yang et al. 2015; Trouillon et al.
works well on WN18, YAGO3-10 and FB15k: embed- 2016; Niepert 2016), we report results using a filtered setting,
ding dropout 0.2, feature map dropout 0.2, projection layer i.e. we rank test triples against all other candidate triples
dropout 0.3, embedding size 200, batch size 128, learning not appearing in the training, validation, or test set (Bordes
rate 0.001, and label smoothing 0.1. For the Countries dataset, et al. 2013a). Candidates are obtained by permuting either
we increase embedding dropout to 0.3, hidden dropout to 0.5, the subject or the object of a test triple with all entities in
and set label smoothing to 0. We use early stopping according the knowledge graph. Our results on the standard bench-
to the mean reciprocal rank (WN18, FB15k, YAGO3-10) and marks FB15k and WN18 are shown in Table 3; results on the
AUC-PR (Countries) statistics on the validation set, which datasets with inverse relations removed are shown in Table 4;
we evaluate every three epochs. Unlike the other datasets, results on YAGO3-10 and Countries are shown in Table 5.
for Countries the results have a high variance, as such we Strikingly, the inverse model achieves state-of-the-art on
average 10 runs and produce 95% confidence intervals. For many different metrics for both FB15k and WN18. How-
our DistMult and ComplEx results with 1-1 training, we use ever, it fails to pick up on inverse relations for YAGO3-
an embedding size of 100, AdaGrad (Duchi, Hazan, and 10 and FB15k-237. The procedure used by Toutanova and
Singer 2011) for optimisation, and we regularise our model Chen (2015) to derive FB15k-237 does not remove certain
by forcing the entity embeddings to have a L2 norm of 1 after symmetric relationships, for example “similar to”. The pres-
each parameter update. As in Bordes et al. (2013a), we use a ence of these relationships explains the good score of our
pairwise margin-based ranking loss. inverse model on WN18RR, which was derived using the
The code for our model and experiments is made publicly same procedure.
available,2 as well as the code for replicating the DistMult Our proposed model, ConvE, achieves state-of-the-art per-
results.3 formance for all metrics on YAGO3-10, for some metrics on
FB15k, and it does well on WN18. On Countries, it solves
Inverse Model the S1 and S2 tasks, and does well on S3, scoring better than
It has been noted by Toutanova and Chen (2015), that the other models like DistMult and ComplEx.
training datasets of WN18 and FB15k have 94% and 81% For FB15k-237, we could not replicate the basic model
test leakage as inverse relations, that is, 94% and 81% of results from Toutanova et al. (2015), where the models in
the triples in these datasets have inverse relations which are general have better performance than what we can achieve.
linked to the test set. For instance, a test triple (feline, hy- Compared to Schlichtkrull et al. (2017), our results for stan-
ponym, cat) can easily be mapped to a training triple (cat, dard models are a slightly better then theirs, and on-a-par
hypernym, feline) if it is known that hyponym is the inverse with their R-GCN model.
of hypernym. This is highly problematic, because link pre-
dictors that do well on these datasets may simply learn which Parameter efficiency of ConvE
relations that are the inverse of others, rather than to model From Table 2 we can see that ConvE for FB15k-237 with
the actual knowledge graph. 0.23M parameters performs better than DistMult with 1.89M
parameters for 3 metrics out of 5.
2
https://github.com/TimDettmers/ConvE ConvE with 0.46M parameters still achieves state-of-the-
3
https://github.com/uclmr/inferbeddings art results on FB15k-237 with 0.425 Hits@10. Comparing to

1815
Table 3: Link prediction results for WN18 and FB15k

WN18 FB15k
Hits Hits
MR MRR @10 @3 @1 MR MRR @10 @3 @1
DistMult (Yang et al. 2015) 902 .822 .936 .914 .728 97 .654 .824 .733 .546
ComplEx (Trouillon et al. 2016) – .941 .947 .936 .936 – .692 .840 .759 .599
Gaifman (Niepert 2016) 352 – .939 – .761 75 – .842 – .692
ANALOGY (Liu, Wu, and Yang 2017) – .942 .947 .944 .939 – .725 .854 .785 .646
R-GCN (Schlichtkrull et al. 2017) – .814 .964 .929 .697 – .696 .842 .760 .601
ConvE 504 .942 .955 .947 .935 64 .745 .873 .801 .670
Inverse Model 567 .861 .969 .968 .764 1897 .706 .737 .718 .689

Table 4: Link prediction results for WN18RR and FB15k-237

WN18RR FB15k-237
Hits Hits
MR MRR @10 @3 @1 MR MRR @10 @3 @1
DistMult (Yang et al. 2015) 5110 .43 .49 .44 .39 254 .241 .419 .263 .155
ComplEx (Trouillon et al. 2016) 5261 .44 .51 .46 .41 339 .247 .428 .275 .158
R-GCN (Schlichtkrull et al. 2017) – – – – – – .248 .417 .258 .153
ConvE 5277 .46 .48 .43 .39 246 .316 .491 .350 .239
Inverse Model 13219 .36 .36 .36 .36 7148 .009 .012 .010 .006

the previous best model, R-GCN (Schlichtkrull et al. 2017), business people) and successful modelling of such a high
which achieves 0.417 Hits@10 with more than 8M parame- indegree nodes requires capturing all these differences. Our
ters. hypothesis is that deeper models, that is, models that learn
Overall, ConvE is more than 17x parameter efficient than multiple layers of features, like ConvE, have an advantage
R-GCNs, and 8x more parameter efficient than DistMult. For over shallow models, like DistMult, to capture all these con-
the entirety of Freebase, the size of these models would be straints.
more than 82GB for R-GCNs, 21GB for DistMult, compared However, deeper models are more difficult to optimise, so
to 5.2GB for ConvE. we hypothesise that for datasets with low average relation-
specific indegree (like WN18RR and WN18), a shallow
Analysis model like DistMult might suffice for accurately representing
the structure of the network.
Ablation Study
To test our two hypotheses, we take two datasets with
Table 7 shows the results from our ablation study where low (low-WN18) and high (high-FB15k) relation-specific in-
we evaluate different parameter initialisation (n = 2) to degree and reverse them into high (high-WN18) and low (low-
calculate confidence intervals. We see that hidden dropout is FB15k) relation-specific indegree datasets by deleting low
by far the most important component, which is unsurprising and high indegree nodes. We hypothesise that, compared to
since it is our main regularisation technique. 1-N scoring DistMult, ConvE will always do better on the dataset with
improves performance, as does input dropout, feature map high relation-specific indegree, and vice-versa.
dropout has a minor effect, while label smoothing seems to Indeed, we find that both hypotheses hold: for low-FB15k
be unimportant – as good results can be achieved without it. we have ConvE 0.586 Hits@10 vs DistMult 0.728 Hits@10;
for high-WN18 we have ConvE 0.952 Hits@10 vs DistMult
0.938 Hits@10. This supports our hypothesis that deeper
Analysis of Indegree and PageRank models such as ConvE have an advantage to model more
Our main hypothesis for the good performance of our model complex graphs (e.g. FB15k and FB15k-237), but that shal-
on datasets like YAGO3-10 and FB15k-237 compared to low models such as DistMult have an advantage to model
WN18RR, is that these datasets contain nodes with very high less complex graphs (e.g. WN18 WN18RR).
relation-specific indegree. For example the node “United To investigate this further, we look at PageRank, a measure
States” with edges “was born in” has an indegree of over of centrality of a node. PageRank can also be seen as a
10,000. Many of these 10,000 nodes will be very differ- measure of the recursive indegree of a node: the PageRank
ent from each other (actors, writers, academics, politicians, value of a node is proportional to the indegree of this node, its

1816
Table 5: Link prediction results for YAGO3-10 and Countries

YAGO3-10 Countries
Hits AUC-PR
MR MRR @10 @3 @1 S1 S2 S3
DistMult (Yang et al. 2015) 5926 .34 .54 .38 .24 1.00±0.00 0.72±0.12 0.52±0.07
ComplEx (Trouillon et al. 2016) 6351 .36 .55 .40 .26 0.97±0.02 0.57±0.10 0.43±0.07
ConvE 2792 .52 .66 .56 .45 1.00±0.00 0.99±0.01 0.86 ±0.05
Inverse Model 60251 .02 .02 .02 .01 – – –

Table 6: Mean PageRank ×10−3 of nodes in the test set vs error reduction of ConvE compared to DistMult is strong
reduction in error in terms of AUC-PR or Hits@10 of ConvE with r = 0.83. This gives additional evidence that models
wrt. DistMult. that are deeper have an advantage when modelling nodes
with high (recursive) indegree.
Dataset PageRank Error Reduction From this evidence we conclude, that the increased perfor-
WN18RR 0.104 0.91 mance of our model compared to a standard link predictor,
WN18 0.125 1.28 DistMult, can be partially explained due to our it’s ability
FB15k 0.599 1.23 to model nodes with high indegree with greater precision –
FB15-237 0.733 1.17 which is possibly related to its depth.
YAGO3-10 0.988 1.91
Countries S3 1.415 3.36 Conclusion and Future Work
Countries S1 1.711 0.00
Countries S2 1.796 18.6 We introduced ConvE, a link prediction model that uses 2D
convolution over embeddings and multiple layers of non-
linear features to model knowledge graphs. ConvE uses fewer
Table 7: Ablation study for FB15k-237. parameters; it is fast through 1-N scoring; it is expressive
through multiple layers of non-linear features; it is robust
Ablation Hits@10 to overfitting due to batch normalisation and dropout; and
Full ConvE 0.491 achieves state-of-the-art results on several datasets, while still
scaling to large knowledge graphs. In our analysis, we show
Hidden dropout -0.044 ± 0.003 that the performance of ConvE compared to a common link
Input dropout -0.022 ± 0.000 predictor, DistMult, can partially be explained by its ability
1-N scoring -0.019 to model nodes with high (recursive) indegree.
Feature map dropout -0.013 ± 0.001 Test leakage through inverse relations of WN18 and FB15k
Label smoothing -0.008 ± 0.000 was first reported by Toutanova and Chen (2015): we investi-
gate the severity of this problem for commonly used datasets
by introducing a simple rule-based model, and find that it
neighbours indegrees, its neighbours-neighbours indegrees can achieve state-of-the-art results on WN18 and FB15k. To
and so forth scaled relative to all other nodes in the network. ensure robust versions of all investigated datasets exists, we
By this line of reasoning, we also expect ConvE to be better derive WN18RR.
than DistMult on datasets with high average PageRank (high Our model is still shallow compared to convolutional archi-
connectivity graphs), and vice-versa. tecture found in computer vision, and future work might deal
To test this hypothesis, we calculate the PageRank for each with convolutional models of increasing depth. Further work
dataset as a measure of centrality. We find that the most might also look at the interpretation of 2D convolution, or
central nodes in WN18 have a PageRank value more than how to enforce large-scale structure in embedding space so
one order of magnitude smaller than the most central nodes to increase the number of interactions between embeddings.
in YAGO3-10 and Countries, and about 4 times smaller than
the most central nodes in FB15k. When we look at the mean Acknowledgments
PageRank of nodes contained in the test sets, we find that
the difference of performance in terms of Hits@10 between We would like to thank Johannes Welbl and Peter Hayes for
DistMult and ConvE is roughly proportional to the mean their feedback and helpful discussions related to this work.
test set PageRank, that is, the higher the mean PageRank This work was supported by a Marie Curie Career Integration
of the test set nodes the better ConvE does compared to Award, an Allen Distinguished Investigator Award, a Google
DistMult, and vice-versa. See Table 6 for these statistics. Europe Scholarship for Students with Disabilities, and the
The correlation between mean test set PageRank and relative H2020 project SUMMA.

1817
References Proceedings of the 39th International ACM SIGIR conference
Bordes, A.; Usunier, N.; García-Durán, A.; Weston, J.; and on Research and Development in Information Retrieval, 445–
Yakhnenko, O. 2013a. Translating Embeddings for Modeling 454. ACM.
Multi-relational Data. In Proceedings of NIPS, 2787–2795. Liu, H.; Wu, Y.; and Yang, Y. 2017. Analogical Inference
Bordes, A.; Usunier, N.; García-Durán, A.; Weston, J.; and for Multi-Relational Embeddings. ArXiv e-prints.
Yakhnenko, O. 2013b. Translating Embeddings for Modeling Mahdisoltani, F.; Biega, J.; and Suchanek, F. M. 2015.
Multi-relational Data. In Proceedings of NIPS, 2787–2795. YAGO3: A Knowledge Base from Multilingual Wikipedias.
Bordes, A.; Glorot, X.; Weston, J.; and Bengio, Y. 2014. A In Proceedings of CIDR 2015.
semantic matching energy function for learning with multi- Nguyen, D. Q.; Sirts, K.; Qu, L.; and Johnson, M. 2016.
relational data - application to word-sense disambiguation. Stranse: a novel embedding model of entities and relation-
Machine Learning 94(2):233–259. ships in knowledge bases. arXiv preprint arXiv:1606.08140.
Bouchard, G.; Singh, S.; and Trouillon, T. 2015. On approxi- Nickel, M.; Murphy, K.; Tresp, V.; and Gabrilovich, E.
mate reasoning capabilities of low-rank vector spaces. AAAI 2016. A review of relational machine learning for knowledge
Spring Syposium on Knowledge Representation and Reason- graphs. Proceedings of the IEEE 104(1):11–33.
ing (KRR): Integrating Symbolic and Neural Approaches. Nickel, M.; Rosasco, L.; and Poggio, T. A. 2016. Holo-
Collobert, R.; Weston, J.; Bottou, L.; Karlen, M.; graphic Embeddings of Knowledge Graphs. In Proceedings
Kavukcuoglu, K.; and Kuksa, P. P. 2011. Natural Lan- of AAAI, 1955–1961.
guage Processing (Almost) from Scratch. Journal of Machine Niepert, M. 2016. Discriminative Gaifman Models. In
Learning Research 12:2493–2537. Proceedings of NIPS 2016, 3405–3413.
Defferrard, M.; Bresson, X.; and Vandergheynst, P. 2016. Schlichtkrull, M.; Kipf, T. N.; Bloem, P.; Berg, R. v. d.;
Convolutional Neural Networks on Graphs with Fast Local- Titov, I.; and Welling, M. 2017. Modeling Relational
ized Spectral Filtering. In Proceedings of NIPS, 3837–3845. Data with Graph Convolutional Networks. arXiv preprint
Dong, X.; Gabrilovich, E.; Heitz, G.; Horn, W.; Lao, N.; arXiv:1703.06103.
Murphy, K.; Strohmann, T.; Sun, S.; and Zhang, W. 2014. Shen, Y.; He, X.; Gao, J.; Deng, L.; and Mesnil, G. 2014.
Knowledge Vault: A Web-Scale Approach to Probabilistic Learning Semantic Representations Using Convolutional
Knowledge Fusion. In Proceedings of KDD 2014, 601–610. Neural Networks for Web Search. In Proceedings of WWW
Duchi, J. C.; Hazan, E.; and Singer, Y. 2011. Adaptive subgra- 2014, 373–374.
dient methods for online learning and stochastic optimization. Srivastava, N.; Hinton, G. E.; Krizhevsky, A.; Sutskever,
Journal of Machine Learning Research 12:2121–2159. I.; and Salakhutdinov, R. 2014. Dropout: A Simple Way
Duvenaud, D. K.; Maclaurin, D.; Aguilera-Iparraguirre, J.; to Prevent Neural Networks from Overfitting. Journal of
Bombarell, R.; Hirzel, T.; Aspuru-Guzik, A.; and Adams, Machine Learning Research 15(1):1929–1958.
R. P. 2015. Convolutional Networks on Graphs for Learning Szegedy, C.; Liu, W.; Jia, Y.; Sermanet, P.; Reed, S.;
Molecular Fingerprints. In Proceedings of NIPS 2015, 2224– Anguelov, D.; Erhan, D.; Vanhoucke, V.; and Rabinovich, A.
2232. 2015. Going deeper with convolutions. In Proceedings of
Ioffe, S., and Szegedy, C. 2015. Batch Normalization: Ac- IEEE CVPR, 1–9.
celerating Deep Network Training by Reducing Internal Co- Szegedy, C.; Vanhoucke, V.; Ioffe, S.; Shlens, J.; and Wojna,
variate Shift. arXiv preprint arXiv:1502.03167. Z. 2016. Rethinking the Inception Architecture for Computer
Kalchbrenner, N.; Grefenstette, E.; and Blunsom, P. 2014. A Vision. In Proceedings of IEEE CVPR, 2818–2826.
Convolutional Neural Network for Modelling Sentences. In Toutanova, K., and Chen, D. 2015. Observed Versus Latent
Proceedings of ACL 2014, Volume 1: Long Papers, 655–665. Features for Knowledge Base and Text Inference. In Pro-
Kim, Y. 2014. Convolutional Neural Networks for Sentence ceedings of the 3rd Workshop on Continuous Vector Space
Classification. In Proceedings of EMNLP 2014, 1746–1751. Models and their Compositionality, 57–66.
Kingma, D., and Ba, J. 2014. Adam: A method for stochastic Toutanova, K.; Chen, D.; Pantel, P.; Poon, H.; Choudhury, P.;
optimization. arXiv preprint arXiv:1412.6980. and Gamon, M. 2015. Representing Text for Joint Embedding
Kipf, T. N., and Welling, M. 2016. Semi-Supervised Classi- of Text and Knowledge Bases. In Proceedings of EMNLP
fication with Graph Convolutional Networks. In Proceedings 2015, volume 15, 1499–1509.
of ICLR 2016. Trouillon, T.; Welbl, J.; Riedel, S.; Gaussier, É.; and
Krizhevsky, A.; Sutskever, I.; and Hinton, G. E. 2012. Im- Bouchard, G. 2016. Complex Embeddings for Simple Link
ageNet Classification with Deep Convolutional Neural Net- Prediction. In Proceedings of ICML 2016, 2071–2080.
works. In Proceedings of NIPS 2012, 1097–1105. Yang, B.; Yih, W.; He, X.; Gao, J.; and Deng, L. 2015.
Krompaß, D.; Baier, S.; and Tresp, V. 2015. Type- Embedding Entities and Relations for Learning and Inference
Constrained Representation Learning in Knowledge Graphs. in Knowledge Bases. In Proceedings of ICLR 2015.
In Proceedings of ISWC 2015, 640–655. Yih, W.; Toutanova, K.; Platt, J. C.; and Meek, C. 2011.
Liu, Q.; Jiang, L.; Han, M.; Liu, Y.; and Qin, Z. 2016. Learning Discriminative Projections for Text Similarity Mea-
Hierarchical random walk inference in knowledge graphs. In sures. In Proceedings of CoNLL 2011, 247–256.