Hypernetwork Knowledge Graph Embeddings
Hypernetwork Knowledge Graph Embeddings
Hypernetwork Knowledge Graph Embeddings
1 Introduction
2 Related Work
Numerous matrix factorization approaches to link prediction have been pro-
posed. An early model, RESCAL [12], tackles the link prediction task by opti-
mizing a scoring function containing a bilinear product between vectors for each
Hypernetwork Knowledge Graph Embeddings 3
of the subject and object entities and a full rank matrix for each relation. Dist-
Mult [23] can be viewed as a special case of RESCAL with a diagonal matrix per
relation type, which limits the linear transformation performed on entity vectors
to a stretch. ComplEx [22] extends DistMult to the complex domain. TransE [1]
is an affine model that represents a relation as a translation operation between
subject and object entity vectors.
A somewhat separate line of link prediction research introduces Relational
Graph Convolutional Networks (R-GCNs) [15]. R-GCNs use a convolution oper-
ator to capture locality information in graphs. The model closest to our own and
which we draw inspiration from, is ConvE [3], where a convolution operation is
performed on the subject entity vector and the relation vector, after they are
each reshaped to a matrix and lengthwise concatenated. The obtained feature
maps are flattened, put through a fully connected layer, and the inner product
is taken with all object entity vectors to generate a score for each triple. Ad-
vantages of ConvE over previous approaches include its expressiveness, achieved
by using multiple layers of non-linear features, its scalability to large knowledge
graphs, and its robustness to overfitting. However, it is not intuitive why con-
volving across concatenated and reshaped subject entity and relation vectors
should be effective.
The proposed HypER model does no such reshaping or concatenation and
thus avoids both implying any inherent 2D structure in the embeddings and re-
stricting interaction to the concatenation boundary. Instead, HypER convolves
every dimension of the subject entity embedding with relation-specific convo-
lutional filters generated by the hypernetwork. This way, entity and relation
embeddings are combined in a non-linear (quadratic) manner, unlike the lin-
ear combination (weighted sum) in ConvE. This gives HypER more expressive
power, while also reducing parameters.
Interestingly, we find that the differences in moving from ConvE to HypER
in fact bring the factorization and convolutional approaches together, since the
1D convolution process is equivalent to multiplication by a highly sparse tensor
with tied weights (see Figure 2). The multiplication of this “convolutional tensor”
(defined by the relation embedding and hypernetwork) and other weights gives
an implicit relation matrix, corresponding to those in e.g. RESCAL, DistMult
and ComplEx. Other than the method of deriving these relation matrices, the
key difference to existing factorization approaches is the ReLU non-linearity
applied prior to interaction with the object embedding.
3 Link Prediction
In link prediction, the aim is to learn a scoring function φ that assigns a score
s = φ(e1 , r, e2 ) ∈ R to each input triple (e1 , r, e2 ), where e1 , e2 ∈ E are sub-
ject and object entities and r ∈ R a relation. The score indicates the strength
of prediction that the given triple corresponds to a true fact, with positive
scores meaning true and negative scores, false. Link prediction models typically
map entity pair e1 , e2 to their corresponding distributed embedding represen-
tations e1 , e2 ∈ Rde and a score is assigned using a relation-specific function,
4 Balažević et al.
s = φr (e1 , e2 ). The majority of link prediction models apply the logistic sigmoid
function σ(·) to the score to give a probabilistically interpretable prediction
p = σ(s) ∈ [0, 1] as to whether the queried fact is true. The scoring functions
for models from across the literature and HypER are summarized in Table 1,
together with the dimensionality of their relation parameters and the significant
terms of their space complexity.
e1 Mr
W 0.3
Convolve f 0.1
0.3
0.2
× σ 0.5
H 0.9
wr
0.1
0.2
0.8
Fr .. 0.4
.
H Fr W
Fr
dr lm lm
wr ⊗z nf ⊗yz
nf nf Fr nf
lf
lf de de
f
⊗x ⊗x
y explicitly learned parameters
x induced filter weights
e1 e2
z zero weights
Fig. 2. Interpretation of the HypER model in terms of tensor operations. Each rela-
tion embedding wr generates a set of filters Fr via the hypernetwork H. The act of
convolving Fr over e1 is equivalent to multiplication of e1 by a tensor Fr (in which
Fr is diagonally duplicated and zero elsewhere). The tensor product Fr ⊗yz W gives
a de × de matrix specific to each relation. Axes labels indicate the modes of tensor
interaction (via inner product).
In the feed-forward pass, the model obtains embeddings for the input triple
from the entity and relation embedding matrices E ∈ Rne ×de and R ∈ Rnr ×dr .
The hypernetwork is a fully connected layer H ∈ Rdr ×lf nf (lf denotes filter
length and nf the number of filters per relation, i.e. output channels of the
convolution) that is applied to the relation embedding wr ∈ Rdr . The result
is reshaped to generate a matrix of convolutional filters Fr = vec−1 (wr H) ∈
Rlf ×nf . Whilst the overall dimensionality of the filter set is lf nf , the rank is
restricted to dr to encourage parameter sharing between relations.
The subject entity embedding e1 is convolved with the set of relation-specific
filters Fr to give a 2D feature map Mr ∈ Rlm ×nf , where lm = de − lf + 1 is
the feature map length. The feature map is vectorized to vec(Mr ) ∈ Rlm nf , and
projected to de -dimensional space by the weight matrix W ∈ Rlm nf ×de . After
applying a ReLU activation function, the result is combined by way of inner
product with each and every object entity embedding e2 (i) , where i varies over all
entities in the dataset (of size ne ), to give a vector of scores. The logistic sigmoid
is applied element-wise to the score vector to obtain the predicted probability of
each prospective triple being true pi = σ(φr (e1 , e2 (i) )).
6 Balažević et al.
Following the training procedure introduced by [3], we use 1-N scoring with the
Adam optimizer [8] to minimize the binary cross-entropy loss:
1 X
L(p, y) = − (yi log(pi ) + (1 − yi )log(1 − pi )), (2)
ne i
where y ∈ Rne is the label vector containing ones for true triples and zeros oth-
erwise, subject to label smoothing. Label smoothing is a widely used technique
shown to improve generalization [20,14]. Label smoothing changes the ground-
truth label distribution by adding a uniform prior to encourage the model to be
less confident, achieving a regularizing effect. 1-N scoring refers to simultane-
ously scoring (e1 , r, E), i.e. for all entities e2 ∈ E, in contrast to 1-1 scoring, the
practice of training individual triples (e1 , r, e2 ) one at a time. As shown by [3],
1-N scoring offers a significant speedup (3x on train and 300x on test time) and
improved accuracy compared to 1-1 scoring. A potential extension of the HypER
model described above would be to apply convolutional filters to both subject
and object entity embeddings. However, since this is not trivially implementable
with 1-N scoring and wanting to keep its benefits, we leave this to future work.
Table 2 compares the number of parameters of ConvE and HypER (for the
FB15k-237 dataset, which determines ne and nr ). It can be seen that, overall,
HypER has fewer parameters (4.3M) than ConvE (5.1M) due to the way HypER
directly transforms relations to convolutional filters.
Hypernetwork Knowledge Graph Embeddings 7
Model E R Filters W
n × de nr × dr lf nf hm wm nf × de
ConvE e
2.9M 0.1M 0.0M 2.1M
n × de nr × dr dr × lf nf lm nf × de
HypER e
2.9M 0.1M 0.1M 1.2M
5 Experiments
5.1 Datasets
We evaluate our HypER model on the standard link prediction task using the
following datasets (see Table 3):
FB15k [1] a subset of Freebase, a large database of facts about the real world.
WN18 [1] a subset of WordNet, containing lexical relations between words.
FB15k-237 created by [21], noting that the validation and test sets of FB15k
and WN18 contain the inverse of many relations present in the training set,
making it easy for simple models to do well. FB15k-237 is a subset of FB15k
with the inverse relations removed.
WN18RR [3] a subset of WN18, created by removing the inverse relations.
YAGO3-10 [3] a subset of YAGO3 [10], containing entities which have a
minimum of 10 relations each.
Table 3. Summary of dataset statistics.
well across all datasets: input dropout 0.2, feature map dropout 0.2, and hidden
dropout 0.3, apart from FB15k-237, where we set input dropout to 0.3. We select
the learning rate from {0.01, 0.005, 0.003, 0.001, 0.0005, 0.0001} and exponential
learning rate decay from {1., 0.99, 0.995} for each dataset and find the best per-
forming learning rate and learning rate decay to be dataset-specific. We set the
convolution stride to 1, number of feature maps to 32 with the filter size 3 × 3
for ConvE and 1 × 9 for HypER, after testing different numbers of feature maps
nf ∈ {16, 32, 64} and filter sizes lf ∈ {1 × 1, 1 × 2, 1 × 3, 1 × 6, 1 × 9, 1 × 12}
(see Table 9). We train all models using the Adam optimizer with batch size
128. One epoch on FB15k-237 takes approximately 12 seconds on a single GPU
compared to 1 minute for e.g. RESCAL, largely due to 1-N scoring.
Evaluation Results are obtained by iterating over all triples in the test set.
A particular triple is evaluated by replacing the object entity e2 with all entities
E while keeping the subject entity e1 fixed and vice versa, obtaining scores for
each combination. These scores are then ranked using the “filtered” setting only,
i.e. we remove all true cases other than the current test triple [1].
We evaluate HypER on five different metrics found throughout the link pre-
diction literature: mean rank (MR), mean reciprocal rank (MRR), hits@10,
hits@3, and hits@1. Mean rank is the average rank assigned to the true triple,
over all test triples. Mean reciprocal rank takes the average of the reciprocal rank
assigned to the true triple. Hits@k measures the percentage of cases in which
the true triple appears in the top k ranked triples. Overall, the aim is to achieve
high mean reciprocal rank and hits@k and low mean rank. For a more extensive
description of how each of these metrics is calculated, we refer to [3].
5.3 Results
Link prediction results for all models across the five datasets are shown in Tables
4, 5 and 6. Our key findings are:
– whilst having fewer parameters than the closest comparator ConvE, Hy-
pER consistently outperforms all other models across all datasets, thereby
achieving state-of-the-art results on the link prediction task; and
– our filter dimension study suggests that no benefit is gained by convolving
over reshaped 2D entity embeddings in comparison with 1D entity embed-
ding vectors and that most information can be extracted with very small
convolutional filters (Table 9).
Overall, HypER outperforms all other models on all metrics apart from mean
reciprocal rank on WN18 and mean rank on WN18RR, FB15k-237, WN18, and
YAGO3-10. Given that mean rank is known to be highly sensitive to outliers
[11], this suggests that HypER correctly ranks many true triples in the top 10,
but makes larger ranking errors elsewhere.
Given that most models in the literature, with the exception of ConvE, were
trained with 100 dimension embeddings and 1-1 scoring, we reimplement previ-
ous models (DistMult, ComplEx and ConvE) with 200 dimension embeddings
Hypernetwork Knowledge Graph Embeddings 9
and 1-N scoring for fair comparison and report the obtained results on WN18RR
in Table 7. We perform the same hyperparameter search for every model and
present the mean and standard deviation of each result across five runs (differ-
ent random seeds). This improves most previously published results, except for
ConvE where we fail to replicate some values. Notwithstanding, HypER remains
the best performing model overall despite better tuning of the competitors.
Table 4. Link prediction results on WN18RR and FB15k-237. The RotatE [19] results
are reported without their self-adversarial negative sampling (see Appendix H in the
original paper) for fair comparison, given that it is not specific to that model only.
WN18RR FB15k-237
MR MRR H@10 H@3 H@1 MR MRR H@10 H@3 H@1
DistMult [23] 5110 .430 .490 .440 .390 254 .241 .419 .263 .155
ComplEx [22] 5261 .440 .510 .460 .410 339 .247 .428 .275 .158
Neural LP [24] − − − − − − .250 .408 − −
R-GCN [15] − − − − − − .248 .417 .264 .151
MINERVA [2] − − − − − − − .456 − −
ConvE [3] 4187 .430 .520 .440 .400 244 .325 .501 .356 .237
M-Walk [16] − .437 − .445 .414 − − − − −
RotatE [19] − − − − − 185 .297 .480 .328 .205
HypER (ours) 5798 .465 .522 .477 .436 250 .341 .520 .376 .252
WN18 FB15k
MR MRR H@10 H@3 H@1 MR MRR H@10 H@3 H@1
TransE [1] 251 − .892 − − 125 − .471 − −
DistMult [23] 902 .822 .936 .914 .728 97 .654 .824 .733 .546
ComplEx [22] − .941 .947 .936 .936 − .692 .840 .759 .599
ANALOGY [9] − .942 .947 .944 .939 − .725 .854 .785 .646
Neural LP [24] − .940 .945 − − − .760 .837 − −
R-GCN [15] − .819 .964 .929 .697 − .696 .842 .760 .601
TorusE [4] − .947 .954 .950 .943 − .733 .832 .771 .674
ConvE [3] 374 .943 .956 .946 .935 51 .657 .831 .723 .558
SimplE [7] − .942 .947 .944 .939 − .727 .838 .773 .660
HypER (ours) 431 .951 .958 .955 .947 44 .790 .885 .829 .734
YAGO3-10
MR MRR H@10 H@3 H@1
DistMult [23] 5926 .340 .540 .380 .240
ComplEx [22] 6351 .360 .550 .400 .260
ConvE [3] 1676 .440 .620 .490 .350
HypER (ours) 2529 .533 .678 .580 .455
To ensure that the difference between reported results for HypER and ConvE
is not simply due to HypER having a reduced number of parameters (implicit
regularization), we trained ConvE reducing the number of feature maps to 16
instead of 32 to have a comparable number of parameters to HypER (explicit
10 Balažević et al.
Table 7. Link prediction results on WN18RR; all models trained with 200 dimension
embeddings and 1-N scoring.
WN18RR
MR MRR H@10 H@3 H@1
DistMult [23] 4911 ± 109 .434 ± .002 .508 ± .002 .447 ± .001 .399 ± .002
ComplEx [22] 5930 ± 125 .446 ± .001 .523 ± .002 .462 ± .001 .409 ± .001
ConvE [3] 4997± 99 .431 ± .001 .504 ± .002 .443 ± .002 .396 ± .001
HypER (ours) 5798 ± 124 .465 ± .002 .522 ± .003 .477 ± .002 .436 ± .003
WN18RR FB15k-237
MRR H@10 MRR H@10
HypER .465 ± .002 .522 ± .003 .341 ± .001 .520 ± .002
HypER (no H) .459 ± .002 .511 ± .002 .338 ± .001 .515 ± .001
WN18RR FB15k-237
Filter Size MRR H@1 MRR H@1
1×1 .455 .422 .337 .248
1×2 .458 .428 .337 .248
1×3 .457 .427 .339 .250
1×6 .459 .429 .340 .251
1×9 .465 .436 .341 .252
1 × 12 .457 .428 .341 .252
2×2 .456 .429 .340 .250
3×3 .458 .430 .339 .250
5×5 .452 .423 .340 .252
6 Conclusion
In this work, we introduce HypER, a hypernetwork model for link prediction on
knowledge graphs. HypER generates relation-specific convolutional filters and
applies them to subject entity embeddings. The hypernetwork component allows
information to be shared between relation vectors, enabling multi-task learning
across relations. To our knowledge, HypER is the first link prediction model
that creates non-linear interaction between entity and relation embeddings by
convolving relation-specific filters over the entity embeddings.
We show that no benefit is gained from 2D convolutional filters over 1D,
dispelling the suggestion that 2D structure exists in entity embeddings implied
by ConvE. We also recast HypER in terms of tensor operations showing that,
despite the convolution operation, it is closely related to the established family of
tensor factorization models. Our results suggest that convolution provides a good
trade-off between expressiveness and parameter number compared to a dense
network. HypER is fast, robust to overfitting, has relatively few parameters,
and achieves state-of-the-art results across almost all metrics on multiple link
prediction datasets.
Future work might include expanding the current architecture by applying
convolutional filters to both subject and object entity embeddings. We may
also analyze the influence of label smoothing and explore the interpretability of
convolutional feature maps to gain insight and potentially improve the model.
Acknowledgements
We thank Ivan Titov for helpful discussions on this work. Ivana Balažević and
Carl Allen were supported by the Centre for Doctoral Training in Data Science,
funded by EPSRC (grant EP/L016427/1) and the University of Edinburgh.
References
1. Bordes, A., Usunier, N., Garcia-Duran, A., Weston, J., Yakhnenko, O.: Translating
Embeddings for Modeling Multi-relational Data. In: Advances in Neural Informa-
tion Processing Systems (2013)
12 Balažević et al.
2. Das, R., Dhuliawala, S., Zaheer, M., Vilnis, L., Durugkar, I., Krishnamurthy, A.,
Smola, A., McCallum, A.: Go for a Walk and Arrive at the Answer: Reasoning
over Paths in Knowledge Bases Using Reinforcement Learning. In: International
Conference on Learning Representations (2018)
3. Dettmers, T., Minervini, P., Stenetorp, P., Riedel, S.: Convolutional 2D Knowledge
Graph Embeddings. In: Association for the Advancement of Artificial Intelligence
(2018)
4. Ebisu, T., Ichise, R.: TorusE: Knowledge Graph Embedding on a Lie Group. In:
Association for the Advancement of Artificial Intelligence (2018)
5. Ha, D., Dai, A., Le, Q.V.: Hypernetworks. In: International Conference on Learning
Representations (2017)
6. Ioffe, S., Szegedy, C.: Batch Normalization: Accelerating Deep Network Training
by Reducing Internal Covariate Shift. In: International Conference on Machine
Learning (2015)
7. Kazemi, S.M., Poole, D.: SimplE Embedding for Link Prediction in Knowledge
Graphs. In: Advances in Neural Information Processing Systems (2018)
8. Kingma, D.P., Ba, J.: Adam: A Method for Stochastic Optimization. In: Interna-
tional Conference on Learning Representations (2015)
9. Liu, H., Wu, Y., Yang, Y.: Analogical Inference for Multi-relational Embeddings.
In: International Conference on Machine Learning (2017)
10. Mahdisoltani, F., Biega, J., Suchanek, F.M.: Yago3: A Knowledge Base from Mul-
tilingual Wikipedias. In: Conference on Innovative Data Systems Research (2013)
11. Nickel, M., Rosasco, L., Poggio, T.A.: Holographic Embeddings of Knowledge
Graphs. In: Association for the Advancement of Artificial Intelligence (2016)
12. Nickel, M., Tresp, V., Kriegel, H.P.: A Three-Way Model for Collective Learning on
Multi-Relational Data. In: International Conference on Machine Learning (2011)
13. Paszke, A., Gross, S., Chintala, S., Chanan, G., Yang, E., DeVito, Z., Lin, Z.,
Desmaison, A., Antiga, L., Lerer, A.: Automatic Differentiation in PyTorch. In:
NIPS-W (2017)
14. Pereyra, G., Tucker, G., Chorowski, J., Kaiser, L., Hinton, G.: Regularizing
neural networks by penalizing confident output distributions. arXiv preprint
arXiv:1701.06548 (2017)
15. Schlichtkrull, M., Kipf, T.N., Bloem, P., van den Berg, R., Titov, I., Welling,
M.: Modeling Relational Data with Graph Convolutional Networks. In: European
Semantic Web Conference (2018)
16. Shen, Y., Chen, J., Huang, P.S., Guo, Y., Gao, J.: M-Walk: Learning to Walk
over Graphs using Monte Carlo Tree Search. In: Advances in Neural Information
Processing Systems (2018)
17. Socher, R., Chen, D., Manning, C.D., Ng, A.: Reasoning with Neural Tensor Net-
works for Knowledge Base Completion. In: Advances in Neural Information Pro-
cessing Systems (2013)
18. Srivastava, N., Hinton, G., Krizhevsky, A., Sutskever, I., Salakhutdinov, R.:
Dropout: A Simple Way to Prevent Neural Networks from Overfitting. Journal
of Machine Learning Research 15(1), 1929–1958 (2014)
19. Sun, Z., Deng, Z.H., Nie, J.Y., Tang, J.: RotatE: Knowledge Graph Embedding by
Relational Rotation in Complex Space. In: International Conference on Learning
Representations (2019)
20. Szegedy, C., Vanhoucke, V., Ioffe, S., Shlens, J., Wojna, Z.: Rethinking the Incep-
tion Architecture for Computer Vision. In: Computer Vision and Pattern Recog-
nition (2016)
Hypernetwork Knowledge Graph Embeddings 13
21. Toutanova, K., Chen, D., Pantel, P., Poon, H., Choudhury, P., Gamon, M.: Rep-
resenting Text for Joint Embedding of Text and Knowledge Bases. In: Empirical
Methods in Natural Language Processing (2015)
22. Trouillon, T., Welbl, J., Riedel, S., Gaussier, É., Bouchard, G.: Complex Embed-
dings for Simple Link Prediction. In: International Conference on Machine Learning
(2016)
23. Yang, B., Yih, W.t., He, X., Gao, J., Deng, L.: Embedding Entities and Relations
for Learning and Inference in Knowledge Bases. In: International Conference on
Learning Representations (2015)
24. Yang, F., Yang, Z., Cohen, W.W.: Differentiable Learning of Logical Rules for
Knowledge Base Reasoning. In: Advances in Neural Information Processing Sys-
tems (2017)