0% found this document useful (0 votes)
28 views24 pages

Neural Bellman-Ford Networks - A General Graph Neural Network Framework For Link Prediction

Uploaded by

timsmith1081574
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
28 views24 pages

Neural Bellman-Ford Networks - A General Graph Neural Network Framework For Link Prediction

Uploaded by

timsmith1081574
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 24

Neural Bellman-Ford Networks: A General Graph

Neural Network Framework for Link Prediction

Zhaocheng Zhu1,2 , Zuobai Zhang1,2 , Louis-Pascal Xhonneux1,2 , Jian Tang1,3,4


Mila - Québec AI Institute1 , Université de Montréal2
HEC Montréal3 , CIFAR AI Chair4
{zhaocheng.zhu, zuobai.zhang, louis-pascal.xhonneux}@mila.quebec
arXiv:2106.06935v4 [cs.LG] 21 Jan 2022

[email protected]

Abstract

Link prediction is a very fundamental task on graphs. Inspired by traditional


path-based methods, in this paper we propose a general and flexible representation
learning framework based on paths for link prediction. Specifically, we define the
representation of a pair of nodes as the generalized sum of all path representations
between the nodes, with each path representation as the generalized product of
the edge representations in the path. Motivated by the Bellman-Ford algorithm
for solving the shortest path problem, we show that the proposed path formulation
can be efficiently solved by the generalized Bellman-Ford algorithm. To further
improve the capacity of the path formulation, we propose the Neural Bellman-Ford
Network (NBFNet), a general graph neural network framework that solves the
path formulation with learned operators in the generalized Bellman-Ford algorithm.
The NBFNet parameterizes the generalized Bellman-Ford algorithm with 3 neural
components, namely I NDICATOR, M ESSAGE and AGGREGATE functions, which
corresponds to the boundary condition, multiplication operator, and summation
operator respectively1 . The NBFNet covers many traditional path-based methods,
and can be applied to both homogeneous graphs and multi-relational graphs (e.g.,
knowledge graphs) in both transductive and inductive settings. Experiments on
both homogeneous graphs and knowledge graphs show that the proposed NBFNet
outperforms existing methods by a large margin in both transductive and inductive
settings, achieving new state-of-the-art results2 .

1 Introduction

Predicting the interactions between nodes (a.k.a. link prediction) is a fundamental task in the field of
graph machine learning. Given the ubiquitous existence of graphs, such a task has many applications,
such as recommender system [34], knowledge graph completion [41] and drug repurposing [27].
Traditional methods of link prediction usually define different heuristic metrics over the paths between
a pair of nodes. For example, Katz index [30] is defined as a weighted count of paths between two
nodes. Personalized PageRank [42] measures the similarity of two nodes as the random walk
probability from one to the other. Graph distance [37] uses the length of the shortest path between
two nodes to predict their association. These methods can be directly applied to new graphs, i.e.,
inductive setting, enjoy good interpretability and scale up to large graphs. However, they are designed
based on handcrafted metrics and may not be optimal for link prediction on real-world graphs.
1
Unless stated otherwise, we use summation and multiplication to refer the generalized operators in the path
formulation, rather than the basic operations of arithmetic.
2
Code is available at https://github.com/DeepGraphLearning/NBFNet

35th Conference on Neural Information Processing Systems (NeurIPS 2021).


To address these limitations, some link prediction methods adopt graph neural networks (GNNs) [32,
48, 59] to automatically extract important features from local neighborhoods for link prediction.
Thanks to the high expressiveness of GNNs, these methods have shown state-of-the-art performance.
However, these methods can only be applied to predict new links on the training graph, i.e. transduc-
tive setting, and lack interpretability. While some recent methods [73, 55] extract features from local
subgraphs with GNNs and support inductive setting, the scalability of these methods is compromised.
Therefore, we wonder if there exists an approach that enjoys the advantages of both traditional
path-based methods and recent approaches based on graph neural networks, i.e., generalization in
the inductive setting, interpretability, high model capacity and scalability.
In this paper, we propose such a solution. Inspired by traditional path-based methods, our goal is
to develop a general and flexible representation learning framework for link prediction based on
the paths between two nodes. Specifically, we define the representation of a pair of nodes as the
generalized sum of all the path representations between them, where each path representation is
defined as the generalized product of the edge representations in the path. Many link prediction
methods, such as Katz index [30], personalized PageRank [42], graph distance [37], as well as graph
theory algorithms like widest path [4] and most reliable path [4], are special instances of this path
formulation with different summation and multiplication operators. Motivated by the polynomial-time
algorithm for the shortest path problem [5], we show that such a formulation can be efficiently solved
via the generalized Bellman-Ford algorithm [4] under mild conditions and scale up to large graphs.
The operators in the generalized Bellman-Ford algorithm—summation and multiplication—are
handcrafted, which have limited flexibility. Therefore, we further propose the Neural Bellman-Ford
Networks (NBFNet), a graph neural network framework that solves the above path formulation with
learned operators in the generalized Bellman-Ford algorithm. Specifically, NBFNet parameterizes the
generalized Bellman-Ford algorithm with three neural components, namely I NDICATOR, M ESSAGE
and AGGREGATE functions. The I NDICATOR function initializes a representation on each node,
which is taken as the boundary condition of the generalized Bellman-Ford algorithm. The M ESSAGE
and the AGGREGATE functions learn the multiplication and summation operators respectively.
We show that the M ESSAGE function can be defined according to the relational operators in knowledge
graph embeddings [6, 68, 58, 31, 52], e.g., as a translation in Euclidean space induced by the relational
operators of TransE [6]. The AGGREGATE function can be defined as learnable set aggregation
functions [71, 65, 9]. With such parameterization, NBFNet can generalize to the inductive setting,
meanwhile achieve one of the lowest time complexity among inductive GNN methods. A comparison
of NBFNet and other GNN frameworks for link prediction is showed in Table 1. With other
instantiations of M ESSAGE and AGGREGATE functions, our framework can also recover some
existing works on learning logic rules [69, 46] for link prediction on knowledge graphs (Table 2).
Our NBFNet framework can be applied to several link prediction variants, covering not only single-
relational graphs (e.g., homogeneous graphs) but also multi-relational graphs (e.g., knowledge
graphs). We empirically evaluate the proposed NBFNet for link prediction on homogeneous graphs
and knowledge graphs in both transductive and inductive settings. Experimental results show that
the proposed NBFNet outperforms existing state-of-the-art methods by a large margin in all settings,
with an average relative performance gain of 18% on knowledge graph completion (HITS@1) and
22% on inductive relation prediction (HITS@10). We also show that the proposed NBFNet is indeed
interpretable by visualizing the top-k relevant paths for link prediction on knowledge graphs.
Table 1: Comparison of GNN frameworks for link prediction. The time complexity refers to the
amortized time for predicting a single edge or triplet. |V| is the number of nodes, |E| is the number of
edges, and d is the dimension of representations. The wall time is measured on FB15k-237 test set
with 40 CPU cores and 4 GPUs. We estimate the wall time of GraIL based on a downsampled test set.
Method Inductive3 Interpretable Learned Representation Time Complexity Wall Time
VGAE [32] /
X O(d) 18 secs
RGCN [48]
NeuralLP [69] /
 
|E|d
X X O |V| + d2 2.1 mins
DRUM [46]
SEAL [73] /
X X O(|E|d2 ) ≈1 month
GraIL [55]
 
NBFNet X X X O |E|d
|V| + d
2
4.0 mins

2
2 Related Work
Existing work on link prediction can be generally classified into 3 main paradigms: path-based
methods, embedding methods, and graph neural networks.
Path-based Methods. Early methods on homogeneous graphs compute the similarity between two
nodes based on the weighted count of paths (Katz index [30]), random walk probability (personalized
PageRank [42]) or the length of the shortest path (graph distance [37]). SimRank [28] uses advanced
metrics such as the expected meeting distance on homogeneous graphs, which is extended by
PathSim [51] to heterogeneous graphs. On knowledge graphs, Path Ranking [35, 15] directly uses
relational paths as symbolic features for prediction. Rule mining methods, such as NeuralLP [69]
and DRUM [46], learn probabilistic logical rules to weight different paths. Path representation
methods, such as Path-RNN [40] and its successors [11, 62], encode each path with recurrent neural
networks (RNNs), and aggregate paths for prediction. However, these methods need to traverse
an exponential number of paths and are limited to very short paths, e.g., ≤ 3 edges. To scale
up path-based methods, All-Paths [57] proposes to efficiently aggregate all paths with dynamic
programming. However, All-Paths is restricted to bilinear models and has limited model capacity.
Another stream of works [64, 10, 22] learns an agent to collect useful paths for link prediction. While
these methods can produce interpretable paths, they suffer from extremely sparse rewards and require
careful engineering of the reward function [38] or the search strategy [50]. Some other works [8, 44]
adopt variational inference to learn a path finder and a path reasoner for link prediction.
Embedding Methods. Embedding methods learn a distributed representation for each node and edge
by preserving the edge structure of the graph. Representative methods include DeepWalk [43] and
LINE [53] on homogeneous graphs, and TransE [6], DistMult [68] and RotatE [52] on knowledge
graphs. Later works improve embedding methods with new score functions [58, 13, 31, 52, 54, 76]
that capture common semantic patterns of the relations, or search the score function in a general
design space [75]. Embedding methods achieve promising results on link prediction, and can be
scaled to very large graphs using multiple GPUs [78]. However, embedding methods do not explicitly
encode local subgraphs between node pairs and cannot be applied to the inductive setting.
Graph Neural Networks. Graph neural networks (GNNs) [47, 33, 60, 65] are a family of represen-
tation learning models that encode topological structures of graphs. For link prediction, the prevalent
frameworks [32, 48, 12, 59] adopt an auto-encoder formulation, which uses GNNs to encode node
representations, and decodes edges as a function over node pairs. Such frameworks are potentially
inductive if the dataset provides node features, but are transductive only when node features are
unavailable. Another stream of frameworks, such as SEAL [73] and GraIL [55], explicitly encodes
the subgraph around each node pair for link prediction. While these frameworks are proved to be
more powerful than the auto-encoder formulation [74] and can solve the inductive setting, they
require to materialize a subgraph for each link, which is not scalable to large graphs. By contrast, our
NBFNet explicitly captures the paths between two nodes for link prediction, meanwhile achieves a
relatively low time complexity (Table 1). ID-GNN [70] formalizes link prediction as a conditional
node classification task, and augments GNNs with the identity of the source node. While the archi-
tecture of NBFNet shares some spirits with ID-GNN, our model is motivated by the generalized
Bellman-Ford algorithm and has theoretical connections with traditional path-based methods. There
are also some works trying to scale up GNNs for link prediction by dynamically pruning the set
of nodes in message passing [66, 20]. These methods are complementary to NBFNet, and may be
incorporated into our method to further improve scalability.

3 Methodology
In this section, we first define a path formulation for link prediction. Our path formulation generalizes
several traditional methods, and can be efficiently solved by the generalized Bellman-Ford algorithm.
Then we propose Neural Bellman-Ford Networks to learn the path formulation with neural functions.

3.1 Path Formulation for Link Prediction

We consider the link prediction problem on both knowledge graphs and homogeneous graphs. A
3
We consider the inductive setting where a model can generalize to entirely new graphs without node features.

3
knowledge graph is denoted by G = (V, E, R), where V and E represent the set of entities (nodes)
and relations (edges) respectively, and R is the set of relation types. We use N (u) to denote the set of
nodes connected to u, and E(u) to denote the set of edges ending with node u. A homogeneous graph
G = (V, E) can be viewed as a special case of knowledge graphs, with only one relation type for all
edges. Throughout this paper, we use bold terms, wq (e) or hq (u, v), to denote vector representations,
and italic terms, we or wuv , to denote scalars like the weight of edge (u, v) in homogeneous graphs
or triplet (u, r, v) in knowledge graphs. Without loss of generality, we derive our method based on
knowledge graphs, while our method can also be applied to homogeneous graphs.
Path Formulation. Link prediction is aimed at predicting the existence of a query relation q between
a head entity u and a tail entity v. From a representation learning perspective, this requires to learn
a pair representation hq (u, v), which captures the local subgraph structure between u and v w.r.t.
the query relation q. In traditional methods, such a local structure is encoded by counting different
types of random walks from u to v [35, 15]. Inspired by this construction, we formulate the pair
representation as a generalized sum of path representations between u and v with a commutative
summation operator ⊕. Each path representation hq (P ) is defined as a generalized product of the
edge representations in the path with the multiplication operator ⊗.
M
hq (u, v) = hq (P1 ) ⊕ hq (P2 ) ⊕ ... ⊕ hq (P|Puv | )|Pi ∈Puv , hq (P ) (1)
P ∈Puv
|P |
O
hq (P = (e1 , e2 , ..., e|P | )) = wq (e1 ) ⊗ wq (e2 ) ⊗ ... ⊗ wq (e|P | ) , wq (ei ) (2)
i=1

where Puv denotes the set of paths from u to v and wq (ei ) is the representation of edge ei . Note the
N operator ⊗ is not required to be commutative (e.g., matrix multiplication), therefore
multiplication
we define to compute the product following the exact order. Intuitively, the path formulation
can be interpreted as a depth-first-search (DFS) algorithm, where one searches all possible paths
from u to v, computes their representations (Equation 2) and aggregates the results (Equation 1).
Such a formulation is capable of modeling several traditional link prediction methods, as well as
graph theory algorithms. Formally, Theorem 1-5 state the corresponding path formulations for 3 link
prediction methods and 2 graph theory algorithms respectively. See Appendix A for proofs.

Theorem 1 Katz index is a path formulation with ⊕ = +, ⊗ = × and wq (e) = βwe .


P 2 Personalized PageRank is a path formulation with ⊕ = +, ⊗ = × and wq (e) =
Theorem
αwuv / v0 ∈N (u) wuv0 .
Theorem 3 Graph distance is a path formulation with ⊕ = min, ⊗ = + and wq (e) = we .
Theorem 4 Widest path is a path formulation with ⊕ = max, ⊗ = min and wq (e) = we .
Theorem 5 Most reliable path is a path formulation with ⊕ = max, ⊗ = × and wq (e) = we .

Generalized Bellman-Ford Algorithm. While the above formulation is able to model important
heuristics for link prediction, it is computationally expensive since the number of paths grows
exponentially with the path length. Previous works [40, 11, 62] that directly computes the exponential
number of paths can only afford a maximal path length of 3. A more scalable solution is to use
the generalized Bellman-Ford algorithm [4]. Specifically, assuming the operators h⊕, ⊗i satisfy
a semiring system [21] with summation identity 0 q and multiplication identity 1 q , we have the
following algorithm.

q (u, v) ← 1q (u = v)
h(0) (3)
 
M
h(t)
q (u, v) ←
 h(t−1)
q (u, x) ⊗ wq (x, r, v) ⊕ h(0)
q (u, v) (4)
(x,r,v)∈E(v)

where 1q (u = v) is the indicator function that outputs 1 q if u = v and 0 q otherwise. wq (x, r, v)


is the representation for edge e = (x, r, v) and r is the relation type of the edge. Equation 3 is
known as the boundary condition, while Equation 4 is known as the Bellman-Ford iteration. The
high-level idea of the generalized Bellman-Ford algorithm is to compute the pair representation
hq (u, v) for a given entity u, a given query relation q and all v ∈ V in parallel, and reduce the

4
total computation by the distributive property of multiplication over summation. Since u and q are
(t) (t)
fixed in the generalized Bellman-Ford algorithm, we may abbreviate hq (u, v) as hv when the
context is clear. When ⊕ = min and ⊗ = +, it recovers the original Bellman-Ford algorithm for the
shortest path problem [5]. See Appendix B for preliminaries and the proof of the above algorithm.

Theorem 6 Katz index, personalized PageRank, graph distance, widest path and most reliable path
can be solved via the generalized Bellman-Ford algorithm.

Table 2: Comparison of operators in NBFNet and other methods from the view of path formulation.
M ESSAGE AGGREGATE I NDICATOR Edge Representation
Class Method
wq (ei ) ⊗ wq (ej ) hq (Pi ) ⊕ hq (Pj ) 0 q, 1 q wq (e)
Traditional Katz Index [30] wq (ei ) × wq (ej ) hq (Pi ) + hq (Pj ) 0, 1 Pβwe
Link Personalized PageRank [42] wq (ei ) × wq (ej ) hq (Pi ) + hq (Pj ) 0, 1 αwuv / v 0 ∈N (u) wuv0
Prediction Graph Distance [37] wq (ei ) + wq (ej ) min(hq (Pi ), hq (Pj )) +∞, 0 we
Graph Theory Widest Path [4] min(wq (ei ), wq (ej )) max(hq (Pi ), hq (Pj )) −∞, +∞ we
Algorithms Most Reliable Path [4] wq (ei ) × wq (ej ) max(hq (Pi ), hq (Pj )) 0, 1 we
NeuralLP [69] / Weights learned
Logic Rules wq (ei ) × wq (ej ) hq (Pi ) + hq (Pj ) 0, 1
DRUM [46] by LSTM [23]
Relational operators of
Learned set Learned indicator Learned relation
NBFNet knowledge graph aggregators [9] functions embeddings
embeddings [6, 68, 52]

3.2 Neural Bellman-Ford Networks

While the generalized Bellman-Ford algorithm can solve many classical methods (Theorem 6), these
methods instantiate the path formulation with handcrafted operators (Table 2), and may not be optimal
for link prediction. To improve the capacity of path formulation, we propose a general framework,
Neural Bellman-Ford Networks (NBFNet), to learn the operators in the pair representations.
Neural Parameterization. We relax the Algorithm 1 Neural Bellman-Ford Networks
semiring assumption and parameterize Input: source node u, query relation q, #layers T
the generalized Bellman-Ford algorithm Output: pair representations hq (u, v) for all v ∈ V
(Equation 3 and 4) with 3 neural func- 1: for v ∈ V do . Boundary condition
tions, namely I NDICATOR, M ESSAGE 2:
(0)
hv ← I NDICATOR(u, v, q)
and AGGREGATE functions. The I NDI - 3: end for
CATOR function replaces the indicator 4: for t ← 1 to T do . Bellman-Ford iteration
function 1q (u = v). The M ESSAGE 5: for v ∈ V don o
(t) (0)
function replaces the binary multiplica- 6: Mv ← hv . Message augmentation
tion operator ⊗. The AGGREGATE func- 7: for (x, r, v) ∈ E(v) do
tion is a permutation invariant function (t) (t−1)
8: m(x,r,v) ← M ESSAGE(t) (hx , wq (x, r, v))
over sets thatL
replaces the n-ary summa- n o
(t) (t) (t)
tion operator . Note that one may alter- 9: Mv ← Mv ∪ m(x,r,v)
natively define AGGREGATE as the com- 10: end for
(t) (t)
mutative binary operator ⊕ and apply it 11: hv ← AGGREGATE(t) (Mv )
to a sequence of messages. However, 12: end for
this will make the parameterization more 13: end for
(T )
complicated. 14: return hv as hq (u, v) for all v ∈ V

Now consider the generalized Bellman-Ford algorithm for a given entity u and relation q. In this
(t) (t)
context, we abbreviate hq (u, v) as hv , i.e., a representation on entity v in the t-th iteration. It
(t)
should be stressed that hv is still a pair representation, rather than a node representation. By
substituting the neural functions into Equation 3 and 4, we get our Neural Bellman-Ford Networks.

h(0)
v ← I NDICATOR (u, v, q) (5)
n   o n o
h(t)
v ← AGGREGATE M ESSAGE h(t−1)
x , wq (x, r, v) (x, r, v) ∈ E(v) ∪ h(0)
v (6)

NBFNet can be interpreted as a novel GNN framework for learning pair representations. Compared
to common GNN frameworks [32, 48] that compute the pair representation as two independent node
representations hq (u) and hq (v), NBFNet initializes a representation on the source node u, and
readouts the pair representation on the target node v. Intuitively, our framework can be viewed as a

5
source-specific message passing process, where every node learns a representation conditioned on
the source node. The pseudo code of NBFNet is outlined in Algorithm 1.
Design Space. Now we discuss some principled designs for M ESSAGE, AGGREGATE and I NDI -
CATOR functions by drawing insights from traditional methods. Note the potential design space
for NBFNet is way larger than what is presented here, as one can always borrow M ESSAGE and
AGGREGATE from the arsenal of message-passing GNNs [19, 16, 60, 65].
For the M ESSAGE function, traditional methods instantiate it as natural summation, natural mul-
tiplication or min over scalars. Therefore, we may use the vectorized version of summation or
(t−1)
multiplication. Intuitively, summation of hx and wq (x, r, v) can be interpreted as a translation of
(t−1)
hx by wq (x, r, v) in the pair representation space, while multiplication corresponds to scaling.
Such transformations correspond to the relational operators [18, 45] in knowledge graph embed-
dings [6, 68, 58, 31, 52]. For example, translation and scaling are the relational operators used in
TransE [6] and DistMult [68] respectively. We also consider the rotation operator in RotatE [52].
The AGGREGATE function is instantiated as natural summation, max or min in traditional methods,
which are reminiscent of set aggregation functions [71, 65, 9] used in GNNs. Therefore, we specify
the AGGREGATE function to be sum, mean, or max, followed by a linear transformation and a
non-linear activation. We also consider the principal neighborhood aggregation (PNA) proposed in a
recent work [9], which jointly learns the types and scales of the aggregation function.
The I NDICATOR function is aimed at providing a non-trivial representation for the source node u as
the boundary condition. Therefore, we learn a query embedding q for 1 q and define I NDICATOR
function as 1(u = v)∗q. Note it is also possible to additionally learn an embedding for 0 q . However,
we find a single query embedding works better in practice.
The edge representations are instantiated as transition probabilities or length in traditional methods.
We notice that an edge may have different contribution in answering different query relations.
Therefore, we parameterize the edge representations as a linear function over the query relation, i.e.,
wq (x, r, v) = Wr q + br . For homogeneous graphs or knowledge graphs with very few relations,
we simplify the parameterization to wq (x, r, v) = br to prevent overfitting. Note that one may also
parameterize wq (x, r, v) with learnable entity embeddings x and v, but such a parameterization
cannot solve the inductive setting. Similar to NeuralLP [69] & DRUM [46], we use different edge
representations for different iterations, which is able to distinguish noncommutative edges in paths,
e.g., father’s mother v.s. mother’s father.
Link Prediction. We now show how to apply the learned pair representations hq (u, v) to the
link prediction problem. We predict the conditional likelihood of the tail entity v as p(v|u, q) =
σ(f (hq (u, v))), where σ(·) is the sigmoid function and f (·) is a feed-forward neural network. The
conditional likelihood of the head entity u can be predicted by p(u|v, q −1 ) = σ(f (hq−1 (v, u)))
with the same model. Following previous works [6, 52], we minimize the negative log-likelihood
of positive and negative triplets (Equation 7). The negative samples are generated according to
Partial Completeness Assumption (PCA) [14], which corrupts one of the entities in a positive triplet
to create a negative sample. For undirected graphs, we symmetrize the representations and define
pq (u, v) = σ(f (hq (u, v) + hq (v, u))). Equation 8 shows the loss for homogeneous graphs.
n
X 1
LKG = − log p(u, q, v) − log(1 − p(u0i , q, vi0 )) (7)
i=1
n
n
X 1
Lhomo = − log p(u, v) − log(1 − p(u0i , vi0 )), (8)
i=1
n

where n is the number of negative samples per positive sample and (u0i , q, vi0 ) and (u0i , vi0 ) are the i-th
negative samples for knowledge graphs and homogeneous graphs, respectively.
Time Complexity. One advantage of NBFNet is that it has a relatively low time complexity during
inference4 . Consider a scenario where a model is required to infer the conditional likelihood of
all possible triplets p(v|u, q). We group triplets with the same condition u, q together, where each
group contains |V| triplets. For each group, we only need to execute Algorithm 1 once to get their
4
Although the same analysis can be applied to training on a fixed number of samples, we note it is less
instructive since one can trade-off samples for performance, and the trade-off varies from method to method.

6
predictions. Since a small constant number of iterations T is enough for NBFNet to converge
(Table 6b), Algorithm 1 has a time complexity of O(|E|d + |V|d2 ), where d isthe dimension
 of
representations. Therefore, the amortized time complexity for a single triplet is O |E|d 2
|V| + d . For a
detailed derivation of time complexity of other GNN frameworks, please refer to Appendix C.

4 Experiment
4.1 Experiment Setup

We evaluate NBFNet in three settings, knowledge graph completion, homogeneous graph link
prediction and inductive relation prediction. The former two are transductive settings, while the last
is an inductive setting. For knowledge graphs, we use FB15k-237 [56] and WN18RR [13]. We use
the standard transductive splits [56, 13] and inductive splits [55] of these datasets. For homogeneous
graphs, we use Cora, Citeseer and PubMed [49]. Following previous works [32, 12], we split the
edges into train/valid/test with a ratio of 85:5:10. Statistics of datasets can be found in Appendix E.
Additional experiments of NBFNet on OGB [25] datasets can be found in Appendix G.
Implementation Details. Our implementation generally follows the open source codebases of
knowledge graph completion5 and homogeneous graph link prediction6 . For knowledge graphs, we
follow [69, 46] and augment each triplet hu, q, vi with a flipped triplet hv, q−1 , ui. For homogeneous
graphs, we follow [33, 32] and augment each node u with a self loop hu, ui. We instantiate NBFNet
with 6 layers, each with 32 hidden units. The feed-forward network f (·) is set to a 2-layer MLP with
64 hidden units. ReLU is used as the activation function for all hidden layers. We drop out edges that
directly connect query node pairs during training to encourage the model to capture longer paths and
prevent overfitting. Our model is trained on 4 Tesla V100 GPUs for 20 epochs. We select the models
based on their performance on the validation set. See Appendix F for more details.
Evaluation. We follow the filtered ranking protocol [6] for knowledge graph completion. For a test
triplet hu, q, vi, we rank it against all negative triplets hu, q, v’i or hu’, q, vi that do not appear in the
knowledge graph. We report mean rank (MR), mean reciprocal rank (MRR) and HITS at N (H@N)
for knowledge graph completion. For inductive relation prediction, we follow [55] and draw 50
negative triplets for each positive triplet and use the above filtered ranking. We report HITS@10 for
inductive relation prediction. For homogeneous graph link prediction, we follow [32] and compare
the positive edges against the same number of negative edges. We report area under the receiver
operating characteristic curve (AUROC) and average precision (AP) for homogeneous graphs.
Baselines. We compare NBFNet against path-based methods, embedding methods, and GNNs. These
include 11 baselines for knowledge graph completion, 10 baselines for homogeneous graph link
prediction and 4 baselines for inductive relation prediction. Note the inductive setting only includes
path-based methods and GNNs, since existing embedding methods cannot handle this setting.

4.2 Main Results

Table 3 summarizes the results on knowledge graph completion. NBFNet significantly outperforms
existing methods on all metrics and both datasets. NBFNet achieves an average relative gain of
21% in HITS@1 compared to the best path-based method, DRUM [46], on two datasets. Since
DRUM is a special instance of NBFNet with natural summation and multiplication operators, this
indicates the importance of learning M ESSAGE and AGGREGATE functions in NBFNet. NBFNet
also outperforms the best embedding method, LowFER [1], with an average relative performance
gain of 18% in HITS@1 on two datasets. Meanwhile, NBFNet requires much less parameters than
embedding methods. NBFNet only uses 3M parameters on FB15k-237, while TransE needs 30M
parameters. See Appendix D for details on the number of parameters.
Table 4 shows the results on homogeneous graph link prediction. NBFNet gets the best results on
Cora and PubMed, meanwhile achieves competitive results on CiteSeer. Note CiteSeer is extremely
sparse (Appendix E), which makes it hard to learn good representations with NBFNet. One thing
to note here is that unlike other GNN methods, NBFNet does not use the node features provided by
5
https://github.com/DeepGraphLearning/KnowledgeGraphEmbedding. MIT license.
6
https://github.com/tkipf/gae. MIT license.

7
Table 3: Knowledge graph completion results. Results of NeuraLP and DRUM are taken from [46].
Results of RotatE, HAKE and LowFER are taken from their original papers [52, 76, 1]. Results of
the other embedding methods are taken from [52]. Since GraIL has scalability issues in this setting,
we evaluate it with 50 and 100 negative triplets for FB15k-237 and WN18RR respectively and report
MR based on an unbiased estimation.
FB15k-237 WN18RR
Class Method
MR MRR H@1 H@3 H@10 MR MRR H@1 H@3 H@10
Path Ranking [35] 3521 0.174 0.119 0.186 0.285 22438 0.324 0.276 0.360 0.406
Path-based NeuralLP [69] - 0.240 - - 0.362 - 0.435 0.371 0.434 0.566
DRUM [46] - 0.343 0.255 0.378 0.516 - 0.486 0.425 0.513 0.586
TransE [6] 357 0.294 - - 0.465 3384 0.226 - - 0.501
DistMult [68] 254 0.241 0.155 0.263 0.419 5110 0.43 0.39 0.44 0.49
ComplEx [58] 339 0.247 0.158 0.275 0.428 5261 0.44 0.41 0.46 0.51
Embeddings
RotatE [52] 177 0.338 0.241 0.375 0.553 3340 0.476 0.428 0.492 0.571
HAKE [76] - 0.346 0.250 0.381 0.542 - 0.497 0.452 0.516 0.582
LowFER [1] - 0.359 0.266 0.396 0.544 - 0.465 0.434 0.479 0.526
RGCN [48] 221 0.273 0.182 0.303 0.456 2719 0.402 0.345 0.437 0.494
GNNs GraIL [55] 2053 - - - - 2539 - - - -
NBFNet 114 0.415 0.321 0.454 0.599 636 0.551 0.497 0.573 0.666

Table 4: Homogeneous graph link prediction results. Results of VGAE and S-VGAE are taken from
their original papers [32, 12].
Cora Citeseer PubMed
Class Method
AUROC AP AUROC AP AUROC AP
Katz Index [30] 0.834 0.889 0.768 0.810 0.757 0.856
Path-based Personalized PageRank [42] 0.845 0.899 0.762 0.814 0.763 0.860
SimRank [28] 0.838 0.888 0.755 0.805 0.743 0.829
DeepWalk [43] 0.831 0.850 0.805 0.836 0.844 0.841
Embeddings LINE [53] 0.844 0.876 0.791 0.826 0.849 0.888
node2vec [17] 0.872 0.879 0.838 0.868 0.891 0.914
VGAE [32] 0.914 0.926 0.908 0.920 0.944 0.947
S-VGAE [12] 0.941 0.941 0.947 0.952 0.960 0.960
GNNs SEAL [73] 0.933 0.942 0.905 0.924 0.978 0.979
TLC-GNN [67] 0.934 0.931 0.909 0.916 0.970 0.968
NBFNet 0.956 0.962 0.923 0.936 0.983 0.982

Table 5: Inductive relation prediction results (HITS@10). V1-v4 corresponds to the 4 standard
versions of inductive splits. Results of compared methods are taken from [55].
FB15k-237 WN18RR
Class Method
v1 v2 v3 v4 v1 v2 v3 v4
NeuralLP [16] 0.529 0.589 0.529 0.559 0.744 0.689 0.462 0.671
Path-based DRUM [46] 0.529 0.587 0.529 0.559 0.744 0.689 0.462 0.671
RuleN [39] 0.498 0.778 0.877 0.856 0.809 0.782 0.534 0.716
GraIL [55] 0.642 0.818 0.828 0.893 0.825 0.787 0.584 0.734
GNNs
NBFNet 0.834 0.949 0.951 0.960 0.948 0.905 0.893 0.890

the datasets but is still able to outperform most other methods. We leave how to effectively combine
node features and structural representations for link prediction as our future work.
Table 5 summarizes the results on inductive relation prediction. On all inductive splits of two datasets,
NBFNet achieves the best result. NBFNet outperforms the previous best method, GraIL [55], with an
average relative performance gain of 22% in HITS@10. Note that GraIL explicitly encodes the local
subgraph surrounding each node pair and has a high time complexity (Appendix C). Usually, GraIL
can at most encode a 2-hop subgraph, while our NBFNet can efficiently explore longer paths.

4.3 Ablation Study

M ESSAGE & AGGREGATE Functions. Table 6a shows the results of different M ESSAGE and
AGGREGATE functions. Generally, NBFNet benefits from advanced embedding methods (DistMult,

8
RotatE > TransE) and aggregation functions (PNA > sum, mean, max). Among simple AGGREGATE
functions (sum, mean, max), combinations of M ESSAGE and AGGREGATE functions (TransE & max,
DistMult & sum) that satisfy the semiring assumption7 of the generalized Bellman-Ford algorithm,
achieve locally optimal performance. PNA significantly improves over simple counterparts, which
highlights the importance of learning more powerful AGGREGATE functions.
Number of GNN Layers. Table 6b compares the results of NBFNet with different number of layers.
Although it has been reported that GNNs with deep layers often result in significant performance
drop [36, 77], we observe NBFNet does not have this issue. The performance increases monotonically
with more layers, hitting a saturation after 6 layers. We conjecture the reason is that longer paths
have negligible contribution, and paths not longer than 6 are enough for link prediction.
Performance by Relation Category. We break down the performance of NBFNet by the categories
of query relations: one-to-one, one-to-many, many-to-one and many-to-many8 . Table 6c shows the
prediction results for each category. It is observed that NBFNet not only improves on easy one-to-one
cases, but also on hard cases where there are multiple true answers for the query.
Table 6: Ablation studies of NBFNet on FB15k-237. Due to space constraints, we only report MRR
here. For full results on all metrics, please refer to Appendix H.
(a) Different M ESSAGE and AGGREGATE functions. (b) Different number of layers.
AGGREGATE #Layers (T )
M ESSAGE Method
Sum Mean Max PNA [9] 2 4 6 8
TransE [6] 0.297 0.310 0.377 0.383 NBFNet 0.345 0.409 0.415 0.416
DistMult [69] 0.388 0.384 0.374 0.415
RotatE [52] 0.392 0.376 0.385 0.414

(c) Performance w.r.t. relation category. The two scores are the rankings over heads and tails respectively.
Relation Category
Method
1-to-1 1-to-N N-to-1 N-to-N
TransE [6] 0.498/0.488 0.455/0.071 0.079/0.744 0.224/0.330
RotatE [51] 0.487/0.484 0.467/0.070 0.081/0.747 0.234/0.338
NBFNet 0.578/0.600 0.499/0.122 0.165/0.790 0.348/0.456

4.4 Path Interpretations of Predictions

One advantage of NBFNet is that we can interpret its predictions through paths, which may be
important for users to understand and debug the model. Intuitively, the interpretations should contain
paths that contribute most to the prediction p(u, q, v). Following local interpretation methods [3, 72],
we approximate the local landscape of NBFNet with a linear model over the set of all paths, i.e.,
1st-order Taylor polynomial. We define the importance of a path as its weight in the linear model,
which can be computed by the partial derivative of the prediction w.r.t. the path. Formally, the top-k
path interpretations for p(u, q, v) are defined as
∂p(u, q, v)
P1 , P2 , ..., Pk = top-k (9)
P ∈Puv ∂P
Note this formulation generalizes the definition of logical rules [69, 46] to non-linear models. While
directly computing the importance of all paths is intractable, we approximate them with edge
importance. Specifically, the importance of each path is approximated by the sum of the importance
of edges in that path, where edge importance is obtained via auto differentiation. Then the top-k path
interpretations are equivalent to the top-k longest paths on the edge importance graph, which can be
solved by a Bellman-Ford-style beam search. Better approximation is left as a future work.
Table 7 visualizes path interpretations from FB15k-237 test set. While users may have different
insights towards the visualization, here is our understanding. 1) In the first example, NBFNet learns
7
Here semiring is discussed under the assumption of linear activation functions. Rigorously, no combination
satisfies a semiring if we consider non-linearity in the model.
8
The categories are defined same as [63]. We compute the average number of tails per head and the average
number of heads per tail. The category is one if the average number is smaller than 1.5 and many otherwise.

9
soft logical entailment, such as impersonate−1 ∧ nationality =⇒ nationality and ethnicity−1 ∧
distribution =⇒ nationality. 2) In second example, NBFNet performs analogical reasoning by
leveraging the fact that Florence is similar to Rome. 3) In the last example, NBFNet extracts longer
paths, since there is no obvious connection between Pearl Harbor (film) and Japanese language.

Table 7: Path interpretations of predictions on FB15k-237 test set. For each query triplet, we visualize
the top-2 path interpretations and their weights. Inverse relations are denoted with a superscript −1 .
Query hu, q, vi: hO. Hardy, nationality, U.S.i
0.243 hO. Hardy, impersonate−1 , R. Littlei ∧ hR. Little, nationality, U.S.i
0.224 hO. Hardy, ethnicity−1 , Scottish Americani ∧ hScottish American, distribution, U.S.i
Query hu, q, vi: hFlorence, vacationer, D.C. Henriei
0.251 hFlorence, contain−1 , Italyi ∧ hItaly, capital, Romei ∧ hRome, vacationer, D.C. Henriei
0.183 hFlorence, place live−1 , G.F. Handeli ∧ hG.F. Handel, place live, Romei ∧ hRome, vacationer, D.C. Henriei
Query hu, q, vi: hPearl Harbor (film), language, Japanesei
0.211 hPearl Harbor (film), film actor, C.-H. Tagawai ∧ hC.-H. Tagawa, nationality, Japani
∧ hJapan, country of origin, Yu-Gi-Oh!i ∧ hYu-Gi-Oh!, language, Japanesei
0.208 hPearl Harbor (film), film actor, C.-H. Tagawai ∧ hC.-H. Tagawa, nationality, Japani
∧ hJapan, official language, Japanesei

5 Discussion and Conclusion


Limitations. There are a few limitations for NBFNet. First, the assumption of the generalized
Bellman-Ford algorithm requires the operators h⊕, ⊗i to satisfy a semiring. Due to the non-linear
activation functions in neural networks, this assumption does not hold for NBFNet, and we do not
have a theoretical guarantee on the loss incurred by this relaxation. Second, NBFNet is only verified
on simple edge prediction, while there are other link prediction variants, e.g., complex logical queries
with conjunctions (∧) and disjunctions (∨) [18, 45]. In the future, we would like to how NBFNet
approximates the path formulation, as well as apply NBFNet to other link prediction settings.
Social Impacts. Link prediction has a wide range of beneficial applications, including recommender
systems, knowledge graph completion and drug repurposing. However, there are also some potentially
negative impacts. First, NBFNet may encode the bias present in the training data, which leads to
stereotyped predictions when the prediction is applied to a user on a social or e-commerce platform.
Second, some harmful network activities could be augmented by powerful link prediction models,
e.g., spamming, phishing, and social engineering. We expect future studies will mitigate these issues.
Conclusion. We present a representation learning framework based on paths for link prediction.
Our path formulation generalizes several traditional methods, and can be efficiently solved via the
generalized Bellman-Ford algorithm. To improve the capacity of the path formulation, we propose
NBFNet, which parameterizes the generalized Bellman-Ford algorithm with learned I NDICATOR,
M ESSAGE, AGGREGATE functions. Experiments on knowledge graphs and homogeneous graphs
show that NBFNet outperforms a wide range of methods in both transductive and inductive settings.

Acknowledgements
We would like to thank Komal Teru for discussion on inductive relation prediction, Guyue Huang for
discussion on fused message passing implementation, and Yao Lu for assistance on large-scale GPU
training. We thank Meng Qu, Chence Shi and Minghao Xu for providing feedback on our manuscript.
This project is supported by the Natural Sciences and Engineering Research Council (NSERC)
Discovery Grant, the Canada CIFAR AI Chair Program, collaboration grants between Microsoft
Research and Mila, Samsung Electronics Co., Ltd., Amazon Faculty Research Award, Tencent AI
Lab Rhino-Bird Gift Fund and a NRC Collaborative R&D Project (AI4D-CORE-06). This project
was also partially funded by IVADO Fundamental Research Project grant PRF-2019-3583139727.
The computation resource of this project is supported by Calcul Québec9 and Compute Canada10 .
9
https://www.calculquebec.ca/
10
https://www.computecanada.ca/

10
References
[1] Saadullah Amin, Stalin Varanasi, Katherine Ann Dunfield, and Günter Neumann. Lowfer: Low-
rank bilinear pooling for link prediction. In International Conference on Machine Learning,
pages 257–268. PMLR, 2020.
[2] Jimmy Lei Ba, Jamie Ryan Kiros, and Geoffrey E Hinton. Layer normalization. arXiv preprint
arXiv:1607.06450, 2016.
[3] David Baehrens, Timon Schroeter, Stefan Harmeling, Motoaki Kawanabe, Katja Hansen, and
Klaus-Robert Müller. How to explain individual classification decisions. The Journal of
Machine Learning Research, 11:1803–1831, 2010.
[4] John S Baras and George Theodorakopoulos. Path problems in networks. Synthesis Lectures on
Communication Networks, 3(1):1–77, 2010.
[5] Richard Bellman. On a routing problem. Quarterly of applied mathematics, 16(1):87–90, 1958.
[6] Antoine Bordes, Nicolas Usunier, Alberto Garcia-Duran, Jason Weston, and Oksana Yakhnenko.
Translating embeddings for modeling multi-relational data. In Advances in Neural Information
Processing Systems, pages 1–9, 2013.
[7] Linlin Chao, Jianshan He, Taifeng Wang, and Wei Chu. PairRE: Knowledge graph embeddings
via paired relation vectors. In Proceedings of the 59th Annual Meeting of the Association for
Computational Linguistics and the 11th International Joint Conference on Natural Language
Processing (Volume 1: Long Papers), pages 4360–4369, 2021.
[8] Wenhu Chen, Wenhan Xiong, Xifeng Yan, and William Yang Wang. Variational knowledge
graph reasoning. In Proceedings of the 2018 Conference of the North American Chapter of the
Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long
Papers), pages 1823–1832, 2018.
[9] Gabriele Corso, Luca Cavalleri, Dominique Beaini, Pietro Liò, and Petar Veličković. Principal
neighbourhood aggregation for graph nets. volume 33, 2020.
[10] Rajarshi Das, Shehzaad Dhuliawala, Manzil Zaheer, Luke Vilnis, Ishan Durugkar, Akshay
Krishnamurthy, Alex Smola, and Andrew McCallum. Go for a walk and arrive at the answer:
Reasoning over paths in knowledge bases using reinforcement learning. In International
Conference on Learning Representations, 2018.
[11] Rajarshi Das, Arvind Neelakantan, David Belanger, and Andrew McCallum. Chains of reasoning
over entities, relations, and text using recurrent neural networks. In Proceedings of the 15th
Conference of the European Chapter of the Association for Computational Linguistics: Volume
1, Long Papers, pages 132–141, Valencia, Spain, April 2017. Association for Computational
Linguistics.
[12] Tim R Davidson, Luca Falorsi, Nicola De Cao, Thomas Kipf, and Jakub M Tomczak. Hyper-
spherical variational auto-encoders. 2018.
[13] Tim Dettmers, Pasquale Minervini, Pontus Stenetorp, and Sebastian Riedel. Convolutional 2d
knowledge graph embeddings. In Proceedings of the AAAI Conference on Artificial Intelligence,
volume 32, 2018.
[14] Luis Antonio Galárraga, Christina Teflioudi, Katja Hose, and Fabian Suchanek. Amie: associa-
tion rule mining under incomplete evidence in ontological knowledge bases. In Proceedings of
the 22nd international conference on World Wide Web, pages 413–422, 2013.
[15] Matt Gardner and Tom Mitchell. Efficient and expressive knowledge base completion using
subgraph feature extraction. In Proceedings of the 2015 Conference on Empirical Methods in
Natural Language Processing, pages 1488–1498, 2015.
[16] Justin Gilmer, Samuel S Schoenholz, Patrick F Riley, Oriol Vinyals, and George E Dahl. Neural
message passing for quantum chemistry. In International Conference on Machine Learning,
pages 1263–1272. PMLR, 2017.
[17] Aditya Grover and Jure Leskovec. node2vec: Scalable feature learning for networks. In
Proceedings of the 22nd ACM SIGKDD international conference on Knowledge discovery and
data mining, pages 855–864, 2016.

11
[18] William L Hamilton, Payal Bajaj, Marinka Zitnik, Dan Jurafsky, and Jure Leskovec. Embedding
logical queries on knowledge graphs. In Advances in Neural Information Processing Systems,
pages 2030–2041, 2018.
[19] William L Hamilton, Rex Ying, and Jure Leskovec. Inductive representation learning on large
graphs. In Advances in Neural Information Processing Systems, pages 1025–1035, 2017.
[20] Zhen Han, Peng Chen, Yunpu Ma, and Volker Tresp. xerte: Explainable reasoning on temporal
knowledge graphs for forecasting future links. 2021.
[21] Udo Hebisch and Hanns Joachim Weinert. Semirings: algebraic theory and applications in
computer science, volume 5. World Scientific, 1998.
[22] Marcel Hildebrandt, Jorge Andres Quintero Serna, Yunpu Ma, Martin Ringsquandl, Mitchell
Joblin, and Volker Tresp. Reasoning on knowledge graphs with debate dynamics. In Proceedings
of the AAAI Conference on Artificial Intelligence, volume 34, pages 4123–4131, 2020.
[23] Sepp Hochreiter and Jürgen Schmidhuber. Long short-term memory. Neural computation,
9(8):1735–1780, 1997.
[24] Weihua Hu, Matthias Fey, Hongyu Ren, Maho Nakata, Yuxiao Dong, and Jure Leskovec. Ogb-
lsc: A large-scale challenge for machine learning on graphs. arXiv preprint arXiv:2103.09430,
2021.
[25] Weihua Hu, Matthias Fey, Marinka Zitnik, Yuxiao Dong, Hongyu Ren, Bowen Liu, Michele
Catasta, and Jure Leskovec. Open graph benchmark: Datasets for machine learning on graphs.
arXiv preprint arXiv:2005.00687, 2020.
[26] Guyue Huang, Guohao Dai, Yu Wang, and Huazhong Yang. Ge-spmm: General-purpose
sparse matrix-matrix multiplication on gpus for graph neural networks. In SC20: International
Conference for High Performance Computing, Networking, Storage and Analysis, pages 1–12.
IEEE, 2020.
[27] Vassilis N Ioannidis, Da Zheng, and George Karypis. Few-shot link prediction via graph neural
networks for covid-19 drug-repurposing. arXiv preprint arXiv:2007.10261, 2020.
[28] Glen Jeh and Jennifer Widom. Simrank: a measure of structural-context similarity. In Proceed-
ings of the eighth ACM SIGKDD international conference on Knowledge discovery and data
mining, pages 538–543, 2002.
[29] Glen Jeh and Jennifer Widom. Scaling personalized web search. In Proceedings of the 12th
international conference on World Wide Web, pages 271–279, 2003.
[30] Leo Katz. A new status index derived from sociometric analysis. Psychometrika, 18(1):39–43,
1953.
[31] Seyed Mehran Kazemi and David Poole. Simple embedding for link prediction in knowledge
graphs. In Advances in Neural Information Processing Systems, pages 4289–4300, 2018.
[32] Thomas N Kipf and Max Welling. Variational graph auto-encoders. arXiv preprint
arXiv:1611.07308, 2016.
[33] Thomas N Kipf and Max Welling. Semi-supervised classification with graph convolutional
networks. In International Conference on Learning Representations, 2017.
[34] Yehuda Koren, Robert Bell, and Chris Volinsky. Matrix factorization techniques for recom-
mender systems. Computer, 42(8):30–37, 2009.
[35] Ni Lao and William W Cohen. Relational retrieval using a combination of path-constrained
random walks. Machine learning, 81(1):53–67, 2010.
[36] Qimai Li, Zhichao Han, and Xiao-Ming Wu. Deeper insights into graph convolutional networks
for semi-supervised learning. In Proceedings of the AAAI Conference on Artificial Intelligence,
volume 32, 2018.
[37] David Liben-Nowell and Jon Kleinberg. The link-prediction problem for social networks.
Journal of the American society for information science and technology, 58(7):1019–1031,
2007.
[38] Xi Victoria Lin, Richard Socher, and Caiming Xiong. Multi-hop knowledge graph reasoning
with reward shaping. In Proceedings of the 2018 Conference on Empirical Methods in Natural
Language Processing, EMNLP 2018, Brussels, Belgium, October 31-November 4, 2018, 2018.

12
[39] Christian Meilicke, Manuel Fink, Yanjie Wang, Daniel Ruffinelli, Rainer Gemulla, and Heiner
Stuckenschmidt. Fine-grained evaluation of rule-and embedding-based systems for knowledge
graph completion. In International Semantic Web Conference, pages 3–20. Springer, 2018.
[40] Arvind Neelakantan, Benjamin Roth, and Andrew McCallum. Compositional vector space
models for knowledge base completion. In Proceedings of the 53rd Annual Meeting of the
Association for Computational Linguistics and the 7th International Joint Conference on Natural
Language Processing (Volume 1: Long Papers), pages 156–166, Beijing, China, July 2015.
Association for Computational Linguistics.
[41] Maximilian Nickel, Kevin Murphy, Volker Tresp, and Evgeniy Gabrilovich. A review of
relational machine learning for knowledge graphs. Proceedings of the IEEE, 104(1):11–33,
2015.
[42] Lawrence Page, Sergey Brin, Rajeev Motwani, and Terry Winograd. The pagerank citation
ranking: Bringing order to the web. Technical report, Stanford InfoLab, 1999.
[43] Bryan Perozzi, Rami Al-Rfou, and Steven Skiena. Deepwalk: Online learning of social repre-
sentations. In Proceedings of the 20th ACM SIGKDD international conference on Knowledge
discovery and data mining, pages 701–710, 2014.
[44] Meng Qu, Junkun Chen, Louis-Pascal Xhonneux, Yoshua Bengio, and Jian Tang. Rnnlogic:
Learning logic rules for reasoning on knowledge graphs. In International Conference on
Learning Representations, 2021.
[45] H Ren, W Hu, and J Leskovec. Query2box: Reasoning over knowledge graphs in vector space
using box embeddings. In International Conference on Learning Representations, 2020.
[46] Ali Sadeghian, Mohammadreza Armandpour, Patrick Ding, and Daisy Zhe Wang. Drum:
End-to-end differentiable rule mining on knowledge graphs. volume 32, pages 15347–15357,
2019.
[47] Franco Scarselli, Marco Gori, Ah Chung Tsoi, Markus Hagenbuchner, and Gabriele Monfardini.
The graph neural network model. IEEE transactions on neural networks, 20(1):61–80, 2008.
[48] Michael Schlichtkrull, Thomas N Kipf, Peter Bloem, Rianne Van Den Berg, Ivan Titov, and Max
Welling. Modeling relational data with graph convolutional networks. In European semantic
web conference, pages 593–607. Springer, 2018.
[49] Prithviraj Sen, Galileo Namata, Mustafa Bilgic, Lise Getoor, Brian Galligher, and Tina Eliassi-
Rad. Collective classification in network data. AI magazine, 29(3):93–93, 2008.
[50] Yelong Shen, Jianshu Chen, Po-Sen Huang, Yuqing Guo, and Jianfeng Gao. M-walk: learning to
walk over graphs using monte carlo tree search. In Advances in Neural Information Processing
Systems, pages 6787–6798, 2018.
[51] Yizhou Sun, Jiawei Han, Xifeng Yan, Philip S Yu, and Tianyi Wu. Pathsim: Meta path-based
top-k similarity search in heterogeneous information networks. volume 4, pages 992–1003.
VLDB Endowment, 2011.
[52] Zhiqing Sun, Zhi-Hong Deng, Jian-Yun Nie, and Jian Tang. Rotate: Knowledge graph em-
bedding by relational rotation in complex space. In International Conference on Learning
Representations, 2019.
[53] Jian Tang, Meng Qu, Mingzhe Wang, Ming Zhang, Jun Yan, and Qiaozhu Mei. Line: Large-
scale information network embedding. In Proceedings of the 24th international conference on
World Wide Web, pages 1067–1077, 2015.
[54] Yun Tang, Jing Huang, Guangtao Wang, Xiaodong He, and Bowen Zhou. Orthogonal relation
transforms with graph context modeling for knowledge graph embedding. In Proceedings of the
58th Annual Meeting of the Association for Computational Linguistics, pages 2713–2722, 2020.
[55] Komal Teru, Etienne Denis, and Will Hamilton. Inductive relation prediction by subgraph
reasoning. In International Conference on Machine Learning, pages 9448–9457. PMLR, 2020.
[56] Kristina Toutanova and Danqi Chen. Observed versus latent features for knowledge base and
text inference. In Proceedings of the 3rd workshop on continuous vector space models and their
compositionality, pages 57–66, 2015.

13
[57] Kristina Toutanova, Xi Victoria Lin, Wen-tau Yih, Hoifung Poon, and Chris Quirk. Composi-
tional learning of embeddings for relation paths in knowledge base and text. In Proceedings of
the 54th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long
Papers), pages 1434–1444, 2016.
[58] Théo Trouillon, Johannes Welbl, Sebastian Riedel, Éric Gaussier, and Guillaume Bouchard.
Complex embeddings for simple link prediction. In International Conference on Machine
Learning, pages 2071–2080. PMLR, 2016.
[59] Shikhar Vashishth, Soumya Sanyal, Vikram Nitin, and Partha Talukdar. Composition-based
multi-relational graph convolutional networks. In International Conference on Learning Repre-
sentations, 2020.
[60] Petar Veličković, Guillem Cucurull, Arantxa Casanova, Adriana Romero, Pietro Liò, and Yoshua
Bengio. Graph attention networks. In International Conference on Learning Representations,
2018.
[61] Andrew Viterbi. Error bounds for convolutional codes and an asymptotically optimum decoding
algorithm. IEEE transactions on Information Theory, 13(2):260–269, 1967.
[62] Hongwei Wang, Hongyu Ren, and Jure Leskovec. Relational message passing for knowledge
graph completion. In Proceedings of the 27th ACM SIGKDD Conference on Knowledge
Discovery & Data Mining, pages 1697–1707, 2021.
[63] Zhen Wang, Jianwen Zhang, Jianlin Feng, and Zheng Chen. Knowledge graph embedding by
translating on hyperplanes. In Proceedings of the AAAI Conference on Artificial Intelligence,
volume 28, 2014.
[64] Wenhan Xiong, Thien Hoang, and William Yang Wang. Deeppath: A reinforcement learning
method for knowledge graph reasoning. In Proceedings of the 2017 Conference on Empirical
Methods in Natural Language Processing (EMNLP 2017), Copenhagen, Denmark, September
2017. ACL.
[65] Keyulu Xu, Weihua Hu, Jure Leskovec, and Stefanie Jegelka. How powerful are graph neural
networks? In International Conference on Learning Representations, 2019.
[66] Xiaoran Xu, Wei Feng, Yunsheng Jiang, Xiaohui Xie, Zhiqing Sun, and Zhi-Hong Deng.
Dynamically pruned message passing networks for large-scale knowledge graph reasoning. In
International Conference on Learning Representations, 2019.
[67] Zuoyu Yan, Tengfei Ma, Liangcai Gao, Zhi Tang, and Chao Chen. Link prediction with
persistent homology: An interactive view. In International Conference on Machine Learning,
pages 11659–11669. PMLR, 2021.
[68] Bishan Yang, Wen-tau Yih, Xiaodong He, Jianfeng Gao, and Li Deng. Embedding entities
and relations for learning and inference in knowledge bases. In International Conference on
Learning Representations, 2015.
[69] Fan Yang, Zhilin Yang, and William W Cohen. Differentiable learning of logical rules for
knowledge base reasoning. In Advances in Neural Information Processing Systems, pages
2316–2325, 2017.
[70] Jiaxuan You, Jonathan M Gomes-Selman, Rex Ying, and Jure Leskovec. Identity-aware graph
neural networks. In Proceedings of the AAAI Conference on Artificial Intelligence, volume 35,
pages 10737–10745, 2021.
[71] Manzil Zaheer, Satwik Kottur, Siamak Ravanbakhsh, Barnabas Poczos, Russ R Salakhutdinov,
and Alexander J Smola. Deep sets. volume 30, 2017.
[72] Matthew D Zeiler and Rob Fergus. Visualizing and understanding convolutional networks. In
European conference on computer vision, pages 818–833. Springer, 2014.
[73] Muhan Zhang and Yixin Chen. Link prediction based on graph neural networks. volume 31,
pages 5165–5175, 2018.
[74] Muhan Zhang, Pan Li, Yinglong Xia, Kai Wang, and Long Jin. Revisiting graph neural networks
for link prediction. arXiv preprint arXiv:2010.16103, 2020.
[75] Yongqi Zhang, Quanming Yao, Wenyuan Dai, and Lei Chen. Autosf: Searching scoring
functions for knowledge graph embedding. In 2020 IEEE 36th International Conference on
Data Engineering (ICDE), pages 433–444. IEEE, 2020.

14
[76] Zhanqiu Zhang, Jianyu Cai, Yongdong Zhang, and Jie Wang. Learning hierarchy-aware
knowledge graph embeddings for link prediction. In Proceedings of the AAAI Conference on
Artificial Intelligence, volume 34, pages 3065–3072, 2020.
[77] Lingxiao Zhao and Leman Akoglu. Pairnorm: Tackling oversmoothing in gnns. In International
Conference on Learning Representations, 2019.
[78] Zhaocheng Zhu, Shizhen Xu, Meng Qu, and Jian Tang. Graphvite: A high-performance cpu-gpu
hybrid system for node embedding. In The World Wide Web Conference, pages 2494–2504,
2019.

15
A Path Formulations for Traditional Methods
Here we demonstrate our path formulation is capable of modeling traditional link prediction methods
like Katz index [30], personalized PageRank [42] and graph distance [37], as well as graph theory
algorithms like widest path [4] and most reliable path [4].
Recall the path formulation is defined as
M
hq (u, v) = hq (P1 ) ⊕ hq (P2 ) ⊕ ... ⊕ hq (P|Puv | )|Pi ∈Puv , hq (P ) (1)
P ∈Puv
|P |
O
hq (P = (e1 , e2 , ..., e|P | )) = wq (e1 ) ⊗ wq (e2 ) ⊗ ... ⊗ wq (e|P | ) , wq (ei ) (2)
i=1

which can be written in the following compact form


|P |
M O
hq (u, v) = wq (ei ) (10)
P ∈Puv i=1

A.1 Katz Index

The Katz index for a pair of nodes u, v is defined as a weighted count of paths between u and v,
penalized by an attenuation factor β ∈ (0, 1). Formally, it can be written as

X
Katz(u, v) = β t e> t
u A ev (11)
t=1

where A denotes the adjacency matrix and eu , ev denote the one-hot vector for nodes u, v respectively.
The term e> t
u A ev counts all paths of length t between u, and v and shorter paths are assigned with
larger weights.

Theorem 1 Katz index is a path formulation with ⊕ = +, ⊗ = × and wq (e) = βwe .

Proof. We show that Katz(u, v) can be transformed into a summation over all paths between u and
v, where each path is represented by a product of damped edge weights in the path. Mathematically,
it can be derived as
X∞ X Y
Katz(u, v) = βt we (12)
t=1 P ∈Puv :|P |=t e∈P
X Y
= βwe (13)
P ∈Puv e∈P

Therefore, the Katz index can be viewed as a path formulation with the summation operator +, the
multiplication operator × and the edge representations βwe . 

A.2 Personalized PageRank

The personalized PageRank (PPR) for u computes the stationary distribution over nodes generated by
an infinite random walker, where the walker moves to a neighbor node with probability α and returns
to the source node u with probability 1 − α at each step. The probability of a node v from a source
node u has the following closed-form solution [29]

X
PPR(u, v) = (1 − α) α t e>
u (D
−1
A)t ev (14)
t=1

where D is the degree matrix and D −1 A is the (random walk) normalized adjacency matrix. Note
that e>
u (D
−1
A)t ev computes the probability of t-step random walks from u to v.

P 2 Personalized PageRank is a path formulation with ⊕ = +, ⊗ = × and wq (e) =


Theorem
αwuv / v0 ∈N (u) wuv0 .

16
Proof. We omit the coefficient 1 − α, since it is always positive and has no effect on the ranking of
different node pairs. Then we have

X X Y wab
PPR(u, v) ∝ αt P (15)
t=1 b0 ∈N (a) wab0
P ∈Puv :|P |=t (a,b)∈P
X Y αwab
= P (16)
b0 ∈N (a) wab0
P ∈Puv (a,b)∈P

where the summation operator is +, the multiplication operator is × and edge representations are
random walk probabilities scaled by α. 

A.3 Graph Distance

Graph distance (GD) is defined as the minimum length of all paths between u and v.

Theorem 3 Graph distance is a path formulation with ⊕ = min, ⊗ = + and wq (e) = we .

Proof. Since the length of a path is the sum of edge lengths in the path, we have
X
GD(u, v) = min we (17)
P ∈Puv
e∈P

Here the summation operator is min, the multiplication operator is + and the edge representations
are the lengths of edges. 

A.4 Widest Path

The widest path (WP), also known as the maximum capacity path, is aimed at finding a path between
two given nodes, such that the path maximizes the minimum edge weight in the path.

Theorem 4 Widest path is a path formulation with ⊕ = max, ⊗ = min and wq (e) = we .

Proof. Given two nodes u and v, we can write the widest path as
WP(u, v) = max min we (18)
P ∈Puv e∈P

Here the summation operator is max, the multiplication operator is min and the edge representations
are plain edge weights. 

A.5 Most Reliable Path

For a graph with non-negative edge probabilities, the most reliable path (MRP) is the path with
maximal probability from a start node to an end node. This is also known as Viterbi algorithm [61]
used in the maximum a posterior (MAP) inference of hidden Markov models (HMM).

Theorem 5 Most reliable path is a path formulation with ⊕ = max, ⊗ = × and wq (e) = we .

Proof. For a start node u and an end node v, the probaility of their most reliable path is
Y
MRP(u, v) = max we (19)
P ∈Puv
e∈P

Here the summation operator is max, the multiplication operator is × and the edge representations
are edge probabilities. 

B Generalized Bellman-Ford Algorithm


First, we prove that the path formulation can be efficiently solved by the generalized Bellman-Ford
algorithm when the operators h⊕, ⊗i satisfy a semiring. Then, we show that traditional methods
satisfy the semiring assumption and therefore can be solved by the generalized Bellman-Ford
algorithm.

17
B.1 Preliminaries on Semirings

Semirings are algebraic structures with two operators, summation ⊕ and multiplication ⊗, that share
similar properties with the natural summation and the natural multiplication defined on integers.
Specifically, ⊕ should be commutative, associative and have an identity element 0 . ⊗ should be
associative and have an identity element 1 . Mathematically, the summation ⊕ satisfies

• Commutative Property. a ⊕ b = b ⊕ a
• Associative Property. (a ⊕ b) ⊕ c = a ⊕ (b ⊕ c)
• Identity Element. a ⊕ 0 = a

The multiplication ⊗ satisfies

• Associative Property. (a ⊗ b) ⊗ c = a ⊗ (b ⊗ c)
• Absorption Property. a ⊗ 0 = 0 ⊗ a = 0
• Identity Element. a ⊗ 1 = 1 ⊗ a = a

Additionally, ⊗ should be distributive over ⊕.

• Distributive Property (Left). a ⊗ (b ⊕ c) = (a ⊗ b) ⊕ (a ⊗ c)


• Distributive Property (Right). (b ⊕ c) ⊗ a = (b ⊗ a) ⊕ (c ⊗ a)

Note semirings differ from natural arithmetic operators in two aspects. First, the summation operator
⊕ does not need to be invertible, e.g., min or max. Second, the multiplication operator ⊗ does not
need to be commutative nor invertible, e.g., matrix multiplication.

B.2 Generalized Bellman-Ford Algorithm for Path Formulation

Now we prove that the generalized Bellman-Ford algorithm computes the path formulation when
the operators h⊕, ⊗i satisfy a semiring. It should be stressed that the generalized Bellman-Ford
algorithm for path problems has been proved in [4], and not a contribution of this paper. Here we
apply the proof to our proposed path formulation.
The generalized Bellman-Ford algorithm computes the following iterations for all v ∈ V

q (u, v) ← 1q (u = v)
h(0) (3)
 
M
h(t)
q (u, v) ←
 h(t−1)
q (u, x) ⊗ wq (x, r, v) ⊕ h(0)
q (u, v) (4)
(x,r,v)∈E(v)

(t)
Lemma 1 After t Bellman-Ford iterations, the intermediate representation hq (u, v) aggregates all
path representations within a length of t edges for all v. That is

|P |
M O
h(t)
q (u, v) = wq (ei ) (20)
P ∈Puv :|P |≤t i=1

Proof. We prove Lemma 1 by induction. For the base case t = 0, there is a single path of length
0 from u to itself and no path to other nodes. Due to the product definition of path representa-
tions, a path of length 0 is equal to the multiplication identity 1 q . Similarly, a summation of
no path is equal to the summation identity 0 q.
(0)
Therefore, we have hq (u, v) = 1q (u = v) =
L N|P |
P ∈Puv :|P |=0 i=1 wq (ei ).

For the inductive case t > 0, we consider the second-to-last node x in each path N if the path has a
length larger than 0. To avoid overuse of brackets, we use the convention that ⊗ and have a higher

18
L
priority than ⊕ and .
 
M
hq(t) (u, v) =  h(t−1)
q (u, x) ⊗ wq (x, r, v) ⊕ h(0)
q (u, v) (21)
(x,r,v)∈E(v)
   
|P |
M M O
=  wq (ei ) ⊗ wq (x, r, v) ⊕ h(0)
q (u, v) (22)
(x,r,v)∈E(v) P ∈Pux :|P |≤t−1 i=1
    
 |P | 
M M O
=   wq (ei ) ⊗ wq (x, r, v) ⊕ h(0)
q (u, v) (23)
 
(x,r,v)∈E(v) P ∈Pux :|P |≤t−1 i=1
   
|P | |P |
M O M O
= wq (ei ) ⊕  wq (ei ) (24)
P ∈Puv :1≤|P |≤t i=1 P ∈Puv :|P |=0 i=1
|P |
M O
= wq (ei ), (25)
P ∈Puv :|P |≤t i=1

(t−1)
where Equation 22 substitutes the inductive assumption for hq (u, x), Equation 23 uses the
distributive property of ⊗ over ⊕. 
By comparing Lemma 1 and Equation 10, we can see the intermediate representation converges to
(t)
our path formulation limt→∞ hq (u, v) = hq (u, v). More specifically, at most |V| iterations are
required if we only consider simple paths, i.e., paths without repeating nodes. In practice, for link
prediction we find it only takes a very small number of iterations (e.g., T = 6) to converge, since
long paths make negligible contribution to the task.

B.3 Traditional Methods

Theorem 6 Katz index, personalized PageRank, graph distance, widest path and most reliable path
can be solved via the generalized Bellman-Ford algorithm.

Proof. Given that the generalized Bellman-Ford algorithm solves the path formulation when h⊕, ⊗i
satisfy a semiring, we only need to show that the operators of the path formulations for traditional
methods satisfy semiring structures.
Katz index (Theorem 1) and personalized PageRank (Theorem 2) use the natural summation + and
the natural multiplication ×, which obviously satisfy a semiring.
Graph distance (Theorem 3) uses min for summation and + for multiplication. The corresponding
identities are 0 = +∞ and 1 = 0. It is obvious that + satisfies the associative property and has
identity element 0.

• Commutative Property. min(a, b) = min(b, a)


• Associative Property. min(min(a, b), c) = min(a, min(b, c))
• Identity Element. min(a, +∞) = a
• Absorption Property. a + ∞ = ∞ + a = +∞
• Distributive Property (Left). a + min(b, c) = min(a + b, a + c)
• Distributive Property (Right). min(b, c) + a = min(b + a, c + a)

Widest path (Theorem 4) uses max for summation and min for multiplication. The corresponding
identities are 0 = −∞ and 1 = +∞. We have

• Commutative Property. max(a, b) = max(b, a)


• Associative Property. max(max(a, b), c) = max(a, max(b, c))
• Identity Element. max(a, −∞) = a
• Associative Property. min(min(a, b), c) = min(a, min(b, c))
• Absorption Property. min(a, −∞) = min(−∞, a) = −∞
• Identity Element. min(a, +∞) = min(+∞, a) = a

19
• Distributive Property (Left). min(a, max(b, c)) = max(min(a, b), min(a, c))
• Distributive Property (Right). min(max(b, c), a) = max(min(b, a), min(c, a))
where the distributive property can be proved by enumerating all possible orders of a, b and c.
Most reliable path (Theorem 5) uses max for summation and × for multiplication. The corresponding
identities are 0 = 0 and 1 = 1, since all path representations are probabilities in [0, 1]. It is obvious
that × satisfies the associative property, the absorption property and has identity element 0.
• Commutative Property. max(a, b) = max(b, a)
• Associative Property. max(max(a, b), c) = max(a, max(b, c))
• Identity Element. max(a, 0) = a
• Distributive Property (Left). a × max(b, c) = max(a × b, a × c)
• Distributive Property (Right). max(b, c) × a = max(b × a, c × a)
where the identity element and the distributive property hold for non-negative elements. 

C Time Complexity of GNN Frameworks


Here we prove the time complexity for NBFNet and other GNN frameworks.

C.1 NBFNet

Lemma 2 The time complexity of one NBFNet run (Algorithm 1) is O(T (|E|d + |V|d2 )).
Proof. We break the time complexity by I NDICATOR, M ESSAGE and AGGREGATE functions.
I NDICATOR is called |V| times, and a single call to I NDICATOR takes O(d) time. M ESSAGE is
called T (|E| + |V|) times, and a single call to M ESSAGE, i.e., a relation operator, takes O(d) time.
AGGREGATE is called T |V| times over a total of T |E| messages with d dimensions. Each call to
AGGREGATE additionally takes O(d2 ) time due to the linear transformations in the function.
Therefore, the total complexity is summed to O(T (|E|d + |V|d2 )). 
In practice, we find a small constant T works well for link prediction, and Lemma 2 can be reduced
to O(|E|d + |V|d2 ) time.
Now consider applying NBFNet to infer the likelihood of all possible triplets. Without loss of
generality, assume we want to predict the tail entity for each head entity and relation p(v|u, q). We
group triplets with the same condition u, q together, where each group contains |V| triplets. For
triplets in a group, we only need to execute
 Algorithm 1 once to get their predictions. Therefore, the
|E|d 2
amortized time for a single triplet is O |V| + d .

C.2 VGAE / RGCN

RGCN is a message-passing GNN applied to multi-relational graphs, with the message function
being a per-relation linear transformation. VGAE can be viewed as a special case of RGCN applied
to single-relational graphs. The time complexity of RGCN is similar to Lemma 2, except that
each call to the message function takes O(d2 ) time due to the linear transformation. Therefore, the
total complexity is O(T (|E|d2 + |V|d2 )), where T refers to the number of layers in RGCN. Since
|V| ≤ |E|, the complexity is reduced to O(T |E|d2 )11 . In practice, T is a small constant and we get
O(|E|d2 ) complexity.
While directly executing RGCN once for each triplet is costly, a smart way to apply RGCN for
inference is to first compute all node representations, and then perform link prediction with the node
representations. The first step runs RGCN once for |V|2 |R| triplets,
 while  second step takes O(d)
the
|E|d2
time. Therefore, the amortized time for a single triplet is O |V|2 |R| + d . For large graphs and
reasonable choices of d, we have |E|d ≤ |V|2 |R| and the amortized time can be reduced to O(d).
11
By moving the linear transformations from the message function to the aggregation function, one can also
get an implementation of RGCN with O(T |V||R|d2 ) time, which is better for dense graphs but worse for sparse
graphs. For knowledge graph datasets used in this paper, the above O(T |E|d2 ) implementation is better.

20
C.3 NeuralLP / DRUM

DRUM can be viewed as a special case of NBFNet with M ESSAGE being Hadamard product and
AGGREGATE being natural summation. NeuralLP is a special case of DRUM where the dimension d
equals to 1. Since there is no linear transformation
 in their
 AGGREGATE functions, the amortized
T |E|d
time complexity for the message passing part is O |V| . Both DRUM and NeuralLP additionally
use an LSTM to learn the edge weights for each layer, which additionally costs O(T d2 ) time for T
layers. T is small andcan be ignored
 like in other methods. Therefore, the amortized time of two
parts is summed to O |E|d
|V| + d 2
.

C.4 SEAL / GraIL

GraIL first extracts a local subgraph surrounding the link, and then applies RGCN to the local
subgraph. SEAL can be viewed as a special case of GraIL applied to single-relational graphs.
Therefore, their amortized time is the same as that of one RGCN run, which is O(|E|d2 ).
Note that one may still run GraIL on large graphs by restricting the local subgraphs to be very
small, e.g., within 1-hop neighborhood of the query entities. However, this will severely harm the
performance of link prediction. Moreover, most real-world graphs are small-world networks, and a
moderate radius can easily cover a non-trivial number of nodes and edges, which costs a lot of time
for GraIL.

D Number of Parameters

Table 8: Number of parameters in NBFNet. The number of parameters only grows with the number
of relations |R|, rather than the number of nodes |V| or edges |E|. For FB15k-237 augmented with
flipped triplets, |R| is 474. Our best configuration uses T = 6, d = 32 and hidden dimension
m = 64.
#Parameter
Analytic Formula FB15k-237
I NDICATOR |R|d 15,168
M ESSAGE T |R|d(d + 1) 3,003,264
AGGREGATE T d(13d + 3) 80,448
f (·) m(2d + 1) + m + 1 4,225
Total 3,103,105

One advantage of NBFNet is that it requires much less parameters than embedding methods. For
example, on FB15k-237, NBFNet requires 3M parameters while TransE requires 30M parameters.
Table 8 shows a break down of number of parameters in NBFNet. Generally, the number of parameters
in NBFNet scales linearly w.r.t. the number of relations, regardless the number of entities in the
graph, which makes NBFNet more parameter-efficient for large graphs.

E Statistics of Datasets
Dataset statistics of two transductive settings, i.e., knowledge graph completion and homogeneous
graph link prediction, are summarized in Table 9 and 10. Dataset statistics of inductive relation
prediction is summarized in Table 11.
We use the standard transductive splits [56, 13] and inductive splits [55] for knowledge graphs.
For homogeneous graphs, we follow previous works [32, 12] and randomly split the edges into
train/validation/test sets with a ratio of 85:5:10. All the homogeneous graphs used in this paper
are undirected. Note that for inductive relation prediction, the original paper [55] actually uses a
transductive valid set that shares the same set of fact triplets as the training set for hyperparameter
tuning. The inductive test set contains entities, query triplets and fact triplets that never appear in the
training set. The same data split is adopted in this paper for a fair comparison.

21
Table 9: Dataset statistics for knowledge graph completion. Table 10: Dataset statistics for homoge-
Dataset #Entity #Relation
#Triplet neous link prediction.
#Train #Validation #Test #Edge
Dataset #Node
FB15k-237 [56] 14,541 237 272,115 17,535 20,466 #Train #Validation #Test
WN18RR [13] 40,943 11 86,835 3,034 3,134 Cora [49] 2,708 4,614 271 544
CiteSeer [49] 3,327 4,022 236 474
PubMed [49] 19,717 37,687 2,216 4,435

Table 11: Dataset statistics for inductive relation prediction. Queries refer to the triplets that are used
as training or test labels, while facts are the triplets used as training or test inputs. In the training sets,
all queries are also provided as facts.
Train Validation Test
Dataset #Relation
#Entity #Query #Fact #Entity #Query #Fact #Entity #Query #Fact
v1 180 1,594 4,245 4,245 1,594 489 4,245 1,093 205 1,993
v2 200 2,608 9,739 9,739 2,608 1,166 9,739 1,660 478 4,145
FB15k-237 [55]
v3 215 3,668 17,986 17,986 3,668 2,194 17,986 2,501 865 7,406
v4 219 4,707 27,203 27,203 4,707 3,352 27,203 3,051 1,424 11,714
v1 9 2,746 5,410 5,410 2,746 630 5,410 922 188 1,618
v2 10 6,954 15,262 15,262 6,954 1,838 15,262 2,757 441 4,011
WN18RR [55]
v3 11 12,078 25,901 25,901 12,078 3,097 25,901 5,084 605 6,327
v4 9 3,861 7,940 7,940 3,861 934 7,940 7,084 1,429 12,334

F Implementation Details

Table 12: Hyperparameter configurations of NBFNet on different datasets. Adv. temperature


corresponds to the temperature in self-adversarial negative sampling [52]. Note for FB15k-237 and
WN18RR, we use the same hyperparameters for their transductive and inductive settings. We find our
model configuration is robust across all datasets, therefore we only tune the learning hyperparameters
for each dataset. All the hyperparameters are chosen by the performance on the validation set.
Hyperparameter FB15k-237 WN18RR Cora CiteSeer PubMed
#layer(T ) 6 6 6 6 6
GNN
hidden dim. 32 32 32 32 32
#layer 2 2 2 2 2
MLP
hidden dim. 64 64 64 64 64
#positive 256 128 256 256 64
Batch
#negative/#positive(n) 32 32 1 1 1
optimizer Adam Adam Adam Adam Adam
learning rate 5e-3 5e-3 5e-3 5e-3 5e-3
Learning
#epoch 20 20 20 20 20
adv. temperature 0.5 1 - - -

Our implementation generally follows the open source codebases of knowledge graph completion12
and homogeneous graph link prediction13 . Table 12 lists the hyperparameter configurations for
different datasets. Table 13 shows the wall time of training and inference on different datasets.
Data Augmentation. For knowledge graphs, we follow previous works [69, 46] and augment each
triplet hu, q, vi with a flipped triplet hv, q−1 , ui. For homogeneous graphs, we follow previous
works [33, 32] and augment each node u with a self loop hu, ui.
Architecture Details. We apply Layer Normalization [2] and short cut connection to accelerate
the training of NBFNet. Layer Normalization is applied after each AGGREGATE function. The
feed-forward network f (·) is instantiated as a MLP. ReLU is used as the activation function for all
hidden layers. For undirected graphs, we symmetrize the pair representation by taking the sum of
hq (u, v) and hq (v, u).
12
https://github.com/DeepGraphLearning/KnowledgeGraphEmbedding. MIT license.
13
https://github.com/tkipf/gae. MIT license.

22
Training Details. We train NBFNet on 4 Tesla V100 GPUs with standard data parallelism. During
training, we drop out edges that directly connect query node pairs to encourage the model to capture
longer paths and prevent overfitting. We select the best checkpoint for each model based on its
performance on the validation set. The selection criteria is MRR for knowledge graphs and AUROC
for homogeneous graphs.
Fused Message Passing. To reduce memory footprint and better utilize GPU hardware, we follow the
efficient implementation of GNNs [26] and implement customized PyTorch operators that combines
M ESSAGE and AGGREGATE functions into a single operation, without creating all messages explicitly.
This reduces the memory complexity of NBFNet from O(|E|d) to O(|V|d).

Table 13: Wall time of NBFNet on different datasets and in different settings (Table 3, 4 and 5). For
inductive setting, the total time over 4 split versions is reported.
Transductive Inductive
Wall Time
FB15k-237 WN18RR Cora CiteSeer PubMed FB15k-237 WN18RR
Training 9.7 hrs 4.4 hrs 5.5 mins 5.3 mins 5.6 hrs 23 mins 41 mins
Inference 4.0 mins 2.4 mins < 1 sec < 1 sec 25 secs 6 secs 20 secs

G Experimental Results on OGB Datasets


To demonstrate the effectiveness of NBFNet on large-scale graphs, we additionally evaluate our
method on two knowledge graph datasets from OGB [25], ogbl-biokg and WikiKG90M. We follow
the standard evaluation protocol of OGB link property prediction, and compute the mean reciprocoal
rank (MRR) of the true entity against 1,000 negative entities.

G.1 Results on ogbl-biokg

Ogbl-biokg is a large biomedical knowledge graph that contains 93,773 entities, 51 relations and
5,088,434 triplets. We compare NBFNet with 6 embedding methods on this dataset. Note by the time
of this work, only embedding methods are available for such large-scale datasets. Table 14 shows
the results on ogbl-biokg. NBFNet achieves the best result compared to all methods reported on the
official leaderboard14 with much fewer parameters. Note the previous best model AutoSF is based on
architecture search and requires more computation resource than NBFNet for training.

Table 14: Knowledge graph completion results on ogbl-biokg. Results of compared methods are
taken from the OGB leaderboard.
Class Method Test MRR Validation MRR #Params
TransE [6] 0.7452 0.7456 187,648,000
DistMult [68] 0.8043 0.8055 187,648,000
ComplEx [58] 0.8095 0.8105 187,648,000
Embeddings
RotatE [52] 0.7989 0.7997 187,597,000
AutoSF [75] 0.8309 0.8317 93,824,000
PairRE [7] 0.8164 0.8172 187,750,000
GNNs NBFNet 0.8317 0.8318 734,209

G.2 Results on WikiKG90M

WikiKG90M is an extremely large dataset used in OGB large-scale challenge [24], hold at KDD Cup
2021. It is a general-purpose knowledge graph containing 87,143,637 entities, 1,315 relations and
504,220,369 triplets.
To apply NBFNet to such a large scale, we use a bidirectional breath-first-search (BFS) algorithm to
sample a local subgraph for each query. Given a query, we generate a k-hop neighborhood for each
14
https://ogb.stanford.edu/docs/leader_linkprop/#ogbl-biokg

23
(A) Original graph (B) Bidirectional BFS (C) Sampled graph
Figure 1: Illustration of bidirectional BFS sampling. For a head entity and multiple tail candidates,
we use BFS to sample a k-hop neighborhood around each entity, regardless of the direction of edges.
The neighborhood is denoted by dashed circles. The nodes and edges visited by the BFS algorithm
are extracted to generate the sampled graph. Best viewed in color.

of the head entity and the candidate tail entities, based on a BFS search. The union of all generated
neighborhoods is then collected as the sampled graph. With this sampling algorithm, any path within
a length of 2k between the head entity and any tail candidate is guaranteed to present in the sampled
graph. See Figure 1 for illustration. While a standard single BFS algorithm computing the 2k-hop
neighborhood of the head entity has the same guarantee, a bidirectional BFS algorithm significantly
reduces the number of nodes and edges in the sampled graph.
We additionally downsample the neighbors when expanding the neighbors of an entity, to tackle
entities with large degrees. For each entity visited during the BFS algorithm, we downsample its
outgoing neighbors and incoming neighbors to m entities respectively.
Table 16 shows the results of NBFNet on WikiKG90M validation set. Our best single model uses
k = 2 and m = 100. While the validation set requires to rank the true entity against 1,000 negative
entities, in practice it is not mandatory to draw 1,000 negative samples for each positive sample
during training. We find that reducing the negative samples from 1,000 to 20 and increasing the batch
size from 4 to 64 provides a better result, although it creates a distribution shift between sampled
graphs in training and validation. We leave further research of such distribution shift as a future work.

Table 15: Results of different M ESSAGE and AGGREGATE Table 16: Knowledge graph completion
functions on FB15k-237. results on WikiKG90M validation set.
M ESSAGE AGGREGATE MR MRR H@1 H@3 H@10 Model Single Model 6 Model Ensemble
Sum 191 0.297 0.217 0.321 0.453 MRR 0.924 0.930
Mean 161 0.310 0.218 0.339 0.496
TransE [6]
Max 135 0.377 0.282 0.415 0.565
PNA [9] 129 0.383 0.288 0.420 0.568
Sum 136 0.388 0.294 0.427 0.574
Table 17: Results of different number of
Mean 132 0.384 0.287 0.425 0.577 layers on FB15k-237.
DistMult [68]
Max 136 0.374 0.279 0.412 0.563
PNA [9] 114 0.415 0.321 0.454 0.599 #Layers (T ) MR MRR H@1 H@3 H@10
Sum 129 0.392 0.298 0.429 0.580 2 191 0.345 0.261 0.377 0.510
Mean 138 0.376 0.278 0.416 0.571 4 119 0.409 0.315 0.450 0.592
RotatE [52]
Max 139 0.385 0.290 0.423 0.572 6 114 0.415 0.321 0.454 0.599
PNA [9] 117 0.414 0.323 0.454 0.593 8 115 0.416 0.322 0.457 0.599

H Ablation Study
Table 15 shows the full results of different M ESSAGE and AGGREGATE functions. Table 17 shows
the full results of NBFNet with different number of layers.

24

You might also like