0% found this document useful (0 votes)
69 views32 pages

PINT-Provably Expressive Temporal Graph Networks

This document presents theoretical analyses of temporal graph networks (TGNs), which model dynamic graph interactions over time. The key findings are: 1) TGNs based on message passing (MP-TGNs) and those based on aggregating temporal walks (WA-TGNs) are not universally more expressive than each other. 2) Equipping MP-TGNs with injective functions makes them as expressive as the temporal Weisfeiler-Leman test for graph isomorphism. 3) A new model called PINT is proposed, which uses injective message passing, positional encodings, and is provably more expressive than both MP-TGNs and WA-TGNs.

Uploaded by

markus.aurelius
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
69 views32 pages

PINT-Provably Expressive Temporal Graph Networks

This document presents theoretical analyses of temporal graph networks (TGNs), which model dynamic graph interactions over time. The key findings are: 1) TGNs based on message passing (MP-TGNs) and those based on aggregating temporal walks (WA-TGNs) are not universally more expressive than each other. 2) Equipping MP-TGNs with injective functions makes them as expressive as the temporal Weisfeiler-Leman test for graph isomorphism. 3) A new model called PINT is proposed, which uses injective message passing, positional encodings, and is provably more expressive than both MP-TGNs and WA-TGNs.

Uploaded by

markus.aurelius
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 32

Provably expressive temporal graph networks

Amauri H. Souza1 , Diego Mesquita2 , Samuel Kaski1,3 , Vikas Garg1,4


1
Aalto University 2 Getulio Vargas Foundation 3 University of Manchester 4 YaiYai Ltd
{amauri.souza, samuel.kaski}@aalto.fi, [email protected], [email protected]
arXiv:2209.15059v1 [cs.LG] 29 Sep 2022

Abstract
Temporal graph networks (TGNs) have gained prominence as models for embed-
ding dynamic interactions, but little is known about their theoretical underpinnings.
We establish fundamental results about the representational power and limits of the
two main categories of TGNs: those that aggregate temporal walks (WA-TGNs),
and those that augment local message passing with recurrent memory modules
(MP-TGNs). Specifically, novel constructions reveal the inadequacy of MP-TGNs
and WA-TGNs, proving that neither category subsumes the other. We extend the
1-WL (Weisfeiler-Leman) test to temporal graphs, and show that the most powerful
MP-TGNs should use injective updates, as in this case they become as expressive
as the temporal WL. Also, we show that sufficiently deep MP-TGNs cannot benefit
from memory, and MP/WA-TGNs fail to compute graph properties such as girth.
These theoretical insights lead us to PINT — a novel architecture that leverages
injective temporal message passing and relative positional features. Importantly,
PINT is provably more expressive than both MP-TGNs and WA-TGNs. PINT
significantly outperforms existing TGNs on several real-world benchmarks.

1 Introduction
Graph neural networks (GNNs) [11, 30, 36, 39] have recently led to breakthroughs in many applica-
tions [7, 28, 31] by resorting to message passing between neighboring nodes in input graphs. While
message passing imposes an important inductive bias, it does not account for the dynamic nature of
interactions in time-evolving graphs arising from many real-world domains such as social networks
and bioinformatics [16, 40]. In several scenarios, these temporal graphs are only given as a sequence
of timestamped events. Recently, temporal graph nets (TGNs) [16, 27, 32, 38, 42] have emerged as a
prominent learning framework for temporal graphs and have become particularly popular due to their
outstanding predictive performance. Aiming at capturing meaningful structural and temporal patterns,
TGNs combine a variety of building blocks, such as self-attention [33, 34], time encoders [15, 41],
recurrent models [5, 13], and message passing [10].
Unraveling the learning capabilities of (temporal) graph networks is imperative to understanding
their strengths and pitfalls, and designing better, more nuanced models that are both theoretically
well-grounded and practically efficacious. For instance, the enhanced expressivity of higher-order
GNNs has roots in the inadequacy of standard message-passing GNNs to separate graphs that are
indistinguishable by the Weisfeiler-Leman isomorphism test, known as 1-WL test or color refinement
algorithm [21, 22, 29, 37, 43]. Similarly, many other notable advances on GNNs were made possible
by untangling their ability to generalize [9, 17, 35], extrapolate [45], compute graph properties [4, 6, 9],
and express Boolean classifiers [1]; by uncovering their connections to distributed algorithms [19, 29],
graph kernels [8], dynamic programming [44], diffusion processes [3], graphical models [46], and
combinatorial optimization [2]; and by analyzing their discriminative power [20, 23]. In stark contrast,
the theoretical foundations of TGNs remain largely unexplored. For instance, unresolved questions
include: How does the expressive power of existing TGNs compare? When do TGNs fail? Can we
improve the expressiveness of TGNs? What are the limits on the power of TGNs?

36th Conference on Neural Information Processing Systems (NeurIPS 2022).


Overview of the theoretical results
PINT
Relationship between CTDGs and DTDGs Prop. 1
Injective CAW
MP-TGNs Injective MP-TGNs  MP-TGNs Prop. 2
Sufficiently deep MP-TGNs do not need memory Prop. 3

M
SO -T
SOTA MP-TGNs ≺ Injective MP-TGNs Prop. 4

P
TAGN
MP-TGNs 6 WA-TGNs and WA-TGNs 6 MP-TGNs Prop. 5

s
Injective MP-TGNs ∼ = temporal-WL test Prop. 6
MP-TGNs/CAWs cannot recognize graph properties Prop. 7
Constructing injective temporal MP Prop. 8
Open
problems PINT (ours)  both MP-TGNs and WA-TGNs Prop. 9
Limitations of PINT Prop. 10

Figure 1: Schematic diagram and summary of our contributions.

We establish a series of results to address these fundamental questions. We begin by showing that
discrete-time dynamic graphs (DTDGs) can always be converted to continuous-time analogues
(CTDGs) without loss of information, so we can focus on analyzing the ability of TGNs to distinguish
nodes/links of CTDGs. We consider a general framework for message-passing TGNs (MP-TGNs)
[27] that subsumes a wide variety of methods [e.g., 16, 32, 42]. We prove that equipping MP-TGNs
with injective aggregation and update functions leads to the class of most expressive anonymous
MP-TGNs (i.e., those that do not leverage node ids). Extending the color-refinement algorithm to
temporal settings, we show that these most powerful MP-TGNs are as expressive as the temporal
WL method. Notably, existing MP-TGNs do not enforce injectivity. We also delineate the role of
memory in MP-TGNs: nodes in a network with only a few layers of message passing fail to aggregate
information from a sufficiently wide receptive field (i.e., from distant nodes), so memory serves
to offset this highly local view with additional global information. In contrast, sufficiently deep
architectures obviate the need for memory modules.
Different from MP-TGNs, walk-aggregating TGNs (WA-TGNs) such as CAW [38] obtain represen-
tations from anonymized temporal walks. We provide constructions that expose shortcomings of
each framework, establishing that WA-TGNs can distinguish links in cases where MP-TGNs fail and
vice-versa. Consequently, neither class is more expressive than the other. Additionally, we show that
MP-TGNs and CAWs cannot decide temporal graph properties such as diameter, girth, or number of
cycles. Strikingly, our analysis unravels the subtle relationship between the walk computations in
CAWs and the MP steps in MP-TGNs.
Equipped with these theoretical insights, we propose PINT (short for position-encoding injective
temporal graph net), founded on a new temporal layer that leverages the strengths of both MP-TGNs
and WA-TGNs. Like the most expressive MP-TGNs, PINT defines injective message passing and
update steps. PINT also augments memory states with novel relative positional features, and these
features can replicate all the discriminative benefits available to WA-TGNs. Interestingly, the time
complexity of computing our positional features is less severe than the sampling overhead in CAW,
thus PINT can often be trained faster than CAW. Importantly, we establish that PINT is provably
more expressive than CAW as well as MP-TGNs.
Our contributions are three-fold:
• a rigorous theoretical foundation for TGNs is laid - elucidating the role of memory, benefits
of injective message passing, limits of existing TGN models, temporal extension of the
1-WL test and its implications, impossibility results about temporal graph properties, and
the relationship between main classes of TGNs — as summarized in Figure 1;
• explicit injective temporal functions are introduced, and a novel method for temporal graphs
is proposed that is provably more expressive than state-of-the-art TGNs;
• extensive empirical investigations underscore practical benefits of this work. The proposed
method is either competitive or significantly better than existing models on several real
benchmarks for dynamic link prediction, in transductive as well as inductive settings.

2
2 Preliminaries
We denote a static graph G as a tuple (V, E, X , E), where V = {1, 2, . . . , n} denotes the set of
nodes and E ⊆ V × V the set of edges. Each node u ∈ V has a feature vector xu ∈ X and each
edge (u, v) ∈ E has a feature vector euv ∈ E, where X and E are countable sets of features.
Dynamic graphs can be roughly split according to their discrete- or continuous-time nature [14].
A discrete-time dynamic graph (DTDG) is of a sequence of graph snapshots (G1 , G2 , . . . ) usually
sampled at regular intervals, each snapshot being a static graph Gt = (Vt , Et , Xt , Et ).
A continuous-time dynamic graph (CTDG) evolves with node- and edge-level events, such as addition
and deletion. We represent a CTDG as a sequence of time-stamped multi-graphs (G(t0 ), G(t1 ), . . . )
such that tk < tk+1 , and G(tk+1 ) results from updating G(tk ) with all events at time tk+1 . We assume
no event occurs between tk and tk+1 . We denote an interaction (i.e., edge addition event) between
nodes u and v at time t as a tuple (u, v, t) associated with a feature vector euv (t). Unless otherwise
stated, interactions correspond to undirected edges, i.e., (u, v, t) is a shorthand for ({u, v}, t).
Noting that CTDGs allow for finer (irregular) temporal resolution, we now formalize the intuition that
DTDGs can be reduced to and thus analyzed as CTDGs, but the converse may need extra assumptions.
Proposition 1 (Relationship between DTDG and CTDG). For any DTDG we can build a CTDG with
the same sets of node and edge features that contains the same information, i.e., we can reconstruct
the original DTDG from the converted CTDG. The converse holds if the CTDG timestamps form a
subset of a uniformly spaced countable set.
Following the usual practice [16, 38, 42], we focus on CTDGs with edge addition events (see
Appendix E for a discussion on deletion). Thus, we can represent temporal graphs as sets G(t) =
{(uk , vk , tk ) | tk < t}. We also assume each distinct node v in G(t) has an initial feature vector xv .
Message-passing temporal graph nets (MP-TGNs). Rossi et al. [27] introduced MP-TGN as a
general representation learning framework for temporal graphs. The goal is to encode the graph
dynamics into node embeddings, capturing information that is relevant for the task at hand. To
achieve this, MP-TGNs rely on three main ingredients: memory, aggregation, and update. Memory
comprises a set of vectors that summarizes the history of each node, and is updated using a recurrent
model whenever an event occurs. The aggregation and update components resemble those in message-
passing GNNs, where the embedding of each node is refined using messages from its neighbors.
We define the temporal neighbohood of node v at time t as N (v, t) = {(u, euv (t0 ), t0 ) | ∃(u, v, t0 ) ∈
G(t)}, i.e., the set of neighbor/feature/timestamp triplets from all interactions of node v prior to t.
(`)
MP-TGNs compute the temporal representation hv (t) of v at layer ` by recursively applying

v (t) = AGG ({{(hu


h̃(`) (`) (`−1)
(t), t − t0 , e) | (u, e, t0 ) ∈ N (v, t)}}) (1)
 
v (t) = U PDATE (2)
(`)
h(`) h(`−1)
v (t), h̃(`)
v (t) ,

(0)
where {{·}} denotes multisets, hv (t) = sv (t) is the state of v at time t, and AGG(`) and U PDATE(`)
are arbitrary parameterized functions. The memory block updates the states as events occur. Let
J (v, t) be the set of events involving v at time t. The state of v is updated due to J (v, t) as
mv (t) = M EM AGG({{[sv (t), su (t), t − tv , evu (t)] | (v, u, t) ∈ J (v, t)}}) (3)
sv (t ) = M EM U PDATE(sv (t), mv (t)),
+
(4)
where sv (0) = xv (initial node features), sv (t+ ) denotes the updated state of v due to events at time
t, and tv denotes the time of the last update to v. M EM AGG combines information from simultaneous
events involving node v and M EM U PDATE usually implements a gated recurrent unit (GRU) [5].
Notably, some MP-TGNs do not use memory, or equivalently, they employ identity memory, i.e.,
sv (t) = xv for all t. We refer to Appendix A for further details.
Causal Anonymous Walks (CAWs). Wang et al. [38] proposed CAW as an approach for link
prediction on temporal graphs. To predict if an event (u, v, t) occurs, CAW first obtains sets Su and Sv
of temporal walks starting at nodes u and v at time t. An (L − 1)-length temporal walk is represented
as W = ((w1 , t1 ), (w2 , t2 ), . . . , (wL , tL )), with t1 > t2 > · · · > tL and (wi−1 , wi , ti ) ∈ G(t)
∀i > 1. Note that when predicting (u, v, t), we have walks starting at time t1 = t. Then, CAW
anonymizes walks replacing each node w with a set ICAW (w; Su , Sv ) = {g(w; Su ), g(w; Sv )} of two

3
feature vectors. The `-th entry of g(w; Su ) stores how many times w appears at the `-th position in a
walk of Su , i.e. g(w, Su )[`] = |{W ∈ Su : (w, t` ) = W` }| where W` is `-th pair of W .
To encode a walk W with respect to the sets Su and Sv , CAW applies E NC(W ; Su , Sv ) =
i=1 ) where f1 is a permutation-invariant function, f2 is
RNN([f1 (ICAW (wi ; Su , Sv ))kf2 (ti−1 − ti )]L
a time encoder, and t0 = t1 = t. Finally, CAW combines the embeddings of each walk in Su ∪ Sv
using mean-pooling or self-attention to obtain the representation for the event (u, v, t).
In practice, TGNs often rely on sampling schemes for computational reasons. However we are
concerned with the expressiveness of TGNs, so our analysis assumes complete structural information,
i.e., Su is the set of all temporal walks from u and MP-TGNs combine information from all neighbors.

3 The representational power and limits of TGNs


We now study the expressiveness of TGNs on node/edge-level prediction. We also establish connec-
tions to a variant of the WL test and show limits of specific TGN models. Proofs are in Appendix B.
3.1 Distinguishing nodes with MP-TGNs
We analyze MP-TGNs w.r.t. their ability to map different nodes to different locations in the embedding
space. In particular, we say that an L-layer MP-TGN distinguishes two nodes u, v of a temporal
(L) (L)
graph at time t, if the last layer embeddings of u and v are different, i.e., hu (t) 6= hv (t).
We can describe the MP computations of a node v at time t via its temporal computation tree (TCT)
Tv (t). Tv (t) has v as its root and height equal to the number of MP-TGN layers L. We will keep
the dependence on depth L implicit for notational simplicity. For each element (u, e, t0 ) ∈ N (v, t)
associated with v, we have a node, say i, in the next layer of the TCT linked to the root by an edge
annotated with (e, t0 ). The remaining TCT layers are built recursively using the same mechanism.
We denote by ]tv the (possibly many-to-one) operator that maps nodes in Tv (t) back to nodes in G(t),
e.g., ]tv i = u. Each node i in Tv (t) has a state vector si = s]tv i (t). To get the embedding of the root
v, information is propagated bottom-up, i.e., starting from the leaves all the way up to the root —
each node aggregates the message from the layer below and updates its representation along the way.
Whenever clear from context, we denote ]tv simply as ] for a cleaner notation.
We study the expressive power of MP-TGNs through the lens of functions on multisets adapted to
temporal settings, i.e., comprising triplets of node states, edge features, and timestamps. Intuitively,
injective functions ‘preserve’ the information as it is propagated, so should be essential for maximally
expressive MP-TGNs. We formalize this idea in Lemma 1 and Proposition 2 via Definition 1.
Definition 1 (Isomorphic TCTs). Two TCTs Tz (t) and Tz0 (t) at time t are isomorphic if there is a
bijection f : V (Tz (t)) → V (Tz0 (t)) between the nodes of the trees such that the following holds:
(u, v, t0 ) ∈ E(Tz (t)) ⇐⇒ (f (u), f (v), t0 ) ∈ E(Tz0 (t))
∀(u, v, t0 ) ∈ E(Tz (t)) : euv (t0 ) = ef (u)f (v) (t0 ) and ∀u ∈ V (Tz (t)) : su = sf (u) and ku = kf (u)
Here, ku denotes the level (depth) of node u in the tree. The root node has level 0, and for a node u
with level ku , the children of u have level ku + 1.
Lemma 1. If an MP-TGN Q with L layers distinguishes two nodes u, v of a dynamic graph G(t),
then the L-depth TCTs Tu (t) and Tv (t) are not isomorphic.
For non-isomorphic TCTs, Proposition 2 shows that improving MP-TGNs with injective message
passing layers suffices to achieve node distinguishability, extending results from static GNNs [43].
Proposition 2 (Most expressive MP-TGNs). If the L-depth TCTs of two nodes u, v of a temporal
graph G(t) at time t are not isomorphic, then an MP-TGN Q with L layers and injective aggregation
and update functions at each layer is able to distinguish nodes u and v.
So far, we have considered TCTs with general memory modules, i.e., nodes are annotated with
memory states. However, an important question remains: How does the expressive power of MP-
TGNs change as a function of the memory? Our next result - Proposition 3 - shows that adding
GRU-based memory does not increase the expressiveness of suitably deep MP-TGNs.
[M ]
Proposition 3 (The role of memory). Let QL denote the class of MP-TGNs with recurrent memory
and L layers. Similarly, we denote by QL the family of memoryless MP-TGNs with L layers. Let ∆
be the temporal diameter of G(t) (see Definition B2). Then, it holds that:

4
Figure 2: Limitations of TGNs. [Left] Temporal graph with nodes u, v that TGN-Att/TGAT cannot
distinguish. Colors are node features, edge features are identical, and t3 > t2 > t1 . [Center] TCTs
of u and v are non-isomorphic. However, the attention layers of TGAT/TGN-Att compute weighted
averages over a same multiset of values, returning identical messages for u and v. [Right] MP-TGNs
fail to distinguish the events (u, v, t3 ) and (v, z, t3 ) as TCTs of z and u are isomorphic. Meanwhile,
CAW cannot separate (u, z, t3 ) and (u0 , z, t3 ): the 3-depth TCTs of u and u0 are not isomorphic, but
the temporal walks from u and u0 have length 1, keeping CAW from capturing structural differences.
[M ]
1. If L < ∆: QL is strictly more powerful than QL in distinguishing nodes of G(t);
[M ]
2. For any L : QL+∆ is at least as powerful as QL in distinguishing nodes of G(t).
The MP-TGN framework is rather general and subsumes many modern methods for temporal graphs
[e.g., 16, 32, 42]. We now analyze the theoretical limitations of two concrete instances of MP-TGNs:
TGAT [42] and TGN-Att [27]. Remarkably, these models are among the best-performing MP-TGNs.
Nonetheless, we can show that there are nodes of very simple temporal graphs that TGAT and
TGN-Att cannot distinguish (see Figure 2). We formalize this in Proposition 4 by establishing that
there are cases in which TGNs with injective layers can succeed, but TGAT and TGN-Att cannot.
Proposition 4 (Limitations of TGAT/TGN-Att). There exist temporal graphs containing nodes u, v
that have non-isomorphic TCTs, yet no TGAT nor TGN-Att with mean memory aggregator (i.e., using
M EAN as M EM AGG) can distinguish u and v.
This limitation stems from the fact that the attention mechanism employed by TGAT and TGN-Att is
proportion invariant [26]. The memory module of TGN-Att cannot counteract this limitation due to
its mean-based aggregation scheme. We provide more details in Appendix B.6.

3.2 Predicting temporal links


Models for dynamic graphs are usually trained and evaluated on temporal link prediction [18], which
consists in predicting whether an event would occur at a given time. To predict an event between
(L) (L)
nodes u and v at t, MP-TGNs combine the node embeddings hu (t) and hv (t), and push the
resulting vector through an MLP. On the other hand, CAW is originally designed for link prediction
tasks and directly computes edge embeddings, bypasssing the computation of node representations.
We can extend the notion of node distinguishability to edges/events. We say that a model distinguishes
two synchronous events γ = (u, v, t) and γ 0 = (u0 , v 0 , t) of a temporal graph if it assigns different
edge embeddings hγ 6= hγ 0 for γ and γ 0 . Proposition 5 asserts that CAWs are not strictly more
expressive than MP-TGNs, and vice-versa. Intuitively, CAW’s advantage over MP-TGNs lies in its
ability to exploit node identities and capture correlation between walks. However, CAW imposes
temporal constraints on random walks, i.e., walks have timestamps in decreasing order, which can
limit its ability to distinguish events. Figure 2(Right) sketches constructions for Proposition 5.
Proposition 5 (Limitations of MP-TGNs and CAW). There exist distinct synchronous events of a
temporal graph that CAW can distinguish but MP-TGNs with injective layers cannot, and vice-versa.

3.3 Connections with the WL test


The Weisfeiler-Leman test (1-WL) has been used as a key tool to analyze the expressive power of
GNNs. We now study the power of MP-TGNs under a temporally-extended version of 1-WL, and
prove negative results regarding whether TGNs can recognize properties of temporal graphs.
Temporal WL test. We can extend the WL test for temporal settings in a straightforward manner
by exploiting the equivalence between temporal graphs and multi-graphs with timestamped edges
[24]. In particular, the temporal variant of 1-WL assigns colors for all nodes in an input dynamic
graph G(t) by applying the following iterative procedure:
Initialization: The colors of all nodes in G(t) are initialized using the initial node features: ∀v ∈
V (G(t)), c0 (v) = xv . If node features are not available, all nodes receive identical colors;

5
Refinement: At step `, the colors of all nodes are refined using a hash (injective) function: for all
v ∈ V (G(t)), we apply c`+1 (v) = H ASH(c` (v), {{(c` (u), euv (t0 ), t0 ) : (u, v, t0 ) ∈ G(t)}});
Termination: The test is carried out for two temporal graphs at time t in parallel and stops when
the multisets of corresponding colors diverge, returning non-isomorphic. If the algorithm
runs until the number of different colors stops increasing, the test is deemed inconclusive.
We note that the temporal WL test trivially reduces to the standard 1-WL test if all timestamps and
edge features are identical. The resemblance between MP-TGNs and GNNs and their corresponding
WL tests suggests that the power of MP-TGNs is bounded by the temporal WL test. Proposition 6
conveys that MP-TGNs with injective layers are as powerful as the temporal WL test.
Proposition 6. Assume finite spaces of initial node features X , edge features E, and timestamps T .
Let the number of events of any temporal graph be bounded by a fixed constant. Then, there is an
MP-TGN with suitable parameters using injective aggregation/update functions that outputs different
representations for two temporal graphs if and only if the temporal-WL test outputs ‘non-isomorphic’.
A natural consequence of the limited power of MP-TGNs is that even the most powerful MP-TGNs
fail to distinguish relevant graph properties, and the same applies to CAWs (see Proposition 7).
Proposition 7. There exist non-isomorphic temporal graphs differing in properties such as diameter,
girth, and total number of cycles, which cannot be differentiated by MP-TGNs and CAWs.
Figure 3 provides a construction for Proposition 7.
The temporal graphs G(t) and G 0 (t) differ in diameter
(∞ vs. 3), girth (3 vs. 6), and number of cycles (2
vs. 1). By inspecting the TCTs, one can observe that,
for any node in G(t), there is a corresponding one
in G 0 (t) whose TCTs are isomorphic, e.g., Tu1 (t) ∼ = Figure 3: Examples of temporal graphs for which
Tu01 (t) for t > t3 . As a result, the multisets of node MP-TGNs cannot distinguish the diameter, girth,
embeddings for these temporal graphs are identical. and number of cycles.
We provide more details and a construction - where CAW fails to decide properties - in the Appendix.

4 Position-encoding injective temporal graph net


We now leverage insights from our analysis in Section 3 to build more powerful TGNs. First, we
discuss how to build injective aggregation and update functions in the temporal setting. Second, we
propose an efficient scheme to compute positional features based on counts from TCTs. In addition,
we show that the proposed method, called position-encoding injective temporal graph net (PINT), is
more powerful than both WA-TGNs and MP-TGNs in distinguishing events in temporal graphs.
Injective temporal aggregation. An important design principle in TGNs is to prioritize (give higher
importance to) events based on recency [38, 42]. Proposition 8 introduces an injective aggregation
scheme that captures this principle using linearly exponential time decay.
Proposition 8 (Injective function on temporal neighborhood). Let X and E beP countable, and T
countable and bounded. There exists a function f and scalars α and β such that i f (xi , ei )α−βti
is unique on any multiset M = {{(xi , ei , ti )}} ⊆ X × E × T with |M | < N , where N is a constant.
Leveraging Proposition 8 and the approximation capabilities of multi-layer perceptrons (MLPs), we
propose position-encoding injective temporal graph net (PINT). In particular, PINT computes the
embedding of node v at time t and layer ` using the following message passing steps:
X   0
h̃(`)
v (t) =
(`)
MLPagg hu
(`−1)
(t) k e α−β(t−t ) (5)
(u,e,t0 )∈N (v,t)
 
(`)
hv(`) (t) = MLPupd h(`−1)
v (t) k h̃(`)
v (t) (6)
(0)
where k denotes concatenation, hv = sv (t), α and β are scalar (hyper-)parameters, and MLP(`)
agg
(`)
and MLPupd denote the nonlinear transformations of the aggregation and update steps, respectively.
We note that to guarantee that the MLPs in PINT implement injective aggregation/update, we must
further assume that the edge and node features (states) take values in a finite support. In addition,
we highlight that there may exist many other ways to achieve injective temporal MP — we have
presented a solution that captures the ‘recency’ inductive bias of real-world temporal networks.

6
Compute via
Monotone TCT
Injective Temporal MP

Temporal Graph

time
...

Compute
Memory

positional
features

Figure 5: PINT. Following the MP-TGN protocol, PINT updates memory states as events unroll.
Meanwhile, we use Eqs. (7-11) to update positional features. To extract the embedding for node v,
we build its TCT, annotate nodes with memory + positional features, and run (injective) MP.

Relative positional features. To boost the power of PINT, we propose augmenting memory states
with relative positional features. These features count how many temporal walks of a given length
exist between two nodes, or equivalently, how many times nodes appear at different levels of TCTs.
Formally, let P be the d × d matrix obtained by padding a (d − 1)-dimensional identity matrix
(t)
with zeros on its top row and its rightmost column. Also, let rj→u ∈ Nd denote the positional
feature vector of node j relative to u’s TCT at time t. For each event (u, v, t), with u and
v not participating in other events at t, we recursively update the positional feature vectors as
+ +
(0)
Vi = {i} ∀i (7) Vu(t )
= Vv(t )
= Vv(t) ∪ Vu(t) (9)
+
(t ) (t) (t)
(
(0) [1, 0, . . . , 0]> if i = j ri→v = P ri→u + ri→v ∀i ∈ Vu(t) (10)
ri→j = (8)
[0, 0, . . . , 0]> if i 6= j (t+ ) (t) (t)
rj→u =P rj→v + rj→u ∀j ∈ Vv(t) (11)

where we use t+ to denote values “right after” t. The set Vi keeps track of the nodes for which
we need to update positional features when i participates in an interaction. For simplicity, we have
assumed that there are no other events involving u or v at time t. Appendix B.10 provides equations
for the general case where nodes can participate in multiple events at the same timestamp.
(t) (t)
The value ri→v [k] (the k-th component of ri→v ) corresponds
to how many different ways we can get from v to i in k steps
through temporal walks. Additionally, we provide in Lemma 2
an interpretation of relative positional features in terms of the so-
called monotone TCTs (Definition 2). In this regard, Figure 4
shows how the TCT of v evolves due to an event (u, v, t) and
provides an intuition about the updates in Eqs. 10-11. The
procedure amounts to appending the monotone TCT of u to the
first level of the monotone TCT of v.
Definition 2. The monotone TCT of a node u at time t, denoted
Figure 4: The effect of (u, v, t) on the
by T̃u (t), is the maximal subtree of the TCT of u s.t. for any path
monotone TCT of v. Also, note how the
p = (u, t1 , u1 , t2 , u2 , . . . ) from the root u to leaf nodes of T̃u (t) positional features of a node i, relative
time monotonically decreases, i.e., we have that t1 > t2 > . . . . to v, can be incrementally updated.
Lemma 2. For any pair of nodes i, u of a temporal graph G(t), the k-th component of the positional
(t)
feature vector ri→u stores the number of times i appears at the k-th layer of the monotone TCT of u.

Edge and node embeddings. To obtain the embedding hγ for an event γ = (u, v, t), an L-layer
PINT computes embeddings for node u and v using L steps of temporal message passing. However,
when computing the embedding hL u (t) of u, we concatenate node states sj (t) with the positional
(t) (t)
features rj→u and rj→v for all node j in the L-hop temporal neighborhood of u. We apply the same
procedure to obtain hLv (t), and then combine hv (t) and hu (t) using a readout function.
L L

7
Similarly, to compute representations for node-level prediction, for each node j in the L-hop neigh-
(t)
borhood of u, we concatenate node states sj (t) with features rj→u . Then, we use our injective MP to
combine the information stored in u and its neighboring nodes. Figure 5 illustrates the process.
Notably, Proposition 9 states that PINT is strictly more powerful than existing TGNs. In fact, the
relative positional features mimic the discriminative power of WA-TGNs, while eliminate their
temporal monotonicity constraints. Additionally, PINT can implement injective temporal message
passing (either over states or states + positional features), akin to maximally-expressive MP-TGNs.
Proposition 9 (Expressiveness of PINT: link prediction). PINT (with relative positional features) is
strictly more powerful than MP-TGNs and CAWs in distinguishing events in temporal graphs.
When does PINT fail? Naturally, whenever the TCTs (annotated with
positional features) for the endpoints of two edges (u, v, t) and (u0 , v 0 , t)
are pairwise isomorphic, PINT returns the same edge embedding and is
not able to differentiate the events. Figure 6 shows an example in which
this happens — we assume that all node/edge features are identical. Due
to graph symmetries, u and z occur the same number of times in each level
of v’s monotone TCT. Also, the sets of temporal walks starting at u and z Figure 6: PINT cannot
distinguish the events
are identical if we swap the labels of these nodes. Importantly, CAWs and (u, v, t ) and (v, z, t ).
3 3
MP-TGNs also fail here, as stated in Proposition 9.
Proposition 10 (Limitations of PINT). There are synchronous events of temporal graphs that PINT
cannot distinguish (as seen in Figure 6).
Implementation  and computational  cost. The online updates for PINT’s positional features have
(t) (t)
complexity O d |Vu | + d |Vv | . Similarly to CAW’s sampling procedure, our online update
is a sequential process better done in CPUs. However, while CAW may require significant CPU-
GPU memory exchange — proportional to both the number of walks and their depth —, we only
communicate the positional features. We can also speed-up the training of PINT by pre-computing
the positional features for each batch, avoiding redundant computations at each epoch. Apart from
positional features, the computational cost of PINT is similar to that of TGN-Att. Following standard
MP-TGN procedure, we control the branching factor of TCTs using neighborhood sampling.
Note that the positional features monotonically increase with time, which is undesirable for practical
generalization purposes. Since our theoretical results hold for any fixed t, this issue can be solved
by dividing the positional features by a time-dependent normalization factor. Nonetheless, we have
found that employing L1 -normalization leads to good empirical results for all evaluated datasets.

5 Experiments
We now assess the performance of PINT on several popular and large-scale benchmarks for TGNs.
We run experiments using PyTorch [25] and code is available at www.github.com/AaltoPML/PINT.
Tasks and datasets. We evaluate PINT on dynamic link prediction, closely following the evaluation
setup employed by Rossi et al. [27] and Xu et al. [42]. We use six popular benchmark datasets:
Reddit, Wikipedia, Twitter, UCI, Enron, and LastFM [16, 27, 38, 42]. Notably, UCI, Enron, and
LastFM are non-attributed networks, i.e., they do not contain feature vectors associated with the
events. Node features are absent in all datasets, thus following previous works we set them to vectors
of zeros [27, 42]. Since Twitter is not publicly available, we follow the guidelines by Rossi et al. [27]
to create our version. We provide more details regarding datasets in the supplementary material.
Baselines. We compare PINT against five prominent TGNs: Jodie [16], DyRep [32], TGAT [42],
TGN-Att [27], and CAW [38]. For completeness, we also report results using two static GNNs: GAT
[34] and GraphSage [12]. Since we adopt the same setup as TGN-Att, we use their table numbers
for all baselines but CAW on Wikipedia and Reddit. The remaining results were obtained using the
implementations and guidelines available from the official repositories. As an ablation study, we also
include a version of PINT without relative positional features in the comparison. We provide detailed
information about hyperparameters and the training of each model in the supplementary material.
Experimental setup. We follow Xu et al. [42] and use a 70%-15%-15% (train-val-test) temporal
split for all datasets. We adopt average precision (AP) as the performance metric. We also analyze
separately predictions involving only nodes seen during training (transductive), and those involving

8
Table 1: Average Precision (AP) results for link prediction. We denote the best-performing model (highest
mean AP) in blue. In 5 out of 6 datasets, PINT achieves the highest AP in the transductive setting. For the
inductive case, PINT outperforms previous MP-TGNs and competes with CAW. We also show the performance
of PINT with and without relative positional features. For all datasets, adopting positional features leads to
significant performance gains.
Model Reddit Wikipedia Twitter UCI Enron LastFM
GAT 97.33 ± 0.2 94.73 ± 0.2 - - - -
GraphSAGE 97.65 ± 0.2 93.56 ± 0.3 - - - -
Transductive

Jodie 97.11 ± 0.3 94.62 ± 0.5 98.23 ± 0.1 86.73 ± 1.0 77.31 ± 4.2 69.32 ± 1.0
DyRep 97.98 ± 0.1 94.59 ± 0.2 98.48 ± 0.1 54.60 ± 3.1 77.68 ± 1.6 69.24 ± 1.4
TGAT 98.12 ± 0.2 95.34 ± 0.1 98.70 ± 0.1 77.51 ± 0.7 68.02 ± 0.1 54.77 ± 0.4
TGN-Att 98.70 ± 0.1 98.46 ± 0.1 98.00 ± 0.1 80.40 ± 1.4 79.91 ± 1.3 80.69 ± 0.2
CAW 98.39 ± 0.1 98.63 ± 0.1 98.72 ± 0.1 92.16 ± 0.1 92.09 ± 0.7 81.29 ± 0.1
PINT (w/o pos. feat.) 98.62 ± .04 98.43 ± .04 98.53 ± 0.1 92.68 ± 0.5 83.06 ± 2.1 81.35 ± 1.6
PINT 99.03 ± .01 98.78 ± 0.1 99.35 ± .01 96.01 ± 0.1 88.71 ± 1.3 88.06 ± 0.7
GAT 95.37 ± 0.3 91.27 ± 0.4 - - - -
GraphSAGE 96.27 ± 0.2 91.09 ± 0.3 - - - -
Jodie 94.36 ± 1.1 93.11 ± 0.4 96.06 ± 0.1 75.26 ± 1.7 76.48 ± 3.5 80.32 ± 1.4
Inductive

DyRep 95.68 ± 0.2 92.05 ± 0.3 96.33 ± 0.2 50.96 ± 1.9 66.97 ± 3.8 82.03 ± 0.6
TGAT 96.62 ± 0.3 93.99 ± 0.3 96.33 ± 0.1 70.54 ± 0.5 63.70 ± 0.2 56.76 ± 0.9
TGN-Att 97.55 ± 0.1 97.81 ± 0.1 95.76 ± 0.1 74.70 ± 0.9 78.96 ± 0.5 84.66 ± 0.1
CAW 97.81 ± 0.1 98.52 ± 0.1 98.54 ± 0.4 92.56 ± 0.1 91.74 ± 1.7 85.67 ± 0.5
PINT (w/o pos. feat.) 97.22 ± 0.2 97.81 ± 0.1 96.10 ± 0.1 90.25 ± 0.3 75.99 ± 2.3 88.44 ± 1.1
PINT 98.25 ± .04 98.38 ± .04 98.20 ± .03 93.97 ± 0.1 81.05 ± 2.4 91.76 ± 0.7

novel nodes (inductive). We report mean and standard deviation of the AP over ten runs. For further
details, see Appendix D. We provide additional results in the supplementary material.

Results. Table 1 shows that PINT is the best-performing method on five out of six datasets for the
transductive setting. Notably, the performance gap between PINT and TGN-Att amounts to over
15% AP on UCI. The gap is also relatively high compared to CAW on LastFM, Enron, and UCI;
with CAW being the best model only on Enron. We also observe that many models achieve relatively
high AP on the attributed networks (Reddit, Wikipedia, and Twitter). This aligns well with findings
from [38], where TGN-Att was shown to have competitive performance against CAW on Wikipedia
and Reddit. The performance of GAT and TGAT (static GNNs) on Reddit and Wikipedia reinforces
the hypothesis that the edge features add significantly to the discriminative power. On the other
hand, PINT and CAW, which leverage relative identities, show superior performance relative to
other methods when only time and degree information is available, i.e., on unattributed networks
(UCI, Enron, and LastFM). Table 1 also shows the effect of using relative positional features. While
including these features boosts PINT’s performance systematically, our ablation study shows that
PINT w/o positional features still outperforms other MP-TGNs on unattributed networks. In the
inductive case, we observe a similar behavior: PINT is consistently the best MP-TGN, and is better
than CAW on 3/6 datasets. Overall, PINT (w/ positional features) also yields the lowest standard
deviations. This suggests that positional encodings might be a useful inductive bias for TGNs.

Time comparison. Figure 7 compares the Wikipedia UCI


Avg. time per epoch log(s)

training times of PINT against other TGNs.


For fairness, we use the same architecture 7 5

(number of layers & neighbors) for all MP- 6 4


TGNs: i.e., the best-performing PINT. For
5 3
CAW, we use the one that yielded results in
Table 1. As expected, TGAT is the fastest 4 2
0 25 50 100 150 200 0 25 50 100 150 200
model. Note that the average time/epoch Epochs Epochs
of PINT gets amortized since positional PINT PINT (w/o pos.) TGAT CAW TGN-Att
features are pre-computed. Without these
features, PINT’s runtime closely matches Figure 7: Time comparison: PINT versus TGNs (in log-scale).
TGN-Att. When trained for over 25 epochs, The cost of pre-computing positional features is quickly di-
PINT runs considerably faster than CAW. luted as the number of epochs increases.
We provide additional details and results in the supplementary material.

9
Table 2: Average precision results for TGN-Att + relative positional features.
Transductive Inductive
UCI Enron LastFM UCI Enron LastFM
TGN-Att 80.40 ± 1.4 79.91 ± 1.3 80.69 ± 0.2 74.70 ± 0.9 78.96 ± 0.5 84.66 ± 0.1
TGN-Att + RPF 95.64 ± 0.1 85.04 ± 2.5 89.41 ± 0.9 92.82 ± 0.4 76.27 ± 3.4 91.63 ± 0.3
PINT 96.01 ± 0.1 88.71 ± 1.3 88.06 ± 0.7 93.97 ± 0.1 81.05 ± 2.4 91.76 ± 0.7

Incorporating relative positional features into MP-TGNs. We can use our relative positional
features (RPF) to boost MP-TGNs. Table 2 shows the performance of TGN-Att with relative positional
features on UCI, Enron, and LastFM. Notably, TGN-Att receives a significant boost from our RPF.
However, PINT still beats TGN-Att+RPF on 5 out of 6 cases. The values for TGN-Att+RPF reflect
outcomes from 5 repetitions. We have used the same model selection procedure as TGN-Att in Table
1, and incorporated d = 4-dimensional positional features
Dimensionality of relative positional fea- UCI Enron
97 95
tures. We assess the performance of PINT as
a function of the dimension d of the relative posi- 96 90

AP (test)
tional features. Figure 8 shows the performance 95 85

of PINT for d ∈ {4, 10, 15, 20} on UCI and En- 94 80 transductive
ron. We report mean and standard deviation of inductive

AP on test set obtained from five independent 93


4 10 15 20
75
4 10 15 20
Dims. (d) Dims. (d)
runs. In all experiments, we re-use the optimal
hyper-parameters found with d = 4. Increas- Figure 8: PINT: AP (mean and std) as a function of the
ing the dimensionality of the positional features dimensionality of the positional features.
leads to performance gains on both datasets. No-
tably, we obtain a significant boost for Enron with d = 10: 92.69 ± 0.09 AP in the transductive
setting and 88.34 ± 0.29 in the inductive case. Thus, PINT becomes the best-performing model on
Enron (transductive). On UCI, for d = 20, we obtain 96.36 ± 0.07 and 94.77 ± 0.12 (inductive).

6 Conclusion
We laid a rigorous theoretical foundation for TGNs, including the role of memory modules, relation-
ship between classes of TGNs, and failure cases for MP-TGNs. Together, our theoretical results shed
light on the representational capabilities of TGNs, and connections with their static counterparts. We
also introduced a novel TGN method, provably more expressive than the existing TGNs.
Key practical takeaways from this work: (a) temporal models should be designed to have injective
update rules and to exploit both neighborhood and walk aggregation, and (b) deep architectures can
likely be made more compute-friendly as the role of memory gets diminished with depth, provably.

Acknowledgments and Disclosure of Funding


This work was supported by the Academy of Finland (Flagship programme: Finnish Center for
Artificial Intelligence FCAI and 341763), ELISE Network of Excellence Centres (EU Horizon:2020
grant agreement 951847) and UKRI Turing AI World-Leading Researcher Fellowship, EP/W002973/1.
We also acknowledge the computational resources provided by the Aalto Science-IT Project from
Computer Science IT. AS and DM also would like to thank Jorge Perez, Jou-Hui Ho, and Hojin Kang
for valuable discussions about TGNs, and the latter’s input on a preliminary version of this work.

Societal and broader impact


Temporal graph networks have shown remarkable performance in relevant domains such as social
networks, e-commerce, and drug discovery. In this paper, we establish fundamental results that
delineate the representational power of TGNs. We expect that our findings will help declutter the
literature and serve as a seed for future developments. Moreover, our analysis culminates with PINT,
a method that is provably more powerful than the prior art and shows superior predictive performance
on several benchmarks. We believe that PINT (and its underlying concepts) will help engineers and
researchers build better recommendation engines, improving the quality of systems that permeate our
lives. Also, we do not foresee any negative societal impact stemming directly from this work.

10
References
[1] P. Barceló, E. V. Kostylev, M. Monet, J. Pérez, J. L. Reutter, and J.-P. Silva. The logical expressiveness of
graph neural networks. In International Conference on Learning Representations (ICLR), 2020.

[2] Q. Cappart, D. Chételat, E. B. Khalil, A. Lodi, C. Morris, and P. Velickovic. Combinatorial optimization
and reasoning with graph neural networks. In International Joint Conference on Artificial Intelligence
(IJCAI), 2021.

[3] B. Chamberlain, J. Rowbottom, M. Gorinova, M. M. Bronstein, S. Webb, and E. Rossi. GRAND: graph
neural diffusion. In International Conference on Machine Learning (ICML), 2021.

[4] M. Chen, Z. Wei, Z. Huang, B. Ding, and Y. Li. Simple and deep graph convolutional networks. In
International Conference on Machine Learning (ICML), 2020.

[5] K. Cho, B. van Merriënboer, C. Gulcehre, D. Bahdanau, F. Bougares, H. Schwenk, and Y. Bengio.
Learning phrase representations using RNN encoder–decoder for statistical machine translation. In
Empirical Methods in Natural Language Processing (EMNLP), 2014.

[6] N. Dehmamy, A.-L. Barabási, and R. Yu. Understanding the representation power of graph neural networks
in learning graph topology. In Advances in neural information processing systems (NeurIPS), 2019.

[7] A. Derrow-Pinion, J. She, D. Wong, O. Lange, T. Hester, L. Perez, M. Nunkesser, S. Lee, X. Guo,
B. Wiltshire, P. W. Battaglia, V. Gupta, A. Li, Z. Xu, A. Sanchez-Gonzalez, Y. Li, and P. Velickovic. Eta
prediction with graph neural networks in google maps. In Conference on Information and Knowledge
Management (CIKM), 2021.

[8] S. S. Du, K. Hou, R. Salakhutdinov, B. Póczos, R. Wang, and K. Xu. Graph neural tangent kernel:
Fusing graph neural networks with graph kernels. In Advances in Neural Information Processing Systems
(NeurIPS), 2019.

[9] V. Garg, S. Jegelka, and T. Jaakkola. Generalization and representational limits of graph neural networks.
In International Conference on Machine Learning (ICML), 2020.

[10] J. Gilmer, S. S. Schoenholz, P. F. Riley, O. Vinyals, and G. E. Dahl. Neural message passing for quantum
chemistry. In International Conference on Machine Learning (ICML), 2017.

[11] M. Gori, G. Monfardini, and F. Scarselli. A new model for learning in graph domains. In IEEE International
Joint Conference on Neural Networks (IJCNN), 2005.

[12] W. Hamilton, Z. Ying, and J. Leskovec. Inductive representation learning on large graphs. In Advances in
Neural Information Processing Systems (NeurIPS), 2017.

[13] S. Hochreiter and J. Schmidhuber. Long short-term memory. Neural Computation, 9(8):1735–1780, 1997.

[14] S. Kazemi, R. Goel, K. Jain, I. Kobyzev, A. Sethi, P. Forsyth, and P. Poupart. Representation learning for
dynamic graphs: A survey. Journal of Machine Learning Research, 21(70):1–73, 2020.

[15] S. M. Kazemi, R. Goel, S. Eghbali, J. Ramanan, J. Sahota, S. Thakur, S. Wu, C. Smyth, P. Poupart, and
M. Brubaker. Time2vec: Learning a vector representation of time. ArXiv: 1907.05321, 2019.

[16] S. Kumar, X. Zhang, and J. Leskovec. Predicting dynamic embedding trajectory in temporal interaction
networks. In International Conference on Knowledge Discovery & Data Mining (KDD), 2019.

[17] R. Liao, R. Urtasun, and R. Zemel. A PAC-bayesian approach to generalization bounds for graph neural
networks. In International Conference on Learning Representations (ICLR), 2021.

[18] D. Liben-Nowell and J. Kleinberg. The link prediction problem for social networks. Journal of the
American Society for Information Science and Technology, 58(7):1019–1031, 2007.

[19] A. Loukas. What graph neural networks cannot learn: depth vs width. In International Conference on
Learning Representations (ICLR), 2020.

[20] A. Loukas. How hard is to distinguish graphs with graph neural networks? In Advances in Neural
Information Processing Systems (NeurIPS), 2020.

[21] H. Maron, H. Ben-Hamu, H. Serviansky, and Y. Lipman. Provably powerful graph networks. In Advances
in Neural Information Processing Systems (NeurIPS), 2019.

11
[22] C. Morris, M. Ritzert, M. Fey, W. L. Hamilton, J. E. Lenssen, G. Rattan, and M. Grohe. Weisfeiler and
leman go neural: Higher-order graph neural networks. In AAAI Conference on Artificial Intelligence
(AAAI), 2019.
[23] H. Nguyen and T. Maehara. Graph homomorphism convolution. In International Conference on Machine
Learning (ICML), 2020.
[24] F. Orsini, P. Frasconi, and L. D. Raedt. Graph invariant kernels. In International Joint Conference on
Artificial Intelligence (IJCAI), 2015.
[25] A. Paszke, S. Gross, S. Chintala, G. Chanan, E. Yang, Z. DeVito, Z. Lin, A. Desmaison, L. Antiga, and
A. Lerer. Automatic differentiation in pytorch. In Advances in Neural Information Processing Systems
(NeurIPS - Workshop), 2017.
[26] J. Pérez, J. Marinković, and P. Barceló. On the turing completeness of modern neural network architectures.
In International Conference on Learning Representations (ICLR), 2019.
[27] E. Rossi, B. Chamberlain, F. Frasca, D. Eynard, E. Monti, and M. Bronstein. Temporal graph networks for
deep learning on dynamic graphs. In ICML 2020 Workshop on Graph Representation Learning, 2020.
[28] A. Sanchez-Gonzalez, J. Godwin, T. Pfaff, R. Ying, J. Leskovec, and P. Battaglia. Learning to simulate
complex physics with graph networks. In International Conference on Machine Learning (ICML), 2020.
[29] R. Sato, M. Yamada, and H. Kashima. Approximation ratios of graph neural networks for combinatorial
problems. In Advances in Neural Information Processing Systems (NeurIPS), 2019.
[30] F. Scarselli, M. Gori, A. C. Tsoi, M. Hagenbuchner, and G. Monfardini. The graph neural network model.
IEEE Transactions on Neural Networks, 20(1):61–80, 2009.
[31] J. M. Stokes, K. Yang, K. Swanson, W. Jin, A. Cubillos-Ruiz, N. M. Donghia, C. R. MacNair, S. French,
L. A. Carfrae, Z. Bloom-Ackermann, V. M. Tran, A. Chiappino-Pepe, A. H. Badran, I. W. Andrews, E. J.
Chory, G. M. Church, E. D. Brown, T. S. Jaakkola, R. Barzilay, and J. J. Collins. A deep learning approach
to antibiotic discovery. Cell, 180(4):688 – 702, 2020.
[32] R. Trivedi, M. Farajtabar, P. Biswal, and H. Zha. DyRep: Learning representations over dynamic graphs.
In International Conference on Learning Representations (ICLR), 2019.
[33] A. Vaswani, N. Shazeer, N. Parmar, J. Uszkoreit, L. Jones, A. N. Gomez, L. Kaiser, and I. Polosukhin.
Attention is all you need. In Advances in Neural Information Processing Systems (NeurIPS), 2017.
[34] P. Velickovic, G. Cucurull, A. Casanova, A. Romero, P. Liò, and Y. Bengio. Graph Attention Networks. In
International Conference on Learning Representations (ICLR), 2018.
[35] S. Verma and Z.-L. Zhang. Stability and generalization of graph convolutional neural networks. In
International Conference on Knowledge Discovery & Data Mining (KDD), 2019.
[36] Y. Verma, S. Kaski, M. Heinonen, and V. Garg. Modular flows: Differential molecular generation. In
Advances in Neural Information Processing Systems (NeurIPS), 2022.
[37] C. Vignac, A. Loukas, and P. Frossard. Building powerful and equivariant graph neural networks with
structural message-passing. In Neural Information Processing Systems (NeurIPS), 2020.
[38] Y. Wang, Y. Chang, Y. Liu, J. Leskovec, and P. Li. Inductive representation learning in temporal networks
via causal anonymous walks. In International Conference on Learning Representations (ICLR), 2021.
[39] Z. Wu, S. Pan, F. Chen, G. Long, C. Zhang, and P. S. Yu. A comprehensive survey on graph neural
networks. IEEE Transactions on Neural Networks and Learning Systems, pages 1–21, 2020.
[40] D. Xu, W. Cheng, D. Luo, Y. Gu, X. Liu, J. Ni, B. Zong, H. Chen, and X. Zhang. Adaptive neural network
for node classification in dynamic networks. In IEEE International Conference on Data Mining (ICDM),
2019.
[41] D. Xu, C. Ruan, E. Korpeoglu, S. Kumar, and K. Achan. Self-attention with functional time representation
learning. In Advances in Neural Information Processing Systems (NeurIPS), 2019.
[42] D. Xu, C. Ruan, E. Korpeoglu, S. Kumar, and K. Achan. Inductive representation learning on temporal
graphs. In International Conference on Learning Representations (ICLR), 2020.
[43] K. Xu, W. Hu, J. Leskovec, and S. Jegelka. How powerful are graph neural networks? In International
Conference on Learning Representations (ICLR), 2019.

12
[44] K. Xu, J. Li, M. Zhang, S. S. Du, K.-I. Kawarabayashi, and S. Jegelka. What can neural networks reason
about? In International Conference on Learning Representations (ICLR), 2020.

[45] K. Xu, M. Zhang, J. Li, S. S. Du, K.-I. Kawarabayashi, and S. Jegelka. How neural networks extrapolate:
From feedforward to graph neural networks. In International Conference on Learning Representations
(ICLR), 2021.

[46] Z. Zhang, F. Wu, and W. S. Lee. Factor graph neural networks. In Advances in Neural Information
Processing Systems (NeurIPS), 2020.

13
Provably expressive temporal graph networks
(Supplementary material)

A Further details on temporal graph networks


In this section we present more details about the models TGAT, TGN-Att, and CAW.

A.1 Temporal graph attention (TGAT)


Temporal graph attention networks [42] combine time encoders [15] and self-attention [33]. In
particular, the time encoder φ is given by

φ(t − t0 ) = [cos(ω1 (t − t0 ) + b1 ), . . . , cos(ωd (t − t0 ) + bd )], (S1)

where ωi ’s and bi ’s are learned scalar parameters. The time embeddings are concatenated to the
edge features before being fed into a typical self-attention layer, where the query q is a function of a
reference node v, and both values V and keys K depend on v’s temporal neighbors. Formally, TGAT
(`) (`) (`−1)
first computes a matrix Cv (t) whose u-th row is cvu (t) = [hu (t) k φ(t − tuv ) k euv ] for all
(`)
(u, euv , tuv ) ∈ N (v, t). Then, the output h̃v (t) of the AGG function is given by
(`)

(`) (`)
q = [h(`−1)
v (t) k φ(0)]Wq(`) K = Cv(`) (t)WK V = Cv(`) (t)WV (S2)
>
h̃(`) (S3)

v (t) = softmax qK V
(`) (`) (`)
where Wq , WK , and WV are model parameters. Regarding the U PDATE function, TGAT applies
(`) (`−1) (`)
a multilayer perceptron, i.e., hv (t) = MLP(`) (hv (t) k h̃v (t)).

A.2 Temporal graph networks with attention (TGN-Att)


We now discuss details regarding the MP-TGN framework omitted from the main paper for simplicity.
For the sake of generality, Rossi et al. [27] present a formulation for MP-TGNs that can handle
node-level events, e.g., node feature updates. These events lead to i) updating node memory states,
and ii) using the time-evolving node features as additional inputs for the message-passing functions.
Nonetheless, to the best of our knowledge, all relevant CTDG benchmarks comprise only edge
events. Therefore, for ease of presentation, we omit node events and temporal node features from our
treatment. In Appendix E, we discuss how to handle node-level events.
Note that MP-TGNs update memory states only after an event occurs, otherwise it would incur
information leakage. Unless we use the updated states to predict another event later on in the batch,
this means that there might be no signal to propagate through memory modules. To get around
this problem, Rossi et al. [27] propose updating the memory with messages coming from previous
batches, and then predicting the interactions.
To speed up computations, MP-TGNs employ a form of batch learning where events in a same batch
are aggregated. In our analysis, we assume that two events belong to the same batch only if they
occur at the same timestamp. Importantly, memory aggregators allow removing ambiguity in the way
the memory of a node participating in multiple events (at the same timestamp) is updated — without
memory aggregator, two events involving a given node i at the same time could lead to different ways
of updating the state of i.
Suppose the event γ = (i, u, t) occurs. MP-TGNs proceed by computing a memory-message function
M EM M SGe for each endpoint of γ, i.e.,
mi,u (t) = M EM M SGe (si (t), su (t), t − ti , eiu (t))
mu,i (t) = M EM M SGe (su (t), si (t), t − tu , eiu (t))

Following the original formulation, we assume an identity memory-message function — simply the
concatenation of the inputs, i.e., M EM M SGe (si (t), su (t), t−ti , eiu (t)) = [si (t), su (t), t−ti , eiu (t)].
Now, suppose two events (i, u, t) and (i, v, t) happen. MP-TGNs aggregate the memory-messages
from these events using a function M EM AGG to obtain a single memory message for i:
mi (t) = M EM AGG(mi,u (t), mi,v (t))

Rossi et al. [27] propose non-learnable memory aggregators, such as the mean aggregator (average
all memory messages for a given node), that we denote as M EANAGG and adopt throughout our
analysis. As an example, under events (i, u, t) and (i, v, t), the aggregated message for i is mi (t) =
0.5([si (t), su (t), t − ti , eiu (t)] + [si (t), sv (t), t − ti , eiv (t)]).
The memory update of our query node i is given by
si (t+ ) = M EM U PDATE(si (t), mi (t)).

Finally, we note that TGAT does not have a memory module. TGN-Att consists of the model resulting
from augmenting TGAT with a GRU-based memory.
A.3 Causal anonymous walks (CAW)
We now provide details regarding how CAW obtains edge embeddings for a query event γ = (u, v, t).
A temporal walk is represented as W = ((w1 , t1 ), (w2 , t2 ), . . . , (wL , tL )), with t1 > t2 > · · · > tL
and (wi−1 , wi , ti ) ∈ G(t) for all i > 1. We denote by Su (t) the set of maximal temporal walks
starting at u of size at most L obtained from the temporal graph at time t. Following the original
paper, we drop the time dependence henceforth.
A given walk W gets anonymized through replacing each element wi belonging to W by a 2-element
set of vectors ICAW (wi ; Su , Sv ) accounting for how many times wi appears at each position of walks
in Su and Sv . These vectors are denoted by g(wi , Su ) and g(wi , Sv ). The walk is encoded using a
RNN:
E NC(W ; Su , Sv ) = RNN([f1 (ICAW (wi ; Su , Sv ))kf2 (ti − ti−1 )]L
i=1 ),

where t1 = t0 = t and f1 is
f1 (ICAW (wi ; Su , Sv )) = MLP(g(wi , Su )) + MLP(g(wi , Sv )).
We note that the MLPs share parameters. The function f2 is given by
f2 (t) = [cos(ωi t), sin(ω1 t), . . . , cos(ωd t), sin(ωd t)]
where ωi ’s are learned parameters.
To compute the embedding hγ for (u, v, t), CAW considers two readout functions: mean and self-
attention. Finally, the final link prediction is obtained from a 2-layer MLP over hγ .

15
B Proofs
B.1 Further definitions and Lemmata
Definition B1 (Monotone walk.). An N -length monotone walk in a temporal graph G(t) is a sequence
(w1 , t1 , w2 , t2 , . . . , wN +1 ) such that ti > ti+1 and (wi , wi+1 , ti ) ∈ G(t) for all i.
Definition B2 (Temporal diameter.). We say the temporal diameter of a graph G(t) is ∆ if the longest
monotone walk in G(t) has length (i.e, number of edges) exactly ∆.
Lemma B1. If the TCTs of two nodes are isomorphic, then their monotone TCTs (Definition 2) are
also isomorphic, i.e., Tu (t) ∼
= Tv (t) ⇒ T̃u (t) ∼
= T̃v (t) for two nodes u and v of a dynamic graph.

Proof. Since Tu (t) ∼


= Tv (t), we have that
p = (u0 , t1 , u1 , t2 , u2 , . . . ) from Tu (t) ⇐⇒ p0 = (f (u0 ), t1 , f (u1 ), t2 , f (u2 ), . . . ) from Tv (t),
with sui = sf (ui ) and eui ui+1 (ti+1 ) = ef (ui )f (ui+1 ) (ti+1 ) and kui = kf (ui ) ∀i
where f : V (Tu (t)) → V (Tv (t)) is a bijection.
∼ T̃v (t). Then, either there exists a path ps = (u0 , t0 , u0 , t0 , . . .) in T̃u (t), such
Assume that T̃u (t) 6= 0 1 1 2
that t0k+1 < t0k for all k (i.e., a monotone walk), with no corresponding one in T̃v (t) or vice-versa.
Without loss of generality, let us consider the former case.
We can construct the path p0s in Tv (t) by applying f in all elements of ps , i.e., p0s =
(f (u00 ), t01 , f (u01 ), t02 , . . .). Note that p0s is a monotone walk in Tv (t). Since T̃v (t) is the maximal
monotone subtree of Tv (t), it must contain p0s , leading to contradiction.

Lemma B2. Let G(t) and G 0 (t) be any two non-isomorphic temporal graphs. If an MP-TGN obtains
different multisets of node embeddings for G(t) and G 0 (t). Then, the temporal WL test decides G(t)
and G 0 (t) are not isomorphic.

Proof. Recall Proposition 3 shows that if an MP-TGN with memory is able to distinguish two nodes,
then there is a memoryless MP-TGN with ∆ (temporal diameter) additional layers that does the
same. Thus, it suffices to show that if the multisets of colors from temporal WL for G(t) and G 0 (t)
after ` iterations are identical, then the multisets of embeddings from the memoryless MP-TGN
(`)
are also identical, i.e., if {{c` (u)}}u∈V (G(t)) = {{c` (u0 )}}u0 ∈V (G 0 (t)) , then {{hu (t)}}u∈V (G(t)) =
(`)
{{hu0 (t)}}u0 ∈V (G 0 (t)) . To do so, we repurpose the proof of Lemma 2 in [43].
More broadly, we show that for any two nodes of a temporal graph G(t), if the temporal WL returns
c` (u) = c` (v), we have that corresponding embeddings from MP-TGN without memory are identical
h`u (t) = h`v (t). We proceed with a proof by induction.
[Base case] For ` = 0, the proposition trivially holds as the temporal WL has the initial node features
as colors, and memoryless MP-TGNs have these features as embeddings.
[Induction step] Assume the proposition holds for iteration `. Thus, for any two nodes u, v, if
c`+1 (u) = c`+1 (v), we have
(c` (u), {{(c` (i), eiu (t0 ), t0 ) : (u, i, t0 ) ∈ G(t)}}) = (c` (v), {{(c` (j), ejv (t0 ), t0 ) : (v, j, t0 ) ∈ G(t)}})
and, by the induction hypothesis, we know
(`) 0 0 0
(h(`)
u (t), {{(hi (t), eiu (t ), t ) : (u, i, t ) ∈ G(t)}}) =
(`) 0 0 0
(h(`)
v (t), {{(hj (t), ejv (t ), t ) : (v, j, t ) ∈ G(t)}})

We also note that this last identity also implies


(`) 0 0
(h(`)
u (t), {{(hi (t), t − t , e) | (i, e, t ) ∈ N (u, t)}}) =
(`) 0 0
(h(`)
v (t), {{(hj (t), t − t , e) | (j, e, t ) ∈ N (v, t)}})

since there exists an event (u, i, t0 ) ∈ G(t) with feature eui (t0 ) = e iff there is an element (i, e, t0 ) ∈
N (u, t).

16
As a result, the inputs of the MP-TGN’s aggregation and update functions are identical, which leads
(`+1) (`+1)
to identical outputs hu (t) = hv (t). Therefore, if the temporal WL test obtains identical
multisets of colors for two temporal graphs after ` steps, the multisets of embeddings at layer ` for
these graphs are also identical.

P (Lemma 5 in [43]). Assume X is countable. There exists a function f : X → R so that


n
Lemma B3
h(X) = x∈X f (x) is unique for each multiset X ⊂ X of bounded size. Moreover, any multiset
P
function g can be decomposed as g(X) = ϕ x∈X f (x) for some function ϕ.

B.2 Proof of Proposition 1: Relationship between DTDGs and CTDGs


Proof. We prove the two statements in Proposition 1 separately. In the following, we treat CTDGs as
sets of events up to a given timestamp.
Statement 1: For any DTDG we can build a CTDG that contains the same information.
A DTDG consists of a sequence of graphs with no temporal information. We can model this using the
CTDG formalism by setting a fixed time difference δ between consecutive elements G(ti ), G(ti+1 )
of the CTDG, i.e., ti+1 − ti = δ for all i ≥ 0.
Consider a DTDG given by the sequence (G1 , G2 , . . . ). To build the equivalent CTDG, we define
S(Gi ) as the set of edge events corresponding to Gi , i.e., S(Gi ) = {(u, v, iδ) : (u, v) ∈ E(Gi )}.
We also make the edge features of these events match those in the DTDG, i.e., euv (iδ) = euv ∈ Ei .
To account for node features, for all u ∈ V (Gi ), we create an event (u, , iδ) between u and a
dummy node , with feature eu (iδ) = xu ∈ Xi . Let C(Gi ) denote the set comprising these
node-level events. Then, we can construct the CTDG G(ti ) = ∪ij=1 S(Gj ) ∪ C(Gj ) for i = 1, . . . .
Reconstructing the DTDG (G1 , G2 , . . . ) is trivial. To build Gi , it suffices to select all events at time
iδ in the CTDG. Events involving  determine node features and the remaining ones constitute edges
in the DTDG.
Statement 2: The converse holds if the CTDG timestamps form a subset of some uniformly spaced
countable set.
We say that a countable set A ⊂ R is uniformly spaced if there exists some δ ∈ R such that
ai+1 − ai = δ for all i where (a1 , a2 , . . .) is the ordered sequence formed from elements ar of A,
i.e., a1 < a2 < . . . < ai < ai+1 , . . .
Note that DTDGs are naturally represented by a set of uniformly spaced timestamps. This is because
DTDGs correspond to sequences that do not contain any time information. Let us denote the set of
CTDG timestamps T ⊆ T such that T is countable and uniformly spaced. Our idea is to construct a
DTDG sequence with timestamps that coincide with the elements in T . Then, since T ⊆ T , we do
not lose any information pertaining to events occurring at timestamps given by T . Without loss of
generality, in the following we assume that the elements of T and T are arranged in their increasing
order respectively, i.e., ti < ti+1 for all i, and τk < τk+1 for all k.
Consider a CTDG (G(t1 ), G(t2 ), . . . ) such that G(ti ) = {(u, v, t) : t ∈ T and t ≤ ti } for ti ∈ T .
Also, let us denote H(ti ) = {(u, v, t) ∈ G(ti ) : t = ti } the set of events at time ti ∈ T . We can
build a corresponding DTDG (G1 , G2 , . . . ) such that for all τk ∈ T the k-th snapshot Gk is

{u : (u, ·, τk ) ∈ H(τk )}, if τk ∈ T ;



V (Gk ) =
∅, otherwise.
{(u, v) : (u, v, τk ) ∈ H(τk )}, if τk ∈ T ;

E(Gk ) =
∅, otherwise.

To recover the original CTDG, we can adapt the reconstruction procedure we used in the previous
part of the proof. We define
I˜ = {(i, k) ∈ N × N : τk = ti for ti ∈ T and τk ∈ T }. (S4)

Note that we can treat I˜ as a map by defining I(i)


˜ = k if and only if (i, k) ∈ I. ˜ To recover the
original CTDG, we first create the set of events S(Gk ) = {(u, v, kδ) : (u, v) ∈ E(Gk )}. Then, we
build G(ti ) = ∪j:j≤I(i)
˜ S(Gj ) for ti ∈ T .

17
B.3 Proof of Lemma 1
Proof. Here we show that if two nodes u and v have isomorphic (L-depth) TCTs, then MP-TGNs
(with L-layers) compute identical embeddings for u and v. Formally, let Tu,` (t) denote the TCT of u
with ` layers. We want to show that Tu,` (t) ∼
(`) (`)
= Tv,` (t) ⇒ hu (t) = hv (t). We employ a proof by
induction on `. Since there is no ambiguity, we drop the dependence on time in the following.

[Base case] Consider ` = 1. By the isomorphism assumption Tu,1 ∼


(0) (0)
= Tv,1 , hu = su = sv = hv —
roots of both trees have the same states. Also, for any children i of u in Tu,1 there is a corresponding
one f (i) in Tv,1 with si = sf (i) . Recall that the `-th layer aggregation function AGG(`) (·) acts
on multisets of triplets of previous-layer embeddings, edge features and timestamps of temporal
neighbors (see Equation 1). Since the temporal neighbors of u correspond to its children in Tu,1 ,
(1) (1)
then the output of the aggregation function for u and v are identical: h̃u = h̃v . In addition, since
(0) (0)
the initial embeddings of u and v are also equal (i.e., hu = hv ), we can ensure that the update
(1) (1)
function returns hu = hv .
[Induction step] Assuming that Tu,`−1 ∼
(`−1) (`−1)
= Tv,`−1 ⇒ hu = hv for any pair of nodes u and
∼ (`) (`)
v, we will show that Tu,` = Tv,` ⇒ hu = hv . For any children i of u, let us define the subtree
of Tu,` rooted at i by Ti . We know that Ti has depth ` − 1, and since Tu,` ∼ = Tv,` , there exists a
corresponding subtree of Tv (of depth ` − 1) rooted at f (i) such that Ti ∼
= Tf (i) . Using the induction
hypothesis, we obtain that the multisets of embeddings from the children i of u and children f (i) of
v are identical. Note that if two `-depth TCTs are isomorphic, they are also isomorphic up to depth
` − 1, i.e., Tu,` ∼
= Tv,` implies Tu,`−1 ∼
(`−1) (`−1)
= Tv,`−1 and, consequently, hu = hv (by induction
hypothesis). Thus, the input of the aggregation and update functions are identical and they compute
the same embeddings for u and v.

B.4 Proof of Proposition 2: Most expressive MP-TGNs


Proof. Consider MP-TGNs with parameter values that make AGG(`) (·) and U PDATE(`) (·) injective
functions on multisets of triples of hidden representations, edge features and timestamps. The
existence of these parameters is guaranteed by the fact that, at any given time t, the space of node
states (and hidden embeddings from temporal neighbors), edge features and timestamps is finite (see
Lemma B3).
Again, let Tu,` (t) denote the TCT of u with ` layers. We want to prove that, under the injectivity
assumption, if Tu,` (t) ∼
(`) (`)
6 Tv,` (t), then hu (t) 6= hu (t) for any two nodes u and v. In the following,
=
we simplify notation by removing the dependence on time. We proceed with proof by induction on
the TCT’s depth `. Also, keep in mind that ϕ` = U PDATE(`) ◦ AGG(`) is injective for any `.
[Base case] For ` = 1, if Tu,1 6=∼ Tv,1 then the root node states are different (i.e., su 6= sv ) or the
multiset of states/edge features/ timestamps triples from u and v’s children are different. In both
cases, the inputs of ϕ` are different and it therefore outputs different embeddings for u and v.

[Induction step] The inductive hypothesis is Tu,`−1 ∼


(`−1) (`−1)
6 Tv,`−1 ⇒ hu
= 6= hv for any pair of
nodes u and v. If Tu,` =∼
6 Tv,` , at least one of the following holds: i) the states of u and v are different,
ii) the multisets of edges (edge features/ timestamps) with endpoints in u and v are different, or iii)
there is no pair-wise isomorphism between the TCTs rooted at u and v’s children. In the first two
cases, ϕ` trivially outputs different embeddings for u and v. We are left with the case in which only
the latter occurs. Using our inductive hypothesis, the lack of a (isomorphism ensuring) bijection
between the TCTs rooted at u and v’s children implies there is also no bijection between their multiset
of embeddings. In turn, this guarantees that ϕ will output different embeddings for u and v.

B.5 Proof of Proposition 3: The role of memory


Proof. We prove the two parts of the proposition separately. In the following proofs, we rely on the
concept of monotone TCTs (see Definition 2).
[M ]
Statement 1: If L < ∆: QL is strictly stronger than QL .

18
We know that the family of L-layer MP-TGNs with memory comprises the family of L-layer MP-
[M ]
TGNs without memory (we can assume identity memory). Therefore, QL is at least as powerful
[M ]
as QL . To show that QL is strictly stronger (more powerful) than QL , when L < ∆, it suffices
to create an example for which memory can help distinguish a pair of nodes. We provide a trivial
example in Figure S1 for L = 1. Note that the 1-depth TCTs of u and v are isomorphic when no
memory is used. However, when equipped with memory, the interaction (b, c, t1 ) affects the states of
v and c, making the 1-depth TCTs of u and v (at time t > t2 ) no longer isomorphic.

Figure S1: Temporal graph where all initial node features and edge features are identical, and t2 > t1 .

[M ]
Statement 2: For any L : QL+∆ is at least as powerful as QL .
[M ]
It suffices to show that if QL+∆ cannot distinguish a pair of nodes u and v , QL cannot distinguish
them too. Let Tu,LM
(t) and Tu,L (t) denote the L-depth TCTs of u with and without memory respec-
tively. Using Lemma 1, this is equivalent to showing that Tu,L+∆ (t) ∼ M
= Tv,L+∆ (t) ⇒ Tu,L (t) ∼
=
Tv,L (t), since no MP-TGN can separate nodes associated with isomorphic TCTs. In the following,
M

when we omit the number of layers from TCTs, we assume TCTs of arbitrary depth.
Step 1: Characterizing the dependence of memory on initial states and events in the dynamic graph.
We now show that the memory for a node u, after processing all events with timestamp ≤ tn , depends
on the initial states of a set of nodes Vun , and a set of events annotated with their respective timestamps
and features Bun . If at time tn no event involves a node z , we set Bzn = Bzn−1 and Vzn = Vzn−1 . We
also initialize Bu0 = ∅ and Vu0 = {u} for all nodes u. We proceed with a proof by induction on the
number of observed timestamps n.
[Base case] Let I1 (u) = {v : (u, v, t1 ) ∈ G(t+ 1 )} be the set of nodes interacting with u at time t1 ,
where G(t+ 1 ) is the temporal graph right after t1 . Similarly, let J1 (u) = {(u, ·, t1 ) ∈ G(t1 )} be
+

the set of events involving u at time t1 . Recall that up until t1 , all memory states equal initial node
features (i.e., su (t1 ) = su (0)). Then, the updated memory (see Equation 3 and Equation 4) for u
depends on Vu1 = Vu0 ∪v∈I(u) Vv0 , Bu1 = Bu0 ∪ J1 (u).
[Induction step] Assume that for timestamp tn−1 the proposition holds. We now show that it holds for
tn . Since the proposition holds for n−1 timestamps, we know that the memory of any w that interacts
with u in tn , i.e. w ∈ In (u), depends on Vwn−1 and Bw
n−1
, and the memory of u so far depends on
Vun−1 and Bun−1 . Then, the updated memory for u depends on Vun = Vun−1 ∪w∈I(u) {w, u} ∪ Vwn−1
and Bun = Bun−1 ∪ Jn (u) ∪w∈I(u) Bw n−1
.

Step 2: (z, w, tzw ) ∈ Bun if and only if there is a path (uk , tk = tzw , uk+1 ) in T̃u (t+
n ) — the
monotone TCT of u (see Definition 2) after processing events with timestamp ≤ tn — with either
]uk = z, ]uk+1 = w or ]uk = w, ]uk+1 = z.
[Forward direction] An event (z, w, tzw ) with tzw ≤ tn will be in Bun only if z = u or w = u, or
if there is a subset of events {(u, ]u1 , t1 ), (]u1 , ]u2 , t2 ), . . . , (]uk , ]uk+1 , tzw )} with ]uk = z and
]uk+1 = w such that tn ≥ t1 > · · · > tzw . In either case, this will lead to root-to-leaf path in T̃u (t+ n)
passing through (uk , tzw , uk+1 ). This subset of events can be easily obtained by backtracking edges
that caused unions/updates in the procedure from Step 1.
[Backward direction] Assume there is a subpath p = (uk , tk = tzw , uk+1 ) ∈ T̃u (t+ n ) with ]uk = z
and ]uk+1 = w such that (z, w, tzw ) ∈ / Bu . Since we can obtain p from T̃u (tn ), we know that the
n +

sequence of events r = ((u, ]u1 , t1 ), . . . , (]uk−2 , ]uk−1 = z, tk−1 ), (z, w, tk = tzw )) happened and
that ti > ti+1 ∀i. However, since (z, w, tzw ) ∈ / Bun , there must be no monotone walk starting from
u going through the edge (z, w, tzw ) to arrive at w, which is exactly what r characterizes. Thus, we
reach contradiction.

19
Note that the nodes in Vun are simply the nodes that have an endpoint in the events Bun , and therefore
are also nodes in T̃u (t+
n ) and vice-versa.

Step 3: For any node u, there is a bijection that maps (Vun , Bun ) to T̃u (t+
n ).

First, we note that (Vun , Bun ) depends on a subset of all events, which we represent as G 0 ⊆ G(t+ n ).
Since Bun contains all events in G 0 and (Vun , Bun ) can be uniquely constructed from G 0 , then there is a
bijection g that maps from G 0 to (Vun , Bun ).
Similarly, T̃u (t+
n ) also depends on a subset of events which we denote by G ⊆ G(tn ). We note that
00 +

the unique events in T̃u (t+n ) correspond to G , and we can uniquely build the tree T̃u (tn ) from G .
00 + 00

This implies that there is a bijection h that maps from G to T̃u (tn ).
00 +

Previously, we have shown that all events in Bun are also in T̃u (t+
n ) and vice-versa. This implies that
both sets depend on the same events, and thus on the same subset of all events, i.e., G 0 = G 00 = GS .
Since there is a bijection g between GS and (Vun , Bun ), and a bijection h between GS and T̃u (t+ n ),
there exists a bijection f between (Vun , Bun ) and T̃u (t+
n ).
+ ∼ + ∼ M
Step 4: If Tu,L+∆ (t ) = Tv,L+∆ (t ), then T (t ) = T (t+ ).
+ M
u,L v,L

To simplify notation, we omit here the dependence on time.


Any node w ∈ Tu,L M
also appears in Tu,L+∆ at the same level. The subtree of Tu,L+∆ rooted at w,
denoted here by Tw0 , has depth at least k ≥ ∆. Note that Tw0 corresponds to the k-depth TCT of ]w.
Since the depth of Tw0 is at least ∆, we know that the T̃w0 ∼
= T̃]w — i.e., imposing time-constraints to
Tw results in the monotone TCT of node ]w. Also, because the memory of ]w depends on T̃]w , Tw0
0

comprises the information used to compute the memory state of ]w. Note that this applies to any w in
M
Tu,L ; thus, Tu,L+∆ contains all we need to compute the states of any node of the dynamic graph that
appears in Tu,L
M
. The same argument applies to Tv,L+∆ and Tv,L M
. Finally, since Tu,L
M
can be uniquely
∼ M ∼ M
computed from Tu,L+∆ , and Tv,L from Tv,L+∆ , if Tu,L+∆ = Tv,L+∆ , then Tu,L = Tv,L .
M

B.6 Proof of Proposition 4: Limitations of TGAT and TGN-Att


Proof. In this proof, we first provide an example of a dynamic graph where the TCTs of two
nodes u and v are not isomorphic. Then, we show that we can not find a TGAT model such that
(L) (L)
hu (t) 6= hv (t), i.e., TGAT does not distinguish u and v. Next, we show that even if we consider
TGATs with memory (TGN-Att), it is still not possible to distinguish nodes u and v in our example.

Figure S2: (Leftmost) Example of a temporal graph for which TGN-Att and TGAT cannot distinguish
nodes u and v even though their TCTs are non-isomorphic. Colors denote node features and all edge
features are identical, and t2 > t1 (and t > t2 ). (Right) The 2-depth TCTs of nodes u, v, z and w.
The TCTs of u and v are non-isomorphic whereas the TCTs of z and w are isomorphic.

Figure S2(leftmost) provides a temporal graph where all edge events have the same edge features.
Colors denote node features. As we can observe, the TCTs of nodes u and v are not isomorphic. In
the following, we consider node distinguishability at time t > t2 .
Statement 1: TGAT cannot distinguish the nodes u and v in our example.
(`) (`)
Step 1: For any TGAT with ` layers, we have that hw (t) = hz (t).
We note that the `-layer TCTs of nodes w and z are isomorphic, for any `. To see this, one can consider
the symmetry around node u that allows us to define a node permutation function (bijection) f given
by f (z) = w, f (w) = z, f (a) = v, f (u) = u, f (b) = c, f (c) = b, f (v) = a. Figure S2(right)
provides an illustration of the 2-depth TCTs of z and w at time t > t2 .

20
By Lemma 1, if the `-layer TCTs of two nodes z and w are isomorphic, then no `-layer MP-TGN can
(`) (`)
distinguish them. Thus, we conclude that hw (t) = hz (t) for any TGAT with arbitrary number of
layers `.
(`) (`)
Step 2: There is no TGAT such that hv (t) 6= hu (t).
(`)
To compute hv (t), TGAT aggregates the messages of v’s temporal neighbors at layer ` − 1, and
(`−1) (`−1) (`)
then combines hv (t) with the aggregated message h̃v (t) to obtain hv (t).
Note that N (u, t) = {(z, e, t1 ), (w, e, t1 )} and N (v, t) = {(w, e, t1 )}, where e denotes an edge
(`−1) (`−1)
feature vector. Also, we have previously shown that hw (t) = hz (t).
Using the TGAT aggregation layer (Equation S2), the query vectors of u and v are qu =
(`−1) (`) (`−1) (`)
[hu (t)||φ(0)]Wq and qv = [hv (t)||φ(0)]Wq , respectively.
(`) (`)
Since all events have the common edge features e, the matrices Cu and Cv share the same vector in
(`) (`) (`−1)
their rows. Thehsingle-row matrix Cv is given by Cv = [hw (t)||φ(t−t
i 1 )||e], while the two-row
(`) (`−1) (`−1) (`−1) (`−1)
matrix Cu = [hw (t)||φ(t − t1 )||e]; [hz (t)kφ(t − t1 )||e] , with hw (t) = hz (t). We
(`) (`) (`−1)
can express Cu = [1, 1] r and
>
Cv = r , where r denotes the row vector r = [hz (t)||φ(t −
t1 )||e].
(`) (`) (`) (`)
Using the key and value matrices of node v, i.e., Kv = Cv WK and Vv = Cv WV , we have that

h̃v(`) (t) = softmax(qv Kv> )Vv


(`)
= softmax(qv Kv> ) rWV [softmax of a single element is 1]
| {z }
=1
(`)
= rWV
(`)
= softmax(qu Ku> )[1, 1]> rWV = h̃u(`) (t) [softmax outputs a convex combination]
| {z }
=1

We have shown that the aggregated messages of nodes u and v are the same at any layer `. We note
(0) (0)
that the initial embeddings are also identical hv (t) = hu (t) as u and v have the same color. Recall
(`) (`−1) (`)
that the update step is hv (t) = MLP(hv (t), h̃v (t)). Therefore, if the initial embeddings are
(`) (`)
identical, and the aggregated messages at each layer are also identical, we have that hu (t) = hv (t)
for any `.
Statement 2: TGN-Att cannot distinguish the nodes u and v in our example.
We now show that adding a memory module to TGAT produces node states such that su (t) = sv (t) =
sa (t), sz (t) = sw (t), and sb (t) = sc (t). If that is the case, then these node states could be treated
as node features in a equivalent TGAT model of our example in Figure S2, proving that there is no
(`) (`)
TGN-Att such that hv (t) 6= hu (t). In the following, we consider TGN-Att with average memory
aggregators (see Appendix A).
We begin by showing that sa (t) = su (t) = sv (t) after memory updates. We note that the
memory message node a receives is [ekt1 ksz (t1 )]. The memory message node u receives is
M EANAGG([ekt1 ksw (t1 )], [ekt1 ksz (t1 )]), but since sw (t1 ) = sz (t1 ), both messages are the same,
and the average aggregator outputs [ekt1 ksz (t1 )]. Finally, the message that node v receives is
[ekt1 ksw (t1 )] = [ekt1 ksz (t1 )]. Since all three nodes receive the same memory message and have
the same initial features, their updated memory states are identical.
Now we show that sz (t) = sw (t), for t1 < t ≤ t2 . Note that the message that node z receives
is M EANAGG([ekt1 ksa (t1 )], [ekt1 ksu (t1 )]) = [ekt1 ksu (t1 )], with su (t1 ) = sa (t1 ). The message
that node w receives is M EANAGG([ekt1 ksu (t1 )], [ekt1 ksv (t1 )]) = [ekt1 ksu (t1 )]. Again, since the
initial features and the messages received by each node are equal, sz (t) = sw (t) for t1 < t ≤ t2 .
We can then use this to show that sz (t) = sw (t) for t > t2 . Note that at time t2 , the message that
nodes z and w receive are [ekt2 − t1 ksb (t2 )] and [ekt2 − t1 ksc (t2 )], respectively. Also, note that

21
sb (t2 ) = sc (t2 ) = sb (0) = sc (0) as the states of b and c are only updated right after t2 . Because
the received messages and the previous states (up until t2 ) of z and w are identical, we have that
sz (t) = sw (t) for t > t2 .
Finally, we show that sb (t) = sc (t). Using that sz (t2 ) = sw (t2 ) in conjunction with the fact that
node b receives message [[ekt2 − t1 ksz (t2 )]], and node c receives [ekt2 − t1 ksw (t2 )], we obtain
sb (t) = sc (t) since initial memory states and messages that the nodes received are the same.

B.7 Proof of Proposition 5: Limitations of MP-TGN and CAWs

Figure S3: (Left) Example of a temporal graph for which CAW can distinguish the events (u, v, t3 )
and (z, v, t3 ) but MP-TGNs cannot. We assume that all edge and node features are identical, and
tk+1 > tk for all k. (Right) Example for which MP-TGNs can distinguish (u, z, t4 ) and (u0 , z, t4 )
but CAW cannot.

Proof. Using the example in Figure S3(Left), we adapt a construction by Wang et al. [38] to show
that CAW can separate events that MP-TGNs adopting node embedding concatenation cannot. We
first note that the TCTs of u and z are isomorphic. Thus, since v is a common endpoint in (u, v, t3 )
and (z, v, t3 ), no MP-TGN can distinguish these two events. Nonetheless, CAW obtains the following
anonymized walks for the event (u, v, t3 ):
1t
{[1, 0, 0], [0, 1, 0]} −→ {[0, 1, 0], [2, 0, 0]}
| {z } | {z }
ICAW (u;Su ,Sv ) ICAW (v;Su ,Sv )
t1
{[0, 1, 0], [2, 0, 0]} −→ {[1, 0, 0], [0, 1, 0]}
| {z } | {z }
ICAW (v;Su ,Sv ) ICAW (u;Su ,Sv )
t2 1 t
{[0, 1, 0], [2, 0, 0]} −→ {[0, 0, 0], [0, 1, 0]} −→ {[0, 0, 0], [0, 0, 1]}
| {z } | {z } | {z }
ICAW (v;Su ,Sv ) ICAW (w;Su ,Sv ) ICAW (z;Su ,Sv )

and the walks associated with (z, v, t3 ) are (here we omit underbraces for readability):
1t
{[1, 0, 0], [0, 0, 1]} −→ {[0, 1, 0], [0, 1, 0]}
1t
{[0, 0, 0], [2, 0, 0]} −→ {[0, 0, 0], [0, 1, 0]}
2t 1 t
{[0, 0, 0], [2, 0, 0]} −→ {[0, 1, 0], [0, 1, 0]} −→ {[1, 0, 0], [0, 0, 1]}
In this example, assume that MLPs used to encode each walk correspond to identity mappings. Then,
the sum of the elements in each set is injective since each element of the sets in the anonymized
walks are one-hot vectors. We note that, in this example, we can simply choose a RNN that sums the
vectors in each sequence (walks), and then apply a mean readout layer (or pooling aggregator) to
obtain distinct representations for (u, v, t3 ) and (z, v, t3 ).
We now use the example in Figure S3(Right) to show that MP-TGNs can separate events that CAW
cannot. To see why MP-TGNs can separate the events (u, z, t4 ) and (u0 , z, t4 ), it suffices to observe
that the 4-depth TCTs of u and u0 are non-isomorphic. Thus, a MP-TGN with injective layers could
distinguish such events. Now, let us take a look at the anonymized walks for (u, z, t4 ):
1t
{[1, 0], [0, 0]} −→ {[0, 1], [0, 0]}
| {z } | {z }
ICAW (u;Su ,Sz ) ICAW (v;Su ,Sz )
t1
{[0, 0], [1, 0]} −→ {[0, 0], [0, 1]}
| {z } | {z }
ICAW (z;Su ,Sz ) ICAW (w;Su ,Sz )

22
and for (u0 , z, t4 ):
1 t
{[1, 0], [0, 0]} −→ {[0, 1], [0, 0]}
| {z } | {z }
ICAW (u0 ;Su0 ,Sz ) ICAW (v 0 ;Su0 ,Sz )
t1
{[0, 0], [1, 0]} −→ {[0, 0], [0, 1]}
| {z } | {z }
ICAW (z;Su0 ,Sz ) ICAW (w;Su0 ,Sz )

Since the sets of walks are identical, they must have the same embedding. Therefore, there is no
CAW model that can separate these two events.

B.8 Proof of Proposition 6: Injective MP-TGNs and the temporal WL test


We want to prove that injective MP-TGNs can separate two temporal graphs if and only if the
temporal WL does the same. Our proof comprises two parts. We first show that if an MP-TGN
produces different multisets of embeddings for two non-isomorphic temporal graphs G(t) and G 0 (t),
then the temporal WL decides these graphs are not isomorphic. Then, we prove that, if the temporal
WL decides G(t) and G 0 (t) are non-isomorphic, there is an injective MP-TGN (i.e., with injective
message-passing layers) that outputs distinct multisets of embeddings.
Statement 1: Temporal WL is at least as powerful as MP-TGNs.
See Lemma B2 for proof.
Statement 2: Injective MP-TGN is at least as powerful as temporal WL.

Proof. To prove this, we can repurpose the proof of Theorem 3 in [43]. In particular, we assume MP-
TGNs that meet the injective requirements of Proposition 2, i.e., MP-TGNs that implement injective
aggregate and update functions on multisets of hidden representations from temporal neighbors.
Following their footprints, we prove that there is a injection ϕ to the set of embeddings of all nodes
in a temporal graph from their respective colors in the temporal WL test. We do so via induction
on the number of layers `. To achieve our purpose, we can assume identity memory without loss of
generality.
The base case (` = 0) is straightforward since the temporal WL test initializes colors with node
features. We now focus on the inductive step. Suppose the proposition holds for ` − 1. Note that our
update function:
 
v (t) = U PDATE
h(`) (`)
h(`−1)
v (t), AGG(`) ({{(h(`−1)
u (t), t − t0 , e) | (u, e, t0 ) ∈ N (v, t)}})

can be rewritten using ϕ as a function of node colors:


 
0 0
h(`)
v (t) = U PDATE
(`)
ϕ(c `−1
(v)), A GG
(`)
{(ϕ(c
({ `−1
(u)), t − t , e) | (u, e, t ) ∈ N (v, })
t)} .

Note that the composition of injective functions is also injective. In addition, time-shifting operations
are also injective. Thus we can construct an injection ψ such that:
hv(`) (t) = ψ c`−1 (v), {{(c`−1 (u), t0 , e) | (u, e, t0 ) ∈ N (v, t)}})


= ψ c`−1 (v), {{(c`−1 (u), t0 , euv (t0 )) | (v, u, t0 ) ∈ G(t)}})




since there exists an element (u, e, t0 ) ∈ N (v, t) if and only if there is an event (u, v, t0 ) ∈ G(t) with
feature euv (t0 ) = e.
Then, we can write:
−1
v (t) = ψ ◦ H ASH
h(`) ◦ H ASH c`−1 (v), {{(c`−1 (u), t0 , euv (t0 )) | (u, v, t0 ) ∈ G(t)}})


= ψ ◦ H ASH−1 (c(`) (v))

Note ϕ = ψ ◦ H ASH−1 is injective since it is a composition of two injective functions. We then


conclude that if the temporal WL test outputs different multisets of colors, then a suitable MP-TGN
outputs different multisets of embeddings.

23
Figure S4: Examples of temporal graphs for which MP-TGNs cannot distinguish the diameter, girth, and
number of cycles. For any node in G(t) (e.g., u1 ), there is a corresponding one in G 0 (t) (u01 ) whose TCTs are
isomorphic.

B.9 Proof of Proposition 7: MP-TGNs and CAWs fail to decide some graph properties
Statement 1: MP-TGNs fail to decide some graph properties.

Proof. Adapting a construction by Garg et al. [9], we provide in Figure S4 an example that demon-
strates Proposition 7. Colors denote node features, and all edge features are identical. The temporal
graphs G(t) and G0 (t) are non-isomorphic and differ in properties such as diameter (∞ for G(t) and
3 for G 0 (t)), girth (3 for G(t) and 6 for G 0 (t)), and number of cycles (2 for G(t) and 1 for G 0 (t)). In
spite of that, for t > t3 , the set of embeddings of nodes in G(t) is the same as that of nodes in G 0 (t)
and, therefore, MP-TGNs cannot decide these properties. In particular, by constructing the TCTs of
all nodes at time t > t3 , we observe that the TCTs of the pairs (u1 , u01 ), (u2 , u02 ), (v1 , v10 ), (v2 , v20 ),
(w1 , w10 ), (w2 , w20 ) are isomorphic and, therefore, they can not be distinguished (Lemma 1).

Statement 2: CAWs fail to decide some graph properties.


Since CAW does not provide a recipe to obtain graph-level embeddings, we first define such a
procedure. Let G(t) be a temporal graph given as a set of events. We sequentially compute event
embeddings hγ for each event γ = (u, v, t0 ) ∈ G(t) respecting the temporal order (two or more
events at the same timestamp are computed in parallel). We then apply a readout layer to the set of
event embeddings to obtain a graph-level representation. We provide a proof assuming this procedure.

Figure S5: Examples of temporal graphs with different static properties, such as diameter, girth, and
number of cycles. CAWs fail to distinguish G1 (t) and G2 (t).

Proof. We can adapt our construction in Figure 2 [rightmost] to extend Proposition 7 to CAW. The
idea consists of creating two temporal graphs with different diameters, girths, and numbers of cycles
that comprise events that CAW cannot separate — Figure S5 provides one such construction. In
particular, CAW obtains identical embeddings for (u, z, t3 ) and (a0 , z 0 , t3 ) (as shown in Proposition 5).
The remaining events are the same up to node re-labelling and thus also lead to identical embeddings.
Therefore, CAW cannot distinguish G1 (t) and G2 (t) although they clearly differ in diameter, girth,
and number of cycles.

B.10 Proof of Lemma 2


(t)
We now show that the k-th component of the relative positional features ru→v corresponds to the
number of occurrences of u at the k-th layer of the monotone TCT of v, and this is valid for all pairs
of nodes u and v of the dynamic graph. We proceed with a proof by induction.
(0)
[Base case] Let us consider t = 0, i.e., no events have occurred. By definition, ru→v is the zero vector
(0)
if u 6= v, indicating that node u does not belong to the TCT of v. If u = v, then ru→u = [1, 0, . . . , 0]
corresponds to count 1 for the root of the TCT of u. Thus, for t = 0, the proposition holds.

24
[Induction step] Assume that the proposition holds
for all nodes and any time instant up to t. We will
show that after the event γ = (u, v, t) at time t, the
proposition remains true.
Note that the event γ only impacts the monotone
TCTs of u and v. The reason is that the monotone
TCTs of all other nodes have timestamps lower than
t, which prevents the event γ from belonging to any
path (with decreasing timestamps) from the root.
Without loss of generality, let us now consider the
impact of γ on the monotone TCT of v. Figure S6
shows how the TCT of v changes after γ, i.e., how it
Figure S6: Illustration of how the monotone
goes from T̃v (t) to T̃v (t+ ). In particular, the process
TCT of v changes after an event between u
attaches the TCT of u to the root node v. Under this
and v at time t. This allows us to see how to
change, we need to update the counts of all nodes
update the positional features of any node i
i in T̃u (t) regarding how many times it appears in
of the dynamic graph that belongs to T̃u (t)
T̃v (t+ ). We do so by adding the counts in T̃u (t) (i.e., relative to v.
(t) (t)
ri→u ) to T̃v (t) (i.e., ri→v ), accounting for the 1-layer
mismatch,
 since T̃u (t) is attached to the first layer. This can be easily achieved with the shift matrix
0 0
P = applied to the counts of any node i in T̃u (t), i.e.,
Id−1 0
(t+ ) (t) (t)
ri→v = P ri→u + ri→v ∀i ∈ Vu(t) ,
(t) (t)
where Vu comprises the nodes of the original graph that belong to T̃u .
Similarly, the event γ also affects the counts of nodes in the monotone TCT of v w.r.t. the monotone
(t+ ) (t)
TCT of u. To account for that change, we follow the same procedure and update rj→u = P rj→v +
(t) (t)
rj→u , ∀j ∈ Vv .
Handling multiple events at the same time. We now consider the setting where a given node v
interacts with multiple nodes u1 , u2 , . . . , uJ at time t. We can extend the computation of positional
features to this setting in a straightforward manner by noting that each event leads to an independent
branch in the TCT of v. Therefore, the update of the positional features with respect to v is given by
J J
(t+ ) (t) (t)
X [
ri→v = P ri→uj + ri→v ∀i ∈ Vu(t)
j
j=1 j=1
J
+ [
Vv(t )
= Vv(t) Vu(t)
j
.
j=1

We note that the updates of the positional features of u1 , . . . , uJ remain untouched if they do not
interact with other nodes at time t.
B.11 Proof of Proposition 8: Injective function on temporal neighborhood
Proof. To capture the intuition behind the proof, first consider a multiset M such that |M | < 4. We
can assign a unique number ψ(m) ∈ {1, 2, 3, 4} to any distinct element m ∈ M . Also, the function
h(m) = 10−ψ(m) denotes the decimal expansion of ψ(m) and corresponds to reserving one decimal
placePfor each unique element m ∈ M . Since there are less than 10 elements in the multiset, note
that m h(m) is unique for any multiset M .
To prove the proposition, we also leverage the well-known fact that the Cartesian product of two
countable sets is countable — the Cantor’s (bijective) pairing function z : N × N → N, with
z(n1 , n2 ) = (n1 +n2 )(n2 1 +n2 +1) + n2 , provides a proof for that.
Here, we consider multisets M = {{(xi , ei , ti )}} whose tuples take values on the Cartesian product of
the countable sets X , E, and T — the latter is also assumed to be bounded. In addition, we assume

25
the lengths of all multisets are bounded by N , i.e., |M | < N for all M . Since X and E are countable,
there exists an enumeration function ψ : X × E → N for all M . Without P loss of generality, we
assume T = {1, 2, tmax }. We want to show that exists a function of the form i 10−kψ(xi ,ei ) α−βti
that is unique on any multiset M .
Our idea is to separate a range of k decimal slots for each unique element (xi , ei , ·) in the multiset.
Each such a range has to accommodate at least tmax decimal slots (one for each value of ti ). Finally,
we need to make sure we can add up to N values at each decimal slot.
Formally, we map each tuple (xi , ei , ·) to one of k decimal slots starting from 10−kψ(xi ,ei ) . In
particular, for each element (xi , ei , ti = j) we add one unit at the j-th decimal slot after 10−kψ(xi ,ei ) .
Also, to ensure the counts for (xi , ei , j) and (xi , ei , l 6= j) do not overlap, we set β = dlog10 N e
since no tuple can repeat more than N times. We use α = 10 as we shift decimals. Finally, to
guarantee that each range encompasses tmax slots of β decimals, we set k = β(tmax + 1). Therefore,
the function
X
10−kψ(xi ,ei ) α−βti
i
is unique on any multiset M . We note that, without loss of generality, one could choose a different
basis (other than 10).

B.12 Proof of Proposition 9: Expressiveness of PINT: link prediction


Proof. We now show that PINT (with relative positional features) is strictly more powerful than
MP-TGN and CAW in distinguishing edges of temporal graphs. Leveraging Proposition 5, it suffices
to show that PINT is at least as powerful as both CAW and MP-TGN.
Statement 1: PINT is at least as powerful as MP-TGNs.
Since PINT is a generalization of MP-TGNs with injective aggregation/update layers, it derives that
it is at least as powerful as MP-TGNs. We can set the model’s parameters associated with positional
features to zero and obtain an equivalent MP-TGN.
Statement 2: PINT is at least as powerful as CAW.
We wish to show that for any pair of events that PINT cannot distinguish, CAW also cannot distinguish
it. Let us consider the events (u, v, t) and (u0 , v 0 , t) of a temporal graph. Formally, we want to
prove that if {{Tu (t), Tv (t)}} = {{Tu0 (t), Tv0 (t)}} (i.e., the multisets contain TCTs that are pairwise
isomorphic), then {{E NC(W ; Su , Sv )}}W ∈{Su ∪Sv } = {{E NC(W 0 ; Su0 , Sv0 )}}W 0 ∈{Su0 ∪Sv0 } , where
E NC denotes the walk-encoding function of CAW. Importantly, for the sake of this proof, we assume
that all TCTs here are augmented with positional features, characterizing edge embeddings obtained
from PINT.
Without loss of generality, we can assume that Tu (t) = ∼ Tu0 (t) and Tv (t) =
∼ Tv0 (t). By Lemma B1,
we know that the corresponding monotone TCTs are also isomorphic: T̃u (t) = ∼ T̃u0 (t), and T̃v (t) =

T̃v0 (t) with associated bijections f1 : V (T̃u (t)) → V (T̃u0 (t)) and f2 : V (T̃v (t)) → V (T̃v0 (t)).
We can construct a tree Tuv by attaching T̃u (t) and T̃v (t) to a (virtual) root node uv — without loss
of generality, u and v are the left-hand and right-hand child of uv, respectively. We can follow the
same procedure and create the tree Tu0 v0 by attaching the TCTs T̃u0 (t) and T̃v0 (t) to a root node
u0 v 0 . Since the left-hand and right-hand subtrees of Tuv and Tu0 v0 are isomorphic, then Tuv and
Tu0 v0 are also isomorphic. Let f : V (Tuv ) → V (Tu0 v0 ) denote the bijection associated with the
augmented trees. We also assume f is constructed by preserving the bijections f1 and f2 defined
between the original monotone TCTs: this ensures that f does not map any node in V (T̃u (t)) to a
node in V (T̃v0 (t)), for instance. We have that
(t) (t) (t) (t)
[r]i→u kr]i→v ] = [r]f (i)→u0 kr]f (i)→v0 ] ∀i ∈ V (Tuv ) \ {uv}

Note that we use the function ] (that maps nodes in the TCT to nodes in the dynamic graph) here
because the positional feature vectors are defined for nodes in the dynamic graph.
To guarantee that two encoded walks are identical E NC(W ; Su , Sv ) = E NC(W 0 ; Su0 , Sv0 ), it
suffices to show that the anonymized walks are equal. Thus, we turn our problem into show-
ing that for any walk W = (w0 , t0 , w1 , t1 , . . . ) in Su ∪ Sv , there exists a corresponding one

26
W 0 = (w00 , t0 , w10 , t1 , . . . ) in Su0 ∪ Sv0 such that ICAW (wi ; Su , Sv ) = ICAW (wi0 ; Su0 , Sv0 ) for all
i. Recall that ICAW (wi ; Su , Sv ) = {g(wi ; Su ), g(wi ; Sv )}, where g(wi ; Su ) is a vector whose k-
component stores how many times wi appears in a walk from Su at position k.
A key observation is that there is an equivalence between deanonymized root-leaf paths in Tuv and
walks in Su ∪ Sv (disregarding the virtual root node). By deanonymized, we mean paths where node
identities (in the temporal graph) are revealed by applying the function ]. Using this equivalence, it
suffices to show that
g(]i; Su ) = g(]f (i); Su0 ) and g(]i; Sv ) = g(]f (i); Sv0 ) ∀i ∈ V (Tuv ) \ {uv}

Suppose there is an i ∈ V (Tuv )\{uv} such that g(]i; Su ) 6= g(]f (i); Su0 ). Without loss of generality,
suppose this holds for the `-th entry of the vectors.
(t)
We know there are exactly ra→u [`] nodes at the `-th level of T̃u (t) that are associated with a = ]i ∈
V (G(t)). We denote by Ψ the set comprising such nodes. It also follows that computing g(]i; Su )[`]
is the same as summing up the amount of leaves of each subtree of T̃u (t) rooted at ψ ∈ Ψ, which we
denote as l(ψ; T̃u (t)), i.e., X
g(]i; Su )[`] = l(ψ; T̃u (t)).
ψ∈Ψ

Since we assume g(]i; Su )[`] 6= g(]f (i); Su0 )[`], then it holds that
X X
g(]i; Su )[`] 6= g(]f (i); Su0 )[`] ⇒ l(ψ; T̃u (t)) 6= l(f (ψ); T̃u0 (t)) (S5)
ψ∈Ψ ψ∈Ψ

Note that the subtree of T̃u rooted at ψ should be isomorphic to the subtree of T̃u0 rooted at f (ψ),
and therefore have the same number of leaves. However, the RHS of Equation S5 above implies there
is a ψ ∈ Ψ for which l(ψ; T̃u ) 6= l(f (ψ); T̃u0 ), reaching a contradiction. The same argument can
be applied to v and v 0 to prove that g(]i; Sv ) = g(]f (i); Sv0 ).

C Additional related works


Structural features for static GNNs. Using structural features to enhance the power of GNNs is
an active research topic. Bouritsas et al. [48] improved GNN expressivity by incorporating counts
of local structures in the message-passing procedure, e.g, the number triangles a node appears on.
These counts depend on identifying subgraph isomorphisms and, naturally, can become intractable
depending on the chosen substructure. Li et al. [52] proposed increasing the power of GNNs using
distance encodings, i.e., augmenting original node features with distance-based ones. In particular,
they compute the distance between a node set whose representation is to be learned and each node
in the graph. To alleviate the cost of distance encoding, an alternative is to learn absolute position
encoding schemes [51, 61, 62], that try to summarize the role each node plays in the overall graph
topology. We note that another class of methods uses random features to boost the power of GNNs
[47, 58]. However, these models are referred to be hard to converge and obtain noisy predictions [62].
The most trivial difference between these approaches and PINT is that our relative positional features
account for temporal information. On a deeper level, our features summarize the role each node plays
in each other’s monotone TCT instead of measuring, e.g., pair-wise distances in the original graph or
counting substructures. Also, our scheme leverages the temporal aspect to achieve computational
tractability, updating features incrementally as events unroll. Finally, while some works proposing
structural features for static GNNs present marginal gains [62], PINT exhibits significant performance
gains in real-world temporal link prediction tasks.
Other models for temporal graphs. Representation learning for dynamic graphs is a broad and
diverse field. In fact, strategies to cope with the challenge of modeling dynamic graphs can come
in many flavors, including simple aggregation schemes [18], walk-aggregating methods [53], and
combinations of sequence models with GNNs [56, 59]. For instance, Seo et al. [59] used a spectral
graph convolutional network [49] to encode graph snapshots followed by a graph-level LSTM [13].
Manessi et al. [55] followed a similar approach but employed a node-level LSTM, with parameters
shared across the nodes. Sankar et al. [57] proposed a fully attentive model based on graph attention

27
networks [34]. Pareja et al. [56] applied a recurrent neural net to dynamically update the parameters
of a GCN. Gao and Ribeiro [50] compared the expressive power of two classes of models for discrete
dynamic graphs: time-and-graph and time-then-graph. The former represents the standard approach
of interleaving GNNs and sequence (e.g., RNN) models. In the latter class, the models first capture
node and edge dynamics using RNNs, and are then feed into graph neural networks. The authors
showed that time-then-graph has expressive advantage over time-and-graph approaches under mild
assumptions. For an in-depth review of representation learning for dynamic graphs, we refer to the
survey works by Kazemi et al. [14] and Skarding et al. [60].
While most of the early works focused on discrete-time dynamic graphs, we have recently witnessed
a rise in interest in models for event-based temporal graphs (i.e., CTDGs). The reason is that
models for DTDGs may fail to leverage fine-grained temporal and structural information that can be
crucial in many applications. In addition, it is hard to specify meaningful time intervals for different
tasks. Thus, modern methods for temporal graphs explicitly incorporate timestamp information
into sequence/graph models, achieving significant performance gains over approaches for DTDGs
[38]. Appendix A provides a more detailed presentation of CAW, TGN-Att, and TGAT, which are
among the best performing models for link prediction on temporal graphs. Besides these methods,
JODIE [16] applies two RNNs (for the source and target nodes of an event) with a time-dependent
embedding projection to learn node representations of item-user interaction networks. Trivedi et al.
[32] employed RNNs with a temporally attentive module to update node representations. APAN
[63] consists of a memory-based TGN that uses attention mechanism to update memory states using
multi-hop temporal neighborhood information. Makarov et al. [54] proposed incorporating edge
embeddings obtained from CAW into MP-TGNs’ memory and message-passing computations.

28
D Datasets and implementation details
D.1 Datasets
In our empirical evaluation, we have considered six datasets for dynamic link prediction: Reddit1 ,
Wikipedia2 , UCI3 , LastFM4 , Enron5 , and Twitter. Reddit is a network of posts made by users on
subreddits, considering the 1,000 most active subreddits and the 10,000 most active users. Wikipedia
comprises edits made on the 1,000 most edited Wikipedia pages by editors with at least 5 edits. Both
Reddit and Wikipedia networks include links collected over one month, and text is used as edge
features, providing informative context. The LastFM dataset is a network of interactions between
user and the songs they listened to. UCI comprises students’ posts to a forum at the University
of California Irvine. Enron contains a collection of email events between employees of the Enron
Corporation, before its bankruptcy. The Twitter dataset is a non-bipartite net where nodes are users
and interactions are retweets. Since Twitter is not publicly available, we build our own version by
following the guidelines by Rossi et al. [27]. We use the data available from the 2021 Twitter RecSys
Challenge and select 10,000 nodes and their associated interactions based on node participation:
number of interactions the node participates in. We also apply multilingual BERT to obtain text
representations of retweets (edge features).
Table S1 reports statistics of the datasets such as number of temporal nodes and links, and the
dimensionality of the edge features. We note that UCI, Enron, and LastFM represent non-attributed
networks and therefore do not contain feature vectors associated with the events. Also, the node
features for all datasets are vectors of zeros [42].

Table S1: Summary statistics of the datasets.


Dataset #Nodes #Events #Edge feat. Bipartite?
Reddit 10,984 (10,000 / 984) 672,447 172 Yes
Wikipedia 9,227 (8,227 / 1,000) 157,474 172 Yes
Twitter 8,925 406,564 768 No
UCI 1,899 59,835 - No
Enron 184 125,235 - No
LastFM 1,980 (980 / 1,000) 1,293,103 - Yes

D.2 Implementation details


We train all models in link prediction tasks in a self-supervised approach. During training, we
generate negative samples: for each actual event (z, w, t) (class 1), we create a fake one (z, w0 , t)
(class 0) where w0 is uniformly sampled from the set of nodes, and both events have the same edge
feature vector.
To ensure a fair comparison, we mainly rely on the original repositories and guidelines. For instance,
regarding the training of MP-TGNs (including PINT), we mostly follow the setup and choices in the
implementation available in [27]. In particular, we apply the Adam optimizer with learning rate 10−4
during 50 epochs with early stopping if there is no improvement in validation AP for 5 epochs. In
addition, we use batch size 200 for all methods. We report statistics (mean and standard deviation) of
the performance metric (AP) over ten runs.
MP-TGNs. For TGN-Att, we follow Rossi et al. [27] and sample either ten or twenty temporal
neighbors with memory dimensionality equal to 32 (Enron), 100 (UCI, Twitter), or 172 (Reddit,
Wikipedia, LastFM), node embedding dimension equal to 100, and two attention heads. We use a
memory unit implemented as a GRU, and update the state of each node based on only its most recent
message. For TGAT, we use twenty temporal neighbors and two layers.
CAW. We conduct model selection using grid search over: i) time decay α ∈
{0.01, 0.1, 0.25, 0.5, 1.0, 2.0, 4.0, 10.0, 100.0} × 10−6 , ii) number of walks M ∈ {1, 2, 3, 4, 5};
1
http://snap.stanford.edu/jodie/reddit.csv
2
http://snap.stanford.edu/jodie/wikipedia.csv
3
http://konect.cc/networks/opsahl-ucforum/
4
http://snap.stanford.edu/jodie/lastfm.csv
5
https://www.cs.cmu.edu/~./enron/

29
and iii) walk length L ∈ {32, 64, 128}. The best combination of hyperparameters is shown in
Table S2. The remaining training choices follows the default values from the original implementation.
Importantly, we note that TGN-Att’s original evaluation setup is different from CAW’s. Thus, we
adapted CAW’s original repo to reflect these differences and ensure a valid comparison.

Table S2: Optimal hyperparameters for CAW.


Dataset Time decay α #Walks Walk length
Reddit 10−8 32 3
Wikipedia 4 × 10−6 64 4
Twitter 10−6 64 3
UCI 10−5 64 2
Enron 10−6 64 5
LastFM 10−6 64 2

PINT. We use α = 2 (in the exponential aggregation function), and experiment with learned and
fixed β. We apply a relu function to avoid negative values of β, which could lead to unstable training.
We do grid search as the follow: when learning beta, we consider initial values for β ∈ {0.1, 0.5}; for
the fixed case (requires_grad=False), we evaluate β ∈ {10−3 × |N |, 10−4 × |N |, 10−5 × |N |}
— |N | denotes the number of temporal neighbors — and always apply memory as in the original
implementation of TGN-Att. We consider number message passing layers ` in {1, 2}. Also, we apply
neighborhood sampling with the number of neighbors in {10, 20}, and update the state of a node
based on its most recent message. We then carry out model selection based on AP values obtained
during validation. Overall, the models with fixed β led to better results. Table S3 reports the optimal
hyperparameters for PINT found via automatic model selection.
In all experiments, we use relative positional features with d = 4 dimensions. For computational
efficiency, we update the relative positional features only after processing a batch, factoring in all
events from that batch. Note that this prevents information linkage as these positional features take
effect after prediction. In addition, since temporal events repeat (in the same order) at each epoch, we
also speed up PINT’s training procedure by precomputing and saving the positional features for each
batch. To save up space, we store the positional features as sparse matrices.

Table S3: Optimal hyperparameters for PINT.


Dataset β/|N | #Neighbors (|N |) #Layers
Reddit 10−5 10 2
Wikipedia 10−4 10 2
Twitter 10−5 20 2
UCI 10−5 10 2
Enron 10−5 20 2
LastFM 10−4 10 2

Hardware. For all experiments, we use Tesla V100 GPU cards and consider a memory budget of
32GB of RAM.

E Deletion and node-level events


Rossi et al. [27] propose handling edge deletions by simply updating memory states of the edge’s
endpoints, as if we were dealing with a usual edge addition. However, it is not discussed whether
the edge in question should be excluded from the event list or if we should just add a novel event
with edge features that characterize deletion. If we choose the former, we may be unable to recover
the memory state of a node from its monotone TCT and the original node features. Removing an
edge from the event list also affects the computation of node embeddings. Therefore, we advise
practitioners to do the latter when using PINT. It is worth mentioning that the vast majority of models
for temporal interaction prediction do not consider the possibility of deletion events.
Regarding node-level events, PINT can accommodate node addition by simply creating novel memory
states. To deal with node feature updates, we can create an edge event with both endpoints on that

30
node, inducing a self-loop in the dynamic graph. Also, we can combine (e.g., concatenate) the
temporal features in message-passage operations, similarly to the general formulation of the MP-TGN
framework [27]. Finally, we can deal with the removal of a node v by following our previous (edge
deletion) procedure to delete all edges with endpoints in v.

F Additional experiments
Time comparison. Figure S7 compares the time per epoch for PINT and for the prior art (CAW,
TGN-Att, and TGAT) in the Enron and LastFM datasets. Following the trend in Figure 7, Figure S7
further supports that PINT is generally slower than other MP-TGNs but, after a few training epochs,
is orders of magnitude faster than CAW. In the case of Enron, the time CAW takes to complete an
epoch is much higher than the time we need to preprocess PINT’s positional features.

Enron LastFM
Avg. time per epoch log(s)

7 9

6 8

5 7

4 6

0 25 50 100 150 200 0 25 50 100 150 200


Epochs Epochs
PINT PINT (w/o pos.) TGAT CAW TGN-Att

Figure S7: Time comparison: PINT versus TGNs (in log-scale) on Enron and LastFM.

Experiments on node classification. For completeness, we also evaluate PINT on node-level tasks
(Wikipedia and Reddit). We follow closely the experimental setup in Rossi et al. [27] and compare
against the baselines therein. Table S4 shows that PINT ranks first on Reddit and second on Wikipedia.
The values for PINT reflect the outcome of 5 repetitions.

Table S4: Results for node classification (AUC).


Wikipedia Reddit
CTDNE 75.89 ± 0.5 59.43 ± 0.6
JODIE 84.84 ± 1.2 61.83 ± 2.7
TGAT 83.69 ± 0.7 65.56 ± 0.7
DyRep 84.59 ± 2.2 62.91 ± 2.4
TGN-Att 87.81 ± 0.3 67.06 ± 0.9
PINT 87.59 ± 0.6 67.31 ± 0.2

Supplementary References
[47] R. Abboud, I. I. Ceylan, M. Grohe, and T. Lukasiewicz. The surprising power of graph neural networks
with random node initialization. In International Joint Conference on Artificial Intelligence (IJCAI), 2021.

[48] G. Bouritsas, F. Frasca, S. Zafeiriou, and M. M. Bronstein. Improving graph neural network expressivity
via subgraph isomorphism counting. In Arxiv e-prints, 2020.

[49] M. Defferrard, X. Bresson, and P. Vandergheynst. Convolutional neural networks on graphs with fast
localized spectral filtering. In Advances in Neural Information Processing Systems (NeurIPS), 2018.

[50] J. Gao and B. Ribeiro. On the equivalence between temporal and static graph representations for observa-
tional predictions. ArXiv, 2103.07016, 2021.

[51] D. Kreuzer, D. Beaini, W. L. Hamilton, V. Letourneau, and P. Tossou. Rethinking graph transformers with
spectral attention. In Advances in Neural Information Processing Systems (NeurIPS), 2021.

31
[52] P. Li, Y. Wang, H. Wang, and J. Leskovec. Distance encoding: Design provably more powerful neural
networks for graph representation learning. In Advances in Neural Information Processing Systems
(NeurIPS), 2020.

[53] S. Mahdavi, S. Khoshraftar, and A. An. dynnode2vec: Scalable dynamic network embedding. In
International Conference on Big Data, 2018.

[54] I. Makarov, A. V. Savchenko, A. Korovko, L. Sherstyuk, N. Severin, A. Mikheev, and D. Babaev. Temporal
graph network embedding with causal anonymous walks representations. ArXiv, 2108.08754, 2021.
[55] F. Manessi, A. Rozza, and M. Manzo. Dynamic graph convolutional networks. Pattern Recognition, 97,
2020.

[56] A. Pareja, G. Domeniconi, J. Chen, T. Ma, H. Kanezashi T. Suzumura, T. Kaler, T. B. Schardl, and C. E.
Leiserson. EvolveGCN: Evolving graph convolutional networks for dynamic graphs. In AAAI Conference
on Artificial Intelligence (AAAI), 2020.

[57] A. Sankar, Y. Wu, L. Gou, W. Zhang, and H. Yang. DySAT: Deep neural representation learning on
dynamic graphs via self-attention networks. In International Conference on Web Search and Data Mining
(WSDM), 2020.

[58] R. Sato, M. Yamada, and H. Kashima. Random features strengthen graph neural networks. In SIAM
International Conference on Data Mining (SDM), 2021.
[59] Y. Seo, M. Defferrard, P. Vandergheynst, and X. Bresson. Structured sequence modeling with graph
convolutional recurrent networks. In International Conference on Neural Information Processing (ICONIP),
2018.
[60] J. Skarding, B. Gabrys, and K. Musial. Foundations and modeling of dynamic networks using dynamic
graph neural networks: A survey. IEEE Access, 9:79143–79168, 2021.

[61] B. Srinivasan and B. Ribeiro. On the equivalence between positional node embeddings and structural graph
representations. In International Conference on Learning Representations (ICLR), 2020.

[62] H. Wang, H. Yin, M. Zhang, and P. Li. Equivariant and stable positional encoding for more powerful graph
neural networks. In International Conference on Learning Representations (ICLR), 2022.

[63] X. Wang, D. Lyu, M. Li, Y. Xia, Q. Yang, X. Wang, X. Wang, P. Cui, Y. Yang, B. Sun, and Z. Guo. APAN:
Asynchronous propagation attention network for real-time temporal graph embedding. International
Conference on Management of Data, 2021.

32

You might also like