PINT-Provably Expressive Temporal Graph Networks
PINT-Provably Expressive Temporal Graph Networks
Abstract
Temporal graph networks (TGNs) have gained prominence as models for embed-
ding dynamic interactions, but little is known about their theoretical underpinnings.
We establish fundamental results about the representational power and limits of the
two main categories of TGNs: those that aggregate temporal walks (WA-TGNs),
and those that augment local message passing with recurrent memory modules
(MP-TGNs). Specifically, novel constructions reveal the inadequacy of MP-TGNs
and WA-TGNs, proving that neither category subsumes the other. We extend the
1-WL (Weisfeiler-Leman) test to temporal graphs, and show that the most powerful
MP-TGNs should use injective updates, as in this case they become as expressive
as the temporal WL. Also, we show that sufficiently deep MP-TGNs cannot benefit
from memory, and MP/WA-TGNs fail to compute graph properties such as girth.
These theoretical insights lead us to PINT — a novel architecture that leverages
injective temporal message passing and relative positional features. Importantly,
PINT is provably more expressive than both MP-TGNs and WA-TGNs. PINT
significantly outperforms existing TGNs on several real-world benchmarks.
1 Introduction
Graph neural networks (GNNs) [11, 30, 36, 39] have recently led to breakthroughs in many applica-
tions [7, 28, 31] by resorting to message passing between neighboring nodes in input graphs. While
message passing imposes an important inductive bias, it does not account for the dynamic nature of
interactions in time-evolving graphs arising from many real-world domains such as social networks
and bioinformatics [16, 40]. In several scenarios, these temporal graphs are only given as a sequence
of timestamped events. Recently, temporal graph nets (TGNs) [16, 27, 32, 38, 42] have emerged as a
prominent learning framework for temporal graphs and have become particularly popular due to their
outstanding predictive performance. Aiming at capturing meaningful structural and temporal patterns,
TGNs combine a variety of building blocks, such as self-attention [33, 34], time encoders [15, 41],
recurrent models [5, 13], and message passing [10].
Unraveling the learning capabilities of (temporal) graph networks is imperative to understanding
their strengths and pitfalls, and designing better, more nuanced models that are both theoretically
well-grounded and practically efficacious. For instance, the enhanced expressivity of higher-order
GNNs has roots in the inadequacy of standard message-passing GNNs to separate graphs that are
indistinguishable by the Weisfeiler-Leman isomorphism test, known as 1-WL test or color refinement
algorithm [21, 22, 29, 37, 43]. Similarly, many other notable advances on GNNs were made possible
by untangling their ability to generalize [9, 17, 35], extrapolate [45], compute graph properties [4, 6, 9],
and express Boolean classifiers [1]; by uncovering their connections to distributed algorithms [19, 29],
graph kernels [8], dynamic programming [44], diffusion processes [3], graphical models [46], and
combinatorial optimization [2]; and by analyzing their discriminative power [20, 23]. In stark contrast,
the theoretical foundations of TGNs remain largely unexplored. For instance, unresolved questions
include: How does the expressive power of existing TGNs compare? When do TGNs fail? Can we
improve the expressiveness of TGNs? What are the limits on the power of TGNs?
M
SO -T
SOTA MP-TGNs ≺ Injective MP-TGNs Prop. 4
P
TAGN
MP-TGNs 6 WA-TGNs and WA-TGNs 6 MP-TGNs Prop. 5
s
Injective MP-TGNs ∼ = temporal-WL test Prop. 6
MP-TGNs/CAWs cannot recognize graph properties Prop. 7
Constructing injective temporal MP Prop. 8
Open
problems PINT (ours) both MP-TGNs and WA-TGNs Prop. 9
Limitations of PINT Prop. 10
We establish a series of results to address these fundamental questions. We begin by showing that
discrete-time dynamic graphs (DTDGs) can always be converted to continuous-time analogues
(CTDGs) without loss of information, so we can focus on analyzing the ability of TGNs to distinguish
nodes/links of CTDGs. We consider a general framework for message-passing TGNs (MP-TGNs)
[27] that subsumes a wide variety of methods [e.g., 16, 32, 42]. We prove that equipping MP-TGNs
with injective aggregation and update functions leads to the class of most expressive anonymous
MP-TGNs (i.e., those that do not leverage node ids). Extending the color-refinement algorithm to
temporal settings, we show that these most powerful MP-TGNs are as expressive as the temporal
WL method. Notably, existing MP-TGNs do not enforce injectivity. We also delineate the role of
memory in MP-TGNs: nodes in a network with only a few layers of message passing fail to aggregate
information from a sufficiently wide receptive field (i.e., from distant nodes), so memory serves
to offset this highly local view with additional global information. In contrast, sufficiently deep
architectures obviate the need for memory modules.
Different from MP-TGNs, walk-aggregating TGNs (WA-TGNs) such as CAW [38] obtain represen-
tations from anonymized temporal walks. We provide constructions that expose shortcomings of
each framework, establishing that WA-TGNs can distinguish links in cases where MP-TGNs fail and
vice-versa. Consequently, neither class is more expressive than the other. Additionally, we show that
MP-TGNs and CAWs cannot decide temporal graph properties such as diameter, girth, or number of
cycles. Strikingly, our analysis unravels the subtle relationship between the walk computations in
CAWs and the MP steps in MP-TGNs.
Equipped with these theoretical insights, we propose PINT (short for position-encoding injective
temporal graph net), founded on a new temporal layer that leverages the strengths of both MP-TGNs
and WA-TGNs. Like the most expressive MP-TGNs, PINT defines injective message passing and
update steps. PINT also augments memory states with novel relative positional features, and these
features can replicate all the discriminative benefits available to WA-TGNs. Interestingly, the time
complexity of computing our positional features is less severe than the sampling overhead in CAW,
thus PINT can often be trained faster than CAW. Importantly, we establish that PINT is provably
more expressive than CAW as well as MP-TGNs.
Our contributions are three-fold:
• a rigorous theoretical foundation for TGNs is laid - elucidating the role of memory, benefits
of injective message passing, limits of existing TGN models, temporal extension of the
1-WL test and its implications, impossibility results about temporal graph properties, and
the relationship between main classes of TGNs — as summarized in Figure 1;
• explicit injective temporal functions are introduced, and a novel method for temporal graphs
is proposed that is provably more expressive than state-of-the-art TGNs;
• extensive empirical investigations underscore practical benefits of this work. The proposed
method is either competitive or significantly better than existing models on several real
benchmarks for dynamic link prediction, in transductive as well as inductive settings.
2
2 Preliminaries
We denote a static graph G as a tuple (V, E, X , E), where V = {1, 2, . . . , n} denotes the set of
nodes and E ⊆ V × V the set of edges. Each node u ∈ V has a feature vector xu ∈ X and each
edge (u, v) ∈ E has a feature vector euv ∈ E, where X and E are countable sets of features.
Dynamic graphs can be roughly split according to their discrete- or continuous-time nature [14].
A discrete-time dynamic graph (DTDG) is of a sequence of graph snapshots (G1 , G2 , . . . ) usually
sampled at regular intervals, each snapshot being a static graph Gt = (Vt , Et , Xt , Et ).
A continuous-time dynamic graph (CTDG) evolves with node- and edge-level events, such as addition
and deletion. We represent a CTDG as a sequence of time-stamped multi-graphs (G(t0 ), G(t1 ), . . . )
such that tk < tk+1 , and G(tk+1 ) results from updating G(tk ) with all events at time tk+1 . We assume
no event occurs between tk and tk+1 . We denote an interaction (i.e., edge addition event) between
nodes u and v at time t as a tuple (u, v, t) associated with a feature vector euv (t). Unless otherwise
stated, interactions correspond to undirected edges, i.e., (u, v, t) is a shorthand for ({u, v}, t).
Noting that CTDGs allow for finer (irregular) temporal resolution, we now formalize the intuition that
DTDGs can be reduced to and thus analyzed as CTDGs, but the converse may need extra assumptions.
Proposition 1 (Relationship between DTDG and CTDG). For any DTDG we can build a CTDG with
the same sets of node and edge features that contains the same information, i.e., we can reconstruct
the original DTDG from the converted CTDG. The converse holds if the CTDG timestamps form a
subset of a uniformly spaced countable set.
Following the usual practice [16, 38, 42], we focus on CTDGs with edge addition events (see
Appendix E for a discussion on deletion). Thus, we can represent temporal graphs as sets G(t) =
{(uk , vk , tk ) | tk < t}. We also assume each distinct node v in G(t) has an initial feature vector xv .
Message-passing temporal graph nets (MP-TGNs). Rossi et al. [27] introduced MP-TGN as a
general representation learning framework for temporal graphs. The goal is to encode the graph
dynamics into node embeddings, capturing information that is relevant for the task at hand. To
achieve this, MP-TGNs rely on three main ingredients: memory, aggregation, and update. Memory
comprises a set of vectors that summarizes the history of each node, and is updated using a recurrent
model whenever an event occurs. The aggregation and update components resemble those in message-
passing GNNs, where the embedding of each node is refined using messages from its neighbors.
We define the temporal neighbohood of node v at time t as N (v, t) = {(u, euv (t0 ), t0 ) | ∃(u, v, t0 ) ∈
G(t)}, i.e., the set of neighbor/feature/timestamp triplets from all interactions of node v prior to t.
(`)
MP-TGNs compute the temporal representation hv (t) of v at layer ` by recursively applying
(0)
where {{·}} denotes multisets, hv (t) = sv (t) is the state of v at time t, and AGG(`) and U PDATE(`)
are arbitrary parameterized functions. The memory block updates the states as events occur. Let
J (v, t) be the set of events involving v at time t. The state of v is updated due to J (v, t) as
mv (t) = M EM AGG({{[sv (t), su (t), t − tv , evu (t)] | (v, u, t) ∈ J (v, t)}}) (3)
sv (t ) = M EM U PDATE(sv (t), mv (t)),
+
(4)
where sv (0) = xv (initial node features), sv (t+ ) denotes the updated state of v due to events at time
t, and tv denotes the time of the last update to v. M EM AGG combines information from simultaneous
events involving node v and M EM U PDATE usually implements a gated recurrent unit (GRU) [5].
Notably, some MP-TGNs do not use memory, or equivalently, they employ identity memory, i.e.,
sv (t) = xv for all t. We refer to Appendix A for further details.
Causal Anonymous Walks (CAWs). Wang et al. [38] proposed CAW as an approach for link
prediction on temporal graphs. To predict if an event (u, v, t) occurs, CAW first obtains sets Su and Sv
of temporal walks starting at nodes u and v at time t. An (L − 1)-length temporal walk is represented
as W = ((w1 , t1 ), (w2 , t2 ), . . . , (wL , tL )), with t1 > t2 > · · · > tL and (wi−1 , wi , ti ) ∈ G(t)
∀i > 1. Note that when predicting (u, v, t), we have walks starting at time t1 = t. Then, CAW
anonymizes walks replacing each node w with a set ICAW (w; Su , Sv ) = {g(w; Su ), g(w; Sv )} of two
3
feature vectors. The `-th entry of g(w; Su ) stores how many times w appears at the `-th position in a
walk of Su , i.e. g(w, Su )[`] = |{W ∈ Su : (w, t` ) = W` }| where W` is `-th pair of W .
To encode a walk W with respect to the sets Su and Sv , CAW applies E NC(W ; Su , Sv ) =
i=1 ) where f1 is a permutation-invariant function, f2 is
RNN([f1 (ICAW (wi ; Su , Sv ))kf2 (ti−1 − ti )]L
a time encoder, and t0 = t1 = t. Finally, CAW combines the embeddings of each walk in Su ∪ Sv
using mean-pooling or self-attention to obtain the representation for the event (u, v, t).
In practice, TGNs often rely on sampling schemes for computational reasons. However we are
concerned with the expressiveness of TGNs, so our analysis assumes complete structural information,
i.e., Su is the set of all temporal walks from u and MP-TGNs combine information from all neighbors.
4
Figure 2: Limitations of TGNs. [Left] Temporal graph with nodes u, v that TGN-Att/TGAT cannot
distinguish. Colors are node features, edge features are identical, and t3 > t2 > t1 . [Center] TCTs
of u and v are non-isomorphic. However, the attention layers of TGAT/TGN-Att compute weighted
averages over a same multiset of values, returning identical messages for u and v. [Right] MP-TGNs
fail to distinguish the events (u, v, t3 ) and (v, z, t3 ) as TCTs of z and u are isomorphic. Meanwhile,
CAW cannot separate (u, z, t3 ) and (u0 , z, t3 ): the 3-depth TCTs of u and u0 are not isomorphic, but
the temporal walks from u and u0 have length 1, keeping CAW from capturing structural differences.
[M ]
1. If L < ∆: QL is strictly more powerful than QL in distinguishing nodes of G(t);
[M ]
2. For any L : QL+∆ is at least as powerful as QL in distinguishing nodes of G(t).
The MP-TGN framework is rather general and subsumes many modern methods for temporal graphs
[e.g., 16, 32, 42]. We now analyze the theoretical limitations of two concrete instances of MP-TGNs:
TGAT [42] and TGN-Att [27]. Remarkably, these models are among the best-performing MP-TGNs.
Nonetheless, we can show that there are nodes of very simple temporal graphs that TGAT and
TGN-Att cannot distinguish (see Figure 2). We formalize this in Proposition 4 by establishing that
there are cases in which TGNs with injective layers can succeed, but TGAT and TGN-Att cannot.
Proposition 4 (Limitations of TGAT/TGN-Att). There exist temporal graphs containing nodes u, v
that have non-isomorphic TCTs, yet no TGAT nor TGN-Att with mean memory aggregator (i.e., using
M EAN as M EM AGG) can distinguish u and v.
This limitation stems from the fact that the attention mechanism employed by TGAT and TGN-Att is
proportion invariant [26]. The memory module of TGN-Att cannot counteract this limitation due to
its mean-based aggregation scheme. We provide more details in Appendix B.6.
5
Refinement: At step `, the colors of all nodes are refined using a hash (injective) function: for all
v ∈ V (G(t)), we apply c`+1 (v) = H ASH(c` (v), {{(c` (u), euv (t0 ), t0 ) : (u, v, t0 ) ∈ G(t)}});
Termination: The test is carried out for two temporal graphs at time t in parallel and stops when
the multisets of corresponding colors diverge, returning non-isomorphic. If the algorithm
runs until the number of different colors stops increasing, the test is deemed inconclusive.
We note that the temporal WL test trivially reduces to the standard 1-WL test if all timestamps and
edge features are identical. The resemblance between MP-TGNs and GNNs and their corresponding
WL tests suggests that the power of MP-TGNs is bounded by the temporal WL test. Proposition 6
conveys that MP-TGNs with injective layers are as powerful as the temporal WL test.
Proposition 6. Assume finite spaces of initial node features X , edge features E, and timestamps T .
Let the number of events of any temporal graph be bounded by a fixed constant. Then, there is an
MP-TGN with suitable parameters using injective aggregation/update functions that outputs different
representations for two temporal graphs if and only if the temporal-WL test outputs ‘non-isomorphic’.
A natural consequence of the limited power of MP-TGNs is that even the most powerful MP-TGNs
fail to distinguish relevant graph properties, and the same applies to CAWs (see Proposition 7).
Proposition 7. There exist non-isomorphic temporal graphs differing in properties such as diameter,
girth, and total number of cycles, which cannot be differentiated by MP-TGNs and CAWs.
Figure 3 provides a construction for Proposition 7.
The temporal graphs G(t) and G 0 (t) differ in diameter
(∞ vs. 3), girth (3 vs. 6), and number of cycles (2
vs. 1). By inspecting the TCTs, one can observe that,
for any node in G(t), there is a corresponding one
in G 0 (t) whose TCTs are isomorphic, e.g., Tu1 (t) ∼ = Figure 3: Examples of temporal graphs for which
Tu01 (t) for t > t3 . As a result, the multisets of node MP-TGNs cannot distinguish the diameter, girth,
embeddings for these temporal graphs are identical. and number of cycles.
We provide more details and a construction - where CAW fails to decide properties - in the Appendix.
6
Compute via
Monotone TCT
Injective Temporal MP
Temporal Graph
time
...
Compute
Memory
positional
features
Figure 5: PINT. Following the MP-TGN protocol, PINT updates memory states as events unroll.
Meanwhile, we use Eqs. (7-11) to update positional features. To extract the embedding for node v,
we build its TCT, annotate nodes with memory + positional features, and run (injective) MP.
Relative positional features. To boost the power of PINT, we propose augmenting memory states
with relative positional features. These features count how many temporal walks of a given length
exist between two nodes, or equivalently, how many times nodes appear at different levels of TCTs.
Formally, let P be the d × d matrix obtained by padding a (d − 1)-dimensional identity matrix
(t)
with zeros on its top row and its rightmost column. Also, let rj→u ∈ Nd denote the positional
feature vector of node j relative to u’s TCT at time t. For each event (u, v, t), with u and
v not participating in other events at t, we recursively update the positional feature vectors as
+ +
(0)
Vi = {i} ∀i (7) Vu(t )
= Vv(t )
= Vv(t) ∪ Vu(t) (9)
+
(t ) (t) (t)
(
(0) [1, 0, . . . , 0]> if i = j ri→v = P ri→u + ri→v ∀i ∈ Vu(t) (10)
ri→j = (8)
[0, 0, . . . , 0]> if i 6= j (t+ ) (t) (t)
rj→u =P rj→v + rj→u ∀j ∈ Vv(t) (11)
where we use t+ to denote values “right after” t. The set Vi keeps track of the nodes for which
we need to update positional features when i participates in an interaction. For simplicity, we have
assumed that there are no other events involving u or v at time t. Appendix B.10 provides equations
for the general case where nodes can participate in multiple events at the same timestamp.
(t) (t)
The value ri→v [k] (the k-th component of ri→v ) corresponds
to how many different ways we can get from v to i in k steps
through temporal walks. Additionally, we provide in Lemma 2
an interpretation of relative positional features in terms of the so-
called monotone TCTs (Definition 2). In this regard, Figure 4
shows how the TCT of v evolves due to an event (u, v, t) and
provides an intuition about the updates in Eqs. 10-11. The
procedure amounts to appending the monotone TCT of u to the
first level of the monotone TCT of v.
Definition 2. The monotone TCT of a node u at time t, denoted
Figure 4: The effect of (u, v, t) on the
by T̃u (t), is the maximal subtree of the TCT of u s.t. for any path
monotone TCT of v. Also, note how the
p = (u, t1 , u1 , t2 , u2 , . . . ) from the root u to leaf nodes of T̃u (t) positional features of a node i, relative
time monotonically decreases, i.e., we have that t1 > t2 > . . . . to v, can be incrementally updated.
Lemma 2. For any pair of nodes i, u of a temporal graph G(t), the k-th component of the positional
(t)
feature vector ri→u stores the number of times i appears at the k-th layer of the monotone TCT of u.
Edge and node embeddings. To obtain the embedding hγ for an event γ = (u, v, t), an L-layer
PINT computes embeddings for node u and v using L steps of temporal message passing. However,
when computing the embedding hL u (t) of u, we concatenate node states sj (t) with the positional
(t) (t)
features rj→u and rj→v for all node j in the L-hop temporal neighborhood of u. We apply the same
procedure to obtain hLv (t), and then combine hv (t) and hu (t) using a readout function.
L L
7
Similarly, to compute representations for node-level prediction, for each node j in the L-hop neigh-
(t)
borhood of u, we concatenate node states sj (t) with features rj→u . Then, we use our injective MP to
combine the information stored in u and its neighboring nodes. Figure 5 illustrates the process.
Notably, Proposition 9 states that PINT is strictly more powerful than existing TGNs. In fact, the
relative positional features mimic the discriminative power of WA-TGNs, while eliminate their
temporal monotonicity constraints. Additionally, PINT can implement injective temporal message
passing (either over states or states + positional features), akin to maximally-expressive MP-TGNs.
Proposition 9 (Expressiveness of PINT: link prediction). PINT (with relative positional features) is
strictly more powerful than MP-TGNs and CAWs in distinguishing events in temporal graphs.
When does PINT fail? Naturally, whenever the TCTs (annotated with
positional features) for the endpoints of two edges (u, v, t) and (u0 , v 0 , t)
are pairwise isomorphic, PINT returns the same edge embedding and is
not able to differentiate the events. Figure 6 shows an example in which
this happens — we assume that all node/edge features are identical. Due
to graph symmetries, u and z occur the same number of times in each level
of v’s monotone TCT. Also, the sets of temporal walks starting at u and z Figure 6: PINT cannot
distinguish the events
are identical if we swap the labels of these nodes. Importantly, CAWs and (u, v, t ) and (v, z, t ).
3 3
MP-TGNs also fail here, as stated in Proposition 9.
Proposition 10 (Limitations of PINT). There are synchronous events of temporal graphs that PINT
cannot distinguish (as seen in Figure 6).
Implementation and computational cost. The online updates for PINT’s positional features have
(t) (t)
complexity O d |Vu | + d |Vv | . Similarly to CAW’s sampling procedure, our online update
is a sequential process better done in CPUs. However, while CAW may require significant CPU-
GPU memory exchange — proportional to both the number of walks and their depth —, we only
communicate the positional features. We can also speed-up the training of PINT by pre-computing
the positional features for each batch, avoiding redundant computations at each epoch. Apart from
positional features, the computational cost of PINT is similar to that of TGN-Att. Following standard
MP-TGN procedure, we control the branching factor of TCTs using neighborhood sampling.
Note that the positional features monotonically increase with time, which is undesirable for practical
generalization purposes. Since our theoretical results hold for any fixed t, this issue can be solved
by dividing the positional features by a time-dependent normalization factor. Nonetheless, we have
found that employing L1 -normalization leads to good empirical results for all evaluated datasets.
5 Experiments
We now assess the performance of PINT on several popular and large-scale benchmarks for TGNs.
We run experiments using PyTorch [25] and code is available at www.github.com/AaltoPML/PINT.
Tasks and datasets. We evaluate PINT on dynamic link prediction, closely following the evaluation
setup employed by Rossi et al. [27] and Xu et al. [42]. We use six popular benchmark datasets:
Reddit, Wikipedia, Twitter, UCI, Enron, and LastFM [16, 27, 38, 42]. Notably, UCI, Enron, and
LastFM are non-attributed networks, i.e., they do not contain feature vectors associated with the
events. Node features are absent in all datasets, thus following previous works we set them to vectors
of zeros [27, 42]. Since Twitter is not publicly available, we follow the guidelines by Rossi et al. [27]
to create our version. We provide more details regarding datasets in the supplementary material.
Baselines. We compare PINT against five prominent TGNs: Jodie [16], DyRep [32], TGAT [42],
TGN-Att [27], and CAW [38]. For completeness, we also report results using two static GNNs: GAT
[34] and GraphSage [12]. Since we adopt the same setup as TGN-Att, we use their table numbers
for all baselines but CAW on Wikipedia and Reddit. The remaining results were obtained using the
implementations and guidelines available from the official repositories. As an ablation study, we also
include a version of PINT without relative positional features in the comparison. We provide detailed
information about hyperparameters and the training of each model in the supplementary material.
Experimental setup. We follow Xu et al. [42] and use a 70%-15%-15% (train-val-test) temporal
split for all datasets. We adopt average precision (AP) as the performance metric. We also analyze
separately predictions involving only nodes seen during training (transductive), and those involving
8
Table 1: Average Precision (AP) results for link prediction. We denote the best-performing model (highest
mean AP) in blue. In 5 out of 6 datasets, PINT achieves the highest AP in the transductive setting. For the
inductive case, PINT outperforms previous MP-TGNs and competes with CAW. We also show the performance
of PINT with and without relative positional features. For all datasets, adopting positional features leads to
significant performance gains.
Model Reddit Wikipedia Twitter UCI Enron LastFM
GAT 97.33 ± 0.2 94.73 ± 0.2 - - - -
GraphSAGE 97.65 ± 0.2 93.56 ± 0.3 - - - -
Transductive
Jodie 97.11 ± 0.3 94.62 ± 0.5 98.23 ± 0.1 86.73 ± 1.0 77.31 ± 4.2 69.32 ± 1.0
DyRep 97.98 ± 0.1 94.59 ± 0.2 98.48 ± 0.1 54.60 ± 3.1 77.68 ± 1.6 69.24 ± 1.4
TGAT 98.12 ± 0.2 95.34 ± 0.1 98.70 ± 0.1 77.51 ± 0.7 68.02 ± 0.1 54.77 ± 0.4
TGN-Att 98.70 ± 0.1 98.46 ± 0.1 98.00 ± 0.1 80.40 ± 1.4 79.91 ± 1.3 80.69 ± 0.2
CAW 98.39 ± 0.1 98.63 ± 0.1 98.72 ± 0.1 92.16 ± 0.1 92.09 ± 0.7 81.29 ± 0.1
PINT (w/o pos. feat.) 98.62 ± .04 98.43 ± .04 98.53 ± 0.1 92.68 ± 0.5 83.06 ± 2.1 81.35 ± 1.6
PINT 99.03 ± .01 98.78 ± 0.1 99.35 ± .01 96.01 ± 0.1 88.71 ± 1.3 88.06 ± 0.7
GAT 95.37 ± 0.3 91.27 ± 0.4 - - - -
GraphSAGE 96.27 ± 0.2 91.09 ± 0.3 - - - -
Jodie 94.36 ± 1.1 93.11 ± 0.4 96.06 ± 0.1 75.26 ± 1.7 76.48 ± 3.5 80.32 ± 1.4
Inductive
DyRep 95.68 ± 0.2 92.05 ± 0.3 96.33 ± 0.2 50.96 ± 1.9 66.97 ± 3.8 82.03 ± 0.6
TGAT 96.62 ± 0.3 93.99 ± 0.3 96.33 ± 0.1 70.54 ± 0.5 63.70 ± 0.2 56.76 ± 0.9
TGN-Att 97.55 ± 0.1 97.81 ± 0.1 95.76 ± 0.1 74.70 ± 0.9 78.96 ± 0.5 84.66 ± 0.1
CAW 97.81 ± 0.1 98.52 ± 0.1 98.54 ± 0.4 92.56 ± 0.1 91.74 ± 1.7 85.67 ± 0.5
PINT (w/o pos. feat.) 97.22 ± 0.2 97.81 ± 0.1 96.10 ± 0.1 90.25 ± 0.3 75.99 ± 2.3 88.44 ± 1.1
PINT 98.25 ± .04 98.38 ± .04 98.20 ± .03 93.97 ± 0.1 81.05 ± 2.4 91.76 ± 0.7
novel nodes (inductive). We report mean and standard deviation of the AP over ten runs. For further
details, see Appendix D. We provide additional results in the supplementary material.
Results. Table 1 shows that PINT is the best-performing method on five out of six datasets for the
transductive setting. Notably, the performance gap between PINT and TGN-Att amounts to over
15% AP on UCI. The gap is also relatively high compared to CAW on LastFM, Enron, and UCI;
with CAW being the best model only on Enron. We also observe that many models achieve relatively
high AP on the attributed networks (Reddit, Wikipedia, and Twitter). This aligns well with findings
from [38], where TGN-Att was shown to have competitive performance against CAW on Wikipedia
and Reddit. The performance of GAT and TGAT (static GNNs) on Reddit and Wikipedia reinforces
the hypothesis that the edge features add significantly to the discriminative power. On the other
hand, PINT and CAW, which leverage relative identities, show superior performance relative to
other methods when only time and degree information is available, i.e., on unattributed networks
(UCI, Enron, and LastFM). Table 1 also shows the effect of using relative positional features. While
including these features boosts PINT’s performance systematically, our ablation study shows that
PINT w/o positional features still outperforms other MP-TGNs on unattributed networks. In the
inductive case, we observe a similar behavior: PINT is consistently the best MP-TGN, and is better
than CAW on 3/6 datasets. Overall, PINT (w/ positional features) also yields the lowest standard
deviations. This suggests that positional encodings might be a useful inductive bias for TGNs.
9
Table 2: Average precision results for TGN-Att + relative positional features.
Transductive Inductive
UCI Enron LastFM UCI Enron LastFM
TGN-Att 80.40 ± 1.4 79.91 ± 1.3 80.69 ± 0.2 74.70 ± 0.9 78.96 ± 0.5 84.66 ± 0.1
TGN-Att + RPF 95.64 ± 0.1 85.04 ± 2.5 89.41 ± 0.9 92.82 ± 0.4 76.27 ± 3.4 91.63 ± 0.3
PINT 96.01 ± 0.1 88.71 ± 1.3 88.06 ± 0.7 93.97 ± 0.1 81.05 ± 2.4 91.76 ± 0.7
Incorporating relative positional features into MP-TGNs. We can use our relative positional
features (RPF) to boost MP-TGNs. Table 2 shows the performance of TGN-Att with relative positional
features on UCI, Enron, and LastFM. Notably, TGN-Att receives a significant boost from our RPF.
However, PINT still beats TGN-Att+RPF on 5 out of 6 cases. The values for TGN-Att+RPF reflect
outcomes from 5 repetitions. We have used the same model selection procedure as TGN-Att in Table
1, and incorporated d = 4-dimensional positional features
Dimensionality of relative positional fea- UCI Enron
97 95
tures. We assess the performance of PINT as
a function of the dimension d of the relative posi- 96 90
AP (test)
tional features. Figure 8 shows the performance 95 85
of PINT for d ∈ {4, 10, 15, 20} on UCI and En- 94 80 transductive
ron. We report mean and standard deviation of inductive
6 Conclusion
We laid a rigorous theoretical foundation for TGNs, including the role of memory modules, relation-
ship between classes of TGNs, and failure cases for MP-TGNs. Together, our theoretical results shed
light on the representational capabilities of TGNs, and connections with their static counterparts. We
also introduced a novel TGN method, provably more expressive than the existing TGNs.
Key practical takeaways from this work: (a) temporal models should be designed to have injective
update rules and to exploit both neighborhood and walk aggregation, and (b) deep architectures can
likely be made more compute-friendly as the role of memory gets diminished with depth, provably.
10
References
[1] P. Barceló, E. V. Kostylev, M. Monet, J. Pérez, J. L. Reutter, and J.-P. Silva. The logical expressiveness of
graph neural networks. In International Conference on Learning Representations (ICLR), 2020.
[2] Q. Cappart, D. Chételat, E. B. Khalil, A. Lodi, C. Morris, and P. Velickovic. Combinatorial optimization
and reasoning with graph neural networks. In International Joint Conference on Artificial Intelligence
(IJCAI), 2021.
[3] B. Chamberlain, J. Rowbottom, M. Gorinova, M. M. Bronstein, S. Webb, and E. Rossi. GRAND: graph
neural diffusion. In International Conference on Machine Learning (ICML), 2021.
[4] M. Chen, Z. Wei, Z. Huang, B. Ding, and Y. Li. Simple and deep graph convolutional networks. In
International Conference on Machine Learning (ICML), 2020.
[5] K. Cho, B. van Merriënboer, C. Gulcehre, D. Bahdanau, F. Bougares, H. Schwenk, and Y. Bengio.
Learning phrase representations using RNN encoder–decoder for statistical machine translation. In
Empirical Methods in Natural Language Processing (EMNLP), 2014.
[6] N. Dehmamy, A.-L. Barabási, and R. Yu. Understanding the representation power of graph neural networks
in learning graph topology. In Advances in neural information processing systems (NeurIPS), 2019.
[7] A. Derrow-Pinion, J. She, D. Wong, O. Lange, T. Hester, L. Perez, M. Nunkesser, S. Lee, X. Guo,
B. Wiltshire, P. W. Battaglia, V. Gupta, A. Li, Z. Xu, A. Sanchez-Gonzalez, Y. Li, and P. Velickovic. Eta
prediction with graph neural networks in google maps. In Conference on Information and Knowledge
Management (CIKM), 2021.
[8] S. S. Du, K. Hou, R. Salakhutdinov, B. Póczos, R. Wang, and K. Xu. Graph neural tangent kernel:
Fusing graph neural networks with graph kernels. In Advances in Neural Information Processing Systems
(NeurIPS), 2019.
[9] V. Garg, S. Jegelka, and T. Jaakkola. Generalization and representational limits of graph neural networks.
In International Conference on Machine Learning (ICML), 2020.
[10] J. Gilmer, S. S. Schoenholz, P. F. Riley, O. Vinyals, and G. E. Dahl. Neural message passing for quantum
chemistry. In International Conference on Machine Learning (ICML), 2017.
[11] M. Gori, G. Monfardini, and F. Scarselli. A new model for learning in graph domains. In IEEE International
Joint Conference on Neural Networks (IJCNN), 2005.
[12] W. Hamilton, Z. Ying, and J. Leskovec. Inductive representation learning on large graphs. In Advances in
Neural Information Processing Systems (NeurIPS), 2017.
[13] S. Hochreiter and J. Schmidhuber. Long short-term memory. Neural Computation, 9(8):1735–1780, 1997.
[14] S. Kazemi, R. Goel, K. Jain, I. Kobyzev, A. Sethi, P. Forsyth, and P. Poupart. Representation learning for
dynamic graphs: A survey. Journal of Machine Learning Research, 21(70):1–73, 2020.
[15] S. M. Kazemi, R. Goel, S. Eghbali, J. Ramanan, J. Sahota, S. Thakur, S. Wu, C. Smyth, P. Poupart, and
M. Brubaker. Time2vec: Learning a vector representation of time. ArXiv: 1907.05321, 2019.
[16] S. Kumar, X. Zhang, and J. Leskovec. Predicting dynamic embedding trajectory in temporal interaction
networks. In International Conference on Knowledge Discovery & Data Mining (KDD), 2019.
[17] R. Liao, R. Urtasun, and R. Zemel. A PAC-bayesian approach to generalization bounds for graph neural
networks. In International Conference on Learning Representations (ICLR), 2021.
[18] D. Liben-Nowell and J. Kleinberg. The link prediction problem for social networks. Journal of the
American Society for Information Science and Technology, 58(7):1019–1031, 2007.
[19] A. Loukas. What graph neural networks cannot learn: depth vs width. In International Conference on
Learning Representations (ICLR), 2020.
[20] A. Loukas. How hard is to distinguish graphs with graph neural networks? In Advances in Neural
Information Processing Systems (NeurIPS), 2020.
[21] H. Maron, H. Ben-Hamu, H. Serviansky, and Y. Lipman. Provably powerful graph networks. In Advances
in Neural Information Processing Systems (NeurIPS), 2019.
11
[22] C. Morris, M. Ritzert, M. Fey, W. L. Hamilton, J. E. Lenssen, G. Rattan, and M. Grohe. Weisfeiler and
leman go neural: Higher-order graph neural networks. In AAAI Conference on Artificial Intelligence
(AAAI), 2019.
[23] H. Nguyen and T. Maehara. Graph homomorphism convolution. In International Conference on Machine
Learning (ICML), 2020.
[24] F. Orsini, P. Frasconi, and L. D. Raedt. Graph invariant kernels. In International Joint Conference on
Artificial Intelligence (IJCAI), 2015.
[25] A. Paszke, S. Gross, S. Chintala, G. Chanan, E. Yang, Z. DeVito, Z. Lin, A. Desmaison, L. Antiga, and
A. Lerer. Automatic differentiation in pytorch. In Advances in Neural Information Processing Systems
(NeurIPS - Workshop), 2017.
[26] J. Pérez, J. Marinković, and P. Barceló. On the turing completeness of modern neural network architectures.
In International Conference on Learning Representations (ICLR), 2019.
[27] E. Rossi, B. Chamberlain, F. Frasca, D. Eynard, E. Monti, and M. Bronstein. Temporal graph networks for
deep learning on dynamic graphs. In ICML 2020 Workshop on Graph Representation Learning, 2020.
[28] A. Sanchez-Gonzalez, J. Godwin, T. Pfaff, R. Ying, J. Leskovec, and P. Battaglia. Learning to simulate
complex physics with graph networks. In International Conference on Machine Learning (ICML), 2020.
[29] R. Sato, M. Yamada, and H. Kashima. Approximation ratios of graph neural networks for combinatorial
problems. In Advances in Neural Information Processing Systems (NeurIPS), 2019.
[30] F. Scarselli, M. Gori, A. C. Tsoi, M. Hagenbuchner, and G. Monfardini. The graph neural network model.
IEEE Transactions on Neural Networks, 20(1):61–80, 2009.
[31] J. M. Stokes, K. Yang, K. Swanson, W. Jin, A. Cubillos-Ruiz, N. M. Donghia, C. R. MacNair, S. French,
L. A. Carfrae, Z. Bloom-Ackermann, V. M. Tran, A. Chiappino-Pepe, A. H. Badran, I. W. Andrews, E. J.
Chory, G. M. Church, E. D. Brown, T. S. Jaakkola, R. Barzilay, and J. J. Collins. A deep learning approach
to antibiotic discovery. Cell, 180(4):688 – 702, 2020.
[32] R. Trivedi, M. Farajtabar, P. Biswal, and H. Zha. DyRep: Learning representations over dynamic graphs.
In International Conference on Learning Representations (ICLR), 2019.
[33] A. Vaswani, N. Shazeer, N. Parmar, J. Uszkoreit, L. Jones, A. N. Gomez, L. Kaiser, and I. Polosukhin.
Attention is all you need. In Advances in Neural Information Processing Systems (NeurIPS), 2017.
[34] P. Velickovic, G. Cucurull, A. Casanova, A. Romero, P. Liò, and Y. Bengio. Graph Attention Networks. In
International Conference on Learning Representations (ICLR), 2018.
[35] S. Verma and Z.-L. Zhang. Stability and generalization of graph convolutional neural networks. In
International Conference on Knowledge Discovery & Data Mining (KDD), 2019.
[36] Y. Verma, S. Kaski, M. Heinonen, and V. Garg. Modular flows: Differential molecular generation. In
Advances in Neural Information Processing Systems (NeurIPS), 2022.
[37] C. Vignac, A. Loukas, and P. Frossard. Building powerful and equivariant graph neural networks with
structural message-passing. In Neural Information Processing Systems (NeurIPS), 2020.
[38] Y. Wang, Y. Chang, Y. Liu, J. Leskovec, and P. Li. Inductive representation learning in temporal networks
via causal anonymous walks. In International Conference on Learning Representations (ICLR), 2021.
[39] Z. Wu, S. Pan, F. Chen, G. Long, C. Zhang, and P. S. Yu. A comprehensive survey on graph neural
networks. IEEE Transactions on Neural Networks and Learning Systems, pages 1–21, 2020.
[40] D. Xu, W. Cheng, D. Luo, Y. Gu, X. Liu, J. Ni, B. Zong, H. Chen, and X. Zhang. Adaptive neural network
for node classification in dynamic networks. In IEEE International Conference on Data Mining (ICDM),
2019.
[41] D. Xu, C. Ruan, E. Korpeoglu, S. Kumar, and K. Achan. Self-attention with functional time representation
learning. In Advances in Neural Information Processing Systems (NeurIPS), 2019.
[42] D. Xu, C. Ruan, E. Korpeoglu, S. Kumar, and K. Achan. Inductive representation learning on temporal
graphs. In International Conference on Learning Representations (ICLR), 2020.
[43] K. Xu, W. Hu, J. Leskovec, and S. Jegelka. How powerful are graph neural networks? In International
Conference on Learning Representations (ICLR), 2019.
12
[44] K. Xu, J. Li, M. Zhang, S. S. Du, K.-I. Kawarabayashi, and S. Jegelka. What can neural networks reason
about? In International Conference on Learning Representations (ICLR), 2020.
[45] K. Xu, M. Zhang, J. Li, S. S. Du, K.-I. Kawarabayashi, and S. Jegelka. How neural networks extrapolate:
From feedforward to graph neural networks. In International Conference on Learning Representations
(ICLR), 2021.
[46] Z. Zhang, F. Wu, and W. S. Lee. Factor graph neural networks. In Advances in Neural Information
Processing Systems (NeurIPS), 2020.
13
Provably expressive temporal graph networks
(Supplementary material)
where ωi ’s and bi ’s are learned scalar parameters. The time embeddings are concatenated to the
edge features before being fed into a typical self-attention layer, where the query q is a function of a
reference node v, and both values V and keys K depend on v’s temporal neighbors. Formally, TGAT
(`) (`) (`−1)
first computes a matrix Cv (t) whose u-th row is cvu (t) = [hu (t) k φ(t − tuv ) k euv ] for all
(`)
(u, euv , tuv ) ∈ N (v, t). Then, the output h̃v (t) of the AGG function is given by
(`)
(`) (`)
q = [h(`−1)
v (t) k φ(0)]Wq(`) K = Cv(`) (t)WK V = Cv(`) (t)WV (S2)
>
h̃(`) (S3)
v (t) = softmax qK V
(`) (`) (`)
where Wq , WK , and WV are model parameters. Regarding the U PDATE function, TGAT applies
(`) (`−1) (`)
a multilayer perceptron, i.e., hv (t) = MLP(`) (hv (t) k h̃v (t)).
Following the original formulation, we assume an identity memory-message function — simply the
concatenation of the inputs, i.e., M EM M SGe (si (t), su (t), t−ti , eiu (t)) = [si (t), su (t), t−ti , eiu (t)].
Now, suppose two events (i, u, t) and (i, v, t) happen. MP-TGNs aggregate the memory-messages
from these events using a function M EM AGG to obtain a single memory message for i:
mi (t) = M EM AGG(mi,u (t), mi,v (t))
Rossi et al. [27] propose non-learnable memory aggregators, such as the mean aggregator (average
all memory messages for a given node), that we denote as M EANAGG and adopt throughout our
analysis. As an example, under events (i, u, t) and (i, v, t), the aggregated message for i is mi (t) =
0.5([si (t), su (t), t − ti , eiu (t)] + [si (t), sv (t), t − ti , eiv (t)]).
The memory update of our query node i is given by
si (t+ ) = M EM U PDATE(si (t), mi (t)).
Finally, we note that TGAT does not have a memory module. TGN-Att consists of the model resulting
from augmenting TGAT with a GRU-based memory.
A.3 Causal anonymous walks (CAW)
We now provide details regarding how CAW obtains edge embeddings for a query event γ = (u, v, t).
A temporal walk is represented as W = ((w1 , t1 ), (w2 , t2 ), . . . , (wL , tL )), with t1 > t2 > · · · > tL
and (wi−1 , wi , ti ) ∈ G(t) for all i > 1. We denote by Su (t) the set of maximal temporal walks
starting at u of size at most L obtained from the temporal graph at time t. Following the original
paper, we drop the time dependence henceforth.
A given walk W gets anonymized through replacing each element wi belonging to W by a 2-element
set of vectors ICAW (wi ; Su , Sv ) accounting for how many times wi appears at each position of walks
in Su and Sv . These vectors are denoted by g(wi , Su ) and g(wi , Sv ). The walk is encoded using a
RNN:
E NC(W ; Su , Sv ) = RNN([f1 (ICAW (wi ; Su , Sv ))kf2 (ti − ti−1 )]L
i=1 ),
where t1 = t0 = t and f1 is
f1 (ICAW (wi ; Su , Sv )) = MLP(g(wi , Su )) + MLP(g(wi , Sv )).
We note that the MLPs share parameters. The function f2 is given by
f2 (t) = [cos(ωi t), sin(ω1 t), . . . , cos(ωd t), sin(ωd t)]
where ωi ’s are learned parameters.
To compute the embedding hγ for (u, v, t), CAW considers two readout functions: mean and self-
attention. Finally, the final link prediction is obtained from a 2-layer MLP over hγ .
15
B Proofs
B.1 Further definitions and Lemmata
Definition B1 (Monotone walk.). An N -length monotone walk in a temporal graph G(t) is a sequence
(w1 , t1 , w2 , t2 , . . . , wN +1 ) such that ti > ti+1 and (wi , wi+1 , ti ) ∈ G(t) for all i.
Definition B2 (Temporal diameter.). We say the temporal diameter of a graph G(t) is ∆ if the longest
monotone walk in G(t) has length (i.e, number of edges) exactly ∆.
Lemma B1. If the TCTs of two nodes are isomorphic, then their monotone TCTs (Definition 2) are
also isomorphic, i.e., Tu (t) ∼
= Tv (t) ⇒ T̃u (t) ∼
= T̃v (t) for two nodes u and v of a dynamic graph.
Lemma B2. Let G(t) and G 0 (t) be any two non-isomorphic temporal graphs. If an MP-TGN obtains
different multisets of node embeddings for G(t) and G 0 (t). Then, the temporal WL test decides G(t)
and G 0 (t) are not isomorphic.
Proof. Recall Proposition 3 shows that if an MP-TGN with memory is able to distinguish two nodes,
then there is a memoryless MP-TGN with ∆ (temporal diameter) additional layers that does the
same. Thus, it suffices to show that if the multisets of colors from temporal WL for G(t) and G 0 (t)
after ` iterations are identical, then the multisets of embeddings from the memoryless MP-TGN
(`)
are also identical, i.e., if {{c` (u)}}u∈V (G(t)) = {{c` (u0 )}}u0 ∈V (G 0 (t)) , then {{hu (t)}}u∈V (G(t)) =
(`)
{{hu0 (t)}}u0 ∈V (G 0 (t)) . To do so, we repurpose the proof of Lemma 2 in [43].
More broadly, we show that for any two nodes of a temporal graph G(t), if the temporal WL returns
c` (u) = c` (v), we have that corresponding embeddings from MP-TGN without memory are identical
h`u (t) = h`v (t). We proceed with a proof by induction.
[Base case] For ` = 0, the proposition trivially holds as the temporal WL has the initial node features
as colors, and memoryless MP-TGNs have these features as embeddings.
[Induction step] Assume the proposition holds for iteration `. Thus, for any two nodes u, v, if
c`+1 (u) = c`+1 (v), we have
(c` (u), {{(c` (i), eiu (t0 ), t0 ) : (u, i, t0 ) ∈ G(t)}}) = (c` (v), {{(c` (j), ejv (t0 ), t0 ) : (v, j, t0 ) ∈ G(t)}})
and, by the induction hypothesis, we know
(`) 0 0 0
(h(`)
u (t), {{(hi (t), eiu (t ), t ) : (u, i, t ) ∈ G(t)}}) =
(`) 0 0 0
(h(`)
v (t), {{(hj (t), ejv (t ), t ) : (v, j, t ) ∈ G(t)}})
since there exists an event (u, i, t0 ) ∈ G(t) with feature eui (t0 ) = e iff there is an element (i, e, t0 ) ∈
N (u, t).
16
As a result, the inputs of the MP-TGN’s aggregation and update functions are identical, which leads
(`+1) (`+1)
to identical outputs hu (t) = hv (t). Therefore, if the temporal WL test obtains identical
multisets of colors for two temporal graphs after ` steps, the multisets of embeddings at layer ` for
these graphs are also identical.
To recover the original CTDG, we can adapt the reconstruction procedure we used in the previous
part of the proof. We define
I˜ = {(i, k) ∈ N × N : τk = ti for ti ∈ T and τk ∈ T }. (S4)
17
B.3 Proof of Lemma 1
Proof. Here we show that if two nodes u and v have isomorphic (L-depth) TCTs, then MP-TGNs
(with L-layers) compute identical embeddings for u and v. Formally, let Tu,` (t) denote the TCT of u
with ` layers. We want to show that Tu,` (t) ∼
(`) (`)
= Tv,` (t) ⇒ hu (t) = hv (t). We employ a proof by
induction on `. Since there is no ambiguity, we drop the dependence on time in the following.
18
We know that the family of L-layer MP-TGNs with memory comprises the family of L-layer MP-
[M ]
TGNs without memory (we can assume identity memory). Therefore, QL is at least as powerful
[M ]
as QL . To show that QL is strictly stronger (more powerful) than QL , when L < ∆, it suffices
to create an example for which memory can help distinguish a pair of nodes. We provide a trivial
example in Figure S1 for L = 1. Note that the 1-depth TCTs of u and v are isomorphic when no
memory is used. However, when equipped with memory, the interaction (b, c, t1 ) affects the states of
v and c, making the 1-depth TCTs of u and v (at time t > t2 ) no longer isomorphic.
Figure S1: Temporal graph where all initial node features and edge features are identical, and t2 > t1 .
[M ]
Statement 2: For any L : QL+∆ is at least as powerful as QL .
[M ]
It suffices to show that if QL+∆ cannot distinguish a pair of nodes u and v , QL cannot distinguish
them too. Let Tu,LM
(t) and Tu,L (t) denote the L-depth TCTs of u with and without memory respec-
tively. Using Lemma 1, this is equivalent to showing that Tu,L+∆ (t) ∼ M
= Tv,L+∆ (t) ⇒ Tu,L (t) ∼
=
Tv,L (t), since no MP-TGN can separate nodes associated with isomorphic TCTs. In the following,
M
when we omit the number of layers from TCTs, we assume TCTs of arbitrary depth.
Step 1: Characterizing the dependence of memory on initial states and events in the dynamic graph.
We now show that the memory for a node u, after processing all events with timestamp ≤ tn , depends
on the initial states of a set of nodes Vun , and a set of events annotated with their respective timestamps
and features Bun . If at time tn no event involves a node z , we set Bzn = Bzn−1 and Vzn = Vzn−1 . We
also initialize Bu0 = ∅ and Vu0 = {u} for all nodes u. We proceed with a proof by induction on the
number of observed timestamps n.
[Base case] Let I1 (u) = {v : (u, v, t1 ) ∈ G(t+ 1 )} be the set of nodes interacting with u at time t1 ,
where G(t+ 1 ) is the temporal graph right after t1 . Similarly, let J1 (u) = {(u, ·, t1 ) ∈ G(t1 )} be
+
the set of events involving u at time t1 . Recall that up until t1 , all memory states equal initial node
features (i.e., su (t1 ) = su (0)). Then, the updated memory (see Equation 3 and Equation 4) for u
depends on Vu1 = Vu0 ∪v∈I(u) Vv0 , Bu1 = Bu0 ∪ J1 (u).
[Induction step] Assume that for timestamp tn−1 the proposition holds. We now show that it holds for
tn . Since the proposition holds for n−1 timestamps, we know that the memory of any w that interacts
with u in tn , i.e. w ∈ In (u), depends on Vwn−1 and Bw
n−1
, and the memory of u so far depends on
Vun−1 and Bun−1 . Then, the updated memory for u depends on Vun = Vun−1 ∪w∈I(u) {w, u} ∪ Vwn−1
and Bun = Bun−1 ∪ Jn (u) ∪w∈I(u) Bw n−1
.
Step 2: (z, w, tzw ) ∈ Bun if and only if there is a path (uk , tk = tzw , uk+1 ) in T̃u (t+
n ) — the
monotone TCT of u (see Definition 2) after processing events with timestamp ≤ tn — with either
]uk = z, ]uk+1 = w or ]uk = w, ]uk+1 = z.
[Forward direction] An event (z, w, tzw ) with tzw ≤ tn will be in Bun only if z = u or w = u, or
if there is a subset of events {(u, ]u1 , t1 ), (]u1 , ]u2 , t2 ), . . . , (]uk , ]uk+1 , tzw )} with ]uk = z and
]uk+1 = w such that tn ≥ t1 > · · · > tzw . In either case, this will lead to root-to-leaf path in T̃u (t+ n)
passing through (uk , tzw , uk+1 ). This subset of events can be easily obtained by backtracking edges
that caused unions/updates in the procedure from Step 1.
[Backward direction] Assume there is a subpath p = (uk , tk = tzw , uk+1 ) ∈ T̃u (t+ n ) with ]uk = z
and ]uk+1 = w such that (z, w, tzw ) ∈ / Bu . Since we can obtain p from T̃u (tn ), we know that the
n +
sequence of events r = ((u, ]u1 , t1 ), . . . , (]uk−2 , ]uk−1 = z, tk−1 ), (z, w, tk = tzw )) happened and
that ti > ti+1 ∀i. However, since (z, w, tzw ) ∈ / Bun , there must be no monotone walk starting from
u going through the edge (z, w, tzw ) to arrive at w, which is exactly what r characterizes. Thus, we
reach contradiction.
19
Note that the nodes in Vun are simply the nodes that have an endpoint in the events Bun , and therefore
are also nodes in T̃u (t+
n ) and vice-versa.
Step 3: For any node u, there is a bijection that maps (Vun , Bun ) to T̃u (t+
n ).
First, we note that (Vun , Bun ) depends on a subset of all events, which we represent as G 0 ⊆ G(t+ n ).
Since Bun contains all events in G 0 and (Vun , Bun ) can be uniquely constructed from G 0 , then there is a
bijection g that maps from G 0 to (Vun , Bun ).
Similarly, T̃u (t+
n ) also depends on a subset of events which we denote by G ⊆ G(tn ). We note that
00 +
the unique events in T̃u (t+n ) correspond to G , and we can uniquely build the tree T̃u (tn ) from G .
00 + 00
This implies that there is a bijection h that maps from G to T̃u (tn ).
00 +
Previously, we have shown that all events in Bun are also in T̃u (t+
n ) and vice-versa. This implies that
both sets depend on the same events, and thus on the same subset of all events, i.e., G 0 = G 00 = GS .
Since there is a bijection g between GS and (Vun , Bun ), and a bijection h between GS and T̃u (t+ n ),
there exists a bijection f between (Vun , Bun ) and T̃u (t+
n ).
+ ∼ + ∼ M
Step 4: If Tu,L+∆ (t ) = Tv,L+∆ (t ), then T (t ) = T (t+ ).
+ M
u,L v,L
comprises the information used to compute the memory state of ]w. Note that this applies to any w in
M
Tu,L ; thus, Tu,L+∆ contains all we need to compute the states of any node of the dynamic graph that
appears in Tu,L
M
. The same argument applies to Tv,L+∆ and Tv,L M
. Finally, since Tu,L
M
can be uniquely
∼ M ∼ M
computed from Tu,L+∆ , and Tv,L from Tv,L+∆ , if Tu,L+∆ = Tv,L+∆ , then Tu,L = Tv,L .
M
Figure S2: (Leftmost) Example of a temporal graph for which TGN-Att and TGAT cannot distinguish
nodes u and v even though their TCTs are non-isomorphic. Colors denote node features and all edge
features are identical, and t2 > t1 (and t > t2 ). (Right) The 2-depth TCTs of nodes u, v, z and w.
The TCTs of u and v are non-isomorphic whereas the TCTs of z and w are isomorphic.
Figure S2(leftmost) provides a temporal graph where all edge events have the same edge features.
Colors denote node features. As we can observe, the TCTs of nodes u and v are not isomorphic. In
the following, we consider node distinguishability at time t > t2 .
Statement 1: TGAT cannot distinguish the nodes u and v in our example.
(`) (`)
Step 1: For any TGAT with ` layers, we have that hw (t) = hz (t).
We note that the `-layer TCTs of nodes w and z are isomorphic, for any `. To see this, one can consider
the symmetry around node u that allows us to define a node permutation function (bijection) f given
by f (z) = w, f (w) = z, f (a) = v, f (u) = u, f (b) = c, f (c) = b, f (v) = a. Figure S2(right)
provides an illustration of the 2-depth TCTs of z and w at time t > t2 .
20
By Lemma 1, if the `-layer TCTs of two nodes z and w are isomorphic, then no `-layer MP-TGN can
(`) (`)
distinguish them. Thus, we conclude that hw (t) = hz (t) for any TGAT with arbitrary number of
layers `.
(`) (`)
Step 2: There is no TGAT such that hv (t) 6= hu (t).
(`)
To compute hv (t), TGAT aggregates the messages of v’s temporal neighbors at layer ` − 1, and
(`−1) (`−1) (`)
then combines hv (t) with the aggregated message h̃v (t) to obtain hv (t).
Note that N (u, t) = {(z, e, t1 ), (w, e, t1 )} and N (v, t) = {(w, e, t1 )}, where e denotes an edge
(`−1) (`−1)
feature vector. Also, we have previously shown that hw (t) = hz (t).
Using the TGAT aggregation layer (Equation S2), the query vectors of u and v are qu =
(`−1) (`) (`−1) (`)
[hu (t)||φ(0)]Wq and qv = [hv (t)||φ(0)]Wq , respectively.
(`) (`)
Since all events have the common edge features e, the matrices Cu and Cv share the same vector in
(`) (`) (`−1)
their rows. Thehsingle-row matrix Cv is given by Cv = [hw (t)||φ(t−t
i 1 )||e], while the two-row
(`) (`−1) (`−1) (`−1) (`−1)
matrix Cu = [hw (t)||φ(t − t1 )||e]; [hz (t)kφ(t − t1 )||e] , with hw (t) = hz (t). We
(`) (`) (`−1)
can express Cu = [1, 1] r and
>
Cv = r , where r denotes the row vector r = [hz (t)||φ(t −
t1 )||e].
(`) (`) (`) (`)
Using the key and value matrices of node v, i.e., Kv = Cv WK and Vv = Cv WV , we have that
We have shown that the aggregated messages of nodes u and v are the same at any layer `. We note
(0) (0)
that the initial embeddings are also identical hv (t) = hu (t) as u and v have the same color. Recall
(`) (`−1) (`)
that the update step is hv (t) = MLP(hv (t), h̃v (t)). Therefore, if the initial embeddings are
(`) (`)
identical, and the aggregated messages at each layer are also identical, we have that hu (t) = hv (t)
for any `.
Statement 2: TGN-Att cannot distinguish the nodes u and v in our example.
We now show that adding a memory module to TGAT produces node states such that su (t) = sv (t) =
sa (t), sz (t) = sw (t), and sb (t) = sc (t). If that is the case, then these node states could be treated
as node features in a equivalent TGAT model of our example in Figure S2, proving that there is no
(`) (`)
TGN-Att such that hv (t) 6= hu (t). In the following, we consider TGN-Att with average memory
aggregators (see Appendix A).
We begin by showing that sa (t) = su (t) = sv (t) after memory updates. We note that the
memory message node a receives is [ekt1 ksz (t1 )]. The memory message node u receives is
M EANAGG([ekt1 ksw (t1 )], [ekt1 ksz (t1 )]), but since sw (t1 ) = sz (t1 ), both messages are the same,
and the average aggregator outputs [ekt1 ksz (t1 )]. Finally, the message that node v receives is
[ekt1 ksw (t1 )] = [ekt1 ksz (t1 )]. Since all three nodes receive the same memory message and have
the same initial features, their updated memory states are identical.
Now we show that sz (t) = sw (t), for t1 < t ≤ t2 . Note that the message that node z receives
is M EANAGG([ekt1 ksa (t1 )], [ekt1 ksu (t1 )]) = [ekt1 ksu (t1 )], with su (t1 ) = sa (t1 ). The message
that node w receives is M EANAGG([ekt1 ksu (t1 )], [ekt1 ksv (t1 )]) = [ekt1 ksu (t1 )]. Again, since the
initial features and the messages received by each node are equal, sz (t) = sw (t) for t1 < t ≤ t2 .
We can then use this to show that sz (t) = sw (t) for t > t2 . Note that at time t2 , the message that
nodes z and w receive are [ekt2 − t1 ksb (t2 )] and [ekt2 − t1 ksc (t2 )], respectively. Also, note that
21
sb (t2 ) = sc (t2 ) = sb (0) = sc (0) as the states of b and c are only updated right after t2 . Because
the received messages and the previous states (up until t2 ) of z and w are identical, we have that
sz (t) = sw (t) for t > t2 .
Finally, we show that sb (t) = sc (t). Using that sz (t2 ) = sw (t2 ) in conjunction with the fact that
node b receives message [[ekt2 − t1 ksz (t2 )]], and node c receives [ekt2 − t1 ksw (t2 )], we obtain
sb (t) = sc (t) since initial memory states and messages that the nodes received are the same.
Figure S3: (Left) Example of a temporal graph for which CAW can distinguish the events (u, v, t3 )
and (z, v, t3 ) but MP-TGNs cannot. We assume that all edge and node features are identical, and
tk+1 > tk for all k. (Right) Example for which MP-TGNs can distinguish (u, z, t4 ) and (u0 , z, t4 )
but CAW cannot.
Proof. Using the example in Figure S3(Left), we adapt a construction by Wang et al. [38] to show
that CAW can separate events that MP-TGNs adopting node embedding concatenation cannot. We
first note that the TCTs of u and z are isomorphic. Thus, since v is a common endpoint in (u, v, t3 )
and (z, v, t3 ), no MP-TGN can distinguish these two events. Nonetheless, CAW obtains the following
anonymized walks for the event (u, v, t3 ):
1t
{[1, 0, 0], [0, 1, 0]} −→ {[0, 1, 0], [2, 0, 0]}
| {z } | {z }
ICAW (u;Su ,Sv ) ICAW (v;Su ,Sv )
t1
{[0, 1, 0], [2, 0, 0]} −→ {[1, 0, 0], [0, 1, 0]}
| {z } | {z }
ICAW (v;Su ,Sv ) ICAW (u;Su ,Sv )
t2 1 t
{[0, 1, 0], [2, 0, 0]} −→ {[0, 0, 0], [0, 1, 0]} −→ {[0, 0, 0], [0, 0, 1]}
| {z } | {z } | {z }
ICAW (v;Su ,Sv ) ICAW (w;Su ,Sv ) ICAW (z;Su ,Sv )
and the walks associated with (z, v, t3 ) are (here we omit underbraces for readability):
1t
{[1, 0, 0], [0, 0, 1]} −→ {[0, 1, 0], [0, 1, 0]}
1t
{[0, 0, 0], [2, 0, 0]} −→ {[0, 0, 0], [0, 1, 0]}
2t 1 t
{[0, 0, 0], [2, 0, 0]} −→ {[0, 1, 0], [0, 1, 0]} −→ {[1, 0, 0], [0, 0, 1]}
In this example, assume that MLPs used to encode each walk correspond to identity mappings. Then,
the sum of the elements in each set is injective since each element of the sets in the anonymized
walks are one-hot vectors. We note that, in this example, we can simply choose a RNN that sums the
vectors in each sequence (walks), and then apply a mean readout layer (or pooling aggregator) to
obtain distinct representations for (u, v, t3 ) and (z, v, t3 ).
We now use the example in Figure S3(Right) to show that MP-TGNs can separate events that CAW
cannot. To see why MP-TGNs can separate the events (u, z, t4 ) and (u0 , z, t4 ), it suffices to observe
that the 4-depth TCTs of u and u0 are non-isomorphic. Thus, a MP-TGN with injective layers could
distinguish such events. Now, let us take a look at the anonymized walks for (u, z, t4 ):
1t
{[1, 0], [0, 0]} −→ {[0, 1], [0, 0]}
| {z } | {z }
ICAW (u;Su ,Sz ) ICAW (v;Su ,Sz )
t1
{[0, 0], [1, 0]} −→ {[0, 0], [0, 1]}
| {z } | {z }
ICAW (z;Su ,Sz ) ICAW (w;Su ,Sz )
22
and for (u0 , z, t4 ):
1 t
{[1, 0], [0, 0]} −→ {[0, 1], [0, 0]}
| {z } | {z }
ICAW (u0 ;Su0 ,Sz ) ICAW (v 0 ;Su0 ,Sz )
t1
{[0, 0], [1, 0]} −→ {[0, 0], [0, 1]}
| {z } | {z }
ICAW (z;Su0 ,Sz ) ICAW (w;Su0 ,Sz )
Since the sets of walks are identical, they must have the same embedding. Therefore, there is no
CAW model that can separate these two events.
Proof. To prove this, we can repurpose the proof of Theorem 3 in [43]. In particular, we assume MP-
TGNs that meet the injective requirements of Proposition 2, i.e., MP-TGNs that implement injective
aggregate and update functions on multisets of hidden representations from temporal neighbors.
Following their footprints, we prove that there is a injection ϕ to the set of embeddings of all nodes
in a temporal graph from their respective colors in the temporal WL test. We do so via induction
on the number of layers `. To achieve our purpose, we can assume identity memory without loss of
generality.
The base case (` = 0) is straightforward since the temporal WL test initializes colors with node
features. We now focus on the inductive step. Suppose the proposition holds for ` − 1. Note that our
update function:
v (t) = U PDATE
h(`) (`)
h(`−1)
v (t), AGG(`) ({{(h(`−1)
u (t), t − t0 , e) | (u, e, t0 ) ∈ N (v, t)}})
Note that the composition of injective functions is also injective. In addition, time-shifting operations
are also injective. Thus we can construct an injection ψ such that:
hv(`) (t) = ψ c`−1 (v), {{(c`−1 (u), t0 , e) | (u, e, t0 ) ∈ N (v, t)}})
since there exists an element (u, e, t0 ) ∈ N (v, t) if and only if there is an event (u, v, t0 ) ∈ G(t) with
feature euv (t0 ) = e.
Then, we can write:
−1
v (t) = ψ ◦ H ASH
h(`) ◦ H ASH c`−1 (v), {{(c`−1 (u), t0 , euv (t0 )) | (u, v, t0 ) ∈ G(t)}})
23
Figure S4: Examples of temporal graphs for which MP-TGNs cannot distinguish the diameter, girth, and
number of cycles. For any node in G(t) (e.g., u1 ), there is a corresponding one in G 0 (t) (u01 ) whose TCTs are
isomorphic.
B.9 Proof of Proposition 7: MP-TGNs and CAWs fail to decide some graph properties
Statement 1: MP-TGNs fail to decide some graph properties.
Proof. Adapting a construction by Garg et al. [9], we provide in Figure S4 an example that demon-
strates Proposition 7. Colors denote node features, and all edge features are identical. The temporal
graphs G(t) and G0 (t) are non-isomorphic and differ in properties such as diameter (∞ for G(t) and
3 for G 0 (t)), girth (3 for G(t) and 6 for G 0 (t)), and number of cycles (2 for G(t) and 1 for G 0 (t)). In
spite of that, for t > t3 , the set of embeddings of nodes in G(t) is the same as that of nodes in G 0 (t)
and, therefore, MP-TGNs cannot decide these properties. In particular, by constructing the TCTs of
all nodes at time t > t3 , we observe that the TCTs of the pairs (u1 , u01 ), (u2 , u02 ), (v1 , v10 ), (v2 , v20 ),
(w1 , w10 ), (w2 , w20 ) are isomorphic and, therefore, they can not be distinguished (Lemma 1).
Figure S5: Examples of temporal graphs with different static properties, such as diameter, girth, and
number of cycles. CAWs fail to distinguish G1 (t) and G2 (t).
Proof. We can adapt our construction in Figure 2 [rightmost] to extend Proposition 7 to CAW. The
idea consists of creating two temporal graphs with different diameters, girths, and numbers of cycles
that comprise events that CAW cannot separate — Figure S5 provides one such construction. In
particular, CAW obtains identical embeddings for (u, z, t3 ) and (a0 , z 0 , t3 ) (as shown in Proposition 5).
The remaining events are the same up to node re-labelling and thus also lead to identical embeddings.
Therefore, CAW cannot distinguish G1 (t) and G2 (t) although they clearly differ in diameter, girth,
and number of cycles.
24
[Induction step] Assume that the proposition holds
for all nodes and any time instant up to t. We will
show that after the event γ = (u, v, t) at time t, the
proposition remains true.
Note that the event γ only impacts the monotone
TCTs of u and v. The reason is that the monotone
TCTs of all other nodes have timestamps lower than
t, which prevents the event γ from belonging to any
path (with decreasing timestamps) from the root.
Without loss of generality, let us now consider the
impact of γ on the monotone TCT of v. Figure S6
shows how the TCT of v changes after γ, i.e., how it
Figure S6: Illustration of how the monotone
goes from T̃v (t) to T̃v (t+ ). In particular, the process
TCT of v changes after an event between u
attaches the TCT of u to the root node v. Under this
and v at time t. This allows us to see how to
change, we need to update the counts of all nodes
update the positional features of any node i
i in T̃u (t) regarding how many times it appears in
of the dynamic graph that belongs to T̃u (t)
T̃v (t+ ). We do so by adding the counts in T̃u (t) (i.e., relative to v.
(t) (t)
ri→u ) to T̃v (t) (i.e., ri→v ), accounting for the 1-layer
mismatch,
since T̃u (t) is attached to the first layer. This can be easily achieved with the shift matrix
0 0
P = applied to the counts of any node i in T̃u (t), i.e.,
Id−1 0
(t+ ) (t) (t)
ri→v = P ri→u + ri→v ∀i ∈ Vu(t) ,
(t) (t)
where Vu comprises the nodes of the original graph that belong to T̃u .
Similarly, the event γ also affects the counts of nodes in the monotone TCT of v w.r.t. the monotone
(t+ ) (t)
TCT of u. To account for that change, we follow the same procedure and update rj→u = P rj→v +
(t) (t)
rj→u , ∀j ∈ Vv .
Handling multiple events at the same time. We now consider the setting where a given node v
interacts with multiple nodes u1 , u2 , . . . , uJ at time t. We can extend the computation of positional
features to this setting in a straightforward manner by noting that each event leads to an independent
branch in the TCT of v. Therefore, the update of the positional features with respect to v is given by
J J
(t+ ) (t) (t)
X [
ri→v = P ri→uj + ri→v ∀i ∈ Vu(t)
j
j=1 j=1
J
+ [
Vv(t )
= Vv(t) Vu(t)
j
.
j=1
We note that the updates of the positional features of u1 , . . . , uJ remain untouched if they do not
interact with other nodes at time t.
B.11 Proof of Proposition 8: Injective function on temporal neighborhood
Proof. To capture the intuition behind the proof, first consider a multiset M such that |M | < 4. We
can assign a unique number ψ(m) ∈ {1, 2, 3, 4} to any distinct element m ∈ M . Also, the function
h(m) = 10−ψ(m) denotes the decimal expansion of ψ(m) and corresponds to reserving one decimal
placePfor each unique element m ∈ M . Since there are less than 10 elements in the multiset, note
that m h(m) is unique for any multiset M .
To prove the proposition, we also leverage the well-known fact that the Cartesian product of two
countable sets is countable — the Cantor’s (bijective) pairing function z : N × N → N, with
z(n1 , n2 ) = (n1 +n2 )(n2 1 +n2 +1) + n2 , provides a proof for that.
Here, we consider multisets M = {{(xi , ei , ti )}} whose tuples take values on the Cartesian product of
the countable sets X , E, and T — the latter is also assumed to be bounded. In addition, we assume
25
the lengths of all multisets are bounded by N , i.e., |M | < N for all M . Since X and E are countable,
there exists an enumeration function ψ : X × E → N for all M . Without P loss of generality, we
assume T = {1, 2, tmax }. We want to show that exists a function of the form i 10−kψ(xi ,ei ) α−βti
that is unique on any multiset M .
Our idea is to separate a range of k decimal slots for each unique element (xi , ei , ·) in the multiset.
Each such a range has to accommodate at least tmax decimal slots (one for each value of ti ). Finally,
we need to make sure we can add up to N values at each decimal slot.
Formally, we map each tuple (xi , ei , ·) to one of k decimal slots starting from 10−kψ(xi ,ei ) . In
particular, for each element (xi , ei , ti = j) we add one unit at the j-th decimal slot after 10−kψ(xi ,ei ) .
Also, to ensure the counts for (xi , ei , j) and (xi , ei , l 6= j) do not overlap, we set β = dlog10 N e
since no tuple can repeat more than N times. We use α = 10 as we shift decimals. Finally, to
guarantee that each range encompasses tmax slots of β decimals, we set k = β(tmax + 1). Therefore,
the function
X
10−kψ(xi ,ei ) α−βti
i
is unique on any multiset M . We note that, without loss of generality, one could choose a different
basis (other than 10).
Note that we use the function ] (that maps nodes in the TCT to nodes in the dynamic graph) here
because the positional feature vectors are defined for nodes in the dynamic graph.
To guarantee that two encoded walks are identical E NC(W ; Su , Sv ) = E NC(W 0 ; Su0 , Sv0 ), it
suffices to show that the anonymized walks are equal. Thus, we turn our problem into show-
ing that for any walk W = (w0 , t0 , w1 , t1 , . . . ) in Su ∪ Sv , there exists a corresponding one
26
W 0 = (w00 , t0 , w10 , t1 , . . . ) in Su0 ∪ Sv0 such that ICAW (wi ; Su , Sv ) = ICAW (wi0 ; Su0 , Sv0 ) for all
i. Recall that ICAW (wi ; Su , Sv ) = {g(wi ; Su ), g(wi ; Sv )}, where g(wi ; Su ) is a vector whose k-
component stores how many times wi appears in a walk from Su at position k.
A key observation is that there is an equivalence between deanonymized root-leaf paths in Tuv and
walks in Su ∪ Sv (disregarding the virtual root node). By deanonymized, we mean paths where node
identities (in the temporal graph) are revealed by applying the function ]. Using this equivalence, it
suffices to show that
g(]i; Su ) = g(]f (i); Su0 ) and g(]i; Sv ) = g(]f (i); Sv0 ) ∀i ∈ V (Tuv ) \ {uv}
Suppose there is an i ∈ V (Tuv )\{uv} such that g(]i; Su ) 6= g(]f (i); Su0 ). Without loss of generality,
suppose this holds for the `-th entry of the vectors.
(t)
We know there are exactly ra→u [`] nodes at the `-th level of T̃u (t) that are associated with a = ]i ∈
V (G(t)). We denote by Ψ the set comprising such nodes. It also follows that computing g(]i; Su )[`]
is the same as summing up the amount of leaves of each subtree of T̃u (t) rooted at ψ ∈ Ψ, which we
denote as l(ψ; T̃u (t)), i.e., X
g(]i; Su )[`] = l(ψ; T̃u (t)).
ψ∈Ψ
Since we assume g(]i; Su )[`] 6= g(]f (i); Su0 )[`], then it holds that
X X
g(]i; Su )[`] 6= g(]f (i); Su0 )[`] ⇒ l(ψ; T̃u (t)) 6= l(f (ψ); T̃u0 (t)) (S5)
ψ∈Ψ ψ∈Ψ
Note that the subtree of T̃u rooted at ψ should be isomorphic to the subtree of T̃u0 rooted at f (ψ),
and therefore have the same number of leaves. However, the RHS of Equation S5 above implies there
is a ψ ∈ Ψ for which l(ψ; T̃u ) 6= l(f (ψ); T̃u0 ), reaching a contradiction. The same argument can
be applied to v and v 0 to prove that g(]i; Sv ) = g(]f (i); Sv0 ).
27
networks [34]. Pareja et al. [56] applied a recurrent neural net to dynamically update the parameters
of a GCN. Gao and Ribeiro [50] compared the expressive power of two classes of models for discrete
dynamic graphs: time-and-graph and time-then-graph. The former represents the standard approach
of interleaving GNNs and sequence (e.g., RNN) models. In the latter class, the models first capture
node and edge dynamics using RNNs, and are then feed into graph neural networks. The authors
showed that time-then-graph has expressive advantage over time-and-graph approaches under mild
assumptions. For an in-depth review of representation learning for dynamic graphs, we refer to the
survey works by Kazemi et al. [14] and Skarding et al. [60].
While most of the early works focused on discrete-time dynamic graphs, we have recently witnessed
a rise in interest in models for event-based temporal graphs (i.e., CTDGs). The reason is that
models for DTDGs may fail to leverage fine-grained temporal and structural information that can be
crucial in many applications. In addition, it is hard to specify meaningful time intervals for different
tasks. Thus, modern methods for temporal graphs explicitly incorporate timestamp information
into sequence/graph models, achieving significant performance gains over approaches for DTDGs
[38]. Appendix A provides a more detailed presentation of CAW, TGN-Att, and TGAT, which are
among the best performing models for link prediction on temporal graphs. Besides these methods,
JODIE [16] applies two RNNs (for the source and target nodes of an event) with a time-dependent
embedding projection to learn node representations of item-user interaction networks. Trivedi et al.
[32] employed RNNs with a temporally attentive module to update node representations. APAN
[63] consists of a memory-based TGN that uses attention mechanism to update memory states using
multi-hop temporal neighborhood information. Makarov et al. [54] proposed incorporating edge
embeddings obtained from CAW into MP-TGNs’ memory and message-passing computations.
28
D Datasets and implementation details
D.1 Datasets
In our empirical evaluation, we have considered six datasets for dynamic link prediction: Reddit1 ,
Wikipedia2 , UCI3 , LastFM4 , Enron5 , and Twitter. Reddit is a network of posts made by users on
subreddits, considering the 1,000 most active subreddits and the 10,000 most active users. Wikipedia
comprises edits made on the 1,000 most edited Wikipedia pages by editors with at least 5 edits. Both
Reddit and Wikipedia networks include links collected over one month, and text is used as edge
features, providing informative context. The LastFM dataset is a network of interactions between
user and the songs they listened to. UCI comprises students’ posts to a forum at the University
of California Irvine. Enron contains a collection of email events between employees of the Enron
Corporation, before its bankruptcy. The Twitter dataset is a non-bipartite net where nodes are users
and interactions are retweets. Since Twitter is not publicly available, we build our own version by
following the guidelines by Rossi et al. [27]. We use the data available from the 2021 Twitter RecSys
Challenge and select 10,000 nodes and their associated interactions based on node participation:
number of interactions the node participates in. We also apply multilingual BERT to obtain text
representations of retweets (edge features).
Table S1 reports statistics of the datasets such as number of temporal nodes and links, and the
dimensionality of the edge features. We note that UCI, Enron, and LastFM represent non-attributed
networks and therefore do not contain feature vectors associated with the events. Also, the node
features for all datasets are vectors of zeros [42].
29
and iii) walk length L ∈ {32, 64, 128}. The best combination of hyperparameters is shown in
Table S2. The remaining training choices follows the default values from the original implementation.
Importantly, we note that TGN-Att’s original evaluation setup is different from CAW’s. Thus, we
adapted CAW’s original repo to reflect these differences and ensure a valid comparison.
PINT. We use α = 2 (in the exponential aggregation function), and experiment with learned and
fixed β. We apply a relu function to avoid negative values of β, which could lead to unstable training.
We do grid search as the follow: when learning beta, we consider initial values for β ∈ {0.1, 0.5}; for
the fixed case (requires_grad=False), we evaluate β ∈ {10−3 × |N |, 10−4 × |N |, 10−5 × |N |}
— |N | denotes the number of temporal neighbors — and always apply memory as in the original
implementation of TGN-Att. We consider number message passing layers ` in {1, 2}. Also, we apply
neighborhood sampling with the number of neighbors in {10, 20}, and update the state of a node
based on its most recent message. We then carry out model selection based on AP values obtained
during validation. Overall, the models with fixed β led to better results. Table S3 reports the optimal
hyperparameters for PINT found via automatic model selection.
In all experiments, we use relative positional features with d = 4 dimensions. For computational
efficiency, we update the relative positional features only after processing a batch, factoring in all
events from that batch. Note that this prevents information linkage as these positional features take
effect after prediction. In addition, since temporal events repeat (in the same order) at each epoch, we
also speed up PINT’s training procedure by precomputing and saving the positional features for each
batch. To save up space, we store the positional features as sparse matrices.
Hardware. For all experiments, we use Tesla V100 GPU cards and consider a memory budget of
32GB of RAM.
30
node, inducing a self-loop in the dynamic graph. Also, we can combine (e.g., concatenate) the
temporal features in message-passage operations, similarly to the general formulation of the MP-TGN
framework [27]. Finally, we can deal with the removal of a node v by following our previous (edge
deletion) procedure to delete all edges with endpoints in v.
F Additional experiments
Time comparison. Figure S7 compares the time per epoch for PINT and for the prior art (CAW,
TGN-Att, and TGAT) in the Enron and LastFM datasets. Following the trend in Figure 7, Figure S7
further supports that PINT is generally slower than other MP-TGNs but, after a few training epochs,
is orders of magnitude faster than CAW. In the case of Enron, the time CAW takes to complete an
epoch is much higher than the time we need to preprocess PINT’s positional features.
Enron LastFM
Avg. time per epoch log(s)
7 9
6 8
5 7
4 6
Figure S7: Time comparison: PINT versus TGNs (in log-scale) on Enron and LastFM.
Experiments on node classification. For completeness, we also evaluate PINT on node-level tasks
(Wikipedia and Reddit). We follow closely the experimental setup in Rossi et al. [27] and compare
against the baselines therein. Table S4 shows that PINT ranks first on Reddit and second on Wikipedia.
The values for PINT reflect the outcome of 5 repetitions.
Supplementary References
[47] R. Abboud, I. I. Ceylan, M. Grohe, and T. Lukasiewicz. The surprising power of graph neural networks
with random node initialization. In International Joint Conference on Artificial Intelligence (IJCAI), 2021.
[48] G. Bouritsas, F. Frasca, S. Zafeiriou, and M. M. Bronstein. Improving graph neural network expressivity
via subgraph isomorphism counting. In Arxiv e-prints, 2020.
[49] M. Defferrard, X. Bresson, and P. Vandergheynst. Convolutional neural networks on graphs with fast
localized spectral filtering. In Advances in Neural Information Processing Systems (NeurIPS), 2018.
[50] J. Gao and B. Ribeiro. On the equivalence between temporal and static graph representations for observa-
tional predictions. ArXiv, 2103.07016, 2021.
[51] D. Kreuzer, D. Beaini, W. L. Hamilton, V. Letourneau, and P. Tossou. Rethinking graph transformers with
spectral attention. In Advances in Neural Information Processing Systems (NeurIPS), 2021.
31
[52] P. Li, Y. Wang, H. Wang, and J. Leskovec. Distance encoding: Design provably more powerful neural
networks for graph representation learning. In Advances in Neural Information Processing Systems
(NeurIPS), 2020.
[53] S. Mahdavi, S. Khoshraftar, and A. An. dynnode2vec: Scalable dynamic network embedding. In
International Conference on Big Data, 2018.
[54] I. Makarov, A. V. Savchenko, A. Korovko, L. Sherstyuk, N. Severin, A. Mikheev, and D. Babaev. Temporal
graph network embedding with causal anonymous walks representations. ArXiv, 2108.08754, 2021.
[55] F. Manessi, A. Rozza, and M. Manzo. Dynamic graph convolutional networks. Pattern Recognition, 97,
2020.
[56] A. Pareja, G. Domeniconi, J. Chen, T. Ma, H. Kanezashi T. Suzumura, T. Kaler, T. B. Schardl, and C. E.
Leiserson. EvolveGCN: Evolving graph convolutional networks for dynamic graphs. In AAAI Conference
on Artificial Intelligence (AAAI), 2020.
[57] A. Sankar, Y. Wu, L. Gou, W. Zhang, and H. Yang. DySAT: Deep neural representation learning on
dynamic graphs via self-attention networks. In International Conference on Web Search and Data Mining
(WSDM), 2020.
[58] R. Sato, M. Yamada, and H. Kashima. Random features strengthen graph neural networks. In SIAM
International Conference on Data Mining (SDM), 2021.
[59] Y. Seo, M. Defferrard, P. Vandergheynst, and X. Bresson. Structured sequence modeling with graph
convolutional recurrent networks. In International Conference on Neural Information Processing (ICONIP),
2018.
[60] J. Skarding, B. Gabrys, and K. Musial. Foundations and modeling of dynamic networks using dynamic
graph neural networks: A survey. IEEE Access, 9:79143–79168, 2021.
[61] B. Srinivasan and B. Ribeiro. On the equivalence between positional node embeddings and structural graph
representations. In International Conference on Learning Representations (ICLR), 2020.
[62] H. Wang, H. Yin, M. Zhang, and P. Li. Equivariant and stable positional encoding for more powerful graph
neural networks. In International Conference on Learning Representations (ICLR), 2022.
[63] X. Wang, D. Lyu, M. Li, Y. Xia, Q. Yang, X. Wang, X. Wang, P. Cui, Y. Yang, B. Sun, and Z. Guo. APAN:
Asynchronous propagation attention network for real-time temporal graph embedding. International
Conference on Management of Data, 2021.
32