0% found this document useful (0 votes)

26 views51 pages

Aligning Transformers With Weisfeiler-Leman: K K K K

Uploaded by

laijiahao0430

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

26 views51 pages

Aligning Transformers With Weisfeiler-Leman: K K K K

Uploaded by

laijiahao0430

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 51

Aligning Transformers with Weisfeiler–Leman

Luis Müller 1 Christopher Morris 1

Abstract Maron et al., 2019a; Morris et al., 2020; 2023; 2022) intro-
duced higher-order GNNs, aligned with the k-dimensional
Graph neural network architectures aligned with Weisfeiler–Leman (k-WL) hierarchy for graph isomorphism
the k-dimensional Weisfeiler–Leman (k-WL) hier- testing (Cai et al., 1992), resulting in more expressivity with
arXiv:2406.03148v1 [cs.LG] 5 Jun 2024

archy offer theoretically well-understood expres- an increase in the order k > 1. The k-WL hierarchy draws
sive power. However, these architectures often fail from a rich history in graph theory (Babai, 1979; Cai et al.,
to deliver state-of-the-art predictive performance 1992; Grohe, 2017; Weisfeiler & Leman, 1968), offering
on real-world graphs, limiting their practical util- a deep theoretical understanding of k-WL-aligned GNNs.
ity. While recent works aligning graph transformer In contrast, graph transformers (GTs) (Glickman & Yahav,
architectures with the k-WL hierarchy have shown 2023; He et al., 2023; Ma et al., 2023; Müller et al., 2024;
promising empirical results, employing transform- Rampášek et al., 2022; Ying et al., 2021) recently demon-
ers for higher orders of k remains challenging due strated state-of-the-art empirical performance. However, they
to a prohibitive runtime and memory complexity draw their expressive power mostly from positional or struc-
of self-attention as well as impractical architectural encodings (PEs in the following), making it challenging
tural assumptions, such as an infeasible number of to understand these models in terms of an expressivity hier-
attention heads. Here, we advance the alignment archy such as the k-WL. In addition, their empirical success
of transformers with the k-WL hierarchy, showing relies on modifying the attention mechanism (Glickman &
stronger expressivity results for each k, making Yahav, 2023; Ma et al., 2023; Ying et al., 2021) or using ad-
them more feasible in practice. In addition, we de- ditional message-passing GNNs (He et al., 2023; Rampášek
velop a theoretical framework that allows the study et al., 2022).
of established positional encodings such as Lapla-
cian PEs and SPE. We evaluate our transformers To still benefit from the theoretical power offered by the
on the large-scale PCQM4Mv2 dataset, showing k-WL, previous works (Kim et al., 2021; 2022) aligned trans-
competitive predictive performance with the state- formers with k-IGNs, showing that transformer layers can
of-the-art and demonstrating strong downstream approximate invariant linear layers (Maron et al., 2019b).
performance when fine-tuning them on small-scale Crucially, Kim et al. (2022) designed pure transformers,
molecular datasets. requiring no architectural modifications of the standard trans-
former and instead draw their expressive power from an
appropriate tokenization, i.e., the encoding of a graph as
a set of input tokens. Pure transformers provide several
1. Introduction benefits over graph transformers that use message-passing
Message-passage graph neural networks (GNNs) are the de- or modified attention, such as being directly applicable to
facto standard in graph learning (Gilmer et al., 2017; Kipf innovations for transformer architectures most notably Per-
& Welling, 2017; Scarselli et al., 2009; Xu et al., 2019a). formers (Choromanski et al., 2021) and Flash Attention (Dao
However, due to their purely local mode of aggregating infor- et al., 2022) to reduce the runtime or memory demands of
mation, they suffer from limited expressivity in distinguish- transformers, as well as the Perceiver (Jaegle et al., 2021) en-
ing non-isomorphic graphs in terms of the 1-dimensional abling multi-modal learning. Unfortunately, the framework
Weisfeiler–Leman algorithm (1-WL) (Morris et al., 2019; Xu of Kim et al. (2022) does not allow for a feasible transformer
et al., 2019a). Hence, recent works (Azizian & Lelarge, 2021; with provable expressivity strictly greater than 1-WL due to
an O(n6 ) runtime complexity and the requirement of 203
1
Department of Computer Science, RWTH Aachen University, attention heads resulting from their alignment with IGNs.
Germany. Correspondence to: Luis Müller <[email protected] This poses the question of whether a more feasible hierarchy
aachen.de>.
of pure transformers exists.
Proceedings of the 41 st International Conference on Machine
Learning, Vienna, Austria. PMLR 235, 2024. Copyright 2024 by
the author(s).

1
Aligning Transformers with Weisfeiler–Leman

Approximate more functions

1-GT (2, 1)-GT (3, 1)-GT 3-GT k-GT

⊑ ⊑ ⊑ ⊏ ⊏
⊐ ⊐ ⊐ ⊐ ⊐
1-WL (2, 1)-WL (3, 1)-WL 3-WL ··· k-WL

Figure 1: Overview of our theoretical results, aligning transformers with the established k-WL hierarchy. Forward arrows
point to more powerful algorithms or neural architectures. A ⊏ B (A ⊑ B, A ≡ B)—algorithm A is strictly more powerful
than (as least as powerful as, equally powerful as) B. The relations between the boxes in the lower row stem from Cai et al.
(1992) and Morris et al. (2022).

Present work Here, we offer theoretical improvements for while theoretically intriguing, higher-order GNNs often fail
pure transformers. Our guiding question is to deliver state-of-the-art performance on real-world prob-
lems (Azizian & Lelarge, 2021; Morris et al., 2020; 2022),
Can we design a hierarchy of increasingly making theoretically grounded architectures less relevant in
expressive pure transformers that are also practice.
feasible in practice?
Graph transformers with higher-order expressive power are
In this work, we will make significant progress to answer- Graphormer-GD (Zhang et al., 2023) as well as the higher-
ing this question in the affirmative. First, in Section 3, we order graph transformers in Kim et al. (2021) and Kim et al.
improve the expressivity of pure transformers for every or- (2022). However, Graphormer-GD is less expressive than
der k > 0, while keeping the same runtime complexity as the 3-WL (Zhang et al., 2023). Further, Kim et al. (2021) and
Kim et al. (2022) by aligning transformers closely with the Kim et al. (2022) align transformers with k-IGNs, and, thus,
Weisfeiler–Leman hierarchy. Secondly, we demonstrate the obtain the theoretical expressive power of the corresponding
benefits of this close alignment by showing that our results k-WL. However, they do not empirically evaluate their trans-
directly yield transformers aligned with the (k, s)-WL hier- formers for k > 2. For k = 2, Kim et al. (2022) propose
archy, recently proposed to increase the scalability of higher- TokenGT, a pure transformer with a n + m tokens per graph,
order variants of the Weisfeiler–Leman algorithm (Morris where n is the number of nodes and m is the number of edges.
et al., 2022). These transformers then form a hierarchy of This transformer has 2-WL expressivity, which, however, is
increasingly expressive pure transformers. We show in Sec- the same as 1-WL expressivity (Morris et al., 2023). From a
tion 4, that our transformers can be naturally implemented theoretical perspective, the transformers in Kim et al. (2022)
with established node-level PEs such as the Laplacian PEs still require impractical assumptions such as bell(2k) atten-
in Kreuzer et al. (2021). Lastly, we show in Section 5 that tion heads, where bell(2k) is the 2k-th bell number (Maron
our transformers are feasible in practice and have more ex- et al., 2019a),1 resulting in 203 attention heads for a trans-
pressivity than the 1-WL. In particular, we obtain very close former with a provable expressive power strictly stronger
to state-of-the-art results on the large-scale PCQM4M V 2 than the 1-WL. In addition, Kim et al. (2022) introduces
dataset (Hu et al., 2021) and further show that our transform- special encodings called node- and type identifiers that are
ers are highly competitive with GNNs when fine-tuned on theoretically necessary but, as argued in Appendix A.5 in
small-scale molecular benchmarks (Hu et al., 2020a). See (Kim et al., 2022), not ideal in practice. For an overview
Figure 1 for an overview of our theoretical results and Ta- of the Weisfeiler–Leman hierarchy in graph learning; see
ble 1 for a comparison of our pure transformers with the Morris et al. (2023).
transformers in Kim et al. (2021) and Kim et al. (2022).

Related work Many graph learning architectures with

2. Background
higher-order Weisfeiler–Leman expressive power exist, most We consider node-labeled graphs G := (V (G), E(G), ℓ)
notably δ-k-GNNs (Morris et al., 2020), SpeqNets (Morris with n nodes and without self-loops and without isolated
et al., 2022), k-IGNs (Maron et al., 2019b;c), PPGN (Azizian nodes, where V (G) is the set of nodes, E(G) is the set of
& Lelarge, 2021; Maron et al., 2019a), and the more recent edges, and ℓ : V (G) → N assigns an initial color or label
PPGN++ (Puny et al., 2023). Moreover, Lipman et al. (2020)
1
devise a low-rank attention module possessing the same For example, bell(2 · 3) = 203, bell(2 · 4) = 4 140, bell(2 ·
5) = 115 975. In comparison, GPT-3, a large transformer with
power as the folklore 2-WL. Bodnar et al. (2021) propose 175B parameters, has only 96 attention heads (Brown et al., 2020).
CIN with an expressive power of at least 3-WL. However,

2
Aligning Transformers with Weisfeiler–Leman

to each node. For convenience of notation, we always as- n token embeddings X(0,1) ∈ Rn×d as
sume an arbitrary but fixed ordering over the nodes such
that each node corresponds to a number in [n]. Further X(0,1) := F + P, (1)
A(G) ∈ {0, 1}n×n denotes the adjacency matrix where
A(G)ij = 1 if, and only, if nodes i and j share an edge. where we call P ∈ Rn×d structural embeddings, encoding
We also construct a node feature matrix F ∈ Rn×d that is structural information for each token. For node v, we define
consistent with ℓ, i.e., for nodes i and j in V (G), Fi = Fj P(v) := FFN(deg(v) + PE(v)), (2)
if, and only, if ℓ(i) = ℓ(j). Note that, for a finite subset of
N, we can always construct F, e.g., with a one-hot encoding where deg : V (G) → Rd is a learnable embedding of the
of the initial colors. We call pairs of nodes (i, j) ∈ V (G)2 node degree, PE : V (G) → Rd is a node-level PE such as
node pairs or 2-tuples. For a matrix X ∈ Rn×d , whose i-th the Laplacian PE (Kreuzer et al., 2021) or SPE (Huang et al.,
row represents the embedding of node v ∈ V (G) in a graph 2024), and FFN : Rd → Rd is a multi-layer perceptron. For
G, we also write Xv ∈ Rd or X(v) ∈ Rd to denote the the PE, we require that it enables us to distinguish whether
row corresponding to node v. Further, we define a learn- two nodes share an edge in G, which we formally define as
able embedding as a function, mapping a subset S ⊂ N to a follows.
d-dimensional Euclidean space, parameterized by a neural
Definition 1 (Adjacency-identifying). Let G =
network, e.g., a neural network applied to a one-hot encoding
(V (G), E(G), ℓ) be a graph with n nodes. Let
of the elements in S. Moreover, || · ||F denotes the Frobenius
X(0,1) ∈ Rn×d denote the initial token embeddings
norm. Finally, we refer to some transformer architecture’s
according to Equation (1) with structural embeddings
concrete set of parameters, including its tokenization, as a
P ∈ Rn×d . Let further
parameterization. See Appendix C for a complete descrip-
tion of our notation. Further, see Appendix E.1 for a formal 1 T
P̃ := √ PWQ PWK ,
description of the k-WL and its variants. dk

where WQ , WK ∈ Rd×d . Then P is adjacency-identifying

3. Expressive power of transformers on graphs if there exists WQ , WK such that for any row i and column
Here, we consider the (standard) transformer (Vaswani et al., j,
2017), a stack of alternating blocks of multi-head attention P̃ij = max P̃ik ,
k
and fully-connected feed-forward networks; see Appendix D
for a definition. if, and only, if A(G)ij = 1, where each row can have
between one and (n − 1) maxima (for connected graphs).
To align the above transformer with the k-WL, we will pro- Further, an approximation Q ∈ Rn×d to P is sufficiently
vide appropriate input tokens X(0,k) to the standard trans- adjacency-identifying if
former for each k ≥ 1 and subsequently show that then the
t-th layer of the transformer can simulate the t-th iteration P−Q F
< ϵ,
of some k-order Weisfeiler–Leman algorithm. To gradually
introduce our theoretical framework, we begin with tokeniza- for any ϵ > 0.
tion for k = 1, obtaining a transformer with the expressive
As the name suggests, the property of (sufficiently)
power of at least the 1-WL and afterward generalize the tok-
adjacency-identifying allows us to identify the underlying
enization to higher orders k.
adjacency matrix during the attention computation. In Sec-
tion 4, we introduce a slight generalization of the commonly
3.1. Transformers with 1-WL expressive power
used Laplacian PEs (Kreuzer et al., 2021; Rampášek et al.,
A commonly used baseline in prior work (Müller et al., 2024; 2022), which we refer to as LPE, and show that structural
Rampášek et al., 2022) is to employ a standard transformer embeddings are sufficiently adjacency-identifying when us-
with Laplacian PEs (Kreuzer et al., 2021). Here, we start by ing LPE or SPE (Huang et al., 2024) as node-level PEs. We
characterizing such a transformer in terms of 1-WL expres- call a standard transformer using the above tokenization with
sivity and introduce our theoretical framework. Concretely, sufficiently adjacency-identifying node-level PEs the 1-GT.
we propose a tokenization for a standard transformer to be Let us now understand why the above token embeddings
at least as expressive as the 1-WL. We first formalize our with degree encodings and adjacency-identifying PEs are
tokenization and then derive how 1-WL expressivity follows. sufficient to obtain 1-WL expressivity. To this end, we revisit
Let G = (V (G), E(G), ℓ) be a graph with n nodes and fea- a 1-WL expressive GNN from Grohe (2021), which updates
ture matrix F ∈ Rn×d , consistent with ℓ. Then, we initialize node representations F(t) at layer t as

F(t) := FFN F(t−1) + 2A(G)F(t−1) ,

(3)

3
Aligning Transformers with Weisfeiler–Leman

(t)
where FFN is again a feed-forward neural network and Fi k-WL, for an arbitrary but fixed k > 1. Subsequently, to
contains the representation of node i ∈ V (G) at layer t. make the architecture more practical, we reduce the number
Contrast the above to the update of a 1-GT layer with a of tokens while remaining strictly more expressive than the
single head, given as 1-WL.

X(t,1) := FFN X(t−1,1) + softmax X̃(t−1,1) X(t−1,1) WV , Again, we consider a labeled graph G = (V (G), E(G), ℓ)
with n nodes and feature matrix F ∈ Rn×d , consistent with
where ℓ. Intuitively, to surpass the limitations of the 1-WL, the
1 T k-WL colors ordered subgraphs instead of a single node.
X̃(t−1,1) := √ X(t−1,1) WQ X(t−1,1) WK More precisely, the k-WL colors the tuples from V (G)k for
dk
k ≥ 2 instead of the nodes. Of central importance to our
denotes the unnormalized attention matrix at layer t. If we tokenization is consistency with the initial coloring of the
now set WV = 2I, where I is the identity matrix, we obtain k-WL and recovering the adjacency information between k-
tuples imposed by the k-WL, both of which we will describe
X(t,1) := FFN X(t−1,1) +2·softmax X̃(t−1,1) X(t−1,1) .

hereafter; see Appendix E for a formal definition of the k-
At this point, if we could reconstruct the adjacency matrix WL.
with softmax(X̃(t−1,1) ), we could simulate the 1-WL ex- The initial color of a k-tuple v := (v1 , . . . , vk ) ∈ V (G)k
pressive GNN of Equation (3) with a single attention head. under the k-WL depends on its atomic type and the labels
However, the attention matrix is right-stochastic, meaning ℓ(v1 ), . . . , ℓ(vk ). Intuitively, the atomic type describes the
its rows sum to 1. Unfortunately, this is not expressive structural dependencies within elements in a tuple. We can
enough to reconstruct the adjacency matrix, which generally represent the atomic type of v by a k × k matrix K over
is not right-stochastic. Thus, we aim to reconstruct the row- {1, 2, 3}. That is, the entry Kij is 1 if (vi , vj ) ∈ E(G), 2
normalized adjacency matrix Ã(G) := D−1 A(G), where if vi = vj , and 3 otherwise; see Appendix C for a formal
D ∈ Rn×n is the diagonal degree matrix, such that Dii definition of the atomic type. Hence, to encode the atomic
is the degree of node i. Indeed, Ã(G) is right-stochastic. type as a real vector, we can learn an embedding from the
Further, element-wise multiplication of row i of Ã(G) with set of k × k matrices to Re , where e > 0 is the embedding
the degree of i recovers A(G). As we show, adjacency- dimension of the atomic type. For a tuple v, we denote this
identifying PEs are, in fact, sufficient for softmax(X̃(t−1,1) ) embedding with atp(v). Apart from the initial colors for
to approximate Ã(G) arbitrarily close. We show that then, tuples, the k-WL also imposes a notion of adjacency between
we can use the degree embeddings deg(i) to de-normalize tuples. Concretely, we define
the i-th row of Ã(G) and obtain
ϕj (v, w) := (v1 , . . . , vj−1 , w, vj+1 , . . . , vk ),
X(t,1) = FFN X(t−1,1) + 2A(G)X(t−1,1) .

Showing that a transformer can simulate the GNN in Grohe i.e., ϕj (v, w) replaces the j-th component of the tuple v
(2021) implies the connection to the 1-WL. We formally with the vertex w. We say that two tuples are adjacent or
prove the above in the following theorem, showing that the 1- j-neighbors if they are different in the j-th component (or
GT can simulate the 1-WL; see Appendix H for proof details. equal, in the case of self-loops).
Now, to construct token embeddings consistent with the
Theorem 2. Let G = (V (G), E(G), ℓ) be a labeled graph initial colors under the k-WL, we initialize nk token embed-
k
with n nodes and F ∈ Rn×d be a node feature matrix consis- dings X(0,k) ∈ Rn ×d , one for each k-tuple. For order k
tent with ℓ. Further, let Ct1 : V (G) → N denote the coloring and embedding dimension d of the tuple-level tokens, we
function of the 1-WL at iteration t. Then, for all iterations first compute node-level tokens X(0,1) in Equation (1) and,
t ≥ 0, there exists a parametrization of the 1-GT such that in particular, the structural embeddings P(v) for each node
v, with an embedding dimension of d and then concatenate
Ct1 (v) = Ct1 (w) ⇐⇒ X(t,1) (v) = X(t,1) (w), node-level embeddings along the embedding dimension to
construct tuple-level embeddings which are then projected
for all nodes v, w ∈ V (G).
down to fit the embedding dimension d. Specifically, we
Having set the stage for theoretically aligned transformers, define the token embedding of a k-tuple v = (v1 , . . . , vk )
we now generalize the above to higher orders k > 1. as
k
X(0,k) (v) := X(0,1) (vi ) i=1 W + atp(v),

(4)
3.2. Transformers with k-WL expressive power
where W ∈ Rd·k×d is a projection matrix. Intuitively, the
Here, we propose tokenization for a standard transformer, above construction ensures that the token embeddings respect
operating on nk tokens, to be strictly more expressive as the the initial node colors and the atomic type. Analogously to

4
Aligning Transformers with Weisfeiler–Leman

our notion of adjacency-identifying, we use the structural em- Note that similar to Theorem 2, the above theorem gives a
beddings P(vi ) in each node embedding X(0,1) (vi ) to iden- lower bound on the expressivity of the k-GT. We now show
tify the j-neighborhood adjacency between tuples v and w that the k-GT is strictly more powerful than the k-WL by
in the j-th attention head. To reconstruct the j-neighborhood showing that the k-GT can also simulate the δ-k-WL, a k-WL
adjacency, we show that it is sufficient to identify the nodes variant that, for each j-neighbor ϕj (v, w), additionally con-
in each tuple-level token. Hence, we define the following siders whether (vj , w) ∈ E(G) (Morris et al., 2020). That
requirements for the structural embeddings P. is, the δ-k-WL distinguishes for each j-neighbor whether
Definition 3 (Node-identifying). Let G = (V (G), E(G), ℓ) the replaced node and the replacing node share an edge in
k
be a graph with n nodes. Let X(0,k) ∈ Rn ×d denote the G. Morris et al. (2020) showed that the δ-k-WL is strictly
initial token embeddings according to Equation (4) with more expressive than the k-WL. We use this to show that the
structural embeddings P ∈ Rn×d . Let further k-GT is strictly more expressive than the k-WL, implying the
following result; see again Appendix I for proof details.
1 T
P̃ := √ PWQ PWK , Theorem 5. For k > 1, the k-GT is strictly more expressive
dk than the k-WL.
where WQ , WK ∈ Rd×d . Then P is node-identifying if
there exists WQ , WK such that for any row i and column Note that the k-GT uses nk tokens. However, for larger k and
j, large graphs, the number of tokens might present an obstacle
P̃ij = max P̃ik , in practice. Luckily, our close alignment of the k-GT with
k the k-WL hierarchy allows us to directly benefit from a recent
if, and only, if i = j. Further, an approximation Q ∈ Rn×d result in Morris et al. (2022), reducing the number of tokens
of P is sufficiently node-identifying if while maintaining an expressivity guarantee. Specifically,
Morris et al. (2022) define the set of (k, s)-tuples V (G)ks as
P−Q F
< ϵ,
V (G)ks := {v ∈ V (G)k | #comp(G[v]) ≤ s},
for any ϵ > 0.
where #comp(H) denotes the number of connected compo-
As we will show, the above requirement allows the attention nents in subgraph H, G[v] denotes the ordered subgraph of
to distinguish whether two tuples share the same nodes, in- G, induced by the nodes in v and s ≤ k is a hyper-parameter.
tuitively, by counting the number of node-to-node matches Morris et al. (2022) then define the (k, s)-LWL as a k-WL
between two tuples, which is sufficient to determine whether variant that only considers the tuples in V (G)ks and addi-
the tuples are j-neighbors. For structural embeddings P to tional only use subset of adjacent tuples; see (Morris et al.,
be node-identifying, it suffices, for example, that P has an 2022) and Appendix E for a detailed description of the (k, s)-
orthogonal sub-matrix. WL. Fortunately, the runtime of the (k, s)-LWL depends only
In Section 4, we show that structural embeddings are also on k, s, and the sparsity of the graph, resulting in a more
sufficiently node-identifying when using LPE or SPE (Huang efficient and scalable k-WL variant. We can now directly
et al., 2024) as node-level PEs. It should also be mentioned translate this modification to our token embeddings. Specif-
that our definition of node-identifying structural embeddings ically, we call the k-GT using only the token embeddings
generalizes the node identifiers in Kim et al. (2022), i.e., X(0,k) (v) where v ∈ V (G)ks , the (k, s)-GT.
their node identifiers are node-identifying. Still, structural Now, Morris et al. (2022, Theorem 1) shows that, for k > 1,
embeddings exist that are (sufficiently) node-identifying and the (k + 1, 1)-LWL is strictly stronger than the (k, 1)-LWL.
do not qualify as node identifiers in Kim et al. (2022). Further, note that the (1, 1)-LWL is equal to the 1-WL. Using
We call a standard transformer using the above tokenization this result, we can prove the following; see Appendix I for
with sufficiently node- and adjacency-identifying structural proof details.
embeddings the k-GT. We then show the connection of the Theorem 6. For all k > 1, the (k, 1)-GT is strictly more
k-GT to the k-WL, resulting in the following theorem; see expressive than the (k − 1, 1)-GT. Further, the (2, 1)-GT is
Appendix I for proof details. strictly more expressive than the 1-WL.
Theorem 4. Let G = (V (G), E(G), ℓ) be a labeled graph
with n nodes and k ≥ 2 and F ∈ Rn×d be a node feature Note that the (2, 1)-GT only requires O(n + m) tokens,
matrix consistent with ℓ. Let Ctk denote the coloring function where m is the number of edges, and consequently has a
of the k-WL at iteration t. Then for all iterations t ≥ 0, there runtime complexity of O((n + m)2 ), the same as TokenGT.
exists a parametrization of the k-GT such that Hence, the (2, 1)-GT improves the expressivity result of To-
kenGT, which is shown to have 1-WL expressive power (Kim
Ctk (v) = Ctk (w) ⇐⇒ X(t,k) (v) = X(t,k) (w) et al., 2022) while having the same runtime complexity.
for all k-tuples v and w ∈ V (G)k . In summary, our alignment of standard transformers with

5
Aligning Transformers with Weisfeiler–Leman

Table 1: Comparison of our theoretical results with Kim et al. (2021) and Kim et al. (2022). Highlighted are aspects in
which our results improve over Kim et al. (2022). Here, n denotes the number of nodes, and m denotes the number of edges.
We denote with ⊏ A that the transformer is strictly more expressive than algorithm A. Note that for pure transformers, the
squared complexity can be relaxed to linear complexity when applying linear attention approximation such as the Performer
(Choromanski et al., 2021).

Transformer Provable expressivity Runtime complexity # Heads Pure transformer

Sparse Transformer (Kim et al., 2021) ⊏ Gilmer et al. (2017) O(m) 1 ✗
TokenGT (Kim et al., 2022) 1-WL O((n + m)2 ) 15 ✓
1-GT (ours) 1-WL O(n2 ) 1 ✓
(2, 1)-GT (ours) ⊏ 1-WL O((n + m)2 ) 2 ✓
Higher-Order Transformer (Kim et al., 2021) k-WL O(n2k ) 1 ✗
k-order Transformer (Kim et al., 2022) k-WL O(n2k ) bell(2k) ✓
k-GT (ours) ⊏ k-WL O(n2k ) 2k ✓

the k-WL hierarchy has brought forth a variety of improved LPE Here, we define a slight generalization of the Lapla-
theoretical results for pure transformers, providing a strict cian PEs in Kreuzer et al. (2021) that enables sufficiently
increase in provable expressive power for every order k > 0 node and adjacency-identifying PEs. Concretely, we define
compared to previous works. See Table 1 for a direct com- LPE as
parison of our results with Kim et al. (2021) and Kim et al.
(2022). Most importantly, Theorem 6 leads to a hierarchy of LPE(V, λ) = ρ ϕ(V1T , λ + ϵ) . . . ϕ(VnT , λ + ϵ) , (5)
increasingly expressive pure transformers, taking one step
closer to answering our guiding question in Section 1. In where ϵ ∈ Rl is a learnable, zero-initialized vector, we intro-
what follows, we make our transformers feasible in practice. duce to show our result. Here, ϕ : R2 → Rd is an FFN that
is applied row-wise and ρ : Rl×d → Rd is a permutation-
4. Implementation details equivariant neural network, applied row-wise. One can re-
cover the Laplacian PEs in Kreuzer et al. (2021) by setting
Here, we discuss the implementation details of pure trans- ϵ = 0 and implementing ρ as first performing a sum over
formers. In particular, we demonstrate that Laplacian PEs the first dimension of its input, resulting in an d-dimensional
(Kreuzer et al., 2021) are sufficient to be used as node-level vector and then subsequently applying an FFN to obtain a
PEs for our transformers. Afterward, we introduce order d-dimensional output vector. Note that then, Equation (5)
transfer, a way to transfer model weights between different forms a DeepSet (Zaheer et al., 2017) over the set of eigenvec-
orders of the WL hierarchy; see Appendix A for additional tor components, paired with their (ϵ-perturbed) eigenvalues.
implementation details. While slightly deviating from the Laplacian PEs defined in
Kreuzer et al. (2021), the ϵ vector is necessary to ensure that
4.1. Node and adjacency-identifying PEs no spectral information is lost when applying ρ.

Here, we develop PEs based on Laplacian eigenvectors We will now show that LPE is sufficiently node- and
and -values that are both sufficiently node- and adjacency- adjacency-identifying. Further, this result holds irrespective
identifying, avoiding computing separate PEs for node- and of whether the Laplacian is normalized. As a result, LPE can
adjacency identification; see Appendix A.1 for details about be used with 1-GT, k-GT, and (k, s)-GT; see Appendix F.3
the graph Laplacian and some background on PEs based on for proof details.
the spectrum of the graph Laplacian. Theorem 7. Structural embeddings with LPE as node-level
T
Let λ := (λ1 , . . . , λl ) denote the vector of the l smallest, PE are sufficiently node- and adjacency-identifying.
possibly repeated, eigenvalues of the graph Laplacian for
some graph with n nodes and let V ∈ Rn×l be a matrix such SPE We also show that SPE (Huang et al., 2024) are suffi-
that the i-th column of V is the eigenvector corresponding ciently node- and adjacency-identifying. Huang et al. (2024)
to eigenvalue λi . We will now briefly introduce LPE, based define the SPE encodings as
on Kreuzer et al. (2021), as well as SPE from Huang et al.
(2024) and show that these node-level PEs are node- and SPE(V, λ) = ρ Vdiag(ϕ1 (λ))VT ... Vdiag(ϕn (λ))VT ,
adjacency-identifying.
where ϕ1 , . . . , ϕn : Rl → Rl are equivariant FFNs and
ρ : Rn×l → Rd is a permutation equivariant neural network,

6
Aligning Transformers with Weisfeiler–Leman

applied row-wise. For example, ρ can be implemented by 5.1. Pre-training

first performing a sum over the first dimension of its input,
For pre-training, we train on PCQM4M V 2, one of the largest
resulting in an l-dimensional vector, and then subsequently
molecular regression datasets available (Hu et al., 2021).
applying an FFN to obtain a d-dimensional output vector.
The dataset consists of roughly 3.8M molecules, and the
SPE have the benefit of being stable, meaning a small pertur- task is to predict the HOMO-LUMO energy gap. Here, we
bation in the graph structure causes only a small change in the pre-train the (2, 1)-GT for 2M steps with a cosine learn-
resulting SPE encoding. As a result, stability can be related ing rate schedule with 60K warm-up steps on two A100
to OOD generalization (Huang et al., 2024). We can show NVIDIA GPUs; see Table 6 in Appendix B for details on
that SPE are sufficiently node- and adjacency-identifying. the hyper-parameters. For model evaluation, we use the
As a result, SPE encodings can be used with 1-GT, k-GT and code provided by Hu et al. (2021), available at https:
(k, s)-GT; see Appendix F.4 for proof details. //github.com/snap-stanford/ogb. Further, to
demonstrate the effectiveness of our transformers on large
Theorem 8. Structural embeddings with SPE as node-level node-level tasks, we additionally benchmark the (2, 1)-GT
PE are sufficiently node- and adjacency-identifying. on CS and P HOTO, two transductive node classification
datasets (Shchur et al., 2018). On all three tasks, we compare
4.2. Order transfer to Graphormer (Ying et al., 2021), a strong graph transformer
Here, we describe order transfer, a strategy for effectively with modified attention, and TokenGT (Kim et al., 2022) as
using the k-GT for k > 2 in practice. The idea of order a pure transformer baseline. We use 12 transformer layers, a
transfer consists of pre-training a transformer on a lower- hidden dimension of 768, 16 attention heads, and a GELU
order tokenization, such as on the (2, 1)-GT tokens, and non-linearity (Hendrycks & Gimpel, 2016). As node-level
subsequently fine-tuning the pre-trained weights on a higher- PEs, we compare both LPE and SPE, as defined in Section 4.
order transformer, such as the 2-GT, 3-GT or the (3, 1)-GT. We present our results in Table 2 and observe that the
We evaluate order transfer empirically in Section 5.2. (2, 1)-GT further closes the gap between pure transformers
Order transfer could be useful when pre-training a higher- and graph transformers with graph inductive bias such as
order model from scratch is infeasible. However, using a Graphormer (Ying et al., 2021). Considering that the (2, 1)-
higher-order model on a smaller downstream task is feasible GT has still relatively low graph inductive bias and in light of
and beneficial to performance. Examples of such scenarios recent advances in terms of quality and size of pre-training
are (a) a large difference in the number of graphs in the pre- dataset (Beaini et al., 2024), we expect this gap to further
training dataset compared to the fine-tuning task, (b) a large shrink with increasing data scale. Further, we find that on all
difference in the size of the graphs in the pre-training dataset three datasets, LPE are favorable or comparable to SPE.
compared to the fine-tuning task, or (c) a specific fine-tuning
task benefiting strongly from higher-order expressivity. As a 5.2. Fine-tuning
result, order transfer is a promising application of expensive Here, we describe our fine-tuning experiments. When fine-
yet expressive higher-order models. tuning, we re-use the pre-trained weights for the transformer
layers and randomly initialize new weights for task-specific
5. Experimental evaluation feature encodings and the task head. To demonstrate that fine-
tuning large pre-trained transformers for small downstream
Here, we conduct an experimental evaluation of our theo- tasks is feasible even with a smaller compute budget, we
retical results. In particular, we consider two settings. To ensure that all experiments can be run on a single A10 Nvidia
motivate the use of large-scale pure transformers for graph GPU with 24GB RAM.
learning, we evaluate the empirical performance of our trans-
formers under a pre-training/fine-tuning paradigm typically
employed with large-scale language, vision, and graph mod- Molecular regression To determine whether pre-training
els (Devlin et al., 2019; Dosovitskiy et al., 2021; Ying et al., can improve the fine-tuning performance of our transform-
2021). This approach is especially promising in molecular ers, we choose A LCHEMY (12K), a small-scale molecular
learning, where downstream datasets often only contain a dataset with 12K molecules (Morris et al., 2022). Specifi-
few thousand examples (Hu et al., 2020b; Méndez-Lucio cally, we evaluate three settings: (a) training the (2, 1)-GT
et al., 2022); see Section 5.1 for pre-training and Section 5.2 from scratch, (b) fine-tuning the pre-trained (2, 1)-GT and
for fine-tuning. Moreover, we benchmark the (2, 1)-GT and (c) order transfer from (2, 1)-GT to (3, 1)-GT, where we re-
(3, 1)-GT on how well these models can leverage their ex- use the pre-trained weights from the (2, 1)-GT pre-training
pressivity in practice; see Section 5.3. The source code for but learn a new tokenizer with (3, 1)-GT tokens. We evalu-
all experiments is available at https://github.com/ ate these settings both for LPE and SPE as node-level PEs.
luis-mueller/wl-transformers. We compare the results to three GNNs with node-level PEs

7
Aligning Transformers with Weisfeiler–Leman

Table 2: Comparison of (2, 1)-GT to Graphormer and To- Table 3: Effect of pre-training for fine-tuning performance
kenGT. Results on PCQM4M V 2 over a single random seed, on A LCHEMY (12K). Pre-trained weights for both (2, 1)-GT
as well as CS and P HOTO over 7 random seeds where we and (3, 1)-GT are taken from the (2, 1)-GT model in Table 2.
also report standard deviation. Results for baselines are taken We report mean and standard deviation over 3 random seeds.
from Kim et al. (2022). We highlight best and second best Baseline results for SignNet, BasisNet and SPE are taken
model on each dataset. from Huang et al. (2024). We highlight the best, second best
and third best model.
PCQM4M V 2 CS P HOTO
Model
Validation MAE ↓ Accuracy ↑ Accuracy ↑ A LCHEMY (12K)
Model Pre-trained
Graphormer 0.0864 0.791 ± 0.015 0.894 ± 0.004 MAE ↓
TokenGT 0.0910 0.903 ± 0.004 0.949 ± 0.007
(2, 1)-GT + LPE 0.0870 0.924 ± 0.008 0.933 ± 0.013 SignNet ✗ 0.113 ± 0.002
(2, 1)-GT + SPE 0.0888 0.920 ± 0.002 0.933 ± 0.011 BasisNet ✗ 0.110 ± 0.001
SPE ✗ 0.108 ± 0.001
✗ 0.124 ± 0.001
(2, 1)-GT + LPE
based on the eigenvectors and -values of the graph Lapla- ✓ 0.101 ± 0.001
cian: SignNet, BasisNet, and SPE (Huang et al., 2024); see (3, 1)-GT + LPE ✓ 0.114 ± 0.001
Table 3 for results. Here, we find that even without pre- ✗ 0.112 ± 0.000
training, the (2, 1)-GT already performs well on A LCHEMY. (2, 1)-GT + SPE
✓ 0.103 ± 0.002
Most notably, the (2, 1)-GT with SPE without fine-tuning (3, 1)-GT + SPE ✓ 0.108 ± 0.001
already performs on par with SignNet. Nonetheless, we ob-
serve significant improvements through pre-training (2, 1)-
GT. Most notably, fine-tuning the pre-trained (2, 1)-GT with scale pre-trained models such as the (2, 1)-GT can compete
LPE or SPE beats all GNN baselines. Interestingly, training with a task-specific GNN. Specifically, following Tönshoff
the (2, 1)-GT with SPE from scratch results in much bet- et al. (2023), we aim for a fair comparison by carefully
ter performance than training the (2, 1)-GT with LPE from designing and hyper-parameter tuning a GINE model (Xu
scratch. However, once pre-trained, the (2, 1)-GT performs et al., 2019b) with residual connections, batch normaliza-
better with LPE than with SPE. Moreover, we observe no im- tion, dropout, GELU non-linearities and, most crucially, the
provements when performing order transfer to the (3, 1)-GT. same node-level PEs as the (2, 1)-GT. We followed Hu et al.
However, the (3, 1)-GT with LPE and pre-trained weights (2020b) in choosing the GINE layer over other GNN layers
from the (2, 1)-GT beats SignNet (Lim et al., 2023), a strong due to its guaranteed 1-WL expressivity. It is worth noting
GNN baseline. Further, (3, 1)-GT with SPE performs on that our GINE with LPE encodings significantly outperforms
par with SPE in Huang et al. (2024), the best of our GNN the GIN in Hu et al. (2020b) without pre-training on all five
baselines. Interestingly, order transfer improves over the datasets, demonstrating the quality of this baseline. Inter-
(2, 1)-GT trained from scratch for both LPE and SPE. As a estingly, GINE with SPE as node-level PEs underperforms
result, we hypothesize that LPE and SPE provide sufficient against the GIN in Hu et al. (2020b) on all but one datasets.
expressivity for this task but that pre-training is required to In addition, we report the best pre-trained model for each
fully leverage their potential. Further, we hypothesize that task in Hu et al. (2020b). We fine-tune the (2, 1)-GT for 5
the added (3, 1)-GT tokens lead to overfitting. epochs and train the GINE model for 45 epochs. In addition,
We conclude that pre-training can help our transformers’ we also train the (2, 1)-GT for 45 epochs from scratch to
downstream performance. Further, order transfer is a promis- study the impact of pre-training on the molecular classifica-
ing technique for fine-tuning downstream tasks, particularly tion tasks. Note that we do not pre-train our GINE model
those that benefit from higher-order representations. How- to evaluate whether a large pre-trained transformer can beat
ever, its benefits might be nullified in the presence of suffi- a small GNN model trained from scratch; see Table 4 for
ciently expressive node-level PEs in combination with large- results.
scale pre-training. First, we find that the pre-trained (2, 1)-GT with SPE is better
than the corresponding GINE with SPE on all five datasets.
Molecular classification Here, we evaluate whether fine- The pre-trained (2, 1)-GT with LPE is better than GINE with
tuning a large pre-trained transformer can be competitive LPE on all but one dataset. Moreover, the (2, 1)-GT with
with a strong GNN baseline on five small-scale molecu- LPE as well as the (2, 1)-GT with SPE are each better or
lar classification tasks, namely BBBP, BACE, C LIN T OX, on par with the best pre-trained GIN in Hu et al. (2020b)
T OX 21, and T OX C AST (Hu et al., 2020a). The number in two out of five datasets. When studying the impact of
of molecules in any of these datasets is lower than 10K, pre-training, we observe that pre-training leads mostly to
making them an ideal choice to benchmark whether large- performance improvement. The notable exception is on the

8
Aligning Transformers with Weisfeiler–Leman

Table 4: Fine-tuning results on small OGB molecular datasets compared to a fully-equipped and -tuned GIN. Results over 6
random seeds. We highlight best, second best and third best model for each dataset. Ties are broken by smaller standard
deviation. For comparison, we also report pre-training results from Hu et al. (2020b).

BBBP BACE C LIN T OX T OX 21 T OX C AST

Model Pre-trained
AUROC ↑ AUROC ↑ AUROC ↑ AUROC ↑ AUROC ↑
✗ 0.658 ± 0.045 0.701 ± 0.054 0.580 ± 0.044 0.740 ± 0.008 0.634 ± 0.006
GIN (Hu et al., 2020b)
✓ 0.688 ± 0.008 0.845 ± 0.007 0.737 ± 0.028 0.783 ± 0.003 0.665 ± 0.003
GINE+LPE ✗ 0.670 ± 0.016 0.763 ± 0.007 0.824 ± 0.016 0.763 ± 0.011 0.659 ± 0.005
GINE+SPE ✗ 0.613 ± 0.036 0.677 ± 0.047 0.641 ± 0.066 0.709 ± 0.023 0.593 ± 0.018
(2, 1)-GT +LPE ✗ 0.703± 0.025 0.786± 0.019 0.821± 0.045 0.763± 0.003 0.667± 0.007
(2, 1)-GT +LPE ✓ 0.679± 0.012 0.815± 0.017 0.867± 0.020 0.780± 0.005 0.642± 0.007
(2, 1)-GT +SPE ✗ 0.694± 0.020 0.776± 0.030 0.845± 0.017 0.758± 0.006 0.657± 0.007
(2, 1)-GT +SPE ✓ 0.703± 0.013 0.793± 0.025 0.859± 0.021 0.762± 0.003 0.637± 0.003

T OX C AST dataset, where the pre-trained (2, 1)-GT underper- Table 5: Results on the BREC benchmark over a single seed.
forms even the GIN baselines without pre-training. Hence, Baseline results are taken from Wang & Zhang (2023). We
we conclude that the pre-training/fine-tuning paradigm is highlight the best model for each PE and category (excluding
viable for applying pure transformers such as the (2, 1)-GT the 3-WL, which is not a model and merely serves as a refer-
to small-scale problems. ence point). We additionally report the results of Graphormer
from Wang & Zhang (2023).
5.3. Expressivity tests
Model PE Basic Regular Extension CFI All
Since we only provide expressivity lower bounds, we empir- GINE 19 20 37 3 79
ically investigate the expressive power of the k-GT. To this (2, 1)-GT LPE 36 28 51 3 118
end, we evaluate the B REC benchmark offering fine-grained (3, 1)-GT 55 46 55 3 159
and scalable expressivity tests (Wang & Zhang, 2023). The GINE 56 48 93 20 217
benchmark is comprised of 400 graph pairs that range from (2, 1)-GT SPE 60 50 98 18 226
being 1-WL to 4-WL indistinguishable and pose a challenge (3, 1)-GT 60 50 97 21 228
even to the most expressive models; see Table 5 for the ex- Graphormer 16 12 41 10 79
pressivity results of (2, 1)-GT and (3, 1)-GT on B REC. We 3-WL 60 50 100 60 270
mainly compare to our GINE baseline, Graphormer (Ying
et al., 2021) as a graph transformer baseline, and the 3-WL
as a potential expressivity upper-bound. In addition, we (3, 1)-GT have a provable expressivity strictly above 1-WL
compare our GINE model, (2, 1)-GT and (3, 1)-GT for both but are also feasible in practice. Empirically, we verify our
LPE and SPE. We find that SPE encodings consistently lead claims about practical feasibility and show that our trans-
to better performance. Further, we observe that increased formers improve over existing pure transformers, closing the
expressivity also leads to better performance, as for both gap between pure transformers and graph transformers with
LPE and SPE, the (2, 1)-GT beats our GINE baseline, and strong graph inductive bias on the large-scale PCQM4M V 2
the (3, 1)-GT beats the (2, 1)-GT across all tasks. Finally, dataset. Further, we show that fine-tuning our pre-trained
we find that both the (2, 1)-GT and the (3, 1)-GT improve transformers is a feasible and resource-efficient approach
over Graphormer while still being outperformed by the 3-WL, for applying pure transformers to small-scale datasets. We
irrespective of the choice of PE. discover that higher-order transformers can efficiently re-use
pre-trained weights from lower-order transformers during
6. Conclusion fine-tuning, indicating a promising direction for utilizing
higher-order expressivity to improve results on downstream
In this work, we propose a hierarchy of expressive pure trans- tasks. Future work could explore aligning transformers to
formers that are also feasible in practice. We improve exist- other variants of Weisfeiler–Leman. Finally, pre-training
ing pure transformers such as TokenGT (Kim et al., 2022) in pure transformers on truly large-scale datasets such as those
several aspects, both theoretically and empirically. Theoret- recently proposed by Beaini et al. (2024) could be a promis-
ically, our hierarchy has stronger provable expressivity for ing direction for graph transformers.
each k and is more scalable. For example, our (2, 1)-GT and

9
Aligning Transformers with Weisfeiler–Leman

Impact statement bound on the number of variables for graph identifications.

Combinatorica, 12(4):389–410, 1992.
This paper presents work whose goal is to advance the field of
machine learning. Our work has many potential societal con- Chen, Z., Villar, S., Chen, L., and Bruna, J. On the equiv-
sequences, none of which must be specifically highlighted alence between graph isomorphism testing and function
here. approximation with gnns. In Advances in Neural Informa-
tion Processing Systems, 2019.
Acknowledgements Choromanski, K. M., Likhosherstov, V., Dohan, D., Song, X.,
CM and LM are partially funded by a DFG Emmy Noether Gane, A., Sarlós, T., Hawkins, P., Davis, J. Q., Mohiuddin,
grant (468502433) and RWTH Junior Principal Investigator A., Kaiser, L., Belanger, D. B., Colwell, L. J., and Weller,
Fellowship under Germany’s Excellence Strategy. A. Rethinking attention with performers. In International
Conference on Learning Representations, 2021.
References Dao, T., Fu, D. Y., Ermon, S., Rudra, A., and Ré, C. Flashat-
tention: Fast and memory-efficient exact attention with
Anderson, W. N. and Morley, T. D. Eigenvalues of the
io-awareness. In Advances in Neural Information Process-
Laplacian of a graph. Linear and Multilinear Algebra, 18
ing Systems, 2022.
(2):141–145, 1985.
Devlin, J., Chang, M., Lee, K., and Toutanova, K. BERT: pre-
Azizian, W. and Lelarge, M. Characterizing the expressive
training of deep bidirectional transformers for language
power of invariant and equivariant graph neural networks.
understanding. In Conference of the North American
In International Conference on Learning Representations,
Chapter of the Association for Computational Linguis-
2021.
tics, 2019.
Babai, L. Lectures on graph isomorphism. Univer- Dosovitskiy, A., Beyer, L., Kolesnikov, A., Weissenborn,
sity of Toronto, Department of Computer Science. D., Zhai, X., Unterthiner, T., Dehghani, M., Minderer,
Mimeographed lecture notes, 1979. M., Heigold, G., Gelly, S., Uszkoreit, J., and Houlsby,
N. An image is worth 16x16 words: Transformers for
Beaini, D., Huang, S., Cunha, J. A., Li, Z., Moisescu-
image recognition at scale. In International Conference
Pareja, G., Dymov, O., Maddrell-Mander, S., McLean, C.,
on Learning Representations, 2021.
Wenkel, F., Müller, L., Mohamud, J. H., Parviz, A., Craig,
M., Koziarski, M., Lu, J., Zhu, Z., Gabellini, C., Klaser, K., Geerts, F. and Reutter, J. L. Expressiveness and approxima-
Dean, J., Wognum, C., Sypetkowski, M., Rabusseau, G., tion properties of graph neural networks. In International
Rabbany, R., Tang, J., Morris, C., Koutis, I., Ravanelli, M., Conference on Learning Representations, 2022.
Wolf, G., Tossou, P., Mary, H., Bois, T., Fitzgibbon, A. W.,
Banaszewski, B., Martin, C., and Masters, D. Towards Gilmer, J., Schoenholz, S. S., Riley, P. F., Vinyals, O., and
foundational models for molecular learning on large-scale Dahl, G. E. Neural message passing for quantum chem-
multi-task datasets. In International Conference on Learn- istry. In International Conference on Machine Learning,
ing Representations, 2024. 2017.

Bodnar, C., Frasca, F., Otter, N., Wang, Y. G., Liò, P., Mon- Glickman, D. and Yahav, E. Diffusing graph attention. ArXiv
túfar, G., and Bronstein, M. M. Weisfeiler and Lehman preprint, 2023.
go cellular: CW networks. In Advances in Neural Infor-
Grohe, M. Descriptive Complexity, Canonisation, and De-
mation Processing Systems, 2021.
finable Graph Structure Theory. Cambridge University
Brown, T. B., Mann, B., Ryder, N., Subbiah, M., Kaplan, Press, 2017.
J., Dhariwal, P., Neelakantan, A., Shyam, P., Sastry, G., Grohe, M. The logic of graph neural networks. In Symposium
Askell, A., Agarwal, S., Herbert-Voss, A., Krueger, G., on Logic in Computer Science, 2021.
Henighan, T., Child, R., Ramesh, A., Ziegler, D. M., Wu,
J., Winter, C., Hesse, C., Chen, M., Sigler, E., Litwin, M., He, X., Hooi, B., Laurent, T., Perold, A., LeCun, Y., and
Gray, S., Chess, B., Clark, J., Berner, C., McCandlish, Bresson, X. A generalization of vit/mlp-mixer to graphs.
S., Radford, A., Sutskever, I., and Amodei, D. Language In International Conference on Machine Learning, 2023.
models are few-shot learners. In Advances in Neural
Information Processing Systems, 2020. Hendrycks, D. and Gimpel, K. Bridging nonlinearities and
stochastic regularizers with gaussian error linear units.
Cai, J., Fürer, M., and Immerman, N. An optimal lower ArXiv preprint, 2016.

10
Aligning Transformers with Weisfeiler–Leman

Horn, R. A. and Johnson, C. R. Matrix Analysis, 2nd Edition. Malkin, P. N. Sherali–Adams relaxations of graph isomor-
Cambridge University Press, 2012. phism polytopes. Discrete Optimization, 2014.
Hu, W., Fey, M., Zitnik, M., Dong, Y., Ren, H., Liu, B., Maron, H., Ben-Hamu, H., Serviansky, H., and Lipman, Y.
Catasta, M., and Leskovec, J. Open graph benchmark: Provably powerful graph networks. In Advances in Neural
Datasets for machine learning on graphs. In Advances in Information Processing Systems, 2019a.
Neural Information Processing Systems, 2020a.
Maron, H., Ben-Hamu, H., Shamir, N., and Lipman, Y. In-
Hu, W., Liu, B., Gomes, J., Zitnik, M., Liang, P., Pande, variant and equivariant graph networks. In International
V. S., and Leskovec, J. Strategies for pre-training graph Conference on Learning Representations, 2019b.
neural networks. In International Conference on Learning
Representations, 2020b. Maron, H., Fetaya, E., Segol, N., and Lipman, Y. On the
universality of invariant networks. In International Con-
Hu, W., Fey, M., Ren, H., Nakata, M., Dong, Y., and ference on Machine Learning, 2019c.
Leskovec, J. OGB-LSC: A large-scale challenge for ma-
chine learning on graphs. In NeurIPS: Datasets and Bench- Méndez-Lucio, O., Nicolaou, C. A., and Earnshaw, B. Mole:
marks Track, 2021. a molecular foundation model for drug discovery. ArXiv
preprint, 2022.
Huang, Y., Lu, W., Robinson, J., Yang, Y., Zhang, M.,
Jegelka, S., and Li, P. On the stability of expressive posi- Morris, C., Ritzert, M., Fey, M., Hamilton, W. L., Lenssen,
tional encodings for graphs. In International Conference J. E., Rattan, G., and Grohe, M. Weisfeiler and Leman
on Learning Representations, 2024. go neural: Higher-order graph neural networks. In AAAI
Conference on Artificial Intelligence, 2019.
Jaegle, A., Gimeno, F., Brock, A., Vinyals, O., Zisserman,
A., and Carreira, J. Perceiver: General perception with it- Morris, C., Rattan, G., and Mutzel, P. Weisfeiler and Le-
erative attention. In International Conference on Machine man go sparse: Towards higher-order graph embeddings.
Learning, 2021. In Advances in Neural Information Processing Systems,
Kim, J., Oh, S., and Hong, S. Transformers generalize 2020.
deepsets and can be extended to graphs & hypergraphs. Morris, C., Rattan, G., Kiefer, S., and Ravanbakhsh, S. Spe-
In Advances in Neural Information Processing Systems, qNets: Sparsity-aware permutation-equivariant graph net-
2021. works. In International Conference on Machine Learning,
Kim, J., Nguyen, D., Min, S., Cho, S., Lee, M., Lee, H., and 2022.
Hong, S. Pure transformers are powerful graph learners.
Morris, C., L., Y., Maron, H., Rieck, B., Kriege, N. M.,
In Advances in Neural Information Processing Systems,
Grohe, M., Fey, M., and Borgwardt, K. Weisfeiler and
2022.
Leman go machine learning: The story so far. Journal of
Kipf, T. N. and Welling, M. Semi-supervised classifica- Machine Learning Research, 2023.
tion with graph convolutional networks. In International
Müller, L., Galkin, M., Morris, C., and Rampásek, L. At-
Conference on Learning Representations, 2017.
tending to graph transformers. Transactions on Machine
Kreuzer, D., Beaini, D., Hamilton, W. L., Létourneau, V., and Learning Research, 2024.
Tossou, P. Rethinking graph transformers with spectral
attention. In Advances in Neural Information Processing Puny, O., Lim, D., Kiani, B. T., Maron, H., and Lipman, Y.
Systems, 2021. Equivariant polynomials for graph neural networks. In
International Conference on Machine Learning, 2023.
Lim, D., Robinson, J. D., Zhao, L., Smidt, T. E., Sra, S.,
Maron, H., and Jegelka, S. Sign and basis invariant net- Rampášek, L., Galkin, M., Dwivedi, V. P., Luu, A. T., Wolf,
works for spectral graph representation learning. In Inter- G., and Beaini, D. Recipe for a general, powerful, scal-
national Conference on Learning Representations, 2023. able graph transformer. Advances in Neural Information
Processing Systems, 2022.
Lipman, Y., Puny, O., and Ben-Hamu, H. Global attention
improves graph networks generalization. ArXiv preprint, Scarselli, F., Gori, M., Tsoi, A. C., Hagenbuchner, M., and
2020. Monfardini, G. The graph neural network model. IEEE
Transactions on Neural Networks, 20(1):61–80, 2009.
Ma, L., Lin, C., Lim, D., Romero-Soriano, A., Dokania, K.,
Coates, M., H.S. Torr, P., and Lim, S.-N. Graph Induc- Shchur, O., Mumme, M., Bojchevski, A., and Günnemann,
tive Biases in Transformers without Message Passing. In S. Pitfalls of graph neural network evaluation. ArXiv
International Conference on Machine Learning, 2023. preprint, 2018.

11
Aligning Transformers with Weisfeiler–Leman

Tönshoff, J., Ritzert, M., Rosenbluth, E., and Grohe, M.

Where did the gap go? reassessing the long-range graph
benchmark. ArXiv preprint, 2023.
Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones,
L., Gomez, A. N., Kaiser, L., and Polosukhin, I. Atten-
tion is all you need. In Advances in Neural Information
Processing Systems, 2017.

Wang, Y. and Zhang, M. Towards better evaluation of GNN

expressiveness with BREC dataset. ArXiv preprint, 2023.
Weisfeiler, B. and Leman, A. The reduction of a graph
to canonical form and the algebra which appears therein.
Nauchno-Technicheskaya Informatsia, 2(9):12–16, 1968.
Xu, K., Hu, W., Leskovec, J., and Jegelka, S. How powerful
are graph neural networks? In International Conference
on Learning Representations, 2019a.
Xu, K., Wang, L., Yu, M., Feng, Y., Song, Y., Wang, Z.,
and Yu, D. Cross-lingual knowledge graph alignment via
graph matching neural network. In Annual Meeting of the
Association for Computational Linguistics, 2019b.
Ying, C., Cai, T., Luo, S., Zheng, S., Ke, G., He, D., Shen, Y.,
and Liu, T.-Y. Do transformers really perform badly for
graph representation? In Advances in Neural Information
Processing System, 2021.
Zaheer, M., Kottur, S., Ravanbakhsh, S., Poczos, B.,
Salakhutdinov, R. R., and Smola, A. J. Deep sets. In Ad-
vances in Neural Information Processing Systems, 2017.

Zhang, B., Luo, S., Wang, L., and He, D. Rethinking the
expressive power of gnns via graph biconnectivity. In
International Conference on Learning Representations,
2023.

12
Aligning Transformers with Weisfeiler–Leman

A. Additional implementation details

Here, we outline additional implementation details that make our pure transformers more feasible in practice.

A.1. Positional encodings based on Laplacian eigenmaps

Encoding Laplacian eigenmaps are a popular choice for PEs in GNNs and graph transformers (Kreuzer et al., 2021; Lim et al.,
2023; Müller et al., 2024; Rampášek et al., 2022). In particular, Lim et al. (2023) show that their PEs based on Laplacian
eigenmaps generalize other PEs, such as those based on random walks or heat kernels. In what follows, we will briefly review
some background on Laplacian eigenvectors and -values as input to machine learning models and then show how such PEs
can be constructed to be node- and adjacency-identifying.
For an n-order graph G, the graph Laplacian is defined as L := D − A(G), where D denotes the degree matrix and A(G)
denotes the adjacency matrix. We then consider the eigendecomposition of L = VΣVT , the column vectors of V correspond
to eigenvectors and the diagonal matrix Σ contains the corresponding eigenvalues of the graph Laplacian. Since L is real
and symmetric, such decomposition always exists; see, e.g., Corollary 2.5.11 in Horn & Johnson (2012). In addition, the
1 1
eigenvalues of L are real and non-negative. Some works also consider the normalized graph Laplacian L̃ := D− 2 LD− 2
(Kreuzer et al., 2021; Lim et al., 2023). The statements we made for the graph Laplacian also hold for the normalized graph
Laplacian, i.e., L̃ is real and symmetric, and its eigenvalues are real and non-negative. Further, the eigenvectors of real and
symmetric matrices are orthonormal (Lim et al., 2023).
We need to address the following to use Laplacian eigenvectors for machine learning. Let v be an eigenvector of a matrix M
with eigenvalue λ. Then, −v is also an eigenvector of M with eigenvalue λ. As a result, we want a machine learning model
to be invariant to the signs of the eigenvectors. Kreuzer et al. (2021) address this issue by randomly flipping the sign of each
eigenvector during training for the model to learn this invariance from data. Concretely, given k eigenvectors v1 , . . . , vk , we
uniformly sample k independent sign values s1 , . . . , sk ∈ {−1, 1} and then plug eigenvectors s1 · v1 , . . . , sk · vk into the
model during training.

A.2. Implementation of LPE and SPE

Here, we describe our concrete implementations of LPE and SPE based on our definition in Section 4.
For LPE, we implement ρ by performing a sum over its input’s first dimension and applying an FFN, as described in Section 4.
We choose this implementation to stay as close to implementing the Laplacian PEs in Kreuzer et al. (2021).
For SPE, we follow the implementation in Huang et al. (2024). Concretely, let

Q := Vdiag(ϕ1 (λ))VT . . . Vdiag(ϕn (λ))VT ∈ Rn×n×l

be the input tensor to ρ. Huang et al. (2024) propose to partition Q along the second axis into n matrices of size n × l and then
to apply a GIN to each of those n matrices in parallel where we use the adjacency matrix of the original graph. Concretely,
the GIN maps each n × l matrix to a matrix of shape n × d, where d is the output dimension of ρ. Finally, the output of
ρ is the sum of all n × d matrices. We adopt this implementation for our experiments. For prohibitively large graphs, we
propose the following modification to the above implementation of ρ. Specifically, we let V:m denote the matrix containing
as columns the eigenvectors corresponding to the m smallest eigenvalues. Then, we define
T T
∈ Rn×m×l

Q:m := Vdiag(ϕ1 (λ))V:m . . . Vdiag(ϕn (λ))V:m (6)

and use the same ρ as for the case m = n.

A.3. Atomic types from edge embeddings

Here, we discuss a practical implementation of the atomic type embeddings atp in Section 3.2. In particular, we use the fact
that most graph learning datasets also include edge types, for example, the bond type in molecular datasets. Now, for an
undirected graph G and a tuple v = (v1 , . . . , vk ) ∈ V (G)ks , let E(vi , vj ) ∈ Rd denote a learnable embedding of the edge
type between nodes vi and vj , where we additionally set E(vi , vj ) = 0 if, and only, if vi and vj do not share an edge in G.
Further, since we consider graphs without self-loops, we can assign a special learnable vector s to E(vi , vj ) for i = j. Then,
we can define k
atp(v) := E(vi , vj ) i≥j W,

13
Aligning Transformers with Weisfeiler–Leman

Table 6: Hyper-parameters for (2, 1)-GT pre-training on PCQM4M V 2.

Parameter Value
Learning rate 2e − 4
Weight decay 0.1
Attention dropout 0.1
Post-attention dropout 0.1
Batch size 256
# gradient steps 2M
# warmup steps 60K
precision bfloat16

2
where W ∈ Rd(k − k)/2×d is a learnable weight matrix projecting the concatenated edge embeddings to the target dimension
d. To see that the above faithfully encodes the atomic type, recall the matrix K over {1, 2, 3}, determining the atomic type
that we introduced in Section 3.2. For undirected graphs, for all i ≥ j, Kij = 1 if E(vi , vj ) ̸= 0 and E(vi , vj ) ̸= s, Kij = 2
if E(vi , vj ) = s and Kij = 3 if E(vi , vj ) = 0. As a result, by including additional edge information into the k-GT, we
simultaneously obtain atomic type embeddings atp.

B. Additional experimental details

Here, we present the hyper-parameters used during pre-training in Table 6. Further, we describe the hyper-parameter selection
strategies employed for the different experiments. Across all experiments, we always select the hyper-parameters based on
the best validation score and then evaluate on the test set.
For CS and P HOTO, we set the learning rate to 0.0001 and tune dropout over {0.5, 0.1}, the hidden dimension over
{512, 1024}, the number of attention heads over {1, 2, 4}, the number of eigenvalues used over {64, 96}. For the node
classification datasets, we disable the learning rate scheduler. We train for 100 epochs. Due to the large number of nodes and
edges on these datasets, we use the SPE node-level PEs according to Equation (6) with m = 384.
For A LCHEMY, we only tune the learning rate. Specifically, for the (2, 1)-GT we sweep over {0.001, 0.0005, 0.0003} and
for the (3, 1)-GT, due to the additional computational demand over {0.0005, 0.0003}. For the (2, 1)-GT we train for 2000
epochs and for the (3, 1)-GT we train for 500 epochs, again due to the additional computational demand.
For the molecular classification datasets, we tune the transformers on learning rate and dropout. Specifically, we sweep the
learning rate over {0.0005, 0.0001, 0.00005, 0.00001} and dropout over {0.5, 0.1}. To ensure that the GINE baseline is rep-
resentative, we invest more time for hyper-parameter tuning. Specifically, we sweep over the number of layers over {3, 6, 12},
the hidden dimension over {384, 768}. Further, we sweep the learning rate over {0.0005, 0.0001, 0.00005, 0.00001} and
dropout over {0.5, 0.1}. Finally, we sweep the choice of pooling function over {mean, sum}, used for pooling the node
embeddings of the GINE into a single graph-level representation. This results in 12× more hyper-parameter configurations
for the GINE baseline than the transformers.
For BREC, we use a batch size of 16, 6 layers and a hidden dimension of 768 for both GNN and (2, 1)-GT. For the (3, 1)-GT,
we use a batch size of 2, 3 layers and a hidden dimension of 192.

C. Extended notation
The neighborhood of a vertex v in V (G) is denoted by N (v) := {u ∈ V (G) | (v, u) ∈ E(G)} and the degree of a vertex
v is |N (v)|. Two graphs G and H are isomorphic, and we write G ≃ H if there exists a bijection φ : V (G) → V (H)
preserving the adjacency relation, i.e., (u, v) is in E(G) if and only if (φ(u), φ(v)) is in E(H). Then φ is an isomorphism
between G and H. In the case of labeled graphs, we additionally require that l(v) = l(φ(v)) for v in V (G), and similarly
for attributed graphs. We further define the atomic type atp : V (G)k → N, for k > 0, such that atp(v) = atp(w) for v and
w in V (G)k if and only if the mapping φ : V (G)k → V (G)k where vi 7→ wi induces a partial isomorphism, i.e., we have
vi = vj ⇐⇒ wi = wj and (vi , vj ) ∈ E(G) ⇐⇒ (φ(vi ), φ(vj )) ∈ E(G). Let M ∈ Rn×p and N ∈ Rn×q be two
matrices then
M N ∈ Rn×p+q

14
Aligning Transformers with Weisfeiler–Leman

denotes column-wise matrix concatenation. Further, let M ∈ Rp×n and N ∈ Rq×n be two matrices then

M
∈ Rp+q×n
N

denotes row-wise matrix concatenation. For a matrix X ∈ Rn×d , we denote with Xi the i-th row vector. In the case where
the rows of X correspond to nodes in a graph G, we use Xv to denote the row vector corresponding to the node v ∈ V (G).

D. Transformers
Here, we define the (standard) transformer (Vaswani et al., 2017), a stack of alternating blocks of multi-head attention and
fully-connected feed-forward networks. In each layer, t > 0, given token embeddings X(t−1) ∈ RL×d for L tokens, we
compute
X(t) := FFN X(t−1) + h1 (X(t−1) ) . . . hM (X(t−1) ) WO ,

(7)
where [·] denotes column-wise concatenation of matrices, M is the number of heads, hi denotes the i-th transformer head,
WO ∈ RM dv ×d denotes a final projection matrix applied to the concatenated heads, and FFN denotes a feed-forward neural
network applied row-wise. We define the i-th head as
1 T
hi (X) := softmax √ XWQ,i XWK,i XWV,i , (8)
dk

where the softmax is applied row-wise and WQ,i , WK,i ∈ Rd×dk , WV,i ∈ Rd×dv , and dv and d are the head dimension
and embedding dimension, respectively. We omit layer indices t and optional bias terms for clarity.

E. Weisfeiler–Leman
Here, we discuss additional background for the Weisfeiler–Leman hierarchy. We begin by describing the Weisfeiler–Leman
algorithm, starting with the 1-WL. The 1-WL or color refinement is a well-studied heuristic for the graph isomorphism problem,
originally proposed by Weisfeiler & Leman (1968).2 Intuitively, the algorithm determines if two graphs are non-isomorphic
by iteratively coloring or labeling vertices. Formally, let G = (V, E, ℓ) be a labeled graph, in each iteration, t > 0, the 1-WL
computes a vertex coloring Ct1 : V (G) → N, depending on the coloring of the neighbors. That is, in iteration t > 0, we set

Ct1 (v) := RELABEL Ct−1 1 1
(v), {{Ct−1 (u) | u ∈ N (v)}} ,

for all vertices v in V (G), where RELABEL injectively maps the above pair to a unique natural number, which has not been
used in previous iterations. In iteration 0, the coloring C01 := ℓ. To test if two graphs G and H are non-isomorphic, we run
the above algorithm in “parallel” on both graphs. If the two graphs have a different number of vertices colored c in N at some
iteration, the 1-WL distinguishes the graphs as non-isomorphic. It is easy to see that the algorithm cannot distinguish all
non-isomorphic graphs (Cai et al., 1992). Several researchers, e.g., Babai (1979); Cai et al. (1992), devised a more powerful
generalization of the former, today known as the k-dimensional Weisfeiler–Leman algorithm (k-WL), operating on k-tuples
of vertices rather than single vertices.

E.1. The k-dimensional Weisfeiler–Leman algorithm

Due to the shortcomings of the 1-WL or color refinement in distinguishing non-isomorphic graphs, several researchers,
e.g., Babai (1979); Cai et al. (1992), devised a more powerful generalization of the former, today known as the k-dimensional
Weisfeiler-Leman algorithm, operating on k-tuples of vertices rather than single vertices.
Intuitively, to surpass the limitations of the 1-WL, the k-WL colors vertex-ordered k-tuples instead of a single vertex. More
precisely, given a graph G, the k-WL colors the tuples from V (G)k for k ≥ 2 instead of the vertices. By defining a
neighborhood between these tuples, we can define a coloring similar to the 1-WL. Formally, let G be a graph, and let k ≥ 2.
In each iteration, t ≥ 0, the algorithm, similarly to the 1-WL, computes a coloring Ctk : V (G)k → N. In the first iteration,
2
Strictly speaking, the 1-WL and color refinement are two different algorithms. The 1-WL considers neighbors and non-neighbors to
update the coloring, resulting in a slightly higher expressive power when distinguishing vertices in a given graph; see (Grohe, 2021) for
details. For brevity, we consider both algorithms to be equivalent.

15
Aligning Transformers with Weisfeiler–Leman

t = 0, the tuples v and w in V (G)k get the same color if they have the same atomic type, i.e., C0k (v) := atp(v). Then, for
each iteration, t > 0, Ctk is defined by
Ctk (v) := RELABEL Ct−1 k

(v), Mt (v) , (9)
with Mt (v) the multiset
k k

Mt (v) := {{Ct−1 (ϕ1 (v, w)) | w ∈ V (G)}}, . . . , {{Ct−1 (ϕk (v, w)) | w ∈ V (G)}} , (10)
and where
ϕj (v, w) := (v1 , . . . , vj−1 , w, vj+1 , . . . , vk ).
That is, ϕj (v, w) replaces the j-th component of the tuple v with the vertex w. Hence, two tuples are adjacent or j-neighbors
if they are different in the j-th component (or equal, in the case of self-loops). Hence, two tuples v and w with the same color
in iteration (t − 1) get different colors in iteration t if there exists a j in [k] such that the number of j-neighbors of v and w,
respectively, colored with a certain color is different.
We run the k-WL algorithm until convergence, i.e., until for t in N
Ctk (v) = Ctk (w) ⇐⇒ Ct+1
k k
(v) = Ct+1 (w),
for all v and w in V (G)k , holds.
Similarly to the 1-WL, to test whether two graphs G and H are non-isomorphic, we run the k-WL in “parallel” on both
graphs. Then, if the two graphs have a different number of vertices colored c, for c in N, the k-WL distinguishes the graphs as
non-isomorphic. By increasing k, the algorithm gets more powerful in distinguishing non-isomorphic graphs, i.e., for each
k ≥ 2, there are non-isomorphic graphs distinguished by (k + 1)-WL but not by k-WL (Cai et al., 1992). In the following, we
define some variants of the k-WL.

E.2. The δ-k-dimensional Weisfeiler–Leman algorithm

Malkin (2014) introduced the following variant of the k-WL, the δ-k-WL, which updates k-tuples according to
k,adj
Mtadj (v) := {{(Ct−1 (ϕ1 (v, w)), adj(v1 , w)) | w ∈ V (G)}}, . . . ,
k,adj
{{(Ct−1 (ϕk (v, w)), adj(vk , w)) | w ∈ V (G)}} ,
where (
1, if (v, w) ∈ E(G)
adj(v, w) :=
0, otherwise,
resulting in the coloring function Ctk,M : V (G)k → N. Morris et al. (2020) showed that this variant is slightly more powerful
in distinguishing non-isomorphic graphs compared to the k-WL.

E.3. The local δ-k-dimensional Weisfeiler–Leman algorithm

Morris et al. (2020) introduced a more efficient variant of the k-WL, the local δ-k-dimensional Weisfeiler–Leman algorithm
(δ-k-LWL), which updates k-tuples according to
k,δ k,δ
Mtδ (v) = {{Ct−1

(ϕ1 (v, w)) | w ∈ N (v1 )}}, . . . , {{Ct−1 (ϕk (v, w)) | w ∈ N (vk )}} ,

resulting in the coloring function Ctk,δ : V (G)k → N.

E.4. The local (k, s)-dimensional Weisfeiler–Leman algorithm

Let G be a graph. Then #comp(G) denotes the number of (connected) components of G. Further, let k ≥ 1 and 1 ≤ s ≤ k,
then
V (G)ks := {v ∈ V (G)k | #comp(G[v]) ≤ s}
is the set of (k, s)-tuples of nodes, i.e, k-tuples which induce (sub-)graphs with at most s (connected) components. In contrast
to the algorithms of Equation (9), the (k, s)-LWL colors tuples from V (G)ks instead of the entire V (G)k , resulting in the
coloring Ctk,s : V (G)ks → N. For more details; see Morris et al. (2022).

16
Aligning Transformers with Weisfeiler–Leman

E.5. Comparing k-WL variants

Let A1 and A2 denote k-WL-like algorithms, we write A1 ⊑ A2 if A1 distinguishes between all non-isomorphic pairs A2
does, and A1 ≡ A2 if both directions hold. The corresponding strict relation is denoted by ⊏. We can extend these relations
to neural architectures as follows. Given two non-isomorphic graphs G and H, a neural network architecture distinguishes
them if a parameter assignment exists such that hG ̸= hH .

E.6. The Weisfeiler–Leman hierarchy and permutation-invariant function approximation

The Weisfeiler–Leman hierarchy is a purely combinatorial algorithm for testing graph isomorphism. However, the graph
isomorphism function, mapping non-isomorphic graphs to different values, is the hardest to approximate permutation-invariant
function. Hence, the Weisfeiler–Leman hierarchy has strong ties to GNNs’ capabilities to approximate permutation-invariant
or equivariant functions over graphs. For example, Morris et al. (2019); Xu et al. (2019a) showed that the expressive power
of any possible GNN architecture is limited by the 1-WL in terms of distinguishing non-isomorphic graphs. Azizian &
Lelarge (2021) refined these results by showing that if an architecture is capable of simulating the k-WL (on the set of
n-order graphs) and allows the application of universal neural networks on vertex features, it will be able to approximate any
permutation-equivariant function below the expressive power of the k-WL; see also (Chen et al., 2019). Hence, if one shows
that one architecture distinguishes more graphs than another, it follows that the corresponding GNN can approximate more
functions. These results were refined in (Geerts & Reutter, 2022) for color refinement and taking into account the number of
iterations of the k-WL.

E.7. A generalized adjacency matrix

Finally, we define a generalization of the adjacency matrix from nodes to k-tuples. Specifically, let G be a graph with n nodes
k k
and let γ ∈ {−1, 1}. Then, for all j ∈ [k], the generalized adjacency matrix A(k,j,γ) ∈ {0, 1}n ×n is defined as

(
 1 ∃w ∈ V (G) : ul = ϕj (ui , w) ∧ adj(uij , w)

 γ=1
 0 else



(k,j,γ)
Ail := (11)
(
1 ∃w ∈ V (G) : ul = ϕj (ui , w) ∧ ¬adj(uij , w)


γ = −1



0 else,


where ui denotes the i-th k-tuple in a fixed but arbitrary ordering over V (G)k , uij denotes the j-th node of ui and where γ
controls whether the generalized adjacency matrix considers (γ = 0) all j-neighbors, (γ > 1) j-neighbors where the swapped
nodes are adjacent in G or (γ < 0) j-neighbors where the swapped nodes are not adjacent in G. With the generalized adjacency
matrix as defined above we can represent the local and global j-neighborhood adjacency defined for the δ-k-WL with A(k,j,1)
and A(k,j,−1) , respectively. Further, the j-neighborhood adjacency of k-WL can be described via A(k,j,1) + A(k,j,−1) and
the local j-neighborhood adjacency of δ-k-LWL can be described with A(k,j,1) .

F. Structural embeddings
Before we give the missing proofs from Section 4, we develop a theoretical framework to analyze structural embeddings.

F.1. Sufficiently node and adjacency-identifying

We begin by proving the following lemma which shows an important bound for sufficiently node and adjacency-identifying
matrices.

Lemma 9. Let P, Q ∈ Rn×d be matrices such that

||P − Q||F < ϵ,

17
Aligning Transformers with Weisfeiler–Leman

for an arbitrary but fixed ϵ > 0. Let P be node or adjacency-identifying with projection matrices WQ , WK ∈ Rd×d and let
1
P̃ = √ (PWQ )(PWK )T , and
dk
1
Q̃ = √ (QWQ )(QWK )T .
dk
Then, there exists a monotonic strictly increasing function f such that
||P̃ − Q̃||F < f (ϵ).

Proof. Our goal is to show that if the error between P and Q is bounded, so is the error between P̃ and Q̃. We now show
that this error can be described by a monotonic strictly increasing function f , i.e., if ϵ1 < ϵ2 , then f (ϵ1 ) < f (ϵ2 ). We will
first prove the existence of f .
First note that we can write,
1
P̃ − Q̃ = √ · PWQ (PWK − QWK )T + (PWQ − QWQ )(QWK )T .
dk
Note further that we are guaranteed that
||P||F > 0 (12)
Q
||W ||F > 0 (13)
K
||W ||F > 0 (14)
since otherwise at least one of the above matrices is zero, in which case P̃ = 0, which is not node or adjacency-identifying in
general, a contradiction. Now we can write,
1
||P̃ − Q̃||F = √ · PWQ (PWK − QWK )T + (PWQ − QWQ )(QWK )T
dk F
!
(a) 1 Q K K T Q Q K T
≤√ · PW (PW − QW ) + (PW − QW )(QW )
dk F F
!
(b) 1 Q K K T K T Q Q
≤ √ · PW · (PW − QW ) + (QW ) · PW − QW
dk F F F F
!
(c) 1
≤√ · PWQ · PWK − QWK + QWK · PWQ − QWQ
dk F F F F
!
(d) 1 Q K Q K
≤ √ · P W · W P−Q + Q W W · P−Q
dk F F F F F F F F

1
= √ · WQ WK · P − Q · P + Q
dk F F F F F
(e) ϵ
< √ · WQ WK · P + Q .
dk F F F F

Here, we used (a) the triangle inequality, (b) the Cauchy-Schwarz inequality (c) the fact that for any matrix X, ||X||F =
||XT ||F , (d) again Cauchy-Schwarz and finally (e) the Lemma statement, namely that ||P − Q||F < ϵ combined with the
facts in Equation (12), (13) and (14), guaranteeing that ||P̃ − Q̃||F > 0. We set
ϵ
f (ϵ) := √ · WQ WK · P + Q ,
dk F F F F

which is clearly monotonic strictly increasing, since the norms as well as √1 are non-negative and Equation (12), (13) and
dk
(14) ensure that f (ϵ) > 0. As a result, we can now write
||P̃ − Q̃||F < f (ϵ).
This concludes the proof.

18
Aligning Transformers with Weisfeiler–Leman

We use sufficiently node or adjacency-identifying matrices for approximately recovering weighted indicator matrices, which
we define next.
Definition 10 (Weighted indicators). Let x = (x1 , . . . , xn ) ∈ {0, 1}n be an n-dimensional binary vector. We call
x
x̃ := Pn ,
i=1 xi

the weighted indicator vector of x. Further, let X ∈ {0, 1}n×n be a binary matrix. Let now X̃ ∈ Rn×n be a matrix such that
the i-th row of X̃ is the weighted indicator vector of the i-th row of X. We call X̃ the weighted indicator matrix of X.

The following lemma ties (sufficiently) node or adjacency-identifying matrices and weighted indicators together.
Lemma 11. Let P̃ ∈ Rn×n be a matrix and let X be a binary matrix with weighted indicator matrix X̃, such that for every
i, j ∈ [n],
P̃ij = max P̃ik
k

if, and only, if Xij = 1. Then, for a matrix Q̃ ∈ Rn×n and all ϵ > 0, there exist an δ > 0 and a b > 0 such that if

P̃ − Q̃ F
< δ,

then,
softmax b · Q̃ − X̃ F
< ϵ,
where softmax is applied row-wise.

Proof. We begin by reviewing how the softmax acts on a vector z = (z1 , . . . , zn ) ∈ Rn . Let zmax := maxi zi be the
maximum value in z. Further, let x = (x1 , . . . , xn ) ∈ {0, 1}n be a binary vector such that
(
1 zi = zmax
xi = .
0 else

Now, for a scalar b > 0, x

lim softmax b · z = Pn ,
b→inf i=1 xi
i.e., as b goes to infinity, the softmax converges b · z to the weighted indicator vector
x
x̃ = Pn .
i=1 xi

Let us now generalize this to matrices. Specifically, let P̃ ∈ Rn×n be a matrix, let X be a binary matrix with weighted
indicator matrix X̃ such that for every i, j ∈ [n],

P̃ij = max P̃ik

if and only if Xij = 1. Then, for a scalar b > 0,

lim softmax b · P̃ = X̃,
b→inf

which follows from the fact that the softmax is applied independently to each row and each row of X̃ is a weighted indicator
vector. We now show the proof statement. First, note that for any b > 0 we can choose a δ < f (b), where f : R → R is some
strictly monotonically decreasing function of b that shrinks faster than linear, e.g., f (b) = b12 . This function is well-defined
since by assumption b > 0. As a result an increase in b implies a non-linearly growing decrease of

softmax b · P̃ − softmax b · Q̃ F

and thus,
lim softmax b · Q̃ = X̃,
b→inf

19
Aligning Transformers with Weisfeiler–Leman

(a)
x1 x2 x3
δ

(b)
b2
b b b

(c)

b b2

0.0 0.25 0.5 0.75 1

Figure 2: Visual explanation of the "opposing forces" in Lemma 11. In (a) before softmax: We consider three numbers x1 ,
x2 and x3 , where x2 and x3 are less than δ apart. In (b) after softmax: An increase in b (blue) pushes the maximum value x3
away from x1 and x2 . However, the approximation with δ acts stronger (red). As a result, x1 gets pushed closer to 0, but x2
and x3 get pushed closer. In (c) after softmax: Further increasing b makes x1 converge to 0, but the approximation with δ
pushes x2 and x3 closer together, and the softmax maps both values approximately to 21 (here depicted with the same dot).
Hence, with a sufficiently close approximation, we can approximate the weighted indicator matrix X̃.

and we have that for all ϵ > 0 there exists a b > 0 and a δ < f (b) such that

softmax b · Q̃ − X̃ F < ϵ.

This concludes the proof.

In the above proof, the approximation with δ and the scaling with b act as two "opposing forces". The proof then chooses b
such that ϵ acts stronger than b and hence with b → inf, the approximation converges to the weighted indicator matrix; see
Figure 2 for a visual explanation of this concept.
From the definitions of sufficiently node and adjacency-identifying matrices, we can prove the following statement.
Lemma 12. Let G be a graph with adjacency matrix A(G) and degree matrix D. Let Q be a sufficiently adjacency-identifying
matrix. Then, there exists a b > 0 and projection matrices WQ , WK ∈ Rd×d with
1
Q̃ := √ (QWQ )(QWK )T
dk
such that
softmax b · Q̃ − D−1 A(G) F
< ϵ,

for all ϵ > 0.

Proof. Note that D−1 A(G) is a weighted indicator matrix, since A(G) has binary entries and left-multiplication with D−1
results in dividing every element in row i of A(G) by the number of 1’s in row i of A(G), or formally,
h i
A(G)i1 PnA(G)in
D−1 A(G) i = Pnj=1 A(G)

. . . ,
ij j=1 A(G)ij

where we note that

n
X
Dii = A(G)ij .
j=1

Further, since Q is sufficiently adjacency-identifying, there exists an adjacency-identifying matrix P and projection matrices
WQ and WK , such that for
1
P̃ = √ (PWQ )(PWK )T ,
dk

20
Aligning Transformers with Weisfeiler–Leman

it holds that
P̃ij = max P̃ik ,
k

if and only if A(G) = 1. Recall the definition of sufficiently adjacency-identifying, namely that we can choose projection
matrices such that
||P − Q||F < ϵ0 ,
for all ϵ0 . Then, according to Lemma 9, for matrix
1
Q̃ = √ (QWQ )(QWK )T ,
dk
it holds that
P̃ − Q̃ F
< f (ϵ0 ),

where f is a strictly monotonically increasing function of ϵ0 . We can then apply Lemma 11 to P̃, Q̃, A(G) as the binary
matrix and D−1 A(G) as its weighted indicator matrix and, for every ϵ, choose ϵ0 small enough such that there exists a b > 0
with
softmax b · Q̃ − D−1 A(G) F < ϵ.

This concludes the proof.

Finally, we generalize the above result to arbitrary k via the generalized adjacency matrix.
Lemma 13 (Approximating the generalized adjacency matrix). Let G be graph with n nodes and let k > 1. Let further,
d
Q ∈ Rn× /k be sufficiently node and adjacency-identifying structural embeddings. Let v := (v1 , . . . , vk ) ∈ V (G)k be the
i-th and let u := (u1 , . . . , uk ) ∈ V (G)k be the l-th k-tuple in a fixed but arbitrary ordering over V (G)k . Let Ã(k,j,γ) denote
k k
the weighted indicator matrix of the generalized adjacency matrix in Equation (11). Let Z ∈ Rn ×n be the unnormalized
attention matrix such that
1 k k T
Q(vj ) j=1 WQ Q(uj ) j=1 WK ,

Zil := √
dk
with projection matrices WQ , WK ∈ Rd×d . Then, for every j ∈ [k] and every γ ∈ {−1, 1}, there exist projection matrices
WQ , WK such that for all i ∈ [nk ]
(k,j,γ)
softmax Zi1 . . . Zink − Ãi < ε,
F

for any ε > 0.

Proof. Our proof strategy is to first construct the unnormalized attention matrix from a node- and adjacency-identifying
matrix and then to invoke Lemma 11 to relax the approximation to the sufficiently node- and adjacency-identifying Q.
First, by definition of sufficiently node and adjacency-identifying matrices, the existence of Q implies the existence of a node
and adjacency-identifying matrix P. We will now give projection matrices WQ and WK such that
(k,j,γ)
softmax Z∗i1 . . . Z∗ink − Ãi < ε. (15)
F

for any ε > 0, with

1 k k T
Z∗il := √ P(vj ) j=1 WQ P(uj ) j=1 WK ,

dk
where we recall that v = (v1 , . . . , vk ) ∈ V (G)k is the i-th and u = (u1 , . . . , uk ) ∈ V (G)k is the l-th k-tuple in a fixed but
arbitrary ordering over V (G)k . To show Equation (15), we expand sub-matrices in WQ and WK for each j, i.e., we write
 Q,1 
W
Q  .. 
W = . 
WQ,k

21
Aligning Transformers with Weisfeiler–Leman

and
WK,1
 

WK =  ... ,
 

WK,k
where for all j ∈ [k], Q(vj ) is projected by WQ,j and Q(uj ) is projected by WK,j . We further expand the sub-matrices of
each WQ,j and WK,j , writing Q,j
WN
WQ,j = Q,j
WA
and K,j
WN
WK,j = K,j .
WA

Q,∗ K,∗
Since P is node-identifying, there exists projection matrices WN and WN such that
1 Q,∗ K,∗ T
pilj := √ P∗ WN (P∗ WN )
dk
is maximal if and only if ui,j = ul,j is the same node. Further, if P is adjacency-identifying, there exists projection matrices
Q,∗ K,∗
WA and WA such that
1 Q,∗ K,∗ T
qilj := √ P∗ WA (P∗ WA ) ,
dk
is maximal if and only if nodes ui,j and ul,j share an edge in G. Consequently, we also know that then −qilj is maximal if
and only if nodes ui,j and ul,j do not share an edge in G. Note that because we assume that G has no self-loops, qiij = 0 for
all i ∈ [n] and all j ∈ [k]. Now, let us write out the unnormalized attention score Z∗il as
1 k k T
Z∗il = √ P(vj ) j=1 WQ P(uj ) j=1 WK

dk
k
1 X T
=√ P(vj )WQ,j P(uj )WK,j
dk j=1
k k
1 X Q,j K,j T 1 X Q,j K,j T
=√ P(vj )WN P(uj )WN +√ P(vj )WA P(uj )WA ,
dk j=1 dk j=1

where in the second line we simply write out the dot-product and in the last line we split the sum to distinguish node- and
adjacency-identifying terms.
Q,j K,j Q,o Q,∗ K,o K,∗
Now for all γ and a fixed j, we set WN = 0 and WN = 0 and WN = b · WN and WN = WN for o ̸= j, for
some b > 0 that we need to be arbitary to use it later in Lemma 11.
Q,j Q,∗
We now distinguish between the different choices of γ. In the first case, let γ > 0. Then, we set WA = WA and
K,j K,∗ Q,j Q,∗ K,j K,∗
WA = WA . In the third case, γ < 0. Then, we set WA = −WA and WA = WA .
Then, in all cases, for tuples ui and ul , X
Z∗il = γ · qilj + b · pilo . (16)
o̸=j

Note that b is the same for all pairs ui , ul and the dot-product is positive. We again distinguish between two cases. In the first
case, let γ > 0. Then, the above sum attains its maximum value if and only if

∀o ̸= j : ui,o = ul,o ∧ Ailj = 1.

Now, since ui,o and ul,o denote nodes at the o-th component of ui and ul , respectively, the above statement is equivalent to
saying that ul is a j-neighbor of ui and that ul is adjacent to ui . In the second case, let γ < 0. Then, the above sum attains
its maximum value if and only if
∀o ̸= j : ui,o = ul,o ∧ Ailj = 0.

22
Aligning Transformers with Weisfeiler–Leman

Again, since ui,o and ul,o denote nodes at the o-th component of ui and ul , respectively, the above statement is equivalent to
saying that ul is a j-neighbor of ui and that ul is not adjacent to ui .
Now, recall the generalized adjacency matrix in Equation (11) as
(
 1 ∃w ∈ V (G) : ul = ϕj (ui , w) ∧ adj(uij , w)

 γ=1
 0 else



(k,j,γ)
Ail :=
(
1 ∃w ∈ V (G) : ul = ϕj (ui , w) ∧ ¬adj(uij , w)


γ = −1



0 else.


Then, we can say that for our construction of Z∗ , for all i, l ∈ [nk ]

Z∗il = max Z∗ik

if and only if Ak,j,γ

il = 1. Consequently, we can apply Lemma 11 to Z∗ , Z, Ak,j,γ as the binary matrix and Ãk,j,γ as its
weighted indicator matrix and obtain that for each ϵ > 0 there exists a b > 0 such that

softmax Zi1 . . . Zink − Ãk,j,γ i < ε.
F

This completes the proof.

F.2. adjacency-identifying via graph Laplacian

Here, we show a few results for how factorizations of the (normalized) graph Laplacian can be adjacency-identifying.
Lemma 14. Let G be a graph with graph Laplacian L. Then, a matrix P is node- and adjacency-identifying if there exists
matrices WQ and WK such that
1
√ (PWQ )(PWK )T = L.
dk

Proof. Note that the graph Laplacian is node-identifying, since all off-diagonal elements are ≤ 0 and the diagonal is always
> 0 since we consider graphs G without self-loops and without isolated nodes. Note further, that if there exists matrices WQ
and WK such that
1
√ (PWQ )(PWK )T = L,
dk
then there also exist matrix W∗Q = −WQ such that
1
√ (PW∗Q )(PWK )T = −L.
dk
Now, note that the negative graph Laplacian is −L = A(G) − D. Because we subtract the degree matrix from the adjacency
matrix, the maximum element of each row of the negative graph Laplacian is 1. Since D is diagonal, for each row i and each
column j,
−Lij = 1 ⇐⇒ A(G)ij = 1,
for i ̸= j. Further, in the case where i = j,
−Lij ≤ 0,
since we consider graphs G without self-loops. Hence, we obtain that

−Lij = max(−Lik )
k

if and only if A(G)ij = 1. This shows the statement.

Lemma 15. Let G be a graph with graph Laplacian L. Then, a matrix P is node and adjacency-identifying if

L = PPT .

23
Aligning Transformers with Weisfeiler–Leman

Proof. Our goal is to show that there exist matrices WQ and WK , such that

1
√ (PWQ )(PWK )T = L,
dk

and then to invoke Lemma 14. We set

p
WQ = dk I
K
W = I,

where I is the identity matrix. Then,

1
√ (PWQ )(PWK )T = PPT = L.
dk
This shows the statement.

Lemma 16. Let G be a graph with graph Laplacian L. Then, a matrix P Q is node- and adjacency-identifying if

L = PQT .

Proof. Our goal is to show that there exist matrices WQ and WK , such that

1
√ (PWQ )(PWK )T = L,
dk

and then to invoke Lemma 14. We set

WQ =
√
dk I 0
√
WK

= 0 dk I ,

where I is the identity matrix and 0 is the all-zero matrix. Then,

1 T
Q WQ P Q WK = PQT = L.

√ P
dk

This shows the statement.

For the next two lemmas, we first briefly define permutation matrices as binary matrices M that are right-stochastic, i.e.,
its rows sum to 1, and such that each column of M is 1 at exactly one position and 0 elsewhere. It is well know that for
permutation matrices M it holds that MT M = MMT = I, where I is the identity matrix. We now state the following
lemmas.
Lemma 17. Let G be a graph with graph Laplacian L. Let L = UΣUT be the eigendecomposition of L. Then, for any
1
permutation matrix M, the matrix UΣ 2 M is adjacency-identifying.

Proof. We have that

1 1 1 1
UΣ 2 M(UΣ 2 M)T = UΣ 2 MMT Σ 2 UT
= UΣUT
= L.
1
Hence, by Lemma 15, UΣ 2 M is adjacency-identifying.

Lemma 18. Let G be a graph with graph Laplacian L and normalized graph Laplacian L̃. Let L = UΣUT be the
1 1
eigendecomposition of L and let D denote the degree matrix. Then, for any permutation matrix M the matrix D 2 UΣ 2 M is
adjacency-identifying.

24
Aligning Transformers with Weisfeiler–Leman

Proof. We have that

1 1 1 1 1 1 1 1
D 2 UΣ 2 M(D 2 UΣ 2 M)T = D 2 UΣ 2 MMT Σ 2 UT D 2
1 1
= D 2 UΣUT D 2
1 1
= D 2 L̃D 2
1 1 1 1
= D 2 (I − D− 2 AD− 2 )D 2
= D − A = L.
1 1
Hence, by Lemma 15, D 2 UΣ 2 is node or adjacency-identifying.

F.3. LPE
Here, we show that the node-level PEs LPE as defined in Equation (5) are sufficiently node- and adjacency-identifying. We
begin with the following useful lemma.
′ ′′
Lemma 19. Let P ∈ Rn×d be structural embeddings with sub-matrices Q1 ∈ Rn×d and Q2 ∈ Rn×d , i.e.,

P = Q1 Q2 ,

where d = d′ +d′′ . If one of Q1 , Q2 is (sufficiently) node-identifying, then P is (sufficiently) node-identifying. If one of Q1 , Q2

is (sufficiently) adjacency-identifying, then P is (sufficiently) adjacency-identifying. If Q1 is (sufficiently) node-identifying
and Q2 is (sufficiently) adjacency-identifying or vice versa, then P is (sufficiently) node and adjacency-identifying.

Proof. Let a sub-matrix Q be (sufficiently) node- or adjacency-identifying. Let WQ,Q and WK,Q be the corresponding
projection matrices with which Q is (sufficiently) node- or adjacency-identifying. Then, if Q = Q1 , we define
Q,Q
Q W 0
W =
0 0
Q,K
K W 0
W =
0 0

and if Q = Q2 , we define

0 0
WQ =
WQ,Q 0

0 0
WK = .
WQ,K 0

In both cases, we have that

PWQ = QWQ,Q

0
PWK = QWQ,K

0

and consequently,
PWQ (PWK )T = QWQ,Q (QWQ,K )T .
Hence if Q1 is (sufficiently) node-identifying, then P is (sufficiently) node-identifying. If Q2 is (sufficiently) node-
identifying, then P is (sufficiently) node-identifying. If Q1 is (sufficiently) adjacency-identifying, then P is (sufficiently)
adjacency-identifying. If Q2 is (sufficiently) adjacency-identifying, then P is (sufficiently) adjacency-identifying. If Q1 is
(sufficiently) node-identifying and Q2 is (sufficiently) adjacency-identifying or vice versa, then P is (sufficiently) node and
adjacency-identifying. This concludes the proof.

We will now prove a slightly more general statement than Theorem 7.

25
Aligning Transformers with Weisfeiler–Leman

Theorem 20 (Slightly more general than Theorem 7 in main text). Structural embeddings with LPE according to Equation (5)
as node-level PEs with embedding dimension d are sufficiently node- and adjacency-identifying, irrespective of whether the
underlying Laplacian is normalized or not. Further, for graphs with n nodes, the statement holds for d ≥ (2n + 1).

Proof. We begin by stating that the domain of LPE is compact since eigenvectors are unit-norm and eigenvalues are bounded
by twice the maximum node degree for graphs without self-loops (Anderson & Morley, 1985). Further, the domain of the
structural embeddings is compact, since LPE is compact and the node degrees are finite. Hence, the domain of deg is compact.
Let G be a graph with graph Laplacian L. Let L = UΣUT be the eigendecomposition of L, where the i-th column of U
contains the i-th eigenvector, denoted vi , and the i-th diagonal entry of Σ contains the i-th eigenvalue, denoted λi .
Recall that structural embeddings with LPE as node-level PEs are defined as

P(v) = FFN(deg(v) + LPE(v)),

for all v ∈ V (G). Since the domains of both deg and LPE are compact, there exist parameters for both of these embeddings
such that without loss of generality we may assume that for v ∈ V (G),

deg(v) + LPE(v) = deg(v) LPE(v) ∈ Rd

and that deg and LPE are instead embedded into some smaller p-dimensional and s-dimensional sub-spaces, respectively,
where it holds that d = p + s.
To show node and adjacency identifiability, we divide up the s-dimensional embedding space of LPE into two s/2-dimensional
sub-spaces node and adj, i.e., we write

LPE(v) = node(v) adj(v) ∈ Rn×s ,

where
k
X
node(v) = ρnode ϕnode (vji , λj + ϵj )
j=1
k
X
adj(v) = ρadj ϕadj (vji , λj + ϵj ) ,
j=1

and where we have chosen ρ to be a sum over the first dimension of its input, followed by FFN ρnode and ρadj , for node and
adjacency, respectively. Note that we have written out Equation (5) for a single node v. For convenience, we also fix an
arbitrary ordering over the nodes in V (G) and define
 
deg(v1 )
QD =  ... 
 

deg(vn )
 
node(v1 )
QN =
 .. 
. 
node(vn )
 
adj(v1 )
QA =  ... ,
 

adj(vn )

where vi is the i-th node in our ordering. Further, note that node and adj are DeepSets (Zaheer et al., 2017) over the set

Mi := {(vji , λj + ϵj )}kj=1 ,

where vj is again the j-th eigenvector with corresponding eigenvalue λj and ϵj is a learnable scalar. With DeepSets we can
universally approximate permutation invariant functions. We will use this fact in the following, where we show that there exists

26
Aligning Transformers with Weisfeiler–Leman

a parameterization of QN and QA such that QN is sufficiently node-identifying and QA is sufficiently adjacency-identifying.

Then, it follows from Lemma 19 that P is sufficiently node and adjacency-identifying. We will further use QD later in the
proof.
We begin by showing that QN is sufficiently node-identifying. Observe that the eigenvector matrix U is already node-
identifying since U forms an orthonormal basis and hence UUT is the n-dimensional identity matrix I. Clearly for I it holds
that
Iij = max Iik ,
k

if and only if i = j. Moreover, let M be any permutation matrix, i.e., each column of M is 1 at exactly one position and 0
else and the rows of M sum to 1. Then,
UMMT UT = UUT = I
is also node-identifying. We will now approximate UM for some M with node arbitrarily close. Specifically, for the i-th
node vi in our node ordering, we choose ρnode and ϕnode such that

node(vi ) − Ui M F
< ϵ,

for all ϵ > 0 and all i. Since DeepSets can universally approximate permutation invariant functions, it remains to show that
there exists a permutation invariant function f such that f (Mi ) = Ui M for all i and for some M. To this end, note that for a
graph G, there are only at most n unique eigenvalues of the corresponding (normalized) graph Laplacian. Hence, we can
choose ϵj such that λj + ϵj is unique for each unique j. In particular, let

ϵj = j · δ,

where we choose δ < minl,o |λl − λo | > 0, i.e., the smallest non-zero difference between two eigenvalues. We now define f
as
f ({(vji , λj + ϵj )}kj=1 ) = v1i . . . vki = Ui M,

where the order of the components is according to the sorted λj + ϵj in ascending order. This order is reflected in some
permutation matrix M. Hence, f is permutation invariant and can be approximated by a DeepSet arbitrarily close. As a result,
we have
node(vi ) − Ui M F < ϵ,
for all i and an arbitrarily small ϵ > 0. In matrix form, we have that

UM − QN F
<ϵ

and since UM is node-identifying, we can invoke Lemma 9, to conclude that QN is sufficiently node-identifying. As a result,
P has a sufficiently node-identifying sub-space and is thus, also sufficiently node-identifying according to Lemma 19.
1
We continue by showing that QA is sufficiently adjacency-identifying. Note that according to Lemma 17, UΣ 2 is adjacency-
1
identifying. We will now approximate UΣ 2 with adj arbitrarily close. Specifically, for the i-th node vi in our node ordering,
we choose ρadj and ϕadj such that
1
adj(vi ) − UΣi2 F
< ϵ,
1
for all ϵ > 0 and all i. To this end, we first note that right-multiplication ofp
U by Σ 2 is equal to multiplying the i-th column
1
of U, i.e., the eigenvector vi , with the j-th diagonal element of Σ , i.e., λj . Hence, for the j-node vj it holds
2

1 √ √
λn ∈ Rn ,

UΣi2 = v1j · λ1 ... vnj ·

where vij denotes the j-th component of vi . Since DeepSets can universally approximate permutation invariant functions, it
1
remains to show that there exists a permutation invariant function f such that f (Mi ) = UΣi2 M for all i and for some M.
To this end, note that for a graph G, there are only at most n unique eigenvalues of the corresponding (normalized) graph
Laplacian. Hence, we can choose ϵj such that λj + ϵj is unique for each unique j. In particular, let

ϵj = j · δ,

27
Aligning Transformers with Weisfeiler–Leman

where we choose δ < minl,o |λl − λo | > 0, i.e., the smallest non-zero difference between two eigenvalues. We now define f
as
√ √
f ({(vji , λj + j · δ)}kj=1 ) = v1i · λ1 + ϵ1 + δ . . . vki · λk + k · δ ,

where the order of the components is according to the sorted λj + ϵj in ascending order. This order is reflected in some
permutation matrix M, which we will use next. Now, since we can choose δ arbitrarily low, we can choose it such that
1
f ({(vji , λj + j · δ)}kj=1 ) − UΣi2 M||F < ϵ,

for any ϵ > 0. Further, f is permutation invariant and can be approximated by a DeepSet arbitrarily close. As a result, we
have
1
adj(vi ) − UΣi2 M F < ϵ,

for all i and an arbitrarily small ϵ > 0. In matrix form, we have that
1
UΣ 2 M − QA F
< ϵ.

Now, we need to distinguish between the non-normalized and normalized Laplacian. In the first case, we consider the graph
Laplacian underlying the eigendecomposition to be non-normalized, i.e., L = D − A(G). First, we know from Lemma 17
1
that UΣ 2 M is adjacency-identifying. Further, we know from Lemma 9 that then QA is sufficiently adjacency-identifying,
1
A
since Q can approximate UΣ 2 M arbitrarily close. As a result, P has a sufficiently adjacency-identifying sub-space and is,
thus, also sufficiently adjacency-identifying.
In the second case, we consider the graph Laplacian underlying the eigendecomposition to be normalized, i.e., L̃ =
1 1
I − D− 2 A(G)D− 2 . Here, recall our construction for P, namely

P(v) = FFN deg(v) node(v) adj(v)

or in matrix form
P = FFN QD QN QA .

We will now use FFN to approximate the following function f , defined as

1
QD QN QA

f = QD QN D 2 QA .

1
Note that a FFN can approximate such a function arbitrarily
√ close since our domain is compact and left-multiplication by D 2
is equivalent to multiplying the i-th row of QA with di , where di is the degree of node i. Further, the i-th row of QD is an
embedding deg(vi ) of di . We can choose this embedding to be

0 ∈ Rp ,
√
deg(vi ) = di 0 ...
√
where we write di into the first component and pad the remaining vector with zeros to fit the target dimension p of
1
deg. Hence, with the FFN we can approximate a sub-space containing D 2 QA arbitrarily close. Further, since we already
1 1 1
showed that QA can approximate UΣ 2 M arbitrarily close, we can thus approximate D 2 UΣ 2 M arbitrarily close. First,
1 1 1
we know from Lemma 18 that D 2 UΣ 2 M is adjacency-identifying. Further, we know from Lemma 9 that then D 2 QA
1 1 1
is sufficiently adjacency-identifying, since D 2 QA can approximate D 2 UΣ 2 QA arbitrarily close. As a result, P has a
sufficiently adjacency-identifying sub-space and is, thus, also sufficiently adjacency-identifying according to Lemma 19.
s
Finally, for QD we need p ≥ 1, for QN and QA we need 2 ≥ n. As a result, the above statements hold for d ≥ (2n + 1).
This concludes the proof.

F.4. SPE
Here, we show that SPE are sufficiently node- and adjacency-identifying. We begin with useful lemma.

28
Aligning Transformers with Weisfeiler–Leman

Lemma 21. Let G be a graph with graph Laplacian L and normalized graph Laplacian L̃. Then, if for a node-level PE Q
with compact domain there exists matrices WQ and WK such that
1
√ (QWQ )(QWK )T = L̃,
dk
there exists a parameterization of the structural embedding P with Q as node-level PE such that P is sufficiently node- and
adjacency-identifying.

Proof. Recall that structural embeddings with P as node-level PEs are defined as

P(v) = FFN(deg(v) + Q(v)),

for all v ∈ V (G). Since the domains of both deg and Q are compact, there exists parameters for both of these embeddings
such that without loss of generality we may assume that for v ∈ V (G),

deg(v) + Q(v) = deg(v) Q(v) ∈ Rd

and that deg and Q are instead embedded into some smaller p-dimensional and s-dimensional sub-spaces, respectively, where
it holds that d = p + s.
Next, we choose the embedding deg(v) to be

dv 0 . . . 0 ∈ Rp ,
√
deg(v) =
√
where dv is the degree of node v and we write dv into the first component and pad the remaining vector with zeros to fit the
target dimension p of deg. We will now use FFN to approximate the following function f , defined as
√
f deg(v) Q(v) = deg(v) dv Q(v) .

Note that a FFN can approximate such a function arbitrarily close since our domain is compact. Hence, we have that
√
P(v) − deg(v) dv Q(v) F < ϵ1 ,

for all ϵ1 > 0. Further, in matrix form this becomes

  
deg(v1 )
P −  ...  D 2 Q
1
< ϵ2 ,
  

deg(vn ) F

1
for allpϵ2 > 0, since left multiplication of a matrix by D 2 corresponds to an element-wise multiplication of the i-th row of Q
with dvi , the square-root of the degree of the i-th node vi in an arbitrary but fixed node ordering. Now, since we have that
there exists matrices WQ and WK such that
1
√ (QWQ )(QWK )T = L̃,
dk

then, for the same matrices WQ and WK we know that

1 1 1 1 1
√ (D 2 QWQ )(D 2 QWK )T = D 2 L̃D 2 = L
dk
1 1
Finally, since P has a sub-matrix that can approximate D 2 Q arbitrarily close and since, by Lemma 15, D 2 Q is node- and
adjacency-identifying, P is sufficiently node- and adjacency-identifying. This shows the statement.

We now show Theorem 8.

Theorem 22 (Theorem 8 in the main paper). Structural embeddings with SPE as node-level PE are sufficiently node- and
adjacency-identifying.

29
Aligning Transformers with Weisfeiler–Leman

Proof. We begin by stating that the domain of SPE is compact, since eigenvectors are unit-norm and eigenvalues are bounded
by twice the maximum node degree for graphs without self-loops (Anderson & Morley, 1985). Further, the domain of the
structural embeddings is compact, since SPE is compact and the node degrees are finite and hence the domain of deg is
compact.
Let G be a graph with (normalized or non-normalized) graph Laplacian L. Let L = VΣVT be the eigendecomposition of
L, where the i-th column of V contains the i-th eigenvector, denoted vi , and the i-th diagonal entry of Σ contains the i-th
eigenvalue, denoted λi .
Recall that structural embeddings with SPE as node-level PEs are defined as

P(v) = FFN(deg(v) + SPE(v)),

for all v ∈ V (G), where

SPE(v) = ρ Vdiag(ϕ1 (λ))VT ... Vdiag(ϕn (λ))VT (v).

To show node and adjacency identifiability, we first replace the neural networks ρ, ϕ1 , . . . , ϕn in SPE with functions
permutation equivariant functions g, h1 , . . . , hn over the same domain and co-domain, respectively, and choose g and
h1 , . . . , hn such that the resulting encoding, which we call fSPE , is sufficiently node- and adjacency-identifying. Then, we
show that ρ and ϕ1 , . . . , ϕn can approximate g and h1 , . . . , hn arbitrarily close, which gives the proof statement.
We first define, for all ℓ ∈ [n],
0
 
 .. 
 . 
 1
λℓ ,
hℓ (λ) =  2

 . 
 .. 
0
such that λℓ is the ℓ-th sorted eigenvalue where ties are broken arbitrarily3 and λℓ is the ℓ-th entry of hℓ (λ). Note that due to
the sorting, hℓ is equivariant to permutations of the nodes.
We then define the matrix Mi as i-th input to g in fSPE , for all i ∈ [n]. We have

Mi := Vdiag(h1 (λ))ViT . . . Vdiag(hn (λ))ViT

P P
j v1j · h1 (λ)j · vij j v1j · hn (λ)j · vij

...
=
 .. 
. 
P P
j vnj · h1 (λ)j · vij ... j vnj · hn (λ)j · vij
 1 1 
v11 · λ12 · vi1 . . . v1n · λn2 · vin
=
 .. 
,
 . 
1 1
vn1 · λ1 · vi1 . . . vnn · λn · vin
2 2

where the last equality follows from the fact that hℓ (λ)j ̸= 0 if and only if ℓ = j. Now, we choose a linear set function f ,
satisfying
f ({a · x | x ∈ X}) = a · f (X),
for all a ∈ R and all X ⊂ R, e.g., sum or mean. Since f is a function over sets, f is equivariant to the ordering of X. We
now define the function g step-by-step. In g, we first apply f to the columns of Mi for all i ∈ [n]. To this end, let Mij denote
the set of entries in the j-column of matrix Mi . Then, we define

fi = f (Mi1 ) . . . f (Min )
h 1 1
i
= λ12 · vi1 · f ({vo1 | o ∈ [n]}) . . . λn2 · vin · f ({von | o ∈ [n]}) ,
3
Note that despite eigenvalues with higher multiplicity, the output of hℓ is the same, no matter how ties are broken.

30
Aligning Transformers with Weisfeiler–Leman

We then multiply fi element-wise to each row of Mi and obtain a matrix

Pi = Mi · fi
 1 1 
v11 · (λ12 )2 · (vi1 )2 . . . v1n · (λ12 )2 · (vin )2
=
 .. 
 . 

1 1
vn1 · (λ12 )2 · (vi1 )2 . . . vnn · (λn2 )2 · (vin )2
v11 · λ1 · (vi1 )2 . . . v1n · λn · (vin )2
 

=
 .. .

.
vn1 · λ1 · (vi1 )2 ... vnn · λn · (vin )2

Further, we divide each row of Mi element-wise by fi and obtain a matrix

Qi = Mi /fi
 v11 v1n 
...
 f ({vo1 | o ∈ [n]}) f ({von | o ∈ [n]}) 

= .. 
.
 . 
 vn1 vnn 
...
f ({vo1 | o ∈ [n]}) f ({von | o ∈ [n]})
Hence, we also need to ensure that
f ({voj | o ∈ [n]}) ̸= 0,
for all j ∈ [n]. Now, we define X
P= Pi
i

and X
Q= Qi .
i

and have that

X X vlℓ
(PQT )k,l = vkℓ · λℓ · f ({voℓ | o ∈ [n]}) · 2
viℓ ·
i
f ({voℓ | o ∈ [n]})
ℓ
X
= vkℓ · vlℓ · λℓ ,
ℓ

2
P
where we recall that the eigenvectors are normalized and hence, i viℓ = 1.
In the last step of g, we return the matrix
Q ∈ Rn×2n .

P
We will now show that this matrix is node- and adjacency-identifying. To this end, note that

(PQT )k,l = (VΣVT )k,l

and hence,
PQT = VΣVT = L.
We distinguish between two cases. If L is the non-normalized graph Laplacian, then according to Lemma 16, the matrix

P Q

is node- and adjacency-identifying and hence, according to Lemma 19, fSPE is node- and adjacency-identifying. Further, if L
is the normalized graph Laplacian, according to Lemma 21, fSPE is sufficiently node- and adjacency-identifying.
To summarize, we have shown that there exist functions g, h1 , . . . , hn such that if, in SPE, we replace ρ with g and ϕℓ with
hℓ , then the resulting encoding fSPE is sufficiently node- and adjacency-identifying. To complete the proof, we will now show

31
Aligning Transformers with Weisfeiler–Leman

that we can approximate g, h1 , . . . , hn with ρ, ϕ1 , . . . , ϕn arbitrarily close. Since our domain is compact, we can use ϕℓ to
approximate hℓ arbitrarily close, i.e., we have that

ϕℓ − hℓ F
< ϵ,

for all ϵ > 0. Further, g consists of a sequence of permutation equivariant steps and is thus, permutation equivariant. Since
the domain of g is compact and, by construction, g and ρ have the same domain and co-domain, and since ρ can approximate
any permutation equivariant function on its domain and co-domain arbitrarily close, we have that ρ can also approximate g
arbitrarily close. As a result, SPE can approximate a sufficiently node- and adjacency-identifying encoding arbitrarily close,
and hence, by definition, we have that SPE is also sufficiently node- and adjacency-identifying. This completes the proof.

G. Learnable embeddings
Here, we pay more attention to learnable embeddings, which play an important role throughout our work. Although our
definition of learnable embeddings is fairly abstract, learnable embeddings are very commonly used in practice, e.g., in
tokenizers of language models (Vaswani et al., 2017). Since many components of graphs, such as the node labels, edge types,
or node degrees are discrete, learnable embeddings are useful to embed these discrete features into an embedding space. In
our work, different embeddings are typically joined via sum, e.g., in Equation (1) or Equation (4). Summing embeddings
is much more convenient in practice since every embedding has the same dimension d. In contrast, joining embeddings
via concatenation can lead to very large d, unless the underlying embeddings are very low-dimensional. However, in our
theorems joining embeddings via concatenation is much more convenient. To bridge theory and practice in this regard, in
what follows we establish under what conditions summed embeddings can express concatenated embeddings and vice versa.
Specifically, we show that summed embeddings can still act as if concatenating lower-dimensional embeddings but with more
flexibility in terms of joining different embeddings.
The core idea is that we show under which operations one may replace a learnable embedding e to Rd with a lower-dimensional
′
learnable embedding e′ to Rd while preserving the injectivity of e. This property is useful since it allows us to, for example,
add two learnable embeddings without losing expressivity. As a result, we can design a tokenizer that is practically useful
without sacrificing theoretical guarantees.
We begin with a projection of a concatenation of learnable embeddings.
Lemma 23 (Projection of learnable embeddings). Let e′1 , . . . , e′k be learnable embeddings from N to Rp for some p ≥ 1.
If d ≥ k then, there exists learnable embeddings e1 , . . . , ek as well as a projection matrix W ∈ Rd·k×k·p such that for
v1 , . . . , v k ∈ N
k k
ei (vi ) i=1 W = e′i (vi ) i=1 ∈ Rk·p ,

where it holds for all v, w ∈ N and all i ∈ {1, . . . , k} that

ei (v) = ei (w) ⇐⇒ e′i (v) = e′i (w).

Proof. For a learnable embedding e and some v ∈ N, we denote with e(v)j the j-th element of the vector e(v). Since
learnable embeddings are from N to Rd , for every i we define ei such that for every v ∈ N, ei (v)1 = e′i (v) and ei (v)j = 0
for j > 0.
Now, we define W as follows. For row l and column j, we set Wlj = 1 if (l − 1) is a multiple of d, i.e., l = 1, (d + 1), 2(d +
k
1), . . . . Otherwise we set Wlj = 0. Intuitively, applying W to ei (vi ) i=1 will select the first, d-th, 2d-th, . . . , kd-th,
k
element of ei (vi ) i=1 , corresponding to v1 , . . . , vk , respectively, according to our construction of ei . Now, we have that
k k
ei (vi ) i=1 W = e′1 (v1 ) . . . e′k (vk ) = e′i (vi ) i=1 .

Further we have by construction that for all v, w ∈ N and all i ∈ {1, . . . , k}

ei (v) = ei (w) ⇐⇒ e′i (v) = e′i (w).

This shows the statement.

32
Aligning Transformers with Weisfeiler–Leman

Next, we turn to a composition of learnable embeddings, namely the structural embeddings P, as defined in Section 3.1. In
particular, we show next that a projection of multiple structural embeddings preserves node- and adjacency identifiability of
the individual structural embeddings in the new embedding space.
Lemma 24 (Projection of structural embeddings). Let G be a graph with n nodes and let P ∈ Rn×d be structural
embeddings for G. Let k > 1. If d ≥ k · (2n + 1) then, there exists a parameterization of P as well as a projection matrix
W ∈ Rk·d·k×(2n+1) such that for any v1 , . . . , vk ∈ V (G),
k k
P(vi ) i=1 W = P′ (vi ) i=1 ∈ Rn×d ,

′
where P′ ∈ Rn×d is also a structural embedding for G and it holds that if P is sufficiently node- and adjacency-identifying,
then so is P′ .

Proof. We know from Theorem 20 that there exists sufficiently node- and adjacency-identifying structural embeddings P′ in
Rn×(2d+1) . Since d ≥ k · (2n + 1), we define
P = P′ 0 ,

where 0 ∈ Rn×(k−1)·(2n+1) is an all-zero matrix. Since P has a sufficiently node- and adjacency-identifying subspace P′ ,
by Lemma 19, P is sufficiently node- and adjacency-identifying. Further, by setting

W = In×k·(2n+1) ,

where In×k·(2n+1) are the first n rows of the k · (2n + 1)-dimensional identity matrix and 0 ∈ Rn×(k−1)·(2n+1) is again an
all-zero matrix.
Then, we have that for nodes v1 , . . . , vk ∈ V (G),
k k
P(vi ) i=1 W = P′ (vi ) i=1 ∈ Rn×d ,

and P′ is by definition sufficiently node- and adjacency-identifying, from which the statement follows.

We continue by showing that our construction of token embeddings in Equation (1) can be equally well represented by another
form that will be easier to use in proving Theorem 2.
Lemma 25. Let G be a graph with node features F ∈ Rn×d . For every tokenization X(0,1) ∈ Rd according to Equation (1)
with (sufficiently) adjacency reconstructing structural embeddings P(v), there exists a parameterization of X(0,1) such that
for node v ∈ V (G) as
X(0,1) (v) = F′ (v) 0 deg′ (v) P′ (v) ,

with
P′ (v) = FFN′ (deg′ (v) + PE(v)),
such that F′ (v) ∈ Rp , 0 ∈ Rp , deg′ : N → Rr and FFN′ : Rd → Rs for d = 2p + r + s and where it holds for every
v, w ∈ V (G) that
F(v) = F(w) ⇐⇒ F′ (v) = F′ (w)
and
deg(v) = deg(w) ⇐⇒ deg′ (v) = deg′ (w)
and
P(v) = P(w) ⇐⇒ P′ (v) = P′ (w)
and P′ (v) is (sufficiently) adjacency-identifying.

Proof. We set p = (d − (2n + 1))/2 − 1, r = 2 and s = (2n + 1). Indeed, we have that

2p + r + s = 2((d − (2n + 1))/2 − 1) + 2 + (2n + 1) = d − (2n + 1) + (2n + 1) = d.

We continue by noting that the node features F are mapped from N to Rd and can be obtained from a learnable embedding.
More importantly, we can define
F′ (v) = ℓ(v) 0 . . . 0 ,

33
Aligning Transformers with Weisfeiler–Leman

where F′ (v) ∈ Rd is another learnable embedding for which it holds that

F(v) = F(w) ⇐⇒ F′ (v) = F′ (w),

for all v, w ∈ V (G).

Further, we can write
deg(v) + PE(v) = deg′ (v) PE(v) ,

where deg′ is another learnable embedding for the node degrees mapping to Rr . Moreover, we know from Theorem 20
that there exist (sufficiently) adjacency-identifying structural embeddings of dimension (2n + 1). We choose the FFN in
Equation (2) such that
P(v) = 0 deg′ (v) P′ ,

where 0 ∈ R2p is an all-zero vector and

P′ = FFN′ (deg′ (v) + PE(v)),
with FFN′ : Rd → Rs , another FFN inside of FFN. Note that such a FFN exists without approximation. Further according to
Lemma 24, P′ is still (sufficiently) adjacency-identifying. Hence, we can write

X(0,1) (v) = F′ (v) + P′ (v) = F′ (v) 0 deg′ (v) FFN′ (deg′ (v) + PE(v)) ,

where through the concatenation it is easy to verify that indeed X(0,1) (v) has the desired properties, namely that

F(v) = F(w) ⇐⇒ F′ (v) = F′ (w)

and
deg(v) = deg(w) ⇐⇒ deg′ (v) = deg′ (w)
and
P(v) = P(w) ⇐⇒ P′ (v) = P′ (w)
and P′ (v) is (sufficiently) adjacency-identifying. This shows the statement.

Lastly, we show a similar result for Equation (4).

Lemma 26. Let G be a graph with node features F ∈ Rn×d . For every tokenization X(0,k) ∈ Rd according to Equation (4)
with node- and adjacency reconstructing structural embeddings P(v), there exists a parameterization of X(0,k) such that for
k-tuple v = (v1 , . . . , vk ) ∈ V (G)k ,

X(0,k) (v) = F′ (v1 ) . . . F′ (vk ) 0 deg′ (v1 ) . . . deg′ (vk ) P′ (v1 ) . . . P′ (vk ) atp′ (v) ,

with
P′ (v) = FFN′ (deg′ (v) + PE(v)),
2
such that F′ (v) ∈ Rp , 0 ∈ Rk ·p , deg′ : N → Rr , atp′ : N → Ro and P′ ∈ Rs for d = k 2 · p + k · p + k · r + k · s + o and
where it holds for every v, w ∈ V (G),

F′ (v) = F(w) ⇐⇒ F′ (v) = F′ (w)

and
deg(v) = deg(w) ⇐⇒ deg′ (v) = deg′ (w)
and
P(v) = P(w) ⇐⇒ P′ (v) = P′ (w)
and P′ (v) is node and adjacency-identifying and for every v, w ∈ V (G)k

atp(v) = atp(w) ⇐⇒ atp′ (v) = atp′ (w).

34
Aligning Transformers with Weisfeiler–Leman

d−k2 −3k−1
Proof. We set p = 1, r = 2, o = 1 and s = k . Indeed, we have that

k 2 · p + k · p + k · r + k · s + o = k 2 + 3k + d − k 2 − 3k − 1 + 1 = d.

Recall Equation (4) as

k k
X(0,k) (v) = F(vi ) i=1 WF + P(vi ) i=1 WP + atp(v),

where WF ∈ Rd·k×d and WP ∈ Rd·k×d are projection matrices, P are structural embeddings and v = (v1 , . . . , vk ) is a
k-tuple.
We continue by noting that the node features F are mapped from N to Rd and can be obtained from a learnable embedding.
More importantly, we can define F′ (v) = ℓ(v) for which it clearly holds that

F(v) = F(w) ⇐⇒ F′ (v) = F′ (w),

for all v, w ∈ V (G). Invoking Lemma 23, there exists learnable embeddings F and a projection matrix WF,∗ ∈ Rk·d×k·p
such that k
F(vi ) i=1 WF = F′ (v1 ) . . . F′ (vk ) ∈ Rk·p .

Note that since we do not have an assumption on d, we can make d arbitrarily large to ensure that d ≥ k. Hence, by defining

WF = WF,∗ 0 ,

where 0 ∈ Rk·d×(d−k·p) is an all-zero matrix, we can write

k
F(vi ) i=1 WF = F′ (v1 ) . . . F′ (vk ) 0 ∈ Rd ,

where 0 ∈ Rd−k·p is an all-zero vector.

Moreover, we know from Theorem 7 that there exists adjacency-identifying structural embeddings of dimension (2n + 1).
We choose the FFN in Equation (2) such that

P(v) = 0 deg′ (v) P′ (v) ,

2
·p+k·p
where 0 ∈ Rk is an all-zero vector and

P′ (v) = FFN′ (deg′ (v) + PE(v)),

with FFN′ : Rd → Rs , another FFN inside of FFN. Note that such a FFN exists without approximation. Further, according to
Lemma 24, P′ is still sufficiently adjacency-identifying, and there exists a projection matrix WP,∗ ∈ Rk·d×d−o such that
k
P(vi ) i=1 WP,∗ = 0 deg′ (v1 ) P′ (v1 ) . . . deg′ (vk ) P′ (vk ) ∈ Rd−o .

0

But then, there also exists a permutation matrix Q ∈ {0, 1}d−o×d−o such that
k
P(vi ) i=1 WP,∗ Q = 0 deg′ (v1 ) . . . deg′ (vk ) P′ (v1 ) . . . P′ (vk ) ∈ Rd−o ,

k 2
where we simply re-order the entries of P(vi ) i=1 WP,∗ . Note that now, 0 ∈ Rk ·p+k·p , as we grouped the zero-vectors of

k
P(vi ) i=1 WP,∗ into a single zero-vector of size k 2 · p + k · p.
Note that since we do not have an assumption on d, we can make d arbitrarily large to ensure that d ≥ k · (2n + 1). Hence,
by defining
WP = WP,∗ Q 0 ,

where 0 ∈ Ro is an all-zero vector, we can write

k
P(vi ) i=1 WP = 0 deg′ (v1 ) . . . deg′ (vk ) P′ (v1 ) . . . P′ (vk ) 0 ∈ Rd ,

35
Aligning Transformers with Weisfeiler–Leman

where P′ (vi ) ∈ Rs . Further, since P is by assumption sufficiently node- and adjacency-identifying, Lemma 24 guarantees
that then so is P′ .
Finally, for tuple v ∈ V (G)k , we set
atp(v) = 0 atp′ (v) ,

where atp′ (v) = atp(v) and we recall that atp maps to N and 0 ∈ Rd−1 is an all-zero vector.
Then, we can write
k k
X(0,k) (v) = F(vi ) i=1 WF + P(vi ) i=1 WP + atp(v)

= F′ (v1 ) . . . F′ (vk ) 0 deg′ (v1 ) . . . deg′ (vk ) P′ (v1 ) . . . P′ (vk ) atp′ (v) ,

where it holds for every v, w ∈ V (G),

F′ (v) = F(w) ⇐⇒ F′ (v) = F′ (w)

and
deg(v) = deg(w) ⇐⇒ deg′ (v) = deg′ (w)
and
P(v) = P(w) ⇐⇒ P′ (v) = P′ (w)
where P′ (v) is node and adjacency-identifying and for every v, w ∈ V (G)k

atp(v) = atp(w) ⇐⇒ atp′ (v) = atp′ (w).

This shows the statement.

H. Expressivity of 1-GT
Here, we prove Theorem 2 in the main paper. We first restate Theorem VIII.4 of (Grohe, 2021), showing that a simple GNN
can simulate the 1-WL.
Lemma 27 (Grohe (2021), Theorem VIII.4). Let G = (V (G), E(G), ℓ) be a graph with n nodes with adjacency matrix
A(G) and node feature matrix X(0) := F ∈ Rn×d consistent with ℓ. Further, assume a GNN that for each layer, t > 0,
updates the vertex feature matrix
X(t) := FFN X(t−1) + 2A(G)X(t−1) .

Then, for all t ≥ 0, there exists a parameterization of FFN such that

Ct1 (v) = Ct1 (w) ⇐⇒ X(t) (v) = X(t) (w),

for all vertices v, w ∈ V (G).

We now give the proof for a slight generalization of Theorem 2. Specifically, we relax the adjacency-identifying condition to
sufficiently adjacency-identifying.
Theorem 28 (Generalization of Theorem 2 in main paper). Let G = (V (G), E(G), ℓ) be a labeled graph with n nodes
and F ∈ Rn×d be a node feature matrix consistent with ℓ. Let Ct1 denote the coloring function of the 1-WL at iteration t.
Then, for all iterations t ≥ 0, there exists a parametrization of the 1-GT with sufficiently adjacency-identifying structural
embeddings, such that
Ct1 (v) = Ct1 (w) ⇐⇒ X(t,1) (v) = X(t,1) (w),
for all nodes v, w ∈ V (G).

Proof. According to Lemma 25, there exists a parameterization of X(0,1) such that for each v ∈ V (G),

X(0,1) (v) = F′ (v) + P′ (v) = F′ (v) 0 deg′ (v) P′ (v) ,

36
Aligning Transformers with Weisfeiler–Leman

where it holds that

F(v) = F(w) ⇐⇒ F′ (v) = F′ (w) (17)
and
deg(v) = deg(w) ⇐⇒ deg′ (v) = deg′ (w)
and
P(v) = P(w) ⇐⇒ P′ (v) = P′ (w)
and P′ (v) is adjacency-identifying. Further, F′ (v), 0 ∈ Rp , deg′ (v) ∈ Rr and P′ (v) ∈ Rs , for some choice of p, r, s where
d = 2p + r + s, specified in Lemma 25. We will use this parameterization for X(0,1) throughout the rest of the proof.
We will now prove the statement by induction over t. Since by definition

C01 (v) = C01 (w) ⇐⇒ F(v) = F(w),

the base case follows from Equation (17). We let F(t) (v) denote the representation of the color of node v at iteration t.
Initially, we set F(t) (v) = F′ (v). Further, we define the matrix Demb ∈ Rn×r such that for the i-th row of Demb,i = deg′ (vi ),
where vi is the i-th node in a fixed but arbitrary node ordering. Hence, we can write

X(0,1) (v) = F(0) (v) 0 deg′ (v) P′ (v) ,

for all v ∈ V (G) and in matrix form

X(0,1) = F(0) P′ .

0 Demb

Now, assume that the statement holds up to iteration t − 1. For the induction, we show

Ct1 (v) = Ct1 (w) ⇐⇒ F(t) (v) = F(t) (w). (18)

That is, we compute the 1-WL-equivalent aggregation using the node color representations F(t) and use the remaining columns
in X(0,1) to keep track of degree and structural embeddings. Clearly, if Equation (18) holds for all t, then the statement
follows. Thereto, we aim to update the vertex representations such that

X(t,1) = F(t) P′ ,

0 Demb (19)

where, following Lemma 27,

F(t) := FFN F(t−1) + 2A(G)F(t−1) . (20)

Hence, it remains to show that our graph transformer layer can update vertex representations according to Equation (19). To
this end, we will require only a single transformer head in each layer. Specifically, we want to compute

h1 (X(t−1,1) ) = 0 D−1 A(G))F(t−1) 0 0 ,

(21)

where D−1 denotes the inverse of the degree matrix and I denotes the identity matrix with n rows. Note that since we
only have one head, the head dimension dv = d. We begin by re-stating Equation (8) with expanded sub-matrices and then
derive the instances necessary to obtain the head Equation (21). We re-state projection weights WQ and WK with expanded
sub-matrices as  Q
W1
Q
WQ 
W = 2 
WQ 
3
W4Q
and  K
W1
W2K 
WK =
W3K ,


W4K

37
Aligning Transformers with Weisfeiler–Leman

where W1Q , W1K ∈ Rp×d , W2Q , W2K ∈ Rp×d , W3Q , W3K ∈ Rr×d , W4Q , W4K ∈ Rs×d . We then define
 Q  K
W1 W1
Q K
1 1 W 2 )(X(t−1,1) W2 )T ,
Z(t−1) := √ (X(t−1,1) WQ )(X(t−1,1) WK )T = √ (X(t−1,1) 
 
Q
W  W3K 
dk dk 3
WQ W4K
4

where
X(t−1,1) = F(t−1) P′ ,

0 Demb
by the induction hypothesis. By setting W1Q , W2Q , W3Q , W1K , W2K , W3K to zero, we have
1
Z(t−1) = √ (P′ W4Q )(P′ W4K )T .
dk
Now, let
1
P̃ := √ (P′ WPQ )(P′ WPK )T ,
dk
for some weight matrices WPQ and WPK . We know from Lemma 12 that, since P′ is sufficiently adjacency reconstructing,
there exists WPQ and WPK such that by setting W4Q = b · WPQ and W4K = WPK , we have

Z(t−1) = b · P̃,
where for all ε > 0, there exists a b > 0, such that

softmax Z(t−1) − D−1 A(G) < ε.
F
−1
Hence, by choosing a large enough b,we can
approximate the matrix D A(G) arbitrarily close. In the following, for clarity
of presentation, we assume softmax Z(t) = D−1 A(G) although we only approximate it arbitrarily close. However, by
choosing ε small enough, we can still approximate the matrix X(t,1) , see below, arbitrarily close.
We now again expand sub-matrices of WV and express Equation (8) as
 V
W1
W2V 
hi (X(t−1,1) ) = softmax Z(t−1) X(t−1,1) 

W3V ,


W4V

where W1V ∈ Rp×d , W2V ∈ Rp×d , W3V ∈ Rr×d , W4V ∈ Rs×d . By setting
W1V = 0 I 0 0

W2V = W3V = W4V = 0,

where I ∈ Rp×p , we can approximate
h1 (X(t−1,1) ) = 0 D−1 A(G)F(t−1)

0 0

arbitrarily close. We now conclude our proof as follows. Recall that the transformer computes the final representation X(t,1)
as
!
X(t,1) = FFNfinal X(t−1,1) + h1 (X(t−1,1) )WO
!
−1
0 WO
(t−1)
(t−1)

= FFNfinal F 0 Demb U + 0 D A(G)F 0
!
D−1 A(G)F(t−1)
(t−1)
= FFNfinal F Demb U ,
WO :=I

38
Aligning Transformers with Weisfeiler–Leman

for some FFNfinal . We now show that there exists an FFNfinal such that
!
(t,1)
D−1 A(G)F(t−1)
(t−1)
X = FFNfinal F Demb U = F(t) 0 Demb U ,

which then implies the proof statement. To this end, we show that there exists a composition of functions fFFN ◦fadd ◦fdeg ◦f×2
such that
!
D−1 A(G)F(t−1) Demb U = F(t) 0 Demb U ,
(t−1)
fFFN ◦ fadd ◦ fdeg ◦ f×2 F

where the functions are applied independently to each row. We denote

(t−1)
Xupd := F(t−1) D−1 A(G)F(t−1) Demb U ∈ Rn×d

and obtain for node v ∈ V (G),

h i
(t−1)
Xupd (v) = F(t−1) (v) 1
· F(t−1) (w) deg′ (v) P′ (v) ∈ Rd .
P
w∈N (v) |N (v)|

We can now define

deg′ (v) = |N (v)| 0 ∈ R2 .

Since our domain is compact, there exist choices of fFFN , fadd , fdeg , f×2 such that fFFN ◦ fadd ◦ fdeg ◦ f×2 : Rd → Rd is
continuous. As a result, FFNfinal can approximate fFFN ◦ fadd ◦ fdeg ◦ f×2 arbitrarily close.
Concretely, we define
!
h i
(t−1) 1 (t−1) ′
P
f×2 F (v) w∈N (v) |N (v)| ·F (w) |N (v)| 0 P (v)
h i
= F(t−1) (v) 1
· F(t−1) (w) |N (v)| 0 P′ (v) ,
P
2 w∈N (v) |N (v)|

where f×2 multiplies the second column with 2.

Next, we define
!
h i
(t−1) 1 (t−1) ′
P
fdeg F (v) 2 w∈N (v) |N (v)| ·F (w) |N (v)| 0 P (v)
h i
(t−1)
= F(t−1) (v) 2 w∈N (v) |N 1(v)| · Fj · |N (v)| |N (v)| P′ (v)
P
h i
= F(t−1) (v) 2 w∈N (v) F(t−1) (w) |N (v)| 0 P′ (v)
P

where fdeg multiplies the second column with the degree of node v in the third column.
Next, we define
!
h i
(t−1) (t−1) ′
P
fadd F (v) 2 w∈N (v) F (w) |N (v)| 0 P (v)
h i
= F(t−1) (v) + 2 w∈N (v) F(t−1) (w) 0 |N (v)| 0 P′ (v) ,
P

where we add the second column to the first column and set the elements of the second column to zero.
Finally, we define
!
h i
F(t−1) (v) + 2 F(t−1) (w) 0 P′ (v)
P
fFFN w∈N (v) |N (v)| 0
h i
= FFN F(t−1) (v) + 2 w∈N (v) F(t−1) (w) P′ (v)
P
0 |N (v)| 0
= F(t) (v) 0 deg′ (v) P′ (v) ,

39
Aligning Transformers with Weisfeiler–Leman

where FFN denotes the FFN in Equation (20), from which the last equality follows. Clearly, applying fFFN ◦ fadd ◦ fdeg ◦ f×2
(t−1)
to each row i of Xupd results in

(t−1)
P′ (v) ,

fFFN ◦ fadd ◦ fdeg ◦ f×2 Xupd = F(t) 0 Demb

from which our proof follows.

I. Expressivity of higher-order transformers

Here, we prove Theorem 4, Theorem 5, and Theorem 6 in Section 3.2 in the main paper. To this end, we first construct new
higher-order GNNs aligned with the k-WL, δ-k-WL, δ-k-LWL, and (k, s)-WL, respectively. We will use these higher-order
GNNs as an intermediate step to show the expressivity results for the k-GT and (k, s)-GT.

I.1. Higher-order GNNs

Here, we generalize the GNN from Grohe (2021) to higher orders k. Higher-order GNNs with the same expressivity have
been proposed in prior works by Azizian & Lelarge (2021) and Morris et al. (2019). However, our GNNs have a special form
that can be computed by a transformer. To this end, we extend Theorem VIII.4 in Grohe (2021) to the k-WL. Specifically, for
each k > 1 we devise neural architectures with k-WL expressivity. Afterwards, we show that transformers can simulate these
architectures.
Formally, let S ⊂ N be a finite subset. First, we show that multisets over S can be injectively mapped to a value in the closed
interval (0, 1), which is a variant of Lemma VIII.5 in Grohe (2021). Here, we outline a streamlined version of its proof,
highlighting the key intuition behind representing multisets as m-ary numbers. Let M ⊆ S be a multiset with multiplicities
Pk
a1 , . . . , ak and distinct k values. We define the order of the multiset as i=1 ai . We can write such a multiset as a sequence
x(1) , . . . , x(l) where l is the order of the multiset. Note that the order of the sequence is arbitrary and that for i ̸= j it is
possible to have x(i) = x(j) . We call such a sequence an M -sequence of length l. We now prove the following, a slight
variation of Grohe (2021).
Lemma 29. For a finite m ∈ N, let M ⊆ S be a multiset of order m − 1 and let xi ∈ S denote the i-th number in a fixed but
arbitrary ordering of S. Given a mapping g : S → (0, 1) where

g(xi ) := m−i ,

and an M -sequence of length l given by x(1) , . . . , x(l) with positions i(1) , . . . , i(l) in S, the sum
X X (j)
g(x(j) ) = m−i
j∈[l] j∈[l]

is unique for every unique M .

Proof. By assumption, let M ⊆ S denote a multiset of order m − 1. Further, let x(1) , . . . , x(l) ∈ M be an M -sequence with
i(1) , . . . , i(l) in S. Given our fixed ordering of the numbers in S we can equivalently write M = ((a1 , x1 ), . . . , (an , xn )),
where ai denotes the multiplicity of i-th number in M with position i from our ordering over S. Note that for a number m−i
there exists a corresponding m-ary number written as

0.0 . . . |{z}
1 ...
i

Then the sum,

X X (j)
g(x(j) ) = m−i
j∈[l] j∈[l]
X
= ai m−i ∈ (0, 1)
i∈S

40
Aligning Transformers with Weisfeiler–Leman

and in m-ary representation

0.a1 . . . an .

Note that ai = 0 if and only if there exists no j such that i(j) = i. Since the order of M is m − 1, it holds that ai < m.
Hence, it follows that the above sum is unique for each unique multiset M , implying the result.

Recall that S ⊆ N and that we fixed an arbitrary ordering over S. Intuitively, we use the finiteness of S to map each number
therein to a fixed digit of the numbers in (0, 1). The finite m ensures that at each digit, we have sufficient “bandwidth” to
encode each ai . Now that we have seen how to encode multisets over S as numbers in (0, 1), we review some fundamental
operations about the m-ary numbers defined above. We will refer to decimal numbers m−i as corresponding to an m-ary
number
0.0 . . . |{z}
1 ...,
i

where the i-th digit after the decimal point is 1 and all other digits are 0, and vice versa.
To begin with, addition between decimal numbers implements counting in m-ary notation, i.e.,

m−i + m−j corresponds to 0.0 . . . |{z}

1 . . . |{z}
1 ...,
i j

for digit positions i ̸= j and

m−i + m−j corresponds to 0.0 . . . |{z}
2 ...,
i=j

otherwise. We used counting in the proof of the previous result to represent the multiplicities of a multiset. Next, multiplication
between decimal numbers implements shifting in m-ary notation, i.e.,

m−i · m−j corresponds to 0.0 . . . |{z}

1 ....
i+j

Shifting further applies to general decimal numbers in (0, 1). Let x ∈ (0, 1) correspond to an m-ary number with l digits,

0.a1 . . . al .

Then,
m−i · x corresponds to 0.0 . . . 0 a1 . . . al .
| {z }
i+1,...,i+l

We continue by deriving a neural architecture simulating the k-WL.

Proposition 30. Let G = (V (G), E(G), ℓ) be a n-order (node-)labeled graph. Then for all t ≥ 1, there exists a function
g (t) and scalars β1 , . . . , βk with
X X
g (t) (u) := FFN g (t−1) (u) + βj · g (t−1) (ϕj (u, w)) ,
j∈[k] w∈V (G)

where g (0) (u) is initialized consistent ℓ consistent with the atomic type, such that for all u, v ∈ V (G)k

Ctk (u) = Ctk (v) ⇐⇒ g (t) (u) = g (t) (v). (22)

Proof. Before we start, let us recall the relabeling computed by the k-WL for a k-tuple u as

(t) (t)
RELABEL Ctk (u), M1 (u), . . . , Mk (u) ,

with
(t) k
Mj (u) := {{Ct−1 (ϕj (u, w)) | w ∈ V (G)}}.

41
Aligning Transformers with Weisfeiler–Leman

To show our result, we show that there exist scalars β1 , . . . , βk such that the m-ary representations computed for Ctk (u) and
(t) (t)
M1 , . . . , Mk are pairwise unique. To this end, we show that a weighted sum can represent multiset counting in different
exponent ranges of m-ary numbers in (0, 1). We then simply invoke Lemma 29 to show that we can map each unique tuple
(t) (t)
(Ctk (u), M1 (u), . . . , Mk (u)) to a unique number in (0, 1). Finally, the FFN will be responsible for the relabeling.
Each possible color in Ctk is a unique number in [nk ] as the maximum possible number of unique colors in Ctk is nk . We then
fix an arbitrary ordering over the [nk ].
(0)
We show the statement via induction over t. Let m > 0 such that m − 1 is the order of the multiset Mj (u). Note that this
is the same m for each u. For a k-tuple u with initial color C0k (u) at position i, we choose g (0) such as to approximate m−i
arbitarily close, i.e.,
g (0) (u) − m−i F < ϵ,
for an arbitrarily small ϵ > 0. By choosing ϵ small enough, we have that g (0) (u) is unique for every unique position i and
hence Equation (22) holds for all pairs of k-tuples for t = 0. Note that by construction, i ≤ nk and hence, for tuple u at
position i, the m-ary number corresponding to m−i is non-zero in at most the first nk digits.
For the inductive case, we assume that
k k
Ct−1 (u) = Ct−1 (v) ⇐⇒ g (t−1) (u) = g (t−1) (v).

Further, we assume that for tuple u at position i,

g (t−1) (u) − m−i F

< ϵ,

for an arbitrarily small ϵ > 0. Let now u ∈ V (G)k . We say that the j-neighbors of u w.r.t. w have indices l(1) (w), . . . , l(k) (w)
(j)
in our ordering. Then, w∈V (G) m−l (w) is unique for each unique
P

(t)
Mj (u) = {{Ctk (ϕj (u, w)) | w ∈ V (G)}}, (23)

according to Lemma 29.

We now obtain a unique value for the tuple

(t) (t)
M1 (u), . . . , Mk (u) ,

by setting
k
βj := m−(n )·j
,
(t)
for j ∈ [k]. Specifically, let a1 , . . . , ank denote the multiplicities of multiset Mj (u) and let

0.a1 . . . ank ,

denote the m-ary number that corresponds to the sum

X (j)
m−l (w)
.
w∈V (G)

(j)
Note that since each m−l (w) corresponds to a color under k-WL at iteration t − 1, all digits after the nk -th digit are
zero. Then, multiplying the above sum with βj results in a shift in m-ary notation and, hence, for the m-ary number that
corresponds to the term X (j)
βj · m−l (w) , (24)
w∈V (G)

we can write
0.0 . . . |{z}
0 a1 . . . ank 0
|{z} . . . 0.
| {z }
nk ·j nk ·j+1,...,nk ·j+nk nk ·(j+1)+1

42
Aligning Transformers with Weisfeiler–Leman

As a result, for all l ̸= j, the non-zero digits of the m-ary number that corresponds to
X (j)
βl · m−l (w)

w∈V (G)

do not collide with the non-zero digits of the output of Equation (24) and hence, the sum
X X (j)
βj · m−l (w)
, (25)
j∈[k] w∈V (G)

is unique for each unique tuple

(t) (t)
M1 (u), . . . , Mk (u) .

Let ui be the i-th tuple in our fixed but arbitrary ordering. Then, g (t−1) (u) approximates the number m−i arbitrarily close.
Note that by the induction hypothesis, the m-ary number that corresponds to m−i is non-zero in at most the first nk digits, as
nk is the maximum possible number of colors attainable under k-WL. Since the smallest shift in Equation (25) is by nk (for
j = 1) and since the m-ary number that corresponds to m−i is non-zero in at most the first nk digits, the sum of m−i and
Equation (25) have no intersecting non-zero digits. As a result,
X X (j)
m−i + βj · m−l (w)
(26)
j∈[k] w∈V (G)

is unique for each tuple

(t) (t)
Ctk (u), M1 (u), . . . , Mk (u) .

Now, by the induction hypothesis, we can approximate each m−i with g (t−1) (ui ) arbitrarily close. Further, we can
(j)
approximate each m−l (w) with g (t−1) (ϕj (u, w)) arbitrarily close. As a result, we can approximate Equation (26) with
X X
g (t−1) (u) + βj · g (t−1) (ϕj (u, w))
j∈[k] w∈V (G)

arbitrarily close, i.e.,

X X X X (j)

g (t−1) (u) + βj · g (t−1) (ϕj (u, w)) − m−i + βj · m−l (w)
F
< ϵ,
j∈[k] w∈V (G) j∈[k] w∈V (G)

for an arbitrarily small ϵ. Hence, we by choosing ϵ small enough,

X X
g (t−1) (u) + βj · g (t−1) (ϕj (u, w))
j∈[k] w∈V (G)

is also unique for each unique tuple

(t) (t)
Ctk (u), M1 (u), . . . , Mk (u) .

Finally, since for finite graphs of order n there exists only a finite number of such tuples, there exists a continuous function
mapping the output of Equation (26) to m−i where i is the position of the color Ctk (u) in [nk ], for all u ∈ V (G)k . We
approximate this function with FFN arbitrarily close and obtain

||g (t) (u) − m−i || < ϵ,

for an arbitrarily small ϵ > 0. This concludes the proof.

We finish with a few Corollaries regarding variants of the k-WL.

43
Aligning Transformers with Weisfeiler–Leman

Corollary 31. Let G = (V (G), E(G), ℓ) be a n-order (node-)labeled graph. Then for each k-tuple u := (u1 , . . . uk ) and
each t > 0, we can equivalently express the coloring of the δ-k-WL as Ctδ,k (u) :=

δ,k δ,k δ,k
RELABEL Ct−1 (u),{{Ct−1 (ϕ1 (u, w)) | w ∈ N (u1 )}}, {{Ct−1 (ϕ1 (u, w)) | w ∈ V (G) \ N (u1 )}}, . . . ,

δ,k δ,k
{{Ct−1 (ϕk (u, w)) | w ∈ N (uk )}}, {{Ct−1 (ϕk (u, w)) | w ∈ V (G) \ N (uk )}} .

With the above statement, we can directly derive a δ-variant of Proposition 30. To this end, let ∆j (u) denote the set of
vertices adjacent to the j-th node in u.
Corollary 32. Let G = (V (G), E(G), ℓ) be a n-order (vertex-)labeled graph and assume a vertex feature matrix F ∈ Rn×d
that is consistent with ℓ. Then for all t ≥ 1, there exists a function g (t) and scalars β1 , . . . , β2k with
X X X
g (t) (u) := FFN g (t−1) (u) + αj · g (t−1) (ϕj (u, w)) + βj · g (t−1) (ϕj (u, w) ,
j∈[k] w∈∆j (u) w∈V (G)\∆j (u)

such that for all u, v ∈ V (G)k

Ctk,M (u) = Ctk,M (v) ⇐⇒ g (t) (u) = g (t) (v). (27)

The proof is the same as for Proposition 30. Further, we can also recover the δ-k-LWL variant of Proposition 30 by setting
βj = 0. Lastly, we again recover Proposition 30 by setting αj = βj .

I.2. Higher-order pure transformers

Here, we use the results from the previous section to show the expressivity results for Theorem 4, Theorem 5 and Theorem 6
in the main paper. In fact, we prove a slightly stronger result than Theorem 4 from which then also Theorem 5 and Theorem 6
will follow directly.
Theorem 33 (Generalization of Theorem 4 in main paper). Let G = (V (G), E(G), ℓ) be a labeled graph with n nodes and
k ≥ 2 and F ∈ Rn×d be a node feature matrix consistent with ℓ. Let Ctk denote the coloring function of the k-WL, δ-k-WL,
or δ-k-LWL at iteration t. Then for all iterations t ≥ 0, there exists a parametrization of the k-GT such that

Ctk (v) = Ctk (w) ⇐⇒ X(t,k) (v) = X(t,k) (w)

for all k-tuples v and w ∈ V (G)k . If Ctk is the coloring function of the k-WL, it suffices that the structural embeddings of
k-GT are sufficiently node-identifying. Otherwise, we require the structural embeddings of k-GT to be both sufficiently node
and adjacency-identifying.

Proof. We begin by recalling Equation (4) as

k k
X(0,k) (v) := F(vi ) i=1 WF + P(vi ) i=1 WP + atp(v),

where P are structural embeddings and atp is a learnable embedding of the atomic type. We further know from Lemma 26
that we can write

X(0,k) (v) = F′ (v1 ) . . . F′ (vk ) 0 deg′ (v1 ) . . . deg′ (vk ) P′ (v1 ) . . . P′ (vk ) atp′ (v) ,

(28)

where it holds for every v, w ∈ V (G),

F(v) = F(w) ⇐⇒ F′ (v) = F′ (w)
and
deg(v) = deg(w) ⇐⇒ deg′ (v) = deg′ (w)
and
P(v) = P(w) ⇐⇒ P′ (v) = P′ (w)
where P′ (v) is sufficiently node and adjacency-identifying and for every v, w ∈ V (G)k

atp(v) = atp(w) ⇐⇒ atp′ (v) = atp′ (w).

44
Aligning Transformers with Weisfeiler–Leman
2
Further, F′ (v) ∈ Rp , 0 ∈ Rk ·p , deg′ : N → Rr , atp′ : N → Ro and P′ ∈ Rs , for some choice of p, r, s, o where
d = k 2 · p + k · p + k · r + k · s + o, as specified in Lemma 26. We will use this parameterization for X(0,k) throughout the
rest of the proof.
We prove our statement by induction over iteration t. For the base case, notice that the initial color of a tuple v depends
on the atomic type and the node labeling. In Equation (28), we encode the atomic type with atp′ (v) and the node labels by
concatenating the features F′ (v1 ), . . . , F′ (vk ) of the k nodes v1 , . . . , vk in v. The concatenation of both node labels and
atomic type is clearly injective, and so

C0k (v) = C0k (w) ⇐⇒ X(0,k) (v) = X(0,k) (w)

for any two v, w ∈ V (G)k .

Before continuing with the induction, we define some additional notation. Throughout the induction, we will denote the
k
color representation of a tuple v at iteration t as F(t) (v). Further, we initially define F(0) (v) = F′ (vi ) i=1 , deg′ (v) =

′ k k
deg (vi ) i=1 and P′ (v) = P′ (vi ) i=1 . Hence, we can rewrite

X(0,k) (v) = F(0) (v) 0 deg′ (v) P′ (v) atp′ (v) .

For the induction step, we show

Ctk (v) = Ctk (w) ⇐⇒ X(t,k) (v) = X(t,k) (w), (29)
that is we compute the k-WL-, δ-k-LWL, or δ-k-WL-equivalent aggregation. Clearly, if Equation (29) holds for all t, then
the proof statement follows. Thereto, we show that the standard transformer updates the tuple representation of tuple
ui = (u1 , . . . , uk ), following Corollary 32, as
h i
X(t,k) (ui ) = FFN F(t−1) (ui ) + H(X(t−1,k) (ui )) 0 deg′ (ui ) P′ (ui ) atp′ (ui ) , (30)

where
X X X
H(X(t−1,k) (ui )) := αj · F(t−1) (ϕj (ui , w)) + βj · F(t−1) (ϕj (ui , w)) ,
j∈[k] w∈∆j (ui ) w∈V (G)\∆j (ui )

for α1 , . . . , αk , β1 , . . . , βk ∈ R, for j ∈ [k] that we pick such that H(X(t−1,k) (ui )) is equivalent to the neural architecture
in Corollary 32. Note that we obtain the k-WL for αj = βj for all j ∈ [k]. Then,
X X
H(X(t−1,k) (ui )) = βj · F(t−1) (ϕj (ui , w)) .
j∈[k] w∈V (G)

Further, we obtain δ-k-LWL for βj = 0 for all j ∈ [k]. Then,

X X
H(X(t−1,k) (ui )) = αj · F(t−1) (ϕj (ui , w)) .
j∈[k] w∈∆j (ui )

Hence, it remains to show that the standard transformer layer can update tuple representations according to Equation (30). To
this end, we will require 2k transformer heads h11 , . . . h1k , h21 , . . . h2k in each layer. Specifically, in the first k heads, we want
to compute X
h1j (X(t−1,k) (ui )) = αj · F(t−1) (ϕj (ui , w)). (31)
w∈∆j (ui )

In the remaining k heads, we want to compute

X
h2j (X(t−1,k) (ui )) = βj · F(t−1) (ϕj (ui , w)). (32)
w∈V (G)\∆j (ui )

In both of the above cases, for a head hγj , γ ∈ {−1, 1} denotes the type of head, and j ∈ [k] denotes the j-neighbors the head
aggregates over. For the head dimension, we set dv = d.

45
Aligning Transformers with Weisfeiler–Leman

For each j, recall the definition of the standard transformer head at tuple ui at position i as
h i T
hγj (X(t−1,k) (ui )) = softmax Z(t−1,γ)
i1 . . . Z
(t−1,γ)
ink
Xt−1 (u1 ) . . . Xt−1 (unk ) WV ,

where
(t−1,γ) 1
Zil := √ (X(t−1,k) (ui )WQ,γ )(X(t−1,k) (ul )WK,γ )T ∈ R (33)
dk
is the unnormalized attention score between tuples ui and ul , with

X(t−1,k) (ui ) = F(t−1) (ui ) 0 deg′ (ui ) P′ (ui ) atp′ (ui )

for all i ∈ [nk ] and

WFQ
 
 WQ 
 Z 
Q,γ Q 
W =  WD


 Q,γ 
WP 
Q
WA
WFK
 
 WZK 
K,γ K 
 
W = WK,γ
D 
W 
P
K
WA
 V
WF
W V 
 Z
WV =  V

WD ,
WPV 
V
WA

where

F(t−1) (ui ) is projected by WFQ , WFK , WFV ,

Q K V
0 is projected by WZ , WZ , WZ ,
Q
deg′ (ui ) is projected by WD K
, WD V
, WD ,
P′ (ui ) is projected by WPQ,γ , WPK,γ , WPV ,
Q
atp′ (ui ) is projected by WA K
, WA V
, WA .

Note that only sub-matrices WPQ,γ and WPK,γ are different for different γ. We now specify projection matrices WQ,γ ,
WK,γ and WV in a way that allows the attention head hγj to dynamically recover the j-neighborhood adjacency as well as
the adjacency between j-neighbors in the attention matrix. To this end, in heads h1j and h2j we set

WFQ = WFK = 0
Q K V
WZ = WZ = WZ =0
WPV = 0
Q K V
WD = WD = WD =0
Q K V
WA = WA = WA = 0.

The remaining non-zero sub-matrices are WFV , WPQ,γ , WPK,γ , which we will define next.

46
Aligning Transformers with Weisfeiler–Leman

We begin by defining WPQ,γ and WPK,γ . Specifically, we want to choose WPQ,γ and WPK,γ such that the attention matrix
in head hγj can approximate the generalized adjacency matrix A(k,j,γ) in Equation (11). To this end, we simply invoke
Lemma 13, guaranteeing that there exists WPQ,γ and WPK,γ such that that for each ϵ > 0 and each γ ∈ {−1, 1},
h i
(t−1,γ) (t−1,γ) (k,j,γ)
softmax Zi1 ... Zink − Ãi < ε,
F

i.e., we can approximate the generalized adjacency matrix arbitrarily close for each k > 1, j ∈ [k] and γ ∈ {−1, 1}. In the
following, for clarity of presentation, we assume
h i
(t−1,γ) (t−1,γ) (k,j,γ)
softmax Zi1 ... Zink = Ãi

although we only approximate it arbitrarily close. However, by choosing ε small enough, we can still approximate the matrix
X(t,k) , see below, arbitrarily close. We now set

WV = In×kp

0 ,

where In×kp denotes the first n rows of the kp-by-kp identity matrix if kp > n or else the first kp columns of the n-by-n
identity matrix and 0 ∈ Rn×d−kp is an all-zero matrix. The above yields

F(t−1) (u1 )
 
1 h (k,j,γ) i
..
hγj (X(t−1,k) (ui )) = γ Ãi1 ...
(k,j,γ)
Ãink ·
 
dij . 
F(t−1) (unk )

P
(t−1)
1  w∈∆j (ui ) F
 (ϕj (ui , w)) γ=1
= γ · ,
dij P (t−1)
w∈V (G)\∆j (ui ) F (ϕj (ui , w)) γ = −1

where (
dij γ=1
dγij =
n − dij γ = −1

with dij the degree of node uj in k-tuple ui = (u1 , . . . , uk ).

We now conclude our proof as follows. Recall that the standard transformer layer computes the final representation X(t,k) (ui )
as
 1 (t−1,k) T
h1 (X )(ui )
 .. 

 1 (t−1,k). 
 !
h (X )(ui )
X(t,k) (ui ) = FFN X(t−1,k) (ui ) +  O
 
k
h2 (X(t−1,k) )(ui ) W .

 1 
 .. 
 . 
h2k (X(t−1,k) )(ui )
We can now satisfy the requirements in Equation (31) and Equation (32) as follows.
Recall that for the first k heads we set γ = 1, in the remaining k heads we set γ = −1. Further, since we have 2k heads,
WO ∈ R2k·dv ×d . We set 
αj kp < i = j ≤ k + kp

O
Wij = βj k + kp < i = j ≤ 2k + kp

0 else,


i.e., we leave the first kp diagonal elements zero and then fill the next 2k diagonal elements of WO with the αj and βj ,

47
Aligning Transformers with Weisfeiler–Leman

respectively. All other elements are 0. We now obtain

T  α1 P (t−1)
T
h11 (X(t−1,k) )(ui ) w∈∆1 (ui ) F (ϕ1 (ui , w))

di1
..  .. 
.
 
 .  



 1 (t−1,k)  α P (t−1)
dikP w∈∆k (ui ) F (ϕk (ui , w))
h (X k
)(ui )
 
 k  WO =  β  WO

h2 (X(t−1,k) )(ui )  1
F (t−1)
(ϕ 1 (ui , w))
 1   n−d i1 w∈V (G)\∆ 1 (u i ) 
 .
..
 
.

  
 .. 

h2k (X(t−1,k) )(ui ) βk P
w∈V (G)\∆ (u ) F
(t−1)
(ϕk (ui , w))
n−dik k i

 T
0
α1 P (t−1)
w∈∆1 (ui ) F (ϕ1 (ui , w))
 
 di1 

 .
..


 
αk P (t−1)
dikP w∈∆k (ui ) F (ϕk (ui , w))
 
=  β1  ,
 
(t−1)
 n−di1 w∈V (G)\∆1 (ui ) F (ϕ1 (ui , w)) 
..
 
 

 β P . 

 k
n−dik F(t−1) (ϕk (ui , w))
w∈V (G)\∆k (ui )
0

where the first zero vector is in Rkp and the second zero vector is in Rd−k(r+s)+o . We now define the vector
 T
0
α1 P (t−1)
w∈∆1 (ui ) F (ϕ1 (ui , w))
 
di1
 (t−1) T  .

F (ui ) ..
 
 
0
 
  α P (t−1)
dikP w∈∆k (ui ) F (ϕk (ui , w))
 k 
′
 
X̃(ui ) =  deg (ui )  +  β1
   
 n−di1 w∈V (G)\∆1 (ui ) F(t−1) (ϕ1 (ui , w)) 

 P′ (ui ) 
..
 
atp′ (ui )
 

 β P . 

(t−1)
w∈V (G)\∆k (ui ) F (ϕk (ui , w))
 k
n−dik
0
T
F(t−1) (ui )

α1 P (t−1)
w∈∆1 (ui ) F (ϕ1 (ui , w))
 
 di1 
 .. 

 . 

αk P (t−1)
 F (ϕ k (u i , w)) 
 β1 dikP w∈∆k (ui )
 
(t−1)
 n−di1 w∈V (G)\∆1 (ui ) F (ϕ1 (ui , w)) 

=  ∈ Rd .
 .. 

 β P . 

(t−1)
 k
 n−dik w∈V (G)\∆k (ui ) F (ϕ k (ui , w)) 

′

 deg (ui ) 

′
 P (u ) i

atp′ (ui )

To simplify terms, we additionally define

 α1 P T
di1 F(t−1) (ϕ1 (ui , w))
w∈∆1 (ui )
F̃α (ui ) := 
 .. 
. 
αk P (t−1)
dik w∈∆k (ui ) F (ϕ k (ui , w))

48
Aligning Transformers with Weisfeiler–Leman

and
 T
β1
F(t−1) (ϕ1 (ui , w))
P
n−di1 w∈V (G)\∆1 (ui )
F̃β (ui ) := 
 .. 
.

 
βk
F(t−1) (ϕk (ui , w))
P
n−dik w∈V (G)\∆k (ui )

and consequently, we can write

 (t−1) T
F (ui )
 F̃α (ui ) 
 
 F̃β (ui ) 
X̃(ui ) = 

′

 deg (ui ) 

 P′ (ui ) 
atp′ (ui )

We additionally define
deg′ (uj ) = dij n − dij ∈ R2 ,

where dij is again the degree of node uj in k-tuple ui = (u1 , . . . , uk ). Note that X̃(ui ) represents all the information we
feed into the final FFN. Specifically, we obtain the updated representation of tuple ui as

X(t,k) (ui ) = FFN X̃(ui ) .

We now show that there exists an FFN such that

X(t,k) = FFN X̃(ui )
= F(t) (ui ) 0 deg′ (ui ) P′ (ui ) atp′ (ui ) ,

which then implies the proof statement. It is worth to pause at this point and remind ourselves what each element in X̃(ui )
represents. To this end, we use PyTorch-like array slicing, i.e., for a vector x, x[a : b] corresponds to the sub-vector
of length b − a for b > a containing the (a + 1)-st to b-th element of x. E.g., for a vector x = (x1 , x2 , x3 , x4 , x5 )T ,
x[1 : 4] = (x2 , x3 , x4 )T . Now, we interpret X̃(ui ) by its sub-vectors. Concretely,

1. X̃(ui )[0 : kp] ∈ Rkp corresponds to the previous color representation F(t−1) (ui ) of tuple ui .
2. F̃α (ui )[(j − 1)kp : jkp] ∈ R
kp
corresponds to the degree-normalized sum over adjacent j-neighbors
αj P (t−1)
dj w∈∆j (ui ) F (ϕ j (ui , w)).

3. F̃β (ui )[(j − 1)kp : jkp] ∈ Rkp corresponds to the degree-normalized sum over non-adjacent j-neighbors
βj P (t−1)
n−dj w∈∆j (ui ) F (ϕj (ui , w)).

4. deg′ (ui )[2(j − 1)] = dij , where dij is the degree of node uj in the k-tuple ui = (u1 , . . . , uk ).
5. deg′ (ui )[2(j − 1) + 1] = n − dij , where dij is the degree of node uj in the k-tuple ui = (u1 , . . . , uk ).

To this end, we show that there exists a sequence of functions fFFN ◦ fadd ◦ fdeg such that

X(t,k) = fFFN ◦ fadd ◦ fdeg X̃(ui )
= F(t) (ui ) 0 deg′ (ui ) P′ (ui ) atp′ (ui ) ,

where the functions are applied independently to each row.

Since our domain is compact, there exist choices of fFFN , fadd , fdeg such that fFFN ◦ fadd ◦ fdeg : Rd → Rd , in which case
fFFN ◦ fadd ◦ fdeg is continuous. As a result, FFN can approximate fFFN ◦ fadd ◦ fdeg arbitrarily close.

49
Aligning Transformers with Weisfeiler–Leman

Concretely, we define
T
F(t−1) (ui )

di1 · dαi11 w∈∆1 (ui ) F(t−1) (ϕ1 (ui , w))
P
 
 
 .. 

 . 

αk P (t−1)

 dik · dik w∈∆k (ui ) F (ϕk (ui , w)) 

β1 P
 (n − di1 ) · n−di1 w∈V (G)\∆1 (ui ) F(t−1) (ϕ1 (ui , w)) 
 
fdeg X̃(ui ) =   ,
 .. 
 . 
(n − dik ) · βk P
 
(t−1)
 n−dik w∈V (G)\∆k (ui ) F (ϕk (ui , w))


 deg′ (ui ) 

′
 P (ui ) 
′
atp (ui )

where for each j ∈ [k], fdeg multiplies

1. F̃α (ui )[(j − 1)kp : jkp] with deg′ (ui )[2(j − 1)].
2. F̃β (ui )[(j − 1)kp : jkp] with deg′ (ui )[2(j − 1) + 1].

We then define
T
F(t−1) (ϕ1 (ui , w))
P
α1 ·

w∈∆1 (ui )
Fα (ui ) := 
 .. 
. 
(t−1)
P
αk · w∈∆k (ui ) F (ϕk (ui , w))
and
T
F(t−1) (ϕ1 (ui , w))
P
β1 ·

w∈V (G)\∆1 (ui )
Fβ (ui ) := 
 .. 
. 
F(t−1) (ϕk (ui , w))
P
βk · w∈V (G)\∆k (ui )

and consequently, we can write

 (t−1) T
F (ui )
 Fα (ui ) 
 Fβ (ui ) 

fdeg X̃(ui ) = 
 ′
 .
 deg (ui ) 

 P′ (ui ) 
atp′ (ui )

Next, we define
 (t−1) T
(ui ) + j∈[k] (Fα (ui )[(j − 1)kp : jkp] + Fβ (ui )[(j − 1)kp : jkp])
P
F

 0 

′
fadd fdeg X̃(ui ) = 
 deg (ui )  ,

 P′ (ui ) 
atp′ (ui )

where fadd sums up F(t−1) (ui ) with Fα (ui )[(j − 1)kp : jkp] and Fβ (ui )[(j − 1)kp : jkp] for each j. Afterwards, fadd sets
2
fdeg X̃(ui ) [kp : 2k 2 p + kp] = 0 ∈ R2k p .

Now, finally, note from the definition of Fα (ui ) and Fβ (ui ) that
X
(Fα (ui )[(j − 1)kp : jkp] + Fβ (ui )[(j − 1)kp : jkp]) = H(X(t−1,k) (ui )),
j∈[k]

50
Aligning Transformers with Weisfeiler–Leman

where we recall that we defined

1 X X X
H(X(t−1,k) (ui )) = · αj · F(t−1) (ϕj (ui , w)) + βj · F(t−1) (ϕj (ui , w)) .
n
j∈[k] w∈∆j (ui ) w∈V (G)\∆j (ui )

Hence, we have that

 (t−1) T
F (ui ) + H(X(t−1,k) (ui ))
 0 
′
 
fadd fdeg X̃(ui ) = 
 deg (ui )  .

 P′ (ui ) 
atp′ (ui )
Finally, we define
 T
FFN F(t−1) (ui ) + H(X(t−1,k) (ui ))
 
 0 
fFFN fadd fdeg X̃(ui ) =  ′
 
deg (ui ) 
P′ (ui )
 
 
atp′ (ui )
= F(t) (v) 0 deg′ (v) P′ (v) ,

where FFN denotes the FFN in Equation (30), from which the last equality follows. Thank you for sticking around until the
end. This completes the proof.

The above proof shows that the k-GT can simulate the δ-k-WL. Since the δ-k-WL is strictly more expressive than the k-WL
(Morris et al., 2020), Theorem 5 follows directly from Theorem 4.
Further, the above proof shows that the k-GT can simulate the δ-k-LWL. If we simulate the δ-k-LWL with the k-GT while only
using token embeddings for tuples in V (G)ks , Theorem 6 directly follows.

c5 Apelete Sossou Jad Zakharia
No ratings yet
c5 Apelete Sossou Jad Zakharia
6 pages
Nddi2 LS - GS - 3ba23173
No ratings yet
Nddi2 LS - GS - 3ba23173
27 pages
Ye Et Al. 2024
No ratings yet
Ye Et Al. 2024
19 pages
6795 Beyond Weisfeiler Lehman
No ratings yet
6795 Beyond Weisfeiler Lehman
73 pages
Simple Future Tense: Presented by Henny Septia Utami, M.PD
100% (2)
Simple Future Tense: Presented by Henny Septia Utami, M.PD
10 pages
1.1 Introduction To Service Management
No ratings yet
1.1 Introduction To Service Management
22 pages
R E P GNN G B: Ethinking The Xpressive Ower of S Via Raph Iconnectivity
No ratings yet
R E P GNN G B: Ethinking The Xpressive Ower of S Via Raph Iconnectivity
60 pages
Towards Principled Graph Transformers: Luis Müller Daniel Kusuma
No ratings yet
Towards Principled Graph Transformers: Luis Müller Daniel Kusuma
26 pages
Evaluasi Penggunaan Oksigen Sebagai Penghasil Uap Terapi Nebulizer Pada Pasien Asma
No ratings yet
Evaluasi Penggunaan Oksigen Sebagai Penghasil Uap Terapi Nebulizer Pada Pasien Asma
7 pages
453 Rethinking The Expressive Powe
No ratings yet
453 Rethinking The Expressive Powe
59 pages
Preliminar Não Fabricar: Plan View From Above Showing Foundation Hole Drilling
No ratings yet
Preliminar Não Fabricar: Plan View From Above Showing Foundation Hole Drilling
1 page
Classroom and Lab Area - Job Roles Wise
No ratings yet
Classroom and Lab Area - Job Roles Wise
115 pages
693 Equivariant Subgraph Aggregati
No ratings yet
693 Equivariant Subgraph Aggregati
46 pages
(2024) Understanding Transformer Reasoning Capabilities Via Graph Algorithms (Sanford, Fatemi, Hall, Tsitsulin, Kazemi, Halcrow, Perozzi, Mirrokni)
No ratings yet
(2024) Understanding Transformer Reasoning Capabilities Via Graph Algorithms (Sanford, Fatemi, Hall, Tsitsulin, Kazemi, Halcrow, Perozzi, Mirrokni)
46 pages
Provably Powerful Graph Networks: Haggai Maron Heli Ben-Hamu Hadar Serviansky Yaron Lipman
No ratings yet
Provably Powerful Graph Networks: Haggai Maron Heli Ben-Hamu Hadar Serviansky Yaron Lipman
15 pages
2782 On The Generalization of
No ratings yet
2782 On The Generalization of
28 pages
Pure Transformers Are Powerful Graph Learners
No ratings yet
Pure Transformers Are Powerful Graph Learners
26 pages
Natural Language Is All A Graph Needs
No ratings yet
Natural Language Is All A Graph Needs
21 pages
LLM As GNN Graph Vocabul
No ratings yet
LLM As GNN Graph Vocabul
25 pages
Graph Conv
No ratings yet
Graph Conv
16 pages
Aph Language Models
No ratings yet
Aph Language Models
18 pages
Advancement in Graph Understandig
No ratings yet
Advancement in Graph Understandig
16 pages
Graven and Venkat
No ratings yet
Graven and Venkat
21 pages
An End-To-End Attention-Based Approach For Learning On Graphs
No ratings yet
An End-To-End Attention-Based Approach For Learning On Graphs
16 pages
4320 A New Perspective On How Graph Compressed
No ratings yet
4320 A New Perspective On How Graph Compressed
23 pages
On The Connection Between MPNN and Graph Transformer
No ratings yet
On The Connection Between MPNN and Graph Transformer
23 pages
23 - AAAI - Substructure Aware Graph Neural Networks
No ratings yet
23 - AAAI - Substructure Aware Graph Neural Networks
9 pages
AGraphis Worth K Words: Euclideanizing Graph Using Pure Transformer
No ratings yet
AGraphis Worth K Words: Euclideanizing Graph Using Pure Transformer
21 pages
GM 3500T OwnersManual
No ratings yet
GM 3500T OwnersManual
36 pages
Mapeh Blank Grading Sheet
No ratings yet
Mapeh Blank Grading Sheet
19 pages
2106 - Rethinking Graph Transformers With Spectral Attention
No ratings yet
2106 - Rethinking Graph Transformers With Spectral Attention
18 pages
2106 - GraphiT Encoding Graph Structure in Transformers
No ratings yet
2106 - GraphiT Encoding Graph Structure in Transformers
20 pages
A Survey of Graph Meets Large Language Model: Progress and Future Directions
No ratings yet
A Survey of Graph Meets Large Language Model: Progress and Future Directions
13 pages
AP I W T - L M: Rimer On The Nner Orkings of Ransformer Based Anguage Odels
No ratings yet
AP I W T - L M: Rimer On The Nner Orkings of Ransformer Based Anguage Odels
55 pages
Do Transformers Really Perform Bad For Graph Representation?
No ratings yet
Do Transformers Really Perform Bad For Graph Representation?
19 pages
Graphprompt: Unifying Pre-Training and Downstream Tasks For Graph Neural Networks
No ratings yet
Graphprompt: Unifying Pre-Training and Downstream Tasks For Graph Neural Networks
12 pages
Identity-Aware Graph Neural Networks: Jiaxuan You, Jonathan Gomes-Selman, Rex Ying, Jure Leskovec
No ratings yet
Identity-Aware Graph Neural Networks: Jiaxuan You, Jonathan Gomes-Selman, Rex Ying, Jure Leskovec
10 pages
Automated Unsupervised Graph Representation Learning
No ratings yet
Automated Unsupervised Graph Representation Learning
14 pages
29112-Article Text-33166-1-2-20240324
No ratings yet
29112-Article Text-33166-1-2-20240324
9 pages
Ai Presentation
No ratings yet
Ai Presentation
71 pages
Standardization For Oil and Gas Sector: S.M. Bhatia Deputy Director General Bureau of Indian Standards
No ratings yet
Standardization For Oil and Gas Sector: S.M. Bhatia Deputy Director General Bureau of Indian Standards
41 pages
NeurIPS 2024 Understanding Transformer Reasoning Capabilities Via Graph Algorithms Paper Conference
No ratings yet
NeurIPS 2024 Understanding Transformer Reasoning Capabilities Via Graph Algorithms Paper Conference
51 pages
Short Tutorial On WL Test
No ratings yet
Short Tutorial On WL Test
5 pages
Imaging and Design For The Online Environment
No ratings yet
Imaging and Design For The Online Environment
30 pages
Human Values: DR - Sunil Ms Ob LPU
No ratings yet
Human Values: DR - Sunil Ms Ob LPU
11 pages
SCMA306 Course Outline - July 2017
No ratings yet
SCMA306 Course Outline - July 2017
9 pages
1 Xyz
No ratings yet
1 Xyz
21 pages
Motherboard Labeling Designed by Fujitsu
No ratings yet
Motherboard Labeling Designed by Fujitsu
3 pages
HBRI Brochure
0% (1)
HBRI Brochure
8 pages
The Role of Academic Libraries in The Digital Transformation of The Universities
No ratings yet
The Role of Academic Libraries in The Digital Transformation of The Universities
5 pages
How Powerful Are Graph Neural Networks
No ratings yet
How Powerful Are Graph Neural Networks
17 pages
A Generalization of Transformer Networks To Graphs
No ratings yet
A Generalization of Transformer Networks To Graphs
8 pages
Graphormer 2021 neurIPS
No ratings yet
Graphormer 2021 neurIPS
12 pages
WWW23-Tutorial-V6 Self-Supervised Learning and Pre-Training On Graphs
No ratings yet
WWW23-Tutorial-V6 Self-Supervised Learning and Pre-Training On Graphs
107 pages
(SDM2022) Neural Graph Matching For Pre-Training Graph Neural Networks
No ratings yet
(SDM2022) Neural Graph Matching For Pre-Training Graph Neural Networks
9 pages
Jemarah Rabina
No ratings yet
Jemarah Rabina
4 pages
Graph Neural Networks Methods Applications and Opp
No ratings yet
Graph Neural Networks Methods Applications and Opp
35 pages
A Survey of Graph Meets Large Language Model - Progress and Future Directions
No ratings yet
A Survey of Graph Meets Large Language Model - Progress and Future Directions
13 pages
The Recycling Folded Cascode A General Enhancement of The Folded Cascode Amplifier
No ratings yet
The Recycling Folded Cascode A General Enhancement of The Folded Cascode Amplifier
8 pages
Integrating Graph Contextualized Knowledge Into Pre-Trained Language Models
No ratings yet
Integrating Graph Contextualized Knowledge Into Pre-Trained Language Models
8 pages
Fast-and-Frugal Text-Graph Transformers Are Effective Link Predictors
No ratings yet
Fast-and-Frugal Text-Graph Transformers Are Effective Link Predictors
14 pages
A Data-Driven Online Prediction Model For Battery
No ratings yet
A Data-Driven Online Prediction Model For Battery
17 pages
18CSP83 - Project Phase 2 - Body
No ratings yet
18CSP83 - Project Phase 2 - Body
11 pages
Hehe
No ratings yet
Hehe
4 pages
A Comprehensive Survey of Graph Neural Networks For Knowledge Graphs
No ratings yet
A Comprehensive Survey of Graph Neural Networks For Knowledge Graphs
13 pages
Glam: Fine-Tuning Large Language Models For Domain Knowledge Graph Alignment Via Neighborhood Partitioning and Generative Subgraph Encoding
No ratings yet
Glam: Fine-Tuning Large Language Models For Domain Knowledge Graph Alignment Via Neighborhood Partitioning and Generative Subgraph Encoding
8 pages
GMPT Cikm2021 Final
No ratings yet
GMPT Cikm2021 Final
10 pages
150# Cs Ball Valve Datasheet: General
100% (1)
150# Cs Ball Valve Datasheet: General
3 pages
1201 Graph Transformer
No ratings yet
1201 Graph Transformer
14 pages
Improving Global Awareness of Linkset Predictions Using Cross-Attentive Modulation Tokens
No ratings yet
Improving Global Awareness of Linkset Predictions Using Cross-Attentive Modulation Tokens
17 pages
Intership
No ratings yet
Intership
40 pages
A Survey of Graph Meets Large Language Model: Progress and Future Directions
No ratings yet
A Survey of Graph Meets Large Language Model: Progress and Future Directions
13 pages
Large Language Models For Graph Learning: Yujuan Ding Wenqi Fan
No ratings yet
Large Language Models For Graph Learning: Yujuan Ding Wenqi Fan
4 pages
DLP Cot2
No ratings yet
DLP Cot2
3 pages
Eurovent - New Energy Classes - 2016 PDF
No ratings yet
Eurovent - New Energy Classes - 2016 PDF
3 pages
Graph GPT
No ratings yet
Graph GPT
10 pages
Lesson Plan in Science 6
100% (1)
Lesson Plan in Science 6
6 pages
Partial Molar Heat Content and Chemical Potential, Significance and Factors Affecting, Gibb's-Duhem Equation
No ratings yet
Partial Molar Heat Content and Chemical Potential, Significance and Factors Affecting, Gibb's-Duhem Equation
11 pages
Gnns
No ratings yet
Gnns
75 pages
Graph Transformer Networks: Corresponding Author
No ratings yet
Graph Transformer Networks: Corresponding Author
11 pages
Modeep
No ratings yet
Modeep
13 pages
Graphprompt: Unifying Pre-Training and Downstream Tasks For Graph Neural Networks
No ratings yet
Graphprompt: Unifying Pre-Training and Downstream Tasks For Graph Neural Networks
12 pages
2024 - Introduction To Graph Neural Networks A Starting
No ratings yet
2024 - Introduction To Graph Neural Networks A Starting
49 pages
26 Seed Processing
No ratings yet
26 Seed Processing
24 pages
Yang 20 A
No ratings yet
Yang 20 A
16 pages
Overview of The Transformer-Based Models For NLP Tasks
No ratings yet
Overview of The Transformer-Based Models For NLP Tasks
5 pages
Content Augmented Graph Neural Networks
No ratings yet
Content Augmented Graph Neural Networks
15 pages
A Survey of Graph Neural Networks in Various Learning Paradigms Methods, Applications, and Challenges
No ratings yet
A Survey of Graph Neural Networks in Various Learning Paradigms Methods, Applications, and Challenges
70 pages
System-On-Chip Design Book 2019 200dpi Aw
No ratings yet
System-On-Chip Design Book 2019 200dpi Aw
334 pages
Markov Random Field: Exploring the Power of Markov Random Fields in Computer Vision
From Everand
Markov Random Field: Exploring the Power of Markov Random Fields in Computer Vision
Fouad Sabry
No ratings yet

Aligning Transformers With Weisfeiler-Leman: K K K K

Uploaded by

Aligning Transformers With Weisfeiler-Leman: K K K K

Uploaded by

Aligning Transformers with Weisfeiler–Leman

Luis Müller 1 Christopher Morris 1

Approximate more functions

1-GT (2, 1)-GT (3, 1)-GT 3-GT k-GT

Related work Many graph learning architectures with

where WQ , WK ∈ Rd×d . Then P is adjacency-identifying

F(t) := FFN F(t−1) + 2A(G)F(t−1) ,

Transformer Provable expressivity Runtime complexity # Heads Pure transformer

applied row-wise. For example, ρ can be implemented by 5.1. Pre-training

BBBP BACE C LIN T OX T OX 21 T OX C AST

Impact statement bound on the number of variables for graph identifications.

Tönshoff, J., Ritzert, M., Rosenbluth, E., and Grohe, M.

Wang, Y. and Zhang, M. Towards better evaluation of GNN

A. Additional implementation details

A.1. Positional encodings based on Laplacian eigenmaps

A.2. Implementation of LPE and SPE

Q := Vdiag(ϕ1 (λ))VT . . . Vdiag(ϕn (λ))VT ∈ Rn×n×l

and use the same ρ as for the case m = n.

A.3. Atomic types from edge embeddings

Table 6: Hyper-parameters for (2, 1)-GT pre-training on PCQM4M V 2.

B. Additional experimental details

E.1. The k-dimensional Weisfeiler–Leman algorithm

E.2. The δ-k-dimensional Weisfeiler–Leman algorithm

E.3. The local δ-k-dimensional Weisfeiler–Leman algorithm

resulting in the coloring function Ctk,δ : V (G)k → N.

E.4. The local (k, s)-dimensional Weisfeiler–Leman algorithm

E.5. Comparing k-WL variants

E.6. The Weisfeiler–Leman hierarchy and permutation-invariant function approximation

E.7. A generalized adjacency matrix

F.1. Sufficiently node and adjacency-identifying

Lemma 9. Let P, Q ∈ Rn×d be matrices such that

||P − Q||F < ϵ,

Now, for a scalar b > 0,   x

P̃ij = max P̃ik

if and only if Xij = 1. Then, for a scalar b > 0,

0.0 0.25 0.5 0.75 1

This concludes the proof.

for all ϵ > 0.

where we note that

This concludes the proof.

for any ε > 0.

for any ε > 0, with

∀o ̸= j : ui,o = ul,o ∧ Ailj = 1.

Z∗il = max Z∗ik

if and only if Ak,j,γ

This completes the proof.

F.2. adjacency-identifying via graph Laplacian

if and only if A(G)ij = 1. This shows the statement.

and then to invoke Lemma 14. We set

where I is the identity matrix. Then,

and then to invoke Lemma 14. We set

where I is the identity matrix and 0 is the all-zero matrix. Then,

This shows the statement.

Proof. We have that

Proof. We have that

where d = d′ +d′′ . If one of Q1 , Q2 is (sufficiently) node-identifying, then P is (sufficiently) node-identifying. If one of Q1 , Q2

In both cases, we have that

We will now prove a slightly more general statement than Theorem 7.

P(v) = FFN(deg(v) + LPE(v)),

deg(v) + LPE(v) = deg(v) LPE(v) ∈ Rd

LPE(v) = node(v) adj(v) ∈ Rn×s ,

a parameterization of QN and QA such that QN is sufficiently node-identifying and QA is sufficiently adjacency-identifying.

We will now use FFN to approximate the following function f , defined as

P(v) = FFN(deg(v) + Q(v)),

deg(v) + Q(v) = deg(v) Q(v) ∈ Rd

for all ϵ1 > 0. Further, in matrix form this becomes

then, for the same matrices WQ and WK we know that

We now show Theorem 8.

P(v) = FFN(deg(v) + SPE(v)),

for all v ∈ V (G), where

Mi := Vdiag(h1 (λ))ViT . . . Vdiag(hn (λ))ViT

We then multiply fi element-wise to each row of Mi and obtain a matrix

Further, we divide each row of Mi element-wise by fi and obtain a matrix

and have that

(PQT )k,l = (VΣVT )k,l

where it holds for all v, w ∈ N and all i ∈ {1, . . . , k} that

Now, for a scalar b > 0, x

where, following Lemma 27,