Aligning Transformers With Weisfeiler-Leman: K K K K
Aligning Transformers With Weisfeiler-Leman: K K K K
Abstract Maron et al., 2019a; Morris et al., 2020; 2023; 2022) intro-
duced higher-order GNNs, aligned with the k-dimensional
Graph neural network architectures aligned with Weisfeiler–Leman (k-WL) hierarchy for graph isomorphism
the k-dimensional Weisfeiler–Leman (k-WL) hier- testing (Cai et al., 1992), resulting in more expressivity with
arXiv:2406.03148v1 [cs.LG] 5 Jun 2024
archy offer theoretically well-understood expres- an increase in the order k > 1. The k-WL hierarchy draws
sive power. However, these architectures often fail from a rich history in graph theory (Babai, 1979; Cai et al.,
to deliver state-of-the-art predictive performance 1992; Grohe, 2017; Weisfeiler & Leman, 1968), offering
on real-world graphs, limiting their practical util- a deep theoretical understanding of k-WL-aligned GNNs.
ity. While recent works aligning graph transformer In contrast, graph transformers (GTs) (Glickman & Yahav,
architectures with the k-WL hierarchy have shown 2023; He et al., 2023; Ma et al., 2023; Müller et al., 2024;
promising empirical results, employing transform- Rampášek et al., 2022; Ying et al., 2021) recently demon-
ers for higher orders of k remains challenging due strated state-of-the-art empirical performance. However, they
to a prohibitive runtime and memory complexity draw their expressive power mostly from positional or struc-
of self-attention as well as impractical architec- tural encodings (PEs in the following), making it challenging
tural assumptions, such as an infeasible number of to understand these models in terms of an expressivity hier-
attention heads. Here, we advance the alignment archy such as the k-WL. In addition, their empirical success
of transformers with the k-WL hierarchy, showing relies on modifying the attention mechanism (Glickman &
stronger expressivity results for each k, making Yahav, 2023; Ma et al., 2023; Ying et al., 2021) or using ad-
them more feasible in practice. In addition, we de- ditional message-passing GNNs (He et al., 2023; Rampášek
velop a theoretical framework that allows the study et al., 2022).
of established positional encodings such as Lapla-
cian PEs and SPE. We evaluate our transformers To still benefit from the theoretical power offered by the
on the large-scale PCQM4Mv2 dataset, showing k-WL, previous works (Kim et al., 2021; 2022) aligned trans-
competitive predictive performance with the state- formers with k-IGNs, showing that transformer layers can
of-the-art and demonstrating strong downstream approximate invariant linear layers (Maron et al., 2019b).
performance when fine-tuning them on small-scale Crucially, Kim et al. (2022) designed pure transformers,
molecular datasets. requiring no architectural modifications of the standard trans-
former and instead draw their expressive power from an
appropriate tokenization, i.e., the encoding of a graph as
a set of input tokens. Pure transformers provide several
1. Introduction benefits over graph transformers that use message-passing
Message-passage graph neural networks (GNNs) are the de- or modified attention, such as being directly applicable to
facto standard in graph learning (Gilmer et al., 2017; Kipf innovations for transformer architectures most notably Per-
& Welling, 2017; Scarselli et al., 2009; Xu et al., 2019a). formers (Choromanski et al., 2021) and Flash Attention (Dao
However, due to their purely local mode of aggregating infor- et al., 2022) to reduce the runtime or memory demands of
mation, they suffer from limited expressivity in distinguish- transformers, as well as the Perceiver (Jaegle et al., 2021) en-
ing non-isomorphic graphs in terms of the 1-dimensional abling multi-modal learning. Unfortunately, the framework
Weisfeiler–Leman algorithm (1-WL) (Morris et al., 2019; Xu of Kim et al. (2022) does not allow for a feasible transformer
et al., 2019a). Hence, recent works (Azizian & Lelarge, 2021; with provable expressivity strictly greater than 1-WL due to
an O(n6 ) runtime complexity and the requirement of 203
1
Department of Computer Science, RWTH Aachen University, attention heads resulting from their alignment with IGNs.
Germany. Correspondence to: Luis Müller <[email protected] This poses the question of whether a more feasible hierarchy
aachen.de>.
of pure transformers exists.
Proceedings of the 41 st International Conference on Machine
Learning, Vienna, Austria. PMLR 235, 2024. Copyright 2024 by
the author(s).
1
Aligning Transformers with Weisfeiler–Leman
⊑ ⊑ ⊑ ⊏ ⊏
⊐ ⊐ ⊐ ⊐ ⊐
1-WL (2, 1)-WL (3, 1)-WL 3-WL ··· k-WL
Figure 1: Overview of our theoretical results, aligning transformers with the established k-WL hierarchy. Forward arrows
point to more powerful algorithms or neural architectures. A ⊏ B (A ⊑ B, A ≡ B)—algorithm A is strictly more powerful
than (as least as powerful as, equally powerful as) B. The relations between the boxes in the lower row stem from Cai et al.
(1992) and Morris et al. (2022).
Present work Here, we offer theoretical improvements for while theoretically intriguing, higher-order GNNs often fail
pure transformers. Our guiding question is to deliver state-of-the-art performance on real-world prob-
lems (Azizian & Lelarge, 2021; Morris et al., 2020; 2022),
Can we design a hierarchy of increasingly making theoretically grounded architectures less relevant in
expressive pure transformers that are also practice.
feasible in practice?
Graph transformers with higher-order expressive power are
In this work, we will make significant progress to answer- Graphormer-GD (Zhang et al., 2023) as well as the higher-
ing this question in the affirmative. First, in Section 3, we order graph transformers in Kim et al. (2021) and Kim et al.
improve the expressivity of pure transformers for every or- (2022). However, Graphormer-GD is less expressive than
der k > 0, while keeping the same runtime complexity as the 3-WL (Zhang et al., 2023). Further, Kim et al. (2021) and
Kim et al. (2022) by aligning transformers closely with the Kim et al. (2022) align transformers with k-IGNs, and, thus,
Weisfeiler–Leman hierarchy. Secondly, we demonstrate the obtain the theoretical expressive power of the corresponding
benefits of this close alignment by showing that our results k-WL. However, they do not empirically evaluate their trans-
directly yield transformers aligned with the (k, s)-WL hier- formers for k > 2. For k = 2, Kim et al. (2022) propose
archy, recently proposed to increase the scalability of higher- TokenGT, a pure transformer with a n + m tokens per graph,
order variants of the Weisfeiler–Leman algorithm (Morris where n is the number of nodes and m is the number of edges.
et al., 2022). These transformers then form a hierarchy of This transformer has 2-WL expressivity, which, however, is
increasingly expressive pure transformers. We show in Sec- the same as 1-WL expressivity (Morris et al., 2023). From a
tion 4, that our transformers can be naturally implemented theoretical perspective, the transformers in Kim et al. (2022)
with established node-level PEs such as the Laplacian PEs still require impractical assumptions such as bell(2k) atten-
in Kreuzer et al. (2021). Lastly, we show in Section 5 that tion heads, where bell(2k) is the 2k-th bell number (Maron
our transformers are feasible in practice and have more ex- et al., 2019a),1 resulting in 203 attention heads for a trans-
pressivity than the 1-WL. In particular, we obtain very close former with a provable expressive power strictly stronger
to state-of-the-art results on the large-scale PCQM4M V 2 than the 1-WL. In addition, Kim et al. (2022) introduces
dataset (Hu et al., 2021) and further show that our transform- special encodings called node- and type identifiers that are
ers are highly competitive with GNNs when fine-tuned on theoretically necessary but, as argued in Appendix A.5 in
small-scale molecular benchmarks (Hu et al., 2020a). See (Kim et al., 2022), not ideal in practice. For an overview
Figure 1 for an overview of our theoretical results and Ta- of the Weisfeiler–Leman hierarchy in graph learning; see
ble 1 for a comparison of our pure transformers with the Morris et al. (2023).
transformers in Kim et al. (2021) and Kim et al. (2022).
2
Aligning Transformers with Weisfeiler–Leman
to each node. For convenience of notation, we always as- n token embeddings X(0,1) ∈ Rn×d as
sume an arbitrary but fixed ordering over the nodes such
that each node corresponds to a number in [n]. Further X(0,1) := F + P, (1)
A(G) ∈ {0, 1}n×n denotes the adjacency matrix where
A(G)ij = 1 if, and only, if nodes i and j share an edge. where we call P ∈ Rn×d structural embeddings, encoding
We also construct a node feature matrix F ∈ Rn×d that is structural information for each token. For node v, we define
consistent with ℓ, i.e., for nodes i and j in V (G), Fi = Fj P(v) := FFN(deg(v) + PE(v)), (2)
if, and only, if ℓ(i) = ℓ(j). Note that, for a finite subset of
N, we can always construct F, e.g., with a one-hot encoding where deg : V (G) → Rd is a learnable embedding of the
of the initial colors. We call pairs of nodes (i, j) ∈ V (G)2 node degree, PE : V (G) → Rd is a node-level PE such as
node pairs or 2-tuples. For a matrix X ∈ Rn×d , whose i-th the Laplacian PE (Kreuzer et al., 2021) or SPE (Huang et al.,
row represents the embedding of node v ∈ V (G) in a graph 2024), and FFN : Rd → Rd is a multi-layer perceptron. For
G, we also write Xv ∈ Rd or X(v) ∈ Rd to denote the the PE, we require that it enables us to distinguish whether
row corresponding to node v. Further, we define a learn- two nodes share an edge in G, which we formally define as
able embedding as a function, mapping a subset S ⊂ N to a follows.
d-dimensional Euclidean space, parameterized by a neural
Definition 1 (Adjacency-identifying). Let G =
network, e.g., a neural network applied to a one-hot encoding
(V (G), E(G), ℓ) be a graph with n nodes. Let
of the elements in S. Moreover, || · ||F denotes the Frobenius
X(0,1) ∈ Rn×d denote the initial token embeddings
norm. Finally, we refer to some transformer architecture’s
according to Equation (1) with structural embeddings
concrete set of parameters, including its tokenization, as a
P ∈ Rn×d . Let further
parameterization. See Appendix C for a complete descrip-
tion of our notation. Further, see Appendix E.1 for a formal 1 T
P̃ := √ PWQ PWK ,
description of the k-WL and its variants. dk
3
Aligning Transformers with Weisfeiler–Leman
(t)
where FFN is again a feed-forward neural network and Fi k-WL, for an arbitrary but fixed k > 1. Subsequently, to
contains the representation of node i ∈ V (G) at layer t. make the architecture more practical, we reduce the number
Contrast the above to the update of a 1-GT layer with a of tokens while remaining strictly more expressive than the
single head, given as 1-WL.
X(t,1) := FFN X(t−1,1) + softmax X̃(t−1,1) X(t−1,1) WV , Again, we consider a labeled graph G = (V (G), E(G), ℓ)
with n nodes and feature matrix F ∈ Rn×d , consistent with
where ℓ. Intuitively, to surpass the limitations of the 1-WL, the
1 T k-WL colors ordered subgraphs instead of a single node.
X̃(t−1,1) := √ X(t−1,1) WQ X(t−1,1) WK More precisely, the k-WL colors the tuples from V (G)k for
dk
k ≥ 2 instead of the nodes. Of central importance to our
denotes the unnormalized attention matrix at layer t. If we tokenization is consistency with the initial coloring of the
now set WV = 2I, where I is the identity matrix, we obtain k-WL and recovering the adjacency information between k-
tuples imposed by the k-WL, both of which we will describe
X(t,1) := FFN X(t−1,1) +2·softmax X̃(t−1,1) X(t−1,1) .
hereafter; see Appendix E for a formal definition of the k-
At this point, if we could reconstruct the adjacency matrix WL.
with softmax(X̃(t−1,1) ), we could simulate the 1-WL ex- The initial color of a k-tuple v := (v1 , . . . , vk ) ∈ V (G)k
pressive GNN of Equation (3) with a single attention head. under the k-WL depends on its atomic type and the labels
However, the attention matrix is right-stochastic, meaning ℓ(v1 ), . . . , ℓ(vk ). Intuitively, the atomic type describes the
its rows sum to 1. Unfortunately, this is not expressive structural dependencies within elements in a tuple. We can
enough to reconstruct the adjacency matrix, which generally represent the atomic type of v by a k × k matrix K over
is not right-stochastic. Thus, we aim to reconstruct the row- {1, 2, 3}. That is, the entry Kij is 1 if (vi , vj ) ∈ E(G), 2
normalized adjacency matrix Ã(G) := D−1 A(G), where if vi = vj , and 3 otherwise; see Appendix C for a formal
D ∈ Rn×n is the diagonal degree matrix, such that Dii definition of the atomic type. Hence, to encode the atomic
is the degree of node i. Indeed, Ã(G) is right-stochastic. type as a real vector, we can learn an embedding from the
Further, element-wise multiplication of row i of Ã(G) with set of k × k matrices to Re , where e > 0 is the embedding
the degree of i recovers A(G). As we show, adjacency- dimension of the atomic type. For a tuple v, we denote this
identifying PEs are, in fact, sufficient for softmax(X̃(t−1,1) ) embedding with atp(v). Apart from the initial colors for
to approximate Ã(G) arbitrarily close. We show that then, tuples, the k-WL also imposes a notion of adjacency between
we can use the degree embeddings deg(i) to de-normalize tuples. Concretely, we define
the i-th row of Ã(G) and obtain
ϕj (v, w) := (v1 , . . . , vj−1 , w, vj+1 , . . . , vk ),
X(t,1) = FFN X(t−1,1) + 2A(G)X(t−1,1) .
Showing that a transformer can simulate the GNN in Grohe i.e., ϕj (v, w) replaces the j-th component of the tuple v
(2021) implies the connection to the 1-WL. We formally with the vertex w. We say that two tuples are adjacent or
prove the above in the following theorem, showing that the 1- j-neighbors if they are different in the j-th component (or
GT can simulate the 1-WL; see Appendix H for proof details. equal, in the case of self-loops).
Now, to construct token embeddings consistent with the
Theorem 2. Let G = (V (G), E(G), ℓ) be a labeled graph initial colors under the k-WL, we initialize nk token embed-
k
with n nodes and F ∈ Rn×d be a node feature matrix consis- dings X(0,k) ∈ Rn ×d , one for each k-tuple. For order k
tent with ℓ. Further, let Ct1 : V (G) → N denote the coloring and embedding dimension d of the tuple-level tokens, we
function of the 1-WL at iteration t. Then, for all iterations first compute node-level tokens X(0,1) in Equation (1) and,
t ≥ 0, there exists a parametrization of the 1-GT such that in particular, the structural embeddings P(v) for each node
v, with an embedding dimension of d and then concatenate
Ct1 (v) = Ct1 (w) ⇐⇒ X(t,1) (v) = X(t,1) (w), node-level embeddings along the embedding dimension to
construct tuple-level embeddings which are then projected
for all nodes v, w ∈ V (G).
down to fit the embedding dimension d. Specifically, we
Having set the stage for theoretically aligned transformers, define the token embedding of a k-tuple v = (v1 , . . . , vk )
we now generalize the above to higher orders k > 1. as
k
X(0,k) (v) := X(0,1) (vi ) i=1 W + atp(v),
(4)
3.2. Transformers with k-WL expressive power
where W ∈ Rd·k×d is a projection matrix. Intuitively, the
Here, we propose tokenization for a standard transformer, above construction ensures that the token embeddings respect
operating on nk tokens, to be strictly more expressive as the the initial node colors and the atomic type. Analogously to
4
Aligning Transformers with Weisfeiler–Leman
our notion of adjacency-identifying, we use the structural em- Note that similar to Theorem 2, the above theorem gives a
beddings P(vi ) in each node embedding X(0,1) (vi ) to iden- lower bound on the expressivity of the k-GT. We now show
tify the j-neighborhood adjacency between tuples v and w that the k-GT is strictly more powerful than the k-WL by
in the j-th attention head. To reconstruct the j-neighborhood showing that the k-GT can also simulate the δ-k-WL, a k-WL
adjacency, we show that it is sufficient to identify the nodes variant that, for each j-neighbor ϕj (v, w), additionally con-
in each tuple-level token. Hence, we define the following siders whether (vj , w) ∈ E(G) (Morris et al., 2020). That
requirements for the structural embeddings P. is, the δ-k-WL distinguishes for each j-neighbor whether
Definition 3 (Node-identifying). Let G = (V (G), E(G), ℓ) the replaced node and the replacing node share an edge in
k
be a graph with n nodes. Let X(0,k) ∈ Rn ×d denote the G. Morris et al. (2020) showed that the δ-k-WL is strictly
initial token embeddings according to Equation (4) with more expressive than the k-WL. We use this to show that the
structural embeddings P ∈ Rn×d . Let further k-GT is strictly more expressive than the k-WL, implying the
following result; see again Appendix I for proof details.
1 T
P̃ := √ PWQ PWK , Theorem 5. For k > 1, the k-GT is strictly more expressive
dk than the k-WL.
where WQ , WK ∈ Rd×d . Then P is node-identifying if
there exists WQ , WK such that for any row i and column Note that the k-GT uses nk tokens. However, for larger k and
j, large graphs, the number of tokens might present an obstacle
P̃ij = max P̃ik , in practice. Luckily, our close alignment of the k-GT with
k the k-WL hierarchy allows us to directly benefit from a recent
if, and only, if i = j. Further, an approximation Q ∈ Rn×d result in Morris et al. (2022), reducing the number of tokens
of P is sufficiently node-identifying if while maintaining an expressivity guarantee. Specifically,
Morris et al. (2022) define the set of (k, s)-tuples V (G)ks as
P−Q F
< ϵ,
V (G)ks := {v ∈ V (G)k | #comp(G[v]) ≤ s},
for any ϵ > 0.
where #comp(H) denotes the number of connected compo-
As we will show, the above requirement allows the attention nents in subgraph H, G[v] denotes the ordered subgraph of
to distinguish whether two tuples share the same nodes, in- G, induced by the nodes in v and s ≤ k is a hyper-parameter.
tuitively, by counting the number of node-to-node matches Morris et al. (2022) then define the (k, s)-LWL as a k-WL
between two tuples, which is sufficient to determine whether variant that only considers the tuples in V (G)ks and addi-
the tuples are j-neighbors. For structural embeddings P to tional only use subset of adjacent tuples; see (Morris et al.,
be node-identifying, it suffices, for example, that P has an 2022) and Appendix E for a detailed description of the (k, s)-
orthogonal sub-matrix. WL. Fortunately, the runtime of the (k, s)-LWL depends only
In Section 4, we show that structural embeddings are also on k, s, and the sparsity of the graph, resulting in a more
sufficiently node-identifying when using LPE or SPE (Huang efficient and scalable k-WL variant. We can now directly
et al., 2024) as node-level PEs. It should also be mentioned translate this modification to our token embeddings. Specif-
that our definition of node-identifying structural embeddings ically, we call the k-GT using only the token embeddings
generalizes the node identifiers in Kim et al. (2022), i.e., X(0,k) (v) where v ∈ V (G)ks , the (k, s)-GT.
their node identifiers are node-identifying. Still, structural Now, Morris et al. (2022, Theorem 1) shows that, for k > 1,
embeddings exist that are (sufficiently) node-identifying and the (k + 1, 1)-LWL is strictly stronger than the (k, 1)-LWL.
do not qualify as node identifiers in Kim et al. (2022). Further, note that the (1, 1)-LWL is equal to the 1-WL. Using
We call a standard transformer using the above tokenization this result, we can prove the following; see Appendix I for
with sufficiently node- and adjacency-identifying structural proof details.
embeddings the k-GT. We then show the connection of the Theorem 6. For all k > 1, the (k, 1)-GT is strictly more
k-GT to the k-WL, resulting in the following theorem; see expressive than the (k − 1, 1)-GT. Further, the (2, 1)-GT is
Appendix I for proof details. strictly more expressive than the 1-WL.
Theorem 4. Let G = (V (G), E(G), ℓ) be a labeled graph
with n nodes and k ≥ 2 and F ∈ Rn×d be a node feature Note that the (2, 1)-GT only requires O(n + m) tokens,
matrix consistent with ℓ. Let Ctk denote the coloring function where m is the number of edges, and consequently has a
of the k-WL at iteration t. Then for all iterations t ≥ 0, there runtime complexity of O((n + m)2 ), the same as TokenGT.
exists a parametrization of the k-GT such that Hence, the (2, 1)-GT improves the expressivity result of To-
kenGT, which is shown to have 1-WL expressive power (Kim
Ctk (v) = Ctk (w) ⇐⇒ X(t,k) (v) = X(t,k) (w) et al., 2022) while having the same runtime complexity.
for all k-tuples v and w ∈ V (G)k . In summary, our alignment of standard transformers with
5
Aligning Transformers with Weisfeiler–Leman
Table 1: Comparison of our theoretical results with Kim et al. (2021) and Kim et al. (2022). Highlighted are aspects in
which our results improve over Kim et al. (2022). Here, n denotes the number of nodes, and m denotes the number of edges.
We denote with ⊏ A that the transformer is strictly more expressive than algorithm A. Note that for pure transformers, the
squared complexity can be relaxed to linear complexity when applying linear attention approximation such as the Performer
(Choromanski et al., 2021).
the k-WL hierarchy has brought forth a variety of improved LPE Here, we define a slight generalization of the Lapla-
theoretical results for pure transformers, providing a strict cian PEs in Kreuzer et al. (2021) that enables sufficiently
increase in provable expressive power for every order k > 0 node and adjacency-identifying PEs. Concretely, we define
compared to previous works. See Table 1 for a direct com- LPE as
parison of our results with Kim et al. (2021) and Kim et al.
(2022). Most importantly, Theorem 6 leads to a hierarchy of LPE(V, λ) = ρ ϕ(V1T , λ + ϵ) . . . ϕ(VnT , λ + ϵ) , (5)
increasingly expressive pure transformers, taking one step
closer to answering our guiding question in Section 1. In where ϵ ∈ Rl is a learnable, zero-initialized vector, we intro-
what follows, we make our transformers feasible in practice. duce to show our result. Here, ϕ : R2 → Rd is an FFN that
is applied row-wise and ρ : Rl×d → Rd is a permutation-
4. Implementation details equivariant neural network, applied row-wise. One can re-
cover the Laplacian PEs in Kreuzer et al. (2021) by setting
Here, we discuss the implementation details of pure trans- ϵ = 0 and implementing ρ as first performing a sum over
formers. In particular, we demonstrate that Laplacian PEs the first dimension of its input, resulting in an d-dimensional
(Kreuzer et al., 2021) are sufficient to be used as node-level vector and then subsequently applying an FFN to obtain a
PEs for our transformers. Afterward, we introduce order d-dimensional output vector. Note that then, Equation (5)
transfer, a way to transfer model weights between different forms a DeepSet (Zaheer et al., 2017) over the set of eigenvec-
orders of the WL hierarchy; see Appendix A for additional tor components, paired with their (ϵ-perturbed) eigenvalues.
implementation details. While slightly deviating from the Laplacian PEs defined in
Kreuzer et al. (2021), the ϵ vector is necessary to ensure that
4.1. Node and adjacency-identifying PEs no spectral information is lost when applying ρ.
Here, we develop PEs based on Laplacian eigenvectors We will now show that LPE is sufficiently node- and
and -values that are both sufficiently node- and adjacency- adjacency-identifying. Further, this result holds irrespective
identifying, avoiding computing separate PEs for node- and of whether the Laplacian is normalized. As a result, LPE can
adjacency identification; see Appendix A.1 for details about be used with 1-GT, k-GT, and (k, s)-GT; see Appendix F.3
the graph Laplacian and some background on PEs based on for proof details.
the spectrum of the graph Laplacian. Theorem 7. Structural embeddings with LPE as node-level
T
Let λ := (λ1 , . . . , λl ) denote the vector of the l smallest, PE are sufficiently node- and adjacency-identifying.
possibly repeated, eigenvalues of the graph Laplacian for
some graph with n nodes and let V ∈ Rn×l be a matrix such SPE We also show that SPE (Huang et al., 2024) are suffi-
that the i-th column of V is the eigenvector corresponding ciently node- and adjacency-identifying. Huang et al. (2024)
to eigenvalue λi . We will now briefly introduce LPE, based define the SPE encodings as
on Kreuzer et al. (2021), as well as SPE from Huang et al.
(2024) and show that these node-level PEs are node- and SPE(V, λ) = ρ Vdiag(ϕ1 (λ))VT ... Vdiag(ϕn (λ))VT ,
adjacency-identifying.
where ϕ1 , . . . , ϕn : Rl → Rl are equivariant FFNs and
ρ : Rn×l → Rd is a permutation equivariant neural network,
6
Aligning Transformers with Weisfeiler–Leman
7
Aligning Transformers with Weisfeiler–Leman
Table 2: Comparison of (2, 1)-GT to Graphormer and To- Table 3: Effect of pre-training for fine-tuning performance
kenGT. Results on PCQM4M V 2 over a single random seed, on A LCHEMY (12K). Pre-trained weights for both (2, 1)-GT
as well as CS and P HOTO over 7 random seeds where we and (3, 1)-GT are taken from the (2, 1)-GT model in Table 2.
also report standard deviation. Results for baselines are taken We report mean and standard deviation over 3 random seeds.
from Kim et al. (2022). We highlight best and second best Baseline results for SignNet, BasisNet and SPE are taken
model on each dataset. from Huang et al. (2024). We highlight the best, second best
and third best model.
PCQM4M V 2 CS P HOTO
Model
Validation MAE ↓ Accuracy ↑ Accuracy ↑ A LCHEMY (12K)
Model Pre-trained
Graphormer 0.0864 0.791 ± 0.015 0.894 ± 0.004 MAE ↓
TokenGT 0.0910 0.903 ± 0.004 0.949 ± 0.007
(2, 1)-GT + LPE 0.0870 0.924 ± 0.008 0.933 ± 0.013 SignNet ✗ 0.113 ± 0.002
(2, 1)-GT + SPE 0.0888 0.920 ± 0.002 0.933 ± 0.011 BasisNet ✗ 0.110 ± 0.001
SPE ✗ 0.108 ± 0.001
✗ 0.124 ± 0.001
(2, 1)-GT + LPE
based on the eigenvectors and -values of the graph Lapla- ✓ 0.101 ± 0.001
cian: SignNet, BasisNet, and SPE (Huang et al., 2024); see (3, 1)-GT + LPE ✓ 0.114 ± 0.001
Table 3 for results. Here, we find that even without pre- ✗ 0.112 ± 0.000
training, the (2, 1)-GT already performs well on A LCHEMY. (2, 1)-GT + SPE
✓ 0.103 ± 0.002
Most notably, the (2, 1)-GT with SPE without fine-tuning (3, 1)-GT + SPE ✓ 0.108 ± 0.001
already performs on par with SignNet. Nonetheless, we ob-
serve significant improvements through pre-training (2, 1)-
GT. Most notably, fine-tuning the pre-trained (2, 1)-GT with scale pre-trained models such as the (2, 1)-GT can compete
LPE or SPE beats all GNN baselines. Interestingly, training with a task-specific GNN. Specifically, following Tönshoff
the (2, 1)-GT with SPE from scratch results in much bet- et al. (2023), we aim for a fair comparison by carefully
ter performance than training the (2, 1)-GT with LPE from designing and hyper-parameter tuning a GINE model (Xu
scratch. However, once pre-trained, the (2, 1)-GT performs et al., 2019b) with residual connections, batch normaliza-
better with LPE than with SPE. Moreover, we observe no im- tion, dropout, GELU non-linearities and, most crucially, the
provements when performing order transfer to the (3, 1)-GT. same node-level PEs as the (2, 1)-GT. We followed Hu et al.
However, the (3, 1)-GT with LPE and pre-trained weights (2020b) in choosing the GINE layer over other GNN layers
from the (2, 1)-GT beats SignNet (Lim et al., 2023), a strong due to its guaranteed 1-WL expressivity. It is worth noting
GNN baseline. Further, (3, 1)-GT with SPE performs on that our GINE with LPE encodings significantly outperforms
par with SPE in Huang et al. (2024), the best of our GNN the GIN in Hu et al. (2020b) without pre-training on all five
baselines. Interestingly, order transfer improves over the datasets, demonstrating the quality of this baseline. Inter-
(2, 1)-GT trained from scratch for both LPE and SPE. As a estingly, GINE with SPE as node-level PEs underperforms
result, we hypothesize that LPE and SPE provide sufficient against the GIN in Hu et al. (2020b) on all but one datasets.
expressivity for this task but that pre-training is required to In addition, we report the best pre-trained model for each
fully leverage their potential. Further, we hypothesize that task in Hu et al. (2020b). We fine-tune the (2, 1)-GT for 5
the added (3, 1)-GT tokens lead to overfitting. epochs and train the GINE model for 45 epochs. In addition,
We conclude that pre-training can help our transformers’ we also train the (2, 1)-GT for 45 epochs from scratch to
downstream performance. Further, order transfer is a promis- study the impact of pre-training on the molecular classifica-
ing technique for fine-tuning downstream tasks, particularly tion tasks. Note that we do not pre-train our GINE model
those that benefit from higher-order representations. How- to evaluate whether a large pre-trained transformer can beat
ever, its benefits might be nullified in the presence of suffi- a small GNN model trained from scratch; see Table 4 for
ciently expressive node-level PEs in combination with large- results.
scale pre-training. First, we find that the pre-trained (2, 1)-GT with SPE is better
than the corresponding GINE with SPE on all five datasets.
Molecular classification Here, we evaluate whether fine- The pre-trained (2, 1)-GT with LPE is better than GINE with
tuning a large pre-trained transformer can be competitive LPE on all but one dataset. Moreover, the (2, 1)-GT with
with a strong GNN baseline on five small-scale molecu- LPE as well as the (2, 1)-GT with SPE are each better or
lar classification tasks, namely BBBP, BACE, C LIN T OX, on par with the best pre-trained GIN in Hu et al. (2020b)
T OX 21, and T OX C AST (Hu et al., 2020a). The number in two out of five datasets. When studying the impact of
of molecules in any of these datasets is lower than 10K, pre-training, we observe that pre-training leads mostly to
making them an ideal choice to benchmark whether large- performance improvement. The notable exception is on the
8
Aligning Transformers with Weisfeiler–Leman
Table 4: Fine-tuning results on small OGB molecular datasets compared to a fully-equipped and -tuned GIN. Results over 6
random seeds. We highlight best, second best and third best model for each dataset. Ties are broken by smaller standard
deviation. For comparison, we also report pre-training results from Hu et al. (2020b).
T OX C AST dataset, where the pre-trained (2, 1)-GT underper- Table 5: Results on the BREC benchmark over a single seed.
forms even the GIN baselines without pre-training. Hence, Baseline results are taken from Wang & Zhang (2023). We
we conclude that the pre-training/fine-tuning paradigm is highlight the best model for each PE and category (excluding
viable for applying pure transformers such as the (2, 1)-GT the 3-WL, which is not a model and merely serves as a refer-
to small-scale problems. ence point). We additionally report the results of Graphormer
from Wang & Zhang (2023).
5.3. Expressivity tests
Model PE Basic Regular Extension CFI All
Since we only provide expressivity lower bounds, we empir- GINE 19 20 37 3 79
ically investigate the expressive power of the k-GT. To this (2, 1)-GT LPE 36 28 51 3 118
end, we evaluate the B REC benchmark offering fine-grained (3, 1)-GT 55 46 55 3 159
and scalable expressivity tests (Wang & Zhang, 2023). The GINE 56 48 93 20 217
benchmark is comprised of 400 graph pairs that range from (2, 1)-GT SPE 60 50 98 18 226
being 1-WL to 4-WL indistinguishable and pose a challenge (3, 1)-GT 60 50 97 21 228
even to the most expressive models; see Table 5 for the ex- Graphormer 16 12 41 10 79
pressivity results of (2, 1)-GT and (3, 1)-GT on B REC. We 3-WL 60 50 100 60 270
mainly compare to our GINE baseline, Graphormer (Ying
et al., 2021) as a graph transformer baseline, and the 3-WL
as a potential expressivity upper-bound. In addition, we (3, 1)-GT have a provable expressivity strictly above 1-WL
compare our GINE model, (2, 1)-GT and (3, 1)-GT for both but are also feasible in practice. Empirically, we verify our
LPE and SPE. We find that SPE encodings consistently lead claims about practical feasibility and show that our trans-
to better performance. Further, we observe that increased formers improve over existing pure transformers, closing the
expressivity also leads to better performance, as for both gap between pure transformers and graph transformers with
LPE and SPE, the (2, 1)-GT beats our GINE baseline, and strong graph inductive bias on the large-scale PCQM4M V 2
the (3, 1)-GT beats the (2, 1)-GT across all tasks. Finally, dataset. Further, we show that fine-tuning our pre-trained
we find that both the (2, 1)-GT and the (3, 1)-GT improve transformers is a feasible and resource-efficient approach
over Graphormer while still being outperformed by the 3-WL, for applying pure transformers to small-scale datasets. We
irrespective of the choice of PE. discover that higher-order transformers can efficiently re-use
pre-trained weights from lower-order transformers during
6. Conclusion fine-tuning, indicating a promising direction for utilizing
higher-order expressivity to improve results on downstream
In this work, we propose a hierarchy of expressive pure trans- tasks. Future work could explore aligning transformers to
formers that are also feasible in practice. We improve exist- other variants of Weisfeiler–Leman. Finally, pre-training
ing pure transformers such as TokenGT (Kim et al., 2022) in pure transformers on truly large-scale datasets such as those
several aspects, both theoretically and empirically. Theoret- recently proposed by Beaini et al. (2024) could be a promis-
ically, our hierarchy has stronger provable expressivity for ing direction for graph transformers.
each k and is more scalable. For example, our (2, 1)-GT and
9
Aligning Transformers with Weisfeiler–Leman
Bodnar, C., Frasca, F., Otter, N., Wang, Y. G., Liò, P., Mon- Glickman, D. and Yahav, E. Diffusing graph attention. ArXiv
túfar, G., and Bronstein, M. M. Weisfeiler and Lehman preprint, 2023.
go cellular: CW networks. In Advances in Neural Infor-
Grohe, M. Descriptive Complexity, Canonisation, and De-
mation Processing Systems, 2021.
finable Graph Structure Theory. Cambridge University
Brown, T. B., Mann, B., Ryder, N., Subbiah, M., Kaplan, Press, 2017.
J., Dhariwal, P., Neelakantan, A., Shyam, P., Sastry, G., Grohe, M. The logic of graph neural networks. In Symposium
Askell, A., Agarwal, S., Herbert-Voss, A., Krueger, G., on Logic in Computer Science, 2021.
Henighan, T., Child, R., Ramesh, A., Ziegler, D. M., Wu,
J., Winter, C., Hesse, C., Chen, M., Sigler, E., Litwin, M., He, X., Hooi, B., Laurent, T., Perold, A., LeCun, Y., and
Gray, S., Chess, B., Clark, J., Berner, C., McCandlish, Bresson, X. A generalization of vit/mlp-mixer to graphs.
S., Radford, A., Sutskever, I., and Amodei, D. Language In International Conference on Machine Learning, 2023.
models are few-shot learners. In Advances in Neural
Information Processing Systems, 2020. Hendrycks, D. and Gimpel, K. Bridging nonlinearities and
stochastic regularizers with gaussian error linear units.
Cai, J., Fürer, M., and Immerman, N. An optimal lower ArXiv preprint, 2016.
10
Aligning Transformers with Weisfeiler–Leman
Horn, R. A. and Johnson, C. R. Matrix Analysis, 2nd Edition. Malkin, P. N. Sherali–Adams relaxations of graph isomor-
Cambridge University Press, 2012. phism polytopes. Discrete Optimization, 2014.
Hu, W., Fey, M., Zitnik, M., Dong, Y., Ren, H., Liu, B., Maron, H., Ben-Hamu, H., Serviansky, H., and Lipman, Y.
Catasta, M., and Leskovec, J. Open graph benchmark: Provably powerful graph networks. In Advances in Neural
Datasets for machine learning on graphs. In Advances in Information Processing Systems, 2019a.
Neural Information Processing Systems, 2020a.
Maron, H., Ben-Hamu, H., Shamir, N., and Lipman, Y. In-
Hu, W., Liu, B., Gomes, J., Zitnik, M., Liang, P., Pande, variant and equivariant graph networks. In International
V. S., and Leskovec, J. Strategies for pre-training graph Conference on Learning Representations, 2019b.
neural networks. In International Conference on Learning
Representations, 2020b. Maron, H., Fetaya, E., Segol, N., and Lipman, Y. On the
universality of invariant networks. In International Con-
Hu, W., Fey, M., Ren, H., Nakata, M., Dong, Y., and ference on Machine Learning, 2019c.
Leskovec, J. OGB-LSC: A large-scale challenge for ma-
chine learning on graphs. In NeurIPS: Datasets and Bench- Méndez-Lucio, O., Nicolaou, C. A., and Earnshaw, B. Mole:
marks Track, 2021. a molecular foundation model for drug discovery. ArXiv
preprint, 2022.
Huang, Y., Lu, W., Robinson, J., Yang, Y., Zhang, M.,
Jegelka, S., and Li, P. On the stability of expressive posi- Morris, C., Ritzert, M., Fey, M., Hamilton, W. L., Lenssen,
tional encodings for graphs. In International Conference J. E., Rattan, G., and Grohe, M. Weisfeiler and Leman
on Learning Representations, 2024. go neural: Higher-order graph neural networks. In AAAI
Conference on Artificial Intelligence, 2019.
Jaegle, A., Gimeno, F., Brock, A., Vinyals, O., Zisserman,
A., and Carreira, J. Perceiver: General perception with it- Morris, C., Rattan, G., and Mutzel, P. Weisfeiler and Le-
erative attention. In International Conference on Machine man go sparse: Towards higher-order graph embeddings.
Learning, 2021. In Advances in Neural Information Processing Systems,
Kim, J., Oh, S., and Hong, S. Transformers generalize 2020.
deepsets and can be extended to graphs & hypergraphs. Morris, C., Rattan, G., Kiefer, S., and Ravanbakhsh, S. Spe-
In Advances in Neural Information Processing Systems, qNets: Sparsity-aware permutation-equivariant graph net-
2021. works. In International Conference on Machine Learning,
Kim, J., Nguyen, D., Min, S., Cho, S., Lee, M., Lee, H., and 2022.
Hong, S. Pure transformers are powerful graph learners.
Morris, C., L., Y., Maron, H., Rieck, B., Kriege, N. M.,
In Advances in Neural Information Processing Systems,
Grohe, M., Fey, M., and Borgwardt, K. Weisfeiler and
2022.
Leman go machine learning: The story so far. Journal of
Kipf, T. N. and Welling, M. Semi-supervised classifica- Machine Learning Research, 2023.
tion with graph convolutional networks. In International
Müller, L., Galkin, M., Morris, C., and Rampásek, L. At-
Conference on Learning Representations, 2017.
tending to graph transformers. Transactions on Machine
Kreuzer, D., Beaini, D., Hamilton, W. L., Létourneau, V., and Learning Research, 2024.
Tossou, P. Rethinking graph transformers with spectral
attention. In Advances in Neural Information Processing Puny, O., Lim, D., Kiani, B. T., Maron, H., and Lipman, Y.
Systems, 2021. Equivariant polynomials for graph neural networks. In
International Conference on Machine Learning, 2023.
Lim, D., Robinson, J. D., Zhao, L., Smidt, T. E., Sra, S.,
Maron, H., and Jegelka, S. Sign and basis invariant net- Rampášek, L., Galkin, M., Dwivedi, V. P., Luu, A. T., Wolf,
works for spectral graph representation learning. In Inter- G., and Beaini, D. Recipe for a general, powerful, scal-
national Conference on Learning Representations, 2023. able graph transformer. Advances in Neural Information
Processing Systems, 2022.
Lipman, Y., Puny, O., and Ben-Hamu, H. Global attention
improves graph networks generalization. ArXiv preprint, Scarselli, F., Gori, M., Tsoi, A. C., Hagenbuchner, M., and
2020. Monfardini, G. The graph neural network model. IEEE
Transactions on Neural Networks, 20(1):61–80, 2009.
Ma, L., Lin, C., Lim, D., Romero-Soriano, A., Dokania, K.,
Coates, M., H.S. Torr, P., and Lim, S.-N. Graph Induc- Shchur, O., Mumme, M., Bojchevski, A., and Günnemann,
tive Biases in Transformers without Message Passing. In S. Pitfalls of graph neural network evaluation. ArXiv
International Conference on Machine Learning, 2023. preprint, 2018.
11
Aligning Transformers with Weisfeiler–Leman
Zhang, B., Luo, S., Wang, L., and He, D. Rethinking the
expressive power of gnns via graph biconnectivity. In
International Conference on Learning Representations,
2023.
12
Aligning Transformers with Weisfeiler–Leman
be the input tensor to ρ. Huang et al. (2024) propose to partition Q along the second axis into n matrices of size n × l and then
to apply a GIN to each of those n matrices in parallel where we use the adjacency matrix of the original graph. Concretely,
the GIN maps each n × l matrix to a matrix of shape n × d, where d is the output dimension of ρ. Finally, the output of
ρ is the sum of all n × d matrices. We adopt this implementation for our experiments. For prohibitively large graphs, we
propose the following modification to the above implementation of ρ. Specifically, we let V:m denote the matrix containing
as columns the eigenvectors corresponding to the m smallest eigenvalues. Then, we define
T T
∈ Rn×m×l
Q:m := Vdiag(ϕ1 (λ))V:m . . . Vdiag(ϕn (λ))V:m (6)
13
Aligning Transformers with Weisfeiler–Leman
Parameter Value
Learning rate 2e − 4
Weight decay 0.1
Attention dropout 0.1
Post-attention dropout 0.1
Batch size 256
# gradient steps 2M
# warmup steps 60K
precision bfloat16
2
where W ∈ Rd(k − k)/2×d is a learnable weight matrix projecting the concatenated edge embeddings to the target dimension
d. To see that the above faithfully encodes the atomic type, recall the matrix K over {1, 2, 3}, determining the atomic type
that we introduced in Section 3.2. For undirected graphs, for all i ≥ j, Kij = 1 if E(vi , vj ) ̸= 0 and E(vi , vj ) ̸= s, Kij = 2
if E(vi , vj ) = s and Kij = 3 if E(vi , vj ) = 0. As a result, by including additional edge information into the k-GT, we
simultaneously obtain atomic type embeddings atp.
C. Extended notation
The neighborhood of a vertex v in V (G) is denoted by N (v) := {u ∈ V (G) | (v, u) ∈ E(G)} and the degree of a vertex
v is |N (v)|. Two graphs G and H are isomorphic, and we write G ≃ H if there exists a bijection φ : V (G) → V (H)
preserving the adjacency relation, i.e., (u, v) is in E(G) if and only if (φ(u), φ(v)) is in E(H). Then φ is an isomorphism
between G and H. In the case of labeled graphs, we additionally require that l(v) = l(φ(v)) for v in V (G), and similarly
for attributed graphs. We further define the atomic type atp : V (G)k → N, for k > 0, such that atp(v) = atp(w) for v and
w in V (G)k if and only if the mapping φ : V (G)k → V (G)k where vi 7→ wi induces a partial isomorphism, i.e., we have
vi = vj ⇐⇒ wi = wj and (vi , vj ) ∈ E(G) ⇐⇒ (φ(vi ), φ(vj )) ∈ E(G). Let M ∈ Rn×p and N ∈ Rn×q be two
matrices then
M N ∈ Rn×p+q
14
Aligning Transformers with Weisfeiler–Leman
denotes column-wise matrix concatenation. Further, let M ∈ Rp×n and N ∈ Rq×n be two matrices then
M
∈ Rp+q×n
N
denotes row-wise matrix concatenation. For a matrix X ∈ Rn×d , we denote with Xi the i-th row vector. In the case where
the rows of X correspond to nodes in a graph G, we use Xv to denote the row vector corresponding to the node v ∈ V (G).
D. Transformers
Here, we define the (standard) transformer (Vaswani et al., 2017), a stack of alternating blocks of multi-head attention and
fully-connected feed-forward networks. In each layer, t > 0, given token embeddings X(t−1) ∈ RL×d for L tokens, we
compute
X(t) := FFN X(t−1) + h1 (X(t−1) ) . . . hM (X(t−1) ) WO ,
(7)
where [·] denotes column-wise concatenation of matrices, M is the number of heads, hi denotes the i-th transformer head,
WO ∈ RM dv ×d denotes a final projection matrix applied to the concatenated heads, and FFN denotes a feed-forward neural
network applied row-wise. We define the i-th head as
1 T
hi (X) := softmax √ XWQ,i XWK,i XWV,i , (8)
dk
where the softmax is applied row-wise and WQ,i , WK,i ∈ Rd×dk , WV,i ∈ Rd×dv , and dv and d are the head dimension
and embedding dimension, respectively. We omit layer indices t and optional bias terms for clarity.
E. Weisfeiler–Leman
Here, we discuss additional background for the Weisfeiler–Leman hierarchy. We begin by describing the Weisfeiler–Leman
algorithm, starting with the 1-WL. The 1-WL or color refinement is a well-studied heuristic for the graph isomorphism problem,
originally proposed by Weisfeiler & Leman (1968).2 Intuitively, the algorithm determines if two graphs are non-isomorphic
by iteratively coloring or labeling vertices. Formally, let G = (V, E, ℓ) be a labeled graph, in each iteration, t > 0, the 1-WL
computes a vertex coloring Ct1 : V (G) → N, depending on the coloring of the neighbors. That is, in iteration t > 0, we set
Ct1 (v) := RELABEL Ct−1 1 1
(v), {{Ct−1 (u) | u ∈ N (v)}} ,
for all vertices v in V (G), where RELABEL injectively maps the above pair to a unique natural number, which has not been
used in previous iterations. In iteration 0, the coloring C01 := ℓ. To test if two graphs G and H are non-isomorphic, we run
the above algorithm in “parallel” on both graphs. If the two graphs have a different number of vertices colored c in N at some
iteration, the 1-WL distinguishes the graphs as non-isomorphic. It is easy to see that the algorithm cannot distinguish all
non-isomorphic graphs (Cai et al., 1992). Several researchers, e.g., Babai (1979); Cai et al. (1992), devised a more powerful
generalization of the former, today known as the k-dimensional Weisfeiler–Leman algorithm (k-WL), operating on k-tuples
of vertices rather than single vertices.
15
Aligning Transformers with Weisfeiler–Leman
t = 0, the tuples v and w in V (G)k get the same color if they have the same atomic type, i.e., C0k (v) := atp(v). Then, for
each iteration, t > 0, Ctk is defined by
Ctk (v) := RELABEL Ct−1 k
(v), Mt (v) , (9)
with Mt (v) the multiset
k k
Mt (v) := {{Ct−1 (ϕ1 (v, w)) | w ∈ V (G)}}, . . . , {{Ct−1 (ϕk (v, w)) | w ∈ V (G)}} , (10)
and where
ϕj (v, w) := (v1 , . . . , vj−1 , w, vj+1 , . . . , vk ).
That is, ϕj (v, w) replaces the j-th component of the tuple v with the vertex w. Hence, two tuples are adjacent or j-neighbors
if they are different in the j-th component (or equal, in the case of self-loops). Hence, two tuples v and w with the same color
in iteration (t − 1) get different colors in iteration t if there exists a j in [k] such that the number of j-neighbors of v and w,
respectively, colored with a certain color is different.
We run the k-WL algorithm until convergence, i.e., until for t in N
Ctk (v) = Ctk (w) ⇐⇒ Ct+1
k k
(v) = Ct+1 (w),
for all v and w in V (G)k , holds.
Similarly to the 1-WL, to test whether two graphs G and H are non-isomorphic, we run the k-WL in “parallel” on both
graphs. Then, if the two graphs have a different number of vertices colored c, for c in N, the k-WL distinguishes the graphs as
non-isomorphic. By increasing k, the algorithm gets more powerful in distinguishing non-isomorphic graphs, i.e., for each
k ≥ 2, there are non-isomorphic graphs distinguished by (k + 1)-WL but not by k-WL (Cai et al., 1992). In the following, we
define some variants of the k-WL.
16
Aligning Transformers with Weisfeiler–Leman
(
1 ∃w ∈ V (G) : ul = ϕj (ui , w) ∧ adj(uij , w)
γ=1
0 else
(k,j,γ)
Ail := (11)
(
1 ∃w ∈ V (G) : ul = ϕj (ui , w) ∧ ¬adj(uij , w)
γ = −1
0 else,
where ui denotes the i-th k-tuple in a fixed but arbitrary ordering over V (G)k , uij denotes the j-th node of ui and where γ
controls whether the generalized adjacency matrix considers (γ = 0) all j-neighbors, (γ > 1) j-neighbors where the swapped
nodes are adjacent in G or (γ < 0) j-neighbors where the swapped nodes are not adjacent in G. With the generalized adjacency
matrix as defined above we can represent the local and global j-neighborhood adjacency defined for the δ-k-WL with A(k,j,1)
and A(k,j,−1) , respectively. Further, the j-neighborhood adjacency of k-WL can be described via A(k,j,1) + A(k,j,−1) and
the local j-neighborhood adjacency of δ-k-LWL can be described with A(k,j,1) .
F. Structural embeddings
Before we give the missing proofs from Section 4, we develop a theoretical framework to analyze structural embeddings.
17
Aligning Transformers with Weisfeiler–Leman
for an arbitrary but fixed ϵ > 0. Let P be node or adjacency-identifying with projection matrices WQ , WK ∈ Rd×d and let
1
P̃ = √ (PWQ )(PWK )T , and
dk
1
Q̃ = √ (QWQ )(QWK )T .
dk
Then, there exists a monotonic strictly increasing function f such that
||P̃ − Q̃||F < f (ϵ).
Proof. Our goal is to show that if the error between P and Q is bounded, so is the error between P̃ and Q̃. We now show
that this error can be described by a monotonic strictly increasing function f , i.e., if ϵ1 < ϵ2 , then f (ϵ1 ) < f (ϵ2 ). We will
first prove the existence of f .
First note that we can write,
1
P̃ − Q̃ = √ · PWQ (PWK − QWK )T + (PWQ − QWQ )(QWK )T .
dk
Note further that we are guaranteed that
||P||F > 0 (12)
Q
||W ||F > 0 (13)
K
||W ||F > 0 (14)
since otherwise at least one of the above matrices is zero, in which case P̃ = 0, which is not node or adjacency-identifying in
general, a contradiction. Now we can write,
1
||P̃ − Q̃||F = √ · PWQ (PWK − QWK )T + (PWQ − QWQ )(QWK )T
dk F
!
(a) 1 Q K K T Q Q K T
≤√ · PW (PW − QW ) + (PW − QW )(QW )
dk F F
!
(b) 1 Q K K T K T Q Q
≤ √ · PW · (PW − QW ) + (QW ) · PW − QW
dk F F F F
!
(c) 1
≤√ · PWQ · PWK − QWK + QWK · PWQ − QWQ
dk F F F F
!
(d) 1 Q K Q K
≤ √ · P W · W P−Q + Q W W · P−Q
dk F F F F F F F F
1
= √ · WQ WK · P − Q · P + Q
dk F F F F F
(e) ϵ
< √ · WQ WK · P + Q .
dk F F F F
Here, we used (a) the triangle inequality, (b) the Cauchy-Schwarz inequality (c) the fact that for any matrix X, ||X||F =
||XT ||F , (d) again Cauchy-Schwarz and finally (e) the Lemma statement, namely that ||P − Q||F < ϵ combined with the
facts in Equation (12), (13) and (14), guaranteeing that ||P̃ − Q̃||F > 0. We set
ϵ
f (ϵ) := √ · WQ WK · P + Q ,
dk F F F F
which is clearly monotonic strictly increasing, since the norms as well as √1 are non-negative and Equation (12), (13) and
dk
(14) ensure that f (ϵ) > 0. As a result, we can now write
||P̃ − Q̃||F < f (ϵ).
This concludes the proof.
18
Aligning Transformers with Weisfeiler–Leman
We use sufficiently node or adjacency-identifying matrices for approximately recovering weighted indicator matrices, which
we define next.
Definition 10 (Weighted indicators). Let x = (x1 , . . . , xn ) ∈ {0, 1}n be an n-dimensional binary vector. We call
x
x̃ := Pn ,
i=1 xi
the weighted indicator vector of x. Further, let X ∈ {0, 1}n×n be a binary matrix. Let now X̃ ∈ Rn×n be a matrix such that
the i-th row of X̃ is the weighted indicator vector of the i-th row of X. We call X̃ the weighted indicator matrix of X.
The following lemma ties (sufficiently) node or adjacency-identifying matrices and weighted indicators together.
Lemma 11. Let P̃ ∈ Rn×n be a matrix and let X be a binary matrix with weighted indicator matrix X̃, such that for every
i, j ∈ [n],
P̃ij = max P̃ik
k
if, and only, if Xij = 1. Then, for a matrix Q̃ ∈ Rn×n and all ϵ > 0, there exist an δ > 0 and a b > 0 such that if
P̃ − Q̃ F
< δ,
then,
softmax b · Q̃ − X̃ F
< ϵ,
where softmax is applied row-wise.
Proof. We begin by reviewing how the softmax acts on a vector z = (z1 , . . . , zn ) ∈ Rn . Let zmax := maxi zi be the
maximum value in z. Further, let x = (x1 , . . . , xn ) ∈ {0, 1}n be a binary vector such that
(
1 zi = zmax
xi = .
0 else
Let us now generalize this to matrices. Specifically, let P̃ ∈ Rn×n be a matrix, let X be a binary matrix with weighted
indicator matrix X̃ such that for every i, j ∈ [n],
which follows from the fact that the softmax is applied independently to each row and each row of X̃ is a weighted indicator
vector. We now show the proof statement. First, note that for any b > 0 we can choose a δ < f (b), where f : R → R is some
strictly monotonically decreasing function of b that shrinks faster than linear, e.g., f (b) = b12 . This function is well-defined
since by assumption b > 0. As a result an increase in b implies a non-linearly growing decrease of
softmax b · P̃ − softmax b · Q̃ F
and thus,
lim softmax b · Q̃ = X̃,
b→inf
19
Aligning Transformers with Weisfeiler–Leman
(a)
x1 x2 x3
δ
(b)
b2
b b b
(c)
b b2
Figure 2: Visual explanation of the "opposing forces" in Lemma 11. In (a) before softmax: We consider three numbers x1 ,
x2 and x3 , where x2 and x3 are less than δ apart. In (b) after softmax: An increase in b (blue) pushes the maximum value x3
away from x1 and x2 . However, the approximation with δ acts stronger (red). As a result, x1 gets pushed closer to 0, but x2
and x3 get pushed closer. In (c) after softmax: Further increasing b makes x1 converge to 0, but the approximation with δ
pushes x2 and x3 closer together, and the softmax maps both values approximately to 21 (here depicted with the same dot).
Hence, with a sufficiently close approximation, we can approximate the weighted indicator matrix X̃.
and we have that for all ϵ > 0 there exists a b > 0 and a δ < f (b) such that
softmax b · Q̃ − X̃ F < ϵ.
In the above proof, the approximation with δ and the scaling with b act as two "opposing forces". The proof then chooses b
such that ϵ acts stronger than b and hence with b → inf, the approximation converges to the weighted indicator matrix; see
Figure 2 for a visual explanation of this concept.
From the definitions of sufficiently node and adjacency-identifying matrices, we can prove the following statement.
Lemma 12. Let G be a graph with adjacency matrix A(G) and degree matrix D. Let Q be a sufficiently adjacency-identifying
matrix. Then, there exists a b > 0 and projection matrices WQ , WK ∈ Rd×d with
1
Q̃ := √ (QWQ )(QWK )T
dk
such that
softmax b · Q̃ − D−1 A(G) F
< ϵ,
Proof. Note that D−1 A(G) is a weighted indicator matrix, since A(G) has binary entries and left-multiplication with D−1
results in dividing every element in row i of A(G) by the number of 1’s in row i of A(G), or formally,
h i
A(G)i1 PnA(G)in
D−1 A(G) i = Pnj=1 A(G)
. . . ,
ij j=1 A(G)ij
Further, since Q is sufficiently adjacency-identifying, there exists an adjacency-identifying matrix P and projection matrices
WQ and WK , such that for
1
P̃ = √ (PWQ )(PWK )T ,
dk
20
Aligning Transformers with Weisfeiler–Leman
it holds that
P̃ij = max P̃ik ,
k
if and only if A(G) = 1. Recall the definition of sufficiently adjacency-identifying, namely that we can choose projection
matrices such that
||P − Q||F < ϵ0 ,
for all ϵ0 . Then, according to Lemma 9, for matrix
1
Q̃ = √ (QWQ )(QWK )T ,
dk
it holds that
P̃ − Q̃ F
< f (ϵ0 ),
where f is a strictly monotonically increasing function of ϵ0 . We can then apply Lemma 11 to P̃, Q̃, A(G) as the binary
matrix and D−1 A(G) as its weighted indicator matrix and, for every ϵ, choose ϵ0 small enough such that there exists a b > 0
with
softmax b · Q̃ − D−1 A(G) F < ϵ.
Finally, we generalize the above result to arbitrary k via the generalized adjacency matrix.
Lemma 13 (Approximating the generalized adjacency matrix). Let G be graph with n nodes and let k > 1. Let further,
d
Q ∈ Rn× /k be sufficiently node and adjacency-identifying structural embeddings. Let v := (v1 , . . . , vk ) ∈ V (G)k be the
i-th and let u := (u1 , . . . , uk ) ∈ V (G)k be the l-th k-tuple in a fixed but arbitrary ordering over V (G)k . Let Ã(k,j,γ) denote
k k
the weighted indicator matrix of the generalized adjacency matrix in Equation (11). Let Z ∈ Rn ×n be the unnormalized
attention matrix such that
1 k k T
Q(vj ) j=1 WQ Q(uj ) j=1 WK ,
Zil := √
dk
with projection matrices WQ , WK ∈ Rd×d . Then, for every j ∈ [k] and every γ ∈ {−1, 1}, there exist projection matrices
WQ , WK such that for all i ∈ [nk ]
(k,j,γ)
softmax Zi1 . . . Zink − Ãi < ε,
F
Proof. Our proof strategy is to first construct the unnormalized attention matrix from a node- and adjacency-identifying
matrix and then to invoke Lemma 11 to relax the approximation to the sufficiently node- and adjacency-identifying Q.
First, by definition of sufficiently node and adjacency-identifying matrices, the existence of Q implies the existence of a node
and adjacency-identifying matrix P. We will now give projection matrices WQ and WK such that
(k,j,γ)
softmax Z∗i1 . . . Z∗ink − Ãi < ε. (15)
F
21
Aligning Transformers with Weisfeiler–Leman
and
WK,1
WK = ... ,
WK,k
where for all j ∈ [k], Q(vj ) is projected by WQ,j and Q(uj ) is projected by WK,j . We further expand the sub-matrices of
each WQ,j and WK,j , writing Q,j
WN
WQ,j = Q,j
WA
and K,j
WN
WK,j = K,j .
WA
Q,∗ K,∗
Since P is node-identifying, there exists projection matrices WN and WN such that
1 Q,∗ K,∗ T
pilj := √ P∗ WN (P∗ WN )
dk
is maximal if and only if ui,j = ul,j is the same node. Further, if P is adjacency-identifying, there exists projection matrices
Q,∗ K,∗
WA and WA such that
1 Q,∗ K,∗ T
qilj := √ P∗ WA (P∗ WA ) ,
dk
is maximal if and only if nodes ui,j and ul,j share an edge in G. Consequently, we also know that then −qilj is maximal if
and only if nodes ui,j and ul,j do not share an edge in G. Note that because we assume that G has no self-loops, qiij = 0 for
all i ∈ [n] and all j ∈ [k]. Now, let us write out the unnormalized attention score Z∗il as
1 k k T
Z∗il = √ P(vj ) j=1 WQ P(uj ) j=1 WK
dk
k
1 X T
=√ P(vj )WQ,j P(uj )WK,j
dk j=1
k k
1 X Q,j K,j T 1 X Q,j K,j T
=√ P(vj )WN P(uj )WN +√ P(vj )WA P(uj )WA ,
dk j=1 dk j=1
where in the second line we simply write out the dot-product and in the last line we split the sum to distinguish node- and
adjacency-identifying terms.
Q,j K,j Q,o Q,∗ K,o K,∗
Now for all γ and a fixed j, we set WN = 0 and WN = 0 and WN = b · WN and WN = WN for o ̸= j, for
some b > 0 that we need to be arbitary to use it later in Lemma 11.
Q,j Q,∗
We now distinguish between the different choices of γ. In the first case, let γ > 0. Then, we set WA = WA and
K,j K,∗ Q,j Q,∗ K,j K,∗
WA = WA . In the third case, γ < 0. Then, we set WA = −WA and WA = WA .
Then, in all cases, for tuples ui and ul , X
Z∗il = γ · qilj + b · pilo . (16)
o̸=j
Note that b is the same for all pairs ui , ul and the dot-product is positive. We again distinguish between two cases. In the first
case, let γ > 0. Then, the above sum attains its maximum value if and only if
Now, since ui,o and ul,o denote nodes at the o-th component of ui and ul , respectively, the above statement is equivalent to
saying that ul is a j-neighbor of ui and that ul is adjacent to ui . In the second case, let γ < 0. Then, the above sum attains
its maximum value if and only if
∀o ̸= j : ui,o = ul,o ∧ Ailj = 0.
22
Aligning Transformers with Weisfeiler–Leman
Again, since ui,o and ul,o denote nodes at the o-th component of ui and ul , respectively, the above statement is equivalent to
saying that ul is a j-neighbor of ui and that ul is not adjacent to ui .
Now, recall the generalized adjacency matrix in Equation (11) as
(
1 ∃w ∈ V (G) : ul = ϕj (ui , w) ∧ adj(uij , w)
γ=1
0 else
(k,j,γ)
Ail :=
(
1 ∃w ∈ V (G) : ul = ϕj (ui , w) ∧ ¬adj(uij , w)
γ = −1
0 else.
Then, we can say that for our construction of Z∗ , for all i, l ∈ [nk ]
Proof. Note that the graph Laplacian is node-identifying, since all off-diagonal elements are ≤ 0 and the diagonal is always
> 0 since we consider graphs G without self-loops and without isolated nodes. Note further, that if there exists matrices WQ
and WK such that
1
√ (PWQ )(PWK )T = L,
dk
then there also exist matrix W∗Q = −WQ such that
1
√ (PW∗Q )(PWK )T = −L.
dk
Now, note that the negative graph Laplacian is −L = A(G) − D. Because we subtract the degree matrix from the adjacency
matrix, the maximum element of each row of the negative graph Laplacian is 1. Since D is diagonal, for each row i and each
column j,
−Lij = 1 ⇐⇒ A(G)ij = 1,
for i ̸= j. Further, in the case where i = j,
−Lij ≤ 0,
since we consider graphs G without self-loops. Hence, we obtain that
−Lij = max(−Lik )
k
L = PPT .
23
Aligning Transformers with Weisfeiler–Leman
Proof. Our goal is to show that there exist matrices WQ and WK , such that
1
√ (PWQ )(PWK )T = L,
dk
L = PQT .
Proof. Our goal is to show that there exist matrices WQ and WK , such that
1
√ (PWQ )(PWK )T = L,
dk
WQ =
√
dk I 0
√
WK
= 0 dk I ,
1 T
Q WQ P Q WK = PQT = L.
√ P
dk
For the next two lemmas, we first briefly define permutation matrices as binary matrices M that are right-stochastic, i.e.,
its rows sum to 1, and such that each column of M is 1 at exactly one position and 0 elsewhere. It is well know that for
permutation matrices M it holds that MT M = MMT = I, where I is the identity matrix. We now state the following
lemmas.
Lemma 17. Let G be a graph with graph Laplacian L. Let L = UΣUT be the eigendecomposition of L. Then, for any
1
permutation matrix M, the matrix UΣ 2 M is adjacency-identifying.
Lemma 18. Let G be a graph with graph Laplacian L and normalized graph Laplacian L̃. Let L = UΣUT be the
1 1
eigendecomposition of L and let D denote the degree matrix. Then, for any permutation matrix M the matrix D 2 UΣ 2 M is
adjacency-identifying.
24
Aligning Transformers with Weisfeiler–Leman
F.3. LPE
Here, we show that the node-level PEs LPE as defined in Equation (5) are sufficiently node- and adjacency-identifying. We
begin with the following useful lemma.
′ ′′
Lemma 19. Let P ∈ Rn×d be structural embeddings with sub-matrices Q1 ∈ Rn×d and Q2 ∈ Rn×d , i.e.,
P = Q1 Q2 ,
Proof. Let a sub-matrix Q be (sufficiently) node- or adjacency-identifying. Let WQ,Q and WK,Q be the corresponding
projection matrices with which Q is (sufficiently) node- or adjacency-identifying. Then, if Q = Q1 , we define
Q,Q
Q W 0
W =
0 0
Q,K
K W 0
W =
0 0
and if Q = Q2 , we define
0 0
WQ =
WQ,Q 0
0 0
WK = .
WQ,K 0
PWQ = QWQ,Q
0
PWK = QWQ,K
0
and consequently,
PWQ (PWK )T = QWQ,Q (QWQ,K )T .
Hence if Q1 is (sufficiently) node-identifying, then P is (sufficiently) node-identifying. If Q2 is (sufficiently) node-
identifying, then P is (sufficiently) node-identifying. If Q1 is (sufficiently) adjacency-identifying, then P is (sufficiently)
adjacency-identifying. If Q2 is (sufficiently) adjacency-identifying, then P is (sufficiently) adjacency-identifying. If Q1 is
(sufficiently) node-identifying and Q2 is (sufficiently) adjacency-identifying or vice versa, then P is (sufficiently) node and
adjacency-identifying. This concludes the proof.
25
Aligning Transformers with Weisfeiler–Leman
Theorem 20 (Slightly more general than Theorem 7 in main text). Structural embeddings with LPE according to Equation (5)
as node-level PEs with embedding dimension d are sufficiently node- and adjacency-identifying, irrespective of whether the
underlying Laplacian is normalized or not. Further, for graphs with n nodes, the statement holds for d ≥ (2n + 1).
Proof. We begin by stating that the domain of LPE is compact since eigenvectors are unit-norm and eigenvalues are bounded
by twice the maximum node degree for graphs without self-loops (Anderson & Morley, 1985). Further, the domain of the
structural embeddings is compact, since LPE is compact and the node degrees are finite. Hence, the domain of deg is compact.
Let G be a graph with graph Laplacian L. Let L = UΣUT be the eigendecomposition of L, where the i-th column of U
contains the i-th eigenvector, denoted vi , and the i-th diagonal entry of Σ contains the i-th eigenvalue, denoted λi .
Recall that structural embeddings with LPE as node-level PEs are defined as
for all v ∈ V (G). Since the domains of both deg and LPE are compact, there exist parameters for both of these embeddings
such that without loss of generality we may assume that for v ∈ V (G),
and that deg and LPE are instead embedded into some smaller p-dimensional and s-dimensional sub-spaces, respectively,
where it holds that d = p + s.
To show node and adjacency identifiability, we divide up the s-dimensional embedding space of LPE into two s/2-dimensional
sub-spaces node and adj, i.e., we write
where
k
X
node(v) = ρnode ϕnode (vji , λj + ϵj )
j=1
k
X
adj(v) = ρadj ϕadj (vji , λj + ϵj ) ,
j=1
and where we have chosen ρ to be a sum over the first dimension of its input, followed by FFN ρnode and ρadj , for node and
adjacency, respectively. Note that we have written out Equation (5) for a single node v. For convenience, we also fix an
arbitrary ordering over the nodes in V (G) and define
deg(v1 )
QD = ...
deg(vn )
node(v1 )
QN =
..
.
node(vn )
adj(v1 )
QA = ... ,
adj(vn )
where vi is the i-th node in our ordering. Further, note that node and adj are DeepSets (Zaheer et al., 2017) over the set
Mi := {(vji , λj + ϵj )}kj=1 ,
where vj is again the j-th eigenvector with corresponding eigenvalue λj and ϵj is a learnable scalar. With DeepSets we can
universally approximate permutation invariant functions. We will use this fact in the following, where we show that there exists
26
Aligning Transformers with Weisfeiler–Leman
if and only if i = j. Moreover, let M be any permutation matrix, i.e., each column of M is 1 at exactly one position and 0
else and the rows of M sum to 1. Then,
UMMT UT = UUT = I
is also node-identifying. We will now approximate UM for some M with node arbitrarily close. Specifically, for the i-th
node vi in our node ordering, we choose ρnode and ϕnode such that
node(vi ) − Ui M F
< ϵ,
for all ϵ > 0 and all i. Since DeepSets can universally approximate permutation invariant functions, it remains to show that
there exists a permutation invariant function f such that f (Mi ) = Ui M for all i and for some M. To this end, note that for a
graph G, there are only at most n unique eigenvalues of the corresponding (normalized) graph Laplacian. Hence, we can
choose ϵj such that λj + ϵj is unique for each unique j. In particular, let
ϵj = j · δ,
where we choose δ < minl,o |λl − λo | > 0, i.e., the smallest non-zero difference between two eigenvalues. We now define f
as
f ({(vji , λj + ϵj )}kj=1 ) = v1i . . . vki = Ui M,
where the order of the components is according to the sorted λj + ϵj in ascending order. This order is reflected in some
permutation matrix M. Hence, f is permutation invariant and can be approximated by a DeepSet arbitrarily close. As a result,
we have
node(vi ) − Ui M F < ϵ,
for all i and an arbitrarily small ϵ > 0. In matrix form, we have that
UM − QN F
<ϵ
and since UM is node-identifying, we can invoke Lemma 9, to conclude that QN is sufficiently node-identifying. As a result,
P has a sufficiently node-identifying sub-space and is thus, also sufficiently node-identifying according to Lemma 19.
1
We continue by showing that QA is sufficiently adjacency-identifying. Note that according to Lemma 17, UΣ 2 is adjacency-
1
identifying. We will now approximate UΣ 2 with adj arbitrarily close. Specifically, for the i-th node vi in our node ordering,
we choose ρadj and ϕadj such that
1
adj(vi ) − UΣi2 F
< ϵ,
1
for all ϵ > 0 and all i. To this end, we first note that right-multiplication ofp
U by Σ 2 is equal to multiplying the i-th column
1
of U, i.e., the eigenvector vi , with the j-th diagonal element of Σ , i.e., λj . Hence, for the j-node vj it holds
2
1 √ √
λn ∈ Rn ,
UΣi2 = v1j · λ1 ... vnj ·
where vij denotes the j-th component of vi . Since DeepSets can universally approximate permutation invariant functions, it
1
remains to show that there exists a permutation invariant function f such that f (Mi ) = UΣi2 M for all i and for some M.
To this end, note that for a graph G, there are only at most n unique eigenvalues of the corresponding (normalized) graph
Laplacian. Hence, we can choose ϵj such that λj + ϵj is unique for each unique j. In particular, let
ϵj = j · δ,
27
Aligning Transformers with Weisfeiler–Leman
where we choose δ < minl,o |λl − λo | > 0, i.e., the smallest non-zero difference between two eigenvalues. We now define f
as
√ √
f ({(vji , λj + j · δ)}kj=1 ) = v1i · λ1 + ϵ1 + δ . . . vki · λk + k · δ ,
where the order of the components is according to the sorted λj + ϵj in ascending order. This order is reflected in some
permutation matrix M, which we will use next. Now, since we can choose δ arbitrarily low, we can choose it such that
1
f ({(vji , λj + j · δ)}kj=1 ) − UΣi2 M||F < ϵ,
for any ϵ > 0. Further, f is permutation invariant and can be approximated by a DeepSet arbitrarily close. As a result, we
have
1
adj(vi ) − UΣi2 M F < ϵ,
for all i and an arbitrarily small ϵ > 0. In matrix form, we have that
1
UΣ 2 M − QA F
< ϵ.
Now, we need to distinguish between the non-normalized and normalized Laplacian. In the first case, we consider the graph
Laplacian underlying the eigendecomposition to be non-normalized, i.e., L = D − A(G). First, we know from Lemma 17
1
that UΣ 2 M is adjacency-identifying. Further, we know from Lemma 9 that then QA is sufficiently adjacency-identifying,
1
A
since Q can approximate UΣ 2 M arbitrarily close. As a result, P has a sufficiently adjacency-identifying sub-space and is,
thus, also sufficiently adjacency-identifying.
In the second case, we consider the graph Laplacian underlying the eigendecomposition to be normalized, i.e., L̃ =
1 1
I − D− 2 A(G)D− 2 . Here, recall our construction for P, namely
P(v) = FFN deg(v) node(v) adj(v)
or in matrix form
P = FFN QD QN QA .
1
Note that a FFN can approximate such a function arbitrarily
√ close since our domain is compact and left-multiplication by D 2
is equivalent to multiplying the i-th row of QA with di , where di is the degree of node i. Further, the i-th row of QD is an
embedding deg(vi ) of di . We can choose this embedding to be
0 ∈ Rp ,
√
deg(vi ) = di 0 ...
√
where we write di into the first component and pad the remaining vector with zeros to fit the target dimension p of
1
deg. Hence, with the FFN we can approximate a sub-space containing D 2 QA arbitrarily close. Further, since we already
1 1 1
showed that QA can approximate UΣ 2 M arbitrarily close, we can thus approximate D 2 UΣ 2 M arbitrarily close. First,
1 1 1
we know from Lemma 18 that D 2 UΣ 2 M is adjacency-identifying. Further, we know from Lemma 9 that then D 2 QA
1 1 1
is sufficiently adjacency-identifying, since D 2 QA can approximate D 2 UΣ 2 QA arbitrarily close. As a result, P has a
sufficiently adjacency-identifying sub-space and is, thus, also sufficiently adjacency-identifying according to Lemma 19.
s
Finally, for QD we need p ≥ 1, for QN and QA we need 2 ≥ n. As a result, the above statements hold for d ≥ (2n + 1).
This concludes the proof.
F.4. SPE
Here, we show that SPE are sufficiently node- and adjacency-identifying. We begin with useful lemma.
28
Aligning Transformers with Weisfeiler–Leman
Lemma 21. Let G be a graph with graph Laplacian L and normalized graph Laplacian L̃. Then, if for a node-level PE Q
with compact domain there exists matrices WQ and WK such that
1
√ (QWQ )(QWK )T = L̃,
dk
there exists a parameterization of the structural embedding P with Q as node-level PE such that P is sufficiently node- and
adjacency-identifying.
Proof. Recall that structural embeddings with P as node-level PEs are defined as
for all v ∈ V (G). Since the domains of both deg and Q are compact, there exists parameters for both of these embeddings
such that without loss of generality we may assume that for v ∈ V (G),
and that deg and Q are instead embedded into some smaller p-dimensional and s-dimensional sub-spaces, respectively, where
it holds that d = p + s.
Next, we choose the embedding deg(v) to be
dv 0 . . . 0 ∈ Rp ,
√
deg(v) =
√
where dv is the degree of node v and we write dv into the first component and pad the remaining vector with zeros to fit the
target dimension p of deg. We will now use FFN to approximate the following function f , defined as
√
f deg(v) Q(v) = deg(v) dv Q(v) .
Note that a FFN can approximate such a function arbitrarily close since our domain is compact. Hence, we have that
√
P(v) − deg(v) dv Q(v) F < ϵ1 ,
deg(vn ) F
1
for allpϵ2 > 0, since left multiplication of a matrix by D 2 corresponds to an element-wise multiplication of the i-th row of Q
with dvi , the square-root of the degree of the i-th node vi in an arbitrary but fixed node ordering. Now, since we have that
there exists matrices WQ and WK such that
1
√ (QWQ )(QWK )T = L̃,
dk
29
Aligning Transformers with Weisfeiler–Leman
Proof. We begin by stating that the domain of SPE is compact, since eigenvectors are unit-norm and eigenvalues are bounded
by twice the maximum node degree for graphs without self-loops (Anderson & Morley, 1985). Further, the domain of the
structural embeddings is compact, since SPE is compact and the node degrees are finite and hence the domain of deg is
compact.
Let G be a graph with (normalized or non-normalized) graph Laplacian L. Let L = VΣVT be the eigendecomposition of
L, where the i-th column of V contains the i-th eigenvector, denoted vi , and the i-th diagonal entry of Σ contains the i-th
eigenvalue, denoted λi .
Recall that structural embeddings with SPE as node-level PEs are defined as
To show node and adjacency identifiability, we first replace the neural networks ρ, ϕ1 , . . . , ϕn in SPE with functions
permutation equivariant functions g, h1 , . . . , hn over the same domain and co-domain, respectively, and choose g and
h1 , . . . , hn such that the resulting encoding, which we call fSPE , is sufficiently node- and adjacency-identifying. Then, we
show that ρ and ϕ1 , . . . , ϕn can approximate g and h1 , . . . , hn arbitrarily close, which gives the proof statement.
We first define, for all ℓ ∈ [n],
0
..
.
1
λℓ ,
hℓ (λ) = 2
.
..
0
such that λℓ is the ℓ-th sorted eigenvalue where ties are broken arbitrarily3 and λℓ is the ℓ-th entry of hℓ (λ). Note that due to
the sorting, hℓ is equivariant to permutations of the nodes.
We then define the matrix Mi as i-th input to g in fSPE , for all i ∈ [n]. We have
where the last equality follows from the fact that hℓ (λ)j ̸= 0 if and only if ℓ = j. Now, we choose a linear set function f ,
satisfying
f ({a · x | x ∈ X}) = a · f (X),
for all a ∈ R and all X ⊂ R, e.g., sum or mean. Since f is a function over sets, f is equivariant to the ordering of X. We
now define the function g step-by-step. In g, we first apply f to the columns of Mi for all i ∈ [n]. To this end, let Mij denote
the set of entries in the j-column of matrix Mi . Then, we define
fi = f (Mi1 ) . . . f (Min )
h 1 1
i
= λ12 · vi1 · f ({vo1 | o ∈ [n]}) . . . λn2 · vin · f ({von | o ∈ [n]}) ,
3
Note that despite eigenvalues with higher multiplicity, the output of hℓ is the same, no matter how ties are broken.
30
Aligning Transformers with Weisfeiler–Leman
Pi = Mi · fi
1 1
v11 · (λ12 )2 · (vi1 )2 . . . v1n · (λ12 )2 · (vin )2
=
..
.
1 1
vn1 · (λ12 )2 · (vi1 )2 . . . vnn · (λn2 )2 · (vin )2
v11 · λ1 · (vi1 )2 . . . v1n · λn · (vin )2
=
.. .
.
vn1 · λ1 · (vi1 )2 ... vnn · λn · (vin )2
Qi = Mi /fi
v11 v1n
...
f ({vo1 | o ∈ [n]}) f ({von | o ∈ [n]})
= ..
.
.
vn1 vnn
...
f ({vo1 | o ∈ [n]}) f ({von | o ∈ [n]})
Hence, we also need to ensure that
f ({voj | o ∈ [n]}) ̸= 0,
for all j ∈ [n]. Now, we define X
P= Pi
i
and X
Q= Qi .
i
2
P
where we recall that the eigenvectors are normalized and hence, i viℓ = 1.
In the last step of g, we return the matrix
Q ∈ Rn×2n .
P
We will now show that this matrix is node- and adjacency-identifying. To this end, note that
and hence,
PQT = VΣVT = L.
We distinguish between two cases. If L is the non-normalized graph Laplacian, then according to Lemma 16, the matrix
P Q
is node- and adjacency-identifying and hence, according to Lemma 19, fSPE is node- and adjacency-identifying. Further, if L
is the normalized graph Laplacian, according to Lemma 21, fSPE is sufficiently node- and adjacency-identifying.
To summarize, we have shown that there exist functions g, h1 , . . . , hn such that if, in SPE, we replace ρ with g and ϕℓ with
hℓ , then the resulting encoding fSPE is sufficiently node- and adjacency-identifying. To complete the proof, we will now show
31
Aligning Transformers with Weisfeiler–Leman
that we can approximate g, h1 , . . . , hn with ρ, ϕ1 , . . . , ϕn arbitrarily close. Since our domain is compact, we can use ϕℓ to
approximate hℓ arbitrarily close, i.e., we have that
ϕℓ − hℓ F
< ϵ,
for all ϵ > 0. Further, g consists of a sequence of permutation equivariant steps and is thus, permutation equivariant. Since
the domain of g is compact and, by construction, g and ρ have the same domain and co-domain, and since ρ can approximate
any permutation equivariant function on its domain and co-domain arbitrarily close, we have that ρ can also approximate g
arbitrarily close. As a result, SPE can approximate a sufficiently node- and adjacency-identifying encoding arbitrarily close,
and hence, by definition, we have that SPE is also sufficiently node- and adjacency-identifying. This completes the proof.
G. Learnable embeddings
Here, we pay more attention to learnable embeddings, which play an important role throughout our work. Although our
definition of learnable embeddings is fairly abstract, learnable embeddings are very commonly used in practice, e.g., in
tokenizers of language models (Vaswani et al., 2017). Since many components of graphs, such as the node labels, edge types,
or node degrees are discrete, learnable embeddings are useful to embed these discrete features into an embedding space. In
our work, different embeddings are typically joined via sum, e.g., in Equation (1) or Equation (4). Summing embeddings
is much more convenient in practice since every embedding has the same dimension d. In contrast, joining embeddings
via concatenation can lead to very large d, unless the underlying embeddings are very low-dimensional. However, in our
theorems joining embeddings via concatenation is much more convenient. To bridge theory and practice in this regard, in
what follows we establish under what conditions summed embeddings can express concatenated embeddings and vice versa.
Specifically, we show that summed embeddings can still act as if concatenating lower-dimensional embeddings but with more
flexibility in terms of joining different embeddings.
The core idea is that we show under which operations one may replace a learnable embedding e to Rd with a lower-dimensional
′
learnable embedding e′ to Rd while preserving the injectivity of e. This property is useful since it allows us to, for example,
add two learnable embeddings without losing expressivity. As a result, we can design a tokenizer that is practically useful
without sacrificing theoretical guarantees.
We begin with a projection of a concatenation of learnable embeddings.
Lemma 23 (Projection of learnable embeddings). Let e′1 , . . . , e′k be learnable embeddings from N to Rp for some p ≥ 1.
If d ≥ k then, there exists learnable embeddings e1 , . . . , ek as well as a projection matrix W ∈ Rd·k×k·p such that for
v1 , . . . , v k ∈ N
k k
ei (vi ) i=1 W = e′i (vi ) i=1 ∈ Rk·p ,
Proof. For a learnable embedding e and some v ∈ N, we denote with e(v)j the j-th element of the vector e(v). Since
learnable embeddings are from N to Rd , for every i we define ei such that for every v ∈ N, ei (v)1 = e′i (v) and ei (v)j = 0
for j > 0.
Now, we define W as follows. For row l and column j, we set Wlj = 1 if (l − 1) is a multiple of d, i.e., l = 1, (d + 1), 2(d +
k
1), . . . . Otherwise we set Wlj = 0. Intuitively, applying W to ei (vi ) i=1 will select the first, d-th, 2d-th, . . . , kd-th,
k
element of ei (vi ) i=1 , corresponding to v1 , . . . , vk , respectively, according to our construction of ei . Now, we have that
k k
ei (vi ) i=1 W = e′1 (v1 ) . . . e′k (vk ) = e′i (vi ) i=1 .
32
Aligning Transformers with Weisfeiler–Leman
Next, we turn to a composition of learnable embeddings, namely the structural embeddings P, as defined in Section 3.1. In
particular, we show next that a projection of multiple structural embeddings preserves node- and adjacency identifiability of
the individual structural embeddings in the new embedding space.
Lemma 24 (Projection of structural embeddings). Let G be a graph with n nodes and let P ∈ Rn×d be structural
embeddings for G. Let k > 1. If d ≥ k · (2n + 1) then, there exists a parameterization of P as well as a projection matrix
W ∈ Rk·d·k×(2n+1) such that for any v1 , . . . , vk ∈ V (G),
k k
P(vi ) i=1 W = P′ (vi ) i=1 ∈ Rn×d ,
′
where P′ ∈ Rn×d is also a structural embedding for G and it holds that if P is sufficiently node- and adjacency-identifying,
then so is P′ .
Proof. We know from Theorem 20 that there exists sufficiently node- and adjacency-identifying structural embeddings P′ in
Rn×(2d+1) . Since d ≥ k · (2n + 1), we define
P = P′ 0 ,
where 0 ∈ Rn×(k−1)·(2n+1) is an all-zero matrix. Since P has a sufficiently node- and adjacency-identifying subspace P′ ,
by Lemma 19, P is sufficiently node- and adjacency-identifying. Further, by setting
W = In×k·(2n+1) ,
where In×k·(2n+1) are the first n rows of the k · (2n + 1)-dimensional identity matrix and 0 ∈ Rn×(k−1)·(2n+1) is again an
all-zero matrix.
Then, we have that for nodes v1 , . . . , vk ∈ V (G),
k k
P(vi ) i=1 W = P′ (vi ) i=1 ∈ Rn×d ,
and P′ is by definition sufficiently node- and adjacency-identifying, from which the statement follows.
We continue by showing that our construction of token embeddings in Equation (1) can be equally well represented by another
form that will be easier to use in proving Theorem 2.
Lemma 25. Let G be a graph with node features F ∈ Rn×d . For every tokenization X(0,1) ∈ Rd according to Equation (1)
with (sufficiently) adjacency reconstructing structural embeddings P(v), there exists a parameterization of X(0,1) such that
for node v ∈ V (G) as
X(0,1) (v) = F′ (v) 0 deg′ (v) P′ (v) ,
with
P′ (v) = FFN′ (deg′ (v) + PE(v)),
such that F′ (v) ∈ Rp , 0 ∈ Rp , deg′ : N → Rr and FFN′ : Rd → Rs for d = 2p + r + s and where it holds for every
v, w ∈ V (G) that
F(v) = F(w) ⇐⇒ F′ (v) = F′ (w)
and
deg(v) = deg(w) ⇐⇒ deg′ (v) = deg′ (w)
and
P(v) = P(w) ⇐⇒ P′ (v) = P′ (w)
and P′ (v) is (sufficiently) adjacency-identifying.
Proof. We set p = (d − (2n + 1))/2 − 1, r = 2 and s = (2n + 1). Indeed, we have that
We continue by noting that the node features F are mapped from N to Rd and can be obtained from a learnable embedding.
More importantly, we can define
F′ (v) = ℓ(v) 0 . . . 0 ,
33
Aligning Transformers with Weisfeiler–Leman
where deg′ is another learnable embedding for the node degrees mapping to Rr . Moreover, we know from Theorem 20
that there exist (sufficiently) adjacency-identifying structural embeddings of dimension (2n + 1). We choose the FFN in
Equation (2) such that
P(v) = 0 deg′ (v) P′ ,
X(0,1) (v) = F′ (v) + P′ (v) = F′ (v) 0 deg′ (v) FFN′ (deg′ (v) + PE(v)) ,
where through the concatenation it is easy to verify that indeed X(0,1) (v) has the desired properties, namely that
and
deg(v) = deg(w) ⇐⇒ deg′ (v) = deg′ (w)
and
P(v) = P(w) ⇐⇒ P′ (v) = P′ (w)
and P′ (v) is (sufficiently) adjacency-identifying. This shows the statement.
X(0,k) (v) = F′ (v1 ) . . . F′ (vk ) 0 deg′ (v1 ) . . . deg′ (vk ) P′ (v1 ) . . . P′ (vk ) atp′ (v) ,
with
P′ (v) = FFN′ (deg′ (v) + PE(v)),
2
such that F′ (v) ∈ Rp , 0 ∈ Rk ·p , deg′ : N → Rr , atp′ : N → Ro and P′ ∈ Rs for d = k 2 · p + k · p + k · r + k · s + o and
where it holds for every v, w ∈ V (G),
and
deg(v) = deg(w) ⇐⇒ deg′ (v) = deg′ (w)
and
P(v) = P(w) ⇐⇒ P′ (v) = P′ (w)
and P′ (v) is node and adjacency-identifying and for every v, w ∈ V (G)k
34
Aligning Transformers with Weisfeiler–Leman
d−k2 −3k−1
Proof. We set p = 1, r = 2, o = 1 and s = k . Indeed, we have that
k 2 · p + k · p + k · r + k · s + o = k 2 + 3k + d − k 2 − 3k − 1 + 1 = d.
where WF ∈ Rd·k×d and WP ∈ Rd·k×d are projection matrices, P are structural embeddings and v = (v1 , . . . , vk ) is a
k-tuple.
We continue by noting that the node features F are mapped from N to Rd and can be obtained from a learnable embedding.
More importantly, we can define F′ (v) = ℓ(v) for which it clearly holds that
for all v, w ∈ V (G). Invoking Lemma 23, there exists learnable embeddings F and a projection matrix WF,∗ ∈ Rk·d×k·p
such that k
F(vi ) i=1 WF = F′ (v1 ) . . . F′ (vk ) ∈ Rk·p .
Note that since we do not have an assumption on d, we can make d arbitrarily large to ensure that d ≥ k. Hence, by defining
WF = WF,∗ 0 ,
2
·p+k·p
where 0 ∈ Rk is an all-zero vector and
with FFN′ : Rd → Rs , another FFN inside of FFN. Note that such a FFN exists without approximation. Further, according to
Lemma 24, P′ is still sufficiently adjacency-identifying, and there exists a projection matrix WP,∗ ∈ Rk·d×d−o such that
k
P(vi ) i=1 WP,∗ = 0 deg′ (v1 ) P′ (v1 ) . . . deg′ (vk ) P′ (vk ) ∈ Rd−o .
0
But then, there also exists a permutation matrix Q ∈ {0, 1}d−o×d−o such that
k
P(vi ) i=1 WP,∗ Q = 0 deg′ (v1 ) . . . deg′ (vk ) P′ (v1 ) . . . P′ (vk ) ∈ Rd−o ,
k 2
where we simply re-order the entries of P(vi ) i=1 WP,∗ . Note that now, 0 ∈ Rk ·p+k·p , as we grouped the zero-vectors of
k
P(vi ) i=1 WP,∗ into a single zero-vector of size k 2 · p + k · p.
Note that since we do not have an assumption on d, we can make d arbitrarily large to ensure that d ≥ k · (2n + 1). Hence,
by defining
WP = WP,∗ Q 0 ,
35
Aligning Transformers with Weisfeiler–Leman
where P′ (vi ) ∈ Rs . Further, since P is by assumption sufficiently node- and adjacency-identifying, Lemma 24 guarantees
that then so is P′ .
Finally, for tuple v ∈ V (G)k , we set
atp(v) = 0 atp′ (v) ,
where atp′ (v) = atp(v) and we recall that atp maps to N and 0 ∈ Rd−1 is an all-zero vector.
Then, we can write
k k
X(0,k) (v) = F(vi ) i=1 WF + P(vi ) i=1 WP + atp(v)
= F′ (v1 ) . . . F′ (vk ) 0 deg′ (v1 ) . . . deg′ (vk ) P′ (v1 ) . . . P′ (vk ) atp′ (v) ,
and
deg(v) = deg(w) ⇐⇒ deg′ (v) = deg′ (w)
and
P(v) = P(w) ⇐⇒ P′ (v) = P′ (w)
where P′ (v) is node and adjacency-identifying and for every v, w ∈ V (G)k
H. Expressivity of 1-GT
Here, we prove Theorem 2 in the main paper. We first restate Theorem VIII.4 of (Grohe, 2021), showing that a simple GNN
can simulate the 1-WL.
Lemma 27 (Grohe (2021), Theorem VIII.4). Let G = (V (G), E(G), ℓ) be a graph with n nodes with adjacency matrix
A(G) and node feature matrix X(0) := F ∈ Rn×d consistent with ℓ. Further, assume a GNN that for each layer, t > 0,
updates the vertex feature matrix
X(t) := FFN X(t−1) + 2A(G)X(t−1) .
We now give the proof for a slight generalization of Theorem 2. Specifically, we relax the adjacency-identifying condition to
sufficiently adjacency-identifying.
Theorem 28 (Generalization of Theorem 2 in main paper). Let G = (V (G), E(G), ℓ) be a labeled graph with n nodes
and F ∈ Rn×d be a node feature matrix consistent with ℓ. Let Ct1 denote the coloring function of the 1-WL at iteration t.
Then, for all iterations t ≥ 0, there exists a parametrization of the 1-GT with sufficiently adjacency-identifying structural
embeddings, such that
Ct1 (v) = Ct1 (w) ⇐⇒ X(t,1) (v) = X(t,1) (w),
for all nodes v, w ∈ V (G).
Proof. According to Lemma 25, there exists a parameterization of X(0,1) such that for each v ∈ V (G),
36
Aligning Transformers with Weisfeiler–Leman
the base case follows from Equation (17). We let F(t) (v) denote the representation of the color of node v at iteration t.
Initially, we set F(t) (v) = F′ (v). Further, we define the matrix Demb ∈ Rn×r such that for the i-th row of Demb,i = deg′ (vi ),
where vi is the i-th node in a fixed but arbitrary node ordering. Hence, we can write
Now, assume that the statement holds up to iteration t − 1. For the induction, we show
That is, we compute the 1-WL-equivalent aggregation using the node color representations F(t) and use the remaining columns
in X(0,1) to keep track of degree and structural embeddings. Clearly, if Equation (18) holds for all t, then the statement
follows. Thereto, we aim to update the vertex representations such that
X(t,1) = F(t) P′ ,
0 Demb (19)
Hence, it remains to show that our graph transformer layer can update vertex representations according to Equation (19). To
this end, we will require only a single transformer head in each layer. Specifically, we want to compute
where D−1 denotes the inverse of the degree matrix and I denotes the identity matrix with n rows. Note that since we
only have one head, the head dimension dv = d. We begin by re-stating Equation (8) with expanded sub-matrices and then
derive the instances necessary to obtain the head Equation (21). We re-state projection weights WQ and WK with expanded
sub-matrices as Q
W1
Q
WQ
W = 2
WQ
3
W4Q
and K
W1
W2K
WK =
W3K ,
W4K
37
Aligning Transformers with Weisfeiler–Leman
where W1Q , W1K ∈ Rp×d , W2Q , W2K ∈ Rp×d , W3Q , W3K ∈ Rr×d , W4Q , W4K ∈ Rs×d . We then define
Q K
W1 W1
Q K
1 1 W 2 )(X(t−1,1) W2 )T ,
Z(t−1) := √ (X(t−1,1) WQ )(X(t−1,1) WK )T = √ (X(t−1,1)
Q
W W3K
dk dk 3
WQ W4K
4
where
X(t−1,1) = F(t−1) P′ ,
0 Demb
by the induction hypothesis. By setting W1Q , W2Q , W3Q , W1K , W2K , W3K to zero, we have
1
Z(t−1) = √ (P′ W4Q )(P′ W4K )T .
dk
Now, let
1
P̃ := √ (P′ WPQ )(P′ WPK )T ,
dk
for some weight matrices WPQ and WPK . We know from Lemma 12 that, since P′ is sufficiently adjacency reconstructing,
there exists WPQ and WPK such that by setting W4Q = b · WPQ and W4K = WPK , we have
Z(t−1) = b · P̃,
where for all ε > 0, there exists a b > 0, such that
softmax Z(t−1) − D−1 A(G) < ε.
F
−1
Hence, by choosing a large enough b,we can
approximate the matrix D A(G) arbitrarily close. In the following, for clarity
of presentation, we assume softmax Z(t) = D−1 A(G) although we only approximate it arbitrarily close. However, by
choosing ε small enough, we can still approximate the matrix X(t,1) , see below, arbitrarily close.
We now again expand sub-matrices of WV and express Equation (8) as
V
W1
W2V
hi (X(t−1,1) ) = softmax Z(t−1) X(t−1,1)
W3V ,
W4V
where W1V ∈ Rp×d , W2V ∈ Rp×d , W3V ∈ Rr×d , W4V ∈ Rs×d . By setting
W1V = 0 I 0 0
arbitrarily close. We now conclude our proof as follows. Recall that the transformer computes the final representation X(t,1)
as
!
X(t,1) = FFNfinal X(t−1,1) + h1 (X(t−1,1) )WO
!
−1
0 WO
(t−1)
(t−1)
= FFNfinal F 0 Demb U + 0 D A(G)F 0
!
D−1 A(G)F(t−1)
(t−1)
= FFNfinal F Demb U ,
WO :=I
38
Aligning Transformers with Weisfeiler–Leman
for some FFNfinal . We now show that there exists an FFNfinal such that
!
(t,1)
D−1 A(G)F(t−1)
(t−1)
X = FFNfinal F Demb U = F(t) 0 Demb U ,
which then implies the proof statement. To this end, we show that there exists a composition of functions fFFN ◦fadd ◦fdeg ◦f×2
such that
!
D−1 A(G)F(t−1) Demb U = F(t) 0 Demb U ,
(t−1)
fFFN ◦ fadd ◦ fdeg ◦ f×2 F
Since our domain is compact, there exist choices of fFFN , fadd , fdeg , f×2 such that fFFN ◦ fadd ◦ fdeg ◦ f×2 : Rd → Rd is
continuous. As a result, FFNfinal can approximate fFFN ◦ fadd ◦ fdeg ◦ f×2 arbitrarily close.
Concretely, we define
!
h i
(t−1) 1 (t−1) ′
P
f×2 F (v) w∈N (v) |N (v)| ·F (w) |N (v)| 0 P (v)
h i
= F(t−1) (v) 1
· F(t−1) (w) |N (v)| 0 P′ (v) ,
P
2 w∈N (v) |N (v)|
where fdeg multiplies the second column with the degree of node v in the third column.
Next, we define
!
h i
(t−1) (t−1) ′
P
fadd F (v) 2 w∈N (v) F (w) |N (v)| 0 P (v)
h i
= F(t−1) (v) + 2 w∈N (v) F(t−1) (w) 0 |N (v)| 0 P′ (v) ,
P
where we add the second column to the first column and set the elements of the second column to zero.
Finally, we define
!
h i
F(t−1) (v) + 2 F(t−1) (w) 0 P′ (v)
P
fFFN w∈N (v) |N (v)| 0
h i
= FFN F(t−1) (v) + 2 w∈N (v) F(t−1) (w) P′ (v)
P
0 |N (v)| 0
= F(t) (v) 0 deg′ (v) P′ (v) ,
39
Aligning Transformers with Weisfeiler–Leman
where FFN denotes the FFN in Equation (20), from which the last equality follows. Clearly, applying fFFN ◦ fadd ◦ fdeg ◦ f×2
(t−1)
to each row i of Xupd results in
(t−1)
P′ (v) ,
fFFN ◦ fadd ◦ fdeg ◦ f×2 Xupd = F(t) 0 Demb
g(xi ) := m−i ,
and an M -sequence of length l given by x(1) , . . . , x(l) with positions i(1) , . . . , i(l) in S, the sum
X X (j)
g(x(j) ) = m−i
j∈[l] j∈[l]
Proof. By assumption, let M ⊆ S denote a multiset of order m − 1. Further, let x(1) , . . . , x(l) ∈ M be an M -sequence with
i(1) , . . . , i(l) in S. Given our fixed ordering of the numbers in S we can equivalently write M = ((a1 , x1 ), . . . , (an , xn )),
where ai denotes the multiplicity of i-th number in M with position i from our ordering over S. Note that for a number m−i
there exists a corresponding m-ary number written as
0.0 . . . |{z}
1 ...
i
40
Aligning Transformers with Weisfeiler–Leman
0.a1 . . . an .
Note that ai = 0 if and only if there exists no j such that i(j) = i. Since the order of M is m − 1, it holds that ai < m.
Hence, it follows that the above sum is unique for each unique multiset M , implying the result.
Recall that S ⊆ N and that we fixed an arbitrary ordering over S. Intuitively, we use the finiteness of S to map each number
therein to a fixed digit of the numbers in (0, 1). The finite m ensures that at each digit, we have sufficient “bandwidth” to
encode each ai . Now that we have seen how to encode multisets over S as numbers in (0, 1), we review some fundamental
operations about the m-ary numbers defined above. We will refer to decimal numbers m−i as corresponding to an m-ary
number
0.0 . . . |{z}
1 ...,
i
where the i-th digit after the decimal point is 1 and all other digits are 0, and vice versa.
To begin with, addition between decimal numbers implements counting in m-ary notation, i.e.,
otherwise. We used counting in the proof of the previous result to represent the multiplicities of a multiset. Next, multiplication
between decimal numbers implements shifting in m-ary notation, i.e.,
Shifting further applies to general decimal numbers in (0, 1). Let x ∈ (0, 1) correspond to an m-ary number with l digits,
0.a1 . . . al .
Then,
m−i · x corresponds to 0.0 . . . 0 a1 . . . al .
| {z }
i+1,...,i+l
where g (0) (u) is initialized consistent ℓ consistent with the atomic type, such that for all u, v ∈ V (G)k
Proof. Before we start, let us recall the relabeling computed by the k-WL for a k-tuple u as
(t) (t)
RELABEL Ctk (u), M1 (u), . . . , Mk (u) ,
with
(t) k
Mj (u) := {{Ct−1 (ϕj (u, w)) | w ∈ V (G)}}.
41
Aligning Transformers with Weisfeiler–Leman
To show our result, we show that there exist scalars β1 , . . . , βk such that the m-ary representations computed for Ctk (u) and
(t) (t)
M1 , . . . , Mk are pairwise unique. To this end, we show that a weighted sum can represent multiset counting in different
exponent ranges of m-ary numbers in (0, 1). We then simply invoke Lemma 29 to show that we can map each unique tuple
(t) (t)
(Ctk (u), M1 (u), . . . , Mk (u)) to a unique number in (0, 1). Finally, the FFN will be responsible for the relabeling.
Each possible color in Ctk is a unique number in [nk ] as the maximum possible number of unique colors in Ctk is nk . We then
fix an arbitrary ordering over the [nk ].
(0)
We show the statement via induction over t. Let m > 0 such that m − 1 is the order of the multiset Mj (u). Note that this
is the same m for each u. For a k-tuple u with initial color C0k (u) at position i, we choose g (0) such as to approximate m−i
arbitarily close, i.e.,
g (0) (u) − m−i F < ϵ,
for an arbitrarily small ϵ > 0. By choosing ϵ small enough, we have that g (0) (u) is unique for every unique position i and
hence Equation (22) holds for all pairs of k-tuples for t = 0. Note that by construction, i ≤ nk and hence, for tuple u at
position i, the m-ary number corresponding to m−i is non-zero in at most the first nk digits.
For the inductive case, we assume that
k k
Ct−1 (u) = Ct−1 (v) ⇐⇒ g (t−1) (u) = g (t−1) (v).
for an arbitrarily small ϵ > 0. Let now u ∈ V (G)k . We say that the j-neighbors of u w.r.t. w have indices l(1) (w), . . . , l(k) (w)
(j)
in our ordering. Then, w∈V (G) m−l (w) is unique for each unique
P
(t)
Mj (u) = {{Ctk (ϕj (u, w)) | w ∈ V (G)}}, (23)
by setting
k
βj := m−(n )·j
,
(t)
for j ∈ [k]. Specifically, let a1 , . . . , ank denote the multiplicities of multiset Mj (u) and let
0.a1 . . . ank ,
(j)
Note that since each m−l (w) corresponds to a color under k-WL at iteration t − 1, all digits after the nk -th digit are
zero. Then, multiplying the above sum with βj results in a shift in m-ary notation and, hence, for the m-ary number that
corresponds to the term X (j)
βj · m−l (w) , (24)
w∈V (G)
we can write
0.0 . . . |{z}
0 a1 . . . ank 0
|{z} . . . 0.
| {z }
nk ·j nk ·j+1,...,nk ·j+nk nk ·(j+1)+1
42
Aligning Transformers with Weisfeiler–Leman
As a result, for all l ̸= j, the non-zero digits of the m-ary number that corresponds to
X (j)
βl · m−l (w)
w∈V (G)
do not collide with the non-zero digits of the output of Equation (24) and hence, the sum
X X (j)
βj · m−l (w)
, (25)
j∈[k] w∈V (G)
Let ui be the i-th tuple in our fixed but arbitrary ordering. Then, g (t−1) (u) approximates the number m−i arbitrarily close.
Note that by the induction hypothesis, the m-ary number that corresponds to m−i is non-zero in at most the first nk digits, as
nk is the maximum possible number of colors attainable under k-WL. Since the smallest shift in Equation (25) is by nk (for
j = 1) and since the m-ary number that corresponds to m−i is non-zero in at most the first nk digits, the sum of m−i and
Equation (25) have no intersecting non-zero digits. As a result,
X X (j)
m−i + βj · m−l (w)
(26)
j∈[k] w∈V (G)
Now, by the induction hypothesis, we can approximate each m−i with g (t−1) (ui ) arbitrarily close. Further, we can
(j)
approximate each m−l (w) with g (t−1) (ϕj (u, w)) arbitrarily close. As a result, we can approximate Equation (26) with
X X
g (t−1) (u) + βj · g (t−1) (ϕj (u, w))
j∈[k] w∈V (G)
Finally, since for finite graphs of order n there exists only a finite number of such tuples, there exists a continuous function
mapping the output of Equation (26) to m−i where i is the position of the color Ctk (u) in [nk ], for all u ∈ V (G)k . We
approximate this function with FFN arbitrarily close and obtain
43
Aligning Transformers with Weisfeiler–Leman
Corollary 31. Let G = (V (G), E(G), ℓ) be a n-order (node-)labeled graph. Then for each k-tuple u := (u1 , . . . uk ) and
each t > 0, we can equivalently express the coloring of the δ-k-WL as Ctδ,k (u) :=
δ,k δ,k δ,k
RELABEL Ct−1 (u),{{Ct−1 (ϕ1 (u, w)) | w ∈ N (u1 )}}, {{Ct−1 (ϕ1 (u, w)) | w ∈ V (G) \ N (u1 )}}, . . . ,
δ,k δ,k
{{Ct−1 (ϕk (u, w)) | w ∈ N (uk )}}, {{Ct−1 (ϕk (u, w)) | w ∈ V (G) \ N (uk )}} .
With the above statement, we can directly derive a δ-variant of Proposition 30. To this end, let ∆j (u) denote the set of
vertices adjacent to the j-th node in u.
Corollary 32. Let G = (V (G), E(G), ℓ) be a n-order (vertex-)labeled graph and assume a vertex feature matrix F ∈ Rn×d
that is consistent with ℓ. Then for all t ≥ 1, there exists a function g (t) and scalars β1 , . . . , β2k with
X X X
g (t) (u) := FFN g (t−1) (u) + αj · g (t−1) (ϕj (u, w)) + βj · g (t−1) (ϕj (u, w) ,
j∈[k] w∈∆j (u) w∈V (G)\∆j (u)
The proof is the same as for Proposition 30. Further, we can also recover the δ-k-LWL variant of Proposition 30 by setting
βj = 0. Lastly, we again recover Proposition 30 by setting αj = βj .
for all k-tuples v and w ∈ V (G)k . If Ctk is the coloring function of the k-WL, it suffices that the structural embeddings of
k-GT are sufficiently node-identifying. Otherwise, we require the structural embeddings of k-GT to be both sufficiently node
and adjacency-identifying.
where P are structural embeddings and atp is a learnable embedding of the atomic type. We further know from Lemma 26
that we can write
X(0,k) (v) = F′ (v1 ) . . . F′ (vk ) 0 deg′ (v1 ) . . . deg′ (vk ) P′ (v1 ) . . . P′ (vk ) atp′ (v) ,
(28)
44
Aligning Transformers with Weisfeiler–Leman
2
Further, F′ (v) ∈ Rp , 0 ∈ Rk ·p , deg′ : N → Rr , atp′ : N → Ro and P′ ∈ Rs , for some choice of p, r, s, o where
d = k 2 · p + k · p + k · r + k · s + o, as specified in Lemma 26. We will use this parameterization for X(0,k) throughout the
rest of the proof.
We prove our statement by induction over iteration t. For the base case, notice that the initial color of a tuple v depends
on the atomic type and the node labeling. In Equation (28), we encode the atomic type with atp′ (v) and the node labels by
concatenating the features F′ (v1 ), . . . , F′ (vk ) of the k nodes v1 , . . . , vk in v. The concatenation of both node labels and
atomic type is clearly injective, and so
where
X X X
H(X(t−1,k) (ui )) := αj · F(t−1) (ϕj (ui , w)) + βj · F(t−1) (ϕj (ui , w)) ,
j∈[k] w∈∆j (ui ) w∈V (G)\∆j (ui )
for α1 , . . . , αk , β1 , . . . , βk ∈ R, for j ∈ [k] that we pick such that H(X(t−1,k) (ui )) is equivalent to the neural architecture
in Corollary 32. Note that we obtain the k-WL for αj = βj for all j ∈ [k]. Then,
X X
H(X(t−1,k) (ui )) = βj · F(t−1) (ϕj (ui , w)) .
j∈[k] w∈V (G)
Hence, it remains to show that the standard transformer layer can update tuple representations according to Equation (30). To
this end, we will require 2k transformer heads h11 , . . . h1k , h21 , . . . h2k in each layer. Specifically, in the first k heads, we want
to compute X
h1j (X(t−1,k) (ui )) = αj · F(t−1) (ϕj (ui , w)). (31)
w∈∆j (ui )
In both of the above cases, for a head hγj , γ ∈ {−1, 1} denotes the type of head, and j ∈ [k] denotes the j-neighbors the head
aggregates over. For the head dimension, we set dv = d.
45
Aligning Transformers with Weisfeiler–Leman
For each j, recall the definition of the standard transformer head at tuple ui at position i as
h i T
hγj (X(t−1,k) (ui )) = softmax Z(t−1,γ)
i1 . . . Z
(t−1,γ)
ink
Xt−1 (u1 ) . . . Xt−1 (unk ) WV ,
where
(t−1,γ) 1
Zil := √ (X(t−1,k) (ui )WQ,γ )(X(t−1,k) (ul )WK,γ )T ∈ R (33)
dk
is the unnormalized attention score between tuples ui and ul , with
WFQ
WQ
Z
Q,γ Q
W = WD
Q,γ
WP
Q
WA
WFK
WZK
K,γ K
W = WK,γ
D
W
P
K
WA
V
WF
W V
Z
WV = V
WD ,
WPV
V
WA
where
Note that only sub-matrices WPQ,γ and WPK,γ are different for different γ. We now specify projection matrices WQ,γ ,
WK,γ and WV in a way that allows the attention head hγj to dynamically recover the j-neighborhood adjacency as well as
the adjacency between j-neighbors in the attention matrix. To this end, in heads h1j and h2j we set
WFQ = WFK = 0
Q K V
WZ = WZ = WZ =0
WPV = 0
Q K V
WD = WD = WD =0
Q K V
WA = WA = WA = 0.
The remaining non-zero sub-matrices are WFV , WPQ,γ , WPK,γ , which we will define next.
46
Aligning Transformers with Weisfeiler–Leman
We begin by defining WPQ,γ and WPK,γ . Specifically, we want to choose WPQ,γ and WPK,γ such that the attention matrix
in head hγj can approximate the generalized adjacency matrix A(k,j,γ) in Equation (11). To this end, we simply invoke
Lemma 13, guaranteeing that there exists WPQ,γ and WPK,γ such that that for each ϵ > 0 and each γ ∈ {−1, 1},
h i
(t−1,γ) (t−1,γ) (k,j,γ)
softmax Zi1 ... Zink − Ãi < ε,
F
i.e., we can approximate the generalized adjacency matrix arbitrarily close for each k > 1, j ∈ [k] and γ ∈ {−1, 1}. In the
following, for clarity of presentation, we assume
h i
(t−1,γ) (t−1,γ) (k,j,γ)
softmax Zi1 ... Zink = Ãi
although we only approximate it arbitrarily close. However, by choosing ε small enough, we can still approximate the matrix
X(t,k) , see below, arbitrarily close. We now set
WV = In×kp
0 ,
where In×kp denotes the first n rows of the kp-by-kp identity matrix if kp > n or else the first kp columns of the n-by-n
identity matrix and 0 ∈ Rn×d−kp is an all-zero matrix. The above yields
F(t−1) (u1 )
1 h (k,j,γ) i
..
hγj (X(t−1,k) (ui )) = γ Ãi1 ...
(k,j,γ)
Ãink ·
dij .
F(t−1) (unk )
P
(t−1)
1 w∈∆j (ui ) F
(ϕj (ui , w)) γ=1
= γ · ,
dij P (t−1)
w∈V (G)\∆j (ui ) F (ϕj (ui , w)) γ = −1
where (
dij γ=1
dγij =
n − dij γ = −1
i.e., we leave the first kp diagonal elements zero and then fill the next 2k diagonal elements of WO with the αj and βj ,
47
Aligning Transformers with Weisfeiler–Leman
T α1 P (t−1)
T
h11 (X(t−1,k) )(ui ) w∈∆1 (ui ) F (ϕ1 (ui , w))
di1
.. ..
.
.
1 (t−1,k) α P (t−1)
dikP w∈∆k (ui ) F (ϕk (ui , w))
h (X k
)(ui )
k WO = β WO
h2 (X(t−1,k) )(ui ) 1
F (t−1)
(ϕ 1 (ui , w))
1 n−d i1 w∈V (G)\∆ 1 (u i )
.
..
.
..
h2k (X(t−1,k) )(ui ) βk P
w∈V (G)\∆ (u ) F
(t−1)
(ϕk (ui , w))
n−dik k i
T
0
α1 P (t−1)
w∈∆1 (ui ) F (ϕ1 (ui , w))
di1
.
..
αk P (t−1)
dikP w∈∆k (ui ) F (ϕk (ui , w))
= β1 ,
(t−1)
n−di1 w∈V (G)\∆1 (ui ) F (ϕ1 (ui , w))
..
β P .
k
n−dik F(t−1) (ϕk (ui , w))
w∈V (G)\∆k (ui )
0
where the first zero vector is in Rkp and the second zero vector is in Rd−k(r+s)+o . We now define the vector
T
0
α1 P (t−1)
w∈∆1 (ui ) F (ϕ1 (ui , w))
di1
(t−1) T .
F (ui ) ..
0
α P (t−1)
dikP w∈∆k (ui ) F (ϕk (ui , w))
k
′
X̃(ui ) = deg (ui ) + β1
n−di1 w∈V (G)\∆1 (ui ) F(t−1) (ϕ1 (ui , w))
P′ (ui )
..
atp′ (ui )
β P .
(t−1)
w∈V (G)\∆k (ui ) F (ϕk (ui , w))
k
n−dik
0
T
F(t−1) (ui )
α1 P (t−1)
w∈∆1 (ui ) F (ϕ1 (ui , w))
di1
..
.
αk P (t−1)
F (ϕ k (u i , w))
β1 dikP w∈∆k (ui )
(t−1)
n−di1 w∈V (G)\∆1 (ui ) F (ϕ1 (ui , w))
= ∈ Rd .
..
β P .
(t−1)
k
n−dik w∈V (G)\∆k (ui ) F (ϕ k (ui , w))
′
deg (ui )
′
P (u ) i
atp′ (ui )
α1 P T
di1 F(t−1) (ϕ1 (ui , w))
w∈∆1 (ui )
F̃α (ui ) :=
..
.
αk P (t−1)
dik w∈∆k (ui ) F (ϕ k (ui , w))
48
Aligning Transformers with Weisfeiler–Leman
and
T
β1
F(t−1) (ϕ1 (ui , w))
P
n−di1 w∈V (G)\∆1 (ui )
F̃β (ui ) :=
..
.
βk
F(t−1) (ϕk (ui , w))
P
n−dik w∈V (G)\∆k (ui )
We additionally define
deg′ (uj ) = dij n − dij ∈ R2 ,
where dij is again the degree of node uj in k-tuple ui = (u1 , . . . , uk ). Note that X̃(ui ) represents all the information we
feed into the final FFN. Specifically, we obtain the updated representation of tuple ui as
X(t,k) (ui ) = FFN X̃(ui ) .
which then implies the proof statement. It is worth to pause at this point and remind ourselves what each element in X̃(ui )
represents. To this end, we use PyTorch-like array slicing, i.e., for a vector x, x[a : b] corresponds to the sub-vector
of length b − a for b > a containing the (a + 1)-st to b-th element of x. E.g., for a vector x = (x1 , x2 , x3 , x4 , x5 )T ,
x[1 : 4] = (x2 , x3 , x4 )T . Now, we interpret X̃(ui ) by its sub-vectors. Concretely,
1. X̃(ui )[0 : kp] ∈ Rkp corresponds to the previous color representation F(t−1) (ui ) of tuple ui .
2. F̃α (ui )[(j − 1)kp : jkp] ∈ R
kp
corresponds to the degree-normalized sum over adjacent j-neighbors
αj P (t−1)
dj w∈∆j (ui ) F (ϕ j (ui , w)).
3. F̃β (ui )[(j − 1)kp : jkp] ∈ Rkp corresponds to the degree-normalized sum over non-adjacent j-neighbors
βj P (t−1)
n−dj w∈∆j (ui ) F (ϕj (ui , w)).
4. deg′ (ui )[2(j − 1)] = dij , where dij is the degree of node uj in the k-tuple ui = (u1 , . . . , uk ).
5. deg′ (ui )[2(j − 1) + 1] = n − dij , where dij is the degree of node uj in the k-tuple ui = (u1 , . . . , uk ).
To this end, we show that there exists a sequence of functions fFFN ◦ fadd ◦ fdeg such that
X(t,k) = fFFN ◦ fadd ◦ fdeg X̃(ui )
= F(t) (ui ) 0 deg′ (ui ) P′ (ui ) atp′ (ui ) ,
49
Aligning Transformers with Weisfeiler–Leman
Concretely, we define
T
F(t−1) (ui )
di1 · dαi11 w∈∆1 (ui ) F(t−1) (ϕ1 (ui , w))
P
..
.
αk P (t−1)
dik · dik w∈∆k (ui ) F (ϕk (ui , w))
β1 P
(n − di1 ) · n−di1 w∈V (G)\∆1 (ui ) F(t−1) (ϕ1 (ui , w))
fdeg X̃(ui ) = ,
..
.
(n − dik ) · βk P
(t−1)
n−dik w∈V (G)\∆k (ui ) F (ϕk (ui , w))
deg′ (ui )
′
P (ui )
′
atp (ui )
1. F̃α (ui )[(j − 1)kp : jkp] with deg′ (ui )[2(j − 1)].
2. F̃β (ui )[(j − 1)kp : jkp] with deg′ (ui )[2(j − 1) + 1].
We then define
T
F(t−1) (ϕ1 (ui , w))
P
α1 ·
w∈∆1 (ui )
Fα (ui ) :=
..
.
(t−1)
P
αk · w∈∆k (ui ) F (ϕk (ui , w))
and
T
F(t−1) (ϕ1 (ui , w))
P
β1 ·
w∈V (G)\∆1 (ui )
Fβ (ui ) :=
..
.
F(t−1) (ϕk (ui , w))
P
βk · w∈V (G)\∆k (ui )
Next, we define
(t−1) T
(ui ) + j∈[k] (Fα (ui )[(j − 1)kp : jkp] + Fβ (ui )[(j − 1)kp : jkp])
P
F
0
′
fadd fdeg X̃(ui ) =
deg (ui ) ,
P′ (ui )
atp′ (ui )
where fadd sums up F(t−1) (ui ) with Fα (ui )[(j − 1)kp : jkp] and Fβ (ui )[(j − 1)kp : jkp] for each j. Afterwards, fadd sets
2
fdeg X̃(ui ) [kp : 2k 2 p + kp] = 0 ∈ R2k p .
Now, finally, note from the definition of Fα (ui ) and Fβ (ui ) that
X
(Fα (ui )[(j − 1)kp : jkp] + Fβ (ui )[(j − 1)kp : jkp]) = H(X(t−1,k) (ui )),
j∈[k]
50
Aligning Transformers with Weisfeiler–Leman
where FFN denotes the FFN in Equation (30), from which the last equality follows. Thank you for sticking around until the
end. This completes the proof.
The above proof shows that the k-GT can simulate the δ-k-WL. Since the δ-k-WL is strictly more expressive than the k-WL
(Morris et al., 2020), Theorem 5 follows directly from Theorem 4.
Further, the above proof shows that the k-GT can simulate the δ-k-LWL. If we simulate the δ-k-LWL with the k-GT while only
using token embeddings for tuples in V (G)ks , Theorem 6 directly follows.
51