0% found this document useful (0 votes)

34 views12 pages

Graphormer 2021 neurIPS

The document presents Graphormer, a Transformer-based model designed to improve graph representation learning, which has historically lagged behind GNN variants in performance. Graphormer incorporates structural encoding methods, such as centrality and spatial encoding, to effectively capture the importance and relationships of nodes within a graph. The model demonstrates state-of-the-art results on various graph-level prediction tasks, particularly in the Open Graph Benchmark Large-Scale Challenge.

Uploaded by

Hubert Nicolas

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

34 views12 pages

Graphormer 2021 neurIPS

Uploaded by

Hubert Nicolas

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 12

Do Transformers Really Perform Bad

for Graph Representation?

Chengxuan Ying1∗, Tianle Cai2 , Shengjie Luo3∗,

Shuxin Zheng4†, Guolin Ke4 , Di He4†, Yanming Shen1 , Tie-Yan Liu4
1
Dalian University of Technology 2 Princeton University
3
Peking University 4 Microsoft Research Asia
[email protected], [email protected], [email protected]
{shuz†, guoke, dihe†, tyliu}@microsoft.com, [email protected]

Abstract

The Transformer architecture has become a dominant choice in many domains, such
as natural language processing and computer vision. Yet, it has not achieved com-
petitive performance on popular leaderboards of graph-level prediction compared
to mainstream GNN variants. Therefore, it remains a mystery how Transformers
could perform well for graph representation learning. In this paper, we solve this
mystery by presenting Graphormer, which is built upon the standard Transformer
architecture, and could attain excellent results on a broad range of graph representa-
tion learning tasks, especially on the recent OGB Large-Scale Challenge. Our key
insight to utilizing Transformer in the graph is the necessity of effectively encoding
the structural information of a graph into the model. To this end, we propose several
simple yet effective structural encoding methods to help Graphormer better model
graph-structured data. Besides, we mathematically characterize the expressive
power of Graphormer and exhibit that with our ways of encoding the structural
information of graphs, many popular GNN variants could be covered as the special
cases of Graphormer. The code and models of Graphormer will be made publicly
available at https://github.com/Microsoft/Graphormer.

1 Introduction
The Transformer [46] is well acknowledged as the most powerful neural network in modelling
sequential data, such as natural language [11, 35, 6] and speech [17]. Model variants built upon
Transformer have also been shown great performance in computer vision [12, 36] and programming
language [19, 57, 41]. However, to the best of our knowledge, Transformer has still not been
the de-facto standard on public graph representation leaderboards [22, 14, 21]. There are many
attempts of leveraging Transformer into the graph domain, but the only effective way is replacing
some key modules (e.g., feature aggregation) in classic GNN variants by the softmax attention
[47, 7, 23, 48, 56, 43, 13]. Therefore, it is still an open question whether Transformer architecture is
suitable to model graphs and how to make it work in graph representation learning.
In this paper, we give an affirmative answer by developing Graphormer, which is directly built
upon the standard Transformer, and achieves state-of-the-art performance on a wide range of graph-
level prediction tasks, including the very recent Open Graph Benchmark Large-Scale Challenge
(OGB-LSC) [21], and several popular leaderboards (e.g., OGB [22], Benchmarking-GNN [14]). The
Transformer is originally designed for sequence modeling. To utilize its power in graphs, we believe
∗
Interns at MSRA.
†
Corresponding authors.

35th Conference on Neural Information Processing Systems (NeurIPS 2021).

the key is to properly incorporate structural information of graphs into the model. Note that for each
node i, the self-attention only calculates the semantic similarity between i and other nodes, without
considering the structural information of a graph reflected on the nodes and the relation between
node pairs. Graphormer incorporates several effective structural encoding methods to leverage such
information, which are described below.
First, we propose a Centrality Encoding in Graphormer to capture the node importance in the graph.
In a graph, different nodes may have different importance, e.g., celebrities are considered to be
more influential than the majority of web users in a social network. However, such information
isn’t reflected in the self-attention module as it calculates the similarities mainly using the node
semantic features. To address the problem, we propose to encode the node centrality in Graphormer.
In particular, we leverage the degree centrality for the centrality encoding, where a learnable vector
is assigned to each node according to its degree and added to the node features in the input layer.
Empirical studies show that simple centrality encoding is effective for Transformer in modeling the
graph data.
Second, we propose a novel Spatial Encoding in Graphormer to capture the structural relation
between nodes. One notable geometrical property that distinguishes graph-structured data from other
structured data, e.g., language, images, is that there does not exist a canonical grid to embed the graph.
In fact, nodes can only lie in a non-Euclidean space and are linked by edges. To model such structural
information, for each node pair, we assign a learnable embedding based on their spatial relation.
Multiple measurements in the literature could be leveraged for modeling spatial relations. For a
general purpose, we use the distance of the shortest path between any two nodes as a demonstration,
which will be encoded as a bias term in the softmax attention and help the model accurately capture
the spatial dependency in a graph. In addition, sometimes there is additional spatial information
contained in edge features, such as the type of bond between two atoms in a molecular graph. We
design a new edge encoding method to further take such signal into the Transformer layers. To
be concrete, for each node pair, we compute an average of dot-products of the edge features and
learnable embeddings along the shortest path, then use it in the attention module. Equipped with
these encodings, Graphormer could better model the relationship for node pairs and represent the
graph.
By using the proposed encodings above, we further mathematically show that Graphormer has strong
expressiveness as many popular GNN variants are just its special cases. The great capacity of the
model leads to state-of-the-art performance on a wide range of tasks in practice. On the large-scale
quantum chemistry regression dataset3 in the very recent Open Graph Benchmark Large-Scale
Challenge (OGB-LSC) [21], Graphormer outperforms most mainstream GNN variants by more than
10% points in terms of the relative error. On other popular leaderboards of graph representation
learning (e.g., MolHIV, MolPCBA, ZINC) [22, 14], Graphormer also surpasses the previous best
results, demonstrating the potential and adaptability of the Transformer architecture.

2 Preliminary
In this section, we recap the preliminaries in Graph Neural Networks and Transformer.

Graph Neural Network (GNN). Let G = (V, E) denote a graph where V = {v1 , v2 , · · · , vn },
n = |V | is the number of nodes. Let the feature vector of node vi be xi . GNNs aim to learn
representation of nodes and graphs. Typically, modern GNNs follow a learning schema that iteratively
updates the representation of a node by aggregating representations of its first or higher-order
(l) (0)
neighbors. We denote hi as the representation of vi at the l-th layer and define hi = xi . The l-th
iteration of aggregation could be characterized by AGGREGATE-COMBINE step as
n o
(l) (l−1) (l) (l−1) (l)
ai = AGGREGATE (l) hj : j ∈ N (vi ) , hi = COMBINE (l) hi , ai , (1)

where N (vi ) is the set of first or higher-order neighbors of vi . The AGGREGATE function is used to
gather the information from neighbors. Common aggregation functions include MEAN, MAX, SUM,
which are used in different architectures of GNNs [26, 18, 47, 50]. The goal of COMBINE function
is to fuse the information from neighbors into the node representation.
3
https://ogb.stanford.edu/kddcup2021/pcqm4m/

2
v1 v2 v3 v4 v5

MatMul v2

v3
SoftMax
v4

v5 v1

Scale Spatial Encoding v5

v1 v2 v3 v4 v5
MatMul
v1 v2

v3
Linear Linear Linear v4

v4 v3
Q K V
v5

Edge Encoding

Node Feature Centrality Encoding

Figure 1: An illustration of our proposed centrality encoding, spatial encoding, and edge encoding in
Graphormer.

In addition, for graph representation tasks, a READOUT function is designed to aggregate node
(L)
features hi of the final iteration into the representation hG of the entire graph G:
n o
(L)
hG = READOUT hi | vi ∈ G . (2)

READOUT can be implemented by a simple permutation invariant function such as summation [50]
or a more sophisticated graph-level pooling function [1].

Transformer. The Transformer architecture consists of a composition of Transformer layers [46].

Each Transformer layer has two parts: a self-attention module and a position-wise feed-forward
⊤ ⊤
network (FFN). Let H = h⊤

1 , · · · , hn ∈ Rn×d denote the input of self-attention module where d
1×d
is the hidden dimension and hi ∈ R is the hidden representation at position i. The input H is
projected by three matrices WQ ∈ Rd×dK , WK ∈ Rd×dK and WV ∈ Rd×dV to the corresponding
representations Q, K, V . The self-attention is then calculated as:

Q = HWQ , K = HWK , V = HWV , (3)

⊤
QK
A= √ , Attn (H) = softmax (A) V, (4)
dK

where A is a matrix capturing the similarity between queries and keys. For simplicity of illustration,
we consider the single-head self-attention and assume dK = dV = d. The extension to the multi-head
attention is standard and straightforward, and we omit bias terms for simplicity.

3 Graphormer

In this section, we present our Graphormer for graph tasks. First, we elaborate on several key
designs in the Graphormer, which serve as an inductive bias in the neural network to learn the graph
representation. We further provide the detailed implementations of Graphormer. Finally, we show
that our proposed Graphormer is more powerful since popular GNN models [26, 50, 18] are its
special cases.

3
3.1 Structural Encodings in Graphormer

As discussed in the introduction, it is important to develop ways to leverage the structural information
of graphs into the Transformer model. To this end, we present three simple but effective designs of
encoding in Graphormer. See Figure 1 for an illustration.

3.1.1 Centrality Encoding

In Eq.4, the attention distribution is calculated based on the semantic correlation between nodes.
However, node centrality, which measures how important a node is in the graph, is usually a strong
signal for graph understanding. For example, celebrities who have a huge number of followers are
important factors in predicting the trend of a social network [38, 37]. Such information is neglected
in the current attention calculation, and we believe it should be a valuable signal for Transformer
models.
In Graphormer, we use the degree centrality, which is one of the standard centrality measures in
literature, as an additional signal to the neural network. To be specific, we develop a Centrality
Encoding which assigns each node two real-valued embedding vectors according to its indegree and
outdegree. As the centrality encoding is applied to each node, we simply add it to the node features
as the input.
(0) − +
hi = xi + zdeg− (v ) + zdeg+ (v ) ,
i i
(5)

where z − , z + ∈ Rd are learnable embedding vectors specified by the indegree deg− (vi ) and out-
degree deg+ (vi ) respectively. For undirected graphs, deg− (vi ) and deg+ (vi ) could be unified to
deg(vi ). By using the centrality encoding in the input, the softmax attention can catch the node
importance signal in the queries and the keys. Therefore the model can capture both the semantic
correlation and the node importance in the attention mechanism.

3.1.2 Spatial Encoding

An advantage of Transformer is its global receptive field. In each Transformer layer, each token can
attend to the information at any position and then process its representation. But this operation has a
byproduct problem that the model has to explicitly specify different positions or encode the positional
dependency (such as locality) in the layers. For sequential data, one can either give each position an
embedding (i.e., absolute positional encoding [46]) as the input or encode the relative distance of any
two positions (i.e., relative positional encoding [42, 44]) in the Transformer layer.
However, for graphs, nodes are not arranged as a sequence. They can lie in a multi-dimensional
spatial space and are linked by edges. To encode the structural information of a graph in the
model, we propose a novel Spatial Encoding. Concretely, for any graph G, we consider a function
ϕ (vi , vj ) : V × V → R which measures the spatial relation between vi and vj in graph G. The
function ϕ can be defined by the connectivity between the nodes in the graph. In this paper, we
choose ϕ(vi , vj ) to be the distance of the shortest path (SPD) between vi and vj if the two nodes
are connected. If not, we set the output of ϕ to be a special value, i.e., -1. We assign each (feasible)
output value a learnable scalar which will serve as a bias term in the self-attention module. Denote
Aij as the (i, j)-element of the Query-Key product matrix A, we have:

(hi WQ )(hj WK )T
Aij = √ + bϕ(vi ,vj ) , (6)
d
where bϕ(vi ,vj ) is a learnable scalar indexed by ϕ(vi , vj ), and shared across all layers.
Here we discuss several benefits of our proposed method. First, compared to conventional GNNs
described in Section 2, where the receptive field is restricted to the neighbors, we can see that in
Eq. (6), the Transformer layer provides a global information that each node can attend to all other
nodes in the graph. Second, by using bϕ(vi ,vj ) , each node in a single Transformer layer can adaptively
attend to all other nodes according to the graph structural information. For example, if bϕ(vi ,vj ) is
learned to be a decreasing function with respect to ϕ(vi , vj ), for each node, the model will likely pay
more attention to the nodes near it and pay less attention to the nodes far away from it.

4
3.1.3 Edge Encoding in the Attention
In many graph tasks, edges also have structural features, e.g., in a molecular graph, atom pairs may
have features describing the type of bond between them. Such features are important to the graph
representation, and encoding them together with node features into the network is essential. There are
mainly two edge encoding methods used in previous works. In the first method, the edge features are
added to the associated nodes’ features [22, 30]. In the second method, for each node, its associated
edges’ features will be used together with the node features in the aggregation [15, 50, 26]. However,
such ways of using edge feature only propagate the edge information to its associated nodes, which
may not be an effective way to leverage edge information in representation of the whole graph.
To better encode edge features into attention layers, we propose a new edge encoding method in
Graphormer. The attention mechanism needs to estimate correlations for each node pair (vi , vj ), and
we believe the edges connecting them should be considered in the correlation as in [34, 48]. For each
ordered node pair (vi , vj ), we find (one of) the shortest path SPij = (e1 , e2 , ..., eN ) from vi to vj ,
and compute an average of the dot-products of the edge feature and a learnable embedding along the
path. The proposed edge encoding incorporates edge features via a bias term to the attention module.
Concretely, we modify the (i, j)-element of A in Eq. (3) further with the edge encoding cij as:

N
(hi WQ )(hj WK )T 1 X
Aij = √ + bϕ(vi ,vj ) + cij , where cij = xe (wE )T , (7)
d N n=1 n n

where xen is the feature of the n-th edge en in SPij , wnE ∈ RdE is the n-th weight embedding, and
dE is the dimensionality of edge feature.

3.2 Implementation Details of Graphormer

Graphormer Layer. Graphormer is built upon the original implementation of classic Transformer
encoder described in [46]. In addition, we apply the layer normalization (LN) before the multi-head
self-attention (MHA) and the feed-forward blocks (FFN) instead of after [49]. This modification
has been unanimously adopted by all current Transformer implementations because it leads to more
effective optimization [40]. Especially, for FFN sub-layer, we set the dimensionality of input, output,
and the inner-layer to the same dimension with d. We formally characterize the Graphormer layer as
below:

′
h (l) = MHA(LN(h(l−1) )) + h(l−1) (8)
′ ′
(l) (l) (l)
h = FFN(LN(h )) + h (9)

Special Node. As stated in the previous section, various graph pooling functions are proposed
to represent the graph embedding. Inspired by [15], in Graphormer, we add a special node called
[VNode] to the graph, and make connection between [VNode] and each node individually. In the
AGGREGATE-COMBINE step, the representation of [VNode] has been updated as normal nodes in
graph, and the representation of the entire graph hG would be the node feature of [VNode] in the
final layer. In the BERT model [11, 35], there is a similar token, i.e., [CLS], which is a special token
attached at the beginning of each sequence, to represent the sequence-level feature on downstream
tasks. While the [VNode] is connected to all other nodes in graph, which means the distance of
the shortest path is 1 for any ϕ([VNode], vj ) and ϕ(vi , [VNode]), the connection is not physical. To
distinguish the connection of physical and virtual, inspired by [25], we reset all spatial encodings for
bϕ([VNode],vj ) and bϕ(vi ,[VNode]) to a distinct learnable scalar.

3.3 How Powerful is Graphormer?

In the previous subsections, we introduce three structural encodings and the architecture of
Graphormer. Then a natural question is: Do these modifications make Graphormer more pow-
erful than other GNN variants? In this subsection, we first give an affirmative answer by showing
that Graphormer can represent the AGGREGATE and COMBINE steps in popular GNN models:

5
Fact 1. By choosing proper weights and distance function ϕ, the Graphormer layer can represent
AGGREGATE and COMBINE steps of popular GNN models such as GIN, GCN, GraphSAGE.
The proof sketch to derive this result is: 1) Spatial encoding enables self-attention module to
distinguish neighbor set N (vi ) of node vi so that the softmax function can calculate mean statistics
over N (vi ); 2) Knowing the degree of a node, mean over neighbors can be translated to sum over
neighbors; 3) With multiple heads and FFN, representations of vi and N (vi ) can be processed
separately and combined together later. We defer the proof of this fact to Appendix A.
Moreover, we show further that by using our spatial encoding, Graphormer can go beyond classic
message passing GNNs whose expressive power is no more than the 1-Weisfeiler-Lehman (WL) test.
We give a concrete example in Appendix A to show how Graphormer helps distinguish graphs that
the 1-WL test fails to.

Connection between Self-attention and Virtual Node. Besides the superior expressiveness than
popular GNNs, we also find an interesting connection between using self-attention and the virtual
node heuristic [15, 31, 24, 22]. As shown in the leaderboard of OGB [22], the virtual node trick,
which augments graphs with additional supernodes that are connected to all nodes in the original
graphs, can significantly improve the performance of existing GNNs. Conceptually, the benefit of
the virtual node is that it can aggregate the information of the whole graph (like the READOUT
function) and then propagate it to each node. However, a naive addition of a supernode to a graph
can potentially lead to inadvertent over-smoothing of information propagation [24]. We instead find
that such a graph-level aggregation and propagation operation can be naturally fulfilled by vanilla
self-attention without additional encodings. Concretely, we can prove the following fact:
Fact 2. By choosing proper weights, every node representation of the output of a Graphormer layer
without additional encodings can represent MEAN READOUT functions.
This fact takes the advantage of self-attention that each node can attend to all other nodes. Thus it can
simulate graph-level READOUT operation to aggregate information from the whole graph. Besides
the theoretical justification, we empirically find that Graphormer does not encounter the problem of
over-smoothing, which makes the improvement scalable. The fact also inspires us to introduce a
special node for graph readout (see the previous subsection).

4 Experiments
We first conduct experiments on the recent OGB-LSC [21] quantum chemistry regression (i.e.,
PCQM4M-LSC) challenge, which is currently the biggest graph-level prediction dataset and contains
more than 3.8M graphs in total. Then, we report the results on the other three popular tasks: ogbg-
molhiv, ogbg-molpcba and ZINC, which come from the OGB [22] and benchmarking-GNN [14]
leaderboards. Finally, we ablate the important design elements of Graphormer. A detailed description
of datasets and training strategies could be found in Appendix B.

4.1 OGB Large-Scale Challenge

Baselines. We benchmark the proposed Graphormer with GCN [26] and GIN [50], and their
variants with virtual node (-VN) [15]. They achieve the state-of-the-art valid and test mean absolute
error (MAE) on the official leaderboard4 [21]. In addition, we compare to GIN’s multi-hop variant [5],
and 12-layer deep graph network DeeperGCN [30], which also show promising performance on other
leaderboards. We further compare our Graphormer with the recent Transformer-based graph model
GT [13].

Settings. We primarily report results on two model sizes: Graphormer (L = 12, d = 768), and
a smaller one GraphormerS MALL (L = 6, d = 512). Both the number of attention heads in the
attention module and the dimensionality of edge features dE are set to 32. We use AdamW as the
optimizer, and set the hyper-parameter ϵ to 1e-8 and (β1, β2) to (0.99,0.999). The peak learning rate
is set to 2e-4 (3e-4 for GraphormerS MALL ) with a 60k-step warm-up stage followed by a linear decay
learning rate scheduler. The total training steps are 1M. The batch size is set to 1024. All models are
trained on 8 NVIDIA V100 GPUS for about 2 days.
4
https://github.com/snap-stanford/ogb/tree/master/examples/lsc/pcqm4m#performance

6
Table 1: Results on PCQM4M-LSC. * indicates the results are cited from the official leaderboard [21].

method #param. train MAE validate MAE

GCN [26] 2.0M 0.1318 0.1691 (0.1684*)
GIN [50] 3.8M 0.1203 0.1537 (0.1536*)
GCN-VN [26, 15] 4.9M 0.1225 0.1485 (0.1510*)
GIN-VN [50, 15] 6.7M 0.1150 0.1395 (0.1396*)
GINE-VN [5, 15] 13.2M 0.1248 0.1430
DeeperGCN-VN [30, 15] 25.5M 0.1059 0.1398
GT [13] 0.6M 0.0944 0.1400
GT-Wide [13] 83.2M 0.0955 0.1408
GraphormerS MALL 12.5M 0.0778 0.1264
Graphormer 47.1M 0.0582 0.1234

Results. Table 1 summarizes performance comparisons on PCQM4M-LSC dataset. From the table,
GIN-VN achieves the previous state-of-the-art validate MAE of 0.1395. The original implementation
of GT [13] employs a hidden dimension of 64 to reduce the total number of parameters. For a fair
comparison, we also report the result by enlarging the hidden dimension to 768, denoted by GT-Wide,
which leads to a total number of parameters of 83.2M. While, both GT and GT-Wide do not outperform
GIN-VN and DeeperGCN-VN. Especially, we do not observe a performance gain along with the growth
of parameters of GT.
Compared to the previous state-of-the-art GNN architecture, Graphormer noticeably surpasses GIN-
VN by a large margin, e.g., 11.5% relative validate MAE decline. By using the ensemble with
ExpC [51], we got a 0.1200 MAE on complete test set and won the first place of the graph-level track
in OGB Large-Scale Challenge[21, 53]. As stated in Section 3.3, we further find that the proposed
Graphormer does not encounter the problem of over-smoothing, i.e., the train and validate error keep
going down along with the growth of depth and width of models.

4.2 Graph Representation

In this section, we further investigate the performance of Graphormer on commonly used graph-level
prediction tasks of popular leaderboards, i.e., OGB [22] (OGBG-MolPCBA, OGBG-MolHIV), and
benchmarking-GNN [14] (ZINC). Since pre-training is encouraged by OGB, we mainly explore
the transferable capability of a Graphormer model pre-trained on OGB-LSC (i.e., PCQM4M-LSC).
Please note that the model configurations, hyper-parameters, and the pre-training performance of
pre-trained Graphormers used for MolPCBA and MolHIV are different from the models used in
the previous subsection. Please refer to Appendix B for detailed descriptions. For benchmarking-
GNN, which does not encourage large pre-trained model, we train an additional GraphormerS LIM
(L = 12, d = 80, total param.= 489K) from scratch on ZINC.

Baselines. We report performance of GNNs which achieve top-performance on the official leader-
boards5 without additional domain-specific features. Considering that the pre-trained Graphormer
leverages external data, for a fair comparison on OGB datasets, we additionally report performance
for fine-tuning GIN-VN pre-trained on PCQM4M-LSC dataset, which achieves the previous state-of-
the-art valid and test MAE on that dataset.

Settings. We report detailed training strategies in Appendix B. In addition, Graphormer is more

easily trapped in the over-fitting problem due to the large size of the model and the small size of the
dataset. Therefore, we employ a widely used data augmentation for graph - FLAG [27], to mitigate
the over-fitting problem on OGB datasets.

Results. Table 2, 3 and 4 summarize performance of Graphormer comparing with other GNNs on
MolHIV, MolPCBA and ZINC datasets. Especially, GT [13] and SAN [28] in Table 4 are recently
proposed Transformer-based GNN models. Graphormer consistently and significantly outperforms
previous state-of-the-art GNNs on all three datasets by a large margin. Specially, except Graphormer,
5
https://ogb.stanford.edu/docs/leader_graphprop/
https://github.com/graphdeeplearning/benchmarking-gnns/blob/master/docs/07_
leaderboards.md

7
Table 2: Results on MolPCBA. Table 3: Results on MolHIV.
method #param. AP (%) method #param. AUC (%)
DeeperGCN-VN+FLAG [30] 5.6M 28.42±0.43 GCN-GraphNorm [5, 8] 526K 78.83±1.00
DGN [2] 6.7M 28.85±0.30 PNA [10] 326K 79.05±1.32
GINE-VN [5] 6.1M 29.17±0.15 PHC-GNN [29] 111K 79.34±1.16
PHC-GNN [29] 1.7M 29.47±0.26 DeeperGCN-FLAG [30] 532K 79.42±1.20
GINE-APPNP [5] 6.1M 29.79±0.30 DGN [2] 114K 79.70±0.97
GIN-VN[50] (fine-tune) 3.4M 29.02±0.17 GIN-VN[50] (fine-tune) 3.3M 77.80±1.82
Graphormer-FLAG 119.5M 31.39±0.32 Graphormer-FLAG 47.0M 80.51±0.53

Table 4: Results on ZINC.

method #param. test MAE
GIN [50] 509,549 0.526±0.051
GraphSage [18] 505,341 0.398±0.002
GAT [47] 531,345 0.384±0.007
GCN [26] 505,079 0.367±0.011
GatedGCN-PE [4] 505,011 0.214±0.006
MPNN (sum) [15] 480,805 0.145±0.007
PNA [10] 387,155 0.142±0.010
GT [13] 588,929 0.226±0.014
SAN [28] 508, 577 0.139±0.006
GraphormerSLIM 489,321 0.122±0.006

the other pre-trained GNNs do not achieve competitive performance, which is in line with previous
literature [20]. In addition, we conduct more comparisons to fine-tuning the pre-trained GNNs, please
refer to Appendix C.

4.3 Ablation Studies

We perform a series of ablation studies on the importance of designs in our proposed Graphormer,
on PCQM4M-LSC dataset. The ablation results are included in Table 5. To save the computation
resources, the Transformer models in table 5 have 12 layers, and are trained for 100K iterations.

Node Relation Encoding. We compare previously used positional encoding (PE) to our proposed
spatial encoding, which both aim to encode the information of distinct node relation to Transformers.
There are various PEs employed by previous Transformer-based GNNs, e.g., Weisfeiler-Lehman-
PE (WL-PE) [56] and Laplacian PE [3, 14]. We report the performance for Laplacian PE since
it performs well comparing to a series of PEs for Graph Transformer in previous literature [13].
Transformer architecture with the spatial encoding outperforms the counterpart built on the positional
encoding, which demonstrates the effectiveness of using spatial encoding to capture the node spatial
information.

Centrality Encoding. Transformer architecture with degree-based centrality encoding yields a

large margin performance boost in comparison to those without centrality information. This indicates
that the centrality encoding is indispensable to Transformer architecture for modeling graph data.

Edge Encoding. We compare our proposed edge encoding (denoted as via attn bias) to two
commonly used edge encodings described in Section 3.1.3 to incorporate edge features into GNN,
denoted as via node and via Aggr in Table 5. From the table, the gap of performance is minor between
the two conventional methods, but our proposed edge encoding performs significantly better, which
indicates that edge encoding as attention bias is more effective for Transformer to capture spatial
information on edges.

8
Table 5: Ablation study results on PCQM4M-LSC dataset with different designs.

Node Relation Encoding Edge Encoding

Centrality valid MAE
Laplacian PE[13] Spatial via node via Aggr via attn bias(Eq.7)
- - - - - - 0.2276
✓ - - - - - 0.1483
- ✓ - - - - 0.1427
- ✓ ✓ - - - 0.1396
- ✓ ✓ ✓ - - 0.1328
- ✓ ✓ - ✓ - 0.1327
- ✓ ✓ - - ✓ 0.1304

5 Related Work
In this section, we highlight the most recent works which attempt to develop standard Transformer
architecture-based GNN or graph structural encoding, but spend less effort on elaborating the works
by adapting attention mechanism to GNNs [33, 55, 7, 23, 1, 47, 48, 56, 45].

5.1 Graph Transformer

There are several works that study the performance of pure Transformer architectures (stacked by
transformer layers) with modifications on graph representation tasks, which are more related to our
Graphormer. For example, several parts of the transformer layer are modified in [43], including
an additional GNN employed in attention sub-layer to produce vectors of Q, K, and V , long-range
residual connection, and two branches of FFN to produce node and edge representations separately.
They pre-train their model on 10 million unlabelled molecules and achieve excellent results by
fine-tuning on downstream tasks. Attention module is modified to a soft adjacency matrix in [39]
by directly adding the adjacency matrix and RDKit6 -computed inter-atomic distance matrix to the
attention probabilites. Very recently, Dwivedi et al. [13] revisit a series of works for Transformer-
based GNNs, and suggest that the attention mechanism in Transformers on graph data should only
aggregate the information from neighborhood (i.e., using adjacent matrix as attention mask) to ensure
graph sparsity, and propose to use Laplacian eigenvector as positional encoding. Their model GT
surpasses baseline GNNs on graph representation task. A concurrent work [28] propose a novel full
Laplacian spectrum to learn the position of each node in a graph, and empirically shows better results
than GT.

5.2 Structural Encodings in GNNs

Path and Distance in GNNs. Information of path and distance is commonly used in GNNs. For
example, an attention-based aggregation is proposed in [9] where the node features, edge features,
one-hot feature of the distance and ring flag feature are concatenated to calculate the attention
probabilites; similar to [9], path-based attention is leveraged in [52] to model the influence between
the center node and its higher-order neighbors; a distance-weighted aggregation scheme on graph is
proposed in [54]; it has been proved in [32] that adopting distance encoding (i.e., one-hot feature of
the distance as extra node attribute) could lead to a strictly more expressive power than the 1-WL test.

Positional Encoding in Transformer on Graph. Several works introduce positional encoding (PE)
to Transformer-based GNNs to help the model capture the node position information. For example,
Graph-BERT [56] introduces three types of PE to embed the node position information to model, i.e.,
an absolute WL-PE which represents different nodes labeled by Weisfeiler-Lehman algorithm, an
intimacy based PE and a hop based PE which are both variant to the sampled subgraphs. Absolute
Laplacian PE is employed in [13] and empircal study shows that its performance surpasses the
absolute WL-PE used in [56].

Edge Feature. Except the conventionally used methods to encode edge feature, which are described
in previous section, there are several attempts that exploit how to better encode edge features: an
attention-based GNN layer is developed in [16] to encode edge features, where the edge feature
6
https://www.rdkit.org/

9
is weighted by the similarity of the features of its two nodes; edge feature has been encoded into
the popular GIN [50] in [5]; in [13], the authors propose to project edge features to an embedding
vector, then multiply it by attention coefficients, and send the result to an additional FFN sub-layer to
produce edge representations;

6 Conclusion
We have explored the direct application of Transformers to graph representation. With three novel
graph structural encodings, the proposed Graphormer works surprisingly well on a wide range of
popular benchmark datasets. While these initial results are encouraging, many challenges remain.
For example, the quadratic complexity of the self-attention module restricts Graphormer’s application
on large graphs. Therefore, future development of efficient Graphormer is necessary. Performance
improvement could be expected by leveraging domain knowledge-powered encodings on particular
graph datasets. Finally, an applicable graph sampling strategy is desired for node representation
extraction with Graphormer. We leave them for future works.

References
[1] Jinheon Baek, Minki Kang, and Sung Ju Hwang. Accurate learning of graph representations with graph
multiset pooling. ICLR, 2021.
[2] Dominique Beaini, Saro Passaro, Vincent Létourneau, William L Hamilton, Gabriele Corso, and Pietro
Liò. Directional graph networks. In International Conference on Machine Learning, 2021.
[3] Mikhail Belkin and Partha Niyogi. Laplacian eigenmaps for dimensionality reduction and data representa-
tion. Neural computation, 15(6):1373–1396, 2003.
[4] Xavier Bresson and Thomas Laurent. Residual gated graph convnets. arXiv preprint arXiv:1711.07553,
2017.
[5] Rémy Brossard, Oriel Frigo, and David Dehaene. Graph convolutions that can finally model local structure.
arXiv preprint arXiv:2011.15069, 2020.
[6] Tom Brown, Benjamin Mann, Nick Ryder, Melanie Subbiah, Jared D Kaplan, Prafulla Dhariwal, Arvind
Neelakantan, Pranav Shyam, Girish Sastry, Amanda Askell, Sandhini Agarwal, Ariel Herbert-Voss,
Gretchen Krueger, Tom Henighan, Rewon Child, Aditya Ramesh, Daniel Ziegler, Jeffrey Wu, Clemens
Winter, Chris Hesse, Mark Chen, Eric Sigler, Mateusz Litwin, Scott Gray, Benjamin Chess, Jack Clark,
Christopher Berner, Sam McCandlish, Alec Radford, Ilya Sutskever, and Dario Amodei. Language models
are few-shot learners. In H. Larochelle, M. Ranzato, R. Hadsell, M. F. Balcan, and H. Lin, editors,
Advances in Neural Information Processing Systems, volume 33, pages 1877–1901. Curran Associates,
Inc., 2020.
[7] Deng Cai and Wai Lam. Graph transformer for graph-to-sequence learning. In Proceedings of the AAAI
Conference on Artificial Intelligence, volume 34, pages 7464–7471, 2020.
[8] Tianle Cai, Shengjie Luo, Keyulu Xu, Di He, Tie-yan Liu, and Liwei Wang. Graphnorm: A principled
approach to accelerating graph neural network training. In International Conference on Machine Learning,
2021.
[9] Benson Chen, Regina Barzilay, and Tommi Jaakkola. Path-augmented graph transformer network. arXiv
preprint arXiv:1905.12712, 2019.
[10] Gabriele Corso, Luca Cavalleri, Dominique Beaini, Pietro Liò, and Petar Veličković. Principal neighbour-
hood aggregation for graph nets. Advances in Neural Information Processing Systems, 33, 2020.
[11] Jacob Devlin, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova. Bert: Pre-training of deep bidi-
rectional transformers for language understanding. In Proceedings of the 2019 Conference of the North
American Chapter of the Association for Computational Linguistics: Human Language Technologies,
Volume 1 (Long and Short Papers), pages 4171–4186, 2019.
[12] Alexey Dosovitskiy, Lucas Beyer, Alexander Kolesnikov, Dirk Weissenborn, Xiaohua Zhai, Thomas
Unterthiner, Mostafa Dehghani, Matthias Minderer, Georg Heigold, Sylvain Gelly, et al. An image is worth
16x16 words: Transformers for image recognition at scale. arXiv preprint arXiv:2010.11929, 2020.
[13] Vijay Prakash Dwivedi and Xavier Bresson. A generalization of transformer networks to graphs. AAAI
Workshop on Deep Learning on Graphs: Methods and Applications, 2021.
[14] Vijay Prakash Dwivedi, Chaitanya K Joshi, Thomas Laurent, Yoshua Bengio, and Xavier Bresson. Bench-
marking graph neural networks. arXiv preprint arXiv:2003.00982, 2020.

10
[15] Justin Gilmer, Samuel S Schoenholz, Patrick F Riley, Oriol Vinyals, and George E Dahl. Neural message
passing for quantum chemistry. In International Conference on Machine Learning, pages 1263–1272.
PMLR, 2017.
[16] Liyu Gong and Qiang Cheng. Exploiting edge features for graph neural networks. In Proceedings of the
IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 9211–9219, 2019.
[17] Anmol Gulati, James Qin, Chung-Cheng Chiu, Niki Parmar, Yu Zhang, Jiahui Yu, Wei Han, Shibo Wang,
Zhengdong Zhang, Yonghui Wu, et al. Conformer: Convolution-augmented transformer for speech
recognition. arXiv preprint arXiv:2005.08100, 2020.
[18] William L Hamilton, Zhitao Ying, and Jure Leskovec. Inductive representation learning on large graphs.
In NIPS, 2017.
[19] Vincent J Hellendoorn, Charles Sutton, Rishabh Singh, Petros Maniatis, and David Bieber. Global relational
models of source code. In International conference on learning representations, 2019.
[20] W Hu, B Liu, J Gomes, M Zitnik, P Liang, V Pande, and J Leskovec. Strategies for pre-training graph
neural networks. In International Conference on Learning Representations (ICLR), 2020.
[21] Weihua Hu, Matthias Fey, Hongyu Ren, Maho Nakata, Yuxiao Dong, and Jure Leskovec. Ogb-lsc: A
large-scale challenge for machine learning on graphs. arXiv preprint arXiv:2103.09430, 2021.
[22] Weihua Hu, Matthias Fey, Marinka Zitnik, Yuxiao Dong, Hongyu Ren, Bowen Liu, Michele Catasta,
and Jure Leskovec. Open graph benchmark: Datasets for machine learning on graphs. arXiv preprint
arXiv:2005.00687, 2020.
[23] Ziniu Hu, Yuxiao Dong, Kuansan Wang, and Yizhou Sun. Heterogeneous graph transformer. In Proceedings
of The Web Conference 2020, pages 2704–2710, 2020.
[24] Katsuhiko Ishiguro, Shin-ichi Maeda, and Masanori Koyama. Graph warp module: an auxiliary module for
boosting the power of graph neural networks in molecular graph analysis. arXiv preprint arXiv:1902.01020,
2019.
[25] Guolin Ke, Di He, and Tie-Yan Liu. Rethinking the positional encoding in language pre-training. ICLR,
2020.
[26] Thomas N Kipf and Max Welling. Semi-supervised classification with graph convolutional networks.
arXiv preprint arXiv:1609.02907, 2016.
[27] Kezhi Kong, Guohao Li, Mucong Ding, Zuxuan Wu, Chen Zhu, Bernard Ghanem, Gavin Taylor,
and Tom Goldstein. Flag: Adversarial data augmentation for graph neural networks. arXiv preprint
arXiv:2010.09891, 2020.
[28] Devin Kreuzer, Dominique Beaini, William Hamilton, Vincent Létourneau, and Prudencio Tossou. Re-
thinking graph transformers with spectral attention. arXiv preprint arXiv:2106.03893, 2021.
[29] Tuan Le, Marco Bertolini, Frank Noé, and Djork-Arné Clevert. Parameterized hypercomplex graph neural
networks for graph classification. arXiv preprint arXiv:2103.16584, 2021.
[30] Guohao Li, Chenxin Xiong, Ali Thabet, and Bernard Ghanem. Deepergcn: All you need to train deeper
gcns. arXiv preprint arXiv:2006.07739, 2020.
[31] Junying Li, Deng Cai, and Xiaofei He. Learning graph-level representation for drug discovery. arXiv
preprint arXiv:1709.03741, 2017.
[32] Pan Li, Yanbang Wang, Hongwei Wang, and Jure Leskovec. Distance encoding: Design provably more
powerful neural networks for graph representation learning. Advances in Neural Information Processing
Systems, 33, 2020.
[33] Yuan Li, Xiaodan Liang, Zhiting Hu, Yinbo Chen, and Eric P. Xing. Graph transformer, 2019.
[34] Xi Victoria Lin, Richard Socher, and Caiming Xiong. Multi-hop knowledge graph reasoning with reward
shaping. arXiv preprint arXiv:1808.10568, 2018.
[35] Yinhan Liu, Myle Ott, Naman Goyal, Jingfei Du, Mandar Joshi, Danqi Chen, Omer Levy, Mike Lewis,
Luke Zettlemoyer, and Veselin Stoyanov. Roberta: A robustly optimized bert pretraining approach. arXiv
preprint arXiv:1907.11692, 2019.
[36] Ze Liu, Yutong Lin, Yue Cao, Han Hu, Yixuan Wei, Zheng Zhang, Stephen Lin, and Baining Guo. Swin
transformer: Hierarchical vision transformer using shifted windows. arXiv preprint arXiv:2103.14030,
2021.
[37] P David Marshall. The promotion and presentation of the self: celebrity as marker of presentational media.
Celebrity studies, 1(1):35–48, 2010.
[38] Alice Marwick and Danah Boyd. To see and be seen: Celebrity practice on twitter. Convergence,
17(2):139–158, 2011.

11
[39] Łukasz Maziarka, Tomasz Danel, Sławomir Mucha, Krzysztof Rataj, Jacek Tabor, and Stanisław Jastrz˛ebski.
Molecule attention transformer. arXiv preprint arXiv:2002.08264, 2020.
[40] Sharan Narang, Hyung Won Chung, Yi Tay, William Fedus, Thibault Fevry, Michael Matena, Karishma
Malkan, Noah Fiedel, Noam Shazeer, Zhenzhong Lan, et al. Do transformer modifications transfer across
implementations and applications? arXiv preprint arXiv:2102.11972, 2021.
[41] Dinglan Peng, Shuxin Zheng, Yatao Li, Guolin Ke, Di He, and Tie-Yan Liu. How could neural networks
understand programs? In International Conference on Machine Learning. PMLR, 2021.
[42] Colin Raffel, Noam Shazeer, Adam Roberts, Katherine Lee, Sharan Narang, Michael Matena, Yanqi Zhou,
Wei Li, and Peter J. Liu. Exploring the limits of transfer learning with a unified text-to-text transformer.
Journal of Machine Learning Research, 21(140):1–67, 2020.
[43] Yu Rong, Yatao Bian, Tingyang Xu, Weiyang Xie, Ying Wei, Wenbing Huang, and Junzhou Huang. Self-
supervised graph transformer on large-scale molecular data. Advances in Neural Information Processing
Systems, 33, 2020.
[44] Peter Shaw, Jakob Uszkoreit, and Ashish Vaswani. Self-attention with relative position representations. In
Proceedings of the 2018 Conference of the North American Chapter of the Association for Computational
Linguistics: Human Language Technologies, Volume 2 (Short Papers), pages 464–468, 2018.
[45] Yunsheng Shi, Zhengjie Huang, Wenjin Wang, Hui Zhong, Shikun Feng, and Yu Sun. Masked label predic-
tion: Unified message passing model for semi-supervised classification. arXiv preprint arXiv:2009.03509,
2020.
[46] Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N Gomez, Lukasz
Kaiser, and Illia Polosukhin. Attention is all you need. In NIPS, 2017.
[47] Petar Veličković, Guillem Cucurull, Arantxa Casanova, Adriana Romero, Pietro Lio, and Yoshua Bengio.
Graph attention networks. ICLR, 2018.
[48] Guangtao Wang, Rex Ying, Jing Huang, and Jure Leskovec. Direct multi-hop attention based graph neural
network. arXiv preprint arXiv:2009.14332, 2020.
[49] Ruibin Xiong, Yunchang Yang, Di He, Kai Zheng, Shuxin Zheng, Chen Xing, Huishuai Zhang, Yanyan
Lan, Liwei Wang, and Tieyan Liu. On layer normalization in the transformer architecture. In International
Conference on Machine Learning, pages 10524–10533. PMLR, 2020.
[50] Keyulu Xu, Weihua Hu, Jure Leskovec, and Stefanie Jegelka. How powerful are graph neural networks?
In International Conference on Learning Representations, 2019.
[51] Mingqi Yang, Yanming Shen, Heng Qi, and Baocai Yin. Breaking the expressive bottlenecks of graph
neural networks. arXiv preprint arXiv:2012.07219, 2020.
[52] Yiding Yang, Xinchao Wang, Mingli Song, Junsong Yuan, and Dacheng Tao. Spagan: Shortest path graph
attention network. Advances in IJCAI, 2019.
[53] Chengxuan Ying, Mingqi Yang, Shuxin Zheng, Guolin Ke, Shengjie Luo, Tianle Cai, Chenglin Wu, Yuxin
Wang, Yanming Shen, and Di He. First place solution of kdd cup 2021 & ogb large-scale challenge
graph-level track. arXiv preprint arXiv:2106.08279, 2021.
[54] Jiaxuan You, Rex Ying, and Jure Leskovec. Position-aware graph neural networks. In International
Conference on Machine Learning, pages 7134–7143. PMLR, 2019.
[55] Seongjun Yun, Minbyul Jeong, Raehyun Kim, Jaewoo Kang, and Hyunwoo J Kim. Graph transformer
networks. Advances in Neural Information Processing Systems, 32, 2019.
[56] Jiawei Zhang, Haopeng Zhang, Congying Xia, and Li Sun. Graph-bert: Only attention is needed for
learning graph representations. arXiv preprint arXiv:2001.05140, 2020.
[57] Daniel Zügner, Tobias Kirschstein, Michele Catasta, Jure Leskovec, and Stephan Günnemann. Language-
agnostic representation learning of source code from structure and context. In International Conference on
Learning Representations, 2020.

A Practical Tutorial On Graph Neural Net
No ratings yet
A Practical Tutorial On Graph Neural Net
38 pages
A Practical Guide To Graph Neural Networks
No ratings yet
A Practical Guide To Graph Neural Networks
36 pages
A Comprehensive Survey On Deep Graph Representation Learning
No ratings yet
A Comprehensive Survey On Deep Graph Representation Learning
85 pages
Learning Methods
No ratings yet
Learning Methods
70 pages
2106 - GraphiT Encoding Graph Structure in Transformers
No ratings yet
2106 - GraphiT Encoding Graph Structure in Transformers
20 pages
2501 11968v1
No ratings yet
2501 11968v1
22 pages
2106 - Rethinking Graph Transformers With Spectral Attention
No ratings yet
2106 - Rethinking Graph Transformers With Spectral Attention
18 pages
AGraphis Worth K Words: Euclideanizing Graph Using Pure Transformer
No ratings yet
AGraphis Worth K Words: Euclideanizing Graph Using Pure Transformer
21 pages
Graphormer A General-Propose Backbone For Graph Learning
No ratings yet
Graphormer A General-Propose Backbone For Graph Learning
14 pages
Automated Unsupervised Graph Representation Learning
No ratings yet
Automated Unsupervised Graph Representation Learning
14 pages
Improving Graph Neural Network Expressivity Via Subgraph Isomorphism Counting
No ratings yet
Improving Graph Neural Network Expressivity Via Subgraph Isomorphism Counting
12 pages
Do Transformers Really Perform Bad For Graph Representation?
No ratings yet
Do Transformers Really Perform Bad For Graph Representation?
19 pages
Graph Convolutional Networks: A Comprehensive Review: Open Access Research
No ratings yet
Graph Convolutional Networks: A Comprehensive Review: Open Access Research
23 pages
MLST Wavelet Positional Encoding
No ratings yet
MLST Wavelet Positional Encoding
25 pages
Suevey On GNN
No ratings yet
Suevey On GNN
31 pages
MLST Wavelet Positional Encoding
No ratings yet
MLST Wavelet Positional Encoding
23 pages
Graph Representation Learning A Survey
No ratings yet
Graph Representation Learning A Survey
21 pages
Improving Graph Neural Networks With Simple Architecture Design
No ratings yet
Improving Graph Neural Networks With Simple Architecture Design
10 pages
Documents 2025-3 (v2) GNN (Node Classification) GNN Classification v2
No ratings yet
Documents 2025-3 (v2) GNN (Node Classification) GNN Classification v2
74 pages
Graph Neural Networks Methods Applications and Opp
No ratings yet
Graph Neural Networks Methods Applications and Opp
35 pages
1201 Graph Transformer
No ratings yet
1201 Graph Transformer
14 pages
DLG Book
No ratings yet
DLG Book
332 pages
A Generalization of Transformer Networks To Graphs
No ratings yet
A Generalization of Transformer Networks To Graphs
8 pages
Graphprompt: Unifying Pre-Training and Downstream Tasks For Graph Neural Networks
No ratings yet
Graphprompt: Unifying Pre-Training and Downstream Tasks For Graph Neural Networks
12 pages
10 Graph Neural Networks v2.2
No ratings yet
10 Graph Neural Networks v2.2
61 pages
Transfer Learning For Deep Learning On Graph-Structured Data
No ratings yet
Transfer Learning For Deep Learning On Graph-Structured Data
7 pages
Why Are Graph Neural Networks Effective For EDA Problems
No ratings yet
Why Are Graph Neural Networks Effective For EDA Problems
8 pages
Deep Learning On Graphs: A Survey: Ziwei Zhang, Peng Cui and Wenwu Zhu, Fellow, IEEE
No ratings yet
Deep Learning On Graphs: A Survey: Ziwei Zhang, Peng Cui and Wenwu Zhu, Fellow, IEEE
24 pages
Graph GPT
No ratings yet
Graph GPT
10 pages
Improving Global Awareness of Linkset Predictions Using Cross-Attentive Modulation Tokens
No ratings yet
Improving Global Awareness of Linkset Predictions Using Cross-Attentive Modulation Tokens
17 pages
A Comparison Between Recursive Neural Networks and Graph Neural Networks
No ratings yet
A Comparison Between Recursive Neural Networks and Graph Neural Networks
8 pages
Gnns
No ratings yet
Gnns
75 pages
A Gentle Introduction To Graph Neural Network
100% (1)
A Gentle Introduction To Graph Neural Network
122 pages
Graph Neural Networks (GNNS)
No ratings yet
Graph Neural Networks (GNNS)
22 pages
Approximation - and Quantization-Aware Training For Graph Neural Networks
No ratings yet
Approximation - and Quantization-Aware Training For Graph Neural Networks
14 pages
A Gentle Introduction To Graph Neural Networks
No ratings yet
A Gentle Introduction To Graph Neural Networks
14 pages
2024 - Introduction To Graph Neural Networks A Starting
No ratings yet
2024 - Introduction To Graph Neural Networks A Starting
49 pages
Original GNN
No ratings yet
Original GNN
22 pages
GNN Foundations Frontiers and Applications Chapter2
No ratings yet
GNN Foundations Frontiers and Applications Chapter2
10 pages
Computing Graph Neural Networks: A Survey From Algorithms To Accelerators
No ratings yet
Computing Graph Neural Networks: A Survey From Algorithms To Accelerators
38 pages
Bacciu 2020
No ratings yet
Bacciu 2020
62 pages
2022 - Chuan Shi, Xiao Wang, Cheng Yang - Advances in Graph Neural Networks-Springer
No ratings yet
2022 - Chuan Shi, Xiao Wang, Cheng Yang - Advances in Graph Neural Networks-Springer
207 pages
Seminar Presentation
No ratings yet
Seminar Presentation
19 pages
Defence Transcription
No ratings yet
Defence Transcription
4 pages
Papers Papers PDF
No ratings yet
Papers Papers PDF
48 pages
A Comprehensive Survey On Graph Neural Networks
No ratings yet
A Comprehensive Survey On Graph Neural Networks
22 pages
Yang 20 A
No ratings yet
Yang 20 A
16 pages
Graph Neural Networks
100% (1)
Graph Neural Networks
27 pages
Graph Transformer Networks: Corresponding Author
No ratings yet
Graph Transformer Networks: Corresponding Author
11 pages
Representation Learning On Graphs: Methods and Applications
No ratings yet
Representation Learning On Graphs: Methods and Applications
23 pages
Graph Neural Networks: A Review of Methods and Applications
No ratings yet
Graph Neural Networks: A Review of Methods and Applications
22 pages
GNNS
No ratings yet
GNNS
7 pages
Graph Representation Learning
No ratings yet
Graph Representation Learning
141 pages
A Practical Guide To Graph Neural Networks
No ratings yet
A Practical Guide To Graph Neural Networks
28 pages
GNN Review
No ratings yet
GNN Review
26 pages
A Survey of Graph Neural Networks in Various Learning Paradigms Methods, Applications, and Challenges
No ratings yet
A Survey of Graph Neural Networks in Various Learning Paradigms Methods, Applications, and Challenges
70 pages
Graph Neural Networks: A Review of Methods and Applications
No ratings yet
Graph Neural Networks: A Review of Methods and Applications
20 pages
125 Questions GenAI Interview Guide
No ratings yet
125 Questions GenAI Interview Guide
24 pages
AtliQ TShirts Synopsis
No ratings yet
AtliQ TShirts Synopsis
16 pages
A Comprehensive Survey of Graph Neural Networks PDF
No ratings yet
A Comprehensive Survey of Graph Neural Networks PDF
22 pages
A Practical Guide To Graph Neural Networks
No ratings yet
A Practical Guide To Graph Neural Networks
28 pages
RNN LSTM GRU Transformers
0% (1)
RNN LSTM GRU Transformers
123 pages
Brochure Advanced Generative AI Certification Course
No ratings yet
Brochure Advanced Generative AI Certification Course
23 pages
Foundations of Large Language Models: Tong Xiao and Jingbo Zhu
No ratings yet
Foundations of Large Language Models: Tong Xiao and Jingbo Zhu
277 pages
AR L L M A A C: Eview of Arge Anguage Odels and Utonomous Gents in Hemistry
No ratings yet
AR L L M A A C: Eview of Arge Anguage Odels and Utonomous Gents in Hemistry
66 pages
Thesis Attention-Based Encoder-Decoder Models For Speech Processing
No ratings yet
Thesis Attention-Based Encoder-Decoder Models For Speech Processing
219 pages
Difference Between BART and BERT
No ratings yet
Difference Between BART and BERT
2 pages
ImageCaptioning (BLIP) Final
No ratings yet
ImageCaptioning (BLIP) Final
90 pages
Oviedo Et Al 2022 Interpretable and Explainable Machine Learning For Materials Science and Chemistry
No ratings yet
Oviedo Et Al 2022 Interpretable and Explainable Machine Learning For Materials Science and Chemistry
11 pages
Image Segmentation New
No ratings yet
Image Segmentation New
11 pages
CS388N Practice Questions Answers
No ratings yet
CS388N Practice Questions Answers
48 pages
Groot n1
No ratings yet
Groot n1
36 pages
Retrieval-Augmented Generation (RAG) - A Comprehens
No ratings yet
Retrieval-Augmented Generation (RAG) - A Comprehens
8 pages
Training-Free Multi-Concept Image Generation and Editing With Rectified Flow Transformers
No ratings yet
Training-Free Multi-Concept Image Generation and Editing With Rectified Flow Transformers
26 pages
ADET Model Real Time Autism Detection Via Eye Tracking Model Using Retinal Scan Images
No ratings yet
ADET Model Real Time Autism Detection Via Eye Tracking Model Using Retinal Scan Images
18 pages
ChatGPT in The Classroom A Comprehensive Review of The Impact of ChatGPT On Modern Education
No ratings yet
ChatGPT in The Classroom A Comprehensive Review of The Impact of ChatGPT On Modern Education
6 pages
Mitre Cve2
No ratings yet
Mitre Cve2
17 pages
History of Generative Artificial Intelligence (AI) Chatbots: Past, Present, and Future Development
No ratings yet
History of Generative Artificial Intelligence (AI) Chatbots: Past, Present, and Future Development
31 pages
Fastvlm: Efficient Vision Encoding For Vision Language Models
No ratings yet
Fastvlm: Efficient Vision Encoding For Vision Language Models
16 pages
Effective Analysis of Machine and Deep Learning Methods For Diagnosing Mental He
No ratings yet
Effective Analysis of Machine and Deep Learning Methods For Diagnosing Mental He
21 pages
SegNeXt Rethinking Convolutional Attention Design Segmentation
No ratings yet
SegNeXt Rethinking Convolutional Attention Design Segmentation
15 pages
Generative AI and Cyber Security in Central Banking 1716645333
No ratings yet
Generative AI and Cyber Security in Central Banking 1716645333
22 pages
Time FM
No ratings yet
Time FM
15 pages
Perspective-Shifted Neuro-Symbolic World Models: A Framework For Socially-Aware Robot Navigation
No ratings yet
Perspective-Shifted Neuro-Symbolic World Models: A Framework For Socially-Aware Robot Navigation
12 pages
NeurIPS 2023 Learning Topology Agnostic Eeg Representations With Geometry Aware Modeling Paper Conference
No ratings yet
NeurIPS 2023 Learning Topology Agnostic Eeg Representations With Geometry Aware Modeling Paper Conference
17 pages
Augmenting Decompiler Output With Learned Variable Names and Types
No ratings yet
Augmenting Decompiler Output With Learned Variable Names and Types
17 pages
Sequence-To-Sequence, Attention, Transformer - Machine Learning Lecture
No ratings yet
Sequence-To-Sequence, Attention, Transformer - Machine Learning Lecture
20 pages
2024.icon Fauxhate.6
No ratings yet
2024.icon Fauxhate.6
7 pages
Behavior Sequence Transformer For E-Commerce Recommend in Alibaba
No ratings yet
Behavior Sequence Transformer For E-Commerce Recommend in Alibaba
4 pages

Graphormer 2021 neurIPS

Uploaded by

Graphormer 2021 neurIPS

Uploaded by

Do Transformers Really Perform Bad

for Graph Representation?

Chengxuan Ying1∗, Tianle Cai2 , Shengjie Luo3∗,

35th Conference on Neural Information Processing Systems (NeurIPS 2021).

Scale Spatial Encoding v5

Node Feature Centrality Encoding

Transformer. The Transformer architecture consists of a composition of Transformer layers [46].

Q = HWQ , K = HWK , V = HWV , (3)

3.1.1 Centrality Encoding

3.1.2 Spatial Encoding

3.2 Implementation Details of Graphormer

3.3 How Powerful is Graphormer?

4.1 OGB Large-Scale Challenge

method #param. train MAE validate MAE

4.2 Graph Representation

Settings. We report detailed training strategies in Appendix B. In addition, Graphormer is more

Table 4: Results on ZINC.

4.3 Ablation Studies

Centrality Encoding. Transformer architecture with degree-based centrality encoding yields a

Node Relation Encoding Edge Encoding

5.1 Graph Transformer

5.2 Structural Encodings in GNNs

You might also like