Graphormer 2021 neurIPS
Graphormer 2021 neurIPS
Abstract
The Transformer architecture has become a dominant choice in many domains, such
as natural language processing and computer vision. Yet, it has not achieved com-
petitive performance on popular leaderboards of graph-level prediction compared
to mainstream GNN variants. Therefore, it remains a mystery how Transformers
could perform well for graph representation learning. In this paper, we solve this
mystery by presenting Graphormer, which is built upon the standard Transformer
architecture, and could attain excellent results on a broad range of graph representa-
tion learning tasks, especially on the recent OGB Large-Scale Challenge. Our key
insight to utilizing Transformer in the graph is the necessity of effectively encoding
the structural information of a graph into the model. To this end, we propose several
simple yet effective structural encoding methods to help Graphormer better model
graph-structured data. Besides, we mathematically characterize the expressive
power of Graphormer and exhibit that with our ways of encoding the structural
information of graphs, many popular GNN variants could be covered as the special
cases of Graphormer. The code and models of Graphormer will be made publicly
available at https://github.com/Microsoft/Graphormer.
1 Introduction
The Transformer [46] is well acknowledged as the most powerful neural network in modelling
sequential data, such as natural language [11, 35, 6] and speech [17]. Model variants built upon
Transformer have also been shown great performance in computer vision [12, 36] and programming
language [19, 57, 41]. However, to the best of our knowledge, Transformer has still not been
the de-facto standard on public graph representation leaderboards [22, 14, 21]. There are many
attempts of leveraging Transformer into the graph domain, but the only effective way is replacing
some key modules (e.g., feature aggregation) in classic GNN variants by the softmax attention
[47, 7, 23, 48, 56, 43, 13]. Therefore, it is still an open question whether Transformer architecture is
suitable to model graphs and how to make it work in graph representation learning.
In this paper, we give an affirmative answer by developing Graphormer, which is directly built
upon the standard Transformer, and achieves state-of-the-art performance on a wide range of graph-
level prediction tasks, including the very recent Open Graph Benchmark Large-Scale Challenge
(OGB-LSC) [21], and several popular leaderboards (e.g., OGB [22], Benchmarking-GNN [14]). The
Transformer is originally designed for sequence modeling. To utilize its power in graphs, we believe
∗
Interns at MSRA.
†
Corresponding authors.
2 Preliminary
In this section, we recap the preliminaries in Graph Neural Networks and Transformer.
Graph Neural Network (GNN). Let G = (V, E) denote a graph where V = {v1 , v2 , · · · , vn },
n = |V | is the number of nodes. Let the feature vector of node vi be xi . GNNs aim to learn
representation of nodes and graphs. Typically, modern GNNs follow a learning schema that iteratively
updates the representation of a node by aggregating representations of its first or higher-order
(l) (0)
neighbors. We denote hi as the representation of vi at the l-th layer and define hi = xi . The l-th
iteration of aggregation could be characterized by AGGREGATE-COMBINE step as
n o
(l) (l−1) (l) (l−1) (l)
ai = AGGREGATE (l) hj : j ∈ N (vi ) , hi = COMBINE (l) hi , ai , (1)
where N (vi ) is the set of first or higher-order neighbors of vi . The AGGREGATE function is used to
gather the information from neighbors. Common aggregation functions include MEAN, MAX, SUM,
which are used in different architectures of GNNs [26, 18, 47, 50]. The goal of COMBINE function
is to fuse the information from neighbors into the node representation.
3
https://ogb.stanford.edu/kddcup2021/pcqm4m/
2
v1 v2 v3 v4 v5
v1
MatMul v2
v3
SoftMax
v4
v5 v1
v1 v2 v3 v4 v5
MatMul
v1 v2
v2
v3
Linear Linear Linear v4
v4 v3
Q K V
v5
Edge Encoding
Figure 1: An illustration of our proposed centrality encoding, spatial encoding, and edge encoding in
Graphormer.
In addition, for graph representation tasks, a READOUT function is designed to aggregate node
(L)
features hi of the final iteration into the representation hG of the entire graph G:
n o
(L)
hG = READOUT hi | vi ∈ G . (2)
READOUT can be implemented by a simple permutation invariant function such as summation [50]
or a more sophisticated graph-level pooling function [1].
where A is a matrix capturing the similarity between queries and keys. For simplicity of illustration,
we consider the single-head self-attention and assume dK = dV = d. The extension to the multi-head
attention is standard and straightforward, and we omit bias terms for simplicity.
3 Graphormer
In this section, we present our Graphormer for graph tasks. First, we elaborate on several key
designs in the Graphormer, which serve as an inductive bias in the neural network to learn the graph
representation. We further provide the detailed implementations of Graphormer. Finally, we show
that our proposed Graphormer is more powerful since popular GNN models [26, 50, 18] are its
special cases.
3
3.1 Structural Encodings in Graphormer
As discussed in the introduction, it is important to develop ways to leverage the structural information
of graphs into the Transformer model. To this end, we present three simple but effective designs of
encoding in Graphormer. See Figure 1 for an illustration.
where z − , z + ∈ Rd are learnable embedding vectors specified by the indegree deg− (vi ) and out-
degree deg+ (vi ) respectively. For undirected graphs, deg− (vi ) and deg+ (vi ) could be unified to
deg(vi ). By using the centrality encoding in the input, the softmax attention can catch the node
importance signal in the queries and the keys. Therefore the model can capture both the semantic
correlation and the node importance in the attention mechanism.
(hi WQ )(hj WK )T
Aij = √ + bϕ(vi ,vj ) , (6)
d
where bϕ(vi ,vj ) is a learnable scalar indexed by ϕ(vi , vj ), and shared across all layers.
Here we discuss several benefits of our proposed method. First, compared to conventional GNNs
described in Section 2, where the receptive field is restricted to the neighbors, we can see that in
Eq. (6), the Transformer layer provides a global information that each node can attend to all other
nodes in the graph. Second, by using bϕ(vi ,vj ) , each node in a single Transformer layer can adaptively
attend to all other nodes according to the graph structural information. For example, if bϕ(vi ,vj ) is
learned to be a decreasing function with respect to ϕ(vi , vj ), for each node, the model will likely pay
more attention to the nodes near it and pay less attention to the nodes far away from it.
4
3.1.3 Edge Encoding in the Attention
In many graph tasks, edges also have structural features, e.g., in a molecular graph, atom pairs may
have features describing the type of bond between them. Such features are important to the graph
representation, and encoding them together with node features into the network is essential. There are
mainly two edge encoding methods used in previous works. In the first method, the edge features are
added to the associated nodes’ features [22, 30]. In the second method, for each node, its associated
edges’ features will be used together with the node features in the aggregation [15, 50, 26]. However,
such ways of using edge feature only propagate the edge information to its associated nodes, which
may not be an effective way to leverage edge information in representation of the whole graph.
To better encode edge features into attention layers, we propose a new edge encoding method in
Graphormer. The attention mechanism needs to estimate correlations for each node pair (vi , vj ), and
we believe the edges connecting them should be considered in the correlation as in [34, 48]. For each
ordered node pair (vi , vj ), we find (one of) the shortest path SPij = (e1 , e2 , ..., eN ) from vi to vj ,
and compute an average of the dot-products of the edge feature and a learnable embedding along the
path. The proposed edge encoding incorporates edge features via a bias term to the attention module.
Concretely, we modify the (i, j)-element of A in Eq. (3) further with the edge encoding cij as:
N
(hi WQ )(hj WK )T 1 X
Aij = √ + bϕ(vi ,vj ) + cij , where cij = xe (wE )T , (7)
d N n=1 n n
where xen is the feature of the n-th edge en in SPij , wnE ∈ RdE is the n-th weight embedding, and
dE is the dimensionality of edge feature.
Graphormer Layer. Graphormer is built upon the original implementation of classic Transformer
encoder described in [46]. In addition, we apply the layer normalization (LN) before the multi-head
self-attention (MHA) and the feed-forward blocks (FFN) instead of after [49]. This modification
has been unanimously adopted by all current Transformer implementations because it leads to more
effective optimization [40]. Especially, for FFN sub-layer, we set the dimensionality of input, output,
and the inner-layer to the same dimension with d. We formally characterize the Graphormer layer as
below:
′
h (l) = MHA(LN(h(l−1) )) + h(l−1) (8)
′ ′
(l) (l) (l)
h = FFN(LN(h )) + h (9)
Special Node. As stated in the previous section, various graph pooling functions are proposed
to represent the graph embedding. Inspired by [15], in Graphormer, we add a special node called
[VNode] to the graph, and make connection between [VNode] and each node individually. In the
AGGREGATE-COMBINE step, the representation of [VNode] has been updated as normal nodes in
graph, and the representation of the entire graph hG would be the node feature of [VNode] in the
final layer. In the BERT model [11, 35], there is a similar token, i.e., [CLS], which is a special token
attached at the beginning of each sequence, to represent the sequence-level feature on downstream
tasks. While the [VNode] is connected to all other nodes in graph, which means the distance of
the shortest path is 1 for any ϕ([VNode], vj ) and ϕ(vi , [VNode]), the connection is not physical. To
distinguish the connection of physical and virtual, inspired by [25], we reset all spatial encodings for
bϕ([VNode],vj ) and bϕ(vi ,[VNode]) to a distinct learnable scalar.
In the previous subsections, we introduce three structural encodings and the architecture of
Graphormer. Then a natural question is: Do these modifications make Graphormer more pow-
erful than other GNN variants? In this subsection, we first give an affirmative answer by showing
that Graphormer can represent the AGGREGATE and COMBINE steps in popular GNN models:
5
Fact 1. By choosing proper weights and distance function ϕ, the Graphormer layer can represent
AGGREGATE and COMBINE steps of popular GNN models such as GIN, GCN, GraphSAGE.
The proof sketch to derive this result is: 1) Spatial encoding enables self-attention module to
distinguish neighbor set N (vi ) of node vi so that the softmax function can calculate mean statistics
over N (vi ); 2) Knowing the degree of a node, mean over neighbors can be translated to sum over
neighbors; 3) With multiple heads and FFN, representations of vi and N (vi ) can be processed
separately and combined together later. We defer the proof of this fact to Appendix A.
Moreover, we show further that by using our spatial encoding, Graphormer can go beyond classic
message passing GNNs whose expressive power is no more than the 1-Weisfeiler-Lehman (WL) test.
We give a concrete example in Appendix A to show how Graphormer helps distinguish graphs that
the 1-WL test fails to.
Connection between Self-attention and Virtual Node. Besides the superior expressiveness than
popular GNNs, we also find an interesting connection between using self-attention and the virtual
node heuristic [15, 31, 24, 22]. As shown in the leaderboard of OGB [22], the virtual node trick,
which augments graphs with additional supernodes that are connected to all nodes in the original
graphs, can significantly improve the performance of existing GNNs. Conceptually, the benefit of
the virtual node is that it can aggregate the information of the whole graph (like the READOUT
function) and then propagate it to each node. However, a naive addition of a supernode to a graph
can potentially lead to inadvertent over-smoothing of information propagation [24]. We instead find
that such a graph-level aggregation and propagation operation can be naturally fulfilled by vanilla
self-attention without additional encodings. Concretely, we can prove the following fact:
Fact 2. By choosing proper weights, every node representation of the output of a Graphormer layer
without additional encodings can represent MEAN READOUT functions.
This fact takes the advantage of self-attention that each node can attend to all other nodes. Thus it can
simulate graph-level READOUT operation to aggregate information from the whole graph. Besides
the theoretical justification, we empirically find that Graphormer does not encounter the problem of
over-smoothing, which makes the improvement scalable. The fact also inspires us to introduce a
special node for graph readout (see the previous subsection).
4 Experiments
We first conduct experiments on the recent OGB-LSC [21] quantum chemistry regression (i.e.,
PCQM4M-LSC) challenge, which is currently the biggest graph-level prediction dataset and contains
more than 3.8M graphs in total. Then, we report the results on the other three popular tasks: ogbg-
molhiv, ogbg-molpcba and ZINC, which come from the OGB [22] and benchmarking-GNN [14]
leaderboards. Finally, we ablate the important design elements of Graphormer. A detailed description
of datasets and training strategies could be found in Appendix B.
Baselines. We benchmark the proposed Graphormer with GCN [26] and GIN [50], and their
variants with virtual node (-VN) [15]. They achieve the state-of-the-art valid and test mean absolute
error (MAE) on the official leaderboard4 [21]. In addition, we compare to GIN’s multi-hop variant [5],
and 12-layer deep graph network DeeperGCN [30], which also show promising performance on other
leaderboards. We further compare our Graphormer with the recent Transformer-based graph model
GT [13].
Settings. We primarily report results on two model sizes: Graphormer (L = 12, d = 768), and
a smaller one GraphormerS MALL (L = 6, d = 512). Both the number of attention heads in the
attention module and the dimensionality of edge features dE are set to 32. We use AdamW as the
optimizer, and set the hyper-parameter ϵ to 1e-8 and (β1, β2) to (0.99,0.999). The peak learning rate
is set to 2e-4 (3e-4 for GraphormerS MALL ) with a 60k-step warm-up stage followed by a linear decay
learning rate scheduler. The total training steps are 1M. The batch size is set to 1024. All models are
trained on 8 NVIDIA V100 GPUS for about 2 days.
4
https://github.com/snap-stanford/ogb/tree/master/examples/lsc/pcqm4m#performance
6
Table 1: Results on PCQM4M-LSC. * indicates the results are cited from the official leaderboard [21].
Results. Table 1 summarizes performance comparisons on PCQM4M-LSC dataset. From the table,
GIN-VN achieves the previous state-of-the-art validate MAE of 0.1395. The original implementation
of GT [13] employs a hidden dimension of 64 to reduce the total number of parameters. For a fair
comparison, we also report the result by enlarging the hidden dimension to 768, denoted by GT-Wide,
which leads to a total number of parameters of 83.2M. While, both GT and GT-Wide do not outperform
GIN-VN and DeeperGCN-VN. Especially, we do not observe a performance gain along with the growth
of parameters of GT.
Compared to the previous state-of-the-art GNN architecture, Graphormer noticeably surpasses GIN-
VN by a large margin, e.g., 11.5% relative validate MAE decline. By using the ensemble with
ExpC [51], we got a 0.1200 MAE on complete test set and won the first place of the graph-level track
in OGB Large-Scale Challenge[21, 53]. As stated in Section 3.3, we further find that the proposed
Graphormer does not encounter the problem of over-smoothing, i.e., the train and validate error keep
going down along with the growth of depth and width of models.
In this section, we further investigate the performance of Graphormer on commonly used graph-level
prediction tasks of popular leaderboards, i.e., OGB [22] (OGBG-MolPCBA, OGBG-MolHIV), and
benchmarking-GNN [14] (ZINC). Since pre-training is encouraged by OGB, we mainly explore
the transferable capability of a Graphormer model pre-trained on OGB-LSC (i.e., PCQM4M-LSC).
Please note that the model configurations, hyper-parameters, and the pre-training performance of
pre-trained Graphormers used for MolPCBA and MolHIV are different from the models used in
the previous subsection. Please refer to Appendix B for detailed descriptions. For benchmarking-
GNN, which does not encourage large pre-trained model, we train an additional GraphormerS LIM
(L = 12, d = 80, total param.= 489K) from scratch on ZINC.
Baselines. We report performance of GNNs which achieve top-performance on the official leader-
boards5 without additional domain-specific features. Considering that the pre-trained Graphormer
leverages external data, for a fair comparison on OGB datasets, we additionally report performance
for fine-tuning GIN-VN pre-trained on PCQM4M-LSC dataset, which achieves the previous state-of-
the-art valid and test MAE on that dataset.
Results. Table 2, 3 and 4 summarize performance of Graphormer comparing with other GNNs on
MolHIV, MolPCBA and ZINC datasets. Especially, GT [13] and SAN [28] in Table 4 are recently
proposed Transformer-based GNN models. Graphormer consistently and significantly outperforms
previous state-of-the-art GNNs on all three datasets by a large margin. Specially, except Graphormer,
5
https://ogb.stanford.edu/docs/leader_graphprop/
https://github.com/graphdeeplearning/benchmarking-gnns/blob/master/docs/07_
leaderboards.md
7
Table 2: Results on MolPCBA. Table 3: Results on MolHIV.
method #param. AP (%) method #param. AUC (%)
DeeperGCN-VN+FLAG [30] 5.6M 28.42±0.43 GCN-GraphNorm [5, 8] 526K 78.83±1.00
DGN [2] 6.7M 28.85±0.30 PNA [10] 326K 79.05±1.32
GINE-VN [5] 6.1M 29.17±0.15 PHC-GNN [29] 111K 79.34±1.16
PHC-GNN [29] 1.7M 29.47±0.26 DeeperGCN-FLAG [30] 532K 79.42±1.20
GINE-APPNP [5] 6.1M 29.79±0.30 DGN [2] 114K 79.70±0.97
GIN-VN[50] (fine-tune) 3.4M 29.02±0.17 GIN-VN[50] (fine-tune) 3.3M 77.80±1.82
Graphormer-FLAG 119.5M 31.39±0.32 Graphormer-FLAG 47.0M 80.51±0.53
the other pre-trained GNNs do not achieve competitive performance, which is in line with previous
literature [20]. In addition, we conduct more comparisons to fine-tuning the pre-trained GNNs, please
refer to Appendix C.
We perform a series of ablation studies on the importance of designs in our proposed Graphormer,
on PCQM4M-LSC dataset. The ablation results are included in Table 5. To save the computation
resources, the Transformer models in table 5 have 12 layers, and are trained for 100K iterations.
Node Relation Encoding. We compare previously used positional encoding (PE) to our proposed
spatial encoding, which both aim to encode the information of distinct node relation to Transformers.
There are various PEs employed by previous Transformer-based GNNs, e.g., Weisfeiler-Lehman-
PE (WL-PE) [56] and Laplacian PE [3, 14]. We report the performance for Laplacian PE since
it performs well comparing to a series of PEs for Graph Transformer in previous literature [13].
Transformer architecture with the spatial encoding outperforms the counterpart built on the positional
encoding, which demonstrates the effectiveness of using spatial encoding to capture the node spatial
information.
Edge Encoding. We compare our proposed edge encoding (denoted as via attn bias) to two
commonly used edge encodings described in Section 3.1.3 to incorporate edge features into GNN,
denoted as via node and via Aggr in Table 5. From the table, the gap of performance is minor between
the two conventional methods, but our proposed edge encoding performs significantly better, which
indicates that edge encoding as attention bias is more effective for Transformer to capture spatial
information on edges.
8
Table 5: Ablation study results on PCQM4M-LSC dataset with different designs.
5 Related Work
In this section, we highlight the most recent works which attempt to develop standard Transformer
architecture-based GNN or graph structural encoding, but spend less effort on elaborating the works
by adapting attention mechanism to GNNs [33, 55, 7, 23, 1, 47, 48, 56, 45].
There are several works that study the performance of pure Transformer architectures (stacked by
transformer layers) with modifications on graph representation tasks, which are more related to our
Graphormer. For example, several parts of the transformer layer are modified in [43], including
an additional GNN employed in attention sub-layer to produce vectors of Q, K, and V , long-range
residual connection, and two branches of FFN to produce node and edge representations separately.
They pre-train their model on 10 million unlabelled molecules and achieve excellent results by
fine-tuning on downstream tasks. Attention module is modified to a soft adjacency matrix in [39]
by directly adding the adjacency matrix and RDKit6 -computed inter-atomic distance matrix to the
attention probabilites. Very recently, Dwivedi et al. [13] revisit a series of works for Transformer-
based GNNs, and suggest that the attention mechanism in Transformers on graph data should only
aggregate the information from neighborhood (i.e., using adjacent matrix as attention mask) to ensure
graph sparsity, and propose to use Laplacian eigenvector as positional encoding. Their model GT
surpasses baseline GNNs on graph representation task. A concurrent work [28] propose a novel full
Laplacian spectrum to learn the position of each node in a graph, and empirically shows better results
than GT.
Path and Distance in GNNs. Information of path and distance is commonly used in GNNs. For
example, an attention-based aggregation is proposed in [9] where the node features, edge features,
one-hot feature of the distance and ring flag feature are concatenated to calculate the attention
probabilites; similar to [9], path-based attention is leveraged in [52] to model the influence between
the center node and its higher-order neighbors; a distance-weighted aggregation scheme on graph is
proposed in [54]; it has been proved in [32] that adopting distance encoding (i.e., one-hot feature of
the distance as extra node attribute) could lead to a strictly more expressive power than the 1-WL test.
Positional Encoding in Transformer on Graph. Several works introduce positional encoding (PE)
to Transformer-based GNNs to help the model capture the node position information. For example,
Graph-BERT [56] introduces three types of PE to embed the node position information to model, i.e.,
an absolute WL-PE which represents different nodes labeled by Weisfeiler-Lehman algorithm, an
intimacy based PE and a hop based PE which are both variant to the sampled subgraphs. Absolute
Laplacian PE is employed in [13] and empircal study shows that its performance surpasses the
absolute WL-PE used in [56].
Edge Feature. Except the conventionally used methods to encode edge feature, which are described
in previous section, there are several attempts that exploit how to better encode edge features: an
attention-based GNN layer is developed in [16] to encode edge features, where the edge feature
6
https://www.rdkit.org/
9
is weighted by the similarity of the features of its two nodes; edge feature has been encoded into
the popular GIN [50] in [5]; in [13], the authors propose to project edge features to an embedding
vector, then multiply it by attention coefficients, and send the result to an additional FFN sub-layer to
produce edge representations;
6 Conclusion
We have explored the direct application of Transformers to graph representation. With three novel
graph structural encodings, the proposed Graphormer works surprisingly well on a wide range of
popular benchmark datasets. While these initial results are encouraging, many challenges remain.
For example, the quadratic complexity of the self-attention module restricts Graphormer’s application
on large graphs. Therefore, future development of efficient Graphormer is necessary. Performance
improvement could be expected by leveraging domain knowledge-powered encodings on particular
graph datasets. Finally, an applicable graph sampling strategy is desired for node representation
extraction with Graphormer. We leave them for future works.
References
[1] Jinheon Baek, Minki Kang, and Sung Ju Hwang. Accurate learning of graph representations with graph
multiset pooling. ICLR, 2021.
[2] Dominique Beaini, Saro Passaro, Vincent Létourneau, William L Hamilton, Gabriele Corso, and Pietro
Liò. Directional graph networks. In International Conference on Machine Learning, 2021.
[3] Mikhail Belkin and Partha Niyogi. Laplacian eigenmaps for dimensionality reduction and data representa-
tion. Neural computation, 15(6):1373–1396, 2003.
[4] Xavier Bresson and Thomas Laurent. Residual gated graph convnets. arXiv preprint arXiv:1711.07553,
2017.
[5] Rémy Brossard, Oriel Frigo, and David Dehaene. Graph convolutions that can finally model local structure.
arXiv preprint arXiv:2011.15069, 2020.
[6] Tom Brown, Benjamin Mann, Nick Ryder, Melanie Subbiah, Jared D Kaplan, Prafulla Dhariwal, Arvind
Neelakantan, Pranav Shyam, Girish Sastry, Amanda Askell, Sandhini Agarwal, Ariel Herbert-Voss,
Gretchen Krueger, Tom Henighan, Rewon Child, Aditya Ramesh, Daniel Ziegler, Jeffrey Wu, Clemens
Winter, Chris Hesse, Mark Chen, Eric Sigler, Mateusz Litwin, Scott Gray, Benjamin Chess, Jack Clark,
Christopher Berner, Sam McCandlish, Alec Radford, Ilya Sutskever, and Dario Amodei. Language models
are few-shot learners. In H. Larochelle, M. Ranzato, R. Hadsell, M. F. Balcan, and H. Lin, editors,
Advances in Neural Information Processing Systems, volume 33, pages 1877–1901. Curran Associates,
Inc., 2020.
[7] Deng Cai and Wai Lam. Graph transformer for graph-to-sequence learning. In Proceedings of the AAAI
Conference on Artificial Intelligence, volume 34, pages 7464–7471, 2020.
[8] Tianle Cai, Shengjie Luo, Keyulu Xu, Di He, Tie-yan Liu, and Liwei Wang. Graphnorm: A principled
approach to accelerating graph neural network training. In International Conference on Machine Learning,
2021.
[9] Benson Chen, Regina Barzilay, and Tommi Jaakkola. Path-augmented graph transformer network. arXiv
preprint arXiv:1905.12712, 2019.
[10] Gabriele Corso, Luca Cavalleri, Dominique Beaini, Pietro Liò, and Petar Veličković. Principal neighbour-
hood aggregation for graph nets. Advances in Neural Information Processing Systems, 33, 2020.
[11] Jacob Devlin, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova. Bert: Pre-training of deep bidi-
rectional transformers for language understanding. In Proceedings of the 2019 Conference of the North
American Chapter of the Association for Computational Linguistics: Human Language Technologies,
Volume 1 (Long and Short Papers), pages 4171–4186, 2019.
[12] Alexey Dosovitskiy, Lucas Beyer, Alexander Kolesnikov, Dirk Weissenborn, Xiaohua Zhai, Thomas
Unterthiner, Mostafa Dehghani, Matthias Minderer, Georg Heigold, Sylvain Gelly, et al. An image is worth
16x16 words: Transformers for image recognition at scale. arXiv preprint arXiv:2010.11929, 2020.
[13] Vijay Prakash Dwivedi and Xavier Bresson. A generalization of transformer networks to graphs. AAAI
Workshop on Deep Learning on Graphs: Methods and Applications, 2021.
[14] Vijay Prakash Dwivedi, Chaitanya K Joshi, Thomas Laurent, Yoshua Bengio, and Xavier Bresson. Bench-
marking graph neural networks. arXiv preprint arXiv:2003.00982, 2020.
10
[15] Justin Gilmer, Samuel S Schoenholz, Patrick F Riley, Oriol Vinyals, and George E Dahl. Neural message
passing for quantum chemistry. In International Conference on Machine Learning, pages 1263–1272.
PMLR, 2017.
[16] Liyu Gong and Qiang Cheng. Exploiting edge features for graph neural networks. In Proceedings of the
IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 9211–9219, 2019.
[17] Anmol Gulati, James Qin, Chung-Cheng Chiu, Niki Parmar, Yu Zhang, Jiahui Yu, Wei Han, Shibo Wang,
Zhengdong Zhang, Yonghui Wu, et al. Conformer: Convolution-augmented transformer for speech
recognition. arXiv preprint arXiv:2005.08100, 2020.
[18] William L Hamilton, Zhitao Ying, and Jure Leskovec. Inductive representation learning on large graphs.
In NIPS, 2017.
[19] Vincent J Hellendoorn, Charles Sutton, Rishabh Singh, Petros Maniatis, and David Bieber. Global relational
models of source code. In International conference on learning representations, 2019.
[20] W Hu, B Liu, J Gomes, M Zitnik, P Liang, V Pande, and J Leskovec. Strategies for pre-training graph
neural networks. In International Conference on Learning Representations (ICLR), 2020.
[21] Weihua Hu, Matthias Fey, Hongyu Ren, Maho Nakata, Yuxiao Dong, and Jure Leskovec. Ogb-lsc: A
large-scale challenge for machine learning on graphs. arXiv preprint arXiv:2103.09430, 2021.
[22] Weihua Hu, Matthias Fey, Marinka Zitnik, Yuxiao Dong, Hongyu Ren, Bowen Liu, Michele Catasta,
and Jure Leskovec. Open graph benchmark: Datasets for machine learning on graphs. arXiv preprint
arXiv:2005.00687, 2020.
[23] Ziniu Hu, Yuxiao Dong, Kuansan Wang, and Yizhou Sun. Heterogeneous graph transformer. In Proceedings
of The Web Conference 2020, pages 2704–2710, 2020.
[24] Katsuhiko Ishiguro, Shin-ichi Maeda, and Masanori Koyama. Graph warp module: an auxiliary module for
boosting the power of graph neural networks in molecular graph analysis. arXiv preprint arXiv:1902.01020,
2019.
[25] Guolin Ke, Di He, and Tie-Yan Liu. Rethinking the positional encoding in language pre-training. ICLR,
2020.
[26] Thomas N Kipf and Max Welling. Semi-supervised classification with graph convolutional networks.
arXiv preprint arXiv:1609.02907, 2016.
[27] Kezhi Kong, Guohao Li, Mucong Ding, Zuxuan Wu, Chen Zhu, Bernard Ghanem, Gavin Taylor,
and Tom Goldstein. Flag: Adversarial data augmentation for graph neural networks. arXiv preprint
arXiv:2010.09891, 2020.
[28] Devin Kreuzer, Dominique Beaini, William Hamilton, Vincent Létourneau, and Prudencio Tossou. Re-
thinking graph transformers with spectral attention. arXiv preprint arXiv:2106.03893, 2021.
[29] Tuan Le, Marco Bertolini, Frank Noé, and Djork-Arné Clevert. Parameterized hypercomplex graph neural
networks for graph classification. arXiv preprint arXiv:2103.16584, 2021.
[30] Guohao Li, Chenxin Xiong, Ali Thabet, and Bernard Ghanem. Deepergcn: All you need to train deeper
gcns. arXiv preprint arXiv:2006.07739, 2020.
[31] Junying Li, Deng Cai, and Xiaofei He. Learning graph-level representation for drug discovery. arXiv
preprint arXiv:1709.03741, 2017.
[32] Pan Li, Yanbang Wang, Hongwei Wang, and Jure Leskovec. Distance encoding: Design provably more
powerful neural networks for graph representation learning. Advances in Neural Information Processing
Systems, 33, 2020.
[33] Yuan Li, Xiaodan Liang, Zhiting Hu, Yinbo Chen, and Eric P. Xing. Graph transformer, 2019.
[34] Xi Victoria Lin, Richard Socher, and Caiming Xiong. Multi-hop knowledge graph reasoning with reward
shaping. arXiv preprint arXiv:1808.10568, 2018.
[35] Yinhan Liu, Myle Ott, Naman Goyal, Jingfei Du, Mandar Joshi, Danqi Chen, Omer Levy, Mike Lewis,
Luke Zettlemoyer, and Veselin Stoyanov. Roberta: A robustly optimized bert pretraining approach. arXiv
preprint arXiv:1907.11692, 2019.
[36] Ze Liu, Yutong Lin, Yue Cao, Han Hu, Yixuan Wei, Zheng Zhang, Stephen Lin, and Baining Guo. Swin
transformer: Hierarchical vision transformer using shifted windows. arXiv preprint arXiv:2103.14030,
2021.
[37] P David Marshall. The promotion and presentation of the self: celebrity as marker of presentational media.
Celebrity studies, 1(1):35–48, 2010.
[38] Alice Marwick and Danah Boyd. To see and be seen: Celebrity practice on twitter. Convergence,
17(2):139–158, 2011.
11
[39] Łukasz Maziarka, Tomasz Danel, Sławomir Mucha, Krzysztof Rataj, Jacek Tabor, and Stanisław Jastrz˛ebski.
Molecule attention transformer. arXiv preprint arXiv:2002.08264, 2020.
[40] Sharan Narang, Hyung Won Chung, Yi Tay, William Fedus, Thibault Fevry, Michael Matena, Karishma
Malkan, Noah Fiedel, Noam Shazeer, Zhenzhong Lan, et al. Do transformer modifications transfer across
implementations and applications? arXiv preprint arXiv:2102.11972, 2021.
[41] Dinglan Peng, Shuxin Zheng, Yatao Li, Guolin Ke, Di He, and Tie-Yan Liu. How could neural networks
understand programs? In International Conference on Machine Learning. PMLR, 2021.
[42] Colin Raffel, Noam Shazeer, Adam Roberts, Katherine Lee, Sharan Narang, Michael Matena, Yanqi Zhou,
Wei Li, and Peter J. Liu. Exploring the limits of transfer learning with a unified text-to-text transformer.
Journal of Machine Learning Research, 21(140):1–67, 2020.
[43] Yu Rong, Yatao Bian, Tingyang Xu, Weiyang Xie, Ying Wei, Wenbing Huang, and Junzhou Huang. Self-
supervised graph transformer on large-scale molecular data. Advances in Neural Information Processing
Systems, 33, 2020.
[44] Peter Shaw, Jakob Uszkoreit, and Ashish Vaswani. Self-attention with relative position representations. In
Proceedings of the 2018 Conference of the North American Chapter of the Association for Computational
Linguistics: Human Language Technologies, Volume 2 (Short Papers), pages 464–468, 2018.
[45] Yunsheng Shi, Zhengjie Huang, Wenjin Wang, Hui Zhong, Shikun Feng, and Yu Sun. Masked label predic-
tion: Unified message passing model for semi-supervised classification. arXiv preprint arXiv:2009.03509,
2020.
[46] Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N Gomez, Lukasz
Kaiser, and Illia Polosukhin. Attention is all you need. In NIPS, 2017.
[47] Petar Veličković, Guillem Cucurull, Arantxa Casanova, Adriana Romero, Pietro Lio, and Yoshua Bengio.
Graph attention networks. ICLR, 2018.
[48] Guangtao Wang, Rex Ying, Jing Huang, and Jure Leskovec. Direct multi-hop attention based graph neural
network. arXiv preprint arXiv:2009.14332, 2020.
[49] Ruibin Xiong, Yunchang Yang, Di He, Kai Zheng, Shuxin Zheng, Chen Xing, Huishuai Zhang, Yanyan
Lan, Liwei Wang, and Tieyan Liu. On layer normalization in the transformer architecture. In International
Conference on Machine Learning, pages 10524–10533. PMLR, 2020.
[50] Keyulu Xu, Weihua Hu, Jure Leskovec, and Stefanie Jegelka. How powerful are graph neural networks?
In International Conference on Learning Representations, 2019.
[51] Mingqi Yang, Yanming Shen, Heng Qi, and Baocai Yin. Breaking the expressive bottlenecks of graph
neural networks. arXiv preprint arXiv:2012.07219, 2020.
[52] Yiding Yang, Xinchao Wang, Mingli Song, Junsong Yuan, and Dacheng Tao. Spagan: Shortest path graph
attention network. Advances in IJCAI, 2019.
[53] Chengxuan Ying, Mingqi Yang, Shuxin Zheng, Guolin Ke, Shengjie Luo, Tianle Cai, Chenglin Wu, Yuxin
Wang, Yanming Shen, and Di He. First place solution of kdd cup 2021 & ogb large-scale challenge
graph-level track. arXiv preprint arXiv:2106.08279, 2021.
[54] Jiaxuan You, Rex Ying, and Jure Leskovec. Position-aware graph neural networks. In International
Conference on Machine Learning, pages 7134–7143. PMLR, 2019.
[55] Seongjun Yun, Minbyul Jeong, Raehyun Kim, Jaewoo Kang, and Hyunwoo J Kim. Graph transformer
networks. Advances in Neural Information Processing Systems, 32, 2019.
[56] Jiawei Zhang, Haopeng Zhang, Congying Xia, and Li Sun. Graph-bert: Only attention is needed for
learning graph representations. arXiv preprint arXiv:2001.05140, 2020.
[57] Daniel Zügner, Tobias Kirschstein, Michele Catasta, Jure Leskovec, and Stephan Günnemann. Language-
agnostic representation learning of source code from structure and context. In International Conference on
Learning Representations, 2020.
12