20 Grpe Relative Positional Encod
20 Grpe Relative Positional Encod
A BSTRACT
1 I NTRODUCTION
The average cost of drug discovery has increased drastically in the last decade with a declining
success rate of developing new therapeutics. Large-scale screening with deep neural networks draws
considerable attention to lower the cost of drug discovery, especially in the lead finding and lead
optimization phase. By representing a molecule in a graph, a graph neural network can be utilized on
the screening by predicting important properties, e.g., drug-likeness, solubility, or synthesizability.
Therefore, graph representation learning has become a key technique for drug discovery.
Transformer introduced by Vaswani et al. (2017), has been effective in graph representation learn-
ing, overcoming inductive biases of graph convolutional networks by using self-attention. On the
other hand, explicit representations of position in graph convolutional networks are lost, thus in-
corporating graph structure on the hidden representations of self-attention is a key challenge. We
categorize existing works into two: (a) linearizing graph with graph Laplacian to encode the abso-
lute position of each node (Dwivedi & Bresson, 2020; Kreuzer et al., 2021) (b) encoding position
relative to another node with bias terms (Ying et al., 2021). The former loses preciseness of position
due to linearization, while the latter loses a tight integration of node-edge and node-spatial informa-
tion. Ours can be interpreted as overcoming the weakness of the two, which considers both relative
position and the interaction with node.
To avoid losing relative position due to the linearization, we propose to adopt the relative posi-
tional encoding by Shaw et al. (2018). Its efficacy has been verified in natural language processing,
however, the efficacy is not yet studied in a graph. Since the original work is designed for 1D
word sequences, we reformulate to incorporate graph-specific properties. To this end, we introduce
two sets of learnable positional encoding vectors to represent spatial relation or edge between two
nodes. Our method considers the interaction between node features and the two encoding vectors
to integrate both node-spatial relation and node-edge information. We name our method for graph
representation learning as Graph Relative Positional Encoding (GRPE).
∗
Equal contribution
1
Published at the MLDD workshop, ICLR 2022
Graph-Encoded Value 𝑧
Encode Graph
Node-Aware Attention
𝑞 𝑘 𝑣
Query Key Value
Graph
𝑥
Graph Relative Positional Encoding
Figure 1: Our proposed Graph Relative Positional Encoding (GRPE). GRPE is a relative positional
encoding method dedicated to learn graph representation.
We extensively conducted experiments to validate the efficacy of our proposed method on various
tasks from graph classification to graph regression. Models built with our relative positional encod-
ing achieve state-of-the-art performance on various molecule property prediction datasets, showing
the efficacy of our proposed method.
2 R ELATED W ORK
Existing works leverage Transformer architecture to learn graph representation. We categorize those
methods as follows.
Earlier models adopt Transformer without explicit encoding of positional information on a graph.
Veličković et al. (2017) replace graph convolution operation with self-attention module where atten-
tion is only performed within neighboring nodes. Rong et al. (2020) stack self-attention module next
to the graph convolutional networks iteratively to consider long-range interaction between nodes. In
their method, affinity is considered only on the graph convolutional networks, and positional infor-
mation is not given on self-attention.
Later works employ absolute positional encoding to explicitly encode positional information of
graph on Transformer. Their main idea is to linearize a graph into a sequence of nodes, and an
absolute positional encoding is added to the input feature. Dwivedi & Bresson (2020) adopted graph
Laplacian as a positional encoding, where each cell of encoding represents partitions after graph
min-cut. Nodes sharing many partitions after graph min-cut would have similar graph Laplacian
vectors. Kreuzer et al. (2021) employ a learnable positional encoding with a Transformer where its
input is the Laplacian spectrum of a graph. Due to the linearization of a graph, those approaches
lose the preciseness of position on the graph.
Meanwhile, encoding relative positional information has been studied to avoid losing the preciseness
of position. Graphormer introduced by Ying et al. (2021) encodes relative position on scaled dot
product attention map by adding bias terms. However, the bias terms are parameterized only relative
position such as shortest path distance or edge type, and interaction with node features is lost. On
the other hand, Shaw et al. (2018) introduce relative positional encoding, which considers tight
integration of node and relative position, however, their method is designed to process 1D word
sequences. We extend the systematic design of Shaw et al. (2018) for a graph. Our method considers
the interaction between nodes and graph-specific properties such as edges, and encodes precise
positional information of a graph.
3 BACKGROUND
Notation. We denote a set of nodes on the graph {ni }i=1:N and a set of edges on the graph {eij |j ∈
Ni }i=1:N , where N is the number of nodes and Ni is a set neighbors of a node ni . Both ni and eij
are positive integer numbers to index the type of nodes or edges, e.g., atom numbers or bond types
of a molecule. ψ(i, j) denotes a function encodes structural relationship between the node ni and
nj .
2
Published at the MLDD workshop, ICLR 2022
3.1 T RANSFORMER
Transformer is built by stacking multiple self-attention layers. Self-attention maps a query and a
set of key pairs to compute an attention map. Values are weighted summed with the weight on the
attention map to output the hidden feature for the following layer.
Specifically, xi ∈ Rdx denotes the input feature of the node ni , and zi ∈ Rdz denotes the output
feature of the self-attention module. The self-attention module computes query q, key k, and value v
with independent linear transformations: W query ∈ Rdx ×dz , W key ∈ Rdx ×dz and W value ∈ Rdx ×dz .
q = W query x, k = W key x and v = W value x. (1)
The attention map is computed by applying a scaled dot product between the queries and the keys.
qi ·kj exp(aij )
aij = √ and âij = PN . (2)
dz k=1 exp(aik )
The self-attention module outputs the next hidden feature by applying weighted summation on the
values.
XN
zi = âij vj . (3)
j=1
z is later fed into a feed forward neural network with a residual connection (He et al., 2016). How-
ever, we defer detailed explanations since it is out of the scope of our paper. In practice, self-attention
module with multi-head is adopted.
To encode graph structure in Transformer, previous methods focus on encoding graph information
into either the attention map or input features fed to Transformer. Graphormer (Ying et al., 2021)
adopted two additional terms on the self-attention module to encode graph information on the atten-
tion map.
qi ·kj
aGraphormer
ij = √ + bψ(i,j) + Eeij · w. (4)
dz
ψ represents spatial relation between ni and nj , which outputs the shortest path distance between the
two nodes. Learnable scalar bias bψ(i,j) encodes spatial relation between two nodes, e.g., bl is a bias
representing two nodes that are l-hop apart. An embedding vector Eeij is a feature representing edge
between the node ni and the node nj , and w is a learnable vector. Eeij · w encodes edge between
the two nodes. Moreover, Graphormer adds centrality encoding into the input x which represents
the number of edges of a node. Graphormer encodes graph on the attention map with the bias terms
parameterized by either spatial relation or edge without considering node feature. However, we
additionally consider node-spatial relation and node-edge relation. We will explain the details in
Section 4.2.
Dwivedi & Bresson (2020) and Kreuzer et al. (2021) utilize graph Laplacian (Belkin & Niyogi,
2003) λ ∈ R|N |×dx as positional encodings on the input feature x; λ are the top-dx smallest eigen-
1 1
vectors of I − D− 2 AD− 2 where I is an identity matrix, A is an adjacency matrix and D is a degree
matrix. Each cell of a graph Laplacian vector represents partitions after graph min-cut, and neigh-
bouring nodes sharing the many partitions would have similar graph Laplacian. The graph Laplacian
λi represents the structure of a graph with respect to node ni .
x̂i = xi + λi . (5)
Kreuzer et al. (2021) adopt an additional Transformer model f to produce learnable positional en-
coding: x̂i = xi + λ̂i where λ̂ = f (λ). By adding the graph Laplacian into input x, graph in-
formation can be encoded in both the attention map and the hidden representations. Their methods
lose relative positional information during the linearization of a graph to absolute positional encod-
ings. However, our method encodes a graph directly on the attention map without linearization, thus
relative positional information is encoded without loss.
3
Published at the MLDD workshop, ICLR 2022
z: 5×𝑑'
5×5
Softmax
𝑛! 𝑛" 𝑛# 𝑛$ 𝑛%
𝑛! 𝑛!
Spatial Edge
0 1 2 far far Encoding Encoding
Edge 0 𝑛" 1 0 1 2 2
𝑛" 𝑛# 2 1 0 1 1 𝑞! ⋅ 𝑘" + 𝑏 #$%&'%(
+𝑏 )*+) 5×5
𝑛$
far 2 1 0 2
Edge 1
𝑛% far 2
𝒫: 𝑳×𝒅𝒛 ℰ: 𝑬×𝒅𝒛
1 2 0
Edge 0
𝑛# 𝑛$ Spatial Relation 𝝍(𝒊, 𝒋)
𝑛! 𝑛" 𝑛# 𝑛$ 𝑛% Spatial Edge
Ed 𝑛! self 𝑞: 5×𝑑' Encoding Encoding 𝑘: 5×𝑑' 𝑣: 5×𝑑'
ge 0 no no no
1 𝑛" 0 self1 no no Query Key Value
𝑛% 𝑛# no 1 self 0 1
𝑛$ no no 0 self no 𝒫: 𝑳×𝒅𝒛 ℰ: 𝑬×𝒅𝒛
𝑛% no no 1 no self
Graph 𝑮
Edge 𝒆𝒊𝒋 𝑥&
Figure 2: Illustration of the proposed Graph Relative Positional Encoding. Left figure shows an
example of how GRPE process relative relation between nodes. In the example we set the L to 2.
Right figure describes our self-attention mechanism. Our two relative positional encodings, spatial
encoding and edge encoding, are used to encode graph on both attention map and value.
4 O UR A PPROACH
We reformulate relative positional encoding for 1D sequence data (Shaw et al., 2018) to incorpo-
rate graph-specific properties, such as spatial relation or edges between nodes. Our distinction is
twofold. First, we integrate node-spatial and node-edge information on the attention map that was
not leveraged in all existing works of graph Transformer. For that, we propose node-aware attention
which considers the interactions existing in two pairs: node-spatial relation and node-edge relation.
Second, we also encode a graph to the hidden representation of self-attention. For that, we propose
graph-encoded values that directly encode relative positional information on the features of value by
addition. Our node-aware attention applies the attention mechanism in a node-wise manner, while
our graph-encoded value applies the attention mechanism in a channel-wise manner.
We define two encodings to represent relative positional relation between two nodes in a graph.
The first is spatial encoding P, and we define the encodings for query, key, and value respectively:
P query , P key , P value ∈ RL×dz . Each vector of P represents the spatial relation between two nodes,
e.g., Pl represents the spatial relation of two nodes where their shortest path distance is l. L is the
maximum shortest path distance that our method considers.
The second is edge encoding E, and we define the encodings for query, key, and value respectively:
E query , E key , E value ∈ RE×dz . Eeij is a vector representing edge between two nodes ni and nj . E is
the number of types of edge. The spatial encodings and the edge encodings are shared throughout
all layers.
We propose two terms to encode graph on the attention map with two newly proposed encodings.
The first term is bspatial . It encodes graph by considering interaction between node feature and spatial
relation in graph.
bspatial
ij
query
= qi ·Pψ(i,j) key
+ kj ·Pψ(i,j) (6)
The second term is bedge . It encodes graph by considering interaction between node feature and edge
in graph.
bedge query key
ij = qi ·Eeij + kj ·Eeij (7)
Finally, the two terms are added to scaled dot product attention map to encode graph information.
4
Published at the MLDD workshop, ICLR 2022
qi ·kj + bspatial
ij + bedge
ij
aij = √ . (8)
dz
Our two terms consider node-spatial relation and node-edge relation, but Graphormer did not con-
sider the interaction with node feature. For instance, any two nodes with the same distance apart
have the same bias bψ(i,j) on Eq. (4), however our bspatial deploys different values according to the
node features of query and key.
We propose to encode a graph to the hidden features of self-attention, when values are weighted
summed with the attention map. We encode both spatial encoding and edge encoding into value via
summation:
XN
value
zi = âij (vj + Pψ(i,j) + Eevalue
ij
). (9)
j=1
Our method directly encodes graph information into the hidden features of value. The attention
weight â is applied equally for all channels, while our graph-encoded value enriches the feature of
each channel. Therefore, node-aware attention is a node-wise attention mechanism while graph-
encoded value is a channel-wise attention mechanism. Our method encodes a graph into the hidden
features without losing preciseness of positional information thanks to our relative positional encod-
ing. However, previous approaches encode a graph on the initial input x by linearizing graph e.g.,
centrality encoding (Ying et al., 2021) or graph Laplacian (Kreuzer et al., 2021; Dwivedi & Bresson,
2020) which loss preciseness of position.
A naive implementation of computing all pairs of bspatial requires time complexity of O(N 2 dz ),
since it requires performing inner product between all node pairs. Instead, we pre-compute an
inner product of all possible pairs of node features and spatial encoding vectors P which requires
time complexity of O(N Ldz ). Then we assign the pre-computed value according to the indices of
node pairs. Likewise, for the bedge , we pre-compute an inner product of all possible pairs of node
features and edge encoding vectors E query which requires time complexity of O(N Edz ). The time
complexity is reduced significantly with our implementation since L and E are much smaller than
the number of nodes N .
5 E XPERIMENT
5.1 I MPLEMENTATION DETAILS
5
Published at the MLDD workshop, ICLR 2022
Model Configurations # Params # Layers Hidden dim [dz ] FFN layer dim # Heads
GRPE-Small 489k 12 80 80 8
GRPE-Standard 46.2M 12 768 768 32
GRPE-Large 118.3M 18 1024 1024 32
Table 2: Results on ZINC. ∗ indicates a fine-tuned model. The lower the better.
5.1.3 E DGE
Some pair of nodes are not connected with edges. Therefore, we utilize a special encoding vector
Eno for the pairs of node that are not connected with any edges; {(ni , nj )|i ̸= j and j ∈
/ Ni }i=1:|N | .
For the pair of two identical nodes where i = j, we use a special embedding vector Eself . Finally, for
the pairs that are connected with the virtual node, we utilize another special encoding vector Evirtual .
† indicates the models adopted Transformer for learning graph representation. Bold faced text indi-
cates the best result. We summarize the model configurations of our experiments in Table 1.
We validate our method on the tasks of the molecule property prediction such as OGBG-MolPCBA
(MolPCBA) (Hu et al., 2020), OGBG-MolHIV (MolHIV) (Hu et al., 2020) and ZINC (Dwivedi
et al., 2020). MolPCBA consists of 437,929 graphs and the task is to predict multiple binary labels
indicating various molecule properties. The evaluation metric is average precision (AP). MolHIV
is a small dataset that consists of 41,127 graphs. The task is to predict a binary label indicating
whether a molecule inhibits HIV virus replication or not. The evaluation metric is area under the
curve (AUC). ZINC is also a small dataset that consists of 12,000 graphs, and the task is to regress
a molecule property. The evaluation metric is mean absolute error (MAE). All experiments are
conducted for 5 times, and we report the mean and the standard deviation of the experiments.
We adopt the linear learning rate decay, and the learning rate starts from 2∗10−4 and ends at 1∗10−9 .
We set L to 5. For ZINC dataset, we adopt GRPE-Small configuration with less than 500k param-
eters for a fair comparison. For MolHIV and MolPCBA datasets, we initialize the parameter of the
models with the weight of a pretrained model trained on PCQM4M (Hu et al., 2020) dataset.
Table 2 shows the results on ZINC dataset, our model achieve state-of-the-art MAE score. Table 4
shows the results on MolPCBA dataset, our model achieves state-of-the-art AP score. Table 3 shows
the results on MolHIV dataset, our model achieves the state-of-the-art AUC with less parameters
than Graphormer (Ying et al., 2021).
We validate our method on two datasets of OGB large scale challenge (Hu et al., 2020). The two
datasets aim to predict the DFT-calculated HOMO-LUMO energy gap of molecules given their
molecular graphs. We conduct experiments on both the PCQM4M and PCQM4Mv2 datasets, which
are currently the biggest molecule property prediction datasets containing about 4 million graphs
6
Published at the MLDD workshop, ICLR 2022
Table 3: Results on MolHIV. ∗ indicates a fine-tuned model. The higher the better.
Table 4: Results on MolPCBA. ∗ indicates a fine-tuned model. The higher the better.
in total. PCQM4Mv2 contains the DFT-calcuated 3D strcuture of molecules. For our experiments,
we only utilize 2D molecular graphs not 3D structures. Throughout experiments, we set L to 5.
We adopt a GRPE-Standard for fair comparisons with Graphormer (Ying et al., 2021). We linearly
increase learning rate up to 2∗10−4 for 3 epochs and linearly decay learning rate upto 1∗10−9 for
400 epochs. We are unable to measure the test MAE of PCQM4M, because the test dataset is
deprecated as PCQM4Mv2 is newly released.
Table 5 shows the results on PCQM4M dataset. Our model achieves the state-of-the-art validation
MAE score. Table. 6 shows the results on PCQM4Mv2 dataset. Our model achieves the second best
result on both the validation dataset and the test dataset. However, our model use only half of the
parameters compared to the best model.
7
Published at the MLDD workshop, ICLR 2022
Table 5: Results on PCQM4M. * indicates the results are from the official leaderboard. VN indicates
that the model used virtual node. The lower the better.
Table 6: Results on PCQM4Mv2. The results of other methods are from the official leaderboard.
VN indicates that the model used the virtual node. The lower the better.
Table 7: Effects of components of GRPE on ZINC datasets. The lower the better.
28.0
27.5
AUC on MolPCBA (%)
8
Published at the MLDD workshop, ICLR 2022
R EFERENCES
Dominique Beani, Saro Passaro, Vincent Létourneau, Will Hamilton, Gabriele Corso, and Pietro
Liò. Directional graph networks. In International Conference on Machine Learning, pp. 748–
758. PMLR, 2021.
Mikhail Belkin and Partha Niyogi. Laplacian eigenmaps for dimensionality reduction and data
representation. Neural computation, 15(6):1373–1396, 2003.
Xavier Bresson and Thomas Laurent. Residual gated graph convnets. arXiv preprint
arXiv:1711.07553, 2017.
Rémy Brossard, Oriel Frigo, and David Dehaene. Graph convolutions that can finally model local
structure. arXiv preprint arXiv:2011.15069, 2020.
Tianle Cai, Shengjie Luo, Keyulu Xu, Di He, Tie-yan Liu, and Liwei Wang. Graphnorm: A prin-
cipled approach to accelerating graph neural network training. In International Conference on
Machine Learning, pp. 1204–1215. PMLR, 2021.
Gabriele Corso, Luca Cavalleri, Dominique Beaini, Pietro Liò, and Petar Veličković. Principal
neighbourhood aggregation for graph nets. arXiv preprint arXiv:2004.05718, 2020.
Jacob Devlin, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova. Bert: Pre-training of deep
bidirectional transformers for language understanding. arXiv preprint arXiv:1810.04805, 2018.
Vijay Prakash Dwivedi and Xavier Bresson. A generalization of transformer networks to graphs.
arXiv preprint arXiv:2012.09699, 2020.
Vijay Prakash Dwivedi, Chaitanya K Joshi, Thomas Laurent, Yoshua Bengio, and Xavier Bresson.
Benchmarking graph neural networks. arXiv preprint arXiv:2003.00982, 2020.
Justin Gilmer, Samuel S Schoenholz, Patrick F Riley, Oriol Vinyals, and George E Dahl. Neural
message passing for quantum chemistry. In International conference on machine learning, pp.
1263–1272. PMLR, 2017.
William L Hamilton, Rex Ying, and Jure Leskovec. Inductive representation learning on large
graphs. In Proceedings of the 31st International Conference on Neural Information Processing
Systems, pp. 1025–1035, 2017.
Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. Deep residual learning for image recog-
nition. In Proceedings of the IEEE conference on computer vision and pattern recognition, pp.
770–778, 2016.
Weihua Hu, Matthias Fey, Marinka Zitnik, Yuxiao Dong, Hongyu Ren, Bowen Liu, Michele Catasta,
and Jure Leskovec. Open graph benchmark: Datasets for machine learning on graphs. arXiv
preprint arXiv:2005.00687, 2020.
Thomas N Kipf and Max Welling. Semi-supervised classification with graph convolutional net-
works. arXiv preprint arXiv:1609.02907, 2016.
Devin Kreuzer, Dominique Beaini, William L Hamilton, Vincent Létourneau, and Prudencio Tossou.
Rethinking graph transformers with spectral attention. arXiv preprint arXiv:2106.03893, 2021.
Tuan Le, Marco Bertolini, Frank Noé, and Djork-Arné Clevert. Parameterized hypercomplex graph
neural networks for graph classification. arXiv preprint arXiv:2103.16584, 2021.
Guohao Li, Chenxin Xiong, Ali Thabet, and Bernard Ghanem. Deepergcn: All you need to train
deeper gcns. arXiv preprint arXiv:2006.07739, 2020.
9
Published at the MLDD workshop, ICLR 2022
Yu Rong, Yatao Bian, Tingyang Xu, Weiyang Xie, Ying Wei, Wenbing Huang, and Junzhou
Huang. Self-supervised graph transformer on large-scale molecular data. arXiv preprint
arXiv:2007.02835, 2020.
Peter Shaw, Jakob Uszkoreit, and Ashish Vaswani. Self-attention with relative position representa-
tions. arXiv preprint arXiv:1803.02155, 2018.
Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N Gomez,
Łukasz Kaiser, and Illia Polosukhin. Attention is all you need. In Advances in neural information
processing systems, pp. 5998–6008, 2017.
Petar Veličković, Guillem Cucurull, Arantxa Casanova, Adriana Romero, Pietro Lio, and Yoshua
Bengio. Graph attention networks. arXiv preprint arXiv:1710.10903, 2017.
Keyulu Xu, Weihua Hu, Jure Leskovec, and Stefanie Jegelka. How powerful are graph neural
networks? arXiv preprint arXiv:1810.00826, 2018.
Chengxuan Ying, Tianle Cai, Shengjie Luo, Shuxin Zheng, Guolin Ke, Di He, Yanming Shen,
and Tie-Yan Liu. Do transformers really perform bad for graph representation? arXiv preprint
arXiv:2106.05234, 2021.
10