Adaptive Graph Diffusion
Adaptive Graph Diffusion
September 5, 2022
A BSTRACT
Graph Neural Networks (GNNs) have received much attention in the graph deep learning domain.
However, recent research empirically and theoretically shows that deep GNNs suffer from over-
fitting and over-smoothing problems. The usual solutions either cannot solve extensive runtime
of deep GNNs or restrict graph convolution in the same feature space. We propose the Adaptive
Graph Diffusion Networks (AGDNs) which perform multi-layer generalized graph diffusion in
different feature spaces with moderate complexity and runtime. Standard graph diffusion methods
combine large and dense powers of the transition matrix with predefined weighting coefficients.
Instead, AGDNs combine smaller multi-hop node representations with learnable and generalized
weighting coefficients. We propose two scalable mechanisms of weighting coefficients to capture
multi-hop information: Hop-wise Attention (HA) and Hop-wise Convolution (HC). We evaluate
AGDNs on diverse, challenging Open Graph Benchmark (OGB) datasets with semi-supervised node
classification and link prediction tasks. Until the date of submission (Aug 26, 2022), AGDNs achieve
top-1 performance on the ogbn-arxiv, ogbn-proteins and ogbl-ddi datasets and top-3 performance on
the ogbl-citation2 dataset. On the similar Tesla V100 GPU cards, AGDNs outperform Reversible
GNNs (RevGNNs) with 13% complexity and 1% training runtime of RevGNNs on the ogbn-proteins
dataset. AGDNs also achieve comparable performance to SEAL with 36% training and 0.2% inference
runtime of SEAL on the ogbl-citation2 dataset.
1 Introduction
The Graph Neural Networks (GNNs), or Message Passing Neural Networks (MPNNs) [1], recently proved effective and
became mainstream of graph deep learning in many domains, such as citation networks [2, 3, 4], social networks [3, 5],
biological graphs [6], and traffic networks [7, 8, 9]. Based on simple feed-forward networks, they perform message
passing in each layer to capture neighborhood information. However, unlike the deep models in the Computer Vision
(CV) domain, the deep GNNs [10, 11] will encounter the over-smoothing problem, resulting in significant performance
degradation. The model cannot distinguish the node representations, which become nearly identical after long-range
message passing. Deep GNNs may contain many feature transformations since a GNN layer usually couples a graph
convolution operator and a feature transformation. Recent research [12] reveals that the well-known overfitting problem
caused by redundant transformations can contribute significantly to the performance degradation of deep GNNs. In
addition, many transformations also bring high memory footprints and extended runtime.
A PREPRINT - S EPTEMBER 5, 2022
By emphasizing the shallow information during long-range message passing, several types of residual connections
[10, 13, 14, 15, 16] can tackle the over-smoothing and over-fitting problems. However, they do not improve extensive
high memory footprints or runtime from deep GNNs. Some techniques [16] from the CV domain can effectively reduce
memory footprints of deep GNNs, but with a longer runtime.
Residual connections work among pairs of graph convolution and feature transformation. In contrast, the graph
diffusion-based methods directly combine sequential graph convolution operators. The shallow information can be
preserved with suitable weighting coefficients to alleviate the oversmoothing problem. A graph convolution operator
can be described as a matrix multiplication between the weighted adjacency (transition matrix) and node feature matrix.
Then a graph diffusion operator replaces the transition matrix with a linear combination of its different powers. On the
other hand, since many feature transformations bring both efficiency and performance problems, limiting the number of
transformations is also helpful. Thus, employing the graph diffusion with limited feature transformations is reasonable.
This strategy exists in some decoupled methods [12, 17, 18] that decouple graph convolution operators and feature
transformations. In detail, all graph convolution operators are placed before or after all feature transformations. Thus,
the decoupled GNNs can enlarge the receptive field with shallow feature transformations. The graph diffusion or
its variants are incorporated to leverage the multi-hop information. However, there are no feature transformations
(non-linear hierarchies) among all graph convolution operators. Graph convolution is restricted in the same feature
space. This characteristic may limit their model capacity. For example, decoupled GNNs can achieve considerable
performance on small citation datasets but residual GNNs outperform them on the larger ogbn-arxiv dataset.
Directly replacing the graph convolution operator in each GNN layer with the graph diffusion operator can also enlarge
the receptive field with shallow feature transformations. Graph Diffusion Convolution (GDC) directly replaces the
transition matrix with an explicit diffusion matrix. Efficient approximation and sparsification techniques can reduce the
memory cost of the dense and large diffusion matrix. However, GDC lacks flexibility since its weighting coefficients
are predefined and fixed for different nodes and feature channels. Moreover, GDC cannot improve link prediction
performance. A more efficient but equivalent way of calculating the graph diffusion is to calculate multi-hop node
representations iteratively and combine them. Some methods perform this memory-efficient graph diffusion in each
layer without calculating the explicit diffusion matrix. In other words, instead of high-power transition matrices,
multi-hop representation matrices are calculated and stored. The unified framework of these methods has not been well
studied and separated from other graph diffusion-based methods [17, 12, 19]. We summarize them as Graph Diffusion
Networks (GDNs). However, many existing GDNs still utilize the fixed predefined hop weighting coefficients, which
are identical across nodes, channels, and layers.
Other techniques [20, 21, 22] that do not modify model architecture can also tackle the over-smoothing problem. Many
GNNs, including our proposed models, are compatible with them. We will not precisely introduce or compare them.
In this paper, we refine and propose Graph Diffusion Networks (GDNs). Then we propose Adaptive Graph Diffusion
Neural Networks (AGDNs) with generalized graph diffusion associated with two learnable and scalable mechanisms of
weighting coefficients: Hop-wise Attention (HA) and Hop-wise Convolution (HC). We show a natural evolution path
of GNNs. From MPNNs to GDNs, the receptive field is enlarged without adding extra transformations or decoupling
the model architecture. The multi-layer graph diffusion in different feature spaces can contribute to model capacity.
From GDNs to AGDNs, a more generalized and flexible paradigm of graph diffusion can bring better performance. HA
induces hop-wise and node-wise weighting coefficients. HC can directly learn the hop-wise and channel-wise weighting
coefficients. Following the historical development of GNNs, from elegant but inefficient spectral methods to intuitive
and effective spatial methods, we generalize the graph diffusion to be more spatial with a loss of spectral analyzability.
We conduct experiments on diverse Open Graph Benchmark (OGB) [23] datasets with semi-supervised node classifi-
cation and link prediction tasks [23]. The results show that our proposed methods significantly outperform popular
GNNs on both tasks. AGDNs achieve top-1 performance on the ogbn-arxiv (Accuracy of 76.37±0.11%), ogbn-proteins
(ROC-AUC of 88.65±0.13%) and ogbl-ddi (Hits@20 of 95.38±0.94%) datasets and top-3 performance on the ogbl-
citation2 (MRR of 85.49±0.29%) dataset. AGDNs also achieve the SOTA performance among models without using
labels as input on the ogbn-products dataset. AGDNs outperform the state-of-the-art (SOTA) RevGNNs with much
less complexity and runtime and achieve comparable results to SOTA SEAL with much less runtime on large graphs.
In our ablation study on all datasets, AGDNs can significantly outperform the associated GATs. Furthermore, the
experiments of MPNNs and AGDNs with different model depths demonstrate that AGDNs can effectively mitigate the
over-smoothing effect.
Our main contributions are 1). We incorporate multi-layer generalized graph diffusion into GNNs with moderate
complexity and runtime; 2). We propose two learnable and scalable mechanisms for adaptively capturing multi-hop
information. 3). We achieve new SOTA performance on diverse, challenging OGB datasets. We outperform complicated
RevGNNs with much less complexity and runtime on large datasets.
2
A PREPRINT - S EPTEMBER 5, 2022
2 Related works
By employing residual connections, residual GNNs simultaneously alleviate over-fitting and over-smoothing problems.
The main concerns of these methods are connection design and memory optimizations. Jumping-Knowledge Network
(JKNet) [24] introduces jumping knowledge connection and adaptively sums intermediate representations. GCN with
initial residual and identity mapping (GCNII) [15] combines two classes of residual connections. DeepGCN [13]
introduces dense and dilated connections from CNNs. DeeperGCN [14] unifies message aggregation operations with
differentiable generalized aggregation functions. It further proposes a novel normalization layer and pre-activation
residual GNNs. Reversible GNNs (RevGNNs) [16] reduce memory footprints by incorporating reversible connections
and grouped convolutions. With deep and over-parameterized GNN architecture, RevGNNs can achieve SOTA results
on several datasets. However, as a cost of reducing memory footprints, the runtime of deep RevGNNs layers is even
longer. This paper will present deep GNNs with shallow feature transformations that outperform SOTA RevGNNs with
much less complexity and runtime.
We describe the graph convolution as a matrix multiplication between the transition matrix and node fea-
ture/representation matrix. Then the graph diffusion [17, 19, 25, 26] replaces the transition matrix with a diffusion
matrix, which is a linear combination of powers of the transition matrix with weighting coefficients normalized along
hops. The weighting coefficients are essential to balance the importance of shallow and deep information. The Personal-
ized PageRank (PPR) [25, 17] and the Heat Kernel (HK) [26, 27] are two popular predefined weighting coefficients.
They both follow the prior that more distant neighboring nodes have fewer influences than near neighboring ones. The
weighting coefficients can also be analogously trainable parameters using label propagation [28, 29]. Attention walk
[30] jointly optimizes the node embeddings and weighting coefficients.
3
A PREPRINT - S EPTEMBER 5, 2022
The Bag of Tricks on GNNs (BoT)[38] includes some critical tricks of GNNs applied in many SOTA models on the
OGB leaderboard. Firstly, the masked node labels, which are zeros other than sampled training nodes, are used as
model input ("label as input"). Secondly, BoT conducts the additional iterative feed-forward passes in each epoch
with the predicted soft labels filling up masked zero labels ("label reuse"). Thirdly, BoT proposes a more robust loss
function ("Loge loss"). Finally, BoT proposes to adjust the GAT adjacency closer to the GCN adjacency ("norm.adj.").
Self-Knowledge Distillation (self-KD) [39] is another common trick on the OGB leaderboard. Graph Information
Aided Node feature exTraction with XR-Transformers (GIANT-XRT) [40] trains more informative node embedding
using raw text data.
There exist two main GNN methods applied to link prediction. The first methods follow an encoder-decoder framework
[4, 23], in which GNNs act as the encoder to output node representations. A simple predictor receiving pairs of node
representations makes final predictions. Pairwise Learning on Neural Link Prediction (PLNLP) proposes to utilize
pairwise AUC loss to improve the quality of ranking metrics. The second method, SEAL [41, 42] applies a GNN
on an enclosing subgraph sampled from each pair of nodes and directly output predictions. SEAL also includes an
additional labeling trick to enhance structural information. SEAL achieves the SOTA performance of GNNs on many
link prediction datasets. However, SEAL requires extensive runtime, which limits its application, especially on large
graphs. This paper will show that with the encoder-decoder framework, AGDNs can outperform other encoder-decoder
GNNs and even approach or outperform SEAL.
3 Preliminaries
3.1 Notations
Suppose G = (V, E) be a given graph with node set V and edge set E. We denote the number of nodes with N = |V|,
the number of edges with E = |E|, and the adjacency matrix with A ∈ RN ×N . The normalized adjacency matrix
4
A PREPRINT - S EPTEMBER 5, 2022
is denoted with A ∈ RN ×N . Considering the common stacking architecture of GNNs, we denote the initial node
(0)
feature matrix with X ∈ RN ×d and the initial edge feature matrix with X E . For the l-th GNN layer, we denote
(l−1) (l)
its input node representation with H (l−1) ∈ RN ×d and its output node representation with H (l) ∈ RN ×d . We
(l−1) (l)
also denote node i’s input representation vector with hi and its output representation vector with hi . We denote
(l)
the attention query vector of GAT with a(l) ∈ R2×d and the hop-wise attention query vector of AGDN-HA with
(l) (l)
ahw ∈ R2×d .
3.2 Tasks
Let us review the main architecture of MPNNs. This paper focuses on the essential step in MPNNs, neighbor aggregation
(message passing). We also consider the following combination with node self feature matrix (usually in the form of
residual linear connection). We omit the optional readout operation (only necessary for the graph prediction). The
neighbor aggregation with a residual linear connection can be described as a matrix multiplication between weighted
adjacency A (transition matrix) and node feature/representation matrix H:
(l)
H (l) = A H (l−1) W (l) + H (l−1) W (l),r , (1)
(l−1) (l) (l−1) (l)
where W ∈ Rd ×d
refers to the linear transformation in the l-th layer, and W (l),r ∈ Rd ×d
refers to its
residual linear transformation. A complete MPNN model stacks several MPNN layers with intermediate activations.
5
A PREPRINT - S EPTEMBER 5, 2022
row-stochastic adjacency, derived from the learnable attention mechanism, which depends on the attributes of source and
destination nodes. We can view this adjacency as a learnable row-stochastic adjacency by reviewing the computation of
GAT adjacency:
eij = LeakyReLU([hi W ||hj W ] · a)
(2)
= LeakyReLU([hi W ] · adst + [hj W ] · asrc ),
exp(eij )
Aij = P , (3)
k∈Ni exp(eik )
where hi is the input representation vector of node i, W is the transformation matrix. The query vector a can be split
into source query vector asrc and destination query vector adst .
For graphs with edge attributes xE
ij , BoT [38] proposes to incorporate edge attributes in attention coefficients:
E
eij = LeakyReLU([hi W ||hj W ||xE
ij W ] · a), (4)
where W E is edge feature transformation matrix.
Then, we can define the unnormalized GAT adjacency A : Aij = exp(eij ) and its diagonal in-degree and out-degree
matrices are defined as D row and D col , with respectively row summation and column summation as diagonal entries.
The GAT adjacency may be expressed as A = D −1 row A.
4 Proposed methods
In this section, we firstly introduce the frameworks of GDNs and AGDNs. Then, we propose a pseudo-symmetric
variant of the GAT transition matrix. Thirdly, we propose two adaptive mechanisms for calculating hop weighting
matrices.
Several existing GNNs using multi-hop information in each layer can be considered GDNs. However, there is a lack of
a unified framework for GDNs. For a GDN model, we also stack multiple GDN layers to perform multi-layer graph
diffusion. We formulate a GDN layer with the diffusion depth K as below:
(l,0)
H̃ = H (l−1) W (l) , (5)
(l,k) (l,k−1)
H̃ = AH̃ , (6)
K
X (l,k)
H (l) = θ(l,k) H̃ + H (l−1) W (l),r , (7)
k=0
where {θ(k) }k∈{0,1,...,K} is the set of normalized weighting coefficients and K is the diffusion depth. Note that we
iteratively calculate the multi-hop representations in a right-to-left matrix multiplication, instead of the explicit multiple
(l) (l) (l)
powers of the transition matrix (A , (A )2 , .., (A )K ).
The weighting coefficients in GDC and GDNs are hop-wise. In this paper, we generalize the graph diffusion and make
weighting coefficients further node-wise or channel-wise. We suppose that different nodes or channels may require
different hop weighting coefficients. We define an AGDN layer as below:
(l,0)
H̃ = H (l−1) W (l) , (8)
(l,k) (l,k−1)
H̃ = AH̃ , (9)
K
(l)
X (l,k)
H = Θ(k) ⊗ H̃ + H (l−1) W (l),r , (10)
k=0
6
A PREPRINT - S EPTEMBER 5, 2022
Figure 1: AGDN layer Architecture: The operator ⊗ represents matrix multiplication, the bold operator ⊗ represents
element-wise multiplication, and the operator ⊕ represents summation. The left and right multiplication correspond to
the relative position of the multiplicand to the operator ⊗.
(l)
where ⊗ refers to the element-wise multiplication and Θ(k) in RN ×d is a generalized weighting matrix. GDNs are
special cases of AGDNs when all elements of Θ(k) are the same. This description in the form of matrices is for global
comparison with MPNNs and GDNs.
In detail, we also describe the AGDN layer from a node viewpoint, which matches the actual implementation. We
perform sequential graph convolution and sum multi-hop representations with hop-wise (node-wise or channel-wise)
weighting coefficients. The generalized graph diffusion at l-th layer for node i is described as below:
(l,0) (l−1)
h̃i = hi W (l) , (11)
(l,k) X (l,k−1)
h̃i = Aij h̃j , (12)
j∈Ni
(l−1)
K dX
(l) (l,k) (l−1) (l),r
X
hic = θikc h̃ic + hic0 Wc0 c , (13)
k=0 c0 =1
(l,k)
where h̃i is the k-hop intermediate representation vector of node i and θikc can be viewed extracted from a
3-dimensional (node-wise, hop-wise and channel-wise) tensor Θ. The previous weighting matrix Θ(k) can be
extracted from this tensor by selecting the k-hop. In some cases, to explicitly enhance the position (hop) information,
we can add intermediate multi-hop representation vectors with learnable Positional Embedding (PE) row vectors
(l)
{p(0) , p(1) , ..., p(K) } in Rd ×1 . We omit this trick since PE marginally improves model performance on certain
datasets empirically.
There exists a trade-off between generalization and spectral analyzability. The eigenvectors of the generalized graph
diffusion matrix are generally different from the original transition matrix. From another perspective, AGDNs do
not follow the characteristics or limitations of previous diffusion-based methods. Instead, our generalized weighting
coefficients can be adaptive across hops and nodes or channels, reflected in the following proposed Hop-wise Attention
and Hop-wise Convolution. In addition, we can naturally assign layer-wise weighting coefficients.
We can find that GAT is more concerned about destination nodes’ in-degrees. The symmetric normalized adjacency
has proven more effective on specific datasets [2]. It is reasonable to give a variant of GAT adjacency to leverage
both source nodes’ out-degrees and destination nodes’ in-degrees. Thus, motivated by the form of popular symmetric
normalized adjacency, we propose a pseudo-symmetric normalized GAT adjacency:
−1 −1
Asym = D row
2
AD col2 . (14)
7
A PREPRINT - S EPTEMBER 5, 2022
Figure 2: Left: The hop-wise attention is parameterized by ahw , with a LeakyReLU activation function σ and
normalized along hops with softmax function; The associated weighting tensor Θ can be derived from the N × (K + 1)
matrix ΘHA , by repeating d times along the third dimension. Right: The weighting tensor of hop-wise convolution Θ
is directly derived from (K + 1) × d kernel matrix ΘHC by repeating N times along the first dimension.
In BoT, another version of pseudo-symmetric normalized GAT adjacency is proposed (denoted with "norm.adj.") [38]:
1 1
Aadj = D 2 D −1
row AD
−2
, (15)
where adj refers to "adjustment" since we can view this adjacency as the GAT adjacency adjusted to GCN adjacency.
Note that A, Asym and Aadj are pseudo-symmetric since they are guaranteed to be symmetric if and only if eij =
eji , ∀i, j ∈ N , when asrc = adst . As a special case when we set query vectors to zeros, then both Asym and Aadj
become the standard symmetric normalized adjacency. This characteristic connects GAT and GCN.
In this subsection, to simplify the discussion, we omit the subscript (l) and collect all weighting coefficients {θikc }
into a unified weighting tensor with Θ ∈ RN ×(K+1)×d . We can extract Θ(k) = Θ:k: , where : in the subscript
refers to extracting all channels in this dimension. We denote the subscripts for nodes, hops, and channels with i,
k, and c. We aim to design ’adaptive’ and efficient ways of calculating weighting tensor, which should be variable
for nodes or feature channels. It is hard to directly define a unified weighting tensor adaptive for both nodes and
feature channels, which results in enormous additional complexity. We propose two efficient ways in the following
paragraphs: hop-wise attention (HA) and hop-wise convolution (HC). We denote AGDN variants with AGDN-mean,
AGDN-HA, and AGDN-HC, using naive fixed weights, hop-wise attention weights, and hop-wise convolution weights.
As a particular case of AGDNs, AGDN-mean can be considered a representative example of GDNs.
Hop-wise Attention We suppose that, in many cases, the computation of graph diffusion should be adaptive for
different nodes but identical for different feature channels, which manifests as the unified weighting tensor normalized
PK
along hops k=0 θi,k,c = 1, ∀i, c and identical along channels θi,k,c = θi,k,1 , ∀c ∈ {1, 2, ..., d}. Then we can simplify
this tensor into a 2-dimensional weighting matrix ΘHA = [θik HA
] in RN ×(K+1) , by ignoring the last subscript c. Θ
can be recovered by adding the third dimension and repeating ΘHA for d times in the third dimension. It is still not
efficient to define a naive learnable weighting matrix in RN ×(K+1) , which results in redundant complexity and violates
the possible inductive setting. Inspired by the attention mechanism in GAT, we propose Hop-wise Attention (HA) using
a learnable query vector ahw in R2d to induce the expected weighting matrix. We need to learn just 2d parameters.
Firstly, we calculate ω:
h (l,0) (l,k)
i
ωik = h̃i || h̃i · ahw , (16)
where · represents inner product, || represents the concatenation operation and k represents the k-hop representation.
Then, as shown in the left part of Figure 2, the hop-wise attention scores are calculated as below:
HA exp (σ (ωik ))
θik = PK , (17)
k=0 exp (σ (ωik ))
8
A PREPRINT - S EPTEMBER 5, 2022
Hop-wise Convolution We consider another simple strategy for integrating multi-hop representations, which performs
Hop-wise Convolution (HC). This time, we suppose that, in certain cases, the graph diffusion should be adaptive
for different channels. Thus, we directly define learnable weighting tensor, which is identical for all nodes θi,k,c =
θ1,k,c , ∀i ∈ {1, 2, ..., N }. Then we simplify this tensor into a 2-dimensional convolution kernel matrix ΘHC = [θkc
HC
]
(K+1)×d
in R , by ignoring the first subscript i. The complete weighting tensor Θ can be recovered by adding the first
dimension and repeating ΘHC for N times in the first dimension, as shown in Figure 2. We need to learn (K + 1) × d
parameters. Note that we do not require that the tensor is normalized along any dimensions. For each feature channel
c ∈ {1, 2, ..., d}, we conduct individual hop-wise convolution with the associated kernel vector in RK+1 . HC is in the
same form as DCNN. However, HC is based on more memory-efficient graph diffusion and calculated with different
convolution kernels in different layers.
4.5 Complexity
In complexity analysis, we omit the dimension change between a layer’s input and output. The extra time complexity of
an AGDN layer over its base MPNN layer comes from K-hop aggregations O(KEd) (by default, we perform feature
transformation before aggregation), element-wise multiplication with weighting matrices O(KN d), and hop-wise
attention computation O(KN d) if used. Then the extra time complexity of an AGDN layer is O(KEd + KN d). Under
the realistic assumption of E N , this extra time complexity becomes O(KEd). The extra space complexity of an
AGDN layer is O(KN d).
5 Model analysis
We generally lose the elegant spectral analyzability since generalized weighting coefficients can easily change the
eigenvectors. However, this also implies that AGDNs are more flexible in the spectral domain.
For AGDN-HC, the weighting coefficients are identical for all nodes and do not change the eigenvectors. Thus, we can
give a preliminary spectral analysis. We will demonstrate that, even without changing eigenvectors, AGDN-HC is still
flexible in the spectral domain with a considerable diffusion depth. First, given a feature channel c, we simplify the
form of AGDN-HC with an eigendecomposition of the transition matrix A = U −1 ΛU :
K
X k
Sc = θkc A
k=0
K
X k
= θkc U −1 ΛU (18)
k=0
K
!
X
−1 k
=U θkc Λ U,
k=0
where the row vectors of U refer to the eigenvectors of A and Λ is a diagonal matrix whose entries {λ1 , λ2 , ..., λN }
are eigenvalues of A. The eigenvalues of the transition matrix are bounded by 1 (λi ∈ [−1, 1], ∀i) [43]. For the i-th
eigenvalue λi of the transition matrix, the associated eigenvalue λ0i of the diffusion matrix is:
K
X
λ0i = θk,c λki . (19)
k=0
This relation can be described as an K-order polynomial function of λi . With the order increasing, this function
becomes more flexible and can approximate more functions.
6 Experiments
In this section, we conduct experiments on three OGB node classification datasets and three OGB link prediction
datasets. Our proposed AGDNs outperform common MPNNs and SOTA RevGNNs with less complexity and runtime
for the semi-supervised node classification datasets. AGDNs outperform other GNN models in link prediction tasks
using the same encoder-decoder framework. AGDNs approach SOTA SEAL with much less runtime. AGDNs achieve
new SOTA performance on the ogbn-arxiv, ogbn-proteins, and ogbl-ddi datasets. We train all AGDN models on a single
V100 card with 16Gb memory.
9
A PREPRINT - S EPTEMBER 5, 2022
Datasets We utilize three OGB semi-supervised node classification datasets (ogbn-arxiv, ogbn-proteins and ogbn-
products) and three OGB link prediction datasets (ogbl-ppa, ogbl-ddi and ogbl-citation2). We summarize the detailed
statistics of these datasets in Table 2. ogbn-arxiv is a citation network between all Computer Science (CS) arXiv
papers, whose data split is based on the publication dates of the papers. ogbn-proteins is a graph between proteins
with multi-dimensional edge weights indicating different types of biologically meaningful associations. Its data split is
based on the associated species of the proteins. ogbn-products is a co-purchasing network between Amazon products,
whose data split is based on the sales ranking. ogbl-ppa is a graph between proteins from 58 species with similar edges
to ogbn-proteins, whose edges measured by high-throughput technology are used as training edges, and other edges
measured by low-throughput technology are used as validation and testing edges. ogbl-ddi is a drug-drug interaction
network with each edge indicating the joint effect of taking the two drugs together. It has data split based on what
proteins those drugs target in the body. ogbl-citation2 is a citations graph between a subset of papers from MAG.
Its data is split by selecting the most recent papers as source nodes and randomly selecting destination nodes for
training/validation/testing sets.
Global settings We conduct all experiments on a single Nvidia Tesla V100 with 16 Gb GPU memory. We evaluate
our proposed models with 10 runs, fixing random seed 0-9, and report means and standard deviations. Except on the
ogbn-products and ogbl-citation2 datasets (evaluated on CPU), we conduct both training and inference of all AGDN
models on the same GPU card. All final test scores are from the best model selected based on validation scores. In
the tables of this paper, we highlight the results of AGDN with underlined fonts and the best results with bold fonts.
We utilize AGDN-HC on the ogbn-proteins dataset and AGDN-HA for all other datasets. The unavailable results are
indicated by –.
Baselines Several representative GNNs and SOTA GNNs are selected as baselines. For semi-supervised node
classification, we utilize GCN [2], GraphSAGE [3], GAT [44], MixHop [36], JKNet [24], DeeperGCN [14], GCNII
[15], DAGNN [12], MAGNA [35], UniMP [45], GAT+BoT [38] and RevGNN [16].
Experimental setup For ogbn-arxiv, we utilize 3 AGDN layers with transition matrix of GAT, hidden dimension
256, 3 attention heads, and residual linear connections. For AGDN with BoT, we utilize pseudo-symmetric normalized
transition matrix of GAT from BoT. For AGDN with BoT and GIANT-XRT embedding, we utilize our proposed
pseudo-symmetric transition matrix of GAT and 2 AGDN layers. For ogbn-proteins, we utilize 6 AGDN layers
with the transition matrix of GAT A, hidden dimension 150, 6 attention heads, and residual linear connections. For
ogbn-products, we utilize 4 AGDN layers with the transition matrix of GAT, hidden dimension 120, 4 attention heads,
and residual linear connections.
10
A PREPRINT - S EPTEMBER 5, 2022
Table 3: The first part of experimental results on the ogbn-arxiv dataset. Except for AGDN, other results are from their
papers or the OGB leaderboard.
Accuracy (%)
Models Params
Test Valid
GCN 71.74±0.29 73.00±0.17 0.11M
GraphSAGE 71.49±0.27 72.77±0.16 0.22M
DeeperGCN 71.92±0.16 72.62±0.14 0.49M
JKNet 72.19±0.21 73.35±0.07 0.09M
DAGNN 72.09±0.25 72.90±0.11 0.04M
GCNII 72.74±0.16 – 2.15M
MAGNA 72.76±0.14 – –
Ours (AGDN) 73.41±0.25 74.23±0.13 1.45M
Table 4: The second part of experimental results on the ogbn-arxiv dataset. Except for AGDN, other results are from
their papers or the OGB leaderboard. ¬=BoT, =self-KD, ®=GIANT-XRT embedding.
Accuracy(%)
Models Params
Test Valid
UniMP 73.11±0.20 74.50±0.15 0.18M
GAT+¬ 73.91±0.12 75.16±0.08 1.44M
RevGAT+¬ 74.02±0.18 75.01±0.10 2.10M
Ours (AGDN+¬) 74.11±0.12 75.25±0.05 1.51M
GAT+¬+ 74.16±0.08 75.14±0.04 1.44M
RevGAT+¬+ 74.26±0.17 74.97±0.08 2.10M
Ours (AGDN+¬+) 74.31±0.12 75.22±0.09 1.51M
RevGAT+¬+® 75.90±0.19 77.01±0.09 1.30M
Ours (AGDN+¬+®) 76.18±0.16 77.24±0.06 1.31M
RevGAT+¬++® 76.15±0.10 77.16±0.09 1.30M
Ours (AGDN+¬++®) 76.37±0.11 77.19±0.08 1.31M
embedding, AGDN and RevGAT are implemented with 2 layers with hidden dimension 256. With similar complexity,
the margin between RevGAT and AGDN becomes larger.
Moreover, we evaluate AGDN on larger ogbn-proteins and ogbn-products datasets with the random graph partition
technique. For ogbn-proteins, we utilize HC instead of HA. AGDN can achieve a new SOTA result of 88.65%, which
even outperforms the much more complex and deeper model RevGNN. We only utilize 6 AGDN layers with hidden
dimension 150 with 8.61M parameters. RevGNN-wide includes 448 layers with hidden dimension 224 with 68.47M
parameters. Furthermore, the inference of AGDN is also conducted on the same GPU card of training with 16Gb
memory. In contrast, the inference of RevGNN is conducted on another GPU card with 48Gb memory. For ogbn-
products, we evaluate AGDN with the random partition. AGDN significantly outperforms other baselines, including
RevGNN. AGDN achieves the SOTA performance among models without using labels as input.
6.1.3 Runtime
We report training and inference runtime on the ogbn-proteins dataset in Table 7 with the runtime of RevGNNs
reported in its paper. This comparison demonstrates that AGDN can outperform RevGNN and cost much less runtime
simultaneously. With extended runtime, RevGNN-Deep and RevGNN-Wide cost 2.86Gb and 7.91Gb for training,
while AGDN costs 13.67Gb. However, the inference of AGDN is conducted on the same GPU, while the inference of
RevGNNs is on another Nvidia RTX A6000 (48Gb) without reporting inference runtime or memory cost.
11
A PREPRINT - S EPTEMBER 5, 2022
Table 5: Experimental results on the ogbn-proteins dataset. DeeperGCN, UniMP, RevGNN, and AGDN are implemented
with random partition. GAT is implemented with neighbor sampling. AGDN+BoT is based on the implementation
of GAT+BoT, however, labels are not used as model input since they empirically bring no improvements. Except for
AGDN, other results are from their papers or the OGB leaderboard.
ROC-AUC(%)
Models Params
Test Valid
GCN 72.51±0.35 79.21±0.18 0.10M
GraphSAGE 77.68±0.20 83.34±0.13 0.19M
DeeperGCN 85.80±0.17 91.06±0.16 2.37M
UniMP 86.42±0.08 91.75±0.06 1.91M
GAT+BoT 87.65±0.08 92.80±0.08 2.48M
RevGNN-deep 87.74±0.13 93.26±0.06 20.03M
RevGNN-wide 88.24±0.15 94.50±0.08 68.47M
Ours (AGDN) 88.65±0.13 94.18±0.05 8.61M
Table 6: Experimental results on the ogbn-products dataset. GAT, DeeperGCN, and AGDN are implemented with
random partition. GraphSAGE and UniMP are implemented with neighbor sampling. Except for AGDN, all results
are from their papers or the OGB leaderboard.
Accuracy(%)
Models Params
Test Valid
GCN 75.64±0.21 92.00±0.03 0.10M
GraphSAGE 78.50±0.14 92.24±0.07 0.21M
GraphSAINT 80.27±0.26 – 0.33M
DeeperGCN 80.98±0.20 92.38±0.09 0.25M
SIGN 80.52±0.16 92.99±0.04 3.48M
UniMP 82.56±0.31 93.08±0.17 1.48M
RevGNN-112 83.07±0.30 92.90±0.07 2.95M
Ours (AGDN) 83.34±0.27 92.29±0.10 1.54M
Table 7: Runtime comparison on the ogbn-proteins dataset with similar Tesla V100 cards.
Based on the naive implementation of the official OGB repository, we compare variants of AGDNs with different base
models (GCN, GraphSAGE, GAT), under different diffusion depths K (K = 1, 2, ..., 8), on the ogbn-arxiv dataset. We
also evaluate MPNN baselines with related equivalent receptive fields. For example, for K = 4 in each subgraph of
Figure 3 and Figure 4, the associated model in MPNN baseline curve has 3 × 4 layers. As shown in Figure 3, three
MPNN baseline curves, especially for GAT, show a distinct over-smoothing problem. AGDN-mean shows quickly
rising but then significantly dropping curves, which is also affected by over-smoothing. AGDN-HA and AGDN-HC
show much more stable curves. AGDN-HA has similar optimal results to AGDN-mean. However, AGDN-HC cannot
even outperform shallow baseline models. Moreover, we repeat these experiments by adding residual linear connections.
As shown in Figure 4, all models, including MPNNs (with few layers) and AGDNs, can be improved with residual
linear connections. MPNNs still have a distinct over-smoothing problem. However, the over-smoothing problem of
AGDN-mean is significantly alleviated. Moreover, AGDN-HC is effectively improved. Three variants of AGDN show
similar performance. However, AGDN-HA outperforms AGDN-mean, especially at low K. This characteristic is vital
in applying complex models on large graphs because we tend to utilize lower K due to the memory limit. This paper
selects low K (2 or 3) in other experiments.
12
A PREPRINT - S EPTEMBER 5, 2022
Test accuracy
Method
0.72 MPNN (Baseline)
AGDN-Mean
0.70 AGDN-HA
AGDN-HC
1 8 1 8 1 8
K K K
Figure 3: Comparison of AGDN-Mean, AGDN-HA, AGDN-HC with different base models and diffusion hops on the
ogbn-arxiv dataset.
Method
0.72 MPNN (Baseline)
AGDN-Mean
0.70 AGDN-HA
AGDN-HC
1 8 1 8 1 8
K K K
Figure 4: Comparison of AGDN-Mean, AGDN-HA, AGDN-HC with residual linear connections, different base models,
and different diffusion hops on the ogbn-arxiv dataset.
Baselines For link prediction tasks, we utilize DeepWalk [46], Matrix Factorization [47], Common Neighbor [48],
Adamic Adar [49], Resource Allocation [50], GCN [2], GraphSAGE [3], SEAL [41] and PLNLP [51] as baselines. Due
to memory limitation, we adapt graph-based sampling techniques including random partition [14, 45] for ogbn-proteins
and ogbn-products, and GraphSAINT [52] for ogbl-citation2. Some baselines are not implemented for some datasets.
Thus we do not report the associated results.
Experimental Setup For ogbl-ppa, we utilize 2 AGDN layers with the transition matrix of GAT, hidden dimension
128, 1 attention head, and residual linear connections. For ogbl-ddi, we utilize 2 AGDN layers with the transition matrix
of GAT, hidden dimension 512, 1 attention head, and residual linear connections. For ogbl-citation2, we utilize 3 AGDN
layers with the transition matrix of GAT, hidden dimension 256, 1 attention head, and residual linear connections. In
official OGB baselines, naive cross-entropy loss is used regarding link prediction as binary classification. PLNLP
proposes to utilize pairwise AUC loss. We adapt AUC loss on the ogbl-ddi dataset. Note that we only utilize this loss on
the ogbl-ddi dataset since it does not improve AGDN on other datasets. We adopt the GraphSAINT technique for AGDN
Table 8: Ablation study on the ogbn-arxiv, ogbn-proteins and ogbn-products datasets. Due to the space limit, we omit
the variances of these scores.
13
A PREPRINT - S EPTEMBER 5, 2022
on the ogbl-citation2 dataset. We utilize learnable node embeddings instead of possible node features on the ogbl-ddi
(dimension 512) and ogbl-ppa (dimension 128) datasets. Note that we only manually tune a few hyperparameters based
on default settings.
Training and evaluation We follow the standard training procedure in official OGB baselines, which use an encoder-
decoder framework. To emphasize the effect of AGDN, we do not introduce other modifications except pairwise AUC
loss. We use the standard data splits and metrics from the official OGB paper for evaluation.
Hits@100(%)
Models Params
Test Valid
DeepWalk 28.88±1.53 - 150.14M
Matrix Factorization 32.29±0.94 32.28±4.28 147.66M
Common Neighbor 27.65±0.00 28.23±0.00 0
Adamic Adar 32.45±0.00 32.68±0.00 0
Resource Allocation 49.33±0.00 47.22±0.00 0
GCN 18.67±1.32 18.45±1.40 0.28M
GraphSAGE 16.55±2.40 17.24±2.64 0.42M
SEAL 48.80±3.16 51.25±2.52 0.71M
PLNLP 32.38±2.58 - –
Ours (AGDN) 41.23±1.59 43.32±0.92 36.90M
Hits@20(%)
Models Params
Test Valid
DeepWalk 22.46±2.90 – 11.54M
Matrix Factorization 13.68±4.75 33.70±2.64 1.22M
Common Neighbor 17.73±0.00 9.47±0.00 0
Adamic Adar 18.61±0.00 9.66±0.00 0
Resource Allocation 6.23±0.00 7.25±0.00 0
GCN 37.07±5.07 55.50±2.08 1.29M
GraphSAGE 53.90±4.74 62.62±0.37 1.42M
SEAL 30.56±3.86 28.49±2.69 0.53M
PLNLP 90.88±3.13 82.42±2.53 3.50M
Ours (AGDN) 95.38±0.94 89.43±2.81 3.51M
MRR(%)
Models Params
Test Valid
Matrix Factorization 51.86±4.43 51.81±4.36 281.11M
Common Neighbor 51.47±0.00 51.19±0.00 0
Adamic Adar 51.89±0.00 51.67±0.00 0
Resource Allocation 51.98±0.00 51.77±0.00 0
GCN 84.74±0.31 84.79±0.23 0.30M
GraphSAGE 82.60±0.36 82.63±0.33 0.46M
SEAL 87.67±0.32 87.57±0.31 0.26M
PLNLP 84.92±0.29 84.90±0.31 146.51M
Ours (AGDN) 85.49±0.29 85.56±0.33 0.31M
14
A PREPRINT - S EPTEMBER 5, 2022
Table 12: Ablation study on the ogbl-ppa, ogbl-ddi, ogbl-citation2 datasets. Due to the space limit, we omit the
variances of these scores.
6.2.1 Results
As shown in Table 9, heuristic methods show significant advantages over GNN methods on the ogbl-ppa dataset. As
a GNN architecture modified for link prediction, based on a complicated subgraph extracting and labeling tricks,
SEAL achieves similar performance to the best heuristic method. AGDN, based on naive official OGB baseline scripts,
outperforms GCN, GraphSAGE, and several heuristic methods. AGDN utilizes learnable node embeddings as model
input, bringing additional parameters.
As shown in Table 10, on the ogbl-ddi dataset, GNN methods act better than heuristic methods. With AUC loss, AGDN
can achieve 95.38% Hits@20, the new SOTA result on the ogbl-ddi leaderboard. This dataset is very dense and will
make structural patterns meaningless. Thus encoder-decoder GNNs with learnable node embeddings can act much
better than SEAL.
As shown in Table 11, on the ogbl-citation2 dataset, GNN methods also act better than heuristic methods. GCN,
GraphSAGE, and PLNLP are full-batch trained. Due to our GPU memory limitation (16Gb), we train AGDN with
a neighbor sampling technique. We can observe significant performance degradation by comparing full-batch GCN
(84.74% test MRR) and GCN using GraphSAINT (79.85% test MRR) in the official OGB repository. However, even
with GraphSAINT, AGDN still achieves top 3 performance on the whole ogbl-citation2 leaderboard and outperforms
full-batch GCN, GraphSAGE, and PLNLP.
On the ogbl-ddi dataset, AGDN outperforms SEAL and other encoder-decoder GNNs with a significant margin. On the
ogbl-ppa and ogbl-citation2 datasets, the margin between AGDN and SEAL is not enormous (< 8%) and smaller than
other encoder-decoder GNNs. We believe that, with more suitable techniques designed for link prediction, AGDN will
contribute more to this task.
6.2.2 Runtime
We compare training and inference runtime of AGDN and SEAL on the ogbl-ppa, ogbl-ddi, and ogbl-citation2 datasets
in Table 13. With similar Tesla V100 GPU cards, AGDN takes significantly less training and inference runtime
than SEAL on the ogbl-ppa and ogbl-citation2 datasets. The model architecture of AGDN is more complicated than
SEAL. However, the simple encoder-decoder framework is less expressive but much more efficient than SEAL’s
time-consuming subgraph sampling and labeling trick. On the small ogbl-ddi dataset, where additional techniques in
SEAL work much more efficient, AGDN takes more training runtime but still much less inference runtime than SEAL.
Table 13: Runtime comparison on the ogbl-ppa, ogbl-ddi, and ogbl-citation2 datasets with similar Tesla V100 cards.
15
A PREPRINT - S EPTEMBER 5, 2022
7 Conclusion
This paper proposes a feasible and effective evolution path for GNNs. Firstly, we refine and propose Graph Diffusion
Networks (GDNs) by replacing the graph convolution operator with an efficient graph diffusion in each GNN layer.
Then, we generalize graph diffusion to propose Adaptive Graph Diffusion Networks (AGDNs). We propose two
adaptive and scalable mechanisms of computing hop weighting coefficients/matrices. In the spectral domain, AGDNs
are more adaptive and flexible than previous graph diffusion-based methods. We evaluate AGDNs and other popular
GNNs on node classification and link prediction tasks. The experimental results show that AGDNs can significantly
outperform many popular GNNs and even SOTA GNNs (RevGNNs and SEAL). At the same time, AGDNs have
considerable overall advantages of complexity and efficiency over SOTA GNNs. Instead of copying huge models from
other domains or using simplified architecture, we enlarge the receptive field with moderate complexity and essential
architecture. It is valuable for limited computation hardware and time-critical tasks.
Limits As a common issue, it is hard to apply the node-wise or layer-wise neighbor sampling techniques to very deep
GNNs, including AGDNs. We must employ additional memory-saving techniques from other models if we want to
train a very deep/wide AGDN model with a considerable diffusion depth. Fortunately, AGDNs are compatible with
most techniques applied to MPNNs. The effect of position embedding in AGDNs has not been precisely studied. We
leave potential memory-saving techniques for AGDNs in future research.
References
[1] Justin Gilmer, Samuel S Schoenholz, Patrick F Riley, Oriol Vinyals, and George E Dahl. Neural message passing
for quantum chemistry. arXiv preprint arXiv:1704.01212, 2017.
[2] Thomas N Kipf and Max Welling. Semi-supervised classification with graph convolutional networks. arXiv
preprint arXiv:1609.02907, 2016.
[3] Will Hamilton, Zhitao Ying, and Jure Leskovec. Inductive representation learning on large graphs. In Advances in
neural information processing systems, pages 1024–1034, 2017.
[4] Thomas N Kipf and Max Welling. Variational graph auto-encoders. arXiv preprint arXiv:1611.07308, 2016.
[5] Jie Chen, Tengfei Ma, and Cao Xiao. Fastgcn: fast learning with graph convolutional networks via importance
sampling. arXiv preprint arXiv:1801.10247, 2018.
[6] Alex Fout, Jonathon Byrd, Basir Shariat, and Asa Ben-Hur. Protein interface prediction using graph convolutional
networks. In Advances in neural information processing systems, pages 6530–6539, 2017.
[7] Le Yu, Bowen Du, Xiao Hu, Leilei Sun, Liangzhe Han, and Weifeng Lv. Deep spatio-temporal graph convolutional
network for traffic accident prediction. Neurocomputing, 423:135–147, 2021.
[8] Wei Li, Xin Wang, Yiwen Zhang, and Qilin Wu. Traffic flow prediction over muti-sensor data correlation with
graph convolution network. Neurocomputing, 427:50–63, 2021.
[9] Xueyan Yin, Genze Wu, Jinze Wei, Yanming Shen, Heng Qi, and Baocai Yin. Multi-stage attention spatial-
temporal graph networks for traffic prediction. Neurocomputing, 428:42–53, 2021.
[10] Qimai Li, Zhichao Han, and Xiao-Ming Wu. Deeper insights into graph convolutional networks for semi-
supervised learning. arXiv preprint arXiv:1801.07606, 2018.
[11] Guangtao Wang, Rex Ying, Jing Huang, and Jure Leskovec. Improving graph attention networks with large
margin-based constraints. arXiv preprint arXiv:1910.11945, 2019.
[12] Meng Liu, Hongyang Gao, and Shuiwang Ji. Towards deeper graph neural networks. In Proceedings of the 26th
ACM SIGKDD International Conference on Knowledge Discovery & Data Mining, pages 338–348, 2020.
[13] Guohao Li, Matthias Muller, Ali Thabet, and Bernard Ghanem. Deepgcns: Can gcns go as deep as cnns? In
Proceedings of the IEEE International Conference on Computer Vision, pages 9267–9276, 2019.
[14] Guohao Li, Chenxin Xiong, Ali Thabet, and Bernard Ghanem. Deepergcn: All you need to train deeper gcns.
arXiv preprint arXiv:2006.07739, 2020.
[15] Ming Chen, Zhewei Wei, Zengfeng Huang, Bolin Ding, and Yaliang Li. Simple and deep graph convolutional
networks. In International Conference on Machine Learning, pages 1725–1735. PMLR, 2020.
[16] Guohao Li, Matthias Müller, Bernard Ghanem, and Vladlen Koltun. Training graph neural networks with 1000
layers. In International conference on machine learning, pages 6437–6449. PMLR, 2021.
16
A PREPRINT - S EPTEMBER 5, 2022
[17] Johannes Klicpera, Aleksandar Bojchevski, and Stephan Günnemann. Predict then propagate: Graph neural
networks meet personalized pagerank. arXiv preprint arXiv:1810.05997, 2018.
[18] Emanuele Rossi, Fabrizio Frasca, Ben Chamberlain, Davide Eynard, Michael Bronstein, and Federico Monti.
Sign: Scalable inception graph neural networks. arXiv preprint arXiv:2004.11198, 2020.
[19] Johannes Klicpera, Stefan Weißenberger, and Stephan Günnemann. Diffusion improves graph learning. In
Advances in Neural Information Processing Systems, pages 13354–13366, 2019.
[20] Lingxiao Zhao and Leman Akoglu. Pairnorm: Tackling oversmoothing in gnns. arXiv preprint arXiv:1909.12223,
2019.
[21] Yu Rong, Wenbing Huang, Tingyang Xu, and Junzhou Huang. Dropedge: Towards deep graph convolutional
networks on node classification. arXiv preprint arXiv:1907.10903, 2019.
[22] Wenzheng Feng, Jie Zhang, Yuxiao Dong, Yu Han, Huanbo Luan, Qian Xu, Qiang Yang, Evgeny Kharlamov, and
Jie Tang. Graph random neural networks for semi-supervised learning on graphs. Advances in neural information
processing systems, 33:22092–22103, 2020.
[23] Weihua Hu, Matthias Fey, Marinka Zitnik, Yuxiao Dong, Hongyu Ren, Bowen Liu, Michele Catasta, and Jure
Leskovec. Open graph benchmark: Datasets for machine learning on graphs. arXiv preprint arXiv:2005.00687,
2020.
[24] Keyulu Xu, Chengtao Li, Yonglong Tian, Tomohiro Sonobe, Ken-ichi Kawarabayashi, and Stefanie Jegelka.
Representation learning on graphs with jumping knowledge networks. arXiv preprint arXiv:1806.03536, 2018.
[25] Lawrence Page, Sergey Brin, Rajeev Motwani, and Terry Winograd. The pagerank citation ranking: Bringing
order to the web. Technical report, Stanford InfoLab, 1999.
[26] Risi Imre Kondor and John Lafferty. Diffusion kernels on graphs and other discrete structures. In Proceedings of
the 19th international conference on machine learning, volume 2002, pages 315–22, 2002.
[27] Bingbing Xu, Huawei Shen, Qi Cao, Keting Cen, and Xueqi Cheng. Graph convolutional networks using heat
kernel for semi-supervised learning. arXiv preprint arXiv:2007.16002, 2020.
[28] Dimitris Berberidis, Athanasios N Nikolakopoulos, and Georgios B Giannakis. Adaptive diffusions for scalable
learning over graphs. IEEE Transactions on Signal Processing, 67(5):1307–1321, 2018.
[29] Siheng Chen, Aliaksei Sandryhaila, José MF Moura, and Jelena Kovačević. Adaptive graph filtering: Multireso-
lution classification on graphs. In 2013 IEEE Global Conference on Signal and Information Processing, pages
427–430. IEEE, 2013.
[30] Sami Abu-El-Haija, Bryan Perozzi, Rami Al-Rfou, and Alexander A Alemi. Watch your step: Learning node
embeddings via graph attention. In Advances in Neural Information Processing Systems, pages 9180–9190, 2018.
[31] Felix Wu, Tianyi Zhang, Amauri Holanda de Souza Jr, Christopher Fifty, Tao Yu, and Kilian Q Weinberger.
Simplifying graph convolutional networks. arXiv preprint arXiv:1902.07153, 2019.
[32] Hao Zhu and Piotr Koniusz. Simple spectral graph convolution. In International Conference on Learning
Representations, 2021.
[33] James Atwood and Don Towsley. Diffusion-convolutional neural networks. In Advances in neural information
processing systems, pages 1993–2001, 2016.
[34] Jian Du, Shanghang Zhang, Guanhang Wu, José MF Moura, and Soummya Kar. Topology adaptive graph
convolutional networks. arXiv preprint arXiv:1710.10370, 2017.
[35] Guangtao Wang, Rex Ying, Jing Huang, and Jure Leskovec. Direct multi-hop attention based graph neural network.
arXiv preprint arXiv:2009.14332, 2020.
[36] Sami Abu-El-Haija, Bryan Perozzi, Amol Kapoor, Nazanin Alipourfard, Kristina Lerman, Hrayr Harutyunyan,
Greg Ver Steeg, and Aram Galstyan. Mixhop: Higher-order graph convolutional architectures via sparsified
neighborhood mixing. arXiv preprint arXiv:1905.00067, 2019.
[37] Sami Abu-El-Haija, Amol Kapoor, Bryan Perozzi, and Joonseok Lee. N-gcn: Multi-scale graph convolution for
semi-supervised node classification. In uncertainty in artificial intelligence, pages 841–851. PMLR, 2020.
[38] Yangkun Wang. Bag of tricks of semi-supervised classification with graph neural networks. arXiv preprint
arXiv:2103.13355, 2021.
[39] Linfeng Zhang, Jiebo Song, Anni Gao, Jingwei Chen, Chenglong Bao, and Kaisheng Ma. Be your own teacher:
Improve the performance of convolutional neural networks via self distillation. In Proceedings of the IEEE/CVF
International Conference on Computer Vision, pages 3713–3722, 2019.
17
A PREPRINT - S EPTEMBER 5, 2022
[40] Eli Chien, Wei-Cheng Chang, Cho-Jui Hsieh, Hsiang-Fu Yu, Jiong Zhang, Olgica Milenkovic, and Inder-
jit S Dhillon. Node feature extraction by self-supervised multi-scale neighborhood prediction. arXiv preprint
arXiv:2111.00064, 2021.
[41] Muhan Zhang, Pan Li, Yinglong Xia, Kai Wang, and Long Jin. Labeling trick: A theory of using graph
neural networks for multi-node representation learning. Advances in Neural Information Processing Systems,
34:9061–9073, 2021.
[42] Muhan Zhang and Yixin Chen. Link prediction based on graph neural networks. Advances in neural information
processing systems, 31, 2018.
[43] Andrew Ng, Michael Jordan, and Yair Weiss. On spectral clustering: Analysis and an algorithm. Advances in
neural information processing systems, 14, 2001.
[44] Petar Veličković, Guillem Cucurull, Arantxa Casanova, Adriana Romero, Pietro Lio, and Yoshua Bengio. Graph
attention networks. arXiv preprint arXiv:1710.10903, 2017.
[45] Yunsheng Shi, Zhengjie Huang, Shikun Feng, and Yu Sun. Masked label prediction: Unified massage passing
model for semi-supervised classification. arXiv preprint arXiv:2009.03509, 2020.
[46] Bryan Perozzi, Rami Al-Rfou, and Steven Skiena. Deepwalk: Online learning of social representations. In
Proceedings of the 20th ACM SIGKDD international conference on Knowledge discovery and data mining, pages
701–710, 2014.
[47] Aditya Krishna Menon and Charles Elkan. Link prediction via matrix factorization. In Joint european conference
on machine learning and knowledge discovery in databases, pages 437–452. Springer, 2011.
[48] David Liben-Nowell and Jon Kleinberg. The link-prediction problem for social networks. Journal of the American
society for information science and technology, 58(7):1019–1031, 2007.
[49] Lada A Adamic and Eytan Adar. Friends and neighbors on the web. Social networks, 25(3):211–230, 2003.
[50] Tao Zhou, Linyuan Lü, and Yi-Cheng Zhang. Predicting missing links via local information. The European
Physical Journal B, 71(4):623–630, 2009.
[51] Zhitao Wang, Yong Zhou, Litao Hong, Yuanhang Zou, and Hanjing Su. Pairwise learning for neural link prediction.
arXiv preprint arXiv:2112.02936, 2021.
[52] Hanqing Zeng, Hongkuan Zhou, Ajitesh Srivastava, Rajgopal Kannan, and Viktor Prasanna. Graphsaint: Graph
sampling based inductive learning method. arXiv preprint arXiv:1907.04931, 2019.
18