0% found this document useful (0 votes)

15 views18 pages

Adaptive Graph Diffusion

Uploaded by

ljn98425

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

15 views18 pages

Adaptive Graph Diffusion

Uploaded by

ljn98425

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 18

A DAPTIVE G RAPH D IFFUSION N ETWORKS

Chuxiong Sun Jie Hu Hongming Gu

Big Data and AI Division Big Data and AI Division Big Data and AI Division
China Telecom Research Institute China Telecom Research Institute China Telecom Research Institute
Beijing, China Beijing, China Beijing, China
[email protected] [email protected] [email protected]
arXiv:2012.15024v2 [cs.LG] 2 Sep 2022

Jinpeng Chen Mingchuan Yang

School of Software Engineering Big Data and AI Division
Beijing University of Posts and Telecommunications China Telecom Research Institute
Beijing, China Beijing, China
[email protected] [email protected]

September 5, 2022

A BSTRACT

Graph Neural Networks (GNNs) have received much attention in the graph deep learning domain.
However, recent research empirically and theoretically shows that deep GNNs suffer from over-
fitting and over-smoothing problems. The usual solutions either cannot solve extensive runtime
of deep GNNs or restrict graph convolution in the same feature space. We propose the Adaptive
Graph Diffusion Networks (AGDNs) which perform multi-layer generalized graph diffusion in
different feature spaces with moderate complexity and runtime. Standard graph diffusion methods
combine large and dense powers of the transition matrix with predefined weighting coefficients.
Instead, AGDNs combine smaller multi-hop node representations with learnable and generalized
weighting coefficients. We propose two scalable mechanisms of weighting coefficients to capture
multi-hop information: Hop-wise Attention (HA) and Hop-wise Convolution (HC). We evaluate
AGDNs on diverse, challenging Open Graph Benchmark (OGB) datasets with semi-supervised node
classification and link prediction tasks. Until the date of submission (Aug 26, 2022), AGDNs achieve
top-1 performance on the ogbn-arxiv, ogbn-proteins and ogbl-ddi datasets and top-3 performance on
the ogbl-citation2 dataset. On the similar Tesla V100 GPU cards, AGDNs outperform Reversible
GNNs (RevGNNs) with 13% complexity and 1% training runtime of RevGNNs on the ogbn-proteins
dataset. AGDNs also achieve comparable performance to SEAL with 36% training and 0.2% inference
runtime of SEAL on the ogbl-citation2 dataset.

1 Introduction

The Graph Neural Networks (GNNs), or Message Passing Neural Networks (MPNNs) [1], recently proved effective and
became mainstream of graph deep learning in many domains, such as citation networks [2, 3, 4], social networks [3, 5],
biological graphs [6], and traffic networks [7, 8, 9]. Based on simple feed-forward networks, they perform message
passing in each layer to capture neighborhood information. However, unlike the deep models in the Computer Vision
(CV) domain, the deep GNNs [10, 11] will encounter the over-smoothing problem, resulting in significant performance
degradation. The model cannot distinguish the node representations, which become nearly identical after long-range
message passing. Deep GNNs may contain many feature transformations since a GNN layer usually couples a graph
convolution operator and a feature transformation. Recent research [12] reveals that the well-known overfitting problem
caused by redundant transformations can contribute significantly to the performance degradation of deep GNNs. In
addition, many transformations also bring high memory footprints and extended runtime.
A PREPRINT - S EPTEMBER 5, 2022

By emphasizing the shallow information during long-range message passing, several types of residual connections
[10, 13, 14, 15, 16] can tackle the over-smoothing and over-fitting problems. However, they do not improve extensive
high memory footprints or runtime from deep GNNs. Some techniques [16] from the CV domain can effectively reduce
memory footprints of deep GNNs, but with a longer runtime.
Residual connections work among pairs of graph convolution and feature transformation. In contrast, the graph
diffusion-based methods directly combine sequential graph convolution operators. The shallow information can be
preserved with suitable weighting coefficients to alleviate the oversmoothing problem. A graph convolution operator
can be described as a matrix multiplication between the weighted adjacency (transition matrix) and node feature matrix.
Then a graph diffusion operator replaces the transition matrix with a linear combination of its different powers. On the
other hand, since many feature transformations bring both efficiency and performance problems, limiting the number of
transformations is also helpful. Thus, employing the graph diffusion with limited feature transformations is reasonable.
This strategy exists in some decoupled methods [12, 17, 18] that decouple graph convolution operators and feature
transformations. In detail, all graph convolution operators are placed before or after all feature transformations. Thus,
the decoupled GNNs can enlarge the receptive field with shallow feature transformations. The graph diffusion or
its variants are incorporated to leverage the multi-hop information. However, there are no feature transformations
(non-linear hierarchies) among all graph convolution operators. Graph convolution is restricted in the same feature
space. This characteristic may limit their model capacity. For example, decoupled GNNs can achieve considerable
performance on small citation datasets but residual GNNs outperform them on the larger ogbn-arxiv dataset.
Directly replacing the graph convolution operator in each GNN layer with the graph diffusion operator can also enlarge
the receptive field with shallow feature transformations. Graph Diffusion Convolution (GDC) directly replaces the
transition matrix with an explicit diffusion matrix. Efficient approximation and sparsification techniques can reduce the
memory cost of the dense and large diffusion matrix. However, GDC lacks flexibility since its weighting coefficients
are predefined and fixed for different nodes and feature channels. Moreover, GDC cannot improve link prediction
performance. A more efficient but equivalent way of calculating the graph diffusion is to calculate multi-hop node
representations iteratively and combine them. Some methods perform this memory-efficient graph diffusion in each
layer without calculating the explicit diffusion matrix. In other words, instead of high-power transition matrices,
multi-hop representation matrices are calculated and stored. The unified framework of these methods has not been well
studied and separated from other graph diffusion-based methods [17, 12, 19]. We summarize them as Graph Diffusion
Networks (GDNs). However, many existing GDNs still utilize the fixed predefined hop weighting coefficients, which
are identical across nodes, channels, and layers.
Other techniques [20, 21, 22] that do not modify model architecture can also tackle the over-smoothing problem. Many
GNNs, including our proposed models, are compatible with them. We will not precisely introduce or compare them.
In this paper, we refine and propose Graph Diffusion Networks (GDNs). Then we propose Adaptive Graph Diffusion
Neural Networks (AGDNs) with generalized graph diffusion associated with two learnable and scalable mechanisms of
weighting coefficients: Hop-wise Attention (HA) and Hop-wise Convolution (HC). We show a natural evolution path
of GNNs. From MPNNs to GDNs, the receptive field is enlarged without adding extra transformations or decoupling
the model architecture. The multi-layer graph diffusion in different feature spaces can contribute to model capacity.
From GDNs to AGDNs, a more generalized and flexible paradigm of graph diffusion can bring better performance. HA
induces hop-wise and node-wise weighting coefficients. HC can directly learn the hop-wise and channel-wise weighting
coefficients. Following the historical development of GNNs, from elegant but inefficient spectral methods to intuitive
and effective spatial methods, we generalize the graph diffusion to be more spatial with a loss of spectral analyzability.
We conduct experiments on diverse Open Graph Benchmark (OGB) [23] datasets with semi-supervised node classifi-
cation and link prediction tasks [23]. The results show that our proposed methods significantly outperform popular
GNNs on both tasks. AGDNs achieve top-1 performance on the ogbn-arxiv (Accuracy of 76.37±0.11%), ogbn-proteins
(ROC-AUC of 88.65±0.13%) and ogbl-ddi (Hits@20 of 95.38±0.94%) datasets and top-3 performance on the ogbl-
citation2 (MRR of 85.49±0.29%) dataset. AGDNs also achieve the SOTA performance among models without using
labels as input on the ogbn-products dataset. AGDNs outperform the state-of-the-art (SOTA) RevGNNs with much
less complexity and runtime and achieve comparable results to SOTA SEAL with much less runtime on large graphs.
In our ablation study on all datasets, AGDNs can significantly outperform the associated GATs. Furthermore, the
experiments of MPNNs and AGDNs with different model depths demonstrate that AGDNs can effectively mitigate the
over-smoothing effect.
Our main contributions are 1). We incorporate multi-layer generalized graph diffusion into GNNs with moderate
complexity and runtime; 2). We propose two learnable and scalable mechanisms for adaptively capturing multi-hop
information. 3). We achieve new SOTA performance on diverse, challenging OGB datasets. We outperform complicated
RevGNNs with much less complexity and runtime on large datasets.

2
A PREPRINT - S EPTEMBER 5, 2022

2 Related works

2.1 Residual GNNs

By employing residual connections, residual GNNs simultaneously alleviate over-fitting and over-smoothing problems.
The main concerns of these methods are connection design and memory optimizations. Jumping-Knowledge Network
(JKNet) [24] introduces jumping knowledge connection and adaptively sums intermediate representations. GCN with
initial residual and identity mapping (GCNII) [15] combines two classes of residual connections. DeepGCN [13]
introduces dense and dilated connections from CNNs. DeeperGCN [14] unifies message aggregation operations with
differentiable generalized aggregation functions. It further proposes a novel normalization layer and pre-activation
residual GNNs. Reversible GNNs (RevGNNs) [16] reduce memory footprints by incorporating reversible connections
and grouped convolutions. With deep and over-parameterized GNN architecture, RevGNNs can achieve SOTA results
on several datasets. However, as a cost of reducing memory footprints, the runtime of deep RevGNNs layers is even
longer. This paper will present deep GNNs with shallow feature transformations that outperform SOTA RevGNNs with
much less complexity and runtime.

2.2 Graph diffusion

We describe the graph convolution as a matrix multiplication between the transition matrix and node fea-
ture/representation matrix. Then the graph diffusion [17, 19, 25, 26] replaces the transition matrix with a diffusion
matrix, which is a linear combination of powers of the transition matrix with weighting coefficients normalized along
hops. The weighting coefficients are essential to balance the importance of shallow and deep information. The Personal-
ized PageRank (PPR) [25, 17] and the Heat Kernel (HK) [26, 27] are two popular predefined weighting coefficients.
They both follow the prior that more distant neighboring nodes have fewer influences than near neighboring ones. The
weighting coefficients can also be analogously trainable parameters using label propagation [28, 29]. Attention walk
[30] jointly optimizes the node embeddings and weighting coefficients.

2.2.1 Decoupled GNNs

To perform deep graph convolutions with shallow feature transformations, decoupled GNNs decouple graph convolution
operators and feature transformations. The integration of multi-hop information in several decoupled GNNs can be
considered the special cases [31, 17, 32] or generalized cases [12] of the graph diffusion. The Approximated Personalized
Propagation of Neural Predictions (APPNP) [17] utilizes the predefined weights exponentially decaying with hops to
integrate multi-hop information. The Deep and Adaptive Graph Neural Network (DAGNN) [12] incorporates learnable
weights. They both perform graph convolutions after feature transformations. On the contrary, other decoupled GNNs
[31, 32, 18] perform graph convolutions before feature transformations. The Simplified Graph Convolution network
(SGC) [31] utilizes the last representation and then classifies with a linear layer. SGC is significantly affected by the
over-smoothing problem. As an early GNN model, Diffusion Convolution Neural Network (DCNN) [33] can also be
considered a decoupled GNN using the graph diffusion operator with a learnable convolution kernel. DCNN directly
performs graph diffusion with learnable hop-wise and channel-wise weighting coefficients (convolution kernel) based
on the input node feature matrix and applies a linear layer. Although DCNN finalizes the graph diffusion during model
training, it calculates and stores high powers of the transition matrix, which limits its scalability. Based on SGC, the
Simple Spectral Graph Convolution network (S2 GC) sums multi-hop features. The Scalable Inception Graph Neural
Network (SIGN) [18] encodes and concatenates multi-hop representations. However, as a trade-off between efficiency
and accuracy, the simple decoupled architecture limits model capacity because all graph convolution operators are
restricted in the same feature space. There are no intermediate transformations between graph convolution operators.
In this paper, we tend to directly replace the graph convolution operator with a graph diffusion operator in each MPNN
layer without decoupling the model architecture. We must consider whether to calculate an explicit diffusion matrix
during actual implementation.

2.2.2 Explicit diffusion matrix: Graph Diffusion Convolution

As a preprocessing method to augment input data, the Graph Diffusion Convolution (GDC) [19] calculates an explicit
diffusion matrix. Although GDC controls the sparsity of the final diffusion matrix, the intermediate explicit high-power
transition matrices are still maintained, which limits its scalability. There are approximation methods for PPR and HK
to alleviate this problem with a loss of information. Moreover, the weighting coefficients are predefined and identical
for all nodes, channels, and layers, which may limit the model performance. GDC cannot be calculated in preprocessing
for a learnable transition matrix. In addition, GDC cannot improve link prediction performance.

3
A PREPRINT - S EPTEMBER 5, 2022

2.2.3 Implicit diffusion matrix: Graph Diffusion Networks

The Graph Diffusion Networks (GDNs) perform implicit graph diffusion with right-to-left matrix multiplication
starting from the node feature or representation matrix in each layer. First, GDNs sequentially calculate the multi-hop
representation matrices. Then, they integrate these matrices without maintaining a high-dimensional diffusion matrix.
Such a paradigm is more memory-efficient because high-order aggregated feature matrices are much smaller than
high-power transition matrices. Another critical advantage over GDC is that the weighting coefficients in GDNs can
be layer-wise. GDNs multiply the receptive field of base MPNNs with the diffusion depth without increasing feature
transformations or changing other central model architecture. Thus, GDNs can inherit many essential characteristics
from MPNNs, including non-linear hierarchies, attention mechanisms, and residual connections. In contrast, decoupled
GNNs lose most of them. Based on an MPNN layer, we can generate its associated GDN layer. Topology Adaptive
Graph Convolutional Networks (TAGCNs) [34] are generated based on GCN and sum multi-hop representations after
transformations using uniform weighting coefficients. TAGCNs utilize a generalized version of the graph diffusion since
multiple linear transformations are applied to multi-hop features. These additional transformations bring redundant
complexity. MAGNA [35] is generated based on GAT using PPR weighting coefficients. MAGNA further includes
residual connections and feed-forward networks for each layer. In addition, GDNs using PPR or HK coefficients
with a predefined transition matrix are equivalent to GDC. However, previous works have not precisely studied the
unified framework of GDNs. In this paper, we refine the unified framework of GDNs and extend this framework with
generalized graph diffusion.

2.2.4 Diffusion-like GNNs

MixHop [36] and N-GCN [37] perform diffusion-like operation in each layer, however, they transform node repre-
sentations with non-linear activations for each hop. Such methods more likely stack MPNN layers into groups by
concatenating their outputs without effectively controlling model complexity. They are hard to optimize and have
redundant hyperparameters with a complicated tuning strategy, including selecting which hops to use in each layer.
Our proposed GDNs and AGDNs follow a more efficient paradigm of graph diffusion without redundant feature
transformations or non-linear activations, and AGDNs can adaptively learn the importance of each hop instead of
manual selection.

2.3 Tricks on GNNs

The Bag of Tricks on GNNs (BoT)[38] includes some critical tricks of GNNs applied in many SOTA models on the
OGB leaderboard. Firstly, the masked node labels, which are zeros other than sampled training nodes, are used as
model input ("label as input"). Secondly, BoT conducts the additional iterative feed-forward passes in each epoch
with the predicted soft labels filling up masked zero labels ("label reuse"). Thirdly, BoT proposes a more robust loss
function ("Loge loss"). Finally, BoT proposes to adjust the GAT adjacency closer to the GCN adjacency ("norm.adj.").
Self-Knowledge Distillation (self-KD) [39] is another common trick on the OGB leaderboard. Graph Information
Aided Node feature exTraction with XR-Transformers (GIANT-XRT) [40] trains more informative node embedding
using raw text data.

2.4 GNNs for link prediction

There exist two main GNN methods applied to link prediction. The first methods follow an encoder-decoder framework
[4, 23], in which GNNs act as the encoder to output node representations. A simple predictor receiving pairs of node
representations makes final predictions. Pairwise Learning on Neural Link Prediction (PLNLP) proposes to utilize
pairwise AUC loss to improve the quality of ranking metrics. The second method, SEAL [41, 42] applies a GNN
on an enclosing subgraph sampled from each pair of nodes and directly output predictions. SEAL also includes an
additional labeling trick to enhance structural information. SEAL achieves the SOTA performance of GNNs on many
link prediction datasets. However, SEAL requires extensive runtime, which limits its application, especially on large
graphs. This paper will show that with the encoder-decoder framework, AGDNs can outperform other encoder-decoder
GNNs and even approach or outperform SEAL.

3 Preliminaries
3.1 Notations

Suppose G = (V, E) be a given graph with node set V and edge set E. We denote the number of nodes with N = |V|,
the number of edges with E = |E|, and the adjacency matrix with A ∈ RN ×N . The normalized adjacency matrix

4
A PREPRINT - S EPTEMBER 5, 2022

is denoted with A ∈ RN ×N . Considering the common stacking architecture of GNNs, we denote the initial node
(0)
feature matrix with X ∈ RN ×d and the initial edge feature matrix with X E . For the l-th GNN layer, we denote
(l−1) (l)
its input node representation with H (l−1) ∈ RN ×d and its output node representation with H (l) ∈ RN ×d . We
(l−1) (l)
also denote node i’s input representation vector with hi and its output representation vector with hi . We denote
(l)
the attention query vector of GAT with a(l) ∈ R2×d and the hop-wise attention query vector of AGDN-HA with
(l) (l)
ahw ∈ R2×d .

Table 1: Symbol definitions.

Symbols Definitions Symbols Definitions

G Graph K Diffusion depth
V Node set N Number of nodes
E Edge set E Number of edges
A Adjacency matrix A transition matrix
X Node feature matrix xi Node feature vector
XE Edge feature matrix xE(i,j) Edge feature vector
(l)
H (l) Node representation matrix hi Node representation vector
d(0) Node feature dimension d (l)
l-th layer’s output dimension
dE Edge feature dimension Ni Neighborhood set of node i

3.2 Tasks

3.2.1 Semi-supervised node classification

Given a graph G, the node set can be further split into a labeled node set L and an unlabeled node set U. The semi-
supervised node classification task is to predict unlabeled nodes’ labels. The unlabeled nodes’ feature and related edges
are not accessible at inference for inductive tasks. Nevertheless, they are accessible for transductive tasks. For inductive
tasks, unlabeled nodes’ all information cannot be accessed during training. Whether GDNs or AGDNs are applicable
on inductive datasets ultimately depends on its MPNN base model.

3.2.2 Link prediction

Given a graph G and a pair of nodes, the link prediction task is to predict whether an edge exists between two nodes.
The edges in a dataset are split into training, validation, and test edges. OGB rules make the training edges accessible
during model training and inference. The other edges can only be accessed to compute evaluation metrics.

3.3 Message Passing Neural Networks

Let us review the main architecture of MPNNs. This paper focuses on the essential step in MPNNs, neighbor aggregation
(message passing). We also consider the following combination with node self feature matrix (usually in the form of
residual linear connection). We omit the optional readout operation (only necessary for the graph prediction). The
neighbor aggregation with a residual linear connection can be described as a matrix multiplication between weighted
adjacency A (transition matrix) and node feature/representation matrix H:
(l)
H (l) = A H (l−1) W (l) + H (l−1) W (l),r , (1)
(l−1) (l) (l−1) (l)
where W ∈ Rd ×d
refers to the linear transformation in the l-th layer, and W (l),r ∈ Rd ×d
refers to its
residual linear transformation. A complete MPNN model stacks several MPNN layers with intermediate activations.

3.3.1 Transition matrix

The weighted adjacency or transition matrix is used in graph convolution (1-power) and graph diffusion (multiple
powers). There are several common transition matrices: row-stochastic adjacency Arow = D −1 A, column-stochastic
adjacency Acol = AD −1 and the symmetric normalized adjacency Asym = D −1/2 AD −1/2 , where D is the degree
matrix (Dii = ΣN j=1 Aij ). Different transition matrices are associated with different base MPNN models. GraphSAGE
utilizes Arow as its weighted adjacency. GCN utilizes Asym modified with adding self-loops. GAT utilizes learnable

5
A PREPRINT - S EPTEMBER 5, 2022

row-stochastic adjacency, derived from the learnable attention mechanism, which depends on the attributes of source and
destination nodes. We can view this adjacency as a learnable row-stochastic adjacency by reviewing the computation of
GAT adjacency:
eij = LeakyReLU([hi W ||hj W ] · a)
(2)
= LeakyReLU([hi W ] · adst + [hj W ] · asrc ),
exp(eij )
Aij = P , (3)
k∈Ni exp(eik )
where hi is the input representation vector of node i, W is the transformation matrix. The query vector a can be split
into source query vector asrc and destination query vector adst .
For graphs with edge attributes xE
ij , BoT [38] proposes to incorporate edge attributes in attention coefficients:
E
eij = LeakyReLU([hi W ||hj W ||xE
ij W ] · a), (4)
where W E is edge feature transformation matrix.
Then, we can define the unnormalized GAT adjacency A : Aij = exp(eij ) and its diagonal in-degree and out-degree
matrices are defined as D row and D col , with respectively row summation and column summation as diagonal entries.
The GAT adjacency may be expressed as A = D −1 row A.

4 Proposed methods
In this section, we firstly introduce the frameworks of GDNs and AGDNs. Then, we propose a pseudo-symmetric
variant of the GAT transition matrix. Thirdly, we propose two adaptive mechanisms for calculating hop weighting
matrices.

4.1 Graph Diffusion Networks

Several existing GNNs using multi-hop information in each layer can be considered GDNs. However, there is a lack of
a unified framework for GDNs. For a GDN model, we also stack multiple GDN layers to perform multi-layer graph
diffusion. We formulate a GDN layer with the diffusion depth K as below:
(l,0)
H̃ = H (l−1) W (l) , (5)

(l,k) (l,k−1)
H̃ = AH̃ , (6)

K
X (l,k)
H (l) = θ(l,k) H̃ + H (l−1) W (l),r , (7)
k=0
where {θ(k) }k∈{0,1,...,K} is the set of normalized weighting coefficients and K is the diffusion depth. Note that we
iteratively calculate the multi-hop representations in a right-to-left matrix multiplication, instead of the explicit multiple
(l) (l) (l)
powers of the transition matrix (A , (A )2 , .., (A )K ).

4.2 Adaptive Graph Diffusion Networks

The weighting coefficients in GDC and GDNs are hop-wise. In this paper, we generalize the graph diffusion and make
weighting coefficients further node-wise or channel-wise. We suppose that different nodes or channels may require
different hop weighting coefficients. We define an AGDN layer as below:
(l,0)
H̃ = H (l−1) W (l) , (8)

(l,k) (l,k−1)
H̃ = AH̃ , (9)

K
(l)
X (l,k)
H = Θ(k) ⊗ H̃ + H (l−1) W (l),r , (10)
k=0

6
A PREPRINT - S EPTEMBER 5, 2022

Figure 1: AGDN layer Architecture: The operator ⊗ represents matrix multiplication, the bold operator ⊗ represents
element-wise multiplication, and the operator ⊕ represents summation. The left and right multiplication correspond to
the relative position of the multiplicand to the operator ⊗.

(l)
where ⊗ refers to the element-wise multiplication and Θ(k) in RN ×d is a generalized weighting matrix. GDNs are
special cases of AGDNs when all elements of Θ(k) are the same. This description in the form of matrices is for global
comparison with MPNNs and GDNs.
In detail, we also describe the AGDN layer from a node viewpoint, which matches the actual implementation. We
perform sequential graph convolution and sum multi-hop representations with hop-wise (node-wise or channel-wise)
weighting coefficients. The generalized graph diffusion at l-th layer for node i is described as below:
(l,0) (l−1)
h̃i = hi W (l) , (11)

(l,k) X (l,k−1)
h̃i = Aij h̃j , (12)
j∈Ni
(l−1)
K dX
(l) (l,k) (l−1) (l),r
X
hic = θikc h̃ic + hic0 Wc0 c , (13)
k=0 c0 =1
(l,k)
where h̃i is the k-hop intermediate representation vector of node i and θikc can be viewed extracted from a
3-dimensional (node-wise, hop-wise and channel-wise) tensor Θ. The previous weighting matrix Θ(k) can be
extracted from this tensor by selecting the k-hop. In some cases, to explicitly enhance the position (hop) information,
we can add intermediate multi-hop representation vectors with learnable Positional Embedding (PE) row vectors
(l)
{p(0) , p(1) , ..., p(K) } in Rd ×1 . We omit this trick since PE marginally improves model performance on certain
datasets empirically.
There exists a trade-off between generalization and spectral analyzability. The eigenvectors of the generalized graph
diffusion matrix are generally different from the original transition matrix. From another perspective, AGDNs do
not follow the characteristics or limitations of previous diffusion-based methods. Instead, our generalized weighting
coefficients can be adaptive across hops and nodes or channels, reflected in the following proposed Hop-wise Attention
and Hop-wise Convolution. In addition, we can naturally assign layer-wise weighting coefficients.

4.3 Transition matrix

We can find that GAT is more concerned about destination nodes’ in-degrees. The symmetric normalized adjacency
has proven more effective on specific datasets [2]. It is reasonable to give a variant of GAT adjacency to leverage
both source nodes’ out-degrees and destination nodes’ in-degrees. Thus, motivated by the form of popular symmetric
normalized adjacency, we propose a pseudo-symmetric normalized GAT adjacency:
−1 −1
Asym = D row
2
AD col2 . (14)

7
A PREPRINT - S EPTEMBER 5, 2022

Figure 2: Left: The hop-wise attention is parameterized by ahw , with a LeakyReLU activation function σ and
normalized along hops with softmax function; The associated weighting tensor Θ can be derived from the N × (K + 1)
matrix ΘHA , by repeating d times along the third dimension. Right: The weighting tensor of hop-wise convolution Θ
is directly derived from (K + 1) × d kernel matrix ΘHC by repeating N times along the first dimension.

In BoT, another version of pseudo-symmetric normalized GAT adjacency is proposed (denoted with "norm.adj.") [38]:
1 1
Aadj = D 2 D −1
row AD
−2
, (15)
where adj refers to "adjustment" since we can view this adjacency as the GAT adjacency adjusted to GCN adjacency.
Note that A, Asym and Aadj are pseudo-symmetric since they are guaranteed to be symmetric if and only if eij =
eji , ∀i, j ∈ N , when asrc = adst . As a special case when we set query vectors to zeros, then both Asym and Aadj
become the standard symmetric normalized adjacency. This characteristic connects GAT and GCN.

4.4 Weighting ceofficients

In this subsection, to simplify the discussion, we omit the subscript (l) and collect all weighting coefficients {θikc }
into a unified weighting tensor with Θ ∈ RN ×(K+1)×d . We can extract Θ(k) = Θ:k: , where : in the subscript
refers to extracting all channels in this dimension. We denote the subscripts for nodes, hops, and channels with i,
k, and c. We aim to design ’adaptive’ and efficient ways of calculating weighting tensor, which should be variable
for nodes or feature channels. It is hard to directly define a unified weighting tensor adaptive for both nodes and
feature channels, which results in enormous additional complexity. We propose two efficient ways in the following
paragraphs: hop-wise attention (HA) and hop-wise convolution (HC). We denote AGDN variants with AGDN-mean,
AGDN-HA, and AGDN-HC, using naive fixed weights, hop-wise attention weights, and hop-wise convolution weights.
As a particular case of AGDNs, AGDN-mean can be considered a representative example of GDNs.

Hop-wise Attention We suppose that, in many cases, the computation of graph diffusion should be adaptive for
different nodes but identical for different feature channels, which manifests as the unified weighting tensor normalized
PK
along hops k=0 θi,k,c = 1, ∀i, c and identical along channels θi,k,c = θi,k,1 , ∀c ∈ {1, 2, ..., d}. Then we can simplify
this tensor into a 2-dimensional weighting matrix ΘHA = [θik HA
] in RN ×(K+1) , by ignoring the last subscript c. Θ
can be recovered by adding the third dimension and repeating ΘHA for d times in the third dimension. It is still not
efficient to define a naive learnable weighting matrix in RN ×(K+1) , which results in redundant complexity and violates
the possible inductive setting. Inspired by the attention mechanism in GAT, we propose Hop-wise Attention (HA) using
a learnable query vector ahw in R2d to induce the expected weighting matrix. We need to learn just 2d parameters.
Firstly, we calculate ω:
h (l,0) (l,k)
i
ωik = h̃i || h̃i · ahw , (16)

where · represents inner product, || represents the concatenation operation and k represents the k-hop representation.
Then, as shown in the left part of Figure 2, the hop-wise attention scores are calculated as below:

HA exp (σ (ωik ))
θik = PK , (17)
k=0 exp (σ (ωik ))

where σ is an activation function (usually LeakyReLU).

8
A PREPRINT - S EPTEMBER 5, 2022

Hop-wise Convolution We consider another simple strategy for integrating multi-hop representations, which performs
Hop-wise Convolution (HC). This time, we suppose that, in certain cases, the graph diffusion should be adaptive
for different channels. Thus, we directly define learnable weighting tensor, which is identical for all nodes θi,k,c =
θ1,k,c , ∀i ∈ {1, 2, ..., N }. Then we simplify this tensor into a 2-dimensional convolution kernel matrix ΘHC = [θkc
HC
]
(K+1)×d
in R , by ignoring the first subscript i. The complete weighting tensor Θ can be recovered by adding the first
dimension and repeating ΘHC for N times in the first dimension, as shown in Figure 2. We need to learn (K + 1) × d
parameters. Note that we do not require that the tensor is normalized along any dimensions. For each feature channel
c ∈ {1, 2, ..., d}, we conduct individual hop-wise convolution with the associated kernel vector in RK+1 . HC is in the
same form as DCNN. However, HC is based on more memory-efficient graph diffusion and calculated with different
convolution kernels in different layers.

4.5 Complexity

In complexity analysis, we omit the dimension change between a layer’s input and output. The extra time complexity of
an AGDN layer over its base MPNN layer comes from K-hop aggregations O(KEd) (by default, we perform feature
transformation before aggregation), element-wise multiplication with weighting matrices O(KN d), and hop-wise
attention computation O(KN d) if used. Then the extra time complexity of an AGDN layer is O(KEd + KN d). Under
the realistic assumption of E N , this extra time complexity becomes O(KEd). The extra space complexity of an
AGDN layer is O(KN d).

5 Model analysis
We generally lose the elegant spectral analyzability since generalized weighting coefficients can easily change the
eigenvectors. However, this also implies that AGDNs are more flexible in the spectral domain.
For AGDN-HC, the weighting coefficients are identical for all nodes and do not change the eigenvectors. Thus, we can
give a preliminary spectral analysis. We will demonstrate that, even without changing eigenvectors, AGDN-HC is still
flexible in the spectral domain with a considerable diffusion depth. First, given a feature channel c, we simplify the
form of AGDN-HC with an eigendecomposition of the transition matrix A = U −1 ΛU :
K
X k
Sc = θkc A
k=0
K
X k
= θkc U −1 ΛU (18)
k=0
K
!
X
−1 k
=U θkc Λ U,
k=0

where the row vectors of U refer to the eigenvectors of A and Λ is a diagonal matrix whose entries {λ1 , λ2 , ..., λN }
are eigenvalues of A. The eigenvalues of the transition matrix are bounded by 1 (λi ∈ [−1, 1], ∀i) [43]. For the i-th
eigenvalue λi of the transition matrix, the associated eigenvalue λ0i of the diffusion matrix is:
K
X
λ0i = θk,c λki . (19)
k=0

This relation can be described as an K-order polynomial function of λi . With the order increasing, this function
becomes more flexible and can approximate more functions.

6 Experiments
In this section, we conduct experiments on three OGB node classification datasets and three OGB link prediction
datasets. Our proposed AGDNs outperform common MPNNs and SOTA RevGNNs with less complexity and runtime
for the semi-supervised node classification datasets. AGDNs outperform other GNN models in link prediction tasks
using the same encoder-decoder framework. AGDNs approach SOTA SEAL with much less runtime. AGDNs achieve
new SOTA performance on the ogbn-arxiv, ogbn-proteins, and ogbl-ddi datasets. We train all AGDN models on a single
V100 card with 16Gb memory.

9
A PREPRINT - S EPTEMBER 5, 2022

Table 2: Dataset statistics

Datasets #Nodes #Edges Metrics
ogbn-arxiv 169,343 1,166,243 Accuracy
ogbn-proteins 132,534 39,561,252 ROC-AUC
ogbn-products 2,449,029 61,859,140 Accuracy
ogbl-ppa 576,289 30,326,273 Hits@100
ogbl-ddi 4,267 1,334,889 Hits@20
ogbl-citation2 2,927,963 30,561,187 MRR

Datasets We utilize three OGB semi-supervised node classification datasets (ogbn-arxiv, ogbn-proteins and ogbn-
products) and three OGB link prediction datasets (ogbl-ppa, ogbl-ddi and ogbl-citation2). We summarize the detailed
statistics of these datasets in Table 2. ogbn-arxiv is a citation network between all Computer Science (CS) arXiv
papers, whose data split is based on the publication dates of the papers. ogbn-proteins is a graph between proteins
with multi-dimensional edge weights indicating different types of biologically meaningful associations. Its data split is
based on the associated species of the proteins. ogbn-products is a co-purchasing network between Amazon products,
whose data split is based on the sales ranking. ogbl-ppa is a graph between proteins from 58 species with similar edges
to ogbn-proteins, whose edges measured by high-throughput technology are used as training edges, and other edges
measured by low-throughput technology are used as validation and testing edges. ogbl-ddi is a drug-drug interaction
network with each edge indicating the joint effect of taking the two drugs together. It has data split based on what
proteins those drugs target in the body. ogbl-citation2 is a citations graph between a subset of papers from MAG.
Its data is split by selecting the most recent papers as source nodes and randomly selecting destination nodes for
training/validation/testing sets.

Global settings We conduct all experiments on a single Nvidia Tesla V100 with 16 Gb GPU memory. We evaluate
our proposed models with 10 runs, fixing random seed 0-9, and report means and standard deviations. Except on the
ogbn-products and ogbl-citation2 datasets (evaluated on CPU), we conduct both training and inference of all AGDN
models on the same GPU card. All final test scores are from the best model selected based on validation scores. In
the tables of this paper, we highlight the results of AGDN with underlined fonts and the best results with bold fonts.
We utilize AGDN-HC on the ogbn-proteins dataset and AGDN-HA for all other datasets. The unavailable results are
indicated by –.

6.1 Task 1: Node Classification

Baselines Several representative GNNs and SOTA GNNs are selected as baselines. For semi-supervised node
classification, we utilize GCN [2], GraphSAGE [3], GAT [44], MixHop [36], JKNet [24], DeeperGCN [14], GCNII
[15], DAGNN [12], MAGNA [35], UniMP [45], GAT+BoT [38] and RevGNN [16].

Experimental setup For ogbn-arxiv, we utilize 3 AGDN layers with transition matrix of GAT, hidden dimension
256, 3 attention heads, and residual linear connections. For AGDN with BoT, we utilize pseudo-symmetric normalized
transition matrix of GAT from BoT. For AGDN with BoT and GIANT-XRT embedding, we utilize our proposed
pseudo-symmetric transition matrix of GAT and 2 AGDN layers. For ogbn-proteins, we utilize 6 AGDN layers
with the transition matrix of GAT A, hidden dimension 150, 6 attention heads, and residual linear connections. For
ogbn-products, we utilize 4 AGDN layers with the transition matrix of GAT, hidden dimension 120, 4 attention heads,
and residual linear connections.

6.1.1 Results on the ogbn-arxiv dataset

We split the comparison on the ogbn-arxiv dataset into two parts since several critical tricks contribute significantly to
the final performance on this leaderboard. We implement AGDN based on the implementation of GAT+BoT. Firstly,
for the first part of the comparison (Table 3), we disable the options about BoT and compare baselines and AGDN
without using labels as input. AGDN outperforms other baselines. Secondly, we adapt BoT, self-KD, and GIANT-XRT
for AGDN in the second part of the comparison (Table 4). We use baselines using labels as input. By progressively
applying these tricks, AGDN consistently outperforms GAT and RevGAT. Finally, AGDN+BoT+self-KD+GIANT-XRT
achieves a new SOTA performance of 76.37% with a significant margin. Without GIANT-XRT embedding, AGDN
is implemented with 3 layers with hidden dimension 256, and RevGAT is implemented with 5 layers with hidden
dimension 256. With more complexity, RevGAT can achieve similar performance to AGDN. With GIANT-XRT

10
A PREPRINT - S EPTEMBER 5, 2022

Table 3: The first part of experimental results on the ogbn-arxiv dataset. Except for AGDN, other results are from their
papers or the OGB leaderboard.

Accuracy (%)
Models Params
Test Valid
GCN 71.74±0.29 73.00±0.17 0.11M
GraphSAGE 71.49±0.27 72.77±0.16 0.22M
DeeperGCN 71.92±0.16 72.62±0.14 0.49M
JKNet 72.19±0.21 73.35±0.07 0.09M
DAGNN 72.09±0.25 72.90±0.11 0.04M
GCNII 72.74±0.16 – 2.15M
MAGNA 72.76±0.14 – –
Ours (AGDN) 73.41±0.25 74.23±0.13 1.45M

Table 4: The second part of experimental results on the ogbn-arxiv dataset. Except for AGDN, other results are from
their papers or the OGB leaderboard. ¬=BoT, =self-KD, ®=GIANT-XRT embedding.

Accuracy(%)
Models Params
Test Valid
UniMP 73.11±0.20 74.50±0.15 0.18M
GAT+¬ 73.91±0.12 75.16±0.08 1.44M
RevGAT+¬ 74.02±0.18 75.01±0.10 2.10M
Ours (AGDN+¬) 74.11±0.12 75.25±0.05 1.51M
GAT+¬+ 74.16±0.08 75.14±0.04 1.44M
RevGAT+¬+ 74.26±0.17 74.97±0.08 2.10M
Ours (AGDN+¬+) 74.31±0.12 75.22±0.09 1.51M
RevGAT+¬+® 75.90±0.19 77.01±0.09 1.30M
Ours (AGDN+¬+®) 76.18±0.16 77.24±0.06 1.31M
RevGAT+¬++® 76.15±0.10 77.16±0.09 1.30M
Ours (AGDN+¬++®) 76.37±0.11 77.19±0.08 1.31M

embedding, AGDN and RevGAT are implemented with 2 layers with hidden dimension 256. With similar complexity,
the margin between RevGAT and AGDN becomes larger.

6.1.2 Results on the ogbn-proteins and ogbn-products datasets

Moreover, we evaluate AGDN on larger ogbn-proteins and ogbn-products datasets with the random graph partition
technique. For ogbn-proteins, we utilize HC instead of HA. AGDN can achieve a new SOTA result of 88.65%, which
even outperforms the much more complex and deeper model RevGNN. We only utilize 6 AGDN layers with hidden
dimension 150 with 8.61M parameters. RevGNN-wide includes 448 layers with hidden dimension 224 with 68.47M
parameters. Furthermore, the inference of AGDN is also conducted on the same GPU card of training with 16Gb
memory. In contrast, the inference of RevGNN is conducted on another GPU card with 48Gb memory. For ogbn-
products, we evaluate AGDN with the random partition. AGDN significantly outperforms other baselines, including
RevGNN. AGDN achieves the SOTA performance among models without using labels as input.

6.1.3 Runtime

We report training and inference runtime on the ogbn-proteins dataset in Table 7 with the runtime of RevGNNs
reported in its paper. This comparison demonstrates that AGDN can outperform RevGNN and cost much less runtime
simultaneously. With extended runtime, RevGNN-Deep and RevGNN-Wide cost 2.86Gb and 7.91Gb for training,
while AGDN costs 13.67Gb. However, the inference of AGDN is conducted on the same GPU, while the inference of
RevGNNs is on another Nvidia RTX A6000 (48Gb) without reporting inference runtime or memory cost.

11
A PREPRINT - S EPTEMBER 5, 2022

Table 5: Experimental results on the ogbn-proteins dataset. DeeperGCN, UniMP, RevGNN, and AGDN are implemented
with random partition. GAT is implemented with neighbor sampling. AGDN+BoT is based on the implementation
of GAT+BoT, however, labels are not used as model input since they empirically bring no improvements. Except for
AGDN, other results are from their papers or the OGB leaderboard.

ROC-AUC(%)
Models Params
Test Valid
GCN 72.51±0.35 79.21±0.18 0.10M
GraphSAGE 77.68±0.20 83.34±0.13 0.19M
DeeperGCN 85.80±0.17 91.06±0.16 2.37M
UniMP 86.42±0.08 91.75±0.06 1.91M
GAT+BoT 87.65±0.08 92.80±0.08 2.48M
RevGNN-deep 87.74±0.13 93.26±0.06 20.03M
RevGNN-wide 88.24±0.15 94.50±0.08 68.47M
Ours (AGDN) 88.65±0.13 94.18±0.05 8.61M

Table 6: Experimental results on the ogbn-products dataset. GAT, DeeperGCN, and AGDN are implemented with
random partition. GraphSAGE and UniMP are implemented with neighbor sampling. Except for AGDN, all results
are from their papers or the OGB leaderboard.

Accuracy(%)
Models Params
Test Valid
GCN 75.64±0.21 92.00±0.03 0.10M
GraphSAGE 78.50±0.14 92.24±0.07 0.21M
GraphSAINT 80.27±0.26 – 0.33M
DeeperGCN 80.98±0.20 92.38±0.09 0.25M
SIGN 80.52±0.16 92.99±0.04 3.48M
UniMP 82.56±0.31 93.08±0.17 1.48M
RevGNN-112 83.07±0.30 92.90±0.07 2.95M
Ours (AGDN) 83.34±0.27 92.29±0.10 1.54M

Table 7: Runtime comparison on the ogbn-proteins dataset with similar Tesla V100 cards.

Model Training Inference

RevGNN-Deep 13.5d/2000epochs –
RevGNN-Wide 17.1d/2000epochs –
AGDN 0.14d/2000epochs 12s

6.1.4 Over-smoothing and ablation study

Based on the naive implementation of the official OGB repository, we compare variants of AGDNs with different base
models (GCN, GraphSAGE, GAT), under different diffusion depths K (K = 1, 2, ..., 8), on the ogbn-arxiv dataset. We
also evaluate MPNN baselines with related equivalent receptive fields. For example, for K = 4 in each subgraph of
Figure 3 and Figure 4, the associated model in MPNN baseline curve has 3 × 4 layers. As shown in Figure 3, three
MPNN baseline curves, especially for GAT, show a distinct over-smoothing problem. AGDN-mean shows quickly
rising but then significantly dropping curves, which is also affected by over-smoothing. AGDN-HA and AGDN-HC
show much more stable curves. AGDN-HA has similar optimal results to AGDN-mean. However, AGDN-HC cannot
even outperform shallow baseline models. Moreover, we repeat these experiments by adding residual linear connections.
As shown in Figure 4, all models, including MPNNs (with few layers) and AGDNs, can be improved with residual
linear connections. MPNNs still have a distinct over-smoothing problem. However, the over-smoothing problem of
AGDN-mean is significantly alleviated. Moreover, AGDN-HC is effectively improved. Three variants of AGDN show
similar performance. However, AGDN-HA outperforms AGDN-mean, especially at low K. This characteristic is vital
in applying complex models on large graphs because we tend to utilize lower K due to the memory limit. This paper
selects low K (2 or 3) in other experiments.

12
A PREPRINT - S EPTEMBER 5, 2022

A = Agcn A = Asage A = Agat

Test accuracy
Method
0.72 MPNN (Baseline)
AGDN-Mean
0.70 AGDN-HA
AGDN-HC
1 8 1 8 1 8
K K K

Figure 3: Comparison of AGDN-Mean, AGDN-HA, AGDN-HC with different base models and diffusion hops on the
ogbn-arxiv dataset.

A = Agcn A = Asage A = Agat

Test accuracy

Method
0.72 MPNN (Baseline)
AGDN-Mean
0.70 AGDN-HA
AGDN-HC
1 8 1 8 1 8
K K K

Figure 4: Comparison of AGDN-Mean, AGDN-HA, AGDN-HC with residual linear connections, different base models,
and different diffusion hops on the ogbn-arxiv dataset.

6.1.5 More ablation study

We report ablation study between GAT and AGDN on the ogbn-arxiv, ogbn-proteins and ogbn-products datasets in
Table 8. We keep all settings the same for GAT and AGDN. We can confirm the significant improvements of AGDN.

6.2 Task 2: Link Prediction

Baselines For link prediction tasks, we utilize DeepWalk [46], Matrix Factorization [47], Common Neighbor [48],
Adamic Adar [49], Resource Allocation [50], GCN [2], GraphSAGE [3], SEAL [41] and PLNLP [51] as baselines. Due
to memory limitation, we adapt graph-based sampling techniques including random partition [14, 45] for ogbn-proteins
and ogbn-products, and GraphSAINT [52] for ogbl-citation2. Some baselines are not implemented for some datasets.
Thus we do not report the associated results.

Experimental Setup For ogbl-ppa, we utilize 2 AGDN layers with the transition matrix of GAT, hidden dimension
128, 1 attention head, and residual linear connections. For ogbl-ddi, we utilize 2 AGDN layers with the transition matrix
of GAT, hidden dimension 512, 1 attention head, and residual linear connections. For ogbl-citation2, we utilize 3 AGDN
layers with the transition matrix of GAT, hidden dimension 256, 1 attention head, and residual linear connections. In
official OGB baselines, naive cross-entropy loss is used regarding link prediction as binary classification. PLNLP
proposes to utilize pairwise AUC loss. We adapt AUC loss on the ogbl-ddi dataset. Note that we only utilize this loss on
the ogbl-ddi dataset since it does not improve AGDN on other datasets. We adopt the GraphSAINT technique for AGDN

Table 8: Ablation study on the ogbn-arxiv, ogbn-proteins and ogbn-products datasets. Due to the space limit, we omit
the variances of these scores.

ogbn-arxiv ogbn-proteins ogbn-products

Models Accuracy(%) ROC-AUC(%) Accuracy(%)
Test Valid Test Valid Test Valid
GAT 72.98 74.05 88.15 93.85 81.77 91.75
Ours (AGDN) 73.41 74.23 88.65 94.18 83.34 92.29

13
A PREPRINT - S EPTEMBER 5, 2022

on the ogbl-citation2 dataset. We utilize learnable node embeddings instead of possible node features on the ogbl-ddi
(dimension 512) and ogbl-ppa (dimension 128) datasets. Note that we only manually tune a few hyperparameters based
on default settings.

Training and evaluation We follow the standard training procedure in official OGB baselines, which use an encoder-
decoder framework. To emphasize the effect of AGDN, we do not introduce other modifications except pairwise AUC
loss. We use the standard data splits and metrics from the official OGB paper for evaluation.

Table 9: Experimental results on the ogbl-ppa dataset.

Hits@100(%)
Models Params
Test Valid
DeepWalk 28.88±1.53 - 150.14M
Matrix Factorization 32.29±0.94 32.28±4.28 147.66M
Common Neighbor 27.65±0.00 28.23±0.00 0
Adamic Adar 32.45±0.00 32.68±0.00 0
Resource Allocation 49.33±0.00 47.22±0.00 0
GCN 18.67±1.32 18.45±1.40 0.28M
GraphSAGE 16.55±2.40 17.24±2.64 0.42M
SEAL 48.80±3.16 51.25±2.52 0.71M
PLNLP 32.38±2.58 - –
Ours (AGDN) 41.23±1.59 43.32±0.92 36.90M

Table 10: Experimental results on the ogbl-ddi dataset.

Hits@20(%)
Models Params
Test Valid
DeepWalk 22.46±2.90 – 11.54M
Matrix Factorization 13.68±4.75 33.70±2.64 1.22M
Common Neighbor 17.73±0.00 9.47±0.00 0
Adamic Adar 18.61±0.00 9.66±0.00 0
Resource Allocation 6.23±0.00 7.25±0.00 0
GCN 37.07±5.07 55.50±2.08 1.29M
GraphSAGE 53.90±4.74 62.62±0.37 1.42M
SEAL 30.56±3.86 28.49±2.69 0.53M
PLNLP 90.88±3.13 82.42±2.53 3.50M
Ours (AGDN) 95.38±0.94 89.43±2.81 3.51M

Table 11: Experimental results on the ogbl-citation2 dataset.

MRR(%)
Models Params
Test Valid
Matrix Factorization 51.86±4.43 51.81±4.36 281.11M
Common Neighbor 51.47±0.00 51.19±0.00 0
Adamic Adar 51.89±0.00 51.67±0.00 0
Resource Allocation 51.98±0.00 51.77±0.00 0
GCN 84.74±0.31 84.79±0.23 0.30M
GraphSAGE 82.60±0.36 82.63±0.33 0.46M
SEAL 87.67±0.32 87.57±0.31 0.26M
PLNLP 84.92±0.29 84.90±0.31 146.51M
Ours (AGDN) 85.49±0.29 85.56±0.33 0.31M

14
A PREPRINT - S EPTEMBER 5, 2022

Table 12: Ablation study on the ogbl-ppa, ogbl-ddi, ogbl-citation2 datasets. Due to the space limit, we omit the
variances of these scores.

ogbl-ppa ogbl-ddi ogbl-citation2

Models Hits@100(%) Hits@10(%) MRR(%)
Test Valid Test Valid Test Valid
GAT 37.28 40.64 85.84 77.08 83.07 83.12
Ours (AGDN) 41.23 43.32 95.38 89.43 85.49 85.56

6.2.1 Results
As shown in Table 9, heuristic methods show significant advantages over GNN methods on the ogbl-ppa dataset. As
a GNN architecture modified for link prediction, based on a complicated subgraph extracting and labeling tricks,
SEAL achieves similar performance to the best heuristic method. AGDN, based on naive official OGB baseline scripts,
outperforms GCN, GraphSAGE, and several heuristic methods. AGDN utilizes learnable node embeddings as model
input, bringing additional parameters.
As shown in Table 10, on the ogbl-ddi dataset, GNN methods act better than heuristic methods. With AUC loss, AGDN
can achieve 95.38% Hits@20, the new SOTA result on the ogbl-ddi leaderboard. This dataset is very dense and will
make structural patterns meaningless. Thus encoder-decoder GNNs with learnable node embeddings can act much
better than SEAL.
As shown in Table 11, on the ogbl-citation2 dataset, GNN methods also act better than heuristic methods. GCN,
GraphSAGE, and PLNLP are full-batch trained. Due to our GPU memory limitation (16Gb), we train AGDN with
a neighbor sampling technique. We can observe significant performance degradation by comparing full-batch GCN
(84.74% test MRR) and GCN using GraphSAINT (79.85% test MRR) in the official OGB repository. However, even
with GraphSAINT, AGDN still achieves top 3 performance on the whole ogbl-citation2 leaderboard and outperforms
full-batch GCN, GraphSAGE, and PLNLP.
On the ogbl-ddi dataset, AGDN outperforms SEAL and other encoder-decoder GNNs with a significant margin. On the
ogbl-ppa and ogbl-citation2 datasets, the margin between AGDN and SEAL is not enormous (< 8%) and smaller than
other encoder-decoder GNNs. We believe that, with more suitable techniques designed for link prediction, AGDN will
contribute more to this task.

6.2.2 Runtime
We compare training and inference runtime of AGDN and SEAL on the ogbl-ppa, ogbl-ddi, and ogbl-citation2 datasets
in Table 13. With similar Tesla V100 GPU cards, AGDN takes significantly less training and inference runtime
than SEAL on the ogbl-ppa and ogbl-citation2 datasets. The model architecture of AGDN is more complicated than
SEAL. However, the simple encoder-decoder framework is less expressive but much more efficient than SEAL’s
time-consuming subgraph sampling and labeling trick. On the small ogbl-ddi dataset, where additional techniques in
SEAL work much more efficient, AGDN takes more training runtime but still much less inference runtime than SEAL.

Table 13: Runtime comparison on the ogbl-ppa, ogbl-ddi, and ogbl-citation2 datasets with similar Tesla V100 cards.

Dataset Model Training Inference

ogbl-ppa SEAL 20h/20epochs 4h
ogbl-ppa AGDN 2.3h/40epochs 0.06h
ogbl-ddi SEAL 0.07h/10epochs 0.1h
ogbl-ddi AGDN 0.8h/2000epochs 0.3s
ogbl-citation2 SEAL 7h/10epochs 28h
ogbl-citation2 AGDN 2.5h/2000epochs 0.06h

6.2.3 Ablation study

We also conduct an ablation study to demonstrate the effectiveness of AGDN on link prediction tasks. We report the
results of both GAT and AGDN (with transition matrix of GAT) on three datasets with the same settings. As shown in
Table 12, AGDN significantly improves the link prediction performance of GAT.

15
A PREPRINT - S EPTEMBER 5, 2022

7 Conclusion
This paper proposes a feasible and effective evolution path for GNNs. Firstly, we refine and propose Graph Diffusion
Networks (GDNs) by replacing the graph convolution operator with an efficient graph diffusion in each GNN layer.
Then, we generalize graph diffusion to propose Adaptive Graph Diffusion Networks (AGDNs). We propose two
adaptive and scalable mechanisms of computing hop weighting coefficients/matrices. In the spectral domain, AGDNs
are more adaptive and flexible than previous graph diffusion-based methods. We evaluate AGDNs and other popular
GNNs on node classification and link prediction tasks. The experimental results show that AGDNs can significantly
outperform many popular GNNs and even SOTA GNNs (RevGNNs and SEAL). At the same time, AGDNs have
considerable overall advantages of complexity and efficiency over SOTA GNNs. Instead of copying huge models from
other domains or using simplified architecture, we enlarge the receptive field with moderate complexity and essential
architecture. It is valuable for limited computation hardware and time-critical tasks.

Limits As a common issue, it is hard to apply the node-wise or layer-wise neighbor sampling techniques to very deep
GNNs, including AGDNs. We must employ additional memory-saving techniques from other models if we want to
train a very deep/wide AGDN model with a considerable diffusion depth. Fortunately, AGDNs are compatible with
most techniques applied to MPNNs. The effect of position embedding in AGDNs has not been precisely studied. We
leave potential memory-saving techniques for AGDNs in future research.

References
[1] Justin Gilmer, Samuel S Schoenholz, Patrick F Riley, Oriol Vinyals, and George E Dahl. Neural message passing
for quantum chemistry. arXiv preprint arXiv:1704.01212, 2017.
[2] Thomas N Kipf and Max Welling. Semi-supervised classification with graph convolutional networks. arXiv
preprint arXiv:1609.02907, 2016.
[3] Will Hamilton, Zhitao Ying, and Jure Leskovec. Inductive representation learning on large graphs. In Advances in
neural information processing systems, pages 1024–1034, 2017.
[4] Thomas N Kipf and Max Welling. Variational graph auto-encoders. arXiv preprint arXiv:1611.07308, 2016.
[5] Jie Chen, Tengfei Ma, and Cao Xiao. Fastgcn: fast learning with graph convolutional networks via importance
sampling. arXiv preprint arXiv:1801.10247, 2018.
[6] Alex Fout, Jonathon Byrd, Basir Shariat, and Asa Ben-Hur. Protein interface prediction using graph convolutional
networks. In Advances in neural information processing systems, pages 6530–6539, 2017.
[7] Le Yu, Bowen Du, Xiao Hu, Leilei Sun, Liangzhe Han, and Weifeng Lv. Deep spatio-temporal graph convolutional
network for traffic accident prediction. Neurocomputing, 423:135–147, 2021.
[8] Wei Li, Xin Wang, Yiwen Zhang, and Qilin Wu. Traffic flow prediction over muti-sensor data correlation with
graph convolution network. Neurocomputing, 427:50–63, 2021.
[9] Xueyan Yin, Genze Wu, Jinze Wei, Yanming Shen, Heng Qi, and Baocai Yin. Multi-stage attention spatial-
temporal graph networks for traffic prediction. Neurocomputing, 428:42–53, 2021.
[10] Qimai Li, Zhichao Han, and Xiao-Ming Wu. Deeper insights into graph convolutional networks for semi-
supervised learning. arXiv preprint arXiv:1801.07606, 2018.
[11] Guangtao Wang, Rex Ying, Jing Huang, and Jure Leskovec. Improving graph attention networks with large
margin-based constraints. arXiv preprint arXiv:1910.11945, 2019.
[12] Meng Liu, Hongyang Gao, and Shuiwang Ji. Towards deeper graph neural networks. In Proceedings of the 26th
ACM SIGKDD International Conference on Knowledge Discovery & Data Mining, pages 338–348, 2020.
[13] Guohao Li, Matthias Muller, Ali Thabet, and Bernard Ghanem. Deepgcns: Can gcns go as deep as cnns? In
Proceedings of the IEEE International Conference on Computer Vision, pages 9267–9276, 2019.
[14] Guohao Li, Chenxin Xiong, Ali Thabet, and Bernard Ghanem. Deepergcn: All you need to train deeper gcns.
arXiv preprint arXiv:2006.07739, 2020.
[15] Ming Chen, Zhewei Wei, Zengfeng Huang, Bolin Ding, and Yaliang Li. Simple and deep graph convolutional
networks. In International Conference on Machine Learning, pages 1725–1735. PMLR, 2020.
[16] Guohao Li, Matthias Müller, Bernard Ghanem, and Vladlen Koltun. Training graph neural networks with 1000
layers. In International conference on machine learning, pages 6437–6449. PMLR, 2021.

16
A PREPRINT - S EPTEMBER 5, 2022

[17] Johannes Klicpera, Aleksandar Bojchevski, and Stephan Günnemann. Predict then propagate: Graph neural
networks meet personalized pagerank. arXiv preprint arXiv:1810.05997, 2018.
[18] Emanuele Rossi, Fabrizio Frasca, Ben Chamberlain, Davide Eynard, Michael Bronstein, and Federico Monti.
Sign: Scalable inception graph neural networks. arXiv preprint arXiv:2004.11198, 2020.
[19] Johannes Klicpera, Stefan Weißenberger, and Stephan Günnemann. Diffusion improves graph learning. In
Advances in Neural Information Processing Systems, pages 13354–13366, 2019.
[20] Lingxiao Zhao and Leman Akoglu. Pairnorm: Tackling oversmoothing in gnns. arXiv preprint arXiv:1909.12223,
2019.
[21] Yu Rong, Wenbing Huang, Tingyang Xu, and Junzhou Huang. Dropedge: Towards deep graph convolutional
networks on node classification. arXiv preprint arXiv:1907.10903, 2019.
[22] Wenzheng Feng, Jie Zhang, Yuxiao Dong, Yu Han, Huanbo Luan, Qian Xu, Qiang Yang, Evgeny Kharlamov, and
Jie Tang. Graph random neural networks for semi-supervised learning on graphs. Advances in neural information
processing systems, 33:22092–22103, 2020.
[23] Weihua Hu, Matthias Fey, Marinka Zitnik, Yuxiao Dong, Hongyu Ren, Bowen Liu, Michele Catasta, and Jure
Leskovec. Open graph benchmark: Datasets for machine learning on graphs. arXiv preprint arXiv:2005.00687,
2020.
[24] Keyulu Xu, Chengtao Li, Yonglong Tian, Tomohiro Sonobe, Ken-ichi Kawarabayashi, and Stefanie Jegelka.
Representation learning on graphs with jumping knowledge networks. arXiv preprint arXiv:1806.03536, 2018.
[25] Lawrence Page, Sergey Brin, Rajeev Motwani, and Terry Winograd. The pagerank citation ranking: Bringing
order to the web. Technical report, Stanford InfoLab, 1999.
[26] Risi Imre Kondor and John Lafferty. Diffusion kernels on graphs and other discrete structures. In Proceedings of
the 19th international conference on machine learning, volume 2002, pages 315–22, 2002.
[27] Bingbing Xu, Huawei Shen, Qi Cao, Keting Cen, and Xueqi Cheng. Graph convolutional networks using heat
kernel for semi-supervised learning. arXiv preprint arXiv:2007.16002, 2020.
[28] Dimitris Berberidis, Athanasios N Nikolakopoulos, and Georgios B Giannakis. Adaptive diffusions for scalable
learning over graphs. IEEE Transactions on Signal Processing, 67(5):1307–1321, 2018.
[29] Siheng Chen, Aliaksei Sandryhaila, José MF Moura, and Jelena Kovačević. Adaptive graph filtering: Multireso-
lution classification on graphs. In 2013 IEEE Global Conference on Signal and Information Processing, pages
427–430. IEEE, 2013.
[30] Sami Abu-El-Haija, Bryan Perozzi, Rami Al-Rfou, and Alexander A Alemi. Watch your step: Learning node
embeddings via graph attention. In Advances in Neural Information Processing Systems, pages 9180–9190, 2018.
[31] Felix Wu, Tianyi Zhang, Amauri Holanda de Souza Jr, Christopher Fifty, Tao Yu, and Kilian Q Weinberger.
Simplifying graph convolutional networks. arXiv preprint arXiv:1902.07153, 2019.
[32] Hao Zhu and Piotr Koniusz. Simple spectral graph convolution. In International Conference on Learning
Representations, 2021.
[33] James Atwood and Don Towsley. Diffusion-convolutional neural networks. In Advances in neural information
processing systems, pages 1993–2001, 2016.
[34] Jian Du, Shanghang Zhang, Guanhang Wu, José MF Moura, and Soummya Kar. Topology adaptive graph
convolutional networks. arXiv preprint arXiv:1710.10370, 2017.
[35] Guangtao Wang, Rex Ying, Jing Huang, and Jure Leskovec. Direct multi-hop attention based graph neural network.
arXiv preprint arXiv:2009.14332, 2020.
[36] Sami Abu-El-Haija, Bryan Perozzi, Amol Kapoor, Nazanin Alipourfard, Kristina Lerman, Hrayr Harutyunyan,
Greg Ver Steeg, and Aram Galstyan. Mixhop: Higher-order graph convolutional architectures via sparsified
neighborhood mixing. arXiv preprint arXiv:1905.00067, 2019.
[37] Sami Abu-El-Haija, Amol Kapoor, Bryan Perozzi, and Joonseok Lee. N-gcn: Multi-scale graph convolution for
semi-supervised node classification. In uncertainty in artificial intelligence, pages 841–851. PMLR, 2020.
[38] Yangkun Wang. Bag of tricks of semi-supervised classification with graph neural networks. arXiv preprint
arXiv:2103.13355, 2021.
[39] Linfeng Zhang, Jiebo Song, Anni Gao, Jingwei Chen, Chenglong Bao, and Kaisheng Ma. Be your own teacher:
Improve the performance of convolutional neural networks via self distillation. In Proceedings of the IEEE/CVF
International Conference on Computer Vision, pages 3713–3722, 2019.

17
A PREPRINT - S EPTEMBER 5, 2022

[40] Eli Chien, Wei-Cheng Chang, Cho-Jui Hsieh, Hsiang-Fu Yu, Jiong Zhang, Olgica Milenkovic, and Inder-
jit S Dhillon. Node feature extraction by self-supervised multi-scale neighborhood prediction. arXiv preprint
arXiv:2111.00064, 2021.
[41] Muhan Zhang, Pan Li, Yinglong Xia, Kai Wang, and Long Jin. Labeling trick: A theory of using graph
neural networks for multi-node representation learning. Advances in Neural Information Processing Systems,
34:9061–9073, 2021.
[42] Muhan Zhang and Yixin Chen. Link prediction based on graph neural networks. Advances in neural information
processing systems, 31, 2018.
[43] Andrew Ng, Michael Jordan, and Yair Weiss. On spectral clustering: Analysis and an algorithm. Advances in
neural information processing systems, 14, 2001.
[44] Petar Veličković, Guillem Cucurull, Arantxa Casanova, Adriana Romero, Pietro Lio, and Yoshua Bengio. Graph
attention networks. arXiv preprint arXiv:1710.10903, 2017.
[45] Yunsheng Shi, Zhengjie Huang, Shikun Feng, and Yu Sun. Masked label prediction: Unified massage passing
model for semi-supervised classification. arXiv preprint arXiv:2009.03509, 2020.
[46] Bryan Perozzi, Rami Al-Rfou, and Steven Skiena. Deepwalk: Online learning of social representations. In
Proceedings of the 20th ACM SIGKDD international conference on Knowledge discovery and data mining, pages
701–710, 2014.
[47] Aditya Krishna Menon and Charles Elkan. Link prediction via matrix factorization. In Joint european conference
on machine learning and knowledge discovery in databases, pages 437–452. Springer, 2011.
[48] David Liben-Nowell and Jon Kleinberg. The link-prediction problem for social networks. Journal of the American
society for information science and technology, 58(7):1019–1031, 2007.
[49] Lada A Adamic and Eytan Adar. Friends and neighbors on the web. Social networks, 25(3):211–230, 2003.
[50] Tao Zhou, Linyuan Lü, and Yi-Cheng Zhang. Predicting missing links via local information. The European
Physical Journal B, 71(4):623–630, 2009.
[51] Zhitao Wang, Yong Zhou, Litao Hong, Yuanhang Zou, and Hanjing Su. Pairwise learning for neural link prediction.
arXiv preprint arXiv:2112.02936, 2021.
[52] Hanqing Zeng, Hongkuan Zhou, Ajitesh Srivastava, Rajgopal Kannan, and Viktor Prasanna. Graphsaint: Graph
sampling based inductive learning method. arXiv preprint arXiv:1907.04931, 2019.

Sim Et Al. - 2024 - Learning To Approximate Adaptive Kernel Convolution On Graphs
No ratings yet
Sim Et Al. - 2024 - Learning To Approximate Adaptive Kernel Convolution On Graphs
9 pages
Past Question Paper With Answers On Linear Algebra 1 MAT 213 - Theory
100% (1)
Past Question Paper With Answers On Linear Algebra 1 MAT 213 - Theory
20 pages
Continuity in Metric Spaces
100% (1)
Continuity in Metric Spaces
6 pages
On The Bottleneck of Graph Neural Networks
No ratings yet
On The Bottleneck of Graph Neural Networks
16 pages
Diverse Message Passing For Attribute With Heterophily
No ratings yet
Diverse Message Passing For Attribute With Heterophily
13 pages
Cal.101 Eng Jubran
No ratings yet
Cal.101 Eng Jubran
154 pages
AI HL Functions
No ratings yet
AI HL Functions
93 pages
29256-Article Text-33310-1-2-20240324
No ratings yet
29256-Article Text-33310-1-2-20240324
9 pages
Esgnn
No ratings yet
Esgnn
14 pages
Graph Conv
No ratings yet
Graph Conv
16 pages
Hierarchical Graph Pooling With Structure Learning
No ratings yet
Hierarchical Graph Pooling With Structure Learning
9 pages
1218 Link Prediction in Hypergraphs
No ratings yet
1218 Link Prediction in Hypergraphs
13 pages
A Cyclic Coloring of Central Graph of Gear Graph Families
No ratings yet
A Cyclic Coloring of Central Graph of Gear Graph Families
4 pages
Raw-Gnn: Random Walk Aggregation Based Graph Neural Network
No ratings yet
Raw-Gnn: Random Walk Aggregation Based Graph Neural Network
7 pages
Graph-Aware Unsupervised Feature Scoring Using Reconstruction Errors in GNN Embeddings
No ratings yet
Graph-Aware Unsupervised Feature Scoring Using Reconstruction Errors in GNN Embeddings
4 pages
GCNN
No ratings yet
GCNN
11 pages
DRew
No ratings yet
DRew
16 pages
Learning Module in General Mathematics: Quarter 1 - Week 3
100% (1)
Learning Module in General Mathematics: Quarter 1 - Week 3
4 pages
Latent Graph Diffusion A Unified Framework For Generation and Prediction On Graphs
No ratings yet
Latent Graph Diffusion A Unified Framework For Generation and Prediction On Graphs
21 pages
CS 224W Fall 2023 HW1
No ratings yet
CS 224W Fall 2023 HW1
11 pages
NeurIPS 2020 Graph Random Neural Networks For Semi Supervised Learning On Graphs Paper
No ratings yet
NeurIPS 2020 Graph Random Neural Networks For Semi Supervised Learning On Graphs Paper
12 pages
Automated Unsupervised Graph Representation Learning
No ratings yet
Automated Unsupervised Graph Representation Learning
14 pages
Lecture Note 5B
No ratings yet
Lecture Note 5B
2 pages
Theory of Graph Neural Networks: Representation and Learning
No ratings yet
Theory of Graph Neural Networks: Representation and Learning
23 pages
AlonAndYahav 2021 On The Bottleneck of Graph Neu
No ratings yet
AlonAndYahav 2021 On The Bottleneck of Graph Neu
16 pages
Best of Both Worlds - Combine KG and Vector Search For Enhanced RAG - Neo4j
No ratings yet
Best of Both Worlds - Combine KG and Vector Search For Enhanced RAG - Neo4j
40 pages
Simonovsky Dynamic Edge-Conditioned Filters CVPR 2017 Paper
No ratings yet
Simonovsky Dynamic Edge-Conditioned Filters CVPR 2017 Paper
10 pages
06GNN Beyond Homophily
No ratings yet
06GNN Beyond Homophily
56 pages
Hierarchical Graph Neural Networks
No ratings yet
Hierarchical Graph Neural Networks
14 pages
Rolip2 Report GNN
No ratings yet
Rolip2 Report GNN
6 pages
Automorphic Equivalence in The Classical Varieties of Linear Algebras
No ratings yet
Automorphic Equivalence in The Classical Varieties of Linear Algebras
27 pages
Improving Graph Neural Networks With Simple Architecture Design
No ratings yet
Improving Graph Neural Networks With Simple Architecture Design
10 pages
Graph Convolutional Networks: A Comprehensive Review: Open Access Research
No ratings yet
Graph Convolutional Networks: A Comprehensive Review: Open Access Research
23 pages
Lesson 5 Properties of Exponential Graphs
No ratings yet
Lesson 5 Properties of Exponential Graphs
4 pages
04 GNNBasic
No ratings yet
04 GNNBasic
107 pages
Ruteo Vehicular
No ratings yet
Ruteo Vehicular
65 pages
Cini 2023 SparseGraphLearningFromSpatiotemporal Time Series
No ratings yet
Cini 2023 SparseGraphLearningFromSpatiotemporal Time Series
36 pages
Fourier Series
No ratings yet
Fourier Series
16 pages
3.7.5 Practice - Modeling - Pumpkin Launch (Practice)
No ratings yet
3.7.5 Practice - Modeling - Pumpkin Launch (Practice)
4 pages
Directed Graph Neural Networks
No ratings yet
Directed Graph Neural Networks
2 pages
Xu Et Al. 2023
No ratings yet
Xu Et Al. 2023
12 pages
Asymptotic Notation
No ratings yet
Asymptotic Notation
10 pages
CV Specimen For The Bachelor Program
No ratings yet
CV Specimen For The Bachelor Program
4 pages
Neural Networks
No ratings yet
Neural Networks
10 pages
Are Powerful Graph Neural Nets Necessary
No ratings yet
Are Powerful Graph Neural Nets Necessary
16 pages
Maxima & Minima
No ratings yet
Maxima & Minima
66 pages
Lecture 10
No ratings yet
Lecture 10
16 pages
Intro To GNN
No ratings yet
Intro To GNN
49 pages
DGCNN
No ratings yet
DGCNN
8 pages
WWW23-Tutorial-V6 Self-Supervised Learning and Pre-Training On Graphs
No ratings yet
WWW23-Tutorial-V6 Self-Supervised Learning and Pre-Training On Graphs
107 pages
A Comparison Between Recursive Neural Networks and Graph Neural Networks
No ratings yet
A Comparison Between Recursive Neural Networks and Graph Neural Networks
8 pages
2020 - Line Hypergraph Convolution Network - Bandyopadhyay Et Al
No ratings yet
2020 - Line Hypergraph Convolution Network - Bandyopadhyay Et Al
13 pages
2.7.1 Definition:: LU Decomposition
No ratings yet
2.7.1 Definition:: LU Decomposition
43 pages
Definition: A Function F From A Set A To Set B Is A Rule of Correspondence That Assigns To Each
No ratings yet
Definition: A Function F From A Set A To Set B Is A Rule of Correspondence That Assigns To Each
4 pages
Improving Global Awareness of Linkset Predictions Using Cross-Attentive Modulation Tokens
No ratings yet
Improving Global Awareness of Linkset Predictions Using Cross-Attentive Modulation Tokens
17 pages
Graphnorm: A Principled Approach To Accelerating Graph Neural Network Training
No ratings yet
Graphnorm: A Principled Approach To Accelerating Graph Neural Network Training
25 pages
theseGNN XAI
No ratings yet
theseGNN XAI
4 pages
Featgraph: A Flexible and Efficient Backend For Graph Neural Network Systems
No ratings yet
Featgraph: A Flexible and Efficient Backend For Graph Neural Network Systems
12 pages
Graphprompt: Unifying Pre-Training and Downstream Tasks For Graph Neural Networks
No ratings yet
Graphprompt: Unifying Pre-Training and Downstream Tasks For Graph Neural Networks
12 pages
Approximation - and Quantization-Aware Training For Graph Neural Networks
No ratings yet
Approximation - and Quantization-Aware Training For Graph Neural Networks
14 pages
Discrete Mathematics 18
No ratings yet
Discrete Mathematics 18
3 pages
Inverse Trigonometric Functions NCERT
No ratings yet
Inverse Trigonometric Functions NCERT
5 pages
Edgenets: Edge Varying Graph Neural Networks: Elvin Isufi, Fernando Gama and Alejandro Ribeiro
No ratings yet
Edgenets: Edge Varying Graph Neural Networks: Elvin Isufi, Fernando Gama and Alejandro Ribeiro
15 pages
CS224W Project: Graph-Based Node Feature Prediction System Using Structural Features and Incomplete Node Metadata
No ratings yet
CS224W Project: Graph-Based Node Feature Prediction System Using Structural Features and Incomplete Node Metadata
10 pages
Seminar Presentation
No ratings yet
Seminar Presentation
19 pages
Extra Practice - Trig Functions
No ratings yet
Extra Practice - Trig Functions
5 pages
Defence Transcription
No ratings yet
Defence Transcription
4 pages
Graph Transformer Networks: Corresponding Author
No ratings yet
Graph Transformer Networks: Corresponding Author
11 pages
272829291919
No ratings yet
272829291919
36 pages
Content Augmented Graph Neural Networks
No ratings yet
Content Augmented Graph Neural Networks
15 pages
Yang 20 A
No ratings yet
Yang 20 A
16 pages
Graph Neural Networks
100% (1)
Graph Neural Networks
27 pages
17.pertemuan 11 Double Integrals Over General Regions-854-862
No ratings yet
17.pertemuan 11 Double Integrals Over General Regions-854-862
9 pages
Graph Neural Networks: A Review of Methods and Applications
No ratings yet
Graph Neural Networks: A Review of Methods and Applications
22 pages
End-To-End Learning of Latent Edge Weights For Graph Convolutional Networks
No ratings yet
End-To-End Learning of Latent Edge Weights For Graph Convolutional Networks
49 pages
Practice Problems Related To Root Finding Methods With Answers
No ratings yet
Practice Problems Related To Root Finding Methods With Answers
3 pages
(Pure and Applied Mathematics) Donald L. Stancl, Mildred L. Stancl-Real Analysis With Point-Set Topology-Dekker (1987)
100% (1)
(Pure and Applied Mathematics) Donald L. Stancl, Mildred L. Stancl-Real Analysis With Point-Set Topology-Dekker (1987)
295 pages
A Comprehensive Survey On Graph Neural Networks
No ratings yet
A Comprehensive Survey On Graph Neural Networks
22 pages
GNNS
No ratings yet
GNNS
7 pages
A Gentle Introduction To Graph Neural Networks
No ratings yet
A Gentle Introduction To Graph Neural Networks
9 pages
Mathematics: Sequences and Series of Real Numbers
No ratings yet
Mathematics: Sequences and Series of Real Numbers
2 pages
Fusion Graph Convolutional Networks
No ratings yet
Fusion Graph Convolutional Networks
10 pages
Graph Neural Networks: A Review of Methods and Applications
No ratings yet
Graph Neural Networks: A Review of Methods and Applications
20 pages
GNN Review
No ratings yet
GNN Review
26 pages
UCLA Math 61 Final Review
No ratings yet
UCLA Math 61 Final Review
5 pages
Network Analysis VTU Notes
50% (2)
Network Analysis VTU Notes
47 pages
A Survey of Graph Neural Networks in Various Learning Paradigms Methods, Applications, and Challenges
No ratings yet
A Survey of Graph Neural Networks in Various Learning Paradigms Methods, Applications, and Challenges
70 pages
Intermediate Algebra: Do You Know
No ratings yet
Intermediate Algebra: Do You Know
3 pages
NB-IoT Systems and Protocols: Definitive Reference for Developers and Engineers
From Everand
NB-IoT Systems and Protocols: Definitive Reference for Developers and Engineers
Richard Johnson
No ratings yet
Mesh Generation: Advances and Applications in Computer Vision Mesh Generation
From Everand
Mesh Generation: Advances and Applications in Computer Vision Mesh Generation
Fouad Sabry
No ratings yet

Adaptive Graph Diffusion

Uploaded by

Adaptive Graph Diffusion

Uploaded by

A DAPTIVE G RAPH D IFFUSION N ETWORKS

Chuxiong Sun Jie Hu Hongming Gu

Jinpeng Chen Mingchuan Yang

2.1 Residual GNNs

2.2 Graph diffusion

2.2.1 Decoupled GNNs

2.2.2 Explicit diffusion matrix: Graph Diffusion Convolution

2.2.3 Implicit diffusion matrix: Graph Diffusion Networks

2.2.4 Diffusion-like GNNs

2.3 Tricks on GNNs

2.4 GNNs for link prediction

Table 1: Symbol definitions.

Symbols Definitions Symbols Definitions

3.2.1 Semi-supervised node classification

3.2.2 Link prediction

3.3 Message Passing Neural Networks

3.3.1 Transition matrix

4.1 Graph Diffusion Networks

4.2 Adaptive Graph Diffusion Networks

4.3 Transition matrix

4.4 Weighting ceofficients

where σ is an activation function (usually LeakyReLU).

Table 2: Dataset statistics

6.1 Task 1: Node Classification

6.1.1 Results on the ogbn-arxiv dataset

6.1.2 Results on the ogbn-proteins and ogbn-products datasets

Model Training Inference

6.1.4 Over-smoothing and ablation study

A = Agcn A = Asage A = Agat

A = Agcn A = Asage A = Agat

6.1.5 More ablation study

6.2 Task 2: Link Prediction

ogbn-arxiv ogbn-proteins ogbn-products

Table 9: Experimental results on the ogbl-ppa dataset.

Table 10: Experimental results on the ogbl-ddi dataset.

Table 11: Experimental results on the ogbl-citation2 dataset.

ogbl-ppa ogbl-ddi ogbl-citation2

Dataset Model Training Inference

6.2.3 Ablation study

You might also like