0% found this document useful (0 votes)
109 views85 pages

A Comprehensive Survey On Deep Graph Representation Learning

Uploaded by

markus.aurelius
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
109 views85 pages

A Comprehensive Survey On Deep Graph Representation Learning

Uploaded by

markus.aurelius
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 85

A Comprehensive Survey on Deep Graph Representation

Learning

WEI JU, ZHENG FANG, YIYANG GU, ZEQUN LIU, and QINGQING LONG, Peking University,
arXiv:2304.05055v2 [cs.LG] 19 Apr 2023

China
ZIYUE QIAO, The Hong Kong University of Science and Technology, China
YIFANG QIN, JIANHAO SHEN, and FANG SUN, Peking University, China
ZHIPING XIAO, University of California, Los Angeles, USA
JUNWEI YANG, JINGYANG YUAN, and YUSHENG ZHAO, Peking University, China
XIAO LUO, University of California, Los Angeles, USA
MING ZHANG∗ , Peking University, China
Graph representation learning aims to effectively encode high-dimensional sparse graph-structured data into
low-dimensional dense vectors, which is a fundamental task that has been widely studied in a range of fields,
including machine learning and data mining. Classic graph embedding methods follow the basic idea that the
embedding vectors of interconnected nodes in the graph can still maintain a relatively close distance, thereby
preserving the structural information between the nodes in the graph. However, this is sub-optimal due to: (i)
traditional methods have limited model capacity which limits the learning performance; (ii) existing techniques
typically rely on unsupervised learning strategies and fail to couple with the latest learning paradigms; (iii)
representation learning and downstream tasks are dependent on each other which should be jointly enhanced.
With the remarkable success of deep learning, deep graph representation learning has shown great potential
and advantages over shallow (traditional) methods, there exist a large number of deep graph representation
learning techniques have been proposed in the past decade, especially graph neural networks. In this survey,
we conduct a comprehensive survey on current deep graph representation learning algorithms by proposing a
new taxonomy of existing state-of-the-art literature. Specifically, we systematically summarize the essential
components of graph representation learning and categorize existing approaches by the ways of graph neural
network architectures and the most recent advanced learning paradigms. Moreover, this survey also provides
the practical and promising applications of deep graph representation learning. Last but not least, we state
new perspectives and suggest challenging directions which deserve further investigations in the future.

CCS Concepts: • Computing methodologies → Neural networks; Learning latent representations.

∗ Corresponding author, and the other authors contributed equally to this research.

Authors’ addresses: Wei Ju, [email protected]; Zheng Fang, [email protected]; Yiyang Gu, [email protected];
Zequn Liu, [email protected]; Qingqing Long, [email protected], Peking University, Beijing, China, 100871;
Ziyue Qiao, [email protected], The Hong Kong University of Science and Technology, Guangzhou, China, 511453;
Yifang Qin, [email protected]; Jianhao Shen, [email protected]; Fang Sun, [email protected], Peking University,
Beijing, China, 100871; Zhiping Xiao, [email protected], University of California, Los Angeles, USA, 90095; Junwei
Yang, [email protected]; Jingyang Yuan, [email protected]; Yusheng Zhao, [email protected], Peking
University, Beijing, China, 100871; Xiao Luo, [email protected], University of California, Los Angeles, USA, 90095; Ming
Zhang, [email protected], Peking University, Beijing, China, 100871.

Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee
provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and
the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored.
Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires
prior specific permission and/or a fee. Request permissions from [email protected].
© 2023 Association for Computing Machinery.
0004-5411/2023/4-ART $15.00
https://doi.org/XXXXXXX.XXXXXXX

J. ACM, Vol. 1, No. 1, Article . Publication date: April 2023.


2 W. Ju, et al.

Additional Key Words and Phrases: Deep Learning on Graphs, Graph Representation Learning, Graph Neural
Network, Survey
ACM Reference Format:
Wei Ju, Zheng Fang, Yiyang Gu, Zequn Liu, Qingqing Long, Ziyue Qiao, Yifang Qin, Jianhao Shen, Fang Sun,
Zhiping Xiao, Junwei Yang, Jingyang Yuan, Yusheng Zhao, Xiao Luo, and Ming Zhang. 2023. A Comprehensive
Survey on Deep Graph Representation Learning. J. ACM 1, 1 (April 2023), 85 pages. https://doi.org/XXXXXXX.
XXXXXXX

1 Introduction By Wei Ju
Graphs have recently emerged as a powerful tool for representing a variety of structured and
complex data, including social networks, traffic networks, information systems, knowledge graphs,
protein-protein interaction networks, and physical interaction networks. As a kind of general form
of data organization, graph structures are capable of naturally expressing the intrinsic relationship
of these data, and thus can characterize plenty of non-Euclidean structures that are crucial in
a variety of disciplines and domains due to their flexible adaptability. For example, to encode a
social network as a graph, nodes on the graph are used to represent individual users, and edges are
used to represent the relationship between two individuals, such as friends. In the field of biology,
nodes can be used to represent proteins, and edges can be used to represent biological interactions
between various proteins, such as the dynamic interactions between proteins. Thus, by analyzing
and mining the graph-structured data, we can understand the deep meaning hidden behind the
data, and further discover valuable knowledge, so as to benefit society and human beings.
In the last decade years, a wide range of machine learning algorithms have been developed for
graph-structured data learning. Among them, traditional graph kernel methods [107, 177, 314, 316]
usually break down graphs into different atomic substructures and then use kernel functions
to measure the similarity between all pairs of them. Although graph kernels could provide a
perspective on modeling graph topology, these approaches often generate substructures or feature
representations based on given hand-crafted criteria. These rules are rather heuristic, prone to suffer
from high computational complexity, and therefore have weak scalability and subpar performance.
In the past few years, graph embedding algorithms [2, 123, 276, 343, 344, 359] have ever-
increasing emerged, which attempt to encode the structural information of the graph (usually a
high-dimensional sparse matrix) and map it into a low-dimensional dense vector embedding to
preserve the topology information and attribute information in the embedding space as much
as possible, so that the learned graph embeddings can be naturally integrated into traditional
machine learning algorithms. Compared to previous works which use feature engineering in the
pre-processing phase to extract graph structural features, current graph embedding algorithms are
conducted in a data-driven way leveraging machine learning algorithms (such as neural networks)
to encode the structural information of the graph. Specifically, existing graph embedding methods
can be categorized into the following main groups: (i) matrix factorization based methods [2, 36, 268]
that factorize the matrix to learn node embedding which preserves the graph property; (ii) deep
learning based methods [123, 276, 344, 359] that apply deep learning techniques specifically de-
signed for graph-structured data; (iii) edge reconstruction based methods [229, 253, 343] that either
maximizes edge reconstruction probability or minimizes edge reconstruction loss. Generally, these
methods typically depend on shallow architectures, and fail to exploit the potential and capacity of
deep neural networks, resulting in sub-optimal representation quality and learning performance.
Inspired by the recent remarkable success of deep neural networks, a range of deep learning
algorithms has been developed for graph-structured data learning. The core of these methods is to
generate effective node and graph representations using graph neural networks (GNNs), followed
by a goal-oriented learning paradigm. In this way, the derived representations can be adaptively

J. ACM, Vol. 1, No. 1, Article . Publication date: April 2023.


A Comprehensive Survey on Deep Graph Representation Learning 3

coupled with a variety of downstream tasks and applications. Following this line of thought, in
this paper, we propose a new taxonomy to classify the existing graph representation learning
algorithms, i.e., graph neural network architectures, learning paradigms, and various promising
applications, as shown in Fig. 1. Specifically, for the architectures of GNNs, we investigate the
studies on graph convolutions, graph kernel neural networks, graph pooling, and graph transformer.
For the learning paradigms, we explore three advanced types namely supervised/semi-supervised
learning on graphs, graph self-supervised learning, and graph structure learning. To demonstrate
the effectiveness of the learned graph representations, we provide several promising applications
to build tight connections between representation learning and downstream tasks, such as social
analysis, molecular property prediction and generation, recommender systems, and traffic analysis.
Last but not least, we present some perspectives for thought and suggest challenging directions
that deserve further study in the future.
Differences between this survey and existing ones. Up to now, there exist some other overview
papers focusing on different perspectives of graph representation learning[12, 40, 43, 47, 179, 387,
390, 446, 463, 464] that are closely related to ours. However, there are very few comprehensive
reviews have summarized deep graph representation learning simultaneously from the perspective
of diverse GNN architectures and corresponding up-to-date learning paradigms. Therefore, we
here clearly state their distinctions from our survey as follows. There have been several surveys
on classic graph embedding[32, 119], these works categorize graph embedding methods based on
different training objectives. Wang et al. [366] goes further and provides a comprehensive review of
existing heterogeneous graph embedding approaches. With the rapid development of deep learning,
there are a handful of surveys along this line. For example, Wu et al. [387] and Zhang et al. [446]
mainly focus on several classical and representative GNN architectures without exploring deep
graph representation learning from a view of the most recent advanced learning paradigms such as
graph self-supervised learning and graph structure learning. Xia et al. [390] and Chami et al. [40]
jointly summarize the studies of graph embeddings and GNNs. Zhou et al. [463] explores different
types of computational modules for GNNs. One recent survey under review [179] categorizes the
existing works in graph representation learning from both static and dynamic graphs. However,
these taxonomies emphasize the basic GNN methods but pay insufficient attention to the learning
paradigms, and provide few discussions of the most promising applications, such as recommender
systems and molecular property prediction and generation. To the best of our knowledge, the most
relevant survey published formally is [464], which presents a review of GNN architectures and
roughly discusses the corresponding applications. Nevertheless, this survey merely covers methods
up to the year of 2020, missing the latest developments in the past two years.
Therefore, it is highly desired to summarize the representative GNN methods, the most recent
advanced learning paradigms, and promising applications into one unified and comprehensive
framework. Moreover, we strongly believe this survey with a new taxonomy of literature and more
than 400 studies will strengthen future research on deep graph representation learning.
Contribution of this survey. The goal of this survey is to systematically review the literature
on the advances of deep graph representation learning and discuss further directions. It aims
to help the researchers and practitioners who are interested in this area, and support them in
understanding the panorama and the latest developments of deep graph representation learning.
The key contributions of this survey are summarized as follows:
• Systematic Taxonomy. We propose a systematic taxonomy to organize the existing deep
graph representation learning approaches based on the ways of GNN architectures and the
most recent advanced learning paradigms via providing some representative branches of

J. ACM, Vol. 1, No. 1, Article . Publication date: April 2023.


4 W. Ju, et al.

Graph Representations Optimized Graph Representations

Social Analysis
Graph Convolutions
Semi-Supervised Learning on Graphs Molecular Property Prediction
Graph Kernel Neural Networks
Graph Self-Supervised Learning Molecular Generation Future Directions
Graph Pooling
Graph Structure Learning Recommender Systems
Graph Transformer
Traffic Analysis
Graph Data Learning Paradigms
Graph Neural Network Architectures
Graph-Related Applications

Fig. 1. The architecture of this paper.

methods. Moreover, several promising applications are presented to illustrate the superiority
and potential of graph representation learning.
• Comprehensive Review. For each branch of this survey, we review the essential compo-
nents and provide detailed descriptions of representative algorithms, and systematically
summarize the characteristics to make the overview comparison.
• Future Directions. Based on the properties of existing deep graph representation learning
algorithms, we discuss the limitations and challenges of current methods and propose the
potential as well as promising research directions deserving of future investigations.

2 Background By Wei Ju
In this section, we first briefly introduce some definitions in deep graph representation learning
that need to be clarified, then we explain the reasons why we need graph representation learning.

2.1 Problem Definition


Definition: Graph. Given a graph 𝐺 = (𝑉 , 𝐸, X), where 𝑉 = {𝑣 1, · · · , 𝑣 |𝑉 | } is the set of nodes,
𝐸 = {𝑒 1, · · · , 𝑒 |𝑉 | } is the set of edges, and the edge 𝑒 = (𝑣𝑖 , 𝑣 𝑗 ) ∈ 𝐸 represent the connection
relationship between nodes 𝑣𝑖 and 𝑣 𝑗 in the graph. X ∈ R |𝑉 |×𝑀 is the node feature matrix with
𝑀 being the dimension of each node feature. The adjacency matrix of a graph can be defined as
A ∈ R |𝑉 |× |𝑉 | , where A𝑖 𝑗 = 1 if (𝑣𝑖 , 𝑣 𝑗 ) ∈ 𝐸, otherwise A𝑖 𝑗 = 0.
The adjacency matrix can be regarded as the structural representation of the graph-structured
data, in which each row of the adjacency matrix A represents the connection relationship between
the corresponding node of the row and all other nodes, which can be regarded as a discrete repre-
sentation of the node. However, in real-life circumstances, the adjacency matrix A corresponding
to 𝐺 is a highly sparse matrix, and if A is used directly as node representations, it will be seriously
affected by computational efficiency. The storage space of the adjacency matrix A is |𝑉 | × |𝑉 |,
which is usually unacceptable when the total number of nodes grows to the order of millions. At
the same time, the value of most dimensions in the node representation is 0. The sparsity will make
subsequent machine learning tasks very difficult.
Graph representation learning is a bridge between the original input data and the task objectives
in the graph. The fundamental idea of the graph representation learning algorithm is first to learn
the embedded representations of nodes or the entire graph from the input graph structure data and
then apply these embedded representations to downstream related tasks, such as node classification,
graph classification, link prediction, community detection, and visualization, etc. Specifically, it
aims to learn low-dimensional, dense distributed embedding representations for nodes in the graph.
Formally, the goal of graph representation learning is to learn its embedding vector representation

J. ACM, Vol. 1, No. 1, Article . Publication date: April 2023.


A Comprehensive Survey on Deep Graph Representation Learning 5

𝑅𝑣 ∈ R𝑑 for each node 𝑣 ∈ 𝑉 , where the dimension 𝑑 of the vector is much smaller than the total
number of nodes |𝑉 | in the graph.

2.2 Why study deep graph representation learning


With the rapid development of deep learning techniques, deep neural networks such as convolu-
tional neural networks and recurrent neural networks have made breakthroughs in the fields of
computer vision, natural language processing, and speech recognition. They can well abstract the
semantic information of images, natural languages, and speeches. However, current deep learning
techniques fail to handle more complex and irregular graph-structured data. To effectively analyze
and model this kind of non-Euclidean structure data, many graph representation learning algo-
rithms have emerged in recent years, including graph embedding and graph neural networks. At
present, compared with Euclidean-style data such as images, natural language, and speech, graph-
structured data is high-dimensional, complex, and irregular. Therefore, the graph representation
learning algorithm is a rather powerful tool for studying graph-structured data. To encode complex
graph-structured data, deep graph representation learning needs to meet several characteristics: (1)
topological properties: Graph representations need to capture the complex topological information
of the graph, such as the relationship between nodes and nodes, and other substructure information,
such as subgraphs, motif, etc. (2) feature attributes: It is necessary for graph representations to de-
scribe high-dimensional attribute features in the graph, including the attributes of nodes and edges
themselves. (3) scalability: Because different real graph data have different characteristics, graph
representation learning algorithms should be able to efficiently learn its embedding representation
on different graph structure data, making it universal and transferable.

3 Graph Convolutions By Yusheng Zhao


Graph convolutions have become the basic building blocks in many deep graph representation
learning algorithms and graph neural networks developed recently. In this section, we provide a
comprehensive review of graph convolutions, which generally fall into two categories: spectral
graph convolutions and spatial graph convolutions. Based on the solid mathematical foundations
of Graph Signal Processing (GSP)[129, 303, 320], spectral graph convolutions seek to capture the
patterns of the graph in the frequency domain. On the other hand, spatial graph convolutions
inherit the idea of message passing from Recurrent Graph Neural Networks (RecGNNs), and they
compute node features by aggregating the features of their neighbors. Thus, the computation graph
of a node is derived from the local graph structure around it, and the graph topology is naturally
incorporated into the way node features are computed. In this section, we first introduce spectral
graph convolutions and then spatial graph convolutions, followed by a brief summary. In Table 1,
we summarize a number of graph convolutions proposed in recent years.

3.1 Spectral Graph Convolutions


With the success of Convolutional Neural Networks (CNNs) in computer vision [195], efforts have
been made to transfer the idea of convolution to the graph domain. However, this is not an easy task
because of the non-Euclidean nature of graphical data. Graph signal processing (GSP) [129, 303, 320]
defines the Fourier Transform on graphs and thus provide solid theoretical foundation of spectral
graph convolutions.
In graph signal processing, a graph signal refers to a set of scalars associated to every nodes
in the graph, i.e. 𝑓 (𝑣), ∀𝑣 ∈ 𝑉 , and it can be written in the 𝑛-dimensional vector form x ∈ R𝑛 ,
where 𝑛 is the number of nodes in the graph. Another core concept of graph signal processing
is the symmetric normalized graph Laplacian matrix (or simply, the graph Laplacian), defined as
L = I − D−1/2 AD−1/2 , where I is the identity matrix, D is the degree matrix (i.e. a diagonal matrix

J. ACM, Vol. 1, No. 1, Article . Publication date: April 2023.


6 W. Ju, et al.

Table 1. Summary of graph convolution methods.

Method Category Aggregation Time Complexity


Spectral CNN [29] Spectral Graph Convolution - 𝑂 (𝑛 3 )
Henaff et al. [137] Spectral Graph Convolution - 𝑂 (𝑛 3 )
ChebNet [66] Spectral Graph Convolution - 𝑂 (𝑚)
GCN [182] Spectral / Spatial Weighted Average 𝑂 (𝑚)
CayleyNet [203] Spectral Graph Convolution - 𝑂 (𝑚)
GraphSAGE [128] Spatial Graph Convolution General 𝑂 (𝑚)
GAT [351] Spatial Graph Convolution Attentive 𝑂 (𝑚)
DGCNN [372] Spatial Graph Convolution General 𝑂 (𝑚)
LanzcosNet [224] Spectral Graph Convolution - 𝑂 (𝑛 2 )
SGC [383] Spatial Graph Convolution Weighted Average 𝑂 (𝑚)
GWNN [399] Spectral Graph Convolution - 𝑂 (𝑚)
GIN [403] Spatial Graph Convolution Sum 𝑂 (𝑚)
GraphAIR [143] Spatial Graph Convolution Sum 𝑂 (𝑚)
PNA [64] Spatial Graph Convolution Multiple 𝑂 (𝑚)
S GC [466]
2
Spectral Graph Convolution - 𝑂 (𝑚)
GNNML3 [15] Spatial / Spectral - 𝑂 (𝑚)
MSGNN [135] Spectral Graph Convolution - 𝑂 (𝑚)
EGC [340] Spatial Graph Convolution General 𝑂 (𝑚)

D𝑖𝑖 = 𝑗 A𝑖 𝑗 ), and A is the adjacency matrix. In the typical setting of graph signal processing, the
Í
graph 𝐺 is undirected. Therefore, L is real symmetric and positive semi-definite. This guarantees
the eigen decomposition of the graph Laplacian: L = UΛU𝑇 , where U = [u0, u1, ..., u𝑛−1 ] is the
eigenvectors of the graph Laplacian and the diagonal elements of Λ = diag(𝜆0, 𝜆1, ..., 𝜆𝑛−1 ) are the
eigenvalues. With this, the Graph Fourier Transform (GFT) of a graph signal x is defined as x̃ = U𝑇 x,
where x̃ is the graph frequencies of x. Correspondingly, the Inverse Graph Fourier Transform can
be written as x = Ux̃.
With GFT and the Convolution Theorem, the graph convolution of a graph signal x and a filter
g can be defined as g ∗𝐺 x = U(U𝑇 g ⊙ U𝑇 x). To simplify this, let g𝜃 = diag(U𝑇 𝑔), the graph
convolution can be written as:
g ∗𝐺 x = Ug𝜃 U𝑇 x, (1)
which is the general form of most spectral graph convolutions. The key of spectral graph convolu-
tions is to parameterize and learn the filter g𝜃 .
Bruna et al. [29] propose Spectral Convolutional Neural Network (Spectral CNN) that sets graph
filter as a learnable diagonal matrix W. The convolution operation can be written as y = UWU𝑇 x.
In practice, multi-channel signals and activation functions are common, and the graph convolution
can be written as !
𝑐𝑖𝑛
∑︁
Y:,𝑗 = 𝜎 U W𝑖,𝑗 U𝑇 X:,𝑖 , 𝑗 = 1, 2, ..., 𝑐𝑜𝑢𝑡 , (2)
𝑖=1
where 𝑐𝑖𝑛 is the number of input channel, 𝑐𝑜𝑢𝑡 is the number of output channel, X is a 𝑛 × 𝑐𝑖𝑛
matrix representing the input signal, Y is a 𝑛 × 𝑐𝑜𝑢𝑡 matrix denoting the output signal, W𝑖,𝑗 is a

J. ACM, Vol. 1, No. 1, Article . Publication date: April 2023.


A Comprehensive Survey on Deep Graph Representation Learning 7

parameterized diagonal matrix, and 𝜎 (·) is the activation function. For mathematical convenience
we sometimes use single-channel version of graph convolutions omitting activation functions, and
the multi-channel versions are similar to Eq. 2.
Spectral CNN has several limitations. Firstly, the filters are basis-dependent, which means that
they cannot generalize across graphs. Secondly, the algorithm requires eigen decomposition, which
is computationally expensive. Thirdly, it has no guarantee of spatial localization of filters. To make
filters spatially localized, Henaff et al. [137] propose to use a smooth spectral transfer function
Θ(Λ) to parameterize the filter, and the convolution operation can be written as:
y = U𝐹 (Λ)U𝑇 x. (3)
Chebyshev Spectral Convolutional Neural Network (ChebNet) [66] extends this idea by using
truncated Chebyshev polynomials to approximate the spectral transfer function. The Chebyshev
polynomial is defined as 𝑇0 (𝑥) = 1, 𝑇1 (𝑥) = 𝑥, 𝑇𝑘 (𝑥) = 2𝑥𝑇𝑘−1 (𝑥) − 𝑇𝑘−2 (𝑥), and the spectral
transfer function 𝐹 (Λ) is approximated to the order of 𝐾 − 1 as
∑︁1
𝐾−
𝐹 (Λ) = 𝜃 𝑘𝑇𝑘 ( Λ̃), (4)
𝑘=0

where the model parameters 𝜃 𝑘 , 𝑘 ∈ {0, 1, ..., 𝐾 − 1} are the Chebyshev coefficients, and Λ̃ =
2Λ/𝜆𝑚𝑎𝑥 − I is a diagonal matrix of scaled eigenvalues. Thus, the graph convolution can be written
as:
∑︁1
𝐾− ∑︁1
𝐾−
g ∗𝐺 x = U𝐹 (Λ)U𝑇 x = U 𝜃 𝑘𝑇𝑘 ( Λ̃)U𝑇 x = 𝜃 𝑘𝑇𝑘 ( L̃)x, (5)
𝑘=0 𝑘=0
where L̃ = 2L/𝜆𝑚𝑎𝑥 − I.
Graph Convolutional Network (GCN) [182] is proposed as the localized first-order approximation
of ChebNet. Assuming 𝐾 = 2 and 𝜆𝑚𝑎𝑥 = 2, Eq. 5 can be simplified as:
g ∗𝐺 x = 𝜃 0 x + 𝜃 1 (L − I)x = 𝜃 0 x − 𝜃 1 D−1/2 AD−1/2 x. (6)
To further constraint the number of parameters, we assume 𝜃 = 𝜃 0 = −𝜃 1 , which gives a simpler
form of graph convolution:
g ∗ G x = 𝜃 (I + D−1/2 AD−1/2 )x. (7)
As I + D−1/2 AD−1/2 now has the eigenvalues in the range of [0, 2] and repeatedly multiplying
this matrix can lead to numerical instabilities, GCN empirically proposes a renormalization trick to
solve this problem by using D̃−1/2 ÃD̃−1/2 instead, where à = A + I and D̃𝑖𝑖 = 𝑖 Ã𝑖 𝑗 .
Í
Allowing multi-channel signals and adding activation functions, the following formula is more
commonly seen in literature:
Y = 𝜎 (( D̃−1/2 ÃD̃−1/2 )XΘ), (8)
where X, Y have the same shape as in Eq. 2 and Θ is a 𝑐𝑖𝑛 × 𝑐𝑜𝑢𝑡 matrix as model’s parameters.
Apart from the aforementioned methods, other spectral graph convolutions have been proposed.
Levie et al. [203] propose CayleyNets that utilize Cayley Polynomials to equip the filters with
the ability to detect narrow frequency bands. Liao et al. [224] propose LanczosNets that employs
Lanczos algorithm to construct low-rank approximation of graph Laplacian in order to improve
the computation efficiency of graph convolutions. The proposed model is able to efficiently utilizes
the multi-scale information in the graph data. Instead of using Graph Fourier Transform, Xu et
al. [399] propose Graph Wavelet Neural Network (GWNN) that use graph wavelet transform to
avoid matrix eigendecomposition. Moreover, graph wavelets are sparse and localized, which provide
good interpretations for the convolution operation. Zhu et al. [466] derive a Simple Spectral Graph

J. ACM, Vol. 1, No. 1, Article . Publication date: April 2023.


8 W. Ju, et al.

Convolution (S2 GC) from a modified Markov Diffusion Kernel, which achieves a trade-off between
low-pass and high-pass filter bands.

3.2 Spatial Graph Convolutions


Inspired by the convolution on Euclidean data (e.g. images and texts), which applies data trans-
formation on a small region, spatial graph convolutions compute the central node’s feature via
transforming and aggregating its neighbors’ features. In this way, the graph structure is naturally
embedded in the computation graph of node features. Moreover, the idea of sending one node’s
feature to another node is similar to the message passing used in recurrent graph neural networks.
In the following, we will introduce several seminal spatial graph convolutions as well as some
recently proposed promising methods.
Spatial graph convolutions generally follows a three-step paradigm: message generation, feature
aggregation and feature update. This can be mathematically written as:
y𝑖 = UPDATE x𝑖 , AGGREGATE {MESSAGE x𝑖 , x 𝑗 , e𝑖 𝑗 , 𝑗 ∈ N (𝑖)} , (9)
 

where x𝑖 and y𝑖 is the input and output feature vector of node 𝑖, e𝑖 𝑗 is the feature vector of the edge
(or more generally, the relationship) between node 𝑖 and its neighbor node 𝑗, and N (𝑖) denotes the
neighbor of node 𝑖, which could be more generally defined.
In the previous subsection, we show the spectral interpretation of GCN [182]. The model also
has its spatial interpretation, which can be mathematically written as:
∑︁ 1
y𝑖 = Θ𝑇 √︃ x𝑗 , (10)
𝑗 ∈N (𝑖)∪𝑖 𝑑^𝑖 𝑑^𝑗

where 𝑑^𝑖 and 𝑑^𝑗 is the 𝑖-th and 𝑗-th row sums of A ^ in Eq. 8. For each node, the model takes a weighted
sum of its neighbors’ features as well as its own feature and applies a linear transformation to obtain
the result. In practice, multiple GCN layers are often stack together with non-linear functions after
convolution to encode complex and hierarchical features. Nonetheless, Wu et al. [383] show that
the model still achieves competitive results without non-linearity.
Although GCN as well as other spectral graph convolutions achieve competitive results on a
number of benchmarks, these methods assume the presence of all nodes in the graph and fall in the
category of transductive learning. Hamilton et al. [128] propose GraphSAGE that performs graph
convolutions in inductive settings, when there are new nodes during inference (e.g. newcomers
in the social network). For each node, the model samples its 𝐾-hop neighbors and uses 𝐾 graph
convolutions to aggregate their features hierarchically. Furthermore, the use of sampling also
reduces the computation when a node has too many neighbors.
The attention mechanism has been successfully used in natural language processing [350], com-
puter vision [235] and multi-modal tasks [49, 133, 427, 455]. Graph Attention Networks (GAT) [351]
introduce the idea of attention to graphs. The attention mechanism uses an adaptive, feature-
dependent weight (i.e. attention coefficient) to aggregate a set of features, which can be mathemati-
cally written as:
exp LeakyReLU a𝑇 [Θx𝑖 ||Θx 𝑗 ]

𝛼𝑖,𝑗 = Í  , (11)
𝑘 ∈N (𝑖)∪{𝑖 } exp LeakyReLU a [Θx𝑖 ||Θx 𝑗 ]
𝑇

where 𝛼𝑖,𝑗 is the attention coefficient, a and Θ are model parameters, and [·||·] means concatenation.
After the 𝛼s are obtained, the new features are computed as a weighted sum of input node features,
which is: ∑︁
y𝑖 = 𝛼𝑖,𝑖 Θx𝑖 + 𝛼𝑖,𝑗 Θx 𝑗 . (12)
𝑗 ∈N (𝑖)

J. ACM, Vol. 1, No. 1, Article . Publication date: April 2023.


A Comprehensive Survey on Deep Graph Representation Learning 9

Xu et al. [403] explores the representational limitations of graph neural networks. What they
discover is that message passing networks like GCN [182] and GraphSAGE [128] are incapable of
distinguishing certain graph structures. To improve the representational power of graph neural
networks, they propose Graph Isomorphism Network (GIN) that gives an adjustable weight to the
central node feature, which can be mathematically written as:
∑︁
y𝑖 = MLP ­ (1 + 𝜖)x𝑖 + x𝑗 ® , (13)
© ª

« 𝑗 ∈N (𝑖) ¬
where 𝜖 is a learnable parameter.
More recently, efforts have been made to improve the representational power of graph neural
networks. For example, Hu et al. [143] propose GraphAIR that explicitly models the neighborhood
interaction to better capture complex non-linear features. Specifically, they use Hadamard product
between pairs of nodes in the neighborhood to model the quadratic terms of neighborhood interac-
tion. Balcilar et al. [15] propose GNNML3 that break the limits of first-order Weisfeiler-Lehman
test (1-WL) and reach third-order WL test (3-WL) experimentally. They also show that Hadamard
product is required for the model to have more representational power than first-order Weisfeiler-
Lehman test. Other elements in spatial graph convolutions are widely studied. For example, Corso
et al. [64] explore the aggregation operation in GNN and propose Principal Neighbourhood Ag-
gregation (PNA) that uses multiple aggregators with degree-scalers. Tailor et al. [340] explores
the anisotropism and isotropism in the message passing process of graph neural networks, and
propose Efficient Graph Convolution (EGC) that achieves promising results with reduced memory
consumption due to isotropism.

3.3 Summary
This section introduces graph convolutions and we provide the summary as follows:
• Techniques. Graph convolutions mainly fall into two types, i.e. spectral graph convolu-
tions and spatial graph convolutions. Spectral graph convolutions have solid mathematical
foundations of Graph Signal Processing and therefore their operations have theoretical in-
terpretations. Spatial graph convolutions are inspired by Recurrent Graph Neural Networks
and their computation is simple and straightforward, as their computation graph is derived
from the local graph structure. Generally, spatial graph convolutions are more common in
applications.
• Challenges and Limitations. Despite the great success of graph convolutions, their perfor-
mance is unsatisfactory in more complicated applications. On the one hand, the performance
of graph convolutions rely heavily on the construction of the graph. Different construction
of the graph might result in different performance of graph convolutions. On the other
hand, graph convolutions are prone to over-smoothing when constructing very deep neural
networks.
• Future Works. In the future, we expect that more powerful graph convolutions are de-
veloped to mitigate the problem of over-smoothing and we also hope that techniques and
methodologies in Graph Structure Learning (GSL) can help learn more meaningful graph
structure to benefit the performance of graph convolutions.

4 Graph Kernel Neural Networks By Qingqing Long


Graph kernels (GKs) are historically the most widely used technique on graph analyzing and
representation tasks [107, 313, 463]. However, traditional graph kernels rely on hand-crafted
patterns or domain knowledge on specific tasks[194, 316]. Over the years, an amount of research

J. ACM, Vol. 1, No. 1, Article . Publication date: April 2023.


10 W. Ju, et al.

has been conducted on graph kernel neural networks (GKNNs), which has yielded promising results.
Researchers have explored various aspects of GKNNs, including their theoretical foundations,
algorithmic design, and practical applications. These efforts have led to the development of a wide
range of GKNN-based models and methods that can be used for graph analysis and representation
tasks, such as node classification, link prediction, and graph clustering [44, 237, 238].
The success of GKNNs can be attributed to their ability to leverage the strengths of both graph
kernels and neural networks. By using kernel functions to measure similarity between graphs,
GKNNs can capture the structural properties of graphs, while the use of neural networks enables
them to learn more complex and abstract representations of graphs. This combination of techniques
allows GKNNs to achieve state-of-the-art performance on a wide range of graph-related tasks.
In this section, we begin with introducing the most representative traditional graph kernels.
Then we summarize the basic framework for combining GNNs and graph kernels. Finally, we
categorize the popular graph kernel Neural networks into several categories and compare their
differences.

4.1 Graph Kernels


Graph kernels generally evaluate pairwise similarity between nodes or graphs by decomposing
them into basic structural units. Random walks [175], subtrees [315], shortest paths [24] and
graphlets [316] are representative categories.
Given two graphs 𝐺 1 = (𝑉1, 𝐸 1, 𝑋 1 ) and 𝐺 2 = (𝑉2, 𝐸 2, 𝑋 2 ), a graph kernel function 𝐾 (𝐺 1, 𝐺 2 )
measures the similarity between 𝐺 1 and 𝐺 2 through the following formula:
∑︁ ∑︁
(14)

𝐾 (𝐺 1, 𝐺 2 ) = 𝜅𝑏𝑎𝑠𝑒 𝑙𝐺 1 (𝑢 1 ), 𝑙𝐺 2 (𝑢 2 ) ,
𝑢 1 ∈𝑉1 𝑢 2 ∈𝑉2

where 𝑙𝐺 (𝑢) denotes a set of local substructures centered at node 𝑢 in graph 𝐺, and 𝜅𝑏𝑎𝑠𝑒 is a base
kernel measuring the similarity between the two sets of substructures. For simplicity, we may
rewrite Eqn. 14 as: ∑︁ ∑︁
𝐾 (𝐺 1, 𝐺 2 ) = 𝜅𝑏𝑎𝑠𝑒 (𝑢 1, 𝑢 2 ), (15)
𝑢 1 ∈𝑉1 𝑢 2 ∈𝑉2
the uppercase letter 𝐾 (𝐺 1, 𝐺 2 ) is denoted as graph kernels, 𝜅 (𝑢 1, 𝑢 2 ) is denoted as node kernels,
and lowercase 𝑘 (𝑥, 𝑦) is denoted as general kernel functions.
The kernel mapping of a kernel 𝜓 maps a data point into its corresponding Reproducing Kernel
Hilbert Space (RKHS) H . Specifically, given a kernel 𝑘 ∗ (·, ·), its kernel mapping 𝜓 ∗ can be formulized
as,
∀𝑥 1, 𝑥 2, 𝑘 ∗ (𝑥 1, 𝑥 2 ) = ⟨𝜓 ∗ (𝑥 1 ),𝜓 ∗ (𝑥 2 )⟩ H∗ , (16)
where H∗ is the RKHS of 𝑘 ∗ (·, ·).
We introduce several representative and popular used graph kernels below.
(𝑙)
Walk and Path Kernels. A 𝑙-walk kernel 𝐾𝑤𝑎𝑙𝑘 compares all length 𝑙 walks starting from each
node in two graphs 𝐺 1, 𝐺 2 ,
∑︁ ∑︁
(𝑙)
𝜅 𝑤𝑎𝑙𝑘 (𝑢 1, 𝑢 2 ) = 𝛿 (𝑋 1 (𝑤 1 ), 𝑋 2 (𝑤 2 )),
𝑤1 ∈W𝑙 (𝐺 1 ,𝑢 1 ) 𝑤2 ∈W𝑙 (𝐺 2 ,𝑢 2 )
∑︁ ∑︁ (17)
(𝑙) (𝑙)
𝐾𝑤𝑎𝑙𝑘 (𝐺 1, 𝐺 2 ) = 𝜅 𝑤𝑎𝑙𝑘 (𝑢 1, 𝑢 2 ).
𝑢 1 ∈𝑉1 𝑢 2 ∈𝑉2

Substituting W with P is able to get the 𝑙-path kernel.


Subtree Kernels. The WL subtree kernel is the most popular one in subtree kernels. It is a finite-
depth kernel variant of the 1-WL test. The WL subtree kernel with depth 𝑙, 𝐾𝑊(𝑙)𝐿 compares all

J. ACM, Vol. 1, No. 1, Article . Publication date: April 2023.


A Comprehensive Survey on Deep Graph Representation Learning 11

subtrees with depth ≤ 𝑙 rooted at each node.


∑︁ ∑︁
(𝑖)
𝜅𝑠𝑢𝑏𝑡𝑟𝑒𝑒 (𝑢 1, 𝑢 2 ) = 𝛿 (𝑡 1, 𝑡 2 ),
𝑡 1 ∈ T 𝑖 (𝐺 1 ,𝑢 2 ) 𝑡 2 ∈T 𝑖 (𝐺 2 ,𝑢 2 )
∑︁ ∑︁
(𝑖) (𝑖)
𝐾𝑠𝑢𝑏𝑡𝑟𝑒𝑒 (𝐺 1, 𝐺 2 ) = 𝜅𝑠𝑢𝑏𝑡𝑟𝑒𝑒 (𝑢 1, 𝑢 2 ),
𝑢 1 ∈𝑉1 𝑢 2 ∈𝑉2
(18)
𝑙
∑︁
𝐾𝑊(𝑙)𝐿 (𝐺 1, 𝐺 2 ) = (𝑖)
𝐾𝑠𝑢𝑏𝑡𝑟𝑒𝑒 (𝐺 1, 𝐺 2 ),
𝑖=0

where 𝑡 ∈ T (𝑖) (𝐺, 𝑢) denotes a subtree of depth 𝑖 rooted at 𝑢 in 𝐺.

4.2 General Framework of GKNNs


In this section, we summarize the general framework of GKNNs. For the first step, a kernel that
measures similarities of heterogeneous features from heterogeneous nodes and edges (𝑢 1, 𝑒 ·,𝑢2 )
and (𝑢 2, 𝑒 ·,𝑢2 ) should be defined. Take the inner product of neighbor tensors as an example, its
neighbor kernel is defined as follows,
𝜅 ((𝑢 1, 𝑒 ·,𝑢1 ), (𝑢 2, 𝑒 ·,𝑢2 )) = ⟨𝑓 (𝑢 1 ), 𝑓 (𝑢 2 )⟩ · ⟨𝑓 (𝑒 ·,𝑢1 ), 𝑓 (𝑒 ·,𝑢2 )⟩.
Based on the neighbor kernel, a kernel with two 𝑙-hop neighborhoods for central node 𝑢 1 and 𝑢 2
can be defined as 𝐾 (𝑙) (𝑢 1, 𝑢 2 ) =

 ⟨𝑓 (𝑢 1 ), 𝑓 (𝑢 2 )⟩ 𝑙 =0
(19)

 ∑︁ ∑︁
 ⟨𝑓 (𝑢 1 ), 𝑓 (𝑢 2 )⟩ · 𝐾 (𝑙−1) (𝑣 1, 𝑣 2 ) · ⟨𝑓 (𝑒 ·,𝑣1 ), 𝑓 (𝑒 ·,𝑣2 )⟩ 𝑙 >0,

 𝑣1 ∈𝑁 (𝑢 1 ) 𝑣2 ∈𝑁 (𝑢 2 )

By regarding the lower-hop kernel 𝜅 (𝑙−1) (𝑢 1, 𝑢 2 ), as the inner product of the (𝑙 − 1)-th hidden
representations of 𝑢 1 and 𝑢 2 . Furthermore, by recursively applying the neighborhood kernel, the
𝑙-hop graph kernel can be derived as
𝑙−1 𝑙−2
!
∑︁ ∑︁ Ö Ö
𝑙 (𝑖) (𝑖)
𝐾 (𝐺 1, 𝐺 2 ) = ⟨𝑓 (𝒘 1 ), 𝑓 (𝒘 2 )⟩ × ⟨𝑓 (𝑒 𝒘 (𝑖 ) ,𝒘 (𝑖+1) ), 𝑓 (𝑒 𝒘 (𝑖 ) ,𝒘 (𝑖+1) )⟩ ,
1 1 2 2
𝒘 1 ∈W𝑙 (𝐺 1 ) 𝒘 2 ∈W𝑙 (𝐺 2 ) 𝑖=0 𝑖=0
(20)
where W𝑙 (𝐺) denotes the set of all walk sequences with length 𝑙 in graph 𝐺, and 𝒘 1(𝑖) denotes the
𝑖-th node in sequence 𝒘 1 .
As shown in Eqn. (16), kernel methods implicitly perform projections from original data spaces
to their RKHS H . Hence, as GNNs also project nodes or graphs into vector spaces, connections
have been established between GKs and GNNs through the kernel mappings. And several works
conducted research on the connections [202, 381], and found some foundation conclusions. Take
the basic rule introduced in [202] as an example, the proposed graph kernel in Eqn. (14) can be
derived as the general formulas,

ℎ ( 0) (𝑣) =𝑾 𝑡(𝑉0)(𝑣) 𝑓 (𝑣),


(21)
∑︁
ℎ (𝑙) (𝑣) =𝑾 𝑡(𝑙)
𝑉 (𝑣)
𝑓 (𝑣) ⊙ (𝑼 𝑡(𝑙)
𝑉 (𝑣)
ℎ (𝑙−1) (𝑢) ⊙ 𝑼 𝑡(𝑙)
𝐸 (𝑒𝑢,𝑣 )
𝑓 (𝑒𝑢,𝑣 )), 1 < 𝑙 ≤ 𝐿,
𝑢 ∈𝑁 (𝑣)

where ⊙ is the element-wise product and ℎ (𝑙) (𝑣) is the cell state vector of node v. The parameter
matrices 𝑾 𝑡(𝑙)
𝑉 (𝑣)
, 𝑼 𝑡(𝑙)
𝑉 (𝑣)
and 𝑼 𝑡(𝑙)
𝐸 (𝑒𝑢,𝑣 )
are learnable parameters related to types of nodes and edges.

J. ACM, Vol. 1, No. 1, Article . Publication date: April 2023.


12 W. Ju, et al.

Table 2. Summary of popular GKNNs.

Method Type Related GK Adaptive Datasets


𝑘-GNN [264] Isomorphic WL Subtree ✓ Biochemical Network
Biochemical Network,
RetGK [449] Isomorphic Random Walk ✓
Social Network
Biochemical Network,
GNTK [77] Isomorphic Neural Tangent Kernel
Social Network
DDGK [4] Isomorphic Random Walk Biochemical Network
Biochemical Network,
GCKN [44] Isomorphic Random Walk ✓
Social Network
Random Walk, Biochemical Network,
GSKN [237] Isomorphic ✓
Anonymous Walk Social Network
Social Network
GCN-LASE [222] Heterogeneous Random Walk -
Academic Network
Social Network
HGK-GNN [238] Heterogeneous Random Walk ✓
Academic Network

Then mean embeddings of all nodes are usually used to represent the graph-level embedding, let
|𝐺𝑖 | denote the number of nodes in the 𝑖-th graph, then the graph-level embeddings are generated
as,
∑︁ 1
Φ(𝐺𝑖 ) = ℎ (𝐿) (𝑣). (22)
𝑢 ∈𝐺
|𝐺 𝑖 |
𝑖

For semi-supervised multiclass classification, the cross entropy is used as the objective function
over all training examples [36, 37],
𝐹
∑︁ ∑︁
L=− 𝑌𝑙 𝑓 ln 𝑍𝑙 𝑓 , (23)
𝑙 ∈Y𝐿 𝑓 =1

where Y𝐿 is the set of node indices which have labels in node classification tasks, or set of graph
indices in graph classification tasks. 𝑍𝑙 𝑓 denotes the prediction of labels, which are outputs of a
linear layer with an activation function, inputing ℎ (𝑙) (𝑣) in node classification task, and Φ(𝐺𝑖 ) in
graph classification task.

4.3 Popular Variants of GKNNs


We summarize the popular variants of GKNNs and compare their differences in Table 2. Specifically,
we conclude their basic graph kernels, whether designed for heterogeneous graphs, and experi-
mental datasets, etc. As the original designed graph-kernel based GNNs have high complexity, they
usually by acceleration strategies, such as sampling strategies, simplification and approximation,
etc. In this section, We select four typical and popular GKNNs to introduce their well-designed
graph kernels and corresponding GNN frameworks.
𝑘-dimensional Graph Neural Networks [264] (𝑘-GNN) is the pioneer of GKNNs, it incorpo-
rates the WL-subtree graph kernel and graph neural networks. For better scalability, the paper
considers a set-based version of the 𝑘-WL. Let ℎ𝑘(𝑙) (𝑠) and ℎ𝑘,𝐿
(𝑙)
(𝑠) denote the global and local hidden
representation for node 𝑠 in the 𝑙-th layer respectively. 𝑘-GNN defined the end-to-end hierarchical

J. ACM, Vol. 1, No. 1, Article . Publication date: April 2023.


A Comprehensive Survey on Deep Graph Representation Learning 13

trainable framework as,


" # !
∑︁
ℎ𝑘( 0) (𝑠) =𝜎 ℎ 𝑖𝑠𝑜
(𝑠), ℎ (𝑇𝑘 −1)
(𝑢) ( 0)
· 𝑾 𝑘− 1
,
𝑢 ∈𝑠

∑︁
ℎ𝑘(𝑙) (𝑠) =𝜎 ­ℎ𝑘(𝑙−1) · 𝑾 1(𝑙) + ℎ𝑘(𝑙−1) (𝑢) · 𝑾 2(𝑙) ® , 1 < 𝑙 ≤ 𝐿, (24)
© ª

« 𝑢 ∈𝑁𝐿 (𝑠)∪𝑁𝐺 (𝑠) ¬


∑︁
(𝑙) © (𝑙−1)
(𝑠) =𝜎 ­ℎ𝑘,𝐿 (𝑠) · 𝑾 1(𝑙) + (𝑙−1)
(𝑢) · 𝑾 2(𝑙) ® , 1 < 𝑙 ≤ 𝐿.
ª
ℎ𝑘,𝐿 ℎ𝑘,𝐿
« 𝑢 ∈𝑁 𝐿 (𝑠) ¬
where ℎ𝑖𝑠𝑜 (𝑠) is the one-hot encoding of the isomorphism type of 𝐺 [𝑠], 𝑁𝐿 (𝑠) is the local neigh-
(𝑙)
borhood, 𝑁𝐺 (𝑠) is the global neighborhood, ℎ𝑘,𝐿 (𝑠) is designed for better scalability and running
speed, and ℎ𝑘(𝑙) (𝑠) has better performance due to its larger neighbor sets.
Graph Convolutional Kernel Networks [44] (GCKN). GCKN is a representative random walk
and path-based GKNN. It The Gaussian kernel 𝑘 can be written as,
𝛼1
| |𝑧 1 −𝑧 2 | | 2
𝑘 (𝑧 1, 𝑧 2 ) = 𝑒 − = 𝑒 𝛼 (𝑧1 𝑧2 −𝑘−1) = 𝜎 (𝑧𝑇1 𝑧 2 ), (25)
𝑇
2

then the GNN architecture can be written as


∑︁  
ℎ (𝑙) (𝑢) = 𝐾 (𝑧 1, 𝑧 2 ) · 𝜎 𝑍 𝑇 ℎ (𝑙−1) (𝑝)
𝑧∈
  − 12 ∑︁   (26)
=𝜎 𝑍𝑍 𝑇 · 𝜎 𝑍 𝑇 ℎ (𝑙−1) (𝑝) ,
𝑝 ∈ P𝑘 (𝐺,𝑢)

where 𝑍 is the matrix of prototype path attributes.


Furthermore, the paper analysis the relationship between GCKN and the WL-subtree based
𝑘-GNN. The Theorem 1 in paper[44] shows that WL-subtree based GKNNs can be seen as a special
case in GCKN.
Graph Neural Tangent Kernel [77] (GNTK). Different from the above two works, GNTK
proposed a new class of graph kernels. GNTK is a general recipe which translates a GNN architecture
to its corresponding GNTK.
The neighborhood aggregation operation in GNTK is defined as,

 (𝑙)
∑︁  " (𝑙−1) #
∑︁ ∑︁ ∑︁
 (𝐺, 𝐺 ′ )  ′

  =𝑐𝑢 𝑐𝑢 ′ (𝐺, 𝐺 ) ,
′ ′ ′ ′
(27)
 ( 0)  ′ 𝑢 ∈𝑁 (𝑣)∪{𝑢 } 𝑣 ∈𝑁 (𝑢 )∪{𝑢 } 𝑅 𝑣𝑣
h i 𝑢𝑢 h i

∑︁ ∑︁ ′
(𝑙) (𝑙−1)
Θ ( 0) (𝐺, 𝐺 ) ′ =𝑐𝑢 𝑐𝑢 ′ Θ𝑅 (𝐺, 𝐺 ) ′ ,
𝑢𝑢 𝑣𝑣
𝑢 ∈𝑁 (𝑣)∪{𝑢 } 𝑣′ ∈𝑁 (𝑢 ′ )∪{𝑢 ′ }
′ ′ ′
where 𝑅( 0) (𝐺, 𝐺 ) and Θ𝑅( 0) (𝐺, 𝐺 ) are defined as ( 0) (𝐺, 𝐺 ). Then GNTK performed 𝑅 transfor-
Í Í
mations to xx.
 (𝑙)
 (𝑙)
  ∑︁ 
©  ∑︁  ′ 
(𝐺, 𝐺)  (𝐺, 𝐺 ) 
ª
­  ,  ®
­  ′ ®®
h
(𝑙) ′
i ­  (𝑟 −1)  ′  (𝑟 −1)
𝐴 (𝑟 ) (𝐺, 𝐺 ) ′ = ­­  𝑢𝑢   𝑢𝑢 ®. (28)
𝑢𝑢 ­  ∑︁(𝑙)   ∑︁(𝑙)  ®®
′   ′ ′ 
­
­ (𝐺 , 𝐺)  ,  (𝐺 , 𝐺 )  ® ®
 (𝑟 −1)  ′  (𝑟 −1)  ′
«  𝑢𝑢   𝑢𝑢 ¬

J. ACM, Vol. 1, No. 1, Article . Publication date: April 2023.


14 W. Ju, et al.

 (𝑙)
∑︁ 
 (𝐺, 𝐺 ′ ) 

  =E (𝑎,𝑏)∼N ( 0, [𝐴 (𝑙 ) (𝐺,𝐺 ′ ) ] ′ ) [𝜎 (𝑎) · 𝜎 (𝑏)],
 (𝑟 )  ′ (𝑟 ) 𝑢𝑢
  𝑢𝑢 (29)
 (𝑙)
∑︁ 
 (𝐺, 𝐺 ′ ) 

  =E (𝑎,𝑏)∼N ( 0, [𝐴 (𝑙 ) (𝐺,𝐺 ′ ) ] ′ ) [𝜎 (𝑎) · 𝜎 (𝑏)],
 (𝑟 )  ′ (𝑟 ) 𝑢𝑢
  𝑢𝑢
then the 𝑟 order can be calculated as,
h ′
i h ′
i h∑︁ ′
i h∑︁ ′
i
(𝑙) (𝑙)
Θ (𝑟 )
(𝐺, 𝐺 ) ′
= Θ (𝑟 −1)
(𝐺, 𝐺 ) ′
(𝑟 ) (𝑙)
(𝐺, 𝐺 ) ′
+ (𝑟 ) (𝑙)
(𝐺, 𝐺 ) ′
. (30)
𝑢𝑢 𝑢𝑢 𝑢𝑢 𝑢𝑢
Finally, GNTK calculates the final output as
" 𝐿
#

∑︁ ∑︁ ′
Θ(𝐺, 𝐺 ) = Θ (𝑙)
(𝑅)
(𝐺, 𝐺 ) . (31)
′ ′ 𝑙=0 ′
𝑢 ∈𝑉 ,𝑢 ∈𝑉 𝑢,𝑢
Heterogeneous Graph Kernel based Graph Neural Network [238] (HGK-GNN). HGK-GNN
first proposed GKNN for heterogeneous graphs. It adopted ⟨𝑓 (𝑢 1 ), 𝑓 (𝑢 2 )⟩𝑀 as graph kernel based
on the Mahalanobis Distance to build connections among heterogeneous nodes and edges,
⟨𝑓 (𝑢 1 ), 𝑓 (𝑢 2 )⟩𝑀1 = 𝑓 (𝑢 1 )𝑇 𝑴 1 𝑓 (𝑢 2 ),
⟨𝑓 (𝑒 ·,𝑢1 ), 𝑓 (𝑒 ·,𝑢2 )⟩𝑀2 = 𝑓 (𝑒 ·,𝑢1 )𝑇 𝑴 2 𝑓 (𝑒 ·,𝑢2 ).
Following the route introduced in [202] , the corresponding neural network architecture of
heterogeneous graph kernel can be derived as
ℎ ( 0) (𝑣) =𝑾 𝑡(𝑉0)(𝑣) 𝑓 (𝑣),
(32)
∑︁
ℎ (𝑙) (𝑣) =𝑾 𝑡(𝑙)
𝑉 (𝑣)
𝑓 (𝑣) ⊙ (𝑼 𝑡(𝑙)
𝑉 (𝑣)
ℎ (𝑙−1) (𝑢) ⊙ 𝑼 𝑡(𝑙)
𝐸 (𝑒𝑢,𝑣 )
𝑓 (𝑒𝑢,𝑣 )), 1 < 𝑙 ≤ 𝐿,
𝑢 ∈𝑁 (𝑣)

where ℎ (𝑙) (𝑣) is the cell state vector of node v, and 𝑾 𝑡(𝑙)
𝑉 (𝑣)
, 𝑼 𝑡(𝑙)
𝑉 (𝑣)
𝑼 𝑡(𝑙)
𝐸 (𝑒𝑢,𝑣 )
are learnable parameters.

4.4 Summary
This section introduces graph kernel neural networks. We provide the summary as follows:
• Techniques. Graph kernel neural networks (GKNNs) are a recent popular research area
that combines the advantages of graph kernels and GNNs to learn more effective graph
representations. Researchers have studied GKNNs in various aspects, such as theoretical
foundations, algorithmic design, and practical applications. As a result, a wide range of GKNN-
based models and methods have been developed for graph analysis and representation tasks,
including node classification, link prediction, and graph clustering.
• Challenges and Limitations. Although GKNNs have shown great potential in graph-related
tasks, they also have several limitations that need to be addressed. Scalability is a significant
challenge, particularly when dealing with large-scale graphs and networks. As the size of
the graph increases, the computational cost of GKNNs grows exponentially, which can limit
their ability to handle large and complex real-world applications.
• Future Works. For future works, we expect the GKNNs can integrate more domain-specific
knowledge into the designed kernels. Domain-specific knowledge has been shown to signifi-
cantly improve the performance of many applications, such as drug discovery, knowledge
graph-based information retrieval systems, and molecular analysis [90, 360]. Incorporating
domain-specific knowledge into GKNNs can enhance their ability to handle complex and
diverse data structures, leading to more accurate and interpretable models.

J. ACM, Vol. 1, No. 1, Article . Publication date: April 2023.


A Comprehensive Survey on Deep Graph Representation Learning 15

Table 3. Summary of graph pooling methods.

Method Type TopK-based Cluster-based Attention Mechanism


Mean/Sum/Max
Global
Pooling
GGS-NN [221] Global ✓
SortPool [437] Global
SSRead [199] Global
gPool [104] Hierarchical ✓
SAGPool [201] Hierarchical ✓ ✓
HGP-SL [445] Hierarchical ✓ ✓
TAPool [105] Hierarchical ✓
DiffPool [417] Hierarchical ✓
MinCutPool [22] Hierarchical ✓
SEP [384] Hierarchical ✓
ASAP [292] Hierarchical ✓ ✓ ✓
MuchPool [76] Hierarchical ✓ ✓

5 Graph Pooling By Yiyang Gu


When it comes to graph-level tasks, such as graph classification and graph regression, graph
pooling is an essential component for generating the whole graph representation from the learned
node embeddings. To ensure isomorphic graphs have the same representation, the graph pooling
operations should be invariant to the permutations of nodes. In this section, we give a systematic
review of existing graph pooling algorithms and generally classify them into two categories: global
pooling algorithms and hierarchical pooling algorithms. The global pooling algorithms aggregate
the node embeddings to the final graph representation directly, while the hierarchical pooling
algorithms reduce the graph size and generate the immediate representations gradually to capture
the hierarchical structure and characteristics of the input graph. A summary is provided in Table 3.

5.1 Global Pooling


Global pooling operations generate a holistic graph representation from the learned node em-
beddings in one step, which are referred to as readout functions in some literature [64, 403] as
well. Several simple permutation-invariant operations, such as mean, sum, and max, are widely
employed on the node embeddings to output the graph-level representation. To enhance the adap-
tiveness of global pooling operators, GGS-NN [221] introduces a soft attention mechanism to
evaluate the importance of nodes for a particular task and then takes a weighted sum of the node
embeddings. SortPool [437] exploits Weisfeiler-Lehman methods [378] to sort nodes based on their
structural positions in the graph topology, and produces the graph representation from the sorted
node embeddings by traditional convolutional neural networks. Recently, Lee et al. propose a
structural-semantic pooling method, SSRead [199], which first aligns nodes and learnable structural
prototypes semantically, and then aggregates node representations in groups based on matching
structural prototypes.

J. ACM, Vol. 1, No. 1, Article . Publication date: April 2023.


16 W. Ju, et al.

5.2 Hierarchical Pooling


Different from global pooling methods, hierarchical pooling methods coarsen the graph gradually,
to preserve the structural information of the graph better. To adaptively coarse the graph and learn
optimal hierarchical representations according to a particular task, many learnable hierarchical
pooling operators are proposed in the past few years, which can be integrated with multifarious
graph convolution layers. There are two common approaches to coarsening the graph, one is
selecting important nodes and dropping the others by TopK selection, and the other one is merging
nodes and generating the coarsened graph by clustering methods. We call the former TopK-based
pooling, and the latter cluster-based pooling in this survey. In addition, some works combine these
two types of pooling methods, which will be reviewed in the hybrid pooling section.
5.2.1 TopK-based Pooling. Typically, TopK-based pooling methods first learn a scoring function
to evaluate the importance of nodes of the original graph. Based on importance score Z ∈ R |𝑉 |×1
generated, they select the top 𝐾 nodes out of all nodes,
𝑖𝑑𝑥 = TOP𝑘 (Z) , (33)
where 𝑖𝑑𝑥 denotes the index of the top 𝐾 nodes. Based on these selected nodes, most methods
directly employ the induced subgraph as the pooled graph,
Apool = A𝑖𝑑𝑥,𝑖𝑑𝑥 , (34)
where 𝐴𝑖𝑑𝑥,𝑖𝑑𝑥 denotes the adjacency matrix indexed by the selected rows and columns. Moreover,
to make the scoring function learnable, they further use score 𝑍 as a gate for the selected node
features,
Xpool = X𝑖𝑑𝑥,: ⊙ Z𝑖𝑑𝑥 , (35)
where 𝑋𝑖𝑑𝑥,: denotes the feature matrix indexed by the selected nodes, and ⊙ denotes the broadcasted
element-wise product. With the help of the gate mechanism, the scoring function can be trained
by back-propagation, to adaptively evaluate the importance of nodes according to a certain task.
Several representative TopK-based pooling methods are reviewed detailly in the following.
gPool [104]. gPool is one of the first works to select the most important node subset from the
original graph to construct the coarsened graph by Top K operation. The key idea of gPool is to
evaluate the importance of all nodes by learning a projection vector p, which projects node features
to a scalar score,
Z 𝑗 = X 𝑗,: p/∥p∥, (36)
and then select nodes with K-highest scores to form the pooled graph.
Self-Attention Graph Pooling (SAGPool) [201]. Unlike gPool [104], which only uses node features
to generate projection scores, SAGPool captures both graph topology and node features to obtain
self-attention scores by graph convolution. The various formulas of graph convolution can be
employed to compute the self-attention score Z,
Z = 𝜎 (GNN(X, A)), (37)
where 𝜎 denotes the activation function, and GNN denotes various graph convolutional layers or
stacks of them, whose output dimension is one.
Hierarchical Graph Pooling with Structure Learning (HGP-SL) [445] . HGP-SL evaluates the
importance score of a node according to the information it contains given its neighbors. It supposes
that a node which can be easily represented by its neighborhood carries relatively little information.
Specifically, the importance score can be defined by the Manhattan distance between the original
node representation and the reconstructed one aggregated from its neighbors’ representation,
Z = I − D−1 A X 1 , (38)


J. ACM, Vol. 1, No. 1, Article . Publication date: April 2023.


A Comprehensive Survey on Deep Graph Representation Learning 17

where I denotes the identity matrix, D denotes the diagonal degree matrix of A, and ∥ · ∥ 1 means
ℓ1 norm. Furthermore, to reduce the loss of topology information, HGP-SL leverages structure
learning to learn a refined graph topology for the reserved nodes. Specifically, it utilizes the attention
mechanism to compute the similarity of two nodes as the weight of the reconstructed edge,
  h i  
pool pool pool ⊤ pool
𝑨𝑖 𝑗 = sparsemax 𝜎 w 𝑿 𝑖,: ∥𝑿 𝑗,:
e + 𝜆 · 𝑨𝑖 𝑗 , (39)

pool
where 𝑨e denotes the refined adjacency matrix of the pooled graph, sparsemax(·) truncates the
values below a threshold to zeros, w denotes a learnable weight vector, and 𝜆 is a weight parameter
between the original edges and the reconstructed edges. These reconstructed edges may capture
the underlying relationship between nodes disconnected due to node dropping.
Topology-Aware Graph Pooling (TAPool) [105]. TAPool takes both the local and global significance
of a node into account. On the one hand, it utilizes the average similarity between a node and its
neighbors to evaluate its local importance,
 
^ = XX𝑇 ⊙ D−1 A , Z𝑙 = softmax 1 R1
 
^ 𝑛 , (40)

R
𝑛
where R^ denotes the localized similarity matrix, and Z𝑙 denotes the local importance score. On
the other hand, it measures the global importance of a node according to the significance of its
one-hop neighborhood in the whole graph,
 
^ = D−1 AX, Z𝑔 = softmax Xp
X ^ , (41)
where p is a learnable and globally shared projector vector, similar to the aforementioned gPool [104].
However, X ^ here further aggregates the features from the neighborhood, which enables the global
importance score Z𝑔 to capture more topology information such as salient subgraphs. Moreover,
TAPool encourages connectivity in the coarsened graph with the help of a degree-based connectivity
term, then obtaining the final importance score Z = Z𝑙 + Z𝑔 + 𝜆D/|𝑉 |, where 𝜆 is a trade-off
hyperparameter.
5.2.2 Cluster-based Pooling. Pooling the graph by clustering and merging nodes is the main
concept behind cluster-based pooling methods. Typically, they allocate nodes to a collection of
clusters by learning a cluster assignment matrix S ∈ R |𝑉 |×𝐾 , where 𝐾 is the number of the clusters.
After that, they merge the nodes within each cluster to generate a new node in the pooled graph.
The feature (embedding) matrix of the new nodes can be obtained by aggregating the features
(embeddings) of nodes within the clusters, according to the cluster assignment matrix,

Xpool = S𝑇 X. (42)
While the adjacency matrix of the pooled graph can be generated by calculating the connectivity
strength between each pair of clusters,

Apool = S𝑇 AS. (43)


Then, we review several representative cluster-based pooling methods in detail.
DiffPool [417]. DiffPool is one of the first and classic works to hierarchically pool the graph by
graph clustering. Specifically, it uses an embedding GNN to generate embeddings of nodes, and a
pooling GNN to generate the cluster assignment matrix,

^ = GNNembed (X, A) , S = softmax GNNpool (X, A) , Xpool = S𝑇 X.


^ (44)

X

J. ACM, Vol. 1, No. 1, Article . Publication date: April 2023.


18 W. Ju, et al.

Besides, DiffPool leverages an auxiliary link prediction objective 𝐿LP = A, SS𝑇 𝐹 to encourage
the adjacent nodes to be in the same cluster and avoid fake local minima, where ∥ · ∥ 𝐹 is the
Í |𝑉 |
Frobenius norm. And it utilizes an entropy regularization term 𝐿E = |𝑉1 | 𝑖= 1 𝐻 (S𝑖 ) to impel clear
cluster assignments, where 𝐻 (·) represents the entropy.
Graph Pooling with Spectral Clustering (MinCutPool) [22]. MinCutPool takes advantage of the
properties of spectral clustering (SC) to provide a better inductive bias and avoid degenerate cluster
assignments. It learns to cluster like SC by optimizing the MinCut loss,

Tr S𝑇 AS
𝐿𝑐 = − . (45)
Tr S𝑇 DS

In addition, it utilizes an orthogonality loss 𝐿𝑜 = − √I𝐾 to encourage orthogonal and


S S 𝑇

∥ S𝑇 S ∥ 𝐹 𝐾
𝐹
uniform cluster assignments, and prevent the bad minima of 𝐿𝑐 , where 𝐾 is the number of the
clusters. When performing a specific task, it can optimize the weighted sum of the unsupervised
loss 𝐿𝑢 = 𝐿𝑐 + 𝐿𝑜 and a task-specific loss to find the optimal balance between the theoretical prior
and the task objective.
Structural Entropy Guided Graph Pooling (SEP) [384]. In order to lessen the local structural harm
and suboptimal performance caused by separate pooling layers and predesigned pooling ratios,
SEP leverages the concept of structural entropy to generate the global and hierarchical cluster
assignments at once. Specifically, SEP treats the nodes of a given graph as the leaf nodes of a coding
tree and exploits the hierarchical layers of the coding tree to capture the hierarchical structure of
the graph. The optimal code tree 𝑇 can be obtained by minimizing the structural entropy [204],

∑︁ 𝑔(𝑃 𝑣 ) vol 𝑃 𝑣𝑖
𝑇
H (𝐺) = − 𝑖
log  , (46)
𝑣 ∈𝑇
vol(𝑉 ) vol 𝑃 +
𝑖 𝑣𝑖

where represents the father node of node 𝑣𝑖 , 𝑃 𝑣𝑖 denotes the partition of leaf nodes which are
𝑣𝑖+
descendants of 𝑣𝑖 in the coding tree 𝑇 , 𝑔(𝑃 𝑣𝑖 ) denotes the number of edges that have a terminal in
the 𝑃 𝑣𝑖 , and vol(·) denotes the total degrees of leaf nodes in the given partition. Then, the cluster
assignment matrix for each pooling layer can be derived from the edges of each layer in the coding
tree. With the help of the one-step joint assignments generation based on structural entropy, it can
not only make the best use of the hierarchical relationships of pooling layers, but also reduce the
structural noise in the original graph.
5.2.3 Hybrid pooling. Hybrid pooling methods combine TopK-based pooling methods and cluster-
based pooling methods, to exert the advantages of the two methods and overcome their respective
limitations. Here, we review two representative hybrid pooling methods, Adaptive Structure Aware
Pooling [292] and Multi-channel Pooling [76].
Adaptive Structure Aware Pooling (ASAP) [292]. Considering that TopK-based pooling methods are
not good at capturing the connectivity of the coarsened graph, while cluster-based pooling methods
fail to be employed for large graphs because of the dense assignment matrix, ASAP organically
combines the two types of pooling methods to overcome the above limitations. Specifically, it
regards the ℎ-hop ego-network 𝑐ℎ (𝑣𝑖 ) of each node 𝑣𝑖 as a cluster. Such local clustering enables the
cluster assignment matrix to be sparse. Then, a new self-attention mechanism Master2Token is
used to learn the cluster assignment matrix S and the cluster representations,
     |𝑐ℎ∑︁
(𝑣𝑖 ) |
m𝑖 = max X′𝑗 , S 𝑗,𝑖 = softmax w𝑇 𝜎 Wm𝑖 ∥X′𝑗 , X𝑐𝑖 = S 𝑗,𝑖 X 𝑗 , (47)
𝑣 𝑗 ∈𝑐ℎ (𝑣𝑖 )
𝑗=1

J. ACM, Vol. 1, No. 1, Article . Publication date: April 2023.


A Comprehensive Survey on Deep Graph Representation Learning 19

where X′ is the node embedding matrix after passing GCN, w and W denote the trainable vector
and matrix respectively, and X𝑐𝑖 denotes the representation of the cluster 𝑐ℎ (𝑣𝑖 ). Next, it utilizes
the graph convolution and TopK selection to choose the top 𝐾 clusters, whose centers are treated
as the nodes of the pooled graph. The adjacency matrix of the pooled graph can be calculated like
common cluster-based pooling methods (43), preserving the connectivity of the original graph well.
Multi-channel Pooling (MuchPool) [76]. The key idea of MuchPool is to capture both the local
and global structure of a given graph by combining the TopK-based pooling methods and the
cluster-based pooling methods. MuchPool has two pooling channels based on TopK selection to
yield two fine-grained pooled graphs, whose selection criteria are node degrees and projected
scores of node features respectively, so that both the local topology and the node features are
considered. Besides, it leverages a channel based on graph clustering to obtain a coarse-grained
pooled graph, which captures the global and hierarchical structure of the input graph. To better
integrate the information of different channels, a cross-channel convolution is proposed, which
fuses the node embeddings of the fine-grained pooled graph Xfine and the coarse-grained pooled
graph Xcoarse with the help of the cluster assignments S of the cluster-based pooling channel,
 
e fine = 𝜎 Xfine + SXcoarse · W , (48)

X

where W denotes the learnable weights. Finally, it merges the node embeddings and the adjacency
matrices of the two fine-grained pooled graphs to obtain the eventually pooled graph.

5.3 Summary
This section introduces graph pooling methods for graph-level representation learning. We provide
the summary as follows:
• Techniques. Graph pooling methods play a vital role in generating an entire graph represen-
tation by aggregating node embeddings. There are mainly two categories of graph pooling
methods: global pooling methods and hierarchical pooling methods. While global pooling
methods directly aggregate node embeddings in one step, hierarchical pooling methods
gradually coarsen a graph to capture hierarchical structure characteristics of the graph based
on TopK selection, clustering methods, or hybrid methods.
• Challenges and Limitations. Despite the great success of graph pooling methods for learn-
ing the whole graph representation, there remain several challenges and limitations unsolved:
1) For hierarchical pooling, most cluster-based methods involve the dense assignment matrix,
which limits their application to large graphs, while TopK-based methods are not good at
capturing structure information of the graph and may lose information due to node dropping.
2) Most graph pooling methods are designed for simple attributed graphs, while pooling
algorithms tailored to other types of graphs, like dynamic graphs and heterogeneous graphs,
are largely under-explored.
• Future Works. In the future, we expect that more hybrid or other pooling methods can
be studied to capture the graph structure information sufficiently as well as be efficient for
large graphs. In realistic scenarios, there are various types of graphs involving dynamic,
heterogeneous, or spatial-temporal information. It is promising to design graph pooling
methods specifically for these graphs, which can be beneficial to more real-world applications,
such as traffic analysis and recommendation systems.

6 Graph Transformer By Junwei Yang


Though GNNs based on the message-passing paradigm have achieved impressive performance on
multiple well-known tasks [115, 205, 363, 403], they still face some intrinsic problems due to the

J. ACM, Vol. 1, No. 1, Article . Publication date: April 2023.


20 W. Ju, et al.

iterative neighbor-aggregation operation. Many previous works have demonstrated the two major
defects of message-passing GNNs, which are known as the over-smoothing and long-distance
modeling problem. And there are lots of explanatory works trying to mine insights from these
two issues. The over-smoothing problem can be explained in terms of various GNNs focusing
only on low-frequency information [23], mixing information between different kinds of nodes
destroying model performance [45], GCN is equivalent to Laplacian smoothing [213], isotropic
aggregation among neighbors leading to the same influence distribution as random walk [404], etc.
The inability to model long-distance dependencies of GNNs is partially due to the over-smoothing
problem, because in the context of conventional neighbor-aggregation GNNs, node information
can be passed over long distances only through multiple GNN layers. Recently, Alon et al. [6] find
that this problem may also be caused by over-squashing, which means the exponential growth of
computation paths with increasing distance. Though the two basic performance bottlenecks can
be tackled with elaborate message passing and aggregation strategies, representational power of
GNNs is inherently bounded by the Weisfeiler-Lehman isomorphism hierarchy [264]. Worse still,
most GNNs [115, 182, 351] are bounded by the simplest first-order Weisfeiler-Lehman test (1-WL).
Some efforts have been dedicated to break this limitation, such as hypergraph-based [93, 149],
path-based [31, 415], and k-WL-based [15, 264] approaches.
Among many attempts to solve these fundamental problems, an essential one is the adaptation
of Transformer [350] for graph representation learning. Transformers, both the vanilla version and
several variants, have been adopted with impressive results in various deep learning fields including
NLP [70, 350], CV [38, 469], etc. Recently, Transformer also shows powerful graph modeling abilities
in many researches [46, 81, 193, 386, 415]. And extensive empirical results show that some chronic
shortcomings in conventional GNNs can be easily overcome in Transformer-based approaches.
This section gives an overview of the current progress on this kind of methods.

6.1 Transformer
Transformer [350] was firstly applied to model machine translation, but two of the key mechanisms
adopted in this work, attention operation and positional encoding, are highly compatible with the
graph modeling problem.
To be specific, we denote the input of attention layer in Transformer as X = [x0, x1, . . . , x𝑛−1 ],
x𝑖 ∈ R𝑑 , where 𝑛 is the length of input sequence and 𝑑 is the dimension of each input embedding
x𝑖 . Then the core operation of calculating new embedding x ^𝑖 for each x𝑖 in attention layer can be
streamlined as:
sℎ (x𝑖 , x 𝑗 ) = NORM 𝑗 ( ∥ Qℎ (x𝑖 ) T K ℎ (x𝑘 )),
x𝑘 ∈X
∑︁
(49)

x𝑖 = s (x𝑖 , x 𝑗 )V ℎ (x 𝑗 ),

x 𝑗 ∈X

^𝑖 = MERGE(x𝑖1, x𝑖2, . . . , x𝑖𝐻 ),


x
where ℎ ∈ {0, 1, . . . , 𝐻 − 1} represents the attention head number. Qℎ , K ℎ and V ℎ are projection
functions mapping a vector to the query space, key space and value space respectively. sℎ (x𝑖 , x 𝑗 ) is
score function measuring the similarity between x𝑖 and x 𝑗 . NORM is the normalization operation
ensuring x 𝑗 ∈X sℎ (x𝑖 , x 𝑗 ) ≡ 1 to propel the stability of the output generated by a stack of attention
Í

layers, it is usually performed as scaled softmax: NORM(·) = SoftMax(·/ 𝑑). And MERGE function
is designed to combine the information extracted from multiple attention heads. Here, we omit
further implementation details that do not affect our understanding of attention operation.
The attention process cannot encode the position information of each x𝑖 , which is essential in
machine translation problem. So the positional encoding is introduced to remedy this deficiency,

J. ACM, Vol. 1, No. 1, Article . Publication date: April 2023.


A Comprehensive Survey on Deep Graph Representation Learning 21

Table 4. A Summary of Graph Transformer.

Technique Capacity
Method
Attention Modification Encoding Enhancement Heterogeneous Long Distance > 1-WL
GGT [81] ✓ ✓ structure only ✓
GTSA [193] ✓ ✓ ✓ ✓
HGT [147] ✓ ✓
G2SHGT [413] ✓ ✓ ✓
GRUGT [31] ✓ ✓ ✓
Graphormer [415] ✓ ✓ ✓ ✓
GSGT [153] ✓ ✓ ✓
Graph-BERT [436] ✓ ✓ ✓
LRGT [386] ✓ ✓
SAT [46] ✓ ✓ ✓

and it’s calculated as:


X𝑖,2 𝑗 = sin(𝑖/100002 𝑗/𝑑 ), X𝑖,2 𝑗+1 = cos(𝑖/100002 𝑗/𝑑 ), (50)
𝑝𝑜𝑠 𝑝𝑜𝑠

where 𝑖 is the position and 𝑗 is the dimension. The positional encoding is added to the input before
it is fed to the Transformer.

6.2 Overview
From the simplified process shown in Equation. 49, we can see that the core of the attention
operation is to accomplish information transfer based on the similarity between the source and the
target to be updated. It’s quite similar to the message-passing process on a fully-connected graph.
However, direct application of this architecture to arbitrary graphs does not make use of structural
information, so it may lead to poor performance when graph topology is important. On the other
hand, the definition of positional encoding in graphs is not a trivial problem because the order or
coordinates of graph nodes are underdefined.
According to these two challenges, Transformer-based methods for graph representation learning
can be classified into two major categories, one considering graph structure during attention process,
and the other encoding the topological information of the graph into initial node features. We
name the first one as Attention Modification and the second one as Encoding Enhancement. A
summarization is provided in Table 4. In the following discussion, if both methods are used in
one paper, we will list them in different subsections, and we will ignore the multi-head trick in
attention operation.

6.3 Attention Modification


This group of works attempt to modify the full attention operation to capture structure information.
The most prevalent approach is changing the score function, which is denoted as s(·, ·) in
Equation 49. GGT [81] constrains each node feature can only attend to neighbours and enables
model to represent edge feature information by rewrite s(·, ·) as:
(
(W𝑄 x𝑖 ) T (W𝐾 x 𝑗 ⊙ W𝐸 e 𝑗𝑖 ), ⟨𝑗, 𝑖⟩ ∈ 𝐸
s̃1 (x𝑖 , x 𝑗 ) = ,
− ∞, otherwise (51)
s1 (x𝑖 , x 𝑗 ) = SoftMax 𝑗 ( ∥ s̃1 (x𝑖 , x𝑘 )),
x𝑘 ∈X

where ⊙ is Hadamard product and represents trainable parameter matrix. This approach
W𝑄,𝐾,𝐸
is not efficient yet to model long-distance dependencies since only 1st-neighbors are considered.
Though it adopts Laplacian eigenvectors to gather global information (cf. Section 6.4), but only

J. ACM, Vol. 1, No. 1, Article . Publication date: April 2023.


22 W. Ju, et al.

long-distance structure information is remedied while the node and edge features are not. GTSA
[193] improves this approach by combining the original graph and the full graph. Specifically, it
extends s1 (·, ·) to:
( 𝑄 T 𝐾
(W1 x𝑖 ) (W1 x 𝑗 ⊙ W1𝐸 e 𝑗𝑖 ), ⟨𝑗, 𝑖⟩ ∈ 𝐸
s̃2 (x𝑖 , x 𝑗 ) = ,
(W0 x𝑖 ) T (W𝐾0 x 𝑗 ⊙ W0𝐸 e 𝑗𝑖 ), otherwise
𝑄

 1 SoftMax 𝑗 ( ∥ s̃2 (x𝑖 , x𝑘 )), ⟨𝑗, 𝑖⟩ ∈ 𝐸 (52)




1 +𝜆


 ⟨𝑘,𝑖 ⟩ ∈𝐸
s2 (x𝑖 , x 𝑗 ) = ,
𝜆
SoftMax otherwise


 𝑗 ( ∥ s̃ (x ,
2 𝑖 𝑘 x )),
1 +𝜆

⟨𝑘,𝑖 ⟩∉𝐸

where 𝜆 is a hyperparameter representing the strength of full connection.
Some works try to reduce information-mixing problems [45] in heterogeneous graph. HGT [147]
disentangles the attention of different node type and edge type by adopting additional attention
𝜏 (𝑣) 𝜙 (𝑒)
heads. It defines W𝑄,𝐾,𝑉 for each node type 𝜏 (𝑣) and W𝐸 for each edge type 𝜙 (𝑒), 𝜏 (·) and
𝜙 (·) are type indicating function. G2SHGT [413] defines four types of subgraphs, fully-connected,
connected, default and reverse, to capture global, undirected, forward and backward information
respectively. And each subgraph is homogeneous, so it can reduce interactions between different
classes.
Path features between nodes are always treated as inductive bias added to the original score
function. Let SP𝑖 𝑗 = (𝑒 1, 𝑒 2, . . . , 𝑒 𝑁 ) denote the shortest path between node pair (𝑣𝑖 , 𝑣 𝑗 ). GRUGT
[31] uses GRU [61] to encode forward and backward features as: r𝑖 𝑗 = GRU(SP𝑖 𝑗 ), r 𝑗𝑖 = GRU(SP 𝑗𝑖 ).
Then, the final attention score is calculated by adding up four components:
s̃3 (x𝑖 , x 𝑗 ) = (W𝑄 x𝑖 ) T W𝐾 x 𝑗 + (W𝑄 x𝑖 ) T W𝐾 r 𝑗𝑖 + (W𝑄 r𝑖 𝑗 ) T W𝐾 x 𝑗 + (W𝑄 r𝑖 𝑗 ) T W𝐾 r 𝑗𝑖 , (53)
from front to back, which represent content-based score, source-dependent bias, target-dependent
bias and universal bias respectively. Graphormer [415] uses both path length and path embedding
to introduce structural bias as:

s̃4 (x𝑖 , x 𝑗 ) = (W𝑄 x𝑖 ) T W𝐾 x 𝑗 / 𝑑 + 𝑏 𝑁 + 𝑐𝑖 𝑗 ,
𝑁
1 ∑︁
𝑐𝑖 𝑗 = (e𝑘 ) T w𝑘𝐸 , (54)
𝑁
𝑘=1
s4 (x𝑖 , x 𝑗 ) = SoftMax 𝑗 ( ∥ s̃4 (x𝑖 , x𝑘 )),
x𝑘 ∈X

where 𝑏 𝑁 is a trainable scalar indexed by 𝑁 , the length of SP𝑖 𝑗 . e𝑘 is the embedding of the the
edge 𝑒𝑘 , and w𝑘𝐸 ∈ R𝑑 is the 𝑘-th edge parameter. If SP𝑖 𝑗 does not exist, then 𝑏 𝑁 and 𝑐𝑖 𝑗 are set to
be special value.

6.4 Encoding Enhancement


This kind of methods intend to enhance initial node representations to enable Transformer to
encode structure information. They can be further divided into two categories, position-analogy
methods and structure-aware methods.
6.4.1 Position-analogy methods In Euclidean space, the Laplacian operator corresponds to the
divergence of the gradient, whose eigenfunctions are sine/cosine functions. For graph, the Laplacian
operator is Laplacian matrix, whose eigenvectors can be considered as eigenfunctions. Hence,

J. ACM, Vol. 1, No. 1, Article . Publication date: April 2023.


A Comprehensive Survey on Deep Graph Representation Learning 23

inspired by Equation 50, position-analogy methods utilize Laplacian eigenvectors to simulate


positional encoding X𝑝𝑜𝑠 as they are the equivalents of sine/cosine functions.
Laplacian eigenvectors can be calculated through the eigendecomposition of normalized graph
Laplacian matrix L̃:
L̃ ≜ I − D−1/2 AD−1/2 = UΛUT, (55)
where A is the adjacency matrix, D is the degree matrix, U = [u1, u2, . . . , u𝑛−1 ] are eigenvectors
and Λ = 𝑑𝑖𝑎𝑔(𝜆0, 𝜆1, . . . , 𝜆𝑛−1 ) are eigenvalues. With U and Λ, GGT [81] uses eigenvectors of the
k smallest non-trivial eigenvalues to denote the intermediate embedding X𝑚𝑖𝑑 ∈ R𝑛×𝑘 , and maps it
to d-dimensional space and gets the position encoding X𝑝𝑜𝑠 ∈ R𝑛×𝑑 . This process can be formulized
as:
𝑖𝑛𝑑𝑒𝑥 = argmin𝑘 ({𝜆𝑖 |0 ≤ 𝑖 < 𝑛 ∧ 𝜆𝑖 > 0}),
X𝑚𝑖𝑑 = [u𝑖𝑛𝑑𝑒𝑥 0 , u𝑖𝑛𝑑𝑒𝑥 1 , . . . , u𝑖𝑛𝑑𝑒𝑥𝑘−1 ] T, (56)
𝑝𝑜𝑠 𝑚𝑖𝑑 𝑘×𝑑
X =X W ,
where 𝑖𝑛𝑑𝑒𝑥 is the subscript of the selected eigenvectors. GTSA [193] puts eigenvector u𝑖 on the
frequency axis at 𝜆𝑖 and uses sequence modeling methods to generate positional encoding. Specificly,
it extends X𝑚𝑖𝑑 in Equation 56 to X̃𝑚𝑖𝑑 ∈ R𝑛×𝑘×2 by concatenating each value in eigenvectors with
corresponding eigenvalue, and then positional encoding X𝑝𝑜𝑠 ∈ R𝑛×𝑑 are generated as:
X𝑖𝑛𝑝𝑢𝑡 = X̃𝑚𝑖𝑑 W2×𝑑 ,
(57)
X𝑝𝑜𝑠 = SumPooling(Transformer(X𝑖𝑛𝑝𝑢𝑡 ), dim = 1).
Here, X𝑖𝑛𝑝𝑢𝑡 ∈ R𝑛×𝑘×𝑑 is equivalent to the input matrix in sequence modeling problem with
shape (𝑏𝑎𝑡𝑐ℎ_𝑠𝑖𝑧𝑒, 𝑙𝑒𝑛𝑔𝑡ℎ, 𝑑𝑖𝑚), and can be naturally processed by Transformer. Since the Laplacian
eigenvectors can be complex-valued for directed graph, GSGT [153] proposes to utilize SVD of
adjacency matrix A, which is denoted as A = UΣVT , and uses the largest 𝑘 singular values Σ𝑘 and
associated left and right singular vectors U𝑘 and V𝑘T to output X𝑝𝑜𝑠 as X𝑝𝑜𝑠 = [U𝑘 Σ𝑘1/2 ∥V𝑘 Σ𝑘1/2 ],
where ∥ is the concatenation operation. All these methods above randomly flip the signs of eigen-
vectors or singular vectors during the training phase to promote the invariance of the models to
the sign ambiguity.
6.4.2 Structure-aware methods In contrast to position-analogy methods, structure-aware methods
do not attempt to mathematically rigorously simulate sequence positional encoding. They use some
additional mechanisms to directly calculate structure related encoding.
Some approaches compute extra encoding X𝑎𝑑𝑑 and add it to the initial node representation.
Graphormer [415] proposes to leverage node centrality as additional signal to address the impor-
tance of each node. Concretely, x𝑖𝑎𝑑𝑑 is determined by the indegree deg𝑖− and outdegree deg𝑖+ :
x𝑖𝑎𝑑𝑑 = P − (deg𝑖− ) + P + (deg𝑖+ ), (58)
where P−and are learnable embedding function. Graph-BERT [436] employs Weisfeiler-
P+
Lehman algorithm to label node 𝑣𝑖 to a number WL(𝑣𝑖 ) ∈ N and defines x𝑖𝑎𝑑𝑑 as:
x𝑖,𝑎𝑑𝑑
2 𝑗 = sin(WL(𝑣 𝑖 )/10000
2 𝑗/𝑑
), x𝑖,𝑎𝑑𝑑
2 𝑗+1 = cos(WL(𝑣 𝑖 )/10000
2 𝑗/𝑑
). (59)
The other approaches try to leverage GNNs to initialize inputs to Transformer. LRGT [386]
applies GNN to get intermediate vectors as X′ = GNN(X), and passes the concatenation of X′
and a special vector xCLS to Transformer layer as: X
^ = Transformer( [X′ ∥xCLS ]). Then x
^CLS can be
used as the representation of the entire graph for downstream tasks. This method cannot break
1-WL bottleneck because it uses GCN [182] and GIN [403] as graph encoder in the first step, which

J. ACM, Vol. 1, No. 1, Article . Publication date: April 2023.


24 W. Ju, et al.

are intrinsically limited by 1-WL test. SAT [46] improves this deficiency by using subgraph-GNN
NGNN [438] for initialization, and achieves outstanding performance.

6.5 Summary
This section introduces Transformer-based approaches for graph representation learning and we
provide the summary as follows:
• Techniques. Graph Transformer methods modify two fundamental techniques in Trans-
former, attention operation and positional encoding, to enhance its ability to encode graph
data. Typically, they introduce fully-connected attention to model long-distance relationship,
utilize shortest path and Laplacian eigenvectors to break 1-WL bottleneck, and separate
points and edges belonging to different classes to avoid over-mixing problem.
• Challenges and Limitations. Though Graph Transformers achieve encouraging perfor-
mance, they still face two major challenges. The first challenge is the computational cost of
the quadratic attention mechanism and shortest path calculation. These operations require
significant computing resources and can be a bottleneck, particularly for large graphs. The
second is the reliance of Transformer-based models on large amounts of data for stable perfor-
mance. It poses a challenge when dealing with problems that lack sufficient data, especially
for few-shot and zero-shot settings.
• Future Works. We expect efficiency improvement for Graph Transformer should be further
explored. Additionally, there are some works using pre-training and fine-tuning framework
to balance performance and complexity in downstream tasks [415], this may be a promising
solution to address the aforementioned two challenges.

7 Semi-supervised Learning on Graphs By Xiao Luo


We have investigated various architectures of graph neural networks in which the parameters should
be tuned by a learning objective. The most prevalent optimization approach is supervised learning on
graph data. Due to the label deficiency, semi-supervised learning has attracted increasing attention
in the data mining community. In detail, these methods attempt to combine graph representation
learning with current semi-supervised techniques including pseudo-labeling, consistency learning,
knowledge distillation and active learning. These works can be further subdivided into node-level
representation learning and graph-level representation learning. We would introduce both parts in
detail as in Sec. 7.1 and Sec. 7.2, respectively. A summarization is provided in Table 5.

7.1 Node Representation Learning


Typically, node representation learning follows the concept of transductive learning, which has the
access to test unlabeled data. We first review the simplest loss objective, i.e., node-level supervised
loss. This loss exploits the ground truth of labeled nodes on graphs. The standard cross-entropy is
usually adopted for optimization. In formulation,
1 ∑︁ 𝑇
L𝑁 𝑆𝐿 = − 𝐿 y𝑖 log p𝑖 , (60)
|Y | 𝐿 𝑖 ∈Y

where Y 𝐿 denotes the set of labeled nodes. Additionally, there are a variety of unlabeled nodes that
can be used to offer semantic information. To fully utilize these nodes, a range of methods attempt
to combine semi-supervised approaches with graph neural networks. Pseudo-labeling [200] is a
fundamental semi-supervised technique that uses the classifier to produce the label distribution of
unlabeled examples and then adds appropriately labeled examples to the training set [212, 465].
Another line of semi-supervised learning is consistency regularization [197] that requires two

J. ACM, Vol. 1, No. 1, Article . Publication date: April 2023.


A Comprehensive Survey on Deep Graph Representation Learning 25

Table 5. A Summary of Methods for Semi-supervised Learning on Graphs. Contrastive learning can be
considered as a specific kind of consistency learning.

Approach Pseudo-labeling Consistency Learning Knowledge Distillation Active Learning


CoGNet [212] ✓
DSGCN [465] ✓
Node-level
GRAND [92] ✓
AugGCR [271] ✓
SEAL [211] ✓ ✓
InfoGraph [335] ✓ ✓
DSGC [408] ✓
ASGN [131] ✓ ✓
TGNN [172] ✓
Graph-level KGNN [174] ✓
HGMI [209] ✓ ✓
ASGNN [395] ✓ ✓
DualGraph [242] ✓ ✓
GLA [428] ✓
SS [394] ✓

examples to have identical predictions under perturbation. This regularization is based on the
assumption that each instance has a distinct label that is resistant to random perturbations [92, 271].
Then, we show several representative works in detail.
Cooperative Graph Neural Networks [212] (CoGNet). CoGNet is a representative pseudo-label-
based GNN approach for semi-supervised node classification. It employs two GNN classifiers to
jointly annotate unlabeled nodes. In particular, it calculates the confidence of each node as follows:
𝐶𝑉 (p𝑖 ) = p𝑇𝑖 log p𝑖 , (61)
where p𝑖 denotes the output label distribution. Then it selects the pseudo-labels with high confidence
generated from one model to supervise the optimization of the other model. In particular, the
objective for unlabeled nodes is written as follows:
∑︁
L𝐶𝑜𝐺𝑁 𝑒𝑡 = ^𝑇𝑖 𝑙𝑜𝑔q𝑖 ,
1𝐶𝑉 ( p𝑖 ) >𝜏 y (62)
𝑖 ∈V𝑈
where y^𝑖 denotes the one-hot formulation of the pseudo-label 𝑦^𝑖 = 𝑎𝑟𝑔𝑚𝑎𝑥p𝑖 and q𝑖 denotes the
label distribution predicted by the other classifier. 𝜏 is a pre-defined temperature coefficient. This
cross supervision has been demonstrated effective in [51, 244] to prevent the provision of biased
pseudo-labels. Moreover, it employs GNNExplainer [416] to provide additional information from a
dual perspective. Here it measures the minimal subgraphs where GNN classifiers can still generate
the same prediction. In this way, CoGNet can illustrate the entire optimization process to enhance
our understanding.
Dynamic Self-training Graph Neural Network [465] (DSGCN). DSGCN develops an adaptive
manner to utilize reliable pseudo-labels for unlabeled nodes. In particular, it allocates smaller
weights to samples with lower confidence with the additional consideration of class balance. The
weight is formulated as:
1
𝜔𝑖 = max (RELU (p𝑖 − 𝛽 · 1)) , (63)
𝑛𝑐 𝑖
where 𝑛𝑐 𝑖 denotes the number of unlabeled samples assigned to the class 𝑐 𝑖 . This technique will
decrease the impact of wrong pseudo-labels during iterative training.
Graph Random Neural Networks [92] (GRAND). GRAND is a representative consistency learning-
based method. It first adds a variety of perturbations to the input graph to generate a list of

J. ACM, Vol. 1, No. 1, Article . Publication date: April 2023.


26 W. Ju, et al.

graph views. Each graph view 𝐺 𝑟 is sent to a GNN classifier to produce a prediction matrix
P𝑟 = [p𝑟1, · · · , p𝑟𝑁 ]. Then it summarizes these matrices as follows:
1 𝑟
P . P= (64)
𝑅
To provide more discriminative information and ensure that the matrix is row-normalized,
GRAND sharpens the summarized label matrix into P𝑆𝐴 as:
P𝑖1𝑗/𝑇
P𝑖𝑆𝐴
𝑗 = Í 1/𝑇
, (65)
𝑗 ′ =0 P𝑖 𝑗 ′

where 𝑇 is a given temperature parameter. Finally, consistency learning is performed by comparing


the sharpened summarized matrix with the matrix of each graph view. Formally, the objective is:
𝑅
1 ∑︁ ∑︁ 𝑆𝐴
L𝐺𝑅𝐴𝑁 𝐷 = ||P𝑖 − P𝑖 ||, (66)
𝑅 𝑟 =1 𝑖 ∈𝑉
here L𝐺𝑅𝐴𝑁 𝐷 serves as a regularization which is combined with the standard supervised loss.
Augmentation for GNNs with the Consistency Regularization [271] (AugGCR). AugGCR begins with
the generation of augmented graphs by random dropout and mixup of different order features. To
enhance the model generalization, it borrows the idea of meta-learning to partition the training data,
which improves the quality of graph augmentation. In addition, it utilizes consistency regularization
to enhance the semi-supervised node classification.

7.2 Graph Representation Learning


The objective of graph classification is to predict the property of the whole graph example.
Assuming that the training set comprises 𝑁 𝑙 and 𝑁 𝑢 graph samples G𝑙 = {𝐺 1, · · · , 𝐺 𝑁 } and
𝑙

G𝑢 = {𝐺 𝑁 +1, · · · , 𝐺 𝑁 +𝑁 }, the graph-level supervised loss for labeled data can be expressed as
𝑙 𝑙 𝑢

follows:
1 ∑︁ 𝑗 𝑇
L𝐺𝑆𝐿 = − 𝑢 y 𝑙𝑜𝑔p 𝑗 , (67)
|G | 𝐿 𝐺𝑗 ∈G

where y𝑗denotes the one-hot label vector for the 𝑗-th sample while p 𝑗 denotes the predicted
distribution of 𝐺 𝑗 . When 𝑁 𝑢 = 0, this objective can be utilized to optimize supervised methods.
However, due to the shortage of labels in graph data, supervised methods cannot reach exceptional
performance in real-world applications [131]. To tackle this, semi-supervised graph classification
has been developed extensively. These approaches can be categorized into pseudo-labeling-based
methods, knowledge distillation-based methods and contrastive learning-based methods. Pseudo-
labeling methods annotate graph instances and utilize well-classified graph examples to update
the training set [209, 211]. Knowledge distillation-based methods usually utilize a teacher-student
architecture, where the teacher model conducts graph representation learning without label infor-
mation to extract generalized knowledge while the student model focuses on the downstream task.
Due to the restricted number of labeled instances, the student model transfers knowledge from
the teacher model to prevent overfitting [131, 335]. Another line of this topic is to utilize graph
contrastive learning, which is frequently used in unsupervised learning. Typically, these methods
extract topological information from two perspectives (i.e., different perturbation strategies and
graph encoders), and maximize the similarity of their representations compared with those from
other examples [171, 172, 242]. Active learning, as a prevalent technique to improve the efficiency
of data annotation, has also been utilized for semi-supervised methods [131, 395]. Then, we review
these methods in detail.

J. ACM, Vol. 1, No. 1, Article . Publication date: April 2023.


A Comprehensive Survey on Deep Graph Representation Learning 27

SEmi-supervised grAph cLassification [211] (SEAL). SEAL treats each graph example as a node
in a hierarchical graph. It builds two graph classifiers which generate graph representations and
conduct semi-supervised graph classification respectively. SEAL employs a self-attention module
to encode each graph into a graph-level representation, and then conducts message passing from a
graph level for final classification. SEAL can also be combined with cautious iteration and active
iteration. The former merely utilizes partial graph samples to optimize the parameters in the first
classifier due to the potential erroneous pseudo-labels. The second combines active learning with
the model, which increases the annotation efficiency in semi-supervised scenarios.
InfoGraph [335]. Infograph is the first contrastive learning-based method. It maximizes the
similarity between summarized graph representations and their node representations. In particular,
it generates node representations using the message passing mechanism and summarizes these
node representations into a graph representation. Let Φ(·, ·) denote a discriminator to distinguish
whether a node belongs to the graph, and we have:

| G𝑙∑︁
|+ | G𝑢 | ∑︁ h   i 1 ∑︁ h   𝑗 ′ 𝑗   i
L𝐼𝑛𝑓 𝑜𝐺𝑟𝑎𝑝ℎ = − sp −Φ h𝑖𝑗 , z 𝑗 − 𝑗 sp Φ h𝑖 ′ , z , (68)
𝑗=1 𝑖 ∈ G𝑗 |𝑁𝑖 | 𝑖 ′𝑗 ′ ∈𝑁 𝑗
𝑖

where sp(·) denotes the softplus function. denotes the negative node set where nodes are not
𝑁𝑖𝑗
in 𝐺 𝑗 . This mutual information maximization formulation is originally developed for unsupervised
learning and it can be simply extended for semi-supervised graph classification. In particular,
InfoGraph utilizes a teacher-student architecture that compares the representation across the
teacher and student networks. The contrastive learning objective serves as a regularization by
combining with supervised loss.
Dual Space Graph Contrastive Learning [408] (DSGC). DSGC is a representative contrastive
learning-based method. It utilizes two graph encoders. The first is a standard GNN encoder in the
Euclidean space and the second is the hyperbolic GNN encoder. The hyperbolic GNN encoder first
converts graph embeddings into hyperbolic space and then measures the distance based on the
length of geodesics. DSGC compares graph embeddings in the Euclidean space and hyperbolic
space. Assuming the two GNNs are named as 𝑓1 (·) and 𝑓2 (·), the positive pair is denoted as:
𝑗
z𝐸→𝐻 = exp𝑐o (𝑓1 (𝐺 𝑗 )),
𝑗  (69)
z𝐻 = exp𝑐o 𝑓2 (𝐺 𝑗 ) .

Then it selects one labeled sample and 𝑁𝐵 unlabeled sample 𝐺 𝑗 for graph contrastive learning in
the hyperbolic space. In formulation,

e𝑑 ( h𝐻 ,z𝐸→𝐻 ) /𝜏
𝐻 𝑖 𝑖

L𝐷𝑆𝐺𝐶 = − log  
Í𝑁 𝑑D z𝑖𝐸→𝐻 ,z𝐻𝑗 /𝜏
e𝑑 ( z𝐻 ,z𝐸→𝐻 ) /𝜏 + 𝑖=
𝐻 𝑖 𝑖
1 e

𝑢 z 𝑗 ,z 𝑗
 (70)
𝑁 𝑑D 𝐻 𝐸→𝐻 /𝜏
𝜆𝑢 ∑︁ e
− log     ,
𝑁 𝑖=1 𝑢 z 𝑗 ,z 𝑗 𝑗
𝑑D 𝐻 𝐸→𝐻 /𝜏 𝑑 D z𝑖𝐻 ,z𝐸→𝐻 /𝜏
e +e
where z𝑖𝐸→𝐻 and z𝑖𝐻 denote the embeddings for labeled graph sample 𝐺 𝑖 and 𝑑 𝐻 (·) denotes a distance
metric in the hyperbolic space. This contrastive learning objective maximizes the similarity between
embeddings learned from two encoders compared with other samples. Finally, the contrastive
learning objective can be combined with the supervised loss to achieve effective semi-supervised
contrastive learning.

J. ACM, Vol. 1, No. 1, Article . Publication date: April 2023.


28 W. Ju, et al.

Active Semi-supervised Graph Neural Network [131] (ASGN). ASGN utilizes a teacher-student
architecture with the teacher model focusing on representation learning and the student model
targeting at molecular property prediction. In the teacher model, ASGN first employs a message
passing neural network to learn node representations under the reconstruction task and then
borrows the idea of balanced clustering to learn graph-level representations in a self-supervised
fashion. In the student model, ASGN utilizes label information to monitor the model training based
on the weights of the teacher model. In addition, active learning is also used to minimize the
annotation cost while maintaining sufficient performance. Typically, the teacher model seeks to
provide discriminative graph-level representations without labels, which transfer knowledge to the
student model to overcome the potential overfitting in the presence of label scarcity.
Twin Graph Neural Networks [172] (TGNN). TGNN also uses two graph neural networks to
give different perspectives to learn graph representations. Differently, it adopts a graph kernel
neural network to learn graph-level representations in virtue of random walk kernels. Rather than
directly enforcing representation from two modules to be similar, TGNN exchanges information
by contrasting the similarity structure of the two modules. In particular, it constructs a list of
anchor graphs, 𝐺 𝑎1 , 𝐺 𝑎2 , · · · , 𝐺 𝑎𝑀 , and utilizes two graph encoders to produce their embeddings,
i.e., {𝑧𝑎𝑚 }𝑚=
𝑀
1 , {𝑤
𝑎𝑚 }𝑀 . Then it calculates the similarity distribution between each unlabeled
𝑚=1
graph and anchor graphs for two modules. Formally,
 
exp cos 𝑧 𝑗 , 𝑧𝑎𝑚 /𝜏
𝑗
𝑝𝑚 = Í𝑀 , (71)
𝑚′ =1 exp (cos (𝑧 , 𝑧
𝑗 𝑎𝑚′ ) /𝜏)

 
exp cos w 𝑗 , w𝑎𝑚 /𝜏
𝑗
𝑞𝑚 = Í𝑀 . (72)
𝑚′ =1 exp (cos (w , w
𝑗 𝑎𝑚′ ) /𝜏)

Then, TGNN minimizes the distance between distributions from different modules as follows:
1 ∑︁ 1
(73)
 
L𝑇𝐺𝑁 𝑁 = 𝐷 KL p 𝑗 ∥q 𝑗 + 𝐷 KL q 𝑗 ∥p 𝑗 ,
G𝑈 𝑗 𝑢 2 𝐺 ∈G

which serves as a regularization term to combine with the supervised loss.

7.3 Summary
This section introduces semi-supervised learning for graph representation learning and we provide
the summary as follows:
• Techniques. Classic node classification aims to conduct transductive learning on graphs
with access to unlabeled data, which is a natural semi-supervised problem. Semi-supervised
graph classification aims to relieve the requirement of abundant labeled graphs. Here, a
variety of semi-supervised methods have been put forward to achieve better performance
under the label scarcity. Typically, they try to integrate semi-supervised techniques such as
active learning, pseudo-labeling, consistency learning, and consistency learning with graph
representation learning.
• Challenges and Limitations. Despite their great success, the performance of these methods
is still unsatisfactory, especially in graph-level representation learning. For example, DSGC
can only achieve an accuracy of 57% in a binary classification dataset REDDIT-BINARY. Even
worse, label scarcity often accompanies by unbalanced datasets and potential domain shifts,
which provides more challenges from real-world applications.
• Future Works. In the future, we expect that these methods can be applied to different
problems such as molecular property predictions. There are also works to extend graph

J. ACM, Vol. 1, No. 1, Article . Publication date: April 2023.


A Comprehensive Survey on Deep Graph Representation Learning 29

Table 6. A Summary of Methods for Self-supervised Learning on Graphs. "PT", "CT" and "UFE" mean "Pre-
training", "Collaborative Train" and "Unsupervised Feature Extracting" respectively.

Approach Augmentation Scheme Training Scheme Generation Target Objective Function


Graph Completion [419] Feature Mask PT/CT Node Feature -
AttributeMask [166] Feature Mask PT/CT PCA Node Feature -
Generation-based AttrMasking [145] Feature Mask PT Node/Edge Feature -
MGAE [358] No Augmentation CT Node Feature -
GAE [183] Feature Noise UFE Adjacency Matrix -
DeepWalk [276] Random Walk UFE - SkipGram
LINE [344] Random Walk UFE - Jensen-Shannon
GCC [286] Random Walk PT/URL - InfoNCE
SimGCL [422] Embedding Noise UFE - InfoNCE
Contrast-based SimGRACE [391] Model Noise UFE - InfoNCE
Feature Masking &
GCA [471] URL - InfoNCE
Strcture Adjustment
Feature Masking &
BGRL [120] URL - BYOL
Strcture Adjustment

representation learning in more realistic scenarios like few-shot learning [41, 249]. A higher
accuracy is always anticipated for more advanced and effective semi-supervised techniques.

8 Graph Self-supervised Learning By Jingyang Yuan


Besides supervised or semi-supervised methods, self-supervised learning (SSL) also has shown its
powerful capability in data mining and representation embedding recent years. In this section we
investigated Graph Neural Networks based on SSL, and provided a detailed introduction to a few
typical models. Graph SSL methods usually has an unified pipeline, which includes pretext tasks
and downstream tasks. Pretext tasks help model encoder to learn better representation, as a premise
of better performance in downstream tasks. So a delicate design of pretext task is crucial for Graph
SSL. We would firstly introduce overall framework of Graph SSL in Section 8.1, then introduce the
two kind of pretext task design, generation-based methods and contrast-Based methods respectively
in Section 8.2 and 8.3. A summarization is provided in Table 6.

8.1 Overall framework


Consider a featured graph G, we denote a graph encoder 𝑓 to learn representation of graph,
and a pretext decoder 𝑔 with specific architecture in different pretext tasks. Then the pretext
self-supervised learning loss can be formulated as:

L𝑡𝑜𝑡𝑎𝑙 = 𝐸 G∼D [L𝑠𝑠𝑙 (𝑔, 𝑓 , G)], (74)

where D denotes the distribution of featured graph G. By minimizing L𝑜𝑣𝑒𝑟𝑎𝑙𝑙 , we can learn encoder
𝑓 with capacity to produce high-quality embedding. As for downstream tasks, we denote a graph
decoder 𝑑 which transforms the output of graph encoder 𝑓 into model prediction. The loss of
downstream tasks can be formulated as:

L𝑠𝑢𝑝 = L𝑠𝑢𝑝 (𝑑, 𝑓 , G; 𝑦), (75)


where 𝑦 is the ground truth in downstream tasks. We can obvious that L𝑠𝑢𝑝 is a typical supervised
loss. To ensure the model achieve wise graph representation extraction and optimistic prediction
performance, L𝑠𝑠𝑙 and L𝑠𝑢𝑝 have to be minimized simultaneously. We introduce 3 different ways
to minimize the two loss function:

J. ACM, Vol. 1, No. 1, Article . Publication date: April 2023.


30 W. Ju, et al.

Pre-training. This strategy has two steps. In pre-training step, the L𝑠𝑠𝑙 is minimized to get 𝑔∗
and 𝑓 ∗ :
𝑔∗, 𝑓 ∗ = arg minL𝑠𝑠𝑙 (𝑔, 𝑓 , D). (76)
𝑔,𝑓
Then the parameter of is kept to continue training in pretext supervised learning progress.
𝑓∗
The supervised loss is minimized to get final parameters of 𝑓 and 𝑑.
minL𝑠𝑠𝑙 (𝑑, 𝑓 | 𝑓0 =𝑓 ∗ , G; 𝑦). (77)
𝑑,𝑓

Collaborative Train. In this strategy, L𝑠𝑠𝑙 and L𝑠𝑢𝑝 are optimized simultaneously. A hyperpa-
rameter 𝛼 is used to balance the contribution of pretext task loss and downstream task loss. The
overall minimization strategy is like traditional supervised strategy with a pretext task regulariza-
tion:

min [L𝑠𝑠𝑙 (𝑔, 𝑓 , G) + 𝛼 L𝑠𝑢𝑝 (𝑑, 𝑓 , G; 𝑦)]. (78)


𝑔,𝑓 ,𝑑
Unsupervised Feature Extracting. This strategy is similar to Pre-training and Fine-tuning
strategy in first step to minimize pretext task loss L𝑠𝑠𝑙 and get 𝑓 ∗ . However, when minimizing
downstream loss L𝑠𝑢𝑝 , the encoder 𝑓 ∗ is fixed. Also, the training graph data are on same dataset,
which differs from Pre-training and Fine-tuning strategy. The formulation is defined as:
𝑔∗, 𝑓 ∗ = arg minL𝑠𝑠𝑙 (𝑔, 𝑓 , D), (79)
𝑔,𝑓

minL𝑠𝑢𝑝 (𝑑, 𝑓 ∗, G; 𝑦). (80)


𝑑

8.2 Generation-based pretext task design


If a model with encoder-decoder structure can reproduce certain graph feature from incomplete
or perturbed graph, it indicated the encoder has the ability to extract useful graph representation.
This motivation derived from Autoencoder [139] which originally learns on image dataset. In such
a case, Equation 76 can be rewritten as:
^ G),
minL𝑠𝑠𝑙 (𝑔(𝑓 ( G)), (81)
𝑔,𝑓

where 𝑓 (·) and 𝑔(·) stand for the representation encoder and rebuilding decoder. However, for graph
dataset, feature information and structure information are both important composition suitable to
be rebuilt. So generation-based pretext can be divided into two categories: feature rebuilding and
structure rebuilding. We introduce several outstanding models in followed part.
Graph Completion [419] is one of a representative method about feature rebuilding. They mask
some node features to generate an incomplete graph. Then the pretext task is set as predicting the
removed node features. As shown in Equation 82, this method can be formulated as a special case
of Equation 82, letting G^ = (𝐴, 𝑋^ ) and replacing G →
− 𝑋 . The loss function is often Mean Squared
Error or Cross Entropy, depending on the feature is continuous or binary.

^ X).
min MSE(𝑔(𝑓 ( G)), (82)
𝑔,𝑓
Other works make some changes about feature setting. For example, AttrMasking [145] aims
to rebuild both node representation and edge representation, AttributeMask [166] preprocess 𝑋
firstly by PCA to reduce the complexity of rebuilding features.
In the other hand, MGAE [358] modify the original graph by adding noise in node representation,
motivated by denoising autoencoder [353]. As shown in Equation 82, we can also consider MGAE

J. ACM, Vol. 1, No. 1, Article . Publication date: April 2023.


A Comprehensive Survey on Deep Graph Representation Learning 31

as an implement of Equation 76 where G^ = (𝐴, 𝑋^ ) and G → − 𝑋 . 𝑋^ stands for perturbed node


representation. Since the noise are independent and random, the encoder are more robust to feature
input.

^ A).
min BCE(𝑔(𝑓 ( G)), (83)
𝑔,𝑓
As for structure rebuilding methods, GAE [183] is the simplest instance, which can be regard
as an implement of Equation 76 where G^ = G and G → − 𝐴. A is the adjacency matrix of graph.
Similar with feature rebuilding method, GAE compresses raw node representation vectors into
low-dimensional embedding with its encoder. Then the adjacency matrix is rebuilt by computing
node embedding similarity. Loss function is set to error between ground-truth adjacency matrix
and the recovered one, to help model rebuild correct graph structure.

8.3 Contrast-Based pretext task design


The mutual information maximization principle, which implement self-supervising by predicting
the similarity between the two augmented views, forms the foundation of contrast-based approaches.
Since mutual information represent the degree of correlation between two samples, we can maximize
it in augmented pairs and minimize it in random-selected pairs.
The contrast-based graph SSL taxonomy can be formulated as Equation 84. The discriminator
that calculates similarity of sample pairs is indicated by pretext decoder 𝑔. G ( 1) and G ( 2) are two
variants of 𝐺 that have been augmented. Since graph contrastive learning methods differ from each
other in 1) view generation, 2) MI estimation method we introduce this methodology in these 2
perspectives.

minL𝑠𝑠𝑙 (𝑔[𝑓 ( G^ ( 1) ), 𝑓 ( G^ ( 2) )]). (84)


𝑔,𝑓

8.3.1 View generation. Traditional pipeline of contrastive learning-based models is first augmenting
the graph by well-crafted empirical methods, and then maximizing the consistency between different
augmentations. Following methods in computer vision domain and considering non-Euclidean
structure of graph data, typical graph augmentation methods aim to modify graph topologically or
representatively.
Given graph G = (𝐴, 𝑋 ), the topologically augmentation methods usually modify the adjacency
matrix 𝐴, which can be formulated as:
𝐴^ = 𝒯𝐴 (𝐴), (85)
where 𝒯𝐴 (·) is the transform function of adjacency matrix. Topology augmentation methods has
many variants, in which the most popular one is edge modification, given by 𝒯𝐴 (𝐴) = 𝑃 ◦ 𝐴 +
𝑄 ◦ (1 − 𝐴). 𝑃 and 𝑄 are two matrix representing edge dropping and adding respectively. Another
method, graph diffusion, connect nodes with their k-hop neighhors with specific weight, defined as:
0 𝛼𝑘 𝑇 . where 𝛼 and 𝑇 are coefficient and transition matrix. Graph diffusion method
Í∞
𝒯𝐴 (𝐴) = 𝑘= 𝑘

can integrate broad topological information with local structure.


In the other hand, the representative augmentation modify the node representation directly,
which can be formulated as:
𝑋^ = 𝒯𝑋 (𝑋 ), (86)
usually 𝒯𝑋 (·) can be a simple masking operater, a.k.a. 𝒯𝑋 (𝑋 ) = 𝑀 ◦ 𝑋 and 𝑀 ∈ {0, 1} 𝑁 ×𝐷 . Based
on such mask strategy, some method propose ways to improve performance. GCA [471] preserves
critical nodes while giving less significant nodes a larger masking probability, where significance is
determined by node centrality.

J. ACM, Vol. 1, No. 1, Article . Publication date: April 2023.


32 W. Ju, et al.

As introduced before, the paradigm of augmentation has been prove to be effective in contrastive
learning view generation. However, given the variety of graph data, it is challenging to main-
tain semantics properly during augmentations. In order to preserve valuable nature in specific
graph dataset, There are currently three mainly-used methods: picking by trial-and-errors, trying
laborious search or seeking domain-specific information as guidance [169, 243]. It is clear that
such complicated augmentation methods constrain the effectiveness and widespread application of
graph contrastive learning. So many newest works question the necessity of augmentation and
seek other contrastive views generation methods.
SimGCL [422] is one of outstanding works challenging the effectiveness of graph augmentation.
The author find that noise can be a substitution to augmentation to produce graph views in specific
task such as recommendation. After doing ablation study about augmentation and InfoNCE [397],
they find that the InfoNCE loss, not the augmentation of the graph, is what makes the difference. It
can be further explained by the importance of distribution uniformity. The contrastive learning
enhance model representation ability by intensifying two characteristics: The alignment of fea-
tures from positive samples and the uniformity of the normalized feature distribution. SimGCL
directly add random noises to node embeddings as augmentation, to control the uniformity of the
representation distribution in a more effective way:

e𝑖( 1) = e𝑖 + 𝜖 ( 1) ∗ 𝜏𝑖( 1) , e𝑖( 2) = e𝑖 + 𝜖 ( 2) ∗ 𝜏𝑖( 2) ,


(87)
𝜖 ∼ N (0, 𝜎 2 ),

where e𝑖 is a node representation in embedding space, 𝜏𝑖( 1) and 𝜏𝑖( 2) are two random sampled unit
vector. The experiment results indicate that SimGCL performs better than its graph augmentation-
based competitors and means, while training time is significantly decreased.
SimGRACE [391] is another graph contrastive learning framework without data augmentation.
Motivated by the observation that despite encoder disrupted, graph data can effectively maintain
their semantics, SimGRACE take GNN with its modified version as encoder to produce two con-
trastive embedding views by the same graph input. For GNN encoder 𝑓 (·; 𝜃 ), the two contrastive
embedding views e, e′ can be computed by:
e ( 1) = 𝑓 (G; 𝜃 ), e ( 2) = 𝑓 (G; 𝜃 + 𝜖 · Δ𝜃 ),
(88)
Δ𝜃𝑙 ∼ N (0, 𝜎𝑙2 ),
where Δ𝜃𝑙 represents GNN parameter perturbation Δ𝜃 in the 𝑙th layer. SimGRACE can improve
alignment and uniformity simutanously, proving its capacity of producing high-quality embedding.
8.3.2 MI estimation method. The mutual information 𝐼 (𝑥, 𝑦) measures the information that x
and y share, given a pair of random variables (𝑥, 𝑦). As discussed before, mutual information is a
significant component of contrast-based method by formulating the loss function. Mathematically
rigorous MI is defined on the probability space, we can formulate mutual information between a
pair of instances (𝑥𝑖 , 𝑥 𝑗 ) as:
𝐼 (𝑥, 𝑦) = 𝐷 𝐾𝐿 (𝑝 (𝑥, 𝑦)||𝑝 (𝑥)𝑝 (𝑦))
𝑝 (𝑥, 𝑦) (89)
= 𝐸𝑝 (𝑥,𝑦) [log ].
𝑝 (𝑥)𝑝 (𝑦)
However, directly compute Equation 89 is quiet difficult, so we introduce several different types
of estimation for MI:
InfoNCE. Noise-contrastive estimator is a widely used lower bound MI estimatior. Given a
positive sample 𝑦 and several negative sample 𝑦𝑖′, a noise-contrastive estimator can be fomulated

J. ACM, Vol. 1, No. 1, Article . Publication date: April 2023.


A Comprehensive Survey on Deep Graph Representation Learning 33

as [470][286]:
𝑒 𝑔 (𝑥,𝑦)
L = −𝐼 (𝑥, 𝑦) = −𝐸𝑝 (𝑥,𝑦) [log ′ ], (90)
𝑒 𝑔 (𝑥,𝑦) + 𝑖 𝑒 𝑔 (𝑥,𝑦𝑖 )
Í

usually the kernal function 𝑔(·) can be cosine similarity or dot product.
Triplet Loss. Intuitively, we can push the similarity between positive samples and negative
samples differ by a certain distance. So we can define the loss function in the following manner [162]:

L = 𝐸𝑝 (𝑥,𝑦) [max(𝑔(𝑥, 𝑦) − 𝑔(𝑥, 𝑦 ′) + 𝜖, 0)], (91)


where 𝜖 is a hyperparameter. This function is straightforward and easy to compute.
BYOL Loss. Estimation without negative samples is investigated by BYOL [120]. The estimator
is Asymmetrical structured:
𝑔(𝑥) · 𝑦
L = 𝐸𝑝 (𝑥,𝑦) [2 − 2 ], (92)
∥𝑔(𝑥) ∥ ∥𝑦 ∥
note that encoder 𝑔 should keep the dimension of input and output the same.

8.4 Summary
This section introduces graph self-supervised learning and we provide the summary as follows:
• Techniques. Differ from classic supervised learning and semi-supervised learning, self-
supervised learning increase model generalization ability and robustness, whilst decrease
label reliance. Graph SSL utilize pretext tasks to extract inherent information in representation
distribution. Typical Graph SSL methods can be devide into generation-based and contrast-
based. Generation-based methods learns an encoder with ability to reconstruct graph as
precise as possible, motivated by Autoencoder. Contrast-based methods attract significant
interests recently, they learns an encoder to minimizing mutual information between relevant
instance and maximizing mutaul information between unrelated instance.
• Challenges and Limitations. Although graph SSL has achieved superior performance in
many tasks, its theoretical basis is not so solid. Many well-known methods are just validated
through experiments without explaining theoretical or coming up with mathematical proof.
It is imperative to establish a strong theoretical foundation for graph SSL.
• Future Works. In the future we expect more graph ssl methods designed essentially by
theoretical proof, without dedicate designed augment process or pretext tasks by intuition.
This will bring us more definite mathematical properties and less ambiguous empirical sense.
Also, graphs are a prevalent form of data representation across diverse domains, yet obtaining
manual labels can be prohibitively expensive. Expanding the applications of graph SSL to
broader fields is a promising avenue for future research.

9 Graph Structure Learning By Jianhao Shen


Graph structure determines how node features propagate and affect each other, playing a crucial
role in graph representation learning. In some scenarios the provided graph is incomplete, noisy, or
even has no structure information at all. Recent research also finds that graph adversarial attacks
(i.e., modifying a small number of node features or edges), can degrade learned representations
significantly. These issues motivate graph structure learning (GSL), which aims to learn a new
graph structure to produce optimal graph representations. According to how edge connectivity is
modeled, there are three different approaches in GSL, namely metric-based approaches, model-based
approaches, and direct approaches. Besides edge modeling, regularization is also a common trick to
make the learned graph satisfy some desired properties. We first present the basic framework and

J. ACM, Vol. 1, No. 1, Article . Publication date: April 2023.


34 W. Ju, et al.

Table 7. Summary of graph structure learning methods.

Regularization
Method Structure Learning
Sparsity Low-rank Smoothness
AGCN [214] Mahalanobis distance
GRCN [421] Inner product ✓
Metric-based

CAGCN [472] Inner product ✓


GNNGUARD [440] Cosine similarity
IDGL [52] Cosine similarity ✓ ✓ ✓
HGSL [451] Cosine similarity ✓
GDC [108] Graph diffusion ✓
GLN [277] Recurrent blocks
GLCN [159] One-layer neural network ✓ ✓
Model-based

NeuralSparse [457] Multi-layer neural network ✓


GAT [351] Self-attention
GaAN [435] Gated attention
hGAO [104] Hard attention ✓
VIB-GSL [337] Dot-product attention ✓
MAGNA [365] Graph attention diffusion
GLNN [106] MAP estimation ✓ ✓
Direct

GSML [357] Bilevel optimization ✓


BGCNN [442] Bayesion optimization
VGCN [83] Stochastic variational inference

regularization methods for GSL in Sec. 9.1 and Sec. 9.2, respectively, and then introduce different
categories of GSL in Sec. 9.3, 9.4 and 9.5. We summarize GSL approaches in Table 7.

9.1 Overall Framework


We denote a graph by G = (A, X), where A ∈ R𝑁 ×𝑁 is the adjacency matrix and X ∈ R𝑁 ×𝑀 is
the node feature matrix with 𝑀 being the dimension of each node feature. A graph encoder 𝑓𝜃
learns to represent the graph based on node features and graph structure for task-specific objective
L𝑡 (𝑓𝜃 (A, X)). In the GSL setting, there is also a graph structure learner which aims to build a
new graph adjacency matrix A∗ to optimize the learned representation. Besides the task-specific
objective, a regularization term can be added to constrain the learned structure. So the overall
objective function of GSL can be formulated as

min∗ L = L𝑡 (𝑓𝜃 (A∗, X)) + 𝜆L𝑟 (A∗, A, X), (93)


𝜃,A

where L𝑡 is the task-specific objective, L𝑟 is the regularization term and 𝜆 is a hyperparameter for
the weight of regularization.

9.2 Regularization
The goal of regularization is to constrain the learned graph to satisfy some properties by adding
some penalties to the learned structure. The most common properties used in GSL are sparsity, low
lank, and smoothness.
9.2.1 Sparsity Noise or adversarial attacks will introduce redundant edges into graphs and degrade
the quality of graph representation. An effective technique to remove unnecessary edges is sparsity
regularization, i.e., adding a penalty on the number of nonzero entries of the adjacency matrix

J. ACM, Vol. 1, No. 1, Article . Publication date: April 2023.


A Comprehensive Survey on Deep Graph Representation Learning 35

(ℓ0 -norm):
L𝑠𝑝 = ∥A∥ 0, (94)
however, ℓ0 -norm is not differentiable so optimizing it is difficult, and in many cases ℓ1 -norm is used
instead as a convex relaxation. Other methods to impose sparsity include pruning and discretization.
These processes are also called postprocessing since they usually happen after the adjacency matrix
is learned. Pruning removes part of the edges according to some criteria. For example, edges with
weights lower than a threshold, or those not in the top-K edges of nodes or graph. Discretization
is applied to generate graph structure by sampling from some distribution. Compared to directly
learning edge weights, sampling enjoys the advantage to control the generated graph, but has
issues during optimizing since sampling itself is discrete and hard to optimize. Reparameterization
and Gumbel-softmax are two useful techniques to overcome such issue, and are widely adopted in
GSL.
9.2.2 Low Rank In real-world graphs, similar nodes are likely to group together and form commu-
nities, which should lead to a low-rank adjacency matrix. Recent work also finds that adversarial
attacks tend to increase the rank of the adjacency matrix quickly. Therefore, low rank regularization
is also a useful tool to make graph representation learning more robust:
L𝑙𝑟 = 𝑅𝑎𝑛𝑘 (A). (95)
It is hard to minimize matrix rank directly. A common technique is to optimize the nuclear norm,
which is a convex envelope of the matrix rank:
𝑁
∑︁
L𝑛𝑐 = ∥A∥ ∗ = 𝜎𝑖 , (96)
𝑖
where 𝜎𝑖 are singular values of A. Entezari et al. replaces the learned adjacency matrix with rank-r
approximation by singular value decomposition (SVD) to achieve robust graph learning against
adversarial attacks.
9.2.3 Smoothness A common assumption is that connected nodes share similar features, or in
other words, the graph is “smooth” as the difference between local neighbors is small. The following
metric is a natural way to measure graph smoothness:
𝑁
1 ∑︁
L𝑠𝑚 = 𝐴𝑖 𝑗 (𝑥𝑖 − 𝑥 𝑗 ) 2 = 𝑡𝑟 (X⊤ (D − A)X) = 𝑡𝑟 (X⊤ LX), (97)
2 𝑖,𝑗=1

where D is the degree matrix of A and L = D − A is called graph Laplacian. A variant is to use the
1 1
normalized graph Laplacian bL = D− 2 LD− 2 .

9.3 Metric-based Methods


Metric-based methods measure the similarity between nodes as the edge weights. They follow
the basic assumption that similar nodes tend to have connection with each other. We show some
representative works
Adaptive Graph Convolutional Neural Networks [214] (AGCN). AGCN learns a task-driven adaptive
graph during training to enable a more generalized and flexible graph representation model. After
parameterizing the distance metric between nodes, AGCN is able to adapt graph topology to the
given task. It proposes generalized Mahalanobis distance between two nodes with the following
formula: √︃
D(𝑥𝑖 , 𝑥 𝑗 ) = (𝑥𝑖 − 𝑥 𝑗 ) ⊤ 𝑀 (𝑥𝑖 − 𝑥 𝑗 ), (98)

J. ACM, Vol. 1, No. 1, Article . Publication date: April 2023.


36 W. Ju, et al.

where 𝑀 = 𝑊𝑑𝑊𝑑⊤ and 𝑊𝑑 is the trainable weights to minimize task-specific objective. Then the
Gaussian kernel is used to obtain the adjacency matrix:
G𝑖 𝑗 = exp(−D(𝑥𝑖 , 𝑥 𝑗 )/(2𝜎 2 )), (99)
𝐴^ = 𝑛𝑜𝑟𝑚𝑎𝑙𝑖𝑧𝑒 (G). (100)
Graph-Revised Convolutional Network [421] (GRCN). GRCN uses a graph revision module to
predict missing edges and revise edge weights through joint optimization on downstream tasks. It
first learns the node embedding with GCN and then calculates pair-wise node similarity with the
dot product as the kernel function.
𝑍 = 𝐺𝐶𝑁𝑔 (𝐴, 𝑋 ), (101)
𝑆 𝑖 𝑗 = 𝑧𝑖 , 𝑧 𝑗 . (102)
The revised adjacency matrix is the residual summation of the original adjacency matrix 𝐴^ = 𝐴+𝑆.
GRCN also applies a sparsification technique on the similarity matrix 𝑆 to reduce computation cost:
𝑆𝑖 𝑗 , 𝑆𝑖 𝑗 ∈ 𝑡𝑜𝑝𝐾 (𝑆𝑖 )

𝑆𝑖(𝐾) = . (103)
𝑗 0, 𝑆𝑖 𝑗 ∉ 𝑡𝑜𝑝𝐾 (𝑆𝑖 )
Threshold pruning is also a common strategy for sparsification. For example, CAGCN [472] also
uses dot product to measure node similarity, and refines the graph structure by removing edges
between nodes whose similarity is less than a threshold 𝜏𝑟 and adding edges between nodes whose
similarity is greater than another threshold 𝜏𝑎 .
Defending Graph Neural Networks against Adversarial Attacks [440] (GNNGuard). GNNGuard
measures similarity between a node 𝑢 and its neighbor 𝑣 in the 𝑘-th layer by cosine similarity and
normalizes node similarity at the node level within the neighborhood as follows:
ℎ𝑢𝑘 ⊙ ℎ𝑘𝑣
𝑘
𝑠𝑢𝑣 = , (104)
∥ℎ𝑢𝑘 ∥ 2 ∥ℎ𝑘𝑣 ∥ 2
∑︁
𝑘
/ 𝑘
× 𝑁^𝑢𝑘 /( 𝑁^𝑢𝑘 + 1), 𝑖𝑓 𝑢 ≠ 𝑣

𝑠𝑢𝑣
 𝑠𝑢𝑣
(105)
𝑘
 𝑣 ∈N𝑢
𝛼𝑢𝑣 = ,
 1/( 𝑁^𝑢𝑘 + 1),
 𝑖𝑓 𝑢 = 𝑣

where N𝑢 denotes the neighborhood of node 𝑢 and 𝑁^𝑢𝑘 = 𝑣 ∈N𝑢 ∥𝑠𝑢𝑣 𝑘 ∥ . To stabilize GNN training,
Í
0
it also proposes a layer-wise graph memory by keeping part of the information from the previous
layer in the current layer. Similar to GNNGuard, IDGL [52] uses multi-head cosine similarity and
mask edges with node similarity smaller than a non-negative threshold, and HGSL [451] generalizes
this idea to heterogeneous graphs.
Graph Diffusion Convolution [108] (GDC). GDC replaces the original adjacency matrix with
generalized graph diffusion matrix S:

∑︁
S= 𝜃 𝑘 T𝑘 , (106)
𝑘=0
where 𝜃 𝑘 is the weighting coefficient
Í∞and T is the generalized transition matrix. To ensure conver-
gence, GDC further requires that 𝑘= 0 𝜃 𝑘 = 1 and the eigenvalues of T lie in [0, 1]. The random
walk transition matrix T𝑟 𝑤 = AD and the symmetric transition matrix T𝑠𝑦𝑚 = D−1/2 AD−1/2 are
−1

two examples. This new graph structure allows graph convolution to aggregate information from a
larger neighborhood. The graph diffusion acts as a smoothing operator to filter out underlying noise.
However, in most cases graph diffusion will result in a dense adjacency matrix 𝑆, so sparsification

J. ACM, Vol. 1, No. 1, Article . Publication date: April 2023.


A Comprehensive Survey on Deep Graph Representation Learning 37

technology like top-k filtering and threshold filtering will be applied to graph diffusion. Following
GDC, there are some other graph diffusion proposed. For example, AdaCAD [225] proposes Class-
Attentive Diffusion, which further considers node features and aggregates nodes probably of the
same class among K-hop neighbors. Adaptive diffusion convolution [450] (ADC) learns the optimal
neighborhood size via optimizing a bi-level problem.

9.4 Model-based Methods


Model-based methods parameterize edge weights with more complex models like deep neural
networks. Compared to metric-based methods, model-based methods offer greater flexibility and
expressive power.
Graph Learning Network [277] (GLN). GLN proposes a recurrent block to first produce interme-
diate node embeddings and then merge them with adjacency information as the output of this
layer to predict the adjacency matrix for the next layer. Specifically, it uses convolutional graph
operations to extract node features, and creates a local-context embedding based on node features
and the current adjacency matrix:
𝑘
∑︁
(𝑙)
𝐻𝑖𝑛𝑡 = 𝜎𝑙 (𝜏 (𝐴 (𝑙) )𝐻 (𝑙)𝑊𝑖 (𝑙) ), (107)
𝑖=1
(𝑙) (𝑙) (𝑙)
𝐻𝑙𝑜𝑐𝑎𝑙 = 𝜎𝑙 (𝜏 (𝐴 (𝑙) )𝐻𝑖𝑛𝑡 𝑈 ), (108)

where 𝑊𝑖 (𝑙) and 𝑈 (𝑙) are the learnable weights. GLN then predicts the next adjacency matrix as
follows:
(𝑙) ⊤
𝐴 (𝑙+1) = 𝜎𝑙 (𝑀 (𝑙) 𝛼𝑙 (𝐻𝑙𝑜𝑐𝑎𝑙 )𝑀 (𝑙) ). (109)
Similarly, GLCN [159] models graph structure with a softmax layer over the inner product
between the difference of node features and a learnable vector. NeuralSparse [457] uses a multi-
layer neural network to generate a learnable distribution from which a sparse graph structure is
sampled. PTDNet [240] prunes graph edges with a multi-layer neural network and penalizes the
number of non-zero elements to encourage sparsity.
Graph Attention Networks [351] (GAT). Besides constructing a new graph to guide the massage
passing and aggregation process of GNNs, many recent researchers also leverage the attention
mechanism to adaptively model the relationship between nodes. GAT is the first work to introduce
the self-attention strategy into graph learning. In each attention layer, the attention weight between
two nodes is calculated as the Softmax output on the combination of linear and non-linear transform
of node features:

𝑒𝑖 𝑗 = 𝑎(Wℎ®𝑖 , Wℎ®𝑗 ), (110)


𝑒𝑥𝑝 (𝑒𝑖 𝑗 )
𝛼𝑖 𝑗 = Í , (111)
𝑘 ∈N𝑖 𝑒𝑥𝑝 (𝑒𝑖𝑘 )

where N𝑖 denotes the neighborhood of node 𝑖,W is learnable linear transform and 𝑎 is pre-defined
attention function. In the original implementation of GAT, 𝑎 is a single-layer neural network with
LeakyReLU:
𝑎(Wℎ®𝑖 , Wℎ®𝑗 ) = LeakyReLU(®a⊤ [Wℎ®𝑖 ||Wℎ®𝑗 ]). (112)
The attention weights are then used to guide the message-passing phase of GNNs:
∑︁
ℎ®𝑖′ = 𝜎 ( 𝛼𝑖 𝑗 Wℎ®𝑗 ), (113)
𝑗 ∈N𝑖

J. ACM, Vol. 1, No. 1, Article . Publication date: April 2023.


38 W. Ju, et al.

where 𝜎 is a nonlinear function. It is beneficial to concatenate multiple heads of attention to-


gether to get a more stable and generalizable model, so-called multi-head attention. The attention
mechanism serves as a soft graph structure learner which captures important connections within
node neighborhoods. Following GAT, many recent works propose more effective and efficient
graph attention operators to improve performance. GaAN [435] adds a soft gate at each attention
head to adjust its importance. MAGNA [365] proposes a novel graph attention diffusion layer to
incorporate multi-hop information. One drawback of graph attention is that the time and space
complexities are both 𝑂 (𝑁 3 ). hGAO [104] performs hard graph attention by limiting node attention
to its neighborhood. VIB-GSL [337] adopts the information bottleneck principle to guide feature
masking in order to drop task-irrelevant information and preserve actionable information for the
downstream task.

9.5 Direct Methods


Direct methods treat edge weights as free learnable parameters. These methods enjoy more flexibility
but are also more difficult to train. The optimization is usually carried out in an alternating way,
i.e., iteratively updating the adjacency matrix A and the GNN encoder parameters 𝜃 .
GLNN [106]. GLNN uses MAP estimation to learn an optimal adjacency matrix for a joint
objective function including sparsity and smoothness. Specifically, it targets at finding the most
probable adjacency matrix 𝐴^ given graph node features 𝑥:
𝐴˜𝑀𝐴𝑃 (𝑥) = argmax 𝑓 (𝑥 |𝐴)𝑔(
^ 𝐴),^ (114)
𝐴^

where 𝑓 (𝑥 |𝐴)
^ measures the likelihood of observing 𝑥 given 𝐴,
^ and 𝑔(𝐴)^ is the prior distribution of
𝐴. GLNN uses sparsity and property constraint as prior, and define the likelihood function 𝑓 as:
^
^ = 𝑒𝑥𝑝 (−𝜆0𝑥 ⊤ 𝐿𝑥)
𝑓 (𝑥 |𝐴) (115)

^
= 𝑒𝑥𝑝 (−𝜆0𝑥 (𝐼 − 𝐴)𝑥), (116)
where 𝜆0 is a parameter. This likelihood imposed smoothness assumption on the learned graph
structure. Some other works also model the adjacency matrix in a probabilistic manner. Bayesian
GCNN [442] adopts a Bayesian framework and treats the observed graph as a realization from a
family of random graphs. Then it estimates the posterior probablity of labels given the observed
graph adjacency matrix and features with Monte Carlo approximation. VGCN [83] follows a similar
formulation and estimates the graph posterior through stochastic variational inference.
Graph Sparsification via Meta-Learning [357] (GSML). GSML formulates GSL as a meta-leanring
problem and uses bi-level optimization to find the optimal graph structure. The goal is to find a
sparse graph structure which leads to high node classification accuracy at the same time given
labeled and unlabeled nodes. To achieves this, GSML makes the inner optimization as training on
the node classification task, and targets the outer optimization at the sparsity of the graph structure,
which formulates the following bi-level optimization problem:
𝐺^∗ = min 𝐿𝑠𝑝𝑠 (𝑓𝜃 ∗ (𝐺),
^ 𝑌𝑈 ), (117)
𝐺^ ∈Φ(𝐺)

𝑠.𝑡 . 𝜃 ∗ = argmin 𝐿𝑡𝑟𝑎𝑖𝑛 (𝑓𝜃 (𝐺),


^ 𝑌𝐿 ). (118)
𝜃

In this bi-level optimization problem, 𝐺^ ∈ Φ(𝐺) are the meta-parameters and optimized directly
without parameterization. Similarly, LSD-GNN [97] also uses bi-level optimization. It models graph
structure with a probability distribution over graph and reformulates the bi-level program in terms
of the continuous distribution parameters.

J. ACM, Vol. 1, No. 1, Article . Publication date: April 2023.


A Comprehensive Survey on Deep Graph Representation Learning 39

9.6 Summary
This section and we provide the summary as follows:
• Techniques. GSL aims to learn an optimized graph structure for better graph representations.
It is also used for more robust graph representation against adversarial attacks. According
to the way of edge modeling, we categorize GSL into three groups: metric-based methods,
model-based methods, and direct methods. Regularization is also a commonly used principle
to make the learned graph structure satisfy specific properties including sparsity, low-rank
and smoothness.
• Challenges and Limitations. Since there is no way to access the groundtruth or optimal
graph structure as training data, the learning objective of GSL is either indirect (e.g., perfor-
mance on downstream tasks) or manually designed (e.g., sparsity and smoothness). Therefore,
the optimization of GSL is difficult and the performance is not satisfying. In addition, many
GSL methods are based on homophily assumption, i.e., similar nodes are more likely to
connect with each other. However, many other types of connection exist in the real-world
which impose great challenges for GSL.
• Future Works. In the future we expect more efficient and generalizable GSL methods to
be applied to large-scale and heterogeneous graphs. Most existing GSL methods focus on
pair-wise node similarities and thus struggle to scale to large graphs. Besides, they often
learn homogeneous graph structure, but in many scenarios graphs are heterogeneous.

10 Social Analysis By Ziyue Qiao


In the real world, there usually exist complex relations and interactions between people and multiple
entities. Taking people, concrete things, and abstract concepts in society as nodes and taking the
diverse, changeable, and large-scale connections between data as links, we can form massive and
complex social information as social networks [33, 339]. Compared with traditional data structures
such as texts and forms, modeling social data as graphs have many benefits. Especially with
the arrival of the "big data" era, more and more heterogeneous information are interconnected
and integrated, and it is difficult and uneconomical to model this information with a traditional
data structure. The graph is an effective implementation for information integration, as it can
naturally incorporate different types of objects and their interactions from heterogeneous data
sources [265, 317]. A summarization of social analysis applications is provided in Table 8.

10.1 Concepts of Social Networks


A social network is usually composed of multiple types of nodes, link relationships, and node
attributes, which inherently include rich structural and semantic information. Specifically, a social
network can be homogeneous or heterogeneous and directed or undirected in different scenarios.
Without loss of generality, we define the social network as a directed heterogeneous graph 𝐺 =
|𝑉 | |𝐸 | |T |
{𝑉 , 𝐸, T , R}, where 𝑉 = {𝑛𝑖 }𝑖=1 is the node set, 𝐸 = {𝑒𝑖 }𝑖=1 is the edge set, T = {𝑡𝑖 }𝑖=1 is the node
|R |
type set, and R = {𝑟𝑖 }𝑖=1 is the edge type set. Each node 𝑛𝑖 ∈ 𝑉 is associated with a node type
mapping: 𝜙𝑛 (𝑛𝑖 ) = 𝑡 𝑗 : 𝑉 −→ T and each edge 𝑒𝑖 ∈ 𝐸 is associated with a node type mapping:
𝜙𝑒 (𝑒𝑖 ) = 𝑟 𝑗 : 𝐸 −→ R. A node 𝑛𝑖 may have a feature set, where the feature space is specific for the
node type. An edge 𝑒𝑖 is also represented by node pairs (𝑛 𝑗 , 𝑛𝑘 ) at both ends and can be directed or
undirected with relation-type-specific attributes. If |T | = 1 and |R| = 1, the social network is a
homogeneous graph; otherwise, it is a heterogeneous graph.
Almost any data produced by social activities can be modeled as social networks, for example,
the academic social network produced by academic activities such as collaboration and citation,
the online social network produced by user following and followed on social media, and the

J. ACM, Vol. 1, No. 1, Article . Publication date: April 2023.


40 W. Ju, et al.

Table 8. A summarization of social analysis applications

Social networks Node type Edge type Applications References

Authorship, Classification/ Paper/author classification [74, 283, 368,


Clustering 432], name disambiguation [42, 281, 444]
Author, Co-Author,
Publication, Advisor-
Academic Co-authorship [56, 59], citation relation-
Venue, advisee, Relationship
Social ship [161, 362, 425], advisor-advisee rela-
Organization, Citing, Cited, prediction
Network tionship [228, 456]
Keyword Co-Citing,
Publishing Collaborator recommendation [187, 188,
Recommen- 236], paper recommendation [14, 334],
dation venue recommendation [257, 424]
Anomaly Malicious attacks [234, 338], emergency de-
Following,
detection tection [21], and robot discovery [91]
Like, Unlike,
Academic User, Blog,
Media Article, Image,
Clicked, Sentiment Customer feedback [298, 441], public
Network Video
Viewed, analysis events [254, 349]
Commented,
Reposted Influence Important node finding [73, 295], informa-
analysis tion diffusion modeling [178, 196, 269, 431]
Spatial/temporal influence [322, 376, 453],
POI recom-
social relationship [400], textual informa-
Location-based Restaurant, mendation
Friendship, tion [402]
Social Cinema, Mall,
Check-in Traffic congestion prediction [160, 398], ur-
Network Parking
Urban ban mobility analysis [35, 414], event de-
computing tection [326, 423]

location-based social network produced by human activities on different locations. Based on


constructing social networks, researchers have new paths to data mining, knowledge discovery,
and multiple application tasks on social data. Exploring social networks also brings new challenges.
One of the critical challenges is how to succinctly represent the network from the massive and
heterogeneous raw graph data, that is, how to learn continuous and low-dimensional social network
representations, so as to researchers can efficiently perform advanced machine learning techniques
on the social network data for multiple application tasks, such as analysis, clustering, prediction,
and knowledge discovery. Thus, graph representation learning on the social network becomes the
foundational technique for social analysis.

10.2 Academic Social Network


Academic collaboration is a common and important behavior in academic society, and also a
major way for scientists and researchers to innovate and break through scientific research, which
leads to a complex social relationship between scholars. Also, the academic data generated by
academic collaboration also contains a large number of interconnected entities with complex
relationships [189]. Normally, in an academic social network, the node type set consists of Author,
Publication, Venue, Organization, Keyword, etc., and the relation set consists of Authorship, Co-
Author, Advisor-advisee, Citing, Cited, Co-Citing, Publishing, Co-Word, etc. Note that in most social
networks, each relation type always connects two fixed node types with a fixed direction. For
example, the relation Authorship points from the node type Author to Publication, and the Co-Author
is an undirected relation between two nodes with type Author. Based on the node and relation
types in an academic social network, one can divide it into multiple categories. For example, the
co-author network with nodes of Author and relations of Co-Author, the citation network with
nodes of Publication and relation of Citing, and the academic heterogeneous information graph with

J. ACM, Vol. 1, No. 1, Article . Publication date: April 2023.


A Comprehensive Survey on Deep Graph Representation Learning 41

multiple academic node and relation types. Many research institutes and academic search engines,
such as Aminer1 , DBLP2 , Microsoft Academic Graph (MAG)3 , have provided open academic social
network datasets for research purposes.
There are multiple applications of graph representation learning on the academic social net-
work. Roughly, they can be divided into three categories–academic entity classification/clustering,
academic relationship prediction, and academic resource recommendation.
• Academic entities usually belong to different classes of research areas. Research of academic
entity classification and clustering aims to categorize these entities, such as papers and
authors, into different classes [74, 283, 368, 432]. In literature, academic networks such as
Cora, CiteSeer, and Pubmed [312] have become the most widely used benchmark datasets for
examining the performance of graph representation learning models on paper classification.
Also, the author name disambiguation problem [42, 281, 444] is also essentially a node
clustering task on co-author networks and is usually solved by the graph representation
learning technique.
• Academic relationship prediction represents the link prediction task on various academic
relations. Typical applications are co-authorship prediction [56, 59] and citation relationship
prediction [161, 362, 425]. Existing methods learn representations of authors and papers and
use the similarity between two nodes to predict the link probability. Besides, some work [228,
456] studies the problem of advisor-advisee relationship prediction in the collaboration
network.
• Various academic recommendation systems have been introduced to retrieve academic
resources for users from large amounts of academic data in recent years. For example, collab-
orator recommendation [187, 188, 236] benefit researchers by finding suitable collaborators
under particular topics; paper recommendation [14, 334] help researchers find relevant pa-
pers on given topics; venue recommendation [257, 424] help researchers choose appropriate
venues when they submit papers.

10.3 Social Media Network


With the development of the Internet in decades, various online social media have emerged in large
numbers and greatly changed people’s traditional social models. People can establish friendships
with others beyond the distance limit and share interests, hobbies, status, activities, and other
information among friends. These abundant interactions on the Internet form large-scale complex
social media networks, also named online social networks. Usually, in an academic social network,
the node type set consists of User, Blog, Article, Image, Video, etc., and the relation type set consists
of Following, Like, Unlike, Clicked, Viewed, Commented, Reposted, etc. The main property of a social
media network is that it usually contains multi-mode information on the nodes, such as video,
image, and text. Also, the relations are more complex and multiplex, including the explicit relations
such as Like and Unlike and the implicit relations such as Clicked. The social media network can
be categorized into multiple types based on their media categories. For example, the friendship
network, the movie review network, and the music interacting network are extracted from different
social media platforms. In a broad sense, the user-item networks in online shopping system can also
be viewed as social media networks as they also exist on the Internet and contains rich interactions
by people. There are many widely used data sources for social media network analysis, such as
Twitter, Facebook, Weibo, YouTube, and Instagram.
1 https://www.aminer.cn/
2 https://dblp.uni-trier.de/
3 https://www.microsoft.com/en-us/research/project/microsoft-academic-graph/

J. ACM, Vol. 1, No. 1, Article . Publication date: April 2023.


42 W. Ju, et al.

The mainstream application research on social media networks via graph representation learning
techniques mainly includes anomaly detection, sentiment analysis, and influence analysis.
• Anomaly detection aims to find strange or unusual patterns in social networks, which
has a wide range of application scenarios, such as malicious attacks [234, 338], emergency
detection [21], and robot discovery [91] in social networks. Unsupervised anomaly detection
usually learns a reconstructed graph to detect those nodes with higher reconstructed error as
the anomaly nodes [3, 452]; Supervised methods model the problem as a binary classification
task on the learned graph representations [259, 458].
• Sentiment analysis, also named as opinion mining, is to mining the sentiment, opinions, and
attitudes, which can help enterprises understand customer feedback on products [298, 441]
and help the government analyze the public emotion and make rapid response to public
events [254, 349]. The graph representation learning model is usually combined with RNN-
based [48, 430] or Transformer-based [5, 342] text encoders to incorporate both the user
relationship and textual semantic information.
• Influence analysis usually aims to find several nodes in a social network to initially spread
information such as advertisements, so as to maximize the final spread of information [73, 295].
The core challenge is to model the information diffusion process in the social network. Deep
learning methods [178, 196, 269, 431] usually leverage graph neural networks to learn node
embeddings and diffusion probabilities between nodes.

10.4 Location-based Social Network


Locations are the fundamental information of human social activities. With the wide availability of
mobile Internet and GPS positioning technology, people can easily acquire their precise locations
and socialize with their friends by sharing their historical check-ins on the Internet. This opens up
a new avenue of research on location-based social network analysis, which gathered significant
attention from the user, business, and government perspectives. Usually, in a location-based social
network, the node type set consists of User, and Location, also named Point of Interest(POI) in the
recommendation scenario containing multiple categories such as Restaurant, Cinema, Mall, Parking,
etc. The relation type set consists of Friendship, Check-in. Also, those node and relation types that
exist in traditional social media networks can be included in a location-based social network. The
difference with other social networks, the main location-based social networks are spatial and
temporal, making the graph representation learning more challenging. For example, in a typical
social network constructed for the POI recommendation, the user nodes are connected with each
other by their friendship. The location nodes are connected by user nodes with the relations feature
of timestamps. The location nodes also have a spatial relationship with each other and own have
complex features, including categories, tags, check-in counts, number of users check-in, etc. There
are many location-based social network datasets, such as Foursquare4 , Gowalla5 , and Waze6 . Also,
many social media such as Twitter, Instagram, and Facebook can provide location information.
The research of graph representation learning on location-based social networks can be divided
into two categories: POI recommendation for business benefits and urban computing for public
management.
• POI recommendation is one of the research hotspots in the field of location-based social
networks and recommendation systems in recent years [157, 173, 379], which aim to uti-
lize historical check-ins of users and auxiliary information to recommend potential favor
4 https://foursquare.com/
5 https://www.gowalla.com/
6 https://www.waze.com/live-map/

J. ACM, Vol. 1, No. 1, Article . Publication date: April 2023.


A Comprehensive Survey on Deep Graph Representation Learning 43

places for users from a large of location points. Existing researches mainly integrate four
essential characteristics, including spatial influence, temporal influence [322, 376, 453], social
relationship [400], and textual information [402].
• Urban computing is defined as a process of analysis of the large-scale connected urban data
created from city activities of vehicles, human beings, and sensors [272, 273, 323]. Besides the
local-based social network, the urban data also includes physical sensors, city infrastructure,
traffic roads, and so on. Urban computing aims to improve the quality of public management
and life quality of people living in city environments. Typical applications including traffic
congestion prediction [160, 398], urban mobility analysis [35, 414], event detection [326, 423].

10.5 Summary
This section introduces social analysis by graph representation learning and we provide the
summary as follows:
• Techniques. Social networks, generated by human social activities, such as communication,
collaboration, and social interactions, typically involve massive and heterogeneous data, with
different types of attributes and properties that can change over time. Thus, social network
analysis is a field of study that explores the techniques to understand and analyze the complex
attributes, heterogeneous structures, and dynamic information of social networks. Social
network analysis typically learns low-dimensional graph representations that capture the
essential properties and patterns of the social network data, which can be used for various
downstream tasks, such as classification, clustering, link prediction, and recommendation.
• Chanllenges and Limitations. Despite the structural heterogeneity in social networks
(nodes and relations have different types), with the technological advances in social media,
the node attributes have become more heterogeneous now, containing text, video, and images.
Also, the large-scale problem is a pending issue in social network analysis. The data in
the social network has increased exponentially in past decades, containing a high density
of topological links and a large amount of node attribute information, which brings new
challenges to the efficiency and effectiveness of traditional network representation learning
on the social network. Lastly, social networks are often dynamic, which means the network
information usually changes over time, and this temporal information plays a significant
role in many downstream tasks, such as recommendations. This brings new challenges to
representation learning on social networks in incorporating temporal information.
• Future Works. Recently, multi-modal big pre-training models that can fuse information
from different modalities have gained increasing attention [282, 289]. These models can
obtain valuable information from a large amount of unlabeled data and transfer it to various
downstream analysis tasks. Moreover, Transformer-based models have demonstrated better
effectiveness than RNNs in capturing temporal information. In the future, there is potential
for introducing multi-modal big pre-training models in social network analysis. Also, it is
important to make the models more efficient for network information extraction and use
lightweight techniques like knowledge distillation to further enhance the applicability of the
models. These advancements can lead to more effective social network analysis and enable
the development of more sophisticated applications in various domains.

11 Molecular Property Prediction By Zequn Liu


Molecular Property Prediction is an essential task in computational drug discovery and cheminfor-
matics. Traditional quantitative structure property/activity relationship (QSPR/QSAR) approaches
are based on either SMILES or fingerprints [262, 406, 439], largely overlooking the topological

J. ACM, Vol. 1, No. 1, Article . Publication date: April 2023.


44 W. Ju, et al.

features of the molecules. To address this problem, graph representation learning has been widely
applied to molecular property prediction. A molecule can be represented as a graph where nodes
stand for atoms and edges stand for atom-bonds (ABs). Graph-level molecular representations are
learned via message passing mechanism to incorporate the topological information. The represen-
tations are then utilized for the molecular property prediction tasks.
Specifically, a molecule is denoted as a topological graph G = (V, E), where V = {𝑣𝑖 |𝑖 =
1, . . . , |G|} is the set of nodes representing atoms. A feature vector x𝑖 is associated with each
node 𝑣𝑖 indicating its type such as Carbon, Nitrogen. E = {𝑒𝑖 𝑗 |𝑖, 𝑗 = 1, . . . , |G|} is the set of edges
connecting two nodes (atoms) 𝑣𝑖 and 𝑣 𝑗 representing atom bonds. Graph representation learning
methods are used to obtain the molecular representation h G . Then downstream classification or
regression layers 𝑓 (·) are applied to predict the probability of target property of each molecule
𝑦 = 𝑓 (h G ).
In Section 11.1, we introduce 4 types of molecular properties graph representation learning can
be treated and their corresponding datasets. Section 11.2 reviews the graph representation learning
backbones applied to molecular property prediction. Strategies for training the molecular property
prediction methods are listed in Section 11.3.

11.1 Molecular Property Categorization


Plenty of molecular properties can be predicted by graph-based methods. We follow [380] to
categorize them into 4 types: quantum chemistry, physicochemical properties, biophysics, and
biological effect.
Quantum chemistry is a branch of physical chemistry focused on the application of quantum
mechanics to chemical systems, including conformation, partial charges and energies. QM7, QM8,
QM9 [389], COD [299] and CSD [122] are datasets for quantum chemistry prediction.
Physicochemical properties are the intrinsic physical and chemical characteristics of a substance,
such as bioavailability, octanol solubility, aqueous solubility and hydrophobicity. ESOL, Lipophilicity
and Freesolv [389] are datasets for physicochemical properties prediction.
Biophysics properties are about the physical underpinnings of biomolecular phenomena, such
as affinity, efficacy and activity. PDBbind [364], MUV, and HIV [389] are biophysics property
prediction datasets.
Biological effect properties are generally defined as the response of an organism, a population,
or a community to changes in its environment, such as side effects, toxicity and ADMET. Tox21,
toxcast [389] and PTC [348] are biological effect prediction datasets.
Moleculenet [389] is a widely-used benchmark dataset for molecule property prediction. It
contains over 700,000 compounds tested on different properties. For each dataset, they provide
a metric and a splitting pattern. Among the datasets, QM7, OM7b, QM8, QM9, ESOL, FreeSolv,
Lipophilicity and PDBbind are regression tasks, using MAE or RMSE as evaluation metrics. Other
tasks such as tox21 and toxcast are classification tasks, using AUC as evaluation metric.

11.2 Molecular Graph Representation Learning Backbones


Since node attributes and edge attributes are crucial to molecular representation, most works use
GNN instead of traditional graph representation learning methods as backbones, since many GNN
methods consider edge information. Existing GNNs designed for the general domain can be applied
to molecular graph. Table 9 summarizes the GNNs used for molecular property prediction and the
types of properties they can be applied to predict.
Furthermore, many works customize their GNN structure by considering the chemical domain
knowledge.

J. ACM, Vol. 1, No. 1, Article . Publication date: April 2023.


A Comprehensive Survey on Deep Graph Representation Learning 45

Table 9. Summary of GNNs in molecular property prediction.

Type Spatial/Specrtal Method Application


Reccurent GNN - R-GNN Biological effect [306]
Quantum chemistry [255],
Reccurent GNN - GGNN
Biological effect [7, 90, 382]
Reccurent GNN - IterRefLSTM Biophysics [7], Biological effect [7]
Quantum chemistry [224, 382, 409],
pysicochemical properties [62, 80, 301],
Convolutional GNN Spatial/Specrtal GCN
Biophysics [25, 80, 409]
Biological effect [208, 389]
Convolutional GNN Specrtal LanczosNet Quantum chemistry [224]
Physicochemical properties,
Convolutional GNN Specrtal ChebNet
Biophysics, Biological effect [214]
Physicochemical properties [145],
Convolutional GNN Spatial GraphSAGE Biophysics [55, 86, 223],
Biological effect [145, 250]
Physicochemical properties [145],
Convolutional GNN Spatial GAT Biophysics [25, 55],
Biological effect [145]
Convolutional GNN Spatial DGCNN Biophysics [50], Biological effect [437]
Physicochemical properties [25, 145],
Convolutional GNN Spatial GIN Biophysics [144, 145],
Biological effect [145]
Convolutional GNN Spatial MPNN Physicochemical [247]
Transformer - MAT Physicochemical, Biophysics [474]

• First, the chemical bonds are taken into consideration carefully. For example, Ma et al. [247]
use an additional edge GNN to model the chemical bonds separately. Specifically, given an
edge (𝑣, 𝑤), they formulate an Edge-based GNN as:
(𝑘) (𝑘−1) (𝑘−1) (𝑘) (𝑘−1) ( 0)
m𝑣𝑤 = AGGedge ({h𝑣𝑤 , h𝑢𝑣 , x𝑢 |𝑢 ∈ N𝑣 \ 𝑤 }), h𝑣𝑤 = MLPedge ({m𝑣𝑤 , h𝑣𝑤 }), (119)
( 0)
where h𝑣𝑤 = 𝜎 (Wein e𝑣𝑤 ) is the input state of the Edge-based GNN, Wein ∈ R𝑑hid ×𝑑𝑒 is the
input weight matrix. PotentialNet [90] further uses different message passing operations for
different edge types.
• Second, motifs in molecular graphs play an important role in molecular property prediction.
GSN [25] leverages substructure encoding to construct a topologically-aware message passing
method. Each node 𝑣 updates its state h𝑡𝑣 by combining its previous state with the aggregated
messages:

h𝑡𝑣+1 = UP𝑡 +1 h𝑡𝑣 , m𝑡𝑣+1 , (120)



 

𝑀 𝑡 +1 P(h𝑡𝑣 , h𝑢𝑡 , x𝑉𝑣 , x𝑉𝑢 , e𝑢,𝑣 )Q𝑢 ∈N (𝑣) (GSN-v)





or (121)


m𝑡𝑣+1 =   ,


 𝑡 + 1
P(h𝑣 , h𝑢 , x𝑢,𝑣 , e𝑢,𝑣 )Q𝑢 ∈N (𝑣) (GSN-e)
𝑡 𝑡 𝐸
 𝑀


J. ACM, Vol. 1, No. 1, Article . Publication date: April 2023.


46 W. Ju, et al.

where x𝑉𝑣 , x𝑉𝑢 , x𝑢,𝑣


𝐸 ,e
𝑢,𝑣 contains the substructure information associated with nodes and
edges, PQ denotes a multiset. Yu et al. [426] constructs a heterogeneous graph using motifs
and molecules. Motifs and molecules are both treated as nodes and the edges model the
relationship between motifs and graphs, for example, if a graph contains a motif, there will
be an edge between them. MGSSL [447] leverages a retrosynthesis-based algorithm BRICS
and additional rules to find the motifs and combines motif layers with atom layers. It is a
hierarchical framework jointly modeling atom-level information and motif-level information.
• Third, different feature modalities have been used to improve the molecular graph em-
bedding. Lin et al. [227] combine SMILES modality and graph modality with contrastive
learning. Zhu et al. [468] encode 2D molecular graph and 3D molecular conformation with a
unified Transformer. It uses a unified model to learn 3D conformation generation given 2D
graph and 2D graph generation given 3D conformation.
• Finally, knowledge graph and literature can provide additional knowledge for molecular
property prediction. Fang et al. [88] introduce a chemical element knowledge graph to
summarize microscopic associations between elements and augment the molecular graph
based on the knowledge graph, and a knowledge-aware message passing network is used
to encode the augmented graph. MuMo [333] introduces biomedical literature to guide the
molecular property prediction. It pretrains a GNN and a language model on paired data of
molecules and literature mentions via contrastive learning:

( z𝐺
𝑖 ,z𝑖 )
𝑇 exp (sim(z𝐺𝑖 , z𝑖 )/𝜏)
𝑇
ℓ𝑖 = − log Í𝑁 , (122)
𝑗=1 exp (sim(z𝑖 , z 𝑗 )/𝜏)
𝐺 𝑇

where z𝐺
𝑖 , z𝑖 are the representation of molecule and its corresponding literature.
𝑇

11.3 Training strategies


Despite the encouraging performance achieved by GNNs, traditional supervised training scheme
of GNNs faces a severe limitation: The scarcity of available molecules with desired properties.
Although there are a large number of molecular graphs in public databases such as PubChem,
labeled molecules are hard to acquire due to the high cost of wet-lab experiments and quantum
chemistry calculations. Directly training GNNs on such limited molecules in a supervised way
is prone to over-fitting and lack of generalization. To address this issue, few-shot learning and
self-supervised learning are widely used in molecular property prediction.
Few-shot learning. Few-shot learning aims at generalizing to a task with a small labeled data
set. The prediction of each property is treated as a single task. Metric-based and optimization-based
few-shot learning have been adopted for molecular property prediction. Metric-based few-shot
learning is similar to nearest neighbors and kernel density estimation, which learns a metric or
distance function over objects. IterRefLSTM [7] leverages matching network [354] as the few-
shot learning framework, calculating the similarity between support samples and query samples.
Optimization-based few-shot learning optimizes a meta-learner for parameter initialization which
can be fast adapted to new tasks. Meta-MGNN [126] adopts MAML [94] to train a parameter
initialization to adapt to different tasks and use self-attentive task weights for each task. PAR [370]
also uses MAML framework and learns an adaptive relation graph among molecules for each task.
Self-supervised learning. Self-supervised learning can pre-train a GNN model with plenty of
unlabeled molecular graphs and transfer it to specific molecular property prediction tasks. Self-
supervised learning contains generative methods and predictive methods. Predictive methods design
prediction tasks to capture the intrinsic data features. Pre-GNN [145] exploits both node-level and
graph-level prediction tasks including context prediction, attribute masking, graph-level property

J. ACM, Vol. 1, No. 1, Article . Publication date: April 2023.


A Comprehensive Survey on Deep Graph Representation Learning 47

prediction and structural similarity prediction. MGSSL [447] provides a motif-based generative
pre-training framework making topology prediction and motif generation iteratively. Contrastive
methods learn graph representations by pulling views from the same graph close and pushing
views from different graphs apart. Different views of the same graph are constructed by graph
augmentation or leveraging the 1D SMILES and 3D structure. MolCLR [373] augments molecular
graphs by atom masking, bond deletion and subgraph removal and maximizes the agreement
between the original molecular graph and augmented graphs. Fang et al. [88] uses a chemical
knowledge graph to guide the graph augmentation. SMICLR [279] uses contrastive learning across
SMILES and 2D molecular graphs. GeomGCL [215] leverages graph contrastive learning to capture
the geometry of the molecule across 2D and 3D views. Self-supervised learning can also be combined
with few-shot learning to fully leverage the hierarchical information in the training set [170].

11.4 Summary
This section introduces graph representation learning in molecular property prediction and we
provide the summary as follows:
• Techniques. For molecular property prediction, a molecule is represented as a graph whose
nodes are atoms and edges are atom-bonds (ABs). GNNs such as GCN, GAT, and GraphSAGE
are adopted to learn the graph-level representation. The representations are then fed into a
classification or regression head for the molecular property prediction tasks. Many works
guide the model structure design with medical domain knowledge including chemical bond
features, motif features, different modalities of molecular representation, chemical knowledge
graph and literature. Due to the scarcity of available molecules with desired properties,
few-shot learning and contrastive learning are used to train molecular property prediction
model, so that the model can leverage the information in large unlabeled dataset and can be
adapted to new tasks with a few examples.
• Challenges and Limitations. Despite the great success of graph representation learning
in molecular property prediction, the methods still have limitations: 1) Few-shot molecular
property prediction are not fully explored. 2) Most methods depend on training with labeled
data, but neglect the chemical domain knowledge.
• Future Works. In the future, we expect that: 1) More few-shot learning and zero-shot
learning methods are studied for molecular property prediction to solve the data scarcity
problem. 2) Heterogeneous data can be fused for molecular property prediction. There are a
large amount of heterogeneous data about molecules such as knowledge graphs, molecule
descriptions and property descriptions. They can be considered to assist molecular prop-
erty prediction. 3) Chemical domain knowledge can be leveraged for the prediction model.
For example, when we perform affinity prediction, we can consider molecular dynamics
knowledge.

12 Molecular Generation By Fang Sun


Molecular generation is pivotal to drug discovery, where it serves a fundamental role in downstream
tasks like molecular docking [260] and virtual screening [356]. The goal of molecular generation
is to produce chemical structures that satisfy a specific molecular profile, e.g., novelty, binding
affinity, and SA scores. Traditional methods have relied on 1D string formats like SMILES [116] and
SELFIES [192]. With the recent advances in graph representation learning, numerous graph-based
methods have also emerged, where molecular graph G can naturally embody both 2D topology and
3D geometry. While recent literature reviews [78, 261] have covered the general topics of molecular
design, this chapter is dedicated to the applications of graph representation learning in the molecular

J. ACM, Vol. 1, No. 1, Article . Publication date: April 2023.


48 W. Ju, et al.

generation task. Molecular generation is intrinsically a de novo task, where molecular structures
are generated from scratch to navigate and sample from the vast chemical space. Therefore, this
chapter does not discuss tasks that restrict chemical structures a priori, such as docking [103, 332]
and conformation generation [318, 467].

12.1 Taxonomy for molecular featurization methods


This section categorizes the different methods to featurize molecules. The taxonomy presented
here is unique to the task of molecular generation, owing to the various modalities of molecular
entities, complex interactions with other bio-molecular systems and formal knowledge from the
laws of chemistry and physics.
2D topology vs. 3D geometry. Molecular data are multi-modal by nature. For one thing, a
molecule can be unambiguously represented by its 2D topological graph G2D , where atoms are
nodes and bonds are edges. G2D can be encoded by canonical MPNN models like GCN [182],
GAT [351], and R-GCN [307], in ways similar to tasks like social networks and knowledge graphs. A
typical example of this line of work is GCPN [418], a graph convolutional policy network generating
molecules with desired properties such as synthetic accessibility and drug-likeness.
For another, the 3D conformation of a molecule can be accurately depicted by its 3D geometric
graph G3D , which incorporates 3D atom coordinates. In 3D-GNNs like SchNet [311] and Orb-
Net [284], G3D is organized into a 𝑘-NN graph or a radius graph according to the Euclidean distance
between atoms. It is justifiable to approximate G3D as a 3D extension to G2D , since covalent atoms
are closest to each other in most cases. However, G3D can also find a more long-standing origin
in the realm of computational chemistry [99], where both covalent and non-covalent atomistic
interactions are considered to optimize the potential surface and simulate molecular dynamics.
Therefore, G3D more realistically represents the molecular geometry, which makes a good fit for
protein pocket binding and 3D-QSAR optimization [352].
Molecules can rotate and translate, affecting their position in the 3D space. Therefore, it is ideal to
encode these molecules with GNNs equivariant/invariant to roto-translations, which can be ∼ 103
times more efficient than data augmentation [113]. Equivariant GNNs can be based on irreducible
representation [8, 18, 28, 102, 347], regular representation [95, 154], or scalarization [152, 168, 184–
186, 233, 304, 310, 311, 346], which are explained in more detail in Han et al. [130].
Unbounded vs. binding-based. Earlier works have aimed to generate unbounded molecules
in either 2D or 3D space, striving to learn good molecular representations through this task.
In the 2D scenario, GraphNVP [251] first introduces a flow-based model to learn an invertible
transformation between the 2D chemical space and the latent space. GraphAF [319] further adopts
an autoregressive generation scheme to check the valence of the generated atoms and bonds.
In the 3D scenario, G-SchNet [111] first proposes to utilize G3D (instead of 3D density grids) as
the generation backbone. It encodes G3D via SchNet, and uses an auxiliary token to generate
atoms on the discretized 3D space autoregressively. G-SphereNet [245] uses symmetry-invariant
representations in a spherical coordinate system (SCS) to generate atoms in the continuous 3D
space and preserve equivariance.
Unbounded models adopt certain techniques to optimize specific properties of the generated
molecules. GCPN and GraphAF use scores likes logP, QED, and chemical validity to tune the model
via reinforcement learning. EDM [142] can generate 3D molecules with property 𝑐 by re-training
the diffusion model with 𝑐’s feature vector concatenated to the E(n) equivariant dynamics function
𝝐^𝑡 = 𝜙 (𝒛𝑡 , [𝑡, 𝑐]). cG-SchNet [112] adopts a conditioning network architecture to jointly target
multiple electronic properties during conditional generation without the need to re-train the model.
RetMol [374] uses a retrieval-based model for controllable generation.

J. ACM, Vol. 1, No. 1, Article . Publication date: April 2023.


A Comprehensive Survey on Deep Graph Representation Learning 49

On the other hand, binding-based methods generate drug-like molecules (aka. ligands) accord-
ing to the binding site (aka. binding pocket) of a protein receptor. Drawing inspirations from the
lock-and-key model for enzyme action [96], works like LiGAN [290] and DESERT [239] use 3D
density grids to fit the density surface between the ligand and the receptor, encoded by 3D-CNNs.
Meanwhile, a growing amount of literature has adopted G3D for representing ligand and receptor
molecules, because G3D more accurately depicts molecular structures and atomistic interactions
both within and between the ligand and the receptor. Representative works include 3D-SBDD [241],
GraphBP [230], Pocket2Mol [275], and DiffSBDD [309]. GraphBP shares a similar workflow with
G-SphereNet, except that the receptor atoms are also incorporated into G3D to depict the 3D
geometry at the binding pocket.
Atom-based vs. fragment-based. Molecules are inherently hierarchical structures. At the
atomistic level, molecules are represented by encoding atoms and bonds. At a coarser level,
molecules can also be represented as molecular fragments like functional groups or chemical
sub-structures. Both the composition and the geometry are fixed within a given fragment, e.g.,
the planar peptide-bond (–CO–NH–) structure. Fragment-based generation effectively reduces the
degree of freedom (DOF) of chemical structures, and injects well-established knowledge about
molecular patterns and reactivity. JT-VAE [165] decomposes 2D molecular graph G2D into a junction-
tree structure T , which is further encoded via tree message-passing. DeepScaffold [216] expands
the provided molecular scaffold into 3D molecules. L-Net [217] adopts a graph U-Net architecture
and devises a custom three-level node clustering scheme for pooling and unpooling operations in
molecular graphs. A number of works have also emerged lately for fragment-based generation in
the binding-based setting, including FLAG [448] and FragDiff [274]. FLAG uses a regression-based
approach to sequentially decide the type and torsion angle of the next fragment to be placed at the
binding site, and finally optimizes the molecule conformation via a pseudo-force field. FragDiff
also adopts a sequential generation process but uses a diffusion model to determine the type and
pose of each fragment in one go.

12.2 Generative methods for molecular graphs


For a molecular graph generation process, the model first learns a latent distribution 𝑃 (𝑍 |G)
characterizing the input molecular graphs. A new molecular graph G^ is then generated by sam-
pling and decoding from this learned distribution. Various models have been adopted to generate
molecular graphs, including generative adversarial network (GAN), variational auto-encoder (VAE),
normalizing flow (NF), diffusion model (DM), and autoregressive model (AR).
Generative adversarial network (GAN). GAN [117] is trained to discriminate real data 𝒙
from generated generated data 𝒛, with the training object formalized as
min max L (𝐷, 𝐺) = E𝒙∼𝑝 data [log 𝐷 (𝒙)] + E𝒛∼𝑝 (𝒛) [log(1 − 𝐷 (𝐺 (𝒛)))], (123)
𝐺 𝐷

where 𝐺 (·) is the generator function and 𝐷 (·) is the discriminator function. For example, Mol-
GAN [65] encodes G2D with R-GCN, trains 𝐷 and 𝐺 with improved W-GAN [9], and uses rein-
forcement learning to generate attributed molecules, where the score function is assigned from
RDKit [198] and chemical validity.
Varaitional auto-encoder (VAE). In VAE [180], the decoder parameterizes the conditional
likelihood distribution 𝑝𝜃 (𝒙 |𝒛), and the encoder parameterizes an approximate posterior distribution
𝑞𝜙 (𝒛|𝒙) ≈ 𝑝𝜃 (𝒛|𝒙). The model is optimized by the evidence lower bound (ELBO), consisting of the
reconstruction loss term and the distance loss term:
 
𝑝𝜃 (𝒙, 𝒛)
(124)

max L𝜃,𝜙 (𝒙) := E𝒛∼𝑞𝜙 ( · |𝒙) ln = ln 𝑝𝜃 (𝒙) − 𝐷 KL 𝑞𝜙 (·|𝒙) ∥𝑝𝜃 (·|𝒙) .
𝜃,𝜙 𝑞𝜙 (𝒛|𝒙)

J. ACM, Vol. 1, No. 1, Article . Publication date: April 2023.


50 W. Ju, et al.

Maximizing ELBO is equivalent to simultaneously maximizing the log-likelihood of the observed


data, and minimizing the divergence of the approximate posterior 𝑞𝜙 (·|𝑥) from the exact poste-
rior 𝑝𝜃 (·|𝑥). Representative works along this thread include JT-VAE [165], GraphVAE [325], and
CGVAE [231] for the 2D generation task, and 3DMolNet [267] for the 3D generation task.
Autoregressive model (AR). Autoregressive model is an umbrella definition for any model
that sequentially generates the components (atoms or fragments) of a molecule. ARs better capture
the interdependency within the molecular structure and allows for explicit valency check. For each
step in AR, the new component can be produced using different techniques:
• Regression/classification, such is the case with 3D-SBDD [241], Pocket2Mol [275], etc.
• Reinforcement learning, such is the case with L-Net [217], DeepLigBuilder [218], etc.
• Probabilistic models like normalizing flow and diffusion.
Normalizing flow (NF). Based on the change-of-variable theorem, NF [294] constructs an in-
vertible mapping 𝑓 between a complex data distribution 𝒙 ∼ 𝑋 : and a normalized latent distribution
𝒛 ∼ 𝑍 . Unlike VAE, which has juxtaposed parameters for encoder and decoder, the flow model uses
the same set of parameter for encoding and encoding: reverse flow 𝑓 −1 for generation, and forward
flow 𝑓 for training:

max log 𝑝 (𝒙) = log 𝑝 𝐾 (𝒛 𝐾 ) (125)


𝑓
 
𝑑 𝑓𝐾 (𝒛 𝐾−1 )
= log 𝑝 𝐾−1 (𝒛 𝐾−1 ) − log det (126)
𝑑𝒛 𝐾−1
= ... (127)
𝐾  
∑︁ 𝑑 𝑓𝑖 (𝒛𝑖−1 )
= log 𝑝 0 (𝒛 0 ) − log det , (128)
𝑖=1
𝑑𝒛𝑖−1

where 𝑓 = 𝑓𝐾 ◦ 𝑓𝐾−1 ◦ ... ◦ 𝑓1 is a composite of 𝐾 blocks of transformation. While GraphNVP [251]


generates the molecular graph with NF in one go, following works tend to use the autoregressive
flow model, including GraphAF [319], GraphDF [246], and GraphBP [230].
Diffusion model (DM). Diffusion models [140, 327, 330] define a Markov chain of diffusion
steps to slowly add random noise to data 𝒙 0 ∼ 𝑞(𝒙):
√︁
𝑞(𝒙 𝑡 |𝒙 𝑡 −1 ) = N (𝒙 𝑡 ; 1 − 𝛽𝑡 𝒙 𝑡 −1, 𝛽𝑡 𝑰 ), (129)
𝑇
Ö
𝑞(𝒙 1:𝑇 |𝒙 0 ) = 𝑞(𝒙 𝑡 |𝒙 𝑡 −1 ). (130)
𝑡 =1

They then learn to reverse the diffusion process to construct desired data samples from the noise:
𝑇
Ö
𝑝𝜃 (𝒙 0:𝑇 ) = 𝑝 (𝒙𝑇 ) 𝑝𝜃 (𝒙 𝑡 −1 |𝒙 𝑡 ), (131)
𝑡 =1
𝑝𝜃 (𝒙 𝑡 −1 |𝒙 𝑡 ) = N (𝒙 𝑡 −1 ; 𝝁 𝜃 (𝒙 𝑡 , 𝑡), 𝚺𝜃 (𝒙 𝑡 , 𝑡)), (132)

while the models are trained using a variational lower bound. Diffusion models have been applied
to generate unbounded 3D molecules in EDM [142], and binding-specific ligands in DiffSBDD [309]
and DiffBP [226]. Diffusion can also be applied to generate molecular fragments in autoregressive
models, as is the case with FragDiff [274].

J. ACM, Vol. 1, No. 1, Article . Publication date: April 2023.


A Comprehensive Survey on Deep Graph Representation Learning 51

Table 10. Summary of molecular generation models.

Binding- Fragment- GNN Generative


Model 2D/3D
based based Backbone Model
GCPN [418] 2D GCN [182] GAN
MolGAN [65] 2D R-GCN [307] GAN
DEFactor [11] 2D GCN GAN
GraphVAE [325] 2D ECC [324] VAE
MDVAE [79] 2D GGNN [219] VAE
JT-VAE [165] 2D ✓ MPNN [115] VAE
CGVAE [231] 2D GGNN VAE
DeepScaffold [216] 2D ✓ GCN VAE
GraphNVP [251] 2D R-GCN NF
MoFlow [429] 2D R-GCN NF
GraphAF [319] 2D R-GCN NF + AR
GraphDF [246] 2D R-GCN NF + AR
L-Net [217] 3D ✓ g-U-Net [104] AR
G-SchNet [111] 3D SchNet [311] AR
GEN3D [297] 3D EGNN [304] AR
G-SphereNet [245] 3D SphereNet [233] NF + AR
EDM [142] 3D EGNN DM
3D-SBDD [241] 3D ✓ SchNet AR
Pocket2Mol [275] 3D ✓ GVP [167] AR
FLAG [448] 3D ✓ ✓ SchNet AR
GraphBP [230] 3D ✓ SchNet NF + AR
DiffBP [226] 3D ✓ EGNN DM
DiffSBDD [309] 3D ✓ EGNN DM
FragDiff [274] 2D + 3D ✓ ✓ MPNN DM + AR

12.3 Summary and prospects


We wrap up this chapter with Table 10, which profiles existing molecular generation models
according to their taxonomy for molecular featurization, the GNN backbone, and the generative
method. This chapter covers the critical topics of molecular generation, which also elicit valuable
insights into the promising directions for future research. We summarize these important aspects
as follows.
Techniques. Graph neural networks can be flexibly leveraged to encode molecular features
on different representation levels and across different problem settings. Canonical GNNs like
GCN [182], GAT [351], and R-GCN [307] have been widely adopted to model 2D molecular graphs,
while 3D equivariant GNNs have also been effective in modeling 3D molecular graphs. In particular,
this 3D approach can be readily extended to binding-based scenarios, where the 3D geometry of the
binding protein receptor is considered alongside the ligand geometry per se. Fragment-based models
like JT-VAE [165] and L-Net [217] can also effectively capture the hierarchical molecular structure.
Various generative methods have also been effectively incorporated into the molecular setting,
including generative adversarial network (GAN), variational auto-encoder (VAE), autoregressive
model (AR), normalizing flow (NF), and diffusion model (DM). These models have been able to
generate valid 2D molecular topologies and realistic 3D molecular geometries, greatly accelerating
the search for drug candidates.
Challenges and Limitations. While there has been an abundant supply of unlabelled molecular
structural and geometric data [98, 155, 331], the number of labeled molecular data over certain
critical biochemical properties like toxicity [110] and solubility [67] remain very limited. On the

J. ACM, Vol. 1, No. 1, Article . Publication date: April 2023.


52 W. Ju, et al.

other hand, existing models have heavily relied on expert-crafted metrics to evaluate the quality of
the generated molecules, such as QED and Vina [82], rather than actual wet lab experiments.
Future Works. Besides the structural and geometric attributes described in this chapter, an
even more extensive array of data can be applied to aid molecular generation, including chemical
reactions and medical ontology. These data can be organized into a heterogeneous knowledge
graph to aid the extraction of high-quality molecular representations. Furthermore, high through-
put experimentation (HTE) should be adopted to realistically evaluate the synthesizablity and
druggability of the generated molecules in the wet lab. Concrete case studies, such as the design of
potential inhibitors to SARS-CoV-2 [218], would be even more encouraging, bringing new insights
into leveraging these molecular generative models to facilitate the design and fabrication of potent
and applicable drug molecules in the pharmaceutical industry.

13 Recommender Systems By Yifang Qin


The use of graph representation learning in recommender systems has been drawing increasing
attention as one of the key strategies for addressing the issue of information overload. With their
strong ability to capture high-order connectivity between graph nodes, deep graph representation
learning has been shown to be beneficial in enhancing recommendation performance across a
variety of recommendation scenarios.
Typical recommender systems take the observed interactions between users and items and
their fixed features as input, and are intended for making proper predictions on which items a
specific user is probably interested in. To formulate, given an user set U, an item set I and the
interaction matrix between users and items 𝑋 ∈ {0, 1} | U |× | I | , where 𝑋𝑢,𝑣 indicates there is an
observed interaction between user 𝑢 and item 𝑖. The target of GNNs on recommender systems is to
learn representations ℎ𝑢 , ℎ𝑖 ∈ R𝑑 for given 𝑢 and 𝑖. The pereference score can further be calculated
by a similarity function:
𝑥^𝑢,𝑖 = 𝑓 (ℎ𝑢 , ℎ𝑖 ), (133)

where 𝑓 (·, ·) is the similarity function, e.g. inner product, consine similarity, multi-layer perceptrons
that takes the representation of 𝑢 and 𝑖 and calculate the pereference score 𝑥^𝑢,𝑖 .
When it comes to adapting graph representation learning in recommender systems, a key step is
to construct graph-structured data from the interaction set 𝑋 . Generally, a graph is represented
as G = {V, E} where V, E denotes the set of vertices and edges respectively. According the
construction of G, we can categorize the existing works as follows into three parts which are
introduced in following subsections. A summary is provided in Table 11.

13.1 User-Item Bipartite Graph


13.1.1 Graph Construction A undirected bipartite graph where the vertex set V = U ∪ I and the
undirected edge set E = {(𝑢, 𝑖)|𝑢 ∈ U ∧ 𝑖 ∈ I}. Under this case the graph adjacency can be directly
obtained from the interaction matrix, thus the optimization target on the user-item bipartite graph
is equivalent to collaborative filtering tasks such as MF [191] and SVD++ [190].
There have been plenty of previous works that applied GNNs on the constructed user-item bipar-
tite graphs. GC-MC [20] firstly applies graph convolution networks to user-item recommendation
and optimizes a graph autoencoder (GAE) to reconstruct interactions between users and items.
NGCF [367] introduces the concept of Collaborative Filtering (CF) into graph-based recommenda-
tions by modeling the affinity between neighboring nodes on the interaction graph. MMGCN [377]
extend graph-based recommendation to multi-modal scenarios by constructing different subgraphs
for each modal. LightGCN [134] improves NGCF by removing the non-linear activation functions

J. ACM, Vol. 1, No. 1, Article . Publication date: April 2023.


A Comprehensive Survey on Deep Graph Representation Learning 53

Table 11. Summary of graph models for recommender systems.

Model Recommendation Task Graph Structure Graph Encoder Representation


GC-MC [20] Matrix Completion User-Item Graph GCN Last-Layer
NGCF [367] Collaborative Filtering User-Item Graph GCN+Affinity Concatenation
MMGCN [377] Micro-Video Multi-Modal Graph GCN Last-Layer
LightGCN [134] Collaborative Filtering User-Item Graph LGC Mean-Pooling
DGCF [369] Collaborative Filtering User-Item Graph Dynamic Routing Mean-Pooling
SR-GNN [385] Session-based Transition Graph GGNN Soft-Attention
GC-SAN [385, 399] Session-based Session Graph GGNN Self-Attention
FGNN [287] Session-based Session Graph GAT Last-Layer
GAG [288] Session-based Session Graph GCN Self-Attention
GCE-GNN [375] Session-based Transition+Global GAT Sum-Pooling
HyperRec [361] Sequence-based Sequential HyperGraph HGCN Self-Attention
DHCF [158] Collaborative Filtering Dual HyperGraph JHConv Last-Layer
MBHT [411] Sequence-based Learnable HyperGraph Transformer Cross-View Attention
HCCF [392] Collaborative Filtering Learnable HyperGraph HGCN Last-Layer

and simplifying the message function. With the development of disentangled representation learn-
ing, there are works like DGCF [369] introduce disentangled graph representation learning to
represent users and items from multiple disentangled perspective
13.1.2 Graph Propagation Scheme A common practice is to follow the traditional message-passing
networks (MPNNs) and design the graph propagation method accordingly. GC-MC adopt vanilla
GCNs to encode the user-item bipartite graph. NGCF enhance GCNs by considering the affinity
between users and items. The message function of NGCF from node 𝑗 to 𝑖 is formulated as:

𝑚𝑖←𝑗 = √ 1
(
(𝑊1𝑒 𝑗 + 𝑊2 (𝑒𝑖 ⊙ 𝑒 𝑗 ))
| N𝑖 | | N𝑗 | , (134)
𝑚𝑖←𝑖 = 𝑊1𝑒𝑖
where 𝑊1,𝑊2 are trainable parameters, 𝑒𝑖 represents 𝑖’s representation from previous layer. The
matrix form can be further provided by:

𝐸 (𝑙) = LeakyReLU((L + 𝐼 )𝐸 (𝑙−1)𝑊1(𝑙) + L𝐸 (𝑙−1) ⊙ 𝐸 (𝑙−1)𝑊2(𝑙) ), (135)


where L represents the Laplacian matrix of the user-item graph. The element-wise product in Eq.
135 represents the affinity between connected nodes, containing the collaborative signals from
interactions.
However, the notably heaviness and burdensome calculation of NGCF’s architecture hinders
the model from making faster recommendations on larger graphs. LightGCN solves this issue by
proposing Light Graph Convolution (LGC), which simplifies the convolution operation with:
∑︁ 1
𝑒𝑖(𝑙+1) = √︁ 𝑒 𝑗(𝑙) . (136)
𝑗 ∈N𝑖 |N𝑖 ||N 𝑗 |

When an interaction takes place, e.g. a user clickes a particular item, there could be multiple
intentions behind the observed interaction. Thus it is necessary to consider the various disentangled
intentions among users and items. DGCF proposes the cross-intent embedding propagation scheme
on graph, inspired by the dynamic routing algorithm of capsule netwroks [302]. To formulate, the
propagation process maintains a set of routing logits 𝑆˜𝑘 (𝑢, 𝑖) for each user 𝑢. The weighted sum

J. ACM, Vol. 1, No. 1, Article . Publication date: April 2023.


54 W. Ju, et al.

aggregator to get the representation of 𝑢 can be defined as:


∑︁
𝑢𝑘𝑡 = L𝑘𝑡 (𝑢, 𝑖) · 𝑖𝑘0 (137)
𝑖 ∈N𝑢

for 𝑡-th iteration, where L𝑘𝑡 (𝑢, 𝑖) denotes the Laplacian matrix of 𝑆𝑘𝑡 (𝑢, 𝑖), formulated as:
𝑆𝑘𝑡
L𝑘𝑡 (𝑢, 𝑖) = √︃ Í . (138)
[ 𝑖 ′ ∈N𝑢 𝑆𝑘𝑡 (𝑢, 𝑖 ′)] · [ 𝑢 ′ ∈N𝑖 𝑆𝑘𝑡 (𝑢 ′, 𝑖)]
Í

13.1.3 Node Representations After the graph propagation module outputs node-level representa-
tions, there are multiple methods to leverage node representations for recommendation tasks. A
plain solution is to apply a readout function on layer outputs like the concatenation operation used
by NGCF:
𝑒 ∗ = 𝐶𝑜𝑛𝑐𝑎𝑡 (𝑒 ( 0) , ..., 𝑒 (𝐿) ) = 𝑒 ( 0) ∥...∥𝑒 (𝐿) . (139)
However, the readout function among layers would neglects the relationship between target
item and current user. A general solution is to use attention mechanism [350] to reweight and
aggregate the node representations. SR-GNN adapts soft-attention mechanism to model the item-
item relationship:
𝛼𝑖 = q𝑇 𝜎 (𝑊1𝑒𝑡 + 𝑊2𝑒𝑖 + 𝑐),
∑︁1
𝑛− (140)
𝑠𝑔 = 𝛼 𝑖 𝑒𝑖 ,
𝑖=1

where q, 𝑊1, 𝑊2 are trainable matrices.


Some methods focus on exploiting information from multiple graph structures. A common
practice is contrastive learning, which maximize the mutual information between hidden represen-
tations from several views. HCCF leverage InfoNCE loss as the estimator of mutual information,
given a pair of representation 𝑧𝑖 , Γ𝑖 for node 𝑖, controlled by temperature parameter 𝜏:
exp(𝑐𝑜𝑠𝑖𝑛𝑒 (𝑧𝑖 , Γ𝑖 ))/𝜏
L𝐼𝑛𝑓 𝑜𝑁𝐶𝐸 (𝑖) = − log Í . (141)
𝑖 ′ ≠𝑖 exp(𝑐𝑜𝑠𝑖𝑛𝑒 (𝑧𝑖 , Γ𝑖 ′ ))/𝜏

Besides InfoNCE, there exists several other ways to combine node representations from different
views. For instance, MBHT applies attention mechanism to fuse multiple semantics, DisenPOI
adapt bayesian personalized ranking loss (BPR) [293] as a soft estimator for contrastive learning,
and KBGNN applies pair-wise similarities to ensure the consistency from two views.

13.2 Transition Graph


13.2.1 Transition Graph Construction Since sequence-based recommendation (SR) is one of the
fundamental problems in recomemnder systems, some researches focus on modeling the sequential
information with GNNs. A commonly applied way is to construct transition graphs based on each
given sequence according to the clicking sequence by a user. To formulate, given a user 𝑢’s clicking
sequence 𝑠𝑢 = [𝑖𝑢,1, 𝑖𝑢,2, ..., 𝑖𝑢,𝑛 ] containing 𝑛 items, noting that there could be duplicated items, the
sequential graph is constructed via G𝑠 = {SET(𝑠𝑢 ), E}, where ∀ 𝑖 𝑗 , 𝑖𝑘 ∈ E indicates there exists
a successive transition from 𝑖 𝑗 to 𝑖𝑘 . Since G𝑠 are directed graphs, a widely used way to depict
graph connectivity is by building the connection matrix 𝐴𝑠 ∈ R𝑛×2𝑛 . 𝐴𝑠 is the combination of two
adjacency matrices 𝐴𝑠 = [𝐴𝑠(𝑖𝑛) ; 𝐴𝑠(𝑜𝑢𝑡 ) ], which denotes the normalized node degrees of incoming
and outgoing edges in the session graph respectively.

J. ACM, Vol. 1, No. 1, Article . Publication date: April 2023.


A Comprehensive Survey on Deep Graph Representation Learning 55

The proposed transition graphs that obtain user behavior patterns has been demonstrated
important to session-based recommendations [210, 232]. SR-GNN and GC-SAN [385, 399] propose to
leverage transition graphs and applies attention-based GNNs to capture the sequential information
for session-based recommendation. FGNN [287] formulates the recommendation within a session
as a graph classification problem to predict next item for an anonymous user. GAG [288] and
GCE-GNN [375] further extends the model to capture global embeddings among multiple session
graphs.
13.2.2 Session Graph Propagation Since the session graphs are directed item graphs, there have
been multiple session graph propagation methods to obtain node representations on session graphs.
SR-GNN leverages Gated Graph Neural Networks (GGNNs) to obtain sequential information
from a given session graph adjacency 𝐴𝑠 = [𝐴𝑠(𝑖𝑛) ; 𝐴𝑠(𝑜𝑢𝑡 ) ] and item embedding set {𝑒𝑖 }:
𝑎𝑡 = 𝐴𝑠 [𝑒 1, ..., 𝑒𝑡 −1 ]𝑇 𝐻 + 𝑏, (142)
𝑧𝑡 = 𝜎 (𝑊𝑧 𝑎𝑡 + 𝑈𝑧 𝑒𝑡 −1 ), (143)
𝑟𝑡 = 𝜎 (𝑊𝑟 𝑎𝑡 + 𝑈𝑟 𝑒𝑡 −1 ), (144)
𝑒˜𝑡 = tanh(𝑊𝑜 𝑎𝑡 + 𝑈𝑜 (𝑟𝑡 ⊙ 𝑒𝑡 −1 )), (145)
𝑒𝑡 = (1 − 𝑧𝑡 ) ⊙ 𝑒𝑡 −1 + 𝑧𝑡 𝑒˜𝑡 , (146)
where 𝑊 s and 𝑈 s are trainable parameters. GC-SAN extend GGNN by calculating initial state 𝑎𝑡
separately to better exploit transition information:
𝑎𝑡 = 𝐶𝑜𝑛𝑐𝑎𝑡 (𝐴𝑠(𝑖𝑛) ( [𝑒 1, ..., 𝑒𝑡 −1𝑊𝑎(𝑖𝑛) ] + 𝑏 (𝑖𝑛) ), 𝐴𝑠(𝑜𝑢𝑡 ) ( [𝑒 1, ..., 𝑒𝑡 −1𝑊𝑎(𝑜𝑢𝑡 ) ] + 𝑏 (𝑜𝑢𝑡 ) )). (147)

13.3 HyperGraph
13.3.1 Hypergraph Topology Construction Motivated by the idea of modeling hyper-structures
and high-order correlation among nodes, hypergraphs [93] are proposed as extentions of the
commonly used graph structures. For graph-based recommender systems, a common practice is
to construct hyper structures among the original user-item bipartite graphs. To be specific, an
incidence matrix of a graph with vertex set V is presented as a binary matrix 𝐻 ∈ {0, 1} |V |×| E | ,
where E represents the set of hyperedges. Each entry ℎ(𝑣, 𝑒) of 𝐻 depicts the connectivity between
vertex 𝑣 and hyperedge 𝑒:
(
1 𝑖𝑓 𝑣 ∈ 𝑒
ℎ(𝑣, 𝑒) = . (148)
0 𝑖𝑓 𝑣 ∉ 𝑒
Given the formulation of hypergraphs, the degrees of vertices and hyperedges of 𝐻 can then be
defined with two diagnal matrices 𝐷 𝑣 ∈ N | V |× | V | and 𝐷𝑒 ∈ N | E |×| E | , where
∑︁ ∑︁
𝐷 𝑣 (𝑖; 𝑖) = ℎ(𝑣𝑖 , 𝑒), 𝐷𝑒 ( 𝑗; 𝑗) = ℎ(𝑣, 𝑒 𝑗 ). (149)
𝑒 ∈E 𝑣 ∈V

With the development of Hypergraph Neural Networks (HGNNs) [93, 151, 460] have shown
to be capable of capturing the high-order connectivity between nodes. HyperRec [361] firstly
attempts to leverage hypergraph structures for sequential recommendation by connecting items
with hyperedges according to the interactions with users during different time periods. DHCF
[158] proposes to construct hypergraphs for users and items respectively based on certain rules, to
explicit capture the collaborative similarities via HGNNs. MBHT [411] combines hypergraphs with
low-rank self-attention mechanism to capture the dynamic heterogeneous relationships between
users and items. HCCF [392] uses the contrastive information between hypergraph and interaction
graph to enhance the recommendation performance.

J. ACM, Vol. 1, No. 1, Article . Publication date: April 2023.


56 W. Ju, et al.

13.3.2 Hyper Graph Message Passing With the development of HGNNs, previous works have
propose different variants of HGNN to better exploit hypergraph structures. A classic high-order
hyper convolution process on a fixed hypergraph G = {V, E} with hyper adjacency 𝐻 is given by:
𝑔 ★ 𝑋 = 𝐷 𝑣−1/2𝐻 𝐷𝑒−1𝐻 𝑇 𝐷 𝑣−1/2𝑋 Θ, (150)
where 𝐷 𝑣 , 𝐷𝑒 are degree matrices of nodes and hyperedges, Θ denotes the convolution kernel. For
hyper adjacency matrix 𝐻 , DHCF refers to a rule-based hyperstructure via k-order reachable rule,
where nodes in the same hyperedge group are k-order reachable to each other:
𝐴𝑢𝑘 = min(1, power(𝐴 · 𝐴𝑇 , 𝑘)), (151)
where 𝐴 denotes the graph adjacency matrix. By considering the situations where 𝑘 = 1, 2, the
matrix formulation of the hyper connectivity of users and items are calculated with:
(
𝐻𝑢 = 𝐴∥ (𝐴(𝐴𝑇 𝐴))
, (152)
𝐻𝑖 = 𝐴𝑇 ∥ (𝐴𝑇 (𝐴𝐴𝑇 ))
which depicts the dual hypergraphs for users and items.
HCCF proposes to construct a learnable hypergraph to depict the global dependencies between
nodes on the interaction graph. To be specific, the hyperstructure is factorized with two low-rank
embedding matrices to achieve model efficiency:
𝐻𝑢 = 𝐸𝑢 · 𝑊𝑢 , 𝐻 𝑣 = 𝐸 𝑣 · 𝑊𝑣 . (153)

13.4 Other Graphs


Since there are a variety of recommendation scenarios, several tailored designed graph struc-
tures have been proposed accordingly, to better exploit the domain information from different
scenarios. For instance, CKE [433] and MKR [360] introduce Knowledge graphs to enhance graph
recommendation. GSTN [376], KBGNN [173] and DisenPOI [285] propose to build geographical
graphs based on the distance between Poit-of-Interests (POIs) to better model the locality of users’
visiting patterns. TGSRec [87] and DisenCTR [371] empower the user-item interaction graphs with
temporal sampling between layers to obtain sequential information from static bipartite graphs.

13.5 Summary
This section introduces the application of different kinds of graph neural networks in recommender
systems and can be summarized as follows:
• Graph Constructions. There are multiple options for constructing graph-structured data
for a variety of recommendation tasks. For instance, the user-item bipartite graphs reveal
the high-order collaborative similarity between users and items, and the transition graph
is suitable for encoding sequential information in clicking history. These diversified graph
structures provide different views for node representation learning on users and items, and
can be further used for downstream ranking tasks.
• Challenges and Limitations. Though the superiority of graph-structured data and GNNs
against traditional methods has been widely illustrated, there are still challenges unsolved.
For example, the computational cost of graph methods are normally expensive and thus
unacceptable in real-world applications. The data sparsity and cold-started issue in graph
recommendation remains to be explored as well.
• Future Works. In the future, the efficient solution of applying GNNs in recommendation
tasks is expected. There are also some attempts [87, 371] on incorporating temporal informa-
tion in graph representation learning for sequential recommendation tasks.

J. ACM, Vol. 1, No. 1, Article . Publication date: April 2023.


A Comprehensive Survey on Deep Graph Representation Learning 57

14 Traffic Analysis By Zheng Fang


Intelligent Transportation Systems (ITS) are essential for safe, reliable, and efficient transportation
in smart cities, serving the daily commuting and traveling needs of millions of people. To support
ITS, advanced modeling and analysis techniques are necessary, and Graph Neural Networks (GNNs)
are a promising tool for traffic analysis. GNNs can effectively model spatial correlations, making
them well-suited for analyzing complex transportation networks. As such, GNNs have garnered
significant interest in the traffic domain for their ability to provide insights into traffic patterns and
behaviors.
In this section, we first conclude the main GNN research directions in traffic domain, and then
we summarize the typical graph construction processes in different traffic scenes and datasets.
Finally, we list the classical GNN workflows for dealing with tasks in traffic networks. A summary
is provided in Table 12.

14.1 Research Directions in Traffic Domain


We summarize main GNN research directions in traffic domain as follows,

• Traffic Flow Forecasting. Traffic flow forecasting plays an indispensable role in ITS [72,
291], which involves leveraging spatial-temporal data collected by various sensors to gain
insights into future traffic patterns and behaviors. Classic methods, like autoregressive
integrated moving average (ARIMA) [27], support vector machine (SVM) [136] and recurrent
neural networks (RNN) [63] can only model time series separately without considering their
spatial connections. To address this issue, graph neural networks (GNNs) have emerged as
a powerful approach for traffic forecasting due to their strong ability of modeling complex
graph-structured correlations [30, 160, 393].
• Trajectory Prediction. Trajectory prediction is a crucial task in various applications, such
as autonomous driving and traffic surveillance, which aims to forecast future positions of
agents in the traffic scene. However, accurately predicting trajectories can be challenging, as
the behavior of an agent is influenced not only by its own motion but also by interactions
with surrounding objects. To address this challenge, Graph Neural Networks (GNNs) have
emerged as a promising tool for modeling complex interactions in trajectory prediction
[34, 263, 336, 462]. By representing the scene as a graph, where each node corresponds to an
agent and the edges capture interactions between them, GNNs can effectively capture spatial
dependencies and interactions between agents. This makes GNNs well-suited for predicting
trajectories that accurately capture the behavior of agents in complex traffic scenes.
• Traffic Anomaly Detection. Anomaly detection is an essential support for ITS. There are
lots of traffic anomalies in daily transportation systems, for example, traffic accidents, extreme
weather and unexpected situations. Handling these traffic anomalies timely can improve the
service quality of public transportation. The main trouble of traffic anomaly detection is the
highly twisted spatial-temporal characteristics of traffic data. The criteria and influence of
traffic anomaly varies among locations and times. GNNs have been introduced and achieved
success in this domain [53, 68, 69, 434].
• Others. Traffic demand prediction targets at estimating the future number of travelling at
some location. It is of vital and practical significance in the resource scheduling for ITS. By
using GNNs, the spatial dependencies of demands can be revealed [410, 412]. What is more,
urban vehicle emission analysis is also considered in recent work, which is closely related to
environmental protection and gains increasing researcher attention [405].

J. ACM, Vol. 1, No. 1, Article . Publication date: April 2023.


58 W. Ju, et al.

Table 12. Summary of graph models for traffic analysis.

Models Tasks Adjcency matrices GNN types Temporal modules


STGCN[420] Traffic Flow Forecasting Fixed Matrix GCN TCN
DCRNN[220] Traffic Flow Forecasting Fixed Matrix ChebNet RNN
AGCRN [13] Traffic Flow Forecasting Dynamic Matrix GCN GRU
ASTGCN [125] Traffic Flow Forecasting Fixed Matrix GAT Attention&TCN
STSGCN [328] Traffic Flow Forecasting Dynamic Matrix GCN Cropping
GraphWaveNet [388] Traffic Flow Forecasting Dynamic Matrix GCN Gated-TCN
LSGCN [150] Traffic Flow Forecasting Fixed Matrix GAT GLU
GAC-Net [329] Traffic Flow Forecasting Fixed Matrix GAT Gated-TCN
STGODE [89] Traffic Flow Forecasting Fixed Matrix Graph ODE TCN
STG-NCDE [57] Traffic Flow Forecasting Dynamic Matrix GCN NCDE
MS-ASTN [365] OD Flow Forecasting OD Matrix GCN LSTM
Social-STGCNN [263] Trajectory Prediction Fixed Matrix GCN TXP-CNN
RSBG [336] Trajectory Prediction Dynamic Matrix GCN LSTM
STGAN [69] Anomaly Detection Fixed Matrix GCN GRU
DMVST-VGNN [164] Traffic Demand Prediction Fixed Matrix GAT GLU
ST-GRAT [270] Traffic Speed Prediction Fixed Matrix GAT Attention

14.2 Traffic Graph Construction


14.2.1 Traffic Graph . The traffic network is represented as a graph G = (𝑉 , 𝐸, 𝐴), where 𝑉 is the
set of 𝑁 traffic nodes, 𝐸 is the set of edges, and 𝐴 ∈ R𝑁 ×𝑁 is an adjacency matrix representing the
connectivity of 𝑁 nodes. In the traffic domain, 𝑉 usually represents a set of physical nodes, like
traffic stations or traffic sensors. The features of nodes typically depend on the specific task. Take
the traffic flow forecasting as an example. The features are the traffic flows, i.e., the historical time
series of nodes. The traffic flow can be represented as a flow matrix 𝑋 ∈ R𝑁 ×𝑇 , where 𝑁 is the
number of traffic nodes and 𝑇 is the length of historical series, and 𝑋𝑛𝑡 denotes the traffic flow of
node 𝑛 at time 𝑡. The goal of traffic flow forecasting is to learn a mapping function 𝑓 to predict the
traffic flow during future 𝑇 ′ steps given the historical 𝑇 step information, which can be formulated
as following:
 𝑓 
(154)
 
𝑋 :,𝑡 −𝑇 +1, 𝑋 :,𝑡 −𝑇 +2, · · · , 𝑋 :,𝑡 ; G −→ 𝑋 :,𝑡 +1, 𝑋 :,𝑡 +2, · · · , 𝑋 :,𝑡 +𝑇 ′ .
14.2.2 Graph Construction. Constructing a graph to describe the interactions among traffic nodes,
i.e., the design of the adjacency matrix 𝐴, is the key part of traffic analysis. The mainstream designs
can be divided into two categories, fixed matrix and dynamic matrix.
Fixed matrix. Lots of works assume that the correlations among traffic nodes are fixed and
constant over time, and they design a fixed and pre-defined adjacency matrix to capture the spatial
correlation. Here we list several common choices of fixed adjacency matrix.
The connectivity matrix is the most natural construction way. It relies on the support of
road map data. The element of the connectivity matrix is defined as 1 if two node are physically
connected and 0 otherwise. This binary format is convenient to construct and easy to interpret.
The distance-based matrix is also a common choice, which shows the connection between two
nodes more precisely. The elements of the matrix are defined as the function of distance between
two nodes (driving distance or geographical distance). A typical way is to use threshold Guassian
function as follows, (
𝑑2
𝐴𝑖 𝑗 = exp(− 𝜎𝑖2𝑗 ), 𝑑𝑖 𝑗 < 𝜖 , (155)
0, 𝑑𝑖 𝑗 > 𝜖

J. ACM, Vol. 1, No. 1, Article . Publication date: April 2023.


A Comprehensive Survey on Deep Graph Representation Learning 59

where 𝑑𝑖 𝑗 is the distance between node 𝑖 and 𝑗, and 𝜎 and 𝜖 are two hyperparameters to control the
distribution and the sparsity of the matrix.
Another kind of fixed adjacency matrix is the similarity-based matrix. In fact, similarity matrix
is not an adjacency matrix to some extent. It is constructed according to the similarity of two
nodes, which means the neighbors in the similarity graph may be far way in the real world. There
are various similarity metrics. For example, many works measure the similarity of two node by
their functionality, e.g., the distribution of surrounding point of interests (POIs). The assumption
behind is that nodes share similar functionality may share similar traffic patterns. We can also
define the similarity through the historical flow patterns. To compute the similarity of two time
series, a common practice is to use Dynamic Time Wrapping (DTW) algorithm [266], which is
superior to other metrics due to its sensitivity to shape similarity rather than point-wise similarity.
Specifically, given two time series 𝑋 = (𝑥 1, 𝑥 2, · · · , 𝑥𝑛 ) and 𝑌 = (𝑦1, 𝑦2, · · · , 𝑦𝑛 ), DTW is a dynamic
programming algorithm defined as
𝐷 (𝑖, 𝑗) = 𝑑𝑖𝑠𝑡 (𝑥𝑖 , 𝑦 𝑗 ) + min (𝐷 (𝑖 − 1, 𝑗), 𝐷 (𝑖, 𝑗 − 1), 𝐷 (𝑖 − 1, 𝑗 − 1)) , (156)
where 𝐷 (𝑖, 𝑗) represents the shortest distance between subseries 𝑋 = (𝑥 1, 𝑥 2, · · · , 𝑥𝑖 ) and 𝑌 =
(𝑦1, 𝑦2, · · · , 𝑦 𝑗 ), and 𝑑𝑖𝑠𝑡 (𝑥𝑖 , 𝑦 𝑗 ) is some distance metric like absolute distance. As a result, 𝐷𝑇𝑊 (𝑋, 𝑌 ) =
𝐷 (𝑛, 𝑛) is set as the final distance between 𝑋 and 𝑌 , which better reflects the similarity of two time
series compared to the Euclidean distance.
Dynamic matrix. The pre-defined matrix is sometimes unavailable and cannot reflect complete
information of spatial correlations. The dynamic adaptive matrix is proposed to solve the issue.
The dynamic matrix is learned from input data automatically. To achieve the best prediction
performance, the dynamic matrix will manage to infer the hidden correlations among nodes, more
than those physical connections.
A typical practice is learning adjacency matrix from node embeddings [13]. Let 𝐸𝐴 ∈ R𝑁 ×𝑑 be a
learnable node embedding dictionary, where each row of 𝐸𝐴 represents the embedding of a node,
𝑁 and 𝑑 denote the number of node and the dimension of embeddings respectively. The graph
adjacency matrix is defined as the similarities among node embeddings,
1 1
 
𝐷 − 2 𝐴𝐷 − 2 = 𝑠𝑜 𝑓 𝑡𝑚𝑎𝑥 𝑅𝑒𝐿𝑈 (𝐸𝐴 · 𝐸𝑇𝐴 ) , (157)
1 1
where 𝑠𝑜 𝑓 𝑡𝑚𝑎𝑥 function is to perform row-normalization, and 𝐷 − 2 𝐴𝐷 − 2 is the Laplacian matrix.

14.3 Typical GNN Frameworks in Traffic Domain


Here we list several representative GNN models for traffic analysis.
14.3.1 Spatial Temporal Graph Convolution Network (STGCN) [420]. STGCN is a pioneering work
in the spatial-temporal GNN domain. It utilizes graph convolution to capture spatial features, and
deploys a gated causal convolution to extract temporal patterns. Specifically, the graph convolution
and temporal convolution are defined as follows,
1 1
˜ − 21 𝐴˜𝐷
Θ ∗ G 𝑥 = 𝜃 (𝐼𝑛 + 𝐷 − 2 𝐴𝐷 − 2 )𝑥 = 𝜃 ( 𝐷 ˜ − 12 )𝑥, (158)
Γ ∗ T 𝑦 = 𝑃 ⊙ 𝜎 (𝑄), (159)
where Θ is the parameter of graph convolution, 𝑃 and 𝑄 are the outputs of an 1-d convolution
along the temporal dimension. The sigmoid gate 𝜎 (𝑄) controls how the states of 𝑃 are relevant
for discovering hidden temporal patterns. In order to fuse features from both spatial and temporal
dimension, the spatial convolution layer and the temporal convolution layer are combined to
construct a spatial temporal block to jointly deal with graph-structured time series, and more blocks
can be stacked to achieve a more scalable and complex model.

J. ACM, Vol. 1, No. 1, Article . Publication date: April 2023.


60 W. Ju, et al.

14.3.2 Diffusion Convolutional Recurrent Neural Network (DCRNN) [220]. DCRNN is a representa-
tive solution combining graph convolution networks with recurrent neural networks. It captures
spatial dependencies by bidirectional random walks on the graph. The diffusion convolution
operation on a graph is defined as:
𝐾 
∑︁ 
𝑋 ∗ G 𝑓𝜃 = 𝜃 𝑘,1 (𝐷𝑂−1𝐴)𝑘 + 𝜃 𝑘,2 (𝐷 𝐼−1𝐴)𝑘 𝑋, (160)
𝑘=0

where 𝜃 are parameters for the convolution filter, and 𝐷𝑂−1𝐴, 𝐷 𝐼−1𝐴 represent the bidirectional diffu-
sion processes respectively. In term of the temporal dependency, DCRNN utilizes Gated Recurrent
Units (GRU), and replace the linear transformation in the GRU with the diffusion convolution as
follows,
𝑟 (𝑡 ) = 𝜎 (Θ𝑟 ∗ G [𝑋 (𝑡 ) , 𝐻 (𝑡 −1) ] + 𝑏𝑟 ), (161)
𝑢 (𝑡 )
= 𝜎 (Θ𝑢 ∗ G [𝑋 (𝑡 )
,𝐻 (𝑡 −1)
] + 𝑏𝑢 ), (162)
𝐶 (𝑡 ) = tanh(Θ𝐶 ∗ G [𝑋 (𝑡 ) , (𝑟 (𝑡 ) ⊙ 𝐻 (𝑡 −1) ] + 𝑏𝑐 ), (163)
𝐻 (𝑡 )
=𝑢 (𝑡 )
⊙𝐻 (𝑡 −1)
+ (1 − 𝑢 (𝑡 )
) ⊙𝐶 (𝑡 )
, (164)
where 𝑋 (𝑡 ) , 𝐻 (𝑡 ) denote the input and output at time 𝑡, 𝑟 (𝑡 ) , 𝑢 (𝑡 ) are the reset and update gates
respectively, and Θ𝑟 , Θ𝑢 , Θ𝐶 are parameters of convolution filters. Moreover, DCRNN employs a
sequence to sequence architecture to predict future series. Both the encoder and the decoder are
constructed with diffusion convolutional recurrent layers. The historical time series are fed into
the encoder and the predictions are generated by the decoder. The scheduled sampling technique is
utilized to solve the discrepancy problem between training and test distribution.
14.3.3 Adaptive Graph Convolutional Recurrent Network (AGCRN) [13]. The focuses of AGCRN are
two-fold. On the one hand, it argues that the temporal patterns are diversified and thus parameter-
sharing for each node is inferior; on the other hand, it proposes that the pre-defined graph may be
intuitive and incomplete for the specific prediction task. To mitigate the two issues, it designs a
Node Adaptive Parameter Learning (NAPL) module to learn node-specific patterns for each traffic
series, and a Data Adaptive Graph Generation (DAGG) module to infer the hidden correlations
among nodes from data and to generate the graph during training. Specifically, the NAPL module
is defined as follows,
1 1
𝑍 = (𝐼𝑛 + 𝐷 − 2 𝐴𝐷 − 2 )𝑋 𝐸 G𝑊 G + 𝐸 G 𝑏 G , (165)
where 𝑋 ∈ R𝑁 ×𝐶 is the input feature, 𝐸 G ∈ R𝑁 ×𝑑 is a node embedding dictionary, 𝑑 is the
embedding dimension (𝑑 << 𝑁 ), 𝑊 G ∈ R𝑑×𝐶×𝐹 is a weight pool. The original parameter Θ in
the graph convolution is replaced by the matrix production of 𝐸 G𝑊 G , and the same operation
is applied for the bias. This can help the model to capture node specific patterns from a pattern
pool according the node embedding. The DAGG module has been introduced in 157. The whole
workflow of AGCRN is formulated as following,
𝐴˜ = 𝑠𝑜 𝑓 𝑡𝑚𝑎𝑥 (𝑅𝑒𝐿𝑈 (𝐸𝐸𝑇 )), (166)
𝑧 (𝑡 ) = 𝜎 (𝐴[𝑋
˜ (𝑡 ) , 𝐻 (𝑡 −1) ]𝐸𝑊𝑧 + 𝐸𝑏𝑧 , (167)
𝑟 (𝑡 ) ˜ (𝑡 ) , 𝐻 (𝑡 −1) ]𝐸𝑊𝑟 + 𝐸𝑏𝑟 ,
= 𝜎 (𝐴[𝑋 (168)
^ (𝑡 )
𝐻 ˜ 𝑟
= tanh(𝐴[𝑋, (𝑡 )
⊙𝐻 (𝑡 −1)
]𝐸𝑊ℎ + 𝐸𝑏ℎ , (169)
𝐻 (𝑡 ) = 𝑧 (𝑡 ) ⊙ 𝐻 (𝑡 −1) + (1 − 𝑧 (𝑡 ) ) ⊙ 𝐻^ (𝑡 ) . (170)

J. ACM, Vol. 1, No. 1, Article . Publication date: April 2023.


A Comprehensive Survey on Deep Graph Representation Learning 61

14.3.4 Attention Based Spatial-Temporal Graph Convolutional Networks (ASTGCN) [125]. ASTGCN
introduces two kinds of attention mechanisms into spatial temporal forecasting, i.e., spatial attention
and temporal attention. The spatial attention is defined as following,
 
𝑆 = 𝑉𝑆 𝜎 (𝑋𝑊1 )𝑊2 (𝑊3𝑋 )𝑇 + 𝑏𝑆 , (171)
exp(𝑆𝑖,𝑗 )

𝑆𝑖,𝑗 = Í𝑁 , (172)
𝑗=1 exp(𝑆𝑖,𝑗 )
where 𝑆 ′ is the attention score, and 𝑊1,𝑊2,𝑊 3 are learnable parameters. The similar construction
is applied for temporal attention,
(173)
where 𝑋 𝑇 denotes transpose the spatial dimension and the temporal dimension. Besides the attention
mechanism, ASTGCN also introduces multi-component fusion to enhance the prediction ability.
The input of ASTGCN consists of three parts, the recent segments, the daily-periodic segments and
the weekly-periodic segment. The three segments are processed by the main model independently
and fused with learnable weights at last:
𝑌 = 𝑊ℎ ⊙ 𝑌ℎ + 𝑊𝑑 ⊙ 𝑌𝑑 + 𝑊𝑤 ⊙ 𝑌𝑤 , (174)
where 𝑌ℎ , 𝑌𝑑 , 𝑌𝑤 denotes the predictions of different segments respectively.

15 Summary
This section introduces graph models for traffic analysis and we provide a summary as follows:
• Techniques. Traffic analysis is a classical spatial temporal data mining task, and graph
models play a vital role for extracting spatial correlations. Typical procedures include the
graph construction, spatial dimension operations, temporal dimension operations and the
information fusion. There are multiple implementations for each procedure, each implemen-
tation has its strengths and weaknesses. By combining different implementations, various
kinds of traffic analysis models can be created. Choosing the right combination of procedures
and implementations is critical for achieving accurate and reliable traffic analysis results.
• Challenges and Limitations. Despite the remarkable success of graph representation
learning in traffic analysis, there are still several challenges that need to be addressed in
current studies. Firstly, external data, such as weather and calendar information, are not
well-utilized in current models, despite their close relation to traffic status. The challenge lies
in how to effectively fuse heterogeneous data to improve traffic analysis accuracy. Secondly,
the interpretability of models has been underexplored, which could hinder their deployment
in real-world transportation systems. Interpretable models are crucial for building trust and
understanding among stakeholders, and more research is needed to develop models that are
both accurate and interpretable. Addressing these challenges will be critical for advancing
the state-of-the-art in traffic analysis and ensuring the deployment of effective transportation
systems.
• Future Works. In the future, we anticipate that more data sources will be available for traffic
analysis, enabling a more comprehensive understanding of real-world traffic scenes. From
data collection to model design, there is still a lot of work to be done to fully leverage the
potential of GNNs in traffic analysis. In addition, we expect to see more creative applications
of GNN-based traffic analysis, such as designing traffic light control strategies, which can
help to improve the efficiency and safety of transportation systems. To achieve these goals, it
is necessary to continue advancing the development of GNN-based models and exploring new

J. ACM, Vol. 1, No. 1, Article . Publication date: April 2023.


62 W. Ju, et al.

ways to fuse diverse data sources. Additionally, there is a need to enhance the interpretability
of models and ensure their applicability to real-world transportation systems. We believe that
these efforts will contribute to the continued success of traffic analysis and the development
of intelligent transportation systems.

16 Future Directions By Zhiping Xiao


In this section, we outline some prospective future directions of deep graph representation learning
based on the above cornerstone, taxonomy and real-world applications. We also outline a few more
directions closer to the theoretical side.

16.1 Application-Inspired Directions


Since deep graph representation is widely used these years, many problems are solved while many
others arose. While we observe many real-world applications, we conclude plenty of challenging
problems that are not yet solved. Here in this subsection, we outline a few.
16.1.1 Fairness in Graph Representation Learning One common aspect to care about is the fairness
concern. Fairness, by definition, refers to that the protected features does not infect the outcome.
In general, a data set’s fairness refers to that the protected features are not influencing the data
distribution. A model’s fairness, on the other hand, refers to the concern that the output of our
algorithms should not be affected by some certain protected features. The protected features can
be race, gender, etc.
To have some fair data, we might measure how frequently in a text corpus a female character
is associated to leadership, versus how frequently a male is. To design a fair model, we require
the results of the model keep the same once we make any change to the protected features. For
instance, if we swap the gender feature of some data samples, the predicted outcomes remain the
same.
Similar to the fairness challenge in many other fields of machine learning [58, 258], graph
representation learning can easily suffer from bias from the data sets which inherit stereotypes
from the real world, from the architecture of models, or from design decisions aiming at a higher
performance on a particular task. As the graph representation becomes increasingly popular in
recent years, the researchers are getting fairness into their sights [75, 248].
16.1.2 Robustness in Graph Representation Learning Real world data is always noisy, contain-
ing many different kinds of disruptions, and not end up being perfect normal distribution. In
the worst case, some noise can potentially avoid a model from learning the correct knowledge.
Better robustness refers to the model has better chance of reaching a relatively good and stable
outcome while input being changed. If a model is not robust enough, the performance can not be
relied on. Therefore, robustness is another important yet challenging consideration in deep graph
representation learning.
Again, similar to many other machine learning approaches that aims at solving real-world
problems [39], improving the robustness of deep graph representation models is a nontrivial
direction. Either enhancing the models’ robustness, or conducting adversarial attacks to challenge
the robustness of graph representations, are promising direction to go for [114, 124, 345].
Conducting adversarial attack on models are centered around manipulating the data input.
Enhancing a model’s robustness usually works on introducing new tricks or new frameworks to
the model, or even change the model’s architecture or other design details.
16.1.3 Adversarial Reprogramming With the emerge of pre-trained graph neural network mod-
els [145, 146, 286], introducing adversarial reprogramming [84, 459] into deep graph representation

J. ACM, Vol. 1, No. 1, Article . Publication date: April 2023.


A Comprehensive Survey on Deep Graph Representation Learning 63

learning becomes another possibility as well. The major difference between adversarial repro-
gramming and adversarial attack lies in whether or not there is a particular target after putting
some adversarial samples against the model. Adversarial attack requires some small modification
on the input data samples. An adversarial attack is considered successful once the result is influ-
enced. However, under the adversarial reprogramming settings, the task succeed if and only if the
influenced results can be used for another desired task.
This is to say, without changing the model’s inner structure or fine-tuning its parameters, we
might be able to use some pre-trained graph models for some other tasks that were not planned to
be solved by these models in the first place. In other deep learning fields, adversarial reprogramming
problems are normally solved by having the input carefully encoded, and output cleverly mapped.
On some graph data sets, such as chemical data sets and biology data sets, pretrained models are
already available. Therefore, there is a possibility that adversarial reprogramming could be applied
in the future.
16.1.4 Generalizing to Out of Distribution Data In order to perform better on unobserved data
set, in the ideal case, the representation we learn should better be able to generalize to some
out-of-distribution (OOD) data. Being out-of-distribution is not identical to being mis-classified.
The mis-classified samples are coming from the same distribution of the training data but the model
fails to classify it correctly, while out-of-distribution refer to the case where the sample comes from
a distribution other than the training data [138]. Being able to generalize to out-of-distribution data
will greatly enhance a model’s reliability in real life. And studying out-of-distribution generalized
graph representation [206] is an opening field [207]. This is partly because of, currently, even the
problem of detecting out-of-distribution data samples is not fully conquered yet [138].
In order to do something on the out-of-distribution data samples, we need to detect which samples
belong to this type first. Detecting OOD samples itself is somewhat similar novelty detection, or
outlier detection problems [278]. Their major difference is whether or not a well-performed model
conducting the original tasks remains part of our goal. Novelty detection cares only about figuring
out who are the outliers; OOD detection requires our model to detect the outliers while keep the
performance remain unharmed at the same time.
16.1.5 Interpretability in Graph Representation Learning Interpretability concern is another limita-
tion that exists when researchers try to apply deep graph representation learning onto some of the
emerging application fields. For instance, in the field of computational social science, researchers
are urging for more efforts in integrating explanation and prediction together [141]. So as drug
discovery, being able to explain why such structure is chosen instead of another option, is very
important [163]. Generally speaking, neural networks are completely in black-box mode to human
knowledge without taking efforts to make them interpretable and explainable. Although more and
more tasks are being handled by deep learning methods in many fields, the tool remains mysterious
to most human beings. Even an expert of deep learning can not easily explain to you how the
tasks are performed and what the model has learned from the data. This situation reduces the
trustworthiness of the neural network models, prevent human from learning more from the models’
results, and even limit the potential improvements of the models themselves, without sufficient
feedback to human beings.
Seeking for better interpretability is not only some personal interests of companies and re-
searchers, in fact, as more and more ethics concern arose since more and more black-box decisions
were made by AI algorithms, interpretability has become a legal requirement [118].
Various approaches have been applied, serving for the goal of better interpretability [443]. There
we find existing works that provide either ad-hoc explanation after the results come out, or those
actively change the model structure to provide better explanations; explanation by providing similar

J. ACM, Vol. 1, No. 1, Article . Publication date: April 2023.


64 W. Ju, et al.

examples, by highlighting some attributes to the input features, by making sense of some hidden
layers and extract semantics from them, or by extracting logical rules; we also see local explanations
that explain some particular samples, global explanations that explain the network as a whole, or
hybrid. Most of those existing directions make sense in a graph representation learning setting.
Not a consensus has been reached on what are the best methods of making a model interpretable.
Researchers are still actively exploring every possibility, and thus there are plenty of challenges
and interesting topics in this direction.
16.1.6 Causality in Graph Representation Learning In recent years, there are increasing research
works focusing on combining causality and machine learning models [148, 252, 296]. It is widely
believed that making good use of causality will help models gain higher performances. However,
finding the right way to model causality in many real world scenarios remain super challenging.
Something to note is that, the most common kind of graph that comes along the causal study,
called “causal graph”, is not identical to the kind of graphs we are studying in deep graph represen-
tation learning. Causal graphs are the kind of graph whose nodes are factors and links represents
causal relations. Up till now, they are among the most reliable tools for causal inference study.
Traditionally, causal graphs are defined by human experts. Recent works has shown that neural
networks can help with scalable causal graph generation [401]. From this perspective, the story
can be other-side around: beside using causal relations to enhance graph representation learning, it
is also possible to use graph representation learning strategies to help with causal study.
16.1.7 Emerging Application Fields Beside the above-mentioned directions solving existing chal-
lenges in the deep learning world, there are many emerging fields of application that naturally
come along with the graph-structured data.
For instance, social network analysis and drug discovery. Due to the nature of the data, social
network interactions and drug molecule structures can be easily depicted as graph-structured data.
Therefore, deep graph representation learning has much to do in these fields [1, 109, 473].
Some basic problems on social network are easily solved using graph representation learning
strategies. Those basic problems include node-classification, link-prediction, graph-classification,
and so on. In practice, those problem settings could refer to real-world problems such as: ideology
prediction, interaction prediction, analyzing a social group, etc. However, social network data
typically has many unique features that could potentially stop the general-purposed models to
perform well. For instance, social media data can be sparse, incomplete, and can be extremely
imbalanced [454]. On the other hand, people have clear goals when studying social media data, such
as controversy detection [19], rumor detection [127, 341], mis-information and dis-information
detection [71], or studying the dynamics of the system [181]. There are still a lot of open quests to
be conquered, where deep graph representation learning can help with.
As for drug-discovery, researchers are having some interests in other perspective beside simply
proposing a set of potentially functional structures, which is widely seen today. The other per-
spectives include having more interpretable results from the model’s proposals [163, 280], and to
consider synthetic accessibility [396]. These directions are important, in answer to some doubt on
AI from the society [118], as well as from the tradition of chemistry studies [308]. Similar to the
challenges we faced when combining social science and neural networks, chemical science would
also prefer the black-box AI models to be interpretable instead. Some chemical scientists would
also prefer AI tools to provide them with synthetic route instead of the targeting structure itself.
In practice, proposing new molecule structures is usually not the bottleneck, but synthesising is.
There are already some existing works focusing on conquering this problem [85, 156]. But so far
there is a gap between chemical experiments and AI tools, indicating that there is still plenty of
improvement to be made.

J. ACM, Vol. 1, No. 1, Article . Publication date: April 2023.


A Comprehensive Survey on Deep Graph Representation Learning 65

Some chemistry researchers also found it useful to have material data better organized, given
that the molecule structures are becoming increasingly complex, and massive amount of research
papers are describing the material’s features from different aspects [355]. This direction might be
more closely related to knowledge base or even database systems. But in a way, given that the
polymer structure is typically a node-link graph, graph representation learning might be able to
help with deadling with such issues.

16.2 Theory-Driven Directions


Some other future directions dig into the root of graph theory. More specifically, focusing on
some fundamental improvement on neural network structure design, or better ways of expressing
the graph representations. These directions require background knowledge of their mathematical
backgrounds. All in all, breakthroughs in these directions might not end up in immediate impact,
but every study in these directions has the potential to change the entire field, sooner or later.

16.2.1 Mathematical Proof of Feasibility It has been a long-lasting problem that most of the existing
deep learning approaches lack mathematical proof of their learnability, bound, etc [16, 26]. This
problem relates to the difficulty of providing theoretical proof on a complicated structure like
neural network [121].
Currently, most of the theoretical proof aims at figuring out theoretical bounds [17, 132, 176].
There are multiple types of bounds with different problem settings. Such as: given a known
architecture of the model, with input data satisfying particular normal distribution, prove that
training will converge, and provide the estimated number of iterations. Most of these architectures
being studied are simple, such as those made of multi-layer perceptron (MLP), or simply studying
the updates of parameters in a single fully-connected layer.
In the field of deep graph representation learning, the neural network architectures are typically
much more complex than MLPs. Graph neural networks (GNNs), since the very beginning [66, 182],
involve a lot of approximation and simplification of mathematical theorems. Nowadays, most
researchers rely heavily on the experimental results. No matter how wild an idea is, as long as it
finally works out in an experiment, say, being able to converge and the results are acceptable, the
design is acceptable. All these practice makes the entire field somewhat experiments-oriented or
experience-oriented, while there remains a huge gap between the theoretical proof and the frontier
of deep graph representation.
It will be more than beneficial to the whole field if some researchers can push forward these
theoretical foundations. However, these problems are incredibly challenging.

16.2.2 Combining Spectral Graph Theory Down to the theory foundations, the idea of graph neural
networks [66, 182, 321] initially comes from spectral graph theory [60]. In recent years, many
researchers are investigating possible improvement on graph representation learning strategies via
utilizing spectral graph theory [54, 135, 256, 407]. For example, graph Laplacian is closely related
to many properties, such as the connectivity of a graph. By studying the properties of Laplacian,
it is possible to provide proof on graph neural network models’ properties, and to propose better
models with desired advantages, such as robustness [100, 300].
Spectral graph theory provides a lot of useful insights into graph representation learning from a
new perspective. There are a lot to be done in this direction.

16.2.3 From Graph to Manifolds Many researchers are devoted to the direction of learning graph
representation in non-Euclidean spaces [10, 305]. That is to say, to embed and to compute on some
other spaces that are not Euclidean, such as, hyperbolic and spherical spaces.

J. ACM, Vol. 1, No. 1, Article . Publication date: April 2023.


66 W. Ju, et al.

Theoretical reasoning and experimental results have shown certain advantages of working on
manifolds instead of standard Euclidean space. It is believed that these advantages are brought by
their abilities to capture complex correlations on the surface manifold [461]. Besides, researchers
have shown that, by combining standard graph representation learning strategies and manifold
assumptions, models work better on preserving and acquiring the locality and similarity rela-
tionships [101]. Intuitively, sometimes two nodes’ embeddings are regarded way too similar in
Euclidean space, but in non-Euclidean space they are easily distinguishable.

17 Conclusion By Wei Ju
In this survey, we present a comprehensive and up-to-date overview of deep graph representation
learning. We present a novel taxonomy of existing algorithms categorized into GNN architectures,
learning paradigms, and applications. Technically, we first summarize the ways of GNN archi-
tectures namely graph convolutions, graph kernel neural networks, graph pooling, and graph
transformer. Based on the different training objectives, we present three types of the most recent
advanced learning paradigms namely: supervised/semi-supervised learning on graphs, graph self-
supervised learning, and graph structure learning. Then, we provide several promising applications
to demonstrate the effectiveness of deep graph representation learning. Last but not least, we discuss
the future directions in deep graph representation learning that have potential opportunities.

Acknowledgments
The authors are grateful to the anonymous reviewers for critically reading the manuscript and for
giving important suggestions to improve their paper.
This paper is partially supported by the National Key Research and Development Program of
China with Grant No. 2018AAA0101902 and the National Natural Science Foundation of China
(NSFC Grant No. 62276002).

References
[1] Ash Mohammad Abbas. 2021. Social network analysis using deep learning: applications and schemes. Social Network
Analysis and Mining 11, 1 (2021), 1–21.
[2] Amr Ahmed, Nino Shervashidze, Shravan Narayanamurthy, Vanja Josifovski, and Alexander J Smola. 2013. Distributed
large-scale natural graph factorization. In Proceedings of the 22nd international conference on World Wide Web. 37–48.
[3] Imtiaz Ahmed, Travis Galoppo, Xia Hu, and Yu Ding. 2021. Graph regularized autoencoder and its application
in unsupervised anomaly detection. IEEE Transactions on Pattern Analysis and Machine Intelligence 44, 8 (2021),
4110–4124.
[4] Rami Al-Rfou, Bryan Perozzi, and Dustin Zelle. 2019. Ddgk: Learning graph representations for deep divergence
graph kernels. In The World Wide Web Conference. 37–48.
[5] Barakat AlBadani, Ronghua Shi, Jian Dong, Raeed Al-Sabri, and Oloulade Babatounde Moctard. 2022. Transformer-
Based Graph Convolutional Network for Sentiment Analysis. Applied Sciences 12, 3 (2022), 1316.
[6] Uri Alon and Eran Yahav. 2020. On the bottleneck of graph neural networks and its practical implications. arXiv
preprint arXiv:2006.05205 (2020).
[7] Han Altae-Tran, Bharath Ramsundar, Aneesh S Pappu, and Vijay Pande. 2017. Low data drug discovery with one-shot
learning. ACS central science 3, 4 (2017), 283–293.
[8] Brandon Anderson, Truong Son Hy, and Risi Kondor. 2019. Cormorant: Covariant Molecular Neural Networks. In
NeurIPS, Vol. 32. https://proceedings.neurips.cc/paper/2019/file/03573b32b2746e6e8ca98b9123f2249b-Paper.pdf
[9] Martin Arjovsky, Soumith Chintala, and Léon Bottou. 2017. Wasserstein generative adversarial networks. In Interna-
tional conference on machine learning. PMLR, 214–223.
[10] Nurul A Asif, Yeahia Sarker, Ripon K Chakrabortty, Michael J Ryan, Md Hafiz Ahamed, Dip K Saha, Faisal R Badal,
Sajal K Das, Md Firoz Ali, Sumaya I Moyeen, et al. 2021. Graph neural network: A comprehensive review on
non-euclidean space. IEEE Access 9 (2021), 60588–60606.
[11] Rim Assouel, Mohamed Ahmed, Marwin H Segler, Amir Saffari, and Yoshua Bengio. 2018. Defactor: Differentiable
edge factorization-based probabilistic graph generation. arXiv preprint arXiv:1811.09766 (2018).

J. ACM, Vol. 1, No. 1, Article . Publication date: April 2023.


A Comprehensive Survey on Deep Graph Representation Learning 67

[12] Davide Bacciu, Federico Errica, Alessio Micheli, and Marco Podda. 2020. A gentle introduction to deep learning for
graphs. Neural Networks 129 (2020), 203–221.
[13] Lei Bai, Lina Yao, Can Li, Xianzhi Wang, and Can Wang. 2020. Adaptive graph convolutional recurrent network for
traffic forecasting. Advances in neural information processing systems 33 (2020), 17804–17815.
[14] Xiaomei Bai, Mengyang Wang, Ivan Lee, Zhuo Yang, Xiangjie Kong, and Feng Xia. 2019. Scientific paper recommen-
dation: A survey. Ieee Access 7 (2019), 9324–9339.
[15] Muhammet Balcilar, Pierre Héroux, Benoit Gauzere, Pascal Vasseur, Sébastien Adam, and Paul Honeine. 2021. Breaking
the limits of message passing graph neural networks. In International Conference on Machine Learning. PMLR, 599–608.
[16] Peter L Bartlett, Dylan J Foster, and Matus J Telgarsky. 2017. Spectrally-normalized margin bounds for neural
networks. Advances in neural information processing systems 30 (2017).
[17] Peter L Bartlett, Nick Harvey, Christopher Liaw, and Abbas Mehrabian. 2019. Nearly-tight VC-dimension and
pseudodimension bounds for piecewise linear neural networks. The Journal of Machine Learning Research 20, 1 (2019),
2285–2301.
[18] Simon Batzner, Albert Musaelian, Lixin Sun, Mario Geiger, Jonathan P. Mailoa, Mordechai Kornbluth, Nicola Molinari,
Tess E. Smidt, and Boris Kozinsky. 2021. E(3)-Equivariant Graph Neural Networks for Data-Efficient and Accurate
Interatomic Potentials. arXiv:2101.03164 [physics.comp-ph]
[19] Samy Benslimane, Jérôme Azé, Sandra Bringay, Maximilien Servajean, and Caroline Mollevi. 2022. A text and GNN
based controversy detection method on social media. World Wide Web (2022), 1–27.
[20] Rianne van den Berg, Thomas N Kipf, and Max Welling. 2017. Graph convolutional matrix completion. arXiv preprint
arXiv:1706.02263 (2017).
[21] Tian Bian, Xi Xiao, Tingyang Xu, Peilin Zhao, Wenbing Huang, Yu Rong, and Junzhou Huang. 2020. Rumor detection
on social media with bi-directional graph convolutional networks. In Proceedings of the AAAI conference on artificial
intelligence, Vol. 34. 549–556.
[22] Filippo Maria Bianchi, Daniele Grattarola, and Cesare Alippi. 2020. Spectral clustering with graph neural networks
for graph pooling. In International Conference on Machine Learning. PMLR, 874–883.
[23] Deyu Bo, Xiao Wang, Chuan Shi, and Huawei Shen. 2021. Beyond low-frequency information in graph convolutional
networks. In Proceedings of the AAAI Conference on Artificial Intelligence, Vol. 35. 3950–3957.
[24] Karsten M Borgwardt and Hans-Peter Kriegel. 2005. Shortest-path kernels on graphs. In Fifth IEEE international
conference on data mining (ICDM). IEEE, 8–pp.
[25] Giorgos Bouritsas, Fabrizio Frasca, Stefanos P Zafeiriou, and Michael Bronstein. 2022. Improving graph neural network
expressivity via subgraph isomorphism counting. IEEE Transactions on Pattern Analysis and Machine Intelligence
(2022).
[26] Abdesselam Bouzerdoum and Tim R Pattison. 1993. Neural network for quadratic optimization with bound constraints.
IEEE transactions on neural networks 4, 2 (1993), 293–304.
[27] George EP Box and David A Pierce. 1970. Distribution of residual autocorrelations in autoregressive-integrated
moving average time series models. Journal of the American statistical Association 65, 332 (1970), 1509–1526.
[28] Johannes Brandstetter, Rob Hesselink, Elise van der Pol, Erik J Bekkers, and Max Welling. 2022. Geometric and Physical
Quantities improve E(3) Equivariant Message Passing. In ICLR. https://openreview.net/forum?id=_xwr8gOBeV1
[29] Joan Bruna, Wojciech Zaremba, Arthur Szlam, and Yann LeCun. 2013. Spectral networks and locally connected
networks on graphs. arXiv preprint arXiv:1312.6203 (2013).
[30] Khac-Hoai Nam Bui, Jiho Cho, and Hongsuk Yi. 2021. Spatial-temporal graph neural network for traffic forecasting:
An overview and open research issues. Applied Intelligence (2021), 1–12.
[31] Deng Cai and Wai Lam. 2020. Graph transformer for graph-to-sequence learning. In Proceedings of the AAAI Conference
on Artificial Intelligence, Vol. 34. 7464–7471.
[32] Hongyun Cai, Vincent W Zheng, and Kevin Chen-Chuan Chang. 2018. A comprehensive survey of graph embedding:
Problems, techniques, and applications. IEEE Transactions on Knowledge and Data Engineering 30, 9 (2018), 1616–1637.
[33] David Camacho, Ángel Panizo-LLedot, Gema Bello-Orgaz, Antonio Gonzalez-Pardo, and Erik Cambria. 2020. The
four dimensions of social network analysis: An overview of research methods, applications, and software tools.
Information Fusion 63 (2020), 88–120.
[34] Defu Cao, Jiachen Li, Hengbo Ma, and Masayoshi Tomizuka. 2021. Spectral temporal graph neural network for
trajectory prediction. In 2021 IEEE International Conference on Robotics and Automation (ICRA). IEEE, 1839–1845.
[35] Jinzhou Cao, Qingquan Li, Wei Tu, Qili Gao, Rui Cao, and Chen Zhong. 2021. Resolving urban mobility networks
from individual travel graphs using massive-scale mobile phone tracking data. Cities 110 (2021), 103077.
[36] Shaosheng Cao, Wei Lu, and Qiongkai Xu. 2015. Grarep: Learning graph representations with global structural
information. In Proceedings of the 24th ACM international on conference on information and knowledge management.
891–900.

J. ACM, Vol. 1, No. 1, Article . Publication date: April 2023.


68 W. Ju, et al.

[37] Shaosheng Cao, Wei Lu, and Qiongkai Xu. 2016. Deep neural networks for learning graph representations. In
Proceedings of the AAAI Conference on Artificial Intelligence, Vol. 30.
[38] Nicolas Carion, Francisco Massa, Gabriel Synnaeve, Nicolas Usunier, Alexander Kirillov, and Sergey Zagoruyko. 2020.
End-to-end object detection with transformers. In Computer Vision–ECCV 2020: 16th European Conference, Glasgow,
UK, August 23–28, 2020, Proceedings, Part I 16. Springer, 213–229.
[39] Nicholas Carlini and David Wagner. 2017. Towards evaluating the robustness of neural networks. In 2017 ieee
symposium on security and privacy (sp). Ieee, 39–57.
[40] Ines Chami, Sami Abu-El-Haija, Bryan Perozzi, Christopher Ré, and Kevin Murphy. 2022. Machine learning on graphs:
A model and comprehensive taxonomy. Journal of Machine Learning Research 23, 89 (2022), 1–64.
[41] Jatin Chauhan, Deepak Nathani, and Manohar Kaul. 2020. Few-shot learning on graphs via super-classes based on
graph spectral measures. arXiv preprint arXiv:2002.12815 (2020).
[42] Bo Chen, Jing Zhang, Jie Tang, Lingfan Cai, Zhaoyu Wang, Shu Zhao, Hong Chen, and Cuiping Li. 2020. Conna:
Addressing name disambiguation on the fly. IEEE Transactions on Knowledge and Data Engineering (2020).
[43] Chaoqi Chen, Yushuang Wu, Qiyuan Dai, Hong-Yu Zhou, Mutian Xu, Sibei Yang, Xiaoguang Han, and Yizhou Yu. 2022.
A Survey on Graph Neural Networks and Graph Transformers in Computer Vision: A Task-Oriented Perspective.
arXiv preprint arXiv:2209.13232 (2022).
[44] Dexiong Chen, Laurent Jacob, and Julien Mairal. 2020. Convolutional Kernel Networks for Graph-Structured Data.
arXiv preprint arXiv:2003.05189 (2020).
[45] Deli Chen, Yankai Lin, Wei Li, Peng Li, Jie Zhou, and Xu Sun. 2020. Measuring and relieving the over-smoothing
problem for graph neural networks from the topological view. In Proceedings of the AAAI conference on artificial
intelligence, Vol. 34. 3438–3445.
[46] Dexiong Chen, Leslie O’Bray, and Karsten Borgwardt. 2022. Structure-aware transformer for graph representation
learning. In International Conference on Machine Learning. PMLR, 3469–3489.
[47] Fenxiao Chen, Yun-Cheng Wang, Bin Wang, and C-C Jay Kuo. 2020. Graph representation learning: a survey. APSIPA
Transactions on Signal and Information Processing 9 (2020), e15.
[48] Guimin Chen, Yuanhe Tian, and Yan Song. 2020. Joint aspect extraction and sentiment analysis with directional graph
convolutional networks. In Proceedings of the 28th international conference on computational linguistics. 272–279.
[49] Shizhe Chen, Pierre-Louis Guhur, Cordelia Schmid, and Ivan Laptev. 2021. History aware multimodal transformer for
vision-and-language navigation. Advances in Neural Information Processing Systems 34 (2021), 5834–5847.
[50] Ting Chen, Song Bian, and Yizhou Sun. 2019. Are powerful graph neural nets necessary? a dissection on graph
classification. arXiv preprint arXiv:1905.04579 (2019).
[51] Xiaokang Chen, Yuhui Yuan, Gang Zeng, and Jingdong Wang. 2021. Semi-supervised semantic segmentation with
cross pseudo supervision. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition.
2613–2622.
[52] Yu Chen, Lingfei Wu, and Mohammed Zaki. 2020. Iterative deep graph learning for graph neural networks: Better
and robust node embeddings. Advances in neural information processing systems 33 (2020), 19314–19326.
[53] Zekai Chen, Dingshuo Chen, Xiao Zhang, Zixuan Yuan, and Xiuzhen Cheng. 2021. Learning graph structures with
transformer for multivariate time series anomaly detection in iot. IEEE Internet of Things Journal (2021).
[54] Zhiqian Chen, Fanglan Chen, Lei Zhang, Taoran Ji, Kaiqun Fu, Liang Zhao, Feng Chen, Lingfei Wu, Charu Aggarwal,
and Chang-Tien Lu. 2020. Bridging the gap between spatial and spectral domains: A survey on graph neural networks.
arXiv preprint arXiv:2002.11867 (2020).
[55] Zhengdao Chen, Lei Chen, Soledad Villar, and Joan Bruna. 2020. Can graph neural networks count substructures?
Advances in neural information processing systems 33 (2020), 10383–10395.
[56] Haeran Cho and Yi Yu. 2018. Link prediction for interdisciplinary collaboration via co-authorship network. Social
Network Analysis and Mining 8, 1 (2018), 1–12.
[57] Jeongwhan Choi, Hwangyong Choi, Jeehyun Hwang, and Noseong Park. 2022. Graph neural controlled differential
equations for traffic forecasting. In Proceedings of the AAAI Conference on Artificial Intelligence, Vol. 36. 6367–6374.
[58] Alexandra Chouldechova and Aaron Roth. 2018. The frontiers of fairness in machine learning. arXiv preprint
arXiv:1810.08810 (2018).
[59] Pham Minh Chuan, Le Hoang Son, Mumtaz Ali, Tran Dinh Khang, Le Thanh Huong, and Nilanjan Dey. 2018. Link
prediction in co-authorship networks based on hybrid content similarity metric. Applied Intelligence 48, 8 (2018),
2470–2486.
[60] Fan RK Chung. 1997. Spectral graph theory. Vol. 92. American Mathematical Soc.
[61] Junyoung Chung, Caglar Gulcehre, KyungHyun Cho, and Yoshua Bengio. 2014. Empirical evaluation of gated
recurrent neural networks on sequence modeling. arXiv preprint arXiv:1412.3555 (2014).
[62] Connor W Coley, Regina Barzilay, William H Green, Tommi S Jaakkola, and Klavs F Jensen. 2017. Convolutional
embedding of attributed molecular graphs for physical property prediction. Journal of chemical information and

J. ACM, Vol. 1, No. 1, Article . Publication date: April 2023.


A Comprehensive Survey on Deep Graph Representation Learning 69

modeling 57, 8 (2017), 1757–1772.


[63] Jerome T Connor, R Douglas Martin, and Les E Atlas. 1994. Recurrent neural networks and robust time series
prediction. IEEE transactions on neural networks 5, 2 (1994), 240–254.
[64] Gabriele Corso, Luca Cavalleri, Dominique Beaini, Pietro Liò, and Petar Veličković. 2020. Principal neighbourhood
aggregation for graph nets. Advances in Neural Information Processing Systems 33 (2020), 13260–13271.
[65] Nicola De Cao and Thomas Kipf. 2018. MolGAN: An implicit generative model for small molecular graphs. arXiv
preprint arXiv:1805.11973 (2018).
[66] Michaël Defferrard, Xavier Bresson, and Pierre Vandergheynst. 2016. Convolutional neural networks on graphs with
fast localized spectral filtering. Advances in neural information processing systems 29 (2016).
[67] John S Delaney. 2004. ESOL: estimating aqueous solubility directly from molecular structure. Journal of chemical
information and computer sciences 44, 3 (2004), 1000–1005.
[68] Ailin Deng and Bryan Hooi. 2021. Graph neural network-based anomaly detection in multivariate time series. In
Proceedings of the AAAI Conference on Artificial Intelligence, Vol. 35. 4027–4035.
[69] Leyan Deng, Defu Lian, Zhenya Huang, and Enhong Chen. 2022. Graph convolutional adversarial networks for
spatiotemporal anomaly detection. IEEE Transactions on Neural Networks and Learning Systems 33, 6 (2022), 2416–2428.
[70] Jacob Devlin, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova. 2018. Bert: Pre-training of deep bidirectional
transformers for language understanding. arXiv preprint arXiv:1810.04805 (2018).
[71] Giandomenico Di Domenico, Jason Sit, Alessio Ishizaka, and Daniel Nunan. 2021. Fake news, social media and
marketing: A systematic review. Journal of Business Research 124 (2021), 329–341.
[72] George Dimitrakopoulos and Panagiotis Demestichas. 2010. Intelligent transportation systems. IEEE Vehicular
Technology Magazine 5, 1 (2010), 77–84.
[73] Pedro Domingos and Matt Richardson. 2001. Mining the network value of customers. In Proceedings of the seventh
ACM SIGKDD international conference on Knowledge discovery and data mining. 57–66.
[74] Yuxiao Dong, Nitesh V Chawla, and Ananthram Swami. 2017. metapath2vec: Scalable representation learning for
heterogeneous networks. In Proceedings of the 23rd ACM SIGKDD international conference on knowledge discovery and
data mining. 135–144.
[75] Yushun Dong, Jian Kang, Hanghang Tong, and Jundong Li. 2021. Individual fairness for graph neural networks: A
ranking based approach. In Proceedings of the 27th ACM SIGKDD Conference on Knowledge Discovery & Data Mining.
300–310.
[76] Jinlong Du, Senzhang Wang, Hao Miao, and Jiaqiang Zhang. 2021. Multi-Channel Pooling Graph Neural Networks.
In Proceedings of the Thirtieth International Joint Conference on Artificial Intelligence, IJCAI-21, Zhi-Hua Zhou (Ed.).
International Joint Conferences on Artificial Intelligence Organization, 1442–1448. https://doi.org/10.24963/ijcai.
2021/199 Main Track.
[77] Simon S Du, Kangcheng Hou, Russ R Salakhutdinov, Barnabas Poczos, Ruosong Wang, and Keyulu Xu. 2019. Graph
neural tangent kernel: Fusing graph neural networks with graph kernels. In Advances in Neural Information Processing
Systems. 5723–5733.
[78] Yuanqi Du, Tianfan Fu, Jimeng Sun, and Shengchao Liu. 2022. MolGenSurvey: A Systematic Survey in Machine
Learning Models for Molecule Design. arXiv preprint arXiv:2203.14500 (2022).
[79] Yuanqi Du, Xiaojie Guo, Amarda Shehu, and Liang Zhao. 2022. Interpretable molecular graph generation via
monotonic constraints. In Proceedings of the 2022 SIAM International Conference on Data Mining (SDM). SIAM, 73–81.
[80] David K Duvenaud, Dougal Maclaurin, Jorge Iparraguirre, Rafael Bombarell, Timothy Hirzel, Alán Aspuru-Guzik,
and Ryan P Adams. 2015. Convolutional networks on graphs for learning molecular fingerprints. Advances in neural
information processing systems 28 (2015).
[81] Vijay Prakash Dwivedi and Xavier Bresson. 2020. A generalization of transformer networks to graphs. arXiv preprint
arXiv:2012.09699 (2020).
[82] Jerome Eberhardt, Diogo Santos-Martins, Andreas F Tillack, and Stefano Forli. 2021. AutoDock Vina 1.2. 0: New
docking methods, expanded force field, and python bindings. Journal of Chemical Information and Modeling 61, 8
(2021), 3891–3898.
[83] Pantelis Elinas, Edwin V Bonilla, and Louis Tiao. 2020. Variational inference for graph convolutional networks in
the absence of graph data and adversarial settings. Advances in Neural Information Processing Systems 33 (2020),
18648–18660.
[84] Gamaleldin F Elsayed, Ian Goodfellow, and Jascha Sohl-Dickstein. 2018. Adversarial reprogramming of neural
networks. arXiv preprint arXiv:1806.11146 (2018).
[85] Claire Empel and Rene M Koenigs. 2019. Artificial-Intelligence-Driven Organic Synthesis—En Route towards
Autonomous Synthesis? Angewandte Chemie International Edition 58, 48 (2019), 17114–17116.
[86] Federico Errica, Marco Podda, Davide Bacciu, and Alessio Micheli. 2019. A fair comparison of graph neural networks
for graph classification. arXiv preprint arXiv:1912.09893 (2019).

J. ACM, Vol. 1, No. 1, Article . Publication date: April 2023.


70 W. Ju, et al.

[87] Ziwei Fan, Zhiwei Liu, Jiawei Zhang, Yun Xiong, Lei Zheng, and Philip S Yu. 2021. Continuous-time sequential
recommendation with temporal graph collaborative transformer. In Proceedings of the 30th ACM International
Conference on Information & Knowledge Management. 433–442.
[88] Yin Fang, Qiang Zhang, Haihong Yang, Xiang Zhuang, Shumin Deng, Wen Zhang, Ming Qin, Zhuo Chen, Xiaohui
Fan, and Huajun Chen. 2022. Molecular contrastive learning with chemical element knowledge graph. In Proceedings
of the AAAI Conference on Artificial Intelligence, Vol. 36. 3968–3976.
[89] Zheng Fang, Qingqing Long, Guojie Song, and Kunqing Xie. 2021. Spatial-temporal graph ode networks for traffic
flow forecasting. In Proceedings of the 27th ACM SIGKDD conference on knowledge discovery & data mining. 364–373.
[90] Evan N Feinberg, Debnil Sur, Zhenqin Wu, Brooke E Husic, Huanghao Mai, Yang Li, Saisai Sun, Jianyi Yang, Bharath
Ramsundar, and Vijay S Pande. 2018. PotentialNet for molecular property prediction. ACS central science 4, 11 (2018),
1520–1530.
[91] Shangbin Feng, Herun Wan, Ningnan Wang, and Minnan Luo. 2021. BotRGCN: Twitter bot detection with relational
graph convolutional networks. In Proceedings of the 2021 IEEE/ACM International Conference on Advances in Social
Networks Analysis and Mining. 236–239.
[92] Wenzheng Feng, Jie Zhang, Yuxiao Dong, Yu Han, Huanbo Luan, Qian Xu, Qiang Yang, Evgeny Kharlamov, and Jie
Tang. 2020. Graph random neural networks for semi-supervised learning on graphs. (2020), 22092–22103.
[93] Yifan Feng, Haoxuan You, Zizhao Zhang, Rongrong Ji, and Yue Gao. 2019. Hypergraph neural networks. In Proceedings
of the AAAI conference on artificial intelligence, Vol. 33. 3558–3565.
[94] Chelsea Finn, Pieter Abbeel, and Sergey Levine. 2017. Model-agnostic meta-learning for fast adaptation of deep
networks. In International conference on machine learning. PMLR, 1126–1135.
[95] Marc Finzi, Samuel Stanton, Pavel Izmailov, and Andrew Gordon Wilson. 2020. Generalizing convolutional neural
networks for equivariance to lie groups on arbitrary continuous data. In ICML.
[96] Emil Fischer. 1894. Einfluss der Configuration auf die Wirkung der Enzyme. Berichte der deutschen chemischen
Gesellschaft 27, 3 (1894), 2985–2993.
[97] Luca Franceschi, Mathias Niepert, Massimiliano Pontil, and Xiao He. 2019. Learning discrete structures for graph
neural networks. In International conference on machine learning. PMLR, 1972–1982.
[98] Paul G Francoeur, Tomohide Masuda, Jocelyn Sunseri, Andrew Jia, Richard B Iovanisci, Ian Snyder, and David R Koes.
2020. Three-dimensional convolutional neural networks and a cross-docked data set for structure-based drug design.
Journal of Chemical Information and Modeling 60, 9 (2020), 4200–4215.
[99] MJ ea Frisch, GW Trucks, HB Schlegel, GE Scuseria, MA Robb, JR Cheeseman, G Scalmani, VPGA Barone, GA
Petersson, HJRA Nakatsuji, et al. 2016. Gaussian 16.
[100] Guoji Fu, Peilin Zhao, and Yatao Bian. 2022. 𝑝-Laplacian Based Graph Neural Networks. In International Conference
on Machine Learning. PMLR, 6878–6917.
[101] Sichao Fu and Weifeng Liu. 2021. Recent Advances of Manifold-based Graph Convolutional Networks for Remote
Sensing Images Recognition. Generalization With Deep Learning: For Improvement On Sensing Capability (2021),
209–232.
[102] Fabian Fuchs, Daniel Worrall, Volker Fischer, and Max Welling. 2020. SE(3)-Transformers: 3D Roto-
Translation Equivariant Attention Networks. In NeurIPS, Vol. 33. https://proceedings.neurips.cc/paper/2020/file/
15231a7ce4ba789d13b722cc5c955834-Paper.pdf
[103] Octavian-Eugen Ganea, Xinyuan Huang, Charlotte Bunne, Yatao Bian, Regina Barzilay, Tommi Jaakkola, and An-
dreas Krause. 2021. Independent se (3)-equivariant models for end-to-end rigid protein docking. arXiv preprint
arXiv:2111.07786 (2021).
[104] Hongyang Gao and Shuiwang Ji. 2019. Graph u-nets. In international conference on machine learning. PMLR, 2083–2092.
[105] Hongyang Gao, Yi Liu, and Shuiwang Ji. 2021. Topology-aware graph pooling networks. IEEE Transactions on Pattern
Analysis and Machine Intelligence 43, 12 (2021), 4512–4518.
[106] Xiang Gao, Wei Hu, and Zongming Guo. 2020. Exploring structure-adaptive graph learning for robust semi-supervised
classification. In 2020 ieee international conference on multimedia and expo (icme). IEEE, 1–6.
[107] Thomas Gärtner, Peter Flach, and Stefan Wrobel. 2003. On graph kernels: Hardness results and efficient alternatives.
In Proceedings of Computational Learning theory and kernel machines. 129–143.
[108] Johannes Gasteiger, Stefan Weißenberger, and Stephan Günnemann. 2019. Diffusion improves graph learning.
Advances in neural information processing systems 32 (2019).
[109] Thomas Gaudelet, Ben Day, Arian R Jamasb, Jyothish Soman, Cristian Regep, Gertrude Liu, Jeremy BR Hayter,
Richard Vickers, Charles Roberts, Jian Tang, et al. 2021. Utilizing graph machine learning within drug discovery and
development. Briefings in bioinformatics 22, 6 (2021), bbab159.
[110] Kaitlyn M Gayvert, Neel S Madhukar, and Olivier Elemento. 2016. A data-driven approach to predicting successes
and failures of clinical trials. Cell chemical biology 23, 10 (2016), 1294–1301.

J. ACM, Vol. 1, No. 1, Article . Publication date: April 2023.


A Comprehensive Survey on Deep Graph Representation Learning 71

[111] Niklas Gebauer, Michael Gastegger, and Kristof Schütt. 2019. Symmetry-adapted generation of 3d point sets for the
targeted discovery of molecules. Advances in neural information processing systems 32 (2019).
[112] Niklas WA Gebauer, Michael Gastegger, Stefaan SP Hessmann, Klaus-Robert Müller, and Kristof T Schütt. 2022.
Inverse design of 3d molecular structures with conditional generative neural networks. Nature communications 13, 1
(2022), 1–11.
[113] Mario Geiger and Tess Smidt. 2022. e3nn: Euclidean neural networks. arXiv preprint arXiv:2207.09453 (2022).
[114] Simon Geisler, Tobias Schmidt, Hakan Şirin, Daniel Zügner, Aleksandar Bojchevski, and Stephan Günnemann. 2021.
Robustness of graph neural networks at scale. Advances in Neural Information Processing Systems 34 (2021), 7637–7649.
[115] Justin Gilmer, Samuel S Schoenholz, Patrick F Riley, Oriol Vinyals, and George E Dahl. 2017. Neural message passing
for quantum chemistry. In International conference on machine learning. PMLR, 1263–1272.
[116] Rafael Gómez-Bombarelli, Jennifer N Wei, David Duvenaud, José Miguel Hernández-Lobato, Benjamín Sánchez-
Lengeling, Dennis Sheberla, Jorge Aguilera-Iparraguirre, Timothy D Hirzel, Ryan P Adams, and Alán Aspuru-Guzik.
2018. Automatic chemical design using a data-driven continuous representation of molecules. ACS central science 4, 2
(2018), 268–276.
[117] Ian Goodfellow, Jean Pouget-Abadie, Mehdi Mirza, Bing Xu, David Warde-Farley, Sherjil Ozair, Aaron Courville, and
Yoshua Bengio. 2020. Generative adversarial networks. Commun. ACM 63, 11 (2020), 139–144.
[118] Bryce Goodman and Seth Flaxman. 2017. European Union regulations on algorithmic decision-making and a “right
to explanation”. AI magazine 38, 3 (2017), 50–57.
[119] Palash Goyal and Emilio Ferrara. 2018. Graph embedding techniques, applications, and performance: A survey.
Knowledge-Based Systems 151 (2018), 78–94.
[120] Jean-Bastien Grill, Florian Strub, Florent Altché, Corentin Tallec, Pierre Richemond, Elena Buchatskaya, Carl Doersch,
Bernardo Avila Pires, Zhaohan Guo, Mohammad Gheshlaghi Azar, et al. 2020. Bootstrap your own latent-a new
approach to self-supervised learning. Advances in neural information processing systems 33 (2020), 21271–21284.
[121] Philipp Grohs and Felix Voigtlaender. 2021. Proof of the theory-to-practice gap in deep learning via sampling
complexity bounds for neural network approximation spaces. arXiv preprint arXiv:2104.02746 (2021).
[122] Colin R Groom, Ian J Bruno, Matthew P Lightfoot, and Suzanna C Ward. 2016. The Cambridge structural database.
Acta Crystallographica Section B: Structural Science, Crystal Engineering and Materials 72, 2 (2016), 171–179.
[123] Aditya Grover and Jure Leskovec. 2016. node2vec: Scalable feature learning for networks. In Proceedings of the 22nd
ACM SIGKDD international conference on Knowledge discovery and data mining. 855–864.
[124] Stephan Günnemann. 2022. Graph neural networks: Adversarial robustness. In Graph Neural Networks: Foundations,
Frontiers, and Applications. Springer, 149–176.
[125] Shengnan Guo, Youfang Lin, Ning Feng, Chao Song, and Huaiyu Wan. 2019. Attention based spatial-temporal graph
convolutional networks for traffic flow forecasting. In Proceedings of the AAAI conference on artificial intelligence,
Vol. 33. 922–929.
[126] Zhichun Guo, Chuxu Zhang, Wenhao Yu, John Herr, Olaf Wiest, Meng Jiang, and Nitesh V Chawla. 2021. Few-shot
graph learning for molecular property prediction. In Proceedings of the Web Conference 2021. 2559–2567.
[127] Sardar Hamidian and Mona T Diab. 2019. Rumor detection and classification for twitter data. arXiv preprint
arXiv:1912.08926 (2019).
[128] Will Hamilton, Zhitao Ying, and Jure Leskovec. 2017. Inductive representation learning on large graphs. Advances in
neural information processing systems 30 (2017).
[129] David K Hammond, Pierre Vandergheynst, and Rémi Gribonval. 2011. Wavelets on graphs via spectral graph theory.
Applied and Computational Harmonic Analysis 30, 2 (2011), 129–150.
[130] Jiaqi Han, Yu Rong, Tingyang Xu, and Wenbing Huang. 2022. Geometrically equivariant graph neural networks: A
survey. arXiv preprint arXiv:2202.07230 (2022).
[131] Zhongkai Hao, Chengqiang Lu, Zhenya Huang, Hao Wang, Zheyuan Hu, Qi Liu, Enhong Chen, and Cheekong Lee.
2020. ASGN: An active semi-supervised graph neural network for molecular property prediction. In Proceedings of
the ACM SIGKDD International Conference on Knowledge Discovery & Data Mining. 731–752.
[132] Nick Harvey, Christopher Liaw, and Abbas Mehrabian. 2017. Nearly-tight VC-dimension bounds for piecewise linear
neural networks. In Conference on learning theory. PMLR, 1064–1068.
[133] Dailan He, Yusheng Zhao, Junyu Luo, Tianrui Hui, Shaofei Huang, Aixi Zhang, and Si Liu. 2021. TransRefer3D:
Entity-and-Relation Aware Transformer for Fine-Grained 3D Visual Grounding. In Proceedings of the 29th ACM
International Conference on Multimedia. 2344–2352.
[134] Xiangnan He, Kuan Deng, Xiang Wang, Yan Li, Yongdong Zhang, and Meng Wang. 2020. Lightgcn: Simplifying
and powering graph convolution network for recommendation. In Proceedings of the 43rd International ACM SIGIR
conference on research and development in Information Retrieval. 639–648.
[135] Yixuan He, Michael Permultter, Gesine Reinert, and Mihai Cucuringu. 2022. MSGNN: A Spectral Graph Neural
Network Based on a Novel Magnetic Signed Laplacian. arXiv preprint arXiv:2209.00546 (2022).

J. ACM, Vol. 1, No. 1, Article . Publication date: April 2023.


72 W. Ju, et al.

[136] Marti A. Hearst, Susan T Dumais, Edgar Osuna, John Platt, and Bernhard Scholkopf. 1998. Support vector machines.
IEEE Intelligent Systems and their applications 13, 4 (1998), 18–28.
[137] Mikael Henaff, Joan Bruna, and Yann LeCun. 2015. Deep convolutional networks on graph-structured data. arXiv
preprint arXiv:1506.05163 (2015).
[138] Dan Hendrycks and Kevin Gimpel. 2016. A baseline for detecting misclassified and out-of-distribution examples in
neural networks. arXiv preprint arXiv:1610.02136 (2016).
[139] Geoffrey E Hinton and Ruslan R Salakhutdinov. 2006. Reducing the dimensionality of data with neural networks.
science 313, 5786 (2006), 504–507.
[140] Jonathan Ho, Ajay Jain, and Pieter Abbeel. 2020. Denoising diffusion probabilistic models. Advances in Neural
Information Processing Systems 33 (2020), 6840–6851.
[141] Jake M Hofman, Duncan J Watts, Susan Athey, Filiz Garip, Thomas L Griffiths, Jon Kleinberg, Helen Margetts, Sendhil
Mullainathan, Matthew J Salganik, Simine Vazire, et al. 2021. Integrating explanation and prediction in computational
social science. Nature 595, 7866 (2021), 181–188.
[142] Emiel Hoogeboom, Vıctor Garcia Satorras, Clément Vignac, and Max Welling. 2022. Equivariant diffusion for molecule
generation in 3d. In International Conference on Machine Learning. PMLR, 8867–8887.
[143] Fenyu Hu, Yanqiao Zhu, Shu Wu, Weiran Huang, Liang Wang, and Tieniu Tan. 2020. Graphair: Graph representation
learning with neighborhood aggregation and interaction. Pattern Recognition 112 (2020), 107745.
[144] Weihua Hu, Matthias Fey, Marinka Zitnik, Yuxiao Dong, Hongyu Ren, Bowen Liu, Michele Catasta, and Jure Leskovec.
2020. Open graph benchmark: Datasets for machine learning on graphs. Advances in neural information processing
systems 33 (2020), 22118–22133.
[145] Weihua Hu, Bowen Liu, Joseph Gomes, Marinka Zitnik, Percy Liang, Vijay Pande, and Jure Leskovec. 2019. Strategies
for pre-training graph neural networks. arXiv preprint arXiv:1905.12265 (2019).
[146] Ziniu Hu, Yuxiao Dong, Kuansan Wang, Kai-Wei Chang, and Yizhou Sun. 2020. Gpt-gnn: Generative pre-training of
graph neural networks. In Proceedings of the 26th ACM SIGKDD International Conference on Knowledge Discovery &
Data Mining. 1857–1867.
[147] Ziniu Hu, Yuxiao Dong, Kuansan Wang, and Yizhou Sun. 2020. Heterogeneous graph transformer. In Proceedings of
the web conference 2020. 2704–2710.
[148] Zhiting Hu and Li Erran Li. 2021. A causal lens for controllable text generation. Advances in Neural Information
Processing Systems 34 (2021), 24941–24955.
[149] Jing Huang and Jie Yang. 2021. Unignn: a unified framework for graph and hypergraph neural networks. arXiv
preprint arXiv:2105.00956 (2021).
[150] Rongzhou Huang, Chuyin Huang, Yubao Liu, Genan Dai, and Weiyang Kong. 2020. LSGCN: Long Short-Term Traffic
Prediction with Graph Convolutional Networks.. In IJCAI, Vol. 7. 2355–2361.
[151] Sheng Huang, Mohamed Elhoseiny, Ahmed Elgammal, and Dan Yang. 2015. Learning hypergraph-regularized
attribute predictors. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 409–417.
[152] Wenbing Huang, Jiaqi Han, Yu Rong, Tingyang Xu, Fuchun Sun, and Junzhou Huang. 2022. Constrained Graph
Mechanics Networks. In ICLR. https://openreview.net/forum?id=SHbhHHfePhP
[153] Md Shamim Hussain, Mohammed J Zaki, and Dharmashankar Subramanian. 2021. Edge-augmented graph transform-
ers: Global self-attention is enough for graphs. arXiv preprint arXiv:2108.03348 (2021).
[154] Michael J Hutchinson, Charline Le Lan, Sheheryar Zaidi, Emilien Dupont, Yee Whye Teh, and Hyunjik Kim. 2021.
Lietransformer: equivariant self-attention for lie groups. In ICML.
[155] John J Irwin, Teague Sterling, Michael M Mysinger, Erin S Bolstad, and Ryan G Coleman. 2012. ZINC: a free tool to
discover chemistry for biology. Journal of chemical information and modeling 52, 7 (2012), 1757–1768.
[156] Shoichi Ishida, Kei Terayama, Ryosuke Kojima, Kiyosei Takasu, and Yasushi Okuno. 2022. Ai-driven synthetic route
design incorporated with retrosynthesis knowledge. Journal of chemical information and modeling 62, 6 (2022),
1357–1367.
[157] Md Ashraful Islam, Mir Mahathir Mohammad, Sarkar Snigdha Sarathi Das, and Mohammed Eunus Ali. 2022. A
survey on deep learning based Point-of-Interest (POI) recommendations. Neurocomputing 472 (2022), 306–325.
[158] Shuyi Ji, Yifan Feng, Rongrong Ji, Xibin Zhao, Wanwan Tang, and Yue Gao. 2020. Dual channel hypergraph
collaborative filtering. In Proceedings of the 26th ACM SIGKDD International Conference on Knowledge Discovery &
Data Mining. 2020–2029.
[159] Bo Jiang, Ziyan Zhang, Doudou Lin, Jin Tang, and Bin Luo. 2019. Semi-supervised learning with graph learning-
convolutional networks. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition. 11313–
11320.
[160] Weiwei Jiang and Jiayun Luo. 2022. Graph neural network for traffic forecasting: A survey. Expert Systems with
Applications (2022), 117921.

J. ACM, Vol. 1, No. 1, Article . Publication date: April 2023.


A Comprehensive Survey on Deep Graph Representation Learning 73

[161] Zhuoren Jiang, Yue Yin, Liangcai Gao, Yao Lu, and Xiaozhong Liu. 2018. Cross-language citation recommendation
via hierarchical representation learning on heterogeneous graph. In The 41st International ACM SIGIR Conference on
Research & Development in Information Retrieval. 635–644.
[162] Yizhu Jiao, Yun Xiong, Jiawei Zhang, Yao Zhang, Tianqi Zhang, and Yangyong Zhu. 2020. Sub-graph contrast for
scalable self-supervised graph representation learning. In 2020 IEEE international conference on data mining (ICDM).
IEEE, 222–231.
[163] José Jiménez-Luna, Francesca Grisoni, and Gisbert Schneider. 2020. Drug discovery with explainable artificial
intelligence. Nature Machine Intelligence 2, 10 (2020), 573–584.
[164] Guangyin Jin, Zhexu Xi, Hengyu Sha, Yanghe Feng, and Jincai Huang. 2020. Deep multi-view spatiotemporal virtual
graph neural network for significant citywide ride-hailing demand prediction. arXiv preprint arXiv:2007.15189 (2020).
[165] Wengong Jin, Regina Barzilay, and Tommi Jaakkola. 2018. Junction tree variational autoencoder for molecular graph
generation. In International conference on machine learning. PMLR, 2323–2332.
[166] Wei Jin, Tyler Derr, Haochen Liu, Yiqi Wang, Suhang Wang, Zitao Liu, and Jiliang Tang. 2020. Self-supervised learning
on graphs: Deep insights and new direction. arXiv preprint arXiv:2006.10141 (2020).
[167] Bowen Jing, Stephan Eismann, Patricia Suriana, Raphael JL Townshend, and Ron Dror. 2020. Learning from protein
structure with geometric vector perceptrons. arXiv preprint arXiv:2009.01411 (2020).
[168] Bowen Jing, Stephan Eismann, Patricia Suriana, Raphael John Lamarre Townshend, and Ron Dror. 2021. Learning
from Protein Structure with Geometric Vector Perceptrons. In ICLR. https://openreview.net/forum?id=1YLJDvSx6J4
[169] Wei Ju, Yiyang Gu, Xiao Luo, Yifan Wang, Haochen Yuan, Huasong Zhong, and Ming Zhang. 2023. Unsupervised
graph-level representation learning with hierarchical contrasts. Neural Networks 158 (2023), 359–368.
[170] Wei Ju, Zequn Liu, Yifang Qin, Bin Feng, Chen Wang, Zhihui Guo, Xiao Luo, and Ming Zhang. 2023. Few-shot
molecular property prediction via Hierarchically Structured Learning on Relation Graphs. Neural Networks (2023).
[171] Wei Ju, Xiao Luo, Zeyu Ma, Junwei Yang, Minghua Deng, and Ming Zhang. 2022. GHNN: Graph Harmonic Neural
Networks for semi-supervised graph-level classification. Neural Networks 151 (2022), 70–79.
[172] Wei Ju, Xiao Luo, Meng Qu, Yifan Wang, Chong Chen, Minghua Deng, Xian-Sheng Hua, and Ming Zhang. 2022.
TGNN: A Joint Semi-supervised Framework for Graph-level Classification. In Proceedings of the International Joint
Conference on Artificial Intelligence. 2122–2128.
[173] Wei Ju, Yifang Qin, Ziyue Qiao, Xiao Luo, Yifan Wang, Yanjie Fu, and Ming Zhang. 2022. Kernel-based Substructure
Exploration for Next POI Recommendation. arXiv preprint arXiv:2210.03969 (2022).
[174] Wei Ju, Junwei Yang, Meng Qu, Weiping Song, Jianhao Shen, and Ming Zhang. 2022. KGNN: Harnessing Kernel-based
Networks for Semi-supervised Graph Classification. In Proceedings of the Fifteenth ACM International Conference on
Web Search and Data Mining. 421–429.
[175] U Kang, Hanghang Tong, and Jimeng Sun. 2012. Fast random walk graph kernel. In Proceedings of the SIAM international
conference on data mining. SIAM, 828–838.
[176] Marek Karpinski and Angus Macintyre. 1997. Polynomial bounds for VC dimension of sigmoidal and general Pfaffian
neural networks. J. Comput. System Sci. 54, 1 (1997), 169–176.
[177] Hisashi Kashima, Koji Tsuda, and Akihiro Inokuchi. 2003. Marginalized kernels between labeled graphs. In Proceedings
of international conference on machine learning. 321–328.
[178] Mohammad Mehdi Keikha, Maseud Rahgozar, Masoud Asadpour, and Mohammad Faghih Abdollahi. 2020. Influence
maximization across heterogeneous interconnected networks based on deep learning. Expert Systems with Applications
140 (2020), 112905.
[179] Shima Khoshraftar and Aijun An. 2022. A Survey on Graph Representation Learning Methods. arXiv preprint
arXiv:2204.01855 (2022).
[180] Diederik P Kingma and Max Welling. 2013. Auto-encoding variational bayes. arXiv preprint arXiv:1312.6114 (2013).
[181] Thomas Kipf, Ethan Fetaya, Kuan-Chieh Wang, Max Welling, and Richard Zemel. 2018. Neural relational inference
for interacting systems. In International Conference on Machine Learning. PMLR, 2688–2697.
[182] Thomas N Kipf and Max Welling. 2016. Semi-supervised classification with graph convolutional networks. arXiv
preprint arXiv:1609.02907 (2016).
[183] Thomas N Kipf and Max Welling. 2016. Variational graph auto-encoders. arXiv preprint arXiv:1611.07308 (2016).
[184] Johannes Klicpera, Florian Becker, and Stephan Günnemann. 2021. GemNet: Universal Directional Graph Neural
Networks for Molecules. In NeurIPS. https://openreview.net/forum?id=HS_sOaxS9K-
[185] Johannes Klicpera, Janek Groß, and Stephan Günnemann. 2020. Directional Message Passing for Molecular Graphs.
In ICLR. https://openreview.net/forum?id=B1eWbxStPH
[186] Jonas Köhler, Leon Klein, and Frank Noe. 2020. Equivariant Flows: Exact Likelihood Generative Learning for
Symmetric Densities. In ICML. https://proceedings.mlr.press/v119/kohler20a.html
[187] Xiangjie Kong, Huizhen Jiang, Wei Wang, Teshome Megersa Bekele, Zhenzhen Xu, and Meng Wang. 2017. Exploring
dynamic research interest and academic influence for scientific collaborator recommendation. Scientometrics 113, 1

J. ACM, Vol. 1, No. 1, Article . Publication date: April 2023.


74 W. Ju, et al.

(2017), 369–385.
[188] Xiangjie Kong, Huizhen Jiang, Zhuo Yang, Zhenzhen Xu, Feng Xia, and Amr Tolba. 2016. Exploiting publication
contents and collaboration networks for collaborator recommendation. PloS one 11, 2 (2016), e0148492.
[189] Xiangjie Kong, Yajie Shi, Shuo Yu, Jiaying Liu, and Feng Xia. 2019. Academic social networks: Modeling, analysis,
mining and applications. Journal of Network and Computer Applications 132 (2019), 86–103.
[190] Yehuda Koren. 2008. Factorization meets the neighborhood: a multifaceted collaborative filtering model. In Proceedings
of the 14th ACM SIGKDD international conference on Knowledge discovery and data mining. 426–434.
[191] Yehuda Koren, Robert Bell, and Chris Volinsky. 2009. Matrix factorization techniques for recommender systems.
Computer 42, 8 (2009), 30–37.
[192] Mario Krenn, Florian Häse, AkshatKumar Nigam, Pascal Friederich, and Alan Aspuru-Guzik. 2020. Self-referencing
embedded strings (SELFIES): A 100% robust molecular string representation. Machine Learning: Science and Technology
1, 4 (2020), 045024.
[193] Devin Kreuzer, Dominique Beaini, Will Hamilton, Vincent Létourneau, and Prudencio Tossou. 2021. Rethinking graph
transformers with spectral attention. Advances in Neural Information Processing Systems 34 (2021), 21618–21629.
[194] Nils M Kriege, Fredrik D Johansson, and Christopher Morris. 2020. A survey on graph kernels. Applied Network
Science 5, 1 (2020), 1–42.
[195] Alex Krizhevsky, Ilya Sutskever, and Geoffrey E Hinton. 2017. Imagenet classification with deep convolutional neural
networks. Commun. ACM 60, 6 (2017), 84–90.
[196] Sanjay Kumar, Abhishek Mallik, Anavi Khetarpal, and BS Panda. 2022. Influence maximization in social networks
using graph embedding and graph neural network. Information Sciences 607 (2022), 1617–1636.
[197] Samuli Laine and Timo Aila. 2016. Temporal ensembling for semi-supervised learning. arXiv preprint arXiv:1610.02242
(2016).
[198] Greg Landrum et al. 2013. RDKit: A software suite for cheminformatics, computational chemistry, and predictive
modeling. Greg Landrum (2013).
[199] Dongha Lee, Su Kim, Seonghyeon Lee, Chanyoung Park, and Hwanjo Yu. 2021. Learnable structural semantic readout
for graph classification. In 2021 IEEE International Conference on Data Mining (ICDM). IEEE, 1180–1185.
[200] Dong-Hyun Lee et al. 2013. Pseudo-label: The simple and efficient semi-supervised learning method for deep neural
networks. In Workshop on challenges in representation learning, ICML, Vol. 3. 896.
[201] Junhyun Lee, Inyeop Lee, and Jaewoo Kang. 2019. Self-attention graph pooling. In International conference on machine
learning. PMLR, 3734–3743.
[202] Tao Lei, Wengong Jin, Regina Barzilay, and Tommi Jaakkola. 2017. Deriving Neural Architectures from Sequence and
Graph Kernels. In International Conference on Machine Learning. 2024–2033.
[203] Ron Levie, Federico Monti, Xavier Bresson, and Michael M Bronstein. 2018. Cayleynets: Graph convolutional neural
networks with complex rational spectral filters. IEEE Transactions on Signal Processing 67, 1 (2018), 97–109.
[204] Angsheng Li and Yicheng Pan. 2016. Structural information and dynamical complexity of networks. IEEE Transactions
on Information Theory 62, 6 (2016), 3290–3339.
[205] Guohao Li, Chenxin Xiong, Ali Thabet, and Bernard Ghanem. 2020. Deepergcn: All you need to train deeper gcns.
arXiv preprint arXiv:2006.07739 (2020).
[206] Haoyang Li, Xin Wang, Ziwei Zhang, and Wenwu Zhu. 2022. Ood-gnn: Out-of-distribution generalized graph neural
network. IEEE Transactions on Knowledge and Data Engineering (2022).
[207] Haoyang Li, Xin Wang, Ziwei Zhang, and Wenwu Zhu. 2022. Out-of-distribution generalization on graphs: A survey.
arXiv preprint arXiv:2202.07987 (2022).
[208] Junying Li, Deng Cai, and Xiaofei He. 2017. Learning graph-level representation for drug discovery. arXiv preprint
arXiv:1709.03741 (2017).
[209] Jia Li, Yongfeng Huang, Heng Chang, and Yu Rong. 2022. Semi-Supervised Hierarchical Graph Classification. IEEE
Transactions on Pattern Analysis and Machine Intelligence (2022).
[210] Jing Li, Pengjie Ren, Zhumin Chen, Zhaochun Ren, Tao Lian, and Jun Ma. 2017. Neural attentive session-based
recommendation. In Proceedings of the 2017 ACM on Conference on Information and Knowledge Management. 1419–1428.
[211] Jia Li, Yu Rong, Hong Cheng, Helen Meng, Wenbing Huang, and Junzhou Huang. 2019. Semi-supervised graph
classification: A hierarchical graph perspective. In Proceedings of the Web Conference. 972–982.
[212] Peibo Li, Yixing Yang, Maurice Pagnucco, and Yang Song. 2022. CoGNet: Cooperative Graph Neural Networks. In
Proceedings of the International Joint Conference on Neural Networks.
[213] Qimai Li, Zhichao Han, and Xiao-Ming Wu. 2018. Deeper insights into graph convolutional networks for semi-
supervised learning. In Proceedings of the AAAI conference on artificial intelligence, Vol. 32.
[214] Ruoyu Li, Sheng Wang, Feiyun Zhu, and Junzhou Huang. 2018. Adaptive graph convolutional neural networks. In
Proceedings of the AAAI conference on artificial intelligence, Vol. 32.

J. ACM, Vol. 1, No. 1, Article . Publication date: April 2023.


A Comprehensive Survey on Deep Graph Representation Learning 75

[215] Shuangli Li, Jingbo Zhou, Tong Xu, Dejing Dou, and Hui Xiong. 2022. Geomgcl: geometric graph contrastive learning
for molecular property prediction. In Proceedings of the AAAI Conference on Artificial Intelligence, Vol. 36. 4541–4549.
[216] Yibo Li, Jianxing Hu, Yanxing Wang, Jielong Zhou, Liangren Zhang, and Zhenming Liu. 2019. Deepscaffold: a
comprehensive tool for scaffold-based de novo drug discovery using deep learning. Journal of chemical information
and modeling 60, 1 (2019), 77–91.
[217] Yibo Li, Jianfeng Pei, and Luhua Lai. 2021. Learning to design drug-like molecules in three-dimensional space using
deep generative models. arXiv preprint arXiv:2104.08474 (2021).
[218] Yibo Li, Jianfeng Pei, and Luhua Lai. 2021. Structure-based de novo drug design using 3D deep generative models.
Chemical science 12, 41 (2021), 13664–13675.
[219] Yujia Li, Daniel Tarlow, Marc Brockschmidt, and Richard Zemel. 2015. Gated graph sequence neural networks. arXiv
preprint arXiv:1511.05493 (2015).
[220] Yaguang Li, Rose Yu, Cyrus Shahabi, and Yan Liu. 2017. Diffusion convolutional recurrent neural network: Data-driven
traffic forecasting. arXiv preprint arXiv:1707.01926 (2017).
[221] Yujia Li, Richard Zemel, Marc Brockschmidt, and Daniel Tarlow. 2016. Gated Graph Sequence Neural Networks. In
Proceedings of ICLR’16.
[222] Ziyao Li, Liang Zhang, and Guojie Song. 2019. GCN-LASE: Towards Adequately Incorporating Link Attributes
in Graph Convolutional Networks. In Proceedings of the Twenty-Eighth International Joint Conference on Artificial
Intelligence, IJCAI-19. 2959–2965.
[223] Yanyan Liang, Yanfeng Zhang, Dechao Gao, and Qian Xu. 2020. MxPool: Multiplex Pooling for Hierarchical Graph
Representation Learning. arXiv preprint arXiv:2004.06846 (2020).
[224] Renjie Liao, Zhizhen Zhao, Raquel Urtasun, and Richard Zemel. 2018. LanczosNet: Multi-Scale Deep Graph Convolu-
tional Networks. In International Conference on Learning Representations.
[225] Jongin Lim, Daeho Um, Hyung Jin Chang, Dae Ung Jo, and Jin Young Choi. 2021. Class-attentive diffusion network
for semi-supervised classification. In Proceedings of the AAAI Conference on Artificial Intelligence, Vol. 35. 8601–8609.
[226] Haitao Lin, Yufei Huang, Meng Liu, Xuanjing Li, Shuiwang Ji, and Stan Z Li. 2022. DiffBP: Generative Diffusion of 3D
Molecules for Target Protein Binding. arXiv preprint arXiv:2211.11214 (2022).
[227] Jiacheng Lin, Hanwen Xu, Addie Woicik, Jianzhu Ma, and Sheng Wang. 2022. Pisces: A cross-modal contrastive learning
approach to synergistic drug combination prediction. bioRxiv (2022). https://doi.org/10.1101/2022.11.21.517439
[228] Jiaying Liu, Feng Xia, Lei Wang, Bo Xu, Xiangjie Kong, Hanghang Tong, and Irwin King. 2019. Shifu2: A network
representation learning based model for advisor-advisee relationship mining. IEEE Transactions on Knowledge and
Data Engineering 33, 4 (2019), 1763–1777.
[229] Li Liu, William K Cheung, Xin Li, and Lejian Liao. 2016. Aligning Users across Social Networks Using Network
Embedding.. In Ijcai, Vol. 16. 1774–80.
[230] Meng Liu, Youzhi Luo, Kanji Uchino, Koji Maruhashi, and Shuiwang Ji. 2022. Generating 3D Molecules for Target
Protein Binding. arXiv preprint arXiv:2204.09410 (2022).
[231] Qi Liu, Miltiadis Allamanis, Marc Brockschmidt, and Alexander Gaunt. 2018. Constrained graph variational autoen-
coders for molecule design. Advances in neural information processing systems 31 (2018).
[232] Qiao Liu, Yifu Zeng, Refuoe Mokhosi, and Haibin Zhang. 2018. STAMP: short-term attention/memory priority model
for session-based recommendation. In Proceedings of the 24th ACM SIGKDD international conference on knowledge
discovery & data mining. 1831–1839.
[233] Yi Liu, Limei Wang, Meng Liu, Yuchao Lin, Xuan Zhang, Bora Oztekin, and Shuiwang Ji. 2022. Spherical Message
Passing for 3D Molecular Graphs. In ICLR. https://openreview.net/forum?id=givsRXsOt9r
[234] Ziqi Liu, Chaochao Chen, Xinxing Yang, Jun Zhou, Xiaolong Li, and Le Song. 2018. Heterogeneous graph neural
networks for malicious account detection. In Proceedings of the 27th ACM International Conference on Information and
Knowledge Management. 2077–2085.
[235] Ze Liu, Yutong Lin, Yue Cao, Han Hu, Yixuan Wei, Zheng Zhang, Stephen Lin, and Baining Guo. 2021. Swin
transformer: Hierarchical vision transformer using shifted windows. In Proceedings of the IEEE/CVF International
Conference on Computer Vision. 10012–10022.
[236] Zheng Liu, Xing Xie, and Lei Chen. 2018. Context-aware academic collaborator recommendation. In Proceedings of
the 24th ACM SIGKDD International Conference on Knowledge Discovery & Data Mining. 1870–1879.
[237] Qingqing Long, Yilun Jin, Yi Wu, and Guojie Song. 2021. Theoretically improving graph neural networks via
anonymous walk graph kernels. In Proceedings of the Web Conference 2021. 1204–1214.
[238] Qingqing Long, Lingjun Xu, Zheng Fang, and Guojie Song. 2021. HGK-GNN: Heterogeneous Graph Kernel based
Graph Neural Networks. In Proceedings of the 27th ACM SIGKDD Conference on Knowledge Discovery & Data Mining.
1129–1138.
[239] Siyu Long, Yi Zhou, Xinyu Dai, and Hao Zhou. 2022. Zero-Shot 3D Drug Design by Sketching and Generating. arXiv
preprint arXiv:2209.13865 (2022).

J. ACM, Vol. 1, No. 1, Article . Publication date: April 2023.


76 W. Ju, et al.

[240] Dongsheng Luo, Wei Cheng, Wenchao Yu, Bo Zong, Jingchao Ni, Haifeng Chen, and Xiang Zhang. 2021. Learning to
drop: Robust graph neural network via topological denoising. In Proceedings of the 14th ACM international conference
on web search and data mining. 779–787.
[241] Shitong Luo, Jiaqi Guan, Jianzhu Ma, and Jian Peng. 2021. A 3D generative model for structure-based drug design.
Advances in Neural Information Processing Systems 34 (2021), 6229–6239.
[242] Xiao Luo, Wei Ju, Meng Qu, Chong Chen, Minghua Deng, Xian-Sheng Hua, and Ming Zhang. 2022. DualGraph:
Improving Semi-supervised Graph Classification via Dual Contrastive Learning. In Proceedings of the IEEE International
Conference on Data Engineering. 699–712.
[243] Xiao Luo, Wei Ju, Meng Qu, Yiyang Gu, Chong Chen, Minghua Deng, Xian-Sheng Hua, and Ming Zhang. 2022. Clear:
Cluster-enhanced contrast for self-supervised graph representation learning. IEEE Transactions on Neural Networks
and Learning Systems (2022).
[244] Xiao Luo, Daqing Wu, Zeyu Ma, Chong Chen, Minghua Deng, Jinwen Ma, Zhongming Jin, Jianqiang Huang, and
Xian-Sheng Hua. 2021. CIMON: Towards High-quality Hash Codes. In Proceedings of the International Joint Conference
on Artificial Intelligence.
[245] Youzhi Luo and Shuiwang Ji. 2022. An autoregressive flow model for 3d molecular geometry generation from scratch.
In International Conference on Learning Representations (ICLR).
[246] Youzhi Luo, Keqiang Yan, and Shuiwang Ji. 2021. Graphdf: A discrete flow model for molecular graph generation. In
International Conference on Machine Learning. PMLR, 7192–7203.
[247] Hehuan Ma, Yatao Bian, Yu Rong, Wenbing Huang, Tingyang Xu, Weiyang Xie, Geyan Ye, and Junzhou Huang. 2020.
Multi-view graph neural networks for molecular property prediction. arXiv preprint arXiv:2005.13607 (2020).
[248] Jiaqi Ma, Junwei Deng, and Qiaozhu Mei. 2021. Subgroup generalization and fairness of graph neural networks.
Advances in Neural Information Processing Systems 34 (2021), 1048–1061.
[249] Ning Ma, Jiajun Bu, Jieyu Yang, Zhen Zhang, Chengwei Yao, Zhi Yu, Sheng Zhou, and Xifeng Yan. 2020. Adaptive-step
graph meta-learner for few-shot graph classification. In Proceedings of the 29th ACM International Conference on
Information & Knowledge Management. 1055–1064.
[250] Yao Ma, Suhang Wang, Charu C Aggarwal, and Jiliang Tang. 2019. Graph convolutional networks with eigenpooling.
In Proceedings of the 25th ACM SIGKDD international conference on knowledge discovery & data mining. 723–731.
[251] Kaushalya Madhawa, Katushiko Ishiguro, Kosuke Nakago, and Motoki Abe. 2019. Graphnvp: An invertible flow
model for generating molecular graphs. arXiv preprint arXiv:1905.11600 (2019).
[252] Prashan Madumal, Tim Miller, Liz Sonenberg, and Frank Vetere. 2020. Explainable reinforcement learning through a
causal lens. In Proceedings of the AAAI conference on artificial intelligence, Vol. 34. 2493–2500.
[253] Tong Man, Huawei Shen, Shenghua Liu, Xiaolong Jin, and Xueqi Cheng. 2016. Predict anchor links across social
networks via an embedding approach.. In Ijcai, Vol. 16. 1823–1829.
[254] Kamaran H Manguri, Rebaz N Ramadhan, and Pshko R Mohammed Amin. 2020. Twitter sentiment analysis on
worldwide COVID-19 outbreaks. Kurdistan Journal of Applied Research (2020), 54–65.
[255] Elman Mansimov, Omar Mahmood, Seokho Kang, and Kyunghyun Cho. 2019. Molecular geometry prediction using a
deep generative graph neural network. Scientific reports 9, 1 (2019), 1–13.
[256] Mohammad MansourLakouraj, Mukesh Gautam, Hanif Livani, and Mohammed Benidris. 2022. A multi-rate sampling
PMU-based event classification in active distribution grids with spectral graph neural network. Electric Power Systems
Research 211 (2022), 108145.
[257] Dionisis Margaris, Costas Vassilakis, and Dimitris Spiliotopoulos. 2019. Handling uncertainty in social media textual
information for improving venue recommendation formulation quality in social networks. Social Network Analysis
and Mining 9, 1 (2019), 1–19.
[258] Ninareh Mehrabi, Fred Morstatter, Nripsuta Saxena, Kristina Lerman, and Aram Galstyan. 2021. A survey on bias and
fairness in machine learning. ACM Computing Surveys (CSUR) 54, 6 (2021), 1–35.
[259] Xuying Meng, Suhang Wang, Zhimin Liang, Di Yao, Jihua Zhou, and Yujun Zhang. 2021. Semi-supervised anomaly
detection in dynamic communication networks. Information Sciences 571 (2021), 527–542.
[260] Xuan-Yu Meng, Hong-Xing Zhang, Mihaly Mezei, and Meng Cui. 2011. Molecular docking: a powerful approach for
structure-based drug discovery. Current computer-aided drug design 7, 2 (2011), 146–157.
[261] Joshua Meyers, Benedek Fabian, and Nathan Brown. 2021. De novo molecular design and generative models. Drug
Discovery Today 26, 11 (2021), 2707–2715.
[262] Tomas Mikolov, Ilya Sutskever, Kai Chen, Greg S Corrado, and Jeff Dean. 2013. Distributed representations of words
and phrases and their compositionality. Advances in neural information processing systems 26 (2013).
[263] Abduallah Mohamed, Kun Qian, Mohamed Elhoseiny, and Christian Claudel. 2020. Social-stgcnn: A social spatio-
temporal graph convolutional neural network for human trajectory prediction. In Proceedings of the IEEE/CVF
conference on computer vision and pattern recognition. 14424–14432.

J. ACM, Vol. 1, No. 1, Article . Publication date: April 2023.


A Comprehensive Survey on Deep Graph Representation Learning 77

[264] Christopher Morris, Martin Ritzert, Matthias Fey, William L Hamilton, Jan Eric Lenssen, Gaurav Rattan, and Martin
Grohe. 2019. Weisfeiler and leman go neural: Higher-order graph neural networks. In Proceedings of the AAAI
conference on artificial intelligence, Vol. 33. 4602–4609.
[265] Vincenzo Moscato and Giancarlo Sperlì. 2021. A survey about community detection over On-line Social and
Heterogeneous Information Networks. Knowledge-Based Systems 224 (2021), 107112.
[266] Meinard Müller. 2007. Dynamic time warping. Information retrieval for music and motion (2007), 69–84.
[267] Vitali Nesterov, Mario Wieser, and Volker Roth. 2020. 3DMolNet: a generative network for molecular structures.
arXiv preprint arXiv:2010.06477 (2020).
[268] Mingdong Ou, Peng Cui, Jian Pei, Ziwei Zhang, and Wenwu Zhu. 2016. Asymmetric transitivity preserving graph
embedding. In Proceedings of the 22nd ACM SIGKDD international conference on Knowledge discovery and data mining.
1105–1114.
[269] George Panagopoulos, Fragkiskos Malliaros, and Michalis Vazirgiannis. 2020. Multi-task learning for influence
estimation and maximization. IEEE Transactions on Knowledge and Data Engineering (2020).
[270] Cheonbok Park, Chunggi Lee, Hyojin Bahng, Yunwon Tae, Seungmin Jin, Kihwan Kim, Sungahn Ko, and Jaegul
Choo. 2020. ST-GRAT: A novel spatio-temporal graph attention networks for accurately forecasting dynamically
changing road speed. In Proceedings of the 29th ACM international conference on information & knowledge management.
1215–1224.
[271] Hyeonjin Park, Seunghun Lee, Dasol Hwang, Jisu Jeong, Kyung-Min Kim, Jung-Woo Ha, and Hyunwoo J Kim. 2021.
Learning Augmentation for GNNs With Consistency Regularization. IEEE Access 9 (2021), 127961–127972.
[272] Eric Paulos, Ken Anderson, and Anthony Townsend. 2004. UbiComp in the urban frontier. (2004).
[273] Eric Paulos and Elizabeth Goodman. 2004. The familiar stranger: anxiety, comfort, and play in public places. In
Proceedings of the SIGCHI conference on Human factors in computing systems. 223–230.
[274] Xingang Peng, Jiaqi Guan, Jian Peng, and Jianzhu Ma. 2023. Pocket-specific 3D Molecule Generation by Fragment-
based Autoregressive Diffusion Models. https://openreview.net/forum?id=HGsoe1wmRW5
[275] Xingang Peng, Shitong Luo, Jiaqi Guan, Qi Xie, Jian Peng, and Jianzhu Ma. 2022. Pocket2Mol: Efficient Molecular
Sampling Based on 3D Protein Pockets. arXiv preprint arXiv:2205.07249 (2022).
[276] Bryan Perozzi, Rami Al-Rfou, and Steven Skiena. 2014. Deepwalk: Online learning of social representations. In
Proceedings of the 20th ACM SIGKDD international conference on Knowledge discovery and data mining. 701–710.
[277] Darwin Saire Pilco and Adín Ramírez Rivera. 2019. Graph learning network: A structure learning algorithm. arXiv
preprint arXiv:1905.12665 (2019).
[278] Marco AF Pimentel, David A Clifton, Lei Clifton, and Lionel Tarassenko. 2014. A review of novelty detection. Signal
processing 99 (2014), 215–249.
[279] Gabriel A Pinheiro, Juarez LF Da Silva, and Marcos G Quiles. 2022. SMICLR: Contrastive Learning on Multiple
Molecular Representations for Semisupervised and Unsupervised Representation Learning. Journal of Chemical
Information and Modeling 62, 17 (2022), 3948–3960.
[280] Kristina Preuer, Günter Klambauer, Friedrich Rippmann, Sepp Hochreiter, and Thomas Unterthiner. 2019. Interpretable
deep learning in drug discovery. In Explainable AI: Interpreting, Explaining and Visualizing Deep Learning. Springer,
331–345.
[281] Ziyue Qiao, Yi Du, Yanjie Fu, Pengfei Wang, and Yuanchun Zhou. 2019. Unsupervised author disambiguation using
heterogeneous graph convolutional network embedding. In 2019 IEEE international conference on big data (Big Data).
IEEE, 910–919.
[282] Ziyue Qiao, Yanjie Fu, Pengyang Wang, Meng Xiao, Zhiyuan Ning, Yi Du, and Yuanchun Zhou. 2022. RPT: Toward
Transferable Model on Heterogeneous Researcher Data via Pre-Training. IEEE Transactions on Big Data (2022).
[283] Ziyue Qiao, Pengyang Wang, Yanjie Fu, Yi Du, Pengfei Wang, and Yuanchun Zhou. 2020. Tree structure-aware
graph representation learning via integrated hierarchical aggregation and relational metric learning. In 2020 IEEE
International Conference on Data Mining (ICDM). IEEE, 432–441.
[284] Zhuoran Qiao, Matthew Welborn, Animashree Anandkumar, Frederick R Manby, and Thomas F Miller III. 2020.
OrbNet: Deep learning for quantum chemistry using symmetry-adapted atomic-orbital features. The Journal of
chemical physics 153, 12 (2020), 124111.
[285] Yifang Qin, Yifan Wang, Fang Sun, Wei Ju, Xuyang Hou, Zhe Wang, Jia Cheng, Jun Lei, and Ming Zhang. 2022.
DisenPOI: Disentangling Sequential and Geographical Influence for Point-of-Interest Recommendation. arXiv preprint
arXiv:2210.16591 (2022).
[286] Jiezhong Qiu, Qibin Chen, Yuxiao Dong, Jing Zhang, Hongxia Yang, Ming Ding, Kuansan Wang, and Jie Tang. 2020.
Gcc: Graph contrastive coding for graph neural network pre-training. In Proceedings of the 26th ACM SIGKDD
International Conference on Knowledge Discovery & Data Mining. 1150–1160.
[287] Ruihong Qiu, Jingjing Li, Zi Huang, and Hongzhi Yin. 2019. Rethinking the item order in session-based recom-
mendation with graph neural networks. In Proceedings of the 28th ACM international conference on information and

J. ACM, Vol. 1, No. 1, Article . Publication date: April 2023.


78 W. Ju, et al.

knowledge management. 579–588.


[288] Ruihong Qiu, Hongzhi Yin, Zi Huang, and Tong Chen. 2020. Gag: Global attributed graph neural network for
streaming session-based recommendation. In Proceedings of the 43rd International ACM SIGIR Conference on Research
and Development in Information Retrieval. 669–678.
[289] Alec Radford, Jong Wook Kim, Chris Hallacy, Aditya Ramesh, Gabriel Goh, Sandhini Agarwal, Girish Sastry, Amanda
Askell, Pamela Mishkin, Jack Clark, et al. 2021. Learning transferable visual models from natural language supervision.
In International Conference on Machine Learning. PMLR, 8748–8763.
[290] Matthew Ragoza, Tomohide Masuda, and David Ryan Koes. 2022. Generating 3D molecules conditional on receptor
binding sites with deep generative models. Chemical science 13, 9 (2022), 2701–2713.
[291] Bin Ran and David Boyce. 2012. Modeling dynamic transportation networks: an intelligent transportation system oriented
approach. Springer Science & Business Media.
[292] Ekagra Ranjan, Soumya Sanyal, and Partha Talukdar. 2020. Asap: Adaptive structure aware pooling for learning
hierarchical graph representations. In Proceedings of the AAAI Conference on Artificial Intelligence, Vol. 34. 5470–5477.
[293] Steffen Rendle, Christoph Freudenthaler, Zeno Gantner, and Lars Schmidt-Thieme. 2012. BPR: Bayesian personalized
ranking from implicit feedback. arXiv preprint arXiv:1205.2618 (2012).
[294] Danilo Rezende and Shakir Mohamed. 2015. Variational inference with normalizing flows. In International conference
on machine learning. PMLR, 1530–1538.
[295] Matthew Richardson and Pedro Domingos. 2002. Mining knowledge-sharing sites for viral marketing. In Proceedings
of the eighth ACM SIGKDD international conference on Knowledge discovery and data mining. 61–70.
[296] Jonathan G Richens, Ciarán M Lee, and Saurabh Johri. 2020. Improving the accuracy of medical diagnosis with causal
machine learning. Nature communications 11, 1 (2020), 1–9.
[297] James P Roney, Paul Maragakis, Peter Skopp, and David E Shaw. 2021. Generating Realistic 3D Molecules with an
Equivariant Conditional Likelihood Model. (2021).
[298] Renata Lopes Rosa, Gisele Maria Schwartz, Wilson Vicente Ruggiero, and Demóstenes Zegarra Rodríguez. 2018. A
knowledge-based recommendation system that includes sentiment analysis and deep learning. IEEE Transactions on
Industrial Informatics 15, 4 (2018), 2124–2135.
[299] Lars Ruddigkeit, Ruud Van Deursen, Lorenz C Blum, and Jean-Louis Reymond. 2012. Enumeration of 166 billion
organic small molecules in the chemical universe database GDB-17. Journal of chemical information and modeling 52,
11 (2012), 2864–2875.
[300] Bharat Runwal, Sandeep Kumar, et al. 2022. Robustifying GNN Via Weighted Laplacian. In 2022 IEEE International
Conference on Signal Processing and Communications (SPCOM). IEEE, 1–5.
[301] Seongok Ryu, Jaechang Lim, Seung Hwan Hong, and Woo Youn Kim. 2018. Deeply learning molecular
structure-property relationships using attention-and gate-augmented graph convolutional network. arXiv preprint
arXiv:1805.10988 (2018).
[302] Sara Sabour, Nicholas Frosst, and Geoffrey E Hinton. 2017. Dynamic routing between capsules. Advances in neural
information processing systems 30 (2017).
[303] Aliaksei Sandryhaila and José MF Moura. 2013. Discrete signal processing on graphs. IEEE transactions on signal
processing 61, 7 (2013), 1644–1656.
[304] Victor Garcia Satorras, Emiel Hoogeboom, and Max Welling. 2021. E(n) Equivariant Graph Neural Networks. arXiv
preprint arXiv:2102.09844 (2021). arXiv:2102.09844 [cs.LG]
[305] Chandni Saxena, Tianyu Liu, and Irwin King. 2020. A Survey of Graph Curvature and Embedding in Non-Euclidean
Spaces. In International Conference on Neural Information Processing. Springer, 127–139.
[306] Franco Scarselli, Marco Gori, Ah Chung Tsoi, Markus Hagenbuchner, and Gabriele Monfardini. 2008. The graph
neural network model. IEEE transactions on neural networks 20, 1 (2008), 61–80.
[307] Michael Schlichtkrull, Thomas N Kipf, Peter Bloem, Rianne Van Den Berg, Ivan Titov, and Max Welling. 2018.
Modeling relational data with graph convolutional networks. In European semantic web conference. Springer, 593–607.
[308] Petra Schneider, W Patrick Walters, Alleyn T Plowright, Norman Sieroka, Jennifer Listgarten, Robert A Goodnow,
Jasmin Fisher, Johanna M Jansen, José S Duca, Thomas S Rush, et al. 2020. Rethinking drug design in the artificial
intelligence era. Nature Reviews Drug Discovery 19, 5 (2020), 353–364.
[309] Arne Schneuing, Yuanqi Du, Charles Harris, Arian Jamasb, Ilia Igashov, Weitao Du, Tom Blundell, Pietro Lió, Carla
Gomes, Max Welling, et al. 2022. Structure-based Drug Design with Equivariant Diffusion Models. arXiv preprint
arXiv:2210.13695 (2022).
[310] Kristof Schütt, Oliver Unke, and Michael Gastegger. 2021. Equivariant message passing for the prediction of tensorial
properties and molecular spectra. In ICML. https://proceedings.mlr.press/v139/schutt21a.html
[311] Kristof T Schütt, Huziel E Sauceda, P-J Kindermans, Alexandre Tkatchenko, and K-R Müller. 2018. Schnet–a deep
learning architecture for molecules and materials. The Journal of Chemical Physics 148, 24 (2018), 241722.

J. ACM, Vol. 1, No. 1, Article . Publication date: April 2023.


A Comprehensive Survey on Deep Graph Representation Learning 79

[312] Prithviraj Sen, Galileo Namata, Mustafa Bilgic, Lise Getoor, Brian Galligher, and Tina Eliassi-Rad. 2008. Collective
classification in network data. AI magazine 29, 3 (2008), 93–93.
[313] John Shawe-Taylor, Nello Cristianini, et al. 2004. Kernel methods for pattern analysis. Cambridge university press.
[314] Nino Shervashidze, Pascal Schweitzer, Erik Jan Van Leeuwen, Kurt Mehlhorn, and Karsten M Borgwardt. 2011.
Weisfeiler-lehman graph kernels. Journal of Machine Learning Research 12, 9 (2011), 2539–2561.
[315] Nino Shervashidze, Pascal Schweitzer, Erik Jan Van Leeuwen, Kurt Mehlhorn, and Karsten M Borgwardt. 2011.
Weisfeiler-lehman graph kernels. Journal of Machine Learning Research 12, 9 (2011).
[316] Nino Shervashidze, SVN Vishwanathan, Tobias Petri, Kurt Mehlhorn, and Karsten Borgwardt. 2009. Efficient graphlet
kernels for large graph comparison. In Proceedings of International Conference on Artificial Intelligence and Statistics.
488–495.
[317] Chuan Shi, Yitong Li, Jiawei Zhang, Yizhou Sun, and S Yu Philip. 2016. A survey of heterogeneous information
network analysis. IEEE Transactions on Knowledge and Data Engineering 29, 1 (2016), 17–37.
[318] Chence Shi, Shitong Luo, Minkai Xu, and Jian Tang. 2021. Learning gradient fields for molecular conformation
generation. Proceedings of the 38th International Conference on Machine Learning, ICML 139 (2021), 9558–9568.
[319] Chence Shi, Minkai Xu, Zhaocheng Zhu, Weinan Zhang, Ming Zhang, and Jian Tang. 2020. Graphaf: a flow-based
autoregressive model for molecular graph generation. arXiv preprint arXiv:2001.09382 (2020).
[320] David I Shuman, Sunil K Narang, Pascal Frossard, Antonio Ortega, and Pierre Vandergheynst. 2013. The emerging
field of signal processing on graphs: Extending high-dimensional data analysis to networks and other irregular
domains. IEEE signal processing magazine 30, 3 (2013), 83–98.
[321] David I Shuman, Sunil K Narang, Pascal Frossard, Antonio Ortega, and Pierre Vandergheynst. 2013. The emerging
field of signal processing on graphs: Extending high-dimensional data analysis to networks and other irregular
domains. IEEE signal processing magazine 30, 3 (2013), 83–98.
[322] Yali Si, Fuzhi Zhang, and Wenyuan Liu. 2019. An adaptive point-of-interest recommendation method for location-based
social networks based on user activity and spatial features. Knowledge-Based Systems 163 (2019), 267–282.
[323] Thiago H Silva, Aline Carneiro Viana, Fabrício Benevenuto, Leandro Villas, Juliana Salles, Antonio Loureiro, and
Daniele Quercia. 2019. Urban computing leveraging location-based social network data: a survey. ACM Computing
Surveys (CSUR) 52, 1 (2019), 1–39.
[324] Martin Simonovsky and Nikos Komodakis. 2017. Dynamic edge-conditioned filters in convolutional neural networks
on graphs. In Proceedings of the IEEE conference on computer vision and pattern recognition. 3693–3702.
[325] Martin Simonovsky and Nikos Komodakis. 2018. Graphvae: Towards generation of small graphs using variational
autoencoders. In International conference on artificial neural networks. Springer, 412–422.
[326] Seyyid Emre Sofuoglu and Selin Aviyente. 2022. Gloss: Tensor-based anomaly detection in spatiotemporal urban
traffic data. Signal Processing 192 (2022), 108370.
[327] Jascha Sohl-Dickstein, Eric Weiss, Niru Maheswaranathan, and Surya Ganguli. 2015. Deep unsupervised learning
using nonequilibrium thermodynamics. In International Conference on Machine Learning. PMLR, 2256–2265.
[328] Chao Song, Youfang Lin, Shengnan Guo, and Huaiyu Wan. 2020. Spatial-temporal synchronous graph convolutional
networks: A new framework for spatial-temporal network data forecasting. In Proceedings of the AAAI conference on
artificial intelligence, Vol. 34. 914–921.
[329] Qingyu Song, RuiBo Ming, Jianming Hu, Haoyi Niu, and Mingyang Gao. 2020. Graph attention convolutional
network: Spatiotemporal modeling for urban traffic prediction. In 2020 IEEE 23rd International Conference on Intelligent
Transportation Systems (ITSC). IEEE, 1–6.
[330] Yang Song and Stefano Ermon. 2019. Generative modeling by estimating gradients of the data distribution. Advances
in Neural Information Processing Systems 32 (2019).
[331] Peter R. Spackman, Dylan Jayatilaka, and Amir Karton. 2016. Basis set convergence of CCSD(T) equilibrium
geometries using a large and diverse set of molecular structures. The Journal of Chemical Physics 145, 10 (2016),
104101. https://doi.org/10.1063/1.4962168
[332] Hannes Stärk, Octavian Ganea, Lagnajit Pattanaik, Regina Barzilay, and Tommi Jaakkola. 2022. Equibind: Geometric
deep learning for drug binding structure prediction. In International Conference on Machine Learning. PMLR, 20503–
20521.
[333] Bing Su, Dazhao Du, Zhao Yang, Yujie Zhou, Jiangmeng Li, Anyi Rao, Hao Sun, Zhiwu Lu, and Ji-Rong Wen. 2022.
A molecular multimodal foundation model associating molecule graphs with natural language. arXiv preprint
arXiv:2209.05481 (2022).
[334] Kazunari Sugiyama and Min-Yen Kan. 2010. Scholarly paper recommendation via user’s recent research interests. In
Proceedings of the 10th annual joint conference on Digital libraries. 29–38.
[335] Fan-Yun Sun, Jordan Hoffmann, Vikas Verma, and Jian Tang. 2020. Infograph: Unsupervised and semi-supervised
graph-level representation learning via mutual information maximization. In Proceedings of the International Conference
on Learning Representations.

J. ACM, Vol. 1, No. 1, Article . Publication date: April 2023.


80 W. Ju, et al.

[336] Jianhua Sun, Qinhong Jiang, and Cewu Lu. 2020. Recursive social behavior graph for trajectory prediction. In
Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 660–669.
[337] Qingyun Sun, Jianxin Li, Hao Peng, Jia Wu, Xingcheng Fu, Cheng Ji, and S Yu Philip. 2022. Graph structure learning
with variational information bottleneck. In Proceedings of the AAAI Conference on Artificial Intelligence, Vol. 36.
4165–4174.
[338] Xiaoqing Sun, Zhiliang Wang, Jiahai Yang, and Xinran Liu. 2020. Deepdom: Malicious domain detection with scalable
and heterogeneous graph convolutional networks. Computers & Security 99 (2020), 102057.
[339] Shazia Tabassum, Fabiola SF Pereira, Sofia Fernandes, and João Gama. 2018. Social network analysis: An overview.
Wiley Interdisciplinary Reviews: Data Mining and Knowledge Discovery 8, 5 (2018), e1256.
[340] Shyam A Tailor, Felix Opolka, Pietro Lio, and Nicholas Donald Lane. 2022. Do We Need Anisotropic Graph Neural
Networks?. In International Conference on Learning Representations.
[341] Tetsuro Takahashi and Nobuyuki Igata. 2012. Rumor detection on twitter. In The 6th International Conference on Soft
Computing and Intelligent Systems, and The 13th International Symposium on Advanced Intelligence Systems. IEEE,
452–457.
[342] Hao Tang, Donghong Ji, Chenliang Li, and Qiji Zhou. 2020. Dependency graph enhanced dual-transformer structure
for aspect-based sentiment classification. In Proceedings of the 58th annual meeting of the association for computational
linguistics. 6578–6588.
[343] Jian Tang, Meng Qu, and Qiaozhu Mei. 2015. Pte: Predictive text embedding through large-scale heterogeneous text
networks. In Proceedings of the 21th ACM SIGKDD international conference on knowledge discovery and data mining.
1165–1174.
[344] Jian Tang, Meng Qu, Mingzhe Wang, Ming Zhang, Jun Yan, and Qiaozhu Mei. 2015. Line: Large-scale information
network embedding. In Proceedings of the 24th international conference on world wide web. 1067–1077.
[345] Xianfeng Tang, Yandong Li, Yiwei Sun, Huaxiu Yao, Prasenjit Mitra, and Suhang Wang. 2020. Transferring robustness
for graph neural network against poisoning attacks. In Proceedings of the 13th international conference on web search
and data mining. 600–608.
[346] Philipp Thölke and Gianni De Fabritiis. 2022. Equivariant Transformers for Neural Network based Molecular
Potentials. In ICLR. https://openreview.net/forum?id=zNHzqZ9wrRB
[347] Nathaniel Thomas, Tess Smidt, Steven Kearnes, Lusann Yang, Li Li, Kai Kohlhoff, and Patrick Riley. 2018. Tensor field
networks: Rotation-and translation-equivariant neural networks for 3d point clouds. arXiv preprint arXiv:1802.08219
(2018).
[348] Hannu Toivonen, Ashwin Srinivasan, Ross D King, Stefan Kramer, and Christoph Helma. 2003. Statistical evaluation
of the predictive toxicology challenge 2000–2001. Bioinformatics 19, 10 (2003), 1183–1193.
[349] Sayan Unankard, Xue Li, Mohamed Sharaf, Jiang Zhong, and Xueming Li. 2014. Predicting elections from social
networks based on sub-event detection and sentiment analysis. In International Conference on Web Information Systems
Engineering. Springer, 1–16.
[350] Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N Gomez, Łukasz Kaiser, and Illia
Polosukhin. 2017. Attention is all you need. Advances in neural information processing systems 30 (2017).
[351] Petar Veličković, Guillem Cucurull, Arantxa Casanova, Adriana Romero, Pietro Lio, and Yoshua Bengio. 2017. Graph
attention networks. arXiv preprint arXiv:1710.10903 (2017).
[352] Jitender Verma, Vijay M Khedkar, and Evans C Coutinho. 2010. 3D-QSAR in drug design-a review. Current topics in
medicinal chemistry 10, 1 (2010), 95–115.
[353] Pascal Vincent, Hugo Larochelle, Isabelle Lajoie, Yoshua Bengio, Pierre-Antoine Manzagol, and Léon Bottou. 2010.
Stacked denoising autoencoders: Learning useful representations in a deep network with a local denoising criterion.
Journal of machine learning research 11, 12 (2010).
[354] Oriol Vinyals, Charles Blundell, Timothy Lillicrap, Daan Wierstra, et al. 2016. Matching networks for one shot
learning. Advances in neural information processing systems 29 (2016).
[355] Dylan J Walsh, Weizhong Zou, Ludwig Schneider, Reid Mello, Michael E Deagen, Joshua Mysona, Tzyy-Shyang
Lin, Juan J de Pablo, Klavs F Jensen, Debra J Audus, et al. 2023. Community Resource for Innovation in Polymer
Technology (CRIPT): A Scalable Polymer Material Data Structure.
[356] W Patrick Walters, Matthew T Stahl, and Mark A Murcko. 1998. Virtual screening—an overview. Drug discovery
today 3, 4 (1998), 160–178.
[357] Guihong Wan and Harsha Kokel. 2021. Graph sparsification via meta-learning. DLG@ AAAI (2021).
[358] Chun Wang, Shirui Pan, Guodong Long, Xingquan Zhu, and Jing Jiang. 2017. Mgae: Marginalized graph autoencoder
for graph clustering. In Proceedings of the 2017 ACM on Conference on Information and Knowledge Management.
889–898.
[359] Daixin Wang, Peng Cui, and Wenwu Zhu. 2016. Structural deep network embedding. In Proceedings of the 22nd ACM
SIGKDD international conference on Knowledge discovery and data mining. 1225–1234.

J. ACM, Vol. 1, No. 1, Article . Publication date: April 2023.


A Comprehensive Survey on Deep Graph Representation Learning 81

[360] Hongwei Wang, Fuzheng Zhang, Miao Zhao, Wenjie Li, Xing Xie, and Minyi Guo. 2019. Multi-task feature learning
for knowledge graph enhanced recommendation. In The world wide web conference. 2000–2010.
[361] Jianling Wang, Kaize Ding, Liangjie Hong, Huan Liu, and James Caverlee. 2020. Next-item recommendation with
sequential hypergraphs. In Proceedings of the 43rd international ACM SIGIR conference on research and development in
information retrieval. 1101–1110.
[362] Jie Wang, Li Zhu, Tao Dai, and Yabin Wang. 2020. Deep memory network with Bi-LSTM for personalized context-aware
citation recommendation. Neurocomputing 410 (2020), 103–113.
[363] Kai Wang, Yu-An Lin, Ben Weissmann, Manolis Savva, Angel X Chang, and Daniel Ritchie. 2019. Planit: Planning
and instantiating indoor scenes with relation graph and spatial prior networks. ACM Transactions on Graphics (TOG)
38, 4 (2019), 1–15.
[364] Renxiao Wang, Xueliang Fang, Yipin Lu, Chao-Yie Yang, and Shaomeng Wang. 2005. The PDBbind database:
methodologies and updates. Journal of medicinal chemistry 48, 12 (2005), 4111–4119.
[365] Senzhang Wang, Hao Miao, Hao Chen, and Zhiqiu Huang. 2020. Multi-task adversarial spatial-temporal networks
for crowd flow prediction. In Proceedings of the 29th ACM international conference on information & knowledge
management. 1555–1564.
[366] Xiao Wang, Deyu Bo, Chuan Shi, Shaohua Fan, Yanfang Ye, and S Yu Philip. 2022. A survey on heterogeneous graph
embedding: methods, techniques, applications and sources. IEEE Transactions on Big Data (2022).
[367] Xiang Wang, Xiangnan He, Meng Wang, Fuli Feng, and Tat-Seng Chua. 2019. Neural graph collaborative filtering.
In Proceedings of the 42nd international ACM SIGIR conference on Research and development in Information Retrieval.
165–174.
[368] Xiao Wang, Houye Ji, Chuan Shi, Bai Wang, Yanfang Ye, Peng Cui, and Philip S Yu. 2019. Heterogeneous graph
attention network. In The world wide web conference. 2022–2032.
[369] Xiang Wang, Hongye Jin, An Zhang, Xiangnan He, Tong Xu, and Tat-Seng Chua. 2020. Disentangled graph collabora-
tive filtering. In Proceedings of the 43rd international ACM SIGIR conference on research and development in information
retrieval. 1001–1010.
[370] Yaqing Wang, Abulikemu Abuduweili, Quanming Yao, and Dejing Dou. 2021. Property-aware relation networks for
few-shot molecular property prediction. Advances in Neural Information Processing Systems 34 (2021), 17441–17454.
[371] Yifan Wang, Yifang Qin, Fang Sun, Bo Zhang, Xuyang Hou, Ke Hu, Jia Cheng, Jun Lei, and Ming Zhang. 2022.
DisenCTR: Dynamic Graph-based Disentangled Representation for Click-Through Rate Prediction. In Proceedings of
the 45th International ACM SIGIR Conference on Research and Development in Information Retrieval. 2314–2318.
[372] Yue Wang, Yongbin Sun, Ziwei Liu, Sanjay E Sarma, Michael M Bronstein, and Justin M Solomon. 2019. Dynamic
graph cnn for learning on point clouds. Acm Transactions On Graphics (tog) 38, 5 (2019), 1–12.
[373] Yuyang Wang, Jianren Wang, Zhonglin Cao, and Amir Barati Farimani. 2022. Molecular contrastive learning of
representations via graph neural networks. Nature Machine Intelligence 4, 3 (2022), 279–287.
[374] Zichao Wang, Weili Nie, Zhuoran Qiao, Chaowei Xiao, Richard Baraniuk, and Anima Anandkumar. 2022. Retrieval-
based Controllable Molecule Generation. arXiv preprint arXiv:2208.11126 (2022).
[375] Ziyang Wang, Wei Wei, Gao Cong, Xiao-Li Li, Xian-Ling Mao, and Minghui Qiu. 2020. Global context enhanced graph
neural networks for session-based recommendation. In Proceedings of the 43rd international ACM SIGIR conference on
research and development in information retrieval. 169–178.
[376] Zhaobo Wang, Yanmin Zhu, Qiaomei Zhang, Haobing Liu, Chunyang Wang, and Tong Liu. 2022. Graph-Enhanced
Spatial-Temporal Network for Next POI Recommendation. ACM Transactions on Knowledge Discovery from Data
(TKDD) 16, 6 (2022), 1–21.
[377] Yinwei Wei, Xiang Wang, Liqiang Nie, Xiangnan He, Richang Hong, and Tat-Seng Chua. 2019. MMGCN: Multi-
modal graph convolution network for personalized recommendation of micro-video. In Proceedings of the 27th ACM
International Conference on Multimedia. 1437–1445.
[378] Boris Weisfeiler and Andrei Leman. 1968. The reduction of a graph to canonical form and the algebra which appears
therein. nti, Series 2, 9 (1968), 12–16.
[379] Heitor Werneck, Nícollas Silva, Matheus Carvalho Viana, Fernando Mourão, Adriano CM Pereira, and Leonardo
Rocha. 2020. A survey on point-of-interest recommendation in location-based social networks. In Proceedings of the
Brazilian Symposium on Multimedia and the Web. 185–192.
[380] Oliver Wieder, Stefan Kohlbacher, Mélaine Kuenemann, Arthur Garon, Pierre Ducrot, Thomas Seidel, and Thierry
Langer. 2020. A compact review of molecular property prediction with graph neural networks. Drug Discovery Today:
Technologies 37 (2020), 1–12.
[381] Christopher KI Williams and Matthias Seeger. 2001. Using the Nyström method to speed up kernel machines. In
Advances in neural information processing systems. 682–688.
[382] Michael Withnall, Edvard Lindelöf, Ola Engkvist, and Hongming Chen. 2020. Building attention and edge message
passing neural networks for bioactivity and physical–chemical property prediction. Journal of cheminformatics 12, 1

J. ACM, Vol. 1, No. 1, Article . Publication date: April 2023.


82 W. Ju, et al.

(2020), 1–18.
[383] Felix Wu, Amauri Souza, Tianyi Zhang, Christopher Fifty, Tao Yu, and Kilian Weinberger. 2019. Simplifying graph
convolutional networks. In International conference on machine learning. PMLR, 6861–6871.
[384] Junran Wu, Xueyuan Chen, Ke Xu, and Shangzhe Li. 2022. Structural entropy guided graph hierarchical pooling. In
International Conference on Machine Learning. PMLR, 24017–24030.
[385] Shu Wu, Yuyuan Tang, Yanqiao Zhu, Liang Wang, Xing Xie, and Tieniu Tan. 2019. Session-based recommendation
with graph neural networks. In Proceedings of the AAAI conference on artificial intelligence, Vol. 33. 346–353.
[386] Zhanghao Wu, Paras Jain, Matthew Wright, Azalia Mirhoseini, Joseph E Gonzalez, and Ion Stoica. 2021. Representing
long-range context for graph neural networks with global attention. Advances in Neural Information Processing
Systems 34 (2021), 13266–13279.
[387] Zonghan Wu, Shirui Pan, Fengwen Chen, Guodong Long, Chengqi Zhang, and S Yu Philip. 2020. A comprehensive
survey on graph neural networks. IEEE transactions on neural networks and learning systems 32, 1 (2020), 4–24.
[388] Zonghan Wu, Shirui Pan, Guodong Long, Jing Jiang, and Chengqi Zhang. 2019. Graph wavenet for deep spatial-
temporal graph modeling. In Proceedings of the 28th International Joint Conference on Artificial Intelligence. 1907–1913.
[389] Zhenqin Wu, Bharath Ramsundar, Evan N Feinberg, Joseph Gomes, Caleb Geniesse, Aneesh S Pappu, Karl Leswing,
and Vijay Pande. 2018. MoleculeNet: a benchmark for molecular machine learning. Chemical science 9, 2 (2018),
513–530.
[390] Feng Xia, Ke Sun, Shuo Yu, Abdul Aziz, Liangtian Wan, Shirui Pan, and Huan Liu. 2021. Graph learning: A survey.
IEEE Transactions on Artificial Intelligence 2, 2 (2021), 109–127.
[391] Jun Xia, Lirong Wu, Jintao Chen, Bozhen Hu, and Stan Z Li. 2022. Simgrace: A simple framework for graph contrastive
learning without data augmentation. In Proceedings of the ACM Web Conference 2022. 1070–1079.
[392] Lianghao Xia, Chao Huang, Yong Xu, Jiashu Zhao, Dawei Yin, and Jimmy Huang. 2022. Hypergraph contrastive
collaborative filtering. In Proceedings of the 45th International ACM SIGIR Conference on Research and Development in
Information Retrieval. 70–79.
[393] Peng Xie, Tianrui Li, Jia Liu, Shengdong Du, Xin Yang, and Junbo Zhang. 2020. Urban flow prediction from
spatiotemporal data using machine learning: A survey. Information Fusion 59 (2020), 1–12.
[394] Yu Xie, Yanfeng Liang, Maoguo Gong, AK Qin, Yew-Soon Ong, and Tiantian He. 2022. Semisupervised Graph Neural
Networks for Graph Classification. IEEE Transactions on Cybernetics (2022).
[395] Yu Xie, Shengze Lv, Yuhua Qian, Chao Wen, and Jiye Liang. 2022. Active and Semi-supervised Graph Neural Networks
for Graph Classification. IEEE Transactions on Big Data (2022).
[396] Yutong Xie, Chence Shi, Hao Zhou, Yuwei Yang, Weinan Zhang, Yong Yu, and Lei Li. 2021. Mars: Markov molecular
sampling for multi-objective drug discovery. arXiv preprint arXiv:2103.10432 (2021).
[397] Yaochen Xie, Zhao Xu, Jingtun Zhang, Zhengyang Wang, and Shuiwang Ji. 2022. Self-supervised learning of graph
neural networks: A unified review. IEEE transactions on pattern analysis and machine intelligence (2022).
[398] Haoyi Xiong, Amin Vahedian, Xun Zhou, Yanhua Li, and Jun Luo. 2018. Predicting traffic congestion propagation
patterns: A propagation graph approach. In Proceedings of the 11th ACM SIGSPATIAL International Workshop on
Computational Transportation Science. 60–69.
[399] Bingbing Xu, Huawei Shen, Qi Cao, Yunqi Qiu, and Xueqi Cheng. 2019. Graph Wavelet Neural Network. In International
Conference on Learning Representations.
[400] Chonghuan Xu, Austin Shijun Ding, and Kaidi Zhao. 2021. A novel POI recommendation method based on trust
relationship and spatial–temporal factors. Electronic Commerce Research and Applications 48 (2021), 101060.
[401] Chenxiao Xu, Hao Huang, and Shinjae Yoo. 2019. Scalable causal graph learning through a deep neural network. In
Proceedings of the 28th ACM international conference on information and knowledge management. 1853–1862.
[402] Chonghuan Xu, Dongsheng Liu, and Xinyao Mei. 2021. Exploring an efficient POI recommendation model based on
user characteristics and spatial-temporal factors. Mathematics 9, 21 (2021), 2673.
[403] Keyulu Xu, Weihua Hu, Jure Leskovec, and Stefanie Jegelka. 2018. How Powerful are Graph Neural Networks?. In
International Conference on Learning Representations.
[404] Keyulu Xu, Chengtao Li, Yonglong Tian, Tomohiro Sonobe, Ken-ichi Kawarabayashi, and Stefanie Jegelka. 2018.
Representation learning on graphs with jumping knowledge networks. In International conference on machine learning.
PMLR, 5453–5462.
[405] Zhenyi Xu, Yu Kang, Yang Cao, and Zhijun Li. 2020. Spatiotemporal graph convolution multifusion network for urban
vehicle emission prediction. IEEE Transactions on Neural Networks and Learning Systems 32, 8 (2020), 3342–3354.
[406] Zheng Xu, Sheng Wang, Feiyun Zhu, and Junzhou Huang. 2017. Seq2seq fingerprint: An unsupervised deep molecular
embedding for drug discovery. In Proceedings of the 8th ACM international conference on bioinformatics, computational
biology, and health informatics. 285–294.
[407] Chaoying Yang, Kaibo Zhou, and Jie Liu. 2021. SuperGraph: Spatial-temporal graph-based feature extraction for
rotating machinery diagnosis. IEEE Transactions on Industrial Electronics 69, 4 (2021), 4167–4176.

J. ACM, Vol. 1, No. 1, Article . Publication date: April 2023.


A Comprehensive Survey on Deep Graph Representation Learning 83

[408] Haoran Yang, Hongxu Chen, Shirui Pan, Lin Li, Philip S Yu, and Guandong Xu. 2022. Dual Space Graph Contrastive
Learning. In Proceedings of the Web Conference. 1238–1247.
[409] Kevin Yang, Kyle Swanson, Wengong Jin, Connor Coley, Philipp Eiden, Hua Gao, Angel Guzman-Perez, Timothy
Hopper, Brian Kelley, Miriam Mathea, et al. 2019. Analyzing learned molecular representations for property prediction.
Journal of chemical information and modeling 59, 8 (2019), 3370–3388.
[410] Yuanxuan Yang, Alison Heppenstall, Andy Turner, and Alexis Comber. 2020. Using graph structural information
about flows to enhance short-term demand prediction in bike-sharing systems. Computers, Environment and Urban
Systems 83 (2020), 101521.
[411] Yuhao Yang, Chao Huang, Lianghao Xia, Yuxuan Liang, Yanwei Yu, and Chenliang Li. 2022. Multi-Behavior
Hypergraph-Enhanced Transformer for Sequential Recommendation. In Proceedings of the 28th ACM SIGKDD Confer-
ence on Knowledge Discovery and Data Mining. 2263–2274.
[412] Huaxiu Yao, Fei Wu, Jintao Ke, Xianfeng Tang, Yitian Jia, Siyu Lu, Pinghua Gong, Jieping Ye, and Zhenhui Li. 2018.
Deep multi-view spatial-temporal network for taxi demand prediction. In Proceedings of the AAAI conference on
artificial intelligence, Vol. 32.
[413] Shaowei Yao, Tianming Wang, and Xiaojun Wan. 2020. Heterogeneous graph transformer for graph-to-sequence
learning. In Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics. 7145–7154.
[414] Mehmet Yildirimoglu and Jiwon Kim. 2018. Identification of communities in urban mobility networks using multi-layer
graphs of network traffic. Transportation Research Part C: Emerging Technologies 89 (2018), 254–267.
[415] Chengxuan Ying, Tianle Cai, Shengjie Luo, Shuxin Zheng, Guolin Ke, Di He, Yanming Shen, and Tie-Yan Liu. 2021.
Do transformers really perform badly for graph representation? Advances in Neural Information Processing Systems
34 (2021), 28877–28888.
[416] Zhitao Ying, Dylan Bourgeois, Jiaxuan You, Marinka Zitnik, and Jure Leskovec. 2019. Gnnexplainer: Generating
explanations for graph neural networks. In Proceedings of the Conference on Neural Information Processing Systems.
[417] Zhitao Ying, Jiaxuan You, Christopher Morris, Xiang Ren, Will Hamilton, and Jure Leskovec. 2018. Hierarchical graph
representation learning with differentiable pooling. Advances in neural information processing systems 31 (2018).
[418] Jiaxuan You, Bowen Liu, Zhitao Ying, Vijay Pande, and Jure Leskovec. 2018. Graph convolutional policy network for
goal-directed molecular graph generation. Advances in neural information processing systems 31 (2018).
[419] Yuning You, Tianlong Chen, Zhangyang Wang, and Yang Shen. 2020. When does self-supervision help graph
convolutional networks?. In international conference on machine learning. PMLR, 10871–10880.
[420] Bing Yu, Haoteng Yin, and Zhanxing Zhu. 2017. Spatio-temporal graph convolutional networks: A deep learning
framework for traffic forecasting. arXiv preprint arXiv:1709.04875 (2017).
[421] Donghan Yu, Ruohong Zhang, Zhengbao Jiang, Yuexin Wu, and Yiming Yang. 2021. Graph-revised convolutional
network. In Machine Learning and Knowledge Discovery in Databases: European Conference, ECML PKDD 2020, Ghent,
Belgium, September 14–18, 2020, Proceedings, Part III. Springer, 378–393.
[422] Junliang Yu, Hongzhi Yin, Xin Xia, Tong Chen, Lizhen Cui, and Quoc Viet Hung Nguyen. 2022. Are graph augmenta-
tions necessary? simple graph contrastive learning for recommendation. In Proceedings of the 45th International ACM
SIGIR Conference on Research and Development in Information Retrieval. 1294–1303.
[423] Le Yu, Bowen Du, Xiao Hu, Leilei Sun, Liangzhe Han, and Weifeng Lv. 2021. Deep spatio-temporal graph convolutional
network for traffic accident prediction. Neurocomputing 423 (2021), 135–147.
[424] Shuo Yu, Jiaying Liu, Zhuo Yang, Zhen Chen, Huizhen Jiang, Amr Tolba, and Feng Xia. 2018. PAVE: Personalized
Academic Venue recommendation Exploiting co-publication networks. Journal of Network and Computer Applications
104 (2018), 38–47.
[425] Xiao Yu, Quanquan Gu, Mianwei Zhou, and Jiawei Han. 2012. Citation prediction in heterogeneous bibliographic
networks. In Proceedings of the 2012 SIAM international conference on data mining. SIAM, 1119–1130.
[426] Zhaoning Yu and Hongyang Gao. 2022. Molecular representation learning via heterogeneous motif graph neural
networks. In International Conference on Machine Learning. PMLR, 25581–25594.
[427] Zhou Yu, Jun Yu, Yuhao Cui, Dacheng Tao, and Qi Tian. 2019. Deep modular co-attention networks for visual question
answering. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition. 6281–6290.
[428] Han Yue, Chunhui Zhang, Chuxu Zhang, and Hongfu Liu. 2022. Label-invariant Augmentation for Semi-Supervised
Graph Classification. In Proceedings of the Conference on Neural Information Processing Systems.
[429] Chengxi Zang and Fei Wang. 2020. MoFlow: an invertible flow model for generating molecular graphs. In Proceedings
of the 26th ACM SIGKDD International Conference on Knowledge Discovery & Data Mining. 617–626.
[430] Chen Zhang, Qiuchi Li, and Dawei Song. 2019. Aspect-based Sentiment Classification with Aspect-specific Graph
Convolutional Networks. In Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing
and the 9th International Joint Conference on Natural Language Processing (EMNLP-IJCNLP). 4568–4578.
[431] Cai Zhang, Weimin Li, Dingmei Wei, Yanxia Liu, and Zheng Li. 2022. Network dynamic GCN influence maximization
algorithm with leader fake labeling mechanism. IEEE Transactions on Computational Social Systems (2022).

J. ACM, Vol. 1, No. 1, Article . Publication date: April 2023.


84 W. Ju, et al.

[432] Chuxu Zhang, Dongjin Song, Chao Huang, Ananthram Swami, and Nitesh V Chawla. 2019. Heterogeneous graph
neural network. In Proceedings of the 25th ACM SIGKDD international conference on knowledge discovery & data mining.
793–803.
[433] Fuzheng Zhang, Nicholas Jing Yuan, Defu Lian, Xing Xie, and Wei-Ying Ma. 2016. Collaborative knowledge base
embedding for recommender systems. In Proceedings of the 22nd ACM SIGKDD international conference on knowledge
discovery and data mining. 353–362.
[434] Hengyuan Zhang, Suyao Zhao, Ruiheng Liu, Wenlong Wang, Yixin Hong, and Runjiu Hu. 2022. Automatic traffic
anomaly detection on the road network with spatial-temporal graph neural network representation learning. Wireless
Communications and Mobile Computing 2022 (2022).
[435] Jiani Zhang, Xingjian Shi, Junyuan Xie, Hao Ma, Irwin King, and Dit-Yan Yeung. 2018. Gaan: Gated attention networks
for learning on large and spatiotemporal graphs. arXiv preprint arXiv:1803.07294 (2018).
[436] Jiawei Zhang, Haopeng Zhang, Congying Xia, and Li Sun. 2020. Graph-bert: Only attention is needed for learning
graph representations. arXiv preprint arXiv:2001.05140 (2020).
[437] Muhan Zhang, Zhicheng Cui, Marion Neumann, and Yixin Chen. 2018. An end-to-end deep learning architecture for
graph classification. In Proceedings of the AAAI conference on artificial intelligence, Vol. 32.
[438] Muhan Zhang and Pan Li. 2021. Nested graph neural networks. Advances in Neural Information Processing Systems 34
(2021), 15734–15747.
[439] Xiaoyu Zhang, Sheng Wang, Feiyun Zhu, Zheng Xu, Yuhong Wang, and Junzhou Huang. 2018. Seq3seq fingerprint:
towards end-to-end semi-supervised deep drug discovery. In Proceedings of the 2018 ACM International Conference on
Bioinformatics, Computational Biology, and Health Informatics. 404–413.
[440] Xiang Zhang and Marinka Zitnik. 2020. Gnnguard: Defending graph neural networks against adversarial attacks.
Advances in neural information processing systems 33 (2020), 9263–9275.
[441] Yongfeng Zhang, Guokun Lai, Min Zhang, Yi Zhang, Yiqun Liu, and Shaoping Ma. 2014. Explicit factor models for
explainable recommendation based on phrase-level sentiment analysis. In Proceedings of the 37th international ACM
SIGIR conference on Research & development in information retrieval. 83–92.
[442] Yingxue Zhang, Soumyasundar Pal, Mark Coates, and Deniz Ustebay. 2019. Bayesian graph convolutional neural
networks for semi-supervised classification. In Proceedings of the AAAI conference on artificial intelligence, Vol. 33.
5829–5836.
[443] Yu Zhang, Peter Tiňo, Aleš Leonardis, and Ke Tang. 2021. A survey on neural network interpretability. IEEE
Transactions on Emerging Topics in Computational Intelligence (2021).
[444] Yutao Zhang, Fanjin Zhang, Peiran Yao, and Jie Tang. 2018. Name Disambiguation in AMiner: Clustering, Maintenance,
and Human in the Loop.. In Proceedings of the 24th ACM SIGKDD international conference on knowledge discovery &
data mining. 1002–1011.
[445] Zhen Zhang, Jiajun Bu, Martin Ester, Jianfeng Zhang, Chengwei Yao, Zhi Yu, and Can Wang. 2019. Hierarchical
graph pooling with structure learning. arXiv preprint arXiv:1911.05954 (2019).
[446] Ziwei Zhang, Peng Cui, and Wenwu Zhu. 2020. Deep learning on graphs: A survey. IEEE Transactions on Knowledge
and Data Engineering (2020).
[447] Zaixi Zhang, Qi Liu, Hao Wang, Chengqiang Lu, and Chee-Kong Lee. 2021. Motif-based graph self-supervised learning
for molecular property prediction. Advances in Neural Information Processing Systems 34 (2021), 15870–15882.
[448] ZAIXI ZHANG, Yaosen Min, Shuxin Zheng, and Qi Liu. [n. d.]. Molecule Generation For Target Protein Binding with
Structural Motifs. In The Eleventh International Conference on Learning Representations.
[449] Zhen Zhang, Mianzhi Wang, Yijian Xiang, Yan Huang, and Arye Nehorai. 2018. Retgk: Graph kernels based on return
probabilities of random walks. In Advances in Neural Information Processing Systems. 3964–3974.
[450] Jialin Zhao, Yuxiao Dong, Ming Ding, Evgeny Kharlamov, and Jie Tang. 2021. Adaptive diffusion in graph neural
networks. Advances in Neural Information Processing Systems 34 (2021), 23321–23333.
[451] Jianan Zhao, Xiao Wang, Chuan Shi, Binbin Hu, Guojie Song, and Yanfang Ye. 2021. Heterogeneous graph structure
learning for graph neural networks. In Proceedings of the AAAI conference on artificial intelligence, Vol. 35. 4697–4705.
[452] Lingxiao Zhao, Saurabh Sawlani, Arvind Srinivasan, and Leman Akoglu. 2022. Graph Anomaly Detection with
Unsupervised GNNs. arXiv preprint arXiv:2210.09535 (2022).
[453] Pengpeng Zhao, Anjing Luo, Yanchi Liu, Fuzhen Zhuang, Jiajie Xu, Zhixu Li, Victor S Sheng, and Xiaofang Zhou. 2020.
Where to go next: A spatio-temporal gated network for next poi recommendation. IEEE Transactions on Knowledge
and Data Engineering (2020).
[454] Tianxiang Zhao, Xiang Zhang, and Suhang Wang. 2021. Graphsmote: Imbalanced node classification on graphs
with graph neural networks. In Proceedings of the 14th ACM international conference on web search and data mining.
833–841.
[455] Yusheng Zhao, Jinyu Chen, Chen Gao, Wenguan Wang, Lirong Yang, Haibing Ren, Huaxia Xia, and Si Liu. 2022.
Target-Driven Structured Transformer Planner for Vision-Language Navigation. In Proceedings of the 30th ACM

J. ACM, Vol. 1, No. 1, Article . Publication date: April 2023.


A Comprehensive Survey on Deep Graph Representation Learning 85

International Conference on Multimedia. 4194–4203.


[456] Zhongying Zhao, Wenqiang Liu, Yuhua Qian, Liqiang Nie, Yilong Yin, and Yong Zhang. 2018. Identifying advisor-
advisee relationships from co-author networks via a novel deep model. Information Sciences 466 (2018), 258–269.
[457] Cheng Zheng, Bo Zong, Wei Cheng, Dongjin Song, Jingchao Ni, Wenchao Yu, Haifeng Chen, and Wei Wang. 2020.
Robust graph representation learning via neural sparsification. In International Conference on Machine Learning.
PMLR, 11458–11468.
[458] Li Zheng, Zhenpeng Li, Jian Li, Zhao Li, and Jun Gao. 2019. AddGraph: Anomaly Detection in Dynamic Graph Using
Attention-based Temporal GCN.. In IJCAI. 4419–4425.
[459] Yang Zheng, Xiaoyi Feng, Zhaoqiang Xia, Xiaoyue Jiang, Ambra Demontis, Maura Pintor, Battista Biggio, and Fabio
Roli. 2021. Why Adversarial Reprogramming Works, When It Fails, and How to Tell the Difference. arXiv preprint
arXiv:2108.11673 (2021).
[460] Dengyong Zhou, Jiayuan Huang, and Bernhard Schölkopf. 2006. Learning with hypergraphs: Clustering, classification,
and embedding. Advances in neural information processing systems 19 (2006).
[461] Fan Zhou, Rongfan Li, Qiang Gao, Goce Trajcevski, Kunpeng Zhang, and Ting Zhong. 2022. Dynamic Manifold
Learning for Land Deformation Forecasting. (2022).
[462] Hao Zhou, Dongchun Ren, Huaxia Xia, Mingyu Fan, Xu Yang, and Hai Huang. 2021. AST-GNN: An attention-based
spatio-temporal graph neural network for Interaction-aware pedestrian trajectory prediction. Neurocomputing 445
(2021), 298–308.
[463] Jie Zhou, Ganqu Cui, Shengding Hu, Zhengyan Zhang, Cheng Yang, Zhiyuan Liu, Lifeng Wang, Changcheng Li, and
Maosong Sun. 2020. Graph neural networks: A review of methods and applications. AI Open 1 (2020), 57–81.
[464] Yu Zhou, Haixia Zheng, Xin Huang, Shufeng Hao, Dengao Li, and Jumin Zhao. 2022. Graph Neural Networks:
Taxonomy, Advances, and Trends. ACM Transactions on Intelligent Systems and Technology (TIST) 13, 1 (2022), 1–54.
[465] Ziang Zhou, Shenzhong Zhang, and Zengfeng Huang. 2019. Dynamic self-training framework for graph convolutional
networks. (2019).
[466] Hao Zhu and Piotr Koniusz. 2021. Simple spectral graph convolution. In International Conference on Learning
Representations.
[467] Jinhua Zhu, Yingce Xia, Chang Liu, Lijun Wu, Shufang Xie, Tong Wang, Yusong Wang, Wengang Zhou, Tao Qin,
Houqiang Li, et al. 2022. Direct Molecular Conformation Generation. arXiv preprint arXiv:2202.01356 (2022).
[468] Jinhua Zhu, Yingce Xia, Lijun Wu, Shufang Xie, Tao Qin, Wengang Zhou, Houqiang Li, and Tie-Yan Liu. 2022. Unified
2d and 3d pre-training of molecular representations. In Proceedings of the 28th ACM SIGKDD Conference on Knowledge
Discovery and Data Mining. 2626–2636.
[469] Xizhou Zhu, Weijie Su, Lewei Lu, Bin Li, Xiaogang Wang, and Jifeng Dai. 2020. Deformable detr: Deformable
transformers for end-to-end object detection. arXiv preprint arXiv:2010.04159 (2020).
[470] Yanqiao Zhu, Yichen Xu, Feng Yu, Qiang Liu, Shu Wu, and Liang Wang. 2020. Deep graph contrastive representation
learning. arXiv preprint arXiv:2006.04131 (2020).
[471] Yanqiao Zhu, Yichen Xu, Feng Yu, Qiang Liu, Shu Wu, and Liang Wang. 2021. Graph contrastive learning with
adaptive augmentation. In Proceedings of the Web Conference 2021. 2069–2080.
[472] Yanqiao Zhu, Yichen Xu, Feng Yu, Shu Wu, and Liang Wang. 2020. CAGNN: Cluster-aware graph neural networks
for unsupervised graph representation learning. arXiv preprint arXiv:2009.01674 (2020).
[473] Zhenyi Zhu. 2022. A Survey of GNN in Bioinformation Data. (2022).
[474] Łukasz Maziarka, Tomasz Danel, Sławomir Mucha, Krzysztof Rataj, Jacek Tabor, and Stanisław Jastrzębski. 2020.
Molecule Attention Transformer. (2020). arXiv:2002.08264 [cs.LG]

J. ACM, Vol. 1, No. 1, Article . Publication date: April 2023.

You might also like