A Comprehensive Survey On Deep Graph Representation Learning
A Comprehensive Survey On Deep Graph Representation Learning
Learning
WEI JU, ZHENG FANG, YIYANG GU, ZEQUN LIU, and QINGQING LONG, Peking University,
arXiv:2304.05055v2 [cs.LG] 19 Apr 2023
China
ZIYUE QIAO, The Hong Kong University of Science and Technology, China
YIFANG QIN, JIANHAO SHEN, and FANG SUN, Peking University, China
ZHIPING XIAO, University of California, Los Angeles, USA
JUNWEI YANG, JINGYANG YUAN, and YUSHENG ZHAO, Peking University, China
XIAO LUO, University of California, Los Angeles, USA
MING ZHANG∗ , Peking University, China
Graph representation learning aims to effectively encode high-dimensional sparse graph-structured data into
low-dimensional dense vectors, which is a fundamental task that has been widely studied in a range of fields,
including machine learning and data mining. Classic graph embedding methods follow the basic idea that the
embedding vectors of interconnected nodes in the graph can still maintain a relatively close distance, thereby
preserving the structural information between the nodes in the graph. However, this is sub-optimal due to: (i)
traditional methods have limited model capacity which limits the learning performance; (ii) existing techniques
typically rely on unsupervised learning strategies and fail to couple with the latest learning paradigms; (iii)
representation learning and downstream tasks are dependent on each other which should be jointly enhanced.
With the remarkable success of deep learning, deep graph representation learning has shown great potential
and advantages over shallow (traditional) methods, there exist a large number of deep graph representation
learning techniques have been proposed in the past decade, especially graph neural networks. In this survey,
we conduct a comprehensive survey on current deep graph representation learning algorithms by proposing a
new taxonomy of existing state-of-the-art literature. Specifically, we systematically summarize the essential
components of graph representation learning and categorize existing approaches by the ways of graph neural
network architectures and the most recent advanced learning paradigms. Moreover, this survey also provides
the practical and promising applications of deep graph representation learning. Last but not least, we state
new perspectives and suggest challenging directions which deserve further investigations in the future.
∗ Corresponding author, and the other authors contributed equally to this research.
Authors’ addresses: Wei Ju, [email protected]; Zheng Fang, [email protected]; Yiyang Gu, [email protected];
Zequn Liu, [email protected]; Qingqing Long, [email protected], Peking University, Beijing, China, 100871;
Ziyue Qiao, [email protected], The Hong Kong University of Science and Technology, Guangzhou, China, 511453;
Yifang Qin, [email protected]; Jianhao Shen, [email protected]; Fang Sun, [email protected], Peking University,
Beijing, China, 100871; Zhiping Xiao, [email protected], University of California, Los Angeles, USA, 90095; Junwei
Yang, [email protected]; Jingyang Yuan, [email protected]; Yusheng Zhao, [email protected], Peking
University, Beijing, China, 100871; Xiao Luo, [email protected], University of California, Los Angeles, USA, 90095; Ming
Zhang, [email protected], Peking University, Beijing, China, 100871.
Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee
provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and
the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored.
Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires
prior specific permission and/or a fee. Request permissions from [email protected].
© 2023 Association for Computing Machinery.
0004-5411/2023/4-ART $15.00
https://doi.org/XXXXXXX.XXXXXXX
Additional Key Words and Phrases: Deep Learning on Graphs, Graph Representation Learning, Graph Neural
Network, Survey
ACM Reference Format:
Wei Ju, Zheng Fang, Yiyang Gu, Zequn Liu, Qingqing Long, Ziyue Qiao, Yifang Qin, Jianhao Shen, Fang Sun,
Zhiping Xiao, Junwei Yang, Jingyang Yuan, Yusheng Zhao, Xiao Luo, and Ming Zhang. 2023. A Comprehensive
Survey on Deep Graph Representation Learning. J. ACM 1, 1 (April 2023), 85 pages. https://doi.org/XXXXXXX.
XXXXXXX
1 Introduction By Wei Ju
Graphs have recently emerged as a powerful tool for representing a variety of structured and
complex data, including social networks, traffic networks, information systems, knowledge graphs,
protein-protein interaction networks, and physical interaction networks. As a kind of general form
of data organization, graph structures are capable of naturally expressing the intrinsic relationship
of these data, and thus can characterize plenty of non-Euclidean structures that are crucial in
a variety of disciplines and domains due to their flexible adaptability. For example, to encode a
social network as a graph, nodes on the graph are used to represent individual users, and edges are
used to represent the relationship between two individuals, such as friends. In the field of biology,
nodes can be used to represent proteins, and edges can be used to represent biological interactions
between various proteins, such as the dynamic interactions between proteins. Thus, by analyzing
and mining the graph-structured data, we can understand the deep meaning hidden behind the
data, and further discover valuable knowledge, so as to benefit society and human beings.
In the last decade years, a wide range of machine learning algorithms have been developed for
graph-structured data learning. Among them, traditional graph kernel methods [107, 177, 314, 316]
usually break down graphs into different atomic substructures and then use kernel functions
to measure the similarity between all pairs of them. Although graph kernels could provide a
perspective on modeling graph topology, these approaches often generate substructures or feature
representations based on given hand-crafted criteria. These rules are rather heuristic, prone to suffer
from high computational complexity, and therefore have weak scalability and subpar performance.
In the past few years, graph embedding algorithms [2, 123, 276, 343, 344, 359] have ever-
increasing emerged, which attempt to encode the structural information of the graph (usually a
high-dimensional sparse matrix) and map it into a low-dimensional dense vector embedding to
preserve the topology information and attribute information in the embedding space as much
as possible, so that the learned graph embeddings can be naturally integrated into traditional
machine learning algorithms. Compared to previous works which use feature engineering in the
pre-processing phase to extract graph structural features, current graph embedding algorithms are
conducted in a data-driven way leveraging machine learning algorithms (such as neural networks)
to encode the structural information of the graph. Specifically, existing graph embedding methods
can be categorized into the following main groups: (i) matrix factorization based methods [2, 36, 268]
that factorize the matrix to learn node embedding which preserves the graph property; (ii) deep
learning based methods [123, 276, 344, 359] that apply deep learning techniques specifically de-
signed for graph-structured data; (iii) edge reconstruction based methods [229, 253, 343] that either
maximizes edge reconstruction probability or minimizes edge reconstruction loss. Generally, these
methods typically depend on shallow architectures, and fail to exploit the potential and capacity of
deep neural networks, resulting in sub-optimal representation quality and learning performance.
Inspired by the recent remarkable success of deep neural networks, a range of deep learning
algorithms has been developed for graph-structured data learning. The core of these methods is to
generate effective node and graph representations using graph neural networks (GNNs), followed
by a goal-oriented learning paradigm. In this way, the derived representations can be adaptively
coupled with a variety of downstream tasks and applications. Following this line of thought, in
this paper, we propose a new taxonomy to classify the existing graph representation learning
algorithms, i.e., graph neural network architectures, learning paradigms, and various promising
applications, as shown in Fig. 1. Specifically, for the architectures of GNNs, we investigate the
studies on graph convolutions, graph kernel neural networks, graph pooling, and graph transformer.
For the learning paradigms, we explore three advanced types namely supervised/semi-supervised
learning on graphs, graph self-supervised learning, and graph structure learning. To demonstrate
the effectiveness of the learned graph representations, we provide several promising applications
to build tight connections between representation learning and downstream tasks, such as social
analysis, molecular property prediction and generation, recommender systems, and traffic analysis.
Last but not least, we present some perspectives for thought and suggest challenging directions
that deserve further study in the future.
Differences between this survey and existing ones. Up to now, there exist some other overview
papers focusing on different perspectives of graph representation learning[12, 40, 43, 47, 179, 387,
390, 446, 463, 464] that are closely related to ours. However, there are very few comprehensive
reviews have summarized deep graph representation learning simultaneously from the perspective
of diverse GNN architectures and corresponding up-to-date learning paradigms. Therefore, we
here clearly state their distinctions from our survey as follows. There have been several surveys
on classic graph embedding[32, 119], these works categorize graph embedding methods based on
different training objectives. Wang et al. [366] goes further and provides a comprehensive review of
existing heterogeneous graph embedding approaches. With the rapid development of deep learning,
there are a handful of surveys along this line. For example, Wu et al. [387] and Zhang et al. [446]
mainly focus on several classical and representative GNN architectures without exploring deep
graph representation learning from a view of the most recent advanced learning paradigms such as
graph self-supervised learning and graph structure learning. Xia et al. [390] and Chami et al. [40]
jointly summarize the studies of graph embeddings and GNNs. Zhou et al. [463] explores different
types of computational modules for GNNs. One recent survey under review [179] categorizes the
existing works in graph representation learning from both static and dynamic graphs. However,
these taxonomies emphasize the basic GNN methods but pay insufficient attention to the learning
paradigms, and provide few discussions of the most promising applications, such as recommender
systems and molecular property prediction and generation. To the best of our knowledge, the most
relevant survey published formally is [464], which presents a review of GNN architectures and
roughly discusses the corresponding applications. Nevertheless, this survey merely covers methods
up to the year of 2020, missing the latest developments in the past two years.
Therefore, it is highly desired to summarize the representative GNN methods, the most recent
advanced learning paradigms, and promising applications into one unified and comprehensive
framework. Moreover, we strongly believe this survey with a new taxonomy of literature and more
than 400 studies will strengthen future research on deep graph representation learning.
Contribution of this survey. The goal of this survey is to systematically review the literature
on the advances of deep graph representation learning and discuss further directions. It aims
to help the researchers and practitioners who are interested in this area, and support them in
understanding the panorama and the latest developments of deep graph representation learning.
The key contributions of this survey are summarized as follows:
• Systematic Taxonomy. We propose a systematic taxonomy to organize the existing deep
graph representation learning approaches based on the ways of GNN architectures and the
most recent advanced learning paradigms via providing some representative branches of
Social Analysis
Graph Convolutions
Semi-Supervised Learning on Graphs Molecular Property Prediction
Graph Kernel Neural Networks
Graph Self-Supervised Learning Molecular Generation Future Directions
Graph Pooling
Graph Structure Learning Recommender Systems
Graph Transformer
Traffic Analysis
Graph Data Learning Paradigms
Graph Neural Network Architectures
Graph-Related Applications
methods. Moreover, several promising applications are presented to illustrate the superiority
and potential of graph representation learning.
• Comprehensive Review. For each branch of this survey, we review the essential compo-
nents and provide detailed descriptions of representative algorithms, and systematically
summarize the characteristics to make the overview comparison.
• Future Directions. Based on the properties of existing deep graph representation learning
algorithms, we discuss the limitations and challenges of current methods and propose the
potential as well as promising research directions deserving of future investigations.
2 Background By Wei Ju
In this section, we first briefly introduce some definitions in deep graph representation learning
that need to be clarified, then we explain the reasons why we need graph representation learning.
𝑅𝑣 ∈ R𝑑 for each node 𝑣 ∈ 𝑉 , where the dimension 𝑑 of the vector is much smaller than the total
number of nodes |𝑉 | in the graph.
D𝑖𝑖 = 𝑗 A𝑖 𝑗 ), and A is the adjacency matrix. In the typical setting of graph signal processing, the
Í
graph 𝐺 is undirected. Therefore, L is real symmetric and positive semi-definite. This guarantees
the eigen decomposition of the graph Laplacian: L = UΛU𝑇 , where U = [u0, u1, ..., u𝑛−1 ] is the
eigenvectors of the graph Laplacian and the diagonal elements of Λ = diag(𝜆0, 𝜆1, ..., 𝜆𝑛−1 ) are the
eigenvalues. With this, the Graph Fourier Transform (GFT) of a graph signal x is defined as x̃ = U𝑇 x,
where x̃ is the graph frequencies of x. Correspondingly, the Inverse Graph Fourier Transform can
be written as x = Ux̃.
With GFT and the Convolution Theorem, the graph convolution of a graph signal x and a filter
g can be defined as g ∗𝐺 x = U(U𝑇 g ⊙ U𝑇 x). To simplify this, let g𝜃 = diag(U𝑇 𝑔), the graph
convolution can be written as:
g ∗𝐺 x = Ug𝜃 U𝑇 x, (1)
which is the general form of most spectral graph convolutions. The key of spectral graph convolu-
tions is to parameterize and learn the filter g𝜃 .
Bruna et al. [29] propose Spectral Convolutional Neural Network (Spectral CNN) that sets graph
filter as a learnable diagonal matrix W. The convolution operation can be written as y = UWU𝑇 x.
In practice, multi-channel signals and activation functions are common, and the graph convolution
can be written as !
𝑐𝑖𝑛
∑︁
Y:,𝑗 = 𝜎 U W𝑖,𝑗 U𝑇 X:,𝑖 , 𝑗 = 1, 2, ..., 𝑐𝑜𝑢𝑡 , (2)
𝑖=1
where 𝑐𝑖𝑛 is the number of input channel, 𝑐𝑜𝑢𝑡 is the number of output channel, X is a 𝑛 × 𝑐𝑖𝑛
matrix representing the input signal, Y is a 𝑛 × 𝑐𝑜𝑢𝑡 matrix denoting the output signal, W𝑖,𝑗 is a
parameterized diagonal matrix, and 𝜎 (·) is the activation function. For mathematical convenience
we sometimes use single-channel version of graph convolutions omitting activation functions, and
the multi-channel versions are similar to Eq. 2.
Spectral CNN has several limitations. Firstly, the filters are basis-dependent, which means that
they cannot generalize across graphs. Secondly, the algorithm requires eigen decomposition, which
is computationally expensive. Thirdly, it has no guarantee of spatial localization of filters. To make
filters spatially localized, Henaff et al. [137] propose to use a smooth spectral transfer function
Θ(Λ) to parameterize the filter, and the convolution operation can be written as:
y = U𝐹 (Λ)U𝑇 x. (3)
Chebyshev Spectral Convolutional Neural Network (ChebNet) [66] extends this idea by using
truncated Chebyshev polynomials to approximate the spectral transfer function. The Chebyshev
polynomial is defined as 𝑇0 (𝑥) = 1, 𝑇1 (𝑥) = 𝑥, 𝑇𝑘 (𝑥) = 2𝑥𝑇𝑘−1 (𝑥) − 𝑇𝑘−2 (𝑥), and the spectral
transfer function 𝐹 (Λ) is approximated to the order of 𝐾 − 1 as
∑︁1
𝐾−
𝐹 (Λ) = 𝜃 𝑘𝑇𝑘 ( Λ̃), (4)
𝑘=0
where the model parameters 𝜃 𝑘 , 𝑘 ∈ {0, 1, ..., 𝐾 − 1} are the Chebyshev coefficients, and Λ̃ =
2Λ/𝜆𝑚𝑎𝑥 − I is a diagonal matrix of scaled eigenvalues. Thus, the graph convolution can be written
as:
∑︁1
𝐾− ∑︁1
𝐾−
g ∗𝐺 x = U𝐹 (Λ)U𝑇 x = U 𝜃 𝑘𝑇𝑘 ( Λ̃)U𝑇 x = 𝜃 𝑘𝑇𝑘 ( L̃)x, (5)
𝑘=0 𝑘=0
where L̃ = 2L/𝜆𝑚𝑎𝑥 − I.
Graph Convolutional Network (GCN) [182] is proposed as the localized first-order approximation
of ChebNet. Assuming 𝐾 = 2 and 𝜆𝑚𝑎𝑥 = 2, Eq. 5 can be simplified as:
g ∗𝐺 x = 𝜃 0 x + 𝜃 1 (L − I)x = 𝜃 0 x − 𝜃 1 D−1/2 AD−1/2 x. (6)
To further constraint the number of parameters, we assume 𝜃 = 𝜃 0 = −𝜃 1 , which gives a simpler
form of graph convolution:
g ∗ G x = 𝜃 (I + D−1/2 AD−1/2 )x. (7)
As I + D−1/2 AD−1/2 now has the eigenvalues in the range of [0, 2] and repeatedly multiplying
this matrix can lead to numerical instabilities, GCN empirically proposes a renormalization trick to
solve this problem by using D̃−1/2 ÃD̃−1/2 instead, where à = A + I and D̃𝑖𝑖 = 𝑖 Ã𝑖 𝑗 .
Í
Allowing multi-channel signals and adding activation functions, the following formula is more
commonly seen in literature:
Y = 𝜎 (( D̃−1/2 ÃD̃−1/2 )XΘ), (8)
where X, Y have the same shape as in Eq. 2 and Θ is a 𝑐𝑖𝑛 × 𝑐𝑜𝑢𝑡 matrix as model’s parameters.
Apart from the aforementioned methods, other spectral graph convolutions have been proposed.
Levie et al. [203] propose CayleyNets that utilize Cayley Polynomials to equip the filters with
the ability to detect narrow frequency bands. Liao et al. [224] propose LanczosNets that employs
Lanczos algorithm to construct low-rank approximation of graph Laplacian in order to improve
the computation efficiency of graph convolutions. The proposed model is able to efficiently utilizes
the multi-scale information in the graph data. Instead of using Graph Fourier Transform, Xu et
al. [399] propose Graph Wavelet Neural Network (GWNN) that use graph wavelet transform to
avoid matrix eigendecomposition. Moreover, graph wavelets are sparse and localized, which provide
good interpretations for the convolution operation. Zhu et al. [466] derive a Simple Spectral Graph
Convolution (S2 GC) from a modified Markov Diffusion Kernel, which achieves a trade-off between
low-pass and high-pass filter bands.
where x𝑖 and y𝑖 is the input and output feature vector of node 𝑖, e𝑖 𝑗 is the feature vector of the edge
(or more generally, the relationship) between node 𝑖 and its neighbor node 𝑗, and N (𝑖) denotes the
neighbor of node 𝑖, which could be more generally defined.
In the previous subsection, we show the spectral interpretation of GCN [182]. The model also
has its spatial interpretation, which can be mathematically written as:
∑︁ 1
y𝑖 = Θ𝑇 √︃ x𝑗 , (10)
𝑗 ∈N (𝑖)∪𝑖 𝑑^𝑖 𝑑^𝑗
where 𝑑^𝑖 and 𝑑^𝑗 is the 𝑖-th and 𝑗-th row sums of A ^ in Eq. 8. For each node, the model takes a weighted
sum of its neighbors’ features as well as its own feature and applies a linear transformation to obtain
the result. In practice, multiple GCN layers are often stack together with non-linear functions after
convolution to encode complex and hierarchical features. Nonetheless, Wu et al. [383] show that
the model still achieves competitive results without non-linearity.
Although GCN as well as other spectral graph convolutions achieve competitive results on a
number of benchmarks, these methods assume the presence of all nodes in the graph and fall in the
category of transductive learning. Hamilton et al. [128] propose GraphSAGE that performs graph
convolutions in inductive settings, when there are new nodes during inference (e.g. newcomers
in the social network). For each node, the model samples its 𝐾-hop neighbors and uses 𝐾 graph
convolutions to aggregate their features hierarchically. Furthermore, the use of sampling also
reduces the computation when a node has too many neighbors.
The attention mechanism has been successfully used in natural language processing [350], com-
puter vision [235] and multi-modal tasks [49, 133, 427, 455]. Graph Attention Networks (GAT) [351]
introduce the idea of attention to graphs. The attention mechanism uses an adaptive, feature-
dependent weight (i.e. attention coefficient) to aggregate a set of features, which can be mathemati-
cally written as:
exp LeakyReLU a𝑇 [Θx𝑖 ||Θx 𝑗 ]
𝛼𝑖,𝑗 = Í , (11)
𝑘 ∈N (𝑖)∪{𝑖 } exp LeakyReLU a [Θx𝑖 ||Θx 𝑗 ]
𝑇
where 𝛼𝑖,𝑗 is the attention coefficient, a and Θ are model parameters, and [·||·] means concatenation.
After the 𝛼s are obtained, the new features are computed as a weighted sum of input node features,
which is: ∑︁
y𝑖 = 𝛼𝑖,𝑖 Θx𝑖 + 𝛼𝑖,𝑗 Θx 𝑗 . (12)
𝑗 ∈N (𝑖)
Xu et al. [403] explores the representational limitations of graph neural networks. What they
discover is that message passing networks like GCN [182] and GraphSAGE [128] are incapable of
distinguishing certain graph structures. To improve the representational power of graph neural
networks, they propose Graph Isomorphism Network (GIN) that gives an adjustable weight to the
central node feature, which can be mathematically written as:
∑︁
y𝑖 = MLP (1 + 𝜖)x𝑖 + x𝑗 ® , (13)
© ª
« 𝑗 ∈N (𝑖) ¬
where 𝜖 is a learnable parameter.
More recently, efforts have been made to improve the representational power of graph neural
networks. For example, Hu et al. [143] propose GraphAIR that explicitly models the neighborhood
interaction to better capture complex non-linear features. Specifically, they use Hadamard product
between pairs of nodes in the neighborhood to model the quadratic terms of neighborhood interac-
tion. Balcilar et al. [15] propose GNNML3 that break the limits of first-order Weisfeiler-Lehman
test (1-WL) and reach third-order WL test (3-WL) experimentally. They also show that Hadamard
product is required for the model to have more representational power than first-order Weisfeiler-
Lehman test. Other elements in spatial graph convolutions are widely studied. For example, Corso
et al. [64] explore the aggregation operation in GNN and propose Principal Neighbourhood Ag-
gregation (PNA) that uses multiple aggregators with degree-scalers. Tailor et al. [340] explores
the anisotropism and isotropism in the message passing process of graph neural networks, and
propose Efficient Graph Convolution (EGC) that achieves promising results with reduced memory
consumption due to isotropism.
3.3 Summary
This section introduces graph convolutions and we provide the summary as follows:
• Techniques. Graph convolutions mainly fall into two types, i.e. spectral graph convolu-
tions and spatial graph convolutions. Spectral graph convolutions have solid mathematical
foundations of Graph Signal Processing and therefore their operations have theoretical in-
terpretations. Spatial graph convolutions are inspired by Recurrent Graph Neural Networks
and their computation is simple and straightforward, as their computation graph is derived
from the local graph structure. Generally, spatial graph convolutions are more common in
applications.
• Challenges and Limitations. Despite the great success of graph convolutions, their perfor-
mance is unsatisfactory in more complicated applications. On the one hand, the performance
of graph convolutions rely heavily on the construction of the graph. Different construction
of the graph might result in different performance of graph convolutions. On the other
hand, graph convolutions are prone to over-smoothing when constructing very deep neural
networks.
• Future Works. In the future, we expect that more powerful graph convolutions are de-
veloped to mitigate the problem of over-smoothing and we also hope that techniques and
methodologies in Graph Structure Learning (GSL) can help learn more meaningful graph
structure to benefit the performance of graph convolutions.
has been conducted on graph kernel neural networks (GKNNs), which has yielded promising results.
Researchers have explored various aspects of GKNNs, including their theoretical foundations,
algorithmic design, and practical applications. These efforts have led to the development of a wide
range of GKNN-based models and methods that can be used for graph analysis and representation
tasks, such as node classification, link prediction, and graph clustering [44, 237, 238].
The success of GKNNs can be attributed to their ability to leverage the strengths of both graph
kernels and neural networks. By using kernel functions to measure similarity between graphs,
GKNNs can capture the structural properties of graphs, while the use of neural networks enables
them to learn more complex and abstract representations of graphs. This combination of techniques
allows GKNNs to achieve state-of-the-art performance on a wide range of graph-related tasks.
In this section, we begin with introducing the most representative traditional graph kernels.
Then we summarize the basic framework for combining GNNs and graph kernels. Finally, we
categorize the popular graph kernel Neural networks into several categories and compare their
differences.
where 𝑙𝐺 (𝑢) denotes a set of local substructures centered at node 𝑢 in graph 𝐺, and 𝜅𝑏𝑎𝑠𝑒 is a base
kernel measuring the similarity between the two sets of substructures. For simplicity, we may
rewrite Eqn. 14 as: ∑︁ ∑︁
𝐾 (𝐺 1, 𝐺 2 ) = 𝜅𝑏𝑎𝑠𝑒 (𝑢 1, 𝑢 2 ), (15)
𝑢 1 ∈𝑉1 𝑢 2 ∈𝑉2
the uppercase letter 𝐾 (𝐺 1, 𝐺 2 ) is denoted as graph kernels, 𝜅 (𝑢 1, 𝑢 2 ) is denoted as node kernels,
and lowercase 𝑘 (𝑥, 𝑦) is denoted as general kernel functions.
The kernel mapping of a kernel 𝜓 maps a data point into its corresponding Reproducing Kernel
Hilbert Space (RKHS) H . Specifically, given a kernel 𝑘 ∗ (·, ·), its kernel mapping 𝜓 ∗ can be formulized
as,
∀𝑥 1, 𝑥 2, 𝑘 ∗ (𝑥 1, 𝑥 2 ) = ⟨𝜓 ∗ (𝑥 1 ),𝜓 ∗ (𝑥 2 )⟩ H∗ , (16)
where H∗ is the RKHS of 𝑘 ∗ (·, ·).
We introduce several representative and popular used graph kernels below.
(𝑙)
Walk and Path Kernels. A 𝑙-walk kernel 𝐾𝑤𝑎𝑙𝑘 compares all length 𝑙 walks starting from each
node in two graphs 𝐺 1, 𝐺 2 ,
∑︁ ∑︁
(𝑙)
𝜅 𝑤𝑎𝑙𝑘 (𝑢 1, 𝑢 2 ) = 𝛿 (𝑋 1 (𝑤 1 ), 𝑋 2 (𝑤 2 )),
𝑤1 ∈W𝑙 (𝐺 1 ,𝑢 1 ) 𝑤2 ∈W𝑙 (𝐺 2 ,𝑢 2 )
∑︁ ∑︁ (17)
(𝑙) (𝑙)
𝐾𝑤𝑎𝑙𝑘 (𝐺 1, 𝐺 2 ) = 𝜅 𝑤𝑎𝑙𝑘 (𝑢 1, 𝑢 2 ).
𝑢 1 ∈𝑉1 𝑢 2 ∈𝑉2
By regarding the lower-hop kernel 𝜅 (𝑙−1) (𝑢 1, 𝑢 2 ), as the inner product of the (𝑙 − 1)-th hidden
representations of 𝑢 1 and 𝑢 2 . Furthermore, by recursively applying the neighborhood kernel, the
𝑙-hop graph kernel can be derived as
𝑙−1 𝑙−2
!
∑︁ ∑︁ Ö Ö
𝑙 (𝑖) (𝑖)
𝐾 (𝐺 1, 𝐺 2 ) = ⟨𝑓 (𝒘 1 ), 𝑓 (𝒘 2 )⟩ × ⟨𝑓 (𝑒 𝒘 (𝑖 ) ,𝒘 (𝑖+1) ), 𝑓 (𝑒 𝒘 (𝑖 ) ,𝒘 (𝑖+1) )⟩ ,
1 1 2 2
𝒘 1 ∈W𝑙 (𝐺 1 ) 𝒘 2 ∈W𝑙 (𝐺 2 ) 𝑖=0 𝑖=0
(20)
where W𝑙 (𝐺) denotes the set of all walk sequences with length 𝑙 in graph 𝐺, and 𝒘 1(𝑖) denotes the
𝑖-th node in sequence 𝒘 1 .
As shown in Eqn. (16), kernel methods implicitly perform projections from original data spaces
to their RKHS H . Hence, as GNNs also project nodes or graphs into vector spaces, connections
have been established between GKs and GNNs through the kernel mappings. And several works
conducted research on the connections [202, 381], and found some foundation conclusions. Take
the basic rule introduced in [202] as an example, the proposed graph kernel in Eqn. (14) can be
derived as the general formulas,
where ⊙ is the element-wise product and ℎ (𝑙) (𝑣) is the cell state vector of node v. The parameter
matrices 𝑾 𝑡(𝑙)
𝑉 (𝑣)
, 𝑼 𝑡(𝑙)
𝑉 (𝑣)
and 𝑼 𝑡(𝑙)
𝐸 (𝑒𝑢,𝑣 )
are learnable parameters related to types of nodes and edges.
Then mean embeddings of all nodes are usually used to represent the graph-level embedding, let
|𝐺𝑖 | denote the number of nodes in the 𝑖-th graph, then the graph-level embeddings are generated
as,
∑︁ 1
Φ(𝐺𝑖 ) = ℎ (𝐿) (𝑣). (22)
𝑢 ∈𝐺
|𝐺 𝑖 |
𝑖
For semi-supervised multiclass classification, the cross entropy is used as the objective function
over all training examples [36, 37],
𝐹
∑︁ ∑︁
L=− 𝑌𝑙 𝑓 ln 𝑍𝑙 𝑓 , (23)
𝑙 ∈Y𝐿 𝑓 =1
where Y𝐿 is the set of node indices which have labels in node classification tasks, or set of graph
indices in graph classification tasks. 𝑍𝑙 𝑓 denotes the prediction of labels, which are outputs of a
linear layer with an activation function, inputing ℎ (𝑙) (𝑣) in node classification task, and Φ(𝐺𝑖 ) in
graph classification task.
∑︁
ℎ𝑘(𝑙) (𝑠) =𝜎 ℎ𝑘(𝑙−1) · 𝑾 1(𝑙) + ℎ𝑘(𝑙−1) (𝑢) · 𝑾 2(𝑙) ® , 1 < 𝑙 ≤ 𝐿, (24)
© ª
(𝑙)
∑︁ " (𝑙−1) #
∑︁ ∑︁ ∑︁
(𝐺, 𝐺 ′ ) ′
=𝑐𝑢 𝑐𝑢 ′ (𝐺, 𝐺 ) ,
′ ′ ′ ′
(27)
( 0) ′ 𝑢 ∈𝑁 (𝑣)∪{𝑢 } 𝑣 ∈𝑁 (𝑢 )∪{𝑢 } 𝑅 𝑣𝑣
h i 𝑢𝑢 h i
′
∑︁ ∑︁ ′
(𝑙) (𝑙−1)
Θ ( 0) (𝐺, 𝐺 ) ′ =𝑐𝑢 𝑐𝑢 ′ Θ𝑅 (𝐺, 𝐺 ) ′ ,
𝑢𝑢 𝑣𝑣
𝑢 ∈𝑁 (𝑣)∪{𝑢 } 𝑣′ ∈𝑁 (𝑢 ′ )∪{𝑢 ′ }
′ ′ ′
where 𝑅( 0) (𝐺, 𝐺 ) and Θ𝑅( 0) (𝐺, 𝐺 ) are defined as ( 0) (𝐺, 𝐺 ). Then GNTK performed 𝑅 transfor-
Í Í
mations to xx.
(𝑙)
(𝑙)
∑︁
© ∑︁ ′
(𝐺, 𝐺) (𝐺, 𝐺 )
ª
, ®
′ ®®
h
(𝑙) ′
i (𝑟 −1) ′ (𝑟 −1)
𝐴 (𝑟 ) (𝐺, 𝐺 ) ′ = 𝑢𝑢 𝑢𝑢 ®. (28)
𝑢𝑢 ∑︁(𝑙) ∑︁(𝑙) ®®
′ ′ ′
(𝐺 , 𝐺) , (𝐺 , 𝐺 ) ® ®
(𝑟 −1) ′ (𝑟 −1) ′
« 𝑢𝑢 𝑢𝑢 ¬
(𝑙)
∑︁
(𝐺, 𝐺 ′ )
=E (𝑎,𝑏)∼N ( 0, [𝐴 (𝑙 ) (𝐺,𝐺 ′ ) ] ′ ) [𝜎 (𝑎) · 𝜎 (𝑏)],
(𝑟 ) ′ (𝑟 ) 𝑢𝑢
𝑢𝑢 (29)
(𝑙)
∑︁
(𝐺, 𝐺 ′ )
=E (𝑎,𝑏)∼N ( 0, [𝐴 (𝑙 ) (𝐺,𝐺 ′ ) ] ′ ) [𝜎 (𝑎) · 𝜎 (𝑏)],
(𝑟 ) ′ (𝑟 ) 𝑢𝑢
𝑢𝑢
then the 𝑟 order can be calculated as,
h ′
i h ′
i h∑︁ ′
i h∑︁ ′
i
(𝑙) (𝑙)
Θ (𝑟 )
(𝐺, 𝐺 ) ′
= Θ (𝑟 −1)
(𝐺, 𝐺 ) ′
(𝑟 ) (𝑙)
(𝐺, 𝐺 ) ′
+ (𝑟 ) (𝑙)
(𝐺, 𝐺 ) ′
. (30)
𝑢𝑢 𝑢𝑢 𝑢𝑢 𝑢𝑢
Finally, GNTK calculates the final output as
" 𝐿
#
′
∑︁ ∑︁ ′
Θ(𝐺, 𝐺 ) = Θ (𝑙)
(𝑅)
(𝐺, 𝐺 ) . (31)
′ ′ 𝑙=0 ′
𝑢 ∈𝑉 ,𝑢 ∈𝑉 𝑢,𝑢
Heterogeneous Graph Kernel based Graph Neural Network [238] (HGK-GNN). HGK-GNN
first proposed GKNN for heterogeneous graphs. It adopted ⟨𝑓 (𝑢 1 ), 𝑓 (𝑢 2 )⟩𝑀 as graph kernel based
on the Mahalanobis Distance to build connections among heterogeneous nodes and edges,
⟨𝑓 (𝑢 1 ), 𝑓 (𝑢 2 )⟩𝑀1 = 𝑓 (𝑢 1 )𝑇 𝑴 1 𝑓 (𝑢 2 ),
⟨𝑓 (𝑒 ·,𝑢1 ), 𝑓 (𝑒 ·,𝑢2 )⟩𝑀2 = 𝑓 (𝑒 ·,𝑢1 )𝑇 𝑴 2 𝑓 (𝑒 ·,𝑢2 ).
Following the route introduced in [202] , the corresponding neural network architecture of
heterogeneous graph kernel can be derived as
ℎ ( 0) (𝑣) =𝑾 𝑡(𝑉0)(𝑣) 𝑓 (𝑣),
(32)
∑︁
ℎ (𝑙) (𝑣) =𝑾 𝑡(𝑙)
𝑉 (𝑣)
𝑓 (𝑣) ⊙ (𝑼 𝑡(𝑙)
𝑉 (𝑣)
ℎ (𝑙−1) (𝑢) ⊙ 𝑼 𝑡(𝑙)
𝐸 (𝑒𝑢,𝑣 )
𝑓 (𝑒𝑢,𝑣 )), 1 < 𝑙 ≤ 𝐿,
𝑢 ∈𝑁 (𝑣)
where ℎ (𝑙) (𝑣) is the cell state vector of node v, and 𝑾 𝑡(𝑙)
𝑉 (𝑣)
, 𝑼 𝑡(𝑙)
𝑉 (𝑣)
𝑼 𝑡(𝑙)
𝐸 (𝑒𝑢,𝑣 )
are learnable parameters.
4.4 Summary
This section introduces graph kernel neural networks. We provide the summary as follows:
• Techniques. Graph kernel neural networks (GKNNs) are a recent popular research area
that combines the advantages of graph kernels and GNNs to learn more effective graph
representations. Researchers have studied GKNNs in various aspects, such as theoretical
foundations, algorithmic design, and practical applications. As a result, a wide range of GKNN-
based models and methods have been developed for graph analysis and representation tasks,
including node classification, link prediction, and graph clustering.
• Challenges and Limitations. Although GKNNs have shown great potential in graph-related
tasks, they also have several limitations that need to be addressed. Scalability is a significant
challenge, particularly when dealing with large-scale graphs and networks. As the size of
the graph increases, the computational cost of GKNNs grows exponentially, which can limit
their ability to handle large and complex real-world applications.
• Future Works. For future works, we expect the GKNNs can integrate more domain-specific
knowledge into the designed kernels. Domain-specific knowledge has been shown to signifi-
cantly improve the performance of many applications, such as drug discovery, knowledge
graph-based information retrieval systems, and molecular analysis [90, 360]. Incorporating
domain-specific knowledge into GKNNs can enhance their ability to handle complex and
diverse data structures, leading to more accurate and interpretable models.
where I denotes the identity matrix, D denotes the diagonal degree matrix of A, and ∥ · ∥ 1 means
ℓ1 norm. Furthermore, to reduce the loss of topology information, HGP-SL leverages structure
learning to learn a refined graph topology for the reserved nodes. Specifically, it utilizes the attention
mechanism to compute the similarity of two nodes as the weight of the reconstructed edge,
h i
pool pool pool ⊤ pool
𝑨𝑖 𝑗 = sparsemax 𝜎 w 𝑿 𝑖,: ∥𝑿 𝑗,:
e + 𝜆 · 𝑨𝑖 𝑗 , (39)
pool
where 𝑨e denotes the refined adjacency matrix of the pooled graph, sparsemax(·) truncates the
values below a threshold to zeros, w denotes a learnable weight vector, and 𝜆 is a weight parameter
between the original edges and the reconstructed edges. These reconstructed edges may capture
the underlying relationship between nodes disconnected due to node dropping.
Topology-Aware Graph Pooling (TAPool) [105]. TAPool takes both the local and global significance
of a node into account. On the one hand, it utilizes the average similarity between a node and its
neighbors to evaluate its local importance,
^ = XX𝑇 ⊙ D−1 A , Z𝑙 = softmax 1 R1
^ 𝑛 , (40)
R
𝑛
where R^ denotes the localized similarity matrix, and Z𝑙 denotes the local importance score. On
the other hand, it measures the global importance of a node according to the significance of its
one-hop neighborhood in the whole graph,
^ = D−1 AX, Z𝑔 = softmax Xp
X ^ , (41)
where p is a learnable and globally shared projector vector, similar to the aforementioned gPool [104].
However, X ^ here further aggregates the features from the neighborhood, which enables the global
importance score Z𝑔 to capture more topology information such as salient subgraphs. Moreover,
TAPool encourages connectivity in the coarsened graph with the help of a degree-based connectivity
term, then obtaining the final importance score Z = Z𝑙 + Z𝑔 + 𝜆D/|𝑉 |, where 𝜆 is a trade-off
hyperparameter.
5.2.2 Cluster-based Pooling. Pooling the graph by clustering and merging nodes is the main
concept behind cluster-based pooling methods. Typically, they allocate nodes to a collection of
clusters by learning a cluster assignment matrix S ∈ R |𝑉 |×𝐾 , where 𝐾 is the number of the clusters.
After that, they merge the nodes within each cluster to generate a new node in the pooled graph.
The feature (embedding) matrix of the new nodes can be obtained by aggregating the features
(embeddings) of nodes within the clusters, according to the cluster assignment matrix,
Xpool = S𝑇 X. (42)
While the adjacency matrix of the pooled graph can be generated by calculating the connectivity
strength between each pair of clusters,
Besides, DiffPool leverages an auxiliary link prediction objective 𝐿LP = A, SS𝑇 𝐹 to encourage
the adjacent nodes to be in the same cluster and avoid fake local minima, where ∥ · ∥ 𝐹 is the
Í |𝑉 |
Frobenius norm. And it utilizes an entropy regularization term 𝐿E = |𝑉1 | 𝑖= 1 𝐻 (S𝑖 ) to impel clear
cluster assignments, where 𝐻 (·) represents the entropy.
Graph Pooling with Spectral Clustering (MinCutPool) [22]. MinCutPool takes advantage of the
properties of spectral clustering (SC) to provide a better inductive bias and avoid degenerate cluster
assignments. It learns to cluster like SC by optimizing the MinCut loss,
Tr S𝑇 AS
𝐿𝑐 = − . (45)
Tr S𝑇 DS
∥ S𝑇 S ∥ 𝐹 𝐾
𝐹
uniform cluster assignments, and prevent the bad minima of 𝐿𝑐 , where 𝐾 is the number of the
clusters. When performing a specific task, it can optimize the weighted sum of the unsupervised
loss 𝐿𝑢 = 𝐿𝑐 + 𝐿𝑜 and a task-specific loss to find the optimal balance between the theoretical prior
and the task objective.
Structural Entropy Guided Graph Pooling (SEP) [384]. In order to lessen the local structural harm
and suboptimal performance caused by separate pooling layers and predesigned pooling ratios,
SEP leverages the concept of structural entropy to generate the global and hierarchical cluster
assignments at once. Specifically, SEP treats the nodes of a given graph as the leaf nodes of a coding
tree and exploits the hierarchical layers of the coding tree to capture the hierarchical structure of
the graph. The optimal code tree 𝑇 can be obtained by minimizing the structural entropy [204],
∑︁ 𝑔(𝑃 𝑣 ) vol 𝑃 𝑣𝑖
𝑇
H (𝐺) = − 𝑖
log , (46)
𝑣 ∈𝑇
vol(𝑉 ) vol 𝑃 +
𝑖 𝑣𝑖
where represents the father node of node 𝑣𝑖 , 𝑃 𝑣𝑖 denotes the partition of leaf nodes which are
𝑣𝑖+
descendants of 𝑣𝑖 in the coding tree 𝑇 , 𝑔(𝑃 𝑣𝑖 ) denotes the number of edges that have a terminal in
the 𝑃 𝑣𝑖 , and vol(·) denotes the total degrees of leaf nodes in the given partition. Then, the cluster
assignment matrix for each pooling layer can be derived from the edges of each layer in the coding
tree. With the help of the one-step joint assignments generation based on structural entropy, it can
not only make the best use of the hierarchical relationships of pooling layers, but also reduce the
structural noise in the original graph.
5.2.3 Hybrid pooling. Hybrid pooling methods combine TopK-based pooling methods and cluster-
based pooling methods, to exert the advantages of the two methods and overcome their respective
limitations. Here, we review two representative hybrid pooling methods, Adaptive Structure Aware
Pooling [292] and Multi-channel Pooling [76].
Adaptive Structure Aware Pooling (ASAP) [292]. Considering that TopK-based pooling methods are
not good at capturing the connectivity of the coarsened graph, while cluster-based pooling methods
fail to be employed for large graphs because of the dense assignment matrix, ASAP organically
combines the two types of pooling methods to overcome the above limitations. Specifically, it
regards the ℎ-hop ego-network 𝑐ℎ (𝑣𝑖 ) of each node 𝑣𝑖 as a cluster. Such local clustering enables the
cluster assignment matrix to be sparse. Then, a new self-attention mechanism Master2Token is
used to learn the cluster assignment matrix S and the cluster representations,
|𝑐ℎ∑︁
(𝑣𝑖 ) |
m𝑖 = max X′𝑗 , S 𝑗,𝑖 = softmax w𝑇 𝜎 Wm𝑖 ∥X′𝑗 , X𝑐𝑖 = S 𝑗,𝑖 X 𝑗 , (47)
𝑣 𝑗 ∈𝑐ℎ (𝑣𝑖 )
𝑗=1
where X′ is the node embedding matrix after passing GCN, w and W denote the trainable vector
and matrix respectively, and X𝑐𝑖 denotes the representation of the cluster 𝑐ℎ (𝑣𝑖 ). Next, it utilizes
the graph convolution and TopK selection to choose the top 𝐾 clusters, whose centers are treated
as the nodes of the pooled graph. The adjacency matrix of the pooled graph can be calculated like
common cluster-based pooling methods (43), preserving the connectivity of the original graph well.
Multi-channel Pooling (MuchPool) [76]. The key idea of MuchPool is to capture both the local
and global structure of a given graph by combining the TopK-based pooling methods and the
cluster-based pooling methods. MuchPool has two pooling channels based on TopK selection to
yield two fine-grained pooled graphs, whose selection criteria are node degrees and projected
scores of node features respectively, so that both the local topology and the node features are
considered. Besides, it leverages a channel based on graph clustering to obtain a coarse-grained
pooled graph, which captures the global and hierarchical structure of the input graph. To better
integrate the information of different channels, a cross-channel convolution is proposed, which
fuses the node embeddings of the fine-grained pooled graph Xfine and the coarse-grained pooled
graph Xcoarse with the help of the cluster assignments S of the cluster-based pooling channel,
e fine = 𝜎 Xfine + SXcoarse · W , (48)
X
where W denotes the learnable weights. Finally, it merges the node embeddings and the adjacency
matrices of the two fine-grained pooled graphs to obtain the eventually pooled graph.
5.3 Summary
This section introduces graph pooling methods for graph-level representation learning. We provide
the summary as follows:
• Techniques. Graph pooling methods play a vital role in generating an entire graph represen-
tation by aggregating node embeddings. There are mainly two categories of graph pooling
methods: global pooling methods and hierarchical pooling methods. While global pooling
methods directly aggregate node embeddings in one step, hierarchical pooling methods
gradually coarsen a graph to capture hierarchical structure characteristics of the graph based
on TopK selection, clustering methods, or hybrid methods.
• Challenges and Limitations. Despite the great success of graph pooling methods for learn-
ing the whole graph representation, there remain several challenges and limitations unsolved:
1) For hierarchical pooling, most cluster-based methods involve the dense assignment matrix,
which limits their application to large graphs, while TopK-based methods are not good at
capturing structure information of the graph and may lose information due to node dropping.
2) Most graph pooling methods are designed for simple attributed graphs, while pooling
algorithms tailored to other types of graphs, like dynamic graphs and heterogeneous graphs,
are largely under-explored.
• Future Works. In the future, we expect that more hybrid or other pooling methods can
be studied to capture the graph structure information sufficiently as well as be efficient for
large graphs. In realistic scenarios, there are various types of graphs involving dynamic,
heterogeneous, or spatial-temporal information. It is promising to design graph pooling
methods specifically for these graphs, which can be beneficial to more real-world applications,
such as traffic analysis and recommendation systems.
iterative neighbor-aggregation operation. Many previous works have demonstrated the two major
defects of message-passing GNNs, which are known as the over-smoothing and long-distance
modeling problem. And there are lots of explanatory works trying to mine insights from these
two issues. The over-smoothing problem can be explained in terms of various GNNs focusing
only on low-frequency information [23], mixing information between different kinds of nodes
destroying model performance [45], GCN is equivalent to Laplacian smoothing [213], isotropic
aggregation among neighbors leading to the same influence distribution as random walk [404], etc.
The inability to model long-distance dependencies of GNNs is partially due to the over-smoothing
problem, because in the context of conventional neighbor-aggregation GNNs, node information
can be passed over long distances only through multiple GNN layers. Recently, Alon et al. [6] find
that this problem may also be caused by over-squashing, which means the exponential growth of
computation paths with increasing distance. Though the two basic performance bottlenecks can
be tackled with elaborate message passing and aggregation strategies, representational power of
GNNs is inherently bounded by the Weisfeiler-Lehman isomorphism hierarchy [264]. Worse still,
most GNNs [115, 182, 351] are bounded by the simplest first-order Weisfeiler-Lehman test (1-WL).
Some efforts have been dedicated to break this limitation, such as hypergraph-based [93, 149],
path-based [31, 415], and k-WL-based [15, 264] approaches.
Among many attempts to solve these fundamental problems, an essential one is the adaptation
of Transformer [350] for graph representation learning. Transformers, both the vanilla version and
several variants, have been adopted with impressive results in various deep learning fields including
NLP [70, 350], CV [38, 469], etc. Recently, Transformer also shows powerful graph modeling abilities
in many researches [46, 81, 193, 386, 415]. And extensive empirical results show that some chronic
shortcomings in conventional GNNs can be easily overcome in Transformer-based approaches.
This section gives an overview of the current progress on this kind of methods.
6.1 Transformer
Transformer [350] was firstly applied to model machine translation, but two of the key mechanisms
adopted in this work, attention operation and positional encoding, are highly compatible with the
graph modeling problem.
To be specific, we denote the input of attention layer in Transformer as X = [x0, x1, . . . , x𝑛−1 ],
x𝑖 ∈ R𝑑 , where 𝑛 is the length of input sequence and 𝑑 is the dimension of each input embedding
x𝑖 . Then the core operation of calculating new embedding x ^𝑖 for each x𝑖 in attention layer can be
streamlined as:
sℎ (x𝑖 , x 𝑗 ) = NORM 𝑗 ( ∥ Qℎ (x𝑖 ) T K ℎ (x𝑘 )),
x𝑘 ∈X
∑︁
(49)
ℎ
x𝑖 = s (x𝑖 , x 𝑗 )V ℎ (x 𝑗 ),
ℎ
x 𝑗 ∈X
Technique Capacity
Method
Attention Modification Encoding Enhancement Heterogeneous Long Distance > 1-WL
GGT [81] ✓ ✓ structure only ✓
GTSA [193] ✓ ✓ ✓ ✓
HGT [147] ✓ ✓
G2SHGT [413] ✓ ✓ ✓
GRUGT [31] ✓ ✓ ✓
Graphormer [415] ✓ ✓ ✓ ✓
GSGT [153] ✓ ✓ ✓
Graph-BERT [436] ✓ ✓ ✓
LRGT [386] ✓ ✓
SAT [46] ✓ ✓ ✓
where 𝑖 is the position and 𝑗 is the dimension. The positional encoding is added to the input before
it is fed to the Transformer.
6.2 Overview
From the simplified process shown in Equation. 49, we can see that the core of the attention
operation is to accomplish information transfer based on the similarity between the source and the
target to be updated. It’s quite similar to the message-passing process on a fully-connected graph.
However, direct application of this architecture to arbitrary graphs does not make use of structural
information, so it may lead to poor performance when graph topology is important. On the other
hand, the definition of positional encoding in graphs is not a trivial problem because the order or
coordinates of graph nodes are underdefined.
According to these two challenges, Transformer-based methods for graph representation learning
can be classified into two major categories, one considering graph structure during attention process,
and the other encoding the topological information of the graph into initial node features. We
name the first one as Attention Modification and the second one as Encoding Enhancement. A
summarization is provided in Table 4. In the following discussion, if both methods are used in
one paper, we will list them in different subsections, and we will ignore the multi-head trick in
attention operation.
where ⊙ is Hadamard product and represents trainable parameter matrix. This approach
W𝑄,𝐾,𝐸
is not efficient yet to model long-distance dependencies since only 1st-neighbors are considered.
Though it adopts Laplacian eigenvectors to gather global information (cf. Section 6.4), but only
long-distance structure information is remedied while the node and edge features are not. GTSA
[193] improves this approach by combining the original graph and the full graph. Specifically, it
extends s1 (·, ·) to:
( 𝑄 T 𝐾
(W1 x𝑖 ) (W1 x 𝑗 ⊙ W1𝐸 e 𝑗𝑖 ), ⟨𝑗, 𝑖⟩ ∈ 𝐸
s̃2 (x𝑖 , x 𝑗 ) = ,
(W0 x𝑖 ) T (W𝐾0 x 𝑗 ⊙ W0𝐸 e 𝑗𝑖 ), otherwise
𝑄
where 𝑏 𝑁 is a trainable scalar indexed by 𝑁 , the length of SP𝑖 𝑗 . e𝑘 is the embedding of the the
edge 𝑒𝑘 , and w𝑘𝐸 ∈ R𝑑 is the 𝑘-th edge parameter. If SP𝑖 𝑗 does not exist, then 𝑏 𝑁 and 𝑐𝑖 𝑗 are set to
be special value.
are intrinsically limited by 1-WL test. SAT [46] improves this deficiency by using subgraph-GNN
NGNN [438] for initialization, and achieves outstanding performance.
6.5 Summary
This section introduces Transformer-based approaches for graph representation learning and we
provide the summary as follows:
• Techniques. Graph Transformer methods modify two fundamental techniques in Trans-
former, attention operation and positional encoding, to enhance its ability to encode graph
data. Typically, they introduce fully-connected attention to model long-distance relationship,
utilize shortest path and Laplacian eigenvectors to break 1-WL bottleneck, and separate
points and edges belonging to different classes to avoid over-mixing problem.
• Challenges and Limitations. Though Graph Transformers achieve encouraging perfor-
mance, they still face two major challenges. The first challenge is the computational cost of
the quadratic attention mechanism and shortest path calculation. These operations require
significant computing resources and can be a bottleneck, particularly for large graphs. The
second is the reliance of Transformer-based models on large amounts of data for stable perfor-
mance. It poses a challenge when dealing with problems that lack sufficient data, especially
for few-shot and zero-shot settings.
• Future Works. We expect efficiency improvement for Graph Transformer should be further
explored. Additionally, there are some works using pre-training and fine-tuning framework
to balance performance and complexity in downstream tasks [415], this may be a promising
solution to address the aforementioned two challenges.
where Y 𝐿 denotes the set of labeled nodes. Additionally, there are a variety of unlabeled nodes that
can be used to offer semantic information. To fully utilize these nodes, a range of methods attempt
to combine semi-supervised approaches with graph neural networks. Pseudo-labeling [200] is a
fundamental semi-supervised technique that uses the classifier to produce the label distribution of
unlabeled examples and then adds appropriately labeled examples to the training set [212, 465].
Another line of semi-supervised learning is consistency regularization [197] that requires two
Table 5. A Summary of Methods for Semi-supervised Learning on Graphs. Contrastive learning can be
considered as a specific kind of consistency learning.
examples to have identical predictions under perturbation. This regularization is based on the
assumption that each instance has a distinct label that is resistant to random perturbations [92, 271].
Then, we show several representative works in detail.
Cooperative Graph Neural Networks [212] (CoGNet). CoGNet is a representative pseudo-label-
based GNN approach for semi-supervised node classification. It employs two GNN classifiers to
jointly annotate unlabeled nodes. In particular, it calculates the confidence of each node as follows:
𝐶𝑉 (p𝑖 ) = p𝑇𝑖 log p𝑖 , (61)
where p𝑖 denotes the output label distribution. Then it selects the pseudo-labels with high confidence
generated from one model to supervise the optimization of the other model. In particular, the
objective for unlabeled nodes is written as follows:
∑︁
L𝐶𝑜𝐺𝑁 𝑒𝑡 = ^𝑇𝑖 𝑙𝑜𝑔q𝑖 ,
1𝐶𝑉 ( p𝑖 ) >𝜏 y (62)
𝑖 ∈V𝑈
where y^𝑖 denotes the one-hot formulation of the pseudo-label 𝑦^𝑖 = 𝑎𝑟𝑔𝑚𝑎𝑥p𝑖 and q𝑖 denotes the
label distribution predicted by the other classifier. 𝜏 is a pre-defined temperature coefficient. This
cross supervision has been demonstrated effective in [51, 244] to prevent the provision of biased
pseudo-labels. Moreover, it employs GNNExplainer [416] to provide additional information from a
dual perspective. Here it measures the minimal subgraphs where GNN classifiers can still generate
the same prediction. In this way, CoGNet can illustrate the entire optimization process to enhance
our understanding.
Dynamic Self-training Graph Neural Network [465] (DSGCN). DSGCN develops an adaptive
manner to utilize reliable pseudo-labels for unlabeled nodes. In particular, it allocates smaller
weights to samples with lower confidence with the additional consideration of class balance. The
weight is formulated as:
1
𝜔𝑖 = max (RELU (p𝑖 − 𝛽 · 1)) , (63)
𝑛𝑐 𝑖
where 𝑛𝑐 𝑖 denotes the number of unlabeled samples assigned to the class 𝑐 𝑖 . This technique will
decrease the impact of wrong pseudo-labels during iterative training.
Graph Random Neural Networks [92] (GRAND). GRAND is a representative consistency learning-
based method. It first adds a variety of perturbations to the input graph to generate a list of
graph views. Each graph view 𝐺 𝑟 is sent to a GNN classifier to produce a prediction matrix
P𝑟 = [p𝑟1, · · · , p𝑟𝑁 ]. Then it summarizes these matrices as follows:
1 𝑟
P . P= (64)
𝑅
To provide more discriminative information and ensure that the matrix is row-normalized,
GRAND sharpens the summarized label matrix into P𝑆𝐴 as:
P𝑖1𝑗/𝑇
P𝑖𝑆𝐴
𝑗 = Í 1/𝑇
, (65)
𝑗 ′ =0 P𝑖 𝑗 ′
G𝑢 = {𝐺 𝑁 +1, · · · , 𝐺 𝑁 +𝑁 }, the graph-level supervised loss for labeled data can be expressed as
𝑙 𝑙 𝑢
follows:
1 ∑︁ 𝑗 𝑇
L𝐺𝑆𝐿 = − 𝑢 y 𝑙𝑜𝑔p 𝑗 , (67)
|G | 𝐿 𝐺𝑗 ∈G
where y𝑗denotes the one-hot label vector for the 𝑗-th sample while p 𝑗 denotes the predicted
distribution of 𝐺 𝑗 . When 𝑁 𝑢 = 0, this objective can be utilized to optimize supervised methods.
However, due to the shortage of labels in graph data, supervised methods cannot reach exceptional
performance in real-world applications [131]. To tackle this, semi-supervised graph classification
has been developed extensively. These approaches can be categorized into pseudo-labeling-based
methods, knowledge distillation-based methods and contrastive learning-based methods. Pseudo-
labeling methods annotate graph instances and utilize well-classified graph examples to update
the training set [209, 211]. Knowledge distillation-based methods usually utilize a teacher-student
architecture, where the teacher model conducts graph representation learning without label infor-
mation to extract generalized knowledge while the student model focuses on the downstream task.
Due to the restricted number of labeled instances, the student model transfers knowledge from
the teacher model to prevent overfitting [131, 335]. Another line of this topic is to utilize graph
contrastive learning, which is frequently used in unsupervised learning. Typically, these methods
extract topological information from two perspectives (i.e., different perturbation strategies and
graph encoders), and maximize the similarity of their representations compared with those from
other examples [171, 172, 242]. Active learning, as a prevalent technique to improve the efficiency
of data annotation, has also been utilized for semi-supervised methods [131, 395]. Then, we review
these methods in detail.
SEmi-supervised grAph cLassification [211] (SEAL). SEAL treats each graph example as a node
in a hierarchical graph. It builds two graph classifiers which generate graph representations and
conduct semi-supervised graph classification respectively. SEAL employs a self-attention module
to encode each graph into a graph-level representation, and then conducts message passing from a
graph level for final classification. SEAL can also be combined with cautious iteration and active
iteration. The former merely utilizes partial graph samples to optimize the parameters in the first
classifier due to the potential erroneous pseudo-labels. The second combines active learning with
the model, which increases the annotation efficiency in semi-supervised scenarios.
InfoGraph [335]. Infograph is the first contrastive learning-based method. It maximizes the
similarity between summarized graph representations and their node representations. In particular,
it generates node representations using the message passing mechanism and summarizes these
node representations into a graph representation. Let Φ(·, ·) denote a discriminator to distinguish
whether a node belongs to the graph, and we have:
| G𝑙∑︁
|+ | G𝑢 | ∑︁ h i 1 ∑︁ h 𝑗 ′ 𝑗 i
L𝐼𝑛𝑓 𝑜𝐺𝑟𝑎𝑝ℎ = − sp −Φ h𝑖𝑗 , z 𝑗 − 𝑗 sp Φ h𝑖 ′ , z , (68)
𝑗=1 𝑖 ∈ G𝑗 |𝑁𝑖 | 𝑖 ′𝑗 ′ ∈𝑁 𝑗
𝑖
where sp(·) denotes the softplus function. denotes the negative node set where nodes are not
𝑁𝑖𝑗
in 𝐺 𝑗 . This mutual information maximization formulation is originally developed for unsupervised
learning and it can be simply extended for semi-supervised graph classification. In particular,
InfoGraph utilizes a teacher-student architecture that compares the representation across the
teacher and student networks. The contrastive learning objective serves as a regularization by
combining with supervised loss.
Dual Space Graph Contrastive Learning [408] (DSGC). DSGC is a representative contrastive
learning-based method. It utilizes two graph encoders. The first is a standard GNN encoder in the
Euclidean space and the second is the hyperbolic GNN encoder. The hyperbolic GNN encoder first
converts graph embeddings into hyperbolic space and then measures the distance based on the
length of geodesics. DSGC compares graph embeddings in the Euclidean space and hyperbolic
space. Assuming the two GNNs are named as 𝑓1 (·) and 𝑓2 (·), the positive pair is denoted as:
𝑗
z𝐸→𝐻 = exp𝑐o (𝑓1 (𝐺 𝑗 )),
𝑗 (69)
z𝐻 = exp𝑐o 𝑓2 (𝐺 𝑗 ) .
Then it selects one labeled sample and 𝑁𝐵 unlabeled sample 𝐺 𝑗 for graph contrastive learning in
the hyperbolic space. In formulation,
e𝑑 ( h𝐻 ,z𝐸→𝐻 ) /𝜏
𝐻 𝑖 𝑖
L𝐷𝑆𝐺𝐶 = − log
Í𝑁 𝑑D z𝑖𝐸→𝐻 ,z𝐻𝑗 /𝜏
e𝑑 ( z𝐻 ,z𝐸→𝐻 ) /𝜏 + 𝑖=
𝐻 𝑖 𝑖
1 e
𝑢 z 𝑗 ,z 𝑗
(70)
𝑁 𝑑D 𝐻 𝐸→𝐻 /𝜏
𝜆𝑢 ∑︁ e
− log ,
𝑁 𝑖=1 𝑢 z 𝑗 ,z 𝑗 𝑗
𝑑D 𝐻 𝐸→𝐻 /𝜏 𝑑 D z𝑖𝐻 ,z𝐸→𝐻 /𝜏
e +e
where z𝑖𝐸→𝐻 and z𝑖𝐻 denote the embeddings for labeled graph sample 𝐺 𝑖 and 𝑑 𝐻 (·) denotes a distance
metric in the hyperbolic space. This contrastive learning objective maximizes the similarity between
embeddings learned from two encoders compared with other samples. Finally, the contrastive
learning objective can be combined with the supervised loss to achieve effective semi-supervised
contrastive learning.
Active Semi-supervised Graph Neural Network [131] (ASGN). ASGN utilizes a teacher-student
architecture with the teacher model focusing on representation learning and the student model
targeting at molecular property prediction. In the teacher model, ASGN first employs a message
passing neural network to learn node representations under the reconstruction task and then
borrows the idea of balanced clustering to learn graph-level representations in a self-supervised
fashion. In the student model, ASGN utilizes label information to monitor the model training based
on the weights of the teacher model. In addition, active learning is also used to minimize the
annotation cost while maintaining sufficient performance. Typically, the teacher model seeks to
provide discriminative graph-level representations without labels, which transfer knowledge to the
student model to overcome the potential overfitting in the presence of label scarcity.
Twin Graph Neural Networks [172] (TGNN). TGNN also uses two graph neural networks to
give different perspectives to learn graph representations. Differently, it adopts a graph kernel
neural network to learn graph-level representations in virtue of random walk kernels. Rather than
directly enforcing representation from two modules to be similar, TGNN exchanges information
by contrasting the similarity structure of the two modules. In particular, it constructs a list of
anchor graphs, 𝐺 𝑎1 , 𝐺 𝑎2 , · · · , 𝐺 𝑎𝑀 , and utilizes two graph encoders to produce their embeddings,
i.e., {𝑧𝑎𝑚 }𝑚=
𝑀
1 , {𝑤
𝑎𝑚 }𝑀 . Then it calculates the similarity distribution between each unlabeled
𝑚=1
graph and anchor graphs for two modules. Formally,
exp cos 𝑧 𝑗 , 𝑧𝑎𝑚 /𝜏
𝑗
𝑝𝑚 = Í𝑀 , (71)
𝑚′ =1 exp (cos (𝑧 , 𝑧
𝑗 𝑎𝑚′ ) /𝜏)
exp cos w 𝑗 , w𝑎𝑚 /𝜏
𝑗
𝑞𝑚 = Í𝑀 . (72)
𝑚′ =1 exp (cos (w , w
𝑗 𝑎𝑚′ ) /𝜏)
Then, TGNN minimizes the distance between distributions from different modules as follows:
1 ∑︁ 1
(73)
L𝑇𝐺𝑁 𝑁 = 𝐷 KL p 𝑗 ∥q 𝑗 + 𝐷 KL q 𝑗 ∥p 𝑗 ,
G𝑈 𝑗 𝑢 2 𝐺 ∈G
7.3 Summary
This section introduces semi-supervised learning for graph representation learning and we provide
the summary as follows:
• Techniques. Classic node classification aims to conduct transductive learning on graphs
with access to unlabeled data, which is a natural semi-supervised problem. Semi-supervised
graph classification aims to relieve the requirement of abundant labeled graphs. Here, a
variety of semi-supervised methods have been put forward to achieve better performance
under the label scarcity. Typically, they try to integrate semi-supervised techniques such as
active learning, pseudo-labeling, consistency learning, and consistency learning with graph
representation learning.
• Challenges and Limitations. Despite their great success, the performance of these methods
is still unsatisfactory, especially in graph-level representation learning. For example, DSGC
can only achieve an accuracy of 57% in a binary classification dataset REDDIT-BINARY. Even
worse, label scarcity often accompanies by unbalanced datasets and potential domain shifts,
which provides more challenges from real-world applications.
• Future Works. In the future, we expect that these methods can be applied to different
problems such as molecular property predictions. There are also works to extend graph
Table 6. A Summary of Methods for Self-supervised Learning on Graphs. "PT", "CT" and "UFE" mean "Pre-
training", "Collaborative Train" and "Unsupervised Feature Extracting" respectively.
representation learning in more realistic scenarios like few-shot learning [41, 249]. A higher
accuracy is always anticipated for more advanced and effective semi-supervised techniques.
where D denotes the distribution of featured graph G. By minimizing L𝑜𝑣𝑒𝑟𝑎𝑙𝑙 , we can learn encoder
𝑓 with capacity to produce high-quality embedding. As for downstream tasks, we denote a graph
decoder 𝑑 which transforms the output of graph encoder 𝑓 into model prediction. The loss of
downstream tasks can be formulated as:
Pre-training. This strategy has two steps. In pre-training step, the L𝑠𝑠𝑙 is minimized to get 𝑔∗
and 𝑓 ∗ :
𝑔∗, 𝑓 ∗ = arg minL𝑠𝑠𝑙 (𝑔, 𝑓 , D). (76)
𝑔,𝑓
Then the parameter of is kept to continue training in pretext supervised learning progress.
𝑓∗
The supervised loss is minimized to get final parameters of 𝑓 and 𝑑.
minL𝑠𝑠𝑙 (𝑑, 𝑓 | 𝑓0 =𝑓 ∗ , G; 𝑦). (77)
𝑑,𝑓
Collaborative Train. In this strategy, L𝑠𝑠𝑙 and L𝑠𝑢𝑝 are optimized simultaneously. A hyperpa-
rameter 𝛼 is used to balance the contribution of pretext task loss and downstream task loss. The
overall minimization strategy is like traditional supervised strategy with a pretext task regulariza-
tion:
where 𝑓 (·) and 𝑔(·) stand for the representation encoder and rebuilding decoder. However, for graph
dataset, feature information and structure information are both important composition suitable to
be rebuilt. So generation-based pretext can be divided into two categories: feature rebuilding and
structure rebuilding. We introduce several outstanding models in followed part.
Graph Completion [419] is one of a representative method about feature rebuilding. They mask
some node features to generate an incomplete graph. Then the pretext task is set as predicting the
removed node features. As shown in Equation 82, this method can be formulated as a special case
of Equation 82, letting G^ = (𝐴, 𝑋^ ) and replacing G →
− 𝑋 . The loss function is often Mean Squared
Error or Cross Entropy, depending on the feature is continuous or binary.
^ X).
min MSE(𝑔(𝑓 ( G)), (82)
𝑔,𝑓
Other works make some changes about feature setting. For example, AttrMasking [145] aims
to rebuild both node representation and edge representation, AttributeMask [166] preprocess 𝑋
firstly by PCA to reduce the complexity of rebuilding features.
In the other hand, MGAE [358] modify the original graph by adding noise in node representation,
motivated by denoising autoencoder [353]. As shown in Equation 82, we can also consider MGAE
^ A).
min BCE(𝑔(𝑓 ( G)), (83)
𝑔,𝑓
As for structure rebuilding methods, GAE [183] is the simplest instance, which can be regard
as an implement of Equation 76 where G^ = G and G → − 𝐴. A is the adjacency matrix of graph.
Similar with feature rebuilding method, GAE compresses raw node representation vectors into
low-dimensional embedding with its encoder. Then the adjacency matrix is rebuilt by computing
node embedding similarity. Loss function is set to error between ground-truth adjacency matrix
and the recovered one, to help model rebuild correct graph structure.
8.3.1 View generation. Traditional pipeline of contrastive learning-based models is first augmenting
the graph by well-crafted empirical methods, and then maximizing the consistency between different
augmentations. Following methods in computer vision domain and considering non-Euclidean
structure of graph data, typical graph augmentation methods aim to modify graph topologically or
representatively.
Given graph G = (𝐴, 𝑋 ), the topologically augmentation methods usually modify the adjacency
matrix 𝐴, which can be formulated as:
𝐴^ = 𝒯𝐴 (𝐴), (85)
where 𝒯𝐴 (·) is the transform function of adjacency matrix. Topology augmentation methods has
many variants, in which the most popular one is edge modification, given by 𝒯𝐴 (𝐴) = 𝑃 ◦ 𝐴 +
𝑄 ◦ (1 − 𝐴). 𝑃 and 𝑄 are two matrix representing edge dropping and adding respectively. Another
method, graph diffusion, connect nodes with their k-hop neighhors with specific weight, defined as:
0 𝛼𝑘 𝑇 . where 𝛼 and 𝑇 are coefficient and transition matrix. Graph diffusion method
Í∞
𝒯𝐴 (𝐴) = 𝑘= 𝑘
As introduced before, the paradigm of augmentation has been prove to be effective in contrastive
learning view generation. However, given the variety of graph data, it is challenging to main-
tain semantics properly during augmentations. In order to preserve valuable nature in specific
graph dataset, There are currently three mainly-used methods: picking by trial-and-errors, trying
laborious search or seeking domain-specific information as guidance [169, 243]. It is clear that
such complicated augmentation methods constrain the effectiveness and widespread application of
graph contrastive learning. So many newest works question the necessity of augmentation and
seek other contrastive views generation methods.
SimGCL [422] is one of outstanding works challenging the effectiveness of graph augmentation.
The author find that noise can be a substitution to augmentation to produce graph views in specific
task such as recommendation. After doing ablation study about augmentation and InfoNCE [397],
they find that the InfoNCE loss, not the augmentation of the graph, is what makes the difference. It
can be further explained by the importance of distribution uniformity. The contrastive learning
enhance model representation ability by intensifying two characteristics: The alignment of fea-
tures from positive samples and the uniformity of the normalized feature distribution. SimGCL
directly add random noises to node embeddings as augmentation, to control the uniformity of the
representation distribution in a more effective way:
where e𝑖 is a node representation in embedding space, 𝜏𝑖( 1) and 𝜏𝑖( 2) are two random sampled unit
vector. The experiment results indicate that SimGCL performs better than its graph augmentation-
based competitors and means, while training time is significantly decreased.
SimGRACE [391] is another graph contrastive learning framework without data augmentation.
Motivated by the observation that despite encoder disrupted, graph data can effectively maintain
their semantics, SimGRACE take GNN with its modified version as encoder to produce two con-
trastive embedding views by the same graph input. For GNN encoder 𝑓 (·; 𝜃 ), the two contrastive
embedding views e, e′ can be computed by:
e ( 1) = 𝑓 (G; 𝜃 ), e ( 2) = 𝑓 (G; 𝜃 + 𝜖 · Δ𝜃 ),
(88)
Δ𝜃𝑙 ∼ N (0, 𝜎𝑙2 ),
where Δ𝜃𝑙 represents GNN parameter perturbation Δ𝜃 in the 𝑙th layer. SimGRACE can improve
alignment and uniformity simutanously, proving its capacity of producing high-quality embedding.
8.3.2 MI estimation method. The mutual information 𝐼 (𝑥, 𝑦) measures the information that x
and y share, given a pair of random variables (𝑥, 𝑦). As discussed before, mutual information is a
significant component of contrast-based method by formulating the loss function. Mathematically
rigorous MI is defined on the probability space, we can formulate mutual information between a
pair of instances (𝑥𝑖 , 𝑥 𝑗 ) as:
𝐼 (𝑥, 𝑦) = 𝐷 𝐾𝐿 (𝑝 (𝑥, 𝑦)||𝑝 (𝑥)𝑝 (𝑦))
𝑝 (𝑥, 𝑦) (89)
= 𝐸𝑝 (𝑥,𝑦) [log ].
𝑝 (𝑥)𝑝 (𝑦)
However, directly compute Equation 89 is quiet difficult, so we introduce several different types
of estimation for MI:
InfoNCE. Noise-contrastive estimator is a widely used lower bound MI estimatior. Given a
positive sample 𝑦 and several negative sample 𝑦𝑖′, a noise-contrastive estimator can be fomulated
as [470][286]:
𝑒 𝑔 (𝑥,𝑦)
L = −𝐼 (𝑥, 𝑦) = −𝐸𝑝 (𝑥,𝑦) [log ′ ], (90)
𝑒 𝑔 (𝑥,𝑦) + 𝑖 𝑒 𝑔 (𝑥,𝑦𝑖 )
Í
usually the kernal function 𝑔(·) can be cosine similarity or dot product.
Triplet Loss. Intuitively, we can push the similarity between positive samples and negative
samples differ by a certain distance. So we can define the loss function in the following manner [162]:
8.4 Summary
This section introduces graph self-supervised learning and we provide the summary as follows:
• Techniques. Differ from classic supervised learning and semi-supervised learning, self-
supervised learning increase model generalization ability and robustness, whilst decrease
label reliance. Graph SSL utilize pretext tasks to extract inherent information in representation
distribution. Typical Graph SSL methods can be devide into generation-based and contrast-
based. Generation-based methods learns an encoder with ability to reconstruct graph as
precise as possible, motivated by Autoencoder. Contrast-based methods attract significant
interests recently, they learns an encoder to minimizing mutual information between relevant
instance and maximizing mutaul information between unrelated instance.
• Challenges and Limitations. Although graph SSL has achieved superior performance in
many tasks, its theoretical basis is not so solid. Many well-known methods are just validated
through experiments without explaining theoretical or coming up with mathematical proof.
It is imperative to establish a strong theoretical foundation for graph SSL.
• Future Works. In the future we expect more graph ssl methods designed essentially by
theoretical proof, without dedicate designed augment process or pretext tasks by intuition.
This will bring us more definite mathematical properties and less ambiguous empirical sense.
Also, graphs are a prevalent form of data representation across diverse domains, yet obtaining
manual labels can be prohibitively expensive. Expanding the applications of graph SSL to
broader fields is a promising avenue for future research.
Regularization
Method Structure Learning
Sparsity Low-rank Smoothness
AGCN [214] Mahalanobis distance
GRCN [421] Inner product ✓
Metric-based
regularization methods for GSL in Sec. 9.1 and Sec. 9.2, respectively, and then introduce different
categories of GSL in Sec. 9.3, 9.4 and 9.5. We summarize GSL approaches in Table 7.
where L𝑡 is the task-specific objective, L𝑟 is the regularization term and 𝜆 is a hyperparameter for
the weight of regularization.
9.2 Regularization
The goal of regularization is to constrain the learned graph to satisfy some properties by adding
some penalties to the learned structure. The most common properties used in GSL are sparsity, low
lank, and smoothness.
9.2.1 Sparsity Noise or adversarial attacks will introduce redundant edges into graphs and degrade
the quality of graph representation. An effective technique to remove unnecessary edges is sparsity
regularization, i.e., adding a penalty on the number of nonzero entries of the adjacency matrix
(ℓ0 -norm):
L𝑠𝑝 = ∥A∥ 0, (94)
however, ℓ0 -norm is not differentiable so optimizing it is difficult, and in many cases ℓ1 -norm is used
instead as a convex relaxation. Other methods to impose sparsity include pruning and discretization.
These processes are also called postprocessing since they usually happen after the adjacency matrix
is learned. Pruning removes part of the edges according to some criteria. For example, edges with
weights lower than a threshold, or those not in the top-K edges of nodes or graph. Discretization
is applied to generate graph structure by sampling from some distribution. Compared to directly
learning edge weights, sampling enjoys the advantage to control the generated graph, but has
issues during optimizing since sampling itself is discrete and hard to optimize. Reparameterization
and Gumbel-softmax are two useful techniques to overcome such issue, and are widely adopted in
GSL.
9.2.2 Low Rank In real-world graphs, similar nodes are likely to group together and form commu-
nities, which should lead to a low-rank adjacency matrix. Recent work also finds that adversarial
attacks tend to increase the rank of the adjacency matrix quickly. Therefore, low rank regularization
is also a useful tool to make graph representation learning more robust:
L𝑙𝑟 = 𝑅𝑎𝑛𝑘 (A). (95)
It is hard to minimize matrix rank directly. A common technique is to optimize the nuclear norm,
which is a convex envelope of the matrix rank:
𝑁
∑︁
L𝑛𝑐 = ∥A∥ ∗ = 𝜎𝑖 , (96)
𝑖
where 𝜎𝑖 are singular values of A. Entezari et al. replaces the learned adjacency matrix with rank-r
approximation by singular value decomposition (SVD) to achieve robust graph learning against
adversarial attacks.
9.2.3 Smoothness A common assumption is that connected nodes share similar features, or in
other words, the graph is “smooth” as the difference between local neighbors is small. The following
metric is a natural way to measure graph smoothness:
𝑁
1 ∑︁
L𝑠𝑚 = 𝐴𝑖 𝑗 (𝑥𝑖 − 𝑥 𝑗 ) 2 = 𝑡𝑟 (X⊤ (D − A)X) = 𝑡𝑟 (X⊤ LX), (97)
2 𝑖,𝑗=1
where D is the degree matrix of A and L = D − A is called graph Laplacian. A variant is to use the
1 1
normalized graph Laplacian bL = D− 2 LD− 2 .
where 𝑀 = 𝑊𝑑𝑊𝑑⊤ and 𝑊𝑑 is the trainable weights to minimize task-specific objective. Then the
Gaussian kernel is used to obtain the adjacency matrix:
G𝑖 𝑗 = exp(−D(𝑥𝑖 , 𝑥 𝑗 )/(2𝜎 2 )), (99)
𝐴^ = 𝑛𝑜𝑟𝑚𝑎𝑙𝑖𝑧𝑒 (G). (100)
Graph-Revised Convolutional Network [421] (GRCN). GRCN uses a graph revision module to
predict missing edges and revise edge weights through joint optimization on downstream tasks. It
first learns the node embedding with GCN and then calculates pair-wise node similarity with the
dot product as the kernel function.
𝑍 = 𝐺𝐶𝑁𝑔 (𝐴, 𝑋 ), (101)
𝑆 𝑖 𝑗 = 𝑧𝑖 , 𝑧 𝑗 . (102)
The revised adjacency matrix is the residual summation of the original adjacency matrix 𝐴^ = 𝐴+𝑆.
GRCN also applies a sparsification technique on the similarity matrix 𝑆 to reduce computation cost:
𝑆𝑖 𝑗 , 𝑆𝑖 𝑗 ∈ 𝑡𝑜𝑝𝐾 (𝑆𝑖 )
𝑆𝑖(𝐾) = . (103)
𝑗 0, 𝑆𝑖 𝑗 ∉ 𝑡𝑜𝑝𝐾 (𝑆𝑖 )
Threshold pruning is also a common strategy for sparsification. For example, CAGCN [472] also
uses dot product to measure node similarity, and refines the graph structure by removing edges
between nodes whose similarity is less than a threshold 𝜏𝑟 and adding edges between nodes whose
similarity is greater than another threshold 𝜏𝑎 .
Defending Graph Neural Networks against Adversarial Attacks [440] (GNNGuard). GNNGuard
measures similarity between a node 𝑢 and its neighbor 𝑣 in the 𝑘-th layer by cosine similarity and
normalizes node similarity at the node level within the neighborhood as follows:
ℎ𝑢𝑘 ⊙ ℎ𝑘𝑣
𝑘
𝑠𝑢𝑣 = , (104)
∥ℎ𝑢𝑘 ∥ 2 ∥ℎ𝑘𝑣 ∥ 2
∑︁
𝑘
/ 𝑘
× 𝑁^𝑢𝑘 /( 𝑁^𝑢𝑘 + 1), 𝑖𝑓 𝑢 ≠ 𝑣
𝑠𝑢𝑣
𝑠𝑢𝑣
(105)
𝑘
𝑣 ∈N𝑢
𝛼𝑢𝑣 = ,
1/( 𝑁^𝑢𝑘 + 1),
𝑖𝑓 𝑢 = 𝑣
where N𝑢 denotes the neighborhood of node 𝑢 and 𝑁^𝑢𝑘 = 𝑣 ∈N𝑢 ∥𝑠𝑢𝑣 𝑘 ∥ . To stabilize GNN training,
Í
0
it also proposes a layer-wise graph memory by keeping part of the information from the previous
layer in the current layer. Similar to GNNGuard, IDGL [52] uses multi-head cosine similarity and
mask edges with node similarity smaller than a non-negative threshold, and HGSL [451] generalizes
this idea to heterogeneous graphs.
Graph Diffusion Convolution [108] (GDC). GDC replaces the original adjacency matrix with
generalized graph diffusion matrix S:
∞
∑︁
S= 𝜃 𝑘 T𝑘 , (106)
𝑘=0
where 𝜃 𝑘 is the weighting coefficient
Í∞and T is the generalized transition matrix. To ensure conver-
gence, GDC further requires that 𝑘= 0 𝜃 𝑘 = 1 and the eigenvalues of T lie in [0, 1]. The random
walk transition matrix T𝑟 𝑤 = AD and the symmetric transition matrix T𝑠𝑦𝑚 = D−1/2 AD−1/2 are
−1
two examples. This new graph structure allows graph convolution to aggregate information from a
larger neighborhood. The graph diffusion acts as a smoothing operator to filter out underlying noise.
However, in most cases graph diffusion will result in a dense adjacency matrix 𝑆, so sparsification
technology like top-k filtering and threshold filtering will be applied to graph diffusion. Following
GDC, there are some other graph diffusion proposed. For example, AdaCAD [225] proposes Class-
Attentive Diffusion, which further considers node features and aggregates nodes probably of the
same class among K-hop neighbors. Adaptive diffusion convolution [450] (ADC) learns the optimal
neighborhood size via optimizing a bi-level problem.
where 𝑊𝑖 (𝑙) and 𝑈 (𝑙) are the learnable weights. GLN then predicts the next adjacency matrix as
follows:
(𝑙) ⊤
𝐴 (𝑙+1) = 𝜎𝑙 (𝑀 (𝑙) 𝛼𝑙 (𝐻𝑙𝑜𝑐𝑎𝑙 )𝑀 (𝑙) ). (109)
Similarly, GLCN [159] models graph structure with a softmax layer over the inner product
between the difference of node features and a learnable vector. NeuralSparse [457] uses a multi-
layer neural network to generate a learnable distribution from which a sparse graph structure is
sampled. PTDNet [240] prunes graph edges with a multi-layer neural network and penalizes the
number of non-zero elements to encourage sparsity.
Graph Attention Networks [351] (GAT). Besides constructing a new graph to guide the massage
passing and aggregation process of GNNs, many recent researchers also leverage the attention
mechanism to adaptively model the relationship between nodes. GAT is the first work to introduce
the self-attention strategy into graph learning. In each attention layer, the attention weight between
two nodes is calculated as the Softmax output on the combination of linear and non-linear transform
of node features:
where N𝑖 denotes the neighborhood of node 𝑖,W is learnable linear transform and 𝑎 is pre-defined
attention function. In the original implementation of GAT, 𝑎 is a single-layer neural network with
LeakyReLU:
𝑎(Wℎ®𝑖 , Wℎ®𝑗 ) = LeakyReLU(®a⊤ [Wℎ®𝑖 ||Wℎ®𝑗 ]). (112)
The attention weights are then used to guide the message-passing phase of GNNs:
∑︁
ℎ®𝑖′ = 𝜎 ( 𝛼𝑖 𝑗 Wℎ®𝑗 ), (113)
𝑗 ∈N𝑖
where 𝑓 (𝑥 |𝐴)
^ measures the likelihood of observing 𝑥 given 𝐴,
^ and 𝑔(𝐴)^ is the prior distribution of
𝐴. GLNN uses sparsity and property constraint as prior, and define the likelihood function 𝑓 as:
^
^ = 𝑒𝑥𝑝 (−𝜆0𝑥 ⊤ 𝐿𝑥)
𝑓 (𝑥 |𝐴) (115)
⊤
^
= 𝑒𝑥𝑝 (−𝜆0𝑥 (𝐼 − 𝐴)𝑥), (116)
where 𝜆0 is a parameter. This likelihood imposed smoothness assumption on the learned graph
structure. Some other works also model the adjacency matrix in a probabilistic manner. Bayesian
GCNN [442] adopts a Bayesian framework and treats the observed graph as a realization from a
family of random graphs. Then it estimates the posterior probablity of labels given the observed
graph adjacency matrix and features with Monte Carlo approximation. VGCN [83] follows a similar
formulation and estimates the graph posterior through stochastic variational inference.
Graph Sparsification via Meta-Learning [357] (GSML). GSML formulates GSL as a meta-leanring
problem and uses bi-level optimization to find the optimal graph structure. The goal is to find a
sparse graph structure which leads to high node classification accuracy at the same time given
labeled and unlabeled nodes. To achieves this, GSML makes the inner optimization as training on
the node classification task, and targets the outer optimization at the sparsity of the graph structure,
which formulates the following bi-level optimization problem:
𝐺^∗ = min 𝐿𝑠𝑝𝑠 (𝑓𝜃 ∗ (𝐺),
^ 𝑌𝑈 ), (117)
𝐺^ ∈Φ(𝐺)
In this bi-level optimization problem, 𝐺^ ∈ Φ(𝐺) are the meta-parameters and optimized directly
without parameterization. Similarly, LSD-GNN [97] also uses bi-level optimization. It models graph
structure with a probability distribution over graph and reformulates the bi-level program in terms
of the continuous distribution parameters.
9.6 Summary
This section and we provide the summary as follows:
• Techniques. GSL aims to learn an optimized graph structure for better graph representations.
It is also used for more robust graph representation against adversarial attacks. According
to the way of edge modeling, we categorize GSL into three groups: metric-based methods,
model-based methods, and direct methods. Regularization is also a commonly used principle
to make the learned graph structure satisfy specific properties including sparsity, low-rank
and smoothness.
• Challenges and Limitations. Since there is no way to access the groundtruth or optimal
graph structure as training data, the learning objective of GSL is either indirect (e.g., perfor-
mance on downstream tasks) or manually designed (e.g., sparsity and smoothness). Therefore,
the optimization of GSL is difficult and the performance is not satisfying. In addition, many
GSL methods are based on homophily assumption, i.e., similar nodes are more likely to
connect with each other. However, many other types of connection exist in the real-world
which impose great challenges for GSL.
• Future Works. In the future we expect more efficient and generalizable GSL methods to
be applied to large-scale and heterogeneous graphs. Most existing GSL methods focus on
pair-wise node similarities and thus struggle to scale to large graphs. Besides, they often
learn homogeneous graph structure, but in many scenarios graphs are heterogeneous.
multiple academic node and relation types. Many research institutes and academic search engines,
such as Aminer1 , DBLP2 , Microsoft Academic Graph (MAG)3 , have provided open academic social
network datasets for research purposes.
There are multiple applications of graph representation learning on the academic social net-
work. Roughly, they can be divided into three categories–academic entity classification/clustering,
academic relationship prediction, and academic resource recommendation.
• Academic entities usually belong to different classes of research areas. Research of academic
entity classification and clustering aims to categorize these entities, such as papers and
authors, into different classes [74, 283, 368, 432]. In literature, academic networks such as
Cora, CiteSeer, and Pubmed [312] have become the most widely used benchmark datasets for
examining the performance of graph representation learning models on paper classification.
Also, the author name disambiguation problem [42, 281, 444] is also essentially a node
clustering task on co-author networks and is usually solved by the graph representation
learning technique.
• Academic relationship prediction represents the link prediction task on various academic
relations. Typical applications are co-authorship prediction [56, 59] and citation relationship
prediction [161, 362, 425]. Existing methods learn representations of authors and papers and
use the similarity between two nodes to predict the link probability. Besides, some work [228,
456] studies the problem of advisor-advisee relationship prediction in the collaboration
network.
• Various academic recommendation systems have been introduced to retrieve academic
resources for users from large amounts of academic data in recent years. For example, collab-
orator recommendation [187, 188, 236] benefit researchers by finding suitable collaborators
under particular topics; paper recommendation [14, 334] help researchers find relevant pa-
pers on given topics; venue recommendation [257, 424] help researchers choose appropriate
venues when they submit papers.
The mainstream application research on social media networks via graph representation learning
techniques mainly includes anomaly detection, sentiment analysis, and influence analysis.
• Anomaly detection aims to find strange or unusual patterns in social networks, which
has a wide range of application scenarios, such as malicious attacks [234, 338], emergency
detection [21], and robot discovery [91] in social networks. Unsupervised anomaly detection
usually learns a reconstructed graph to detect those nodes with higher reconstructed error as
the anomaly nodes [3, 452]; Supervised methods model the problem as a binary classification
task on the learned graph representations [259, 458].
• Sentiment analysis, also named as opinion mining, is to mining the sentiment, opinions, and
attitudes, which can help enterprises understand customer feedback on products [298, 441]
and help the government analyze the public emotion and make rapid response to public
events [254, 349]. The graph representation learning model is usually combined with RNN-
based [48, 430] or Transformer-based [5, 342] text encoders to incorporate both the user
relationship and textual semantic information.
• Influence analysis usually aims to find several nodes in a social network to initially spread
information such as advertisements, so as to maximize the final spread of information [73, 295].
The core challenge is to model the information diffusion process in the social network. Deep
learning methods [178, 196, 269, 431] usually leverage graph neural networks to learn node
embeddings and diffusion probabilities between nodes.
places for users from a large of location points. Existing researches mainly integrate four
essential characteristics, including spatial influence, temporal influence [322, 376, 453], social
relationship [400], and textual information [402].
• Urban computing is defined as a process of analysis of the large-scale connected urban data
created from city activities of vehicles, human beings, and sensors [272, 273, 323]. Besides the
local-based social network, the urban data also includes physical sensors, city infrastructure,
traffic roads, and so on. Urban computing aims to improve the quality of public management
and life quality of people living in city environments. Typical applications including traffic
congestion prediction [160, 398], urban mobility analysis [35, 414], event detection [326, 423].
10.5 Summary
This section introduces social analysis by graph representation learning and we provide the
summary as follows:
• Techniques. Social networks, generated by human social activities, such as communication,
collaboration, and social interactions, typically involve massive and heterogeneous data, with
different types of attributes and properties that can change over time. Thus, social network
analysis is a field of study that explores the techniques to understand and analyze the complex
attributes, heterogeneous structures, and dynamic information of social networks. Social
network analysis typically learns low-dimensional graph representations that capture the
essential properties and patterns of the social network data, which can be used for various
downstream tasks, such as classification, clustering, link prediction, and recommendation.
• Chanllenges and Limitations. Despite the structural heterogeneity in social networks
(nodes and relations have different types), with the technological advances in social media,
the node attributes have become more heterogeneous now, containing text, video, and images.
Also, the large-scale problem is a pending issue in social network analysis. The data in
the social network has increased exponentially in past decades, containing a high density
of topological links and a large amount of node attribute information, which brings new
challenges to the efficiency and effectiveness of traditional network representation learning
on the social network. Lastly, social networks are often dynamic, which means the network
information usually changes over time, and this temporal information plays a significant
role in many downstream tasks, such as recommendations. This brings new challenges to
representation learning on social networks in incorporating temporal information.
• Future Works. Recently, multi-modal big pre-training models that can fuse information
from different modalities have gained increasing attention [282, 289]. These models can
obtain valuable information from a large amount of unlabeled data and transfer it to various
downstream analysis tasks. Moreover, Transformer-based models have demonstrated better
effectiveness than RNNs in capturing temporal information. In the future, there is potential
for introducing multi-modal big pre-training models in social network analysis. Also, it is
important to make the models more efficient for network information extraction and use
lightweight techniques like knowledge distillation to further enhance the applicability of the
models. These advancements can lead to more effective social network analysis and enable
the development of more sophisticated applications in various domains.
features of the molecules. To address this problem, graph representation learning has been widely
applied to molecular property prediction. A molecule can be represented as a graph where nodes
stand for atoms and edges stand for atom-bonds (ABs). Graph-level molecular representations are
learned via message passing mechanism to incorporate the topological information. The represen-
tations are then utilized for the molecular property prediction tasks.
Specifically, a molecule is denoted as a topological graph G = (V, E), where V = {𝑣𝑖 |𝑖 =
1, . . . , |G|} is the set of nodes representing atoms. A feature vector x𝑖 is associated with each
node 𝑣𝑖 indicating its type such as Carbon, Nitrogen. E = {𝑒𝑖 𝑗 |𝑖, 𝑗 = 1, . . . , |G|} is the set of edges
connecting two nodes (atoms) 𝑣𝑖 and 𝑣 𝑗 representing atom bonds. Graph representation learning
methods are used to obtain the molecular representation h G . Then downstream classification or
regression layers 𝑓 (·) are applied to predict the probability of target property of each molecule
𝑦 = 𝑓 (h G ).
In Section 11.1, we introduce 4 types of molecular properties graph representation learning can
be treated and their corresponding datasets. Section 11.2 reviews the graph representation learning
backbones applied to molecular property prediction. Strategies for training the molecular property
prediction methods are listed in Section 11.3.
• First, the chemical bonds are taken into consideration carefully. For example, Ma et al. [247]
use an additional edge GNN to model the chemical bonds separately. Specifically, given an
edge (𝑣, 𝑤), they formulate an Edge-based GNN as:
(𝑘) (𝑘−1) (𝑘−1) (𝑘) (𝑘−1) ( 0)
m𝑣𝑤 = AGGedge ({h𝑣𝑤 , h𝑢𝑣 , x𝑢 |𝑢 ∈ N𝑣 \ 𝑤 }), h𝑣𝑤 = MLPedge ({m𝑣𝑤 , h𝑣𝑤 }), (119)
( 0)
where h𝑣𝑤 = 𝜎 (Wein e𝑣𝑤 ) is the input state of the Edge-based GNN, Wein ∈ R𝑑hid ×𝑑𝑒 is the
input weight matrix. PotentialNet [90] further uses different message passing operations for
different edge types.
• Second, motifs in molecular graphs play an important role in molecular property prediction.
GSN [25] leverages substructure encoding to construct a topologically-aware message passing
method. Each node 𝑣 updates its state h𝑡𝑣 by combining its previous state with the aggregated
messages:
( z𝐺
𝑖 ,z𝑖 )
𝑇 exp (sim(z𝐺𝑖 , z𝑖 )/𝜏)
𝑇
ℓ𝑖 = − log Í𝑁 , (122)
𝑗=1 exp (sim(z𝑖 , z 𝑗 )/𝜏)
𝐺 𝑇
where z𝐺
𝑖 , z𝑖 are the representation of molecule and its corresponding literature.
𝑇
prediction and structural similarity prediction. MGSSL [447] provides a motif-based generative
pre-training framework making topology prediction and motif generation iteratively. Contrastive
methods learn graph representations by pulling views from the same graph close and pushing
views from different graphs apart. Different views of the same graph are constructed by graph
augmentation or leveraging the 1D SMILES and 3D structure. MolCLR [373] augments molecular
graphs by atom masking, bond deletion and subgraph removal and maximizes the agreement
between the original molecular graph and augmented graphs. Fang et al. [88] uses a chemical
knowledge graph to guide the graph augmentation. SMICLR [279] uses contrastive learning across
SMILES and 2D molecular graphs. GeomGCL [215] leverages graph contrastive learning to capture
the geometry of the molecule across 2D and 3D views. Self-supervised learning can also be combined
with few-shot learning to fully leverage the hierarchical information in the training set [170].
11.4 Summary
This section introduces graph representation learning in molecular property prediction and we
provide the summary as follows:
• Techniques. For molecular property prediction, a molecule is represented as a graph whose
nodes are atoms and edges are atom-bonds (ABs). GNNs such as GCN, GAT, and GraphSAGE
are adopted to learn the graph-level representation. The representations are then fed into a
classification or regression head for the molecular property prediction tasks. Many works
guide the model structure design with medical domain knowledge including chemical bond
features, motif features, different modalities of molecular representation, chemical knowledge
graph and literature. Due to the scarcity of available molecules with desired properties,
few-shot learning and contrastive learning are used to train molecular property prediction
model, so that the model can leverage the information in large unlabeled dataset and can be
adapted to new tasks with a few examples.
• Challenges and Limitations. Despite the great success of graph representation learning
in molecular property prediction, the methods still have limitations: 1) Few-shot molecular
property prediction are not fully explored. 2) Most methods depend on training with labeled
data, but neglect the chemical domain knowledge.
• Future Works. In the future, we expect that: 1) More few-shot learning and zero-shot
learning methods are studied for molecular property prediction to solve the data scarcity
problem. 2) Heterogeneous data can be fused for molecular property prediction. There are a
large amount of heterogeneous data about molecules such as knowledge graphs, molecule
descriptions and property descriptions. They can be considered to assist molecular prop-
erty prediction. 3) Chemical domain knowledge can be leveraged for the prediction model.
For example, when we perform affinity prediction, we can consider molecular dynamics
knowledge.
generation task. Molecular generation is intrinsically a de novo task, where molecular structures
are generated from scratch to navigate and sample from the vast chemical space. Therefore, this
chapter does not discuss tasks that restrict chemical structures a priori, such as docking [103, 332]
and conformation generation [318, 467].
On the other hand, binding-based methods generate drug-like molecules (aka. ligands) accord-
ing to the binding site (aka. binding pocket) of a protein receptor. Drawing inspirations from the
lock-and-key model for enzyme action [96], works like LiGAN [290] and DESERT [239] use 3D
density grids to fit the density surface between the ligand and the receptor, encoded by 3D-CNNs.
Meanwhile, a growing amount of literature has adopted G3D for representing ligand and receptor
molecules, because G3D more accurately depicts molecular structures and atomistic interactions
both within and between the ligand and the receptor. Representative works include 3D-SBDD [241],
GraphBP [230], Pocket2Mol [275], and DiffSBDD [309]. GraphBP shares a similar workflow with
G-SphereNet, except that the receptor atoms are also incorporated into G3D to depict the 3D
geometry at the binding pocket.
Atom-based vs. fragment-based. Molecules are inherently hierarchical structures. At the
atomistic level, molecules are represented by encoding atoms and bonds. At a coarser level,
molecules can also be represented as molecular fragments like functional groups or chemical
sub-structures. Both the composition and the geometry are fixed within a given fragment, e.g.,
the planar peptide-bond (–CO–NH–) structure. Fragment-based generation effectively reduces the
degree of freedom (DOF) of chemical structures, and injects well-established knowledge about
molecular patterns and reactivity. JT-VAE [165] decomposes 2D molecular graph G2D into a junction-
tree structure T , which is further encoded via tree message-passing. DeepScaffold [216] expands
the provided molecular scaffold into 3D molecules. L-Net [217] adopts a graph U-Net architecture
and devises a custom three-level node clustering scheme for pooling and unpooling operations in
molecular graphs. A number of works have also emerged lately for fragment-based generation in
the binding-based setting, including FLAG [448] and FragDiff [274]. FLAG uses a regression-based
approach to sequentially decide the type and torsion angle of the next fragment to be placed at the
binding site, and finally optimizes the molecule conformation via a pseudo-force field. FragDiff
also adopts a sequential generation process but uses a diffusion model to determine the type and
pose of each fragment in one go.
where 𝐺 (·) is the generator function and 𝐷 (·) is the discriminator function. For example, Mol-
GAN [65] encodes G2D with R-GCN, trains 𝐷 and 𝐺 with improved W-GAN [9], and uses rein-
forcement learning to generate attributed molecules, where the score function is assigned from
RDKit [198] and chemical validity.
Varaitional auto-encoder (VAE). In VAE [180], the decoder parameterizes the conditional
likelihood distribution 𝑝𝜃 (𝒙 |𝒛), and the encoder parameterizes an approximate posterior distribution
𝑞𝜙 (𝒛|𝒙) ≈ 𝑝𝜃 (𝒛|𝒙). The model is optimized by the evidence lower bound (ELBO), consisting of the
reconstruction loss term and the distance loss term:
𝑝𝜃 (𝒙, 𝒛)
(124)
max L𝜃,𝜙 (𝒙) := E𝒛∼𝑞𝜙 ( · |𝒙) ln = ln 𝑝𝜃 (𝒙) − 𝐷 KL 𝑞𝜙 (·|𝒙) ∥𝑝𝜃 (·|𝒙) .
𝜃,𝜙 𝑞𝜙 (𝒛|𝒙)
They then learn to reverse the diffusion process to construct desired data samples from the noise:
𝑇
Ö
𝑝𝜃 (𝒙 0:𝑇 ) = 𝑝 (𝒙𝑇 ) 𝑝𝜃 (𝒙 𝑡 −1 |𝒙 𝑡 ), (131)
𝑡 =1
𝑝𝜃 (𝒙 𝑡 −1 |𝒙 𝑡 ) = N (𝒙 𝑡 −1 ; 𝝁 𝜃 (𝒙 𝑡 , 𝑡), 𝚺𝜃 (𝒙 𝑡 , 𝑡)), (132)
while the models are trained using a variational lower bound. Diffusion models have been applied
to generate unbounded 3D molecules in EDM [142], and binding-specific ligands in DiffSBDD [309]
and DiffBP [226]. Diffusion can also be applied to generate molecular fragments in autoregressive
models, as is the case with FragDiff [274].
other hand, existing models have heavily relied on expert-crafted metrics to evaluate the quality of
the generated molecules, such as QED and Vina [82], rather than actual wet lab experiments.
Future Works. Besides the structural and geometric attributes described in this chapter, an
even more extensive array of data can be applied to aid molecular generation, including chemical
reactions and medical ontology. These data can be organized into a heterogeneous knowledge
graph to aid the extraction of high-quality molecular representations. Furthermore, high through-
put experimentation (HTE) should be adopted to realistically evaluate the synthesizablity and
druggability of the generated molecules in the wet lab. Concrete case studies, such as the design of
potential inhibitors to SARS-CoV-2 [218], would be even more encouraging, bringing new insights
into leveraging these molecular generative models to facilitate the design and fabrication of potent
and applicable drug molecules in the pharmaceutical industry.
where 𝑓 (·, ·) is the similarity function, e.g. inner product, consine similarity, multi-layer perceptrons
that takes the representation of 𝑢 and 𝑖 and calculate the pereference score 𝑥^𝑢,𝑖 .
When it comes to adapting graph representation learning in recommender systems, a key step is
to construct graph-structured data from the interaction set 𝑋 . Generally, a graph is represented
as G = {V, E} where V, E denotes the set of vertices and edges respectively. According the
construction of G, we can categorize the existing works as follows into three parts which are
introduced in following subsections. A summary is provided in Table 11.
and simplifying the message function. With the development of disentangled representation learn-
ing, there are works like DGCF [369] introduce disentangled graph representation learning to
represent users and items from multiple disentangled perspective
13.1.2 Graph Propagation Scheme A common practice is to follow the traditional message-passing
networks (MPNNs) and design the graph propagation method accordingly. GC-MC adopt vanilla
GCNs to encode the user-item bipartite graph. NGCF enhance GCNs by considering the affinity
between users and items. The message function of NGCF from node 𝑗 to 𝑖 is formulated as:
𝑚𝑖←𝑗 = √ 1
(
(𝑊1𝑒 𝑗 + 𝑊2 (𝑒𝑖 ⊙ 𝑒 𝑗 ))
| N𝑖 | | N𝑗 | , (134)
𝑚𝑖←𝑖 = 𝑊1𝑒𝑖
where 𝑊1,𝑊2 are trainable parameters, 𝑒𝑖 represents 𝑖’s representation from previous layer. The
matrix form can be further provided by:
When an interaction takes place, e.g. a user clickes a particular item, there could be multiple
intentions behind the observed interaction. Thus it is necessary to consider the various disentangled
intentions among users and items. DGCF proposes the cross-intent embedding propagation scheme
on graph, inspired by the dynamic routing algorithm of capsule netwroks [302]. To formulate, the
propagation process maintains a set of routing logits 𝑆˜𝑘 (𝑢, 𝑖) for each user 𝑢. The weighted sum
for 𝑡-th iteration, where L𝑘𝑡 (𝑢, 𝑖) denotes the Laplacian matrix of 𝑆𝑘𝑡 (𝑢, 𝑖), formulated as:
𝑆𝑘𝑡
L𝑘𝑡 (𝑢, 𝑖) = √︃ Í . (138)
[ 𝑖 ′ ∈N𝑢 𝑆𝑘𝑡 (𝑢, 𝑖 ′)] · [ 𝑢 ′ ∈N𝑖 𝑆𝑘𝑡 (𝑢 ′, 𝑖)]
Í
13.1.3 Node Representations After the graph propagation module outputs node-level representa-
tions, there are multiple methods to leverage node representations for recommendation tasks. A
plain solution is to apply a readout function on layer outputs like the concatenation operation used
by NGCF:
𝑒 ∗ = 𝐶𝑜𝑛𝑐𝑎𝑡 (𝑒 ( 0) , ..., 𝑒 (𝐿) ) = 𝑒 ( 0) ∥...∥𝑒 (𝐿) . (139)
However, the readout function among layers would neglects the relationship between target
item and current user. A general solution is to use attention mechanism [350] to reweight and
aggregate the node representations. SR-GNN adapts soft-attention mechanism to model the item-
item relationship:
𝛼𝑖 = q𝑇 𝜎 (𝑊1𝑒𝑡 + 𝑊2𝑒𝑖 + 𝑐),
∑︁1
𝑛− (140)
𝑠𝑔 = 𝛼 𝑖 𝑒𝑖 ,
𝑖=1
Besides InfoNCE, there exists several other ways to combine node representations from different
views. For instance, MBHT applies attention mechanism to fuse multiple semantics, DisenPOI
adapt bayesian personalized ranking loss (BPR) [293] as a soft estimator for contrastive learning,
and KBGNN applies pair-wise similarities to ensure the consistency from two views.
The proposed transition graphs that obtain user behavior patterns has been demonstrated
important to session-based recommendations [210, 232]. SR-GNN and GC-SAN [385, 399] propose to
leverage transition graphs and applies attention-based GNNs to capture the sequential information
for session-based recommendation. FGNN [287] formulates the recommendation within a session
as a graph classification problem to predict next item for an anonymous user. GAG [288] and
GCE-GNN [375] further extends the model to capture global embeddings among multiple session
graphs.
13.2.2 Session Graph Propagation Since the session graphs are directed item graphs, there have
been multiple session graph propagation methods to obtain node representations on session graphs.
SR-GNN leverages Gated Graph Neural Networks (GGNNs) to obtain sequential information
from a given session graph adjacency 𝐴𝑠 = [𝐴𝑠(𝑖𝑛) ; 𝐴𝑠(𝑜𝑢𝑡 ) ] and item embedding set {𝑒𝑖 }:
𝑎𝑡 = 𝐴𝑠 [𝑒 1, ..., 𝑒𝑡 −1 ]𝑇 𝐻 + 𝑏, (142)
𝑧𝑡 = 𝜎 (𝑊𝑧 𝑎𝑡 + 𝑈𝑧 𝑒𝑡 −1 ), (143)
𝑟𝑡 = 𝜎 (𝑊𝑟 𝑎𝑡 + 𝑈𝑟 𝑒𝑡 −1 ), (144)
𝑒˜𝑡 = tanh(𝑊𝑜 𝑎𝑡 + 𝑈𝑜 (𝑟𝑡 ⊙ 𝑒𝑡 −1 )), (145)
𝑒𝑡 = (1 − 𝑧𝑡 ) ⊙ 𝑒𝑡 −1 + 𝑧𝑡 𝑒˜𝑡 , (146)
where 𝑊 s and 𝑈 s are trainable parameters. GC-SAN extend GGNN by calculating initial state 𝑎𝑡
separately to better exploit transition information:
𝑎𝑡 = 𝐶𝑜𝑛𝑐𝑎𝑡 (𝐴𝑠(𝑖𝑛) ( [𝑒 1, ..., 𝑒𝑡 −1𝑊𝑎(𝑖𝑛) ] + 𝑏 (𝑖𝑛) ), 𝐴𝑠(𝑜𝑢𝑡 ) ( [𝑒 1, ..., 𝑒𝑡 −1𝑊𝑎(𝑜𝑢𝑡 ) ] + 𝑏 (𝑜𝑢𝑡 ) )). (147)
13.3 HyperGraph
13.3.1 Hypergraph Topology Construction Motivated by the idea of modeling hyper-structures
and high-order correlation among nodes, hypergraphs [93] are proposed as extentions of the
commonly used graph structures. For graph-based recommender systems, a common practice is
to construct hyper structures among the original user-item bipartite graphs. To be specific, an
incidence matrix of a graph with vertex set V is presented as a binary matrix 𝐻 ∈ {0, 1} |V |×| E | ,
where E represents the set of hyperedges. Each entry ℎ(𝑣, 𝑒) of 𝐻 depicts the connectivity between
vertex 𝑣 and hyperedge 𝑒:
(
1 𝑖𝑓 𝑣 ∈ 𝑒
ℎ(𝑣, 𝑒) = . (148)
0 𝑖𝑓 𝑣 ∉ 𝑒
Given the formulation of hypergraphs, the degrees of vertices and hyperedges of 𝐻 can then be
defined with two diagnal matrices 𝐷 𝑣 ∈ N | V |× | V | and 𝐷𝑒 ∈ N | E |×| E | , where
∑︁ ∑︁
𝐷 𝑣 (𝑖; 𝑖) = ℎ(𝑣𝑖 , 𝑒), 𝐷𝑒 ( 𝑗; 𝑗) = ℎ(𝑣, 𝑒 𝑗 ). (149)
𝑒 ∈E 𝑣 ∈V
With the development of Hypergraph Neural Networks (HGNNs) [93, 151, 460] have shown
to be capable of capturing the high-order connectivity between nodes. HyperRec [361] firstly
attempts to leverage hypergraph structures for sequential recommendation by connecting items
with hyperedges according to the interactions with users during different time periods. DHCF
[158] proposes to construct hypergraphs for users and items respectively based on certain rules, to
explicit capture the collaborative similarities via HGNNs. MBHT [411] combines hypergraphs with
low-rank self-attention mechanism to capture the dynamic heterogeneous relationships between
users and items. HCCF [392] uses the contrastive information between hypergraph and interaction
graph to enhance the recommendation performance.
13.3.2 Hyper Graph Message Passing With the development of HGNNs, previous works have
propose different variants of HGNN to better exploit hypergraph structures. A classic high-order
hyper convolution process on a fixed hypergraph G = {V, E} with hyper adjacency 𝐻 is given by:
𝑔 ★ 𝑋 = 𝐷 𝑣−1/2𝐻 𝐷𝑒−1𝐻 𝑇 𝐷 𝑣−1/2𝑋 Θ, (150)
where 𝐷 𝑣 , 𝐷𝑒 are degree matrices of nodes and hyperedges, Θ denotes the convolution kernel. For
hyper adjacency matrix 𝐻 , DHCF refers to a rule-based hyperstructure via k-order reachable rule,
where nodes in the same hyperedge group are k-order reachable to each other:
𝐴𝑢𝑘 = min(1, power(𝐴 · 𝐴𝑇 , 𝑘)), (151)
where 𝐴 denotes the graph adjacency matrix. By considering the situations where 𝑘 = 1, 2, the
matrix formulation of the hyper connectivity of users and items are calculated with:
(
𝐻𝑢 = 𝐴∥ (𝐴(𝐴𝑇 𝐴))
, (152)
𝐻𝑖 = 𝐴𝑇 ∥ (𝐴𝑇 (𝐴𝐴𝑇 ))
which depicts the dual hypergraphs for users and items.
HCCF proposes to construct a learnable hypergraph to depict the global dependencies between
nodes on the interaction graph. To be specific, the hyperstructure is factorized with two low-rank
embedding matrices to achieve model efficiency:
𝐻𝑢 = 𝐸𝑢 · 𝑊𝑢 , 𝐻 𝑣 = 𝐸 𝑣 · 𝑊𝑣 . (153)
13.5 Summary
This section introduces the application of different kinds of graph neural networks in recommender
systems and can be summarized as follows:
• Graph Constructions. There are multiple options for constructing graph-structured data
for a variety of recommendation tasks. For instance, the user-item bipartite graphs reveal
the high-order collaborative similarity between users and items, and the transition graph
is suitable for encoding sequential information in clicking history. These diversified graph
structures provide different views for node representation learning on users and items, and
can be further used for downstream ranking tasks.
• Challenges and Limitations. Though the superiority of graph-structured data and GNNs
against traditional methods has been widely illustrated, there are still challenges unsolved.
For example, the computational cost of graph methods are normally expensive and thus
unacceptable in real-world applications. The data sparsity and cold-started issue in graph
recommendation remains to be explored as well.
• Future Works. In the future, the efficient solution of applying GNNs in recommendation
tasks is expected. There are also some attempts [87, 371] on incorporating temporal informa-
tion in graph representation learning for sequential recommendation tasks.
• Traffic Flow Forecasting. Traffic flow forecasting plays an indispensable role in ITS [72,
291], which involves leveraging spatial-temporal data collected by various sensors to gain
insights into future traffic patterns and behaviors. Classic methods, like autoregressive
integrated moving average (ARIMA) [27], support vector machine (SVM) [136] and recurrent
neural networks (RNN) [63] can only model time series separately without considering their
spatial connections. To address this issue, graph neural networks (GNNs) have emerged as
a powerful approach for traffic forecasting due to their strong ability of modeling complex
graph-structured correlations [30, 160, 393].
• Trajectory Prediction. Trajectory prediction is a crucial task in various applications, such
as autonomous driving and traffic surveillance, which aims to forecast future positions of
agents in the traffic scene. However, accurately predicting trajectories can be challenging, as
the behavior of an agent is influenced not only by its own motion but also by interactions
with surrounding objects. To address this challenge, Graph Neural Networks (GNNs) have
emerged as a promising tool for modeling complex interactions in trajectory prediction
[34, 263, 336, 462]. By representing the scene as a graph, where each node corresponds to an
agent and the edges capture interactions between them, GNNs can effectively capture spatial
dependencies and interactions between agents. This makes GNNs well-suited for predicting
trajectories that accurately capture the behavior of agents in complex traffic scenes.
• Traffic Anomaly Detection. Anomaly detection is an essential support for ITS. There are
lots of traffic anomalies in daily transportation systems, for example, traffic accidents, extreme
weather and unexpected situations. Handling these traffic anomalies timely can improve the
service quality of public transportation. The main trouble of traffic anomaly detection is the
highly twisted spatial-temporal characteristics of traffic data. The criteria and influence of
traffic anomaly varies among locations and times. GNNs have been introduced and achieved
success in this domain [53, 68, 69, 434].
• Others. Traffic demand prediction targets at estimating the future number of travelling at
some location. It is of vital and practical significance in the resource scheduling for ITS. By
using GNNs, the spatial dependencies of demands can be revealed [410, 412]. What is more,
urban vehicle emission analysis is also considered in recent work, which is closely related to
environmental protection and gains increasing researcher attention [405].
where 𝑑𝑖 𝑗 is the distance between node 𝑖 and 𝑗, and 𝜎 and 𝜖 are two hyperparameters to control the
distribution and the sparsity of the matrix.
Another kind of fixed adjacency matrix is the similarity-based matrix. In fact, similarity matrix
is not an adjacency matrix to some extent. It is constructed according to the similarity of two
nodes, which means the neighbors in the similarity graph may be far way in the real world. There
are various similarity metrics. For example, many works measure the similarity of two node by
their functionality, e.g., the distribution of surrounding point of interests (POIs). The assumption
behind is that nodes share similar functionality may share similar traffic patterns. We can also
define the similarity through the historical flow patterns. To compute the similarity of two time
series, a common practice is to use Dynamic Time Wrapping (DTW) algorithm [266], which is
superior to other metrics due to its sensitivity to shape similarity rather than point-wise similarity.
Specifically, given two time series 𝑋 = (𝑥 1, 𝑥 2, · · · , 𝑥𝑛 ) and 𝑌 = (𝑦1, 𝑦2, · · · , 𝑦𝑛 ), DTW is a dynamic
programming algorithm defined as
𝐷 (𝑖, 𝑗) = 𝑑𝑖𝑠𝑡 (𝑥𝑖 , 𝑦 𝑗 ) + min (𝐷 (𝑖 − 1, 𝑗), 𝐷 (𝑖, 𝑗 − 1), 𝐷 (𝑖 − 1, 𝑗 − 1)) , (156)
where 𝐷 (𝑖, 𝑗) represents the shortest distance between subseries 𝑋 = (𝑥 1, 𝑥 2, · · · , 𝑥𝑖 ) and 𝑌 =
(𝑦1, 𝑦2, · · · , 𝑦 𝑗 ), and 𝑑𝑖𝑠𝑡 (𝑥𝑖 , 𝑦 𝑗 ) is some distance metric like absolute distance. As a result, 𝐷𝑇𝑊 (𝑋, 𝑌 ) =
𝐷 (𝑛, 𝑛) is set as the final distance between 𝑋 and 𝑌 , which better reflects the similarity of two time
series compared to the Euclidean distance.
Dynamic matrix. The pre-defined matrix is sometimes unavailable and cannot reflect complete
information of spatial correlations. The dynamic adaptive matrix is proposed to solve the issue.
The dynamic matrix is learned from input data automatically. To achieve the best prediction
performance, the dynamic matrix will manage to infer the hidden correlations among nodes, more
than those physical connections.
A typical practice is learning adjacency matrix from node embeddings [13]. Let 𝐸𝐴 ∈ R𝑁 ×𝑑 be a
learnable node embedding dictionary, where each row of 𝐸𝐴 represents the embedding of a node,
𝑁 and 𝑑 denote the number of node and the dimension of embeddings respectively. The graph
adjacency matrix is defined as the similarities among node embeddings,
1 1
𝐷 − 2 𝐴𝐷 − 2 = 𝑠𝑜 𝑓 𝑡𝑚𝑎𝑥 𝑅𝑒𝐿𝑈 (𝐸𝐴 · 𝐸𝑇𝐴 ) , (157)
1 1
where 𝑠𝑜 𝑓 𝑡𝑚𝑎𝑥 function is to perform row-normalization, and 𝐷 − 2 𝐴𝐷 − 2 is the Laplacian matrix.
14.3.2 Diffusion Convolutional Recurrent Neural Network (DCRNN) [220]. DCRNN is a representa-
tive solution combining graph convolution networks with recurrent neural networks. It captures
spatial dependencies by bidirectional random walks on the graph. The diffusion convolution
operation on a graph is defined as:
𝐾
∑︁
𝑋 ∗ G 𝑓𝜃 = 𝜃 𝑘,1 (𝐷𝑂−1𝐴)𝑘 + 𝜃 𝑘,2 (𝐷 𝐼−1𝐴)𝑘 𝑋, (160)
𝑘=0
where 𝜃 are parameters for the convolution filter, and 𝐷𝑂−1𝐴, 𝐷 𝐼−1𝐴 represent the bidirectional diffu-
sion processes respectively. In term of the temporal dependency, DCRNN utilizes Gated Recurrent
Units (GRU), and replace the linear transformation in the GRU with the diffusion convolution as
follows,
𝑟 (𝑡 ) = 𝜎 (Θ𝑟 ∗ G [𝑋 (𝑡 ) , 𝐻 (𝑡 −1) ] + 𝑏𝑟 ), (161)
𝑢 (𝑡 )
= 𝜎 (Θ𝑢 ∗ G [𝑋 (𝑡 )
,𝐻 (𝑡 −1)
] + 𝑏𝑢 ), (162)
𝐶 (𝑡 ) = tanh(Θ𝐶 ∗ G [𝑋 (𝑡 ) , (𝑟 (𝑡 ) ⊙ 𝐻 (𝑡 −1) ] + 𝑏𝑐 ), (163)
𝐻 (𝑡 )
=𝑢 (𝑡 )
⊙𝐻 (𝑡 −1)
+ (1 − 𝑢 (𝑡 )
) ⊙𝐶 (𝑡 )
, (164)
where 𝑋 (𝑡 ) , 𝐻 (𝑡 ) denote the input and output at time 𝑡, 𝑟 (𝑡 ) , 𝑢 (𝑡 ) are the reset and update gates
respectively, and Θ𝑟 , Θ𝑢 , Θ𝐶 are parameters of convolution filters. Moreover, DCRNN employs a
sequence to sequence architecture to predict future series. Both the encoder and the decoder are
constructed with diffusion convolutional recurrent layers. The historical time series are fed into
the encoder and the predictions are generated by the decoder. The scheduled sampling technique is
utilized to solve the discrepancy problem between training and test distribution.
14.3.3 Adaptive Graph Convolutional Recurrent Network (AGCRN) [13]. The focuses of AGCRN are
two-fold. On the one hand, it argues that the temporal patterns are diversified and thus parameter-
sharing for each node is inferior; on the other hand, it proposes that the pre-defined graph may be
intuitive and incomplete for the specific prediction task. To mitigate the two issues, it designs a
Node Adaptive Parameter Learning (NAPL) module to learn node-specific patterns for each traffic
series, and a Data Adaptive Graph Generation (DAGG) module to infer the hidden correlations
among nodes from data and to generate the graph during training. Specifically, the NAPL module
is defined as follows,
1 1
𝑍 = (𝐼𝑛 + 𝐷 − 2 𝐴𝐷 − 2 )𝑋 𝐸 G𝑊 G + 𝐸 G 𝑏 G , (165)
where 𝑋 ∈ R𝑁 ×𝐶 is the input feature, 𝐸 G ∈ R𝑁 ×𝑑 is a node embedding dictionary, 𝑑 is the
embedding dimension (𝑑 << 𝑁 ), 𝑊 G ∈ R𝑑×𝐶×𝐹 is a weight pool. The original parameter Θ in
the graph convolution is replaced by the matrix production of 𝐸 G𝑊 G , and the same operation
is applied for the bias. This can help the model to capture node specific patterns from a pattern
pool according the node embedding. The DAGG module has been introduced in 157. The whole
workflow of AGCRN is formulated as following,
𝐴˜ = 𝑠𝑜 𝑓 𝑡𝑚𝑎𝑥 (𝑅𝑒𝐿𝑈 (𝐸𝐸𝑇 )), (166)
𝑧 (𝑡 ) = 𝜎 (𝐴[𝑋
˜ (𝑡 ) , 𝐻 (𝑡 −1) ]𝐸𝑊𝑧 + 𝐸𝑏𝑧 , (167)
𝑟 (𝑡 ) ˜ (𝑡 ) , 𝐻 (𝑡 −1) ]𝐸𝑊𝑟 + 𝐸𝑏𝑟 ,
= 𝜎 (𝐴[𝑋 (168)
^ (𝑡 )
𝐻 ˜ 𝑟
= tanh(𝐴[𝑋, (𝑡 )
⊙𝐻 (𝑡 −1)
]𝐸𝑊ℎ + 𝐸𝑏ℎ , (169)
𝐻 (𝑡 ) = 𝑧 (𝑡 ) ⊙ 𝐻 (𝑡 −1) + (1 − 𝑧 (𝑡 ) ) ⊙ 𝐻^ (𝑡 ) . (170)
14.3.4 Attention Based Spatial-Temporal Graph Convolutional Networks (ASTGCN) [125]. ASTGCN
introduces two kinds of attention mechanisms into spatial temporal forecasting, i.e., spatial attention
and temporal attention. The spatial attention is defined as following,
𝑆 = 𝑉𝑆 𝜎 (𝑋𝑊1 )𝑊2 (𝑊3𝑋 )𝑇 + 𝑏𝑆 , (171)
exp(𝑆𝑖,𝑗 )
′
𝑆𝑖,𝑗 = Í𝑁 , (172)
𝑗=1 exp(𝑆𝑖,𝑗 )
where 𝑆 ′ is the attention score, and 𝑊1,𝑊2,𝑊 3 are learnable parameters. The similar construction
is applied for temporal attention,
(173)
where 𝑋 𝑇 denotes transpose the spatial dimension and the temporal dimension. Besides the attention
mechanism, ASTGCN also introduces multi-component fusion to enhance the prediction ability.
The input of ASTGCN consists of three parts, the recent segments, the daily-periodic segments and
the weekly-periodic segment. The three segments are processed by the main model independently
and fused with learnable weights at last:
𝑌 = 𝑊ℎ ⊙ 𝑌ℎ + 𝑊𝑑 ⊙ 𝑌𝑑 + 𝑊𝑤 ⊙ 𝑌𝑤 , (174)
where 𝑌ℎ , 𝑌𝑑 , 𝑌𝑤 denotes the predictions of different segments respectively.
15 Summary
This section introduces graph models for traffic analysis and we provide a summary as follows:
• Techniques. Traffic analysis is a classical spatial temporal data mining task, and graph
models play a vital role for extracting spatial correlations. Typical procedures include the
graph construction, spatial dimension operations, temporal dimension operations and the
information fusion. There are multiple implementations for each procedure, each implemen-
tation has its strengths and weaknesses. By combining different implementations, various
kinds of traffic analysis models can be created. Choosing the right combination of procedures
and implementations is critical for achieving accurate and reliable traffic analysis results.
• Challenges and Limitations. Despite the remarkable success of graph representation
learning in traffic analysis, there are still several challenges that need to be addressed in
current studies. Firstly, external data, such as weather and calendar information, are not
well-utilized in current models, despite their close relation to traffic status. The challenge lies
in how to effectively fuse heterogeneous data to improve traffic analysis accuracy. Secondly,
the interpretability of models has been underexplored, which could hinder their deployment
in real-world transportation systems. Interpretable models are crucial for building trust and
understanding among stakeholders, and more research is needed to develop models that are
both accurate and interpretable. Addressing these challenges will be critical for advancing
the state-of-the-art in traffic analysis and ensuring the deployment of effective transportation
systems.
• Future Works. In the future, we anticipate that more data sources will be available for traffic
analysis, enabling a more comprehensive understanding of real-world traffic scenes. From
data collection to model design, there is still a lot of work to be done to fully leverage the
potential of GNNs in traffic analysis. In addition, we expect to see more creative applications
of GNN-based traffic analysis, such as designing traffic light control strategies, which can
help to improve the efficiency and safety of transportation systems. To achieve these goals, it
is necessary to continue advancing the development of GNN-based models and exploring new
ways to fuse diverse data sources. Additionally, there is a need to enhance the interpretability
of models and ensure their applicability to real-world transportation systems. We believe that
these efforts will contribute to the continued success of traffic analysis and the development
of intelligent transportation systems.
learning becomes another possibility as well. The major difference between adversarial repro-
gramming and adversarial attack lies in whether or not there is a particular target after putting
some adversarial samples against the model. Adversarial attack requires some small modification
on the input data samples. An adversarial attack is considered successful once the result is influ-
enced. However, under the adversarial reprogramming settings, the task succeed if and only if the
influenced results can be used for another desired task.
This is to say, without changing the model’s inner structure or fine-tuning its parameters, we
might be able to use some pre-trained graph models for some other tasks that were not planned to
be solved by these models in the first place. In other deep learning fields, adversarial reprogramming
problems are normally solved by having the input carefully encoded, and output cleverly mapped.
On some graph data sets, such as chemical data sets and biology data sets, pretrained models are
already available. Therefore, there is a possibility that adversarial reprogramming could be applied
in the future.
16.1.4 Generalizing to Out of Distribution Data In order to perform better on unobserved data
set, in the ideal case, the representation we learn should better be able to generalize to some
out-of-distribution (OOD) data. Being out-of-distribution is not identical to being mis-classified.
The mis-classified samples are coming from the same distribution of the training data but the model
fails to classify it correctly, while out-of-distribution refer to the case where the sample comes from
a distribution other than the training data [138]. Being able to generalize to out-of-distribution data
will greatly enhance a model’s reliability in real life. And studying out-of-distribution generalized
graph representation [206] is an opening field [207]. This is partly because of, currently, even the
problem of detecting out-of-distribution data samples is not fully conquered yet [138].
In order to do something on the out-of-distribution data samples, we need to detect which samples
belong to this type first. Detecting OOD samples itself is somewhat similar novelty detection, or
outlier detection problems [278]. Their major difference is whether or not a well-performed model
conducting the original tasks remains part of our goal. Novelty detection cares only about figuring
out who are the outliers; OOD detection requires our model to detect the outliers while keep the
performance remain unharmed at the same time.
16.1.5 Interpretability in Graph Representation Learning Interpretability concern is another limita-
tion that exists when researchers try to apply deep graph representation learning onto some of the
emerging application fields. For instance, in the field of computational social science, researchers
are urging for more efforts in integrating explanation and prediction together [141]. So as drug
discovery, being able to explain why such structure is chosen instead of another option, is very
important [163]. Generally speaking, neural networks are completely in black-box mode to human
knowledge without taking efforts to make them interpretable and explainable. Although more and
more tasks are being handled by deep learning methods in many fields, the tool remains mysterious
to most human beings. Even an expert of deep learning can not easily explain to you how the
tasks are performed and what the model has learned from the data. This situation reduces the
trustworthiness of the neural network models, prevent human from learning more from the models’
results, and even limit the potential improvements of the models themselves, without sufficient
feedback to human beings.
Seeking for better interpretability is not only some personal interests of companies and re-
searchers, in fact, as more and more ethics concern arose since more and more black-box decisions
were made by AI algorithms, interpretability has become a legal requirement [118].
Various approaches have been applied, serving for the goal of better interpretability [443]. There
we find existing works that provide either ad-hoc explanation after the results come out, or those
actively change the model structure to provide better explanations; explanation by providing similar
examples, by highlighting some attributes to the input features, by making sense of some hidden
layers and extract semantics from them, or by extracting logical rules; we also see local explanations
that explain some particular samples, global explanations that explain the network as a whole, or
hybrid. Most of those existing directions make sense in a graph representation learning setting.
Not a consensus has been reached on what are the best methods of making a model interpretable.
Researchers are still actively exploring every possibility, and thus there are plenty of challenges
and interesting topics in this direction.
16.1.6 Causality in Graph Representation Learning In recent years, there are increasing research
works focusing on combining causality and machine learning models [148, 252, 296]. It is widely
believed that making good use of causality will help models gain higher performances. However,
finding the right way to model causality in many real world scenarios remain super challenging.
Something to note is that, the most common kind of graph that comes along the causal study,
called “causal graph”, is not identical to the kind of graphs we are studying in deep graph represen-
tation learning. Causal graphs are the kind of graph whose nodes are factors and links represents
causal relations. Up till now, they are among the most reliable tools for causal inference study.
Traditionally, causal graphs are defined by human experts. Recent works has shown that neural
networks can help with scalable causal graph generation [401]. From this perspective, the story
can be other-side around: beside using causal relations to enhance graph representation learning, it
is also possible to use graph representation learning strategies to help with causal study.
16.1.7 Emerging Application Fields Beside the above-mentioned directions solving existing chal-
lenges in the deep learning world, there are many emerging fields of application that naturally
come along with the graph-structured data.
For instance, social network analysis and drug discovery. Due to the nature of the data, social
network interactions and drug molecule structures can be easily depicted as graph-structured data.
Therefore, deep graph representation learning has much to do in these fields [1, 109, 473].
Some basic problems on social network are easily solved using graph representation learning
strategies. Those basic problems include node-classification, link-prediction, graph-classification,
and so on. In practice, those problem settings could refer to real-world problems such as: ideology
prediction, interaction prediction, analyzing a social group, etc. However, social network data
typically has many unique features that could potentially stop the general-purposed models to
perform well. For instance, social media data can be sparse, incomplete, and can be extremely
imbalanced [454]. On the other hand, people have clear goals when studying social media data, such
as controversy detection [19], rumor detection [127, 341], mis-information and dis-information
detection [71], or studying the dynamics of the system [181]. There are still a lot of open quests to
be conquered, where deep graph representation learning can help with.
As for drug-discovery, researchers are having some interests in other perspective beside simply
proposing a set of potentially functional structures, which is widely seen today. The other per-
spectives include having more interpretable results from the model’s proposals [163, 280], and to
consider synthetic accessibility [396]. These directions are important, in answer to some doubt on
AI from the society [118], as well as from the tradition of chemistry studies [308]. Similar to the
challenges we faced when combining social science and neural networks, chemical science would
also prefer the black-box AI models to be interpretable instead. Some chemical scientists would
also prefer AI tools to provide them with synthetic route instead of the targeting structure itself.
In practice, proposing new molecule structures is usually not the bottleneck, but synthesising is.
There are already some existing works focusing on conquering this problem [85, 156]. But so far
there is a gap between chemical experiments and AI tools, indicating that there is still plenty of
improvement to be made.
Some chemistry researchers also found it useful to have material data better organized, given
that the molecule structures are becoming increasingly complex, and massive amount of research
papers are describing the material’s features from different aspects [355]. This direction might be
more closely related to knowledge base or even database systems. But in a way, given that the
polymer structure is typically a node-link graph, graph representation learning might be able to
help with deadling with such issues.
16.2.1 Mathematical Proof of Feasibility It has been a long-lasting problem that most of the existing
deep learning approaches lack mathematical proof of their learnability, bound, etc [16, 26]. This
problem relates to the difficulty of providing theoretical proof on a complicated structure like
neural network [121].
Currently, most of the theoretical proof aims at figuring out theoretical bounds [17, 132, 176].
There are multiple types of bounds with different problem settings. Such as: given a known
architecture of the model, with input data satisfying particular normal distribution, prove that
training will converge, and provide the estimated number of iterations. Most of these architectures
being studied are simple, such as those made of multi-layer perceptron (MLP), or simply studying
the updates of parameters in a single fully-connected layer.
In the field of deep graph representation learning, the neural network architectures are typically
much more complex than MLPs. Graph neural networks (GNNs), since the very beginning [66, 182],
involve a lot of approximation and simplification of mathematical theorems. Nowadays, most
researchers rely heavily on the experimental results. No matter how wild an idea is, as long as it
finally works out in an experiment, say, being able to converge and the results are acceptable, the
design is acceptable. All these practice makes the entire field somewhat experiments-oriented or
experience-oriented, while there remains a huge gap between the theoretical proof and the frontier
of deep graph representation.
It will be more than beneficial to the whole field if some researchers can push forward these
theoretical foundations. However, these problems are incredibly challenging.
16.2.2 Combining Spectral Graph Theory Down to the theory foundations, the idea of graph neural
networks [66, 182, 321] initially comes from spectral graph theory [60]. In recent years, many
researchers are investigating possible improvement on graph representation learning strategies via
utilizing spectral graph theory [54, 135, 256, 407]. For example, graph Laplacian is closely related
to many properties, such as the connectivity of a graph. By studying the properties of Laplacian,
it is possible to provide proof on graph neural network models’ properties, and to propose better
models with desired advantages, such as robustness [100, 300].
Spectral graph theory provides a lot of useful insights into graph representation learning from a
new perspective. There are a lot to be done in this direction.
16.2.3 From Graph to Manifolds Many researchers are devoted to the direction of learning graph
representation in non-Euclidean spaces [10, 305]. That is to say, to embed and to compute on some
other spaces that are not Euclidean, such as, hyperbolic and spherical spaces.
Theoretical reasoning and experimental results have shown certain advantages of working on
manifolds instead of standard Euclidean space. It is believed that these advantages are brought by
their abilities to capture complex correlations on the surface manifold [461]. Besides, researchers
have shown that, by combining standard graph representation learning strategies and manifold
assumptions, models work better on preserving and acquiring the locality and similarity rela-
tionships [101]. Intuitively, sometimes two nodes’ embeddings are regarded way too similar in
Euclidean space, but in non-Euclidean space they are easily distinguishable.
17 Conclusion By Wei Ju
In this survey, we present a comprehensive and up-to-date overview of deep graph representation
learning. We present a novel taxonomy of existing algorithms categorized into GNN architectures,
learning paradigms, and applications. Technically, we first summarize the ways of GNN archi-
tectures namely graph convolutions, graph kernel neural networks, graph pooling, and graph
transformer. Based on the different training objectives, we present three types of the most recent
advanced learning paradigms namely: supervised/semi-supervised learning on graphs, graph self-
supervised learning, and graph structure learning. Then, we provide several promising applications
to demonstrate the effectiveness of deep graph representation learning. Last but not least, we discuss
the future directions in deep graph representation learning that have potential opportunities.
Acknowledgments
The authors are grateful to the anonymous reviewers for critically reading the manuscript and for
giving important suggestions to improve their paper.
This paper is partially supported by the National Key Research and Development Program of
China with Grant No. 2018AAA0101902 and the National Natural Science Foundation of China
(NSFC Grant No. 62276002).
References
[1] Ash Mohammad Abbas. 2021. Social network analysis using deep learning: applications and schemes. Social Network
Analysis and Mining 11, 1 (2021), 1–21.
[2] Amr Ahmed, Nino Shervashidze, Shravan Narayanamurthy, Vanja Josifovski, and Alexander J Smola. 2013. Distributed
large-scale natural graph factorization. In Proceedings of the 22nd international conference on World Wide Web. 37–48.
[3] Imtiaz Ahmed, Travis Galoppo, Xia Hu, and Yu Ding. 2021. Graph regularized autoencoder and its application
in unsupervised anomaly detection. IEEE Transactions on Pattern Analysis and Machine Intelligence 44, 8 (2021),
4110–4124.
[4] Rami Al-Rfou, Bryan Perozzi, and Dustin Zelle. 2019. Ddgk: Learning graph representations for deep divergence
graph kernels. In The World Wide Web Conference. 37–48.
[5] Barakat AlBadani, Ronghua Shi, Jian Dong, Raeed Al-Sabri, and Oloulade Babatounde Moctard. 2022. Transformer-
Based Graph Convolutional Network for Sentiment Analysis. Applied Sciences 12, 3 (2022), 1316.
[6] Uri Alon and Eran Yahav. 2020. On the bottleneck of graph neural networks and its practical implications. arXiv
preprint arXiv:2006.05205 (2020).
[7] Han Altae-Tran, Bharath Ramsundar, Aneesh S Pappu, and Vijay Pande. 2017. Low data drug discovery with one-shot
learning. ACS central science 3, 4 (2017), 283–293.
[8] Brandon Anderson, Truong Son Hy, and Risi Kondor. 2019. Cormorant: Covariant Molecular Neural Networks. In
NeurIPS, Vol. 32. https://proceedings.neurips.cc/paper/2019/file/03573b32b2746e6e8ca98b9123f2249b-Paper.pdf
[9] Martin Arjovsky, Soumith Chintala, and Léon Bottou. 2017. Wasserstein generative adversarial networks. In Interna-
tional conference on machine learning. PMLR, 214–223.
[10] Nurul A Asif, Yeahia Sarker, Ripon K Chakrabortty, Michael J Ryan, Md Hafiz Ahamed, Dip K Saha, Faisal R Badal,
Sajal K Das, Md Firoz Ali, Sumaya I Moyeen, et al. 2021. Graph neural network: A comprehensive review on
non-euclidean space. IEEE Access 9 (2021), 60588–60606.
[11] Rim Assouel, Mohamed Ahmed, Marwin H Segler, Amir Saffari, and Yoshua Bengio. 2018. Defactor: Differentiable
edge factorization-based probabilistic graph generation. arXiv preprint arXiv:1811.09766 (2018).
[12] Davide Bacciu, Federico Errica, Alessio Micheli, and Marco Podda. 2020. A gentle introduction to deep learning for
graphs. Neural Networks 129 (2020), 203–221.
[13] Lei Bai, Lina Yao, Can Li, Xianzhi Wang, and Can Wang. 2020. Adaptive graph convolutional recurrent network for
traffic forecasting. Advances in neural information processing systems 33 (2020), 17804–17815.
[14] Xiaomei Bai, Mengyang Wang, Ivan Lee, Zhuo Yang, Xiangjie Kong, and Feng Xia. 2019. Scientific paper recommen-
dation: A survey. Ieee Access 7 (2019), 9324–9339.
[15] Muhammet Balcilar, Pierre Héroux, Benoit Gauzere, Pascal Vasseur, Sébastien Adam, and Paul Honeine. 2021. Breaking
the limits of message passing graph neural networks. In International Conference on Machine Learning. PMLR, 599–608.
[16] Peter L Bartlett, Dylan J Foster, and Matus J Telgarsky. 2017. Spectrally-normalized margin bounds for neural
networks. Advances in neural information processing systems 30 (2017).
[17] Peter L Bartlett, Nick Harvey, Christopher Liaw, and Abbas Mehrabian. 2019. Nearly-tight VC-dimension and
pseudodimension bounds for piecewise linear neural networks. The Journal of Machine Learning Research 20, 1 (2019),
2285–2301.
[18] Simon Batzner, Albert Musaelian, Lixin Sun, Mario Geiger, Jonathan P. Mailoa, Mordechai Kornbluth, Nicola Molinari,
Tess E. Smidt, and Boris Kozinsky. 2021. E(3)-Equivariant Graph Neural Networks for Data-Efficient and Accurate
Interatomic Potentials. arXiv:2101.03164 [physics.comp-ph]
[19] Samy Benslimane, Jérôme Azé, Sandra Bringay, Maximilien Servajean, and Caroline Mollevi. 2022. A text and GNN
based controversy detection method on social media. World Wide Web (2022), 1–27.
[20] Rianne van den Berg, Thomas N Kipf, and Max Welling. 2017. Graph convolutional matrix completion. arXiv preprint
arXiv:1706.02263 (2017).
[21] Tian Bian, Xi Xiao, Tingyang Xu, Peilin Zhao, Wenbing Huang, Yu Rong, and Junzhou Huang. 2020. Rumor detection
on social media with bi-directional graph convolutional networks. In Proceedings of the AAAI conference on artificial
intelligence, Vol. 34. 549–556.
[22] Filippo Maria Bianchi, Daniele Grattarola, and Cesare Alippi. 2020. Spectral clustering with graph neural networks
for graph pooling. In International Conference on Machine Learning. PMLR, 874–883.
[23] Deyu Bo, Xiao Wang, Chuan Shi, and Huawei Shen. 2021. Beyond low-frequency information in graph convolutional
networks. In Proceedings of the AAAI Conference on Artificial Intelligence, Vol. 35. 3950–3957.
[24] Karsten M Borgwardt and Hans-Peter Kriegel. 2005. Shortest-path kernels on graphs. In Fifth IEEE international
conference on data mining (ICDM). IEEE, 8–pp.
[25] Giorgos Bouritsas, Fabrizio Frasca, Stefanos P Zafeiriou, and Michael Bronstein. 2022. Improving graph neural network
expressivity via subgraph isomorphism counting. IEEE Transactions on Pattern Analysis and Machine Intelligence
(2022).
[26] Abdesselam Bouzerdoum and Tim R Pattison. 1993. Neural network for quadratic optimization with bound constraints.
IEEE transactions on neural networks 4, 2 (1993), 293–304.
[27] George EP Box and David A Pierce. 1970. Distribution of residual autocorrelations in autoregressive-integrated
moving average time series models. Journal of the American statistical Association 65, 332 (1970), 1509–1526.
[28] Johannes Brandstetter, Rob Hesselink, Elise van der Pol, Erik J Bekkers, and Max Welling. 2022. Geometric and Physical
Quantities improve E(3) Equivariant Message Passing. In ICLR. https://openreview.net/forum?id=_xwr8gOBeV1
[29] Joan Bruna, Wojciech Zaremba, Arthur Szlam, and Yann LeCun. 2013. Spectral networks and locally connected
networks on graphs. arXiv preprint arXiv:1312.6203 (2013).
[30] Khac-Hoai Nam Bui, Jiho Cho, and Hongsuk Yi. 2021. Spatial-temporal graph neural network for traffic forecasting:
An overview and open research issues. Applied Intelligence (2021), 1–12.
[31] Deng Cai and Wai Lam. 2020. Graph transformer for graph-to-sequence learning. In Proceedings of the AAAI Conference
on Artificial Intelligence, Vol. 34. 7464–7471.
[32] Hongyun Cai, Vincent W Zheng, and Kevin Chen-Chuan Chang. 2018. A comprehensive survey of graph embedding:
Problems, techniques, and applications. IEEE Transactions on Knowledge and Data Engineering 30, 9 (2018), 1616–1637.
[33] David Camacho, Ángel Panizo-LLedot, Gema Bello-Orgaz, Antonio Gonzalez-Pardo, and Erik Cambria. 2020. The
four dimensions of social network analysis: An overview of research methods, applications, and software tools.
Information Fusion 63 (2020), 88–120.
[34] Defu Cao, Jiachen Li, Hengbo Ma, and Masayoshi Tomizuka. 2021. Spectral temporal graph neural network for
trajectory prediction. In 2021 IEEE International Conference on Robotics and Automation (ICRA). IEEE, 1839–1845.
[35] Jinzhou Cao, Qingquan Li, Wei Tu, Qili Gao, Rui Cao, and Chen Zhong. 2021. Resolving urban mobility networks
from individual travel graphs using massive-scale mobile phone tracking data. Cities 110 (2021), 103077.
[36] Shaosheng Cao, Wei Lu, and Qiongkai Xu. 2015. Grarep: Learning graph representations with global structural
information. In Proceedings of the 24th ACM international on conference on information and knowledge management.
891–900.
[37] Shaosheng Cao, Wei Lu, and Qiongkai Xu. 2016. Deep neural networks for learning graph representations. In
Proceedings of the AAAI Conference on Artificial Intelligence, Vol. 30.
[38] Nicolas Carion, Francisco Massa, Gabriel Synnaeve, Nicolas Usunier, Alexander Kirillov, and Sergey Zagoruyko. 2020.
End-to-end object detection with transformers. In Computer Vision–ECCV 2020: 16th European Conference, Glasgow,
UK, August 23–28, 2020, Proceedings, Part I 16. Springer, 213–229.
[39] Nicholas Carlini and David Wagner. 2017. Towards evaluating the robustness of neural networks. In 2017 ieee
symposium on security and privacy (sp). Ieee, 39–57.
[40] Ines Chami, Sami Abu-El-Haija, Bryan Perozzi, Christopher Ré, and Kevin Murphy. 2022. Machine learning on graphs:
A model and comprehensive taxonomy. Journal of Machine Learning Research 23, 89 (2022), 1–64.
[41] Jatin Chauhan, Deepak Nathani, and Manohar Kaul. 2020. Few-shot learning on graphs via super-classes based on
graph spectral measures. arXiv preprint arXiv:2002.12815 (2020).
[42] Bo Chen, Jing Zhang, Jie Tang, Lingfan Cai, Zhaoyu Wang, Shu Zhao, Hong Chen, and Cuiping Li. 2020. Conna:
Addressing name disambiguation on the fly. IEEE Transactions on Knowledge and Data Engineering (2020).
[43] Chaoqi Chen, Yushuang Wu, Qiyuan Dai, Hong-Yu Zhou, Mutian Xu, Sibei Yang, Xiaoguang Han, and Yizhou Yu. 2022.
A Survey on Graph Neural Networks and Graph Transformers in Computer Vision: A Task-Oriented Perspective.
arXiv preprint arXiv:2209.13232 (2022).
[44] Dexiong Chen, Laurent Jacob, and Julien Mairal. 2020. Convolutional Kernel Networks for Graph-Structured Data.
arXiv preprint arXiv:2003.05189 (2020).
[45] Deli Chen, Yankai Lin, Wei Li, Peng Li, Jie Zhou, and Xu Sun. 2020. Measuring and relieving the over-smoothing
problem for graph neural networks from the topological view. In Proceedings of the AAAI conference on artificial
intelligence, Vol. 34. 3438–3445.
[46] Dexiong Chen, Leslie O’Bray, and Karsten Borgwardt. 2022. Structure-aware transformer for graph representation
learning. In International Conference on Machine Learning. PMLR, 3469–3489.
[47] Fenxiao Chen, Yun-Cheng Wang, Bin Wang, and C-C Jay Kuo. 2020. Graph representation learning: a survey. APSIPA
Transactions on Signal and Information Processing 9 (2020), e15.
[48] Guimin Chen, Yuanhe Tian, and Yan Song. 2020. Joint aspect extraction and sentiment analysis with directional graph
convolutional networks. In Proceedings of the 28th international conference on computational linguistics. 272–279.
[49] Shizhe Chen, Pierre-Louis Guhur, Cordelia Schmid, and Ivan Laptev. 2021. History aware multimodal transformer for
vision-and-language navigation. Advances in Neural Information Processing Systems 34 (2021), 5834–5847.
[50] Ting Chen, Song Bian, and Yizhou Sun. 2019. Are powerful graph neural nets necessary? a dissection on graph
classification. arXiv preprint arXiv:1905.04579 (2019).
[51] Xiaokang Chen, Yuhui Yuan, Gang Zeng, and Jingdong Wang. 2021. Semi-supervised semantic segmentation with
cross pseudo supervision. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition.
2613–2622.
[52] Yu Chen, Lingfei Wu, and Mohammed Zaki. 2020. Iterative deep graph learning for graph neural networks: Better
and robust node embeddings. Advances in neural information processing systems 33 (2020), 19314–19326.
[53] Zekai Chen, Dingshuo Chen, Xiao Zhang, Zixuan Yuan, and Xiuzhen Cheng. 2021. Learning graph structures with
transformer for multivariate time series anomaly detection in iot. IEEE Internet of Things Journal (2021).
[54] Zhiqian Chen, Fanglan Chen, Lei Zhang, Taoran Ji, Kaiqun Fu, Liang Zhao, Feng Chen, Lingfei Wu, Charu Aggarwal,
and Chang-Tien Lu. 2020. Bridging the gap between spatial and spectral domains: A survey on graph neural networks.
arXiv preprint arXiv:2002.11867 (2020).
[55] Zhengdao Chen, Lei Chen, Soledad Villar, and Joan Bruna. 2020. Can graph neural networks count substructures?
Advances in neural information processing systems 33 (2020), 10383–10395.
[56] Haeran Cho and Yi Yu. 2018. Link prediction for interdisciplinary collaboration via co-authorship network. Social
Network Analysis and Mining 8, 1 (2018), 1–12.
[57] Jeongwhan Choi, Hwangyong Choi, Jeehyun Hwang, and Noseong Park. 2022. Graph neural controlled differential
equations for traffic forecasting. In Proceedings of the AAAI Conference on Artificial Intelligence, Vol. 36. 6367–6374.
[58] Alexandra Chouldechova and Aaron Roth. 2018. The frontiers of fairness in machine learning. arXiv preprint
arXiv:1810.08810 (2018).
[59] Pham Minh Chuan, Le Hoang Son, Mumtaz Ali, Tran Dinh Khang, Le Thanh Huong, and Nilanjan Dey. 2018. Link
prediction in co-authorship networks based on hybrid content similarity metric. Applied Intelligence 48, 8 (2018),
2470–2486.
[60] Fan RK Chung. 1997. Spectral graph theory. Vol. 92. American Mathematical Soc.
[61] Junyoung Chung, Caglar Gulcehre, KyungHyun Cho, and Yoshua Bengio. 2014. Empirical evaluation of gated
recurrent neural networks on sequence modeling. arXiv preprint arXiv:1412.3555 (2014).
[62] Connor W Coley, Regina Barzilay, William H Green, Tommi S Jaakkola, and Klavs F Jensen. 2017. Convolutional
embedding of attributed molecular graphs for physical property prediction. Journal of chemical information and
[87] Ziwei Fan, Zhiwei Liu, Jiawei Zhang, Yun Xiong, Lei Zheng, and Philip S Yu. 2021. Continuous-time sequential
recommendation with temporal graph collaborative transformer. In Proceedings of the 30th ACM International
Conference on Information & Knowledge Management. 433–442.
[88] Yin Fang, Qiang Zhang, Haihong Yang, Xiang Zhuang, Shumin Deng, Wen Zhang, Ming Qin, Zhuo Chen, Xiaohui
Fan, and Huajun Chen. 2022. Molecular contrastive learning with chemical element knowledge graph. In Proceedings
of the AAAI Conference on Artificial Intelligence, Vol. 36. 3968–3976.
[89] Zheng Fang, Qingqing Long, Guojie Song, and Kunqing Xie. 2021. Spatial-temporal graph ode networks for traffic
flow forecasting. In Proceedings of the 27th ACM SIGKDD conference on knowledge discovery & data mining. 364–373.
[90] Evan N Feinberg, Debnil Sur, Zhenqin Wu, Brooke E Husic, Huanghao Mai, Yang Li, Saisai Sun, Jianyi Yang, Bharath
Ramsundar, and Vijay S Pande. 2018. PotentialNet for molecular property prediction. ACS central science 4, 11 (2018),
1520–1530.
[91] Shangbin Feng, Herun Wan, Ningnan Wang, and Minnan Luo. 2021. BotRGCN: Twitter bot detection with relational
graph convolutional networks. In Proceedings of the 2021 IEEE/ACM International Conference on Advances in Social
Networks Analysis and Mining. 236–239.
[92] Wenzheng Feng, Jie Zhang, Yuxiao Dong, Yu Han, Huanbo Luan, Qian Xu, Qiang Yang, Evgeny Kharlamov, and Jie
Tang. 2020. Graph random neural networks for semi-supervised learning on graphs. (2020), 22092–22103.
[93] Yifan Feng, Haoxuan You, Zizhao Zhang, Rongrong Ji, and Yue Gao. 2019. Hypergraph neural networks. In Proceedings
of the AAAI conference on artificial intelligence, Vol. 33. 3558–3565.
[94] Chelsea Finn, Pieter Abbeel, and Sergey Levine. 2017. Model-agnostic meta-learning for fast adaptation of deep
networks. In International conference on machine learning. PMLR, 1126–1135.
[95] Marc Finzi, Samuel Stanton, Pavel Izmailov, and Andrew Gordon Wilson. 2020. Generalizing convolutional neural
networks for equivariance to lie groups on arbitrary continuous data. In ICML.
[96] Emil Fischer. 1894. Einfluss der Configuration auf die Wirkung der Enzyme. Berichte der deutschen chemischen
Gesellschaft 27, 3 (1894), 2985–2993.
[97] Luca Franceschi, Mathias Niepert, Massimiliano Pontil, and Xiao He. 2019. Learning discrete structures for graph
neural networks. In International conference on machine learning. PMLR, 1972–1982.
[98] Paul G Francoeur, Tomohide Masuda, Jocelyn Sunseri, Andrew Jia, Richard B Iovanisci, Ian Snyder, and David R Koes.
2020. Three-dimensional convolutional neural networks and a cross-docked data set for structure-based drug design.
Journal of Chemical Information and Modeling 60, 9 (2020), 4200–4215.
[99] MJ ea Frisch, GW Trucks, HB Schlegel, GE Scuseria, MA Robb, JR Cheeseman, G Scalmani, VPGA Barone, GA
Petersson, HJRA Nakatsuji, et al. 2016. Gaussian 16.
[100] Guoji Fu, Peilin Zhao, and Yatao Bian. 2022. 𝑝-Laplacian Based Graph Neural Networks. In International Conference
on Machine Learning. PMLR, 6878–6917.
[101] Sichao Fu and Weifeng Liu. 2021. Recent Advances of Manifold-based Graph Convolutional Networks for Remote
Sensing Images Recognition. Generalization With Deep Learning: For Improvement On Sensing Capability (2021),
209–232.
[102] Fabian Fuchs, Daniel Worrall, Volker Fischer, and Max Welling. 2020. SE(3)-Transformers: 3D Roto-
Translation Equivariant Attention Networks. In NeurIPS, Vol. 33. https://proceedings.neurips.cc/paper/2020/file/
15231a7ce4ba789d13b722cc5c955834-Paper.pdf
[103] Octavian-Eugen Ganea, Xinyuan Huang, Charlotte Bunne, Yatao Bian, Regina Barzilay, Tommi Jaakkola, and An-
dreas Krause. 2021. Independent se (3)-equivariant models for end-to-end rigid protein docking. arXiv preprint
arXiv:2111.07786 (2021).
[104] Hongyang Gao and Shuiwang Ji. 2019. Graph u-nets. In international conference on machine learning. PMLR, 2083–2092.
[105] Hongyang Gao, Yi Liu, and Shuiwang Ji. 2021. Topology-aware graph pooling networks. IEEE Transactions on Pattern
Analysis and Machine Intelligence 43, 12 (2021), 4512–4518.
[106] Xiang Gao, Wei Hu, and Zongming Guo. 2020. Exploring structure-adaptive graph learning for robust semi-supervised
classification. In 2020 ieee international conference on multimedia and expo (icme). IEEE, 1–6.
[107] Thomas Gärtner, Peter Flach, and Stefan Wrobel. 2003. On graph kernels: Hardness results and efficient alternatives.
In Proceedings of Computational Learning theory and kernel machines. 129–143.
[108] Johannes Gasteiger, Stefan Weißenberger, and Stephan Günnemann. 2019. Diffusion improves graph learning.
Advances in neural information processing systems 32 (2019).
[109] Thomas Gaudelet, Ben Day, Arian R Jamasb, Jyothish Soman, Cristian Regep, Gertrude Liu, Jeremy BR Hayter,
Richard Vickers, Charles Roberts, Jian Tang, et al. 2021. Utilizing graph machine learning within drug discovery and
development. Briefings in bioinformatics 22, 6 (2021), bbab159.
[110] Kaitlyn M Gayvert, Neel S Madhukar, and Olivier Elemento. 2016. A data-driven approach to predicting successes
and failures of clinical trials. Cell chemical biology 23, 10 (2016), 1294–1301.
[111] Niklas Gebauer, Michael Gastegger, and Kristof Schütt. 2019. Symmetry-adapted generation of 3d point sets for the
targeted discovery of molecules. Advances in neural information processing systems 32 (2019).
[112] Niklas WA Gebauer, Michael Gastegger, Stefaan SP Hessmann, Klaus-Robert Müller, and Kristof T Schütt. 2022.
Inverse design of 3d molecular structures with conditional generative neural networks. Nature communications 13, 1
(2022), 1–11.
[113] Mario Geiger and Tess Smidt. 2022. e3nn: Euclidean neural networks. arXiv preprint arXiv:2207.09453 (2022).
[114] Simon Geisler, Tobias Schmidt, Hakan Şirin, Daniel Zügner, Aleksandar Bojchevski, and Stephan Günnemann. 2021.
Robustness of graph neural networks at scale. Advances in Neural Information Processing Systems 34 (2021), 7637–7649.
[115] Justin Gilmer, Samuel S Schoenholz, Patrick F Riley, Oriol Vinyals, and George E Dahl. 2017. Neural message passing
for quantum chemistry. In International conference on machine learning. PMLR, 1263–1272.
[116] Rafael Gómez-Bombarelli, Jennifer N Wei, David Duvenaud, José Miguel Hernández-Lobato, Benjamín Sánchez-
Lengeling, Dennis Sheberla, Jorge Aguilera-Iparraguirre, Timothy D Hirzel, Ryan P Adams, and Alán Aspuru-Guzik.
2018. Automatic chemical design using a data-driven continuous representation of molecules. ACS central science 4, 2
(2018), 268–276.
[117] Ian Goodfellow, Jean Pouget-Abadie, Mehdi Mirza, Bing Xu, David Warde-Farley, Sherjil Ozair, Aaron Courville, and
Yoshua Bengio. 2020. Generative adversarial networks. Commun. ACM 63, 11 (2020), 139–144.
[118] Bryce Goodman and Seth Flaxman. 2017. European Union regulations on algorithmic decision-making and a “right
to explanation”. AI magazine 38, 3 (2017), 50–57.
[119] Palash Goyal and Emilio Ferrara. 2018. Graph embedding techniques, applications, and performance: A survey.
Knowledge-Based Systems 151 (2018), 78–94.
[120] Jean-Bastien Grill, Florian Strub, Florent Altché, Corentin Tallec, Pierre Richemond, Elena Buchatskaya, Carl Doersch,
Bernardo Avila Pires, Zhaohan Guo, Mohammad Gheshlaghi Azar, et al. 2020. Bootstrap your own latent-a new
approach to self-supervised learning. Advances in neural information processing systems 33 (2020), 21271–21284.
[121] Philipp Grohs and Felix Voigtlaender. 2021. Proof of the theory-to-practice gap in deep learning via sampling
complexity bounds for neural network approximation spaces. arXiv preprint arXiv:2104.02746 (2021).
[122] Colin R Groom, Ian J Bruno, Matthew P Lightfoot, and Suzanna C Ward. 2016. The Cambridge structural database.
Acta Crystallographica Section B: Structural Science, Crystal Engineering and Materials 72, 2 (2016), 171–179.
[123] Aditya Grover and Jure Leskovec. 2016. node2vec: Scalable feature learning for networks. In Proceedings of the 22nd
ACM SIGKDD international conference on Knowledge discovery and data mining. 855–864.
[124] Stephan Günnemann. 2022. Graph neural networks: Adversarial robustness. In Graph Neural Networks: Foundations,
Frontiers, and Applications. Springer, 149–176.
[125] Shengnan Guo, Youfang Lin, Ning Feng, Chao Song, and Huaiyu Wan. 2019. Attention based spatial-temporal graph
convolutional networks for traffic flow forecasting. In Proceedings of the AAAI conference on artificial intelligence,
Vol. 33. 922–929.
[126] Zhichun Guo, Chuxu Zhang, Wenhao Yu, John Herr, Olaf Wiest, Meng Jiang, and Nitesh V Chawla. 2021. Few-shot
graph learning for molecular property prediction. In Proceedings of the Web Conference 2021. 2559–2567.
[127] Sardar Hamidian and Mona T Diab. 2019. Rumor detection and classification for twitter data. arXiv preprint
arXiv:1912.08926 (2019).
[128] Will Hamilton, Zhitao Ying, and Jure Leskovec. 2017. Inductive representation learning on large graphs. Advances in
neural information processing systems 30 (2017).
[129] David K Hammond, Pierre Vandergheynst, and Rémi Gribonval. 2011. Wavelets on graphs via spectral graph theory.
Applied and Computational Harmonic Analysis 30, 2 (2011), 129–150.
[130] Jiaqi Han, Yu Rong, Tingyang Xu, and Wenbing Huang. 2022. Geometrically equivariant graph neural networks: A
survey. arXiv preprint arXiv:2202.07230 (2022).
[131] Zhongkai Hao, Chengqiang Lu, Zhenya Huang, Hao Wang, Zheyuan Hu, Qi Liu, Enhong Chen, and Cheekong Lee.
2020. ASGN: An active semi-supervised graph neural network for molecular property prediction. In Proceedings of
the ACM SIGKDD International Conference on Knowledge Discovery & Data Mining. 731–752.
[132] Nick Harvey, Christopher Liaw, and Abbas Mehrabian. 2017. Nearly-tight VC-dimension bounds for piecewise linear
neural networks. In Conference on learning theory. PMLR, 1064–1068.
[133] Dailan He, Yusheng Zhao, Junyu Luo, Tianrui Hui, Shaofei Huang, Aixi Zhang, and Si Liu. 2021. TransRefer3D:
Entity-and-Relation Aware Transformer for Fine-Grained 3D Visual Grounding. In Proceedings of the 29th ACM
International Conference on Multimedia. 2344–2352.
[134] Xiangnan He, Kuan Deng, Xiang Wang, Yan Li, Yongdong Zhang, and Meng Wang. 2020. Lightgcn: Simplifying
and powering graph convolution network for recommendation. In Proceedings of the 43rd International ACM SIGIR
conference on research and development in Information Retrieval. 639–648.
[135] Yixuan He, Michael Permultter, Gesine Reinert, and Mihai Cucuringu. 2022. MSGNN: A Spectral Graph Neural
Network Based on a Novel Magnetic Signed Laplacian. arXiv preprint arXiv:2209.00546 (2022).
[136] Marti A. Hearst, Susan T Dumais, Edgar Osuna, John Platt, and Bernhard Scholkopf. 1998. Support vector machines.
IEEE Intelligent Systems and their applications 13, 4 (1998), 18–28.
[137] Mikael Henaff, Joan Bruna, and Yann LeCun. 2015. Deep convolutional networks on graph-structured data. arXiv
preprint arXiv:1506.05163 (2015).
[138] Dan Hendrycks and Kevin Gimpel. 2016. A baseline for detecting misclassified and out-of-distribution examples in
neural networks. arXiv preprint arXiv:1610.02136 (2016).
[139] Geoffrey E Hinton and Ruslan R Salakhutdinov. 2006. Reducing the dimensionality of data with neural networks.
science 313, 5786 (2006), 504–507.
[140] Jonathan Ho, Ajay Jain, and Pieter Abbeel. 2020. Denoising diffusion probabilistic models. Advances in Neural
Information Processing Systems 33 (2020), 6840–6851.
[141] Jake M Hofman, Duncan J Watts, Susan Athey, Filiz Garip, Thomas L Griffiths, Jon Kleinberg, Helen Margetts, Sendhil
Mullainathan, Matthew J Salganik, Simine Vazire, et al. 2021. Integrating explanation and prediction in computational
social science. Nature 595, 7866 (2021), 181–188.
[142] Emiel Hoogeboom, Vıctor Garcia Satorras, Clément Vignac, and Max Welling. 2022. Equivariant diffusion for molecule
generation in 3d. In International Conference on Machine Learning. PMLR, 8867–8887.
[143] Fenyu Hu, Yanqiao Zhu, Shu Wu, Weiran Huang, Liang Wang, and Tieniu Tan. 2020. Graphair: Graph representation
learning with neighborhood aggregation and interaction. Pattern Recognition 112 (2020), 107745.
[144] Weihua Hu, Matthias Fey, Marinka Zitnik, Yuxiao Dong, Hongyu Ren, Bowen Liu, Michele Catasta, and Jure Leskovec.
2020. Open graph benchmark: Datasets for machine learning on graphs. Advances in neural information processing
systems 33 (2020), 22118–22133.
[145] Weihua Hu, Bowen Liu, Joseph Gomes, Marinka Zitnik, Percy Liang, Vijay Pande, and Jure Leskovec. 2019. Strategies
for pre-training graph neural networks. arXiv preprint arXiv:1905.12265 (2019).
[146] Ziniu Hu, Yuxiao Dong, Kuansan Wang, Kai-Wei Chang, and Yizhou Sun. 2020. Gpt-gnn: Generative pre-training of
graph neural networks. In Proceedings of the 26th ACM SIGKDD International Conference on Knowledge Discovery &
Data Mining. 1857–1867.
[147] Ziniu Hu, Yuxiao Dong, Kuansan Wang, and Yizhou Sun. 2020. Heterogeneous graph transformer. In Proceedings of
the web conference 2020. 2704–2710.
[148] Zhiting Hu and Li Erran Li. 2021. A causal lens for controllable text generation. Advances in Neural Information
Processing Systems 34 (2021), 24941–24955.
[149] Jing Huang and Jie Yang. 2021. Unignn: a unified framework for graph and hypergraph neural networks. arXiv
preprint arXiv:2105.00956 (2021).
[150] Rongzhou Huang, Chuyin Huang, Yubao Liu, Genan Dai, and Weiyang Kong. 2020. LSGCN: Long Short-Term Traffic
Prediction with Graph Convolutional Networks.. In IJCAI, Vol. 7. 2355–2361.
[151] Sheng Huang, Mohamed Elhoseiny, Ahmed Elgammal, and Dan Yang. 2015. Learning hypergraph-regularized
attribute predictors. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 409–417.
[152] Wenbing Huang, Jiaqi Han, Yu Rong, Tingyang Xu, Fuchun Sun, and Junzhou Huang. 2022. Constrained Graph
Mechanics Networks. In ICLR. https://openreview.net/forum?id=SHbhHHfePhP
[153] Md Shamim Hussain, Mohammed J Zaki, and Dharmashankar Subramanian. 2021. Edge-augmented graph transform-
ers: Global self-attention is enough for graphs. arXiv preprint arXiv:2108.03348 (2021).
[154] Michael J Hutchinson, Charline Le Lan, Sheheryar Zaidi, Emilien Dupont, Yee Whye Teh, and Hyunjik Kim. 2021.
Lietransformer: equivariant self-attention for lie groups. In ICML.
[155] John J Irwin, Teague Sterling, Michael M Mysinger, Erin S Bolstad, and Ryan G Coleman. 2012. ZINC: a free tool to
discover chemistry for biology. Journal of chemical information and modeling 52, 7 (2012), 1757–1768.
[156] Shoichi Ishida, Kei Terayama, Ryosuke Kojima, Kiyosei Takasu, and Yasushi Okuno. 2022. Ai-driven synthetic route
design incorporated with retrosynthesis knowledge. Journal of chemical information and modeling 62, 6 (2022),
1357–1367.
[157] Md Ashraful Islam, Mir Mahathir Mohammad, Sarkar Snigdha Sarathi Das, and Mohammed Eunus Ali. 2022. A
survey on deep learning based Point-of-Interest (POI) recommendations. Neurocomputing 472 (2022), 306–325.
[158] Shuyi Ji, Yifan Feng, Rongrong Ji, Xibin Zhao, Wanwan Tang, and Yue Gao. 2020. Dual channel hypergraph
collaborative filtering. In Proceedings of the 26th ACM SIGKDD International Conference on Knowledge Discovery &
Data Mining. 2020–2029.
[159] Bo Jiang, Ziyan Zhang, Doudou Lin, Jin Tang, and Bin Luo. 2019. Semi-supervised learning with graph learning-
convolutional networks. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition. 11313–
11320.
[160] Weiwei Jiang and Jiayun Luo. 2022. Graph neural network for traffic forecasting: A survey. Expert Systems with
Applications (2022), 117921.
[161] Zhuoren Jiang, Yue Yin, Liangcai Gao, Yao Lu, and Xiaozhong Liu. 2018. Cross-language citation recommendation
via hierarchical representation learning on heterogeneous graph. In The 41st International ACM SIGIR Conference on
Research & Development in Information Retrieval. 635–644.
[162] Yizhu Jiao, Yun Xiong, Jiawei Zhang, Yao Zhang, Tianqi Zhang, and Yangyong Zhu. 2020. Sub-graph contrast for
scalable self-supervised graph representation learning. In 2020 IEEE international conference on data mining (ICDM).
IEEE, 222–231.
[163] José Jiménez-Luna, Francesca Grisoni, and Gisbert Schneider. 2020. Drug discovery with explainable artificial
intelligence. Nature Machine Intelligence 2, 10 (2020), 573–584.
[164] Guangyin Jin, Zhexu Xi, Hengyu Sha, Yanghe Feng, and Jincai Huang. 2020. Deep multi-view spatiotemporal virtual
graph neural network for significant citywide ride-hailing demand prediction. arXiv preprint arXiv:2007.15189 (2020).
[165] Wengong Jin, Regina Barzilay, and Tommi Jaakkola. 2018. Junction tree variational autoencoder for molecular graph
generation. In International conference on machine learning. PMLR, 2323–2332.
[166] Wei Jin, Tyler Derr, Haochen Liu, Yiqi Wang, Suhang Wang, Zitao Liu, and Jiliang Tang. 2020. Self-supervised learning
on graphs: Deep insights and new direction. arXiv preprint arXiv:2006.10141 (2020).
[167] Bowen Jing, Stephan Eismann, Patricia Suriana, Raphael JL Townshend, and Ron Dror. 2020. Learning from protein
structure with geometric vector perceptrons. arXiv preprint arXiv:2009.01411 (2020).
[168] Bowen Jing, Stephan Eismann, Patricia Suriana, Raphael John Lamarre Townshend, and Ron Dror. 2021. Learning
from Protein Structure with Geometric Vector Perceptrons. In ICLR. https://openreview.net/forum?id=1YLJDvSx6J4
[169] Wei Ju, Yiyang Gu, Xiao Luo, Yifan Wang, Haochen Yuan, Huasong Zhong, and Ming Zhang. 2023. Unsupervised
graph-level representation learning with hierarchical contrasts. Neural Networks 158 (2023), 359–368.
[170] Wei Ju, Zequn Liu, Yifang Qin, Bin Feng, Chen Wang, Zhihui Guo, Xiao Luo, and Ming Zhang. 2023. Few-shot
molecular property prediction via Hierarchically Structured Learning on Relation Graphs. Neural Networks (2023).
[171] Wei Ju, Xiao Luo, Zeyu Ma, Junwei Yang, Minghua Deng, and Ming Zhang. 2022. GHNN: Graph Harmonic Neural
Networks for semi-supervised graph-level classification. Neural Networks 151 (2022), 70–79.
[172] Wei Ju, Xiao Luo, Meng Qu, Yifan Wang, Chong Chen, Minghua Deng, Xian-Sheng Hua, and Ming Zhang. 2022.
TGNN: A Joint Semi-supervised Framework for Graph-level Classification. In Proceedings of the International Joint
Conference on Artificial Intelligence. 2122–2128.
[173] Wei Ju, Yifang Qin, Ziyue Qiao, Xiao Luo, Yifan Wang, Yanjie Fu, and Ming Zhang. 2022. Kernel-based Substructure
Exploration for Next POI Recommendation. arXiv preprint arXiv:2210.03969 (2022).
[174] Wei Ju, Junwei Yang, Meng Qu, Weiping Song, Jianhao Shen, and Ming Zhang. 2022. KGNN: Harnessing Kernel-based
Networks for Semi-supervised Graph Classification. In Proceedings of the Fifteenth ACM International Conference on
Web Search and Data Mining. 421–429.
[175] U Kang, Hanghang Tong, and Jimeng Sun. 2012. Fast random walk graph kernel. In Proceedings of the SIAM international
conference on data mining. SIAM, 828–838.
[176] Marek Karpinski and Angus Macintyre. 1997. Polynomial bounds for VC dimension of sigmoidal and general Pfaffian
neural networks. J. Comput. System Sci. 54, 1 (1997), 169–176.
[177] Hisashi Kashima, Koji Tsuda, and Akihiro Inokuchi. 2003. Marginalized kernels between labeled graphs. In Proceedings
of international conference on machine learning. 321–328.
[178] Mohammad Mehdi Keikha, Maseud Rahgozar, Masoud Asadpour, and Mohammad Faghih Abdollahi. 2020. Influence
maximization across heterogeneous interconnected networks based on deep learning. Expert Systems with Applications
140 (2020), 112905.
[179] Shima Khoshraftar and Aijun An. 2022. A Survey on Graph Representation Learning Methods. arXiv preprint
arXiv:2204.01855 (2022).
[180] Diederik P Kingma and Max Welling. 2013. Auto-encoding variational bayes. arXiv preprint arXiv:1312.6114 (2013).
[181] Thomas Kipf, Ethan Fetaya, Kuan-Chieh Wang, Max Welling, and Richard Zemel. 2018. Neural relational inference
for interacting systems. In International Conference on Machine Learning. PMLR, 2688–2697.
[182] Thomas N Kipf and Max Welling. 2016. Semi-supervised classification with graph convolutional networks. arXiv
preprint arXiv:1609.02907 (2016).
[183] Thomas N Kipf and Max Welling. 2016. Variational graph auto-encoders. arXiv preprint arXiv:1611.07308 (2016).
[184] Johannes Klicpera, Florian Becker, and Stephan Günnemann. 2021. GemNet: Universal Directional Graph Neural
Networks for Molecules. In NeurIPS. https://openreview.net/forum?id=HS_sOaxS9K-
[185] Johannes Klicpera, Janek Groß, and Stephan Günnemann. 2020. Directional Message Passing for Molecular Graphs.
In ICLR. https://openreview.net/forum?id=B1eWbxStPH
[186] Jonas Köhler, Leon Klein, and Frank Noe. 2020. Equivariant Flows: Exact Likelihood Generative Learning for
Symmetric Densities. In ICML. https://proceedings.mlr.press/v119/kohler20a.html
[187] Xiangjie Kong, Huizhen Jiang, Wei Wang, Teshome Megersa Bekele, Zhenzhen Xu, and Meng Wang. 2017. Exploring
dynamic research interest and academic influence for scientific collaborator recommendation. Scientometrics 113, 1
(2017), 369–385.
[188] Xiangjie Kong, Huizhen Jiang, Zhuo Yang, Zhenzhen Xu, Feng Xia, and Amr Tolba. 2016. Exploiting publication
contents and collaboration networks for collaborator recommendation. PloS one 11, 2 (2016), e0148492.
[189] Xiangjie Kong, Yajie Shi, Shuo Yu, Jiaying Liu, and Feng Xia. 2019. Academic social networks: Modeling, analysis,
mining and applications. Journal of Network and Computer Applications 132 (2019), 86–103.
[190] Yehuda Koren. 2008. Factorization meets the neighborhood: a multifaceted collaborative filtering model. In Proceedings
of the 14th ACM SIGKDD international conference on Knowledge discovery and data mining. 426–434.
[191] Yehuda Koren, Robert Bell, and Chris Volinsky. 2009. Matrix factorization techniques for recommender systems.
Computer 42, 8 (2009), 30–37.
[192] Mario Krenn, Florian Häse, AkshatKumar Nigam, Pascal Friederich, and Alan Aspuru-Guzik. 2020. Self-referencing
embedded strings (SELFIES): A 100% robust molecular string representation. Machine Learning: Science and Technology
1, 4 (2020), 045024.
[193] Devin Kreuzer, Dominique Beaini, Will Hamilton, Vincent Létourneau, and Prudencio Tossou. 2021. Rethinking graph
transformers with spectral attention. Advances in Neural Information Processing Systems 34 (2021), 21618–21629.
[194] Nils M Kriege, Fredrik D Johansson, and Christopher Morris. 2020. A survey on graph kernels. Applied Network
Science 5, 1 (2020), 1–42.
[195] Alex Krizhevsky, Ilya Sutskever, and Geoffrey E Hinton. 2017. Imagenet classification with deep convolutional neural
networks. Commun. ACM 60, 6 (2017), 84–90.
[196] Sanjay Kumar, Abhishek Mallik, Anavi Khetarpal, and BS Panda. 2022. Influence maximization in social networks
using graph embedding and graph neural network. Information Sciences 607 (2022), 1617–1636.
[197] Samuli Laine and Timo Aila. 2016. Temporal ensembling for semi-supervised learning. arXiv preprint arXiv:1610.02242
(2016).
[198] Greg Landrum et al. 2013. RDKit: A software suite for cheminformatics, computational chemistry, and predictive
modeling. Greg Landrum (2013).
[199] Dongha Lee, Su Kim, Seonghyeon Lee, Chanyoung Park, and Hwanjo Yu. 2021. Learnable structural semantic readout
for graph classification. In 2021 IEEE International Conference on Data Mining (ICDM). IEEE, 1180–1185.
[200] Dong-Hyun Lee et al. 2013. Pseudo-label: The simple and efficient semi-supervised learning method for deep neural
networks. In Workshop on challenges in representation learning, ICML, Vol. 3. 896.
[201] Junhyun Lee, Inyeop Lee, and Jaewoo Kang. 2019. Self-attention graph pooling. In International conference on machine
learning. PMLR, 3734–3743.
[202] Tao Lei, Wengong Jin, Regina Barzilay, and Tommi Jaakkola. 2017. Deriving Neural Architectures from Sequence and
Graph Kernels. In International Conference on Machine Learning. 2024–2033.
[203] Ron Levie, Federico Monti, Xavier Bresson, and Michael M Bronstein. 2018. Cayleynets: Graph convolutional neural
networks with complex rational spectral filters. IEEE Transactions on Signal Processing 67, 1 (2018), 97–109.
[204] Angsheng Li and Yicheng Pan. 2016. Structural information and dynamical complexity of networks. IEEE Transactions
on Information Theory 62, 6 (2016), 3290–3339.
[205] Guohao Li, Chenxin Xiong, Ali Thabet, and Bernard Ghanem. 2020. Deepergcn: All you need to train deeper gcns.
arXiv preprint arXiv:2006.07739 (2020).
[206] Haoyang Li, Xin Wang, Ziwei Zhang, and Wenwu Zhu. 2022. Ood-gnn: Out-of-distribution generalized graph neural
network. IEEE Transactions on Knowledge and Data Engineering (2022).
[207] Haoyang Li, Xin Wang, Ziwei Zhang, and Wenwu Zhu. 2022. Out-of-distribution generalization on graphs: A survey.
arXiv preprint arXiv:2202.07987 (2022).
[208] Junying Li, Deng Cai, and Xiaofei He. 2017. Learning graph-level representation for drug discovery. arXiv preprint
arXiv:1709.03741 (2017).
[209] Jia Li, Yongfeng Huang, Heng Chang, and Yu Rong. 2022. Semi-Supervised Hierarchical Graph Classification. IEEE
Transactions on Pattern Analysis and Machine Intelligence (2022).
[210] Jing Li, Pengjie Ren, Zhumin Chen, Zhaochun Ren, Tao Lian, and Jun Ma. 2017. Neural attentive session-based
recommendation. In Proceedings of the 2017 ACM on Conference on Information and Knowledge Management. 1419–1428.
[211] Jia Li, Yu Rong, Hong Cheng, Helen Meng, Wenbing Huang, and Junzhou Huang. 2019. Semi-supervised graph
classification: A hierarchical graph perspective. In Proceedings of the Web Conference. 972–982.
[212] Peibo Li, Yixing Yang, Maurice Pagnucco, and Yang Song. 2022. CoGNet: Cooperative Graph Neural Networks. In
Proceedings of the International Joint Conference on Neural Networks.
[213] Qimai Li, Zhichao Han, and Xiao-Ming Wu. 2018. Deeper insights into graph convolutional networks for semi-
supervised learning. In Proceedings of the AAAI conference on artificial intelligence, Vol. 32.
[214] Ruoyu Li, Sheng Wang, Feiyun Zhu, and Junzhou Huang. 2018. Adaptive graph convolutional neural networks. In
Proceedings of the AAAI conference on artificial intelligence, Vol. 32.
[215] Shuangli Li, Jingbo Zhou, Tong Xu, Dejing Dou, and Hui Xiong. 2022. Geomgcl: geometric graph contrastive learning
for molecular property prediction. In Proceedings of the AAAI Conference on Artificial Intelligence, Vol. 36. 4541–4549.
[216] Yibo Li, Jianxing Hu, Yanxing Wang, Jielong Zhou, Liangren Zhang, and Zhenming Liu. 2019. Deepscaffold: a
comprehensive tool for scaffold-based de novo drug discovery using deep learning. Journal of chemical information
and modeling 60, 1 (2019), 77–91.
[217] Yibo Li, Jianfeng Pei, and Luhua Lai. 2021. Learning to design drug-like molecules in three-dimensional space using
deep generative models. arXiv preprint arXiv:2104.08474 (2021).
[218] Yibo Li, Jianfeng Pei, and Luhua Lai. 2021. Structure-based de novo drug design using 3D deep generative models.
Chemical science 12, 41 (2021), 13664–13675.
[219] Yujia Li, Daniel Tarlow, Marc Brockschmidt, and Richard Zemel. 2015. Gated graph sequence neural networks. arXiv
preprint arXiv:1511.05493 (2015).
[220] Yaguang Li, Rose Yu, Cyrus Shahabi, and Yan Liu. 2017. Diffusion convolutional recurrent neural network: Data-driven
traffic forecasting. arXiv preprint arXiv:1707.01926 (2017).
[221] Yujia Li, Richard Zemel, Marc Brockschmidt, and Daniel Tarlow. 2016. Gated Graph Sequence Neural Networks. In
Proceedings of ICLR’16.
[222] Ziyao Li, Liang Zhang, and Guojie Song. 2019. GCN-LASE: Towards Adequately Incorporating Link Attributes
in Graph Convolutional Networks. In Proceedings of the Twenty-Eighth International Joint Conference on Artificial
Intelligence, IJCAI-19. 2959–2965.
[223] Yanyan Liang, Yanfeng Zhang, Dechao Gao, and Qian Xu. 2020. MxPool: Multiplex Pooling for Hierarchical Graph
Representation Learning. arXiv preprint arXiv:2004.06846 (2020).
[224] Renjie Liao, Zhizhen Zhao, Raquel Urtasun, and Richard Zemel. 2018. LanczosNet: Multi-Scale Deep Graph Convolu-
tional Networks. In International Conference on Learning Representations.
[225] Jongin Lim, Daeho Um, Hyung Jin Chang, Dae Ung Jo, and Jin Young Choi. 2021. Class-attentive diffusion network
for semi-supervised classification. In Proceedings of the AAAI Conference on Artificial Intelligence, Vol. 35. 8601–8609.
[226] Haitao Lin, Yufei Huang, Meng Liu, Xuanjing Li, Shuiwang Ji, and Stan Z Li. 2022. DiffBP: Generative Diffusion of 3D
Molecules for Target Protein Binding. arXiv preprint arXiv:2211.11214 (2022).
[227] Jiacheng Lin, Hanwen Xu, Addie Woicik, Jianzhu Ma, and Sheng Wang. 2022. Pisces: A cross-modal contrastive learning
approach to synergistic drug combination prediction. bioRxiv (2022). https://doi.org/10.1101/2022.11.21.517439
[228] Jiaying Liu, Feng Xia, Lei Wang, Bo Xu, Xiangjie Kong, Hanghang Tong, and Irwin King. 2019. Shifu2: A network
representation learning based model for advisor-advisee relationship mining. IEEE Transactions on Knowledge and
Data Engineering 33, 4 (2019), 1763–1777.
[229] Li Liu, William K Cheung, Xin Li, and Lejian Liao. 2016. Aligning Users across Social Networks Using Network
Embedding.. In Ijcai, Vol. 16. 1774–80.
[230] Meng Liu, Youzhi Luo, Kanji Uchino, Koji Maruhashi, and Shuiwang Ji. 2022. Generating 3D Molecules for Target
Protein Binding. arXiv preprint arXiv:2204.09410 (2022).
[231] Qi Liu, Miltiadis Allamanis, Marc Brockschmidt, and Alexander Gaunt. 2018. Constrained graph variational autoen-
coders for molecule design. Advances in neural information processing systems 31 (2018).
[232] Qiao Liu, Yifu Zeng, Refuoe Mokhosi, and Haibin Zhang. 2018. STAMP: short-term attention/memory priority model
for session-based recommendation. In Proceedings of the 24th ACM SIGKDD international conference on knowledge
discovery & data mining. 1831–1839.
[233] Yi Liu, Limei Wang, Meng Liu, Yuchao Lin, Xuan Zhang, Bora Oztekin, and Shuiwang Ji. 2022. Spherical Message
Passing for 3D Molecular Graphs. In ICLR. https://openreview.net/forum?id=givsRXsOt9r
[234] Ziqi Liu, Chaochao Chen, Xinxing Yang, Jun Zhou, Xiaolong Li, and Le Song. 2018. Heterogeneous graph neural
networks for malicious account detection. In Proceedings of the 27th ACM International Conference on Information and
Knowledge Management. 2077–2085.
[235] Ze Liu, Yutong Lin, Yue Cao, Han Hu, Yixuan Wei, Zheng Zhang, Stephen Lin, and Baining Guo. 2021. Swin
transformer: Hierarchical vision transformer using shifted windows. In Proceedings of the IEEE/CVF International
Conference on Computer Vision. 10012–10022.
[236] Zheng Liu, Xing Xie, and Lei Chen. 2018. Context-aware academic collaborator recommendation. In Proceedings of
the 24th ACM SIGKDD International Conference on Knowledge Discovery & Data Mining. 1870–1879.
[237] Qingqing Long, Yilun Jin, Yi Wu, and Guojie Song. 2021. Theoretically improving graph neural networks via
anonymous walk graph kernels. In Proceedings of the Web Conference 2021. 1204–1214.
[238] Qingqing Long, Lingjun Xu, Zheng Fang, and Guojie Song. 2021. HGK-GNN: Heterogeneous Graph Kernel based
Graph Neural Networks. In Proceedings of the 27th ACM SIGKDD Conference on Knowledge Discovery & Data Mining.
1129–1138.
[239] Siyu Long, Yi Zhou, Xinyu Dai, and Hao Zhou. 2022. Zero-Shot 3D Drug Design by Sketching and Generating. arXiv
preprint arXiv:2209.13865 (2022).
[240] Dongsheng Luo, Wei Cheng, Wenchao Yu, Bo Zong, Jingchao Ni, Haifeng Chen, and Xiang Zhang. 2021. Learning to
drop: Robust graph neural network via topological denoising. In Proceedings of the 14th ACM international conference
on web search and data mining. 779–787.
[241] Shitong Luo, Jiaqi Guan, Jianzhu Ma, and Jian Peng. 2021. A 3D generative model for structure-based drug design.
Advances in Neural Information Processing Systems 34 (2021), 6229–6239.
[242] Xiao Luo, Wei Ju, Meng Qu, Chong Chen, Minghua Deng, Xian-Sheng Hua, and Ming Zhang. 2022. DualGraph:
Improving Semi-supervised Graph Classification via Dual Contrastive Learning. In Proceedings of the IEEE International
Conference on Data Engineering. 699–712.
[243] Xiao Luo, Wei Ju, Meng Qu, Yiyang Gu, Chong Chen, Minghua Deng, Xian-Sheng Hua, and Ming Zhang. 2022. Clear:
Cluster-enhanced contrast for self-supervised graph representation learning. IEEE Transactions on Neural Networks
and Learning Systems (2022).
[244] Xiao Luo, Daqing Wu, Zeyu Ma, Chong Chen, Minghua Deng, Jinwen Ma, Zhongming Jin, Jianqiang Huang, and
Xian-Sheng Hua. 2021. CIMON: Towards High-quality Hash Codes. In Proceedings of the International Joint Conference
on Artificial Intelligence.
[245] Youzhi Luo and Shuiwang Ji. 2022. An autoregressive flow model for 3d molecular geometry generation from scratch.
In International Conference on Learning Representations (ICLR).
[246] Youzhi Luo, Keqiang Yan, and Shuiwang Ji. 2021. Graphdf: A discrete flow model for molecular graph generation. In
International Conference on Machine Learning. PMLR, 7192–7203.
[247] Hehuan Ma, Yatao Bian, Yu Rong, Wenbing Huang, Tingyang Xu, Weiyang Xie, Geyan Ye, and Junzhou Huang. 2020.
Multi-view graph neural networks for molecular property prediction. arXiv preprint arXiv:2005.13607 (2020).
[248] Jiaqi Ma, Junwei Deng, and Qiaozhu Mei. 2021. Subgroup generalization and fairness of graph neural networks.
Advances in Neural Information Processing Systems 34 (2021), 1048–1061.
[249] Ning Ma, Jiajun Bu, Jieyu Yang, Zhen Zhang, Chengwei Yao, Zhi Yu, Sheng Zhou, and Xifeng Yan. 2020. Adaptive-step
graph meta-learner for few-shot graph classification. In Proceedings of the 29th ACM International Conference on
Information & Knowledge Management. 1055–1064.
[250] Yao Ma, Suhang Wang, Charu C Aggarwal, and Jiliang Tang. 2019. Graph convolutional networks with eigenpooling.
In Proceedings of the 25th ACM SIGKDD international conference on knowledge discovery & data mining. 723–731.
[251] Kaushalya Madhawa, Katushiko Ishiguro, Kosuke Nakago, and Motoki Abe. 2019. Graphnvp: An invertible flow
model for generating molecular graphs. arXiv preprint arXiv:1905.11600 (2019).
[252] Prashan Madumal, Tim Miller, Liz Sonenberg, and Frank Vetere. 2020. Explainable reinforcement learning through a
causal lens. In Proceedings of the AAAI conference on artificial intelligence, Vol. 34. 2493–2500.
[253] Tong Man, Huawei Shen, Shenghua Liu, Xiaolong Jin, and Xueqi Cheng. 2016. Predict anchor links across social
networks via an embedding approach.. In Ijcai, Vol. 16. 1823–1829.
[254] Kamaran H Manguri, Rebaz N Ramadhan, and Pshko R Mohammed Amin. 2020. Twitter sentiment analysis on
worldwide COVID-19 outbreaks. Kurdistan Journal of Applied Research (2020), 54–65.
[255] Elman Mansimov, Omar Mahmood, Seokho Kang, and Kyunghyun Cho. 2019. Molecular geometry prediction using a
deep generative graph neural network. Scientific reports 9, 1 (2019), 1–13.
[256] Mohammad MansourLakouraj, Mukesh Gautam, Hanif Livani, and Mohammed Benidris. 2022. A multi-rate sampling
PMU-based event classification in active distribution grids with spectral graph neural network. Electric Power Systems
Research 211 (2022), 108145.
[257] Dionisis Margaris, Costas Vassilakis, and Dimitris Spiliotopoulos. 2019. Handling uncertainty in social media textual
information for improving venue recommendation formulation quality in social networks. Social Network Analysis
and Mining 9, 1 (2019), 1–19.
[258] Ninareh Mehrabi, Fred Morstatter, Nripsuta Saxena, Kristina Lerman, and Aram Galstyan. 2021. A survey on bias and
fairness in machine learning. ACM Computing Surveys (CSUR) 54, 6 (2021), 1–35.
[259] Xuying Meng, Suhang Wang, Zhimin Liang, Di Yao, Jihua Zhou, and Yujun Zhang. 2021. Semi-supervised anomaly
detection in dynamic communication networks. Information Sciences 571 (2021), 527–542.
[260] Xuan-Yu Meng, Hong-Xing Zhang, Mihaly Mezei, and Meng Cui. 2011. Molecular docking: a powerful approach for
structure-based drug discovery. Current computer-aided drug design 7, 2 (2011), 146–157.
[261] Joshua Meyers, Benedek Fabian, and Nathan Brown. 2021. De novo molecular design and generative models. Drug
Discovery Today 26, 11 (2021), 2707–2715.
[262] Tomas Mikolov, Ilya Sutskever, Kai Chen, Greg S Corrado, and Jeff Dean. 2013. Distributed representations of words
and phrases and their compositionality. Advances in neural information processing systems 26 (2013).
[263] Abduallah Mohamed, Kun Qian, Mohamed Elhoseiny, and Christian Claudel. 2020. Social-stgcnn: A social spatio-
temporal graph convolutional neural network for human trajectory prediction. In Proceedings of the IEEE/CVF
conference on computer vision and pattern recognition. 14424–14432.
[264] Christopher Morris, Martin Ritzert, Matthias Fey, William L Hamilton, Jan Eric Lenssen, Gaurav Rattan, and Martin
Grohe. 2019. Weisfeiler and leman go neural: Higher-order graph neural networks. In Proceedings of the AAAI
conference on artificial intelligence, Vol. 33. 4602–4609.
[265] Vincenzo Moscato and Giancarlo Sperlì. 2021. A survey about community detection over On-line Social and
Heterogeneous Information Networks. Knowledge-Based Systems 224 (2021), 107112.
[266] Meinard Müller. 2007. Dynamic time warping. Information retrieval for music and motion (2007), 69–84.
[267] Vitali Nesterov, Mario Wieser, and Volker Roth. 2020. 3DMolNet: a generative network for molecular structures.
arXiv preprint arXiv:2010.06477 (2020).
[268] Mingdong Ou, Peng Cui, Jian Pei, Ziwei Zhang, and Wenwu Zhu. 2016. Asymmetric transitivity preserving graph
embedding. In Proceedings of the 22nd ACM SIGKDD international conference on Knowledge discovery and data mining.
1105–1114.
[269] George Panagopoulos, Fragkiskos Malliaros, and Michalis Vazirgiannis. 2020. Multi-task learning for influence
estimation and maximization. IEEE Transactions on Knowledge and Data Engineering (2020).
[270] Cheonbok Park, Chunggi Lee, Hyojin Bahng, Yunwon Tae, Seungmin Jin, Kihwan Kim, Sungahn Ko, and Jaegul
Choo. 2020. ST-GRAT: A novel spatio-temporal graph attention networks for accurately forecasting dynamically
changing road speed. In Proceedings of the 29th ACM international conference on information & knowledge management.
1215–1224.
[271] Hyeonjin Park, Seunghun Lee, Dasol Hwang, Jisu Jeong, Kyung-Min Kim, Jung-Woo Ha, and Hyunwoo J Kim. 2021.
Learning Augmentation for GNNs With Consistency Regularization. IEEE Access 9 (2021), 127961–127972.
[272] Eric Paulos, Ken Anderson, and Anthony Townsend. 2004. UbiComp in the urban frontier. (2004).
[273] Eric Paulos and Elizabeth Goodman. 2004. The familiar stranger: anxiety, comfort, and play in public places. In
Proceedings of the SIGCHI conference on Human factors in computing systems. 223–230.
[274] Xingang Peng, Jiaqi Guan, Jian Peng, and Jianzhu Ma. 2023. Pocket-specific 3D Molecule Generation by Fragment-
based Autoregressive Diffusion Models. https://openreview.net/forum?id=HGsoe1wmRW5
[275] Xingang Peng, Shitong Luo, Jiaqi Guan, Qi Xie, Jian Peng, and Jianzhu Ma. 2022. Pocket2Mol: Efficient Molecular
Sampling Based on 3D Protein Pockets. arXiv preprint arXiv:2205.07249 (2022).
[276] Bryan Perozzi, Rami Al-Rfou, and Steven Skiena. 2014. Deepwalk: Online learning of social representations. In
Proceedings of the 20th ACM SIGKDD international conference on Knowledge discovery and data mining. 701–710.
[277] Darwin Saire Pilco and Adín Ramírez Rivera. 2019. Graph learning network: A structure learning algorithm. arXiv
preprint arXiv:1905.12665 (2019).
[278] Marco AF Pimentel, David A Clifton, Lei Clifton, and Lionel Tarassenko. 2014. A review of novelty detection. Signal
processing 99 (2014), 215–249.
[279] Gabriel A Pinheiro, Juarez LF Da Silva, and Marcos G Quiles. 2022. SMICLR: Contrastive Learning on Multiple
Molecular Representations for Semisupervised and Unsupervised Representation Learning. Journal of Chemical
Information and Modeling 62, 17 (2022), 3948–3960.
[280] Kristina Preuer, Günter Klambauer, Friedrich Rippmann, Sepp Hochreiter, and Thomas Unterthiner. 2019. Interpretable
deep learning in drug discovery. In Explainable AI: Interpreting, Explaining and Visualizing Deep Learning. Springer,
331–345.
[281] Ziyue Qiao, Yi Du, Yanjie Fu, Pengfei Wang, and Yuanchun Zhou. 2019. Unsupervised author disambiguation using
heterogeneous graph convolutional network embedding. In 2019 IEEE international conference on big data (Big Data).
IEEE, 910–919.
[282] Ziyue Qiao, Yanjie Fu, Pengyang Wang, Meng Xiao, Zhiyuan Ning, Yi Du, and Yuanchun Zhou. 2022. RPT: Toward
Transferable Model on Heterogeneous Researcher Data via Pre-Training. IEEE Transactions on Big Data (2022).
[283] Ziyue Qiao, Pengyang Wang, Yanjie Fu, Yi Du, Pengfei Wang, and Yuanchun Zhou. 2020. Tree structure-aware
graph representation learning via integrated hierarchical aggregation and relational metric learning. In 2020 IEEE
International Conference on Data Mining (ICDM). IEEE, 432–441.
[284] Zhuoran Qiao, Matthew Welborn, Animashree Anandkumar, Frederick R Manby, and Thomas F Miller III. 2020.
OrbNet: Deep learning for quantum chemistry using symmetry-adapted atomic-orbital features. The Journal of
chemical physics 153, 12 (2020), 124111.
[285] Yifang Qin, Yifan Wang, Fang Sun, Wei Ju, Xuyang Hou, Zhe Wang, Jia Cheng, Jun Lei, and Ming Zhang. 2022.
DisenPOI: Disentangling Sequential and Geographical Influence for Point-of-Interest Recommendation. arXiv preprint
arXiv:2210.16591 (2022).
[286] Jiezhong Qiu, Qibin Chen, Yuxiao Dong, Jing Zhang, Hongxia Yang, Ming Ding, Kuansan Wang, and Jie Tang. 2020.
Gcc: Graph contrastive coding for graph neural network pre-training. In Proceedings of the 26th ACM SIGKDD
International Conference on Knowledge Discovery & Data Mining. 1150–1160.
[287] Ruihong Qiu, Jingjing Li, Zi Huang, and Hongzhi Yin. 2019. Rethinking the item order in session-based recom-
mendation with graph neural networks. In Proceedings of the 28th ACM international conference on information and
[312] Prithviraj Sen, Galileo Namata, Mustafa Bilgic, Lise Getoor, Brian Galligher, and Tina Eliassi-Rad. 2008. Collective
classification in network data. AI magazine 29, 3 (2008), 93–93.
[313] John Shawe-Taylor, Nello Cristianini, et al. 2004. Kernel methods for pattern analysis. Cambridge university press.
[314] Nino Shervashidze, Pascal Schweitzer, Erik Jan Van Leeuwen, Kurt Mehlhorn, and Karsten M Borgwardt. 2011.
Weisfeiler-lehman graph kernels. Journal of Machine Learning Research 12, 9 (2011), 2539–2561.
[315] Nino Shervashidze, Pascal Schweitzer, Erik Jan Van Leeuwen, Kurt Mehlhorn, and Karsten M Borgwardt. 2011.
Weisfeiler-lehman graph kernels. Journal of Machine Learning Research 12, 9 (2011).
[316] Nino Shervashidze, SVN Vishwanathan, Tobias Petri, Kurt Mehlhorn, and Karsten Borgwardt. 2009. Efficient graphlet
kernels for large graph comparison. In Proceedings of International Conference on Artificial Intelligence and Statistics.
488–495.
[317] Chuan Shi, Yitong Li, Jiawei Zhang, Yizhou Sun, and S Yu Philip. 2016. A survey of heterogeneous information
network analysis. IEEE Transactions on Knowledge and Data Engineering 29, 1 (2016), 17–37.
[318] Chence Shi, Shitong Luo, Minkai Xu, and Jian Tang. 2021. Learning gradient fields for molecular conformation
generation. Proceedings of the 38th International Conference on Machine Learning, ICML 139 (2021), 9558–9568.
[319] Chence Shi, Minkai Xu, Zhaocheng Zhu, Weinan Zhang, Ming Zhang, and Jian Tang. 2020. Graphaf: a flow-based
autoregressive model for molecular graph generation. arXiv preprint arXiv:2001.09382 (2020).
[320] David I Shuman, Sunil K Narang, Pascal Frossard, Antonio Ortega, and Pierre Vandergheynst. 2013. The emerging
field of signal processing on graphs: Extending high-dimensional data analysis to networks and other irregular
domains. IEEE signal processing magazine 30, 3 (2013), 83–98.
[321] David I Shuman, Sunil K Narang, Pascal Frossard, Antonio Ortega, and Pierre Vandergheynst. 2013. The emerging
field of signal processing on graphs: Extending high-dimensional data analysis to networks and other irregular
domains. IEEE signal processing magazine 30, 3 (2013), 83–98.
[322] Yali Si, Fuzhi Zhang, and Wenyuan Liu. 2019. An adaptive point-of-interest recommendation method for location-based
social networks based on user activity and spatial features. Knowledge-Based Systems 163 (2019), 267–282.
[323] Thiago H Silva, Aline Carneiro Viana, Fabrício Benevenuto, Leandro Villas, Juliana Salles, Antonio Loureiro, and
Daniele Quercia. 2019. Urban computing leveraging location-based social network data: a survey. ACM Computing
Surveys (CSUR) 52, 1 (2019), 1–39.
[324] Martin Simonovsky and Nikos Komodakis. 2017. Dynamic edge-conditioned filters in convolutional neural networks
on graphs. In Proceedings of the IEEE conference on computer vision and pattern recognition. 3693–3702.
[325] Martin Simonovsky and Nikos Komodakis. 2018. Graphvae: Towards generation of small graphs using variational
autoencoders. In International conference on artificial neural networks. Springer, 412–422.
[326] Seyyid Emre Sofuoglu and Selin Aviyente. 2022. Gloss: Tensor-based anomaly detection in spatiotemporal urban
traffic data. Signal Processing 192 (2022), 108370.
[327] Jascha Sohl-Dickstein, Eric Weiss, Niru Maheswaranathan, and Surya Ganguli. 2015. Deep unsupervised learning
using nonequilibrium thermodynamics. In International Conference on Machine Learning. PMLR, 2256–2265.
[328] Chao Song, Youfang Lin, Shengnan Guo, and Huaiyu Wan. 2020. Spatial-temporal synchronous graph convolutional
networks: A new framework for spatial-temporal network data forecasting. In Proceedings of the AAAI conference on
artificial intelligence, Vol. 34. 914–921.
[329] Qingyu Song, RuiBo Ming, Jianming Hu, Haoyi Niu, and Mingyang Gao. 2020. Graph attention convolutional
network: Spatiotemporal modeling for urban traffic prediction. In 2020 IEEE 23rd International Conference on Intelligent
Transportation Systems (ITSC). IEEE, 1–6.
[330] Yang Song and Stefano Ermon. 2019. Generative modeling by estimating gradients of the data distribution. Advances
in Neural Information Processing Systems 32 (2019).
[331] Peter R. Spackman, Dylan Jayatilaka, and Amir Karton. 2016. Basis set convergence of CCSD(T) equilibrium
geometries using a large and diverse set of molecular structures. The Journal of Chemical Physics 145, 10 (2016),
104101. https://doi.org/10.1063/1.4962168
[332] Hannes Stärk, Octavian Ganea, Lagnajit Pattanaik, Regina Barzilay, and Tommi Jaakkola. 2022. Equibind: Geometric
deep learning for drug binding structure prediction. In International Conference on Machine Learning. PMLR, 20503–
20521.
[333] Bing Su, Dazhao Du, Zhao Yang, Yujie Zhou, Jiangmeng Li, Anyi Rao, Hao Sun, Zhiwu Lu, and Ji-Rong Wen. 2022.
A molecular multimodal foundation model associating molecule graphs with natural language. arXiv preprint
arXiv:2209.05481 (2022).
[334] Kazunari Sugiyama and Min-Yen Kan. 2010. Scholarly paper recommendation via user’s recent research interests. In
Proceedings of the 10th annual joint conference on Digital libraries. 29–38.
[335] Fan-Yun Sun, Jordan Hoffmann, Vikas Verma, and Jian Tang. 2020. Infograph: Unsupervised and semi-supervised
graph-level representation learning via mutual information maximization. In Proceedings of the International Conference
on Learning Representations.
[336] Jianhua Sun, Qinhong Jiang, and Cewu Lu. 2020. Recursive social behavior graph for trajectory prediction. In
Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 660–669.
[337] Qingyun Sun, Jianxin Li, Hao Peng, Jia Wu, Xingcheng Fu, Cheng Ji, and S Yu Philip. 2022. Graph structure learning
with variational information bottleneck. In Proceedings of the AAAI Conference on Artificial Intelligence, Vol. 36.
4165–4174.
[338] Xiaoqing Sun, Zhiliang Wang, Jiahai Yang, and Xinran Liu. 2020. Deepdom: Malicious domain detection with scalable
and heterogeneous graph convolutional networks. Computers & Security 99 (2020), 102057.
[339] Shazia Tabassum, Fabiola SF Pereira, Sofia Fernandes, and João Gama. 2018. Social network analysis: An overview.
Wiley Interdisciplinary Reviews: Data Mining and Knowledge Discovery 8, 5 (2018), e1256.
[340] Shyam A Tailor, Felix Opolka, Pietro Lio, and Nicholas Donald Lane. 2022. Do We Need Anisotropic Graph Neural
Networks?. In International Conference on Learning Representations.
[341] Tetsuro Takahashi and Nobuyuki Igata. 2012. Rumor detection on twitter. In The 6th International Conference on Soft
Computing and Intelligent Systems, and The 13th International Symposium on Advanced Intelligence Systems. IEEE,
452–457.
[342] Hao Tang, Donghong Ji, Chenliang Li, and Qiji Zhou. 2020. Dependency graph enhanced dual-transformer structure
for aspect-based sentiment classification. In Proceedings of the 58th annual meeting of the association for computational
linguistics. 6578–6588.
[343] Jian Tang, Meng Qu, and Qiaozhu Mei. 2015. Pte: Predictive text embedding through large-scale heterogeneous text
networks. In Proceedings of the 21th ACM SIGKDD international conference on knowledge discovery and data mining.
1165–1174.
[344] Jian Tang, Meng Qu, Mingzhe Wang, Ming Zhang, Jun Yan, and Qiaozhu Mei. 2015. Line: Large-scale information
network embedding. In Proceedings of the 24th international conference on world wide web. 1067–1077.
[345] Xianfeng Tang, Yandong Li, Yiwei Sun, Huaxiu Yao, Prasenjit Mitra, and Suhang Wang. 2020. Transferring robustness
for graph neural network against poisoning attacks. In Proceedings of the 13th international conference on web search
and data mining. 600–608.
[346] Philipp Thölke and Gianni De Fabritiis. 2022. Equivariant Transformers for Neural Network based Molecular
Potentials. In ICLR. https://openreview.net/forum?id=zNHzqZ9wrRB
[347] Nathaniel Thomas, Tess Smidt, Steven Kearnes, Lusann Yang, Li Li, Kai Kohlhoff, and Patrick Riley. 2018. Tensor field
networks: Rotation-and translation-equivariant neural networks for 3d point clouds. arXiv preprint arXiv:1802.08219
(2018).
[348] Hannu Toivonen, Ashwin Srinivasan, Ross D King, Stefan Kramer, and Christoph Helma. 2003. Statistical evaluation
of the predictive toxicology challenge 2000–2001. Bioinformatics 19, 10 (2003), 1183–1193.
[349] Sayan Unankard, Xue Li, Mohamed Sharaf, Jiang Zhong, and Xueming Li. 2014. Predicting elections from social
networks based on sub-event detection and sentiment analysis. In International Conference on Web Information Systems
Engineering. Springer, 1–16.
[350] Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N Gomez, Łukasz Kaiser, and Illia
Polosukhin. 2017. Attention is all you need. Advances in neural information processing systems 30 (2017).
[351] Petar Veličković, Guillem Cucurull, Arantxa Casanova, Adriana Romero, Pietro Lio, and Yoshua Bengio. 2017. Graph
attention networks. arXiv preprint arXiv:1710.10903 (2017).
[352] Jitender Verma, Vijay M Khedkar, and Evans C Coutinho. 2010. 3D-QSAR in drug design-a review. Current topics in
medicinal chemistry 10, 1 (2010), 95–115.
[353] Pascal Vincent, Hugo Larochelle, Isabelle Lajoie, Yoshua Bengio, Pierre-Antoine Manzagol, and Léon Bottou. 2010.
Stacked denoising autoencoders: Learning useful representations in a deep network with a local denoising criterion.
Journal of machine learning research 11, 12 (2010).
[354] Oriol Vinyals, Charles Blundell, Timothy Lillicrap, Daan Wierstra, et al. 2016. Matching networks for one shot
learning. Advances in neural information processing systems 29 (2016).
[355] Dylan J Walsh, Weizhong Zou, Ludwig Schneider, Reid Mello, Michael E Deagen, Joshua Mysona, Tzyy-Shyang
Lin, Juan J de Pablo, Klavs F Jensen, Debra J Audus, et al. 2023. Community Resource for Innovation in Polymer
Technology (CRIPT): A Scalable Polymer Material Data Structure.
[356] W Patrick Walters, Matthew T Stahl, and Mark A Murcko. 1998. Virtual screening—an overview. Drug discovery
today 3, 4 (1998), 160–178.
[357] Guihong Wan and Harsha Kokel. 2021. Graph sparsification via meta-learning. DLG@ AAAI (2021).
[358] Chun Wang, Shirui Pan, Guodong Long, Xingquan Zhu, and Jing Jiang. 2017. Mgae: Marginalized graph autoencoder
for graph clustering. In Proceedings of the 2017 ACM on Conference on Information and Knowledge Management.
889–898.
[359] Daixin Wang, Peng Cui, and Wenwu Zhu. 2016. Structural deep network embedding. In Proceedings of the 22nd ACM
SIGKDD international conference on Knowledge discovery and data mining. 1225–1234.
[360] Hongwei Wang, Fuzheng Zhang, Miao Zhao, Wenjie Li, Xing Xie, and Minyi Guo. 2019. Multi-task feature learning
for knowledge graph enhanced recommendation. In The world wide web conference. 2000–2010.
[361] Jianling Wang, Kaize Ding, Liangjie Hong, Huan Liu, and James Caverlee. 2020. Next-item recommendation with
sequential hypergraphs. In Proceedings of the 43rd international ACM SIGIR conference on research and development in
information retrieval. 1101–1110.
[362] Jie Wang, Li Zhu, Tao Dai, and Yabin Wang. 2020. Deep memory network with Bi-LSTM for personalized context-aware
citation recommendation. Neurocomputing 410 (2020), 103–113.
[363] Kai Wang, Yu-An Lin, Ben Weissmann, Manolis Savva, Angel X Chang, and Daniel Ritchie. 2019. Planit: Planning
and instantiating indoor scenes with relation graph and spatial prior networks. ACM Transactions on Graphics (TOG)
38, 4 (2019), 1–15.
[364] Renxiao Wang, Xueliang Fang, Yipin Lu, Chao-Yie Yang, and Shaomeng Wang. 2005. The PDBbind database:
methodologies and updates. Journal of medicinal chemistry 48, 12 (2005), 4111–4119.
[365] Senzhang Wang, Hao Miao, Hao Chen, and Zhiqiu Huang. 2020. Multi-task adversarial spatial-temporal networks
for crowd flow prediction. In Proceedings of the 29th ACM international conference on information & knowledge
management. 1555–1564.
[366] Xiao Wang, Deyu Bo, Chuan Shi, Shaohua Fan, Yanfang Ye, and S Yu Philip. 2022. A survey on heterogeneous graph
embedding: methods, techniques, applications and sources. IEEE Transactions on Big Data (2022).
[367] Xiang Wang, Xiangnan He, Meng Wang, Fuli Feng, and Tat-Seng Chua. 2019. Neural graph collaborative filtering.
In Proceedings of the 42nd international ACM SIGIR conference on Research and development in Information Retrieval.
165–174.
[368] Xiao Wang, Houye Ji, Chuan Shi, Bai Wang, Yanfang Ye, Peng Cui, and Philip S Yu. 2019. Heterogeneous graph
attention network. In The world wide web conference. 2022–2032.
[369] Xiang Wang, Hongye Jin, An Zhang, Xiangnan He, Tong Xu, and Tat-Seng Chua. 2020. Disentangled graph collabora-
tive filtering. In Proceedings of the 43rd international ACM SIGIR conference on research and development in information
retrieval. 1001–1010.
[370] Yaqing Wang, Abulikemu Abuduweili, Quanming Yao, and Dejing Dou. 2021. Property-aware relation networks for
few-shot molecular property prediction. Advances in Neural Information Processing Systems 34 (2021), 17441–17454.
[371] Yifan Wang, Yifang Qin, Fang Sun, Bo Zhang, Xuyang Hou, Ke Hu, Jia Cheng, Jun Lei, and Ming Zhang. 2022.
DisenCTR: Dynamic Graph-based Disentangled Representation for Click-Through Rate Prediction. In Proceedings of
the 45th International ACM SIGIR Conference on Research and Development in Information Retrieval. 2314–2318.
[372] Yue Wang, Yongbin Sun, Ziwei Liu, Sanjay E Sarma, Michael M Bronstein, and Justin M Solomon. 2019. Dynamic
graph cnn for learning on point clouds. Acm Transactions On Graphics (tog) 38, 5 (2019), 1–12.
[373] Yuyang Wang, Jianren Wang, Zhonglin Cao, and Amir Barati Farimani. 2022. Molecular contrastive learning of
representations via graph neural networks. Nature Machine Intelligence 4, 3 (2022), 279–287.
[374] Zichao Wang, Weili Nie, Zhuoran Qiao, Chaowei Xiao, Richard Baraniuk, and Anima Anandkumar. 2022. Retrieval-
based Controllable Molecule Generation. arXiv preprint arXiv:2208.11126 (2022).
[375] Ziyang Wang, Wei Wei, Gao Cong, Xiao-Li Li, Xian-Ling Mao, and Minghui Qiu. 2020. Global context enhanced graph
neural networks for session-based recommendation. In Proceedings of the 43rd international ACM SIGIR conference on
research and development in information retrieval. 169–178.
[376] Zhaobo Wang, Yanmin Zhu, Qiaomei Zhang, Haobing Liu, Chunyang Wang, and Tong Liu. 2022. Graph-Enhanced
Spatial-Temporal Network for Next POI Recommendation. ACM Transactions on Knowledge Discovery from Data
(TKDD) 16, 6 (2022), 1–21.
[377] Yinwei Wei, Xiang Wang, Liqiang Nie, Xiangnan He, Richang Hong, and Tat-Seng Chua. 2019. MMGCN: Multi-
modal graph convolution network for personalized recommendation of micro-video. In Proceedings of the 27th ACM
International Conference on Multimedia. 1437–1445.
[378] Boris Weisfeiler and Andrei Leman. 1968. The reduction of a graph to canonical form and the algebra which appears
therein. nti, Series 2, 9 (1968), 12–16.
[379] Heitor Werneck, Nícollas Silva, Matheus Carvalho Viana, Fernando Mourão, Adriano CM Pereira, and Leonardo
Rocha. 2020. A survey on point-of-interest recommendation in location-based social networks. In Proceedings of the
Brazilian Symposium on Multimedia and the Web. 185–192.
[380] Oliver Wieder, Stefan Kohlbacher, Mélaine Kuenemann, Arthur Garon, Pierre Ducrot, Thomas Seidel, and Thierry
Langer. 2020. A compact review of molecular property prediction with graph neural networks. Drug Discovery Today:
Technologies 37 (2020), 1–12.
[381] Christopher KI Williams and Matthias Seeger. 2001. Using the Nyström method to speed up kernel machines. In
Advances in neural information processing systems. 682–688.
[382] Michael Withnall, Edvard Lindelöf, Ola Engkvist, and Hongming Chen. 2020. Building attention and edge message
passing neural networks for bioactivity and physical–chemical property prediction. Journal of cheminformatics 12, 1
(2020), 1–18.
[383] Felix Wu, Amauri Souza, Tianyi Zhang, Christopher Fifty, Tao Yu, and Kilian Weinberger. 2019. Simplifying graph
convolutional networks. In International conference on machine learning. PMLR, 6861–6871.
[384] Junran Wu, Xueyuan Chen, Ke Xu, and Shangzhe Li. 2022. Structural entropy guided graph hierarchical pooling. In
International Conference on Machine Learning. PMLR, 24017–24030.
[385] Shu Wu, Yuyuan Tang, Yanqiao Zhu, Liang Wang, Xing Xie, and Tieniu Tan. 2019. Session-based recommendation
with graph neural networks. In Proceedings of the AAAI conference on artificial intelligence, Vol. 33. 346–353.
[386] Zhanghao Wu, Paras Jain, Matthew Wright, Azalia Mirhoseini, Joseph E Gonzalez, and Ion Stoica. 2021. Representing
long-range context for graph neural networks with global attention. Advances in Neural Information Processing
Systems 34 (2021), 13266–13279.
[387] Zonghan Wu, Shirui Pan, Fengwen Chen, Guodong Long, Chengqi Zhang, and S Yu Philip. 2020. A comprehensive
survey on graph neural networks. IEEE transactions on neural networks and learning systems 32, 1 (2020), 4–24.
[388] Zonghan Wu, Shirui Pan, Guodong Long, Jing Jiang, and Chengqi Zhang. 2019. Graph wavenet for deep spatial-
temporal graph modeling. In Proceedings of the 28th International Joint Conference on Artificial Intelligence. 1907–1913.
[389] Zhenqin Wu, Bharath Ramsundar, Evan N Feinberg, Joseph Gomes, Caleb Geniesse, Aneesh S Pappu, Karl Leswing,
and Vijay Pande. 2018. MoleculeNet: a benchmark for molecular machine learning. Chemical science 9, 2 (2018),
513–530.
[390] Feng Xia, Ke Sun, Shuo Yu, Abdul Aziz, Liangtian Wan, Shirui Pan, and Huan Liu. 2021. Graph learning: A survey.
IEEE Transactions on Artificial Intelligence 2, 2 (2021), 109–127.
[391] Jun Xia, Lirong Wu, Jintao Chen, Bozhen Hu, and Stan Z Li. 2022. Simgrace: A simple framework for graph contrastive
learning without data augmentation. In Proceedings of the ACM Web Conference 2022. 1070–1079.
[392] Lianghao Xia, Chao Huang, Yong Xu, Jiashu Zhao, Dawei Yin, and Jimmy Huang. 2022. Hypergraph contrastive
collaborative filtering. In Proceedings of the 45th International ACM SIGIR Conference on Research and Development in
Information Retrieval. 70–79.
[393] Peng Xie, Tianrui Li, Jia Liu, Shengdong Du, Xin Yang, and Junbo Zhang. 2020. Urban flow prediction from
spatiotemporal data using machine learning: A survey. Information Fusion 59 (2020), 1–12.
[394] Yu Xie, Yanfeng Liang, Maoguo Gong, AK Qin, Yew-Soon Ong, and Tiantian He. 2022. Semisupervised Graph Neural
Networks for Graph Classification. IEEE Transactions on Cybernetics (2022).
[395] Yu Xie, Shengze Lv, Yuhua Qian, Chao Wen, and Jiye Liang. 2022. Active and Semi-supervised Graph Neural Networks
for Graph Classification. IEEE Transactions on Big Data (2022).
[396] Yutong Xie, Chence Shi, Hao Zhou, Yuwei Yang, Weinan Zhang, Yong Yu, and Lei Li. 2021. Mars: Markov molecular
sampling for multi-objective drug discovery. arXiv preprint arXiv:2103.10432 (2021).
[397] Yaochen Xie, Zhao Xu, Jingtun Zhang, Zhengyang Wang, and Shuiwang Ji. 2022. Self-supervised learning of graph
neural networks: A unified review. IEEE transactions on pattern analysis and machine intelligence (2022).
[398] Haoyi Xiong, Amin Vahedian, Xun Zhou, Yanhua Li, and Jun Luo. 2018. Predicting traffic congestion propagation
patterns: A propagation graph approach. In Proceedings of the 11th ACM SIGSPATIAL International Workshop on
Computational Transportation Science. 60–69.
[399] Bingbing Xu, Huawei Shen, Qi Cao, Yunqi Qiu, and Xueqi Cheng. 2019. Graph Wavelet Neural Network. In International
Conference on Learning Representations.
[400] Chonghuan Xu, Austin Shijun Ding, and Kaidi Zhao. 2021. A novel POI recommendation method based on trust
relationship and spatial–temporal factors. Electronic Commerce Research and Applications 48 (2021), 101060.
[401] Chenxiao Xu, Hao Huang, and Shinjae Yoo. 2019. Scalable causal graph learning through a deep neural network. In
Proceedings of the 28th ACM international conference on information and knowledge management. 1853–1862.
[402] Chonghuan Xu, Dongsheng Liu, and Xinyao Mei. 2021. Exploring an efficient POI recommendation model based on
user characteristics and spatial-temporal factors. Mathematics 9, 21 (2021), 2673.
[403] Keyulu Xu, Weihua Hu, Jure Leskovec, and Stefanie Jegelka. 2018. How Powerful are Graph Neural Networks?. In
International Conference on Learning Representations.
[404] Keyulu Xu, Chengtao Li, Yonglong Tian, Tomohiro Sonobe, Ken-ichi Kawarabayashi, and Stefanie Jegelka. 2018.
Representation learning on graphs with jumping knowledge networks. In International conference on machine learning.
PMLR, 5453–5462.
[405] Zhenyi Xu, Yu Kang, Yang Cao, and Zhijun Li. 2020. Spatiotemporal graph convolution multifusion network for urban
vehicle emission prediction. IEEE Transactions on Neural Networks and Learning Systems 32, 8 (2020), 3342–3354.
[406] Zheng Xu, Sheng Wang, Feiyun Zhu, and Junzhou Huang. 2017. Seq2seq fingerprint: An unsupervised deep molecular
embedding for drug discovery. In Proceedings of the 8th ACM international conference on bioinformatics, computational
biology, and health informatics. 285–294.
[407] Chaoying Yang, Kaibo Zhou, and Jie Liu. 2021. SuperGraph: Spatial-temporal graph-based feature extraction for
rotating machinery diagnosis. IEEE Transactions on Industrial Electronics 69, 4 (2021), 4167–4176.
[408] Haoran Yang, Hongxu Chen, Shirui Pan, Lin Li, Philip S Yu, and Guandong Xu. 2022. Dual Space Graph Contrastive
Learning. In Proceedings of the Web Conference. 1238–1247.
[409] Kevin Yang, Kyle Swanson, Wengong Jin, Connor Coley, Philipp Eiden, Hua Gao, Angel Guzman-Perez, Timothy
Hopper, Brian Kelley, Miriam Mathea, et al. 2019. Analyzing learned molecular representations for property prediction.
Journal of chemical information and modeling 59, 8 (2019), 3370–3388.
[410] Yuanxuan Yang, Alison Heppenstall, Andy Turner, and Alexis Comber. 2020. Using graph structural information
about flows to enhance short-term demand prediction in bike-sharing systems. Computers, Environment and Urban
Systems 83 (2020), 101521.
[411] Yuhao Yang, Chao Huang, Lianghao Xia, Yuxuan Liang, Yanwei Yu, and Chenliang Li. 2022. Multi-Behavior
Hypergraph-Enhanced Transformer for Sequential Recommendation. In Proceedings of the 28th ACM SIGKDD Confer-
ence on Knowledge Discovery and Data Mining. 2263–2274.
[412] Huaxiu Yao, Fei Wu, Jintao Ke, Xianfeng Tang, Yitian Jia, Siyu Lu, Pinghua Gong, Jieping Ye, and Zhenhui Li. 2018.
Deep multi-view spatial-temporal network for taxi demand prediction. In Proceedings of the AAAI conference on
artificial intelligence, Vol. 32.
[413] Shaowei Yao, Tianming Wang, and Xiaojun Wan. 2020. Heterogeneous graph transformer for graph-to-sequence
learning. In Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics. 7145–7154.
[414] Mehmet Yildirimoglu and Jiwon Kim. 2018. Identification of communities in urban mobility networks using multi-layer
graphs of network traffic. Transportation Research Part C: Emerging Technologies 89 (2018), 254–267.
[415] Chengxuan Ying, Tianle Cai, Shengjie Luo, Shuxin Zheng, Guolin Ke, Di He, Yanming Shen, and Tie-Yan Liu. 2021.
Do transformers really perform badly for graph representation? Advances in Neural Information Processing Systems
34 (2021), 28877–28888.
[416] Zhitao Ying, Dylan Bourgeois, Jiaxuan You, Marinka Zitnik, and Jure Leskovec. 2019. Gnnexplainer: Generating
explanations for graph neural networks. In Proceedings of the Conference on Neural Information Processing Systems.
[417] Zhitao Ying, Jiaxuan You, Christopher Morris, Xiang Ren, Will Hamilton, and Jure Leskovec. 2018. Hierarchical graph
representation learning with differentiable pooling. Advances in neural information processing systems 31 (2018).
[418] Jiaxuan You, Bowen Liu, Zhitao Ying, Vijay Pande, and Jure Leskovec. 2018. Graph convolutional policy network for
goal-directed molecular graph generation. Advances in neural information processing systems 31 (2018).
[419] Yuning You, Tianlong Chen, Zhangyang Wang, and Yang Shen. 2020. When does self-supervision help graph
convolutional networks?. In international conference on machine learning. PMLR, 10871–10880.
[420] Bing Yu, Haoteng Yin, and Zhanxing Zhu. 2017. Spatio-temporal graph convolutional networks: A deep learning
framework for traffic forecasting. arXiv preprint arXiv:1709.04875 (2017).
[421] Donghan Yu, Ruohong Zhang, Zhengbao Jiang, Yuexin Wu, and Yiming Yang. 2021. Graph-revised convolutional
network. In Machine Learning and Knowledge Discovery in Databases: European Conference, ECML PKDD 2020, Ghent,
Belgium, September 14–18, 2020, Proceedings, Part III. Springer, 378–393.
[422] Junliang Yu, Hongzhi Yin, Xin Xia, Tong Chen, Lizhen Cui, and Quoc Viet Hung Nguyen. 2022. Are graph augmenta-
tions necessary? simple graph contrastive learning for recommendation. In Proceedings of the 45th International ACM
SIGIR Conference on Research and Development in Information Retrieval. 1294–1303.
[423] Le Yu, Bowen Du, Xiao Hu, Leilei Sun, Liangzhe Han, and Weifeng Lv. 2021. Deep spatio-temporal graph convolutional
network for traffic accident prediction. Neurocomputing 423 (2021), 135–147.
[424] Shuo Yu, Jiaying Liu, Zhuo Yang, Zhen Chen, Huizhen Jiang, Amr Tolba, and Feng Xia. 2018. PAVE: Personalized
Academic Venue recommendation Exploiting co-publication networks. Journal of Network and Computer Applications
104 (2018), 38–47.
[425] Xiao Yu, Quanquan Gu, Mianwei Zhou, and Jiawei Han. 2012. Citation prediction in heterogeneous bibliographic
networks. In Proceedings of the 2012 SIAM international conference on data mining. SIAM, 1119–1130.
[426] Zhaoning Yu and Hongyang Gao. 2022. Molecular representation learning via heterogeneous motif graph neural
networks. In International Conference on Machine Learning. PMLR, 25581–25594.
[427] Zhou Yu, Jun Yu, Yuhao Cui, Dacheng Tao, and Qi Tian. 2019. Deep modular co-attention networks for visual question
answering. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition. 6281–6290.
[428] Han Yue, Chunhui Zhang, Chuxu Zhang, and Hongfu Liu. 2022. Label-invariant Augmentation for Semi-Supervised
Graph Classification. In Proceedings of the Conference on Neural Information Processing Systems.
[429] Chengxi Zang and Fei Wang. 2020. MoFlow: an invertible flow model for generating molecular graphs. In Proceedings
of the 26th ACM SIGKDD International Conference on Knowledge Discovery & Data Mining. 617–626.
[430] Chen Zhang, Qiuchi Li, and Dawei Song. 2019. Aspect-based Sentiment Classification with Aspect-specific Graph
Convolutional Networks. In Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing
and the 9th International Joint Conference on Natural Language Processing (EMNLP-IJCNLP). 4568–4578.
[431] Cai Zhang, Weimin Li, Dingmei Wei, Yanxia Liu, and Zheng Li. 2022. Network dynamic GCN influence maximization
algorithm with leader fake labeling mechanism. IEEE Transactions on Computational Social Systems (2022).
[432] Chuxu Zhang, Dongjin Song, Chao Huang, Ananthram Swami, and Nitesh V Chawla. 2019. Heterogeneous graph
neural network. In Proceedings of the 25th ACM SIGKDD international conference on knowledge discovery & data mining.
793–803.
[433] Fuzheng Zhang, Nicholas Jing Yuan, Defu Lian, Xing Xie, and Wei-Ying Ma. 2016. Collaborative knowledge base
embedding for recommender systems. In Proceedings of the 22nd ACM SIGKDD international conference on knowledge
discovery and data mining. 353–362.
[434] Hengyuan Zhang, Suyao Zhao, Ruiheng Liu, Wenlong Wang, Yixin Hong, and Runjiu Hu. 2022. Automatic traffic
anomaly detection on the road network with spatial-temporal graph neural network representation learning. Wireless
Communications and Mobile Computing 2022 (2022).
[435] Jiani Zhang, Xingjian Shi, Junyuan Xie, Hao Ma, Irwin King, and Dit-Yan Yeung. 2018. Gaan: Gated attention networks
for learning on large and spatiotemporal graphs. arXiv preprint arXiv:1803.07294 (2018).
[436] Jiawei Zhang, Haopeng Zhang, Congying Xia, and Li Sun. 2020. Graph-bert: Only attention is needed for learning
graph representations. arXiv preprint arXiv:2001.05140 (2020).
[437] Muhan Zhang, Zhicheng Cui, Marion Neumann, and Yixin Chen. 2018. An end-to-end deep learning architecture for
graph classification. In Proceedings of the AAAI conference on artificial intelligence, Vol. 32.
[438] Muhan Zhang and Pan Li. 2021. Nested graph neural networks. Advances in Neural Information Processing Systems 34
(2021), 15734–15747.
[439] Xiaoyu Zhang, Sheng Wang, Feiyun Zhu, Zheng Xu, Yuhong Wang, and Junzhou Huang. 2018. Seq3seq fingerprint:
towards end-to-end semi-supervised deep drug discovery. In Proceedings of the 2018 ACM International Conference on
Bioinformatics, Computational Biology, and Health Informatics. 404–413.
[440] Xiang Zhang and Marinka Zitnik. 2020. Gnnguard: Defending graph neural networks against adversarial attacks.
Advances in neural information processing systems 33 (2020), 9263–9275.
[441] Yongfeng Zhang, Guokun Lai, Min Zhang, Yi Zhang, Yiqun Liu, and Shaoping Ma. 2014. Explicit factor models for
explainable recommendation based on phrase-level sentiment analysis. In Proceedings of the 37th international ACM
SIGIR conference on Research & development in information retrieval. 83–92.
[442] Yingxue Zhang, Soumyasundar Pal, Mark Coates, and Deniz Ustebay. 2019. Bayesian graph convolutional neural
networks for semi-supervised classification. In Proceedings of the AAAI conference on artificial intelligence, Vol. 33.
5829–5836.
[443] Yu Zhang, Peter Tiňo, Aleš Leonardis, and Ke Tang. 2021. A survey on neural network interpretability. IEEE
Transactions on Emerging Topics in Computational Intelligence (2021).
[444] Yutao Zhang, Fanjin Zhang, Peiran Yao, and Jie Tang. 2018. Name Disambiguation in AMiner: Clustering, Maintenance,
and Human in the Loop.. In Proceedings of the 24th ACM SIGKDD international conference on knowledge discovery &
data mining. 1002–1011.
[445] Zhen Zhang, Jiajun Bu, Martin Ester, Jianfeng Zhang, Chengwei Yao, Zhi Yu, and Can Wang. 2019. Hierarchical
graph pooling with structure learning. arXiv preprint arXiv:1911.05954 (2019).
[446] Ziwei Zhang, Peng Cui, and Wenwu Zhu. 2020. Deep learning on graphs: A survey. IEEE Transactions on Knowledge
and Data Engineering (2020).
[447] Zaixi Zhang, Qi Liu, Hao Wang, Chengqiang Lu, and Chee-Kong Lee. 2021. Motif-based graph self-supervised learning
for molecular property prediction. Advances in Neural Information Processing Systems 34 (2021), 15870–15882.
[448] ZAIXI ZHANG, Yaosen Min, Shuxin Zheng, and Qi Liu. [n. d.]. Molecule Generation For Target Protein Binding with
Structural Motifs. In The Eleventh International Conference on Learning Representations.
[449] Zhen Zhang, Mianzhi Wang, Yijian Xiang, Yan Huang, and Arye Nehorai. 2018. Retgk: Graph kernels based on return
probabilities of random walks. In Advances in Neural Information Processing Systems. 3964–3974.
[450] Jialin Zhao, Yuxiao Dong, Ming Ding, Evgeny Kharlamov, and Jie Tang. 2021. Adaptive diffusion in graph neural
networks. Advances in Neural Information Processing Systems 34 (2021), 23321–23333.
[451] Jianan Zhao, Xiao Wang, Chuan Shi, Binbin Hu, Guojie Song, and Yanfang Ye. 2021. Heterogeneous graph structure
learning for graph neural networks. In Proceedings of the AAAI conference on artificial intelligence, Vol. 35. 4697–4705.
[452] Lingxiao Zhao, Saurabh Sawlani, Arvind Srinivasan, and Leman Akoglu. 2022. Graph Anomaly Detection with
Unsupervised GNNs. arXiv preprint arXiv:2210.09535 (2022).
[453] Pengpeng Zhao, Anjing Luo, Yanchi Liu, Fuzhen Zhuang, Jiajie Xu, Zhixu Li, Victor S Sheng, and Xiaofang Zhou. 2020.
Where to go next: A spatio-temporal gated network for next poi recommendation. IEEE Transactions on Knowledge
and Data Engineering (2020).
[454] Tianxiang Zhao, Xiang Zhang, and Suhang Wang. 2021. Graphsmote: Imbalanced node classification on graphs
with graph neural networks. In Proceedings of the 14th ACM international conference on web search and data mining.
833–841.
[455] Yusheng Zhao, Jinyu Chen, Chen Gao, Wenguan Wang, Lirong Yang, Haibing Ren, Huaxia Xia, and Si Liu. 2022.
Target-Driven Structured Transformer Planner for Vision-Language Navigation. In Proceedings of the 30th ACM