2022 Book GraphNeuralNetworksFoundations PDF
2022 Book GraphNeuralNetworksFoundations PDF
Graph Neural
Networks
Foundations,
Frontiers,
and Applications
Graph Neural Networks: Foundations,
Frontiers, and Applications
Lingfei Wu • Peng Cui • Jian Pei • Liang Zhao
Editors
123
Editors
Lingfei Wu Peng Cui
JD Silicon Valley Research Center Tsinghua University
Mountain View, CA, USA Beijing, China
© The Editor(s) (if applicable) and The Author(s), under exclusive license to Springer Nature Singapore
Pte Ltd. 2022
This work is subject to copyright. All rights are solely and exclusively licensed by the Publisher, whether
the whole or part of the material is concerned, specifically the rights of reprinting, reuse of illustrations,
recitation, broadcasting, reproduction on microfilms or in any other physical way, and transmission or
information storage and retrieval, electronic adaptation, computer software, or by similar or dissimilar
methodology now known or hereafter developed.
The use of general descriptive names, registered names, trademarks, service marks, etc. in this publication
does not imply, even in the absence of a specific statement, that such names are exempt from the relevant
protective laws and regulations and therefore free for general use.
The publisher, the authors, and the editors are safe to assume that the advice and information in this book
are believed to be true and accurate at the date of publication. Neither the publisher nor the authors or
the editors give a warranty, expressed or implied, with respect to the material contained herein or for any
errors or omissions that may have been made. The publisher remains neutral with regard to jurisdictional
claims in published maps and institutional affiliations.
This Springer imprint is published by the registered company Springer Nature Singapore Pte Ltd.
The registered company address is: 152 Beach Road, #21-01/04 Gateway East, Singapore 189721, Singapore
Foreword
“The first comprehensive book covering the full spectrum of a young, fast-growing
research field, graph neural networks (GNNs), written by authoritative authors!”
Jiawei Han (Michael Aiken Chair Professor at University of Illinois at Urbana-
Champaign, ACM Fellow and IEEE Fellow)
“This book presents a comprehensive and timely survey on graph representation
learning. Edited and contributed by the best group of experts in this area, this book
is a must-read for students, researchers and pratictioners who want to learn anything
about Graph Neural Networks.”
Heung-Yeung ”Harry” Shum (Former Executive Vice President for Technology
and Research at Microsoft Research, ACM Fellow, IEEE Fellow, FREng)
“As the new frontier of deep learning, Graph Neural Networks offer great potential
to combine probabilistic learning and symbolic reasoning, and bridge knowledge-
driven and data-driven paradigms, nurturing the development of third-generation
AI. This book provides a comprehensive and insightful introduction to GNN, rang-
ing from foundations to frontiers, from algorithms to applications. It is a valuable
resource for any scientist, engineer and student who wants to get into this exciting
field.”
Bo Zhang (Member of Chinese Academy of Science, Professor at Tsinghua Uni-
versity)
“Graph Neural Networks are one of the hottest areas of machine learning and this
book is a wonderful in-depth resource covering a broad range of topics and applica-
tions of graph representation learning.”
Jure Leskovec (Associate Professor at Stanford University, and investigator at
Chan Zuckerberg Biohub).
“Graph Neural Networks are an emerging machine learning model that is already
taking the scientific and industrial world by storm. The time is perfect to get in on the
action – and this book is a great resource for newcomers and seasoned practitioners
v
vi Foreword
alike! Its chapters are very carefully written by many of the thought leaders at the
forefront of the area.”
Petar Veličković (Senior Research Scientist, DeepMind)
Preface
The field of graph neural networks (GNNs) has seen rapid and incredible strides over
the recent years. Graph neural networks, also known as deep learning on graphs,
graph representation learning, or geometric deep learning, have become one of the
fastest-growing research topics in machine learning, especially deep learning. This
wave of research at the intersection of graph theory and deep learning has also influ-
enced other fields of science, including recommendation systems, computer vision,
natural language processing, inductive logic programming, program synthesis, soft-
ware mining, automated planning, cybersecurity, and intelligent transportation.
Although graph neural networks have achieved remarkable attention, it still faces
many challenges when applying them into other domains, from the theoretical un-
derstanding of methods to the scalability and interpretability in a real system, and
from the soundness of the methodology to the empirical performance in an applica-
tion. However, as the field rapidly grows, it has been extremely challenging to gain
a global perspective of the developments of GNNs. Therefore, we feel the urgency
to bridge the above gap and have a comprehensive book on this fast-growing yet
challenging topic, which can benefit a broad audience including advanced under-
graduate and graduate students, postdoctoral researchers, lecturers, and industrial
practitioners.
This book is intended to cover a broad range of topics in graph neural networks,
from the foundations to the frontiers, and from the methodologies to the applica-
tions. Our book is dedicated to introducing the fundamental concepts and algorithms
of GNNs, new research frontiers of GNNs, and broad and emerging applications
with GNNs.
The website and further resources of this book can be found at: https://
graph-neural-networks.github.io/. The website provides online preprints
and lecture slides of all the chapters. It also provides pointers to useful material and
resources that are publicly available and relevant to graph neural networks.
vii
viii Preface
To the Instructors
The book can be used for a one-semester graduate course for graduate students.
Though it is mainly written for students with a background in computer science,
students with a basic understanding of probability, statistics, graph theory, linear
algebra, and machine learning techniques such as deep learning will find it easily
accessible. Some chapters can be skipped or assigned as homework assignments for
reviewing purposes if students have knowledge of a chapter. For example, if students
have taken a deep learning course, they can skip Chapter 1. The instructors can also
choose to combine Chapters 1, 2, and 3 together as a background introduction course
at the very beginning.
When the course focuses more on the foundation and theories of graph neural net-
works, the instructor can choose to focus more on Chapters 4-8 while using Chapters
19-27 to showcase the applications, motivations, and limitations. Please refer to the
Editors’ Notes at the end of each chapter on how Chapters 4-8 and Chapters 19-27
are correlated. When the course focuses more on the research frontiers, Chapters
9-18 can be the pivot to organize the course. For example, an instructor can make
it an advanced graduate course where the students are asked to search and present
the most recent research papers in each different research frontier. They can also
be asked to establish their course projects based on the applications described in
Chapters 19-27 as well as the materials provided on our website.
To the Readers
This book was designed to cover a wide range of topics in the field of graph neu-
ral network field, including background, theoretical foundations, methodologies, re-
search frontiers, and applications. Therefore, it can be treated as a comprehensive
handbook for a wide variety of readers such as students, researchers, and profession-
als. You should have some knowledge of the concepts and terminology associated
with statistics, machine learning, and graph theory. Some backgrounds of the basics
have been provided and referenced in the first eight chapters. You should better also
have knowledge of deep learning and some programming experience for easily ac-
cessing the most of chapters of this book. In particular, you should be able to read
pseudocode and understand graph structures.
The book is well modularized and each chapter can be learned in a standalone
manner based on the individual interests and needs. For those readers who want
to have a solid understanding of various techniques and theories of graph neural
networks, you can start from Chapters 4-9. For those who further want to perform
in-depth research and advance related fields, please read those chapters of interest
among Chapters 9-18, which provide comprehensive knowledge in the most recent
research issues, open problems, and research frontiers. For those who want to ap-
ply graph neural networks to benefit specific domains, or aim at finding interesting
applications to validate specific graph neural networks techniques, please refer to
Chapters 19-27.
Acknowledgements
Graph machine learning has attracted many gifted researchers to make their seminal
contributions over the last few years. We are very fortunate to discuss the chal-
lenges and opportunities, and often work with many of them on a rich variety of
research topics in this exciting field. We are deeply indebted to these collaborators
and colleagues from JD.COM, IBM Research, Tsinghua University, Simon Fraser
University, Emory University, and elsewhere, who encouraged us to create such a
book comprehensively covering various topics of Graph Neural Networks in order
to educate the interested beginners and foster the advancement of the field for both
academic researchers and industrial practitioners.
This book would not have been possible without the contributions of many peo-
ple. We would like to give many thanks to the people who offered feedback on
checking the consistency of the math notations of the entire book as well as ref-
erence editing of this book. They are people from Emory University: Ling Chen,
Xiaojie Guo, and Shiyu Wang, as well as people from Tsinghua University: Yue He,
Ziwei Zhang, and Haoxin Liu. We would like to give our special thanks to Dr. Xiao-
jie Guo, who generously offered her help in providing numerous valuable feedback
on many chapters.
We also want to thank those who allowed us to reproduce images, figures, or data
from their publications.
Finally, we would like to thank our families for their love, patience and support
during this very unusual time when we are writing and editing this book.
ix
Editor Biography
xi
xii Editor Biography
Miltiadis Allamanis
Microsoft Research, Cambridge, UK
Yu Chen
Facebook AI, Menlo Park, CA, USA
Yunfei Chu
Alibaba Group, Hangzhou, China
Peng Cui
Tsinghua University, Beijing, China
Tyler Derr
Vanderbilt University, Nashville, TN, USA
Keyu Duan
Texas A&M University, College Station, TX, USA
Qizhang Feng
Texas A&M University, College Station, TX, USA
Stephan Günnemann
Technical University of Munich, München, Germany
Xiaojie Guo
JD.COM Silicon Valley Research Center, Mountain View, CA, USA
Yu Hou
Weill Cornell Medicine, New York City, New York, USA
Xia Hu
Texas A&M University, College Station, TX, USA
Junzhou Huang
University of Texas at Arlington, Arlington, TA, United States
Shouling Ji
xv
xvi List of Contributors
Terminologies . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . xxxi
1 Basic concepts of Graphs . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . xxxi
2 Machine Learning on Graphs . . . . . . . . . . . . . . . . . . . . . . . . . . . . . xxxii
3 Graph Neural Networks . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . xxxii
Notations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . xxxv
Part I Introduction
1 Representation Learning . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3
Liang Zhao, Lingfei Wu, Peng Cui and Jian Pei
1.1 Representation Learning: An Introduction . . . . . . . . . . . . . . . . . . . . . 3
1.2 Representation Learning in Different Areas . . . . . . . . . . . . . . . . . . . 5
1.2.1 Representation Learning for Image Processing . . . . . . . . . 5
1.2.2 Representation Learning for Speech Recognition . . . . . . . 8
1.2.3 Representation Learning for Natural Language Processing 10
1.2.4 Representation Learning for Networks . . . . . . . . . . . . . . . . 13
1.3 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14
2 Graph Representation Learning . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 17
Peng Cui, Lingfei Wu, Jian Pei, Liang Zhao and Xiao Wang
2.1 Graph Representation Learning: An Introduction . . . . . . . . . . . . . . . 17
2.2 Traditional Graph Embedding . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 19
2.3 Modern Graph Embedding . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 20
2.3.1 Structure-Property Preserving Graph Representation
Learning . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 20
2.3.2 Graph Representation Learning with Side Information . . . 23
2.3.3 Advanced Information Preserving Graph
Representation Learning . . . . . . . . . . . . . . . . . . . . . . . . . . . . 24
2.4 Graph Neural Networks . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 25
2.5 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 26
xix
xx Contents
References . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 595
Terminologies
• Graph: A graph is composed of a node set and an edge set, where nodes rep-
resent entities and edges represent the relationship between entities. The nodes
and edges form the topology structure of the graph. Besides the graph structure,
nodes, edges, and/or the whole graph can be associated with rich information
represented as node/edge/graph features (also known as attributes or contents).
• Subgraph: A subgraph is a graph whose set of nodes and set of edges are all
subsets of the original graph.
• Centrality: A centrality is a measurement of the importance of nodes in the
graph. The basic assumption of centrality is that a node is thought to be im-
portant if many other important nodes also connect to it. Common centrality
measurements include the degree centrality, the eigenvector centrality, the be-
tweenness centrality, and the closeness centrality.
• Neighborhood: The neighborhood of a node generally refers to other nodes that
are close to it. For example, the k-order neighborhood of a node, also called the
k-step neighborhood, denotes a set of other nodes in which the shortest path
distance between these nodes and the central node is no larger than k.
• Community Structure: A community refers to a group of nodes that are
densely connected internally and less densely connected externally.
• Graph Sampling: Graph sampling is a technique to pick a subset of nodes and/
or edges from the original graph. Graph sampling can be applied to train ma-
chine learning models on large-scale graphs while preventing severe scalability
issues.
xxxi
xxxii Terminologies
graph structures and some properties of the graph is preserved in the embedding
vectors. Network embedding is also referred to as graph embedding and node
representation learning.
• Graph Neural Network: Graph neural network refers to any neural network
working on the graph data.
• Graph Convolutional Network: Graph convolutional network usually refers to
a specific graph neural network proposed by Kipf and Welling Kipf and Welling
(2017a). It is occasionally used as a synonym for graph neural network, i.e.,
referring to any neural network working on the graph data, in some literature.
• Message-Passing: Message-passing is a framework of graph neural networks in
which the key step is to pass messages between different nodes based on graph
structures in each neural network layer. The most widely adopted formulation,
usually denoted as message-passing neural networks, is to only pass messages
between nodes that are directly connected Gilmer et al (2017). The message
passing functions are also called graph filters and graph convolutions in some
literature.
• Readout: Readout refers to functions that summarize the information of indi-
vidual nodes to form more high-level information such as forming a subgraph/super-
graph or obtaining the representations of the entire graph. Readout is also called
pooling and graph coarsening in some literature.
• Graph Adversarial Attack: Graph adversarial attacks aim to generate worst-
case perturbations by manipulating the graph structure and/or node features so
that the performance of some models are downgraded. Graph adversarial attacks
can be categorized based on the attacker’s goals, capabilities, and accessible
knowledge.
• Robustness certificates: Methods providing formal guarantees that the predic-
tion of a GNN is not affected even when perturbations are performed based on
a certain perturbation model.
Notations
This Chapter provides a concise reference that describes the notations used through-
out this book.
A scalar x
A vector x
A matrix X
An identity matrix I
The set of real numbers R
The set of complex numbers C
The set of integers Z
The set of real n-length vectors Rn
The set of real m × n matrices Rm×n
The real interval including a and b [a, b]
The real interval including a but excluding b [a, b)
The element of the vector x with index i xi
The element of matrix X’s indexed by Row i and Column j Xi, j
Graph Basics
A graph G
Edge set E
Vertex set V
Adjacent matrix of a graph A
Laplacian matrix L
Diagonal degree matrix D
Isomorphism between graphs G and H G∼=H
H is a subgraph of graph G H ⊆G
H is a proper subgraph of graph G H ⊂G
Union of graphs H and G G ∪H
xxxv
xxxvi Notations
Basic Operations
Transpose of matrix X X⊤
Dot product of matrices X and Y X ·Y or XY
Element-wise (Hadamard) product of matrices X and Y X ⊙Y
Determinant of X det(X)
p-norm (also called ℓ p norm) of x ∥x∥ p
Union ∪
Intersection ∩
Subset ⊆
Proper subset ⊂
Inner prodct of vector x and y < x, y >
Functions
Probablistic Theory
Abstract In this chapter, we first describe what representation learning is and why
we need representation learning. Among the various ways of learning representa-
tions, this chapter focuses on deep learning methods: those that are formed by the
composition of multiple non-linear transformations, with the goal of resulting in
more abstract and ultimately more useful representations. We summarize the repre-
sentation learning techniques in different domains, focusing on the unique chal-
lenges and models for different data types including images, natural languages,
speech signals and networks. Last, we summarize this chapter.
The effectiveness of machine learning techniques heavily relies on not only the de-
sign of the algorithms themselves, but also a good representation (feature set) of
data. Ineffective data representations that lack some important information or con-
tains incorrect or huge redundant information could lead to poor performance of
the algorithm in dealing with different tasks. The goal of representation learning is
to extract sufficient but minimal information from data. Traditionally, this can be
achieved via human efforts based on the prior knowledge and domain expertise on
the data and tasks, which is also named as feature engineering. In deploying ma-
Liang Zhao
Department of Computer Science, Emory University, e-mail: [email protected]
Lingfei Wu
JD.COM Silicon Valley Research Center, e-mail: [email protected]
Peng Cui
Department of Computer Science, Tsinghua University, e-mail: [email protected]
Jian Pei
Department of Computer Science, Simon Fraser University, e-mail: [email protected]
© The Author(s), under exclusive license to Springer Nature Singapore Pte Ltd. 2022 3
L. Wu et al. (eds.), Graph Neural Networks: Foundations, Frontiers, and Applications,
https://doi.org/10.1007/978-981-16-6054-2_1
4 Liang Zhao, Lingfei Wu, Peng Cui and Jian Pei
chine learning and many other artificial intelligence algorithms, historically a large
portion of the human efforts goes into the design of prepossessing pipelines and data
transformations. More specifically, feature engineering is a way to take advantage
of human ingenuity and prior knowledge in the hope to extract and organize the dis-
criminative information from the data for machine learning tasks. For example, po-
litical scientists may be asked to define a keyword list as the features of social-media
text classifiers for detecting those texts on societal events. For speech transcription
recognition, one may choose to extract features from raw sound waves by the op-
erations including Fourier transformations. Although feature engineering is widely
adopted over the years, its drawbacks are also salient, including: 1) Intensive labors
from domain experts are usually needed. This is because feature engineering may
require tight and extensive collaboration between model developers and domain ex-
perts. 2) Incomplete and biased feature extraction. Specifically, the capacity and
discriminative power of the extracted features are limited by the knowledge of dif-
ferent domain experts. Moreover, in many domains that human beings have limited
knowledge, what features to extract itself is an open questions to domain experts,
such as cancer early prediction. In order to avoid these drawbacks, making learn-
ing algorithms less dependent on feature engineering has been a highly desired goal
in machine learning and artificial intelligence domains, so that novel applications
could be constructed faster and hopefully addressed more effectively.
The techniques of representation learning witness the development from the tra-
ditional representation learning techniques to more advanced ones. The traditional
methods belong to “shallow” models and aim to learn transformations of data that
make it easier to extract useful information when building classifiers or other pre-
dictors, such as Principal Component Analysis (PCA) (Wold et al, 1987), Gaussian
Markov random field (GMRF) (Rue and Held, 2005), and Locality Preserving Pro-
jections (LPP) (He and Niyogi, 2004). Deep learning-based representation learning
is formed by the composition of multiple non-linear transformations, with the goal
of yielding more abstract and ultimately more useful representations. In the light of
introducing more recent advancements and sticking to the major topic of this book,
here we majorly focus on deep learning-based representation learning, which can
be categorized into several types: (1) Supervised learning, where a large number of
labeled data are needed for the training of the deep learning models. Given the well-
trained networks, the output before the last fully-connected layers is always utilized
as the final representation of the input data; (2) Unsupervised learning (including
self-supervised learning), which facilitates the analysis of input data without corre-
sponding labels and aims to learn the underlying inherent structure or distribution
of data. The pre-tasks are utilized to explore the supervision information from large
amounts of unlabelled data. Based on this constructed supervision information, the
deep neural networks are trained to extract the meaningful representations for the
future downstream tasks; (3) Transfer learning, which involves methods that utilize
any knowledge resource (i.e., data, model, labels, etc.) to increase model learning
and generalization for the target task. Transfer learning encompasses different sce-
narios including multi-task learning (MTL), model adaptation, knowledge transfer,
co-variance shift, etc. There are also other important representation learning meth-
1 Representation Learning 5
help of hand-crafted features by human beings based on prior knowledge. For exam-
ple, Huang et al (2000) extracted the character’s structure features from the strokes,
then use them to recognize the handwritten characters. Rui (2005) adopted the mor-
phology method to improve local feature of the characters, then use PCA to ex-
tract features of characters. However, all of these methods need to extract features
from images manually and thus the prediction performances strongly rely on the
prior knowledge. In the field of computer vision, manual feature extraction is very
cumbersome and impractical because of the high dimensionality of feature vec-
tors. Thus, representation learning of images which can automatically extract mean-
ingful, hidden and complex patterns from high-dimension visual data is necessary.
Deep learning-based representation learning for images is learned in an end-to-end
fashion, which can perform much better than hand-crafted features in the target ap-
plications, as long as the training data is of sufficient quality and quantity.
Supervised Representation Learning for image processing. In the domain of im-
age processing, supervised learning algorithm, such as Convolution Neural Network
(CNN) and Deep Belief Network (DBN), are commonly applied in solving various
tasks. One of the earliest deep-supervised-learning-based works was proposed in
2006 (Hinton et al, 2006), which is focused on the MNIST digit image classifica-
tion problem, outperforming the state-of-the-art SVMs. Following this, deep convo-
lutional neural networks (ConvNets) showed amazing performance which is greatly
depends on their properties of shift in-variance, weights sharing and local pattern
capturing. Different types of network architectures were developed to increase the
capacity of network models, and larger and larger datasets were collected these days.
Various networks including AlexNet (Krizhevsky et al, 2012), VGG (Simonyan and
Zisserman, 2014b), GoogLeNet (Szegedy et al, 2015), ResNet (He et al, 2016a),
and DenseNet (Huang et al, 2017a) and large scale datasets, such as ImageNet and
OpenImage, have been proposed to train very deep convolutional neural networks.
With the sophisticated architectures and large-scale datasets, the performance of
convolutional neural networks keeps outperforming the state-of-the-arts in various
computer vision tasks.
Unsupervised Representation Learning for image processing. Collection and an-
notation of large-scale datasets are time-consuming and expensive in both image
datasets and video datasets. For example, ImageNet contains about 1.3 million la-
beled images covering 1,000 classes while each image is labeled by human workers
with one class label. To alleviate the extensive human annotation labors, many unsu-
pervised methods were proposed to learn visual features from large-scale unlabeled
images or videos without using any human annotations. A popular solution is to
propose various pretext tasks for models to solve, while the models can be trained
by learning objective functions of the pretext tasks and the features are learned
through this process. Various pretext tasks have been proposed for unsupervised
learning, including colorizing gray-scale images (Zhang et al, 2016d) and image in-
painting (Pathak et al, 2016). During the unsupervised training phase, a predefined
pretext task is designed for the models to solve, and the pseudo labels for the pretext
task are automatically generated based on some attributes of data. Then the models
are trained according to the objective functions of the pretext tasks. When trained
1 Representation Learning 7
with pretext tasks, the shallower blocks of the deep neural network models focus on
the low-level general features such as corners, edges, and textures, while the deeper
blocks focus on the high-level task-specific features such as objects, scenes, and
object parts. Therefore, the models trained with pretext tasks can learn kernels to
capture low-level features and high-level features that are helpful for other down-
stream tasks. After the unsupervised training is finished, the learned visual features
in this pre-trained models can be further transferred to downstream tasks (especially
when only relatively small data is available) to improve performance and overcome
over-fitting.
Transfer Learning for image processing. In real-world applications, due to the
high cost of manual labeling, sufficient training data that belongs to the same fea-
ture space or distribution as the testing data may not always be accessible. Transfer
learning mimics the human vision system by making use of sufficient amounts of
prior knowledge in other related domains (i.e., source domains) when executing
new tasks in the given domain (i.e., target domain). In transfer learning, both the
training set and the test set can contribute to the target and source domains. In most
cases, there is only one target domain for a transfer learning task, while either single
or multiple source domains can exist. The techniques of transfer learning in im-
ages processing can be categorized into feature representation knowledge transfer
and classifier-based knowledge transfer. Specifically, feature representation trans-
fer methods map the target domain to the source domains by exploiting a set of
extracted features, where the data divergence between the target domain and the
source domains can be significantly reduced so that the performance of the task
in the target domain is improved. For example, classifier-based knowledge-transfer
methods usually share the common trait that the learned source domain models are
utilized as prior knowledge, which are used to learn the target model together with
the training samples. Instead of minimizing the cross-domain dissimilarity by up-
dating instances’ representations, classifier-based knowledge-transfer methods aim
to learn a new model that minimizes the generalization error in the target domain
via the provided training set from both domains and the learned model.
Other Representation Learning for Image Processing. Other types of representa-
tion learning are also commonly observed for dealing with image processing, such
as reinforcement learning, and semi-supervised learning. For example, reinforce-
ment learning are commonly explored in the task of image captioning Liu et al
(2018a); Ren et al (2017) and image editing Kosugi and Yamasaki (2020), where
the learning process is formalized as a sequence of actions based on a policy net-
work.
8 Liang Zhao, Lingfei Wu, Peng Cui and Jian Pei
Nowadays, speech interfaces or systems have become widely developed and inte-
grated into various real-life applications and devices. Services like Siri 1 , Cortana 2 ,
and Google Voice Search 3 have become a part of our daily life and are used by mil-
lions of users. The exploration in speech recognition and analysis has always been
motivated by a desire to enable machines to participate in verbal human-machine
interactions. The research goals of enabling machines to understand human speech,
identify speakers, and detect human emotion have attracted researchers’ attention
for more than sixty years across several distinct research areas, including but not
limited to Automatic Speech Recognition (ASR), Speaker Recognition (SR), and
Speaker Emotion Recognition (SER).
Analyzing and processing speech has been a key application of machine learning
(ML) algorithms. Research on speech recognition has traditionally considered the
task of designing hand-crafted acoustic features as a separate distinct problem from
the task of designing efficient models to accomplish prediction and classification
decisions. There are two main drawbacks of this approach: First, the feature engi-
neering is cumbersome and requires human knowledge as introduced above; and
second, the designed features might not be the best for the specific speech recog-
nition tasks at hand. This has motivated the adoption of recent trends in the speech
community towards the utilization of representation learning techniques, which can
learn an intermediate representation of the input signal automatically that better fits
into the task at hand and hence lead to improved performance. Among all these suc-
cesses, deep learning-based speech representations play an important role. One of
the major reasons for the utilization of representation learning techniques in speech
technology is that speech data is fundamentally different from two-dimensional im-
age data. Images can be analyzed as a whole or in patches, but speech has to be
formatted sequentially to capture temporal dependency and patterns.
Supervised representation learning for speech recognition. In the domain of
speech recognition and analyzing, supervised representation learning methods are
widely employed, where feature representations are learned on datasets by leverag-
ing label information. For example, restricted Boltzmann machines (RBMs) (Jaitly
and Hinton, 2011; Dahl et al, 2010) and deep belief networks (DBNs) (Cairong
et al, 2016; Ali et al, 2018) are commonly utilized in learning features from speech
for different tasks, including ASR, speaker recognition, and SER. For example,
in 2012, Microsoft has released a new version of their MAVIS (Microsoft Audio
Video Indexing Service) speech system based on context-dependent deep neural net-
works (Seide et al, 2011). These authors managed to reduce the word error rate on
four major benchmarks by about 30% (e.g., from 27.4% to 18.5% on RT03S) com-
1 Siri is an artificial intelligence assistant software that is built into Apple’s iOS system.
2 Microsoft Cortana is an intelligent personal assistant developed by Microsoft, known as ”the
world’s first cross-platform intelligent personal assistant”.
3 Google Voice Search is a product of Google that allows you to use Google to search by speaking
to a mobile phone or computer, that is, to use the legendary content on the device to be identified
by the server, and then search for information based on the results of the recognition
1 Representation Learning 9
ing MTL with different auxiliary tasks including gender, speaker adaptation, speech
enhancement, it has been shown that the learned shared representations for differ-
ent tasks can act as complementary information about the acoustic environment and
give a lower word error rate (WER) (Parthasarathy and Busso, 2017; Xia and Liu,
2015).
Other Representation Learning for speech recognition. Other than the above-
mentioned three categories of representation learning for speech signals, there are
also some other representation learning techniques commonly explored, such as
semi-supervised learning and reinforcement learning. For example, in the speech
recognition for ASR, semi-supervised learning is mainly used to circumvent the lack
of sufficient training data. This can be achieved either by creating features fronts
ends (Thomas et al, 2013), or by using multilingual acoustic representations (Cui
et al, 2015), or by extracting an intermediate representation from large unpaired
datasets (Karita et al, 2018). RL is also gaining interest in the area of speech recog-
nition, and there have been multiple approaches to model different speech problems,
including dialog modeling and optimization (Levin et al, 2000), speech recogni-
tion (Shen et al, 2019), and emotion recognition (Sangeetha and Jayasankar, 2019).
Besides speech recognition, there are many other Natural Language Processing
(NLP) applications of representation learning, such as the text representation learn-
ing. For example, Google’s image search exploits huge quantities of data to map im-
ages and queries in the same space (Weston et al, 2010) based on NLP techniques.
In general, there are two types of applications of representation learning in NLP.
In one type, the semantic representation, such as the word embedding, is trained
in a pre-training task (or directly designed by human experts) and is transferred to
the model for the target task. It is trained by using language modeling objective
and is taken as inputs for other down-stream NLP models. In the other type, the
semantic representation lies within the hidden states of the deep learning model and
directly aims for better performance of the target tasks in an end-to-end fashion. For
example, many NLP tasks want to semantically compose sentence or document rep-
resentation, such as tasks like sentiment classification, natural language inference,
and relation extraction, which require sentence representation.
Conventional NLP tasks heavily rely on feature engineering, which requires care-
ful design and considerable expertise. Recently, representation learning, especially
deep learning-based representation learning is emerging as the most important tech-
nique for NLP. First, NLP is typically concerned with multiple levels of language en-
tries, including but not limited to characters, words, phrases, sentences, paragraphs,
and documents. Representation learning is able to represent the semantics of these
multi-level language entries in a unified semantic space, and model complex se-
mantic dependence among these language entries. Second, there are various NLP
tasks that can be conducted on the same input. For example, given a sentence, we
1 Representation Learning 11
can perform multiple tasks such as word segmentation, named entity recognition,
relation extraction, co-reference linking, and machine translation. In this case, it
will be more efficient and robust to build a unified representation space of inputs
for multiple tasks. Last, natural language texts may be collected from multiple do-
mains, including but not limited to news articles, scientific articles, literary works,
advertisement and online user-generated content such as product reviews and so-
cial media. Moreover, texts can also be collected from different languages, such as
English, Chinese, Spanish, Japanese, etc. Compared to conventional NLP systems
which have to design specific feature extraction algorithms for each domain accord-
ing to its characteristics, representation learning enables us to build representations
automatically from large-scale domain data and even add bridges among these lan-
guages from different domains. Given these advantages of representation learning
for NLP in the feature engineering reduction and performance improvement, many
researchers have developed efficient algorithms on representation learning, espe-
cially deep learning-based approaches, for NLP.
Supervised Representation Learning for NLP. Deep neural networks in the su-
pervised learning setting for NLP emerge from distributed representation learning,
then to CNN models, and finally to RNN models in recent years. At early stage,
distributed representations are first developed in the context of statistical language
modeling by Bengio (2008) in so-called neural net language models. The model
is about learning a distributed representation for each word (i.e., word embedding).
Following this, the need arose for an effective feature function that extracts higher-
level features from constituting words or n-grams. CNNs turned out to be the nat-
ural choice given their properties of excellent performance in computer vision and
speech processing tasks. CNNs have the ability to extract salient n-gram features
from the input sentence to create an informative latent semantic representation of
the sentence for downstream tasks. This domain was pioneered by Collobert et al
(2011) and Kalchbrenner et al (2014), which led to a huge proliferation of CNN-
based networks in the succeeding literature. The neural net language model was also
improved by adding recurrence to the hidden layers (Mikolov et al, 2011a) (i.e.,
RNN), allowing it to beat the state-of-the-art (smoothed n-gram models) not only in
terms of perplexity (exponential of the average negative log-likelihood of predicting
the right next word) but also in terms of WER in speech recognition. RNNs use
the idea of processing sequential information. The term “recurrent” applies as they
perform the same computation over each token of the sequence and each step is de-
pendent on the previous computations and results. Generally, a fixed-size vector is
produced to represent a sequence by feeding tokens one by one to a recurrent unit. In
a way, RNNs have “memory” over previous computations and use this information
in current processing. This template is naturally suited for many NLP tasks such
as language modeling (Mikolov et al, 2010, 2011b), machine translation (Liu et al,
2014; Sutskever et al, 2014), and image captioning (Karpathy and Fei-Fei, 2015).
Unsupervised Representation Learning for NLP. Unsupervised learning (includ-
ing self-supervised learning) has made a great success in NLP, for the plain text itself
contains abundant knowledge and patterns about languages. For example, in most
deep learning based NLP models, words in sentences are first mapped to their corre-
12 Liang Zhao, Lingfei Wu, Peng Cui and Jian Pei
recently. For example, researchers have explored few-shot relation extractio (Han
et al, 2018) where each relation has a few labeled instances, and low-resource ma-
chine translation (Zoph et al, 2016) where the size of the parallel corpus is limited.
Beyond popular data like images, texts, and sounds, network data is another im-
portant data type that is becoming ubiquitous across a large scale of real-world ap-
plications ranging from cyber-networks (e.g., social networks, citation networks,
telecommunication networks, etc.) to physical networks (e.g., transportation net-
works, biological networks, etc). Networks data can be formulated as graphs math-
ematically, where vertices and their relationships jointly characterize the network
information. Networks and graphs are very powerful and flexible data formulation
such that sometimes we could even consider other data types like images, and texts
as special cases of it. For example, images can be considered as grids of nodes with
RGB attributes which are special types of graphs, while texts can also be organized
into sequential-, tree-, or graph-structured information. So in general, representa-
tion learning for networks is widely considered as a promising yet more challenging
tasks that require the advancement and generalization of many techniques we devel-
oped for images, texts, and so forth. In addition to the intrinsic high complexity of
network data, the efficiency of representation learning on networks is also an impor-
tant issues considering the large-scale of many real-world networks, ranging from
hundreds to millions or even billions of vertices. Analyzing information networks
plays a crucial role in a variety of emerging applications across many disciplines.
For example, in social networks, classifying users into meaningful social groups is
useful for many important tasks, such as user search, targeted advertising and recom-
mendations; in communication networks, detecting community structures can help
better understand the rumor spreading process; in biological networks, inferring in-
teractions between proteins can facilitate new treatments for diseases. Nevertheless,
efficient and effective analysis of these networks heavily relies on good representa-
tions of the networks.
Traditional feature engineering on network data usually focuses on obtaining a
number of predefined straightforward features in graph levels (e.g., the diameter,
average path length, and clustering co-efficient), node levels (e.g., node degree and
centrality), or subgraph levels (e.g., frequent subgraphs and graph motifs). Those
limited number of hand-crafted, well-defined features, though describe several fun-
damental aspects of the graphs, discard the patterns that cannot be covered by them.
Moreover, real-world network phenomena are usually highly complicated require
sophisticated, unknown combinations among those predefined features or cannot be
characterized by any of the existing features. In addition, traditional graph feature
engineering usually involve expensive computations with super-linear or exponen-
tial complexity, which often makes many network analytic tasks computationally
expensive and intractable over large-scale networks. For example, in dealing with
14 Liang Zhao, Lingfei Wu, Peng Cui and Jian Pei
the task of community detection, classical methods involve calculating the spectral
decomposition of a matrix with at least quadratic time complexity with respect to
the number of vertices. This computational overhead makes algorithms hard to scale
to large-scale networks with millions of vertices.
More recently, network representation learning (NRL) has aroused a lot of re-
search interest. NRL aims to learn latent, low-dimensional representations of net-
work vertices, while preserving network topology structure, vertex content, and
other side information. After new vertex representations are learned, network ana-
lytic tasks can be easily and efficiently carried out by applying conventional vector-
based machine learning algorithms to the new representation space. Earlier work
related to network representation learning dates back to the early 2000s, when re-
searchers proposed graph embedding algorithms as part of dimensionality reduction
techniques. Given a set of independent and identically distributed (i.i.d.) data points
as input, graph embedding algorithms first calculate the similarity between pairwise
data points to construct an affinity graph, e.g., the k-nearest neighbor graph, and
then embed the affinity graph into a new space having much lower dimensionality.
However, graph embedding algorithms are designed on i.i.d. data mainly for dimen-
sionality reduction purpose, which usually have at least quadratic time complexity
with respect to the number of vertices.
Since 2008, significant research efforts have shifted to the development of ef-
fective and scalable representation learning techniques that are directly designed
for complex information networks. Many network representation learning algo-
rithms (Perozzi et al, 2014; Yang et al, 2015b; Zhang et al, 2016b; Manessi et al,
2020) have been proposed to embed existing networks, showing promising per-
formance for various applications. These methods embed a network into a latent,
low-dimensional space that preserves structure proximity and attribute affinity. The
resulting compact, low-dimensional vector representations can be then taken as fea-
tures to any vector-based machine learning algorithms. This paves the way for a
wide range of network analytic tasks to be easily and efficiently tackled in the new
vector space, such as node classification (Zhu et al, 2007), link prediction (Lü and
Zhou, 2011), clustering (Malliaros and Vazirgiannis, 2013), network synthesis (You
et al, 2018b). The following chapters of this book will then provide a systematic and
comprehensive introduction into network representation learning.
1.3 Summary
Representation learning is a very active and important field currently, which heavily
influences the effectiveness of machine learning techniques. Representation learn-
ing is about learning the representations of the data that makes it easier to extract
useful and discriminative information when building classifiers or other predictors.
Among the various ways of learning representations, deep learning algorithms have
increasingly been employed in many areas nowadays where the good representation
can be learned in an efficient and automatic way based on large amount of complex
1 Representation Learning 15
and high dimensional data. The evaluation of a representation is closely related to its
performance on the downstream tasks. Generally, there are also some general prop-
erties that the good representations may hold, such as the smoothness, the linearity,
disentanglement, as well as capturing multiple explanatory and casual factors.
We have summarized the representation learning techniques in different domains,
focusing on the unique challenges and models for different areas including the
processing of images, natural language, and speech signals. For each area, there
emerges many deep learning-based representation techniques from different cate-
gories, including supervised learning, unsupervised learning, transfer learning, dis-
entangled representation learning, reinforcement learning, etc. We have also briefly
mentioned about the representation learning on networks and its relations to that on
images, texts, and speech, in order for the elaboration of it in the following chapters.
Chapter 2
Graph Representation Learning
Peng Cui, Lingfei Wu, Jian Pei, Liang Zhao and Xiao Wang
Many complex systems take the form of graphs, such as social networks, biological
networks, and information networks. It is well recognized that graph data is often
sophisticated and thus is challenging to deal with. To process graph data effectively,
the first critical challenge is to find effective graph data representation, that is, how
to represent graphs concisely so that advanced analytic tasks, such as pattern discov-
ery, analysis, and prediction, can be conducted efficiently in both time and space.
Liang Zhao
Department of Computer Science, Emory University, e-mail: [email protected]
Lingfei Wu
JD.COM Silicon Valley Research Center, e-mail: [email protected]
Peng Cui
Department of Computer Science, Tsinghua University, e-mail: [email protected]
Jian Pei
Department of Computer Science, Simon Fraser University, e-mail: [email protected]
Xiao Wang
Department of Computer Science, Beijing University of Posts and Telecommunications, e-mail:
[email protected]
© The Author(s), under exclusive license to Springer Nature Singapore Pte Ltd. 2022 17
L. Wu et al. (eds.), Graph Neural Networks: Foundations, Frontiers, and Applications,
https://doi.org/10.1007/978-981-16-6054-2_2
18 Peng Cui, Lingfei Wu, Jian Pei, Liang Zhao and Xiao Wang
for graph inference. After the representation is obtained, downstream tasks such as
node classification , node clustering , graph visualization and link prediction can be
dealt with based on these representations. Overall, there are three main categories of
graph representation learning methods: traditional graph embedding, modern graph
embedding, and graph neural networks, which will be introduced separately in the
following three sections.
rectly defined. For example, an edge between two nodes usually just implies there is
a relationship between them, but cannot indicate the specific proximity. Also, even
if there is no edge between two nodes, we cannot say the proximity between these
two nodes is zero. The definition of node proximities depends on specific analytic
tasks and application scenarios. Therefore, modern graph embedding usually incor-
porates rich information, such as network structures, properties, side information
and advanced information, to facilitate different problems and applications. Modern
graph embedding needs to target both of goals mentioned before. In view of this,
traditional graph embedding can be regarded as a special case of modern graph em-
bedding, and the recent research progress on modern graph embedding pays more
attention to network inference.
To well support network inference, modern graph embedding considers much richer
information in a graph. According to the types of information that are preserved in
graph representation learning, the existing methods can be categorized into three
categories: (1) graph structures and properties preserving graph embedding, (2)
graph representation learning with side information and (3) advanced information
preserving graph representation learning. In technique view, different models are
adopted to incorporate different types of information or address different goals. The
commonly used models include matrix factorization, random walk, deep neural net-
works and their variations.
Among all the information encoded in a graph, graph structures and properties are
two crucial factors that largely affect graph inference. Thus, one basic requirement
of graph representation learning is to appropriately preserve graph structures and
capture properties of graphs. Often, graph structures include first-order structures
and higher-order structures, such as second-order structures and community struc-
tures. Graphs with different types have different properties. For example, directed
graphs have the asymmetric transitivity property. The structural balance theory is
widely applicable to signed graphs.
Graph structures can be categorized into different groups that present at differ-
ent granularities. The commonly exploited graph structures in graph representation
2 Graph Representation Learning 21
learning include neighborhood structure, high-order node proximity and graph com-
munities.
How to define the neighborhood structure in a graph is the first challenge. Based
on the discovery that the distribution of nodes appearing in short random walks is
similar to the distribution of words in natural language, DeepWalk (Perozzi et al,
2014) employs the random walks to capture the neighborhood structure. Then for
each walk sequence generated by random walks, following Skip-Gram, DeepWalk
aims to maximize the probability of the neighbors of a node in a walk sequence.
Node2vec defines a flexible notion of a node’s graph neighborhood and designs
a second order random walks strategy to sample the neighborhood nodes, which
can smoothly interpolate between breadth-first sampling (BFS) and depth-first sam-
pling (DFS). Besides the neighborhood structure, LINE (Tang et al, 2015b) is pro-
posed for large scale network embedding, which can preserve the first and second
order proximities. The first order proximity is the observed pairwise proximity be-
tween two nodes. The second order proximity is determined by the similarity of
the “contexts” (neighbors) of two nodes. Both are important in measuring the re-
lationships between two nodes. Essentially, LINE is based on the shallow model,
consequently, the representation ability is limited. SDNE (Wang et al, 2016) pro-
poses a deep model for network embedding, which also aims at capturing the first
and second order proximites. SDNE uses the deep auto-encoder architecture with
multiple non-linear layers to preserve the second order proximity. To preserve the
first-order proximity, the idea of Laplacian eigenmaps (Belkin and Niyogi, 2002)
is adopted. Wang et al (2017g) propose a modularized nonnegative matrix factor-
ization (M-NMF) model for graph representation learning, which aims to preserve
both the microscopic structure, i.e., the first-order and second-order proximities of
nodes, and the mesoscopic community structure (Girvan and Newman, 2002). They
adopt the NMF model (Févotte and Idier, 2011) to preserve the microscopic struc-
ture. Meanwhile, the community structure is detected by modularity maximization
(Newman, 2006a). Then, they introduce an auxiliary community representation ma-
trix to bridge the representations of nodes with the community structure. In this
way, the learned representations of nodes are constrained by both the microscopic
structure and community structure.
In summary, many network embedding methods aim to preserve the local struc-
ture of a node, including neighborhood structure, high-order proximity as well as
community structure, in the latent low-dimensional space. Both linear and non-
linear models are attempted, demonstrating the large potential of deep models in
network embedding.
We usually demonstrate that the transitivity usually exists in a graph. But mean-
while, we can find that preserving such a property is not challenging, because in a
metric space, the distance between different data points naturally satisfies the trian-
gle inequality. However, this is not always true in the real world. Ou et al (2015) aim
to preserve the non-transitivity property via latent similarity components. The non-
transitivity property declares that, for nodes v1 , v2 and v3 in a graph where (v1 ; v2 )
and (v2 ; v3 ) are similar pairs, (v1 ; v3 ) may be a dissimilar pair. For example, in a
social network, a student may connect with his classmates and his family, while his
classmates and family are probably very different. The main idea is that they learn
multiple node embeddings, and then compare different nodes based on multiple
similarities, rather than one similarity. They observe that if two nodes have a large
semantic similarity, at least one of the structure similarities is large, otherwise, all
of the similarities are small. In a directed graph, it usually has the asymmetric tran-
sitivity property. Asymmetric transitivity indicates that, if there is a directed edge
from node i to node j and a directed edge from j to v, there is likely a directed edge
from i to v, but not from v to i. In order to measure this high-order proximity, HOPE
(Ou et al, 2016) summarizes four measurements in a general formulation, and then
utilizes a generalized SVD problem to factorize the high-order proximity (Paige and
Saunders, 1981), such that the time complexity of HOPE is largely reduced, which
means HOPE is scalable for large scale networks. In a signed graph with both of
positive and negative edges, the social theories, such as structural balance theory
(Cartwright and Harary, 1956; Cygan et al, 2012), which are very different from the
unsigned graph. The structural balance theory demonstrates that users in a signed
social network should be able to have their “friends” closer than their “foes”. To
model the structural balance phenomenon, SiNE (Wang et al, 2017f) utilizes a deep
learning model consisting of two deep graphs with non-linear functions.
The importance of maintaining network properties in network embedding space,
especially the properties that largely affect the evolution and formation of networks,
has been well recognized. The key challenge is how to address the disparity and het-
erogeneity of the original network space and the embedding vector space at property
level. Generally, most of the structure and property preserving methods take high
order proximities of nodes into account, which demonstrate the importance of pre-
serving high order structures in network embedding. The difference is the strategy
of obtaining the high order structures. Some methods implicitly preserve highorder
structure by assuming a generative mechanism from a node to its neighbors, while
some other methods realize this by explicitly approximating high-order proximities
in the embedding space. As topology structures are the most notable characteristic
of networks, structure-preserving network methods embody a large part of the lit-
erature. Comparatively, property preserving network embedding is a relatively new
research topic and is only studied lightly. As network properties usually drive the
formation and evolution of networks, it shows great potential for future research and
applications.
2 Graph Representation Learning 23
Different from side information, the advanced information refers to the supervised
or pseudo supervised information in a specific task. The advanced information pre-
serving network embedding usually consists of two parts. One is to preserve the
network structure so as to learn the representations of nodes. The other is to estab-
lish the connection between the representations of nodes and the target task. The
combination of advanced information and network embedding techniques enables
representation learning for networks.
Information Diffusion. Information diffusion (Guille et al, 2013) is an ubiquitous
phenomenon on the web, especially in social networks. Bourigault et al (2014) pro-
pose a graph representation learning algorithm for predicting information diffusion
in social network. The goal of the proposed algorithm is to learn the representations
of nodes in the latent space such that the diffusion kernel can best explain the cas-
cades in the training set. The basic idea is to map the observed information diffusion
process into a heat diffusion process modeled by a diffusion kernel in the continu-
ous space. The kernel describes that the closer a node in the latent space is from
the source node, the sooner it is infected by information from the source node. The
cascade prediction problem here is defined as predicting the increment of cascade
size after a given time interval (Li et al, 2017a). Li et al (2017a) argue that the pre-
vious work on cascade prediction all depends on the bag of hand-crafting features
to represent the cascade and graph structures. Instead, they present an end-to-end
deep learning model to solve this problem using the idea of graph embedding. The
whole procedure is able to learn the representation of cascade graph in an end-to-end
manner.
Anomaly Detection. Anomaly detection has been widely investigated in previous
work (Akoglu et al, 2015). Anomaly detection in graphs aims to infer the structural
inconsistencies, which means the anomalous nodes that connect to various diverse
influential communities (Hu et al, 2016), (Burt, 2004). Hu et al (2016) propose a
graph embedding based method for anomaly detection. They assume that the com-
munity memberships of two linked nodes should be similar. An anomaly node is
one connecting to a set of different communities. Since the learned embedding of
nodes captures the correlations between nodes and communities, based on the em-
bedding, they propose a new measure to indicate the anomalousness level of a node.
The larger the value of the measure, the higher the propensity for a node being an
anomaly node.
Graph Alignment. The goal of graph alignment is to establish the correspon-
dence between the nodes from two graphs, i.e., to predict the anchor links across
two graphs. The same users who are shared by different social networks naturally
form the anchor links, and these links bridge the different graphs. The anchor link
prediction problem is, given a source graph,a target graph and a set of observed
anchor links, to identify the hidden anchor links across the two graphs. Man et al
(2016) propose a graph representation learning algorithm to solve this problem. The
2 Graph Representation Learning 25
learned representations can preserve the graph structures and respect the observed
anchor links.
Advanced information preserving graph embedding usually consists of two parts.
One is to preserve the graph structures so as to learn the representations of nodes.
The other is to establish the connection between the representations of nodes and the
target task. The first one is similar to structure and property preserving network em-
bedding, while the second one usually needs to consider the domain knowledge of a
specific task. The domain knowledge encoded by the advanced information makes
it possible to develop end-to-end solutions for network applications. Compared with
the hand-crafted network features, such as numerous network centrality measures,
the combination of advanced information and network embedding techniques en-
ables representation learning for networks. Many network applications may be ben-
efitted from this new paradigm.
Over the past decade, deep learning has become the “crown jewel” of artificial intel-
ligence and machine learning, showing superior performance in acoustics, images
and natural language processing, etc. Although it is well known that graphs are ubiq-
uitous in the real world, it is very challenging to utilize deep learning methods to
analyze graph data. This problem is non-trivial because of the following challenges:
(1) Irregular structures of graphs. Unlike images, audio, and text, which have a clear
grid structure, graphs have irregular structures, making it hard to generalize some
of the basic mathematical operations to graphs. For example, defining convolution
and pooling operations, which are the fundamental operations in convolutional neu-
ral networks (CNNs), for graph data is not straightforward. (2) Heterogeneity and
diversity of graphs. A graph itself can be complicated, containing diverse types and
properties. These diverse types, properties, and tasks require different model archi-
tectures to tackle specific problems. (3) Large-scale graphs. In the big-data era, real
graphs can easily have millions or billions of nodes and edges. How to design scal-
able models, preferably models that have a linear time complexity with respect to the
graph size, is a key problem. (4) Incorporating interdisciplinary knowledge. Graphs
are often connected to other disciplines, such as biology, chemistry, and social sci-
ences. This interdisciplinary nature provides both opportunities and challenges: do-
main knowledge can be leveraged to solve specific problems but integrating domain
knowledge can complicate model designs.
Currently, graph neural networks have attracted considerable research attention
over the past several years. The adopted architectures and training strategies vary
greatly, ranging from supervised to unsupervised and from convolutional to re-
cursive, including graph recurrent neural networks (Graph RNNs), graph convo-
lutional networks (GCNs), graph autoencoders (GAEs), graph reinforcement learn-
ing (Graph RL), and graph adversarial methods. Specifically, Graroperty h RNNs
capture recursive and sequential patterns of graphs by modeling states at either the
26 Peng Cui, Lingfei Wu, Jian Pei, Liang Zhao and Xiao Wang
2.5 Summary
Lingfei Wu, Peng Cui, Jian Pei, Liang Zhao and Le Song
Abstract Deep Learning has become one of the most dominant approaches in Ar-
tificial Intelligence research today. Although conventional deep learning techniques
have achieved huge successes on Euclidean data such as images, or sequence data
such as text, there are many applications that are naturally or best represented with
a graph structure. This gap has driven a tide in research for deep learning on graphs,
among them Graph Neural Networks (GNNs) are the most successful in coping
with various learning tasks across a large number of application domains. In this
chapter, we will systematically organize the existing research of GNNs along three
axes: foundations, frontiers, and applications. We will introduce the fundamental
aspects of GNNs ranging from the popular models and their expressive powers, to
the scalability, interpretability and robustness of GNNs. Then, we will discuss vari-
ous frontier research, ranging from graph classification and link prediction, to graph
generation and transformation, graph matching and graph structure learning. Based
on them, we further summarize the basic procedures which exploit full use of vari-
ous GNNs for a large number of applications. Finally, we provide the organization
of our book and summarize the roadmap of the various research topics of GNNs.
Lingfei Wu
JD.COM Silicon Valley Research Center, e-mail: [email protected]
Peng Cui
Department of Computer Science, Tsinghua University, e-mail: [email protected]
Jian Pei
Department of Computer Science, Simon Fraser University, e-mail: [email protected]
Liang Zhao
Department of Computer Science, Emory University, e-mail: [email protected]
Le Song
Mohamed bin Zayed University of Artificial Intelligence, e-mail: [email protected]
© The Author(s), under exclusive license to Springer Nature Singapore Pte Ltd. 2022 27
L. Wu et al. (eds.), Graph Neural Networks: Foundations, Frontiers, and Applications,
https://doi.org/10.1007/978-981-16-6054-2_3
28 Lingfei Wu, Peng Cui, Jian Pei, Liang Zhao and Le Song
Deep Learning has become one of the most dominant approaches in Artificial In-
telligence research today. Conventional deep learning techniques, such as recurrent
neural networks (Schuster and Paliwal, 1997) and convolutional neural networks
(Krizhevsky et al, 2012) have achieved huge successes on Euclidean data such as
images, or sequence data such as text and signals. However, in a rich variety of scien-
tific fields, many important real-world objects and problems can be naturally or best
expressed along with a complex structure, e.g., graph or manifold structure, such
as social networks, recommendation systems, drug discovery and program analy-
sis. On the one hand, these graph-structured data can encode complicated pairwise
relationships for learning more informative representations; On the other hand, the
structural and semantic information in original data (images or sequential texts)
can be exploited to incorporate domain-specific knowledge for capturing more fine-
grained relationships among the data.
In recent years, deep learning on graphs has experienced a burgeoning inter-
est from the research community (Cui et al, 2018; Wu et al, 2019e; Zhang et al,
2020e). Among them, Graph Neural Networks (GNNs) is the most successful learn-
ing framework in coping with various tasks across a large number of application do-
mains. Newly proposed neural network architectures on graph-structured data (Kipf
and Welling, 2017a; Petar et al, 2018; Hamilton et al, 2017b) have achieved remark-
able performance in some well-known domains such as social networks and bioin-
formatics. They have also infiltrated other fields of scientific research, including
recommendation systems (Wang et al, 2019j), computer vision (Yang et al, 2019g),
natural language processing (Chen et al, 2020o), program analysis (Allamanis et al,
2018b), software mining (LeClair et al, 2020), drug discovery (Ma et al, 2018),
anomaly detection (Markovitz et al, 2020), and urban intelligence (Yu et al, 2018a).
Despite these successes that existing research has achieved, GNNs still face many
challenges when they are used to model highly-structured data that is time-evolving,
multi-relational, and multi-modal. It is also very difficult to model mapping between
graphs and other highly structured data, such as sequences, trees, and graphs. One
challenge with graph-structured data is that it does not show as much spatial locality
and structure as image or text data does. Thus, graph-structured data is not naturally
suitable for highly regularized neural structures such as convolutional and recurrent
neural networks.
More importantly, new application domains for GNNs that emerge from real-
world problems introduce significantly challenges for GNNs. Graphs provide a pow-
erful abstraction that can be used to encode arbitrary data types such as multidi-
mensional data. For example, similarity graphs, kernel matrices, and collaborative
filtering matrices can also be viewed as special cases of graph structures. Therefore,
a successful modeling process of graphs is likely to subsume many applications that
are often used in conjunction with specialized and hand-crafted methods.
In this chapter, we will systematically organize the existing research of GNNs
along three axes: foundations of GNNs, frontiers of GNNs, and GNN based applica-
tions. First of all, we will introduce the fundamental aspects of GNNs ranging from
3 Graph Neural Networks 29
popular GNN methods and their expressive powers, to the scalability, interpretabil-
ity, and robustness of GNNs. Next, we will discuss various frontier research which
are built on GNNs, including graph classification, link prediction, graph generation
and transformation, graph matching, graph structure learning, dynamic GNNs, het-
erogeneous GNNs, AutoML of GNNs and self-supervised GNNs. Based on them,
we further summarize the basic procedures which exploit full use of various GNNs
for a large number of applications. Finally, we provide the organization of our GNN
book and summarize the roadmap of the various research topics of GNNs.
In this section, we summarize the development of graph neural networks along three
important dimensions: (1) Foundations of GNNs; (2) Frontiers of GNNs; (3) GNN-
based applications. We will first discuss the important research areas under the first
two dimensions for GNNs and briefly illustrate the current progress and challenges
for each research sub-domain. Then we will provide a general summarization on
how to exploit the power of GNNs for a rich variety of applications.
Conceptually, we can categorize the fundamental learning tasks of GNNs into five
different directions: i) Graph Neural Networks Methods; ii) Theoretical understand-
ing of Graph Neural Networks; iii) Scalability of Graph Neural Networks; iv) In-
terpretability of Graph Neural Networks; and v) Adversarial robustness of Graph
Neural Networks. We will discuss these fundamental aspects of GNNs one by one
in this subsection.
Graph Neural Network Methods. Graph Neural Networks are specifically de-
signed neural architectures operated on graph-structure data. The goal of GNNs is
to iteratively update the node representations by aggregating the representations of
node neighbors and their own representation in the previous iteration. There are
a variety of graph neural networks proposed in the literature (Kipf and Welling,
2017a; Petar et al, 2018; Hamilton et al, 2017b; Gilmer et al, 2017; Xu et al, 2019d;
Velickovic et al, 2019; Kipf and Welling, 2016), which can be further categorized
into supervised GNNs and unsupervised GNNs. Once the node representations are
learnt, a fundamental task on graphs is node classification that tries to classify the
nodes into a few predefined classes. Despite the huge successes that various GNNs
have achieved, a severe issue on training deep graph neural networks has been ob-
served to yield inferior results, namely, over-smoothing problem (Li et al, 2018b),
where all the nodes have similar representations. Many recent works have been pro-
posed with different remedies to overcome this over-smoothing issue.
30 Lingfei Wu, Peng Cui, Jian Pei, Liang Zhao and Le Song
sively studied the robustness of models in domains like computer vision and natural
language processing, which has also influenced similar research on the robustness
of GNNs. Technically, the standard approach (via adversarial examples) for study-
ing the robustness of GNNs is to construct a small change of the input graph data
and then to observe if it leads to a large change of the prediction results (i.e. node
classification accuracy). There are a growing number of research works toward ei-
ther adversarial attacks (Dai et al, 2018a; Wang and Gong, 2019; Wu et al, 2019b;
Zügner et al, 2018; Zügner et al, 2020) or adversarial training (Xu et al, 2019c; Feng
et al, 2019b; Chen et al, 2020i; Jin and Zhang, 2019). Many recent efforts have been
made to provide both theoretical guarantees and new algorithmic developments in
adversarial training and certified robustness.
ular graph type in real applications is heterogeneous graphs that consist of different
types of graph nodes and edges. To fully exploit this information in heterogeneous
graphs, different GNNs for homogeneous graphs are not applicable. As a result, a
new line of research has been devoted to developing various heterogeneous graph
neural networks including message passing based methods (Wang et al, 2019l; Fu
et al, 2020; Hong et al, 2020b), encoder-decoder based methods (Tu et al, 2018;
Zhang et al, 2019b), and adversarial based methods (Wang et al, 2018a; Hu et al,
2018a).
Graph Neural Networks: AutoML and Self-supervised Learning. Automated ma-
chine learning (AutoML) has recently drawn a significant amount of attention in
both research and industrial communities, the goal of which is coping with the
huge challenge of time-consuming manual tuning process, especially for compli-
cated deep learning models. This wave of the research in AutoML also influences
the research efforts in automatically identifying an optimized GNN model architec-
ture and training hyperparameters. Most of the existing research focuses on either
architecture search space (Gao et al, 2020b; Zhou et al, 2019a) or training hyperpa-
rameter search space (You et al, 2020a; Shi et al, 2020). Another important research
direction of GNNs is to address the limitation of most of deep learning models
that requires large amount of annotated data. As a result, self-supervised learning
has been proposed which aims to design and leverage domain-specific pretext tasks
on unlabeled data to pretrain a GNN model. In order to study the power of serf-
supervised leanring in GNNs, there are quite a few works that systemmatically de-
sign and compare different self-supervised pretext tasks in GNNs (Hu et al, 2020c;
Jin et al, 2020d; You et al, 2020c).
Due to the power of GNNs to model various data with complex structures, GNNs
have been widely applied into many applications and domains, such as modern rec-
ommender systems, computer vision (CV), natural language processing (NLP), pro-
gram analysis, software mining, bioinformatics, anomaly detection, and urban intel-
ligence. Though GNNs are utilized to solve different tasks for different applications,
they all consist of two important steps, namely graph construction and graph repre-
sentation learning. Graph construction aims to first transform or represent the input
data as graph-structured data. Based on the graphs, graph representation learning
utilizes GNNs to learn the node or graph embeddings for the downstream tasks.
In the following, we briefly introduce the techniques of these two steps regarding
different applications.
34 Lingfei Wu, Peng Cui, Jian Pei, Liang Zhao and Le Song
After getting the graph expression of the input data, the next step is applying
GNNs for learning the graph representations. Some works directly utilize the typical
GNNs, such as GCN (Kipf and Welling, 2017a), GAT (Petar et al, 2018), GGNN
(Li et al, 2016a) and GraphSage (Hamilton et al, 2017b), which can be generalized
to different application tasks. While some special tasks needs an additional design
on the GNN architecture to better handle the specific problem. For example, in the
task of recommender systems, PinSage (Ying et al, 2018a) is proposed which takes
the top-k counted nodes of a node as its receptive field and utilizes weighted ag-
gregation for aggregation. PinSage can be scalable to the web-scale recommender
systems with millions of users and items. KGCN (Wang et al, 2019d) aims to en-
hance the item representation by performing aggregations among its corresponding
entity neighborhood in a knowledge graph. KGAT (Wang et al, 2019j) shares a gen-
erally similar idea with KGCN except for incorporating an auxiliary loss for knowl-
edge graph reconstruction. For instance, in the NLP task of KB-alignment, Xu et al
(2019e) formulated it as a graph matching problem, and proposed a graph attention-
based approach. It first matches all entities in two KGs, and then jointly models the
local matching information to derive a graph-level matching vector. The detailed
GNN techniques for each application can be found in the following chapters of this
book.
The high-level organization of the book is demonstrated in Figure 1.3. The book is
organized into four parts to best accommodate a variety of readers. Part I introduces
basic concepts; Part II discusses the most established methods; Part III presents the
most typical frontiers, and Part IV describes advances of methods and applications
36 Lingfei Wu, Peng Cui, Jian Pei, Liang Zhao and Le Song
that tend to be important and promising for future research. Next, we briefly elabo-
rate on each chapter.
• Part I: Introduction. These chapters provide the general introduction from the
representation learning for different data types, to the graph representation
learning. In addition, it introduces the basic ideas and typical variants of graph
neural networks for the graph representation learning.
• Part II: Foundations. These chapters describe the foundations of the graph neu-
ral networks by introducing the properties of graph neural networks as well as
several fundamental problems in this line. Specifically, this part introduces the
fundamental problems in graphs: node classification, the expressive power of
graph neural networks, the interpretability and scalability issues of graph neu-
ral network, and the adversarial robustness of the graph neural networks.
• Part III: Frontiers. In these chapters, some frontier or advanced problems in
the domain of graph neural networks are proposed. Specifically, there are in-
troductions about the techniques in graph classification, link prediction, graph
generation, graph transformation, graph matching, graph structure learning. In
addition, there are also introductions of several variants of GNNs for different
types of graphs, such as GNNs for dynamic graphs, heterogeneous graphs. We
also introduce the AutoML and self-supervised learning for GNNs.
• Part IV: Broad and Emerging Applications. These chapters introduce the broad
and emerging applications with GNNs. Specifically, these GNNs-based applica-
tions covers modern recommender systems, tasks in computer vision and NLP,
program analysis, software mining, biomedical knowledge graph mining for
drug design, protein function prediction and interaction, anomaly detection, and
urban intelligence.
3.3 Summary
Graph Neural Networks (GNNs) have been emerging rapidly to deal with the graph-
structured data, which cannot be directly modeled by the conventional deep learning
techniques that are designed for Euclidean data such as images and text. A wide
range of applications can be naturally or best represented with graph structure and
have been successfully handled by various graph neural networks.
In this chapter, we have systematically introduced the development and overview
of GNNs, including the introduction of its foundations, frontiers, and applications.
Specifically, we provide the fundamental aspects of GNNs ranging from the existing
typical GNN methods and their expressive powers, to the scalability, interpretability
and robustness of GNNs. These aspects motivate the research on better understand-
ing and utilization of GNNs. Built on GNNs, recent research developments have
seen a surge of interests in coping with graph-related research problems, which
we called frontiers of GNNs. We have discussed various frontier research built on
GNNs, ranging from graph classification and link prediction, to graph generation,
3 Graph Neural Networks 37
transformation, matching and graph structure learning. Due to the power of GNNs
to model various data with complex structures, GNNs have been widely applied into
many applications and domains, such as modern recommender systems, computer
vision, natural language processing, program analysis, software mining, bioinfor-
matics, anomaly detection, and urban intelligence. Most of these tasks consist of
two important steps, namely graph construction and graph representation learning.
Thus, we provide the introduction of the techniques of these two steps regarding
different applications. The introduction part will end here and thus a summary of
the organization of this book has been provided at the end of this chapter.
Part II
Foundations of Graph Neural Networks
Chapter 4
Graph Neural Networks for Node Classification
Abstract Graph Neural Networks are neural architectures specifically designed for
graph-structured data, which have been receiving increasing attention recently and
applied to different domains and applications. In this chapter, we focus on a funda-
mental task on graphs: node classification. We will give a detailed definition of node
classification and also introduce some classical approaches such as label propaga-
tion. Afterwards, we will introduce a few representative architectures of graph neu-
ral networks for node classification. We will further point out the main difficulty—
the oversmoothing problem—of training deep graph neural networks and present
some latest advancement along this direction such as continuous graph neural net-
works.
Graph-structured data (e.g., social networks, the World Wide Web, and protein-
protein interaction networks) are ubiquitous in real-world, covering a variety of
applications. A fundamental task on graphs is node classification, which tries to
classify the nodes into a few predefined categories. For example, in social networks,
we want to predict the political bias of each user; in protein-protein interaction net-
works, we are interested in predicting the function role of each protein; in the World
Wide Web, we may have to classify web pages into different semantic categories.
To make effective prediction, a critical problem is to have very effective node rep-
resentations, which largely determine the performance of node classification.
Graph neural networks are neural network architectures specifically designed for
learning representations of graph-structured data including learning node represen-
Jian Tang
Mila-Quebec AI Institute, HEC Montreal, e-mail: [email protected]
Renjie Liao
University of Toronto, e-mail: [email protected]
© The Author(s), under exclusive license to Springer Nature Singapore Pte Ltd. 2022 41
L. Wu et al. (eds.), Graph Neural Networks: Foundations, Frontiers, and Applications,
https://doi.org/10.1007/978-981-16-6054-2_4
42 Jian Tang and Renjie Liao
tations of big graphs (e.g., social networks and the World Wide Web) and learning
representations of entire graphs (e.g., molecular graphs). In this chapter, we will
focus on learning node representations for large-scale graphs and will introduce
learning the whole-graph representations in other chapters. A variety of graph neu-
ral networks have been proposed (Kipf and Welling, 2017b; Veličković et al, 2018;
Gilmer et al, 2017; Xhonneux et al, 2020; Liao et al, 2019b; Kipf and Welling,
2016; Veličković et al, 2019). In this chapter, we will comprehensively revisit exist-
ing graph neural networks for node classification including supervised approaches
(Sec. 4.2), unsupervised approaches (Sec. 4.3), and a common problem of graph
neural networks for node classification—over-smoothing (Sec. 4.4).
Problem Definition. Let us first formally define the problem of learning node rep-
resentations for node classification with graph neural networks. Let G = (V , E )
denotes a graph, where V is the set of nodes and E is the set of edges. A ∈ RN×N rep-
resents the adjacency matrix, where N is the total number of nodes, and X ∈ RN×C
represents the node attribute matrix, where C is the number of features for each
node. The goal of graph neural networks is to learn effective node representations
(denoted as H ∈ RN×F , F is the dimension of node representations) by combining
the graph structure information and the node attributes, which are further used for
node classification.
Concept Notation
Graph G = (V , E )
Adjacency matrix A ∈ RN×N
Node attributes X ∈ RN×C
Total number of GNN layers K
Node representations at the k-th layer H k ∈ RN×F , k ∈ {1, 2, · · · , K}
The essential idea of graph neural networks is to iteratively update the node repre-
sentations by combining the representations of their neighbors and their own repre-
sentations. In this section, we introduce a general framework of graph neural net-
works in (Xu et al, 2019d). Starting from the initial node representation H 0 = X, in
each layer we have two important functions:
• AGGREGATE, which tries to aggregate the information from the neighbors of
each node;
• COMBINE, which tries to update the node representations by combining the
aggregated information from neighbors with the current node representations.
Mathematically, we can define the general framework of graph neural networks
as follows:
Initialization: H 0 = X
For k = 1, 2, · · · , K,
where N(v) is the set of neighbors for the v-th node. The node representations H K
in the last layer can be treated as the final node representations.
Once we have the node representations, they can be used for downstream tasks.
Take the node classification as an example, the label of node v (denoted as ŷv ) can
be predicted through a Softmax function, i.e.,
We will start from the graph convolutional networks (GCN) (Kipf and Welling,
2017b), which is now the most popular graph neural network architecture due to its
simplicity and effectiveness in a variety of tasks and applications. Specifically, the
node representations in each layer is updated according to the following propagation
rule:
1 1
H k+1 = σ (D̃− 2 ÃD̃− 2 H kW k ). (4.5)
à = A + I is the adjacency matrix of the given undirected graph G with self-
connections, which allows to incorporate the node features itself when updating the
node representations. I ∈ RN×N is the identity matrix. D̃ is a diagonal matrix with
D̃ii = ∑ j Ãi j . σ (·) is an activation function such as ReLU and Tanh. The ReLU ac-
′
tive function is widely used, which is defined as ReLU(x) = max(0, x). W k ∈ RF×F
(F,F ′ are the dimensions of node representations in the k-th, (k+1)-th layer respec-
tively) is a laywise linear transformation matrix, which will be trained during the
optimization.
We can further dissect equation equation 4.5 and understand the AGGREGATE
and COMBINE function defined in GCN. For a node i, the node updating equation
can be reformulated as below:
Ãi j
Hik = σ ( ∑ q H k−1 k
j W ) (4.6)
j∈{N(i)∪i} D̃ii D̃ j j
Ai j 1 k−1 k
Hik = σ ( ∑ q H k−1 k
j W + H W ) (4.7)
j∈N(i) D̃ii D̃ j j D̃i i
In the Equation equation 4.7, we can see that the AGGREGATE function is de-
fined as the weighted average of the neighbor node representations. The weight of
the neighbor j is determined by the weight of the edge between i and j (i.e. Ai j nor-
malized by the degrees of the two nodes). The COMBINE function is defined as the
summation of the aggregated messages and the node representation itself, in which
the node representation is normalized by its own degree.
gθ ⋆ x = Ugθ U T x. (4.8)
4 Graph Neural Networks for Node Classification 45
U represents the matrix of the eigenvectors of the normalized graph Laplacian ma-
1 1
trix L = IN − D− 2 AD− 2 . L = UΛU T , Λ is a diagonal matrix of eigenvalues, and
U T x is the graph Fourier transform of the input signal x. In practice, gθ can be un-
derstood as a function of the eigenvalues of the normalized graph Laplacian matrix
L (i.e. gθ (Λ )). In practice, directly calculating Eqn. equation 4.8 is very compu-
tationally expensive, which is quadratic to the number of nodes N. According to
(Hammond et al, 2011), this problem can be circumvented by approximating the
function gθ (Λ ) with a truncated expansion of Chebyshev polynomials Tk (x) up to
K th order:
K
gθ ′ (Λ ) = ∑ θk′ Tk (Λ̃ ), (4.9)
k=0
2
where Λ̃ = λmax Λ − I, and λmax is the largest eigenvalue of L. θ ′ ∈ RK is the vector
of Chebyshev coefficients. Tk (x) are Chebyshev polynomials which are recursively
defined as Tk (x) = 2xTk−1 (x)−Tk−2 (x), with T0 (x) = 1 and T1 (x) = x. By combining
Eqn. equation 4.9 and Eqn. equation 4.8, the convolution of a signal x with a filter
gθ ′ can be reformulated as below:
K
gθ ′ ⋆ x = ∑ θk′ Tk (L̃)x, (4.10)
k=0
2
where L̃ = λmax L − I. From this equation, we can see that each node only depends
on the information within the K th -order neighborhood. The overall complexity of
evaluating Eqn. equation 4.10 is O(|E |) (i.e. linear to the number of edges in the
original graph G ), which is very efficient.
To define a neural network based on graph convolutions, one can stack multiple
convolution layers defined according to Eqn. equation 4.10 with each layer followed
by a nonlinear transformation. At each layer, instead of being limited to the explicit
parametrization by the Chebyshev polynomials defined in Eqn. equation 4.10, the
authors of GCNs proposed to limit the number of convolutions to K = 1 at each
layer. By doing this, at each layer, it only defines a linear function over the graph
Laplacian matrix L. However, by stacking multiple such layers, we are still capable
of covering a rich class of convolution filter functions on graphs. Intuitively, such a
model is capable of alleviating the problem of overfitting local neighborhood struc-
tures for graphs whose node degree distribution has a high variance such as social
networks, the World Wide Web, and citation networks.
At each layer, we can further approximate λmax ≈ 2, which could be accommo-
dated by the neural network parameters during training. Based on al these simplifi-
cations, we have
1 1
gθ ′ ⋆ x ≈ θ0′ x + θ1′ x(L − IN )x = θ0′ x − θ1′ D− 2 AD− 2 , (4.11)
where θ0′ and θ1′ are too free parameters, which could be shared over the entire
graph. In practice, we can further reduce the number of parameters, which allows to
46 Jian Tang and Renjie Liao
reduce overfitting and meanwhile minimize the number of operations per layer. As
a result, the following expression can be further obtained:
1 1
gθ ⋆ x ≈ θ (I + D− 2 AD− 2 )x, (4.12)
− 21
where θ = θ0′ = −θ1′ . One potential issue is the matrix IN + D AD , whose − 12
eigenvalues lie in the interval of [0, 2]. In a deep graph convolutional neural network,
repeated application of the above function will likely lead to exploding or vanish-
ing gradients, yielding numerical instabilites. As a result, we can further renormal-
1 1 1 1
ize this matrix by converting I + D− 2 AD− 2 to D̃− 2 ÃD̃− 2 , where à = A + I, and
D̃ii = ∑ j Ãi j .
In the above, we only consider the case that there is only one feature channel
and one filter. This can be easily generalized to an input signal with C channels
X ∈ RN×C and F filters (or number of hidden units) as follows:
1 1
H = D̃− 2 ÃD̃− 2 XW, (4.13)
where W ∈ RC×F is a matrix of filter parameters. H is the convolved signal matrix.
Graph Attention Layer. The graph attention layer defines how to transfer the hid-
den node representations at layer k − 1 (denoted as H k−1 ∈ RN×F ) to the new node
′
representations H k ∈ RN×F . In order to guarantee sufficient expressive power to
transform the lower-level node representations to higher-level node representations,
′
a shared linear transformation is applied to every node, denoted as W ∈ RF×F . Af-
terwards, self-attention is defined on the nodes, which measures the attention coeffi-
′ ′
cients for any pair of nodes through a shared attentional mechanism a : RF × RF →
R
ei j indicates the relationship strength between node i and j. Note in this subsec-
tion we use Hik−1 to represent a column-wise vector instead of a row-wise vector.
For each node, we can theoretically allow it to attend to every other node on the
graph, which however will ignore the graph structural information. A more reason-
able solution would be only to attend to the neighbors for each node. In practice,
the first-order neighbors are only used (including the node itself). And to make the
coefficients comparable across different nodes, the attention coefficients are usually
normalized with the softmax function:
exp(ei j )
αi j = Softmax j ({ei j }) = . (4.15)
∑l∈N(i) exp(eil )
We can see that for a node i, αi j essentially defines a multinomial distribution over
the neighbors, which can also be interpreted as the transition probability from node
i to each of its neighbors.
In the work by Veličković et al (2018), the attention mechanism a is defined as
a single-layer feedforward neural network including a linear transformation with
′
the weight vector W2 ∈ R1×2F ) and a LeakyReLU nonlinear activation function
(with negative input slope α = 0.2). More specifically, we can calculate the attention
coefficients with the following architecture:
Multi-head Attention.
In practice, instead of only using one single attention mechanism, multi-head at-
tention can be used, each of which determines a different similarity function over
the nodes. For each attention head, we can independently obtain a new node rep-
resentation according to Eqn. equation 4.17. The final node representation will be
a concatenation of the node representations learned by different attention heads.
Mathematically, we have
T !
k
t t k−1
Hi =
σ ∑ αi jW H j , (4.18)
t=1 j∈N(i)
48 Jian Tang and Renjie Liao
where T is the total number of attention heads, αit j is the attention coefficient calcu-
lated from the t-th attention head, W t is the linear transformation matrix of the t-th
attention head.
One thing that mentioned in the paper by Veličković et al (2018) is that in the
final layer, when trying to combine the node representations from different attention
heads, instead of using the operation concatenation, other pooling techniques could
be used, e.g. simply taking the average node representations from different attention
heads.
!
k 1 T t t k−1
Hi = σ ∑ ∑ αi j W H j . (4.19)
T t=1 j∈N(i)
Another very popular graph neural network architecture is the Neural Message Pass-
ing Network (MPNN) (Gilmer et al, 2017), which is originally proposed for learn-
ing molecular graph representations. However, MPNN is actually very general, pro-
vides a general framework of graph neural networks, and could be used for the task
of node classification as well. The essential idea of MPNN is formulating existing
graph neural networks as a general framework of neural message passing among
nodes. In MPNNs, there are two important functions including Message and Up-
dating function:
The above graph neural networks iteratively update the node representations with
different kinds of graph convolutional layers. Essentially, these approaches model
4 Graph Neural Networks for Node Classification 49
H k+1 = AH k + H 0 , (4.23)
where H 0 = X or the output of an encoder on the input feature X. Intuitively, at each
step, the new node representation is a linear combination of its neighboring node
representations as well as the initial node features. Such a mechanism allows to
model the information propagation on the graph without forgetting the initial node
features. We can unroll Eqn. equation 4.23 and explicitly derive the node represen-
tations at the k-th step:
!
k
Hk = ∑ Ai H 0 = (A − I)−1 (Ak+1 − I)H 0 . (4.24)
i=0
As the above equation effectively models the discrete dynamics of node repre-
sentations, the CGNN model further extended it to the continuous setting, which
50 Jian Tang and Renjie Liao
dH t
= log AH t + X, (4.25)
dt
with the initial value H 0 = (log A)−1 (A − I)X, where X is the initial node features or
the output of an encoder applied to it. We do not provide the proof here. More details
can be referred to the original paper (Xhonneux et al, 2020). In Eqn. equation 4.25,
as log A is intractable to compute in practice, it is approximated with the first-order
of the Taylor expansion, i.e. log A ≈ A − I. By integrating all these information, we
have the following ODE equation:
dH t
= (A − I)H t + X, (4.26)
dt
with the initial value H 0 = X, which is the first variant of the CGNN model.
The CGNN model is actually very intuitive, which has a nice connection with
traditional epidemic model, which aims at studying the dynamics of infection in a
population. For the epidemic model, it usually assumes that the infection of people
will be affected by three different factors including the infection from neighbors, the
natural recovery, and the natural characteristics of people. If we treat H t as the num-
ber of people infected at time t, then these three factors can be naturally modeled by
the three terms in Eqn. equation 4.26: AH t for the infection from neighbors, −H t
for the natural recovery, and the last one X for the natural characteristics of people.
Model 2: Modeling the Interaction of Feature Channels. The above model as-
sumes different node feature channels are independent with each other, which is a
very strong assumption and limits the capacity of the model. Inspired by the success
of a linear variant of graph neural networks (i.e., Simple GCN (Wu et al, 2019a)),
a more powerful discrete node dynamic model is proposed, which allows different
feature channels to interact with each other as,
H k+1 = AH kW + H 0 , (4.27)
where W ∈ RF×F is a weight matrix used to model the interactions between different
feature channels. Similarly, we can also extend the above discrete dynamics into
continuous case, yielding the following equation:
dH t
= (A − I)H t + H t (W − I) + X, (4.28)
dt
with the initial value being H 0 = X. This is the second variant of CGNN with train-
able weights. Similar form of ODEs defined in Eqn. equation 4.28 has been studied
in the literature of control theory, which is known as Sylvester differential equa-
tion (Locatelli and Sieniutycz, 2002). The two matrices A − I and W − I characterize
4 Graph Neural Networks for Node Classification 51
the natural solution of the system while X is the information provided to the system
to drive the system into the desired state.
Discussion. The proposed continuous graph neural networks (CGNN) has multiple
nice properties: (1) Recent work has shown that if we increase the number of layers
K in the discrete graph neural networks, the learned node representations tend to
have the problem of over-smoothing (will introduce in detail later) and hence lose
the power of expressiveness. On the contrary, the continuous graph neural networks
are able to train very deep graph neural networks and are experimentally robust to
arbitrarily chosen integration time; (2) For some of the tasks on graphs, it is crit-
ical to model the long-range dependency between nodes, which requires training
deep GNNs. Existing discrete GNNs fail to train very deep GNNs due to the over-
smoothing problem. The CGNNs are able to effectively model the long-range de-
pendency between nodes thanks to the stability w.r.t. time. (3) The hyperparameter
α is very important, which controls the rate of diffusion. Specifically, it controls the
rate at which high-order powers of regularized matrix A vanishes. In the work pro-
posed by (Xhonneux et al, 2020), the authors proposed to learn a different value of
α for each node, which hence allows to choose the best diffusion rates for different
nodes.
Recall the one-layer graph convolution operator used in GCNs (Kipf and Welling,
1 1
2017b) H = LHW , where L = D− 2 ÃD− 2 . Here we drop the superscript of the layer
index to avoid the clash with the notation of the matrix power. There are two main
issues with this simple graph convolution formulation. First, one such graph convo-
lutional layer would only propagate information from any node to its nearest neigh-
bors, i.e., neighboring nodes that are one-hop away. If one would like to propagate
information to M-hop away neighbors, one has to either stack M graph convolutional
layers or compute the graph convolution with M-th power of the graph Laplacian,
i.e., H = σ (LM HW ). When M is large, the solution of stacking layers would make
the whole GCN model very deep, thus causing problems in learning like the van-
ishing gradient. This is similar to what people experienced in training very deep
feedforward neural networks. For the matrix power solution, naively computing the
M-th power of the graph Laplacian is also very costly (e.g., the time complexity is
O(N 3(M−1) ) for graphs with N nodes). Second, there are no learnable parameters
in GCNs associated with the graph Laplacian L (corresponding to the connectiv-
ities/structures). The only learnable parameter W is a linear transform applied to
every node simultaneously which is not aware of the structures. Note that we typ-
ically associate learnable weights on edges while applying the convolution applied
to regular graphs like grids (e.g., applying 2D convolution to images). This would
greatly improve the expressiveness of the model. However, it is not clear that how
52 Jian Tang and Renjie Liao
one can add learnable parameters to the graph Laplacian L since its size varies from
graph to graph.
{"# , %# }' = )*+,-./()) Long Range Spectral Filtering Long Range Spectral Filtering
e.g., I = {20, 50, , … } e.g., I = {20, 50, , … }
' '
O O O O O O
)> < = K N< ("#P , "#Q , … , "# |R| )%# %#S )> < = K N< ("#P , "#Q , … , "# |R| )%# %#S
#LM #LM
7< = = )> < ?@< ∀B ∈ [|F|] 7< = = )> < ?@< ∀B ∈ [|F|]
Layer 1 Layer 2
Fig. 4.1: The inference procedure of Lanczos Networks. The approximated top
eigenvalues {rk } and eigenvectors {vk } are computed by the Lanczos algorithm.
Note that this step is only needed once per graph. The long range/scale (top blocks)
graph convolutions are efficiently computed by the low-rank approximation of the
graph Laplacian. One can control the ranges (i.e., the exponent of eigenvalues)
as hyperparameters. Learnable spectral filters are applied to the approximated top
eigenvalues {rk }. The short range/scale (bottom blocks) graph convolution is the
same as GCNs. Adapted from Figure 1 of (Liao et al, 2019b).
uses the M-step Lanczos algorithm (Lanczos, 1950) (listed in Alg. 1) to compute an
orthogonal matrix Q and a symmetric tridiagonal matrix T , such that Q⊤ LQ = T .
We denote Q = [q1 , · · · , qM ] where column vector qi is the i-th Lanczos vector. Note
that M could be much smaller than the number of nodes N. T is illustrated as below,
γ1 β1
.. ..
β1 . .
T = . (4.29)
. .
.. .. β
M−1
βM−1 γM
After obtaining the tridiagonal matrix T , we can compute the Ritz values and Ritz
vectors which approximate the top eigenvalues and eigenvectors of L by diagonal-
izing the matrix T as T = BRB⊤ , where the K × K diagonal matrix R contains the
Ritz values and B ∈ RK×K is an orthogonal matrix. Here top means ranking the
eigenvalues by their magnitudes in a descending order. This can be implemented
via the general eigendecomposition or some fast decomposition methods special-
ized for tridiagonal matrices. Now we have a low rank approximation of the graph
Laplacian matrix L ≈ V RV ⊤ , where V = QB. Denoting the column vectors of V as
{v1 , · · · , vM }, we can compute multi-scale graph convolution as
H = L̂HW
M
I1 I2 Iu
L̂ = ∑ fθ (rm , rm , · · · , rm )vm v⊤
m, (4.30)
m=1
where {I1 , · · · , Iu } is the set of scale/range parameters which determine how many
hops (or how far) one would like to propagate the information over the graph. For
example, one could easily set {I1 = 50, I2 = 100} (u = 2 in this case) to consider the
situations of propagating 50 and 100 steps respectively. Note that one only needs to
compute the scalar power rather than the original matrix power. The overall com-
plexity of the Lanczos algorithm in our context is O(MN 2 ) which makes the whole
algorithm much more efficient than naively computing the matrix power. Moreover,
fθ is a learnable spectral filter parameterized by θ and can be applied to graphs with
varying sizes since we decouple the graph size and the input size of fθ . fθ directly
acts on the graph Laplacian and greatly improves the expressiveness of the model.
Although Lanczos algorithm provides an efficient way to approximately com-
pute arbitrary powers of the graph Laplacian, it is still a low-rank approximation
which may lose certain information (e.g., the high frequency one). To alleviate the
problem, one can further do vanilla graph convolution with small scale parameters
like H = LS HW where S could be small integers like 2 or 3. The resultant repre-
sentation can be concatenated with the one obtained from the longer scale/range
graph convolution in Eq. (4.30). Relying on the above design, one could add nonlin-
earities and stack multiple such layers to build a deep graph convolutional network
(namely Lanczos Networks) just like GCNs. The overall inference procedure of
Lanczos Networks is shown in Fig. 4.1. This method demonstrates strong empirical
54 Jian Tang and Renjie Liao
4.3.1.2 Model
Similar to VAEs, the VGAE model consists of an encoder qφ (Z|A, X), a decoder
pθ (A|Z), and a prior p(Z).
Encoder The goal of the encoder is to learn a distribution of latent variables asso-
ciated with each node conditioning on the node features X and the adjacency matrix
A. We could instantiate qφ (Z|A, X) as a graph neural network where the learnable
parameters are φ . In particular, VGAE assumes an node-independent encoder as
below,
N
qφ (Z|X, A) = ∏ qφ (zi |X, A) (4.31)
i=1
qφ (zi |X, A) = N (zi |µi , diag(σi2 )) (4.32)
µ, σ = GCNφ (X, A) (4.33)
where zi , µi , and σi are the i-th rows of the matrices Z, µ, and σ respectively. Ba-
sically, we assume a multivariate Normal distribution with the diagonal covariance
as the variational approximated distribution of the latent vector per node (i.e., zi ).
The mean and diagonal covariance are predict by the encoder network, i.e., a GCN
as described in Section 4.2.2. For example, the original paper uses a two-layer GCN
as follows,
µ = ÃHWµ (4.34)
σ = ÃHWσ (4.35)
H = ReLU(ÃXW0 ), (4.36)
1 1
where à = D− 2 AD− 2 is the symmetrically normalized adjacency matrix and D is
the degree matrix. Learnable parameters are thus φ = [Wµ ,Wσ ,W0 ].
Decoder Given sampled latent variables, the decoder aims at predicting the con-
nectivities among nodes. The original paper adopts a simple dot-product based pre-
dictor as below,
N N
p(A|Z) = ∏ ∏ p(Ai j |zi , z j ) (4.37)
i=1 j=1
where Ai j denotes the (i, j)-th element and σ(·) is the logistic sigmoid function.
This decoder again assumes conditional independence among all possible edges for
tractability. Note that there are no learnable parameters associated with this decoder.
The only way to improve the performance of the decoder is to learn good latent
representations.
Prior The prior distributions over the latent variables are simply set to indepen-
dent zero-mean Gaussians with unit variances,
56 Jian Tang and Renjie Liao
N
p(Z) = ∏ N (zi |0, I). (4.39)
i=1
This prior is fixed throughout the learning as what typical VAEs do.
Objective & Learning To learn the encoder and the decoder, one typically max-
imize the evidence lower bound (ELBO) as in VAEs,
4.3.1.3 Discussion
The VGAE model is popular in the literature mainly due to its simplicity and good
empirical performances. For example, since there are no learnable parameters for
the prior and the decoder, the model is quite light-weight and the learning process
is fast. Moreover, the VGAE model is versatile in way that once we learned a good
encoder, i.e., good latent representations, we can use them for predicting edges (,
link prediction), node attributes, and so on. On the other side, VGAE model is still
limited in the following ways. First, it can not serve as a good generative model for
graphs as what VAEs do for images since the decoder is not learnable. One could
simply design some learnable decoder. However, it is not clear that the goal of learn-
ing good latent representations and generating graphs with good qualities are always
well-aligned. More exploration along this direction would be fruitful. Second, the
independence assumption is exploited for both the encoder and the decoder which
might be very limited. More structural dependence (e.g., auto-regressive) would be
desirable to improve the model capacity. Third, as discussed in the original paper,
the prior may be potentially a poor choice. At last, for link prediction in practice,
one may need to add the weighting of edges vs. non-edges in the decoder term and
carefully tune it since graphs may be very sparse.
4 Graph Neural Networks for Node Classification 57
Following Mutual Information Neural Estimation (MINE) (Belghazi et al, 2018) and
Deep Infomax (Hjelm et al, 2018), Deep Graph Infomax (Veličković et al, 2019) is
an unsupervised learning framework that learns graph representations via the prin-
ciple of mutual information maximization.
Following the original paper, we will explain the model under the single-graph
setup, i.e., the node feature matrix X and the graph adjacency matrix A of a single
graph are provided as input. Extensions to other problem setups like transductive
and inductive learning settings will be discussed in Section 4.3.2.3. The goal is to
learn the node representations in an unsupervised way. After node representations
are learned, one can apply some simple linear (logistic regression) classifier on top
of the representations to perform supervised tasks like node classification.
4.3.2.2 Model
(X, A) (H, A)
E
~xi ~hi D +
R
C ~s
~x E ~e
ej hj D −
e A)
(X, e e A)
(H, e
Fig. 4.2: The overall process of Deep Graph Infomax. The top path shows how the
positive sample is processed, whereas the bottom shows process corresponding to
the negative sample. Note that the graph representation is shared for both positive
and negative samples. Subgraphs of positive and negative samples do not necessarily
need to be different. Adapted from Figure 1 of (Veličković et al, 2019).
The main idea of the model is to maximize the local mutual information between
a node representation (capturing local graph information) and the graph represen-
tation (capturing global graph information). By doing so, the learned node repre-
sentation should capture the global graph information as much as possible. Let us
denote the graph encoder as ε which could be any GNN discussed before, e.g., a
two-layer GCN. We can obtain all node representations as H = ε(X, A) where the
58 Jian Tang and Renjie Liao
representation hi of any node i should contain some local information near node i.
Specifically, k-layer GCN should be able to leverage node information that is k-hop
away. To get the global graph information, one could use a readout layer/function
to process all node representations, i.e., s = R(H), where the readout function R
could be some learnable pooling function or simply an average operator.
Objective Given the local node representation hi and the global graph represen-
tation s, the natural next-step is to compute their mutual information. Recall the
definition of mutual information is as follows,
Z Z
p(h, s)
MI(h, s) = p(h, s) log dhds. (4.41)
p(h)p(s)
However, maximizing the local mutual information alone is not enough to learn
useful representations as shown in (Hjelm et al, 2018). To develop a more practical
objective, authors in (Veličković et al, 2019) instead use a noise-contrastive type
objective following Deep Infomax (Hjelm et al, 2018),
!
N M
1
L =
N + M i=1∑ E(X,A) [log D(hi , s)] + ∑ E(X̃,Ã) log 1 − D(h̃ j , s) . (4.42)
j=1
where D is a binary classifier which takes both the node representation hi and the
graph representation s as input and predicts whether the pair (hi , s) comes from the
joint distribution p(h, s) (positive class) or the product of marginals p(hi )p(s) (neg-
ative class). We denote h̃ j as the j-th node representation from the negative sample.
The numbers of positive and negative samples are N and M respectively. We will
explain how to draw positive and negative samples shortly. The overall objective is
thus the negative binary cross-entropy for training a probabilistic classifier. Note that
this objective is the same type of distance as used in generative adversarial networks
(GANs) (Goodfellow et al, 2014b) which is shown to be proportional to the Jensen-
Shannon divergence (Goodfellow et al, 2014b; Nowozin et al, 2016). As verified by
(Hjelm et al, 2018), maximizing the Jensen-Shannon divergence based mutual in-
formation estimator behaves similarly (i.e., they have an approximately monotonic
relationship) to directly maximizing the mutual information. Therefore, maximizing
the objective in Eq. (4.42) is expected to maximize the mutual information. More-
over, the freedom of choosing negative samples makes the method more likely to
learn useful representations than maximizing the vanilla mutual information.
Negative Sampling To generate the positive samples, one can directly sample a
few nodes from the graph to construct the pairs (hi , s). For negative samples, one can
generate them via corrupting the original graph data, denoting as (X̃, Ã) = C (X, A).
In practice, one can choose various forms of this corruption function C . For ex-
ample, authors in (Veličković et al, 2019) suggest to keep the adjacency matrix to
be the same and corrupt the node feature X by row-wise shuffling. Other possibili-
ties of the corruption function include randomly sampling subgraphs and applying
Dropout (Srivastava et al, 2014) to node features.
4 Graph Neural Networks for Node Classification 59
Once positive and negative samples were collected, one can learn the representa-
tions via maximizing the objective in Eq. (4.42). We summarize the training process
of Deep Graph Infomax as follows:
1. Sample negative examples via the corruption function (X̃, Ã) ∼ C (X, A).
2. Compute node representations of positive samples H = {h1 , · · · , hN } = ε(X, A).
3. Compute node representations of negative samples H̃ = {h̃1 , · · · , h̃M } = ε(X̃, Ã).
4. Compute graph representation via the readout function s = R(H).
5. Update parameters of ε, D, and R via gradient ascent to maximize Eq. (4.42).
4.3.2.3 Discussion
Training deep graph neural networks by stacking multiple layers of graph neural
networks usually yields inferior results, which is a common problem observed in
many different graph neural network architectures. This is mainly due to the prob-
lem of over-smoothing, which is first explicitly studied in (Li et al, 2018b). (Li et al,
2018b) showed that the graph convolutional network (Kipf and Welling, 2017b) is
a special case of Laplacian smoothing:
downstream tasks suffer as well. This phenomenon has later been pointed out by a
few other later work as well such as (Zhao and Akoglu, 2019; Li et al, 2018b; Xu
et al, 2018a; Li et al, 2019c; Rong et al, 2020b).
PairNorm (Zhao and Akoglu, 2019). Next, we will present a method called
PairNorm for alleviating the problem of over-smoothing when GNNs go deep. The
essential idea of PairNorm is to keep the total pairwise squared distance (TPSD)
of node representations unchanged, which is the same as that of the original node
feature X. Let H̃ be the output of the node representations by the graph convolu-
tion, which will be the input of PairNorm, and Ĥ is the output of PairNorm. The
goal of PairNorm is to normalize the H̃ such that after normalization TPSD(Ĥ) =
TPSD(X). In other words,
!
2 2 1 N 1 N
TPSD(H̃) = ∑ ||H̃i − H̃ j || = 2N ∑ ||H̃i ||22 − || ∑ H̃i ||22 (4.45)
(i, j)∈[N]
N i=1 N i=1
We can further simply the above equation by substracting the row-wise mean
from each H̃i . In other words, H̃ic = H̃i − N1 ∑Ni=1 H̃i , which denotes the centered
representation. A nice property of centering the node representation is that it will
not change the TPSD and meanwhile push the second term || N1 ∑Ni=1 H̃i ||22 to zero.
As a result, we have
1 N
H̃ic = H̃i − ∑ H̃i (Center) (4.47)
N i=1
H̃ic √ H̃ c
Ĥi = s · q =s N· q i (Scale), (4.48)
1 N c 2
N ∑i=1 ||H̃i ||2 ||H̃ c ||2F
4 Graph Neural Networks for Node Classification 61
H̃ic
TPSD(Ĥ) = 2N||Ĥ||2F = 2N ∑ ||s · q ||22 = 2N 2 s2 (4.49)
1
i
N ∑Ni=1 ||H̃ic ||22
4.5 Summary
Editor’s Notes: Node classification task is one of the most important tasks
in Graph Neural Networks. The node representation learning techniques in-
troduced in this chapter are the corner stone for all other tasks for the rest
of the book, including graph classification task (Chapter 9), link predic-
tion (Chapter 10), graph generation task (Chapter 11), and so on. Familiar
with the learning methodologies and design principles of node representa-
tion learning is the key to deeply understanding other fundamental research
directions like Theoretical analysis (Chapter 5), Scalability (Chapter 6), Ex-
plainability (Chapter 7), and Adversarial Robustness (Chapter 8).
Chapter 5
The Expressive Power of Graph Neural
Networks
Abstract The success of neural networks is based on their strong expressive power
that allows them to approximate complex non-linear mappings from features to
predictions. Since the universal approximation theorem by (Cybenko, 1989), many
studies have proved that feed-forward neural networks can approximate any func-
tion of interest. However, these results have not been applied to graph neural net-
works (GNNs) due to the inductive bias imposed by additional constraints on the
GNN parameter space. New theoretical studies are needed to better understand these
constraints and characterize the expressive power of GNNs.
In this chapter, we will review the recent progress on the expressive power of GNNs
in graph representation learning. We will start by introducing the most widely-used
GNN framework— message passing— and analyze its power and limitations. We
will next introduce some recently proposed techniques to overcome these limita-
tions, such as injecting random attributes, injecting deterministic distance attributes,
and building higher-order GNNs. We will present the key insights of these tech-
niques and highlight their advantages and disadvantages.
5.1 Introduction
Pan Li
Department of Computer Science, Purdue University, e-mail: [email protected]
Jure Leskovec
Department of Computer Science, Stanford University, e-mail: [email protected]
© The Author(s), under exclusive license to Springer Nature Singapore Pte Ltd. 2022 63
L. Wu et al. (eds.), Graph Neural Networks: Foundations, Frontiers, and Applications,
https://doi.org/10.1007/978-981-16-6054-2_5
64 Pan Li and Jure Leskovec
how broad such a range could be, called the model’s expressive power, provides an
important measure of the model potential. It is desirable to have models with a more
expressive power that may learn more complex mapping functions.
Neural networks (NNs) are well known for their great expressive power. Specifi-
cally, Cybenko (1989) first proved that any continuous function defined over a com-
pact space could be uniformly approximated by neural networks with sigmoid acti-
vation functions and only one hidden layer. Later, this result got generalized to any
squashing activation functions by (Hornik et al, 1989).
However, these seminal findings are in-
sufficient to explain the current unprece-
dented success of NNs in practice because
their strong expressive power only demon- rks
strates that the model fθ is able to approx- t wo
ne
al
imate f ∗ but does not guarantee that the e ur
N
model obtained via training fˆ indeed ap-
Traditional
proximates f ∗ . Fig. 5.1 illustrates a well- machine learning:
known curve of Amount of Data vs. Per- SVM, GBDT
formance of machine learning models (Ng,
2011). NN-based methods may only out-
perform traditional methods given suffi-
cient data. One important reason is that Fig. 5.1: Amount of Data vs. Perfor-
NNs as machine learning models are still mance of different models.
governed by the fundamental tradeoff be-
tween the data amount and model complex-
ity (Fig. 5.2). Although NNs could be rather expressive, they are likely to overfit the
training examples when paired with more parameters. Therefore, it is necessary for
practice to build NNs that can maintain strong expressive power while constraints
are imposed on their parameters. At the same time, a good theoretical understanding
of the expressive power of NNs with constraints on their parameters is needed.
Fig. 5.2: Training and testing errors with and without inductive bias can dramatically
affect the expressive power of models.
Translation variance
Features …
RNNs/CNNs do not fit this case
Targets …
results about the expressive power of NNs with inductive bias have been shown
recently. Yarotsky (2017); Liang and Srikant (2017) have proved that deep neural
networks (DNNs), by stacking multiple hidden layers, can achieve good enough
approximation with significantly fewer parameters than shallow NNs. The archi-
tecture of DNNs leverages the fact that data has typically a hierarchical structure.
DNNs are agnostic to the type of data, while dedicated neural network architec-
tures have been developed to support specific types of data. Recurrent neural net-
works (RNNs) (Hochreiter and Schmidhuber, 1997) or convolution neural networks
(CNNs) (LeCun et al, 1989) were proposed to process time series and images, re-
spectively. In these two types of data, effective patterns typically hold translation
invariance in time and in space, respectively. To match this invariance, RNNs and
CNNs adopt the inductive bias that their parameters have shared across time and
space (Fig. 5.3). The parameter-sharing mechanism works as a constraint on the
parameters and limits the expressive power of RNNs and CNNs. However, RNNs
and CNNs have been shown to have sufficient expressive power to learn transla-
tion invariant functions (Siegelmann and Sontag, 1995; Cohen and Shashua, 2016;
Khrulkov et al, 2018), which leads to the great practical success of RNNs and CNNs
in processing time series and images.
Recently, many studies have focused on a new type of NNs, termed graph neu-
ral networks (GNNs) (Scarselli et al, 2008; Bruna et al, 2014; Kipf and Welling,
2017a; Bronstein et al, 2017; Gilmer et al, 2017; Hamilton et al, 2017b; Battaglia
et al, 2018). These aim to capture the inductive bias of graphs/networks, another
important type of data. Graphs are commonly used to model complex relations and
interactions between multiple elements and have been widely used in machine learn-
ing applications, such as community detection, recommendation systems, molecule
property prediction, and medicine design (Fortunato, 2010; Fouss et al, 2007; Pires
et al, 2015). Compared to time series and images, which are well-structured and rep-
resented by tables or grids, graphs are irregular and thus introduce new challenges.
A fundamental assumption behind machine learning on graphs is that the targets
for prediction should be invariant to the order of nodes of the graph. To match this
assumption, GNNs hold a general inductive bias termed permutation invariance. In
particular, the output given by GNNs should be independent of how the node indices
of a graph are assigned and thus in which order are they processed. GNNs require
66 Pan Li and Jure Leskovec
GNN = GNN
Targets
Fig. 5.4: This illustrates how GNNs are designed to maintain permutation invari-
ance.
their parameters to be independent from the node ordering and are shared across the
entire graph (Fig. 5.4). Because of this new parameter sharing mechanism in GNNs,
new theoretical tools are needed to characterize their expressive power.
Analyzing the expressive power of GNNs is challenging, as this problem is
closely related to some long-standing problems in graph theory. To understand this
connection, consider the following example of how a GNN would predict whether a
graph structure corresponds to a valid molecule. The GNN should be able to identify
whether this graph structure is the same, similar, or very different from the graph
structures that are known to correspond to valid molecules. Measuring whether two
graphs have the same structure involves addressing the graph isomorphism prob-
lem, in which no P solutions have yet been found (Helfgott et al, 2017). In addition,
measuring whether two graphs have a similar structure requires contending with the
graph edit distance problem, which is even harder to address than the graph isomor-
phism problem (Lewis et al, 1983).
Great progress has been made recently on characterizing the expressive power of
GNNs, especially on how to match their power with traditional graph algorithms and
how to build more powerful GNNs that overcome the limitation of those algorithms.
We will delve more into these recent efforts further along in this chapter. In par-
ticular, compared to previous introductions (Hamilton, 2020; Sato, 2020), we will
focus on recent key insights and techniques that yield more powerful GNNs. Specifi-
cally, we will introduce standard message-passing GNNs that are able to achieve the
limit of the 1-dimensional Weisfeiler-Lehman test (Weisfeiler and Leman, 1968), a
widely-used algorithm to test for graph isomorphism. We will also discuss a number
of strategies to overcome the limitations of the Weisfeiler-Lehman test — including
attaching random attributes, attaching deterministic distance attributes, and leverag-
ing higher-order structures.
In Section 5.2, we will formulate the graph representation learning problems that
GNNs target. In Section 5.3, we will review the most widely used GNN frame-
work, the message passing neural network, describing the limitations of its expres-
sive power and discussing its efficient implementations. In Section 5.4, we will in-
troduce a number of methods that make GNNs more powerful than the message
passing neural network. In Section 5.5, we will conclude this chapter by discussing
further research directions.
5 The Expressive Power of Graph Neural Networks 67
ℱ′ ℱ′
ℱ) ℱ) ℱ -- the space of all
𝑓∗ 𝑓 ∗ potential mappings that satisfy
ℳ permutation invariance
𝑓(
ℱ) -- the space of mappings
ℱ ℱ that may represented by GNNs
(c) (d) 𝑓( -- the learnt model via GNNs
Fig. 5.5: An illustration of the expressive power of NNs and GNNs and their affects
on the performance of learned models. a) Machine learning problems aim to learn
the mapping from the feature space to the target space based on several observed
examples. b) The expressive power of NNs refers to the gap between the two spaces
F and Fˆ ′ . Although NNs are expressive (Fˆ ′ is dense in F ), the learned model
f ′ based on NNs may differ significantly from f ∗ due to their overfit of the limited
observed data. c) Suppose f ∗ is known to be permutation invariant, as it works on
graph-structured data. Then, the space of potential mapping functions is reduced
from F ′ to a much smaller space F that only includes permutation invariant func-
tions. If we adopt GNNs, the space of mapping functions that can be approximated
simultaneously reduces to Fˆ . The gap between F and Fˆ characterizes the ex-
pressive power of GNNs. d) Although GNNs are less expressive than general NNs
(Fˆ ⊂ Fˆ ′ ), the learned model based on GNNs f is a much better approximator of
f ∗ as opposed to the one based on NNs fˆ′ . Therefore, for graph-structured data, our
understanding of the expressive power of GNNs, i.e., the gap between F and Fˆ , is
much more relevant than that of NNs.
In this section, we will set up the formal definition of graph representation learning
problems, their fundamental assumption, and their inductive bias. We will also dis-
cuss relationships between different notions of graph representation learning prob-
lems frequently studied in recent literature.
First, we will start by defining graph-structured data.
Definition 5.1. (Graph-structured data) Let G = (V , E , X) denote an attributed
graph, where V is the node set, E is the edge set, and X ∈ R|V |×F are the node
attributes. Each row of X, Xv ∈ RF refers to the attributes on the node v ∈ V . In
practice, graphs are usually sparse, i.e., |E | ≪ |V |2 . We introduce A ∈ {0, 1}|V |×|V |
to denote the adjacency matrix of G such that Auv = 1 iff (u, v) ∈ E. Combining the
68 Pan Li and Jure Leskovec
𝒢
𝒢
𝑓 𝒢, 𝑆 should capture the
informative fingerprint of the
𝑆 graph 𝒢 to represent S for certain
applications (characterized by a
ground-truth mapping 𝑓 ∗ 𝒢, 𝑆 .
𝒢
Link prediction…
adjacency matrix and node attributes, we may also denote G = (A, X). Moreover, if
G is unattributed with no node attributes, we can assume that all elements in X are
constant. Later, we also use V [G ] to denote the entire node set of a particular graph
G.
The goal of graph representation learning is to learn a model by taking graph-
structured data as input and then mapping it so that certain prediction targets are
matched. Different graph representation learning problems may apply to a varying
number of nodes in a graph. For example, for node classification, a prediction is
made for each node, for each link/relation prediction on a pair of nodes, and for
each graph classification or graph property prediction on the entire node set V . We
can unify all these problems as graph representation learning.
Definition 5.2. (Graph representation learning) The feature space is defined as
X := Γ × S , where Γ is the space of graph-structured data and S includes all
the node subsets of interest, given a graph G ∈ Γ . Then, a point in X can be de-
noted as (G , S), where S is a subset of nodes that are in G . Later, we call (G , S) as
a graph representation learning (GRL) example. Each GRL example (G , S) ∈ X is
associated with a target y in the target space Y . Suppose the ground-truth associa-
tion function between features and targets is denoted by f ∗ : X → Y , f ∗ (G , S) = y.
Given a set of training examples Ξ = {(G (i) , S(i) , y(i) )}ki=1 and a set of testing exam-
ples Ψ = {(G˜(i) , S̃(i) , ỹ(i) )}ki=1 , a graph representation learning problem is to learn
a function f based on Ξ such that f is close to f ∗ on Ψ .
The above definition is general in the sense that in a GRL example (G , S) ∈ X , G
provides both raw and structural features on which some prediction for a node subset
S of interest is to be made. Below, we will further list a few frequently-investigated
learning problems that may be formulated as graph representation learning prob-
lems.
Remark 5.1. (Graph classification problem / Graph-level prediction) The node set S
of interest is the entire node set V [G ] by default. The space of graph-structured data
5 The Expressive Power of Graph Neural Networks 69
Γ typically contains multiple graphs. The target space Y contains labels of different
graphs. Later, for graph-level prediction, we will use G to denote a GRL example
instead of (G , S) for notational simplicity.
Next, we will introduce the fundamental assumption used in most graph repre-
sentation learning problems.
Definition 5.3. (Isomorphism) Consider two GRL examples (G (1) , S(1) ), (G (2) , S(2) )
∈ X . Suppose G (1) = (A(1) , X (1) ) and G (2) = (A(2) , X (2) ). If there exists a bijective
(1) (2) (1) (2)
mapping π : V [G (1) ] → V [G (2) ], i ∈ {1, 2}, such that Auv = Aπ(u)π(v) , Xu = Xπ(u)
and π also gives a bijective mapping between S(1) and S(2) , we call that (G (1) , S(1) )
and (G (2) , S(2) ) are isomorphic, denoted as (G (1) , S(1) ) ∼
= (G (2) , S(2) ). When the par-
π
ticular bijective mapping π should be highlighted, we use notation (G (1) , S(1) ) ∼ =
(G (2) , S(2) ). If there is no such a π, we call that they are non-isomorphic, denoted as
(G (1) , S(1) ) ̸∼
= (G (2) , S(2) ).
Assumption 1 (Fundamental assumption in graph representation learning) Con-
sider a graph representation learning problem with a feature space X and its cor-
responding target space Y . Pick any two GRL examples (G (1) , S(1) ), (G (2) , S(2) ) ∈
X . The fundamental assumption says that if (G (1) , S(1) ) ∼
= (G (2) , S(2) ), their corre-
sponding targets in Y are the same.
Due to this fundamental assumption, it is natural to introduce the corresponding
permutation invariance as inductive bias that all models of graph representation
learning should satisfy.
Remark 5.4. Note that the expressive power in Def. 5.5, characterized by how a
model can distinguish non-isomorphic GRL examples, does not exactly match the
traditional expressive power used for NNs in the sense of functional approxima-
tion. Actually, Def. 5.5 is strictly weaker because distinguishing any non-isomorphic
GRL examples does not necessarily indicate that we can approximate any function
f ∗ defined over X . However, if a model f cannot distinguish two non-isomorphic
features, f is definitely unable to approximate function f ∗ that maps these two ex-
amples to two different targets. Some recent studies have been able to prove some
equivalence between distinguishing non-isomorphic features and permutation in-
variant function approximations under weak assumptions and applying involved
techniques (Chen et al, 2019f; Azizian and Lelarge, 2020). Interested readers may
check these references for more details.
We will start by reviewing the NNs with sets (multisets) as their input,since a set
can be viewed as a simplified-version of a graph where all edges are removed. By
definition, the order of elements of a set does not impact the output; models that
encode sets naturally provide an important building block for encoding the graphs.
We term this approach invariant pooling.
Definition 5.6. (Multiset) A multiset is a set where its elements can be repetitive,
meaning that they are present multiple times. In this chapter, we assume by default
that all the sets are multisets and thus allow repetitive elements. In situations where
this is not the case, we will indicate otherwise.
5 The Expressive Power of Graph Neural Networks 71
Message passing is the most widely-used framework to build GNNs (Gilmer et al,
2017). Given a graph G = (V , E , X), the message passing framework encodes each
node v ∈ V with a vector representation hv and keeps updating this node represen-
tation by iteratively collecting representations of its neighbors and applying neural
network layers to perform a non-linear transformation of those collections:
(0)
1. Initialize node vector representations as node attributes: hv ← Xv , ∀v ∈ V .
2. Update each node representation based on message passing over the graph
structure. In l-th layer, l = 1, 2, ..., L, perform the following steps:
(l) (l−1) (l−1)
Message: mvu ← MSG(hv , hu ), ∀(u, v) ∈ E , (5.1)
(l) (l)
Aggregation: av ← AGG({mvu |u ∈ Nv }), ∀v ∈ V , (5.2)
(l) (l−1) (l)
Update: hv ← UPT(hv , av ), ∀v ∈ V . (5.3)
(L)
MP-GNN produces representations of all the nodes, {hv |v ∈ V }. Each node
representation is essentially determined by a subtree rooted at this node (Fig. 5.7).
Given a specific graph representation learning problem, for example, classifying a
set of nodes S ⊆ V , we may use the representations of relevant nodes in S to make
the prediction:
(L)
ŷS = READOUT({hv |v ∈ S}). (5.4)
where the READOUT operation is often implemented via another invariant pooling
when |S| > 1 plus a feed-forward NN to predict the target. Combining Eqs.equation 11.45-
equation 5.4, MP-GNN builds a GNN model for graph representation learning:
We can show the permutation invariance of MP-GNN by induction over the iter-
ation index l.
Theorem 5.1. (Invariance of MP-GNN) fMP−GNN (·, ·) satisfies permutation invari-
ance (Def. 5.4) as long as the AGG and READOUT operations are invariant pooling
operations (Def. 5.7).
Proof. This can be proved trivially by induction.
MP-GNN by default leverages the inductive bias that the nodes in the graph di-
rectly affect each other only via their connected edges. The mutual effect between
nodes that are not connected by an edge can be captured via paths that connect
these nodes via message passing. Indeed, such inductive bias may not match the
assumptions in a specific application, and MP-GNN may find it hard to capture mu-
tual effect between two far-away nodes. However, the message-passing framework
has several benefits for model implementation and practical deployment. First, it
directly works on the original graph structure and no pre-processing is needed. Sec-
ond, graphs in practice are typically sparse (|E | ≪ |V |2 ) and thus MP-GNN is able
to scale to very large but sparse graphs. Third, each of the three operations MSG,
AGG, and UPT can be computed in parallel across all nodes and edges, which is
beneficial for parallel computing platforms such as GPUs and map-reduce systems.
Because it is natural and easy to be implemented in practice, most GNN architec-
tures essentially follow the MP-GNN framework by adopting specific MSG, AGG,
and UPT operations. Representative approaches include InteractionNet (Battaglia
et al, 2016), structure2vec (Dai et al, 2016), GCN (Kipf and Welling, 2017a), Graph-
SAGE (Hamilton et al, 2017b), GAT (Veličković et al, 2018), GIN (Xu et al, 2019d),
and many others (Kearnes et al, 2016; Zhang et al, 2018g).
In this section, we will introduce the expressive power of MP-GNN , following the
results proposed in Xu et al (2019d); Morris et al (2019).
5 The Expressive Power of Graph Neural Networks 73
The 1-dimensional Weisfeiler-Lehman test to distinguish (G (1) , S(1) ) and (G (2) , S(2) ):
(i,0) (i) (i)
1. Assume each node v in V [G (i) ] is initialized with a color Cv ← Xv for i = 1, 2. If Xv
is a vector, then an injective function is used to map it to a color.
2. For l = 1, 2, ..., do
(i,l) (i,l−1) (i,l−1) (i)
Update node colors: Cv ← HASH(Cv , {Cu |u ∈ Nv }), i ∈ {1, 2}
(5.6)
where the HASH operation can be viewed as an injective mapping where different tuples
(i,l−1) (i,l−1) (i)
(Cv , {Cu |u ∈ Nv }) are mapped to different labels.
(1,l) (2,l)
Test: If two multisets {Cv |v ∈ S(1) } and {Cv |v ∈ S(2) } are not equal,
then return (G (1) , S(1) ) ̸∼
= (G (2) , S(2) ); else, go back to equation 5.6.
If 1-WL test returns (G (1) , S(1) ) ̸= ∼ (G (2) , S(2) ), we know that (G (1) , S(1) ) (G (2) , S(2) ) are not
isomorphic. However, for some non-isomorphic (G (1) , S(1) ) (G (2) , S(2) ), the 1-WL test may
not return (G (1) , S(1) ) ̸∼
= (G (2) , S(2) ) (even with infinitely many iterations). In this case, the 1-
WL test fails to distinguish them. Note that the 1-WL test was originally proposed to test the
isomorphism of two entire graphs, i.e.,, S(i) = V [G (i) ] for i ∈ {1, 2} (Weisfeiler and Leman,
1968). Here the 1-WL test is further generalized to test the case where S(i) ⊂ V (i) , to match
the general graph representation learning problems.
The expressive power we defined (Def. 5.5) is closely related to the graph iso-
morphism problem. This problem is challenging, as no polynomial-time algorithms
have been found for it (Garey, 1979; Garey and Johnson, 2002; Babai, 2016). De-
spite some corner cases (Cai et al, 1992), the Weisfeiler-Lehman (WL) tests of graph
isomorphism (Weisfeiler and Leman, 1968) are a family of effective and computa-
tionally efficient tests that distinguish a broad class of graphs (Babai and Kucera,
1979). Its 1-dimensional form (the 1-WL test), “naive vertex refinement”, is analo-
gous to the neighborhood aggregation in MP-GNN .
They are comparing MP-GNN with the 1-WL test, the node-representation up-
dating procedure Eqs.equation 11.45-equation 5.3 can be viewed as an implemen-
tation of Eq.equation 5.6 and the READOUT operation in Eq.equation 5.4 can
be viewed as a summary of all node representations. Although MP-GNN was
not proposed to perform graph isomorphism testing, the fMP−GNN can be used
for this test: if fMP−GNN (G (1) , S(1) ) ̸= fMP−GNN (G (2) , S(2) ), then we know that
(G (1) , S(1) ) ∼
̸ (G (2) , S(2) ). Because of this analogy, the expressive power of MP-
=
GNN can be measured by the 1-WL test. Formally, we conclude the argument in the
following theorem.
Theorem 5.2. (Lemma 2 in (Xu et al, 2019d), Theorem 1 in (Morris et al, 2019))
Consider two non-isomorphic GRL examples (G (1) , S(1) ) and (G (2) , S(2) ). If
fMP−GNN (G (1) , S(1) ) ̸= fMP−GNN (G (2) , S(2) ), then the 1-WL test also decides
(G (1) , S(1) ) and (G (2) , S(2) ) are not isomorphic.
Theorem 5.2 indicates that MP-GNN is at most as powerful as the 1-WL test
in distinguishing different graph-structured features. Here, the 1-WL test is consid-
ered an upper bound (instead of being equal to the expressive power of MP-GNN)
74 Pan Li and Jure Leskovec
C B C B
The mapping “attributes → colors” is injective.
The mapping “(self-color, set of colors from neighbors) → a new color” is injective
After each iteration , check the set of node colors. Current both graphs have the same set of colors.
We do step 2 again. After two iterations, we may distinguish these two graphs because left B will 88
get a color that will not appear in the right graph, because currently left B has purple + blue in its
neighborhood while no nodes in the right graph have such neighborhood.
Fig. 5.8: An illustration of steps that distinguish two graphs via the 1-dimensional
Weisfeiler-Lehman test. MP-GNN follows a similar procedure and may also distin-
guish them.
21
because the updating procedure which aggregates node colors from its neighbors
(Eq.equation 5.6) is injective and can distinguish between the different aggregations
of node colors. This intuition is useful later to design MP-GNN that matches this
upper bound.
Now that the upper bound of the representation power of MP-GNN has been
established, a natural follow-up question is whether there are existing GNNs that
are, in principle, as powerful as the 1-WL test. The answer is yes. As shown by
Theorem 5.3: if the message passing operation (compositing Eqs.equation 11.45-
equation 5.3 together) and the final READOUT (Eq.equation 5.4) are both injective,
then the resulting MP-GNN is as powerful as the 1-WL test.
Theorem 5.3. (Theorem 3 in (Xu et al, 2019d)) After sufficient iterations, MP-GNN
may map any GRL examples (G (1) , S(1) ) and (G (2) , S(2) ), that the 1-WL test decides
as non-isomorphic, to different representations if the following two conditions hold:
a) The composition of MSE, AGG and UPT (Eqs.equation 11.45-equation 5.3)
(k−1) (k−1) (k)
constructs an injective mapping from (hv , {hu |u ∈ Nv }) to hv .
b) The READOUT (Eq.equation 5.4) is injective.
Although MP-GNN does not surpass the representation power of the 1-WL test,
MP-GNN has important benefits over the 1-WL test from the perspective of ma-
chine learning: node colors and the final decision given by the 1-WL test are dis-
crete (represented as node colors or a “yes/no” decision) and thus cannot capture the
similarity between graph structures. In contrast, a MP-GNN satisfying the criteria in
5 The Expressive Power of Graph Neural Networks 75
Theorem 5.3 generalizes the 1-WL test by learning to represent the graph structures
with vectors in a continuous space. This enables MP-GNN to not only discrimi-
nate between different structures but also to learn to map similar graph structures
to similar representations, thus capturing dependencies between graph structures.
Such learned representations are particularly helpful for generalizations where data
contains noisy edges and the exact matching graph structures are sparse (Yanardag
and Vishwanathan, 2015).
In the next subsection, we will focus on introducing the key design ideas behind
MP-GNN that satisfies the conditions in Theorem 5.3.
Remark 5.5. Note that the sum pooling operator is crucial, as some popular invari-
ant pooling operators, such as the mean pooling operator, are not injective multiset
functions. The significance of the sum pooling operation is to record the number
of repetitive elements in a multiset. The mean pooling operation adopted by graph
convolutional network (Kipf and Welling, 2017a) or the softmax-normalization (at-
tention) pooling adopted by graph attention network (Veličković et al, 2018) may
learn the distribution of the elements in a multiset but not the precise counts of the
elements.
Thanks to the universal approximation theorem (Hornik et al, 1989), we can use
multi-layer perceptrons (MLPs) to model and learn ψ and φ in Lemma 5.1 for uni-
versally injective AGG operation. In MP-GNN, we do not even need to explicitly
model ψ and φ as the MSG and UPT operations — (Eqs.equation 11.45 and equa-
tion 5.3) respectfully — have already been implemented via MLPs. Therefore, using
the sum pooling as the AGG operation is sufficient to achieve the most expressive
MP-GNN:
76 Pan Li and Jure Leskovec
where ε (k) is a learnable weight. This updating method, by using a NN-based lan-
guage, is termed the graph isomorphism network (GIN) layer (Xu et al, 2019d).
Lemma 5.2 formally states that MP-GNN that adopts Eq.equation 5.7 may match
the condition a) in Theorem 5.3.
Proof. Combine the proof for injectiveness of the sum aggregation with the univer-
sal approximation property of MLP (Hornik et al, 1989).
A similar idea may be adapted to the READOUT operation (Eq.5.4), which also
requires an injective mapping of multisets:
(L)
Expressive Inference: ŷS = MLP( ∑ hv ). (5.8)
v∈S
Xu et al (2019d) has observed that node representations from earlier iterations may
sometimes generalize better and thus also suggests using the READOUT (a counter-
part to Eq.5.4) from the Jumping Knowledge Network (JK-Net) (Xu et al, 2018a),
though it is not necessary from the perspective of the representation power of MP-
GNN .
Overall, combining the update Eq.equation 5.7 and the READOUT Eq.equation 5.8,
we may achieve an MP-GNN that is as powerful as the 1-WL test. In the next sec-
tion, we introduce several techniques that allow MP-GNN to break the limitation of
the 1-WL test and achieve even stronger expressive power.
5 The Expressive Power of Graph Neural Networks 77
First, we will review several critical limitations of MP-GNN and the 1-WL test to
gain the intuition for understanding the techniques that build more powerful GNNs.
MP-GNN iteratively updates the representation of each node by aggregating repre-
sentations of its neighbors. The obtained node representation essentially encodes the
subtree rooted at Node v (Fig. 5.7). However, using this rooted subtree to represent
a node may lose useful information, such as:
1. The information about the distance between multiple nodes is lost. For example,
You et al (2019) noticed that MP-GNN has limited power in capturing the po-
sition/location of a given node with respect to another node in the graph. Many
nodes may share similar subtrees, and thus, MP-GNN produces the same rep-
resentation for them although the nodes may be located at different locations in
the graph. This location information of nodes is crucial for the tasks that depend
on multiple nodes, such as link prediction (Liben-Nowell and Kleinberg, 2007),
as two nodes that tend to be connected with a link are typically located close to
each other. An illustrative example is shown in Fig. 5.9.
2. The information about cycles is lost. Particularly, when expanding the subtree of
a node, MP-GNN essentially losses track of the node identities in the subtrees.
An illustrative example is shown in Fig. 5.10. The information about cycles
is crucial in applications such as subgraph matching (Ying et al, 2020b) and
counting (Liu et al, 2020e) because loops frequently appear in the queried sub-
graph patterns of a subgraph matching/counting problem. Chen et al (2020q)
formally proved that MP-GNN is able to count star structures (a particular form
of trees) but cannot count connected subgraphs with three or more nodes that
form cycles.
Theoretically, there is a general class of graph representation learning problems
that MP-GNN will fail to solve due to its limited representation. To show this, we
define a class of graphs, termed attributed regular graphs.
78 Pan Li and Jure Leskovec
? ?
Query: Which one is more
likely the predator of
Pelagic Fish, Lynx or Orca?
Corresponding subtrees:
v u
v u
v.s.
=
𝐺( ) 𝐺( )
(L) (L)
Fig. 5.10: The node representations hv and hu given by MP-GNN are the same,
although they belong to different cycles – 3-cycle and 6-cycle, respectively.
Note that the definition of attributed regular graphs is similar to k-partite regular
graphs, while attributed regular graphs allow edges connecting nodes from the same
partition. It can be shown that the 1-WL test will color all the nodes of one partition
in the same way. Based on the bound of representation power of MP-GNN (Theo-
rem 5.2), we can obtain the following corollary about the impossibility of MP-GNN
to distinguish GRL examples defined on attributed regular graphs. Fig. 5.11 gives
some examples that illustrate the impossibility. Actually, with sufficient layers (it-
erations), MP-GNN (the 1-WL test) will always transform any attributed graph into
5 The Expressive Power of Graph Neural Networks 79
regular graphs attributed regular graphs
𝑆 (#) 𝑆 (&) 𝑆 (#) 𝑆 (&)
an attributed regular graph (Arvind et al, 2019) if we view the node representations
obtained by MP-GNN as the node attributes on this transformed graph 1 .
Corollary 5.1. Consider two graph-structured features (G (1) , S(1) ), (G (2) , S(2) ). If
two attributed regular graphs G (1), G (2) share the same configuration, i.e., Config(G (1))=
(1) (2)
Config(G (2) ), and two multisets of attributes {Xv |v ∈ S(1) } and {Xv |v ∈ S(2) } are
also equal, then fMP−GNN (G , S ) = fMP−GNN (G , S ). Therefore, if graph
(1) (1) (2) (2)
(1) (2)
representation learning problems associate {Xv |v ∈ S(1) } and {Xv |v ∈ S(2) } with
different targets, MP-GNN does not hold the expressive power to distinguish them
and predict their correct targets.
Proof. The proof is obtained by tracking each iteration of the 1-WL test and per-
forming an induction.
Next, we will introduce several approaches that address the above limitations and
that further improve the expressive power of MP-GNN .
The main reason for limitations on the expressive power of MP-GNN is that MP-
GNN does not track node identities; however, different nodes with the same at-
tributes will be initialized with the same vector representations. This condition will
be maintained unless their neighbors propagate different node representations. One
way to improve the expressive power of MP-GNN is to inject each node with a
unique attribute. Given a GRL example (G , S), where G = (A, X),
where ⊕ is concatenation and I is an identity matrix, this gives each node a unique
one-hot encoding and yields a new attributed graph GI . The composite model
1 Most transformed graphs have one single node per partition. In this case, two graphs that share
the same configuration are isomorphic.
80 Pan Li and Jure Leskovec
fMP−GNN ◦ gI increases expressive power as node identities are attached to the mes-
sages in the message passing framework and the distance and loop information can
be learnt with sufficient iterations of message propagation.
However, the limitation of the above framework is that it is not permutation in-
variant (Def.5.4): given that two isomorphic GRL examples (G (1) , S(1) ) ∼
= (G (2) , S(2) ),
gI (G , S ) and gI (G , S ) may be not isomorphic any more. Then, the com-
(1) (1) (2) (2)
posite model fMP−GNN ◦ gI (G (1) , S(1) ) may not equal fMP−GNN ◦ gI (G (2) , S(2) ). As
the obtained model loses the fundamental inductive bias of graph representation
learning, it is hard to be generalized2 .
Remark 5.6. Some other approaches may share the same limitation with gI , e.g.,
using the adjacency matrix A (each row of A representing node attributes). However,
Srinivasan and Ribeiro (2020a) argued that node embeddings obtained via matrix
factorization, such as deepwalk (Perozzi et al, 2014) and node2vec (Grover and
Leskovec, 2016), can keep the required invariance and thus are still generalizable.
We will return to this concept in Sec.5.4.2.4.
To overcome the above limitation, different methods have been proposed re-
cently. These models share the following strategy: they first design some additional
random node attributes Z, use them to argue the original dataset, and then learn a
GNN model over the augmented dataset (Fig. 5.13).
The obtained models will be more expressive, as the random node attributes can
be viewed as unique node identities that distinguish nodes. However, if the model
is only trained based on a single GRL example augmented by these random at-
tributes, it cannot keep invariance as discussed above. Instead, the model needs
to be trained over multiple GRL examples augmented by independently injected
random attributes. The new augmented GRL examples have the same target as the
original GRL examples from which they are generated. This training of models over
augmented examples essentially regularizes the permutation variance of the models
and makes them behave almost “permutation invariant.”
Different methods to inject these random attributes may be adopted, but a direct
way is to attach Z to X, i.e., given a graph-structured data (G , S), where G = (A, X),
Note that for each realization Z, the composite model fMP−GNN ◦ gZ is not permu-
tation invariant. Instead, all these approaches make E[ fMP−GNN ◦ gZ ] permutation
invariant and expect the models to keep invariant in expectation. To match such
invariance in expectation, an approach must satisfy the following proposition.
Proposition 5.1. The following two properties are needed to build a model by in-
jecting random features Z.
2 Recent literature often states that the composite model is not inductive. Inductiveness and gen-
eralization to unobserved examples are related. In the transductive setting, fMP−GNN ◦ gI is less
generalizable than fMP−GNN , although the prediction performance of fMP−GNN ◦ gI may be some-
times better than fMP−GNN due to the much stronger expressive power of fMP−GNN ◦ gI .
5 The Expressive Power of Graph Neural Networks 81
original attributes
Types of random attributes Positional information Model & reference
random
~ℙ Random permutations No RP-GNN (Murphy et al, 2019)
attributes
⊕ (Almost uniform) Discrete r.v. No rGIN (Sato et al, 2020)
⊕ Distances to random anchor sets Yes PGNN (You et al, 2019)
⊕
Graph-convoluted Gaussian r.v. Yes CGNN (Srinivasan & Ribeiro, 2020)
⊕
Random signed Laplacian eigenmap Yes LE-GNN (Dwivedi et al, 2020)
Fig. 5.12: Injecting random node attributes can improve the expressive power of
GNNs. Different types of random node attributes are adopted in different works.
Some random node attributes contain node positional information (the position of a
node with respect to other nodes in the graph).
1. A sufficient number of Z’s should be sampled during the training stage so that
the model indeed captures permutation invariance in expectation.
2. The randomness in Z should be agnostic to the original node identities.
To satisfy the property 1, a method suggests that for each forward pass to com-
pute fMP−GNN ◦ gZ during the training stage, one Z should be re-sampled once or
multiple times to get enough data argumentation. To satisfy the property 2, four
different types of random Z have been proposed as described next.
Theorem 5.4. (Theorem 2.2 (Murphy et al, 2019a)) The RP-GNN fRP−GNN is
strictly more powerful than the original fMP−GNN .
lem. They suggest to use all π’s that permute all the nodes of each connected local
subgraph.
where E indicates expectation and D is a discrete space with at least 1/p elements
for some p > 0. Similar to RP-GNN, frGIN can be implemented by sampling only
a few Zr ’s for each evaluation of fMP−GNN ◦ gZ (indeed, one Zr is sampled per for-
warding evaluation (Sato et al, 2021)).
Theorem 5.5. (Theorem 4.1 (Sato et al, 2021)) Consider a GRL example (G , v),
where only a single node is contained in the node set of interest. For any graph-
structured features (G ′ , v′ ), where the nodes of G ′ have a bounded maximal degree
and the attributes X come from a finite space, then there exist an MP-GNN , such
that:
1. If (G ′ , v′ ) =∼ (G , v), fMP−GNN ◦ gZ (G ′ , v′ ) > 0.5 with high probability.
r
′ ′
2. If (G , v ) = ̸∼ (G , v), fMP−GNN ◦ gZr (G ′ , v′ ) < 0.5 with high probability.
This result can be viewed as a characterization of the expressive power of rGIN.
However, this result is lessened by the fact that almost all nodes of all graphs will
be associated with different representations within two iterations of the 1-WL test
(so is MP-GNN) (Babai and Kucera, 1979). Moreover, the isomorphism problem of
graphs with a bounded degree is known to be in P (Fortin, 1996). Instead, a very
recent work was able to demonstrate the universal approximation of rGIN, which
gives a stronger characterization of the expressive power of rGIN.
Theorem 5.6. (Theorem 4.1 (Abboud et al, 2020)) Consider any invariant mapping
f ∗ : Gn → R, where Gn contains all graphs with n nodes. Then, there exists a rGIN
fMP−GNN ◦ gZr such that
p(| fMP−GNN ◦ gZr − f ∗ | < ε) > 1 − δ , for some given ε > 0, δ ∈ (0, 1).
The above RP-GNN and rGIN adopt random attributes that are totally agnostic
to the input data (G , S). Instead, the next two methods inject random attributes that
leverage the input data. Particularly, these random attributes are related to the po-
sition/location of a node in the graph, which tends to counter the loss of positional
information of nodes in MP-GNN.
5 The Expressive Power of Graph Neural Networks 83
You et al (2019) demonstrated that MP-GNN may not capture the position/loca-
tion of a node in the graph, which is critical information for applications such as
link prediction. Therefore, they proposed to use node positional embeddings as ex-
tra attributes. To capture permutation invariance in the sense of expectation, node
positional embeddings are generated based on randomly selected anchor node sets.
We denote the random attributes adopted in PGNN as ZP , which is constructed as
follows. Considering a graph G = (V , E , X),
1. Randomly select a few anchor sets (S1 , S2 , ..., SK ), where Sk ⊂ V . Note that the
choice of Sk is agnostic to the node identities: given a k, Sk will include each
node with the same probability.
2. For some u ∈ G, set [ZP ]u = (d(u, S1 ), ..., d(u, SK )) where d(u, Sk ), k ∈ [K] is a
distance metric between u and the anchor set Sk .
As the selection of the anchor sets is agnostic to node identities, the obtained ZP still
satisfies the property 2 in Proposition 5.1. Next, we specify the strategy to sample
these anchor sets and the choice of the distance metric. The primary requirement to
select those anchor sets is to keep low distortion of the two distances between nodes,
where one distance is given by the original graph and the other one is given by those
anchor sets. Specifically, distortion measures the faithfulness of the embeddings in
preserving distances when mapping from one metric space to another metric space,
which is defined as follows:
Compared with RP-GNN and rGIN, the random attributes adopted by PGNN
deal specifically with the positional information of a node in graph. Therefore,
PGNN is better for the tasks that are directly related to the positions of nodes,
e.g., link prediction. You et al (2019) did not provide a mathematical character-
ization of the representation power of PGNN. However, the way to establish ZP
allows that for the two nodes u, v, the attributes [ZP ]u and [ZP ]v by definition are
statistically correlated. As for the example in Fig. 5.9, such correlation gives PGNN
the information that the distance between Lynx and Pelagic Fish is different from
the distance between Orca and Pelagic Fish, and thus may successfully distinguish
(G, {Lynx, Pelagic Fish}) and (G, {Orca, Pelagic Fish}) and making the right link
prediction.
Note that the original PGNN (You et al, 2019) does not use MP-GNN as the
backbone to perform message passing. Instead, PGNN allows message passing from
nodes to anchor sets. As such, this approach is not directly relevant to the expressive
power and is thus out of the scope of this chapter, so we will not discuss it in detail.
Interested readers may refer to the original paper (You et al, 2019).
Srinivasan and Ribeiro (2020a) recently made an important observation that node
positional embeddings obtained via the factorization of some variants of the adja-
cency matrix A can be used as node attributes as long as certain random perturbation
is allowed. The obtained models still keep permutation invariance in expectation.
Srinivasan and Ribeiro (2020a) argue that a model that is built upon these random
perturbed node positional embeddings is still inductive and holds good general-
ization properties. This significant observation challenges the traditional claim that
models built upon these node positional embeddings are not inductive. A high-level
idea of why this is true is as follows: suppose the SVD decomposition of the adja-
cency matrix A = UΣU T . When we permute the order of nodes, that is, the row and
column orders of A, the row order of U will be changed simultaneously. Therefore,
the models that use U as the node attributes should keep the permutation invariance.
That randomly perturbed factorization is needed because such SVD decomposition
is not unique.
Although Srinivasan and Ribeiro (2020a) proposed this idea, they did not explic-
itly compute the node positional embeddings via matrix factorization. Instead, their
method samples a series of Gaussian random matrices ZG,1 , ZG,2 , ... and let them
propagate over the graph, e.g., for the two hops,
where ψ’s are MLPs and  indicates some variant of the adjacency matrix. The rows
of ZG essentially give rough node positional embeddings. Then, these obtained node
embeddings are further used as the attributes of nodes in MP-GNN.
5 The Expressive Power of Graph Neural Networks 85
L = I − D−1/2 AD−1/2 ,
Proof. The proof can be easily seen from the above arguments.
As shown in Lemma 5.3, the composite model keeps permutation invariance
in expectation for most graphs, although it may break invariance in some corner
cases. Regarding the expressive power, ZLE associates different nodes with distinct
attributes because U is an orthogonal matrix by definition. Hence, there must exist
fMP−GNN ◦ gZLE that may distinguish any node subsets from the graph:
Theorem 5.8. For any two GRL examples (G , S(1) ), (G , S(2) ) over the same graph
G , even if they are isomorphic, as long as S(1) ̸= S(2) , there exists an fMP−GNN such
that fMP−GNN ◦gZLE (G , S(1) ) ̸= fMP−GNN ◦gZLE (G , S(2) ). However, if those two GRL
examples are indeed isomorphic (G , S(1) ) ∼ = (G , S(2) ) over the same graph G and
the normalized Laplacian matrix of G has no multiple same-valued eigenvalues,
then E( fMP−GNN ◦ gZLE (G , S(1) )) = E( fMP−GNN ◦ gZLE (G , S(2) )).
Proof. The proof can be easily seen from the above arguments.
Theorem 5.8 implies the potential of fMP−GNN ◦gZLE to distinguish different node
sets from the same graph. Note that although fMP−GNN ◦ gZLE achieves great rep-
resentation power, it does not always work very well for link prediction in prac-
tice (Dwivedi et al, 2020) when compared with another model SEAL (Zhang and
86 Pan Li and Jure Leskovec
Chen, 2018b) (compare their performance on the COLLAB dataset in (Hu et al,
2020b)). SEAL is based on the deterministic distance attributes that are introduced
in the next subsection. Whether a model is permutation invariant is a much weaker
statement on characterizing the generalization of the model. Actually, when the
model is paired node positional embeddings, the dimension of the parameter space
increases, and thus also negatively impacts the generalization. A comprehensive in-
vestigation of this observation is left for future study.
In the next subsection, we will introduce deterministic node distance attributes,
which provide a different angle to solve the above problem. Distance encoding has
a solid mathematical foundation and provides the theoretical support for many em-
pirically well-behaved models such as SEAL (Zhang and Chen, 2018b) and ID-
GNN (You et al, 2021).
In this subsection, we will introduce an approach that boosts the expressive power
of MP-GNN by injecting deterministic distance attributes.
The key motivation behind the deterministic distance attributes is as follows. In
Section 5.4.1, we have shown that MP-GNN is limited in its ability to measure the
distances between different nodes, to count cycles3 , and to distinguish attributed
regular graphs. All of these limitations are essentially inherited from the 1-WL
test which does not capture distance information between the nodes. If MP-GNN
is paired with some distance information, then the composite model must achieve
more expressive power. Then, the question is how to inject the distance information
properly.
There are two important pieces of intuition to design such distance attributes.
First, the effective distance information is typically correlated with the tasks. For
example, consider a GRL example (G , S). If this task is node classification (|S| = 1),
the information of distance from this node to itself (thus the cycles containing this
node) is relevant because it measures the information of the contextual structure. If
the task is link prediction (|S| = 2), the information of distance between the two end
nodes of the link is relevant as two nodes near to each other in the network tend
to be connected by a link. For graph-level prediction (S = V (G )), the information
of distances between any pairs of nodes could be relevant as it can be viewed as a
group of link predictions. Second, besides the distance between the nodes in S, the
distance from S to other nodes in G may also provide useful side-information. Both
two aspects inspire the design of distance attributes.
There have been a few empirically successful GNN models that leverage deter-
ministic distance attributes, although their impact on the expressive power of GNNs
3 Cycles actually carry a special type of distance information, as they describe the length of walks
from one node to itself. If the distance from one node to itself is not measured by the shortest path
distance but by the returning probability of random walk, this distance already contains the cycle
information.
5 The Expressive Power of Graph Neural Networks 87
has not been characterized until very recently (Li et al, 2020e). For link prediction,
Li et al (2016a) first consider annotating the two end-nodes of the link of interest.
These two end-nodes are annotated with one-hot encodings and all other nodes are
annotated by zeros. Such annotations can be transformed into distance information
via GNN message passing. Again for link prediction, Zhang and Chen (2018b) first
sample an enclosing subgraph around the queried link and then annotate each node
in this subgraph with one-hot encodings of the shortest path distances (SPDs) from
this node to the two end-nodes of the link. Note that deciding whether a node is in
the enclosing subgraph around the queries link already gives a distance attribute.
Zhang and Chen (2019) uses a similar idea to perform matrix completion which is
a similar task to link prediction. For graph classification and graph-level property
prediction, Chen et al (2019a) and Maziarka et al (2020a) adopt the SPDs between
two nodes as edge attributes. These edge attributes can be also used as the input of
MSG (Eq.equation 11.45) in MP-GNN. You et al (2021) annotates a node as 1 and
other nodes as 0 to improve MP-GNN in node classification. As our focus is on the
theoretical characterization of the expressive power, we will not go into detail about
these empirically successful works. Interested readers are referred to the relevant
papers.
Remark 5.7. (Comparison between deterministic distance attributes and random at-
tributes) Deterministic distance attributes have some advantages. First, as there is
no randomness in the input attributes, the optimization procedure of the model con-
tains less noise. Hence, the training procedure tends to converge much faster than
the model with random attributes. The model evaluation performance contains much
less noise too. Some empirical evaluation of the convergence of the model training
with random attributes can be found in Abboud et al (2020). Second, a model based
on deterministic distance attributes typically shows better generalization in practice
than the one based on random attributes. Although theoretically a model is permuta-
tion invariant when being trained based on sufficiently many examples with random
attributes (as discussed in Sec.5.4.2), in practice, this could be hard to achieved due
to the high complexity.
Deterministic distance attributes have some disadvantages. First, models that are
paired with deterministic attributes may never achieve the universal approxima-
tion, unless the graph isomorphism problem is in P. However, random attributes
may be universal in the probabilistic sense (e.g., Theorem 5.6). Second, determin-
istic distance attributes typically depend on the information S in a GRL example
(G , S). This introduces an issue in computation: that is, if there are two GRL ex-
amples (G (1) , S(1) ) and (G (2) , S(2) ) sharing the same graph G but with different
node sets of interest S(1) ̸= S(2) , they will be attached with different deterministic
distance attributes and hence GNNs have to make inference over them separately.
However, GNNs with random attributes can share intermediate node representations
(L)
{hv |v ∈ V [G ]} in Eq.equation 5.4, between the two GRL examples, which saves
intermediate computation.
88 Pan Li and Jure Leskovec
ζ (u|v) = g(ℓuv ), ℓuv = (1, (W )uv , (W 2 )uv , ..., (W k )uv , ...), (5.15)
where W = AD−1 is the random walk matrix and g(·) is a general function that maps
ℓuv to different types of distance measures.
Note that ζ (u|S) depends on the graph structure G , which we omit in our notation
for simplicity. First, setting g(ℓuv ) as the first non-zero position in ℓuv gives the
shortest-path-distance (SPD) from v to u. Second, setting g(ℓuv ) as follows gives
generalized PageRank scores (Li et al, 2019f):
Different choices of {γk |k ∈ Z≥0 } yield various distance measures between u and v.
It is important to see that the above definition of distance encoding satisfies per-
mutation invariance.
π
Lemma 5.4. For two isomorphic GRL examples (G (1) , S(1) ) ∼
= (G (2) , S(2) ), if π(u) =
π(v) for u ∈ V [G ] and v ∈ V [G ], their distance encodings are equal ζ (u|S(1) ) =
(1) (2)
ζ (v|S(2) ).
Proof. The proof can be easily seen by the definition of distance encoding.
… … … … Left Right
… …
+ distance encoding (use shortest
path distance as an example) 𝑆# 𝑆"
DE = 0
DE = 1
DE = 2
… … … … Left Right
Link prediction ? ?
+ distance encoding (use shortest
path distance as an example)
𝜁 𝑆𝑒𝑎𝑙 𝑂𝑟𝑐𝑎, 𝑃𝑒𝑙𝑎𝑔𝑖𝑐 𝐹𝑖𝑠ℎ = {1,1}
𝜁 𝑆𝑒𝑎𝑙 𝐿𝑦𝑛𝑥, 𝑃𝑒𝑙𝑎𝑔𝑖𝑐 𝐹𝑖𝑠ℎ = {1, ∞}
(2020e) considers the scenario when the graphs are regular and do not have attributes
and proved that DE-GNN can distinguish two GRL examples with high probability,
which is formally stated in the following theorem.
Theorem 5.9. (Theorem 3.3 (Li et al, 2020e)) Consider two GRL examples (G (1) , S(1) )
and (G (2) , S(2) ) where G (1) and G (2) are two n-sized unattributed regular graphs,
and |S(1) | = |S(2) | is a constant (independent of n). Suppose G (1) and G (2) are
uniformly independently sampled from all n-sized r-regular graphs where 3 ≤ r <
(2 log n)1/2 . Then, for any small constant ε > 0, there exists DE-GNN with certain
log n
weights within L ≤ ⌈( 21 + ε) log(r−1) ⌉ layers that can distinguish these two examples
with high probability. Specifically, the outputs fDE ((G (1) , S(1) )) ̸= fDE ((G (2) , S(2) ))
with probability 1 − o(n−1 ). The specific form of DE, i.e., g in Eq.equation 5.15,
can be simply chosen as short path distance. The little-o notation here and later are
w.r.t. n.
Theorem 5.9 focuses on the node sets of unattributed regular graphs. We con-
jecture that the statement can be generalized to attributed regular graphs as distinct
attributes can only further improve the distinguishing power of a model. Moreover,
90 Pan Li and Jure Leskovec
the assumption on regularity of graphs is also not crucial because the 1-WL test or
MP-GNN may transform all graphs, attributed or not, into attributed regular graphs
with enough iterations (Arvind et al, 2019).
Of course, DE-GNN may not distinguish any non-isomorphic GRL examples.
Li et al (2020e) introduce the limitation of DE-GNN. Particularly, DE-GNN cannot
distinguish nodes of distance regular graphs with the same intersection arrays, al-
though DE-GNN may distinguish their edges (See Fig. 5.14 later). Li et al (2020e)
also generalize the above results to the case that leverages distance attributes as
edge attributes (to control message aggregation in MP-GNN). Interested readers
can check the details in their original paper.
Theorem 5.10. For two graph-structured examples (G (1) , S(1) ) and (G (2) , S(2) ),
where |S(i) | = 1 for i ∈ {1, 2} and G (i) is unattributed, if DE-GNN can distinguish
them with L layers, then ID-GNN requires at most 2L layers to distinguish them.
Proof. ID-GNN needs the first L layers to propagate the identity attribute to capture
distance information and the second L layers to let such information propagate back
to finally be merged into the node representations.
Fig. 5.14: ID-GNN v.s. DE-GNN makes predictions for a pair of nodes. Two graphs
are the Shrikhande graph (left) and the 4 × 4 Rook’s graph (right). ID-GNN (black
nodes attached identities) cannot distinguish node pairs {a, b} and {c, d}. DE-GNN
may learn distinct representations of {a, b} and {c, d}. In these two graphs, each
node is colored with its DE that is a set of SPDs to either node in the target node
sets {a, b} or {c, d} (Eq. equation 5.14). Note that the neighbors of nodes with DE=
{1, 1} (dashed boxes) are enclosed by red ellipses which shows that the neighbors
of these two nodes have different DE’s. Hence, after one layer of DE-GNN, the
intermediate representations of nodes with DE= {1, 1} are different between these
two graphs. Using another layer, DE-GNN can distinguish the representations of
{a, b} and {c, d}.
expressive power of the above procedure for the entire graph classification problem,
summarized in the following corollary.
Corollary 5.2. (Reproduced from Corollary 3.4 (Li et al, 2020e)) Consider two
GRL examples G (1) and G (2) . Suppose G (1) and G (2) are uniformly independently
sampled from all n-sized unattributed r-regular graphs where 3 ≤ r < (2 log n)1/2 .
Then, ID-GNN with a sufficient number of layers can distinguish these two graphs
with probability 1 − o(1). The little-o notation here and later are w.r.t. n.
ID-GNN can be viewed as the simplest version of DE-GNN that achieves the
same expressive power for node-level prediction. However, when the prediction
tasks contain two nodes, i.e., node-pair-level prediction, ID-GNN will be less pow-
erful than DE-GNN.
To make a prediction for a GRL example (G , S) where |S| = 2, ID-GNN can
adopt two different approaches. First, ID-GNN can attach the extra identity at-
tributes to the two nodes in S separately, learn their representations separately and
combine these two representations to make the final prediction. However, this ap-
proach cannot capture the distance information between the two nodes in S. Instead,
ID-GNN uses an alternative approach. ID-GNN attaches the extra identity attribute
to only one of nodes in S and performs message passing. Then, after a sufficient
number of layers where the extra node identity is propagated from one node to
another in S, the distance information between these two nodes can be captured.
Finally, ID-GNN makes its prediction based on the two node representations in S.
Note that although the second approach captures the distance information between
the two nodes in S, it is still less powerful than DE-GNN. One example is shown in
Fig. 5.14.
92 Pan Li and Jure Leskovec
The final collection of techniques for building GNNs, which overcome the limi-
tation of the 1-WL test, are related to higher-dim WL test. In this subsection, for
notational simplicity, we focus only on graph-level prediction learning problems,
where higher-order GNNs are mostly used.
The family of WL tests forms a hierarchy for the graph isomorphism prob-
lem (Cai et al, 1992). There are different definitions of the higher-dim WL tests.
We follow the terminology adopted in Maron et al (2019a) and introduce two types
of WL tests: the k-forklore WL (k-FWL) test and the k-WL test.
Recall G (i) = {A(i) , X (i) }, i ∈ {1, 2} . For both G (i) ’s, i ∈ {1, 2}, do the following steps.
1. For each k-tuple of node set V j = (v j1 , v j2 , ..., v jk ) ∈ V k , j ∈ [n]k , we initialize V j with a
(0)
color denoted by C j . These colors satisfy the condition that for two k-tuples, say V j and
(0) (0)
V j′ , C j
and C j′ are the same if and only if: (1) Xv ja = Xv j′ ; (2) v ja = v jb ⇔ v ja′ = v j′ ;
a b
and (3) (v ja , v jb ) ∈ E ⇔ (v ja′ , v j′ ) ∈ E for all a, b ∈ [k].
b
2. k-FWL: For each k-tuple V j and u ∈ V , define Nk−FW L (V j ; u) as a k-tuple of k-tuples, such
that Nk−FW L (V j ; u) = ((u, v j2 , ..., v jk ), (v j1 , u, ..., v jk ), (v j1 , v j2 , ..., u)). Then the color of Vi
can be updated via the following mapping.
(l+1) (l) (l)
Update colors: C j ← HASH(C j , {(C j′ |V j′ ∈ Nk−FW L (V j ; u))}u∈V ). (5.18)
k-WL: For each k-tuple V j and u ∈ V , define Nk−W L (V j ; u) as a set of k-tuples such that
Nk−W L (V j ; u) = {(u, v j2 , ..., v jk ), (v j1 , u, ..., v jk ), (v j1 , v j2 , ..., u)} Then, the color of Vi can
be updated via the following mapping.
(l+1) (l) (l)
Update colors: C j ← HASH(C j , ∪u∈V {C j′ |V j′ ∈ Nk−W L (V j ; u)}), (5.19)
where the HASH operations in both cases guarantee an injective mapping with different
inputs yielding different outputs.
(l)
3. For each step l, {C j } j∈[V (G(i) )]k is a multi-set. If such multi-sets of the two graphs are
not equal, return G (1) ̸∼
= G (2) . Otherwise, go to Eq. equation 5.19.
Similar to the 1-WL test, if the k-(F)WL test returns G (1) ̸∼
= G (2) , then it follows that G (1) ,
G (2) are not isomorphic. However, the reverse is not true.
Fig. 5.15: Use k-FWL and k-WL to distinguish G (1) and G (2) .
The key idea of these higher-dim WL tests is to color every k-tuple of nodes in
the graphs and update these colors by aggregating the colors from other k-tuples that
5 The Expressive Power of Graph Neural Networks 93
share k − 1 nodes. The procedures of the k-FWL test and the k-WL test are shown in
Fig. 5.15. Note that they perform aggregation differently, and as such, have different
power to distinguish non-isomorphic graphs. These two types of tests form a nested
hierarchy, as summarized in the following theorem.
Theorem 5.11. (Cai et al, 1992; Grohe and Otto, 2015; Grohe, 2017)
1. The k-FWL test and the k + 1-WL test have the same discriminatory power, for
k ≥ 1.
2. The 1-FWL test, the 2-WL test and the 1-WL test have the same discriminatory
power.
3. There are some graphs that the k + 1-WL test can distinguish while the k-WL
test cannot, for k ≥ 2.
Because of Theorem 5.11, GNNs that are able to capture the power of these
higher-dim WL tests can be strictly more powerful than the 1-WL test. Therefore,
higher-order GNNs have the potential to learn even more complex functions than
MP-GNN.
However, the drawback of these GNNs is their computational complexity. By
the definition of higher-order WL tests, the colors of all k-tuples of nodes need to
be tracked. Correspondingly, higher-order GNNs that mimic higher-order WL tests
need to associate each k-tuple with a vector representation. Therefore, their memory
complexity is at least Ω (|V |k ), where |V | is the number of nodes in the graph. The
computational complexity is at least Ω (|V |k+1 ), which makes these higher-order
GNNs prohibitively expensive for large-scaled graphs.
Morris et al (2019) first proposed k-GNN by following the k-WL test. Specifically,
k-GNN associates each k-tuple of nodes, denoted by V j , j ∈ V k , with a vector repre-
(0)
sentation that is initialized as h j . In order to save memory, k-GNN only considers
k-tuples that contain k different nodes and ignores the order of these nodes. There-
fore, each k-tuple reduces to a set of k nodes. With some modification of notation in
this subsection, let V j denote this set of k different nodes. The initial representation
(0) (0) (0)
of V j , h j is chosen as a one-hot encoding such that h j = h j′ , if and only if the
subgraphs induced by V j and V j′ are isomorphic.
Then, k-GNN follows the following update procedure of these representations:
(l+1) (l) (l)
hj = MLP(h j ⊕ ∑ h j′ ), ∀ k-sized node sets V j , (5.20)
V j′ :Nk−GNN (V j )
Eq.equation 5.20 has time complexity at least O(|V |k ) as the size of Nk−GNN (V j )
is O(|V |k ). Recently, Morris et al (2019) also considers using a local neighbor-
hood of V j instead of Nk−GNN (V j ). This local neighborhood only includes V j′ ∈
Nk−GNN (V j ), such that the node in V j′ \V j is connected to at least one node in V j .
Morris et al (2020b) demonstrated that a variant of this local version of k-GNN may
be as powerful as the k-WL test, although a deeper architecture with more layers is
needed to match the expressive power.
k-GNN is at most as powerful as the k-WL test. To be more expressive than
MP-GNN, k = 3 is needed. Therefore, the memory complexity is at least Ω (|V |3 ).
Subsequently, the computational complexity of k-GNN, even for their local version,
is at least Ω (|V |3 ) per layer.
To build higher-order GNNs, every k-tuple needs to be associated with a vector rep-
resentation. Therefore, regardless whether a local or a global neighborhood aggrega-
tion is adopted (Eq.equation 5.20), the benefit of reducing the computation by lever-
aging the sparse graph structure is limited, as it cannot reduce the dominant term
Ω (|V |k ). Moreover, to handle a sparse graph structure, these higher-order GNNs
also need random memory access, which introduces additional computational over-
head. Therefore, a line of research into building higher-order GNNs totally ignores
graph sparsity. Graphs are viewed as tensors and NNs take these tensors as input.
The NNs are designed to be invariant to the order of tensor indices.
Many approaches (Maron et al, 2018, 2019a,b; Chen et al, 2019f; Keriven and
Peyré, 2019; Vignac et al, 2020a; Azizian and Lelarge, 2020) adopt this formulation
to build GNNs and analyze their expressive power.
(l)
Each k-tuple V j ∈ V k is associated with a vector representation h j . We assume
(l)
that h j ∈ R for simplicity. By concatenating the k-tuple’s representations together,
we obtain a k-order tensor:
|V | × · · · × |V |
| {z }
⊗k |V | ⊗k |V |
H ∈R , where R =R k times .
Maron et al (2018) showed that the number of the bases needed to represent all
possible linear invariant mappings from R⊗k |V | → R is b(k), where b(k) is the k-
th Bell number. Additionally, the number of bases, needed to represent all possible
linear equivariant mappings from R⊗k |V | → R⊗k′ |V | , is b(k + k′ ). To better under-
stand this observation, consider the invariant case with k = 1. In this case, the linear
invariant mapping g : R|V | → R is essentially an invariant pooling (Def.5.7). As
b(1) = 1, the linear invariant mapping g : R|V | → R only holds one single base —
the sum pooling, i.e., g follows the form g(a) = c⟨1, a⟩, where c is a parameter to
be learned. Consider the equivariant case, where k = 1 and k′ = 1. As b(2) = 2,
the linear equivariant mapping g : R|V | → R|V | holds two bases, i.e., g has the form
g(a) = (c1 I + c2 11⊤ )a, where c1 , c2 are parameters to be learned.
Based on the above observations, GNNs can be built by compositing these linear
invariant/equivariant mappings. Learning can be performed via learning the weights
before the above bases. Towards this end, Maron et al (2018, 2019a) has proposed
using these linear invariant/equivariant mappings to build GNNs:
(L) (L−1) (1)
fk−inv = ginv ◦ gequ ◦ σ ◦ gequ ◦ σ · · · ◦ σ ◦ gequ , (5.21)
(l)
where ginv is a linear invariant layer R⊗k |V | → R, gequ ’s, l ∈ [L] are linear equivari-
ant layers from R⊗k |V | → R⊗k |V | , and σ is an element-wise non-linear activation
function. It can be shown that fk−inv is an invariant mapping. Maron et al (2018);
Azizian and Lelarge (2020) proved that the connection of fk−inv to the k-WL test
can be summarized with the following theorem.
Theorem 5.12. (Reproduced from (Maron et al, 2018; Azizian and Lelarge, 2020))
For two non-isomorphic graphs G (1) ∼ ̸ G (2) , if the k-WL test can distinguish them,
=
then there exists fk−inv that can distinguish them.
Maron et al (2019b); Keriven and Peyré (2019) also studied whether the models
fk−inv may universally approximate any permutation invariant function. However,
they were pessimistic in their conclusion since this would require high-order tensors,
k = Ω (n), which can hardly be implemented in practice (Maron et al, 2019b).
Similar to k-GNN, finv is also at most as powerful as the k-WL test. To be more
expressive than MP-GNN, finv should use at least k = 3. Therefore, the memory
complexity is at least Ω (|V |3 ). Then, the number of bases of the linear equivariant
layer is b(6) = 203. Therefore, the computation at each layer follows that: (1) a
tensor in R⊗3 |V | times b(6) many tensors in R⊗6 |V | get b(6) many tensors in R⊗3 |V | ;
(2) these tensors get summed into a single tensor in R⊗3 |V | .
The higher-order GNNs in previous two subsections match the expressive power of
the k-WL test. According to Theorem 5.11, the k-FWL test holds the same power
as the k + 1-WL test, which is strictly more powerful than the k-WL test for k ≥ 2,
while the k-FWL test only needs to track the representations of k-tuples. Therefore,
96 Pan Li and Jure Leskovec
if GNNs can mimic the k-FWL test, they may hold similar memory cost as the GNNs
introduced in the previous two subsections while being more expressive. Maron et al
(2019a); Chen et al (2019f) proposed PPGN and Ring-GNN respectively to match
the k-FWL test.
The key difference between the k-FWL test and the k-WL test is the leverag-
ing of the neighbors of a k-tuple V j . Note that Nk−FW L (V j ; u) in Eq.equation 5.18
groups the neighboring tuples of V j into a higher-level tuple, while Nk−W L (V j ; u)
skips grouping them due to the set union operation in Eq.equation 5.19. This yields
the key mechanism in the GNN design to match the k-FWL test: implement the ag-
gregating procedure in the k-FWL test of Eq.equation 5.18 via a product-sum pro-
cedure. Suppose the representation for V j is h j ∈ R. We may design the aggregation
(l)
of {(C j′ |V j′ ∈ Nk−FW L (V j ; u))}u∈V as
∑ ∏ h j′ .
u∈V V j′ ∈Nk−FW L (V j ;u)
If we combine all these representations into a tensor H ∈ R⊗k |V |×F , the above oper-
ation can essentially be represented as tensor operation, i.e., define
Here, MLPs are imposed at the last dimension of these tensors. MLPs with different
(L)
sup-script have different parameters. Finally, perform a READOUT ∑V j ∈V k h j to
obtain the graph representation.
Maron et al (2019a) proved that PPGN, when k = 2, can match the power of the
2-FWL test. Azizian and Lelarge (2020) generalized this result to an arbitrary k.
Theorem 5.13. (Reproduced from (Azizian and Lelarge, 2020)) For two non-isomorphic
graphs G(1) ≁ G(2) , if the k-FWL test can distinguish them, then there exists a PPGN
=
that can distinguish them.
5 The Expressive Power of Graph Neural Networks 97
To be more powerful than the 1-WL test, PPGN only needs to set k = 2 and hence
the memory complexity is just Ω (|V |2 ). Regarding the computation, the product-
sum-type aggregation of PPGN is indeed more complex than finv in Sec.5.4.4.2.
However, when k = 2, Eq.equation 5.22 reduces to the product of two matrices,
which can be efficiently computed in parallel computing units.
5.5 Summary
Graph neural networks have recently achieved unprecedented success across many
domains due to their great expressive power to learn complex functions defined
over graphs and relational data. In this chapter, we provided a systematic study of
the expressive power of GNNs by giving an overview of recent research results in
this field.
We first established that the message passing GNN is at most as powerful as the
1-WL test to distinguish non-isomorphic graphs. The key condition that guaran-
tees to match the limit is an injective updating function of node representations.
Next, we discussed techniques that have been proposed to build more powerful
GNNs. One approach to make message passing GNNs more expressive is to pair
the input graphs with extra attributes. In particular, we discussed two types of extra
attributes — random attributes and deterministic distance attributes. Injecting ran-
dom attributes allows GNNs to distinguish any non-isomorphic graphs, though a
large amount of data augmentation is required to make GNNs approximately invari-
ant. Meanwhile, injecting deterministic distance attributes does not require the same
data augmentation, but the expressive power of the resulting GNNs still holds certain
limitations. Mimicking higher-dimensional WL tests is another way to build more
powerful GNNs. These approaches do not track node representations. Instead, they
update the representation of every k-tuple of nodes (k ≥ 2). Overall, the message
passing GNN is powerful but holds some limits in its expressive power. Different
techniques make GNNs overcome these limits to a different extent while incurring
different types of computational costs.
We would like to list some additional research results on the expressive power
of GNNs that we were not able to cover earlier due to space limitations. Barceló
et al (2019) study the expressive power of GNNs to represent Boolean classifiers,
which is useful to understand how GNNs represent knowledge and logic. Vignac
et al (2020a) propose a structural message passing framework for GNNs, where a
matrix instead of a vector is adopted as the node representation to make GNN more
expressive. Balcilar et al (2021) studied the expressivity of GNNs via the spectral
analysis of GNN-based graph signal transformations. Chen et al (2020k) studies the
effect of non-linearity of GNNs in the message passing procedure on their expres-
sive power, which complements our understanding of many works that suggest a
linear message passing procedure (Wu et al, 2019a; Klicpera et al, 2019a; Chien
et al, 2021).
98 Pan Li and Jure Leskovec
Acknowledgements The authors would like to greatly thank Jiaxuan You and Weihua Hu for
sharing many materials reproduced here. The authors would like to greatly thank Rok Sosič
and Natasha Sharp for commenting on and polishing the manuscript. The authors also grate-
fully acknowledge the support of DARPA under Nos. HR00112190039 (TAMI), N660011924033
(MCS); ARO under Nos. W911NF-16-1-0342 (MURI), W911NF-16-1-0171 (DURIP); NSF under
Nos. OAC-1835598 (CINES), OAC-1934578 (HDR), CCF-1918940 (Expeditions), IIS-2030477
(RAPID), NIH under No. R56LM013365; Stanford Data Science Initiative, Wu Tsai Neuro-
sciences Institute, Chan Zuckerberg Biohub, Amazon, JPMorgan Chase, Docomo, Hitachi, Intel,
JD.com, KDDI, NVIDIA, Dell, Toshiba, Visa, and UnitedHealth Group. J. L. is a Chan Zuckerberg
Biohub investigator.
4 Loukas (2020) measures the required depth and width of GNNs by viewing them as distributed
algorithms, which does not assume permutation invariance. Instead, here we are talking about the
expressive power that refers to the capability of learning permutation invariant functions.
Chapter 6
Graph Neural Networks: Scalability
Abstract Over the past decade, Graph Neural Networks have achieved remarkable
success in modeling complex graph data. Nowadays, graph data is increasing expo-
nentially in both magnitude and volume, e.g., a social network can be constituted
by billions of users and relationships. Such circumstance leads to a crucial question,
how to properly extend the scalability of Graph Neural Networks? There remain
two major challenges while scaling the original implementation of GNN to large
graphs. First, most of the GNN models usually compute the entire adjacency matrix
and node embeddings of the graph, which demands a huge memory space. Second,
training GNN requires recursively updating each node in the graph, which becomes
infeasible and ineffective for large graphs. Current studies propose to tackle these
obstacles mainly from three sampling paradigms: node-wise sampling, which is ex-
ecuted based on the target nodes in the graph; layer-wise sampling, which is im-
plemented on the convolutional layers; and graph-wise sampling, which constructs
sub-graphs for the model inference. In this chapter, we will introduce several repre-
sentative research accordingly.
Hehuan Ma
Department of CSE, University of Texas at Arlington, e-mail: [email protected]
Yu Rong
Tencent AI Lab, e-mail: [email protected]
Junzhou Huang
Department of CSE, University of Texas at Arlington, e-mail: [email protected]
© The Author(s), under exclusive license to Springer Nature Singapore Pte Ltd. 2022 99
L. Wu et al. (eds.), Graph Neural Networks: Foundations, Frontiers, and Applications,
https://doi.org/10.1007/978-981-16-6054-2_6
100 Hehuan Ma, Yu Rong, and Junzhou Huang
6.1 Introduction
Graph Neural Network (GNN) has gained increasing popularity and obtained re-
markable achievement in many fields, including social network (Freeman, 2000;
Perozzi et al, 2014; Hamilton et al, 2017b; Kipf and Welling, 2017b), bioin-
formatics (Gilmer et al, 2017; Yang et al, 2019b; Ma et al, 2020a), knowledge
graph (Liben-Nowell and Kleinberg, 2007; Hamaguchi et al, 2017; Schlichtkrull
et al, 2018), etc. GNN models are powerful to capture accurate graph structure in-
formation as well as the underlying connections and interactions between nodes (Li
et al, 2016b; Veličković et al, 2018; Xu et al, 2018a, 2019d). Generally, GNN models
are constructed based on the features of the nodes and edges, as well as the adja-
cency matrix of the whole graph. However, since the graph data is growing rapidly
nowadays, the graph size is increasing exponentially too. Recently published graph
benchmark datasets, Open Graph Benchmark (OGB), collects several commonly
used datasets for machine learning on graphs (Weihua Hu, 2020). Table 6.1 is the
statistics of the datasets about node classification tasks. As observed, large-scale
dataset ogbn-papers100M contains over one hundred million nodes and one billion
edges. Even the relatively small dataset ogbn-arxiv still consists of fairly large nodes
and edges.
Table 6.1: The statistics of node classification datasets from OGB (Weihua Hu,
2020).
For such large graphs, the original implementation of GNN is not suitable. There
are two main obstacles, 1) large memory requirement, and 2) inefficient gradient
update. First, most of the GNN models need to store the entire adjacent matrices
and the feature matrices in the memory, which demand huge memory consumption.
Moreover, the memory may not be adequate for handling very large graphs. There-
fore, GNN cannot be applied on large graphs directly. Second, during the training
phase of most GNN models, the gradient of each node is updated in every iteration,
which is inefficient and infeasible for large graphs. Such scenario is similar with the
gradient descent versus stochastic gradient descent, while the gradient descent may
take too long to converge on large dataset, and stochastic gradient is introduced to
speed up the process towards an optimum.
In order to tackle these obstacles, recent studies propose to design proper sam-
pling algorithms on large graphs to reduce the computational cost as well as increase
6 Graph Neural Networks: Scalability 101
6.2 Preliminary
We first briefly introduce some concepts and notations that are used in this chapter.
Given a graph G (V , E ), |V | = n denotes the set of n nodes and |E | = m denotes a
set of m edges. Node u ∈ N (v) is the neighborhood of node v, where v ∈ V , and
(u, v) ∈ E . The vanilla GNN architecture can be summarized as:
h(l+1) = σ Ah(l)W (l) ,
where A is the normalized adjacency matrix, h(l) represents the embedding of the
node in the graph for layer/depth l, W (l) is the weight matrix of the neural network,
and σ denotes the activation function.
For large-scaled graph learning, the problem is often referred as the node classi-
fication, where each node v is associated with a label y, and the goal is to learn from
the graph and predict the labels of unseen nodes.
The concept of sampling aims at selecting a partition of all the samples to represent
the entire sample distribution. Therefore, the sampling algorithm on large graphs
refers to the approach that uses partial graph instead of the full graph to address
target problems. In this chapter, we categorize different sampling algorithms into
three major groups, which are node-wise sampling, layer-wise sampling and graph-
wise sampling.
Node-wise sampling plays a dominant role during the early stage of imple-
menting GCN on large graphs, such as Graph SAmple and aggreGatE (Graph-
SAGE) (Hamilton et al, 2017b) and Variance Reduction Graph Convolutional
Networks (VR-GCN) (Chen et al, 2018d). Later, layer-wise sampling algorithms
are proposed to address the neighborhood expansion problem occurred during
node-wise sampling, e.g., Fast Learning Graph Convolutional Networks (Fast-
GCN) (Chen et al, 2018c) and Adaptive Sampling Graph Convolutional Networks
(ASGCN) (Huang et al, 2018). Moreover, graph-wise sampling paradigms are de-
signed to further improve the efficiency and scalability, e.g., Cluster Graph Convo-
lutional Networks (Cluster-GCN) (Chiang et al, 2019) and Graph SAmpling based
INductive learning meThod (GraphSAINT) (Zeng et al, 2020a). Fig. 6.1 illustrates
a comparison between three sampling paradigms. In the node-wise sampling, the
nodes are sampled based on the target node in the graph. While in the layer-wise
sampling, the nodes are sampled based on the convolutional layers in the GNN
102 Hehuan Ma, Yu Rong, and Junzhou Huang
(c) Graph-wise.
models. For the graph-wise sampling, the sub-graphs are sampled from the original
graph, and used for the model inference.
According to these paradigms, two main issues should be addressed while con-
structing large-scale GNNs: 1) how to design efficient sampling algorithms? and 2)
how to guarantee the sampling quality? In recent years, a lot of works have studied
about how to construct large-scale GNNs and how to address the above issues prop-
erly. Fig. 6.2 displays a timeline of certain representative works in this area from the
year 2017 until recent. Each work will be introduced accordingly in this chapter.
Other than these major sampling paradigms, more recent works have attempted
to improve the scalability of large graphs from various perspectives as well. For
example, heterogeneous graph has attracted more and more attention with regards
to the rapid growth of data. Large graphs not only include millions of nodes but
also various data types. How to train GNNs on such large graphs has become a new
domain of interest. Li et al (2019a) proposes a GCN-based Anti-Spam (GAS) model
6 Graph Neural Networks: Scalability 103
Rather than use all the nodes in the graph, the first approach selects certain nodes
through various sampling algorithms to construct large-scale GNNs. GraphSAGE (Hamil-
ton et al, 2017b) and VR-GCN (Chen et al, 2018d) are two pivotal studies that utilize
such a method.
6.3.1.1 GraphSAGE
At the early stage of GNN development, most work target at the transductive learn-
ing on a fixed-size graph (Kipf and Welling, 2017b, 2016), while the inductive
setting is more practical in many cases. Yang et al (2016b) develops an inductive
learning on graph embeddings, and GraphSAGE Hamilton et al (2017b) extends the
study on large graphs. The overall architecture is illustrated in Fig. 6.3.
Fig. 6.3: Overview of the GraphSAGE architecture. Step 1: sample the neighbor-
hoods of the target node; step 2: aggregate feature information from the neighbors;
step 3: utilize the aggregated information to predict the graph context or label. Fig-
ure excerpted from (Hamilton et al, 2017b).
Different from the original mean aggregator in GCN, GraphSAGE proposes LSTM
aggregator and Pooling aggregator to aggregate the information from the neigh-
bors. The second extension is that GraphSAGE applies the concatenation function
to combine the information of target node and neighborhoods instead of the sum-
mation function:
(l+1) (l) (l+1)
hv = σ W (l+1) · CONCAT hv , hN (v) ,
where W (l+1) are the weight matrices, and σ is the activation function.
In order to make GNN suitable for the large-scale graphs, GraphSAGE intro-
duces the mini-batch training strategy to reduce the computation cost during the
training phase. Specifically, in each training iteration, only the nodes that are used
by computing the representations in the batch are considered, which significantly
reduces the number of sampled nodes. Take layer 2 in Fig. 6.4(a) as an example,
unlike the full-batch training which takes all 11 nodes into consideration, only 6
nodes are involved for mini-batch training. However, the simple implementation of
mini-batch training strategy suffers the neighborhood expansion problem. As shown
in layer 1 of Fig. 6.4(a), most of the nodes are sampled since the number of sampled
nodes grows exponentially if all the neighbors are sampled at each layer. Thus, all
the nodes are selected eventually if the model contains many layers.
Fig. 6.4: Visual comparison between mini-batch training and fixed-size neighbor
sampling.
To further improve the training efficiency and eliminate the neighborhood expan-
sion problem, GraphSAGE adopts fixed-size neighbor sampling strategy. In specific,
a fixed-size set of neighbor nodes are sampled for each layer for computing, instead
of using the entire neighborhood sets. For example, one can set the fixed-size set as
two nodes, which is illustrated in Fig. 6.4(b), the yellow nodes represent the sampled
nodes, and the blue nodes are the candidate nodes. It is observed that the number of
sampled nodes is significantly reduced, especially for layer 1.
6 Graph Neural Networks: Scalability 105
6.3.1.2 VR-GCN
In order to further reduce the size of the sampled nodes, as well as conduct a com-
prehensive theoretical analysis, VR-GCN (Chen et al, 2018d) proposes a Control
Variate Based Estimator. It only samples an arbitrarily small size of the neighbor
nodes by employing historical activations of the nodes. Fig. 6.5 compares the recep-
tive field of one target node using different sampling strategies. For the implementa-
tion of the original GCN (Kipf and Welling, 2017b), the number of sampled nodes is
increased exponentially with the number of layers. With neighbor sampling, the size
of the receptive field is reduced randomly according to the preset sampling number.
Compared with them, VR-GCN utilizes the historical node activations as a control
variate to keep the receptive field small scaled.
Fig. 6.5: Illustration of the receptive field of a single node utilizing different sam-
pling strategies with a two-layer graph convolutional neural network. The red circle
represents the latest activation, and the blue circle indicates the historical activation.
Figure excerpted from (Chen et al, 2018d).
where N (v) represents the neighbor set of node v, d (l) is the sampled size of the
neighbor nodes at layer l, Nˆ (l) (v) ⊂ N (v) is the sampled neighbor set of node v at
106 Hehuan Ma, Yu Rong, and Junzhou Huang
layer l, and A represents the normalized adjacency matrix. Such a method has been
proved to be a biased sampling, and would cause larger variance. The detailed proof
can be found in (Chen et al, 2018d). Such properties result in a larger sample size
Nˆ (l) (v) ⊂ N (v).
To address these issues, VR-GCN proposes Control Variate Based Estimator
(l)
(CV Sampler) to maintain all the historical hidden embedding h̄v of every partici-
(l) (l)
pated node. For a better estimation, since the difference between h̄v and hv shall
be small if the model weights do not change too fast. CV Sampler is capable of
reducing the variance and obtaining a smaller sample size n̂(l) (v) eventually. Thus,
the feed-forward layer of VR-GCN can be defined as,
H (l+1) = σ A(l) H (l+1) − H̄ (l) + AH̄ (l) W (l) .
(l) (l)
A(l) is the sampled normalized adjacency matrix at layer l, H̄ (l) = {h̄1 , · · · , h̄n }
(l) (l+1) (l+1)
is the stack of the historical hidden embedding h̄ , H (l+1) = {h1 , · · · , hn } is
the embedding of the graph nodes in the (l + 1)-th layer, and W (l) is the learnable
weights matrix. In such a manner, the sampled size of A(l) is greatly reduced com-
(l)
pared with GraphSAGE by utilizing the historical hidden embedding h̄ , which
introduces a more efficient computing method. Moreover, VR-GCN also studies
how to apply the Control Variate Estimator on the dropout model. More details can
be found in the paper.
In summary, VR-GCN first analyzes the variance reduction on node-wise sam-
pling, and successfully reduces the size of the samples. However, the trade-off is
that the additional memory consumption for storing the historical hidden embed-
dings would be very large. Also, the limitation of applying GNNs on large-scale
graphs is that it is not realistic to store the full adjacent matrices or the feature ma-
trices. In VR-GCN, the historical hidden embeddings storage actually increases the
memory cost, which is not helping from this perspective.
Since node-wise sampling can only alleviate but not completely solve the neigh-
borhood expansion problem, layer-wise sampling has been studied to address this
obstacle.
6.3.2.1 FastGCN
In order to solve the neighborhood expansion problem, FastGCN (Chen et al, 2018c)
first proposes to understand the GNN from the functional generalization perspective.
The authors point out that training algorithms such as stochastic gradient descent are
implemented according to the additivity of the loss function for independent data
6 Graph Neural Networks: Scalability 107
samples. However, GNN models generally lack sample loss independence. To solve
this problem, FastGCN converts the common graph convolution view to an integral
transform view by introducing a probability measure for each node. Fig. 6.6 shows
the conversion between the traditional graph convolution view and the integral trans-
form view. In the graph convolution view, a fixed number of nodes are sampled in
a bootstrapping manner in each layer, and are connected if there is a connection
exists. Each convolutional layer is responsible for integrating the node embeddings.
The integral transform view is visualized according to the probability measure, and
the integral transform (demonstrated in the yellow triangle form) is used to calculate
the embedding function in the next layer. More details can be found in (Chen et al,
2018c).
Fig. 6.6: Two views of GCN. The circles represent the nodes in the graph, while
the yellow circles indicate the sampled nodes. The lines represent the connection
between nodes.
(l) (l)
Moreover, consider sampling tl iid samples u1 , . . . , utl ∼ p for each layer l, l =
0, . . . , K − 1, a layer-wise estimation of the loss function is admitted as,
1 tK
(K) (K)
Lt0 ,t1 ,...,tK := ∑g htK ui ,
tK i=1
which proves that FastGCN samples a fixed number of nodes at each layer.
108 Hehuan Ma, Yu Rong, and Junzhou Huang
Furthermore, in order to reduce the sampling variance, FastGCN adopts the im-
portance sampling with respect to the weights in the normalized adjacency matrix.
q(u) = ∥A(:, u)∥2 / ∑
A :, u′
2 , u∈V, (6.1)
u ∈V
′
where A is the normalized adjacency matrix of the graph. Detailed proofs can be
found in (Chen et al, 2018c). According to Equation 6.1, the entire sampling process
is independent for each layer, and the sampling probability keeps the same.
6.3.2.2 ASGCN
To better capture the between-layer correlations, ASGCN (Huang et al, 2018) pro-
poses an adaptive layer-wise sampling strategy. In specific, the sampling probability
of lower layers depends on the upper ones. As shown in Fig. 8(a), ASGCN only
6 Graph Neural Networks: Scalability 109
samples nodes from the neighbors of the sampled node (yellow node) to obtain the
better between-layer correlations, while FastGCN utilizes the importance sampling
among all the nodes.
Fig. 6.9: Network construction example: (a) node-wise sampling; (b) layer-wise
sampling; (c) skip connection implementation. Figure excerpted from (Huang et al,
2018).
To further reduce the sampling variance, ASGCN introduces the explicit vari-
ance reduction to optimize the sampling variance as the final objective. Consider
x (u j ) as the node feature of node u j , the optimal sampling probability q∗ (u j ) can
be formulated as,
′
∗ ∑ni=1 p (u j | vi ) g (x (u j ))
q (u j ) = n , g (x (u j )) = Wg x (u j ) . (6.2)
∑ ∑n p (u j | vi ) g (x (v j ))
′
j=1 i=1
However, simply utilizing the sampler given by Equation 6.2 is not sufficient
to secure a minimal variance. Thus, ASGCN designs a hybrid loss by adding the
variance to the classification loss Lc , as shown in Equation 6.3. In such a manner,
the variance can be trained to achieve the minimal status.
′
1 n
L = ′ ∑ Lc (yi , ȳ (µ̂q (vi ))) + λ Varq (µ̂q (vi )) , (6.3)
n i=1
where yi is the ground-truth label, µ̂q (vi ) represents the output hidden embeddings
of node vi , and ȳ (µ̂q (vi )) is the prediction. λ is involved as a trade-off parameter.
The variance reduction term λ Varq (µ̂q (vi )) can also be viewed as a regularization
according to the sampled instances.
ASGCN also proposes a skip connection method to obtain the information across
distant nodes. As shown in Fig. 6.9 (c), the nodes in the (l-1)-th layer theoretically
preserve the second-order proximity (Tang et al, 2015b), which are the 2-hop neigh-
bors for the nodes in the (l+1)-th layer. The sampled nodes will include both 1-hop
and 2-hop neighbors by adding a skip connection between the (l-1)-th layer and the
(l+1)-th layer, which captures the information between distant nodes and facilitates
the model training.
In summary, by introducing the adaptive sampling strategy, ASGCN has gained
better performance as well as equips a better variance control. However, it also
brings in the additional dependence during sampling. Take FastGCN as an example,
it can perform parallel sampling to accelerate the sampling process since each layer
is sampled independently. While in ASGCN, the sampling process is dependent to
the upper layer, thus parallel processing is not applicable.
6 Graph Neural Networks: Scalability 111
6.3.3.1 Cluster-GCN
Cluster-GCN (Chiang et al, 2019) first proposes to extract small graph clusters based
on efficient graph clustering algorithms. The intuition is that the mini-batch algo-
rithm is correlated with the number of links between nodes in one batch. Hence,
Cluster-GCN constructs mini-batch on the sub-graph level, while previous studies
usually construct mini-batch based on the nodes.
Cluster-GCN extracts small clusters based on the following clustering algo-
rithms. A graph G (V , E ) can be devided into c portions by grouping its nodes,
where V = [V1 , · · · Vc ]. The extracted sub-graphs can be defined as,
(Vt , Et ) represents the nodes and the links within the t-th portion, t ∈ (1, c). And the
re-ordered adjacency matrix can be written as,
A11 · · · A1c A11 · · · 0 0 · · · A1c
A = Ā + ∆ = ... . . . ... ; Ā = ... . . . ... , ∆ = ... . . . ... .
Different graph clustering algorithms can be used to partition the graph by enabling
more links between nodes within the cluster. The motivation of considering sub-
graph as a batch also follows the nature of graphs, which is that neighbors usually
stay closely with each other.
112 Hehuan Ma, Yu Rong, and Junzhou Huang
Obviously, this strategy can avoid the neighbor expansion problem since it only
samples the nodes in the clusters, as shown in Fig. 6.11. For Cluster-GCN, since
there is no connection between the sub-graphs, the nodes in other sub-graphs will
not be sampled when the layer increases. In such a manner, the sampling process
establishes a neighbor expansion control by sampling over the sub-graphs, while in
layer-wise sampling the neighbor expansion control is implemented by fixing the
neighbor sampling size.
However, there still remain two concerns with the vanilla Cluster-GCN. The first
one is that the links between sub-graphs are dismissed, which may fail to capture
important correlations. The second issue is that the clustering algorithm may change
the original distribution of the dataset and introduce some bias. To address these
concerns, the authors propose stochastic multiple partitions scheme to randomly
combine clusters to a batch. In specific, the graph is first clustered into p sub-graphs;
then in each epoch training, a new batch is formed by randomly combine q clusters
(q < p), and the interactions between clusters are included too. Fig. 6.12 visualized
an example when q equals to 2. As observed, the new batch is formed by 2 random
clusters, along with the retained connections between the clusters.
6.3.3.2 GraphSAINT
Instead of using clustering algorithms to generate the sub-graphs which may bring in
certain bias or noise, GraphSAINT (Zeng et al, 2020a) proposes to directly sample a
sub-graph for mini-batch training according to sub-graph sampler, and employ a full
GCN on the sub-graph to generate the node embeddings as well as back-propagate
the loss for each node. As shown in Fig. 6.13, sub-graph Gs is constructed from the
original graph G with Nodes 0, 1, 2, 3, 4, 7 included. Next, a full GCN is applied
on these 6 nodes along with the corresponding connections.
Fig. 6.13: An illustration of GraphSAINT training algorithm. The yellow circle in-
dicates the sampled node.
Mini-batch training,
Inductive
GraphSAGE (Hamil-Random × reduce neighborhood
learning
ton et al, 2017b) expansion.
Node-wise
Sampling Neighborhood Historical
VR-GCN (Chen Random ✓
expansion activations.
et al, 2018d)
Neighborhood Integral transform
FastGCN (Chen Importance ✓
expansion view.
Layer-wise et al, 2018c)
Sampling Explicit variance
Between-layer
ASGCN (Huang Importance ✓ reduction, skip
correlation
et al, 2018) connection.
Mini-batch on
Cluster-GCN (Chi- Random ✓ Graph batching
Graph-wise ang et al, 2019) sub-graph.
Sampling Edge Neighborhood Variance and bias
GraphSAINT (Zeng ✓
Probability expansion control.
et al, 2020a)
PinSage (Ying et al, 2018b) is one of the successful applications in the early stage
of utilizing large-scale GNNs on item-item recommendation systems, which is de-
ployed on Pinterest1 . Pinterest is a social media application that shares and discovers
various content. The users mark their interested content with pins and organize them
on the boards. When the users browse the website, Pinterest recommends the poten-
tially interesting content for them. By the year 2018, the Pinterest graph contains 2
billion pins, 1 billion boards, and over 18 billion edges between pins and boards.
In order to scale the training model on such a large graph, Ying et al (2018b)
proposes PinSage, a random-walk-based GCN, to implement node-wise sampling
on Pinterest graph. In specific, a short random walk is used to select a fixed-number
neighborhood of the target node. Fig. 6.15 demonstrates the overall architecture of
PinSage. Take node A as an example, a 2-depth convolution is constructed to gen-
(2) (1)
erate the node embedding hA . The embedding vector hN (A) of node A’s neighbors
are aggregated by node B, C, and D. Similar process is established to get the 1-hop
(1) (1) (1)
neighbors’ embedding hB , hC , and hD . An illustration of all participated nodes
for each node from the input graph is shown at the bottom of Fig. 6.15. In addition,
a L1-normalization is computed to sort the neighbors by their importance (Eksom-
batchai et al, 2018), and a curriculum training strategy is used to further improve the
prediction performance by feeding harder-and-harder examples.
A series of comprehensive experiments that are conducted on Pinterest data, e.g.,
offline experiments, production A/B tests and user studies, have demonstrated the
effectiveness of the proposed method. Moreover, with the adoption of highly effi-
cient MapReduce inference pipeline, the entire process on the whole graph can be
finished within one day.
1 https://www.pinterest.com/
6 Graph Neural Networks: Scalability 117
Fig. 6.15: Overview of PinSage architecture. Colored nodes are applied to illustrate
the construction of graph convolutions.
(l) (l)
embeddings hN (v) and the target itself hv . Such an operation is able to capture two
types of information: the interactions between target node and its neighborhoods;
and the interactions between different dimensions of the embedding space. How-
ever, in user-item networks, learning the information between different feature di-
mensions may be less informative and unnecessary. Therefore, IntentNet designs a
vector-wise convolution operation as follows:
(l) (l) (l) (l) (l)
gv (i) = σ Wv (i, 1) · hv +Wv (i, 2) · hN (v) ,
(l+1) (l) (l)
hv = σ ∑Li=1 θi · gv (i) ,
(l) (l)
where Wv (i, 1) and Wv (i, 2) are the associated weight matrices for the i-th local
(l)
filter. gv (i) represents the operation that learns the interactions between the target
node and its neighbor nodes in a vector-wise manner. Another vector-wise layer is
applied to gather the final embedding vector of the target node for the next convolu-
tional layer. Moreover, the output vector of the last convolutional layer is fed into a
three-layer fully-connected network to further learn the node-level combinatory fea-
tures. Such an operation significantly promotes the training efficiency and reduces
the time complexity.
Extensive experiments are conducted on Taobao and Amazon datasets, which
contain millions to billions of users and items. IntentGC outperforms other baseline
methods, as well as reduces the training time for about two days compared with
GraphSAGE.
Overall, in recent years, the scalability of GNNs has been extensively studied and
has achieved fruitful results. Fig. 6.18 summarizes the development towards large-
scale GNNs.
6 Graph Neural Networks: Scalability 119
Editor’s Notes: For graphs of large scale or with rapid expansibility, such
as dynamic graph (chapter 15) and heterogeneous graph (chapter 16), the
scalability characterization of GNNs is of vital importance to determine
whether the algorithm is superior in practice. For example, graph sampling
strategy is especially necessary to ensure computational efficiency in in-
dustrial scenarios, such as recommender system (chapter 19) and urban in-
telligence (chapter 27). With the increasing complexity and scale of the
real problem, the limitation in scalability has been considered almost ev-
erywhere in the study of GNNs. Researchers devoted to graph embedding
(chapter 2), graph structure learning (chapter 14) and self-supervised learn-
ing (chapter 18) put forward very remarkable works to overcome it.
Chapter 7
Interpretability in Graph Neural Networks
Deep learning has become an indispensable tool for a wide range of applications
such as image processing, natural language processing, and speech recognition. De-
spite the success, deep models have been criticized as “black boxes” due to their
complexity in processing information and making decisions. In this section, we in-
troduce the research background of interpretability in deep models, including the
Ninghao Liu
Department of CSE, Texas A&M University, e-mail: [email protected]
Qizhang Feng
Department of CSE, Texas A&M University, e-mail: [email protected]
Xia Hu
Department of CSE, Texas A&M University, e-mail: [email protected]
© The Author(s), under exclusive license to Springer Nature Singapore Pte Ltd. 2022 121
L. Wu et al. (eds.), Graph Neural Networks: Foundations, Frontiers, and Applications,
https://doi.org/10.1007/978-981-16-6054-2_7
122 Ninghao Liu and Qizhang Feng and Xia Hu
Interpretation
Interpretation
Improvement
Fig. 7.1: Left: Interpretation could benefit user experiences in interaction with
models. Right: Through interpretation, we could identify model behaviors that are
not desirable according to humans, and work on improving the model accord-
ingly (Ribeiro et al, 2016).
There are several pragmatic reasons that motivate people to study and improve
model interpretability. Depending on who finally benefits from interpretation, we
divide the reasons into model-oriented and user-oriented, as shown in Fig. 7.1.
are based on sensitive features that are required to be avoided in real applica-
tions.
3. Adversarial-Attack Robustness: Adversarial attack refers to adding carefully-
crafted perturbations to input, where the perturbations are almost imperceptible
to humans, but can cause the model to make wrong predictions (Goodfellow
et al, 2015). Robustness against adversarial attacks is an increasingly impor-
tant topic in machine learning security. Recent studies have shown how inter-
pretation could help in discovering new attack schemes and designing defense
strategies (Liu et al, 2020d).
4. Backdoor-Attack Robustness: Backdoor attack refers to injecting malicious
functionality into a model, by either implanting additional modules or poison-
ing training data. The model will behave normally unless it is fed with input
containing patterns that trigger the malicious functionality. Studying model ro-
bustness against backdoor attacks is attracting more interest recently. Recent
research discovers that interpretation could be applied in identifying if a model
has been infected by backdoors (Huang et al, 2019c; Tang et al, 2020a).
f’
f
……
The post-hoc interpretation has received a lot of interests in both research and real
applications. Flexibility is one of the advantages of post-hoc interpretation, as it put
less requirement on the model types or structures. In the following paragraphs, we
briefly introduce several commonly used methods. The illustration of the basic idea
behind each of these methods is shown in Fig. 7.2.
The first type of methods to be introduced is approximation-based methods.
Given a function f that is complex to understand and an input instance x∗ ∈ Rm , we
could approximate f with a simple and understandable surrogate function h (usually
chosen as a linear function) locally around x∗ . Here m is the number of features in
each instance. There are several ways to build h. A straightforward way is based on
the first-order Taylor expansion, where:
where w ∈ Rm tells how sensitive the output is to the input features. Typically, w
can be estimated with the gradient (Simonyan et al, 2013), so that w = ∇x f (x∗ ).
When gradient information is not available, such as in tree-based models, we could
126 Ninghao Liu and Qizhang Feng and Xia Hu
build h through training (Ribeiro et al, 2016). The general idea is that a number of
training instances (xi , f (xi )), 1 ≤ i ≤ n are sampled around x∗ , i.e., ∥xi − x∗ ∥ ≤ ε.
The instances are then used to train h, so that h approximates f around x∗ .
Besides directly studying the sensitivity between input and output, there is an-
other type of method called layer-wise relevance propagation (LRP) (Bach et al,
2015). Specifically, LRP redistributes the activation score of output neuron to its
predecessor neurons, which iterates until reaching the input neurons. The redistri-
bution of scores is based on the connection weights between neurons in adjacent
layers. The share received by each input neuron is used as its contribution to the
output.
Another way to understand the importance of a feature xi is to answer questions
like “What would have happened to f , had xi not existed in input?”. If xi is important
for predicting f (x), then removing/weakening it will cause a significant drop in
prediction confidence. This type of method is called the perturbation method (Fong
and Vedaldi, 2017). One of the key challenges in designing perturbation methods is
how to guarantee the input after perturbation is still valid. For example, it is argued
that perturbation on word embedding vectors cannot explain deep language models,
because texts are discrete symbols, and it is hard to identify the meaning of perturbed
embeddings.
Different from the previous methods that focus on explaining prediction results,
there is another type of method that tries to understand how data is represented in-
side a model. We call it representation interpretation. There is no unified definition
for representation interpretation. The design of methods under this category is usu-
ally motivated by the nature of the problem or the properties of data. For example,
in natural language processing, it has been shown that a word embedding could be
understood as the composition of a number of basis word embeddings, where the
basis words constitute a dictionary (Mathew et al, 2020).
Besides understanding predictions and data representations, another interpreta-
tion scheme is to understand the role of model components. A well-known example
is to visualize the visual patterns that maximally activate the target neuron/layer in
a CNN model (Olah et al, 2018). In this way, we understand what kind of visual
signal is detected by the target component. The interpretation is usually obtained
through a generative process, so that the result is understandable to humans.
of the complex model. The pool of interpretable models includes linear models,
decision trees, rule-based models, etc. This strategy is also called mimic learning.
The interpretable model trained in this way tends to perform better than normal
training, and it is also much easier to understand than the complex model.
Attention models, originally introduced for machine translation tasks, have now
become enormously popular, partially due to their interpretation properties. The in-
tuition behind attention models can be explained using human biological systems,
where we tend to selectively focus on some parts of the input, while ignoring other
irrelevant parts (Xu et al, 2015). By examining attention scores, we could know
which features in the input have been used for making the prediction. This is also
similar to using post-hoc interpretation algorithms that find which input features are
important. The major difference is that attention scores are generated during model
prediction, while post-hoc interpretation is performed after prediction.
Deep models heavily rely on learning effective representations to compress in-
formation for downstream tasks. However, it is hard for humans to understand the
representations as the meanings of different dimensions are unknown. To tackle this
challenge, disentangled representation learning has been proposed. Disentangled
representation learning breaks down features of different meanings and encodes
them as separate dimensions in representations. As a result, we could check each
dimension to understand which factors of input data are encoded. For example, af-
ter learning disentangled representations on 3D-chair images, factors such as chair
leg style, width and azimuth, are separately encoded into different dimensions (Hig-
gins et al, 2017).
Despite the major progress made in domains such as vision, language and control,
many defining characteristics of human intelligence remain out of reach for tradi-
tional deep models such as convolutional neural networks (CNNs), recurrent neural
networks (RNNs) and multi-layer perceptrons (MLPs). To look for new model ar-
chitectures, people believe that GNN architectures could lay the foundation for more
interpretable patterns of reasoning (Battaglia et al, 2018). In this part, we discuss the
advantages of GNNs and challenges to be tackled in terms of interpretability.
The GNN architecture is regarded as more interpretable because it facilitates
learning about entities, relations, and rules for composing them. First, entities are
discrete and usually represent high-level concepts or knowledge items, so it is re-
garded as easier for humans to understand than image pixels (tiny granularity) or
word embeddings (latent space vectors). Second, GNN inference propagates infor-
mation through links, so it is easier to find the explicit reasoning path or subgraph
that contributes to the prediction result. Therefore, there is a recent trend of trans-
forming images or text data into graphs, and then applying GNN models for predic-
tions. For example, to build a graph from an image, we can treat objects inside the
image (or different portions within an object) as nodes, and generate links based on
128 Ninghao Liu and Qizhang Feng and Xia Hu
the spatial relations between nodes. Similarly, a document can be transformed into a
graph by discovering concepts (e.g., nouns, named entities) as nodes and extracting
their relations as links through lexical parsing.
Although the graph data format lays a foundation for interpretable modeling,
there are still several challenges that undermine GNN interpretability. First, GNN
still maps nodes and links into embeddings. Therefore, similar to traditional deep
models, GNN also suffers from the opacity of information processing in intermedi-
ate layers. Second, different information propagation paths or subgraphs contribute
differently to the final prediction. GNN does not directly provide the most impor-
tant reasoning paths for its prediction, so post-hoc interpretation methods are still
needed. In the following sections, we will introduce the recent advances in tackling
the above challenges to improve the explainability and interpretability of GNNs.
7.2.1 Background
Before introducing the techniques, we first provide the definition of graphs and re-
view the fundamental formulations of a GNN model.
Graphs: In the rest of the chapter, if not specified, the graphs we discuss are
limited to homogeneous graphs.
Definition 7.3. A homogeneous graph is defined as G = (V , E ), where V is the set
of nodes and E is the set of edges between nodes.
Furthermore, let A ∈ Rn×n be the adjacency matrix of G , where n = |V |. For un-
weighted graphs, Ai, j is binary, where Ai, j = 1 means there exists an edge (i, j) ∈ E ,
otherwise Ai, j = 0. For weighted graphs, each edge (i, j) is assigned a weight wi, j ,
so Ai, j = wi, j . In some cases, nodes are associated with features, which could be
denoted as X ∈ Rn×m , and Xi,: is the feature vector of node i. The number of fea-
tures for each node is m. In this chapter, unless otherwise stated, we focus on GNN
models on homogeneous graphs.
GNN Fundamentals: Traditional GNNs propagate information via the input
graph’s structure according to the propagation scheme:
1 1
H l+1 = σ (D̃− 2 ÃD̃− 2 H l W l ), (7.2)
7 Interpretability in Graph Neural Networks 129
Important Edge
Important Node
Important Feature
Computation graph of node 𝑖
(2 convolution layers)
Fig. 7.3: Illustration of explanation result formats. Explanation results for graph
neural networks could be the important nodes, the important edges, the important
features, etc. An explanation method may return multiple types of results.
where H l denotes the embedding matrix at layer l, and W l denotes the trainable
parameters at layer l. Also, Ã = A + I denotes the adjacency matrix of the graph
after adding the self-loop. The matrix D̃ is the diagonal degree matrix of Ã, i.e.,
1 1
D̃i,i = ∑ j Ãi, j . Therefore, D̃− 2 ÃD̃− 2 normalizes the adjacency matrix. If we only
focus on the embedding update of node i, the GCN propagation scheme could be
rewritten as:
l+1 1 l l
Hi,: = σ( ∑ H j,:W ), (7.3)
c
j∈V ∪{i} i, j
i
where H j,: denotes the j-th row of matrix H, and Vi denotes the neighbors of node
1 1
i. Here ci, j is a normalization constant, and c1i, j = (D̃− 2 ÃD̃− 2 )i, j . Therefore, the
embedding of node i at layer l can be seen as aggregating neighbor embeddings
of nodes that are neighbors of node i, followed by some transformations. The em-
beddings in the first layer H 0 is usually set as the node features. As the layer goes
deeper, the computation of each node’s embedding will include further nodes. For
example, in a 2-layer GNN, computing the embedding of node i will use the infor-
mation of nodes within the 2-hop neighborhood of node i. The subgraph composed
by these nodes is called the computation graph of node i, as shown in Fig. 7.3.
Target Models: There are two common tasks in graph analysis, i.e., graph-level
predictions and node-level predictions. We use classification tasks as the example. In
graph-level tasks, the model f (G ) ∈ RC produces a single prediction for the whole
graph, where C is the number of classes. The prediction score for class c could
be written as f c (G ). In node-level tasks, the model f (G ) ∈ Rn×C returns a matrix,
where each row is the prediction for a node. Some explanation methods are designed
solely for graph-level tasks, some are for node-level tasks, while some could handle
both scenarios. The computation graphs introduced above are commonly used in
explaining node-level predictions.
130 Ninghao Liu and Qizhang Feng and Xia Hu
𝑓𝑐 𝑓𝑐
𝑆(𝑥)
Grad⊙Input
Raw Gradient
(SA)
0 𝑥 0 𝑥
𝑓𝑐 𝑓𝑐
SmoothGrad IG
= 𝑆(𝑥)
0 𝑥 0 𝑥
The approximation-based explanation has been widely used to analyze the predic-
tion of models with complex structures. Approximation-based approaches could be
further divided into white-box approximation and black-box approximation. The
white-box approximation uses information inside the model, which includes but is
not limited to gradients, intermediate features, model parameters, etc. The black-box
approximation does not utilize information propagation inside the model. It usually
uses a simple and interpretable model to fit the target model’s decision on an input
instance. Then, the explanation can be easily extracted from the simple model. The
details of commonly used methods for both categories are introduced as below.
Sensitivity Analysis (SA) Baldassarre and Azizpour (2019) study the impact of a
particular change in an independent variable on a dependent variable. In the context
of explanation, the dependent variable refers to the prediction, while the independent
7 Interpretability in Graph Neural Networks 131
variables refer to the features. The local gradient of the model is commonly used as
sensitivity scores to represent the correlation between the feature and the prediction
result. The sensitivity score is defined as:
S (x) = ∇⊤
x f (G ) ⊙ x. (7.5)
Therefore, Grad ⊙ Input considers not only the feature sensitivity, but also the scale
of feature values. However, the methods mentioned above all suffered from the sat-
uration problem, where the scope of the local gradients is too limited to reflect the
overall contribution of each feature.
Integrated Gradients (IG) Sanchez-Lengeling et al (2020) solve the saturation
problem by aggregating feature contribution along a designed path in input space.
This path starts from a chosen baseline point G ′ and ends at the target input G .
Specifically, the feature contribution is computed as:
Z 1
S (x) = x − x′ ∇x f G ′ + α G − G ′ dα (7.6)
α=0
where x′ denotes a feature vector in the baseline point G ′ , while x is a feature vector
in the original input G . The choice of baseline G ′ is relatively flexible. A typical
strategy is to use a null graph as the baseline, which has the same topology but its
nodes use “unspecified” categorical features. This is motivated by the application of
132 Ninghao Liu and Qizhang Feng and Xia Hu
CAM treats each dimension of the final node embeddings (i.e., H:,k L ) as a feature
f c (G ) = ∑ wck hk (7.7)
k
where hk denotes the k-th entry of h, wck is the GAP-layer weight of k-th feature map
with respect to class c. Therefore, the contribution of node i to the prediction is:
1
S (i) = wck Hi,k
L
. (7.8)
n∑k
Although CAM is simple and efficient, it only works on models with certain struc-
tures, which greatly limits its application scenarios.
Grad-CAM (Pope et al, 2019) combines gradient information with feature maps
to relax the limitation of CAM. While CAM uses the GAP layer to estimate the
weight of each feature map, Grad-CAM employs the gradient of output with respect
to the feature maps to compute the weights, so that:
1 n ∂ f c (G )
wck = ∑ ∂ HL , (7.9)
n i=1 i,k
!
S (i) = ReLU ∑ wck Hi,kL . (7.10)
k
The ReLU function forces the explanation to focus on the positive influence on the
class of interest. Grad-CAM is equivalent to CAM for GNNs with only one fully-
connected layer before output. Compared to CAM, Grad-CAM can be applied to
7 Interpretability in Graph Neural Networks 133
more GNN architectures, thus avoiding the trade-off between model explainability
and capacity.
mask entry is in [0, 1], so it is a soft mask. There are two loss terms for training the
mask: (1) f ′ (Gt ⊙ M) is close to f ′ (Gt ), (2) the mask M is sparse. The resultant mask
entry values indicate the importance score of edges in Gt , where a higher mask value
means the corresponding edge is more important.
PGM-Explainer applies probabilistic graphical models to explain GNNs. To
find the neighbor instances of the target, PGM-Explainer first randomly selects
nodes to be perturbed from computation graphs. Then, the selected nodes’ features
are set to the mean value among all nodes. After that, PGM-Explainer employs a
pair-wise dependence test to filter out unimportant samples, aiming at reducing the
computational complexity. Finally, a Bayesian network is introduced to fit the pre-
dictions of chosen samples. Therefore, the advantage of PGM-Explainer is that it
illustrates the dependency between features.
Relevance propagation redistributes the activation score of output neuron to its pre-
decessor neurons, iterating until reaching the input neurons. The core of relevance
propagation methods is about defining a rule for the activation redistribution be-
tween neurons. Relevance propagation has been widely used to explain models in
domains such as computer vision and natural language processing. Recently, some
work has been proposed to explore the possibility of revising relevance propagation
method for GNNs. Some representative approaches include LRP (Layer-wise Rel-
evance Propagation) (Baldassarre and Azizpour, 2019; Schwarzenberg et al, 2019),
GNN-LRP (Schnake et al, 2020), ExcitationBP (Pope et al, 2019).
LRP is first proposed in (Bach et al, 2015) to calculate the contribution of indi-
vidual pixels to the prediction result for an image classifier. The core idea of LRP is
to use back propagation to recursively propagate the relevance scores of high-level
neurons to low-level neurons, up to the input-level feature neurons. The relevance
score of the output neuron is set as the prediction score. The relevance score that
a neuron receives is proportional to its activation value, which follows the intu-
ition that neurons with higher activation tend to contribute more to the prediction.
In (Baldassarre and Azizpour, 2019; Schwarzenberg et al, 2019), the propagation
rule is defined as below:
z+
i, j
Rli = ∑ Rl+1
j
j ∑k z+ +
k, j + b j + ε (7.12)
zi, j = xil wi, j
limited to nodes and node features, where graph edges are excluded. The reason
is that the adjacency matrix is treated as part of the GNN model. Therefore, LRP
is unable to analyze topological information which nevertheless plays an important
role in graph data.
ExcitationBP is a top-down attention model originally developed for CNNs
(Zhang et al, 2018d). It shares a similar idea as LRP. However, ExcitationBP defines
the relevance score as a probability distribution and uses a conditional probability
model to describe the relevance propagation rule.
P (a j ) = ∑ P (a j | ai ) P (ai ) (7.13)
i
where a j is the j-th neuron in the lower layer and ai is the i-th parent neuron of
a j in the higher layer. When the propagation process passes through the activation
function, only non-negative weights are considered and negative weights are set to
zero. To extend ExcitationBP for graph data, new backward propagation schemes
are designed for the softmax classifier, the GAP (global average pooling) layer and
the graph convolutional operator.
GNN-LRP mitigates the weakness of traditional LRP by defining a new prop-
agation rule. Instead of using the adjacency matrix to obtain propagation paths,
GNN-LRP assigns the relevance score to a walk, which refers to a message flow
path in the graph. The relevance score is defined by the T -order Taylor expansion of
the model with respect to the incorporation operator (graph convolutional operator,
linear message function, etc.). The intuition is that the incorporation operator with
greater gradients has a greater influence on the final decision.
where GS and XS is the subgraph and its nodes’ features. Y is the predicted label
distribution, and its entropy H(Y ) is a constant. To solve the optimization problem
above, the authors apply a soft-mask M on adjacency matrix:
C
min − ∑ 1[y = c] log PΦ (Y = y | G = Ac ⊙ σ (M), X = Xc ) , (7.15)
M
c=1
where Ψ denotes the trainable parameters of the MLP. zi and z j are the embedding
vector for node i and j, respectively. [·; ·] denotes concatenation. Similar to the GN-
NExplainer, the mask generator is trained by maximizing the mutual information
between the original prediction and the new prediction.
GraphMask (Schlichtkrull et al, 2021) also produces the explanation by estimat-
ing the influences of edges. Similar to PGExplainer, GraphMask learns an erasure
function that quantifies the importance of each edge. The erasure function is defined
as:
(k) (k) (k) (k)
zu,v = gπ hu , hv , mu,v (7.17)
where hu , hv and mu,v refers to the hidden embedding vectors for node u, node v and
the message sent through the edge in graph convolution. π denotes the parameters
of function g. One difference between GraphMask and PGExplainer is that the for-
mer also takes the edge embedding as input. Another difference is that GraphMask
provides the importance estimation for every graph convolution layer, and k indi-
cates the layer that the embedding vectors belong to. Instead of directly erasing the
influences of unimportant edges, the authors then propose to replace the message
sent through unimportant edges as:
(k) (k) (k) (k)
m̃u,v = zu,v · mu,v + 1 − zu,v · b(k) , (7.18)
7 Interpretability in Graph Neural Networks 137
where b(k) is trainable. The work shows that a large proportion of edges can be
dropped without deteriorating the model performance.
Causal Screening (Wang et al, 2021) is a model-agnostic post-hoc method that
identifies a subgraph of input as an explanation from the cause-effect standpoint.
Causal Screening exerts causal effect of candidate subgraph as the metric:
where Gk is the candidate subgraph, k is the number of edges and MI is the mu-
tual information. The intervention do(G = Gk ) and do(G = 0) / means the model
input receives treatment (feeding Gk into the model) and control (feeding 0/ into the
model), respectively. ŷ denotes the prediction when feeding the original graph into
the model. Causal Screening uses a greedy algorithm to search for the explanation.
Starting from an empty set, at each step, it adds one edge with the highest causal
effect into the candidate subgraph.
CF-GNNExplainer (Lucic et al, 2021) also proposes to generate counterfactual
explanations for GNNs. Different from previous methods that try to find a sparse
subgraph to preserve the correct prediction, CF-GNNExplainer proposes to find the
minimal number edges to be removed such that the prediction changes. Similar to
GNNExplainer, CF-GNNExplainer employs the soft mask as well. Therefore, it also
suffers from the “introduced evidence” problem (Dabkowski and Gal, 2017), which
means that non-zero or non-one values may introduce unnecessary information or
noises, and thus influence the explanation result.
then take the learned features as input to predict the possibility of a start point and
an endpoint. The endpoint and the edge between the two points are added to update
the intermediate graph as an action. Finally, it calculates the reward of the action, so
that we can train the generator via policy gradient algorithms. The reward consists
of two terms. The first term is the score of the intermediate graph after feeding it to
the target GNN model. The second one is a regularization term that guarantees the
validity of the intermediate graph. The above steps are executed repeatedly until the
number of action steps reaches the predefined upper limit. As a generative explana-
tion method, XGNN provides a holistic explanation for graph classification. There
could be more generative explanation methods for other graph analysis tasks to be
explored in the future.
where αi, j is the attention score, and Vi denotes the set of neighbors of node i. Also,
GAT uses a shared parameter matrix W independent of the layer depth. The attention
score is computed as:
exp(ei, j )
αi, j = softmax(ei, j ) = , (7.21)
∑k∈Vi ∪{i} exp(ei,k )
7 Interpretability in Graph Neural Networks 139
𝑗
𝒉𝑙 𝒂
𝒉4𝑙 𝒉5𝑙
𝑾𝒉𝑖𝑙 𝑾𝒉𝑙
𝑗
Fig. 7.5: Left: An illustration of graph convolution with single head attentions by
node 1 on its neighborhood. Middle: The linear transformation with a shared param-
eter matrix. Right: The attention mechanism employed in (Veličković et al, 2018).
where ∥ denotes vector concatenation. In general, the attention mechanism can also
be denoted as ei, j = attn(hil , hlj ). Therefore, the attention mechanism is a single-
layer neural network parameterized by a weight vector a. The attention score αi, j
shows the importance of node j to node i.
The above mechanism could also be extended with multi-head attention. Specif-
ically, K independent attention mechanisms are executed in parallel, and the results
are concatenated:
hil+1 = ∥Kk=1 σ ( ∑ αi,k jW k hlj ), (7.23)
j∈Vi ∪{i}
where αi,k j is the normalized attention score in the k-th attention mechanism, and W k
is the corresponding parameter matrix.
Besides learning node embeddings, we could also apply attention mechanisms to
learn a low-dimensional embedding for the whole graph (Ling et al, 2021). Suppose
we are working on an information retrieval problem. Given a set of graphs {Gm },
1 ≤ m ≤ M, and a query q, we want to return the graphs that are most relevant to the
query. The embedding of each graph Gm with respect to q could be computed using
the attention mechanism. In the first step, we could apply normal GNN propagation
rules as introduced in Equation 7.2, to obtain the embeddings of nodes inside each
graph. Let q denote the embedding of the query, and hi,m denote the embedding of
node i in a graph Gm . The embedding of graph Gm with respect to the query can be
computed as:
1 |Gm |
hqGm = ∑ αi,q hi,m (7.24)
|Gm | i=1
where αi,q = attn(hi,m , q) is the attention score, and attn() is a certain attention func-
tion. Finally, hqGm can be used to compute the similarity of Gm to the query in the
graph retrieval task.
140 Ninghao Liu and Qizhang Feng and Xia Hu
A heterogeneous network is a network with multiple types of nodes, links, and even
attributes. The structural heterogeneity and rich semantic information bring chal-
lenges for designing graph neural networks to fuse information.
Definition 7.4. A heterogeneous graph is defined as G = (V , E , φ , ψ), where V is
the set of node objects and E is the set of edges. Each node v ∈ V is associated with
a node type φ (v), and each edge (i, j) ∈ E is associated with an edge type ψ((i, j)).
We introduce how the challenge in embedding could be tackled using Heteroge-
neous graph Attention Network (HAN) (Wang et al, 2019m). Different from tradi-
tional GNNs, information propagation on HAN is conducted based on meta-paths.
1 2 r r
Definition 7.5. A meta-path Φ is defined as a path with the form vi1 −
→ vi2 −
→
rl−1
· · · −−→ vil , abbreviated as vi1 vi2 · · · vil with a composite relation r1 ◦ r2 ◦ · · · ◦ rl−1 .
To learn the embedding of node i, we propagate the embeddings from its neighbors
within the meta-path. The set of neighbor nodes is denoted as Vi Φ . Considering
that different types of nodes have different feature spaces, a node embedding is first
′
projected to the same space h j = Mφi h j . Here Mφi is the transformation matrix for
node type φi . The attention mechanism in HAN is similar to GAT, except that we
need to consider the type of meta-path that is currently sampled. Specifically,
′
zi,Φ = σ ( ∑ αi,Φj h j ), (7.25)
j∈Vi Φ
Given a set of meta-paths {Φ1 , ..., ΦP }, we can obtain a group of node embeddings
denoted as {zi,Φ1 , ..., zi,ΦP }. To fuse embeddings across different meta-paths, an-
other attention algorithm is applied. The fused embedding is computed as:
P
zi = ∑ βΦ p zi,Φ p , (7.27)
p=1
as a young father
Node: Person
is interested in
Fig. 7.6: Using multiple embeddings to represent the interests of a user. Each em-
bedding segment corresponds to one aspect in data (Liu et al, 2019a).
Clustering/Routing
Prediction Layer
Target node
Fig. 7.7: The high-level idea of learning the disentangled node embedding for a
target node by using clustering or dynamic routing.
where τ is a hyper-parameter that scales the cosine similarity. Then, the probability
of observing an edge (u,t) is
K
p(t|u, ct ) ∝ ∑ ct,k · similarity(ht , hu,k ). (7.30)
k=1
Besides the fundamental learning process introduced above, the variational autoen-
coder framework could also be applied to regularize the learning process (Ma et al,
2019c). The item embeddings and prototype embeddings are jointly updated until
convergence. The embedding of each user hu is determined by aggregating the em-
beddings of interacted items, where hu,k collects embeddings from items that also
belong to facet k. In the learning process, the cluster discovery, node-cluster assign-
ments, and embedding learning are jointly conducted.
7 Interpretability in Graph Neural Networks 143
The idea of using dynamic routing for disentangled node representation learning is
motivated by the Capsule Network (Sabour et al, 2017). There are two layers of
capsules, i.e., low-level capsules and high-level capsules. Given a user u, the set of
items that he has interacted with is denoted as Vu . The set of low-level capsules
is {cli }, i ∈ Vu , so each capsule is the embedding of an interacted item. The set of
high-level capsules is {chk }, 1 ≤ k ≤ K, where chk represents the user’s k-th interest.
The routing logit value bi,k between low-level capsule i and high-level capsule k
is computed as:
bi,k = (chk )⊤ S cli , (7.31)
where S is the bilinear mapping matrix. Then, the intermediate embedding for high-
level capsule k is computed as a weighted sum of low-level capsules,
so wi,k can be seen as the attention weights connecting the two capsules. Finally, a
“squash” function is applied to obtain the embedding of high-level capsules:
∥zhk ∥2 zhk
chk = squash(zhk ) = . (7.33)
1 + ∥zhk ∥2 ∥zhk ∥2
The above steps constitute one iteration of dynamic routing. The routing process is
usually repeated for several iterations to converge. When the routing finishes, the
high-level capsules can be used to represent the user u with multiple interests, to be
fed into subsequent network modules for inference (Li et al, 2019b), as shown in
Fig. 7.7.
In this section, we introduce the setting for evaluating GNN explanations. This in-
cludes the datasets that are commonly used for constructing and explaining GNNs,
as well as the metrics that evaluate different aspects of explanations.
As more approaches have been proposed for explaining GNNs, a variety of datasets
have been used to assess their effectiveness. As such a research direction is still
144 Ninghao Liu and Qizhang Feng and Xia Hu
1 N yi
f idelity = ∑ f (Gi ) − f yi Gi \ Gi′ (7.34)
N i=1
where f is the output function target model. Gi is the i-th graph, Gi′ is the ex-
planation for it, and Gi \ Gi′ represents the perturbed i-th graph in which the
identified explanation is removed.
• Contrastivity (Pope et al, 2019) uses Hamming distance to measure the dif-
ferences between two explanations. These two explanations correspond to the
model’s prediction of one instance for different classes. It is assumed that mod-
els would highlight different features when making predictions for different
146 Ninghao Liu and Qizhang Feng and Xia Hu
classes. The higher the contrastivity, the better the performance of the inter-
preter.
• Sparsity (Pope et al, 2019) is calculated as the ratio of explanation graph size
to input graph size. In some cases, explanations are encouraged to be sparse,
because a good explanation should include only the essential features as far as
possible and discard the irrelevant ones.
• Stability (Sanchez-Lengeling et al, 2020) measures the performance gap of the
interpreter before and after adding noise to the explanation. It suggests that a
good explanation should be robust to slight changes in the input that do not
affect the model’s prediction.
Interpretation on graph neural networks is an emerging domain. There are still many
challenges to be tackled. In this section, we list several future directions towards
improving the interpretability of graph neural networks.
First, some online applications require real-time responses from models and al-
gorithms. It thus puts forward high requirements on the efficiency of explanation
methods. However, many GNN explanation methods conduct sampling or highly
iterative algorithms to obtain the results, which is time-consuming. Therefore, one
future research direction is how to develop more efficient explanation algorithms
without significantly sacrificing explanation precision.
Second, although more and more methods have been developed for interpreting
GNN models, how to utilize interpretation towards identifying GNN model defects
and improving model properties is still rarely discussed in existing work. Will GNN
models be largely affected by adversarial attacks or backdoor attacks? Can interpre-
tation help us to tackle these issues? How to improve GNN models if they have been
found to be biased or untrustworthy?
Third, besides attention methods and disentangled representation learning, are
there other modeling or training paradigms that could also improve GNN inter-
pretability? In the interpretable machine learning domain, some researchers are in-
terested in providing causal relations between variables, while some others prefer
using logic rules for reasoning. Therefore, how to bring causality into GNN learn-
ing, or how to use incorporate logic reasoning into GNN inference, may be an inter-
esting direction to explore.
Fourth, most existing efforts on interpretable machine learning have been de-
voted to get more accurate interpretation, while the human experience aspect is usu-
ally overlooked. For end-users, friendly interpretation can promote user experience,
and gain their trust to the system. For domain experts without machine learning
background, an intuitive interface helps integrate them into the system improvement
loop. Therefore, another possible direction is how to incorporate human-computer
interaction (HCI) to show explanation in a more user-friendly format, or how to de-
sign better human-computer interfaces to facilitate user interactions with the model.
7 Interpretability in Graph Neural Networks 147
Acknowledgements The work is, in part, supported by NSF (#IIS-1900990, #IIS-1718840, #IIS-
1750074). The views and conclusions contained in this paper are those of the authors and should
not be interpreted as representing any funding agencies.
Editor’s Notes: Similar to the general trend in the machine learning do-
main, explainability has been ever more widely recognized as an important
metric for graph neural networks in addition to those well recognized be-
fore such as effectiveness (Chapter 4), complexity (Chapter 5), efficiency
(Chapter 6), and robustness (Chapter 8). Explainability can not only broadly
influence technique development (e.g., Chapters 9-18) by informing model
developers of useful model details, but also could benefit domain experts in
various application domains (e.g., Chapters 19-27) by providing them with
explanations of predictions.
Chapter 8
Graph Neural Networks: Adversarial
Robustness
Stephan Günnemann
Abstract Graph neural networks have achieved impressive results in various graph
learning tasks and they have found their way into many applications such as molec-
ular property prediction, cancer classification, fraud detection, or knowledge graph
reasoning. With the increasing number of GNN models deployed in scientific ap-
plications, safety-critical environments, or decision-making contexts involving hu-
mans, it is crucial to ensure their reliability. In this chapter, we provide an overview
of the current research on adversarial robustness of GNNs. We introduce the unique
challenges and opportunities that come along with the graph setting and give an
overview of works showing the limitations of classic GNNs via adversarial example
generation. Building upon these insights we introduce and categorize methods that
provide provable robustness guarantees for graph neural networks as well as prin-
ciples for improving robustness of GNNs. We conclude with a discussion of proper
evaluation practices taking robustness into account.
8.1 Motivation
The success story of graph neural networks is astonishing. Within a few years, they
have become a core component of many deep learning applications. Nowadays they
are used in scientific applications such as drug design or medical diagnoses, are
integrated in human-centered applications like fake news detection in social media,
get applied in decision-making tasks, and even are studied in safety-critical environ-
ments like autonomous driving. What unites these domains is their crucial need for
reliable results; misleading predictions are not only unfortunate but indeed might
lead to dramatic consequences – from false conclusions drawn in science to harm
for people. However, can we really trust the predictions resulting from graph neural
Stephan Günnemann
Department of Informatics, Technical University of Munich, e-mail: [email protected]
© The Author(s), under exclusive license to Springer Nature Singapore Pte Ltd. 2022 149
L. Wu et al. (eds.), Graph Neural Networks: Foundations, Frontiers, and Applications,
https://doi.org/10.1007/978-981-16-6054-2_8
150 Stephan Günnemann
networks? What happens when the underlying data is corrupted or even becomes
deliberately manipulated?
Indeed, the vulnerability of classic machine learning models to (deliberate) per-
turbations of the data is well known (Goodfellow et al, 2015): even only slight
changes of the input can lead to wrong predictions. Such instances, for humans
nearly indistinguishable from the original input yet wrongly classified, are also
known as adversarial examples. One of the most well-known and alarming exam-
ples is an image of a stop sign, which is classified as a speed limit sign by a neural
network with only very subtle changes to the input; though, for us as humans it still
clearly looks like a stop sign (Eykholt et al, 2018). Examples like these illustrate
how machine learning models can dramatically fail in the presence of adversarial
perturbations. Consequently, adopting machine learning for safety-critical or sci-
entific application domains is still problematic. To address this shortcoming, many
researchers have started to analyze the robustness of models in domains like images,
natural language, or speech. Only recently, however, GNNs have come into focus.
Here, the first work studying GNNs’ robustness (Zügner et al, 2018) investigates
one of the most prominent tasks, node-level classification, and demonstrated the
susceptibility of GNNs to adversarial perturbations as well (see Figure 8.1). Since
then, the field of adversarial robustness on graphs has been rapidly expanding, with
many works studying diverse tasks and models, and exploring ways to make GNNs
more robust.
To some degree it is surprising that graphs were not in the focus even earlier.
Corrupted data and adversaries are common in many domains where graphs are
analyzed, e.g., social media and e-commerce systems. Take for example a GNN-
based model for detectingNode fake news in a social network (Monti et al, 2019; Shu et al,
2020). Adversaries have a strong incentive to fool the system in order to avoid being
classification
detected. Similarly, in credit scoring systems, fraudsters try to disguise themselves
by creating fake connections. Thus, robustness is an important concern for graph-
based learning.
It is important to highlight, though, that adversarial robustness is not only a topic
in light of security concerns, where intentional changes, potentially crafted by hu-
mans, are used to try to fool the predictions. Instead, adversarial robustness con-
siders worst-case scenarios in general. Especially in safety-critical or scientific ap-
8 Graph Neural Networks: Adversarial Robustness 151
Here Φ(G ) denotes the set of all graphs we are treating as indistinguishable to the
given graph G at hand, and Gˆ denotes a specific perturbed graph from this set. For
example, Φ(G ) could capture all graphs which differ from G in at most ten edges
or in a few node attributes. The attacker’s goal is to find a graph Gˆ that, when
passed through the GNN fθ ∗ , maximizes a specific objective Oatk , e.g., increasing
the predicated probability of a certain class for a specific node. Importantly, in a
poisoning setting, the weights θ ∗ of the GNN are not fixed but learned based on
the perturbed data, leading to the inner optimization problem that corresponds to
the usual training procedure on the (now perturbed) graph. That is, θ ∗ is obtained
1 Again it is worth highlighting that such ‘attacks’ are not always due to human adversaries. Thus,
the terms ‘change’ or ‘perturbation’ might be better suited and have a more neutral connotation.
8 Graph Neural Networks: Adversarial Robustness 153
by minimizing some training loss Ltrain on the graph Gˆ. This nested optimization
makes the problem specifically hard.
To define an evasion attack, the above equation can simply be changed by assum-
ing the parameter θ ∗ to be fixed. Often it is assumed to be given by minimizing the
training loss w.r.t. the given graph G (i.e. θ ∗ = arg minθ Ltrain ( fθ (G ))). This makes
the above scenario a single-level optimization problem.
This general form of an attack enables us to provide a categorization along dif-
ferent aspects and illustrates the space to explore for robustness characteristics of
GNNs in general. While this taxonomy is general, for ease of understanding, it helps
to think about an intentional attacker.
however, leads to a name clash with categorizations used in other communities (Carlini and Wag-
ner, 2017) we decided to use local/global here.
154 Stephan Günnemann
What changes are allowed to the original graph? What do we expect the perturba-
tions to look like? For example, do we want to understand how deleting a few edges
influences the prediction? The space of perturbations under consideration is mod-
eled via Φ(G ). It intuitively represents the attacker’s capabilities; what and how
much they are able to manipulate. The complexity of the perturbation space for
graphs represents one of the biggest differences to classical robustness studies and
stretches along two dimensions.
(1) What can be changed? Unique to the graph domain are perturbations of the
graph structure. In this regard, most publications have studied the scenarios of re-
moving or adding edges to the graph (Dai et al, 2018a; Wang and Gong, 2019;
Zügner et al, 2018; Zügner and Günnemann, 2019; Bojchevski and Günnemann,
2019; Zhang et al, 2019e; Zügner et al, 2018; Tang et al, 2020b; Chen et al, 2020f;
Chang et al, 2020b; Ma et al, 2020b; Geisler et al, 2021). Focusing on the node level,
some works (Wang et al, 2020c; Sun et al, 2020d; Geisler et al, 2021) have consid-
ered adding or removing nodes from the graph. Beyond the graph structure, GNN
robustness has also been explored for changes to the node attributes (Zügner et al,
2018; Wu et al, 2019b; Takahashi, 2019) and the labels used in semi-supervised
node classification (Zhang et al, 2020b).
An intriguing aspect of graphs is to investigate how the interdepenence of in-
stances plays a role in robustness. Due to the message passing scheme, changes to
one node might affect (potentially many) other nodes. Often, for example, a node’s
prediction depends on its k-hop neighborhood, intuitively representing the node’s
receptive field. Thus, it is not only important what type of change can be performed
but also where in the graph this can happen. Consider for example Figure 8.1: to
analyze whether the prediction for the highlighted node can change, we are not lim-
ited to perturbing the node’s own attributes and its incident edges but we can also
achieve our aim by perturbing other nodes. Indeed, this reflects real world scenarios
much better since it is likely that an attacker has access to a few nodes only, and
not to the entire data or the target node itself. Put simply, we also have to consider
which nodes can be perturbed. Multiple works (Zügner et al, 2018; Zhang et al,
2019e; Takahashi, 2019) investigate what they call indirect attacks (or sometimes
influencer attacks), specifically analyzing how an individual node’s prediction can
change when only perturbing other parts of the graph while leaving the target node
untouched.
(2) How much can be changed? Typically, adversarial examples are designed to
be nearly indistinguishable to the original input, e.g., changing the pixel values of an
image so that it stays visually the same. Unlike image data, where this can easily be
verified by manual inspection, this is much more challenging in the graph setting.
8 Graph Neural Networks: Adversarial Robustness 155
Technically, the set of perturbations can be defined based on any graph distance
function D measuring the (dis)similarity between graphs. All graphs similar to the
given graph G then define the set Φ(G ) = {Gˆ ∈ G | D(G , Gˆ) ≤ ∆ }, where G denotes
the space of all potential graphs and ∆ the largest acceptable distance.
Defining what are suitable graph distance functions is in itself a challenging
task. Beyond that, computing these distances and using them within the optimiza-
tion problem of Eq. equation 8.1 might be computationally intractable (think, e.g.,
about the graph edit distance which itself is NP-hard to compute). Therefore, exist-
ing works have mainly focused on so called budget constraints, limiting the number
of changes allowed to be performed. Technically, such budgets correspond to the
L0 pseudo-norm between the clean and perturbed data, e.g., relating to the graphs’
adjacency matrix A or its node attributes X.3 To enable more fine-grained control,
often such budget constraints are used locally per node (e.g., limiting the maximal
number of edge deletions per node; ∆iloc ) as well as globally (e.g., limiting the over-
all number of edge deletions; ∆ glob ). For example
Φ(G ) = {Gˆ = (Â, X̂) ∈ G | ||A − Â||0 ≤ ∆ glob ∧ ∀i : ||Ai − Âi ||0 ≤ ∆iloc ∧ X = X̂},
(8.2)
where the graphs G = (A, X) and Gˆ = (Â, X̂) are assumed to have the same size and
the node attributes, X resp. X̂, to stay unchanged; Ai denotes the ith row of A.
Beyond these budget constraints, it might be useful to preserve further character-
istics of the data. In particular for real-world networks many patterns such as spe-
cific degree distributions, large clustering coefficients, low diameter, and more are
known to hold (Chakrabarti and Faloutsos, 2006). If two graphs show very different
patterns, it is easy to tell them apart – and a different prediction could be expected.
Therefore, in (Zügner et al, 2018; Zügner and Günnemann, 2019; Lin et al, 2020d)
only perturbed graphs are considered which follow similar power-law behavior in
the degree distribution. Similarly, one can impose constraints on the attributes con-
sidering, e.g., the co-occurrence of specific values.
3 This is a similar approach to image data, where often we take a certain radius as measured by,
e.g., an L p norm around the original input as the allowed perturbation set, assuming that for small
radii the semantic meaning of the input does not change.
156 Stephan Günnemann
the ground-truth labels of the target node(s) could additionally be hidden from the
attacker. The knowledge about the model includes many aspects such as knowledge
about the used GNN architecture, the model’s weights, or whether only the output
predictions or the gradients are known. Given all these variations, the most common
ones are white-box settings, where full information is available, and black-box set-
tings, which usually mean that only the graph and potentially the predicted outputs
are available.
Among the three aspects above, the attacker’s knowledge seems to be the one
which most strongly links to human-like adversaries. It should be highlighted,
though, that worst-case perturbations in general are best reflected by the fully white-
box setting, making it the preferred choice for strong robustness results. If a model
performs robustly in a white-box setting, it will also be robust under the limited
scenarios. Moreover, as we will see in Section 8.2.2.1, the transferability of attacks
implies that knowledge about the model is not really required.
Besides the above categorization that focuses on the properties of the attack, an-
other, more technical, view can be taken by considering the algorithmic approach
how the (bi-level) optimization problem is solved. In the discussion of the pertur-
bation space we have seen that graph perturbations often relate to the addition/re-
moval of edges or nodes — these are discrete decisions, making Eq. equation 8.1 a
discrete optimization problem. This is in stark contrast to other data domains where
infinitesimal changes are possible. Thus besides adapting gradient-based approxi-
mations, various other techniques can be used to tackle Eq. equation 8.1 for GNNs
such as reinforcement learning (Sun et al, 2020d; Dai et al, 2018a) or spectral ap-
proximations (Bojchevski and Günnemann, 2019; Chang et al, 2020b). Moreover,
the attacker’s knowledge has also implications on the algorithmic choice. In a black-
box setting where, e.g., only the input and output are observed, we cannot use the
true GNN fθ to compute gradients but have to use other principles like first learning
some surrogate model.
The above categorization shows that various kinds of adversarial perturbations under
different scenarios can be investigated. Summarizing the different results obtained
in the literature so far, the trend is clear: standard GNNs trained in the standard way
are not robust. In the following, we given an overview of some key insights.
Figure 8.2 illustrates one of the results of the method Nettack as introduced in
(Zügner et al, 2018). Here, local attacks in an evasion setting focusing on graph
structure perturbations are analyzed for a GCN (Kipf and Welling, 2017b). The
figure shows the classification margin, i.e., the difference between the predicted
8 Graph Neural Networks: Adversarial Robustness 157
Classification margin
1.0
model and the Cora ML data
with the Nettack (Zügner et al, 0.5
misclassified
below the dashed line it is
nodes
−0.5
misclassified w.r.t. the ground
truth label. As shown, almost −1.0
any node’s prediction can be Original Nettack Nettack-In. Nettack Nettack-In.
changed. Graph Budget ∆ = bd/2c Budget ∆ = d
probability of the node’s true class minus the one of the second highest class. The
left column shows the results for the unperturbed graph where most nodes are cor-
rectly classified as illustrated by the predominantly positive classification margin.
The second column shows the result after perturbing the graph based on the pertur-
bation found by Nettack using a global budget of ∆ = ⌊dv /2⌋ and making sure that
no singletons occur where dv is the degree of the node v under attack. Clearly, the
GCN model is not robust: almost every node’s prediction can be changed. Moreover,
the third column shows the impact of indirect attacks. Recall that in these scenarios
the performed perturbations cannot happen at the node we aim to misclassify. Even
in this setting, a large fraction of nodes is vulnerable. The last two columns show
results for an increased budget of ∆ = dv . Not surprisingly, the impact of the attack
becomes even more pronounced.
Considering global attacks in the poisoning setting similar behavior can be ob-
served. For example, when studying the effect of node additions, the work (Sun et al,
2020d) reports a relative drop in accuracy by up to 7 percentage points with a bud-
get of 1% of additional nodes, without changing the connectivity between existing
nodes. For changes to the edge structure, the work (Zügner and Günnemann, 2019)
reports performance drops on the test sets by around 6 to 16 percentage points when
perturbing 5% of the edges. Noteworthy, on one dataset, these perturbations lead to
a GNN obtaining worse performance than a logistic regression baseline operating
only on the node attributes, i.e., ignoring the graph altogether becomes the better
choice.
The following observation from (Zügner and Günnemann, 2019) is important
to highlight: One core factor for the obtained lower performance on the perturbed
graphs are indeed the learned GNN weights. When using the weights θGˆ trained on
the perturbed graph Gˆ obtained by the poisoning attack, not only the performance
on Gˆ is low but even the performance on the unperturbed graph G suffers dramat-
ically. Likewise, when applying weights θG trained on the unperturbed graph G to
the graph Gˆ, the classification accuracy barely changes. Thus, the poisoning attack
performed in (Zügner and Günnemann, 2019) indeed derails the training procedure,
i.e., leads to ‘bad’ weights. This result emphasizes the importance of the training
procedure for the performance of graph models. If we are able to find appropriate
weights, even perturbed data might be handled more robustly. We will encounter
this aspect again in Section 8.4.2.
158 Stephan Günnemann
Figure 8.3 compares the distribution of such a property (e.g. node degree) when
considering all nodes of the unperturbed graph with the distribution of the prop-
erty when considering only the nodes incident to the inserted/removed adversarial
edges. The comparison indicates a statistically significant difference between the
distributions. For example, in Figure 8.3 (left) we can see that the Nettack method
tends to connect a target node to low-degree nodes. This could be due to the degree-
normalization performed in GCN, where low-degree nodes have a higher weight
(i.e., influence) on the aggregation of a node. Likewise, considering nodes incident
to edges removed by the adversary we can observe that the Nettack method tends
to disconnect high-degree nodes from the target node. In Figure 8.3 (second and
third plot) we can see that the attack tends to connect the target node with peripheral
nodes, as evidenced by small two-hop neighborhood size and low closeness cen-
trality of the adversarially connected nodes. In Figure 8.3 (right) we can see that
the adversary tends to connect a target node to other nodes which have dissimilar
attributes. As also shown in other works, the adversary appears to try to counter the
homophily property in the graph – which is not surprising, since the GNN has likely
learned to partly infer a node’s class based on its neighbors.
To understand whether such detected patterns are universal, they can be used
to design attack principles itself — indeed, this even leads to black-box attacks
since the analyzed properties usually relate to the graph only and not the GNN. In
(Zügner et al, 2020) a prediction model was learned estimating the potential impact
of a perturbation on unseen graphs using the above mentioned properties as input
features. While this often resulted in finding effective adversarial perturbations, thus,
highlighting the generality of the regularities uncovered, the attack performance
was not on par with the original Nettack attack. Similarly, in (Ma et al, 2020b)
PageRank-like scores have been used to identify potential harmful perturbations.
The aspects along which adversarial attacks on graphs can be studied allow for a
huge variety of scenarios. Only a few of them have been thoroughly investigated
in the literature. One important aspect to consider, for example, is that in real ap-
plications the cost of perturbations differ: while changing node attributes might be
relatively easy, injecting edges might be harder. Thus, designing improved pertur-
bation spaces can make the attack scenarios more realistic and better captures the
robustness properties one might want to ensure. Moreover, many different data do-
mains such as knowledge graphs or temporal graphs need to be investigated.
Importantly, while first steps have been made to understand the patterns that
makes these perturbations harmful, a clear understanding with a sound theoretical
backing is still missing. In this regard, it is also worth repeating that all these studies
have focused on analyzing perturbations obtained by Nettack; other attacks might
potentially lead to very different patterns. This also implies that exploiting the re-
sulting patterns to design more robust GNNs (see Section 8.4.1) is not necessarily
160 Stephan Günnemann
a good solution. Moreover, finding reliable patterns also requires more research on
how to compute adversarial perturbations in a scalable way (Wang and Gong, 2019;
Geisler et al, 2021), since such patterns might be more pronounced on larger graphs.
Model-specific certificates are designed for a specific class of GNN models (e.g., 2-
layer GCNs) and a specific task such as node-level classification. A common theme
is to phrase certification as a constrained optimization problem: Recall that in a
classification task, the final prediction is usually obtained by taking the class with
the largest predicted probability or logit. Let c∗ = arg maxc∈C fθ (G )c denote the
predicted class4 obtained on the unperturbed graph G , where C is the set of classes
and fθ (G )c denotes the logit obtained for class c. This specifically implies, that the
margin fθ (G )c∗ − fθ (G )c between class c∗ and any other class c is positive.
A particularly useful quantity for robustness certification is the worst-case mar-
gin, i.e., the smallest margin possible under any perturbed data Gˆ:
4 This could either be the predicted class for a specific target node v in case of node-level classi-
fication; or for the entire graph in case of graph-level classification. We drop the dependency on v
since it is not relevant for the discussion. For simplicity, we assume the maximizer c∗ to be unique.
8 Graph Neural Networks: Adversarial Robustness 161
Fig. 8.4: Obtaining robustness certificates via the worst-case margin: The predic-
tion obtained from the unperturbed graph Gi is illustrated with a cross, while the
predictions for the perturbed graphs Φ(Gi ) are illustrated around it. The worst-case
margin measures the shortest distance to the decision boundary. If it is positive (see
G1 ), all predictions are on the same side of the boundary; robustness holds. If it is
negative (see G2 ), some predictions cross the decision boundary; the class prediction
will change under perturbations, meaning the model is not robust. When using lower
bounds — the shaded regions in the figure — robustness is ensured for positive val-
ues (see G1 ) since the exact worst-case margin can only be larger. If the lower bound
becomes negative, no statement can be made (see G2 and G3 ; robustness unknown).
Both G2 and G3 have a negative lower bound, while the (not tractable to compute)
exact worst-case margin differs in sign.
If this term is positive, c can never be the predicted class for node v. And if the
worst-case margin m̂(c∗ , c) stays positive for all c ̸= c∗ , the prediction is certifiably
robust since the logit for class c∗ is always the largest – for all perturbed graphs in
Φ(G ). This idea is illustrated in Figure 8.4.
As shown, obtaining a certificate means solving the (constrained) optimization
problem in Eq. equation 8.3 for every class c. Not surprisingly, however, solving
this optimization problem is usually intractable – for similar reasons as computing
adversarial attacks is hard. So how can we obtain certificates? Just heuristically
solving Eq. equation 8.3 is not helpful since we aim for guarantees.
The core idea is to obtain tractable lower bounds on the worst-case margin. That is,
we aim to find functions m̂LB that ensure m̂LB (c∗ , c) ≤ m̂(c∗ , c) and are more effi-
cient to compute. One solution is to consider relaxations of the original constrained
minimization problem, replacing, for example, the model’s nonlinearities and hard
discreteness constraints via their convex relaxation. For example, instead of requir-
ing that an edge is perturbed or not, indicated by the variables e ∈ {0, 1}, we can
use e ∈ [0, 1]. Intuitively, using such relaxations leads to supersets of the actually
reachable predictions, as visualized in Figure 8.4 with the shaded regions.
162 Stephan Günnemann
Overall, if the lower bound m̂LB stays positive, the robustness certificate holds —
since m̂ is positive by transitivity as well. This is shown in Figure 8.4 for graph G1 .
If m̂LB is negative, no statement can be made since it is only a lower bound of the
original worst-case margin m̂, which thus can be positive or negative. Compare the
two graphs G2 and G3 in Figure 8.4: While both have a negative lower bound (i.e.,
both shaded regions cross the decision boundary), their actual worst-case margins m̂
differ. Only for graph G2 the actually reachable predictions (which are not efficiently
computable) also cross the decision boundary. Thus, if the lower bound is negative,
the actual robustness remains unknown – similar to an unsuccessful attack, where
it remains unclear whether the model is actually non-robust or the attack simply
not strong enough. Therefore, besides being efficient to compute, the function m̂LB
should be as close as possible to m̂ to avoid cases where no answer can be given
despite the model being robust.
The above idea, using convex relaxations of the model’s nonlinearities and the
admissible perturbations, is used in the works (Zügner and Günnemann, 2019;
Zügner and Günnemann, 2020) for the class of GCNs and node-level classification.
In (Zügner and Günnemann, 2019), the authors consider perturbations to the node
attributes and obtain lower bounds via a relaxation to a linear program. The work
(Zügner and Günnemann, 2020) considers perturbations in the form of edge dele-
tions and reduces the problem to a jointly constrained bilinear program. Similarly,
also using convex relaxations, Jin et al (2020a) has proposed certificates for graph-
level classification under edge perturbations using GCNs. Beyond GCNs, model-
specific certificates for edge perturbations have also been devised for the class of
GNNs using PageRank diffusion (Bojchevski and Günnemann, 2019), which in-
cludes label/feature propagation and (A)PPNP (Klicpera et al, 2019a). The core idea
of (Bojchevski and Günnemann, 2019) is to treat the problem as a PageRank opti-
mization task which subsequently can be expressed as a Markov decision process.
Using this connection one can indeed show that in scenarios where only local bud-
gets are used (see Section 8.2; Eq. equation 8.2) the derived certificates are exact,
i.e., no lower bound, while we can still compute them in polynomial time w.r.t. the
graph size. In general, all models above consider local and global budget constraints
on the number of changes.
Besides providing certificates, being able to efficiently compute (a differentiable
lower bound on) the worst-case margin as in Eq. equation 8.3 also enables to im-
prove GNN robustness by incorporating the margin during training, i.e. aiming to
make it positive for all nodes. We will discuss this in detail in Section 8.4.2.
Overall, a strong advantage of model-specific certificates is their explicit consid-
eration of the GNN model structure within the margin computation. However, the
white-box nature of these certificates is simultaneously their limitation: The pro-
posed certificates capture only a subset of the existing GNN models and any GNN
yet to be developed likely requires a new certification technique as well. This limi-
tation is tackled by model-agnostic certificates.
8 Graph Neural Networks: Adversarial Robustness 163
In other words, g(G ) returns the most likely class obtained by first randomly per-
turbing the graph G using τ and then classifying the resulting graphs τ(G ) with the
base classifier f .
As in Section 8.3.1, the goal is to assess whether the prediction does not change
under perturbations: denoting with c∗ = g(G ) the class predicted by the smoothed
classifier on G , we want g(Gˆ) = c∗ for all Gˆ ∈ Φ(G ). Considering for simplicity the
case of binary classification, this is equivalent to ensure that Pr( f (τ(Gˆ)) = c∗ ) ≥ 0.5
for all Gˆ ∈ Φ(G ); or short: minGˆ∈Φ(G ) Pr( f (τ(Gˆ)) = c∗ ) ≥ 0.5.
Since, unsurprisingly, the term is intractable to compute, we refer again to a lower
bound to obtain the certificate:
Here, H f is the set of all classifiers sharing some properties with f , e.g., often
that the smoothed classifier based on h and f would return the same probability for
G , i.e., Pr(h(τ(G )) = c∗ ) = Pr( f (τ(G )) = c∗ ). Since f ∈ H f , the inequality holds
164 Stephan Günnemann
trivially. Accordingly, if the left hand side of Eq. equation 8.5 is larger than 0.5,
also the right hand side is guaranteed to be so, implying that G would be certifiably
robust.
What does Eq. equation 8.5 intuitively mean? It aims to find a base classifier h
which minimizes the probability that the perturbed sample Gˆ is assigned to class c∗ .
Thus, h represents a kind of worst-case base classifier which, when used within the
smoothed classifier, tries to obtain a different prediction for Gˆ. If even this worst-
case base classifier leads to certifiable robustness (left hand side of Eq. equation 8.5
larger than 0.5), then surely the actual base classifier at hand has well.
The most important part to make this all useful, however, is the following: given
a set of classifiers H f , finding the worst-case classifier h and minimizing over the
perturbation model Φ(G ) is often tractable. In some cases, the optima can even
be calculated in closed-form. This shows some interesting relation to the previous
section: There, the intractable minimization over Φ(G ) in Eq. equation 8.3 was re-
placed by some tractable lower bound, e.g., via relaxations. Now, by finding a worst-
case classifier h we not only obtain a lower bound but minimization over Φ(G )
becomes often also immediately tractable. Note, however, that in Section 8.3.1 we
obtain a certificate for the base classifier f , while here we obtain a certificate for the
smoothed classifier g.
As said, given a set of classifiers H f , finding the worst-case classifier h and min-
imizing over the perturbation model Φ(G ) is often tractable. The main compu-
tational challenge in practice lies in determining H f . Let’s consider our previ-
ous example where we enforced all classifiers h to ensure Pr(h(τ(G )) = c∗ ) =
Pr( f (τ(G )) = c∗ ). To determine H f , one needs to compute Pr( f (τ(G )) = c∗ ).
Clearly, doing this exactly is again usually intractable. Instead, the probability can
be estimated using sampling. To ensure a tight approximation, the base classifier has
to be fed a large number of samples from the smoothing distribution. This becomes
increasingly expensive as the size and complexity of the GNN model increases.
Furthermore, the resulting estimates only hold with a certain probability. Accord-
ingly, also the derived guarantees have the same probability, i.e., one obtains only
probabilistic robustness certificates. Despite these practical limitations, randomized
smoothing has become widely popular, as it is often still more efficient than model-
specific certificates.
This general idea of model-agnostic certificates has been investigated for discrete
data in (Lee et al, 2019a; Dvijotham et al, 2020; Bojchevski et al, 2020a; Jia et al,
2020), with the latter two focusing also on graph-related tasks. In (Jia et al, 2020),
the authors investigate the robustness of community detection. In (Bojchevski et al,
2020a), the main focus is on node-level and graph-level classification w.r.t. graph
structure and/or attribute perturbations under global budget constraints. Specifically,
Bojchevski et al (2020a) overcomes critical limitations of the other approaches in
two regards: it explicitly accounts for sparsity in the data as present in many graphs,
8 Graph Neural Networks: Adversarial Robustness 165
As we have established, standard GNNs trained in the usual way are not robust to
even small changes to the graph, thus, using them in sensitive and critical applica-
tions might be risky. Certificates can provide us guarantees about their performance.
166 Stephan Günnemann
silience to attacks. Similarly, some works use the term provable defense when referring to certifi-
cates since they provably prevent attacks to be harmful that are within a certified set Φ(G ).
8 Graph Neural Networks: Adversarial Robustness 167
Interestingly, even before the rise of graph neural networks, such joint approaches
have been investigated, e.g., in (Bojchevski et al, 2017) to improve the robustness of
spectral embeddings. For GNNs, such graph structure learning has been proposed in
(Jin et al, 2020e; Luo et al, 2021) where certain properties like low-rank graph struc-
ture and attribute similarity are used to define how the clean graph should preferably
look like.
As discussed in Section 8.2.2, one further reason for the non-robustness of GNNs
are the parameters/weights learned during training. Weights resulting from standard
training often lead to models that do not generalize well to slightly perturbed data.
This is illustrated in Figure 8.5 with the orange/solid decision boundary. Note that
the figure shows the input space, i.e., the space of all graphs G; this is in contrast to
Figure 8.4 which shows the predicted probabilities. If we were able to improve our
training procedure to find ‘better’ parameters – taking into account that the data is or
might become potentially perturbed – the robustness of our model would improve
as well. This is illustrated in Figure 8.5 with the blue/dashed decision boundary.
There, all perturbed graphs from Φ1 (G ) get the same prediction. As seen, in this
regard robustness links to the generalization performance of prediction models in
general.
Robust training refers to training procedures that aim at producing models that are
robust to adversarial (and/or other) perturbations. The common theme is to optimize
a worst-case loss (also called robust loss), i.e. the loss achieved under the worst-case
perturbation. Technically, the training objective becomes:
168 Stephan Günnemann
where fθ is the GNN with its trainable weights. As shown, we do not evaluate the
loss at the unperturbed graph but instead use the loss achieved in the worst case
(compare this to the standard training where we simply minimize Ltrain ( fθ (G ))).
The weights are steered to obtain low loss under these worst scenarios as well, thus
obtaining better generalization.
Not surprisingly, solving Eq. equation 8.6 is usually not tractable for the same
reasons as finding attacks and certificates is hard: we have to solve a discrete, highly
complex (minmax) optimization problem. In particular, for training, e.g., via gradi-
ent based approaches, we also need to compute the gradient w.r.t. the inner maxi-
mization. Thus, for feasibility, one usually has to refer to various surrogate objec-
tives, substituting the worst-case loss and the resulting gradient by simpler ones.
In this regard, the most naı̈ve approach is to randomly draw samples from the pertur-
bation set Φ(G ) during each training iteration. That is, during training the loss and
the gradient are computed w.r.t. these randomly perturbed samples; with different
samples drawn in each training iteration. If the perturbation set, for example, con-
tains graphs where up to x edge deletions are admissible, we would randomly create
graphs with up to x edges dropped out. Such edge dropout has been analyzed in
various works but does not improve adversarial robustness substantially (Dai et al,
2018a; Zügner and Günnemann, 2020); a possible explanation is that the random
samples simply do not represent the worst-case perturbations well.
Thus, more common is the approach of adversarial training (Xu et al, 2019c;
Feng et al, 2019a; Chen et al, 2020i). Here, we do not randomly sample from the
perturbation set, but in each training iteration we create adversarial examples Gˆ and
subsequently compute the gradient w.r.t. these. As these samples are expected to
lead to a higher loss, the result of the inner max-operation in Eq. equation 8.6 is
much better approximated. Instead of perturbing the input graph, the work (Jin and
Zhang, 2019) has investigated a robust training scheme which perturbs the latent
embeddings.
It is interesting to note that adversarial training in its standard form requires la-
beled data since the attack aims to steer towards an incorrect prediction. In the typi-
cal transductive graph-learning tasks, however, large amounts of unlabeled data are
available. As a solution, virtual adversarial training has also been investigated (Deng
et al, 2019; Sun et al, 2020d), operating on the unlabeled data as well. Intuitively,
it treats the currently obtained predictions on the unperturbed graph as the ground
truth, making it a kind of self-supervised learning. The predictions on the perturbed
data should not deviate from the clean predictions, thus enforcing smoothness.
Using (virtual) adversarial training has empirically shown some improvements
in robustness, but not consistently. In particular, to well approximate the max term
in the robust loss of Eq. equation 8.6, we need powerful adversarial attacks, which
8 Graph Neural Networks: Adversarial Robustness 169
are typically costly to compute for graphs (see Section 8.2). Since here attacks need
to be computed in every training iteration, the training process is slowed down sub-
stantially.
At the end of the day, the techniques above perform a costly data augmentation dur-
ing training, i.e., they use altered versions of the graph. Besides being computation-
ally expensive, there is no guarantee that the adversarial examples are indeed good
proxies for the max term in Eq. equation 8.6. An alternative approach, e.g., followed
by (Zügner and Günnemann, 2019; Bojchevski and Günnemann, 2019) relies on the
idea of certification as discussed previously. Recall that these techniques compute a
lower bound m̂LB on the worst-case margin. If it is positive, the prediction is robust
for this node/graph. Thus, the lower bound itself acts like a robustness loss Lrob , for
example instantiated as a hinge loss: max(0, δ − m̂LB ). If the lower-bound is above
δ , then the loss is zero; if it is smaller, a penalty occurs. Combining this loss func-
tion with, e.g., the usual cross-entropy loss, forces the model not only to obtain good
classification performance but also robustness.
Crucially, Lrob and, thus, the lower bound need to be differentiable since we need
to compute gradients for training. This, indeed, might be challenging since usually
the lower bound itself is still an optimization problem. While in some special cases
the optimization problem is directly differentiable (Bojchevski and Günnemann,
2019), another general idea is to relate to the principle of duality. Recall that the
worst-case margin m̂ (or a potential corresponding lower bound m̂LB ) is the result of
a (primal) minimization problem (see Eq. equation 8.3). Based on the principle of
duality, the result of the dual maximization problem provides, as required, a lower
bound to this value. Even more, any feasible solution of the dual problem provides
a lower bound on the optimal solution. Thus, we actually do not need to solve the
dual program. Instead, it is sufficient to compute the objective function of the dual at
any single feasible point to obtain an (even lower, thus looser) lower bound; no op-
timization is required and computing gradients often becomes straightforward. This
principle of duality has been used in (Zügner and Günnemann, 2019) to perform
robust training in an efficient way.
Robust training is not the only way to obtain ‘better’ GNN weights. In (Tang
et al, 2020b), for example, the idea of transfer learning (besides further architecture
changes; see next section) is exploited. Instead of purely training on a perturbed
target graph, the method adopts clean graphs with artificially injected perturbations
to first learn suitable GNN weights. These weights are later transferred and fine-
tuned to the actual graph at hand. The work (Chen et al, 2020i) exploits smoothing
distillation where one trains on predicted soft labels instead of ground-truth labels
170 Stephan Günnemann
to enhance robustness. The work (Jin et al, 2019b) argues that graph powering en-
hances robustness and proposes to minimize the loss not only on the original graph
but on a set of graphs consisting of the different graph powers. Lastly, the authors
of (You et al, 2021) use a contrastive learning framework using different (graph)
data augmentations. Albeit adversarial robustness is not their focus, they report in-
creased adversarial robustness against the attacks of (Dai et al, 2018a). In general,
changing the loss function or regularization terms leads to different training, though
the effects on robustness for GNNs are not fully understood yet.
Inspired by the idea of graph cleaning as discussed before, a natural idea is to en-
hance the GNN by mechanisms to reduce the impact of perturbed edges. An obvious
choice for this are edge attention principles. However, it is a false conclusion to as-
sume that standard attention-based GNNs like GAT are immediately suitable for
this task. Indeed, as shown in (Tang et al, 2020b; Zhu et al, 2019a) such models are
non-robust. The problem is that these models still assume clean data to be given;
they are not aware that the graph might be perturbed.
Thus, other attention approaches try to incorporate more information in the pro-
cess. In (Tang et al, 2020b) the attention mechanism is enhanced by taking clean
graphs into account for which perturbations have been artificially injected. Since
now ground truth information is available (i.e., which edges are harmful), the atten-
tion can try to learn down-weighing these while retaining the non-perturbed ones.
An alternative idea is used in (Zhu et al, 2019a). Here, the representations of each
node in each layer are no longer represented as vectors but as Gaussian distribution.
They hypothesize that attacked nodes tend to have large variances, thus using this
information within the attention scores. Further attention mechanism considering,
e.g., the model and data uncertainty or the neighboring nodes’ similarity have been
proposed in (Feng et al, 2021; Zhang and Zitnik, 2020).
An alternative to edge attention is to enhance the aggregation used in message
passing. In a GNN message passing step, a node’s embedding is updated by aggre-
gating over its neighbors’ embeddings. In this regard, adversarially inserted edges
add additional data points to the aggregation and therefore perturb the output of the
message passing step. Aggregation functions such as sum, weighted mean, or the
8 Graph Neural Networks: Adversarial Robustness 171
max operation used in standard GNNs can be arbitrarily distorted by only a single
outlier. Thus, inspired by the principle of robust statistics, the work (Geisler et al,
2020) proposes to replace the usual GNN’s aggregation function with a differen-
tiable version of the Medoid, a provably robust aggregation operation. The idea of
enhancing the robustness of the aggregation function used during message passing
has further been investigated in (Wang et al, 2020o; Zhang and Lu, 2020).
Overall, all these methods down-weight the relevance of edges, with one cru-
cial difference to the methods discussed in Section 8.4.1: they are adaptive in the
sense that the relevance of each edge might vary between, e.g., the different lay-
ers of the GNN. Thus, an edge might be excluded/down-weighted in the first layer
but included in the second one, depending on the learned intermediate represen-
tation. This allows a more fine-grained handling of perturbations. In contrast, the
approaches in Section 8.4.1 derive a single cleaned graph that is used in the entire
GNN.
Many further ideas to improve robustness have been proposed, which do not all en-
tirely fit into the before mentioned categories. For example, in (Shanthamallu et al,
2021) a surrogate classifier is trained which does not access the graph structure but
is aimed to be aligned with the predictions of the GNN, both being jointly trained.
Since the final predictor is not using the graph but only the node’s attributes, higher
robustness to structure perturbations is hypothesized. The work (Miller et al, 2019)
proposes to select the training data in specific ways to increase robustness, and Wu
et al (2020d) uses the principle of information bottleneck, an information theoretic
approach to learn representations balancing expressiveness and robustness. Finally,
also randomized smoothing (Section 8.3.2) can be interpreted as a technique to im-
prove adversarial robustness by using an ensemble of predictors on randomized in-
puts.
erty that diminishes the effect of robust training or whether the generated adversarial
perturbations are not capturing the worst-case; showcasing again the hardness of the
problem. This might also explain why the majority of works have focused on prin-
ciples of weighting/filtering out edges.
In this regard, it is again important to remember that all approaches are typi-
cally designed with a specific perturbation model Φ(G ) in mind. Indeed, down-
weighting/filtering edges implicitly assumes that adversarial edges had been added
to the graph. Adversarial edge deletions, in contrast, would require to identify po-
tential edges to (re)add. This quickly becomes intractable due to the large number of
possible edges and has not been investigated so far. Moreover, only a few methods
so far have provided theoretical guarantees on the methods’ robustness behavior.
Progress in the field of GNN robustness requires sound evaluation of the proposed
techniques. Importantly, we have to be aware of the potential trade-off between
prediction performance (e.g., accuracy) and robustness. For example, we can easily
obtain a highly robust classification model by simply always predicting the same
class. Clearly, such a model has no use at all. Thus, the evaluation always involves
two aspects: (1) Evaluation of the prediction performance. For this, one can simply
refer to the established evaluation metrics such as accuracy, precision, recall, or
similar, as known for the various supervised and unsupervised learning tasks. (2)
Evaluation of the robustness performance.
Perturbation set and radius. Regarding the latter, the first noteworthy point is that
robustness always links to a specific perturbation set Φ(.) that defines the perturba-
tions the model should be robust to. To enable a proper evaluation, existing works
therefore usually define some parametric form of the perturbation set, e.g., denoted
Φr (G ) where r is the maximal number of changes – the budget – we are allowed to
perform (e.g., maximal number of edges to add). The variable r is often referred to
as the radius. This is because the budget usually coincides with a certain maximal
norm/distance we are willing to accept between graph G and perturbed ones. A gen-
eralization of the above form to consider multiple budgets/radii is straightforward.
Varying the radius enables us to analyze the robustness behavior of the models in de-
tail. Depending on the radius, different robustness results are expected. Specifically,
for a large radius low robustness is expected – or even desired – and accordingly,
the evaluation should also include these cases showing the limits of the models.
Recall that using the methods discussed in Section 8.2 and Section 8.3 together,
we are able to obtain one of the following answers about a prediction’s robustness:
(R) It is robust; the certificate holds since, e.g., the lower bound on the margin
is positive. (NR) It is non-robust; we are able to find an adversarial example. (U)
Unknown; no statement possible since, e.g., the lower bound is negative but the
attack was not successful either.
8 Graph Neural Networks: Adversarial Robustness 173
Figure 8.6 shows such an example analysis providing insights about the robust-
ness properties of a GCN in detail. Here, local attacks and certificates are computed
on standard (left) and robustly (right) trained GCNs for the task of node classifica-
tion. As the result shows, robust training indeed increases the robustness of a GCN
with fewer attacks being successful and more nodes being certifiable.
100 100
Non-robust (NR) Non-robust (NR)
% Nodes
% Nodes
50 50 Unknown (U)
Unknown (U)
Robust (R) Robust (R)
0 0
0 10 20 30 0 10 20 30
Radius (Allowed Perturbations) Radius (Allowed Perturbations)
Fig. 8.6: Share of nodes which are provably robust (blue; R), non-robust via ad-
versarial example construction (orange; NR), or whose robustness is unknown
(“gap”; U), for increasing perturbation radii. For a given radius, the shares of
(R)+(NR)+(U)= 100%. Left: Standard training; Right: robust training as pro-
posed in (Zügner and Günnemann, 2019). Citeseer data and perturbations of node
attributes.
It is worth highlighting that case (U) – the white gap in Figure 8.6 – occurs
only due to the algorithmic inability to solve the attack/certificate problems exactly.
Thus, case (U) does not give a clear indication about the GNN’s robustness but rather
about the performance of the attack/certificate.6 Given this set-up, in the following
we distinguish between two evaluation directions, which are reflected in frequently
used measures.
6 A large gap indicates that the attacks/certificates are rather loose. The gap might become smaller
when improved attacks/certificates become available. Thus, attacks/certificates itself can be eval-
uated by analyzing the size of the gap since it shows what the maximal possible improvement
in either direction is (e.g., the true share of robust predictions can never exceed 100%-NR for a
specific radius).
174 Stephan Günnemann
• The attack success rate, measuring how many predictions were successfully
changed by the attack(s). This simply corresponds to the case (NR), the orange
region shown in Fig 8.6. This metric is typically used in combination with local
attacks where for each prediction a different perturbation can be used. Naturally,
the local attacks’ success rate is higher than the overall performance drop due
to the flexibility in picking different perturbations.
• In the case of classification, the classification margin, i.e., the difference be-
tween the predicted probability of the ‘true’ class minus the second-highest
class, and its drop after the attack. See again Figure 8.2 for an example.
The crucial limitation of this evaluation is its dependence on a specific attack
approach. The power of the attack strongly affects the result. Indeed, it can be re-
garded as an optimistic evaluation of robustness since a non-successful attack is
treated as seemingly robust. However, the conclusion is dangerous since a GNN
might only perform well for one type of attack but not another. Thus, the above
metrics rather evaluate the power of the attack but only weakly the robustness of the
model. Interpreting the results has to be done with care. Consequently, when refer-
ring to empirical robustness evaluation, it is imperative to use multiple different and
powerful attack approaches. Indeed, as also discussed in (Tramer et al, 2020), each
robustification principle should come with its own specifically suited attack method
(also called adaptive attack) to showcase its limitations.
are treated as wrong. The certified performance gives a provable lower bound
on the performance of the GNN under any admissible perturbation w.r.t. the
current perturbation set Φr (G ) and the given data.
• Certified radius: While the above metrics assume a fixed Φr (G ), i.e., a fixed
radius r, we can also take another view. For a specific prediction, the largest
radius r∗ for which the prediction can still be certified as robust is called its
certified radius. Given the certified radius of a single prediction, one can easily
calculate the average certifiable radius over multiple predictions.
Certified ratio
0.8
models using the certificate APPNP
of (Bojchevski et al, 2020a) 0.6 Soft
Medoid
where Φr (G ) consists of edge 0.4 GDC
deletion perturbations. The 0.2
model-agnostic nature of the
certificate allows to compare 0.0
0 2 4 6 8
the robustness across models. Delete radius rd
Figure 8.7 shows the certified ratio for different GNN architectures for the task
of node-classification when perturbing the graph structure. The smoothed classifier
uses 10,000 randomly drawn graphs and the probabilistic certification is based on a
confidence level of α = 0.05 analogously to the set-up in (Geisler et al, 2020). Since
local attacks are considered, the certified ratio is naturally rather low. Still, as shown,
there is a significant difference between the models’ robustness performance.
Provable robustness evaluation provides strong guarantees in the sense that the
evaluation is more pessimistic. E.g. if the certified ratio is high, we know that the
actual GNN can only be better. Note again, however, that we still also implicitly
evaluate the certificate; with new certificates the result might become even better.
Also recall that certificates based on randomized smoothing (Section 8.3.2), eval-
uate the robustness of the smoothed classifier, thus, not providing guarantees for
the base classifier itself. Still, a robust prediction of the smoothed classifier entails
that the base classifier predicts the respective class with a high probability w.r.t. the
randomization scheme.
As it becomes apparent, evaluating robustness is more complex than evaluating
usual prediction performance. To achieve a detailed understanding of the robustness
properties of GNNs it is thus helpful to analyze all aspects introduced above.
8.6 Summary
Along with the increasing relevance of graph neural networks in various application
domains, comes also an increasing demand to ensure their reliability. In this regard,
176 Stephan Günnemann
Acknowledgements
Christopher Morris
Abstract Recently, graph neural networks emerged as the leading machine learn-
ing architecture for supervised learning with graph and relational input. This chapter
gives an overview of GNNs for graph classification, i.e., GNNs that learn a graph-
level output. Since GNNs compute node-level representations, pooling layers, i.e.,
layers that learn graph-level representations from node-level representations, are
crucial components for successful graph classification. Hence, we give a thorough
overview of pooling layers. Further, we overview recent research in understand-
ing GNN’s limitations for graph classification and progress in overcoming them.
Finally, we survey some graph classification applications of GNNs and overview
benchmark datasets for empirical evaluation.
9.1 Introduction
Christopher Morris
CERC in Data Science for Real-Time Decision-Making, Polytechnique Montréal, e-mail: chris@
christophermorris.info
© The Author(s), under exclusive license to Springer Nature Singapore Pte Ltd. 2022 179
L. Wu et al. (eds.), Graph Neural Networks: Foundations, Frontiers, and Applications,
https://doi.org/10.1007/978-981-16-6054-2_9
180 Christopher Morris
In the following, we survey classic and modern works of GNNs for graph classifi-
cation. GNNs layers for graph classification date back to at least the mid-nineties
in chemoinformatics. For example, Kireev (1995) derived GNN-like neural ar-
chitectures to predict chemical molecule properties. The work of (Merkwirth and
9 Graph Neural Networks: Graph Classification 181
Lengauer, 2005) had a similar aim. Gori et al (2005); Scarselli et al (2008) proposed
the original GNN architecture, introducing the general formulation that was later
reintroduced and refined in (Gilmer et al, 2017) by deriving the general message-
passing formulation, most modern GNN architectures can be expressed in, see Sec-
tion 9.2.1.
We divide our overview of modern GNN layers for graph classification into spa-
tial approaches, i.e., ones that are purely based on the graph structure by aggre-
gating local information around each node, and spectral approaches, i.e., ones that
rely on extracting information from the graph’s spectrum. Although this division is
somewhat arbitrary, we stick to it due to historical reasons. Due to the large body of
different GNN layers, we cannot offer a complete survey but focus on representative
and influential works.
One of the earliest modern, spatial GNN architectures for graph classification was
presented in (Duvenaud et al, 2015b), focusing on the prediction of chemical
molecules’ properties. Specifically, the authors propose to design a differentiable
variant of the well-known Extended Connectivity Fingerprint (ECFP) (Rogers and
Hahn, 2010) from chemoinformatics, which works similar to the computation of the
WL feature vector. For the computation of their GNN layer, denoted Neural Graph
Fingerprints, Duvenaud et al (2015b) first initialize the feature vector f 0 (v) of each
node v with features of the corresponding atom, e.g., a one hot-encoding represent-
ing the atom type. In each iteration or layer t, they compute a feature representation
f t (v) for node v as
f t (v) = f t−1 (v) + ∑ f t−1 (w),
w∈N(v)
followed by the application of a one-layer perceptron. Here, N(v) denotes the neigh-
borhood of node v, i.e., N(v) = {w ∈ V | (v, w) ∈ E }. Since the ECFP usually com-
putes sparse feature vectors for small molecules, they apply a linear layer followed
by a softmax function, i.e.,
where W1 and W2 are parameter matrices in Rd×d , which are shared across layers,
and σ (·) is a component-wise non-linearity. The above layer is evaluated on stan-
dard, small-scale benchmark datasets (Kersting et al, 2016) showing good perfor-
mance, similar to classical kernel approaches. Lei et al (2017a) proposed a similar
layer and showed a connection to kernel approaches by deriving the corresponding
kernel space of the learned graph embeddings.
To explicitly support edge labels, e.g., chemical bonds, Simonovsky and Ko-
modakis (2017) introduced Edge-Conditioned Convolution, where a feature for
node v is represented as
1
f t (v) = ∑ F l (l(w, v),W (l)) · f t−1 (w) + bl .
|N(v)| w∈N(v)
Here l(w, v) is the feature (or label) of the edge shared by the nodes v and w. More-
over, F l : Rs → Rdt ×dt−1 is a function, where s denotes the number of components
of the edge features and dt and dt−1 denotes the number of components of the fea-
tures of layer t and (t − 1), respectively, mapping the edge feature to a matrix in
Rdt ×dt−1 . Further, the function F l is parameterized by the matrix W , conditioned on
the edge feature l. Finally, bl is a bias term, again conditioned on the edge feature
l. The above layer is applied to graph classification tasks on small-scale, standard
benchmark datasets (Kersting et al, 2016), and point cloud data from the computer
vision.
Building on (Scarselli et al, 2008), Gilmer et al (2017) introduced a general
message-passing framework, unifying most of the proposed GNN architectures so-
far. Specifically, Gilmer et al (2017) replaced the inner sum defined over the neigh-
borhood in the above equations by a general permutation-invariant, differentiable
function, e.g., a neural network, and substituted the outer sum over the previous
and the neighborhood feature representation, e.g., by a column-wise vector concate-
nation or LSTM-style update step. Thus, in full generality a new feature f t (v) is
computed as
W1
fmerge f t−1 (v), faggr
W2
{{f t−1 (w) | w ∈ N(v)}} , (9.2)
W1 W2
where faggr aggregates over the multi-set of neighborhood features and fmerge
merges the node’s representation from step (t − 1) with the computed neighbor-
hood features. Moreover, it is straighfoward to include edge features as well, e.g., by
learning a combined feature representation of the node itself, the neighboring node,
and the corresponding edge feature. Gilmer et al (2017) employed the above ar-
chitecture for regression tasks from quantum chemistry, showing promising perfor-
mance for regression targets computed by expensive numerical simulations (namely,
DFT) (Wu et al, 2018; Ramakrishnan et al, 2014).
9 Graph Neural Networks: Graph Classification 183
where
1
δ= ∑ log(di + 1),
|train| i∈train
and α is a variable parameter. Here, the set train contains all nodes i in the training
set and di denotes its degree, resulting in the aggregation function
184 Christopher Morris
µ
M I σ
= S(D, α = 1) ⊗
max .
S(D, α = −1)
| {z } min
scalers | {z }
aggregators
where ⊗ denotes the tensor product. The authors report promising performance over
standard aggregation functions on a wide range of standard benchmark datasets,
improving over some standard GNN layers.
Vignac et al (2020b) extended the expressivity of GNNs, see also Section 9.4, by
using unique node identifiers, generalizing the message-passing scheme proposed
by (Gilmer et al, 2017), see Equation (9.2), by computing and passing matrix fea-
tures instead of vector features. Formally, each node i maintains a matrix Ui in Rn×c ,
denoted local context, where the j-th row contains the vectorial representation of
node j of node i. At initialization, each local context Ui is set to 1 in Rn×1 , where n
denotes the number of nodes in the given graph. Now at each layer l, similar to the
above message-passing framework, the local context is updated as
(l+1) (l) (l) (l) (l) (l)
Ui = u(l) Ui , Ũi ∈ Rn×cl+1 with Ũi = φ m(l) (Ui ,U j , yi j ) ,
j∈N(i)
where u(l) , m(l) , and φ are update, message, and aggregation functions, respectively,
to compute the updated local context, and yi j denotes the edge features shared by
node i and j. Moreover, the authors study the expressive power, showing that, in
principle, the above layer can distinguish any non-isomorphic pair of graphs and
propose more scalable alternative variants of the above architecture. Finally, promis-
ing results on standard benchmark datasets are reported.
L = UΛU ⊤ ,
x ∗ g = U(U ⊤ x ⊙U ⊤ g) = U · diag(U ⊤ g) ·U ⊤ x,
where the operator · denotes the elementwise product. If we set gθ = diag(U T g),
the above can be expressed as
x ∗ gθ = Ugθ U ⊤ x.
for j in {1, 2, . . . ,t}. Here, t is the layer index, H t−1 in Rn×(t−1) is the graph signal,
where H 0 = X, i.e., the given graph features, and Θi,t j is a diagonal parameter matrix.
However, the above layer suffers from a number of drawbacks: The bases of the
eigenvectors is not permution invariant, the layer cannot be applied to a graph with
a different structure, and the computation of the eigendecomposition is cubic in the
number of nodes. Hence, Henaff et al (2015) proposed more scalable variants of
the above layer by building on a smoothness notion in the spectral domain, which
reduces the numbers of parameters and acts as a regulizer.
To further make the above layer more scalable, Defferrard et al (2016) intro-
duced Chebyshev Spectral CNNs, which approximates gθ by a Chebyshev expan-
sion (Hammond et al, 2011). Namely, they express
K
gθ = ∑ θi Ti (Λ̂ ),
i=0
186 Christopher Morris
where Λ̂ = 2Λ /λmax − I, and λmax denotes the largest eigenvalue of the normalized
Laplacian Λ̂ . The normalization ensures that the eigenvalues of the Laplacian are
in the [−1, 1] real interval, which is required by Chebyshev polynomials. Here, Ti
denotes the ith Chebyshev polynomial with T1 (x) = x. Alternatively, Levie et al
(2019) used Caley polynomials, and show that Chebyshev Spectral CNNs are a
special case.
Kipf and Welling (2017b) proposed to make Chebyshev Spectral CNNs more
scalable by setting
1 1
x ∗ gθ = θ0 x − θ1 D− 2 AD− 2 x.
Further, they improved the generalization ability of the resulting layer by setting
θ = θ0 = −θ1 , resulting in
1 1
x ∗ gθ = θ (I + D− 2 AD− 2 )x.
In fact, the above layer can be understood as a spatial GNN, i.e., it is equivalent to
computing a feature
!
1
f t (v) = σ ∑ √ f t−1 (w) ·W ,
w∈N(v)∪v dv dw
for node v in the given graph G , where dv and dw denote the degrees of node v and w,
respectively. Although the above layer was originally proposed for semi-supervised
node classification, it is now one of the most widely used ones and has been ap-
plied for tasks such as matrix completion (van den Berg et al, 2018), link predic-
tion (Schlichtkrull et al, 2018), and also as a baseline for graph classification (Ying
et al, 2018c).
Since GNNs learn vectorial node representations, using them for graph classification
requires a pooling layer, enabling going from node to graph-level output. Formally, a
pooling layer is a parameterized function that maps a multiset of vectors, i.e., learned
node-level representations, to a single vector, i.e., the graph-level representation.
Arguably, the simplest of such layers are sum, mean, and min or max pooling. That
is, given a graph G and a multiset
M = {{f(v) ∈ Rd | v ∈ V }}
fpool (G ) = ∑ f(v),
f(v)∈M
9 Graph Neural Networks: Graph Classification 187
while mean, min, max pooling take the (component-wise) average, minimum, max-
imum over the elements in M, respectively. These four simple pooling layers are
still used in many published GNN architectures, e.g., see (Duvenaud et al, 2015b).
In fact, recent work (Mesquita et al, 2020) showed that more sophisticated layers,
e.g., relying on clustering, see below, do not offer any empirical benefits on many
real-world datasets, especially those from the molecular domain.
Simple attention-based pooling became popular in recent years due to its easy im-
plementation and scalability compared to more sophisticated alternatives; see be-
low. For example, Gilmer et al (2017), see above, used a seq2seq architecture for
sets (Vinyals et al, 2016) for pooling purposes in their empirical study. Focusing
on pooling for GNNs, Lee et al (2019b) introduced the SAGPool layer, short for
Self-Attention Graph Pooling method for GNNs, using self-attention. Specifically,
they computed a self-attention score by multiplying the aggregated features of an
arbitrary GNN layer by a matrix Θatt in Rd×1 , where d denotes the number of com-
ponents of the node features. For example, computing the self-attention score Z(v)
for the simple layer of Equation (9.1) equates to
!
Z(v) = σ f t−1 (v) ·W1 + ∑ f t−1 (w) ·W2 ·Θatt .
w∈N(v)
The self-attention score Z(v) is subsequently used to select the top-k nodes in the
graph; similarly, to Cangea et al (2018) and (Gao et al, 2018a), see below, omitting
the other nodes, effectively pruning nodes from the graph. Similar attention-based
techniques are proposed in (Huang et al, 2019).
The idea of cluster-based pooling layers is to coarsen the graph, i.e., merging similar
nodes iteratively. One of the earliest uses has been proposed in (Simonovsky and
Komodakis, 2017), see above, where the Graclus clustering algorithm (Dhillon et al,
2007) is used. However, one has the note that the algorithm is parameter-free, i.e., it
does adapt to the learning task at hand.
The arguably most well-known cluster-based pooling layer is DiffPool (Ying
et al, 2018c). The idea of DiffPool is to iteratively coarsen the graph by learn-
ing a soft clustering of nodes, making the otherwise discrete clustering assignment
differentiable. Concretely, at layer t, DiffPool learns a soft cluster assigment S in
[0, 1]nt ×nt+1 , where nt and nt+1 are the number of nodes at layer t and (t + 1), respec-
tively. Each entry Si, j represents the probablity of node i of layer t being clustered
188 Christopher Morris
S = softmax(GNN(At , Ft )),
where At and Ft are the adjacency matrix and the feature matrix of the clustered
graph at layer t, and the function GNN is an abitrary GNN layer. Finally, in each
layer, the adjacency matrix and the feature matrix are updated as
respectively.
Empirically, the authors show that the DiffPool layer boosts standard GNN lay-
ers’ performance, e.g., GraphSage (Hamilton et al, 2017b), on standard, small-scale
benchmark datasets (Morris et al, 2020a). The downside of the above layer is the
added computational cost. The adjacency matrix becomes dense and real-valued af-
ter the first pooling layer, leading to a quadratic cost in the number of nodes for
each GNN layer’s computation. Moreover, the number of clusters has to be chosen
in advance, leading to an increase in hyperparameters.
where the function sort sorts the feature matrix Ft row-wise in a descending fashion,
and the functions truncate return the first k of the input matrix. Ties are broken up
using the features from previous layers, 1 to (t − 1). The resulting tensor Ftrunc of
shape k × ∑hi=1 di , where di denotes the number of features of the ith layer and h
the total number of layers, is reshaped into a tensor of size k(∑hi=1 di ) × 1, row-
wise, followed by a standard 1-D convolution with a filter and step size of ∑hi=1 di .
Finally, a sequence of max-pooling and 1-D convolutions are applied to identifiy
local patterns in the sequence.
Similarly, to combat the high computational cost of some pooling layer, e.g.,
DiffPool, Cangea et al (2018) introduced a pooling layer dropping n − ⌈nk⌉ nodes
of a graph with n nodes in each layer for k in [0, 1). The nodes to be dropped are
choosen according to a projection score against a learnable vector p. Concretly, they
compute the score vector
9 Graph Neural Networks: Graph Classification 189
Ft · p
y= and I = top-k(y, k),
∥p∥
where top-k returns top-k indices from a given vector according to y. Finally, the
adjacency At+1 is updated by removing rows and columns that are not in I, while
the updated feature matrix
The authors report slightly lower classification accuracies than the DiffPool layer
on most employed datasets while being much faster in computation time. A similar
approach was presented in (Gao and Ji, 2019).
To derive more expressive graph representations, Murphy et al (2019c,b) propose
relational pooling. To increase the expressive power of GNN layers, they average
over all permutations of a given graph. Formally, let G be a graph, then a represen-
tation
1
f(G ) = ∑ g(Aπ,π , [Fπ , I|V | ]) (9.4)
|V | π∈Π
is learned, where Π denotes all possible permutations of the rows and columns of
the adjacency matrix of G , g is a permutation-invariant function, and [·, ·] denotes
column-wise matrix concatenation. Moreover. Aπ,π permutes the rows and columns
of the adjaceny matrix A according to the permutation π in Π , similarly Fπ permutes
the rows of the feature matrix F. The author showed that the above architecture
is more expressive in terms of distinguishing non-isomorphic graphs than the WL
algorithm, and proposed sampling-based techniques to speed up the computation.
Bianchi et al (2020) introduced a pooling layer based on spectral clustering (VON-
LUXBURG, 2007). Thereto, they train a GNN together with an MLP, followed by
a softmax function, against an approximation of a relaxed version of the k-way
normalized Min-cut problem (Shi and Malik, 2000). The resulting cluster assign-
ment matrix S is used in the same way as in Section 9.3.2. The authors evaluated
their approach on standard, small-scale benchmark datasets showing promising per-
formance, especially over the DiffPool layer. For another pooling layer based on
spectral clustering, see (Ma et al, 2019d).
In the following, we briefly survey the limitations of GNNs and how their expressive
power is upper-bounded by the Weisfeiler-Leman method (Weisfeiler and Leman,
1968; Weisfeiler, 1976; Grohe, 2017). Concretely, a recent line of works by Morris
et al (2020b); Xu et al (2019d); Maron et al (2019a) connects the power or expressiv-
ity of GNNs to that of the WL algorithm. The results show that GNN architectures
generally do not have more power to distinguish between non-isomorphic graphs
190 Christopher Morris
than the WL. That is, for any graph structure that the WL algorithm cannot dis-
tinguish, any possible GNN with any possible choices of parameters will also not
be able to distinguish it. On the positive side, the second result states that there is a
sequence of parameter initializations such that GNNs have the same power in distin-
guishing non-isomorphic (sub-)graphs as the WL algorithm, see also Equation (9.3).
However, the WL algorithm has many short-comings, see (Arvind et al, 2015; Kiefer
et al, 2015), e.g., it cannot distinguish between cycles of different lengths, an impor-
tant property for chemical molecules, and is not able to distinguish between graphs
with different triangle counts, an important property of social networks.
To address this, many recent works tried to build provable more expressive GNNs
for graph classification. For example, in (Morris et al, 2020b; Maron et al, 2019b,
2018) the authors proposed higher-order GNN architectures that have the same ex-
pressive power as the k-dimensional Weisfeiler-Leman algorithm (k-WL), which is,
as k grows, a more expressive generalization of the WL algorithm. In the following,
we give an overview of such works.
The first GNN architecture that overcame the limitations of the WL algorithm was
proposed in (Morris et al, 2020b). Specifically, they introduced so-called k-GNNs,
which work by learning features over the set of subgraphs on k nodes instead of
vertices by defining a notion of neighborhood between these subgraphs. Formally,
for a given k, they consider all k-element subsets [V ]k over V . Let s = {s1 , . . . , sk }
be a k-set in [V ]k , then they define the neighborhood of s as
N(s) = {t ∈ [V ]k | |s ∩ t| = k − 1} .
The local neighborhood NL (s) consists of all t in N(s) such that (v, w) in E for the
unique v ∈ s \ t and the unique w ∈ t \ s. The global neighborhood NG (s) then is
defined as N(s) \ NL (s).
Based on this neighborhood definition, one can generalize most GNN layers for
vertex embeddings to more expressive subgraph embeddings. Given a graph G , a
feature for a subgraph s can be computed as
fkt (s) = σ fkt−1 (s) ·W1t + ∑ fkt−1 (u) ·W2t . (9.5)
u∈NL (s)∪NG (s)
The authors resort to sum over the local neighborhood in the experiments for better
scalability and generalization, showing a significant boost over standard GNNs on a
quantum chemistry benchmark dataset (Wu et al, 2018; Ramakrishnan et al, 2014).
The latter approach was refined in (Maron et al, 2019a) and (Morris et al, 2019).
Specifically, based on (Maron et al, 2018), Maron et al (2019a) derived an architec-
ture based on standard matrix multiplication that has at least the same power as the
3-WL. Morris et al (2019) proposed a variant of the k-WL that, unlike the original
9 Graph Neural Networks: Graph Classification 191
algorithm, takes the sparsity of the underlying graph into account. Moreover, they
showed that the derived sparse variant is slightly more powerful than the k-WL in
distinguishing non-isomorphic graphs and proposed a neural architecture with the
same power as the sparse k-WL variant.
An important direction in studying graph representations’ expressive power was
taken by (Chen et al, 2019f). The authors prove that a graph representation can
approximate a function f if and only if it can distinguish all pairs of non-isomorphic
graphs G and H where f (G ) ̸= f (H ). With that in mind, they established an
equivalence between the set of pairs of graphs a representation can distinguish and
the space of functions it can approximate, further introducing a variation of the 2-
WL.
Bouritsas et al (2020) enhanced the expressivity of GNNs by annotating node
features with subgraph information. Specifically, by fixing a set of predefined, small
subgraphs, they annotated each node with their role, formally their automorphism
type, in these subgraphs, showing promising performance gains on standard bench-
mark datasets for graph classification.
Beaini et al (2020) studied how to incorporate directional information into GNNs.
Finally, You et al (2021) enhanced GNNs by uniquely coloring central vertices and
used two types of message functions to surpass the expressive power of the 1-WL,
while Sato et al (2021) and Abboud et al (2020) use random features to achieve
the same goal and additionally studied the universality properties of their derived
architectures.
In the following, we highlight some application areas of GNNs for graph classifi-
cation, focusing on the molecular domain. One of the most promising applications
of GNNs for graph classification is pharmaceutical drug research, see (Gaudelet
et al, 2020) for an overview. In this direction, a promosing approach was proposed
by (Stokes et al, 2020). They used a form of directed message passing neural net-
works operating on molecular graphs to identify repurposing candidates for antibi-
otic development. Moreover, they validated their predictions in vivo, proposing suit-
able repurposing candidates different from know ones.
Schweidtmann et al (2020) used 2-GNNs, see Equation (9.5), to derive GNN
models for predicting three fuel ignition quality indicators such as the derived cetane
number, the research octane number,and the motor octane number of oxygenated
and non-oxygenated hydrocarbons, indicating that the higher-order layers of Equa-
tion (9.5) provide significant gains over standard GNNs in the molecular learning
domain.
A general principled GNN for the molecular domain, denoted DimeNet, was in-
troduced by (Klicpera et al, 2020). By using an edge-based architecture, they com-
puted a message coefficient between atoms based on their relative positioning in 3D
192 Christopher Morris
Since most developments for GNNs are driven empirically, i.e., based on evalua-
tions on standard benchmark datasets, meaningful benchmark datasets are crucial
for the development of GNNs in the context of graph classification. Hence, the re-
search community has established several widely used repositories for benchmark
datasets for graph classification. Two such repositories are worth being highlighted
here. First, the TUDataset (Morris et al, 2020a) collection contains over 130 datasets
provided at www.graphlearning.io of various sizes and various areas such as
chemistry, biology, and social networks. Moreover, it provides Python-based data
loaders and baseline implementations of standards graph kernel and GNNs. More-
over, the datasets can be easily accessed from well-known GNN implementation
frameworks such as Deep Graph Library (Wang et al, 2019f), PyTorch Geomet-
ric (Fey and Lenssen, 2019), or Spektral (Grattarola and Alippi, 2020). Secondly,
the OGB (Weihua Hu, 2020) collections contain many large-scale graph classifica-
tion benchmark datasets, e.g., from chemistry and code analysis with data loaders,
prespecified splits, and evaluation protocols. Finally, Wu et al (2018) also provides
many large-scale datasets from chemo- and bioinformatics.
9.7 Summary
Muhan Zhang
10.1 Introduction
Link prediction is the problem of predicting the existence of a link between two
nodes in a network (Liben-Nowell and Kleinberg, 2007). Given the ubiquitous ex-
istence of networks, it has many applications such as friend recommendation in
social networks (Adamic and Adar, 2003), co-authorship prediction in citation net-
works (Shibata et al, 2012), movie recommendation in Netflix (Bennett et al, 2007),
protein interaction prediction in biological networks (Qi et al, 2006), drug response
prediction (Stanfield et al, 2017), metabolic network reconstruction (Oyetunde et al,
2017), hidden terrorist group identification (Al Hasan and Zaki, 2011), knowledge
graph completion (Nickel et al, 2016a), etc.
Muhan Zhang
Institute for Artificial Intelligence, Peking University, e-mail: [email protected]
© The Author(s), under exclusive license to Springer Nature Singapore Pte Ltd. 2022 195
L. Wu et al. (eds.), Graph Neural Networks: Foundations, Frontiers, and Applications,
https://doi.org/10.1007/978-981-16-6054-2_10
196 Muhan Zhang
Link prediction has many names in different application domains. The term “link
prediction” often refers to predicting links in homogeneous graphs, where nodes and
links both only have a single type. This is the simplest setting and most link predic-
tion works focus on this setting. Link prediction in bipartite user-item networks is
referred to as matrix completion or recommender systems, where nodes have two
types (user and item) and links can have multiple types corresponding to different
ratings users can give to items. Link prediction in knowledge graphs is often re-
ferred to as knowledge graph completion, where each node is a distinct entity and
links have multiple types corresponding to different relations between entities. In
most cases, a link prediction algorithm designed for the homogeneous graph setting
can be easily generalized to heterogeneous graphs (e.g., bipartite graphs and knowl-
edge graphs) by considering heterogeneous node type and relation type information.
There are mainly three types of traditional link prediction methods: heuris-
tic methods, latent-feature methods, and content-based methods. Heuristic meth-
ods compute heuristic node similarity scores as the likelihood of links (Liben-
Nowell and Kleinberg, 2007). Popular ones include common neighbors (Liben-
Nowell and Kleinberg, 2007), Adamic-Adar (Adamic and Adar, 2003), preferen-
tial attachment (Barabási and Albert, 1999), and Katz index (Katz, 1953). Latent-
feature methods factorize the matrix representations of a network to learn low-
dimensional latent representations/embeddings of nodes. Popular network embed-
ding techniques such as DeepWalk (Perozzi et al, 2014), LINE (Tang et al, 2015b)
and node2vec (Grover and Leskovec, 2016), are also latent-feature methods because
they implicitly factorize some matrix representations of networks too (Qiu et al,
2018). Both heuristic methods and latent-feature methods infer future/missing links
leveraging the existing network topology. Content-based methods, on the contrary,
leverage explicit node attributes/features rather than the graph structure (Lops et al,
2011). It is shown that combining the graph topology with explicit node features
can improve the link prediction performance (Zhao et al, 2017).
By learning from graph topology and node/edge features in a unified way, graph
neural networks (GNNs) recently show superior link prediction performance than
traditional methods (Kipf and Welling, 2016; Zhang and Chen, 2018b; You et al,
2019; Chami et al, 2019; Li et al, 2020e). There are two popular GNN-based link
prediction paradigms: node-based and subgraph-based. Node-based methods first
learn a node representation through a GNN, and then aggregate the pairwise node
representations as link representations for link prediction. An example is (Varia-
tional) Graph AutoEncoder (Kipf and Welling, 2016). Subgraph-based methods first
extract a local subgraph around each target link, and then apply a graph-level GNN
(with pooling) to each subgraph to learn a subgraph representation, which is used as
the target link representation for link prediction. An example is SEAL (Zhang and
Chen, 2018b). We introduce these two types of methods separately in Section 10.3.1
and 10.3.2, and discuss their expressive power differences in Section 10.3.3.
To understand GNNs’ power for link prediction, several theoretical efforts have
been made. The γ-decaying theory (Zhang and Chen, 2018b) unifies existing link
prediction heuristics into a single framework and proves their local approximability,
which justifies using GNNs to “learn” heuristics from the graph structure instead of
10 Graph Neural Networks: Link Prediction 197
using predefined ones. The theoretical analysis of labeling trick (Zhang et al, 2020c)
proves that subgraph-based approaches have a higher link representation power than
node-based approaches by being able to learn most expressive structural representa-
tions of links (Srinivasan and Ribeiro, 2020b) where node-based approaches always
fail. We introduce these theories in Section 20.3.
Finally, by analyzing limitations of existing methods, we provide several future
directions on GNN-based link prediction in Section 20.4.
In this section, we review traditional link prediction methods. They can be cate-
gorized into three classes: heuristic methods, latent-feature methods, and content-
based methods.
Heuristic methods use simple yet effective node similarity scores as the likelihood
of links (Liben-Nowell and Kleinberg, 2007; Lü and Zhou, 2011). We use x and y
to denote the source and target node between which to predict a link. We use Γ (x)
to denote the set of x’s neighbors.
One simplest heuristic is called common neighbors (CN), which counts the number
of neighbors two nodes share as a measurement of their likelihood of having a link:
|Γ (x) ∩ Γ (y)|
fJaccard (x, y) = . (10.2)
|Γ (x) ∪ Γ (y)|
There is also a famous preferential attachment (PA) heuristic (Barabási and Al-
bert, 1999), which uses the product of node degrees to measure the link likelihood:
𝑥 𝑦 𝑥 𝑦 𝑥 𝑦
Fig. 10.1: Illustration of three link prediction heuristics: CN, PA and AA.
There are also high-order heuristics which require knowing the entire network.
Examples include Katz index (Katz, 1953), rooted PageRank (RPR) (Brin and Page,
2012), and SimRank (SR) (Jeh and Widom, 2002).
Katz index uses a weighted sum of all the walks between x and y where a longer
walk is discounted more:
∞
fKatz (x, y) = ∑ β l |walks⟨l⟩ (x, y)|. (10.6)
l=1
Here β is a decaying factor between 0 and 1, and |walks⟨l⟩ (x, y)| counts the length-l
walks between x and y. When we only consider length-2 walks, Katz index reduces
to CN.
Rooted PageRank (RPR) is a generalization of PageRank. It first computes the
stationary distribution πx of a random walker starting from x who randomly moves
to one of its current neighbors with probability α, or returns to x with probability
1 − α. Then it uses πx at node y (denoted by [πx ]y ) to predict link (x, y). When the
network is undirected, a symmetric version of rooted PageRank uses
10.2.1.3 Summarization
We summarize the eight introduced heuristics in Table 10.1. For more variants of the
above heuristics, please refer to (Liben-Nowell and Kleinberg, 2007; Lü and Zhou,
2011). Heuristic methods can be regarded as computing predefined graph structure
features located in the observed node and edge structures of the network. Although
effective in many domains, these handcrafted graph structure features have limited
expressivity—they only capture a small subset of all possible structure patterns, and
cannot express general graph structure features underlying different networks. Be-
sides, heuristic methods only work well when the network formation mechanism
200 Muhan Zhang
aligns with the heuristic. There may exist networks with complex formation mech-
anisms which no existing heuristics can capture well. Most heuristics only work for
homogeneous graphs.
Notes: Γ (x) denotes the neighbor set of vertex x. β < 1 is a damping factor. |walks⟨l⟩ (x, y)| counts
the number of length-l walks between x and y. [πx ]y is the stationary distribution probability of y
under the random walk from x with restart, see (Brin and Page, 2012). SimRank score uses a
recursive definition.
The second class of traditional link prediction methods is called latent-feature meth-
ods. In some literature, they are also called latent-factor models or embedding meth-
ods. Latent-feature methods compute latent properties or representations of nodes,
often obtained by factorizing a specific matrix derived from the network, such as the
adjacency matrix and the Laplacian matrix. These latent features of nodes are not
explicitly observable—they must be computed from the network through optimiza-
tion. Latent features are also not interpretable. That is, unlike explicit node features
where each feature dimension represents a specific property of nodes, we do not
know what each latent feature dimension describes.
One most popular latent feature method is matrix factorization (Koren et al, 2009;
Ahmed et al, 2013), which is originated from the recommender systems literature.
Matrix factorization factorizes the observed adjacency matrix A of the network into
10 Graph Neural Networks: Link Prediction 201
the product of a low-rank latent-embedding matrix Z and its transpose. That is, it
approximately reconstructs the edge between i and j using their k-dimensional latent
embeddings zi and z j :
Âi, j = z⊤
i z j, (10.9)
It then minimizes the mean-squared error between the reconstructed adjacency ma-
trix and the true adjacency matrix over the observed edges to learn the latent em-
beddings:
1
L = (Ai, j − Âi, j )2 . (10.10)
|E | (i,∑
j)∈E
Finally, we can predict new links by the inner product between nodes’ latent em-
beddings. Variants of matrix factorization include using powers of A (Cangea et al,
2018) and using general node similarity matrices (Ou et al, 2016) to replace the
original adjacency matrix A. If we replace A with the Laplacian matrix L and define
the loss as follows:
then the nontrivial solution to the above are constructed by the eigenvectors corre-
sponding to the k smallest nonzero eigenvalues of L, which recovers the Laplacian
eigenmap technique (Belkin and Niyogi, 2002) and the solution to spectral cluster-
ing (VONLUXBURG, 2007).
Network embedding methods have gained great popularity in recent years since
the pioneering work DeepWalk (Perozzi et al, 2014). These methods learn low-
dimensional representations (embeddings) for nodes, often based on training a skip-
gram model (Mikolov et al, 2013a) over random-walk-generated node sequences,
so that nodes which often appear nearby each other in a random walk (i.e., nodes
close in the network) will have similar representations. Then, the pairwise node
embeddings are aggregated as link representations for link prediction. Although
not explicitly factorizing a matrix, it is shown in (Qiu et al, 2018) that many net-
work embedding methods, including LINE (Tang et al, 2015b), DeepWalk, and
node2vec (Grover and Leskovec, 2016), implicitly factorize some matrix representa-
tions of the network. Thus, they can also be categorized into latent-feature methods.
For example, DeepWalk approximately factorizes:
1 w −1 r −1
log vol(G ) ∑ (D A) D − log(b), (10.12)
w r=1
202 Muhan Zhang
where vol(G ) is the sum of node degrees, D is the diagonal degree matrix, w is
skip-gram’s window size, and b is a constant. As we can see, DeepWalk essentially
factorizes the log of some high-order normalized adjacency matrices’ sum (up to
w). To intuitively understand this, we can think of the random walk as extending a
node’s neighborhood to w hops away, so that we not only require direct neighbors to
have similar embeddings, but also require nodes reachable from each other through
w steps of random walk to have similar embeddings.
Similarly, the LINE algorithm (Tang et al, 2015b) in its second-order forms im-
plicitly factorizes:
log vol(G )(D−1 AD−1 ) − log(b). (10.13)
10.2.2.3 Summarization
Both heuristic methods and latent-feature methods face the cold-start problem. That
is, when a new node joins the network, heuristic methods and latent-feature meth-
ods may not be able to predict its links accurately because it has no or only a few
existing links with other nodes. In this case, content-based methods might help.
Content-based methods leverage explicit content features associated with nodes for
link prediction, which have wide applications in recommender systems (Lops et al,
2011). For example, in citation networks, word distributions can be used as content
features for papers. In social networks, a user’s profile, such as their demographic in-
formation and interests, can be used as their content features (however, their friend-
ship information belongs to graph structure features because it is calculated from the
graph structure). However, content-based methods usually have worse performance
than heuristic and latent-feature methods due to not leveraging the graph structure.
Thus, they are usually used together with the other two types of methods (Koren,
2008; Rendle, 2010; Zhao et al, 2017) to enhance the link prediction performance.
In the last section, we have covered three types of traditional link prediction meth-
ods. In this section, we will talk about GNN methods for link prediction. GNN
methods combine graph structure features and content features by learning them to-
gether in a unified way, leveraging the excellent graph representation learning ability
of GNNs.
There are mainly two GNN-based link prediction paradigms, node-based and
subgraph-based. Node-based methods aggregate the pairwise node representations
learned by a GNN as the link representation. Subgraph-based methods extract a
local subgraph around each link and use the subgraph representation learned by a
GNN as the link representation.
The most straightforward way of using GNNs for link prediction is to treat GNNs
as inductive network embedding methods which learn node embeddings from lo-
cal neighborhood, and then aggregate the pairwise node embeddings of GNNs to
construct link representations. We call these methods node-based methods.
204 Muhan Zhang
Âi, j = σ (z⊤
i z j ), where zi = Zi,: , Z = GCN(X, A) (10.14)
where Z is the node representation (embedding) matrix output by the GCN with
the ith row of Z being node i’s representation zi , Âi, j is the predicted probability for
link (i, j) and σ is the sigmoid function. If X is not given, GAE can use the one-
hot encoding matrix I instead. The model is trained to minimize the cross entropy
between the reconstructed adjacency matrix and the true adjacency matrix:
Given p(A|Z) and p(Z), we may compute the posterior distribution of Z using
Bayes’ rule. However, this distribution is often intractable. Thus, given the adja-
cency matrix A and node feature matrix X, VGAE uses graph neural networks to
approximate the posterior distribution of the node embedding matrix Z:
10 Graph Neural Networks: Link Prediction 205
q(Z|X, A) = ∏ q(zi |X, A), where q(zi |X, A) = N (zi |µi , diag(σi2 )). (10.18)
i∈V
Here, the mean µi and variance σi2 of zi are given by two GCNs. Then, VGAE
maximizes the evidence lower bound to learn the GCN parameters:
There are many variants of GAE and VGAE. For example, ARGE (Pan et al, 2018)
enhances GAE with an adversarial regularization to regularize the node embeddings
to follow a prior distribution. S-VAE (Davidson et al, 2018) replaces the Normal
distribution in VGAE with a von Mises-Fisher distribution to model data with a hy-
perspherical latent structure. MGAE (Wang et al, 2017a) uses a marginalized graph
autoencoder to reconstruct node features from corrupted ones through a GCN and
applies it to graph clustering.
GAE represents a general class of node-based methods, where a GNN is first used
to learn node embeddings and pairwise node embeddings are aggregated to learn
link representations. In principle, we can replace the GCN used in GAE/VGAE with
any GNN, and replace the inner product z⊤ i z j with any aggregation function over
{zi , z j } and feed the aggregated link representation to an MLP to predict the link
(i, j). Following this methodology, we can generalize any GNN designed for learn-
ing node representations to link prediction. For example, HGCN (Chami et al, 2019)
combines hyperbolic graph convolutional neural networks with a Fermi-Dirac de-
coder for aggregating pairwise node embeddings and outputting link probabilities:
where d(·, ·) computes the hyperbolic distance and r,t are hyperparameters.
Position-aware GNN (PGNN) (You et al, 2019) aggregates messages only from
some selected anchor nodes during the message passing to capture position informa-
tion of nodes. Then, the inner product between node embeddings are used to predict
links. The PGNN paper also generalizes other GNNs, including GAT (Petar et al,
2018), GIN (Xu et al, 2019d) and GraphSAGE (Hamilton et al, 2017b), to the link
prediction setting based on the inner-product decoder.
Many graph neural networks use link prediction as an objective for training node
embeddings in an unsupervised manner, despite that their final task is still node clas-
sification. For example, after computing the node embeddings, GraphSAGE (Hamil-
ton et al, 2017b) minimize the following objective for each zi to encourage con-
206 Muhan Zhang
where j is a node co-occurs near i on some fixed-length random walk, pn is the neg-
ative sampling distribution, and kn is the number of negative samples. If we focus on
length-2 random walks, the above loss reduces to a link prediction objective. Com-
pared to the GAE loss in Equation (10.15), the above objective does not consider all
O(n) negative links, but uses negative sampling instead to only consider kn negative
pairs (i, j′ ) for each positive pair (i, j), thus is more suitable for large graphs.
In the context of recommender systems, there are also many node-based meth-
ods that can be seen as variants of GAE/VGAE. Monti et al (2017) use GNNs to
learn user and item embeddings from their respective nearest-neighbor networks,
and use the inner product between user and item embeddings to predict links. Berg
et al (2017) propose the graph convolutional matrix completion (GC-MC) model
which applies a GNN to the user-item bipartite graph to learn user and item embed-
dings. They use one-hot encoding of node indices as the input node features, and
use the bilinear product between user and item embeddings to predict links. Spec-
tralCF (Zheng et al, 2018a) uses a spectral-GNN on the bipartite graph to learn node
embeddings. The PinSage model (Ying et al, 2018b) uses node content features as
the input node features, and uses the GraphSAGE (Hamilton et al, 2017b) model to
map related items to similar embeddings.
In the context of knowledge graph completion, R-GCN (Relational Graph Con-
volutional Neural Network) (Schlichtkrull et al, 2018) is one representative node-
based method, which considers the relation types by giving different weights to
different relation types during the message passing. SACN (Structure-Aware Con-
volutional Network) (Shang et al, 2019) performs message passing for each relation
type’s induced subgraphs individually and then uses a weighted sum of node em-
beddings from different relation types.
Subgraph-based methods extract a local subgraph around each target link and learn
a subgraph representation through a GNN for link prediction.
Fig. 10.2: Illustration of the SEAL framework. SEAL first extracts enclosing sub-
graphs around target links to predict. It then applies a node labeling to the enclosing
subgraphs to differentiate nodes of different roles within a subgraph. Finally, the
labeled subgraphs are fed into a GNN to learn graph structure features (supervised
heuristics) for link prediction.
DRNL works as follows: First, assign label 1 to x and y. Then, for any node i with
radius (d(i, x), d(i, y)) = (1, 1), assign label 2. Nodes with radius (1, 2) or (2, 1) get
label 3. Nodes with radius (1, 3) or (3, 1) get 4. Nodes with (2, 2) get 5. Nodes with
(1, 4) or (4, 1) get 6. Nodes with (2, 3) or (3, 2) get 7. So on and so forth. In other
words, DRNL iteratively assigns larger labels to nodes with a larger radius w.r.t. the
two center nodes.
DRNL satisfies the following criteria: 1) The two target nodes x and y always
have the distinct label “1” so that they can be distinguished from the context nodes.
2) Nodes i and j have the same label if and only if their “double radius” are the
same, i.e., i and j have the same distances to (x, y). This way, nodes of the same rel-
ative positions within the subgraph (described by the double radius (d(i, x), d(i, y)))
always have the same label.
DRNL has a closed-form solution for directly mapping (d(i, x), d(i, y)) to labels:
where dx := d(i, x), dy := d(i, y), d := dx + dy , (d/2) and (d%2) are the integer
quotient and remainder of d divided by 2, respectively. For nodes with d(i, x) = ∞
or d(i, y) = ∞, DRNL gives them a null label 0.
After getting the DRNL labels, SEAL transforms them into one-hot encoding
vectors, or feeds them to an embedding layer to get their embeddings. These new
feature vectors are concatenated with the original node content features (if any) to
form the new node features. SEAL additionally allows concatenating some pre-
trained node embeddings such as node2vec embeddings to node features. How-
ever, as its experimental results show, adding pretrained node embeddings does not
show clear benefits to the final performance (Zhang and Chen, 2018b). Furthermore,
adding pretrained node embeddings makes SEAL lose the inductive learning ability.
Finally, SEAL feeds these enclosing subgraphs as well as their new node feature
vectors into a graph-level GNN, DGCNN (Zhang et al, 2018g), to learn a graph
classification function. The groundtruth of each subgraph is whether the two cen-
ter nodes really have a link. To train this GNN, SEAL randomly samples N exist-
ing links from the network as positive training links, and samples an equal number
of unobserved links (random node pairs) as negative training links. After training,
SEAL applies the trained GNN to new unobserved node pairs’ enclosing subgraphs
to predict their links. The entire SEAL framework is illustrated in Figure 10.2.
SEAL achieves strong performance for link prediction, demonstrating consistently
superior performance than predefined heuristics (Zhang and Chen, 2018b).
SEAL inspired many follow-up works. For example, Cai and Ji (2020) propose to
use enclosing subgraphs of different scales to learn scale-invariant models. Li et al
(2020e) propose Distance Encoding (DE) which generalizes DRNL to node classi-
fication and general node set classification problems and theoretically analyzes the
10 Graph Neural Networks: Link Prediction 209
power it brings to GNNs. The line graph link prediction (LGLP) model (Cai et al,
2020c) transforms each enclosing subgraph into its line graph and uses the center
node embedding in the line graph to predict the original link.
SEAL is also generalized to the bipartite graph link prediction problem of rec-
ommender systems (Zhang and Chen, 2019). The model is called Inductive Graph-
based Matrix Completion (IGMC). IGMC also samples an enclosing subgraph
around each target (user, item) pair, but uses a different node labeling scheme. For
each enclosing subgraph, it first gives label 0 and label 1 to the target user and the
target item, respectively. The remaining nodes’ labels are determined based on both
their node types and their distances to the target user and item: if a user-type node’s
shortest path to reach either the target user or the target item has a length k, it will get
a label 2k; if an item-type node’s shortest path to reach the target user or the target
item has a length k, it will get a label 2k + 1. This way, the target nodes can always
be distinguished from the context nodes, and users can be distinguished from items
(users always have even labels). Furthermore, nodes of different distances to the
center nodes can be differentiated, too. Finally, the enclosing subgraphs are fed into
a GNN with R-GCN convolution layers to incorporate the edge type information
(each edge type corresponds to a different rating). And the output representations
of the target user and the target item are concatenated as the link representation to
predict the target rating. IGMC is an inductive matrix completion model without
relying on any content features, i.e., the model predicts ratings based only on local
graph structures, and the learned model can transfer to unseen users/items or new
tasks without retraining.
In the context of knowledge graph completion, SEAL is generalized to GraIL
(Graph Inductive Learning) (Teru et al, 2020). It also follows the enclosing subgraph
extraction, node labeling, and GNN prediction framework. For enclosing subgraph
extraction, it extracts the subgraph induced by all the nodes that occur on at least
one path of length at most h + 1 between the two target nodes. Unlike SEAL, the
enclosing subgraph of GraIL does not include those nodes that are only neighbors
of one target node but are not neighbors of the other target node. This is because for
knowledge graph reasoning, paths connecting two target nodes are of extra impor-
tance than dangling nodes. After extracting the enclosing subgraphs, GraIL applies
DRNL to label the enclosing subgraphs and uses a variant of R-GCN by enhancing
R-GCN with edge attention to output the score for each link to predict.
At first glance, both node-based methods and subgraph-based methods learn graph
structure features around target links based on a GNN. However, as we will show,
subgraph-based methods actually have a higher link representation ability than
node-based methods due to modeling the associations between two target nodes.
210 Muhan Zhang
𝑣3
𝑣2 𝑣3 𝑣2 𝑣3 𝑣2 𝑣3
𝑣1 𝑣4 𝑣1 𝑣4 𝑣1 𝑣4
Fig. 10.3: The different link representation ability between node-based methods and
subgraph-based methods. In the left graph, nodes v2 and v3 are isomorphic; links
(v1 , v2 ) and (v4 , v3 ) are isomorphic; link (v1 , v2 ) and link (v1 , v3 ) are not isomor-
phic. However, a node-based method cannot differentiate (v1 , v2 ) and (v1 , v3 ). In
the middle graph, when we predict (v1 , v2 ), we label these two nodes differently
from the rest, so that a GNN is aware of the target link when learning v1 and v2 ’s
representations. Similarly, when predicting (v1 , v3 ), nodes v1 and v3 will be labeled
differently (shown in the right graph). This way, the representation of v2 in the left
graph will be different from the representation of v3 in the right graph, enabling
GNNs to distinguish (v1 , v2 ) and (v1 , v3 ).
When using GNNs for link prediction, we want to learn graph structure features
useful for predicting links based on message passing. However, it is usually not
212 Muhan Zhang
possible to use very deep message passing layers to aggregate information from the
entire network due to the computation complexity introduced by neighbor explosion
and the issue of oversmoothing (Li et al, 2018b). This is why node-based methods
(such as GAE) only use 1 to 3 message passing layers in practice, and why subgraph-
based methods only extract a small 1-hop or 2-hop local enclosing subgraph around
each link.
The γ-decaying heuristic theory (Zhang and Chen, 2018b) mainly answers how
much structural information useful for link prediction is preserved in local neigh-
borhood of the link, in order to justify applying a GNN only to a local enclos-
ing subgraph in subgraph-based methods. To answer this question, the γ-decaying
heuristic theory studies how well can existing link prediction heuristics be approxi-
mated from local enclosing subgraphs. If all these existing successful heuristics can
be accurately computed or approximated from local enclosing subgraphs, then we
are more confident to use a GNN to learn general graph structure features from these
local subgraphs.
Firstly, a direct conclusion from the definition of h-hop enclosing subgraphs (Defi-
nition 10.1) is:
Proposition 10.1. Any h-order heuristic score for (x, y) can be accurately calcu-
h around (x, y).
lated from the h-hop enclosing subgraph Gx,y
For example, a 1-hop enclosing subgraph contains all the information needed to
calculate any first-order heuristics, while a 2-hop enclosing subgraph contains all the
information needed to calculate any first and second-order heuristics. This indicates
that first and second-order heuristics can be learned from local enclosing subgraphs
based on an expressive GNN. However, how about high-order heuristics? High-
order heuristics usually have better link prediction performance than local ones. To
study high-order heuristics’ local approximability, the γ-decaying heuristic theory
first defines a general formulation of high-order heuristics, namely the γ-decaying
heuristic.
Definition 10.2. (γ-decaying heuristic) A γ-decaying heuristic for link (x, y) has
the following form:
∞
H (x, y) = η ∑ γ l f (x, y, l), (10.23)
l=1
Next, it proves that under certain conditions, any γ-decaying heuristic can be
approximated from an h-hop enclosing subgraph, and the approximation error de-
creases at least exponentially with h.
10 Graph Neural Networks: Link Prediction 213
satisfies:
satisfies:
• (property 1) f (x, y, l) ≤ λ l where λ < 1γ ; and
h for l = 1, 2, · · · , g(h), where g(h) =
• (property 2) f (x, y, l) is calculable from Gx,y
ah+b with a, b ∈ N and a > 0,
h and the approximation error decreases
then H (x, y) can be approximated from Gx,y
at least exponentially with h.
The above proof indicates that a smaller γλ leads to a faster decaying speed and a
smaller approximation error. To approximate a γ-decaying heuristic, one just needs
to sum its first few terms calculable from an h-hop enclosing subgraph.
Then, a natural question to ask is which existing high-order heuristics belong to
γ-decaying heuristics that allow local approximations. Surprisingly, the γ-decaying
heuristic theory shows that three most popular high-order heuristics: Katz index,
rooted PageRank and SimRank (listed in Table 10.1) are all γ-decaying heuristics
which satisfy the properties in Theorem 10.1.
To prove these, we need the following lemma first.
h .
Lemma 10.1. Any walk between x and y with length l ≤ 2h + 1 is included in Gx,y
Proof. Given any walk w = ⟨x, v1 , · · · , vl−1 , y⟩ with length l, we will show
that every node vi is included in Gx,yh . Consider any v . Assume d(v , x) ≥ h + 1
i i
and d(vi , y) ≥ h + 1. Then, 2h + 1 ≥ l = |⟨x, v1 , · · · , vi ⟩| + |⟨vi , · · · , vl−1 , y⟩| ≥
d(vi , x) + d(vi , y) ≥ 2h + 2, a contradiction. Thus, d(vi , x) ≤ h or d(vi , y) ≤ h.
By the definition of Gx,y h , v must be included in G h .
i x,y
where walks⟨l⟩ (x, y) is the set of length-l walks between x and y, and Al is the l th
power of the adjacency matrix of the network. Katz index sums over the collection
of all walks between x and y where a walk of length l is damped by β l (0 < β < 1),
giving more weights to shorter walks.
Katz index is directly defined in the form of a γ-decaying heuristic with η =
1, γ = β , and f (x, y, l) = |walks⟨l⟩ (x, y)|. According to Lemma 10.1, |walks⟨l⟩ (x, y)|
is calculable from Gx,y h for l ≤ 2h + 1, thus property 2 in Theorem 10.1 is satisfied.
Proposition 10.2. For any nodes i, j, [Al ]i, j is bounded by d l , where d is the maxi-
mum node degree of the network.
Proof. We prove it by induction. When l = 1, Ai, j ≤ d for any (i, j). Thus the
base case is correct. Now, assume by induction that [Al ]i, j ≤ d l for any (i, j),
we have
|V | |V |
[Al+1 ]i, j = ∑ [Al ]i,k Ak, j ≤ d l ∑ Ak, j ≤ d l d = d l+1 .
k=1 k=1
Taking λ = d, we can see that whenever d < 1/β , the Katz index will satisfy
property 1 in Theorem 10.1. In practice, the damping factor β is often set to very
small values like 5E-4 (Liben-Nowell and Kleinberg, 2007), which implies that Katz
can be very well approximated from the h-hop enclosing subgraph.
10.4.1.3 PageRank
The rooted PageRank for node x calculates the stationary distribution of a random
walker starting at x, who iteratively moves to a random neighbor of its current po-
sition with probability α or returns to x with probability 1 − α. Let πx denote the
stationary distribution vector. Let [πx ]i denote the probability that the random walker
is at node i under the stationary distribution.
Let P be the transition matrix with Pi, j = |Γ (v1 j )| if (i, j) ∈ E and Pi, j = 0 otherwise.
Let ex be a vector with the xth element being 1 and others being 0. The stationary
distribution satisfies
10 Graph Neural Networks: Link Prediction 215
When used for link prediction, the score for (x, y) is given by [πx ]y (or [πx ]y +
[πy ]x for symmetry). To show that rooted PageRank is a γ-decaying heuristic, we
introduce the inverse P-distance theory (Jeh and Widom, 2003), which states that
[πx ]y can be equivalently written as follows:
where the summation is taken over all walks w starting at x and ending at y (pos-
sibly touching x and y multiple times). For a walk w = ⟨v0 , v1 , · · · , vk ⟩, len(w) :=
|⟨v0 , v1 , · · · , vk ⟩| is the length of the walk. The term P[w] is defined as ∏k−1 1
i=0 |Γ (vi )| ,
which can be interpreted as the probability of traveling w. Now we have the follow-
ing theorem.
Theorem 10.2. The rooted PageRank heuristic is a γ-decaying heuristic which sat-
isfies the properties in Theorem 10.1.
10.4.1.4 SimRank
The SimRank score (Jeh and Widom, 2002) is motivated by the intuition that two
nodes are similar if their neighbors are also similar. It is defined in the following
recursive way: if x = y, then s(x, y) := 1; otherwise,
where γ is a constant between 0 and 1. According to (Jeh and Widom, 2002), Sim-
Rank has an equivalent definition:
216 Muhan Zhang
len(w)
s(x, y) = ∑ P[w]γ , (10.30)
w:(x,y)⊸(z,z)
where w : (x, y) ⊸ (z, z) denotes all simultaneous walks such that one walk starts at
x, the other walk starts at y, and they first meet at any vertex z. For a simultaneous
walk w = ⟨(v0 , u0 ), · · · , (vk , uk )⟩, len(w) = k is the length of the walk. The term P[w]
is similarly defined as ∏k−1 1
i=0 |Γ (vi )||Γ (ui )| , describing the probability of this walk. Now
we have the following theorem.
10.4.1.5 Discussion
There exist several other high-order heuristics based on path counting or random
walk (Lü and Zhou, 2011) which can be as well incorporated into the γ-decaying
heuristic framework. Another interesting finding is that first and second-order
heuristics can be unified into this framework too. For example, common neighbors
can be seen as a γ-decaying heuristic with η = γ = 1, and f (x, y, l) = |Γ (x) ∩ Γ (y)|
for l = 1, f (x, y, l) = 0 otherwise.
The above results reveal that most existing link prediction heuristics inherently
share the same γ-decaying heuristic form, and thus can be effectively approximated
from an h-hop enclosing subgraph with exponentially smaller approximation er-
ror. The ubiquity of γ-decaying heuristics is not by accident—it implies that a suc-
cessful link prediction heuristic is better to put exponentially smaller weight on
structures far away from the target, as remote parts of the network intuitively make
little contribution to link existence. The γ-decaying heuristic theory builds the foun-
dation for learning supervised heuristics from local enclosing subgraphs, as they
imply that local enclosing subgraphs already contain enough information to learn
good graph structure features for link prediction which is much desired considering
10 Graph Neural Networks: Link Prediction 217
learning from the entire network is often infeasible. This motivates the proposition
of subgraph-based methods.
To summarize, from small enclosing subgraphs extracted around links, we are
able to accurately calculate first and second-order heuristics, and approximate a
wide range of high-order heuristics with small errors. Therefore, given a sufficiently
expressive GNN, learning from such enclosing subgraphs is expected to achieve
performance at least as good as a wide range of heuristics.
For simplicity, we will briefly use structural representation to denote most expres-
sive structural representation in the rest of this section. We will omit A if it is
clear from context. We call Γ (i, A) a structural node representation for i, and call
Γ ({i, j}, A) a structural link representation for (i, j).
Definition 10.4 requires the structural representations of two node sets to be the
same if and only if they are isomorphic. That is, isomorphic node sets always have
the same structural representation, while non-isomorphic node sets always have
different structural representations. This is in contrast to positional node embed-
dings such as DeepWalk (Perozzi et al, 2014) and matrix factorization (Mnih and
Salakhutdinov, 2008), where two isomorphic nodes can have different node embed-
dings (Ribeiro et al, 2017).
So why do we need structural representations? Formally speaking, Srinivasan
and Ribeiro (2020b) prove that any joint prediction task over node sets only requires
most-expressive structural representations of node sets, which are the same for two
node sets if and only if these two node sets are isomorphic. This means, for link pre-
diction tasks, we need to learn the same representation for isomorphic links while
discriminating non-isomorphic links by giving them different representations. Intu-
itively speaking, two links being isomorphic means they should be indistinguishable
from any perspective—if one link exists, the other should exist too, and vice versa.
Therefore, link prediction ultimately requires finding such a structural link repre-
sentation for node pairs which can uniquely identify link isomorphism classes.
According to Figure 10.3 left, node-based methods that directly aggregate two
node representations cannot learn such a valid structural link representation because
they cannot differentiate non-isomorphic links such as (v1 , v2 ) and (v1 , v3 ). One may
wonder whether using one-hot encoding of node indices as the input node features
help node-based methods learn such a structural link representation. Indeed, using
node-discriminating features enables node-based methods to learn different repre-
sentations for (v1 , v2 ) and (v1 , v3 ) in Figure 10.3 left. However, it also loses GNN’s
ability to map isomorphic nodes (such as v2 and v3 ) and isomorphic links (such
as (v1 , v2 ) and (v4 , v3 )) to the same representations, since any two nodes already
10 Graph Neural Networks: Link Prediction 219
have different representations from the beginning. This might result in poor gener-
alization ability—two nodes/links may have different final representations even they
share identical neighborhoods.
To ease our analysis, we also define a node-most-expressive GNN, which gives
different representations to all non-isomorphic nodes and gives the same represen-
tation to all isomorphic nodes. In other words, a node-most-expressive GNN learns
structural node representations.
Now, we are ready to introduce the labeling trick and see how it enables learning
structural representations of node sets. As we have seen in Section 10.4.2, a simple
zero-one labeling trick can help a GNN distinguish non-isomorphic links such as
(v1 , v2 ) and (v1 , v3 ) in Figure 10.3 left. At the same time, isomorphic links, such
as (v1 , v2 ) and (v4 , v3 ), will still have the same representation, since the zero-one
labeled graph for (v1 , v2 ) is still symmetric to the zero-one labeled graph for (v4 , v3 ).
This brings an exclusive advantage over using one-hot encoding of node indices.
Below we give the formal definition of labeling trick, which incorporates the
zero-one labeling trick as one specific form.
Definition 10.6. (Labeling trick) Given (S, A), we stack a labeling tensor L(S) ∈
Rn×n×d in the third dimension of A to get a new A(S) ∈ Rn×n×(k+d) , where L satis-
′
fies: ∀S, A, S′ , A′ , π ∈ Πn , (1) L(S) = π(L(S ) ) ⇒ S = π(S′ ), and (2) S = π(S′ ), A =
′
π(A′ ) ⇒ L(S) = π(L(S ) ).
To explain a bit, labeling trick assigns a label vector to each node/edge in graph
A, which constitutes the labeling tensor L(S) . By concatenating A and L(S) , we get
the adjacency tensor A(S) of the new labeled graph. By definition we can assign
labels to both nodes and edges. For simplicity, here we only consider node labels,
(S)
i.e., we let off-diagonal components Li, j,: be all zero.
The labeling tensor L(S) should satisfy two conditions in Definition 10.6. The
first condition requires the target nodes S to have distinct labels from those of the
rest nodes, so that S is distinguishable from others. This is because if a permutation
π preserving node labels exists between nodes of A and A′ , then S and S′ must have
distinct labels to guarantee S′ is mapped to S by π. The second condition requires
the labeling function to be permutation equivariant, i.e., when (S, A) and (S′ , A′ ) are
isomorphic under π, the corresponding nodes i ∈ S, j ∈ S′ , i = π( j) must always have
the same label. In other words, the labeling should be consistent across different S.
220 Muhan Zhang
For example, the zero-one labeling is a valid labeling trick by always giving label 1
to nodes in S and 0 otherwise, which is both consistent and S-discriminating. How-
ever, an all-one labeling is not a valid labeling trick, because it cannot distinguish
the target set S.
Now we introduce the main theorem of labeling trick showing that with a valid
labeling trick, a node-most-expressive GNN can learn structural link representations
by aggregating its node representations learned from the labeled graph.
The proof of the above theorem can be found in Appendix A of (Zhang et al, 2020c).
Theorem 10.4 implies that AGG({GNN(i, A(S) )|i ∈ S}) is a structural represen-
tation for (S, A). Remember that directly aggregating structural node representa-
tions learned from the original graph A does not lead to structural link representa-
tions. Theorem 10.4 shows that aggregating over the structural node representations
learned from the adjacency tensor A(S) of the labeled graph, somewhat surprisingly,
results in a structural representation for S.
The significance of Theorem 10.4 is that it closes the gap between GNN’s node
representation nature and link prediction’s link representation requirement, which
solves the open question raised in (Srinivasan and Ribeiro, 2020b) questioning
node-based GNN methods’ ability of performing link prediction. Although directly
aggregating pairwise node representations learned by GNNs does not lead to struc-
tural link representations, combining GNNs with a labeling trick enables learning
structural link representations.
It can be easily proved that the zero-one labeling, DRNL and Distance Encod-
ing (DE) (Li et al, 2020e) are all valid labeling tricks. This explains subgraph-
based methods’ superior empirical performance than node-based methods (Zhang
and Chen, 2018b; Zhang et al, 2020c).
In this section, we introduce several important future directions for link prediction:
accelerating subgraph-based methods, designing more powerful labeling tricks, and
understanding when to use one-hot features.
which prevent them from being deployed in modern recommendation systems. How
to accelerate subgraph-based methods is thus an important problem to study.
The extra computation complexity of subgraph-based methods comes from their
node labeling step. The reason is that for every link (i, j) to predict, we need to
relabel the graph according to (i, j). The same node v will be labeled differently
depending on which one is the target link, and will be given a different node rep-
resentation by the GNN when it appears in different links’ labeled graphs. This is
different from node-based methods, where we do not relabel the graph and each
node only has a single representation.
In other words, for node-based methods, we only need to apply the GNN to
the whole graph once to compute a representation for each node, while subgraph-
based methods need to repeatedly apply the GNN to differently labeled subgraphs
each corresponding to a different link. Thus, when computing link representations,
subgraph-based methods require re-applying the GNN for each target link. For a
graph with n nodes and m links to predict, node-based methods only need to apply
a GNN O(n) times to get a representation for each node (and then use some sim-
ple aggregation function to get link representations), while subgraph-based methods
need to apply a GNN O(m) times for all links. When m ≫ n, subgraph-based meth-
ods have much worse time complexity than node-based methods, which is the price
for learning more expressive link representations.
Is it possible to accelerate subgraph-based methods? One possible way is to sim-
plify the enclosing subgraph extraction process and simplify the GNN architecture.
For example, we may adopt sampling or random walk when extracting the enclosing
subgraphs which might largely reduce the subgraph sizes and avoid hub nodes. It is
interesting to study such simplifications’ influence on performance. Another possi-
ble way is to use distributed and parallel computing techniques. The enclosing sub-
graph extraction process and the GNN computation on a subgraph are completely
independent of each other and are naturally parallelizable. Finally, using multi-stage
ranking techniques could also help. Multi-stage ranking will first use some simple
methods (such as traditional heuristics) to filter out most unlikely links, and use
more powerful methods (such as SEAL) in the later stage to only rank the most
promising links and output the final recommendations/predictions.
Either way, solving the scalability issue of subgraph-based methods can be a
great contribution to the field. That means we can enjoy the superior link prediction
performance of subgraph-based GNN methods without using much more computa-
tion resources, which is expected to extend GNNs to more application domains.
Another direction is to design more powerful labeling tricks. Definition 10.6 gives
a general definition of labeling trick. Although any labeling trick satisfying Defi-
nition 10.6 can enable a node-most-expressive GNN to learn structural link repre-
sentations, the real-world performance of different labeling tricks can vary a lot due
222 Muhan Zhang
to the limited expressive power and depths of practical GNNs. Also, some subtle
differences in implementing a labeling trick can also result in large performance
differences. For example, given the two target nodes x and y, when computing the
distance d(i, x) from a node i to x, DRNL will temporarily mask node y and all its
edges, and when computing the distance d(i, y), DRNL will temporarily mask node
x and all its edges (Zhang and Chen, 2018b). The reason for this “masking trick” is
that DRNL aims to use the pure distance between i and x without the influence of
y. If we do not mask y, d(i, x) will be upper bounded by d(i, y) + d(x, y), which ob-
scures the “true distance” between i and x and might hurt the node labels’ ability to
discriminate structurally-different nodes. As shown in Appendix H of (Zhang et al,
2020c), this masking trick can greatly improve the performance. It is thus interest-
ing to study how to design a more powerful labeling trick (not necessarily based on
shortest path distance like DRNL and DE). It should not only distinguish the target
nodes, but also assign diverse but generalizable labels to nodes with different roles
in the subgraph. A further theoretical analysis on the power of different labeling
tricks is also needed.
Renjie Liao
Abstract In this chapter, we first review a few classic probabilistic models for graph
generation including the Erdős–Rényi model and the stochastic block model. Then
we introduce several representative modern graph generative models that lever-
age deep learning techniques like graph neural networks, variational auto-encoders,
deep auto-regressive models, and generative adversarial networks. At last, we con-
clude the chapter with a discussion on potential future directions.
11.1 Introduction
The study of graph generation revolves around building probabilistic models over
graphs which are also called networks in many scientific disciplines. This problem
has its roots in a branch of mathematics, called random graph theory (Bollobás,
2013), which largely lies at the intersection between the probability theory and the
graph theory. It is also at the core of a new academic field, called network sci-
ence (Barabási, 2013). Historically, researchers in these fields are often interested in
building random graph models (i.e., constructing distributions of graphs using cer-
tain parametric families of distributions) and proving the mathematical properties
of such models. Albeit being an extremely fruitful and successful research direction
that spawns numerous outcomes, these classic models suffer from being too sim-
plistic to capture the complex phenomenon (e.g., highly-clustered, well-connected,
scale-free) that appeared in the real-world graphs.
With the advent of powerful deep learning techniques like graph neural net-
works, we can build more expressive probabilistic models of graphs, i.e., the so-
called deep graph generative models. Such deep models can better capture the com-
plex dependencies within the graph data to generate more realistic graphs and fur-
ther build accurate predictive models. However, the downside is that these models
Renjie Liao
University of Toronto, e-mail: [email protected]
© The Author(s), under exclusive license to Springer Nature Singapore Pte Ltd. 2022 225
L. Wu et al. (eds.), Graph Neural Networks: Foundations, Frontiers, and Applications,
https://doi.org/10.1007/978-981-16-6054-2_11
226 Renjie Liao
are often so complicated that we can rarely analyze their properties in a precise
manner. The recent practices of these models have demonstrated impressive per-
formances in modeling real-world graphs/networks, e.g., social networks, citation
networks, and molecule graphs.
In the following, we first introduce the classic graph generative models in Section
11.2 and then the modern ones that leverage the deep learning techniques in Section
11.3. At last, we conclude the chapter and discuss some promising future directions.
In this section, we review two popular variants of the classic graph generative mod-
els: the Erdős–Rényi model (Erdős and Rényi, 1960) and the stochastic block model
(Holland et al, 1983). They are often used as handy baselines in many applications
since we have already gained deep understandings of their properties. There are
many other graph generative models like the Watts–Strogatz small-world model
(Watts and Strogatz, 1998) and the Barabási–Albert (BA) preferential attachment
model (Barabási and Albert, 1999). Barabási (2013) provides a thorough survey
on these models and other aspects of network science. In the context of machine
learning, there are also quite a few non-deep-learning graph generative models like
Kronecker graphs (Leskovec et al, 2010). We do not cover these models due to the
space limit.
We first explain one of the most well known random graph models, i.e., Erdős–Rényi
model (Erdős and Rényi, 1960), named after two Hungarian mathematicians Paul
Erdős and Alfréd Rényi. Note that this model has been independently proposed at
around the same time by Edgar Gilbert in (Gilbert, 1959). In the following, we first
describe the model along with its properties and then discuss its limitations.
11.2.1.1 Model
The Erdős–Rényi model has two closely variants, namely, G(n, p) and G(n, m).
G(n,p) Model In the G(n, p) model, we are given n labeled nodes and generate
a graph by randomly connecting an edge linking one node to the other with the
probability p, independently from every other edge. In other words, all n2 possible
edges have the equal probability p to be included. Therefore, the probability of
generating a graph with m edges under this model is as below,
n
p(a graph with n nodes and m edges) = pm (1 − p)(2)−m . (11.1)
11 Graph Neural Networks: Graph Generation 227
The parameter p controls the “density” of the graph, i.e., a larger value of p makes
the graph become more likely to contain more edges. When p = 21 , the above prob-
(n) n
ability becomes 1 2 , i.e., all possible 2(2) graphs are chosen with equal probability.
2
Due to the independence of the edges in G(n, p), we can easily derive a few
properties from this model.
• The expected number of edges is n2 p.
• The degree distribution of any node v is binomial:
n k
p(degree(v) = k) = p (1 − p)n−1−k (11.2)
k
(np)k e−np
p(degree(v) = k) = (11.3)
k!
There is an enormous number of more involved properties of this model that has
been proved (e.g., by Erdős and Rényi in the original paper). We list a few others as
below.
(1+ε) ln n
• If p > n , then a graph will almost surely be connected.
(1+ε) ln n
• If p < n , then a graph will almost surely contain isolated
vertices, and
thus be disconnected.
• If N p < 1, then a graph will almost surely have no connected components of
size larger than O(log(n)).
Here almost surely means the probability of the event happens with probability 1
(i.e., the set of possible exceptions has zero measure).
G(n,m) Model In the G(n, m) model, we are given n labeled nodes and generate
a graph by uniformly randomly choosing a graph from the set of all graphs with n
n −1
nodes and m edges, i.e., the probability of choosing each graph is (2) . There are
m
also many important properties associated with the G(n, m) model. In particular,
it
is interchangeable with the G(n, p) model provided that m is close to n2 p in most
investigations. Chapter 2 of (Bollobás and Béla, 2001) provides a comprehensive
discussion on the relationship between these two models. The G(n, p) model is more
commonly used in practice than the G(n, m) model, partly due to the ease of analysis
brought by the independence of the edges.
11.2.1.2 Discussion
As a seminal work in the random graph theory, the Erdős–Rényi model inspires
much subsequent work to study and generalize this model. However, the assump-
tions of this model, e.g., edges are independent and each edge is equally likely to
be generated, are too strong to capture the properties of the real-world graphs. For
example, the degree distribution of the Erdős–Rényi model has an exponential tail
228 Renjie Liao
which means we rarely see node degrees span a broad range, e.g., several orders
of magnitude. Meanwhile, real-world graphs/networks like the World Wide Web
(WWW) are believed to possess a degree distribution that follows a power law, i.e.,
p(d) ∝ d −γ where d is the degree and the exponent γ is typically between 2 and
3. Essentially, this means that there are many nodes that have small node degrees,
whereas there are a few nodes which have extremely large node degrees (, hubs) in
the real-world graphs like WWW. Therefore, many improved models like the scale-
free networks (Barabási and Albert, 1999) were later proposed, which fit better to
the degree distribution of the real-world graphs.
Stochastic block models (SBM) are a family of random graphs with clusters of nodes
and are often employed as a canonical model for tasks like community detection
and clustering. It is proposed independently in a few scientific communities, e.g.,
machine learning and statistics (Holland et al, 1983), theoretical computer science
(Bui et al, 1987), and mathematics (Bollobás et al, 2007). It is arguably the simplest
model of a graph with communities/clusters. As a generative model, SBM could
provide ground-truth cluster memberships, which in turn could help benchmark and
understand different clustering/community detection algorithms. In the following,
we first introduce the basics of the model and then discuss its advantages as well as
limitations.
11.2.2.1 Model
We start the introduction by denoting the total number of nodes as n and the number
of communities/clusters as k. A prior probability vector p over the k clusters and
a k × k matrix W with entries in [0, 1] are also given. We generate a random graph
following the procedure below:
1. For each node, we generate its community label (an integer from {1, · · · , k}) by
independently sampling from p.
2. For each pair of nodes, denoting their community labels as i and j, we generate
an edge by independently sampling with probability Wi, j .
Basically, the community assignments of a pair of nodes determine the specific en-
try of W to be used, which in turn indicates how likely we connect this pair of nodes.
We denote such a model as SBM(n, p,W ). Note that, if we set Wi, j = q for all com-
munities (i, j), then the corresponding SBM degenerates to the Erdős–Rényi model
G(n, q).
In the context of community detection, people are often interested in recovering
the community label given a random graph drawn from the SBM model. Denoting
the recovered and the ground-truth community labels as X ∈ Rn×1 and Y ∈ Rn×1 ,
11 Graph Neural Networks: Graph Generation 229
1 n
R(X,Y ) = max
P∈Π
∑ 1 [Xi = (PY )i ] ,
n i=1
(11.4)
where P is a permutation matrix and Π is the set of all permutation matrices. Xi and
(PY )i are the i-th element of X and PY respectively. In short, the agreement consid-
ers the best possible reshuffle between two sequences of labels. Depending on the
requirement, we could examine the community detection algorithms in the sense
of exact recovery (i.e., cluster assignments are exactly recovered almost surely,
p(R(X,Y ) = 1) = 1) or partial recovery (i.e., at most 1 − ε fraction of nodes are
mislabeled almost surely, p(R(X,Y ) ≥ ε) = 1). Researchers have established vari-
ous conditions under which a particular type of recovery is possible for SBM graphs.
For example, for SBMs with W = log(n)Qn , where Q is a matrix with positive entries
and the same size as W , Abbe and Sandon (2015) shows that the exact recovery
is possible if and only if the minimum Chernoff-Hellinger divergence between any
two columns of diag(p)Q is no less than 1, where diag(p) is a diagonal matrix with
diagonal entries as p.
11.2.2.2 Discussion
Abbe (2017) provides an up-to-date and comprehensive survey on the SBM and
the fundamental limits (from both information-theoretic and computational per-
spectives) for community detection in the SBM. SBM is a more realistic random
graph model for describing graphs with community structures compared to the
Erdős–Rényi model. It also spawns many subsequent variants of block models like
the mixed membership SBM (Airoldi et al, 2008). However, the estimation of SBMs
on real-world graphs is hard since the number of communities is often unknown in
advance and some graphs may not exhibit clear community structures.
In this section, we review several representative deep graph generative models which
aim at building probabilistic models of graphs using deep neural networks. Based
on the type of deep learning techniques being used, we can roughly divide the cur-
rent literature into three categories: variational autoencoder (VAEs) (Kingma and
Welling, 2014) based methods, deep auto-regressive (Van Oord et al, 2016) meth-
ods, and generative adversarial networks (GANs) (Goodfellow et al, 2014b) based
methods. We introduce all three model classes in the subsequent sections.
230 Renjie Liao
We first introduce how a graph is represented in the context of deep graph generative
models. Suppose we are given a graph G = (V , E ) where V is the set of nodes/ver-
tices and E is the set of edges. Conditioning on a specific node ordering π, we can
represent the graph G as an adjacency matrix Aπ where Aπ ∈ R|V |×|V | , where |V | is
the size of set V (i.e., the number of nodes). The adjacency matrix not only provides
a convenient representation of graphs on computers but also offers a natural way
to mathematically define a probability distribution over graphs. Here we explicitly
write the node ordering π in the subscript to emphasize that the rows and columns
of A are arranged according to the π. If we change the node ordering from π to π ′ ,
the adjacency matrix will be permuted (shuffling rows and columns) accordingly,
i.e., Aπ ′ = PAπ P⊤ , where the permutation matrix P is constructed based on the pair
of node orderings (π, π ′ ). In other words, Aπ and Aπ ′ represent the same graph G .
Therefore, a graph G with an adjacency matrix Aπ can be equivalently represented
as a set of adjacency matrices {PAπ P⊤ |P ∈ Π } where Π is the set of all permutation
matrices with size |V | × |V |. Note that, depending on the symmetric structures of
Aπ , there may exist two permutation matrices P1 , P2 ∈ Π so that P1 Aπ P1⊤ = P2 Aπ P2⊤ .
Therefore, we remove such redundancies and keep those uniquely permutated ad-
jacency matrices, denoted as A = {PAπ P⊤ |P ∈ ΠG }. More precisely, ΠG is the
maximal subset of Π so that P1 Aπ P1⊤ ̸= P2 Aπ P2⊤ holds for any P1 , P2 ∈ ΠG . We
add the subscript G to emphasize that ΠG depends on the given graph G . Note that
there exists a surjective mapping between Π and ΠG . For the ease of notations, we
will drop the subscript of the node ordering and use G ≡ A = {PAP⊤ |P ∈ ΠG } to
represent a graph from now on.
When considering the node features/attributes X, we can denote the graph struc-
tured data as G ≡ {(PAP⊤ , PX)|P ∈ ΠG }1 . Note that the rows of X are shuffled
according to P since each row of X corresponds to a node. In our context, we can
assume the maximum number of nodes of all graphs is n. If a graph has fewer nodes
than n, we can add dummy nodes (e.g., with all-zero features) which are isolated to
other nodes to make the size equal n. Therefore, X ∈ Rn×dX and A ∈ Rn×n where
dX is the feature dimension. To simplify the explanation, we do not include the
edge feature. But it is straightforward to modify the following models accordingly
to incorporate edge features.
Due to the great success of VAEs in image generation (Kingma and Welling, 2014;
Rezende et al, 2014), it is natural to extend this framework to graph generation. This
1 Technically, there may exist two permutation matrices P1 , P2 ∈ Π so that P1 AP1⊤ = P2 AP2⊤ and
P1 X ̸= P2 X. It thus seems to be necessary to define G ≡ {(PAP⊤ , PX)|P ∈ Π }. However, as seen
later, we are always interested in distributions of node features that are exchangeable over nodes,
i.e., p(P1 X) = p(P2 X). Therefore, restricting ourselves to ΠG is sufficient for our exposition.
11 Graph Neural Networks: Graph Generation 231
idea has been explored from different aspects (Kipf and Welling, 2016; Jin et al,
2018a; Simonovsky and Komodakis, 2018; Liu et al, 2018d; Ma et al, 2018; Grover
et al, 2019; Liu et al, 2019b) and is often collectively named as GraphVAE. In the
following, we first highlight the common framework shared by all these methods
and then discuss some important variants.
Similar to vanilla VAEs, every model instance within the GraphVAE family con-
sists of an encoder (i.e., a variational distribution qφ (Z|A, X) parameterized by φ ),
a decoder (i.e., a conditional distribution pθ (G |Z) parameterized by θ ), and a prior
distribution (i.e., a distribution p(Z) typically with fixed parameters). Before intro-
ducing individual components, we first describe what the latent variables Z are. In
the context of graph generation, we typically assume that each node is associated
with a latent vector. Denoting the latent vector of the i-th node as zi , then Z ∈ Rn×dZ
is obtained by stacking {zi } as row vectors. Such latent vectors should summarize
the information of the local subgraphs associated with individual nodes so that we
can decode/generate edges based on them. In other words, any pair of latent vec-
tors (zi , z j ) is supposed to be informative to determine whether nodes (i, j) should
be connected. We could further introduce edge latent variables {zi j } to enrich the
model. Again, we do not consider such an option for simplicity since the underlying
modeling technique is roughly the same.
Encoder We first explain how to construct the encoder using a deep neural net-
work. Recall that the input to the encoder is the graph data (A, X). The natural can-
didate to deal with such data is a graph neural network, e.g., a graph convolutional
network (GCN) (Kipf and Welling, 2017b). For example, let us consider a two-layer
GCN as below,
where H ∈ Rn×dH are the node representations (each node is associated with a size-
1 1
dH row vector). Ã = D− 2 (A + I)D− 2 where D is the degree matrix (i.e., a diagonal
matrix of which the entries are the row sum of A + I). I is the identity matrix. σ is
the nonlinearity which is often chosen to be the rectified linear unit (ReLU) (Nair
and Hinton, 2010). {W1 ,W2 } are the learnable parameters. We can pad a constant to
the input feature dimension so that the bias term is absorbed into the weight matrix.
We adopt this convention for ease of notation.
Relying on the learned node representations H, we can construct the variational
distribution as below,
232 Renjie Liao
n
qφ (Z|A, X) = ∏ q(zi |A, X) (11.6)
i=1
q(zi |A, X) = N (µi , σi I) (11.7)
µ = MLPµ (H) (11.8)
log σ = MLPσ (H). (11.9)
This can be easily verified from the exchangeability of the product of probabilities
and the equivariance property of graph neural networks. Second, the neural net-
works underlying each Gaussian (i.e., “GNN + MLP”) are very powerful so that the
conditional distributions are expressive in capturing the uncertainty of latent vari-
ables. Third, this encoder is computationally cheaper than those which consider the
dependencies among different {zi } (e.g., an autoregressive encoder). It thus pro-
vides a solid baseline for investigating whether a more powerful encoder is needed
in a given problem.
Prior Similar to most VAEs, GraphVAEs often adopt a prior that is fixed during
the learning. For example, a common choice is an node-independent Gaussian as
below,
n
p(Z) = ∏ p(zi ) (11.11)
i=1
p(zi ) = N (0, I). (11.12)
Again, we could replace this fixed prior with more powerful ones like an autoregres-
sive model at the cost of more computation and/or a time-consuming pre-training
stage. But this prior serves as a good starting point to benchmark more complicated
alternatives, e.g., the normalizing flow based one in (Liu et al, 2019b).
Decoder The aim of a decoder in graph generative models is to construct a prob-
ability distribution over the graph and its feature/attributes conditioned on the latent
variables, i.e., p(G |Z). However, as we discussed previously, we need to consider all
possible node orderings (each corresponds to a permuted adjacency matrix) which
leaves the graph unchanged, i.e.,
Recall that ΠG is the maximal subset of the set of all possible permutation matrices
Π so that P1 Aπ P1⊤ ̸= P2 Aπ P2⊤ holds for any P1 , P2 ∈ ΠG . To build such a decoder,
we first construct a probability distribution over adjacency matrix and node feature
matrix. For example, we show a popular and simple construction (Kipf and Welling,
2016) as below,
n
p(A, X|Z) = ∏ p(Ai j |Z) ∏ p(xi |Z) (11.14)
i, j i=1
p(Ai j |Z) = Bernoulli(Θi j ) (11.15)
p(xi |Z) = N (µ̃i , σ̃i ) (11.16)
Θi j = MLPΘ ([zi ∥z j ]) (11.17)
µ̃i = MLPµ̃ (zi ) (11.18)
σ̃i = MLPσ̃ (zi ), (11.19)
Note that there are corner cases so that p(P1 AP1⊤ , P1 X|Z) = p(P2 AP2⊤ , P2 X|Z) holds.
For example, if an adjacency matrix A has certain symmetries, there could exist
a pair of (P1 , P2 ) so that P1 AP1⊤ = P2 AP2⊤ . But this does not hold for all pairs of
(P1 , P2 ). As a second example, if all Θi j are the same for all (i, j), all µ̃i are the
same for all i, and all σ̃i are the same for all i, then for any two permutation ma-
trices (P1 , P2 ), we have p(P1 AP1⊤ , P1 X|Z) = p(P2 AP2⊤ , P2 X|Z). Nevertheless, these
two cases happen rarely in practice.
Equipped with the distribution in Eq. (11.14), we can evaluate the terms on the
right hand side of Eq. (11.13). However, the number of permutation matrices in ΠG
can be as large as n! which makes the exact evaluation computationally prohibitive.
There are a few ways in the literature to approximate it. For example, we can just
use the maximum term as below,
To learn the encoder and the decoder, we need to sample from the encoder to ap-
proximate the expectation in Eq. (11.23) and leverage the reparameterization trick
(Kingma and Welling, 2014) to back-propagate the gradient.
There are many variants derived from the GraphVAE family mentioned above. We
now briefly introduce two important types of variants, i.e., hierarchical GraphVAE
(Jin et al, 2018a) and Constrained GraphVAE (Liu et al, 2018d; Ma et al, 2018).
Hierarchical GraphVAEs One representative work of hierarchical GraphVAEs
is Junction Tree VAEs (Jin et al, 2018a) which aim at modeling the molecule graphs.
The key idea is to build a GraphVAE relying on the hierarchical graph represen-
tations of molecules. In particular, we first apply the tree decomposition to obtain
a junction tree T from the original molecule graph G . A junction tree is a cluster
tree (each node is a set of one or more variables of the original graph) with the run-
ning intersection property (Barber, 2004). It provides a coarsened representation of
the original graph since one node in a junction tree may correspond to a subgraph
with several nodes in the original graph. As shown in Figure 11.1, there are two
graphs corresponding to two levels, i.e., the original molecule graph G (1st level)
and the decomposed junction tree T (2nd level). Since we can efficiently perform
tree decomposition to obtain the junction tree, the tree itself is not a latent variable.
Jin et al (2018a) propose to use Gated Graph Neural Networks (GGNNs) (Li et al,
2016b) as encoders (one for each level) and construct variational posteriors q(ZG |G )
and q(ZT |T ) as Gaussians. To decode the molecule graph, we need to perform a
two-level generation process conditioned on the sampled latent variables ZT and
ZG . A junction tree is first generated by a autoregressive decoder which is again
based on GGNNs. Conditioned on the generated tree, Jin et al (2018a) resort to
maximum-a-posterior (MAP) formulation to generate the final molecule graph, i.e.,
finding the compatible subgraphs at each node of the tree so that the overall score
(log-likelihood) of the resultant graph (i.e., replacing each node in the tree with the
chosen subgraph) is maximized. The whole model can be learned similarly to other
11 Graph Neural Networks: Graph Generation 235
Fig. 11.1: Junction Tree VAEs. The junction tree corresponding to the molecule
graph is obtained via the tree decomposition as shown in the top-right. A node/clus-
ter in the junction tree (color-shaded) may correspond to a subgraph in the original
molecule graph. Two GNN-based encoders are applied to the molecular graph and
junction tree respectively to construct the variational posterior distributions over
latent variables ZG and ZT . During the generation, we first generate the junction
tree using an autoregressive decoder and then obtains the final molecule graph via
approximately solving a maximum-a-posterior problem. Adapted from Figure 3 of
(Jin et al, 2018a).
Deep autoregressive models like PixelRNNs (Van Oord et al, 2016) and PixelCNNs
(Oord et al, 2016) have achieved tremendous successes in image modeling. There-
fore, it is natural to generalize this type of method to graphs. The shared underlying
idea of these autoregressive models is to characterize the graph generation process
as a sequential decision-making process and make a new decision at each step con-
ditioning on all previously made decisions. For example, as shown in Figure 11.2,
we can first decide whether to add a new node, then decide whether to add a new
edge, so on and so forth. If node/edge labels are considered, we can further sample
from a categorical distribution at each step to specify such labels. The key question
of this class of methods is how to build a probabilistic model so that our current
decision depends on all previous historical choices.
The first GNN-based autoregressive model was proposed in (Li et al, 2018d) of
which the high-level idea is exactly the same as shown in Figure 11.2. Sup-
pose at time step t − 1, we already generated a partial graph denoted as G t−1 =
(V t−1 , E t−1 ). The corresponding adjacency matrix and node feature matrix are de-
11 Graph Neural Networks: Graph Generation 237
Add node (1)? Add edge? Add node (1)? Add edge? Pick node (0) to
(yes/no) (yes/no) (yes/no) (yes/no) add edge (0,1)
0 0 0 0 0
1 1 1
Add edge? Add node (2)? Add edge? Pick node (0) to Add edge?
(yes/no) (yes/no) (yes/no) add edge (0,2) (yes/no)
0 0 0 0 0
1 1 1 1 1
2 2 2 2
Generation steps
Fig. 11.2: The overview of the deep graph generative model in (Li et al, 2018d).
The graph generation is formulated as a sequential decision-making process. At
each step of the generation, the model needs to decide: 1) whether add a new node
or stop the whole generation; 2) whether add a new edge (one end connected to the
new node) or not; 3) which existing node to connect for the new edge. Adapted from
Figure 1 of (Li et al, 2018d).
noted as (At−1 , X t−1 ). At time step t, the model needs to decide: 1) whether we
add a new node or we stop the generation (denoting the probability as pAddNode );
2) whether we add an edge that links any existing node to the newly added node
(denoting the probability as pAddEdge ); 3) choose a existing node to link to the newly
added node (denoting the probability as pNodes ). For simplicity, we define pAddNode
to be a Bernoulli distribution. We can extend it to a categorical one if node labels/-
types are considered. pAddEdge is yet another Bernoulli distribution whereas pNodes is
a categorical distribution with size |V t−1 | (i.e., its size will change as the generation
goes on).
Message Passing Graph Neural Networks To construct the above probabilities
of decisions, we first build a message passing graph neural network (Scarselli et al,
2008; Li et al, 2016b; Gilmer et al, 2017) to learn node representations. The input
to the GNN at time step t − 1 is (At−1 , H t−1 ) where H t−1 is the node representation
(one row corresponds to a node). Note that at time 0, since the graph is empty, we
need to generate a new node to start. The generation probability pAddNode will be
output by the model based on some randomly initialized hidden state. If we model
the node labels/types or node features, we can also use them as additional node
representations, e.g., concatenating them with rows of H t−1 .
The one-step message passing is shown as below,
where fMsg , fAgg , and fUpdate are the message function, the aggregation function, and
the node update function respectively. For the message function, we often instantiate
fMsg as an MLP. Note that if edge features are considered, one can incorporate
them as input to fMsg . fAgg could simply be an average or summation operator.
Typical examples of fUpdate include gated recurrent units (GRUs) (Cho et al, 2014a)
238 Renjie Liao
and long-short term memory (LSTM) (Hochreiter and Schmidhuber, 1997). ht−1 i is
the input node representation at time step t − 1. Ωi denotes the set of neighboring
nodes of the node i. h̃t−1
i is the updated node representation which serves as the
input node representation for the next message passing step. The above message
passing process is typically executed for a fixed number of steps, which is tuned
as a hyperparameter. Note that the generation step t is different from the message
passing step (we deliberately omit its notation to avoid confusion).
Output Probabilities After the message passing process is done, we obtain the
new node representations H t . Now we can construct the aforementioned output
probabilities as follows,
Here we first summarize the graph representation hG t−1 (a vector) by reading out
from the node representation H t via fReadOut , which could be an average operator
or an attention-based one. Based on hG t−1 , we predict the probability of adding a
new node pAddNode where σ is the sigmoid function. If we decide to add a new
node by sampling 1 from the Bernoulli distribution pAddNode , we denote the new
node as v. We can initialize its representation hv as random features by sampling
either from N (0, I) or learned distribution over node type/label if provided. Then
we compute similarity scores between every existing node u in G t−1 and v as suv . s
is the concatenated vector of all similarity scores. Finally, we normalize the scores
using softmax to form the categorical distribution from which we sample an existing
node to obtain the new edge. By sampling from all these probabilities, we could
either stop the generation or obtain a new graph with a new node and/or a new edge.
We repeat this procedure by carrying on the node representations along with the
generated graphs until the model generates a stop signal from pAddNode .
Training To train the model, we need to maximize the likelihood of the observed
graphs. Recall that we need to consider the permutations that leave the graph un-
changed as discussed in Section 11.3.2.1. For simplicity, we focus on the adjacency
matrix alone following (Li et al, 2018d), i.e., G ≡ {PAP⊤ |P ∈ ΠG }, where ΠG is
the maximal subset of Π so that P1 AP1⊤ ̸= P2 AP2⊤ holds for any P1 , P2 ∈ ΠG . The
ideal objective is to maximize the following,
!
max log p(G ) ⇔ max log ∑ p(PAP⊤ ) . (11.32)
P∈ΠG
Here we omit the variables being optimized, i.e., parameters of models defined in
Eq. (11.24) and Eq. (11.27). Note that given a node ordering (corresponding to one
specific permutation matrix P), we have a bijection between a sequence of cor-
11 Graph Neural Networks: Graph Generation 239
rect decisions and an adjacency matrix. In other words, we can equivalently write
p(PAP⊤ ) as a product of probabilities that are explained in Eq. (11.27). However, the
marginalization inside the logarithmic function on the right hand side is intractable
due to the nearly factorial size of ΠG in practice. Li et al (2018d) propose to ran-
domly sample a few different node orderings as Π̃G and train the model with fol-
lowing approximated objective,
Note that this objective is a strict lower bound of the one in Eq. (11.32). If canonical
node orderings like the SMILES ordering for molecule graphs are available, we can
also use that to compute the above objective.
Discussion This model formulates the graph generation as a sequential decision-
making process and provides a GNN-based autoregressive model to construct prob-
abilities of possible decisions at each step. The overall model design is well-
motivated. It also achieves good empirical performances in generating small graphs
like molecules (e.g., less than 40 nodes). However, since the model only generates at
most one new node and one new edge per step, the total number of generation steps
scales with the number of nodes quadratically for dense graphs. It is thus inefficient
to generate moderately large graphs (e.g., with a few hundreds of nodes).
Graph Recurrent Neural Networks (GraphRNN) (You et al, 2018b) is another deep
autoregressive model which has a similar sequential decision-making formulation
and leverages RNNs to construct the conditional probabilities. We again rely on
the adjacency matrix representation of a graph, i.e., G ≡ {PAP⊤ |P ∈ ΠG }. Before
dealing with the permutations, let us assume the node ordering is given so that P = I.
A Simple Variant of GraphRNN GraphRNN starts with an autoregressive de-
composition of the probability of an adjacency matrix as follows,
n
p(A) = ∏ p(At |A<t ), (11.34)
t=1
where At is the t-th column of the adjacency matrix A and A<t is a matrix formed
by columns A1 , A2 , · · · , At−1 . n is the maximum number of nodes. If a graph has
less than n nodes, we pad dummy nodes similarly as discussed in Section 11.3.1.
Then we can construct the conditional probability as an edge-independent Bernoulli
distribution,
240 Renjie Liao
n
1[Ai,t =1]
p(At |A<t ) = Bernoulli(Θt ) = ∏ Θt,i (1 −Θt,i )1[Ai,t =0] (11.35)
i=1
Θt = fout (ht ) (11.36)
ht = ftrans (ht−1 , At−1 ), (11.37)
where Θt is a size-n vector of Bernoulli parameters. Θt,i denotes its i-th element. Ai,t
denotes the i-th element of the column vector At . fout could be an MLP which takes
the hidden state ht as input and outputs Θt . ftrans is the RNN cell function which
takes the (t − 1)-th column of the adjacency matrix At−1 and the hidden state ht−1
as input and outputs the current hidden state ht . We can use an LSTM or GRU as
the RNN cell function. Note that the conditioning on A<t is implemented via the
recurrent use of the hidden state in an RNN. The hidden state can be initialized as
zeros or randomly sampled from a standard normal distribution. This model variant
is very simple and can be easily implemented since it only consists of a few common
neural network modules, i.e., an RNN and an MLP.
Full Version of GraphRNN To further improve the model, You et al (2018b)
propose a full version of GraphRNN. The idea is to build a hierarchical RNN so that
the conditional distribution in Eq. (11.34) becomes more expressive. Specifically,
instead of using an edge-independent Bernoulli distribution, we leverage another
autoregressive construction to model the dependencies among entries within one
column of the adjacency matrix as below,
n
p(At |A<t ) = ∏ p(Ai,t |A<i,<t ) (11.38)
i=1
p(Ai,t |A<i,<t ) = sigmoid(gout (h̃i,t )) (11.39)
h̃i,t = gtrans (h̃i−1,t , A<i,t ) (11.40)
h̃0,t = ht (11.41)
ht = ftrans (ht−1 , At−1 ). (11.42)
Here the bottom RNN cell function ftrans still recurrently updates the hidden state
to get ht , thus implementing the conditioning on all previous t − 1 columns of the
adjacency matrix A. To generate individual entries of the t-th column, the top RNN
cell function gtrans takes its own hidden state h̃i−1,t and the already generated t-th
column A as input and updates the hidden state as h̃i,t . The output distribution is a
Bernoulli parameterized by the output of an MLP gout which takes h̃i,t as input. Note
that the initial hidden state h̃0,t of the top RNN is set to the hidden state ht returned
by the bottom RNN.
Objective To train the GraphRNN, we can again resort to the maximum log
likelihood similarly to Section 11.3.3.1. We also need to deal with permutations of
nodes that leave the graph unchanged. Instead of randomly sampling a few orderings
like (Li et al, 2018d), You et al (2018b) propose to use a random-breadth-first-search
ordering. The idea is to first randomly sample a node ordering and then pick the first
node in this ordering as the root. A breadth-first-search (BFS) algorithm is applied
11 Graph Neural Networks: Graph Generation 241
starting from this root node to generate the final node ordering. Let us denote the
corresponding permutation matrix as PBFS . The final objective is,
⊤
max log p(PBFS APBFS ) , (11.43)
which is again a strict lower bound of the true log likelihood. Empirical results in
(You et al, 2018b) suggest that this random-BFS ordering provides good perfor-
mances on a few benchmarks.
Discussion The design of the GraphRNN is simple yet effective. The implemen-
tation is straightforward since most of the modules are standard. The simple variant
is more efficient than the previous GNN-based model (Li et al, 2018d) since it gener-
ates multiple edges (corresponding to one column of the adjacency matrix) per step.
Moreover, the simple variant performs comparably with the full version in the ex-
periments. Nevertheless, GraphRNN still has certain limitations. For example, RNN
highly depends on the node ordering since different node orderings would result in
very different hidden states. The sequential ordering could make two nearby (even
neighboring) nodes far away in the generation sequence (i.e., far away in the gen-
eration time step). Typically, hidden states of an RNN that are far away regarding
the generation time step tend to be quite different, thus making it hard for the model
to learn that these nearby nodes should be connected. We call this phenomenon the
sequential ordering bias.
Following the line of the work (Li et al, 2018d; You et al, 2018b), Liao et al (2019a)
propose the graph recurrent attention networks (GRAN). It is a GNN-based autore-
gressive model, which greatly improves the previous GNN-based model (Li et al,
2018d) in terms of capacity and efficiency. Furthermore, it alleviates the sequential
ordering bias of GraphRNN (You et al, 2018b). In the following, we introduce the
details of the model.
Model We start with the adjacency matrix representation of graphs, i.e., G ≡
{PAP⊤ |P ∈ ΠG }. GRAN aims at directly building a probabilistic model over the
adjacency matrix similarly to GraphRNN. Again, node/edge features are not of pri-
mary interests but can be incorporated without much modification to the model. In
particular, from the perspective of modeling the adjacency matrix, the GNN-based
autoregressive model in (Li et al, 2018d) generates one entry of the adjacency matrix
at a step, whereas GraphRNN (You et al, 2018b) generates one column of entries at
a step. GRAN takes a step further along this line by generating a block of column-
s/rows2 of the adjacency matrix at a step, which greatly improves the generation
speed. Denoting the submatrix with first k rows of the adjacency matrix A as A1:k,: ,
we have the following autoregressive decomposition of the probability,
2 Since we are mainly interested in simple graphs, i.e., unweighted, undirected graphs containing
no self-loops or multiple edges, modeling columns or rows makes no difference. We adopt the
row-wise notations to follow the original paper.
242 Renjie Liao
2 2
2 2
1 3 1 3
Graph 6
Sampling
4 1 3 6 Recurrent 1 3 6 4
Attention
Network 5
4 4 1 Lπbt−2
1 2
Lπbt−2 3
2 5 5 Lπbt−1
4
3 Lπbt−1 5
new block (node 5, 6) Output distribution on Lπbt
4 6
augmented edges (dashed) augmented edges
Adjacency Matrix Adjacency Matrix
Fig. 11.3: The overview of the graph recurrent attention networks (GRAN). At each
step, given an already generated graph, we add a new block of nodes (block size
is 2 and color indicates the membership of individual group in the visualization)
and augmented edges (dashed lines). Then we apply GRAN to this graph to ob-
tain the output distribution over augmented edges (we show an edge-independent
Bernoulli where the line width indicates the probability of generating individual
augmented edges). Finally, we sample from the output distribution to obtain a new
graph. Adapted from Figure 1 of (Liao et al, 2019a).
⌈n/k⌉
p(A) = ∏ p(A(t−1)k:tk,: |A:(t−1)k,: ), (11.44)
t=1
where A:(t−1)k,: indicates the adjacency matrix that has been generated before the
t-th step (i.e., t − 1 blocks with block size k). We use A(t−1)k:tk,: to denote the to-be-
generated block at t-th time step. Note that this part is a straightforward generaliza-
tion to the autoregressive model of GraphRNNs in Eq. (11.34).
To build the condition probability p(A(t−1)k:tk,: |A:(t−1)k,: ), GRAN leverages a
message passing graph neural network. Specifically, denoting the already gener-
ated graph before step t (corresponding to A:(t−1)k,: ) as G t−1 = (V t−1 , E t−1 ), we
first initialize every node representation vector with its corresponding row of the
adjacency matrix, i.e., hv = Av,: for all v ≤ (t − 1)k. Since we assume the maximum
number of nodes is n and pad dummy nodes for graphs with a smaller size, hv is of
size n. At time step t, we are interested in generating a new block of nodes (corre-
sponding to A(t−1)k:tk,: ) and their associated edges. For the k new nodes in the t-th
block, since their corresponding rows in the adjacency matrix are initially all zeros,
we give them an arbitrary ordering from 1 to k and use the one-hot-encoding of the
order index as an additional representation to distinguish them, denoting as xu . We
first form a new graph G˜t = (V t , E˜ t ) by connecting the k new nodes to themselves
(excluding self-loops) and every other nodes in G t−1 . We call such edges as the aug-
mented edges, which are shown as the dashed edges in Figure 11.3. In other words,
V t is the union of V t−1 and k new nodes whereas E˜ t is the union of E t−1 and aug-
mented edges. The core part of GRAN is to construct a probability distribution over
such augmented edges from which we can sample a new graph G t . Note that G t has
the same set of nodes but potentially fewer edges compared to G˜t . To construct the
11 Graph Neural Networks: Graph Generation 243
probability, we use a GNN with the following one-step message passing process,
where mi j is the again the message over edge (i, j) and Ωi is the set of neighbor-
ing nodes of node i. The message function fmsg and the attention head gatt could
be MLPs. Note that we set xu to zeros for any node u that is in the already gener-
ated graph G t−1 since the one-hot-encoding is only used to distinguish those newly
added nodes. [a∥b] means concatenating two vectors a and b. The updated node
representation h′i would serve as the input to the next message passing step. We
typically unroll this message passing for a fixed number of steps, which is set as a
hyperparameter. Note that the message passing step is independent of the generation
step. The attention weights ai j depends on the one-hot-encoding xi so that messages
on augmented edges could be weighted differently compared to those on edges be-
longing to E t−1 . Based on the final node representations returned by the message
passing, we can construct the output distribution is as follows,
C tK n
p(A(t−1)k:tk,: |A:(t−1)k,: ) = ∑ αc ∏ ∏ Θc,i, j (11.49)
c=1 i=(t−1)k+1 j=1
!
tK n
α = softmax ∑ ∑ MLPα (hRi − hRj ) (11.50)
i=(t−1)k+1 j=1
Θc,i, j = sigmoid MLPΘ (hRi − hRj ) . (11.51)
ear time (w.r.t. the number of edges) (Batagelj and Zaversnik, 2003). Based on the
largest core number per node, we can uniquely determine a partition of all nodes,
i.e., disjoint sets of nodes which share the same largest core number. We then assign
the core number of each disjoint set by the largest core number of its nodes. Starting
from the set with the largest core number, we rank all nodes within the set in node
degree descending order. Then we move to the second largest core and so on to ob-
tain the final ordering of all nodes. We call this core descending ordering as k-core
node ordering.
Our final training objective is,
where Π̃G is the set of permutation matrices corresponding to the above node order-
ings. This is again a strict lower bound of the true log likelihood.
Discussion GRAN improves the previous GNN-based autoregressive model (Li
et al, 2018d) and GraphRNN (You et al, 2018b) in the following ways. First, it gen-
erates a block of rows of the adjacency matrix per step, which is more efficient than
generating an entry per step and then generating a row per step. Second, GRAN
uses a GNN to construct the conditional probability. This helps alleviate the se-
quential ordering bias in GraphRNN since GNN is permutation equivariant, i.e.,
the node ordering would not affect the conditional probability per step. Third, the
output distribution in GRAN is more expressive and more efficient for sampling.
GRAN outperforms previous deep graph generative models in terms of empirical
performances and the sizes of graphs that can be generated (e.g., GRAN can gener-
ate graphs up to 5K nodes). Nevertheless, GRAN still suffers from the fact that the
overall model depends on the particular choices of node orderings. It may be hard
to find good orderings in certain applications. How to build an order-invariant deep
graph generative model would be an interesting open question.
In this part, we review a few methods (De Cao and Kipf, 2018; Bojchevski et al,
2018; You et al, 2018a) that apply the idea of generative adversarial networks (GAN)
(Goodfellow et al, 2014b) in the context of graph generation. Based on how a graph
is represented during training, we roughly divide them into two categories: adja-
cency matrix based and random walks based methods. In the following, we explain
these two types of methods in detail.
11 Graph Neural Networks: Graph Generation 245
MolGAN (De Cao and Kipf, 2018) and graph convolutional policy network (GCPN)
(You et al, 2018a) propose a similar GAN-based framework to generate molecule
graphs that satisfy certain chemical properties. Here the graph data is represented
slightly different from previous sections since one needs to specify both node types
(i.e., atom types) and edge types (i.e., chemical bond types). We denote the ad-
jacency matrix3 as A ∈ RN×N×Y where Y is the number of chemical bond types.
Basically, one slice along the 3rd dimension of A gives an adjacency matrix that
characterizes the connectivities among atoms under a specific chemical bond type.
We denote the node type as X ∈ RN×T where T is the number of atom types. The
goal is to generate (A, X) so that it is similar to observed molecule graphs and pos-
sesses certain desirable properties.
N ~ N GCN 0/1
Generator
N N
<latexit sha1_base64="h5fkkvOPNqe9NI7w0SLn2N2FVmc=">AAAB+3icbVDLSsNAFL3xWesr1qWbwSK4KokIdVlw47KCfUATymQyaYdOJmFmIpaQX3HjQhG3/og7/8ZJm4W2Hhg4nHMv98wJUs6Udpxva2Nza3tnt7ZX3z84PDq2Txp9lWSS0B5JeCKHAVaUM0F7mmlOh6mkOA44HQSz29IfPFKpWCIe9DylfowngkWMYG2ksd3wYqynQZR7mvGQ5sOiGNtNp+UsgNaJW5EmVOiO7S8vTEgWU6EJx0qNXCfVfo6lZoTTou5liqaYzPCEjgwVOKbKzxfZC3RhlBBFiTRPaLRQf2/kOFZqHgdmskyqVr1S/M8bZTq68XMm0kxTQZaHoowjnaCyCBQySYnmc0MwkcxkRWSKJSba1FU3JbirX14n/auW67Tc++tmp13VUYMzOIdLcKENHbiDLvSAwBM8wyu8WYX1Yr1bH8vRDavaOYU/sD5/ALyelNg=</latexit>
Reward network
N N
~ GCN 0/1
T T
Fig. 11.4: The overview of the MolGAN. We first draw a latent variable Z ∼ p(Z)
and feed it to a generator which produces a probabilistic (continuous) adjacency ma-
trix A and a probabilistic (continuous) node type matrix X. Then we draw a discrete
adjacency matrix à ∼ A and a discrete node type matrix X̃ ∼ X, which together spec-
ify a molecule graph. During training, we simultaneously feed the generated graph
to a discriminator and a reward network to obtain the adversarial loss (measuring
how similar the generated and the observed graphs are) and the negative reward
(measuring how likely the generated graphs satisfy the certain chemical constraints).
Adapted from Figure 2 of (De Cao and Kipf, 2018).
Model We now explain the details of MolGAN and then highlight the difference
between GCPN and MolGAN. Similar to regular GANs, MolGAN consists of a
generator G¯θ (Z) and a discriminator Dφ (A, X). To ensure the generated samples
satisfy desirable chemical properties, MolGAN adopts an additional reward network
Rψ (A, X). The overall pipeline of MolGAN is illustrated in Figure 11.4.
To generate a molecule graph, we first sample a latent variable Z ∈ Rd from some
prior, e.g., Z ∼ N (0, I). Then we use an MLP to directly map the sampled Z to a
continuous adjacency matrix A and a continuous node type matrix X. The contin-
uous version of the graph data has a natural probabilistic interpretation, i.e., Ai, j,c
3 Note that A is actually a tensor. We slightly abuse the terminology here to ease the exposition.
246 Renjie Liao
means the probability of connecting the atom i and the atom j using the chemical
bond type c, whereas Xi,t means the probability of assigning the t-th atom type to
the i-atom. One can sample a discrete graph data (Ã, X̃) from the continuous ver-
sion, i.e., Ã ∼ A and X̃ ∼ X. This sampling procedure can be implemented using the
Gumbel softmax (Jang et al, 2017; Maddison et al, 2017). The discrete adjacency
matrix à along with the discrete node type X̃ specify a molecule graph and complete
the generation process.
To evaluate how similar the generated graphs and the observed graphs are, we
need to build a discriminator. Since we are dealing with graphs, the natural can-
didate for a discriminator is a graph neural network, e.g., a graph convolutional
network (GCN) (Kipf and Welling, 2017b). In particular, we use a variant of GCN
(Schlichtkrull et al, 2018) to incorporate multiple edge types. One such graph con-
volutional layer is shown as below,
!
N Y
′ Ãi, j,y
hi = tanh fs (hi , xi ) + ∑ ∑ fy (h j , xi ) , (11.53)
j=1 y=1 |Ω i |
where hi and h′i are the input and the output node representations of the graph convo-
lutional layer. Ωi is the set of neighboring nodes of the node i. xi is the i-th row of X,
i.e., the node type vector of the node i. fs and fy are linear transformation functions
that are to be learned. After stacking this type of graph convolution for multiple lay-
ers, we can readout the graph representation using the following attention-weighted
aggregation,
!
hG = tanh ∑ sigmoid (MLPatt (hv , xv )) ⊙ tanh (MLP(hv , xv )) , (11.54)
v∈V
where hv is the node representation returned by the top graph convolutional layer.
Note that MLPatt and MLP are two different instances of MLPs. ⊙ means element-
wise product. We can use the graph representation vector hG to compute the dis-
criminator score Dφ (A, X), i.e., the probability of classifying a graph as positive
(i.e., coming from the data distribution).
Objective Originally, GANs learn the model by performing the minimax opti-
mization as below,
min max EA,X∼pdata (A,X) [log Dφ (A, X)] + EZ∼p(Z) [log 1 − Dφ (G¯θ (Z)) ],
θ φ
(11.55)
where the generator aims at fooling the discriminator and the discriminator aims
at correctly classifying the generated samples and the observed samples. To ad-
dress certain issues in training GANs such as the mode collapse and the instability,
Wasserstein GAN (WGAN) (Arjovsky et al, 2017) and its improved version (Gul-
rajani et al, 2017) have been proposed. MolGAN follows the improved WGAN and
uses the following objective to train the discriminator Dφ (A, X),
11 Graph Neural Networks: Graph Generation 247
B 2
max −D
∑ φ (A(i)
, X (i)
) + D (G¯ (Z (i)
)) + α ∥∇ D (Â (i)
, X̂ (i)
)∥ − 1 ,
φ
φ θ Â(i) ,X̂ (i) φ
i=1
(11.56)
where B is the mini-batch size, Z (i) is the i-th sample drawn from the prior, A(i) , X (i)
are the i-th graph data drawn from the data distribution, and Â(i) , X̂ (i) are their linear
combinations, i.e., (Â(i) , X̂ (i) ) = ε(A(i) , X (i) ) + (1 − ε)G¯θ (Z (i) ), ε ∼ U (0, 1). The
squared term on the right-hand side penalizes the gradient of the discriminator so
that the training becomes more stable. α is a weighting term to balance the regular-
ization and the objective. Moreover, fixing the discriminator, we train the generator
G¯θ (A, X) by adding the additional constraint-dependent reward,
B
min
θ
∑ λ Dφ (G¯θ (Z (i) )) + (1 − λ )LRL (G¯θ (Z (i) )), (11.57)
i=1
where LRL is the negative reward returned by the reward network Rψ and λ is the
weighting hyperparameter to regulate the trade-off between two losses. The reward
could be some non-differentiable quantities that characterize the chemical proper-
ties of the generated molecules, e.g., how likely the generated molecule is to be
soluble in water. To learn the model with the non-differentiable reward, the deep de-
terministic policy gradient (DDPG) (Lillicrap et al, 2015) is used. The architecture
of the reward network is the same as the discriminator, i.e., a GCN. It is pre-trained
by minimizing the squared error between the predicted reward given by Rψ and an
external software which produces a property score per molecule. The pre-training is
necessary since the external software is typically slow and could significantly delay
the training if it is included in the whole training framework.
Discussion MolGAN demonstrates strong empirical performances on a large
chemical database called QM9 (Ramakrishnan et al, 2014). Similar to other GANs,
the model is likelihood-free and can thus enjoy more flexible and powerful gener-
ators. More importantly, although the generator still depends on the node ordering,
the discriminator and the reward networks are order (permutation) invariant since
they are built from GNNs. Interestingly enough, graph convolutional policy net-
work (GCPN) (You et al, 2018a) solves the same problem using a similar approach.
GCPN has a similar GAN-type of objective and some additional domain-specific
rewards that capture the chemical properties of the molecules. It also learns both a
generator and a discriminator. However, they do not use a reward network to speed
up the reward computation. To deal with the learning of non-differentiable reward,
GCPN leverages the proximal policy optimization (PPO) (Schulman et al, 2017)
method, which empirically performs better than the vanilla policy gradient method.
Another important difference is that GCPN generates the adjacency matrix in an
entry-by-entry autoregressive fashion so that the dependencies among multiple gen-
erated edges are captured whereas MolGAN generates all entries of the adjacency
matrix in parallel conditioned on the latent variable. GCPN also achieves impres-
sive empirical results on another large chemical database called ZINC (Irwin et al,
2012). Nevertheless, there are still limitations with the above models. The discrete
248 Renjie Liao
gradient estimators (e.g., the policy gradient type of methods) could have large vari-
ances, which may slow down the training. Since the domain-specific rewards are
non-differentiable and may be time-consuming to obtain, learning a neural network
based approximated reward function like what MolGAN does is appealing. How-
ever, as reported in MolGAN, pre-training seems to be crucial to make the whole
training successful. More exploration along the line of learning a reward function
would be beneficial to simplify the whole training pipeline. On the other hand, both
methods use some variant of GCNs as the discriminator, which is shown to be in-
sufficient in distinguishing certain graphs4 (Xu et al, 2019d). Therefore, exploring
more powerful discriminators like the Lanczos network (Liao et al, 2019b) that ex-
ploits the spectrum of the graph Laplacian as the input feature would be promising
to further improve the performance of the above methods.
same number of nodes and every node has exactly two neighbors) assuming all individual node
features are identical.
11 Graph Neural Networks: Graph Generation 249
Generator NetGAN
architecture v1 v1 vT architecture
v2
p1 pN p1 pN Generator
W up W up
o 1 o T
Discrimi-
nator
C0 C1 ⋯
Graph
h0 h1 ⋯
W down D real D fake
⋯ Random
walk
Fig. 11.5: The overview of the NetGAN. We first draw a random vector from a fixed
prior N (0, I) and initialized the memory c0 and the hidden state h0 of an LSTM.
Then the LSTM generator generates which node to visit per step and is unrolled for
a fixed number of steps T . The one-hot-encoding of node index is fed to the LSTM
as the input for the next step. The discriminator is another LSTM which performs
a binary classification to predict if a given random walk is sampled from a data
distribution. Adapted from Figure 2 of (Bojchevski et al, 2018).
After training the LSTM generator, we are capable of generating random walks.
However, we need an additional step to construct a graph from a set of generated
random walks. The strategy used by NetGAN is as follows. First, we count the edges
that appeared in the set of random walks to obtain a scoring matrix S, which has the
same size as the adjacency matrix. The (i, j)-th entry of the score matrix Si, j in-
dicates how many times edge (i, j) appears in the set of generated random walks.
S
Second, for each node i, we sample a neighbor according to the probability ∑ i,Sji,v .
v
We repeat the sampling until node i has at least one neighbor connected and skip if
the edge has already been generated. At last, for any edge (i, j), we perform sam-
S
pling without replacement according to the probability ∑ i,Sj u,v until the maximum
u,v
number of edges is reached.
Discussion The random walk based representations for graphs are novel in the
context of deep graph generative models. Moreover, they could be more scalable
than the adjacency matrix representation since we are not bound by the quadratic
(w.r.t. the number of nodes) complexity. The core modules of the NetGAN are
LSTMs which are efficient in handling sequences and easy to be implemented. Nev-
ertheless, the graph construction from a set of generated random walks seems to be
ad-hoc. There is no theoretical guarantee on how accurate the proposed construc-
tion method is. It may require a large number of sampled random walks in order to
generate a graph with good qualities.
250 Renjie Liao
11.4 Summary
In this chapter, we review a few classic graph generative models and some modern
ones which are constructed based on deep neural networks. From the perspectives
of the model capacity and the empirical performances, e.g., how good the model can
fit observed data, deep graph generative models significantly outperform their clas-
sic counterparts. For example, they could generate molecule graphs which are both
chemically valid and similar to observed ones in terms of certain graph statistics.
Although we have already made impressive progress in recent years, deep gen-
erative models are still in the early stage. Moving forward, there are at least two
main challenges. First, how can we scale these models so that they can handle real-
world graphs like large scale social networks and WWW? It requires not only more
computational resources but also more algorithmic improvements. For example,
building a hierarchical graph generative model would be one promising direction
to boost efficiency and scale. Second, how can we effectively add domain-specific
constraints and/or conditioning on some input information? This question is impor-
tant since many real-world applications require the graph generation to be condi-
tioned on some inputs (e.g., scene graph generations conditioned on input images).
Many graphs in practice come with certain constraints (e.g., chemical validity in the
molecule generation).
Xiaojie Guo
Department of Information Science and Technology, , George Mason University, e-mail: xguo7@
gmu.edu
Shiyu Wang
Department of Computer Science, Emory University, e-mail: [email protected]
Liang Zhao
Department of Computer Science, Emory University, e-mail: [email protected]
© The Author(s), under exclusive license to Springer Nature Singapore Pte Ltd. 2022 251
L. Wu et al. (eds.), Graph Neural Networks: Foundations, Frontiers, and Applications,
https://doi.org/10.1007/978-981-16-6054-2_12
252 Xiaojie Guo, Shiyu Wang, Liang Zhao
Battaglia et al (2016) proposed the interaction network in the task of reasoning about
objects, relations, and physics, which is central to human intelligence, and a key
goal of artificial intelligence. Many physical problems, such as predicting what will
happen next in physical environments or inferring underlying properties of complex
scenes, are challenging because their elements are composed and can influence each
other as a whole system. It is impossible to solve such problems by considering each
object and relation separately. Thus, the node transformation problem can help deal
with this task via modeling the interactions and dynamics of elements in a complex
system. To deal with the node transformation problem that is formalized in this sce-
nario, an interaction network (IN) is proposed, which combines two main powerful
approaches: structured models, simulation, and deep learning. Structured models
are operated as the main component based on the GNNs to exploit the knowledge
of relations among objects. The simulation part is an effective method for approx-
imating dynamical systems, predicting how the elements in a complex system are
influenced by interactions with one another, and by the dynamics of the system.
The overall complex system can be represented as an attributed, directed multi-
graph G , where each node represents an object and the edge represents the rela-
254 Xiaojie Guo, Shiyu Wang, Liang Zhao
tionship between two objects, e.g., a fixed object attached by a spring to a freely
moving mass. To predict the dynamics of a single node (i.e., object), there is an
object-centric function, ht+1i = fO (hti ) with the object’s state ht at time t of the ob-
ject vi as the inputs and a future state ht+1i at next time step as outputs. Assuming
two objects have one directed relationship, the first object vi influences the second
object v j via their interaction. The effect or influence of this interaction, et+1
i, j is pre-
dicted by a relation-centric function, fR , with the object states as well as attributes
of their relationship as inputs. The object updating process is then written as:
et+1 t t
i, j = f R (hi , h j , ri ); ht+1
i = fO (hti , et+1
i, j ), (12.1)
Here the DO and DI refer to the out-degree and in-degree diagonal matrix respec-
tively. P and Q refer to the feature dimension of the input and output node features
at each diffusion convolution layer. The diffusion convolution is defined on both di-
rected and undirected graphs. When applied to undirected graphs, the existing graph
convolution neural networks (GCN) can be considered as a special case of diffusion
convolution network.
To deal with the temporal dependency during the node transformation process,
the recurrent neural networks (RNN) or Gated Recurrent Unit (GRU) can be lever-
aged. For example, by replacing the matrix multiplications in GRU with the diffu-
sion convolution, the Diffusion Convolutional Gated Recurrent Unit (DCGRU) is
defined as
256 Xiaojie Guo, Shiyu Wang, Liang Zhao
where X t and H t denote the input and output of all the nodes at time t, rt and ut are
reset gate and update gate at time t, respectively. ⋆G denotes the diffusion convolu-
tion defined in equation 12.3. Θr ,Θu ,Θc are parameters for the corresponding filters
in the diffusion network.
Another typical spatio-temporal graph convolution network for spatial-temporal
node transformation is proposed by (Yu et al, 2018a). This model comprises sev-
eral spatio-temporal convolutional blocks, which are a combination of graph con-
volutional layers and convolutional sequence learning layers, to model spatial and
temporal dependencies. Specifically, the framework consists of two spatio-temporal
convolutional blocks (ST-Conv blocks) and a fully-connected output layer in the
end. Each ST-Conv block contains two temporal gated convolution layers and one
spatial graph convolution layer in the middle. The residual connection and bottle-
neck strategy are applied inside each block. The input sequence of node information
is uniformly processed by ST-Conv blocks to explore spatial and temporal depen-
dencies coherently. Comprehensive features are integrated by an output layer to gen-
erate the final prediction. In contrast to the above mentioned DCGRU, this model is
built completely from convolutional structures to capture both spatial and temporal
patterns without any recurrent neural network; each block is specially designed to
uniformly process structured data with residual connection and bottleneck strategy
inside.
Edge-level transformation aims to generate the graph topology and edge attributes of
the target graph conditioning on the input graph. It requires the edge set E and edge
attributes E to change while the graph node set and node attributes are fixed during
the transformation: T : GS (V , ES , F, ES ) → GT (V , ET , F, ET ). Edge transformation
has a wide range of real-world applications, such as modeling chemical reactions
(You et al, 2018a), protein folding (Anand and Huang, 2018) and malware cyber-
network synthesis (Guo et al, 2018b). For example, in social networks where people
are the nodes and their contacts are the edges, the contact graph among them varies
dramatically across different situations. For example, when the people are organiz-
ing a riot, it is expected that the contact graph to become denser and several special
“hubs” (e.g., key players) may appear. Hence, accurately predicting the contact net-
12 Graph Neural Networks: Graph Transformation 257
where S refers to the dataset. T tries to minimize this objective while an adversar-
ial D tries to maximize it, i.e. T ∗ = arg minT maxD L (T , D). The graph translator
includes two parts: graph encoder and graph decoder. A graph convolution neural net
(Kawahara et al, 2017) is extended to serve as the graph encoder in order to embed
258 Xiaojie Guo, Shiyu Wang, Liang Zhao
the input graph into node-level representations, while a new graph deconvolution
net is designed as the decoder to generate the target graph. Specifically, the encoder
consists of edge-to-edge and edge-to-node convolution layers, which first extract la-
tent edge-level representations and then node-level representations {Hi }Ni=1 , where
Hi ∈ RL refers to the latent representation of node vi . The decoder consists of node-
to-edge and edge-to-edge deconvolution layers to first get each edge representation
Êi, j based on Hi and H j , and then finally get edge attribute tensor E based on Ê.
Based on the graph deconvolution above, it is possible to utilize skips to link the
extracted edge latent representations of each layer in the graph encoder with those
in the graph decoder.
Specifically, in the graph translator, the output of the l-th “edge deconvolution”
layer in the decoder is concatenated with the output of the l-th “edge convolution”
layer in the encoder to form joint two channels of feature maps, which are then
input into the (l + 1)-th deconvolution layer. It is worth noting that one key factor
for effective translation is the design of a symmetrical encoder-decoder pair, where
the graph deconvolution is a mirrored reversed way from graph convolution. This
allows skip-connections to directly translate different level’s edge information at
each layer.
The graph discriminator is utilized to distinguish between the “translated” target
graph and the “real” ones based on the input graphs, as this helps to train the gen-
erator in an adversarial way. Technically, this requires the discriminator to accept
two graphs simultaneously as inputs (a real target graph and an input graph or a
generated graph and an input graph) and classify the two graphs as either related or
not. Thus, a conditional graph discriminator (CGD) that leverages the same graph
convolution layers in the encoder is utilized for the graph classification. Specifically,
the input and target graphs are both ingested by the CGD and stacked into a tensor,
which can be considered a 2-channel input. After obtaining the node representa-
tions, the graph-level embedding is computed by summing these node embeddings.
Finally, a softmax layer is implemented to distinguish the input graph-pair from the
real graph or generated graph.
To further handle the situation when the pairing information of the input and
the output is not available, Gao et al (2018b) proposes an Unpaired Graph Trans-
lation Generative Adversarial Nets (UGT-GAN) based on Cycle-GAN (Zhu et al,
2017) and incorporate the same encoder and deconder in GT-GAN to handle the
unpaired graph transformation problems. The cycle consistency loss is utilized and
generalized into graph cycle consistency loss for unpaired graph translation. Specif-
ically, graph cycle consistency adds an opposite direction translator from target to
source domain Tr : GT → − GS by training the mappings for both directions simulta-
neously, and adding a cycle consistency loss that encourages Tr (T (GS )) ≈ GS and
T (Tr (GT )) ≈ GT . Combining this loss with adversarial losses on domains GT and
GS yields the full objective for unpaired graph translation.
12 Graph Neural Networks: Graph Transformation 259
(k) ⊤ ⊤ (k)
AT = R(1) ...R(k−1) AS R(k−1) ...R(1) , (12.7)
(k) (k)
where R(k) ∈ RN ×N is the reconstruction operator for the kth level. Thus all the
reconstructed fine graphs at each layer are on the same scale. Finally, these graphs
are aggregated into a unique one by a linear function to get the final adjacent matrix
(k)
as follows: AT = ∑Kk=1 wk AT + bk I, where wk ∈ R and bk ∈ R are weights and bias.
260 Xiaojie Guo, Shiyu Wang, Liang Zhao
The goal of molecule optimization, which is one of the important molecule genera-
tion problems, is to optimize the properties of a given molecule by transforming it
262 Xiaojie Guo, Shiyu Wang, Liang Zhao
into a novel output molecule with optimized properties. The molecule optimization
problem is typically formalized as a NECT problem where the input graph refers to
the initial molecule and the output graph refers to the optimized molecule. Both the
node and edge attributes can change during the transformation process.
The Junction-tree Variational Auto-encoder (JT-VAE) is motivated by the key
challenge of molecule optimization in the domain of drug design, which is to find
target molecules with the desired chemical properties (Jin et al, 2018a). In terms of
the model architecture, JT-VAE extends the VAE (Kingma and Welling, 2014) to
molecular graphs by introducing a suitable encoder and a matching decoder. Under
JT-VAE, each molecule is interpreted as being formalized from subgraphs chosen
from a dictionary of valid components. These components serve as building blocks
when encoding a molecule into a vector representation and decoding latent vectors
back into optimized molecular graphs. The dictionary of components, such as rings,
bonds and individual atoms, is large enough to ensure that a given molecule can
be covered by overlapping clusters without forming cluster cycles. In general, JT-
VAE generates molecular graphs in two phases, by first generating a tree-structured
scaffold over chemical substructures and then combining them into a molecule with
a graph message-passing network.
The latent representation of the input graph G is encoded by a graph message-
passing network (Dai et al, 2016; Gilmer et al, 2017). Here, let xv denote the feature
vector of the vertex v, involving properties of the vertex such as the atom type and
valence. Similarly, each edge (u, v) ∈ E has a feature vector xvu indicating its bond
type. Two hidden vectors νuv and νvu denote the message from u to v and vice versa.
In the encoder, messages are exchanged via loopy belief propagation:
νuv = τ(W1g xu +W2g xuv +W3g
(t) (t−1)
∑ νwu ), (12.8)
w∈N(u)\v
(0)
where vtuv is the message computed in the t-th iteration, initialized with νuv = 0, τ(·)
is the ReLU function, W1g , W2g and W3g are weights, and N(u) denotes the neighbors
of u. Then, after T iterations, the latent vector of each vertex is generated capturing
its local graphical structure:
hu = τ(U1g xu + U2g νvu ),
(T )
∑ (12.9)
v∈N(u)
where U1g and U2g are weights. The final graph representation is hG = ∑i hi /|V |,
where |V | is the number of nodes in the graph. The corresponding latent variable
zG can be sampled from N (zG ; µG , σG2 ) and µG and σG2 can be calculated from hG
via two separate affine layers.
A junction tree can be represented as (V , E , X ) whose node set is V = (C1 , ...,Cn )
and edge set is E = (E1 , ..., En ). This junction tree is labeled by the label dictionary
X . Similar to the graph representation, each cluster Ci is represented by a one-hot xi
and each edge (Ci ,C j ) corresponds to two message vectors vi j and v ji . An arbitrary
leaf node is picked as the root and messages are propagated in two phases:
12 Graph Neural Networks: Graph Transformation 263
si j = ∑ vki (12.10)
k∈N(i)\ j
zi j = σ (W z xi +U z si j + bz )
rki = σ (W r xi +U r vki + br )
ṽi j = tanh(W xi +U ∑ rki ⊙ vki )
k∈N(i)\ j
vi j = (1 − zi j ) ⊙ si j + zi j ⊙ ṽi j .
The final tree representation is hTG = hroot . zTG is sampled in a similar way as in
the encoding process.
Under the JT-VAE framework, the junction tree is decoded from zTG using a
tree-structured decoder that traverses the tree from the root and generates nodes in
their depth-first order. During this process, a node receives information from other
nodes, and this information is propagated through message vectors hi j . Formally,
let E˜ = {(i1 , j1 ), ..., (im , jm )} be the set of edges traversed over the junction tree
(V , E ), where m = 2|E | because each edge is traversed in both directions. The
model visits node it at time t. Let E˜t be the first t edges in E˜ . The message is updated
as hit , jt = GRU(xit , {hk,it }(k,it )∈E˜t ,k̸= jt ), where xit corresponds to the node features.
The decoder first makes a prediction regarding whether the node it still has children
to be generated, in which the probability is calculated as:
pt = σ (ud · τ(W1d xit +W2d zTG +W3d ∑ hk,it )), (12.12)
(k,it )∈E˜t
where ud , W1d , W2d and W3d are weights. Then, when a child node j is generated from
its parent i, its node label is predicted with:
q j = so f tmax(U l · τ(W1l zTG +W2l hi j )), (12.13)
where U l , W1l and W2l are weights and q j is a distribution over label dictionary X .
The final step of the model is to reproduce a molecular graph G to represent
the predicted junction tree (Vˆ , Eˆ ) by assembling the subgraphs together into the
final molecular graph. Let G (TG ) be a set of graphs corresponding to the junction
tree TG . Decoding graph Gˆ from the junction tree TˆG = (Vˆ , Eˆ ) is a structured
prediction:
Gˆ = arg max f a (G ′ ), (12.14)
G ′ =G (TˆG )
where f a is a scoring function over candidate graphs. The decoder starts by sampling
the assembly of the root and its neighbors according to their scores, then proceeds to
assemble the neighbors and associated clusters. In terms of scoring the realization
of each neighborhood, let Gi be the subgraph resulting from a particular merging of
cluster Ci in the tree with its neighbors C j , j ∈ NTˆG (i). Gi is scored as a candidate
subgraph by first deriving a vector representation hGi , and fia (Gi ) = hGi · zG is the
264 Xiaojie Guo, Shiyu Wang, Liang Zhao
and G∗ , F ∗ = arg minG,F maxDX ,DY L(G, F, DX , DY ). The adversarial loss is utilized:
1
LGAN (G, DY , GX , GY ) = E GY [(DY (y) − 1)2 ] (12.17)
2 y∼pdata
1
+ E GX [DY (G(x))2 ],
2 x∼pdata
which ensures that the generator G (and F) generates samples from a distribution
close to the distribution of GY (or GX ), denoted by pGdata
Y
(or pGdata
X
). The cycle con-
sistency loss
Lcyc (G, F) = E GY [∥G(F(y)) − y∥1 ] (12.18)
y∼pdata
+E GX [∥F(G(x)) − x∥1 ],
x∼pdata
reduces the space available to the possible mapping functions such that for a
molecule x from set GX , the GAN cycle constrains the output to a molecule similar
to x. The inclusion of the cyclic component acts as a regularization factor, making
the model more robust. Finally, to ensure that the generated molecule is close to the
original, identity mapping loss is employed:
Lidentity (G, F) = E G
Y [∥F(y) − y∥1 ] (12.19)
y∼pdata
+E X G [∥G(x) − x∥1 ],
x∼pdata
which further reduces the space available to the possible mapping functions and
prevents the model from generating molecules that lay far away from the starting
molecule in the latent space of JT-VAE.
Since the ordering of the nodes is defined by the topological sort of Gin , all the
hidden states hv can be computed with a single forward pass along a layer of DG-
DAGRNN. The encoder contains multiple layers, each of which passes hidden states
to the recurrent units in the subsequent layer corresponding to the same node.
The encoder outputs an embedding Hin = Eα (Gin ), which serves as the input of
the DAG decoder. The decoder follows the local-based node-sequential generation
style. Specifically, first, the number of nodes of the target graph is predicted by a
multilayer perceptron (MLP) with a Poisson regressor output layer, which takes the
input graph embedding Hin and outputs the mean of a Poisson distribution describ-
ing the output graph. Whether it is necessary to add an edge eu,vn for all the nodes
u ∈ {v1 , ..., vn−1 } already in the graph is determined by a module of MLP. Since the
output nodes are generated in their topological order, the edges are directed from
the nodes added earlier to the nodes added later. For each node v, the hidden state
hv is calculated using a similar mechanism to that used in the encoder, after which
they are aggregated and fed to a GRU. The other input for the GRU consists of the
aggregated states of all the sink nodes generated so far. For the first node, the hidden
state is initialized based on the encoder’s output. Then, the output node features are
generated based on its hidden state using another module of MLP. Finally, once the
last node has been generated, the edges are introduced with probability 1 for sinks
in the graph to ensure a connected graph with only one sink node as an output.
Zhou et al, 2019c). The modification at each step is selected from a defined action
set that includes ”add node”, ”add edge”, ”remove bonds” and so on. Another is
to update the nodes and edges from the source graph synchronously in a one-shot
manner through the MPNN using several iterations (Guo et al, 2019c).
Motivated by the large size of chemical space, which can be an issue when design-
ing molecular structures, graph convolutional policy networks (GCPNs) serve as
useful general graph convolutional network-based models for goal-directed graph
generation through reinforcement learning (RL) (You et al, 2018a). In this model,
the generation process can be guided towards the specific desired objectives, while
restricting the output space based on underlying chemical rules. To achieve goal-
directed generation, three strategies, namely graph representation, reinforcement
learning, and adversarial trainings are adopted. In GCPN, molecules are represented
as molecular graphs, and partially generated molecular graphs can be interpreted as
substructures. GCTN is designed as an RL agent which operates within a chemistry-
aware graph generation environment. A molecule is successively constructed by ei-
ther connecting a new substructure or atom to an existing molecular graph by adding
a bond. GCPN is trained to optimize the domain-specific properties of the source
molecule by applying a policy gradient to optimize it via a reward composed of
molecular property objectives and adversarial loss; it acts in an environment which
incorporates domain-specific rules. The adversarial loss is provided by a GCN-based
discriminator trained jointly on a dataset of example molecules.
An iterative graph generation process is designed and formulated as a general
decision process M = (S , A , P, R, γ), where S = {si } is the set of states that com-
prises all possible intermediate and final graphs. A = (ai ) is the set of actions
that describe the modifications made to the current graph during each iteration,
P represents the transition dynamics that specify the possible outcomes of carry-
ing out an action p(st+1 |st , ..., s0 , at ), R(st ) = rt is a reward function that specifies
the reward after reaching state st and γ is the discount factor. The graph genera-
tion process can now be formulated as (s0 , a0 , r0 , ..., sn , an , rn ), and the modifica-
tion of the graph at each time can be described as a state transition distribution:
p(st+1 |st , ..., s0 ) = ∑at p(at |st , ..., s0 )p(st+1 |st , ..., s0 , at ), where p(at |st , ..., s0 ) is rep-
resented as a policy network πθ . Note that in this process, the state transition dy-
namics are designed to satisfy the Markov property p(st+1 |st , ...s0 ) = p(st+1 |st ).
In this model, a distinct, fixed-dimension, homogeneous action space is defined
and amenable to reinforcement learning, where an action is analogous to link pre-
diction. Specifically, a set of scaffold subgraphs {C1 , ...,Cs } is first defined based on
the source graph, thus serving as a subgraph vocabulary that contains the subgraphs
to be added into the target graph during graph generation. Define C = ∪si=1Ci . Given
the modified graph Gt at step t, the corresponding extended graph can be defined as
Gt ∪ C. Under this definition, an action can either correspond to connecting a new
subgraph Ci to a node in Gt or connecting existing nodes within graph Gt . GAN is
268 Xiaojie Guo, Shiyu Wang, Liang Zhao
also employed to define the adversarial rewards to ensure that generated molecules
do indeed resemble the originals.
Node embedding is achieved by message passing over each edge type for L layers
through GCN. At the l-th layer of GCN, messages from different edge types are ag-
gregated to calculate the node embedding H (l+1) ∈ R(n+c)×k of the next layer, where
n and c are the sizes of Gt and C, respectively, and k is the embedding dimension:
−1 −1 (l)
H (l+1) = AGG(ReLU({D̂i 2 Êi D̂i 2 H (l)Wi }, ∀i ∈ (1, ..., b))). (12.21)
Ei is the ith slice of the edge-conditioned adjacency tensor E, and Êi = Ei + I; D̂i =
(l)
∑k Êi jk and Wi is the weight matrix for the ith edge type. AGG denotes one of the
aggregation functions from {MEAN, MAX, SUM,CONTACT }.
The link prediction-based action at ensures each component samples from a pre-
diction distribution governed by the equations below:
at = CONCAT (a f irst , asecond , aedge , astop ) (12.22)
f f irst (st ) = so f tmax(m f (X)), a f irst ∼ f f irst (st ) ∈ {0, 1}n (12.23)
fsecond (st ) = so f tmax(ms (Xa f irst , X)), asecond ∼ fsecond (st ) ∈ {0, 1}n+c
fedge (st ) = so f tmax(me (Xa f irst , X)), aedge ∼ fedge (st ) ∈ {0, 1}b
fstop (st ) = so f tmax(mt (AGG(X))), astop ∼ fstop (st ) ∈ {0, 1}
In addition to GCPN, molecule deep Q-networks (MolDQN) has also been devel-
oped for molecule optimization under the node-edge co-transformation problem uti-
lizing an editing-based style. This combines domain knowledge of chemistry with
state-of-the-art reinforcement learning techniques (double Q-learning and random-
ized value functions) (Zhou et al, 2019c). In this field, traditional methods usually
employ policy gradients to generate graph representations of molecules, but these
suffer from high variance when estimating the gradient (Gu et al, 2016). In com-
parison, MolDQN is based on value function learning, which is usually more stable
and sample efficient. MolDQN also avoids the need for expert pretraining on some
datasets, which may lead to lower variance but limits the search space considerably.
In the framework proposed here, modifications of molecules are directly defined
to ensure 100% chemical validity. Modification or optimization is performed in a
step-wise fashion, where each step belongs to one of the following three categories:
(1) atom addition, (2) bond addition, and (3) bond removal. Because the molecule
generated depends solely on the molecule being changed and the modification made,
the optimization process can be formulated as a Markov decision process (MDP).
12 Graph Neural Networks: Graph Transformation 269
Specifically, when performing the action atom addition, an empty set of atoms VT
for the target molecule graph is first defined. Then, a valid action is defined as adding
an atom in VT and also a bond between the added atom and the original molecule
wherever possible. When performing the action bond addition, a bond is added be-
tween two atoms in VT . If there is no existing bond between the two atoms, the
actions between them can consist of adding a single, double or triple bond. If there
is already a bond, this action changes the bond type by increasing the index of the
bond type by one or two. When performing the action bond removal, the valid bond
removal action set is defined as the actions that decrease the bond type index of an
existing bond. Possible transitions include: (1) Triple bond → {Double, Single, No}
bond, (2) Double bond → {Single, No} bond, and (3) Single bond → {No} bond.
Based on the molecule modification MDP defined above, RL aims to find a policy
π that chooses an action for each state that maximizes future rewards. Then, the
decision is made by finding the action a for a state s to maximize the Q function:
T
Qπ (s, a) = Qπ (m,t, a) = Eπ [ ∑ rn ], (12.24)
n=t
where rn is the reward at step n. The optimal policy can therefore be defined as
∗
π ∗ (s) = arg maxa Qπ (s, a). A neural network is adopted to approximate Q(s, a, θ ),
and can be trained by minimizing the loss function:
l(θ ) = E[ fl (yt − Q(st , at ; θ ))], (12.25)
where yt = rt + maxa Q(st+1 , a; θ ) is the target value and fl is the Huber loss:
(
1 2
x if |x| < 1
fl (x) = 2 (12.26)
|x| − 21 otherwise
with hidden state size of [1024, 512, 128, 32] and ReLU activation is used as the
architecture.
To overcome a number of challenges including, but not limited to, the mutually
dependent translation of the node and edge attributes, asynchronous and iterative
changes in the node and edge attributes during graph translation, and the difficulty of
discovering and enforcing the correct consistency between node attributes and graph
spectra, the Node-Edge Co-evolving Deep Graph Translator (NEC-DGT) has been
developed to achieve so-called multi-attributed graph translation and proven to be a
generalization of the existing topology translation models (Guo et al, 2019c). This is
a node-edge co-evolving deep graph translator that edits the source graph iteratively
through a generation process similar to the MPNN-based adjacency-based one-shot
method for unconditional deep graph generation, with the main difference being
that it takes the graph in the source domain as input rather than the initialized graph
(Guo et al, 2019c).
NEC-DGT employs a multi-block translation architecture to learn the distribu-
tion of the graphs in the target domain, conditioning on the input graphs and con-
textual information. Specifically, the inputs are the node and graph attributes, and
the model outputs are the generated graphs’ node and edge attributes after several
blocks. A skip-connection architecture is implemented across the different blocks to
handle the asynchronous properties of different blocks, ensuring the final translated
results fully utilize various combinations of blocks’ information. The following loss
function is minimized in the work:
LT = L (T (G (E0 , F0 ),C), G (E ′ , F ′ )), (12.28)
where S corresponds to the number of blocks and θ refers to the overall parameters
in the spectral graph regularization. G (ES , FS ) is the generated target graph, where
ES is the generated edge attributes tensor and FS is the node attributes matrix. Then
the total loss function is
L˜ = L (T (G (E0 , F0 ),C), G (E ′ , F ′ )) + β R(G(E, F)). (12.30)
Because edge direction in an NLP graph often encodes critical information re-
garding semantic meanings, capturing bidirectional information in the text is helpful
and has been widely explored in works such as BiLSTM and BERT (Devlin et al,
2019). Some attention has also been devoted to extending the existing GNN models
to handle directed graphs. For example, separate model parameters can be intro-
duced for different edge directions (e.g., incoming/outgoing/self-loop edges) when
conducting neighborhood aggregation (Guo et al, 2019e; Marcheggiani et al, 2018;
Song et al, 2018). A BiLSTM-like strategy has also been proposed to learn the node
embeddings of each direction independently using two separate GNN encoders and
then concatenating the two embeddings for each node to obtain the final node em-
beddings (Xu et al, 2018b,c,d).
In the field of NLP, graphs are usually multi-relational, where the edge type in-
formation is vital for the prediction. Similar to the bidirectional graph encoder in-
troduced above, separate model parameters for different edge types are considered
when encoding edge type information with GNNs (Chen et al, 2018e; Ghosal et al,
2020; Schlichtkrull et al, 2018). However, usually the total number of edge types
is large, leading to non-negligible scalability issues for the above strategies. This
problem can be tackled by converting a multi-relational graph to a Levi graph (Levi,
1942), which is bipartite. To create a Levi graph, all the edges in the original graph
are treated as new nodes and new edges are added to connect the original nodes and
new nodes.
Apart from NLP, graph-to-sequence transformation has been employed in other
fields, for example when modeling complex transitions of an individual user’s ac-
tivities among different healthcare subforums over time and learning how this is
related to his various health conditions (Gao et al, 2019c). By formulating the tran-
sition of user activities as a dynamic graph with multi-attributed nodes, the health
stage inference is formalized as a dynamic graph-to-sequence learning problem and,
hence, a dynamic graph-to-sequence neural network architecture (DynGraph2Seq)
has been proposed (Gao et al, 2019c). This model contains a dynamic graph en-
coder and an interpretable sequence decoder. In the same work, a dynamic graph
hierarchical attention mechanism capable of capturing entire both time-level and
node-level attention is also proposed, providing model transparency throughout the
whole inference process.
Deep graph generation conditioning on semantic context aims to generate the target
graph GT conditioning on an input semantic context that is usually represented in
the form of additional meta-features. The semantic context can refer to the category,
label, modality, or any additional information that can be intuitively represented as
a vector C. The main issue here is to decide where to concatenate or embed the con-
dition representation into the generation process. As a summary, the conditioning
information can be added in terms of one or more of the following modules: (1)
274 Xiaojie Guo, Shiyu Wang, Liang Zhao
the node state initialization module, (2) the message passing process for MPNN-
based decoding, and (3) the conditional distribution parameterization for sequential
generating.
A novel unified model of graph variational generative adversarial nets has been
proposed, where the conditioning semantic context is input into the node state ini-
tialization module (Yang et al, 2019a). Specifically, the generation process begins
by modeling the embedding Zi of each node with the separate latent distributions,
after which a conditional graph VAE (CGVAE) can be directly constructed by con-
catenating the condition vector C to each node’s latent representation Zi to obtain
the updated node latent representation Ẑi . Thus, the distribution of the individual
edge Ei, j is assumed to be a Bernoulli distribution, which is parameterized by the
value Eˆi, j and calculated as Eˆi, j = Sigmoid( f (Ẑi )⊤ f (Ẑ j )), where f (·) is constructed
using a few fully connected layers. A conditional deep graph generative model that
adds the semantic context information into the initialized latent representations Zi
at the beginning of the decoding process has also been proposed (Li et al, 2018d).
Other researchers have added the context information C into the message passing
module as part of its MPNN-based decoding process (Li et al, 2018f). Specifically,
the decoding process is parameterized as a Markov process and the graph is gen-
erated by iteratively refining and updating the initialized graph. At each step t, an
action is conducted based on the current node’s hidden states H t = {ht1 , ..., htN }. To
calculate hti ∈ Rl (l denotes the length of the representation) for node vi in the in-
termediate graph Gt after each updating of the graph, a message passing network
is utilized with node message propagation. Thus, the context information C ∈ Rk is
added to the operation of the MPNN layer as follows:
hti = W ht−1
i + Φ ∑v ht−1
j +ΘC, (12.31)
j ∈N(v j )
where W ∈ Rl×l , Θ ∈ Rl×l and Φ ∈ Rk×l are all learnable weights vectors and k
denotes the length of the semantic context vector.
Semantic context has also been considered as one of the inputs for calculating the
conditional distribution parameter at each step during the sequential generating pro-
cess (Jonas, 2019). The aim here is to solve the molecule inverse problem by infer-
ring the chemical structure conditioning on the formula and spectra of a molecule,
which provides a distinguishable fingerprint of its bond structure. The problem is
framed as an MDP and molecules are constructed incrementally one bond at a time
based on a deep neural network, where they learn to imitate a “subisomorphic or-
acle” that knows whether the generated bonds are correct. The context information
(e.g., spectra) is applied in two places. The process begins with an empty edge set
E0 that is sequentially updated to Ek at each step k by adding an edge sampled
from p(ei, j |Ek−1 , V ,C). V denotes the node set that is defined in the given molec-
ular formula. The edge set keeps updating until the existing edges satisfy all the
valence constraints of a molecule. The resulting edge set EK then serves as the can-
didate graph. For a given spectrum C, the process is repeated T times, generating
(i)
T (potentially different) candidate structures, {EK }Ti=1 . Then based on a spectral
prediction function f (·), the quality of these candidate structures are evaluated by
12 Graph Neural Networks: Graph Transformation 275
measuring how close their predicted spectra are to the condition spectrum C. Finally,
(i)
the optimal generated graph is selected according to argmin ∥ f (EK ) −C ∥2 .
i
12.6 Summary
In this chapter, we introduce the definitions and techniques for the transforma-
tion problem that involves graphs in the domain of deep graph neural networks.
We provide a formal definition of the general deep graph transformation prob-
lem as well as its four sub-problems, namely node-level transformation, edge-level
transformation,node-edge co-transformation, as well as other graph-involved trans-
formations (e.g., sequence-to-graph transformation and context-to-graph transfor-
mation). For each sub-problem, its unique challenges and several representative
methods are introduced. As an emerging research domain, there are still many
open problems to be solved for future exploration, including but not limited to:
(1) Improved scalability. Existing deep graph transformation models typically have
super-linear time complexity to the number of nodes and cannot scale well to large
networks. Consequentially, most existing works merely focus on small graphs, typi-
cally with dozens to thousands of nodes. It is difficult for them to handle many real-
world networks with millions to billions of nodes, such as the internet of things,
biological neuronal networks, and social networks. (2) Applications in NLP. As
more and more GNN-based works have advanced the development of NLP, graph
transformation is naturally a good fit for addressing some NLP tasks, such as in-
formation extraction and semantic parsing. For example, information extraction can
be formalized into a graph-to-graph problem where the input graph is the depen-
dency graph and the output graph is the information graph. (3) Explainable graph
transformation. When we learn the underlying distribution of the generated target
graphs, learning interpretable representations of graph that expose semantic mean-
ing is very important. For example, it is highly beneficial if we could identify which
latent variable(s) control(s) which specific properties (e.g., molecule mass) of the
target graphs (e.g., molecules). Thus, investigations on the explainable graph trans-
formation process are critical yet unexplored.
Abstract The problem of graph matching that tries to establish some kind of struc-
tural correspondence between a pair of graph-structured objects is one of the key
challenges in a variety of real-world applications. In general, the graph matching
problem can be classified into two categories: i) the classic graph matching problem
which finds an optimal node-to-node correspondence between nodes of a pair of in-
put graphs and ii) the graph similarity problem which computes a similarity metric
between two graphs. While recent years have witnessed the great success of GNNs
in learning node representations of graphs, there is an increasing interest in explor-
ing GNNs for the graph matching problem in an end-to-end manner. This chapter
focuses on the state of the art of graph matching models based on GNNs. We start
by introducing some backgrounds of the graph matching problem. Then, for each
category of graph matching problem, we provide a formal definition and discuss
state-of-the-art GNN-based models for both the classic graph matching problem
and the graph similarity problem, respectively. Finally, this chapter is concluded by
pointing out some possible future research directions.
Xiang Ling
Department College of Computer Science and Technology, Zhejiang University, e-mail:
[email protected]
Lingfei Wu
JD.COM Silicon Valley Research Center, e-mail: [email protected]
Chunming Wu
Department College of Computer Science and Technology, Zhejiang University, e-mail:
[email protected]
Shouling Ji
Department College of Computer Science and Technology, Zhejiang University, e-mail: sji@
zju.edu.cn
© The Author(s), under exclusive license to Springer Nature Singapore Pte Ltd. 2022 277
L. Wu et al. (eds.), Graph Neural Networks: Foundations, Frontiers, and Applications,
https://doi.org/10.1007/978-981-16-6054-2_13
278 Xiang Ling, Lingfei Wu, Chunming Wu and Shouling Ji
13.1 Introduction
bility as well as the issue of heavy reliance on expert knowledge, and thus remains
as a challenging and significant research problem for many practitioners.
More recently, GNNs that attempt to adapt deep learning from image to non-
euclidean data (i.e., graphs) have received unprecedented attention to learn infor-
mative representation (e.g., node or (sub)graph, etc.) of graph-structured data in
an end-to-end manner (Kipf and Welling, 2017b; Wu et al, 2021d; Rong et al,
2020c). Hereafter, a surge of GNN models have been presented for learning effective
node embeddings for downstream tasks, such as node classification tasks (Hamil-
ton et al, 2017a; Veličković et al, 2018; Chen et al, 2020m), graph classification
tasks (Ying et al, 2018c; Ma et al, 2019d; Gao and Ji, 2019), graph generation
tasks (Simonovsky and Komodakis, 2018; Samanta et al, 2019; You et al, 2018b) as
so on. The great success of GNN-based models on these application tasks demon-
strates that GNN is a powerful class of deep learning model to better learn the graph
representation for downstream tasks.
Encouraged by the great success of GNN-based models obtained from many
other graph-related tasks, many researchers have started to adopt GNNs for the
graph matching problem and a large number of GNN-based models have been pro-
posed to improve the matching accuracy and efficiency (Zanfir and Sminchisescu,
2018; Rolı́nek et al, 2020; Wang et al, 2019g; Jiang et al, 2019a; Fey et al, 2020; Yu
et al, 2020; Wang et al, 2020j; Bai et al, 2018, 2020b, 2019b; Xiu et al, 2020; Ling
et al, 2020; Zhang, 2020; Wang et al, 2020f; Li et al, 2019h; Wang et al, 2019i).
During the training stage, these models try to learn a mapping between the pair
of input graphs and the ground-truth correspondence in a supervised learning and
thus are more time-efficient during the inference stage than traditional approxima-
tion methods. In this chapter, we walk through the recent advances and develop-
ments of graph matching models based on GNNs. Particularly, we focus on how
to incorporate GNNs into the framework of graph matching/similarity learning and
try to provide a systematic introduction and review of state-of-the-art GNN-based
methods for both categories of the graph matching problem (i.e., the classic graph
matching problem in Section 13.2 and the graph similarity problem in Section 13.3,
respectively).
In this section, we start by introducing the first category of the graph matching
problem, i.e., the classic graph matching problem1 , and provide a formal definition
of the graph matching problem. Subsequently, we will focus discussion on state-of-
the-art graph matching models based on deep learning as well as more advanced
GNNs in the literature.
1 For simplicity, we represent the classic graph matching problem as the graph matching problem
Definition 13.1 (Graph Matching Problem). Given a pair of input graphs G (1) =
(V (1) , E (1) , A(1) , X (1) , E (1) ) and G (2) = (V (2) , E (2) , A(2) , X (2) , E (2) ) of equal size n,
the graph matching problem is to find a node-to-node correspondence matrix S ∈
{0, 1}n×n (i.e., also called assignment matrix and permutation matrix) between the
two graphs G (1) and G (2) . Each element Si,a = 1 if and only if the node vi ∈ V (1) in
G (1) corresponds to the node va ∈ V (2) in G (2) .
2 For simplicity, we assume that a pair of input graphs in the graph matching problem have the same
number of nodes, but we can extend the problem to a pair of graphs with different number of nodes
via adding dummy nodes, which is commonly adopted by graph matching literature Krishnapuram
et al (2004).
13 Graph Neural Networks: Graph Matching 281
s∗ = arg max s⊤ Ks
s (13.1)
s.t. S1n = 1n & S⊤ 1n = 1n
2
where s = vec(S) ∈ {0, 1}n is the column-wise vectorized version of the assignment
matrix S and 1n is a column vector of length n whose elements are equal to 1.
2 2
Particularly, K ∈ Rn ×n is the corresponding second-order affinity matrix in which
each element Ki j,ab measures how well every pair of nodes (vi , v j ) ∈ V (1) × V (1)
matches (va , vb ) ∈ V (2) × V (2) and can be defined as follows (Zhou and De la Torre,
2012).
cia
if i = j and a = b,
Kind(i, j),ind(a,b) = di jab else if A(1) (2)
i, j Aa,b > 0, (13.2)
0 otherwise.
where ind(·, ·) is a bijection function that maps a pair of nodes to an integer index,
the diagonal element (i.e., cia ) encodes the node-to-node (i.e., first-order) affinity
between the node vi ∈ V (1) and the node va ∈ V (2) , and the off-diagonal element
(i.e., di jab ) encodes the edge-to-edge (i.e., second-order) affinity between the edge
(vi , v j ) ∈ E (1) and the edge (va , vb ) ∈ E (2) .
Another important aspect for the formulation in Equation (13.1) is the constraint,
i.e., S1n = 1n and S⊤ 1n = 1n . It demands that the matching output of the graph
matching problem, i.e., the correspondence matrix S ∈ {0, 1}n×n , should be strictly
constrained as a doubly-stochastic matrix. Formally the correspondence matrix S
is a doubly-stochastic matrix if the summation of each column and each row of it is
1. That is, ∀i, ∑ j Si, j = 1 and ∀ j, ∑i Si, j = 1. Therefore, the resulting correspondence
matrix of the graph matching problem should satisfy the requirement of the doubly-
stochastic matrix.
In general, the main challenge in optimizing and solving Equation (13.1) lies in
how to model the affinity model as well as how to optimize with the constraint for
solutions. Traditional methods mostly utilize pre-defined affinity models with lim-
ited capacity (e.g., Gaussian kernel with Euclid distance Cho et al (2010)) and resort
to different heuristic optimizations (e.g., graduated assignment (Gold and Rangara-
jan, 1996), spectral method (Leordeanu and Hebert, 2005), random walk (Cho et al,
2010), etc.). However, such traditional methods suffer from poor scalability and
inferior performance for large-scale settings as well as a broad of application sce-
narios (Yan et al, 2020a). Recently, studies on the graph matching are starting to
explore the high capacity of deep learning models, which achieve state-of-the-are
performance. In the following subsections, we will first give a brief introduction of
deep learning based graph matching models and then discuss state-of-the-art graph
matching models based on GNNs.
282 Xiang Ling, Lingfei Wu, Chunming Wu and Shouling Ji
where ⌈X⌋ denotes a diagonal matrix whose diagonal elements are all X; ⊗ denotes
the Kronecker product; Gi and Hi (i = {1, 2}) are the node-edge incidence matrices
that are recovered from the adjacency matrices A(i) , i.e., A(i) = Gi Hi⊤ (i = {1, 2});
K p ∈ Rn×n encodes the node-to-node similarity and is directly obtained from the
product of two node feature matrices, i.e., K p = U (1)U (2)⊤ ; Ke ∈ R p×q encodes the
edge-to-edge similarity and is calculated by Ke = F (1)Λ F (2) . It is worth to note
that Λ ∈ R2d×2d is a learnable parameter matrix and thus the built graph matching
affinity matrix K in Equation (13.4) is a learnable affinity model.
Then, with the spectral matching technique (Leordeanu and Hebert, 2005), the
graph matching problem is translated into computing the leading eigenvector s∗
which can be approximated by the power iteration algorithm as follows.
3 https://www.thecvf.com/?page_id=413
13 Graph Neural Networks: Graph Matching 283
Ksk
sk+1 = (13.5)
∥Ksk ∥2
where P(1) and P(2) are coordinates of nodes in both images; the vector of di mea-
sures the pixel offset; dgt
i is the corresponding ground-truth; and ε is a small value
for robust penalty.
Deep Graph Matching via Black-box Combinatorial Solver. Motivated by ad-
vances in incorporating a combinatorial optimization solver into a neural net-
work (Pogancic et al, 2020), Rolı́nek et al (2020) propose an end-to-end neural
network which seamlessly embeds a black-box combinatorial solver, namely BB-
GM, for the graph matching problem. To be specific, given two cost vectors (i.e.,
2 (1) (2)
cv ∈ Rn and ce ∈ R|E ||E | ) for both node-to-node and edge-to-edge correspon-
dences, the graph matching problem is formulated as follows.
2
where GM denotes the black-box combinatorial solver; sv ∈ {0, 1}n is the indicator
(1) (2)
vector of matched nodes; se ∈ {0, 1}|E ||E | is the indicator vector of matched
edges; Adm(G , G ) represents a set of all possible matching results between
(1) (2)
More recently, GNNs have started to be studied to deal with the graph matching
problem. This is because GNNs bring about new opportunities on the tasks over
graph-like data and further improve the model capability taking structural informa-
tion of graphs into account. Besides, GNNs can be easily incorporated with other
deep learning architectures (e.g., CNN, RNN, MLP, etc.) and thus provide an end-
to-end learning framework for the graph matching problem.
Cross-graph Affinity based Graph Matching. Wang et al (2019g) claim that it is
the first work that employs GNNs for deep graph matching learning (as least in com-
puter vision). By exploiting the highly efficient learning capabilities of GNNs that
can update the node embeddings with the structural affinity information between
two graphs, the graph matching problem, i.e., the quadratic assignment problem, is
translated into a linear assignment problem that can be easily solved.
In particular, the authors present the cross-graph affinity based graph match-
ing model with the permutation loss, namely PCA-GM. PCA-GM consists of three
steps. First, to enhance learned node embeddings of individual graph with a stan-
dard message-passing network (i.e., intra-graph convolution network), PCA-GM
further updates node embeddings with an extra cross-graph convolution network,
i.e., CrossGConv which not only aggregates the information from local neighbors,
but also incorporates the information from the similar nodes in the other graph.
Fig. 13.2 illustrates an intuitive comparison between the intra-graph convolution
network and the cross-graph convolution network formulated as follows.
H (1)(k) = CrossGConv Ŝ, H (1)(k−1) , H (2)(k−1)
(13.8)
H (2)(k) = CrossGConv Ŝ⊤ , H (2)(k−1) , H (1)(k−1)
where H (1)(k) and H (2)(k) are the k-layer node embeddings for the graph G (1) and
G (2) ; k denotes the k-th iteration; Ŝ denotes the predicted assignment matrix which
is computed from shallower node embedding layers; and the initial embeddings,
i.e., H (1)(0) and H (2)(0) , are extracted via a pre-trained VGG-16 network in line
with Zanfir and Sminchisescu (2018).
Second, based on the resulting node embeddings H e (2) for both graphs,
e (1) and H
PCA-GM computes the node-to-node assignment matrix S by a bi-linear mapping
followed by an exponential function as follows.
H
e (1)Θ He (2)⊤
Se = exp (13.9)
τ
where Θ denotes the learnable parameter matrix for the assignment matrix learn-
ing and τ > 0 is a hyper-parameter. As the obtained Se ∈ Rn×n does not satisfy the
constraint of the doubly-stochastic matrix, PCA-GM uses the Sinkhorn (Adams and
Zemel, 2011) operation for the relaxed linear assignment problem because it is fully
differentiable and has been proven effective for the final graph matching prediction.
e
S = Sinkhorn(S) (13.10)
Finally, PCA-GM adopts the combinatorial permutation loss that computes the
cross entropy loss between the final predicted permutation S and ground truth per-
mutation Sgt for supervised graph matching learning.
gt gt
L perm = − ∑ Si,a log(Si,a ) + (1 − Si,a ) log(1 − Si,a )
(13.11)
vi ∈V (1) ,v ∈V (2)
a
Experiment results in (Wang et al, 2019g) demonstrated that graph matching mod-
els with the permutation loss outperform that with the displacement loss in Equa-
tion (13.6).
Graph Learning–Matching Network. Most prior studies on the graph matching
problem rely on established graphs with fixed structure information, i.e., the edge set
with or without attributes. Differently, Jiang et al (2019a) present a graph learning-
matching network, namely GLMNet, which incorporates the graph structure learn-
ing (i.e., learning the graph structure information) into the general graph matching
learning to build a unified end-to-end model architecture. To be specific, based on
(l) (l)
the pair of node feature matrices X (l) = {x1 , · · · , xn } (l = {1, 2}), GLMNet at-
tempts to learn a pair of optimal graph adjacency matrices A(l) (l = {1, 2}) for bet-
ter serving for the latter graph matching learning and each element is computed as
follows.
(l) (l)
(l) (l) (l) exp(σ (θ ⊤ [xi , x j ]))
Ai, j = φ (xi , x j ; θ ) = (l) (l)
, l = {1, 2} (13.12)
∑nj=1 exp(σ (θ ⊤ [xi , x j ]))
where σ is the activation function, e.g., ReLU; [·, ·] denotes the concatenation oper-
ation; and θ denotes the trainable parameter for the graph structure learning which
is shared for both input graphs.
286 Xiang Ling, Lingfei Wu, Chunming Wu and Shouling Ji
Deep Graph Matching with Consensus. In (Fey et al, 2020), Fey et al also
employ GNNs to learn the graph correspondence as previous work, but addition-
ally introduce a neighborhood consensus Rocco et al (2018) to further refine the
learned correspondence matrix. Firstly, they use common GNN models along with
the Sinkhorm operation to compute an initial correspondence matrix S0 as follows.
Ψθ1 denotes the shared GNN model for both graphs.
(1) (2)
where In is the identify matrix and oi − oa is computed as the neighborhood
consensus between the node pair (vi , va ) ∈ V (1) × V (2) between two graphs (e.g.,
(1) (2)
oi − oa ̸= 0 means a false matching over the neighborhoods of vi and v j ). Finally,
13 Graph Neural Networks: Graph Matching 287
SK is obtained after K iterations and the final loss function incorporates both feature
matching loss and neighborhood consensus loss, i.e., L = L init + L re f ine .
L init = − ∑ log Si,π 0
gt (i)
vi ∈V (1)
(13.16)
L re f ine = − ∑ K
log Si,π gt (i)
vi ∈V (1)
Z = Attention(Hungarian(S), Sgt )
Lhung = − Zi,a Sgt log(S ) + (1 − S gt
) log(1 − S ) (13.17)
∑ i,a i,a i,a i,a
vi ∈V (1) ,v
a ∈V
(2)
where Hungarian denotes a black-box Hungarian algorithm and the role of Z is like
a mask that attempts to focus more on those mismatched node pairs and focus less
on node pairs that are matched exactly.
Graph Matching with Assignment Graph. Differently, Wang et al (2020j) refor-
mulate the graph matching problem as the problem of selecting reliable nodes in
the constructed assignment graph (Cho et al, 2010) in which each node represents a
potential node-to-node correspondence. The formal definition of assignment graph
is given in Definition 13.2 and one example is illustrated in Fig. 13.3.
Definition 13.2 (Assignment Graph). Given two graphs G (1) = (V (1) , E (1) , X (1) , E (1) )
and G (2) = (V (2) , E (2) , X (2) , E (2) ), an assignment graph G (A) = (V (A) , E (A) , X (A) , E (A) )
(1) (2)
is constructed as follows. G (A) takes each candidate correspondence (vi , va ) ∈
V (1) × V (2) between two graphs as a node via ∈ V (A) and link an edge between a
(A) (A) (A) (A)
pair of nodes via , v jb ∈ V (A) (i.e., (via , v jb ) ∈ E (A) ) if and only if both edges i.e.,
(1) (1) (2) (2)
(vi , v j ) ∈ E (1) and (va , vb ) ∈ E (2) , exist in its original graph. Optionally, for
288 Xiang Ling, Lingfei Wu, Chunming Wu and Shouling Ji
Fig. 13.3: Example illustration of building an assignment graph G (A) from the pair
of graphs G (1) and G (2) .
node attributes X (A) and edge attributes E (A) , each of them could be obtained by
concatenating attributes of the pair of nodes or edges in the original graph, respec-
tively.
With the constructed assignment graph G (A) , the reformulated problem of select-
ing reliable nodes in G (A) is quite similar to binary node classification tasks Kipf and
Welling (2017b) that classify nodes into positive or negative (i.e., meaning matched
or un-matched). To solve the problem, the authors propose a fully learnable model
based on GNNs which takes the G (A) as input, iteratively learns node embeddings
over graph structural information and predicts a label for each node in G (A) as out-
put. Besides, the model is trained with a similar loss function to (Jiang et al, 2019a).
In this section, we will first introduce the second category of the general graph
matching problem – the graph similarity problem. Then, we will provide an ex-
tensive discussion and analysis of state-of-the-art graph similarity learning models
based on GNNs.
similar or not) (Ling et al, 2021). As GED is equivalent to the problem of MCS
under a fitness function (Bunke, 1997), in this section, we mainly consider the GED
computation and focus more on state-of-the-art graph similarity learning models
based on GNNs.
Basically, the graph similarity problem intends to compute a similarity score be-
tween a pair of graphs, which indicates how similar the pair of graphs is. In the
following Definition 13.3, the general graph similarity problem is defined.
Definition 13.3 (Graph Similarity Problem). Given two input graphs G (1) and
G (2) , the purpose of graph similarity problem is to produce a similarity score s
between G (1) and G (2) . In line with the notations in Section 13.2.1, the G (1) =
(V (1) , E (1) , A(1) , X (1) ) is represented as set of n nodes vi ∈ V (1) with a feature ma-
trix X (1) ∈ Rn×d , edges (vi , v j ) ∈ E (1) formulating an adjacency matrix A(1) . Simi-
larly, G (2) = (V (2) , E (2) , A(2) , X (2) is represented as set of m nodes va ∈ V (2) with a
feature matrix X (2) ∈ Rm×d , edges (va , vb ) ∈ E (2) formulating an adjacency matrix
A(2) .
For the similarity score s, if s ∈ R, the graph similarity problem can be considered
as the graph-graph regression tasks. On the other hand, if s ∈ {−1, 1}, the problem
can be considered as the graph-graph classification tasks.
Particularly, the computation of GED (Riesen, 2015; Bai et al, 2019b) (some-
times normalized in [0, 1]) is a typical case of graph-graph regression tasks. To be
specific, GED is formulated as the cost of the shortest sequence of edit operations
over nodes or edges which have to undertake to transform one graph into another
graph, in which an edit operation can be an insertion or a deletion of a node or an
edge. In Fig. 13.4, We give an illustration of GED computation.
Similar to the classic graph matching problem, the computation of GED is also
a well-studied NP-hard problem. Although there is a rich body of work (Hart et al,
1968; Zeng et al, 2009; Riesen et al, 2007) that attempts to find sub-optimal so-
lutions in polynomial time via a variety of heuristics (Riesen et al, 2007; Riesen,
2015), these heuristic methods still suffer from the poor scalability (e.g., large search
space or excessive memory) and heavy reliance on expert knowledge (e.g., various
heuristics based on different application cases). Currently, learning-based models
which incorporate GNNs into an end-to-end learning framework for graph similar-
ity learning are gradually becoming more and more available, demonstrating the
GraphSim in (Bai et al, 2020b), which evaluates the model with additional datasets and similarity
metrics (i.e., both GED and MCS).
13 Graph Neural Networks: Graph Matching 291
·
where σ is the activation function and [ ] denotes the concatenation operation.
·
In addition, W [1:K] , V and b are parameters in NTN to be learned and K is a
hyper-parameter which determines the length of the graph-level similarity vector
calculated by NTN. Finally, to compute the similarity score between two graphs,
SimGNN concatenates two similarity vectors from the node level and the graph
level along with a small MLP network for prediction.
Graph Similarity Learning based on Hierarchical Clustering. In (Xiu et al,
2020), Xiu et al argue that if two graphs are similar, their corresponding compact
graphs should be similar with each other and conversely if two graphs are dissim-
ilar, their corresponding compact graphs should also be dissimilar. They believe
that, for the input pair of graphs, different views in regard to different pairs of com-
pact graphs can provide different scales of similarity information between two input
graphs and thus benefit the graph similarity computation. To this end, a hierarchical
graph matching network (HGMN) (Xiu et al, 2020) is presented to learn the graph
similarity from a multi-scale view. Concretely, HGMN first employs multiple stages
of hierarchical graph clustering to successively generate more compact graphs with
initial node embeddings to provide a multi-scale view of differences between two
graphs for subsequent model learning. Then, with the pairs of compact graphs in
different stages, HGMN computes the final graph similarity score by adopting a
GraphSim-like model (Bai et al, 2020b), including node embeddings update via
GCNs, similarity matrices generation and prediction via CNNs. However, in order
to ensure permutation invariance of generated similarity matrices, HGMN devises a
different node-ordering scheme based on earth mover distance(EMD) (Rubner et al,
1998) rather than BFS node-order method in (Bai et al, 2020b). According to the
EMD distance, HGMN first aligns nodes for both input graphs in each stage and
then produces the corresponding similarity matrix in the aligned order.
Graph Similarity Learning with Node-Graph Interaction. To learn richer in-
teraction features between a pair of input graphs for computing the graph similar-
ity in an end-to-end fashion, Ling et al propose a multi-level graph matching net-
work (MGMN) (Ling et al, 2020) which consists of a siamese graph neural network
(SGNN) and a novel node-graph matching network (NGMN). To learn graph-level
interactions between two graphs, SGNN first utilizes a multi-layer of GCNs with the
(l) {n,m}
siamese network to generate node embeddings H (l) = {hi }i=1 ∈ R{n,m}×d for all
nodes in graph G(l) , l = {1, 2} and then aggregates a corresponding graph-level em-
bedding vector for each graph. On the other hand, to learn cross-level interaction
features between two graphs, NGMN further employs a node-graph matching layer
to update node embeddings with learned cross-level interactions between node em-
beddings of a graph and a corresponding graph-level embedding of the other whole
292 Xiang Ling, Lingfei Wu, Chunming Wu and Shouling Ji
graph. Taking a node vi ∈ V (1) in G (1) as an example, NGMN first computes an at-
tentive graph-level embedding vector hi,att
G(2)
for G (2) by weighted averaging all node
embeddings in G (2) based on the corresponding cross-graph attention coefficient
towards vi as follows.
hi,att
(2) (1) (2)
G(2)
= ∑ αi, j h j , where αi, j = cosine(hi , h j ) ∀ v j ∈ V (2) (13.19)
v j ∈V (2)
where FC denotes the employed fully connected layers and (·) ∗ ∗2 denotes the
element-wise square of the input vector. In (Zhang, 2020), GB-DISTANCE con-
siders a scenario that inputs a set of m graphs (i.e., {G (i) }m
i=1 ) and outputs the
similarity between any pair of graphs, i.e., a similarity matrix D = {Di, j }i,i, j=m
j=1 =
j)
i, j=m m×m
d(G , G ) i, j=1 ∈ R
(i) ( , and formulates the graph similarity problem in a su-
pervised or semi-supervised settings as follows.
1 if Di, j is labeled
min ∥M ⊙ (D − D̂)∥ p with Mi, j = α if Di, j is unlabeled ∧ i ̸= j
(13.21)
β if i = j
s.t. Di, j ≤ Di,k + Dk, j , ∀i, j, k ∈ {1, · · · , m}
13 Graph Neural Networks: Graph Matching 293
5 The termed graph-graph classification learning is totally different from the general graph classifi-
cation task (Ying et al, 2018c; Ma et al, 2019d) that only predicts a label for one input graph rather
than a pair of input graphs.
294 Xiang Ling, Lingfei Wu, Chunming Wu and Shouling Ji
13.4 Summary
In this chapter, we have introduced the general graph matching learning, whereby
objective functions are formulated for establishing an optimal node-to-node corre-
spondence matrix between two graphs for the classic graph matching problem and
computing a similarity metric between two graphs for the graph similarity problem,
respectively. In particular, we have thoroughly analyzed and discussed state-of-the-
art GNN-based graph matching models and graph similarity models. In the future,
for better graph matching learning, some directions we believe are requiring more
efforts:
• Fined-grained cross-graph features. For the graph matching problem which
inputs the pair of graphs, interaction features between two graphs are funda-
mental and key features in both the graph matching learning and the graph sim-
ilarity learning. Although several existing methods (Li et al, 2019h; Ling et al,
2020) have been devoted to learning interacted features between two graphs for
better representation learning, these models have caused non-negligible extra
computational overhead. Better fined-grained cross-graph feature learning with
efficient algorithms could make a new state of the art.
• Semi-supervised learning and un-supervised learning. Because of the com-
plexity of graphs in the real-world application scenarios, it is common to train
the model in a semi-supervised setting or even in an un-supervised setting. Mak-
ing full use of relationships between existing graphs and, if possible, the other
data that is not directly relevant to the graph matching problem could further
promote the development of graph matching/similarity learning in more practi-
cal applications.
• Vulnerability and robustness. Although adversarial attacks have been exten-
sively studied for image classification tasks (Goodfellow et al, 2015; Ling et al,
2019) and node/graph classification tasks (Zügner et al, 2018; Dai et al, 2018a),
there is currently only one preliminary work (Zhang et al, 2020f) that studies
adversarial attacks on the graph matching problem. Therefore, studying the vul-
nerability of the state-of-the-art graph matching/similarity models and further
building more robust models is a highly challenging problem.
Abstract Due to the excellent expressive power of Graph Neural Networks (GNNs)
on modeling graph-structure data, GNNs have achieved great success in various
applications such as Natural Language Processing, Computer Vision, recommender
systems, drug discovery and so on. However, the great success of GNNs relies on
the quality and availability of graph-structured data which can either be noisy or
unavailable. The problem of graph structure learning aims to discover useful graph
structures from data, which can help solve the above issue. This chapter attempts
to provide a comprehensive introduction of graph structure learning through the
lens of both traditional machine learning and GNNs. After reading this chapter,
readers will learn how this problem has been tackled from different perspectives,
for different purposes, via different techniques, as well as its great potential when
combined with GNNs. Readers will also learn promising future directions in this
research area.
14.1 Introduction
Recent years have seen a significantly increasing amount of interest in Graph Neu-
ral Networks (GNNs) (Kipf and Welling, 2017b; Bronstein et al, 2017; Gilmer
et al, 2017; Hamilton et al, 2017b; Li et al, 2016b) with a wide range of appli-
cations in Natural Language Processing (Bastings et al, 2017; Chen et al, 2020p),
Computer Vision (Norcliffe-Brown et al, 2018), recommender systems (Ying et al,
2018b), drug discovery (You et al, 2018a) and so on. GNN’s powerful ability in
learning expressive graph representations relies on the quality and availability of
graph-structured data. However, this poses some challenges for graph representation
Yu Chen
Facebook AI, e-mail: [email protected]
Lingfei Wu
JD.COM Silicon Valley Research Center, e-mail: [email protected]
© The Author(s), under exclusive license to Springer Nature Singapore Pte Ltd. 2022 297
L. Wu et al. (eds.), Graph Neural Networks: Foundations, Frontiers, and Applications,
https://doi.org/10.1007/978-981-16-6054-2_14
298 Yu Chen and Lingfei Wu
learning with GNNs. On the one hand, in some scenarios where the graph structure
is already available, most of the GNN-based approaches assume that the given graph
topology is perfect, which does not necessarily hold true because i) the real-word
graph topology is often noisy or incomplete due to the inevitably error-prone data
measurement or collection; and ii) the intrinsic graph topology might merely repre-
sent physical connections (e.g the chemical bonds in molecule), and fail to capture
abstract or implicit relationships among vertices which can be beneficial for certain
downstream prediction task. On the other hand, in many real-world applications
such as those in Natural Language Processing or Computer Vision, the graph rep-
resentation of the data (e.g., text graph for textual data or scene graph for images)
might be unavailable. Early practice of GNNs (Bastings et al, 2017; Xu et al, 2018d)
heavily relied on manual graph construction which requires extensive human effort
and domain expertise for obtaining a reasonably performant graph topology during
the data preprocessing stage.
In order to tackle the above challenges, graph structure learning aims to dis-
cover useful graph structures from data for better graph representation learning with
GNNs. Recent attempts (Chen et al, 2020m,o; Liu et al, 2021; Franceschi et al,
2019; Ma et al, 2019b; Elinas et al, 2020; Velickovic et al, 2020; Johnson et al,
2020) focus on joint learning of graph structures and representations without re-
sorting to human effort or domain expertise. Different sets of techniques have been
developed for learning discrete graph structures and weighted graph structures for
GNNs. More broadly speaking, graph structure learning has been widely studied in
the literature of traditional machine learning in both unsupervised learning and su-
pervised learning settings (Kalofolias, 2016; Kumar et al, 2019a; Berger et al, 2020;
Bojchevski et al, 2017; Zheng et al, 2018b; Yu et al, 2019a; Li et al, 2020a). Besides,
graph structure learning is also closely related to important problems such as graph
generation (You et al, 2018a; Shi et al, 2019a), graph adversarial defenses (Zhang
and Zitnik, 2020; Entezari et al, 2020; Jin et al, 2020a,e) and transformer mod-
els (Vaswani et al, 2017).
This chapter is organized as follows. We will first introduce how graph structure
learning has been studied in the literature of traditional machine learning, prior to
the recent surge of GNNs (section 14.2). We will introduce existing works on both
unsupervised graph structure learning (section 14.2.1) and supervised graph struc-
ture learning (section 14.2.2). Readers will later see how some of the introduced
techniques originally developed for traditional graph structure learning have been
revisited and improve graph structure learning for GNNs. Then we will move to
our main focus of this chapter which is graph structure learning for GNNs in sec-
tion 14.3. This part will cover various topics including joint graph structure and
representation learning for both unweighted and weighted graphs (section 14.3.1),
and the connections to other problems such as graph generation, graph adversarial
defenses and transformers (section 14.3.2). We will highlight some future directions
in section 24.5 including robust graph structure learning, scalable graph structure
learning, graph structure learning for heterogeneous graphs, and transferable graph
structure learning. We will summarize this chapter in section 14.5.
14 Graph Neural Networks: Graph Structure Learning 299
Graph structure learning has been widely studied from different perspectives in the
literature of traditional machine learning, prior to the recent surge of Graph Neural
Networks. Before we move to the recent achievements of graph structure learning
in the field of Graph Neural Networks, which is the main focus of this chapter,
in this section, we will first examine this challenging problem through the lens of
traditional machine learning.
The task of unsupervised graph structure learning aims to directly learn a graph
structure from a set of data points in an unsupervised manner. The learned graph
structure may be later consumed by subsequent machine learning methods for var-
ious prediction tasks. The most important benefit of this kind of approaches is that
they do not require labeled data such as ground-truth graph structures for super-
vision, which could be expensive to obtain. However, because the graph structure
learning process does not consider any particular downstream prediction task on the
data, the learned graph structure might be sub-optimal for the downstream task.
Graph structure learning has been extensively studied in the literature of Graph Sig-
nal Processing (GSP). It is often referred to as the graph learning problem in the lit-
erature whose goal is to learn the topological structure from smooth signals defined
on the graph in an unsupervised manner. These graph learning techniques (Jebara
et al, 2009; Lake and Tenenbaum, 2010; Kalofolias, 2016; Kumar et al, 2019a; Kang
et al, 2019; Kumar et al, 2020; Bai et al, 2020a) typically operate by solving an opti-
mization problem with certain prior constraints on the properties (e.g., smoothness,
sparsity) of graphs. Here, we introduce some representative prior constraints defined
on graphs which have been widely used for solving the graph learning problem.
Before introducing the specific graph learning techniques, we first provide the
formal definition of a graph and graph signals. Consider a graph G = {V , E } with
the vertex set V of cardinality n and edge set E , its adjacency matrix A ∈ Rn×n
governs its topological structure where Ai, j > 0 indicates there is an edge connecting
vertex i and j and Ai, j is the edge weight. Given an adjacency matrix A, we can
further obtain the graph Laplacian matrix L = D − A where Di,i = ∑ j Ai, j is the
degree matrix whose off-diagonal entries are all zero.
A graph signal is defined as a function that assigns a scalar value to each vertex
of a graph. We can further define multi-channel signals X ∈ Rn×d on a graph that
assigns a d dimensional vector to each vertex, and each column of the feature matrix
300 Yu Chen and Lingfei Wu
X can be considered as a graph signal. Let Xi ∈ Rd denote the graph signal defined
on the i-th vertex.
Fitness. Early works (Wang and Zhang, 2007; Daitch et al, 2009) on graph learning
utilized the neighborhood information of each data point for graph construction by
assuming that each data point can be optimally reconstructed using a linear com-
bination of its neighbors. Wang and Zhang (2007) proposed to learn a graph with
normalized degrees by minimizing the following objective,
where L is the Laplacian matrix and tr(·) denotes the trace of a matrix. Lake and
Tenenbaum (2010); Kalofolias (2016) proposed to learn a graph by minimizing
Ω (A, X) which forces neighboring vertices to have similar features, thus enforcing
graph signals to change smoothly on the learned graph. Notably, solely minimizing
the above smoothness loss can lead to the trivial solution A = 0.
Connectivity and Sparsity. In order to avoid the trivial solution caused by solely
minimizing the smoothness loss, Kalofolias (2016) imposed additional constraints
on the learned graph,
−α⃗1⊤ log(A⃗1) + β ||A||2F (14.4)
where the first term penalizes the formation of disconnected graphs via the logarith-
mic barrier, and the second term controls sparsity by penalizing large degrees due
to the first term. Note that ⃗1 denotes the all-ones vector. As a result, this improves
the overall connectivity of the graph, without compromising sparsity.
Similarly, Dong et al (2016) proposed to solve the following optimization prob-
lem:
14 Graph Neural Networks: Graph Structure Learning 301
Graph structure learning has also been studied in the field of clustering analysis.
For example, in order to improve the robustness of spectral clustering methods for
noisy input data, Bojchevski et al (2017) assumed that the observed graph A can be
decomposed into the corrupted graph Ac and the good (i.e., clean) graph Ag , and it
is beneficial to only perform the spectral clustering on the clean graph. They hence
proposed to jointly perform the spectral clustering and the decomposition of the ob-
served graph, and adopted a highly efficient block coordinate-descent (alternating)
optimization scheme to approximate the objective function. Huang et al (2019b)
proposed a multi-view learning model which simultaneously conducts multi-view
clustering and learns similarity relationships between data points in kernel spaces.
The task of supervised graph structure learning aims to learn a graph structure from
data in a supervised manner. They may or may not consider a particular downstream
prediction task during the model training phase.
302 Yu Chen and Lingfei Wu
Relational inference for interacting systems aims to study how objects in com-
plex systems interact. Early works considered a fixed or fully-connected interaction
graph (Battaglia et al, 2016; van Steenkiste et al, 2018) while modeling the interac-
tion dynamics among objects. Sukhbaatar et al (2016) proposed a neural model to
learn continuous communication among a dynamically changing set of agents where
the communication graph changes over time as agents move, enter and exit the envi-
ronment. Recent efforts (Kipf et al, 2018; Li et al, 2020a) have been made to simul-
taneously infer the latent interaction graph and model the interaction dynamics. Kipf
et al (2018) proposed a variational autoencoder (VAE) (Kingma and Welling, 2014)
based approach which learns to infer the interaction graph structure and model the
interaction dynamics among physical objects simultaneously from their observed
trajectories in an unsupervised manner. The discrete latent code of VAE represents
edge connections of the latent interaction graph, and both the encoder and decoder
take the form of a GNN to model the interaction dynamics among objects. Because
the latent distribution of VAE is discrete, the authors adopted a continuous relax-
ation in order to use the reparameterization trick (Kingma et al, 2014). While Kipf
et al (2018) focused on inferring a static interaction graph, Li et al (2020a) designed
a dynamic mechanism to evolve the latent interaction graph adaptively over time. A
Gated Recurrent Unit (GRU) (Cho et al, 2014a) was applied to capture the history
information and adjust the prior interaction graph.
ant of the structural constraint to learn the DAG. The VAE was parameterized by a
GNN that can naturally handle both discrete and vector-valued random variables.
Graph structure learning has recently been revisited in the field of GNNs so as to
handle the scenarios where the graph-structured data is noisy or unavailable. Recent
attempts in this line of research mainly focus on joint learning of graph structures
and representations without resorting to human effort or domain expertise. fig. 14.1
shows the overview of graph structure learning for GNNs. Besides, we see several
important problems being actively studied (including graph generation, graph ad-
versarial defenses and transformer models) in recent years which are closely related
to graph structure learning for GNNs. We will discuss their connections and differ-
ences in this section.
Variational Inference
Reinforcement Learning
Smoothness
Learning Weighted Graph Graph Regularization
Structures Connectivity
Techniques
Sparsity
In recent practice of GNNs, joint graph structure and representation learning has
drawn a growing attention. This line of research aims to jointly optimize the graph
structure and GNN parameters toward the downstream prediction task in an end-
to-end manner, and can be roughly categorized into two groups: learning discrete
graph structures and learning weighted adjacency matrices. The first kind of ap-
proaches (Chen et al, 2018e; Ma et al, 2019b; Zhang et al, 2019d; Elinas et al,
2020; Pal et al, 2020; Stanic et al, 2021; Franceschi et al, 2019; Kazi et al, 2020)
operate by sampling a discrete graph structure (i.e., corresponding to a binary ad-
jacency matrix) from the learned probabilistic adjacency matrix, and then feeding
the graph to a subsequent GNN in order to obtain the task prediction. Because the
sampling operation breaks the differentiability of the whole learning system, tech-
niques such as variational inference (Hoffman et al, 2013) or Reinforcement Learn-
ing (Williams, 1992) are applied to optimize the learning system. Considering that
discrete graph structure learning often has the optimization difficulty introduced by
the non-differentiable sampling operation and it is hence difficult to learn weights on
edges, the other kind of approaches (Chen et al, 2020m; Li et al, 2018c; Chen et al,
2020o; Huang et al, 2020a; Liu et al, 2019b, 2021; Norcliffe-Brown et al, 2018)
focuses on learning the weighted (and usually sparse) adjacency matrix associated
to a weighted graph which will be later consumed by a subsequent GNN for the
prediction task. We will discuss these two types of approaches in great detail next.
Before discussing different techniques for joint graph structure and representation
learning, let’s first formulate the joint graph structure and representation learning
problem.
In order to deal with the issue of uncertainty on graphs, many of the existing works
on learning discrete graph structures regard the graph structure as a random variable
where a discrete graph structure can be sampled from certain probabilistic adja-
cency matrix. They usually leverage various techniques such as variational infer-
14 Graph Neural Networks: Graph Structure Learning 305
ence (Chen et al, 2018e; Ma et al, 2019b; Zhang et al, 2019d; Elinas et al, 2020; Pal
et al, 2020; Stanic et al, 2021), bilevel optimization (Franceschi et al, 2019), and Re-
inforcement Learning (Kazi et al, 2020) to jointly optimize the graph structure and
GNN parameters. Notably, they are often limited to the transductive learning setting
where the node features and graph structure are fully observed during both the train-
ing and inference stages. In this section, we introduce some representative works on
this topic and show how they approach the problem from different perspectives.
Franceschi et al (2019) proposed to jointly learn a discrete probability distribu-
tion on the edges of the graph and the parameters of GNNs by treating the task as a
bilevel optimization problem Colson et al (2007), formulated as,
where H N denotes the convex hull of the set of all adjacency matrices for N nodes,
and L(w, A) and F(wθ , A) are both task-specific loss functions measuring the differ-
ence between GNN predictions and ground-truth labels which are computed on a
training set and validation set, respectively. Each edge (i.e., node pair) of the graph
is independently modeled as a Bernoulli random variable, and an adjacency matrix
A ∼ Ber(⃗θ ) can thus be sampled from the graph structure distribution parameterized
by ⃗θ . The outer objective (i.e., the first objective) aims to find an optimal discrete
graph structure given a GCN and the inner objective (i.e., the second objective) aims
to find the optimal parameters wθ of a GCN given a graph. The authors approxi-
mately solved the above challenging bilevel problem with hypergradient descent.
Considering that real-word graphs are often noisy, Ma et al (2019b) viewed the
node features, graph structure and node labels as random variables, and modeled the
joint distribution of them with a flexible generative model for the graph-based semi-
supervised learning problem. Inspired by random graph models from the network
science field (Newman, 2010), they assumed that the graph is generated based on
node features and labels, and thus factored the joint distribution as the following:
where X, Y and G are random variables corresponding to the node features, labels
and graph structure, and ⃗θ are learnable model parameters. Note that the condi-
tional probabilities p⃗θ (G|X,Y ) and p⃗θ (Y |X) can be any flexible parametric families of
distributions as long as they are differentiable almost everywhere w.r.t. ⃗θ . In the
paper, p⃗θ (G|X,Y ) is instantiated with either latent space model (LSM) (Hoff et al,
2002) or stochastic block models (SBM) (Holland et al, 1983). During the inference
stage, in order to infer the missing node labels denoted as Ymiss , the authors lever-
aged the recent advances in scalable variational inference (Kingma and Welling,
2014; Kingma et al, 2014) to approximate the posterior distribution p⃗θ (Y |X,Y ,G)
miss obs
via a recognition model q⃗ parameterized by ⃗φ where Yobs denotes the
φ (Ymiss |X,Yobs ,G)
306 Yu Chen and Lingfei Wu
observed node labels. In the paper, q⃗φ (Y |X,Y ,G) is instantiated with a GNN. The
miss obs
model parameters ⃗θ and ⃗φ are jointly optimized by maximizing the Evidence Lower
Bound (Bishop, 2006) of the observed data (Yobs , G) conditioned on X.
Elinas et al (2020) aimed to maximize the posterior over the binary adjacency
matrix given the observed data (i.e., node features X and observed node labels Y o ),
formulated as,
p(A|X,Y o ) ∝ p⃗θ (Y o |X,A)p(A) (14.8)
where p⃗θ (Y o |X,A) is a conditional likelihood which can be further factorized follow-
ing the conditional independence assumption,
where Cat(yi |⃗πi ) denotes a categorical distribution, and is the i-th row of a probabil-
ity matrix Π ∈ RN×C modeled by a GCN, namely, Π = GCN(X, A, ⃗θ ). As for the
prior distribution over the graph p(A), the authors considered the following form,
p(A) = ∏ p(Ai, j )
i, j (14.10)
p(Ai, j ) = Bern(Ai, j |ρi,o j )
where Bern(Ai, j |ρi,o j ) is a Bernoulli distribution over the adjacency matrix Ai, j with
parameter ρi,o j . In the paper, ρi,o j = ρ1 Ai, j + ρ2 (1 − Ai, j ) was constructed to encode
the degree of belief on the absence and presence of observed links with hyperpa-
rameters 0 < ρ1 , ρ2 < 0. Note that Ai, j is the observed graph structure which can
potentially be perturbed. If there is no input graph available, a KNN graph can be
employed. Given the above formulations, the authors developed a stochastic varia-
tional inference algorithm by leveraging the reparameterization trick (Kingma et al,
2014) and Concrete distributions techniques (Maddison et al, 2017; Jang et al, 2017)
to optimize the graph posterior p(A|X,Y o ) and the GCN parameters ⃗θ jointly.
Kazi et al (2020) designed a probabilistic graph generator whose underlying
probability distribution is computed based on pair-wise node similarity, formulated
as,
pi, j = e−t||Xi −X j || (14.11)
where t is a temperature parameter, and Xi is the node embedding of node vi . Given
the above edge probability distribution, they adopted the Gumbel-Top-k trick (Kool
et al, 2019) to sample an unweighted KNN graph which would be fed into a GNN-
based prediction network. Note that the sampling operation breaks the differentia-
bility of the model, the authors thus exploited Reinforcement Learning to reward
edges involved in a correct classification and penalize edges which led to misclassi-
fication.
14 Graph Neural Networks: Graph Structure Learning 307
Unlike the kind of graph structure learning approaches focusing on learning a dis-
crete graph structure (i.e., binary adjacency matrix) for the GNN, there is a class of
approaches instead focusing on learning a weighted graph structure (i.e., weighted
adjacency matrix). In comparison with learning a discrete graph structure, learning
a weighted graph structure has several advantages. Firstly, optimizing a weighted
adjacency matrix is much more tractable than optimizing a binary adjacency matrix
because the former can be easily achieved by SGD techniques (Bottou, 1998) or
even convex optimization techniques (Boyd et al, 2004) while the later often has to
resort to more challenging techniques such as variational inference (Hoffman et al,
2013), Reinforcement Learning (Williams, 1992) and combinatorial optimization
techniques (Korte et al, 2011) due to its non-differentiability. Secondly, a weighted
adjacency matrix is able to encode richer information on edges compared to a binary
adjacency matrix, which could benefit the subsequent graph representation learning.
For example, the widely used Graph Attention Network (GAT) (Veličković et al,
2018) essentially aims to learn edge weights for the input binary adjacency matrix
which benefit the subsequent message passing operations. In this subsection, we
will first introduce some common graph similarity metric learning techniques as
well as graph sparsification techniques widely used in existing works for learning
a sparse weighted graph by considering pair-wise node similarity in the embedding
space. Some representative graph regularization techniques will be later introduced
for controlling the quality of the learned graph structure. We will then discuss the
importance of combining both of the intrinsic graph structures and learned implicit
graph structures for better learning performance. Finally, we will cover some im-
portant learning paradigms for the joint learning of graph structures and graph rep-
resentations that have been successfully adopted by existing works.
function can be later applied to an unseen set of node embeddings to infer a graph
structure, thus enabling inductive graph structure learning.
For data deployed in non-Euclidean domains such as graph data, the Euclidean
distance is not necessarily the optimal metric for measuring node similarity. Com-
mon options for metric learning include cosine similarity (Nguyen and Bai, 2010),
radial basis function (RBF) kernel (Yeung and Chang, 2007) and attention mech-
anisms (Bahdanau et al, 2015; Vaswani et al, 2017). In general, according to the
types of raw information sources needed, we group the similarity metric learning
functions into two categories: Node Embedding Based Similarity Metric Learning
and Structure-aware Similarity Metric Learning. Next, we will introduce some rep-
resentative metric learning functions from both categories which have been success-
fully adopted in prior works on graph structure learning for GNNs.
(t)
where µi is a learnable binary gating mask, ∨ denotes logical disjunction between
the two operands to enforce symmetry, and W1 and W2 are d × d weight matrices.
Because the argmax operation makes the whole learning system non-differentiable,
the authors provided the ground-truth graph structures for supervision at each time
step.
Cosine-based Similarity Metric Functions Chen et al (2020m) proposed a multi-
head weighted cosine similarity function which aims at capturing pair-wise node
similarity from multiple perspectives, formulated as follows:
310 Yu Chen and Lingfei Wu
where ⃗w p is a learnable weight vector associated to the p-th perspective, and has the
same dimension as the node embeddings. Intuitively, Si,p j computes the pair-wise
cosine similarity for the p-th perspective where each perspective considers one part
of the semantics captured in the embeddings. Moreover, as observed in (Vaswani
et al, 2017; Veličković et al, 2018), employing multi-head learners is able to stabilize
the learning process and increase the learning capacity.
Kernel-based Similarity Metric Functions Besides attention-based and cosine-
based similarity metric functions, researchers also explored to apply kernel-based
metric functions for graph structure learning. Li et al (2018c) applied a Gaussian
kernel to the distance between any pair of node embeddings, formulated as follows:
q
d(⃗vi ,⃗v j ) = (⃗vi −⃗v j )⊤ M(⃗vi −⃗v j )
(14.20)
−d(⃗vi ,⃗v j )
S(⃗vi ,⃗v j ) =
2σ 2
where σ is a scalar hyperparameter which determines the width of the Gaussian
kernel, and d(⃗vi ,⃗v j ) computes the Mahalanobis distance between the two node em-
beddings ⃗vi and ⃗v j . Notably, M is the covariance matrix of the node embeddings
distribution if we assume all the node embeddings of the graph are drawn from
the same distribution. If we set M = I, the Mahalanobis distance reduces to the
Euclidean distance. To make M a symmetric and positive semi-definite matrix, the
authors let M = WW ⊤ where W is a d × d learnable weight matrix. We can also re-
gard W as the transform basis to the space where we measure the Euclidean distance
between two vectors.
Similarly, Henaff et al (2015) first computed the Euclidean distance between
any pair of node embeddings, and then applied a Gaussian Kernel or a self-tuning
diffusion kernel (Zelnik-Manor and Perona, 2004), formulated as follows:
q
d(⃗vi ,⃗v j ) = (⃗vi −⃗v j )⊤ (⃗vi −⃗v j )
−d(⃗vi ,⃗v j )
S(⃗vi ,⃗v j ) = (14.21)
σ2
−d(⃗vi ,⃗v j )
Slocal (⃗vi ,⃗v j ) =
σi σ j
where Slocal (⃗vi ,⃗v j ) defines a self-tuning diffusion kernel whose variance is locally
adapted around each node. Specifically, σi is computed as the distance d(⃗vi ,⃗vik )
corresponding to the k-th nearest neighbor ik of node i.
14 Graph Neural Networks: Graph Structure Learning 311
Si,l j = softmax(⃗u⊤ tanh(W [⃗hli ,⃗hlj ,⃗vi ,⃗v j ,⃗ei, j ])) (14.22)
where ⃗vi denotes the node attributes for node i, ⃗ei, j represents the edge attributes
between node i and j, ⃗hli is the vector representation of node i in the l-th GNN layer,
and ⃗u and W are trainable weight vector and weight matrix, respecitively.
Similarly, Liu et al (2021) proposed a structure-aware global attention mecha-
nism for learning pair-wise node similarity, formulated as follows,
where ⃗ei, j ∈ Rde is the embedding of the edge connecting node i and j, W Q ,W K ∈
Rd×dv , W R ∈ Rd×de are learnable weight matrices, and d, dv and de are the dimen-
sions of hidden vectors, node embeddings and edge embeddings, respectively.
Utilizing Intrinsic Edge Connectivity Information for Similarity Metric Learn-
ing In the case where only the edge connectivity information is available in the in-
trinsic graph, Jiang et al (2019b) proposed a masked attention mechanism for graph
structure learning, formulated as follows,
where Ai, j is the adjacency matrix of the intrinsic graph and ⃗u is a weight vec-
tor with the same dimension as node embeddings ⃗vi . This idea of using masked
attention to incorporate the initial graph topology shares the same spirit with the
GAT (Veličković et al, 2018) model.
The aforementioned similarity metric learning functions all return a weighted ad-
jacency matrix associated to a fully-connected graph. A fully-connected graph is
not only computationally expensive but also might introduce noise such as unim-
portant edges. In real-word applications, most graph structures are much more
312 Yu Chen and Lingfei Wu
where topk is a KNN-style operation. Specifically, for each node, only the K nearest
neighbors (including itself) and the associated similarity scores are kept, and the
remaining similarity scores are masked off.
Klicpera et al (2019b); Chen et al (2020m) enforced a sparse adjacency matrix
by considering only the ε-neighborhood for each node, formulated as follows:
Si, j Si, j > ε
Ai, j = (14.26)
0 otherwise
where those elements in S which are smaller than a non-negative threshold ε are all
masked off (i.e., set to zero).
As discussed earlier, many works in the field of Graph Signal Processing typically
learn the graph structure from data by directly optimizing the adjacency matrix to
minimize the constraints defined based on certain graph properties, without con-
sidering any downstream tasks. On the contrary, many works on graph structure
learning for GNNs aim to optimize a similarity metric learning function (for learn-
ing graph structures) toward the downstream prediction task. However, they do not
explicitly enforce the learned graph structure to have some common properties (e.g.,
smoothness) presented in real-word graphs.
Chen et al (2020m) proposed to optimize the graph structures by minimizing a
hybrid loss function combining both the task prediction loss and the graph regular-
ization loss. They explored three types of graph regularization losses which pose
constrains on the smoothness, connectivity and sparsity of the learned graph.
Smoothness The smoothness property assumes neighboring nodes to have similar
features.
1 1
Ω (A, X) = 2 ∑ Ai, j ||Xi − X j ||2 = 2 tr(X ⊤ LX) (14.27)
2n i, j n
where tr(·) denotes the trace of a matrix, L = D−A is the graph Laplacian, and Di,i =
∑ j Ai, j is the degree matrix. As can be seen, minimizing Ω (A, X) forces adjacent
14 Graph Neural Networks: Graph Structure Learning 313
nodes to have similar features, thus enforcing smoothness of the graph signals on
the graph associated with A. However, solely minimizing the smoothness loss will
result in the trivial solution A = 0. We might also want to pose other constraints to
the graph.
Connectivity The following equation penalizes the formation of disconnected
graphs via the logarithmic barrier.
−1⃗ ⊤
1 log(A⃗1) (14.28)
n
where n is the number of nodes.
Sparsity The following equation controls sparsity by penalizing large degrees.
1
||A||2F (14.29)
n2
where || · ||F denotes the Frobenius norm of a matrix.
In practice, solely minimizing one type of graph regularization losses might not
be desirable. For instance, solely minimizing the smoothness loss will result in the
trivial solution A = 0. Therefore, it could be beneficial to balance the trade-off
among different types of desired graph properties by computing a linear combi-
nation of the various graph regularization losses, formulated as follows:
α −β ⃗ ⊤ γ
2
tr(X ⊤ LX) + 1 log(A⃗1) + 2 ||A||2F (14.30)
n n n
where α, β and γ are all non-negative hyperparameters for controlling the smooth-
ness, connectivity and sparsity of the learned graph.
Besides the above graph regularization techniques, other prior assumptions such
as neighboring nodes tend to share the same label (Yang et al, 2019c) and learned
implicit adjacency matrix should be close to the intrinsic adjacency matrix (Jiang
et al, 2019b) have been adopted in the literature.
Recall that one of the most important motivations for graph structure learning is
that the intrinsic graph structure (if it is available) might be error-prone (e.g., noisy
or incomplete) and sub-optimal for the downstream prediction task. However, the
intrinsic graph typically still carries rich and useful information regarding the opti-
mal graph structure for the downstream task. Hence, it could be harmful to totally
discard the intrinsic graph structure.
A few recent works (Li et al, 2018c; Chen et al, 2020m; Liu et al, 2021) proposed
to combine the learned implicit graph structure with the intrinsic graph structure for
better downstream prediction performance. The rationales are as follows. First of
all, they assume that the optimized graph structure is potentially a “shift” (e.g., sub-
314 Yu Chen and Lingfei Wu
structures) from the intrinsic graph structure, and the similarity metric function is in-
tended to learn such a “shift” which is supplementary to the intrinsic graph structure.
Secondly, incorporating the intrinsic graph structure can help accelerate the training
process and increase the training stability considering there is no prior knowledge
on the similarity metric, the trainable parameters are randomly initialized, and thus
it may take long to converge.
Different ways for combining intrinsic and implicit graph structures have been
proposed. For instance, Li et al (2018c); Chen et al (2020m) proposed to compute a
linear combination of the normalized graph Laplacian of the intrinsic graph structure
and the normalized adjacency matrix of the implicit graph structure, formulated as
follows:
e = λ L(0) + (1 − λ ) f (A)
A (14.31)
where L(0) is the normalized graph Laplacian matrix, f (A) is the normalized adja-
cency matrix associated to the learned implicit graph structure, and λ is a hyperpa-
rameter controlling the trade-off between the intrinsic and implicit graph structures.
Note that f : Rn×n → Rn×n can be arbitrary normalization operations such as graph
Laplacian operation and row-normalization operation. Liu et al (2021) proposed a
hybrid message passing mechanism for GNNs which fuses the two aggregated node
vectors from the intrinsic graph and the learned implicit graph, respectively, and
then feed the fused vector to a GRU (Cho et al, 2014a) to update node embeddings.
Learning Paradigms
Most existing methods for graph structure learning for GNNs consist of two key
learning components: graph structure learning (i.e., similarity metric learning) and
graph representation learning (i.e., GNN module), and the ultimate goal is to learn
the optimized graph structures and representations with respect to certain down-
stream prediction task. How to optimize the two separate learning components to-
ward the same ultimate goal becomes an important question?
the previous GNN layer. And the whole learning system is usually jointly optimized
in an end-to-end manner toward the downstream prediction task.
learning procedure dynamically stops when the learned graph structure approaches
close enough to the optimized graph (with respect to the downstream task) according
to certain stopping criterion (i.e., the difference between learned adjacency matrices
at consecutive iterations are smaller than certain threshold). At each iteration, a hy-
brid loss combining both the task prediction loss and the graph regularization loss
is added to the overall loss. After all iterations, the overall loss is back-propagated
through all previous iterations to update model parameters.
This iterative learning paradigm for repeatedly refining the graph structure and
graph representations has a few advantages. On the one hand, even when the raw
node features do not contain adequate information for learning implicit relation-
ships among nodes, the node embeddings learned by the graph representation learn-
ing component could ideally provide useful information for learning a better graph
structure because these node embeddings are optimized toward the downstream
task. On the other hand, the newly learned graph structure could be a better graph
input for the graph representation learning component to learn better node embed-
dings.
Graph structure learning for GNNs has interesting connections to a few important
problems. Thinking about these connections might spur further research in those
areas.
The task of graph generation focuses on generating realistic and meaningful graphs.
The early works of graph generation formalized the problem as a stochastic gen-
eration process, and proposed various random graph models for generating a pre-
selected family of graphs such as ER graphs (Erdős and Rényi, 1959), small-world
networks (Watts and Strogatz, 1998), and scale-free graphs (Albert and Barabási,
2002). However, these approaches typically make certain simplified and carefully-
designed apriori assumptions on graph properties, and thus in general have limited
modeling capacity on complex graph structures. Recent attempts focus on building
deep generative models for graphs by leveraging RNN You et al (2018b), VAE (Jin
et al, 2018a), GAN (Wang et al, 2018a), flow-based techniques (Shi et al, 2019a) and
other specially designed models (You et al, 2018a). And GNNs are usually adopted
by these models as a powerful graph encoder.
Even though the graph generation task and the graph structure learning task
both focus on learning graphs from data, they have essentially different goals and
methodologies. Firstly, the graph generation task aims to generate new graphs where
both nodes and edges are added to together construct a meaningful graph. However
the graph structure learning task aims to learn a graph structure given a set of node
318 Yu Chen and Lingfei Wu
attributes. Secondly, generative models for graphs typically operate by learning the
distribution from the observed set of graphs, and generating more realistic graphs
by sampling from the learned graph distribution. But graph structure learning meth-
ods typically operate by learning the pair-wise relationships among the given set
of nodes, and based on that, building the graph topology. It will be an interesting
research direction to study how the two tasks can help each other.
Recent studies (Dai et al, 2018a; Zügner et al, 2018) have shown that GNNs are
vulnerable to carefully-crafted perturbations (a.k.a adversarial attacks), e.g., small
deliberate perturbations in graph structures and node/edge attributes. Researchers
working on building robust GNNs found graph structure learning a powerful tool
against topology attacks. Given an initial graph whose topology might become un-
reliable because of adversarial attacks, they leveraged graph structure learning tech-
niques to recover the intrinsic graph topology from the poisoned graph.
For instance, assuming that adversarial attacks are likely to violate some intrinsic
graph properties (e.g., low-rank and sparsity), Jin et al (2020e) proposed to jointly
learn the GNN model and the “clean” graph structure from the perturbed graph
by optimizing some hybrid loss combining both the task prediction loss and the
graph regularization loss. In order to restore the structure of the perturbed graph,
Zhang and Zitnik (2020) designed a message-passing scheme that can detect fake
edges, block them and then attend to true, unperturbed edges. In order to address
the noise brought by the task-irrelevant information on real-life large graphs, Zheng
et al (2020b) introduced a supervised graph sparsification technique to remove po-
tentially task-irrelevant edges from input graphs. Chen et al (2020d) proposed a
Label-Aware GCN (LAGCN) framework which can refine the graph structure (i.e.,
filtering distracting neighbors and adding valuable neighbors for each node) before
the training of GCN.
There are many connections between graph adversarial defenses and graph struc-
ture learning. On the one hand, graph structure learning is partially motivated by im-
proving potentially error-prone (e.g., noisy or incomplete) input graphs for GNNs,
which share the similar spirit with graph adversarial defenses. On the other hand,
the task of graph adversarial defenses can benefit from graph structure learning tech-
niques as evidenced by some recent works.
However, there is a key difference between their problem settings. The graph
adversarial defenses task deals with the setting where the initial graph structure is
available, but potentially poisoned by adversarial attacks. And the graph structure
learning task aims to handle both the scenarios where the input graph structure is
available or unavailable. Even when the input graph structure is available, one can
still improve it by “denoising” the graph structure or augmenting the graph structure
with an implicit graph structure which captures implicit relationships among nodes.
14 Graph Neural Networks: Graph Structure Learning 319
Transformer models (Vaswani et al, 2017) have been widely used as a powerful
alternative to Recurrent Neural Networks, especially in the Natural Language Pro-
cessing field. Recent studies (Choi et al, 2020) have shown the close connection be-
tween transformer models and GNNs. By nature, transformer models aim to learn
a self-attention matrix between every pair of objects, which can be thought as an
adjacency matrix associated with a fully-connected graph containing each object as
a node. Therefore, one can claim that transformer models also perform some sort
of joint graph structure and representation learning, even though these models typi-
cally do not consider any initial graph topology and do not control the quality of the
learned fully-connected graph. Recently, many variants of the so-called graph trans-
formers (Zhu et al, 2019b; Yao et al, 2020; Koncel-Kedziorski et al, 2019; Wang
et al, 2020k; Cai and Lam, 2020) have been developed to combine the benefits of
both GNNs and transformers.
In this section, we will introduce some advanced topics of graph structure learning
for GNNs and highlight some promising future directions.
Although one of the major motivations of developing graph structure learning tech-
niques for GNNs is to handle noisy or incomplete input graphs, robustness does not
lie in the heart of most existing graph structure learning techniques. Most of exist-
ing works did not evaluate the robustness of their approaches to noisy initial graphs.
Recent works showed that random edge addition or deletion attacks significantly
downgraded the downstream task performance (Franceschi et al, 2019; Chen et al,
2020m). Moreover, most existing works admit that the initial graph structure (if
provided) might be noisy and thus unreliable for graph representation learning, but
they still assume that node features are reliable for graph structure learning, which
is often not true in real-world scenarios. Therefore, it is challenging yet rewarding to
explore robust graph structure learning techniques for data with noisy initial graph
structures and noisy node attributes.
320 Yu Chen and Lingfei Wu
Most existing graph structure learning techniques need to model the pair-wise re-
lationships among all the nodes in order to discover the hidden graph structure.
Therefore, their time complexity is at least O(n2 ) where n is the number of graph
nodes. This can be very expensive and even intractable for large-scale graphs (e.g.,
social networks) in real word. Recently, Chen et al (2020m) proposed a scalable
graph structure learning approach by leveraging the anchor-based approximation
technique to avoid explicitly computing the pair-wise node similarity, and achieved
linear complexity in both computational time and memory consumption with respect
to the number of graph nodes. In order to improve the scalability of transformer
models, different kinds of approximation techniques have also been developed in
recent works (Tsai et al, 2019; Katharopoulos et al, 2020; Choromanski et al, 2021;
Peng et al, 2021; Shen et al, 2021; Wang et al, 2020g). Considering the close connec-
tions between graph structure learning for GNNs and transformers, we believe there
are many opportunities in building scalable graph structure learning techniques for
GNNs.
Most existing graph structure learning works focus on learning homogeneous graph
structures from data. In comparison with homogeneous graphs, heterogeneous
graphs are able to carry on richer information on node types and edge types, and
occur frequently in real-world graph-related applications. Graph structure learning
for heterogeneous graphs is supposed to be more challenging because more types
of information (e.g., node types, edge types) are expected to be learned from data.
Some recent attempts (Yun et al, 2019; Zhao et al, 2021) have been made to learn
graph structures from heterogeneous graphs.
14.5 Summary
In this chapter, we explored and discussed graph structure learning from multiple
perspectives. We first reviewed the existing works on graph structure learning in the
literature of traditional machine learning, including both unsupervised graph struc-
ture learning and supervised graph structure learning. As for unsupervised graph
structure learning, we mainly looked into some representative techniques devel-
oped from the Graph Signal Processing community. We also introduced some recent
works on clustering analysis that leveraged graph structure learning techniques. As
for supervised graph structure learning, we introduced how this problem was studied
in the research on modeling interacting systems and Bayesian Networks. The main
focus of this chapter is on introducing recent advances in graph structure learning
14 Graph Neural Networks: Graph Structure Learning 321
for GNNs. We motivated graph structure learning in the GNN field by discussing the
scenarios where the graph-structured data is noisy or unavailable. We then moved
on to introduce recent research progress in joint graph structure and representa-
tion learning, including learning discrete graph structures and learning weighted
graph structures. The connections and differences between graph structure learning
and other important problems such as graph generation, graph adversarial defenses
and transformer models were also discussed. We then highlighted several remain-
ing challenges and future directions in the research of graph structure learning for
GNNs.
Abstract The world around us is composed of entities that interact and form re-
lations with each other. This makes graphs an essential data representation and a
crucial building-block for machine learning applications; the nodes of the graph
correspond to entities and the edges correspond to interactions and relations. The
entities and relations may evolve; e.g., new entities may appear, entity properties
may change, and new relations may be formed between two entities. This gives rise
to dynamic graphs. In applications where dynamic graphs arise, there often exists
important information within the evolution of the graph, and modeling and exploit-
ing such information is crucial in achieving high predictive performance. In this
chapter, we characterize various categories of dynamic graph modeling problems.
Then we describe some of the prominent extensions of graph neural networks to dy-
namic graphs that have been proposed in the literature. We conclude by reviewing
three notable applications of dynamic graph neural networks namely skeleton-based
human activity recognition, traffic forecasting, and temporal knowledge graph com-
pletion.
15.1 Introduction
© The Author(s), under exclusive license to Springer Nature Singapore Pte Ltd. 2022 323
L. Wu et al. (eds.), Graph Neural Networks: Foundations, Frontiers, and Applications,
https://doi.org/10.1007/978-981-16-6054-2_15
324 Seyed Mehran Kazemi
In many applications, there exist relationships between the entities that can be
exploited to make better predictions about them. As a few examples, social network
users that are close friends or family members are more likely to support the same
political party, two publications by the same author are more likely to have the same
topic, two images taken from the same website (or uploaded to social media by
the same user) are more likely to have similar objects in them, and two roads that
are connected are more likely to have similar traffic volumes. The data for these
applications can be represented in the form of a graph where nodes correspond to
entities and edges correspond to the relationships between these entities.
Graphs arise naturally in many real-world applications including recommender
systems, biology, social networks, ontologies, knowledge graphs, and computational
finance. In some domains the graph is static, i.e. the graph structure and the node fea-
tures are fixed over time. In other domains, the graph changes over time. In a social
network, for example, new edges are added when people make new friends, exist-
ing edges are removed when people stop being friends, and node features change
as people change their attributes, e.g., when they change their career assuming that
career is one of the node features. In this chapter, we focus on the domains where
the graph is dynamic and changes over time.
In applications where dynamic graphs arise, modeling the evolution of the graph
is often crucial in making accurate predictions. Over the years, several classes of
machine learning models have been developed that capture the structure and the
evolution of dynamic graphs. Among these classes, extensions of graph neural net-
works (GNNs) (Scarselli et al, 2008; Kipf and Welling, 2017b) to dynamic graphs
have recently found success in several domains and they have become one of the
essential tools in the machine learning toolbox. In this chapter, we review the GNN
approaches for dynamic graphs and provide several application domains where dy-
namic GNNs have provided striking results. The chapter is not meant to be a full
survey of the literature but rather a description of the common techniques for apply-
ing GNNs to dynamic graphs. For a comprehensive survey of representation learn-
ing approaches for dynamic graphs we refer the reader to (Kazemi et al, 2020), and
for a more specialized survey of GNN-based approaches to dynamic graphs we refer
the reader to (Skarding et al, 2020).
The rest of the chapter is organized as follows. In Section 15.2, we define the no-
tation that will be used throughout the chapter and provide the necessary background
to follow the rest of the chapter. In Section 15.3, we describe different types of dy-
namic graphs and different prediction problems on these graphs. In Section 15.4, we
review several approaches for applying GNNs on dynamic graphs. In Section 15.5,
we review some of the applications of dynamic GNNs. Finally, Section 15.6 sum-
marizes and concludes the chapter.
15 Dynamic Graph Neural Networks 325
In this section, we define our notation and provide the background required to follow
the rest of the chapter.
We use lowercase letters z to denote scalars, bold lowercase letters z to denote
vectors and uppercase letters Z to denote matrices. zi denotes the i element of z,
Zi denotes a column vector corresponding to the i row of Z, and Zi, j denotes the
element at the i row and j column of Z. z denotes the transpose of z and Z denotes
′
the transpose of Z. (zz ′ ) ∈ Rd+d corresponds to the concatenation of z ∈ Rd and
′
z ′ ∈ Rd . We use to represent an identity matrix. We use ⊙ to denote element-
wise (Hadamard) product. We represent a sequence as [e1 , e2 , . . . , ek ] and a set as
{e1 , e2 , . . . , ek } where ei s represent the elements in the sequence or set.
In this chapter, we mainly consider attributed graphs. We represent an attributed
graph as G = (V, A, X) where V = {v1 , v2 , . . . , vn } is the set of vertices (aka nodes),
n = |V | denotes the number of nodes, A ∈ Rn×n is an adjacency matrix, and X ∈
Rn×d is a feature matrix where Xi represents the features associated with the i node
vi and d denotes the number of features. If there exists no edge between vi and v j ,
then Ai, j = 0; otherwise, Ai, j ∈ R+ represents the weight of the edge where R+
represents positive real numbers.
If G is unweighted, then the range of A is {0, 1} (i.e. A ∈ {0, 1}n×n ). G is undi-
rected if the edges have no directions; it is directed if the edges have directions.
For an undirected graph, A is symmetric (i.e. A = A). For each edge Ai, j > 0 of
a directed graph, we call vi the source and v j the target of the edge. If G is multi-
relational with a set R = {r1 , . . . , rm } of relations, then the graph has m adjacency
matrices where the i adjacency matrix represents the existence of the i relation ri
between the nodes.
In this chapter, we use the term Graph Neural Network (GNN) to refer to the general
class of neural networks that operate on graphs through message-passing between
the nodes. Here, we provide a brief description of GNNs.
Let G = (V, A, X) be a static attributed graph. A GNN is a function f : Rn×n ×
n×d ′
R → Rn×d that takes G (or more specifically A and X) as input and provides as
′ ′
output a matrix Z ∈ Rn×d where Zi ∈ Rd corresponds to a hidden representation
for the i node vi . This hidden representation is called the node embedding. Provid-
ing a node embedding for each node vi can be viewed as dimensionality reduction
where the information from vi ’s initial features as well as the information from its
connectivity to other nodes and the features of these nodes are captured in a vector
Zi . This vector can be used to make informed predictions about vi . In what follows,
we describe two example GNNs namely graph convolutions networks and graph
attention networks for undirected graphs.
326 Seyed Mehran Kazemi
ent weights θ (l,1) , . . . , θ (l,β ) and W (l,1) , . . . , W (l,β ) and then replaces equation 15.2
with:
Z (l) = σ (Â(l,1) Z (l−1) W (l,1) || . . . || Â(l,β ) Z (l−1) W (l,β ) ) (15.5)
where β is the number of heads. Each head may learn to aggregate the neighbors
differently and extract different information.
Over the years, several models have been proposed that operate on sequences. In
this chapter, we are mainly interested in neural sequence models that take as input a
sequence [x(1) , x(2) , . . . , x(τ) ] of observations where x(t) ∈ Rd for all t ∈ {1, . . . , τ},
′
and produce as output hidden representations [h(1) , h(2) , . . . , h(τ) ] where h(t) ∈ Rd
for all t ∈ {1, . . . , τ}. Here, τ represents the length of the sequence or the timestamp
for the last element in the sequence. Each hidden representation h(t) is a sequence
embedding capturing information from the first t observations. Providing a sequence
embedding for a given sequence can be viewed as dimensionality reduction where
the information from the first t observations in the sequence is captured in a single
vector h(t) which can be used to make informed predictions about the sequence. In
what follows, we describe recurrent neural networks, Transformers, and convolu-
tional neural networks for sequence modeling.
Recurrent Neural Networks: Recurrent neural networks (RNNs) (Elman, 1990)
and its variants have achieved impressive results on a range of sequence modeling
problems. The core principle of the RNN is that its output is a function of the current
data point as well as a representation of the previous inputs. Vanilla RNNs consume
the input sequence one by one and provides embeddings using the following equa-
tion (applied sequentially for t in [1, . . . , τ]):
where W (.) s and b are the model parameters, h(t) is the hidden state corresponding
to the embedding of the first t observations, and x(t) is the t observation. One may
initialize h(0) = 0, where 0 is a vector of 0s, or let h(0) be learned during training.
Training vanilla RNNs is typically difficult due to gradient vanishing and exploding.
Long short term memory (LSTMs) (Hochreiter and Schmidhuber, 1997) (and
gated recurrent units (GRUs) (Cho et al, 2014a)) alleviate the training problem of
vanilla RNNs through gating mechanism and additive operations. An LSTM model
consumes the input sequence one by one and provides embeddings using the fol-
lowing equations:
328 Seyed Mehran Kazemi
LSTM Cell
LSTM Cell
LSTM Cell
…
𝐜 (0) 𝐜 (1) 𝐜 (2) 𝐜 (𝑇−1) 𝐜 (𝑇)
Fig. 15.1: An LSTM model taking as input a sequence x(1) , x(2) , . . . , x(τ) and pro-
ducing hidden representations h(1) , h(2) , . . . , h(τ) as output. Equations 15.7-15.11
describe the operations in LSTM Cells.
i(t) = σ W (ii) x(t) + W (ih) h(t−1) + b(i) (15.7)
f (t) = σ W ( f i) x(t) + W ( f h) h(t−1) + b( f ) (15.8)
c(t) = f (t) ⊙ c(t−1) + i(t) ⊙ Tanh W (ci) x(t) + W (ch) h(t−1) + b(c) (15.9)
o(t) = σ W (oi) x(t) + W (oh) h(t−1) + b(o) (15.10)
h(t) = o(t) ⊙ Tanh c(t) (15.11)
Here i(t) , f (t) , and o(t) represent the input, forget and output gates respectively,
c(t) is the memory cell, h(t) is the hidden state corresponding to the embedding of
the sequence until t observation, σ is an activation function (typically Sigmoid),
Tanh represents the hyperbolic tangent function, and W (..) s and b(.) s are weight
matrices and vectors. Similar to vanilla RNNs, one may initialize h(0) = c(0) = 0 or
let them be vectors with learnable parameters. Figure 15.1 shows an overview of an
LSTM model.
A bidirectional RNN (BiRNN) (Schuster and Paliwal, 1997) is a combination of
two RNNs one consuming the input sequence [x(1) , x(2) , . . . , x(τ) ] in the forward
→
− →
− →
−
direction and producing hidden representations [ h (1) , h (2) , . . . , h (τ) ] as output,
and the other consuming the input sequence backwards (i.e. [x , x(τ−1) , . . . , x(1) ])
(τ)
←
− ← − ←−
and producing hidden representations [ h (τ) , h (τ−1) , . . . , h (1) ] as output. These two
hidden representations are then concatenated producing a single hidden representa-
→
− ← −
tion h(t) = ( h (t) h (t) ). Note that in RNNs, h(t) is computed only based on obser-
vations at or before t whereas in BiRNNs, h(t) is computed based on observations
at, before, or after t. BiLSTMs Graves et al (2005) are a specific version of BiRNNs
where the RNN is an LSTM.
Transformers: Consuming the input sequence one by one makes RNNs not
amenable to parallelization. It also makes capturing long-range dependencies dif-
ficult. To solve these issues, the Transformer model Vaswani et al (2017) allows
15 Dynamic Graph Neural Networks 329
Note that p(t) is constant and does not change during training.
Convolutional Neural Networks: Convolutional neural networks (CNNs) (Le Cun
et al, 1989) have revolutionized many computer vision applications. Originally,
CNNs were proposed for 2D signals such as images. They were later used for 1D
signals such as sequences and time-series. Here, we describe 1D CNNs. We start
with describing 1D convolutions. Let H ∈ Rn×d be a matrix and F ∈ Ru×d be a
convolution filter. Applying the filter F on H produces a vector h′ ∈ Rn−u+1 as
follows:
u d
h′i = ∑ ∑ Hi+ j−1,k F j,k (15.14)
j=1 k=1
1 For readers familiar with Transformers, in our description the values matrix corresponds to the
multiplication of the embedding matrix with the weight matrix W (l) in equation 15.2.
330 Seyed Mehran Kazemi
Input Filter 1
Result
0.1 -0.2 1.1 0.2 0.4 0.0 1.0 0.4
5.88 -2.98
0.9 -0.8 1.0 1.0 0.0 -1.2 3.2 0.5 (-1.2)(0.9)+(-0.8)(0.8)+(0.0)(1.0)+(0.0)(1.0)
2.93 -2.75
0.2 0.3 0.4 0.5 + (0.0)(0.2)+(0.0)(0.3)+(-3.2)(0.4)+(0.5)(0.5)
2.75 -1.85 = -2.75
0.6 -0.6 0.5 -0.5 Filter 2
6.92 -6.82
1.1 1.2 2.1 2.2 -1.2 0.8 0.0 0.0
7.22 -2.96
0.0 0.0 1.0 1.2 0.0 0.0 -3.2 0.5
H (l−1) as described above and produce a matrix to which activation and (some-
times) pooling operations are applied to produce H (l) . The convolution filters are
the learnable parameters of the model. Hereafter, we use the term CNN to refer to
the general family of 1D convolutional neural networks.
A deep neural network model can typically be decomposed into an encoder and a de-
coder module. The encoder module takes the input and provides vector-representations
(or embeddings), and the decoder module takes the embeddings and provides pre-
dictions. The GNNs and sequence models described in Sections 15.2.1 and 15.2.2
correspond to the encoder modules of a full model; they provide node embeddings
Z and sequence embeddings H, respectively. The decoder is typically task-specific.
As an example, for a node classification task, the decoder can be a feed-forward neu-
ral network applied on a node embedding Zi provided by the encoder, followed by a
softmax function. Such a decoder provides as output a vector ŷ ∈ R|C| where C rep-
resents the classes, |C| represents the number of classes, and ŷ j shows the probabil-
ity of the node belonging to the j class. A similar decoder can be used for sequence
classification. As another example, for a link prediction problem, the decoder can
take as input the embeddings for two nodes, take the sigmoid of a dot-product of the
two node embeddings, and use the produced number as the probability of an edge
existing between the two nodes.
The parameters of a model are learned through optimization by minimizing a
task-specific loss function. For a classification task, for instance, we typically as-
15 Dynamic Graph Neural Networks 331
sume having access to a set of ground-truth labels Y where Yi, j = 1 if the i example
belongs to the j class and Yi, j = 0 otherwise. We learn the parameters of the model
by minimizing (e.g., using stochastic gradient descent) the cross entropy loss de-
fined as follows:
1
L=− ∑ Yi, j log(Ŷi, j ) (15.15)
|Yi, j | ∑
i j
where |Yi, j | denotes the number of rows in Yi, j corresponding to the number of
labeled examples, and Ŷi, j is the probability of the i example belonging to the j
class according to the model. For other tasks, one may use other appropriate loss
functions.
Different applications give rise to different types of dynamic graphs and different
prediction problems. Before commencing the model development, it is crucial to
identify the type of dynamic graph and its static and evolving parts, and have a clear
understanding of the prediction problem. In what follows, we describe some general
categories of dynamic graphs, their evolution types, and some common prediction
problems for them.
As pointed out in (Kazemi et al, 2020), dynamic graphs can be divided into discrete-
time and continuous-time categories. Here, we describe the two categories and point
out how discrete-time can be considered a specific case of continuous-time dynamic
graphs.
A discrete-time dynamic graph (DTDG) is a sequence [G(1) , G(2) , . . . , G(τ) ] of
graph snapshots where each G(t) = (V (t) , A(t) , X (t) ) has vertices V (t) , adjacency
matrix A(t) and feature matrix X (t) . DTDGs mainly appear in applications where
(sensory) data is captured at regularly-spaced intervals.
Example 15.1. Figure 15.3 shows three snapshots of an example DTDG. In the first
snapshot, there are three nodes. In the next snapshot, a new node v4 is added and a
connection is formed between this node and v2 . Furthermore, the features of v1 are
updated. In the third snapshot, a new edge has been added between v3 and v4 .
A special type of DTDGs is the spatio-temporal graphs where a set of entities are
spatially (i.e. in terms of closeness in space) and temporally correlated and data is
captured at regularly-spaced intervals. An example of such a spatio-temporal graph
is traffic data in a city or a region where traffic statistics at each road are computed at
regularly-spaced intervals; the traffic at a particular road at time t is correlated with
332 Seyed Mehran Kazemi
First Snapshot Second Snapshot Third Snapshot
𝑣1 𝑣2 𝑣1 𝑣2 𝑣1 𝑣2
…
𝑣3 𝑣3 𝑣4 𝑣3 𝑣4
𝒱 (1) = {𝑣1, 𝑣2, 𝑣3} 𝒱 (2) = {𝑣1, 𝑣2, 𝑣3, 𝑣4 } 𝒱 (3) = {𝑣1, 𝑣2, 𝑣3, 𝑣4}
0 1 0 0 0.1 2 0 1 0 0 0.1 2
0 1 0 0.1 1
𝐴(1) = 1 0 1 , 𝑋(1) = 0.2 1 𝐴 (2) = 1 0 1 1 , 𝑋(2) = 0.2 1 𝐴(3) = 1 0 1 1 , 𝑋(3) = 0.2 1
0 1 0 0 0.2 2 0 1 0 1 0.2 2
0 1 0 0.2 2 0 1 0 0 0.5 1 0 1 1 0 0.5 1
Fig. 15.3: Three snapshots of an example DTDG. In the first snapshot, there are 3
nodes. In the second snapshot, a new node v4 is added and a connection is formed
between this node and v2 . Moreover, the features of v1 are updated. In the third
snapshot, a new edge has been added between v3 and v4 .
the traffic at the roads connected to it at time t (spatial correlation) as well as the
traffic at these roads and the ones connected to it at previous timestamps (temporal
correlation). In this example, the nodes in each G(t) may represent roads (or road
segments), the adjacency matrix A(t) may represent how the roads are connected,
and the feature matrix X (t) may represent the traffic statistics in each road at time t.
A continuous-time dynamic graph (CTDG) is a pair (G(t0 ) , O) where G(t0 ) =
(V 0 ) , A(t0 ) , X (t0 ) ) is a static graph2 representing an initial state at time t0 and O is
(t
Example 15.3. For the CTDG in Example 15.2, assume t0 = 01-05-2020 and we
only observe the state of the graph on the first day of each month (01-05-2020, 01-
06-2020 and 01-07-2020 for this example). In this case, the CTDG will reduce to
the DTDG snapshots in Figure 15.3.
For both DTDGs and CTDGs, various parts of the graph may change and evolve.
Here, we describe some of the main types of evolution. As a running example, we
use a dynamic graph corresponding to a social network where the nodes represent
users and the edges represent connections such as friendship.
Node addition/deletion: In our running example, new users may join the plat-
form resulting in new nodes being added to the graph, and some users may leave the
platform resulting in some nodes being removed from the graph.
Feature update: Users may have multiple features such as age, country of resi-
dence, occupation, etc. These features may change over time as users become older,
move to a new country, or change their occupation.
Edge addition/deletion: As time goes by, some users become friends resulting
in new edges and some people stop being friends resulting in some edges being
removed from the graph. As pointed out in (Trivedi et al, 2019), the observations
corresponding to events between two nodes may be categorized into association
and communication events. The former corresponds to events that lead to structural
changes in the graph and result in a long-lasting flow of information between the
nodes (e.g., the formation of new friendships in social networks). The latter cor-
responds to events that result in a temporary flow of information between nodes
(e.g., the exchange of messages in a social network). These two event categories
typically evolve at different rates and one may model them differently, especially in
applications where they are both present.
Edge weight updates: The adjacency matrix corresponding to the friendships
may be weighted where the weights represent the strength of the friendships (e.g.,
computed based on the duration of friendship or other features). In this case, the
strength of the friendships may change over time resulting in edge weight updates.
Relation updates: The edges between the users may be labeled where the label
indicates the type of the connection, e.g., friendship, engagement, and siblings. In
this case, the relation between two users may change over time (e.g., it may change
from friendship to engagement). One may see relation update as a special case of
edge evolution where one edge is deleted and another edge is added (e.g., the friend-
ship edge is removed and an engagement edge is added).
334 Seyed Mehran Kazemi
We review four types of prediction problems for dynamic graphs: node classifica-
tion/regression, graph classification, link prediction, and time prediction. Some of
these problems can be studied under two settings: interpolation and extrapolation.
They can also be studied under a transductive or inductive prediction setting. In
what follows, we will describe each prediction problem. We let be a (discrete-time
or continuous-time) dynamic graph containing information in a time interval [t0 , τ].
Node classification/regression: Let V (t) = {v1 , . . . , vn } represent the nodes in at
time t. Node classification at time t is the problem of classifying a node vi ∈ V (t) into
a predefined set of classes C. Node regression at time t is the problem of predicting
a continuous feature for a node vi ∈ V (t) . In the extrapolation setting, we make
predictions about a future state (i.e. t ≥ τ) and the predictions are made based on
the observations before or at t (e.g., forecasting the weather for the upcoming days).
In the interpolation setting, t0 ≤ t ≤ τ and the predictions are made based on all the
observations (e.g., filling the missing values).
Graph classification: Let {1, 2, . . . , k} be a set of dynamic graphs. Graph clas-
sification is the problem of classifying each dynamic graph i into a predefined set of
classes C.
Link prediction: Link prediction is the problem of predicting new links between
the nodes of a dynamic graph. In the case of interpolation, the goal is to predict if
there was an edge between two nodes vi and v j at timestamp t0 ≤ t ≤ τ (or a time
interval between t0 and τ), assuming that vi and v j are in at time t. The interpolation
problem is also known as the completion problem and can be used to predict missing
links. In the case of extrapolation, the goal is to predict if there is going to be an
edge between two nodes vi and v j at a timestamp t > τ (or a time interval after τ)
assuming that vi and v j are in the at time τ.
Time prediction: Time prediction is the problem of predicting when an event
happened or when it will happen. In the case of interpolation (sometimes called
temporal scoping), the goal is to predict the time t0 ≤ t ≤ τ when an event occurred
(e.g., when two nodes vi and v j started or ended their connection). In the extrapola-
tion case (sometimes called time to event prediction), the goal is to predict the time
t > τ when an event will happen (e.g., when a connection will be formed between
vi and v j ).
Transductive vs. Inductive: The above problem definitions for node classifi-
cation/regression, link prediction, and time prediction correspond to a transductive
setting in which at the test time, predictions are to be made for entities already ob-
served during training. In the inductive setting, information about previously unseen
entities (or entirely new graphs) is provided at the test time and predictions are to
be made for these entities (see (Hamilton et al, 2017b; Xu et al, 2020a; Albooyeh
et al, 2020) for examples). The graph classification task is inductive by nature as it
requires making predictions for previously unseen graphs at the test time.
15 Dynamic Graph Neural Networks 335
A simple but sometimes effective approach for applying GNNs on dynamic graphs
is to first convert the dynamic graph into a static graph and then apply a GNN on the
resulting static graph. The main benefits of this approach include simplicity as well
as enabling the use of a wealth of GNN models and techniques for static graphs.
One disadvantage with this approach, however, is the potential loss of information.
In what follows, we describe two conversion approaches.
Temporal aggregation: We start with describing temporal aggregation for a par-
ticular type of dynamic graphs and then explain how it extends to more general
cases. Consider a DTDG [G(1) , G(2) , . . . , G(τ) ] where each G(t) = (V (t) , A(t) , X (t) )
such that V (1) = · · · = V (τ) = V and X (1) = · · · = X (τ) = X (i.e. the nodes and their
features are fixed over time and only the adjacency matrix evolves). Note that in this
case, the adjacency matrices have the same shape. One way to convert this DTDG
into a static graph is through a weighted aggregation of the adjacency matrices as
follows:
τ
A(agg) = ∑ φ (t, τ)A(t) (15.16)
t=1
Fig. 15.4: An example of converting a DTDG into a static graph through temporal
unrolling. Solid lines represent the edges in the graph at different timestamps and
dashed lines represent the added edges. In this example, each node is connected
to the node corresponding to the same entity only in the previous timestamp (i.e.
ω = 1).
In the case where node features also evolve, one may use a similar aggregation as
in equation 15.16 and compute X (agg) based on [X (1) , X (2) , . . . , X (τ) ]. In the case
where nodes are added and removed, one possible way of aggregation is as follows.
Let V (s) = {v | v ∈ V (1) ∪ · · · ∪ V (τ) } represent the set of all the nodes that existed
(s) (s)
throughout time. We can expand every A(t) to a matrix in R|V |×|V | where the
values for the rows and columns corresponding to any node v ̸∈ V (t) are all 0s. The
feature vectors can be expanded similarly. Then, equation 15.16 can be applied on
the expanded adjacency and feature matrices. A similar aggregation can be done for
CTDGs by first converting it into a DTDG (see Section 15.3.1) and then applying
equation 15.16.
Example 15.4. Consider a DTDG with the three snapshots in Figure 15.3. We let
V (s) = {v1 , v2 , v3 , v4 }, add a row and a column of zeros to A(1) , and add a row of
zeros to X (1) . Then, we use equation 15.16 with some value of θ to compute A(agg)
and X (agg) . Then we apply a GNN on the aggregated graph.
Example 15.5. Figure 15.4 provides an example of temporal unrolling for the DTDG
in Figure 15.3 with ω = 1. The graph has 11 nodes overall and so A(s) ∈ R11×11 . The
node features are set according to the ones in Figure 15.3, e.g., the feature values
(2)
for v1 are 0.1 and 2.
One natural way of developing models for DTDGs is by combining GNNs with
sequence models; the GNN captures the information within the node connections
and the sequence model captures the information within their evolution. A large
number of the works on dynamic graphs in the literature follow this approach – see,
e.g., (Seo et al, 2018; Manessi et al, 2020; Xu et al, 2019a). Here, we describe some
generic ways of combining GNNs with sequence models.
GNN-RNN: Let be a DTDG with a sequence [G(1) , . . . , G(τ) ] of snapshots where
G = (V (t) , A(t) , X (t) ) for each t ∈ {1, . . . , τ}. Suppose we want to obtain node em-
(t)
where, similar to equations 15.7-15.11, I (t) , F (t) , and O (t) represent the input, for-
get and output gates for the nodes respectively, C (t) is the memory cell, H (t) is the
hidden state corresponding to the node embeddings for the first t observation, and
W (..) s and b(.) s are weight matrices and vectors. In the above formulae, when we
add a matrix Z (t) W (.i) + H (t−1) W (.h) with a bias vector b(.) , we assume the bias
vector b(.) as added to every row of the matrix. H (0) and C (0) can be initialized with
zeros or learned from the data. H (t) corresponds to the temporal node embeddings
at time t and can be used to make predictions about them. We can summarize the
equations above into:
In a similar way, one can construct other variations of the GNN-RNN model such as
GCN-GRU, GAT-LSTM, GAT-RNN, etc. Figure 15.5 provides an overview of the
GCN-LSTM model.
RNN-GNN: In cases where the graph structure is fixed through time (i.e. A(1) =
· · · = A(τ) = A) and only node features change, instead of first applying a GNN
model and then applying a sequence model to obtain temporal node embeddings,
one may apply the sequence model first to capture the temporal evolution of the
node features and then apply a GNN model to capture the correlations between the
nodes. We can create different variations of this generic model by using different
GNN and sequence models (e.g., LSTM-GCN, LSTM-GAT, GRU-GCN, etc.). The
formulation for a LSTM-GCN model is as follows:
with Z (t) containing the temporal node embeddings at time t. Note that RNN-GNN
is only appropriate if the the adjacency matrix is fixed over time; otherwise, RNN-
GNN fails to capture the information within the evolution of the graph structure.
GNN-BiRNN and BiRNN-GNN: In the case of GNN-RNN and RNN-GNN,
the obtained node embeddings H (t) contain information about the observations at
15 Dynamic Graph Neural Networks 339
LSTM Cell
LSTM Cell
LSTM Cell
…
𝐶 (0) 𝐶 (1) 𝐶 (2) 𝐶 (𝑇−1) 𝐶 (𝑇)
Fig. 15.5: The GCN-LSTM model taking a sequence G(1) , G(2) , . . . , G(τ) as input
and producing hidden representations H (1) , H (2) , . . . , H (τ) as output. The opera-
tions in LSTM Cells are described in equations 15.18-15.22. The GCN modules
have shared parameters.
where H (0,t) = X (t) for t ∈ {1, . . . , τ}. The above two equations define what is
called a GCN-LSTM block. Other blocks can be constructed using similar combina-
tions.
only observation types are edge additions, for this CTDG, the nodes and their fea-
tures are fixed over time. Let Z (t−) represent the node embeddings right before time
t (initially, Z (t0 ) = X (t0 ) or Z (t0 ) = X (t0 ) W where W is a weight matrix with learn-
able parameters). Upon making an observation (AddEdge, (vi , v j ),t) corresponding
to a new directed edge between two nodes vi , v j ∈ V , the model developed in (Kumar
et al, 2019b) updates the embeddings for vi and v j as follows:
(t) (t−) (t−)
Zi = RNNsource ((Z j || ∆ti || f ), Zi ) (15.32)
(t) (t−) (t−)
Zj = RNNtarget ((Zi || ∆t j || f ), Zj ) (15.33)
where RNNsource and RNNtarget are two RNNs with different weights3 , ∆ti and ∆t j
represent the time elapsed since vi ’s and v j ’s previous interactions respectively4 , f
represents a vector of features corresponding to edge features (if any), || indicates
(t) (t)
concatenation, and Zi and Z j represent the updated embeddings at time t. The
(t−)
first RNN takes as input a new observation (Z j || ∆ti || f ) and the previous
(t−)
hidden state of a node Zi and provides an updated representation (similarly for
the second RNN). Besides learning a temporal embedding Z (t) as described above,
in (Kumar et al, 2019b) another embedding vector is also learned for each entity
that is fixed over time and captures the static features of the nodes. The two embed-
dings are then concatenated to produce the final embedding that is used for making
predictions.
In Trivedi et al (2017), a similar strategy is followed to develop a model for
CTDGs with multi-relational graphs in which two custom RNNs update the node
embeddings for the source and target nodes once a new labeled edge is observed
between them. In Trivedi et al (2019), a model is developed that is similar to
the above models but closer in nature to GNNs. Upon making an observation
(AddEdge, (vi , v j ),t), the node embedding for vi is updated as follows (and simi-
larly for v j ):
(t) (t−)
Zi = RNN((zN (v j )∆ti ), Zi ) (15.34)
3 The reason for using two RNNs is to allow the source and target nodes of a directed graph to be
updated differently upon making the observation (AddEdge, (vi , v j ),t). If the graph is undirected,
one may use a single RNN.
4 If this is the first interaction of v (or v ), then ∆t (or ∆t ) can be the time elapsed since t .
i j i j 0
342 Seyed Mehran Kazemi
away from them) and do not take into account the nodes that are multi-hops away.
We now describe a GNN-based model for CTDGs named temporal graph attention
networks (TGAT) and developed in (Xu et al, 2020a) that computes node embed-
dings based on the k-hop neighborhood of the nodes (i.e. based on the nodes that
are at most k hops away). Being a GNN-based model, TGAT can learn embeddings
for new nodes that are added to a graph and can be used for inductive settings where
at the test time, predictions are to be made for previously unseen nodes.
Similar to the Transformer model, TGAT removes the recurrence and instead
relies on self-attention and an extension of positional encoding to continuous time
encoding named Time2Vec. In Time2Vec (Kazemi et al, 2019), time t (or a delta of
time as in equation 15.32 and equation 15.34) is represented as a vector z (t) defined
as follows: (
(t) ωit + ϕi , if i = 0.
zi = (15.35)
sin (ωit + ϕi ), if 1 ≤ i ≤ k.
where ω and ϕ are vectors with learnable parameters. TGAT uses a specific case of
Time2Vec where the linear term is removed and the parameters ϕ are fixed to 0s and
2 s similar to equation 15.13. We refer the reader to Kazemi et al (2019); Xu et al
π
5 For simplicity, here we describe a single-head attention-based GNN version of TGAT; in the
original work, a multi-head version is used (see equation 15.5 for details.)
15 Dynamic Graph Neural Networks 343
6. Finally, h(t,l,i) = FF (l) (h(t,l−1,i) h̃(t,l,i) ) computes the representation for node vi
at time t in layer l where FF (l) is a feed-forward neural network in layer l.
An L-layer TGAT model computes node embeddings based on the L-hop neigh-
borhood of a node.
Suppose we run a 2-layer TGAT model on a temporal graph where vi interacted
with v j at time t1 < t and v j interacted with vk at time t2 < t1 . The embedding h(t,2,i)
is computed based on the embedding h(t1 ,1, j) which is itself computed based on the
embedding h(t2 ,0,k) . Since we are now at 0 layer, h(t2 ,0,k) in TGAT is approximated
with Xk thus ignoring the interactions vk has had before time t2 . This may be sub-
optimal if vk has had important interactions before t2 as these interactions are not
reflected on h(t1 ,1, j) and hence not reflected on h(t,2,i) . In (Rossi et al, 2020), this
problem is remedied by using a recurrent model (similar to those introduced at the
beginning of this subsection) that provides node embeddings at any time based on
their previous local interactions, and initializing h(t,0,i) s with these embeddings.
15.5 Applications
In this chapter, we provide some examples of real-world problems that have been
formulated as predictions over dynamic graphs and modeled using GNNs. In partic-
ular, we review applications in computer vision, traffic forecasting, and knowledge
graphs. This is by no means a comprehensive list; other application domains include
recommendation systems Song et al (2019a), physical simulation of object trajecto-
ries Kipf et al (2018), social network analysis Min et al (2021), automated software
bug triaging Wu et al (2021a), and many more.
2 1 2 1 2 1 5
5 5
4
3 3
6 … 3
6 4 6
7 7 4
11 11 11 7
8 8 8
9 12 9 12 9 12
10
13 10 13 13
10
Fig. 15.6: The human skeleton represented as a graph for each snapshot of a video.
The nodes represent the key points and the edges represent connections between
these key points. The t graph corresponds to the human skeleton obtained from the
t frame of a video.
this description, we can formulate the problem as reasoning over a DTDG consisting
of a sequence [G(1) , G(2) , . . . , G(τ) ] of graphs where each G(t) = (V (t) , A(t) , X (t) )
corresponds to the t frame of a video with V (t) representing the set of key points in
the t frame, A(t) representing their connections, and X (t) representing their features.
An example is provided in Figure 15.6. One may notice that V (1) = · · · = V (τ) = V
and A(1) = · · · = A(τ) = A, i.e. the nodes and the adjacency matrices remain fixed
throughout the sequence because they correspond to the key points and how they
are connected in the human body. For instance, in the graphs of Figure 15.6, the
node numbered as 3 is always connected to the nodes numbered as 2 and 4. The
feature matrices X (t) , however, keep changing as the coordinates of the key points
change in different frames. The activity recognition can now be cast as classifying
a dynamic graph into a set of predefined classes C.
The approach employed in (Yan et al, 2018a) is to convert the above DTDG into
a static graph through temporal unrolling (see Section 15.4.1). In the static graph,
the node corresponding to a key point at time t is connected to other key points at
time t according to the human body (or, in other words, according to A(t) ) as well
as the nodes representing the same key point and its neighbors in the previous ω
timestamps. Once a static graph is constructed, a GNN can be applied to obtain em-
beddings for every joint at every timestamp. Since activity recognition corresponds
to graph classification in this formulation, the decoder may consist of a (max, mean,
or another type of) pooling layer on the node embeddings to obtain a graph em-
bedding followed by a feed-forward network and a softmax layer to make class
predictions.
In the l layer of the GNN in (Yan et al, 2018a), the adjacency matrix is multiplied
element-wise to a mask matrix M (l) with learnable parameters (i.e. A ⊙ M (l) is
used as the adjacency matrix). M (l) can be considered a data-independent attention
map that learns weights for the edges in A. The goal of M (l) is to learn which
connections are more important for activity recognition. Multiplying by M (l) only
allows for changing the weight of the edges in A but it cannot add new edges.
Connecting the key points according to the human body may arguably not be the
15 Dynamic Graph Neural Networks 345
best choice as, e.g., the connection between the hands is important in recognizing
the clapping activity. In (Li et al, 2019e), the adjacency is summed with two other
matrices B (l) and C (l) (i.e. A + B (l) + C (l) is used as the adjacency) where B (l)
is a data-independent attention matrix similar to M (l) and C (l) is a data-dependent
attention matrix. Adding two matrices B (l) and C (l) to A allows for not only chang-
ing the edge weights in A but also adding new edges.
Instead of converting the dynamic graph to a static graph through temporal un-
rolling and applying a GNN on the static graph as in the previous two works, in Shi
et al (2019b), (among other changes) a GNN-CNN model is used. One can use
other combinations of a GNN and a sequence model (e.g., GNN-RNN) to obtain
embeddings for joints at different timestamps. Note that activity recognition is not
an extrapolation problem (i.e. the goal is not to predict the future based on the past).
Therefore, to obtain the joint embeddings at time t, one may use information not
′
only from G(t ) where t ′ ≤ t but also from timestamps t ′ > t. This can be done by
using, e.g., a GNN-BiRNN model (see Section 15.4.2).
For urban traffic control, traffic forecasting plays a paramount role. To predict the
future traffic of a road, one needs to consider two important factors: spatial depen-
dence and temporal dependence. The traffics in different roads are spatially depen-
dent on each other as future traffic in one road depends on the traffic in the roads
that are connected to it. The spatial dependence is a function of the topology of the
road networks. There is also temporal dependence for each road because the traffic
volume on a road at any time depends on the traffic volume at the previous times.
There are also periodic patterns as, e.g., the traffic in a road may be similar at the
same times of the day or at the same times of the week.
Early approaches for traffic forecasting mainly focused on temporal dependen-
cies and ignored the spatial dependencies (Fu et al, 2016). Later approaches aimed
at capturing spatial dependencies using convolutional neural networks (CNNs) (Yu
et al, 2017b), but CNNs are typically restricted to grid structures. To enable captur-
ing both spatial and temporal dependencies, several recent works have formulated
traffic forecasting as reasoning over a dynamic graph (DTDGs in particular).
We first start by formulating traffic forecasting as a reasoning problem over a
dynamic graph. One possible formulation is to consider a node for each road seg-
ment and connect two nodes if their corresponding road segments intersect with
each other. The node features are the traffic flow variables (e.g., speed, volume, and
density). The edges can be directed, e.g., to show the flow of the traffic in one-way
roads, or undirected, showing that traffic flows in both directions. The structure of
the graph can also change over time as, e.g., some road segments or some intersec-
tions may get (temporarily) closed. One may record the traffic flow variables and
the state of the roads and intersections at regularly-spaced time intervals resulting
in a DTDG. Alternatively, one may record the variables at different (asynchronous)
346 Seyed Mehran Kazemi
Knowledge graphs (KGs) are databases of facts. A KG contains a set of facts in the
form of triples (vi , r j , vk ) where vi and vk are called the subject and object entities
and r j is a relation. A KG can be viewed as a directed multi-relational graph with
nodes V = {v1 , . . . , vn }, relations R = {r1 , . . . , rm }, and m adjacency matrices where
the j adjacency matrix corresponds to the relations of type r j between the nodes
according to the triples.
A temporal knowledge graph (TKG) contains a set of temporal facts. Each fact
may be associated with a single timestamp indicating the time when the event spec-
ified by the fact occurred, or a time interval indicating the start and end timestamps.
15 Dynamic Graph Neural Networks 347
The facts with a single timestamp typically represent communication events and the
facts with a time interval typically represent associative events (see Section 15.3.2)6 .
Here, we focus on facts with a single timestamp for which a TKG can be defined as a
set of quadruples of the form (vi , r j , vk ,t) where t indicates the time when (vi , r j , vk )
occurred. Depending on the granularity of the timestamps, one may think of a TKG
as a DTDG or a CTDG.
TKG completion is the problem of learning models based on the existing tempo-
ral facts in a TKG to answer queries of the type (vi , r j , ?,t) (or (?, r j , vk ,t)) where the
correct answer is an entity v ∈ V such that (vi , r j , v,t) (or (v, r j , vk ,t)) has not been
observed during training. It is mainly an interpolation problem as queries are to be
answered at a timestamp t based on the past, present, and future facts. Currently, the
majority of the models for TKG completion are not based on GNNs (e.g., see (Goel
et al, 2020; Garcı́a-Durán et al, 2018; Dasgupta et al, 2018; Lacroix et al, 2020)).
Here, we describe a GNN-based approach that is mainly based on the work in (Wu
et al, 2020b).
Since TKGs correspond to multi-relational graphs, to develop a GNN-based
model that operates on a TKG we first need a relational GNN. Here, we describe
a model named relational graph convolution network (RGCN) (Schlichtkrull et al,
2018) but other relational GNN models can also be used (see, e.g., (Vashishth et al,
2020)). Whereas GCN projects all neighbors of a node using the same weight ma-
trix (see Section 15.2.1), RGCN applies relation-specific projections. Let R̂ be a a
set of relations that includes every relation in R = {r1 , . . . , rm } as well as a self-loop
relation r0 where each node has the relation r0 only with itself. As is common in
directed graphs (see, e.g., (Marcheggiani and Titov, 2017)) and specially for multi-
relational graphs (see, e.g., (Kazemi and Poole, 2018)), for each relation r j ∈ R we
also add an auxiliary relation r−1 −1
j to R̂ where vi has relation r j with vk if and only
if vk has relation r j with vi . The l layer of an RGCN model can then be described as
follows:
−1
Z (l) = σ ∑ D(r) A(r) Z (l−1) W (l,r) (15.36)
r∈R̂
where A(r) ∈ Rn×n represents the adjacency matrix corresponding to relation r, D(r)
(r)
is the degree matrix of A(r) with Di,i representing the number of incoming relations
−1
of type r for the i node, D(r) is a normalization term7 , W (l,r) is a relation-specific
weight matrix for layer l, Z (l−1) represents the node embeddings in the (l-1) layer,
and Z (l) represents the updated node embeddings in the l layer. If initial features
X are provided as input, Z (0) can be set to X. Otherwise, Z (0) can either be set as
(0)
1-hot encodings where Zi is a vector whose elements are all zeros except in the
6 This, however, is not always true as one may break a fact such as (v , LivedIn, v ) with a time
i j
interval [2010, 2015] (meaning from 2010 until 2015) into a fact (vi , StartedLivingIn, v j ) with a
timestamp of 2010 and another fact (vi , EndedLivingIn, v j ) with a timestamp of 2015.
7 One needs to handle the cases where D(r) = 0 to avoid numerical issues.
i,i
348 Seyed Mehran Kazemi
i position where it is 1, or it can be randomly initialized and then learned from the
data.
In (Wu et al, 2020b), a TKG is formulated as a DTDG consisting of a sequence
of snapshots [G(1) , G(2) , . . . , G(τ) ] of multi-relational graphs. Each G(t) contains the
same set of entities V and relations R (corresponding to all the entities and relations
in the TKG) and contains the triples (vi , r j , vk ,t) from the TKG that occurred at time
t. Then, RGCN-BiGRU and RGCN-Transformer models are developed (see Sec-
tion 15.4.2) that operate on the DTDG formulation of the TKG where the RGCN
model provides the node embeddings at every timestamp and the BiGRU and Trans-
former models aggregate the temporal information. Note that in each G(t) there may
be several nodes with no incoming and outgoing edges (and also no features since
TKGs typically do not have node features). RGCN does not learn a representation
for these nodes as there exists no information about them in G(t) . To handle this,
special BiGRU and Transformer models are developed in (Wu et al, 2020b) that
handle missing values.
The RGCN-BiGRU and RGCN-Transformer models provide node embeddings
H (t) at any timestamp t. To answer a query such as (vi , r j , ?,t), one can compute the
plausibility score of (vi , r j , vk ,t) for every vk ∈ V and select the entity that achieves
the highest score. A common approach to find the score for an entity vk for the above
query is to use the TransE decoder Bordes et al (2013) according to which the score
(t) (t) (t) (t)
is −||Hi + R j − Hk || where Hi and Hk correspond to the node embeddings
for vi and vk at time t (provided by the RGCN) and R is a matrix with learnable
parameters which has m = |R| rows each corresponding to an embedding for a re-
lation. TransE and its extensions are known to make unrealistic assumptions about
the types and properties of the relations Kazemi and Poole (2018), so, alternatively,
one may use other decoders that has been developed within the knowledge graph
embedding community (e.g., the models in (Kazemi and Poole, 2018; Trouillon
et al, 2016)).
When the timestamps in the TKG are discrete and there are not many of them,
one can use a similar approach as above to answer queries of the form (vi , r j , vk , ?)
through finding the score for every t in the set of discrete timestamps and selecting
the one that achieves the highest score (see, e.g., (Leblay and Chekol, 2018)). Time
prediction for TKGs has been also studied in an extrapolation setting where the goal
is to predict when an event is going to happen in the future. This has been mainly
done using temporal point processes as decoders (see, e.g., (Trivedi et al, 2017,
2019)).
15.6 Summary
Graph-based techniques are emerging as leading approaches in the industry for ap-
plication domains with relational information. Among these techniques, graph neu-
ral networks (GNNs) are currently among the top-performing approaches. While
GNNs and other graph-based techniques were initially developed mainly for static
15 Dynamic Graph Neural Networks 349
graphs, extending these approaches to dynamic graphs has been the subject of sev-
eral recent studies and has found success in several important areas. In this chapter,
we reviewed the techniques for applying GNNs to dynamic graphs. We also re-
viewed some of the applications of dynamic GNNs in different domains including
computer vision, traffic forecasting, and knowledge graphs.
Editor’s Notes: In the universe, the only thing unchanged is “change” it-
self, so do networks. Hence extending techniques for simple, static net-
works to those for dynamic ones is inevitably the trend while this domain
is progressing. While there is a fast-increasing research body for dynamic
networks in recent years, much more efforts are needed in order for sub-
stantial progress in the key issues such as scalability and validity discussed
in Chapter 5 and other chapters. Extensions of the techniques in Chapters
9-18 are also needed. Many real-world applications radically speaking, re-
quires to consider dynamic network, such as recommender system (Chapter
19) and urban intelligence (Chapter 27). So they could also benefit from the
technique advancement toward dynamic networks.
Chapter 16
Heterogeneous Graph Neural Networks
Chuan Shi
Heterogeneous graphs (HGs) (Sun and Han, 2013), which compose different types
of entities and relations, also known as heterogeneous information networks (HINs),
are ubiquitous in real-world scenarios, ranging from bibliographic networks, social
networks to recommender systems. For example, as shown in Fig. 16.1 (a), a biblio-
graphic network can be represented by a HG, which consists of four types of entities
(author, paper, venue, and term) and three types of relations (author-write-paper,
paper-contain-term and conference-publish-paper); and these basic relations can be
further derived for more complex semantics (e.g., author-write-paper-contain-item).
It has been well recognized that HG is a powerful model that embraces rich seman-
tic and structural information. Therefore, researches on HG have been experiencing
tremendous growth in data mining and machine learning, many of which have suc-
cessful applications such as recommendation (Shi et al, 2018a; Hu et al, 2018a), text
Chuan Shi
School of Computer Science, Beijing University of Posts and Telecommunications, e-mail:
[email protected]
© The Author(s), under exclusive license to Springer Nature Singapore Pte Ltd. 2022 351
L. Wu et al. (eds.), Graph Neural Networks: Foundations, Frontiers, and Applications,
https://doi.org/10.1007/978-981-16-6054-2_16
352 Chuan Shi
analysis (Linmei et al, 2019; Hu et al, 2020a), and cybersecurity (Hu et al, 2019b;
Hou et al, 2017).
Due to the ubiquity of HGs, how to learn embedding of HGs is a key re-
search problem in various graph analysis applications, e.g., node/graph classifica-
tion (Dong et al, 2017; Fu et al, 2017), and node clustering (Li et al, 2019g). Tradi-
tionally, matrix factorization methods (Newman, 2006b) generate latent features in
HGs. However, the computational cost of decomposing a large-scale matrix is usu-
ally very expensive, and also suffers from its statistical performance drawback (Shi
et al, 2016; Cui et al, 2018). To address this challenge, heterogeneous graph embed-
ding, aiming to learn a function that maps input space into lower-dimensional space
while preserving heterogeneous structure and semantic, has drawn considerable at-
tention in recent years.
Although there have been ample studies of embedding technology on homoge-
neous graphs (Cui et al, 2018) which consist of only one type of nodes and edges,
these techniques cannot be directly applicable to HGs due to heterogeneity. Specif-
ically, (1) the structure in HGs is usually semantic dependent, e.g., meta-path struc-
ture (Dong et al, 2017) can be very different when considering different types of
relations; (2) different types of nodes and edges have different attributes located in
different feature spaces; (3) HGs are usually application dependent, which may need
sufficient domain knowledge for meta-path/meta-graph selection.
To tackle the above issues, various HG embedding methods have been proposed
(Chen et al, 2018b; Hu et al, 2019a; Dong et al, 2017; Fu et al, 2017; Wang et al,
2019m; Shi et al, 2018a; Wang et al, 2020n). From the technical perspective, we
divide the widely used models in HG embedding into two categories: shallow mod-
els and deep models. In summary, shallow models initialize the node embeddings
randomly, then learn the node embeddings through optimizing some well-designed
objective functions to preserve heterogeneous structures and semantics. Deep model
aims to use deep neural networks (DNNs) to learn embedding from node attributes
or interactions, where heterogeneous graph neural networks (HGNNs) stand out and
will be the focus of this chapter. And there have demonstrated the success of HG
embedding techniques deployed in real-world applications including recommender
systems (Shi et al, 2018a; Hu et al, 2018a; Wang et al, 2020n), malware detection
systems (Hou et al, 2017; Fan et al, 2018; Ye et al, 2019a), and healthcare systems
(Cao et al, 2020; Hosseini et al, 2018).
The remainder of this chapter is organized as follows. In Sect. 27.1, we first
introduce basic concepts in HGs, then discuss unique challenges of HG embedding
due to the heterogeneity and give a brief review of the recent development on HG
embedding. In Sect. 24.2 and 20.3, we categorize and introduce HG embedding in
details according to the shallow and deep models. In Sect. 20.4, we further review
pros and cons of the models introduced above. Finally, Sect. 20.5 forecasts the future
research directions for HGNNs.
16 Heterogeneous Graph Neural Networks 353
Publish
Contain
Author Paper Venue Term Write APA APCPA
(a) An example of HIN (c) Meta-path
In this section, we will first formally introduce basic concepts in HGs and illustrate
the symbols used throughout this chapter. HG is a graph consisting of different types
of entities (i.e., nodes) and/or different types of relations (i.e., edges), which can be
defined as follows.
Definition 16.1. Heterogeneous Graph (or Heterogeneous Information Network)
(Sun and Han, 2013). A HG is defined as a graph G = {V , E }, in which V and
E represent the node set and the edge set, respectively. Each node v ∈ V and
each edge e ∈ E are associated with their mapping function φ (v) : V → A and
ϕ(e) : E → R. A and R denote the node type set and edge type set, respectively,
where |A |+|R| > 2. The network schema for G is defined as S = (A , R), which
can be seen as a meta template of a heterogeneous graph G = {V , E } with the
node type mapping function φ (v) : V → A and the edge type mapping function
ϕ(e) : E → R. The network schema is a graph defined over node types A , with
edges as relation types from R.
HG not only provides graph structure of data association, but also portrays
higher-level semantics. An example of HG is illustrated in Fig. 16.1 (a), which
consists of four node types (author, paper, venue, and term) and three edge types
(author-write-paper, paper-contain-term, and conference-publish-paper), and Fig.
16.1 (b) illustrates the network schema. To formulate semantics of higher-order re-
lationships among entities, meta-path (Sun et al, 2011) is further proposed whose
definition is given below.
Definition 16.2. Meta-path (Sun et al, 2011). A meta-path p is based on network
1 2 l R R R
schema S , which is denoted as p = N1 −→ N2 −→ · · · −→ Nl+1 (simplified to
354 Chuan Shi
Different from homogeneous graph embedding (Cui et al, 2018), where the basic
problem is preserving structure and property in node embedding (Cui et al, 2018).
Due to the heterogeneity, HG embedding imposes more challenges, which are illus-
trated below.
Complex Structure (the complex HG structure caused by multiple types of
nodes and edges). In a homogeneous graph, the fundamental structure can be con-
sidered as first-order, second-order, and even higher-order structures (Tang et al,
2015b). All these structures are well defined and have good intuition. However, the
structure in HGs will dramatically change depending on the selected relations. Let’s
still take the academic graph in Fig. 16.1 (a) as an example, the neighbors of one
paper will be authors with the “write” relation; while with “contain” relation, the
neighbors become terms. Complicating things further, the combination of these re-
lations, which can be considered as higher-order structures in HGs, will result in
different and more complicated structures. Therefore, how to efficiently and effec-
tively preserve these complex structures is of great challenge in HG embedding,
16 Heterogeneous Graph Neural Networks 355
while current efforts have been made towards the meta-path structure (Dong et al,
2017) and meta-graph structure (Zhang et al, 2018b).
Heterogeneous Attributes (the fusion problem caused by the heterogeneity of
attributes). Since nodes and edges in a homogeneous graph have the same type, each
dimension of the node or edge attributes has the same meaning. In this situation,
node can directly fuse attributes of its neighbors. However, in HGs, the attributes
of different types of nodes and edges may have different meanings (Zhang et al,
2019b; Wang et al, 2019m). For example, the attributes of author can be research
fields, while paper may use keywords as attributes. Therefore, how to overcome
the heterogeneity of attributes and effectively fuse the attributes of neighbors poses
another challenge in HG embedding.
Application Dependent. HG is closely related to the real-world applications,
while many practical problems remain unsolved. For example, constructing an ap-
propriate HG may require sufficient domain knowledge in a real-world application.
Also, meta-path and/or meta-graph are widely used to capture the structure of HGs.
However, unlike homogeneous graph, where the structure (e.g., the first-order and
second-order structure) is well defined, meta-path selection may also need prior
knowledge. Furthermore, to better facilitate the real-world applications, we usu-
ally need to elaborately encode side information (e.g., node attributes) (Wang et al,
2019m; Zhang et al, 2019b) or more advanced domain knowledge (Shi et al, 2018a;
Chen and Sun, 2017) to HG embedding process.
Most of early works on graph data are based on high-dimensional sparse vectors
for matrix analysis. However, the sparsity of the graph in reality and its growing
scale have created serious challenges for such methods. A more effective way is
to map nodes to latent space and use low-dimensional vectors to represent them.
Therefore, they can be more flexibly applied to different data mining tasks, i.e.,
graph embedding.
There has been a lot of works dedicated to homogeneous graph embedding (Cui
et al, 2018). These works are mainly based on deep models and combined with graph
properties to learn embeddings of nodes or edges. For instance, DeepWalk (Perozzi
356 Chuan Shi
et al, 2014) combines random walk and skip-gram model; LINE (Tang et al, 2015b)
utilizes first-order and second-order similarity to learn distinguished node embed-
ding for large-scale graphs; SDNE (Wang et al, 2016) uses deep auto-encoders to
extract non-linear characteristics of graph structure. In addition to structural infor-
mation, many methods further use the content of nodes or other auxiliary informa-
tion (such as text, images, and tags) to learn more accurate and meaningful node
embeddings. Some survey papers comprehensively summarize the work in this area
(Cui et al, 2018; Hamilton et al, 2017c).
Due to the heterogeneity, embedding techniques for homogeneous graphs can-
not be directly applicable to HGs. Therefore, researchers have begun to explore
HG embedding methods, which emerge in recent years but develop rapidly. From
the technical perspective, we summarize the widely used techniques (or models) in
HG embedding, which can be generally divided into two categories: shallow mod-
els and deep models, as shown in Fig. 16.2. Specifically, shallows model mainly
rely on meta-paths to simplify the complex structure of HGs, which can be classi-
fied into decomposition-based and random walk-based. Decomposition-based tech-
niques Chen et al (2018b); Xu et al (2017b); Shi et al (2018b,c); Matsuno and Mu-
rata (2018); Tang et al (2015a); Gui et al (2016) decompose complex heteroge-
neous structure into several simpler homogeneous structures; while random walk-
based (Dong et al, 2017; Hussein et al, 2018) methods utilize meta-path-guided ran-
dom walk to preserve specific first-order and high-order structures. In order to take
full advantage of heterogeneous structures and attributes, deep models are three-
fold: message passing-based (HGNNs), encoder-decoder-based and adversarial-
based methods. Message passing mechanism, i.e., the core idea of graph neural net-
works (GNNs), seamlessly integrates structure and attribute information. HGNNs
inherit the message passing mechanism and design suitable aggregation functions
to capture rich semantic in HGs (Wang et al, 2019m; Fu et al, 2020; Hong et al,
2020b; Zhang et al, 2019b; Cen et al, 2019; Zhao et al, 2020b; Zhu et al, 2019d;
Schlichtkrull et al, 2018). The remaining encoder-decoder-based (Tu et al, 2018;
Chang et al, 2015; Zhang et al, 2019c; Chen and Sun, 2017) and adversarial-based
(Hu et al, 2018a; Zhao et al, 2020c) techniques employ encoder-decoder framework
or adversarial learning to preserve complex attribute and structural information of
HGs. In the following sections, we will introduce representative works of their sub-
categories in detail and compare their pros and cons.
Early HG embedding methods focus on employing shallow models. They first ini-
tialize node embeddings randomly, then learn node embeddings through optimizing
some well-designed objective functions. We divide the shallow model into two cat-
egories: decomposition-based and random walk-based.
16 Heterogeneous Graph Neural Networks 357
1 P
g(hup ) = ∑ (W p hup + b p ), (16.1)
|P| p=1
where hup is the embedding of user node u in meta-path p. P denotes the set of meta-
paths. The fusion of item embeddings is similar to users. Finally, a prediction layer
is used to predict the items that users prefer. HERec optimizes the graph embedding
and recommendation objective jointly.
As another example, EOE is proposed to learn embeddings for coupled HGs,
which consist of two different but related subgraphs. It divides the edges in HG
into intra-graph edges and inter-graph edges. Intra-graph edge connects two nodes
with the same type, and inter-graph edge connects two nodes with different types.
To capture the heterogeneity in inter-graph edge, EOE (Xu et al, 2017b) uses the
relation-specific matrix Mr to calculate the similarity between two nodes, which can
be formulated as:
358 Chuan Shi
1
Sr (vi , v j ) = . (16.2)
1 + exp −h⊤i Mr h j
Similarly, PME (Chen et al, 2018b) decomposes HG into some bipartite graphs
according to the types of edges and projects each bipartite graph into a relation-
specific semantic space. PTE (Tang et al, 2015a) divides the documents into word-
word graph, word-document graph and word-label graph. Then it uses LINE (Tang
et al, 2015b) to learn the shared node embeddings for each sub-graph. HEBE (Gui
et al, 2016) samples a series of subgraphs from a HG and preserves the proximity
between the center node and its subgraph.
The above-mentioned two-step framework of decomposition and fusion, as a
transition product from homogeneous networks to HGs, is often used in the early
attempt of HG embedding. Later, researchers gradually realized that extracting ho-
mogeneous graphs from HGs would irreversibly lose information carried by hetero-
geneous neighbors, and began to explore HG embedding methods that truly adapted
to heterogeneous structure.
Random walk, which generates some node sequences in a graph, is often used to
describe the reachability between nodes. Therefore, it is widely used in graph rep-
resentation learning to sample neighbor relationships of nodes and capture local
structure in the graph (Grover and Leskovec, 2016). In homogeneous graphs, the
node type is single and random walk can walk along any path. While in HGs, due to
the type constraints of nodes and edges, meta-path-guided random walk is usually
adopted, so that the generated node sequence contains not only the structural infor-
mation, but also the semantic information. Through preserving the node sequence
structure, node embedding can preserve both first-order and high-order proximity
(Dong et al, 2017). A representative work is metapath2vec (Dong et al, 2017), which
uses meta-path-guided random walk to capture semantic information of two nodes,
e.g., the co-author relationship in academic graph as shown in Fig. 16.4.
Metapath2vec (Dong et al, 2017) mainly uses meta-path-guided random walk to
generate heterogeneous node sequences with rich semantic, then it designs a het-
16 Heterogeneous Graph Neural Networks 359
erogeneous skip-gram technique to preserve the proximity between node v and its
context nodes, i.e., neighbors in the random walk sequences:
where Ct (v) represents the context nodes of node v with type t. p(ct |v; θ ) denotes
the heterogeneous similarity function on node v and its context neighbors ct :
ehct ·hv
p(ct |v; θ ) = . (16.4)
∑ṽ∈V ehṽ ·hv
From the diagram shown in Fig. 16.4, Eq. (16.4) needs to calculate similarity
between center node and its neighbors. Then Mikolov et al (2013b) introduces a
negative sampling strategy to reduce the computation. Hence, Eq. (16.4) can be
approximated as:
Q
log σ (hct · hv ) + ∑ Eṽq ∼P(ṽ) [log σ (−hṽq · hv )] , (16.5)
q=1
where σ (·) is the sigmoid function, and P(ṽ) is the distribution in which the negative
node ṽq is sampled for Q times. Through the strategy of negative sampling, the
time complexity is greatly reduced. However, when choosing the negative samples,
metapath2vec does not consider the types of nodes, i.e., different types of nodes
are from the same distribution P(ṽ). Thus it further designs metapath2vec++, which
samples negative nodes of the same type as the central node, i.e., ṽtq ∼ P(ṽt ). The
formulation can be rewritten as:
Q h i
log σ (hct · hv ) + ∑ Eṽtq ∼P(ṽt ) log σ −hṽtq · hv . (16.6)
q=1
HGs, e.g., hierarchical and power-law structure, can be naturally reflected in learned
node embeddings.
In recent years, deep neural networks (DNNs) have achieved great success in the
fields of computer vision and natural language processing. Some works have also
begun to use deep models to learn embedding from node attributes or interactions
among nodes in HGs. Compared with shallow models, deep models can better cap-
ture the non-linear relationship, which can be roughly divided into three categories:
message passing-based, encoder-decoder-based and adversarial-based.
Graph neural networks (GNNs) have emerged recently. Its core idea is the message
passing mechanism, which aggregates neighborhood information and transmits it as
messages to neighbor nodes. Different from GNNs that can directly fuse attributes
of neighbors to update node embeddings, due to different types of nodes and edges,
HGNNs need to overcome the heterogeneity of attributes and design effective fusion
methods to utilize neighborhood information. Therefore, the key component is to
design a suitable aggregation function, which can capture semantic and structural
information of HGs (Wang et al, 2019m; Fu et al, 2020; Hong et al, 2020b; Zhang
et al, 2019b; Cen et al, 2019; Zhao et al, 2020b; Zhu et al, 2019d; Schlichtkrull et al,
2018).
Unsupervised HGNNs. Unsupervised HGNNs aim to learn node embeddings
with good generalization. To this end, they always utilize interactions among dif-
ferent types of attributes to capture the potential commonalities. HetGNN (Zhang
et al, 2019b) is the representative work of unsupervised HGNNs. It consists of three
parts: content aggregation, neighbor aggregation, and type aggregation. Content ag-
gregation is designed to learn fused embeddings from different node contents, such
as images, text, or attributes:
−−−→ ←−−−
∑i∈Cv [LST M{F C (hi )} ⊕ LST M{F C (hi )}]
f1 (v) = , (16.7)
|Cv |
where Cv is the type of node v’s attributes. hi is the i-th attributes of node v. A bi-
directional Long Short-Term Memory (Bi-LSTM) (Huang et al, 2015) is used to fuse
the embeddings learned by multiple attribute encoder F C . Neighbor aggregation
aims to aggregate the nodes with same type by using a Bi-LSTM to capture the
position information:
16 Heterogeneous Graph Neural Networks 361
−−−→ ′ ←−−− ′
∑v′ ∈Nt (v) [LST M{ f1 (v )} ⊕ LST M{ f1 (v )}]
f2t (v) = , (16.8)
|Nt (v)|
where Nt (v) is the first-order neighbors of node v with type t. Type aggregation uses
an attention mechanism to mix the embeddings of different types and produces the
final node embeddings.
where hv is the final embedding of node v, and Ov denotes the set of node types. Fi-
nally, a heterogeneous skip-gram loss is used as the unsupervised graph context loss
to update node embeddings. Through these three aggregation methods, HetGNN
can preserve the heterogeneity of both graph structures and node attributes.
Other unsupervised methods capture either heterogeneity of node attributes or
heterogeneity of graph structures. HNE (Chang et al, 2015) is proposed to learn em-
beddings for the cross-model data in HGs, but it ignores the various types of edges.
SHNE (Zhang et al, 2019c) focuses on capturing semantic information of nodes by
designing a deep semantic encoder with gated recurrent units (GRU) (Chung et al,
2014). Although it uses heterogeneous skip-gram to preserve the heterogeneity of
graph, SHNE is designed specifically for text data. Cen proposes GATNE (Cen et al,
2019), which aims to learn node embeddings in multiplex graph, i.e., a heteroge-
neous graph with different types of edges. Compared with HetGNN, GATNE pays
more attention to distinguishing different edge relationships between node pairs.
Semi-supervised HGNNs. Different from unsupervised HGNNs, semi-supervised
HGNNs aim to learn task-specific node embeddings in an end-to-end manner. For
this reason, they prefer to use the attention mechanism to capture the most relevant
structural and attribute information to the task. Wang (Wang et al, 2019m) propose
heterogeneous graph attention network (HAN), which uses a hierarchical attention
mechanism to capture both node and semantic importance. The architecture of HAN
is shown in Fig. 16.5.
It consists of three parts: node-level attention, semantic-level attention, and pre-
diction. Node-level attention aims to utilize the self-attention mechanism (Vaswani
et al, 2017) to learn importances of neighbors in a certain meta-path:
′ ′
exp(σ (aTm · [hi ∥h j ]))
αimj = ′ ′ , (16.10)
∑k∈Ni m exp(σ (aTm · [hi ∥hk ]))
hm
i =σ
∑ m αimj · h j , (16.11)
j∈Ni
362 Chuan Shi
Fig. 16.5: The architecture of HAN (Wang et al, 2019m). The whole model can
be divided into three parts: Node-Level Attention aims to learn the importance of
neighbors’ features. Semantic-Level Attention aims to learn the importance of dif-
ferent meta-paths. Prediction layer utilizes the labeled nodes to update node embed-
dings.
′ ′
where W ∈ Rd ×d and b ∈ Rd ×1 denote the weight matrix and bias of the MLP,
′
respectively. q ∈ Rd ×1 is the semantic-level attention vector. In order to prevent the
node embeddings from being too large, HAN uses the softmax function to normalize
wmi . Hence, the semantic-level aggregation is defined as:
P
H = ∑ βmi · Hmi , (16.13)
i=1
where βmi denotes the normalized wmi , which represents the semantic importance.
H ∈ RN×d denotes the final node embeddings. Finally, a task-specific layer is used
to fine-tune node embeddings with a small number of labels and the embeddings H
can be used in downstream tasks, such as node clustering and link prediction. HAN
is the first to extend GNNs to the heterogeneous graph and design a hierarchical
attention mechanism, which can capture both structural and semantic information.
16 Heterogeneous Graph Neural Networks 363
where L denotes the embedding of the hyperedge; ha , hb and hc ∈ Rd×1 are the
′
embeddings of node a, b and c learn by the autoencoder. Wa ,Wb and Wc ∈ Rd ×d are
the transformation matrices for different node types. Finally, the third layer is used
364 Chuan Shi
Fig. 16.6: The framework of DHNE (Tu et al, 2018). DHNE learns embeddings
for nodes in heterogeneous hypernetworks, which can simultaneously address inde-
composable hyperedges while preserving rich structural information.
S = σ (W · L + b), (16.15)
′
where S denote the indecomposability of hyperedge; W ∈ R1×3d and b ∈ R1×1 are
the weight matrix and bias, respectively. A higher value of S means these nodes
are from the existing hyperedges, otherwise it should be small.
Similarly, HNE (Chang et al, 2015) focuses on multi-modal heterogeneous graph.
It uses CNN and autoencoder to learn embedding from images and texts, respec-
tively. Then it uses the embedding to predict whether there is an edge between the
images and texts. Camel (Zhang et al, 2018a) uses GRU as an encoder to learn paper
embedding from the abstracts. A skip-gram objective function is used to preserve
the local structures of the graphs.
Fig. 16.7: Overview of HeGAN (Hu et al, 2018a). (a) A toy HG for bibliographic
data. (b) Comparison between HeGAN and previous works. (c) The framework of
HeGAN for adversarial learning on HGs.
fake samples associated with the given node to feed into the discriminator, whereas
the discriminator tries to improve its parameterization to separate the fake samples
from the real ones actually connected to the given node. The better trained discrimi-
nator would then force the generator to produce better fake samples, and the process
is repeated. During such iterations, both the generator and discriminator receive mu-
tual, positive reinforcement. While this setup may appear similar to previous efforts
(Cai et al, 2018c; Dai et al, 2018c; Pan et al, 2018) on GAN-based network embed-
ding, HeGAN employs two major novelties to address the challenges of adversarial
learning on HINs.
First, existing studies only leverage GAN to distinguish whether a node is real
or fake w.r.t. structural connections to a given node, without accounting for the het-
erogeneity in HINs. For example, given a paper p2 , they treat nodes a2 , a4 as real,
whereas a1 , a3 are fake simply based on the topology of the HIN shown in Fig. 16.7
(a). However, a2 and a4 are connected to p2 for different reasons: a2 writes p2 and
a4 only views p2 . Thus, they miss out on valuable semantics carried by HGs, un-
able to differentiate a2 and a4 even though they play distinct semantic roles. Given
a paper p2 as well as a relation, say, write/written, HeGAN introduces a relation-
aware discriminator to tell apart a2 and a4 . Formally, relation-aware discriminator
C(ev | u, r; θ C ) evaluates the connectivity between the pair of nodes u and v w.r.t. a
relation r:
1
C(ev | u, r; θ C ) = , (16.16)
1 + exp(−eCu MrC ev )
⊤
where ev ∈ Rd×1 is the input embedding of the sample v, eu ∈ Rd×1 is the learnable
embedding of node u, and MrC ∈ Rd×d is a learnable relation matrix for relation r.
Second, existing studies are limited in sample generation in both effectiveness
and efficiency. They typically model the distribution of nodes using some form of
softmax over all nodes in the original graph. In terms of effectiveness, their fake
samples are constrained to the nodes in the graph, whereas the most representative
fake samples may fall “in between” the existing nodes in the embedding space. For
example, given a paper p2 , they can only choose fake samples from V , such as
366 Chuan Shi
a1 and a3 . However, both may not be adequately similar to real samples such as
a2 . Towards a better sample generation, we introduce a generalized generator that
can produce latent nodes such as a′ shown in Fig. 16.7 (c), where it is possible
/ V . In particular, the generalized generator leverage the following Gaussian
that a′ ∈
distribution:
⊤
N (eG G 2
u Mr , σ I), (16.17)
where eG u ∈R
d×1 and M G ∈ Rd×d denote the node embedding of u ∈ V and the
r
relation matrix of r ∈ R for the generator.
Except for HeGAN, MV-ACM (Zhao et al, 2020c) uses GAN to generate the
complementary views by computing the similarity of nodes in different views. Over-
all, adversarial-based methods prefer to utilize the negative samples to enhance the
robustness of embeddings. But the choice of negative samples has a huge influence
on the performance, thus leading higher variances.
16.4 Review
Based on the above representative work of the shallow and deep models, it can be
found that the shallow models mainly focus on the structure of HGs, and rarely
use additional information such as attributes. One of the possible reasons is that
shallow models are hard to depict the relationship between additional and struc-
tural information. The learning ability of DNNs supports modeling of this complex
relationship. For example, message passing-based techniques are good at encod-
ing structures and attributes simultaneously, and integrate different semantic infor-
mation. Compared with message passing-based techniques, encoder-decoder-based
techniques are weak in fusing information due to the lack of messaging mechanism.
But they are more flexible to introduce different objective functions through differ-
ent decoders. Adversarial-based methods prefer to utilize the negative samples to
enhance the robustness of embeddings. But the choice of negative samples has a
huge influence on the performance, thus leading higher variances (Hu et al, 2019a).
However, shallow and deep models each have their own pros and cons. Shallow
models lack non-linear representation capability, but are efficient and easy to par-
allelize. Specially, the complexity of random walk technique consists of two parts:
random walk and skip-gram, both of which are linear with the number of nodes. De-
composition technique needs to divide HGs into sub-graphs according to the type
of edges, so the complexity is linear with the number of edges, which is higher
than random walk. Deep models have stronger representation capability, but they
are easier to fit noise and have higher time and space complexity. Additionally, the
cumbersome hyperparameter adjustment of deep models is also criticized. But with
the popularity of deep learning, deep models, especially HGNNs, have become the
main research direction in HG embedding.
16 Heterogeneous Graph Neural Networks 367
HGNNs have made great progress in recent years, which clearly shows that it is a
powerful and promising graph analysis paradigm. In this section, we discuss addi-
tional issues/challenges and explore a series of possible future research directions.
The basic success of HGNNs builds on the HG structure preservation. This also
motivates many HGNNs to exploit different HG structures, where the most typical
one is meta-path (Dong et al, 2017; Shi et al, 2016). Following this line, meta-graph
structure is naturally considered (Zhang et al, 2018b). However, HG is far more than
these structures. Selecting the most appropriate meta-path is still very challenging in
the real world. An improper meta-path will fundamentally hinder the performance
of HGNNs. Whether we can explore other techniques, e.g., motif (Zhao et al, 2019a;
Huang et al, 2016b) or network schema (Zhao et al, 2020b) to capture HG structure
is worth pursuing. Moreover, if we rethink the goal of traditional graph embedding,
i.e., replacing structure information with the distance/similarity in a metric space, a
research direction to explore is whether we can design HGNNs which can naturally
learn such distance/similarity rather than using pre-defined meta-path/meta-graph.
As mentioned before, many current HGNNs mainly take the structures into ac-
count. However, some properties, which usually provide additional useful infor-
mation to model HGs, have not been fully considered. One typical property is the
dynamics of HG, i.e., a real-world HG always evolves over time. Despite that the
incremental learning on dynamic HG is proposed (Wang et al, 2020m), dynamic
heterogeneous graph embedding is still facing big challenges. For example, Bian
et al (2019) is only proposed with a shallow model, which greatly limits its embed-
ding ability. How can we learn dynamic heterogeneous graph embedding in HGNNs
framework is worth pursuing. The other property is the uncertainty of HG, i.e., the
generation of HG is usually multi-faceted and the node in a HG contains different
semantics. Traditionally, learning a vector embedding usually cannot well capture
such uncertainty. Gaussian distribution may innately represent the uncertainty prop-
erty (Kipf and Welling, 2016; Zhu et al, 2018), which is largely ignored by current
HGNNs. This suggests a huge potential direction for improving HGNNs.
We have witnessed the great success and large impact of GNNs, where most of the
existing GNNs are proposed for homogeneous graph (Kipf and Welling, 2017b;
Veličković et al, 2018). Recently, HGNNs have attracted considerable attention
(Wang et al, 2019m; Zhang et al, 2019b; Fu et al, 2020; Cen et al, 2019).
368 Chuan Shi
One natural question may arise that what is the essential difference between
GNNs and HGNNs. More theoretical analysis on HGNNs is seriously lacking. For
example, it is well accepted that the GNNs suffer from over-smoothing problem (Li
et al, 2018b), so will HGNNs also have such a problem? If the answer is yes, what
factor causes the over-smoothing problem in HGNNs since they usually contain
multiple aggregation strategies (Wang et al, 2019m; Zhang et al, 2019b).
In addition to theoretical analysis, new technique design is also important. One
of the most important directions is the self-supervised learning. It uses the pre-
text tasks to train neural networks, thus reducing the dependence on manual la-
bels (Liu et al, 2020f). Considering the actual demand that label is insufficient,
self-supervised learning can greatly benefit the unsupervised and semi-supervised
learning, and has shown remarkable performance on homogeneous graph embed-
ding (Veličković et al, 2018; Sun et al, 2020c). Therefore, exploring self-supervised
learning on HGNNs is expected to further facilitate the development of this area.
Another important direction is the pre-training of HGNNs (Hu et al, 2020d; Qiu
et al, 2020a). Nowadays, HGNNs are designed independently, i.e., the proposed
method usually works well for certain tasks, but the transfer ability across differ-
ent tasks is ill-considered. When dealing with a new HG or task, we have to train
HGNNs from scratch, which is time-consuming and requires a large amount of la-
bels. In this situation, if there is a well pre-trained HGNN with strong generaliza-
tion that can be fine-tuned with few labels, the time and label consumption can be
reduced.
16.5.3 Reliability
Except for properties and techniques in HGs, we are also concerned about ethical
issues in HGNNs, such as fairness, robustness, and interpretability. Considering that
most methods are black boxes, making HGNNa reliable is an important future work.
Fairness. The embeddings learned by methods are sometimes highly related to
certain attributes, e.g., age or gender, which may amplify societal stereotypes in the
prediction results (Du et al, 2020). Therefore, learning fair or de-biased embeddings
is an important research direction. There are some researches on the fairness of
homogeneous graph embedding (Bose and Hamilton, 2019; Rahman et al, 2019).
However, the fairness of HGNNs is still an unsolved problem, which is an important
research direction in the future.
Robustness. Also, the robustness of HGNNs, especially the adversarial attack-
ing, is always an important problem (Madry et al, 2017). Since many real-world
applications are built based on HGs, the robustness of HGNNs becomes an urgent
yet unsolved problem. What is the weakness of HGNNs and how to enhance it to
improve the robustness need to be further studied.
Interpretability. Moreover, in some risk-aware scenarios, e.g., fraud detection
(Hu et al, 2019b) and bio-medicine (Cao et al, 2020), the explanation of mod-
els or embeddings is important. A significant advantage of HG is that it contains
16 Heterogeneous Graph Neural Networks 369
rich semantics, which may provide eminent insight to promote the explanation of
HGNNs. Besides, the emerging disentangled learning (Siddharth et al, 2017; Ma
et al, 2019c), which divides the embedding into different latent spaces to improve
the interpretability, can also be considered.
16.5.4 Applications
Many HG-based applications have stepped into the era of graph embedding. There
have demonstrated the strong performance of HGNNs on E-commerce and cyber-
security. Exploring more capacity of HGNNs on other areas holds great potential in
the future. For example, in software engineering area, there are complex relations
among test sample, requisition form, and problem form, which can be naturally
modeled as HGs. Therefore, HGNNs are expected to open up broad prospects for
these new areas and become a promising analytical tool. Another area is the bio-
logical system, which can also be naturally modeled as a HG. A typical biological
system contains many types of objects, e.g., Gene Expression, Chemical, Pheno-
type, and Microbe. There are also multiple relations between Gene Expression and
Phenotype (Tsuyuzaki and Nikaido, 2017). HG structure has been applied to bio-
logical system as an analytical tool, implying that HGNNs are expected to provide
more promising results.
In addition, since the complexity of HGNNs are relatively large and the tech-
niques are difficult to parallelize, it is difficult to apply the existing HGNNs to
large-scale industrial scenarios. For example, the number of nodes in E-commerce
recommendation may reach one billion (Zhao et al, 2019b). Therefore, successful
technique deployment in various applications while resolving the scalability and
efficiency challenges will be very promising.
Abstract Graph neural networks (GNNs) are efficient deep learning tools to analyze
networked data. Being widely applied in graph analysis tasks, the rapid evolution of
GNNs has led to a growing number of novel architectures. In practice, both neural
architecture construction and training hyperparameter tuning are crucial to the node
representation learning and the final model performance. However, as the graph data
characteristics vary significantly in the real-world systems, given a specific scenario,
rich human expertise and tremendous laborious trials are required to identify a suit-
able GNN architecture and training hyperparameters. Recently, automated machine
learning (AutoML) has shown its potential in finding the optimal solutions automat-
ically for machine learning applications. While releasing the burden of the manual
tuning process, AutoML could guarantee access of the optimal solution without ex-
tensive expert experience. Motivated from the previous successes of AutoML, there
have been some preliminary automated GNN (AutoGNN) frameworks developed
to tackle the problems of GNN neural architecture search (GNN-NAS) and train-
ing hyperparameter tuning. This chapter presents a comprehensive and up-to-date
review of AutoGNN in terms of two perspectives, namely search space and search
algorithm. Specifically, we mainly focus on the GNN-NAS problem and present the
Kaixiong Zhou
Department of Computer Science and Engineering, Texas A&M University, e-mail: zkxiong@
tamu.edu
Zirui Liu
Department of Computer Science and Engineering, Texas A&M University, e-mail:
[email protected]
Keyu Duan
Department of Computer Science and Engineering, Texas A&M University, e-mail: k.duan@
tamu.edu
Xia Hu
Department of Computer Science and Engineering, Texas A&M University, e-mail: hu@cse.
tamu.edu
© The Author(s), under exclusive license to Springer Nature Singapore Pte Ltd. 2022 371
L. Wu et al. (eds.), Graph Neural Networks: Foundations, Frontiers, and Applications,
https://doi.org/10.1007/978-981-16-6054-2_17
372 Kaixiong Zhou, Zirui Liu, Keyu Duan and Xia Hu
17.1 Background
Graph neural networks (GNNs) have made substantial progress in integrating deep
learning approaches to analyze graph-structured data collected from various do-
mains, such as social networks (Ying et al, 2018b; Huang et al, 2019d; Monti et al,
2017; He et al, 2020), academic networks (Yang et al, 2016b; Kipf and Welling,
2017b; Gao et al, 2018a), and biochemical modular graphs (Zitnik and Leskovec,
2017; Aynaz Taheri, 2018; Gilmer et al, 2017; Jiang and Balaprakash, 2020). Fol-
lowing the common message passing strategy, GNNs apply spatial graph convolu-
tional layer to learn a node’s embedding representation via aggregating the repre-
sentations of its neighbors and combining them to the node itself. A GNN archi-
tecture is then constructed by the stacking of multiple such layers and their inter-
layer skip connections, where the elementary operations of a layer (e.g., aggrega-
tion & combination functions) and the concrete inter-layer connections are specified
specifically in each design. To adapt to different real-world applications, a variety of
GNN architectures have been explored, including GCN (Kipf and Welling, 2017b),
GraphSAGE (Hamilton et al, 2017b), GAT (Veličković et al, 2018), SGC (Wu et al,
2019a), JKNet (Xu et al, 2018a), and GCNII (Chen et al, 2020l). They vary in how
to aggregate the neighborhood information (e.g., mean aggregation in GCN versus
neighbor attention learning in GAT) and the choices of skip connections (e.g., none
connection in GCN versus initial connection in GCNII).
Despite the significant success of GNNs, their empirical implementations are
usually accompanied with careful architecture engineering and training hyperpa-
rameter tuning, aiming to adapt to the different types of graph-structured data.
Based on the researcher’s prior knowledge and trial-and-error tuning processes, a
GNN architecture is instantiated from its model space specifically and evaluated in
each graph analysis task. For example, considering the underlying model Graph-
SAGE (Hamilton et al, 2017b), the various-size architectures determined by the
different hidden units are applied respectively for citation networks and protein-
protein interaction graphs. Furthermore, the optimal skip connection mechanisms
in JKNet architectures (Xu et al, 2018a) vary with the real-world tasks. Except the
architecture engineering, the training hyperparameters play important roles in the
final model performance, including learning rate, weight decay, and epoch num-
bers. In the open repositories, their hyperparameters are manually manipulated to
get the desired model performances. The tedious selections of GNN architectures
and training hyperparameters not only burden data scientists, but also make it dif-
ficult for beginners to access the high-performance solutions quickly for their tasks
on hand.
Automated machine learning (AutoML) has emerged as a prevailing research to
liberate the community from the time-consuming manual tuning processes (Chen
17 Graph Neural Networks: AutoML 373
et al, 2021). Given any task and based on the predefined search space, AutoML
aims at automatically optimizing the machine learning solutions (or denoted with
the term designs), including neural architecture search (NAS) and automated hyper-
parameter tuning (AutoHPT). While NAS targets the optimization of architecture-
related parameters (e.g., the layer number and hidden units), AutoHPT indicates the
selections of training-related parameters (e.g., the learning rate and weight decay).
They are the sub-fields of AutoML. It has been widely reported that the novel neu-
ral architectures discovered by NAS outperform the human-designed ones in many
machine learning applications, including image classification (Zoph and Le, 2016;
Zoph et al, 2018; Liu et al, 2017b; Pham et al, 2018; Jin et al, 2019a; Luo et al, 2018;
Liu et al, 2018b,c; Xie et al, 2019a; Kandasamy et al, 2018), semantic image seg-
mentation (Chenxi Liu, 2019), and image generation (Wang and Huan, 2019; Gong
et al, 2019). Dating back to 1900’s (Kohavi and John, 1995), it has been commonly
acknowledged that AutoHPT could improve over the default training setting (Feurer
and Hutter, 2019; Chen et al, 2021). Motivated by the previous successful applica-
tions of AutoML, there have been some recent efforts on conjoining the researches
of AutoML and GNNs (Gao et al, 2020b; Zhou et al, 2019a; You et al, 2020a;
Ding et al, 2020a; Zhao et al, 2020a,g; Nunes and Pappa, 2020; Li and King, 2020;
Shi et al, 2020; Jiang and Balaprakash, 2020). They generally define the automated
GNN (AutoGNN) as an optimization problem and formulate their own working
pipelines from three perspectives, as shown in Figure 17.1, the search space, search
algorithm, and performance estimation strategy. The search space consists of a large
volume of candidate designs, including GNN architectures and the training hyper-
parameters. On top of the search space, several heuristic search algorithms are pro-
posed to solve the NP-complete optimization problem by iteratively approximating
the well-performing designs, including random search (You et al, 2020a). The ob-
jective of performance estimation is to accurately estimate the task performance of
every candidate design explored at each step. Once the search progress terminates,
the best neural architecture accompanied with the suitable training hyperparameters
is returned to be evaluated on the downstream machine learning task.
In this chapter, we will organize the existing efforts and illustrate AutoGNN
framework with the following sections: notations, problem definition, and chal-
lenges of AutoGNN (in Sections 17.1.1, 17.1.2, and 17.1.3), search space (in Sec-
tion 17.2), and search algorithm (in Section 17.3). We then present the open prob-
lems for future research in Section 17.4. Specially, since the community’s interests
mainly focus on discovering the powerful GNN architecture, we pay more attentions
to GNN-NAS in this chapter.
Following the previous expressions (You et al, 2020a), we use the term “design”
to refer to an available solution of the optimization problem in AutoGNN. A de-
sign consists of a concrete GNN architecture and a specific set of training hy-
374 Kaixiong Zhou, Zirui Liu, Keyu Duan and Xia Hu
Fig. 17.1: Illustration of a general framework for AutoGNN. The search space con-
sists of plenty of designs, including GNN architectures and the training hyperparam-
eters. At each step, the search algorithm samples a candidate design from the search
space and estimates its model performance on the downstream task. Once the search
progress terminates, the best design accompanied with the highest performance on
the validation set is returned and exploited for the real-world system.
f ∗ = argmax f ∈F M( f (θ ∗ ); Dvalid ),
(17.1)
s.t. θ ∗ = argminθ L( f (θ ); Dtrain ).
where θ ∗ denotes the optimized trainable weights of design f and L denotes the loss
function. For each design, AutoGNN will first optimize its associated weights θ by
minimizing the loss on the training set through gradient descent, and then evaluates
it on the validation set to decide whether this design is the optimal one. By solving
the above optimization problem, AutoGNN automates the architecture engineering
and training hyperparameter tuning procedure, and pushes GNN designs to exam-
ine a broad scope of candidate solutions. However, it is well known that such the
bi-level optimization problem is NP-complete (Chen et al, 2021), thereby it would
be extremely time-consuming for searching and evaluating the well-performing de-
signs on large graphs with massive nodes and edges. Fortunately, there have been
some heuristic search techniques proposed to locate the local optimal design (e.g.,
CNN or RNN architecture) as close as possible to the global one in the applications
of image classification and natural language processing, including reinforcement
learning (RL) (Zoph and Le, 2016; Zoph et al, 2018; Pham et al, 2018; Cai et al,
2018a; Baker et al, 2016), evolutionary methods (Liu et al, 2017b; Real et al, 2017;
Miikkulainen et al, 2019; Xie and Yuille, 2017; Real et al, 2019), and Bayesian op-
timization (Jin et al, 2019a). They iteratively explore the next design and update the
search algorithm based on the performance feedback of the new design, in order to
move toward the global optimal solution. Compared with the previous efforts, the
characteristics of AutoGNN problem could be viewed from two aspects: the search
space and search algorithms tailored to identify the optimal design of GNN. In the
following sections, we list the challenge details and the existing AutoGNN work.
Considering the existing AutoGNN frameworks (Gao et al, 2020b; Zhou et al,
2019a), GNN model is commonly implemented based on the spatial graph convolu-
tion mechanism. To be specific, the spatial graph convolution takes the input graph
as a computation graph and learns node embeddings by passing messages along
edges. A node embedding is updated recursively by aggregating the embedding rep-
resentations of its neighbors and combining them to the node itself. Formally, the
k-th spatial graph convolutional layer of GNN could be expressed as:
(k) (k) (k−1)
hi = AGGREGATE({ai j W (k) x j : j ∈ N (i)}),
(17.2)
(k) (k−1) (k)
xi = ACT(COMBINE(W (k) xi , hi )).
(k)
xi denotes the embedding vector of node vi at the k-th layer. N (i) denotes the set
of neighbors adjacent to node vi . W (k) denotes the trainable weight matrix used to
(k)
project node embeddings. ai j denotes the message-passing weight along edge con-
necting nodes vi and v j , which is determined by normalized graph adjacency ma-
trix or learned from attention mechanism. Function AGGREGATE, such as mean,
max, and sum pooling, is used to aggregate neighbor representations. Function
(k)
COMBINE is used to combine neighbor embedding hi as well as node embed-
(k−1)
ding xi from the last layer. Finally, function ACT (e.g., ReLU) is used to add
non-linearity to the embedding learning.
As shown in Figure 17.2, GNN architecture consists of several graph convolu-
tional layers defined in Eq. equation 17.2, and may incorporate skip connection be-
tween any two arbitrary layers similar to residual CNN (He et al, 2016a). Following
the previous definitions in NAS, we use the term “micro-architecture” to represent
a graph convolutional layer, including the specifications of hidden units and graph
convolutional functions; we use the term “macro-architecture” to represent network
topology, including the choices of layer depth, inter-layer skip connections, and
pre/post-processing layers. The architecture search space contains a large volume
of diverse GNN architectures, which could be categorized into the search spaces of
micro-architectures as well as macro-architectures.
According to Eq. equation 17.2 and as shown in Figure 17.2, the micro-architecture
of a graph convolutional layer is characterized by the following five architecture
dimensions:
(k−1) (k)
• Hidden units: Trainable matrix W (k) ∈ Rd ×d maps node embeddings to
a new space and learns to extract the informative features. d (k) is the number
of hidden units and plays key role in the task performance. In the GNN-NAS
378 Kaixiong Zhou, Zirui Liu, Keyu Duan and Xia Hu
frameworks of GraphNAS (Gao et al, 2020b) and AGNN (Zhou et al, 2019a),
d (k) is usually selected from set {4, 8, 16, 32, 64, 128, 256}.
(k)
• Propagation function: It determines the message-passing weight ai j to spec-
ify how node embeddings are propagated upon the input graph structure. In
a wide variety of GNN models (Kipf and Welling, 2017b; Wu et al, 2019a;
(k)
Hamilton et al, 2017b; Ding et al, 2020a), ai j is defined by the correspond-
1 1
ing element from the normalized adjacency matrix: D̃− 2 ÃD̃− 2 or D̃−1 Ã, where
à is the self-loop graph adjacency matrix and D̃ is its degree matrix, respec-
tively. Note that the real-world graph-structured data could be both complex
and noisy (Lee et al, 2019c), which leads to the inefficient neighbor aggregation.
(k)
GAT (Veličković et al, 2018) applies attention mechanism to compute ai j to at-
tend on relevant neighbors. Based on the existing GNN-NAS frameworks (Gao
et al, 2020b; Zhou et al, 2019a; Ding et al, 2020a), we list the common choices
of propagation functions in Table 17.1.
• Aggregation function: Depending on the input graph structure, a proper ap-
plication of aggregation function is important to learn the informative neighbor
distribution (Xu et al, 2019d). For example, a mean pooling function takes the
average of neighbors, while a max pooling only preserves the significant one.
The aggregation function is usually selected from set {SUM, MEAN, MAX}.
(k)
• Combination function: It is used to combine neighbor embedding hi and
(k−1)
projected embedding W (k) xi of the node itself. Examples of combination
17 Graph Neural Networks: AutoML 379
function include sum and multiple layer perceptron (MLP), etc. While the sum
operation simply adds the two embeddings, MLP further applies linear mapping
based upon the summation or concatenation of these two embeddings.
• Activation function: The candidate activation function is usually selected from
{Sigmoid, Tanh, ReLU, Linear, Softplus, LeakyReLU, ReLU6, ELU}.
Given the above five architecture dimensions and their associated candidate op-
tions, the micro-architecture search space is constructed by their Cartesian product.
Each discrete point in the micro-architecture search space corresponds to a concrete
micro-architecture, e.g., a graph convolutional layer with {Hidden units: 64, Propa-
gation function: GAT, aggregation function: SUM, combination function: MLP, Ac-
tivation function: ReLU}. By providing the extensive candidate options along each
dimension, the micro-architecture search space covers most of layer implementa-
tions in the state-of-the-art models, such as Chebyshev (Defferrard et al, 2016),
GCN (Kipf and Welling, 2017b), GAT (Veličković et al, 2018), and LGCN (Gao
et al, 2018a).
(k)
Table 17.1: Propagation function candidates to compute weight ai j if nodes vi and
(k)
v j are connected; otherwise ai j = 0. Symbol || denotes the concatenation operation,
(k)
a, al and ar denote trainable vectors, and WG is a trainable matrix.
Many different search strategies can be used to explore the search space in Au-
toGNN, including random search, evolutionary methods, RL, and differentiable
search methods. In this section, we will introduce the basic concepts of these search
algorithms and how to utilize them to explore candidate designs.
Given a search space, random search randomly samples the various designs with
equal probability. The random search is the most basic approach, yet it is quite ef-
fective in practice. In addition to serve as a baseline in AutoGNN works (Zhou et al,
2019a; Gao et al, 2020b), random search is the standard benchmark for compar-
ing the effectiveness of different candidate options along a dimension in the search
space (You et al, 2020a). Specially, suppose the dimension to be evaluated is batch
normalization, whose candidate examples are given by {False, BatchNorm}. To
comprehensively compare the effectiveness of these two options, a series of diverse
designs are randomly sampled from the search space, where the batch normalization
is reset to False and BatchNorm in each design, respectively. Each pair of designs
(referred to Normalization=False and Normalization=BatchNorm) are compared in
terms of their model performances on a downstream graph analysis task. It is found
that the designs with Normalization=BatchNorm generally rank higher than the oth-
ers, which indicates the benefit of including BatchNorm in the model design.
Evolutionary methods evolve a population of designs, i.e., the set of different GNN
architectures and training hyperparameters. In every evolution step, at least one de-
sign from the population is sampled and serves as a parent to generate a new child
design by applying mutations to it. In the context of AutoGNN, the design muta-
tions are local operations, such as changing the aggregation function from MAX to
SUM, altering the hidden units, and altering a specific training hyperparameter. Af-
ter training the child design, its performance is evaluated on the validation set. The
superior design will be added to the population. Specifically, Shi et al (2020) pro-
poses to select two parent designs and then crossover them along some dimensions.
To generate the diverse child designs, Shi et al (2020) further mutates the above
crossover designs.
17 Graph Neural Networks: AutoML 383
RL (Silver et al, 2014; Sutton and Barto, 2018) is a learning paradigm concerned
with how agents ought to take actions in an environment to maximize the reward.
In the context of AutoGNN, the agent is the so-called “controller”, which tries to
generate promising designs. The generation of design can be regarded as the con-
troller’s action. The controller’s reward is often defined as the model performance
of generated design on the validation set, such as validation accuracy for the node
classification task. The controller is trained in a loop as shown in Figure 17.3: the
controller first samples a candidate design and trains it to convergence to measure
its performance on the task of desire. Note that the controller is usually realized by
RNN, which generates the design of GNN architecture and training hyperparam-
eters as a string of variable strength. The controller then uses the performance as
a guiding signal to update itself toward finding the more promising design in the
future search progress.
Fig. 17.3: A illustration of reinforcement learning based search algorithm. The con-
troller (upper block) generates a GNN architecture (lower block) and tests it on the
validation dataset. By treating the architecture as a string with variable length, the
controller usually applies RNN to sequentially sample options in the different di-
mensions (e.g., combination, aggregation, and propagation functions) to formulate
the final GNN architecture. The validation performance is then used as feedback to
train the controller. Note that the architecture dimensions here are just used for the
illustration purpose. Please refer to Section 17.2 for a complete introduction of the
search space.
384 Kaixiong Zhou, Zirui Liu, Keyu Duan and Xia Hu
There are several candidate options along each architecture dimension. For exam-
ple, for the aggregation function at a particular layer, we have the option of apply-
ing either a SUM, a MEAN, or a MAX pooling. The common search approaches in
GNN-NAS, such as random search, evolutionary algorithms, and RL-based search
methods, treat selecting the best option as a black-box optimization problem over
a discrete domain. At each search step, they sample and evaluate a single architec-
ture from the discrete architecture search space. However, such the search process
towards well-performing GNNs will be very time-consuming since the number of
possible models is extremely large. Differentiable search algorithms relax the dis-
crete search space to be continuous, which can be optimized efficiently by gradient
descent. Specifically, for each architecture dimension, the differentiable search al-
gorithms usually relax the hard choice from the candidate set into a continuous dis-
tribution, where each option is assigned with a probability. One example for illus-
trating the differentiable search along the aggregation function dimension is shown
in Figure 17.4. At the k-th layer, the node embedding output of aggregation function
can be decomposed and expressed as:
(k−1)
∑m αm om (x j
: j ∈ N (i) ∪ {i}),
(k)
hi = or
α o (x
(k−1)
: j ∈ N (i) ∪ {i}), m ∼ p(α ), (17.4)
m m j m
s.t. ∑ αm = 1.
m
om represents the m-th aggregation function option, and αm is the sampling prob-
ability associated with the corresponding option. The probability distribution along
a dimension is regularized to have the sum of one. The architecture distribution is
then formulated by the union probability distribution of all the dimensions. At each
search step, as shown in Eq.equation 17.4 (with the example of the aggregation
function dimension), the real operation of a dimension in a new architecture could
be generated by two different ways: weighted option combination and option sam-
pling. For the case of weighted option combination, the real operation is represented
by the weighted average of all candidate options. For the other case, the real opera-
tion is instead sampled from the probability distribution p(αm ) of the corresponding
architecture dimension. In both cases, the adopted options are scaled by their sam-
pling probabilities to support the architecture distribution optimization by gradient
descent. The architecture distribution is then updated directly by backpropagating
the training loss at each training step. During the testing, the discrete architecture
can be obtained by retaining the strongest candidate with the highest probability αm
along each dimension. In contrast to black-box optimization, gradient-based opti-
mization is significantly more data efficient, and hence greatly speeds up the search
process.
386 Kaixiong Zhou, Zirui Liu, Keyu Duan and Xia Hu
Fig. 17.4: One example for illustrating the differentiable search for the aggregation
function. At a search step, the aggregation function is given by the weighted combi-
nation of the three candidates, or instead realized by one sampled option (e.g., MAX
scaled with probability α2 ). Once the search progress terminates, the option with the
highest probability (e.g., MAX with solid arrow) is used in the final architecture to
be evaluated on testing set.
To solve the bi-level optimization problem of AutoGNN, all the above search al-
gorithms share a common two-stage working pipeline: sampling a new design and
adjusting the search algorithm based on the performance estimation of the new de-
sign at each step. Once the search progress terminates, the optimal design with the
highest model performance will be treated as the desired solution to the concerned
optimization problem. Therefore, an accurate performance estimation strategy is
crucial to AutoGNN framework. The simplest way of performance estimation is to
perform a standard training for each generated design, and then obtain the model
performance on the split validation set. However, such an intuitive strategy is com-
putationally expensive given the long search progress and massive graph datasets.
Parameter sharing is one of the efficient strategies to reduce the cost of perfor-
mance estimation, which avoids training from scratch for each design. Parameter
sharing is first proposed in ENAS (Pham et al, 2018) to force all designs to share
weights to improve efficiency. A new design could be immediately estimated by
reusing the weights well trained before. However, such a strategy cannot be di-
rectly adopted in GNN-NAS since the GNN architectures in search space may have
weights with different dimensions or shapes. To tackle the challenge, recent work
17 Graph Neural Networks: AutoML 387
modified the parameter sharing strategy to customize for GNNs. GraphNAS (Gao
et al, 2020b) categorizes and stores the optimized weights based on their shapes,
and applies the one with the same shape to the new design. After parameter shar-
ing, AGNN (Zhou et al, 2019a) further uses a few training epochs to fully adapt the
transferred weights to the new design. In the differentiable GNN-NAS frameworks,
the parameter sharing is conducted naturally between GNN architectures sharing
the common computation options (Zhao et al, 2020g; Ding et al, 2020a).
We have reviewed various search spaces and search algorithms. Although some ini-
tial AutoGNN efforts have been paid, compared with the rapid development of Au-
toML in computer vision, AutoGNN is still in the preliminary research stage. In this
section, we discuss several future directions, especially for research on GNN-NAS.
• Search space. The design of architecture search space is the most important
portion in GNN-NAS framework. An appropriate search space should be com-
prehensive by covering the key architecture dimensions and their state-of-the-
art primitive options to guarantee the performance of searched architecture for
any given task. Besides, the search space should be compact by incorporating
a moderate number of powerful options to make the search progress efficient.
However, most of the existing architecture search spaces are constructed based
on vanilla GCN and GAT, failing to consider the recent GNN developments. For
example, graph pooling (Ying et al, 2018c; Gao and Ji, 2019; Lee et al, 2019b;
Zhou et al, 2020e) has attracted increasing research interests to enable encoding
the graph structures hierarchically. Based on the wide variety of pooling algo-
rithms, the corresponding hierarchical GNN architectures gradually shrink the
graph size and enhance the neighborhood reception field, empirically improving
the downstream graph analysis tasks. Furthermore, a series of novel graph con-
volution mechanisms have been proposed from different perspectives, such as
neighbor-sampling methods to accelerate computation (Hamilton et al, 2017b;
Chen et al, 2018c; Zeng et al, 2020a), and PageRank based graph convolutions
to extend neighborhood size (Klicpera et al, 2019a,a; Bojchevski et al, 2020b).
With the development in GNN community, it is crucial to update the search
space to subsume the state-of-the-art models.
• Deep graph neural networks. All the existing search spaces are implemented
with shallow GNN architectures, i.e., the number of graph convolutional lay-
ers lgc ≤ 10. Unlike the widely adopted deep neural networks (e.g., CNNs and
transformers) in computer vision and natural language processing, GNN archi-
tectures are usually limited with less than 3 layers (Kipf and Welling, 2017b;
Veličković et al, 2018). As the layer number increases, the node representations
will converge to indistinguishable vectors due to the recursive neighborhood
aggregation and non-linear activation (Li et al, 2018b; Oono and Suzuki, 2020).
Such phenomenon is recognized as the over-smoothing issue (NT and Maehara,
388 Kaixiong Zhou, Zirui Liu, Keyu Duan and Xia Hu
2019), which prevents the construction of deep GNNs from modeling the de-
pendencies to high-order neighbors. Recently, many efforts have been proposed
to relieve the over-smoothing issue and construct deep GNNs, including em-
bedding normalization (Zhao and Akoglu, 2019; Zhou et al, 2020d; Ioffe and
Szegedy, 2015), residual connection (Li et al, 2019c, 2018b; Chen et al, 2020l;
Klicpera et al, 2019a), and random data augmentation (Rong et al, 2020b; Feng
et al, 2020). However, most of them only achieve comparable or even worse
performance compared to their corresponding shallow models. By incorporat-
ing these new techniques into the search space, GNN-NAS could effectively
combine them and identify the novel deep GNN model, which unleashes the
deep learning power for graph analytics.
• Applications to emerging graph analysis tasks. One limitation of GNN-NAS
frameworks in literature is that they are usually evaluated on a few bench-
mark datasets, such as Cora, Citeseer, and Pubmed for node classification (Yang
et al, 2016b). However, the graph-structured data is ubiquitous, and the novel
graph analysis tasks are always emerging in real-world applications, such as
property prediction of biochemical molecules (i.e., graph classification) (Zitnik
and Leskovec, 2017; Aynaz Taheri, 2018; Gilmer et al, 2017; Jiang and Bal-
aprakash, 2020), item/friend recommendation in social networks (i.e., link pre-
diction) (Ying et al, 2018b; Monti et al, 2017; He et al, 2020), and circuit design
(i.e., graph generation) (Wang et al, 2020b; Li et al, 2020h; Zhang et al, 2019d).
The surge of novel tasks poses significant challenges for the future search of
well-performing architectures in GNN-NAS, due to the diverse data character-
istics and objectives of tasks and the expensive searching cost. On one hand,
since the new tasks may do not resemble any of the existing benchmarks, the
search space has to be re-constructed by considering their specific data charac-
teristics. For example, in the knowledge graph with informative edge attributes,
the micro-architecture search space needs to incorporate edge-aware graph con-
volutional layers to guarantee a desired model performance (Schlichtkrull et al,
2018; Shang et al, 2019). On the other hand, if the new tasks are similar to the
existing ones, the search algorithms could re-exploit the best architectures dis-
covered before to accelerate the search progress in the new tasks. For example,
one can simply initialize the search progress with these sophisticated archi-
tectures and uses several epochs to explore the potentially good ones within a
small region. Especially for the massive graphs with a large volume of nodes
and edges, the reuse of well-performing architectures from similar tasks could
significantly save the computation cost. The research challenge is how to quan-
tify the similarities between the different graph-structured data.
17 Graph Neural Networks: AutoML 389
Acknowledgements
This work is, in part, supported by NSF (#IIS-1750074 and #IIS-1718840). The
views, opinions, and/or findings contained in this paper are those of the authors and
should not be interpreted as representing any funding agencies.
Yu Wang
Department of Electrical Engineering and Computer Science, Vanderbilt University, e-mail:
[email protected]
Wei Jin
Department of Computer Science and Engineering, Michigan State University, e-mail: jinwei2@
msu.edu
Tyler Derr
Department of Electrical Engineering and Computer Science, Vanderbilt University, e-mail:
[email protected]
© The Author(s), under exclusive license to Springer Nature Singapore Pte Ltd. 2022 391
L. Wu et al. (eds.), Graph Neural Networks: Foundations, Frontiers, and Applications,
https://doi.org/10.1007/978-981-16-6054-2_18
392 Yu Wang, Wei Jin, and Tyler Derr
18.1 Introduction
Recent years have witnessed the great success of applying deep learning in numer-
ous fields. However, the superior performance of deep learning heavily depends
on the quality of the supervision provided by the labeled data and collecting a
large amount of high-quality labeled data tends to be time-intensive and resource-
expensive (Hu et al, 2020c; Zitnik and Leskovec, 2017). Therefore, to alleviate the
demand for massive labeled data and provide sufficient supervision, self-supervised
learning (SSL) has been introduced. Specifically, SSL designs domain-specific pre-
text tasks that leverage extra supervision from unlabeled data to train deep learning
models and learn better representations for downstream tasks. In computer vision,
various pretext tasks have been studied, e.g., predicting relative locations of image
patches (Noroozi and Favaro, 2016) and identifying augmented images generated
from image processing techniques such as cropping, rotating and resizing (Shorten
and Khoshgoftaar, 2019). In natural language processing, self-supervised learning
has also been heavily utilized, e.g., predicting the masked word in BERT (Devlin
et al, 2019).
Simultaneously, graph representation learning has emerged as a powerful strat-
egy for analyzing graph-structured data over the past few years (Hamilton, 2020).
As the generalization of deep learning to the graph domain, Graph Neural Networks
(GNNs) has become one promising paradigm due to their efficiency and strong per-
formance in real-world applications (You et al, 2021; Zitnik and Leskovec, 2017).
However, the vanilla GNN model (i.e., Graph Convolutional Network (Kipf and
Welling, 2017b)) and even more advanced existing GNNs (Hamilton et al, 2017b;
Xu et al, 2019d, 2018a) are mostly established in a semi-supervised or supervised
manner, which still requires high-cost label annotation. Additionally, these GNN
models may not take full advantage of the abundant information in unlabeled data,
such as the graph topology and node attributes. Hence, SSL can be naturally har-
nessed for GNNs to gain additional supervision and thoroughly exploit the informa-
tion in the unlabeled data.
Compared with grid-based data such as images or text (Zhang et al, 2020e),
graph-structured data is far more complex due to its highly irregular topology,
involved intrinsic interactions and abundant domain-specific semantics (Wu et al,
2021d). Different from images and text where the entire structure represents a single
entity or expresses a single semantic meaning, each node in the graph is an individ-
ual instance with its own features and positioned in its own local context. Further-
more, these individual instances are inherently related with each other, which forms
diverse local structures that encode even more complex information to be discovered
and analyzed. While such complexity engenders tremendous challenges in analyz-
ing graph-structured data, the substantial and diverse information contained in the
node features, node labels, local/global graph structures, and their interactions and
combinations provide golden opportunities to design self-supervised pretext tasks.
Embracing the challenges and opportunities to study self-supervised learning
in GNNs, the works (Hu et al, 2020c, 2019c; Jin et al, 2020d; You et al, 2020c)
have been the first research that systematically design and compare different self-
18 Graph Neural Networks: Self-supervised Learning 393
supervised pretext tasks in GNNs. For example, the works (Hu et al, 2019c; You
et al, 2020c) design pretext tasks to encode the topological properties of a node
such as centrality, clustering coefficient, and its graph partitioning assignment, or
to encode the attributes of a node such as individual features and clustering assign-
ments in embeddings output by GNNs. The work (Jin et al, 2020d) designs pretext
tasks to align the pairwise feature similarity or the topological distance between
two nodes in the graph with the closeness of two nodes in the embedding space.
Apart from the supervision information employed in creating pretext tasks, design-
ing effective training strategies and selecting reasonable loss functions are another
crucial components in incorporating SSL into GNNs. Two frequently used training
strategies that equip GNNs with SSL are 1) pre-training GNNs through complet-
ing pretext task(s) and then fine-tuning the GNNs on downstream task(s), and 2)
jointly training GNNs on both pretext and downstream tasks (Jin et al, 2020d; You
et al, 2020c). There are also few works (Chen et al, 2020c; Sun et al, 2020c) ap-
plying the idea of self-training in incorporating SSL into GNNs. In addition, loss
functions are selected to be tailored for purposes of specific pretext tasks, which in-
cludes classification-based tasks (cross-entropy loss), regression-based tasks (mean
squared error loss) and contrastive-based tasks (contrastive loss).
In view of the substantial progress made in the field of graph neural networks
and the significant potential of self-supervised learning, this chapter aims to present
a systematic and comprehensive review on applying self-supervised learning into
graph neural networks. The rest of the chapter is organized as follows. Section 18.2
first introduces self-supervised learning and pretext tasks, and then summarizes fre-
quently used self-supervised methods from the image and text domains. In Sec-
tion 18.3, we introduce the training strategies that are used to incorporate SSL
into GNNs and categorize the pretext tasks that have been developed for GNNs.
Section 18.4 and 18.5 present detailed summaries of numerous representative SSL
methods that have been developed for node-level and graph-level pretext tasks.
Thereafter, in Section 18.6 we discuss representative SSL methods that are devel-
oped using both node-level and graph-level supervision, which we refer to as node-
graph-level pretext tasks. Section 18.7 collects and reinforces the major results and
the insightful discoveries in prior sections. Concluding remarks and future forecasts
on the development of SSL in GNNs are provided in Section 18.8.
Supervised learning is the machine learning task of training a model that maps an
input to an output based on the ground-truth input-output pairs provided by a la-
beled dataset. Good performance of supervised learning requires a decent amount
of labeled data (especially when using deep learning models), which are expen-
sive to manually collect. Conversely, self-supervised learning generates supervisory
signals from unlabeled data and then trains the model based on the generated super-
visory signals. The task used for training the model based on the generative signal is
394 Yu Wang, Wei Jin, and Tyler Derr
referred to as the pretext task. In comparison, the task whose ultimate performance
we care about the most and expect our model to solve is referred to as the down-
stream task. To guarantee the performance benefits from self-supervised learning,
pretext tasks should be carefully designed such that completing them encourages
the model to have the similar or complementary understanding as completing down-
stream tasks. Self-supervised learning initially originated to solve tasks in image and
text domains. The following part focuses on introducing self-supervised learning in
these two fields with the specific emphasis on different pretext tasks.
In computer vision (CV), many ideas have been proposed for self-supervised rep-
resentation learning on image data. A common example is that we expect that small
distortion on an image does not affect its original semantic meaning or geometric
forms. The idea to create surrogate training datasets with unlabeled image patches
by first sampling patches from different images at varying positions and then distort-
ing patches by applying a variety of random transformations are proposed in (Doso-
vitskiy et al, 2014). The pretext task is to discriminate between patches distorted
from the same image or from different images. Rotation of an entire image is an-
other effective and inexpensive way to modify an input image without changing
semantic content (Gidaris et al, 2018). Each input image is first rotated by a mul-
tiple of 90 degrees at random. The model is then trained to predict which rotation
has been applied. However, instead of performing pretext tasks on an entire image,
the local patches could also be extracted to construct the pretext tasks. Examples of
methods using this technique include predicting the relative position between two
random patches from one image (Doersch et al, 2015) and designing a jigsaw puz-
zle game to place nine shuffled patches back to the original locations (Noroozi and
Favaro, 2016). More pretext tasks such as colorization, autoencoder, and contrastive
predictive coding have also been introduced and effectively utilized (Oord et al,
2018; Vincent et al, 2008; Zhang et al, 2016d).
While computer vision has achieved amazing progress on self-supervised learn-
ing in recent years, self-supervised learning has been heavily utilized in natural lan-
guage processing (NLP) research for quite a while. Word2vec (Mikolov et al, 2013b)
is the first work that popularized the SSL ideas in the NLP field. Center word pre-
diction and neighbor word prediction are two pretext tasks in Word2vec where the
model is given a small chunk of the text and asked to predict the center word in that
text or vice versa. BERT (Devlin et al, 2019) is another famous pre-trained model
in NLP where two pretest tasks are to recover randomly masked words in a text or
to classify whether two sentences can come one after another or not. Similar works
have also been introduced, such as having the pretext task classify whether a pair
of sentences are in the correct order (Lan et al, 2020), or a pretext task that first
randomly shuffles the ordering of sentences and then seeks to recover the original
ordering (Lewis et al, 2020).
Compared with the difficulty of data acquisition encountered in image and text
domains, machine learning in the graph domain faces even more challenges in ac-
quiring high-quality labeled data. For example, for molecular graphs it can be ex-
tremely expensive to perform the necessary laboratory experiments to label some
molecules (Rong et al, 2020a), and in a social network obtaining ground-truth labels
18 Graph Neural Networks: Self-supervised Learning 395
for individual users may require large-scale surveys or be unable to be released due
to privacy agreements/concerns (Chen et al, 2020a). Therefore, the success achieved
by applying SSL in CV and NLP naturally leads the question as to whether SSL
can be effectively applied in the graph domain. Given that graph neural network is
among the most powerful paradigms for graph representation learning, in follow-
ing sections we will mainly focus on introducing self-supervised learning within
the framework of graph neural networks and highlighting/summarizing these recent
advancements.
by optimizing the supervised loss calculated between the output/predicted label and
the true label for labeled nodes and the labeled graph, which can be formulated as:
1
arg min ∑ ℓsup (zsup,i , ysup,i )
θ ,θsup |Vl |
vi ∈Vl
| {z }
θ ∗ , θsup
∗
= arg min Lsup (θ , θsup ) = Node supervised task , (18.3)
θ ,θsup
arg min ℓsup (zsup,G , ysup,G )
θ ,θsup
| {z }
Graph supervised task
where Lsup is the total supervised loss function and ℓsup is the supervised loss
function for each example, ysup,i = Ysup [i, :]⊤ indicates the true label for node vi
in node supervised task and ysup,G indicates the true label for graph G in graph
supervised task. Their corresponding predicted label distributions are denoted as
zsup,i = Zsup [i, :]⊤ and zsup,G . θ , θsup are parameters to be optimized for any GNN
model and the extra adaptation layer for the supervised downstream task, respec-
tively. Note that for ease of notation, we assume the above graph supervised task
is operated only on one graph but the above framework can be easily adapted to
supervised tasks on multiple graphs.
In this chapter, we view SSL as the process of designing a specific pretext task and
learning the model on the pretext task. In this sense, SSL can either be used as
unsupervised pre-training or be integrated with semi-supervised learning.
The model capability of extracting features for completing pretext and down-
stream tasks is improved through optimizing the model parameters θ , θssl , and θsup ,
where θssl denotes the parameters of the adaptation layer for the pretext task. In-
spired by relevant discussions (Hu et al, 2019c; Jin et al, 2020d; Sun et al, 2020c;
You et al, 2020b,c), we summarize three possible training strategies that are pop-
ular in the literature to train GNNs in the self-supervised setting as self-training,
pre-training with fine-tuning, and joint training.
18 Graph Neural Networks: Self-supervised Learning 397
18.3.1.1 Self-training
A common strategy to utilize features learned from completing pretext tasks in-
cludes applying the optimized parameters from self-supervision as initialization for
fine-tuning in downstream tasks. This strategy consists of two stages: pre-training
on the self-supervised pretext tasks and fine-tuning on the downstream tasks. The
overview of this two-stage optimization strategy is given in Fig. 18.2.
The whole model consists of one shared GNN-based feature extractor and two
adaptation modules, one for the pretext task and one for the downstream task. In the
pre-training process, the model is trained with the self-supervised pretext task(s) as:
where θssl denotes the parameters of the adaptation layer hθssl for the pretext
tasks, ℓssl is the self-supervised loss function for each example, and Lssl is the
total loss function of completing the self-supervised task. In node pretext tasks,
zssl,i = Zssl [i, :]⊤ and yssl,i = Yssl [i, :]⊤ , which are the self-supervised predicted and
true label(s) for the node vi , respectively. In graph pretext tasks, zssl,G and yssl,G are
the self-supervised predicted and true label(s) for the graph G , respectively. Then, in
the fine-tuning process, the feature extractor fθ is trained by completing downstream
tasks in Eq. equation 18.1-equation 18.3 with the pre-trained θ ∗ as the initialization.
Note that to utilize the pre-trained node/graph representations the fine-tuning pro-
cess can also be replaced by training a linear classifier (e.g., Logistic Regression
(Peng et al, 2020; Veličković et al, 2019; You et al, 2020b; Zhu et al, 2020c)).
Fig. 18.2: An overview of GNNs with SSL using pre-training and fine-tuning.
Another natural idea to harness self-supervised learning for graph neural networks is
to combine losses of completing pretext task(s) and downstream task(s) and jointly
train the model. The overview of the joint training is shown in Fig. 18.3.
The joint training consists of two components: feature extraction by a GNN and
adaption processes for both the pretext tasks and downstream tasks. In the feature
extraction process, a GNN takes the graph adjacency matrix A and the feature ma-
18 Graph Neural Networks: Self-supervised Learning 399
trix X as input and outputs the node embeddings ZGNN and/or graph embeddings
zGNN,G . In the adaptation procedure, the extracted node and graph embeddings are
further transformed to complete pretext and downstream tasks via hθssl and hθsup ,
respectively. We then jointly optimize the pretext and downstream task losses as:
zsup,G = hθsup (READOUT( fθ (X, A))), zssl,G = hθssl (READOUT( fθ (X, A))),
(18.8)
1
arg min (α1 ℓsup (zsup,i , ysup,i ) + α2 ℓssl (zssl,i , yssl,i ))
θ ,θ sup ,θ ssl |V | v∑ ∈V
|
i
{z }
θ ∗ , θsup
∗ ∗
, θssl = Node pretext tasks ,
arg min α1 ℓsup (zsup,G , ysup,G ) + α2 ℓssl (zssl,G , yssl,G )
θ ,θsup ,θssl
| {z }
Graph pretext tasks
(18.9)
where α1 , α2 ∈ R > 0 are the weights for combining the supervised loss ℓsup and the
self-supervised loss ℓssl .
A loss function is used to evaluate the performance of how well the algorithm mod-
els the data. Generally, in GNNs with self-supervised learning, the loss function for
the pretext task has three forms, which are classification loss, regression loss and
contrastive learning loss. Note that the loss functions we discuss here are only for
the pretext tasks rather than downstream tasks.
400 Yu Wang, Wei Jin, and Tyler Derr
where the objective is minimizing the distance from our learned embedding to yssl,i
which represents any ground-truth value of node vi , such as the original attribute in
the feature completion or other values of node vi .
1 exp(D(z1ssl,i , z2ssl, j ))
=− ∑ + log
|P + | (i, j)∈P ∑ exp(D(z1ssl,i , z2ssl,k ))
(18.13)
k∈{ j∪Pi− }
sim(z1 ,z2 )
where D(z1ssl,i , z2ssl, j )) = ssl,i ssl, j
τ is a learnable discriminator parametrized with
the similarity function (i.e., cosine similarity) and the temperature factor τ, P +
S
represents the set of all pairs of positive samples while P − = (i, j)∈P + Pi− repre-
sents all sets of negative samples. Especially Pi− contains all negative samples of
the sample i. Note that we can contrast both node representations, graph represen-
tations and node-graph representations under different views. Therefore, z1ssl is not
limited to the node embeddings, but could refer to the embeddings of both node and
402 Yu Wang, Wei Jin, and Tyler Derr
graph under the first graph view G 1 . Thus, i, j, k could refer to both node and graph
samples.
1 Additional summary details and the corresponding code links for these methods can be found at
https://github.com/NDS-VU/GNN-SSL-Chapter.
18 Graph Neural Networks: Self-supervised Learning 403
mation from node features, graph structure, and even information from the known
training labels (as presented in (Jin et al, 2020d)). We summarize the categorization
of pretext tasks as a tree where each leaf node represents a specific type of pretext
tasks in Fig. 18.5 while also including the corresponding references. In the next
three sections, we give detailed explanations about each of these pretext tasks and
summarize the majority of existing methods.
For node-level pretext tasks, methods have been developed to use easily-accessible
data to generate pseudo labels for each node or relationships for each pair of nodes.
In this way, the GNNs are then trained to be predictive of the pseudo labels or to keep
the equivalence between the node embeddings and the original node relationships.
Different nodes have different structure properties in graph topology, which can be
measured by the node degree, centrality, node partition, etc. Thus, for structure-
based pretext tasks at the node-level, we expect to align node embeddings extracted
from the GNNs with their structure properties, in an attempt to ensure this informa-
tion is preserved while GNNs learn the node embeddings.
Since degree is the most fundamental topological property, Jin et al (2020d) de-
signs the pretext task to recover the node degree from the node embeddings as fol-
lows:
1
Lssl = ℓMSE (zssl,i , di ) (18.14)
|V | v∑
∈V
i
where di represents the degree of node i and zssl,i = Zssl [i, :]⊤ denotes the self-
supervised GNN embeddings of node i. It should be noted that this pretext task
can be generalized to harness any structural property in the node level.
Node centrality measures the importance of nodes based on their structure roles
in the whole graph (Newman, 2018). Hu et al (2019c) designs a pretext task to have
GNNs estimate the rank scores of node centrality. The specific centrality measures
considered are eigencentrality, betweenness, closeness, and subgraph centrality. For
a node pair (u, v) and a centrality score s, with relative order Rsu,v = 1(su > sv )
where Rsu,v = 1 if su > sv and Ru,v = 0 if su ≤ sv , a decoder Drank s for centrality score
rank
s estimates its rank score by Sv = Ds (zGNN,v ). The probability of estimated rank
exp(Su −Sv )
order is defined by the sigmoid function R̃su,v = 1+exp(S u −Sv )
. Then predicting the
relative order between pairs of nodes could be formalized as a binary classification
problem with the loss:
404 Yu Wang, Wei Jin, and Tyler Derr
Different from peer works, Hu et al (2019c) does not consider any node feature but
instead extract the node features directly from the graph topology, which includes:
(1) degree that defines the local importance of a node; (2) core-number that defines
the connectivity of the subgraph around a node; (3) collective influence that defines
the neighborhood importance of a node; and (4) local clustering coefficient, which
defines the connectivity of 1-hop neighborhood of a node. Then, the four features
(after min-max normalization) are concatenated with a nonlinear transformation and
fed into the GNN where (Hu et al, 2019c) uses the pretext tasks: centrality ranking,
clustering recovery and edge prediction. Another innovative idea in (Hu et al, 2019c)
is to choose a fix-tune boundary in the middle layer of GNNs. The GNN blocks
below this boundary are fixed, while the ones above the boundary are fine-tuned. For
downstream tasks that are closely related to the pre-trained tasks, a higher boundary
is used.
Another important node-level structural property is the partition each node be-
longs after performing a graph partitioning method. In (You et al, 2020c), the pretext
task is to train the GNNs to encode the node partition information. Graph partition-
ing is to partition the nodes of a graph into different groups such that the number
of edges between each group is minimized. Given the node set V , the edge set E ,
and a preset number of partitions p ∈ [1, |V |], a graph partitioning algorithm (e.g.,
(Karypis and Kumar, 1995) as used in (You et al, 2020c)) will output a set of nodes
{Vpar1 , ..., Vpar p |Vpari ⊂ V , i = 1, ..., p}. Then the classification loss is set exactly the
same as:
1
Lssl = − ℓCE (zssl,i , yssl,i ) (18.16)
|V | v∑
∈V
i
where zssl,i denotes the embedding of node vi and assuming that the partitioning
label is a one-hot encoding yssl,i ∈ R p with k-th entry as 1 and others as 0 if vi ∈
Vpark , i = 1, ..., |V |, ∃k ∈ [1, p].
Node features are another important information that can be leveraged to provide ex-
tra supervision. Since the state-of-the-art GNNs suffer from over-smoothing (Chen
et al, 2020c), the original feature information is partially lost after fed into the
GNNs. In order to reduce the information loss in node embeddings, the pretext task
in (Hu et al, 2020c; Jin et al, 2020d; Manessi and Rozza, 2020; Wang et al, 2017a;
You et al, 2020c) is to first mask node features and let the GNN predict those fea-
tures. More specifically, they randomly mask input node features by replacing them
with special mask indicators and then apply GNNs to obtain the corresponding node
embeddings. Finally a linear model is applied on top of embeddings to predict the
corresponding masked node features. Assuming the set of nodes that are masked is
18 Graph Neural Networks: Self-supervised Learning 405
Vm , then the self-supervised regression loss to reconstruct these masked features is:
1
Lssl = ℓMSE (zssl,i , xi ) (18.17)
|Vm | v ∑
∈Vm
i
To handle the high sparsity of the node features, it is beneficial to first perform
feature dimensionality reduction on X (such as principal component analysis (PCA)
used in (Jin et al, 2020d)). Additionally, instead of reconstructing node features,
node embeddings could also be reconstructed from their corrupted version, such as
in (Manessi and Rozza, 2020).
Contrary to the graph partitioning where nodes are grouped by the graph topol-
ogy, in graph clustering the clusters of nodes are discovered based on their fea-
tures (You et al, 2020c). In this way the pretext task can be designed to recover the
node clustering assignment. Given the node set V , the feature matrix X, and a preset
number of clusters p ∈ [1, |V |] (or without if the clustering algorithm automatically
learns the number of clusters) as input, the clustering algorithm will output a set of
node clusters {Vclu1 , . . . , Vclu p |Vclui ⊂ V , i = 1, ..., p} and assuming for node vi , the
partitioning label is a one-hot encoding yssl,i ∈ R p with k-th entry as 1 and others
as 0 if vi ∈ Vcluk , i = 1, ..., |V |, ∃k ∈ [1, p]. Then the loss is the same as Eq. equa-
tion 18.16.
Instead of focusing on individual nodes, pretext tasks have also been developed
based on the relationship between pairs of nodes (Jin et al, 2021, 2020d). The basic
idea is to retain the node pairwise feature similarity in the node embeddings from
GNNs. Suppose Ts , Td denote the sets of node pairs having the highest and the
lowest similarity:
Ts = {(vi , v j )| sim(xi , x j ) in top-B of {sim(xi , xb )}Bb=1 \sim(xi , xi ), ∀vi ∈ V }, (18.18)
where sim(xi , x j ) measures the cosine similarity of features between two nodes vi , v j
and B is the number of top/bottom pairs selected for each node. Then the pretext task
is to optimize the following regression loss:
1
Lssl = ∑ ℓMSE fw (|zGNN,i − zGNN, j |), sim(xi , x j ) , (18.20)
|Ts ∪ Td | (v ,v
i j )∈Ts ∪Td
where fw is a function mapping the difference between two node embeddings from
GNNs to a scalar representing the similarity between them.
406 Yu Wang, Wei Jin, and Tyler Derr
Instead of employing only the topology or only the feature information as the extra
supervision, some pretext tasks combine them together as the hybrid supervision, or
even utilize information from the known training labels.
A contrastive framework for unsupervised graph representation learning, GRACE,
where two correlated graph views are generated by randomly performing corrup-
tion on attributes (masking node features) and topology (removing or adding graph
edges) is proposed in (Zhu et al, 2020c). Then the GNNs are trained using a con-
trastive loss to maximize the agreement between node embeddings in these two
views. In each iteration two graph views G 1 = {A1 , X 1 } and G 2 = {A2 , X 2 } are
generated randomly according to the possible augmentation functions from an input
graph G = {A, X}.
The objective is to maximize the similarity of the same nodes in different views of
the graph while minimizing the similarity of different nodes in the same or different
views of the graph. Thus, if we denote the node embeddings in the two views as
1
ZGNN = fθ (X 1 , A1 ), ZGNN
2 = fθ (X 2 , A2 ), then the contrastive NT-Xent loss is:
1 1 2
Lssl = ∑ ℓNT-Xent (ZGNN , ZGNN , P − ), (18.21)
|P + |
(v1i ,v2i )∈P +
where P + includes positive pairs ofS(v1i , v2i ) where v1i , v2i correspond to the same
node in different views, while P − = (v1 ,v2 )∈P + Pv−1 represents all sets of negative
i i i
samples with Pv−1 containing nodes different from vi in the same view (intra-view
i
negative pairs) or the other view (inter-view negative pairs).
More specifically, in the above, the two graph corruptions are removing edges
and masking node features. In removing edges, a random masking matrix M ∈
{0, 1}|V |×|V | is randomly sampled whose entry is drawn from a Bernoulli distri-
bution Mi j ∼ B(1 − pr ) if Ai j = 1 for the original graph. pr is the probability of
each edge being removed. The resulting matrix can be computed as A′ = A ⊙ M
′
creating the adjacency matrix of graph view G from G .
In masking node features, a random vector m ∈ {0, 1}d is utilized, where each
dimension of m is independently drawn from a Bernoulli distribution with probabil-
ity 1 − pm and d is the dimension of the node features X. Then, the generated node
′
features X ′ for graph view G from G is computed by:
where µm denotes the centroid of class m in labeled data, κk denotes the centroid
of cluster k in unlabeled data and ck represents the aligned class that has the clos-
est distance to the centroid κk of the cluster k among all centroids of classes in the
original labeled data. Note that the self-checking can be directly performed by com-
paring the distance of each unlabeled node to centroids of classes in labeled data.
However, directly checking in this naı̈ve way is very time-consuming.
After having just presented the node-level SSL pretext tasks, in this section we focus
on the graph-level SSL pretext tasks where we desire the node embeddings coming
from the GNNs to encode information of graph-level properties.
As the counterpart of the nodes in the graph, the edges encode abundant information
of the graph, which can also be leveraged as an extra supervision to design pretext
tasks. The pretext task in (Zhu et al, 2020a) is to recover the graph topology, i.e.,
predict edges, after randomly removing edges in the graph. After node embeddings
zGNN,i is obtained for each node vi , the probability of the edge between any pair of
nodes vi , v j is calculated by their feature similarity as follows:
and the weighted cross-entropy loss is used during training, which is defined as:
where W is the weight hyperparameter used for balancing two classes; which are
node pairs having an edge and node pairs without an edge between them.
As it is known that unclean graph structure usually impedes the applicability of
GNNs (Cosmo et al, 2020; Jang et al, 2019). A method that trains the GNNs by
downstream supervised tasks based on the cleaned graph structure reconstructed
from completing a self-supervised pretext task is introduced in (Fatemi et al, 2021).
The self-supervised pretext task aims to train a separate GNN to denoise the cor-
rupted node feature X̂ generated by either randomly zeroing some dimensions of
the original node feature X when having binary features or by adding independent
Gaussian noise when X is continuous. Two methods are used to generate the initial
graph adjacency matrix Ã. The first method Full Parametrization (FP) treats every
entry in à as a parameter and directly optimizes its |V |2 parameters by denoising the
corrupted feature X̂. The second method MLP-kNN considers a mapping function
18 Graph Neural Networks: Self-supervised Learning 409
1 P̃(Ã) + P̃(Ã)⊤ − 1
A = D− 2 D 2, (18.27)
2
where P̃ is a function with a non-negative range to ensure the positivity of every
entry in A. In MLP-kNN method, P̃ is the element-wise ReLU function. However,
the ReLU function could result in the gradient flow problem in the FP method, thus
the element-wise ELU function followed by an addition of 1 to avoid the problem
of gradient flow is used instead. Next, a separate GNN-based encoder takes noisy
node features X̂ and the new normalized adjacency matrix A as input and output the
updated node features Ẑ = GNN(X̂, A). The parameters in FP and MLP-kNN used
for generating the initial adjacency matrix à is optimized by:
1
Lssl = ℓMSE (xi , ẑi ), (18.28)
|Vm | v ∑
∈Vm
i
where ẑi = Ẑ[i, :]⊤ is the noisy embedding vector of the node vi obtained by the
separate GNN-based encoder. The optimized parameters in FP and MLP-kNN leads
to the generation of more cleaned graph adjacency matrix, which in turn results in
the better performance in the downstream tasks.
In addition to the graph edges and the adjacency matrix, topological distance
between nodes is another important global structural property in graph. The pretext
task in (Peng et al, 2020) is to recover the topological distance between nodes. More
specifically, they leverage the shortest path length between nodes denoted as pi j
between nodes vi and v j , but this could be replaced with any other distance measure.
Then, they define the set Cik as all the nodes having the shortest path distance of
length k from node vi . More formally, this is defined as:
where δi is the upper bound of the hop count from other nodes to vi , di j is the length
of the path pi j , and Ci is the union of all the k-hop shortest path neighbor sets Cik .
Based on these sets, one-hot encodings di j ∈ Rδi are created for pairs of nodes vi , v j ,
where v j ∈ Ci , according to their distance di j . Then, the GNN model is guided to
extract node embeddings that encode node topological distance as follows:
of the hop count (topological distance) but precisely determining this upper bound
is time-consuming for a big graph, it is assumed that the number of hops (distance)
is under control based on small-world phenomenon (Newman, 2018) and is further
divided into several major categories that clearly discriminates the dissimilarity and
partly tolerates the similarity. Experiments demonstrate that dividing the topological
distance into four categories: Ci1 , Ci2 , Ci3 , Cik (k ≥ 4) achieves the best performance
(i.e., δi =4). Another problem is that the number of nodes that are close to the focal
node vi is much less than the nodes that are further away (i.e., the magnitude of Ciδi
will be significantly larger than other sets). To circumvent this imbalance problem,
node pairs are sampled with an adaptive ratio.
Network motifs are recurrent and statistically significant subgraphs of a larger
graph and (Zhang et al, 2020f) designs a pretext task to train a GNN encoder that can
automatically extract graph motifs. The learned motifs are further leveraged to gen-
erate informative subgraphs used in graph-subgraph contrastive learning. Firstly, a
GNN-based encoder fθ and a m-slot embedding table {m1 , ..., mm } denoting m clus-
ter centers of m motifs are initialized. Then, a node affinity matrix U ∈ R|V |×|V | is
calculated by softmax normalization on the embedding similarity D(zGNN,i , zGNN, j )
between nodes i, j as in Eq. equation 18.13. Afterwards, spectral clustering (VON-
LUXBURG, 2007) is performed on U to generate different groups, within which
nG connected components that have more than three nodes are collected as the sam-
pled subgraphs from the graph G and their embeddings are calculated by apply-
ing READOUT function. For each subgraph, its cosine similarity to each of the m
motifs is calculated to obtain a similarity metric S ∈ Rm×nG . To produce semantic-
meaningful subgraphs that are close to motifs, the top 10% most similar subgraphs
to each motif are selected based on the similarity metric S and are collected into a
set G top . The affinity values in U between pairs of nodes in each of these subgraphs
are increased by optimizing the loss:
|G top |
1
L1 = − ∑ ∑ U[ j, k]. (18.31)
|G top | i=1 (v ,v )∈G top
j k i
The optimization of the above loss forces nodes in motif-like subgraphs to be more
likely to be grouped together in spectral clustering, which leads to more subgraph
samples aligned with the motifs. Next, the embedding table of motifs is optimized
based on the sampled subgraphs. The assignment matrix Q ∈ Rm×nG is found by
maximizing similarities between embeddings and its assigned motif:
1
max Tr(QT S) − ∑ Q[i, j] log Q[i, j], (18.32)
Q λ i, j
1
Lssl = ∑ 1 ( j, i) ∈ E · log χi j + 1 ( j, i) ∈ E − log(1 − χi j ),
|E ∪ E − | ( j,i)∈E ∪E −
(18.35)
where E is the set of edges, E − is the sampled set of node pairs without edges,
and χi j is the edge probability between node i, j calculated from their embeddings.
Based on two primary edge attentions, the GAT attention (shortly as GO) (Petar
et al, 2018) and the dot-product attention (shortly as DP) (Luong et al, 2015), two
advanced attention mechanisms, SuperGATSD (Scaled Dot-product, shortly as SD)
and SuperGATMX (Mixed GO and DP, shortly as MX) are proposed:
√
ei j,SD = ei j,DP / F, χi j,SD = σ (ei j,SD ), (18.36)
exp(−(∆ zi − ∆ z j ) ⊙ (∆ zi − ∆ z j ))
ei j = , (18.39)
|| exp(−(∆ zi − ∆ z j ) ⊙ (∆ zi − ∆ z j ))||
where ⊙ denotes the Hardamard product. This edge representation ei j is then fed
into an MLP for the prediction of the topological transformation, which includes
18 Graph Neural Networks: Self-supervised Learning 413
four classes: edge addition, edge deletion, keeping disconnection and keeping con-
nection between each pair of nodes. Thus, the GNN-based encoder is trained by:
1
Lssl = ℓCE (MLP(ei j ), ti j ) (18.40)
|V |2 v ,v∑∈V
i j
Typically, graphs do not come with any feature information and here the graph-level
features refer to the graph embeddings obtained after applying a pooling layer on
all node embeddings from GNNs.
GraphCL (You et al, 2020b) designs the pretext task to first augment graphs
by four different augmentations including node dropping, edge perturbation, at-
tribute masking and subgraph extraction and then maximize the mutual information
of the graph embeddings between different augmented views generated from the
same original graph while also minimizing the mutual information of the graph em-
beddings between different augmented views generated from different graphs. The
graph embeddings Zssl are obtained through any permutational-invariant READ-
OUT function on node embeddings followed by applying an adaptation layer. Then
the mutual information is maximized by optimizing the following NT-Xent con-
trastive loss:
1 1 2
Lssl = ℓNT-Xent (Zssl , Zssl , P − ), (18.41)
|P + | (G ,G∑
)∈P +
i j
where Zssl1 , Z 2 represent graph embeddings under two different views. The view
ssl
could be the original view without any augmentation or the one generated from ap-
plying four different augmentations. P + contains positive pairs of graphs (Gi , G j )
S
augmented from the same original graph while P − = (Gi ,G j )∈P + PG−i represents
all sets of negative samples. Specifically PG−i contains graphs augmented from the
graph different from Gi . Numerical results demonstrate that the augmentation of
edge perturbations benefits social networks but hurts biochemical molecules. Ap-
plying attribute masking achieves better performance in denser graphs. Node drop-
ping and subgraph extraction are generally beneficial across all datasets.
414 Yu Wang, Wei Jin, and Tyler Derr
One way to use the information of the training nodes in designing pretext tasks is
developed in (Hu et al, 2020c) where the context concept is raised. The goal of this
work is to pre-train a GNN so that it maps nodes appearing in similar graph structure
contexts to nearby embeddings. For every node vi , the r-hop neighborhood of vi
contains all nodes and edges that are at most r-hops away from vi in the graph. The
context graph of vi is a subgraph between r1 -hops and r2 -hops away from node vi .
It is required that r1 < r so that some nodes are shared between the neighborhood
and the context graph, which is referred to as context anchor nodes. Examples of
neighborhood and context graphs are shown in Fig. 18.6. Two GNN encoders are set
up: the main GNN encoder is to get the node embedding zrGNN,i based on their r-hop
neighborhood node features and the context GNN is to get the node embeddings
of every other node in the context anchor node set, which are then averaged to
get the node context embedding ci . Then Hu et al (2020c) used negative sampling
to jointly learn the main GNN and the context GNN. In the optimization process,
positive samples refer to the situation when the center node of the context and the
neighborhood graphs is the same while the negative samples refer to the situation
when the center nodes of the context and the neighborhood graphs are different. The
learning objective is a binary classification of whether a particular neighborhood and
a particular context graph have the same center node and the negative likelihood loss
is used as follows:
1
Lssl = −( (yi log(σ ((zrGNN,i )⊤ c j ))+(1−yi ) log(1−σ ((zrGNN,i )⊤ c j ))))
|K | (v ,v∑)∈K
i j
(18.42)
where yi = 1 for the positive sample where i = j while yi = 0 for the negative sample
where i ̸= j, with K denoting the set of positive and negative pairs, and σ is the
sigmoid function computing the probability.
18 Graph Neural Networks: Self-supervised Learning 415
Similar idea to employ the context concept in completing pretext tasks is also
proposed in (Jin et al, 2020d). Specifically, the context here is defined as:
where Vu and Vl denote the unlabeled and labeled node set, ΓVu (vi ) denotes the
unlabeled nodes that are adjacency to node vi , ΓVu (vi , c) denotes the unlabeled nodes
that have been assigned class c and are adjacency to node vi , NVl (vi ) denotes the
labeled nodes that are adjacency to node vi , ΓVl (vi , c) denotes the labeled nodes that
are adjacency to node vi and of class c. To generate labels for the unlabeled nodes so
as to calculate the context vector yi for each node vi , label propagation (LP) (ZHU,
2002) or the iterative classification algorithm (ICA) (Neville and Jensen, 2000) is
used to construct pseudo labels for unlabeled nodes in Vu . Then the pretext task is
approached by optimizing the following loss function:
1
Lssl = ℓCE (zssl,i , yi ), (18.44)
|V | v∑
∈V i
The main issue of the above pretext task is the error caused by generating la-
bels from LP or ICA. The paper (Jin et al, 2020d) further proposed two methods
to improve the above pretext task. The first method is to replace the procedure of
assigning labels of unlabeled nodes based on only one method such as LP or ICA
with assigning labels by ensembling results from multiple different methods. Their
second method treats the initial labeling from LP or ICA as noisy labels, and then
leverages an iterative approach (Han et al, 2019) to improve the context vectors,
which leads to significant improvements based on this correction phase.
One previous pretext task is to recover the topological distance between nodes.
However, calculating the distance of the shortest path for all pairs of nodes even
after the sampling is time-consuming. Therefore, Jin (Jin et al, 2020d) replaces the
pairwise distance between nodes with the distance between nodes and their corre-
sponding clusters. For each cluster, a fixed set of anchor/center nodes is established.
For each node, its distance to this set of anchor nodes is calculated. The pretext task
is to extract node features that encode the information of this node2cluster distance.
Suppose k clusters are obtained by applying the METIS graph partitioning algo-
rithm (Karypis and Kumar, 1998) and the node with the highest degree is assumed
to be the center of the corresponding cluster, then each node vi will have a clus-
ter distance vector di ∈ Rk and the distance-to-cluster pretext task is completed by
optimizing:
1
Lssl = ℓMSE (zssl,i , di ), (18.45)
|V | v∑
∈V i
Aside from the graph topology and the node features, the distribution of the train-
ing nodes and their training labels are another valuable source of information for
designing pretext tasks. One of the pretext tasks in (Jin et al, 2020d) is to require
the node embeddings output by GNNs to encode the information of the topological
416 Yu Wang, Wei Jin, and Tyler Derr
distance from any node to the training nodes. Assuming that the total number of
classes is p and for class c ∈ {1, ..., p} and the node vi ∈ V , the average, minimum
and maximum shortest path length from vi to all labeled nodes in class c is calcu-
lated and denoted as di ∈ R3p , then the objective is to optimize the same regression
loss as defined in Eq. equation 18.45
The generating process of networks encodes abundant information for design-
ing pretext tasks. Hu et al (2020d) propose the GPT-GNN framework for generative
pre-training of GNNs. This framework performs attribute and edge generation to
enable the pre-trained model to capture the inherent dependency between node at-
tributes and graph structure. Assuming that the likelihood over this graph by this
GNN model is p(G ; θ ) which represents how the nodes in G are attributed and
connected, GPT-GNN aims to pre-train the GNN model by maximizing the graph
likelihood, i.e., θ ∗ = maxθ p(G ; θ ). Given a permutated order, the log likelihood is
factorized autoregressively - generating one node per iteration as:
|V |
log pθ (X, E ) = ∑ log pθ (xi , Ei |X<i , E<i ) (18.46)
i=1
For all nodes that are generated before the node i, their attributes X<i , and the edges
between these nodes E<i are used to generate a new node vi , including both its at-
tribute xi and its connections with existing nodes Ei . Instead of directly assuming
that xi , Ei are independent, they devise a dependency-aware factorization mecha-
nism to maintain the dependency between node attributes and edge existence. The
generation process can be decomposed into two coupled parts: (1) generating node
attributes given the observed edges, and (2) generating the remaining edges given
the observed edges and the generated node attributes. For computing the loss of
attribute generation, the generated node feature matrix X is corrupted by masking
some dimensions to obtain the corrupted version X̂ Attr and further fed together with
the generated edges into GNNs to get the embeddings ẐGNN Attr . Then, the decoder
DecAttr (ẐGNN
Attr ). The attribute generation loss is:
Attr 1
Lssl = ℓMSE (DecAttr (ẑAttr
GNN,i ), xi ), (18.47)
|V | v∑
∈Vi
All the above pretext tasks are designed based on either the node or the graph level
supervision. However, there is another final line of research combining these two
sources of supervision to design pretext tasks, which we summarize in this section.
Veličković et al (2019) proposed to maximize the mutual information between
representations of high-level graphs and low-level patches. In each iteration, a nega-
tive sample X̂, Â is generated by corrupting the graph through shuffling node features
and removing edges. Then a GNN-based encoder is applied to extract node repre-
sentations ZGNN and ẐGNN , which are also named as the local patch representations.
The local patch representations are further fed into an injective readout function to
get the global graph representations zGNN,G = READOUT(ZGNN ). Then the mutual
information between ZGNN and zGNN,G is maximized by minimizing the following
loss function:
+
1 |P |
Lssl = ∑ E(X,A) [log σ (z⊤GNN,iW zGNN,G )]
|P + | + |P − | i=1
(18.49)
|P − |
+ ∑ E(X̂,Â) [log(1 − σ (z̃⊤
GNN,iW zGNN,G ))] ,
j=1
where |P + | and |P − | are the number of the positive and negative pairs, σ stands
for any nonlinear activation function and PReLU is used in (Veličković et al, 2019),
GNN,iW zGNN,G calculates the weighted similarity between the patch representation
z⊤
centered at node vi and the graph representation. A linear classifier is followed up
to classify nodes after the above contrastive pretext task.
Similar to (Veličković et al, 2019) where the mutual information between the
patch representations and the graph representations is maximized, Hassani and
Khasahmadi (2020) proposed another framework of contrasting the node represen-
tations of one view and the graph representations of another view. The first view is
the original graph and the second view is generated by a graph diffusion matrix. The
heat and personalized PageRank (PPR) diffusion matrix are considered, which are:
shared projection head are applied on nodes in the original graph adjacency matrix
1
and the generated diffusion matrix to get two different node embeddings ZGNN and
2 1 2
ZGNN . Two different graph embeddings zGNN,G and zGNN,G are further obtained by
applying a graph pooling function to the node representations (before the projec-
tion head) and followed by another shared projection head. The mutual information
between nodes and graphs in different views is maximized through:
1
Lssl = − (MI(z1GNN,i , z2GNN,G ) + MI(z2GNN,i , z1GNN,G )), (18.52)
|V | v∑
i ∈V
where the MI represents the mutual information estimator and four estimators are
explored, which are noise-contrastive estimator, Jensen-Shannon estimator, normal-
ized temperature-scaled cross-entropy, and Donsker-Varadhan representation of the
KL-divergence. Note that the mutual information in Eq. equation 18.52 is averaged
over all graphs in the original work (Hassani and Khasahmadi, 2020). Addition-
ally, their results demonstrate that Jensen-Shannon estimator achieves better results
across all graph classification tasks, whereas in the node classification task, noise
contrastive estimation achieves better results. They also discover that increasing the
number of views does not increase the performance on downstream tasks.
18.7 Discussion
18.8 Summary
Abstract Graph is an expressive and powerful data structure that is widely applica-
ble, due to its flexibility and effectiveness in modeling and representing graph struc-
ture data. It has been more and more popular in various fields, including biology,
finance, transportation, social network, among many others. Recommender system,
one of the most successful commercial applications of the artificial intelligence,
whose user-item interactions can naturally fit into graph structure data, also receives
much attention in applying graph neural networks (GNNs). We first summarize the
most recent advancements of GNNs, especially in the recommender systems. Then
we share our two case studies, dynamic GNN learning and device-cloud collabora-
tive Learning for GNNs. We finalize with discussions regarding the future directions
of GNNs in practice.
19.1.1 Introduction
The Introduction of GNNs Graph has a long history originated from the Seven
Bridges of Königsberg problem in 1736 (Biggs et al, 1986). It is flexible to model
Yunfei Chu,
DAMO Academy, Alibaba Group, e-mail: [email protected]
Jiangchao Yao
DAMO Academy, Alibaba Group, e-mail: [email protected]
Chang Zhou
DAMO Academy, Alibaba Group, e-mail: [email protected]
Hongxia Yang
DAMO Academy, Alibaba Group, e-mail: [email protected]
© The Author(s), under exclusive license to Springer Nature Singapore Pte Ltd. 2022 423
L. Wu et al. (eds.), Graph Neural Networks: Foundations, Frontiers, and Applications,
https://doi.org/10.1007/978-981-16-6054-2_19
424 Yunfei Chu, Jiangchao Yao, Chang Zhou and Hongxia Yang
Random Walk-style
Early graph representation learning approaches (Perozzi et al, 2014; Tang et al,
2015b; Cao et al, 2015; Zhou et al, 2017; Ou et al, 2016; Grover and Leskovec,
2016) in deep learning era are inspired by word2vec (Mikolov et al, 2013b), an ef-
ficient word embedding method in natural language processing community. These
19 Graph Neural Networks in Modern Recommender Systems 425
methods do not need any neighborhood for encoding, where EGO plays as an iden-
tity mapping. The encoder ENC takes as the node id in the graph and assigns a
trainable vector to each node.
The very different part of these methods is the learning objective. Approaches
like Deepwalk, LINE, Node2vec use different random walk strategies to create pos-
itive node pairs (u, v) as the training example, and estimate the probability of visiting
v given u, p(v|u), as a multinomial distribution,
exp(sim(u, v))
p(v|u) = ,
∑v′ exp(sim(u, v′ ))
where sim is a similarity function. They exploit an approximated Noise Constrained
Estimation (NCE) loss (Gutmann and Hyvärinen, 2010), known as skip gram with
negative sampling originated in word2vec as the following, to reduce the high com-
putation cost,
qneg is a proposed negative distribution, which impacts the variation of the optimiza-
tion target (Yang et al, 2020d). Note that this formula can be also approximated with
sampled softmax (Bengio and Senécal, 2008; Jean et al, 2014), which in our expe-
rience performs better in top-k recommendation tasks as the node number becomes
extremely large (Zhou et al, 2020a).
These learning objectives have connections with traditional node proximity mea-
surements in graph mining community. GraRep (Cao et al, 2015), APP (Zhou et al,
2017) borrows the idea from (Levy and Goldberg, 2014) and point out these random
walk based method are equivalent to preserving their corresponding transformations
of the adjacency matrix A, such as personalized pagerank.
Matrix Factorization-style
HOPE (Ou et al, 2016) provides a generalized matrix form of other types of node
proximity measurement, e.g., katz, adamic-adar, and adopts matrix factorization to
learn embedding that preserve these proximity. NetMF (Qiu et al, 2018) unifies
several classic graph embedding methods in the framework of matrix factorization,
provides connections between the deepwalk-like approaches and the theory of graph
Laplacian.
GNN-style
Graph neural network (Kipf and Welling, 2017b; Scarselli et al, 2008) provides
an end-to-end semi-supervised learning paradigm that was previously modeled via
label propagations. It can also be used to learn node representations in an unsu-
pervised manner like the above graph embedding methods. GNN-like approaches
for unsupervised learning, compared to deepwalk-like methods, are more power-
ful in capturing local structural, e.g., have at most the power of WL-test (Xu et al,
426 Yunfei Chu, Jiangchao Yao, Chang Zhou and Hongxia Yang
2019d). The downstream link prediction task that requires local-structural aware
representation or cooperation with node features may benefit more from GNN-style
approaches.
The EGO operator collects and constructs the receptive field of each node. For
GCN (Kipf and Welling, 2017b), a full k-layer neighborhood is required for each
node, making it hard to work for large graphs which usually follow power-law de-
gree distribution. GraphSage (Hamilton et al, 2017b) instead samples a fixed-size
neighborhood in each layer, mitigates this problem and can scale to large graphs.
LCGNN (Qiu et al, 2021) samples a local cluster around each node by short random-
walks with theoretical guarantee.
Then different kinds of Aggregation functions are proposed within this receptive
field. GraphSage investigates several neighborhood aggregation alternatives, includ-
ing mean/max pooling, LSTM. GAT (Veličković et al, 2018) utilizes self-attention
to perform the aggregation, which shows stable and superior performance in many
graph benchmarks. GIN (Xu et al, 2019d) has a slightly different aggregation func-
tion, whose discriminative/representational power is proved to be equal to the power
of the WL test. As link prediction task may also consider structural similarity be-
tween two nodes besides their distance, this local structural preserving method may
achieve good performance for networks that have obvious local structural patterns.
The learning objectives of GNN-style approaches are similar with those in ran-
dom walk style ones.
in which the Utility function could be considered as maximizing click through rate,
GMV, or a mixture of multiple objectives (Ribeiro et al, 2014; McNee et al, 2006).
A modern commercial recommender system, especially for those with over mil-
lions of end-users and items, has adopted a multi-stage modeling pipeline as the
tradeoff between the business goals and the efficiency given the constraints of lim-
ited computing resources. Different stages have different simplifications of the data
organization and objectives, which many research papers do not put in a clear way.
In the following, we first review several simplifications of the industrial recom-
mendation problem setting, that are clean enough for the research community. Then
we describe the multi-stage pipeline and the problem in each stage, review clas-
sic methods to handle the problem and revisit how GNNs are applied in existing
methods, trying to give an objective view about these methods.
1 We indicate the short-term objective as the objective in the sense of each request response. Here
efficient retrieval. The most widely used measurement for this phase is the top-k
hit ratio.
• Rank Phase. The problem space is quite different from those in the retrieval
phase, since rank phase needs to give precise comparison within a much smaller
subspace, instead of recalling as many as good items from the entire item candi-
dates set. Restricted to a small number of candidates, it is capable of exploiting
more complex methods over the user-item interaction in acceptable response
time.
• Re-rank Phase. Considering the effects studied in the discrete choice model (Train,
1986), the relationships among the displayed items may have significant im-
pacts on the user behavior. This poses opportunities to consider from the combi-
national optimization perspective, i.e., how to chose a combination of the subset
which maximizes the whole utilities of the recommendation list.
The above stages can be adjusted according to different characteristics of the recom-
mendation scenario. For example, if the candidate set is at hundreds or thousands,
recall phase is not necessarily required as the computation power is usually enough
to cover such rank-all operation at once. The re-rank phase is also not necessary if
the item number per request is few.
We summarize in Table 19.1 the different data simplifications made in different
problem settings with their corresponding pipeline stages.
Neighborhood-based Approaches
Item-based collaborative filtering first identifies a set of similar items for each of the
items that the user has clicked/purchased/rated, and then recommends top-N items
by aggregating the similarities. User-based CF, on the other hand, identifies similar
users and then performs aggregation on their clicked items.
The key part in Neighborhood-based Approaches is the definition of the similar-
ity metric. Take item-based CF as an example, top-k heuristic approaches calculate
item-item similarity from the user-item interaction matrix M, e.g., pearson corre-
lation, cosine similarity. Storing |I |x|I | similarity score pairs is intractable. In-
stead, to help produce a top-k recommendation list efficiently, neighborhood-based
k-nearest-neighbor CF usually memorizes top few similar items for each item, re-
sulting in a sparse similarity matrix C. Despite the heuristics, SLIM (Ning and
Karypis, 2011) learns such sparse similarity by reconstructing M via MC with zero
diagonal and sparse constraints in C.
One draw back of storing only the sparse similarity is that, it cannot identify
less-similar relationships which restricts its downstream applications.
Model-based Approaches
Model-based methods learn similarity functions between user and item by optimiz-
ing an objective function. Matrix Factorization, the prior of which is that the user-
behavior matrix is low-rank, i.e., all users’ tastes can be described by linear com-
binations of a few style latent factors. The prediction for a user’s preference on an
item can be calculated as the dot product of the corresponding user and item factor.
The matrix completion setting also has an equivalent form in bipartite graph,
G = (V , E ), (19.3)
where V = U ∪ I , i.e., the union of the user set U and the item set I , and
E = {(u, i)|i ∈ Iu+ , u ∈ U }, i.e., the collection of the edges between u and his/her
clicked i. Then the point-wise user-item preference estimation can be viewed as a
link prediction task in this user-item interaction bipartite graph.
Heuristic graph mining approaches, which fall into the category of neighborhood-
based CF, are widely used in the retrieval phase. We can calculate user-item similar-
ity by performing graph mining tasks like Common Neighbors, Adar (Adamic and
Adar, 2003), Katz (Katz, 1953), Personalized PageRank (Haveliwala, 2002), over
the original bipartite graph, or calculate item-item similarity on its induced item-
item correlation graph (Zhou et al, 2017; Wang et al, 2018b) which are then used in
the final user preference aggregation.
430 Yunfei Chu, Jiangchao Yao, Chang Zhou and Hongxia Yang
Graph embedding techniques for industrial recommender system are first ex-
plored in (Zhou et al, 2017) and its successor with side information support (Wang
et al, 2018b). They construct an item correlation graph of billions of edges from
user-item click sequences organized by sessions. Then a deepwalk-style graph em-
bedding method is applied to calculate the item representations, which then provides
item-item similarities in the retrieval phase. Though it’s shown in (Zhou et al, 2017)
that embedding based method has advantage in scenarios where the top-k heuristics
cannot provide any item-pair similarity, it’s still debatable whether the similarity
given by graph embedding methods can outperform carefully designed heuristic
ones when all the top-k similar item can be retrieved.
We also note that, graph embedding techniques can be regarded as matrix factor-
ization for a transformation of the graph adjacency matrix A, as discussed in earlier
sections. That means, theoretically the difference between graph embedding tech-
niques and the basic matrix factorization are their priors, i.e., what matrix is assumed
to be the best to factorize. Factorization of the transformations of A indicates to fit
an evolved system in the future while traditional MF methods are factorizing the
current static system.
Graph neural networks for industrial recommender system are first studied
in (Ying et al, 2018b), whose backend model is a variant of GraphSage. PinSage
computes the L1 normalized visit counts of nodes during random walks started
from a given node v, and the top-k counted nodes are regarded as v’s receptive field.
Weighted aggregation is performed among the nodes according to their normalized
counts. As GraphSage-like approaches do not suffer from too large neighborhood,
PinSage is scalable to web-scale recommender system with millions of users and
items. It adopts a triplet loss, instead of NCE-variants that are usually used in other
papers.
We want to discuss more about the choice of negative examples in representa-
tion learning based recommender models, including GNNs, in the retrieval phase.
As retrieval phase aims to retrieve the k most relevant items from the entire item
space, it’s crucial to keep an item’s global position far from all irrelevant items.
In an industrial system with an extremely large candidate set, we find the perfor-
mance of any representation-based model very sensitive to the choice of negative
samples and the loss function. Though there seems a trend in mixing all kinds of
hand-crafted hard examples (Ying et al, 2018b; Huang et al, 2020b; Grbovic and
Cheng, 2018) in binary cross entropy loss or triplet loss, unfortunately, it has even
no theoretical support that can lead us to the right direction. In practice, we find
it a good choice to apply sampled softmax (Jean et al, 2014; Bengio and Senécal,
2008), InfoNCE (Zhou et al, 2020a) in the retrieval phase with an extremely large
candidate set, where the latter has also an effect of debiasing.
GNNs are a useful tool to incorporate with relational features of user and item.
KGCN (Wang et al, 2019e) enhances the item representation by performing ag-
gregations among its corresponding entity neighborhood in a knowledge graph.
KGNN-LS (Wang et al, 2019c) further poses a label smoothness assumption, which
posits that similar items in the knowledge graph are likely to have similar user pref-
erence. It adds a regularization term to help learn such a personalized weighted
19 Graph Neural Networks in Modern Recommender Systems 431
knowledge graph. KGAT (Wang et al, 2019j) shares a generally similar idea with
KGCN. The only main difference is an auxiliary loss for knowledge graph recon-
struction.
Despite there are many more paper discussing about how to fuse external knowl-
edge, relationships of other entities, which all argue it’s beneficial for downstream
recommendation tasks, one should seriously consider whether its system needs such
external knowledge or it will introduce more noises than benefits.
target user
Emma
(3)
candidate item Gu0 ,t0 u0
!"#$%&
(2) t1
…
Gi1 ,t1
i1 Su0 ,t0
(1)
Lucy Anna Gu1 ,t2 t2
… … … …
u1 Si1 ,t1
t3
… … … … …
(a) Dynamic sequential graphs in recommendation. (b) An example of a user’s 3-depth DSG.
19.2.2.1 Overview
Loss
(3) MLP
xu,t (2)
(2)
xu,t x̂u,t x̂i,t xi,t
Layer Combination Concatenate Layer Combination
(1) (1)
xu,t xi,t
Layer 3
Layer 2 Layer 2
Layer 1 Layer 1
2-nd ATT 2-nd ATT 2-nd ATT 2-nd ATT 2-nd ATT
Aggregation
ftime
Time Feature
Embedding Layer
(3) (2)
User 3-depth DSG Gu,t Item 2-depth DSG Gi,t
Fig. 19.2: Framework of the proposed DSGL method. DSGL constructs DSGs for
the target user u (left) and the candidate item i (right) respectively. Their representa-
tions are refined with multiple aggregation layers, each of which consists of a time-
aware sequence encoding layer and a second-order graph attention layer. DSGL gets
the final representations via layer combination followed by an MLP-based predic-
tion layer. Modules of the same color share the same set of parameters.
Based on the constructed user-item interaction DSG, we propose the edge learn-
ing model named Dynamic Sequential Graph Learning (DSGL), as illustrated in
Figure 19.2. The basic idea of DSGL is to perform graph convolution iteratively on
the DSGs for the target user and the candidate item on their corresponding devices,
by aggregating the embeddings of neighbors as the new representation of a target
node. The aggregator consists of two parts: (1) the time-aware sequence encoding
that encodes the behavior sequence with time information and temporal dependency
captured; and (2) the second-order graph attention that activates the related behavior
in the sequence to eliminate noisy information. Besides the above two components,
we also propose an embedding layer that initializes user, item, and time embed-
dings, a layer combination module that combines the embeddings of multiple layers
to achieve final representations, and a prediction layer that outputs the prediction
score.
19 Graph Neural Networks in Modern Recommender Systems 433
There are four groups of inputs in the proposed DSGL: the target user u, the candi-
k and (k-1)-depth DSGs of the can-
date item i, the k-depth DSGs of the target user Gu,t
k−1
didate item Gi,t . For each field of discrete features, such as age, gender,category,
brand, and ID, we represent it as an embedding matrix. By concatenating all fields
of features, we have the node feature of items, denoted by fitem ∈ Rdi . Similarly,
fuser ∈ Rdu represents the concatenated embedding vectors of fields in the category
of user. As for the interaction timestamp in DSG, we compute the time intervals
between the interaction time and its parent interaction time as time decays. Given
a historical behavior sequence Su,t of user u at the timestamp t, each interaction
(u, i, τ) ∈ Su,t corresponds to a time decay ∆(u,i,τ) = t − τ. Following (Li et al,
2020g), we transform the continuous time decay values to discrete features by map-
ping them to a series of buckets with the ranges [b0 , b1 ), [b1 , b2 ), . . . , [bl , bl+1 ), where
the base b is a hyper-parameter. Then by performing the embedding lookup opera-
tion, the time decay embedding can be obtained, denoted by ftime ∈ Rdt .
The nodes at each layer of DSGs are in time order, which reflects the time-varying
preference of users as well as the popularity evolution of items. Thus we perform
sequence modeling as a part of GNN to capture the dynamics of the interaction se-
quences. We design a time-aware sequential encoder to utilize the time information
explicitly. For each interaction (u, i,t), we have the historical behavior sequence Su,t
of user u and Si,t of item i. For sequence Su,t , by feeding each interacted item along
with the time decay in the sequence into the embedding layer, the behavior embed-
ding sequence is formed with the combined feature sequence, as {ei,τ |(i, τ) ∈ Su,t },
where ei,τ = [fitemi ; ftimeτ ] ∈ Rdi +dt is the embedding of item i in the sequence. Sim-
ilarly, for sequence Si,t , we have the embedding sequence as {eu,τ |(u, τ) ∈ Si,t },
where eu,τ = [fuseru ; ftimeτ ] ∈ Rdu +dt . We take the obtained embedding as the zero-
(0) (0)
layer of inputs in the time-aware sequence encoder, i.e., xu,t = eu,t and xi,t = ei,t .
For ease of notation, we will drop the superscript in the rest of the following two
subsections.
In the time-aware sequence encoding, we infer the hidden state of each node in
the behavior sequence step by step in a RNN-based manner. Given the behavior
sequences Su,t and Si,t , we represent j-th item’s hidden states and inputs in the
sequence Su,t as hitem j and xitem j , and j-th user’s hidden states and inputs in the
sequence Si,t as huser j and xuser j . The forward formulas are
hitem j = Hitem (hitem j−1 , xitem j ); huser j = Huser (huser j−1 , xuser j ). (19.4)
where Huser (·, ·) and Hitem (·, ·) represent the encoding functions specific to user and
item, respectively. We adopt the long short-term memory (LSTM) (Hochreiter and
Schmidhuber, 1997) as the encoder instead of the Transformer (Vaswani et al, 2017),
434 Yunfei Chu, Jiangchao Yao, Chang Zhou and Hongxia Yang
since LSTM can utilize time feature to control the information to be propagated
with the time decay feature as inputs. After the time-aware sequence encoding, we
obtain the corresponding hidden states sequence of historical behavior sequence
Su,t of user u and Si,t of item i. The time-aware sequence encoding functions can
be represented as:
In practice, there may exist noisy neighbors, whose interest or audience is irrele-
vant to the target node. To eliminate the noise brought by the unreliable nodes, we
propose an attention mechanism to activate related nodes in the behavior sequence.
Traditional graph attention mechanism, like GAT (Veličković et al, 2018), computes
attention weights between the central node and the neighbor nodes, which indicate
the importance of each neighbor node to the central node. Although they perform
well on the node classification task, they may increase noise diffusion for recom-
mendation when there exists an unreliable connection.
To address the above problem, we propose a graph attention mechanism that uses
both the parent node of the central node and the central node itself to build the query
and takes the neighbor nodes as the key and value. Since we use the parent node of
the central node to enhance the expressive power of the query, which is connected
to the key node with two hops, we name it second-order graph attention. The parent
node of the central node can be seen as a complement when the central node is
unreliable, thus improving the robustness.
Following the scaled dot-product attention (Vaswani et al, 2017), the attention
function is defined as
softmax(QK ⊤ )
Attention(Q, K,V ) = √ V (19.6)
d
where Q, K and V represent the query, key and value, respectively, and d is the
dimension of K and Q. The multi-head attention is defined as follows:
xu,t = ATTitem ({hi,τ |(i, τ) ∈ Su,t }); xi,t = ATTuser ({hu,τ |(u, τ) ∈ Si,t }). (19.9)
The core idea of GCN is to learn representation for nodes by performing convolution
over their neighborhood. In DSGL, we stack the time-aware sequence encoding and
the second-order graph attention, and the aggregator can be represented as:
(k+1) (k)
xu,t = ATTitem (LSTMitem ({xi,t |i ∈ Su,t }));
(19.10)
(k+1) (k)
xi,t = ATTuser (LSTMuser ({xu,t |i ∈ Si,t })).
Different from traditional GCN models that use the last layer as the final node rep-
resentation, inspired by (He et al, 2020), we combine the embeddings obtained at
each layer to form the final representation of a user (an item):
1 ku (k) 1 ki (k)
x̂u,t = ∑ xu,t ; x̂i,t = ∑ xi,t , (19.11)
ku k=1 ki k=1
where Ku and Ki denote the numbers of DSGL layers for user u and item i, respec-
tively.
Given an interaction triplet (u, i,t), we can predict the possibility of the user inter-
acting with the item as:
(k) (k−1)
ŷ = F (u, i, Gu,t , Gi,t ;Θ ) = MLP([eu,t ; ei,t ; x̂u,t ; x̂i,t ]) (19.12)
where MLP(·) represents the MLP layer and Θ denotes the network parameters. We
adopt the cross-entropy loss function:
where D is the set of training samples, and y ∈ {0, 1} denotes the real label. The
algorithm procedure is presented in Algorithm 1.
436 Yunfei Chu, Jiangchao Yao, Chang Zhou and Hongxia Yang
We evaluate our methods on the real-world Amazon product datasets2 , and use
five subsets. The widely used metrics for the CTR prediction task, i.e., AUC (the
area under the ROC curve) and Logloss, are adopted. The compared recommen-
dation methods can be grouped into five categories, including conventional meth-
ods (SVD++ (Koren, 2008) and PNN (Qu et al, 2016)), sequential methods with
user behaviors (GRU4Rec (Hidasi et al, 2015), CASER (Tang and Wang, 2018),
ATRANK (Zhou et al, 2018a) and DIN (Zhou et al, 2018b)), sequential methods
with user and item behaviors (Topo-LSTM (Wang et al, 2017b), TIEN (Li et al,
2020g) and DIB (Guo et al, 2019a)), static-graph-based methods (NGCF (Wang
et al, 2019k) and LightGCN (He et al, 2020)), and dynamic-graph-based method
(SR-GNN (Wu et al, 2019c)).
2 http://snap.stanford.edu/data/amazon/productGraph/
19 Graph Neural Networks in Modern Recommender Systems 437
performs all other baselines, demonstrating its effectiveness. The sequential models
outperform the conventional methods by a large margin, proving the effectiveness of
capturing temporal dependency in recommendation. The sequential methods which
model both user behaviors and item behaviors outperform the methods that only use
the user behavior sequences, which verifies the importance of both user- and item-
side behavior information. The performance of the static-graph-based methods, in-
cluding LightGCN and NGCF, are not competitive. The reasons are two folds. First,
these methods ignore the new interactions in the testing set in the inference phase.
Second, since they do not model the temporal dependency of interactions, they
cannot capture the evolving interests, degrading the performances compared with
sequential models. The session-graph-based method SR-GNN outperforms static-
graph-based methods, because SR-GNN incorporates all the interacted items before
the current moment into graphs dynamically. However, it underperforms the se-
quential methods. One possible reason could be that the ratio of repeated items in
the sequences is low in the Amazon datasets, and the transitions of items are not
complex enough to be modeled as graphs.
To show the effectiveness of the graph structure and layer combination, we compare
the performance of DSGL and its variant DSGL w/o LC that uses the last layer
instead of the combined layer as the final representation w.r.t different numbers
of layers. Focusing on DSGL with layer combination, the performance gradually
improves with the increase of layers. We attribute the improvement to the collab-
orative information carried by the second-order and third-order connectivity in the
graph structure. Comparing DSGL and DSGL w/o LC, we find that removing the
layer combination degrades the performance largely, which demonstrates the effec-
tiveness of layer combination.
DSGL w/o time in IBH with the default DSGL, we observe that removing the time
information on either user or item behavior side will cause performance degradation.
DSGL outperforms DSGL w/o Seq ENC, confirming the importance of temporal
dependency carried by the historical behavior sequence.
Recently, several works (Sun et al, 2020e; Cai et al, 2020a; Gong et al, 2020; Yang
et al, 2019e; Lin et al, 2020e; Niu et al, 2020) have explored the on-device comput-
ing advantages in recommender systems. This drives the development of on-device
GNNs, e.g., DSGL in the previous section. However, these early works either only
consider the cloud modeling, or on-device inference, or the aggregation of the tem-
poral on-device training pieces to handle the privacy constraint. Little has explored
the device modeling and the cloud modeling jointly to benefit both sides for GNNs.
To bridge this gap, we introduce a Device-Cloud Collaborative Learning framework
as shown in Figure 23.2. Given a recommendation dataset {(xn , yn )}n=1,...,N , we tar-
get to learn a GNN-based mapping function f : xn → yn on the cloud side. Here, xn is
the graph feature that contains all available candidate features and user context, yn is
the user implicit feedback (click or not) to the corresponding candidate and N is the
19 Graph Neural Networks in Modern Recommender Systems 439
Cloud Device
Ranking
Candidates
All
Samples
Model-over-Models
…
…
Distillation
Optimize
Feedback
…
MetaPatch
…
MoMoDistill
Fig. 19.3: The general DCCL framework for recommendation. The cloud side is
responsible to learn the centralized cloud GNN model via the model-over-models
distillation from the personalized on-device GNN models. The device receives the
cloud GNN model to conduct the on-device personalization. We propose MoMoDis-
till and MetaPatch to instantiate each side respectively.
sample number.
n On the
o device side, each device (indexed by m) has its own local
(m) (m)
dataset, (xn , yn ) . We add a few parameter-efficient patches (Yuan
(m)n=1,...,N
et al, 2020a) to the cloud GNN model f (freezing its parameters on the device side)
(m) (m)
for each device to build a new GNN f (m) : xn → yn . In the following, we will
present the practical challenges in the deployment and our solutions.
Although the device hardware has been greatly improved in the recent years, it is
still resource-constrained to learn a complete big model on the device. Meanwhile,
only finetuning last few layers is performance-limited due to the feature basis of
the pretrained layers. Fortunately, some previous works have demonstrated that it is
possible to achieve the comparable performance as the whole network finetuning via
patch learning (Cai et al, 2020b; Yuan et al, 2020a; Houlsby et al, 2019). Inspired
by these works, we insert the model patches on basis of the cloud model f for on-
device personalization. Formally, the output of the l-th layer attached with one patch
on the m-th device is expressed as
(m) (m)
fl (·) = fl (·) + hl (·) ◦ fl (·), (19.14)
where LHS of Eq.19.14 is the sum of the original fl (·) and the patch response of
(m)
fl (·). Here, hl (·) is the trainable patch function and ◦ denotes the function com-
position that treats the output of the previous function as the input. Note that, the
model patch could have different neural architectures. Here, we do not explore its
variants but specify the same bottleneck architecture like (Houlsby et al, 2019).
Nevertheless, we empirically find that the parameter space of multiple patches is
still relatively too large and easily overfits the sparse local samples. To overcome
440 Yunfei Chu, Jiangchao Yao, Chang Zhou and Hongxia Yang
this issue, we propose MetaPatch to reduce the parameter space. It is a kind of meta
learning methods to generate parameters (Ha et al, 2017; Jia et al, 2016). Concretely,
(m)
assume the parameters of each patch are denoted by θl (flatten all parameters in
the patch into a vector). Then, we can deduce the following decomposition
(m)
θl = Θl ∗ θ̂ (m) , (19.15)
where Θl is the globally shared parameter basis (freezing it on the device and learned
in the cloud) and θ̂ (m) is the surrogate tunable parameter vector to generate each
(m)
patch parameter θl in the device-GNN-model f (m) . To facilitate the understand-
ing, we term θ̂ (m) as the metapatch parameter. In this paper, we keep the number of
patch parameters is greatly less than that of the metapatch parameters to be learned
for personalization. Note that, regarding the pretraining of Θl , we leave the discus-
sion in the following section to avoid the clutter, since it is learned on the cloud
side. According to Eq. 19.15, we implement the patch parameter generation via the
metapatch parameter θ̂ (m) instead of directly learning θ (m) . To learn the metapatch
parameter, we can leverage the local dataset to minimize the following loss function.
min ℓ(y, ỹ)ỹ= f (m) (x) , (19.16)
(m)
θ̂
The conventional incremental training of the centralized cloud model follows the
“model-over-data” paradigm. That is, when the new training samples are collected
from devices, we directly perform the incremental learning based on the model
trained in the early sample collection. The objective is formulated as follows,
min ℓ (y, ŷ) ŷ= f (x) , (19.17)
Wf
where W f is the network parameter of the cloud GNN model f to be trained. This is
an independent perspective without considering the device modeling. However, the
on-device personalization actually can be more powerful than the centralized cloud
model to handle the corresponding local samples. Thus, the guidance from the on-
device models could be a meaningful prior to help the cloud modeling. Inspired
by this, we propose a “model-over-models” paradigm to simultaneously learn from
data and aggregate the knowledge from on-device models, to enhance the training of
the centralized cloud model. Formally, the objective with the distillation procedure
19 Graph Neural Networks in Modern Recommender Systems 441
where W (1) , W (2) , W (3) are tunable projection matrices. Here, we use We denoting
the collection {W (1) ,W (2) ,W (3) } for simplicity. To learn the global parameter basis,
we replace θ̂ by U(θ̂ , u) to simulate Eq. 19.15 to generate the model patch, i.e., Θ ∗
U(θ̂ , u), since actually θ̂ is too heterogeneous to be directly used. Then, combining
Θ ∗ U(θ̂ , u) with f learned in the first distillation step, we can form a new proxy
device model fˆ(m) (different from f (m) in the patch generation). Here, we leverage
such a proxy fˆ(m) to directly distill the knowledge from the true f (m) collected from
devices, which optimizes Θ and the parameters of the auxiliary encoder,
min ℓ(y, ŷ) + β KL(ỹ, ŷ)ŷ= fˆ(m) (x),ỹ= f (m) (x) , (19.20)
(Θ , W )
e
Eq. 19.18 and Eq. 19.20 progressively help learn the centralized cloud model and the
global parameter basis. We specially term this progressive distillation mechanism as
MoMoDistill to emphasize our “model-over-models” paradigm different from the
conventional “model-over-data” incremental training on the cloud side. Finally, in
Algorithm 3, we summarize the complete procedure of DCCL.
442 Yunfei Chu, Jiangchao Yao, Chang Zhou and Hongxia Yang
of the backbone (DIN) and the other part is for the training of DCCL. In the ex-
periments, we conduct one-round DCCL-e and DCCL-m. Finally, the DCCL-m is
used to compare with the six representative models. We find that the deep learning
based methods NeuMF and DeepFM usually outperform the conventional methods
MF and FM, and the sequence-based methods SASRec and DIN consistently out-
perform previous non-sequence-based methods. Our DCCL builds upon on the best
baseline DIN and further improves its results. Specifically, DCCL shows about 2%
or more improvements in terms of NDCG@10, and at least 1% improvements in
terms of HitRate@10 on all three datasets. The performances on both small and
large datasets confirm the superiority of our DCCL.
tively trace the performance of each round evaluated on the last click of each user.
According to the results, we observe that frequent interactions achieve much better
performance than the infrequent counterparts. We speculate that, as MeatPatch and
MoMoDistill could promote each other at every round, the advantages in perfor-
mance have been continuously strengthened with more frequent interactions. How-
ever, the side effect is we have to frequently update the on-device models, which
may introduce other uncertain crash risks. Thus, in the real-world scenarios, we
need to make a trade-off between performance and the interaction interval.
For the first study, we given the results of the one-round DCCL on the Taobao
dataset and compare with DIN. From the results, we can observe the progressive
improvement after DCCL-e and DCCL-m, and DCCL-m acquires more benefit than
DCCL-e in terms of the improvement. The revenue behind DCCL-e is MetaPatch
customizes a personalized model for each user to improve their recommendation ex-
perience once new behavior logs are collected on device, without the delayed update
from the centralized cloud server. The further improvements from DCCL-m confirm
the necessity of MoMoDistill to re-calibrate the backbone and the parameter basis
in a long term. However, if we conduct the experiments without our two modules,
the model performance is as DIN, which is not better than DCCL.
For the second ablation study, we explore the effect of the model patches in dif-
ferent layer junctions. In previous sections, we insert two patches (1st Junction, 2nd
Junction) in the two fully-connected layers respectively after the feature embedding
layer, and one patch (3rd Junction) to the layer before the last softmax transforma-
tion layer. In this experiment, we validate their effectiveness by only keep each of
them in one-round DCCL. Compared with the full model, we can find that removing
the model patch would decrease the performance. The results suggest the patches in
the 1st and 2nd junctions are more effective than the one in the 3rd junction.
Certainly, we have witnessed the arising trends for GNNs to be applied in various
areas. We believe the following directions should be paid more attention for GNNs
to have wider impacts in big data areas, especially in search, recommendation or
advertisement.
• There is still a lot to understand about GNNs, but there were quite a few im-
portant results about how they work (Loukas, 2020; Xu et al, 2019d; Oono and
Suzuki, 2020). Future research works of GNNs should balance between techni-
cal simplicity, high practical impact, and far-reaching theoretical insights.
• It is also great to see how GNNs can be applied for other real-world tasks (Wei
et al, 2019; Wang et al, 2019a; Paliwal et al, 2020; Shi et al, 2019a; Jiang and
19 Graph Neural Networks in Modern Recommender Systems 445
Balaprakash, 2020; Chen et al, 2020o). For example, we see applications in fix-
ing bugs in Javascript, game playing, answering IQ-like tests, optimization of
TensorFlow computational graphs, molecule generation, and question genera-
tion in dialogue systems, among many others.
• It will become popular to see GNNs applied for knowledge graph reasoning
(Ren et al, 2020; Ye et al, 2019b). A knowledge graph is a structured way to
represent facts where nodes and edges actually bear some semantic meaning,
such as the name of the actor or act of playing in movies.
• Recently there are new perspectives on how we should approach learning graph
representations, especially considering the balance between local and global
information. For example, Deng et al (2020) presents a way to improve run-
ning time and accuracy in node classification problem for any unsupervised
embedding method. Chen et al (2019c) shows that if one replaces a non-linear
neighborhood aggregation function with its linear counterpart, which includes
degrees of the neighbors and the propagated graph attributes, then the perfor-
mance of the model does not decrease. This is aligned with previous statements
that many graph data sets are trivial for classification and raises a question of
the proper validation framework for this task.
• Algorithmic works of GNNs should be integrated with system design more
closely, to empower end-to-end solutions for users to address their scenarios
by taking graph into deep learning frameworks. It should allow pluggable oper-
ators to adapt to the fast development of GNN community and excels in graph
building and sampling. As an independent and portable system, the interfaces
of AliGraph (Zhu et al, 2019c) can be integrated with any tensor engine that is
used for expressing neural network models. By co-designing the flexible Grem-
lin like interfaces for both graph query and sampling, users can customize data
accessing pattern freely. Moreover, AliGraph also shows excellent performance
and scalability.
Siliang Tang, Wenqiao Zhang, Zongshen Mu, Kai Shen, Juncheng Li, Jiacheng Li
and Lingfei Wu
Abstract Recently Graph Neural Networks (GNNs) have been incorporated into
many Computer Vision (CV) models. They not only bring performance improve-
ment to many CV-related tasks but also provide more explainable decomposition to
these CV models. This chapter provides a comprehensive overview of how GNNs
are applied to various CV tasks, ranging from single image classification to cross-
media understanding. It also provides a discussion of this rapidly growing field from
a frontier perspective.
Siliang Tang,
College of Computer Science and Technology, Zhejiang University e-mail: siliang@zju.
edu.cn
Wenqiao Zhang,
College of Computer Science and Technology, Zhejiang University, e-mail: wenqiaozhang@
zju.edu.cn
Zongshen Mu,
College of Computer Science and Technology, Zhejiang University, e-mail: zongshen@zju.
edu.cn
Kai Shen,
College of Computer Science and Technology, Zhejiang University, e-mail: shenkai@zju.
edu.cn
Juncheng Li,
College of Computer Science and Technology, Zhejiang University, e-mail: junchengli@zju.
edu.cn
Jiacheng Li,
College of Computer Science and Technology, Zhejiang University, e-mail: lijiacheng@zju.
edu.cn
Lingfei Wu
JD.COM Silicon Valley Research Center, e-mail: [email protected]
© The Author(s), under exclusive license to Springer Nature Singapore Pte Ltd. 2022 447
L. Wu et al. (eds.), Graph Neural Networks: Foundations, Frontiers, and Applications,
https://doi.org/10.1007/978-981-16-6054-2_20
448 Authors Suppressed Due to Excessive Length
20.1 Introduction
Recent years have seen great success of Convolutional Neural Network (CNN) in
Computer Vision (CV). However, most of these methods lack the fine-grained anal-
ysis of relationships among the visual data (e.g., relation visual regions, adjacent
video frames). For example, an image can be represented as a spatial map while the
regions in an image are often spatially and semantically dependent. Similarly, video
can be represented as spatio-temporal graphs, where each node in the graph repre-
sents a region of interest in the video and the edges capture relationships between
such regions. These edges can describe the relations and capture the interdepen-
dence between nodes in the visual data. Such fine-grained dependencies are critical
to perceiving, understanding, and reasoning the visual data. Therefore, graph neural
networks can be naturally utilized to extract patterns from these graphs to facilitate
the corresponding computer vision tasks.
This chapter introduces the graph neural network model in various computer
vision tasks, including specific tasks for image, video and cross-media (cross-
modal) (Zhuang et al, 2017). For each task, this chapter demonstrates how graph
neural networks can be adapted to and improve the aforementioned computer vision
tasks with representative algorithms.
Ultimately, to provide a frontier perspective, we also introduce some other dis-
tinctive GNN modeling methods and application scenarios on the subfield.
Nodes are essential entities in a graph. There are three kinds of methods to represent
the node of the image X ∈ Rh×w×c or the video X ∈ Rf ×h×w×c , where (h, w) is the
resolution of the original image, c is the number of channels, and f is the number of
frames.
Firstly, it is possible to split the image or the frame of the video into regular
grids referring to Fig. 20.1, each of which is the (p, p) resolution of the image patch
(Dosovitskiy et al, 2021; Han et al, 2020). Then each grid servers as the vertex of
the visual graph and apply neural networks to get its embedding.
Secondly, some pre-processed structures like Fig. 20.2 can be directly borrowed
for vertex representation. For example, by object detection framework like Faster
R-CNN (Ren et al, 2015) or YOLO (Heimer et al, 2019), visual regions in the first
20 Graph Neural Networks in Computer Vision 449
Fig. 20.1: Split an image into fixed-size patches and view as vertexes
column of the figure, have been processed and can be thought of as vertexes in the
graph. We map different regions to the same dimensional features and feed them to
the next training step. Like the middle column of the figure, scene graph generation
models (Xu et al, 2017a; Li et al, 2019i) not only achieve visual detection but also
aim to parse an image into a semantic graph which consists of objects and their
semantic relationships, where it is tractable to get vertexes and edges to deploy
downstream tasks in the image or video. In the last one, human joints linked by
skeletons naturally form a graph and learn human action patterns (Jain et al, 2016b;
Yan et al, 2018a)
Edges depict the relations of nodes and play an important role in graph neural net-
works.For a 2D image, the nodes in the image can be linked with different spatial
relations. For a clip of video stacked by continuous frames, it adds temporal rela-
tions between frames besides spatial ones within the frame. On the one hand, these
relations can be fixed by predefined rules to train GNNs, referred to as static rela-
tions. Learning to learn relations (thought of as dynamic relations) attracts more and
more attention on the other hand.
To capture spatial relations is the key step in the image or video. For static methods,
generating scene graphs (Xu et al, 2017a) and human skeletons (Jain et al, 2016b)
are natural to choose edges between nodes in the visual graph described in the Fig.
20.2. Recently, some works (Bajaj et al, 2019; Liu et al, 2020g) use fully-connected
graph (every vertex is linked with other ones) to model the relations among vi-
sual nodes and compute union region of them to represent edge features. Further-
more, self-attention mechanism (Yun et al, 2019; Yang et al, 2019f) are introduced
to learn the relations among visual nodes, whose main idea is inspired by trans-
former (Vaswani et al, 2017) in NLP. When edges are represented, we can choose
either spectral-based or spatial-based GNNs for applications (Zhou et al, 2018c; Wu
et al, 2021d).
Fig. 20.4: A spatial-temporal graph by extracting nodes from each frame and allow-
ing directed edges between nodes in neighbouring frames
20 Graph Neural Networks in Computer Vision 451
To understand the video, the model not only builds spatial relations in a frame but
also captures temporal connections among frames. A series of methods (Yuan et al,
2017; Shen et al, 2020; Zhang et al, 2020h) compute each node in the current frame
with near frames by semantic similarity methods like k-Nearest Neighbors to con-
struct temporal relations among frames. Especially, as you can see in the Fig. 20.4,
Jabri et al (2020) represent video as a graph using a Markov chain and learn a ran-
dom walk among nodes by dynamic adjustment, where nodes are image patches, and
edges are affinities (in some feature space) between nodes of neighboring frames.
Zhang et al (2020g) use regions as visual vertexes and evaluate the IoU (Intersection
of Union) of nodes among frames to represent the weight edges.
as G (V , E ), where V is the vertex set and E is the edge set. The image is I . And
nI
we formulate the regions as R = {fi }i=1 , fi ∈ Rd for a specific image I , where d
is the region feature’s dimension. We will discuss these two parts and omit other
details.
Unlike previous attempts in close fields which build category-to-category graph
(Dai et al, 2017; Niepert et al, 2016), the SGRN treats the candidate regions R as
graph nodes V and constructs dynamic graph G on top of them. Technically, they
project the region features into the latent space z by:
zi = φ (fi ) (20.1)
where φ is the two fully-connected layers with ReLU activation, zi ∈ Rl and l is the
latent dimension.
The region graph is constructed by latent representation z as follows:
where S ∈ Rnr ×nr . It is not proper to reserve all relations between region pairs since
there are many negative (i.e., background) samples among the region proposals,
which may affect the down task’s performance. If we use the dense matrix S as the
graph adjacency matrix, the graph will be fully-connected, which leads to computa-
tion burden or performance drop since most existing GNN methods work worse on
fully-connected graphs (Sun et al, 2019). To solve this issue, the SGRN adopt KNN
to make the graph sparse (Chen et al, 2020n,o). In other words, for the learned sim-
ilarity matrix Si ∈ RNr , they only keep the K nearest neighbors (including itself) as
well as the associated similarity scores (i.e., they mask off the remaining similarity
scores). The learned graph adjacency is denoted as:
A = KNN(S) (20.3)
where N (i) denotes the neighborhood of node i, µ(i, j) is the distance of node i, j
calculated by the center of them in a polar coordinate system, and ωk () is the k−th
gaussian kernel. Then the K kernels’ results are concatenated together and projected
to the latent space as follows:
where g(·) denotes the projection with non-linearity. Finally, hi is combined with
the original visual region feature fi to enhance classification and regression perfor-
mance.
where φ () is the convolutional neural network and h() is the one-hot label encoding.
Note that for unlabeled data, they replace the h() with the uniform distribution over
the K-simplex.
Second, the graph topology is learned by current layer’s node embedding denoted
as xk . The distance matrix modeling the distance in the embedding space between
nodes is denoted as S given by:
where MLP() is a multilayer perceptron network and abs() is the absolute function.
Then the adjacency matrix A is calculated by normalizing the row of S using softmax
operation.
Then a GNN layer is adapted to encode the graph nodes with learned topology A.
The GNN layer receives the node embedding matrix xk and outputs the aggregated
node representation xk+1 as:
xk+1
l = ρ( ∑ Bxk θB,l
k
), l = d1 ...dk+1 (20.9)
B∈A
Action recognition in video is a highly active area of research, which plays a crucial
role in video understanding. Given a video as input, the task of action recognition
is to recognize the action appearing in the video and predict the action category.
Over the past few years, modeling the spatio-temporal nature of video has been the
core of research in the field of video understanding and action recognition. Early
approaches of activity recognition such as Hand-crafted Improved Dense Trajec-
tory(iDT) (Wang and Schmid, 2013), two-Stream ConvNets (Simonyan and Zisser-
man, 2014a), C3D (Tran et al, 2015), and I3D (Carreira and Zisserman, 2017) have
focused on using spatio-temporal appearance features. To better model longer-term
temporal information, researchers also attempted to model the video as an ordered
frame sequence using Recurrent Neural Networks (RNNs) (Yue-Hei Ng et al, 2015;
Donahue et al, 2015; Li et al, 2017b). However, these conventional deep learning
approaches only focus on extracting features from the whole scenes and are unable
to model the relationships between different object instances in space and time. For
example, to recognize the action in the video corresponds to “opening a book”, the
temporal dynamics of objects and human-object and object-object interactions are
crucial. We need to temporally link book regions across time to capture the shape of
the book and how it changes over time.
To capture relations between objects across time, several deep models (Chen
et al, 2019d; Herzig et al, 2019; Wang and Gupta, 2018; Wang et al, 2018e) have
been recently introduced that represent the video as spatial-temporal graph and
leverage recently proposed graph neural networks. These methods take dense ob-
20 Graph Neural Networks in Computer Vision 455
ject proposals as graph nodes and learn the relations between them. In this section,
we take the framework proposed in (Wang and Gupta, 2018) as one example to
demonstrate how graph neural networks can be applied to action recognition task.
As illustrated in Fig 20.5, the model takes a long clip of video frames as in-
put and forwards them to a 3D Convolutional Neural Network to get a feature map
I ∈ Rt×h×w×d , where t represents the temporal dimension, h × w represents the spa-
tial dimensions and d represents the channel number. Then the model adopts the
Region Proposal Network (RPN) (Ren et al, 2015) to extract the object bounding
boxes followed by RoIAlign (He et al, 2017a) extracting d-dimension feature for
each object proposal. The output n object proposals aggregated over t frames are
corresponding to n nodes in the building graphs. There are mainly two types of
graphs: Similarity Graph and Spatial-Temporal Graph.
Building Graphs
t×h×w×d Feature
Pooling Over n nodes
Graph
X RoIAlign
Convolutions
Classification
1×d
n×d
Fig. 20.5: Overview of the GNN-based model for Video Action Recognition.
expF(xi , x j )
Asim
ij = (20.11)
∑nj=1 expF(xi , x j )
456 Authors Suppressed Due to Excessive Length
The spatial-temporal graph is proposed to encode the relative spatial and tempo-
ral relations between objects, where objects in nearby locations in space and time
are connected together. The normalized edge values of the spatial-temporal graph
can be formulated as:
σi j
Aifjront = n (20.12)
∑ j=1 σi j
where G f ront represents the forward graph which connects objects from frame t to
frame t + 1, and σi j represents the value of Intersection Over Unions (IoUs) between
object i in frame t and object j in frame t + 1. The backward graph Aback can be
computed in a similar way. Then, the Graph Convolutional Networks (GCNs) (Kipf
and Welling, 2017b) is applied to update features of each object node. One layer of
graph convolutions can be represented as:
Z = AXW (20.13)
where A represents one of the adjacency matrix (Asim , A f ront , or Aback ), X represents
the node features, and W is the weight matrix of the GCN.
The updated node features after graph convolutions are forwarded to an average
pooling layer to obtain the global graph representation. Then, the graph representa-
tion and pooled video representation are concatenated together for video classifica-
tion.
Temporal action localization is the task of training a model to predict the bound-
aries and categories of action instances in untrimmed videos. Most existing meth-
ods (Chao et al, 2018; Gao et al, 2017; Lin et al, 2017; Shou et al, 2017, 2016; Zeng
et al, 2019) tackle temporal action localization in a two-stage pipeline: they first gen-
erate a set of 1D temporal proposals and then perform classification and temporal
boundary regression on each proposal individually. However, these methods process
each proposal separately, failing to leverage the semantic relations between propos-
als. To model the proposal-proposal relations in the video, graph neural networks are
then adopted to facilitate the recognition of each proposal instance. P-GCN (Zeng
et al, 2019) is recently proposed method to exploit the proposal-proposal relations
using Graph Convolutional Networks. P-GCN first constructs an action proposal
graph, where each proposal is represented as a node and their relations between two
proposals as an edge. Then P-GCN performs reasoning over the proposal graph us-
ing GCN to model the relations among different proposals and update their represen-
tations. Finally, the updated node representations are used to refine their boundaries
and classification scores based on the established proposal-proposal dependencies.
20 Graph Neural Networks in Computer Vision 457
Graph-structured data widely exists in different modal data (images, videos, texts),
and is used in existing cross-media tasks (e.g., visual caption, visual question an-
swer, cross-media retrieval). In other words, using of graph structure data and GNN
rationally can effectively improve the performance of cross-media tasks.
Visual caption aims at building a system that automatically generates a natural lan-
guage description of a given image or video. The problem of image captioning is
interesting not only because it has important practical applications, such as helping
visually impaired people see, but also because it is regarded as a grand challenge
for vision understanding. The typical solutions of visual captioning are inspired
by machine translation and equivalent to translating an image to a text. In these
methods (Li et al, 2017d; Lu et al, 2017a; Ding et al, 2019b), Convolutional Neu-
ral Network (CNN) or Region-based CNN (R-CNN) is usually exploited to encode
an image and a decoder of Recurrent Neural Network (RNN) w/ or w/o attention
mechanism is utilized to generate the sentence. However, a common issue not fully
studied is how visual relationships should be leveraged in view that the mutual corre-
lations or interactions between objects are the natural basis for describing an image.
Spatial Graph
GCN
𝑤𝑡
𝑣ҧ 2
ℎ𝑡−1
Mean Pooling
Faster R-CNN ℎ1𝑡 LSTM
GCN
on 𝑤𝑡−1
wearing
they build two kinds of visual relationships, i.e., semantic and spatial correlations,
on the detected regions, and devised Graph Convolutions on the region-level rep-
resentations with visual relationships to learn more powerful representations. Such
relation-aware region-level representations are then input into attention LSTM for
sentence generation.
Then, Yang et al (2019g) presented a novel Scene Graph Auto-Encoder (SGAE)
for image captioning. This captioning pipeline contains two step: 1) extracting the
scene graph for an image and using GCN to encode the corresponding scene graph,
then decoding the sentence by the recoding representation; 2) incorporating the im-
age scene graph to the captioning model. They also use GCNs to encode the visual
scene graph . Given the representation of visual scene graph, they introduce joint vi-
sual and language memory to choose appropriate representation to generate image
description.
Visual Question Answering (VQA) aims at building a system that automatically an-
swers natural language questions about visual information. It is a challenging task
that involves mutual understanding and reasoning across different modalities. In the
past few years, benefiting from the rapid developments of deep learning, the pre-
vailing image and video question methods (Shah et al, 2019; Zhang et al, 2019g; Yu
et al, 2017a) prefer to represent the visual and linguistic modalities in a common la-
tent subspace, use the encoder-decoder framework and attention mechanism, which
has made remarkable progress.
dog
dog
on dog
on
on
boy
boy
boy
filed in filed in
shirt filed in
shirt
shirt
Update Edges Update Nodes
Question
What is the boy doing?
Global Feature Answer
However, the aforementioned methods also have not considered the graph infor-
mation in the VQA task. Recently, Zhang et al (2019a) investigates an alternative
approach inspired by conventional QA systems that operate on knowledge graphs.
Specifically, as shown in Fig. 20.7, they investigate the use of scene graphs derived
from images, then naturally encode information on graphs and perform structured
reasoning for Visual QA. The experimental results demonstrate that scene graphs,
20 Graph Neural Networks in Computer Vision 459
Image-text retrieval task has become a popular cross-media research topic in re-
cent years. It aims to retrieve the most similar samples from the database in an-
other modality. The key challenge here is how to match the cross-modal data by
understanding their contents and measuring their semantic similarity. Many ap-
proaches (Faghri et al, 2017; Gu et al, 2018; Huang et al, 2017b) have been pro-
posed. They often use global representations or local to express the whole image
and sentence. Then, a metric is devised to measure the similarity of a couple of
features in different modalities. However, the above methods lose sight of the re-
lationships between objects in multi-modal data, which is also the key point for
image-text retrieval.
Text Input
Text
The students ReLU FC
are listening to
the class.
Graph Structure Graph Conv
Similarity
estimation
Image Hand-crafted
features
NN features FC FC
Joint-trained
Feature Vector
features
Image Input
To utilize the graph data in image and text better, as shown in Fig. 20.8, Yu
et al (2018b) proposes a novel cross-modal retrieval model named dual-path neu-
ral network with graph convolutional network. This network takes both irregular
graph-structured textual representations and regular vector-structured visual repre-
sentations into consideration to jointly learn coupled feature and common latent
semantic space.
In addition, Wang et al (2020i) extract objects and relationships from the image
and text to form the visual scene graph and text scene graph, and design a so-called
Scene Graph Matching (SGM) model, where two tailored graph encoders encode
the visual scene graph and text scene graph into the feature graph. After that, both
object-level and relationship-level features are learned in each graph, and the two
feature graphs corresponding to two modalities can be finally matched at two levels
more plausibly.
In this section, we introduce the frontiers for GNNs on Computer Vision. We focus
on the advanced modeling methods of GNN for Computer Vision and their applica-
tions in a broader area of the subfield.
The main idea of the GNN modeling method on CV is to represent visual informa-
tion as a graph. It is common to represent pixels, object bounding boxes, or image
frames as nodes and further build a homogeneous graph to model their relations.
Despite this kind of method, there are also some new ideas for GNN modeling.
Considering the specific task nature, some works try to represent different forms
of visual information in the graph.
• Person Feature Patches Yan et al (2019); Yang et al (2020b); Yan et al (2020b)
build spatial and temporal graphs for person re-identification (Re-ID). They
horizontally partition each person feature map into patches and use the patches
as the nodes of the graph. GCN is further used to modeling the relation of body
parts across frames.
• Irregular Clustering Regions Liu et al (2020h) introduce the bipartite GNN
for mammogram mass detection. It first leverages kNN forward mapping to
partition an image feature map into irregular regions. Then the features in an
irregular region are further integrated as a node. The bipartite node sets are con-
structed by cross-view images respectively, while the bipartite edge learns to
model both inherent cross-view geometric constraints and appearance similari-
ties.
20 Graph Neural Networks in Computer Vision 461
20.7 Summary
This chapter shows that GNN is a promising and fast-developing research field
that offers exciting opportunities in computer vision techniques. Nevertheless, it
also presents some challenges. For example, graphs are often related to real scenar-
ios, while the aforementioned GNNs lack interpretability, especially the decision-
making problems (e.g., medical diagnostic model) in the computer vision field.
However, compared to other black-box models (e.g., CNN), interpretability for
graph-based deep learning is even more challenging since graph nodes and edges
are often heavily interconnected. Thus, a further direction worth exploring is how to
improve the interpretability and robustness of GNN for computer vision tasks.
Abstract Natural language processing (NLP) and understanding aim to read from
unformatted text to accomplish different tasks. While word embeddings learned by
deep neural networks are widely used, the underlying linguistic and semantic struc-
tures of text pieces cannot be fully exploited in these representations. Graph is a
natural way to capture the connections between different text pieces, such as enti-
ties, sentences, and documents. To overcome the limits in vector space models, re-
searchers combine deep learning models with graph-structured representations for
various tasks in NLP and text mining. Such combinations help to make full use of
both the structural information in text and the representation learning ability of deep
neural networks. In this chapter, we introduce the various graph representations that
are extensively used in NLP, and show how different NLP tasks can be tackled from
a graph perspective. We summarize recent research works on graph-based NLP, and
discuss two case studies related to graph-based text clustering, matching, and multi-
hop machine reading comprehension in detail. Finally, we provide a synthesis about
the important open problems of this subfield.
21.1 Introduction
Bang Liu
Department of Computer Science and Operations Research, University of Montreal, e-mail:
[email protected]
Lingfei Wu
JD.COM Silicon Valley Research Center, e-mail: [email protected]
© The Author(s), under exclusive license to Springer Nature Singapore Pte Ltd. 2022 463
L. Wu et al. (eds.), Graph Neural Networks: Foundations, Frontiers, and Applications,
https://doi.org/10.1007/978-981-16-6054-2_21
464 Bang Liu, Lingfei Wu
until nowadays, NLP has been playing an essential role in the research of machine
learning and artificial intelligence.
NLP has a wide range of applications in the life and business of modern society.
Critical NLP applications include but not limited to: machine translation applica-
tions that aim to translate text or speech from a source language to another tar-
get language (e.g., Google Translation, Yandex Translate); chatbots or virtual assis-
tants that conduct an on-line chat conversation with a human agent (e.g., Apple Siri,
Microsoft Cortana, Amazon Alexa); search engines for information retrieval (e.g.,
Google, Baidu, Bing); question answering (QA) and machine reading comprehen-
sion in different fields and applications (e.g., open-domain question answering in
search engines, medical question answering); knowledge graphs and ontologies that
extract and represent knowledge from multi-sources to improve various applications
(e.g., DBpedia (Bizer et al, 2009), Google Knowledge Graph); and recommender
systems in E-commerce based on text analysis (e.g., E-commerce recommendation
in Alibaba and Amazon). Therefore, AI breakthroughs in NLP are big for business.
Two crucial research problems lie at the core of NLP: i) how to represent natural
language texts in a format that computers can read; and ii) how to compute based
on the input format to understand the input text pieces. We observe that researchers’
ideas on representing and modeling text keep evolving during the long history of
NLP development.
Up to the 1980s, most NLP systems were symbolic-based. Different text pieces
were considered as symbols, and the models for various NLP tasks were imple-
mented based on complex sets of hand-written rules. For example, classic rule-based
machine translation (RBMT) involves a host of rules defined by linguists in gram-
mar books. Such systems include Systran, Reverso, Prompt, and LOGOS (Hutchins,
1995). Rule-based approaches with symbolic representations are fast, accurate, and
explainable. However, acquiring the rules for different tasks is difficult and needs
extensive expert efforts.
Starting in the late 1980s, statistical machine learning algorithms brought revolu-
tion to NLP research. In statistical NLP systems, usually a piece of text is considered
as a bag of its words, disregarding grammar and even word order but keeping multi-
plicity (Manning and Schutze, 1999). Many of the notable early successes occurred
in machine translation due to statistical models were developed. Statistical systems
were able to take advantage of multilingual textual corpora. However, it is hard to
model the semantic structure and information of human language by simply consid-
ering the text as a bag of words.
Since the early 2010s, the field of NLP has shifted to neural networks and deep
learning, where word embeddings techniques such as Word2Vec (Mikolov T, 2013)
or GloVe (Pennington et al, 2014) were developed to represent words as fixed vec-
tors. We have also witnessed an increase in end-to-end learning for tasks such as
question answering. Besides, by representing text as a sequence of word embedding
vectors, different neural network architectures, such as vanilla recurrent neural net-
works (Pascanu et al, 2013), Long Short-Term Memory (LSTM) networks (Greff
et al, 2016), or convolutional neural networks (Dos Santos and Gatti, 2014), were
21 Graph Neural Networks in Natural Language Processing 465
applied to model text. Deep learning has brought a new revolution in NLP, greatly
improving the performance of various tasks.
In 2018, Google introduced a neural network-based technique for NLP pre-
training called Bidirectional Encoder Representations from Transformers (BERT)
(Devlin et al, 2019). This model has enabled many NLP tasks to achieve superhu-
man performance in different benchmarks and has spawned a series of follow-up
studies on pre-training large-scale language models (Qiu et al, 2020b). In such ap-
proaches, the representations of words are contextual sensitive vectors. By taking the
contextual information into account, we can model the polysemy of words. How-
ever, large-scale pre-trained language models require massive consumption of data
and computing resources. Besides, existing neural network-based models lack ex-
plainability or transparency, which can be a major drawback in health, education,
and finance domains.
Along with the evolving history of text representations and computational mod-
els, from symbolic representations to contextual-sensitive embeddings, we can see
an increase of semantical and structural information in text modeling. A key ques-
tion is: how to further improve the representation of various text pieces and the
computational models for different NLP tasks? We argue that representing text as
graphs and applying graph neural networks to NLP applications is a highly promis-
ing research direction. Graphs are of great significance to NLP research. The reasons
are multi-aspect, which will be illustrated in the following.
First, our world consists of things and the relations between them. The ability to
draw logical conclusions about how different things are related to one another, or
so-called relational reasoning, is central to both human and machine intelligence. In
NLP, understanding human language also requires modeling different text pieces
and reasoning over their relations. Graph provides a unified format to represent
things and the relations between them. By modeling text as graphs, we can char-
acterize the syntactic and semantic structures of different texts and perform explain-
able reasoning and inference over such representations.
Second, the structure of languages is intrinsically compositional, hierarchical,
and flexible. From corpus to documents, paraphrases, sentences, phrases, and words,
different text pieces form a hierarchical semantic structure, in which a higher-level
semantic unit (e.g., a sentence) can be further decomposed into more fine-grained
units (e.g., phrases and words). Such structural nature of human languages can be
characterized by tree structures. Furthermore, due to the flexibility of languages, the
same meaning can be expressed in different sentences, such as active and passive
voices. However, we can unify the representation of varying sentences by seman-
tic graphs like Abstract Meaning Representation (AMR) (Schneider et al, 2015) to
make NLP models more robust.
Last but not least, graphs have always been extensively utilized and formed an
essential part of NLP applications ranging from syntax-based machine translation,
knowledge graph-based question answering, abstract meaning representation for
common sense reasoning tasks, and so on. On the other hand, with the vigorous
research on graph neural networks, the recent research trend of combining graph
neural networks and NLP has become more and more prosperous. Moreover, by uti-
466 Bang Liu, Lingfei Wu
Various graph representations have been proposed for text modeling. Based on the
different types of graph nodes and edges, a majority of existing works can be gen-
eralized into five categories: text graphs, syntactic graphs, semantic graphs, knowl-
edge graphs, and hybrid graphs.
Text graphs use words, sentences, paragraphs, or documents as nodes and estab-
lish edges by word co-occurrence, location, or text similarities. Rousseau and Vazir-
giannis (2013); Rousseau et al (2015) represented a document as graph-of-word,
where nodes represent unique terms and directed edges represent co-occurrences
between the terms within a fixed-size sliding window. Wang et al (2011) connected
terms with syntactic dependencies. Schenker et al (2003) connected two words by
1 Graph4NLP library can be accessed via this link https://github.com/graph4ai/
graph4nlp.
21 Graph Neural Networks in Natural Language Processing 467
a directed edge if one word immediately precedes another word in the document
title, body, or link. The edges are categorized by the three different types of linking.
Balinsky et al (2011); Mihalcea and Tarau (2004); Erkan and Radev (2004) con-
nected sentences if they near to each other, share at least one common keyword, or
the sentence similarity is above a threshold. Page et al (1999) connected web docu-
ments by hyperlinks. Putra and Tokunaga (2017) constructed directed graphs of sen-
tences for text coherence evaluation. It utilized sentence similarities as weights and
connects sentences with various constraints about sentence similarity or location.
Text graphs can be established quickly, but they can not characterize the syntactic
or semantic structure of sentences or documents.
Syntactic graphs (or trees) emphasize the syntactical dependencies between
words in a sentence. Such structural representations of sentences are achieved by
parsing, which constructs the syntactic structure of a sentence according to a formal
grammar. Constituency parsing tree and dependency parsing graph are two types
of syntactic representations of sentences that use different grammars (Jurafsky,
2000). Based on syntactic analysis, documents can also be structured. For exam-
ple, Leskovec et al (2004) extracted subject-predicate-object triples from text based
on syntactic analysis and merges them to form a directed graph. The graph was fur-
ther normalized by utilizing WordNet (Miller, 1995) to merge triples belonging to
the same semantic pattern.
While syntactic graphs show the grammatical structure of text pieces, seman-
tic graphs aim to represent the meaning being conveyed. A model of semantics
could help disambiguate the meaning of a sentence when multiple interpretations
are valid. Abstract Meaning Representation (AMR) graphs (Banarescu et al, 2013)
are rooted, labeled, directed, acyclic graphs (DAGs), comprising whole sentences.
Sentences that are similar in meaning will be assigned the same AMR, even if they
are not identically worded. In this way, AMR graphs abstract away from syntactic
representations. The nodes in an AMR graph are AMR concepts, which are either
English words, PropBank framesets (Kingsbury and Palmer, 2002), or special key-
words. The edges are approximately 100 relations, including frame arguments fol-
lowing PropBank conventions, semantic relations, quantities, date-entities, lists, and
so on.
Knowledge graphs (KGs) are graphs of data intended to accumulate and convey
knowledge of the real world. The nodes of a KG represent entities of interest, and
the edges represent relations between these entities (Hogan et al, 2020). Prominent
examples of KGs include DBpedia (Bizer et al, 2009), Freebase (Bollacker et al,
2007), Wikidata (Vrandečić and Krötzsch, 2014) and YAGO (Hoffart et al, 2011),
covering various domains. KGs are broadly applied for commercial use-cases, such
as web search in Bing (Shrivastava, 2017) and Google (Singhal, 2012), commerce
recommendation in Airbnb (Chang, 2018) and Amazon (Krishnan, 2018), and social
networks like Facebook (Noy et al, 2019) and LinkedIn (He et al, 2016b). There are
also graph representations that connect terms in a document to real-world entities or
concepts based on KGs such as DBpedia (Bizer et al, 2009) and WordNet (Miller,
1995). For example, Hensman (2004) identifies the semantic roles in a sentence with
468 Bang Liu, Lingfei Wu
WordNet and VerbNet, and combines these semantic roles with a set of syntactic
rules to construct a concept graph.
Hybrid graphs contain multiple types of nodes and edges to integrate hetero-
geneous information. In this way, the various text attributes and relations can be
jointly utilized for NLP tasks. Rink et al (2010) utilized sentences as nodes and en-
codes lexical, syntactic, and semantic relations in edges. Jiang et al (2010) extracted
tokens, syntactic structure nodes, semantic nodes and so on from each sentence and
link them by different types of edges. Baker and Ellsworth (2017) built a sentence
graph based on Frame Semantics and Construction Grammar.
of-words and then utilizes graph convolution operations to convolve the word graph.
Huang et al (2019a); Zhang et al (2020d) proposed graph-based methods for text
classification, where each text owns its structural graph and text level word interac-
tions can be learned.
For NLP tasks involving a pair of text, graph matching techniques can be applied
to incorporate the structural information of a text. Liu et al (2019a) proposed the
Concept Interaction Graph to represent an article as a graph of concepts. It then
matches a pair of articles by comparing the sentences that enclose the same concept
node through a series of encoding techniques and aggregate the matching signals
through a graph convolutional network. Haghighi et al (2005) represented sentences
as directed graphs extracted from a dependency parser and develops a learned graph
matching approach to approximating textual entailment. Xu et al (2019e) formu-
lated the KB-alignment task as a graph matching problem, and proposed a graph
attention-based approach. It first matches all entities in two KGs, and then jointly
models the local matching information to derive a graph-level matching vector.
Community detection provides a means of coarse-graining the complex interac-
tions or relations between nodes, which is suitable for text clustering problems. For
example, Liu et al (2017a, 2020a) described a news content organization system
at Tencent which discovers events from vast streams of breaking news and evolves
news story structures in an online fashion. They constructed a keyword graph and
applied community detection over it to perform coarse-grained keyword-based text
clustering. After that, they further constructed a document graph for each coarse-
grained clusters, and applied community detection again to get fine-grained event-
level document clusters.
The task of graph-to-text generation aims at producing sentences that preserve
the meaning of input graphs (Song et al, 2020b). Koncel-Kedziorski et al (2019)
introduced a graph transforming encoder which can leverage the relational struc-
ture of knowledge graphs and generate text from them. Wang et al (2020k); Song
et al (2018) proposed graph-to-sequence models (Graph Transformer) to generate
natural language texts from AMR graphs. Alon et al (2019a) leveraged the syntactic
structure of programming languages to encode source code and generate text.
Last but not least, reasoning over graphs plays a key role in multi-hop ques-
tion answering (QA), knowledge-based QA, and conversational QA tasks. Ding
et al (2019a) presented a framework CogQA to tackle multi-hop machine reading
problem at scale. The reasoning process is organized as a cognitive graph, reaching
entity-level explainability. Tu et al (2019) represented documents as a heterogeneous
graph and employ GNN-based message passing algorithms to accumulate evidence
on the proposed graph to solve the multi-hop reading comprehension problem across
multiple documents. Fang et al (2020) created a hierarchical graph by constructing
nodes on different levels of granularity (questions, paragraphs, sentences, entities),
and proposed Hierarchical Graph Network (HGN) for multi-hop QA. Chen et al
(2020n) dynamically constructed a question and conversation history aware context
graph at each conversation turn and utilized a Recurrent Graph Neural Network and
a flow mechanism to capture the conversational flow in a dialog.
470 Bang Liu, Lingfei Wu
Fig. 21.1: The story tree of “2016 U.S. presidential election”. Figure credit: Liu et al
(2020a).
In the following, we will present two case studies to illustrate how graphs and
graph neural networks can be applied to different NLP tasks with more details.
In this case study, we will describe the Story Forest intelligent news organization
system designed for fine-grained hot event discovery and organization from web-
scale breaking news (Liu et al, 2017a, 2020a). Story Forest has been deployed in the
Tencent QQ Browser, a mobile application that serves more than 110 million daily
active users. Specifically, we will see how a number of graph representations are
utilized for fine-grained document clustering and document pair matching and how
GNN contributes to the system.
In the fast-paced modern society, tremendous volumes of news articles are con-
stantly being generated by different media providers, leading to information explo-
sion. In the meantime, the large quantities of daily news stories that can cover differ-
ent subjects and contain redundant or overlapping data are becoming increasingly
difficult for readers to digest. Many news app users feel that they are overwhelmed
by extremely repetitive information about a variety of current hot events while still
struggling to get information about the events in which they are genuinely interested.
Besides, search engines conduct document retrieval on the basis of user-entered re-
quests. They do not, however, provide users with a natural way to view trending
topics or breaking news.
21 Graph Neural Networks in Natural Language Processing 471
Time w w
w
w
w w w
w w w w
w w w w w
w w
w w
w w w
w
Community 2
Documents Keywords Keyword Graph Community 1
e
e e e e
e e d e e
e d e
e e d d e e
Recommend s e e e e
d
s Story 1 e e d
Tree 2 s Event 2 Topic 1 Topic 2
Tree 1 Tree 2 Story 2 Event 1
Fig. 21.2: An overview of the system architecture of Story Forest. Figure credit: Liu
et al (2020a).
In (Liu et al, 2017a, 2020a), a novel news organization system named Story For-
est was proposed to address the aforementioned challenges. The key idea of the
Story Forest system is that, instead of providing users a list of web articles based on
input queries, it proposes the concept of “event” and “story”, and propose to orga-
nize tremendous of news articles into story trees to organize and track evolving hot
events, revealing the relationships between them and reduce the redundancies. An
event is a set of news articles reporting the same piece of real-world breaking news.
And a story is a tree of related events that report a series of evolving real-world
breaking news.
Figure 21.1 presents an example of a story tree, which showcases the story of
“2016 U.S. presidential election”. There are 20 nodes in the story tree. Each node
indicates an event in the U.S. election in 2016, and each edge represents a temporal
development relationship or a logical connection between two breaking news events.
For example, event 1 is talking about Trump becomes a presidential candidate, and
event 20 says Donald Trump is elected president. The index number on each node
represents the event sequence over the timeline. The story tree contains 6 paths,
where the main path 1 → 20 captures the process of the presidential election, the
branch 3 → 6 describes Hilary’s health conditions, the branch 7 → 13 is focusing
on the television debates, 14 → 18 are about “mail door” investigation, etc. As we
can see, users can easily understand the logic of news reports and learn the key facts
quickly by modeling the evolutionary and logical structure of a story into a story
tree.
472 Bang Liu, Lingfei Wu
The story trees are constructed from web-scale news articles by the Story Forest
system. The system’s architecture is shown in Fig. 21.2. It consists primarily of four
components: preprocessing, keyword graph construction, clustering documents to
events, and growing story trees with events. The overall process is split into eight
stages. First, a range of NLP and machine learning tools will be used to process the
input news document stream, including document filtering and word segmentation.
Then the system extracts keywords, construct/update the co-occurrence graph of
keywords, and divide the graph into sub-graphs. After that, it utilizes EventX, a
graph-based fine-grained clustering algorithm to cluster documents into fine-grained
events. Finally, the story trees (formed previously) are updated by either inserting
each discovered event into an existing story tree at the right place or creating a new
story tree if the event does not belong to any current story.
We can observe from Fig. 21.2 that a variety of text graphs are utilized in
the Story Forest system. Specifically, the EventX clustering algorithm is based on
two types of text graphs: keyword co-occurrence graph and document relation-
ship graph. The keyword co-occurrence graph connects two keywords if they co-
occurred for more than n times in a news corpus, where n is a hyperparameter. On
the other hand, the document relationship graph connects document pairs based on
whether two documents are talking about the same event. Based on such two types
of text graphs, EventX can accurately extract fine-grained document clusters, where
each cluster contains a set of documents that focus on the same event.
In particular, EventX performs two-layer graph-based clustering to extract events.
The first layer performs community detection over the constructed keyword co-
occurrence graph to split it into sub-graphs, where each sub-graph the keywords for
a specific topic. The intuition for this step is that keywords related to a common topic
usually will frequently appear in documents belonging to that topic. For example,
documents belonging to the topic “2016 U.S. presidential election” will often men-
tion keywords such as “Donald Trump”, “Hillary Clinton”, “election”, and so on.
Therefore, highly correlated keywords will be linked to each other and form dense
subgraphs, whereas keywords that are not highly related will have sparse or no links.
The goal here is to extract dense keyword subgraphs linked to various topics. After
obtaining the keyword subgraphs (or communities), we can assign each document
to its most correlated keyword subgraph by calculating their TF-IDF similarity. At
this point, we have grouped documents by topics in the first layer clustering.
In the second layer, EventX constructs a document relationship graph for each
topic obtained in the first layer. Specifically, a binary classifier will be applied to
each pair of documents in a topic to detect whether two documents are talking about
the same event. If yes, we connect the pair of documents. In this way, the set of
documents in a topic turn into a document relationship graph. After that, the same
community detection algorithm in the first layer will be applied to the document
relationship graph, splitting it into sub-graphs where each sub-graph now represents
a fine-grained event instead of a coarse-grained topic. Since the number of news
articles belonging to each topic is significantly less after the first-layer document
clustering, the graph-based clustering on the second layer is highly efficient, making
it applicable for real-world applications. After extracting fine-grained events, we can
21 Graph Neural Networks in Natural Language Processing 473
update the story trees by inserting an event to its related story or creating a new story
tree if it doesn’t belong to any existing stories. We refer to (Liu et al, 2020a) for more
details about the Story Forest system.
During the construction of the document relationship graph in the Story Forest sys-
tem, a fundamental problem is determining whether two news articles are talking
about the same event. It is a problem of semantic matching, which is a core research
problem that lies at the core of many NLP applications, including search engines,
recommender systems, news systems, etc. However, previous research about se-
mantic matching is mainly designed for matching sentence pairs (Wan et al, 2016;
Pang et al, 2016), e.g., for paraphrase identification, answer selection in question-
answering, and so on. Due to the long length of news articles, such methods are not
suitable and do not perform well on document matching (Liu et al, 2019a).
To solve this challenge, Liu et al (2019a) presented a divide-and-conquer strategy
to align a pair of documents and shift deep text comprehension away from the cur-
rently dominant sequential modeling of language elements and toward a new level
of graphical document representation that is better suited to longer articles. Specif-
ically, Liu et al (2019a) proposed the Concept Interaction Graph (CIG) as a way to
view a document as a weighted graph of concepts, with each concept node being
either a keyword or a group of closely related keywords. Furthermore, two con-
cept nodes will be connected by a weighted edge which indicates their interaction
strength.
As a toy example, Fig. 21.3 shows how to convert a document into a Concept In-
teraction Graph (CIG). First, we extract keywords such as Rick, Morty, and Summer
from the document using standard keyword extraction algorithms, e.g., TextRank
(Mihalcea and Tarau, 2004). Second, similar to what we have done in the Story For-
est system, we can group keywords into sub-graphs by community detection. Each
keyword community turns into a “concept” in the document. After extracting con-
cepts, we attach each sentence in the document to its most related concept node by
calculating the similarities between a sentence and each concept. In Fig. 21.3, sen-
474 Bang Liu, Lingfei Wu
Fig. 21.4: An overview of our approach for constructing the Concept Interaction
Graph (CIG) from a pair of documents and classifying it by Graph Convolutional
Networks. Figure credit: Liu et al (2019a).
tences 5 and 6 are mainly talking about the relationship between Rick and Summer,
and are thus attached to the concept (Rick, Summer). Similarly, we can attach other
sentences to nodes, decomposing the content of a document into a number of con-
cepts. To construct edges, we represent each node’s sentence set as a concatenation
of the sentences attached to it and measure the edge weight between any two nodes
as the TF-IDF similarity between their sentence sets to create edges that show the
correlation between different concepts. An edge will be removed if its weight is be-
low a threshold. For a pair of documents, the process of converting them into a CIG
is similar. The only differences are that the keywords are from both documents, and
each concept node will have two sets of sentences from the two documents. As a re-
sult, we have represented the original document (or document pair) with a graph of
key concepts, each with a (or a pair of) sentence subset(s), as well as the interaction
topology among them.
The CIG representation of a document pair decomposes its content into multi-
ple parts. Next, we need to match the two documents based on such representation.
Fig. 21.4 illustrates the process of matching a pair of long documents. The matching
process consists of four steps: a) preprocessing the input document pair and trans-
form it into a CIG; b) matching the sentences from two documents over each node
to get local matching features; c) structurally transforming local matching features
by graph convolutional layers; and d) aggregating all the local matching features to
get the final result.
Specifically, for the local matching on each concept node, the inputs are the two
sets of sentences from two documents. As each node only contains a small portion
of the document sentences, the long text matching problems transform into short
text matching on a number of concept nodes. In (Liu et al, 2019a), two different
matching models are utilized: i) similarity-based matching, which calculate a vari-
ety of text similarities between two set of sentences; ii) Siamese matching, which
utilizes a Siamese neural network (Mueller and Thyagarajan, 2016) to encode the
21 Graph Neural Networks in Natural Language Processing 475
two sentence sets and get a local matching vector. After getting local matching re-
sults, the next question is: how to get an overall matching score? Liu et al (2019a)
aggregates the local matching vectors into a final matching score for the pair of ar-
ticles by utilizing the ability of the graph convolutional network filters (Kipf and
Welling, 2017b) to capture the patterns exhibited in the CIG at multiple scales. In
particular, the local matching vectors of the concept nodes are transformed by multi-
layer GCN layers to take the interaction structure between nodes (or concepts in two
documents) into consideration. After getting the transformed feature vectors, they
are aggregated by mean pooling to get a global matching vector. Finally, the global
matching vector will be fed into a classifier (e.g., a feed-forward neural network) to
get the final matching label or score. The local matching module, global aggregation
module, and the final classification module are trained end-to-end.
In (Liu et al, 2019a), extensive evaluations were performed to test the perfor-
mance of the proposed approach for document matching. A key discovery made
by (Liu et al, 2019a) is that the graph convolution operation significantly improves
the performance of matching, demonstrating the effect of applying graph neural
networks to the proposed text graph representation. The structural transformation
on the matching vectors via GCN can efficiently capture the semantic interactions
between sentences, and the transformed matching vectors better capture the seman-
tic distance over each concept node by integrating the information of its neighbor
nodes.
In this case study, we further introduce how graph neural networks can be applied to
machine reading comprehension in NLP. Machine reading comprehension (MRC)
aims to teach machines to read and understand unstructured text like a human. It is a
challenging task in artificial intelligence and has great potential in various enterprise
applications. We will see that by representing text as a graph and applying graph
neural networks to it, we can mimic the reasoning process of human beings and
achieve significant improvements for MRC tasks.
Suppose we have access to a Wikipedia search engine, which can be utilized
to retrieve the introductory paragraph para[x] of an entity x. How can we answer
the question “Who is the director of the 2003 film which has scenes in it filmed
at the Quality Cafe in Los Angeles?” with the search engine? Naturally, we will
start with pay attention to related entities such as “Quality Cafe”, look up relevant
introductions through Wikipedia, and quickly locate “Old School” and “Gone in
60 Seconds” when it comes to Hollywood movies. By continuing to inquire about
the introduction of the two movies, we further found their director. The last step
is to determine which director it is. This requires us to analyze the semantics and
qualifiers of the sentence. After knowing that the movie is in 2003, we can make the
final judgment: “Todd Phillips” is the answer we want. Figure 21.5 illustrates such
[email protected]
for
scale
the-
grad-
ative
rac-
rea-
g ac-
pro-
ecifi-
ERT
ently
-hop Fig.Figure
21.5: An1:example
An example ofgraph
of cognitive cognitive graphQA.
for multi-hop forEach
multi-hop
hop node cor-
wiki responds to an entity (e.g., “Los Angeles”) followed
QA. Each hop node corresponds to an entity (e.g., “Losby its introductory paragraph.
The circles mean ans nodes, answer candidates to the question. Cognitive graph
re of Angeles”)
mimics followed
human reasoning by Edges
process. its introductory paragraph.
are built when calling an entity The
to “mind”.
6 of Thecircles mean
solid black edgesansarenodes, answer
the correct candidates
reasoning to credit:
path. Figure the ques-
Ding et al
(2019a).
tion. Cognitive graph mimics human reasoning pro-
cess. Edges are built when calling an entity to “mind”.
TheAnswering
process. solid black edges are the
the aforementioned correct
question reasoning
requires multi-hoppath.
reasoning over
different information, that is so-called multi-hop question answering.
gnificant means
In fact, “pay unordered and entities
attention to related sentence-level explainabil-
quickly” and “analyze the meaning of
sentences for inference” are two different thinking processes. In cognition, the well-
ion and ity,“dual
known yet process
humans can(Kahneman,
theory” interpret answers
2011) with
believes that stepcognition
human by is
aragraph step solutions, indicating an ordered and entity-think-
divided into two systems. System 1 is an implicit, unconscious and intuitive
ing system. Its operation relies on experience and association. System 2 performs
ncluding levelconscious
explicit, and controllable3)
explainability. Scalability.
reasoning For
process. This anyusesprac-
system knowledge
., 2018; in working
tically useful QA system, scalability is indis-2 is the
memory to perform slow but reliable logical reasoning. System
embodiment of human advanced intelligence.
ross the pensable.
Guided by the dualExisting QA
process theory, systems
the Cognitive Graphbased on ma-
QA (CogQA) framework
between was proposed in (Ding et al, 2019a). It adopts a directed graph structure, named
chine comprehension generally follow retrieval-
cognitive graph, to perform step-by-step deduction and exploration in the cognitive
nges lie extraction
process of multi-hopframework in DrQA
question answering. Figure (Chen et al.,
21.5 presents 2017),graph
the cognitive
d by ad- for answering the previously mentioned question. Denote the graph as G , each node
reducing the scope of sources to a few paragraphs
in G represents an entity or possible answer x, also interchangeably denoted as node
dels for bysolid
x. The pre-retrieval.
black edges are This framework
the correct reasoning pathistoaanswer
simple com- The
the question.
s in sen- cognitive graph is constructed by an extraction module that acts like System 1. It
promise between single paragraph QA and scal-
does not able information retrieval, compared to human’s
ulti-hop ability to breeze through reasoning with knowl-
r (Yang edge in massive-capacity memory (Wang et al.,
21 Graph Neural Networks in Natural Language Processing 477
takes the introductory paragraph para[x] of entity x as input, and outputs answer
candidates (i.e., ans nodes) and useful next-hop entities (i.e., hop nodes) from the
paragraph. These new nodes gradually expand G , forming an explicit graph struc-
ture for System 2 reasoning module. During the expansion of G , the new nodes or
existing nodes with new incoming edges bring new clue about the answer. Such
nodes are referred as frontier nodes. For clue, it is a form-flexible concept, refer-
ring to information from predecessors for guiding System 1 to better extract spans.
To perform neural network-based reasoning over G instead of rule-based, System 1
also summarizes para[x] into an initial hidden representation vector when extract-
ing spans, and System 2 updates all paragraphs’ hidden vectors X based on graph
structure as reasoning results for downstream prediction.
The procedure of the framework CogQA is as follows. First, the cognitive graph
G is initialized with the entities mentioned in the input question Q, and the entities
are marked as initial frontier nodes. After initialization, a node x is popped from
frontier nodes, and then a two-stage iterative process is conducted with two models
S1 and S2 mimicking System 1 and System 2, respectively. In the first stage, the
System 1 module in CoQA extracts question-relevant entities, answers candidates
from paragraphs, and encodes their semantic information. Extracted entities are or-
ganized as a cognitive graph, which resembles the working memory. Specifically,
given x, CogQA collects clues[x, G ] from predecessor nodes of x, where the clues
can be sentences where x is mentioned. It further fetches introductory paragraph
para[x] in Wikipedia database W if any. After that, S1 generates sem[x, Q, clues],
which is the initial Xx (i.e., the embedding of x). If x is a hop node, then S1 finds hop
(e.g., entities) and answer spans in para[x]. For each hop span y, if y ∈ / G and y ∈ W ,
then create a a new hop node for y and add it to G . If y ∈ G but edge(x, y) ∈ / G , then
add a new edge (x, y) to G and mark node y as a frontier node, as it needs to be
revisited with new information. For each answer span y, a new answer node y and
edge (x, y) will be added to G . In the second stage, System 2 conducts the reason-
ing procedure over the graph and collects clues to guide System 1 to better extract
next-hop entities. In particular, the hidden representation X of all paragraphs will be
updated by S2 . The above process is iterated until there is no frontier node in the
cognitive graph (i.e., all possible answers are found) or the graph is large enough.
Then the final answer is chosen with a predictor F based on the reasoning results
X from System 2.
The CogQA framework can be implemented as the system in Fig. 21.6. It utilizes
BERT (Devlin et al, 2019) as System 1 and GNN as System 2. For clues clues[x, G ],
they are the sentences in paragraphs of x’s predecessor nodes, from which x is ex-
tracted. We can observe from Fig. 21.6 that the input to BERT is the concatenation
of the question, the clues passed from predecessor nodes, and the introductory para-
graph of x. Based on these inputs, BERT outputs hop spans and answer spans, as
well as uses the output at position 0 as sem[x, Q, clues].
For System 2, CogQA utilizes a variant of GNN to update the hidden representa-
tions of all nodes. For each node x, its initial representation Xx ∈ Rh is the semantic
vector sem[x, Q, clues] from System 1 (i.e., BERT). The updating formula of the
GNN layers are as follows:
while,
its
478
worktion
basedFig.
<latexit sha1_base64="ETVqKsXMMWfH5TnGJ4hkR7zGnZE=">AAAB8HicbVBNSwMxEJ2tX7V+VT16CRbBU9kVQY/FInjwUNF+yHYp2TTbhibZJckKZemv8OJBEa/+HG/+G9N2D9r6YODx3gwz88KEM21c99sprKyurW8UN0tb2zu7e+X9g5aOU0Vok8Q8Vp0Qa8qZpE3DDKedRFEsQk7b4ag+9dtPVGkWywczTmgg8ECyiBFsrPR43cv8+u19MOmVK27VnQEtEy8nFcjR6JW/uv2YpIJKQzjW2vfcxAQZVoYRTielbqppgskID6hvqcSC6iCbHTxBJ1bpoyhWtqRBM/X3RIaF1mMR2k6BzVAvelPxP89PTXQZZEwmqaGSzBdFKUcmRtPvUZ8pSgwfW4KJYvZWRIZYYWJsRiUbgrf48jJpnVU9t+rdnVdqV3kcRTiCYzgFDy6gBjfQgCYQEPAMr/DmKOfFeXc+5q0FJ585hD9wPn8AQkCQCg==</latexit>
discovered
paragraphs
and System
<latexit sha1_base64="X93JYNB4Gt2WCA50tQLVi297OSU=">AAAB6nicbVBNS8NAEJ3Ur1q/qh69LBbBU0lU0GPRi8eK/YI2lM120i7dbMLuRiihP8GLB0W8+ou8+W/ctjlo64OBx3szzMwLEsG1cd1vp7C2vrG5Vdwu7ezu7R+UD49aOk4VwyaLRaw6AdUouMSm4UZgJ1FIo0BgOxjfzfz2EyrNY9kwkwT9iA4lDzmjxkqPjb7bL1fcqjsHWSVeTiqQo94vf/UGMUsjlIYJqnXXcxPjZ1QZzgROS71UY0LZmA6xa6mkEWo/m586JWdWGZAwVrakIXP190RGI60nUWA7I2pGetmbif953dSEN37GZZIalGyxKEwFMTGZ/U0GXCEzYmIJZYrbWwkbUUWZsemUbAje8surpHVR9S6r7sNVpXabx1GEEziFc/DgGmpwD3VoAoMhPMMrvDnCeXHenY9Fa8HJZ47hD5zPH9PvjX0=</latexit>
The certain
Algorithm
T0
[CLS]
on which
torSystem
E[CLS]
2. It also
main part
is towith
our framework.
erates semantic
sha1_base64="Ix9kWd0zRNXSI22F6B6cJwYtcDM=">AAAB8HicbVBNS8NAEJ34WetX1aOXYBE8lUQFPRb14LGC/ZA0lM122y7d3YTdiVhCf4UXD4p49ed489+4bXPQ1gcDj/dmmJkXJYIb9LxvZ2l5ZXVtvbBR3Nza3tkt7e03TJxqyuo0FrFuRcQwwRWrI0fBWolmREaCNaPh9cRvPjJteKzucZSwUJK+4j1OCVrpoX3DBJLgKeyUyl7Fm8JdJH5OypCj1il9tbsxTSVTSAUxJvC9BMOMaORUsHGxnRqWEDokfRZYqohkJsymB4/dY6t03V6sbSl0p+rviYxIY0Yysp2S4MDMexPxPy9IsXcZZlwlKTJFZ4t6qXAxdiffu12uGUUxsoRQze2tLh0QTSjajIo2BH/+5UXSOK34ZxXv7rxcvcrjKMAhHMEJ+HABVbiFGtSBgoRneIU3RzsvzrvzMWtdcvKZA/gD5/MHpOSQTA==</latexit>
<latexit
hidden
sha1_base64="eA+LNHChZyeHW4T5nVYCUL+kR68=">AAAB8XicbVBNS8NAEJ3Ur1q/qh69LBbBU0lE0WNBDx4r2A9sQ9lsN+3SzSbsTsQS+i+8eFDEq//Gm//GbZuDtj4YeLw3w8y8IJHCoOt+O4WV1bX1jeJmaWt7Z3evvH/QNHGqGW+wWMa6HVDDpVC8gQIlbyea0yiQvBWMrqd+65FrI2J1j+OE+xEdKBEKRtFKD90bLpH2sqdJr1xxq+4MZJl4OalAjnqv/NXtxyyNuEImqTEdz03Qz6hGwSSflLqp4QllIzrgHUsVjbjxs9nFE3JilT4JY21LIZmpvycyGhkzjgLbGVEcmkVvKv7ndVIMr/xMqCRFrth8UZhKgjGZvk/6QnOGcmwJZVrYWwkbUk0Z2pBKNgRv8eVl0jyrehdV9+68UnPzOIpwBMdwCh5cQg1uoQ4NYKDgGV7hzTHOi/PufMxbC04+cwh/4Hz+ALnEkOU=</latexit>
<latexit
W1 ,W2 ∈ R
sha1_base64="(null)">(null)</latexit>
sha1_base64="(null)">(null)</latexit><latexit
sha1_base64="(null)">(null)</latexit><latexit
sha1_base64="(null)">(null)</latexit><latexit
<latexit
sha1_base64="(null)">(null)</latexit>
sha1_base64="(null)">(null)</latexit><latexit
sha1_base64="(null)">(null)</latexit><latexit
sha1_base64="(null)">(null)</latexit><latexit
<latexit
sha1_base64="Bi34J8SYWq1KLtBcT2QBNyIgAIM=">AAAB6nicbVBNS8NAEJ3Ur1q/qh69LBbBU0lEqMeiCB4r2g9oQ9lsN+3SzSbsToQS+hO8eFDEq7/Im//GbZuDtj4YeLw3w8y8IJHCoOt+O4W19Y3NreJ2aWd3b/+gfHjUMnGqGW+yWMa6E1DDpVC8iQIl7ySa0yiQvB2Mb2Z++4lrI2L1iJOE+xEdKhEKRtFKD7d9r1+uuFV3DrJKvJxUIEejX/7qDWKWRlwhk9SYrucm6GdUo2CST0u91PCEsjEd8q6likbc+Nn81Ck5s8qAhLG2pZDM1d8TGY2MmUSB7YwojsyyNxP/87ophld+JlSSIldssShMJcGYzP4mA6E5QzmxhDIt7K2EjaimDG06JRuCt/zyKmldVD236t1fVurXeRxFOIFTOAcPalCHO2hAExgM4Rle4c2Rzovz7nwsWgtOPnMMf+B8/gC9741t</latexit>
sha1_base64="Bi34J8SYWq1KLtBcT2QBNyIgAIM=">AAAB6nicbVBNS8NAEJ3Ur1q/qh69LBbBU0lEqMeiCB4r2g9oQ9lsN+3SzSbsToQS+hO8eFDEq7/Im//GbZuDtj4YeLw3w8y8IJHCoOt+O4W19Y3NreJ2aWd3b/+gfHjUMnGqGW+yWMa6E1DDpVC8iQIl7ySa0yiQvB2Mb2Z++4lrI2L1iJOE+xEdKhEKRtFKD7d9r1+uuFV3DrJKvJxUIEejX/7qDWKWRlwhk9SYrucm6GdUo2CST0u91PCEsjEd8q6likbc+Nn81Ck5s8qAhLG2pZDM1d8TGY2MmUSB7YwojsyyNxP/87ophld+JlSSIldssShMJcGYzP4mA6E5QzmxhDIt7K2EjaimDG06JRuCt/zyKmldVD236t1fVurXeRxFOIFTOAcPalCHO2hAExgM4Rle4c2Rzovz7nwsWgtOPnMMf+B8/gC9741t</latexit><latexit
sha1_base64="Bi34J8SYWq1KLtBcT2QBNyIgAIM=">AAAB6nicbVBNS8NAEJ3Ur1q/qh69LBbBU0lEqMeiCB4r2g9oQ9lsN+3SzSbsToQS+hO8eFDEq7/Im//GbZuDtj4YeLw3w8y8IJHCoOt+O4W19Y3NreJ2aWd3b/+gfHjUMnGqGW+yWMa6E1DDpVC8iQIl7ySa0yiQvB2Mb2Z++4lrI2L1iJOE+xEdKhEKRtFKD7d9r1+uuFV3DrJKvJxUIEejX/7qDWKWRlwhk9SYrucm6GdUo2CST0u91PCEsjEd8q6likbc+Nn81Ck5s8qAhLG2pZDM1d8TGY2MmUSB7YwojsyyNxP/87ophld+JlSSIldssShMJcGYzP4mA6E5QzmxhDIt7K2EjaimDG06JRuCt/zyKmldVD236t1fVurXeRxFOIFTOAcPalCHO2hAExgM4Rle4c2Rzovz7nwsWgtOPnMMf+B8/gC9741t</latexit><latexit
sha1_base64="Bi34J8SYWq1KLtBcT2QBNyIgAIM=">AAAB6nicbVBNS8NAEJ3Ur1q/qh69LBbBU0lEqMeiCB4r2g9oQ9lsN+3SzSbsToQS+hO8eFDEq7/Im//GbZuDtj4YeLw3w8y8IJHCoOt+O4W19Y3NreJ2aWd3b/+gfHjUMnGqGW+yWMa6E1DDpVC8iQIl7ySa0yiQvB2Mb2Z++4lrI2L1iJOE+xEdKhEKRtFKD7d9r1+uuFV3DrJKvJxUIEejX/7qDWKWRlwhk9SYrucm6GdUo2CST0u91PCEsjEd8q6likbc+Nn81Ck5s8qAhLG2pZDM1d8TGY2MmUSB7YwojsyyNxP/87ophld+JlSSIldssShMJcGYzP4mA6E5QzmxhDIt7K2EjaimDG06JRuCt/zyKmldVD236t1fVurXeRxFOIFTOAcPalCHO2hAExgM4Rle4c2Rzovz7nwsWgtOPnMMf+B8/gC9741t</latexit><latexit
<latexit
sha1_base64="i4jo7GwtGwKoN3YeH1gvlbskDwc=">AAACBHicbVC7TsMwFHXKq5RXgLFLRIXEVCWlEoyVWBiLRB9SE1WOc9NadZzIdpCqKAMLv8LCAEKsfAQbf4PTZoCWI1k+OudeX9/jJ4xKZdvfRmVjc2t7p7pb29s/ODwyj0/6Mk4FgR6JWSyGPpbAKIeeoorBMBGAI5/BwJ/dFP7gAYSkMb9X8wS8CE84DSnBSktjs+6mPADhC0wgc6cyKW6nZScqz8dmw27aC1jrxClJA5Xojs0vN4hJGgFXhGEpR45+x8uwUJQwyGtuKkEPmOEJjDTlOALpZYslcutcK4EVxkIfrqyF+rsjw5GU88jXlRFWU7nqFeJ/3ihV4bWXUZ6kCjhZDgpTZqnYKhKxAiqAKDbXBBNB9V8tMsU6EKVzq+kQnNWV10m/1XQum/Zdu9Fpl3FUUR2doQvkoCvUQbeoi3qIoEf0jF7Rm/FkvBjvxseytGKUPafoD4zPHyWfmFs=</latexit>
<latexit
forms 1a reads
x
T1
E1
[x]
T ok1
determine
sha1_base64="GG7zAfyyjzEFCkqaqZiId0lVt+4=">AAACHHicbVDLSsNAFJ3UV62vqks3wSK4Kkkr6LLgxmUF+4CmlMnkJh06mYSZG6GEfogbf8WNC0XcuBD8G6ePRW09zMDhnHvvzD1+KrhGx/mxChubW9s7xd3S3v7B4VH5+KStk0wxaLFEJKrrUw2CS2ghRwHdVAGNfQEdf3Q79TuPoDRP5AOOU+jHNJI85IyikQbluicgRC8veT5EXOZU8EhCMCl53uyADJY0xaMhVgflilN1ZrDXibsgFbJAc1D+8oKEZTFIZIJq3XOdFPs5VciZADM305BSNqIR9AyVNAbdz2fLTewLowR2mChzJdozdbkjp7HW49g3lTHFoV71puJ/Xi/D8Kafc5lmCJLNHwozYWNiT5OyA66AoRgbQpni5q82G1JFGZo8SyYEd3XlddKuVd161bm/qjRqiziK5Iyck0vikmvSIHekSVqEkSfyQt7Iu/VsvVof1ue8tGAtek7JH1jfvxlvoU8=</latexit>
<latexit
sem[x, Q, clues]
h×h
3 Implementation
Finally,vector
property
is tocreates
gregated1vectors
sha1_base64="26BDQsRl0AvWjqpXxBRvcak+khY=">AAAB6HicbVBNS8NAEJ34WetX1aOXxSIIQklU0GPRi8cW7Ae0oWy2k3btZhN2N0IJ/QVePCji1Z/kzX/jts1BWx8MPN6bYWZekAiujet+Oyura+sbm4Wt4vbO7t5+6eCwqeNUMWywWMSqHVCNgktsGG4EthOFNAoEtoLR3dRvPaHSPJYPZpygH9GB5CFn1Fipft4rld2KOwNZJl5OypCj1it9dfsxSyOUhgmqdcdzE+NnVBnOBE6K3VRjQtmIDrBjqaQRaj+bHTohp1bpkzBWtqQhM/X3REYjrcdRYDsjaoZ60ZuK/3md1IQ3fsZlkhqUbL4oTAUxMZl+TfpcITNibAllittbCRtSRZmx2RRtCN7iy8ukeVHxLitu/apcvc3jKMAxnMAZeHANVbiHGjSAAcIzvMKb8+i8OO/Ox7x1xclnjuAPnM8fci+MsQ==</latexit>
latexit
<
localized
wherebyXiteratively
input the
+
describes
…
21.6:theOverview
access the
Xx −
para[x]
to implement
some
′ is the new
a two-layer
System 2 (GNN)
sha1_base64="Up5tOwwgSUXlwkAGw0dYYCaLo54=">AAAB6nicbVBNS8NAEJ3Ur1q/qh69LBbBU0mqoMeiF48V7Qe0oWy2k3bpZhN2N0IJ/QlePCji1V/kzX/jts1BWx8MPN6bYWZekAiujet+O4W19Y3NreJ2aWd3b/+gfHjU0nGqGDZZLGLVCahGwSU2DTcCO4lCGgUC28H4dua3n1BpHstHM0nQj+hQ8pAzaqz00O7X+uWKW3XnIKvEy0kFcjT65a/eIGZphNIwQbXuem5i/Iwqw5nAaamXakwoG9Mhdi2VNELtZ/NTp+TMKgMSxsqWNGSu/p7IaKT1JApsZ0TNSC97M/E/r5ua8NrPuExSg5ItFoWpICYms7/JgCtkRkwsoUxxeythI6ooMzadkg3BW355lbRqVe+i6t5fVuo3eRxFOIFTOAcPrqAOd9CAJjAYwjO8wpsjnBfn3flYtBacfOYY/sD5/AHbiY2C</latexit>
<latexit
passed
sha1_base64="eRV+cFYyUVwcnsSDasavMiXevIM=">AAAB6nicbVBNS8NAEJ3Ur1q/qh69LBbBU0m0oMeiF48V7Qe0oWy2m3bpZhN2J0IJ/QlePCji1V/kzX/jts1BWx8MPN6bYWZekEhh0HW/ncLa+sbmVnG7tLO7t39QPjxqmTjVjDdZLGPdCajhUijeRIGSdxLNaRRI3g7GtzO//cS1EbF6xEnC/YgOlQgFo2ilh3bf65crbtWdg6wSLycVyNHol796g5ilEVfIJDWm67kJ+hnVKJjk01IvNTyhbEyHvGupohE3fjY/dUrOrDIgYaxtKSRz9fdERiNjJlFgOyOKI7PszcT/vG6K4bWfCZWkyBVbLApTSTAms7/JQGjOUE4soUwLeythI6opQ5tOyYbgLb+8SloXVe+y6t7XKvWbPI4inMApnIMHV1CHO2hAExgM4Rle4c2Rzovz7nwsWgtOPnMMf+B8/gDaBY2B</latexit>
<latexit
GNN in System
representation
of entity
concrete
inital
sha1_base64="(null)">(null)</latexit>
sha1_base64="(null)">(null)</latexit><latexit
sha1_base64="(null)">(null)</latexit><latexit
sha1_base64="(null)">(null)</latexit><latexit
<latexit
spectral
sha1_base64="7sBxCpAslcX/eSwL+9oxwQm7zqs=">AAAB8nicbVBNS8NAFHypX7V+VT16WayCp5KIoseCBz1WsLaQlrLZbtqlm03YfRFK6M/w4kERr/4ab/4bN20O2jqwMMy8x86bIJHCoOt+O6WV1bX1jfJmZWt7Z3evun/waOJUM95isYx1J6CGS6F4CwVK3kk0p1EgeTsY3+R++4lrI2L1gJOE9yI6VCIUjKKV/G5EccSozG6n/WrNrbszkGXiFaQGBZr96ld3ELM04gqZpMb4nptgL6MaBZN8WummhieUjemQ+5YqGnHTy2aRp+TUKgMSxto+hWSm/t7IaGTMJArsZB7RLHq5+J/npxhe9zKhkhS5YvOPwlQSjEl+PxkIzRnKiSWUaWGzEjaimjK0LVVsCd7iycvk8bzuXdbd+4ta46SoowxHcAxn4MEVNOAOmtACBjE8wyu8Oei8OO/Ox3y05BQ7h/AHzucPcbiRQw==</latexit>
<latexit
W2
sha1_base64="(null)">(null)</latexit>
sha1_base64="(null)">(null)</latexit><latexit
sha1_base64="(null)">(null)</latexit><latexit
sha1_base64="(null)">(null)</latexit><latexit
<latexit
X
sha1_base64="(null)">(null)</latexit>
sha1_base64="(null)">(null)</latexit><latexit
sha1_base64="(null)">(null)</latexit><latexit
sha1_base64="(null)">(null)</latexit><latexit
<latexit
G
W1
x
Question + clues[x,G]
sem[x,fully
TN
EN
T okN
under the
specific
is
hidden
′
of CogQA
Xy andmodels
and prepares clues[y, G] for any successor
the xCogQA
[SEP ]
T[SEP ]
answer
E[SEP ]
sha1_base64="34xctzRhXynC5MRSN2gEdH2p5uw=">AAAB+3icbVDLSsNAFL2pr1pfsS7dBIvgqiRV0GXRjcsK9gFpCJPppB06mYSZSbGE/IobF4q49Ufc+TdO2iy09cDA4Zx7uWdOkDAqlW1/G5WNza3tnepubW//4PDIPK73ZJwKTLo4ZrEYBEgSRjnpKqoYGSSCoChgpB9M7wq/PyNC0pg/qnlCvAiNOQ0pRkpLvlkfRkhNgjAb5G5HkJnf8nyzYTftBax14pSkASU6vvk1HMU4jQhXmCEpXcdOlJchoShmJK8NU0kShKdoTFxNOYqI9LJF9tw618rICmOhH1fWQv29kaFIynkU6MkiqVz1CvE/z01VeONllCepIhwvD4Ups1RsFUVYIyoIVmyuCcKC6qwWniCBsNJ11XQJzuqX10mv1XQum/bDVaN9W9ZRhVM4gwtw4BracA8d6AKGJ3iGV3gzcuPFeDc+lqMVo9w5gT8wPn8A/ZyUZQ==</latexit>
<latexit
sha1_base64="TXQJqDIeE3FAztKM5jG5ip1FY+c=">AAAB+3icbVBNS8NAFHypX7V+xXr0slgETyVRQY9FLx4r2FpoQ9hsN+3SzSbsbool5K948aCIV/+IN/+NmzYHbR1YGGbe481OkHCmtON8W5W19Y3Nrep2bWd3b//APqx3VZxKQjsk5rHsBVhRzgTtaKY57SWS4ijg9DGY3Bb+45RKxWLxoGcJ9SI8EixkBGsj+XZ9EGE9DsKsl/fbkk591/PthtN05kCrxC1JA0q0fftrMIxJGlGhCcdK9V0n0V6GpWaE07w2SBVNMJngEe0bKnBElZfNs+fo1ChDFMbSPKHRXP29keFIqVkUmMkiqVr2CvE/r5/q8NrLmEhSTQVZHApTjnSMiiLQkElKNJ8ZgolkJisiYywx0aauminBXf7yKumeN92LpnN/2WjdlHVU4RhO4AxcuIIW3EEbOkDgCZ7hFd6s3Hqx3q2PxWjFKneO4A+szx/8F5Rk</latexit>
<latexit
∆ =and
filtering.
semantic relation with the question,Xleading
sha1_base64="TklOcwpA1D9+YieOLHI872SLYNc=">AAAB9HicbVDLSgMxFL3js9ZX1aWbYBFclRkVdFl047KCfcB0KJk004ZmkjHJFMvQ73DjQhG3fow7/8ZMOwttPRA4nHMv9+SECWfauO63s7K6tr6xWdoqb+/s7u1XDg5bWqaK0CaRXKpOiDXlTNCmYYbTTqIojkNO2+HoNvfbY6o0k+LBTBIaxHggWMQINlYKujE2wzDKOlP/KehVqm7NnQEtE68gVSjQ6FW+un1J0pgKQzjW2vfcxAQZVoYRTqflbqppgskID6hvqcAx1UE2Cz1Fp1bpo0gq+4RBM/X3RoZjrSdxaCfzkHrRy8X/PD810XWQMZGkhgoyPxSlHBmJ8gZQnylKDJ9YgoliNisiQ6wwMbansi3BW/zyMmmd17yLmnt/Wa3fFHWU4BhO4Aw8uII63EEDmkDgEZ7hFd6csfPivDsf89EVp9g5gj9wPn8AFqCSTA==</latexit>
<latexit
sha1_base64="giCdyg3mshq32J7zJ8qlIOKuQJE=">AAAB6nicbVBNS8NAEJ3Ur1q/qh69LBbBg5REFD0WvHisaD+gDWWznbRLN5uwuxFL6E/w4kERr/4ib/4bt20O2vpg4PHeDDPzgkRwbVz32ymsrK6tbxQ3S1vbO7t75f2Dpo5TxbDBYhGrdkA1Ci6xYbgR2E4U0igQ2ApGN1O/9YhK81g+mHGCfkQHkoecUWOl+3bvqVeuuFV3BrJMvJxUIEe9V/7q9mOWRigNE1Trjucmxs+oMpwJnJS6qcaEshEdYMdSSSPUfjY7dUJOrNInYaxsSUNm6u+JjEZaj6PAdkbUDPWiNxX/8zqpCa/9jMskNSjZfFGYCmJiMv2b9LlCZsTYEsoUt7cSNqSKMmPTKdkQvMWXl0nzvOpdVt27i0rtLI+jCEdwDKfgwRXU4Bbq0AAGA3iGV3hzhPPivDsf89aCk88cwh84nz9BQI21</latexit>
<latexit sha1_base64="L3R95SSqtEt9LlLRewmVMotJdYg=">AAAB8XicbVBNS8NAEJ3Ur1q/qh69LBbBg5REFD0WvHisYD+wDWWznbRLN5uwuymU0H/hxYMiXv033vw3btsctPXBwOO9GWbmBYng2rjut1NYW9/Y3Cpul3Z29/YPyodHTR2nimGDxSJW7YBqFFxiw3AjsJ0opFEgsBWM7mZ+a4xK81g+mkmCfkQHkoecUWOlp3Yvqysc97xpr1xxq+4cZJV4OalAjnqv/NXtxyyNUBomqNYdz02Mn1FlOBM4LXVTjQllIzrAjqWSRqj9bH7xlJxZpU/CWNmShszV3xMZjbSeRIHtjKgZ6mVvJv7ndVIT3voZl0lqULLFojAVxMRk9j7pc4XMiIkllClubyVsSBVlxoZUsiF4yy+vkuZl1buuug9XldpFHkcRTuAUzsGDG6jBPdShAQwkPMMrvDnaeXHenY9Fa8HJZ47hD5zPH2KekKg=</latexit>
<latexit sha1_base64="p5I/W1J7Nnlt+NMgCb9fZgrosk8=">AAAB8XicbVBNS8NAEJ34WetX1aOXxSJ4kJIURY8FLx4r2A9sQ9hsp+3SzSbsbgol9F948aCIV/+NN/+N2zYHbX0w8Hhvhpl5YSK4Nq777aytb2xubRd2irt7+weHpaPjpo5TxbDBYhGrdkg1Ci6xYbgR2E4U0igU2ApHdzO/NUaleSwfzSRBP6IDyfucUWOlp3aQ1RWOg+o0KJXdijsHWSVeTsqQox6Uvrq9mKURSsME1brjuYnxM6oMZwKnxW6qMaFsRAfYsVTSCLWfzS+eknOr9Eg/VrakIXP190RGI60nUWg7I2qGetmbif95ndT0b/2MyyQ1KNliUT8VxMRk9j7pcYXMiIkllClubyVsSBVlxoZUtCF4yy+vkma14l1X3Iercu0yj6MAp3AGF+DBDdTgHurQAAYSnuEV3hztvDjvzseidc3JZ07gD5zPH2QjkKk=</latexit>
<latexit
X[P
X[P
solvedofby
guidance
node
to
sha1_base64="(null)">(null)</latexit>
sha1_base64="(null)">(null)</latexit><latexit
sha1_base64="(null)">(null)</latexit><latexit
sha1_base64="(null)">(null)</latexit><latexit
<latexit
connected
Xx
updated
sha1_base64="(null)">(null)</latexit>
sha1_base64="(null)">(null)</latexit><latexit
sha1_base64="(null)">(null)</latexit><latexit
sha1_base64="(null)">(null)</latexit><latexit
<latexit
sha1_base64="(null)">(null)</latexit>
sha1_base64="(null)">(null)</latexit><latexit
sha1_base64="(null)">(null)</latexit><latexit
sha1_base64="(null)">(null)</latexit><latexit
<latexit
X[x]
sha1_base64="Q2q815ab42RF67VFAub46kmu5lk=">AAACBHicbVC7TsMwFHXKq5RXgLFLRIXEVCWlEoyVWBiLRB9SE1WOc9NadZzIdpCqKAMLv8LCAEKsfAQbf4PTZoCWI1k+OudeX9/jJ4xKZdvfRmVjc2t7p7pb29s/ODwyj0/6Mk4FgR6JWSyGPpbAKIeeoorBMBGAI5/BwJ/dFP7gAYSkMb9X8wS8CE84DSnBSktjs+6mPADhC0wgc6cyKe6WbScqz8dmw27aC1jrxClJA5Xojs0vN4hJGgFXhGEpR45+x8uwUJQwyGtuKkEPmOEJjDTlOALpZYslcutcK4EVxkIfrqyF+rsjw5GU88jXlRFWU7nqFeJ/3ihV4bWXUZ6kCjhZDgpTZqnYKhKxAiqAKDbXBBNB9V8tMsU6EKVzq+kQnNWV10m/1XQum/Zdu9Fpl3FUUR2doQvkoCvUQbeoi3qIoEf0jF7Rm/FkvBjvxseytGKUPafoD4zPHyQXmFo=</latexit>
<latexit
−1
T1
Xrev
Xrev
from neighbors
T ok1
sha1_base64="WkmkOQqV4y/G2CwEGjey+GFekFc=">AAACAnicbVDLSgMxFM3UV62vUVfiJlgEV2VGi7osuHFZwT6gM5RMeqcNzWSGJCOUobjxV9y4UMStX+HOvzHTzkJbD4Qczrn3JvcECWdKO863VVpZXVvfKG9WtrZ3dvfs/YO2ilNJoUVjHstuQBRwJqClmebQTSSQKODQCcY3ud95AKlYLO71JAE/IkPBQkaJNlLfPvJiYweSUMi8kUry+9JJ9HTat6tOzZkBLxO3IFVUoNm3v7xBTNMIhKacKNVzzRw/I1IzymFa8VIFZv6YDKFnqCARKD+brTDFp0YZ4DCW5giNZ+rvjoxESk2iwFRGRI/UopeL/3m9VIfXfsZEkmoQdP5QmHKsY5zngQdMAtV8Ygihkpm/YjoiJg9tUquYENzFlZdJ+7zmXtScu3q1US/iKKNjdILOkIuuUAPdoiZqIYoe0TN6RW/Wk/VivVsf89KSVfQcoj+wPn8A712XuA==</latexit>
<latexit
representation
P rev
P rev
σ ((AD
1]1
2 ]2
2696
sha1_base64="/04fUx5CbtNJGNPyUDBDQPloL60=">AAAB63icbVBNS8NAEJ3Ur1q/qh69BIvgqSQq6LHoxZNUsB/QhrLZTtulu5uwuxFL6F/w4kERr/4hb/4bN20O2vpg4PHeDDPzwpgzbTzv2ymsrK6tbxQ3S1vbO7t75f2Dpo4SRbFBIx6pdkg0ciaxYZjh2I4VEhFybIXjm8xvPaLSLJIPZhJjIMhQsgGjxGTSHT6ZXrniVb0Z3GXi56QCOeq98le3H9FEoDSUE607vhebICXKMMpxWuomGmNCx2SIHUslEaiDdHbr1D2xSt8dRMqWNO5M/T2REqH1RIS2UxAz0oteJv7ndRIzuApSJuPEoKTzRYOEuyZys8fdPlNIDZ9YQqhi9laXjogi1Nh4SjYEf/HlZdI8q/rnVe/+olK7zuMowhEcwyn4cAk1uIU6NIDCCJ7hFd4c4bw4787HvLXg5DOH8AfO5w8XCo5D</latexit>
<latexit
sha1_base64="puK58MFBvgD1nt+jWdz8pL4eOOM=">AAAB7XicbVBNS8NAEJ3Ur1q/qh69LBbBU0lU0GPRi8cK9gPaUDbbSbt2kw27m0IJ/Q9ePCji1f/jzX/jts1BWx8MPN6bYWZekAiujet+O4W19Y3NreJ2aWd3b/+gfHjU1DJVDBtMCqnaAdUoeIwNw43AdqKQRoHAVjC6m/mtMSrNZfxoJgn6ER3EPOSMGis16wrHPa9XrrhVdw6ySrycVCBHvVf+6vYlSyOMDRNU647nJsbPqDKcCZyWuqnGhLIRHWDH0phGqP1sfu2UnFmlT0KpbMWGzNXfExmNtJ5Ege2MqBnqZW8m/ud1UhPe+BmPk9RgzBaLwlQQI8nsddLnCpkRE0soU9zeStiQKsqMDahkQ/CWX14lzYuqd1l1H64qtds8jiKcwCmcgwfXUIN7qEMDGDzBM7zCmyOdF+fd+Vi0Fpx85hj+wPn8ATvNjuU=</latexit>
<latexit
is
|
…
another
−1 ⊤
withrepresentations
= y.arg max
Mean-network
each
P rev1
paragraphs sem[x,
implementation.
…
Hop span
based
…
following
of our in theOur
answer2016).
then
sha1_base64="ycagfIhPAcq9SuB1/HItSluEld4=">AAAB63icbVBNS8NAEJ3Ur1q/qh69LBbRU0lU0GPRi8cK/YI2lM120y7d3YTdjVBC/4IXD4p49Q9589+4SXPQ1gcDj/dmmJkXxJxp47rfTmltfWNzq7xd2dnd2z+oHh51dJQoQtsk4pHqBVhTziRtG2Y47cWKYhFw2g2m95nffaJKs0i2zCymvsBjyUJGsMmk1pCdD6s1t+7mQKvEK0gNCjSH1a/BKCKJoNIQjrXue25s/BQrwwin88og0TTGZIrHtG+pxIJqP81vnaMzq4xQGClb0qBc/T2RYqH1TAS2U2Az0cteJv7n9RMT3vopk3FiqCSLRWHCkYlQ9jgaMUWJ4TNLMFHM3orIBCtMjI2nYkPwll9eJZ3LundVdx+va427Io4ynMApXIAHN9CAB2hCGwhM4Ble4c0Rzovz7nwsWktOMXMMf+B8/gCLgo3n</latexit>
<latexit
Such questions
Ti
iterative
σ (XW
sha1_base64="1EWxMOFjDRT37bRb7ZlgcYDiw9o=">AAAB63icbVBNS8NAEJ3Ur1q/qh69LBbBU0lU0GPRi8cW7Ae0oWy2k3bpZhN2N0IJ/QtePCji1T/kzX/jps1Bqw8GHu/NMDMvSATXxnW/nNLa+sbmVnm7srO7t39QPTzq6DhVDNssFrHqBVSj4BLbhhuBvUQhjQKB3WB6l/vdR1Sax/LBzBL0IzqWPOSMmlxqpaiH1Zpbdxcgf4lXkBoUaA6rn4NRzNIIpWGCat333MT4GVWGM4HzyiDVmFA2pWPsWypphNrPFrfOyZlVRiSMlS1pyEL9OZHRSOtZFNjOiJqJXvVy8T+vn5rwxs+4TFKDki0XhakgJib542TEFTIjZpZQpri9lbAJVZQZG0/FhuCtvvyXdC7q3mXdbV3VGrdFHGU4gVM4Bw+uoQH30IQ2MJjAE7zAqxM5z86b875sLTnFzDH8gvPxDRWcjkI=</latexit>
<latexit
) σ (XW
the
NLP tasks,
on largeare
nodes,Q,from
|Name of entity “Next”|
sha1_base64="T81e0FN4eiLN0l7csieDRUgh6Jc=">AAAB6HicbVBNS8NAEJ34WetX1aOXxSJ4KokKeix68diC/YA2lM120q7dbMLuRiyhv8CLB0W8+pO8+W/ctjlo64OBx3szzMwLEsG1cd1vZ2V1bX1js7BV3N7Z3dsvHRw2dZwqhg0Wi1i1A6pRcIkNw43AdqKQRoHAVjC6nfqtR1Sax/LejBP0IzqQPOSMGivVn3qlsltxZyDLxMtJGXLUeqWvbj9maYTSMEG17nhuYvyMKsOZwEmxm2pMKBvRAXYslTRC7WezQyfk1Cp9EsbKljRkpv6eyGik9TgKbGdEzVAvelPxP6+TmvDaz7hMUoOSzReFqSAmJtOvSZ8rZEaMLaFMcXsrYUOqKDM2m6INwVt8eZk0zyveRcWtX5arN3kcBTiGEzgDD66gCndQgwYwQHiGV3hzHpwX5935mLeuOPnMEfyB8/kD5uOM/g==</latexit>
latexit
<
the hidden representations
vlin et(FCN)
by
3.1 DSystem
x
Ques
)
damental step
extraction
a
steps,
…
clues],
1 ) with
al., 2018)
Pass clues
corpora.
…
former (Vaswani
above
propagation.
decoupled,
When X[x].
to “Next”“Ans”
which
architecture,
den representations
sha1_base64="8WPqCaIDG188Dswr9/97u5Grotk=">AAAB63icbVBNSwMxEJ2tX7V+VT16CRbRU9m1gh6LXjxW6Be0S8mm2TY2yS5JVihL/4IXD4p49Q9589+YbfegrQ8GHu/NMDMviDnTxnW/ncLa+sbmVnG7tLO7t39QPjxq6yhRhLZIxCPVDbCmnEnaMsxw2o0VxSLgtBNM7jK/80SVZpFsmmlMfYFHkoWMYJNJzcHj+aBccavuHGiVeDmpQI7GoPzVH0YkEVQawrHWPc+NjZ9iZRjhdFbqJ5rGmEzwiPYslVhQ7afzW2fozCpDFEbKljRorv6eSLHQeioC2ymwGetlLxP/83qJCW/8lMk4MVSSxaIw4chEKHscDZmixPCpJZgoZm9FZIwVJsbGU7IheMsvr5L2ZdWrVd2Hq0r9No+jCCdwChfgwTXU4R4a0AICY3iGV3hzhPPivDsfi9aCk88cwx84nz+NB43o</latexit>
<latexit
Tj
sha1_base64="WkmkOQqV4y/G2CwEGjey+GFekFc=">AAACAnicbVDLSgMxFM3UV62vUVfiJlgEV2VGi7osuHFZwT6gM5RMeqcNzWSGJCOUobjxV9y4UMStX+HOvzHTzkJbD4Qczrn3JvcECWdKO863VVpZXVvfKG9WtrZ3dvfs/YO2ilNJoUVjHstuQBRwJqClmebQTSSQKODQCcY3ud95AKlYLO71JAE/IkPBQkaJNlLfPvJiYweSUMi8kUry+9JJ9HTat6tOzZkBLxO3IFVUoNm3v7xBTNMIhKacKNVzzRw/I1IzymFa8VIFZv6YDKFnqCARKD+brTDFp0YZ4DCW5giNZ+rvjoxESk2iwFRGRI/UopeL/3m9VIfXfsZEkmoQdP5QmHKsY5zngQdMAtV8Ygihkpm/YjoiJg9tUquYENzFlZdJ+7zmXtScu3q1US/iKKNjdILOkIuuUAPdoiZqIYoe0TN6RW/Wk/VivVsf89KSVfQcoj+wPn8A712XuA==</latexit>
<latexit
sha1_base64="NHajn1S7d4tKGHbUVsWnUGCMXZ0=">AAAB7XicbVBNS8NAEJ3Ur1q/qh69BIvgqSRV0GPRi8cKthbaUDbbSbt2sxt2N4US+h+8eFDEq//Hm//GbZuDtj4YeLw3w8y8MOFMG8/7dgpr6xubW8Xt0s7u3v5B+fCopWWqKDap5FK1Q6KRM4FNwwzHdqKQxCHHx3B0O/Mfx6g0k+LBTBIMYjIQLGKUGCu1GgrHvVqvXPGq3hzuKvFzUoEcjV75q9uXNI1RGMqJ1h3fS0yQEWUY5TgtdVONCaEjMsCOpYLEqINsfu3UPbNK342ksiWMO1d/T2Qk1noSh7YzJmaol72Z+J/XSU10HWRMJKlBQReLopS7Rrqz190+U0gNn1hCqGL2VpcOiSLU2IBKNgR/+eVV0qpV/Yuqd39Zqd/kcRThBE7hHHy4gjrcQQOaQOEJnuEV3hzpvDjvzseiteDkM8fwB87nDz1RjuY=</latexit>
<latexit
{z
overall model
identical FCNs.
to construct
F (Xxincluding
regarded
as the final
Paragraph[x]
A is
based
and
sha1_base64="EzJauHCFVmw9rVYLAt7MIeB3Ps8=">AAAB6nicbVBNS8NAEJ3Ur1q/qh69LBbBU0lU0GPVi8eK9gPaUDbbSbt0swm7G6GE/gQvHhTx6i/y5r9x2+agrQ8GHu/NMDMvSATXxnW/ncLK6tr6RnGztLW9s7tX3j9o6jhVDBssFrFqB1Sj4BIbhhuB7UQhjQKBrWB0O/VbT6g0j+WjGSfoR3QgecgZNVZ6uJa6V664VXcGsky8nFQgR71X/ur2Y5ZGKA0TVOuO5ybGz6gynAmclLqpxoSyER1gx1JJI9R+Njt1Qk6s0idhrGxJQ2bq74mMRlqPo8B2RtQM9aI3Ff/zOqkJr/yMyyQ1KNl8UZgKYmIy/Zv0uUJmxNgSyhS3txI2pIoyY9Mp2RC8xZeXSfOs6p1X3fuLSu0mj6MIR3AMp+DBJdTgDurQAAYDeIZXeHOE8+K8Ox/z1oKTzxzCHzifPzNjjbw=</latexit>
<latexit
P rev2
is utilized
Sentence A
et al.,
Input as
is
leading
Ans
of visiting
propagation
after a propagation
Ans span
to
has become
…
equations.
sha1_base64="cs1Q9fet/6GNtc+Tzw/y6WCTX8Y=">AAAB6HicbVBNS8NAEJ3Ur1q/qh69LBbBU0lU0GPRi8cW7Ae0oWy2k3btZhN2N0Io/QVePCji1Z/kzX/jts1BWx8MPN6bYWZekAiujet+O4W19Y3NreJ2aWd3b/+gfHjU0nGqGDZZLGLVCahGwSU2DTcCO4lCGgUC28H4bua3n1BpHssHkyXoR3QoecgZNVZqZP1yxa26c5BV4uWkAjnq/fJXbxCzNEJpmKBadz03Mf6EKsOZwGmpl2pMKBvTIXYtlTRC7U/mh07JmVUGJIyVLWnIXP09MaGR1lkU2M6ImpFe9mbif143NeGNP+EySQ1KtlgUpoKYmMy+JgOukBmRWUKZ4vZWwkZUUWZsNiUbgrf88ippXVS9y6rbuKrUbvM4inACp3AOHlxDDe6hDk1ggPAMr/DmPDovzrvzsWgtOPnMMfyB8/kD6GeM/w==</latexit>
<latexit
y
sha1_base64="6Ps7j3DCjP4TdyO7DF0/yE/WYZQ=">AAAB63icbVBNS8NAEJ3Ur1q/qh69LBbRU0lU0GPRi8cK/YI2lM120y7d3YTdjVBC/4IXD4p49Q9589+4SXPQ1gcDj/dmmJkXxJxp47rfTmltfWNzq7xd2dnd2z+oHh51dJQoQtsk4pHqBVhTziRtG2Y47cWKYhFw2g2m95nffaJKs0i2zCymvsBjyUJGsMmk1nB6PqzW3LqbA60SryA1KNAcVr8Go4gkgkpDONa677mx8VOsDCOcziuDRNMYkyke076lEguq/TS/dY7OrDJCYaRsSYNy9fdEioXWMxHYToHNRC97mfif109MeOunTMaJoZIsFoUJRyZC2eNoxBQlhs8swUQxeysiE6wwMTaeig3BW355lXQu695V3X28rjXuijjKcAKncAEe3EADHqAJbSAwgWd4hTdHOC/Ou/OxaC05xcwx/IHz+QOOjI3p</latexit>
<latexit
X for graph
step
answer.
x is extracted.
sentences
is illustrated
step
to serve
of
|Possible answer “Ans”|
the cognitive
one of
binary are
n×h
on whichWe
sha1_base64="(null)">(null)</latexit>
sha1_base64="(null)">(null)</latexit><latexit
sha1_base64="(null)">(null)</latexit><latexit
sha1_base64="(null)">(null)</latexit><latexit
<latexit
the adjacent
sha1_base64="nta34gE+XG+4LV5XUqH2RD7n1o0=">AAAB7HicbVBNS8NAEJ34WetX1aOXxSJ6KokKeiyK4EWoYNpCG8pmu2mX7m7C7kYoob/BiwdFvPqDvPlv3LQ5aOuDgcd7M8zMCxPOtHHdb2dpeWV1bb20Ud7c2t7ZreztN3WcKkJ9EvNYtUOsKWeS+oYZTtuJoliEnLbC0U3ut56o0iyWj2ac0EDggWQRI9hYyb/t3Z+Ue5WqW3OnQIvEK0gVCjR6la9uPyapoNIQjrXueG5iggwrwwink3I31TTBZIQHtGOpxILqIJseO0HHVumjKFa2pEFT9fdEhoXWYxHaToHNUM97ufif10lNdBVkTCapoZLMFkUpRyZG+eeozxQlho8twUQxeysiQ6wwMTafPARv/uVF0jyreec19+GiWr8u4ijBIRzBKXhwCXW4gwb4QIDBM7zCmyOdF+fd+Zi1LjnFzAH8gfP5A383jdA=</latexit>
<latexit
sha1_base64="(null)">(null)</latexit>
sha1_base64="(null)">(null)</latexit><latexit
sha1_base64="(null)">(null)</latexit><latexit
sha1_base64="(null)">(null)</latexit><latexit
<latexit
nodes
elaborately
sha1_base64="UUpcMFddJxBgFFJj3Qoeg+A+48A=">AAAB8nicbVBNS8NAFHypX7V+VT16WSyCeCiJKHoseNBjBWsLaSmb7aZdutmE3RehhP4MLx4U8eqv8ea/cdPmoK0DC8PMe+y8CRIpDLrut1NaWV1b3yhvVra2d3b3qvsHjyZONeMtFstYdwJquBSKt1Cg5J1EcxoFkreD8U3ut5+4NiJWDzhJeC+iQyVCwShaye9GFEeMyux22q/W3Lo7A1kmXkFqUKDZr351BzFLI66QSWqM77kJ9jKqUTDJp5VuanhC2ZgOuW+pohE3vWwWeUpOrDIgYaztU0hm6u+NjEbGTKLATuYRzaKXi/95forhdS8TKkmRKzb/KEwlwZjk95OB0JyhnFhCmRY2K2EjqilD21LFluAtnrxMHs/r3mXdvb+oNc6KOspwBMdwCh5cQQPuoAktYBDDM7zCm4POi/PufMxHS06xcwh/4Hz+AHOGkUk=</latexit>
<latexit
EM
2017),Ina the
asthe
Step of
T okM
Before Visiting x
Cognitive Graph GG
GNN.
Visiting x
of are
Results of The
frontier graph,
SQuAD (Rajpurkar
in Figure
potQA dataset (Yang et al, 2018b), there are also questions that aim to compare
1 )) training at different iterative
thedirectly
(21.3)
node x,
Bang Liu, Lingfei Wu
classifica-
a
Hot-
predic-
(AD−1 )⊤ , the GNN per-
is the 1degree matrix of G . By left
matrix
are ag-
(21.2)
in System 2 updates the hidden representations Xx . Figure credit: Ding et al (2019a).
generates new hop and answer nodes based on the clues[x, G ] discovered by System
visiting the node x, System 1
GNN.
2.
} | {z }
[CLS] Question [SEP ] clues[x, G] [SEP ] P ara[x]
Sentence B
Figure 2: Overview of CogQA implementation. When visiting the node x, System 1 generates new hop and answer
The cognitive graph structure in the CogQA framework offers ordered and entity-
level explainability and suits for relational reasoning, owing to the explicit reasoning
paths in it. Aside from simple paths, it can also clearly display joint or loopy reason-
ing processes, where new predecessors might bring new clues about the answer. As
we can see, by modeling the context information as a cognitive graph and applying
GNN to such representation, we can mimic the dual process of human perception
and reasoning and achieve excellent performance on multi-hop machine reading
comprehension tasks, as demonstrated in (Ding et al, 2019a).
Applying graph neural networks to NLP tasks with suitable graph representations
for text can bring significant benefits, as we have discussed and shown through
the case studies. Although GNNs have achieved outstanding performance in many
tasks, including text clustering, classification, generation, machine reading compre-
hension and so on, there are still numerous open problems to solve at the moment
to better understand human language with graph-based representations and models.
In particular, here we categorize and discuss the open problems or future directions
for graph-based NLP in terms of five aspects: model design of GNNs, data rep-
resentation learning, multi-task relationship modeling, world model, and learning
paradigm.
Although several GNN models are applicable to NLP tasks, only a small subset of
them is explored for model design. More advanced GNN models can be utilized or
improved to handle the scale, depth, dynamics, heterogeneity, and explainability of
natural language texts. First, scaling GNNs to large graphs helps to utilize resources
such as large-scale knowledge graphs better. Second, most GNN architectures are
shallow, and the performance drops after two to three layers. Design deeper GNNs
enables node representation learning with information from larger and more adap-
tive receptive fields (Liu et al, 2020c). Third, we can utilize dynamic graphs to model
the evolving or temporal phenomenons in texts, e.g., the development of stories or
events. Correspondingly, dynamic or temporal GNNs (Skarding et al, 2020) can help
capture the dynamic nature in specific NLP tasks. Forth, the syntactic, semantic, as
well as knowledge graphs in NLP are essentially heterogeneous graphs. Developing
heterogeneous GNNs (Wang et al, 2019i; Zhang et al, 2019b) can help better utiliz-
ing the various nodes and edge information in text and understanding its semantic.
Last but not least, the need for improved explainability, interpretability, and trust of
AI systems in general demands principled methodologies. One way is using GNNs
as a model of neural-symbolic computing and reasoning (Lamb et al, 2020), as the
data structure and reasoning process can be naturally captured by graphs.
For data representations, most existing GNNs can only learn from input when
a graph-structure of input data is available. However, real-world graphs are often
noisy and incomplete or might not be available at all. Designing effective models
and algorithms to automatically learn the relational structure in input data with lim-
480 Bang Liu, Lingfei Wu
ited structured inductive biases can efficiently solve this problem. Instead of man-
ually designing specific graph representations of data for different applications, we
can enable models to automatically identify the implicit, high-order, or even casual
relationships between input data points, and learn the graph structure and repre-
sentations of inputs. To achieve these, recent research on graph pooling (Lee et al,
2019b), graph transformers (Yun et al, 2019), and hypergraph neural networks (Feng
et al, 2019c) can be applied and further explored.
Multi-task learning (MTL) in deep neural networks for NLP has recently re-
ceived increasing interest as it has the potential to efficiently regularize models and
to reduce the need for labeled data (Bingel and Søgaard, 2017). We can marriage the
representation power of graph structures with multi-task learning to integrate diverse
input data, such as images, text pieces, and knowledge bases, and jointly learn a uni-
fied and structured representation for various tasks. Furthermore, we can learn the
relationships or correlations between different tasks and exploit the learned relation-
ship for curriculum learning to accelerate the convergence rate for model training.
Finally, with the unified graph representation and integration of different data, as
well as the joint and curriculum learning of different tasks, NLP or AI systems will
gain the ability to continually acquire, fine-tune, and transfer knowledge and skills
throughout their lifespan.
Grounded language learning or acquisition (Matuszek, 2018; Hermann et al,
2017) is another trending research topic that aims at learning the meaning of lan-
guage as it applies to the physical world. Intuitively, language can be better learned
when presented and interpreted in the context of the world it pertains to. It has
been demonstrated that GNNs can efficiently capture joint dependencies between
different elements in the world (Li et al, 2017e). Besides, they can also efficiently
utilize the rich information in multiple modalities of the world to help understand
the meaning of scene texts (Gao et al, 2020a). Therefore, representing the world
or environment with graphs and GNNs to improve the understanding of languages
deserves more research endeavors.
Lastly, research about self-supervised pre-training for GNNs is also attracting
more attention. Self-supervised representation learning leverages input data itself
as supervision and benefits almost all types of downstream tasks (Liu et al, 2020f).
Numerous successful self-supervised pre-training strategies, such as BERT (Devlin
et al, 2019) and GPT (Radford et al, 2018) have been developed to tackle a variety
of language tasks. For graph learning, when task-specific labeled data is extremely
scarce, or the graphs in the training set are structurally very different from graphs
in the test set, pre-training GNNs can serve as an efficient approach for transfer
learning on graph-structured data (Hu et al, 2020c).
21.6 Conclusions
Over the past few years, graph neural networks have become powerful and practical
tools for a variety of problems that can be modeled by graphs. In this chapter, we
21 Graph Neural Networks in Natural Language Processing 481
Miltiadis Allamanis
22.1 Introduction
Miltiadis Allamanis
Microsoft Research, e-mail: [email protected]
© The Author(s), under exclusive license to Springer Nature Singapore Pte Ltd. 2022 483
L. Wu et al. (eds.), Graph Neural Networks: Foundations, Frontiers, and Applications,
https://doi.org/10.1007/978-981-16-6054-2_22
484 Miltiadis Allamanis
“counter”, without any additional context, she/he will conclude with a high proba-
bility that this variable is a non-negative integer that enumerates some elements or
events. In contrast, a formal program analysis method — having no additional con-
text — will conservatively conclude that “counter” may contain any value.
Machine learning-based program analysis (Section 22.2) aims to provide this
human-like ability to learn to reason over ambiguous and partial information at the
cost of foregoing the ability to provide (absolute) guarantees. Instead, through learn-
ing common coding patterns, such as naming conventions and syntactic idioms,
these methods can offer (probabilistic) evidence about aspects of the behavior of a
program. This is not to say that machine learning makes traditional program analy-
ses redundant. Instead, machine learning provides a useful weapon in the arsenal of
program analysis methodologies.
Graph representations of programs play a central role in program analysis and al-
low reasoning over the complex structure of programs. Section 22.3 illustrates one
such graph representation which we use throughout this and discusses alternatives.
We then discuss GNNs which have found a natural fit for machine learning-based
program analyses and relate them to other machine learning models (Section 22.4).
GNNs allow us to represent, learn, and reason over programs elegantly by integrat-
ing the rich, deterministic relationships among program entities with the ability to
learn over ambiguous coding patterns. In this , we discuss how to approach two prac-
tical static program analyses using GNNs: bug detection (Section 22.5), and prob-
abilistic type inference (Section 22.6). We conclude this (Section 22.7) discussing
open challenges and promising new areas of research in the area.
Before discussing program analysis with GNNs, it is important to take a step back
and ask where machine learning can help program analysis and why. At a first look
these two fields seem incompatible: static program analyses commonly seek guar-
antees (e.g., a program never reaches some state) and dynamic program analyses
certify some aspect of a program’s execution (e.g., specific inputs yield expected
outputs), whereas machine learning models probabilities of events.
At the same time, the burgeoning area of machine learning for code (Allamanis
et al, 2018a) has shown that machine learning can be applied to source code across
a series of software engineering tasks. The premise is that although code has a de-
terministic, unambiguous structure, humans write code that contains patterns and
ambiguous information (e.g. comments, variable names) that is valuable for under-
standing its functionality. It is this phenomenon that program analysis can also take
advantage of.
There are two broad areas where machine learning can be used in program anal-
ysis: learning proof heuristics, and learning static or dynamic program analyses.
Commonly static program analyses resort into converting the analysis task into a
combinatorial search problem, such as a Boolean satisfiability problem (SAT), or
22 Graph Neural Networks in Program Analysis 485
another form of theorem proving. Such problems are known to often be computa-
tionally intractable. Machine learning-based methods, such as the work of (Irving
et al, 2016) and (Selsam and Bjørner, 2019) have shown the promise that heuris-
tics can be learned to guide combinatorial search. Discussing this exciting area of
research is out-of-scope for this . Instead, we focus on the static program analysis
learning problem.
Conceptually, a specification defines a desired aspect of a program’s functionality
and can take many forms, from natural language descriptions to formal mathemati-
cal constructs. Traditional static program analyses commonly resort to formulating
program analyses through rigorous formal methods and dynamic analyses through
observations of program executions. However, defining such program analyses is a
tedious, manual task that can rarely scale to a wide range of properties and programs.
Although it is imperative that formal methods are used for safety-critical applica-
tions, there is a wide range of applications that miss on the opportunity to benefit
from program analysis. Machine learning-based program analysis aims to address
this, but sacrifice the ability to provide guarantees. Specifically, machine learning
can help program analyses deal with the two common sources of ambiguities: latent
specifications, and ambiguous execution contexts (e.g., due to dynamically loaded
code). Program analysis learning commonly takes one of three forms, discussed
next.
Specification Tuning where an expert writes a sound program analysis which may
yield many false positives (false alarms). Raising a large number of false alarms
leads to the analogue of Aesop’s “The Boy who Cried Wolf”: too many false alarms,
lead to true positives getting ignored, diminishing the utility of the analysis. To ad-
dress this, work such as those of (Raghothaman et al, 2018) and (Mangal et al,
2015) use machine learning methods to “tune” (or post-process) a program analy-
sis by learning which aspects of the formal analysis can be discounted, increasing
precision at the cost of recall (soundness).
Specification Inference where a machine learning model is asked to learn to pre-
dict a plausible specification from existing code. By making the (reasonable) as-
sumption that most of the code in a codebase complies with some latent specifica-
tion, machine learning models are asked to infer closed forms of those specifica-
tions. The predicted specifications can then be input to traditional program analyses
that check if a program satisfies them. Examples of such models are the factor graphs
of (Kremenek et al, 2007) for detecting resource leaks, the work of (Livshits et al,
2009) and (Chibotaru et al, 2019) for information flow analysis, the work of (Si
et al, 2018) for generating loop invariants, and the work of (Bielik et al, 2017) for
synthesizing rule-based static analyzers from examples. The type inference problem
discussed in Section 22.6 is also an instance of specification inference.
Weaker specifications — commonly used in dynamic analyses — can also be in-
ferred. For example, Ernst et al (2007) and Hellendoorn et al (2019a) aim to predict
invariants (assert statements) by observing the values during execution. Tufano et al
(2020) learn to generate unit tests that describe aspects of the code’s behavior.
486 Miltiadis Allamanis
Black Box Analysis Learning where the machine learning model acts as a black
box that performs the program analysis and raises warnings but never explicitly for-
mulates a concrete specification. Such forms of program analysis have great flexi-
bility and go beyond what many traditional program analyses can do. However, they
often sacrifice explainability and provide no guarantees. Examples of such methods
include DeepBugs (Pradel and Sen, 2018), Hoppity (Dinella et al, 2020), and the
variable misuse problem (Allamanis et al, 2018b) discussed in Section 22.5.
In Section 22.5 and 22.6, we showcase two learned program analyses using
GNNs. However, we first need to discuss how to represent programs as graphs (Sec-
tion 22.3) and how to process these graphs with GNNs (Section 22.4).
Many traditional program analysis methods are formulated over graph represen-
tations of programs. Examples of such representations include syntax trees, con-
trol flow, data flow, program dependence, and call graphs each providing different
views of a program. At a high level, programs can be thought as a set of heteroge-
neous entities that are related through various kinds of relations. This view directly
maps a program to a heterogeneous directed graph G = (V , E ), with each entity
being represented as a node and each relationship of type r represented as an edge
(vi , r, v j ) ∈ E . These graphs resemble knowledge bases with two important differ-
ences (1) nodes and edges can be deterministically extracted from source code and
other program artifacts (2) there is one graph per program/code snippet.
However, deciding which entities and relations to include in a graph represen-
tation of a program is a form of feature engineering and task-dependent. Note that
there is no unique or widely accepted method to convert a program into a graph
representation; different representations offer trade-offs between expressing various
program properties, the size of the graph representation, and the (human and com-
putational) effort required to generate them.
In this section we illustrate one possible program graph representation inspired
by (Allamanis et al, 2018b), who model each source code file as a single graph.
We discuss other graph representations at the end of this section. Figure 22.1 shows
the graph for a hand-crafted synthetic Python code snippet curated to illustrate a
few aspects of the graph representation. A high-level explanation of the entities
and relations follows; for a detailed overview of the relevant concepts, we refer the
reader to programming language literature, such as the compiler textbook of (Aho
et al, 2006).
Tokens A program’s source code is at its most basic form a string of characters. By
construction programming languages can be deterministically tokenized (lexed) into
a sequence of tokens (also known as lexemes). Each token can then be represented
as a node (white boxes with gray border in Figure 22.1) of “token” type. These
22 Graph Neural Networks in Program Analysis 487
FnDef Parameters
6 raise Exception ( )
ReturnStatement
MethodInvoke
Token Node Syntax Node Symbol Node Child Occurrence Of May Next Use
nodes are connected with a NextToken edge (not shown in Figure 22.1) to form a
linear chain.
Syntax The sequence of tokens is parsed into a syntax tree. The leafs of the tree
are the tokens and all other nodes of the tree are “syntax nodes” (Figure 22.1; grey
blue rounded boxes). Using edges of Child type all syntax nodes and tokens are con-
nected to form a tree structure. This stucture provides contextual information about
the syntactical role of the tokens, and groups them into expressions and statements;
core units in program analysis.
Symbols Next, we introduce “symbol” nodes (Figure 22.1; black boxes with
dashed outline). Symbols in Python are the variables, functions, packages that are
available at a given scope of a program. Like most compilers and interpreters, after
parsing the code, Python creates a symbol table containing all the symbols within
488 Miltiadis Allamanis
each file of code. For each symbol, a node is created. Then, every identifier token
(e.g., the content tokens in Figure 22.1) or expression node is connected to the sym-
bol node it refers to. Symbol nodes act as a central point of reference among the
uses of variables and are useful for modeling the long-range relationships (e.g., how
an object is used).
Data Flow To convey information about the program execution we add data flow
edges to the graph (dotted curved lines in Figure 22.1) using an intraprocedural
dataflow analysis. Although, the actual data flow within the program during execu-
tion is unknown due to the use of branching in loops and if statements, we can add
edges indicating all the valid paths that data may flow through the program. Take as
an example the parameter min len in Figure 22.1. If the condition in line 3 is true,
then min len will be accessed in line 4, but not in line 5. Conversely, if the condition
in line 3 is false, then the program will proceed to line 5, where min len will be
accessed. We denote this information with a MayNextUse edge. This construction
resembles a program dependence graph (PDG) used in compilers and conventional
program analyses. In contrast to the edges previously discussed, MayNextUse has a
different flavor. It does not indicate a deterministic relationship but sketches all pos-
sible data flows during execution. Such relationships are central in program analyses
where existential or universal properties of programs need to be computed. For ex-
ample, a program analysis may need to compute that for all (∀) possible execution
paths some property is true, or that there exists (∃) at least one possible execution
with some property.
It is interesting to observe that just using the token nodes and NextToken edges
we can (deterministically) compute all other nodes and edges. Compilers do ex-
actly that. Then why introduce those additional nodes and edges and not let a neural
network figure them out? Extracting such graph representations is cheap computa-
tionally and can be performed using the compiler/interpreter of the programming
language without substantial effort. By directly providing this information to ma-
chine learning models — such as GNNs — we avoid “spending” model capacity for
learning deterministic facts and introduce inductive biases that can help on program
analysis tasks.
Alternative Graph Representations So far we presented a simplified graph rep-
resentation inspired from (Allamanis et al, 2020). However, this is just one possi-
ble representation among many, that emphasizes the local aspects of code, such as
syntax, and intraprocedural data flow. These aspects will be useful for the tasks dis-
cussed in Sections 22.5 and 22.6. Others entities and relationships can be added, in
the graph representation of Figure 22.1. For example, Allamanis et al (2018b) use a
GuardedBy edge type to indicate that a statement is guarded by a condition (i.e., it
is executed only when the condition is true), and Cvitkovic et al (2018) use a Subto-
kenOf edge to connect tokens to special subtoken nodes indicating that the nodes
share a common subtoken (e.g., the tokens max len and min len in Figure 22.1 share
the len subtoken).
Representations such as the one presented here are local, i.e. emphasize the local
structure of the code and allow detecting and using fine-grained patterns. Other local
22 Graph Neural Networks in Program Analysis 489
representations, such as the one of (Cummins et al, 2020) emphasize the data and
control flow removing the rich natural language information in identifiers and com-
ments, which is unnecessary for some compiler program analysis tasks. However,
such local representations yield extremely large graphs when representing multiple
files and the graphs become too large for current GNN architectures to meaningfully
process (e.g., due to very long distances among nodes). Although a single, general
graph representation that includes every imaginable entity and relationship would
seem useful, existing GNNs would suffer to process the deluge of data. Neverthe-
less, alternative graph constructions that emphasize different program aspects are
found in the literature and provide different trade-offs.
One such representation is the global hypergraph representation of (Wei et al,
2019) that emphasizes the inter- and intraprocedural type constraints among expres-
sions in a program, ignoring information about syntactic patterns, control flow, and
intraprocedural data flow. This allows processing whole programs (instead of single
files; as in the representation of Figure 22.1) in a way that is suitable for predicting
type annotations, but misses the opportunity to learn from syntactic and control-flow
patterns. For example, it would be hard argue for using this representation for the
variable misuse bug detection discussed in Section 22.5.
Another kind of graph representations is the extrinsic one defined by (Abde-
laziz et al, 2020) who combine syntactic and semantic information of programs
with metadata such as documentation and content from question and answer (Q&A)
websites. Such representations often de-emphasize aspects of the code structure fo-
cusing on other natural language and social elements of software development. Such
a representation would be unsuitable for the program analyses of Sections 22.5 and
22.6.
Given the predominance of the graph representations for code, a variety of ma-
chine learning techniques has been employed for program analyses over program
graphs, well before GNNs got established in the machine learning community. In
these methods, we find some of the origins and motivations for GNNs.
One popular approach has been to project the graph into another simpler repre-
sentation that other machine learning methods can accept as input. Such projections
include sequences, trees, and paths. For example, Mir et al (2021) encode the se-
quences of tokens around each variable usage to predict its type (as in the usecase
of Section 22.6). Sequence-based models offer great simplicity and have good com-
putational performance but may miss the opportunity to capture complex structural
patterns such as data and control flow.
Another successful representation is the extraction of paths from trees or graphs.
For example, Alon et al (2019a) extract a sample of the paths between every two
terminal nodes in an abstract syntax tree, which resembles random walk meth-
ods (Vishwanathan et al, 2010). Such methods can capture the syntactic informa-
490 Miltiadis Allamanis
tion and learn to derive some of code’s semantic information. These paths are easy
to extract and provide useful features to learn about code. Nevertheless, they are
lossy projections of the entities and relations within a program, that a GNN can – in
principle – use in full.
Finally, factor graphs, such as conditional random fields (CRF) work directly on
graphs. Such models commonly include carefully constructed graphs that capture
only the relevant relationships. The most prominent example in program analysis
includes the work of Raychev et al (2015) that captures the type constraints among
expressions and the names of identifiers. While such models accurately represent
entities and relationships, they commonly require manual feature engineering and
cannot easily learn “soft” patterns beyond those explicitly modeled.
Graph Neural Networks GNNs rapidly became a valuable tool for learned pro-
gram analyses given their flexibility to learn from rich patterns and the easiness
of combining them with other neural network components. Given a program graph
representation, GNNs compute the network embeddings for each node, to be used
for downstream tasks, such as those discussed in Section 22.5 and 22.6. First, each
entity/node vi is embedded into a vector representation nvi . Program graphs have
rich and diverse information in their nodes, such as meaningful identifier names
(e.g. max len). To take advantage of the information within each token and symbol
node, its string representation is subtokenized (e.g. “max”, “len”) and each initial
node representation nvi is computed by pooling the embeddings of the subtokens,
i.e., for a node vi and for sum pooling, the input node representation is computed as
nvi = ∑ ts
s∈S UB T OKENIZE(vi )
where ts is a learned embedding for a subtoken s. For syntax nodes, their initial
state is the embedding of the type of the node. Then, any GNN architecture that
can process directed heterogeneous graphs1 can be used to compute the network
embeddings, i.e.,
{hvi } = G NN G ′ , {nvi } , (22.1)
where the GNN commonly has a fixed number of “layers” (e.g. 8), G ′ = (V , E ∪
Einv ), and Einv is the set of inverse edges of E , i.e., Einv = (v j , r , vi ), ∀(vi , r, v j ) ∈ E .
−1
The network embeddings {hvi } are then the input to a task-specific neural network.
We discuss two tasks in the next sections.
1 GGNNs (Li et al, 2016b) have historically been a common option, but other architectures have
shown improvements (Brockschmidt, 2020) over plain GGNNs for some tasks.
22 Graph Neural Networks in Program Analysis 491
We now focus on a black box analysis learning problem that utilizes the graph rep-
resentation discussed in the previous section. Specifically, we discuss the variable
misuse task, first introduced by (Allamanis et al, 2018b) but employ the formulation
of (Vasic et al, 2018). A variable misuse is the incorrect use of one variable instead
of another already in the scope. Figure 22.1 contains such a bug in line 4, where
instead of min len, the max len variable needs to be used to correctly truncate the
content. To tackle this task a model needs to first localize (locate) the bug (if one
exists) and then suggest a repair.
Such bugs happen frequently, often due to careless copy-paste operations and can
often be though as “typos”. Karampatsis and Sutton (2020) find that more than 12%
of the bugs in a large set of Java codebases are variable misuses, whereas Tarlow et al
(2020) find 6% of Java build errors in the Google engineering systems are variable
misuses. This is a lower bound, since the Java compiler can only detect variable
misuse bugs though its type checker. The author conjectures — from his personal
experience — that many more variable misuse bugs arise during code editing and
are resolved before being committed to a repository.
Note that this is a black box analysis learning task. No explicit specification
of what the user tries to achieve exists. Instead the GNN needs to infer this from
common coding patterns, natural language information within comments (like the
one in line 2; Figure 22.1) and identifier names (like min, max, and len) to reason
about the presence of a likely bug. In Figure 22.1 it is reasonable to assume that the
developer’s intent is to truncate content to max len when it exceeds that size (line
4). Thus, the goal of the variable misuse analysis is to (1) localize the bug (if one
exists) by pointing to the buggy node (the min len token in line 4), and (2) suggest
a repair (the max len symbol).
To achieve this, assume that a GNN has computed the network embeddings {hvi }
for all nodes vi ∈ V in the program graph G (Equation 22.1). Then, let Vvu ⊂ V be
the set of token nodes that refer to variable usages, such as the min len token in line
4 (Figure 22.1). First, a localization module aims to pinpoint which variable usage
(if any) is a variable misuse. This is implemented as a pointer network (Vinyals
et al, 2015) over Vvu ∪ {0} / where 0/ denotes the “no bug” event with a learned h0/
embedding. Then using a (learnable) projection u and a softmax, we can compute
the probability distribution over Vvu and the special “no bug” event,
ploc (vi ) = softmax u⊤ hvi . (22.2)
v j ∈Vvu ∪{0}
/
In the case of Figure 22.1, a GNN detecting the variable misuse bug in line 4, would
assign a high ploc to the node corresponding to the min len token, which is the
location of the variable misuse bug. During (supervised) training the loss is simply
the cross-entropy classification loss of the probability of the ground-truth location
(Equation 22.2).
492 Miltiadis Allamanis
Fig. 22.2: A diff snippet of code with a real-life variable misuse error caught by a
GNN-based model in the https://github.com/spulec/moto open-source
project.
Repair given the location of a variable misuse bug can also be represented as a
pointer network over the nodes of the symbols that are in scope at the variable mis-
use location vbug . We define Vs@vbug as the set of the symbol nodes of the alternative
candidate symbols that are in scope at vbug , except from the symbol node of vbug .
In the case of Figure 22.1 and the bug in line 4, Vs@vbug would contain the content
and max len symbol nodes. We can then compute the probability of repairing the
localized variable misuse bug with the symbol si as
prep (si ) = softmax w⊤ [hvbug , hsi ] ,
s j ∈Vs@vbug
i.e., the softmax of the concatenation of the node embeddings of vbug and si , pro-
jected onto a w (i.e., a linear layer). For the example of Figure. 22.1, prep (si ) should
be high for the symbol node of max len, which is the intended repair for the vari-
able misuse bug. Again, in supervised training, we minimize the cross-entropy loss
of the probability of the ground-truth repair.
Training When a large dataset of variable misuse bugs and the relevant fixes can
be mined, the GNN-based model discussed in this section can be trained in a super-
vised manner. However, such datasets are hard to collect at the scale that existing
deep learning methods require to achieve reasonable performance. Instead work in
this area has opted to automatically insert random variable misuse bugs in code
scraped from open-source repositories — such as GitHub — and create a corpus of
randomly inserted bugs (Vasic et al, 2018; Hellendoorn et al, 2019b). However, the
random generation of buggy code needs to be carefully performed. If the randomly
introduced bugs are “too obvious”, the learned models will not be useful. For exam-
ple, random bug generators should avoid introducing a variable misuse that causes
a variable to be used before it is defined (use-before-def). Although such randomly
generated corpora are not entirely representative of real-life bugs, they have been
used to train models that can catch real-life bugs.
When evaluating variable misuse models — like those presented in this section
— they achieve relatively high accuracy over randomly generated corpora with ac-
curacies of up to 75% (Hellendoorn et al, 2019b). However, in the author’s experi-
22 Graph Neural Networks in Program Analysis 493
ence for real-life bugs — while some variable misuse bugs are recalled — precision
tends to be low making them impractical for deployment. Improving upon this is
an important open research problem. Nevertheless, actual bugs have been caught in
practice. Figure 22.2 shows such an example caught by a GNN-based variable mis-
use detector. Here, the developer incorrectly passed identity pool instead of iden-
tity pool id as the exception argument when identity pool was None (no pool with
the requested id could be found). The GNN-based black-box analysis seems to have
learned to “understand” that it is unlikely that the developer’s intention is to pass
None to the ResourceNotFoundError constructor and instead suggests that it should
be replaced by identity pool id. This is without ever formulating a formal specifica-
tion or creating a symbolic program analysis rule.
Types are one of the most successful innovations in programming languages. Specif-
ically, type annotations are explicit specifications over the valid values a variable can
take. When a program type checks, we get a formal guarantee that the values of vari-
ables will only take the values of the annotated type. For example, if a variable has
an int annotation, it must contain integers but not strings, floats, etc. Furthermore,
types can help coders understand code more easily and software tools such as auto-
completion and code navigation to be more precise. However, many programming
languages either have to decide to forgo the guarantees provided by types or require
their users to explicitly provide type annotations.
To overcome these limitations, specification inference methods can be used to
predict plausible type annotations and bring back some of the advantages of typed
code. This is especially useful in code with partial contexts (e.g., a standalone snip-
pet of code in a webpage) or optionally typed languages. This section looks into
Python, which provides an optional mechanism for defining type annotations. For
example, content in Figure 22.1 can be annotated as content: str in line 1 to indi-
cate that the developer expects that it will only contain string values. These annota-
tions can then be used by type checkers, such as mypy (mypy Contributors, 2021)
and other developer tools and code editors. This is the probabilistic type inference
problem, first proposed by (Raychev et al, 2015). Here we use the G RAPH 2C LASS
GNN-based formulation of (Allamanis et al, 2020) treating this as a classification
task over the symbols of the program similar to (Hellendoorn et al, 2018). Pandi
et al (2020) offer an alternative formulation of the problem.
For type checking methods to operate explicit types annotations need to be pro-
vided by a user. When those are not present, type checking may not be able to
function and provide any guarantees about the program. However, this misses the
opportunity to probabilistically reason over the types of the program from other
sources of information – such as variable names and comments. Concretely, in the
example of Figure 22.1, it would be reasonable to assume that min len and max len
494 Miltiadis Allamanis
have an integer type given their names and usage. We can then use this “educated
guess” to type check the program and retrieve back some guarantees about the pro-
gram execution.
Such models can find multiple applications. For example, they can be used in
recommendation systems that help developers annotate a code base. They may help
developers find incorrect type annotations or allow editors to provide assistive fea-
tures — such as autocomplete — based on the predicted types. Or they may offer
“fuzzy” type checking of a program (Pandi et al, 2020).
At its simplest form, predicting types is a node classification task over the subset
of symbol nodes. Let Vs be the set of nodes of “symbol” type in the heterogeneous
graph of a program. Let also, Z be a fixed vocabulary of type annotations, along with
a special Any type2 . We can then use the node embeddings of every node v ∈ Vs to
predict the possible type of each symbol.
p(s j : τ) = softmax Eτ ⊤ hvs j + bτ ,
τ ′ ∈Z
i.e., the inner product of each symbol node embedding with a learnable type embed-
ding Eτ for each type τ ∈ T plus a learnable bias bτ . Training can then be performed
by minimizing some classification loss, such as the cross entropy loss, over a corpus
of (partially) annotated code.
Type Checking The type prediction problem is a specification inference problem
(Section 22.2) and the predicted type annotations can be passed to a standard type
checking tool which can verify that the predictions are consistent with the source
code’s structure (Allamanis et al, 2020) or search for the most likely prediction
that is consistent with the program’s structure (Pradel et al, 2020). This approach
allows to reduce false positives, but does not eliminate them. A trivial example is
an identity function def foo(x): return x. A machine learning model may incorrectly
deduce that x is a str and that foo returns a str. Although the type checker will
consider this prediction type-correct it is hard to justify as correct in practice.
Training The type prediction model discussed in this section can be trained in a
supervised fashion. By scraping large corpora of code, such as open-source code
found on GitHub3 , we can collect thousands of type-annotated symbols. By strip-
ping those type annotations from the original code and using them as a ground truth
a training and validation set can be generated.
Such systems have shown to achieve a reasonably high accuracy (Allamanis et al,
2020) but with some limitations: type annotations are highly structured and sparse.
For example Dict[Tuple[int, str], List[bool]] is a valid type annotation that may
appear infrequently in code. New user-defined types (classes) will also appear at test
time. Thus, treating type annotations as district classes of a classification problem
2 The type Any representing the top of the type lattice and is somewhat analogous to the special
U NKNOWN token used in NLP.
3 Automatically scraped code corpora are known to suffer from a large number of duplicates (Al-
lamanis, 2019). When collecting such corpora special care is needed to remove those duplicates to
ensure that the test set is not contaminated with training examples.
22 Graph Neural Networks in Program Analysis 495
1 def __init__(
2 self,
3 - embedding_dim: float = 768,
4 - ffn_embedding_dim: float = 3072,
5 - num_attention_heads: float = 8,
6 + embedding_dim: int = 768,
7 + ffn_embedding_dim: int = 3072,
8 + num_attention_heads: int = 8,
9 dropout: float = 0.1,
10 attention_dropout: float = 0.1,
Fig. 22.3: A diff snippet from the incorrect type annotation caught by Typilus (Al-
lamanis et al, 2020) in the open-source fairseq library.
is prone to severe class imbalance issues and fails to capture information about the
structure within types. Adding new types to the model can be solved by employing
meta-learning techniques such as those used in Typilus (Allamanis et al, 2020; Mir
et al, 2021), but exploiting the internal structure of types and the rich type hierarchy
is still an open research problem.
Applications of type prediction models include suggesting new type annotations
to previously un-annotated code but can also be used for other downstream tasks
that can exploit information for a probabilistic estimate of the type of some symbol.
Additionally, such models can help find incorrect type annotations provided by the
users. Figure 22.3 shows such an example from Typilus (Allamanis et al, 2020).
Here the neural model “understands” from the parameter names and the usage of the
parameters (not shown) that the variables cannot contain floats but instead should
contain integers.
to primarily using the formal structure of the program, ignoring ambiguous informa-
tion in identifiers and code comments. Researching analyses that can better leverage
this information may light new and fruitful directions to help coders across many
application domains.
Crucially, the question of how to integrate formal aspects of program analyses
into the learning process is still an open question. Most specification inference work
(e.g. Section 22.6) commonly treats the formal analyses as a separate pre- or post-
processing step. Integrating the two viewpoints more tightly will create better, more
robust tools. For example, researching better ways to incorporate (symbolic) con-
straints, search, and optimization concepts within neural networks and GNNs will
allow for better learned program analyses that can learn to better capture program
properties.
From a software engineering research additional research is needed for the user
experience (UX) of the program analysis results presented to users. Most of the
existing machine learning models do not have performance characteristics that al-
low them to work autonomously. Instead they make probabilistic suggestions and
present them to users. Creating or finding the affordances of the developer environ-
ment that allow to surface probabilistic observations and communicate the proba-
bilistic nature of machine learning model predictions will significantly help accel-
erate the use of learned program analyses.
Within the research area of GNNs there are many open research questions. GNNs
have shown the ability to learn to replicate some of the algorithms used in common
program analysis techniques (Veličković et al, 2019) but with strong supervision.
How can complex algorithms be learned with GNNs using just weak supervision?
Additionally, existing techniques often lack the representational capabilities of for-
mal methods. Combinatorial concepts found in formal methods, such as sets and
lattices lack direct analogues in deep learning. Researching richer combinatorial
— and possibly non-parametric — representations will provide valuable tools for
learning program analyses.
Finally, common themes in deep learning also arise within this domain:
• The explainability of the decisions and warnings raised by learned program
analyses is important to coders who need to understand them and either mark
them as false positives or address them appropriately. This is especially impor-
tant for black-box analyses.
• Traditional program analyses offer explicit guarantees about a program’s behav-
ior even within adversarial settings. Machine learning-based program analyses
relax many of those guarantees towards reducing false positives or aiming to
provide some value beyond the one offered by formal methods (e.g. use am-
biguous information). However, this makes these analyses vulnerable to adver-
sarial attacks (Yefet et al, 2020). Retrieving some form of adversarial robustness
is still desirable for learned program analyses and is still an open research prob-
lem.
• Data efficiency is also an important problem. Most existing GNN-based pro-
gram analysis methods either make use of relatively large datasets of annotated
code (Section 22.6) or use unsupervised/self-supervised proxy objectives (Sec-
22 Graph Neural Networks in Program Analysis 497
tion 22.5). However, many of the desired program analyses do not fit these
frameworks and would require at least some form of weak supervision.
Pre-training on graphs is one promising direction that could address this prob-
lem, but has so far is focused on homogeneous graphs, such as social/cita-
tion networks and molecules. However, techniques developed for homogeneous
graphs, such as the pre-training objectives used, do not transfer well to hetero-
geneous graphs like those used in program analysis.
• All machine learning models are bound to generate false positive suggestions.
However when models provide well-calibrated confidence estimates, sugges-
tions can be accurately filtered to reduce false positives and their confidence
better communicated to the users. Researching neural methods that can make
accurate and calibrated confidence estimates will allow for greater impact of
learned program analyses.
Acknowledgements The author would like to thank Earl T. Barr for useful discussions and feed-
back on drafts of this chapter.
Collin McMillan
23.1 Introduction
Software Mining is broadly defined as any task that seeks to solve a software en-
gineering problem by analyzing the myriad artifacts in projects and their connec-
tions (Hassan and Xie, 2010; Kagdi et al, 2007; Zimmermann et al, 2005). Consider
the task of writing documentation. A human performing this task may gain compre-
hension of the software by reading the source code and understanding how different
parts of the code interact. Then he or she may write documentation explaining the
behavior of the system based on that comprehension. Likewise, if a machine is to
automate writing that documentation, the machine must also analyze the software
in order to comprehend it. This analysis is often called “Software Mining.”
Collin McMillan
Department of Computer Science, University of Notre Dame, e-mail: [email protected]
© The Author(s), under exclusive license to Springer Nature Singapore Pte Ltd. 2022 499
L. Wu et al. (eds.), Graph Neural Networks: Foundations, Frontiers, and Applications,
https://doi.org/10.1007/978-981-16-6054-2_23
500 Collin McMillan
Software is a high-value target for GNNs partly because software tends to be very
highly structured as a graph or set of graphs. Different software mining tasks may
take advantage of different graph structures from software. Graph representations
of software go far beyond any specific software mining task. Graph representations
are baked into the way compilers convert source code into machine code (e.g., parse
trees). They are used during linking and dependency resolution (e.g., program de-
pendence graphs). And they have long the basis for many visualization and support
tools to help programmers understand large software projects (Gema et al, 2020;
Ottenstein and Ottenstein, 1984; Silva, 2012).
When considering how to make use of these different graph structures in soft-
ware, basically the questions one must ask are: “what are the nodes?” and “what
are the edges?” These questions take two forms in software engineering research:
a macro- and a micro-level representation. The macro-level representation tends to
concern connections among large software artifacts, such as a graph in which ev-
ery source code file is a node and every dependency among the files is an edge.
The micro-level representation, in contrast, tends to include small details, such as a
graph in which every token in a function is a node, and every edge is a syntactic link
between the nodes, such as are often extracted from an Abstract Syntax Tree.
23 Graph Neural Networks in Software Mining 501
This section compares and contrasts these representations as they relate to using
GNNs for Software Mining tasks.
few dozen packages. But a very popular alternative is a function/method call graph,
in which each function in a program is a node and each call relationship from one
function to another is a directed edge between two nodes. Call graphs are popu-
lar within Software Engineering literature because they are relatively easy to extract
while giving enough detail for a strong macro-level view of a program without over-
whelming data sizes (recall a typical program has around 1800 functions (LeClair
and McMillan, 2019)).
Graph neural networks are becoming a staple of research in software mining tasks.
The history of deep learning for software mining tasks is chronicled in several sur-
veys (Allamanis et al, 2018a; Lin et al, 2020b; Semasaba et al, 2020; Song et al,
2019b). Allamanis et al (2018a) cast a particularly wide net and broadly classify
software mining tasks that rely on neural networks as either “code generational”
or ”code representational.” This classification is based on a big picture view of the
models used for these tasks. In a code generational task, the output of the model is
source code. Tasks in this category include automatic program repair (Chen et al,
2019e; Dinella et al, 2020; Wang et al, 2018d; Vasic et al, 2018; Yasunaga and
Liang, 2020), code completion (Li et al, 2018a; Raychev et al, 2014), and compiler
optimization (Brauckmann et al, 2020). These models tend to be trained with large
volumes of code vetted somehow to ensure quality, with the aim of learning norms
in code that lead to that quality. Then, during inference, the goal is to bring arbi-
trary code into closer conformance with those norms. For example, a model may be
presented with code containing a bug, and that bug may be repaired by changing
the code to be more like the model’s predictions (which, it is hoped, represent the
norms learned in training).
In contrast to code generational tasks are code representational tasks. These tasks
use source code primarily as the input to a neural model during training but have a
wide variety of outputs. Tasks in this category include code clone detection (Ain
et al, 2019; Li et al, 2017c; White et al, 2016), code search (Chen and Zhou, 2018;
Sachdev et al, 2018; Zhang et al, 2019f), type prediction (Pradel et al, 2020), and
code summarization (Song et al, 2019b). In models designed to solve these tasks, the
goal is usually to create a vectorized representation of code, which is then used for
a specific task that may only be tangentially related to the code itself. For instance,
for source code search, a neural model may be used to project the source code in
a large repository into a vector space. Then a different model is used to project a
natural language query into the same vector space. The code nearest to the query
in the vector space is considered as the search result for that query. Code clone
504 Collin McMillan
detection is similar: code is projected into a vector space, and very nearby code may
be considered a clone in that space.
The use of graph neural networks is ballooning in both categories of software
mining tasks. In code generational tasks, the focus tends to be on modifications to a
program graph such as an AST that bring that graph into closer conformity with the
model’s expectations. While some approaches focus on code as a sequence (Chen
et al, 2019e), the recent trend has been to recommend graph transformations or
highlight non-conforming areas of the graph (Dinella et al, 2020; Yasunaga and
Liang, 2020). This is useful in code because a recommendation may relate to code
elements that are quite far away from each other, such as the declaration of a vari-
able and a use of that variable. In contrast, in code representational tasks, the focus
tends to be on creating ever more complex graph representations of code and then
using GNN architectures to exploit that complexity. For example, the first GNN-
based approaches tended to use only the AST (LeClair et al, 2020), while newer
approaches use attention-based GNNs to emphasize the most important edges out
of a multitude that can be extracted from code (Zügner et al, 2021). Despite differ-
ences in code generational and representational tasks, the trend in both categories
has strongly favored GNNs.
Consider the task of code summarization, which exemplifies the trend towards
GNNs. Code summarization is the task of writing natural language descriptions of
source code. Typically these descriptions are used in documentation for that source
code, e.g., JavaDocs. The evolution of this research area is shown in Figure 23.1.
The term “code summarization” was coined around 2010, and several years of active
research followed using templated and IR-based solutions. Then around 2017, solu-
tions based on neural networks proliferated. At first, these were essentially seq2seq
models in which the encoder sequence is the code and decoder sequence is the de-
scription. Starting around 2018, the state-of-the-art moved to linearized AST repre-
sentations. Graph neural networks were proposed around this time as a better solu-
tion (Allamanis et al, 2018b), but it would be another year or more for GNN-based
approaches to appear in the literature. GNNs are poised to underpin the state-of-the-
art. In the next section, we dive into the details of a GNN-based solution, showing
why it works and areas of future growth.
IR M T A S G
*Haiduc et al (2010) x
*Sridhara et al (2011) x x
*Rastkar et al (2011) x x x
*De Lucia et al (2012) x
*Panichella et al (2012) x x
*Moreno et al (2013) x x
*Rastkar and Murphy (2013) x
*McBurney and McMillan (2014) x x
*Rodeghero et al (2014) x
*Rastkar et al (2014) x
*Cortés-Coy et al (2014) x
*Moreno et al (2014) x
*Oda et al (2015) x
*Abid et al (2015) x x
*Iyer et al (2016) x
*McBurney et al (2016) x x
*Zhang et al (2016a) x x
*Rodeghero et al (2017) x
*Fowkes et al (2017) x
*Badihi and Heydarnoori (2017) x x
*Loyola et al (2017) x
*Lu et al (2017b) x
*Jiang et al (2017) x
*Hu et al (2018c) x
*Hu et al (2018b) x x
*Wan et al (2018) x x
*Liang and Zhu (2018) x x
*Alon et al (2019a,b) x x
*Gao et al (2019b) x
*LeClair et al (2019) x x
*Nie et al (2019) x x
*Haque et al (2020) x x
*Haldar et al (2020) x x
*LeClair et al (2020) x x x
*Ahmad et al (2020) x x
*Zügner et al (2021) x x x
*Liu et al (2021) x x x
Table 23.1: Overview of papers on the topic of source code summarization, from the paper to
coin the term “code summarization” in 2010 to the following ten years. Note the evolution from
IR/template-based solutions to neural models and now to GNN models. Column IR indicates if the
approach is based on Information Retrieval. M indicates manual features/heuristics. T indicates
templated natural language. A indicates Artificial Intelligence (usually Neural Network) solutions.
S means structural data such as the AST is used (for AI-based models). G means a GNN is the
primary means of representing that structural data.
506 Collin McMillan
The input to this technique is a micro-level representation of code: it is just the AST
of a single subroutine. The nodes in the graph are all nodes in the GNN, whether
they are visible to the programmer or not. The only edge type is the parent-child
relationship in the AST. Consider the code and example summaries in Example 23.1
and the AST of this code in Figure 23.1. Regarding the Figure 23.1, bold indicates
text from source code that is visible to a human reader in the source code file –
a depth-first search of the leaf nodes reveals the code sequence. E.g., “public void
send guess ...” Non-bold indicates AST nodes that the compiler uses to represent
structure. Visible text is preprocessed as it would appear to the model. For example,
the name sendGuess is split into send and guess, and both nodes are children
of a name node, which is a child of function. Neither name nor function is
visible to a human reader. The circled areas 1-4 are reference points for discussion
in Sections 23.4.1.4 and 23.4.2.
The AST in Figure 23.1 is the only input to the model, from which the model
must generate an English description. Technically, the AST is srcml (Collard et al,
2011) preprocessed (e.g., splitting identifies such as sendGuess into send and
guess) using community standard procedures (LeClair and McMillan, 2019). The
reference output description in Example 23.1 is the actual JavaDoc summary written
by a human programmer. The summary labeled “gnn ast” is the prediction from this
approach. The summary labeled “flat ast” is the output from an immediate prede-
cessor that used an RNN on a linearization of the AST. The only difference between
the GNN and flat AST approach is the structure of the encoder; all other model de-
tails are identical. Yet, we note that the GNN-based approach matched the reference
exactly, while the flat AST approach matched only a few words. Shortly we will
analyze this example to provide intuition about why the model performed so well.
summaries
reference sends a guess to the server
ast-attendgru-gnn (LeClair et al, 2020) sends a guess to the socket
ast-attendgru-flat (LeClair et al, 2019) attempts to initiate a <UNK> guess
source code
public void sendGuess(String guess) {
if( isConnected() ) {
gui.statusBarInfo("Querying...", false);
try {
os.write( (guess + "\\r\\n").getBytes() );
os.flush();
} catch (IOException e) {
gui.statusBarInfo("Failed to send guess.", true);
System.err.println("IOException during send guess");
}
}
}
23.4.1.3 Experiment
aggregate BLEU score but this score obscures some details of the performance,
which we will see in the next section.
The second key finding is that a hop distance of two results in the best over-
all performance. While models with GNN iterations ranging between one and ten
all achieve higher scores than the baselines, the model performs best with two it-
erations. One explanation is that nodes in the AST are only relevant to each other
within a distance of about two. The AST is a tree, so information is propagated up
and down levels of the tree. For two hops, this means information from a node will
propagate to its parent in the first hop and then to its grandparent and siblings in
the second hop. It is possible that nodes beyond this scope are not that relevant to
the model for code summarization. However, another explanation is that the method
of aggregating information in each hop is less efficient after two hops – this inter-
pretation would be consistent with findings by Xu et al (2018c) that aggregation
procedure is critical to GNN deployment. Either way, the practical advice for model
designers is that the optimal number of GNN iterations for this task is not that high.
The third key finding is that the use of the GRU after the GNN layer (Figure 23.2
after area C) improves overall performance. The models labeled with the suffix
+GRU use this GRU layer, as described in Section 23.4.1.2. The model labeled with
the suffix +dense calculates attention between the decoder and the output matrix
from the GNN. This model did not perform as well. A likely explanation is that
source code has not only a tree structure via the AST – it also has an order from
start to end. The GRU after the GNN captures this order and seems to result in a
better representation of the code for summarization.
A question remains regarding what benefit can be attributed to the use of a GNN.
While we and others may observe an improvement in overall BLEU scores when
using a GNN (LeClair et al, 2020; Zügner et al, 2021; Liu et al, 2021), a key point
is that the GNN contributes orthogonal information to the model. This section ex-
plores how.
Concentration of Improvement:
The improvement is concentrated among a set of subroutines where the GNN
adds significant improvement. It is not the case that the BLEU scores increase
marginally for all subroutines – there is a set of subroutines that benefits the most.
Consider Figure 23.3. The pie chart divides the test set into subroutines from the
experiment describe above into five groups: one group where ast-attendgru-gnn per-
formed the best, one group where ast-attendgru-flat performed the best, one group
where they tied, one group for attendgru, and one group for other ties including
when all models made the same prediction. For simplicity, we use BLEU-1 scores
(BLEU-1 is unigram precision, single words predicted correctly).
What we observe is that each model achieves the highest BLEU-1 score for 20-
25% of the subroutines. For about 12% of the subroutines, the AST-based models
23 Graph Neural Networks in Software Mining 509
were tied, meaning that in total over 50% of the subroutines benefited from AST
information (GNN plus flat AST models). But there still exists a large set of sub-
routines where attendgru outperformed all others. However, consider the bar chart
in Figure 23.3. The “all” columns show the BLEU-1 score for that approach – note
that ast-attendgru-gnn is only marginally higher than others. The “best” columns
show the score for the set where that model achieved the highest BLEU-1 score (the
set with that model’s name indicated in the pie chart). We observe that the BLEU-1
scores for ast-attendgru-gnn are much higher for this set than others.
Demonstrating Improvement in Example 23.1:
A deeper dive into the subroutine sendGuess() from Example 23.1 demon-
strates the improvement that a GNN provides. Recall that the ast-attendgru-gnn
model calculates attention between each position in the decoder and each node in
the output from the GNN (Section 23.4.1.2, Figure 23.2 area E). The result is an m
x n matrix where m is the length of the decoder sequence and n is the number of
nodes (in the implementation, m=13 and n=100). Thus each position in the attention
matrix represents the relevance of an AST node to a word in the output summary.
In fact, the attention matrix for ast-attendgru-flat has the same meaning: the mod-
els are identical except that ast-attendgru-gnn encodes the AST with a GNN then a
GRU, while the flat model uses only the GRU. Comparing the values in these atten-
tion matrices provides a useful contrast of the two models because they show the
contribution of the AST encoding to the prediction.
The benefit of a GNN becomes apparent in the attention networks in Figure 23.3.
Both models have a very similar attention activation to the tokens in the source code
sequence (Figures 23.3a and 23.3c). Both models show close attention to position
2 of the code sequence, which is the word “send”. This is not surprising consider-
ing that “send” appears in the method’s name. Yet, ast-attendgru-flat still incorrectly
predicts the first word of the summary as “attempts”, while ast-attendgru-gnn cor-
rectly predicts “sends.” The explanation lies in the attention to AST nodes. The
flat model focuses on node 37 (Figure 23.3d), which is an expr stmt node immedi-
ately after the try block, just before the call to os.write(), indicated as area
1 in Figure 23.1. The reason for this focus suggested by the original paper on that
model (LeClair et al, 2019) is that the flat AST model tends to learn broadly similar
code structure such as “if-block, try-block, call to os.write().” Under this expla-
nation, methods in the training set with this if-try-call-catch pattern are associated
with the word “attempts.”
In contrast, the GNN-based model focuses on position 8, which is the word
“send” in the method name, just like in the attention to the code sequence (Fig-
ure 23.3b). The result is that the GNN-based AST encoding reinforces the attention
paid to this word when predicting the first word of the output. Consider the method’s
AST in Figure 23.1. Position 8 is the node for “send” indicated at area 2. In a 2-hop
GNN, this node will share information with its parent (name), grandparent (func-
tion), and sibling (guess). During training, the model learned that words associated
with the AST nodes “function” and “name” are likely candidates for the first word
of the summary, so the model knows to highlight this word.
510 Collin McMillan
in the same project. The approach is to obtain a dynamic call graph of the Android
program, which represents the actual runtime control flow from one subroutine to
the next. Then a subset of the subroutines in this call graph is selected using PageR-
ank – the idea is to emphasize the subroutines, which are called many times or hold
other importance measurable from the structure of the call graph (McMillan et al,
2011). The summaries from these subroutines are then appended to the initial sum-
mary.
Aghamohammadi et al (2020)’s approach demonstrates an advantage to macro-
level information. The macro-level information is the dynamic call graph of the
entire program, and it is used to augment summaries created from the source code
itself. The summaries tend to be longer and to provide more contextual informa-
tion to readers. Recall sendGuess() in Example 23.1, for which ast-attendgru-gnn
wrote “sends a guess to the socket.” The approach by Aghamohammadi et al (2020)
may (hypothetically) find that the subroutine that calls sendGuess() is a mouse
click handler subroutine, and so would append, e.g., “called when the mouse is used
to click the button.” Human readers of documentation benefit from knowing how
subroutines are used, so summaries that include this macro-level information tend
to be considered more valuable by those readers (Holmes and Murphy, 2005; Ko
et al, 2006; McBurney and McMillan, 2016).
Macro-level representations of code for software mining tasks are likely fertile
ground for GNN-based technologies. The dynamic call graphs which Aghamoham-
madi et al (2020) extract contain information from actual runtime use, and a GNN
may serve as a useful tool in generating a representation of this information. Yet,
applications of GNNs to macro-level data for software mining tasks are still in their
infancy.
23.5 Summary
cases. An improvement based on an attentional GNN shows how much more com-
plex graphs can also be exploited for better for this purpose. Yet, these improve-
ments for code summarization likely herald improvements for many software min-
ing tasks. Both code representational and code generational tasks depend heavily on
understanding the nuances of the structure that code, and GNNs are a likely avenue
for capturing this structure. This chapter has covered the history of this research, a
specific target problem, and recommendations for future researchers.
Editor’s Notes: AI for Code is a very fast-growing area in the recent years.
Computer software or program is just like a second language compared
to human language, which is not surprising that there are many shared at-
tributes or aspects in both languages. Therefore, we have seen this trend
that both NLP and Software communities start paying a large amount of
attentions in applying GNNs for their domain applications and achieve the
great successes in both domains. Just like GNNs for NLP, graph structure
learning techniques in Chapter 14, GNN Methods in Chapter 4, GNN Scal-
ability in Chapter 6, Heterogeneous GNNs in Chapter 16, GNN Robustness
in Chapter 8 are all highly important building blocks for developing an ef-
fective and efficient approach with GNNs for code.
514 Collin McMillan
Figure 23.1: Abstract Syntax Tree for the function sendGuess() in Example 1.
23 Graph Neural Networks in Software Mining 515
A B D
Src sequence GRU
Src/AST E
Embedding C
AST nodes Attention F G
ConvGNN-1
AST edges Context Output
ConvGNN-2 GRU Attention
Summary
Summary GRU
Embedding H
Figure 23.2: High-level diagram of the model architecture for 2-hop model.
Figure 23.3: (left) Comparison of the BLEU-1 score for the subroutines where each
method performed best, to BLEU-1 score for the whole test set. (right) Percent of
test set for which each approach received the highest BLEU-1 score.
516 Collin McMillan
Abstract Drug discovery and development (D3 ) is an extremely expensive and time
consuming process. It takes tens of years and billions of dollars to make a drug suc-
cessfully on the market from scratch, which makes this process highly inefficient
when facing emergencies such as COVID-19. At the same time, a huge amount
of knowledge and experience has been accumulated during the D3 process during
the past decades. These knowledge are usually encoded in guidelines or biomedi-
cal literature, which provides an important resource containing insights that can be
informative of the future D3 process. Knowledge graph (KG) is an effective way
of organizing the useful information in those literature so that they can be retrieved
efficiently. It also bridges the heterogeneous biomedical concepts that are involved
in the D3 process. In this chapter we will review the existing biomedical KG and
introduce how GNN techniques can facilitate the D3 process on the KG. We will
also introduce two case studies on Parkinson’s disease and COVID-19, and point
out future directions.
24.1 Introduction
Chang Su,
Department of Population Health Sciences, Weill Cornell Medicine, e-mail: chs4001@med.
cornell.edu
Yu Hou,
Department of Population Health Sciences, Weill Cornell Medicine, e-mail: yuh4001@med.
cornell.edu
Fei Wang,
Department of Population Health Sciences, Weill Cornell Medicine, e-mail: few2001@med.
cornell.edu
© The Author(s), under exclusive license to Springer Nature Singapore Pte Ltd. 2022 517
L. Wu et al. (eds.), Graph Neural Networks: Foundations, Frontiers, and Applications,
https://doi.org/10.1007/978-981-16-6054-2_24
518 Chang Su, Yu Hou, Fei Wang
in massive biomedical literature and text books. This makes effective knowledge
organization and efficient knowledge retrieval a challenging task. Knowledge graph
is a recently emerged concept aiming at achieving this goal. A knowledge graph
(KG) stores and represents knowledge by constructing a semantic network describ-
ing entities and the relationships between them. The basic elements comprising a
knowledge graph are a set of ⟨head, relation, tail⟩ tuples, where the heads and tails
are concept entities and relations link these entities with semantic relationships. In
biomedicine, the typical entities could be diseases, drugs, genes, etc., and the rela-
tionships could be treats, binds, interactions, etc. Large scale biomedical KG makes
efficient knowledge retrieval and inference possible.
Biomedical KG can effectively complement the biomedical data analytics pro-
cesses. In particular, many different types of biomedical data are heterogeneous and
noisy (Wang et al, 2019f; Wang and Preininger, 2019; Zhu et al, 2019e), which
makes the data-driven models developed on these data not reliable for real prac-
tice. Biomedical KGs (BKGs) effectively encode the biomedical entities and their
semantic relationships, which can serve as “prior knowledge” to guide the down-
stream data-driven analytics procedure and improve the quality of the model. On
the other hand, we can also use BKGs to generate hypotheses (such as which drug
can be used to treat which disease), and get them validated in real world health data
(such as electronic health records).
In this chapter, we will review existing BKGs and present examples of how BKGs
can be used for generating drug repurposing hypotheses, and point out future direc-
tions.
This section surveys the existing BKGs that are publicly available and the ways of
BKG construction and curation (Table 24.3).
A common way for constructing a BKG is to extract and integrate data from
data resources, usually, which are manually curated to summarize and organize the
biomedical knowledge derived from biological experiments, clinical trials, genome
wide association analyses, clinical practices, etc (Santos et al, 2020; Ioannidis et al,
2020; Himmelstein et al, 2017; Rizvi et al, 2019; Yu et al, 2019b; Zhu et al, 2020b;
Zeng et al, 2020b,b; Domingo-Fernández et al, 2020; Wang et al, 2020e; Percha and
Altman, 2018; Li et al, 2020d,b; Goodwin and Harabagiu, 2013; Rotmensch et al,
2017; Sun et al, 2020a). In table 24.2, we summarized some public data resources
that have been commonly used in the construction of BKGs. For instance, Compar-
ative Toxicogenomics Database (CTD) (Davis et al, 2019) is an open resource pro-
viding rich, manually curated chemical–gene, chemical–disease and gene–disease
relational data, for the aim of advancing understanding the impacts of environmental
exposures on human health. DrugBank (Wishart et al, 2018) is a database containing
information of the approved drugs and drugs under trial, as well as the pharmacoge-
nomic data (e.g., drug-target interactions). Ontology resources like Gene Ontology
24 GNN-based Biomedical Knowledge Graph Mining in Drug Development 519
(Ashburner et al, 2000) and Disease Ontology (Schriml et al, 2019) stored func-
tional and semantic context of genes and diseases, respectively. By integrating data
from these rich resources, a number of BKGs have been constructed (Santos et al,
2020; Ioannidis et al, 2020; Himmelstein et al, 2017; Rizvi et al, 2019; Yu et al,
2019b; Zhu et al, 2020b; Zeng et al, 2020b,b; Domingo-Fernández et al, 2020; Wang
et al, 2020e). For example, Hetionet (Himmelstein et al, 2017), released in 2017, is
a well-curated BKG that integrates 29 publicly available biomedical databases. It
contains 11 types of 47031 biomedical entities and 24 types of over 2 million re-
lations among thoses entities. Similar to Hetionet, Drug Repurposing Knowledge
Graph (DRKG) (Ioannidis et al, 2020) was built by integrating data from six differ-
ent existing biomedical databases, containing 13 types of about 100K entities and
107 types of over 5 million relationships. Zhu et al (2020b) constructed a drug-
centric BKG by systematically integrating multiple drug databases such as Drug-
Bank (Wishart et al, 2018) and PharmGKB (Whirl-Carrillo et al, 2012). Hetionet,
DRKG, and BKGs have been used in accelerating computational drug repurpos-
ing. PreMedKB (Yu et al, 2019b) includes the information of disease, genes, vari-
ants, and drugs by integrating relational data among them from existing resources.
By integrating multiple dietary related databases, Rizvi et al (2019) built a BKG,
named Dietary Supplements Knowledge Base (iDISK), which covers knowledge of
dietary supplements, including vitamins, herbs, minerals, etc. The Clinical Knowl-
edge Graph (CKG)(Santos et al, 2020) was constructed by integrating relevant exist-
ing biomedical databases such as DrugBank (Wishart et al, 2018), Disease Ontology
(Schriml et al, 2019), SIDER (Kuhn et al, 2016), etc. and knowledge extracted from
scientific literature. It contains over 16 million nodes and over 220 million relation-
ships. Compared to other BKGs, CKG has a finer granularity of knowledge as it
involves more entity types such as metabolite, modified protein, molecule function,
transcript, genetic variant, food, clinical variable, etc.
As the rapid development of biomedical research, a continuously increasing vol-
ume of biomedical articles have been published every day. Manually extracting
knowledge from literature for BKG cuuration is no longer sufficient to meet cur-
rent needs. To this end, efforts have been made in using text mining methods to ex-
tract biomedical knowledge from scientific literature to construct BKGs (Domingo-
Fernández et al, 2020; Wang et al, 2020e; Percha and Altman, 2018; Li et al,
2020d). For example, Sun et al (2020a) constructed a knowledge graph by extracting
biomedical entities and relationships from drug descriptions, medical dictionaries,
and literature to identify suspected cases of Fraud, Waste, and Abuse from claim
files. COVID-KG (Wang et al, 2020e) and COVID-19 Knowledge Graph (Domingo-
Fernández et al, 2020) were built by extracting COVID-19 specific knowledge from
biomedical literature. The resulting COVID-19 specific BKGs contain entities such
as diseases, chemicals, genes, and pathways, along with their relationships. KGHC
(Li et al, 2020d) is a BKG with the specific focus on hepatocellular carcinoma.
It was built by extracting knowedge from literature and contents on the internet,
as well as structured triples from SemMedDB (Kilicoglu et al, 2012). In addition,
some studies (Goodwin and Harabagiu, 2013; Li et al, 2020b; Rotmensch et al,
2017; Sun et al, 2020a) tried to build BKGs from clinical data such as electronic
520 Chang Su, Yu Hou, Fei Wang
health records (EHRs) and electronic medical records (EMRs). For instance, Rot-
mensch et al (2017) constructed a BKG by extracting disease-symptom associations
from EHR data using the data-driven approach. Li et al (2020b) proposed a sys-
tematic pipeline for extracting BKG from large scale EMR data. Compared to other
BKGs based on triplet structure, the resulting KG is based a quadruplet structure,
i.e., ⟨head, relation,tail, property⟩. Here the property includes information such as
co-occurrence number, co-occurrence probablity, specificity, and reliability of the
corresponding ⟨head, relation,tail⟩ triplet.
24 GNN-based Biomedical Knowledge Graph Mining in Drug Development 521
This subsection discusses KG inference techniques based on the novel GNN archi-
tectures.
24 GNN-based Biomedical Knowledge Graph Mining in Drug Development 525
(l+1)
Here hi is the embedding vector of entity ei at the (l + 1)-th graph convolutional
layer. R is the set of all relations and Ni k is the neighbors of entity ei under rela-
tion rk . The problem-specific normalization coefficient ci,k can be either learned or
pre-defined. Using softmax for each entity, R-GCN can be trained for entity clas-
sification. In link prediction, R-GCN is used as an encoder for learning embedding
vectors of the entities while the factorization model, DistMult, is used as the de-
coder to predict missing links in the KG based on the learned entity embeddings. It
resulted in a significantly improved performance compared to the baseline models
like DistMult and TransE.
Cai et al (2019) proposed the TransGCN, which combines the GCN architecture
with the translational distance models (e.g., TransE and RotatE) for link prediction
in KGs. Compared to R-GCN, TransGCN aims to address the link prediction task
without a task-specific decoder like R-GCN and learn both entity embeddings and
relation embeddings simultaneously. For each triplet (ei , rk , e j ), TransGCN assumes
that rk is the transformation from the head ei to the tail e j in the embedding space.
Then it extends the GCN layer to update ei ’s embedding as
(l+1) 1 (l) (l) (l) (l) (l)
mi = W ∑ hi ◦ gk + ∑ h j ⋆ gk (24.2)
ci 0 (in) (out)
(e j ,rk ,ei )∈Ni (ei ,rk ,e j )∈Ni
(l+1) (l+1) (l)
hi = σ mi + hi (24.3)
where ◦ and ⋆ are transformation operators that can be defined based on specific
(in) (out)
translational mechanism used. Ni and Ni are incoming and outgoing triplet
of ei , respectively. The normalization constant ci was defined by the total degree
of entity ei . Meanwhile, embedding of each relation rk was updated by simply
(l+1) (l) (l)
gk = σ (W1 gk ). The authors engaged two translational mechanisms, TransE
and RotatE, and defined ◦ , ⋆, and scoring functions accordingly. Both result-
526 Chang Su, Yu Hou, Fei Wang
(l)
where αk is the weight of relation rk at the l-th layer. The learned embedding from
WGCN was then fed to a decoder, Conv-TransE, a CNN with TransE’s translational
mechanism, for link prediction.
Graph attention network (GAT)-based architectures. A potential drawback
of the GCN architectures is that, for each entity, they treat the neighbors equally
to gather information. However, different neighboring entities, relations or triplets
may have different importances in indicating a specific entity, and the weights of
neighboring entities under the same relation may be also distinct. To address this,
GATs have been used to involved in the KG inference problems. One of the early
efforts is the GATE-KG (i.e., graph attention-based embedding in KG) (Nathani
et al, 2019). It introduces an extended and generalized attention mechanism as the
encoder to produce the entity and relation embeddings while capturing the diverse
relation type in KG. For each triplet (ei , rk , e j ), GATE-KG first produces a represen-
(l)
tation vector ci jk of this triplet by
(l)
(l) exp(βi jk )
αi jk = (l)
(24.7)
∑ j′ ∈Ni ∑k′ ∈Ri j′ exp( βi j′ k′ )
where Ri j is the set of all relations between ei and e j . By aggregating information
(l+1)
from neighbors according to different relations, entity ei ’s embedding vector hi
at the (l + 1)-th layer can be calculated as
!
(l+1) (l) (l)
hi =σ ∑ ∑ αi jk ci jk (24.8)
j∈Ni k∈Ri j
24 GNN-based Biomedical Knowledge Graph Mining in Drug Development 527
In addition, by using the auxiliary relation between n-hop neighbors and itera-
tively accumulating information of n-hop neighbors at the n-th graph attention layer,
GATE-KG gives high weights to the 1-hop neighbors while lower weights to the n-
hop neighbors. Hence it captures the multi-hop structure information of KG.
Relational Graph neural network with Hierarchical ATtention (RGHAT) (Zhang
et al, 2020i) is another GAT-based model to address link prediction in KGs. Specif-
ically, it engages a two-level attention mechanism. First, a relational-level attention
defines the weight of each relation rk indicating a specific entity ei as
exp(bi jk )
αi jk = (24.14)
∑ j′ ∈Ni ∑k′ ∈Ri j′ exp( βi j′ k′ )
KGAT then stacks multiple attentive embedding propagation layers to capture infor-
mation of multiple-hop neighbors ofeach entity, specifically, entity ei s embedding
(l+1) (l) (1) (1) (l)
at the (l +1)-th layer, i.e., hi = σ hi , hNi , where hNi = ∑(ei ,rk ,e j )∈Ni αi jk h j .
Finally, a prediction layer concatenates embeddings at each graph attention layer for
each entity to make prediction.
528 Chang Su, Yu Hou, Fei Wang
Generally, the drug repurposing procedure includes three major steps: hypothesis
generation, assessment, and validation (Pushpakom et al, 2019). Among them, the
first and foremost step is hypothesis generation. Typially, the hypothesis generation
for drug repurposing aims at identifying candidate drugs that has a high confidence
to be associated with the therapeutic indication of interest. Today’s largly available
BKGs, encoding huge volume of biomedical knowledge, have become a valuable
resouce for drug repurposing. In KG, the hypothesis generation procedure can be
formulated as a link prediction problem, i.e., computational identification of poten-
tial drug-target or drug-disease associations with a high confidence level based on
existing knowledge (KG’s structure properties). This section introduces some pre-
liminary efforts of hypothesis generation for drug repurposing, using computational
approaches in the BKGs.
24 GNN-based Biomedical Knowledge Graph Mining in Drug Development 529
One of the previous efforts using computational inference in BKG for drug repur-
posing is Zhu et al.’s study (Zhu et al, 2020b). The main contributions of this study
is two-fold: 1) KG construction via data integration, and 2) building the KG-based
machine learning pipeline for drug repurposing.
First, by integrating six drug knowledge bases, including PharmGKB (Whirl-
Carrillo et al, 2012), TTD (Yang et al, 2016a), KEGG DRUG (Kanehisa et al,
2007), DrugBank (Wishart et al, 2018), SIDER (Kuhn et al, 2016), and DID (Sharp,
2017), they curated a drug-centric KG consisting of five entity types including drugs,
diseases, genes, pathways, and side-effects and nine relation types including drug-
disease TREATS, drug-drug INTERACTS, and drug-gene REGULATES, BINDS,
and ASSOCIATES, drug-side effect CAUSES relations, gene-gene ASSOCIATES,
gene-disease ASSOCIATES, and gene-pathway PARTICIPATES relations.
Second, based on the drug-centric KG, a machine learning pipeline was built
for drug repurposing. Specifically, the target of the proposed model was to pre-
dict the existence of relation between a pair of drug and disease entities. In this
way, the task fell into the supervised classification setting where the input sam-
ples were the drug-disease pairs. To this end, representation for each sample (drug-
disease pair) was calculated in two ways: 1) meta-path-based representation and 2)
KG embedding-based representation. For meta-path-based representation, 99 pos-
sible meta-paths between drugs and diseases with length 2-4 were enumerated,
TREATS ASSOCIATES TREATS ASSOCIATES
such as Drug → Gene → Disease and Drug → Gene →
ASSOCIATES
Gene → Disease. Then a 99-dimensional representation vector was cal-
culated for a drug-disease pair, of which each element indicates the connectivity
measure between this two entities based on a specific meta-path. In this study, four
different connectivity measures were used, under a specific meta-path Φ, including
• Path count, PCΦ (edr , edi ), the number of paths between drug edr and disease
edi ;
• Head normalized path count HNPCΦ = PC Φ (edr ,edi )
PC (e ,∗) ;
Φ dr
PCΦ (edr ,edi )
• Tail normalized path count T NPCΦ = PCΦ (∗,edi ) ;
PCΦ (edr ,edi )
• Normalized path count NPCΦ = PCΦ (edr ,∗)+PCΦ (edr ,∗) ;
To address this, a positive and unlabeled (PU) learning framework (Elkan and Noto,
2008) was used. Decision Tree, Random Forest, and support vector machine (SVM)
were used as basic classifiers of this PU learning framework, respectively. In this
study, drug-disease relations related to eight diseases were used as the testing set,
while the remaining drug-disease relations (positive) and 143,830 pairs associating
the eight diseases with other drugs (unlabeled) were used as the training set. Ex-
perimental results showed that the KG-driven pipeline can produce high prediction
results on known diabetes mellitus treatments with only using treatment information
of other diseases.
The sudden outbreak of the human coronavirus disease 2019 (COVID-19) has led
to a pandemic that heavily strikes the healthcare system and tremendously impacts
people’ life around the world. To date, many drugs have been under investigation to
treat COVID-19, costing tremendous investment, however, very limited COVID-19
antiviral medications are approved. In this context, there is the urgent need for a
more efficient and effective way for drug development against the pandemic, and
computational drug repurposing can be a promising approach to address this.
Zeng et al.’s work (Zeng et al, 2020b) is a pioneer effort that computationally
repurposes antiviral medications in COVID-19 based on KG inference. First of all,
a comprehensive biomedical KG was constructed by integrating the two biomedi-
cal relational data resources, Global Network of Biomedical Relationships (GNBR)
(Percha and Altman, 2018) and DrugBank (Wishart et al, 2018), and experimen-
tally discovered COVID-gene relationships (Zhou et al, 2020f), resulting in a KG
consisting of 145,179 entities of four types (drugs, disease, genes, and drug side
information) and 15,018,067 relationships of 39 types. Secondly, a deep KG em-
bedding model, RotatE, was performed to learn low-dimensional representations
for the entities and relations. Using such learned embedding vectors, the top 100
drugs that are most close to the COVID-19 entity in the embedding space were pri-
oritized as the candidate drugs. Using drugs in ongoing COVID-19 clinical trials
(https://covid19-trials.com/) as a validation set, the results achieved a
desirable performance with an area under the receiver operating characteristic curve
(AUROC) of 0.85. Moreover, gene set enrichment analysis (GSEA), which involved
transcriptome data from peripheral blood and Calu-3 cells, and proteome data from
Caco-2 cells, was performed to validate the candidate drugs. Finally, 41 drugs were
identified as potential repurposable candidates for COVID-19 therapy, especially
9 are under ongoing COVID-19 trials. Among the 41 candidates, three types of
drugs were highlighted by the author: 1) the Anti-Inflammatory Agents such as
dexamethasone, indomethacin, and melatonin; 2) the Selective Estrogen Receptor
Modulators (SERMs) such as clomifene, bazedoxifene, and toremifene; and 3) the
Antiparasitics including hydroxychloroquine and chloroquine phosphate.
24 GNN-based Biomedical Knowledge Graph Mining in Drug Development 531
Another work (Hsieh et al, 2020), has been focused on using GNN in KG to
address the drug repurposing problem. By extracting and integrating drug-target
interactions, pathways, gene/drug-phenotype interactions from CTD (Davis et al,
2019), a SARS-CoV-2 KG was built, which consists of 27 SARS-CoV-2 baits, 5,677
host genes, 3,635 drugs, and 1,285 phenotypes, as well as 330 virus-host protein-
protein interactions, 13,423 gene-gene sharing pathway interactions, 16,972 drug-
target interactions, 1,401 gene-phenotype associations, and 935 drug-phenotype as-
sociations. Nest, a variational graph autoencoder (Kipf and Welling, 2016), which
engages R-GCN (Schlichtkrull et al, 2018) as encoder, was used to learn entity em-
beddings in the SARS-CoV-2 KG. Since the SARS-CoV-2 KG has a specific focus
on COVID-19 related knowledge, some general yet meaningful biomedical knowl-
edge may be missing. To address this, a transfer learning framework was introduced.
Specifically, it first used entity embeddings of Zeng et al.’s work (Zeng et al, 2020b)
that encode general biomedical knowledge to initialize entity embeddings in SARS-
CoV-2 KG. Then the embeddings were fine-tuned in SARS-CoV-2 KG through the
proposed GNN. Using a customized neural network ranking model, 300 drugs that
are most relevant to the COVID-19 were selected as the candidate drugs. Similar to
Zeng et al.’s work (Zeng et al, 2020b) , the authors engaged GSEA, retrospective in-
vitro drug screening, and populiation-based treatment effect analysis in electronic
health records (EHRs), to further validate the repurposable candidates. Through
such a pipeline, 22 drugs were highlighted for potential COVID-19 treatment, in-
cluding Azithromycin, Atorvastatin, Aspirin, Acetaminophen, and Albuterol.
In summary, these studies shed light on the importance of the KG-based compu-
tational approaches in drug repurposing to fight against the complex diseases like
COVID-19. The reported good performance in terms of the high overlapping ratio
between the repurposed candidate drug set and the drugs under ongoing COVID-19
trials, not only demonstrated the effectiveness of the KG-based techniques but also
provided biological evidence of the ongoing clinical trials. Moreover, they proposed
feasible ways using other publicly available data to validate or refine the hypothesis
derived from KGs, which therefore enhances the usability of KG-based approaches.
KGs have been playing a more and more important role in biomedicine. An increas-
ing number of KG-based machine learning and deep learning approaches have been
used in biomedical studies such as hypothesis generation in computational drug de-
velopment. As one of the latest advances in artificial intelligence (AI), GNNs, which
have led to tremendous progress in image and text data mining (Kipf and Welling,
2017b; Hamilton et al, 2017b; Veličković et al, 2018), have been introduced to ad-
dress the KG inference problems. In this context, the use of GNN in biomedical KGs
has a great potential in improving hypothesis generation in computational drug de-
velopment. However, there remain significant gaps between the novel technique and
the success of computational drug development. This section discusses the potential
532 Chang Su, Yu Hou, Fei Wang
complement for the biomedical KGs. Moreover, the computational methods such as
the KG embedding models (e.g., TransE and TransH) and the GNNs (e.g., R-GCN)
have been used in KG completion (Arora, 2020), which predict missing relations
within a KG according to its structure properties.
Apart from the KGs, there is an enormous volume of other biomedical data available
such as clinical data and omics data, which are also promising resources for compu-
tational drug repurposing. The clinical data is an important resource for healthcare
534 Chang Su, Yu Hou, Fei Wang
Biomedical Knowledge
Figure 24.1: Coupling biomedical KGs with other biomedical data resources for
improving computational drug development.
and medical research, mainly including EHR data, claim data, and clinical trial data,
etc. The EHR data is routinely collected during the daily patient care, containing het-
erogeneous information of the patients, such as demographics, diagnoses, laboratory
test results, medications, and clinical notes. Such rich information makes it possible
for tracking patient’s health condition changes, medication prescriptions, and clini-
cal outcomes. In addition, a tremendous volume of EHR data has been collected and
the volume is rapidly increasing, which largely strengthens the statistical power for
EHR-based analysis. For this reason, beyond its common usage such as diagnostic
and prognostic prediction (Xiao et al, 2018; Si et al, 2020; Su et al, 2020e,a), and
phenotyping (Chiu and Hripcsak, 2017; Weng et al, 2020; Su et al, 2020d, 2021),
EHR data has been used for computational drug repurposing (Hurle et al, 2013;
Pushpakom et al, 2019). For example, Wu et al (2019d) identified some non-cancer
drugs as the repurposable candidates to treat cancer using EHR; Gurwitz (Gurwitz,
2020) analyized EHR data to repurpose drugs for treating COVID-19.
Advanced by the high throughput sequencing techniques, an enormous volume
of omics data, including genomics, proteomics, transcriptomics, epigenomics, and
metabolomics, have been collected and publicly available for analysis. Integrating
and analyzing the omics data enable us to derive new biomedical insights and better
understand human health and diseases at the molecular level (Subramanian et al,
2020; Nicora et al, 2020; Su et al, 2020b). Due to the wealth of the omics data, it
has also been involved in computational drug development (Pantziarka and Meheus,
2018; Nicora et al, 2020; Issa et al, 2020). For example, via mining multiple omics
data, Zhang et al (2016c) identified 18 proteins as the potential anti-Alzheimer’s dis-
24 GNN-based Biomedical Knowledge Graph Mining in Drug Development 535
ease (AD) targets and prioritized 7 repurposable drugs inhibiting the targets. Mokou
et al (2020) proposed a drug repurposing pipeline in bladder cancer based on pa-
tients’ omics (proteomics and transcriptomics) signature data.
In this context, combining KGs, clinical data, and multi-omics data and jointly
learning them is a promising route to advance computational drug development (Fig.
24.1). The benefits of combining of these data for inference can be two-way. First,
computational models in clinical data and multi-omics data usually suffer from the
data quality such as noise and limited cohort size especially for the population of a
rare disease and model interpretability. The incorporation of KGs has been demon-
strated to be able to address these issues effectively and accelerate the clinical data
and omics data analysis. For example, Nelson et al (2019) linked EHR data with a
biomedical KG and learned a barcode vector for each specific cohort (e.g., the obese
cohort), which encodes both KG structure and EHR information and illustrates the
importance of each biomedical entity (e.g., genes, symptoms, and medications) in
indicating the cohort. Such cohort-specific barcode vectors further showed the ef-
fectiveness in link prediction (e.g., disease-gene associations prediction). Wang et al
(2017c) bridged patient EHR data with the BKG and extended the KG embedding
model for safe medicine recommendation, which comprehensively considered rel-
evant knowledge such as drug-drug interactions. In addition, Santos et al (2020)
developed an open platform that couples the CKG (i.e., Clinical Knowledge Graph)
with the typical proteomics workflows. In this way, CKG facilitates analysis and
interpretation of the protomics data. Second, the incorporation of clinical data and
omics data can potentially improve KG inference. Current KG-based drug repurpos-
ing studies have involved the clinical data and omics data (Zeng et al, 2020b; Hsieh
et al, 2020), which were typically used in an independent validation procedure to
validate/refine the generated new hypotheses (i.e., novel disease-drug associations)
. Moreover, previous studies have showcased that leveraging the clinical data (Rot-
mensch et al, 2017; Chen et al, 2020e; Pan et al, 2020c) and omics data (Ramos et al,
2019) can derive new knowledge. Therefore, we believe that incorporating clinical
data and omics data in KG inference may largely reduce the impacts of KG quality
issues especially the incompleteness. In total, when we design the next-generation
GNN models for drug-repurposing, a considerable direction is the feasible and flex-
ible architecture that can subtly harness KGs, clinical data, and multi-omics data to
recursively improve each other.
Editor’s Notes: Drug hypothesis generation aims to use biological and clin-
ical knowledge to generate biomedical molecules. This knowledge is effec-
tively stored in the form of knowledge graph (KG). The construction of KG
is relevant to graph generation (Chapter 11) and some applications, such
as text mining (Chapter 21). Based on KG, hypothesis generation process
mainly contains graph representation learning (Chapter 2) and graph struc-
ture learning (Chapter 14). It can also be formulated as the link prediction
(Chapter 10) problem and calculate the confidence level of candidate drugs.
The future direction of drug developments focuses on scalability (Chapter
6) and interpretability (Chapter 7).
536
Database Number of Entities Entity Types Number of Relation Types Focus Available Formats Source Type URL
Relations
Clinical Knowledge 16 million 33 entity types, 220 million 51 relation types, - Neo4j KG (Integration) https://github.com/
Graph(Santos et al, such as Drug, Gene, such as associate, MannLabs/CKG
2020) Disease, etc. has quantified
protein, etc.
Drug Repurposing 97,238 13 entity types, 5,874,261 107 relation types, - TSV KG (Integration) https://github.com/
Knowledge such as Compound, such as interaction, gnn4dr/DRKG
Graph(Ioannidis Disease, etc. etc.
et al, 2020)
Hetionet(Himmelstein 47,031 11 entity types, 2,250,197 24 relation types, - Neo4j,TSV KG (Integration) https://het.io/
et al, 2017) such as Disease, such as treats,
Gene, Compound, associates, etc.
etc.
iDISK(Rizvi et al, 144,059 6 entity types, such 708,164 6 relation types, Dietary Neo4j,RRF KG (Integration) https://conservancy.
2019) as Semantic Dietary such as Supplements umn.edu/handle/
Supplement has adverse reaction, 11299/204783
Ingredient, Dietary is effective for, etc.
Supplement
Product, Disease,
etc.
PreMedKB(Yu 404,904 Drug, Variant, 496,689 52 relation types, Variant - KG (Integration) http:
et al, 2019b) Gene, Disease such as cause, //www.fudan-pgx.org/
associate, etc. premedkb/index.html#
/home
Table 24.3: Summary of existing BKGs.
Relations
Drug–Gene 160,054 Drug, Gene 96,924 - Drug-Gene TSV KB https:
Interaction Interaction //www.dgidb.org/
Database(Cotto
et al, 2018)
DISEASES(Pletscher- 22,216 Disease, Gene 543,405 - Disease-Gene TSV KB https://diseases.
Frankild et al, 2015) Association jensenlab.org/
DisGeNET(Piñero 159,052 Disease, Gene, 839,138 Gene- Gene- TSV KB https://www.
et al, 2020) Variant Disease,Variant- Disease,Variant- disgenet.org/home/
Disease Disease
associations
Global Network of - Chemical, Disease, 2,236,307 36 relation types, - TXT KB https://zenodo.org/
Biomedical Rela- Gene such as causal record/1035500
tionships(Percha mutations,
and Altman, 2018) treatment, etc.
IntAct(Orchard 119,281 Chemical, Gene 1,130,596 - Molecular TXT KB https://www.ebi.ac.
et al, 2014) Interaction uk/intact/
STRING(Szklarczyk 24,584,628 Protein 3,123,056,667 Protein-Protein Protein- TXT KB https:
et al, 2019) Interaction Protein //string-db.org/
Interaction
SIDER(Kuhn et al, 7,298 Drug, Side-effect 139,756 Drug-Side effect Medicines and TSV KB http://sideeffects.
2016) their recorded embl.de/
adverse drug
reactions
SIGNOR(Licata 7,095 10 entity types, 26,523 - Signaling TSV KB https://signor.
et al, 2020) such as protein, information uniroma2.it/
chemical, etc.
TISSUE(Palasca 26,260 Tissue, Gene 6,788,697 Express Tissue-Gene TSV KB https://tissues.
et al, 2018) Expression jensenlab.org/
Catalogue of 12,339,359 Mutation - - Somatic TSV Database https://cancer.
Somatic Mutations Mutations in sanger.ac.uk/cosmic
in Cancer(Tate et al, Cancer
2019)
Chang Su, Yu Hou, Fei Wang
Database Number of Entities Entity Types Number of Relation Types Focus Available Formats Source Type URL
Relations
ChEMBL(Mendez 1,940,733 Molecule - - Molecule TXT Database https://www.ebi.ac.
et al, 2019) uk/chembl/
ChEBI(Hastings 155,342 Molecule - - Molecule TXT Database https://www.ebi.ac.
et al, 2016) uk/chebi/init.do
DrugBank(Wishart 15,128 Drug 28,014 Drug-Target, Drug CSV Database https:
et al, 2018) Drug-Enzyme, //go.drugbank.com/
Drug-Carrier,
Drug-Transporter
Entrez 30,896,060 Gene - - Gene TXT Database https://www.ncbi.
Gene(Maglott et al, nlm.nih.gov/gene/
2010)
HUGO Gene 41,439 Gene - - Gene TXT Database https:
Nomenclature //www.genenames.org/
Committee(Braschi
et al, 2017)
KEGG(Kanehisa 33,756,186 Drug, Pathway, - - - TXT Database https:
and Goto, 2000) Gene, etc. //www.kegg.jp/kegg/
PharmGKB(Whirl- 43,112 Genes, Variant, 61,616 - - TSV Database https:
Carrillo et al, 2012) Drug/Chemical, //www.pharmgkb.org/
Phenotype
Reactome(Jassal 21.087 Pathway - - Pathway TXT Database https:
et al, 2020) //reactome.org/
Semantic - - 109,966,978 Subject-Predicate- Semantic CSV Database https://skr3.nlm.
MEDLINE Object Triples predictions nih.gov/index.html
Database(Kilicoglu from the
et al, 2012) literature
UniPort(Bateman 243,658 Protein - - Protein XML,TXT Database https:
et al, 2020) //www.uniprot.org/
Brenda Tissue 6,478 Tissue - - Tissue OWL Ontology http://www.BTO.
Ontology(Gremse brenda-enzymes.org
24 GNN-based Biomedical Knowledge Graph Mining in Drug Development
et al, 2010)
Disease 10,648 Disease - - Disease OWL Ontology https:
Ontology(Schriml //disease-ontology.
et al, 2019) org/
539
Database Number of Entities Entity Types Number of Relation Types Focus Available Formats Source Type URL
540
Relations
Gene Ontol- 44,085 Gene - - Gene OWL Ontology http:
ogy(Ashburner //geneontology.org/
et al, 2000)
Uberon(Mungall 14,944 Anatomy - - Anatomy OWL Ontology http:
et al, 2012) //uberon.github.io/
publications.html
Chang Su, Yu Hou, Fei Wang
Chapter 25
Graph Neural Networks in Predicting Protein
Function and Interactions
Abstract Graph Neural Networks (GNNs) are becoming increasingly popular and
powerful tools in molecular modeling research due to their ability to operate over
non-Euclidean data, such as graphs. Because of their ability to embed both the inher-
ent structure and preserve the semantic information in a graph, GNNs are advancing
diverse molecular structure-function studies. In this chapter, we focus on GNN-
aided studies that bring together one or more protein-centric sources of data with
the goal of elucidating protein function. We provide a short survey on GNNs and
their most successful, recent variants designed to tackle the related problems of pre-
dicting the biological function and molecular interactions of protein molecules. We
review the latest methodological advances, discoveries, as well as open challenges
promising to spur further research.
Molecular biology is now reaping the benefits of big data, as rapidly advancing
high-throughput, automated wet-laboratory protocols have resulted in a vast amount
of biological sequence, expression, interactions, and structure data (Stark, 2006;
Zoete et al, 2011; Finn et al, 2013; Sterling and Irwin, 2015; Dana et al, 2018;
Doncheva et al, 2018). Since functional characterization has lagged behind, we now
have millions of protein products in databases for which no functional information
is readily available; that is, we do not know what many of the proteins in our cells
do (Gligorijevic et al, 2020).
Anowarul Kabir
Department of Computer Science, George Mason University, e-mail: [email protected]
Amarda Shehu
Department of Computer Science, George Mason University, e-mail: [email protected]
© The Author(s), under exclusive license to Springer Nature Singapore Pte Ltd. 2022 541
L. Wu et al. (eds.), Graph Neural Networks: Foundations, Frontiers, and Applications,
https://doi.org/10.1007/978-981-16-6054-2_25
542 Anowarul Kabir and Amarda Shehu
Answering the question of what function a protein molecule performs is key not
only to understanding our biology and protein-centric disorders, but also to advanc-
ing protein-targeted therapies. Hence, this question remains the driver of much wet-
and dry-laboratory research in molecular biology (Radivojac et al, 2013; Jiang et al,
2016). Answering it can take many forms based on the detail sought or possible. The
highest amount of detail provides an answer to the question by directly exposing the
other molecules with which a target protein interacts in the cell, thus revealing what
a protein does by elucidating the molecular partners to which it binds.
In this brief survey, we focus on how graph neural networks (GNNs) are ad-
vancing our ability to answer this question in silico. This chapter is organized as
follows: First, a brief historical overview is provided, so that the reader understands
the evolution of ideas and data that have made possible the application of machine
learning to the problem of protein function prediction. Then, a brief overview of the
(shallow) models prior to GNNs is provided. The rest of the survey is devoted to
the GNN-based formulation of this question, a summary of state-of-the-art (SOTA)
GNN-based methods, with a few selected methods highlighted where relevant, and
an exposition of remaining challenges and potential ways forward via GNNs.
Historically, the earliest methods devised for protein function prediction related pro-
tein sequence similarity to protein function similarity. This led to important discov-
eries until remote homologs were identified, which are proteins with low sequence
similarity but highly similar three-dimensional/tertiary structure and function. So
methods evolved to utilize tertiary structure, but their applicability was limited, as
determination of tertiary structure was and remains a laborious process. Other meth-
ods utilized patterns in gene expression data to infer interacting proteins, based on
the insight that proteins interacting with one another need foremost to be expressed
in the cell at the same time.
With the development of high-throughput technologies, such as two-hybrid anal-
ysis for the yeast protein interactome (Ito et al, 2001), tandem-affinity purifi-
cation and mass spectrometry (TAP-MS) (Gavin et al, 2002) for characterizing
multi-protein complexes and protein-protein associations (Huang et al, 2016a),
high-throughput mass spectrometric protein complex identification (HMS-PCI) (Ho
et al, 2002), co-immunoprecipitation coupled to mass spectrometry (Foltman and
Sanchez-Diaz, 2016), protein-protein interaction (PPI) data suddenly became avail-
able, and in large amounts. PPI networks, with edges denoting interacting protein
nodes, of many species, such as human, yeast, mouse, and others, suddenly became
available to researchers. PPI networks, as small as a few nodes or as large as tens
of thousands of nodes, gave a boost to machine learning methods and improved the
performance of shallow models. Surveys such as Ref. (Shehu et al, 2016) provide a
detailed history of the evolution of protein function prediction methods as different
sources of wet-laboratory data became available to computational biologists.
25 Graph Neural Networks in Predicting Protein Function and Interactions 543
A natural question arises. If we have access to PPI data, then what else remains
to predict with regards to protein function? Despite significant progress, the reality
remains that there are many unmapped PPIs. This is formally known as the link pre-
diction problem. For various reasons, PPI networks are incomplete. They entirely
miss information on a protein, or they may contain incomplete information on a
protein. In particular, we now know that PPIs suffer from high type-I error, type-II
error, and low inclusion (Luo et al, 2015; Byron and Vestergaard, 2015). The to-
tal number of PPI links that are experimentally determined is still moderate (Han
et al, 2005). PPI data are inherently noisy as experimental methods often produce
false-positive results (Hashemifar et al, 2018). Therefore, predicting protein func-
tion computationally remains an essential task.
The problem of protein function prediction is often formulated as that of link
prediction, that is, predicting whether or not there exists a connection between two
nodes in a given PPI network. While link prediction methods connect proteins on
the basis of biological or network-based similarity, researchers report that inter-
acting proteins are not necessarily similar and similar proteins do not necessarily
interact (Kovács et al, 2019).
As indicated above, information on protein function can be provided at different
levels of detail. There are several widely-used protein function annotation schemes,
including the Gene Ontology (Lovell et al, 2003) (GO) Consortium, the Kyoto En-
cyclopedia of Genes and Genomes (Wang and Dunbrack, 2003) (KEGG), the En-
zyme Commission (Rhodes, 2010) (EC) numbers, the Human Phenotype Ontol-
ogy (Robinson et al, 2008), and others. It is beyond the scope of this paper to provide
an explanation of these ontologies. However, we emphasize that the most popular
one remains the GO annotation, which classifies proteins into hierarchically-related
functional classes organized into 3 different ontologies: Molecular Function (MF),
Biological Process (BP), and Cellular Component (CC), to describe different aspects
of protein functions. Systematic benchmarking efforts via the Critical Assessment
of Functional Annotation (CAFA) community-wide experiments (Radivojac et al,
2013; Jiang et al, 2016; Zhou et al, 2019b) and MouseFunc (Peña-Castillo et al,
2008) have been central to the automation of protein function annotation and rigor-
ous assessment of devised methodologies.
Many shallow machine learning approaches have been developed over the years.
Xue-Wen and Mei propose a domain-based random forest of decision trees to infer
protein interactions on the Saccharomyces cerevisiae dataset (Chen and Liu, 2005).
Shinsuke et al. apply multiple support vector machines (SVMs) for predicting in-
544 Anowarul Kabir and Amarda Shehu
teractions between pairs of yeast proteins and pairs of human proteins by increas-
ing more negative pairs than positives (Dohkan et al, 2006). Fiona et al. assess
naı̈ve bayes (NB), multi-layer perceptron (MLP) and k-nearest neighbour (KNN)
methods on diverse, large-scale functional data to infer pairwise (PW) and module-
based (MB) interaction networks (Browne et al, 2007). PRED PPI provides a server
developed on SVM for predicting PPIs in five organisms, such as humans, yeast,
Drosophila, Escherichia coli, and Caenorhabditis elegans (Guo et al, 2010). Xiao-
tong and Xue-wen integrate features extracted from microarray expression measure-
ments, GO labels and orthologous scores, and apply a tree-augmented NB classifier
for human PPI predictions from model organisms (Lin and Chen, 2012). Zhu-Hong
et al. propose a multi-scale local descriptor feature representation scheme to ex-
tract features from a protein sequence and use random forest (You et al, 2015a).
Zhu-Hong et al. propose to apply SVM on a matrix-based representation of protein
sequence, which fully considers the sequence order and dipeptide information of the
protein primary sequence to detect PPIs (You et al, 2015b).
Although many advances were made by shallow models, as summarized in Ta-
ble 25.1, the problem of protein function prediction is still a long way from being
solved. Shallow machine learning methods depend greatly on feature extraction and
feature computation, which hinder performance. The task of feature engineering,
particularly when integrating different sources of data (sequence, expression, in-
teractions) is complex, laborious, and ultimately limited by human creativity and
domain-specific understanding of what may be determinants of protein function. In
particular, feature-based shallow models cannot fully incorporate the rich, local and
distal topological information present in one or more PPI networks. These reasons
have prompted researchers to investigate GNNs for protein function prediction.
This section first relates a general formulation of a GNN and forsakes detail in the
interest of space, assuming readers are already somewhat familiar with GNNs. The
rest of the section focuses on three task-specific formulations that allow leveraging
GNNs for protein function prediction.
25.1.4.1 Preliminaries
parametric functions that compute the embedding and output considering a single
protein, following (Scarselli et al, 2008), we formulate follows:
oi = g(hi , pi ) (25.2)
where pi , pe[i] , pne[i] and hne[i] denote the feature representation of the i-th protein,
features of all connected edges to the i-th protein, neighboring proteins’ features and
embeddings of neighborhood proteins of the i-th protein, respectively.
Let us now consider |V | = n proteins. All proteins are represented as a matrix,
P ∈ Rn×m . The adjacency matrix A ∈ Rn×n encodes the connectivity of the proteins;
namely, Ai, j indicates whether or not there exists a link between proteins i and j.
Enforcing the self-loops with each protein, the updated adjacent matrix is à = A + I.
The degree diagonal matrix, D, can then be defined, such that Di,i = ∑nj=1 Ãi, j . From
there, one can compute the symmetric Laplacian matrix L = D − Ã. Finally, one can
then formulate the following iterative process:
Given two proteins, we want to predict if there is a link between them, where prob-
ability p(Ai, j ) ≈ 1 indicates there exists an interaction with high confidence; con-
versely p(Ai, j ) ≈ 0 indicates a low interaction confidence. The prediction of a link
between two given proteins can bet set up as a binary classification problem. The
relations among nodes can be of several types; so, an edge of type r from node u to
r
v can be defined as u → − v ∈ E , which can be formulated as a multi-relational link
prediction problem.
Using GNNs, one can map graph nodes into a low-dimensional vector space
which may preserve both local graph structure and dissimilarities among node fea-
tures. To address link prediction, one can employ a two layer encoder-decoder ap-
proach where the model learns Z from equation 25.5:
In the following, we highlight three selected methods that exemplify SOTA tech-
niques and performance.
Liu et al (2019) apply a graph convolutional neural network (GCN) for PPI pre-
diction as a supervised binary classification task. Learned representations of two
proteins are fed to the model, and the model predicts the probability of interaction
between the proteins. The model first captures position-specific information inside
the PPI network and combines amino-acid sequence information to output final em-
beddings for each protein. The model encodes each amino acid as a one-hot vector
and employs a graph convolutional layer to learn a hidden representation from the
graph. To do that, Liu et al (2019) use the message passing protocol to update each
protein embedding by aggregating the original features and first-hop neighbors’ in-
formation, which is formulated as following:
based SOTA methods. Additionally, the authors report achieving 95% accuracy on
yeast data under 93% sensitivity. Therefore, the extracted information from the PPI
graph suggests that a single graph convolutional layer is capable of extracting useful
information for the PPI prediction task.
Brockschmidt (Brockschmidt, 2020) proposes a novel GNN variant using feature-
wise linear modulation (GNN-FiLM), originally introduced by Perez et al. (Perez
et al, 2018) in the visual question-answering domain, and evaluates on three differ-
ent tasks, including node-level classification of PPI networks. The targeted appli-
cation in this work is the classification of proteins into known protein families or
super-families, which is of great importance in numerous application domains, such
as precision drug design. Typically, in GNN variants, the information is passed from
the source to the target node considering the learned weights and the representation
of the source node. However, the GNN-FiLM method proposes a hypernetwork,
neural networks, that compute parameters for other networks (Ha et al, 2017), in
graph settings, where the feature weights are learned dynamically based on the in-
formation that the target node holds. Therefore, considering function g as a learnable
function to compute the parameters for the affine transformation, the update rule is
defined for the l-th layer as follows:
(l) (l) (l)
βr,v γr,v = g(hv ; θg,r ) (25.8)
(l+1) (l) (l) (l)
hv =σ ∑ γr,v ⊙Wr hu + βr,v (25.9)
r
u→
− v∈E
(t)
where g is implemented as a single linear layer in practice considering βe,v and
(t) e
γe,v as the hyperparameters of the message passing operation in GNN, and u → − v
indicates that message is passing from u to v through a type r edge. In experiments,
GNN-FiLM achieves micro-averaged F1 score of 99% which outperforms other
variants when evaluated on protein classification tasks.
Zitnik et al (2018) employ GCNs to predict polypharmacy side effects, which
emerge from drug-drug interactions when using drug combinations on patients’
treatments. The problem can be formulated as a multi-relational link prediction
problem in multimodal graph structured data. Specifically, Zitnik et al (2018)
consider two types of nodes, proteins and drugs, and construct the network using
protein-protein, protein-drug, and drug-drug interactions as polypharmacy side ef-
fects, whereas each side effect can be of different types of edges, called Decagon.
More precisely, a relation of type r between two nodes (proteins or drugs), u and v, is
defined as (u, r, v) ∈ E . Here, the relations can be a side effect between two proteins,
binding affinity of two proteins, or relation between a protein and a drug. More
formally, given a drug pair (u, v), the task is to predict the likelihood of an edge,
Au,v = (u, r, v). For this purpose, they develop a non-linear and multi-layer graph
convolutional encoder to compute the embeddings of each node using original node
features, called Decagon. To update a node’s representation, authors transform the
25 Graph Neural Networks in Predicting Protein Function and Interactions 549
(l)
where φ denotes non-linear activation function, hi indicates hidden state of the i-th
(l)
node at the l-th layer, Wr means relation-type specific learnable parameter matrix,
j ∈ Nr are the neighboring nodes of i, ci,r j = √ 1i i and cir = √ 1 i are the
i
|Nr ||Nr | |Nr |
normalization constant. Finally, a tensor factorization model is used to predict the
polypharmacy side effects using these embeddings. The probability of a link of type
r between node u and v is defined as:
xu,v
r = σ (g(u, r, v)) (25.11)
where σ is the sigmoid function and g is defined as follows:
(
zT Dr RDr zv if u and v both denote drug nodes
g(u, r, v) = Tu (25.12)
zu Mr zv if any of u or v is not drug node
vojac et al, 2013; Jiang et al, 2016; Zhou et al, 2019b) and MouseFunc (Peña-
Castillo et al, 2008).
Many computation methods have been developed to this end to analyze protein-
function relationships. Traditional machine learning approaches, such as SVMs (Guan
et al, 2008; Wass et al, 2012; Cozzetto et al, 2016), heuristic-based methods (Schug,
2002), high dimensional statistical methods (Koo and Bonneau, 2018), and hierar-
chical supervised clustering methods (Das et al, 2015) have been extensively stud-
ied in AFP tasks and found that integration of several features, such as gene and
protein network or structure outperforms sequence-based features. However, these
traditional approaches rely strongly on hand-engineered features.
Deep learning methods have become prevalent. For example, DeepSite (Jiménez
et al, 2017), Torng and Altman (2018), and Enzynet (Amidi et al, 2018) ap-
ply 3D convolutonal neural networks (CNNs) for feature extraction and predic-
tion from protein structure data. However, storing the high-resolution 3D represen-
tation of protein structure and applying 3D convolutions over the representation
is inefficient (Gligorijevic et al, 2020). Very recently, GCNs (Kipf and Welling,
2017b) (Henaff et al, 2015; Bronstein et al, 2017) have been shown to general-
ize convolutional operations on graph-like molecular representations and overcome
these limitations.
In particular, Ioannidis et al (2019) adapt the graph residual neural network
(GRNN) approach for a semi-supervised learning task over multi-relational PPI
graphs to address AFP. The authors formulate a multi-relational connectivity graph
as an n × n × I tensor S, where Sn,n′ ,i captures the edge between proteins vn and
vn′ for the i-th relation. The n proteins are encoded in a feature matrix X ∈ Rn× f ,
where the i-th protein is represented as an f × 1 feature vector. Furthermore, a label
matrix Y ∈ Rn×k encodes the k labels. Subsets of proteins are associated with true
labels, and the task is to predict the labels of proteins with unavailable labels. The
neighborhood aggregation for the n-th protein and the i-th relation at the l-th layer
is defined by the following formula:
(l) (l−1)
Hn,i = ∑ Sn,n′ ,i Žn′ ,i (25.13)
(i)
n′ ∈Nn
(l−1)
where n′ denotes the neighboring nodes of the n-th protein, and Žn′ i denotes the
feature vector of the n-th protein in the i-th relation at the l-th to the first layer.
Neighboring nodes are defined as one-hop only, which essentially incorporates one-
hop diffusion. However, successive operations eventually spread the information
(l)
across the network. To apply multi-relational graphs, the authors combine Hni
across i as follows:
I
(l) (l) (l)
Gn,i = ∑ Ri,i′ Hn,i′ (25.14)
i =1
′
(l)
where Ri,i′ is the learnable parameter. Then, a linear operation mixes the extracted
features as follows:
25 Graph Neural Networks in Predicting Protein Function and Interactions 551
Z (l) = GTn,iWn,i − 1
(l)
(25.15)
where Wn,i is the learnable parameter. In summary, the neighborhood convolution
and propagation step can be shown as:
(l)
Z (l) = f (Z (l−1) ; θz ) (25.16)
(l)
where θz is comprised of two weight matrices, W and R, which linearly combine
the information of neighboring nodes and the multi-relational information, respec-
tively. Moreover, the authors incorporate residual connection to diffuse the input, X,
across L-hop neighborhoods to capture multi-type diffusion; that is:
(l) (l)
Z (l) = f (Z (l−1) ; θz ) + f (X; θx ) (25.17)
A softmax classification layer is used for the final prediction. The authors apply
this model on three multi-relational networks, comprising generic, brain, and circu-
lation cells. The model is shown to perform better than general graph convolutional
neural networks.
Recently, Gligorijevic et al (2020) employ DeepFRI, based on GCNs, for func-
tionally annotating protein sequences and structures. DeepFRI outputs probabil-
ities for each function. A Long Short-Term Memory Language Model (LSTM-
LM) (Graves, 2013) is pretrained on around 10 million protein sequences from
protein family database (Pfam) (Finn et al, 2013) to extract residue-level position-
context features. The following equation is used:
with existing baseline models, including CAFA-like BLAST (Wass et al, 2012) and
CNN-based sequence-only DeepGOPlus (Kulmanov and Hoehndorf, 2019), on each
sub-ontology of GO-terms and EC numbers and outperform in every category.
Zhou et al (2020b) apply a GCN model, DeepGOA, to predict maize protein
functions. The authors exploit both GO structure information and protein sequence
information for a multi-label classification task. Since GO organizes the functional
annotation terms into a directed acyclic graph (DAG), the authors utilize the knowl-
edge encoded in the GO hierarchy. First, amino acids of a protein are encoded into
one-hot encodings, a 21-dimensional feature vector for each amino acid, as there are
20 amino acids and sometimes there are undetermined amino acids in a protein. Pro-
teins might be different in length; therefore, the authors only extract the first 2000
amino acids for those proteins which are longer than that. Otherwise, the encodings
are zero-padded. So the i-th protein is represented as
where h is the sliding window length, w ∈ R21×h is a convolutional kernel, and f (·)
is a non-linear activation function. Then, the authors incorporate the GO structure
into the model. To do that, graph convolutional layers are deployed to generate the
embeddings of the GO terms by propagating information among GO terms using
neighboring terms in the GO hierarchy. For τ number of GO terms, initial one-hot
feature description, H 0 ∈ Rτ×τ , and correlation matrix, A ∈ Rτ×τ are computed as
input. For the l-th layer’s representation, H l is updated using the following neigh-
borhood information propagating equation:
Ŷ = HZ T (25.23)
Cross-entropy loss for the multi-label loss function is used to train the model
end-to-end. The authors experiment on the Maize PH207 inbred line (Hirsch et al,
25 Graph Neural Networks in Predicting Protein Function and Interactions 553
2016) and the human protein sequence dataset and show that DeepGOA outperforms
SOTA methods.
and neighborhood information aggregation operations. The GCN update rule (Gilmer
et al, 2017) is followed for the i-th protein’s representation, hki , at the l-th layer as
follows:
1
hli = ∑ p p W hl−1
j (25.36)
j∈N (i)∪i deg(i) ∗ deg( j)
The GraphSAGE (Hamilton et al, 2017b) update rule is then deployed:
hli = W1 hl−1
i +W2 Mean j∈N (i)∪i hl−1
j (25.37)
Additionally, the authors employ the GraphConv (Morris et al, 2020b) operator:
hli = W1 hl−1
i + ∑ W2 hl−1
j (25.38)
j∈N (i)
 = Sigmoid(ZZ T ) (25.40)
Cross-entropy loss between A and  and gradient descent are used to update the
weights. Finally, the embeddings Z are used to predict the class Y in predicting
missing links in the adjacency matrix and thus in the graph.
As this survey indicates, many variants of GNNs have been applied to obtain infor-
mation on protein function. Much work remains to be done. Future directions can
be broadly divided into two categories, methodology-oriented and task-oriented.
Many existing GNN-based approaches are limited to proteins of the same size
(number of amino acids). This essentially weakens model capacity for the particular
task at hand. Therefore, future research needs to focus on size-agnostic, as well as
task-agnostic models. Choosing the right model is always a difficult task. However,
benchmark datasets and available packages are making it easier to develop models
expediently.
Enhancing model explainability is also an important direction. Some community
bias has been observed towards focusing model development on GCNs for learn-
ing semantic and topological information for the function prediction task. However,
there are many other variants of GNNs. For instance, graph attention networks may
556 Anowarul Kabir and Amarda Shehu
prove useful. Existing literature also often ignores ablation studies, which are impor-
tant to provide a strong rationale for choosing a particular component of the model
over others.
Most of the PPI prediction tasks assume training a single model for an organism.
Leveraging multi-organisms PPI networks provides more data and may result in
better performance. In the same spirit, leveraging multi-omics data combined with
sequence and structural data may advance the state of the art.
Finally, we draw attention to the site-specific function prediction task, which
provides more information and highlights specific residues that are important for
a particular function. This fine-grained function prediction task can be even more
critical to support other tasks, such as drug design. Transfer learning across related
tasks may additionally provide insights for learning important attributes.
This work is supported in part by National Science Foundation Grant No.
1907805 and Grant No. 1763233. This material is additionally based upon work
by AS supported by (while serving at) the National Science Foundation. Any opin-
ion, findings, and conclusions or recommendations expressed in this material are
those of the author and do not necessarily reflect the views of the National Science
Foundation.
Abstract Anomaly detection is an important task, which tackles the problem of dis-
covering “different from normal” signals or patterns by analyzing a massive amount
of data, thereby identifying and preventing major faults. Anomaly detection is ap-
plied to numerous high-impact applications in areas such as cyber-security, finance,
e-commerce, social network, industrial monitoring, and many more mission-critical
tasks. While multiple techniques have been developed in past decades in address-
ing unstructured collections of multi-dimensional data, graph-structure-aware tech-
niques have recently attracted considerable attention. A number of novel techniques
have been developed for anomaly detection by leveraging the graph structure. Re-
cently, graph neural networks (GNNs), as a powerful deep-learning-based graph rep-
resentation technique, has demonstrated superiority in leveraging the graph structure
and been used in anomaly detection. In this chapter, we provide a general, compre-
hensive, and structured overview of the existing works that apply GNNs in anomaly
detection.
26.1 Introduction
In the era of machine learning, sometimes, what stands out in the data is more
important and interesting than the normal. This branch of task is called anomaly
detection, which concentrates on discovering “different from normal” signals or
patterns by analyzing a massive amount of data, thereby identifying and prevent-
ing major faults. This task plays a key on in several high-impact domains, such as
cyber-security (network intrusion or network failure detection, malicious program
Shen Wang
Department of Computer Science, University of Illinois at Chicago, e-mail: swang224@uic.
edu
Philip S. Yu
Department of Computer Science, University of Illinois at Chicago, e-mail: [email protected]
© The Author(s), under exclusive license to Springer Nature Singapore Pte Ltd. 2022 557
L. Wu et al. (eds.), Graph Neural Networks: Foundations, Frontiers, and Applications,
https://doi.org/10.1007/978-981-16-6054-2_26
558 Shen Wang, Philip S. Yu
detection), finance (credit card fraud detection, malicious account detection, cashout
user detection, loan fraud detection), e-commerce (reviews spam detection), social
network (key player detection, anomaly user detection, real money trading detec-
tion), and industrial monitoring (fault detection).
In the past decades, many techniques have been developed for anomaly detec-
tion by leveraging the graph structure, a.k.a. graph-based anomaly detection. Unlike
non-graph anomaly detection, they further take the inter-dependency among each
data instance into consideration, where data instances in a wide range of disciplines,
such as physics, biology, social sciences, and information systems, are inherently re-
lated to one another. Compare to the non-graph-based method, the performance of
the graph-based method is greatly improved. Here, we provide an illustrative ex-
ample of malicious program detection in the cyber-security domain in Figure 26.1.
In a phishing email attack as shown in Figure 26.1, to steal sensitive data from the
database of a computer/server, the attacker exploits a known venerability of Mi-
crosoft Office by sending a phishing email attached with a malicious .doc file to
one of the IT staff of the enterprise. When the IT staff member opens the attached
.doc file through the browser, a piece of a malicious macro is triggered. This ma-
licious macro creates and executes a malware executable, which pretends to be an
open-source Java runtime (Java.exe). This malware then opens a backdoor to the ad-
versary, subsequently allowing the adversary to read and dump data from the target
database via the affected computer. In this case, signature-based or behavior-based
malware detection approaches generally do not work well in detecting the mali-
cious program in our example. As the adversary can make the malicious program
from scratch with binary obfuscation, signature-based approaches would fail due
to the lack of known malicious signatures. Behavior-based approaches may not be
effective unless the malware sample has previously been used to train the detection
model. It might be possible to detect the malicious program using existing host-
level anomaly detection techniques. These host-based anomaly detection methods
can locally extract patterns from process events as the discriminators of abnormal
behavior. However, such detection is based on observations of single operations,
and it sacrifices the false positive rate to detect the malicious program. For exam-
ple, the host-level anomaly detection can detect the fake “Java.exe” by capturing the
database read. However, a Java-based SQL client may also exhibit the same opera-
tion. If we simply detect the database read, we may also classify normal Java-based
SQL clients as abnormal program instances and generate false positives. In the en-
terprise environment, too many false positives can lead to the alert fatigue problem,
causing cyber-analysts to fail to catch up with attacks. To accurately separate the
database read of the malicious Java from the real Java instances, we need to con-
sider the higher semantic-level context of the two Java instances. As shown in Figure
??, malicious Java is a very simple program and directly accesses the database. On
the contrary, a real Java instance has to load a set of .DLL files in addition to the
database read. By comparing the behavior graph of the fake Java instance with the
normal ones, we can find that it is abnormal and precisely report it as a malicious
program instance. Thus, leveraging the graph helps to identify the anomaly data
instances.
26 Graph Neural Networks in Anomaly Detection 559
NTDLL.DLL
26.2 Issues
In this section, we provide a brief discussion and summary of the issues in GNN-
based anomaly detection. In particular, we group them into three: (i) data-specific
issues, (ii) task-specific issues, and (iii) model-specific issues.
As the anomaly detection systems usually work with real-world data, they demon-
strate high volume, high dimensionality, high heterogeneity, high complexity, and
dynamic property.
High Volume – With the advance of information storage, it is much easier to
collect large amounts of data. For example, in an e-commerce platform like Xianyu,
there are over 1 billion second-hand goods published by over ten millions users;
in an enterprise network monitoring system, the system event data collected from
a single computer system in one day can easily reach 20 GB, and the number of
events related to one specific program can easily reach thousands. It is prohibitively
expensive to perform the analytic task on such massive data in terms of both time
and space.
High Dimensionality – Also, benefit from the advance of the information stor-
age, rich amount of information is collected. It results in high dimensionality of
the attributes for each data instance. For example, in an e-commerce platform like
Xianyu, different types of attributes are collected for each data instance, such as
user demographics, interests, roles, as well as different types of relations; in an en-
terprise network monitoring system, each collected system event is associated with
hundreds of attributes, including information of involved system entities and their
relationships, which causes the curse of dimensionality.
High Heterogeneity – As rich types of information are collected, it results in
high heterogeneity of the attributes for each data instance: the feature of each data
instance can be multi-view or multi-sourced. For example, in an e-commerce plat-
form like Xianyu, multiple types of data are collected from the user, such as personal
profile, purchase history, explore history, and so on. Nevertheless, multi-view data
like social relations and user attributes have different statistical properties. Such het-
erogeneity poses a great challenge to integrate multi-view data.
High Complexity – As we can collect more and more information, the collected
data is complex in content: it can be categorical or numerical, which increases the
difficulty of leveraging all the contents jointly.
Dynamic Property – The data collection is usually conducted every day or con-
tinuously. For example, billions of credit card transactions are performed every day;
billions of click-through traces of web users are generated each day. This kind of
data can be thought of as streaming data, and it demonstrates dynamic property.
The above data-specific issues are general and apply to all kinds of data. So
we also need to discuss the graph-data-specific issues, including relational prop-
562 Shen Wang, Philip S. Yu
Due to the unique characteristics of the anomaly detection task, the issues also come
from the problems, including labels quantity and quality, class imbalance and asym-
metric error, and novel anomalies.
Labels Quantity and Quality – The major issue of anomaly detection is that the
data often has no or very few class labels. It is unknown which data is abnormal
or normal. Usually, it is costly and time-consuming to obtain ground-truth labels
from the domain expert. Moreover, due to the complexity of the data, the produced
label may be noisy and biased. Therefore, this issue limits the performance of the
supervised machine learning algorithm. What is more, the lack of true clean labels,
i.e., ground truth data, also makes the evaluation of anomaly detection techniques
challenging.
Class Imbalance and Asymmetric Error – Since the anomalies are rare and only
a small fraction of the data is excepted to be abnormal, the data is extremely imbal-
anced. Moreover, the cost of mislabeling a good data instance versus a bad instance
may change depending on the application and further could be hard to estimate
beforehand. For example, mis-predicting a cash-out fraudster as a normal user is es-
sentially harmful to the whole financial system or even the national security, while
mis-predicting a normal user as a fraudster could cause customer loss fidelity. There-
fore, the class imbalance and asymmetric error affect the machine-learning-based
method seriously.
Novel Anomalies – In some domain, such as fraud detection or malware detec-
tion, the anomalies are created by the human. They are created by analyzing the
detection system and designed to be disguised as a normal instance to bypass the
detection. As a result, not only should the algorithms be adaptive to changing and
growing data over time, they should also be adaptive to and be able to detect novel
anomalies in the face of adversaries.
Apart from data-specific and task-specific issues, it is also challenging to apply the
graph neural network directly to anomaly detection task sdue to its unique model
properties, such as homogeneous focus and vulnerability.
Homogeneous Focus – Most graph neural network models are designed for ho-
mogeneous graph, which considers a single type of nodes and edges. In many real-
world applications, data can be naturally represented as heterogeneous graphs. How-
ever, traditional GNNs treat different features equally. All the features are mapped
564 Shen Wang, Philip S. Yu
and propagated together to get the representations of nodes. Considering that the
role of each node is just a one-dimensional feature in the high dimensional feature
space, there exist more features that are not related to the role, e.g., age, gender, and
education. Thus the representation of applicants with neighbors of different roles
has no distinction in representation space after neighbor aggregation, which causes
the traditional GNNs to fail.
Vulnerability – Recently theoretical studies prove the limitations and vulnerabil-
ities of GNNs, when graphs have noisy nodes and edges. Therefore, a small change
to the node features may cause a dramatic performance drop and failing to tackle
the camouflage, where fraudsters would sabotage the GNN-based fraud detectors.
26.3 Pipeline
In this section, we introduce the standard pipeline of the GNN-based anomaly detec-
tion. Typically, GNN-based anomaly detection methods consist of three important
components, including graph construction and transformation, graph representation
learning, and prediction.
one defining a unique relationship between two entities. The multi-channel graph
is a graph with each channel constructed via a certain type of meta-path. Formally,
given a heterogeneous graph G with a set of meta-paths M = {M1 , ..., M|M | }, the
transformed multi-channel network Gˆ is defined as follows:
where Ei denotes the homogeneous links between the entities in Vi , which are con-
nected through the meta-path Mi . Each channel graph Gi is associated with an adja-
cency matrix Ai . |M | indicates the number of meta-paths. Notice that the potential
meta-paths induced from the heterogeneous network can be infinite, but not every-
one is relevant and useful for the specific task of interest. Fortunately, there are some
algorithms (Chen and Sun, 2017) proposed recently for automatically selecting the
meta-paths for particular tasks.
where x is the input feature vector, and W is the trainable weight matrix.
Nonlinear Mapping Function MAPnonlinear ():
where x is the input feature vector, W is the trainable weight matrix, and σ () repre-
sents the non-linear activation function.
Multilayer Perceptron Function MLP():
where x is the input feature vector, Wi with i = 1, ..., k is the trainable weight ma-
trix, k indicates the number of layers, and σ () represents the non-linear activation
function.
Feature Concatenation CONCAT ():
exp(MAP(xi ))
so f tmax(xi ) = n (26.7)
∑ j=1 exp(MAP(x j ))
where MAP() can be linear or nonlinear.
Different from traditional deep learning algorithm, the GNNs have its unique
operation–neural aggregation function AGG(). Based on the level of object to ag-
gregate, it can be categorized into three specific types: node-wise neural aggregator
AGGnode (), layer-wise neural aggregator AGGlayer (), and path-wise neural aggrega-
tor AGG path ().
Node-wise Neural Aggregator AGGnode () is the GNN module that aims to aggre-
gate the node neighborhoods, which can be described as follows,
(i)(k) (i)(k−1) (i)(k−1)
hv = AGGnode (hv , {hu }u∈Nvi ) (26.8)
(i)(k)
where i is meta-path (relation) indicator, k ∈ {1, 2, ...K} is the layer indicator, hv
is the feature vector of node v for relation Mi at the k-th layer, Nvi indicates the
neighbourhoods of node v under the relation Mi . Based on the way the the node
neighborhoods are aggregated, typically, the node-level neural aggregator can be
GCN AGGGCN () (Kipf and Welling, 2017b), GAT AGGGAT () (Veličković et al,
2018) or Message-Passing AGGMPNN () (Gilmer et al, 2017). For the GCN and GAT,
the formulations can be described by Equation 8. While for the Message-Passing,
the edges are also used during the node-level aggregation. Formally, it can be de-
scribed as follows,
(i)(k) (i)(k−1) (i)(k−1) (i)(k−1) (i)(k−1)
hv = AGGnode (hv , {hv , hu , hvu }u∈Nvi ) (26.9)
(i)(k−1)
where hvu denotes the edge embedding between the target node v and its neigh-
bor node u, and {} indicates a fusion function to combine the target node, its neigh-
bor node and the corresponding edge between them.
Layer-wise Neural Aggregator AGGlayer () is the GNN module that aims to ag-
gregate the context information from different hops. For example, if layer num-
ber k = 2, the GNN gets 1-hop neighborhood information, and if layer number
26 Graph Neural Networks in Anomaly Detection 567
k = K + 1, the GNN gets K-hop neighborhood information. The larger the k is, the
more global information the GNN obtains. Formally, this function can be described
as follows,
(i)(k) (i)(k−1) (i)(k)
lv = AGGlayer (lv , hv ) (26.10)
(i)(k)
where lv is the aggregated representation of (k − 1)−hop neighborhood node v
for relation Mi at the k-th layer.
Path-wise Neural Aggregator AGGlayer () is the GNN module that aims to ag-
gregate the context information from different relations. Generally, the relation can
be described by meta-path (Sun et al, 2011) based contextual search. Formally, this
function can be described as follows,
(i) (i)(K)
pv = lv (26.11)
(1) (|M |)
pv = AGG path (pv , ...pv ) (26.12)
(i)
where pv is the aggregated final layer representation of node v for relation Mi .
Then the final node representation is described by the fusion representation from
different meta-paths (relations) as follows,
( f inal)
hv = pv (26.13)
Based on the task, we can also compute the graph representation by performing
readout function Readout() to aggregate all the nodes’ final representations, which
can be described as follows,
( f inal) ( f inal)
g = Readout(hv1 , ...hvV ) (26.14)
26.3.3 Prediction
After the graph representation is learned, they are fed to the prediction stage. De-
pends on the task and the target label, there are two types of prediction: classification
and matching. In the classification-based prediction, it assumes that enough labeled
anomaly data instances are provided. A good classifier can be trained to identify
568 Shen Wang, Philip S. Yu
if the given graph target is abnormal or not. As mentioned in the issues section,
there might be no or few anomaly data instances. In this case, the matching-based
prediction is usually used. If there are very few anomaly samples, we learn the rep-
resentation of them, and when the candidate sample is similar to one of the anomaly
samples, an alarm is triggered. If there is no anomaly sample, we learn the represen-
tation of the normal data instance. When the candidate sample is not similar to any
of the normal samples, an alarm is triggered.
26.4 Taxonomy
In this section, we provide the case studies to give the details of some representative
GNN-based anomaly detection approaches.
26 Graph Neural Networks in Anomaly Detection 569
Graph embeddings for malicious accounts detection (GEM) (Liu et al, 2018f) is the
first attempt to apply the GNN to anomaly detection. The aim of GEM is to detect
the malicious account at Alipay pay, a mobile cashless payment platform.
The graph constructed from the raw data is static and heterogeneous. The con-
strued graph G = (V , E ) consists of 7 types of nodes, including account typed
nodes (U) and 6 types of device typed nodes (phone number (PN), User Machine ID
(UMID), MAC address (MACA), International Mobile Subscriber Identity (IMSI),
Alipay Device ID (APDID) and a random number generated via IMSI and IMEI
(TID), such that V = U ∪ PN ∪ UMID ∪ MACA ∪ IMSI ∪ APDID ∪ T ID. To over-
come the heterogeneous graph challenge and make GNN applicable to the graph,
through graph transformation, GEM constructs a 6-channel graph Gˆ = {Gi |Gi =
(Vi , Ei , Ai ), i = 1, 2, ..., |M |} with |M | = 6. In particular, 6 types of edges are specif-
ically modeled to capture the edge heterogeneity, e.g., account connects phone num-
ber (U → PN), account connects UMID (U → UMID), account connects MAC ad-
dress (U → MACA), account connects IMSI (U → IMSI), account connects Alipay
Device ID (U → APDID) and account connects TID (U → T ID). As the activity
attributes are constructed, the constructed graph is an attributed graph. After the
graphs are constructed and transformed, GEM performs a graph convolutional net-
work to aggregate the neighborhood on each channel graph. As each channel graph
is treated as a homogeneous graph corresponding to a specific relation, GNN can be
directly applied to each channel graph.
570 Shen Wang, Philip S. Yu
During the graph representation learning stage, the node aggregated representa-
(i)(k)
tion hv is computed by performing a GCN aggregator AGGGCN (). To get the path
aggregated representation, it adopts the attentionally feature fusion to fuse the node
aggregated representation obtained in each channel graph G i . Besides, an activity
feature for each node is constructed, and it adds the linear mapping of this activity
feature to the attentional feature fusion of the path aggregated representations. For-
mally, the GNN operations can be described as follow.
Node-wise aggregation:
(i)(k) (i)(k−1) (i)(k−1)
hv = AGGnode (hv , {hu }u∈Nvi )
(26.15)
AGGGCN (hv
(i)(k−1) (i)(k−1)
= , {hu }u∈Nvi )
Path-wise aggregation:
(k) (1)(k)) (|M |)(k)
pv = MAPlinear (xv ) +COMBatt (hv , ..., hv ) (26.16)
Layer-wise aggregation:
(K) (K)
lv = pv (26.17)
Final node representation:
( f inal) (K)
hv = lv (26.18)
where K indicates the number of the layers.
The object of GEM is classification. It feeds the learned account node embedding
to a standard logistic loss function.
The two selected meta-paths capture different semantics. For example, the UU path
connects users having fund transfers from one to another, while the UMU connects
users having transactions with the same merchants. Then each channel graph is ho-
mogeneous and can work with GNN directly. As the user attributes are available,
the constructed graph is attributed.
In the graph representation stage, the node-wise aggregation is performed to each
channel graph via a convolutional graph network. Different from GEM (Liu et al,
2018f), it adds and joins the user feature xv to the aggregated node representation
in an attentional way. Then the node-wise aggregation extends to a 3-step proce-
dure, including (a) initial node-wise aggregation, (b) feature fusion, and (c) feature
(i)
attention. After the initial aggregated node representation h̃v is computed vis GCN
AGG GCN (), it is fused with user feature xv through a feature fusion. Next, it per-
forms the feature attention. Since only 1-hop neighborhoods are considered, there is
(i)
no layer-wise aggregation, and the final node-wise aggregated representations hv
are fed to the path-wise aggregation directly. Formally, it can be described as fol-
lows,
Node-wise aggregation:
(b)Feature fusion:
(i) (i)
fv = MAPnonlinear (CONCAT (MAPlinear (h̃v ), MAPlinear (xv ))) (26.20)
(c)Feature attention:
(i) (i)
αv = MAPnonlinear (MAPnonlinear (CONCAT (MAPlinear (xv ), fv )) (26.21)
(i)(k) (i)(k)
lv = hv (26.27)
Path-wise aggregation:
(1)(K)) (|M |)(K)
pv = COMBatt (lv , ..., lv ) (26.28)
Final node representation:
( f inal)
hv = pv (26.29)
The object of DeepHGNN is classification. However, it is different from GEM
and HACUD, which simply build single classifiers for all the samples. DeepHGNN
formulates the problem of program reidentification in malicious program detection.
The graph representation learning aims to learn the representation of the normal
target program, and each target program learns a unique classifier. Given a target
program with corresponding event data during a time window U = {e1 , e2 , ...} and a
claimed name/ID, the system checks whether it belongs to the claimed name/ID. If
it matches the behavior pattern of the claimed name/ID, the predicted label should
be +1; otherwise, it should be −1.
Graph matching framework to learn the program representation and similarity met-
ric via graph neural network (MatchGNet) (Wang et al, 2019i) is another GNN-
based anomaly detection approach for malicious program detection in a computer
system of an enterprise network. MatchGNet is different from DeepHGNN in five
aspects: (1) after the graph transformation, the resulted channel graph only keep the
target type node – process node, which is similar to HACUD, (2) the raw program
attributes are used as the program node representation initialization, (3) the GNN
aggregation is conducted hierarchically in node-wise, layer-wise, and path-wise, (4)
the anomaly target is the subgraph of the target program (5) the final graph repre-
sentation is fed to a similarity learning framework with contrastive loss to deal with
the unknown anomaly.
It follows a similar style to construct the static heterogeneous graph from system
behavioral data. In the graph transformation, it adopts three meta-paths (relations):
a process forking another process (P → P), two processes accessing the same file
(P ← F → P), and two processes opening the same internet socket (P ← I → P) with
each one defining a unique relationship between two processes. Based on them, a
3-channel graph is constructed from the the heterogeneous graph, such that Gˆ =
{Gi |Gi = (Vi , Ei , Ai ), i = 1, ..., |M |} with |M | = 3 and Vi ∈ P. Then the GNN can be
574 Shen Wang, Philip S. Yu
directly applied to each channel graph. As only process typed nodes are available,
we use the raw attributes of these process xv as the node representation initialization.
During the graph representation stage, a hierarchical attentional graph neural
network is designed, including node-wise attentional neural aggregator, layer-wise
dense-connected neural aggregator, and path-wise attentional neural aggregator. In
particular, the node-wise attentional neural aggregator aims to generate node em-
beddings by selectively aggregating the entities in each channel graph based on ran-
dom walk scores α(u)i . Layer-wise dense-connected neural aggregator aggregates the
Layer-wise aggregation:
(i)(k) (i)(0) (i)(1) (i)(k)
lv = AGGlayer (hv , lv , ...lv )
(26.32)
(i)(0) (i)(1) (i)(k)
= MLP(CONCAT (hv ; lv ; ...lv ))
Path-wise aggregation:
(i)(K)) (|M |)(K)
pv = COMBatt (lv , ..., lv ) (26.33)
Final node representation:
( f inal)
hv = pv (26.34)
Final graph representation:
( f inal)
hGv = hv (26.35)
where k indicates the number of layers, and ε is a small number. Different from
GEM, HACUD, and DeepHGNN, the object of MatchGNet is matching. The final
graph representation is fed to a similarity learning framework with contrastive loss
to deal with the unknown anomaly. During the training, P pairs of program graph
snapshots (Gi(1) , Gi(2) ), i ∈ {1, 2, ...P} are collected with corresponding ground truth
pairing information yi ∈ {+1, −1}. If the pair of graph snapshots belong to the
same program, the ground truth label is yi = +1; otherwise, its ground truth label
is yi = −1. For each pair of program snapshots, a cosine score function is used to
26 Graph Neural Networks in Anomaly Detection 575
measure the similarity of the two program embeddings, and the output is defined as
follows:
Sim(Gi(1) , Gi(2) ) = cos((hGi(1) , hGi(2) ))
hGi(1) · hGi(2) (26.36)
=
||hGi(1) || · ||hGi(2) ||
Correspondingly, our objective function can be formulated as:
P
ℓ = ∑ (Sim(Gi(1) , Gi(2) ) − yi )2 (26.37)
i=1
As the state of a node ctv can be computed by aggregating the neighboring hidden
states in the previous timestamp t − 1, the node hidden states in a short window
w can be obtained and combined to get the short-term embedding stv . In particular,
an attentional feature fusion is used to combine these node hidden states in a short
window, as follows,
stv = COMBatt (ht−w t−1
v , ..., hv ) (26.39)
Then short-term embedding stv and current state ctv are fed to GRU, a classic recur-
rent neural network, to compute the current hidden state that encoding the dynamics
within the graph. This stage can be described as follows:
The object of AddGraph is matching. The hidden state of the nodes at each times-
tamp are used to calculate the anomalous probabilities of an existing edge and a
negative sampled edge, and then feed them to a margin loss.
GCN-based anti-spam (GAS) (Li et al, 2019a) applies the GNN in the spam re-
view detection at the e-commerce platform Xianyu. Similar to previous works, the
constructed graph is static, heterogeneous and attributed, such that G = (U , I , E ).
There are two types of nodes: user nodes U and item nodes I . The edges E are a
set of comments. Different from previous works, the edges E are the anomalies tar-
gets. Moreover, as each edge represents a sentence, edge modeling is complicated,
and the number of edge types increases dramatically. To better capture the edge
representation, the message-passing-like GNN is used. The edge-wise aggregation
is proposed by concatenation of previous representation of the edge itself hk−1iu and
k−1
corresponding user node representation hk−1 u , item node representation hi To get
the initial attributes of edge, the word2vec word embedding for each word in the
comments of the edges is extracted via the embedding function pre-training on a
million-scale comment dataset. Then the word embedding of each words in an edge
of comments w0 , w1 , ...wn is fed to TextCNN() function to get the comments em-
bedding h0iu , which is used as the initial attributes of edge. Then the edge-wise ag-
gregation is defined as:
Edge-wise aggregation:
(26.44)
where k is the layer indicator. The final edge representation is computed by con-
catenation of the raw edge embedding h0iu , new edge embedding hKiu , corresponding
new user node embedding hKu , and corresponding new item node embedding hKi as
follows:
Final edge representation:
Editor’s Notes: Graph neural networks for anomaly detection can be con-
sidered as a downstream task of graph representation learning, where the
long-term challenges in anomaly detection are coupled with the vulnera-
bility of graph neural networks such as scalability discussed in Chapter 6
and robustness discussed in Chapter 8. Graph neural networks for anomaly
detection also further benefits a wide range of downstream tasks in various
interesting, important, yet usually challenging areas such as anomaly detec-
tion in dynamic networks, spam review detection for recommender system,
and malware program detection, which are highly relevant to the topics in-
troduced in Chapters 15, 19, and 22.
Chapter 27
Graph Neural Networks in Urban Intelligence
Abstract In recent years, smart and connected urban infrastructures have undergone
a fast expansion, which increasingly generates huge amounts of urban big data, such
as human mobility data, location-based transaction data, regional weather and air
quality data, social connection data. These heterogeneous data sources convey rich
information about the city and can be naturally linked with or modeled by graphs,
e.g., urban social graph, transportation graph. These urban graph data can enable
intelligent solutions to solve various urban challenges, such as urban facility plan-
ning, air pollution, etc. However, it is also very challenging to manage, analyze, and
make sense of such big urban graph data. Recently, there have been many studies
on advancing and expanding Graph Neural Networks (GNNs) approaches for var-
ious urban intelligence applications. In this chapter, we provide a comprehensive
overview of the graph neural network (GNN) techniques that have been used to em-
power urban intelligence, in four application categories, namely, (i) urban anomaly
and event detection, (ii) urban configuration and transportation planning, (iii) ur-
ban traffic prediction, and (iv) urban human behavior inference. The chapter also
discusses future directions of this line of research. The chapter is (tentatively) orga-
nized as follows.
Yanhua Li
Computer Science Department, Worcester Polytechnic Institute, e-mail: [email protected]
Xun Zhou
Tippie College of Business, University of Iowa e-mail: [email protected]
Menghai Pan
Computer Science Department, Worcester Polytechnic Institute, e-mail: [email protected]
© The Author(s), under exclusive license to Springer Nature Singapore Pte Ltd. 2022 579
L. Wu et al. (eds.), Graph Neural Networks: Foundations, Frontiers, and Applications,
https://doi.org/10.1007/978-981-16-6054-2_27
580 Yanhua Li, Xun Zhou, and Menghai Pan
27.1.1 Introduction
According to the report (Desa, 2018) published by the United Nations in 2018, the
urban population in the world reached 55 percent in 2018, which is growing rapidly
over time. By 2050, the world will be one-third rural (34 percent) and two-thirds
urban (66 percent). Moreover, thanks to the fast development of sensing technolo-
gies in recent years, various sensors are widely deployed in the urban areas, e.g., the
GPS sets on vehicles, personal devices, air quality monitoring stations, gas pressure
regulators, etc. Stimulated by the large urban population and the wide use of the
sensors, there are massive data generated in the urban environment, for example,
the trajectory data of the vehicles in ride-sharing services, the air quality monitoring
data. Given a large amount of heterogeneous urban data, the question to answer is
what and how can we benefit from these data. For instance, can we use the GPS data
of the vehicles to help urban planners better design the road network? Can we infer
the air quality index across the city based on a limited number of existing monitor-
ing stations? To answer these practical questions, the interdisciplinary research area,
Urban Intelligence, has been extensively studied in recent years. In general, Urban
Intelligence, which is also referred as urban computing, is a process of acquisition,
integration, and analysis of big and heterogeneous data generated by a diversity of
sources in urban spaces, such as sensors, devices, vehicles, buildings, and humans,
to tackle the major issues in cities (Zheng et al, 2014).
Data analytics (e.g., data mining, machine learning, optimization) techniques are
usually employed to analyze numerous types of data generated in the urban scenar-
ios for prediction, pattern discovery, and decision-making purposes. How to repre-
sent urban data is an essential question for the design and implementation of these
techniques. Given the heterogeneity of urban big data, various data structures can
be used to represent them. For example, spatial data in an urban area can be rep-
resented as raster data (like images), where the area is partitioned into grid cells
(pixels) with attribute functions imposed on them (Pan et al, 2020b; Zhang et al,
2019, 2020b,a; Pan et al, 2019, 2020a). Spatial data can also be represented as a
collection of objects (e.g., vehicles, point-of-interests, and trajectory GPS points)
with their locations and topological relationships defined (Ding et al, 2020b).
Moreover, the intrinsic structures of many urban big data enable people to rep-
resent them with graphs. For instance, the structure of urban road network helps
people model the traffic data with graphs (Xie et al, 2019b; Dai et al, 2020; Cui
et al, 2019; Chen et al, 2019b; Song et al, 2020a; Zhang et al, 2020e; Zheng et al,
2020a; Diao et al, 2019; Guo et al, 2019b; Li et al, 2018e; Yu et al, 2018a; Zhang
et al, 2018e); the pipeline of gas supply network enable people to model the gas
pressure monitoring data with graph (Yi and Park, 2020); people can also represent
the data on the map with a graph by dividing the city into functional regions (Wang
et al, 2019o; Yi and Park, 2020; Geng et al, 2019; Bai et al, 2019a; Xie et al, 2016).
Representing urban data with graphs can capture the intrinsic topological informa-
27 Graph Neural Networks in Urban Intelligence 581
tion and knowledge in the data, and plenty of techniques are developed to analyze
the urban graph data.
Graph Neural Networks (GNNs) are naturally employed to solve various real-
world problems with urban graph data. For example, Convolutional Graph Neu-
ral Networks (ConvGNN) (Kipf and Welling, 2017b) are used to capture the spa-
tial dependencies of the urban graph data, and Recurrent Graph Neural Networks
(RecGNN) (Li et al, 2016b) are for the temporal dependencies. Spatial-temporal
Graph Neural Networks (STGNN) (Yu et al, 2018a) can capture both spatial and
temporal dependencies in the data, which are widely used in dealing with many ur-
ban intelligence problems, e.g., predicting traffic status based on urban traffic data
(Zhang et al, 2018e; Li et al, 2018e; Yu et al, 2018a). The traffic data are modeled
as spatial-temporal graphs where the nodes are sensors on road segments, and each
node has the average traffic speed within a window as dynamic input features.
In the following sections, we first summarize the general application scenarios in
urban intelligence, followed by the graph representations in urban scenarios. Then,
we provide more details on GNN for urban configuration and transportation plan-
ning, urban anomaly and event detection, and urban human behavior inference, re-
spectively.
The diverse application domains in urban intelligence include urban planning, trans-
portation, environment, energy, human behavior analysis, economy, and event de-
tection, etc. In the following paragraphs, we will introduce the practical problems
and the common datasets in these domains. The problems and examples highlighted
below are not exhaustive, here we just introduce some critical problems and typical
examples from literature, which are summarized in Table 27.1.
1) Urban configuration. Urban configuration is essential for enabling smart cities.
It deals with the design problem of the entire urban area, such as, the land use, the
layout of human settlements, design of road networks, etc. The problems in this
domain includes estimating the impact of a construction (Zhang et al, 2019c), dis-
covering the functional regions of the city (Yuan et al, 2012), detecting city bound-
aries (Ratti et al, 2010), etc. In (Zhang et al, 2019c), the authors employ and ana-
lyze the historical taxi GPS data and the road network data, where they define the
off-deployment traffic estimation problem as a traffic generation problem, and de-
velop a novel deep generative model TrafficGAN that captures the shared patterns
across spatial regions of how traffic conditions evolve according to travel demand
changes and underlying road network structures. This problem is important to city
planners to evaluate and develop urban deployment plans. In (Yuan et al, 2012), the
authors propose a DRoF framework that Discovers Regions of different Functions
in a city using human mobility between regions with data collected from the GPS
set in Taxis in Beijing and points of interest (POIs) located in the city. The under-
standing of functional regions in a city can calibrate urban planning and facilitate
582 Yanhua Li, Xun Zhou, and Menghai Pan
other applications, such as choosing a location for a business. In (Ratti et al, 2010),
the authors propose a model to detect the city’s boundary by analyzing the human
network inferred from a large telecommunications database in Great Britain. An-
swering this question can help the city planner get a sense on what the exact range
the urban area is within as the urban area changes fast over time.
2) Transportation. Transportation plays an important role in the urban area. Urban
intelligence deals with several problems regarding the transportation in the city, e.g.,
routing for the drivers, estimating the travel time, improving the efficiency of taxi
system and the public transit system, etc. In (Yuan et al, 2010), the authors propose a
T-Drive system, that provides personalized driving directions that adapt to weather,
traffic conditions, and a person’s own driving habits. The system is built based on
historical trajectory data of taxicabs. In (Pan et al, 2019), the authors propose a solu-
tion framework to analyze the learning curve of taxi drivers. The proposed method
first learns the driver’s preference to different profiles and habit features in each
time period, then analyzes the preference dynamics of different groups of drivers.
The results illustrate that taxi drivers tend to change their preference to some habit
features to improve their operation efficiency. This finding can help the new drivers
improve their operation efficiency faster. The authors in (Watkins et al, 2011) con-
ducted a study on the impact of providing real-time bus arrival information directly
on riders’ mobile phones and found it to reduce not only the perceived wait time of
those already at a bus stop, but also the actual wait time experienced by customers
who plan their journey using such information.
3) Urban Environment. Urban intelligence can deal with the potential threat to the
environment caused by the fast pace of urbanization. The environment is essential
for people’s health, for example, air quality, noise, etc. In (Zheng et al, 2013), the
authors infer the real-time and fine-grained air quality information throughout a city
based on the (historical and real-time) air quality data reported by existing monitor
27 Graph Neural Networks in Urban Intelligence 583
stations and a variety of data sources observed in the city, such as meteorology, traf-
fic flow, human mobility, structure of road networks, and POIs. The results can be
used to suggest people when and where to conduct outdoor activities, e.g., jogging.
Also, the result can infer suitable locations for deploying new air quality monitoring
stations. Noise pollution is usually serious in the urban area. It has impacts to both
the mental and physical health of human beings. Santini et al (2008) assess environ-
mental noise pollution in urban areas by using the monitoring data from wireless
sensor networks.
4) Energy supply and consumption. Another application domain of urban intelli-
gence is energy consumption in the urban area, which usually deals with the problem
of sensing city-scale energy cost, improving energy infrastructures, and finally re-
ducing energy consumption. The common energy include gas and electricity. Shang
et al (2014) inferred the gas consumption and pollution emission of vehicles travel-
ing on a city’s road network in the current time slot using GPS trajectories from a
sample of vehicles (e.g., taxicabs). The knowledge can be used not only to suggest
cost-efficient driving routes but also to identify the road segments where gas has
been wasted significantly. Momtazpour et al (2012) proposes a framework to pre-
dict electronic vehicle (EV) charging needs based on owners’ activities, EV charg-
ing demands at different locations in the city and available charge of EV batteries,
and design distributed mechanisms that manage the movements of EVs to different
charging stations.
5) Urban human behavior analysis. With the popularization of smart devices,
people can generate massive location-embedded information every day, such as,
location-tagged text, image, video, check-ins, GPS trajectories. The first question in
this domain is estimating user similarity, and similar users can be recommended as
friends. Li et al (2008) connects users with similar interests even when they may not
have known each other previously, and community discovery, which employs the
GPS trajectories collected from GPS equipped devices like phones.
6) Economy. Urban intelligence can benefit the urban economy. The human mobil-
ity and the statistics of POIs can reflect the economy of the city. For example, the
average price of a dinner in the restaurants can indicate the income level and the
power of consumption. In (Karamshuk et al, 2013), the authors study the problem
of optimal retail store placement in the context of location-based social networks.
They collected human mobility data from Foursquare and analyzed it to understand
how the popularity of three retail store chains in New York is shaped in terms of
number of check-ins. The result indicates that some POIs, like train station and air-
port, can imply the popularity of the location, also, the number of competitive stores
is an indicator for the popularity.
7) Public safety. Public safety and security in the urban area is always attracting
people’s concerns. The availability of different data enable us to learn from his-
tory how to deal with public safety problems, e.g., traffic accident (Yuan et al,
2018), large event (Vahedian et al, 2019; Khezerlou et al, 2021, 2017; Vahedian
et al, 2017), pandemic (Bao et al, 2020), etc., and we can use the data to detect
and predict abnormal events. Pang et al (2011) detects the anomalous traffic pattern
from the spatial-temporal data of vehicles. The authors partition a city into uniform
584 Yanhua Li, Xun Zhou, and Menghai Pan
grids and counted the number of vehicles arriving in a grid over a time period. The
objective was to identify contiguous sets of cells and time intervals that have the
largest statistically significant departure from expected behavior (i.e., the number of
vehicles).
Various data structures and models can be employed to define the spatial settings
of urban systems. For example, a simple model is a grid structure, where the ur-
ban area is partitioned into grid cells, with a set of attribute values of interest (e.g.,
average traffic speed, number of taxis, population, rainfall) associated with each
cell. While such a model is simple to implement, it ignores many intrinsic and im-
portant relationships existing in urban data. For example, a grid structure may lose
the information of road connectivity in the underlying traffic system of the city. In
many scenarios, instead, graph is an elegant choice to capture the intrinsic topolog-
ical information and knowledge in the data. Many urban system components can
be represented as graphs. Additional attributes may be associated with nodes and/or
edges. In this section, we introduce graph representations of various urban system
scenarios, which are summarized in Table 27.2. The application domains covered
include 1) Urban transportation and configuration planning, 2) Urban environment
monitoring, 3) Urban energy supply and consumption, 4) Urban event and anomaly
detection, and 5) Urban human behavior analysis.
1) Urban transportation and configuration planing. Modeling urban trans-
portation system as a graph is widely used in solving real-world urban intelligence
problems, e.g., traffic flow prediction (Xie et al, 2019b; Dai et al, 2020; Cui et al,
2019; Chen et al, 2019b; Song et al, 2020a; Zhang et al, 2020e; Zheng et al, 2020a;
Diao et al, 2019; Guo et al, 2019b; Li et al, 2018e; Yu et al, 2018a; Zhang et al,
2018e), parking availability problem (Zhang et al, 2020h), etc. The graphs are usu-
ally built based on the real-world road network. To solve the problem of traffic flow
prediction, in (Cui et al, 2019), the authors employ an undirected graph to predict
the traffic state, the nodes are the traffic sensing locations, e.g., sensor stations, road
segments, and the edges are the intersections or road segments connecting those
traffic sensing locations. Xie et al (2019b); Dai et al (2020) model the urban traffic
network as a directed graph with attributes to predict the traffic speed, the nodes
are the road segments, and the edges are the intersections. Road segment width,
length, and direction are the attributes of the nodes, and the type of intersection,
and whether there are traffic lights, toll gates are the attributes of the edges. For
urban configuration, Wu et al (2020c) incorporates a hierarchical GNN framework
to learn Road Network Representation in different levels. The nodes in the hierar-
chical graph include road segments, structural regions, and functional zones, and
the edges are intersections and hyperedges. There are some works about predicting
parking availability. Zhang et al (2020h) models the parking lots and the surround-
ing POIs and population features as a graph to predict the parking availability for
27 Graph Neural Networks in Urban Intelligence 585
the parking lots. The nodes are the parking lots, and the edges are determined by
the connectivity between each two parking lots whose on-road distance is smaller
than a threshold. Context features, e.g., POI distribution, population, etc., are the
attributes of the nodes.
2) Urban environment monitoring system. People model the air quality mon-
itoring system as a graph to forecast the air quality in the urban area(Wang et al,
2020h; Li et al, 2017f). For example, Wang et al (2020h) proposed the PM2.5-GNN
to forecast the PM2.5 index in different locations. The nodes are locations deter-
mined by latitude, longitude, altitude, and there exists an edge between two nodes
if the distance and difference of altitudes between them are less than threshholds re-
586 Yanhua Li, Xun Zhou, and Menghai Pan
spectively (e.g., distance < 300 km and difference of altitudes < 1200 m). The node
attributes include Planetary Boundary Layer (PBL) height, K index, wind speed, 2m
temperature, relative humidity, precipitation, and surface pressure. Edge attributes
include wind speed of source node, distance between source and sink, wind direc-
tion of source node, and direction from source to sink.
3) Urban energy supply and consumption. GNN is also employed in analyz-
ing urban energy supply and consuming systems. For example, Yi and Park (2020)
proposed a framework to predict the gas pressure in the gas supply network. The
gas regulators are considered as the nodes, and the pipelines that connect every two
regulators are the edges.
4) Urban event and anomaly detection. Urban event and anomaly detection is a
hot topic in urban intelligence. People employ machine learning models to detect or
predict the events occurring in the urban area, e.g., traffic accident prediction(Zhou
et al, 2020g,h; Yu et al, 2021b). In (Zhou et al, 2020g), the authors proposed a
framework to predict traffic accident in different regions of the city. The urban area
is divided into subregions, i.e., grids, and if the traffic elements within two subre-
gions have strong correlations, there is a connection.
5) Urban human behavior analysis. Studying human behavior in urban region
can benefit people in many aspects, for example, demographic attribute prediction,
personalized recommendation, passenger demand prediction, etc. Some works pro-
posed GNN to study Human behavior modeling. Human behavior modeling is es-
sential for many real-world applications such as demographic attribute prediction,
content recommendation, and target advertising. In (Wang et al, 2020a), the authors
model human behavior via a tripartite graph. The nodes include user’s sessions, lo-
cations and items. There exists an edge between a session node and a location node if
the user started the session at this location. Similarly, there exists an edge between
a session node and an item node if the user interacted with this item within the
session. Each edge possesses a time attribute indicating the temporal signal of the
interaction between two nodes. Another application of analysing human behavior is
passenger demand prediction. Understanding human behavior in daily transits can
help improve the efficiency of urban transportation system. For example, predicting
the passenger demand in the ride-sharing system can help the ride-sharing company
and the drivers improve their operation efficiency. And in recent publications, many
researchers employ graph neural networks to solve the problem of predicting human
mobility (Wang et al, 2019o; Yi and Park, 2020; Geng et al, 2019; Bai et al, 2019a;
Xie et al, 2016), and usually the nodes of the graph are subregions of the city, and
the edges are usually defined based on spatial proximity.
Urban intelligence can help urban planners design urban configuration, and bene-
fit the urban transportation system from different perspectives, e.g., operation effi-
27 Graph Neural Networks in Urban Intelligence 587
where X (t) is the node feature matrix at time step t. H is the hidden state. W , U, and
b are the network parameters. Then, the STGNN based on RNN can be formulated
as Eq. (27.2):
588 Yanhua Li, Xun Zhou, and Menghai Pan
Public safety and security in the urban area always attracts people’s concerns. The
availability of different data enables us to learn from history how to deal with public
safety problems, e.g., traffic accidents, crime, large events, pandemic, etc., and we
can use the data to detect and predict abnormal events.
Traffic accident prediction. Traffic accident prediction is of great significance
to improve the safety of the road network. Although “accident” is a word related to
“randomness”, there exist a significant correlation between the occurrence of traffic
accidents and the surrounding environmental features, e.g., traffic flow, road net-
work structure, weather, etc. Thus, machine learning approaches, like GNN, can be
employed to predict or forecast traffic accidents over the city, which can help enable
urban intelligence.
The problem of traffic accident prediction is as follows:
Definition 27.3. Traffic accident prediction problem. Given the road network data
and the historical environmental features, the target is to predict the traffic accident
risk over the city in the future.
The environmental features include the traffic conditions, surrounding POIs, etc. In
recent publications (Zhou et al, 2020g,h; Yu et al, 2021b), GNN is employed to
solve this problem.
The graphs in solving traffic accident problem are usually constructed based on
dividing the urban area into grids, and each grid is considered as a node. If the traffic
conditions between two nodes have a strong correlation, there is an edge between
them. The context environmental features are the attributes with each grid. After the
graphs are constructed in different historical time slots, graph convolutional neural
networks (GCNs) are usually used to extract the hidden embedding in each time
slot. Then, methods dealing with time-series inputs can be employed to capture
the temporal dependencies, e.g., RNN-based neural networks. Finally, the spatial-
temporal information is used to predict traffic accident risk over the city. Overall,
590 Yanhua Li, Xun Zhou, and Menghai Pan
predict the passenger demand in the ride-hailing service. The overall framework
can be illustrated as in Fig.27.4. First, multiple graphs are constructed based on dif-
ferent aspects of relationships between each two grids, i.e., proximity, functional
similarity, and transportation connectivity. Then, a RNN is used to aggregate obser-
vations in different times considering the global contextual information. After that,
GCN is used to model the non-Euclidean correlations among regions. Finally, the
aggregated embeddings are used to predict the passenger demand over the city.
User behavior modeling. Modeling human behavior is important for many real-
world applications, e.g., demographic attribute prediction, content recommendation,
and target advertising, etc. Studying human behavior in the urban scenario can ben-
efit urban intelligence in many aspects, e.g., economy, transportation, etc. Here,
we introduce an example of modeling spatial-temporal user behavior with tripartite
graphs (Wang et al, 2020a).
Take the urban user online browsing behavior as an example, the spatial-temporal
user behavior can be defined on a set of users U, a set of sessions S, a set of items
V , and a set of locations L. Each user’s behavior log can be represented by a set of
session-location tuples, and each session contains multiple item-timestamp tuples.
Then a user’s spatial-temporal behavior can be captured via a tripartite graph as
illustrated in Fig.27.5. The nodes of this tripartite graph include user’s sessions S,
locations L, and items V . The edges include session-item edges and session-location
edges.
© The Author(s), under exclusive license to Springer Nature Singapore Pte Ltd. 2022 595
L. Wu et al. (eds.), Graph Neural Networks: Foundations, Frontiers, and Applications,
https://doi.org/10.1007/978-981-16-6054-2
596 References
Barabasi AL, Oltvai ZN (2004) Network biology: Understanding the cell’s func-
tional organization. Nature Reviews Genetics 5(2):101–113
Barber D (2004) Probabilistic modelling and reasoning: The junction tree algorithm.
Course Notes
Barceló P, Kostylev EV, Monet M, Pérez J, Reutter J, Silva JP (2019) The logical ex-
pressiveness of graph neural networks. In: International Conference on Learning
Representations
Bastian FB, Roux J, Niknejad A, Comte A, Fonseca Costa SS, De Farias TM,
Moretti S, Parmentier G, De Laval VR, Rosikiewicz M, et al (2021) The bgee
suite: integrated curated expression atlas and comparative transcriptomics in ani-
mals. Nucleic Acids Research 49(D1):D831–D847
Bastings J, Titov I, Aziz W, Marcheggiani D, Sima’an K (2017) Graph convo-
lutional encoders for syntax-aware neural machine translation. arXiv preprint
arXiv:170404675
Batagelj V, Zaversnik M (2003) An o(m) algorithm for cores decomposition of net-
works. arXiv preprint cs/0310049
Bateman A, Martin MJ, Orchard S, Magrane M, Agivetova R, Ahmad S, Alpi E,
Bowler-Barnett EH, Britto R, Bursteinas B, et al (2020) Uniprot: the universal
protein knowledgebase in 2021. Nucleic Acids Research
Battaglia P, Pascanu R, Lai M, Rezende DJ, kavukcuoglu K (2016) Interaction net-
works for learning about objects, relations and physics. In: Proceedings of the
30th International Conference on Neural Information Processing Systems, pp
4509–4517
Battaglia PW, Hamrick JB, Bapst V, Sanchez-Gonzalez A, Zambaldi V, Malinowski
M, Tacchetti A, Raposo D, Santoro A, Faulkner R, et al (2018) Relational induc-
tive biases, deep learning, and graph networks. arXiv preprint arXiv:180601261
Beaini D, Passaro S, Létourneau V, Hamilton WL, Corso G, Liò P (2020) Direc-
tional graph networks. CoRR abs/2010.02863
Beck D, Haffari G, Cohn T (2018) Graph-to-sequence learning using gated graph
neural networks. arXiv preprint arXiv:180609835
Belghazi MI, Baratin A, Rajeswar S, Ozair S, Bengio Y, Hjelm RD, Courville AC
(2018) Mutual information neural estimation. In: Dy JG, Krause A (eds) Pro-
ceedings of the 35th International Conference on Machine Learning, ICML 2018,
Stockholmsmässan, Stockholm, Sweden, July 10-15, 2018, PMLR, Proceedings
of Machine Learning Research, vol 80, pp 530–539
Belkin M, Niyogi P (2002) Laplacian eigenmaps and spectral techniques for em-
bedding and clustering. In: Advances in neural information processing systems,
pp 585–591
Bengio Y (2008) Neural net language models. Scholarpedia 3(1):3881
Bengio Y, Senécal JS (2008) Adaptive importance sampling to accelerate training
of a neural probabilistic language model. IEEE Transactions on Neural Networks
19(4):713–722
Bennett J, Lanning S, et al (2007) The netflix prize. In: Proceedings of KDD cup
and workshop, New York, vol 2007, p 35
600 References
van den Berg R, Kipf TN, Welling M (2018) Graph convolutional matrix comple-
tion. KDD18 Deep Learning Day
Berg Rvd, Kipf TN, Welling M (2017) Graph convolutional matrix completion.
arXiv preprint arXiv:170602263
Berger P, Hannak G, Matz G (2020) Efficient graph learning from noisy and incom-
plete data. IEEE Trans Signal Inf Process over Networks 6:105–119
Berggård T, Linse S, James P (2007) Methods for the detection and analysis of
protein–protein interactions. PROTEOMICS 7(16):2833–2842
Berline N, Getzler E, Vergne M (2003) Heat kernels and Dirac operators. Springer
Science & Business Media
Bian R, Koh YS, Dobbie G, Divoli A (2019) Network embedding and change mod-
eling in dynamic heterogeneous networks. In: Proceedings of the 42nd Interna-
tional ACM SIGIR Conference on Research and Development in Information
Retrieval, pp 861–864
Bianchi FM, Grattarola D, Alippi C (2020) Spectral clustering with graph neural
networks for graph pooling. In: International Conference on Machine Learning,
ACM, pp 2729–2738
Bielik P, Raychev V, Vechev M (2017) Learning a static analyzer from data. In:
International Conference on Computer Aided Verification, Springer, pp 233–253
Biggs N, Lloyd EK, Wilson RJ (1986) Graph Theory, 1736-1936. Oxford University
Press
Bingel J, Søgaard A (2017) Identifying beneficial task relations for multi-task learn-
ing in deep neural networks. In: Proceedings of the 15th Conference of the Euro-
pean Chapter of the Association for Computational Linguistics: Volume 2, Short
Papers, pp 164–169
Bishop CM (2006) Pattern recognition and machine learning. springer
Bizer C, Lehmann J, Kobilarov G, Auer S, Becker C, Cyganiak R, Hellmann S
(2009) Dbpedia-a crystallization point for the web of data. Journal of web se-
mantics 7(3):154–165
Blitzer J, McDonald R, Pereira F (2006) Domain adaptation with structural corre-
spondence learning. In: Proceedings of the 2006 conference on empirical methods
in natural language processing, pp 120–128
Bodenreider O (2004) The unified medical language system (umls): integrating
biomedical terminology. Nucleic acids research 32(suppl 1):D267–D270
Bojchevski A, Günnemann S (2019) Adversarial attacks on node embeddings via
graph poisoning. In: International Conference on Machine Learning, PMLR, pp
695–704
Bojchevski A, Günnemann S (2019) Certifiable robustness to graph perturbations.
In: Wallach H, Larochelle H, Beygelzimer A, d'Alché-Buc F, Fox E, Garnett R
(eds) Advances in Neural Information Processing Systems, Curran Associates,
Inc., vol 32
Bojchevski A, Matkovic Y, Günnemann S (2017) Robust spectral clustering for
noisy data: Modeling sparse corruptions improves latent embeddings. In: Pro-
ceedings of the 23rd ACM SIGKDD International Conference on Knowledge
Discovery and Data Mining, pp 737–746
References 601
Cai H, Gan C, Zhu L, Han S (2020b) Tinytl: Reduce memory, not parameters for
efficient on-device learning. Advances in Neural Information Processing Systems
33
Cai JY, Fürer M, Immerman N (1992) An optimal lower bound on the number of
variables for graph identification. Combinatorica 12(4):389–410
Cai L, Ji S (2020) A multi-scale approach for graph link prediction. In: Proceedings
of the AAAI Conference on Artificial Intelligence, vol 34, pp 3308–3315
Cai L, Yan B, Mai G, Janowicz K, Zhu R (2019) Transgcn: Coupling transformation
assumptions with graph convolutional networks for link prediction. In: Proceed-
ings of the 10th International Conference on Knowledge Capture, pp 131–138
Cai L, Li J, Wang J, Ji S (2020c) Line graph neural networks for link prediction.
arXiv preprint arXiv:201010046
Cai T, Luo S, Xu K, He D, Liu Ty, Wang L (2020d) Graphnorm: A princi-
pled approach to accelerating graph neural network training. arXiv preprint
arXiv:200903294
Cai X, Han J, Yang L (2018c) Generative adversarial network based heterogeneous
bibliographic network representation for personalized citation recommendation.
In: Proceedings of the AAAI Conference on Artificial Intelligence, vol 32
Cai Z, Wen L, Lei Z, Vasconcelos N, Li SZ (2014) Robust deformable and oc-
cluded object tracking with dynamic graph. IEEE Transactions on Image Pro-
cessing 23(12):5497–5509
Cairong Z, Xinran Z, Cheng Z, Li Z (2016) A novel dbn feature fusion model for
cross-corpus speech emotion recognition. Journal of Electrical and Computer En-
gineering 2016
Cangea C, Velickovic P, Jovanovic N, Kipf T, Liò P (2018) Towards sparse hierar-
chical graph classifiers. CoRR abs/1811.01287
Cao S, Lu W, Xu Q (2015) Grarep: Learning graph representations with global struc-
tural information. In: Proceedings of the 24th ACM international on conference
on information and knowledge management, pp 891–900
Cao Y, Peng H, Philip SY (2020) Multi-information source hin for medical con-
cept embedding. In: Pacific-Asia Conference on Knowledge Discovery and Data
Mining, Springer, pp 396–408
Cao Z, Simon T, Wei SE, Sheikh Y (2017) Realtime multi-person 2d pose estimation
using part affinity fields. In: Proceedings of the IEEE conference on computer
vision and pattern recognition, pp 7291–7299
Cao Z, Hidalgo G, Simon T, Wei SE, Sheikh Y (2019) Openpose: realtime multi-
person 2d pose estimation using part affinity fields. IEEE transactions on pattern
analysis and machine intelligence 43(1):172–186
Cappart Q, Chételat D, Khalil E, Lodi A, Morris C, Veličković P (2021)
Combinatorial optimization and reasoning with graph neural networks. CoRR
abs/2102.09544
Carlini N, Wagner D (2017) Towards Evaluating the Robustness of Neural Net-
works. IEEE Symposium on Security and Privacy pp 39–57, DOI 10.1109/SP.
2017.49
604 References
Erdős P, Rényi A (1960) On the evolution of random graphs. Publ Math Inst Hung
Acad Sci 5(1):17–60
Erkan G, Radev DR (2004) Lexrank: Graph-based lexical centrality as salience in
text summarization. Journal of artificial intelligence research 22:457–479
Ernst MD, Perkins JH, Guo PJ, McCamant S, Pacheco C, Tschantz MS, Xiao C
(2007) The Daikon system for dynamic detection of likely invariants. Science of
computer programming 69(1-3):35–45
Eykholt K, Evtimov I, Fernandes E, Li B, Rahmati A, Xiao C, Prakash A, Kohno T,
Song D (2018) Robust physical-world attacks on deep learning visual classifica-
tion. In: IEEE Conference on Computer Vision and Pattern Recognition, CVPR,
pp 1625–1634
Faghri F, Fleet DJ, Kiros JR, Fidler S (2017) Vse++: Improving visual-semantic
embeddings with hard negatives. arXiv preprint arXiv:170705612
Fan Y, Hou S, Zhang Y, Ye Y, Abdulhayoglu M (2018) Gotcha-sly malware! scor-
pion a metagraph2vec based malware detection system. In: Proceedings of the
24th ACM SIGKDD International Conference on Knowledge Discovery & Data
Mining, pp 253–262
Fang Y, Sun S, Gan Z, Pillai R, Wang S, Liu J (2020) Hierarchical graph network
for multi-hop question answering. In: Proceedings of the 2020 Conference on
Empirical Methods in Natural Language Processing (EMNLP), pp 8823–8838
Fatemi B, Asri LE, Kazemi SM (2021) Slaps: Self-supervision improves structure
learning for graph neural networks. arXiv preprint arXiv:210205034
Feng B, Wang Y, Wang Z, Ding Y (2021) Uncertainty-aware Attention Graph Neu-
ral Network for Defending Adversarial Attacks. In: AAAI Conference on Artifi-
cial Intelligence
Feng F, He X, Tang J, Chua T (2019a) Graph adversarial training: Dynamically
regularizing based on graph structure. TKDE pp 1–1
Feng J, Huang M, Wang M, Zhou M, Hao Y, Zhu X (2016) Knowledge graph
embedding by flexible translation. In: Proceedings of the Fifteenth International
Conference on Principles of Knowledge Representation and Reasoning, pp 557–
560
Feng W, Zhang J, Dong Y, Han Y, Luan H, Xu Q, Yang Q, Kharlamov E, Tang J
(2020) Graph random neural networks for semi-supervised learning on graphs. In:
Advances in Neural Information Processing Systems, vol 33, pp 22,092–22,103
Feng X, Zhang Y, Glass J (2014) Speech feature denoising and dereverberation via
deep autoencoders for noisy reverberant speech recognition. In: 2014 IEEE inter-
national conference on acoustics, speech and signal processing (ICASSP), IEEE,
pp 1759–1763
Feng Y, Lv F, Shen W, Wang M, Sun F, Zhu Y, Yang K (2019b) Deep session interest
network for click-through rate prediction. arXiv preprint arXiv:190506482
Feng Y, You H, Zhang Z, Ji R, Gao Y (2019c) Hypergraph neural networks. In:
Proceedings of the AAAI Conference on Artificial Intelligence, vol 33, pp 3558–
3565
Feurer M, Hutter F (2019) Hyperparameter optimization. In: Automated Machine
Learning, Springer, Cham, pp 3–33
References 615
Févotte C, Idier J (2011) Algorithms for nonnegative matrix factorization with the
β -divergence. Neural computation 23(9):2421–2456
Fey M, Lenssen JE (2019) Fast graph representation learning with PyTorch Geo-
metric. CoRR abs/1903.02428
Fey M, Lenssen JE, Weichert F, Müller H (2018) Splinecnn: Fast geometric deep
learning with continuous b-spline kernels. In: Proceedings of the IEEE Confer-
ence on Computer Vision and Pattern Recognition, pp 869–877
Fey M, Lenssen JE, Morris C, Masci J, Kriege NM (2020) Deep graph matching
consensus. In: International Conference on Learning Representations
Finn RD, Bateman A, Clements J, et al (2013) Pfam: the protein families database.
Nucleic Acids Research 42(D1):D222–D230
Foggia P, Percannella G, Vento M (2014) Graph matching and learning in pattern
recognition in the last 10 years. International Journal of Pattern Recognition and
Artificial Intelligence 28(01):1450,001
Foltman M, Sanchez-Diaz A (2016) Studying protein–protein interactions in bud-
ding yeast using co-immunoprecipitation. In: Yeast Cytokinesis, Springer, pp
239–256, DOI 10.1007/978-1-4939-3145-3 17
Fong RC, Vedaldi A (2017) Interpretable explanations of black boxes by meaningful
perturbation. In: Proceedings of the IEEE International Conference on Computer
Vision, pp 3429–3437
Fortin S (1996) The graph isomorphism problem
Fortunato S (2010) Community detection in graphs. Physics reports 486(3-5):75–
174
Fouss F, Pirotte A, Renders JM, Saerens M (2007) Random-walk computation of
similarities between nodes of a graph with application to collaborative recom-
mendation. IEEE Transactions on knowledge and data engineering 19(3):355–
369
Fowkes J, Chanthirasegaran P, Ranca R, Allamanis M, Lapata M, Sutton C (2017)
Autofolding for source code summarization. IEEE Transactions on Software En-
gineering 43(12):1095–1109
Franceschi L, Niepert M, Pontil M, He X (2019) Learning discrete structures for
graph neural networks. In: Proceedings of the 36th International Conference on
Machine Learning, vol 97, pp 1972–1982
Freeman LA (2003) A refresher in data flow diagramming: an effective aid for ana-
lysts. Commun ACM 46(9):147–151, DOI 10.1145/903893.903930
Freeman LC (2000) Visualizing social networks. Journal of social structure 1(1):4
Fröhlich H, Wegner JK, Sieker F, Zell A (2005) Optimal assignment kernels for
attributed molecular graphs. In: International Conference on Machine Learning,
pp 225–232
Fu R, Zhang Z, Li L (2016) Using lstm and gru neural network methods for traffic
flow prediction. In: 2016 31st Youth Academic Annual Conference of Chinese
Association of Automation (YAC), IEEE, pp 324–328
Fu Ty, Lee WC, Lei Z (2017) Hin2vec: Explore meta-paths in heterogeneous infor-
mation networks for representation learning. In: Proceedings of the 2017 ACM
on Conference on Information and Knowledge Management, pp 1797–1806
616 References
Grohe M, Otto M (2015) Pebble games and linear equations. The Journal of Sym-
bolic Logic pp 797–844
Grover A, Leskovec J (2016) node2vec: Scalable feature learning for networks. In:
Proceedings of the 22nd ACM SIGKDD international conference on Knowledge
discovery and data mining, pp 855–864
Grover A, Zweig A, Ermon S (2019) Graphite: Iterative generative modeling of
graphs. In: International Conference on Machine Learning, pp 2434–2444
Gu J, Cai J, Joty SR, Niu L, Wang G (2018) Look, imagine and match: Improving
textual-visual cross-modal retrieval with generative models. In: Proceedings of
the IEEE Conference on Computer Vision and Pattern Recognition, pp 7181–
7189
Gu S, Lillicrap T, Ghahramani Z, Turner RE, Levine S (2016) Q-prop: Sample-
efficient policy gradient with an off-policy critic. arXiv preprint arXiv:161102247
Guan Y, Myers CL, Hess DC, et al (2008) Predicting gene function in a hierarchical
context with an ensemble of classifiers. Genome Biology 9(Suppl 1):S3
Gui H, Liu J, Tao F, Jiang M, Norick B, Han J (2016) Large-scale embedding learn-
ing in heterogeneous event data. In: 2016 IEEE 16th International Conference on
Data Mining (ICDM), IEEE, pp 907–912
Gui T, Zou Y, Zhang Q, Peng M, Fu J, Wei Z, Huang XJ (2019) A lexicon-based
graph neural network for chinese ner. In: Proceedings of the 2019 Conference
on Empirical Methods in Natural Language Processing and the 9th International
Joint Conference on Natural Language Processing (EMNLP-IJCNLP), pp 1039–
1049
Guille A, Hacid H, Favre C, Zighed DA (2013) Information diffusion in online
social networks: A survey. ACM Sigmod Record 42(2):17–28
Gulrajani I, Ahmed F, Arjovsky M, Dumoulin V, Courville A (2017) Improved train-
ing of wasserstein gans. arXiv preprint arXiv:170400028
Guo G, Ouyang S, He X, Yuan F, Liu X (2019a) Dynamic item block and prediction
enhancing block for sequential recommendation. In: Proceedings of the Interna-
tional Joint Conference on Artificial Intelligence, pp 1373–1379
Guo H, Tang R, Ye Y, Li Z, He X (2017) Deepfm: a factorization-machine based
neural network for ctr prediction. In: Proceedings of the International Joint Con-
ference on Artificial Intelligence, pp 1725–1731
Guo M, Chou E, Huang DA, Song S, Yeung S, Fei-Fei L (2018a) Neural graph
matching networks for fewshot 3d action recognition. In: Proceedings of the Eu-
ropean Conference on Computer Vision (ECCV), pp 653–669
Guo S, Lin Y, Feng N, Song C, Wan H (2019b) Attention based spatial-temporal
graph convolutional networks for traffic flow forecasting. In: Proceedings of the
AAAI Conference on Artificial Intelligence, vol 33, pp 922–929
Guo X, Wu L, Zhao L (2018b) Deep graph translation. arXiv preprint
arXiv:180509980
Guo X, Zhao L, Nowzari C, Rafatirad S, Homayoun H, Dinakarrao SMP (2019c)
Deep multi-attributed graph translation with node-edge co-evolution. In: 2019
IEEE International Conference on Data Mining (ICDM), IEEE, pp 250–259
620 References
Hinton GE, Osindero S, Teh YW (2006) A fast learning algorithm for deep belief
nets. Neural computation 18(7):1527–1554
Hirsch CN, Hirsch CD, Brohammer AB, et al (2016) Draft assembly of elite inbred
line PH207 provides insights into genomic and transcriptome diversity in maize.
The Plant Cell 28(11):2700–2714
Hjelm RD, Fedorov A, Lavoie-Marchildon S, Grewal K, Bachman P, Trischler A,
Bengio Y (2018) Learning deep representations by mutual information estimation
and maximization. arXiv preprint arXiv:180806670
Ho Y, Gruhler A, Heilbut A, et al (2002) Systematic identification of pro-
tein complexes in saccharomyces cerevisiae by mass spectrometry. Nature
415(6868):180–183
Hochreiter S, Schmidhuber J (1997) Long short-term memory. Neural computation
9(8):1735–1780
Hoff PD, Raftery AE, Handcock MS (2002) Latent space approaches to social net-
work analysis. Journal of the american Statistical association 97(460):1090–1098
Hoffart J, Suchanek FM, Berberich K, Lewis-Kelham E, De Melo G, Weikum G
(2011) Yago2: exploring and querying world knowledge in time, space, context,
and many languages. In: Proceedings of the 20th international conference com-
panion on World wide web, pp 229–232
Hoffman MD, Blei DM, Wang C, Paisley J (2013) Stochastic variational inference.
The Journal of Machine Learning Research 14(1):1303–1347
Hogan A, Blomqvist E, Cochez M, d’Amato C, de Melo G, Gutierrez C, Gayo JEL,
Kirrane S, Neumaier S, Polleres A, et al (2020) Knowledge graphs. arXiv preprint
arXiv:200302320
Holland PW, Laskey KB, Leinhardt S (1983) Stochastic blockmodels: First steps.
Social networks 5(2):109–137
Holmes R, Murphy GC (2005) Using structural context to recommend source code
examples. In: Proceedings. 27th International Conference on Software Engineer-
ing, 2005. ICSE 2005., IEEE, pp 117–125
Hong D, Gao L, Yao J, Zhang B, Plaza A, Chanussot J (2020a) Graph convo-
lutional networks for hyperspectral image classification. IEEE Transactions on
Geoscience and Remote Sensing pp 1–13, DOI 10.1109/TGRS.2020.3015157
Hong H, Guo H, Lin Y, Yang X, Li Z, Ye J (2020b) An attention-based graph neu-
ral network for heterogeneous structural learning. In: Proceedings of the AAAI
Conference on Artificial Intelligence, vol 34, pp 4132–4139
Hornik K, Stinchcombe M, White H, et al (1989) Multilayer feedforward networks
are universal approximators. Neural Networks 2(5):359–366
Horton T (1992) Object-oriented analysis & design. Englewood Cliffs (New Jersey):
Prentice-Hall
Hosseini A, Chen T, Wu W, Sun Y, Sarrafzadeh M (2018) Heteromed: Heteroge-
neous information network for medical diagnosis. In: Proceedings of the 27th
ACM International Conference on Information and Knowledge Management, pp
763–772
Hou S, Ye Y, Song Y, Abdulhayoglu M (2017) Hindroid: An intelligent android
malware detection system based on structured heterogeneous information net-
624 References
Jin M, Chang H, Zhu W, Sojoudi S (2019b) Power up! robust graph convolutional
network against evasion attacks based on graph powering. CoRR abs/1905.10029,
1905.10029
Jin W, Barzilay R, Jaakkola T (2018a) Junction tree variational autoencoder for
molecular graph generation. In: Proceedings of the 35th International Conference
on Machine Learning, pp 2323–2332
Jin W, Barzilay R, Jaakkola TS (2018b) Junction tree variational autoencoder for
molecular graph generation. In: International Conference on Machine Learning,
pp 2328–2337
Jin W, Yang K, Barzilay R, Jaakkola T (2018c) Learning multimodal graph-to-graph
translation for molecular optimization. arXiv preprint arXiv:181201070
Jin W, Barzilay R, Jaakkola T (2020c) Composing molecules with multiple property
constraints. arXiv preprint arXiv:200203244
Jin W, Derr T, Liu H, Wang Y, Wang S, Liu Z, Tang J (2020d) Self-supervised learn-
ing on graphs: Deep insights and new direction. arXiv preprint arXiv:200610141
Jin W, Ma Y, Liu X, Tang X, Wang S, Tang J (2020e) Graph structure learning
for robust graph neural networks. In: The 26th ACM SIGKDD Conference on
Knowledge Discovery and Data Mining, pp 66–74
Jin W, Derr T, Wang Y, Ma Y, Liu Z, Tang J (2021) Node similarity preserving
graph convolutional networks. In: Proceedings of the 14th ACM International
Conference on Web Search and Data Mining, pp 148–156
Johansson FD, Dubhashi D (2015) Learning with similarity functions on graphs us-
ing matchings of geometric embeddings. In: ACM SIGKDD International Con-
ference on Knowledge Discovery and Data Mining, pp 467–476
Johnson D, Larochelle H, Tarlow D (2020) Learning graph structure with a finite-
state automaton layer. In: Larochelle H, Ranzato M, Hadsell R, Balcan MF, Lin
H (eds) Advances in Neural Information Processing Systems, Curran Associates,
Inc., vol 33, pp 3082–3093
Jonas E (2019) Deep imitation learning for molecular inverse problems. Advances
in Neural Information Processing Systems 32:4990–5000
Jurafsky D (2000) Speech & language processing. Pearson Education India
Kagdi H, Collard ML, Maletic JI (2007) A survey and taxonomy of approaches
for mining software repositories in the context of software evolution. Journal of
software maintenance and evolution: Research and practice 19(2):77–131
Kahneman D (2011) Thinking, fast and slow. Macmillan
Kalchbrenner N, Grefenstette E, Blunsom P (2014) A convolutional neural network
for modelling sentences. In: Proceedings of the 52nd Annual Meeting of the As-
sociation for Computational Linguistics, Association for Computational Linguis-
tics, pp 655–665, DOI 10.3115/v1/P14-1062
Kalliamvakou E, Gousios G, Blincoe K, Singer L, German DM, Damian D (2014)
The promises and perils of mining github. In: Proceedings of the 11th working
conference on mining software repositories, pp 92–101
Kalofolias V (2016) How to learn a graph from smooth signals. In: Artificial Intel-
ligence and Statistics, PMLR, pp 920–929
630 References
Kalofolias V, Perraudin N (2019) Large scale graph learning from smooth signals.
In: 7th International Conference on Learning Representations
Kaluza MCDP, Amizadeh S, Yu R (2018) A neural framework for learning dag to
dag translation. In: NeurIPS’2018 Workshop
Kampffmeyer M, Chen Y, Liang X, Wang H, Zhang Y, Xing EP (2019) Rethink-
ing knowledge graph propagation for zero-shot learning. In: Proceedings of the
IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp 11,487–
11,496
Kandasamy K, Neiswanger W, Schneider J, Poczos B, Xing E (2018) Neural archi-
tecture search with bayesian optimisation and optimal transport. In: Advances in
Neural Information Processing Systems
Kanehisa M, Goto S (2000) Kegg: kyoto encyclopedia of genes and genomes. Nu-
cleic acids research 28(1):27–30
Kanehisa M, Araki M, Goto S, Hattori M, Hirakawa M, Itoh M, Katayama T,
Kawashima S, Okuda S, Tokimatsu T, et al (2007) Kegg for linking genomes
to life and the environment. Nucleic acids research 36(suppl 1):D480–D484
Kang U, Tong H, Sun J (2012) Fast random walk graph kernel. In: SIAM Interna-
tional Conference on Data Mining, pp 828–838
Kang WC, McAuley J (2018) Self-attentive sequential recommendation. In: 2018
IEEE International Conference on Data Mining (ICDM), IEEE, pp 197–206
Kang Z, Pan H, Hoi SC, Xu Z (2019) Robust graph learning from noisy data. IEEE
transactions on cybernetics 50(5):1833–1843
Karampatsis RM, Sutton C (2020) How often do single-statement bugs occur? the
ManySStuBs4J dataset. In: Proceedings of the 17th International Conference on
Mining Software Repositories, pp 573–577
Karamshuk D, Noulas A, Scellato S, Nicosia V, Mascolo C (2013) Geo-spotting:
mining online location-based services for optimal retail store placement. In: Pro-
ceedings of the 19th ACM SIGKDD international conference on Knowledge dis-
covery and data mining, pp 793–801
Karita S, Watanabe S, Iwata T, Ogawa A, Delcroix M (2018) Semi-supervised end-
to-end speech recognition. In: Interspeech, pp 2–6
Karpathy A, Fei-Fei L (2015) Deep visual-semantic alignments for generating im-
age descriptions. In: Proceedings of the IEEE conference on computer vision and
pattern recognition, pp 3128–3137
Karypis G, Kumar V (1995) Multilevel graph partitioning schemes. In: ICPP (3), pp
113–122
Karypis G, Kumar V (1998) A fast and high quality multilevel scheme for partition-
ing irregular graphs. SIAM Journal on scientific Computing 20(1):359–392
Katharopoulos A, Vyas A, Pappas N, Fleuret F (2020) Transformers are rnns: Fast
autoregressive transformers with linear attention. In: International Conference on
Machine Learning, PMLR, pp 5156–5165
Katz L (1953) A new status index derived from sociometric analysis. Psychometrika
18(1):39–43
References 631
Kawahara J, Brown CJ, Miller SP, Booth BG, Chau V, Grunau RE, Zwicker JG,
Hamarneh G (2017) Brainnetcnn: Convolutional neural networks for brain net-
works; towards predicting neurodevelopment. NeuroImage 146:1038–1049
Kazemi E, Hassani SH, Grossglauser M (2015) Growing a graph matching from a
handful of seeds. Proc VLDB Endow 8(10):1010–1021
Kazemi SM, Poole D (2018) Simple embedding for link prediction in knowledge
graphs. In: Neural Information Processing Systems, p 4289–4300
Kazemi SM, Goel R, Eghbali S, Ramanan J, Sahota J, Thakur S, Wu S, Smyth
C, Poupart P, Brubaker M (2019) Time2vec: Learning a vector representation of
time. arXiv preprint arXiv:190705321
Kazemi SM, Goel R, Jain K, Kobyzev I, Sethi A, Forsyth P, Poupart P (2020) Rep-
resentation learning for dynamic graphs: A survey. Journal of Machine Learning
Research 21(70):1–73
Kazi A, Cosmo L, Navab N, Bronstein M (2020) Differentiable graph module (dgm)
graph convolutional networks. arXiv preprint arXiv:200204999
Kearnes S, McCloskey K, Berndl M, Pande V, Riley P (2016) Molecular graph
convolutions: moving beyond fingerprints. Journal of computer-aided molecular
design 30(8):595–608
Keriven N, Peyré G (2019) Universal invariant and equivariant graph neural net-
works. In: Advances in Neural Information Processing Systems, pp 7090–7099
Kersting K, Kriege NM, Morris C, Mutzel P, Neumann M (2016) Benchmark data
sets for graph kernels
Khezerlou AV, Zhou X, Li L, Shafiq Z, Liu AX, Zhang F (2017) A traffic flow
approach to early detection of gathering events: Comprehensive results. ACM
Transactions on Intelligent Systems and Technology (TIST) 8(6):1–24
Khezerlou AV, Zhou X, Tong L, Li Y, Luo J (2021) Forecasting gathering events
through trajectory destination prediction: A dynamic hybrid model. IEEE Trans-
actions on Knowledge and Data Engineering 33(3):991–1004, DOI 10.1109/
TKDE.2019.2937082
Khrulkov V, Novikov A, Oseledets I (2018) Expressive power of recurrent neural
networks. In: International Conference on Learning Representations
Kiefer S, Schweitzer P, Selman E (2015) Graphs identified by logics with counting.
In: International Symposium on Mathematical Foundations of Computer Science,
pp 319–330
Kilicoglu H, Shin D, Fiszman M, Rosemblat G, Rindflesch TC (2012) Semmeddb:
a pubmed-scale repository of biomedical semantic predications. Bioinformatics
28(23):3158–3160
Kim B, Koyejo O, Khanna R, et al (2016) Examples are not enough, learn to criti-
cize! criticism for interpretability. In: NIPS, pp 2280–2288
Kim D, Oh A (2021) How to find your friendly neighborhood: Graph attention de-
sign with self-supervision. In: International Conference on Learning Representa-
tions
Kim J, Kim T, Kim S, Yoo CD (2019) Edge-labeling graph neural network for few-
shot learning. In: Proceedings of the IEEE/CVF Conference on Computer Vision
and Pattern Recognition, pp 11–20
632 References
Li L, Tang S, Deng L, Zhang Y, Tian Q (2017d) Image caption with global-local at-
tention. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol 31
Li L, Gan Z, Cheng Y, Liu J (2019d) Relation-aware graph attention network for
visual question answering. In: Proceedings of the IEEE/CVF International Con-
ference on Computer Vision, pp 10,313–10,322
Li L, Wang P, Yan J, Wang Y, Li S, Jiang J, Sun Z, Tang B, Chang TH, Wang S,
et al (2020b) Real-world data medical knowledge graph: construction and appli-
cations. Artificial intelligence in medicine 103:101,817
Li L, Zhang Y, Chen L (2020c) Generate neural template explanations for recom-
mendation. In: Proceedings of the 29th ACM International Conference on Infor-
mation & Knowledge Management, pp 755–764
Li M, Chen S, Chen X, Zhang Y, Wang Y, Tian Q (2019e) Actional-structural graph
convolutional networks for skeleton-based action recognition. In: IEEE/CVF
Conference on Computer Vision and Pattern Recognition, pp 3595–3603
Li N, Yang Z, Luo L, Wang L, Zhang Y, Lin H, Wang J (2020d) Kghc: a knowl-
edge graph for hepatocellular carcinoma. BMC Medical Informatics and Decision
Making 20(3):1–11
Li P, Chien I, Milenkovic O (2019f) Optimizing generalized pagerank methods for
seed-expansion community detection. In: Advances in Neural Information Pro-
cessing Systems, pp 11,705–11,716
Li P, Wang Y, Wang H, Leskovec J (2020e) Distance encoding: Design provably
more powerful neural networks for graph representation learning. Advances in
Neural Information Processing Systems 33
Li Q, Zheng Y, Xie X, Chen Y, Liu W, Ma WY (2008) Mining user similarity based
on location history. In: Proceedings of the 16th ACM SIGSPATIAL international
conference on Advances in geographic information systems, pp 1–10
Li Q, Han Z, Wu XM (2018b) Deeper insights into graph convolutional networks for
semi-supervised learning. In: Proceedings of the AAAI Conference on Artificial
Intelligence, vol 32
Li R, Tapaswi M, Liao R, Jia J, Urtasun R, Fidler S (2017e) Situation recognition
with graph neural networks. In: Proceedings of the IEEE International Confer-
ence on Computer Vision, pp 4173–4182
Li R, Wang S, Zhu F, Huang J (2018c) Adaptive graph convolutional neural net-
works. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol 32
Li S, Wu L, Feng S, Xu F, Xu F, Zhong S (2020f) Graph-to-tree neural networks
for learning structured input-output translation with applications to semantic
parsing and math word problem. In: Findings of the Association for Computa-
tional Linguistics: EMNLP 2020, Association for Computational Linguistics, On-
line, pp 2841–2852, DOI 10.18653/v1/2020.findings-emnlp.255, URL https:
//www.aclweb.org/anthology/2020.findings-emnlp.255
Li X, Cheng Y, Cong G, Chen L (2017f) Discovering pollution sources and propa-
gation patterns in urban area. In: Proceedings of the 23rd ACM SIGKDD Inter-
national Conference on Knowledge Discovery and Data Mining, pp 1863–1872
638 References
McBurney PW, Liu C, McMillan C (2016) Automated feature discovery via sen-
tence selection and source code summarization. Journal of Software: Evolution
and Process 28(2):120–145
McMillan C, Grechanik M, Poshyvanyk D, Xie Q, Fu C (2011) Portfolio: finding
relevant functions and their usage. In: Proceedings of the 33rd International Con-
ference on Software Engineering, pp 111–120
Mcmillan C, Poshyvanyk D, Grechanik M, Xie Q, Fu C (2013) Portfolio: Searching
for relevant functions and their usages in millions of lines of code. ACM Trans-
actions on Software Engineering and Methodology (TOSEM) 22(4):1–30
McNee SM, Riedl J, Konstan JA (2006) Being accurate is not enough: how accu-
racy metrics have hurt recommender systems. In: CHI’06 extended abstracts on
Human factors in computing systems, pp 1097–1101
Mendez D, Gaulton A, Bento AP, Chambers J, De Veij M, Félix E, Magariños MP,
Mosquera JF, Mutowo P, Nowotka M, et al (2019) Chembl: towards direct depo-
sition of bioassay data. Nucleic acids research 47(D1):D930–D940
Merkwirth C, Lengauer T (2005) Automatic generation of complementary descrip-
tors with molecular graph networks. Journal of Chemical Information and Mod-
eling 45(5):1159–1168
Mesquita DPP, Jr AHS, Kaski S (2020) Rethinking pooling in graph neural net-
works. In: Advances in Neural Information Processing Systems
Mihalcea R, Tarau P (2004) Textrank: Bringing order into text. In: Proceedings of
the 2004 conference on empirical methods in natural language processing, pp
404–411
Miikkulainen R, Liang J, Meyerson E, Rawal A, Fink D, Francon O, Raju B,
Shahrzad H, Navruzyan A, Duffy N, et al (2019) Evolving deep neural networks.
In: Artificial Intelligence in the Age of Neural Networks and Brain Computing,
Elsevier, pp 293–312
Mikolov T, Karafiát M, Burget L, Cernocký J, Khudanpur S (2010) Recurrent neu-
ral network based language model. In: Kobayashi T, Hirose K, Nakamura S (eds)
INTERSPEECH 2010, 11th Annual Conference of the International Speech Com-
munication Association, Makuhari, Chiba, Japan, September 26-30, 2010, ISCA,
pp 1045–1048
Mikolov T, Deoras A, Kombrink S, Burget L, Cernocký J (2011a) Empirical eval-
uation and combination of advanced language modeling techniques. In: INTER-
SPEECH 2011, 12th Annual Conference of the International Speech Communi-
cation Association, Florence, Italy, August 27-31, 2011, ISCA, pp 605–608
Mikolov T, Kombrink S, Burget L, Černockỳ J, Khudanpur S (2011b) Extensions of
recurrent neural network language model. In: 2011 IEEE international conference
on acoustics, speech and signal processing (ICASSP), IEEE, pp 5528–5531
Mikolov T, Chen K, Corrado G, Dean J (2013a) Efficient estimation of word repre-
sentations in vector space. arXiv preprint arXiv:13013781
Mikolov T, Sutskever I, Chen K, Corrado GS, Dean J (2013b) Distributed represen-
tations of words and phrases and their compositionality. In: Advances in neural
information processing systems, pp 3111–3119
646 References
Pandi IV, Barr ET, Gordon AD, Sutton C (2020) OptTyper: Probabilistic
type inference by optimising logical and natural constraints. arXiv preprint
arXiv:200400348
Pang L, Lan Y, Guo J, Xu J, Wan S, Cheng X (2016) Text matching as image recog-
nition. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol 30
Pang LX, Chawla S, Liu W, Zheng Y (2011) On mining anomalous patterns in
road traffic streams. In: International conference on advanced data mining and
applications, Springer, pp 237–251
Panichella S, Aponte J, Di Penta M, Marcus A, Canfora G (2012) Mining source
code descriptions from developer communications. In: 2012 20th IEEE Interna-
tional Conference on Program Comprehension (ICPC), IEEE, pp 63–72
Paninski L (2003) Estimation of entropy and mutual information. Neural computa-
tion 15(6):1191–1253
Pantziarka P, Meheus L (2018) Omics-driven drug repurposing as a source of inno-
vative therapies in rare cancers. Expert Opinion on Orphan Drugs 6(9):513–517
Park C, Kim D, Zhu Q, Han J, Yu H (2019) Task-guided pair embedding in hetero-
geneous network. In: Proceedings of the 28th ACM International Conference on
Information and Knowledge Management, pp 489–498
Parthasarathy S, Busso C (2017) Jointly predicting arousal, valence and dominance
with multi-task learning. In: Interspeech, vol 2017, pp 1103–1107
Pascanu R, Mikolov T, Bengio Y (2013) On the difficulty of training recurrent neural
networks. In: International conference on machine learning, PMLR, pp 1310–
1318
Paszke A, Gross S, Massa F, Lerer A, Bradbury J, Chanan G, Killeen T, Lin Z,
Gimelshein N, Antiga L, Desmaison A, Kopf A, Yang E, DeVito Z, Raison M, Te-
jani A, Chilamkurthy S, Steiner B, Fang L, Bai J, Chintala S (2019) Pytorch: An
imperative style, high-performance deep learning library. In: Advances in Neural
Information Processing Systems, vol 32
Pathak D, Krahenbuhl P, Donahue J, Darrell T, Efros AA (2016) Context encoders:
Feature learning by inpainting. In: Proceedings of the IEEE conference on com-
puter vision and pattern recognition, pp 2536–2544
Peña-Castillo L, Tasan M, Myers CL, et al (2008) A critical assessment of mus
musculus gene function prediction using integrated genomic evidence. Genome
Biology 9(Suppl 1):S2, DOI 10.1186/gb-2008-9-s1-s2
Peng H, Li J, He Y, Liu Y, Bao M, Wang L, Song Y, Yang Q (2018) Large-scale
hierarchical text classification with recursively regularized deep graph-cnn. In:
Proceedings of the 2018 world wide web conference, pp 1063–1072
Peng H, Pappas N, Yogatama D, Schwartz R, Smith N, Kong L (2021) Random
feature attention. In: International Conference on Learning Representations
Peng Z, Dong Y, Luo M, Wu XM, Zheng Q (2020) Self-supervised graph represen-
tation learning via global context prediction. arXiv preprint arXiv:200301604
Pennington J, Socher R, Manning CD (2014) Glove: Global vectors for word repre-
sentation. In: Proceedings of the 2014 conference on empirical methods in natural
language processing (EMNLP), pp 1532–1543
652 References
Rink B, Bejan CA, Harabagiu SM (2010) Learning textual graph patterns to detect
causal event relations. In: FLAIRS Conference
Rizvi RF, Vasilakes JA, Adam TJ, Melton GB, Bishop JR, Bian J, Tao C, Zhang R
(2019) Integrated dietary supplement knowledge base (idisk)
Robinson PN, Köhler S, Bauer S, et al (2008) The human phenotype ontology:
A tool for annotating and analyzing human hereditary disease. The American
Journal of Human Genetics 83(5):610–615
Rocco I, Cimpoi M, Arandjelović R, Torii A, Pajdla T, Sivic J (2018) Neighbour-
hood consensus networks. In: Advances in Neural Information Processing Sys-
tems, vol 31
Rodeghero P, McMillan C, McBurney PW, Bosch N, D’Mello S (2014) Improving
automated source code summarization via an eye-tracking study of programmers.
In: Proceedings of the 36th international conference on Software engineering,
ACM, pp 390–401
Rodeghero P, Jiang S, Armaly A, McMillan C (2017) Detecting user story infor-
mation in developer-client conversations to generate extractive summaries. In:
2017 IEEE/ACM 39th International Conference on Software Engineering (ICSE),
IEEE, pp 49–59
Roehm T, Tiarks R, Koschke R, Maalej W (2012) How do professional develop-
ers comprehend software? In: 2012 34th International Conference on Software
Engineering (ICSE), IEEE, pp 255–265
Rogers D, Hahn M (2010) Extended-connectivity fingerprints. Journal of Chemical
Information and Modeling 50(5):742–754
Rolı́nek M, Swoboda P, Zietlow D, Paulus A, Musil V, Martius G (2020) Deep
graph matching via blackbox differentiation of combinatorial solvers. In: Euro-
pean Conference on Computer Vision, Springer, pp 407–424
Rong Y, Bian Y, Xu T, Xie W, Wei Y, Huang W, Huang J (2020a) Self-supervised
graph transformer on large-scale molecular data. Advances in Neural Information
Processing Systems 33
Rong Y, Huang W, Xu T, Huang J (2020b) Dropedge: Towards deep graph convolu-
tional networks on node classification. In: International Conference on Learning
Representations
Rong Y, Xu T, Huang J, Huang W, Cheng H, Ma Y, Wang Y, Derr T, Wu L, Ma T
(2020c) Deep graph learning: Foundations, advances and applications. In: Pro-
ceedings of the 26th ACM SIGKDD International Conference on Knowledge
Discovery & Data Mining, ACM, Virtual Event, pp 3555–3556
Rossi A, Barbosa D, Firmani D, Matinata A, Merialdo P (2021) Knowledge graph
embedding for link prediction: A comparative analysis. ACM Transactions on
Knowledge Discovery from Data (TKDD) 15(2):1–49
Rossi E, Chamberlain B, Frasca F, Eynard D, Monti F, Bronstein M (2020) Tem-
poral graph networks for deep learning on dynamic graphs. arXiv preprint
arXiv:200610637
Rotmensch M, Halpern Y, Tlimat A, Horng S, Sontag D (2017) Learning a health
knowledge graph from electronic medical records. Scientific reports 7(1):5994
656 References
Sutton RS, McAllester DA, Singh SP, Mansour Y (2000) Policy gradient methods
for reinforcement learning with function approximation. In: Advances in Neural
Information Processing Systems, pp 1057–1063
Swietojanski P, Li J, Renals S (2016) Learning hidden unit contributions for unsu-
pervised acoustic model adaptation. IEEE/ACM Transactions on Audio, Speech,
and Language Processing 24(8):1450–1463
Szegedy C, Liu W, Jia Y, Sermanet P, Reed S, Anguelov D, Erhan D, Vanhoucke
V, Rabinovich A (2015) Going deeper with convolutions. In: Proceedings of the
IEEE conference on computer vision and pattern recognition, pp 1–9
Szklarczyk D, Gable AL, Lyon D, Junge A, Wyder S, Huerta-Cepas J, Simonovic
M, Doncheva NT, Morris JH, Bork P, et al (2019) String v11: protein–protein
association networks with increased coverage, supporting functional discovery in
genome-wide experimental datasets. Nucleic acids research 47(D1):D607–D613
Takahashi T (2019) Indirect adversarial attacks via poisoning neighbors for graph
convolutional networks. In: 2019 IEEE International Conference on Big Data
(Big Data), IEEE, pp 1395–1400
Tang J, Wang K (2018) Personalized top-n sequential recommendation via convolu-
tional sequence embedding. In: Proceedings of the Eleventh ACM International
Conference on Web Search and Data Mining, pp 565–573
Tang J, Qu M, Mei Q (2015a) Pte: Predictive text embedding through large-scale
heterogeneous text networks. In: Proceedings of the 21th ACM SIGKDD inter-
national conference on knowledge discovery and data mining, pp 1165–1174
Tang J, Qu M, Wang M, Zhang M, Yan J, Mei Q (2015b) Line: Large-scale infor-
mation network embedding. In: Proceedings of the 24th international conference
on world wide web, pp 1067–1077
Tang R, Du M, Liu N, Yang F, Hu X (2020a) An embarrassingly simple approach for
trojan attack in deep neural networks. In: Proceedings of the 26th ACM SIGKDD
International Conference on Knowledge Discovery & Data Mining, pp 218–228
Tang X, Li Y, Sun Y, Yao H, Mitra P, Wang S (2020b) Transferring robustness
for graph neural network against poisoning attacks. In: Proceedings of the 13th
International Conference on Web Search and Data Mining, pp 600–608
Tao J, Lin J, Zhang S, Zhao S, Wu R, Fan C, Cui P (2019) Mvan: Multi-view atten-
tion networks for real money trading detection in online games. In: Proceedings
of the 25th ACM SIGKDD International Conference on Knowledge Discovery &
Data Mining, pp 2536–2546
Tarlow D, Moitra S, Rice A, Chen Z, Manzagol PA, Sutton C, Aftandilian E (2020)
Learning to fix build errors with Graph2Diff neural networks. In: Proceedings of
the IEEE/ACM 42nd International Conference on Software Engineering Work-
shops, pp 19–20
Tate JG, Bamford S, Jubb HC, Sondka Z, Beare DM, Bindal N, Boutselakis H,
Cole CG, Creatore C, Dawson E, et al (2019) Cosmic: the catalogue of somatic
mutations in cancer. Nucleic acids research 47(D1):D941–D947
Te G, Hu W, Zheng A, Guo Z (2018) Rgcnn: Regularized graph cnn for point cloud
segmentation. In: Proceedings of the 26th ACM international conference on Mul-
timedia, pp 746–754
References 665
Wang J, Zheng VW, Liu Z, Chang KCC (2017b) Topological recurrent neural net-
work for diffusion prediction. In: 2017 IEEE International Conference on Data
Mining (ICDM), IEEE, pp 475–484
Wang J, Huang P, Zhao H, Zhang Z, Zhao B, Lee DL (2018b) Billion-scale com-
modity embedding for e-commerce recommendation in alibaba. In: Proceedings
of the 24th ACM SIGKDD International Conference on Knowledge Discovery &
Data Mining, pp 839–848
Wang J, Oh J, Wang H, Wiens J (2018c) Learning credible models. In: Proceedings
of the 24th ACM SIGKDD International Conference on Knowledge Discovery &
Data Mining, pp 2417–2426
Wang J, Luo M, Suya F, Li J, Yang Z, Zheng Q (2020c) Scalable attack on graph data
by injecting vicious nodes. Data Mining and Knowledge Discovery 34(5):1363–
1389
Wang K, Singh R, Su Z (2018d) Dynamic neural program embeddings for program
repair. In: International Conference on Learning Representations
Wang M, Liu M, Liu J, Wang S, Long G, Qian B (2017c) Safe medicine recommen-
dation via medical knowledge graph embedding. arXiv preprint arXiv:171005980
Wang M, Yu L, Zheng D, Gan Q, Gai Y, Ye Z, Li M, Zhou J, Huang Q, Ma C,
Huang Z, Guo Q, Zhang H, Lin H, Zhao J, Li J, Smola AJ, Zhang Z (2019f)
Deep graph library: Towards efficient and scalable deep learning on graphs. In-
ternational Conference on Learning Representations Workshop on Representa-
tion Learning on Graphs and Manifolds
Wang M, Lin Y, Lin G, Yang K, Wu Xm (2020d) M2grl: A multi-task multi-view
graph representation learning framework for web-scale recommender systems. In:
Proceedings of the 26th ACM SIGKDD International Conference on Knowledge
Discovery & Data Mining, pp 2349–2358
Wang Q, Mao Z, Wang B, Guo L (2017d) Knowledge graph embedding: A sur-
vey of approaches and applications. IEEE Transactions on Knowledge and Data
Engineering 29(12):2724–2743
Wang Q, Li M, Wang X, Parulian N, Han G, Ma J, Tu J, Lin Y, Zhang H, Liu W, et al
(2020e) Covid-19 literature knowledge graph construction and drug repurposing
report generation. arXiv preprint arXiv:200700576
Wang R, Yan J, Yang X (2019g) Learning combinatorial embedding networks for
deep graph matching. In: Proceedings of the IEEE/CVF International Conference
on Computer Vision, pp 3056–3065
Wang R, Zhang T, Yu T, Yan J, Yang X (2020f) Combinatorial learning of graph
edit distance via dynamic embedding. arXiv preprint arXiv:201115039
Wang S, He L, Cao B, Lu CT, Yu PS, Ragin AB (2017e) Structural deep brain
network mining. In: Proceedings of the 23rd ACM SIGKDD International Con-
ference on Knowledge Discovery and Data Mining, pp 475–484
Wang S, Tang J, Aggarwal C, Chang Y, Liu H (2017f) Signed network embedding
in social media. In: Proceedings of the 2017 SIAM international conference on
data mining, SIAM, pp 327–335
Wang S, Chen Z, Li D, Li Z, Tang LA, Ni J, Rhee J, Chen H, Yu PS (2019h) At-
tentional heterogeneous graph neural network: Application to program reiden-
670 References
Xie L, Yuille A (2017) Genetic cnn. In: Proceedings of the IEEE International Con-
ference on Computer Vision, pp 1379–1388
Xie M, Yin H, Wang H, Xu F, Chen W, Wang S (2016) Learning graph-based poi
embedding for location-based recommendation. In: Proceedings of the 25th ACM
International on Conference on Information and Knowledge Management, Asso-
ciation for Computing Machinery, CIKM ’16, p 15–24, DOI 10.1145/2983323.
2983711
Xie S, Kirillov A, Girshick R, He K (2019a) Exploring randomly wired neural
networks for image recognition. In: Proceedings of the IEEE/CVF International
Conference on Computer Vision, pp 1284–1293
Xie T, Grossman JC (201f8) Crystal graph convolutional neural networks for an ac-
curate and interpretable prediction of material properties. Physical Review Letters
120:145,301
Xie Y, Xu Z, Wang Z, Ji S (2021) Self-supervised learning of graph neural networks:
A unified review. arXiv preprint arXiv:210210757
Xie Z, Lv W, Huang S, Lu Z, Du B, Huang R (2019b) Sequential graph neural
network for urban road traffic speed prediction. IEEE Access 8:63,349–63,358
Xiu H, Yan X, Wang X, Cheng J, Cao L (2020) Hierarchical graph matching net-
work for graph similarity computation. arXiv preprint arXiv:200616551
Xu D, Zhu Y, Choy CB, Fei-Fei L (2017a) Scene graph generation by iterative
message passing. In: Proceedings of the IEEE conference on computer vision
and pattern recognition, pp 5410–5419
Xu D, Cheng W, Luo D, Liu X, Zhang X (2019a) Spatio-temporal attentive rnn for
node classification in temporal attributed graphs. In: International Joint Confer-
ence on Artificial Intelligence, pp 3947–3953
Xu D, Ruan C, Korpeoglu E, Kumar S, Achan K (2020a) Inductive representation
learning on temporal graphs. In: International Conference on Learning Represen-
tations
Xu H, Jiang C, Liang X, Li Z (2019b) Spatial-aware graph relation network for
large-scale object detection. In: Proceedings of the IEEE/CVF Conference on
Computer Vision and Pattern Recognition, pp 9298–9307
Xu J, Gan Z, Cheng Y, Liu J (2020b) Discourse-aware neural extractive text sum-
marization. In: Proceedings of the 58th Annual Meeting of the Association for
Computational Linguistics, pp 5021–5031
Xu K, Ba J, Kiros R, Cho K, Courville A, Salakhudinov R, Zemel R, Bengio Y
(2015) Show, attend and tell: Neural image caption generation with visual atten-
tion. In: International conference on machine learning, PMLR, pp 2048–2057
Xu K, Li C, Tian Y, Sonobe T, Kawarabayashi K, Jegelka S (2018a) Representation
learning on graphs with jumping knowledge networks. In: International Confer-
ence on Machine Learning, pp 5453–5462
Xu K, Wu L, Wang Z, Feng Y, Sheinin V (2018b) Sql-to-text generation with graph-
to-sequence model. arXiv preprint arXiv:180905255
Xu K, Wu L, Wang Z, Feng Y, Witbrock M, Sheinin V (2018c) Graph2seq:
Graph to sequence learning with attention-based neural networks. arXiv preprint
arXiv:180400823
References 675
You J, Ying R, Leskovec J (2019) Position-aware graph neural networks. In: Inter-
national Conference on Machine Learning, PMLR, pp 7134–7143
You J, Ying Z, Leskovec J (2020a) Design space for graph neural networks. Ad-
vances in Neural Information Processing Systems 33
You J, Gomes-Selman J, Ying R, Leskovec J (2021) Identity-aware graph neural
networks. CoRR abs/2101.10320, 2101.10320
You Y, Chen T, Sui Y, Chen T, Wang Z, Shen Y (2020b) Graph contrastive learning
with augmentations. In: Larochelle H, Ranzato M, Hadsell R, Balcan MF, Lin
H (eds) Advances in Neural Information Processing Systems, Curran Associates,
Inc., vol 33, pp 5812–5823
You Y, Chen T, Wang Z, Shen Y (2020c) When does self-supervision help graph
convolutional networks? In: International Conference on Machine Learning,
PMLR, pp 10,871–10,880
You ZH, Chan KCC, Hu P (2015a) Predicting protein-protein interactions from
primary protein sequences using a novel multi-scale local feature representation
scheme and the random forest. PLOS ONE 10:1–19
You ZH, Li J, Gao X, et al (2015b) Detecting protein-protein interactions with a
novel matrix-based protein sequence representation and support vector machines.
BioMed Research International 2015:1–9
Yu B, Yin H, Zhu Z (2018a) Spatio-temporal graph convolutional networks: a deep
learning framework for traffic forecasting. In: Proceedings of the 27th Interna-
tional Joint Conference on Artificial Intelligence, pp 3634–3640
Yu D, Fu J, Mei T, Rui Y (2017a) Multi-level attention networks for visual ques-
tion answering. In: Proceedings of the IEEE Conference on Computer Vision and
Pattern Recognition, pp 4709–4717
Yu D, Zhang R, Jiang Z, Wu Y, Yang Y (2021a) Graph-revised convolutional net-
work. In: Hutter F, Kersting K, Lijffijt J, Valera I (eds) Machine Learning and
Knowledge Discovery in Databases, Springer International Publishing, Cham, pp
378–393
Yu H, Wu Z, Wang S, Wang Y, Ma X (2017b) Spatiotemporal recurrent con-
volutional networks for traffic prediction in transportation networks. Sensors
17(7):1501
Yu J, Lu Y, Qin Z, Zhang W, Liu Y, Tan J, Guo L (2018b) Modeling text with
graph convolutional network for cross-modal information retrieval. In: Pacific
Rim Conference on Multimedia, Springer, pp 223–234
Yu L, Du B, Hu X, Sun L, Han L, Lv W (2021b) Deep spatio-temporal graph con-
volutional network for traffic accident prediction. Neurocomputing 423:135–147
Yu T, Wang R, Yan J, Li B (2020) Learning deep graph matching with channel-
independent embedding and hungarian attention. In: International conference on
learning representations
Yu Y, Chen J, Gao T, Yu M (2019a) Dag-gnn: Dag structure learning with graph
neural networks. In: International Conference on Machine Learning, pp 7154–
7163
Yu Y, Wang Y, Xia Z, Zhang X, Jin K, Yang J, Ren L, Zhou Z, Yu D, Qing T, et al
(2019b) Premedkb: an integrated precision medicine knowledgebase for inter-
680 References
preting relationships between diseases, genes, variants and drugs. Nucleic acids
research 47(D1):D1090–D1101
Yuan F, He X, Karatzoglou A, Zhang L (2020a) Parameter-efficient transfer from
sequential behaviors for user modeling and recommendation. In: Proceedings of
the 43rd International ACM SIGIR Conference on Research and Development in
Information Retrieval, pp 1469–1478
Yuan H, Tang J, Hu X, Ji S (2020b) Xgnn: Towards model-level explanations of
graph neural networks. In: Proceedings of the 26th ACM SIGKDD International
Conference on Knowledge Discovery & Data Mining, pp 430–438
Yuan J, Zheng Y, Zhang C, Xie W, Xie X, Sun G, Huang Y (2010) T-drive: driving
directions based on taxi trajectories. In: Proceedings of the 18th SIGSPATIAL
International conference on advances in geographic information systems, pp 99–
108
Yuan J, Zheng Y, Xie X (2012) Discovering regions of different functions in a city
using human mobility and pois. In: Proceedings of the 18th ACM SIGKDD in-
ternational conference on Knowledge discovery and data mining, pp 186–194
Yuan Y, Liang X, Wang X, Yeung DY, Gupta A (2017) Temporal dynamic graph
lstm for action-driven video object detection. In: Proceedings of the IEEE inter-
national conference on computer vision, pp 1801–1810
Yuan Z, Zhou X, Yang T (2018) Hetero-convlstm: A deep learning approach to
traffic accident prediction on heterogeneous spatio-temporal data. In: Proceedings
of the 24th ACM SIGKDD International Conference on Knowledge Discovery &
Data Mining, pp 984–992
Yue-Hei Ng J, Hausknecht M, Vijayanarasimhan S, Vinyals O, Monga R, Toderici
G (2015) Beyond short snippets: Deep networks for video classification. In: Pro-
ceedings of the IEEE conference on computer vision and pattern recognition, pp
4694–4702
Yun S, Jeong M, Kim R, Kang J, Kim HJ (2019) Graph transformer networks. Ad-
vances in Neural Information Processing Systems 32:11,983–11,993
Zaheer M, Kottur S, Ravanbakhsh S, Poczos B, Salakhutdinov RR, Smola AJ (2017)
Deep sets. In: Advances in Neural Information Processing Systems, pp 3391–
3401
Zanfir A, Sminchisescu C (2018) Deep learning of graph matching. In: Proceedings
of the IEEE conference on computer vision and pattern recognition, pp 2684–
2693
Zelnik-Manor L, Perona P (2004) Self-tuning spectral clustering. Advances in neu-
ral information processing systems 17:1601–1608
Zeng H, Zhou H, Srivastava A, Kannan R, Prasanna V (2020a) Graphsaint: Graph
sampling based inductive learning method. In: International Conference on
Learning Representations
Zeng R, Huang W, Tan M, Rong Y, Zhao P, Huang J, Gan C (2019) Graph
convolutional networks for temporal action localization. In: Proceedings of the
IEEE/CVF International Conference on Computer Vision, pp 7094–7103
References 681
Zhang J, Wang X, Zhang H, Sun H, Wang K, Liu X (2019f) A novel neural source
code representation based on abstract syntax tree. In: 2019 IEEE/ACM 41st In-
ternational Conference on Software Engineering (ICSE), IEEE, pp 783–794
Zhang J, Zhang H, Xia C, Sun L (2020a) Graph-bert: Only attention is needed for
learning graph representations. arXiv preprint arXiv:200105140
Zhang L, Lu H (2020) A Feature-Importance-Aware and Robust Aggregator for
GCN. In: ACM International Conference on Information & Knowledge Manage-
ment, DOI 10.1145/3340531.3411983
Zhang M, Chen Y (2018a) Link prediction based on graph neural networks. In:
Advances in Neural Information Processing Systems, pp 5165–5175
Zhang M, Chen Y (2018b) Link prediction based on graph neural networks. In: Pro-
ceedings of the 32nd International Conference on Neural Information Processing
Systems, pp 5171–5181
Zhang M, Chen Y (2019) Inductive matrix completion based on graph neural net-
works. In: International Conference on Learning Representations
Zhang M, Chen Y (2020) Inductive matrix completion based on graph neural net-
works. In: International Conference on Learning Representations
Zhang M, Schmitt-Ulms G, Sato C, Xi Z, Zhang Y, Zhou Y, St George-Hyslop P,
Rogaeva E (2016c) Drug repositioning for alzheimer’s disease based on system-
atic ‘omics’ data mining. PloS one 11(12):e0168,812
Zhang M, Cui Z, Neumann M, Chen Y (2018f) An end-to-end deep learning archi-
tecture for graph classification. In: Association for the Advancement of Artificial
Intelligence
Zhang M, Cui Z, Neumann M, Chen Y (2018g) An end-to-end deep learning ar-
chitecture for graph classification. In: the AAAI Conference on Artificial Intelli-
gence, pp 4438–4445
Zhang M, Hu L, Shi C, Wang X (2020b) Adversarial label-flipping attack and de-
fense for graph neural networks. In: 2020 IEEE International Conference on Data
Mining (ICDM), IEEE, pp 791–800
Zhang M, Li P, Xia Y, Wang K, Jin L (2020c) Revisiting graph neural networks for
link prediction. arXiv preprint arXiv:201016103
Zhang N, Deng S, Li J, Chen X, Zhang W, Chen H (2020d) Summarizing chinese
medical answer with graph convolution networks and question-focused dual at-
tention. In: Proceedings of the 2020 Conference on Empirical Methods in Natural
Language Processing: Findings, pp 15–24
Zhang Q, Chang J, Meng G, Xiang S, Pan C (2020e) Spatio-temporal graph struc-
ture learning for traffic forecasting. In: Proceedings of the AAAI Conference on
Artificial Intelligence, vol 34, pp 1177–1185
Zhang R, Isola P, Efros AA (2016d) Colorful image colorization. In: European con-
ference on computer vision, Springer, pp 649–666
Zhang S, Hu Z, Subramonian A, Sun Y (2020f) Motif-driven contrastive learning of
graph representations. arXiv preprint arXiv:201212533
Zhang W, Tang S, Cao Y, Pu S, Wu F, Zhuang Y (2019g) Frame augmented al-
ternating attention network for video question answering. IEEE Transactions on
Multimedia 22(4):1032–1041
References 683
Zhang Y, Guo Z, Teng Z, Lu W, Cohen SB, Liu Z, Bing L (2020c) Lightweight, dy-
namic graph convolutional networks for amr-to-text generation. In: Proceedings
of the 2020 Conference on Empirical Methods in Natural Language Processing
(EMNLP), pp 2162–2172
Zhang Y, Yu X, Cui Z, Wu S, Wen Z, Wang L (2020d) Every document owns its
structure: Inductive text classification via graph neural networks. In: Proceedings
of the 58th Annual Meeting of the Association for Computational Linguistics, pp
334–339
Zhang Z, Wang M, Xiang Y, Huang Y, Nehorai A (2018i) Retgk: Graph kernels
based on return probabilities of random walks. In: Advances in Neural Informa-
tion Processing Systems, pp 3964–3974
Zhang Z, Cui P, Zhu W (2020e) Deep learning on graphs: A survey. IEEE Trans-
actions on Knowledge and Data Engineering pp 1–1, DOI 10.1109/TKDE.2020.
2981333
Zhang Z, Zhang Z, Zhou Y, Shen Y, Jin R, Dou D (2020f) Adversarial attacks on
deep graph matching. Advances in Neural Information Processing Systems 33
Zhang Z, Zhao Z, Lin Z, Huai B, Yuan NJ (2020g) Object-aware multi-
branch relation networks for spatio-temporal video grounding. arXiv preprint
arXiv:200806941
Zhang Z, Zhao Z, Zhao Y, Wang Q, Liu H, Gao L (2020h) Where does it exist:
Spatio-temporal video grounding for multi-form sentences. In: Proceedings of the
IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp 10,668–
10,677
Zhang Z, Zhuang F, Zhu H, Shi Z, Xiong H, He Q (2020i) Relational graph neural
network with hierarchical attention for knowledge graph completion. In: Proceed-
ings of the AAAI Conference on Artificial Intelligence, vol 34, pp 9612–9619
Zhao H, Du L, Buntine W (2017) Leveraging node attributes for incomplete rela-
tional data. In: International Conference on Machine Learning, pp 4072–4081
Zhao H, Zhou Y, Song Y, Lee DL (2019a) Motif enhanced recommendation over
heterogeneous information network. In: Proceedings of the 28th ACM interna-
tional conference on information and knowledge management, pp 2189–2192
Zhao H, Wei L, Yao Q (2020a) Simplifying architecture search for graph neural
network. In: Conrad S, Tiddi I (eds) Proceedings of the CIKM 2020 Workshops
co-located with 29th ACM International Conference on Information and Knowl-
edge Management (CIKM 2020), Galway, Ireland, October 19-23, 2020, CEUR-
WS.org, CEUR Workshop Proceedings, vol 2699
Zhao J, Zhou Z, Guan Z, Zhao W, Ning W, Qiu G, He X (2019b) Intentgc: a scalable
graph convolution framework fusing heterogeneous information for recommen-
dation. In: Proceedings of the 25th ACM SIGKDD International Conference on
Knowledge Discovery & Data Mining, pp 2347–2357
Zhao J, Wang X, Shi C, Liu Z, Ye Y (2020b) Network schema preserving hetero-
geneous information network embedding. In: Bessiere C (ed) Proceedings of the
Twenty-Ninth International Joint Conference on Artificial Intelligence, IJCAI-20,
International Joint Conferences on Artificial Intelligence Organization, pp 1366–
1372
References 685
Zheng X, Aragam B, Ravikumar PK, Xing EP (2018b) Dags with no tears: Con-
tinuous optimization for structure learning. Advances in Neural Information Pro-
cessing Systems 31:9472–9483
Zheng Y, Liu F, Hsieh HP (2013) U-air: When urban air quality inference meets
big data. In: Proceedings of the 19th ACM SIGKDD international conference on
Knowledge discovery and data mining, pp 1436–1444
Zheng Y, Capra L, Wolfson O, Yang H (2014) Urban computing: Concepts, method-
ologies, and applications 5(3), DOI 10.1145/2629592
Zhou C, Liu Y, Liu X, Liu Z, Gao J (2017) Scalable graph embedding for asymmet-
ric proximity. In: Proceedings of the AAAI Conference on Artificial Intelligence,
vol 31
Zhou C, Bai J, Song J, Liu X, Zhao Z, Chen X, Gao J (2018a) Atrank: An attention-
based user behavior modeling framework for recommendation. In: Proceedings
of the AAAI Conference on Artificial Intelligence, vol 32
Zhou C, Ma J, Zhang J, Zhou J, Yang H (2020a) Contrastive learning for debi-
ased candidate generation in large-scale recommender systems. arXiv preprint
csIR/200512964
Zhou D, Bousquet O, Lal TN, Weston J, Schölkopf B (2004) Learning with lo-
cal and global consistency. Advances in neural information processing systems
16(16):321–328
Zhou F, De la Torre F (2012) Factorized graph matching. In: 2012 IEEE Conference
on Computer Vision and Pattern Recognition, IEEE, pp 127–134
Zhou G, Zhu X, Song C, Fan Y, Zhu H, Ma X, Yan Y, Jin J, Li H, Gai K (2018b)
Deep interest network for click-through rate prediction. In: Proceedings of the
24th ACM SIGKDD, pp 1059–1068
Zhou G, Wang J, Zhang X, Guo M, Yu G (2020b) Predicting functions of maize
proteins using graph convolutional network. BMC Bioinformatics 21(16):420
Zhou J, Cui G, Zhang Z, Yang C, Liu Z, Sun M (2018c) Graph neural networks: A
review of methods and applications. arXiv preprint arXiv:181208434
Zhou K, Song Q, Huang X, Hu X (2019a) Auto-gnn: Neural architecture search of
graph neural networks. arXiv preprint arXiv:190903184
Zhou K, Dong Y, Wang K, Lee WS, Hooi B, Xu H, Feng J (2020c) Understanding
and resolving performance degradation in graph convolutional networks. arXiv
preprint arXiv:200607107
Zhou K, Huang X, Li Y, Zha D, Chen R, Hu X (2020d) Towards deeper graph
neural networks with differentiable group normalization. In: Advances in Neural
Information Processing Systems, vol 33
Zhou K, Song Q, Huang X, Zha D, Zou N, Hu X (2020e) Multi-channel graph
neural networks. In: International Joint Conference on Artificial Intelligence, pp
1352–1358
Zhou N, Jiang Y, Bergquist TR, et al (2019b) The CAFA challenge reports im-
proved protein function prediction and new functional annotations for hun-
dreds of genes through experimental screens. Genome Biology 20(1), DOI
10.1186/s13059-019-1835-8
References 687
Zhou T, Lü L, Zhang YC (2009) Predicting missing links via local information. The
European Physical Journal B 71(4):623–630
Zhou Y, Tuzel O (2018) Voxelnet: End-to-end learning for point cloud based 3d
object detection. In: Proceedings of the IEEE Conference on Computer Vision
and Pattern Recognition, pp 4490–4499
Zhou Y, Hou Y, Shen J, Huang Y, Martin W, Cheng F (2020f) Network-based drug
repurposing for novel coronavirus 2019-ncov/sars-cov-2. Cell discovery 6(1):1–
18
Zhou Z, Kearnes S, Li L, Zare RN, Riley P (2019c) Optimization of molecules via
deep reinforcement learning. Scientific reports 9(1):1–10
Zhou Z, Wang Y, Xie X, Chen L, Liu H (2020g) Riskoracle: A minute-level citywide
traffic accident forecasting framework. In: Proceedings of the AAAI Conference
on Artificial Intelligence, vol 34, pp 1258–1265
Zhou Z, Wang Y, Xie X, Chen L, Zhu C (2020h) Foresee urban sparse traffic ac-
cidents: A spatiotemporal multi-granularity perspective. IEEE Transactions on
Knowledge and Data Engineering pp 1–1, DOI 10.1109/TKDE.2020.3034312
Zhu D, Cui P, Wang D, Zhu W (2018) Deep variational network embedding in
wasserstein space. In: Proceedings of the 24th ACM SIGKDD International Con-
ference on Knowledge Discovery & Data Mining, pp 2827–2836
Zhu D, Zhang Z, Cui P, Zhu W (2019a) Robust graph convolutional networks against
adversarial attacks. In: Proceedings of the 25th ACM SIGKDD International Con-
ference on Knowledge Discovery amp; Data Mining, Association for Computing
Machinery, KDD ’19, p 1399–1407, DOI 10.1145/3292500.3330851
Zhu J, Li J, Zhu M, Qian L, Zhang M, Zhou G (2019b) Modeling graph structure in
transformer for better AMR-to-text generation. In: Proceedings of the 2019 Con-
ference on Empirical Methods in Natural Language Processing and the 9th Inter-
national Joint Conference on Natural Language Processing (EMNLP-IJCNLP),
Association for Computational Linguistics, Hong Kong, China, pp 5459–5468
Zhu JY, Park T, Isola P, Efros AA (2017) Unpaired image-to-image translation using
cycle-consistent adversarial networks. In: Proceedings of the IEEE international
conference on computer vision, pp 2223–2232
Zhu Q, Du B, Yan P (2020a) Self-supervised training of graph convolutional net-
works. arXiv preprint arXiv:200602380
Zhu R, Zhao K, Yang H, Lin W, Zhou C, Ai B, Li Y, Zhou J (2019c) Aligraph:
a comprehensive graph neural network platform. Proceedings of the VLDB En-
dowment 12(12):2094–2105
Zhu S, Yu K, Chi Y, Gong Y (2007) Combining content and link for classification
using matrix factorization. In: Proceedings of the 30th annual international ACM
SIGIR conference on Research and development in information retrieval, pp 487–
494
Zhu S, Zhou C, Pan S, Zhu X, Wang B (2019d) Relation structure-aware hetero-
geneous graph neural network. In: 2019 IEEE International Conference on Data
Mining (ICDM), IEEE, pp 1534–1539
ZHU X (2002) Learning from labeled and unlabeled data with label propagation.
Tech Report
688 References
Zhu Y, Elemento O, Pathak J, Wang F (2019e) Drug knowledge bases and their
applications in biomedical informatics research. Briefings in bioinformatics
20(4):1308–1321
Zhu Y, Che C, Jin B, Zhang N, Su C, Wang F (2020b) Knowledge-driven drug
repurposing using a comprehensive drug knowledge graph. Health Informatics
Journal 26(4):2737–2750
Zhu Y, Xu Y, Yu F, Liu Q, Wu S, Wang L (2020c) Deep graph contrastive represen-
tation learning. arXiv preprint arXiv:200604131
Zhu Y, Xu Y, Yu F, Liu Q, Wu S, Wang L (2021) Graph Contrastive Learning with
Adaptive Augmentation. In: Proceedings of The Web Conference 2021, ACM,
WWW ’21
Zhuang Y, Jain R, Gao W, Ren L, Aizawa K (2017) Panel: cross-media intelligence.
In: Proceedings of the 25th ACM international conference on Multimedia, pp
1173–1173
Zimmermann T, Zeller A, Weissgerber P, Diehl S (2005) Mining version histories to
guide software changes. IEEE Transactions on Software Engineering 31(6):429–
445
Zitnik M, Leskovec J (2017) Predicting multicellular function through multi-layer
tissue networks. Bioinformatics 33(14):i190–i198
Zitnik M, Agrawal M, Leskovec J (2018) Modeling polypharmacy side effects with
graph convolutional networks. Bioinformatics 34(13):i457–i466
Zoete V, Cuendet MA, Grosdidier A, Michielin O (2011) SwissParam: A fast
force field generation tool for small organic molecules. Journal of Computational
Chemistry 32(11):2359–2368
Zoph B, Le QV (2016) Neural architecture search with reinforcement learning.
arXiv preprint arXiv:161101578
Zoph B, Yuret D, May J, Knight K (2016) Transfer learning for low-resource neu-
ral machine translation. In: Proceedings of the 2016 Conference on Empirical
Methods in Natural Language Processing, pp 1568–1575
Zoph B, Vasudevan V, Shlens J, Le QV (2018) Learning transferable architectures
for scalable image recognition. In: Proceedings of the IEEE Conference on Com-
puter Vision and Pattern Recognition, pp 8697–8710
Zügner D, Günnemann S (2019) Adversarial attacks on graph neural networks via
meta learning. In: International Conference on Learning Representations, ICLR
Zügner D, Günnemann S (2019) Certifiable robustness and robust training for graph
convolutional networks. In: Proceedings of the 25th ACM SIGKDD International
Conference on Knowledge Discovery & Data Mining, pp 246–256
Zügner D, Günnemann S (2020) Certifiable robustness of graph convolutional net-
works under structure perturbations. In: Proceedings of the 26th ACM SIGKDD
International Conference on Knowledge Discovery amp; Data Mining, Associa-
tion for Computing Machinery, KDD ’20, p 1656–1665, DOI 10.1145/3394486.
3403217
Zügner D, Akbarnejad A, Günnemann S (2018) Adversarial attacks on neural net-
works for graph data. In: Proceedings of the 24th ACM SIGKDD International
Conference on Knowledge Discovery & Data Mining, pp 2847–2856
References 689